BigQuery

Introduction

  • BQ is fully managed data warehouse

  • Takes care of underlying infrastructure

  • Different from a data lake - which is raw, unstructed data

  • Data warehouse contains structured and organized data

  • Data sources feed into a data laek, which are processed into your data warehouse for analysis and reporting

  • BQ has two main features

    • Fast SQL Engine for analytics (Analytics)

    • Storage layer for database (Storag)

  • Can also integrate with Vertex AI and other ML workflows

Storage Service

  • Automatically manages data that you ingest

  • Organizes data tables into units called datasets

  • Reference a table by using: project.dataset.table

  • Tables stored as highly compressed columns in Colossus storage

Query service

  • Interactive query service

  • Supports:

    • Web UI

    • bq

    • REST API

  • Connectors to other services such as Dataproc

  • Also run query jobs on data in other locations

    • CSV Cloud Storage

    • Data in Google Sheets

      • Note: BigQuery is most efficient when working with its own storage service

  • Controlling the amount of data processed for costs

    • Only select columns and data you actually want to process and return

    • Focus on critical rows and fields

Organization of Data in BQ

  • Divided into project, dataset and tables

  • Use multiple datasets to separate tables for different analytical domains, use project-level scoping to isolate datasets from each other

  • Every table has a schema

    • Parquet, Firestore export or Datastore export are self-describing

    • However you can enter schema manually through JSON file

  • BQ is a columnar store, designed to process columns, not rows

  • BQ stores data in distributed, sharded format with auto-optimization, balancing, etc.

  • You can also partition and cluster your table based on your access patterns to control costs

Last updated