BigQuery
Introduction
BQ is fully managed data warehouse
Takes care of underlying infrastructure
Different from a data lake - which is raw, unstructed data
Data warehouse contains structured and organized data
Data sources feed into a data laek, which are processed into your data warehouse for analysis and reporting
BQ has two main features
Fast SQL Engine for analytics (Analytics)
Storage layer for database (Storag)
Can also integrate with Vertex AI and other ML workflows
Storage Service
Automatically manages data that you ingest
Organizes data tables into units called datasets
Reference a table by using: project.dataset.table
Tables stored as highly compressed columns in Colossus storage
Query service
Interactive query service
Supports:
Web UI
bq
REST API
Connectors to other services such as Dataproc
Also run query jobs on data in other locations
CSV Cloud Storage
Data in Google Sheets
Note: BigQuery is most efficient when working with its own storage service
Controlling the amount of data processed for costs
Only select columns and data you actually want to process and return
Focus on critical rows and fields
Organization of Data in BQ
Divided into project, dataset and tables
Use multiple datasets to separate tables for different analytical domains, use project-level scoping to isolate datasets from each other
Every table has a schema
Parquet, Firestore export or Datastore export are self-describing
However you can enter schema manually through JSON file
BQ is a columnar store, designed to process columns, not rows
BQ stores data in distributed, sharded format with auto-optimization, balancing, etc.
You can also partition and cluster your table based on your access patterns to control costs
Last updated