Google BigQuery – a fully managed cloud data warehouse for analytics from Google Cloud Platform(GCP), is one of the most popular cloud-based analytics solutions. Due to its unique architecture and seamless integration with other services from GCP, there are certain elements to be considered as Google BigQuery best practices while migrating data to Google Cloud.
These GCP best practices ensure that BigQuery cost & performance are optimized even in the existing Google Cloud Environment.
BIGQUERY LOADING
There are different pointers to consider before loading data into BQ, such as:
- Frequency: There are two ways: Batch or Streaming inserts. Batch inserts are free, but streaming inserts incur extra changes which is currently $0.05 per GB sent.
- Data Source: Data can be loaded into BigQuery from Cloud Storage, which is recommended. Google services like Google Analytics, Google Ads and others, or uploaded from local machines.
- File Format: BigQuery supports the following formats when loading data Avro, JSON (newline delimited), CSV, Parquet and ORC. Some formats are faster than others as shown below.Faster < ============================================================> SlowerAvro ( Compressed ), Parquet / ORC, CSV, JSON, CSV ( Compressed ), JSON ( Compressed )
- The binary Avro is the most efficient format for loading compressed data.
- Parquet and ORC format are also good as they can be loaded in parallel.
- For CSV and JSON, BigQuery can load uncompressed files significantly faster than compressed files because uncompressed files can be read in parallel.
- ELT or ETL Methods: Sometimes it is really handy to transform data within BigQuery using SQL, which is often referred to as Extract Load Transfer (ELT). As data is getting bigger to the scale from terabytes to petabytes, and given the parallel processing capabilities that BigQuery offers it is always recommended to use ELT over ETL. It’s more efficient to first load all the data into BigQuery then do your transformations on it, this way you don’t have to worry about scaling your ETL infrastructure to match the increasing amount of data, especially in high speed use cases. Another reason for preferring ELT over ETL is better support for exploratory data analysis.
BIGQUERY BEST PRACTICES
- BigQuery queries’ costs depend on the amount of data scanned, not the data retrieved.
- Control projection: projection is the number of columns you read when executing your query. You should always try to only select the columns you need, and avoid using “Select *” (select all) at all costs. Even if you use a LIMIT clause it will not reduce the cost of the query.
- User table preview option if you need to explore data, which is free.
- Always prefer partitioned tables (date or time partitioned) over sharded (date-named) tables.
- When querying wildcard tables, use the most granular prefix possible. For example if you have tables names like “TableName_YYYMMDD”, and you only need data from 2020, then use “TableName_2020*” and don’t use “TableName_*” then filter on year.
- When querying partitioned tables it is recommended to first filter on the required partitions in the innermost select query, then add additional filters in outer queries.
- Order By clause is very expensive, so unless really necessary try to limit your use of Order By clause as much as possible. Or just use it in the outermost query to sort results and not in subqueries.
- Avoid joining the same subquery several times in the same query. Instead try saving this subquery as an intermediate table querying it. Storage cost is much less than repetitive querying cost.
- When joining two tables, start with the largest table for better performance.
- Avoid using JavaScript user-defined functions. If needed, use SQL user defined functions.
- Do not use an external data source if you’re seeking better performance, unless it is a frequently changing data.
- Before executing your query check the query validator in console or dry run in CLI which will tell you approximately how much data will be scanned, so you can have an idea about how much it’ll cost using GCP Pricing calculator. Or simply use a third party addon like BQ mate which estimates query cost directly.
.
.
.
About Datametica
A Global Leader in Data Warehouse Modernization & Migration. We empower businesses by migrating their Data/Workload/ETL/Analytics to the Cloud by leveraging Automation.
2 Comments