Optimizing Queries

The general rule is that you need to reduce the number of bytes scanned in your query. not just to make it more performant but more importantly to make it cheaper. Here are some recommendations on how to do that.

  • Avoid SELECT *: BigQuery has column-based storage so the amount of columns you scan is proportional to the number of bytes you'll scan.

  • Leverage partitions and clusters: We partition most of our tables by ìngestion_timestamp (which is the time we ingest the data into BigQuery) and cluster them by event_type and attribute_keyQu. Use these columns in your where clause to make sure you scan as little data as possible. Make sure that ingestion_timestamp is only used when testing and not in production environments. We sometimes reingest data that have an ingestion_timestamp very distant from the actual block_timestamp.

  • Late aggregation: Ensure that you aggregate as late as possible. For instance, your subqueries should avoid GROUP BY clauses UNLESS they are drastically reduced before being joined

  • The largest table first in JOIN clause: When joining multiple tables, make sure that you join the tables in decreasing order of size.

  • Reduced data when joining: When joining use WHERE clause to eliminate all data you don't need before joining, this will make your joins much better performing. Try to have filters on all joining tables if possible.

  • Order in WHERE matters: BQ assumes that you've chosen the best order for your filters. The first part of your WHERE clause should always contain the filter that eliminates the most amount of data.

Last updated