The general rule is that you need to reduce the number of bytes scanned in your query. not just to make it more performant but more importantly to make it cheaper. Here are some recommendations on how to do that.
SELECT *: BigQuery has column-based storage so the amount of columns you scan is proportional to the number of bytes you'll scan.
- Leverage partitions and clusters: We partition most of our tables by ìngestion_timestamp (which is the time we ingest the data into BigQuery) and cluster them by
attribute_keyQu. Use these columns in your where clause to make sure you scan as little data as possible. Make sure that
ingestion_timestampis only used when testing and not in production environments. We sometimes reingest data that have an
ingestion_timestampvery distant from the actual
- Late aggregation: Ensure that you aggregate as late as possible. For instance, your subqueries should avoid
GROUP BYclauses UNLESS they are drastically reduced before being joined
- The largest table first in
JOINclause: When joining multiple tables, make sure that you join the tables in decreasing order of size.
- Reduced data when joining: When joining use
WHEREclause to eliminate all data you don't need before joining, this will make your joins much better performing. Try to have filters on all joining tables if possible.
- Order in
WHEREmatters: BQ assumes that you've chosen the best order for your filters. The first part of your
WHEREclause should always contain the filter that eliminates the most amount of data.