BigQuery for data warehouse practitioners

Updated September 2017

This article explains how to use BigQuery (//what-is-bigquery) as a data warehouse, >rst mapping common data warehouse concepts to those in BigQuery, and then describing how to perform standard data-warehousing tasks in BigQuery.

Service model comparison

The following table maps standard data-warehouse concepts to those in BigQuery:

Data warehouse BigQuery

Data warehouse The BigQuery service replaces the typical hardware setup for a traditional data warehouse. That is, it serves as a collective home for all analytical data in an organization.

Data mart Datasets are collections of that can be divided along business lines or a given analytical domain. Each dataset is tied to a Cloud project.

Data lake Your data lake might contain >les in Cloud Storage (/storage) or (https://www.google.com/drive/) or transactional data in (/bigtable). BigQuery can de>ne a schema and issue queries directly on external data as federated data sources. (/bigquery/federated-data-sources)

Tables and views Tables and views function the same way in BigQuery as they do in a traditional data warehouse.

Grants Identity and Access Management (IAM) is used to grant permission to perform speci>c actions in BigQuery.

Datasets

BigQuery organizes data tables into units called datasets. These datasets are scoped to your Google Cloud project. When you reference a table from the command line, in SQL queries, or in code, you refer to it by using the following construct:

project.dataset.table These multiple scopes—project, dataset, and table—can help you structure your information logically. You can use multiple datasets to separate tables pertaining to different analytical domains, and you can use project-level scoping to isolate datasets from each other according to your business needs.

Here is a structural overview of BigQuery:

Provisioning and system sizing

You don't need to provision resources before using BigQuery, unlike many RDBMS systems. BigQuery allocates storage and query resources dynamically based on your usage patterns.

Storage resources are allocated as you consume them and deallocated as you remove data or drop tables.

Query resources are allocated according to query type and complexity. Each query uses some number of slots, which are units of computation that comprise a certain amount of CPU and RAM.

You don't have to make a minimum usage commitment to use BigQuery. The service allocates and charges for resources based on your actual usage. By default, all BigQuery customers have access to 2,000 slots for query operations. You can also reserve a >xed number of slots for your project. For details about which approach to use, see the Costs (#costs) section. To start using BigQuery, you create a project to host your data, and then you enable billing. For instructions, see the BigQuery Quickstart (/bigquery/quickstart-web-ui#before-you-begin).

Storage management

Internally, BigQuery stores data in a proprietary columnar format called Capacitor (/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format), which has a number of bene>ts for data warehouse workloads. BigQuery uses a proprietary format because it can evolve in tandem with the query engine, which takes advantage of deep knowledge of the data layout to optimize query execution. BigQuery uses query access patterns to determine the optimal number of physical shards and how they are encoded.

The data is physically stored on Google's distributed >le system, called Colossus (https://cloud.google.com/>les/storage_architecture_and_challenges.pdf), which ensures durability by using erasure encoding (https://wikipedia.org/wiki/Erasure_code) to store redundant chunks of the data on multiple physical disks. Moreover, the data is replicated to multiple data centers.

You can also run BigQuery queries on data outside of BigQuery storage, such as data stored in Cloud Storage, Google Drive, or Bigtable, by using federated data sources (/bigquery/federated-data-sources). However, these sources are not optimized for BigQuery operations, so they might not perform as well as data stored in BigQuery storage.

Maintenance

BigQuery is a fully-managed service, which means that the BigQuery engineering team takes care of updates and maintenance for you. Upgrades shouldn't require downtime or hinder system performance.

Many traditional systems require resource-intensive vacuum processes to run at various intervals to reshu`e and sort data blocks and recover space. BigQuery has no equivalent of the vacuum process, because the storage engine continuously manages and optimizes how data is stored and replicated. Also, because BigQuery doesn't use indexes on tables, you don't need to rebuild indexes.

Backup and recovery

BigQuery addresses backup and disaster recovery at the service level. Also, by maintaining a complete 7-day history of changes against your tables, BigQuery allows you to query a point-in-time snapshot of your data by using either table decorators (/bigquery/table-decorators#snapshot_decorators) or SYSTEM_TIME AS OF in the FROM clause (/bigquery/docs/reference/standard-sql/query-syntax#from-clause) . You can easily revert changes without having to request a recovery from backups. (When a table is explicitly deleted, its history is bushed after 7 days.)

Managing work!ows

This section discusses administrative tasks, such as organizing datasets, granting permissions, and onboarding work in BigQuery. The section also discusses how to manage concurrent workloads, monitor the health of your data warehouse, and audit user access.

Organizing datasets

You can segment datasets into separate projects based on class of data or business unit, or consolidate them into common projects for simplicity.

You can invite a data analyst to collaborate on an existing dataset in any limited role that you de>ne. When a data analysts logs into the BigQuery web UI (https://bigquery.cloud.google.com), they see only the datasets that have been shared with them across projects. The activities that they can perform against datasets varies, based on their role against each dataset.

Granting permissions

In a traditional RDBMS system, you grant permissions to view or modify tables by creating SQL grants and applying them to a given user within the database system. In addition, some RDBMS systems allow you to grant permissions to users in an external directory, such as LDAP. The BigQuery model for managing users and permissions resembles the latter model.

BigQuery provides prede>ned roles (/bigquery/docs/access-control#bigquery) for controlling access to resources. You can also create custom IAM roles (/bigquery/docs/access-control) consisting of your de>ned set of permissions, and then assign those roles to users or groups. You can assign a role to a Google email address or to a G Suite Group (https://support.google.com/a/answer/33329?hl=en).

An important aspect of operating a data warehouse is allowing shared but controlled access against the same data to different groups of users. For example, >nance, HR, and marketing departments all access the same tables, but their levels of access differ. Traditional data warehousing tools make this possible by enforcing row-level security. You can achieve the same results in BigQuery by de>ning authorized views (/bigquery/docs/authorized-views) and row-level permissions (/bigquery/docs/authorized-views#row-level-permissions).

Onboarding

Traditionally, onboarding new data analysts involved signi>cant lead time. To enable analysts to run simple queries, you had to show them where data sources resided and set up ODBC connections and tools and access rights. Using Google Cloud, you can greatly accelerate an analyst's time to productivity.

To onboard an analyst on Google Cloud, you grant access to relevant project(s) (https://support.google.com/cloud/answer/6158846#add-members), introduce them to the Google Cloud Console and BigQuery web UI, and share some queries to help them get acquainted with the data:

The Cloud Console (https://console.cloud.google.com/) provides a centralized view of all assets in your Google Cloud environment. The most relevant asset to data analysts might be Cloud Storage buckets (/storage/docs/creating-buckets), where they can collaborate on >les.

The BigQuery web UI presents the list of datasets that the analyst has access to. Analysts can perform tasks in the Cloud Console according to the role you grant them, such as viewing metadata, previewing data, executing, and saving and sharing queries.

Managing workloads and concurrency

BigQuery limits the maximum rate of incoming requests and enforces appropriate quotas on a per- project basis. Speci>c policies vary depending on resource availability, user pro>le, service usage history, and other factors. For details, see the BigQuery quota policy (/bigquery/quota-policy).

BigQuery offers two types of query priorities (/bigquery/querying-data#interactive-batch): interactive and batch. By default, BigQuery runs interactive queries, which means that the query is executed as soon as possible. Interactive queries count towards query quotas (/bigquery/quota-policy#queries). Batch queries are queued and executed as soon as idle resources are available, usually within a few minutes.

BigQuery doesn't support >ne-grained prioritization of interactive or batch queries. Given the speed and scale at which BigQuery operates, many traditional workload issues aren't applicable. If you need explicit query prioritization, you can separate your sensitive workloads into a project with an explicit number of reserved slots. Contact your Google representative to assist in becoming a bat-rate customer. Monitoring and auditing

You can monitor BigQuery using Monitoring (/bigquery/docs/monitoring), where various charts and alerts are de>ned based on BigQuery metrics (/bigquery/docs/monitoring#metrics). For example, you can monitor system throughput using the Query Time metric or visualize query demand trends based on the Slots Allocated metric. When you need to plan ahead for a demanding query, you can use the Slots Available metric. To stay proactive about system health, you can create alerts (/bigquery/docs/monitoring#create-alert) based on thresholds that you de>ne. Monitoring provides a self- service web-based portal. You can control access to the portal with a Monitoring Workspace (/monitoring/accounts/guide).

BigQuery automatically creates audit logs of user actions. You can export audit logs to another BigQuery dataset in a batch or as a data stream and use your preferred analysis tool to visualize the logs. For details, see Analyzing audit logs using BigQuery (/bigquery/audit-logs).

Managing data

This section discusses schema design considerations, denormalization (https://wikipedia.org/wiki/Denormalization), how partitioning works, and methods for loading data into BigQuery. The section concludes with a look at handling change in the warehouse while maintaining zero analysis downtime.

Designing schema

Follow these general guidelines to design the optimal schema for BigQuery:

Denormalize a dimension table that is larger than 10 gigabytes, unless you see strong evidence that data manipulation, UPDATE and DELETE operation, costs outweigh the bene>ts of optimal queries.

Keep a dimension table that is smaller than 10 gigabytes normalized, unless the table rarely goes through UPDATE and DELETE operations.

Take full advantage of nested and repeated >elds in denormalized tables.

Denormalization

The conventional method of denormalizing data involves writing a fact, along with all its dimensions, into a bat table structure. For example, for sales transactions, you would write each fact to a record, along with the accompanying dimensions, such as order and customer information.

In contrast, the preferred method for denormalizing data takes advantage of BigQuery's native support for nested and repeated structures in JSON or Avro input data. Expressing records using nested and repeated structures can provide a more natural representation of the underlying data. In the case of the sales order, the outer part of a JSON structure contains the order and customer information, and the inner part of the structure contains the individual line items of the order, which are represented as nested, repeated elements.

"orderID": "ORDER", "custID": "EMAIL", "custName": "NAME", "timestamp": "TIME", "location": "LOCATION", "purchasedItems": [ { "sku": "SKU", "description": "DESCRIPTION", "quantity": "QTY", "price": "PRICE" }, { "sku": "SKU", "description": "DESCRIPTION", "quantity": "QTY", "price": "PRICE" }

Expressing records by using nested and repeated >elds simpli>es data load using JSON or Avro >les. After you've created such a schema, you can perform SELECT, INSERT, UPDATE, and DELETE operations on any individual >elds using a dot notation, for example, Order.Item.SKU. For examples, see the BigQuery documentation (/bigquery/docs/data).

Advantages of denormalization

BigQuery is essentially an analytical engine. It supports DML (https://wikipedia.org/wiki/Data_manipulation_language) actions, but it isn't meant to be used as an online transaction processing (OLTP) store. The discussion about Changing data (#handling_change) provides guidelines for dealing with changes while maintaining zero analysis downtime and delivering optimal online analytical processing (OLAP) performance. While normalized or partially normalized data structures, such as star schema or snowbake, are suitable for update/delete operations, they aren't optimal for OLAP workloads. When performing OLAP operations on normalized tables, multiple tables have to be JOINed to perform the required aggregations. JOINs (/bigquery/query-reference#joins) are possible with BigQuery and sometimes recommended on small tables. However, they are typically not as performant as denormalized structures.

The following graph compares query performance using JOINs to simple >lters in relation to table size. Query performance shows a much steeper decay in presence of JOINs.

Disadvantages of denormalization

Denormalized schemas aren't storage-optimal, but BigQuery's low cost of storage addresses concerns about storage inemciency. You can contrast costs against gains in query speed to see why storage isn't a signi>cant factor. One challenge when you work with denormalized schema is maintaining data integrity. Depending on the frequency of change and how widespread it is, maintaining data integrity can require increased machine time and sometimes human time for testing and veri>cation.

Pa"itioning tables

BigQuery supports partitioning tables by date (/bigquery/docs/creating-partitioned-tables). You enable partitioning during the table-creation process. BigQuery creates new date-based partitions automatically, with no need for additional maintenance. In addition, you can specify an expiration time for data in the partitions.

New data that is inserted into a partitioned table is written to the raw partition at the time of insert. To explicitly control which partition the data is loaded to, your load job can specify a particular date partition.

Loading data

Before data can be loaded into BigQuery for analytical workloads, it is typically stored in a Cloud Storage product (/products/storage) and in a format that is native to its origin. During early stages of migration to Google Cloud, the common pattern is to use existing extract, transform, and load (ETL) tools to transform data into the ideal schema for BigQuery. After data is transformed, it is transferred to Cloud Storage as CSV, JSON, or Avro >les, and from there loaded into BigQuery by using load jobs (/bigquery/loading-data) or streaming (/bigquery/streaming-data-into-bigquery). Alternatively, you can transfer >les to Cloud Storage in the schema that is native to the existing on-premises data storage, loaded into a set of staging tables in BigQuery and then transformed into the ideal schema for BigQuery by using BigQuery SQL commands. These two approaches are visualized here:

As you expand your footprint in Google Cloud, you will probably capture your source data directly in Bigtable (/bigtable), Datastore (/datastore), or Cloud (/spanner) and use Databow (/databow) to ETL data into BigQuery in batch or streams.

Using load jobs

This section assumes that your data is in Cloud Storage as a collection of >les in a supported >le format. For more information about each data format, as well as speci>c requirements and features to consider when choosing a format, see BigQuery data formats (/bigquery/data-formats).

In addition to CSV, you can also use data >les with delimiters other than commas by using the -- field_delimiter bag. For details, see bq load bags (/bigquery/bq-command-line-tool#bq-load-bags).

BigQuery supports loading gzip compressed (/bigquery/preparing-data-for-loading#loading_compressed_and_uncompressed_data) >les. However, loading compressed >les isn't as fast as loading uncompressed >les. For time-sensitive scenarios or scenarios in which transferring uncompressed >les to Cloud Storage is bandwidth- or time- constrained, conduct a quick loading test to see which alternative works best.

Because load jobs are asynchronous, you don't need to maintain a client connection while the job is being executed. More importantly, load jobs don't affect your other BigQuery resources. A load job creates a destination table if one doesn't already exist.

BigQuery determines the data schema as follows:

If your data is in Avro format, which is self-describing, BigQuery can determine the schema directly.

If the data is in JSON or CSV format, BigQuery can auto-detect the schema (/bigquery/bq-command-line-tool#autodetect), but manual veri>cation (/bigquery/bq-command-line-tool#autodetect) is recommended.

You can specify a schema explicitly by passing the schema as an argument to the load job (/bigquery/docs/loading-data-cloud-storage). Ongoing load jobs can append to the same table using the same procedure as the initial load, but do not require the schema to be passed with each job.

If your CSV >les always contain a header row that needs to be ignored after the initial load and table creation, you can use the --skip_leading_rows bag to ignore the row. For details, see bq load bags (/bigquery/bq-command-line-tool#bq-load-bags).

BigQuery sets daily limits on the number and size of load jobs that you can perform per project and per table. In addition, BigQuery sets limits on the sizes of individual load >les and records. For details, see Quota policy (/bigquery/quota-policy#import).

Because tables are set to a hard limit of 1,000 load jobs per day, micro-batching isn't advised. To achieve emcient high- volume or real-time loading of data, use streaming inserts in place of micro-batching.

You can launch load jobs through the BigQuery web UI. To automate the process, you can set up a Cloud Functions (/functions) to listen to a Cloud Storage event (/functions/docs/calling/storage) that is associated with arriving new >les in a given bucket and launch the BigQuery load job.

Using streaming inserts

For an alternate and complementary approach, you can also stream data directly into BigQuery. Streamed data is made available immediately and can be queried alongside existing table data in real-time.

For situations that can bene>t from real-time information, such as fraud detection or monitoring system metrics, streaming can be a signi>cant differentiator. However, unlike load jobs, which are free in BigQuery, there is a charge for streaming data. Therefore, it's important to use streaming in situations where the bene>ts outweigh the costs. When you stream data to the BigQuery tables, you send your records directly to BigQuery (/bigquery/streaming-data-into-bigquery) by using the BigQuery API. If you use Cloud Logging, you can also stream your Google Cloud project's logs directly into BigQuery (/logging/docs/export/con>gure_export), including request logs from App Engine and custom log information sent to Cloud Logging.

Handling change

Many data warehouses operate under strict Service Level Agreements (SLAs), demanding little to no downtime. While Google handles BigQuery's uptime, you control the availability and responsiveness of your datasets with your approach to rebecting change in the data.

All table modi>cations in BigQuery are ACID (https://wikipedia.org/wiki/ACID) compliant. This applies to DML operations, queries with destination tables, and load jobs. A table that goes through inserts, updates, and deletes while serving user queries handles the concurrency gracefully and transitions from one state to the next in an atomic fashion. Therefore, modifying a table doesn't require downtime. However, your internal process might require a testing and validation phase before making newly refreshed data available for analysis. Also, because DML operations compete against analytical workload over slots, you might prefer to isolate them. For these reasons, you might introduce downtime. This article uses the term "analysis downtime" to avoid confusion with BigQuery service downtime.

You can apply most of the old and proven techniques for handling analysis downtime. This section expands on some of the known challenges and remedies.

Using BigQuery as an OLTP store is considered an anti-pattern. Because OLTP stores have a high volume of updates and deletes, they are a mismatch for the data warehouse use case. To decide which storage option best >ts your use case, review the Cloud storage products (/products/storage) table.

Sliding time window

A traditional data warehouse, unlike a data lake, retains data only for a >xed amount of time, for example, the last 5 years. On each update cycle, new data is added to the warehouse and the oldest data rolls off, keeping the duration >xed. For the most part, this concept was employed to work around the limitations of older technologies.

BigQuery is built for scale and can scale out as the size of the warehouse grows, so there is no need to delete older data. By keeping the entire history, you can deliver more insight on your business. If the storage cost is a concern, you can take advantage of BigQuery's long term storage pricing (/bigquery/pricing#long-term-storage) by archiving older data and using it for special analysis when the need arises. If you still have good reasons for dropping older data, you can use BigQuery's native support for date-partitioned tables (/bigquery/docs/creating-partitioned-tables) and partition expiration (/bigquery/docs/managing-partitioned-tables#partition-expiration). In other words, BigQuery can automatically delete older data.

Changing schemas

While a data warehouse is designed and developed, it is typical to tweak table schemas by adding, updating, or dropping columns or even adding or dropping whole tables. Unless the change is in the form of an added column or table, it could break saved queries and reports that reference a deleted table, a renamed column, and so on.

After the data warehouse is in production, such changes go through strict change control. You might decide to handle minor schema changes during an analysis downtime, but for the most part rebecting schema changes are scheduled as version upgrades. You design, develop, and test the upgrade in parallel while the previous version of the data warehouse is serving the analysis workloads. You follow the same approach in applying schema changes to a BigQuery data warehouse.

Slowly changing dimensions

A normalized data schema minimizes the impact of Slowly Changing Dimensions (SCD) (https://wikipedia.org/wiki/Slowly_changing_dimension) by isolating the change in the dimension tables. It is generally favorable over a denormalized schema, where SCD can cause widespread updates to the bat fact table. However, as discussed in the schema design section, use normalization carefully for BigQuery.

When it comes to SCD, there is no one-size->ts-all solution. It is important to understand the nature of the change and apply the most relevant solution or combinations of solutions to your problem. The remainder of this section outlines a few solutions and how to apply them to SCD types.

It is important to address slowly changing dimensions in the context of ideal schema for BigQuery. Often you must sacri>ce emcient SCD handling in exchange for optimized query performance or the opposite.

Technique 1: view switching This technique is based on two views of the data: "main" vs. "shadow". The trick is to hide the actual table and expose the "main" view to the users. On update cycles, the "shadow" view is created/updated and goes through data correctness tests while the users work against the "main" view. At switchover time, the "main" view is swapped with "shadow." The old "main" and now "shadow" could be torn down until the next update cycle or kept around for some workbows depending on the rules and processes de>ned by the organization.

The two views could be based on a common table and differentiated by a column, for example, "view_type," or based on distinct tables. The former method is not recommended, because DML operations against the "shadow" view of the table could slow down user queries against the "main" view without offering any real bene>ts.

While view switching offers zero analysis downtime, it has a higher cost because during the update cycle, two copies of the data exist. More importantly, if update cycles happen at a higher rate than 90 days, this approach could prevent your organization from taking advantage of long-term storage pricing (/bigquery/pricing#long-term-storage). Ninety days is based on the pricing policy at the time of this writing. Be sure to check the latest policy.

Sometimes different segments of data change at their own pace. For instance, sales data in North America is updated on a daily basis, while data for Asia Paci>c is updated on a biweekly basis. In such situations, it is best to partition the table based on the driving factor for the change, Country in this example. View switching is then applied to the impacted partitions and not the entire data warehouse. At the time of this writing, you can only partition based on a custom data attribute, such as Country, by explicitly splitting the data into multiple tables.

Technique 2: in-place partition loading

When the change in data can be isolated by a partition and brief analysis downtime is tolerated, view switching might be overkill. Instead, data for the affected partitions can be staged in other BigQuery tables or exported to >les in Cloud Storage, where they can be replaced during analysis downtime.

To replace data in a target partition with data from a query of another table:

bq query --use_legacy_sql=false --replace \ --destination_table 'flight_data.fact_flights_part$20140910' \ 'select * from `ods.load_flights_20140910`

To replace data in a target partition by loading from Cloud Storage: bq load --replace \ --source_format=NEWLINE_DELIMITED_JSON 'flight_data.fact_flights_part$20140910' \ gs://{bucket}/load_flights_20140910.json

Technique 3: update data masking

A small and frequently changing dimension is a prime candidate for normalization. In this technique, updates to such a dimension are staged in an isolated table or view that is conditionally joined with the rest of the data:

SELECT f.order_id as order_id, f.customer_id as customer_id, IFNULL(u.customer_first_name, f.customer_first_name) as customer_first_name, IFNULL(u.customer_last_name, f.customer_last_name) as customer_last_name FROM fact_table f LEFT OUTER JOIN pending_customer_updates u ON f.customer_id = u.customer_id

SCD Type 1: overwrite attribute value

Type 1 SCD overwrites the value of an attribute with new data without maintaining the history. For example, if the product "awesome moisturizer cream" was part of the "health and beauty" category and is now categorized as "cosmetics", the change looks like this:

Before:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY

123 ABC awesome moisturizer cream - 100 oz health and beauty

After:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY

123 ABC awesome moisturizer cream - 100 oz health and beauty cosmetics If the attribute is in a normalized dimension table, the change is very isolated. You simply update the impacted row in the dimension table. For smaller dimension tables with frequent type 1 updates, use Technique 3: update data masking. (#technique_3)

If the attribute is embedded in the fact table in a denormalized fashion, the change is rather widespread. You will have to update all fact rows where the attribute is repeated. In this case, use either Technique 2: in-place partition loading (#technique_2), or Technique 1: view switching. (#technique_1)

SCD Type 2: change attribute value and maintain history

This method tracks unlimited historical data by creating multiple records for a given natural key (https://wikipedia.org/wiki/Natural_key) with separate surrogate keys (https://wikipedia.org/wiki/Surrogate_key). For example, the same change that is illustrated in SCD type 1 would be handled as below:

Before:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY START_DATE END_DATE

123 ABC awesome moisturizer cream - 100 oz health and beauty 31-Jan-2009 NULL

After:

PRD_SK PRD_ID PRD_DESC PRD_CATEGORY START_DATE END_DATE

123 ABC awesome moisturizer cream - 100 oz health and beauty 31-Jan-2009 18-JUL-2017

124 ABC awesome moisturizer cream - 100 oz cosmetics 19-JUL-2017 NULL

If the attribute is in a normalized dimension table, the change is isolated. You simply update the previous row and add a new one in the dimension table. For smaller dimension tables with frequent type 1 updates, use Technique 3: update data masking (#technique_3).

If the attribute is embedded in the fact table in a denormalized fashion, the situation can be more favorable, as long as you don't maintain explicit start and end dates for the value and instead rely on the transaction dates. Because the previous value remains true for the date and time the previous transactions occurred, you don't need to change previous fact table rows. The fact table would look like this: TRANSACTION_DATE PRD_SK PRD_ID PRD_DESC PRD_CATEGORY UNITS AMOUNT

18-JUL-2017 123 ABC awesome moisturizer cream - 100 oz health and beauty 2 25.16

19-JUL-2017 124 ABC awesome moisturizer cream - 100 oz cosmetics 1 13.50

Querying data

BigQuery supports standard SQL queries and is compatible with ANSI SQL 2011. BigQuery's SQL reference (/bigquery/sql-reference) provides a comprehensive description of all functions, operators, and regex capabilities that are supported.

Prior to supporting standard SQL, BigQuery supported an alternate SQL version that is now referred to as Legacy SQL. We recommend using the updated SQL standard in your queries. For more information, see Enabling standard SQL (/bigquery/sql-reference/enabling-standard-sql).

Because BigQuery supports nested and repeated >elds as part of the data model, its SQL support has been extended to speci>cally support these >eld types. For example, using the GitHub public dataset (/bigquery/public-data/github), you could issue the UNNEST (/bigquery/sql-reference/query-syntax#unnest) command, which lets you iterate over a repeated >eld:

SELECT name, count(1) as num_repos

`bigquery-public-data.github_repos.languages`, UNNEST(language) GROUP BY name ORDER BY num_repos DESC limit 10

Interactive queries

The BigQuery web UI allows interactive querying of datasets and provides a consolidated view of datasets across projects that you have access to. The console also provides several useful features such as saving and sharing ad-hoc queries, tuning and editing historical queries, exploring tables and schemas, and gathering table metadata. Refer to the BigQuery web UI (/bigquery/bigquery-web-ui) for more details.

Automated queries

It is a common practice to automate execution of queries based on a schedule/event and cache the results for later consumption.

If you are using Airbow to orchestrate other automated activities and already familiar with the tool, use Apache Airbow API for BigQuery (http://airbow.incubator.apache.org/integration.html?highlight=bigquery#gcp) for this purpose. This blog post (/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airbow) walks you through the process of installing Airbow and creating a workbow against BigQuery.

For simpler orchestrations, you can rely on cron jobs. This blog post (/blog/big-data/2017/04/how-to-build-a-bi-dashboard-using-google-data-studio-and-bigquery) shows you how to encapsulate a query as an App Engine app and run it as a scheduled cron job.

Query optimization

Each time BigQuery executes a query, it executes a full-column scan. BigQuery doesn't use or support indexes. Because BigQuery performance and query costs are based on the amount of data scanned during a query, design your queries so that they reference only the columns that are relevant to the query. When using date-partitioned tables, ensure only the relevant partitions are scanned. You can achieve this by using partition >lters based on PARTITIONTIME or PARTITIONDATE (/bigquery/docs/querying-partitioned-tables).

To understand the performance characteristics after a query executes, take a look at the detailed query plan explanation (/bigquery/query-plan-explanation). The explanation breaks down the stages that the query went through, the number of input/output rows handled at each stage, and the timing pro>le within each stage. Using the results from the explanation can help you understand and optimize your queries.

External sources

You can run queries on data that exists outside of BigQuery by using federated data sources (/bigquery/federated-data-sources), but this approach has performance implications. Use federated data sources only if the data must be maintained externally. You can also use query federation to perform ETL from an external source to BigQuery. This approach allows you to de>ne ETL using familiar SQL syntax.

User-de#ned functions

BigQuery also supports user-de>ned functions (/bigquery/user-de>ned-functions) (UDFs) for queries that exceed the complexity of SQL. UDFs allow you to extend the built-in SQL functions; they take a list of values, which can be arrays or structs, and return a single value, which can also be an array or struct. UDFs are written in JavaScript and can include external resources, such as encryption or other libraries.

Query sharing

BigQuery allows collaborators to save and share queries between team members. This feature can be especially useful in data exploration exercises or as a means of coming up to speed on a new dataset or query pattern. For more information, see Saving and sharing queries (/bigquery/docs/saving-sharing-queries).

Analyzing data

This section presents various ways that you can connect to BigQuery and analyze the data. To take full advantage of BigQuery as an analytical engine, you should store the data in BigQuery storage. However, your speci>c use case might bene>t from analyzing external sources either by themselves or JOINed with data in BigQuery storage.

O$-the-shelf tools

Google Data Studio (/data-studio), available in beta at the time of this writing, as well as many partner tools (/bigquery/partners) that are already integrated with BigQuery, can be used to draw analytics from BigQuery and build sophisticated interactive data visualizations.

If you >nd yourself in a situation where you have to choose a tool, you can >nd comprehensive vendor comparison in Gartner's magic quadrant report (https://www.gartner.com/doc/3611117/magic-quadrant-business-intelligence-analytics) and G2 score report (https://www.g2crowd.com/grid_report/documents/grid-for-business-intelligence-platforms-winter-2017) by G2 Crowd. Gartner's report can be obtained from many of our partner sites, such as Tableau (https://www.tableau.com/asset/2017-gartner-magic-quadrant). Custom development

To build custom applications and platforms on top of BigQuery, you can use client libraries (/bigquery/client-libraries), which are available for most common programming languages, or you can use BigQuery's REST API (/bigquery/docs/reference/rest/v2) directly.

For a concrete example, refer to this tutorial (/solutions/bokeh-and-bigquery-dashboards), which uses Python libraries to connect to BigQuery and generate custom interactive dashboards.

All the methods for connecting to BigQuery essentially provide a wrapper around BigQuery's REST API. All connections to the BigQuery API are encrypted by using HTTPS, and enforce permissions by using IAM policies (/bigquery/docs/access-control)

Third-pa"y connectors

To connect to BigQuery from an application that isn't natively integrated with BigQuery at the API level, you can use the BigQuery JDBC and ODBC drivers (/bigquery/partners/simba-beta-drivers). The drivers provide a bridge to interact with BigQuery for legacy applications or applications that cannot be easily modi>ed, such as Microsoft Excel (/blog/big-data/2016/11/how-to-connect-bigquery-to-microsoft-excel-and-other-apps-with-our-new-odbc-driver). Although ODBC and JDBC support interacting with BigQuery using SQL, the drivers aren't as expressive as dealing with the API directly. Costs

Most data warehouses serve multiple business entities within the organization. A common challenge is to analyze cost of operation per business entity. For guidance on slicing your bill and attributing cost to consumption, see Visualize Google Cloud billing using BigQuery and Data Studio (https://medium.com/google-cloud/visualize-gcp-billing-using-bigquery-and-data-studio-d3e695f90c08).

There are three primary cost dimensions for BigQuery: loading, storage, and query costs. This section discusses each dimension in detail.

Storing data

Storage pricing is prorated per MB/s.

If a table hasn't been edited for 90 consecutive days, it is categorized as long-term storage and the price of storage for that table automatically drops by 50 percent to $0.01 per GB per month. There is no degradation of performance, durability, availability, or any other functionality when a table is considered long-term storage. When the data in a table is modi>ed, BigQuery resets the timer on the table, and any data in the table returns to the normal storage price. Actions that don't directly manipulate the data, such as querying and creating views, don't reset the timer.

For more details, see BigQuery storage pricing (/bigquery/pricing#storage).

Loading data

You can load data into BigQuery by using a conventional load job, at no charge. After data is loaded, you pay for the storage as discussed above.

Streaming inserts are charged based on the amount of data that is being streamed. For details, see costs of streaming inserts listed under BigQuery storage pricing (/bigquery/pricing#storage).

Querying data

For queries, BigQuery offers two pricing models: on-demand and bat-rate.

In a multi-project situation where data is hosted in one project and made available for queries to users of other projects, the cost of storage and streaming is incurred in the hosting project, but the cost of queries is incurred in the project where the query is issued from. On-demand pricing

In the on-demand model, BigQuery charges for the amount of data accessed during query execution. Because BigQuery uses a columnar storage format, only the columns relevant to your query are accessed. If you only run reports on a weekly or monthly basis, and you've performed queries on less than 1 TB of your data, you might >nd the cost of queries on your bill is very low. For more details on how queries are charged, see BigQuery query pricing (/bigquery/pricing#queries).

To help determine how much data any given query is going to scan beforehand, you can use the query validator in the web UI. In the case of custom development, you can set the dryRun bag in the API request and have BigQuery not run the job. Instead, return with statistics about the job, such as how many bytes would be processed. Refer to the query API (/bigquery/docs/reference/rest/v2/jobs/query) for more details.

Flat-rate pricing

Customers who prefer more consistency of monthly expenses can choose to enable bat-rate pricing. To learn more, see BigQuery bat-rate pricing (/bigquery/pricing#bat_rate_pricing).

What's next?

BigQuery how-tos (/bigquery/docs/how-to)

BigQuery public datasets (/bigquery/public-data)

BigQuery on the Google Cloud podcast (https://www.gcppodcast.com/categories/bigquery)

BigQuery on Stack Overbow (http://stackoverbow.com/questions/tagged/google-bigquery)

r/bigquery on Reddit (https://www.reddit.com/r/bigquery)

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/), and code samples are licensed under the Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0). For details, see the Site Policies (https://developers.google.com/site-policies). Java is a registered trademark of Oracle and/or its amliates.