8/23/2020 Introduction to loading data | BigQuery | Google Cloud
Introduction to loading data
This page provides an overview of loading data into BigQuery.
Overview
There are many situations where you can query data without loading it (#alternatives_to_loading_data). For all other situations, you must rst load your data into BigQuery before you can run queries.
To load data into BigQuery, you can:
Load a set of data records from Cloud Storage (/bigquery/docs/loading-data-cloud-storage) or from a local le (/bigquery/docs/loading-data-local). The records can be in Avro, CSV, JSON (newline delimited only), ORC, or Parquet format.
Export data from Datastore (/datastore/docs) or Firestore (/ restore/docs) and load the exported data into BigQuery.
Load data from other Google services (#loading_data_from_other_google_services), such as Google Ad Manager and Google Ads.
Stream data one record at a time using streaming inserts (/bigquery/streaming-data-into-bigquery).
Write data from a Data ow pipeline to BigQuery.
Use DML (/bigquery/docs/reference/standard-sql/data-manipulation-language) statements to perform bulk inserts. Note that BigQuery charges for DML queries. See Data Manipulation Language pricing (/bigquery/pricing#dml).
Loading data into BigQuery from Drive is not currently supported, but you can query data in Drive by using an external table (/bigquery/external-data-drive).
You can load data into a new table or partition, append data to an existing table or partition, or overwrite a table or partition. For more information about working with partitions, see Managing partitioned tables (/bigquery/docs/managing-partitioned-tables). When your data is loaded into BigQuery, it is converted into columnar format for Capacitor
https://cloud.google.com/bigquery/docs/loading-data/ 1/8 8/23/2020 Introduction to loading data | BigQuery | Google Cloud
(/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format) (BigQuery's storage format).
Limitations
Loading data into BigQuery is subject to the some limitations, depending on the location and format of the source data:
Limitations on loading local les (/bigquery/docs/loading-data-local#limitations)
Limitations on loading data from Cloud Storage (/bigquery/docs/loading-data-cloud-storage#limitations)
CSV limitations (/bigquery/docs/loading-data-cloud-storage-csv#limitations)
JSON limitations (/bigquery/docs/loading-data-cloud-storage-json#limitations)
Datastore export limitations (/bigquery/docs/loading-data-cloud-datastore#limitations)
Firestore export limitations (/bigquery/docs/loading-data-cloud- restore#limitations)
Limitations on nested and repeated data (/bigquery/docs/nested-repeated#limitations)
Choosing a data ingestion format
When you are loading data, choose a data ingestion format based upon the following factors:
Schema support.
Avro, ORC, Parquet, Datastore exports, and Firestore exports are self-describing formats. BigQuery creates the table schema automatically based on the source data. For JSON and CSV data, you can provide an explicit schema, or you can use schema auto-detection (/bigquery/docs/schema-detect).
Flat data or nested and repeated elds.
Avro, CSV, JSON, ORC, and Parquet all support at data. Avro, JSON, ORC, Parquet, Datastore exports, and Firestore exports also support data with nested and repeated elds. Nested and repeated data is useful for expressing hierarchical data. Nested and
https://cloud.google.com/bigquery/docs/loading-data/ 2/8 8/23/2020 Introduction to loading data | BigQuery | Google Cloud
repeated elds also reduce duplication when denormalizing the data (#loading_denormalized_nested_and_repeated_data).
Embedded newlines.
When you are loading data from JSON les, the rows must be newline delimited. BigQuery expects newline-delimited JSON les to contain a single record per line.
Encoding.
BigQuery supports UTF-8 encoding for both nested or repeated and at data. BigQuery supports ISO-8859-1 encoding for at data only for CSV les.
External limitations.
Your data might come from a document store database that natively stores data in JSON format. Or, your data might come from a source that only exports in CSV format.
Loading compressed and uncompressed data
The Avro binary format is the preferred format for loading both compressed and uncompressed data. Avro data is faster to load because the data can be read in parallel, even when the data blocks are compressed. Compressed Avro les are not supported, but compressed data blocks are. BigQuery supports the DEFLATE and Snappy codecs for compressed data blocks in Avro les.
Parquet binary format is also a good choice because Parquet's e cient, per-column encoding typically results in a better compression ratio and smaller les. Parquet les also leverage compression techniques that allow les to be loaded in parallel. Compressed Parquet les are not supported, but compressed data blocks are. BigQuery supports Snappy, GZip, and LZO_1X codecs for compressed data blocks in Parquet les.
The ORC binary format offers bene ts similar to the bene ts of the Parquet format. Data in ORC les is fast to load because data stripes can be read in parallel. The rows in each data stripe are loaded sequentially. To optimize load time, use a data stripe size of approximately 256 MB or less. Compressed ORC les are not supported, but compressed le footer and stripes are. BigQuery supports Zlib, Snappy, LZO, and LZ4 compression for ORC le footers and stripes.
For other data formats such as CSV and JSON, BigQuery can load uncompressed les signi cantly faster than compressed les because uncompressed les can be read in parallel.
https://cloud.google.com/bigquery/docs/loading-data/ 3/8 8/23/2020 Introduction to loading data | BigQuery | Google Cloud
Because uncompressed les are larger, using them can lead to bandwidth limitations and higher Cloud Storage costs for data staged in Cloud Storage prior to being loaded into BigQuery. Keep in mind that line ordering isn't guaranteed for compressed or uncompressed les. It's important to weigh these tradeoffs depending on your use case.
In general, if bandwidth is limited, compress your CSV and JSON les by using gzip before uploading them to Cloud Storage. Currently, when you load data into BigQuery, gzip is the only supported le compression type for CSV and JSON les. If loading speed is important to your app and you have a lot of bandwidth to load your data, leave your les uncompressed.
Loading denormalized, nested, and repeated data
Many developers are accustomed to working with relational databases and normalized data schemas (https://en.wikipedia.org/wiki/Database_normalization). Normalization eliminates duplicate data from being stored, and provides consistency when regular updates are made to the data.
BigQuery performs best when your data is denormalized. Rather than preserving a relational schema, such as a star or snow ake schema, you can improve performance by denormalizing your data and taking advantage of nested and repeated elds. Nested and repeated elds can maintain relationships without the performance impact of preserving a relational (normalized) schema.
The storage savings from using normalized data has less of an affect in modern systems. Increases in storage costs are worth the performance gains of using denormalized data. Joins require data coordination (communication bandwidth). Denormalization localizes the data to individual slots (/bigquery/docs/slots), so that execution can be done in parallel.
To maintain relationships while denormalizing your data, you can use nested and repeated elds instead of completely attening your data. When relational data is completely attened, network communication (shu ing) can negatively impact query performance.
For example, denormalizing an orders schema without using nested and repeated elds might require you to group the data by a eld like order_id (when there is a one-to-many relationship). Because of the shu ing involved, grouping the data is less effective than denormalizing the data by using nested and repeated elds.
In some circumstances, denormalizing your data and using nested and repeated elds doesn't result in increased performance. Avoid denormalization in these use cases:
https://cloud.google.com/bigquery/docs/loading-data/ 4/8 8/23/2020 Introduction to loading data | BigQuery | Google Cloud
You have a star schema with frequently changing dimensions.
BigQuery complements an Online Transaction Processing (OLTP) system with row-level mutation but can't replace it.
Nested and repeated elds are supported in the following data formats:
Avro
JSON (newline delimited)
ORC
Parquet
Datastore exports
Firestore exports
For information about specifying nested and repeated elds in your schema when you're loading data, see Specifying nested and repeated elds (/bigquery/docs/nested-repeated).
Loading data from other Google services
BigQuery Data Transfer Service
The BigQuery Data Transfer Service (/bigquery-transfer/docs/transfer-service-overview) automates loading data into BigQuery from these services:
Google Software as a Service (SaaS) apps
Campaign Manager (/bigquery-transfer/docs/doubleclick-campaign-transfer)
Cloud Storage (/bigquery-transfer/docs/cloud-storage-transfer)
Google Ad Manager (/bigquery-transfer/docs/doubleclick-publisher-transfer)
Google Ads (/bigquery-transfer/docs/adwords-transfer)
Google Merchant Center (/bigquery-transfer/docs/merchant-center-transfer) (beta)
Google Play (/bigquery-transfer/docs/play-transfer)
Search Ads 360 (/bigquery-transfer/docs/sa360-transfer) (beta)
https://cloud.google.com/bigquery/docs/loading-data/ 5/8 8/23/2020 Introduction to loading data | BigQuery | Google Cloud
YouTube Channel reports (/bigquery-transfer/docs/youtube-channel-transfer)
YouTube Content Owner reports (/bigquery-transfer/docs/youtube-content-owner-transfer)
External cloud storage providers
Amazon S3 (/bigquery-transfer/docs/s3-transfer)
Data warehouses
Teradata (/bigquery-transfer/docs/teradata-migration)
Amazon Redshift (/bigquery-transfer/docs/redshift-migration)
In addition, several third-party transfers (/bigquery-transfer/docs/third-party-transfer) are available in the Google Cloud Marketplace.
After you con gure a data transfer, the BigQuery Data Transfer Service automatically schedules and manages recurring data loads from the source app into BigQuery.
Google Analytics 360
To learn how to export your session and hit data from a Google Analytics 360 reporting view into BigQuery, see BigQuery export (https://support.google.com/analytics/topic/3416089) in the Analytics Help Center.
For examples of querying Analytics data in BigQuery, see BigQuery cookbook (https://support.google.com/analytics/answer/4419694?hl=en) in the Analytics Help.
Data ow
Data ow (/data ow/what-is-google-cloud-data ow) can load data directly into BigQuery. For more information about using Data ow to read from, and write to, BigQuery, see BigQuery I/O connector (https://beam.apache.org/documentation/io/built-in/google-bigquery) in the Apache Beam documentation.
Alternatives to loading data
You don't need to load data before running queries in the following situations:
https://cloud.google.com/bigquery/docs/loading-data/ 6/8 8/23/2020 Introduction to loading data | BigQuery | Google Cloud
Public datasets
Public datasets are datasets stored in BigQuery and shared with the public. For more information, see
BigQuery public datasets (/bigquery/public-data).
Shared datasets
You can share datasets stored in BigQuery. If someone has shared a dataset with you, you can run queries on that dataset without loading the data.
External data sources
You can skip the data loading process by creating a table that is based on an external data source. For information about the bene ts and limitations of this approach, see external data sources
(/bigquery/external-data-sources).
Logging les
Cloud Logging provides an option to export log les into BigQuery. See Exporting with the Logs Viewer (/logging/docs/export/con gure_export) for more information.
Another alternative to loading data is to stream the data one record at a time. Streaming is typically used when you need the data to be immediately available. For information about streaming, see Streaming data into BigQuery (/bigquery/streaming-data-into-bigquery).
Quota policy
For information about the quota policy for loading data, see Load jobs (/bigquery/quotas#load_jobs) on the Quotas and limits page.
Pricing
Currently, there is no charge for loading data into BigQuery. For more information, see the pricing (/bigquery/pricing#free) page.
https://cloud.google.com/bigquery/docs/loading-data/ 7/8 8/23/2020 Introduction to loading data | BigQuery | Google Cloud
Next steps
To learn how to load data from Cloud Storage into BigQuery, see the documentation for your data format:
Avro (/bigquery/docs/loading-data-cloud-storage-avro)
CSV (/bigquery/docs/loading-data-cloud-storage-csv)
JSON (/bigquery/docs/loading-data-cloud-storage-json)
ORC (/bigquery/docs/loading-data-cloud-storage-orc)
Parquet (/bigquery/docs/loading-data-cloud-storage-parquet)
Datastore exports (/bigquery/docs/loading-data-cloud-datastore)
Firestore exports (/bigquery/docs/loading-data-cloud- restore)
To learn how to load data from a local le, see Loading data from a local data source (/bigquery/docs/loading-data-local)
For information about streaming data, see Streaming data into BigQuery (/bigquery/streaming-data-into-bigquery)
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/), and code samples are licensed under the Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0). For details, see the Google Developers Site Policies (https://developers.google.com/site-policies). Java is a registered trademark of Oracle and/or its a liates.
Last updated 2020-06-26 UTC.
https://cloud.google.com/bigquery/docs/loading-data/ 8/8