8/23/2020 Querying data | BigQuery | Cloud

Querying Cloud Storage data

BigQuery supports querying Cloud Storage data in the following formats:

Comma-separated values (CSV)

JSON (newline-delimited)

Avro

ORC

Parquet

Datastore exports

Firestore exports

BigQuery supports querying Cloud Storage data from these storage classes (/storage/docs/storage-classes):

Standard

Nearline

Coldline

Archive

To query a Cloud Storage external data source, provide the Cloud Storage URI (#gcs-uri) path to your data and create a table that references the data source. The table used to reference the Cloud Storage data source can be a permanent table or a temporary table (#table-types).

Be sure to consider the location (/bigquery/external-data-sources#data-locations) of your dataset and Cloud Storage bucket when you query data stored in Cloud Storage.

Retrieving the Cloud Storage URI

To create an external table using a Cloud Storage data source, you must provide the Cloud Storage URI.

https://cloud.google.com/bigquery/external-data-cloud-storage/ 1/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

The Cloud Storage URI comprises your bucket name and your object (lename). For example, if the Cloud Storage bucket is named mybucket and the data le is named myfile.csv, the bucket URI would be gs://mybucket/myfile.csv. If your data is separated into multiple les you can use a wildcard in the URI. For more information, see Cloud Storage Request URIs (https://cloud.google.com/storage/docs/xml-api/reference-uris).

BigQuery does not support source URIs that include multiple consecutive slashes after the initial double slash. Cloud Storage object names can contain multiple consecutive slash ("/") characters. However, BigQuery converts multiple consecutive slashes into a single slash. For example, the following source URI, though valid in Cloud Storage, does not work in BigQuery: gs://bucket/my//object//name.

To retrieve the Cloud Storage URI:

1. Open the Cloud Storage console.

Cloud Storage console (https://console.cloud.google.com/storage/browser)

2. Browse to the location of the object (le) that contains the source data.

3. At the top of the Cloud Storage console, note the path to the object. To compose the URI, replace gs://bucket/file with the appropriate path, for example, gs://mybucket/myfile.json. bucket is the Cloud Storage bucket name and le is the name of the object (le) containing the data.

You can also use the gsutil ls (/storage/docs/gsutil/commands/ls) command to list buckets or objects.

Permanent versus temporary external

You can query an external data source in BigQuery by using a permanent table or a temporary table. A permanent table is a table that is created in a dataset and is linked to your external data source. Because the table is permanent, you can use access controls (/bigquery/docs/access-control) to share the table with others who also have access to the underlying external data source, and you can query the table at any time.

When you query an external data source using a temporary table, you submit a command that includes a query and creates a non-permanent table linked to the external data source. When

https://cloud.google.com/bigquery/external-data-cloud-storage/ 2/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Querying Cloud Storage data using permanent external tables

Required permissions and scopes

When you query external data in Cloud Storage using a permanent table, you need permissions to run a query job at the project level or higher, you need permissions that allow you to create a table that points to the external data, and you need permissions that allow you to access the table. When your external data is stored in Cloud Storage, you also need permissions to access the data in the Cloud Storage bucket.

BigQuery permissions

At a minimum, the following permissions are required to create and query an external table in BigQuery.

bigquery.tables.create

bigquery.tables.getData

bigquery.jobs.create

The following predened IAM roles include both bigquery.tables.create and bigquery.tables.getData permissions:

bigquery.dataEditor

bigquery.dataOwner

bigquery.admin

The following predened IAM roles include bigquery.jobs.create permissions:

bigquery.user

bigquery.jobUser

https://cloud.google.com/bigquery/external-data-cloud-storage/ 3/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

bigquery.admin

In addition, if a user has bigquery.datasets.create permissions, when that user creates a dataset, they are granted bigquery.dataOwner access to it. bigquery.dataOwner access gives the user the ability to create external tables in the dataset, but bigquery.jobs.create permissions are still required to query the data.

For more information on IAM roles and permissions in BigQuery, see Predened roles and permissions (/bigquery/docs/access-control).

Cloud Storage permissions

In order to query external data in a Cloud Storage bucket, you must be granted storage.objects.get permissions. If you are using a URI wildcard (#wildcard-support), you must also have storage.objects.list permissions.

The predened IAM role storage.objectViewer (/storage/docs/access-control/iam) can be granted to provide both storage.objects.get and storage.objects.list permissions.

Scopes for Compute Engine instances

When you create a Compute Engine instance, you can specify a list of scopes for the instance. The scopes control the instance's access to Google Cloud products, including Cloud Storage. Applications running on the VM use the service account attached to the instance to call Google Cloud APIs.

If you set up a Compute Engine instance to run as the default Compute Engine service account (/compute/docs/access/create-enable-service-accounts-for-instances), and that service account accesses an external table linked to a Cloud Storage data source, the instance requires read- only access to Cloud Storage. The default Compute Engine service account is automatically granted the https://www.googleapis.com/auth/devstorage.read_only scope. If you create your own service account, apply the Cloud Storage read-only scope to the instance.

For information on applying scopes to a Compute Engine instance, see Changing the service account and access scopes for an instance (/compute/docs/access/create-enable-service-accounts-for-instances#changeserviceaccountandscopes). For more information on Compute Engine service accounts, see Service accounts (/compute/docs/access/service-accounts).

https://cloud.google.com/bigquery/external-data-cloud-storage/ 4/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

Creating and querying a permanent external table

You can create a permanent table linked to your external data source by:

Using the Cloud Console or the classic BigQuery web UI

Using the command-line tool's mk command

Creating an ExternalDataConfiguration (/bigquery/docs/reference/rest/v2/tables#externaldataconguration) when you use the tables.insert (/bigquery/docs/reference/rest/v2/tables/insert) API method

Using the client libraries

To query an external data source using a permanent table, you create a table in a BigQuery dataset that is linked to your external data source. The data is not stored in the BigQuery table. Because the table is permanent, you can use access controls (/bigquery/docs/access-control) to share the table with others who also have access to the underlying external data source.

There are three ways to specify schema information when you create a permanent external table in BigQuery:

If you are using the tables.insert (/bigquery/docs/reference/rest/v2/tables/insert) API method to create a permanent external table, you create a table resource that includes a schema denition and an ExternalDataConfiguration (/bigquery/docs/reference/rest/v2/tables#externaldataconguration). Set the autodetect parameter to true to enable schema auto-detection (/bigquery/docs/schema-detect) for supported data sources.

If you are using the bq command-line tool to create a permanent external table, you can use a table denition le (/bigquery/external-table-denition), you can create and use your own schema le, or you can enter the schema inline with the bq tool. When you create a table denition le, you can enable schema auto-detection (/bigquery/docs/schema-detect) for supported data sources.

If you are using the console or the classic BigQuery web UI to create a permanent external table, you can enter the table schema manually or use schema auto-detection (/bigquery/docs/schema-detect) for supported data sources.

To create an external table:

https://cloud.google.com/bigquery/external-data-cloud-storage/ 5/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

1. Open the BigQuery web UI in the Cloud Console. Go to the Cloud Console (https://console.cloud.google.com/bigquery)

2. In the navigation panel, in the Resources section, expand your project and select a dataset.

3. Click Create table on the right side of the window.

4. On the Create table page, in the Source section:

For Create table from, select Cloud Storage.

In the Select le from Cloud Storage bucket eld, browse for the le/Cloud Storage bucket, or enter the Cloud Storage URI (#gcs-uri). Note that you cannot include multiple URIs in the Cloud Console, but wildcards (/bigquery/docs/loading-data-cloud-storage#load-wildcards) are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.

For File format, select the format of your data. Valid formats for external Cloud Storage data include:

Comma-separated values (CSV)

JSON (newline delimited)

Avro

Datastore backup (also used for Firestore)

5. On the Create table page, in the Destination section:

For Dataset name, choose the appropriate dataset

Verify that Table type is set to External table.

https://cloud.google.com/bigquery/external-data-cloud-storage/ 6/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

In the Table name eld, enter the name of the table you're creating in BigQuery.

6. In the Schema section, enter the schema (/bigquery/docs/schemas) denition.

For JSON or CSV les, you can check the Auto-detect option to enable schema auto- detection (/bigquery/docs/schema-detect). Auto-detect is not available for Datastore exports, Firestore exports, and Avro les. Schema information for these le types is automatically retrieved from the self-describing source data.

For CSV and JSON les, you can enter schema information manually by:

Enabling Edit as text and entering the table schema as a JSON array. Note: You can view the schema of an existing table in JSON format by entering the following command in the bq command-line tool: bq show --format=prettyjson dataset.table.

Using Add eld to manually input the schema.

7. Click Create table.

After the permanent table is created, you can run a query against the table as if it were a native BigQuery table. After your query completes, you can export the results as CSV or JSON les, save the results as a table, or save the results to .

Querying Cloud Storage data using temporary tables

To query an external data source without creating a permanent table, you run a command to combine:

A table denition le (/bigquery/external-table-denition) with a query

An inline schema denition with a query

A JSON schema denition le with a query

The table denition le or supplied schema is used to create the temporary external table, and the query runs against the temporary external table. Querying an external data source using a temporary table is supported by the bq command-line tool and the API.

When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

https://cloud.google.com/bigquery/external-data-cloud-storage/ 7/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

Required permissions

When you query external data in Cloud Storage using a temporary table, you need permissions to run a query job at the project level or higher, and you need access to the dataset that contains the table that points to the external data. When you query data in Cloud Storage, you also need permissions to access the bucket that contains your data.

BigQuery permissions

At a minimum, the following permissions are required to query an external table in BigQuery using a temporary table.

bigquery.tables.getData

bigquery.jobs.create

The following predened IAM roles include bigquery.tables.getData permissions:

bigquery.dataEditor

bigquery.dataOwner

bigquery.admin

The following predened IAM roles include bigquery.jobs.create permissions:

bigquery.user

bigquery.jobUser

bigquery.admin

In addition, if a user has bigquery.datasets.create permissions, when that user creates a dataset, they are granted bigquery.dataOwner access to it. bigquery.dataOwner access gives the user the ability to create and access external tables in the dataset, but bigquery.jobs.create permissions are still required to query the data.

For more information on IAM roles and permissions in BigQuery, see Predened roles and permissions (/bigquery/docs/access-control).

Cloud Storage permissions

https://cloud.google.com/bigquery/external-data-cloud-storage/ 8/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

In order to query external data in a Cloud Storage bucket, you must be granted storage.objects.get permissions. If you are using a URI wildcard (#wildcard-support), you must also have storage.objects.list permissions.

The predened IAM role storage.objectViewer (/storage/docs/access-control/iam) can be granted to provide both storage.objects.get and storage.objects.list permissions.

Creating and querying a temporary table

You can create and query a temporary table linked to an external data source by using the bq command-line tool, the API, or the client libraries.

bqAPI (#api)Java (#java)Python (#python)

You query a temporary table linked to an external data source using the bq query command with the --external_table_definition ag. When you use the bq command-line tool to query a temporary table linked to an external data source, you can identify the table's schema using:

A table denition le (/bigquery/external-table-denition) (stored on your local machine)

An inline schema denition

A JSON schema le (stored on your local machine)

(Optional) Supply the --location ag and set the value to your location (/bigquery/docs/locations) .

To query a temporary table linked to your external data source using a table denition le, enter the following command.

bq --location=location query \ --external_table_definition=table::definition_file \ 'query'

Where:

location is the name of your location (/bigquery/docs/locations). The --location ag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the ag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc le (/bigquery/docs/bq-command-line-tool#setting_default_values_for_command-line_ags).

table is the name of the temporary table you're creating.

https://cloud.google.com/bigquery/external-data-cloud-storage/ 9/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

denition_le is the path to the table denition le (/bigquery/external-table-denition) on your local machine.

query is the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales using a table denition le named sales_def.

bq query \ --external_table_definition=sales::sales_def \ 'SELECT Region, Total_sales FROM sales'

To query a temporary table linked to your external data source using an inline schema denition, enter the following command.

bq --location=location query \ --external_table_definition=table::schema@source_format=Cloud Storage URI \ 'query'

Where:

location is the name of your location (/bigquery/docs/locations). The --location ag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the ag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc le (/bigquery/docs/bq-command-line-tool#setting_default_values_for_command-line_ags).

table is the name of the temporary table you're creating.

schema is the inline schema denition in the format field:data_type,field:data_type.

source_format is CSV, NEWLINE_DELIMITED_JSON, AVRO, or DATASTORE_BACKUP (DATASTORE_BACKUP is also used for Firestore).

Cloud Storage URI is your Cloud Storage URI (#gcs-uri).

query is the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales linked to a CSV le stored in Cloud Storage with the following schema denition: Region:STRING,Quarter:STRING,Total_sales:INTEGER.

https://cloud.google.com/bigquery/external-data-cloud-storage/ 10/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

bq query \ --external_table_definition=sales::Region:STRING,Quarter:STRING,Total_sales:INTEGER@ 'SELECT Region, Total_sales FROM sales'

To query a temporary table linked to your external data source using a JSON schema le, enter the following command.

bq --location=location query \ --external_table_definition=schema_file@source_format=Cloud Storage URI \ 'query'

Where:

location is the name of your location (/bigquery/docs/locations). The --location ag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the ag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc le (/bigquery/docs/bq-command-line-tool#setting_default_values_for_command-line_ags).

schema_le is the path to the JSON schema le on your local machine.

source_format is CSV, NEWLINE_DELIMITED_JSON, AVRO, or DATASTORE_BACKUP (DATASTORE_BACKUP is also used for Firestore).

Cloud Storage URI is your Cloud Storage URI (#gcs-uri).

query is the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named sales linked to a CSV le stored in Cloud Storage using the /tmp/sales_schema.json schema le.

bq query --external_table_denition=sales::/tmp/sales_schema.json@CSV=gs://mybucket/sales.csv 'SELECT Region, Total_sales FROM sales'

Querying externally paitioned data

https://cloud.google.com/bigquery/external-data-cloud-storage/ 11/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

See instructions for querying externally partitioned Cloud Storage data (/bigquery/docs/hive-partitioned-queries-gcs).

Wildcard suppo for Cloud Storage URIs

If your Cloud Storage data is separated into multiple les that share a common base-name, you can use a wildcard (/bigquery/external-table-denition#wildcard-support) in the URI in the table denition le. You can also use a wildcard when you create an external table without using a table denition le.

To add a wildcard to the Cloud Storage URI, you append an asterisk (*) to the base-name. For example, if you have two les named fed-sample000001.csv and fed-sample000002.csv, the bucket URI is gs://mybucket/fed-sample*. You can then use this wildcard URI in the Cloud Console, the classic web UI, the bq command-line tool, the API, or the client libraries.

On some platforms, when using the bq command-line tool, you might need to escape the asterisk.

You can use only one wildcard for objects (lenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.

For Google Datastore exports, only one URI can be specied, and it must end with .backup_info or .export_metadata.

The asterisk wildcard character isn't allowed when you do the following:

Create external tables linked to Datastore or Firestore exports.

Load Datastore or Firestore export data from Cloud Storage.

The _FILE_NAME pseudo column

Tables based on external data sources provide a pseudo column named _FILE_NAME. This column contains the fully qualied path to the le to which the row belongs. This column is available only for tables that reference external data stored in Cloud Storage and .

https://cloud.google.com/bigquery/external-data-cloud-storage/ 12/13 8/23/2020 Querying Cloud Storage data | BigQuery | Google Cloud

The _FILE_NAME column name is reserved, which means that you cannot create a column by that name in any of your tables. To select the value of _FILE_NAME, you must use an alias. The following example query demonstrates selecting _FILE_NAME by assigning the alias fn to the pseudo column.

ery \ ject_id=project_id \ _legacy_sql=false \ CT me, ILE_NAME AS fn

ataset.table_name` E me contains "Alex"'

Where:

project_id is a valid project ID (this ag is not required if you use Cloud Shell or if you set a default project in the Cloud SDK)

dataset is the name of the dataset that stores the permanent external table

table_name is the name of the permanent external table

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/), and code samples are licensed under the Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0). For details, see the Site Policies (https://developers.google.com/site-policies). Java is a registered trademark of Oracle and/or its aliates.

Last updated 2020-07-28 UTC.

https://cloud.google.com/bigquery/external-data-cloud-storage/ 13/13