Google-Provided Streaming Templates | Cloud Dataﬂow | Google Cloud

8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud Google-provided streaming templates Google provides a set of open-source (https://github.com/GoogleCloudPlatform/DataowTemplates) Dataow templates. For general information about templates, see the Overview (/dataow/docs/guides/templates/overview) page. For a list of all Google-provided templates, see the Get started with Google-provided templates (/dataow/docs/guides/templates/provided-templates) page. This page documents streaming templates: Pub/Sub Subscription to BigQuery (/dataow/docs/guides/templates/provided-streaming#cloudpubsubsubscriptiontobigquery) Pub/Sub Topic to BigQuery (/dataow/docs/guides/templates/provided-streaming#cloudpubsubtobigquery) Pub/Sub Avro to BigQuery (#cloudpubsubavrotobigquery) Pub/Sub to Pub/Sub (#cloudpubsubtocloudpubsub) Pub/Sub to Splunk (#cloudpubsubtosplunk) Pub/Sub to Cloud Storage Avro (#cloudpubsubtoavro) Pub/Sub to Cloud Storage Text (#cloudpubsubtogcstext) Pub/Sub to MongoDB (#cloudpubsubtomongodb) Cloud Storage Text to BigQuery (Stream) (#gcstexttobigquerystream) Cloud Storage Text to Pub/Sub (Stream) (#gcstexttocloudpubsubstream) Data Masking/Tokenization using Cloud DLP from Cloud Storage to BigQuery (Stream) (#dlptexttobigquerystreaming) Change Data Capture to BigQuery (Stream) (#change-data-capture) Apache Kafka to BigQuery (#kafka-to-bigquery) Pub/Sub Subscription to BigQuery https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 1/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages). The Pub/Sub Subscription to BigQuery template is a streaming pipeline that reads JSON- formatted messages from a Pub/Sub subscription and writes them to a BigQuery table. You can use the template as a quick solution to move Pub/Sub data to BigQuery. The template reads JSON-formatted messages from Pub/Sub and converts them to BigQuery elements. Requirements for this pipeline: The Pub/Sub messages must be in JSON format, described here. (https://developers.google.com/api-client-library/java/google-http-java-client/json) For example, messages formatted as {"k1":"v1", "k2":"v2"} may be inserted into a BigQuery table with two columns, named k1 and k2, with string data type. The output table must exist prior to running the pipeline. Template parameters Parameter Description inputSubscriptionThe Pub/Sub input subscription to read from, in the format of projects/<project>/subscriptions/<subscription>. outputTableSpec The BigQuery output table location, in the format of <my-project>:<my-dataset>. <my-table> Running the Pub/Sub Subscription to BigQuery template CONSOLEGCLOUD (#gcloud)API (#api) Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console) 1. Go to the Dataow page in the Cloud Console. Go to the Dataow page (https://console.cloud.google.com/dataow) https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 2/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud 2. Click Create job from template. 3. Select the Pub/Sub Subscription to BigQuery template from the Dataow template drop-down menu. 4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid. 5. Enter your parameter values in the provided parameter elds. 6. Click Run Job. Template source code Java owTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubSubToBigQuery.java) /* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */ package com.google.cloud.teleport.templates; import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQue import com.google.api.services.bigquery.model.TableRow; https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 3/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud import com.google.cloud.teleport.coders.FailsafeElementCoder; import com.google.cloud.teleport.templates.common.BigQueryConverters.FailsafeJsonToT import com.google.cloud.teleport.templates.common.ErrorConverters; import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Failsafe import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Javascri import com.google.cloud.teleport.util.DualInputNestedValueProvider; import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput; import com.google.cloud.teleport.util.ResourceUtils; import com.google.cloud.teleport.util.ValueProviderUtils; import com.google.cloud.teleport.values.FailsafeElement; import com.google.common.collect.ImmutableList; import java.nio.charset.StandardCharsets; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy; import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.transforms.SerializableFunction; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** * The {@link PubSubToBigQuery} pipeline is a streaming pipeline which ingests data https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 4/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud * from Cloud Pub/Sub, executes a UDF, and outputs the resulting records to BigQuery * which occur in the transformation of the data or execution of the UDF will be out * separate errors table in BigQuery. The errors table will be created if it does no * execution. Both output and error tables are specified by the user as template par * * Pipeline Requirements * * <ul> * <li>The Pub/Sub topic exists. * <li>The BigQuery output table exists. * </ul> * * Example Usage * * <pre> * # Set the pipeline vars * PROJECT_ID=PROJECT ID HERE * BUCKET_NAME=BUCKET NAME HERE * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read * from a Pub/Sub Subscription or a Pub/Sub Topic. * * # Set the runner * RUNNER=DataflowRunner * * # Build the template * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=${PIPELINE_FOLDER}/staging \ * --tempLocation=${PIPELINE_FOLDER}/temp \ * --templateLocation=${PIPELINE_FOLDER}/template \ * --runner=${RUNNER} * --useSubscription=${USE_SUBSCRIPTION} * " * * # Execute the template * JOB_NAME=pubsub-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"` * * # Execute a pipeline to read from a Topic. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \ https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 5/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud * --parameters \ * "inputTopic=projects/${PROJECT_ID}/topics/input-topic-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" * * # Execute a pipeline to read from a Subscription. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \ * --parameters \ * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" * </pre>

Google-Provided Streaming Templates | Cloud Dataﬂow | Google Cloud

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support