Google-Provided Streaming Templates | Cloud Dataﬂow | Google Cloud

8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud Google-provided streaming templates Google provides a set of open-source (https://github.com/GoogleCloudPlatform/DataowTemplates) Dataow templates. For general information about templates, see the Overview (/dataow/docs/guides/templates/overview) page. For a list of all Google-provided templates, see the Get started with Google-provided templates (/dataow/docs/guides/templates/provided-templates) page. This page documents streaming templates: Pub/Sub Subscription to BigQuery (/dataow/docs/guides/templates/provided-streaming#cloudpubsubsubscriptiontobigquery) Pub/Sub Topic to BigQuery (/dataow/docs/guides/templates/provided-streaming#cloudpubsubtobigquery) Pub/Sub Avro to BigQuery (#cloudpubsubavrotobigquery) Pub/Sub to Pub/Sub (#cloudpubsubtocloudpubsub) Pub/Sub to Splunk (#cloudpubsubtosplunk) Pub/Sub to Cloud Storage Avro (#cloudpubsubtoavro) Pub/Sub to Cloud Storage Text (#cloudpubsubtogcstext) Pub/Sub to MongoDB (#cloudpubsubtomongodb) Cloud Storage Text to BigQuery (Stream) (#gcstexttobigquerystream) Cloud Storage Text to Pub/Sub (Stream) (#gcstexttocloudpubsubstream) Data Masking/Tokenization using Cloud DLP from Cloud Storage to BigQuery (Stream) (#dlptexttobigquerystreaming) Change Data Capture to BigQuery (Stream) (#change-data-capture) Apache Kafka to BigQuery (#kafka-to-bigquery) Pub/Sub Subscription to BigQuery https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 1/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages). The Pub/Sub Subscription to BigQuery template is a streaming pipeline that reads JSON- formatted messages from a Pub/Sub subscription and writes them to a BigQuery table. You can use the template as a quick solution to move Pub/Sub data to BigQuery. The template reads JSON-formatted messages from Pub/Sub and converts them to BigQuery elements. Requirements for this pipeline: The Pub/Sub messages must be in JSON format, described here. (https://developers.google.com/api-client-library/java/google-http-java-client/json) For example, messages formatted as {"k1":"v1", "k2":"v2"} may be inserted into a BigQuery table with two columns, named k1 and k2, with string data type. The output table must exist prior to running the pipeline. Template parameters Parameter Description inputSubscriptionThe Pub/Sub input subscription to read from, in the format of projects/<project>/subscriptions/<subscription>. outputTableSpec The BigQuery output table location, in the format of <my-project>:<my-dataset>. <my-table> Running the Pub/Sub Subscription to BigQuery template CONSOLEGCLOUD (#gcloud)API (#api) Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console) 1. Go to the Dataow page in the Cloud Console. Go to the Dataow page (https://console.cloud.google.com/dataow) https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 2/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud 2. Click Create job from template. 3. Select the Pub/Sub Subscription to BigQuery template from the Dataow template drop-down menu. 4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid. 5. Enter your parameter values in the provided parameter elds. 6. Click Run Job. Template source code Java owTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubSubToBigQuery.java) /* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */ package com.google.cloud.teleport.templates; import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQue import com.google.api.services.bigquery.model.TableRow; https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 3/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud import com.google.cloud.teleport.coders.FailsafeElementCoder; import com.google.cloud.teleport.templates.common.BigQueryConverters.FailsafeJsonToT import com.google.cloud.teleport.templates.common.ErrorConverters; import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Failsafe import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Javascri import com.google.cloud.teleport.util.DualInputNestedValueProvider; import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput; import com.google.cloud.teleport.util.ResourceUtils; import com.google.cloud.teleport.util.ValueProviderUtils; import com.google.cloud.teleport.values.FailsafeElement; import com.google.common.collect.ImmutableList; import java.nio.charset.StandardCharsets; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy; import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.transforms.SerializableFunction; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** * The {@link PubSubToBigQuery} pipeline is a streaming pipeline which ingests data https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 4/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud * from Cloud Pub/Sub, executes a UDF, and outputs the resulting records to BigQuery * which occur in the transformation of the data or execution of the UDF will be out * separate errors table in BigQuery. The errors table will be created if it does no * execution. Both output and error tables are specified by the user as template par * * Pipeline Requirements * * <ul> * <li>The Pub/Sub topic exists. * <li>The BigQuery output table exists. * </ul> * * Example Usage * * <pre> * # Set the pipeline vars * PROJECT_ID=PROJECT ID HERE * BUCKET_NAME=BUCKET NAME HERE * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read * from a Pub/Sub Subscription or a Pub/Sub Topic. * * # Set the runner * RUNNER=DataflowRunner * * # Build the template * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=${PIPELINE_FOLDER}/staging \ * --tempLocation=${PIPELINE_FOLDER}/temp \ * --templateLocation=${PIPELINE_FOLDER}/template \ * --runner=${RUNNER} * --useSubscription=${USE_SUBSCRIPTION} * " * * # Execute the template * JOB_NAME=pubsub-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"` * * # Execute a pipeline to read from a Topic. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \ https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 5/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud * --parameters \ * "inputTopic=projects/${PROJECT_ID}/topics/input-topic-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" * * # Execute a pipeline to read from a Subscription. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \ * --parameters \ * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" * </pre>

Load more