8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
Google-provided streaming templates
Google provides a set of open-source (https://github.com/GoogleCloudPlatform/Data owTemplates) Data ow templates. For general information about templates, see the Overview (/data ow/docs/guides/templates/overview) page. For a list of all Google-provided templates, see the Get started with Google-provided templates (/data ow/docs/guides/templates/provided-templates) page.
This page documents streaming templates:
Pub/Sub Subscription to BigQuery (/data ow/docs/guides/templates/provided-streaming#cloudpubsubsubscriptiontobigquery)
Pub/Sub Topic to BigQuery (/data ow/docs/guides/templates/provided-streaming#cloudpubsubtobigquery)
Pub/Sub Avro to BigQuery (#cloudpubsubavrotobigquery)
Pub/Sub to Pub/Sub (#cloudpubsubtocloudpubsub)
Pub/Sub to Splunk (#cloudpubsubtosplunk)
Pub/Sub to Cloud Storage Avro (#cloudpubsubtoavro)
Pub/Sub to Cloud Storage Text (#cloudpubsubtogcstext)
Pub/Sub to MongoDB (#cloudpubsubtomongodb)
Cloud Storage Text to BigQuery (Stream) (#gcstexttobigquerystream)
Cloud Storage Text to Pub/Sub (Stream) (#gcstexttocloudpubsubstream)
Data Masking/Tokenization using Cloud DLP from Cloud Storage to BigQuery (Stream) (#dlptexttobigquerystreaming)
Change Data Capture to BigQuery (Stream) (#change-data-capture)
Apache Kafka to BigQuery (#kafka-to-bigquery)
Pub/Sub Subscription to BigQuery
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 1/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).
The Pub/Sub Subscription to BigQuery template is a streaming pipeline that reads JSON- formatted messages from a Pub/Sub subscription and writes them to a BigQuery table. You can use the template as a quick solution to move Pub/Sub data to BigQuery. The template reads JSON-formatted messages from Pub/Sub and converts them to BigQuery elements.
Requirements for this pipeline:
The Pub/Sub messages must be in JSON format, described here. (https://developers.google.com/api-client-library/java/google-http-java-client/json) For example, messages formatted as {"k1":"v1", "k2":"v2"} may be inserted into a BigQuery table with two columns, named k1 and k2, with string data type.
The output table must exist prior to running the pipeline.
Template parameters
Parameter Description
inputSubscriptionThe Pub/Sub input subscription to read from, in the format of projects/
outputTableSpec The BigQuery output table location, in the format of
Running the Pub/Sub Subscription to BigQuery template
CONSOLEGCLOUD (#gcloud)API (#api)
Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console)
1. Go to the Data ow page in the Cloud Console.
Go to the Data ow page (https://console.cloud.google.com/data ow)
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 2/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
2. Click Create job from template.
3. Select the Pub/Sub Subscription to BigQuery template from the Data ow template drop-down menu.
4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
5. Enter your parameter values in the provided parameter elds.
6. Click Run Job.
Template source code
Java
owTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubSubToBigQuery.java)
/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */
package com.google.cloud.teleport.templates;
import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQue
import com.google.api.services.bigquery.model.TableRow;
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 3/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
import com.google.cloud.teleport.coders.FailsafeElementCoder; import com.google.cloud.teleport.templates.common.BigQueryConverters.FailsafeJsonToT import com.google.cloud.teleport.templates.common.ErrorConverters; import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Failsafe import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Javascri import com.google.cloud.teleport.util.DualInputNestedValueProvider; import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput; import com.google.cloud.teleport.util.ResourceUtils; import com.google.cloud.teleport.util.ValueProviderUtils; import com.google.cloud.teleport.values.FailsafeElement; import com.google.common.collect.ImmutableList; import java.nio.charset.StandardCharsets; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy; import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.transforms.SerializableFunction; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.slf4j.Logger; import org.slf4j.LoggerFactory;
/** * The {@link PubSubToBigQuery} pipeline is a streaming pipeline which ingests data
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 4/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* from Cloud Pub/Sub, executes a UDF, and outputs the resulting records to BigQuery * which occur in the transformation of the data or execution of the UDF will be out * separate errors table in BigQuery. The errors table will be created if it does no * execution. Both output and error tables are specified by the user as template par * *
Pipeline Requirements * *
- *
- The Pub/Sub topic exists. *
- The BigQuery output table exists. *
Example Usage * *
* # Set the pipeline vars * PROJECT_ID=PROJECT ID HERE * BUCKET_NAME=BUCKET NAME HERE * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read * from a Pub/Sub Subscription or a Pub/Sub Topic. * * # Set the runner * RUNNER=DataflowRunner * * # Build the template * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=${PIPELINE_FOLDER}/staging \ * --tempLocation=${PIPELINE_FOLDER}/temp \ * --templateLocation=${PIPELINE_FOLDER}/template \ * --runner=${RUNNER} * --useSubscription=${USE_SUBSCRIPTION} * " * * # Execute the template * JOB_NAME=pubsub-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"` * * # Execute a pipeline to read from a Topic. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \*/ public class PubSubToBigQuery {https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 5/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* --parameters \ * "inputTopic=projects/${PROJECT_ID}/topics/input-topic-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" * * # Execute a pipeline to read from a Subscription. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \ * --parameters \ * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" *
/** The log to output status messages to. */ private static final Logger LOG = LoggerFactory.getLogger(PubSubToBigQuery.class);
/** The tag for the main output for the UDF. */ public static final TupleTag
/** The tag for the main output of the json transformation. */ public static final TupleTag
/** The tag for the dead-letter output of the udf. */ public static final TupleTag
/** The tag for the dead-letter output of the json to table row transform. */ public static final TupleTag
/** The default suffix for error tables if dead letter table is not specified. */ public static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";
/** Pubsub message/string coder for pipeline. */ public static final FailsafeElementCoder
/** String/String Coder for FailsafeElement. */ public static final FailsafeElementCoder
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 6/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
/** * The {@link Options} class provides the custom execution options passed by the e * command-line. */ public interface Options extends PipelineOptions, JavascriptTextTransformerOptions @Description("Table spec to write the output to") ValueProvider
void setOutputTableSpec(ValueProvider
@Description("Pub/Sub topic to read the input from") ValueProvider
void setInputTopic(ValueProvider
@Description( "The Cloud Pub/Sub subscription to consume from. " + "The name should be in the format of " + "projects/
void setInputSubscription(ValueProvider
@Description( "This determines whether the template reads from " + "a pub/sub subscription @Default.Boolean(false) Boolean getUseSubscription();
void setUseSubscription(Boolean value);
@Description( "The dead-letter table to output to within BigQuery in
void setOutputDeadletterTable(ValueProvider
/** * The main entry-point for pipeline execution. This method will start the pipelin * wait for it's execution to finish. If blocking execution is required, use the { * PubSubToBigQuery#run(Options)} method to start the pipeline and invoke {@code * result.waitUntilFinish()} on the {@link PipelineResult}. *
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 7/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* @param args The command-line args passed by the executor. */ public static void main(String[] args) { Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti
run(options); }
/** * Runs the pipeline to completion with the specified options. This method does no * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} * object to block until the pipeline is finished running if blocking programmatic * required. * * @param options The execution options. * @return The pipeline result. */ public static PipelineResult run(Options options) {
Pipeline pipeline = Pipeline.create(options);
CoderRegistry coderRegistry = pipeline.getCoderRegistry(); coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER);
/* * Steps: * 1) Read messages in from Pub/Sub * 2) Transform the PubsubMessages into TableRows * - Transform message payload via UDF * - Convert UDF result to TableRow objects * 3) Write successful records out to BigQuery * 4) Write failed records out to BigQuery */
/* * Step #1: Read messages in from Pub/Sub * Either from a Subscription or Topic */
PCollection
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 8/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
.fromSubscription(options.getInputSubscription())); } else { messages = pipeline.apply( "ReadPubSubTopic", PubsubIO.readMessagesWithAttributes().fromTopic(options.getInputTopic( }
PCollectionTuple convertedTableRows = messages /* * Step #2: Transform the PubsubMessages into TableRows */ .apply("ConvertMessageToTableRow", new PubsubMessageToTableRow(options))
/* * Step #3: Write the successful records out to BigQuery */ WriteResult writeResult = convertedTableRows .get(TRANSFORM_OUT) .apply( "WriteSuccessfulRecords", BigQueryIO.writeTableRows() .withoutValidation() .withCreateDisposition(CreateDisposition.CREATE_NEVER) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withExtendedErrorInfo() .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS) .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErr .to(options.getOutputTableSpec()));
/* * Step 3 Contd. * Elements that failed inserts into BigQuery are extracted and converted to Fai */ PCollection
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 9/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
/* * Step #4: Write records that failed table row transformation * or conversion out to BigQuery deadletter table. */ PCollectionList.of( ImmutableList.of( convertedTableRows.get(UDF_DEADLETTER_OUT), convertedTableRows.get(TRANSFORM_DEADLETTER_OUT))) .apply("Flatten", Flatten.pCollections()) .apply( "WriteFailedRecords", ErrorConverters.WritePubsubMessageErrors.newBuilder() .setErrorRecordsTable( ValueProviderUtils.maybeUseDefaultDeadletterTable( options.getOutputDeadletterTable(), options.getOutputTableSpec(), DEFAULT_DEADLETTER_TABLE_SUFFIX)) .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJs .build());
// 5) Insert records that failed insert into deadletter table failedInserts.apply( "WriteFailedRecords", ErrorConverters.WriteStringMessageErrors.newBuilder() .setErrorRecordsTable( ValueProviderUtils.maybeUseDefaultDeadletterTable( options.getOutputDeadletterTable(), options.getOutputTableSpec(), DEFAULT_DEADLETTER_TABLE_SUFFIX)) .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson() .build());
return pipeline.run(); }
/** * If deadletterTable is available, it is returned as is, otherwise outputTableSpe * defaultDeadLetterTableSuffix is returned instead. */ private static ValueProvider
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 10/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
outputTableSpec, new SerializableFunction
/** * The {@link PubsubMessageToTableRow} class is a {@link PTransform} which transfo * {@link PubsubMessage} objects into {@link TableRow} objects for insertion into * applying an optional UDF to the input. The executions of the UDF and transforma * TableRow} objects is done in a fail-safe way by wrapping the element with it's * inside the {@link FailsafeElement} class. The {@link PubsubMessageToTableRow} t * output a {@link PCollectionTuple} which contains all output and dead-letter {@l * PCollection}. * *
The {@link PCollectionTuple} output will contain the following {@link PColle * *
- *
- {@link PubSubToBigQuery#UDF_OUT} - Contains all {@link FailsafeElement} r * successfully processed by the optional UDF. *
- {@link PubSubToBigQuery#UDF_DEADLETTER_OUT} - Contains all {@link Failsaf * records which failed processing during the UDF execution. *
- {@link PubSubToBigQuery#TRANSFORM_OUT} - Contains all records successfull * JSON to {@link TableRow} objects. *
- {@link PubSubToBigQuery#TRANSFORM_DEADLETTER_OUT} - Contains all {@link F * records which couldn't be converted to table rows. *
private final Options options;
PubsubMessageToTableRow(Options options) { this.options = options; }
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 11/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
@Override public PCollectionTuple expand(PCollection
PCollectionTuple udfOut = input // Map the incoming messages into FailsafeElements so we can recover f // across multiple transforms. .apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()) .apply( "InvokeUDF", FailsafeJavascriptUdf.
// Convert the records which were successfully processed by the UDF into Table PCollectionTuple jsonToTableRowOut = udfOut .get(UDF_OUT) .apply( "JsonToTableRow", FailsafeJsonToTableRow.
// Re-wrap the PCollections so we can return a single PCollectionTuple return PCollectionTuple.of(UDF_OUT, udfOut.get(UDF_OUT)) .and(UDF_DEADLETTER_OUT, udfOut.get(UDF_DEADLETTER_OUT)) .and(TRANSFORM_OUT, jsonToTableRowOut.get(TRANSFORM_OUT)) .and(TRANSFORM_DEADLETTER_OUT, jsonToTableRowOut.get(TRANSFORM_DEADLETTER_ } }
/** * The {@link PubsubMessageToFailsafeElementFn} wraps an incoming {@link PubsubMes * {@link FailsafeElement} class so errors can be recovered from and the original * output to a error records table. */ static class PubsubMessageToFailsafeElementFn extends DoFn
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 12/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
PubsubMessage message = context.element(); context.output( FailsafeElement.of(message, new String(message.getPayload(), StandardChars } } }
Pub/Sub Topic to BigQuery
eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).
The Pub/Sub Topic to BigQuery template is a streaming pipeline that reads JSON-formatted messages from a Pub/Sub topic and writes them to a BigQuery table. You can use the template as a quick solution to move Pub/Sub data to BigQuery. The template reads JSON-formatted messages from Pub/Sub and converts them to BigQuery elements.
Requirements for this pipeline:
The Pub/Sub messages must be in JSON format, described here. (https://developers.google.com/api-client-library/java/google-http-java-client/json) For example, messages formatted as {"k1":"v1", "k2":"v2"} may be inserted into a BigQuery table with two columns, named k1 and k2, with string data type.
The output table must exist prior to pipeline execution.
Template parameters
Parameter Description
inputTopic The Pub/Sub input topic to read from, in the format of projects/
outputTableSpecThe BigQuery output table location, in the format of
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 13/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
Running the Pub/Sub Topic to BigQuery template
CONSOLEGCLOUD (#gcloud)API (#api)
Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console)
1. Go to the Data ow page in the Cloud Console.
Go to the Data ow page (https://console.cloud.google.com/data ow)
2. Click Create job from template.
3. Select the Pub/Sub Topic to BigQuery template from the Data ow template drop-down menu.
4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
5. Enter your parameter values in the provided parameter elds.
6. Click Run Job.
Template source code
Java
owTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubSubToBigQuery.java)
/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 *
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 14/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */
package com.google.cloud.teleport.templates;
import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQue
import com.google.api.services.bigquery.model.TableRow; import com.google.cloud.teleport.coders.FailsafeElementCoder; import com.google.cloud.teleport.templates.common.BigQueryConverters.FailsafeJsonToT import com.google.cloud.teleport.templates.common.ErrorConverters; import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Failsafe import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Javascri import com.google.cloud.teleport.util.DualInputNestedValueProvider; import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput; import com.google.cloud.teleport.util.ResourceUtils; import com.google.cloud.teleport.util.ValueProviderUtils; import com.google.cloud.teleport.values.FailsafeElement; import com.google.common.collect.ImmutableList; import java.nio.charset.StandardCharsets; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy; import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements;
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 15/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.transforms.SerializableFunction; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.slf4j.Logger; import org.slf4j.LoggerFactory;
/** * The {@link PubSubToBigQuery} pipeline is a streaming pipeline which ingests data * from Cloud Pub/Sub, executes a UDF, and outputs the resulting records to BigQuery * which occur in the transformation of the data or execution of the UDF will be out * separate errors table in BigQuery. The errors table will be created if it does no * execution. Both output and error tables are specified by the user as template par * *
Pipeline Requirements * *
- *
- The Pub/Sub topic exists. *
- The BigQuery output table exists. *
Example Usage * *
* # Set the pipeline vars * PROJECT_ID=PROJECT ID HERE * BUCKET_NAME=BUCKET NAME HERE * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read * from a Pub/Sub Subscription or a Pub/Sub Topic. * * # Set the runner * RUNNER=DataflowRunner * * # Build the template * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=${PIPELINE_FOLDER}/staging \ * --tempLocation=${PIPELINE_FOLDER}/temp \*/ public class PubSubToBigQuery {https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 16/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* --templateLocation=${PIPELINE_FOLDER}/template \ * --runner=${RUNNER} * --useSubscription=${USE_SUBSCRIPTION} * " * * # Execute the template * JOB_NAME=pubsub-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"` * * # Execute a pipeline to read from a Topic. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \ * --parameters \ * "inputTopic=projects/${PROJECT_ID}/topics/input-topic-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" * * # Execute a pipeline to read from a Subscription. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \ * --parameters \ * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" *
/** The log to output status messages to. */ private static final Logger LOG = LoggerFactory.getLogger(PubSubToBigQuery.class);
/** The tag for the main output for the UDF. */ public static final TupleTag
/** The tag for the main output of the json transformation. */ public static final TupleTag
/** The tag for the dead-letter output of the udf. */ public static final TupleTag
/** The tag for the dead-letter output of the json to table row transform. */ public static final TupleTag
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 17/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
new TupleTag
/** The default suffix for error tables if dead letter table is not specified. */ public static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";
/** Pubsub message/string coder for pipeline. */ public static final FailsafeElementCoder
/** String/String Coder for FailsafeElement. */ public static final FailsafeElementCoder
/** * The {@link Options} class provides the custom execution options passed by the e * command-line. */ public interface Options extends PipelineOptions, JavascriptTextTransformerOptions @Description("Table spec to write the output to") ValueProvider
void setOutputTableSpec(ValueProvider
@Description("Pub/Sub topic to read the input from") ValueProvider
void setInputTopic(ValueProvider
@Description( "The Cloud Pub/Sub subscription to consume from. " + "The name should be in the format of " + "projects/
void setInputSubscription(ValueProvider
@Description( "This determines whether the template reads from " + "a pub/sub subscription @Default.Boolean(false) Boolean getUseSubscription();
void setUseSubscription(Boolean value);
@Description( "The dead-letter table to output to within BigQuery in https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 18/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud + "format. If it doesn't exist, it will be created during pipeline execu ValueProvider void setOutputDeadletterTable(ValueProvider /** * The main entry-point for pipeline execution. This method will start the pipelin * wait for it's execution to finish. If blocking execution is required, use the { * PubSubToBigQuery#run(Options)} method to start the pipeline and invoke {@code * result.waitUntilFinish()} on the {@link PipelineResult}. * * @param args The command-line args passed by the executor. */ public static void main(String[] args) { Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti run(options); } /** * Runs the pipeline to completion with the specified options. This method does no * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} * object to block until the pipeline is finished running if blocking programmatic * required. * * @param options The execution options. * @return The pipeline result. */ public static PipelineResult run(Options options) { Pipeline pipeline = Pipeline.create(options); CoderRegistry coderRegistry = pipeline.getCoderRegistry(); coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER); /* * Steps: * 1) Read messages in from Pub/Sub * 2) Transform the PubsubMessages into TableRows * - Transform message payload via UDF * - Convert UDF result to TableRow objects * 3) Write successful records out to BigQuery * 4) Write failed records out to BigQuery */ https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 19/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud /* * Step #1: Read messages in from Pub/Sub * Either from a Subscription or Topic */ PCollection PCollectionTuple convertedTableRows = messages /* * Step #2: Transform the PubsubMessages into TableRows */ .apply("ConvertMessageToTableRow", new PubsubMessageToTableRow(options)) /* * Step #3: Write the successful records out to BigQuery */ WriteResult writeResult = convertedTableRows .get(TRANSFORM_OUT) .apply( "WriteSuccessfulRecords", BigQueryIO.writeTableRows() .withoutValidation() .withCreateDisposition(CreateDisposition.CREATE_NEVER) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withExtendedErrorInfo() .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS) .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErr .to(options.getOutputTableSpec())); /* https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 20/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud * Step 3 Contd. * Elements that failed inserts into BigQuery are extracted and converted to Fai */ PCollection /* * Step #4: Write records that failed table row transformation * or conversion out to BigQuery deadletter table. */ PCollectionList.of( ImmutableList.of( convertedTableRows.get(UDF_DEADLETTER_OUT), convertedTableRows.get(TRANSFORM_DEADLETTER_OUT))) .apply("Flatten", Flatten.pCollections()) .apply( "WriteFailedRecords", ErrorConverters.WritePubsubMessageErrors.newBuilder() .setErrorRecordsTable( ValueProviderUtils.maybeUseDefaultDeadletterTable( options.getOutputDeadletterTable(), options.getOutputTableSpec(), DEFAULT_DEADLETTER_TABLE_SUFFIX)) .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJs .build()); // 5) Insert records that failed insert into deadletter table failedInserts.apply( "WriteFailedRecords", ErrorConverters.WriteStringMessageErrors.newBuilder() .setErrorRecordsTable( ValueProviderUtils.maybeUseDefaultDeadletterTable( options.getOutputDeadletterTable(), options.getOutputTableSpec(), DEFAULT_DEADLETTER_TABLE_SUFFIX)) .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson() .build()); return pipeline.run(); https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 21/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud } /** * If deadletterTable is available, it is returned as is, otherwise outputTableSpe * defaultDeadLetterTableSuffix is returned instead. */ private static ValueProvider /** * The {@link PubsubMessageToTableRow} class is a {@link PTransform} which transfo * {@link PubsubMessage} objects into {@link TableRow} objects for insertion into * applying an optional UDF to the input. The executions of the UDF and transforma * TableRow} objects is done in a fail-safe way by wrapping the element with it's * inside the {@link FailsafeElement} class. The {@link PubsubMessageToTableRow} t * output a {@link PCollectionTuple} which contains all output and dead-letter {@l * PCollection}. * * The {@link PCollectionTuple} output will contain the following {@link PColle * * *
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 22/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* records which couldn't be converted to table rows. * */ static class PubsubMessageToTableRow extends PTransform
private final Options options;
PubsubMessageToTableRow(Options options) { this.options = options; }
@Override public PCollectionTuple expand(PCollection
PCollectionTuple udfOut = input // Map the incoming messages into FailsafeElements so we can recover f // across multiple transforms. .apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()) .apply( "InvokeUDF", FailsafeJavascriptUdf.
// Convert the records which were successfully processed by the UDF into Table PCollectionTuple jsonToTableRowOut = udfOut .get(UDF_OUT) .apply( "JsonToTableRow", FailsafeJsonToTableRow.
// Re-wrap the PCollections so we can return a single PCollectionTuple return PCollectionTuple.of(UDF_OUT, udfOut.get(UDF_OUT)) .and(UDF_DEADLETTER_OUT, udfOut.get(UDF_DEADLETTER_OUT)) .and(TRANSFORM_OUT, jsonToTableRowOut.get(TRANSFORM_OUT)) .and(TRANSFORM_DEADLETTER_OUT, jsonToTableRowOut.get(TRANSFORM_DEADLETTER_
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 23/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
} }
/** * The {@link PubsubMessageToFailsafeElementFn} wraps an incoming {@link PubsubMes * {@link FailsafeElement} class so errors can be recovered from and the original * output to a error records table. */ static class PubsubMessageToFailsafeElementFn extends DoFn
Pub/Sub Avro to BigQuery
eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).
The Pub/Sub Avro to BigQuery template is a streaming pipeline that ingests Avro data from a Pub/Sub subscription into a BigQuery table. Any errors which occur while writing to the BigQuery table are streamed into a Pub/Sub dead-letter topic.
Requirements for this pipeline
The input Pub/Sub subscription must exist.
The schema le for the Avro records must exist on Cloud Storage.
The dead-letter Pub/Sub topic must exist.
The output BigQuery dataset must exist.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 24/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
Template parameters
Parameter Description
schemaPath The Cloud Storage location of the Avro schema le. For example, gs://path/to/my/schema.avsc.
inputSubscriptionThe Pub/Sub input subscription to read from. For example, projects/
outputTopic The Pub/Sub topic to use as a dead-letter for failed records. For example, projects/
outputTableSpec The BigQuery output table location. For example,
writeDisposition (Optional) The BigQuery WriteDisposition. For example, WRITE_APPEND, WRITE_EMPTY or WRITE_TRUNCATE. Default: WRITE_APPEND
createDisposition(Optional) The BigQuery CreateDisposition. For example, CREATE_IF_NEEDED, CREATE_NEVER. Default: CREATE_IF_NEEDED
Running the Pub/Sub Avro to BigQuery template
CONSOLEGCLOUD (#gcloud)API (#api)
Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console)
1. Go to the Data ow page in the Cloud Console.
Go to the Data ow page (https://console.cloud.google.com/data ow)
2. Click Create job from template.
3. Select the Pub/Sub Avro to BigQuery template from the Data ow template drop-down menu.
4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
5. Enter your parameter values in the provided parameter elds.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 25/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
6. Click Run Job.
Template source code
Java
-binary-to-bigquery/src/main/java/com/google/cloud/teleport/v2/templates/PubsubAvroToBigQuery.java)
/* * Copyright (C) 2020 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */ package com.google.cloud.teleport.v2.templates;
import com.google.cloud.teleport.v2.options.BigQueryCommonOptions.WriteOptions; import com.google.cloud.teleport.v2.options.PubsubCommonOptions.ReadSubscriptionOpti import com.google.cloud.teleport.v2.options.PubsubCommonOptions.WriteTopicOptions; import com.google.cloud.teleport.v2.transforms.BigQueryConverters; import com.google.cloud.teleport.v2.transforms.ErrorConverters; import com.google.cloud.teleport.v2.utils.SchemaUtils; import org.apache.avro.Schema; import org.apache.avro.generic.GenericRecord; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.AvroCoder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 26/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.Validation.Required;
/** * A Dataflow pipeline to stream Apache Avro * Pub/Sub into a BigQuery table. * *
Any persistent failures while writing to BigQuery will be written to a Pub/Sub * topic. */ public final class PubsubAvroToBigQuery {
/** * Validates input flags and executes the Dataflow pipeline. * * @param args command line arguments to the pipeline */ public static void main(String[] args) { PubsubAvroToBigQueryOptions options = PipelineOptionsFactory.fromArgs(args).withValidation() .as(PubsubAvroToBigQueryOptions.class);
run(options); }
/** * Provides custom {@link org.apache.beam.sdk.options.PipelineOptions} required to * {@link PubsubAvroToBigQuery} pipeline. */ public interface PubsubAvroToBigQueryOptions extends ReadSubscriptionOptions, WriteOptions, WriteTopicOptions {
@Description("GCS path to Avro schema file.") @Required String getSchemaPath();
void setSchemaPath(String schemaPath); }
/** * Runs the pipeline with the supplied options. *
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 27/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* @param options execution parameters to the pipeline * @return result of the pipeline execution as a {@link PipelineResult} */ private static PipelineResult run(PubsubAvroToBigQueryOptions options) {
// Create the pipeline. Pipeline pipeline = Pipeline.create(options);
Schema schema = SchemaUtils.getAvroSchema(options.getSchemaPath());
WriteResult writeResults = pipeline .apply( "Read Avro records", PubsubIO .readAvroGenericRecords(schema) .fromSubscription(options.getInputSubscription()))
.apply( "Write to BigQuery", BigQueryIO.
writeResults .getFailedInsertsWithErr() .apply( "Create error payload", ErrorConverters.BigQueryInsertErrorToPubsubMessage.
// Execute the pipeline and return the result. return pipeline.run();
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 28/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
} }
Pub/Sub to Pub/Sub
eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).
The Pub/Sub to Pub/Sub template is a streaming pipeline that reads messages from a Pub/Sub subscription and writes the messages to another Pub/Sub topic. The pipeline also accepts an optional message attribute key and a value that can be used to lter the messages that should be written to the Pub/Sub topic. You can use this template to copy messages from a Pub/Sub subscription to another Pub/Sub topic with an optional message lter.
Requirements for this pipeline:
The source Pub/Sub subscription must exist prior to execution.
The destination Pub/Sub topic must exist prior to execution.
Template parameters
Parameter Description
inputSubscriptionPub/Sub subscription to read the input from. For example, projects/
outputTopic Cloud Pub/Sub topic to write the output to. For example, projects/
filterKey [Optional] Filter events based on an attribute key. No lters are applied if filterKey is not speci ed.
filterValue [Optional] Filter attribute value to use in case a lterKey is provided. A null filterValue is used by default.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 29/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
Running the Pub/Sub to Pub/Sub template
CONSOLEGCLOUD (#gcloud)API (#api)
Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console)
1. Go to the Data ow page in the Cloud Console.
Go to the Data ow page (https://console.cloud.google.com/data ow)
2. Click Create job from template.
3. Select the Pub/Sub to Pub/Sub template from the Data ow template drop-down menu.
4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
5. Enter your parameter values in the provided parameter elds.
6. Click Run Job.
Template source code
Java
a owTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToPubsub.java)
/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 30/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */
package com.google.cloud.teleport.templates;
import static com.google.common.base.Preconditions.checkNotNull; import static org.apache.beam.vendor.guava.v20_0.com.google.common.base.Precondition
import com.google.auto.value.AutoValue; import java.util.regex.Pattern; import java.util.regex.PatternSyntaxException; import javax.annotation.Nullable; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.metrics.Counter; import org.apache.beam.sdk.metrics.Metrics; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.StreamingOptions; import org.apache.beam.sdk.options.Validation; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.ParDo; import org.slf4j.Logger; import org.slf4j.LoggerFactory;
/** An template that copies messages from one Pubsub subscription to another Pubsub public class PubsubToPubsub {
/** * Main entry point for executing the pipeline. * * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) {
// Parse the user options passed from the command-line Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 31/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
options.setStreaming(true);
run(options); }
/** * Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(Options options) { // Create the pipeline Pipeline pipeline = Pipeline.create(options);
/** * Steps: * 1) Read PubSubMessage with attributes from input PubSub subscription. * 2) Apply any filters if an attribute=value pair is provided. * 3) Write each PubSubMessage to output PubSub topic. */ pipeline .apply( "Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputS .apply( "Filter Events If Enabled", ParDo.of( ExtractAndFilterEventsFn.newBuilder() .withFilterKey(options.getFilterKey()) .withFilterValue(options.getFilterValue()) .build())) .apply("Write PubSub Events", PubsubIO.writeMessages().to(options.getOutputT
// Execute the pipeline and return the result. return pipeline.run(); }
/** * Options supported by {@link PubsubToPubsub}. * *
Inherits standard configuration options. */ public interface Options extends PipelineOptions, StreamingOptions { @Description(
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 32/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
"The Cloud Pub/Sub subscription to consume from. " + "The name should be in the format of " + "projects/
void setInputSubscription(ValueProvider
@Description( "The Cloud Pub/Sub topic to publish to. " + "The name should be in the format of " + "projects/
void setOutputTopic(ValueProvider
@Description( "Filter events based on an optional attribute key. " + "No filters are applied if a filterKey is not specified.") @Validation.Required ValueProvider
void setFilterKey(ValueProvider
@Description( "Filter attribute value to use in case a filterKey is provided. Accepts a va + " string as a filterValue. In case a regex is provided, the complete e + " should match in order for the message to be filtered. Partial matche + " substring) will not be filtered. A null filterValue is used by defau @Validation.Required ValueProvider
void setFilterValue(ValueProvider
/** * DoFn that will determine if events are to be filtered. If filtering is enabled, * publish events that pass the filter else, it will publish all input events. */ @AutoValue public abstract static class ExtractAndFilterEventsFn extends DoFn private static final Logger LOG = LoggerFactory.getLogger(ExtractAndFilterEvents https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 33/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud // Counter tracking the number of incoming Pub/Sub messages. private static final Counter INPUT_COUNTER = Metrics .counter(ExtractAndFilterEventsFn.class, "inbound-messages"); // Counter tracking the number of output Pub/Sub messages after the user provide // is applied. private static final Counter OUTPUT_COUNTER = Metrics .counter(ExtractAndFilterEventsFn.class, "filtered-outbound-messages"); private Boolean doFilter; private String inputFilterKey; private Pattern inputFilterValueRegex; private Boolean isNullFilterValue; public static Builder newBuilder() { return new AutoValue_PubsubToPubsub_ExtractAndFilterEventsFn.Builder(); } @Nullable abstract ValueProvider @Nullable abstract ValueProvider @Setup public void setup() { if (this.doFilter != null) { return; // Filter has been evaluated already } inputFilterKey = (filterKey() == null ? null : filterKey().get()); if (inputFilterKey == null) { // Disable input message filtering. this.doFilter = false; } else { this.doFilter = true; // Enable filtering. String inputFilterValue = (filterValue() == null ? null : filterValue().get( if (inputFilterValue == null) { https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 34/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud LOG.warn( "User provided a NULL for filterValue. Only messages with a value of N + " filterKey: {} will be filtered forward", inputFilterKey); // For backward compatibility, we are allowing filtering by null filterVal this.isNullFilterValue = true; this.inputFilterValueRegex = null; } else { this.isNullFilterValue = false; try { inputFilterValueRegex = getFilterPattern(inputFilterValue); } catch (PatternSyntaxException e) { LOG.error("Invalid regex pattern for supplied filterValue: {}", inputFil throw new RuntimeException(e); } } LOG.info( "Enabling event filter [key: " + inputFilterKey + "][value: " + inputFil } } @ProcessElement public void processElement(ProcessContext context) { INPUT_COUNTER.inc(); if (!this.doFilter) { // Filter is not enabled writeOutput(context, context.element()); } else { PubsubMessage message = context.element(); String extractedValue = message.getAttribute(this.inputFilterKey); if (this.isNullFilterValue) { if (extractedValue == null) { // If we are filtering for null and the extracted value is null, we forw // the message. writeOutput(context, message); } https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 35/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud } else { if (extractedValue != null && this.inputFilterValueRegex.matcher(extractedValue).matches()) { // If the extracted value is not null and it matches the filter, // we forward the message. writeOutput(context, message); } } } } /** * Write a {@link PubsubMessage} and increment the output counter. * @param context {@link ProcessContext} to write {@link PubsubMessage} to. * @param message {@link PubsubMessage} output. */ private void writeOutput(ProcessContext context, PubsubMessage message) { OUTPUT_COUNTER.inc(); context.output(message); } /** * Return a {@link Pattern} based on a user provided regex string. * * @param regex Regex string to compile. * @return {@link Pattern} * @throws PatternSyntaxException If the string is an invalid regex. */ private Pattern getFilterPattern(String regex) throws PatternSyntaxException { checkNotNull(regex, "Filter regex cannot be null."); return Pattern.compile(regex); } /** Builder class for {@link ExtractAndFilterEventsFn}. */ @AutoValue.Builder abstract static class Builder { abstract Builder setFilterKey(ValueProvider abstract Builder setFilterValue(ValueProvider abstract ExtractAndFilterEventsFn build(); https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 36/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud /** * Method to set the filterKey used for filtering messages. * * @param filterKey Lookup key for the {@link PubsubMessage} attribute map. * @return {@link Builder} */ public Builder withFilterKey(ValueProvider /** * Method to set the filterValue used for filtering messages. * * @param filterValue Lookup value for the {@link PubsubMessage} attribute map * @return {@link Builder} */ public Builder withFilterValue(ValueProvider Pub/Sub to Splunk eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages). The Pub/Sub to Splunk template is a streaming pipeline that reads messages from a Pub/Sub subscription and writes the message payload to Splunk via Splunk's HTTP Event Collector (HEC). Before writing to Splunk, you can also apply a JavaScript user-de ned function to the message payload. Any messages that experience processing failures are forwarded to a Pub/Sub dead-letter topic for further troubleshooting and reprocessing. https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 37/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud As an extra layer of protection for your HEC token, you can also pass in a Cloud KMS key along with the base64-encoded HEC token parameter encrypted with the Cloud KMS key. See the Cloud KMS API encryption endpoint (/kms/docs/reference/rest/v1/projects.locations.keyRings.cryptoKeys/encrypt) for additional details on encrypting your HEC token parameter. Requirements for this pipeline: The source Pub/Sub subscription must exist prior to running the pipeline. The Pub/Sub dead-letter topic must exist prior to running the pipeline. The Splunk HEC endpoint must be accessible from the Data ow workers' network. The Splunk HEC token must be generated and available. Template parameters Parameter Description inputSubscription The Pub/Sub subscription from which to read the input. For example, projects/ token The Splunk HEC authentication token. This base64-encoded string can be encrypted with a Cloud KMS key for additional security. url The Splunk HEC url. This must be routable from the VPC in which the pipeline runs. For example, https://splunk-hec- host:8088. outputDeadletterTopic The Pub/Sub topic to forward undeliverable messages. For example, projects/ javascriptTextTransformGcsPath [Optional] The Cloud Storage path that contains all your JavaScript code. For example, gs://mybucket/mytransforms/*.js. javascriptTextTransformFunctionName[Optional] The name of the JavaScript function to be called. For example, if your JavaScript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform. https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 38/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud batchCount [Optional] The batch size for sending multiple events to Splunk. Default 1 (no batching). parallelism [Optional] The maximum number of parallel requests. Default 1 (no parallelism). disableCertificateValidation [Optional] Disable SSL certi cate validation. Default false (validation enabled). includePubsubMessage [Optional] Include the full Pub/Sub message in the payload. Default false (only the data element is included in the payload). tokenKMSEncryptionKey [Optional] The Cloud KMS key to decrypt the HEC token string. If the Cloud KMS key is provided, the HEC token string must be passed in encrypted. Running the Pub/Sub to Splunk template CONSOLEGCLOUD (#gcloud)API (#api) Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console) 1. Go to the Data ow page in the Cloud Console. Go to the Data ow page (https://console.cloud.google.com/data ow) 2. Click Create job from template. 3. Select the Pub/Sub to Splunk template from the Data ow template drop-down menu. 4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid. 5. Enter your parameter values in the provided parameter elds. 6. Click Run Job. Template source code https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 39/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud a owTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubSubToSplunk.java) /* * Copyright (C) 2019 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package com.google.cloud.teleport.templates; import com.google.cloud.teleport.coders.FailsafeElementCoder; import com.google.cloud.teleport.splunk.SplunkEvent; import com.google.cloud.teleport.splunk.SplunkEventCoder; import com.google.cloud.teleport.splunk.SplunkIO; import com.google.cloud.teleport.splunk.SplunkWriteError; import com.google.cloud.teleport.templates.common.ErrorConverters; import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Failsafe import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Javascri import com.google.cloud.teleport.templates.common.PubsubConverters.PubsubReadSubscri import com.google.cloud.teleport.templates.common.PubsubConverters.PubsubWriteDeadle import com.google.cloud.teleport.templates.common.SplunkConverters; import com.google.cloud.teleport.templates.common.SplunkConverters.SplunkOptions; import com.google.cloud.teleport.util.KMSEncryptedNestedValueProvider; import com.google.cloud.teleport.values.FailsafeElement; import com.google.gson.Gson; import com.google.gson.GsonBuilder; import com.google.gson.JsonPrimitive; import com.google.gson.JsonSerializer; import java.nio.charset.StandardCharsets; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder; https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 40/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.metrics.Counter; import org.apache.beam.sdk.metrics.Metrics; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.values.PBegin; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.MoreObjects; import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableLis import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** * The {@link PubSubToSplunk} pipeline is a streaming pipeline which ingests data fr * Pub/Sub, executes a UDF, converts the output to {@link SplunkEvent}s and writes t * into Splunk's HEC endpoint. Any errors which occur in the execution of the UDF, c * {@link SplunkEvent} or writing to HEC will be streamed into a Pub/Sub topic. * * Pipeline Requirements * * Example Usage * * *
* * * # Set the pipeline vars * PROJECT_ID=PROJECT ID HERE * BUCKET_NAME=BUCKET NAME HERE * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read * from a Pub/Sub Subscription or a Pub/Sub Topic. *
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 41/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* # Set the runner * RUNNER=DataflowRunner * * # Build the template * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToSplunk \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=${PIPELINE_FOLDER}/staging \ * --tempLocation=${PIPELINE_FOLDER}/temp \ * --templateLocation=${PIPELINE_FOLDER}/template/PubSubToSplunk \ * --runner=${RUNNER} * " * * # Execute the template * JOB_NAME=pubsub-to-splunk-$USER-`date +"%Y%m%d-%H%M%S%z"` * BATCH_COUNT=1 * PARALLELISM=5 * * # Execute the templated pipeline: * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template/PubSubToSplunk \ * --zone=us-east1-d \ * --parameters \ * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\ * token=my-splunk-hec-token,\ * url=http://splunk-hec-server-address:8088,\ * batchCount=${BATCH_COUNT},\ * parallelism=${PARALLELISM},\ * disableCertificateValidation=false,\ * outputDeadletterTopic=projects/${PROJECT_ID}/topics/deadletter-topic-name,\ * javascriptTextTransformGcsPath=gs://${BUCKET_NAME}/splunk/js/my-js-udf.js,\ * javascriptTextTransformFunctionName=myUdf" * */ public class PubSubToSplunk {
/** String/String Coder for FailsafeElement. */ public static final FailsafeElementCoder
/** Counter to track inbound messages from source. */ private static final Counter INPUT_MESSAGES_COUNTER = Metrics.counter(PubSubToSplunk.class, "inbound-pubsub-messages");
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 42/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
/** The tag for successful {@link SplunkEvent} conversion. */ private static final TupleTag /** The tag for failed {@link SplunkEvent} conversion. */ private static final TupleTag /** The tag for the main output for the UDF. */ private static final TupleTag /** The tag for the dead-letter output of the udf. */ private static final TupleTag /** GSON to process a {@link PubsubMessage}. */ private static final Gson GSON = new GsonBuilder() .registerTypeAdapter( byte[].class, (JsonSerializer /** Logger for class. */ private static final Logger LOG = LoggerFactory.getLogger(PubSubToSplunk.class); private static final Boolean DEFAULT_INCLUDE_PUBSUBMESSAGE = false; /** * The main entry-point for pipeline execution. This method will start the pipelin * wait for it's execution to finish. If blocking execution is required, use the { * PubSubToSplunk#run(PubSubToSplunkOptions)} method to start the pipeline and inv * result.waitUntilFinish()} on the {@link PipelineResult}. * * @param args The command-line args passed by the executor. */ public static void main(String[] args) { PubSubToSplunkOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(PubSubToSplunkOpti run(options); https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 43/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud } /** * Runs the pipeline to completion with the specified options. This method does no * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} * object to block until the pipeline is finished running if blocking programmatic * required. * * @param options The execution options. * @return The pipeline result. */ public static PipelineResult run(PubSubToSplunkOptions options) { Pipeline pipeline = Pipeline.create(options); // Register coders. CoderRegistry registry = pipeline.getCoderRegistry(); registry.registerCoderForClass(SplunkEvent.class, SplunkEventCoder.of()); registry.registerCoderForType( FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER); /* * Steps: * 1) Read messages in from Pub/Sub * 2) Convert message to FailsafeElement for processing. * 3) Apply user provided UDF (if any) on the input strings. * 4) Convert successfully transformed messages into SplunkEvent objects * 5) Write SplunkEvents to Splunk's HEC end point. * 5a) Wrap write failures into a FailsafeElement. * 6) Collect errors from UDF transform (#3), SplunkEvent transform (#4) * and writing to Splunk HEC (#5) and stream into a Pub/Sub deadletter topic */ // 1) Read messages in from Pub/Sub PCollection // 2) Convert message to FailsafeElement for processing. PCollectionTuple transformedOutput = stringMessages .apply( "ConvertToFailsafeElement", MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor()) https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 44/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud .via(input -> FailsafeElement.of(input, input))) // 3) Apply user provided UDF (if any) on the input strings. .apply( "ApplyUDFTransformation", FailsafeJavascriptUdf. // 4) Convert successfully transformed messages into SplunkEvent objects PCollectionTuple convertToEventTuple = transformedOutput .get(UDF_OUT) .apply( "ConvertToSplunkEvent", SplunkConverters.failsafeStringToSplunkEvent( SPLUNK_EVENT_OUT, SPLUNK_EVENT_DEADLETTER_OUT)); // 5) Write SplunkEvents to Splunk's HEC end point. PCollection // 5a) Wrap write failures into a FailsafeElement. PCollection @ProcessElement public void processElement(ProcessContext context) { SplunkWriteError error = context.element(); https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 45/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud FailsafeElement if (error.statusMessage() != null) { failsafeElement.setErrorMessage(error.statusMessage()); } if (error.statusCode() != null) { failsafeElement.setErrorMessage( String.format("Splunk write status code: %d", error.status } context.output(failsafeElement); } })); // 6) Collect errors from UDF transform (#4), SplunkEvent transform (#5) // and writing to Splunk HEC (#6) and stream into a Pub/Sub deadletter topic PCollectionList.of( ImmutableList.of( convertToEventTuple.get(SPLUNK_EVENT_DEADLETTER_OUT), wrappedSplunkWriteErrors, transformedOutput.get(UDF_DEADLETTER_OUT))) .apply("FlattenErrors", Flatten.pCollections()) .apply( "WriteFailedRecords", ErrorConverters.WriteStringMessageErrorsToPubSub.newBuilder() .setErrorRecordsTopic(options.getOutputDeadletterTopic()) .build()); return pipeline.run(); } /** * The {@link PubSubToSplunkOptions} class provides the custom options passed by t * the command line. */ public interface PubSubToSplunkOptions extends SplunkOptions, PubsubReadSubscriptionOptions, PubsubWriteDeadletterTopicOptions, JavascriptTextTransformerOptions {} /** * A {@link PTransform} that reads messages from a Pub/Sub subscription, increment * returns a {@link PCollection} of {@link String} messages. https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 46/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud */ private static class ReadMessages extends PTransform ReadMessages(ValueProvider @Override public PCollection @Setup public void setup() { if (inputIncludePubsubMessageFlag != null) { includePubsubMessage = inputIncludePubsubMessageFlag.get(); } includePubsubMessage = MoreObjects.firstNonNull( includePubsubMessage, DEFAULT_INCLUDE_PUBSUBMESSAGE); LOG.info("includePubsubMessage set to: {}", includePubsubMessa } @ProcessElement public void processElement(ProcessContext context) { if (includePubsubMessage) { context.output(GSON.toJson(context.element())); } else { context.output( new String(context.element().getPayload(), StandardChars } } })) .apply( "CountMessages", https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 47/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud ParDo.of( new DoFn /** * Utility method to decrypt a Splunk HEC token. * * @param unencryptedToken The Splunk HEC token as a Base64 encoded {@link String} * @param kmsKey The Cloud KMS Encryption Key to decrypt the Splunk HEC token. * @return Decrypted Splunk HEC token. */ private static ValueProvider Pub/Sub to Cloud Storage Avro eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages). The Pub/Sub to Cloud Storage Avro template is a streaming pipeline that reads data from a Pub/Sub topic and writes Avro les into the speci ed Cloud Storage bucket. This pipeline supports optional user provided window duration to be used to perform windowed writes. Requirements for this pipeline: The input Pub/Sub topic must exist prior to pipeline execution. https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 48/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud Template parameters Parameter Description inputTopic Cloud Pub/Sub Topic to subscribe for message consumption. The topic name should be in the format of projects/ outputDirectory Output Directory where output Avro Files will be archived. Please add / at the end. For eg: gs://example-bucket/example-directory/. avroTempDirectory Directory for temporary Avro Files. Please add / at the end. For eg: gs://example-bucket/example-directory/. outputFilenamePrefix[Optional] Output Filename Pre x for the Avro Files. outputFilenameSuffix[Optional] Output Filename Su x for the Avro Files. outputShardTemplate [Optional] The shard template of the output le. Speci ed as repeating sequences of the letters 'S' or 'N' (example: SSS-NNN). These are replaced with the shard number, or number of shards respectively. Default Template Format is 'W-P-SS-of- NN' when this parameter is not speci ed. numShards [Optional] The maximum number of output shards produced when writing.Default maximum number of Shards is 1. windowDuration [Optional] The window duration in which data will be written. Defaults to 5m. Allowed formats are: Ns (for seconds, example: 5s), Nm (for minutes, example: 12m), Nh (for hours, example: 2h). Running the Pub/Sub to Cloud Storage Avro template CONSOLEGCLOUD (#gcloud)API (#api) Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console) 1. Go to the Data ow page in the Cloud Console. Go to the Data ow page (https://console.cloud.google.com/data ow) 2. Click Create job from template. https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 49/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud 3. Select the Pub/Sub to Cloud Storage Avro template from the Data ow template drop-down menu. 4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid. 5. Enter your parameter values in the provided parameter elds. 6. Click Run Job. Template source code Java ata owTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToAvro.java) /* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */ package com.google.cloud.teleport.templates; import com.google.cloud.teleport.avro.AvroPubsubMessageRecord; import com.google.cloud.teleport.io.WindowedFilenamePolicy; import com.google.cloud.teleport.util.DurationUtils; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.io.AvroIO; import org.apache.beam.sdk.io.FileBasedSink; import org.apache.beam.sdk.io.fs.ResourceId; https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 50/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.StreamingOptions; import org.apache.beam.sdk.options.Validation.Required; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.transforms.SerializableFunction; import org.apache.beam.sdk.transforms.windowing.FixedWindows; import org.apache.beam.sdk.transforms.windowing.Window; /** * This pipeline ingests incoming data from a Cloud Pub/Sub topic and outputs the ra * windowed Avro files at the specified output directory. * * Files output will have the following schema: * * Example Usage: * * * { * "type": "record", * "name": "AvroPubsubMessageRecord", * "namespace": "com.google.cloud.teleport.avro", * "fields": [ * {"name": "message", "type": {"type": "array", "items": "bytes"}}, * {"name": "attributes", "type": {"type": "map", "values": "string"}}, * {"name": "timestamp", "type": "long"} * ] * } *
* * * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.${PIPELINE_NAME} \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/stag * --tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 51/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* --runner=DataflowRunner \ * --windowDuration=2m \ * --numShards=1 \ * --topic=projects/${PROJECT_ID}/topics/windowed-files \ * --outputDirectory=gs://${PROJECT_ID}/temp/ \ * --outputFilenamePrefix=windowed-file \ * --outputFilenameSuffix=.avro * --avroTempDirectory=gs://${PROJECT_ID}/avro-temp-dir/" * */ public class PubsubToAvro {
/** * Options supported by the pipeline. * *
Inherits standard configuration options. */ public interface Options extends PipelineOptions, StreamingOptions { @Description("The Cloud Pub/Sub topic to read from.") @Required ValueProvider
void setInputTopic(ValueProvider
@Description("The directory to output files to. Must end with a slash.") @Required ValueProvider
void setOutputDirectory(ValueProvider
@Description("The filename prefix of the files to write to.") @Default.String("output") ValueProvider
void setOutputFilenamePrefix(ValueProvider
@Description("The suffix of the files to write.") @Default.String("") ValueProvider
void setOutputFilenameSuffix(ValueProvider
@Description( "The shard template of the output file. Specified as repeating sequences " + "of the letters 'S' or 'N' (example: SSS-NNN). These are replaced with
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 52/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
+ "shard number, or number of shards respectively") @Default.String("W-P-SS-of-NN") ValueProvider
void setOutputShardTemplate(ValueProvider
@Description("The maximum number of output shards produced when writing.") @Default.Integer(1) Integer getNumShards();
void setNumShards(Integer value);
@Description( "The window duration in which data will be written. Defaults to 5m. " + "Allowed formats are: " + "Ns (for seconds, example: 5s), " + "Nm (for minutes, example: 12m), " + "Nh (for hours, example: 2h).") @Default.String("5m") String getWindowDuration();
void setWindowDuration(String value);
@Description("The Avro Write Temporary Directory. Must end with /") @Required ValueProvider
void setAvroTempDirectory(ValueProvider
}
/** * Main entry point for executing the pipeline. * * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti options.setStreaming(true);
run(options); }
/**
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 53/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(Options options) { // Create the pipeline Pipeline pipeline = Pipeline.create(options);
/* * Steps: * 1) Read messages from PubSub * 2) Window the messages into minute intervals specified by the executor. * 3) Output the windowed data into Avro files, one per window by default. */ pipeline .apply( "Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromTopic(options.getInputTopic()) .apply("Map to Archive", ParDo.of(new PubsubMessageToArchiveDoFn())) .apply( options.getWindowDuration() + " Window", Window.into(FixedWindows.of(DurationUtils.parseDuration(options.getWindo
// Apply windowed file writes. Use a NestedValueProvider because the filenam // policy requires a resourceId generated from the input value at runtime. .apply( "Write File(s)", AvroIO.write(AvroPubsubMessageRecord.class) .to( new WindowedFilenamePolicy( options.getOutputDirectory(), options.getOutputFilenamePrefix(), options.getOutputShardTemplate(), options.getOutputFilenameSuffix())) .withTempDirectory(NestedValueProvider.of( options.getAvroTempDirectory(), (SerializableFunction
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 54/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
// Execute the pipeline and return the result. return pipeline.run(); }
/** * Converts an incoming {@link PubsubMessage} to the {@link AvroPubsubMessageRecor * copying it's fields and the timestamp of the message. */ static class PubsubMessageToArchiveDoFn extends DoFn Pub/Sub to Cloud Storage Text eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages). The Pub/Sub to Cloud Storage Text template is a streaming pipeline that reads records from Pub/Sub and saves them as a series of Cloud Storage les in text format. The template can be used as a quick way to save data in Pub/Sub for future use. By default, the template generates a new le every 5 minutes. Requirements for this pipeline: The Pub/Sub topic must exist prior to execution. The messages published to the topic must be in text format. https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 55/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud The messages published to the topic must not contain any newlines. Note that each Pub/Sub message is saved as a single line in the output le. Template parameters Parameter Description inputTopic The Pub/Sub topic to read the input from. The topic name should be in the format projects/ outputDirectory The path and lename pre x for writing output les. For example, gs://bucket- name/path/. This value must end in a slash. outputFilenamePrefixThe pre x to place on each windowed le. For example, output- outputFilenameSuffixThe su x to place on each windowed le, typically a le extension such as .txt or .csv. outputShardTemplate The shard template de nes the dynamic portion of each windowed le. By default, the pipeline uses a single shard for output to the le system within each window. This means that all data will land into a single le per window. The outputShardTemplate defaults to W-P-SS-of-NN where W is the window date range, P is the pane info, S is the shard number, and N is the number of shards. In case of a single le, the SS-of-NN portion of the outputShardTemplate will be 00-of-01. Running the Pub/Sub to Cloud Storage Text template CONSOLEGCLOUD (#gcloud)API (#api) Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console) 1. Go to the Data ow page in the Cloud Console. Go to the Data ow page (https://console.cloud.google.com/data ow) 2. Click Create job from template. https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 56/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud 3. Select the Pub/Sub to Cloud Storage Text template from the Data ow template drop-down menu. 4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid. 5. Enter your parameter values in the provided parameter elds. 6. Click Run Job. Template source code Java Data owTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToText.java) /* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */ package com.google.cloud.teleport.templates; import com.google.cloud.teleport.io.WindowedFilenamePolicy; import com.google.cloud.teleport.util.DualInputNestedValueProvider; import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput; import com.google.cloud.teleport.util.DurationUtils; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.io.FileBasedSink; https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 57/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud import org.apache.beam.sdk.io.TextIO; import org.apache.beam.sdk.io.fs.ResourceId; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.StreamingOptions; import org.apache.beam.sdk.options.Validation.Required; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider; import org.apache.beam.sdk.transforms.SerializableFunction; import org.apache.beam.sdk.transforms.windowing.FixedWindows; import org.apache.beam.sdk.transforms.windowing.Window; /** * This pipeline ingests incoming data from a Cloud Pub/Sub topic and * outputs the raw data into windowed files at the specified output * directory. * * Example Usage: * * * mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.${PIPELINE_NAME} \ -Dexec.cleanupDaemonThreads=false \ -Dexec.args=" \ --project=${PROJECT_ID} \ --stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/staging --tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \ --runner=DataflowRunner \ --windowDuration=2m \ --numShards=1 \ --inputTopic=projects/${PROJECT_ID}/topics/windowed-files \ --userTempLocation=gs://${PROJECT_ID}/tmp/ \ --outputDirectory=gs://${PROJECT_ID}/output/ \ --outputFilenamePrefix=windowed-file \ --outputFilenameSuffix=.txt" *
*
/** * Options supported by the pipeline.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 58/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* *
Inherits standard configuration options.
*/ public interface Options extends PipelineOptions, StreamingOptions { @Description("The Cloud Pub/Sub topic to read from.") @Required ValueProvider@Description("The directory to output files to. Must end with a slash.") @Required ValueProvider
@Description("The directory to output temporary files to. Must end with a slash. ValueProvider
@Description("The filename prefix of the files to write to.") @Default.String("output") @Required ValueProvider
@Description("The suffix of the files to write.") @Default.String("") ValueProvider
@Description("The shard template of the output file. Specified as repeating sequ + "of the letters 'S' or 'N' (example: SSS-NNN). These are replaced with the + "shard number, or number of shards respectively") @Default.String("W-P-SS-of-NN") ValueProvider
@Description("The maximum number of output shards produced when writing.") @Default.Integer(1) Integer getNumShards(); void setNumShards(Integer value);
@Description("The window duration in which data will be written. Defaults to 5m. + "Allowed formats are: " + "Ns (for seconds, example: 5s), " + "Nm (for minutes, example: 12m), "
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 59/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
+ "Nh (for hours, example: 2h).") @Default.String("5m") String getWindowDuration(); void setWindowDuration(String value); }
/** * Main entry point for executing the pipeline. * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) {
Options options = PipelineOptionsFactory .fromArgs(args) .withValidation() .as(Options.class);
options.setStreaming(true);
run(options); }
/** * Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(Options options) { // Create the pipeline Pipeline pipeline = Pipeline.create(options);
/* * Steps: * 1) Read string messages from PubSub * 2) Window the messages into minute intervals specified by the executor. * 3) Output the windowed files to GCS */ pipeline .apply("Read PubSub Events", PubsubIO.readStrings().fromTopic(options.getInp .apply( options.getWindowDuration() + " Window", Window.into(FixedWindows.of(DurationUtils.parseDuration(options.getWindo
// Apply windowed file writes. Use a NestedValueProvider because the filenam
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 60/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
// policy requires a resourceId generated from the input value at runtime. .apply( "Write File(s)", TextIO.write() .withWindowedWrites() .withNumShards(options.getNumShards()) .to( new WindowedFilenamePolicy( options.getOutputDirectory(), options.getOutputFilenamePrefix(), options.getOutputShardTemplate(), options.getOutputFilenameSuffix())) .withTempDirectory(NestedValueProvider.of( maybeUseUserTempLocation( options.getUserTempLocation(), options.getOutputDirectory()), (SerializableFunction
// Execute the pipeline and return the result. return pipeline.run(); }
/** * Utility method for using optional parameter userTempLocation as TempDirectory. * This is useful when output bucket is locked and temporary data cannot be delete * * @param userTempLocation user provided temp location * @param outputLocation user provided outputDirectory to be used as the default t * @return userTempLocation if available, otherwise outputLocation is returned. */ private static ValueProvider
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 61/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
Pub/Sub to MongoDB
eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).
The Pub/Sub to MongoDB template is a streaming pipeline that reads JSON-encoded messages from a Pub/Sub subscription and writes them to MongoDB as documents. If required, this pipeline supports additional transforms that can be included using a JavaScript user-de ned function (UDF). Any errors occurred due to schema mismatch, malformed JSON, or while executing transforms are recorded in a BigQuery deadletter table along with input message. The pipeline automatically creates the deadletter table if the table does not exist prior to execution.
Requirements for this pipeline:
The Pub/Sub Subscription must exist and the messages must be encoded in a valid JSON format.
The MongoDB cluster must exist and should be acccessible from the Data ow worker machines.
Template parameters
Parameter Description
inputSubscription Name of the Pub/Sub subscription. For example: projects/
mongoDBUri Comma separated list of MongoDB servers. For example: 192.285.234.12:27017,192.287.123.11:27017
database Database in MongoDB to store the collection. For example: my-
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 62/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
db.
collection Name of the collection inside MongoDB database. For example: my-collection.
deadletterTable BigQuery table that store messages due to failures (mismatched schema, malformed json etc). For example: project-id:dataset-name.table-name.
javascriptTextTransformGcsPath [Optional] Cloud Storage location of JavaScript le contating UDF transform. For example: gs://mybucket/filename.json.
javascriptTextTransformFunctionName[Optional] Name of JavaScript UDF. For example: transform.
batchSize [Optional] Batch size used for batch insertion of documents into MongoDB. Default: 1000.
batchSizeBytes [Optional] Batch size in bytes. Default: 5242880.
maxConnectionIdleTime [Optional] Maximum idle time allowed in seconds before connection time out occurs. Default: 60000.
sslEnabled [Optional] Boolean value indicating whether connection to MongoDB is SSL enabled. Default: true.
ignoreSSLCertificate [Optional] Boolean value indicating if SSL certifcate should be ignored. Default: true.
withOrdered [Optional] Boolean value enabling ordered bulk insertions into MongoDB. Default: true.
withSSLInvalidHostNameAllowed [Optional] Boolean value indicating if invalid host name is allowed for SSL connection. Default: true.
Running the Pub/Sub to MongoDB template
CONSOLEGCLOUD (#gcloud)API (#api)
Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console)
1. Go to the Data ow page in the Cloud Console.
Go to the Data ow page (https://console.cloud.google.com/data ow)
2. Click Create job from template.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 63/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
3. Select Pub/Sub to MongoDB template from the Data ow template drop-down menu.
4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
5. Enter your parameter values in the provided parameter elds.
6. Click Run Job.
Template source code
Java
2/pubsub-to-mongodb/src/main/java/com/google/cloud/teleport/v2/templates/PubSubToMongoDB.java)
/* * Copyright (C) 2019 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */
package com.google.cloud.teleport.v2.templates;
import com.google.auto.value.AutoValue; import com.google.cloud.teleport.v2.coders.FailsafeElementCoder; import com.google.cloud.teleport.v2.transforms.ErrorConverters; import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer; import com.google.cloud.teleport.v2.utils.SchemaUtils; import com.google.cloud.teleport.v2.values.FailsafeElement;
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 64/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonSyntaxException; import java.nio.charset.StandardCharsets; import javax.annotation.Nullable; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder; import org.apache.beam.sdk.io.mongodb.MongoDbIO; import org.apache.beam.sdk.metrics.Counter; import org.apache.beam.sdk.metrics.Metrics; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.Validation; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.apache.beam.sdk.values.TupleTagList; import org.apache.beam.sdk.values.TypeDescriptors; import org.apache.beam.vendor.guava.v20_0.com.google.common.base.Throwables; import org.bson.Document; import org.slf4j.Logger; import org.slf4j.LoggerFactory;
/** * The {@link PubSubToMongoDB} pipeline is a streaming pipeline which ingests data i * from PubSub, applies a Javascript UDF if provided and inserts resulting records a * in MongoDB. If the element fails to be processed then it is written to a deadlett * BigQuery. * *
Pipeline Requirements * *
- *
- The PubSub topic and subscriptions exist *
- The MongoDB is up and running
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 65/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
*
Example Usage * *
* # Set the pipeline vars * PROJECT_NAME=my-project * BUCKET_NAME=my-bucket * INPUT_SUBSCRIPTION=my-subscription * MONGODB_DATABASE_NAME=testdb * MONGODB_HOSTNAME=my-host:port * MONGODB_COLLECTION_NAME=testCollection * DEADLETTERTABLE=project:dataset.deadletter_table_name * * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.v2.templates.PubSubToMongoDB \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_NAME} \ * --stagingLocation=gs://${BUCKET_NAME}/staging \ * --tempLocation=gs://${BUCKET_NAME}/temp \ * --runner=DataflowRunner \ * --inputSubscription=${INPUT_SUBSCRIPTION} \ * --mongoDBUri=${MONGODB_HOSTNAME} \ * --database=${MONGODB_DATABASE_NAME} \ * --collection=${MONGODB_COLLECTION_NAME} \ * --deadletterTable=${DEADLETTERTABLE}" **/ public class PubSubToMongoDB { /** * Options supported by {@link PubSubToMongoDB} * *
Inherits standard configuration options. */
/** The tag for the main output of the json transformation. */ public static final TupleTag
/** The tag for the dead-letter output of the json to table row transform. */ public static final TupleTag
/** Pubsub message/string coder for pipeline. */
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 66/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
public static final FailsafeElementCoder
/** String/String Coder for FailsafeElement. */ public static final FailsafeElementCoder
/** The log to output status messages to. */ private static final Logger LOG = LoggerFactory.getLogger(PubSubToMongoDB.class);
/** * The {@link Options} class provides the custom execution options passed by the e * command-line. * *
Inherits standard configuration options, options from {@link * JavascriptTextTransformer.JavascriptTextTransformerOptions}. */ public interface Options extends JavascriptTextTransformer.JavascriptTextTransformerOptions, PipelineOp @Description( "The Cloud Pub/Sub subscription to consume from." + "The name should be in the format of " + "projects/
void setInputSubscription(String inputSubscription);
@Description("The MongoDB database to push the Documents to.") @Validation.Required String getDatabase();
void setDatabase(String database);
@Description( "The host addresses of the MongoDB" + "Multiple addresses to be specified with a comma separated value e.g." + "host1:port,host2:port,host3:port") @Validation.Required String getMongoDBUri();
void setMongoDBUri(String mongoDBUri);
@Description("The Collection in mongoDB to put documents to.") @Validation.Required
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 67/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
String getCollection();
void setCollection(String collection);
@Description( "The dead-letter table to output to within BigQuery in void setDeadletterTable(String deadletterTable); @Description("Batch size in number of documents. Default: 1000") @Default.Long(1024) Long getBatchSize(); void setBatchSize(Long batchSize); @Description("Batch size in number of bytes. Default: 5242880 (5mb)") @Default.Long(5242880) Long getBatchSizeBytes(); void setBatchSizeBytes(Long batchSizeBytes); @Description("Maximum Connection idle time in ms. Default: 60000") @Default.Integer(60000) int getMaxConnectionIdleTime(); void setMaxConnectionIdleTime(int maxConnectionIdleTime); @Description("Specify if SSL is enabled. Default: true") @Default.Boolean(true) Boolean getSslEnabled(); void setSslEnabled(Boolean sslEnabled); @Description("Specify whether to ignore SSL certificate. Default: true") @Default.Boolean(true) Boolean getIgnoreSSLCertificate(); void setIgnoreSSLCertificate(Boolean ignoreSSLCertificate); @Description("Enable ordered bulk insertions. Default: true") @Default.Boolean(true) Boolean getWithOrdered(); https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 68/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud void setWithOrdered(Boolean withOrdered); @Description("Enable invalidHostNameAllowed for ssl connection. Default: true") @Default.Boolean(true) Boolean getWithSSLInvalidHostNameAllowed(); void setWithSSLInvalidHostNameAllowed(Boolean withSSLInvalidHostNameAllowed); } /** DoFn that will parse the given string elements as Bson Documents. */ private static class ParseAsDocumentsFn extends DoFn @ProcessElement public void processElement(ProcessContext context) { context.output(Document.parse(context.element())); } } /** * Main entry point for executing the pipeline. * * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) { // Parse the user options passed from the command-line. Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti run(options); } /** * Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(Options options) { // Create the pipeline Pipeline pipeline = Pipeline.create(options); // Register the coders for pipeline CoderRegistry coderRegistry = pipeline.getCoderRegistry(); https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 69/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud coderRegistry.registerCoderForType( FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER); coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER); /* * Steps: 1) Read PubSubMessage with attributes from input PubSub subscription. * 2) Apply Javascript UDF if provided. * 3) Write to MongoDB * */ LOG.info("Reading from subscription: " + options.getInputSubscription()); PCollectionTuple convertedPubsubMessages = pipeline /* * Step #1: Read from a PubSub subscription. */ .apply( "Read PubSub Subscription", PubsubIO.readMessagesWithAttributes() .fromSubscription(options.getInputSubscription())) /* * Step #2: Apply Javascript Transform and transform, if provided and tr * the PubsubMessages into Json documents. */ .apply( "Apply Javascript UDF", PubSubMessageToJsonDocument.newBuilder() .setJavascriptTextTransformFunctionName( options.getJavascriptTextTransformFunctionName()) .setJavascriptTextTransformGcsPath(options.getJavascriptTextTran .build()); /* * Step #3a: Write Json documents into MongoDB using {@link MongoDbIO.write}. */ convertedPubsubMessages .get(TRANSFORM_OUT) .apply( "Get Json Documents", MapElements.into(TypeDescriptors.strings()).via(FailsafeElement::getPayl .apply("Parse as BSON Document", ParDo.of(new ParseAsDocumentsFn())) .apply( https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 70/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud "Put to MongoDB", MongoDbIO.write() .withBatchSize(options.getBatchSize()) .withUri(String.format("mongodb://%s", options.getMongoDBUri())) .withDatabase(options.getDatabase()) .withCollection(options.getCollection()) .withIgnoreSSLCertificate(options.getIgnoreSSLCertificate()) .withMaxConnectionIdleTime(options.getMaxConnectionIdleTime()) .withOrdered(options.getWithOrdered()) .withSSLEnabled(options.getSslEnabled()) .withSSLInvalidHostNameAllowed(options.getWithSSLInvalidHostNameAllo /* * Step 3b: Write elements that failed processing to deadletter table via {@link */ convertedPubsubMessages .get(TRANSFORM_DEADLETTER_OUT) .apply( "Write Transform Failures To BigQuery", ErrorConverters.WritePubsubMessageErrors.newBuilder() .setErrorRecordsTable(options.getDeadletterTable()) .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA) .build()); // Execute the pipeline and return the result. return pipeline.run(); } /** * The {@link PubSubMessageToJsonDocument} class is a {@link PTransform} which tra * {@link PubsubMessage} objects into JSON objects for insertion into MongoDB whil * optional UDF to the input. The executions of the UDF and transformation to Json * in a fail-safe way by wrapping the element with it's original payload inside th * FailsafeElement} class. The {@link PubSubMessageToJsonDocument} transform will * PCollectionTuple} which contains all output and dead-letter {@link PCollection} * * The {@link PCollectionTuple} output will contain the following {@link PColle * * *
*/
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 71/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
@AutoValue public abstract static class PubSubMessageToJsonDocument extends PTransform
public static Builder newBuilder() { return new AutoValue_PubSubToMongoDB_PubSubMessageToJsonDocument.Builder(); }
@Nullable public abstract String javascriptTextTransformGcsPath();
@Nullable public abstract String javascriptTextTransformFunctionName();
@Override public PCollectionTuple expand(PCollection
// Map the incoming messages into FailsafeElements so we can recover from fail // across multiple transforms. PCollection
// If a Udf is supplied then use it to parse the PubSubMessages. if (javascriptTextTransformGcsPath() != null) { return failsafeElements.apply( "InvokeUDF", JavascriptTextTransformer.FailsafeJavascriptUdf.
/** Builder for {@link PubSubMessageToJsonDocument}. */ @AutoValue.Builder public abstract static class Builder { public abstract Builder setJavascriptTextTransformGcsPath( String javascriptTextTransformGcsPath);
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 72/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
public abstract Builder setJavascriptTextTransformFunctionName( String javascriptTextTransformFunctionName);
public abstract PubSubMessageToJsonDocument build(); } }
/** * The {@link ProcessFailsafePubSubFn} class processes a {@link FailsafeElement} c * {@link PubsubMessage} and a String of the message's payload {@link PubsubMessag * into a {@link FailsafeElement} of the original {@link PubsubMessage} and a JSON * been processed with {@link Gson}. * *
If {@link PubsubMessage#getAttributeMap()} is not empty then the message att * serialized along with the message payload. */ static class ProcessFailsafePubSubFn extends DoFn private static final Counter successCounter = Metrics.counter(PubSubMessageToJsonDocument.class, "successful-json-conversi private static Gson gson = new Gson(); private static final Counter failedCounter = Metrics.counter(PubSubMessageToJsonDocument.class, "failed-json-conversion") @ProcessElement public void processElement(ProcessContext context) { PubsubMessage pubsubMessage = context.element().getOriginalPayload(); JsonObject messageObject = new JsonObject(); try { if (pubsubMessage.getPayload().length > 0) { messageObject = gson.fromJson(new String(pubsubMessage.getPayload()), Json } // If message attributes are present they will be serialized along with the if (pubsubMessage.getAttributeMap() != null) { pubsubMessage.getAttributeMap().forEach(messageObject::addProperty); } context.output(FailsafeElement.of(pubsubMessage, messageObject.toString())); https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 73/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud successCounter.inc(); } catch (JsonSyntaxException e) { context.output( TRANSFORM_DEADLETTER_OUT, FailsafeElement.of(context.element()) .setErrorMessage(e.getMessage()) .setStacktrace(Throwables.getStackTraceAsString(e))); failedCounter.inc(); } } } /** * The {@link PubsubMessageToFailsafeElementFn} wraps an incoming {@link PubsubMes * {@link FailsafeElement} class so errors can be recovered from and the original * output to a error records table. */ static class PubsubMessageToFailsafeElementFn extends DoFn Cloud Storage Text to BigQuery (Stream) eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages). The Cloud Storage Text to BigQuery (Stream) pipeline is a streaming pipeline that allows you to stream text les stored in Cloud Storage, transform them using a JavaScript User De ned https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 74/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud Function (UDF) that you provide, and output the result to BigQuery. Requirements for this pipeline: Create a JSON formatted BigQuery schema le that describes your output table. { 'fields': [{ 'name': 'location', 'type': 'STRING' }, { 'name': 'name', 'type': 'STRING' }, { 'name': 'age', 'type': 'STRING', }, { 'name': 'color', 'type': 'STRING' }, { 'name': 'coffee', 'type': 'STRING', 'mode': 'REQUIRED' }, { 'name': 'cost', 'type': 'NUMERIC', 'mode': 'REQUIRED' }] } Create a JavaScript (.js) le with your UDF function that supplies the logic to transform the lines of text. Note that your function must return a JSON string. For example, this function splits each line of a CSV le and returns a JSON string after transforming the values. function transform(line) { var values = line.split(','); var obj = new Object(); obj.location = values[0]; https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 75/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud obj.name = values[1]; obj.age = values[2]; obj.color = values[3]; obj.coffee = values[4]; var jsonString = JSON.stringify(obj); return jsonString; } Template parameters Parameter Description javascriptTextTransformGcsPath Cloud Storage location of your JavaScript UDF. For example: gs://my_bucket/my_function.js. JSONPath Cloud Storage location of your BigQuery schema le, described as a JSON. For example: gs://path/to/my/schema.json. javascriptTextTransformFunctionNameThe name of the JavaScript function you wish to call as your UDF. For example: transform. outputTable The fully quali ed BigQuery table. For example: my- project:dataset.table inputFilePattern Cloud Storage location of the text you'd like to process. For example: gs://my-bucket/my-files/text.txt. bigQueryLoadingTemporaryDirectory Temporary directory for BigQuery loading process. For example: gs://my-bucket/my-files/temp_dir outputDeadletterTable Table for messages failed to reach the output table(aka. Deadletter table). For example: my-project:dataset.my- deadletter-table. If it doesn't exist, it will be created during pipeline execution. If not speci ed, Running the Cloud Storage Text to BigQuery (Stream) template CONSOLEGCLOUD (#gcloud)API (#api) Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console) https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 76/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud 1. Go to the Data ow page in the Cloud Console. Go to the Data ow page (https://console.cloud.google.com/data ow) 2. Click Create job from template. 3. Select the Cloud Storage Text to BigQuery template from the Data ow template drop-down menu. 4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid. 5. Enter your parameter values in the provided parameter elds. 6. Click Run Job. Template source code Java mplates/blob/master/src/main/java/com/google/cloud/teleport/templates/TextToBigQueryStreaming.java) /* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */ package com.google.cloud.teleport.templates; https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 77/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud import com.google.api.client.json.JsonFactory; import com.google.api.services.bigquery.model.TableRow; import com.google.cloud.teleport.coders.FailsafeElementCoder; import com.google.cloud.teleport.templates.common.BigQueryConverters.FailsafeJsonToT import com.google.cloud.teleport.templates.common.ErrorConverters.WriteStringMessage import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Failsafe import com.google.cloud.teleport.util.ResourceUtils; import com.google.cloud.teleport.util.ValueProviderUtils; import com.google.cloud.teleport.values.FailsafeElement; import com.google.common.base.Charsets; import com.google.common.collect.ImmutableList; import com.google.common.io.ByteStreams; import java.io.ByteArrayOutputStream; import java.io.IOException; import java.nio.channels.Channels; import java.nio.channels.ReadableByteChannel; import java.nio.channels.WritableByteChannel; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.extensions.gcp.util.Transport; import org.apache.beam.sdk.io.FileSystems; import org.apache.beam.sdk.io.TextIO; import org.apache.beam.sdk.io.fs.ResourceId; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy; import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.SimpleFunction; import org.apache.beam.sdk.transforms.Watch.Growth; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 78/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud import org.joda.time.Duration; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** * The {@link TextToBigQueryStreaming} is a streaming version of {@link TextIOToBigQ * that reads text files, applies a JavaScript UDF and writes the output to BigQuery * continuously polls for new files, reads them row-by-row and processes each record * The polling interval is set at 10 seconds. * * Example Usage: * * * {@code mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.TextToBigQueryStreaming \ * -Dexec.args="\ * --project=${PROJECT_ID} \ * --stagingLocation=gs://${STAGING_BUCKET}/staging \ * --tempLocation=gs://${STAGING_BUCKET}/tmp \ * --runner=DataflowRunner \ * --inputFilePattern=gs://path/to/input* \ * --JSONPath=gs://path/to/json/schema.json \ * --outputTable={$PROJECT_ID}:${OUTPUT_DATASET}.${OUTPUT_TABLE} \ * --javascriptTextTransformGcsPath=gs://path/to/transform/udf.js \ * --javascriptTextTransformFunctionName=${TRANSFORM_NAME} \ * --bigQueryLoadingTemporaryDirectory=gs://${STAGING_BUCKET}/tmp \ * --outputDeadletterTable=${PROJECT_ID}:${ERROR_DATASET}.${ERROR_TABLE}" * } *
*/ public class TextToBigQueryStreaming {
private static final Logger LOG = LoggerFactory.getLogger(TextToBigQueryStreaming.
/** The tag for the main output for the UDF. */ private static final TupleTag
/** The tag for the dead-letter output of the udf. */ private static final TupleTag
/** The tag for the main output of the json transformation. */ private static final TupleTag
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 79/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
/** The tag for the dead-letter output of the json to table row transform. */ private static final TupleTag
/** The default suffix for error tables if dead letter table is not specified. */ private static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";
/** Default interval for polling files in GCS. */ private static final Duration DEFAULT_POLL_INTERVAL = Duration.standardSeconds(10)
/** Coder for FailsafeElement. */ private static final FailsafeElementCoder
private static final JsonFactory JSON_FACTORY = Transport.getJsonFactory();
/** * Main entry point for executing the pipeline. This will run the pipeline asynchr * blocking execution is required, use the {@link * TextToBigQueryStreaming#run(TextToBigQueryStreamingOptions)} method to start th * and invoke {@code result.waitUntilFinish()} on the {@link PipelineResult} * * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) {
// Parse the user options passed from the command-line TextToBigQueryStreamingOptions options = PipelineOptionsFactory.fromArgs(args) .withValidation() .as(TextToBigQueryStreamingOptions.class);
run(options); }
/** * Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(TextToBigQueryStreamingOptions options) {
// Create the pipeline Pipeline pipeline = Pipeline.create(options);
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 80/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
// Register the coder for pipeline FailsafeElementCoder
CoderRegistry coderRegistry = pipeline.getCoderRegistry(); coderRegistry.registerCoderForType(coder.getEncodedTypeDescriptor(), coder);
/* * Steps: * 1) Read from the text source continuously. * 2) Convert to FailsafeElement. * 3) Apply Javascript udf transformation. * - Tag records that were successfully transformed and those * that failed transformation. * 4) Convert records to TableRow. * - Tag records that were successfully converted and those * that failed conversion. * 5) Insert successfully converted records into BigQuery. * - Errors encountered while streaming will be sent to deadletter table. * 6) Insert records that failed into deadletter table. */
PCollectionTuple transformedOutput = pipeline
// 1) Read from the text source continuously. .apply( "ReadFromSource", TextIO.read() .from(options.getInputFilePattern()) .watchForNewFiles(DEFAULT_POLL_INTERVAL, Growth.never()))
// 2) Convert to FailsafeElement. .apply( "ConvertToFailsafeElement", MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor()) .via(input -> FailsafeElement.of(input, input)))
// 3) Apply Javascript udf transformation. .apply( "ApplyUDFTransformation", FailsafeJavascriptUdf.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 81/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
.setSuccessTag(UDF_OUT) .setFailureTag(UDF_DEADLETTER_OUT) .build());
PCollectionTuple convertedTableRows = transformedOutput
// 4) Convert records to TableRow. .get(UDF_OUT) .apply( "ConvertJSONToTableRow", FailsafeJsonToTableRow.
WriteResult writeResult = convertedTableRows
// 5) Insert successfully converted records into BigQuery. .get(TRANSFORM_OUT) .apply( "InsertIntoBigQuery", BigQueryIO.writeTableRows() .withJsonSchema(getSchemaFromGCS(options.getJSONPath())) .to(options.getOutputTable()) .withExtendedErrorInfo() .withoutValidation() .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withMethod(Method.STREAMING_INSERTS) .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErr .withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDi
// Elements that failed inserts into BigQuery are extracted and converted to Fai PCollection
// 6) Insert records that failed transformation or conversion into deadletter ta PCollectionList.of(
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 82/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
ImmutableList.of( transformedOutput.get(UDF_DEADLETTER_OUT), convertedTableRows.get(TRANSFORM_DEADLETTER_OUT), failedInserts)) .apply("Flatten", Flatten.pCollections()) .apply( "WriteFailedRecords", WriteStringMessageErrors.newBuilder() .setErrorRecordsTable( ValueProviderUtils.maybeUseDefaultDeadletterTable( options.getOutputDeadletterTable(), options.getOutputTable(), DEFAULT_DEADLETTER_TABLE_SUFFIX)) .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJs .build());
return pipeline.run(); }
/** * Method to wrap a {@link BigQueryInsertError} into a {@link FailsafeElement}. * * @param insertError BigQueryInsert error. * @return FailsafeElement object. * @throws IOException */ static FailsafeElement
FailsafeElement
String rowPayload = JSON_FACTORY.toString(insertError.getRow()); String errorMessage = JSON_FACTORY.toString(insertError.getError());
failsafeElement = FailsafeElement.of(rowPayload, rowPayload); failsafeElement.setErrorMessage(errorMessage);
} catch (IOException e) { throw new RuntimeException(e); }
return failsafeElement; }
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 83/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
/** * Method to read a BigQuery schema file from GCS and return the file contents as * * @param gcsPath Path string for the schema file in GCS. * @return File contents as a string. */ private static ValueProvider
String schema; try (ReadableByteChannel rbc = FileSystems.open(sourceResourceId)) { try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) { try (WritableByteChannel wbc = Channels.newChannel(baos)) { ByteStreams.copy(rbc, wbc); schema = baos.toString(Charsets.UTF_8.name()); LOG.info("Extracted schema: " + schema); } } } catch (IOException e) { LOG.error("Error extracting schema: " + e.getMessage()); throw new RuntimeException(e); } return schema; } }); }
/** * The {@link TextToBigQueryStreamingOptions} class provides the custom execution * by the executor at the command-line. */ public interface TextToBigQueryStreamingOptions extends TextIOToBigQuery.Options { @Description( "The dead-letter table to output to within BigQuery in
void setOutputDeadletterTable(ValueProvider
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 84/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
Cloud Storage Text to Pub/Sub (Stream)
eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).
This template creates a streaming pipeline that continuously polls for new text les uploaded to Cloud Storage, reads each le line by line, and publishes strings to a Pub/Sub topic. The template publishes records in a newline-delimited le containing JSON records or CSV le to a Pub/Sub topic for real-time processing. You can use this template to replay data to Pub/Sub.
Currently, the polling interval is xed and set to 10 seconds. This template does not set any timestamp on the individual records, so the event time will be equal to the publishing time during execution. If your pipeline relies on an accurate event time for processing, you should not use this pipeline.
Requirements for this pipeline:
Input les must be in newline-delimited JSON or CSV format. Records that span multiple lines in the source les can cause issues downstream, as each line within the les will be published as a message to Pub/Sub.
The Pub/Sub topic must exist prior to execution.
The pipeline runs inde nitely and needs to be terminated manually.
Template parameters
Parameter Description
inputFilePatternThe input le pattern to read from. For example, gs://bucket-name/files/*.json.
outputTopic The Pub/Sub input topic to write to. The name should be in the format of projects/
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 85/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
Running the Cloud Storage Text to Pub/Sub (Stream) template
CONSOLEGCLOUD (#gcloud)API (#api)
Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console)
1. Go to the Data ow page in the Cloud Console.
Go to the Data ow page (https://console.cloud.google.com/data ow)
2. Click Create job from template.
3. Select the Cloud Storage Text to Pub/Sub (Stream) template from the Data ow template drop- down menu.
4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
5. Enter your parameter values in the provided parameter elds.
6. Click Run Job.
Template source code
Java
wTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/TextToPubsubStream.java)
/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 86/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */
package com.google.cloud.teleport.templates;
import com.google.cloud.teleport.templates.TextToPubsub.Options; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.io.TextIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.transforms.Watch; import org.joda.time.Duration;
/** * The {@code TextToPubsubStream} is a streaming version of {@code TextToPubsub} pip * publishes records to Cloud Pub/Sub from a set of files. The pipeline continuously * files, reads them row-by-row and publishes each record as a string message. The p * is fixed and equals to 10 seconds. At the moment, publishing messages with attrib * unsupported. * *
Example Usage: * *
* {@code mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.TextToPubsubStream \ -Dexec.args=" \ --project=${PROJECT_ID} \ --stagingLocation=gs://${STAGING_BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/stagi --tempLocation=gs://${STAGING_BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \ --runner=DataflowRunner \ --inputFilePattern=gs://path/to/*.csv \ --outputTopic=projects/${PROJECT_ID}/topics/${TOPIC_NAME}" * } ** */ public class TextToPubsubStream extends TextToPubsub { private static final Duration DEFAULT_POLL_INTERVAL = Duration.standardSeconds(10)
/** * Main entry-point for the pipeline. Reads in the * command-line arguments, parses them, and executes
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 87/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* the pipeline. * * @param args Arguments passed in from the command-line. */ public static void main(String[] args) {
// Parse the user options passed from the command-line Options options = PipelineOptionsFactory .fromArgs(args) .withValidation() .as(Options.class);
run(options); }
/** * Executes the pipeline with the provided execution * parameters. * * @param options The execution parameters. */ public static PipelineResult run(Options options) { // Create the pipeline. Pipeline pipeline = Pipeline.create(options);
/* * Steps: * 1) Read from the text source. * 2) Write each text record to Pub/Sub */ pipeline .apply( "Read Text Data", TextIO.read() .from(options.getInputFilePattern()) .watchForNewFiles(DEFAULT_POLL_INTERVAL, Watch.Growth.never())) .apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic())
return pipeline.run(); } }
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 88/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
Data Masking/Tokenization using Cloud DLP from Cloud Storag to BigQuery (Stream)
eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).
The Data Masking/Tokenization using Cloud DLP from Cloud Storage to BigQuery template is a streaming pipeline that reads csv les from a Cloud Storage bucket, calls the Cloud Data Loss Prevention (Cloud DLP) API for de-identi cation, and writes the de-identi ed data into the speci ed BigQuery table. This template supports using both a Cloud DLP inspection template (/dlp/docs/creating-templates) and a Cloud DLP de-identi cation template (/dlp/docs/creating-templates-deid). This allows users to inspect for potentially sensitive information and de-identify, as well as de-identify structured data where columns are speci ed to be de-identi ed and no inspection is needed.
Requirements for this pipeline:
The input data to tokenize must exist
The Cloud DLP Templates must exist (for example, DeidentifyTemplate and InspectTemplate). See Cloud DLP templates (/dlp/docs/concepts-templates) for more details.
The BigQuery dataset must exist
Template parameters
Parameter Description
inputFilePattern The csv le(s) to read input data records from. Wildcarding is also accepted. For example, gs://mybucket/my_csv_filename.csv or gs://mybucket/file-*
dlpProjectId Cloud DLP project ID that owns the Cloud DLP API resource. This Cloud DLP proje be the same project that owns the Cloud DLP templates, or it can be a separate p For example, my_dlp_api_project.
deidentifyTemplateNameCloud DLP deidenti cation template to use for API requests, speci ed with the pa
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 89/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
projects/{template_project_id}/deidentifyTemplates/{deIdTempla For example, projects/my_project/deidentifyTemplates/100.
datasetName BigQuery dataset for sending tokenized results.
batchSize Chunking/Batch size for sending data to inspect and/or detokenize. In the case o le, batchSize is the number of rows in a batch. Users must determine the batch based on the size of the records and the sizing of the le. Note that the Cloud DLP has a payload size limit of 524 KB per API call.
inspectTemplateName [Optional] Cloud DLP inspection template to use for API requests, speci ed with t pattern projects/{template_project_id}/identifyTemplates/{idTemplateId example, projects/my_project/identifyTemplates/100.
Running the Data Masking/Tokenization using Cloud DLP from Cloud Storag to BigQuery template
CONSOLEGCLOUD (#gcloud)API (#api)
Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console)
1. Go to the Data ow page in the Cloud Console.
Go to the Data ow page (https://console.cloud.google.com/data ow)
2. Click Create job from template.
3. Select the Data Masking/Tokenization using Cloud DLP from Cloud Storage to BigQuery (Stream) template from the Data ow template drop-down menu.
4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
5. Enter your parameter values in the provided parameter elds.
6. Click Run Job.
Template source code
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 90/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
Java
tes/blob/master/src/main/java/com/google/cloud/teleport/templates/DLPTextToBigQueryStreaming.java)
/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */
package com.google.cloud.teleport.templates;
import com.google.api.services.bigquery.model.TableCell; import com.google.api.services.bigquery.model.TableFieldSchema; import com.google.api.services.bigquery.model.TableRow; import com.google.api.services.bigquery.model.TableSchema; import com.google.cloud.dlp.v2.DlpServiceClient; import com.google.common.base.Charsets; import com.google.privacy.dlp.v2.ContentItem; import com.google.privacy.dlp.v2.DeidentifyContentRequest; import com.google.privacy.dlp.v2.DeidentifyContentRequest.Builder; import com.google.privacy.dlp.v2.DeidentifyContentResponse; import com.google.privacy.dlp.v2.FieldId; import com.google.privacy.dlp.v2.ProjectName; import com.google.privacy.dlp.v2.Table; import com.google.privacy.dlp.v2.Value; import java.io.BufferedReader; import java.io.IOException; import java.nio.channels.Channels; import java.nio.channels.ReadableByteChannel; import java.sql.SQLException; import java.util.ArrayList; import java.util.Iterator; import java.util.List;
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 91/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
import java.util.Map; import java.util.concurrent.atomic.AtomicInteger; import java.util.stream.Collectors; import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.KvCoder; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.Compression; import org.apache.beam.sdk.io.FileIO; import org.apache.beam.sdk.io.FileIO.ReadableFile; import org.apache.beam.sdk.io.ReadableFileCoder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.DynamicDestinations; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy; import org.apache.beam.sdk.io.gcp.bigquery.TableDestination; import org.apache.beam.sdk.io.range.OffsetRange; import org.apache.beam.sdk.metrics.Distribution; import org.apache.beam.sdk.metrics.Metrics; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.Validation.Required; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.DoFn.Element; import org.apache.beam.sdk.transforms.GroupByKey; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.transforms.View; import org.apache.beam.sdk.transforms.Watch; import org.apache.beam.sdk.transforms.WithKeys; import org.apache.beam.sdk.transforms.splittabledofn.OffsetRangeTracker; import org.apache.beam.sdk.transforms.splittabledofn.RestrictionTracker; import org.apache.beam.sdk.transforms.windowing.AfterProcessingTime; import org.apache.beam.sdk.transforms.windowing.FixedWindows; import org.apache.beam.sdk.transforms.windowing.Repeatedly; import org.apache.beam.sdk.transforms.windowing.Window; import org.apache.beam.sdk.values.KV; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionView; import org.apache.beam.sdk.values.ValueInSingleWindow; import org.apache.commons.csv.CSVFormat; import org.apache.commons.csv.CSVRecord; import org.joda.time.Duration; import org.slf4j.Logger;
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 92/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
import org.slf4j.LoggerFactory;
/** * The {@link DLPTextToBigQueryStreaming} is a streaming pipeline that reads CSV fil * storage location (e.g. Google Cloud Storage), uses Cloud DLP API to inspect, clas * sensitive information (e.g. PII Data like passport or SIN number) and at the end * obfuscated data in BigQuery (Dynamic Table Creation) to be used for various purpo * analytics, ML model. Cloud DLP inspection and masking can be configured by the us * use of over 90 built in detectors and masking techniques like tokenization, secur * shifting, partial masking, and more. * *
Pipeline Requirements * *
- *
- DLP Templates exist (e.g. deidentifyTemplate, InspectTemplate) *
- The BigQuery Dataset exists *
Example Usage * *
* # Set the pipeline vars * PROJECT_ID=PROJECT ID HERE * BUCKET_NAME=BUCKET NAME HERE * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/dlp-text-to-bigquery * * # Set the runner * RUNNER=DataflowRunner * * # Build the template * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=${PIPELINE_FOLDER}/staging \ * --tempLocation=${PIPELINE_FOLDER}/temp \ * --templateLocation=${PIPELINE_FOLDER}/template \ * --runner=${RUNNER}" * * # Execute the template * JOB_NAME=dlp-text-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"` * * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \*/ public class KafkaToBigQuery {https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 93/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* --zone=us-east1-d \ * --parameters \ * "inputFilePattern=gs://
/ .csv, batchSize=15,datasetName= */ public class DLPTextToBigQueryStreaming { public static final Logger LOG = LoggerFactory.getLogger(DLPTextToBigQueryStreamin /** Default interval for polling files in GCS. */ private static final Duration DEFAULT_POLL_INTERVAL = Duration.standardSeconds(30) /** Expected only CSV file in GCS bucket. */ private static final String ALLOWED_FILE_EXTENSION = String.valueOf("csv"); /** Regular expression that matches valid BQ table IDs. */ private static final String TABLE_REGEXP = "[-\\w$@]{1,1024}"; /** Default batch size if value not provided in execution. */ private static final Integer DEFAULT_BATCH_SIZE = 100; /** Regular expression that matches valid BQ column name . */ private static final String COLUMN_NAME_REGEXP = "^[A-Za-z_]+[A-Za-z_0-9]*$"; /** Default window interval to create side inputs for header records. */ private static final Duration WINDOW_INTERVAL = Duration.standardSeconds(30);
/** * Main entry point for executing the pipeline. This will run the pipeline asynchr * blocking execution is required, use the {@link * DLPTextToBigQueryStreaming#run(TokenizePipelineOptions)} method to start the pi * invoke {@code result.waitUntilFinish()} on the {@link PipelineResult} * * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) {
TokenizePipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(TokenizePipelineOp run(options); }
/** * Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(TokenizePipelineOptions options) { // Create the pipeline
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 94/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
Pipeline p = Pipeline.create(options); /* * Steps: * 1) Read from the text source continuously based on default interval e.g. 30 * - Setup a window for 30 secs to capture the list of files emited. * - Group by file name as key and ReadableFile as a value. * 2) Create a side input for the window containing list of headers par file. * 3) Output each readable file for content processing. * 4) Split file contents based on batch size for parallel processing. * 5) Process each split as a DLP table content request to invoke API. * 6) Convert DLP Table Rows to BQ Table Row. * 7) Create dynamic table and insert successfully converted records into BQ. */
PCollection
>> csvFiles = p /* * 1) Read from the text source continuously based on default interval e * - Setup a window for 30 secs to capture the list of files emited. * - Group by file name as key and ReadableFile as a value. */ .apply( "Poll Input Files", FileIO.match() .filepattern(options.getInputFilePattern()) .continuously(DEFAULT_POLL_INTERVAL, Watch.Growth.never())) .apply("Find Pattern Match", FileIO.readMatches().withCompression(Compre .apply("Add File Name as Key", WithKeys.of(file -> getFileName(file))) .setCoder(KvCoder.of(StringUtf8Coder.of(), ReadableFileCoder.of())) .apply( "Fixed Window(30 Sec)", Window. >into(FixedWindows.of(WINDOW_INTERVA .triggering(Repeatedly.forever( AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Dur .discardingFiredPanes() .withAllowedLateness(Duration.ZERO)) .apply(GroupByKey.create()); /* * Side input for the window to capture list of headers for each file emited so * used in the next transform. */ final PCollectionView
>>> headerMap = csvFiles
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 95/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
// 2) Create a side input for the window containing list of headers par .apply( "Create Header Map", ParDo.of( new DoFn
>, KV @ProcessElement public void processElement(ProcessContext c) { String fileKey = c.element().getKey(); c.element() .getValue() .forEach( file -> { try (BufferedReader br = getReader(file)) { c.output(KV.of(fileKey, getFileHeaders(br)));
} catch (IOException e) { LOG.error("Failed to Read File {}", e.getMessage throw new RuntimeException(e); } }); } })) .apply("View As List", View.asList());
PCollection
> bqDataMap = csvFiles // 3) Output each readable file for content processing. .apply( "File Handler", ParDo.of( new DoFn
>, KV { c.output(KV.of(fileKey, file)); }); } })) https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 96/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
// 4) Split file contents based on batch size for parallel processing. .apply( "Process File Contents", ParDo.of( new CSVReader( NestedValueProvider.of( options.getBatchSize(), batchSize -> { if (batchSize != null) { return batchSize; } else { return DEFAULT_BATCH_SIZE; } }), headerMap)) .withSideInputs(headerMap))
// 5) Create a DLP Table content request and invoke DLP API for each pro .apply( "DLP-Tokenization", ParDo.of( new DLPTokenizationDoFn( options.getDlpProjectId(), options.getDeidentifyTemplateName(), options.getInspectTemplateName())))
// 6) Convert DLP Table Rows to BQ Table Row .apply("Process Tokenized Data", ParDo.of(new TableRowProcessorDoFn()));
// 7) Create dynamic table and insert successfully converted records into BQ. bqDataMap.apply( "Write To BQ", BigQueryIO.
>write() .to(new BQDestination(options.getDatasetName(), options.getDlpProjectId( .withFormatFunction( element -> { return element.getValue(); }) .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEED .withoutValidation() .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())); return p.run(); }
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 97/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
/** * The {@link TokenizePipelineOptions} interface provides the custom execution opt * the executor at the command-line. */ public interface TokenizePipelineOptions extends DataflowPipelineOptions {
@Description("The file pattern to read records from (e.g. gs://bucket/file-*.csv ValueProvider
getInputFilePattern(); void setInputFilePattern(ValueProvider
value); @Description( "DLP Deidentify Template to be used for API request " + "(e.g.projects/{project_id}/deidentifyTemplates/{deIdTemplateId}") @Required ValueProvider
getDeidentifyTemplateName(); void setDeidentifyTemplateName(ValueProvider
value); @Description( "DLP Inspect Template to be used for API request " + "(e.g.projects/{project_id}/inspectTemplates/{inspectTemplateId}") ValueProvider
getInspectTemplateName(); void setInspectTemplateName(ValueProvider
value); @Description( "DLP API has a limit for payload size of 524KB /api call. " + "That's why dataflow process will need to chunk it. User will have to + "on how they would like to batch the request depending on number of ro + "and how big each row is.") @Required ValueProvider
getBatchSize(); void setBatchSize(ValueProvider
value); @Description("Big Query data set must exist before the pipeline runs (e.g. pii-d ValueProvider
getDatasetName(); void setDatasetName(ValueProvider
value); @Description("Project id to be used for DLP Tokenization") ValueProvider
getDlpProjectId(); https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 98/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
void setDlpProjectId(ValueProvider
value); } /** * The {@link CSVReader} class uses experimental Split DoFn to split each csv file * chunks and process it in non-monolithic fashion. For example: if a CSV file has * batch size is set to 15, then initial restrictions for the SDF will be 1 to 7 a * restriction will be {{1-2},{2-3}..{7-8}} for parallel executions. */ static class CSVReader extends DoFn
, KV > { private ValueProvider
batchSize; private PCollectionView >>> headerMap; /** This counter is used to track number of lines processed against batch size. private Integer lineCount;
List
csvHeaders; public CSVReader( ValueProvider
batchSize, PCollectionView >>> headerMap) { lineCount = 1; this.batchSize = batchSize; this.headerMap = headerMap; this.csvHeaders = new ArrayList<>(); }
@ProcessElement public void processElement(ProcessContext c, RestrictionTracker
csvHeaders = getHeaders(c.sideInput(headerMap), fileKey); if (csvHeaders != null) { List
dlpTableHeaders = csvHeaders.stream() .map(header -> FieldId.newBuilder().setName(header).build()) .collect(Collectors.toList()); List rows = new ArrayList<>(); Table dlpTable = null; /** finding out EOL for this restriction so that we know the SOL */ int endOfLine = (int) (i * batchSize.get().intValue()); int startOfLine = (endOfLine - batchSize.get().intValue()); https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 99/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
/** skipping all the rows that's not part of this restriction */ br.readLine(); Iterator
csvRows = CSVFormat.DEFAULT.withSkipHeaderRecord().parse(br).iterator(); for (int line = 0; line < startOfLine; line++) { if (csvRows.hasNext()) { csvRows.next(); } } /** looping through buffered reader and creating DLP Table Rows equals t while (csvRows.hasNext() && lineCount <= batchSize.get()) { CSVRecord csvRow = csvRows.next(); rows.add(convertCsvRowToTableRow(csvRow)); lineCount += 1; } /** creating DLP table and output for next transformation */ dlpTable = Table.newBuilder().addAllHeaders(dlpTableHeaders).addAllRows( c.output(KV.of(fileKey, dlpTable));
LOG.debug( "Current Restriction From: {}, Current Restriction To: {}," + " StartofLine: {}, End Of Line {}, BatchData {}", tracker.currentRestriction().getFrom(), tracker.currentRestriction().getTo(), startOfLine, endOfLine, dlpTable.getRowsCount());
} else {
throw new RuntimeException("Header Values Can't be found For file Key " } } } }
/** * SDF needs to define a @GetInitialRestriction method that can create a restric * the complete work for a given element. For our case this would be the total n * for each CSV file. We will calculate the number of split required based on to * rows and batch size provided. * * @throws IOException */
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 100/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
@GetInitialRestriction public OffsetRange getInitialRestriction(@Element KV
csvFi int rowCount = 0; int totalSplit = 0; try (BufferedReader br = getReader(csvFile.getValue())) { /** assume first row is header */ int checkRowCount = (int) br.lines().count() - 1; rowCount = (checkRowCount < 1) ? 1 : checkRowCount; totalSplit = rowCount / batchSize.get().intValue(); int remaining = rowCount % batchSize.get().intValue(); /** * Adjusting the total number of split based on remaining rows. For example: * 15 for 100 rows will have total 7 splits. As it's a range last split will * range {7,8} */ if (remaining > 0) { totalSplit = totalSplit + 2;
} else { totalSplit = totalSplit + 1; } }
LOG.debug("Initial Restriction range from 1 to: {}", totalSplit); return new OffsetRange(1, totalSplit); }
/** * SDF needs to define a @SplitRestriction method that can split the intital res * number of smaller restrictions. For example: a intital rewstriction of (x, N) * produces pairs (x, 0), (x, 1), …, (x, N-1) as output. */ @SplitRestriction public void splitRestriction( @Element KV
csvFile,@Restriction OffsetRange range, Ou /** split the initial restriction by 1 */ for (final OffsetRange p : range.split(1, 1)) { out.output(p); } } @NewTracker public OffsetRangeTracker newTracker(@Restriction OffsetRange range) { return new OffsetRangeTracker(new OffsetRange(range.getFrom(), range.getTo()))
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 101/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
}
private Table.Row convertCsvRowToTableRow(CSVRecord csvRow) { /** convert from CSV row to DLP Table Row */ Iterator
valueIterator = csvRow.iterator(); Table.Row.Builder tableRowBuilder = Table.Row.newBuilder(); while (valueIterator.hasNext()) { String value = valueIterator.next(); if (value != null) { tableRowBuilder.addValues(Value.newBuilder().setStringValue(value.toString } else { tableRowBuilder.addValues(Value.newBuilder().setStringValue("").build()); } } return tableRowBuilder.build(); }
private List
getHeaders(List >> headerMap, String return headerMap.stream() .filter(map -> map.getKey().equalsIgnoreCase(fileKey)) .findFirst() .map(e -> e.getValue()) .orElse(null); } } /** * The {@link DLPTokenizationDoFn} class executes tokenization request by calling * DLP table as a content item as CSV file contains fully structured data. DLP tem * de-identify, inspect) need to exist before this pipeline runs. As response from * received, this DoFn ouptputs KV of new table with table id as key. */ static class DLPTokenizationDoFn extends DoFn
, KV private ValueProvider dlpProjectId; private DlpServiceClient dlpServiceClient; private ValueProvider deIdentifyTemplateName; private ValueProvider inspectTemplateName; private boolean inspectTemplateExist; private Builder requestBuilder; private final Distribution numberOfRowsTokenized = Metrics.distribution(DLPTokenizationDoFn.class, "numberOfRowsTokenizedDistro private final Distribution numberOfBytesTokenized = Metrics.distribution(DLPTokenizationDoFn.class, "numberOfBytesTokenizedDistr https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 102/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
public DLPTokenizationDoFn( ValueProvider
dlpProjectId, ValueProvider deIdentifyTemplateName, ValueProvider inspectTemplateName) { this.dlpProjectId = dlpProjectId; this.dlpServiceClient = null; this.deIdentifyTemplateName = deIdentifyTemplateName; this.inspectTemplateName = inspectTemplateName; this.inspectTemplateExist = false; } @Setup public void setup() { if (this.inspectTemplateName.isAccessible()) { if (this.inspectTemplateName.get() != null) { this.inspectTemplateExist = true; } } if (this.deIdentifyTemplateName.isAccessible()) { if (this.deIdentifyTemplateName.get() != null) { this.requestBuilder = DeidentifyContentRequest.newBuilder() .setParent(ProjectName.of(this.dlpProjectId.get()).toString()) .setDeidentifyTemplateName(this.deIdentifyTemplateName.get()); if (this.inspectTemplateExist) { this.requestBuilder.setInspectTemplateName(this.inspectTemplateName.get( } } } }
@StartBundle public void startBundle() throws SQLException {
try { this.dlpServiceClient = DlpServiceClient.create();
} catch (IOException e) { LOG.error("Failed to create DLP Service Client", e.getMessage()); throw new RuntimeException(e); } }
@FinishBundle public void finishBundle() throws Exception {
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 103/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
if (this.dlpServiceClient != null) { this.dlpServiceClient.close(); } }
@ProcessElement public void processElement(ProcessContext c) { String key = c.element().getKey(); Table nonEncryptedData = c.element().getValue(); ContentItem tableItem = ContentItem.newBuilder().setTable(nonEncryptedData).bu this.requestBuilder.setItem(tableItem); DeidentifyContentResponse response = dlpServiceClient.deidentifyContent(this.requestBuilder.build()); Table tokenizedData = response.getItem().getTable(); numberOfRowsTokenized.update(tokenizedData.getRowsList().size()); numberOfBytesTokenized.update(tokenizedData.toByteArray().length); c.output(KV.of(key, tokenizedData)); } }
/** * The {@link TableRowProcessorDoFn} class process tokenized DLP tables and conver * BigQuery Table Row. */ public static class TableRowProcessorDoFn extends DoFn
, KV @ProcessElement public void processElement(ProcessContext c) {
Table tokenizedData = c.element().getValue(); List
headers = tokenizedData.getHeadersList().stream() .map(fid -> fid.getName()) .collect(Collectors.toList()); List outputRows = tokenizedData.getRowsList(); if (outputRows.size() > 0) { for (Table.Row outputRow : outputRows) { if (outputRow.getValuesCount() != headers.size()) { throw new IllegalArgumentException( "CSV file's header count must exactly match with data element count" } c.output( KV.of( c.element().getKey(), createBqRow(outputRow, headers.toArray(new String[headers.size()]) https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 104/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
} } }
private static TableRow createBqRow(Table.Row tokenizedValue, String[] headers) TableRow bqRow = new TableRow(); AtomicInteger headerIndex = new AtomicInteger(0); List
cells = new ArrayList<>(); tokenizedValue .getValuesList() .forEach( value -> { String checkedHeaderName = checkHeaderName(headers[headerIndex.getAndIncrement()].toString( bqRow.set(checkedHeaderName, value.getStringValue()); cells.add(new TableCell().set(checkedHeaderName, value.getStringValu }); bqRow.setF(cells); return bqRow; } } /** * The {@link BQDestination} class creates BigQuery table destination and table sc * the CSV file processed in earlier transformations. Table id is same as filename * same as file header columns. */ public static class BQDestination extends DynamicDestinations
, KV > { private ValueProvider
datasetName; private ValueProvider projectId; public BQDestination(ValueProvider
datasetName, ValueProvider pr this.datasetName = datasetName; this.projectId = projectId; } @Override public KV
getDestination(ValueInSingleWindow https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 105/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
@Override public TableDestination getTable(KV
destination) { TableDestination dest = new TableDestination(destination.getKey(), "pii-tokenized output data from LOG.debug("Table Destination {}", dest.getTableSpec()); return dest; } @Override public TableSchema getSchema(KV
destination) { TableRow bqRow = destination.getValue(); TableSchema schema = new TableSchema(); List
fields = new ArrayList (); List cells = bqRow.getF(); for (int i = 0; i < cells.size(); i++) { Map object = cells.get(i); String header = object.keySet().iterator().next(); /** currently all BQ data types are set to String */ fields.add(new TableFieldSchema().setName(checkHeaderName(header)).setType(" } schema.setFields(fields); return schema; } }
private static String getFileName(ReadableFile file) { String csvFileName = file.getMetadata().resourceId().getFilename().toString(); /** taking out .csv extension from file name e.g fileName.csv->fileName */ String[] fileKey = csvFileName.split("\\.", 2);
if (!fileKey[1].equals(ALLOWED_FILE_EXTENSION) || !fileKey[0].matches(TABLE_REGE throw new RuntimeException( "[Filename must contain a CSV extension " + " BQ table name must contain only letters, numbers, or underscores [ + fileKey[1] + "], [" + fileKey[0] + "]"); } /** returning file name without extension */ return fileKey[0]; }
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 106/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
private static BufferedReader getReader(ReadableFile csvFile) { BufferedReader br = null; ReadableByteChannel channel = null; /** read the file and create buffered reader */ try { channel = csvFile.openSeekable();
} catch (IOException e) { LOG.error("Failed to Read File {}", e.getMessage()); throw new RuntimeException(e); }
if (channel != null) {
br = new BufferedReader(Channels.newReader(channel, Charsets.ISO_8859_1.name() }
return br; }
private static List
getFileHeaders(BufferedReader reader) { List headers = new ArrayList<>(); try { CSVRecord csvHeader = CSVFormat.DEFAULT.parse(reader).getRecords().get(0); csvHeader.forEach( headerValue -> { headers.add(headerValue); }); } catch (IOException e) { LOG.error("Failed to get csv header values}", e.getMessage()); throw new RuntimeException(e); } return headers; } private static String checkHeaderName(String name) { /** some checks to make sure BQ column names don't fail e.g. special characters String checkedHeader = name.replaceAll("\\s", "_"); checkedHeader = checkedHeader.replaceAll("'", ""); checkedHeader = checkedHeader.replaceAll("/", ""); if (!checkedHeader.matches(COLUMN_NAME_REGEXP)) { throw new IllegalArgumentException("Column name can't be matched to a valid fo } return checkedHeader;
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 107/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
} }
Change Data Capture from MySQL to BigQuery using Debeziu and Pub/Sub (Stream)
eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).
The Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub template is a streaming pipeline that reads Pub/Sub messages with change data from a MySQL database and writes the records to BigQuery. A Debezium connector captures changes to the MySQL database and publishes the changed data to Pub/Sub. The template then reads the Pub/Sub messages and writes them to BigQuery.
You can use this template to sync MySQL databases and BigQuery tables. The pipeline writes the changed data to a BigQuery staging table and intermittently updates a BigQuery table replicating the MySQL database.
Requirements for this pipeline:
The Debezium connector must be deployed (https://github.com/GoogleCloudPlatform/Data owTemplates/tree/master/v2/cdc- parent#deploying-the-connector) .
The Pub/Sub messages must be serialized in a Beam Row. (https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/values/Row.html)
Template parameters
Parameter Description
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 108/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
inputSubscriptions The comma-separated list of Pub/Sub input subscriptions to read from, in the format of
, , ... changeLogDataset The BigQuery dataset to store the staging tables, in the format of
replicaDataset The location of the BigQuery dataset to store the replica tables, in the format of
Optional: The interval at which the pipeline updates the BigQuery table replicating the MySQL updateFrequencySecsdatabase.
Running the Change Data Capture using Debezium and MySQL from Pub/S to BigQuery template
To run this template, perform the following steps:
1. On your local machine, clone the Data owTemplates repository (https://github.com/GoogleCloudPlatform/Data owTemplates).
2. Change to the v2/cdc-parent directory.
3. Ensure that the Debezium connector is deployed (https://github.com/GoogleCloudPlatform/Data owTemplates/tree/master/v2/cdc- parent#deploying-the-connector) .
4. Using Maven, run the Data ow template. You must replace the following values in this example:
Replace PROJECT_ID with your project ID.
Replace YOUR_SUBSCRIPTIONS with your comma-separated list of Pub/Sub subscription names.
Replace YOUR_CHANGELOG_DATASET with your BigQuery dataset for changelog data, and replace YOUR_REPLICA_DATASET with your BigQuery dataset for replica tables.
mvn exec:java -pl cdc-change-applier -Dexec.args="--runner=DataflowRunner \ --inputSubscriptions=YOUR_SUBSCRIPTIONS \ --updateFrequencySecs=300 \ --changeLogDataset=YOUR_CHANGELOG_DATASET \ --replicaDataset=YOUR_REPLICA_DATASET \
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 109/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
--project=PROJECT_ID"
Apache Ka a to BigQuery
eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).
The Apache Kafka to BigQuery template is a streaming pipeline which ingests text data from Apache Kafka, executes a user-de ned function (UDF), and outputs the resulting records to BigQuery. Any errors which occur in the transformation of the data, execution of the UDF, or inserting into the output table are inserted into a separate errors table in BigQuery. If the errors table does not exist prior to execution, then it is created.
Requirements for this pipeline
The output BigQuery table must exist.
The Apache Kafka broker server must be running and be reachable from the Data ow worker machines.
The Apache Kafka topics must exist and the messages must be encoded in a valid JSON format.
Template parameters
Parameter Description
outputTableSpec The BigQuery output table location to write the Apache Kafka messages to, in the format of my-project:dataset.table
inputTopics The Apache Kafka input topics to read from in a comma- separated list. For example: messages
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 110/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
bootstrapServers The host address of the running Apache Kafka broker servers in a comma-separated list, each host address in the format of 35.70.252.199:9092
javascriptTextTransformGcsPath (Optional) Cloud Storage location path to the JavaScript UDF. For example: gs://my_bucket/my_function.js
javascriptTextTransformFunctionName(Optional) The name of the JavaScript to call as your UDF. For example: transform
outputDeadletterTable (Optional) The BigQuery output table location to write deadletter records to, in the format of my- project:dataset.my-deadletter-table. If it doesn't exist, the table is created during pipeline execution. If not speci ed,
_error_records is used instead. Running the Apache Ka a to BigQuery template
CONSOLEGCLOUD (#gcloud)API (#api)
Run from the Google Cloud Console (/data ow/docs/templates/running-templates#console)
1. Go to the Data ow page in the Cloud Console.
Go to the Data ow page (https://console.cloud.google.com/data ow)
2. Click Create job from template.
3. Select the Apache Kafka to BigQuery template from the Data ow template drop-down menu.
4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
5. Enter your parameter values in the provided parameter elds.
6. Click Run Job.
Template source code
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 111/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
ster/v2/kafka-to-bigquery/src/main/java/com/google/cloud/teleport/v2/templates/KafkaToBigQuery.java)
/* * Copyright (C) 2019 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package com.google.cloud.teleport.v2.templates;
import com.google.api.services.bigquery.model.TableRow; import com.google.cloud.teleport.v2.coders.FailsafeElementCoder; import com.google.cloud.teleport.v2.transforms.BigQueryConverters.FailsafeJsonToTabl import com.google.cloud.teleport.v2.transforms.ErrorConverters; import com.google.cloud.teleport.v2.transforms.ErrorConverters.WriteKafkaMessageErro import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer.FailsafeJav import com.google.cloud.teleport.v2.utils.SchemaUtils; import com.google.cloud.teleport.v2.values.FailsafeElement; import com.google.common.collect.ImmutableMap; import java.io.IOException; import java.util.ArrayList; import java.util.Arrays; import java.util.List; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.KvCoder; import org.apache.beam.sdk.coders.NullableCoder; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 112/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.io.kafka.KafkaIO; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.Validation.Required; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.values.KV; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.apache.commons.lang3.ObjectUtils; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.common.serialization.StringDeserializer; import org.slf4j.Logger; import org.slf4j.LoggerFactory;
/** * The {@link KafkaToBigQuery} pipeline is a streaming pipeline which ingests text d * executes a UDF, and outputs the resulting records to BigQuery. Any errors which o * transformation of the data, execution of the UDF, or inserting into the output ta * inserted into a separate errors table in BigQuery. The errors table will be creat * not exist prior to execution. Both output and error tables are specified by the u * parameters. * *
Pipeline Requirements * *
*
* *- The Kafka topic exists and the message is encoded in a valid JSON format. *
- The BigQuery output table exists. *
- The Kafka brokers are reachable from the Dataflow worker machines. *
Example Usage * *
* * # Set some environment variables * PROJECT=my-project * TEMP_BUCKET=my-temp-buckethttps://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 113/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* OUTPUT_TABLE=${PROJECT}:my_dataset.my_table * TOPICS=my-topics * JS_PATH=my-js-path-on-gcs * JS_FUNC_NAME=my-js-func-name * BOOTSTRAP=my-comma-separated-bootstrap-servers * * # Set containerization vars * IMAGE_NAME=my-image-name * TARGET_GCR_IMAGE=gcr.io/${PROJECT}/${IMAGE_NAME} * BASE_CONTAINER_IMAGE=my-base-container-image * BASE_CONTAINER_IMAGE_VERSION=my-base-container-image-version * APP_ROOT=/path/to/app-root * COMMAND_SPEC=/path/to/command-spec * * # Build and upload image * mvn clean package \ * -Dimage=${TARGET_GCR_IMAGE} \ * -Dbase-container-image=${BASE_CONTAINER_IMAGE} \ * -Dbase-container-image.version=${BASE_CONTAINER_IMAGE_VERSION} \ * -Dapp-root=${APP_ROOT} \ * -Dcommand-spec=${COMMAND_SPEC} * * # Create an image spec in GCS that contains the path to the image * { * "docker_template_spec": { * "docker_image": $TARGET_GCR_IMAGE * } * } * * # Execute template: * API_ROOT_URL="https://dataflow.googleapis.com" * TEMPLATES_LAUNCH_API="${API_ROOT_URL}/v1b3/projects/${PROJECT}/templates:launch" * JOB_NAME="kafka-to-bigquery`date +%Y%m%d-%H%M%S-%N`" * * time curl -X POST -H "Content-Type: application/json" \ * -H "Authorization: Bearer $(gcloud auth print-access-token)" \ * "${TEMPLATES_LAUNCH_API}"` * `"?validateOnly=false"` * `"&dynamicTemplate.gcsPath=${TEMP_BUCKET}/path/to/image-spec"` * `"&dynamicTemplate.stagingLocation=${TEMP_BUCKET}/staging" \ * -d ' * { * "jobName":"'$JOB_NAME'", * "parameters": { * "outputTableSpec":"'$OUTPUT_TABLE'",
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 114/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
* "inputTopics":"'$TOPICS'", * "javascriptTextTransformGcsPath":"'$JS_PATH'", * "javascriptTextTransformFunctionName":"'$JS_FUNC_NAME'", * "bootstrapServers":"'$BOOTSTRAP'" * } * } * ' *
/* Logger for class. */ private static final Logger LOG = LoggerFactory.getLogger(KafkaToBigQuery.class);
/** The tag for the main output for the UDF. */ private static final TupleTag
/** The tag for the main output of the json transformation. */ static final TupleTag
/** The tag for the dead-letter output of the udf. */ static final TupleTag
/** The tag for the dead-letter output of the json to table row transform. */ static final TupleTag
/** The default suffix for error tables if dead letter table is not specified. */ private static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";
/** String/String Coder for FailsafeElement. */ private static final FailsafeElementCoder
/** * The {@link Options} class provides the custom execution options passed by the e * command-line. */ public interface Options extends PipelineOptions {
@Description("Table spec to write the output to") @Required
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 115/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
String getOutputTableSpec();
void setOutputTableSpec(String outputTableSpec);
@Description("Kafka Bootstrap Servers") @Required String getBootstrapServers();
void setBootstrapServers(String bootstrapServers);
@Description("Kafka topic(s) to read the input from") @Required String getInputTopics();
void setInputTopics(String inputTopics);
@Description( "The dead-letter table to output to within BigQuery in void setOutputDeadletterTable(String outputDeadletterTable); @Description("Gcs path to javascript udf source") String getJavascriptTextTransformGcsPath(); void setJavascriptTextTransformGcsPath(String javascriptTextTransformGcsPath); @Description("UDF Javascript Function Name") String getJavascriptTextTransformFunctionName(); void setJavascriptTextTransformFunctionName(String javascriptTextTransformFuncti } /** * The main entry-point for pipeline execution. This method will start the pipelin * wait for it's execution to finish. If blocking execution is required, use the { * KafkaToBigQuery#run(Options)} method to start the pipeline and invoke {@code * result.waitUntilFinish()} on the {@link PipelineResult}. * * @param args The command-line args passed by the executor. */ public static void main(String[] args) { Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 116/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud run(options); } /** * Runs the pipeline to completion with the specified options. This method does no * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} * object to block until the pipeline is finished running if blocking programmatic * required. * * @param options The execution options. * @return The pipeline result. */ public static PipelineResult run(Options options) { // Create the pipeline Pipeline pipeline = Pipeline.create(options); // Register the coder for pipeline FailsafeElementCoder CoderRegistry coderRegistry = pipeline.getCoderRegistry(); coderRegistry.registerCoderForType(coder.getEncodedTypeDescriptor(), coder); List /* * Steps: * 1) Read messages in from Kafka * 2) Transform the messages into TableRows * - Transform message payload via UDF * - Convert UDF result to TableRow objects * 3) Write successful records out to BigQuery * 4) Write failed records out to BigQuery */ PCollectionTuple convertedTableRows = pipeline /* * Step #1: Read messages in from Kafka */ .apply( https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 117/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud "ReadFromKafka", KafkaIO. /* * Step #2: Transform the Kafka Messages into TableRows */ .apply("ConvertMessageToTableRow", new MessageToTableRow(options)); /* * Step #3: Write the successful records out to BigQuery */ WriteResult writeResult = convertedTableRows .get(TRANSFORM_OUT) .apply( "WriteSuccessfulRecords", BigQueryIO.writeTableRows() .withoutValidation() .withCreateDisposition(CreateDisposition.CREATE_NEVER) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withExtendedErrorInfo() .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS) .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErr .to(options.getOutputTableSpec())); /* * Step 3 Contd. * Elements that failed inserts into BigQuery are extracted and converted to Fai */ PCollection https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 118/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud .setCoder(FAILSAFE_ELEMENT_CODER); /* * Step #4: Write failed records out to BigQuery */ PCollectionList.of(convertedTableRows.get(UDF_DEADLETTER_OUT)) .and(convertedTableRows.get(TRANSFORM_DEADLETTER_OUT)) .apply("Flatten", Flatten.pCollections()) .apply( "WriteTransformationFailedRecords", WriteKafkaMessageErrors.newBuilder() .setErrorRecordsTable( ObjectUtils.firstNonNull( options.getOutputDeadletterTable(), options.getOutputTableSpec() + DEFAULT_DEADLETTER_TABLE_SUFF .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA) .build()); /* * Step #5: Insert records that failed BigQuery inserts into a deadletter table. */ failedInserts.apply( "WriteInsertionFailedRecords", ErrorConverters.WriteStringMessageErrors.newBuilder() .setErrorRecordsTable( ObjectUtils.firstNonNull( options.getOutputDeadletterTable(), options.getOutputTableSpec() + DEFAULT_DEADLETTER_TABLE_SUFF .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA) .build()); return pipeline.run(); } /** * The {@link MessageToTableRow} class is a {@link PTransform} which transforms in * Message objects into {@link TableRow} objects for insertion into BigQuery while * to the input. The executions of the UDF and transformation to {@link TableRow} * in a fail-safe way by wrapping the element with it's original payload inside th * FailsafeElement} class. The {@link MessageToTableRow} transform will output a { * PCollectionTuple} which contains all output and dead-letter {@link PCollection} * * The {@link PCollectionTuple} output will contain the following {@link PColle * *
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 119/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
*
private final Options options;
MessageToTableRow(Options options) { this.options = options; }
@Override public PCollectionTuple expand(PCollection
PCollectionTuple udfOut = input // Map the incoming messages into FailsafeElements so we can recover f // across multiple transforms. .apply("MapToRecord", ParDo.of(new MessageToFailsafeElementFn())) .apply( "InvokeUDF", FailsafeJavascriptUdf.
// Convert the records which were successfully processed by the UDF into Table PCollectionTuple jsonToTableRowOut = udfOut .get(UDF_OUT) .apply( "JsonToTableRow", FailsafeJsonToTableRow.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 120/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
.build());
// Re-wrap the PCollections so we can return a single PCollectionTuple return PCollectionTuple.of(UDF_OUT, udfOut.get(UDF_OUT)) .and(UDF_DEADLETTER_OUT, udfOut.get(UDF_DEADLETTER_OUT)) .and(TRANSFORM_OUT, jsonToTableRowOut.get(TRANSFORM_OUT)) .and(TRANSFORM_DEADLETTER_OUT, jsonToTableRowOut.get(TRANSFORM_DEADLETTER_ } }
/** * The {@link MessageToFailsafeElementFn} wraps an Kafka Message with the {@link F * class so errors can be recovered from and the original message can be output to * table. */ static class MessageToFailsafeElementFn extends DoFn
@ProcessElement public void processElement(ProcessContext context) { KV
/** * Method to wrap a {@link BigQueryInsertError} into a {@link FailsafeElement}. * * @param insertError BigQueryInsert error. * @return FailsafeElement object. */ protected static FailsafeElement
FailsafeElement
failsafeElement = FailsafeElement.of( insertError.getRow().toPrettyString(), insertError.getRow().toPrettySt failsafeElement.setErrorMessage(insertError.getError().toPrettyString());
} catch (IOException e) { LOG.error("Failed to wrap BigQuery insert error."); throw new RuntimeException(e);
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 121/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud
} return failsafeElement; } }
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/), and code samples are licensed under the Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0). For details, see the Google Developers Site Policies (https://developers.google.com/site-policies). Java is a registered trademark of Oracle and/or its a liates.
Last updated 2020-08-20 UTC.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 122/122