8/23/2020 -provided streaming templates | Cloud Dataflow | Google Cloud

Google-provided streaming templates

Google provides a set of open-source (https://github.com/GoogleCloudPlatform/DataowTemplates) Dataow templates. For general information about templates, see the Overview (/dataow/docs/guides/templates/overview) page. For a list of all Google-provided templates, see the Get started with Google-provided templates (/dataow/docs/guides/templates/provided-templates) page.

This page documents streaming templates:

Pub/Sub Subscription to BigQuery (/dataow/docs/guides/templates/provided-streaming#cloudpubsubsubscriptiontobigquery)

Pub/Sub Topic to BigQuery (/dataow/docs/guides/templates/provided-streaming#cloudpubsubtobigquery)

Pub/Sub Avro to BigQuery (#cloudpubsubavrotobigquery)

Pub/Sub to Pub/Sub (#cloudpubsubtocloudpubsub)

Pub/Sub to Splunk (#cloudpubsubtosplunk)

Pub/Sub to Cloud Storage Avro (#cloudpubsubtoavro)

Pub/Sub to Cloud Storage Text (#cloudpubsubtogcstext)

Pub/Sub to MongoDB (#cloudpubsubtomongodb)

Cloud Storage Text to BigQuery (Stream) (#gcstexttobigquerystream)

Cloud Storage Text to Pub/Sub (Stream) (#gcstexttocloudpubsubstream)

Data Masking/Tokenization using Cloud DLP from Cloud Storage to BigQuery (Stream) (#dlptexttobigquerystreaming)

Change Data Capture to BigQuery (Stream) (#change-data-capture)

Apache Kafka to BigQuery (#kafka-to-bigquery)

Pub/Sub Subscription to BigQuery

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 1/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Pub/Sub Subscription to BigQuery template is a streaming pipeline that reads JSON- formatted messages from a Pub/Sub subscription and writes them to a BigQuery table. You can use the template as a quick solution to move Pub/Sub data to BigQuery. The template reads JSON-formatted messages from Pub/Sub and converts them to BigQuery elements.

Requirements for this pipeline:

The Pub/Sub messages must be in JSON format, described here. (https://developers.google.com/api-client-library/java/google-http-java-client/json) For example, messages formatted as {"k1":"v1", "k2":"v2"} may be inserted into a BigQuery table with two columns, named k1 and k2, with string data type.

The output table must exist prior to running the pipeline.

Template parameters

Parameter Description

inputSubscriptionThe Pub/Sub input subscription to read from, in the format of projects//subscriptions/.

outputTableSpec The BigQuery output table location, in the format of :.

Running the Pub/Sub Subscription to BigQuery template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 2/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

2. Click Create job from template.

3. Select the Pub/Sub Subscription to BigQuery template from the Dataow template drop-down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

6. Click Run Job.

 Template source code

Java

owTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubSubToBigQuery.java)

/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */

package com.google.cloud.teleport.templates;

import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQue

import com.google.api.services.bigquery.model.TableRow;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 3/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import com.google.cloud.teleport.coders.FailsafeElementCoder; import com.google.cloud.teleport.templates.common.BigQueryConverters.FailsafeJsonToT import com.google.cloud.teleport.templates.common.ErrorConverters; import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Failsafe import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Javascri import com.google.cloud.teleport.util.DualInputNestedValueProvider; import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput; import com.google.cloud.teleport.util.ResourceUtils; import com.google.cloud.teleport.util.ValueProviderUtils; import com.google.cloud.teleport.values.FailsafeElement; import com.google.common.collect.ImmutableList; import java.nio.charset.StandardCharsets; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy; import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.transforms.SerializableFunction; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.slf4j.Logger; import org.slf4j.LoggerFactory;

/** * The {@link PubSubToBigQuery} pipeline is a streaming pipeline which ingests data

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 4/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* from Cloud Pub/Sub, executes a UDF, and outputs the resulting records to BigQuery * which occur in the transformation of the data or execution of the UDF will be out * separate errors table in BigQuery. The errors table will be created if it does no * execution. Both output and error tables are specified by the user as template par * *

Pipeline Requirements * *

    *
  • The Pub/Sub topic exists. *
  • The BigQuery output table exists. *
* *

Example Usage * *

 * # Set the pipeline vars * PROJECT_ID=PROJECT ID HERE * BUCKET_NAME=BUCKET NAME HERE * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read * from a Pub/Sub Subscription or a Pub/Sub Topic. * * # Set the runner * RUNNER=DataflowRunner * * # Build the template * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=${PIPELINE_FOLDER}/staging \ * --tempLocation=${PIPELINE_FOLDER}/temp \ * --templateLocation=${PIPELINE_FOLDER}/template \ * --runner=${RUNNER} * --useSubscription=${USE_SUBSCRIPTION} * " * * # Execute the template * JOB_NAME=pubsub-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"` * * # Execute a pipeline to read from a Topic. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 5/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* --parameters \ * "inputTopic=projects/${PROJECT_ID}/topics/input-topic-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" * * # Execute a pipeline to read from a Subscription. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \ * --parameters \ * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" *

*/ public class PubSubToBigQuery {

/** The log to output status messages to. */ private static final Logger LOG = LoggerFactory.getLogger(PubSubToBigQuery.class);

/** The tag for the main output for the UDF. */ public static final TupleTag> UDF_OUT = new TupleTag>() {};

/** The tag for the main output of the transformation. */ public static final TupleTag TRANSFORM_OUT = new TupleTag() {}

/** The tag for the dead-letter output of the udf. */ public static final TupleTag> UDF_DEADLETTE new TupleTag>() {};

/** The tag for the dead-letter output of the json to table row transform. */ public static final TupleTag> TRANSFORM_DEA new TupleTag>() {};

/** The default suffix for error tables if dead letter table is not specified. */ public static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";

/** Pubsub message/string coder for pipeline. */ public static final FailsafeElementCoder CODER = FailsafeElementCoder.of(PubsubMessageWithAttributesCoder.of(), StringUtf8Coder

/** String/String Coder for FailsafeElement. */ public static final FailsafeElementCoder FAILSAFE_ELEMENT_CODER = FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 6/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

/** * The {@link Options} class provides the custom execution options passed by the e * command-line. */ public interface Options extends PipelineOptions, JavascriptTextTransformerOptions @Description("Table spec to write the output to") ValueProvider getOutputTableSpec();

void setOutputTableSpec(ValueProvider value);

@Description("Pub/Sub topic to read the input from") ValueProvider getInputTopic();

void setInputTopic(ValueProvider value);

@Description( "The Cloud Pub/Sub subscription to consume from. " + "The name should be in the format of " + "projects//subscriptions/.") ValueProvider getInputSubscription();

void setInputSubscription(ValueProvider value);

@Description( "This determines whether the template reads from " + "a pub/sub subscription @Default.Boolean(false) Boolean getUseSubscription();

void setUseSubscription(Boolean value);

@Description( "The dead-letter table to output to within BigQuery in : getOutputDeadletterTable();

void setOutputDeadletterTable(ValueProvider value); }

/** * The main entry-point for pipeline execution. This method will start the pipelin * wait for it's execution to finish. If blocking execution is required, use the { * PubSubToBigQuery#run(Options)} method to start the pipeline and invoke {@code * result.waitUntilFinish()} on the {@link PipelineResult}. *

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 7/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* @param args The command-line args passed by the executor. */ public static void main(String[] args) { Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti

run(options); }

/** * Runs the pipeline to completion with the specified options. This method does no * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} * object to block until the pipeline is finished running if blocking programmatic * required. * * @param options The execution options. * @return The pipeline result. */ public static PipelineResult run(Options options) {

Pipeline pipeline = Pipeline.create(options);

CoderRegistry coderRegistry = pipeline.getCoderRegistry(); coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER);

/* * Steps: * 1) Read messages in from Pub/Sub * 2) Transform the PubsubMessages into TableRows * - Transform message payload via UDF * - Convert UDF result to TableRow objects * 3) Write successful records out to BigQuery * 4) Write failed records out to BigQuery */

/* * Step #1: Read messages in from Pub/Sub * Either from a Subscription or Topic */

PCollection messages = null; if (options.getUseSubscription()) { messages = pipeline.apply( "ReadPubSubSubscription", PubsubIO.readMessagesWithAttributes()

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 8/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

.fromSubscription(options.getInputSubscription())); } else { messages = pipeline.apply( "ReadPubSubTopic", PubsubIO.readMessagesWithAttributes().fromTopic(options.getInputTopic( }

PCollectionTuple convertedTableRows = messages /* * Step #2: Transform the PubsubMessages into TableRows */ .apply("ConvertMessageToTableRow", new PubsubMessageToTableRow(options))

/* * Step #3: Write the successful records out to BigQuery */ WriteResult writeResult = convertedTableRows .get(TRANSFORM_OUT) .apply( "WriteSuccessfulRecords", BigQueryIO.writeTableRows() .withoutValidation() .withCreateDisposition(CreateDisposition.CREATE_NEVER) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withExtendedErrorInfo() .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS) .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErr .to(options.getOutputTableSpec()));

/* * Step 3 Contd. * Elements that failed inserts into BigQuery are extracted and converted to Fai */ PCollection> failedInserts = writeResult .getFailedInsertsWithErr() .apply( "WrapInsertionErrors", MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor()) .via((BigQueryInsertError e) -> wrapBigQueryInsertError(e))) .setCoder(FAILSAFE_ELEMENT_CODER);

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 9/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

/* * Step #4: Write records that failed table row transformation * or conversion out to BigQuery deadletter table. */ PCollectionList.of( ImmutableList.of( convertedTableRows.get(UDF_DEADLETTER_OUT), convertedTableRows.get(TRANSFORM_DEADLETTER_OUT))) .apply("Flatten", Flatten.pCollections()) .apply( "WriteFailedRecords", ErrorConverters.WritePubsubMessageErrors.newBuilder() .setErrorRecordsTable( ValueProviderUtils.maybeUseDefaultDeadletterTable( options.getOutputDeadletterTable(), options.getOutputTableSpec(), DEFAULT_DEADLETTER_TABLE_SUFFIX)) .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJs .build());

// 5) Insert records that failed insert into deadletter table failedInserts.apply( "WriteFailedRecords", ErrorConverters.WriteStringMessageErrors.newBuilder() .setErrorRecordsTable( ValueProviderUtils.maybeUseDefaultDeadletterTable( options.getOutputDeadletterTable(), options.getOutputTableSpec(), DEFAULT_DEADLETTER_TABLE_SUFFIX)) .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson() .build());

return pipeline.run(); }

/** * If deadletterTable is available, it is returned as is, otherwise outputTableSpe * defaultDeadLetterTableSuffix is returned instead. */ private static ValueProvider maybeUseDefaultDeadletterTable( ValueProvider deadletterTable, ValueProvider outputTableSpec, String defaultDeadLetterTableSuffix) { return DualInputNestedValueProvider.of( deadletterTable,

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 10/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

outputTableSpec, new SerializableFunction, String>() { @Override public String apply(TranslatorInput input) { String userProvidedTable = input.getX(); String outputTableSpec = input.getY(); if (userProvidedTable == null) { return outputTableSpec + defaultDeadLetterTableSuffix; } return userProvidedTable; } }); }

/** * The {@link PubsubMessageToTableRow} class is a {@link PTransform} which transfo * {@link PubsubMessage} objects into {@link TableRow} objects for insertion into * applying an optional UDF to the input. The executions of the UDF and transforma * TableRow} objects is done in a fail-safe way by wrapping the element with it's * inside the {@link FailsafeElement} class. The {@link PubsubMessageToTableRow} t * output a {@link PCollectionTuple} which contains all output and dead-letter {@l * PCollection}. * *

The {@link PCollectionTuple} output will contain the following {@link PColle * *

    *
  • {@link PubSubToBigQuery#UDF_OUT} - Contains all {@link FailsafeElement} r * successfully processed by the optional UDF. *
  • {@link PubSubToBigQuery#UDF_DEADLETTER_OUT} - Contains all {@link Failsaf * records which failed processing during the UDF execution. *
  • {@link PubSubToBigQuery#TRANSFORM_OUT} - Contains all records successfull * JSON to {@link TableRow} objects. *
  • {@link PubSubToBigQuery#TRANSFORM_DEADLETTER_OUT} - Contains all {@link F * records which couldn't be converted to table rows. *
*/ static class PubsubMessageToTableRow extends PTransform, PCollectionTuple> {

private final Options options;

PubsubMessageToTableRow(Options options) { this.options = options; }

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 11/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

@Override public PCollectionTuple expand(PCollection input) {

PCollectionTuple udfOut = input // Map the incoming messages into FailsafeElements so we can recover f // across multiple transforms. .apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()) .apply( "InvokeUDF", FailsafeJavascriptUdf.newBuilder() .setFileSystemPath(options.getJavascriptTextTransformGcsPath() .setFunctionName(options.getJavascriptTextTransformFunctionNam .setSuccessTag(UDF_OUT) .setFailureTag(UDF_DEADLETTER_OUT) .build());

// Convert the records which were successfully processed by the UDF into Table PCollectionTuple jsonToTableRowOut = udfOut .get(UDF_OUT) .apply( "JsonToTableRow", FailsafeJsonToTableRow.newBuilder() .setSuccessTag(TRANSFORM_OUT) .setFailureTag(TRANSFORM_DEADLETTER_OUT) .build());

// Re-wrap the PCollections so we can return a single PCollectionTuple return PCollectionTuple.of(UDF_OUT, udfOut.get(UDF_OUT)) .and(UDF_DEADLETTER_OUT, udfOut.get(UDF_DEADLETTER_OUT)) .and(TRANSFORM_OUT, jsonToTableRowOut.get(TRANSFORM_OUT)) .and(TRANSFORM_DEADLETTER_OUT, jsonToTableRowOut.get(TRANSFORM_DEADLETTER_ } }

/** * The {@link PubsubMessageToFailsafeElementFn} wraps an incoming {@link PubsubMes * {@link FailsafeElement} class so errors can be recovered from and the original * output to a error records table. */ static class PubsubMessageToFailsafeElementFn extends DoFn> { @ProcessElement public void processElement(ProcessContext context) {

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 12/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

PubsubMessage message = context.element(); context.output( FailsafeElement.of(message, new String(message.getPayload(), StandardChars } } }

Pub/Sub Topic to BigQuery

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Pub/Sub Topic to BigQuery template is a streaming pipeline that reads JSON-formatted messages from a Pub/Sub topic and writes them to a BigQuery table. You can use the template as a quick solution to move Pub/Sub data to BigQuery. The template reads JSON-formatted messages from Pub/Sub and converts them to BigQuery elements.

Requirements for this pipeline:

The Pub/Sub messages must be in JSON format, described here. (https://developers.google.com/api-client-library/java/google-http-java-client/json) For example, messages formatted as {"k1":"v1", "k2":"v2"} may be inserted into a BigQuery table with two columns, named k1 and k2, with string data type.

The output table must exist prior to pipeline execution.

Template parameters

Parameter Description

inputTopic The Pub/Sub input topic to read from, in the format of projects//topics/.

outputTableSpecThe BigQuery output table location, in the format of :.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 13/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

Running the Pub/Sub Topic to BigQuery template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

2. Click Create job from template.

3. Select the Pub/Sub Topic to BigQuery template from the Dataow template drop-down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

6. Click Run Job.

 Template source code

Java

owTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubSubToBigQuery.java)

/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 *

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 14/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */

package com.google.cloud.teleport.templates;

import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQue

import com.google.api.services.bigquery.model.TableRow; import com.google.cloud.teleport.coders.FailsafeElementCoder; import com.google.cloud.teleport.templates.common.BigQueryConverters.FailsafeJsonToT import com.google.cloud.teleport.templates.common.ErrorConverters; import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Failsafe import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Javascri import com.google.cloud.teleport.util.DualInputNestedValueProvider; import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput; import com.google.cloud.teleport.util.ResourceUtils; import com.google.cloud.teleport.util.ValueProviderUtils; import com.google.cloud.teleport.values.FailsafeElement; import com.google.common.collect.ImmutableList; import java.nio.charset.StandardCharsets; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy; import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 15/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.transforms.SerializableFunction; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.slf4j.Logger; import org.slf4j.LoggerFactory;

/** * The {@link PubSubToBigQuery} pipeline is a streaming pipeline which ingests data * from Cloud Pub/Sub, executes a UDF, and outputs the resulting records to BigQuery * which occur in the transformation of the data or execution of the UDF will be out * separate errors table in BigQuery. The errors table will be created if it does no * execution. Both output and error tables are specified by the user as template par * *

Pipeline Requirements * *

    *
  • The Pub/Sub topic exists. *
  • The BigQuery output table exists. *
* *

Example Usage * *

 * # Set the pipeline vars * PROJECT_ID=PROJECT ID HERE * BUCKET_NAME=BUCKET NAME HERE * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read * from a Pub/Sub Subscription or a Pub/Sub Topic. * * # Set the runner * RUNNER=DataflowRunner * * # Build the template * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=${PIPELINE_FOLDER}/staging \ * --tempLocation=${PIPELINE_FOLDER}/temp \

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 16/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* --templateLocation=${PIPELINE_FOLDER}/template \ * --runner=${RUNNER} * --useSubscription=${USE_SUBSCRIPTION} * " * * # Execute the template * JOB_NAME=pubsub-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"` * * # Execute a pipeline to read from a Topic. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \ * --parameters \ * "inputTopic=projects/${PROJECT_ID}/topics/input-topic-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" * * # Execute a pipeline to read from a Subscription. * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \ * --zone=us-east1-d \ * --parameters \ * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\ * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\ * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table" *

*/ public class PubSubToBigQuery {

/** The log to output status messages to. */ private static final Logger LOG = LoggerFactory.getLogger(PubSubToBigQuery.class);

/** The tag for the main output for the UDF. */ public static final TupleTag> UDF_OUT = new TupleTag>() {};

/** The tag for the main output of the json transformation. */ public static final TupleTag TRANSFORM_OUT = new TupleTag() {}

/** The tag for the dead-letter output of the udf. */ public static final TupleTag> UDF_DEADLETTE new TupleTag>() {};

/** The tag for the dead-letter output of the json to table row transform. */ public static final TupleTag> TRANSFORM_DEA

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 17/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

new TupleTag>() {};

/** The default suffix for error tables if dead letter table is not specified. */ public static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";

/** Pubsub message/string coder for pipeline. */ public static final FailsafeElementCoder CODER = FailsafeElementCoder.of(PubsubMessageWithAttributesCoder.of(), StringUtf8Coder

/** String/String Coder for FailsafeElement. */ public static final FailsafeElementCoder FAILSAFE_ELEMENT_CODER = FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

/** * The {@link Options} class provides the custom execution options passed by the e * command-line. */ public interface Options extends PipelineOptions, JavascriptTextTransformerOptions @Description("Table spec to write the output to") ValueProvider getOutputTableSpec();

void setOutputTableSpec(ValueProvider value);

@Description("Pub/Sub topic to read the input from") ValueProvider getInputTopic();

void setInputTopic(ValueProvider value);

@Description( "The Cloud Pub/Sub subscription to consume from. " + "The name should be in the format of " + "projects//subscriptions/.") ValueProvider getInputSubscription();

void setInputSubscription(ValueProvider value);

@Description( "This determines whether the template reads from " + "a pub/sub subscription @Default.Boolean(false) Boolean getUseSubscription();

void setUseSubscription(Boolean value);

@Description( "The dead-letter table to output to within BigQuery in :

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 18/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

+ "format. If it doesn't exist, it will be created during pipeline execu ValueProvider getOutputDeadletterTable();

void setOutputDeadletterTable(ValueProvider value); }

/** * The main entry-point for pipeline execution. This method will start the pipelin * wait for it's execution to finish. If blocking execution is required, use the { * PubSubToBigQuery#run(Options)} method to start the pipeline and invoke {@code * result.waitUntilFinish()} on the {@link PipelineResult}. * * @param args The command-line args passed by the executor. */ public static void main(String[] args) { Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti

run(options); }

/** * Runs the pipeline to completion with the specified options. This method does no * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} * object to block until the pipeline is finished running if blocking programmatic * required. * * @param options The execution options. * @return The pipeline result. */ public static PipelineResult run(Options options) {

Pipeline pipeline = Pipeline.create(options);

CoderRegistry coderRegistry = pipeline.getCoderRegistry(); coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER);

/* * Steps: * 1) Read messages in from Pub/Sub * 2) Transform the PubsubMessages into TableRows * - Transform message payload via UDF * - Convert UDF result to TableRow objects * 3) Write successful records out to BigQuery * 4) Write failed records out to BigQuery */

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 19/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

/* * Step #1: Read messages in from Pub/Sub * Either from a Subscription or Topic */

PCollection messages = null; if (options.getUseSubscription()) { messages = pipeline.apply( "ReadPubSubSubscription", PubsubIO.readMessagesWithAttributes() .fromSubscription(options.getInputSubscription())); } else { messages = pipeline.apply( "ReadPubSubTopic", PubsubIO.readMessagesWithAttributes().fromTopic(options.getInputTopic( }

PCollectionTuple convertedTableRows = messages /* * Step #2: Transform the PubsubMessages into TableRows */ .apply("ConvertMessageToTableRow", new PubsubMessageToTableRow(options))

/* * Step #3: Write the successful records out to BigQuery */ WriteResult writeResult = convertedTableRows .get(TRANSFORM_OUT) .apply( "WriteSuccessfulRecords", BigQueryIO.writeTableRows() .withoutValidation() .withCreateDisposition(CreateDisposition.CREATE_NEVER) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withExtendedErrorInfo() .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS) .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErr .to(options.getOutputTableSpec()));

/*

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 20/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* Step 3 Contd. * Elements that failed inserts into BigQuery are extracted and converted to Fai */ PCollection> failedInserts = writeResult .getFailedInsertsWithErr() .apply( "WrapInsertionErrors", MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor()) .via((BigQueryInsertError e) -> wrapBigQueryInsertError(e))) .setCoder(FAILSAFE_ELEMENT_CODER);

/* * Step #4: Write records that failed table row transformation * or conversion out to BigQuery deadletter table. */ PCollectionList.of( ImmutableList.of( convertedTableRows.get(UDF_DEADLETTER_OUT), convertedTableRows.get(TRANSFORM_DEADLETTER_OUT))) .apply("Flatten", Flatten.pCollections()) .apply( "WriteFailedRecords", ErrorConverters.WritePubsubMessageErrors.newBuilder() .setErrorRecordsTable( ValueProviderUtils.maybeUseDefaultDeadletterTable( options.getOutputDeadletterTable(), options.getOutputTableSpec(), DEFAULT_DEADLETTER_TABLE_SUFFIX)) .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJs .build());

// 5) Insert records that failed insert into deadletter table failedInserts.apply( "WriteFailedRecords", ErrorConverters.WriteStringMessageErrors.newBuilder() .setErrorRecordsTable( ValueProviderUtils.maybeUseDefaultDeadletterTable( options.getOutputDeadletterTable(), options.getOutputTableSpec(), DEFAULT_DEADLETTER_TABLE_SUFFIX)) .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson() .build());

return pipeline.run();

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 21/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

}

/** * If deadletterTable is available, it is returned as is, otherwise outputTableSpe * defaultDeadLetterTableSuffix is returned instead. */ private static ValueProvider maybeUseDefaultDeadletterTable( ValueProvider deadletterTable, ValueProvider outputTableSpec, String defaultDeadLetterTableSuffix) { return DualInputNestedValueProvider.of( deadletterTable, outputTableSpec, new SerializableFunction, String>() { @Override public String apply(TranslatorInput input) { String userProvidedTable = input.getX(); String outputTableSpec = input.getY(); if (userProvidedTable == null) { return outputTableSpec + defaultDeadLetterTableSuffix; } return userProvidedTable; } }); }

/** * The {@link PubsubMessageToTableRow} class is a {@link PTransform} which transfo * {@link PubsubMessage} objects into {@link TableRow} objects for insertion into * applying an optional UDF to the input. The executions of the UDF and transforma * TableRow} objects is done in a fail-safe way by wrapping the element with it's * inside the {@link FailsafeElement} class. The {@link PubsubMessageToTableRow} t * output a {@link PCollectionTuple} which contains all output and dead-letter {@l * PCollection}. * *

The {@link PCollectionTuple} output will contain the following {@link PColle * *

    *
  • {@link PubSubToBigQuery#UDF_OUT} - Contains all {@link FailsafeElement} r * successfully processed by the optional UDF. *
  • {@link PubSubToBigQuery#UDF_DEADLETTER_OUT} - Contains all {@link Failsaf * records which failed processing during the UDF execution. *
  • {@link PubSubToBigQuery#TRANSFORM_OUT} - Contains all records successfull * JSON to {@link TableRow} objects. *
  • {@link PubSubToBigQuery#TRANSFORM_DEADLETTER_OUT} - Contains all {@link F

    https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 22/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

    * records which couldn't be converted to table rows. *

*/ static class PubsubMessageToTableRow extends PTransform, PCollectionTuple> {

private final Options options;

PubsubMessageToTableRow(Options options) { this.options = options; }

@Override public PCollectionTuple expand(PCollection input) {

PCollectionTuple udfOut = input // Map the incoming messages into FailsafeElements so we can recover f // across multiple transforms. .apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()) .apply( "InvokeUDF", FailsafeJavascriptUdf.newBuilder() .setFileSystemPath(options.getJavascriptTextTransformGcsPath() .setFunctionName(options.getJavascriptTextTransformFunctionNam .setSuccessTag(UDF_OUT) .setFailureTag(UDF_DEADLETTER_OUT) .build());

// Convert the records which were successfully processed by the UDF into Table PCollectionTuple jsonToTableRowOut = udfOut .get(UDF_OUT) .apply( "JsonToTableRow", FailsafeJsonToTableRow.newBuilder() .setSuccessTag(TRANSFORM_OUT) .setFailureTag(TRANSFORM_DEADLETTER_OUT) .build());

// Re-wrap the PCollections so we can return a single PCollectionTuple return PCollectionTuple.of(UDF_OUT, udfOut.get(UDF_OUT)) .and(UDF_DEADLETTER_OUT, udfOut.get(UDF_DEADLETTER_OUT)) .and(TRANSFORM_OUT, jsonToTableRowOut.get(TRANSFORM_OUT)) .and(TRANSFORM_DEADLETTER_OUT, jsonToTableRowOut.get(TRANSFORM_DEADLETTER_

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 23/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

} }

/** * The {@link PubsubMessageToFailsafeElementFn} wraps an incoming {@link PubsubMes * {@link FailsafeElement} class so errors can be recovered from and the original * output to a error records table. */ static class PubsubMessageToFailsafeElementFn extends DoFn> { @ProcessElement public void processElement(ProcessContext context) { PubsubMessage message = context.element(); context.output( FailsafeElement.of(message, new String(message.getPayload(), StandardChars } } }

Pub/Sub Avro to BigQuery

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Pub/Sub Avro to BigQuery template is a streaming pipeline that ingests Avro data from a Pub/Sub subscription into a BigQuery table. Any errors which occur while writing to the BigQuery table are streamed into a Pub/Sub dead-letter topic.

Requirements for this pipeline

The input Pub/Sub subscription must exist.

The schema le for the Avro records must exist on Cloud Storage.

The dead-letter Pub/Sub topic must exist.

The output BigQuery dataset must exist.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 24/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

Template parameters

Parameter Description

schemaPath The Cloud Storage location of the Avro schema le. For example, gs://path/to/my/schema.avsc.

inputSubscriptionThe Pub/Sub input subscription to read from. For example, projects//subscriptions/.

outputTopic The Pub/Sub topic to use as a dead-letter for failed records. For example, projects//topics/.

outputTableSpec The BigQuery output table location. For example, :. . Depending on the createDisposition specied, the output table may be created automatically using the user provided Avro schema.

writeDisposition (Optional) The BigQuery WriteDisposition. For example, WRITE_APPEND, WRITE_EMPTY or WRITE_TRUNCATE. Default: WRITE_APPEND

createDisposition(Optional) The BigQuery CreateDisposition. For example, CREATE_IF_NEEDED, CREATE_NEVER. Default: CREATE_IF_NEEDED

Running the Pub/Sub Avro to BigQuery template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

2. Click Create job from template.

3. Select the Pub/Sub Avro to BigQuery template from the Dataow template drop-down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 25/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

6. Click Run Job.

 Template source code

Java

-binary-to-bigquery/src/main/java/com/google/cloud/teleport/v2/templates/PubsubAvroToBigQuery.java)

/* * Copyright (C) 2020 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */ package com.google.cloud.teleport.v2.templates;

import com.google.cloud.teleport.v2.options.BigQueryCommonOptions.WriteOptions; import com.google.cloud.teleport.v2.options.PubsubCommonOptions.ReadSubscriptionOpti import com.google.cloud.teleport.v2.options.PubsubCommonOptions.WriteTopicOptions; import com.google.cloud.teleport.v2.transforms.BigQueryConverters; import com.google.cloud.teleport.v2.transforms.ErrorConverters; import com.google.cloud.teleport.v2.utils.SchemaUtils; import org.apache.avro.Schema; import org.apache.avro.generic.GenericRecord; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.AvroCoder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 26/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.Validation.Required;

/** * A Dataflow pipeline to stream Apache Avro * Pub/Sub into a BigQuery table. * *

Any persistent failures while writing to BigQuery will be written to a Pub/Sub * topic. */ public final class PubsubAvroToBigQuery {

/** * Validates input flags and executes the Dataflow pipeline. * * @param args command line arguments to the pipeline */ public static void main(String[] args) { PubsubAvroToBigQueryOptions options = PipelineOptionsFactory.fromArgs(args).withValidation() .as(PubsubAvroToBigQueryOptions.class);

run(options); }

/** * Provides custom {@link org.apache.beam.sdk.options.PipelineOptions} required to * {@link PubsubAvroToBigQuery} pipeline. */ public interface PubsubAvroToBigQueryOptions extends ReadSubscriptionOptions, WriteOptions, WriteTopicOptions {

@Description("GCS path to Avro schema file.") @Required String getSchemaPath();

void setSchemaPath(String schemaPath); }

/** * Runs the pipeline with the supplied options. *

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 27/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* @param options execution parameters to the pipeline * @return result of the pipeline execution as a {@link PipelineResult} */ private static PipelineResult run(PubsubAvroToBigQueryOptions options) {

// Create the pipeline. Pipeline pipeline = Pipeline.create(options);

Schema schema = SchemaUtils.getAvroSchema(options.getSchemaPath());

WriteResult writeResults = pipeline .apply( "Read Avro records", PubsubIO .readAvroGenericRecords(schema) .fromSubscription(options.getInputSubscription()))

.apply( "Write to BigQuery", BigQueryIO.write() .to(options.getOutputTableSpec()) .useBeamSchema() .withMethod(Method.STREAMING_INSERTS) .withWriteDisposition(WriteDisposition.valueOf(options.getWriteD .withCreateDisposition( CreateDisposition.valueOf(options.getCreateDisposition())) .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErr .withExtendedErrorInfo());

writeResults .getFailedInsertsWithErr() .apply( "Create error payload", ErrorConverters.BigQueryInsertErrorToPubsubMessage.newBui .setPayloadCoder(AvroCoder.of(schema)) .setTranslateFunction( BigQueryConverters.TableRowToGenericRecordFn.of(schema)) .build()) .apply( "Write failed records", PubsubIO.writeMessages().to(options.getOutputTopic()));

// Execute the pipeline and return the result. return pipeline.run();

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 28/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

} }

Pub/Sub to Pub/Sub

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Pub/Sub to Pub/Sub template is a streaming pipeline that reads messages from a Pub/Sub subscription and writes the messages to another Pub/Sub topic. The pipeline also accepts an optional message attribute key and a value that can be used to lter the messages that should be written to the Pub/Sub topic. You can use this template to copy messages from a Pub/Sub subscription to another Pub/Sub topic with an optional message lter.

Requirements for this pipeline:

The source Pub/Sub subscription must exist prior to execution.

The destination Pub/Sub topic must exist prior to execution.

Template parameters

Parameter Description

inputSubscriptionPub/Sub subscription to read the input from. For example, projects//subscriptions/.

outputTopic Cloud Pub/Sub topic to write the output to. For example, projects//topics/.

filterKey [Optional] Filter events based on an attribute key. No lters are applied if filterKey is not specied.

filterValue [Optional] Filter attribute value to use in case a lterKey is provided. A null filterValue is used by default.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 29/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

Running the Pub/Sub to Pub/Sub template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

2. Click Create job from template.

3. Select the Pub/Sub to Pub/Sub template from the Dataow template drop-down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

6. Click Run Job.

 Template source code

Java

aowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToPubsub.java)

/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 30/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */

package com.google.cloud.teleport.templates;

import static com.google.common.base.Preconditions.checkNotNull; import static org.apache.beam.vendor.guava.v20_0.com.google.common.base.Precondition

import com.google.auto.value.AutoValue; import java.util.regex.Pattern; import java.util.regex.PatternSyntaxException; import javax.annotation.Nullable; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.metrics.Counter; import org.apache.beam.sdk.metrics.Metrics; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.StreamingOptions; import org.apache.beam.sdk.options.Validation; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.ParDo; import org.slf4j.Logger; import org.slf4j.LoggerFactory;

/** An template that copies messages from one Pubsub subscription to another Pubsub public class PubsubToPubsub {

/** * Main entry point for executing the pipeline. * * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) {

// Parse the user options passed from the command-line Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 31/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

options.setStreaming(true);

run(options); }

/** * Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(Options options) { // Create the pipeline Pipeline pipeline = Pipeline.create(options);

/** * Steps: * 1) Read PubSubMessage with attributes from input PubSub subscription. * 2) Apply any filters if an attribute=value pair is provided. * 3) Write each PubSubMessage to output PubSub topic. */ pipeline .apply( "Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputS .apply( "Filter Events If Enabled", ParDo.of( ExtractAndFilterEventsFn.newBuilder() .withFilterKey(options.getFilterKey()) .withFilterValue(options.getFilterValue()) .build())) .apply("Write PubSub Events", PubsubIO.writeMessages().to(options.getOutputT

// Execute the pipeline and return the result. return pipeline.run(); }

/** * Options supported by {@link PubsubToPubsub}. * *

Inherits standard configuration options. */ public interface Options extends PipelineOptions, StreamingOptions { @Description(

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 32/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

"The Cloud Pub/Sub subscription to consume from. " + "The name should be in the format of " + "projects//subscriptions/.") @Validation.Required ValueProvider getInputSubscription();

void setInputSubscription(ValueProvider inputSubscription);

@Description( "The Cloud Pub/Sub topic to publish to. " + "The name should be in the format of " + "projects//topics/.") @Validation.Required ValueProvider getOutputTopic();

void setOutputTopic(ValueProvider outputTopic);

@Description( "Filter events based on an optional attribute key. " + "No filters are applied if a filterKey is not specified.") @Validation.Required ValueProvider getFilterKey();

void setFilterKey(ValueProvider filterKey);

@Description( "Filter attribute value to use in case a filterKey is provided. Accepts a va + " string as a filterValue. In case a regex is provided, the complete e + " should match in order for the message to be filtered. Partial matche + " substring) will not be filtered. A null filterValue is used by defau @Validation.Required ValueProvider getFilterValue();

void setFilterValue(ValueProvider filterValue); }

/** * DoFn that will determine if events are to be filtered. If filtering is enabled, * publish events that pass the filter else, it will publish all input events. */ @AutoValue public abstract static class ExtractAndFilterEventsFn extends DoFn

private static final Logger LOG = LoggerFactory.getLogger(ExtractAndFilterEvents

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 33/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

// Counter tracking the number of incoming Pub/Sub messages. private static final Counter INPUT_COUNTER = Metrics .counter(ExtractAndFilterEventsFn.class, "inbound-messages");

// Counter tracking the number of output Pub/Sub messages after the user provide // is applied. private static final Counter OUTPUT_COUNTER = Metrics .counter(ExtractAndFilterEventsFn.class, "filtered-outbound-messages");

private Boolean doFilter; private String inputFilterKey; private Pattern inputFilterValueRegex; private Boolean isNullFilterValue;

public static Builder newBuilder() { return new AutoValue_PubsubToPubsub_ExtractAndFilterEventsFn.Builder(); }

@Nullable abstract ValueProvider filterKey();

@Nullable abstract ValueProvider filterValue();

@Setup public void setup() {

if (this.doFilter != null) { return; // Filter has been evaluated already }

inputFilterKey = (filterKey() == null ? null : filterKey().get());

if (inputFilterKey == null) {

// Disable input message filtering. this.doFilter = false;

} else {

this.doFilter = true; // Enable filtering.

String inputFilterValue = (filterValue() == null ? null : filterValue().get(

if (inputFilterValue == null) {

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 34/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

LOG.warn( "User provided a NULL for filterValue. Only messages with a value of N + " filterKey: {} will be filtered forward", inputFilterKey);

// For backward compatibility, we are allowing filtering by null filterVal this.isNullFilterValue = true; this.inputFilterValueRegex = null; } else {

this.isNullFilterValue = false; try { inputFilterValueRegex = getFilterPattern(inputFilterValue); } catch (PatternSyntaxException e) { LOG.error("Invalid regex pattern for supplied filterValue: {}", inputFil throw new RuntimeException(e); } }

LOG.info( "Enabling event filter [key: " + inputFilterKey + "][value: " + inputFil } }

@ProcessElement public void processElement(ProcessContext context) {

INPUT_COUNTER.inc(); if (!this.doFilter) {

// Filter is not enabled writeOutput(context, context.element()); } else {

PubsubMessage message = context.element(); String extractedValue = message.getAttribute(this.inputFilterKey);

if (this.isNullFilterValue) {

if (extractedValue == null) { // If we are filtering for null and the extracted value is null, we forw // the message. writeOutput(context, message); }

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 35/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

} else {

if (extractedValue != null && this.inputFilterValueRegex.matcher(extractedValue).matches()) { // If the extracted value is not null and it matches the filter, // we forward the message. writeOutput(context, message); } } } }

/** * Write a {@link PubsubMessage} and increment the output counter. * @param context {@link ProcessContext} to write {@link PubsubMessage} to. * @param message {@link PubsubMessage} output. */ private void writeOutput(ProcessContext context, PubsubMessage message) { OUTPUT_COUNTER.inc(); context.output(message); }

/** * Return a {@link Pattern} based on a user provided regex string. * * @param regex Regex string to compile. * @return {@link Pattern} * @throws PatternSyntaxException If the string is an invalid regex. */ private Pattern getFilterPattern(String regex) throws PatternSyntaxException { checkNotNull(regex, "Filter regex cannot be null."); return Pattern.compile(regex); }

/** Builder class for {@link ExtractAndFilterEventsFn}. */ @AutoValue.Builder abstract static class Builder {

abstract Builder setFilterKey(ValueProvider filterKey);

abstract Builder setFilterValue(ValueProvider filterValue);

abstract ExtractAndFilterEventsFn build();

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 36/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

/** * Method to set the filterKey used for filtering messages. * * @param filterKey Lookup key for the {@link PubsubMessage} attribute map. * @return {@link Builder} */ public Builder withFilterKey(ValueProvider filterKey) { checkArgument(filterKey != null, "withFilterKey(filterKey) called with null return setFilterKey(filterKey); }

/** * Method to set the filterValue used for filtering messages. * * @param filterValue Lookup value for the {@link PubsubMessage} attribute map * @return {@link Builder} */ public Builder withFilterValue(ValueProvider filterValue) { checkArgument(filterValue != null, "withFilterValue(filterValue) called with return setFilterValue(filterValue); } } } }

Pub/Sub to Splunk

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Pub/Sub to Splunk template is a streaming pipeline that reads messages from a Pub/Sub subscription and writes the message payload to Splunk via Splunk's HTTP Event Collector (HEC). Before writing to Splunk, you can also apply a JavaScript user-dened function to the message payload. Any messages that experience processing failures are forwarded to a Pub/Sub dead-letter topic for further troubleshooting and reprocessing.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 37/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

As an extra layer of protection for your HEC token, you can also pass in a Cloud KMS key along with the base64-encoded HEC token parameter encrypted with the Cloud KMS key. See the Cloud KMS API encryption endpoint (/kms/docs/reference/rest/v1/projects.locations.keyRings.cryptoKeys/encrypt) for additional details on encrypting your HEC token parameter.

Requirements for this pipeline:

The source Pub/Sub subscription must exist prior to running the pipeline.

The Pub/Sub dead-letter topic must exist prior to running the pipeline.

The Splunk HEC endpoint must be accessible from the Dataow workers' network.

The Splunk HEC token must be generated and available.

Template parameters

Parameter Description

inputSubscription The Pub/Sub subscription from which to read the input. For example, projects//subscriptions/.

token The Splunk HEC authentication token. This base64-encoded string can be encrypted with a Cloud KMS key for additional security.

url The Splunk HEC url. This must be routable from the VPC in which the pipeline runs. For example, https://splunk-hec- host:8088.

outputDeadletterTopic The Pub/Sub topic to forward undeliverable messages. For example, projects//topics/.

javascriptTextTransformGcsPath [Optional] The Cloud Storage path that contains all your JavaScript code. For example, gs://mybucket/mytransforms/*.js.

javascriptTextTransformFunctionName[Optional] The name of the JavaScript function to be called. For example, if your JavaScript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 38/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

batchCount [Optional] The batch size for sending multiple events to Splunk. Default 1 (no batching).

parallelism [Optional] The maximum number of parallel requests. Default 1 (no parallelism).

disableCertificateValidation [Optional] Disable SSL certicate validation. Default false (validation enabled).

includePubsubMessage [Optional] Include the full Pub/Sub message in the payload. Default false (only the data element is included in the payload).

tokenKMSEncryptionKey [Optional] The Cloud KMS key to decrypt the HEC token string. If the Cloud KMS key is provided, the HEC token string must be passed in encrypted.

Running the Pub/Sub to Splunk template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

2. Click Create job from template.

3. Select the Pub/Sub to Splunk template from the Dataow template drop-down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

6. Click Run Job.

 Template source code

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 39/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

aowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubSubToSplunk.java)

/* * Copyright (C) 2019 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */

package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.coders.FailsafeElementCoder; import com.google.cloud.teleport.splunk.SplunkEvent; import com.google.cloud.teleport.splunk.SplunkEventCoder; import com.google.cloud.teleport.splunk.SplunkIO; import com.google.cloud.teleport.splunk.SplunkWriteError; import com.google.cloud.teleport.templates.common.ErrorConverters; import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Failsafe import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Javascri import com.google.cloud.teleport.templates.common.PubsubConverters.PubsubReadSubscri import com.google.cloud.teleport.templates.common.PubsubConverters.PubsubWriteDeadle import com.google.cloud.teleport.templates.common.SplunkConverters; import com.google.cloud.teleport.templates.common.SplunkConverters.SplunkOptions; import com.google.cloud.teleport.util.KMSEncryptedNestedValueProvider; import com.google.cloud.teleport.values.FailsafeElement; import com.google.gson.Gson; import com.google.gson.GsonBuilder; import com.google.gson.JsonPrimitive; import com.google.gson.JsonSerializer; import java.nio.charset.StandardCharsets; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 40/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.metrics.Counter; import org.apache.beam.sdk.metrics.Metrics; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.values.PBegin; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.MoreObjects; import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableLis import org.slf4j.Logger; import org.slf4j.LoggerFactory;

/** * The {@link PubSubToSplunk} pipeline is a streaming pipeline which ingests data fr * Pub/Sub, executes a UDF, converts the output to {@link SplunkEvent}s and writes t * into Splunk's HEC endpoint. Any errors which occur in the execution of the UDF, c * {@link SplunkEvent} or writing to HEC will be streamed into a Pub/Sub topic. * *

Pipeline Requirements * *

    *
  • The source Pub/Sub subscription exists. *
  • HEC end-point is routable from the VPC where the Dataflow job executes. *
  • Deadletter topic exists. *
* *

Example Usage * *

 * # Set the pipeline vars * PROJECT_ID=PROJECT ID HERE * BUCKET_NAME=BUCKET NAME HERE * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read * from a Pub/Sub Subscription or a Pub/Sub Topic. *

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 41/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* # Set the runner * RUNNER=DataflowRunner * * # Build the template * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToSplunk \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=${PIPELINE_FOLDER}/staging \ * --tempLocation=${PIPELINE_FOLDER}/temp \ * --templateLocation=${PIPELINE_FOLDER}/template/PubSubToSplunk \ * --runner=${RUNNER} * " * * # Execute the template * JOB_NAME=pubsub-to-splunk-$USER-`date +"%Y%m%d-%H%M%S%z"` * BATCH_COUNT=1 * PARALLELISM=5 * * # Execute the templated pipeline: * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template/PubSubToSplunk \ * --zone=us-east1-d \ * --parameters \ * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\ * token=my-splunk-hec-token,\ * url=http://splunk-hec-server-address:8088,\ * batchCount=${BATCH_COUNT},\ * parallelism=${PARALLELISM},\ * disableCertificateValidation=false,\ * outputDeadletterTopic=projects/${PROJECT_ID}/topics/deadletter-topic-name,\ * javascriptTextTransformGcsPath=gs://${BUCKET_NAME}/splunk/js/my-js-udf.js,\ * javascriptTextTransformFunctionName=myUdf" *

*/ public class PubSubToSplunk {

/** String/String Coder for FailsafeElement. */ public static final FailsafeElementCoder FAILSAFE_ELEMENT_CODER = FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

/** Counter to track inbound messages from source. */ private static final Counter INPUT_MESSAGES_COUNTER = Metrics.counter(PubSubToSplunk.class, "inbound-pubsub-messages");

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 42/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

/** The tag for successful {@link SplunkEvent} conversion. */ private static final TupleTag SPLUNK_EVENT_OUT = new TupleTag

/** The tag for failed {@link SplunkEvent} conversion. */ private static final TupleTag> SPLUNK_EVENT_DEADLE new TupleTag>() {};

/** The tag for the main output for the UDF. */ private static final TupleTag> UDF_OUT = new TupleTag>() {};

/** The tag for the dead-letter output of the udf. */ private static final TupleTag> UDF_DEADLETTER_OUT new TupleTag>() {};

/** GSON to process a {@link PubsubMessage}. */ private static final Gson GSON = new GsonBuilder() .registerTypeAdapter( byte[].class, (JsonSerializer) (bytes, type, jsonSerializationContext) -> new JsonPrimitive(new String(bytes, StandardCharsets.UTF_8))) .create();

/** Logger for class. */ private static final Logger LOG = LoggerFactory.getLogger(PubSubToSplunk.class);

private static final Boolean DEFAULT_INCLUDE_PUBSUBMESSAGE = false;

/** * The main entry-point for pipeline execution. This method will start the pipelin * wait for it's execution to finish. If blocking execution is required, use the { * PubSubToSplunk#run(PubSubToSplunkOptions)} method to start the pipeline and inv * result.waitUntilFinish()} on the {@link PipelineResult}. * * @param args The command-line args passed by the executor. */ public static void main(String[] args) {

PubSubToSplunkOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(PubSubToSplunkOpti

run(options);

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 43/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

}

/** * Runs the pipeline to completion with the specified options. This method does no * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} * object to block until the pipeline is finished running if blocking programmatic * required. * * @param options The execution options. * @return The pipeline result. */ public static PipelineResult run(PubSubToSplunkOptions options) {

Pipeline pipeline = Pipeline.create(options);

// Register coders. CoderRegistry registry = pipeline.getCoderRegistry(); registry.registerCoderForClass(SplunkEvent.class, SplunkEventCoder.of()); registry.registerCoderForType( FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER);

/* * Steps: * 1) Read messages in from Pub/Sub * 2) Convert message to FailsafeElement for processing. * 3) Apply user provided UDF (if any) on the input strings. * 4) Convert successfully transformed messages into SplunkEvent objects * 5) Write SplunkEvents to Splunk's HEC end point. * 5a) Wrap write failures into a FailsafeElement. * 6) Collect errors from UDF transform (#3), SplunkEvent transform (#4) * and writing to Splunk HEC (#5) and stream into a Pub/Sub deadletter topic */

// 1) Read messages in from Pub/Sub PCollection stringMessages = pipeline.apply( "ReadMessages", new ReadMessages(options.getInputSubscription(), options.getIncludePubsu

// 2) Convert message to FailsafeElement for processing. PCollectionTuple transformedOutput = stringMessages .apply( "ConvertToFailsafeElement", MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 44/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

.via(input -> FailsafeElement.of(input, input)))

// 3) Apply user provided UDF (if any) on the input strings. .apply( "ApplyUDFTransformation", FailsafeJavascriptUdf.newBuilder() .setFileSystemPath(options.getJavascriptTextTransformGcsPath()) .setFunctionName(options.getJavascriptTextTransformFunctionName( .setSuccessTag(UDF_OUT) .setFailureTag(UDF_DEADLETTER_OUT) .build());

// 4) Convert successfully transformed messages into SplunkEvent objects PCollectionTuple convertToEventTuple = transformedOutput .get(UDF_OUT) .apply( "ConvertToSplunkEvent", SplunkConverters.failsafeStringToSplunkEvent( SPLUNK_EVENT_OUT, SPLUNK_EVENT_DEADLETTER_OUT));

// 5) Write SplunkEvents to Splunk's HEC end point. PCollection writeErrors = convertToEventTuple .get(SPLUNK_EVENT_OUT) .apply( "WriteToSplunk", SplunkIO.writeBuilder() .withToken(maybeDecrypt(options.getToken(), options.getTokenKMSE .withUrl(options.getUrl()) .withBatchCount(options.getBatchCount()) .withParallelism(options.getParallelism()) .withDisableCertificateValidation(options.getDisableCertificateV .build());

// 5a) Wrap write failures into a FailsafeElement. PCollection> wrappedSplunkWriteErrors = writeErrors.apply( "WrapSplunkWriteErrors", ParDo.of( new DoFn>() {

@ProcessElement public void processElement(ProcessContext context) { SplunkWriteError error = context.element();

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 45/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

FailsafeElement failsafeElement = FailsafeElement.of(error.payload(), error.payload());

if (error.statusMessage() != null) { failsafeElement.setErrorMessage(error.statusMessage()); }

if (error.statusCode() != null) { failsafeElement.setErrorMessage( String.format("Splunk write status code: %d", error.status } context.output(failsafeElement); } }));

// 6) Collect errors from UDF transform (#4), SplunkEvent transform (#5) // and writing to Splunk HEC (#6) and stream into a Pub/Sub deadletter topic PCollectionList.of( ImmutableList.of( convertToEventTuple.get(SPLUNK_EVENT_DEADLETTER_OUT), wrappedSplunkWriteErrors, transformedOutput.get(UDF_DEADLETTER_OUT))) .apply("FlattenErrors", Flatten.pCollections()) .apply( "WriteFailedRecords", ErrorConverters.WriteStringMessageErrorsToPubSub.newBuilder() .setErrorRecordsTopic(options.getOutputDeadletterTopic()) .build());

return pipeline.run(); }

/** * The {@link PubSubToSplunkOptions} class provides the custom options passed by t * the command line. */ public interface PubSubToSplunkOptions extends SplunkOptions, PubsubReadSubscriptionOptions, PubsubWriteDeadletterTopicOptions, JavascriptTextTransformerOptions {}

/** * A {@link PTransform} that reads messages from a Pub/Sub subscription, increment * returns a {@link PCollection} of {@link String} messages.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 46/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

*/ private static class ReadMessages extends PTransform> private final ValueProvider subscriptionName; private final ValueProvider inputIncludePubsubMessageFlag; private Boolean includePubsubMessage;

ReadMessages(ValueProvider subscriptionName, ValueProvider inpu this.subscriptionName = subscriptionName; this.inputIncludePubsubMessageFlag = inputIncludePubsubMessageFlag; }

@Override public PCollection expand(PBegin input) { return input .apply( "ReadPubsubMessage", PubsubIO.readMessagesWithAttributes().fromSubscription(subscriptionNam .apply( "ExtractMessageIfRequired", ParDo.of( new DoFn() {

@Setup public void setup() { if (inputIncludePubsubMessageFlag != null) { includePubsubMessage = inputIncludePubsubMessageFlag.get(); } includePubsubMessage = MoreObjects.firstNonNull( includePubsubMessage, DEFAULT_INCLUDE_PUBSUBMESSAGE); LOG.info("includePubsubMessage set to: {}", includePubsubMessa }

@ProcessElement public void processElement(ProcessContext context) { if (includePubsubMessage) { context.output(GSON.toJson(context.element())); } else { context.output( new String(context.element().getPayload(), StandardChars } } })) .apply( "CountMessages",

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 47/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

ParDo.of( new DoFn() { @ProcessElement public void processElement(ProcessContext context) { INPUT_MESSAGES_COUNTER.inc(); context.output(context.element()); } })); } }

/** * Utility method to decrypt a Splunk HEC token. * * @param unencryptedToken The Splunk HEC token as a Base64 encoded {@link String} * @param kmsKey The Cloud KMS Encryption Key to decrypt the Splunk HEC token. * @return Decrypted Splunk HEC token. */ private static ValueProvider maybeDecrypt( ValueProvider unencryptedToken, ValueProvider kmsKey) { return new KMSEncryptedNestedValueProvider(unencryptedToken, kmsKey); } }

Pub/Sub to Cloud Storage Avro

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Pub/Sub to Cloud Storage Avro template is a streaming pipeline that reads data from a Pub/Sub topic and writes Avro les into the specied Cloud Storage bucket. This pipeline supports optional user provided window duration to be used to perform windowed writes.

Requirements for this pipeline:

The input Pub/Sub topic must exist prior to pipeline execution.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 48/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

Template parameters

Parameter Description

inputTopic Cloud Pub/Sub Topic to subscribe for message consumption. The topic name should be in the format of projects//topics/.

outputDirectory Output Directory where output Avro Files will be archived. Please add / at the end. For eg: gs://example-bucket/example-directory/.

avroTempDirectory Directory for temporary Avro Files. Please add / at the end. For eg: gs://example-bucket/example-directory/.

outputFilenamePrefix[Optional] Output Filename Prex for the Avro Files.

outputFilenameSuffix[Optional] Output Filename Sux for the Avro Files.

outputShardTemplate [Optional] The shard template of the output le. Specied as repeating sequences of the letters 'S' or 'N' (example: SSS-NNN). These are replaced with the shard number, or number of shards respectively. Default Template Format is 'W-P-SS-of- NN' when this parameter is not specied.

numShards [Optional] The maximum number of output shards produced when writing.Default maximum number of Shards is 1.

windowDuration [Optional] The window duration in which data will be written. Defaults to 5m. Allowed formats are: Ns (for seconds, example: 5s), Nm (for minutes, example: 12m), Nh (for hours, example: 2h).

Running the Pub/Sub to Cloud Storage Avro template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

2. Click Create job from template.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 49/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

3. Select the Pub/Sub to Cloud Storage Avro template from the Dataow template drop-down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

6. Click Run Job.

 Template source code

Java

ataowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToAvro.java)

/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */

package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.avro.AvroPubsubMessageRecord; import com.google.cloud.teleport.io.WindowedFilenamePolicy; import com.google.cloud.teleport.util.DurationUtils; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.io.AvroIO; import org.apache.beam.sdk.io.FileBasedSink; import org.apache.beam.sdk.io.fs.ResourceId;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 50/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.StreamingOptions; import org.apache.beam.sdk.options.Validation.Required; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.transforms.SerializableFunction; import org.apache.beam.sdk.transforms.windowing.FixedWindows; import org.apache.beam.sdk.transforms.windowing.Window;

/** * This pipeline ingests incoming data from a Cloud Pub/Sub topic and outputs the ra * windowed Avro files at the specified output directory. * *

Files output will have the following schema: * *

 * { * "type": "record", * "name": "AvroPubsubMessageRecord", * "namespace": "com.google.cloud.teleport.avro", * "fields": [ * {"name": "message", "type": {"type": "array", "items": "bytes"}}, * {"name": "attributes", "type": {"type": "map", "values": "string"}}, * {"name": "timestamp", "type": "long"} * ] * } * 
* *

Example Usage: * *

 * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.${PIPELINE_NAME} \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/stag * --tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 51/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* --runner=DataflowRunner \ * --windowDuration=2m \ * --numShards=1 \ * --topic=projects/${PROJECT_ID}/topics/windowed-files \ * --outputDirectory=gs://${PROJECT_ID}/temp/ \ * --outputFilenamePrefix=windowed-file \ * --outputFilenameSuffix=.avro * --avroTempDirectory=gs://${PROJECT_ID}/avro-temp-dir/" *

*/ public class PubsubToAvro {

/** * Options supported by the pipeline. * *

Inherits standard configuration options. */ public interface Options extends PipelineOptions, StreamingOptions { @Description("The Cloud Pub/Sub topic to read from.") @Required ValueProvider getInputTopic();

void setInputTopic(ValueProvider value);

@Description("The directory to output files to. Must end with a slash.") @Required ValueProvider getOutputDirectory();

void setOutputDirectory(ValueProvider value);

@Description("The filename prefix of the files to write to.") @Default.String("output") ValueProvider getOutputFilenamePrefix();

void setOutputFilenamePrefix(ValueProvider value);

@Description("The suffix of the files to write.") @Default.String("") ValueProvider getOutputFilenameSuffix();

void setOutputFilenameSuffix(ValueProvider value);

@Description( "The shard template of the output file. Specified as repeating sequences " + "of the letters 'S' or 'N' (example: SSS-NNN). These are replaced with

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 52/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

+ "shard number, or number of shards respectively") @Default.String("W-P-SS-of-NN") ValueProvider getOutputShardTemplate();

void setOutputShardTemplate(ValueProvider value);

@Description("The maximum number of output shards produced when writing.") @Default.Integer(1) Integer getNumShards();

void setNumShards(Integer value);

@Description( "The window duration in which data will be written. Defaults to 5m. " + "Allowed formats are: " + "Ns (for seconds, example: 5s), " + "Nm (for minutes, example: 12m), " + "Nh (for hours, example: 2h).") @Default.String("5m") String getWindowDuration();

void setWindowDuration(String value);

@Description("The Avro Write Temporary Directory. Must end with /") @Required ValueProvider getAvroTempDirectory();

void setAvroTempDirectory(ValueProvider value);

}

/** * Main entry point for executing the pipeline. * * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) {

Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti options.setStreaming(true);

run(options); }

/**

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 53/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(Options options) { // Create the pipeline Pipeline pipeline = Pipeline.create(options);

/* * Steps: * 1) Read messages from PubSub * 2) Window the messages into minute intervals specified by the executor. * 3) Output the windowed data into Avro files, one per window by default. */ pipeline .apply( "Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromTopic(options.getInputTopic()) .apply("Map to Archive", ParDo.of(new PubsubMessageToArchiveDoFn())) .apply( options.getWindowDuration() + " Window", Window.into(FixedWindows.of(DurationUtils.parseDuration(options.getWindo

// Apply windowed file writes. Use a NestedValueProvider because the filenam // policy requires a resourceId generated from the input value at runtime. .apply( "Write File(s)", AvroIO.write(AvroPubsubMessageRecord.class) .to( new WindowedFilenamePolicy( options.getOutputDirectory(), options.getOutputFilenamePrefix(), options.getOutputShardTemplate(), options.getOutputFilenameSuffix())) .withTempDirectory(NestedValueProvider.of( options.getAvroTempDirectory(), (SerializableFunction) input -> FileBasedSink.convertToFileResourceIfPossible(input))) /*.withTempDirectory(FileSystems.matchNewResource( options.getAvroTempDirectory(), Boolean.TRUE)) */ .withWindowedWrites() .withNumShards(options.getNumShards()));

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 54/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

// Execute the pipeline and return the result. return pipeline.run(); }

/** * Converts an incoming {@link PubsubMessage} to the {@link AvroPubsubMessageRecor * copying it's fields and the timestamp of the message. */ static class PubsubMessageToArchiveDoFn extends DoFn

Pub/Sub to Cloud Storage Text

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Pub/Sub to Cloud Storage Text template is a streaming pipeline that reads records from Pub/Sub and saves them as a series of Cloud Storage les in text format. The template can be used as a quick way to save data in Pub/Sub for future use. By default, the template generates a new le every 5 minutes.

Requirements for this pipeline:

The Pub/Sub topic must exist prior to execution.

The messages published to the topic must be in text format.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 55/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

The messages published to the topic must not contain any newlines. Note that each Pub/Sub message is saved as a single line in the output le.

Template parameters

Parameter Description

inputTopic The Pub/Sub topic to read the input from. The topic name should be in the format projects//topics/.

outputDirectory The path and lename prex for writing output les. For example, gs://bucket- name/path/. This value must end in a slash.

outputFilenamePrefixThe prex to place on each windowed le. For example, output-

outputFilenameSuffixThe sux to place on each windowed le, typically a le extension such as .txt or .csv.

outputShardTemplate The shard template denes the dynamic portion of each windowed le. By default, the pipeline uses a single shard for output to the le system within each window. This means that all data will land into a single le per window. The outputShardTemplate defaults to W-P-SS-of-NN where W is the window date range, P is the pane info, S is the shard number, and N is the number of shards. In case of a single le, the SS-of-NN portion of the outputShardTemplate will be 00-of-01.

Running the Pub/Sub to Cloud Storage Text template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

2. Click Create job from template.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 56/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

3. Select the Pub/Sub to Cloud Storage Text template from the Dataow template drop-down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

6. Click Run Job.

 Template source code

Java

DataowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToText.java)

/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */

package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.io.WindowedFilenamePolicy; import com.google.cloud.teleport.util.DualInputNestedValueProvider; import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput; import com.google.cloud.teleport.util.DurationUtils; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.io.FileBasedSink;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 57/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import org.apache.beam.sdk.io.TextIO; import org.apache.beam.sdk.io.fs.ResourceId; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.StreamingOptions; import org.apache.beam.sdk.options.Validation.Required; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider; import org.apache.beam.sdk.transforms.SerializableFunction; import org.apache.beam.sdk.transforms.windowing.FixedWindows; import org.apache.beam.sdk.transforms.windowing.Window;

/** * This pipeline ingests incoming data from a Cloud Pub/Sub topic and * outputs the raw data into windowed files at the specified output * directory. * *

Example Usage: * *

 * mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.${PIPELINE_NAME} \ -Dexec.cleanupDaemonThreads=false \ -Dexec.args=" \ --project=${PROJECT_ID} \ --stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/staging --tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \ --runner=DataflowRunner \ --windowDuration=2m \ --numShards=1 \ --inputTopic=projects/${PROJECT_ID}/topics/windowed-files \ --userTempLocation=gs://${PROJECT_ID}/tmp/ \ --outputDirectory=gs://${PROJECT_ID}/output/ \ --outputFilenamePrefix=windowed-file \ --outputFilenameSuffix=.txt" * 
*

*/ public class PubsubToText {

/** * Options supported by the pipeline.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 58/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* *

Inherits standard configuration options.

*/ public interface Options extends PipelineOptions, StreamingOptions { @Description("The Cloud Pub/Sub topic to read from.") @Required ValueProvider getInputTopic(); void setInputTopic(ValueProvider value);

@Description("The directory to output files to. Must end with a slash.") @Required ValueProvider getOutputDirectory(); void setOutputDirectory(ValueProvider value);

@Description("The directory to output temporary files to. Must end with a slash. ValueProvider getUserTempLocation(); void setUserTempLocation(ValueProvider value);

@Description("The filename prefix of the files to write to.") @Default.String("output") @Required ValueProvider getOutputFilenamePrefix(); void setOutputFilenamePrefix(ValueProvider value);

@Description("The suffix of the files to write.") @Default.String("") ValueProvider getOutputFilenameSuffix(); void setOutputFilenameSuffix(ValueProvider value);

@Description("The shard template of the output file. Specified as repeating sequ + "of the letters 'S' or 'N' (example: SSS-NNN). These are replaced with the + "shard number, or number of shards respectively") @Default.String("W-P-SS-of-NN") ValueProvider getOutputShardTemplate(); void setOutputShardTemplate(ValueProvider value);

@Description("The maximum number of output shards produced when writing.") @Default.Integer(1) Integer getNumShards(); void setNumShards(Integer value);

@Description("The window duration in which data will be written. Defaults to 5m. + "Allowed formats are: " + "Ns (for seconds, example: 5s), " + "Nm (for minutes, example: 12m), "

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 59/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

+ "Nh (for hours, example: 2h).") @Default.String("5m") String getWindowDuration(); void setWindowDuration(String value); }

/** * Main entry point for executing the pipeline. * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) {

Options options = PipelineOptionsFactory .fromArgs(args) .withValidation() .as(Options.class);

options.setStreaming(true);

run(options); }

/** * Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(Options options) { // Create the pipeline Pipeline pipeline = Pipeline.create(options);

/* * Steps: * 1) Read string messages from PubSub * 2) Window the messages into minute intervals specified by the executor. * 3) Output the windowed files to GCS */ pipeline .apply("Read PubSub Events", PubsubIO.readStrings().fromTopic(options.getInp .apply( options.getWindowDuration() + " Window", Window.into(FixedWindows.of(DurationUtils.parseDuration(options.getWindo

// Apply windowed file writes. Use a NestedValueProvider because the filenam

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 60/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

// policy requires a resourceId generated from the input value at runtime. .apply( "Write File(s)", TextIO.write() .withWindowedWrites() .withNumShards(options.getNumShards()) .to( new WindowedFilenamePolicy( options.getOutputDirectory(), options.getOutputFilenamePrefix(), options.getOutputShardTemplate(), options.getOutputFilenameSuffix())) .withTempDirectory(NestedValueProvider.of( maybeUseUserTempLocation( options.getUserTempLocation(), options.getOutputDirectory()), (SerializableFunction) input -> FileBasedSink.convertToFileResourceIfPossible(input))));

// Execute the pipeline and return the result. return pipeline.run(); }

/** * Utility method for using optional parameter userTempLocation as TempDirectory. * This is useful when output bucket is locked and temporary data cannot be delete * * @param userTempLocation user provided temp location * @param outputLocation user provided outputDirectory to be used as the default t * @return userTempLocation if available, otherwise outputLocation is returned. */ private static ValueProvider maybeUseUserTempLocation( ValueProvider userTempLocation, ValueProvider outputLocation) { return DualInputNestedValueProvider.of( userTempLocation, outputLocation, new SerializableFunction, String>() { @Override public String apply(TranslatorInput input) { return (input.getX() != null) ? input.getX() : input.getY(); } }); } }

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 61/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

Pub/Sub to MongoDB

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Pub/Sub to MongoDB template is a streaming pipeline that reads JSON-encoded messages from a Pub/Sub subscription and writes them to MongoDB as documents. If required, this pipeline supports additional transforms that can be included using a JavaScript user-dened function (UDF). Any errors occurred due to schema mismatch, malformed JSON, or while executing transforms are recorded in a BigQuery deadletter table along with input message. The pipeline automatically creates the deadletter table if the table does not exist prior to execution.

Requirements for this pipeline:

The Pub/Sub Subscription must exist and the messages must be encoded in a valid JSON format.

The MongoDB cluster must exist and should be acccessible from the Dataow worker machines.

Template parameters

Parameter Description

inputSubscription Name of the Pub/Sub subscription. For example: projects//subscriptions/

mongoDBUri Comma separated list of MongoDB servers. For example: 192.285.234.12:27017,192.287.123.11:27017

database Database in MongoDB to store the collection. For example: my-

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 62/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

db.

collection Name of the collection inside MongoDB database. For example: my-collection.

deadletterTable BigQuery table that store messages due to failures (mismatched schema, malformed json etc). For example: project-id:dataset-name.table-name.

javascriptTextTransformGcsPath [Optional] Cloud Storage location of JavaScript le contating UDF transform. For example: gs://mybucket/filename.json.

javascriptTextTransformFunctionName[Optional] Name of JavaScript UDF. For example: transform.

batchSize [Optional] Batch size used for batch insertion of documents into MongoDB. Default: 1000.

batchSizeBytes [Optional] Batch size in bytes. Default: 5242880.

maxConnectionIdleTime [Optional] Maximum idle time allowed in seconds before connection time out occurs. Default: 60000.

sslEnabled [Optional] Boolean value indicating whether connection to MongoDB is SSL enabled. Default: true.

ignoreSSLCertificate [Optional] Boolean value indicating if SSL certifcate should be ignored. Default: true.

withOrdered [Optional] Boolean value enabling ordered bulk insertions into MongoDB. Default: true.

withSSLInvalidHostNameAllowed [Optional] Boolean value indicating if invalid host name is allowed for SSL connection. Default: true.

Running the Pub/Sub to MongoDB template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

2. Click Create job from template.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 63/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

3. Select Pub/Sub to MongoDB template from the Dataow template drop-down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

6. Click Run Job.

 Template source code

Java

2/pubsub-to-mongodb/src/main/java/com/google/cloud/teleport/v2/templates/PubSubToMongoDB.java)

/* * Copyright (C) 2019 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */

package com.google.cloud.teleport.v2.templates;

import com.google.auto.value.AutoValue; import com.google.cloud.teleport.v2.coders.FailsafeElementCoder; import com.google.cloud.teleport.v2.transforms.ErrorConverters; import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer; import com.google.cloud.teleport.v2.utils.SchemaUtils; import com.google.cloud.teleport.v2.values.FailsafeElement;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 64/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import com.google.gson.Gson; import com.google.gson.JsonObject; import com.google.gson.JsonSyntaxException; import java.nio.charset.StandardCharsets; import javax.annotation.Nullable; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage; import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder; import org.apache.beam.sdk.io.mongodb.MongoDbIO; import org.apache.beam.sdk.metrics.Counter; import org.apache.beam.sdk.metrics.Metrics; import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.Validation; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.apache.beam.sdk.values.TupleTagList; import org.apache.beam.sdk.values.TypeDescriptors; import org.apache.beam.vendor.guava.v20_0.com.google.common.base.Throwables; import org.bson.Document; import org.slf4j.Logger; import org.slf4j.LoggerFactory;

/** * The {@link PubSubToMongoDB} pipeline is a streaming pipeline which ingests data i * from PubSub, applies a Javascript UDF if provided and inserts resulting records a * in MongoDB. If the element fails to be processed then it is written to a deadlett * BigQuery. * *

Pipeline Requirements * *

    *
  • The PubSub topic and subscriptions exist *
  • The MongoDB is up and running

    https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 65/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

    *

* *

Example Usage * *

 * # Set the pipeline vars * PROJECT_NAME=my-project * BUCKET_NAME=my-bucket * INPUT_SUBSCRIPTION=my-subscription * MONGODB_DATABASE_NAME=testdb * MONGODB_HOSTNAME=my-host:port * MONGODB_COLLECTION_NAME=testCollection * DEADLETTERTABLE=project:dataset.deadletter_table_name * * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.v2.templates.PubSubToMongoDB \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_NAME} \ * --stagingLocation=gs://${BUCKET_NAME}/staging \ * --tempLocation=gs://${BUCKET_NAME}/temp \ * --runner=DataflowRunner \ * --inputSubscription=${INPUT_SUBSCRIPTION} \ * --mongoDBUri=${MONGODB_HOSTNAME} \ * --database=${MONGODB_DATABASE_NAME} \ * --collection=${MONGODB_COLLECTION_NAME} \ * --deadletterTable=${DEADLETTERTABLE}" * 
*/ public class PubSubToMongoDB { /** * Options supported by {@link PubSubToMongoDB} * *

Inherits standard configuration options. */

/** The tag for the main output of the json transformation. */ public static final TupleTag> TRANSFORM_OUT new TupleTag>() {};

/** The tag for the dead-letter output of the json to table row transform. */ public static final TupleTag> TRANSFORM_DEA new TupleTag>() {};

/** Pubsub message/string coder for pipeline. */

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 66/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

public static final FailsafeElementCoder CODER = FailsafeElementCoder.of(PubsubMessageWithAttributesCoder.of(), StringUtf8Coder

/** String/String Coder for FailsafeElement. */ public static final FailsafeElementCoder FAILSAFE_ELEMENT_CODER = FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

/** The log to output status messages to. */ private static final Logger LOG = LoggerFactory.getLogger(PubSubToMongoDB.class);

/** * The {@link Options} class provides the custom execution options passed by the e * command-line. * *

Inherits standard configuration options, options from {@link * JavascriptTextTransformer.JavascriptTextTransformerOptions}. */ public interface Options extends JavascriptTextTransformer.JavascriptTextTransformerOptions, PipelineOp @Description( "The Cloud Pub/Sub subscription to consume from." + "The name should be in the format of " + "projects//subscriptions/") @Validation.Required String getInputSubscription();

void setInputSubscription(String inputSubscription);

@Description("The MongoDB database to push the Documents to.") @Validation.Required String getDatabase();

void setDatabase(String database);

@Description( "The host addresses of the MongoDB" + "Multiple addresses to be specified with a comma separated value e.g." + "host1:port,host2:port,host3:port") @Validation.Required String getMongoDBUri();

void setMongoDBUri(String mongoDBUri);

@Description("The Collection in mongoDB to put documents to.") @Validation.Required

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 67/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

String getCollection();

void setCollection(String collection);

@Description( "The dead-letter table to output to within BigQuery in :

void setDeadletterTable(String deadletterTable);

@Description("Batch size in number of documents. Default: 1000") @Default.Long(1024) Long getBatchSize();

void setBatchSize(Long batchSize);

@Description("Batch size in number of bytes. Default: 5242880 (5mb)") @Default.Long(5242880) Long getBatchSizeBytes();

void setBatchSizeBytes(Long batchSizeBytes);

@Description("Maximum Connection idle time in ms. Default: 60000") @Default.Integer(60000) int getMaxConnectionIdleTime();

void setMaxConnectionIdleTime(int maxConnectionIdleTime);

@Description("Specify if SSL is enabled. Default: true") @Default.Boolean(true) Boolean getSslEnabled();

void setSslEnabled(Boolean sslEnabled);

@Description("Specify whether to ignore SSL certificate. Default: true") @Default.Boolean(true) Boolean getIgnoreSSLCertificate();

void setIgnoreSSLCertificate(Boolean ignoreSSLCertificate);

@Description("Enable ordered bulk insertions. Default: true") @Default.Boolean(true) Boolean getWithOrdered();

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 68/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

void setWithOrdered(Boolean withOrdered);

@Description("Enable invalidHostNameAllowed for ssl connection. Default: true") @Default.Boolean(true) Boolean getWithSSLInvalidHostNameAllowed();

void setWithSSLInvalidHostNameAllowed(Boolean withSSLInvalidHostNameAllowed); }

/** DoFn that will parse the given string elements as Bson Documents. */ private static class ParseAsDocumentsFn extends DoFn {

@ProcessElement public void processElement(ProcessContext context) { context.output(Document.parse(context.element())); } }

/** * Main entry point for executing the pipeline. * * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) {

// Parse the user options passed from the command-line. Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti run(options); }

/** * Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(Options options) {

// Create the pipeline Pipeline pipeline = Pipeline.create(options);

// Register the coders for pipeline CoderRegistry coderRegistry = pipeline.getCoderRegistry();

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 69/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

coderRegistry.registerCoderForType( FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER);

coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER);

/* * Steps: 1) Read PubSubMessage with attributes from input PubSub subscription. * 2) Apply Javascript UDF if provided. * 3) Write to MongoDB * */

LOG.info("Reading from subscription: " + options.getInputSubscription());

PCollectionTuple convertedPubsubMessages = pipeline /* * Step #1: Read from a PubSub subscription. */ .apply( "Read PubSub Subscription", PubsubIO.readMessagesWithAttributes() .fromSubscription(options.getInputSubscription())) /* * Step #2: Apply Javascript Transform and transform, if provided and tr * the PubsubMessages into Json documents. */ .apply( "Apply Javascript UDF", PubSubMessageToJsonDocument.newBuilder() .setJavascriptTextTransformFunctionName( options.getJavascriptTextTransformFunctionName()) .setJavascriptTextTransformGcsPath(options.getJavascriptTextTran .build());

/* * Step #3a: Write Json documents into MongoDB using {@link MongoDbIO.write}. */ convertedPubsubMessages .get(TRANSFORM_OUT) .apply( "Get Json Documents", MapElements.into(TypeDescriptors.strings()).via(FailsafeElement::getPayl .apply("Parse as BSON Document", ParDo.of(new ParseAsDocumentsFn())) .apply(

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 70/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

"Put to MongoDB", MongoDbIO.write() .withBatchSize(options.getBatchSize()) .withUri(String.format("mongodb://%s", options.getMongoDBUri())) .withDatabase(options.getDatabase()) .withCollection(options.getCollection()) .withIgnoreSSLCertificate(options.getIgnoreSSLCertificate()) .withMaxConnectionIdleTime(options.getMaxConnectionIdleTime()) .withOrdered(options.getWithOrdered()) .withSSLEnabled(options.getSslEnabled()) .withSSLInvalidHostNameAllowed(options.getWithSSLInvalidHostNameAllo

/* * Step 3b: Write elements that failed processing to deadletter table via {@link */ convertedPubsubMessages .get(TRANSFORM_DEADLETTER_OUT) .apply( "Write Transform Failures To BigQuery", ErrorConverters.WritePubsubMessageErrors.newBuilder() .setErrorRecordsTable(options.getDeadletterTable()) .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA) .build());

// Execute the pipeline and return the result. return pipeline.run(); }

/** * The {@link PubSubMessageToJsonDocument} class is a {@link PTransform} which tra * {@link PubsubMessage} objects into JSON objects for insertion into MongoDB whil * optional UDF to the input. The executions of the UDF and transformation to Json * in a fail-safe way by wrapping the element with it's original payload inside th * FailsafeElement} class. The {@link PubSubMessageToJsonDocument} transform will * PCollectionTuple} which contains all output and dead-letter {@link PCollection} * *

The {@link PCollectionTuple} output will contain the following {@link PColle * *

    *
  • {@link PubSubToMongoDB#TRANSFORM_OUT} - Contains all records successfully * JSON objects. *
  • {@link PubSubToMongoDB#TRANSFORM_DEADLETTER_OUT} - Contains all {@link Fa * records which couldn't be converted to table rows. *
*/

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 71/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

@AutoValue public abstract static class PubSubMessageToJsonDocument extends PTransform, PCollectionTuple> {

public static Builder newBuilder() { return new AutoValue_PubSubToMongoDB_PubSubMessageToJsonDocument.Builder(); }

@Nullable public abstract String javascriptTextTransformGcsPath();

@Nullable public abstract String javascriptTextTransformFunctionName();

@Override public PCollectionTuple expand(PCollection input) {

// Map the incoming messages into FailsafeElements so we can recover from fail // across multiple transforms. PCollection> failsafeElements = input.apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()

// If a Udf is supplied then use it to parse the PubSubMessages. if (javascriptTextTransformGcsPath() != null) { return failsafeElements.apply( "InvokeUDF", JavascriptTextTransformer.FailsafeJavascriptUdf.newBuilde .setFileSystemPath(javascriptTextTransformGcsPath()) .setFunctionName(javascriptTextTransformFunctionName()) .setSuccessTag(TRANSFORM_OUT) .setFailureTag(TRANSFORM_DEADLETTER_OUT) .build()); } else { return failsafeElements.apply( "ProcessPubSubMessages", ParDo.of(new ProcessFailsafePubSubFn()) .withOutputTags(TRANSFORM_OUT, TupleTagList.of(TRANSFORM_DEADLETTER_ } }

/** Builder for {@link PubSubMessageToJsonDocument}. */ @AutoValue.Builder public abstract static class Builder { public abstract Builder setJavascriptTextTransformGcsPath( String javascriptTextTransformGcsPath);

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 72/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

public abstract Builder setJavascriptTextTransformFunctionName( String javascriptTextTransformFunctionName);

public abstract PubSubMessageToJsonDocument build(); } }

/** * The {@link ProcessFailsafePubSubFn} class processes a {@link FailsafeElement} c * {@link PubsubMessage} and a String of the message's payload {@link PubsubMessag * into a {@link FailsafeElement} of the original {@link PubsubMessage} and a JSON * been processed with {@link Gson}. * *

If {@link PubsubMessage#getAttributeMap()} is not empty then the message att * serialized along with the message payload. */ static class ProcessFailsafePubSubFn extends DoFn, FailsafeElement

private static final Counter successCounter = Metrics.counter(PubSubMessageToJsonDocument.class, "successful-json-conversi

private static Gson gson = new Gson();

private static final Counter failedCounter = Metrics.counter(PubSubMessageToJsonDocument.class, "failed-json-conversion")

@ProcessElement public void processElement(ProcessContext context) { PubsubMessage pubsubMessage = context.element().getOriginalPayload();

JsonObject messageObject = new JsonObject();

try { if (pubsubMessage.getPayload().length > 0) { messageObject = gson.fromJson(new String(pubsubMessage.getPayload()), Json }

// If message attributes are present they will be serialized along with the if (pubsubMessage.getAttributeMap() != null) { pubsubMessage.getAttributeMap().forEach(messageObject::addProperty); }

context.output(FailsafeElement.of(pubsubMessage, messageObject.toString()));

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 73/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

successCounter.inc();

} catch (JsonSyntaxException e) { context.output( TRANSFORM_DEADLETTER_OUT, FailsafeElement.of(context.element()) .setErrorMessage(e.getMessage()) .setStacktrace(Throwables.getStackTraceAsString(e))); failedCounter.inc(); } } }

/** * The {@link PubsubMessageToFailsafeElementFn} wraps an incoming {@link PubsubMes * {@link FailsafeElement} class so errors can be recovered from and the original * output to a error records table. */ static class PubsubMessageToFailsafeElementFn extends DoFn> { @ProcessElement public void processElement(ProcessContext context) { PubsubMessage message = context.element(); context.output( FailsafeElement.of(message, new String(message.getPayload(), StandardChars } } }

Cloud Storage Text to BigQuery (Stream)

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Cloud Storage Text to BigQuery (Stream) pipeline is a streaming pipeline that allows you to stream text les stored in Cloud Storage, transform them using a JavaScript User Dened

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 74/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

Function (UDF) that you provide, and output the result to BigQuery.

Requirements for this pipeline:

Create a JSON formatted BigQuery schema le that describes your output table.

{ 'fields': [{ 'name': 'location', 'type': 'STRING' }, { 'name': 'name', 'type': 'STRING' }, { 'name': 'age', 'type': 'STRING', }, { 'name': 'color', 'type': 'STRING' }, { 'name': 'coffee', 'type': 'STRING', 'mode': 'REQUIRED' }, { 'name': 'cost', 'type': 'NUMERIC', 'mode': 'REQUIRED' }] }

Create a JavaScript (.js) le with your UDF function that supplies the logic to transform the lines of text. Note that your function must return a JSON string.

For example, this function splits each line of a CSV le and returns a JSON string after transforming the values.

function transform(line) { var values = line.split(',');

var obj = new Object(); obj.location = values[0];

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 75/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

obj.name = values[1]; obj.age = values[2]; obj.color = values[3]; obj.coffee = values[4]; var jsonString = JSON.stringify(obj);

return jsonString; }

Template parameters

Parameter Description

javascriptTextTransformGcsPath Cloud Storage location of your JavaScript UDF. For example: gs://my_bucket/my_function.js.

JSONPath Cloud Storage location of your BigQuery schema le, described as a JSON. For example: gs://path/to/my/schema.json.

javascriptTextTransformFunctionNameThe name of the JavaScript function you wish to call as your UDF. For example: transform.

outputTable The fully qualied BigQuery table. For example: my- project:dataset.table

inputFilePattern Cloud Storage location of the text you'd like to process. For example: gs://my-bucket/my-files/text.txt.

bigQueryLoadingTemporaryDirectory Temporary directory for BigQuery loading process. For example: gs://my-bucket/my-files/temp_dir

outputDeadletterTable Table for messages failed to reach the output table(aka. Deadletter table). For example: my-project:dataset.my- deadletter-table. If it doesn't exist, it will be created during pipeline execution. If not specied, _error_records is used instead.

Running the Cloud Storage Text to BigQuery (Stream) template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 76/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

2. Click Create job from template.

3. Select the Cloud Storage Text to BigQuery template from the Dataow template drop-down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

6. Click Run Job.

 Template source code

Java

mplates/blob/master/src/main/java/com/google/cloud/teleport/templates/TextToBigQueryStreaming.java)

/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */

package com.google.cloud.teleport.templates;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 77/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import com.google.api.client.json.JsonFactory; import com.google.api.services.bigquery.model.TableRow; import com.google.cloud.teleport.coders.FailsafeElementCoder; import com.google.cloud.teleport.templates.common.BigQueryConverters.FailsafeJsonToT import com.google.cloud.teleport.templates.common.ErrorConverters.WriteStringMessage import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.Failsafe import com.google.cloud.teleport.util.ResourceUtils; import com.google.cloud.teleport.util.ValueProviderUtils; import com.google.cloud.teleport.values.FailsafeElement; import com.google.common.base.Charsets; import com.google.common.collect.ImmutableList; import com.google.common.io.ByteStreams; import java.io.ByteArrayOutputStream; import java.io.IOException; import java.nio.channels.Channels; import java.nio.channels.ReadableByteChannel; import java.nio.channels.WritableByteChannel; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.extensions.gcp.util.Transport; import org.apache.beam.sdk.io.FileSystems; import org.apache.beam.sdk.io.TextIO; import org.apache.beam.sdk.io.fs.ResourceId; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy; import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.SimpleFunction; import org.apache.beam.sdk.transforms.Watch.Growth; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 78/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import org.joda.time.Duration; import org.slf4j.Logger; import org.slf4j.LoggerFactory;

/** * The {@link TextToBigQueryStreaming} is a streaming version of {@link TextIOToBigQ * that reads text files, applies a JavaScript UDF and writes the output to BigQuery * continuously polls for new files, reads them row-by-row and processes each record * The polling interval is set at 10 seconds. * *

Example Usage: * *

 * {@code mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.TextToBigQueryStreaming \ * -Dexec.args="\ * --project=${PROJECT_ID} \ * --stagingLocation=gs://${STAGING_BUCKET}/staging \ * --tempLocation=gs://${STAGING_BUCKET}/tmp \ * --runner=DataflowRunner \ * --inputFilePattern=gs://path/to/input* \ * --JSONPath=gs://path/to/json/schema.json \ * --outputTable={$PROJECT_ID}:${OUTPUT_DATASET}.${OUTPUT_TABLE} \ * --javascriptTextTransformGcsPath=gs://path/to/transform/udf.js \ * --javascriptTextTransformFunctionName=${TRANSFORM_NAME} \ * --bigQueryLoadingTemporaryDirectory=gs://${STAGING_BUCKET}/tmp \ * --outputDeadletterTable=${PROJECT_ID}:${ERROR_DATASET}.${ERROR_TABLE}" * } * 
*/ public class TextToBigQueryStreaming {

private static final Logger LOG = LoggerFactory.getLogger(TextToBigQueryStreaming.

/** The tag for the main output for the UDF. */ private static final TupleTag> UDF_OUT = new TupleTag>() {};

/** The tag for the dead-letter output of the udf. */ private static final TupleTag> UDF_DEADLETTER_OUT new TupleTag>() {};

/** The tag for the main output of the json transformation. */ private static final TupleTag TRANSFORM_OUT = new TupleTag() {

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 79/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

/** The tag for the dead-letter output of the json to table row transform. */ private static final TupleTag> TRANSFORM_DEADLETTE new TupleTag>() {};

/** The default suffix for error tables if dead letter table is not specified. */ private static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";

/** Default interval for polling files in GCS. */ private static final Duration DEFAULT_POLL_INTERVAL = Duration.standardSeconds(10)

/** Coder for FailsafeElement. */ private static final FailsafeElementCoder FAILSAFE_ELEMENT_CODER = FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

private static final JsonFactory JSON_FACTORY = Transport.getJsonFactory();

/** * Main entry point for executing the pipeline. This will run the pipeline asynchr * blocking execution is required, use the {@link * TextToBigQueryStreaming#run(TextToBigQueryStreamingOptions)} method to start th * and invoke {@code result.waitUntilFinish()} on the {@link PipelineResult} * * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) {

// Parse the user options passed from the command-line TextToBigQueryStreamingOptions options = PipelineOptionsFactory.fromArgs(args) .withValidation() .as(TextToBigQueryStreamingOptions.class);

run(options); }

/** * Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(TextToBigQueryStreamingOptions options) {

// Create the pipeline Pipeline pipeline = Pipeline.create(options);

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 80/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

// Register the coder for pipeline FailsafeElementCoder coder = FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

CoderRegistry coderRegistry = pipeline.getCoderRegistry(); coderRegistry.registerCoderForType(coder.getEncodedTypeDescriptor(), coder);

/* * Steps: * 1) Read from the text source continuously. * 2) Convert to FailsafeElement. * 3) Apply Javascript udf transformation. * - Tag records that were successfully transformed and those * that failed transformation. * 4) Convert records to TableRow. * - Tag records that were successfully converted and those * that failed conversion. * 5) Insert successfully converted records into BigQuery. * - Errors encountered while streaming will be sent to deadletter table. * 6) Insert records that failed into deadletter table. */

PCollectionTuple transformedOutput = pipeline

// 1) Read from the text source continuously. .apply( "ReadFromSource", TextIO.read() .from(options.getInputFilePattern()) .watchForNewFiles(DEFAULT_POLL_INTERVAL, Growth.never()))

// 2) Convert to FailsafeElement. .apply( "ConvertToFailsafeElement", MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor()) .via(input -> FailsafeElement.of(input, input)))

// 3) Apply Javascript udf transformation. .apply( "ApplyUDFTransformation", FailsafeJavascriptUdf.newBuilder() .setFileSystemPath(options.getJavascriptTextTransformGcsPath()) .setFunctionName(options.getJavascriptTextTransformFunctionName(

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 81/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

.setSuccessTag(UDF_OUT) .setFailureTag(UDF_DEADLETTER_OUT) .build());

PCollectionTuple convertedTableRows = transformedOutput

// 4) Convert records to TableRow. .get(UDF_OUT) .apply( "ConvertJSONToTableRow", FailsafeJsonToTableRow.newBuilder() .setSuccessTag(TRANSFORM_OUT) .setFailureTag(TRANSFORM_DEADLETTER_OUT) .build());

WriteResult writeResult = convertedTableRows

// 5) Insert successfully converted records into BigQuery. .get(TRANSFORM_OUT) .apply( "InsertIntoBigQuery", BigQueryIO.writeTableRows() .withJsonSchema(getSchemaFromGCS(options.getJSONPath())) .to(options.getOutputTable()) .withExtendedErrorInfo() .withoutValidation() .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withMethod(Method.STREAMING_INSERTS) .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErr .withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDi

// Elements that failed inserts into BigQuery are extracted and converted to Fai PCollection> failedInserts = writeResult .getFailedInsertsWithErr() .apply( "WrapInsertionErrors", MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor()) .via(TextToBigQueryStreaming::wrapBigQueryInsertError));

// 6) Insert records that failed transformation or conversion into deadletter ta PCollectionList.of(

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 82/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

ImmutableList.of( transformedOutput.get(UDF_DEADLETTER_OUT), convertedTableRows.get(TRANSFORM_DEADLETTER_OUT), failedInserts)) .apply("Flatten", Flatten.pCollections()) .apply( "WriteFailedRecords", WriteStringMessageErrors.newBuilder() .setErrorRecordsTable( ValueProviderUtils.maybeUseDefaultDeadletterTable( options.getOutputDeadletterTable(), options.getOutputTable(), DEFAULT_DEADLETTER_TABLE_SUFFIX)) .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJs .build());

return pipeline.run(); }

/** * Method to wrap a {@link BigQueryInsertError} into a {@link FailsafeElement}. * * @param insertError BigQueryInsert error. * @return FailsafeElement object. * @throws IOException */ static FailsafeElement wrapBigQueryInsertError( BigQueryInsertError insertError) {

FailsafeElement failsafeElement; try {

String rowPayload = JSON_FACTORY.toString(insertError.getRow()); String errorMessage = JSON_FACTORY.toString(insertError.getError());

failsafeElement = FailsafeElement.of(rowPayload, rowPayload); failsafeElement.setErrorMessage(errorMessage);

} catch (IOException e) { throw new RuntimeException(e); }

return failsafeElement; }

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 83/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

/** * Method to read a BigQuery schema file from GCS and return the file contents as * * @param gcsPath Path string for the schema file in GCS. * @return File contents as a string. */ private static ValueProvider getSchemaFromGCS(ValueProvider gcsPat return NestedValueProvider.of( gcsPath, new SimpleFunction() { @Override public String apply(String input) { ResourceId sourceResourceId = FileSystems.matchNewResource(input, false)

String schema; try (ReadableByteChannel rbc = FileSystems.open(sourceResourceId)) { try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) { try (WritableByteChannel wbc = Channels.newChannel(baos)) { ByteStreams.copy(rbc, wbc); schema = baos.toString(Charsets.UTF_8.name()); LOG.info("Extracted schema: " + schema); } } } catch (IOException e) { LOG.error("Error extracting schema: " + e.getMessage()); throw new RuntimeException(e); } return schema; } }); }

/** * The {@link TextToBigQueryStreamingOptions} class provides the custom execution * by the executor at the command-line. */ public interface TextToBigQueryStreamingOptions extends TextIOToBigQuery.Options { @Description( "The dead-letter table to output to within BigQuery in : getOutputDeadletterTable();

void setOutputDeadletterTable(ValueProvider value); } }

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 84/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

Cloud Storage Text to Pub/Sub (Stream)

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

This template creates a streaming pipeline that continuously polls for new text les uploaded to Cloud Storage, reads each le line by line, and publishes strings to a Pub/Sub topic. The template publishes records in a newline-delimited le containing JSON records or CSV le to a Pub/Sub topic for real-time processing. You can use this template to replay data to Pub/Sub.

Currently, the polling interval is xed and set to 10 seconds. This template does not set any timestamp on the individual records, so the event time will be equal to the publishing time during execution. If your pipeline relies on an accurate event time for processing, you should not use this pipeline.

Requirements for this pipeline:

Input les must be in newline-delimited JSON or CSV format. Records that span multiple lines in the source les can cause issues downstream, as each line within the les will be published as a message to Pub/Sub.

The Pub/Sub topic must exist prior to execution.

The pipeline runs indenitely and needs to be terminated manually.

Template parameters

Parameter Description

inputFilePatternThe input le pattern to read from. For example, gs://bucket-name/files/*.json.

outputTopic The Pub/Sub input topic to write to. The name should be in the format of projects//topics/.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 85/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

Running the Cloud Storage Text to Pub/Sub (Stream) template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

2. Click Create job from template.

3. Select the Cloud Storage Text to Pub/Sub (Stream) template from the Dataow template drop- down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

6. Click Run Job.

 Template source code

Java

wTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/TextToPubsubStream.java)

/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 86/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */

package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.templates.TextToPubsub.Options; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.io.TextIO; import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.transforms.Watch; import org.joda.time.Duration;

/** * The {@code TextToPubsubStream} is a streaming version of {@code TextToPubsub} pip * publishes records to Cloud Pub/Sub from a set of files. The pipeline continuously * files, reads them row-by-row and publishes each record as a string message. The p * is fixed and equals to 10 seconds. At the moment, publishing messages with attrib * unsupported. * *

Example Usage: * *

 * {@code mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.TextToPubsubStream \ -Dexec.args=" \ --project=${PROJECT_ID} \ --stagingLocation=gs://${STAGING_BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/stagi --tempLocation=gs://${STAGING_BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \ --runner=DataflowRunner \ --inputFilePattern=gs://path/to/*.csv \ --outputTopic=projects/${PROJECT_ID}/topics/${TOPIC_NAME}" * } * 
* */ public class TextToPubsubStream extends TextToPubsub { private static final Duration DEFAULT_POLL_INTERVAL = Duration.standardSeconds(10)

/** * Main entry-point for the pipeline. Reads in the * command-line arguments, parses them, and executes

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 87/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* the pipeline. * * @param args Arguments passed in from the command-line. */ public static void main(String[] args) {

// Parse the user options passed from the command-line Options options = PipelineOptionsFactory .fromArgs(args) .withValidation() .as(Options.class);

run(options); }

/** * Executes the pipeline with the provided execution * parameters. * * @param options The execution parameters. */ public static PipelineResult run(Options options) { // Create the pipeline. Pipeline pipeline = Pipeline.create(options);

/* * Steps: * 1) Read from the text source. * 2) Write each text record to Pub/Sub */ pipeline .apply( "Read Text Data", TextIO.read() .from(options.getInputFilePattern()) .watchForNewFiles(DEFAULT_POLL_INTERVAL, Watch.Growth.never())) .apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic())

return pipeline.run(); } }

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 88/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

Data Masking/Tokenization using Cloud DLP from Cloud Storag to BigQuery (Stream)

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Data Masking/Tokenization using Cloud DLP from Cloud Storage to BigQuery template is a streaming pipeline that reads csv les from a Cloud Storage bucket, calls the Cloud Data Loss Prevention (Cloud DLP) API for de-identication, and writes the de-identied data into the specied BigQuery table. This template supports using both a Cloud DLP inspection template (/dlp/docs/creating-templates) and a Cloud DLP de-identication template (/dlp/docs/creating-templates-deid). This allows users to inspect for potentially sensitive information and de-identify, as well as de-identify structured data where columns are specied to be de-identied and no inspection is needed.

Requirements for this pipeline:

The input data to tokenize must exist

The Cloud DLP Templates must exist (for example, DeidentifyTemplate and InspectTemplate). See Cloud DLP templates (/dlp/docs/concepts-templates) for more details.

The BigQuery dataset must exist

Template parameters

Parameter Description

inputFilePattern The csv le(s) to read input data records from. Wildcarding is also accepted. For example, gs://mybucket/my_csv_filename.csv or gs://mybucket/file-*

dlpProjectId Cloud DLP project ID that owns the Cloud DLP API resource. This Cloud DLP proje be the same project that owns the Cloud DLP templates, or it can be a separate p For example, my_dlp_api_project.

deidentifyTemplateNameCloud DLP deidentication template to use for API requests, specied with the pa

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 89/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

projects/{template_project_id}/deidentifyTemplates/{deIdTempla For example, projects/my_project/deidentifyTemplates/100.

datasetName BigQuery dataset for sending tokenized results.

batchSize Chunking/Batch size for sending data to inspect and/or detokenize. In the case o le, batchSize is the number of rows in a batch. Users must determine the batch based on the size of the records and the sizing of the le. Note that the Cloud DLP has a payload size limit of 524 KB per API call.

inspectTemplateName [Optional] Cloud DLP inspection template to use for API requests, specied with t pattern projects/{template_project_id}/identifyTemplates/{idTemplateId example, projects/my_project/identifyTemplates/100.

Running the Data Masking/Tokenization using Cloud DLP from Cloud Storag to BigQuery template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

2. Click Create job from template.

3. Select the Data Masking/Tokenization using Cloud DLP from Cloud Storage to BigQuery (Stream) template from the Dataow template drop-down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

6. Click Run Job.

 Template source code

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 90/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

Java

tes/blob/master/src/main/java/com/google/cloud/teleport/templates/DLPTextToBigQueryStreaming.java)

/* * Copyright (C) 2018 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */

package com.google.cloud.teleport.templates;

import com.google.api.services.bigquery.model.TableCell; import com.google.api.services.bigquery.model.TableFieldSchema; import com.google.api.services.bigquery.model.TableRow; import com.google.api.services.bigquery.model.TableSchema; import com.google.cloud.dlp.v2.DlpServiceClient; import com.google.common.base.Charsets; import com.google.privacy.dlp.v2.ContentItem; import com.google.privacy.dlp.v2.DeidentifyContentRequest; import com.google.privacy.dlp.v2.DeidentifyContentRequest.Builder; import com.google.privacy.dlp.v2.DeidentifyContentResponse; import com.google.privacy.dlp.v2.FieldId; import com.google.privacy.dlp.v2.ProjectName; import com.google.privacy.dlp.v2.Table; import com.google.privacy.dlp.v2.Value; import java.io.BufferedReader; import java.io.IOException; import java.nio.channels.Channels; import java.nio.channels.ReadableByteChannel; import java.sql.SQLException; import java.util.ArrayList; import java.util.Iterator; import java.util.List;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 91/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import java.util.Map; import java.util.concurrent.atomic.AtomicInteger; import java.util.stream.Collectors; import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.KvCoder; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.Compression; import org.apache.beam.sdk.io.FileIO; import org.apache.beam.sdk.io.FileIO.ReadableFile; import org.apache.beam.sdk.io.ReadableFileCoder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.DynamicDestinations; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy; import org.apache.beam.sdk.io.gcp.bigquery.TableDestination; import org.apache.beam.sdk.io.range.OffsetRange; import org.apache.beam.sdk.metrics.Distribution; import org.apache.beam.sdk.metrics.Metrics; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.Validation.Required; import org.apache.beam.sdk.options.ValueProvider; import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.DoFn.Element; import org.apache.beam.sdk.transforms.GroupByKey; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.transforms.View; import org.apache.beam.sdk.transforms.Watch; import org.apache.beam.sdk.transforms.WithKeys; import org.apache.beam.sdk.transforms.splittabledofn.OffsetRangeTracker; import org.apache.beam.sdk.transforms.splittabledofn.RestrictionTracker; import org.apache.beam.sdk.transforms.windowing.AfterProcessingTime; import org.apache.beam.sdk.transforms.windowing.FixedWindows; import org.apache.beam.sdk.transforms.windowing.Repeatedly; import org.apache.beam.sdk.transforms.windowing.Window; import org.apache.beam.sdk.values.KV; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionView; import org.apache.beam.sdk.values.ValueInSingleWindow; import org.apache.commons.csv.CSVFormat; import org.apache.commons.csv.CSVRecord; import org.joda.time.Duration; import org.slf4j.Logger;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 92/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import org.slf4j.LoggerFactory;

/** * The {@link DLPTextToBigQueryStreaming} is a streaming pipeline that reads CSV fil * storage location (e.g. Google Cloud Storage), uses Cloud DLP API to inspect, clas * sensitive information (e.g. PII Data like passport or SIN number) and at the end * obfuscated data in BigQuery (Dynamic Table Creation) to be used for various purpo * analytics, ML model. Cloud DLP inspection and masking can be configured by the us * use of over 90 built in detectors and masking techniques like tokenization, secur * shifting, partial masking, and more. * *

Pipeline Requirements * *

    *
  • DLP Templates exist (e.g. deidentifyTemplate, InspectTemplate) *
  • The BigQuery Dataset exists *
* *

Example Usage * *

 * # Set the pipeline vars * PROJECT_ID=PROJECT ID HERE * BUCKET_NAME=BUCKET NAME HERE * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/dlp-text-to-bigquery * * # Set the runner * RUNNER=DataflowRunner * * # Build the template * mvn compile exec:java \ * -Dexec.mainClass=com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming \ * -Dexec.cleanupDaemonThreads=false \ * -Dexec.args=" \ * --project=${PROJECT_ID} \ * --stagingLocation=${PIPELINE_FOLDER}/staging \ * --tempLocation=${PIPELINE_FOLDER}/temp \ * --templateLocation=${PIPELINE_FOLDER}/template \ * --runner=${RUNNER}" * * # Execute the template * JOB_NAME=dlp-text-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"` * * gcloud dataflow jobs run ${JOB_NAME} \ * --gcs-location=${PIPELINE_FOLDER}/template \

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 93/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* --zone=us-east1-d \ * --parameters \ * "inputFilePattern=gs:///.csv, batchSize=15,datasetName= */ public class DLPTextToBigQueryStreaming {

public static final Logger LOG = LoggerFactory.getLogger(DLPTextToBigQueryStreamin /** Default interval for polling files in GCS. */ private static final Duration DEFAULT_POLL_INTERVAL = Duration.standardSeconds(30) /** Expected only CSV file in GCS bucket. */ private static final String ALLOWED_FILE_EXTENSION = String.valueOf("csv"); /** Regular expression that matches valid BQ table IDs. */ private static final String TABLE_REGEXP = "[-\\w$@]{1,1024}"; /** Default batch size if value not provided in execution. */ private static final Integer DEFAULT_BATCH_SIZE = 100; /** Regular expression that matches valid BQ column name . */ private static final String COLUMN_NAME_REGEXP = "^[A-Za-z_]+[A-Za-z_0-9]*$"; /** Default window interval to create side inputs for header records. */ private static final Duration WINDOW_INTERVAL = Duration.standardSeconds(30);

/** * Main entry point for executing the pipeline. This will run the pipeline asynchr * blocking execution is required, use the {@link * DLPTextToBigQueryStreaming#run(TokenizePipelineOptions)} method to start the pi * invoke {@code result.waitUntilFinish()} on the {@link PipelineResult} * * @param args The command-line arguments to the pipeline. */ public static void main(String[] args) {

TokenizePipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(TokenizePipelineOp run(options); }

/** * Runs the pipeline with the supplied options. * * @param options The execution parameters to the pipeline. * @return The result of the pipeline execution. */ public static PipelineResult run(TokenizePipelineOptions options) { // Create the pipeline

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 94/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

Pipeline p = Pipeline.create(options); /* * Steps: * 1) Read from the text source continuously based on default interval e.g. 30 * - Setup a window for 30 secs to capture the list of files emited. * - Group by file name as key and ReadableFile as a value. * 2) Create a side input for the window containing list of headers par file. * 3) Output each readable file for content processing. * 4) Split file contents based on batch size for parallel processing. * 5) Process each split as a DLP table content request to invoke API. * 6) Convert DLP Table Rows to BQ Table Row. * 7) Create dynamic table and insert successfully converted records into BQ. */

PCollection>> csvFiles = p /* * 1) Read from the text source continuously based on default interval e * - Setup a window for 30 secs to capture the list of files emited. * - Group by file name as key and ReadableFile as a value. */ .apply( "Poll Input Files", FileIO.match() .filepattern(options.getInputFilePattern()) .continuously(DEFAULT_POLL_INTERVAL, Watch.Growth.never())) .apply("Find Pattern Match", FileIO.readMatches().withCompression(Compre .apply("Add File Name as Key", WithKeys.of(file -> getFileName(file))) .setCoder(KvCoder.of(StringUtf8Coder.of(), ReadableFileCoder.of())) .apply( "Fixed Window(30 Sec)", Window.>into(FixedWindows.of(WINDOW_INTERVA .triggering(Repeatedly.forever( AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Dur .discardingFiredPanes() .withAllowedLateness(Duration.ZERO)) .apply(GroupByKey.create());

/* * Side input for the window to capture list of headers for each file emited so * used in the next transform. */ final PCollectionView>>> headerMap = csvFiles

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 95/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

// 2) Create a side input for the window containing list of headers par .apply( "Create Header Map", ParDo.of( new DoFn>, KV

@ProcessElement public void processElement(ProcessContext c) { String fileKey = c.element().getKey(); c.element() .getValue() .forEach( file -> { try (BufferedReader br = getReader(file)) { c.output(KV.of(fileKey, getFileHeaders(br)));

} catch (IOException e) { LOG.error("Failed to Read File {}", e.getMessage throw new RuntimeException(e); } }); } })) .apply("View As List", View.asList());

PCollection> bqDataMap = csvFiles

// 3) Output each readable file for content processing. .apply( "File Handler", ParDo.of( new DoFn>, KV { c.output(KV.of(fileKey, file)); }); } }))

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 96/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

// 4) Split file contents based on batch size for parallel processing. .apply( "Process File Contents", ParDo.of( new CSVReader( NestedValueProvider.of( options.getBatchSize(), batchSize -> { if (batchSize != null) { return batchSize; } else { return DEFAULT_BATCH_SIZE; } }), headerMap)) .withSideInputs(headerMap))

// 5) Create a DLP Table content request and invoke DLP API for each pro .apply( "DLP-Tokenization", ParDo.of( new DLPTokenizationDoFn( options.getDlpProjectId(), options.getDeidentifyTemplateName(), options.getInspectTemplateName())))

// 6) Convert DLP Table Rows to BQ Table Row .apply("Process Tokenized Data", ParDo.of(new TableRowProcessorDoFn()));

// 7) Create dynamic table and insert successfully converted records into BQ. bqDataMap.apply( "Write To BQ", BigQueryIO.>write() .to(new BQDestination(options.getDatasetName(), options.getDlpProjectId( .withFormatFunction( element -> { return element.getValue(); }) .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEED .withoutValidation() .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors()));

return p.run(); }

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 97/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

/** * The {@link TokenizePipelineOptions} interface provides the custom execution opt * the executor at the command-line. */ public interface TokenizePipelineOptions extends DataflowPipelineOptions {

@Description("The file pattern to read records from (e.g. gs://bucket/file-*.csv ValueProvider getInputFilePattern();

void setInputFilePattern(ValueProvider value);

@Description( "DLP Deidentify Template to be used for API request " + "(e.g.projects/{project_id}/deidentifyTemplates/{deIdTemplateId}") @Required ValueProvider getDeidentifyTemplateName();

void setDeidentifyTemplateName(ValueProvider value);

@Description( "DLP Inspect Template to be used for API request " + "(e.g.projects/{project_id}/inspectTemplates/{inspectTemplateId}") ValueProvider getInspectTemplateName();

void setInspectTemplateName(ValueProvider value);

@Description( "DLP API has a limit for payload size of 524KB /api call. " + "That's why dataflow process will need to chunk it. User will have to + "on how they would like to batch the request depending on number of ro + "and how big each row is.") @Required ValueProvider getBatchSize();

void setBatchSize(ValueProvider value);

@Description("Big Query data set must exist before the pipeline runs (e.g. pii-d ValueProvider getDatasetName();

void setDatasetName(ValueProvider value);

@Description("Project id to be used for DLP Tokenization") ValueProvider getDlpProjectId();

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 98/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

void setDlpProjectId(ValueProvider value); }

/** * The {@link CSVReader} class uses experimental Split DoFn to split each csv file * chunks and process it in non-monolithic fashion. For example: if a CSV file has * batch size is set to 15, then initial restrictions for the SDF will be 1 to 7 a * restriction will be {{1-2},{2-3}..{7-8}} for parallel executions. */ static class CSVReader extends DoFn, KV> {

private ValueProvider batchSize; private PCollectionView>>> headerMap; /** This counter is used to track number of lines processed against batch size. private Integer lineCount;

List csvHeaders;

public CSVReader( ValueProvider batchSize, PCollectionView>>> headerMap) { lineCount = 1; this.batchSize = batchSize; this.headerMap = headerMap; this.csvHeaders = new ArrayList<>(); }

@ProcessElement public void processElement(ProcessContext c, RestrictionTracker

csvHeaders = getHeaders(c.sideInput(headerMap), fileKey); if (csvHeaders != null) { List dlpTableHeaders = csvHeaders.stream() .map(header -> FieldId.newBuilder().setName(header).build()) .collect(Collectors.toList()); List rows = new ArrayList<>(); Table dlpTable = null; /** finding out EOL for this restriction so that we know the SOL */ int endOfLine = (int) (i * batchSize.get().intValue()); int startOfLine = (endOfLine - batchSize.get().intValue());

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 99/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

/** skipping all the rows that's not part of this restriction */ br.readLine(); Iterator csvRows = CSVFormat.DEFAULT.withSkipHeaderRecord().parse(br).iterator(); for (int line = 0; line < startOfLine; line++) { if (csvRows.hasNext()) { csvRows.next(); } } /** looping through buffered reader and creating DLP Table Rows equals t while (csvRows.hasNext() && lineCount <= batchSize.get()) {

CSVRecord csvRow = csvRows.next(); rows.add(convertCsvRowToTableRow(csvRow)); lineCount += 1; } /** creating DLP table and output for next transformation */ dlpTable = Table.newBuilder().addAllHeaders(dlpTableHeaders).addAllRows( c.output(KV.of(fileKey, dlpTable));

LOG.debug( "Current Restriction From: {}, Current Restriction To: {}," + " StartofLine: {}, End Of Line {}, BatchData {}", tracker.currentRestriction().getFrom(), tracker.currentRestriction().getTo(), startOfLine, endOfLine, dlpTable.getRowsCount());

} else {

throw new RuntimeException("Header Values Can't be found For file Key " } } } }

/** * SDF needs to define a @GetInitialRestriction method that can create a restric * the complete work for a given element. For our case this would be the total n * for each CSV file. We will calculate the number of split required based on to * rows and batch size provided. * * @throws IOException */

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 100/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

@GetInitialRestriction public OffsetRange getInitialRestriction(@Element KV csvFi

int rowCount = 0; int totalSplit = 0; try (BufferedReader br = getReader(csvFile.getValue())) { /** assume first row is header */ int checkRowCount = (int) br.lines().count() - 1; rowCount = (checkRowCount < 1) ? 1 : checkRowCount; totalSplit = rowCount / batchSize.get().intValue(); int remaining = rowCount % batchSize.get().intValue(); /** * Adjusting the total number of split based on remaining rows. For example: * 15 for 100 rows will have total 7 splits. As it's a range last split will * range {7,8} */ if (remaining > 0) { totalSplit = totalSplit + 2;

} else { totalSplit = totalSplit + 1; } }

LOG.debug("Initial Restriction range from 1 to: {}", totalSplit); return new OffsetRange(1, totalSplit); }

/** * SDF needs to define a @SplitRestriction method that can split the intital res * number of smaller restrictions. For example: a intital rewstriction of (x, N) * produces pairs (x, 0), (x, 1), …, (x, N-1) as output. */ @SplitRestriction public void splitRestriction( @Element KV csvFile,@Restriction OffsetRange range, Ou /** split the initial restriction by 1 */ for (final OffsetRange p : range.split(1, 1)) { out.output(p); } }

@NewTracker public OffsetRangeTracker newTracker(@Restriction OffsetRange range) { return new OffsetRangeTracker(new OffsetRange(range.getFrom(), range.getTo()))

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 101/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

}

private Table.Row convertCsvRowToTableRow(CSVRecord csvRow) { /** convert from CSV row to DLP Table Row */ Iterator valueIterator = csvRow.iterator(); Table.Row.Builder tableRowBuilder = Table.Row.newBuilder(); while (valueIterator.hasNext()) { String value = valueIterator.next(); if (value != null) { tableRowBuilder.addValues(Value.newBuilder().setStringValue(value.toString } else { tableRowBuilder.addValues(Value.newBuilder().setStringValue("").build()); } }

return tableRowBuilder.build(); }

private List getHeaders(List>> headerMap, String return headerMap.stream() .filter(map -> map.getKey().equalsIgnoreCase(fileKey)) .findFirst() .map(e -> e.getValue()) .orElse(null); } }

/** * The {@link DLPTokenizationDoFn} class executes tokenization request by calling * DLP table as a content item as CSV file contains fully structured data. DLP tem * de-identify, inspect) need to exist before this pipeline runs. As response from * received, this DoFn ouptputs KV of new table with table id as key. */ static class DLPTokenizationDoFn extends DoFn, KV private ValueProvider dlpProjectId; private DlpServiceClient dlpServiceClient; private ValueProvider deIdentifyTemplateName; private ValueProvider inspectTemplateName; private boolean inspectTemplateExist; private Builder requestBuilder; private final Distribution numberOfRowsTokenized = Metrics.distribution(DLPTokenizationDoFn.class, "numberOfRowsTokenizedDistro private final Distribution numberOfBytesTokenized = Metrics.distribution(DLPTokenizationDoFn.class, "numberOfBytesTokenizedDistr

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 102/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

public DLPTokenizationDoFn( ValueProvider dlpProjectId, ValueProvider deIdentifyTemplateName, ValueProvider inspectTemplateName) { this.dlpProjectId = dlpProjectId; this.dlpServiceClient = null; this.deIdentifyTemplateName = deIdentifyTemplateName; this.inspectTemplateName = inspectTemplateName; this.inspectTemplateExist = false; }

@Setup public void setup() { if (this.inspectTemplateName.isAccessible()) { if (this.inspectTemplateName.get() != null) { this.inspectTemplateExist = true; } } if (this.deIdentifyTemplateName.isAccessible()) { if (this.deIdentifyTemplateName.get() != null) { this.requestBuilder = DeidentifyContentRequest.newBuilder() .setParent(ProjectName.of(this.dlpProjectId.get()).toString()) .setDeidentifyTemplateName(this.deIdentifyTemplateName.get()); if (this.inspectTemplateExist) { this.requestBuilder.setInspectTemplateName(this.inspectTemplateName.get( } } } }

@StartBundle public void startBundle() throws SQLException {

try { this.dlpServiceClient = DlpServiceClient.create();

} catch (IOException e) { LOG.error("Failed to create DLP Service Client", e.getMessage()); throw new RuntimeException(e); } }

@FinishBundle public void finishBundle() throws Exception {

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 103/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

if (this.dlpServiceClient != null) { this.dlpServiceClient.close(); } }

@ProcessElement public void processElement(ProcessContext c) { String key = c.element().getKey(); Table nonEncryptedData = c.element().getValue(); ContentItem tableItem = ContentItem.newBuilder().setTable(nonEncryptedData).bu this.requestBuilder.setItem(tableItem); DeidentifyContentResponse response = dlpServiceClient.deidentifyContent(this.requestBuilder.build()); Table tokenizedData = response.getItem().getTable(); numberOfRowsTokenized.update(tokenizedData.getRowsList().size()); numberOfBytesTokenized.update(tokenizedData.toByteArray().length); c.output(KV.of(key, tokenizedData)); } }

/** * The {@link TableRowProcessorDoFn} class process tokenized DLP tables and conver * BigQuery Table Row. */ public static class TableRowProcessorDoFn extends DoFn, KV

@ProcessElement public void processElement(ProcessContext c) {

Table tokenizedData = c.element().getValue(); List headers = tokenizedData.getHeadersList().stream() .map(fid -> fid.getName()) .collect(Collectors.toList()); List outputRows = tokenizedData.getRowsList(); if (outputRows.size() > 0) { for (Table.Row outputRow : outputRows) { if (outputRow.getValuesCount() != headers.size()) { throw new IllegalArgumentException( "CSV file's header count must exactly match with data element count" } c.output( KV.of( c.element().getKey(), createBqRow(outputRow, headers.toArray(new String[headers.size()])

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 104/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

} } }

private static TableRow createBqRow(Table.Row tokenizedValue, String[] headers) TableRow bqRow = new TableRow(); AtomicInteger headerIndex = new AtomicInteger(0); List cells = new ArrayList<>(); tokenizedValue .getValuesList() .forEach( value -> { String checkedHeaderName = checkHeaderName(headers[headerIndex.getAndIncrement()].toString( bqRow.set(checkedHeaderName, value.getStringValue()); cells.add(new TableCell().set(checkedHeaderName, value.getStringValu }); bqRow.setF(cells); return bqRow; } }

/** * The {@link BQDestination} class creates BigQuery table destination and table sc * the CSV file processed in earlier transformations. Table id is same as filename * same as file header columns. */ public static class BQDestination extends DynamicDestinations, KV> {

private ValueProvider datasetName; private ValueProvider projectId;

public BQDestination(ValueProvider datasetName, ValueProvider pr this.datasetName = datasetName; this.projectId = projectId; }

@Override public KV getDestination(ValueInSingleWindow

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 105/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

@Override public TableDestination getTable(KV destination) { TableDestination dest = new TableDestination(destination.getKey(), "pii-tokenized output data from LOG.debug("Table Destination {}", dest.getTableSpec()); return dest; }

@Override public TableSchema getSchema(KV destination) {

TableRow bqRow = destination.getValue(); TableSchema schema = new TableSchema(); List fields = new ArrayList(); List cells = bqRow.getF(); for (int i = 0; i < cells.size(); i++) { Map object = cells.get(i); String header = object.keySet().iterator().next(); /** currently all BQ data types are set to String */ fields.add(new TableFieldSchema().setName(checkHeaderName(header)).setType(" }

schema.setFields(fields); return schema; } }

private static String getFileName(ReadableFile file) { String csvFileName = file.getMetadata().resourceId().getFilename().toString(); /** taking out .csv extension from file name e.g fileName.csv->fileName */ String[] fileKey = csvFileName.split("\\.", 2);

if (!fileKey[1].equals(ALLOWED_FILE_EXTENSION) || !fileKey[0].matches(TABLE_REGE throw new RuntimeException( "[Filename must contain a CSV extension " + " BQ table name must contain only letters, numbers, or underscores [ + fileKey[1] + "], [" + fileKey[0] + "]"); } /** returning file name without extension */ return fileKey[0]; }

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 106/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

private static BufferedReader getReader(ReadableFile csvFile) { BufferedReader br = null; ReadableByteChannel channel = null; /** read the file and create buffered reader */ try { channel = csvFile.openSeekable();

} catch (IOException e) { LOG.error("Failed to Read File {}", e.getMessage()); throw new RuntimeException(e); }

if (channel != null) {

br = new BufferedReader(Channels.newReader(channel, Charsets.ISO_8859_1.name() }

return br; }

private static List getFileHeaders(BufferedReader reader) { List headers = new ArrayList<>(); try { CSVRecord csvHeader = CSVFormat.DEFAULT.parse(reader).getRecords().get(0); csvHeader.forEach( headerValue -> { headers.add(headerValue); }); } catch (IOException e) { LOG.error("Failed to get csv header values}", e.getMessage()); throw new RuntimeException(e); } return headers; }

private static String checkHeaderName(String name) { /** some checks to make sure BQ column names don't fail e.g. special characters String checkedHeader = name.replaceAll("\\s", "_"); checkedHeader = checkedHeader.replaceAll("'", ""); checkedHeader = checkedHeader.replaceAll("/", ""); if (!checkedHeader.matches(COLUMN_NAME_REGEXP)) { throw new IllegalArgumentException("Column name can't be matched to a valid fo } return checkedHeader;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 107/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

} }

Change Data Capture from MySQL to BigQuery using Debeziu and Pub/Sub (Stream)

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub template is a streaming pipeline that reads Pub/Sub messages with change data from a MySQL database and writes the records to BigQuery. A Debezium connector captures changes to the MySQL database and publishes the changed data to Pub/Sub. The template then reads the Pub/Sub messages and writes them to BigQuery.

You can use this template to sync MySQL databases and BigQuery tables. The pipeline writes the changed data to a BigQuery staging table and intermittently updates a BigQuery table replicating the MySQL database.

Requirements for this pipeline:

The Debezium connector must be deployed (https://github.com/GoogleCloudPlatform/DataowTemplates/tree/master/v2/cdc- parent#deploying-the-connector) .

The Pub/Sub messages must be serialized in a Beam Row. (https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/values/Row.html)

Template parameters

Parameter Description

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 108/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

inputSubscriptions The comma-separated list of Pub/Sub input subscriptions to read from, in the format of ,, ...

changeLogDataset The BigQuery dataset to store the staging tables, in the format of

replicaDataset The location of the BigQuery dataset to store the replica tables, in the format of

Optional: The interval at which the pipeline updates the BigQuery table replicating the MySQL updateFrequencySecsdatabase.

Running the Change Data Capture using Debezium and MySQL from Pub/S to BigQuery template

To run this template, perform the following steps:

1. On your local machine, clone the DataowTemplates repository (https://github.com/GoogleCloudPlatform/DataowTemplates).

2. Change to the v2/cdc-parent directory.

3. Ensure that the Debezium connector is deployed (https://github.com/GoogleCloudPlatform/DataowTemplates/tree/master/v2/cdc- parent#deploying-the-connector) .

4. Using Maven, run the Dataow template. You must replace the following values in this example:

Replace PROJECT_ID with your project ID.

Replace YOUR_SUBSCRIPTIONS with your comma-separated list of Pub/Sub subscription names.

Replace YOUR_CHANGELOG_DATASET with your BigQuery dataset for changelog data, and replace YOUR_REPLICA_DATASET with your BigQuery dataset for replica tables.

mvn exec:java -pl cdc-change-applier -Dexec.args="--runner=DataflowRunner \ --inputSubscriptions=YOUR_SUBSCRIPTIONS \ --updateFrequencySecs=300 \ --changeLogDataset=YOUR_CHANGELOG_DATASET \ --replicaDataset=YOUR_REPLICA_DATASET \

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 109/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

--project=PROJECT_ID"

Apache Kaa to BigQuery

eature is covered by the Pre-GA Offerings Terms (/terms/service-terms#1) of the Google Cloud Platform Terms e. Pre-GA features may have limited support, and changes to pre-GA features may not be compatible with othe rsions. For more information, see the launch stage descriptions (/products#product-launch-stages).

The Apache Kafka to BigQuery template is a streaming pipeline which ingests text data from Apache Kafka, executes a user-dened function (UDF), and outputs the resulting records to BigQuery. Any errors which occur in the transformation of the data, execution of the UDF, or inserting into the output table are inserted into a separate errors table in BigQuery. If the errors table does not exist prior to execution, then it is created.

Requirements for this pipeline

The output BigQuery table must exist.

The Apache Kafka broker server must be running and be reachable from the Dataow worker machines.

The Apache Kafka topics must exist and the messages must be encoded in a valid JSON format.

Template parameters

Parameter Description

outputTableSpec The BigQuery output table location to write the Apache Kafka messages to, in the format of my-project:dataset.table

inputTopics The Apache Kafka input topics to read from in a comma- separated list. For example: messages

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 110/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

bootstrapServers The host address of the running Apache Kafka broker servers in a comma-separated list, each host address in the format of 35.70.252.199:9092

javascriptTextTransformGcsPath (Optional) Cloud Storage location path to the JavaScript UDF. For example: gs://my_bucket/my_function.js

javascriptTextTransformFunctionName(Optional) The name of the JavaScript to call as your UDF. For example: transform

outputDeadletterTable (Optional) The BigQuery output table location to write deadletter records to, in the format of my- project:dataset.my-deadletter-table. If it doesn't exist, the table is created during pipeline execution. If not specied, _error_records is used instead.

Running the Apache Kaa to BigQuery template

CONSOLEGCLOUD (#gcloud)API (#api)

Run from the Google Cloud Console (/dataow/docs/templates/running-templates#console)

1. Go to the Dataow page in the Cloud Console.

Go to the Dataow page (https://console.cloud.google.com/dataow)

2. Click Create job from template.

3. Select the Apache Kafka to BigQuery template from the Dataow template drop-down menu.

4. Enter a job name in the Job Name eld. Your job name must match the regular expression [a- z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.

5. Enter your parameter values in the provided parameter elds.

6. Click Run Job.

 Template source code

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 111/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

ster/v2/kafka-to-bigquery/src/main/java/com/google/cloud/teleport/v2/templates/KafkaToBigQuery.java)

/* * Copyright (C) 2019 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package com.google.cloud.teleport.v2.templates;

import com.google.api.services.bigquery.model.TableRow; import com.google.cloud.teleport.v2.coders.FailsafeElementCoder; import com.google.cloud.teleport.v2.transforms.BigQueryConverters.FailsafeJsonToTabl import com.google.cloud.teleport.v2.transforms.ErrorConverters; import com.google.cloud.teleport.v2.transforms.ErrorConverters.WriteKafkaMessageErro import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer.FailsafeJav import com.google.cloud.teleport.v2.utils.SchemaUtils; import com.google.cloud.teleport.v2.values.FailsafeElement; import com.google.common.collect.ImmutableMap; import java.io.IOException; import java.util.ArrayList; import java.util.Arrays; import java.util.List; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.PipelineResult; import org.apache.beam.sdk.coders.CoderRegistry; import org.apache.beam.sdk.coders.KvCoder; import org.apache.beam.sdk.coders.NullableCoder; import org.apache.beam.sdk.coders.StringUtf8Coder; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition; import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError; import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 112/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

import org.apache.beam.sdk.io.gcp.bigquery.WriteResult; import org.apache.beam.sdk.io.kafka.KafkaIO; import org.apache.beam.sdk.options.Description; import org.apache.beam.sdk.options.PipelineOptions; import org.apache.beam.sdk.options.PipelineOptionsFactory; import org.apache.beam.sdk.options.Validation.Required; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.Flatten; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.PTransform; import org.apache.beam.sdk.transforms.ParDo; import org.apache.beam.sdk.values.KV; import org.apache.beam.sdk.values.PCollection; import org.apache.beam.sdk.values.PCollectionList; import org.apache.beam.sdk.values.PCollectionTuple; import org.apache.beam.sdk.values.TupleTag; import org.apache.commons.lang3.ObjectUtils; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.common.serialization.StringDeserializer; import org.slf4j.Logger; import org.slf4j.LoggerFactory;

/** * The {@link KafkaToBigQuery} pipeline is a streaming pipeline which ingests text d * executes a UDF, and outputs the resulting records to BigQuery. Any errors which o * transformation of the data, execution of the UDF, or inserting into the output ta * inserted into a separate errors table in BigQuery. The errors table will be creat * not exist prior to execution. Both output and error tables are specified by the u * parameters. * *

Pipeline Requirements * *

    *
  • The Kafka topic exists and the message is encoded in a valid JSON format. *
  • The BigQuery output table exists. *
  • The Kafka brokers are reachable from the Dataflow worker machines. *
* *

Example Usage * *

 * * # Set some environment variables * PROJECT=my-project * TEMP_BUCKET=my-temp-bucket

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 113/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* OUTPUT_TABLE=${PROJECT}:my_dataset.my_table * TOPICS=my-topics * JS_PATH=my-js-path-on-gcs * JS_FUNC_NAME=my-js-func-name * BOOTSTRAP=my-comma-separated-bootstrap-servers * * # Set containerization vars * IMAGE_NAME=my-image-name * TARGET_GCR_IMAGE=gcr.io/${PROJECT}/${IMAGE_NAME} * BASE_CONTAINER_IMAGE=my-base-container-image * BASE_CONTAINER_IMAGE_VERSION=my-base-container-image-version * APP_ROOT=/path/to/app-root * COMMAND_SPEC=/path/to/command-spec * * # Build and upload image * mvn clean package \ * -Dimage=${TARGET_GCR_IMAGE} \ * -Dbase-container-image=${BASE_CONTAINER_IMAGE} \ * -Dbase-container-image.version=${BASE_CONTAINER_IMAGE_VERSION} \ * -Dapp-root=${APP_ROOT} \ * -Dcommand-spec=${COMMAND_SPEC} * * # Create an image spec in GCS that contains the path to the image * { * "docker_template_spec": { * "docker_image": $TARGET_GCR_IMAGE * } * } * * # Execute template: * API_ROOT_URL="https://dataflow.googleapis.com" * TEMPLATES_LAUNCH_API="${API_ROOT_URL}/v1b3/projects/${PROJECT}/templates:launch" * JOB_NAME="kafka-to-bigquery`date +%Y%m%d-%H%M%S-%N`" * * time curl -X POST -H "Content-Type: application/json" \ * -H "Authorization: Bearer $(gcloud auth print-access-token)" \ * "${TEMPLATES_LAUNCH_API}"` * `"?validateOnly=false"` * `"&dynamicTemplate.gcsPath=${TEMP_BUCKET}/path/to/image-spec"` * `"&dynamicTemplate.stagingLocation=${TEMP_BUCKET}/staging" \ * -d ' * { * "jobName":"'$JOB_NAME'", * "parameters": { * "outputTableSpec":"'$OUTPUT_TABLE'",

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 114/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

* "inputTopics":"'$TOPICS'", * "javascriptTextTransformGcsPath":"'$JS_PATH'", * "javascriptTextTransformFunctionName":"'$JS_FUNC_NAME'", * "bootstrapServers":"'$BOOTSTRAP'" * } * } * ' *

*/ public class KafkaToBigQuery {

/* Logger for class. */ private static final Logger LOG = LoggerFactory.getLogger(KafkaToBigQuery.class);

/** The tag for the main output for the UDF. */ private static final TupleTag, String>> UDF_OUT new TupleTag, String>>() {};

/** The tag for the main output of the json transformation. */ static final TupleTag TRANSFORM_OUT = new TupleTag() {};

/** The tag for the dead-letter output of the udf. */ static final TupleTag, String>> UDF_DEADLETTER_ new TupleTag, String>>() {};

/** The tag for the dead-letter output of the json to table row transform. */ static final TupleTag, String>> TRANSFORM_DEADLETTER_OUT = new TupleTag, St

/** The default suffix for error tables if dead letter table is not specified. */ private static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";

/** String/String Coder for FailsafeElement. */ private static final FailsafeElementCoder FAILSAFE_ELEMENT_CODER = FailsafeElementCoder.of( NullableCoder.of(StringUtf8Coder.of()), NullableCoder.of(StringUtf8Coder.o

/** * The {@link Options} class provides the custom execution options passed by the e * command-line. */ public interface Options extends PipelineOptions {

@Description("Table spec to write the output to") @Required

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 115/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

String getOutputTableSpec();

void setOutputTableSpec(String outputTableSpec);

@Description("Kafka Bootstrap Servers") @Required String getBootstrapServers();

void setBootstrapServers(String bootstrapServers);

@Description("Kafka topic(s) to read the input from") @Required String getInputTopics();

void setInputTopics(String inputTopics);

@Description( "The dead-letter table to output to within BigQuery in :

void setOutputDeadletterTable(String outputDeadletterTable);

@Description("Gcs path to javascript udf source") String getJavascriptTextTransformGcsPath();

void setJavascriptTextTransformGcsPath(String javascriptTextTransformGcsPath);

@Description("UDF Javascript Function Name") String getJavascriptTextTransformFunctionName();

void setJavascriptTextTransformFunctionName(String javascriptTextTransformFuncti }

/** * The main entry-point for pipeline execution. This method will start the pipelin * wait for it's execution to finish. If blocking execution is required, use the { * KafkaToBigQuery#run(Options)} method to start the pipeline and invoke {@code * result.waitUntilFinish()} on the {@link PipelineResult}. * * @param args The command-line args passed by the executor. */ public static void main(String[] args) { Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Opti

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 116/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

run(options); }

/** * Runs the pipeline to completion with the specified options. This method does no * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} * object to block until the pipeline is finished running if blocking programmatic * required. * * @param options The execution options. * @return The pipeline result. */ public static PipelineResult run(Options options) {

// Create the pipeline Pipeline pipeline = Pipeline.create(options);

// Register the coder for pipeline FailsafeElementCoder, String> coder = FailsafeElementCoder.of( KvCoder.of( NullableCoder.of(StringUtf8Coder.of()), NullableCoder.of(StringUtf8C NullableCoder.of(StringUtf8Coder.of()));

CoderRegistry coderRegistry = pipeline.getCoderRegistry(); coderRegistry.registerCoderForType(coder.getEncodedTypeDescriptor(), coder);

List topicsList = new ArrayList<>(Arrays.asList(options.getInputTopics()

/* * Steps: * 1) Read messages in from Kafka * 2) Transform the messages into TableRows * - Transform message payload via UDF * - Convert UDF result to TableRow objects * 3) Write successful records out to BigQuery * 4) Write failed records out to BigQuery */

PCollectionTuple convertedTableRows = pipeline /* * Step #1: Read messages in from Kafka */ .apply(

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 117/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

"ReadFromKafka", KafkaIO.read() .withConsumerConfigUpdates( ImmutableMap.of(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "ea .withBootstrapServers(options.getBootstrapServers()) .withTopics(topicsList) .withKeyDeserializerAndCoder( StringDeserializer.class, NullableCoder.of(StringUtf8Coder.o .withValueDeserializerAndCoder( StringDeserializer.class, NullableCoder.of(StringUtf8Coder.o .withoutMetadata())

/* * Step #2: Transform the Kafka Messages into TableRows */ .apply("ConvertMessageToTableRow", new MessageToTableRow(options));

/* * Step #3: Write the successful records out to BigQuery */ WriteResult writeResult = convertedTableRows .get(TRANSFORM_OUT) .apply( "WriteSuccessfulRecords", BigQueryIO.writeTableRows() .withoutValidation() .withCreateDisposition(CreateDisposition.CREATE_NEVER) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withExtendedErrorInfo() .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS) .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErr .to(options.getOutputTableSpec()));

/* * Step 3 Contd. * Elements that failed inserts into BigQuery are extracted and converted to Fai */ PCollection> failedInserts = writeResult .getFailedInsertsWithErr() .apply( "WrapInsertionErrors", MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor()) .via(KafkaToBigQuery::wrapBigQueryInsertError))

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 118/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

.setCoder(FAILSAFE_ELEMENT_CODER);

/* * Step #4: Write failed records out to BigQuery */ PCollectionList.of(convertedTableRows.get(UDF_DEADLETTER_OUT)) .and(convertedTableRows.get(TRANSFORM_DEADLETTER_OUT)) .apply("Flatten", Flatten.pCollections()) .apply( "WriteTransformationFailedRecords", WriteKafkaMessageErrors.newBuilder() .setErrorRecordsTable( ObjectUtils.firstNonNull( options.getOutputDeadletterTable(), options.getOutputTableSpec() + DEFAULT_DEADLETTER_TABLE_SUFF .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA) .build());

/* * Step #5: Insert records that failed BigQuery inserts into a deadletter table. */ failedInserts.apply( "WriteInsertionFailedRecords", ErrorConverters.WriteStringMessageErrors.newBuilder() .setErrorRecordsTable( ObjectUtils.firstNonNull( options.getOutputDeadletterTable(), options.getOutputTableSpec() + DEFAULT_DEADLETTER_TABLE_SUFF .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA) .build());

return pipeline.run(); }

/** * The {@link MessageToTableRow} class is a {@link PTransform} which transforms in * Message objects into {@link TableRow} objects for insertion into BigQuery while * to the input. The executions of the UDF and transformation to {@link TableRow} * in a fail-safe way by wrapping the element with it's original payload inside th * FailsafeElement} class. The {@link MessageToTableRow} transform will output a { * PCollectionTuple} which contains all output and dead-letter {@link PCollection} * *

The {@link PCollectionTuple} output will contain the following {@link PColle * *

    https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 119/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

    *

  • {@link KafkaToBigQuery#UDF_OUT} - Contains all {@link FailsafeElement} re * successfully processed by the UDF. *
  • {@link KafkaToBigQuery#UDF_DEADLETTER_OUT} - Contains all {@link Failsafe * which failed processing during the UDF execution. *
  • {@link KafkaToBigQuery#TRANSFORM_OUT} - Contains all records successfully * JSON to {@link TableRow} objects. *
  • {@link KafkaToBigQuery#TRANSFORM_DEADLETTER_OUT} - Contains all {@link Fa * records which couldn't be converted to table rows. *
*/ static class MessageToTableRow extends PTransform>, PCollectionTuple> {

private final Options options;

MessageToTableRow(Options options) { this.options = options; }

@Override public PCollectionTuple expand(PCollection> input) {

PCollectionTuple udfOut = input // Map the incoming messages into FailsafeElements so we can recover f // across multiple transforms. .apply("MapToRecord", ParDo.of(new MessageToFailsafeElementFn())) .apply( "InvokeUDF", FailsafeJavascriptUdf.>newBuilder() .setFileSystemPath(options.getJavascriptTextTransformGcsPath() .setFunctionName(options.getJavascriptTextTransformFunctionNam .setSuccessTag(UDF_OUT) .setFailureTag(UDF_DEADLETTER_OUT) .build());

// Convert the records which were successfully processed by the UDF into Table PCollectionTuple jsonToTableRowOut = udfOut .get(UDF_OUT) .apply( "JsonToTableRow", FailsafeJsonToTableRow.>newBuilder() .setSuccessTag(TRANSFORM_OUT) .setFailureTag(TRANSFORM_DEADLETTER_OUT)

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 120/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

.build());

// Re-wrap the PCollections so we can return a single PCollectionTuple return PCollectionTuple.of(UDF_OUT, udfOut.get(UDF_OUT)) .and(UDF_DEADLETTER_OUT, udfOut.get(UDF_DEADLETTER_OUT)) .and(TRANSFORM_OUT, jsonToTableRowOut.get(TRANSFORM_OUT)) .and(TRANSFORM_DEADLETTER_OUT, jsonToTableRowOut.get(TRANSFORM_DEADLETTER_ } }

/** * The {@link MessageToFailsafeElementFn} wraps an Kafka Message with the {@link F * class so errors can be recovered from and the original message can be output to * table. */ static class MessageToFailsafeElementFn extends DoFn, FailsafeElement, String>>

@ProcessElement public void processElement(ProcessContext context) { KV message = context.element(); context.output(FailsafeElement.of(message, message.getValue())); } }

/** * Method to wrap a {@link BigQueryInsertError} into a {@link FailsafeElement}. * * @param insertError BigQueryInsert error. * @return FailsafeElement object. */ protected static FailsafeElement wrapBigQueryInsertError( BigQueryInsertError insertError) {

FailsafeElement failsafeElement; try {

failsafeElement = FailsafeElement.of( insertError.getRow().toPrettyString(), insertError.getRow().toPrettySt failsafeElement.setErrorMessage(insertError.getError().toPrettyString());

} catch (IOException e) { LOG.error("Failed to wrap BigQuery insert error."); throw new RuntimeException(e);

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 121/122 8/23/2020 Google-provided streaming templates | Cloud Dataflow | Google Cloud

} return failsafeElement; } }

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/), and code samples are licensed under the Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0). For details, see the Google Developers Site Policies (https://developers.google.com/site-policies). Java is a registered trademark of Oracle and/or its aliates.

Last updated 2020-08-20 UTC.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming/ 122/122