8/23/2020 Deploying a pipeline | Cloud Dataflow | Cloud

Deploying a pipeline

document explains in detail how Dataow deploys and runs a pipeline, and covers advanced top ptimization and load balancing. If you are looking for a step-by-step guide on how to create and y your rst pipeline, use Dataow's quickstarts for Java aow/docs/quickstarts/quickstart-java-maven), Python (/dataow/docs/quickstarts/quickstart-python) or ates (/dataow/docs/quickstarts/quickstart-templates).

After you construct and test your Apache Beam pipeline, you can use the Dataow managed service to deploy and execute it. Once on the Dataow service, your pipeline code becomes a Dataow job.

The Dataow service fully manages Google Cloud services such as Compute Engine (/compute) and Cloud Storage (/storage) to run your Dataow job, automatically spinning up and tearing down the necessary resources. The Dataow service provides visibility into your job through tools like the Dataow Monitoring Interface (/dataow/pipelines/dataow-monitoring-intf) and the Dataow Command-line Interface (/dataow/pipelines/dataow-command-line-intf).

an control some aspects of how the Dataow service runs your job by setting execution paramet aow/pipelines/specifying-exec-params) in your pipeline code. For example, the execution parameters fy whether the steps of your pipeline run on worker virtual machines, on the Dataow service bac ally.

In addition to managing Google Cloud resources, the Dataow service automatically performs and optimizes many aspects of distributed parallel processing. These include:

Parallelization and Distribution. Dataow automatically partitions your data and distributes your worker code to Compute Engine instances for parallel processing.

Optimization. Dataow uses your pipeline code to create an execution graph that represents your pipeline's PCollections and transforms, and optimizes the graph for the most ecient performance and resource usage. Dataow also automatically optimizes potentially costly operations, such as data aggregations.

Automatic Tuning features. The Dataow service includes several features that provide on-the-y adjustment of resource allocation and data partitioning, such as Autoscaling

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 1/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

and Dynamic Work Rebalancing. These features help the Dataow service execute your job as quickly and eciently as possible.

Pipeline lifecycle: from pipeline code to Dataow job

When you run your Dataow pipeline, Dataow creates an execution graph from the code that constructs your Pipeline object, including all of the transforms and their associated processing functions (such as DoFns). This phase is called Graph Construction Time and runs locally on the computer where the pipeline is run.

During graph construction, Apache Beam locally executes the code from the main entry point of the pipeline code, stopping at the calls to a source, sink or transform step, and turning these calls into nodes of the graph. As a consecuence, a piece of code in a pipeline's entry point (Java's main() method or the top-level of a Python script) locally executes on the machine that runs the pipeline, while the same code declared in a method of a DoFn object executes in the Dataow workers.

Also during graph construction, Apache Beam validates that any resources referenced by the pipeline (like Cloud Storage buckets, BigQuery tables, and Pub/Sub Topics or Subscriptions) actually exist and are accessible. The validation is done through standard API calls to the respective services, so it's vital that the user account used to run a pipeline has proper connectivity to the necessary services and is authorized to call their APIs. Before submitting the pipeline to the Dataow service, Apache Beam also checks for other errors, and ensures that the pipeline graph doesn't contain any illegal operations.

The execution graph is then translated into JSON format, and the JSON execution graph is transmitted to the Dataow service endpoint.

Graph construction also happens when you execute your pipeline locally, but the graph is not ated to JSON or transmitted to the service. Instead, the graph is run locally on the same machine e you launched your Dataow program. See the documentation on conguring for local execution aow/pipelines/specifying-exec-params#LocalExecution) for more details.

The Dataow service then validates the JSON execution graph. When the graph is validated, it becomes a job on the Dataow service. You'll be able to see your job, its execution graph,

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 2/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

status, and log information by using the Dataow Monitoring Interface (/dataow/pipelines/dataow-monitoring-intf).

Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)

 The Dataow service sends a response to the machine where you ran your Dataow program. This response is encapsulated in the object DataflowPipelineJob, which contains your Dataow job's jobId. You can use the jobId to monitor, track, and troubleshoot your job using the Dataow Monitoring Interface (/dataow/pipelines/dataow-monitoring-intf) and the Dataow Command-line Interface (/dataow/pipelines/dataow-command-line-intf). See the API reference for DataowPipelineJob (https://beam.apache.org/documentation/sdks/javadoc/current/index.html? org/apache/beam/runners/dataow/DataowPipelineJob.html) for more information.

Execution graph

Dataow builds a graph of steps that represents your pipeline, based on the transforms and data you used when you constructed your Pipeline object. This is the pipeline execution graph.

The WordCount (https://beam.apache.org/get-started/wordcount-example/) example, included with the Apache Beam SDKs, contains a series of transforms to read, extract, count, format, and write the individual words in a collection of text, along with an occurrence count for each word. The following diagram shows how the transforms in the WordCount pipeline are expanded into an execution graph:

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 3/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

Figure 1: WordCount Example Execution Graph

xecution graph often differs from the order in which you specied your transforms when you ructed the pipeline. This is because the Dataow service performs various optimizations and fus e execution graph before it runs on managed cloud resources. The Dataow service respects dat ndencies when executing your pipeline; however, steps without data dependencies between them ecuted in any order.

You can see the unoptimized execution graph that Dataow has generated for your pipeline when you select your job in the Dataow Monitoring Interface (/dataow/pipelines/dataow-monitoring-intf).

Parallelization and distribution

The Dataow service automatically parallelizes and distributes the processing logic in your pipeline to the workers you've allotted to perform your job. Dataow uses the abstractions in

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 4/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

the programming model (/dataow/model/programming-model-beam) to represent parallel processing functions; for example, your ParDo transforms cause Dataow to automatically distribute your processing code (represented by DoFns) to multiple workers to be run in parallel.

Structuring your user code

You can think of your DoFn code as small, independent entities: there can potentially be many instances running on different machines, each with no knowledge of the others. As such, pure functions (functions that do not depend on hidden or external state, that have no observable side effects, and are deterministic) are ideal code for the parallel and distributed nature of DoFns.

The pure function model is not strictly rigid, however; state information or external initialization data can be valid for DoFn and other function objects, so long as your code does not depend on things that the Dataow service does not guarantee. When structuring your ParDo transforms and creating your DoFns, keep the following guidelines in mind:

The Dataow service guarantees that every element in your input PCollection is processed by a DoFn instance exactly once.

The Dataow service does not guarantee how many times a DoFn will be invoked.

The Dataow service does not guarantee exactly how the distributed elements are grouped—that is, it does not guarantee which (if any) elements are processed together.

The Dataow service does not guarantee the exact number of DoFn instances that will be created over the course of a pipeline.

The Dataow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary les with non-unique names).

The Dataow service serializes element processing per DoFn instance. Your code does not need to be strictly thread-safe; however, any state shared between multiple DoFn instances must be thread-safe.

See Requirements for User-Provided Functions (https://beam.apache.org/documentation/programming-guide/#requirements-for-writing-user-code-for- beam-transforms)

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 5/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

in the programming model (/dataow/model/programming-model-beam) documentation for more information about building your user code.

Error and exception handling

Your pipeline may throw exceptions while processing data. Some of these errors are transient (e.g., temporary diculty accessing an external service), but some are permanent, such as errors caused by corrupt or unparseable input data, or null pointers during computation.

Dataow processes elements in arbitrary bundles, and retries the complete bundle when an error is thrown for any element in that bundle. When running in batch mode, bundles including a failing item are retried 4 times. The pipeline will fail completely when a single bundle has failed 4 times. When running in streaming mode, a bundle including a failing item will be retried indenitely, which may cause your pipeline to permanently stall.

When processing in batch mode, you might see a large number of individual failures before a pipeline job fails etely (which happens when any given bundle fails after four retry attempts). For example, if your pipeline attem cess 100 bundles, Dataow could theoretically generate several hundred individual failures until a single bundl es the 4-failure condition for exit.

Fusion optimization

Once the JSON form of your pipeline's execution graph has been validated, the Dataow service may modify the graph to perform optimizations. Such optimizations can include fusing multiple steps or transforms in your pipeline's execution graph into single steps. Fusing steps prevents the Dataow service from needing to materialize every intermediate PCollection in your pipeline, which can be costly in terms of memory and processing overhead.

While all the transforms you've specied in your pipeline construction are executed on the service, they may be executed in a different order, or as part of a larger fused transform to ensure the most ecient execution of your pipeline. The Dataow service respects data dependencies between the steps in the execution graph, but otherwise steps may be executed in any order.

Fusion example

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 6/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

The following diagram shows how the execution graph from the WordCount (/dataow/examples/examples-beam) example included with the Apache Beam SDK for Java might be optimized and fused by the Dataow service for ecient execution:

Figure 2: WordCount Example Optimized Execution Graph

Preventing fusion

There are a few cases in your pipeline where you may want to prevent the Dataow service from performing fusion optimizations. These are cases in which the Dataow service might incorrectly guess the optimal way to fuse operations in the pipeline, which could limit the Dataow service's ability to make use of all available workers.

For example, one case in which fusion can limit Dataow's ability to optimize worker usage is a "high fan-out" ParDo. In such an operation, you might have an input collection with relatively few elements, but the ParDo produces an output with hundreds or thousands of times as many elements, followed by another ParDo. If the Dataow service fuses these ParDo operations

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 7/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

together, parallelism in this step is limited to at most the number of items in the input collection, even though the intermediate PCollection contains many more elements.

You can prevent such a fusion by adding an operation to your pipeline that forces the Dataow service to materialize your intermediate PCollection. Consider using one of the following operations:

You can insert a GroupByKey and ungroup after your rst ParDo. The Dataow service never fuses ParDo operations across an aggregation.

You can pass your intermediate PCollection as a side input (https://beam.apache.org/documentation/programming-guide/#side-inputs) to another ParDo. The Dataow service always materializes side inputs.

You can insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the data, and performs deduplication of records. Reshue is supported by Dataow even though it is marked deprecated in the Apache Beam documentation.

Combine optimization

Aggregation operations are an important concept in large-scale data processing. Aggregation brings together data that's conceptually far apart, making it extremely useful for correlating. The Dataow programming model (/dataow/model/programming-model-beam) represents aggregation operations as the GroupByKey, CoGroupByKey, and Combine transforms.

Dataow's aggregation operations combine data across the entire data set, including data that may be spread across multiple workers. During such aggregation operations, it's often most ecient to combine as much data locally as possible before combining data across instances. When you apply a GroupByKey or other aggregating transform, the Dataow service automatically performs partial combining locally before the main grouping operation.

Because the Dataow service automatically performs partial local combining, it is strongly recommended that attempt to make this optimization by hand in your pipeline code.

When performing partial or multi-level combining, the Dataow service makes different decisions based on whether your pipeline is working with batch or streaming data. For bounded data, the service favors eciency and will perform as much local combining as possible. For unbounded data, the service favors lower latency, and may not perform partial combining (as it may increase latency).

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 8/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

Autotuning features

The Dataow service contains several autotuning features that can further dynamically optimize your Dataow job while it is running. These features include Autoscaling (/dataow/docs/guides/deploying-a-pipeline#autoscaling) and Dynamic Work Rebalancing (https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in- google-cloud-dataow) .

Autoscaling

With autoscaling enabled, the Dataow service automatically chooses the appropriate number of worker instances required to run your job. The Dataow service may also dynamically re- allocate more workers or fewer workers during runtime to account for the characteristics of your job. Certain parts of your pipeline may be computationally heavier than others, and the Dataow service may automatically spin up additional workers during these phases of your job (and shut them down when they're no longer needed).

Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)

Autoscaling is enabled by default on all batch Dataow jobs and streaming jobs using Streaming Engine (#streaming-engine). You can disable autoscaling by specifying (/dataow/pipelines/specifying-exec-params) the ag --autoscalingAlgorithm=NONE when you run your pipeline; if so, note that the Dataow service sets the number of workers based on the -- numWorkers option, which defaults to 3.

With autoscaling enabled, the Dataow service does not allow user control of the exact number of worker instances allocated to your job. You might still cap the number of workers by specifying (/dataow/pipelines/specifying-exec-params) the --maxNumWorkers option when you run your pipeline.

For batch jobs, the --maxNumWorkers ag is optional. The default is 1000. For streaming jobs using Streaming Engine, the --maxNumWorkers ag is optional. The default is 100. For streaming jobs not using Streaming Engine, the --maxNumWorkers ag is required.

Dataow scales based on the parallelism of a pipeline. The parallelism of a pipeline is an estimate of the number of threads needed to most eciently process data at any given time.

The parallelism is calculated every few minutes unless the bandwidth of an external service is too low. When the parallelism increases, Dataow scales up and adds workers. When the

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 9/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

parallelism decreases, Dataow scales down and removes workers.

The following table summarizes when autoscaling increases or decreases the number of workers in batch (#batch-autoscaling) and streaming (#streaming-autoscaling) pipelines:

Batch pipelines Streaming pipelines

Scaling up If the remaining work takes longer than If a streaming pipeline is backlogged and workers are spinning up new workers and the current utilizing, on average, more than 20% of their CPUs, workers are utilizing, on average, more Dataow may scale up. Backlogs are cleared within than 5% of their CPUs, Dataow may approximately 150 seconds, given the current scale up. throughput per worker.

Sources with the following may limit the number of new workers: a small amount of data, un-splittable data (like compressed les), and data processed by I/O modules that don't split data.

Sinks congured to write to a xed number of shards, like a Cloud Storage destination writing to existing les, may limit the number of new workers.

Scaling If the remaining work takes less time than If a streaming pipeline backlog is lower than 20 down spinning up new workers and the current seconds and workers are utilizing, on, average less workers are utilizing, on average, more than 80% of the CPUs, Dataow may scale down. After than 5% of their CPUs, Dataow may scaling down, the new number of workers utilize, on scale down. average, less than 75% of their CPUs.

No If I/O takes longer than data processing If workers are utilizing, on average, less than 20% of autoscalingor workers are utilizing, on average, less their CPU, the parallelism isn't recalculated. than 5% of their CPUs, the parallelism isn't recalculated.

Batch autoscaling

For batch pipelines, Dataow automatically chooses the number of workers based on both the amount of work in each stage of your pipeline and the current throughput at that stage. Dataow determines how much data is being processed by the current set of workers and extrapolates how much time the rest of the work takes to process.

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 10/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

The number of workers is sub-linear to the amount of work. For instance, a job with twice the work has less tha the workers.

If your pipeline uses a custom data source that you've implemented, there are a few methods you can implement that provide more information to the Dataow service's autoscaling algorithm and potentially improve performance:

Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)

In your BoundedSource subclass, implement the method getEstimatedSizeBytes. The Dataow service uses getEstimatedSizeBytes when calculating the initial number of workers to use for your pipeline.

In your BoundedReader subclass, implement the method getFractionConsumed. The Dataow service uses getFractionConsumed to track read progress and converge on the correct number of workers to use during a read.

Streaming autoscaling

Streaming autoscaling is generally available for pipelines that use Streaming Engine (#streaming-engine). For es that do not use Streaming Engine, streaming autoscaling is available in beta.

ore information about launch stage denitions, see the Product launch stages (/products#product-launch-sta

Streaming autoscaling allows the Dataow service to adaptively change the number of workers used to execute your streaming pipeline in response to changes in load and resource utilization. Streaming autoscaling is a free feature and is designed to reduce the costs of the resources used when executing streaming pipelines.

Without autoscaling, you choose a xed number of workers by specifying numWorkers or num_workers to execute your pipeline. As the input workload varies over time, this number can become either too high or too low. Provisioning too many workers results in unnecessary extra cost, and provisioning too few workers results in higher latency for processed data. By enabling autoscaling, resources are used only as they are needed.

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 11/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

The objective of autoscaling streaming pipelines is to minimize backlog while maximizing worker utilization and throughput, and quickly react to spikes in load. By enabling autoscaling, you don't have to choose between provisioning for peak load and fresh results. Workers are added as CPU utilization and backlog increase and are removed as these metrics come down. This way, you’re paying only for what you need, and the job is processed as eciently as possible.

Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)

Custom unbounded sources

If your pipeline uses a custom unbounded source, the source must inform the Dataow service about backlog. Backlog is an estimate of the input in bytes that has not yet been processed by the source. To inform the service about backlog, implement either one of the following methods in your UnboundedReader class.

getSplitBacklogBytes() - Backlog for the current split of the source. The service aggregates backlog across all the splits.

getTotalBacklogBytes() - The global backlog across all the splits. In some cases the backlog is not available for each split and can only be calculated across all the splits. Only the rst split (split ID '0') needs to provide total backlog.

The Apache Beam repository contains several examples (https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/be am/sdk/io/kafka/KafkaUnboundedReader.java) of custom sources that implement the UnboundedReader class.

Enable streaming autoscaling

For streaming jobs using Streaming Engine (#streaming-engine), autoscaling is enabled by default.

To enable autoscaling for jobs not using Streaming Engine, set the following execution parameters (/dataow/pipelines/specifying-exec-params) when you start your pipeline:

--autoscalingAlgorithm=THROUGHPUT_BASED --maxNumWorkers=N

For streaming jobs not using Streaming Engine, the minimum number of workers is 1/15th of the -- maxNumWorkers value, rounded up.

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 12/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

Streaming pipelines are deployed with a xed pool of Persistent Disks (/compute/docs/disks#pdspecs), equal in number to --maxNumWorkers. Take this into account when you specify --maxNumWorkers, and ensure this value is a sucient number of disks for your pipeline.

Note: If you've reached a scaling limit and want to raise the --maxNumWorkers, you must submit a new job with a higher --maxNumWorkers.

If you want to update a streaming autoscaling job that's not using Streaming Engine, make sure -- maxNumWorkers remains the same (see the section on manually scaling streaming pipelines (#ManualScaling)). Not specifying the --autoscalingAlgorithm pipeline option in the Update command disables autoscaling for the updated job.

Usage and pricing

Compute Engine usage is based on the average number of workers, while Persistent Disk usage is based on the exact value of --maxNumWorkers. Persistent Disks are redistributed such that each worker gets an equal number of attached disks.

In the example above, where --maxNumWorkers=15, you pay (/dataow/pricing) for between 1 and 15 Compute Engine instances and exactly 15 Persistent Disks.

Manually scaling a streaming pipeline

Until autoscaling is generally available in streaming mode, there is a workaround you can use to manually scale the number of workers running your streaming pipeline by using Dataow's Update (/dataow/pipelines/updating-a-pipeline) feature.

Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)

To scale your streaming pipeline during execution, ensure that you set the following execution parameters (/dataow/pipelines/specifying-exec-params) when you start your pipeline:

Set --maxNumWorkers equal to the maximum number of workers you want available to your pipeline.

Set --numWorkers equal to the initial number of workers you want your pipeline to use when it starts running.

Once your pipeline is running, you can Update your pipeline and specify a new number of workers using the --numWorkers parameter. The value you set for the new --numWorkers must be between

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 13/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

N and --maxNumWorkers, where N is equal to --maxNumWorkers / 15.

Updating your pipeline replaces your running job with a new job, using the new number of workers, while preserving all state information associated with the previous job.

Note: Your pipeline's maximum scaling range depends on the number of persistent disks deployed when the pipeline starts. The Dataow service deploys one persistent disk per worker at the maximum number of workers. Deploying extra Persistent Disks by setting --maxNumWorkers to a higher value than -- numWorkers provides some benets to your pipeline—specically, it allows you the exibility to scale your pipeline to a larger number of workers after startup, and might provide improved performance. (/compute/docs/disks/performance#size_price_performance) However, your pipeline might also incur additional cost for the extra Persistent Disks. Take note of the cost and quota implications of the additional Persistent Disk resources when planning your streaming pipeline and setting the scaling range.

Note: You cannot change the scaling range of a pipeline by using the Update feature. If you need to scale further, you must start a new pipeline and specify a higher value for --maxNumWorkers as the ceiling of your desired scaling range.

Dynamic Work Rebalancing

The Dataow service's Dynamic Work Rebalancing feature allows the service to dynamically re- partition work based on runtime conditions. These conditions might include:

Imbalances in work assignments

Workers taking longer than expected to nish

Workers nishing faster than expected

The Dataow service automatically detects these conditions and can dynamically reassign work to unused or underused workers to decrease your job's overall processing time.

Limitations

Dynamic Work Rebalancing only happens when the Dataow service is processing some input data in parallel: when reading data from an external input source, when working with a materialized intermediate PCollection, or when working with the result of an aggregation like GroupByKey. If a large number of steps in your job are fused (#Optimization), there are fewer intermediate PCollections in your job and Dynamic Work Rebalancing will be limited to the number of elements in the source materialized PCollection. If you want to ensure that

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 14/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

Dynamic Work Rebalancing can be applied to a particular PCollection in your pipeline, you can prevent fusion (#fusion-prevention) in a few different ways to ensure dynamic parallelism.

Dynamic Work Rebalancing cannot re-parallelize data ner than a single record. If your data contains individual records that cause large delays in processing time, they may still delay your job, since Dataow cannot subdivide and redistribute an individual "hot" record to multiple workers.

Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)

If you've set a xed number of shards for your pipeline's nal output (for example, by writing data using TextIO.Write.withNumShards), parallelization will be limited based on the number of shards that you've chosen.

xed-shards limitation can be considered temporary, and may be subject to change in future relea e Dataow service.

Working with Custom Data Sources

Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)

If your pipeline uses a custom data source that you provide, you must implement the method splitAtFraction to allow your source to work with the Dynamic Work Rebalancing feature.

Caution: Using Dynamic Work Rebalancing with custom data sources is an extremely advanced use case. If you choose to implement splitAtFraction, it is critical that you test your code extensively and with maximum code coverage.

If you implement splitAtFraction incorrectly, records from your source may appear to get duplicated or dropped. See the API reference information on RangeTracker (https://beam.apache.org/documentation/sdks/javadoc/current/index.html? org/apache/beam/sdk/io/range/RangeTracker.html) for help and tips on implementing splitAtFraction.

Resource usage and management

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 15/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

The Dataow service fully manages resources in Google Cloud on a per-job basis. This includes spinning up and shutting down Compute Engine (/compute) instances (occasionally referred to as workers or VMs) and accessing your project's Cloud Storage (/storage) buckets for both I/O and temporary le staging. However, if your pipeline interacts with Google Cloud data storage technologies like BigQuery (/bigquery) and Pub/Sub (/pubsub), you must manage the resources and quota for those services.

Dataow uses a user provided location in Cloud Storage (/storage) specically for staging les. This location is under your control, and you should ensure that the location's lifetime is maintained as long as any job is reading from it. You can re-use the same staging location for multiple job runs, as the SDK's built-in caching can speed up the start time for your jobs.

on: Manually altering Dataow-managed Compute Engine resources associated with a Dataow job is an ported operation. You should not attempt to manually stop, delete, or otherwise control the Compute Engine ces that Dataow has created to run your job. In addition, you should not alter any persistent disk resources ated with your Dataow job.

Jobs

You may run up to 25 concurrent Dataow jobs per Google Cloud project; however, this limit can be increased by contacting Google Cloud Support (/support). For more information, see Quotas (/dataow/quotas#quota-increase).

The Dataow service is currently limited to processing JSON job requests that are 20 MB in size or smaller. The size of the job request is specically tied to the JSON representation of your pipeline; a larger pipeline means a larger request.

To estimate the size of your pipeline's JSON request, run your pipeline with the following option:

Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)

--dataflowJobFile=< path to output file >

This command writes a JSON representation of your job to a le. The size of the serialized le is a good estimate of the size of the request; the actual size will be slightly larger due to some additional information included in the request.

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 16/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

For more information, see the troubleshooting page for "413 Request Entity Too Large" / "The size of serialized JSON representation of the pipeline exceeds the allowable limit" (/dataow/docs/guides/common-errors#json-request-too-large).

In addition, your job's graph size must not exceed 10 MB. For more information, see the troubleshooting page for "The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs." (/dataow/docs/guides/common-errors#job-graph-too-large).

Workers

The Dataow service currently allows a maximum of 1000 Compute Engine instances per job. The default machine type is n1-standard-1 for batch jobs, and n1-standard-4 for streaming jobs. Therefore, when using the default machine types, the Dataow service can therefore allocate up to 4000 cores per job. If you need more cores for your job, you can select a larger machine type.

The Dataow managed service now deploys Compute Engine virtual machines associated with Dataow jobs ged Instance Groups (/compute/docs/instance-groups). A Managed Instance Group creates multiple Comput e instances from a common template and allows you to control and manage them as a group. That way, you d o individually control each instance associated with your pipeline.

hould not attempt to manage or otherwise interact directly with your Compute Engine Managed Instance Grou ow service will take care of that for you. Manually altering any Compute Engine resources associated with you ow job is an unsupported operation.

You can use any of the available Compute Engine machine type families as well as custom machine types. For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataow Service Level Agreement (/dataow/sla).

Dataow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family. You can specify a machine type for your pipeline by setting the appropriate execution parameter (/dataow/pipelines/specifying-exec-params) at pipeline creation time.

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 17/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

on: Shared core machine types such as f1 and g1 series workers are not supported under Dataow's Service Le ment (/dataow/sla).

Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)

To change the machine type, set the --workerMachineType option.

The Dataow service currently does not support jobs with only preemptible virtual machines mpute/docs/instances/preemptible). Instead, if you would like to save processing costs, consider using the Flex processing mode (/dataow/docs/guides/exrs) that uses a combination of preemptible and non-preemptibl ces.

Resource quota

The Dataow service checks to ensure that your Google Cloud project has the Compute Engine resource quota required to run your job, both to start the job and scale to the maximum number of worker instances. Your job will fail to start if there is not enough resource quota available.

Dataow job deploys Compute Engine virtual machines as a Managed Instance Group, you'll need to ensure y t satises some additional quota requirements. Specically, your project will need one of the following types of for each concurrent Dataow job that you want to run:

One Instance Group per job

One Managed Instance Group per job

One Instance Template per job

on: Manually changing your Dataow job's Instance Template or Managed Instance Group is not recommended rted. Use Dataow's pipeline conguration options (/dataow/pipelines/specifying-exec-params) instead.

Dataow's Autoscaling (#AutoScaling) feature is limited by your project's available Compute Engine quota. If your job has sucient quota when it starts, but another job uses the remainder of your project's available quota, the rst job will run but not be able to fully scale.

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 18/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

However, the Dataow service does not manage quota increases for jobs that exceed the resource quotas in your project. You are responsible for making any necessary requests for additional resource quota, for which you can use the Google Cloud Console (https://console.cloud.google.com/).

Persistent disk resources

The Dataow service is currently limited to 15 persistent disks per worker instance when running a streaming job. Each persistent disk is local to an individual Compute Engine virtual machine. Your job may not have more workers than persistent disks; a 1:1 ratio between workers and disks is the minimum resource allotment.

For jobs running on worker VMs, the default size of each persistent disk is 250 GB in batch mode and 400 GB in streaming mode. Jobs using Streaming Engine (#streaming-engine) or Dataow Shue (#cloud-dataow-shue) run on the Dataow service backend and use smaller disks.

Locations

By default, the Dataow service deploys Compute Engine resources in the us-central1-f zone of the us-central1 region. You can override this setting by specifying (/dataow/pipelines/specifying-exec-params) the --region parameter. If you need to use a specic zone for your resources, use the --zone parameter when you create your pipeline. However, we recommend that you only specify the region, and leave the zone unspecied. This allows the Dataow service to automatically select the best zone within the region based on the available zone capacity at the time of the job creation request. For more information, see the regional endpoints (/dataow/docs/concepts/regional-endpoints) documentation.

Streaming Engine

Currently, the Dataow pipeline runner executes the steps of your streaming pipeline entirely on worker virtual machines, consuming worker CPU, memory, and Persistent Disk storage. Dataow's Streaming Engine moves pipeline execution out of the worker VMs and into the Dataow service backend.

Benets of Streaming Engine

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 19/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

The Streaming Engine model has the following benets:

A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs. Streaming Engine works best with smaller worker machine types (n1- standard-2 instead of n1-standard-4) and does not require Persistent Disk beyond a small worker boot disk, leading to less resource and quota consumption.

More responsive autoscaling (https://cloud.google.com/dataow/service/dataow-service-desc#autoscaling) in response to variations in incoming data volume. Streaming Engine offers smoother, more granular scaling of workers.

Improved supportability, since you don’t need to redeploy your pipelines to apply service updates.

Most of the reduction in worker resources comes from ooading the work to the Dataow service. For that reason, there is a charge (https://cloud.google.com/dataow/pricing) associated with the use of Streaming Engine. However, the total bill for Dataow pipelines using Streaming Engine is expected to be approximately the same compared to the total cost of Dataow pipelines that do not use this option.

Using Streaming Engine

Streaming Engine is currently available for streaming pipelines in the following regions. It will become available in additional regions in the future.

us-west1 (Oregon)

us-central1 (Iowa)

us-east1 (South Carolina)

us-east4 (North Virginia)

northamerica-northeast1 (Montréal)

europe-west2 (London)

europe-west1 (Belgium)

europe-west4 (Netherlands)

europe-west3 (Frankfurt)

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 20/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

asia-southeast1 (Singapore)

asia-east1 (Taiwan)

asia-northeast1 (Tokyo)

australia-southeast1 (Sydney)

Updating (/dataow/docs/guides/updating-a-pipeline) an already-running pipeline to use Streaming Engine is ntly supported.

pipeline is already running in production and you would like to use Streaming Engine, you need to stop your pi the Dataow Drain (/dataow/docs/guides/stopping-a-pipeline#drain) option. Then, specify the Streaming En meter and rerun your pipeline.

Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)

Note: Streaming Engine requires the Apache Beam SDK for Java, version 2.10.0 or higher.

To use Streaming Engine for your streaming pipelines, specify the following parameter:

--enableStreamingEngine if you're using Apache Beam SDK for Java versions 2.11.0 or higher.

--experiments=enable_streaming_engine if you're using Apache Beam SDK for Java version 2.10.0.

If you use Dataow Streaming Engine for your pipeline, do not specify the --zone parameter. Instead, specify the --region parameter and set the value to one of the regions where Streaming Engine is currently available. Dataow auto-selects the zone in the region you specied. If you do specify the --zone parameter and set it to a zone outside of the available regions, Dataow reports an error.

Streaming Engine works best with smaller worker machine types, so we recommend that you set -- workerMachineType=n1-standard-2. You can also set --diskSizeGb=30 because Streaming Engine only needs space for the worker boot image and local logs. These values are the default values.

Dataow Shue

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 21/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

Dataow Shue is the base operation behind Dataow transforms such as GroupByKey, CoGroupByKey, and Combine. The Dataow Shue operation partitions and groups data by key in a scalable, ecient, fault-tolerant manner. Currently, Dataow uses a shue implementation that runs entirely on worker virtual machines and consumes worker CPU, memory, and Persistent Disk storage. The service-based Dataow Shue feature, available for batch pipelines only, moves the shue operation out of the worker VMs and into the Dataow service backend.

Benets of Dataow Shue

The service-based Dataow Shue has the following benets:

Faster execution time of batch pipelines for the majority of pipeline job types.

A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs.

Better autoscaling (/dataow/service/dataow-service-desc#autoscaling) since VMs no longer hold any shue data and can therefore be scaled down earlier.

Better fault tolerance; an unhealthy VM holding Dataow Shue data will not cause the entire job to fail, as would happen if not using the feature.

Most of the reduction in worker resources comes from ooading the shue work to the Dataow service. For that reason, there is a charge (/dataow/pricing) associated with the use of Dataow Shue. However, the total bill for Dataow pipelines using the service-based Dataow implementation is expected to be less than or equal to the cost of Dataow pipelines that do not use this option.

For the majority of pipeline job types, Dataow Shue is expected to execute faster than the shue implementation running on worker VMs. However, the execution times might vary from run to run. If you are running a pipeline that has important deadlines, we recommend allocating sucient buffer time before the deadline. In addition, consider requesting a bigger quota (/dataow/quotas#quota-increase) for Shue.

Disk considerations

When using the service-based Dataow Shue feature, you do not need to attach large Persistent Disks to your worker VMs. Dataow automatically attaches a small 25 GB boot disk.

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 22/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

However, due to this small disk size, there are important considerations to be aware of when using Dataow Shue:

A worker VM uses part of the 25 GB of disk space for the , binaries, logs, and containers. Jobs that use a signicant amount of disk and exceed the remaining disk capacity may fail when you use Dataow Shue.

Jobs that use a lot of disk I/O may be slow due to the performance of the small disk. For more information about performance differences between disk sizes, see the Compute Engine Persistent Disk Performance (/compute/docs/disks/performance) page.

If any of these considerations apply to your job, you can use pipeline options (/dataow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options) to specify a larger disk size.

Using Dataow Shue

Service-based Dataow Shue is currently available in the following regions:

us-west1 (Oregon)

us-central1 (Iowa)

us-east1 (South Carolina)

us-east4 (North Virginia)

northamerica-northeast1 (Montréal)

europe-west2 (London)

europe-west1 (Belgium)

europe-west4 (Netherlands)

europe-west3 (Frankfurt)

asia-southeast1 (Singapore)

asia-east1 (Taiwan)

asia-northeast1 (Tokyo)

australia-southeast1 (Sydney)

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 23/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

Dataow Shue will become available in additional regions in the future.

Performance differences in the asia-northeast1 (Tokyo) region: We recommend using Dataow Shue w datasets (greater than 1 TB) when you run pipelines in the asia-northeast1 (Tokyo) region. Using Shue wit r datasets in the asia-northeast1 (Tokyo) region does not give you the same performance advantages as S er regions.

Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)

To use the service-based Dataow Shue in your batch pipelines, specify the following parameter: --experiments=shuffle_mode=service

If you use Dataow Shue for your pipeline, do not specify the --zone parameter. Instead, specify the --region parameter and set the value to one of the regions where Shue is currently available. Dataow autoselects the zone in the region you specied. If you do specify the --zone parameter and set it to a zone outside of the available regions, Dataow reports an error.

Dataow Flexible Resource Scheduling

Dataow FlexRS reduces batch processing costs by using advanced scheduling techniques (/dataow/docs/guides/exrs#delayed_scheduling), the Dataow Shue (/dataow/docs/guides/deploying-a-pipeline#cloud-dataow-shue) service, and a combination of preemptible virtual machine (VM) instances (/compute/docs/instances/preemptible) and regular VMs. By running preemptible VMs and regular VMs in parallel, Dataow improves the user experience if Compute Engine stops preemptible VM instances during a system event. FlexRS helps to ensure that the pipeline continues to make progress and that you do not lose previous work when Compute Engine preempts (/compute/docs/instances/preemptible#what_is_a_preemptible_instance) your preemptible VMs. For more information about FlexRS, see Using Flexible Resource Scheduling in Dataow (/dataow/docs/guides/exrs).

Dataow Runner v2

The current production Dataow runner utilizes language-specic workers when running Apache Beam pipelines. To improve scalability, generality, extensibility, and eciency, Dataow runner is moving to a more services-based architecture. These changes include a more ecient

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 24/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

and portable worker architecture packaged together with the Shue Service and Streaming Engine.

The new Dataow runner, Dataow Runner v2, is available for Python streaming pipelines. You are encouraged to try out Dataow Runner v2 with your current workload before it is enabled by default on all new pipelines. You do not have to make any changes to your pipeline code to take advantage of this new architecture.

Dataow Runner v2 requires the Apache Beam SDK for Python, version 2.21.0 or higher.

Benets of using Dataow Runner v2

Starting with Python streaming pipelines, new features will be available on Dataow Runner v2 only. In addition, the improved eciency of the Dataow Runner v2 architecture could lead to performance improvements in your Dataow jobs.

While using Dataow Runner v2, you might notice a reduction in your bill. The billing model for Dataow Runner v2 is not nal yet, so your bill might increase back to near current levels as the new runner is enabled across all pipelines.

Using Dataow Runner v2

Dataow Runner v2 is available in regions that have Dataow regional endpoints (/dataow/docs/concepts/regional-endpoints).

Java: SDK 2.xPython (#python)

Dataow Runner v2 is not available for Java at this time.

Debugging Dataow Runner v2 jobs

To debug jobs using Dataow Runner v2, you should follow standard debugging steps (/dataow/docs/guides/troubleshooting-your-pipeline); however, be aware of the following when using Dataow Runner v2:

Dataow Runner v2 jobs run two types of processes on the worker VM—SDK process and the runner harness process. Depending on the pipeline and VM type, there might be one or

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 25/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud

more SDK processes, but there is only one runner harness process per VM.

SDK processes run user code and other language-specic functions, while the runner harness process manages everything else.

The runner harness process waits for all SDK processes to connect to it before starting to request work from Dataow.

Jobs might be delayed if the worker VM downloads and installs dependencies during the SDK process startup. If there are issues in an SDK process, such as starting up or installing libraries, the worker reports its status as unhealthy.

Worker VM logs—available through the Logs Viewer (/logging/docs/view/logs-viewer-interface) or the Dataow monitoring interface (/dataow/docs/guides/using-monitoring-intf)—include logs from the runner harness process as well as logs from the SDK processes.

To diagnose problems in your user code, examine the worker logs from the SDK processes. If you nd any errors in the runner harness logs, please contact Support (https://console.cloud.google.com/support) to le a bug.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/), and code samples are licensed under the Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0). For details, see the Google Developers Site Policies (https://developers.google.com/site-policies). Java is a registered trademark of Oracle and/or its aliates.

Last updated 2020-08-19 UTC.

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 26/26