8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
Deploying a pipeline
document explains in detail how Data ow deploys and runs a pipeline, and covers advanced top ptimization and load balancing. If you are looking for a step-by-step guide on how to create and y your rst pipeline, use Data ow's quickstarts for Java a ow/docs/quickstarts/quickstart-java-maven), Python (/data ow/docs/quickstarts/quickstart-python) or ates (/data ow/docs/quickstarts/quickstart-templates).
After you construct and test your Apache Beam pipeline, you can use the Data ow managed service to deploy and execute it. Once on the Data ow service, your pipeline code becomes a Data ow job.
The Data ow service fully manages Google Cloud services such as Compute Engine (/compute) and Cloud Storage (/storage) to run your Data ow job, automatically spinning up and tearing down the necessary resources. The Data ow service provides visibility into your job through tools like the Data ow Monitoring Interface (/data ow/pipelines/data ow-monitoring-intf) and the Data ow Command-line Interface (/data ow/pipelines/data ow-command-line-intf).
an control some aspects of how the Data ow service runs your job by setting execution paramet a ow/pipelines/specifying-exec-params) in your pipeline code. For example, the execution parameters fy whether the steps of your pipeline run on worker virtual machines, on the Data ow service bac ally.
In addition to managing Google Cloud resources, the Data ow service automatically performs and optimizes many aspects of distributed parallel processing. These include:
Parallelization and Distribution. Data ow automatically partitions your data and distributes your worker code to Compute Engine instances for parallel processing.
Optimization. Data ow uses your pipeline code to create an execution graph that represents your pipeline's PCollections and transforms, and optimizes the graph for the most e cient performance and resource usage. Data ow also automatically optimizes potentially costly operations, such as data aggregations.
Automatic Tuning features. The Data ow service includes several features that provide on-the- y adjustment of resource allocation and data partitioning, such as Autoscaling
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 1/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
and Dynamic Work Rebalancing. These features help the Data ow service execute your job as quickly and e ciently as possible.
Pipeline lifecycle: from pipeline code to Data ow job
When you run your Data ow pipeline, Data ow creates an execution graph from the code that constructs your Pipeline object, including all of the transforms and their associated processing functions (such as DoFns). This phase is called Graph Construction Time and runs locally on the computer where the pipeline is run.
During graph construction, Apache Beam locally executes the code from the main entry point of the pipeline code, stopping at the calls to a source, sink or transform step, and turning these calls into nodes of the graph. As a consecuence, a piece of code in a pipeline's entry point (Java's main() method or the top-level of a Python script) locally executes on the machine that runs the pipeline, while the same code declared in a method of a DoFn object executes in the Data ow workers.
Also during graph construction, Apache Beam validates that any resources referenced by the pipeline (like Cloud Storage buckets, BigQuery tables, and Pub/Sub Topics or Subscriptions) actually exist and are accessible. The validation is done through standard API calls to the respective services, so it's vital that the user account used to run a pipeline has proper connectivity to the necessary services and is authorized to call their APIs. Before submitting the pipeline to the Data ow service, Apache Beam also checks for other errors, and ensures that the pipeline graph doesn't contain any illegal operations.
The execution graph is then translated into JSON format, and the JSON execution graph is transmitted to the Data ow service endpoint.
Graph construction also happens when you execute your pipeline locally, but the graph is not ated to JSON or transmitted to the service. Instead, the graph is run locally on the same machine e you launched your Data ow program. See the documentation on con guring for local execution a ow/pipelines/specifying-exec-params#LocalExecution) for more details.
The Data ow service then validates the JSON execution graph. When the graph is validated, it becomes a job on the Data ow service. You'll be able to see your job, its execution graph,
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 2/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
status, and log information by using the Data ow Monitoring Interface (/data ow/pipelines/data ow-monitoring-intf).
Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)
The Data ow service sends a response to the machine where you ran your Data ow program. This response is encapsulated in the object DataflowPipelineJob, which contains your Data ow job's jobId. You can use the jobId to monitor, track, and troubleshoot your job using the Data ow Monitoring Interface (/data ow/pipelines/data ow-monitoring-intf) and the Data ow Command-line Interface (/data ow/pipelines/data ow-command-line-intf). See the API reference for Data owPipelineJob (https://beam.apache.org/documentation/sdks/javadoc/current/index.html? org/apache/beam/runners/data ow/Data owPipelineJob.html) for more information.
Execution graph
Data ow builds a graph of steps that represents your pipeline, based on the transforms and data you used when you constructed your Pipeline object. This is the pipeline execution graph.
The WordCount (https://beam.apache.org/get-started/wordcount-example/) example, included with the Apache Beam SDKs, contains a series of transforms to read, extract, count, format, and write the individual words in a collection of text, along with an occurrence count for each word. The following diagram shows how the transforms in the WordCount pipeline are expanded into an execution graph:
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 3/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
Figure 1: WordCount Example Execution Graph
xecution graph often differs from the order in which you speci ed your transforms when you ructed the pipeline. This is because the Data ow service performs various optimizations and fus e execution graph before it runs on managed cloud resources. The Data ow service respects dat ndencies when executing your pipeline; however, steps without data dependencies between them ecuted in any order.
You can see the unoptimized execution graph that Data ow has generated for your pipeline when you select your job in the Data ow Monitoring Interface (/data ow/pipelines/data ow-monitoring-intf).
Parallelization and distribution
The Data ow service automatically parallelizes and distributes the processing logic in your pipeline to the workers you've allotted to perform your job. Data ow uses the abstractions in
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 4/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
the programming model (/data ow/model/programming-model-beam) to represent parallel processing functions; for example, your ParDo transforms cause Data ow to automatically distribute your processing code (represented by DoFns) to multiple workers to be run in parallel.
Structuring your user code
You can think of your DoFn code as small, independent entities: there can potentially be many instances running on different machines, each with no knowledge of the others. As such, pure functions (functions that do not depend on hidden or external state, that have no observable side effects, and are deterministic) are ideal code for the parallel and distributed nature of DoFns.
The pure function model is not strictly rigid, however; state information or external initialization data can be valid for DoFn and other function objects, so long as your code does not depend on things that the Data ow service does not guarantee. When structuring your ParDo transforms and creating your DoFns, keep the following guidelines in mind:
The Data ow service guarantees that every element in your input PCollection is processed by a DoFn instance exactly once.
The Data ow service does not guarantee how many times a DoFn will be invoked.
The Data ow service does not guarantee exactly how the distributed elements are grouped—that is, it does not guarantee which (if any) elements are processed together.
The Data ow service does not guarantee the exact number of DoFn instances that will be created over the course of a pipeline.
The Data ow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Data ow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary les with non-unique names).
The Data ow service serializes element processing per DoFn instance. Your code does not need to be strictly thread-safe; however, any state shared between multiple DoFn instances must be thread-safe.
See Requirements for User-Provided Functions (https://beam.apache.org/documentation/programming-guide/#requirements-for-writing-user-code-for- beam-transforms)
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 5/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
in the programming model (/data ow/model/programming-model-beam) documentation for more information about building your user code.
Error and exception handling
Your pipeline may throw exceptions while processing data. Some of these errors are transient (e.g., temporary di culty accessing an external service), but some are permanent, such as errors caused by corrupt or unparseable input data, or null pointers during computation.
Data ow processes elements in arbitrary bundles, and retries the complete bundle when an error is thrown for any element in that bundle. When running in batch mode, bundles including a failing item are retried 4 times. The pipeline will fail completely when a single bundle has failed 4 times. When running in streaming mode, a bundle including a failing item will be retried inde nitely, which may cause your pipeline to permanently stall.
When processing in batch mode, you might see a large number of individual failures before a pipeline job fails etely (which happens when any given bundle fails after four retry attempts). For example, if your pipeline attem cess 100 bundles, Data ow could theoretically generate several hundred individual failures until a single bundl es the 4-failure condition for exit.
Fusion optimization
Once the JSON form of your pipeline's execution graph has been validated, the Data ow service may modify the graph to perform optimizations. Such optimizations can include fusing multiple steps or transforms in your pipeline's execution graph into single steps. Fusing steps prevents the Data ow service from needing to materialize every intermediate PCollection in your pipeline, which can be costly in terms of memory and processing overhead.
While all the transforms you've speci ed in your pipeline construction are executed on the service, they may be executed in a different order, or as part of a larger fused transform to ensure the most e cient execution of your pipeline. The Data ow service respects data dependencies between the steps in the execution graph, but otherwise steps may be executed in any order.
Fusion example
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 6/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
The following diagram shows how the execution graph from the WordCount (/data ow/examples/examples-beam) example included with the Apache Beam SDK for Java might be optimized and fused by the Data ow service for e cient execution:
Figure 2: WordCount Example Optimized Execution Graph
Preventing fusion
There are a few cases in your pipeline where you may want to prevent the Data ow service from performing fusion optimizations. These are cases in which the Data ow service might incorrectly guess the optimal way to fuse operations in the pipeline, which could limit the Data ow service's ability to make use of all available workers.
For example, one case in which fusion can limit Data ow's ability to optimize worker usage is a "high fan-out" ParDo. In such an operation, you might have an input collection with relatively few elements, but the ParDo produces an output with hundreds or thousands of times as many elements, followed by another ParDo. If the Data ow service fuses these ParDo operations
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 7/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
together, parallelism in this step is limited to at most the number of items in the input collection, even though the intermediate PCollection contains many more elements.
You can prevent such a fusion by adding an operation to your pipeline that forces the Data ow service to materialize your intermediate PCollection. Consider using one of the following operations:
You can insert a GroupByKey and ungroup after your rst ParDo. The Data ow service never fuses ParDo operations across an aggregation.
You can pass your intermediate PCollection as a side input (https://beam.apache.org/documentation/programming-guide/#side-inputs) to another ParDo. The Data ow service always materializes side inputs.
You can insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the data, and performs deduplication of records. Reshu e is supported by Data ow even though it is marked deprecated in the Apache Beam documentation.
Combine optimization
Aggregation operations are an important concept in large-scale data processing. Aggregation brings together data that's conceptually far apart, making it extremely useful for correlating. The Data ow programming model (/data ow/model/programming-model-beam) represents aggregation operations as the GroupByKey, CoGroupByKey, and Combine transforms.
Data ow's aggregation operations combine data across the entire data set, including data that may be spread across multiple workers. During such aggregation operations, it's often most e cient to combine as much data locally as possible before combining data across instances. When you apply a GroupByKey or other aggregating transform, the Data ow service automatically performs partial combining locally before the main grouping operation.
Because the Data ow service automatically performs partial local combining, it is strongly recommended that attempt to make this optimization by hand in your pipeline code.
When performing partial or multi-level combining, the Data ow service makes different decisions based on whether your pipeline is working with batch or streaming data. For bounded data, the service favors e ciency and will perform as much local combining as possible. For unbounded data, the service favors lower latency, and may not perform partial combining (as it may increase latency).
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 8/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
Autotuning features
The Data ow service contains several autotuning features that can further dynamically optimize your Data ow job while it is running. These features include Autoscaling (/data ow/docs/guides/deploying-a-pipeline#autoscaling) and Dynamic Work Rebalancing (https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in- google-cloud-data ow) .
Autoscaling
With autoscaling enabled, the Data ow service automatically chooses the appropriate number of worker instances required to run your job. The Data ow service may also dynamically re- allocate more workers or fewer workers during runtime to account for the characteristics of your job. Certain parts of your pipeline may be computationally heavier than others, and the Data ow service may automatically spin up additional workers during these phases of your job (and shut them down when they're no longer needed).
Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)
Autoscaling is enabled by default on all batch Data ow jobs and streaming jobs using Streaming Engine (#streaming-engine). You can disable autoscaling by specifying (/data ow/pipelines/specifying-exec-params) the ag --autoscalingAlgorithm=NONE when you run your pipeline; if so, note that the Data ow service sets the number of workers based on the -- numWorkers option, which defaults to 3.
With autoscaling enabled, the Data ow service does not allow user control of the exact number of worker instances allocated to your job. You might still cap the number of workers by specifying (/data ow/pipelines/specifying-exec-params) the --maxNumWorkers option when you run your pipeline.
For batch jobs, the --maxNumWorkers ag is optional. The default is 1000. For streaming jobs using Streaming Engine, the --maxNumWorkers ag is optional. The default is 100. For streaming jobs not using Streaming Engine, the --maxNumWorkers ag is required.
Data ow scales based on the parallelism of a pipeline. The parallelism of a pipeline is an estimate of the number of threads needed to most e ciently process data at any given time.
The parallelism is calculated every few minutes unless the bandwidth of an external service is too low. When the parallelism increases, Data ow scales up and adds workers. When the
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 9/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
parallelism decreases, Data ow scales down and removes workers.
The following table summarizes when autoscaling increases or decreases the number of workers in batch (#batch-autoscaling) and streaming (#streaming-autoscaling) pipelines:
Batch pipelines Streaming pipelines
Scaling up If the remaining work takes longer than If a streaming pipeline is backlogged and workers are spinning up new workers and the current utilizing, on average, more than 20% of their CPUs, workers are utilizing, on average, more Data ow may scale up. Backlogs are cleared within than 5% of their CPUs, Data ow may approximately 150 seconds, given the current scale up. throughput per worker.
Sources with the following may limit the number of new workers: a small amount of data, un-splittable data (like compressed les), and data processed by I/O modules that don't split data.
Sinks con gured to write to a xed number of shards, like a Cloud Storage destination writing to existing les, may limit the number of new workers.
Scaling If the remaining work takes less time than If a streaming pipeline backlog is lower than 20 down spinning up new workers and the current seconds and workers are utilizing, on, average less workers are utilizing, on average, more than 80% of the CPUs, Data ow may scale down. After than 5% of their CPUs, Data ow may scaling down, the new number of workers utilize, on scale down. average, less than 75% of their CPUs.
No If I/O takes longer than data processing If workers are utilizing, on average, less than 20% of autoscalingor workers are utilizing, on average, less their CPU, the parallelism isn't recalculated. than 5% of their CPUs, the parallelism isn't recalculated.
Batch autoscaling
For batch pipelines, Data ow automatically chooses the number of workers based on both the amount of work in each stage of your pipeline and the current throughput at that stage. Data ow determines how much data is being processed by the current set of workers and extrapolates how much time the rest of the work takes to process.
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 10/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
The number of workers is sub-linear to the amount of work. For instance, a job with twice the work has less tha the workers.
If your pipeline uses a custom data source that you've implemented, there are a few methods you can implement that provide more information to the Data ow service's autoscaling algorithm and potentially improve performance:
Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)
In your BoundedSource subclass, implement the method getEstimatedSizeBytes. The Data ow service uses getEstimatedSizeBytes when calculating the initial number of workers to use for your pipeline.
In your BoundedReader subclass, implement the method getFractionConsumed. The Data ow service uses getFractionConsumed to track read progress and converge on the correct number of workers to use during a read.
Streaming autoscaling
Streaming autoscaling is generally available for pipelines that use Streaming Engine (#streaming-engine). For es that do not use Streaming Engine, streaming autoscaling is available in beta.
ore information about launch stage de nitions, see the Product launch stages (/products#product-launch-sta
Streaming autoscaling allows the Data ow service to adaptively change the number of workers used to execute your streaming pipeline in response to changes in load and resource utilization. Streaming autoscaling is a free feature and is designed to reduce the costs of the resources used when executing streaming pipelines.
Without autoscaling, you choose a xed number of workers by specifying numWorkers or num_workers to execute your pipeline. As the input workload varies over time, this number can become either too high or too low. Provisioning too many workers results in unnecessary extra cost, and provisioning too few workers results in higher latency for processed data. By enabling autoscaling, resources are used only as they are needed.
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 11/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
The objective of autoscaling streaming pipelines is to minimize backlog while maximizing worker utilization and throughput, and quickly react to spikes in load. By enabling autoscaling, you don't have to choose between provisioning for peak load and fresh results. Workers are added as CPU utilization and backlog increase and are removed as these metrics come down. This way, you’re paying only for what you need, and the job is processed as e ciently as possible.
Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)
Custom unbounded sources
If your pipeline uses a custom unbounded source, the source must inform the Data ow service about backlog. Backlog is an estimate of the input in bytes that has not yet been processed by the source. To inform the service about backlog, implement either one of the following methods in your UnboundedReader class.
getSplitBacklogBytes() - Backlog for the current split of the source. The service aggregates backlog across all the splits.
getTotalBacklogBytes() - The global backlog across all the splits. In some cases the backlog is not available for each split and can only be calculated across all the splits. Only the rst split (split ID '0') needs to provide total backlog.
The Apache Beam repository contains several examples (https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/be am/sdk/io/kafka/KafkaUnboundedReader.java) of custom sources that implement the UnboundedReader class.
Enable streaming autoscaling
For streaming jobs using Streaming Engine (#streaming-engine), autoscaling is enabled by default.
To enable autoscaling for jobs not using Streaming Engine, set the following execution parameters (/data ow/pipelines/specifying-exec-params) when you start your pipeline:
--autoscalingAlgorithm=THROUGHPUT_BASED --maxNumWorkers=N
For streaming jobs not using Streaming Engine, the minimum number of workers is 1/15th of the -- maxNumWorkers value, rounded up.
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 12/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
Streaming pipelines are deployed with a xed pool of Persistent Disks (/compute/docs/disks#pdspecs), equal in number to --maxNumWorkers. Take this into account when you specify --maxNumWorkers, and ensure this value is a su cient number of disks for your pipeline.
Note: If you've reached a scaling limit and want to raise the --maxNumWorkers, you must submit a new job with a higher --maxNumWorkers.
If you want to update a streaming autoscaling job that's not using Streaming Engine, make sure -- maxNumWorkers remains the same (see the section on manually scaling streaming pipelines (#ManualScaling)). Not specifying the --autoscalingAlgorithm pipeline option in the Update command disables autoscaling for the updated job.
Usage and pricing
Compute Engine usage is based on the average number of workers, while Persistent Disk usage is based on the exact value of --maxNumWorkers. Persistent Disks are redistributed such that each worker gets an equal number of attached disks.
In the example above, where --maxNumWorkers=15, you pay (/data ow/pricing) for between 1 and 15 Compute Engine instances and exactly 15 Persistent Disks.
Manually scaling a streaming pipeline
Until autoscaling is generally available in streaming mode, there is a workaround you can use to manually scale the number of workers running your streaming pipeline by using Data ow's Update (/data ow/pipelines/updating-a-pipeline) feature.
Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)
To scale your streaming pipeline during execution, ensure that you set the following execution parameters (/data ow/pipelines/specifying-exec-params) when you start your pipeline:
Set --maxNumWorkers equal to the maximum number of workers you want available to your pipeline.
Set --numWorkers equal to the initial number of workers you want your pipeline to use when it starts running.
Once your pipeline is running, you can Update your pipeline and specify a new number of workers using the --numWorkers parameter. The value you set for the new --numWorkers must be between
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 13/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
N and --maxNumWorkers, where N is equal to --maxNumWorkers / 15.
Updating your pipeline replaces your running job with a new job, using the new number of workers, while preserving all state information associated with the previous job.
Note: Your pipeline's maximum scaling range depends on the number of persistent disks deployed when the pipeline starts. The Data ow service deploys one persistent disk per worker at the maximum number of workers. Deploying extra Persistent Disks by setting --maxNumWorkers to a higher value than -- numWorkers provides some bene ts to your pipeline—speci cally, it allows you the exibility to scale your pipeline to a larger number of workers after startup, and might provide improved performance. (/compute/docs/disks/performance#size_price_performance) However, your pipeline might also incur additional cost for the extra Persistent Disks. Take note of the cost and quota implications of the additional Persistent Disk resources when planning your streaming pipeline and setting the scaling range.
Note: You cannot change the scaling range of a pipeline by using the Update feature. If you need to scale further, you must start a new pipeline and specify a higher value for --maxNumWorkers as the ceiling of your desired scaling range.
Dynamic Work Rebalancing
The Data ow service's Dynamic Work Rebalancing feature allows the service to dynamically re- partition work based on runtime conditions. These conditions might include:
Imbalances in work assignments
Workers taking longer than expected to nish
Workers nishing faster than expected
The Data ow service automatically detects these conditions and can dynamically reassign work to unused or underused workers to decrease your job's overall processing time.
Limitations
Dynamic Work Rebalancing only happens when the Data ow service is processing some input data in parallel: when reading data from an external input source, when working with a materialized intermediate PCollection, or when working with the result of an aggregation like GroupByKey. If a large number of steps in your job are fused (#Optimization), there are fewer intermediate PCollections in your job and Dynamic Work Rebalancing will be limited to the number of elements in the source materialized PCollection. If you want to ensure that
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 14/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
Dynamic Work Rebalancing can be applied to a particular PCollection in your pipeline, you can prevent fusion (#fusion-prevention) in a few different ways to ensure dynamic parallelism.
Dynamic Work Rebalancing cannot re-parallelize data ner than a single record. If your data contains individual records that cause large delays in processing time, they may still delay your job, since Data ow cannot subdivide and redistribute an individual "hot" record to multiple workers.
Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)
If you've set a xed number of shards for your pipeline's nal output (for example, by writing data using TextIO.Write.withNumShards), parallelization will be limited based on the number of shards that you've chosen.
xed-shards limitation can be considered temporary, and may be subject to change in future relea e Data ow service.
Working with Custom Data Sources
Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)
If your pipeline uses a custom data source that you provide, you must implement the method splitAtFraction to allow your source to work with the Dynamic Work Rebalancing feature.
Caution: Using Dynamic Work Rebalancing with custom data sources is an extremely advanced use case. If you choose to implement splitAtFraction, it is critical that you test your code extensively and with maximum code coverage.
If you implement splitAtFraction incorrectly, records from your source may appear to get duplicated or dropped. See the API reference information on RangeTracker (https://beam.apache.org/documentation/sdks/javadoc/current/index.html? org/apache/beam/sdk/io/range/RangeTracker.html) for help and tips on implementing splitAtFraction.
Resource usage and management
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 15/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
The Data ow service fully manages resources in Google Cloud on a per-job basis. This includes spinning up and shutting down Compute Engine (/compute) instances (occasionally referred to as workers or VMs) and accessing your project's Cloud Storage (/storage) buckets for both I/O and temporary le staging. However, if your pipeline interacts with Google Cloud data storage technologies like BigQuery (/bigquery) and Pub/Sub (/pubsub), you must manage the resources and quota for those services.
Data ow uses a user provided location in Cloud Storage (/storage) speci cally for staging les. This location is under your control, and you should ensure that the location's lifetime is maintained as long as any job is reading from it. You can re-use the same staging location for multiple job runs, as the SDK's built-in caching can speed up the start time for your jobs.
on: Manually altering Data ow-managed Compute Engine resources associated with a Data ow job is an ported operation. You should not attempt to manually stop, delete, or otherwise control the Compute Engine ces that Data ow has created to run your job. In addition, you should not alter any persistent disk resources ated with your Data ow job.
Jobs
You may run up to 25 concurrent Data ow jobs per Google Cloud project; however, this limit can be increased by contacting Google Cloud Support (/support). For more information, see Quotas (/data ow/quotas#quota-increase).
The Data ow service is currently limited to processing JSON job requests that are 20 MB in size or smaller. The size of the job request is speci cally tied to the JSON representation of your pipeline; a larger pipeline means a larger request.
To estimate the size of your pipeline's JSON request, run your pipeline with the following option:
Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)
--dataflowJobFile=< path to output file >
This command writes a JSON representation of your job to a le. The size of the serialized le is a good estimate of the size of the request; the actual size will be slightly larger due to some additional information included in the request.
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 16/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
For more information, see the troubleshooting page for "413 Request Entity Too Large" / "The size of serialized JSON representation of the pipeline exceeds the allowable limit" (/data ow/docs/guides/common-errors#json-request-too-large).
In addition, your job's graph size must not exceed 10 MB. For more information, see the troubleshooting page for "The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs." (/data ow/docs/guides/common-errors#job-graph-too-large).
Workers
The Data ow service currently allows a maximum of 1000 Compute Engine instances per job. The default machine type is n1-standard-1 for batch jobs, and n1-standard-4 for streaming jobs. Therefore, when using the default machine types, the Data ow service can therefore allocate up to 4000 cores per job. If you need more cores for your job, you can select a larger machine type.
The Data ow managed service now deploys Compute Engine virtual machines associated with Data ow jobs ged Instance Groups (/compute/docs/instance-groups). A Managed Instance Group creates multiple Comput e instances from a common template and allows you to control and manage them as a group. That way, you d o individually control each instance associated with your pipeline.
hould not attempt to manage or otherwise interact directly with your Compute Engine Managed Instance Grou ow service will take care of that for you. Manually altering any Compute Engine resources associated with you ow job is an unsupported operation.
You can use any of the available Compute Engine machine type families as well as custom machine types. For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Data ow Service Level Agreement (/data ow/sla).
Data ow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family. You can specify a machine type for your pipeline by setting the appropriate execution parameter (/data ow/pipelines/specifying-exec-params) at pipeline creation time.
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 17/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
on: Shared core machine types such as f1 and g1 series workers are not supported under Data ow's Service Le ment (/data ow/sla).
Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)
To change the machine type, set the --workerMachineType option.
The Data ow service currently does not support jobs with only preemptible virtual machines mpute/docs/instances/preemptible). Instead, if you would like to save processing costs, consider using the Flex processing mode (/data ow/docs/guides/ exrs) that uses a combination of preemptible and non-preemptibl ces.
Resource quota
The Data ow service checks to ensure that your Google Cloud project has the Compute Engine resource quota required to run your job, both to start the job and scale to the maximum number of worker instances. Your job will fail to start if there is not enough resource quota available.
Data ow job deploys Compute Engine virtual machines as a Managed Instance Group, you'll need to ensure y t satis es some additional quota requirements. Speci cally, your project will need one of the following types of for each concurrent Data ow job that you want to run:
One Instance Group per job
One Managed Instance Group per job
One Instance Template per job
on: Manually changing your Data ow job's Instance Template or Managed Instance Group is not recommended rted. Use Data ow's pipeline con guration options (/data ow/pipelines/specifying-exec-params) instead.
Data ow's Autoscaling (#AutoScaling) feature is limited by your project's available Compute Engine quota. If your job has su cient quota when it starts, but another job uses the remainder of your project's available quota, the rst job will run but not be able to fully scale.
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 18/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
However, the Data ow service does not manage quota increases for jobs that exceed the resource quotas in your project. You are responsible for making any necessary requests for additional resource quota, for which you can use the Google Cloud Console (https://console.cloud.google.com/).
Persistent disk resources
The Data ow service is currently limited to 15 persistent disks per worker instance when running a streaming job. Each persistent disk is local to an individual Compute Engine virtual machine. Your job may not have more workers than persistent disks; a 1:1 ratio between workers and disks is the minimum resource allotment.
For jobs running on worker VMs, the default size of each persistent disk is 250 GB in batch mode and 400 GB in streaming mode. Jobs using Streaming Engine (#streaming-engine) or Data ow Shu e (#cloud-data ow-shu e) run on the Data ow service backend and use smaller disks.
Locations
By default, the Data ow service deploys Compute Engine resources in the us-central1-f zone of the us-central1 region. You can override this setting by specifying (/data ow/pipelines/specifying-exec-params) the --region parameter. If you need to use a speci c zone for your resources, use the --zone parameter when you create your pipeline. However, we recommend that you only specify the region, and leave the zone unspeci ed. This allows the Data ow service to automatically select the best zone within the region based on the available zone capacity at the time of the job creation request. For more information, see the regional endpoints (/data ow/docs/concepts/regional-endpoints) documentation.
Streaming Engine
Currently, the Data ow pipeline runner executes the steps of your streaming pipeline entirely on worker virtual machines, consuming worker CPU, memory, and Persistent Disk storage. Data ow's Streaming Engine moves pipeline execution out of the worker VMs and into the Data ow service backend.
Bene ts of Streaming Engine
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 19/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
The Streaming Engine model has the following bene ts:
A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs. Streaming Engine works best with smaller worker machine types (n1- standard-2 instead of n1-standard-4) and does not require Persistent Disk beyond a small worker boot disk, leading to less resource and quota consumption.
More responsive autoscaling (https://cloud.google.com/data ow/service/data ow-service-desc#autoscaling) in response to variations in incoming data volume. Streaming Engine offers smoother, more granular scaling of workers.
Improved supportability, since you don’t need to redeploy your pipelines to apply service updates.
Most of the reduction in worker resources comes from o oading the work to the Data ow service. For that reason, there is a charge (https://cloud.google.com/data ow/pricing) associated with the use of Streaming Engine. However, the total bill for Data ow pipelines using Streaming Engine is expected to be approximately the same compared to the total cost of Data ow pipelines that do not use this option.
Using Streaming Engine
Streaming Engine is currently available for streaming pipelines in the following regions. It will become available in additional regions in the future.
us-west1 (Oregon)
us-central1 (Iowa)
us-east1 (South Carolina)
us-east4 (North Virginia)
northamerica-northeast1 (Montréal)
europe-west2 (London)
europe-west1 (Belgium)
europe-west4 (Netherlands)
europe-west3 (Frankfurt)
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 20/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
asia-southeast1 (Singapore)
asia-east1 (Taiwan)
asia-northeast1 (Tokyo)
australia-southeast1 (Sydney)
Updating (/data ow/docs/guides/updating-a-pipeline) an already-running pipeline to use Streaming Engine is ntly supported.
pipeline is already running in production and you would like to use Streaming Engine, you need to stop your pi the Data ow Drain (/data ow/docs/guides/stopping-a-pipeline#drain) option. Then, specify the Streaming En meter and rerun your pipeline.
Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)
Note: Streaming Engine requires the Apache Beam SDK for Java, version 2.10.0 or higher.
To use Streaming Engine for your streaming pipelines, specify the following parameter:
--enableStreamingEngine if you're using Apache Beam SDK for Java versions 2.11.0 or higher.
--experiments=enable_streaming_engine if you're using Apache Beam SDK for Java version 2.10.0.
If you use Data ow Streaming Engine for your pipeline, do not specify the --zone parameter. Instead, specify the --region parameter and set the value to one of the regions where Streaming Engine is currently available. Data ow auto-selects the zone in the region you speci ed. If you do specify the --zone parameter and set it to a zone outside of the available regions, Data ow reports an error.
Streaming Engine works best with smaller worker machine types, so we recommend that you set -- workerMachineType=n1-standard-2. You can also set --diskSizeGb=30 because Streaming Engine only needs space for the worker boot image and local logs. These values are the default values.
Data ow Shu e
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 21/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
Data ow Shu e is the base operation behind Data ow transforms such as GroupByKey, CoGroupByKey, and Combine. The Data ow Shu e operation partitions and groups data by key in a scalable, e cient, fault-tolerant manner. Currently, Data ow uses a shu e implementation that runs entirely on worker virtual machines and consumes worker CPU, memory, and Persistent Disk storage. The service-based Data ow Shu e feature, available for batch pipelines only, moves the shu e operation out of the worker VMs and into the Data ow service backend.
Bene ts of Data ow Shu e
The service-based Data ow Shu e has the following bene ts:
Faster execution time of batch pipelines for the majority of pipeline job types.
A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs.
Better autoscaling (/data ow/service/data ow-service-desc#autoscaling) since VMs no longer hold any shu e data and can therefore be scaled down earlier.
Better fault tolerance; an unhealthy VM holding Data ow Shu e data will not cause the entire job to fail, as would happen if not using the feature.
Most of the reduction in worker resources comes from o oading the shu e work to the Data ow service. For that reason, there is a charge (/data ow/pricing) associated with the use of Data ow Shu e. However, the total bill for Data ow pipelines using the service-based Data ow implementation is expected to be less than or equal to the cost of Data ow pipelines that do not use this option.
For the majority of pipeline job types, Data ow Shu e is expected to execute faster than the shu e implementation running on worker VMs. However, the execution times might vary from run to run. If you are running a pipeline that has important deadlines, we recommend allocating su cient buffer time before the deadline. In addition, consider requesting a bigger quota (/data ow/quotas#quota-increase) for Shu e.
Disk considerations
When using the service-based Data ow Shu e feature, you do not need to attach large Persistent Disks to your worker VMs. Data ow automatically attaches a small 25 GB boot disk.
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 22/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
However, due to this small disk size, there are important considerations to be aware of when using Data ow Shu e:
A worker VM uses part of the 25 GB of disk space for the operating system, binaries, logs, and containers. Jobs that use a signi cant amount of disk and exceed the remaining disk capacity may fail when you use Data ow Shu e.
Jobs that use a lot of disk I/O may be slow due to the performance of the small disk. For more information about performance differences between disk sizes, see the Compute Engine Persistent Disk Performance (/compute/docs/disks/performance) page.
If any of these considerations apply to your job, you can use pipeline options (/data ow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options) to specify a larger disk size.
Using Data ow Shu e
Service-based Data ow Shu e is currently available in the following regions:
us-west1 (Oregon)
us-central1 (Iowa)
us-east1 (South Carolina)
us-east4 (North Virginia)
northamerica-northeast1 (Montréal)
europe-west2 (London)
europe-west1 (Belgium)
europe-west4 (Netherlands)
europe-west3 (Frankfurt)
asia-southeast1 (Singapore)
asia-east1 (Taiwan)
asia-northeast1 (Tokyo)
australia-southeast1 (Sydney)
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 23/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
Data ow Shu e will become available in additional regions in the future.
Performance differences in the asia-northeast1 (Tokyo) region: We recommend using Data ow Shu e w datasets (greater than 1 TB) when you run pipelines in the asia-northeast1 (Tokyo) region. Using Shu e wit r datasets in the asia-northeast1 (Tokyo) region does not give you the same performance advantages as S er regions.
Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x)
To use the service-based Data ow Shu e in your batch pipelines, specify the following parameter: --experiments=shuffle_mode=service
If you use Data ow Shu e for your pipeline, do not specify the --zone parameter. Instead, specify the --region parameter and set the value to one of the regions where Shu e is currently available. Data ow autoselects the zone in the region you speci ed. If you do specify the --zone parameter and set it to a zone outside of the available regions, Data ow reports an error.
Data ow Flexible Resource Scheduling
Data ow FlexRS reduces batch processing costs by using advanced scheduling techniques (/data ow/docs/guides/ exrs#delayed_scheduling), the Data ow Shu e (/data ow/docs/guides/deploying-a-pipeline#cloud-data ow-shu e) service, and a combination of preemptible virtual machine (VM) instances (/compute/docs/instances/preemptible) and regular VMs. By running preemptible VMs and regular VMs in parallel, Data ow improves the user experience if Compute Engine stops preemptible VM instances during a system event. FlexRS helps to ensure that the pipeline continues to make progress and that you do not lose previous work when Compute Engine preempts (/compute/docs/instances/preemptible#what_is_a_preemptible_instance) your preemptible VMs. For more information about FlexRS, see Using Flexible Resource Scheduling in Data ow (/data ow/docs/guides/ exrs).
Data ow Runner v2
The current production Data ow runner utilizes language-speci c workers when running Apache Beam pipelines. To improve scalability, generality, extensibility, and e ciency, Data ow runner is moving to a more services-based architecture. These changes include a more e cient
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 24/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
and portable worker architecture packaged together with the Shu e Service and Streaming Engine.
The new Data ow runner, Data ow Runner v2, is available for Python streaming pipelines. You are encouraged to try out Data ow Runner v2 with your current workload before it is enabled by default on all new pipelines. You do not have to make any changes to your pipeline code to take advantage of this new architecture.
Data ow Runner v2 requires the Apache Beam SDK for Python, version 2.21.0 or higher.
Bene ts of using Data ow Runner v2
Starting with Python streaming pipelines, new features will be available on Data ow Runner v2 only. In addition, the improved e ciency of the Data ow Runner v2 architecture could lead to performance improvements in your Data ow jobs.
While using Data ow Runner v2, you might notice a reduction in your bill. The billing model for Data ow Runner v2 is not nal yet, so your bill might increase back to near current levels as the new runner is enabled across all pipelines.
Using Data ow Runner v2
Data ow Runner v2 is available in regions that have Data ow regional endpoints (/data ow/docs/concepts/regional-endpoints).
Java: SDK 2.xPython (#python)
Data ow Runner v2 is not available for Java at this time.
Debugging Data ow Runner v2 jobs
To debug jobs using Data ow Runner v2, you should follow standard debugging steps (/data ow/docs/guides/troubleshooting-your-pipeline); however, be aware of the following when using Data ow Runner v2:
Data ow Runner v2 jobs run two types of processes on the worker VM—SDK process and the runner harness process. Depending on the pipeline and VM type, there might be one or
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 25/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud
more SDK processes, but there is only one runner harness process per VM.
SDK processes run user code and other language-speci c functions, while the runner harness process manages everything else.
The runner harness process waits for all SDK processes to connect to it before starting to request work from Data ow.
Jobs might be delayed if the worker VM downloads and installs dependencies during the SDK process startup. If there are issues in an SDK process, such as starting up or installing libraries, the worker reports its status as unhealthy.
Worker VM logs—available through the Logs Viewer (/logging/docs/view/logs-viewer-interface) or the Data ow monitoring interface (/data ow/docs/guides/using-monitoring-intf)—include logs from the runner harness process as well as logs from the SDK processes.
To diagnose problems in your user code, examine the worker logs from the SDK processes. If you nd any errors in the runner harness logs, please contact Support (https://console.cloud.google.com/support) to le a bug.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/), and code samples are licensed under the Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0). For details, see the Google Developers Site Policies (https://developers.google.com/site-policies). Java is a registered trademark of Oracle and/or its a liates.
Last updated 2020-08-19 UTC.
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 26/26