Deploying a Pipeline | Cloud Dataﬂow | Google Cloud

8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud Deploying a pipeline document explains in detail how Dataow deploys and runs a pipeline, and covers advanced top ptimization and load balancing. If you are looking for a step-by-step guide on how to create and y your rst pipeline, use Dataow's quickstarts for Java aow/docs/quickstarts/quickstart-java-maven), Python (/dataow/docs/quickstarts/quickstart-python) or ates (/dataow/docs/quickstarts/quickstart-templates). After you construct and test your Apache Beam pipeline, you can use the Dataow managed service to deploy and execute it. Once on the Dataow service, your pipeline code becomes a Dataow job. The Dataow service fully manages Google Cloud services such as Compute Engine (/compute) and Cloud Storage (/storage) to run your Dataow job, automatically spinning up and tearing down the necessary resources. The Dataow service provides visibility into your job through tools like the Dataow Monitoring Interface (/dataow/pipelines/dataow-monitoring-intf) and the Dataow Command-line Interface (/dataow/pipelines/dataow-command-line-intf). an control some aspects of how the Dataow service runs your job by setting execution paramet aow/pipelines/specifying-exec-params) in your pipeline code. For example, the execution parameters fy whether the steps of your pipeline run on worker virtual machines, on the Dataow service bac ally. In addition to managing Google Cloud resources, the Dataow service automatically performs and optimizes many aspects of distributed parallel processing. These include: Parallelization and Distribution. Dataow automatically partitions your data and distributes your worker code to Compute Engine instances for parallel processing. Optimization. Dataow uses your pipeline code to create an execution graph that represents your pipeline's PCollections and transforms, and optimizes the graph for the most ecient performance and resource usage. Dataow also automatically optimizes potentially costly operations, such as data aggregations. Automatic Tuning features. The Dataow service includes several features that provide on-the-y adjustment of resource allocation and data partitioning, such as Autoscaling https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 1/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud and Dynamic Work Rebalancing. These features help the Dataow service execute your job as quickly and eciently as possible. Pipeline lifecycle: from pipeline code to Dataow job When you run your Dataow pipeline, Dataow creates an execution graph from the code that constructs your Pipeline object, including all of the transforms and their associated processing functions (such as DoFns). This phase is called Graph Construction Time and runs locally on the computer where the pipeline is run. During graph construction, Apache Beam locally executes the code from the main entry point of the pipeline code, stopping at the calls to a source, sink or transform step, and turning these calls into nodes of the graph. As a consecuence, a piece of code in a pipeline's entry point (Java's main() method or the top-level of a Python script) locally executes on the machine that runs the pipeline, while the same code declared in a method of a DoFn object executes in the Dataow workers. Also during graph construction, Apache Beam validates that any resources referenced by the pipeline (like Cloud Storage buckets, BigQuery tables, and Pub/Sub Topics or Subscriptions) actually exist and are accessible. The validation is done through standard API calls to the respective services, so it's vital that the user account used to run a pipeline has proper connectivity to the necessary services and is authorized to call their APIs. Before submitting the pipeline to the Dataow service, Apache Beam also checks for other errors, and ensures that the pipeline graph doesn't contain any illegal operations. The execution graph is then translated into JSON format, and the JSON execution graph is transmitted to the Dataow service endpoint. Graph construction also happens when you execute your pipeline locally, but the graph is not ated to JSON or transmitted to the service. Instead, the graph is run locally on the same machine e you launched your Dataow program. See the documentation on conguring for local execution aow/pipelines/specifying-exec-params#LocalExecution) for more details. The Dataow service then validates the JSON execution graph. When the graph is validated, it becomes a job on the Dataow service. You'll be able to see your job, its execution graph, https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 2/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud status, and log information by using the Dataow Monitoring Interface (/dataow/pipelines/dataow-monitoring-intf). Java: SDK 2.xPython (#python)Java: SDK 1.x (#java:-sdk-1.x) The Dataow service sends a response to the machine where you ran your Dataow program. This response is encapsulated in the object DataflowPipelineJob, which contains your Dataow job's jobId. You can use the jobId to monitor, track, and troubleshoot your job using the Dataow Monitoring Interface (/dataow/pipelines/dataow-monitoring-intf) and the Dataow Command-line Interface (/dataow/pipelines/dataow-command-line-intf). See the API reference for DataowPipelineJob (https://beam.apache.org/documentation/sdks/javadoc/current/index.html? org/apache/beam/runners/dataow/DataowPipelineJob.html) for more information. Execution graph Dataow builds a graph of steps that represents your pipeline, based on the transforms and data you used when you constructed your Pipeline object. This is the pipeline execution graph. The WordCount (https://beam.apache.org/get-started/wordcount-example/) example, included with the Apache Beam SDKs, contains a series of transforms to read, extract, count, format, and write the individual words in a collection of text, along with an occurrence count for each word. The following diagram shows how the transforms in the WordCount pipeline are expanded into an execution graph: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 3/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud Figure 1: WordCount Example Execution Graph xecution graph often differs from the order in which you specied your transforms when you ructed the pipeline. This is because the Dataow service performs various optimizations and fus e execution graph before it runs on managed cloud resources. The Dataow service respects dat ndencies when executing your pipeline; however, steps without data dependencies between them ecuted in any order. You can see the unoptimized execution graph that Dataow has generated for your pipeline when you select your job in the Dataow Monitoring Interface (/dataow/pipelines/dataow-monitoring-intf). Parallelization and distribution The Dataow service automatically parallelizes and distributes the processing logic in your pipeline to the workers you've allotted to perform your job. Dataow uses the abstractions in https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 4/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud the programming model (/dataow/model/programming-model-beam) to represent parallel processing functions; for example, your ParDo transforms cause Dataow to automatically distribute your processing code (represented by DoFns) to multiple workers to be run in parallel. Structuring your user code You can think of your DoFn code as small, independent entities: there can potentially be many instances running on different machines, each with no knowledge of the others. As such, pure functions (functions that do not depend on hidden or external state, that have no observable side effects, and are deterministic) are ideal code for the parallel and distributed nature of DoFns. The pure function model is not strictly rigid, however; state information or external initialization data can be valid for DoFn and other function objects, so long as your code does not depend on things that the Dataow service does not guarantee. When structuring your ParDo transforms and creating your DoFns, keep the following guidelines in mind: The Dataow service guarantees that every element in your input PCollection is processed by a DoFn instance exactly once. The Dataow service does not guarantee how many times a DoFn will be invoked. The Dataow service does not guarantee exactly how the distributed elements are grouped—that is, it does not guarantee which (if any) elements are processed together. The Dataow service does not guarantee the exact number of DoFn instances that will be created over the course of a pipeline. The Dataow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary les with non-unique names). The Dataow service serializes element processing per DoFn instance. Your code does not need to be strictly thread-safe; however, any state shared between multiple DoFn instances must be thread-safe. See Requirements for User-Provided Functions (https://beam.apache.org/documentation/programming-guide/#requirements-for-writing-user-code-for- beam-transforms) https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline/ 5/26 8/23/2020 Deploying a pipeline | Cloud Dataflow | Google Cloud in the programming model (/dataow/model/programming-model-beam) documentation for more information about building your user code. Error and exception handling Your pipeline may throw exceptions while processing data. Some of these errors are transient (e.g., temporary diculty accessing an external service), but some are permanent, such as errors caused by corrupt or unparseable input data, or null pointers during computation. Dataow processes elements in arbitrary bundles, and retries the complete bundle when an error is thrown for any element in that bundle. When running in batch mode, bundles including a failing item are retried 4 times. The pipeline will fail completely when a single bundle has failed 4 times.

Deploying a Pipeline | Cloud Dataﬂow | Google Cloud

Regeldokument

Trifacta Data Preparation for Amazon Redshift and S3 Must Be Deployed Into an Existing Virtual Private Cloud (VPC)

Portable Stateful Big Data Processing in Apache Beam

The Forrester Wave™: Streaming Analytics, Q3 2019 the 11 Providers That Matter Most and How They Stack up by Mike Gualtieri September 23, 2019

Scalable and Flexible Middleware for Dynamic Data Flows

Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem

Researching Algorithmic Institutions Essay

CIF21 Dibbs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Issues at the Intersection of AI, Streaming, HPC, Data Centers And

Apache Beam: Portable and Evolutive Data-Intensive Applications

Spring 2020 1/21

Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs"