Leveraging Beam's Batch-Mode for Robust Recoveries and Late-Data Processing of Streaming Pipelines Devon Peticolas - Oden Technologies I regret how long this talk’s name is... Devon Peticolas Principal Engineer Devon Peticolas Principal Engineer Jie Zhang Sr. Data Engineer In This Talk
● Why does Oden needs batch recoveries for our streaming jobs?
● How we make our streaming jobs also run in batch mode.
● How we orchestrate our streaming pipeline to run in batch-mode daily. A little about Oden Oden’s Customers
Medium to large manufacturers in plastics extrusion, injection molding, and pipes, chemical, paper and pulp.
Process and Quality Engineers looking to centralize, analyze, and act on their data.
Plant managers who are looking to optimize logistics, output, and cost. Interactive Time-series Analysis
● Compare performance across different equipment. ● Visualize hourly uptime and key custom metrics. ● Calculations for analyzing and optimizing factory performance. Real Time Manufacturing Data
● Streaming second-by-second metrics ● Interactive app that prompts on production state changes and collects user input. Reporting and Alerting
● Daily summaries on key process metrics from continuous intervals of production work. ● Real-time email and text alerts on target violations and concerning trends. and misuse How we use Apache Beam How Oden Uses Apache Beam - Reading Events
PLC
Raw Events How Oden Uses Apache Beam - Acquisition
PLC Acquisition
Raw Events Metrics-A How Oden Uses Apache Beam - Acquisition
PLC Acquisition
Raw Events Metrics-A How Oden Uses Apache Beam - Calculated Metrics
User Has: diameter-x and diameter-y
User Wants: diameter-y avg-diameter = (diameter-x + diameter-y) / 2
● Metrics need to be computed in real-time. diameter-x ● Components can be read from different devices with different clocks.
● Formulas are stored in postgres. How Oden Uses Apache Beam - Calculated Metrics
Calculated Metrics
PLC Acquisition
Raw Events Metrics-A Metrics-B
Windowed Calculated Metrics How Oden Uses Apache Beam - State Change Detection How Oden Uses Apache Beam - State Change Detection
Calculated Metrics
PLC Acquisition
Raw Events Metrics-A Metrics-B
Windowed Calculated Metrics State Change Postgres Detection How Oden Uses Apache Beam - Rollups
Raw Metrics t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19
First Rollups rollup [t0, t2) rollup [t2, t4) rollup [t4, t6) rollup [t6, t8) rollup [t8, t10) rollup [t10, t12) rollup [t12, t14) rollup [t14, t16) rollup [t16, t18) rollup [t18, t20)
Second Rollups rollup [t0, t4) rollup [t4, t8) rollup [t8, t12) rollup [t12, t16) rollup [t16, t20)
SELECT [t6, t17)
● Creates special “rollup” metrics before writing to TSDB ● Optimizes aggregations i.e. ○ sum([t6...t16)) = sum([t6...t8)) + sum([t8...t12)) + sum([t12...t16)) + val(t16) ○ count([t6...t16)) = count([t6...t8)) + count([t8...t12)) + count([t12...t16)) +1 ○ mean([t6...t16)) = sum([t6...t16)) / count([t6...t16)) How Oden Uses Apache Beam - Rollups
Calculated Rollups Metrics
PLC Acquisition TSDB Raw Events Metrics-A Metrics-B Metrics-C
Windowed Calculated Metrics State Change Postgres Detection How Oden Uses Apache Beam - In Summary
Calculated Rollups Metrics
PLC Acquisition TSDB Raw Events Metrics-A Metrics-B Metrics-C
Windowed Calculated Metrics State Change Postgres Detection Reality How Oden Uses Apache Beam - In Summary
Google-provided We built this to Dataflow Template Also BigQuery close a deal and that DROPS DATA because we don’t now I’m not sure if when redeployed trust our DS team we can kill it with either TSDB Everytime I remember this exists I get sad
new TSDB Calculated Rollups Metrics Rollups (600s)
PLC Acquisition Probably a waste of money TSDB Raw Events Metrics Also Metrics
Windowed Calculated Windowed Metrics CalculatedWindowed Flaming Rainbow State Bridge to Asgard MetricsCalculated Change Metrics Postgres Detection (60s)
Definitely a waste API that uses regex of money as a language parser with terrifying effectiveness How Oden Uses Apache Beam - Common Trends
● Downstream writes are idempotent.
● LOTS of windowing keyed by the metric.
● Real-time processing is needed for users to make real-time decisions. How Oden Uses Apache Beam - Common Trends
● Downstream writes are idempotent.
● LOTS of windowing keyed by the metric.
● Real-time processing is needed for users to make real-time decisions. Connectivity for factories is hard Connectivity in factories is hard How Oden Uses Apache Beam - Disconnects
Calculated Rollups Metrics
PLC Acquisition TSDB Raw Events Metrics-A Metrics-B Metrics-C
Windowed Undelivered Calculated Metrics Data Backs Up State Change Postgres Detection How Oden Uses Apache Beam - Disconnects
Calculated Rollups Metrics
PLC Acquisition TSDB Raw Events Metrics-A Metrics-B Metrics-C
Windowed Calculated Metrics State Change Postgres Detection
Downstream jobs are flooded with late events How Oden Uses Apache Beam - Disconnects
PLC
Calculated Rollups Metrics
PLC Acquisition TSDB Raw Events Metrics-A Metrics-B Metrics-C
Windowed Calculated Metrics State Change Postgres Detection
PLC One bad apple ruins the bunch Attempts at Late Data Handling - Complex Triggering Logic
/* The Window described attempts to both be prompt but not needlessly retrigger. It's designed to account for the * following cases... *
- All data is coming on-time. The watermark at any given time is roughly the current time. *
- Data is being backfilled from some subset of metrics and the watermark is ahead of the event time of the windows ● Trigger normally with watermark when * for those metrics. *
- Data is being backfilled for some subset of metrics but the watermark has been stuck to be earlier than than * event time for most metrics. things are on-time. *
- Any of cases 1, 2, or 3 but where late data has arrived due to some uncontrollable situation (i.e. a single * metric for a pane gets stuck in pubsub for days and then is released).
/* The Window described attempts to both be prompt but not needlessly retrigger. It's designed to account for the * following cases... *
- All data is coming on-time. The watermark at any given time is roughly the current time. *
- Data is being backfilled from some subset of metrics and the watermark is ahead of the event time of the windows ● Trigger normally with watermark when * for those metrics. *
- Data is being backfilled for some subset of metrics but the watermark has been stuck to be earlier than than * event time for most metrics. things are on-time. *
- Any of cases 1, 2, or 3 but where late data has arrived due to some uncontrollable situation (i.e. a single * metric for a pane gets stuck in pubsub for days and then is released).
public class GroupByCalcMetricIDandTimestampDoFn extends DoFn
public class GroupByCalcMetricIDandTimestampDoFn extends DoFn
2019-10-18 Pager Duty Incident: 2586 How Oden Uses Apache Beam - It all fell apart
2019-10-18 Pager Duty Incident: 2586
THERE MUST BE A BETTER WAY What do our users need? Access Freq
New Old Age of Data Real-time Historical Decision Making Analysis
Access Freq
New Old Age of Data Real-time Historical Decision Making Analysis
Access Freq Very Late Data
New Old Age of Data Handling late data in real-time
Handling late data later How Oden Uses Apache Beam - Late Events Capture
PubSub To GCS GCS
Late-Events
Calculated Rollups Metrics
PLC Acquisition TSDB Raw Events Metrics-A Metrics-B Metrics-C
Windowed Calculated Metrics State Change Postgres Detection How Oden Uses Apache Beam - Disconnects
PLC
PubSub To GCS GCS
Late-Events
Calculated Rollups Metrics
PLC Acquisition TSDB Raw Events Metrics-A Metrics-B Metrics-C
Windowed Calculated Metrics State Change Postgres Detection
is OK PLC One bad apple ruins the bunch How Oden Uses Apache Beam - Recovery Scripts
# Windowed and Calc Metrics Recovery if [ -n "$SKIP_CALCMETRICS" ]; then # mvn -Pdataflow-runner compile exec:java \ echo "Skipping calcmetrics, so bigquery-ingest job will output to TOPIC2: ${TOPIC2}." # This script will attempt to recover calculated metrics from raw metric data. -Dexec.mainClass=io.oden.laser.Rollup \ BIGQUERY_INGEST_OUTPUT=${TOPIC2} # The dataflow jobs it will setup will look like this: -Dexec.args="\ fi --jobName=${UNIQUE_PREFIX}-calcmetrics-recovery-rollup60 \
# A: BigQuery -> TOPIC1 --workerMachineType=n2-standard-8 \ echo "Bigquery is outputting to ${BIGQUERY_INGEST_OUTPUT}" # Gets raw metrics from BQ and dumps them into TOPIC1. --windowSize=60 \
# B: TOPIC1.a -> calcmetrics -> TOPIC2 --deltasumBuffer=60 \ mvn -Pdataflow-runner compile exec:java \ # C: TOPIC1.b-$res -> windowed metrics with resolution $res -> TOPIC2 --project=$PROJECT \ -Dexec.mainClass=io.oden.laser.Pipe \ # D: TOPIC2.a -> pipe -> TOPIC1 --consumerMode=PUBSUB \ -Dexec.args="\ # C runs calcmetrics on TOPIC1, which contains the input metrics, and outputs --sourcePubsubSubscription=projects/oden-production/subscriptions/$TOPIC2-c \ --jobName=${UNIQUE_PREFIX}-calcmetrics-recovery-inputpipe \ # to TOPIC2. Consumer A of TOPIC2 will pipe back into TOPIC1 for dependent --producerMode=PUBSUB \ --workerMachineType=n2-standard-8 \ # calculated metrics. --sinkPubsubTopic=projects/oden-production/topics/$OUTPUT_TOPIC \ --project=$PROJECT \ # E: TOPIC2.b -> pipe -> OUTPUT_TOPIC --region=$REGION \ --consumerMode=BIGQUERY \ # F: TOPIC2.c -> rollup60 -> OUTPUT_TOPIC --allowedLateness=31556952 \ --sourceBigqueryQuery=\"${BIGQUERY_QUERY}\" \ # G: TOPIC2.d -> rollup600 -> OUTPUT_TOPIC --serviceAccount=dataflow@oden-production.iam.gserviceaccount.com \ --producerMode=PUBSUB \ # H: TOPIC2.e -> bigquery_ingest --autoscalingAlgorithm=THROUGHPUT_BASED \ --className=Metric \ # --maxNumWorkers=10 \ --sinkPubsubTopic=projects/oden-production/topics/$BIGQUERY_INGEST_OUTPUT \ # Other env vars: --runner=DataflowRunner" --region=$REGION \ # BIGQUERY_QUERY you can set this to a query to override the default behavior of forming a --serviceAccount=dataflow@oden-production.iam.gserviceaccount.com \ # query based on a start and end time from metrics.metrics_it # G: TOPIC2.d -> rollup600 -> OUTPUT_TOPIC --autoscalingAlgorithm=THROUGHPUT_BASED \ # SKIP_CALCMETRICS set this to any value to skip creating the calcmetrics job. This has the --maxNumWorkers=10 \ # effect of running only the windowed metrics and rollup jobs on the input data mvn -Pdataflow-runner compile exec:java \ --runner=DataflowRunner" -Dexec.mainClass=io.oden.laser.Rollup \ } UNIQUE_PREFIX="${UNIQUE_PREFIX:-"${USER:-$(date +%s)}"}" -Dexec.args="\ ● Create a bunch of temporary
# The input topic --jobName=${UNIQUE_PREFIX}-calcmetrics-recovery-rollup600 \ run_recovery_pipeline() { TOPIC1="${UNIQUE_PREFIX}-calcmetrics-recovery" --workerMachineType=n2-standard-8 \ # B: TOPIC1.a -> calcmetrics -> TOPIC2 # Calcmetrics output topic, will be sent back into $TOPIC1 so calcmetrics can --windowSize=600 \
# compute dependent calcmetrics --deltasumBuffer=60 \ if [ -n "$SKIP_CALCMETRICS" ]; then TOPIC2="${UNIQUE_PREFIX}-calcmetrics-recovery2" --project=$PROJECT \ echo "Skipping calcmetrics." --consumerMode=PUBSUB \ else OUTPUT_TOPIC=${OUTPUT_TOPIC:-heroic-ingest} --sourcePubsubSubscription=projects/oden-production/subscriptions/$TOPIC2-d \ mvn -Pdataflow-runner compile exec:java \ BIGQUERY_TABLE=${BIGQUERY_TABLE:-oden-production:metrics.metrics_it} --producerMode=PUBSUB \ topics and subscriptions -Dexec.mainClass=io.oden.laser.Calcmetrics \ --sinkPubsubTopic=projects/oden-production/topics/$OUTPUT_TOPIC \ -Dexec.args="\ # windowed metrics job resolutions --region=$REGION \ --jobName=${UNIQUE_PREFIX}-calcmetrics-recovery-calcmetrics \ WINDOW_RESOLUTIONS=(60 180 300) --allowedLateness=31556952 \ --workerMachineType=n2-standard-8 \ # url of repository for windowed metrics jobs -- it's expected that kubefwd has been run for prod --serviceAccount=dataflow@oden-production.iam.gserviceaccount.com \ --project=$PROJECT \ REPO_URL=repository-grpc-lb.lb.oden.network:4005 --autoscalingAlgorithm=THROUGHPUT_BASED \ --workerZone=$ZONE \ --maxNumWorkers=10 \ --network=oden \ if [ -z "$TOKEN" ]; then --runner=DataflowRunner" --subnetwork=regions/$REGION/subnetworks/oden \ # Steal this token from terraform:datascience/variables.tf under 'variable "iam_service_token"' --consumerMode=PUBSUB \ echo "Please specify TOKEN env var" # H: TOPIC2.e -> bigquery-ingest --sourcePubsubSubscription=projects/oden-production/subscriptions/$TOPIC1-a \ exit 1 --producerMode=PUBSUB \ fi gcloud dataflow jobs run "${UNIQUE_PREFIX}-calcmetrics-recovery-bigquery-ingest" \ --sinkPubsubTopic=projects/oden-production/topics/$TOPIC2 \ --gcs-location gs://dataflow-templates/2019-07-10-00/PubSub_Subscription_to_BigQuery \ ● Deploy “recovery” versions of --kmsCryptoKey=$KMS_CRYPTO_KEY \ if [ -z "$STARTTIME" ]; then --region $REGION \ --token=\"$TOKEN\" \ echo "Please specify STARTTIME env var" --parameters "inputSubscription=projects/oden-production/subscriptions/$TOPIC2-e,outputTableSpec=$BIGQUERY_TABLE" --calcmetricsApiUrl=http://events-lb.lb.oden.network/calculatedmetrics/v0/ \ exit 1 } --region=$REGION \ fi --calcmetricsApiQueryInterval=-1 \ cleanup_bigquery_ingest() { --calcmetricsQueryParameters= \ if [ -z "$ENDTIME" ]; then # cancel inputpipe --allowedLateness=31556952 \ echo "Please specify ENDTIME env var" gcloud dataflow jobs list --region $REGION | \ --serviceAccount=dataflow@oden-production.iam.gserviceaccount.com \ exit 1 grep "${UNIQUE_PREFIX}-calcmetrics-recovery-inputpipe" | \ the jobs listening to these --autoscalingAlgorithm=THROUGHPUT_BASED \ fi grep Running | \ --maxNumWorkers=10 \ awk '{print $1}' | \ --runner=DataflowRunner" KMS_CRYPTO_KEY="oden-production/global/dataflow/dataflow" xargs gcloud dataflow jobs cancel --region $REGION fi } if [ -n "$BIGQUERY_QUERY" ]; then # C: TOPIC1.b-$res -> windowed metrics -> TOPIC2 echo "Using provided query: ${BIGQUERY_QUERY}" cleanup_recovery_pipeline() { else # drain other jobs for res in "${WINDOW_RESOLUTIONS[@]}"; do BIGQUERY_QUERY="SELECT uuid, CAST(UNIX_SECONDS(timestamp) AS int64) AS timestamp, metric, value FROM metrics.metrics WHERE timestamp > '$STARTTIME' AND timestamp < '$ENDTIME'" gcloud dataflow jobs list --region $REGION | \ earlyFire=86400 topics. if [ -n "$METRIC_IDS" ]; then grep "${UNIQUE_PREFIX}-calcmetrics-recovery" | \ mvn -Pdataflow-runner compile exec:java \ # a metric_id is
--expiration-period=2d \ gcloud pubsub subscriptions delete $TOPIC2-e # D: TOPIC2.a -> pipe -> TOPIC1 --topic-project=$PROJECT $TOPIC1-a # For dependent calcmetrics gcloud pubsub topics delete $TOPIC1
for res in "${WINDOW_RESOLUTIONS[@]}"; do gcloud pubsub topics delete $TOPIC2 mvn -Pdataflow-runner compile exec:java \ GCS and into PubSub topics gcloud pubsub subscriptions create \ } -Dexec.mainClass=io.oden.laser.Pipe \ --topic=$TOPIC1 \ -Dexec.args="\ --expiration-period=2d \ # If script was run from 'tools' dir we need to cd ../; if it was run from top level we don't need --jobName=${UNIQUE_PREFIX}-calcmetrics-recovery-dependent-pipe \ --topic-project=$PROJECT $TOPIC1-b-$res # to do that. --workerMachineType=n2-standard-8 \ done if test -f "pom.xml"; then --project=$PROJECT \ true --consumerMode=PUBSUB \ gcloud pubsub subscriptions create \ else --sourcePubsubSubscription=projects/oden-production/subscriptions/$TOPIC2-a \ --topic=$TOPIC2 \ cd ../ --producerMode=PUBSUB \ --expiration-period=2d \ fi --className=Metric \ --topic-project=$PROJECT $TOPIC2-a --sinkPubsubTopic=projects/oden-production/topics/$TOPIC1 \ full_recovery() { --region=$REGION \ gcloud pubsub subscriptions create \ set -x --serviceAccount=dataflow@oden-production.iam.gserviceaccount.com \ ● --topic=$TOPIC2 \ Well orchestrated but very --autoscalingAlgorithm=THROUGHPUT_BASED \ --expiration-period=2d \ setup_pubsub --maxNumWorkers=10 \ --topic-project=$PROJECT $TOPIC2-b run_bigquery_ingest --runner=DataflowRunner" run_recovery_pipeline
gcloud pubsub subscriptions create \ # E: TOPIC2.b -> pipe -> OUTPUT_TOPIC --topic=$TOPIC2 \ set +x
--expiration-period=2d \ mvn -Pdataflow-runner compile exec:java \ --topic-project=$PROJECT $TOPIC2-c echo "Pipelines setup, when $TOPIC2-d has 0 unacked messages, press enter!" -Dexec.mainClass=io.oden.laser.Pipe \ echo "https://console.cloud.google.com/cloudpubsub/subscription/detail/$TOPIC2-d?project=$PROJECT" manual process. -Dexec.args="\ gcloud pubsub subscriptions create \ read -p "Press enter to continue to cleanup" --jobName=${UNIQUE_PREFIX}-calcmetrics-recovery-cm-to-heroic-pipe \ --topic=$TOPIC2 \ --workerMachineType=n2-standard-8 \ --expiration-period=2d \ cleanup_bigquery_ingest --project=$PROJECT \ --topic-project=$PROJECT $TOPIC2-d cleanup_recovery_pipeline --consumerMode=PUBSUB \ cleanup_pubsub --sourcePubsubSubscription=projects/oden-production/subscriptions/$TOPIC2-b \ gcloud pubsub subscriptions create \ } --producerMode=PUBSUB \ --topic=$TOPIC2 \ --className=Metric \ --expiration-period=2d \ CMD=$1 --sinkPubsubTopic=projects/oden-production/topics/$OUTPUT_TOPIC \ --topic-project=$PROJECT $TOPIC2-e case $CMD in --region=$REGION \ } ingest) --serviceAccount=dataflow@oden-production.iam.gserviceaccount.com \ echo "Only running bigquery-ingest job." --autoscalingAlgorithm=THROUGHPUT_BASED \ run_bigquery_ingest() { run_bigquery_ingest --maxNumWorkers=10 \ # A: BigQuery -> TOPIC1 echo "Ingestion job has been setup. When it no longer shows any throughput, press enter to" --runner=DataflowRunner" echo "clean it up."
BIGQUERY_INGEST_OUTPUT=${TOPIC1} read -p "Press enter to continue to cleanup..." # F: TOPIC2.c -> rollup60 -> OUTPUT_TOPIC cleanup_bigquery_ingest ;; *) full_recovery ;; esac Batch-Mode is Better Batch vs Streaming Jobs In Beam
Streaming Batch
● “Unbounded” PCollections ● “Bounded” PCollections ● Generally talks to real-time queues ● Generally talks to files / databases ● Processed piece-by-piece ● Processed in stages ● More wasted worker overhead ● Workers run until they’re done ● A little cheaper * Batch vs Streaming Sources/Sinks
Streaming Batch
● PubSubIO ● JdbcIO ● KafkaIO ● BigQueryIO ● TextIO ● AvroIO ● GenerateSequence Batch vs Streaming Sources/Sinks
Streaming Batch
● PubSubIO ● JdbcIO ● KafkaIO ● BigQueryIO ● TextIO* ● TextIO ● AvroIO* ● AvroIO ● GenerateSequence* ● GenerateSequence Batch vs Streaming Jobs In Beam
Streaming Batch
Batch Streaming Job GCS A Job GCS B Topic A Topic B Streaming Jobs - Simple
PubSubIO Transform PubSubIO (Read) (Write)
Source Sink Batch Jobs - Simple
TextIO Transform TextIO (Read) (Write)
Source Sink Batch Jobs - Simple
TextIO Transform TextIO (Read) (Write)
Source Sink
gs://dataflow.oden-qa.io/metric-b/2021-07-15/ gs://dataflow.oden-qa.io/metric-a/2021-07-15/metrics-*.ndjson Batch Jobs - Simple
TextIO Transform TextIO (Read) (Write)
Source Sink
gs://dataflow.oden-qa.io/metric-b/2021-07-15/ gs://dataflow.oden-qa.io/metric-a/2021-07-15/metrics-*.ndjson
When the late metrics arrived Streaming Jobs - Complex
Internal Config API
Generate Map Config Sequence Elements PCollection (5 min)
Source GenerateSequence.from(0).withRate(1, Duration.standardMinutes(5L))) Config PCollection View Side-input
PubSubIO Transform PubSubIO (Read) (Write)
Source Sink Batch Jobs - Complex
Internal Config API
Generate Map Config Sequence Elements PCollection (1 element)
Source GenerateSequence.from(0).to(1) Config PCollection View Side-input
TextIO Transform TextIO (Read) (Write) Running Streaming Jobs in Batch - With if-statements
Pipeline pipeline = Pipeline.create(options); if (options.getConsumerMode() == "PUBSUB") { ● The “core” transform is abstracted into a pipeline.apply( "ReadFromPubsub", generic PTransform
Pipeline pipeline = Pipeline.create(options); if (options.getConsumerMode() == "PUBSUB") { ● The “core” transform is abstracted into a pipeline.apply( "ReadFromPubsub", generic PTransform
Pipeline pipeline = Pipeline.create(options); if (options.getConsumerMode() == "PUBSUB") { ● The “core” transform is abstracted into a pipeline.apply( "ReadFromPubsub", generic PTransform
Pipeline pipeline = Pipeline.create(options); if (options.getConsumerMode() == "PUBSUB") { ● The “core” transform is abstracted into a pipeline.apply( "ReadFromPubsub", generic PTransform
// public interface RollupOptions extends EventIOOptions ● All source/sink handling is moved to Pipeline pipeline = Pipeline.create(options); pipeline shared transforms we call EventIO. .apply( "ReadPlainMetricsFromSource", EventIO.
// public interface RollupOptions extends EventIOOptions ● All source/sink handling is moved to Pipeline pipeline = Pipeline.create(options); pipeline shared transforms we call EventIO. .apply( "ReadPlainMetricsFromSource", EventIO.
● Makes developing locally easy!
● Get’s complicated for multi-source/sink jobs and managing building templates. Running Streaming Jobs in Batch - With EventIO
// public interface RollupOptions extends EventIOOptions ● All source/sink handling is moved to Pipeline pipeline = Pipeline.create(options); pipeline shared transforms we call EventIO. .apply( "ReadPlainMetricsFromSource", EventIO.
● Makes developing locally easy!
● Get’s complicated for multi-source/sink jobs and managing building templates. Automating Batch Recoveries with Airflow Airflow - Overview
● Scheduler, orchestrator, and monitor of DAGs of tasks.
● DAGs and dependency behavior are defined in python.
● Built-in “Operators” which let you easily build tasks for popular services.
● Expressive API for common patterns such as backfilling, short-circuiting, timezone management for ETL. Airflow - Overview Airflow - Running Batch Dataflow Jobs
dag = DAG("hello_world_v1", schedule_interval="0 * * * *") task_1 = PythonOperator( task_id="python_hello", python_callable=lambda: print("hello"), dag=dag, ) task_2 = BashOperator( task_id='bash_world', python_hello bash_world bash_command='echo world', ) task_1 >> task_2 Airflow - Running Batch Dataflow Jobs dag = DAG("late_data_v1", schedule_interval="30 0 * * *") ● Batch dataflow jobs are built as
LATE_METRICS = "gs://dataflow.oden-qa.io/metric-a/{{ds}}/" templates and launched from Airflow CALC_METRICS = "gs://dataflow.oden-qa.io/metric-b/{{ds}}/" DAG tasks RLUP_METRICS = "gs://dataflow.oden-qa.io/metric-c/{{ds}}/" calculated_metrics_df_job = DataflowTemplatedJobStartOperator( ● The DAG structure mirrors the DAG task_id="calculated-metrics-df-job", of streaming dataflow jobs. template='gs://oden-dataflow-templates/latest/batch_calculated_metrics, parameters={ ● GCS buckets are used as 'source': LATE_METRICS, 'sink': CALC_METRICS + "metrics-*.ndjson", intermediaries. }, dag=dag, ) rollup_metrics_df_job = DataflowTemplatedJobStartOperator( task_id="rollup-metrics-df-job", template='gs://oden-dataflow-templates/latest/batch_rollup_metrics, parameters={ 'source': CALC_METRICS, 'sink': RLUP_METRICS + "metrics-*.ndjson", calculated-m rollup-metrics } dag=dag, etrics-df-job -df-job )
Calculated_metrics_df_job >> rollup_metrics_df_job Late Data Airflow DAG - GCS Wildcards
Airflow “execution_date” macro
LATE_METRICS = "gs://dataflow.oden-qa.io/metric-a/{{ds}}/metrics-*.ndjson" CALC_METRICS = "gs://dataflow.oden-qa.io/metric-b/{{ds}}/metrics-*.ndjson" RLUP_METRICS = "gs://dataflow.oden-qa.io/metric-c/{{ds}}/metrics-*.ndjson"
ALL_METRICS = "gs://dataflow.oden-qa.io/metric-[abc]/{{ds}}/metrics-*.ndjson"
GCS wildcard Streaming Pipeline
Calculated Rollups Metrics
Acquisition TSDB Raw Events Metrics-A Metrics-B Metrics-C
Windowed Calculated Metrics State Change Postgres Detection Late Data Airflow DAG
Calculated Rollups Metrics
Late Events Acquisition Metrics-A Metrics-B Metrics-C ToTSDB TSDB
Windowed Calculated Metrics State Change Postgres Detection Late Data Airflow DAG - Detecting Late Data in GCS
# Only works in >=1.10.15 check_for_late_data = GoogleCloudStoragePrefixSensor( dag=dag, task_id="check-for-late-data", soft_fail=True, check-for-late rollup-metrics timeout=60 * 2, -data -df-job bucket="gs://dataflow.oden-qa.io/metric-a/{{ds}}/", prefix="/metric-a/{{ds}}/", )
... check_for_late_data >> ...
check-for-late rollup-metrics -data -df-job Soft-fail causes “skipped” downstream behavior when no late data for that day Late Data Airflow DAG - Putting it all together
schedule_interval="30 0 * * *"
rollup-metric gcs-to-tsdb s-df-job
calculated- windowed- check-for-lat acquisition- metrics-df-jo metrics-df-jo e-data df-job b b
state-chang e-events-df- job Problem Recap
● Oden uses Beam to process metrics for real-time and historical use-cases
● The real-time processing was broken by factory-specific partitions
● This was partly solved by complex windowing and triggering and beam state
● Ultimately, we decided to cap “lateness” and push late data into GCS to preserve our real-time applications Solution Recap
● All of our dataflow jobs can run in a batch or streaming mode.
● We only use streaming mode for recent data needed by our users in ASAP.
● Late data is handled with batch jobs nightly orchestrated by an Airflow DAG.
● Streaming jobs now run at a smaller deployment, autoscaling happens less frequently. What’s next?
● Using the late-data Airflow DAG to backfill old customer data.
● Using the late-data Airflow DAG for “low-priority” metrics to save money.
● Replacing the late-data “fork” with continuously written avro-files.
● The “After-Party Cannon” for continuous testing in our QA environment. Thank You Let’s talk about beam! [email protected] We’re hiring! oden.io/careers