Kettle Beam Matt Casters Neo4j Chief Solutions Architect, Kettle Project Founder, Hop Lead Architect

Updated: Nov 22th 2019 Topic

➢ Kettle essentials ➢ What is Apache Beam? ➢ What is the Kettle Beam project? ➢ How does it work? ➢ Which platforms are supported? ➢ What’s next? ➢ What does not work yet? ➢ How can I contribute? ➢ Q&A

2 Kettle Essentials

Define an ETL workload Run the ETL workload ➢ GUI (workstation/browser) ➢ Anywhere you have a JVM ➢ XML standard ➢ GUI, Local, Remote, … ➢ Java API (SDK) ➢ But also inside engines… ➢ ETL Metadata Injection ○ Kettle Clustering ○ Map/Reduce ○ Storm ○ Spark (EE)

3 Kettle Essentials

How does it work?

Define Kettle Configure Execute Metadata Options

Transformations Log levels Local Jobs Parameters Remote ... Clustered “Run Configuration”

Metadata driven execution on an engine

4 Apache Beam

➢ Originates from a paper called “The dataflow model” https://dl.acm.org/citation.cfm?doid=2824032.2824076

The data flow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing

5 Apache Beam

➢ Google released an open source SDK for it in 2014 → The Apache Beam Direct runner → Local execution ➢ Google released the SDK for their GCP DataFlow platform → The DataFlow runner → Execution on - DataFlow ➢ Spark and a bunch of other platforms followed quickly ➢ Lots of work getting done on right now

6 Apache Beam

FOSDEM 2019 Presentation on Apache Beam by Maximilian Michels

Article & Video of presentation Slide deck

→ Technical!!

7 Apache Beam

➢ 2018 top 3 Apache Project by commits ➢ 2018 top project by dev@ list activity

2019 Technology of the Year Award Winner

8 Apache Beam

Capabilities matrix:

→ https://beam.apache.org/documentation/runners/capability-matrix

Capabilities depend on the underlying “Engine” or “Runner”

9 Apache Beam

What results are being calculated?

10 Apache Beam

Where in event time?

11 Apache Beam

When in processing time?

12 Apache Beam

How do refinements of results relate?

13 Apache Beam

How does it work?

Define an Configure Execute on a Apache Beam Pipeline Runner Pipeline Options

Java / Python Specific options DirectRunner per Runner DataFlowRunner SparkRunner FlinkRunner ...

14 Kettle Beam: Why?

➢ Recognizes the architectural similarities ➢ Allows Kettle to focus on the needs, metadata, code ➢ Re-uses the abstraction layers ➢ Does away with the need to focus on one engine ➢ Recognizes that innovation happens at an incredible pace ➢ Apache Beam shields us from making bad bets.

15 Kettle Beam: Diff

➢ Any differences between Kettle and Beam? ➢ Kettle has strict separation of metadata and engine! ➢ But that made execution on Apache Beam easier!

Define Configure Kettle Execute Options Metadata Define an Configure Apache Execute on Pipeline Beam a Runner Options Pipeline

16 Kettle Beam: a brief history in time

➢ Started right after #PCM18 in Bologna ➢ The Beam spark was officially ignited by the world famous Kettle advocate and personal troller Diethard Steiner ➢ Originally I was supposed to show a “Hello, world!\n” ➢ But as these things go, things got out of hand… ➢ Many nights, days and weekends ➢ Haven’t had this much fun since I started Kettle

17 Kettle Beam: features

➢ Kettle transformations ➢ Beam File Definitions → No more cumbersome “split field” steps ➢ Beam Job Configurations → Handles all the Apache Beam Pipeline Options → Generic approach but only Direct / DataFlow / Spark for now ➢ Beam Run Configuration → Not possible with a Kettle plugin but Pentaho is working on it ➢ Execution in a job, Pan, Kitchen, Maitre, Slave server, ...

18 Kettle Beam: GUI specifics

➢ Pipeline metrics integration for Direct runner and DataFlow → Spark / Flink are harder, require some plumbing (demo later) ➢ Step status & pipeline metrics in Spoon!

19 Kettle Beam: supported operations

➢ Step-to-step data transfer ➢ Copy data to multiple outputs (NOT distribute) ➢ Combine data from multiple input steps ➢ Source data from particular (info) step(s) ➢ Target data to particular steps ➢ ...

20 Kettle Beam: supported steps

➢ All non-aggregating steps doing read / write ○ Calculator, Select Values, … ○ But also: Stream Lookup, Switch/Case, Filter Rows, ... ➢ Merge Join: converted into Beam API ➢ Memory Group By: converted into Beam API ➢ Ability to reduce parallelism: ○ Input steps: Since thread non-Beam input step ○ Output steps: Recognize BEAM_SINGLE as desire to run single threaded ➢ Batch up records: better performance in Kettle output steps 21 Kettle Beam: Extra I/O

➢ Beam Input / Output : Text file reading and writing ➢ Beam Kafka Consume / Produce ➢ Beam BigQuery Input / Output ➢ Beam GCP PubSub Subscribe / Publish

22 Kettle Beam: Streaming

➢ Apache Beam Terminology ○ Bounded data source: finite streaming ○ Unbounded data source: infinitely streaming ➢ “Streaming” Pipeline uses an unbounded data source ○ Kafka Consume, PubSub Subscribe, … ○ OR a bounded data source which is “timestamped” ➢ Enter the Beam Timestamp step ➢ Allows combining Bounded and Unbounded data ➢ Example: split a large file up into chunks of 10 minutes

23 Kettle Beam: Windowing

➢ A window allows you to consider parts of a data stream ○ Global window: everything ○ Fixed window: fixed period of time ○ Sliding window: fixed period, repeated at intervals ○ Session window: No more data after X minutes makes a window (min 5 minutes) ➢ Enter the Beam Window step ➢ Example: write Kafka data to a files

24 Kettle Beam: Demo!!!1!İ!

➢ The Kettle Beam Examples project ➢ Show the File Definitions ➢ Show the Beam Job Configs ➢ Show the Run Configurations ➢ Run unit tests in Kettle ➢ Execute on the Flink runner locally ➢ Execute on Dataflow (Google Cloud Dataflow) ➢ Execute on Spark (AWS with Flintrock) ➢ Execute on Flink in a VM

25 Kettle Beam: Next steps

➢ Release 1.0 ➢ Remote execution, metrics/status updates ○ For Spark/Flink ○ Using Carte plugins ○ Remote pipeline metrics → Carte ← Spoon ➢ Triggers?

26 Kettle Beam: Where is the light?

The Kettle Beam project code on GitHub → https://github.com/mattcasters/kettle-beam Diethard’s blog post on Kettle Beam → http://diethardsteiner.github.io/pdi/2018/12/01/Kettle-Beam.html My YouTube channel with status updates → https://www.youtube.com/mattcasters Download → http://www.kettle.be

This presentation: http://beam.kettle.be

27 Kettle Beam: step into the light!

➢ Try it yourself ➢ Any feedback is good feedback! ○ “This doesn’t work!” ○ Write documentation ○ Request features ➢ Spread the word about Kettle, Apache Beam and Kettle Beam ➢ Help out coding ○ Easier than you think ○ Ask for help.

28 Q&A

29 Kettle Beam Matt Casters Neo4j Chief Solutions Architect