DataWorks Summit Berlin 2018 Breathing New Life into Apache Oozie with Workflow Manager April 19, 2018

Artem Ervits – Hortonworks Clay Baenziger – Bloomberg

© 2018 Bloomberg Finance L.P. All rights reserved. Poll:

• Who here uses Oozie? — In production? With kerberos? — Do you use HUE with Oozie? — How many workflows have you in production? 1-10? 10-50? 50+? — How many actions does the largest workflow contain? 1-10? 10-50? 50+? — Do you use Oozie with (or want to)? HBase? Spark? Python? Deployment Automation?

• Do you like XML? — Do you have a favorite editor for Oozie workflows?

© 2018 Bloomberg Finance L.P. All rights reserved. Open Source Workflow Managers

(Incubating) • Luigi by Spotify • Azkaban by LinkedIn • (And of course) Apache Oozie

© 2018 Bloomberg Finance L.P. All rights reserved. Introduction to Oozie

• Oozie is a workflow scheduler system to manage jobs. • Oozie workflow jobs are Directed Acyclic Graphs (DAGs) of actions. • Oozie coordinator jobs are recurrent Oozie workflow jobs triggered by time and data availability. • Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs as well as system specific jobs out of the box. • Oozie is a scalable, reliable and extensible system. - Paraphrased from http://oozie.apache.org

Actions: • Map/Reduce • Java • E-Mail • Hive • Shell • Decision • Pig • Spark • Fork • HDFS • Sub-Workflow • Join

© 2018 Bloomberg Finance L.P. All rights reserved. Oozie Release Timeline

• 1.x released in 2010. Yahoo! project with two GitHub releases. Added support for workflow jobs.

• 2.x released in 2011. Still with Yahoo! with nine GitHub releases. Added support for coordinator jobs.

• 3.x released in 2013. Project under Apache. Added support for bundle jobs and HBase credentials.

• 4.x released in 2014. Added support for Hive/HCatalog, Spark integration and Oozie server high availability.

• 5.0 released April 2018. Removes support for Hadoop 1, adds support for Hadoop 3, YARN AM instead of MR launcher, new actions, code clean up.

- Adopted from: Apache Oozie by Mohammad Kamrul Islam and Aravind Srinivasan

© 2018 Bloomberg Finance L.P. All rights reserved. Oozie Complaints

• Launcher jobs as map tasks • Dated UI • Confusing object model – workflows, coordinators, bundles • Complicated setup • XML • DAG visualization • SLA alerting • Fine grained authorization • Easy access to log files

© 2018 Bloomberg Finance L.P. All rights reserved. Oozie Complaints Improvements

• Launcher jobs as map tasks – solved by Oozie 5.0.0, OOZIE-1770 • Dated UI – OOZIE-2683, targeted for Oozie 5.X (Hue and Workflow Manager today) • Confusing object model – jobs API, patch available, targeted for 5.X, OOZIE-2339 • Complicated setup – can deploy with embedded Jetty in Oozie 5.0.0, OOZIE-2666 • XML – fluent job API, patch available, targeted for 5.X, OOZIE-2339 • DAG visualization – solved by Oozie 5.0.0, OOZIE-2406 • SLA alerting – since Oozie 4.0.0, OOZIE-1294 • Fine grained authorization – targeted for Oozie 5.X, OOZIE-3196 • Easy access to log files – solved by Oozie 5.0.0, OOZIE-2296

© 2018 Bloomberg Finance L.P. All rights reserved. Oozie Launcher – Prior to Release 5.0

• MR launcher job

© 2018 Bloomberg Finance L.P. All rights reserved. Oozie Launcher – Release 5.0

• OYA: OOZIE-1770: Create Oozie Application Master for YARN — Removes MR launcher job

• Design Doc

© 2018 Bloomberg Finance L.P. All rights reserved. Oozie Documentation – Before Release 5.0 and After

Documentation redesign

OOZIE-3163: Improve documentation rendering: use fluido skin and better config

© 2018 Bloomberg Finance L.P. All rights reserved. Oozie Workflow Visualization – Prior to 5.0 and After

Jung GraphViz

OOZIE-2406: Completely rewrite Graph Generator code

© 2018 Bloomberg Finance L.P. All rights reserved. Oozie Fluent Job API – Apache Oozie 5.X (Preview)

OOZIE-2339: Provide an API for writing jobs based on the XSD schemas

© 2018 Bloomberg Finance L.P. All rights reserved. Apache Ambari

Ambari Provides: • Provisioning of a Hadoop Cluster

• Management of a Hadoop Cluster

• Monitoring of a Hadoop Cluster — A Metrics System for metrics collection — An Alert Framework — A dashboard for monitoring the Hadoop cluster

-Paraphrased from http://ambari.apache.org

© 2018 Bloomberg Finance L.P. All rights reserved. Ambari Views

• Ambari Views ”offer a systematic way to plug-in UI capabilities to surface custom visualization, management and monitoring features in Ambari Web. A "view" is a way of extending Ambari that allows 3rd parties to plug in new resource types along with the APIs, providers and UI to support them. In other words, a view is an application that is deployed into the Ambari container.”

• Key takeaways: — One does not need an Ambari managed (administrated) cluster — Third parties can build views packages to run in the Ambari framework too — Major views available: (YARN) Capacity Scheduler, (HDFS) Files, HAWQ, Hive, Pig, Storm, Tez, (YARN ATS) Jobs, (Oozie) Workflow Manager

• Alternatives: Cloudera Hue, bespoke applications

© 2018 Bloomberg Finance L.P. All rights reserved. Workflow Manager – Motivation

• Oozie workflows are defined in XML – too verbose — Provide GUI workflow builder and editor — Reduce possibility of user introduced errors — Provide browser based workflow manager • Integration with File Browser — Includes S3 support — Can replace existing Oozie web UI • Oozie is hard-coded to display only 25 actions — WFM doesn’t have this limit; tested with 300+ action nodes • Oozie is scalable — Can scale WFM by standing-up multiple Ambari Views servers

© 2018 Bloomberg Finance L.P. All rights reserved. Workflow Manager – Workflow Editor Example

Workflow Manager: • Available as an Ambari View • Enables visual editing of Oozie workflows • Integrated with file browser • Reduces user input errors • Minimal input required

© 2018 Bloomberg Finance L.P. All rights reserved. Workflow Manager – Execution View Example

• Integrated Dashboard with Workflow Manager View • Manage Oozie jobs • Drill down to logs

© 2018 Bloomberg Finance L.P. All rights reserved. Workflow Manager – Workflow Design Component

© 2018 Bloomberg Finance L.P. All rights reserved. Workflow Manager – Workflow Dashboard Component

Good Documentation: HDP 2.6 – Workflow Manager Basics

© 2018 Bloomberg Finance L.P. All rights reserved. DataWorks Summit Berlin 2018 Workflow Manager Examples with HBase

• Setup Oozie – Server and Workflows • Data Definition – Tables, ACLs • Compactions – Operational

© 2018 Bloomberg Finance L.P. All rights reserved. HBase – Setup

Oozie needs HBase Configuration: • Oozie Server Code (to support HBase delegation tokens) — In libexec (see Server JARs list) — In oozie-site.xml oozie.credentials.credentialclasses hbase=org.apache.oozie.action.hadoop.HbaseCredentials,… • Client Workflow Code: • Server JARs: — Add to workflow.xml: (Copy the following to Oozie’s libexec) — hbase-common.jar — hbase-client.jar […] — hbase-server.jar — hbase-protocol.jar — All your normal HBase security settings in the credential section — hbase-hadoop2-compat.jar

© 2018 Bloomberg Finance L.P. All rights reserved. HBase – Data Definition

HBase Shell:

${jobTracker} ${nameNode} create_my_table.rb: tables = list hbase tables.select { |table| shell table.eql?('my_table') } -n if tables.empty? create_my_table.rb create 'my_table', {NAME => 'my_col'} end exit

© 2018 Bloomberg Finance L.P. All rights reserved. HBase – Compactions

HBASE-19528: Major Compaction Tool • Automatically scales compaction to selected number of servers • Requires read ability to /hbase usage: MajorCompactor [-cf ] [-dryRun] -servers -table [...]

Usage instructions -cf column families: comma separated eg: a,b,c -dryRun Dry run, will just output a list of regions that require compaction based on parameters passed -minModTime Compact if store files have modification time < minModTime -servers Concurrent servers compacting -table table name ...

© 2018 Bloomberg Finance L.P. All rights reserved. More Resources

• Apache Oozie Mailing Lists: http://oozie.apache.org/mail-lists.html • Artem’s Oozie Resources: —12 Part Series on WFM: http://bit.ly/2syKUIh — Oozie Examples: https://github.com/dbist/oozie-examples • Clay’s Past Oozie Presentations: — Code Deployment via Oozie: Apache BigData http://bit.ly/2sP2qbj — HBase Multi-Tenancy with Oozie: DataWorks Summit http://bit.ly/2rw7FIR

© 2018 Bloomberg Finance L.P. All rights reserved. DataWorks Summit Berlin 2018 Demo!

© 2018 Bloomberg Finance L.P. All rights reserved. DataWorks Summit Berlin 2018 Questions?

© 2018 Bloomberg Finance L.P. All rights reserved.