Apache Oozie Apache Oozie Get a Solid Grounding in Apache Oozie, the Workflow Scheduler System for “In This Book, the Managing Hadoop Jobs
Total Page:16
File Type:pdf, Size:1020Kb
Apache Oozie Apache Oozie Apache Get a solid grounding in Apache Oozie, the workflow scheduler system for “In this book, the managing Hadoop jobs. In this hands-on guide, two experienced Hadoop authors have striven for practitioners walk you through the intricacies of this powerful and flexible platform, with numerous examples and real-world use cases. practicality, focusing on Once you set up your Oozie server, you’ll dive into techniques for writing the concepts, principles, and coordinating workflows, and learn how to write complex data pipelines. tips, and tricks that Advanced topics show you how to handle shared libraries in Oozie, as well developers need to get as how to implement and manage Oozie’s security capabilities. the most out of Oozie. ■ Install and confgure an Oozie server, and get an overview of A volume such as this is basic concepts long overdue. Developers ■ Journey through the world of writing and confguring will get a lot more out of workfows the Hadoop ecosystem ■ Learn how the Oozie coordinator schedules and executes by reading it.” workfows based on triggers —Raymie Stata ■ Understand how Oozie manages data dependencies CEO, Altiscale ■ Use Oozie bundles to package several coordinator apps into Oozie simplifies a data pipeline “ the managing and ■ Learn about security features and shared library management automating of complex ■ Implement custom extensions and write your own EL functions and actions Hadoop workloads. ■ Debug workfows and manage Oozie’s operational details This greatly benefits Apache both developers and Mohammad Kamrul Islam works as a Staff Software Engineer in the data operators alike.” engineering team at Uber. He’s been involved with the Hadoop ecosystem —Alejandro Abdelnur Islam & Srinivasan & Srinivasan Islam since 2009, and is a PMC member and a respected voice in the Oozie com- Creator of Apache Oozie munity. He has worked in the Hadoop teams at LinkedIn and Yahoo. Aravind Srinivasan is a Lead Application Architect at Altiscale, a Hadoop- as-a-service company, where he helps customers with Hadoop application O o z i e design and architecture. He has been involved with Hadoop in general and Oozie in particular since 2008. THE WORKFLOW SCHEDULER FOR HADOOP DATA Twitter: @oreillymedia facebook.com/oreilly US $39.99 CAN $45.99 ISBN: 978-1-449-36992-7 Mohammad Kamrul Islam & Aravind Srinivasan Apache Oozie Apache Oozie Apache Get a solid grounding in Apache Oozie, the workflow scheduler system for “In this book, the managing Hadoop jobs. In this hands-on guide, two experienced Hadoop authors have striven for practitioners walk you through the intricacies of this powerful and flexible platform, with numerous examples and real-world use cases. practicality, focusing on Once you set up your Oozie server, you’ll dive into techniques for writing the concepts, principles, and coordinating workflows, and learn how to write complex data pipelines. tips, and tricks that Advanced topics show you how to handle shared libraries in Oozie, as well developers need to get as how to implement and manage Oozie’s security capabilities. the most out of Oozie. ■ Install and confgure an Oozie server, and get an overview of A volume such as this is basic concepts long overdue. Developers ■ Journey through the world of writing and confguring will get a lot more out of workfows the Hadoop ecosystem ■ Learn how the Oozie coordinator schedules and executes by reading it.” workfows based on triggers —Raymie Stata ■ Understand how Oozie manages data dependencies CEO, Altiscale ■ Use Oozie bundles to package several coordinator apps into Oozie simplifies a data pipeline “ the managing and ■ Learn about security features and shared library management automating of complex ■ Implement custom extensions and write your own EL functions and actions Hadoop workloads. ■ Debug workfows and manage Oozie’s operational details This greatly benefits Apache both developers and Mohammad Kamrul Islam works as a Staff Software Engineer in the data operators alike.” engineering team at Uber. He’s been involved with the Hadoop ecosystem —Alejandro Abdelnur Islam & Srinivasan & Srinivasan Islam since 2009, and is a PMC member and a respected voice in the Oozie com- Creator of Apache Oozie munity. He has worked in the Hadoop teams at LinkedIn and Yahoo. Aravind Srinivasan is a Lead Application Architect at Altiscale, a Hadoop- as-a-service company, where he helps customers with Hadoop application O o z i e design and architecture. He has been involved with Hadoop in general and Oozie in particular since 2008. THE WORKFLOW SCHEDULER FOR HADOOP DATA Twitter: @oreillymedia facebook.com/oreilly US $39.99 CAN $45.99 ISBN: 978-1-449-36992-7 Mohammad Kamrul Islam & Aravind Srinivasan Apache Oozie Mohammad Kamrul Islam & Aravind Srinivasan Apache Oozie by Mohammad Kamrul Islam and Aravind Srinivasan Copyright © 2015 Mohammad Islam and Aravindakshan Srinivasan. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editors: Mike Loukides and Marie Beaugureau Indexer: Lucie Haskins Production Editor: Colleen Lobner Interior Designer: David Futato Copyeditor: Gillian McGarvey Cover Designer: Ellie Volckhausen Proofreader: Jasmine Kwityn Illustrator: Rebecca Demarest May 2015: First Edition Revision History for the First Edition 2015-05-08: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781449369927 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Apache Oozie, the cover image of a binturong, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-449-36992-7 [LSI] Table of Contents Foreword. ix Preface. xi 1. Introduction to Oozie. 1 Big Data Processing 1 A Recurrent Problem 1 A Common Solution: Oozie 2 A Simple Oozie Job 4 Oozie Releases 10 Some Oozie Usage Numbers 12 2. Oozie Concepts. 13 Oozie Applications 13 Oozie Workflows 13 Oozie Coordinators 15 Oozie Bundles 18 Parameters, Variables, and Functions 19 Application Deployment Model 20 Oozie Architecture 21 3. Setting Up Oozie. 23 Oozie Deployment 23 Basic Installations 24 Requirements 24 Build Oozie 25 Install Oozie Server 26 Hadoop Cluster 28 iii Start and Verify the Oozie Server 29 Advanced Oozie Installations 31 Configuring Kerberos Security 31 DB Setup 32 Shared Library Installation 34 Oozie Client Installations 36 4. Oozie Workfow Actions. 39 Workflow 39 Actions 40 Action Execution Model 40 Action Definition 42 Action Types 43 MapReduce Action 43 Java Action 52 Pig Action 56 FS Action 59 Sub-Workflow Action 61 Hive Action 62 DistCp Action 64 Email Action 66 Shell Action 67 SSH Action 70 Sqoop Action 71 Synchronous Versus Asynchronous Actions 73 5. Workfow Applications. 75 Outline of a Basic Workflow 75 Control Nodes 76 <start> and <end> 77 <fork> and <join> 77 <decision> 79 <kill> 81 <OK> and <ERROR> 82 Job Configuration 83 Global Configuration 83 Job XML 84 Inline Configuration 85 Launcher Configuration 85 Parameterization 86 EL Variables 87 EL Functions 88 iv | Table of Contents EL Expressions 89 The job.properties File 89 Command-Line Option 91 The config-default.xml File 91 The <parameters> Section 92 Configuration and Parameterization Examples 93 Lifecycle of a Workflow 94 Action States 96 6. Oozie Coordinator. 99 Coordinator Concept 99 Triggering Mechanism 100 Time Trigger 100 Data Availability Trigger 100 Coordinator Application and Job 101 Coordinator Action 101 Our First Coordinator Job 101 Coordinator Submission 103 Oozie Web Interface for Coordinator Jobs 106 Coordinator Job Lifecycle 108 Coordinator Action Lifecycle 109 Parameterization of the Coordinator 110 EL Functions for Frequency 110 Day-Based Frequency 110 Month-Based Frequency 111 Execution Controls 112 An Improved Coordinator 113 7. Data Trigger Coordinator. 117 Expressing Data Dependency 117 Dataset 117 Example: Rollup 122 Parameterization of Dataset Instances 124 current(n) 125 latest(n) 128 Parameter Passing to Workflow 132 dataIn(eventName): 132 dataOut(eventName) 133 nominalTime() 133 actualTime() 133 dateOffset(baseTimeStamp, skipInstance, timeUnit) 134 formatTime(timeStamp, formatString) 134 Table of Contents | v A Complete Coordinator Application 134 8. Oozie Bundles. 137 Bundle Basics 137 Bundle Definition 137 Why Do We Need Bundles? 138 Bundle Specification 140 Execution Controls 141 Bundle State Transitions 145 9. Advanced Topics. 147 Managing Libraries in Oozie 147 Origin of JARs in Oozie 147 Design Challenges 148 Managing Action JARs 149 Supporting the User’s JAR 152 JAR Precedence in