Hortonworks Data Platform (Hdp®) Developer Quick Start 4 Days
Total Page:16
File Type:pdf, Size:1020Kb
TRAINING OFFERING | DEV-201 HORTONWORKS DATA PLATFORM (HDP®) DEVELOPER QUICK START 4 DAYS This 4 day training course is designed for developers who need to create applications to analyze Big Data stored in Apache Hadoop using Apache Pig and Apache Hive, and developing applications on Apache Spark. Topics include: Essential understanding of HDP & its capabilities, Hadoop, YARN, HDFS, MapReduce/Tez, data ingestion, using Pig and Hive to perform data analytics on Big Data and an introduction to Spark Core, Spark SQL, Apache Zeppelin, and additional Spark features. PREREQUISITES Students should be familiar with programming principles and have experience in software development. SQL and light scripting knowledge is also helpful. No prior Hadoop knowledge is required. TARGET AUDIENCE Developers and data engineers who need to understand and develop applications on HDP FORMAT 50% Lecture/Discussion 50% Hands-0n Labs AGENDA SUMMARY Day 1: HDP Essentials and an Introduction to Pig Day 2: Apache Hive Day 3: Apache Spark Day 4: Apache Spark Continued DAY 1 OBJECTIVES • Describe the Case for Hadoop • Describe the Trends of Volume, Velocity and Variety • Discuss the Importance of Open Enterprise Hadoop • Describe the Hadoop Ecosystem Frameworks Across the Following Five Architectural Categories: o Data Management o Data Access o Data Governance & Integration o Security o Operations • Describe the Function and Purpose of the Hadoop Distributed File System (HDFS) • List the Major Architectural Components of HDFS and their Interactions • Describe Data Ingestion • Describe Batch/Bulk Ingestion Options • Describe the Streaming Framework Alternatives • Describe the Purpose and Function of MapReduce • Describe the Purpose and Components of YARN • Describe the Major Architectural Components of YARN and their Interactions • Define the Purpose and Function of Apache Pig • Work with the Grunt Shell • Work with Pig Latin Relation Names and Field Names • Describe the Pig Data Types and Schema DAY 1 LABS AND DEMONSTRATIONS • Starting an HDP Cluster • Using HDFS Commands • Demonstration: Understanding Apache Pig • Getting Started with Apache Pig • Exploring Data with Pig About Hortonworks Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™. Contact For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2016 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service DAY 2 OBJECTIVES • Demonstrate Common Operators Such as: o ORDER BY o CASE o DISTINCT o PARALLEL o FOREACH • Understand how Hive Tables are Defined and Implemented • Use Hive to Explore and Analyze Data Sets • Explain and Use the Various Hive File Formats • Create and Populate a Hive Table that Uses ORC File Formats • Use Hive to Run SQL-like Queries to Perform Data Analysis • Use Hive to Join Datasets Using a Variety of Techniques • Write Efficient Hive Queries • Explain the Uses and Purpose of HCatalog • Use HCatalog with Pig and Hive DAY 2 LABS AND DEMONSTRATIONS • Splitting a Dataset • Joining Datasets • Preparing Data for Apache Hive • Understanding Apache Hive Tables • Demonstration: Understanding Partitions and Skew • Analyzing Big Data with Apache Hive • Demonstration: Computing Ngrams • Joining Datasets in Apache Hive • Computing Ngrams of Emails in Avro Format • Using HCatalog with Apache Pig About Hortonworks Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™. Contact For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2016 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service DAY 3 OBJECTIVES • Describe How to Perform a Multi-Table/File Insert • Define and Use Views • Define and Use Clauses and Windows • List the Hive File Formats Including: o Text Files o SequenceFile o RCFile o ORC File • Define Hive Optimization • Use Apache Zeppelin to Work with Spark • Describe the Purpose and Benefits of Spark • Define Spark REPLs and Application Architecture • Explain the Purpose and Function of RDDs • Explain Spark Programming Basics • Define and Use Basic Spark Transformations • Define and Use Basic Spark Actions • Invoke Functions for Multiple RDDs, Create Named Functions and Use Numeric Operations DAY 3 LABS • Advanced Apache Hive Programming • Introduction to Apache Spark REPLs and Apache Zeppelin • Creating and Manipulating RDDs • Creating and Manipulating Pair RDDs About Hortonworks Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™. Contact For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2016 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service DAY 4 OBJECTIVES • Define and Create Pair RDDs • Perform Common Operations on Pair RDDs • Name the Various Components of Spark SQL and Explain their Purpose • Describe the Relationship Between DataFrames, Tables and Contexts • Use Various Methods to Create and Save DataFrames and Tables • Understand Caching, Persisting and the Different Storage Levels • Describe and Implement Checkpointing • Create an Application to Submit to the Cluster • Describe Client vs Cluster Submission with YARN • Submit an Application to the Cluster • List and Set Important Configuration Items DAY 4 LABS • Creating and Saving DateFrames and Tables • Working with DataFrames • Building and Submitting Applications to YARN Revised 10/06/2017 About Hortonworks Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™. Contact For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2016 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service .