Hadoop Training Overview is an open-source software framework for distributed storage and processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. To facilitate understanding and hands-on practice of applications, this course helps developers comprehend big data technologies such as Hive, , Pig, , MapReduce, and others, and also provides real-world experience with the big data cluster. Developers will not only perform ETL using all big data technologies but also will understand how different performance techniques are used in live projects. Big data engineers are taking on a more prominent role in the IT industry, primarily responsible for  Setting up the big data cluster using Hadoop and Spark ecosystems  Designing and developing data ingestion frameworks using tools such as Sqoop, Hive, and Spark  Applying processing rules and performing data transformation using tools like Hive, Pig, Spark  Keeping up to date with technologies and executing POCs to introduce effective tools  Developing new visualization tools or using existing visualization tools

About the Course Our instructors will explain the following technologies to help build the skills required to be a successful big data engineer.  Hadoop Distribute File System  with Python –  Java Reduce overview Data processing using Python  Apache Sqoop – Data ingestion as programming language between Hadoop and RDBMS  Apache Spark Data Frames –  and data Spark with SQL type of interfaces processing in Hadoop  and Kafka – Data  – Data processing in ingestion from web logs to Hadoop Hadoop  – Developing workflows  Apache Spark with Scala – using Hadoop ecosystem Data processing using Spark and  Setting up Hadoop and Spark cluster – Scala as programming language Quick guide

Hadoop Training Syllabus

Category Topic Description

Data Ingest Sqoop Understand Sqoop import and export in detail

Data Ingest Flume Understand ingesting data into HDFS using Flume

Data Ingest HDFS Understand HDFS commands to copy data back and forth from HDFS

Transform, State Spark with Core Spark API such as read/write data, joins, aggregations, and Store Scala/Python filters as well as sorting and ranking

Data Analysis Hive Create tables and load data into tables using Hive

Data Analysis Impala Create tables and load data into tables using Impala

Data Analysis Avro-tools Using Avro-tools to convert into readable formats such as JSON, schema evolution, and others

Use Case - Building DWH

The exercise demonstrates how to perform functions familiar to you, but in Hadoop. Seamless integration is critical when evaluating any new infrastructure; therefore, it is important to be able to perform your current functions and not break any regular BI reports or workloads over the dataset you plan to migrate. retail_db is a well-formed which has six tables. Departments, Categories, Products, Order Items, Orders, Customers

All developers need to build and create data pipelines in a Hadoop cluster, which will be based on the above datasets and meet all the aforementioned goals while performing this activity. Exercises Developers shall successfully complete all of the following exercises, divided into four different sections.

1. Data Ingest Using Sqoop

 Import data from a MySQL database into HDFS using Sqoop  Export data to a MySQL database from HDFS using Sqoop  Change the delimiter and file format of data during import using Sqoop

2. Data Ingest using HDFS  Load data into and out of HDFS using the Hadoop File System (FS) commands

3. Transform, Stage, Store using Spark with Python/Scala  Load data from HDFS and store results back to HDFS using Spark  Join disparate datasets using Spark  Calculate aggregate statistics (e.g., average or sum) using Spark  Filter data into a smaller dataset using Spark  Write a query that produces ranked or sorted data using Spark

4. Data Analysis Using Hive/Impala and Avro

 Read and/or create a table in the Hive metastore in a given schema  Extract an Avro schema from a set of datafiles using Avro-tools  Create a table in the Hive metastore using the Avro file format and an external schema file  Improve query performance by creating partitioned tables in the Hive metastore  Evolve an Avro schema by changing JSON files Hadoop Development Course Content 1. Introduction to Big Data and Hadoop (week 1) Learning objectives: In this training session, you will understand big data, the limitations of existing solutions for the big data problem, how Hadoop solves the big data problem, the common Hadoop ecosystem components, Hadoop architecture, HDFS, anatomy of file write and read, and how MapReduce framework works.

Topics: Big data, limitations and solutions of existing data analytics architecture, Hadoop, Hadoop features, Hadoop ecosystem, Hadoop 2.x core components, Hadoop storage: HDFS, Hadoop processing: MapReduce framework, and Hadoop different distributions.

2. Hadoop Architecture and HDFS (week 1) Learning objectives: In this module, you will learn about the Hadoop cluster architecture, important configuration files in a Hadoop cluster, data loading techniques, and how to set up single node and multi node Hadoop clusters.

Topics: Hadoop 2.x cluster architecture - Federation and High Availability, a typical production Hadoop cluster, Hadoop cluster modes, common Hadoop shell commands, Hadoop 2.x configuration files, single- node cluster and multi-node cluster set up, and basics of Hadoop administration.

3. Hadoop MapReduce Framework (week 2) Learning objectives: In this class, you will gain an understanding of Hadoop MapReduce framework and how MapReduce works on data stored in HDFS. You also will learn about concepts such as input splits in MapReduce, Combiner and Partitioner, and demos on MapReduce using different data sets.

Topics: MapReduce use cases, traditional vs. MapReduce way, why MapReduce?, Hadoop 2.x MapReduce architecture, Hadoop 2.x MapReduce components, YARN MR application execution flow, YARN workflow, anatomy of the MapReduce program, demo on MapReduce, input splits, relation between input splits and HDFS Blocks, MapReduce: Combiner and Partitioner, demo on de-identifying health care data set, and a demo on weather data set.

4. Advanced MapReduce (week 2) Learning objectives: In this module, you will learn advanced MapReduce concepts such as counters, distributed cache, MRunit, reduce join, custom input format, sequence input format, and XML parsing.

Topics: Counters, distributed cache, MRunit, reduce join, custom input format, sequence input format, XML file parsing using MapReduce.

5. Pig (week 3) Learning objectives: In this session, you will learn Pig, types of use case in which Pig can be used, tight coupling between Pig and MapReduce, and Pig Latin scripting, Pig running modes, Pig UDF, Pig Streaming, Testing Pig Scripts. Also includes a demo on healthcare dataset.

Topics: About Pig, MapReduce vs. Pig, Pig use cases, programming structure in Pig, Pig running modes, Pig components, Pig execution, Pig Latin Program, data models in Pig, Pig data types, shell and utility commands, Pig Latin: relational operators, file loaders, group operator, COGROUP operator, joins, and COGROUP, union, diagnostic operators, specialized joins in Pig, built-in functions (eval function, load and store functions, math function, string function, date function, Pig UDF, Piggybank, parameter substitution (PIG macros and Pig parameter substitution ), Pig streaming, testing Pig scripts with Punit, Aviation use case in PIG, Pig demo on healthcare data set.

6. Hive (week 4) Learning objectives: This module will help you understand Hive concepts, Hive data types, loading and querying data in Hive, running hive scripts, and Hive UDF.

Topics: Hive background, Hive use case, about Hive, Hive vs. Pig, Hive architecture and components, metastore in Hive, limitations of Hive, comparison with traditional database, Hive data types and data models, partitions and buckets, Hive tables (managed tables and external tables), importing data, querying data, managing outputs, Hive script, Hive UDF, retail use case in Hive, Hive demo on healthcare data set.

7. Advanced Hive and Sqoop (week 5) Learning objectives: In this class, you will understand advanced Hive concepts such as UDF, dynamic partitioning, Hive indexes and views, optimizations in hive. You also will acquire in-depth knowledge of Sqoop architecture and its components.

Topics: Hive QL: joining tables, dynamic partitioning, custom map/reduce scripts, Hive indexes and views Hive query optimizers, Hive: Thrift Server, user-defined functions, and Sqoop import/export functions.

8. Processing Distributed Data with Apache Spark (week 6) Learning objectives: In this module, you will learn about the Spark ecosystem and its components, how Scala is used in Spark, and SparkSession. You also will be taught how to work in RDD in Spark. A demo will be performed on running application on Spark cluster, comparing performance of MapReduce and Spark.

Topics: What is Apache Spark, Spark ecosystem, Spark components, history of Spark and Spark versions/releases, Spark a polyglot, What is Scala?, Why Scala?, SparkSession, RDD.

10. Oozie and Hadoop Project (week 6) Learning objectives: In this training, you will understand how to work with multiple Hadoop ecosystem components together in a Hadoop implementation to solve big data problems. We will discuss multiple data sets and specifications of the project. This module also includes a Flume and Sqoop demo and Apache Oozie workflow scheduler for Hadoop Jobs.

Topics: Flume and Sqoop demo, Oozie, Oozie components, Oozie workflow, scheduling with Oozie, demo on Oozie Workflow, Oozie Coordinator, Oozie commands, Oozie web console, Oozie for MapReduce, Pig, Hive, and Sqoop, combine flow of MR, Pig, Hive in Oozie, and Hadoop project demo.