Big Data Essentials from MapR®

Covering the Essentials of , the MapR Converged Data Platform®, Basic Cluster Administration, and an overview of Apache® Spark 3-day Virtual instructor-led Course

Curriculum

This course covers an introduction to Big Data and the ecosystem, followed by basics of MapR cluster administration, ending in hands-on Spark training. • Duration: 3 days • Format: virtual, instructor-led format, using a live-online classroom platform

Syllabus This course introduces students to the basics of big data computing, the Apache Hadoop Ecosystem, MapR Converged Data Platform, and the essentials of . Students will learn about big data concepts and how different tools and roles can help solve real-world big data problems. Covered are how to install, configure and administer the MapR converged data platform, then how to use Spark’s interactive shell to load and inspect data.

This course is target to developers and engineers with programming experience, but no prior knowledge of Hadoop.

Prerequisites What’s Included

A background in Linux system administration: • Access to a multi-node cluster (one per student) • able to navigate the Linux file system • Slide guide/transcript • use an editor at the command-line interface • Lab guide and associated files • add users/groups, execute common • Converged Enterprise Edition trial license commands Knowledge of either Scala and Python is required to complete the Spark labs Basic knowledge of SQL is helpful

Copyright © MapR Data Technologies 2017 – All Rights Reserved 1

For more information, visit www.emtecinc.com

Course Syllabus This course is taught over 3 days. The lesson start and end times are determined by the instructor in collaboration with the students. The lessons might span the days differently than outlined here.

Big Data Essentials: Day 1 Lesson 6 – MapR Converged Data Platform Lesson 1 – Introduction to Big Data • Review key components of HDFS • Define big data • Understand key components of MapR-FS • Summarize the history of big data • Compare and contrast MapR-FS and computing HDFS • Define key terms in big data computing Lesson 7 – MapR-DB • Compare and contrast MapR-DB, HBase, Lesson 2 – The Big Data Pipeline and traditional databases • Organize the steps in the data pipeline • Describe components and features of • Explain the role of administrators MapR-DB • Explain the role of developers • Understand table replication in MapR-DB • Explain the role of data analysts Lesson 8 – MapR Streams Lesson 3 – Solving Big Data Problems with • Compare and contrast real-time and batch Apache Hadoop processing • Data Warehouse Optimization • Describe key components of MapR • Recommendation Engine Streams

• Large Scale Log Analysis

Lesson 4 – Core Elements of Apache Cluster Administration: Day 2 Hadoop Get Started • Compare and contrast local and distributed • Understand the lab environment file systems • Connect to your cluster • Explain data management in the Hadoop file system • Lab: Log into your nodes • Summarize the MapReduce algorithm Lesson 1: Prepare for Installation Lesson 5 – Apache Hadoop Ecosystem • Plan the service layout Components • Lab: Plan a service layout • Define the Apache Hadoop Ecosystem • Prepare and verify cluster hardware • Administration: ZooKeeper, YARN • Lab: Audit the cluster • Ingestion: Flume, Oozie, Sqoop • Test nodes • Processing: Spark, HBase, Pig • Lab: Run pre-install tests • Analysis: Hive, Drill, Mahout

Copyright © MapR Data Technologies 2017 – All Rights Reserved 2

For more information, visit www.emtecinc.com

Lesson 2: Install the MapR Converged Data Lesson 3 – Build a simple Spark Platform Application • Install the MapR Converged Data Platform • Define the lifecycle of a Spark program • Lab: Install a MapR cluster • Define the function of SparkContext • Add a MapR license. • Lab: Create the application • Lab: Install a license and explore the MCS • Define different ways to run a Spark Lesson 5: Configure Topology application • Define topology • Run your Spark application • Configure node topology • Lab: Launch the application • Lab: Configure cluster node topology Lesson 6: Configure MapR Volumes

• Volumes and volume properties • Configure volumes Wrap Up and Questions • Lab: Create volumes and set quotas

Spark Essentials: Day 3 Lesson 1 – Introduction to Apache Spark • Describe the features of Apache Spark • Define Spark components Lesson 2 – Load and Inspect data in Spark • Describe different ways of getting data into Spark • Create and use Resilient Distributed Datasets (RDD) • Apply transformation to RDDs • Use actions on RDDs • Lab: Load and Inspect Data in RDD • Cache intermediate RDDs • Use Spark DataFrames for simple queries • Lab: Load and Inspect Data in DataFrames

Copyright © MapR Data Technologies 2017 – All Rights Reserved 3

For more information, visit www.emtecinc.com