TRAINING OFFERING | SCI-241 DATA PLATFORM (HDP®): Data Science: Theory and Application

4 DAYS

This course provides instruction on the theory and practice of data science, including machine learning and natural language processing. This course introduces many of the core concepts behind today’s most commonly used algorithms and introducing them in practical applications. We’ll discuss concepts and key algorithms in all of the major areas – Classification, Regression, Clustering, Dimensionality Reduction, including a primer on Neural Networks. We’ll focus on both single-server tools and frameworks (Python, NumPy, pandas, SciPy, Scikit-learn, NLTK, TensorFlow Jupyter) as well as large-scale tools and frameworks (Spark MLlib, Stanford CoreNLP, TensorFlowOnSpark/Horovod/MLeap, Apache Zeppelin).

PREREQUISITES Students must have experience with Python and Scala, Spark, and prior exposure to statistics, probability, and a basic understanding of and Hadoop principles. While brief reviews are offered in these topics, students new to Hadoop are encouraged to attend the HDP Overview: Essentials (HDP-123) course and HDP Developer: 2.3 (DEV-343), as well as the language-specific introduction courses. TARGET AUDIENCE Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Spark/Hadoop . FORMAT 50% Lecture/Discussion 50% Hands-0n Labs

AGENDA SUMMARY

Day 1: Introducing Data Science, SciKit-Learn, HDFS, Reviewing Spark apps, DataFrames and NOSQL, Reviewing Mathematics, Statistics, and Probability, HDP and HDF and Apache NiFi, and Kafka with Structured Streaming

Day 2: Algorithms in Spark ML and SciKit-Learn: Linear Regression, Logistic Regression, Support Vectors, Decision Trees, Random Forests, KNN, Spam Classifier

Day 3: Algorithms in Spark ML and SciKit-Learn: K-Means & GMM Clustering, Essential TensorFlow, NLP with NLTK, NLP with Stanford CoreNLP, Sentiment Analysis, Dimensionality Reduction

Day 4: Algorithms in Spark ML and SciKit-Learn: HyperParameter Tuning, K-Fold Validation, Ensemble Methods, ML Pipelines in SparkML, TensorFlow on Spark, Horovod, MLeap

About Hortonworks

Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™.

Contact

For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2018 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service

DAY 1 OBJECTIVES

• Discuss aspects of Data Science, the team members, and the various roles in the team • Discuss use cases for Data Science • Discuss the current State of the Art and its future direction • Review HDFS, Spark, Jupyter, and Zeppelin • Work with SciKit-Learn, Pandas, NumPy, Matplotlib, and Seaborn • Review and use Spark DataFrames and NOSQL in ETL • Review and use Apache NiFi to create and manage data flows • Review and use Spark Structured Streaming with Kafka • Review essential Mathematics, Statistics, and Probability used in ML with Zeppelin

DAY 1 LABS AND DEMONSTRATIONS

• Hello, ML w/ SciKit-Learn (30 min, using Jupyter and with visualizations with Matplotlib & Seaborn) • Spark REPLs, Spark Submit, & Zeppelin Review (30 minutes, pre-built apps to be executed all 3 ways; reviews DF-functional paradigm) • HDFS Review (15-20 minutes, moving data to/from HDFS) • Spark DataFrames and Files (20-30 min, JSON, CSV, Parquet, ORC, Avro files) • Spark DataFrames and NOSQL (MariaDB, Mongo) • NiFi Review (30 minutes, essentials of moving data to/from HDFS with NiFi) • Kafka and Structured Streaming Review (30 minutes, reviewing an app that streams data from Kafka) • Essential Math Review (30 min, graphing, plotting, probability, lead in to gradient descent)

About Hortonworks

Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™.

Contact

For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2018 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service

DAY 2 OBJECTIVES

• Discuss categories and use cases of the various ML Algorithms • Understand the similarities and differences in classification and regression categories • Understand Linear Regression, Logistic Regression, and Support Vectors • Understand Decision Trees and their limitations • Understand Random Forests and Gradient Boosted Trees • Understand Nearest-Neighbors • Discuss and demonstrate a Spam Classifier

DAY 2 LABS AND DEMONSTRATIONS

• Linear Regression as a Projection (30 min, includes visualization) • Logistic Regression (30 min, includes visualization) • Support Vectors (30 min) • Decision Trees (30 min) • Random Forests (30 min) • Linear Regression as a Classifier (30 min, includes visualization) • KNN (30 min, includes visualization) • Demo: Creating a Spam Classifier with MLlib (30 min)

About Hortonworks

Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™.

Contact

For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2018 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service

DAY 3 OBJECTIVES

• Discuss and understand Clustering Algorithms • Discuss and understand Neural Networks, particularly Convolutional, Recurrent and LSTMs • Work with TensorFlow to create a basic neural network • Discuss Natural Language Processing • Compare and contrast NLTK and Stanford CoreNLP • Discuss and demonstrate Sentiment Analysis • Discuss Dimensionality Reduction Algorithms

DAY 3 LABS AND DEMONSTRATIONS

• K-Means Clustering (30 min, includes visualization) • GMM Clustering (30 min, includes visualization) • Essential TensorFlow (30 min) • NLTK • Stanford NLP • Sentiment Analysis • Dimensionality Reduction with PCA (30 min)

About Hortonworks

Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™.

Contact

For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2018 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service

DAY 4 OBJECTIVES

• Discuss Hyper-Parameter Tuning and K-Fold Validation • Understand Ensemble Models • Discuss ML Pipelines in Spark MLlib • Discuss ML in production and real-world issues • Demonstrate TensorFlowOnSpark • Describe real-world use cases of ML

DAY 4 LABS AND DEMONSTRATIONS

• Hyper-parameter tuning (30 min, includes visualization) • K-Fold Validation (30 min) • Ensemble Methods (30 min) • ML Pipelines in SparkML (30 min) • Demo: TensorFlowOnSpark (20-30 min) • Demo: Use Cases

Revised 23 August 2018

About Hortonworks

Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™.

Contact

For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2018 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service