HORTONWORKS DATA PLATFORM (HDP®): Data Science: Theory and Application
Total Page:16
File Type:pdf, Size:1020Kb
TRAINING OFFERING | SCI-241 HORTONWORKS DATA PLATFORM (HDP®): Data Science: Theory and Application 4 DAYS This course provides instruction on the theory and practice of data science, including machine learning and natural language processing. This course introduces many of the core concepts behind today’s most commonly used algorithms and introducing them in practical applications. We’ll discuss concepts and key algorithms in all of the major areas – Classification, Regression, Clustering, Dimensionality Reduction, including a primer on Neural Networks. We’ll focus on both single-server tools and frameworks (Python, NumPy, pandas, SciPy, Scikit-learn, NLTK, TensorFlow Jupyter) as well as large-scale tools and frameworks (Spark MLlib, Stanford CoreNLP, TensorFlowOnSpark/Horovod/MLeap, Apache Zeppelin). PREREQUISITES Students must have experience with Python and Scala, Spark, and prior exposure to statistics, probability, and a basic understanding of big data and Hadoop principles. While brief reviews are offered in these topics, students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials (HDP-123) course and HDP Developer: Apache Spark 2.3 (DEV-343), as well as the language-specific introduction courses. TARGET AUDIENCE Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Spark/Hadoop . FORMAT 50% Lecture/Discussion 50% Hands-0n Labs AGENDA SUMMARY Day 1: Introducing Data Science, SciKit-Learn, HDFS, Reviewing Spark apps, DataFrames and NOSQL, Reviewing Mathematics, Statistics, and Probability, HDP and HDF and Apache NiFi, and Kafka with Structured Streaming Day 2: Algorithms in Spark ML and SciKit-Learn: Linear Regression, Logistic Regression, Support Vectors, Decision Trees, Random Forests, KNN, Spam Classifier Day 3: Algorithms in Spark ML and SciKit-Learn: K-Means & GMM Clustering, Essential TensorFlow, NLP with NLTK, NLP with Stanford CoreNLP, Sentiment Analysis, Dimensionality Reduction Day 4: Algorithms in Spark ML and SciKit-Learn: HyperParameter Tuning, K-Fold Validation, Ensemble Methods, ML Pipelines in SparkML, TensorFlow on Spark, Horovod, MLeap About Hortonworks Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™. Contact For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2018 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service DAY 1 OBJECTIVES • Discuss aspects of Data Science, the team members, and the various roles in the team • Discuss use cases for Data Science • Discuss the current State of the Art and its future direction • Review HDFS, Spark, Jupyter, and Zeppelin • Work with SciKit-Learn, Pandas, NumPy, Matplotlib, and Seaborn • Review and use Spark DataFrames and NOSQL in ETL • Review and use Apache NiFi to create and manage data flows • Review and use Spark Structured Streaming with Kafka • Review essential Mathematics, Statistics, and Probability used in ML with Zeppelin DAY 1 LABS AND DEMONSTRATIONS • Hello, ML w/ SciKit-Learn (30 min, using Jupyter and with visualizations with Matplotlib & Seaborn) • Spark REPLs, Spark Submit, & Zeppelin Review (30 minutes, pre-built apps to be executed all 3 ways; reviews DF-functional paradigm) • HDFS Review (15-20 minutes, moving data to/from HDFS) • Spark DataFrames and Files (20-30 min, JSON, CSV, Parquet, ORC, Avro files) • Spark DataFrames and NOSQL (MariaDB, Mongo) • NiFi Review (30 minutes, essentials of moving data to/from HDFS with NiFi) • Kafka and Structured Streaming Review (30 minutes, reviewing an app that streams data from Kafka) • Essential Math Review (30 min, graphing, plotting, probability, lead in to gradient descent) About Hortonworks Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™. Contact For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2018 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service DAY 2 OBJECTIVES • Discuss categories and use cases of the various ML Algorithms • Understand the similarities and differences in classification and regression categories • Understand Linear Regression, Logistic Regression, and Support Vectors • Understand Decision Trees and their limitations • Understand Random Forests and Gradient Boosted Trees • Understand Nearest-Neighbors • Discuss and demonstrate a Spam Classifier DAY 2 LABS AND DEMONSTRATIONS • Linear Regression as a Projection (30 min, includes visualization) • Logistic Regression (30 min, includes visualization) • Support Vectors (30 min) • Decision Trees (30 min) • Random Forests (30 min) • Linear Regression as a Classifier (30 min, includes visualization) • KNN (30 min, includes visualization) • Demo: Creating a Spam Classifier with MLlib (30 min) About Hortonworks Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™. Contact For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2018 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service DAY 3 OBJECTIVES • Discuss and understand Clustering Algorithms • Discuss and understand Neural Networks, particularly Convolutional, Recurrent and LSTMs • Work with TensorFlow to create a basic neural network • Discuss Natural Language Processing • Compare and contrast NLTK and Stanford CoreNLP • Discuss and demonstrate Sentiment Analysis • Discuss Dimensionality Reduction Algorithms DAY 3 LABS AND DEMONSTRATIONS • K-Means Clustering (30 min, includes visualization) • GMM Clustering (30 min, includes visualization) • Essential TensorFlow (30 min) • NLTK • Stanford NLP • Sentiment Analysis • Dimensionality Reduction with PCA (30 min) About Hortonworks Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™. Contact For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2018 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service DAY 4 OBJECTIVES • Discuss Hyper-Parameter Tuning and K-Fold Validation • Understand Ensemble Models • Discuss ML Pipelines in Spark MLlib • Discuss ML in production and real-world issues • Demonstrate TensorFlowOnSpark • Describe real-world use cases of ML DAY 4 LABS AND DEMONSTRATIONS • Hyper-parameter tuning (30 min, includes visualization) • K-Fold Validation (30 min) • Ensemble Methods (30 min) • ML Pipelines in SparkML (30 min) • Demo: TensorFlowOnSpark (20-30 min) • Demo: Use Cases Revised 23 August 2018 About Hortonworks Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™. Contact For further information visit www.hortonworks.com +1 408 675-0983 +1 855 8-HORTON INTL: +44 (0) 20 3826 1405 © 2011-2018 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service .