Data Science Using Spark: an Introduction Topics Covered
Total Page:16
File Type:pdf, Size:1020Kb
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED . Introduction to Spark . Getting Started with Spark . Programming in Spark . Data Science with Spark . What next? 2 DATA SCIENCE PROCESS Exploratory Data Analysis Raw data is Data is Clean collected processed Data Machine Learning Real World Algorithms Statistical Models Build Communicate Data Product ---------------- visualizations Make Decisions ----------------Report Findings Source: Doing Data Science by Rachel Schutt & Cathy O’Neil 3 DATA SCIENCE & DATA MINING Distinctions are blurred Business Data Science Analytics Knowledge Discovery Domain Knowledge Data Mining Visual Data Mining (Structured) (Unstructured) Big Data Data Mining Data Mining Engineering Natural Language Text Mining Web Mining Processing Machine Database Management Statistics Learning Management Science 4 WHAT DO WE NEED TO SUPPORT DATA SCIENCE WORK? Data Input /Output . Ability to read data in multiple formats . Ability to read data from multiple sources . Ability to deal with Big Data (Volume, Velocity, Veracity and Variety) Data Transformations . Easy to describe and perform transformations on rows and columns of data . Requires abstraction of data and a dataflow paradigm Model Development . Library of Data Science Algorithms . Ability to import / export models from other sources . Data Science pipelines / workflow Development Analytics Applications Development . Seamless integration with programming languages / IDEs 5 INTRODUCTION TO SPARK SHARAN KALWANI 6 WHAT IS SPARK? ▪ A distributed computing platform designed to be ▪ Fast ▪ General Purpose ▪ A general engine that allows combination of multiple types of computations ▪ Batch ▪ Interactive ▪ Iterative ▪ SQL Queries ▪ Text Processing ▪ Machine learning 7 Fast/Speed . Computations in memory . Faster than MR even for disk computations Generality . Designed for a wide range of workloads . Single Engine to combine batch, interactive, iterative, streaming algorithms. Has rich high-level libraries and simple native APIs in Java, Scala and Python. Reduces the management burden of maintaining separate tools. 8 SPARK UNIFIED STACK 9 CLUSTER MANAGERS ▪ Can run on a variety of cluster managers ▪ Hadoop YARN - Yet Another Resource Negotiator is a cluster management technology and one of the key features in Hadoop 2. ▪ Apache Mesos - abstracts CPU, memory, storage, and other compute resources away from machines, enabling fault-tolerant and elastic distributed systems. ▪ Spark Standalone Scheduler – provides an easy way to get started on an empty set of machines. ▪ Spark can leverage existing Hadoop infrastructure 10 SPARK HISTORY ▪ Started in 2009 as a research project in UC Berkeley RAD lab which became AMP Lab. ▪ Spark researchers found that Hadoop MapReduce was inefficient for iterative and interactive computing. ▪ Spark was designed from the beginning to be fast for interactive, iterative with support for in-memory storage and fault-tolerance. ▪ Apart from UC Berkeley, Databricks, Yahoo! and Intel are major contributors. ▪ Spark was open sourced in March 2010 and transformed into Apache Foundation project in June 2013. 11 SPARK VS HADOOP Hadoop MapReduce .Mostly suited for batch jobs .Difficulty to program directly in MR .Batch doesn’t compose well for large apps .Specialized systems needed as a workaround Spark .Handles batch, interactive, and real-time within a single framework .Native integration with Java, Python, Scala .Programming at a higher level of abstraction .More general than MapReduce CONFIDENTIAL AND PROPRIETARY 12 GETTING STARTED WITH SPARK 13 GETTING STARTED WITH SPARK …..NOT COVERED TODAY! There are multiple ways of using Spark ▪ Certified Spark Distributions ▪ Datastax Enterprise (Cassandra + Spark) ▪ HortonWorks HDP ▪ MAPR ▪ Local/Standalone ▪ Databricks cloud ▪ Amazon AWS EC2 14 LOCAL MODE ▪ Install Java JDK 6/7 on MacOSX or Windows http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads- 1880260.html ▪ Install Python 2.7 using Anaconda (only on Windows) https://store.continuum.io/cshop/anaconda/ ▪ Download Apache Spark from Databricks, unzip the downloaded file to a convenient location http://training.databricks.com/workshop/usb.zip ▪ Connect to the newly created spark-training directory ▪ Run the interactive Scala shell (REPL) ./spark/bin/spark-shell val data = 1 to 1000 val distData = sc.parallelize(data) val filteredData = distData.filter(s => s<25) filteredData.collect() 15 DATABRICKS CLOUD ▪ A hosted data platform powered by Apache Spark ▪ Features ▪ Exploration and Visualization ▪ Managed Spark Clusters ▪ Production Pipelines ▪ Support for 3rd party apps (Tableau, Pentaho, Qlik View) ▪ Databricks Cloud Trail http://databricks.com/registration ▪ Demo 16 DATABRICKS CLOUD ▪ Workspace ▪ Tables ▪ Clusters 17 DATABRICKS CLOUD ▪ Notebooks ▪ Python ▪ Scala ▪ SQL ▪ Visualizations ▪ Markup ▪ Comments ▪ Collaboration 18 DATABRICKS CLOUD ▪ Tables ▪ Hive tables ▪ SQL ▪ Dbfs ▪ S3 ▪ CSV ▪ Databases 19 AMAZON EC2 ▪ Launch a Linux instance on EC2 and setup EC2 Keys 20 AMAZON EC2 ▪ Setup an EC2 pair from the AWS console 21 AMAZON EC2 ▪ Spark binary ships with a spark-ec2 script to manage clusters on EC2 ▪ Launching Spark cluster on EC2 ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name> ▪ Running Applications ./spark-ec2 -k <keypair> -i <key-file> login <cluster-name> ▪ Terminating a cluster ./spark-ec2 destroy <cluster-name> ▪ Accessing data in S3 s3n://<bucket>/path 22 PROGRAMMING IN SPARK SHARAN KALWANI 23 Spark Cluster • Mesos • YARN • Standalone 24 Scala – Scalable Language ▪ Scala is a multi-paradigm programming language with focus on the functional programming paradigm. ▪ In functional programming functions are used and they use variables that are immutable. ▪ Every operator, variable and function is an object. ▪ Scala generates bytecode that runs on the top of any JVM and can also use any of the java libraries. ▪ Spark is completely written in Scala. ▪ Spark SQL, GraphX, Spark Streaming etc. are libraries written in Scala. ▪ Scala Crash Course by Holden Karau @databricks lintool.github.io/SparkTutorial/slides/day1_Scala_crash_course.pdf 25 Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) . Read-only collections of objects that can be stored in memory or disk across a cluster . Partitions are automatically rebuilt on failure . Parallel functional transformations ( map, filter, ..) . Familiar Scala collections API for distributed data and computation . Lazy transformations 26 Spark Core RDD – Resilient Distributed Dataset ▪ A primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel. ▪ Two Types ▪ Parallelized Scala collections ▪ Hadoop datasets ▪ Transformations and Actions can be performed on RDDs. Transformations ▪ Operate on an RDD and return a new RDD. ▪ Are Lazily Evaluated Actions ▪ Return a value after running a computation on a RDD. ▪ The DAG is evaluated only when an action takes place. 27 Spark Shell Interactive Queries and prototyping Local, YARN, Mesos Static type checking and auto complete 28 Spark compared to Java (native Hadoop) 29 Spark compared to Java (native Hadoop) 30 Spark Streaming • Real time computation similar to Storm • Input distributed to memory for fault tolerance • Streaming input in to sliding windows of RDDs • Kafka, Flume, Kinesis, HDFS 31 32 Spark Streaming 33 DATA SCIENCE USING SPARK 34 WHAT DO WE NEED TO SUPPORT DATA SCIENCE WORK? Data Input /Output . Ability to read data in multiple formats . Ability to read data from multiple sources . Ability to deal with Big Data (Volume, Velocity, and Variety) Data Transformations . Easy to describe and perform transformations on rows and columns of data . Requires abstraction of data and a dataflow paradigm Model Development . Library of Data Science Algorithms . Ability to import / export models from other sources . Data Science pipelines / workflow Development Analytics Applications Development . Seamless integration with programming languages / IDEs 35 WHY SPARK FOR DATA SCIENCE? Distributed In-memory Platform FAST Scalable Small to Big Data; Well integrated into the Big Data Ecosystem SPARK is HOT Expressive Simple, higher level abstractions for describing computations Flexible Extendible, Multiple language bindings (Scala, Java, Python, R) CONFIDENTIAL AND PROPRIETARY 36 Traditional Data Science Tools SAS RapidMiner Matlab And many others…. R SPSS Designed to work on single machines Proprietary & Expensive 37 What is available in Spark? Analytics Workflows (ML Pipeline) Library of Algorithms(MLlib, R packages, Mahout?, Graph Algorithms) Extensions to RDD (SchemaRDD, RRDD, RDPG, DStreams) Basic RDD (Transformations & Actions) 38 DATA TYPES FOR DATA SCIENCE (MLLIB) Single Machine Data Types Distributed Data Types (supported by RDDs) Distributed Matrix Local Vector . RowMatrix Labeled Point . IndexedRowMatrix Local Matrix . CoordinateMatrix 39 Schema RDDs 40 41 R to Spark Dataflow Worker Local tasks Spark R Executor broadcast vars R pacakges Spark Java R Context Spark (ref. in R) Context Worker Spark tasks Executor R broadcast vars R pacakges 42 43 MESOS SHARK SPARK 44.