Spark and DB2 -- A perfect couple

Pallavi Priyadarshini pallavipr@in..com Senior Technical Staff Member Spark Technology Center, IBM Session # E11 | Wed, May 25 (1 PM – 2 PM) | Platform: Cross platform Agenda

 Objective 1: Spark fundamentals relevant to database integration

 Objective 2: Integration between Spark and IBM data servers through DataFrame API

 Objective 3: Loading DB2 data into Spark and writing Spark data into DB2

 Objective 4: Spark Use Cases

 Demo

Power of data. Simplicity of design. Speed of innovation. Background

What is Spark

 An Apache Foundation open source project. Not a product.

 An in-memory compute engine that works with data. Not a data store.

 Enables highly iterative analysis on large volumes of data at scale

 Unified environment for data scientists, developers and data engineers

 Radically simplifies process of developing intelligent apps fueled by data.

History of Spark . 2002 – MapReduce @ Google . 2004 – MapReduce paper . 2006 – Hadoop @ Yahoo . 2008 – Hadoop Summit . 2010 – Spark paper . 2014 – Apache Spark top-level . 2014 – 1.2.0 release in December Activity for 6 months in 2014 . 2015 – 1.3.0 release in March (from Matei Zaharia – 2014 Spark Summit . 2015 – 1.5 released in Sep ) . 2016 – 1.6 released in Jan

. Spark is HOT!!! . Most active project in Hadoop ecosystem . One of top 3 most active Apache projects . Databricks founded by the creators of Spark from UC Berkeley’s AMPLab

Why Spark?

 Spark is open, accelerating community innovation

 Spark is fast —100x faster than Hadoop MapReduce

 Spark is about all data for large-scale data processing

 Spark supports agile data science to iterate rapidly

 Spark can be integrated with IBM solutions

Our partner ecosystem

IBM announces major commitment ™ - to advance Apache® Spark The Analytics Operating System

…the most significant open source project of the next decade.

Our commitment to Spark

Founding member of AMPLab

 Contributing to Core

Open Source SystemML

 Flexible algorithms that can learn and make predictions based on data

 Our largest contribution to Open Source since Linux

 Unifies fractured machine learning environments and establishes standard

Our commitment to Spark

Educate one million data professionals

 Big Data University MOOC

 Partnerships with Databricks, AMPLab,

DataCamp and MetiStream

Establish Spark Technology Center

 Build solutions to solve business problems

 Meetups, hackathons, and open source projects.

IBM is building on Spark

• IBM Analytics • IBM Commerce Quarks from IBM Announced Feb 2016 • IBM • Open-source platform for building IoT applications • IBM Research • Light-weight & embeddable • IBM Cloud • Integrates with Spark

Power of data. Simplicity of design. Speed of innovation. Spark Internals and Integration with IBM products

Hadoop MapReduce Challenges

•Need deep Java skills •Few abstractions available for Ease of Development analysts •No in-memory framework In-Memory Performance •Application tasks write to disk with each cycle •Only suitable for batch workloads Combine Workflows •Rigid processing model

Spark Advantages

•Easier APIs Ease of Development •Python, Scala, Java

In-Memory •Resilient Distributed Datasets Performance •Unify processing

•Batch •Interactive Combine Workflows •Iterative algorithms •Micro-batch

Spark Libraries

IBM Analytics for Apache Spark offering

What it is:

as . Fully-managed Spark environment accessible on-demand a service What you get: . Access to Spark’s next-generation performance and capabilities, including built-in machine learning and other libraries . Pay only for what you use . No lock-in – 100% standard Spark runs on any standard IBM hosted, distribution managed, secure environment . Elastic scaling – start with experimentation, extend to development and scale to production, all within the same environment . Quick start – service is immediately ready for analysis, skipping setup hurdles, hassles and time . Peace of mind – fully managed and secured, no DBAs or other admins necessary – Use Case 1 Spark for daily weather

An IBM Business clusters running hot:  ~30 billion API requests per day The use case:  ~120 million active mobile users Efficient batch + streaming analysis  #3 most active mobile user base Self-serve data science  Billions of events per day (1.3M/sec) BI / visualization tool support  ~360 PB of traffic daily  Need to keep data forever – - Use case 2 Public Wi Fi management

Solution Inc is a leader in managing public Internet access in hotels, convention centers, airports, coffee shops and other public venues around the world. The company offers on-premise and cloud-based solutions to provide high-demand public Wi-Fi solutions.

Solution By using IBM’s managed Spark service, IBM Analytics for Apache Spark, SolutionInc explored its big data, uncovering behavior patterns as customers visited and moved between different locations

Benefits The ability to help clients optimize their capacity at different locations differentiates SolutionInc from competitors and opens the door to the development of new service offerings. – Use case 3 Retail Analytics

Smarter Data, Inc. leverages advanced data science technologies – predictive and prescriptive analytics – to help companies achieve relevance with their customers both online and in a retail environment, and manage the demands of digital-age business challenges.

Solution SmarterData uses IBM® Analytics for Apache Spark to build next- generation retail analytics apps that combine operational and contextual data to give clients a new understanding of consumer desires.

Benefits SmarterData’s clients can now perform real-time analysis, utilizing everything from point-of-sale data to weather data, empowering in-store employees to take immediate action on the shop floor. - Use case 4 Predictive Maintenance in Oil and Gas

Pre-emptive maintenance and action using automated ESP mode detection and determination of pump failure and stoppage propensities.

DATA AUTOMATED PUMP MODE DETECTION Pump Sensors Advanced analytical models identify patterns in (Discrete Time Values) historical data and automatically identify and Maintenance Data categorize different modes of ESP operation. Well Test Data These modes characterize and distinguish different behaviors of a pump.

TELEMETRY TRENDS AND MODE TRANSITION PLOTS FAILURE AND STOPPAGE PROPENSITY DETERMINATION P Compare sensor trends across multiple pumps or visually inspect transitions and mode Monitor pump stoppage GLDU 448 t specific anomalies and failure propensities in GLDU 360 near real-time. AICFU 448 Take pre-emptive actions to maintain production Blakeney 140

levels and lower GLDU 362 operational costs.

Time – RDDs Resilient Distributed Datasets

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Write programs in terms of transformations on distributed datasets

• Automatically rebuilt on failure Operations

• Transformations - e.g. map, filter, groupBy

• Actions - count, collect, save

Working with RDDs ( ) RDD Graph DAG of Tasks

Spark execution

. Each Spark application runs as a set of processes coordinated by the Spark context object (driver program) – Spark context connects to Cluster Manager (standalone, Mesos/Yarn) – Spark context acquires executors (JVM instance) on worker nodes Worker Node

– Spark context sends tasks to the executors Cache Executor

Task Task

Driver Program Cluster Manager SparkContext Worker Node

Cache Executor

Task Task

About Spark SQL

. Spark's module for working with structured data, part of core since Spark 1.0

. Query structured data inside Spark programs using SQL or DataFrames API

. Bindings in Python, Scala, Java and R

. Apply functions to results of SQL queries

. Query and join different data sources DataFrames

. Extension to RDD API

. A distributed collection of rows organized into named columns.

. Conceptually equivalent to a table in a relational database

. An abstraction for selecting, filtering, aggregating and plotting structured data

. Can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. : Less Code Input and Output

. Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. : Less Code High Level Operations

Common operations can be expressed concisely as calls to the DataFrame API:

. Selecting required columns

. Joining different data sources

. Aggregation (count, sum, average, etc)

. Filtering

DataFrames Example

Creating DataFrame

Using DataFrame

Using SQL with DataFrame

Key Integration Points 1. Your primary world is Spark … … but you want to reach out to existing relational data sources such as DB2, dashDB, BigSQL This means bringing the data from “DB2” to Spark

2. Your primary world is “DB2”...

… but you want to reach out into the Spark world to perform analytic computations that would be awkward or impossible in relational SQL

This means invoking Spark from “DB2” , passing it some input data, peforming the Spark “computation” and then returning the answer, in parallel, back to “DB2”

Both are interesting and valid use cases!

IBM Data Server Integration with Spark

IBM Data Server Dialect

IBM JDBC Driver / Apache Spark z OS

Invoking Spark from a DB2 application

Application Program

SELECT people.name, people.address FROM TABLE (syshadoop.EXECSPARK ( language => 'scala', class =>'com.ibm.biginsights.bigsql.examples.ReadJSON', path => ‘/user/bigsql/files/persons.json’) ) AS people WHERE people.country = ‘CANADA’

High Level Spark Integration Layout for DB2 LUW Based MPP Systems

Power of data. Simplicity of design. Speed of innovation. Demo : Demo Spark Integration with DB2

. Demo 1 – Reading DB2 data from Spark shell

. Demo 2 – Joining DB2 data with JSON data in Spark using IntelliJ/Eclipse Summary  Spark is a powerful and rapidly new integrative analytic run-time environment  IBM is making a large investment in its future/success  We have already delivered effective tools for accessing IBM databases from Spark − Starting in earnest with Spark 1.6  We are on the cusp of completing development on advanced new capability to invoke Spark computations from inside an SQL statement.  This capability will be delivered in stages across the IBM SQL Engines − BigSQL − DashDB − BLU Helix − DB2 LUW − Sailfish − DB2 for zOS − IDAA vNext

Thank You! Pallavi Priyadarshini Please fill out your session IBM evaluation before leaving! [email protected]

Session E11 Title Spark and DB2 – A Perfect Couple

Photo by Steve from Austin, TX, USA