Spark and DB2 -- A perfect couple
Pallavi Priyadarshini pallavipr@in.ibm.com Senior Technical Staff Member Spark Technology Center, IBM Session # E11 | Wed, May 25 (1 PM – 2 PM) | Platform: Cross platform Agenda
Objective 1: Spark fundamentals relevant to database integration
Objective 2: Integration between Spark and IBM data servers through DataFrame API
Objective 3: Loading DB2 data into Spark and writing Spark data into DB2
Objective 4: Spark Use Cases
Demo
Power of data. Simplicity of design. Speed of innovation. Background
What is Spark
An Apache Foundation open source project. Not a product.
An in-memory compute engine that works with data. Not a data store.
Enables highly iterative analysis on large volumes of data at scale
Unified environment for data scientists, developers and data engineers
Radically simplifies process of developing intelligent apps fueled by data.
History of Spark . 2002 – MapReduce @ Google . 2004 – MapReduce paper . 2006 – Hadoop @ Yahoo . 2008 – Hadoop Summit . 2010 – Spark paper . 2014 – Apache Spark top-level . 2014 – 1.2.0 release in December Activity for 6 months in 2014 . 2015 – 1.3.0 release in March (from Matei Zaharia – 2014 Spark Summit . 2015 – 1.5 released in Sep ) . 2016 – 1.6 released in Jan
. Spark is HOT!!! . Most active project in Hadoop ecosystem . One of top 3 most active Apache projects . Databricks founded by the creators of Spark from UC Berkeley’s AMPLab
Why Spark?
Spark is open, accelerating community innovation
Spark is fast —100x faster than Hadoop MapReduce
Spark is about all data for large-scale data processing
Spark supports agile data science to iterate rapidly
Spark can be integrated with IBM solutions
Our partner ecosystem
IBM announces major commitment ™ - to advance Apache® Spark The Analytics Operating System
…the most significant open source project of the next decade.
Our commitment to Spark
Founding member of AMPLab
Contributing to Core
Open Source SystemML
Flexible algorithms that can learn and make predictions based on data
Our largest contribution to Open Source since Linux
Unifies fractured machine learning environments and establishes standard
Our commitment to Spark
Educate one million data professionals
Big Data University MOOC
Partnerships with Databricks, AMPLab,
DataCamp and MetiStream
Establish Spark Technology Center
Build solutions to solve business problems
Meetups, hackathons, and open source projects.
IBM is building on Spark
• IBM Analytics • IBM Commerce Quarks from IBM Announced Feb 2016 • IBM Watson • Open-source platform for building IoT applications • IBM Research • Light-weight & embeddable • IBM Cloud • Integrates with Spark
Power of data. Simplicity of design. Speed of innovation. Spark Internals and Integration with IBM products
Hadoop MapReduce Challenges
•Need deep Java skills •Few abstractions available for Ease of Development analysts •No in-memory framework In-Memory Performance •Application tasks write to disk with each cycle •Only suitable for batch workloads Combine Workflows •Rigid processing model
Spark Advantages
•Easier APIs Ease of Development •Python, Scala, Java
In-Memory •Resilient Distributed Datasets Performance •Unify processing
•Batch •Interactive Combine Workflows •Iterative algorithms •Micro-batch
Spark Libraries
IBM Analytics for Apache Spark offering
What it is:
as . Fully-managed Spark environment accessible on-demand a service What you get: . Access to Spark’s next-generation performance and capabilities, including built-in machine learning and other libraries . Pay only for what you use . No lock-in – 100% standard Spark runs on any standard IBM hosted, distribution managed, secure environment . Elastic scaling – start with experimentation, extend to development and scale to production, all within the same environment . Quick start – service is immediately ready for analysis, skipping setup hurdles, hassles and time . Peace of mind – fully managed and secured, no DBAs or other admins necessary – Use Case 1 Spark for daily weather
An IBM Business The Weather Company clusters running hot: ~30 billion API requests per day The use case: ~120 million active mobile users Efficient batch + streaming analysis #3 most active mobile user base Self-serve data science Billions of events per day (1.3M/sec) BI / visualization tool support ~360 PB of traffic daily Need to keep data forever – - Use case 2 Public Wi Fi management
Solution Inc is a leader in managing public Internet access in hotels, convention centers, airports, coffee shops and other public venues around the world. The company offers on-premise and cloud-based solutions to provide high-demand public Wi-Fi solutions.
Solution By using IBM’s managed Spark service, IBM Analytics for Apache Spark, SolutionInc explored its big data, uncovering behavior patterns as customers visited and moved between different locations
Benefits The ability to help clients optimize their capacity at different locations differentiates SolutionInc from competitors and opens the door to the development of new service offerings. – Use case 3 Retail Analytics
Smarter Data, Inc. leverages advanced data science technologies – predictive and prescriptive analytics – to help companies achieve relevance with their customers both online and in a retail environment, and manage the demands of digital-age business challenges.
Solution SmarterData uses IBM® Analytics for Apache Spark to build next- generation retail analytics apps that combine operational and contextual data to give clients a new understanding of consumer desires.
Benefits SmarterData’s clients can now perform real-time analysis, utilizing everything from point-of-sale data to weather data, empowering in-store employees to take immediate action on the shop floor. - Use case 4 Predictive Maintenance in Oil and Gas
Pre-emptive maintenance and action using automated ESP mode detection and determination of pump failure and stoppage propensities.
DATA AUTOMATED PUMP MODE DETECTION Pump Sensors Advanced analytical models identify patterns in (Discrete Time Values) historical data and automatically identify and Maintenance Data categorize different modes of ESP operation. Well Test Data These modes characterize and distinguish different behaviors of a pump.
TELEMETRY TRENDS AND MODE TRANSITION PLOTS FAILURE AND STOPPAGE PROPENSITY DETERMINATION P Compare sensor trends across multiple pumps or visually inspect transitions and mode Monitor pump stoppage GLDU 448 t specific anomalies and failure propensities in GLDU 360 near real-time. AICFU 448 Take pre-emptive actions to maintain production Blakeney 140
levels and lower GLDU 362 operational costs.
Time – RDDs Resilient Distributed Datasets
• Collections of objects spread across a cluster, stored in RAM or on Disk
• Built through parallel transformations
• Write programs in terms of transformations on distributed datasets
• Automatically rebuilt on failure Operations
• Transformations - e.g. map, filter, groupBy
• Actions - count, collect, save
Working with RDDs ( ) RDD Graph DAG of Tasks
Spark execution
. Each Spark application runs as a set of processes coordinated by the Spark context object (driver program) – Spark context connects to Cluster Manager (standalone, Mesos/Yarn) – Spark context acquires executors (JVM instance) on worker nodes Worker Node
– Spark context sends tasks to the executors Cache Executor
Task Task
Driver Program Cluster Manager SparkContext Worker Node
Cache Executor
Task Task
About Spark SQL
. Spark's module for working with structured data, part of core since Spark 1.0
. Query structured data inside Spark programs using SQL or DataFrames API
. Bindings in Python, Scala, Java and R
. Apply functions to results of SQL queries
. Query and join different data sources DataFrames
. Extension to RDD API
. A distributed collection of rows organized into named columns.
. Conceptually equivalent to a table in a relational database
. An abstraction for selecting, filtering, aggregating and plotting structured data
. Can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. : Less Code Input and Output
. Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. : Less Code High Level Operations
Common operations can be expressed concisely as calls to the DataFrame API:
. Selecting required columns
. Joining different data sources
. Aggregation (count, sum, average, etc)
. Filtering
DataFrames Example
Creating DataFrame
Using DataFrame
Using SQL with DataFrame
Key Integration Points 1. Your primary world is Spark … … but you want to reach out to existing relational data sources such as DB2, dashDB, BigSQL This means bringing the data from “DB2” to Spark
2. Your primary world is “DB2”...
… but you want to reach out into the Spark world to perform analytic computations that would be awkward or impossible in relational SQL
This means invoking Spark from “DB2” , passing it some input data, peforming the Spark “computation” and then returning the answer, in parallel, back to “DB2”
Both are interesting and valid use cases!
IBM Data Server Integration with Spark
IBM Data Server Dialect
IBM JDBC Driver / Apache Spark z OS
Invoking Spark from a DB2 application
Application Program
SELECT people.name, people.address FROM TABLE (syshadoop.EXECSPARK ( language => 'scala', class =>'com.ibm.biginsights.bigsql.examples.ReadJSON', path => ‘/user/bigsql/files/persons.json’) ) AS people WHERE people.country = ‘CANADA’
High Level Spark Integration Layout for DB2 LUW Based MPP Systems
Power of data. Simplicity of design. Speed of innovation. Demo : Demo Spark Integration with DB2
. Demo 1 – Reading DB2 data from Spark shell
. Demo 2 – Joining DB2 data with JSON data in Spark using IntelliJ/Eclipse Summary Spark is a powerful and rapidly new integrative analytic run-time environment IBM is making a large investment in its future/success We have already delivered effective tools for accessing IBM databases from Spark − Starting in earnest with Spark 1.6 We are on the cusp of completing development on advanced new capability to invoke Spark computations from inside an SQL statement. This capability will be delivered in stages across the IBM SQL Engines − BigSQL − DashDB − BLU Helix − DB2 LUW − Sailfish − DB2 for zOS − IDAA vNext
Thank You! Pallavi Priyadarshini Please fill out your session IBM evaluation before leaving! [email protected]
Session E11 Title Spark and DB2 – A Perfect Couple
Photo by Steve from Austin, TX, USA