Spark and DB2 -- a Perfect Couple
Total Page:16
File Type:pdf, Size:1020Kb
Spark and DB2 -- A perfect couple Pallavi Priyadarshini [email protected] Senior Technical Staff Member Spark Technology Center, IBM Session # E11 | Wed, May 25 (1 PM – 2 PM) | Platform: Cross platform Agenda Objective 1: Spark fundamentals relevant to database integration Objective 2: Integration between Spark and IBM data servers through DataFrame API Objective 3: Loading DB2 data into Spark and writing Spark data into DB2 Objective 4: Spark Use Cases Demo Power of data. Simplicity of design. Speed of innovation. Background What is Spark An Apache Foundation open source project. Not a product. An in-memory compute engine that works with data. Not a data store. Enables highly iterative analysis on large volumes of data at scale Unified environment for data scientists, developers and data engineers Radically simplifies process of developing intelligent apps fueled by data. History of Spark . 2002 – MapReduce @ Google . 2004 – MapReduce paper . 2006 – Hadoop @ Yahoo . 2008 – Hadoop Summit . 2010 – Spark paper . 2014 – Apache Spark top-level . 2014 – 1.2.0 release in December Activity for 6 months in 2014 . 2015 – 1.3.0 release in March (from Matei Zaharia – 2014 Spark Summit . 2015 – 1.5 released in Sep ) . 2016 – 1.6 released in Jan . Spark is HOT!!! . Most active project in Hadoop ecosystem . One of top 3 most active Apache projects . Databricks founded by the creators of Spark from UC Berkeley’s AMPLab Why Spark? Spark is open, accelerating community innovation Spark is fast —100x faster than Hadoop MapReduce Spark is about all data for large-scale data processing Spark supports agile data science to iterate rapidly Spark can be integrated with IBM solutions Our partner ecosystem IBM announces major commitment™ - to advance Apache® Spark The Analytics Operating System …the most significant open source project of the next decade. Our commitment to Spark Founding member of AMPLab Contributing to Core Open Source SystemML Flexible algorithms that can learn and make predictions based on data Our largest contribution to Open Source since Linux Unifies fractured machine learning environments and establishes standard Our commitment to Spark Educate one million data professionals Big Data University MOOC Partnerships with Databricks, AMPLab, DataCamp and MetiStream Establish Spark Technology Center Build solutions to solve business problems Meetups, hackathons, and open source projects. IBM is building on Spark • IBM Analytics • IBM Commerce Quarks from IBM Announced Feb 2016 • IBM Watson • Open-source platform for building IoT applications • IBM Research • Light-weight & embeddable • IBM Cloud • Integrates with Spark Power of data. Simplicity of design. Speed of innovation. Spark Internals and Integration with IBM products Hadoop MapReduce Challenges •Need deep Java skills •Few abstractions available for Ease of Development analysts •No in-memory framework In-Memory Performance •Application tasks write to disk with each cycle •Only suitable for batch workloads Combine Workflows •Rigid processing model Spark Advantages •Easier APIs Ease of Development •Python, Scala, Java In-Memory •Resilient Distributed Datasets Performance •Unify processing •Batch •Interactive Combine Workflows •Iterative algorithms •Micro-batch Spark Libraries IBM Analytics for Apache Spark offering What it is: as . Fully-managed Spark environment accessible on-demand a service What you get: . Access to Spark’s next-generation performance and capabilities, including built-in machine learning and other libraries . Pay only for what you use . No lock-in – 100% standard Spark runs on any standard IBM hosted, distribution managed, secure environment . Elastic scaling – start with experimentation, extend to development and scale to production, all within the same environment . Quick start – service is immediately ready for analysis, skipping setup hurdles, hassles and time . Peace of mind – fully managed and secured, no DBAs or other admins necessary – Use Case 1 Spark for daily weather An IBM Business The Weather Company clusters running hot: ~30 billion API requests per day The use case: ~120 million active mobile users Efficient batch + streaming analysis #3 most active mobile user base Self-serve data science Billions of events per day (1.3M/sec) BI / visualization tool support ~360 PB of traffic daily Need to keep data forever – - Use case 2 Public Wi Fi management Solution Inc is a leader in managing public Internet access in hotels, convention centers, airports, coffee shops and other public venues around the world. The company offers on-premise and cloud-based solutions to provide high-demand public Wi-Fi solutions. Solution By using IBM’s managed Spark service, IBM Analytics for Apache Spark, SolutionInc explored its big data, uncovering behavior patterns as customers visited and moved between different locations Benefits The ability to help clients optimize their capacity at different locations differentiates SolutionInc from competitors and opens the door to the development of new service offerings. – Use case 3 Retail Analytics Smarter Data, Inc. leverages advanced data science technologies – predictive and prescriptive analytics – to help companies achieve relevance with their customers both online and in a retail environment, and manage the demands of digital-age business challenges. Solution SmarterData uses IBM® Analytics for Apache Spark to build next- generation retail analytics apps that combine operational and contextual data to give clients a new understanding of consumer desires. Benefits SmarterData’s clients can now perform real-time analysis, utilizing everything from point-of-sale data to weather data, empowering in-store employees to take immediate action on the shop floor. - Use case 4 Predictive Maintenance in Oil and Gas Pre-emptive maintenance and action using automated ESP mode detection and determination of pump failure and stoppage propensities. DATA AUTOMATED PUMP MODE DETECTION Pump Sensors Advanced analytical models identify patterns in (Discrete Time Values) historical data and automatically identify and Maintenance Data categorize different modes of ESP operation. Well Test Data These modes characterize and distinguish different behaviors of a pump. TELEMETRY TRENDS AND MODE TRANSITION PLOTS FAILURE AND STOPPAGE PROPENSITY DETERMINATION P Compare sensor trends across multiple pumps or visually inspect transitions and mode Monitor pump stoppage GLDU 448 t specific anomalies and failure propensities in GLDU 360 near real-time. AICFU 448 Take pre-emptive actions to maintain production Blakeney 140 levels and lower GLDU 362 operational costs. Time – RDDs Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Write programs in terms of transformations on distributed datasets • Automatically rebuilt on failure Operations • Transformations - e.g. map, filter, groupBy • Actions - count, collect, save Working with RDDs ( ) RDD Graph DAG of Tasks Spark execution . Each Spark application runs as a set of processes coordinated by the Spark context object (driver program) – Spark context connects to Cluster Manager (standalone, Mesos/Yarn) – Spark context acquires executors (JVM instance) Worker Node on worker nodes – Spark context sends tasks to the executors Cache Executor Task Task Driver Program Cluster Manager SparkContext Worker Node Cache Executor Task Task About Spark SQL . Spark's module for working with structured data, part of core since Spark 1.0 . Query structured data inside Spark programs using SQL or DataFrames API . Bindings in Python, Scala, Java and R . Apply functions to results of SQL queries . Query and join different data sources DataFrames . Extension to RDD API . A distributed collection of rows organized into named columns. Conceptually equivalent to a table in a relational database . An abstraction for selecting, filtering, aggregating and plotting structured data . Can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. : Less Code Input and Output . Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. : Less Code High Level Operations Common operations can be expressed concisely as calls to the DataFrame API: . Selecting required columns . Joining different data sources . Aggregation (count, sum, average, etc) . Filtering DataFrames Example Creating DataFrame Using DataFrame Using SQL with DataFrame Key Integration Points 1. Your primary world is Spark … … but you want to reach out to existing relational data sources such as DB2, dashDB, BigSQL This means bringing the data from “DB2” to Spark 2. Your primary world is “DB2”... … but you want to reach out into the Spark world to perform analytic computations that would be awkward or impossible in relational SQL This means invoking Spark from “DB2” , passing it some input data, peforming the Spark “computation” and then returning the answer, in parallel, back to “DB2” Both are interesting and valid use cases! IBM Data Server Integration with Spark IBM Data Server Dialect IBM JDBC Driver / Apache Spark z OS Invoking Spark from a DB2 application Application Program SELECT people.name, people.address FROM TABLE (syshadoop.EXECSPARK ( language => 'scala', class =>'com.ibm.biginsights.bigsql.examples.ReadJSON', path => ‘/user/bigsql/files/persons.json’) ) AS people WHERE people.country = ‘CANADA’ High Level Spark Integration Layout