Apache Spark Implementation on IBM Z/OS
Total Page:16
File Type:pdf, Size:1020Kb
Front cover Apache Spark Implementation on IBM z/OS Lydia Parziale Joe Bostian Ravi Kumar Ulrich Seelbach Zhong Yu Ye Redbooks International Technical Support Organization Apache Spark Implementation on IBM z/OS August 2016 SG24-8325-00 Note: Before using this information and the product it supports, read the information in “Notices” on page vii. First Edition (August 2016) This edition applies to Version 2, Release 2 of IBM z/OS (product number 5650 ZOS), Apache Spark 1.5.2 © Copyright International Business Machines Corporation 2016. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . vii Trademarks . viii IBM Redbooks promotions . ix Preface . xi Authors. xi Now you can become a published author, too . xii Comments welcome. xiii Stay connected to IBM Redbooks . xiii Chapter 1. Architectural overview . 1 1.1 Open source analytics on z/OS. 2 1.1.1 Benefits of Spark on z/OS. 2 1.1.2 Drawbacks of implementing off-platform analytics . 3 1.1.3 A new chapter in analytics . 4 1.2 Planning your environment . 4 1.3 Reference architecture . 6 1.3.1 Spark server architecture . 6 1.3.2 Spark environment architecture . 7 1.3.3 Implementation with Jupyter Notebooks . 9 1.3.4 Scala IDE . 12 1.4 Security . 13 Chapter 2. Components and extensions. 15 2.1 Apache Spark component overview . 16 2.1.1 Resilient Distributed Datasets and caching. 16 2.1.2 Components of a Spark cluster on z/OS. 17 2.1.3 Monitoring . 18 2.1.4 Spark and Hadoop . 21 2.2 Mainframe Data Services for IBM z/OS Platform for Apache Spark . 21 2.2.1 Virtual tables . 23 2.2.2 Virtual views . 23 2.2.3 SQL queries . 23 2.2.4 MDSS JDBC driver . 23 2.2.5 IBM z/OS Platform for Apache Spark Interface for CICS/TS . 24 2.3 Spark SQL. 25 2.3.1 Reading from z/OS data source into a DataFrame. 26 2.3.2 Writing DataFrame to a DB2 for z/OS table using saveTable method . 26 2.4 Streaming . 27 2.5 GraphX . 28 2.5.1 System G . 28 2.6 MLlib . 28 2.7 Spark R . 29 © Copyright IBM Corp. 2016. All rights reserved. iii Chapter 3. Installation and configuration . 31 3.1 Installing IBM z/OS Platform for Apache Spark . 32 3.2 The Mainframe Data Service for Apache Spark . 32 3.2.1 Installing the MDSS started task. 32 3.2.2 Configuring access to DB2 . 37 3.2.3 Configuring access to IMS databases. 38 3.2.4 The ISPF Panels. 39 3.2.5 Installing and configuring Bash . 45 3.2.6 Check for /usr/bin/env . 46 3.3 Installing workstation components . 46 3.3.1 Installing Data Service Studio . 46 3.3.2 Installing the JDBC driver on the workstation . 48 3.4 Configuring Apache Spark for z/OS . 49 3.4.1 Create log and worker directories . 49 3.4.2 Apache Spark directory structure . 50 3.4.3 Create directories and local configuration. 50 3.4.4 Installing the Data Server JDBC driver . 53 3.4.5 Modifying the log4j configuration . 54 3.4.6 Adding the Spark binaries to your PATH . 55 3.5 Verifying the installation . 55 3.6 Starting the Spark daemons . 57 Chapter 4. Spark application development on z/OS . 61 4.1 Setting up the development environment . 62 4.1.1 Installing Scala IDE. 62 4.1.2 Installing Data Server Studio plugins into Scala IDE . 62 4.1.3 Installing and using sbt . 63 4.2 Accessing VSAM data as an RDD . 65 4.2.1 Defining the data mapping . 65 4.2.2 Building and running the application . 70 4.3 Accessing sequential files and PDS members . 71 4.4 Accessing IBM DB2 data as a DataFrame . 71 4.5 Joining DB2 data with VSAM . 72 4.6 IBM IMS data to DataFrames . ..