This presentation focuses on key technologies in DB2 11 for z/OS that enable a seamless integration between DB2 and BigInsights, using both structured and non-structured data. We show how to enable the DB2 server to integrate with BigInsights. Using a scenario that is common to all DB2 for z/OS users, we demonstrate how Jaql, JAQL_SUBMIT and HDFS_READ can be used to create an integration solution between BigInsights and DB2 by creating a machine data analysis.

1 2 Information Management for System z Big data comes from many sources. Its much more than traditional data sources. And in order to capitalize on the breakthrough opportunities we’ve discussed, you definitely need to look beyond traditional data sources. But at the same time, don’t forget that big data comes from those traditional sources too. Transactional data and application data is growing an a significant rate. Although it’s structured, that data is large and it is contained in many different structures. Big data includes machine data – logs, web logs, instrumentation data, network data. Data generated by machines is multiplying quickly, and it contains valuable insights that need to be discovered. Social data also needs to be incorporated. Most social data is really textual data. And the valuable insights remain locked within that text and its many possible meanings. And most of that data isn’t valuable, or has a very short expiry date during which it is valuable. That makes social data very challenging – extracting insight from largely textual content in very little time. And enterprise content must be amalgamated as well. And that data comes in many forms, and also in significant volume. But businesses started realizing that there were huge opportunities to further lower risk and cost and create more up-sell and cross-sell opportunities by looking at information on Facebook, Twitter, LinkedIn, studying data from telemetry devices (machine to machine), detecting customer sentiment in emails, audio, video but the challenge has been how to integrate all this “noise” Slowly over time, the circle of trust will widen to include other forms of “differently structured” data. We no longer say – semi structured, unstructured – we call it “differently structured” . Example – email is structured “ From, To, Date and time stamp, Subject, Attachments, main body – containing sentences, verbs, nouns, adjectives, propositions and closing remarks. It’s just structured differently from non- relational data. And the reason why we do this is to enhance and augment our knowledge about entities relevant to our business. That way, we can gain deeper insights that help lower business risks and costs, and increase revenue and profit through innovative business models. Our product management, engineering, marketing, CTPs, etc, etc teams have all been working together to help to better understand the big data market. We’ve done surveys, met with analysts and studied their findings, we’ve met in person with customers and prospects (over 300 meetings) and are confident that we found market “sweet spots” for big data. These 5 use cases are our sweet spots. These will resonate with the majority of prospects that you meet with. In the coming slides we’ll cover each of these in detail, we’ll walk through the need, the value and a customer example. is an open-source software framework that supports data-intensive distributed applications. It is designed to run on large clusters of commodity hardware. Hadoop framework is designed more for batch processing rather than interactive use by users. So the emphasis is on high throughput of data access.

Above the file systems is the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. 8 HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework, that provides high-throughput access to application data.

Each node in a Hadoop instance typically has a single name-node; a cluster of data-nodes form the HDFS cluster. Each data-node serves up blocks of data over the network using a block protocol specific to HDFS.

Large files are broken into blocks of fixed size (default= 64MB), and distributed across multiple machines. Blocks are replicated. Block Replicas are distributed across servers and racks. It achieves reliability by replicating the data across multiple nodes, and hence does not require RAID storage. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. his policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

HDFS was designed to handle very large files. HDFS was designed for mostly immutable files [14] and may not be suitable for systems requiring concurrent write operations. HDFS applications need a write-once-read- many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. Multiple Mappers send data to the multiple Shuffles which then pass on data to the Reducers. Divide and Conquer plus mass parallel. Take an example. An input file that is split into 4 blocks. Each block contains a list of receipts. There are 4 mappers, each read one block in parallel, the mapper read each receipt and generate (seller, amount) pair. All the pairs for seller1 are sent to reducer R1, and all seller2 pairs are sent to R2. The reducer then calculate the total amount for each seller.

The Map function only cares about the current key and value The Reduce function only cares about the current key and its values A Mapper can invoke Map on an arbitrary number of input keys and values or just some fraction of the input data set A Reducer can invoke Reduce on an arbitrary number of the unique keys but all the values for that key 11 I’ll give you a quick summary of BigInsights now, and then we’ll dive into details in the next several charts. BigInsights is IBM’s strategic platform for managing and analyzing persistent Big Data. As you’ll see, it’s based on open source and IBM technologies. Internally, the BigInsights project is being run like a start-up. By that, I mean that IBM is engaging deeply with a number of early customers to shape the future direction of our product. We’re purposefully keeping our plans flexible to accommodate rapidly changing requirements in this emerging technology area.

Some of the characteristics that distinguish BigInsights include its built-in support for analytics, its integration with other enterprise software, and its production readiness. We’ll talk move about these topics shortly. But before I leave this chart, I want to point out that IBM is uniquely positioned to provide customers with the necessary software, hardware, services, and research advances in the world of Big Data.

The Standard and Enterprise Editions are IBM’s supported production offerings. They contain IBM-unique technologies in addition to open source technologies. For those who want to work with a free, non-production version of BigInsights, we also offer our Quick Start Edition. It’s similar in content to Standard, plus it lets you experiment with Big R and text analytics as well. Details on the different editions are available as supplemental slides in this deck and in the announcement materials.

12 There are a lot of technologies used inside BigInsight. For DB2 for z/OS to connect to BigInsight, we use JAQL.

13

16 Big data and business analytics represent the new IT battleground. Here is some stats: •IDC estimates the big data market will reach $16.9 billion by 2015, and that enterprises will invest more than $120 billion to capture the business impact of analytics, across hardware, software and services that same year. •The "digital universe" will grow to 2.7ZB in 2012, up 48% from 2011 and rocketing toward nearly 8ZB by 2015 (IDC). •53% of business leaders don't have access to the information from across their organizations they need to do their jobs (IBM CMO Study). •Organizations applying analytics to data for competitive advantage are 2.2x more likely to substantially outperform their industry peers (MIT/IBV Report)

The amount and types of data being captured for business analysis is growing. A classic example of this large superset of data is Web logs, which contain unstructured raw data.

In an increasing trend unstructured data is being stored on new frameworks. These infrastructures encompass hardware and software support such as new file systems, query languages, and appliances. A prime example being Hadoop.

So what is Hadoop? •A java-based framework that supports data intensive distributed applications and allows applications to work with thousands of nodes and petabytes of data. •Hadoop framework is ideal for distributed processing of large data sets . •It utilizes a distributed file system that is designed to be highly fault tolerant and allows high throughput access to data and is suitable for applications that have large data sets. The DB2 11 goal is to connect DB2 with IBM's Hadoop based BigInsights big data platform, and to provide customers a way to integrate their traditional applications on DB2 z/OS with Big Data analytics. Analytics jobs can be specified using JSON (Jaql) and submitted to IBM's Bigdata platform and the results will be stored in Hadoop Distributed File System (HDFS). DB2 11 also provides a user-defined function to access the Hadoop file system so that applications can read the analytic results stored in HDFS.

(Remember that traditional table UDFs require that the output schema of the UDF is specified statically at function creation time. There would be a need to write a different external user-defined table function for reading each different Hadoop files which produce different output schema.

DB2 11 will provide a table UDF (HDFS_READ) to read the Bigdata analytic result from HDFS so that it can used in an SQL query. Since the shape of HDFS_READ's output table varies, we will also support a generic table UDF which improves the usability of HDFS_READ.

There would be a need to write a different external user-defined table function for reading each different Hadoop files which produce different output schema. DB2 11 will implement a new kind user-defined table functions which are called generic table UDFs. Its output schema are determined by at query compile-time. Therefore generic table UDFs are polymorphic, it increases reusability as the same table function can be used read different Hadoop files and produce different output tables. ======JSON & Jaql JSON (JavaScript Object Notation), is a text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects. Despite its relationship to JavaScript, it is language-independent, with parsers available for many languages. The JSON format is often used for serializing and transmitting structured data over a network connection. It is used primarily to transmit data between a server and web application, serving as an alternative to XML.

Jaql is a JSON Query Language, with it’s input/output designed for extensibility. Input can be anything that produces values, and output can be anything that consumes json values. Examples are: •Any Hadoop InputFormatand OutputFormat •HDFS files, Facebook’sThrift format, HBasetables, Facebook’sHive partitioned files •Queries on these are transformed into map/reduce •Local files •JDBC sources 19 20 21 Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop).

Jaql is a functional, declarative query language that is designed to process large data sets. For parallelism, Jaql rewrites high-level queries, when appropriate, into “low-level” queries consisting of MapReduce jobs. SQL within Jaql Jaql integrates an SQL expression that should make it easier for users with an SQL background to write MapReduce scripts in Jaql for the BigInsights environment. SQL within Jaql also makes it easier to integrate existing SQL applications and tooling with Jaql. 23 The HDFS_READ table function returns one row for each record (or line) in the file. If the number of the result columns m is less than the number of fields in each record, the first m fields of each record is returned. If a field has no value (two adjacent comma), a null value is returned for the corresponding column.

26

This one defines a Jaql variable ”syslog” which reads our syslog file from HDFS "hdfs:///idz1470/syslog3sec.txt" using the “lines” operator. That result is “piped” to the “count” operator (the arrow “->” operator is a “pipe” in Jaql), and the result of count is written to the file "hdfs:///idz1470/iod00s/lab3e1.csv". In this exercise, the count operator is not used. The lines are simply returned as strings. The GENERIC TABLE returned by HDFS_READ contains a single column of type VARCHAR(116) to hold the lines from the syslog.

Most of it is similar to LAB3E1. But in this exercise we apply a filter to the lines before counting them, filter(strPos($,"$HASP373")>=0). The filter decides which lines to “keep”, and in this exercise, the filter uses the Jaql built-in function strPos to determine whether there is an occurrence of the string “$HASP373” (a “start job” message code) in the line. A non-negative position of “$HASP373” indicates that it occurred at least once.

Perhaps we want to make access to the BigInsights cluster transparent for users of DB2 SQL. By creating a view on HDFS_READ, we can make it as if the results coming from BigInsights look like any other DB2 table. The benefits of creating a view as in LAB4E3 is that the access is as-if it is a regular DB2 table, and the data is always current with the BigInsights cluster because each time the view is accessed, the data is pulled over the network. However, in cases where the data is accessed frequently and the result set is large, the network traffic may become an issue. An alternative is to create a real table in DB2 to hold the results of the SELECT from HDFS_READ. We can use DB2 INSERT SQL to populate our new table. The advantage of doing this is that it is a real table with data stored in DB2, so access is as you would expect from accessing a DB2 table directly. It can be indexed and cached in the buffer pool for performance. The downside is that if the data on the BigInsights cluster changes, the data must be refreshed manually in DB2. Sometimes, this is not sufficient for the kinds of analysis that we want to do. In this exercise, we want to join the SYSLOG “table” to itself. Suppose we knew that an anomaly occurred on the system around the time that the job SMFDMP28 was run. We want to find which jobs might have been affected by this anomaly, so we want to analyze which jobs started after SMFDMP28. 39 40 41 42 CO:Z a 3rd party applications that uses Dataset Pipes to connect to a SSH server on z/OS and transfer a z/OS dataset to a Linux client system and into hadoop. 44 Apache Sqoop is a tool for data transfers between relational databases and Hadoop. One significant benefit of Sqoop is that it’s easy to use and can work with a variety of systems. Thus, with one tool, you can import or export data from all databases supporting the JDBC interface with the same command-line arguments exposed by Sqoop.

Sqoop ships with specialized connectors for MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server, IBM DB2 LUW, and Netezza.

For DB2 Z, JDBC Generic Connect with DB2 JDBC driver is used to connect.

SQOOP Import data from DB2 to Hadoop Importing 978.46 MB of data from DB2 to Hadoop via SQOOP, Split by Primary Key Column using 9 maps sqoop import --driver com.ibm.db2.jcc.DB2Driver --connect jdbc:db2://9.30.137.207:59000/TOMLOC1 --username SYSADM --password whoopi -- table SYSADM.DOOPTB01 --split-by C1 --target-dir "/user/biadmin/NY1Gimport" -m 9 46 Veristorm’s vStorm Enterprise consists of two capabilities, both of which run exclusively on Linux on IBM System z: • vStorm Connect –a set of native data collectors that uses a graphical interface to facilitate the movement of z/OS data to Hadoop • zDoop –the industry's first commercial, fully supported implementation of the open source Apache Hadoop project Linux on System z vStorm Enterprise is proving to be quite efficient. Veristorm worked with us to run a series of tests that demonstrated solid scalability, and as I mentioned earlier their 2-2-2 test demonstrated agility in driving a representative workload. But the real value of Veristorm’s product comes from how being able to run Hadoop directly on System z addresses some of the most critical needs of mainframe shops. A secure pipe for data Virtually all mainframe files and logs have the same degree of sensitivity as operational relational data, and therefore require the same governance and security models. Transmitting such sensitive data over public networks, to racks of commodity servers, breaks those governance and security models and puts data at risk of compromise. On on-platform solution keeps data within the confines of the mainframe in order to meet compliance challenges without adding the complexity of coordinating multiple governance and security zones or worrying about firewalls. vStorm Enterprise is fully integrated with the z/OS security manager, RACF, so there is no need to establish unique and potentially inconsistent credentials to use vStorm Enterprise; security is built-in, right out of the box. Keeping mainframe data on the mainframe is a critical requirement for many clients, and zDoop enables Hadoop analysis of mainframe data without that data ever leaving the security zone of the mainframe. Easy to use ingestion engine vStorm Connect simplifies the transfer of data from z/OS to Hadoop without the end user having to be concerned with outdated file types, COBOL copybooks, or code page conversions; and since it runs in Linux on System z it facilitates that transfer efficiently without driving up the load on the z/OS environment. Templates for agile deployment One of the hardest tasks in managing a Hadoop analysis project is sizing the clusters. Underestimate and the project must be put on hold until capacity is added. Commodity servers may be cheap, but they still have to be procured, inserted into the data center, configured, loaded with software and made known to the Hadoop management system. Overestimate and unused capacity sits idle, negating the perceived cost benefit of using commodity infrastructure in the first place. This is not a very agile or friendly process – if you’re in an environment where your analytics needs are likely to change frequently, you will struggle to keep capacity in line with processing requirements. vStorm Enterprise, running on the mainframe, executes within a fully virtualized environment. It offers – out of the box – templates so that new virtual servers can be added to the Hadoop cluster with just a few basic commands (or via automation software). This is a true cloud environment where users can obtain and return capacity on demand to suit their needs. Mainframe efficiencies Security, simplicity, and agility are the true hallmark value propositions of vStorm Enterprise for analyzing mainframe data. But there are also some basic efficiencies that arise simply from running Hadoop in a scale-up (mainframe) vs scale-out (commodity) environment. Most notably, consider that Hadoop – to compensate for the unreliability of commodity storage and compute nodes, by default triplicates all data for availability purposes. At some volume, all of this extra storage and associated maintenance undercuts the economics of using Hadoop on commodity systems. Running Hadoop with the quality of a mainframe underneath it can prove to be more economical in the long run because there is very low risk in running with a single copy of the data. And it can greatly simply I/O planning by removing bottlenecks.

47 48 The last 2 whitepaper are written by myself. 0

BigInsights Quick Start was introduced on June 14, 2013.

If you’re looking to get your hands on enterprise-grade Hadoop features with guided learning, you can download Quick Start, watch the tutorial videos and get started today. Information Management for System z 52