Welcome to a Big Data Roadmap for the DB2 Professional. Our Goal

Home , CAP theorem, Graph database, NewSQL, NoSQL, Polyglot persistence

Welcome to A Big Data Roadmap for the DB2 Professional. Our goal today will be to provide an overview of what is going on in the world of Big Data and put it into context for DB2 developers, DBAs, and users.

1 2 Big Data is an industry meme that is gaining traction and cannot be ignored if you wish to continue pursuing a data management career. But what is Big Data? Does it differ greatly from Oracle, SQL Server, DB2, and other relational database systems? And if so, how?

This session will provide a roadmap to Big Data terminology, use cases, and technology. Attend this session to wade through the hype and start your journey toward discovering what Big Data is, and what it can do for you and your company.

3 From 2005 to 2020, the digital universe will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, or 40 trillion gigabytes (more than 5,200 gigabytes for every man, woman, and child in 2020). From now until 2020, the digital universe will about double every two years.

4 Couple the current growth rate for data with the amount of existing data and you have a problem seeking a solution… a Big Data solution.

5 Talking about terabyte‐ and petabyte‐sized databases and data warehouses is becoming common these days. The table shown on the slide outlines the measurements used when discussing data storage size. Keep in mind, though, that the figures in this chart are “rough” guides. Some people speak about disk storage in terms of “powers of ten” instead of “powers of two”— in other words, they refer to a kilobyte as 1,000 bytes instead of 1,024 bytes. However you choose to measure it, though, the amount of data we store, use and manage is increasing very rapidly.

6 Big Data represents a major shift in IT requirements and technology. The phenomenal growth of data and the speed at which it is being generated offers an opportunity for uncovering information and trends that is heretofore unknown.

7 But things are still a bit confusing as we introduce new technologies to help us deal with Big Data, as this Database Landscape map from 451 Research makes clear!

8 There are many industry and business trends driving the Big Data phenomenon. Not all of them are about more data, but many are.

We can contrast Traditional data with Big Data in various ways. Traditional data is formatted, typically in relational databases in rows and columns, whereas Big Data adds in many types of unstructured data. With Traditional data we are looking at tens of terabytes of data and usually much less than that, whereas Big Data can be in the 100 of terabytes to petabytes or more. With Traditional data we have what is basically static data stored in the database – it can change over time but it is not usually constantly flowing into the system as with Big Data. These are the high- level underlying changes that are driving the Big Data phenomenon.

9 Big Data has grown somewhat holistically over time, driven in part by very large data requirements with extreme availability needs, such as social media Websites (like Facebook and Twitter) or the streaming data measurements taken by medical devices – and, really, lots of different devices and machines. Actually, we are at the point today where machines generate much more data than humans… OK, so we have lots of data, but at the same time, analytics has exploded with newer, more sophisticated tools being delivered for deriving useful observations with sophisticated algorithms on large data sets. Really, the analytics is the motivating factor for Big Data. We don’t just store or access a bunch of data because we can… we do it to learn something that will give us a business advantage… that is what analytics is.

10 The driving factors for the growth in Big Data analytics are mostly focused around providing insight into business practices and conditions that can deliver a competitive advantage. By analyzing large amounts of data that was previously unavailable – or was difficult to impossible to analyze in a cost-effective manner – businesses can uncover hidden opportunities. Better customer service, increased revenue, and improved business practices are the goals that drive Big Data Analytics programs.

Drivers include: • Using big data to gain competitive advantages over business rivals • Better understanding customer needs, preferences and buying decisions • Driving increased revenue/finding new revenue opportunities • Improving organizational efficiency and profitability • Using system/network log data to help improve IT operations • Tracking sentiment toward your company and products on social networks

11 I’m sure you’ve at least heard the term Big Data, or you wouldn’t be interested in this session. But it seems to mean different things to different people and in different contexts. The cynic in me wants to say there’s no universal definition because the marketers want to keep it nebulous. You know the drill—every product adapts, at least in the marketing literature, to become part of the next big thing. In this case, the next big thing is Big Data, and the term “Big Data” is being hyped these days – and in some cases – it is being applied imprecisely and perhaps even inappropriately. Big Data does not mean just any type of analytics or conventional business intelligence. As we will learn throughout today’s presentation, Big Data can mean a lot of different things and encompasses a lot of different technologies.

But as a data professional – that is, someone who cares about how data is treated, managed, and used – we should embrace the term and use it to our advantage. Most of the recent industry memes have been process and programming related, instead of data related. For example, cloud, SOA, virtualization and SaaS.

Too often in our careers we have seen data management projects and requirements being ignored. Well, now with Big Data being promoted we can use that to our advantage to get funding and attention for our data-related projects.

12 There are many types and forms of analytical processing that can be performed on our data, whether Big or Traditional. With Big Data, analytics takes a driving role though because there is so much data that it is not possible to process it using traditional methods.

As we see here in this survey conducted by TechTarget, the types of analytical processing that are common with Big Data projects are growing extremely fast.

13 And the same survey also shows that more and more organizations have either already started a Big Data and Analytics program or will soon be embarking on one.

14 But what about a more firm definition of Big Data. Well, Forrester Research defines Big Data in the context of what it calls the 4 V’s: Volume, Velocity, Variety, and Variability. But this is derivative of an earlier work by META Group (now Gartner) that defined Big Data management in terms of three of those Vs… Data Volume, Velocity and Variety… and this was published in February 2001. At any rate, let’s examine these Vs in a little more depth.

The first V is Volume and that’s the obvious one, right? In order for “data” to be “big,” you must have a lot of it. And most of us do in some form or another. A recent survey published by IDC claims that the volume of data under management by the year 2020 will be 44 times greater than what was managed in 2009.

But Volume is only the first dimension of the Big Data challenge. Velocity refers to the increased speed of data arriving in our systems along with the growing number and frequency of business transactions being conducted. Variety refers to the increasing growth in both structured and unstructured data being managed, as well as multiple formats (as opposed to just relational data). Varaibaility refers to the multiple formats of the data, not just relational and perhaps not relational at all. {} Others have tried to add more V’s to the Big Data definition as well. I’ve seen and heard people add Verification, Value, Veracity, Vicintiy, Vision and Validation to this discussion. Although some of these additional Vs add value to the discussion (for example, Veracity – we want our data to be true, right)… but defining everything in

15 terms of a word that starts with a V is a bit of a fabricated stretch. With that in mind, I think I’ll add another V to that list… {} Oy Vey!

15 Nevertheless, the Three or Four Vs are used commonly enough that it makes sense to walk through them here… even though they will nto give a PRECISE definition of what is BIG.

16 Large data volume is the the first V and it is driven by many factors including social media data (on sites like Facebook, Twitter, and Instagram), data being collected from sensors (RFID, etc.), system logs, unstructured data (like images, audio, video, and other data), and streaming data.

The data volume is impacted by both internally and externally generated data and in some cases, it may be to voluminous to store for any length of time (hence streaming data).

17 Volume is the characteristic most associated with big data, but there is no set definition so drawing a line is arbitrary.

18 The speed at which data is being generated and collected has increased… and it is continuing to increase. One aspect of velocity is the progression from batch up to real-time. Another aspect to consider is the on-going, incessant generation of data on social media, from devices, by sensors, and more.

19 Velocity: The expectation in most organizations, regardless of size, is that data will be available within a 24‐hour period.

20 More types of data are being generated and stored than ever before. We’ve already touched on the structured versus unstructured data aspect, but digging a little deeper we see increased usage of both. On the structured data side of things we, of course, have the relational databases, but other sources of Big Data include pre-relational databases, flat files, VSAM, and so on.

On the unstructured data side of things, we see the various types of data shown on the slide above. Of course, I am not a big fan of the term “unstructured” because there IS a structure to these files. If you don’t believe me go ahead and try to open up a Word document in a text editor. It is just differently-structured… but I will defer to the commonly-used term throughout the rest of the presentation.

21 2 billion Internet users in the world today

7 TB of data processed by Twitter every day

10 TB of data processed by Facebook every day

22 Variety: Big data consists of different types of data and data sources. Variety is about managing the complexity of multiple data types, including structured, semi‐structured and unstructured data. Organizations need to integrate and analyze data from a complex array of both traditional and non‐traditional information sources, from within and outside the enterprise. With the explosion of sensors, smart devices and social collaboration technologies, data is being generated in countless forms, including: text, web data, tweets, sensor data, audio, video, click streams, log files and more.

23 The fourth V is not as often discussed as the other 3, and frequently you will see talk of The Three Vs of Big Data. I kept it in because it is part of the Forrester research paper.

24 My “favorite” Big Data term. It is a ridiculously convoluted term to describe a rather simple concept. Don’t let the multiple syllables frighten you away. All it really means is using different database systems for different applications and use cases based upon how the database supports the needs of the application. Which kind of makes sense, doesn’t it?

The general idea of polyglot persistence is to use multiple data storage technologies, chosen based upon the way data is being used by individual applications.

(The source of the graphic is from the authors of the book NoSQL Distilled)

25 26 So what is Big Data then? We’ve talked about a lot of different things but we haven’t really pinned down a definition yet. Personally, I think all this talk about V’s and NoSQL just muddies the water. To me, Big Data is so simple it needs no definition. {} It’s like saying Big Dog … you immediately know what I’m talking about. Big Data is all about a lot of data. {} Big Data doesn’t have to be NoSQL. And you don’t have to sit there counting up your V’s to see if you’re doing it. Real-time analytics on large relational databases and data warehouses qualifies as Big Data to me. And it should to you, too.

As a data bigot, I see the Big Data trend as a good thing. A lot of the more recent computing trends have been process-orientated (e.g., object-oriented programming, Web services, SOA). But the data is more important than the code, and it always will be. However, now that we are storing, processing and managing extremely large amount of data --- Big Data --- there are issues that arise as we administer the physical databases structures that store that data.

27 Frequently, Big Data is coupled with NoSQL database systems. The biggest difference between a NoSQL Database Management System (DBMS) and a relational DBMS is that NoSQL doesn’t rely on SQL for accessing data (at least in the beginning). There are no hard-and-fast rules as to how NoSQL databases store data. But NoSQL does not exactly mean no SQL, at least not anymore. The movement (and its name) originally gained popularity when the primary providers did not use SQL. But these days there isn’t exactly much rigor in terms of defining exactly what a NoSQL database system is, or what it must be able to do. And many in the field are redefining NoSQL as NOSQL, where the NO stands for “Not Only.”

At a high level, NoSQL implies non-relational, distributed, flexible, and scalable. Many are also open source. NoSQL grew out of the perceived need for “modern” database systems to support web initiatives. Additionally, some common attributes of NoSQL DBMSes include the lack of a fixed schema, scalability, data clustering, replication support, and an “eventually consistent” capability (instead of the typical ACID transaction capability).

So NoSQL really does not mean that SQL is not used. And that is a good thing, because SQL is the lingua franca of database access, and therefore I believe that adding SQL support to the NoSQL database offerings will help to boost their popularity. The next question usually asked is “If they are not relational, what are they?” And the answer is that there is not a single data model followed by the NoSQL providers. Instead, there are four popular types of NoSQL database

28 offerings: document stores, column stores, key/value pairs, and graph databases. We’ll expand on those in a moment.

NoSQL database systems are becoming popular for big data implementations and for non- traditional, non-OLTP types of applications (different types of NoSQL offerings offer advantages for different types of applications, and we’ll get to that soon, too).

28 By simplifying what is supported in a system the designer can be freed to focus primarily on the business drivers.

Of course, simple systems are focused systems and we run the risk of returning to the days of siloed systems and applications? Do we really want to do that?

29 There is no DDL needed… Modeling becomes a statistical process – write queries to find exceptions and “normalize” data

Data validation can be done using tools like XML Schema and business rules systems

30 There are four common types of NoSQL database systems gaining popularity these days: 1. Column Store

2. Document Store

3. Key/Value

4. Graph

31 A columnar DBMS turns the traditional notion of a relational database on its side, storing data as sections of columns rather than as rows. By changing the focus from the row to the column, column databases can achieve performance gains when a large amount of data is aggregated for a single column. Don’t get too focused on the relational analogy though. In a column database system you can structure the rows as collections of columns. A single row in a column-family database can contain many columns, grouped together into related column families… and you can retrieve columnar data for multiple entities by iterating through a column family. The flexibility of column families gives your applications a wide scope to perform complex queries and data analyses, similar in many ways to the functionality supported by a relational database.

Column-family databases are designed to hold vast amounts of data (hundreds of millions, or billions of rows containing hundreds of columns), while at the same time providing very fast access to this data coupled with an efficient storage mechanism. A well-designed column-family database is inherently faster and more scalable that a relational database that holds an equivalent volume of data. However, this performance comes at a price; a typical column-family database is designed to support a specific set of queries and as a result it tends to be less generalized than a relational database. Therefore, to make best use of a column-family database, you should design the column families to optimize the most common queries that your applications run.

32 Examples of columnar databases include Google BigTable, Datastax (Cassandra), Cloudera, and DB2 with BLU Acceleration.

32 Data warehousing and CRM applications can benefit from column stores.

33 A document store manages and stores data at the document level. A document is essentially an object and is commonly stored as XML, JSON, BSON, etc. A document database is ideally suited for high performance, high availability, and easy scalability. An “interesting” aspect of document databases is that different documents within a collection can have different schemas. So, it is possible (in MongoDB for example) for one document to have five fields and the other document to have seven fields.

From a relational standpoint, you can think of a document store collection as being somewhat equivalent to a table; and a document is somewhat equivalent to a row. Of course, these comparisons are not exact as the two models of storing data are different.

MongoDB is the most popular document database, but others include Couchbase, RavenDB and MarkLogic.

34 You might consider using a document store for web storefront applications, real- time analytical processing, or to front a blog or content management system. They are not very well-suited for complex transaction processing as typified by traditional relational applications, though.

But do not use if you need complex transactions spanning multiple operations… or when queries are required against varying aggregate structure.

35 The key/value database system is useful when all access to the database is done using a primary key. There typically is no fixed data model or schema. The key is identified with an arbitrary “lump” of data. A key/value pair database is useful for shopping cart data or storing user profiles. It is not useful when there are complex relationships between data elements or when data needs to be queried by other than the primary key.

You can think of K/V stores like dictionaries… easy to look up the definition (value) of a word (key) using the word, but can you look up a word using the definition?

Examples key/value database systems include Riak, Redis, Aerospike, Memcached, Berkeley DB, Oracle NoSQL, Hbase, Amazon Dynamo DB...

36 37 Finally, we have the graph database, which uses graph structures with nodes, edges, and properties to represent and store data. In a graph database every element contains a direct pointer to its adjacent element and no index lookups are necessary. Social networks, routing and dispatch systems, and location aware systems are the prime use cases for graph databases. Some examples include Neo4j, GraphBase, and Meronymy.

38 39 NewSQL is a class of modern relational database management systems that provide the same scalable performance of NoSQL systems for online workloads while still maintaining the ACID guarantees of a traditional database systems

The term was first used by 451 Group analyst Matthew Aslett in a 2011 research paper discussing the rise of new database systems as challengers to established vendors.

The vendors in this space are all newer entries… but it is safe to assume that both NoSQL and traditional relational/SQL database systems will pressure these vendors by adding capabilities.

40 41 42 In the 1990s the “common wisdom” was that Object DBMS products were going to displace RDBMS

In the early 2000s the “common wisdom” was that XML DBMS products were going to supplant RDBMS

Now, we have NoSQL DBMS and… “According to analysis by Wikibon’s David Floyer (and highlighted in the Wall Street Journal), the NoSQL database market is expected to grow at a compound annual growth rate of nearly 60% between 2011 and 2017.”

• What do you think will happen?

43 44 Everything we’ve stated to this point notwithstanding, the 2013 BI and Data Warehousing survery reports that structured transaction data is still the BIGGEST aspect of most Big Data projects today. Of course, this particular survey may have been a bit skewed because of the folks being surveyd – that is, data warehousing professionals who traditionally work with relational databases.

Nonetheless, it is still an interested data point… especially for relational folks like us!

45 And the same survey places mainstream RDBMS as still being at the top of the heap in terms of Big Data technologies.

46 And just to cover our bases, here is another survey, this one conducted by Database Trends & Applications magazine, that shows the same thing as we saw on the previous slide… so perhaps we are on to something here?

47 The top tool used by Data Scientists, as reported in the 2013 Data Science Salary Survey conducted by O’Reilly, is SQL on relational databases. SQL is the meat and potatoes of data analysis, and has not been displaced by other tools for the most part.

What might be more surprising is the rise in usage of tools like R and Python, which were the two most commonly used individual tools, even above Excel, which for years has been the go‐to option for spreadsheets and basic analysis. R and Python are likely popular because they are easily accessible and effective open source tools for analysis. More traditional statistical programs such as SAS and SPSS were far less common than R and Python.

48 NoSQL will NOT replace relational/SQL database systems The major DBMSes (DB2, Oracle, SQL Server) are entrenched in most organizations and adeptly handle OLTP requirements (as well as many analytical reqmts) NoSQL can be added on a project basis where and when it makes sense

Adding NoSQL database systems can make sense as part of an enterprise infrastructure that can handle unstructured and structured data Remember when Object databases were going to replace relational?

Major relational/SQL database systems (like DB2) will incorporate NoSQL capabilities over time Note the column store capabilities of BLU -- DB2 10.5 for LUW

49 With all of that in mind, let’s briefly look at Hadoop, which is one of the more pervasive technologies used in Big Data projects….

50 So what is Hadoop? Well, let’s start with what it is not – it is NOT a database management system. Hadoop is an open source software library that delivers a framework for processing large data sets across a distributed cluster of commodity computing resources. It is designed to scale up from small to large numbers of nodes and is fault tolerant. It is not designed for real-time access but for batch processing.

--- Cloudera touts its product lineup as an “enterprise data hub” in order to distinguish it from competitive offerings from vendors such as Hortonworks and MapR. It’s worth noting that MapR has been pretty aggressive itself about adding new features, many of which are shipping. And Hortonworks, although somewhat more measured on the innovation front, is a close partner with many of the big companies (including Microsoft and Red Hat).

51 HDFS is the Hadoop Distributed File System. HDFS runs on commodity hardware. It stores large files typically in the range of gigabytes to terabytes across different machines.

The job tracker schedules map or reduce jobs to task trackers with awareness of the data.

The two main parts of Hadoop are a data processing framework and HDFS. HDFS is a rack aware file system to handle data effectively. HDFS implements a single- writer, multiple-reader model and supports operations to read, write, and delete files, and operations to create and delete directories.

The data processing framework is the tool used to process the data and it is a Java based system known as MapReduce.

52 "There's so much hype around it now that people think it does pretty much anything," Stirman said. "The reality is that it's a very complex piece of technology that is still raw and needs a lot of care and handling to make it do something worthwhile and valuable." From article at  http://searchbusinessanalytics.techtarget.com/feature/Handling- the-hoopla-When-to-use-Hadoop-and-when-not-to

53 A graphical depiction of the MapReduce process…

54 IBM offers a connector to allow DB2 users to access data that is stored in Hadoop. The basic goal is to enable traditional applications on DB2 z/OS to access Big Data analytics via integrating DB2 for z/OS with IBM's Hadoop-based BigInsights platform.

How is this done? Well: 1. Analytics jobs can be specified using JSON Query Language (JAQL) , submitted to IBM InfoSphere BigInsights and the results stored in Hadoop Distributed File System (HDFS). 2. A prepackaged table UDF (HDFS_READ) can be used to read the Big Data result from HDFS in an SQL query.

55 The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014 by Mike Gualtieri and Noel Yuhanna, February 27, 2014

56 The other term that is bandied about within the NoSQL community is polyglot persistence. Don’t let the multiple syllables frighten you away. All it really means is using different database systems for different applications and use cases based upon how the database supports the needs of the application. Which kind of makes sense, doesn’t it?

57 BASE is not appropriate for financial systems, but for web store fronts it can make sense.

(Example: amazon.com – changing the price of a book)

58 In 2000, Eric Brewer introduced the CAP Theorem for distributed computing systems (as opposed to traditional concurrency models for computing systems, which assume a central concurrency manager).

The CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: • Consistency (all nodes see the same data at the same time) • Availability (a guarantee that every request receives a response about whether it was successful or failed) • Partition Tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

Note that there is a different definition of consistency used in CAP than is used in ACID. In ACID, the C means that a transaction preserves the database rules and reports always get a consistent value. In contrast, the C in CAP refers only to single-copy consistency, a subset of ACID consistency. A read from a cluster must verify that multiple nodes all have the same value.

59 60 Sharding is the process of storing data records across multiple machines to handle the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations.

A database shard is a horizontal partition in a database (or search engine). Each individual partition is referred to as a shard or database shard.

Some NoSQL offerings provide automatic sharding. When one node in a cluster is bearing too much of the load, the data is automatically “sharded”… meaning that half of the data is copied to another processor, and thereby each processor gets half the load

61 Stream computing is another Big Data and analytics term that you may have heard. Basically, it involves the ingestion of data - structured or unstructured - from arbitrary sources and the processing of it without necessarily persisting it. Any digitized data is fair game for stream computing. As the data streams it is analyzed and processed in a problem-specific manner. The "sweet spot" applications for stream computing are situations in which devices produce large amounts of instrumentation data on a regular basis. The data is difficult for humans to interpret easily and is likely to be too voluminous to be stored in a database somewhere. Examples of types of data that are well-suited for stream computing include healthcare, weather, telephony, stock trades, and so on.

By analyzing large streams of data and looking for trends, patterns, and "interesting" data, stream computing can solve problems that were not practical to address using traditional computing methods. Another useful way of thinking about this is as RTAP - real-time analytical processing (as opposed to OLAP - online analytical processing).

62 Data visualization uses visual representations of data to uncover patterns and trends hidden in data. The data/information has been abstracted in some schematic form that renders it more digestible. Visualization tool are really about knowledge compression. They can take a large amount of information and render it down into a single graphic or image, and present it in a manner easily comprehended by the end user.

According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means. It doesn’t mean that data visualization needs to look boring to be functional or extremely sophisticated to look beautiful. To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, providing insights into a rather sparse and complex data set by communicating its key-aspects in a more intuitive way. Yet designers often fail to achieve a balance between form and function, creating gorgeous data visualizations which fail to serve their main purpose — to communicate information". -- Vitaly Friedman (2008) “Data Visualization and Infographics” in:Graphics, Monday Inspiration, January 14th, 2008

A few examples are shown on the slide, but there are many different types of data visualization that can be used. Fopr more details check out the Wikipedia page at http://en.wikipedia.org/wiki/Data_visualization

63 Organization with visualization tools report a 20% or greater improvement in time-to-information and time-to-decision… These results indicates that data can be accessed more quickly when using visualization tools and that can enable decision makers to identify the most important issues quicker allowing action to be taken immediately. And the decisions were not blind, knee-jerk decisions, but informed ones based on data. Compared to organizations without visualization tools, performance gains were 1.8x to 11x greater.

Real-Time Data Visualization can further speed up the process leading to more more opportunities, greater output, and lower costs.

64 Used frequently by mobile style applications.

Document store database systems excel at storing/using JSON documents (e.g. MongoDB)

65 DB2 has partnered with MongoDB and has implemented the MongoDB API. This allows applications to be easily ported to DB2. These applications can be written in any language that supports that API.

You can think of DB2’s JSON support like the old XML Extender. It is not a data type in DB2. There are UDFs to support the creation and retrieval of documents and an API for manipulating the data. - Indexes can be defined .

Learn more at: http://www.ibm.com/developerworks/data/library/techarticle/dm- 1306nosqlforjson1/

- and – http://www.ibm.com/developerworks/data/library/techarticle/dm- 1306nosqlforjson2/

66 67 Interesting article comparing and contrasting R and Python: http://inside-bigdata.com/2013/12/09/data-science-wars-python-vs-r/

68 Apache Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop- compatible file systems, such as the MapR Data Platform (MDP). Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

69 Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Pig's infrastructure layer consists of a compiler that produces sequences of Map- Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin.

70 Another way of thinking about Big Data is in terms of what is Big… what is big at your site is not necessarily big at mine. So what value can be derived from defining a threshold for what constitutes “big data.” It is a relative thing. The most important criteria, in my opinion, are that the size constitutes significant challenges to manage in order to deliver the required service the business demands for accessing, analyzing, and, generally, using the data. So from a relativist viewpoint, let’s take a look at DB2 for z/OS as of Version 10. There are hard limits on what defines the maximum container size for DB2 data. If you shop approaches these sizes, how could anyone argue that you are dealing with Big Data? Frankly, I think there is a lot of room under the 128TB maximum for a DB2 universal table space that would still represent BIG data, don’t you?

At any rate, even as the world of Big Data is upon the world of DB2 for z/OS already, at least in terms of large amounts of data, we see DB2 adapting to the world of Big Data, too. The IBM DB2 Analytics Accelerator is a workload optimized appliance that blends System z and Netezza technologies to deliver, mixed workload performance for complex analytic needs. It runs complex queries up to 2000x faster while retaining single record lookup speed and eliminates costly query tuning while offloading query processing. IDAA can deliver fantastic performance for Complex Business Analysis over many rows of data. Even better it is completely transparent to your application – it looks and behaves just like DB2 for z/OS… but faster.

71 And if you’ve been following the news recently you will have heard about BLU Acceleration for DB2, too. Basically, BLU Acceleration adds a column store and other Big Data capabilities to DB2 10.5 for LUW. Now keep in mind, BLU is not available on DB2 for z/OS yet, but I wanted to briefly discuss it here in this section on Big Data and DB2 for z/OS because IBM indicates it will be on z/OS soon. A column store physically stores data as sections of columns rather than as rows of data. By doing so, large analytical queries where aggregates are computed over large numbers of similar data items can be optimized. But BLU Acceleration is not just a column store. There are other Big Data features it delivers but we won’t go into those here… *****IBM also delivered 3 additional capabilities and improvements with BLU Acceleration. First there is “actionable compression,” which can deliver up to 10x storage space savings. It is called “actionable” because there are (1) new algorithms enabling many predicates to be evaluated without having to decompress and (2) the most frequently occurring values are compressed the most. The second new feature of BLU is exploitation of the SIMD (Single Instruction Multiple Data) capabilities of modern CPUs, enabling a single instruction to be able to act upon multiple items at the same time. And finally, BLU adds data skipping technology. You can probably guess what this does, but the basic idea is to skip over data that is not required in order to deliver an answer set for a query. *****

71 So let’s bring the discussion back around to DB2 then, since that is what we are all here to talk about. If we think in terms of the DB2 table space level, let’s think about what granularity of measurement we should use to determine what is large… Do we talk in terms of number of rows or number of pages? Or just the amount of disk space consumed? And do we count just the base data or add up the space used by indexes on that data as well? And what about compressed data? Do we care about the total size if it were uncompressed or just what we really have to manage in a compressed state?

And what about the type of data? Is a 20 GB database consisting solely of traditional data (that is, numbers and characters; dates and times) bigger than a 50 GB database that contains non-traditional BLOBs and CLOBs? From a purely physical perspective the answer is obvious, but from a management perspective the answer is more nebulous. It may indeed be more difficult to administer the 20 GB database of traditional data than the 50 GB database of large objects because traditional data consists of more disparate attributes and is likely to change more frequently.

Our primary concern should be how the large amount of data impacts our job in terms of managing the database and keeping the data available for use. So here is some advice… First of all, when determining what is BIG DATA for your shop, do it in terms of the number of pages, not the number of rows. You can use this number to easily compare the size of one table space to another, whereas you can’t really

72 compare things if you are using number of rows because row size can vary dramatically from table to table. And count everything that is being persistently stored in the database: data, indexes, free space, etc. If it is being stored it must be managed, and therefore impacts TCO. Stripping out everything but normalized data only matters when you are worrying about who has the biggest database, and we should be more worried about assuring the availability and manageability of our big databases!

72 73 74 75 76 Latest stable release of Hive as of early 2014 is V0.12.0 / October 15, 2013 Latest stable release of Pig as of early 2014 is V0.12.0 too!

77 78 79 80 81 82