This session provides an overview of BigData technology from the focus of a relational DBA.

1 Overview of the topics

2 Evolution of web search engines. The versioning is only meant to show stages of evolution, not as version identifiers of a specific search engine product.

3 Today, search engines are a highly complex piece of software which involves a lot of predictive analytics and they form a central part of the business model of the respective companies.

4 When you consider the topics and figures in the "small print" of this slide, you will realize that relational systems can end up in issues with the requirements.

One of the major points is clearly that relational provide transaction stability, which is obsolete and therefore nothing but processing overhead for a large part of the data we are talking about here. Given that the ACID principle requires such overhead – particularly with logging and locking –, it is pretty obvious that data engines which do not need to fulfil this requirement have a huge advantage in terms of speed. They will obviously outperform any ACID- compliant system (provided their general code quality and level of sophistication is comparable).

Side note: As for a visualization of growth, take a look at http://www.worldometers.info. Core figures on World Population, Government & Economics, Society & Media, Environment, Food, Water, Energy and Health are summing up – pretty impressive to see these figures.

5 Below definitions are taken from wikipedia (http://www.wikipedia.org)

A key-value store, or key-value database, is a data storage paradigm designed for storing, retrieving, and managing associative arrays, a data structure more commonly known today as a dictionary or hash. Dictionaries contain a collection of objects, or records, which in turn have many different fields within them, each containing data. These records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data within the database. Key-value stores work in a very different fashion from the better known relational databases (RDB). RDBs pre-define the data structure in the database as a series of tables containing fields with well defined data types. Exposing the data types to the database program allows it to apply a number of optimizations. In contrast, key- value systems treat the data as a single opaque collection which may have different fields for every record. This offers considerable flexibility and more closely follows modern concepts like object-oriented programming. Because optional values are not represented by placeholders as in most RDBs, key-value stores often use far less memory to store the same database, which can lead to large performance gains in certain workloads. Performance, a lack of standardization and other issues limited key-value systems to niche uses for many years, but the rapid move to cloud computing after 2010 has led to a renaissance as part of the broader NoSQL movement. A subclass of the key-value store is the document-oriented database, which offers additional tools that use the metadata in the data to provide a richer key-value database that more closely matches the use patterns of RDBM systems. Some graph databases are also key-value stores internally, adding the concept of the relationships (pointers) between records as a first class data type.

A column-oriented DBMS is a database management system (DBMS) that stores data tables as sections of columns of data rather than as rows of data. In comparison, most relational DBMSs store data in rows. This column-oriented DBMS has advantages for data warehouses, clinical data analysis,[1] customer relationship management (CRM) systems, and library card catalogs, and other ad hoc inquiry systems[2] where aggregates are computed over large numbers of similar data items. It is possible to achieve some of the benefits of column-oriented and row-oriented organization with any DBMSs. Denoting one as column-oriented refers to both the ease of expression of a column-oriented structure and the focus on optimizations for column-oriented workloads.[2][3] This approach is in contrast to row-oriented or row store databases and with correlation databases, which use a value-based storage structure. Column-oriented storage is closely related to database normalization due to the way it restricts the database schema design. However, it was often found to be too restrictive in practice, and thus many column-oriented databases such as Google's BigTable do allow "column groups" to avoid frequently needed joins.

6 Below definitions are taken from wikipedia (http://www.wikipedia.org)

A document-oriented database or document store is a computer program designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data. Document-oriented databases are one of the main categories of NoSQL databases and the popularity of the term "document-oriented database" has grown[1] with the use of the term NoSQL itself. Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept. The difference lies in the way the data is processed; in a key-value store the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on internal structure in the document order to extract metadata that the database engine uses for further optimization. Although the difference is often moot due to tools in the systems,[a] conceptually the document-store is designed to offer a richer experience with modern programming techniques. XML databases are a specific subclass of document-oriented databases that are optimized to extract their metadata from XML documents. Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal. Document databases[b] contrast strongly with the traditional relational database (RDB). Relational databases are strongly typed during database creation, and store repeated data in separate tables that are defined by the programmer. In an RDB, every instance of data has the same format as every other, and changing that format is generally difficult. Document databases get their type information from the data itself, normally store all related information together, and allow every instance of data to be different from any other. This makes them more flexible in dealing with change and optional values, maps more easily into program objects, and often reduces database size[citation needed]. This makes them attractive for programming modern web applications, which are subject to continual change in place, and speed of deployment is an important issue.

In computing, a graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. Most graph databases are NoSQL in nature and store their data in a key-value store or document-oriented database. In general terms, they can be considered to be key-value databases with the additional relationship concept added. Relationships allow the values in the store to be related to each other in a free form way, as opposed to traditional relational databases where the relationships are defined within the data itself. These relationships allow complex hierarchies to be quickly traversed, addressing one of the more common performance problems found in traditional key-value stores. Most graph databases also add the concept of tags or properties, which are essentially relationships lacking a pointer to another document.

7 Today's complex applications need to fulfil a pretty broad range of requirements and it is becoming more and more reasonable to use the optimal data management system for any purpose.

However, this can easily lead to the fact that not all data is stored in an RDBMS even though it is highly likely that such a system might still be a requirement for a part of the applications.

Polyglot persistence is a usual term to describe the fact that an application uses multiple data management systems for the various purposes it serves for.

8 The list of noSQL products on this slide is by far not exhaustive, I just picked some of the most prominent exponents of the four categories.

When we look at the various exponents of the four major categories of these data stores, you might find that you probably have heard of some of them a few years ago already (like e.g. Lotus Notes, which is a very valid implementation of a Document Store). And maybe most of us "relational" people have heard of SAP HANA, which basically is a Column Store.

And we also know that DB2 meanwhile implements a lot of functionality in terms of Column and Graph Store, plus in-memory computing to speed up things a hundred times. But DB2 still provides the luxury of transaction stability – and it possibly could be another ten or hundred times faster without this feature.

9 Hadoop is one of the most prominent buzzwords in the "Big Data Buzzworld".

It basically consists of a file system (HDFS), which scales out to thousands of nodes and provides out-of-the-box data redundancy. This makes it a great basis to create large MPP file system clusters.

As for the data processing, Hadoop also goes for a typical MPP approach: The invoker node sends out the request (like e.g. collect the number of occurrences of a word inside the relevant data) to all participating nodes in the cluster. Each node computes his partial result locally and ships it back to the invoker which finally sums up the partial results.

Nothing new and nothing a DB2 person would not yet be fluent with, if he or she ever has been exposed to DB2 DPF.

With the difference of the Hadoop version 1 data processing being purely batch.

10 The data processing for Hadoop v1 is called MapReduce, because the major steps of the execution is mapping of values into key-value pairs and the second part is reducing the result into a single final file (per node).

All of this processing is file based and therefore not by default known as particularly fast. Speed primarily comes from the fact that the work is spread across many nodes. Also, if applicable to the specific task, an additional combine can run as an intermediate operation in memory (and therefore very fast). Such an operation combines the partial results which have been created by analyzing the mass of individual files on the server.

The final result of a MapReduce operation of every single participating node is one single output file which is shipped back to the invoker, which then computes the final result across all nodes.

11 Nowadays, with Hadoop v2 (and above), the entire infrastructure has evolved enormously. HDFS and MapReduce are still part of the game, but they are surrounded by an entire zoo of other processing methods and functionality.

Apart from heavily improved and extended core Hadoop processing, the cluster monitoring, security systems, and data import/export functionality has been extended also. Resource allocation is controlled by YARN cluster management, and newer technologies for processing involve SQL based querying (don't think of ANSI SQL here!), Graph and Columnar stores, elastic search, Tez as an online counterpart to the batch-based MapReduce processing, various ways of data interchange, streaming and ingestion, machine learning, and cluster management and monitoring implementations. Zookeeper helps to keep all these parts alive and in sync and it is further supported by timeline and quorum services.

And we haven't even mentioned Spark, which is becoming a more and more important part of Hadoop implementions, because Spark provides Streaming data, SQL, Machine Learning, Graph data and in-memory processing. In itself a pretty sophisticated approach, Spark is likely to become a follow-up technology of the original Hadoop stacks.

12 The general idea of Hadoop stacks is to create one common big data lake, where you just throw in your data and then start fishing for whatever (new) species that might be dwelling in the lake. Depending on the desired result, one might use a net, a rope, a spear or even dynamite to catch "digital food".

13 The idea of collecting and combining all sorts of data, from social media profiles and utterings as well as many other data sources, leads to many new findings about customers (and other individuals), allowing for astonishingly accurate information about an individual's behavior.

It is obvious that new business opportunities can arise from such information.

14 At the example of car insurance, let's collect data which might be available to better understand all details of a roadside incident.

The number and types of data sources that could be collected and correlated to improve the quality of the picture and give us a really clear are increasingly diverse and at the same time more accurate than ever before.

Understanding actual situations can lead to better risk evaluation, allowing us to shape a product according to the individual customer requirements.

15 Data Scientists have the ability and knowledge to correlate various data sources and drive new information and insight from them.

While individuals marked themselves as "Data Scientists" in the recent past, the complexity of the overall requirements has increased pretty quickly so that most of the current successful data science work and results come from specialized teams which combine the necessary skills.

16 Getting and cleaning data requires knowledge of available data sources, how to access them (with proper performance considerations…), and how to increase data quality by filtering out noise, aligning different terms with the same meaning, eliminating typos and other inaccuracies in order to create valid base data.

Exploratory data analysis usually takes place on test data sets with known contents. You typically start out with certain assumptions and start analyzing the data so that the results either confirm or contradict with your assumptions. Gradual refinement then leads to new insight which can be applied to other data sets of unknown content. Obviously, the accuracy of such results and modeling depends on the quality of the input data. The more artistic piece of the process is to define models which are not overly adapted to the input data but instead sufficiently generic to be applied to other data sets.

Reproducible research means that the entire research process adheres to scientific standards of reproducibility so that subsequent execution of the process will lead to the same type of insight.

Statistical inference means creating valid conclusions based on statistical methods.

Regression analysis and modeling has the goal of modeling relationships between one dependent and one or more independent variables. It is mainly used to describe quantitative correlations or predict values of dependent variables.

Machine learning helps to create algorithms which improve with the increasing insight that is derived from the data, and thus provide increasing accuracy over time.

A data product can be anything from a reproducible report to a highly interactive application working on the data and allowing for repetitive and individual types of detail analysis.

17 The IoT (Internet of Things) basically means that all kinds of devices are connected to the internet and are providing their data over this distribution channel.

There is a large predominance of data-generating devices (like health measuring devices, RFID tags (transponders), etc. which can be used to monitor an nearly endless number of events and statuses (like e.g. transport logistics).

But given the similarly endless opportunities for new and enhanced business processes, the number of consuming devices and applications is of course increasing heavily as well.

From the masses of new data which is generated in this area, a large part has a pretty short half-life: The value of such data very often reduces within seconds or minutes. Examples: - A twitter feed containing negative statements about a brand is valuable as long as it makes sense to react on it. After only a few hours or days, any reaction to it would be useless (i.e. damage has already occurred) - Health data of a customer showing that he or she is at high risk of suffering a heart attack is worthless once the event has happened. It is only a matter of minutes or seconds where such knowledge could avoid injuries or fatalities

From such types of data, the new requirement of real-time data analysis appears in order to make any use of such valuable data. Streaming Data Analysis tries to detect diversions from known patterns in order to immediately react accordingly and avoid negative effects of all sorts.

18 With all these additional and new requirements, it becomes obvious that the ideal solution for data management does not exist. There are major dependencies on the use cases and the actual data which is involved.

The idea of tearing down walls between existing (data) silos by providing a big data lake shows limitations, particularly with streaming data analysis.

As a result, integration solutions which are able to work across the many (and further increasing number of) data silos will become more pronounced.

At the same time, legal and data governance aspects do definitely not reduce the complexity which comes with such highly distributed systems of polyglot persistence.

19 The conclusions on this slide do not need further explanations.

20 As the BigData world is still new and very agile, we are dealing with a wealth of new developments and products and languages and specific solution approaches, so that not few of the involved specialists encounter the risk of drowning in the multitude of tools.

Developers scream for things they already know. Structured data and standardized languages provide an extremely high level of comfort and therefore also boost productivity. This is far less the case in the BigData world and many initiatives (once again: many – different ones, of course) try to improve the situation. At the current time, SQL is very popular because it provides a known standard which one can rely on. While this might change again within only a few months, the need for standards is obvious, and SQL is by far not a bad answer to many requirements. However, do not necessarily think of ANSI standardized SQL. New SQL engines are schema-free and able to run against non-tabular data.

21 BigData = unstructured data – true? True!

BigData analytics = analyzing unstructured data? – Basically true.

Analytics without structures? – Not true!

Analytics is always about finding similarities and differences, meaning that patterns must be built and found.

Just: in the BigData world you don't have clear, static structures. This world is more about specifying a structure or pattern on the fly and starting to group data into these structures.

And most of these structures and pattern definitions are hidden in the processing code, and spread across the entire infrastructure. Not necessarily an environment which is built for stability and reproducibility.

With the increasing usage of BigData approaches, the need for standardization and stabilization of such environments will grow. And we're once again back to the challenges we saw in mainframe environments of the 60s – last millenium stuff rearing it's ugly head… Fortunately, answers have been provided before, meaning that solutions will become available this time also.

22 The tools which are available in a BigData world are numerous. Some may be fancy and ingenious for specific requirements, others less edgy and more down-to-earth – but then probably not sufficiently versatile or scalable or whatever.

In any case, the current market consists of more tools than will be likely to survive on the long term and we will continue to see a lot of fluctuation, even though certain first hints of stabilization seem to become visible.

23 The learning curve for BigData environments is impressively high. The methods, technologies and tooling are at least partly disruptive and this disruption comes from the fact that they stem from very different paradigms.

What has grown over a few years in kind of a distant world can be thrown at you with full power and in this case it has the potential to drown you. Don't get fooled by the complexity, and don't try to understand everything from the very beginning. It takes some time to adapt to different ways of thinking, but they are equally valid as the ones we DBAs have assumed for decades.

24 Very often, we see a major gap between the knowledge and ways of thinking of "traditional" data people and BigData people.

One of the issues is that the BigData world has developed disruptive technologies which grew on a "planet" which was completely different from traditional dwellings. This is probably a precondition to allow disruptive technologies to grow at all.

However, today, these worlds need to grow together, to understand each other, in order to make a data evolution become real and pervasive.

25 It takes at least the guts to accept that the people "on the other side of the fence" (i.e. on all sides of any fence) are doing a valid job and are doing it to the best of their knowledge. From this mutual understanding and interest, it will become possible to implement the best possible solutions.

26 Having in-depth understanding of data and requirements towards data quality, management, performance, etc. is a highly important skill. Many of the stadiums which the RDBMS products and people have gone through in the past decades will eventually pop up again in the newer worlds. This is clearly where you can play to your strengths.

However, as the requirements towards data, the ways of using data, the reasons for generating data have changed, the respective technologies are adapting.

27 Try to understand the general concepts without going too much into the technical details. Many of these platforms and products are still at an early stage (and once, your DB2 also was there and the IMS and ISAM/HSAM/VSAM people were wondering what the heck you might be doing). They might have stability issues, be less reliable and not as smart and self-managing as a modern RDBMS system. Not YET…

So, it is good advice to try and establish a basic understanding of the differences (and similarities) of the BigData world in order to provide the best possible support. For a future in which data plays an even more important role – and with you still at the center of it all!

28 29 Pete is an "IBM Champion for Analytics" with 25 years of experience in the IT industry. For the first 12 years, he mainly concentrated on IBM Mainframe Technology, providing technical support and consultancy for Software Development methods, infrastructures and tools, before starting to focus on databases.

12 years ago, he crossed the platform barrier and got fluent with OO paradigms, J2EE, SOA, Web infrastructures, ... - plus of course with DB2 and other DBMSs on Unix and Windows platforms.

He currently works as Coordinating Analyst and Technical Consultant for Database Systems at AXA Technology Services in Switzerland.

Pete is an IBM Certified DB2 Advanced DBA, Application Developer and Solution Developer on both, z/OS and LUW platforms for all DB2 versions since 7.

For the last few years, he has been a frequent speaker at IDUG NA and EMEA Conferences. Being an active IDUG volunteer in various positions since 2009, Pete is currently serving on the Board of Directors. You can contact him through [email protected] or via LinkedIn.

30