Bigdata Technology from the Focus of a Relational DBA

This session provides an overview of BigData technology from the focus of a relational DBA. 1 Overview of the topics 2 Evolution of web search engines. The versioning is only meant to show stages of evolution, not as version identifiers of a specific search engine product. 3 Today, search engines are a highly complex piece of software which involves a lot of predictive analytics and they form a central part of the business model of the respective companies. 4 When you consider the topics and figures in the "small print" of this slide, you will realize that relational database systems can end up in issues with the requirements. One of the major points is clearly that relational databases provide transaction stability, which is obsolete and therefore nothing but processing overhead for a large part of the data we are talking about here. Given that the ACID principle requires such overhead – particularly with logging and locking –, it is pretty obvious that data engines which do not need to fulfil this requirement have a huge advantage in terms of speed. They will obviously outperform any ACID- compliant system (provided their general code quality and level of sophistication is comparable). Side note: As for a visualization of growth, take a look at http://www.worldometers.info. Core figures on World Population, Government & Economics, Society & Media, Environment, Food, Water, Energy and Health are summing up – pretty impressive to see these figures. 5 Below definitions are taken from wikipedia (http://www.wikipedia.org) A key-value store, or key-value database, is a data storage paradigm designed for storing, retrieving, and managing associative arrays, a data structure more commonly known today as a dictionary or hash. Dictionaries contain a collection of objects, or records, which in turn have many different fields within them, each containing data. These records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data within the database. Key-value stores work in a very different fashion from the better known relational databases (RDB). RDBs pre-define the data structure in the database as a series of tables containing fields with well defined data types. Exposing the data types to the database program allows it to apply a number of optimizations. In contrast, key- value systems treat the data as a single opaque collection which may have different fields for every record. This offers considerable flexibility and more closely follows modern concepts like object-oriented programming. Because optional values are not represented by placeholders as in most RDBs, key-value stores often use far less memory to store the same database, which can lead to large performance gains in certain workloads. Performance, a lack of standardization and other issues limited key-value systems to niche uses for many years, but the rapid move to cloud computing after 2010 has led to a renaissance as part of the broader NoSQL movement. A subclass of the key-value store is the document-oriented database, which offers additional tools that use the metadata in the data to provide a richer key-value database that more closely matches the use patterns of RDBM systems. Some graph databases are also key-value stores internally, adding the concept of the relationships (pointers) between records as a first class data type. A column-oriented DBMS is a database management system (DBMS) that stores data tables as sections of columns of data rather than as rows of data. In comparison, most relational DBMSs store data in rows. This column-oriented DBMS has advantages for data warehouses, clinical data analysis,[1] customer relationship management (CRM) systems, and library card catalogs, and other ad hoc inquiry systems[2] where aggregates are computed over large numbers of similar data items. It is possible to achieve some of the benefits of column-oriented and row-oriented organization with any DBMSs. Denoting one as column-oriented refers to both the ease of expression of a column-oriented structure and the focus on optimizations for column-oriented workloads.[2][3] This approach is in contrast to row-oriented or row store databases and with correlation databases, which use a value-based storage structure. Column-oriented storage is closely related to database normalization due to the way it restricts the database schema design. However, it was often found to be too restrictive in practice, and thus many column-oriented databases such as Google's BigTable do allow "column groups" to avoid frequently needed joins. 6 Below definitions are taken from wikipedia (http://www.wikipedia.org) A document-oriented database or document store is a computer program designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data. Document-oriented databases are one of the main categories of NoSQL databases and the popularity of the term "document-oriented database" has grown[1] with the use of the term NoSQL itself. Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept. The difference lies in the way the data is processed; in a key-value store the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on internal structure in the document order to extract metadata that the database engine uses for further optimization. Although the difference is often moot due to tools in the systems,[a] conceptually the document-store is designed to offer a richer experience with modern programming techniques. XML databases are a specific subclass of document-oriented databases that are optimized to extract their metadata from XML documents. Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal. Document databases[b] contrast strongly with the traditional relational database (RDB). Relational databases are strongly typed during database creation, and store repeated data in separate tables that are defined by the programmer. In an RDB, every instance of data has the same format as every other, and changing that format is generally difficult. Document databases get their type information from the data itself, normally store all related information together, and allow every instance of data to be different from any other. This makes them more flexible in dealing with change and optional values, maps more easily into program objects, and often reduces database size[citation needed]. This makes them attractive for programming modern web applications, which are subject to continual change in place, and speed of deployment is an important issue. In computing, a graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. Most graph databases are NoSQL in nature and store their data in a key-value store or document-oriented database. In general terms, they can be considered to be key-value databases with the additional relationship concept added. Relationships allow the values in the store to be related to each other in a free form way, as opposed to traditional relational databases where the relationships are defined within the data itself. These relationships allow complex hierarchies to be quickly traversed, addressing one of the more common performance problems found in traditional key-value stores. Most graph databases also add the concept of tags or properties, which are essentially relationships lacking a pointer to another document. 7 Today's complex applications need to fulfil a pretty broad range of requirements and it is becoming more and more reasonable to use the optimal data management system for any purpose. However, this can easily lead to the fact that not all data is stored in an RDBMS even though it is highly likely that such a system might still be a requirement for a part of the applications. Polyglot persistence is a usual term to describe the fact that an application uses multiple data management systems for the various purposes it serves for. 8 The list of noSQL products on this slide is by far not exhaustive, I just picked some of the most prominent exponents of the four categories. When we look at the various exponents of the four major categories of these data stores, you might find that you probably have heard of some of them a few years ago already (like e.g. Lotus Notes, which is a very valid implementation of a Document Store). And maybe most of us "relational" people have heard of SAP HANA, which basically is a Column Store. And we also know that DB2 meanwhile implements a lot of functionality in terms of Column and Graph Store, plus in-memory computing to speed up things a hundred times. But DB2 still provides the luxury of transaction stability – and it possibly could be another ten or hundred times faster without this feature. 9 Hadoop is one of the most prominent buzzwords in the "Big Data Buzzworld". It basically consists of a file system (HDFS), which scales out to thousands of nodes and provides out-of-the-box data redundancy. This makes it a great basis to create large MPP file system clusters. As for the data processing, Hadoop also goes for a typical MPP approach: The invoker node sends out the request (like e.g. collect the number of occurrences of a word inside the relevant data) to all participating nodes in the cluster. Each node computes his partial result locally and ships it back to the invoker which finally sums up the partial results. Nothing new and nothing a DB2 person would not yet be fluent with, if he or she ever has been exposed to DB2 DPF. With the difference of the Hadoop version 1 data processing being purely batch.

Load more