Masaryk University Faculty of Informatics

Tools for analysis

Master’s Thesis

Bc. Martin Macák

Brno, Spring 2018

Replace this page with a copy of the official signed thesis assignment anda copy of the Statement of an Author.

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Bc. Martin Macák

Advisor: doc. Ing. RNDr. Barbora Bühnová, Ph.D.

i

Acknowledgements

I would like to thank my supervisor, doc. Ing. RNDr. Barbora Bühnová, Ph.D. for offering me to work on this thesis. Her support, guidance, and patience greatly helped me to finish it. I would also like to thank her for introducing me to the great team of people in the CERIT-SC Big Data project. From this team, I would like to especially thank RNDr. Tomáš Rebok, Ph.D., who had many times found time for me, to provide me useful advice, and Bruno Rossi, PhD, who had given me the opportunity to present the results of this thesis in LaSArIS seminar. I would also like to express my gratitude for the support of my family, my parents, Jana and Alexander, and the best sister, Nina. My thanks also belong to my supportive friends, mainly Bc. Tomáš Milo, Bc. Peter Kelemen, Bc. Jaroslav Davídek, Bc. Štefan Bojnák, and Mgr. Ondřej Gasior. Lastly, I would like to thank my girlfriend, Bc. Iveta Vidová for her patience and support.

iii Abstract

This thesis focuses on the design of a Big Data tool selection diagram, which can help to choose the right open source tools for a given Big Data problem. The thesis includes the tool classification into compo- nents and proposes the Big Data tool architecture for a general Big Data problem, which illustrates the communication between those components. This thesis has chosen some of those components and has researched them in more detail, creating an overview of the actual Big Data tools. Based on this overview, the initial version of the Big Data tool selection diagram, which contains storage and processing tools, is created. Then the thesis proposes the of diagram validation and provides a set of tests as examples. Those tests are implemented by comparing the relevant results of the solution using a tool that is chosen by a diagram and the solution using another tool.

iv Keywords

Big Data, Big Data tools, Big Data architecture, Big Data storage, Big Data processing

v

Contents

1 Introduction 1

2 Big Data 3 2.1 Characteristics ...... 3 2.2 Big Data system requirements ...... 4 2.2.1 ...... 4 2.2.2 Distribution models ...... 4 2.2.3 Consistency ...... 6

3 State of the Art in Big Data Tools 7

4 Big Data Tools Architecture 9 4.1 Related work ...... 9 4.2 Classification ...... 9 4.3 Proposed architecture ...... 10

5 Big Data Storage Systems 13 5.1 Relational management systems ...... 13 5.1.1 Data warehouse ...... 14 5.1.2 NewSQL database management systems . . . . 15 5.1.3 Summary ...... 17 5.2 NoSQL database management systems ...... 17 5.2.1 Key-value stores ...... 18 5.2.2 Document stores ...... 21 5.2.3 Column-family stores ...... 25 5.2.4 Graph databases ...... 26 5.2.5 Multi-model databases ...... 29 5.2.6 Summary ...... 31 5.3 Time-series database management systems ...... 32 5.3.1 InfluxDB ...... 33 5.3.2 TS ...... 33 5.3.3 OpenTSDB ...... 34 5.3.4 Druid ...... 34 5.3.5 SiriDB ...... 35 5.3.6 TimescaleDB ...... 35 5.3.7 Prometheus ...... 35

vii 5.3.8 KairosDB ...... 36 5.3.9 Summary ...... 36 5.4 Distributed file systems ...... 37 5.4.1 Hadoop Distributed File System ...... 38 5.4.2 SeaweedFS ...... 38 5.4.3 Perkeep ...... 39 5.4.4 Summary ...... 39

6 Big Data Processing Systems 41 6.1 Batch processing systems ...... 41 6.1.1 MapReduce ...... 41 6.1.2 Alternatives ...... 43 6.2 systems ...... 43 6.2.1 ...... 43 6.2.2 Alternatives ...... 44 6.3 Graph processing systems ...... 44 6.3.1 ...... 45 6.3.2 Alternatives ...... 46 6.4 High-level representation tools ...... 46 6.4.1 ...... 46 6.4.2 ...... 47 6.4.3 Summingbird ...... 47 6.4.4 Alternatives ...... 48 6.5 General-purpose processing systems ...... 49 6.5.1 ...... 49 6.5.2 ...... 50 6.5.3 Alternatives ...... 51 6.6 Summary ...... 51

7 Tool Selection Diagram 53 7.1 Validation ...... 55

8 Attachments 57

9 Conclusion 59 9.1 Future directions ...... 59

Bibliography 61 viii List of Tables

5.1 Basic summary of management systems 17 5.2 Basic summary of NoSQL database management systems 32 5.3 Basic summary of time-series database management systems 37 5.4 Basic summary of distributed file systems 39 6.1 Basic summary of processing systems 52 7.1 Results of the first test 56 7.2 Results of the second test 56 7.3 Results of the extended second test 56

ix

1 Introduction

Nowadays, we are surrounded by Big Data in many forms. Big Data can be seen in several domains, such as Internet of Things, social media, medicine, and astronomy [1]. They are used, for example, in data mining, machine learning, predictive analytics, and statistical techniques. Big Data brings many problems to developers because they have to make systems that can handle working with this type of data and their properties, such as huge volume, heterogeneity, or generation speed. Currently, open source solutions are very popular in this domain. Therefore multiple open source Big Data tools were created to allow working with these type of data. However, their enormous number, specific aims, and fast evolution make it confusing to choose the right solution for the given Big Data problem. We believe that creating a Big Data tool selection diagram would be a valid response to this issue. Such diagram should be able to rec- ommend the set of tools that should be used for the given Big Data problem. The elements of the output set should be based on the prop- erties of this problem. As this is beyond the scope of a master’s thesis, this thesis creates the initial version of Big Data selection diagram, which is expected to be updated and extended in the future. This thesis is organized as follows. Fundamental information about the Big Data domain and its specifics are introduced in chapter 2. Chapter 3 describes the challenges in Big Data tools. Proposed archi- tecture of Big Data tools is described in chapter 4. Chapter 5 contains the overview of Big Data storage tools, and chapter 6 contains the overview of Big Data processing tools. Contents attached to this thesis are described in chapter 7. Chapter 8 concludes the thesis.

1

2 Big Data

This chapter contains the fundamental information about the Big Data domain. It should give the reader a necessary knowledge to understand the following chapters.

2.1 Characteristics

Big Data are typically defined by five properties, called as "5 Vs ofBig Data" [2]. ∙ Volume: Used data have such a large size that they cannot fit into a single server, or the performance of analysis on those data on a single server is low. The relevant factor is also a data growth in time. Therefore, the systems that want to work with Big Data has to be scalable. ∙ Variety: Structure of the used data can be heterogeneous. Data can be classified by their structure into these three categories: structured data with a defined structure, for example, CSV files, and spreadsheets, semi-structured data with a flexible structure, for example, JSON, and XML, and unstructured data without a structure, for example, images, and videos [3]. ∙ Velocity: Data sources generate real-time data at a fast rate. For example, on , 136,000 photos are uploaded every minute [4]. So the system has to be able to handle lots of data at a reasonable speed. ∙ Veracity: Some data may have worse quality, and they cannot be considered trustworthy. So technologies should handle this kind of data too. ∙ Value: This property refers to the ability to extract a value from the data. Therefore systems have to provide useful benefits from the acquired data. Many other definitions emerged, including five parts definition [5], 7 Vs [6], 10Vs [7, 8], and 42 Vs [9] definition. However, the 5 Vs defini- tion is still considered as a popular standard.

3 2. Big Data 2.2 Big Data system requirements

2.2.1 Scalability

Scalability is the ability of the system to manage increased demands. This ability is very relevant, because of the Big Data volume. The scal- ability can be categorized into the vertical or horizontal scaling [10]. Vertical scaling involves adding more processors, memory or faster hardware, typically, into a single server. Most of the software can then benefit from it. However, vertical scaling requires high financial investments, and there is a certain limit of this scaling. Horizontal scaling means adding more servers into a group of cooperating servers, called a cluster. These servers may be cheap com- modity machines, so the financial investment is relatively less. When this method is used, the system can scale as much as needed. However, it brings many complexities that software has to handle, which reflects on the limited number of software that can run on these systems.

2.2.2 Distribution models

Distribution model may bring many essential benefits when working with Big Data. The system can store more data, handle more read or write operations per time, and provide availability even when there are network problems or a server crashes. However, distribution brings complexity to the system, so is not recommended when those benefits are not needed [11]. Although many Big Data tools are designed to run on a cluster, there is a possibility that for some use cases distribution is not needed and a single server is sufficient. If it is not, then there are three options: use sharding, , or combine them and use both. Sharding is a technique that puts different parts of the data onto different servers. This technique improves read and write effectiveness, therefore is very valuable for system performance. Theoretically, if, for example, five servers are used, and the data are appropriately sharded, each server has to handle 20% of the total read and write operations because in the ideal case, each user only has to communicate with one server.

4 2. Big Data

However, this ideal case cannot be simply achieved. Data that are commonly accessed together should be stored together on one server. Also, there should be the effort to keep the same percentual handling of operations for each server. It does not necessarily mean to distribute the data evenly, because there may be other factors that can affect it, for example, physical location or some domain specific rules. Replication is a technique that copies the data over multiple servers. When a server crashes, the data is lost without this method. Replication can have two forms: master-slave and peer-to-peer. Both bring the problems with consistency of data, which will be presented in 2.2.3.

1. Master-slave replication declares one server as the master. This server is responsible for any updates of the data. Its data is replicated to the other servers, which are called slaves. Slaves cannot handle write requests, but they can be used for process- ing read requests. If the data in the master changes, then slaves have to be updated. With this technique, the cluster can handle more read requests. However, write requests are still handled by a single server. This technique provides availability for read operations. When there is a problem with a slave, the user can read from another one. Even when there is a problem with a master, the user can still read the data. However, the crash of a master disables the handling of write operations, so a user has to wait until the master is restored or a new master is appointed. Because of that, the master is considered a single point of failure.

2. Peer-to-peer replication solves the problem of master-slave repli- cation by not having a single point of failure. All servers in a cluster are equal, they all handle write and read operations. After a write operation on one server, others are trying to syn- chronize. When some server crashes, a user can still access his data from another one.

Sharding and replication can be used together. Sharding increases the performance, while replication adds reliability and availability.

5 2. Big Data

2.2.3 Consistency When the replication is used, it provides many benefits. On the other hand, it brings problems with consistency. When two users update the same data at the same time, each on a different server, they create a write-write conflict [12]. This action can happen only in peer-to- peer replication because master-slave has only one server that handles write operations. Pessimistic or optimistic approaches can solve this problem. There is also a possibility of read-write conflict, where the user’s write was still not synchronized to all servers, and another user reads data from one of those servers [12]. This inconsistency may last a short time, but eventually, all servers will be updated. Inconsistencies above can be solved by two methods: strong con- sistency and eventual consistency. The choosing, whether the system should support strong consistency or eventual consistency depends on a specific use case. The strong consistency is not always desired. It is because Brewer’s CAP theorem [13] declares that the stronger the consistency, the smaller the availability of the system. It states that any networked shared-data system can have at most two of the following properties:

∙ consistency

∙ availability

∙ tolerance of network partitions

This theorem was formally proven two years later [14]. In Brewer’s later paper [15] was suggested that the tradeoff between consistency and availability has to be considered only when the network is partitioned. Generally, the tolerance of network partitions cannot be forfeited in wide-area systems. Therefore, in the majority of Big Data solutions, designers have to balance between the consistency and availability. For example, when storing financial data, strong consistency should be chosen. When storing statuses of some social network, eventual consistency is sufficient.

6 3 State of the Art in Big Data Tools

This chapter contains challenges and problems of Big Data open source tools that were identified by the process of research in this domain. Each Big Data tool focuses only on a specific field of a Big Data usage. For example, Kibana aims at visualization, and HDFS targets storage. Although some tools focus on multiple fields, no such tool aims at them all. Therefore, the solution of a specific Big Data problem may consist of a set of tools that have to be used. So the solution implementer has to have practical experience with all of the chosen tools. In each field, tools can be specialized to some specific use cases, and for each use case, there can be several different approaches of realization. This fact results in another issue: the enormous number of Big Data tools. Each of these tools has some advantages and some weaknesses [16], so the choice of the right tool in a specific field isa non-trivial act. Big Data architect has to have theoretical knowledge about all available tools and must be able to decide, which tool fits the problem the best. Since the solution of a Big Data problem may consist of a set of tools, choosing the right solution is even harder. Actual weaknesses of Big Data tools are caused mainly because of their immaturity. The majority of Big Data tools was released no more than 15 years ago. This immaturity results in the rapid improvements. The issue it brings is a constant need to seek out the actualities about all of them. In addition, there are still many new tools emerging, so besides that, there is also a need to seek out them. The next complication is that the development of many tools was stopped, and those tools died. It may happen because of various rea- sons, for example, the company or the community lost interest in its development, or there was not proper documentation and support, which caused not sufficient desire to use that tool. This fact results in the mandatory filtering of not actual tools from the previous surveys.

7

4 Big Data Tools Architecture

This chapter describes a proposed classification of Big Data tools into the components. The chapter also contains the proposed architecture of tools for general Big Data problem, that is based on this classification. The architecture illustrates communication between its components and helps to visualize the solution of a general Big Data problem.

4.1 Related work

Multiple studies have proposed the architecture of Big Data system. Big Data analytics architecture in healthcare domain is proposed in [17]. This study [18] describes a reference architecture for Big Data systems in the national security application domain. The architecture for IoT Big Data analytics is proposed in [19]. This study [20] intro- duces a software reference architecture for semantic-aware Big Data systems. Big Data analytics architecture for an Agro advisory system is presented in [21]. This study [22] describes the architecture of a cross-sectorial Big Data platform for the process industry domain. All of these studies are only focused on the specific Big Data domain. Moreover, many of them are too detailed and complicated, which can have a negative impact on visualization of the specific Big Data solution. Still, there were found this studies [23, 24], that present a general reference architecture for Big Data systems. However, we have decided to propose an architecture, which we believe is more straightforward, illustrative, and suitable for the general Big Data solution visualization.

4.2 Classification

The proposed classification consists of 7 components, each with a specific purpose. Their detailed description can be found in 4.3.The components are:

∙ transferring,

∙ resource management,

9 4. Big Data Tools Architecture

∙ storage, ∙ processing, ∙ advanced analytics, ∙ orchestration, ∙ presentation. Every relevant Big Data tool can be classified into one of these compo- nents. Some tools can also be classified into more components, because of their features. For example, can be used as a storage, but also as a transferring tool. The Big Data solution does not have to consist of the tools from all components, they are optional, and their presence is based on the use case. Also, the Big Data solution does not have to consist only of one tool from a component, it is possible to have more of them. For example, in 2008 was the first time mentioned the choosing of multiple persistence tools based on the task, and combining them in one solution. This act is called the polyglot persistence1. For example, in some information system, there can be one database for storing the user information, one database for storing the user sessions, and one for storing financial data. There is also another term polyglot processing – using multiple processing tools in one solution – which applies to data processing. Lambda architecture2 is a leading example of polyglot processing. However, it was criticised because of the need to maintain code in two different processing tools, so the Kappa architecture3 was pro- posed [25]. In Big Data transferring component, it is common to use multiple tools too, for example, the combination of Flume and Kafka [26].

4.3 Proposed architecture

The proposed architecture was designed to be simple and clear. It contains seven tool components and data sources. The architecture

1. http://www.sleberknight.com/blog/sleberkn/entry/polyglot_persistence 2. http://lambda-architecture.net/ 3. http://kappa-architecture.com/ 10 4. Big Data Tools Architecture can be seen in figure 4.1, and detailed description of components is below.

Advanced Presentation analytics

Processing

Storage

Orchestration Memory file system Resource management Resource

Transferring

Data sources

Figure 4.1: The proposed architecture for general Big Data problem

∙ Transferring component is responsible for moving the data from sink to source. It means that we can transfer the data from external data source to the internal system, but also move the data between two internal storage systems. Many of those tools can filter or transform the data. Some of the popular transfor- mation tools are , Flume, Kafka, or NiFi. ∙ Storage component is responsible for storing the data. In batch processing, data is stored persistently in a data store or file sys- tem, that is typically distributed. When the stream processing

11 4. Big Data Tools Architecture

is used, the use case may not need persistent storage, and only memory is utilized. Some of the popular distributed file sys- tems are HDFS and QFS, and some of the popular data stores are MongoDB, Cassandra, VoltDB, and Redis.

∙ Resource management component is responsible for running the task across the whole cluster. It manages CPU, memory, and storage. So it has to evaluate, how and where to run the given task. Some of the popular resource management tools are YARN and Mesos.

∙ Orchestration component takes care of scheduling of repeated operations, for example, transformation, transferring, or pro- cessing. It is responsible for the right order of operations at the given time. Some of the popular orchestration tools are Oozie and Azkaban.

∙ Processing component process the given data. It can pre-process the data, which means to detect some errors or incomplete data. Also, it can transform the data and save them back to storage. Alternatively, it can process the data, and then give the result to the visualization component. It can also be extended by the advanced analytics tool. Some of the popular processing tools are Hadoop MapReduce, Spark, or Storm.

∙ Advanced analytics component handles advanced tasks, for example, machine learning, deep learning, and predictive mod- eling. Often this component is realized by a library, for example, Deeplearning4J and MLlib, but there are also tools, such as Ma- hout or ELKI.

∙ Presentation component takes care of presenting the results of given task. It also handles receiving the given tasks. It may be a visualization tool, which represents the result in graphical formats, like Kibana or Apache Zeppelin, but it can also be REST API or command line.

12 5 Big Data Storage Systems

This chapter classifies available Big Data storage options into groups. Every group is described, including the typical use cases, where the tools from it can be used. Several groups have introduced subgroups, and then in each group, there is an overview of its open source tools. Every tool has the latest commit on GitHub in 2018. At the moment of the writing of this thesis, it is at most four months, so they are all considered actual. This chapter reviews the storage tools that are high-level. The base structure of the tool review is:

1. basic information (creation date, implementation language, de- velopers),

2. important internal or external features which affect the usabil- ity,

3. suggested special use cases,

4. companies that use this tool.

If some property of the specific tool was not identified, it is omitted in the text. At the end of each section, a table can be found. This table contains the summary of important factors that were identified in this thesis for each tool.

5.1 Relational database management systems

Relational database management system (RDBMS) is a technology that was designed to support data management of all types regardless the format. It stores data in a set of relations (tables). Each relation has tuples (rows), and each tuple represents an entity. This entity is described through attributes (columns), each attribute contains a single value for each entity. This value may refer to another tuple in the same or other relations, which creates a relationship between those entities. It is a very mature technology. It uses a structured query language (SQL) which allows complex queries such as grouping, aggregates,

13 5. Big Data Storage Systems or joins. It provides ACID properties. The most common traditional open source RDBMSs are PostgreSQL1, MySQL2, and MariaDB3. Their typical use case is to handle structured transactional data, that can fit into a single machine, for example, in the accounting system. The main drawback of RDBMSs is the need to have a predefined schema before these databases can store the data. Because of that, they can efficiently store only structured data, and any following schema changes may be difficult to handle. The next problem is the increasing need for scalability, which causes RDBMSs some issues. Although they can scale horizontally, they were not designed to run efficiently on clusters [27]. To tackle this problem, relational databases for data warehouse systems were created. However, they targeted only OLAP workloads [28]. So later, to target OLTP read-write workloads, a scal- able RDBMSs, called NewSQL database management systems, were created.

5.1.1 Data warehouse databases Data warehouse databases provide most of the same features as tradi- tional relational databases. Their advantage is horizontal scalability. They are commonly used in large companies, so they are mostly com- mercial tools. Mostly, they are not designed for transactional work, they are optimized for reads, so the querying runs very efficiently, and data can be analyzed fast. Their typical use case is the process in which huge structured data are stored, and then after some time, they are analyzed.

Greenplum4 was created in 2003 by Greenplum company. From 2013, it is a part of the Pivotal Software, and in 2015, it was released as open source. It is written in . Greenplum is based on PostgreSQL. It can be viewed as several modified PostgreSQL instances working together, using the massively

1. https://www.postgresql.org/ 2. https://www.mysql.com/ 3. https://mariadb.org/ 4. https://greenplum.org/

14 5. Big Data Storage Systems parallel processing (MPP) approach. It can also be used as a key-value store or document store. This database was successfully deployed in, for example, fraud analytics, financial risk management, and manufacturing optimiza- tion [29]. It is used, for example, by Orange, Comcast, DISH, and Ford.

MySQL Cluster5, written in C and C++, was created in 2004. Now, it is supported by Oracle. MySQL Cluster provides multiple interfaces to the database. In addition to standard and common languages, it also provides NoSQL API. This API can be used to bypass the SQL layer of the database and allows faster access to the tables. It was designed to be highly available. It allows doing schema updates, upgrades of servers, and backups, without a downtime. MySQL Cluster is used, for example, by PayPal, Spotify, and Nokia.

Postgres-XL6 was created in 2012, and it was released as open source in 2014. Nowadays, it is supported by 2ndQuadrant. It is written in C. Like Greenplum, it is a database that uses sev- eral PostgreSQL instances. It has similar features as Greenplum, but its advantage is, that is it being developed by the same company as PostgreSQL.

5.1.2 NewSQL database management systems NewSQL database management systems were designed for OLTP read-write workflows. They can scale horizontally much better than traditional RDBMSs, maintain ACID properties and support SQL. NewSQL databases typically store the data in memory, allowing better performance.

VoltDB7 was released in 2008, developed by VoltDB Inc. It is written in Java and C++.

5. https://www.mysql.com/products/cluster/ 6. https://www.postgres-xl.org/ 7. https://www.voltdb.com/

15 5. Big Data Storage Systems

Although VoltDB stores the data in memory for fast speed, it can provide full disk persistence. It can also be used as a key-value store or document store. It is designed to store and analyze the data in real time. Therefore it can act as a storage and processing tool at the same time. VoltDB is used, for example, by Huawei, Nokia, Orange, and Air- push.

TiDB8 is a NewSQL database, released in 2016 by PingCAP. It is written in Go. It is a hybrid transactional and analytical processing database, which means, it can serve for OLAP and OLTP workloads. It is com- patible with MySQL so that a user can simply replace his MySQL solution with this database. TiDB is used, for example, by Mobike, and Yiguo.

CockroachDB9 was released in 2014. Nowadays, is supported by Cockroach Labs. It is written in Go. This database is built on a key-value engine RocksDB. Although the user cannot access this key-value engine directly, CockroachDB can be used as a key-value store by creating a table with two columns, in which one column is a primary key. Then the operations typical for key-value stores will translate into key-value operations instead of SQL. CockroachDB has main aim to be highly scalable, transactional, and resilient. Hence, it has worse performance than in-memory NewSQL databases. It is used, for example, by Baidu, Kindred, Tierion, and Heroic Labs.

CrateDB10 is a NewSQL database, released in 2016 by Crate.io, and written in Java. This database does not support ACID. It is consistent only at the row level. CrateDB can be used as a key-value and document store.

8. https://pingcap.com/en/ 9. https://www.cockroachlabs.com/product/cockroachdb/ 10. https://crate.io/products/cratedb/

16 5. Big Data Storage Systems

It is also designed to handle time-series data. In a benchmark with a time-series NoSQL database InfluxDB, CrateDB had almost ten times larger query throughput [30]. It is used, for example, by Alpla, Clickdrive.io, Clearvoice, and DriveNow.

5.1.3 Summary Relational database management systems are great for structured data. They can be divided into three categories: traditional databases, which run best on the single server, data warehouse databases, which are scalable, mostly older and target OLAP workloads, and NewSQL databases, which are scalable, newer tools that focus on OLTP read- write workloads. Table 5.1 shows the summary of important factors that were re- viewed in this thesis for each scalable relational database management system that was discovered. As can be seen, C is the most popular language in data warehouses databases, whereas, in NewSQL systems, Java and Go are the most popular. For other factors, like performance, and scalability, the following research should be accompanied by rele- vant benchmarks between those tools.

Maturity Used in popular Tool Language NewSQL / Origin companies Greenplum C 2003   MySQL Cluster C, C++ 2004   Postgres-XL C 2012   VoltDB Java, C++ 2008   TiDB Go 2016   CockroachDB Go 2014   CrateDB Java 2016  

Table 5.1: Basic summary of relational database management systems

5.2 NoSQL database management systems

NoSQL database management systems were designed to overcome drawbacks of RDBMSs. They operate without a schema so that they can store even semi-structured and unstructured data. They can run better in a cluster. Typically, there are four main categories of NoSQL databases: key-value stores, document stores, column-family stores,

17 5. Big Data Storage Systems and graph databases. However, this research identified a multi-model category as well.

5.2.1 Key-value stores In this model, each stored value is associated with a unique key and can be accessed only with this key. Value is typically some blob of bits, which allows storing anything in the database. Some key-value stores may allow having a structure in their values to increase the querying capability. However, typically, there is expected to access the data using a key. Basic supported operations for all key-value stores are:

∙ put the value for a key

∙ get the value for a key

∙ delete a key-value

The advantage of a key-value data model is its simplicity. These stores can provide low latency and high throughput. On the other hand, if there is a demand for more complex operations, key-value stores can be ineffective because those operations will have to be per- formed in the application. Their typical use case is handling any data that are just being stored and retrieved by a key, for example, data caching, session storage, and profile storage.

Redis11 is an in-memory data store, written in C, and created in 2009. Currently, it is supported by Redis Labs. Although Redis stores data in memory, it can utilize the disk for persistence. It supports partitioning using the Redis Cluster, transac- tions, and user-defined scripts. Redis provides more complex data structures for storing the data. Thanks to them, secondary indexing can be used. Redis can set the expiration date on each key, so after a specified time, the key-value is automatically deleted.

11. https://redis.io/

18 5. Big Data Storage Systems

This store is being used as a very fast database for simple opera- tions, additionally as a cache. It can also be used as a message broker thanks to its publish/subscribe feature. Redis is used, for example, by Github, StackOverflow, Coinbase, Twitter, Uber, Trello, or Slack.

Aerospike12 is a flash-optimized distributed database, created in 2010. It was first known as Citrusleaf, but in 2012 was renamed. Then in 2014, it went open source. It is written in C, created by the company Aerospike. It uses hybrid memory architecture; it supports DRAM and flash disks [31]. Typically, indexes are stored in DRAM and data on the flash disks. Aerospike has implemented its own system to access flash disks directly, bypassing the ’s file system. This access is optimized and parallelized across multiple flash disks, which results in better throughput. Aerospike can run only in DRAM, but the hybrid architecture provides better performance. It also decreases the number of necessary servers in the cluster because flash disks enable better vertical scalability than using only DRAM. It provides complex data types like lists, maps, and geospatials which may be used with secondary indices. It supports user-defined functions. Aerospike supports two modes, Available mode, and Strong consistency mode. Thus it is possible to choose which kind of database, according to CAP theorem, to use. It is possible to set data access to users. This database is used, for example, by AppNexus, InMobi, AdForm, Yashi, and Curse.

Riak KV13 is a distributed database, written in Erlang. It was cre- ated in 2009 by Basho Technologies. However, in 2017, this company crashed, and now is Riak supported by Erlang Solutions. It aims at ensuring data availability and partition tolerance. It uses peer-to-peer replication, making it is easy to add or remove nodes based on actual needs.

12. https://www.aerospike.com/ 13. http://basho.com/products/riak-kv/

19 5. Big Data Storage Systems

Riak KV can be used as a document store with many querying capabilities thanks to Riak Search and Riak Data Types [32]. They integrate Solr14 for indexing and querying and Riak for storage and distribution. On the top of Riak KV is built a time-series database, Riak TS. This database is used, for example, in Uber, Yammer, and Sendy.

GridDB15 is specialized in-memory database optimized for IoT, released in 2016. It is written in C++ by Toshiba. It extends the basic key-value data model by key-container data model. A container can have two types: collection and time-series container. The collection container is general-purpose, on the other hand, time-series container is used for time series data. Unlike Riak, GridDB can work with both types in a single installation. GridDB supports transactions in a single container. Consistency can be tunable, a database user can choose strong consistency or even- tual consistency. This database can be queried via TQL, but only SE- LECT statement is allowed. GridDB can be used in situations, where is a need to deal with time series data. For example as a weather record storage system.

Apache Accumulo16 is a distributed store, written in Java. Origi- nally, it was called Cloudbase, created in 2008 by NSA. Then in 2011 it was released as open source and renamed to Accumulo. Nowadays, it is backed by Apache Software Foundation. It stores sorted key-value pairs, which allows fast gets of single keys, also of a range of keys. It is used on the top of HDFS. It provides cell-based access control. It is the option to grant or refuse the user access to a particular key-value; every key consists of a visibility field, which takes care of this.

Infinispan17 is a distributed in-memory key-value data store, cre- ated by RedHat in 2009. It is written in Java.

14. http://lucene.apache.org/solr/ 15. https://griddb.net/en/ 16. https://accumulo.apache.org/ 17. http://infinispan.org/

20 5. Big Data Storage Systems

Infinispan offers advanced functionality. It is fully transactional, but this option can be disabled if the bigger performance is needed. Data can be stored to disk if it is necessary. The maximum number or size of entries kept in memory can be configured, others are moved to a persistent store. It also supports indexing and querying.

5.2.2 Document stores

Document stores are similar to key-value stores. In this case, the value can have a structure. The defined structure of stored data grants more flexibility in data access. It is expected to get values by a query, al- though it is possible to use only a key. These stores support more complex queries and indexes based on the structure of the document. Popular structure formats are XML, JSON, and BSON. Their typical use case is handling the semi-structured data, which can be queried, for example, in e-commerce, blog posts, and user profile storage.

MongoDB18 is a distributed database, written in C++, developed by 10gen in 2009. In 2013, this company was renamed to MongoDB Inc. This database stores data in collections. Each collection can contain documents, and each document represents an entity. This entity is described through fields. Data are stored in a binary representation BSON (Binary JSON). Currently, MongoDB is ACID compliant for a single document. But in the newest version MongoDB 4.0 which is scheduled for summer 2018, this database will add support for multi-document ACID transactions. It uses its own query language instead of SQL. A query, which returns all users who are named Martin, looks like this: db.users.find({ name:"Martin"})

MongoDB is general purpose database, it is used, for example, by Google, UPS, Facebook, EBay, and Adobe.

18. https://www.mongodb.com/

21 5. Big Data Storage Systems

Apache CouchDB19 was created in 2005. From 2008 is backed by Apache Software Foundation. It is written in Erlang. CouchDB stores data in JSON data format. It is ACID compliant for a single document. It provides REST API, but it can only be used for querying vie primary key. For complex queries a view has to be built first. It can be done by using a MapReduce paradigm20 in JavaScript. A solution, which returns all users who are named Martin, looks like this: curl-H’Content-Type: application/json’\ -XPOST http://127.0.0.1:5984/db\ -d’{"_id":"_design/users", "_rev": "1-C1687D17", "views": { "name": { "map": "function(doc) { if(doc.name) { emit(doc.name, doc); } }" } } }’ curl-XGET http://127.0.0.1:5984/db/_design/ users/_view/name?key="Martin" This database is used, for example, by Samsung Mobile, GrubHub and IBM.

19. http://couchdb.apache.org/ 20. described on page 42

22 5. Big Data Storage Systems

Couchbase Server21, originally known as Membase, was created in 2010 by NorthScale. Later NorthScale merged with CouchOne, creat- ing company Couchbase and renaming this database to Couchbase Server. It is written in modules, in C, C++, Go, Erlang and Java. This database stores data in key-value or JSON format. It has a built-in cache, which greatly increases its performance. It is ACID compliant on a single document. It uses Non-First normal form Query Language (N1QL) for querying. A query, which returns all users who are named Martin, looks like this: SELECT*FROM usersWHERE name=’Martin’ Couchbase Server is used, for example, by Doodle, Viber, BD, and eBay.

RethinkDB22 is a document store, created in 2009. Nowadays, it is supported by the Foundation. It is written in C++. This database was designed for the realtime web use cases. It has implemented a push architecture, which means an application does not need to poll for changes, but changes are periodically pushed from the database to the application. RethinkDB is ACID compliant on a single document. It stores JSON documents. This database uses a RethinkDB Query Language (ReQL) for querying. This language has many features, like lambda expressions, MapReduce, joins, and the construction of queries in the programming language that is being used. A query, which returns all users who are named Martin, looks like this: r.db(’db’).table(’users’) .filter({’name’:’Martin’}) .run() RethinkDB can be used in real-time situations, such as collaborative web applications, multiplayer games, or realtime marketplaces. It is used, for example, by NASA, NodeCraft, Workshape.io, Narrative Clip, and Mediafly.

21. https://www.couchbase.com/products/server 22. https://www.rethinkdb.com/

23 5. Big Data Storage Systems

RavenDB was created in 2010, written in C#. Nowadays, it is sup- ported by Hibernating Rhinos. RavenDB was designed to target .NET ecosystem. It runs natively on Windows, which may be problematic for some other databases. In the newest version, RavenDB can also be run on Linux, using .NET Core framework. This database is ACID compliant on a single doc- ument. For querying, RavenDB uses Raven Query Language (RQL), but it also supports higher level querying, using Language Integrated Query (LINQ). A query, which returns all users who are named Martin, looks like this: //RQL from users where name=’Martin’

//LINQ var users= session .Query() .Where(x =>x.name == "Martin") .ToList();

ElasticSearch was released in 2010. It is being developed and sup- ported by Elastic, written in Java. ElasticSearch is a distributed real-time document store that has every field indexed. This fact allows users using it as a real-time full- text search engine. It stores data in JSON format. To define queries, ElasticSearch provides a Query Domain Specific Language (Query DSL) based on JSON, which is used over REST API. A solution, which returns all users who are named Martin, looks like this: curl-XGET"localhost:9200/users/_search"\ -H’Content-Type: application/json’\ -d’ { "query": { "match":{"name":"Martin"} } }’

24 5. Big Data Storage Systems

ElasticSearch is used, for example, in eBay, Wikipedia, Facebook and Blizzard.

5.2.3 Column-family stores Column-family stores are also called wide column stores. The principle of them is the storage in column families as rows that have a row key and multiple columns. Each column consists of a name and value. This provides the easy way to add new columns to the existing row. They support ACID only at a row level. Their typical use case is similar to document stores, however, column-family stores typically provide less querying capabilities, but better scalability than document stores. They can be used to handle data, for example, in e-commerce, blog posts, and user profile storage systems.

Apache HBase23 was created in 2008 by Powerset. Nowadays, it is backed by Apache Software Foundation. It is written in Java. This database is built on the top of HDFS (see 5.4.1), allowing the real-time read and write operations. It only provides CRUD operations and it does not have a query language. For querying this database, other tools have to be used to map HBase tables into theirs. Typically, high-level representation tools (see 6.4), like Apache Hive, , or are used for this use case. A solution with Apache Hive, which returns all users who are named Martin, looks like this: CREATETABLE users(keyINT, nameSTRING, ageINT) STOREDBY ’org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITHSERDEPROPERTIES (’hbase.columns.mapping’ = ’:key,f:c1,f:c2’) TBLPROPERTIES(’hbase.table.name’=’myusers’);

SELECT*FROM usersWHERE name=’Martin’; HBase is used, for example, by Facebook, eBay, Pinterest, and Imgur.

23. https://hbase.apache.org/ 25 5. Big Data Storage Systems

Apache Cassandra24 was developed at Facebook, and released in 2008. Nowadays, it is backed by Apache Software Foundation. It is written in Java. Cassandra is a highly available system that is designed to be very fast, especially with write operations. This database use Cassandra Query Language (CQL) for querying. By default, CQL does not allow to execute queries that involve filtering of columns which are not indexed, because of the performance issues. This behavior can be explicitly overriden. If a name has a secondary index, then a query, which returns all users who are named Martin, looks like this: SELECT*FROM usersWHERE name=’Martin’; It is used, for example, by Uber, Spotify, eBay, and .

ScyllaDB25 was released in 2015 by ScyllaDB Inc. This database is written in C++. This database was designed as a replacement for Cassandra. It has compatible API with Cassandra, but internally is much improved. Based on many benchmarks, ScyllaDB performs greatly better than Cassandra [33]. This database uses CQL too, so a query, which returns all users who are named Martin, looks the same as in Cassandra: SELECT*FROM usersWHERE name=’Martin’; It is used, for example, by IBM, Intel, CERN, and Samsung.

5.2.4 Graph databases

Graph databases allow storing entities and relationships of these entities. Entities can have different types of relationships between them. Each entity and relationship can also have some properties. These databases are specialized to handle highly connected data. They are typically used, for example, in a social network, logistics, and e-commerce systems.

24. http://cassandra.apache.org/ 25. https://www.scylladb.com/

26 5. Big Data Storage Systems

Neo4j26 was created in 2002. It went open source in 2007. Nowadays, it is supported by Neo4j, Inc. It is written in Java. Neo4j was designed for easy usability and high performance. It has the largest community from all graph databases. It is ACID compliant. For querying it supports query language, but more supported option is Cypher query language, which was created by Neo4j. A query written in Cypher, which returns all mutual friends of Martin and Tomas, looks like this: MATCH(martin:Person{name:"Martin"}) -[:FRIEND]-(mutualFriend:Person) -[:FRIEND]-(tomas:Person{name:"Tomas"}) RETURN mutualFriend Neo4j is used, for example, by eBay, NASA, Cisco, Microsoft, and Orange.

JanusGraph27 was released in 2017 by the Linux Foundation. It was forked from Titan database, which is now dead. It is written in Java. It is a transactional database that can be built on the top of various stores, such as, Cassandra or HBase. It uses Gremlin query language for querying. A query, which returns all mutual friends of Martin and Tomas, looks like this: mutualFriends=g .V(g.V().has(’name’,’Martin’).next()) .both(’friend’) .where ( both(’friend’).is ( eq(g.V().has(’name’,’Tomas’).next()) ) ) JanusGraph is used, for example, by CELUM, Netflix and G DATA.

26. https://neo4j.com/ 27. http://janusgraph.org/

27 5. Big Data Storage Systems

Dgraph28 was released in 2016 by Dgraph Labs, Inc. This database is written in Go. It was designed to be a high-performance graph database which provides ACID. It handles data natively, so it cannot run on the top of other databases like JanusGraph. Clients can communicate with this database by gRPC or HTTP APIs. This database uses GraphQL+- for querying. It is a modified version of GraphQL, developed by Facebook. Dgraph simplified this language, but also added advanced features to interact with the graph data. Because all GraphQL+- queries return a subgraph, and nowadays, there is no supported option for the in- tersection, a query, which returns all mutual friends of Martin and Tomas has to be divided into two database queries. The first query gets all Martin’s friends and the database return them as JSON. { martinFriends(func: eq(name,"Martin")) { friend{ name} } } Then the application needs to parse that result and put it into the second query. The following example illustrates the situation, when Martin had four friends: Peter, Jane, John, and Kate. { mutualFriendsWithTomas(func: eq(name,"Tomas")) { friend @filter ( anyofterms(name,"Peter Jane John Kate") ) { name} } }

28. https://dgraph.io/

28 5. Big Data Storage Systems

Apache S2Graph29 was released in 2015. It is developed by Apache Software Foundation, currently incubating. It is written in Scala. This database is built on the top of HBase. It provides HTTP API, but also supports Gremlin for querying. So a query, which returns all mutual friends of Martin and Tomas, looks the same as in JanusGraph database: mutualFriends=g .V(g.V().has(’name’,’Martin’).next()) .both(’friend’) .where ( both(’friend’).is ( eq(g.V().has(’name’,’Tomas’).next()) ) )

HyperGraphDB30 was created in 2010 by Kobrix Software. This database is written in Java. It was designed specifically for artificial intelligence and web se- mantic projects. It can store hypergraphs, which are graphs that con- tains edges that can point to more than two nodes. This database also extends this definition by allowing an edge to point to another edge. It supports ACID but does not have any querying language. It is an em- bedded database. Therefore it comes in the form of a library and can be queried only via its API from the process that is currently running in the application.

5.2.5 Multi-model databases Multi-model databases integrate multiple data models, and the data can be accessed by only one query language that covers all supported data models [34]. This combination of data models in a single database allows implementing polyglot persistence without using various databases.

29. https://s2graph.apache.org/ 30. http://www.hypergraphdb.org/

29 5. Big Data Storage Systems

This approach provides better maintainability and also does not re- quire the knowledge of multiple databases. Typically, these databases support key-value, document, and graph data models. These databases can be used in use cases, which has to deal with multiple data models in one solution. These use cases include ac- cess management, traffic management, logistics, internet of things, recommendation engine, network infrastructure management, social networks, and e-commerce systems.

ArangoDB31, formerly known as AvocadoDB, was created in 2011 by triAGENS. Later, the database and its company renamed to ArangoDB. It is written in C++. This database uses ArangoDB query language (AQL) for querying. It is similar to SQL, but this language only supports reading and modifying of data. However, it is modified to support graph querying too. A query, which returns all animal species that live in the cities in Germany, would in SQL look like this: SELECT speciesFROM AnimalsA, CitiesB, CountriesC WHEREA.city=B.id ANDB.country=C.id ANDC.name=’Germany’

In ArangoDB, this query looks like this: FORaIN Animals FORbIN Cities FORcIN Countries FILTERa.city ==b.id FILTERb.country ==c.id FILTERc.name == "Germany" RETURNa.species

ArrangoDB is used, for example, by Thomson Reuters, FlightStats, InfoCamere, and Oxford University.

31. https://www.arangodb.com/

30 5. Big Data Storage Systems

OrientDB32 was created in 2010, written in Java. Nowadays, it is being developed by CallidusCloud. This database supports ACID. Instead of join operations, it can con- nect the entities with links, just like in the graph database. It uses SQL for querying. This language is extended to manipulate with graphs. It also provides SQL-Match option, which has similar syntax to Cypher. A query, which returns all animal species that live in the cities in Germany, would in SQL look like this: SELECT speciesFROM AnimalsA, CitiesB, CountriesC WHEREA.city=B.id ANDB.country=C.id ANDC.name=’Germany’ In OrientDB, this query can be simplified to this: SELECT speciesFROM Animals WHERE city.country.name=’Germany’ OrientDB is used, for example, by Comcast, Accenture, Sky, and United Nations.

5.2.6 Summary NoSQL database management systems were designed to be highly scalable and to operate with semi-structured and unstructured data. They can be divided into five categories based on their data model: ∙ key-value stores that store the data as key-value pairs, ∙ document stores that store the data as key-value pairs, but the value can have internal structure, ∙ column-family stores that store the data in column families that provide a flexible structure, ∙ graph databases that store the data as entities and relationships between them, ∙ multi-model databases that combine multiple data models.

32. https://orientdb.com/

31 5. Big Data Storage Systems

Table 5.2 shows the summary of important factors that were iden- tified in this thesis for each NoSQL database management system that was discovered. As can be seen, Java is the most popular language. Also, supporting a query language in NoSQL database management system is a common practice. For other factors, like performance, and scalability, the following research should be accompanied by relevant benchmarks between those tools.

Maturity Used in popular Tool Language Data model Query language / Origin companies Redis C 2009 key-value   Scalaris Erlang 2008 key-value   Aerospike C 2010 key-value AQL  Riak KV Erlang 2009 key-value   GridDB C++ 2016 key-value TQL  Java 2008 key-value   Infinispan Java 2009 key-value Ickle  MongoDB C++ 2009 document   CouchDB Erlang 2005 document   C, C++, Go, Couchbase 2010 document N1QL  Erlang, Java RethinkDB C++ 2009 document ReQL  RavenDB C# 2010 document RQL  ElasticSearch Java 2010 document Query DSL  Apache HBase Java 2008 column-family   Java 2008 column-family CQL  ScyllaDB C++ 2015 column-family CQL  Neo4j Java 2002 graph Cypher, Gremlin  JanusGraph Java 2017 graph Gremlin  Dgraph Go 2016 graph GraphQL+-  Apache S2Graph Scala 2015 graph Gremlin  HyperGraphDB Java 2010 graph   ArangoDB C++ 2011 multi-model AQL  OrientDB Java 2010 multi-model SQL 

Table 5.2: Basic summary of NoSQL database management systems

5.3 Time-series database management systems

Time-series database management systems are specialized to handle time-series data. Those data typically arrives in time order and is almost always considered as a new record. They typically contain a timestamp and represent change over time. A timestamp can have precision even to nanoseconds. The delete operations are rare, typically they only appear in a large range, removing the old data. The update operations typically never occur. In this kind of databases, data often

32 5. Big Data Storage Systems

have a period after which it expires. Time-series databases are used, for example, in monitoring, IoT sensors, and real-time analytics.

5.3.1 InfluxDB

InfluxDB33 was created in 2013 by Errplane which was later renamed to InfluxData Inc. It is written in Go. It stores points in measurements, which is a unit that contains related points, for example, water level measurements. Each point in measurement has a timestamp, a tagset for metadata, and a field- set for measured data. For writes and querying, it provides REST API, but also a querying language InfluxQL. A query, which returns temperatures from the 7.5.2018 looks like this: SELECT*FROM"temperature" WHERE time >= ’2018-05-07T00:00:00Z’

InfluxDB is used, for example, by Cisco, IBM, eBay, and BBOXX.

5.3.2 Riak TS

Riak TS34 was released in 2016 by Basho Technologies. In 2017, the company crashed, but now is Riak supported by Erlang Solutions. It is written in Erlang. This database stores data similarly to a traditional relational database in tables with columns and rows. Officially, it is considered as a NoSQL database, but it is disputable, because it not only needs a predefined schema in its tables, there is also no possibility to alter the created ta- ble [35]. This database is built on top of Riak KV, so each row maps to a key-value pair. It provides REST API and a subset of SQL for querying. A query, which returns temperatures from the 7.5.2018 looks like this: SELECT*FROM temperature WHERE time >= ’2018-05-07 00:00:00’

33. https://www.influxdata.com/time-series-platform/influxdb/ 34. http://basho.com/products/riak-ts/

33 5. Big Data Storage Systems

5.3.3 OpenTSDB OpenTSDB35 was released in 2011. The initial development was sup- ported by StumbleUpon, but nowadays Yahoo! is supporting it. It is written in Java. This database is built on HBase. Each data point consists of a timestamp, value, metric name and a set of tags. It provides REST API. A query, which returns temperatures from the 7.5.2018 looks like this: curl-XGET"http://localhost:4242/query/ ?start=2018/05/07-00:00:00 &=temperature"

5.3.4 Druid Druid36 is a column-oriented data store that was released in 2012 by Metamarkets. It is written in Java. Druid is designed for business intelligence queries on event data. It provides real-time data ingestion, analysis, and fast aggregations. It has to use MySQL or PostgreSQL for metadata. Data consist of a timestamp, dimensions, which are fields that can be filtered and grouped by, and metrics, which are fields that can be aggregated. Querying can be made by REST API. In 2017, also an experimental query language Druid SQL was added, but nowadays is still stated as experimental. A query using REST API, which returns temperatures between the 7.5.2018 and 21.5.2018 looks like this: { "queryType":"timeseries", "dataSource":"temperature", "granularity":"all", "intervals": [ "2018-05-07T00:00:00.000 /2018-05-21T00:00:00.000" ] } It is used, for example, by eBay, Cisco, Netflix, and Paypal.

35. http://opentsdb.net/ 36. http://druid.io/

34 5. Big Data Storage Systems

5.3.5 SiriDB SiriDB37 was created in 2016 by Transceptor Technology. It is written in C. This database stores the data in points as a pair of timestamp and value, where value can only be a numeric data type. It provides HTTP API for querying, but also its own query language. A query, which returns temperatures from the 7.5.2018 looks like this: select* from"temperature" after "2018-05-07"

5.3.6 TimescaleDB TimescaleDB38 is a relational database, that was released in 2017 by Timescale, Inc. This database is written in C. It is developed as an extension to PostgreSQL, specialized at time- series data. A query, which returns temperatures from the 7.5.2018 looks like this: SELECT*FROM temperature WHERE time > ’2018-05-07’

5.3.7 Prometheus Prometheus39 is a monitoring system and a time series database, re- leased in 2014. It is written in Go. Prometheus is very similar to InfluxDB. This database is more spe- cialized in metrics and has a more powerful built-in query language. However, the queries can be performed only relatively to actual time. Also, Prometheus is relaxing consistency for higher availability than InfluxDB has. A query, which returns temperatures from the last week looks like this: temperature[1w] It is used, for example, by Docker, SoundCloud, JustWatch, and Branch.

37. http://siridb.net/ 38. https://www.timescale.com/ 39. https://prometheus.io/

35 5. Big Data Storage Systems

5.3.8 KairosDB

KairosDB40 is a time-series database, that was released in 2013. It is written in Java. This database is designed on the top of column-family store Cas- sandra. It provides REST API, but in 2017 it also added support for CQL. A query using REST API, which returns temperatures from the last week looks like this:

{ "start_relative":{"value":1,"unit":"weeks"}, "metrics":[ { "name":"temperature" } ] }

This database is used, for example, by Proofpoint, Signal, and Enbase.

5.3.9 Summary

Time-series database management systems are specialized to handle time-series data. They are based on relational or NoSQL database management systems. However, some of them are hard to categorize into one of those groups and are mostly referred only as time-series database management systems. Table 5.3 shows the summary of important factors that were iden- tified in this thesis for each time-series database management system that was discovered. As can be seen, most of these tools are only few years old, and all of them provides a query language. For other factors, like performance, and scalability, the following research should be accompanied by relevant benchmarks between those tools.

40. https://kairosdb.github.io/

36 5. Big Data Storage Systems

Maturity Used in popular Tool Language Query language / Origin companies InfluxDB Go 2013 InfluxQL  Riak TS Erlang 2016 SQL  OpenTSDB Java 2011 REST API  Druid Java 2012 Druid SQL  SiriDB C 2016   TimescaleDB C 2017 SQL  Prometheus Go 2014   KairosDB Java 2013 CQL 

Table 5.3: Basic summary of time-series database management systems

5.4 Distributed file systems

Distributed file systems can be used for storing data too. Their advan- tage is being simple. These systems are designed to store large data sets reliably and to stream them to a user application. However, the files do not provide any querying capabilities, therefore distributed file system can be viewed similarly to key-value stores. The main dif- ferences are that the majority of distributed file systems works better for a small number of large files, whereas the key-value stores work better for a large number of small files [36, 37]. File systems also donot support many key-value store features, like transactions or effective retrieving of the range of values. Their typical use case is when there is need to store and retrieve big files by a key, for example, as storage for ISO image files. Some can also work with smaller files, and beused, for example, as a personal storage system. It is difficult and beyond the scope of this thesis to make a detailed overview of all actual distributed file systems. Only a subset of them will be covered in more detail to give the reader a basic overview about them. All systems that were identified in this thesis are mentioned in the table 5.4 at the end of this section. This thesis found only few studies that compare distributed file sys- tems. In [38] from 2013 is a comparison of six open source distributed file systems: HDFS, MooseFS, iRODS, , GlusterFS, and . At first, each tool has described its architecture, naming conventions, client access, cache consistency, replication, synchronization, load balancing and fault detection. Then this paper compares scalability, transparency, fault tolerance, system setup, accessibility, availability, and performance of each tool.

37 5. Big Data Storage Systems

A year later, this survey [39] talks about the issues in designing of distributed file systems, reviews the taxonomy and makes an overview of four distributed file systems, from which the only one is open source. In 2015, Cern published a paper [40] that states how are distributed file systems used, how has their architecture evolved, what arethe relevant techniques in distributed file systems and what are the future challenges.

5.4.1 Hadoop Distributed File System The most popular open source distributed file system is Hadoop Dis- tributed File System41 (HDFS). It was created in 2006 as a part of Apache Hadoop, written in Java. Originally, it was mostly supported by Yahoo! company [41]. Nowadays, many popular companies42 sup- port Apache Hadoop, for example, Cloudera and . It has the master-slave architecture [41], where NameNode is mas- ter and DataNodes are slaves. HDFS can detect server failures and recover automatically. Many other distributed file systems have similar features to HDFS and have little differences in their usage. For example, Quantcast File System (QFS) was designed as an alternative to HDFS, written in C and C++, to achieve better performance. This benchmark [42] from 2012 states that QFS is faster at writing and reading. HDFS is designed mostly for batch processing, so it aims for high throughput instead of low latency [43]. HDFS is used, for example, by Yahoo!, Facebook, Twitter, LinkedIn, and Ebay.

5.4.2 SeaweedFS SeaweedFS43 was created in 2011, and written in Go. It was designed to be fast and simple in usage. It is not POSIX compliant. Although it can deal with large files, it was designed to handle smaller ones. SeaweedFS is used by SANMS, Techbay, and ZTO Express.

41. http://hadoop.apache.org/ 42. https://wiki.apache.org/hadoop/Support 43. https://github.com/chrislusf/seaweedfs

38 5. Big Data Storage Systems

5.4.3 Perkeep Perkeep44 was formerly known as Camlistore, which was created in 2011. It is written in Go and JavaScript. This tool was designed as a reliable personal storage system that can store data for a lifetime of the user. Perkeep can store any data and supports searching, visualization, and sharing of them.

5.4.4 Summary Distributed file systems are used for storing files that can be distributed across several servers. They provide reliability but not querying. On top of them, many databases can be built. Table 5.4 shows the summary of important factors that were identi- fied in this thesis for each distributed file system that was discovered. As can be seen, C is the most popular language. Usage in popular companies was not included, because of the vague usage description of many tools. For other factors, like performance, scalability, and inter- nal features differences, the following research should be performed more deeply, and be accompanied by relevant benchmarks between those tools. Maturity Tool Language / Origin HDFS Java 2006 QFS C and C++ 2008 SeaweedFS Go 2011 iRODS C++ 2012 CephFS C++ 2006 GlusterFS C 2005 Lustre C 2001 OpenAFS C 2000 XtreemFS Java 2008 MooseFS C 2008 LizardFS C++ 2013 RozoFS C 2011 MogileFS Perl 2004 Tahoe-LAFS Python 2007 LeoFS Erlang 2012 Perkeep Go and JavaScript 2011 IPFS Go 2014 Upspin Go 2017 KBFS Go 2017 FastDFS C 2014

Table 5.4: Basic summary of distributed file systems

44. https://perkeep.org/ 39

6 Big Data Processing Systems

This chapter classifies available Big Data processing options into groups. Every group is described, including the typical use cases, where the tools from it can be used. Then in each group, there is an overview of its most popular open source tools. The base structure of the tool review is:

1. basic information (creation date, implementation language, de- velopers),

2. important internal or external features which affect the usabil- ity,

3. suggested special use cases,

4. companies that use this tool.

If some property of the specific tool was not identified, it is omitted in the text. At the end of this chapter, a table can be found. This table contains the summary of important factors that were identified in this thesis for each tool.

6.1 Batch processing systems

Batch processing systems process the data in tasks which can run without the user interaction. Typically, they are performed on large static data, called a batch, and can run a significant time [44].

6.1.1 Apache Hadoop MapReduce Hadoop MapReduce1 was created in 2006 as a part of Apache Hadoop, written in Java. Originally, it was mostly supported by Yahoo! com- pany [41]. Nowadays, many popular companies2 support Apache Hadoop, for example, Cloudera and Hortonworks.

1. http://hadoop.apache.org/ 2. https://wiki.apache.org/hadoop/Support

41 6. Big Data Processing Systems

It uses MapReduce programming model, which was designed for scalable parallel applications to process a large amount of data. Its main advantage is that it can run in parallel on any number of servers and also can be re-executed if a server crashes [45]. MapReduce takes key-value pairs as an input and returns key- value pairs as an output. The computation is expressed by the imple- mentation of Map and Reduce function. Map takes a set of key-value pairs and produces another set of key-value pairs, based on required functionality. Then this output is grouped by the key and passed to Reduce as a set of key-value pairs where the value consists of the set of values. Reduce takes this as an input and merges the values, based on required functionality. This process results in a set of key-value pairs. Hadoop MapReduce implementation, written in Scala, that takes a set of id-word pairs and returns the set of word-frequency pairs that states, how many of each word was in the input, looks like this: class WordCountMapper extends Mapper[Object,Text,Text,IntWritable] { override def map(key:Object, value:Text, context:Mapper[Object,Text,Text,IntWritable]#Context) = { value.toString().split("\\W+") .map(word => context.write(new Text(word), new IntWritable(1))) } } class WordCountReducer extends Reducer[Text,IntWritable,Text,IntWritable] { override def reduce(key:Text, values:java.lang.Iterable[IntWritable], context:Reducer[Text,IntWritable,Text,IntWritable]#Context) = { val sum= values.foldLeft(0) { (t,i) =>t+i.get} context.write(key, new IntWritable(sum)) } }

The disadvantage of this programming model is that the output of each Map and Reduce has to be stored in the local file system before the computation may continue. This process grants fault-tolerance, but it dramatically affects performance. It results in significant ineffective- ness, for example, in join operations, iterative processing, and stream processing. The problem is also the dataflow process. Tasks often have to be non-trivially reimplemented, because of their different dataflow,

42 6. Big Data Processing Systems

for example, join tasks. Also, each operation has to be written in this manner which may affect reusability of the code [46]. Because of the performance limitations, Hadoop MapReduce is mostly used in batch processing with not many iterations and relation- ships in data. Many popular companies use it, for example, Yahoo!, IBM, Facebook, Twitter, LinkedIn, Spotify, Adobe, and Ebay.

6.1.2 Alternatives Different open source Big Data tools can be used in batch processing. However all found tools are specialized in a specific usage, therefore are introduced in other categories.

6.2 Stream processing systems

Stream processing systems process the flowing streams of data, possi- ble from multiple data sources. These systems are typically deployed as running tasks that run until cancelation. They can be used, for example, in monitoring, smart device applications, and machine learn- ing [47]. Typically, the stream processing tool process much smaller data than the batch processing system. However, in this case, the latency is more important.

6.2.1 Apache Storm Apache Storm3 is a distributed real-time computation system that was created in 2011. It is written in Java. Storm works with streams of data as an infinite sequence of tuples. They can also be processed in parallel. The logic of the application is represented as Storm topology, which is a graph of spouts and bolts. Spouts are sources of streams, and bolts are processing units that take streams as input, and, after processing, can emit multiple streams as an output. So to implement a Storm application that would take, for example, Twitter posts and return the set of word-frequency pairs that states,

3. http://storm.apache.org/

43 6. Big Data Processing Systems how many of each word have been used in that posts, these steps should be done in order: 1. create a spout that periodically takes Twitter posts and emits the stream of text from those posts that were created per time, 2. create a bolt that takes this text of posts, splits it into words, and emits them, 3. create a bolt that takes these words, does a word count on them, and emits the result. The code sample is not shown because of its complexity. Storm is used, for example, by Yahoo!, Twitter, Spotify, Cerner, and Groupon.

6.2.2 Alternatives Other open source Big Data tools are designed for stream processing. Apache Samza4 can be used in the same set of use cases as Storm. Apache Gearpump5 is better in IoT use cases, and in use cases, that needs to upgrade the running application without interruption. Rie- mann6 is an event system with a robust stream processing language. There are also tools that were designed as a replacement for Storm, such as Apache Heron7 and JStorm8. They are both backward com- patible with Storm. The general-purpose processing systems are also designed for stream processing. They are described in section 6.5.

6.3 Graph processing systems

Graph processing systems were designed to solve the performance problems of MapReduce on graph datasets. Many graph algorithms are iterative and require complex joins of data, which are Hadoop MapReduce’s weaknesses [46].

4. http://samza.apache.org/ 5. https://gearpump.apache.org/ 6. http://riemann.io/ 7. http://incubator.apache.org/projects/heron.html 8. http://jstorm.io/

44 6. Big Data Processing Systems

6.3.1 Apache Giraph Apache Giraph9 was created in 2012 by Yahoo!, written in Java. It is designed for batch processing of large graph data, and runs on the top of Apache Hadoop, using MapReduce. Giraph uses Bulk Synchronous Parallel (BSP) programming model. This model uses message-passing interface (MPI) to support scalability and parallelism across multiple servers. In BSP programming model, all algorithms are implemented from the point of view of a vertex. The computation is represented as a sequence of supersteps, where each superstep defines what each used vertex has to do. It can, for example, send a message to another vertex, change its state to inactive, and execute a function. Supersteps run synchronously one after another. If the vertex is inactive, it does not perform computing, but it can become active by receiving a message from another vertex. The program ends when all vertices are set to inactive. The Giraph implementation, written in Java [48], that implements Dijkstra’s algorithm to find the shortest paths from the source vertex to all others, looks like this: public void compute(Iterator msgIterator) { if(getSuperstep() == 0) { setVertexValue(new DoubleWritable(Double.MAX_VALUE)); } // if you area source vertex, set min to 0, else infinity double min=(getContext().getConfiguration() .getLong(SOURCEID,SOURCEIDDEFAULT) == getVertexId().get()) ? 0d: Double.MAX_VALUE; while(msgIterator.hasNext())// read all received messages { min= Math.min(min, msgIterator.next().get()); } if(min< getVertexValue().get()) { setVertexValue(new DoubleWritable(min)); // senda message to all neighbors for(Edge edge : getOutEdgeMap().values()) { sendMsg(edge.getDestVertexId(), new DoubleWritable(min+ edge.getEdgeValue().get())); } } voteToHalt();// set to inactive }

9. http://giraph.apache.org/ 45 6. Big Data Processing Systems

Giraph is used, for example, by Facebook, Zynga, and The Apache Software Foundation.

6.3.2 Alternatives Apache Hama10 is a tool that also uses BSP computing model. How- ever, it was designed more generally, not only on graph problems. The real-time graph processing is providing, for example, a graph processing library GraphJet11, which is used as a recommendation system in Twitter [49]. The general-purpose processing systems are also designed for graph processing. They are described in section 6.5.

6.4 High-level representation tools

High-level representation tools are designed to simplify the program- ming in a specific set of tools. The task can be written in a high-level programming language, leaving the optimizations to backend engine. This programming language can improve code reusability, maintain- ability, and allows to implement otherwise non-trivial tasks more quickly, for example, joins in Hadoop MapReduce [46]. However, high- level representation tools can be used only on data with a structure.

6.4.1 Apache Hive Apache Hive12 is a high-level representation tool, that was created in 2009 by Facebook. It is written in Java. It runs on top of Hadoop and allows to access data with querying language HiveQL. The query in this language is afterward compiled into MapReduce. Hive can also be used on the top of some other technologies, like HBase database. The query, which returns the set of word-frequency pairs that states, how many users are from each city, looks like this:

10. https://hama.apache.org/ 11. https://github.com/twitter/GraphJet 12. https://hive.apache.org/

46 6. Big Data Processing Systems

SELECT word, count(1)AS countFROM (SELECT explode(split(cities,’’)) AS wordFROM users)tempUsers GROUPBY word Hive is used, for example, by Facebook, RocketFuel, and Groove- shark.

6.4.2 Apache Pig Apache Pig13 was created in 2008 by Yahoo! company. It is written in Java. Pig provides a environment, which can be used on the top of Hadoop MapReduce. It can be used in situations, where the programmer does not want to rely on SQL query optimizer but want to have greater control over the task. It uses a Pig Latin lan- guage. The program consists of the sequence of operations, where each operation represents a single data transformation, for example, join, filter, map, or reduce. Then it is compiled to a sequence of MapReduce tasks. The program, which returns the set of word-frequency pairs that states, how many users are from each city, looks like this: input=LOAD’/users/cities.txt’ AS(sentence:Chararray); words=FOREACH input GENERATEFLATTEN(TOKENIZE(line,’’))AS word; grouped=GROUP wordsBY word; result=FOREACH grouped GENERATE group,COUNT(words); Pig is used, for example, by Twitter, MITRE, and LinkedIn.

6.4.3 Summingbird Summingbird is a library, created in 2013 by Twitter. It is written in Scala. Summingbird provides writing programs as native Scala or Java collection transformations. These programs can be then executed in

13. https://pig.apache.org/

47 6. Big Data Processing Systems some batch and stream processing tools, for example, batch processing Hadoop MapReduce and stream processing Storm. This option helps in hybrid systems that use both of these types of technologies for the same workflow, so the logic of processing is not necessary to writein two different languages. The program, written in Scala, which returns the set of word-frequency pairs that states, how many of each word was in the input, looks like this [50]: def wordCount[P <: Platform[P]] (source: Producer[P, String], store:P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ ->1L) }.sumByKey(store)

6.4.4 Alternatives Many other high-level representation tools are designed for the better performance of SQL approach. Tools like, Apache Impala14, Presto15, and Apache Phoenix16 provide similar SQL query language as Hive. However, they do not compile its language to Hadoop MapReduce. Instead, they use its own optimized engine to increase the task perfor- mance. Apache Drill17 is a tool that supports standard SQL, and provides users to query semi-structured data directly. It supports more types of data sources, for example, MongoDB, and cloud storages. As an alternative to non-SQL high-level representation tools, there is Cascading18, which provides Java API which can be used to define workflows on the top of various tools, for example, Apache Hadoop, and general-purpose Apache Flink. There are many extensions to Cascading that extends it with features of other languages, like , Scala, and Python19.

14. https://impala.apache.org/ 15. https://prestodb.io/ 16. https://phoenix.apache.org/ 17. https://drill.apache.org/ 18. http://www.cascading.org/projects/cascading/ 19. http://www.cascading.org/extensions/ 48 6. Big Data Processing Systems

There is also Apache Beam20, that provides a unified programming model that can be used for batch and stream processing, and can be run on multiple systems, for example, Gearpump, and general-purpose , Apache Flink, and Apache Spark. It has ambitions to unify all data processing systems under one API. The general-purpose processing systems also support built-in high- level representation languages. They are described in section 6.5.

6.5 General-purpose processing systems

General-purpose processing systems were designed for many types of data processing scenarios. They shine in complex problems when there is a need to use multiple types of processing systems, but the desire to maintain only one.

6.5.1 Apache Spark Apache Spark21 was created in 2009 by AMPLab at UC Berkeley22. It is written in Scala. Spark improves the performance of Hadoop MapReduce by load- ing the data in . It uses the Resilient Distributed Datasets (RDD), which represents a collection of data. Each RDD can be transformed by functions that process each element in this collec- tion and returns a new RDD with transformed elements. Spark uses a lazy mechanism that loads the data into a memory and processes them only after an Action function is called, for example, First, that returns the first element of the RDD, or Count, which returns the num- ber of elements in the RDD. Used RDD can be cached in memory for following operations, so Spark performs very well in iterative use cases. The implementation, written in Scala, that takes a text file and returns the set of word-frequency pairs that states, how many of each word was in the input, looks like this:

20. https://beam.apache.org/ 21. https://spark.apache.org/ 22. https://amplab.cs.berkeley.edu/

49 6. Big Data Processing Systems val counts= textFile .flatMap(_.split("\\W+")) .map(_, 1) .reduceByKey(_+_) Spark provides many libraries that enhance its capabilities. Spark SQL is a high-level representation tool that allows using SQL querying. Spark can use GraphX for graph processing, and Spark Streaming for stream processing. All those libraries can be combined for a complex problem. Spark is used, for example, by eBay,Autodesk, Amazon, and Shopify.

6.5.2 Apache Flink

Apache Flink23, originally called Stratosphere, was created in 2009 by Technical University of Berlin. It is written in Java. It was designed as a general-purpose tool that is specialized in stream processing. It treats batch processing as a special case of stream processing of fixed data. Flink uses the Parallelization Contracts (PACTs) programming model [51]. One PACT contains one second-order func- tion, named Input Contract, that contains one or more datasets as input, and first-order function that should process the input. PACT canalso contain an optional Output Contract, which contains specific character- istics of the output data, which can be important to the . The program can contain multiple PACTs that are attached in a specific sequence. The implementation, written in Scala, that takes a text file and returns the set of word-frequency pairs that states, how many of each word was in the input, looks like this: val counts= textFile .flatMap(_.split("\\W+")) .map(_, 1) .groupBy(0) .sum(1)

23. https://flink.apache.org/

50 6. Big Data Processing Systems

Flink also provides several libraries that allow Flink additional funcionalities. Table provides SQL querying, and Gelly is used for graph processing. Flink is used, for example, by Uber, Zalando, King, and Mux.

6.5.3 Alternatives Other open source Big Data tools, that can be categorized into general- purpose processing systems are younger, for example, Apache Apex24 and Onyx25.

6.6 Summary

Processing systems can be categorized into five categories, based on the type of processing: ∙ batch processing systems that are focused on the analysis of typically large, static data, ∙ stream processing systems that are focused on the analysis of flowing streams of data, ∙ graph processing systems that are focused on the analysis of graph data, ∙ high-level representation tools that are designed to provide a high-level programming language on top of other technologies, ∙ general-purpose processing systems that are focused on the analysis of complex problems that would otherwise need mul- tiple processing tools. Table 6.1 shows the summary of important factors that were iden- tified in this thesis for each processing system that was discovered. As can be seen, Java is the most popular language. Also, each of these tools is being used in popular companies. For other factors, like perfor- mance, and scalability, the following research should be accompanied by relevant benchmarks between those tools.

24. https://apex.apache.org/ 25. http://www.onyxplatform.org/

51 6. Big Data Processing Systems

Maturity Used in popular Tool Language Classification / Origin companies Apache Hadoop MapReduce Java 2006 batch  Apache Storm Java 2011 stream  Java, Scala 2014 stream  Apache Gearpump Scala 2014 stream  Riemann Clojure 2012 stream  Java 2016 stream  Jstorm Java 2013 stream  Hydra Java 2013 stream  Apache Giraph Java 2012 graph  Java 2012 graph  GraphJet Java 2016 graph  Apache Hive Java 2009 high-level  Apache Pig Java 2008 high-level  Summingbird Scala 2013 high-level  C++ 2012 high-level  Apache Phoenix Java 2014 high-level  Presto Java 2012 high-level  Apache Drill Java 2013 high-level  Cascading Java 2009 high-level  Java 2016 high-level  Apache HAWQ C 2013 high-level  Java 2014 high-level  C++ 2015 high-level  Apache Spark Scala 2009 general-purpose  Apache Flink Java 2009 general-purpose  Apache Apex Java 2012 general-purpose  Onyx Clojure 2015 general-purpose 

Table 6.1: Basic summary of processing systems

52 7 Tool Selection Diagram

This chapter contains the initial version Big Data open source tool se- lection diagram, which was created based on the knowledge acquired in this thesis. In the future, the diagram is expected to be updated by the further research, results of relevant benchmarks, and extended by other categories from the proposed architecture. This diagram can be used to determine the solution to a given Big Data problem. It is designed as the activity diagram with decision and action nodes. The user can traverse the diagram, and based on the guard conditions can get to an activity. Each activity represents the selection of one of the mentioned tools. The guard conditions in each decision node are not mutually exclusive, so the user can end up with a set of tools that has to be used as a solution to his problem. Tests need to be implemented to validate this diagram. In this thesis, two examples of tests are proposed and then implemented by comparing the relevant results of the solution using a tool that was chosen by a diagram and the solution using another tool. This tool was chosen as a popular alternative that may seem suitable and usable for the given problem. To implement these tests precisely, and validate a specific use case selection, the tester should compare the results of the tool chosen by the diagram with the results of every other tool. In this thesis, the proposed tests are written in PHP language. Their source code can found in the attachment.

53 7. Tool Selection Diagram [ speed ] [ ELSE ] [ large files ] Flink Redis HDFS Storm Heron Spark Hama Giraph Samza Aerospike Neo4j OrientDB ArrangoDB [ ACID ] JanusGraph [ batch ] [ real-time ] [ availability ] GraphJet [ consistency ] Infinispan [ stream ] [ cell level security ] Riak KV CreateDB Scalaris Aerospike Prometheus ElasticSearch [ graph ] [ unstructured data || only gets by key ] [ full-text search ] [ complex relationships in data ] [ multiple processing types && want one tool ] [ multiple storage types && want one tool ] Accumulo [ availability ] [ time-series data ] [ ELSE ] [ storage ] [ processing ] [ batch ] VoltDB Hadoop MapReduce [ consistency ] [ high-level language ] [ structured data ] [ SQL ] [ semi-structured data ] Pig Hive Impala InfluxDB [ ELSE ] [ expected schema changes ] [ data needs to be distributed ] [ ELSE ] [ many reads and writes ] [ Windows server ] RavenDB [ ACID ] [ unified language ] [ greater control over SQL ] [ mostly reads ] [ huge scalability ] [ advanced queries ] MySQL Beam ScyllaDB PostgreSQL MongoDB RethinkDB Cascading Cassandra [ data will fit on a single server ] Greenplum SummingBird Visual Paradigm Standard(Work(Masaryk University))

Figure 7.1: Tool Selection Diagram

54 7. Tool Selection Diagram 7.1 Validation

The chosen data for validation of proposed diagram hold the informa- tion about the traffic in Plzeň region in 20151. After decompression, their size is 19.98 GB, and they consist of 272,658,647 records. Those data are structured and stored in CSV file. To work with this file, Ihad to make these transformations:

∙ putting the header from the documentation into the first line of CSV file so that the tools can obtain the names of the fields,

∙ substituting the vertical bar separator for the comma, because MongoDB does not allow to specify separator for CSV import- ing, and the import does not work with a vertical bar as a sepa- rator,

∙ substituting the empty string (1, "", 2) for null (1, , 2), because MongoDB’s CSV parser accepts only the data that complies with RFC 41802, therefore it treats double quotes as an escaped quote [52].

To illustrate the structure of this data, the first fifty records from this transformed file were put in the attachment. The proposed diagram has been tested in two different problems: a simple analysis of the whole dataset and a repeated storing and getting random records of a small subset of the dataset. In both tests, the measurements of results were performed five times. From them, the average values were computed and compared.

First test The first test was a basic analysis of the whole dataset. In this test, the desired output from a database is all dates when the detected vehicle had the speed of 44 km/h. For this problem, the diagram recommends using a traditional RDBMS, so MySQL was chosen. Its analysis time is compared with a document store MongoDB. The results from the table 7.1 show that for querying a structured data that fit on a single server, MySql is faster than MongoDB.

1. http://doprava.plzensky-kraj.cz/opendata/doprava/rok/ 2. https://tools.ietf.org/html/rfc4180

55 7. Tool Selection Diagram

Database Time in seconds MySql 464.58 MongoDB 521.07

Table 7.1: Results of the first test

Second test The second test was a repeated storing and getting ran- dom records of a small subset of the chosen dataset. In this test, the chosen databases had already 1,000 records stored. Then, there were performed 1,000 iterations of a task that writes the new record and ran- dom reads a single record by a key. It is believed to be the simulation of a caching workflow. For this problem, the diagram recommends using Redis. This database was compared with Riak KV. It was analyzed the total time of this given workflow.

Database Time in seconds Redis 0.22 Riak KV 5.17

Table 7.2: Results of the second test

The results from the table 7.2 show that for workflow with the same number of random reads and writes, Redis is 23.5 times faster than Riak KV. To find out, whether there exists a read/write ratio in which Riak KV can behave faster than Redis, the read and write times were ana- lyzed separately.

Database Read time in seconds Write time in seconds Redis 0.11 0.10 Riak KV 2.20 2.97

Table 7.3: Results of the extended second test

The results from the table 7.3 show that Redis performs faster than Riak KV in both operations. In conclusion, for workflow which consists of the writing new records and random reads by a key, using any read/write ratio, Redis is faster than Riak KV.

56 8 Attachments

To the electronic version of the thesis were attached the following files:

∙ BigDataToolsTestings.zip: The complete set of source files of the testings performed in this thesis.

∙ PlzenData.csv: The first fifty records from the used transformed file in the CSV format.

∙ lshwOutput.html: The hardware specification of the server that was used for testing.

57

9 Conclusion

This thesis was focused on the design of a Big Data tool selection diagram, which can help to choose the right tools for a given Big Data problem. The thesis included the tool classification into components and proposed the Big Data tool architecture for a general Big Data problem, which illustrates the communication between them. The storage and processing components were chosen, and the overview of the actual Big Data tools from these components was made. This overview further classified the tools, presented their typical use cases, and described the identified tools in more detail. It also proposed the relevant factors of these tools, which can help to choose between them. Based on the knowledge gained from this overview, the initial version of the Big Data tool selection diagram was created. At the time of writing this thesis, it contains the storage and processing tools. However, it is expected to be extended in the future. It the end, the thesis proposed the process of diagram validation and provided a set of tests as examples. They compared the relevant results of the solution of a specific problem using a tool that was chosen by a diagram and the solution using another tool.

9.1 Future directions

As this thesis initiates a longer project that is to be continued, there are numerous directions for future work.

∙ The analysis of other categories from the proposed architecture can improve the knowledge of open source tools that were not covered in this thesis. This overview can help with the choosing of the right tool from other categories and may result in the extension of the tool selection diagram.

∙ The improvement of the current tool selection diagram is also possible. There is a need to validate each transition in it by using tests that would compare the relevant results of the solution using a tool that was chosen by diagram and all other tools.

59 9. Conclusion

This process can result in a more detailed diagram, with more options and transitions. However, it can also reduce their num- ber, because of possible low results of some tools against the others in the same category. In this case, such tools would be discarded.

∙ Because the changes in this domain are still quite frequent, there is also a need to make surveys of storage and processing tools again in the future. This thesis makes newer surveys from those categories easier because they can check the updates in tools stated here, and then only try to find newly developed tools.

∙ In this thesis, the tools were reviewed mainly by their usability, and the survey was focused practically. Further research can in- clude more technical oriented surveys, that would, for example, compare the details of the implementations of those tools.

60 Bibliography

1. AGRAHARI, Anurag; RAO, Dharmaji. A Review paper on Big Data: Technologies, Tools and Trends. International Research Jour- nal of Engineering and Technology (IRJET). 2017, vol. 4, pp. 640– 649. ISSN 2395-0056. Available also from: https://irjet.net/ archives/V4/i10/IRJET-V4I10112.pdf. 2. DEMCHENKO, Y.; GROSSO, P.; LAAT, C. de; MEMBREY, P. Ad- dressing big data issues in Scientific Data Infrastructure. In: 2013 International Conference on Collaboration Technologies and Systems (CTS). 2013, pp. 48–55. Available from DOI: 10.1109/CTS.2013. 6567203. 3. GANDOMI, Amir; HAIDER, Murtaza. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Informa- tion Management. 2015, vol. 35, no. 2, pp. 138. ISSN 0268-4012. Available from DOI: https://doi.org/10.1016/j.ijinfomgt. 2014.10.007. 4. How Much Data Is Created on the Internet Each Day? [online] [vis- ited on 2018-05-11]. Available from: https://dzone.com/articles/ how-much-data-is-created-on-the-internet-each-day. 5. DEMCHENKO, Y.; LAAT, C. de; MEMBREY, P. Defining architec- ture components of the Big Data Ecosystem. In: 2014 International Conference on Collaboration Technologies and Systems (CTS). 2014, pp. 104–112. Available from DOI: 10.1109/CTS.2014.6867550. 6. Understanding Big Data: The Seven V’s [online] [visited on 2018-05- 18]. Available from: http://dataconomy.com/2014/05/seven- vs-big-data/. 7. The 10 Vs of Big Data [online] [visited on 2018-05-18]. Available from: https://tdwi.org/articles/2017/02/08/10-vs-of- big-data.aspx. 8. Top 10 Big Data Challenges – A Serious Look at 10 Big Data V’s [on- line] [visited on 2018-05-18]. Available from: https://mapr.com/ blog/top-10-big-data-challenges-serious-look-10-big- data-vs/.

61 BIBLIOGRAPHY

9. The 42 V’s of Big Data and Data Science [online] [visited on 2018-05- 18]. Available from: https://www.elderresearch.com/blog/42- v-of-big-data. 10. SINGH, Dilpreet; REDDY, Chandan K. A survey on platforms for big data analytics. Journal of Big Data. 2014, vol. 2, no. 1, pp. 3–4. ISSN 2196-1115. Available from DOI: 10.1186/s40537-014- 0008-6. 11. SADALAGE, P.J.; FOWLER, M. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. In: Pearson Edu- cation, 2012, chap. 4, pp. 37–45. ISBN 9780133036121. 12. SADALAGE, P.J.; FOWLER, M. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. In: Pearson Edu- cation, 2012, chap. 5, pp. 47–60. ISBN 9780133036121. 13. Towards Robust Distributed Systems. 2000. Available also from: https: //people.eecs.berkeley.edu/~brewer/PODC2000.pdf. 14. GILBERT, Seth; LYNCH, Nancy. Brewer’s Conjecture and the Fea- sibility of Consistent, Available, Partition-tolerant Web Services. SIGACT News. 2002, vol. 33, no. 2, pp. 51–59. ISSN 0163-5700. Available from DOI: 10.1145/564585.564601. 15. BREWER, E. CAP twelve years later: How the "rules" have changed. Computer. 2012, vol. 45, no. 2, pp. 23–29. ISSN 0018-9162. Avail- able from DOI: 10.1109/MC.2012.37. 16. AKIDAU, Tyler et al. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-scale, Unbounded, Out-of-order Data Processing. Proc. VLDB Endow. 2015, vol. 8, no. 12, pp. 1792–1793. ISSN 2150-8097. Available from DOI: 10.14778/2824032.2824076. 17. WANG, Yichuan; KUNG, LeeAnn; BYRD, Terry Anthony. Big data analytics: Understanding its capabilities and potential ben- efits for healthcare organizations. Technological Forecasting and Social Change. 2018, vol. 126, pp. 3–13. ISSN 0040-1625. Available from DOI: https://doi.org/10.1016/j.techfore.2015.12. 019.

62 BIBLIOGRAPHY

18. KLEIN, John; BUGLAK, Ross; BLOCKOW, David; WUTTKE, Troy; COOPER, Brenton. A Reference Architecture for Big Data Systems in the National Security Domain. In: Proceedings of the 2Nd International Workshop on BIG Data Software Engineering. Austin, Texas: ACM, 2016, pp. 51–57. BIGDSE ’16. ISBN 978-1-4503-4152- 3. Available from DOI: 10.1145/2896825.2896834. 19. MARJANI, M.; NASARUDDIN, F.; GANI, A.; KARIM, A.; HASHEM, I. A. T.; SIDDIQA, A.; YAQOOB, I. Big IoT Data Analytics: Ar- chitecture, Opportunities, and Open Research Challenges. IEEE Access. 2017, vol. 5, pp. 5247–5261. Available from DOI: 10.1109/ ACCESS.2017.2689040. 20. NADAL, Sergi; HERRERO, Victor; ROMERO, Oscar; ABELLÓ, Alberto; FRANCH, Xavier; VANSUMMEREN, Stijn; VALERIO, Danilo. A software reference architecture for semantic-aware Big Data systems. Information and Software Technology. 2017, vol. 90, pp. 75–92. ISSN 0950-5849. Available from DOI: https://doi. org/10.1016/j.infsof.2017.06.001. 21. SHAH, P.; HIREMATH, D.; CHAUDHARY, S. Big Data Analytics Architecture for Agro Advisory System. In: 2016 IEEE 23rd In- ternational Conference on High Performance Computing Workshops (HiPCW). 2016, pp. 43–49. Available from DOI: 10.1109/HiPCW. 2016.015. 22. SARNOVSKY, Martin; BEDNAR, Peter; SMATANA, Miroslav. Big Data Processing and Analytics Platform Architecture for Pro- cess Industry Factories. Big Data and Cognitive Computing. 2018, vol. 2, no. 1. ISSN 2504-2289. Available from DOI: 10 . 3390 / bdcc2010003. 23. PÄÄKKÖNEN, Pekka; PAKKALA, Daniel. Reference Architec- ture and Classification of Technologies, Products and Services for Big Data Systems. Big Data Research. 2015, vol. 2, no. 4, pp. 166–186. ISSN 2214-5796. Available from DOI: https://doi.org/ 10.1016/j.bdr.2015.01.001. 24. GÖKALP, Mert; KAYABAY, Kerem; ZAKI, Mohamed; KOÇY- IĞIT, Altan; ERHAN EREN, P; NEELY, Andy. Big-Data Analyt- ics Architecture for Businesses: a comprehensive review on new open- source big-data tools. 2017.

63 BIBLIOGRAPHY

25. Questioning the : The Lambda Architecture has its merits, but alternatives are worth exploring. [online] [visited on 2018-05-18]. Available from: https://www.oreilly.com/ideas/ questioning-the-lambda-architecture. 26. Flafka: Big Data Solution for Data Silos [online] [visited on 2018- 05-18]. Available from: https://www.datasciencecentral.com/ profiles/blogs/6448529:BlogPost:541864. 27. SADALAGE, P.J.; FOWLER, M. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. In: Pearson Edu- cation, 2012, chap. 1, pp. 8–9. ISBN 9780133036121. 28. PAVLO, Andrew; ASLETT, Matthew. What’s Really New with NewSQL? SIGMOD Rec. 2016, vol. 45, no. 2, pp. 45–55. ISSN 0163-5808. Available from DOI: 10.1145/3003665.3003674. 29. PRESSER, Marshall. Data Warehousing with Greenplum: Open Source Massively Parallel Data Analytics. 1st. O’Reilly Media, Inc, 2017. 30. CRATE.IO. CrateDB for Time Series: How CrateDB compares to spe- cialized time series data stores. 2017. Technical report. 31. SRINIVASAN, V.; BULKOWSKI, Brian; CHU, Wei-Ling; SAYYA- PARAJU, Sunil; GOODING, Andrew; IYER, Rajkumar; SHINDE, Ashish; LOPATIC, Thomas. Aerospike: Architecture of a Real- time Operational DBMS. Proc. VLDB Endow. 2016, vol. 9, no. 13, pp. 1389–1400. ISSN 2150-8097. Available also from: http://www. vldb.org/pvldb/vol9/p1389-srinivasan.pdf. 32. Implementing a Document Store [online] [visited on 2018-05-18]. Available from: https://docs.basho.com/riak/kv/2.2.3/ developing/usage/document-store/. 33. ScyllaDB vs Cassandra: Benchmarks [online] [visited on 2018-05- 18]. Available from: https : / / www . scylladb . com / product / benchmarks/. 34. ARANGODB. What is a multi-model database and why use it? 2016. Technical report. 35. Riak TS - Data Modeling Basics: Creating a Table [online] [visited on 2018-05-18]. Available from: https://github.com/cvitter/ Riak - TS - Data - Modeling / blob / master / Data % 20Modeling % 20Basics.md#creating-a-table.

64 BIBLIOGRAPHY

36. SADALAGE, P.J.; FOWLER, M. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. In: Pearson Edu- cation, 2012, chap. 14. ISBN 9780133036121. 37. Small files in Hadoop [online] [visited on 2018-05-18]. Available from: https : / / medium . com / arabamlabs / small - files - in - hadoop-88708e2f6a46. 38. DEPARDON, Benjamin; LE MAHEC, Gaël; SÉGUIN, Cyril. Anal- ysis of Six Distributed File Systems. 2013. Available also from: https: //hal.inria.fr/hal-00789086. Research Report. 39. RANI, L. Sudha; SUDHAKAR, K.; KUMAR, S. Vinay. Distributed File Systems: A Survey. 2014. 40. BLOMER, J. A Survey on Distributed File System Technology. Journal of Physics: Conference Series. 2015, vol. 608, no. 1, pp. 012039. Available also from: http://stacks.iop.org/1742-6596/608/ i=1/a=012039. 41. SHVACHKO, Konstantin; KUANG, Hairong; RADIA, Sanjay; CHANSLER, Robert. The Hadoop Distributed File System. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Sys- tems and Technologies (MSST). Washington, DC, USA: IEEE Com- puter Society, 2010, pp. 1–10. MSST ’10. ISBN 978-1-4244-7152-2. Available from DOI: 10.1109/MSST.2010.5496972. 42. Performance Comparison to HDFS [online] [visited on 2018-05-18]. Available from: https://github.com/quantcast/qfs/wiki/ Performance-Comparison-to-HDFS. 43. BORTHAKUR, Dhruba. The Hadoop Distributed File System: Architecture and Design. 2007, vol. 11, pp. 21. 44. Big Data Battle : Batch Processing vs Stream Processing [online] [vis- ited on 2018-05-14]. Available from: https : / / medium . com / @gowthamy/big-data-battle-batch-processing-vs-stream- processing-5d94600d8103. 45. SAKR, Sherif. Big Data 2.0 Processing Systems: A Survey. In: 1st. Springer Publishing Company, Incorporated, 2016, pp. 15–19. ISBN 3319387758, 9783319387758.

65 BIBLIOGRAPHY

46. SAKR, Sherif. Big Data 2.0 Processing Systems: A Survey. In: 1st. Springer Publishing Company, Incorporated, 2016. ISBN 3319387758, 9783319387758. 47. A Gentle Introduction to Stream Processing [online] [visited on 2018- 05-18]. Available from: https://medium.com/@srinathperera/ what-is-stream-processing-1eadfca11b97. 48. Shortest Paths Example [online] [visited on 2018-05-18]. Available from: https : / / cwiki . apache . org / confluence / display / GIRAPH/Shortest+Paths+Example. 49. SHARMA, Aneesh; JIANG, Jerry; BOMMANNAVAR, Praveen; LARSON, Brian; LIN, Jimmy. GraphJet: Real-time Content Rec- ommendations at Twitter. Proc. VLDB Endow. 2016, vol. 9, no. 13, pp. 1281–1292. ISSN 2150-8097. Available from DOI: 10.14778/ 3007263.3007267. 50. Summingbird [online] [visited on 2018-05-18]. Available from: https://github.com/twitter/summingbird. 51. ALEXANDROV, Alexander; HEIMEL, Max; MARKL, Volker; BATTRÉ, Dominic; HUESKE, Fabian; NIJKAMP, Erik; EWEN, Stephan; KAO, Odej; WARNEKE, Daniel. Massively Parallel Data Analysis with PACTs on Nephele. Proc. VLDB Endow. 2010, vol. 3, no. 1-2, pp. 1625–1628. ISSN 2150-8097. Available from DOI: 10. 14778/1920841.1921056. 52. MongoDB 3.6 mongoimport [online] [visited on 2018-05-18]. Avail- able from: https://docs.mongodb.com/manual/reference/ program/mongoimport/index.html.

66