Tools for Big Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Masaryk University Faculty of Informatics Tools for big data analysis Master’s Thesis Bc. Martin Macák Brno, Spring 2018 Replace this page with a copy of the official signed thesis assignment anda copy of the Statement of an Author. Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Bc. Martin Macák Advisor: doc. Ing. RNDr. Barbora Bühnová, Ph.D. i Acknowledgements I would like to thank my supervisor, doc. Ing. RNDr. Barbora Bühnová, Ph.D. for offering me to work on this thesis. Her support, guidance, and patience greatly helped me to finish it. I would also like to thank her for introducing me to the great team of people in the CERIT-SC Big Data project. From this team, I would like to especially thank RNDr. Tomáš Rebok, Ph.D., who had many times found time for me, to provide me useful advice, and Bruno Rossi, PhD, who had given me the opportunity to present the results of this thesis in LaSArIS seminar. I would also like to express my gratitude for the support of my family, my parents, Jana and Alexander, and the best sister, Nina. My thanks also belong to my supportive friends, mainly Bc. Tomáš Milo, Bc. Peter Kelemen, Bc. Jaroslav Davídek, Bc. Štefan Bojnák, and Mgr. Ondřej Gasior. Lastly, I would like to thank my girlfriend, Bc. Iveta Vidová for her patience and support. iii Abstract This thesis focuses on the design of a Big Data tool selection diagram, which can help to choose the right open source tools for a given Big Data problem. The thesis includes the tool classification into compo- nents and proposes the Big Data tool architecture for a general Big Data problem, which illustrates the communication between those components. This thesis has chosen some of those components and has researched them in more detail, creating an overview of the actual Big Data tools. Based on this overview, the initial version of the Big Data tool selection diagram, which contains storage and processing tools, is created. Then the thesis proposes the process of diagram validation and provides a set of tests as examples. Those tests are implemented by comparing the relevant results of the solution using a tool that is chosen by a diagram and the solution using another tool. iv Keywords Big Data, Big Data tools, Big Data architecture, Big Data storage, Big Data processing v Contents 1 Introduction 1 2 Big Data 3 2.1 Characteristics ........................3 2.2 Big Data system requirements ................4 2.2.1 Scalability . .4 2.2.2 Distribution models . .4 2.2.3 Consistency . .6 3 State of the Art in Big Data Tools 7 4 Big Data Tools Architecture 9 4.1 Related work ..........................9 4.2 Classification .........................9 4.3 Proposed architecture ..................... 10 5 Big Data Storage Systems 13 5.1 Relational database management systems .......... 13 5.1.1 Data warehouse databases . 14 5.1.2 NewSQL database management systems . 15 5.1.3 Summary . 17 5.2 NoSQL database management systems ............ 17 5.2.1 Key-value stores . 18 5.2.2 Document stores . 21 5.2.3 Column-family stores . 25 5.2.4 Graph databases . 26 5.2.5 Multi-model databases . 29 5.2.6 Summary . 31 5.3 Time-series database management systems .......... 32 5.3.1 InfluxDB . 33 5.3.2 Riak TS . 33 5.3.3 OpenTSDB . 34 5.3.4 Druid . 34 5.3.5 SiriDB . 35 5.3.6 TimescaleDB . 35 5.3.7 Prometheus . 35 vii 5.3.8 KairosDB . 36 5.3.9 Summary . 36 5.4 Distributed file systems .................... 37 5.4.1 Hadoop Distributed File System . 38 5.4.2 SeaweedFS . 38 5.4.3 Perkeep . 39 5.4.4 Summary . 39 6 Big Data Processing Systems 41 6.1 Batch processing systems ................... 41 6.1.1 Apache Hadoop MapReduce . 41 6.1.2 Alternatives . 43 6.2 Stream processing systems .................. 43 6.2.1 Apache Storm . 43 6.2.2 Alternatives . 44 6.3 Graph processing systems ................... 44 6.3.1 Apache Giraph . 45 6.3.2 Alternatives . 46 6.4 High-level representation tools ................ 46 6.4.1 Apache Hive . 46 6.4.2 Apache Pig . 47 6.4.3 Summingbird . 47 6.4.4 Alternatives . 48 6.5 General-purpose processing systems ............. 49 6.5.1 Apache Spark . 49 6.5.2 Apache Flink . 50 6.5.3 Alternatives . 51 6.6 Summary ........................... 51 7 Tool Selection Diagram 53 7.1 Validation ........................... 55 8 Attachments 57 9 Conclusion 59 9.1 Future directions ....................... 59 Bibliography 61 viii List of Tables 5.1 Basic summary of relational database management systems 17 5.2 Basic summary of NoSQL database management systems 32 5.3 Basic summary of time-series database management systems 37 5.4 Basic summary of distributed file systems 39 6.1 Basic summary of processing systems 52 7.1 Results of the first test 56 7.2 Results of the second test 56 7.3 Results of the extended second test 56 ix 1 Introduction Nowadays, we are surrounded by Big Data in many forms. Big Data can be seen in several domains, such as Internet of Things, social media, medicine, and astronomy [1]. They are used, for example, in data mining, machine learning, predictive analytics, and statistical techniques. Big Data brings many problems to developers because they have to make systems that can handle working with this type of data and their properties, such as huge volume, heterogeneity, or generation speed. Currently, open source solutions are very popular in this domain. Therefore multiple open source Big Data tools were created to allow working with these type of data. However, their enormous number, specific aims, and fast evolution make it confusing to choose the right solution for the given Big Data problem. We believe that creating a Big Data tool selection diagram would be a valid response to this issue. Such diagram should be able to rec- ommend the set of tools that should be used for the given Big Data problem. The elements of the output set should be based on the prop- erties of this problem. As this is beyond the scope of a master’s thesis, this thesis creates the initial version of Big Data selection diagram, which is expected to be updated and extended in the future. This thesis is organized as follows. Fundamental information about the Big Data domain and its specifics are introduced in chapter 2. Chapter 3 describes the challenges in Big Data tools. Proposed archi- tecture of Big Data tools is described in chapter 4. Chapter 5 contains the overview of Big Data storage tools, and chapter 6 contains the overview of Big Data processing tools. Contents attached to this thesis are described in chapter 7. Chapter 8 concludes the thesis. 1 2 Big Data This chapter contains the fundamental information about the Big Data domain. It should give the reader a necessary knowledge to understand the following chapters. 2.1 Characteristics Big Data are typically defined by five properties, called as "5 Vs ofBig Data" [2]. ∙ Volume: Used data have such a large size that they cannot fit into a single server, or the performance of analysis on those data on a single server is low. The relevant factor is also a data growth in time. Therefore, the systems that want to work with Big Data has to be scalable. ∙ Variety: Structure of the used data can be heterogeneous. Data can be classified by their structure into these three categories: structured data with a defined structure, for example, CSV files, and spreadsheets, semi-structured data with a flexible structure, for example, JSON, and XML, and unstructured data without a structure, for example, images, and videos [3]. ∙ Velocity: Data sources generate real-time data at a fast rate. For example, on Facebook, 136,000 photos are uploaded every minute [4]. So the system has to be able to handle lots of data at a reasonable speed. ∙ Veracity: Some data may have worse quality, and they cannot be considered trustworthy. So technologies should handle this kind of data too. ∙ Value: This property refers to the ability to extract a value from the data. Therefore systems have to provide useful benefits from the acquired data. Many other definitions emerged, including five parts definition [5], 7 Vs [6], 10Vs [7, 8], and 42 Vs [9] definition. However, the 5 Vs defini- tion is still considered as a popular standard. 3 2. Big Data 2.2 Big Data system requirements 2.2.1 Scalability Scalability is the ability of the system to manage increased demands. This ability is very relevant, because of the Big Data volume. The scal- ability can be categorized into the vertical or horizontal scaling [10]. Vertical scaling involves adding more processors, memory or faster hardware, typically, into a single server. Most of the software can then benefit from it. However, vertical scaling requires high financial investments, and there is a certain limit of this scaling. Horizontal scaling means adding more servers into a group of cooperating servers, called a cluster. These servers may be cheap com- modity machines, so the financial investment is relatively less. When this method is used, the system can scale as much as needed. However, it brings many complexities that software has to handle, which reflects on the limited number of software that can run on these systems.