Study Materials for Big Data Processing Tools
Total Page:16
File Type:pdf, Size:1020Kb
MASARYK UNIVERSITY FACULTY OF INFORMATICS Study materials for Big Data processing tools BACHELOR'S THESIS Martin Durkac Brno, Spring 2021 MASARYK UNIVERSITY FACULTY OF INFORMATICS Study materials for Big Data processing tools BACHELOR'S THESIS Martin Durkáč Brno, Spring 2021 This is where a copy of the official signed thesis assignment and a copy of the Statement of an Author is located in the printed version of the document. Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Martin Durkäc Advisor: RNDr. Martin Macák i Acknowledgements I would like to thank my supervisor RNDr. Martin Macak for all the support and guidance. His constant feedback helped me to improve and finish it. I would also like to express my gratitude towards my colleagues at Greycortex for letting me use their resources to develop the practical part of the thesis. ii Abstract This thesis focuses on providing study materials for the Big Data sem• inar. The thesis is divided into six chapters plus a conclusion, where the first chapter introduces Big Data in general, four chapters contain information about Big Data processing tools and the sixth chapter describes study materials provided in this thesis. For each Big Data processing tool, the thesis contains a practical demonstration and an assignment for the seminar in the attachments. All the assignments are provided in both with and without solution forms. The four Big Data processing tools in this thesis are Apache Hadoop for the gen• eral batch processing, Apache Hive for SQL systems, Apache Spark Streaming for the stream processing, and GraphFrames package of the Apache Spark for graph processing. iii Keywords Big Data, Apache Hadoop, HDFS, YARN, MapReduce, Apache Hive, Apache Spark, Spark Streaming, GraphFrames iv Contents Introduction 2 1 Big data 3 2 Batch processing 5 2.1 Characteristics 5 2.2 Hadoop Distributed File System 5 2.3 Yet Another Resource Negotiator 6 2.4 Hadoop MapReduce 7 2.5 Application structure 8 2.5.1 Writable 9 2.5.2 Record Reader 10 2.5.3 Mapper 11 2.5.4 Reducer 12 2.5.5 Comparator 13 2.5.6 Partitioner 13 2.6 Apache Hadoop alternative 13 3 Structured data processing using SQL 15 3.1 Characteristics 15 3.2 Architecture 15 3.3 HiveServerl 17 3.4 Hive Metastore 17 3.5 Limitations 18 3.6 HiveQL 18 3.7 Apache Hive alternative 19 4 Stream processing 21 4.1 Apache Spark 22 4.2 Resilient Distributed Datasets 22 4.3 Spark Streaming 24 4.4 Apache Spark Streaming alternative 25 5 Graph processing 26 5.1 GraphFrames 26 5.2 DataFrameAPI 26 v 5.3 GraphFrames features 27 5.3.1 Motif finding 27 5.3.2 Breadth-first search 28 5.3.3 Subgraphs 28 5.3.4 Other built-in graph algorithms 28 5.4 GraphFrames alternative 29 6 Study materials 31 6.1 Tasks and demonstrations 31 6.2 Usage 32 7 Conclusion 33 A Attachments 34 Bibliography 35 vi List of Tables 2.1 Differences between Apache Hadoop and Apache Spark 14 3.1 Differences between SQL-92 standard and HiveQL 19 4.1 Examples of actions in Apache Spark 23 4.2 Examples of transformations in Apache Spark 23 List of Figures 2.1 Example distribution of HDFS blocks across a cluster with replication factor two. 6 2.2 MapReduce data flow 7 2.3 Example of the Apache Hadoop application structure 8 3.1 Apache Hive architecture 16 4.1 Apache Spark architecture with Spark Standalone cluster manager and HDFS storage 21 4.2 Spark Streaming usage of D-Streams 24 5.1 Apache Giraph architecture 29 Listings 2.1 Writable example 9 2.2 Record reader initialize method example 10 2.3 Record reader nextKeyValue method example 11 2.4 Record reader createRecordReader method example . 11 2.5 Reducer example 12 3.1 Word count query in Apache Hive 19 1 Introduction Today we are surrounded by Big Data in many forms and our daily interactions are affected by them. Big data are used in social media, medicine, IoT (Internet of Things), and many more areas [1]. They are here to stay and continuously growing bigger. Big data similar to other areas of computer science are continuously evolving and many IT workers, analysts and economists do not understand and cannot work with them. This is mainly because of their properties such as huge volume, great variety, and velocity [2]. The challenge of Big Data management is coping with all three of these properties in the most efficient way possible. The primary purpose of this thesis is to provide study materials for the most popular Big Data tools for each category of Big Data as described in [3]. These are Apache Hadoop for general purpose batch processing, Apache Hive for big SQL systems. For both streaming and graph processing, the thesis provides the materials about Apache Spark with its Spark Streaming library and GraphFrames package. Each of these chapters also contains a section about one or more alter• natives for a given Big Data tool, describing differences, advantages, and disadvantages in comparison. Study materials from the thesis explain and describe all of the Big Data tools mentioned above, including a tutorial for downloading, installation, and demonstration on a simple example. Furthermore, there are assignments on each Big Data tool with the solution provided. These tutorials are compatible with most of the Linux distributions. System requirements are mentioned in the setup part of the tutorials. The outcome of the thesis is to provide people with study materials about previously mentioned Big Data tools. The materials contain all the important information to start using these tools and improve skills in the Big Data field of computer science. The thesis is organized as follows. Chapter 1 introduces Big Data and processing concepts related to it. Apache Hadoop and its components are introduced and described in chapter 2. Chapter 3 contains information about Apache Hive. Chapters 4 and 5 describe Apache Spark Streaming and Apache Spark GraphFrames respectively. In chapter 6, the usage of the thesis and materials provided are explained. Chapter 7 concludes the thesis. 2 1 Big data This thesis is focused on describing options for working with Big Data using current Big Data processing platforms. Big Data is an idea of data exponentially growing in size, being in unstructured form coming from all the digital devices, computation systems, usage of the Internet, and other applications in people's daily life. As for 2012, then the number of Internet users was around 2.27 billion [3]. In 2021, the number of Internet users reached 4.66 billion [4] and is continuing to grow bigger. More Internet users result in a huge and increasing number of user-generated content such as hundreds of hours of videos uploaded on YouTube, hundreds of thousands of tweets on Twitter, and millions of Google searches happening every minute. Many companies are continuously gathering massive datasets containing information about customer interactions, product sales, and other information that may help improve their business. This is only a fraction of the new data that is being created. Big Data can be described using the multiple properties of Big Data. The number of properties is ranging from three to seven, depending on which definition is used. The most important properties describing Big Data are [3]: • Volume, • Variety, • Velocity. Volume property is linked with a huge amount of data that can be billions of rows and millions of columns. Variety of data is another challenge of Big Data, where no exact format of the data is defined. The data comes in different formats, data sources, and data structures. Velocity describes the trend, where most of the data have been created in the most recent years. Speed, which new data is being generated is increasing further and there is now a need to not only analyze already stored data but do real-time analysis on those enormous volumes of data. 3 i. BIG DATA Additionally, there are other definitions adding more properties de• scribing the Big Data [5]. These are other properties, that can describe Big Data: • Variability - the problem of the interpretation of data, which changes depending on the context, especially in areas such as language processing, • Visualization - the data is readable and accessible, • Veracity - the accuracy and trustworthiness of the data, since the incomplete data may render useless, • Value - the storage and analysis of the data would have no use if they cannot be turned into value. There are many ways to categorize Big Data tools. The thesis is focusing on Big Data tools and categories used for processing the Data. One of the ways to classify the Big Data processing tools and the way it is described in this thesis is into these four categories [3]: • Batch processing, • Structured data processing, • Stream processing, • Graph processing. 4 2 Batch processing Batch processing is the simplest way to handle unstructured data in very large volumes [6]. The data are being collected, stored, and then processed at once, which possibly takes a long time. Since the volumes are so large, the main principles of this type of processing are the possibility to run on low-cost unreliable commodity hardware yet being highly fault-tolerant, being highly scalable, and able to run in parallel in a cluster with multiple working nodes. This chapter contains information about the batch processing tool Apache Hadoop.