An Introduction to Big Data Technologies
Total Page:16
File Type:pdf, Size:1020Kb
University of the Aegean Information and Communication Systems Engineering Intelligent Information Systems Thesis An Introduction to Big Data Technologies George Peppas supervised by Dr. Manolis Maragkoudakis October 18, 2016 Contents 1 Introduction 3 1.1 Why Big Data . .3 1.2 Big Data Applications Today . .9 1.2.1 Bioinformatics . .9 1.2.2 Finance . 10 1.2.3 Commerce . 12 2 Related work 15 2.1 Big Data Programming Models . 15 2.1.1 In-Memory Database Systems . 15 2.1.2 MapReduce Systems . 16 2.1.3 Bulk Synchronous Parallel (BSP) Systems . 22 2.1.4 Big Data and Transactional Systems . 22 2.2 Big Data Platforms . 23 2.2.1 Hortonwork . 23 2.2.2 Cloudera . 24 2.3 Miscellaneous technologies stack . 24 2.3.1 Mahout . 24 2.3.2 Apache Spark and MLlib . 27 2.3.3 Apache ORC . 29 2.3.4 Hadoop Distributed File System . 29 2.3.5 Hive . 33 2.3.6 Pig . 36 2.3.7 HBase . 37 2.3.8 Flume . 38 2.3.9 Oozie . 39 2.3.10 Ambari . 39 2.3.11 Avro . 40 2.3.12 Sqoop . 41 2.3.13 HCatalog . 43 2.3.14 BigTop . 47 2.4 Data Mining and Machine Learning introduction . 47 2.4.1 Data Mining . 48 2.4.2 Machine Learning . 49 2.5 Data Mining and Machine Learning Tools . 51 2.5.1 WEKA . 51 2.5.2 SciKit-Learn . 52 2.5.3 RapidMiner . 53 2.5.4 Spark MLlib . 53 2.5.5 H2O Flow . 53 1 3 Methods 54 3.1 Classification . 64 3.1.1 Feature selection . 64 3.1.2 Dimensionality reduction (PCA) . 65 3.2 Clustering . 65 3.2.1 Expectation - Maximization (EM) . 66 3.2.2 Agglomerative . 67 3.3 Association rule learning . 69 4 Setup and experimental results 70 4.1 Performance Measurement Methodology . 70 4.2 Example using iris data . 71 4.2.1 Rapidminer . 71 4.2.2 Spark MLlib (scala) . 73 4.2.3 WEKA . 74 4.2.4 SciKit - Learn . 75 4.2.5 H2O Flow . 76 4.2.6 Summary . 77 4.3 Experiments on Big data sets . 78 4.3.1 Loading the Big data sets . 78 4.3.2 SVM Spark MLlib . 82 4.3.3 Dimensionality reduction - PCA Spark MLlib . 84 4.3.4 Expectation-Maximization Spark MLlib . 88 4.3.5 Naive Bayes TF-IDF Spark MLlib . 88 4.3.6 Hierarchical clustering Spark MLlib . 93 4.3.7 K-means Spark MLlib . 94 4.3.8 Association Rules . 94 4.3.9 Data and Results . 95 5 Conclusions and future work 96 2 1 Introduction The primary purpose of this work is to provide an introduction of technologies and platforms available for performing big data analysis. We will make an introduction of what big data is and when or how it is used today. Moving deeper we will describe the programming models like in memory database system and Map-Reduce system that are used to make big data analytics possible. Next we will make a short introduction on most popular platform on the sector. An intermediate level reference will be done for many miscellaneous technologies that companions the big data platforms. Data mining and machine learning areas will also be introduced beginning with the conventional models and moving to the most advanced. We will reference the most popular tools for data analysis and expose their flaws and weaknesses on handling big data sets. After the description of the methods that we will use we will continue with the examples and experiments. At the end of this work the reader will be able to understand how the big data technologies are combined together and how someone can start to conduct his own experiments. 1.1 Why Big Data Big Data is driving radical changes in traditional data analysis platforms. To perform any kind of analysis on such voluminous and complex data, scaling up the hardware platforms becomes imminent and choosing the right hardware/ software platforms becomes a crucial decision. There are several big data plat- forms available with different characteristics and choosing the right platform requires an in-depth knowledge about the capabilities of all these platforms. In order to decide if we need big data platform and even further to choose which of these platforms is suitable for our case, we need to answer some questions. How quickly do we need to get the results? How big is the data to be processed? Does the model building require several iterations or a single iteration? At the systems level, one has to meticulously look into the following concerns: Will there be a need for more data processing capability in the future? Is the rate of data transfer critical for this application? Is there a need for handling hardware failures within the application? How about scaling? • Horizontal Scaling: Horizontal scaling involves distributing the workload across many servers which may be even commodity machines. It is also known as scale out, where multiple independent machines are added to- gether in order to improve the processing capability. Typically, multiple instances of the operating system are running on separate machines. Ver- tical Scaling: Vertical Scaling involves installing more processors, more memory and faster hardware, typically, within a single server. It is also known as "scale up" and it usually involves a single instance of an oper- ating system. "Horizontal scaling platforms" describes various horizontal scaling platforms including peer-to-peer networks, Hadoop and Spark. 3 • Vertical scaling platforms. The most popular vertical scale up paradigms are High Performance Computing Clusters (HPC), Multicore processors, Graphics Processing Unit (GPU) and Field Programmable Gate Arrays (FPGA). [27] To handle future workloads, one always will have to add hardware. Peer- to-Peer networks involve millions of machines connected in a network. It is a decentralized and distributed network architecture where the nodes in the networks (known as peers) serve as well as consume resources. Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, cap- ture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value [9]. We often tend to confuse about what big data is. To put it simple its just data, like any other data but when we try to manage them or analyze or even read them and in any way interact with, we can't because of the 3v's (see bellow). For example you can't simply open with your notepad a text file that is 1TB right? Also you can't open a spreadsheet (or cvs) with that size. You need a new tool and a new approach. What we knew until now about data and how we handle them is not going to work on these data sets. We need new tools new algorithms new ways to analyze and to store them. Big data can be described by the following characteristics also known as the 3V's: • Volume: The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not. • Velocity: In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time • Variety: The type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion According to a report from International Data Corporation (IDC), in 2011, the overall created and copied data volume in the world was 1.8ZB (≈ 1021 B), which increased by nearly nine times within five years [12]. This figure will double at least every other two years in the near future. Nowadays, big data related to the service of Internet companies grow rapidly. For example, Google processes data of hundreds of Petabyte (PB), Facebook generates log data of over 10 PB per month, Baidu, a Chinese company, processes data of tens of PB, and Taobao, a subsidiary of Alibaba, generates data of tens of Terabyte (TB) for online trading per day. NIST defines big data as "Big data shall mean the data of which the data volume, acquisition speed, or data representation limits 4 the capacity of using traditional relational methods to conduct effective analysis or the data which may be effectively processed with important horizontal zoom technologies", which focuses on the technological aspect of big data [8]. Many challenges on big data arose. With the development of Internet ser- vices, indexes and queried contents were rapidly growing. Therefore, search engine companies had to face the challenges of handling such big data. Google created GFS and MapReduce programming models to cope with the challenges brought about by data management and analysis at the Internet scale. The sharply increasing data deluge in the big data era brings about huge chal- lenges on data acquisition, storage, management and analysis. Traditional data management and analysis systems are based on the relational database man- agement system (RDBMS). However, such RDBMSs only apply to structured data, other than semi-structured or unstructured data. In addition, RDBMSs are increasingly utilizing more and more expensive hardware. It is apparently that the traditional RDBMSs could not handle the huge volume and hetero- geneity of big data.