University of the Aegean Information and Communication Systems Engineering Intelligent Information Systems

Thesis

An Introduction to Big Data Technologies

George Peppas

supervised by Dr. Manolis Maragkoudakis

October 18, 2016 Contents

1 Introduction 3 1.1 Why Big Data ...... 3 1.2 Big Data Applications Today ...... 9 1.2.1 Bioinformatics ...... 9 1.2.2 Finance ...... 10 1.2.3 Commerce ...... 12

2 Related work 15 2.1 Big Data Programming Models ...... 15 2.1.1 In-Memory Database Systems ...... 15 2.1.2 MapReduce Systems ...... 16 2.1.3 Bulk Synchronous Parallel (BSP) Systems ...... 22 2.1.4 Big Data and Transactional Systems ...... 22 2.2 Big Data Platforms ...... 23 2.2.1 Hortonwork ...... 23 2.2.2 Cloudera ...... 24 2.3 Miscellaneous technologies stack ...... 24 2.3.1 Mahout ...... 24 2.3.2 and MLlib ...... 27 2.3.3 Apache ORC ...... 29 2.3.4 Hadoop Distributed File System ...... 29 2.3.5 Hive ...... 33 2.3.6 Pig ...... 36 2.3.7 HBase ...... 37 2.3.8 Flume ...... 38 2.3.9 Oozie ...... 39 2.3.10 Ambari ...... 39 2.3.11 Avro ...... 40 2.3.12 ...... 41 2.3.13 HCatalog ...... 43 2.3.14 BigTop ...... 47 2.4 Data Mining and Machine Learning introduction ...... 47 2.4.1 Data Mining ...... 48 2.4.2 Machine Learning ...... 49 2.5 Data Mining and Machine Learning Tools ...... 51 2.5.1 WEKA ...... 51 2.5.2 SciKit-Learn ...... 52 2.5.3 RapidMiner ...... 53 2.5.4 Spark MLlib ...... 53 2.5.5 H2O Flow ...... 53

1 3 Methods 54 3.1 Classification ...... 64 3.1.1 Feature selection ...... 64 3.1.2 Dimensionality reduction (PCA) ...... 65 3.2 Clustering ...... 65 3.2.1 Expectation - Maximization (EM) ...... 66 3.2.2 Agglomerative ...... 67 3.3 Association rule learning ...... 69

4 Setup and experimental results 70 4.1 Performance Measurement Methodology ...... 70 4.2 Example using iris data ...... 71 4.2.1 Rapidminer ...... 71 4.2.2 Spark MLlib (scala) ...... 73 4.2.3 WEKA ...... 74 4.2.4 SciKit - Learn ...... 75 4.2.5 H2O Flow ...... 76 4.2.6 Summary ...... 77 4.3 Experiments on Big data sets ...... 78 4.3.1 Loading the Big data sets ...... 78 4.3.2 SVM Spark MLlib ...... 82 4.3.3 Dimensionality reduction - PCA Spark MLlib ...... 84 4.3.4 Expectation-Maximization Spark MLlib ...... 88 4.3.5 Naive Bayes TF-IDF Spark MLlib ...... 88 4.3.6 Hierarchical clustering Spark MLlib ...... 93 4.3.7 K-means Spark MLlib ...... 94 4.3.8 Association Rules ...... 94 4.3.9 Data and Results ...... 95

5 Conclusions and future work 96

2 1 Introduction

The primary purpose of this work is to provide an introduction of technologies and platforms available for performing big data analysis. We will make an introduction of what big data is and when or how it is used today. Moving deeper we will describe the programming models like in memory database system and Map-Reduce system that are used to make big data analytics possible. Next we will make a short introduction on most popular platform on the sector. An intermediate level reference will be done for many miscellaneous technologies that companions the big data platforms. Data mining and machine learning areas will also be introduced beginning with the conventional models and moving to the most advanced. We will reference the most popular tools for data analysis and expose their flaws and weaknesses on handling big data sets. After the description of the methods that we will use we will continue with the examples and experiments. At the end of this work the reader will be able to understand how the big data technologies are combined together and how someone can start to conduct his own experiments.

1.1 Why Big Data Big Data is driving radical changes in traditional data analysis platforms. To perform any kind of analysis on such voluminous and complex data, scaling up the hardware platforms becomes imminent and choosing the right hardware/ software platforms becomes a crucial decision. There are several big data plat- forms available with different characteristics and choosing the right platform requires an in-depth knowledge about the capabilities of all these platforms. In order to decide if we need big data platform and even further to choose which of these platforms is suitable for our case, we need to answer some questions. How quickly do we need to get the results? How big is the data to be processed? Does the model building require several iterations or a single iteration? At the systems level, one has to meticulously look into the following concerns: Will there be a need for more data processing capability in the future? Is the rate of data transfer critical for this application? Is there a need for handling hardware failures within the application? How about scaling?

• Horizontal Scaling: Horizontal scaling involves distributing the workload across many servers which may be even commodity machines. It is also known as scale out, where multiple independent machines are added to- gether in order to improve the processing capability. Typically, multiple instances of the operating system are running on separate machines. Ver- tical Scaling: Vertical Scaling involves installing more processors, more memory and faster hardware, typically, within a single server. It is also known as ”scale up” and it usually involves a single instance of an oper- ating system. ”Horizontal scaling platforms” describes various horizontal scaling platforms including peer-to-peer networks, Hadoop and Spark.

3 • Vertical scaling platforms. The most popular vertical scale up paradigms are High Performance Computing Clusters (HPC), Multicore processors, Graphics Processing Unit (GPU) and Field Programmable Gate Arrays (FPGA). [27] To handle future workloads, one always will have to add hardware. Peer- to-Peer networks involve millions of machines connected in a network. It is a decentralized and distributed network architecture where the nodes in the networks (known as peers) serve as well as consume resources. Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, cap- ture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value [9]. We often tend to confuse about what big data is. To put it simple its just data, like any other data but when we try to manage them or analyze or even read them and in any way interact with, we can’t because of the 3v’s (see bellow). For example you can’t simply open with your notepad a text file that is 1TB right? Also you can’t open a spreadsheet (or cvs) with that size. You need a new tool and a new approach. What we knew until now about data and how we handle them is not going to work on these data sets. We need new tools new algorithms new ways to analyze and to store them. Big data can be described by the following characteristics also known as the 3V’s: • Volume: The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not. • Velocity: In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time • Variety: The type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion According to a report from International Data Corporation (IDC), in 2011, the overall created and copied data volume in the world was 1.8ZB (≈ 1021 B), which increased by nearly nine times within five years [12]. This figure will double at least every other two years in the near future. Nowadays, big data related to the service of Internet companies grow rapidly. For example, Google processes data of hundreds of Petabyte (PB), Facebook generates log data of over 10 PB per month, Baidu, a Chinese company, processes data of tens of PB, and Taobao, a subsidiary of Alibaba, generates data of tens of Terabyte (TB) for online trading per day. NIST defines big data as ”Big data shall mean the data of which the data volume, acquisition speed, or data representation limits

4 the capacity of using traditional relational methods to conduct effective analysis or the data which may be effectively processed with important horizontal zoom technologies”, which focuses on the technological aspect of big data [8]. Many challenges on big data arose. With the development of Internet ser- vices, indexes and queried contents were rapidly growing. Therefore, search engine companies had to face the challenges of handling such big data. Google created GFS and MapReduce programming models to cope with the challenges brought about by data management and analysis at the Internet scale. The sharply increasing data deluge in the big data era brings about huge chal- lenges on data acquisition, storage, management and analysis. Traditional data management and analysis systems are based on the relational database man- agement system (RDBMS). However, such RDBMSs only apply to structured data, other than semi-structured or unstructured data. In addition, RDBMSs are increasingly utilizing more and more expensive hardware. It is apparently that the traditional RDBMSs could not handle the huge volume and hetero- geneity of big data. The research community has proposed some solutions from different perspectives. For example, cloud computing is utilized to meet the requirements on infrastructure for big data, e.g., cost efficiency, elasticity, and smooth upgrading/downgrading. For solutions of permanent storage and man- agement of large-scale disordered datasets, distributed file systems and NoSQL databases are good choices. Such programming frameworks have achieved great success in processing clustered tasks, especially for webpage ranking. Various big data applications can be developed based on these innovative technologies or platforms. Moreover, it is non-trivial to deploy the big data analysis systems [8]. A good case to study the use of big data is Internet of Things (IoT). IoT refers to the networked interconnection of everyday objects, which are often equipped with ubiquitous intelligence. IoT will increase the ubiquity of the Internet by integrating every object for interaction via embedded systems, which leads to a highly distributed network of devices communicating with human beings as well as other devices. Thanks to rapid advances in underlying technologies, IoT is opening tremendous opportunities for a large number of novel applications that promise to improve the quality of our lives. In recent years, IoT has gained much attention from researchers and practitioners from around the world [50]. The big data generated by IoT has different characteristics compared with general big data because of the different types of data collected, of which the most classical characteristics include heterogeneity, variety, unstructured fea- ture, noise, and high redundancy. Although the current IoT data is not the dominant part of big data, by 2030, the quantity of sensors will reach one trillion and then the IoT data will be the most important part of big data, according to the forecast of HP. A report from Intel pointed out that big data in IoT has three features that conform to the big data paradigm: (i) abundant terminals generating masses of data; (ii) data generated by IoT is usually semi- structured or unstructured; (iii) data of IoT is useful only when it is analyzed. Presently, Hadoop is an open source framework for storing and processing large data- sets using clusters of commodity hardware. Hadoop is designed to scale

5 up to hundreds and even thousands of nodes and is also highly fault tolerant, is widely used in big data applications in the industry, e.g., spam filtering, net- work searching, clickstream analysis, and social recommendation. In addition, considerable academic research is now based on Hadoop. Some representative cases are given below. As declared in June 2012, Yahoo runs Hadoop in 42,000 servers at four data centers to support its products and services, e.g., searching and spam filtering, etc. At present, the biggest Hadoop cluster has 4,000 nodes, but the number of nodes will be increased to 10,000 with the release of Hadoop 2.0. In the same month, Facebook announced that their Hadoop cluster can pro- cess 100 PB data, which grew by 0.5 PB per day as in November 2012. Some well-known agencies that use Hadoop to conduct distributed computation are listed in. In addition, many companies provide Hadoop commercial execution and/or support, including Cloudera, IBM, MapR, EMC, and Oracle [8]. The Hadoop platform contains the following two important components: Distributed File System (HDFS) is a distributed file system that is used to store data across cluster of commodity machines while providing high availability and fault tol- erance. Hadoop YARN is a resource management layer and schedules the jobs across the cluster. The programming model used in Hadoop is MapReduce. Big data are generally stored in hundreds and even thousands of commer- cial servers. Thus, the traditional parallel models, such as Message Passing Interface (MPI) and Open Multi-Processing (OpenMP), may not be adequate to support such large-scale parallel programs. Recently, some proposed par- allel programming models effectively improve the performance of NoSQL and reduce the performance gap to relational databases. Therefore, these models have become the cornerstone for the analysis of massive data [8]. As we see bellow in Figure 1 in 60 seconds worldwide we have more data generated that we can handle. For example how would we analyze 11 million instant messages in 60 seconds to extract informations that will help us defend our homeland security? Facebook status updates are 695000 and in there is the information to help us save a person e.g from suicide. 168 million emails many of them as spams or illegal. Can we put people to read all this information? No. Can we use conventional databases and analytic tools to analyze them? No. We will need more than 60 seconds to store and even more to analyze them, so in the next 60 seconds you already have flood data. Let’s see bellow Figure 2 more numbers to understand the emergency. 300 billion dollars where saved from Americas budget thanks to big data analysis. 200 PB data generated by project in China, with traditional databases we can’t even store these data. 750 million pictures uploaded to Facebook, in order to filter these pictures and decide if they have appropriate context we need a big data analytics approach. More numbers in Figure 3 to understand the general view. 90% of world’s data created in the last two years. Imagine how difficult is for companies to scale up so fast in order to handle the flood of data. 25 quintillion bytes are generated every day, if you can’t store it, you loose information that could be beneficial, if you store it, you need a way to analyze it. In both cases we need big data tools to convert the data to useful informations.

6 Figure 1: Public data increment [28]

Figure 2: The continuously increasing big data

7 Figure 3: Big data in numbers [29]

8 1.2 Big Data Applications Today Next we describe some important applications of big data today in vital sec- tors of the real world. The usage of big data analytics is not limited to the sectors bellow and it’s already spreading in any sector that can or will have huge amounts of data. Interdisciplinary cases are also a hot topic where the combinations between data of different disciplines are theoretically infinite and the correlation hidden from the human mind but we will not expand to this area.

1.2.1 Bioinformatics Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to han- dle big data using the distributed and parallel computing technologies. Usually big data tools perform computation in batch-mode and are not optimized for iterative processing and high data dependency among operations. In the recent years, parallel, incremental, and multi-view machine learning algorithms have been proposed. Similarly, graph-based architectures and in-memory big data tools have been developed to minimize I/O cost and optimize iterative process- ing. However, there lack standard big data architectures and tools for many important bioinformatics problems, such as fast construction of co-expression and regulatory networks and salient module identification, detection of com- plexes over growing protein-protein interaction data, fast analysis of massive DNA, RNA, and protein sequence data, and fast querying on incremental and heterogeneous disease networks. Data are being generated by variety of sources other than people and servers, such as sensors embedded into phones and wearable devices, video surveillance cameras, MRI scanners, and set-top boxes. Considering the annual growth of data generation, the digital universe - data we generate annually - will reach 44 zettabytes, or 44 trillion gigabytes by the year 2020, which is ten times the size of the digital universe in 2013. The volume of data is growing fast in bioinformatics research. Big data sources are no longer limited to particle physics experiments or search-engine logs and indexes. With digitization of all processes and availability of high throughput devices at lower costs, data volume is rising everywhere, including in bioinformatics research. For instance, the size of a single sequenced human genome is approximately 200 gigabytes. The data size in bioinformatics is increasing dramatically in the recent years. The European Bioinformatics Institute (EBI), one of the largest biology-data repositories, had approximately 40 petabytes of data about genes, proteins, and small molecules in 2014, in comparsion to 18 petabytes in 2013. EBI has installed a cluster, the Hinxton data centre cluster, with 17,000 cores and 74 terabytes of RAM, to process their data. Its computing power is increased in almost every month. More importantly, EBI is not the only organi-

9 zation involved in massive bio-data store. There are many other organizations, who are storing and processing huge collections of biological databases and dis- tributing them around the world, such as National Center for Biotechnology Information (NCBI), USA and National Institute of Genetics, Japan. There are primarily five types of data that are massive in size and used heavily in bioinformatics research: i) gene expression data, ii) DNA, RNA, and protein sequence data, iii) protein-protein interaction (PPI) data, iv) pathway data, and v) gene ontology (GO). Although, other types of data such as human disease network and disease gene association network are also used, and highly important for many research directions including disease diagnosis. Supervised, unsupervised, and hybrid machine learning approaches are the most widely used tools for descriptive and predictive analytics on big data. Apart from that, various techniques from mathematics have been used in big data analytics. The problem of big data volume can be somewhat minimized by dimensionality reduction. Linear mapping methods, such as principal compo- nent analysis (PCA) and singular value decomposition (SVD), as well as non- linear mapping methods, such as Sammon’s mapping, kernel principal compo- nent analysis, and laplacian eigenmaps, have been widely used for dimensionality reduction. Another important tool used in big data analytics is mathematical optimization. Subfields of optimization, such as constraint satisfaction program- ming, dynamic programming, and heuristics & metaheuristics are widely used in AI and machine learning problems. Other important optimization methods include multi-objective and multi-modal optimization methods, such as pareto optimization and evolutionary algorithms, respectively. Statistics is considered as a counterpart to machine learning; differentiated by data model versus al- gorithmic model respectively. The two fields have subsumed ideas from each other. Statistical concepts, such as expectation-maximization and PCA, are widely adopted in machine learning problems. Similarly, machine learning tech- niques, such as probably approximately correct learning are used in applied statistics. However, both of these tools have been heavily used for big data analytics [17].

1.2.2 Finance Security and fraud detection. It is important for banks and insurance companies to seek for ways that reduce the number of potential losses as well as costs from claims processing and collection recovery situations. Mainly, this is done by attempting to gain insight into customer interactions and behaviors across dif- ferent channels. One possible way is using graph analytics which are among the most promising big data applications scientists are working with today. These new types of analytics can be used to fight whiplash for cash scams. In graph analysis, insurance companies can perform analysis on huge amounts of data in much better performance and less time compared to traditional techniques in order to find suspicious behaviors. Thus, they can follow the whole fraud trail of all people and cars involved to detect whether they are involved in other accidents along with different degrees of connection among them [25].

10 Fraud-related losses, on average, amount to US$9000 for every US$1 mil- lion in revenue. This significant amount of loss can be prevented by identifying relevant insights through the use of big data. With the help of the right infras- tructure, such as Hadoop, e-commerce firms can analyze data at an aggregated level to identify fraud relating to credit cards, product returns and identity theft. In addition, e-commerce firms are able to identify fraud in real time by combining transaction data with customers purchase history, web logs, social feed, and geospatial location data from smartphone apps. For example, Visa has installed a big data-enabled fraud management system that allows the in- spection of 500 different aspects of a transaction, with this system saving US$2 billion in potential losses annually [1]. Credit Scoring Figure 4: Undoubtedly, one of the major sectors that has seen unprecedented new solutions leveraging big data is lending and credit scoring. For decades, credit scores provided based on basic financial transaction served as the norm for all credit activities in the financial services space. Essentially, these new sources go beyond the available quantitative data from banks and assess qualitative concepts like - behavior, willingness, ability, etc. The growth in segments such as P2P Lending, SME Financing is a result of these innovative scoring models. Examples of such startups include Credit Sesame, Faircent, OnDeck, Kabbage, LendingClub, Prosper, ZestFinance and Vouch Financial [2].

Figure 4: FinTech [2]

Customer Acquisition: The cost of acquisition drops drastically for customer acquisition when we compare the physical to digital channels providing huge benefits to both financial services firms as well as startups. Place - one of the four Ps of marketing - has been dominated by the digital channel by both customers and clients. Increasingly, the customers behavior to use digital channels coupled

11 with low-cost advantages for clients (especially in financial services) makes this a major focus area. Leveraging big data, financial services are moving to digital channels to acquire customers. The growth in number of offerings which are moving online - direct investment plans, online savings/deposit account opening, automated advisory services - provides a clear indication of the importance of digital channels for financial services. Marketing, Customer Retention, and Loyalty Programs: Contextual and per- sonalized engagements - be it in product/service advertising or discount offer- ings, have become the norm of many new-age companies. Analytic solutions that combine historic transactional data coupled with external information sources increase the overall conversion rate. Many financial services firms partner/ac- quire/invest in startups and growth-stage companies, and are actively pursuing these services. Firms are effectively leveraging these solutions to increase the cross-sell and upsell opportunities, understanding customer requirements and providing customized packaging. Card-linked offers, customized reward solu- tions are some of the offerings that are being provided by FinTech firms. Risk Management: World-over, real-time payments have taken center stage in the past decade and hence, there is a requirement for enhanced risk man- agement solutions in this new environment. Predictive analytics that utilizes device identification, biometrics, behavior analytics, etc. are major driving fac- tors (each solution or a combination of each of them) for better risk management solutions in the fraud and authentication space. Firms that execute well on erad- icating vulnerable access points would benefit not only in terms of lower losses but it also increases stickiness to their solutions. Apart from banks? own ini- tiatives, various regulations are also enforcing rules that make it vital for banks to store and manage more information about payments. Hence, apart from just storing this data, banks look at building powerful algorithms that mine this data and provide actionable insights. Some startup solutions in this space are BillGuard, Centrifuge, Feedzai, Klarna, etc. Investment Management: Investment management as a segment has wit- nessed innovation on multiple fronts. While robo-advisory solutions take the spotlight in the segment, there are other solutions that are leveraging the power of big-data to provide efficient investment management solutions ? the abil- ity to utilize search data, combine multiple macroeconomic factors, quantifying latest news/insights and combining all these to provide potential upside/down- side scenarios. Also, there are solutions developed to detect specific market anomalies and provide preventive action steps in the investment portfolio. Spe- cific startup solutions in this space include Wealthfront, EidoSearch, SigFig, Betterment, LearnVest, Personal Capital, Jemstep, etc. [2]

1.2.3 Commerce In the past few years, an explosion of interest in big data has occurred from both academia and the e-commerce industry. This explosion is driven by the fact that e-commerce firms that inject big data analytics (BDA) into their value chain experience 5?6 % higher productivity than their competitors. A recent

12 study by BSA Software Alliance in the United States (USA) indicates that BDA contributes to 10 % or more of the growth for 56 % of firms. Therefore, 91 % of Fortune 1000 companies are investing in BDA projects, an 85 % increase from the previous year. While the use of emerging internet-based technologies provides e-commerce firms with transformative benefits (e.g., real-time customer service, dynamic pricing, personalized offers or improved interaction), BDA can further solidify these impacts by enabling informed decisions based on critical insights. Specif- ically, in the e-commerce context, ”big data enables merchants to track each user’s behavior and connect the dots to determine the most effective ways to convert one-time customers into repeat buyers”. Big data analytics (BDA) en- ables e-commerce firms to use data more efficiently, drive a higher conversion rate, improve decision making and empower customers (Miller 2013). From the perspective of transaction cost theory in e-commerce, BDA can benefit online firms by improving market transaction cost efficiency (e.g., buyer-seller inter- action online), managerial transaction cost efficiency (e.g., process efficiency- recommendation algorithms by Amazon) and time cost efficiency (e.g., search- ing, bargaining and after sale monitoring). Drawing on the resource-based view (RBV), we argued that BDA is a distinctive competence of the high-performance business process to support business needs, such as identifying loyal and prof- itable customers, determining the optimal price, detecting quality problems, or deciding the lowest possible level of inventory [1].

Figure 5: Global growth in e-commerce and big data analytics (BDA) [1]

Big data focuses on three main characteristics: the data itself, the analytics of the data, and the presentation of the results of the analytics that allow the creation of business value in terms of new products or services. The sheer vol- ume of academic and industry research provides evidence on the importance of big data in many functional areas of e-commerce including marketing, human resources management, production and operation, and finance. While the sig- nificance of big data in making strategic decisions is recognized and understood,

13 there is still a lack of consensus on the operational definition of big data ana- lytics (BDA). It is thus prudent to analyze the definitions of BDA mentioned in previous studies in order to identify their common themes. The ultimate challenge of BDA is to generate business value from this ex- plosion of big data. The term ’value’ in the context of big data implies the generation of economically worthy insights and/or benefits by analyzing big data through extraction and transformation. We define business value of BDA as the transactional, informational and strategic benefits for the e-commerce firms. Whereas transactional value focuses on improving efficiency and cutting costs, informational value sheds light on real time decision making and strategic value deals with gaining competitive advantages. For example, by injecting ana- lytics into e-commerce, managers could derive overall business value by serving customer needs (79%); creating new products and services (70%); expanding into new markets (72%); and increasing sales and revenue (76%). Amazon, the online retailer giant, is a classic example of enhancing business value and firm performance using big data. Indeed, the firm was able to gener- ate about 30% of its sales through analytics (e.g., through its recommendation engine) (The Economist2011). Similarly, Kiron reported that Match.com was able to earn over 50% increase in revenue in the past two years while the com- pany subscriber base for its core business reached 1.8 million. The IBM case study illustrated that greater data sharing and analytics could improve patient outcomes. For example, Premier Healthcare Alliance was able to reduce expen- diture by US$2.85 billion. Automercados Plaza’s grocery chain was able to earn a nearly 30% rise in revenue and a total of US$7 million increase in profitability each year by implementing information integration throughout the organiza- tion. Furthermore, the company avoided losses on over 30% of its products by scheduling price reductions to sell perishable products on time. In addition to adding value for business in financial terms, the use of big data can add benefit in non-financial parameters such as customer satisfaction, customer retention, or improving business processes. Personalization The first application of big data for e-commerce firms is the provision of personalized service or customized products. Studies have ar- gued that consumers typically like to shop with the same retailer using diverse channels, and that big data from these diverse channels can be personalized in real time. Real-time data analytics enables firms to offer personalized services comprising special content and promotions to customers. In addition, these per- sonalized services assist firms to separate loyal customers from new customers and to make promotional offers. Personalization can increase sales by 10% or more and provide five to eight times the ROI on marketing expenditures. Dynamic pricing. In today’s extremely competitive market environment, customers are considered ’king’. Therefore, to attract new customers, e-commerce firms must be active and vibrant while setting a competitive price. Ama- zon.com’s dynamic pricing system monitors competing prices and alerts Amazon every 15 s, which has resulted in a 35% increase in all sales. To offer compet- itive prices to customers on the eve of possible increases in sales (such as at Christmas or other festive times), Amazon processes big data by taking into ac-

14 count competitors pricing, product sales, actions of customers, and any regional or geographical preferences. Access to this information through the use of big data is likely to enable e-commerce firms to establish dynamic pricing [1].

2 Related work

In this section we will introduce the work that has been done previously related with the big data technologies models and systems. Although big data science is a hot field, is also a very rapid growing one and that led on technology fragmentation. We’ll try to show the most important technologies and how they work to solve those big data problems that came up.

2.1 Big Data Programming Models The major types of Big Data programming models you will encounter are the following [36]:

• Massively parallel processing (MPP) database system: EMC’s Greenplum and IBM’s Netezza are examples of such systems. • In-memory database systems: Examples include Oracle Exalytics and SAP HANA.

• MapReduce systems: These systems include Hadoop, which is the most general-purpose of all the Big Data systems. • Bulk synchronous parallel (BSP) systems: Examples include and .

2.1.1 In-Memory Database Systems From an operational perspective, in-memory database systems are identical to MPP systems. The implementation difference is that each node has a significant amount of memory, and most data is preloaded into memory. SAP HANA oper- ates on this principle. Other systems, such as Oracle Exalytics, use specialized hardware to ensure that multiple hosts are housed in a single appliance. At the core, an in-memory database is like an in-memory MPP database with a SQL interface. One of the major disadvantages of the commercial implementations of in- memory databases is that there is a considerable hardware and software lock-in. Also, given that the systems use proprietary and very specialized hardware, they are usually expensive. Trying to use commodity hardware for in-memory databases increases the size of the cluster very quickly. Consider, for example, a commodity server that has 25 GB of RAM. Trying to host 1 TB in-memory databases will need more than 40 hosts (accounting for other activities that need to be performed on the server). 1 TB is not even that big, and we are already up to a 40-node cluster [36].

15 The following describes how the in-memory database programming model meets the attributes we defined earlier for the Big Data systems: • Data is split by state in the earlier example. Each node loads data into memory. • Each node contains all the necessary application libraries to work on its own subset. • Each node reads data local to its nodes. The exception is when you apply a query that does not respect how the data is distributed; in this case, each task needs to fetch its own data from other nodes. • Because data is cached in memory, the Sequential Data Read attribute does not apply except when the data is read into memory the first time.

2.1.2 MapReduce Systems MapReduce: MapReduce [10] is the basic data processing scheme used in Hadoop which includes breaking the entire task into two parts, known as mappers and reducers. At a high-level, mappers read the data from HDFS, process it and generate some intermediate results to the reducers. Reducers are used to aggre- gate the intermediate results to generate the final output which is again written to HDFS. A typical Hadoop job involves running several mappers and reducers across different nodes in the cluster. MapReduce is also a simple but powerful programming model for large-scale computing using a large number of clusters of commercial PCs to achieve automatic parallel processing and distribution. In MapReduce, computing model only has two functions, i.e., Map and Reduce, both of which are programmed by users. The Map function processes input key- value pairs and generates intermediate key-value pairs. Then, MapReduce will combine all the intermediate values related to the same key and transmit them to the Reduce function, which further compress the value set into a smaller set. MapReduce has the advantage that it avoids the complicated steps for develop- ing parallel applications, e.g., data scheduling, fault-tolerance, and inter-node communications. The user only needs to program the two functions to develop a parallel application. MapReduce has become a dominant parallel computing paradigm for big data, i.e., colossal datasets at the scale of tera-bytes or higher. Ideally, a MapReduce system should achieve a high degree of load balancing among the participating machines, and minimize the space usage, CPU and I/O time, and network transfer at each machine [31]. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run- time system takes care of the details of partitioning the input data, scheduling

16 the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system [10]. Limitations of MapReduce One of the major drawbacks of MapReduce is its inefficiency in running iterative algorithms. MapReduce is not designed for iterative processes. Some of the important characteristics of Hadoop’s implementation of MapRe- duce are the following:

• It uses commodity scale hardware. Note that commodity scale does not imply laptops or desktops. The nodes are still enterprise scale, but they use commonly available components. • Data does not need to be partitioned among nodes based on any predefined criteria.

• The user needs to define only two separate processes: map and reduce.

At a very high level, a MapReduce system needs the user to define a map process and a reduce process. When Hadoop is being used to implement MapRe- duce, the data is typically distributed in 64 MB-128 MB blocks, and each block is replicated twice (a replication factor of 3 is the default in Hadoop). In the example of computing sales for the year 2000 and ordered by state, the entire sales data would be loaded into the Hadoop Distributed File System (HDFS) as blocks (64 MB-128 MB in size). When the MapReduce process is launched, the system would first transfer all the application libraries (comprising the user- defined map and reduce processes) to each node [36]. Each node will schedule a map task that sweeps the blocks comprising the sales data file. Each Mapper (on the respective node) will read records of the block and filter out the records for the year 2000. Each Mapper will then output a record comprised of a key/value pair. Key will be the state and value will be the sales number from the given record if the sales record is for the year 2000. Finally, a configurable number of Reducers will receive the key/value pairs from each of the Mappers. Keys will be assigned to specific Reducers to ensure that a given key is received by one and only one Reducer. Each Reducer will then add up the sales value number for all the key/value pairs received. The data format received by the Reducer is key (state), and a list of values for that key (sales records for the year 2000). The output is written back to the HDFS. The client will then sort the result by states after reading it from the HDFS. The last step can be delegated to the Reducer because the Reducer receives its assigned keys in the sorted order. In this example, we need to restrict the number of Reducers to one to achieve this, however. Because communication between Mappers and Reducers causes network I/O, it can lead to bottlenecks [36]. This is how the MapReduce programming model meets the attributes defined earlier for the Big Data systems:

17 • Data is split into large blocks on HDFS. Because HDFS is a distributed file system the data blocks are distributed across all the nodes redundantly. • The application libraries, including the map and reduce application code, are propagated to all the task nodes.

• Each node reads data local to its nodes. Mappers are launched on all the nodes and read the data blocks local to themselves (in most cases, the mapping between tasks and disk blocks is up to the scheduler, which may allocate remote blocks to map tasks to keep all nodes busy). • Data is read sequentially for each task on large block at a time (blocks are typically of size 64 MB-128 MB) One of the important limitations of the MapReduce paradigm as we mention before, is that it is not suitable for iterative algorithms. A vast majority of data science algorithms are iterative by nature and eventually converge to a solution. When applied to such algorithms, the MapReduce paradigm requires each iteration to be run as a separate MapReduce job, and each iteration often uses the data produced by its previous iteration. But because each MapReduce job reads fresh from the persistent storage, the iteration needs to store its results in persistent storage for the next iteration to work on. This process leads to unnecessary I/O and significantly impacts the overall throughput [36]. MapReduce algorithm proceeds in rounds, where each round has three phases: map, shuffle, and reduce. As all machines execute a program in the same way, next we focus on one specific machine M. Map. In this phase, M generates a list of key-value pairs (k, v) from its local storage. While the key k is usually numeric, the value v can contain arbitrary information. As clarified shortly, the pair (k,v) will be transmitted to another machine in the shuffle phase, such that the recipient machine is determined solely by k. Shuffle. Let L be the list of key-value pairs that all the machines produced in the map phase. The shuffle phase distributes L across the machines adhering to the constraint that, pairs with the same key must be delivered to the same machine. That is, if (k, v1), (k, v2), ..., (k, vx) are the pairs in L having a common key k, all of them will arrive at an identical machine. Reduce. M incorporates the key-value pairs received from the previous phase into its local storage. Then, it carries out whatever processing as needed on its local data. After all machines have completed the reduce phase, the current round terminates. It is clear from the above that, the machines communicate only in the shuffle phase, whereas in the other phases each machine executes the algorithm sequentially, focusing on its own storage. Overall, parallel computing happens mainly in reduce. The major role of map and shuffle is to swap data among the machines, so that computation can take place on different combinations of objects [31]. Simplified View. Let us number the t machines of the MapReduce system arbitrarily from 1 to t. In the map phase, all our algorithms will adopt the convention that M generates a key-value pair (k, v) if and only if it wants to send v to machine k. In other words, the key field is explicitly the id of the

18 recipient machine. This convention admits a conceptually simpler modeling. In describing our algorithms, we will combine the map and shuffle phases into one called map-shuffle. By saying succinctly that ”in the map-shuffle phase, M delivers v to machine k”, we mean that M creates (k,v) in the map phase, which is then transmitted to machine k in the shuffle phase. The equivalence also explains why the simplification is only at the logical level, while physically all our algorithms are still implemented in the standard MapReduce paradigm [31]. A MapReduce example that counts the appearance of each word in a set of documents:

Listing 1: MapReduce Example function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1)

function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += pc emit (word, sum)

Figure 6: The structure of a MapReduce job [35]

Here, each document is split into words, and each word is counted by the map function, using the word as the result key. The framework puts together

19 all the pairs with the same key and feeds them to the same call to reduce. Thus, this function just needs to sum all of its input values to find the total appearances of that word. Some of the analytics major areas are: (Big) Data Analytics, Text Ana- lytics, Web Analytics, Network Analytics, Mobile Analytics. Let’s note some applications of big data with high impact: • E-Commerce: Recommender systems, Social media monitoring and anal- ysis, Crowd-sourcing systems, Social and virtual games • E-Government and Politics 2.0: Ubiquitous government services, Equal access and public services, Citizen engagement and participation, Political campaign and e-polling • Science and Technology: S&T innovation, Hypothesis testing, Knowledge discovery • Health: Human and plant genomics, Healthcare decision support, Pa- tient community analysis, Genomics and sequence data, Electronic health records (EHR) • Security and Public Safety: Crime analysis, Computational criminology, Terrorism informatics, Open-source intelligence, Cyber security Big data concept with images. Although it is very hard for a human to realize the concept of big data and the 3V’s, let’s see some stats that can help us understand this fast growing and emerging area.

20 Figure 7: MapReduce model [36]

21 2.1.3 Bulk Synchronous Parallel (BSP) Systems The BSP class of systems operates very similarly to the MapReduce approach. However, instead of the MapReduce job terminating at the end of its processing cycle, the BSP system is composed of a list of processes (identical to the map processes) that synchronize on a barrier, send data to the Master node, and exchange relevant information. Once the iteration is completed, the Master node will indicate to each processing node to resume the next iteration. Synchronizing on a barrier is a commonly used concept in parallel program- ming. It is used when many threads are responsible for performing their own tasks, but need to agree on a checkpoint before proceeding. This pattern is needed when all threads need to have completed a task up to a certain point before the decision is made to proceed or abort with respect to the rest of the computation (in parallel or in sequence). Synchronization barriers are used all the time in the real world processes. Example, carpool mates often meet at a designated place before proceeding in a single car. The overall process is only as fast as the last person (or thread) arriving at the barrier. The BSP method of execution allows each map-like process to cache its previous iteration’s data significantly improving the throughput of the overall process. We will discuss BSP systems in the Data Science chapter of this book. They are relevant to iterative algorithms [36].

2.1.4 Big Data and Transactional Systems It is important to understand how the concept of transactions has evolved in the context of Big Data. This discussion is relevant to NoSQL databases. Hadoop has HBase as its NoSQL data store. Alternatively, you can use Cassandra or NoSQL systems available in the cloud such as Amazon Dynamo. Although most RDBMS users expect ACID properties in databases, these properties come at a cost. When the underlying database needs to handle millions of transactions per second at peak time, it is extremely challenging to respect ACID features in their purest form. Some compromises are necessary, and the motivation behind these compro- mises is encapsulated in what is known as the CAP theorem (also known as Brewer’s theorem). CAP is an acronym for the following:

• Consistency: All nodes see the same copy of the data at all times. • Availability: A guarantee that every request receives response about suc- cess and failure within a reasonable and well-defined time interval.

• Partition tolerance: The system continues to perform despite failure of its parts.

The theorem goes on to prove that in any system only two of the preced- ing features are achievable, not all three. Now, let’s examine various types of systems:

22 • Consistent and available: A single RDBMS with ACID properties is an example of a system that is consistent and available. It is not partition- tolerant; if the RDBMS goes down, users cannot access the data. • Consistent and partition-tolerant: A clustered RDBMS is such as system. Distributed transactions ensure that all users will always see the same data (consistency), and the distributed nature of the data will ensure that the system remains available despite loss of nodes. However, by virtue of distributed transactions, the system will be unavailable for durations of time when two-phase commits are being issued. This limits the number of simultaneous transactions that can be supported by the system, which in turn limits the availability of the system.

• Available and partition-tolerant: The type of systems classified as ”even- tually consistent” fall into this category. Consider a very popular e- commerce web site such as Amazon.com. Imagine that you are browsing through the product catalogs and notice that two units of a certain item are available for sale. By nature of the buying process, you are aware that between you noticing that a certain number of items are available and is- suing the buy request, someone could come in first and buy the items. So there is little incentive for always showing the most updated value because inventory changes. Inventory changes will be propagated to all the nodes serving the users. Preventing the users from browsing inventory while this propagation is taking place in order to provide the most current value of the inventory will limit the availability of the web site, resulting in lost sales. Thus, we have sacrificed consistency for availability, and partition tolerance allows multiple nodes to display the same data (although there may be a small window of time in which each user sees different data, depending on the nodes they are served by) [36].

2.2 Big Data Platforms Today the main players in the big data platform arena are two. Hortonwork and Cloudera. Both offer enterprise-ready Hadoop distributions. Although Cloudera has a commercial license, while has open source license.

2.2.1 Hortonwork Hortonwork tutorial is analytical and accurate. The directions are ideal for both beginners and experts with print screens and very neat structure. The tutorial is free and accessible from web in web form with very easy navigation through chapters. Hortonwork data platform HDP is a platform that is used for storing, processing, and analyzing large volumes of data. The platform is designed to deal with data from many sources and formats. Includes various data management related projects such as the Hadoop Distributed File System, MapReduce, Pig, Hive, HBase and Zookeeper and additional components. Ev- ery technology that is included in the HDP platform is described analytical from

23 installation process to ”how to use” cases by easy to follow instructions. The platform is offered as a sandbox (vm pre-configured) for easy of use and quick ”pick and start”. In this way (vm pre-configured) HDP eliminates all the pos- sible mismatches that are usually come up, considering the different version of operating systems of each user. If someone wants more ”professional” training, hortonwork offers on-line courses based on your path-role such as ”Developer, System Admin and Data Analyst” focusing in those technologies and cases that are more correlated your role.

2.2.2 Cloudera Cloudera offers a sections called ”Cloudera University”. This section is for training based on your location and more important based on your role. Al- though there are three main roles ”Developers, Administrators, Data Analyst” there is an extra depth-level on more specific technologies to train in, as you can see in Figure 1 for example a developer may wants to be trained only on MapReduce section. All that comes with a price as it is not free. Being more enterprise oriented Cloudera offers even a certification in order for someone to demonstrate his expertise.

Figure 8: Cloudera Training Courses

Comparing the two platform we can conclude that Hortonwork is more open to the public with analytical and free accessible tutorial in opposition Cloudera is more enterprise oriented offering more options in paid training but more expensive than Hortonwok.

2.3 Miscellaneous technologies stack There are several tools in order to help us handle data and platform settings. Bellow we present some of the basic tools [21] that come integrated and bundled with the platforms.

2.3.1 Mahout Mahout is a data mining library. It takes the most popular data mining al- gorithms for performing clustering, regression testing and statistical modeling

24 and implements them using the Map Reduce model. Mahout is an open source machine learning library from Apache. The al- gorithms it implements fall under the broad umbrella of machine learning or collective intelligence. This can mean many things, but at the moment for Mahout it means primarily recommender engines (collaborative filtering), clus- tering, and classification. It?s also scalable. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. In its current incarnation, these scalable machine learning implementations in Mahout are written in Java, and some portions are built upon Apache?s Hadoop distributed computation project. Finally, it’s a Java library. It doesn’t provide a user interface, a prepackaged server, or an installer. It’s a framework of tools intended to be used and adapted by developers.

Figure 9: Simplified illustration of component interaction in a Mahout user- based recommender [23]

Mahout began life in 2008 as a subproject of Apache’s Lucene project, which pro- vides the well-known open source search engine of the same name. Lucene provides advanced implementations of search, text mining, and information- retrieval techniques. In the universe of computer science, these concepts are adjacent to machine learning techniques like clustering and, to an extent, classi- fication. As a result, some of the work of the Lucene committers that fell more into these machine learning areas was spun off into its own subproject. Soon after, Mahout absorbed the Taste open source collaborative filtering project. Mahout contains a recommender engine-several types of them, in fact, be- ginning with conventional user-based and item-based recommenders. It includes implementations of several other algorithms as well, but for now we?ll explore a simple user-based recommender. The quality of recommendations is largely determined by the quantity and quality of data. ”Garbage in, garbage out,” has never been more true than here. Having high-quality data is a good thing, and generally, having lots of it is also good. Recommender algorithms are data-intensive by nature; their computations access a great deal of information. Runtime performance is therefore greatly affected by the quantity of data and its representation. Intelligently choosing data structures can affect performance by orders of magnitude, and, at scale, it matters a lot.

25 run as Hadoop jobs that can cluster large data easily, Ma- hout uses k-means in the in- memory mode to cluster the set of points on the plane. Mahout can be used on a wide range of classification projects, but the ad- vantage of Mahout over other approaches becomes striking as the number of training examples gets extremely large. What large means can vary enormously. Up to about 100,000 examples, other classification systems can be efficient and accurate. But generally, as the input exceeds 1 to 10 million training examples, something scalable like Mahout is needed. [23]

Listing 2: A simple user-based recommender program with Mahout class RecommenderIntro { public static void main(String [] args) throws Exception { DataModel model = new FileDataModel (new File(”intro.csv”)); //Load data file UserSimilarity similarity = new PearsonCorrelationSimilarity (model); UserNeighborhood neighborhood = new NearestNUserNeighborhood (2, similarity , model); Recommender recommender = new GenericUserBasedRecommender ( model, neighborhood , similarity); \\ Create recommender engine List recommendations = recommender.recommend(1, 1) ; \\ For user 1, recommend 1 item for (RecommendedItem recommendation : recommendations) { System.out. println(recommendation); }} }

Figure 10: Mahout is most appropriate [23]

The reason Mahout has an advantage with larger data sets is that as input data increases, the time or memory requirements for training may not increase linearly in a non-scalable system. A system that slows by a factor of 2 with twice the data may be acceptable, but if 5 times as much data input results in the system taking 100 times as long to run, another solution must be found. This is the sort of situation in which Mahout shines. In general, the classification algorithms in Mahout require resources that

26 increase no faster than the number of training or test examples, and in most cases the computing resources required can be parallelized. This allows you to trade off the number of computers used against the time the problem takes to solve. When the number of training examples is relatively small, traditional data mining approaches work as well or better than Mahout. But as the number of examples increase, Mahout’s scalable and parallel algorithm is better with regard to time. The increased time required by non-scalable algorithms is often due to the fact that they require unbounded amounts of memory as the number of training examples grows. Extremely large data sets are becoming increasingly widespread. With the advent of more electronically stored data, the expense of data acquisition can decrease enormously. Using increased data for training is desirable, because it typically improves accuracy. As a result, the number of large data sets that require scalable learning is increasing, and as it does, the scalable classifiers in Mahout are becoming more and more widely useful. [23]

2.3.2 Apache Spark and MLlib Apache Spark is a fast and general engine for large-scale data processing. MLlib is Apache Spark’s scalable machine learning library. Apache Sparkis a fast, in- memory data processing engine with elegant and expressive development APIs inScala,Java, andPythonthat allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets. Spark onA- pache Hadoop YARNenables deep integration with Hadoop and other YARN enabled workloads in the enterprise. Apache Spark is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009 For some users, this means monitoring a continuous stream of server log data and taking action immediately in the case of component failure. For others, it means monitoring a stream of market data for signals and then taking action in real-time or for powering real-time analytic dashboards. This is being done on dedicated Hadoop clusters today. Note that the schema and the data are stored in separate files. The schema is only applied when the data is queried, a technique called ’schema-on-read’. This gives you the flexibility to query the data with SQL while it’s still in a format usable by other systems as well. SQL-like queries in CDH: Hive and Impala. Hive works by translating SQL queries into MapReduce jobs, so it’s best for large batch jobs and applying flexible transformations. Impala is significantly faster and is intended to have low enough latency for interactive queries and data exploration. Hue provides a web-based interface for many of the tools in CDH Apache Spark was designed as a computing platform to be fast, general- purpose, and easy to use. It extends the MapReduce model and takes it to a whole other level. The speed comes from the in-memory computations. Appli- cations running in memory allows for a much faster processing and response Apache Spark: Apache Sparkis a fast, in-memory data processing engine

27 with elegant and expressive development APIs inScala,Java, andPythonthat al- low data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets. Spark onApache Hadoop YARNenables deep integration with Hadoop and other YARN enabled workloads in the enterprise. You can run batch application such as MapReduce types jobs or iterative algo- rithms that builds upon each other. You can also run interactive queries and process streaming data with your application. Spark also provides number of libraries which you can easily use to expand beyond the basic Spark capabilities such as Machine Learning algorithms, SQL, streaming, and graph processing. Spark runs on Hadoop clusters such as Hadoop YARN or , or even as a standalone with its own scheduler. Spark is a general engine for large-scale data processing that supports Java, Scala and Python and for certain tasks it is tested to be up to 100 faster than Hadoop MapReduce.

Listing 3: Word Count example val textFile = sc.textFile(”hdfs://...”) val counts = textFile.flatMap(line => line.split(” ”)) . map( word => ( word , 1) ) . reduceByKey( + ) counts.saveAsTextFile(”hdfs ://...”)

Listing 4: Pi Estimation example val count = sc.parallelize(1 to NUM SAMPLES) . map{ i => val x = Math.random() val y = Math.random() i f ( x∗x + y∗y < 1) 1 else 0 } . reduce ( + ) println(”Pi i s roughly ” + 4 . 0 ∗ count / NUM SAMPLES)

Listing 5: Text Search example val textFile = sc.textFile(”hdfs://...”)

// Creates a DataFrame having a single column named ”line” val df = textFile.toDF(”line”) val errors = df. filter (col(”line”). like(”%ERROR%”)) // Counts all the errors errors .count() // Counts errors mentioning MySQL errors . filter (col(”line”). like(”%MySQL%”)).count() // Fetches the MySQL errors as an array of strings errors . filter (col(”line”). like(”%MySQL%”)). collect ()

28 Figure 11: Spark Architecture

2.3.3 Apache ORC Introducing Apache ORC The Optimized Row Columnar (new Apache ORC project) file format provides a highly efficient way to store Hive data. The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atopApache Hadoop YARN. Back in January 2013, created ORC files as part of the initiative to massively speed up and improve the stor- age efficiency of data stored in . The focus was on enabling high speed processing and reducing file sizes. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated sup- port for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written. Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions. Many large Hadoop users have adopted ORC. For instance, Facebook uses ORC to save tens of petabytes in their data warehouse and demonstrated that ORC is significantly faster than RC File or Parquet. Yahoo uses ORC to store their production data and has released some of their benchmark results. ORC files are divided in to stripes that are roughly 64MB by default. The stripes in a file are independent of each other and form the natural unit of distributed work. Within each stripe, the columns are separated from each other so the reader can read just the columns that are required [3].

2.3.4 Hadoop Distributed File System Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data. HDFS Architecture [26]

29 • NameNode; Maintains namespace hierarchy and file system metadata such as block locations Namespace and metadata is stored in RAM but peri- odically flushed to disk. Modification log keeps on-disk image up to date. • DataNodes: Stores HDFS file data in local file system Receives commands from NameNode that instruct it to: Replicate blocks to other nodes, re- register or shutdown, remove local block replicas, send immediate block report. • HDFS Client: Code library that exports HDFS file system interface to applications, reads data by transferring data from a DataNode directly, writes data by setting up a node-to-node pipeline and sends data to the first DataNode. File I/O Operations and Replica Management. An application adds data to HDFS by creating a new file and writing data to it, all files are read and append only and HDFS implements a single-writer, multiple-reader model. When there is need for a new block, the NameNode allocates a new block ID and determines a list of DataNodes to host replicas of the block, data is sent to the DataNodes in a pipeline fashion and may not be visible to readers until the file is closed. No DataNode contains more than one replica of any block and no rack contains more than two replicas of the same block. The Hadoop Distributed File System is designed to store very large data sets reliably and to stream these datasets to user applications at high bandwidth, to distribute storage and computation tasks across thousands of servers to enable resources to scale with demand while maintaining economical in size, HDFS architecture consists of a single NameN- ode, many DataNodes and the HDFS client. Hadoop is an open source project that was inspired by Google?s proprietary Google File System and MapReduce framework [26]. Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates. Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single in- stance. HDFS applications need a write-once-read-many access model for files.

30 Figure 12: The Hadoop Distributed File System [26]

A file once created, written, and closed need not be changed. This assump- tion simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future. A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. HDFS has a master/slave architecture. An HDFS cluster consists of a sin- gle Namenode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of Datan- odes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of Datanodes. The Namenode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to Datanodes. The Datan- odes are responsible for serving read and write requests from the file system?s clients. The Datanodes also perform block creation, deletion, and replication upon instruction from the Namenode. [6] The Namenode and Datanode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports

31 Figure 13: HDFS Architecture [6]

Java can run the Namenode or the Datanode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the Namenode software. Each of the other machines in the cluster runs one instance of the Datanode software. The architecture does not preclude running multiple Datanodes on the same machine but in a real deployment that is rarely the case. The existence of a single Namenode in a cluster greatly simplifies the architecture of the system. The Namenode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the Namenode. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. All HDFS communication protocols are layered on top of the TCP/IP proto- col. A client establishes a connection to a configurable TCP port on the Namen- ode machine. It talks the ClientProtocol with the Namenode. The Datanodes talk to the Namenode using the DatanodeProtocol. A Remote Procedure Call (RPC) abstraction wraps both the ClientProtocol and the DatanodeProtocol.

32 By design, the Namenode never initiates any RPCs. Instead, it only responds to RPC requests issued by Datanodes or clients. HDFS is designed to support very large files. Applications that are com- patible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different Datanode. The primary objective of HDFS is to store data reli- ably even in the presence of failures. The three common types of failures are Namenode failures, Datanode failures and network partitions [6].

2.3.5 Hive Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language called HiveQL, which are then converted to MapReduce. This allows SQL program- mers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Micros- trategy, Tableau, Revolutions Analytics, etc. Data in Hive is organized into: • Tables - These are analogous to tables in relational databases. Each table has a corresponding HDFS directory. The data in a table is serialized and stored in files within that directory. Users can associate tables with the serialization format of the underlying data. Hive provides builtin serial- ization formats which exploit compression and lazy de-serialization. Users can also add support for new data formats by defining custom serialize and de-serialize methods (called SerDes) written in Java. The serialization format of each table is stored in the system catalog and is automatically used by Hive during query compilation and execution. Hive also supports external tables on data stored in HDFS, NFS or local directories. • Partitions - Each table can have one or more parti- tions which deter- mine the distribution of data within sub-directories of the table direc- tory. Suppose data for table T is in the directory /wh/T. If T is par- titioned on columns ds and ctry, then data with a particular ds value 20090101 and ctry value US, will be stored in files within the directory /wh/T/ds=20090101/ctry=US. • Buckets - Data in each partition may in turn be divided into buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory.

Hive supports primitive column types (integers, floating point numbers, generic strings, dates and booleans) and nestable collection types ? array and map. Users can also define their own types programmatically. It provides a

33 SQL-like query language called HiveQL which supports select, project, join, ag- gregate, union all and sub-queries in the from clause. HiveQL supports data defi- nition (DDL) statements to create tables with specific seri- alization for- mats, and partitioning and bucketing columns. Users can load data from ex- ternal sources and insert query results into Hive tables via the load and insert data manip- ulation (DML) statements respectively. HiveQL currently does not support updating and deleting rows in existing tables. HiveQL supports multi-table insert, where users can per- form multiple queries on the same input data using a single HiveQL statement. Hive optimizes these queries by sharing the scan of the input data, is also very extensible. It supports user defined column transformation (UDF) and aggregation (UDAF) functions implemented in Java. [33]

Figure 14: HIVE Architecture [33]

Listing 6: HiveQL example FROM (SELECT a.status , b.school , b.gender FROM s t a t u s u p d a t e s a JOIN p r o f i l e s b ON (a.userid = b.userid and a . ds= ’ 2009−03−20 ’ ) ) subq1 INSERT OVERWRITE TABLE gender summary PARTITION( ds= ’ 2009−03−20 ’ ) SELECT subq1.gender , COUNT( 1 ) GROUPBY subq1. gender INSERT OVERWRITE TABLE school summary

34 PARTITION( ds= ’ 2009−03−20 ’ ) SELECT subq1.school , COUNT( 1 ) GROUPBY subq1. school

REDUCE subq2.school , subq2.meme, subq2.cnt USING ’ top10 . py ’ AS (school ,meme,cnt) FROM (SELECT subq1.school , subq1.meme, COUNT( 1 ) AS cnt FROM (MAP b.school , a.status USING ’meme−extractor.py’ AS (school ,meme) FROM s t a t u s u p d a t e s a JOIN p r o f i l e s b ON (a.userid = b.userid) ) subq1 GROUPBY subq1.school , subq1.meme DISTRIBUTE BY school , meme SORT BY school , meme, cnt desc ) subq2 ;

HIVE ARCHITECTURE, the main components of Hive are:

• External Interfaces - Hive provides both user inter- faces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC. • The Hive Thrift Server exposes a very simple client API to execute HiveQL statements. Thrift is a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages. The Thrift Hive clients generated in different languages are used to build common drivers like JDBC (java), ODBC (C++), and script- ing drivers written in php, perl, python etc. • The Metastore is the system catalog. All other components of Hive inter- act with the metastore. • The Driver manages the life cycle of a HiveQL statement during compi- lation, optimization and execution. On receiving the HiveQL statement, from the thrift server or other interfaces, it creates a session handle which is later used to keep track of statistics like execution time, number of output rows, etc. • The Compiler is invoked by the driver upon receiving a HiveQL statement. The compiler translates this statement into a plan which consists of a DAG of map- reduce jobs.

• The driver submits the individual map-reduce jobs from the DAG to the Execution Engine in a topological order. Hive currently uses Hadoop as its execution engine.

Hive is an Apache sub-project, with an active user and developer community both within and outside Facebook. The Hive warehouse instance in Facebook contains over 700 terabytes of usable data and supports over 5000 queries on a daily basis.

35 2.3.6 Pig Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.) Unfortunately, the map-reduce model has its own set of limitations. Its one-input, two-stage data flow is extremely rigid. To perform tasks having a different data flow, e.g., joins or n stages, inelegant workarounds have to be devised. Also, custom code has to be written for even the most common operations, e.g., projection and filtering. These factors lead to code that is difficult to reuse and maintain, and in which the semantics of the analysis task are obscured. Moreover, the opaque nature of the map and reduce functions impedes the ability of the system to perform optimizations. Pig Latin combines the best of both worlds: high-level declarative querying in the spirit of SQL, and low-level, procedural programming a‘ la map-reduce. Example: Suppose we have a table urls: (url, category, pagerank). The following is a simple SQL query that finds, for each sufficiently large category, the average pagerank of high-pagerank urls in that category.

Listing 7: SQL query SELECT category , AVG( pagerank ) FROM u r l s WHERE pagerank > 0 . 2 GROUPBY c a t e g o r y HAVINGCOUNT( ∗ ) > 106

An equivalent Pig Latin program is the following.

Listing 8: Pig Latin program g o o d u r l s = FILTER u r l s BY pagerank > 0 . 2 ; groups = GROUP g o o d u r l s BY c a t e g o r y ; b i g groups = FILTER groups BY COUNT( g o o d u r l s ) >106; output = FOREACH b i g g r o u p s GENERATE category , AVG( g o o d urls .pagerank);

As evident from the above example, a Pig Latin program is a sequence of steps, much like in a programming language, each of which carries out a single data transformation. This characteristic is immediately appealing to many pro- grammers. At the same time, the transformations carried out in each step are fairly high-level, e.g., filtering, grouping, and aggregation, much like in SQL. The use of such high-level primitives renders low-level manipulations (as required in map-reduce) unnecessary. The overarching design goal of Pig is to be appealing to experienced program- mers for performing ad-hoc analysis of extremely large data sets. Consequently, Pig Latin has a number of features that might seem surprising when viewed from a traditional database and SQL perspective. In this section, we describe the features of Pig, and the rationale behind them. ”I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables

36 are, and where you are in the process of analyzing your data.” - Jasmine Novak, Engineer, Yahoo! Since Pig Latin is geared toward processing web-scale data, it does not make sense to consider non-parallel evaluation. Consequently, we have only included in Pig Latin a small set of carefully chosen primitives that can be easily paral- lelized. Language primitives that do not lend them- selves to efficient parallel evaluation (e.g., non-equi-joins, correlated subqueries) have been deliberately excluded. Such operations can of course, still be carried out by writing UDFs. However, since the language does not provide explicit primitives for such oper- ations, users are aware of how efficient their programs will be and whether they will be parallelized. Pig?s target demographic is experienced procedural programmers who prefer map-reduce style program- ming over the more declarative, SQL-style program- ming, for stylistic reasons as well as the ability to control the execution plan. Pig aims for a sweet spot between these two extremes, offering high-level data ma- nipulation primitives such as projection and join, but in a much less declarative style than SQL. [22]

2.3.7 HBase HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily. HBase adds a distributed, fault-tolerant scalable database, built on top of the HDFS file system, with random real-time read/write access to data. Each HBase table is stored as a multidimensional sparse map, with rows and columns, each cell having a time stamp. A cell value at a given row and column is by uniquely identified by (Table, Row, Column-Family:Column, Timestamp) ? Value. HBase has its own Java client API, and tables in HBase can be used both as an input source and as an output target for MapReduce jobs through TableInput/TableOutputFormat. There is no HBase single point of failure. HBase uses Zookeeper, another Hadoop subproject, for management of partial failures. All table accesses are by the primary key. Secondary indices are possible through additional index tables; programmers need to denormalize and repli- cate. There is no SQL query language in base HBase. However, there is also a Hive/HBase integration project that allows Hive QL statements access to HBase tables for both reading and inserting. Also, there is the independent HBql project (author P. Ambrose [36]) to add a dialect of SQL and JDBC bindings for HBase. A table is made up of regions. Each region is defined by a startKey and EndKey, may live on a different node, and is made up of several HDFS files and blocks, each of which is replicated by Hadoop. Columns can be added on- the-fly to tables, with only the parent column families being fixed in a schema. Each cell is tagged by column family and column name, so programs can always identify what type of data item a given cell contains. In addition to being

37 able to scale to petabyte size data sets, we may note the ease of integration of disparate data sources into a small number of HBase tables for building a data workspace, with different columns possibly defined (on-the-fly) for different rows in the same table. Such facility is also important. (See the biological integration discussion below.) In addition to HBase, other scalable random access databases are now avail- able. HadoopDB is a hybrid of MapReduce and a standard relational db sys- tem. HadoopDB uses PostgreSQL for db layer (one PostgreSQL instance per data chunk per node), Hadoop for communication layer, and extended version of Hive for a translation layer. Also, there are non-Hadoop based scalable al- ternatives also based on the Google BigTable concept, such as Hypertable, and Cassandra. And there are other so-called noSQL scalable dbs of possible in- terest: Project Voldemort, Dynamo (used for Amazon’s Simple Storage Service (S3)), and Tokyo Tyrant, among others [32].

2.3.8 Flume Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure - inside web servers, application servers and mobile devices, for example - to collect data and integrate it into Hadoop. Flume was first introduced in Cloudera’s CDH3 Distribution in 2011. It con- sisted of a federation of worker daemons (agents) configured from a centralized master (or masters) via Zookeeper (a federated configuration and coordination system). From the master you could check agent status in a Web UI, as well as push out configuration centrally from the UI or via a command line shell (both really communicating via Zookeeper to the worker agents). In a regular Portable Operating System Interface (POSIX) style filesystem, if you open a file and write data, it still exists on disk before the file is closed. That is, if another program opens the same file and starts reading, it will get the data already flushed by the writer to disk. Furthermore, if that writing process is interrupted, any portion that made it to disk is usable (it may be incomplete, but it exists). In HDFS the file exists only as a directory entry, it shows as having zero length until the file is closed. This means if data is written to a file for an extended period without closing it, a network disconnect with the client will leave you with nothing but an empty file for all your efforts. This may lead you to the conclusion that it would be wise to write small files so you can close them as soon as possible. The problem is Hadoop doesn’t like lots of tiny files. Since the HDFS metadata is kept in memory on the NameNode, the more files you create, the more RAM you’ll need to use. From a MapReduce prospective, tiny files lead to poor efficiency. Usually, each mapper is assigned a single block of a file as input (unless you have used certain compression codecs). If you have lots of tiny files, the cost of starting the worker processes can be disproportionally high compared to the data it is processing. This kind of block fragmentation also results in more mapper tasks increasing the overall job run times [14].

38 Figure 15: Flume Agent [14]

2.3.9 Oozie Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages - such as Map Reduce, Pig and Hive – then in- telligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. remains the most sophisticated and powerful workflow sched- uler for managing Apache Hadoop jobs. Although simpler open source alterna- tives have been introduced Hadoop, Pig, Hive, and many other projects provide the foundation for stor- ing and processing large amounts of data in an efficient way. Most of the time, it is not possible to perform all required processing with a single MapReduce, Pig, or Hive job. Multiple MapReduce, Pig, or Hive jobs often need to be chained together, producing and consuming intermediate data and coordinating their flow of execution. At Yahoo!, as developers started doing more complex pro- cessing using Hadoop, multistage Hadoop jobs became common. This led to several ad hoc solutions to manage the execution and interdependency of these multiple Hadoop jobs. As these solutions started to be widely used, several issues emerged. It was hard to track errors and it was difficult to recover from failures. It was not easy to monitor progress. The solution was Oozie. Oozie is an orchestration system for Hadoop jobs. Oozie is designed to run multistage Hadoop jobs as a single job: an Oozie job. Oozie jobs can be configured to run on demand or periodically. Oozie jobs running on demand are called work ow jobs. Oozie jobs running periodically are called coordinator jobs. There is also a third type of Oozie job called bundle jobs. A bundle job is a collection of coordinator jobs managed as a single job. [15] Oozie runs a MapReduce job called identity-MR. If the MapReduce job completes successfully, the workflow job ends normally. If the MapReduce job fails to execute correctly, Oozie kills the workflow.

2.3.10 Ambari Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters. It’s development is being led by engineers from Hor- tonworks, which include Ambari in its Hortonworks Data Platform, is a visual dashboard for monitoring the health of the Hadoop cluster. It uses the REST

39 Figure 16: Oozie in the Hadoop ecosystem [15]

Figure 17: identity-WF Oozie work ow example [15]

APIs provided by Ambari Server. [36]

2.3.11 Avro Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls, is a remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Listing 9: Example serialization and deserialization code in Python import avro .schema from avro. datafile import DataFileReader , DataFileWriter from avro . i o import DatumReader , DatumWriter

schema = avro.schema.parse(open(”user.avsc”).read()) # need to know the schema to write

writer = DataFileWriter(open(”users.avro”, ”w”) , DatumWriter() , schema ) writer .append({ ”name”: ”Alyssa”, ”favorite number”: 256}) writer .append({ ”name”: ”Ben”, ”favorite number”: 7, ”favorite c o l o r ” : ” red ” })

40 Figure 18: Ambari [36]

writer.close()

Listing 10: Deserialization reader = DataFileReader(open(”users.avro”, ”r”) , DatumReader()) # no need to know the schema to read for u s e r in r e a d e r : print u s e r reader.close()

2.3.12 Sqoop Sqoop is a connectivity tool for moving data from non-Hadoop data stores - such as relational databases and data warehouses - into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target. Sqoop is a core member of the Hadoop eco system, and plug-ins are provided and supported by several major SQL and ETL vendors. And Sqoop is now part of integral ETL and processing pipelines run by some of the largest users of Hadoop. Sqoop enabled users with vast troves of information stored in existing SQL tables to use new analytic tools like MapReduce and . As Sqoop matures, a renewed focus on SQL-oriented analytics continues to make it rel- evant: systems like Cloudera Impala and Dremel-style analytic engines offer powerful distributed analytics with SQL-based languages, using the common data substrate offered by HDFS. The variety of data sources and analytic targets presents a challenge in set- ting up effective data transfer pipelines. Data sources can have a variety of

41 subtle inconsistencies: different DBMS providers may use different dialects of SQL, treat data types differently, or use distinct techniques to offer optimal transfer speeds. Depending on whether you’re importing to Hive, Pig, Impala, or your own MapReduce pipeline, you may want to use a different file format or compression algorithm when writing data to HDFS. Sqoop helps the data engi- neer tasked with scripting such transfers by providing a compact but powerful tool that flexibly negotiates the boundaries between these systems and their data layouts. A significant strength of Sqoop is its ability to work with all major and minor database systems and enterprise data warehouses. To abstract the dif- ferent behavior of each system, Sqoop introduced the concept of connectors: all database-specific operations are delegated from core Sqoop to the special- ized connectors. Sqoop itself bundles many such connectors; you do not need to download anything extra in order to run Sqoop. The most general connec- tor bundled with Sqoop is the Generic JDBC Connector that utilizes only the JDBC interface. This will work with every JDBC-compliant database system. In addition to this generic connector, Sqoop also ships with specialized connec- tors for MySQL, Oracle, PostgreSQL, Microsoft SQL Server, and DB2, which utilize special properties of each particular database system. You do not need to explicitly select the desired connector, as Sqoop will automatically do so based on your JDBC URL. Lets see an example how sqoop works. First, Sqoop will connect to the database to fetch table metadata: the number of table columns, their names, and the associated data types. For example, for table cities, Sqoop will re- trieve information about the three columns: id, country, and city, with int, VARCHAR, and VARCHAR as their respective data types. Depending on the particular database system and the table itself, other useful metadata can be retrieved as well (for example, Sqoop can determine whether the table is parti- tioned or not). At this point, Sqoop is not transferring any data between the database and your machine; rather, it’s querying the catalog tables and views. Based on the retrieved metadata, Sqoop will generate a Java class and com- pile it using the JDK and Hadoop libraries available on your machine. Next, Sqoop will connect to your Hadoop cluster and submit a MapReduce job. Each mapper of the job will then transfer a slice of the table’s data. As MapReduce executes multiple mappers at the same time, Sqoop will be transferring data in parallel to achieve the best possible performance by utilizing the potential of your database server. Each mapper transfers the table’s data directly between the database and the Hadoop cluster. To avoid becoming a transfer bottleneck, the Sqoop client acts as the overseer rather than as an active participant in transferring the data. This is a key tenet of Sqoop’s design. Code example: You have a table in a relational database (e.g., MySQL) and you need to transfer the table’s contents into Hadoop’s Distributed File System (HDFS). Importing one table with Sqoop is very simple: you issue the Sqoop import command and specify the database credentials and the name of the table to transfer. [34]

42 Listing 11: A simple Sqoop Code example sqoop import \ −−connect jdbc:mysql: //mysql.example.com/sqoop \ −−username sqoop \ −−password sqoop \ −−table cities

2.3.13 HCatalog HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored. One of the major obstacles of adopting Hadoop in the Enterprise is that Hadoop implementations still require a low-level understanding of the system, including working with files in the distributed file system. Users of databases and ETL systems are used to working with abstractions such as databases and tables, and Hive supports this abstraction. Hive is not suitable for ETL, how- ever; Pig and MapReduce programs are better suited for ETL offloading tasks. Although Pig and MapReduce still work with files in the file system, ETL users and vendors that develop tools prefer to work with the familiar abstraction of databases and tables even when using Pig and MapReduce programs. HCatalog, the new API that provides this abstraction, brings familiar ab- stractions to Pig and MapReduce. It is designed to support ETL tool vendors to communicate with the HDFS using abstractions familiar to database users. HCatalog is designed to facilitate the use of Hadoop in a mature Enterprise. HCatalog is a table and storage management layer that abstracts HDFS files into a familiar database and tables relational view, which enables users of various Hadoop APIs such as Pig, Hive, and MapReduce to be insulated from the low-level details of HDFS?based data-storage. Instead of having to know which directories and files contain the data, the users of these APIs can simply provide the name of the database and table to access the data. [36] Using HCatalog, Hive, Pig and MapReduce can collectively use the same metadata. A table can be generated via tools such as Pig and MapReduce and queried by using Hive. An ETL platform can now use Hive, Pig, and MapReduce in the same pipeline and interface using logical abstractions such as ”databases” and ”tables” instead of the physical abstractions such as HDFS folders and files, which enables ETL pipelines to become arbitrarily complex and consequently more effective. This is not unlike a user plugging a Java program into a mostly SQL query-based DataStage (IBM product) pipeline to execute a complex array of tasks in one step. Similar to this example, users can use Hive for simple tasks, plug in Pig scripts for moderately complex tasks, and use MapReduce for extremely complex components. All these components can be part of the same logical pipeline. HCatalog enables this integration and simplifies the use of Hadoop or a typical Enterprise user. [36]

43 Figure 19: HCatalog for a SQL tools user [36]

HCatalog provides a relational view for HDFS files that have a SerDe defined for them. A SerDe, is a serializer/deserializer component that allows Hive to read data into a table and write data back to the HDFS in any format. Out-of- the-box HCatalog supports the following types of files:

• The Record Columnar File (RCFile) format is used to support column- based storage in the HDFS. • The Optimized Row Columnar (ORC) format is a more efficient version of RCFile format (this efficiency is with respect to storage and data retrieval). • TextFile is a typical character delimited file format. The delimiter can be custom and does not always need to be a comma. • Javascript Object Notation (JSON) is a DOM-based format that is similar to XML except that it is more lightweight. • SequenceFile is a native Hadoop?based binary format we discussed in Chapter 7. It allows storage of binary data as key-value pairs where keys do not need to be unique, and the keys and values can be complex writable types. It supports very efficient storage and retrieval mechanisms.

The below figure shows the high-level architecture for HCatalog. The key features are as follows:

• The Hive metastore client is used by HCatalog classes to retrieve and up- date schema definitions for the databases, tables, and partitions of tables stored as files in the HDFS. • Similar to Hive, HCatalog uses a SerDe to read and write a record to a table.

44 Figure 20: HCatalog supported interfaces [36]

• HCatalog uses specialized I/O format classes (HCatalogInputFormat and HCatalogOutputFormat) to allow MapReduce programs to work with common abstractions such as tables. These I/O formats abstract away the low-level details of working with the actual files in the file system. These classes use the SerDes defined for the underlying file system to read and write records to the tables. The information about appropriate SerDes for the file is retrieved from the Hive metastore.

• HCatalog also provides specialized loadand store classes (HCatLoader and HCatStorer) to allow Pig to interact with the tables defined in the HCat- alog. These classes in turn use the I/O format classes mentioned earlier.

Figure 21: High level HCatalog interfaces [36]

WebHCat is a REST API for HCatalog. Using WebHCat and its underlying security schemes, programs can securely connect and perform operations on HCatalog through a general REST-based API.

45 • REST-based API calls are available to manage databases, tables, parti- tions, columns, and table properties. • PUT calls are used to create/update; GET calls are used to describe or get listings; and DELETE calls are used to drop databases, tables, partitions, and columns.

The operations that WebHCat allows through a REST interface are the following: DDL operations • Execute a DDL command • List and describe databases and tables in databases • Create/drop/alter databases and tables

• Manage table partitions, including operations such as creating and drop- ping partitions • Manage table properties Job management

• Remotely start and manage MapReduce jobs • Remotely start and manage Pig jobs • Execute Hive queries and commands

• Remotely obtain job status and manage job execution HCatalog makes Hive easier to use, and both authentication models are identical. But because HCatalog also supports Pig and MapReduce, the Hive authorization model is inadequate for HCatalog. The default Hive authorization model uses familiar database concepts such as users and roles that are granted permissions to perform DDL and DML operations on databases and tables. However, unlike databases, Hive does not have full control over the data stored in its tables; these tables are stored in the HDFS file system as files. The table abstraction is simply an interface. Users can access the data by going to the files in the HDFS, even if their rights to a table based on those files are revoked. Conversely, users might not be able to access the table, even if they have access rights to the files underlying the table in the HDFS. HCatalog does not use this default model; instead it uses storage-based authorization. Users privileges for a database and table are inherited from the privileges they have on the files in the file system, which is more secure because it is not possible to subvert the security of the system by changing the abstraction levels. Although it might be more limited than traditional database style security in the sense that fine-grained, column-level permissions are not

46 feasible, it is more appropriate because HCatalog supports Pig and MapReduce in addition to Hive. From an Enterprise perspective, Hadoop solves the same problems that databases have solved for decades, except that it does this when the scale of data is so large that it cannot be done using databases without investing in expensive MPP-based database systems. Yet enterprises have been cautious about adopting Hadoop and making it an integral part of their operations sys- tems. The reasons for this caution have been Hadoop?s lack of tooling and its low-level interfaces. HCatalog is the step in the direction of abstracting away these complexities for high-level BI and ETL users. It is a promising API that allows high-level tools to be developed. These tools not only bring Hadoop to the traditional enterprise user but also boost productivity by allowing users to work with fa- miliar abstractions such as databases and tables instead of files, InputFormats and OutputFormats. [36]

2.3.14 BigTop BigTop Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configu- ration of the leading open source big data components. Bigtop supports a wide range of components/projects, including, but not limited to, Hadoop, HBase and Spark, is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop’s sub-projects and related components with the goal improving the Hadoop platform as a whole. Bigtop packages Hadoop RPMs and DEBs, so that you can manage and maintain your Hadoop cluster, provides an integrated smoke testing framework, alongside a suite of over 50 test files, provides vagrant recipes, raw images, and (work-in-progress) docker recipes for deploying Hadoop from zero.

2.4 Data Mining and Machine Learning introduction The rapid growth of interest in data mining follows from the confluence of several recent trends: (1) the falling cost of large data storage devices and the increasing ease of collecting data over networks, (2) the development of robust and efficient machine learning algorithms to process this data, and (3) the falling cost of computational power, enabling the use of computationally intensive methods for data analysis. The field of data mining, sometimes referred to as knowledge discovery from databases, machine learning, or advanced data analysis, has already produced highly practical applications in areas such as credit card fraud detection, medical outcomes analysis, predicting customer purchase behavior, predicting the inter- ests of web users, and optimizing manufacturing processes. It has also led to a set of fascinating scientic questions about how computers might automatically learn from experience. [20]

47 2.4.1 Data Mining Data mining is an interdisciplinary subfield of computer science. It is the com- putational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Data mining is the analysis step of the ”knowledge discovery in databases” process, or KDD. [40] The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:

1. Selection 2. Pre-processing 3. Transformation 4. Data Mining 5. Interpretation/Evaluation

Data mining involves six common classes of tasks:

• Anomaly detection (Outlier/change/deviation detection) - The identifica- tion of unusual data records, that might be interesting or data errors that require further investigation. • Association rule learning (Dependency modelling) - Searches for relation- ships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the super- market can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. • Clustering - is the task of discovering groups and structures in the data that are in some way or another ”similar”, without using known structures in the data. • Classification - is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as ”legitimate” or as ”spam”. • Regression - attempts to find a function which models the data with the least error. • Summarization - providing a more compact representation of the data set, including visualization and report generation.

48 Data mining can unintentionally be misused, and can then produce results which appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split - when applicable at all - may not be sufficient to prevent this from happening. The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish ”spam” from ”legitimate” emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. A number of statistical methods may be used to evaluate the algorithm, such as ROC curves.

2.4.2 Machine Learning Machine learning [43] is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial in- telligence. In 1959, Arthur Samuel defined machine learning as a ”Field of study that gives computers the ability to learn without being explicitly programmed”. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by build- ing a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions. Machine learning is closely related to and often overlaps with computational statistics; a discipline which also focuses in prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. Machine learning is em- ployed in a range of computing tasks where designing and programming explicit algorithms is infeasible. Example applications include spam filtering, optical character recognition (OCR), search engines and computer vision. Machine learning is sometimes conflated with data mining, where the latter sub-field fo- cuses more on exploratory data analysis and is known as unsupervised learning. Machine learning tasks are typically classified into three broad categories, depending on the nature of the learning ”signal” or ”feedback” available to a learning system. These are:

49 • Supervised learning: The computer is presented with example inputs and their desired outputs, given by a ”teacher”, and the goal is to learn a general rule that maps inputs to outputs. • Unsupervised learning: No labels are given to the learning algorithm, leav- ing it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning). • Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it whether it has come close to its goal. Another example is learning to play a game by playing against an opponent.

Between supervised and unsupervised learning is semi-supervised learning, where the teacher gives an incomplete training signal: a training set with some (often many) of the target outputs missing. Transduction is a special case of this principle where the entire set of problem instances is known at learning time, except that part of the targets are missing. Among other categories of machine learning problems, learning to learn learns its own inductive bias based on previous experience. Developmental learning, elaborated for robot learning, generates its own sequences (also called curriculum) of learning situations to cumulatively acquire repertoires of novel skills through autonomous self-exploration and social interaction with human teachers, and using guidance mechanisms such as active learning, maturation, motor synergies, and imitation. Another categorization of machine learning tasks arises when one considers the desired output of a machine-learned system • In classification, inputs are divided into two or more classes, and the learner must produce a model that assigns unseen inputs to one (or multi- label classification) or more of these classes. This is typically tackled in a supervised way. Spam filtering is an example of classification, where the inputs are email (or other) messages and the classes are ”spam” and ”not spam”. • In regression, also a supervised problem, the outputs are continuous rather than discrete.

• In clustering, a set of inputs is to be divided into groups. Unlike in clas- sification, the groups are not known beforehand, making this typically an unsupervised task. • Density estimation finds the distribution of inputs in some space.

• Dimensionality reduction simplifies inputs by mapping them into a lower- dimensional space. Topic modeling is a related problem, where a program

50 is given a list of human language documents and is tasked to find out which documents cover similar topics. Machine learning and data mining often employ the same methods and over- lap significantly. They can be roughly distinguished as follows: • Machine learning focuses on prediction, based on known properties learned from the training data. • Data mining focuses on the discovery of (previously) unknown properties in the data. This is the analysis step of Knowledge Discovery in Databases. The two areas overlap in many ways: data mining uses many machine learn- ing methods, but often with a slightly different goal in mind. On the other hand, machine learning also employs data mining methods as ”unsupervised learning” or as a preprocessing step to improve learner accuracy. Much of the confusion between these two research communities (which do often have separate confer- ences and separate journals, ECML PKDD being a major exception) comes from the basic assumptions they work with: in machine learning, performance is usu- ally evaluated with respect to the ability to reproduce known knowledge, while in Knowledge Discovery and Data Mining (KDD) the key task is the discovery of previously unknown knowledge. Evaluated with respect to known knowledge, an uninformed (unsupervised) method will easily be outperformed by supervised methods, while in a typical KDD task, supervised methods cannot be used due to the unavailability of training data. Machine learning also has intimate ties to optimization: many learning prob- lems are formulated as minimization of some loss function on a training set of examples. Loss functions express the discrepancy between the predictions of the model being trained and the actual problem instances (for example, in clas- sification, one wants to assign a label to instances, and models are trained to correctly predict the pre-assigned labels of a set examples). The difference be- tween the two fields arises from the goal of generalization: while optimization algorithms can minimize the loss on a training set, machine learning is concerned with minimizing the loss on unseen samples

2.5 Data Mining and Machine Learning Tools 2.5.1 WEKA Waikato Environment for Knowledge Analysis (WEKA) is recognized as a land- mark system in data mining and machine learning. It has achieved widespread acceptance within academia and business circles, and has become a widely used tool for data mining research. The WEKA project aims to provide a compre- hensive collection of machine learning algorithms and data preprocessing tools to researchers and practitioners alike. It allows users to quickly try out and compare different machine learning methods on new data sets. Its modular, extensible architecture allows sophisticated data mining processes to be built up from the wide collection of base learning algorithms and tools provided.

51 Regardless of which user interface is desired, it is important to provide the Java virtual machine that is used to run WEKA with a sufficient amount of heap space. The need to prespecify the amount of memory required, which should be set lower than the amount of physical memory of the machine that is used, to avoid swapping, is perhaps the biggest stumbling block to the successful application of WEKA in practice. [13]

Figure 22: Waikato Environment for Knowledge Analysis (WEKA)

2.5.2 SciKit-Learn Scikit-learn is a Python module integrating a wide range of state-of-the-art ma- chine learning algorithms for medium-scale supervised and unsupervised prob- lems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, perfor- mance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both aca- demic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.org. [24]

Figure 23: Scikit-learn.

52 2.5.3 RapidMiner RapidMiner is a software platform developed by the company of the same name that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. It is used for busi- ness and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the data mining process including data preparation, results visualization, validation and optimization. RapidMiner is developed on an open core model, with the RapidMiner Basic Edition available for download under the AGPL license. The Professional Edition starts at $1,999 and is available from the developer. [46]

Figure 24: Rapidminer Logo

2.5.4 Spark MLlib Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. MLlib provides efficient functionality for a wide range of learning settings and includes several underly- ing statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that lever- ages Spark’s rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive docu- mentation to support further growth and to let users quickly get up to speed. [19]

Figure 25: Apache Spark ecosytem.

2.5.5 H2O Flow H2O is open-source software for big-data analysis. It is produced by the start-up H2O.ai (formerly 0xdata), which launched in 2011 in Silicon Valley. The speed

53 and flexibility of H2O allow users to fit hundreds or thousands of potential models as part of discovering patterns in data. With H2O, users can throw models at data to find usable information, allowing H2O to discover patterns. Using H2O, Cisco estimates each month 20 thousand models of its customers’ propensities to buy. H2O’s mathematical core is developed with the leadership of Arno Candel; after H2O was rated as the best ”open-source Java machine learning project” by GitHub’s programming members, Candel was named to the first class of ”Big Data All Stars” by Fortune in 2014. The firm’s scientific advisors are experts on statistical learning theory and mathematical optimization. H2O Flow is a notebook-style open-source user interface for H2O. It is a web- based interactive environment that allows you to combine code execution, text, mathematics, plots, and rich media in a single document, similar to iPython Notebooks. With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work - all within Flow’s browser-based environment.

Figure 26: H2O Logo

3 Methods

In mathematics and computer science, an algorithm (Listeni) is a self-contained step-by-step set of operations to be performed. Algorithms perform calculation, data processing, and/or automated reasoning tasks. The words ’algorithm’ and ’algorism’ come from the name al-Khwarizmi. Al- Khwarizmi was a Persian mathematician, astronomer, geographer, and scholar. An algorithm is an effective method that can be expressed within a finite amount of space and time and in a well-defined formal language for calculating a function. Starting from an initial state and initial input (perhaps empty), the instructions describe a computation that, when executed, proceeds through a finite number of well-defined successive states, eventually producing ”output” and terminating at a final ending state. The transition from one state to the next is not necessarily deterministic; some algorithms, known as randomized algorithms, incorporate random input. The concept of algorithm has existed for centuries; however, a partial for- malization of what would become the modern algorithm began with attempts

54 to solve the Entscheidungsproblem (the ”decision problem”) posed by David Hilbert in 1928. Subsequent formalizations were framed as attempts to de- fine ”effective calculability” or ”effective method”; those formalizations included the Gdel-Herbrand-Kleene recursive functions of 1930, 1934 and 1935, Alonzo Church’s lambda calculus of 1936, Emil Post’s ”Formulation 1” of 1936, and Alan Turing’s Turing machines of 1936-7 and 1939. Giving a formal defini- tion of algorithms, corresponding to the intuitive notion, remains a challenging problem. In computer systems, an algorithm is basically an instance of logic written in software by software developers to be effective for the intended ”target” computer(s) to produce output from given (perhaps null) input. An optimal algorithm, even running in old hardware, would produce faster results than a non-optimal (higher time complexity) algorithm for the same purpose, running in more efficient hardware; that is why algorithms, like computer hardware, are considered technology. Let’s see top algorithms [49]: C4.5 and beyond. Systems that construct classifiers are one of the commonly used tools in data mining. Such systems take as input a collection of cases, each belonging to one of a small number of classes and described by its values for a fixed set of attributes, and output a classifier that can accurately predict the class to which a new case belongs. The k-means algorithm. The k-means algorithm is a simple iterative method to partition a given dataset into a userspecified number of clusters, k. This algorithm has been discovered by several researchers across different disciplines, most notably Lloyd (1957, 1982), Forgey (1965), Friedman and Rubin (1967), and McQueen (1967). Gray and Neuhoff provide a nice historical background for k-means placed in the larger context of hill-climbing algorithms. Support vector machines. In today’s machine learning applications, support vector machines (SVM) are considered a must try-it offers one of the most ro- bust and accurate methods among all well-known algorithms. It has a sound theoretical foundation, requires only a dozen examples for training, and is insen- sitive to the number of dimensions. In addition, efficient methods for training SVM are also being developed at a fast pace. In a two-class learning task, the aim of SVM is to find the best classification function to distinguish between members of the two classes in the training data. The metric for the concept of the ”best” classification function can be realized geometrically. For a linearly separable dataset, a linear classification function corresponds to a separating hyperplane f(x) that passes through the middle of the two classes, separating the two. Because there are many such linear hyperplanes, what SVM addition- ally guarantee is that the best such function is found by maximizing the margin between the two classes. Intuitively, the margin is defined as the amount of space, or separation between the two classes as defined by the hyperplane. Geo- metrically, the margin corresponds to the shortest distance between the closest data points to a point on the hyperplane. Having this geometric definition al- lows us to explore how to maximize the margin, so that even though there are an infinite number of hyperplanes, only a few qualify as the solution to SVM.

55 Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization and that loads only essential data to each machine to perform parallel computation. To further make large-scale dataset training practical and fast, parallelization on distributed computers is necessary. Although SMO-based algorithms are the preferred choice on a single computer, they are difficult to parallelize. PSVM is a practical, parallel approximate implementation to speed up SVM training on today?s distributed computing infrastructures for dealing with Web-scale problems. PSVM is not the sole solution to speed up SVMs. Algorithmic approaches such can be more effective when memory is not a constraint or kernels are not used. Algorithmic approach is not the only avenue to speed up SVM training. Data-processing approaches can divide a serial algorithm (e.g., LIBSVM) into subtasks on subsets of training data to achieve good speedup. (Data-processing and algorithmic approaches complement each other and can be used together to handle large-scale training.)[4]. The Apriori algorithm. One of the most popular data mining approaches is to find frequent itemsets from a transaction dataset and derive association rules. Finding frequent itemsets (itemsets with frequency larger than or equal to a user specified minimum support) is not trivial because of its combinatorial explosion. Once frequent itemsets are obtained, it is straightforward to gen- erate association rules with confidence larger than or equal to a user specified minimum confidence.

Figure 27: Apriori algorithm

The EM algorithm. Finite mixture distributions provide a flexible and mathematical-based approach to the modeling and clustering of data observed on random phenomena. We focus here on the use of normal mixture models,

56 which can be used to cluster continuous data and to estimate the underlying density function. These mixture models can be fitted by maximum likelihood via the EM (Expectation - Maximization) algorithm. PageRank. PageRank was presented and published by Sergey Brin and Larry Page at the Seventh International World Wide Web Conference (WWW7) in April 1998. It is a search ranking algorithm using hyperlinks on the Web. Based on the algorithm, they built the search engine Google, which has been a huge success. Now, every search engine has its own hyperlink based ranking method. PageRank produces a static ranking of Web pages in the sense that a PageRank value is computed for each page off-line and it does not depend on search queries. The algorithm relies on the democratic nature of the Web by using its vast link structure as an indicator of an individual page?s quality. In essence, PageRank interprets a hyperlink from page x to page y as a vote, by page x, for page y. However, PageRank looks at more than just the sheer number of votes, or links that a page receives. It also analyzes the page that casts the vote. Votes casted by pages that are themselves ”important” weigh more heavily and help to make other pages more ”important”. This is exactly the idea of rank prestige in social networks AdaBoost. Ensemble learning deals with methods which employ multiple learners to solve a problem. The generalization ability of an ensemble is usually significantly better than that of a single learner, so ensemble methods are very attractive. The AdaBoost algorithm proposed by Yoav Freund and Robert Schapire is one of the most important ensemble methods, since it has solid theoretical foundation, very accurate prediction, great simplicity (Schapire said it needs only just 10 lines of code), and wide and successful applications. kNN: k-nearest neighbor classification. One of the simplest, and rather trivial classifiers is the Rote classifier, which memorizes the entire training data and performs classification only if the attributes of the test object match one of the training examples exactly. An obvious drawback of this approach is that many test records will not be classified because they do not exactly match any of the training records. A more sophisticated approach, k-nearest neighbor (kNN) classification, finds a group of k objects in the training set that are closest to the test object, and bases the assignment of a label on the predominance of a particular class in this neighborhood. There are three key elements of this approach: a set of labeled objects, e.g., a set of stored records, a distance or similarity metric to compute distance between objects, and the value of k, the number of nearest neighbors. To classify an unlabeled object, the distance of this object to the labeled objects is computed, its k-nearest neighbors are identified, and the class labels of these nearest neighbors are then used to determine the class label of the object Naive Bayes. Given a set of objects, each of which belongs to a known class, and each of which has a known vector of variables, our aim is to construct a rule which will allow us to assign future objects to a class, given only the vectors of variables describing the future objects. Problems of this kind, called problems of supervised classification, are ubiquitous, and many methods for constructing such rules have been developed. One very important one is the naive Bayes

57 method also called idiot’s Bayes, simple Bayes, and independence Bayes. This method is important for several reasons. It is very easy to construct, not needing any complicated iterative parameter estimation schemes. This means it may be readily applied to huge data sets. It is easy to interpret, so users unskilled in classifier technology can understand why it is making the classification it makes. And finally, it often does surprisingly well: it may not be the best possible classifier in any particular application, but it can usually be relied on to be robust and to do quite well. General discussion of the naive Bayes method and its merits are given in. CART. The 1984 monograph, ”CART: Classification and Regression Trees,” co-authored by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone, [9] represents a major milestone in the evolution of Artificial Intelligence, Machine Learning, non-parametric statistics, and data mining. The work is important for the comprehensiveness of its study of decision trees, the techni- cal innovations it introduces, its sophisticated discussion of treestructured data analysis, and its authoritative treatment of large sample theory for trees. While CART citations can be found in almost any domain, far more appear in fields such as electrical engineering, biology, medical research and financial topics than, for example, in marketing research or sociology where other tree methods are more popular. This section is intended to highlight key themes treated in the CART monograph so as to encourage readers to return to the original source for more detail. The CART decision tree is a binary recursive partitioning procedure capable of processing continuous and nominal attributes both as targets and predictors. Data are handled in their raw form; no binning is required or recommended. Trees are grown to a maximal size without the use of a stopping rule and then pruned back (essentially split by split) to the root via cost-complexity pruning. The next split to be pruned is the one contributing least to the overall perfor- mance of the tree on training data (and more than one split may be removed at a time). The procedure produces trees that are invariant under any order preserving transformation of the predictor attributes. The CART mechanism is intended to produce not one, but a sequence of nested pruned trees, all of which are candidate optimal trees. The ”right sized” or ”honest” tree is identified by evaluating the predictive performance of every tree in the pruning sequence. CART offers no internal performance measures for tree selection based on the training data as such measures are deemed suspect. Instead, tree performance is always measured on independent test data (or via cross validation) and tree se- lection proceeds only after test-data-based evaluation. If no test data exist and cross validation has not been performed, CART will remain agnostic regarding which tree in the sequence is best. This is in sharp contrast to methods such as C4.5 that generate preferred models on the basis of training data measures. The CART mechanism includes automatic (optional) class balancing, automatic missing value handling, and allows for cost-sensitive learning, dynamic feature construction, and probability tree estimation. The final reports include a novel attribute importance ranking. The CART authors also broke new ground in showing how cross validation can be used to assess performance for every tree

58 in the pruning sequence given that trees in different CV folds may not align on the number of terminal nodes.

There are different ways an algorithm can model a problem based on its interaction with the experience or environment or whatever we want to call the input data. It is popular in machine learning and artificial intelligence textbooks to first consider the learning styles that an algorithm can adopt. There are only a few main learning styles or learning models that an algorithm can have and we’ll go through them here with a few examples of algorithms and problem types that they suit. This taxonomy or way of organizing machine learning algorithms is useful because it forces you to think about the the roles of the input data and the model preparation process and select one that is the most appropriate for your problem in order to get the best result. Let’s take a look at four different learning styles in machine learning algorithms [7]: Supervised Learning. Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time. A model is prepared through a training process where it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data. Ex- ample problems are classification and regression. Example algorithms include Logistic Regression and the Back Propagation Neural Network. Unsupervised Learning. Input data is not labelled and does not have a known result. A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity. Example problems are clustering, dimensionality reduction and association rule learning. Example algorithms include: the Apriori algorithm and k-Means. Semi-Supervised Learning. Input data is a mixture of labelled and unlabelled examples. There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions. Example problems are classification and regression. Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabelled data. When crunching data to model business decisions, you are most typically using supervised and unsupervised learning methods. A hot topic at the moment is semi-supervised learning methods in areas such as image classification where there are large datasets with very few labelled examples. Algorithms are often grouped by similarity in terms of their function (how they work). For example, tree-based methods, and neural network inspired methods. I think this is the most useful way to group algorithms and it is the approach we will use here. This is a useful grouping method, but it is not perfect. There are still algorithms that could just as easily fit into multiple categories like Learning Vector Quantization that is both a neural network inspired method and an instance-based method. There are also categories that have the same name that describes the problem and the class of algorithm such as Regression and Clustering. We could handle these cases by listing algorithms twice or by

59 selecting the group that subjectively is the best fit. I like this latter approach of not duplicating algorithms to keep things simple. In this section I list many of the popular machine leaning algorithms grouped the way I think is the most intuitive. It is not exhaustive in either the groups or the algorithms, but I think it is representative and will be useful to you to get an idea of the lay of the land. Regression Algorithms. Regression is concerned with modelling the relation- ship between variables that is iteratively refined using a measure of error in the predictions made by the model. Regression methods are a workhorse of statistics and have been cooped into statistical machine learning. This may be confusing because we can use regression to refer to the class of problem and the class of algorithm. Really, regression is a process. The most popular regression algorithms are:

• Ordinary Least Squares Regression (OLSR) • Linear Regression • Logistic Regression • Stepwise Regression • Multivariate Adaptive Regression Splines (MARS) • Locally Estimated Scatterplot Smoothing (LOESS)

Instance-based Algorithms. Instance based learning model a decision prob- lem with instances or examples of training data that are deemed important or required to the model. Such methods typically build up a database of example data and compare new data to the database using a similarity measure in order to find the best match and make a prediction. For this reason, instance-based methods are also called winner-take-all methods and memory-based learning. Focus is put on representation of the stored instances and similarity measures used between instances. The most popular instance-based algorithms are:

• Ridge Regression • Least Absolute Shrinkage and Selection Operator (LASSO) • Elastic Net • Least-Angle Regression (LARS)

Decision Tree Algorithms. Decision tree methods construct a model of de- cisions made based on actual values of attributes in the data. Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are trained on data for classification and regression problems. Decision trees are often fast and accurate and a big favorite in machine learning. The most popular decision tree algorithms are:

• Classification and Regression Tree (CART)

60 • Iterative Dichotomiser 3 (ID3) • C4.5 and C5.0 (different versions of a powerful approach) • Chi-squared Automatic Interaction Detection (CHAID) • Decision Stump • M5 • Conditional Decision Trees

Bayesian Algorithms. Bayesian methods are those that are explicitly apply Bayes Theorem for problems such as classification and regression. The most popular Bayesian algorithms are:

• Naive Bayes • Gaussian Naive Bayes • Multinomial Naive Bayes • Averaged One-Dependence Estimators (AODE) • Bayesian Belief Network (BBN) • Bayesian Network (BN)

Clustering Algorithms. Clustering, like regression describes the class of prob- lem and the class of methods. Clustering methods are typically organized by the modelling approaches such as centroid-based and hierarchal. All methods are concerned with using the inherent structures in the data to best organize the data into groups of maximum commonality. The most popular clustering algorithms are:

• k-Means • k-Medians • Expectation Maximisation (EM) • Hierarchical Clustering

Association Rule Learning Algorithms. Association rule learning are meth- ods that extract rules that best explain observed relationships between variables in data. These rules can discover important and commercially useful associa- tions in large multidimensional datasets that can be exploited by an organisa- tion. The most popular association rule learning algorithms are:

• Apriori algorithm • Eclat algorithm

61 Artificial Neural Network Algorithms. Artificial Neural Networks are models that are inspired by the structure and/or function of biological neural networks. They are a class of pattern matching that are commonly used for regression and classification problems but are really an enormous subfield comprised of hundreds of algorithms and variations for all manner of problem types. Note that I have separated out Deep Learning from neural networks because of the massive growth and popularity in the field. Here we are concerned with the more classical methods. The most popular artificial neural network algorithms are:

• Perceptron • Back-Propagation • Hopfield Network • Radial Basis Function Network (RBFN)

Deep Learning Algorithms. Deep Learning methods are a modern update to Artificial Neural Networks that exploit abundant cheap computation. They are concerned with building much larger and more complex neural networks, and as commented above, many methods are concerned with semi-supervised learning problems where large datasets contain very little labeled data. The most popular deep learning algorithms are:

• Deep Boltzmann Machine (DBM) • Deep Belief Networks (DBN) • Convolutional Neural Network (CNN) In machine learning, a convolutional neural network (CNN, or ConvNet) is a type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex. Individual neurons of the animal cortex are arranged in such a way that they respond to overlapping regions tiling the visual field, which can mathematically be described by a convolution operation. Convolu- tional networks were inspired by biological processes and are variations of multilayer perceptrons designed to use minimal amounts of preprocessing. They have wide applications in image and video recognition, recommender systems and natural language processing. CNNs have become the method of choice for processing visual and other two-dimensional data. A CNN is composed of one or more convolutional layers with fully connected lay- ers (matching those in typical artificial neural networks) on top. It also uses tied weights and pooling layers. In particular, max-pooling is often used in Fukushima’s convolutional architecture. This architecture allows CNNs to take advantage of the 2D structure of input data. In comparison with other deep architectures, convolutional neural networks have shown superior results in both image and speech applications. They can also be

62 trained with standard backpropagation. CNNs are easier to train than other regular, deep, feed-forward neural networks and have many fewer parameters to estimate, making them a highly attractive architecture to use. Examples of applications in Computer Vision include DeepDream [39]. • Stacked Auto-Encoders Dimensionality Reduction Algorithms. Like clustering methods, dimension- ality reduction seek and exploit the inherent structure in the data, but in this case in an unsupervised manner or order to summarise or describe data using less information. This can be useful to visualize dimensional data or to simplify data which can then be used in a supervized learning method. Many of these methods can be adapted for use in classification and regression. • Principal Component Analysis (PCA) • Principal Component Regression (PCR) • Partial Least Squares Regression (PLSR) • Sammon Mapping • Multidimensional Scaling (MDS) • Projection Pursuit • Linear Discriminant Analysis (LDA) • Mixture Discriminant Analysis (MDA) • Quadratic Discriminant Analysis (QDA) • Flexible Discriminant Analysis (FDA) Ensemble Algorithms. Ensemble methods are models composed of multiple weaker models that are independently trained and whose predictions are com- bined in some way to make the overall prediction. Much effort is put into what types of weak learners to combine and the ways in which to combine them. This is a very powerful class of techniques and as such is very popular. • Boosting • Bootstrapped Aggregation (Bagging) • AdaBoost • Stacked Generalization (blending) • Gradient Boosting Machines (GBM) • Gradient Boosted Regression Trees (GBRT) • Random Forest

63 3.1 Classification In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into ”spam” or ”non-spam” classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.). In the terminology of machine learning, classification is considered an in- stance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance. [48]

3.1.1 Feature selection Feature selection approaches try to find a subset of the original variables (also called features or attributes). There are three strategies; filter (e.g. information gain) and wrapper (e.g. search guided by accuracy) approaches, and embedded (features are selected to add or be removed while building the model based on the prediction errors). See also combinatorial optimization problems. In some cases, data analysis such as regression or classification can be done in the reduced space more accurately than in the original space. In machine learning and statistics, feature selection, also known as vari- able selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for three reasons:

1. simplification of models to make them easier to interpret by researcher- s/users 2. shorter training times, 3. enhanced generalization by reducing overfitting (formally, reduction of variance)

The central premise when using a feature selection technique is that the data contains many features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information. Redundant or irrelevant features are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated. Feature selection techniques should be distinguished from feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and com- paratively few samples (or data points).

64 Figure 28: The pseudocode for feature selection from the thesaurus [18].

3.1.2 Dimensionality reduction (PCA) In machine learning and statistics, dimensionality reduction or dimension reduc- tion is the process of reducing the number of random variables under consider- ation, via obtaining a set ”uncorrelated” principle variables. It can be divided into feature selection and feature extraction. Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric. PCA is sensitive to the relative scaling of the original variables.[45]

3.2 Clustering Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics and data compression. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the clus- ter members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective opti- mization problem. The appropriate clustering algorithm and parameter settings

65 (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It will often be necessary to modify data preprocessing and model parameters until the result achieves the desired properties.[38]

3.2.1 Expectation - Maximization (EM) The EM algorithm is used to find (locally)maximum likelihood parameters of astatistical modelin cases where the equations cannot be solved directly. expectation-maximization (EM) algorithmis an iterative method for finding maximum likelihood or maximum a posteriori(MAP) estimates of parameters in-statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of thelog-likelihoodevaluated using the current estimate for the parameters, and a maximization (M) step, which com- putes parameters maximizing the expected log-likelihood found on theEstep. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. [41] The EM algorithm is an efficient iterative procedure to compute the Max- imum Likelihood (ML) estimate in the presence of missing or hidden data. In ML estimation, we wish to estimate the model parameter(s) for which the ob- served data are the most likely. Each iteration of the EM algorithm consists of two processes: The E-step, and the M-step. In the expectation, or E-step, the missing data are estimated given the observed data and current estimate of the model parameters. This is achieved using the conditional expectation, explaining the choice of terminology. In the M-step, the likelihood function is maximized under the assumption that the missing data are known. The esti- mate of the missing data from the E-step are used in lieu of the actual missing data. Convergence is assured since the algorithm is guaranteed to increase the likelihood at each iteration [5]. The expectation-maximization (EM) method can facilitate maximizing like- lihood functions that arise in statistical estimation problems. In the classical EM paradigm, one iteratively maximizes the conditional log-likelihood of a sin- gle unobservable complete data space, rather than maximizing the intractable likelihood function for the measured or incomplete data. EM algorithms update all parameters simultaneously, which has two drawbacks: 1) slow convergence, and 2) difficult maximization steps due to coupling when smoothness penalties are used. In a variety of signal processing applications, direct calculations of maximum- likelihood (ML), maximum a posteriori (MAP), or maximum penalized-likelihood parameter estimates are intractable due to the complexity of the likelihood func- tions or to the coupling introduced by smoothness penalties or priors. EM al- gorithms and generalized EM (GEM) algorithms have proven to be useful for iterative parameter estimation in many such contexts. In the classical formu-

66 lation of an EM algorithm, one supplements the observed measurements, or incomplete data, with a single complete-data space whose relationship to the parameter space facilitates estimation. An EM algorithm iteratively alternates between an E-step: calculating the conditional expectation of the complete-data log-likelihood and an M-step: simultaneously maximizing that expectation with respect to all of the unknown parameters. EM algorithms are most useful when the M-step is easier than maximizing the original likelihood. The simultane- ous update used by a classical EM algorithm necessitates overly informative complete-data spaces, which in turn leads to slow convergence. In this paper we show improved convergence rates by updating the parameters sequentially in small groups [11].

Figure 29: EM algorithm

Some key points for EM: 1)EM is typically used to compute maximum like- lihood estimates given incomplete samples. 2)The EM algorithm estimates the parameters of a model iteratively. Applications: Discovering the value of latent variables, estimating the parameters of HMMs, estimating parameters of finite mixtures, unsupervised learning of clusters, filling in missing data in samples.

3.2.2 Agglomerative In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: [42]

• Agglomerative: This is a ”bottom up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. • Divisive: This is a ”top down” approach: all observations start in one

67 cluster, and splits are performed recursively as one moves down the hier- archy.

Figure 30: Agglomerative algorithm

Hierarchical clustering algorithms are either top-down or bottom-up. Bottom- up algorithms treat each document as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all documents. Bottom-up hierarchi- cal clustering is therefore called hierarchical agglomerative clustering or HAC . Top-down clustering requires a method for splitting a cluster. It proceeds by splitting clusters recursively until individual documents are reached [30].

Figure 31: Agglomerative algorithm pseudocode [30]

68 Figure 32: Agglomerative algorithm diagram

3.3 Association rule learning Association rule learning is a method for discovering interesting relations between variables in large databases. It is intended to identify strong rules dis- covered in databases using some measures of interestingness. Based on the con- cept of strong rules, Rakesh Agrawal introduced association rules for discovering regularities between products in large-scale transaction data recorded bypoint- of-sale(POS) systems in supermarkets. For example, the rule{onions,potatoes} ⇒ {burger} found in the sales data of a supermarket would indicate that if a cus- tomer buys onions and potatoes together, they are likely to also buy hamburger meat. [37] Association rule mining: Finding frequent patterns associations correlations or Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications: Basket data analysis, cross- marketing, catalog design, loss-leader analysis clustering classification etc leader analysis, clustering, classification, etc. Basic Concepts. Given (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items e.g., 98% of people who purchase tires and auto accessories also get automotive services done [16]

69 Figure 33: Association rule learning Example [16]

4 Setup and experimental results

In this section we’ll conduct the experiments on data sets with the conventional data analytics tools but also with the advanced big data technologies. We’ll introduce the Hadoop (big data) file system and how to setup an experiment on big data platform. Using the methods that introduced above we’ll test and time measure the response of the engine and platform. We’ll analyze and comment the code written in scala for spark engine in order to conduct the experiments.

4.1 Performance Measurement Methodology In statistical analysis of binary classification, the F1 score (also F-score or F- measure) is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results, and r is the number of correct positive results divided by the number of positive results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0. precision × recall F = 2 × measure precision + recall In pattern recognition and information retrieval with binary classification, precision (also called positive predictive value) is the fraction of retrieved in- stances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. Both precision and recall are therefore based on an understanding and measure of relevance. In information retrieval, a perfect precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all

70 relevant documents were retrieved) whereas a perfect recall score of 1.0 means that all relevant documents were retrieved by the search (but says nothing about how many irrelevant documents were also retrieved). T rueP ositive P recision = T rueP ositive + F alseP ositive

T rueP ositive Recall = T rueP ositive + F alseNegative In a classification task, a precision score of 1.0 for a class C means that every item labeled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labeled correctly) whereas a recall of 1.0 means that every item from class C was labeled as belonging to class C (but says nothing about how many other items were incorrectly also labeled as belonging to class C). [44] T rueP ositive + T rueNegative Accuracy = T rueP ositive + T rueNegative + F alseP ositive + F alseNegative In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection[1] in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm and can be calculated as (1 - specificity). The ROC curve is thus the sensitivity as a function of fall-out. In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from - ∞ to the discrimination threshold) of the detection probability in the y-axis versus the cumulative distribution function of the false-alarm probability in x-axis. ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making. [47]

4.2 Example using iris data First we will use ”iris” as a data sample in order to demonstrate our platforms and present a guide how to conduct experiments in an efficient way. We expect that the platforms will correspond perfectly in a such small data set file.

4.2.1 Rapidminer The first operator retrieves the data from iris file and defines the type of each attribute while denotes the class.

71 Next we use the ”Split Validation” operator to perform a simple validation to evaluate our model after we split the data to training set and test set. The Split Validation operator is a nested operator. It has two subprocesses: a training subprocess and a testing subprocess. The training subprocess is used for learning or building a model. The trained model is then applied in the testing subprocess. The performance of the model is also measured during the testing phase. Inside the operator we can find our model where in this case is a Naive Bayes classification model. Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem (from Bayesian statistics) with strong (naive) independence assumptions. The input port expects an ExampleSet. It is the output of the Select Attributes operator in our example process. The output of other operators can also be used as input. The Naive Bayes classification model is delivered from the output port. This classification model can now be applied on unseen data sets for prediction of the label attribute. The next operator in the ”Testing” area box, is the ”Apply Model” which applies an already learnt or trained model on an ExampleSet. A model is first trained on an ExampleSet; information related to the ExampleSet is learnt by the model. Then that model can be applied on another ExampleSet usually for prediction. Finally we use the ”Performance” operator to performance an evaluation. It delivers a list of performance criteria values. These performance criteria are automatically determined in order to fit the learning task type. The following criteria are added for binominal classification tasks: Accuracy, Precision and Recall. As we see at the result panel, the accuracy is 91.67%. We also have available each class precision and recall in order to evaluate our model performance. Although the ”sentosa” class had 100% precision and recall, this is not the usually case for real world problems and it depends from the nature of the problem to define it as acceptable or successful.

72 Figure 34: RapidMiner Studio testing Naive Bayes Performance on iris data.

4.2.2 Spark MLlib (scala) Spark mllib (Machine Learning Library) is a cli (command line interface) tool - framework, so in order to use it we need a shell of a unix system or more specifically of a linux OS. In our experiments we use the CentOS as a linux OS which is the HortonWorks VM. The spark engine has two ways (or more) to retrieve the data. We will use the easiest method and that is reading the data straightforward from the csv file. The file must be saved in the hdfs (Hadoop Distributed File System) in order for spark to read it without any path related ”hacks”. Later in other experiments we will see the second method where we are reading the data straight from the hadoop file system after we import them in. The process bellow is performed in the programming language Scala, it can also be done in python. First we import the classes that we will use from mllib framework. Naive Bayes is the model that we will use to classify the data. We will also need the Vector data structure and LabeledPoint in data manipulation. So now that we have all the classes we need, we can import data from csv file to memory. Next we prepare the data format to be compatible with mllib standards. Now we have to split the data in to training and test sets (60% training and 40% test). Next step is the model creation. As we said above the model is Naive Bayes and we use the train dataset to train it. After the training

73 is completed is time to apply it to our test set and gather the results. Measuring the performance can be done with various methods, in this case we use accuracy. The result is 96.08% (rounded) which is pretty close to the previous result that rapidminer gave us.

Listing 12: MLlib Iris Naive Bayes Classification // Iris data Naive Bayes classification import org.apache.spark.mllib. classification .NaiveBayes import org.apache.spark.mllib. linalg .Vectors import org.apache.spark.mllib. regression .LabeledPoint

// Load data from csv file in to memory val data = sc.textFile(”iris4.csv”) val parsedData = data.map { l i n e => val parts = line.split(’,’) LabeledPoint(parts(4).toDouble ,Vectors.dense( parts(0).toDouble , parts(1).toDouble, parts(2).toDouble, parts(3).toDouble )) }

// Split data into training (60%) and test (40%). val splits = parsedData.randomSplit(Array(0.6, 0.4) , seed = 11L) val training = splits(0) val test = splits(1)

// Create model (naive Bayes) for classification val model = NaiveBayes.train(training , lambda = 1.0) val predictionAndLabel = test .map(p => (model.predict(p.features) , p . l a b e l ) )

// Messure performance based on accuracy val accuracy = 1.0 ∗ predictionAndLabel. filter (x => x . 1 == x . 2 ) . count() / test.count()

Spark - Shell Output: ”accuracy: Double = 0.9607843137254902”

4.2.3 WEKA WEKA has a GUI version which more often is used for the experiments. It can read the data from a databases or from file which is the usual way to parse the data. We use iris data for this experiment and as a classifier we pick NaiveBayes. With 60% split between train and test set, the experiment duration was a few seconds. With accuracy 93.3333% the experiment is defined as successful. In summary section we can notice several information that we will skip for now. Detailed Accuracy By Class section of the results are very helpful to realize which class has better performance and which needs more work to excel.

74 Figure 35: WEKA iris test

4.2.4 SciKit - Learn SciKit - Learn has automate the import procedure for the most famous data as iris. So the first thing we need to do is to import the datasets in order to retrieve the iris. Next we load the iris data to memory and are ready for use. GaussianNB implements the Gaussian Naive Bayes algorithm for classification. After we split our data to train and test sets, we make the prediction based on our trained model. The results are as expected with no surprise.

Listing 13: SciKit Iris Naive Bayes Classification from s k l e a r n import d a t a s e t s iris = datasets.load i r i s ( ) from sklearn.naive b a y e s import GaussianNB gnb = GaussianNB() y pred = gnb.fit(iris.data, iris.target).predict(iris.data) print ( ”Number o f m i s l a b e l e d p o i n t s out o f a t o t a l %d p o i n t s : %d” % (iris .data.shape[0], (iris.target != y pred ) .sum())) print (u”Accuracy = { 0 : 0 . 2 f}%”. format ( f l o a t ((iris.target == y pred ) .sum())/ f l o a t (iris .data.shape[0]) ∗ 100))

- Output- Number of mislabeled points out of a total 150 points : 6 Accuracy = 96.00

75 4.2.5 H2O Flow H2O Flow has a web based interface which help us to choose the command lines of code in order to use the framework. We will present only the code that we used and we’ll escape the gui part as it is unnecessary. First we import the file and based on the file name extension it recognizes the type which in this case is ARFF. The next step will help us build our model based on Naive Bayes classifier. The framework will set a model id in order to help us identify our builded model. With ”getModel” and ”predict” we are finally ready to make the prediction.

Figure 36: H2O Flow web interface

76 Listing 14: H2O Iris Naive Bayes Classification importFiles [”/Users/georgepeppas/Desktop/data/iris . arff”] setupParse paths: [ ”/Users/georgepeppas/Desktop/data/iris . arff” ]

p a r s e F i l e s paths: [”/Users/georgepeppas/Desktop/data/iris . arff”] destination frame: ”iris.hex” p a r s e t y p e : ”ARFF” separator: 44 number columns : 5 s i n g l e q u o t e s : f a l s e column names: [”sepallength”,”sepalwidth”,”petallength”,” petalwidth”,”class”] column types : [”Numeric” ,”Numeric” ,”Numeric” ,”Numeric” ,”Enum”] d e l e t e o n d o n e : true check header : −1 c h u n k size: 4194304

buildModel ’naivebayes ’ , {” model id”:”naivebayes −e22ac51c −40ab −4906−978 f −57e126f3b101”,”training frame”:”iris .hex”,” v a l i d a t i o n frame”:”iris .hex”,”response column”:”class”,” ignored columns”:[] ,”ignore c o n s t c o l s ” : true ,”laplace”:0,” min sdev”:0.001 ,”eps sdev”:0,”min prob”:0.001 ,”eps prob ” : 0 , ” compute metrics ” : true , ” s c o r e e a c h iteration”: false ,” m a x c o n f u s i o n m a t r i x size”:20,”max h i t r a t i o k ” : 0 , ” max runtime secs ” : 0 }

getModel ”naivebayes −e22ac51c −40ab−4906−978 f −57e126f3b101” predict model: ”naivebayes −e22ac51c −40ab−4906−978 f −57e126f3b101” predict model: ”naivebayes −e22ac51c −40ab−4906−978 f −57e126f3b101” , frame: ”iris .hex”, predictions frame: ”prediction −5305488b−f445 −4c62 −9507−1274b061766c”

After the code is executed we see below the result as a web interface matrix, we calculated accuracy equal to 98%.

Figure 37: H2O iris test

4.2.6 Summary We sum up all the experiments on iris data on the table below. In order to understand the data better, we visualize the results.

77 Table 1: Iris data

Platform Accuracy Time Memory Spark MLlib 96.08% <1s 15% Weka 93.33% <1s 25% Rapidminer 91.67% <1s 20% Scikit - Learn 96.00% <1s 10% H2O Flow 98.00% <1s 35%

Platform Accuracy

98 98

96.08 96 96

94 Accuracy% 93.33

92 91.67

WEKA H2O-Flow Spark-MLlib RapidMiner SciKit-Learn

4.3 Experiments on Big data sets In this section we’ll use only spark engine with the framework MLlib which both are optimized for big data sets. Any other conventional tool is not capable of handling the size of those datasets. All algorithms on MLlib are optimized to work in a distributed way in order to take advance the distributed nature of hadoop file system and spark engine.

4.3.1 Loading the Big data sets In order to handle big data files we need other tools than usual. For example you can’t open a csv or text file that is 3GB with simple notepad or other similar text editors. In this case we used UNIX commands in order to read or edit rows and data or to even read them. Our dataset was auto generated with Rapidminer ”Generate Massive Data”

78 operator which generates huge amounts of data for testing purposes. With 5 attributes and one label we generate 100000000 rows with is 3GB size and we call it gen.csv. First, we have to add that file in to hdfs (hadoop) system. As we can see in figure 38 our file is placed on hadoop system.

Figure 38: Add my file to hdfs

Let’s see how it looks on web platform Ambari figure 39. Instead of using the UNIX environment we can also upload the file from Ambari web platform.

Figure 39: Ambari file system

Ambari give as the option to change the permission as we want. In figure 40 we give full permissions although it is not recommended it will make our life easier in this experiment.

79 Figure 40: Ambari permission

In order to use hive sql queries on the data we need to create the table were we will then load our data. So through Ambari web platform we select hive interface in order to do it. As we see in figure 41 we create a table named gen3 with the name attributes that our file has.

Figure 41: Hive Table creation

Our table is now created (42) and we are ready to load the data.

Figure 42: Table gen3

As we see in figure 43 with a simple sql command we can load the data to our table from the csv file. In order to make sure that everything has gone right we load the first 100 rows of our data from table gen3 as shown in figure 44 .

80 Figure 43: Load gen.csv data to gen3 table via hive sql

Figure 44: Select our data from table gen3

Last we will make sure that nothing is lost, by counting our rows (remember

81 we had 100,000,000 rows or 3GB in size) with simple hive sql command shown in figure 45.

Figure 45: Count rows of table gen3

Now that everything is on place with our data in the table but also in the csv file inside Hadoop system. Next we will conduct our experiments using these data to see how the system will respond at that amount of data were any other platform wouldn’t even load them.

4.3.2 SVM Spark MLlib Classification aims to divide items into categories. The most common classifica- tion type is binary classification, where there are two categories, usually named positive and negative. If there are more than two categories, it is called mul- ticlass classification. spark.mllib supports two linear methods for classification: linear Support Vector Machines (SVMs) and logistic regression. Linear SVMs supports only binary classification, while logistic regression supports both bi-

82 nary and multiclass classification problems. For both methods, spark.mllib sup- ports L1 and L2 regularized variants. The training data set is represented by an RDD of LabeledPoint in MLlib, where labels are class indices starting from zero: 0,1,2,...0,1,2,... Note that, in the mathematical formulation in this guide, a binary label yy is denoted as either +1+1 (positive) or -1-1 (negative), which is convenient for the formulation. However, the negative label is represented by 00 in spark.mllib instead of -1-1, to be consistent with multiclass labeling. First we open the spark-shell from unix with command: ”spark-shell” in order to write the code or run our .scala file by default in language scala. In this experiment we will create and test an SVM model in order to evaluate the platform response and performance. The linear SVM is a standard method for large-scale classification tasks.

Listing 15: Linear Support Vector Machines (SVMs)

import org.apache.spark. mllib . classification .SVMWithSGD import org.apache.spark.mllib.evaluation. BinaryClassificationMetrics import org.apache.spark.mllib. util .MLUtils import org.apache.spark.mllib. classification .NaiveBayes import org.apache.spark.mllib. linalg .Vectors import org.apache.spark.mllib. regression .LabeledPoint

// Load Data from file val ttdata = sc.textFile(”gen.csv”) val header = ttdata.first() val tdata = ttdata. filter(row => row != header)

// Transform data to LabeledPoint RDD with a little trick in ’ negative’/’positive’ label which is transformed to 0/1 label val data = tdata.map { l i n e => val parts = line.split(’;’) LabeledPoint ( (parts (5) .count( == ’e’).toDouble − 2) ∗( −1) , Vectors . dense ( parts (0) . toDouble , parts (1) . toDouble , parts (2) . toDouble , parts (3) . toDouble , parts (4) . toDouble ) ) }

// Split data into training (60%) and test (40%). val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1)

// Run training algorithm to build the model val numIterations = 2 val model = SVMWithSGD. train(training , numIterations)

// Clear the default threshold. model. clearThreshold()

// Compute raw scores on the test set. val scoreAndLabels = test .map { point => val score = model.predict(point.features)

83 (score , point.label) }

// Get evaluation metrics. val m e t r i c s = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics .areaUnderROC()

println(”Area under ROC = ” + auROC)

Spark and mllib had no problem to process that amount of data although it took 18 minutes to finish. Results as shown in figure 46 are perfect as we expected because of the few features that our auto generated file had.

Figure 46: Spark MLlib SVM results

4.3.3 Dimensionality reduction - PCA Spark MLlib Dimensionality reduction is the process of reducing the number of variables un- der consideration. It can be used to extract latent features from raw and noisy features or compress data while maintaining the structure. MLlib provides two models for dimensionality reduction; these models are closely related to each other. They are Principal Components Analysis (PCA) and Singular Value De- composition (SVD). Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. The columns of the rotation matrix are called principal components. PCA is used widely in dimensionality reduction. MLlib supports PCA for tall-and-skinny matrices stored in row-oriented format. The following code demonstrates how to compute principal components on source vectors and use them to project the vectors into a low-dimensional space while keeping associated labels:

Listing 16: Dimensionality reduction - PCA

import org.apache.spark.mllib. linalg .Matrix import org.apache.spark.mllib. linalg .Vectors import org.apache.spark.mllib. linalg . distributed .RowMatrix

//val spConfig = (new SparkConf).setMaster(”local”).setAppName(” SparkPCADemo”) //val sc = new SparkContext(spConfig)

// Load and parse the data file. val data = sc.textFile(”pca.csv”).map { l i n e => val values = line.split(’;’).map( . toDouble )

84 Figure 47: Auto generated data used for PCA

Vectors.dense(values) }

val dataRDD = sc.parallelize(data, 2)

val mat: RowMatrix = new RowMatrix (dataRDD)

// Compute the top 4 principal components. // Principal components are stored in a local dense matrix. val pc: Matrix = mat.computePrincipalComponents(4)

// Project the rows to the linear space spanned by the top 4 principal components. val projected: RowMatrix = mat.multiply(pc)

An other experiment this time with facial images. Labeled Faces in the Wild (LFW) dataset of facial images. This dataset contains over 13,000 images of faces generally taken from the Internet and belonging to well-known public figures. The faces are labeled with the person’s name.

Listing 17: Dimensionality reduction - PCA images

val path = ” l f w /∗” val rdd = sc.wholeTextFiles(path) val first = rdd.first

//wholeTextFiles returns an RDD that contains key−value pairs , where the key is the le location while the value is the content of the entire text file.

//custom code to read the images, so we don’t need this part of the path. Thus, we will remove it with the following map function: val files = rdd.map { case (fileName , content) => fileName. replace( ”file:”, ””) }

//Next, we will see how many les we are dealing with: println(files .count)

85 //we have 13000 images to work with.

//we will convert images to grayscale in order to represent them as a plain two dimensional matrix. The built −in Java Abstract Window Toolkit (AWT) contains various basic image−p r o c e s s i n g f u n c t i o n s import java .awt.image.BufferedImage def loadImageFromFile(path: String): BufferedImage = { import javax . imageio .ImageIO import java.io.File ImageIO. read(new File(path)) }

//loading the first image into our Spark shell val aePath = ”lfw/Aaron Eckhart/Aaron Eckhart 0001 . jpg ” val aeImage = loadImageFromFile(aePath)

//Our raw 250 x 250 images represent 187,500 data points per image using three color components. For a set of 13000 images, this is 2,431,000,000 data points,if we convert to grayscale and resize the images to, say, 50 x 50 pixels, we only require 2500 data points per image. MLlib’s PCA model works best on tall and skinny matrices with less than 10,000 columns. We will have 2500 columns (that is, each pixel becomes an entry in our feature vector), so we will come in well below this restriction . def processImage(image: BufferedImage , width: Int , height: Int): BufferedImage = { val bwImage = new BufferedImage(width, height , BufferedImage. TYPE BYTE GRAY) val g = bwImage.getGraphics() g.drawImage(image, 0, 0, width, height , null ) g.dispose() bwImage }

//Let’s test this out on our sample image. We will convert it to grayscale and resize it to 100 x 100 pixels: val grayImage = processImage(aeImage, 100, 100)

// save the processed image to a temporary location so that we can read i t back import javax . imageio .ImageIO import java.io.File ImageIO.write(grayImage , ”jpg”, new File(”/tmp/pca/aeGray. jpg”))

//extract the actual feature vectors that will be the input to our dimensionality reduction model. def getPixelsFromImage(image: BufferedImage): Array[Double] = { val width = image.getWidth val height = image.getHeight val pixels = Array.ofDim[Double](width ∗ h e i g h t ) image.getData.getPixels(0, 0, width, height , pixels) }

86 //combine these three functions into one utility function that takes a le location together with the desired image’s width and height and returns the raw Array[Double] value that contains the pixel data def extractPixels(path: String , width: Int, height: Int): Array[Double] = { val raw = loadImageFromFile(path) val processed = processImage(raw, width, height) getPixelsFromImage(processed) }

//Applying this function to each element of the RDD that contains all the image file paths will give us a new RDD that contains the pixel data for each image. val pixels = files.map(f => extractPixels(f, 50, 50)) println(pixels .take(10).map( .take(10).mkString(””, ”,”, ”, ...”)). mkString ( ”\n” ) )

//create an MLlib Vector instance for each image. We will cache the RDD to speed up our later computations: import org.apache.spark.mllib. linalg .Vectors val vectors = pixels.map(p => Vectors.dense(p)) vectors .setName(”image−v e c t o r s ” ) vectors .cache

//standardize input data import org.apache.spark.mllib. linalg .Matrix import org.apache.spark.mllib. linalg . distributed .RowMatrix import org.apache.spark.mllib. feature .StandardScaler val s c a l e r = new StandardScaler(withMean = true , withStd = f a l s e ). fit(vectors)

//Finally, we will use the returned scaler to transform the raw image vectors to vectors with the column means subtracted: val scaledVectors = vectors.map(v => scaler .transform(v))

//Training a dimensionality reduction model Dimensionality reduction models in MLlib require vectors as inputs. However, unlike clustering that operated on an RDD[Vector], PCA and SVD computations are provided as methods on a distributed RowMatrix

//Now that we have extracted our image pixel data into vectors, we can instantiate a new RowMatrix and call the computePrincipalComponents method to compute the top K principal components of our distributed matrix: import org.apache.spark.mllib. linalg .Matrix import org.apache.spark.mllib. linalg . distributed .RowMatrix val matrix = new RowMatrix(scaledVectors) val K = 10 val pc = matrix.computePrincipalComponents(K)

//dimensions of the resulting matrix: //the matrix of principal components has 2500 rows and 10 columns //we have the top 10 principal components

87 val rows = pc . numRows val cols = pc.numCols println(rows, cols) //principal components has 2500 rows and 10 columns

4.3.4 Expectation-Maximization Spark MLlib A Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own proba- bility. The spark.mllib implementation uses the expectation-maximization al- gorithm to induce the maximum-likelihood model given a set of samples. In the following example after loading and parsing data, we use a GaussianMixture object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then output the parameters of the mixture model.

Listing 18: Expectation-Maximization

import org.apache.spark.mllib. clustering. { GaussianMixture , GaussianMixtureModel } import org.apache.spark.mllib. linalg .Vectors

// Load and parse the data val data = sc.textFile(”gen.csv”) val parsedData = data.map(s => Vectors.dense(s.trim. split(’; ’).map( .toDouble))).cache()

// Cluster the data into two classes using GaussianMixture val gmm = new GaussianMixture() .setK(2) .run(parsedData)

// Save and load model gmm. save(sc , ”target/org/apache/spark/GaussianMixtureExample/ GaussianMixtureModel”) val sameModel = GaussianMixtureModel. load(sc , ”target/org/apache/spark/GaussianMixtureExample/ GaussianMixtureModel”)

// output parameters of max−likelihood model for ( i <− 0 until gmm.k) { println (”weight=%f \nmu=%s \ nsigma=\n%s \n” format (gmm.weights(i) , gmm.gaussians(i).mu, gmm.gaussians(i). sigma ) ) }

4.3.5 Naive Bayes TF-IDF Spark MLlib Feature extraction techniques will be used like term frequency-inverse document frequency (TF-IDF) and term weighting scheme and feature hashing. TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining which means it weights each term in a piece of text. Feature hashing, also known as the hashing trick (by analogy

88 to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array. It is used to deal with high-dimensional data. Next we will extracting the TF-IDF features from the 20 Newsgroups dataset. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups and can be found on http://qwone.com/ ~jason/20Newsgroups/. This dataset splits up the available data into training and test sets that comprise 60 percent and 40 percent.

Listing 19: Text processing and Naive Bayes classification

// Read files from hadoop file system // Spark scans the directory structure. val path = ”20news−bydate−t r a i n /∗” val rdd = sc.wholeTextFiles(path) val text = rdd.map { case (file , text) => t e x t } // Total input paths to process (should be 11314) for our data println(text.count)

// newsgroup topics val newsgroups = rdd.map { case (file , text) => file .split(”/”). takeRight(2) .head } val countByGroup = newsgroups.map(n => (n, 1)).reduceByKey( + ). collect .sortBy(− . 2).mkString(”\n” ) println (countByGroup)

// applying whitespace tokenization , converting each token to lowercase val text = rdd.map { case (file , text) => t e x t } val whiteSpaceSplit = text.flatMap(t => t . s p l i t ( ” ” ) . map( . toLowerCase)) println(whiteSpaceSplit. distinct .count) //number of unique tokens: 402978

// More tokenization , splitting each raw document on nonword characters using a regular expression pattern val nonWordSplit = text.flatMap(t => t.split(””” \W+””” ) . map( . toLowerCase)) println(nonWordSplit. distinct .count) //reduces the number of unique tokens to 130126

// filter out numbers and tokens that are words mixed with numbers . val regex = ”””[ˆ0 −9]∗ ””” . r val filterNumbers = nonWordSplit. filter (token => regex.pattern. matcher(token) .matches) println(filterNumbers. distinct .count) // now we have 84912 tokens

89 //Remove stop words val tokenCounts = filterNumbers .map(t => (t, 1)).reduceByKey( + ) val oreringDesc = Ordering.by[(String , Int), Int]( . 2 ) println(tokenCounts.top(20)(oreringDesc).mkString(”\n” ) ) // print top tokens

// create a set of stop words val stopwords = Set(”the”,”a”,”an”,”of”,”or”,”in”,”for”,”by”,”on”,” but”, ”is”, ”not”, ”with”, ”as”, ”was”, ”if”, ”they”, ”are”, ” this”, ”and”, ”it”, ”have”, ”from”, ”at”, ”my”, ”be”, ”that”, ” to ” ) val tokenCountsFilteredStopwords = tokenCounts. filter { case ( k , v ) => !stopwords. contains(k) } println(tokenCountsFilteredStopwords.top(20)(oreringDesc).mkString( ”\n” ) )

// remove any tokens that are only one character in length. val tokenCountsFilteredSize = tokenCountsFilteredStopwords. filter { case ( k , v ) => k . s i z e >= 2 } println(tokenCountsFilteredSize .top(20)(oreringDesc).mkString(”\n” ) )

// excluding terms based on frequency val oreringAsc = Ordering.by[(String , Int), Int]( − . 2 ) println(tokenCountsFilteredSize .top(20)(oreringAsc).mkString(”\n” ) )

//exclude tokens that only occur once val rareTokens = tokenCounts. filter { case ( k , v ) => v < 2 } .map { case ( k , v ) => k } .collect.toSet val tokenCountsFilteredAll = tokenCountsFilteredSize. filter { case ( k , v ) => !rareTokens. contains(k) } println(tokenCountsFilteredAll .top(20)(oreringAsc).mkString(”\n” ) )

// counting number of unique tokens we’ll see that only 51801 left println(tokenCountsFilteredAll .count) // we have reduced the feature dimension from 402,978 to 51,801.

// combine all our filtering logic into one function, which we can apply to each document in our RDD def tokenize(line: String): Seq[String] = { line.split(””” \W+””” ) . map( . toLowerCase) . filter(token => regex. pattern .matcher(token).matches) . filterNot(token => stopwords. contains(token)) . filterNot(token => rareTokens. contains(token)) . filter(token => token . s i z e >= 2) . toSeq }

// check this function (should output 51801 unique token count again ) println(text.flatMap(doc => tokenize(doc)). distinct .count)

// tokenize each document in our RDD and print the first part

90 val tokens = text.map(doc => tokenize(doc)) println(tokens. first .take(20))

// training a TF−IDF model // MLlib transforms each document, in the form of processed tokens, into a vector representation. import org.apache.spark.mllib. linalg. { SparseVector => SV } import org.apache.spark. mllib . feature .HashingTF import org.apache.spark.mllib. feature .IDF //default feature dimension is 2ˆ20 (or around 1 million), we will choose 2ˆ18 (or around 260,000), since with about 50,000 tokens //we should not experience a signi cant number of hash collisions val dim = math.pow(2, 18).toInt // the transform function of HashingTF maps each input document ( that is, a sequence of tokens) to an MLlib Vector. val hashingTF = new HashingTF(dim) val tf = hashingTF.transform(tokens) t f . cache

// inspect the fisrt element to see if everything is ok val v = tf. first .asInstanceOf[SV] println(v.size) println(v.values.size) println(v.values.take(10).toSeq) println(v. indices .take(10).toSeq) // the dimension of each sparse vector of term frequencies is 262 ,144 // the number on non−zero entries in the vector is only 706 // the last two lines of the output shows the frequency counts and vector indexes for the first few entries in the vector

//compute the inverse document frequency for each term in the corpus val i d f = new IDF(). fit(tf) val tfidf = idf.transform(tf) val v2 = tfidf. first .asInstanceOf[SV] println(v2.values. size) println(v2.values.take(10).toSeq) println(v2. indices .take(10).toSeq) // non−zero entries 706 // the new values represent the frequencies weighted by the IDF.

// analyzing the TF−IDF weightings // compute the minimum and maximum TF−IDF weights across the entire corpus val minMaxVals = tfidf .map { v => val sv = v.asInstanceOf[SV] (sv.values.min, sv.values.max) } val globalMinMax = minMaxVals. reduce { case ((min1, max1), (min2, max2) ) =>(math.min(min1, min2) , math.max(max1, max2)) } println (globalMinMax) // minimum TF−IDF is zero, while the maximum is 66155.39470409753

91 //compute the TF−IDF representation for a few of the terms that appear in the list of top occurrences that we previously computed, such as you, do, and we: val common = sc. parallelize(Seq(Seq(”you”, ”do”, ”we”))) val tfCommon = hashingTF. transform(common) val tfidfCommon = idf . transform(tfCommon) val commonVector = tfidfCommon. first . asInstanceOf [SV] println(commonVector. values .toSeq)

//apply the same transformation to a few less common term val uncommon = sc. parallelize(Seq(Seq(”telescope”, ”legislation”, ” investment”))) val tfUncommon = hashingTF. transform(uncommon) val tfidfUncommon = idf . transform(tfUncommon) val uncommonVector = tfidfUncommon. first . asInstanceOf [SV] println(uncommonVector. values .toSeq)

// train a classifier using our TF−IDF transformed vectors as input . // we will use the naive Bayes model in MLlib, which supports multiple classes import org.apache.spark.mllib. regression .LabeledPoint import org.apache.spark.mllib. classification .NaiveBayes import org.apache.spark.mllib.evaluation.MulticlassMetrics

// assigning a numeric index to each class val newsgroupsMap = newsgroups. distinct . collect () .zipWithIndex. toMap val zipped = newsgroups.zip(tfidf) val train = zipped.map { case (topic , vector) => LabeledPoint( newsgroupsMap(topic) , vector) } train .cache //the label is the class index and features is the TF−IDF v e c t o r .

//pass RDD to the naive Bayes train function: val model = NaiveBayes.train(train , lambda = 0.1)

//evaluate the performance of the model on the test dataset val testPath = ”20news−bydate−t e s t /∗” val testRDD = sc.wholeTextFiles(testPath) val testLabels = testRDD.map { case (file , text) => val t o p i c = file . split(”/”).takeRight(2).head newsgroupsMap( topic ) }

//Transforming the text in the test dataset follows the same procedure as for the training data //zip the test class labels with the TF−IDF vectors and create our test RDD[LabeledPoint] val testTf = testRDD.map { case (file , text) => hashingTF. transform (tokenize(text)) } val testTfIdf = idf.transform(testTf) val zippedTest = testLabels.zip(testTfIdf) val test = zippedTest.map { case (topic , vector) => LabeledPoint( topic , vector) }

92 // compute the predictions and true class labels // we will use this RDD to compute accuracy and the multiclass weighted F−measure for our model val predictionAndLabel = test .map(p => (model.predict(p.features) , p . l a b e l ) ) val accuracy = 1.0 ∗ predictionAndLabel. filter (x => x . 1 == x . 2 ) . count() / test.count() val m e t r i c s = new MulticlassMetrics(predictionAndLabel) println(accuracy) println(metrics .weightedFMeasure)

// simple multiclass naive Bayes model has achieved close to 80 percent for both accuracy and F−measure : // 0.7915560276155071 // 0.7810675969031116 Naive Bayes accuracy equals to 0.7915560276155071 and F-measure score equals to 0.7810675969031116.

4.3.6 Hierarchical clustering Spark MLlib Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering. Bisecting k-means is a kind of hierarchical clustering. Hierarchical clustering is one of the most commonly used method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: Agglomerative: This is a ”bottom up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a ”top down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Listing 20: Hierarchical clustering

import org.apache.spark.mllib. clustering .BisectingKMeans import org.apache.spark.mllib. linalg. { Vector , Vectors }

// Loads and parses data def parse(line: String): Vector = Vectors.dense(line.split(”;”).map ( .toDouble)) val data = sc.textFile(”gen.csv”).map(parse).cache()

// Clustering the data into 6 clusters by BisectingKMeans. val bkm = new BisectingKMeans() .setK(6) val model = bkm.run(data)

// Show the compute cost and the cluster centers println (s”Compute Cost : ${model.computeCost(data) }”) model. clusterCenters .zipWithIndex. foreach { case (center , idx) => println(s”Cluster Center ${ idx } : ${ c e n t e r }”) }

93 4.3.7 K-means Spark MLlib K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. We use the KMeans object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing k. In fact the optimal k is usually one where there is an ”elbow” in the WSSSE graph.

Listing 21: K-Means

import org.apache.spark.mllib. clustering. { KMeans, KMeansModel} import org.apache.spark.mllib. linalg .Vectors

// Load and parse the data val data = sc.textFile(”gen.csv”) val parsedData = data.map(s => Vectors.dense(s. split(’; ’).map( . toDouble))).cache()

// Cluster the data into two classes using KMeans val numClusters = 2 val numIterations = 20 val clusters = KMeans.train(parsedData , numClusters , numIterations)

// Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters .computeCost(parsedData) println(”Within Set Sum o f Squared E r r o r s = ” + WSSSE)

// Save and load model clusters .save(sc , ”target/org/apache/spark/KMeansExample/ KMeansModel”) val sameModel = KMeansModel.load(sc , ”target/org/apache/spark/ KMeansExample/KMeansModel”)

Within Set Sum of Squared Errors = 0.11999999999994547

4.3.8 Association Rules Association Rules implements a parallel rule generation algorithm for construct- ing rules that have a single item as the consequent. Confidence is an indication of how often the rule has been found to be true. Support is an indication of how frequently the item-set appears in the database. Many algorithms for generat- ing association rules were presented over time. Some well known algorithms are Apriori, Eclat and FP-Growth.

Listing 22: Association Rules

import org.apache.spark. mllib .fpm. AssociationRules import org.apache. spark . mllib .fpm.FPGrowth. FreqItemset

val freqItemsets = sc.parallelize(Seq(

94 new FreqItemset(Array(”a”) , 150000L) , new FreqItemset(Array(”b”) , 350000L) , new FreqItemset(Array(”a”, ”b”) , 120000L) ))

val ar = new AssociationRules() .setMinConfidence(0.8) val results = ar.run(freqItemsets)

results.collect().foreach { r u l e => println(”[” + rule.antecedent.mkString(”,”) + ”=>” + rule.consequent.mkString(”,”) + ”],” + rule.confidence) }

Class AssociationRules generates association rules from a RDD[FreqItemset[Item].

Figure 48: Association Rules Results

Confidence is an indication of how often the rule has been found to be true. Based on results, we can say 80% of the times a customer buys product ”a”, ”b” is bought as well).

4.3.9 Data and Results Bellow we summarize the data and algorithms we used in the previous ex- periments with the results and time performances. In order to understand the performance of spark engine we need to focus on computation time performance of each experiment and combine it with the complexity and size of each dataset.

As we notice on table 2 Spark with the MLlib framework can handle files that their size can be referred as big data. Although the experiments conducted on a single machine the engine worked seamlessly. The times as shown in the table are not good enough and we have no significantly advantage oven other frameworks if we don’t use a cluster of distributed machines and nodes. All the big data technologies that we introduced are optimized to run on a distributed environment and their architecture was designed based on that thought. In order to take advantage of the spark potentials we need to repeat the experiments on a distributed environment where all files and algorithms of mllib will run in parallel mode using all the available CPU computation power and other hardware sources that are now limited in a single node.

95 Table 2: Algorithms, Data and Results

Algorithm in MLlib Data Time Results Auto Generated (5 binary attributes, SVM 4h35m 100% ROC 100,000,000 rows, 3.2Gb,gen.csv) Auto Generated (100000 rows, Compute the top 4 PCA 25m 1000 attributes, pca.csv principal components. Labeled Faces in the Wild (LFW) 13,000 Compute the top 10 PCA 2h32m images of faces labeled with the person’s name principal components 20,000 newsgroup text documents, partitioned Accuracy - F1 Naive Bayes (TF-IDF) 19m22s evenly across 20 different newsgroups 0.791556 - 0.781067 Expectation Auto Generated (5 binary attributes, Cluster the data into 3h52m Maximization 100,000,000 rows, 3.2Gb,gen.csv) two classes Hierarchical clustering Auto Generated (5 binary attributes, Clustering the data 6h42m Bisecting K-means 100,000,000 rows, 3.2Gb,gen.csv) into 6 clusters Artificial Generated (3 double attributes, 2 clusters, 20 iterations K-means 27m35s 20,000,000 rows, 650MB, kmeans.csv) WSSSE = 0.119999 Auto Generated (Transactional data of artificial Association Rules 12m24s 80% Confidence products ”a” and ”b”, 620,000 rows)

5 Conclusions and future work

The performance of a platform can be calculated based on many variables and measurement methodology. Instead of choosing a platform to use for any prob- lem, we should choose the right platform for the pre-specified problem. Memory and CPU usage may vary from case to case although some facts are stable. For example the programming language of solution implementation (usually de- pends from the platform but it is not strict, some platforms are allowing you to choose from a set of languages) can be crucial. Python compared to java (or Scala in our case which both use the optimized JVM) tends to be slower if the implementation is similar and based on the same algorithm. In a given platform it can be easily compared by implementing the same solution in dif- ferent languages but when the platform changes we can’t easily concluded if we have a decreased performance because of the platform or language runtime or algorithm implementation of the framework or even the os we use, it can be any combination of those parameters. Its not very productive neither a good idea to test every platform, on any data set and this is even more crucial in the real world where the time (time to market in some cases) is more valuable than anything due to high competitive- ness. In the future we need to test big data platforms in a multi node cluster where the performance can be significantly higher. Those platforms are designed for many nodes to work in parallel architecture and only then we can exploit their full potentials. Platforms that are not optimized for big data, usually work in a single node and that can cause limitation in computation time and data storage.

96 Big data platforms that used to store and manage the data (Hortonwork and Cloudera) can be very useful and right now is a good option in order to analyze big data. So choosing the right management platform and combine it with the right big data analysis framework-platform for every specific problem, can be very difficult, time consuming but very crucial for the final result and the performance of the system. If the data can fit into the system memory, then clusters are usually not re- quired and the entire data can be processed on a single machine. The platforms such as GPU, Multicore CPUs etc. can be used to speed up the data processing in this case. If the data does not fit into the system memory, then one has to look at other cluster options such as Hadoop, Spark etc. If one needs to process large amount of data and do not have strict constraints on the processing time, then one can look into systems which can scale out to process huge amounts of data such as Hadoop, Peer-to-Peer networks, etc. These platforms can handle large-scale data but usually take more time to deliver the results. On the other hand, if one needs to optimize the system for speed rather than the size of the data, then they need to consider systems which are more capable of real-time processing such as GPU, FPGA etc. GPUs are preferable over other platforms when the user has strict real-time constraints and the data can be fit into the system memory. On the other hand, the HPC clusters can handle more data compared to GPUs but are not suitable for real-time processing compared to GPUs. Another direction will be to investigate the possibility of combining multiple platforms to solve a particular application problem. For example, at- tempting to merge the horizontal scaling platforms such as Hadoop with vertical scaling platforms such as GPUs is also gaining some recent attention. [27] Looking into the future, we can say that the development of high sophisti- cated platforms just started and there is a long way of growth and R&D. The complexity of these platforms are now a big limitation for the average user to adapt their architectures. Fragmented environments with complex settings and limited knowledge may push the users to use solutions from biggest companies where they feel more safe about the future, although open source communities have shown at the past that can compete those challenges. Creating standards for cooperation and communication between platforms and frameworks can work negative to fragmentation and positive to performance. Big data are here to stay and big data analysis is one of the greatest challenge of this century with many benefits on the road. Many aspects in the civilized society and areas like Internet of thing, public administration, financial balances, security and many more, are gathering data ready for exploit and analysis in order to extract useful informations, correlations or predictions to prevent or control situations based on our interests.

References

[1] Shahriar Akter and Samuel Fosso Wamba. Big data analytics in e- commerce: a systematic review and agenda for future research. Electronic

97 Markets, 26(2):173–194, 2016. [2] Amit. How big data analytics, ai and machine learning is being leveraged across fintech. [3] Apache. Orc is an apache project, 2016.

[4] Ron Bekkerman, Mikhail Bilenko, and John Langford. Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press, 2011. [5] Sean Borman. The expectation maximization algorithm-a short tutorial. 2004.

[6] Dhruba Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 11(2007):21, 2007. [7] Jason Brownlee. A tour of machine learning algorithms, 2013. [8] Min Chen, Shiwen Mao, and Yunhao Liu. Big data: A survey. Mobile Networks and Applications, 19(2):171–209, 2014. [9] Andrea De Mauro, Marco Greco, and Michele Grimaldi. A formal definition of big data based on its essential features. Library Review, 65(3):122–135, 2016.

[10] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [11] Jeffrey A Fessler and Alfred O Hero. Space-alternating generalized expectation-maximization algorithm. IEEE Transactions on Signal Pro- cessing, 42(10):2664–2677, 1994.

[12] John Gantz and David Reinsel. Extracting value from chaos. IDC iview, 1142:1–12, 2011. [13] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.

[14] Steve Hoffman. : Distributed Log Collection for Hadoop. Packt Publishing Ltd, 2013. [15] Mohammad Kamrul Islam and Aravind Srinivasan. Apache Oozie: The Workflow Scheduler for Hadoop. ” O’Reilly Media, Inc.”, 2015.

[16] Hillol Kargupta. Association rule learning. [17] Hirak Kashyap, Hasin Afzal Ahmed, Nazrul Hoque, Swarup Roy, and Dhruba Kumar Bhattacharyya. Big data analytics in bioinformatics: A machine learning perspective. arXiv preprint arXiv:1506.05101, 2015.

98 [18] Nooshin Maghsoodi and Mohammad Mehdi Homayounpour. Improving farsi multiclass text classification using a thesaurus and two-stage feature selection. Journal of the American Society for Information Science and Technology, 62(10):2055–2066, 2011. [19] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. Mllib: Machine learning in apache spark. arXiv preprint arXiv:1505.06807, 2015. [20] Tom M Mitchell. Machine learning and data mining. Communications of the ACM, 42(11):30–36, 1999.

[21] Soumendra Mohanty, Madhu Jagadeesh, and Harsha Srivatsa. Big Data imperatives: enterprise Big Data warehouse, BI implementations and ana- lytics. Apress, 2013. [22] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data process- ing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099–1110. ACM, 2008. [23] Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout in action. 2012.

[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit- learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[25] ALI REBAIE. Fighting fraud. [26] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), pages 1–10. IEEE, 2010. [27] Dilpreet Singh and Chandan K Reddy. A survey on platforms for big data analytics. Journal of Big Data, 2(1):1, 2014. [28] Target Rich Solutions. Big data: Predictive analytics. [29] ITworld staff. Spotlight on big data.

[30] Stanford. Hierarchical agglomerative clustering. [31] Yufei Tao, Wenqing Lin, and Xiaokui Xiao. Minimal mapreduce algorithms. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 529–540. ACM, 2013.

99 [32] Ronald C Taylor. An overview of the hadoop/mapreduce/hbase frame- work and its current applications in bioinformatics. BMC bioinformatics, 11(Suppl 12):S1, 2010. [33] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626–1629, 2009. [34] Kathleen Ting and Jarek Jarcec Cecho. Apache Sqoop Cookbook. ” O’Reilly Media, Inc.”, 2013.

[35] Jeffrey D Ullman. Designing good mapreduce algorithms. XRDS: Cross- roads, The ACM Magazine for Students, 19(1):30–34, 2012. [36] Sameer Wadkar and Madhu Siddalingaiah. . In Pro Apache Hadoop, pages 399–401. Springer, 2014. [37] Wikipedia. Association rule learning — wikipedia, the free encyclopedia, 2016. [Online; accessed 2-April-2016]. [38] Wikipedia. Cluster analysis — wikipedia, the free encyclopedia, 2016. [On- line; accessed 31-March-2016]. [39] Wikipedia. Convolutional neural network — wikipedia, the free encyclope- dia, 2016. [Online; accessed 1-October-2016]. [40] Wikipedia. Data mining — wikipedia, the free encyclopedia, 2016. [Online; accessed 27-March-2016]. [41] Wikipedia. Expectation-maximization algorithm — wikipedia, the free en- cyclopedia, 2016. [Online; accessed 21-March-2016].

[42] Wikipedia. Hierarchical clustering — wikipedia, the free encyclopedia, 2016. [Online; accessed 2-April-2016]. [43] Wikipedia. Machine learning — wikipedia, the free encyclopedia, 2016. [Online; accessed 23-March-2016].

[44] Wikipedia. Precision and recall — wikipedia, the free encyclopedia, 2016. [Online; accessed 22-March-2016]. [45] Wikipedia. Principal component analysis — wikipedia, the free encyclope- dia, 2016. [Online; accessed 31-March-2016].

[46] Wikipedia. Rapidminer — wikipedia, the free encyclopedia, 2016. [Online; accessed 2-April-2016]. [47] Wikipedia. Receiver operating characteristic — wikipedia, the free ency- clopedia, 2016. [Online; accessed 17-September-2016].

100 [48] Wikipedia. Statistical classification — wikipedia, the free encyclopedia, 2016. [Online; accessed 31-March-2016]. [49] Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, S Yu Philip, et al. Top 10 algorithms in data mining. Knowledge and information systems, 14(1):1–37, 2008. [50] Feng Xia, Laurence T Yang, Lizhe Wang, and Alexey Vinel. Internet of things. International Journal of Communication Systems, 25(9):1101, 2012.

101