Hive, Spark, Presto for Interactive Queries on Big Data
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 Hive, Spark, Presto for Interactive Queries on Big Data NIKITA GUREEV KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE TRITA TRITA-EECS-EX-2018:468 www.kth.se Abstract Traditional relational database systems can not be efficiently used to analyze data with large volume and different formats, i.e. big data. Apache Hadoop is one of the first open-source tools that provides a dis- tributed data storage system and resource manager. The space of big data processing has been growing fast over the past years and many tech- nologies have been introduced in the big data ecosystem to address the problem of processing large volumes of data, and some of the early tools have become widely adopted, with Apache Hive being one of them. How- ever, with the recent advances in technology, there are other tools better suited for interactive analytics of big data, such as Apache Spark and Presto. In this thesis these technologies are examined and benchmarked in or- der to determine their performance for the task of interactive business in- telligence queries. The benchmark is representative of interactive business intelligence queries, and uses a star-shaped schema. The performance Hive Tez, Hive LLAP, Spark SQL, and Presto is examined with text, ORC, Par- quet data on different volume and concurrency. A short analysis and con- clusions are presented with the reasoning about the choice of framework and data format for a system that would run interactive queries on big data. Keywords: Hadoop, SQL, interactive analysis, Hive, Spark, Spark SQL, Presto, Big Data 1 Abstract Traditionella relationella databassystem kan inte anv¨andaseffektivt f¨or att analysera stora datavolymer och filformat, s˚asombig data. Apache Hadoop ¨aren av de f¨orstaopen-source verktyg som tillhandah˚allerett dis- tribuerat datalagring och resurshanteringssystem. Omr˚adetf¨orbig data processing har v¨axtfort de senaste ˚arenoch m˚angateknologier har in- troducerats inom ekosystemet f¨orbig data f¨or att hantera problemet med processering av stora datavolymer, och vissa tidiga verktyg har blivit van- ligt f¨orekommande, d¨arApache Hive ¨aren av de. Med nya framsteg inom omr˚adetfinns det nu b¨attreverktyg som ¨arb¨attreanpassade f¨orinterak- tiva analyser av big data, som till exempel Apache Spark och Presto. I denna uppsats ¨ardessa teknologier analyserade med benchmarks f¨or att fastst¨alladeras prestanda f¨oruppgiften av interaktiva business intelli- gence queries. Dessa benchmarks ¨arrepresentative f¨orinteraktiva business intelligence queries och anv¨anderstj¨arnformadescheman. Prestandan ¨ar unders¨oktf¨orHive Tex, Hive LLAP, Spark SQL och Presto med text, ORC Parquet data f¨orolika volymer och parallelism. En kort analys och sam- manfattning ¨arpresenterad med ett resonemang om valet av framework och dataformat f¨orett system som exekverar interaktiva queries p˚abig data. Keywords: Hadoop, SQL, interactive analysis, Hive, Spark, Spark SQL, Presto, Big Data 2 Contents 1 Introduction 4 1.1 Problem . .4 1.2 Purpose . .5 1.3 Goals . .5 1.4 Benefits, Ethics and Sustainability . .5 1.5 Methods . .5 1.6 Outline . .6 2 Big Data 7 2.1 Hadoop . .7 2.2 Hadoop Distributed File System . .9 2.3 YARN . 12 3 SQL-on-Hadoop 15 3.1 Hive . 15 3.2 Presto . 21 3.3 Spark . 24 3.4 File Formats . 28 4 Experiments 32 4.1 Data . 32 4.2 Experiment Setup . 36 4.3 Performance Tuning . 37 5 Results 38 5.1 Single User Execution . 38 5.2 File Format Comparison . 47 5.3 Concurrent Execution . 52 6 Conclusions 61 6.1 Single User Execution . 61 6.2 File Format Comparison . 61 6.3 Concurrent Execution . 62 6.4 Future Work . 62 3 1 Introduction The space of big data processing has been growing fast over the past years [1]. Companies are making analytics of big data a priority, and meaning that in- teractive querying of the collected data becomes an important part of decision making. With growing data volume the process of analytics becomes less inter- active, as it takes a lot of time to process the data for the business to receive insights. Recent advances in big data processing make interactive quieries, as opposed to only long running data processing jobs, to be performed on big data. Interactive queries are low lateny, sometimes ad hoc queries that analysts can run over the data and gain valuable insights. The most important feature in this case is the fast repsonse from the data processing tool, making the feedback loop shorter and making data exploration more interactive for the analyst. Many technologies have been introduced in the big data ecosystem to address the problem of processing large volumes of data, and some of the early tools have become widely adopted [2], with Apache Hive1 being one of them. However, with recent advances in technology, there are other tools better suited for interactive analytics of big data, such as Apache Spark2 and Presto3. In this thesis Hive, Spark, and Presto are examined and benchmarked in order to determine their relative performance for the task of interactive queries. There are several works taken into account during writing of this thesis. Similar work was performed by atScale in 2016 [3], which claims to be the first work on the topic of big data analytics. The report is done well, but the main issue is that with the current pace in the development of technologies the results from several years before can become outdated and less relevant in deciding which data processing framework to use. Another work in similar vein is SQL Engines for Big Data Analytics [4], but the main focus on that work is in the domain of bioinformatics, which lessens the relevance of the work for business intelligence. The work was also done in 2015, making it even older than the atScale report. Performance Comparison of Hive Impala and Spark SQL [5] from 2015 was also considered, but has its drawbacks. Several other works served as references in choosing the method and setting up the benchmark, including Sparkbench [6], BigBench [7], and Making Sense of Performance in Data Analytics Frameworks [8]. 1.1 Problem How is the performance on interactive business intelligence queries impacted by using Hive, Spark or Presto with variable data volume, file format, and number of concurrent users? 1Apache Hive - https://hive.apache.org/ 2Apache Spark - https://spark.apache.org/ 3Presto - https://prestodb.io/ 4 1.2 Purpose The purpose of this thesis is to assess the possible performance impact of switch- ing from Hive to Spark or Presto for interactive queries. Usage of the latest ver- sions of frameworks makes the work more relevant, as all three of the frameworks are undergoing rapid development. Considering the focus on interactive queries, several aspects of the experiments are changed from the previous works, includ- ing choice of the benchmark, experimental environment, file format. 1.3 Goals The main goal of this thesis is to produce an assessment of Hive, Spark, and Presto for interactive queries on big data of different volume, data format, and a number of concurrent users. The results are used to motivate a suggested choice of framework for interactive queries, when a rework of the system is performed or a creation of a new system planned. 1.4 Benefits, Ethics and Sustainability The main beneficial effect of this thesis is a fair comparison of several big data processing frameworks in terms of interactive queries conducted independently. This will help with the choice of tools when implementing a system for running analytical querying with constraints on responsiveness and speed on hardware and data corresponding to the setup in this work. As this thesis uses some of the state-of-the-art versions of frameworks in ques- tion, this include all of the improvements that were absent from previous similar works, while ensuring that no framework is operating under suboptimal condi- tions and no framework is given special treatment and tuning. 1.5 Methods Empirical method is used, as analytical methods cannot be efficiently applied to the presented problem within the resource and time constraints [9]. The results will be collected by generating data of different volume, implementing an interactive query suite, tuning the performance of the frameworks, and running the query suite on the data. This follows an established trend by the most relevant previous works [3], [4], [5], making changes in line with the focus of this thesis. 5 1.6 Outline In the Big Data section the big data ecosystem is described, with emphasis on Hadoop and YARN. In the SQL-on-Hadoop section the data processing frame- works are presented, first Hive, then Presto, then Spark. The ORC and Parquet file formats are also briefly described. In the Experiments section the benchmark and experimental setup are described. In the Results all of the experimental results are outlined and briefly described. In Conclusions the results are sum- marized and conclusions are driven, with future work outlined. 6 2 Big Data This thesis project is focused on comparing the performance of several big data frameworks in the domain of interactive business intelligence queries. Initially, works in big data space were making long-running jobs their focus, but with the advance of tools in big data processing it becomes more common for companies to be able to execute interactive queries over aggregated data. In this section the big data ecosystem is described, with a common Hadoop setup. 2.1 Hadoop Apache Hadoop is a data processing framework targeted at distributed pro- cessing of large volumes of data on one or more clusters of nodes running on commodity hardware.