DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018

Hive, Spark, Presto for Interactive Queries on Big Data

NIKITA GUREEV

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE TRITA TRITA-EECS-EX-2018:468

www.kth.se Abstract Traditional systems can not be efficiently used to analyze data with large volume and different formats, i.e. big data. is one of the first open-source tools that provides a dis- tributed data storage system and resource manager. The space of big data processing has been growing fast over the past years and many tech- nologies have been introduced in the big data ecosystem to address the problem of processing large volumes of data, and some of the early tools have become widely adopted, with being one of them. How- ever, with the recent advances in technology, there are other tools better suited for interactive analytics of big data, such as and Presto. In this thesis these technologies are examined and benchmarked in or- der to determine their performance for the task of interactive business in- telligence queries. The benchmark is representative of interactive business intelligence queries, and uses a star-shaped schema. The performance Hive Tez, Hive LLAP, Spark SQL, and Presto is examined with text, ORC, Par- quet data on different volume and concurrency. A short analysis and con- clusions are presented with the reasoning about the choice of framework and data format for a system that would run interactive queries on big data.

Keywords: Hadoop, SQL, interactive analysis, Hive, Spark, Spark SQL, Presto, Big Data

1 Abstract Traditionella relationella databassystem kan inte anv¨andaseffektivt f¨or att analysera stora datavolymer och filformat, s˚asombig data. Apache Hadoop ¨aren av de f¨orstaopen-source verktyg som tillhandah˚allerett dis- tribuerat datalagring och resurshanteringssystem. Omr˚adetf¨orbig data processing har v¨axtfort de senaste ˚arenoch ˚angateknologier har in- troducerats inom ekosystemet f¨orbig data f¨or att hantera problemet med processering av stora datavolymer, och vissa tidiga verktyg har blivit van- ligt f¨orekommande, d¨arApache Hive ¨aren av de. Med nya framsteg inom omr˚adetfinns det nu b¨attreverktyg som ¨arb¨attreanpassade f¨orinterak- tiva analyser av big data, som till exempel Apache Spark och Presto. I denna uppsats ¨ardessa teknologier analyserade med benchmarks f¨or att fastst¨alladeras prestanda f¨oruppgiften av interaktiva business intelli- gence queries. Dessa benchmarks ¨arrepresentative f¨orinteraktiva business intelligence queries och anv¨anderstj¨arnformadescheman. Prestandan ¨ar unders¨oktf¨orHive Tex, Hive LLAP, Spark SQL och Presto med text, ORC Parquet data f¨orolika volymer och parallelism. En kort analys och sam- manfattning ¨arpresenterad med ett resonemang om valet av framework och dataformat f¨orett system som exekverar interaktiva queries p˚abig data.

Keywords: Hadoop, SQL, interactive analysis, Hive, Spark, Spark SQL, Presto, Big Data

2 Contents

1 Introduction 4 1.1 Problem ...... 4 1.2 Purpose ...... 5 1.3 Goals ...... 5 1.4 Benefits, Ethics and Sustainability ...... 5 1.5 Methods ...... 5 1.6 Outline ...... 6

2 Big Data 7 2.1 Hadoop ...... 7 2.2 Hadoop Distributed File System ...... 9 2.3 YARN ...... 12

3 SQL-on-Hadoop 15 3.1 Hive ...... 15 3.2 Presto ...... 21 3.3 Spark ...... 24 3.4 File Formats ...... 28

4 Experiments 32 4.1 Data ...... 32 4.2 Experiment Setup ...... 36 4.3 Performance Tuning ...... 37

5 Results 38 5.1 Single User Execution ...... 38 5.2 File Format Comparison ...... 47 5.3 Concurrent Execution ...... 52

6 Conclusions 61 6.1 Single User Execution ...... 61 6.2 File Format Comparison ...... 61 6.3 Concurrent Execution ...... 62 6.4 Future Work ...... 62

3 1 Introduction

The space of big data processing has been growing fast over the past years [1]. Companies are making analytics of big data a priority, and meaning that in- teractive querying of the collected data becomes an important part of decision making. With growing data volume the process of analytics becomes less inter- active, as it takes a lot of time to process the data for the business to receive insights. Recent advances in big data processing make interactive quieries, as opposed to only long running data processing jobs, to be performed on big data. Interactive queries are low lateny, sometimes ad hoc queries that analysts can run over the data and gain valuable insights. The most important feature in this case is the fast repsonse from the data processing tool, making the feedback loop shorter and making data exploration more interactive for the analyst. Many technologies have been introduced in the big data ecosystem to address the problem of processing large volumes of data, and some of the early tools have become widely adopted [2], with Apache Hive1 being one of them. However, with recent advances in technology, there are other tools better suited for interactive analytics of big data, such as Apache Spark2 and Presto3. In this thesis Hive, Spark, and Presto are examined and benchmarked in order to determine their relative performance for the task of interactive queries. There are several works taken into account during writing of this thesis. Similar work was performed by atScale in 2016 [3], which claims to be the first work on the topic of big data analytics. The report is done well, but the main issue is that with the current pace in the development of technologies the results from several years before can become outdated and less relevant in deciding which data processing framework to use. Another work in similar vein is SQL Engines for Big Data Analytics [4], but the main focus on that work is in the domain of bioinformatics, which lessens the relevance of the work for business intelligence. The work was also done in 2015, making it even older than the atScale report. Performance Comparison of Hive Impala and Spark SQL [5] from 2015 was also considered, but has its drawbacks. Several other works served as references in choosing the method and setting up the benchmark, including Sparkbench [6], BigBench [7], and Making Sense of Performance in Data Analytics Frameworks [8].

1.1 Problem

How is the performance on interactive business intelligence queries impacted by using Hive, Spark or Presto with variable data volume, file format, and number of concurrent users? 1Apache Hive - https://hive.apache.org/ 2Apache Spark - https://spark.apache.org/ 3Presto - https://prestodb.io/

4 1.2 Purpose

The purpose of this thesis is to assess the possible performance impact of switch- ing from Hive to Spark or Presto for interactive queries. Usage of the latest ver- sions of frameworks makes the work more relevant, as all three of the frameworks are undergoing rapid development. Considering the focus on interactive queries, several aspects of the experiments are changed from the previous works, includ- ing choice of the benchmark, experimental environment, file format.

1.3 Goals

The main goal of this thesis is to produce an assessment of Hive, Spark, and Presto for interactive queries on big data of different volume, data format, and a number of concurrent users. The results are used to motivate a suggested choice of framework for interactive queries, when a rework of the system is performed or a creation of a new system planned.

1.4 Benefits, Ethics and Sustainability

The main beneficial effect of this thesis is a fair comparison of several big data processing frameworks in terms of interactive queries conducted independently. This will help with the choice of tools when implementing a system for running analytical querying with constraints on responsiveness and speed on hardware and data corresponding to the setup in this work. As this thesis uses some of the state-of-the-art versions of frameworks in ques- tion, this include all of the improvements that were absent from previous similar works, while ensuring that no framework is operating under suboptimal condi- tions and no framework is given special treatment and tuning.

1.5 Methods

Empirical method is used, as analytical methods cannot be efficiently applied to the presented problem within the resource and time constraints [9]. The results will be collected by generating data of different volume, implementing an interactive query suite, tuning the performance of the frameworks, and running the query suite on the data. This follows an established trend by the most relevant previous works [3], [4], [5], making changes in line with the focus of this thesis.

5 1.6 Outline

In the Big Data section the big data ecosystem is described, with emphasis on Hadoop and YARN. In the SQL-on-Hadoop section the data processing frame- works are presented, first Hive, then Presto, then Spark. The ORC and Parquet file formats are also briefly described. In the Experiments section the benchmark and experimental setup are described. In the Results all of the experimental results are outlined and briefly described. In Conclusions the results are sum- marized and conclusions are driven, with future work outlined.

6 2 Big Data

This thesis project is focused on comparing the performance of several big data frameworks in the domain of interactive business intelligence queries. Initially, works in big data space were making long-running jobs their focus, but with the advance of tools in big data processing it becomes more common for companies to be able to execute interactive queries over aggregated data. In this section the big data ecosystem is described, with a common Hadoop setup.

2.1 Hadoop

Apache Hadoop is a data processing framework targeted at distributed pro- cessing of large volumes of data on one or more clusters of nodes running on commodity hardware. Hadoop is an open-source project under Apache Foun- dation, with Java being the main implementation language [10]. The main components of Hadoop project are: • Hadoop Common: The common utilities that support other Hadoop mod- ules • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data • Hadoop YARN: A framework for job scheduling and cluster resource man- agement • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets Hadoop is an efficient solution to big data processing as it enables large scale data processing workloads relatively cheaply by using commodity hardware clusters [11]. Hadoop provides fast and efficient data processing and fault- tolerance.

2.1.1 MapReduce

MapReduce [12] is a shared nothing architecture, meaning that every node is independent and self-sufficient, for processing large datasets with a distributed algorithm on clusters of commodity hardware. Hadoop uses MapReduce as the underlying programming paradigm. MapReduce expresses distributed com- putations on large volume of data as a sequence of distributed operations on key-value pair datasets. The Hadoop MapReduce framework utilizes a cluster of nodes and executes MapReduce jobs defined by a user across the machines in the cluster.

7 Figure 1: A MapReduce computation

Figure 1 shows a MapReduce computation. Each MapReduce computation can be separated into three phases: map, combine, and reduce. In the map phase the input data is split by the framework into a large number of small fragments that are subsequently assigned to a map task. The framework takes care of distributing the map tasks across the cluster of nodes, on which it operates. Each of these map tasks starts to consumer key-value pairs from the fragment that was assigned to it and produces intermediate key-value pairs after processing it [12]. The combine or shuffle and sort phase is performed next on the intermediate key-value pairs. The main objective of this phase is to prepare all of the tu- ples for the reduce phase by placing the tuples under the same key together and partitioning them by the number of machines that are used in the reduce phase. Finally, the reduce phase each reduce task consumes the fragment of interme- diate tuples that is assigned to it. For each of the tuples a user defined reduce function is invoked and produces an output key-value pair. The framework distributes the workload across the cluster of nodes. One important aspect of MapReduce is that if a map and reduce task is com- pletely independent of all other concurrent map or reduce tasks, it can be safely run in parallel on different keys and data. Hadoop provides locality awareness, meaning that on a large cluster of machines it tries to match map operations to the machines that store the data that the map needs to be run on.

8 Figure 2: Hadoop MapReduce architecture

Figure 2 shows the architecture of the Hadoop MapReduce framework. It is implemented using master/worker architecture. There is a single master server called Jobtracker and a number of worker servers that are called Tasktrackers, one for every cluster. Users interact with the Jobtracker by sending MapReduce jobs to it that are subsequently put in the pending jobs queue. The jobs in the queue are executed first-in-first-out (FIFO). Jobtracker assigns the individual map and reduce tasks to Tasktrackers, which handle the task execution and data motion across MapReduce phases. The data itself is commonly stored on a distributed file system, frequently on Hadoop Distributed File System [13].

2.2 Hadoop Distributed File System

The main purpose of Hadoop Distributed File System (HDFS) [13] is to reliably store very large files that do not fit into a single hard drive of a node across nodes in a large cluster. The initial inspiration for HDFS was the Google File System [14]. HDFS is highly fault-tolerant and is designed to run on commodity machines. A commodity machine is an already-available computing component for parallel computing, used to get the greatest amount of useful computation at low cost [11]. In this environment the failure of hardware is expected and needs to be handled routinely. One HDFS instance can consist of thousands of server nodes with each node being responsible for parts of the data. HDFS makes failure detection and automatic recovery a core architectural goal.

9 Some of the projects in Hadoop ecosystem rely on streaming access to the data sets. The initial goal of HDFS was to provide more batch than interactive access with emphasis on high throughput rather than low latency of data access. With this in mind HDFS was designed with some of the POSIX4 semantics traded off for increase in data throughput and to enable streaming access to file system data. One of the assumptions HDFS makes is that applications relying on it require a write-once-read-many access model for files. Once a file is created, written to and closed, it does not need to be changed. This assumption simplifies the resolution of data coherency issues and enables the high throughput for data. The data replication is one of the main architectural goals of HDFS. It is de- signed to reliably store very large files across many nodes in a cluster. Each file is stored as a sequence of blocks and all of them are the same fixed size, except for the last block that can be less or equal to the configured block size. The blocks are replicated in order to provide fault tolerance with the block size and replication factor being configurable by an application.

Figure 3: HDFS architecture. Image adapted from HDFS Design page in Apache wiki6

4POSIX - http://www.opengroup.org/austin/papers/posix faq.html 6HDFS Design page in Apache wiki - https://hadoop.apache.org/docs/current/ hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

10 Figure 3 shows the architecture of HDFS. HDFS has a master/worker archi- tecture and a single cluster consists of a single master node called Namenode and multiple worker nodes called Datanodes, one per cluster. The Namenode manages the file system namespace and regulates the file access by clients. The Datanodes manage storage that they are attached to and are responsible for serving read and write requests from the client. HDFS exposes the file system namespaces and allows users to store data in files. A file is split into one or more blocks and the Datanodes are responsible for their storage. The Namen- ode executes the file system operations and is responsible for opening, closing, renaming files and directories. The Datanodes execute block creation, deletion, and replication on requests from the Namenode. HDFS supports a hierarchical file organization. Users can create and remove files, move a file from one directory to another, or rename a file. The Namen- ode maintains the file system namespace. The Namenode records any change to the file system namespace or the properties of it. The number of replicas of a file that should be maintained by HDFS can be specified by a user and is called the replication factor of that file, which is stored by the Namenode. In addition, the Namenode is responsible for the replication of blocks. It pe- riodically receives a Heartbeat and a Blockreport from each of the Datanodes in the cluster. Datanode sends a list of all blocks that are stored on it in a Blockreport. If a Heartbeat is received then the Datanode is functioning and healthy. An absence of a Heartbeat point to a network partition, a subset of Datanodes losing connectivity to the Namenode. The Namenode then marks all of the Datanodes without recent Heartbeats as dead and stops forwarding any new requests to them. All of the data stored on dead Datanodes is unavailable to HDFS, leading to new replica placement in case the death of Datanodes let some files replication factor go below the specified value. Replica placement is a critical factor in providing fault tolerance. HDFS aims to place replicas of data on different racks in order to prevent data loss in case of a whole rack of servers experiencing failure. There is a simple, but non-optimal policy placing each replica on a unique rack providing fault tolerance for entire rack failure and bandwidth of reading from separate racks at an increased cost of writes that needs to be transferred across multiple racks. For the common case of replication factor set to three replicas, HDFS places one replica on a local rack, another on a remote rack, and a final one on a different machine at the same remote rack. The rationale behind this is that node failure is much more common then rack failure, and having two copies on the same remote rack improves write performance, without trading off too much reliability and read performance. HDFS provides data integrity by using checksums. Corruption of the data stored in a HDFS block is possible, due to either disk storage, network faults, or bugs in software. When a client retrieves a file it verifies the content agains a checksum stored in a hidden file in the same HDFS namespace as the file.

11 2.3 YARN

Yet Another Resource Negotiator (YARN) [15] is an Apache Hadoop compo- nent that is dedicated to resource management for Hadoop ecosystem. Ini- tially Hadoop was focused on running MapReduce jobs for web crawls, but later became used more widely for very different applications. The initial de- sign included tightly coupled programming model and resource management infrastructure and centralization of job control flow handling, leading to scala- bility issues of the scheduler. The aim of YARN is to remedy that. The main requirements for YARN are listed as follows. 1. Scalability - an inherent requirement from running on big data 2. Multi-tenancy - as resources are typically requested by multiple concurrent jobs 3. Serviceability - decoupling of upgrade dependencies in order to accommo- date slight difference in Hadoop versions 4. Locality Awareness - a key requirement to minimize overhead of sending data over network 5. High Cluster Utilization - making it economical and minimize time spent unused 6. Reliability/Availability - continuous monitoring of jobs to provide fault- tolerance 7. Secure and Auditable Operation - a critical feature for multi-tenant system 8. Support for Programming Model Diversity - required by the ever growing Hadoop ecosystem 9. Flexible Resource Model - separation of map and reduce tasks can bottle- neck resources and requires careful management of resources 10. Backward compatibility - facilitate adoption

12 Figure 4: YARN System Architecture. Image adapted from YARN Design page in Apache wiki8

Figure 4 shows the architecture of YARN. The two main entities are the global Resource Manager (RM) and an ApplicationMaster (AM), one per application. The ResourceManager monitors how the resources are used and the liveness of nodes, enforces the resource allocation invariants, and acts as an arbiter in resource contention among tenants. The responsibilities of ApplicationMaster are the coordination of the logical plan for a job by issuing resource requests to the ResourceManager, generating the physical plan from the received resources and coordinating the plan execution. The ResourceManager, acting as a central authority can ensure the fairness and locality requirements across tenants. Based on the issued requests from the ap- plications, the ResourceManager can dynamically allocate leases or containers to applications to be executed on particular nodes. Each node runs a specific daemon called NodeManager (NM) that helps enforce and monitor these assign- ments. NodeManagers are also responsible for tracking the resource availability, reporting node failures, and managing the lifecycle of containers. From these snapshots of state from NMs the RM can create a global view of the state. Each job submitted to the RM goes through admission control phase, during which the security credential are validated and administrative checks are per- formed, due to the secure and auditable operation requirement. If everything is in order, the job state is set to accepted and it is passed to the Scheduler.

8YARN Design page in Apache wiki - https://hadoop.apache.org/docs/current/ hadoop-yarn/hadoop-yarn-site/YARN.html

13 All of the accepted applications are recorded in persistent storage in order to recover in case of ResourceManager failure. After that the Scheduler acquires the necessary resources and sets the job state to running: a container is allo- cated for the ApplicationMaster and it is spawned on one of the nodes in the cluster. The ApplicationMaster manages all of the details of the lifecycle, such as re- source consumption elasticity, management of flow execution, failure handling, local optimizations. This provides the scalability and programming model flex- ibility, as the ApplicationMaster can be implemented in any programming lan- guage, as long as it satisfies the few assumptions YARN makes. Commonly, the ApplicationMaster will require resources from multiple nodes in the cluster. In order to obtain the leases, AM sends a resource request to RM with the locality preferences and container properties. The ResourceManager makes an attempt to satisfy the requests from each application according to the specified availability and scheduling policies. Each time a resource request is granted, a lease is generated and pulled with the heartbeat from the AM. A token-based security mechanism guarantees that a request is authentic when a request is passed from AM to NM. The AM encodes a launch request that is specific to the application on container start. Running containers can directly report status and liveness to and receive commands specific to the framework from the AM without any dependencies on YARN.

14 3 SQL-on-Hadoop

With the development of big data ecosystem the tools for big data processing started to become more sophisticated and easy to use. One of the major re- quirements for analytics tools is the ability to execute ad hoc queries on the data stored in the system. SQL, being one of the main language used for the purposes of data querying9 became the standard most frameworks try to inte- grate, creating a number of SQL-like languages. In this section several big data processing frameworks, their architecture and usage are presented.

3.1 Hive

Apache Hive [16] is a data warehouse system that supports read and write operations over datasets in a distributed storage. The queries support SQL-like syntax by introducing a separate query language called HiveQL10. HiveServer2 (HS2) is a service enabling clients to run Hive queries11.

Figure 5: Hive System Architecture. Image adapted from Hive Design page in Apache wiki13

Figure 5 shows the major components of Hive and how it interacts with Hadoop. There is a (UI)x that is used for system interaction, initially being

9https://www.tiobe.com/tiobe-index/ 10Hive Language Manual - https://cwiki.apache.org/confluence/display/Hive/ LanguageManual 11https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview 13Hive Design page in Apache wiki - https://cwiki.apache.org/confluence/display/ Hive/Design

15 only a command line interface (CLI). The Driver is the component that receives the queries, provides the session handles, and an API modelled on JDBC/ODBC interfaces14. The parses the query, does semantic analysis of the query blocks and expressions, and is responsible for creating an execution plan with respect to the table and partition metadata, which is stored in the metastore. The Metastore stores all of the structural information about the different tables and partitions in the warehouse, including the information about columns and their types, the serializers and deserializers used to read and write data to HDFS, and the location of data. The Execution Engine is the component that actually executes the plan that was created by the compiler. The plan is represented by a Directed Acyclic Graph (DAG) of stages. The component is responsible for dependency management between stages and the execution of stages on different components. Figure 5 also shows how a query typically flows through the system. A call from UI to Driver is issued (step 1). Next, the Driver creates a session handle for that query and sends the query to the Compiler, which generates an execution plan for the query (step 2). After that the Compiler gets the required metadata from the Metastore (steps 3, 4). Then the Compiler can generate the plan, while type-checking the query and pruning partitions based on the predicates in the query (step 5). The execution plan consists of stages, each being a map or reduce job, metadata operation, or an HDFS operation. The execution engine submits the stages to the responsible components (steps 6-6.3). In each map or reduce task the deserializer associated with the table outputs intermediate results, which are written to HDFS file using a serializer. These files are used for the map and reduce tasks that are executed afterwards, with a fetch results call for Driver being done for the final query result (steps 7-9).

3.1.1 Hive Data Model

There are three main abstractions in Hive Data Model: tables, partitions, and buckets. Hive tables can be viewed as similar to tables in relational databases. Rows in the Hive tables are separated into columns with types. Tables support filter, union, and join operations. All the data in a table is stored in a HDFS directory. External tables, which can be created on pre-existing files or directo- ries in HDFS, are also supported by Hive [17]. Each table can have one or more partition keys that determine how the data is stored. The partitions allow the system to prune data based on query predicates, in order not to go through all of the stored data for every query. The data stored in partition can be divided into buckets, based on the hash of the column. Each bucket is stored as a file in a directory. The buckets allow more efficient evaluation that depend on a sample of data. In addition to supporting primitive types, Hive has support for arrays and maps. Users can also create user-defined types composed of primitives and collections.

14https://docs.oracle.com/javase/8/docs/technotes/guides/jdbc/

16 The Serializaer and Deserializaer (SerDe) and object inspector interfaces provide the necessary hooks to serialize and deserialize data of User-Defined Functions (UDFs) and have their own object inspectors.

3.1.2 Tez

The initial design of Hive relied on MapReduce for execution, but was later replaced by Apache Tez [18]. The foundation for building Tez was laid during the breaking resource management out of Hadoop that resulted in YARN [15] and enabled an architectural shift. MapReduce could be viewed as just one of the applications to be run on top of YARN. Tez made the next step by creating a flexible and extensible foundation with support of arbitrary data-flow oriented frameworks. Apache Tez provides an API that allows clear modelling of logical and physical data flow graphs. This allows users to model computation as a DAG with finer decomposition compared to classic vertex-edge models. In addition,Application Programming Interfaces (APIs) to modify the DAG definition on the fly are available, enabling more sophisticated optimizations in query execution. Tez also supports efficient and scalable implementation of YARN features - locality awareness, security, reuse of resources, and fault-tolerance. Tez solves the orchestrating and executing a distributed data application on Hadoop, providing cluster resource negotiation, fault tolerance, elasticity, secu- rity, and performance optimization. Tez is composed of several key APIs that define data processing and orchestration framework that the applications need to implement to provide an execution context. This allows Tez to be application agnostic. Computation in Tez is modelled as an acyclic graph, which is natural due to the process of data flowing from source to sinks with transformation happening on the in-between vertices. A single vertex in a Tez DAG API is a representation of some data processing and is a single step in the transforma- tion. A user provides a processor that defines the underlying logic to handle data. It is quite common for multiple vertices to be executed in parallel across many concurrent tasks. An edge of the DAG is a physical and logical repre- sentation of data movement from the vertex that produces data into the one that consumes it. Tez supports one-to-one, broadcast, and scatter-gather data movement between producers and consumers. After the DAG API defines the structure of the data pipeline, the Runtime API is used to inject application logic. While a DAG vertex represents a data processing step, the actual transformation is applied by executing tasks on the machines in the cluster. Each of the tasks is defined as a composition of a set of inputs, processor, and outputs (IPO). The processor is defined for each task by the vertex, and the output classes of incoming edge define the inputs. The input classes of outgoing edges define the outputs of the vertex. Some of the complexity is hidden by this representation, such as the underlying data trans-

17 port, partitioning, aggregation of shards, which are configured after the creation of IPO objects. IPO configuration is done by binary payloads. The processor is presented with inputs and outputs and communicates with the framework by event exchange through a context object. Fault-tolerance is achieved by re-executing tasks to regenerate data on reception of an ErrorEvent. Tez does not specify any format of data and is not a part of the data plane during the execution of the DAG. The data transfer is performed by the inputs and outputs with Tez serving as a router for producers and consumers. This allows Tez to have minimal overhead and makes Tez data format agnostic, allowing the inputs, processors, and outputs to choose their data formats themselves. To enable dynamic reconfiguration of the DAG and to adapt the execution on the fly, Tez uses a VertexManager. Each vertex in the DAG is associated with a VertexManager responsible for reconfiguration of vertices during runtime. A number of different state machines are used to control the lifecycle of vertices and tasks that interact with VertexManager through states. The VertexManager is notified of the state transitions through a context object, facilitating the decisions on the DAG configuration change. Apache Tez provides automatic partition cardinality estimation as a runtime optimization. For instance, that is used to solve a common problem in MapRe- duce - determining the required number of tasks for reduce phase. That depends on the volume of data shuffled from mappers. Tez can produce a statistical estimate of total data size and make an estimation of the total required num- ber of reducers at runtime. In addition, by having the fine-grained control over DAG vertices, Tez provides scheduling optimizations by providing locality- aware scheduling, minimizing out-of-order scheduling, reducing the number of scheduling deadlocks and having specific deadlock detection and preemption to take care of these situations. Multi-tenancy is a common requirement in data processing and the discrete task-based processing model of Apache Tez provides very good support for it. Resource allocation and deallocation based on a task a single unit enables high utilization of cluster resources, with the computational power being transferred from applications that no longer require it to those that do. In Tez each task is executed in a container process that guarantees this resource scalability and elasticity. Hive 0.13 was the first version to take advantage of Tez, utilizing the increased efficiency of translating SQL queries into Tez DAGs instead of using MapRe- duce.

18 Figure 6: MapReduce and Tez for Hive. Images adapted from Tez page in Apache wiki16

Before switching to Tez, Hive used MapReduce for query execution. That led to potential inefficiencies, due to several factors. One instance are queries with more than one reduce sinks that are not possible to combine due to absence of correlation in partition keys. In MapReduce this would lead to separate MapReduce jobs being executed for a single query, as shown on figure 6. Each of the MapReduce jobs would read from and write to HDFS, in addition to data shuffling. In Tez, on the other hand, the query plan would be pipelined and linked directly, referred to a map-reduce-reduce pattern.

3.1.3 LLAP

In Hive 2.0 the Live Long And Process (LLAP) functionality was added17. While the improvements introduced in [17] and [18], a new initiative to introduce asynchronous spindle-aware I/O, pre-ferching and caching of column chunks, and multithreaded JIT-friendly operator pipelines was created. LLAP provides a hybrid execution model that consists of a long-lived daemon that works in place of the direct communication with the Datanode in HDFS, and a tightly integrated framework based on DAGs. LLAP daemons include the caching, pre-fetching, and some query processing functionality. Some of the small and short queries are mostly directly passed to and processed by the daemon, but all the heavy queries still are the responsibility of the YARN container. In addition, similar to how the Datanode is designed, LLAP daemons are ac- cessible to other applications, which can be useful in the case when a relational

16Tez page in Apache wiki - http://tez.apache.org/ 17LLAP page in Apache wiki - https://cwiki.apache.org/confluence/display/Hive/LLAP

19 view of data can be preferable to file-centric one. The API offers InputFormat that can be used by data processing frameworks.

Figure 7: LLAP. Image adapted from LLAP page in Apache wiki18

Figure 7 shows how a job is processed with LLAP and a Tez AM, which coor- dinates the whole execution. Initially, the input query is passed into the LLAP. After that the costly operations such as the shuffles are performed in separate containers during the reduce stage. LLAP can be accessed by multiple concur- rent queries or applications at the same time. To achieve the goals of providing JIT optimization and caching, while reducing startup costs, the daemon is executed on worker nodes in the cluster and handles the I/O operations, caching, and execution of query fragments. Any request to a LLAP node contains the location of data and the associated metadata. While the concerns about data locality are left to YARN, the processing of local and remote data locations is handled by LLAP. Fault-tolerance overall is simplified, as any data node can be used to execute any query fragment, which is done by the Tez AM. Similar to Tez, direct communication between Hive nodes is permitted. The daemon always tries to use multiple threads for the purposes of I/O and reading from a compressed format. As soon as data becomes ready it is passed to execution in order for the previous batch to be processed concurrently with the preparation of the next one. LLAP uses a RLE-encoded columnar format that is used for caching, minimizing copy operations in I/O, execution and cache, and also in vectorized processing. The daemon caches data itself and index and metadata for input files, sometimes even for data that is not currently in the cache. The eviction policy is pluggable, but the default is a LRFU policy. A

18LLAP page in Apache wiki - https://cwiki.apache.org/confluence/display/Hive/LLAP

20 column chunk is a unit of data in cache, making a compromise between efficiency of storage and low overhead in data processing. Dynamic runtime filtering is achieved using a bloom filter [19]. In order to preserve the scalability and flexibility of Hive, LLAP works using the existing processed-based Hive execution. The daemons are optional and Hive can bypass them, even in the case when they are running and deployed. Unlike MapReduce or Tez, LLAP is not an execution engine by itself. The existing Hive execution engines are used to schedule and monitor the overall execution. LLAP results can be a partial result of a Hive query or can be passed to ex- ternal Hive task. Resource management is still a responsibility of YARN, with the YARN container delegation model being used by LLAP. The data caching happens off heap in order to overcome the JVM memory settings limitations - the daemon can initially use only a small amount of CPU and memory, but additional resources can later be allocated based on the workload. For partial execution LLAP can take fragments of a query, e.g. a projection or a filter and work with it. For security and stability only Hive code and some UDFs are accepted by LLAP and the code is localized and executed on the fly. Concurrent execution of several query fragments from different queries and even sessions is allowed. Users can gain direct access to LLAP nodes using client API.

3.2 Presto

Apache Presto [20] is a distributed massively parallel query execution engine developed by in 201319. Presto can process data from multiple sources such as HDFS, Hive, Cassandra. There are many other available connectors making the integration of new data sources easy. Presto is designed to query over data where it is stored, rather than moving data into a single storage. Presto is able to combine multiple data sources in a single query, which makes data integration easier. However, Presto can not act as a general purpose relational database, as it is not designed to handle Online Transaction Processing (OLTP).

19https://www.facebook.com/notes/facebook-engineering/ presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920/

21 Figure 8: Presto System Architecture

Figure 8 shows the architecture of Presto. There are two types of nodes in a Presto cluster: Coordinators and Workers. The main responsibilities of Presto coordinator are parsing the input queries, planning query execution, and man- aging the worker nodes. A Presto worker is responsible for executing tasks and processing data fetched from connectors and exchanged between workers.

22 Figure 9: Presto Task Execution

Presto does not use MapReduce, but instead uses message passing to execute queries, removing the overhead of transitions between map and reduce phases that is present in Hive, as shown on Figure 9. That leads to improvements in performance, due to all of the stages being pipelined, without any overhead from disk interaction. However, that reduces the fault-tolerance, as lost data will need to be recomputed. Another limitation is that data chunks need to fit in memory. Presto is first and foremost a distributed system running on a cluster of nodes. Presto query engine is optimized for interactive analysis of large volume of data and supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions. Presto supports most of the standard data types commonly used in SQL.

23 When an input query is sent from the client node it is first received by the co- ordinator that parses it and creates a query execution plan. Then the scheduler assigns worker nodes that are closest to the data to minimize the network over- head of data transfer. The workers perform data extraction through connectors, when necessary, execute the created distributed query plan and return results to the coordinator, which returns the final result of the query to the client. Presto is not specific to Hadoop and can work with multiple datasources in the same query, making Presto more suitable to cloud environments that do not use HDFS as storage. There are a variety of connectors to data sources, including HDFS, Amazon S3, MySQL, Apache Hive, , , PostgreSQL, Redis, and more20.

3.3 Spark

Apache Spark is a unified engine for distributed data processing that was created in 2009 at the University of California, Berkeley [21]. The first release of it was in 2010 and since then Apache Spark has become one of the most active open source projects in big data processing. It is part of Apache Foundation and has been over 1000 contributors to the project [22]. Spark offers several frameworks 10, including Spark SQL for analytics, Spak Streaming, MLlib for machine learning, and GraphX for graph-specific processing.

Figure 10: Spark ecosystem. Image adapted from Apache Spark page21

20Presto Connectors - https://prestodb.io/docs/current/connector.html 21Apache Spark - https://spark.apache.org/

24 3.3.1 Architecture

Each Spark program has a driver running the execution of the different concur- rent operations on the nodes in the cluster. The main abstraction is a resilient distributed dataset (RDD) that represents a collection of objects partitioned over the Spark nodes in the cluster. Most commonly it can be created by read- ing in a file from HDFS. Another important abstraction in Spark is the concept of shared variables. With the default configuration during each concurrent ex- ecution of a function a copy of each variable is sent to each task. However, in some cases variables need to be shared across tasks, or with the driver. In these cases Spark supports two types of shared variables: broadcast variables that are used as a method to cache a value in memory, and accumulators that are mutated only by adding to them, for instance sums and counts.

Figure 11: Spark architecture. Image adapted from Cluster Overview page in Apache wiki23

Figure 11 shows the architecture of Apache Spark. Spark utilizes a master/- worker architecture with a single Driver node and multiple Worker nodes. Each Worker node contains an Executor that receive and execute the application code from the SparkContext running on the Driver. Each application has sep- arate executor processes that are independent from other applications, which remain for the whole lifecycle of the application. The executor processes can run tasks in multiple threads. This provides isolation of applications at the cost of them being unable to directly share data without using an external storage system. Spark supports four different cluster managers. The cluster manager can be either Spark standalone one, , Hadoop YARN, or Kubernetes.

23Cluster Overview page in Apache wiki - https://spark.apache.org/docs/latest/ cluster-overview.html

25 Spark is agnostic to the underlying cluster manager. The only requirement is that Spark is able to spawn executor processes that are able to communicate with each other. It can easily run with it even on a cluster manager that possibly also supports multiple other applications, for instance in the case of Mesos or YARN.

3.3.2 RDD

The key programming abstraction in Spark is Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of objects partitioned over Spark nodes in the cluster and can be processed concurrently. Spark offers APIs in Java, Scala, Python, and R, through which users can pass functions to be executed on Spark cluster. Commonly, RDDs are first read from an external source, such as HDFS and then transformed using operations such as map, filter, or groupBy. RDDs are lazily evaluated, so an efficient execution plan can be created for the transformations specified by the user. Some operations performed in Spark will trigger a shuffle event. The shuffle is the mechanism that Spark uses to re-distribute data across partitions so it will be grouped differently. Commonly this includes the process of copying the data across the nodes and executors, which makes the shuffling costly and complex. In Spark the distribution of data across Spark nodes is typically not done in preparation for any specific operation. A single task will operate on a single partition, e.g. organizing data for a single reduceByKey an all-to-all operation will be required. This leads to Spark reading all of the values for each key across all of the partitions, bringing the values together in order to compute the result for each key. The performance impact of the shuffle operation is significant. Shuffle involves expensive I/O operations on disk, CPU load for data serialization, and I/O over the network to transmit the data to other Spark nodes. The shuffle operations consists of map and reduce phases, not to be confused with map and reduce Spark operations. Spark keeps the results of individual map tasks in memory until it does not fit, then it spills over to the disk, sorted by target partitions and stored in a single file. Reduce tasks read the blocks relevant to them. In addition to simple read/write interaction with the disk, shuffle operation generates a lot of intermediate files. Depending on Spark version, these files can be preserved until the corresponding RDDs are no longer used and only then garbage collected. This can lead to long-running Spark jobs consuming a lot of space on disk. In order to provide fault-tolerance Spark uses lineage-based approach. In con- trast with the common strategies of replication or checkpointing, Spark tracks the graph of transformations leading to each partition of the RDD and reruns the necessary operations on base data if a RDD partition is lost. For shuffle operation the data is persisted locally on the sender node in case of receiver failure. This approach can significantly increase the efficiency of data-intensive workloads.

26 Spark also provides options for explicit data sharing across computations. RDDs are not persisted by default and are recomputed, but there is a setting to persist them in memory for rapid reuse. The data can spill to disk if it does not fit in the nodes memory, and the option to persist data to disk is also available. The data sharing can provide a large increase in speed up to several orders of magnitude for interactive queries and iterative algorithms that reuse data. The persist() and cache() methods of an RDD are used for that. Different storage levels can be specified for this functions, for instance allowing to store RDD as serialized or deserialized Java objects in memory either with or without spilling to disk, or stored directly on the disk, with additional options that allow to replicate partitions on another Spark node in the cluster. The storage levels provide the developer with a trade-off in memory and CPU usage. There are several factors that should be taken into consideration, when making a decision on which storage level to use. First, if the data fits in memory it is best to leave it stored there, as this would be most CPU efficient and would allow operations on the RDD to be executed the fastest. Second, if the data does not fit into memory, then trying to store it in memory, but serialized should be the second choice. It would be more efficient in terms of space, but introduces the serialization and deserialization overhead, while still being reasonably fast. Third, if both previous levels are impossible, the usage of disk should be a last resort. Only if the computation of the partition is so expensive that reading from disk would be faster, then the data should be persisted on disk, but in many cases the recomputation of a partition would be less costly. Finally, the replication option should be used only for faster fault-tolerance. By default, Spark already provides fault-tolerance for all partitions on all storage levels [23].

3.3.3 Dataframe, Dataset, and Spark SQL

Unlike the basic RDD API the interface provided by Spark SQL contains sig- nificantly more information about how data is structured and how computing is performed. Dataframes are distributed collections of data added to Spark in order to execute SQL-like queries on top of Spark engine. Datasets are an im- provement over Dataframes that combine the benefits of RDDs, such as strong typing and powerful lambda functions and the extra optimizations that Spark is able to do for Dataframes using the Catalyst optimizer. Datasets can be con- structed from Java objects and manipulated using functional transformations. This API is available only in Java and Scala programming languages.

27 Figure 12: Spark SQL and Hive integration. Image adapted from Apache Spark SQL25

Spark SQL can read data from existing Hive installation by communicating with the Metastore (figure 12). Spark does not have all of the dependencies that Hive needs included, but if they are detected on classpath, Spark automatically loads them. Same goes for all of the worker nodes with respect to Hive serializers and deserialziers (SerDe). Spark also supports starting up a Thrift server26, however direct loading of files from HDFS was used.

3.4 File Formats

Storage of big data comes with additional challenges. The trade-off between volume and efficiency of access becomes more important, as every improvement in compression rate may mean major differences in total requirement for stor- age space. The efficiency of access on the other hand is necessary for running analytical queries on data in an interactive manner. This subsection presents two file formats that are commonly used for storing big data.

3.4.1 ORC File Format

The Optimized Row Columnar (ORC) file format is an efficient way to store data for Hive. It improves Hive performance by having a single output file for each task, reducing the load on NameNode, introducing lightweight index that is stored inside the file, concurrent reading of the same file by several readers. In addition it enables block-mode compression, depending on the type of data with run-length encoding for integer types and dictionary encoding for string types. The metadata is stored by using Protocol Buffers27, allowing field addition and removal. 25Apache Spark SQL - https://spark.apache.org/sql/ 26https://hortonworks.com/tutorial/spark-sql-thrift-server-example/ 27https://github.com/google/protobuf/wiki

28 Figure 13: ORC File Format. Image adapted from ORC Language Manual page in Apache wiki29

Each ORC file consists of stripes, which represent grouped row data. Additional information on the content is stored in the file footer, such as list of stripes in the file, number of rows in a single stripe, and column-level aggregates on count, min, max, and sum. Compression parameters and the size after compression are stored in the postscript. The stripe footer contains the directory of stream locations. 29ORC Language Manual page in Apache wiki - https://cwiki.apache.org/confluence/ display/Hive/LanguageManual+ORC

29 3.4.2 Parquet File Format

Apache Parquet is another columnar file format popular in Hadoop ecosystem. The main motivation for this format is to have efficient columnar data repre- sentation with support for complex nested data structures.

Figure 14: Parquet File Format. Image adapted from Parquet documentation page in Apache wiki31

... ... ... 31Parquet documentation page in Apache wiki - https://parquet.apache.org/ documentation/latest/

30 ... File Metadata 4-byte length in bytes of file metadata 4-byte magic number "PAR1" In the above example and on figure 14, there are N columns in this table, split into M row groups. The file metadata contains the locations of all the column metadata start locations. Metadata is written after the data to allow for single pass writing. Parquet supports efficient compression and encoding schemas on per-column level and is open for extension. supports dictionary encod- ing, bit packing, and run-length encoding to provide compression to the data. Snappy32 and ZLIB33 are supported codecs. Many data processing frameworks support this file format, including Apache Hive, , , Apache Crunch, , Cassandra, Apache Spark, and Presto34.

32https://github.com/google/snappy 33https://zlib.net/ 34https://cwiki.apache.org/confluence/display/Hive/Parquet

31 4 Experiments

The main focus of the thesis is business intelligence queries on big data. With this in mind, several experiments are planned. First, the benchmark is run on data of increasing volume, to gain an understanding of how the frameworks’ performance changes on growing volume of data. Second, several data formats are examined in order to have a better understanding of how each frameworks deals with each format, and which framework works better with which format. Finally, as the planned system works in a concurrent setting, the performance of frameworks on a different number of concurrent users is examined.

4.1 Data

The benchmark used to assess the Big Data frameworks in this thesis is the Star Schema Benchmark [24]. The choice of the benchmark was motivated by the fact that it closely models the typical schemas used in OLAP workloads. The benchmark is based on TPC-H35, but makes a number of modifications to it. It uses a star schema that is much more frequently used for interactive querying. This leads to a smaller number of joins and makes it more useful for testing interactive business intelligence queries.

35http://www.tpc.org/tpch/

32 Figure 15: Start Schema Benchmark data model

Figure 15 shows the schema used in the benchmark. The data for this schema was generated for different volumes, defined by scale factor (SF). The tables below show the data volume for scale factors 1, 10, 20, and 75 in different file formats.

SF1 SF10 SF20 SF75 Customer 2.7 Mb 27.3 Mb 54.8 Mb 206.8 Mb Lineorder 571.3 Mb 5.8 Gb 11.6 Gb 44.4 Gb Dates 224.6 Kb 224.6 Kb 224.6 Kb 224.6 Kb Parts 16.3 Mb 65.7 Mb 82.2 Mb 115.4 Mb Supplier 162.8 Kb 1.6 Mb 3.2 Mb 12.2 Mb Total 590.8 Mb 5.9 Gb 11.8 Gb 44.7 Gb

Table 1: Data size, by table and scale factor, Text format

33 SF1 SF10 SF20 SF75 Customer 722.6 Kb 7.0 Mb 14.0 Mb 52.7 Mb Lineorder 118.1 Mb 1.2 Gb 2.4 Gb 9.3 Gb Dates 10.8 Kb 10.8 Kb 10.8 Kb 10.8 Kb Parts 1.9 Mb 7.8 Mb 9.7 Mb 13.6 Mb Supplier 48.7 Kb 473.1 Kb 946.1 Kb 3.5 Mb Total 120.9 Mb 1.2 Gb 2.5 Gb 9.3 Gb

Table 2: Data size, by table and scale factor, ORC format

SF1 SF10 SF20 SF75 Customer 1.2 Mb 12 Mb 24 Mb 89 Mb Lineorder 172 Mb 1.7 Gb 3.4 Gb 14 Gb Dates 40 Kb 40 Kb 40 Kb 40 Kb Parts 2.5 Mb 9.3 Mb 11 Mb 16 Mb Supplier 92 Kb 812 Kb 1.6 Mb 5.9 Mb Total 176 Mb 1.7 Gb 3.5 Gb 14 Gb

Table 3: Data size, by table and scale factor, Parquet format

4.1.1 Queries

Query 1 The query is meant to quantify the amount of revenue increase that would have resulted from eliminating certain companywide discounts in a given percentage range for products shipped in a given year. In each set of queries the specific queries are denoted as Q1.1, Q1.2, and so on. select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_year = [YEAR] and lo_discount between [DISCOUNT] - 1 and [DISCOUNT] + 1 and lo_quantity < [QUANTITY]; There are three queries generated using this template with specific values for YEAR, DISCOUNT, and QUANTITY. For scale factor 1 the filtering of the results will be presented. Q1.1 sets YEAR to 1993, DISCOUNT to 2, and QUANTITY to 25, which filters out approximately 11600 rows. For Q1.2 the year and month are specified instead of year only, as January 1994. DISCOUNT is set to be between 4 and 6, and QUANTITY is set to 26 and 35, filtering out approximately 4000 rows. For the Q1.3 the filtering is even stricter, with only the 6th week of the year 1994 specified for time filtering, DISCOUNT is set to between 5 and 7, and QUANTITY is 26 to 35, same as before. That filters out approximately 500 rows.

34 Query 2 For a second set of queries, the restrictions are placed on two di- mensions. The query compares revenue for select product classes, for suppliers in a select region, grouped by more restrictive product classes and all years of orders. select sum(lo_revenue), d_year, p_brand from lineorder, date, part, supplier where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_category = [CATEGORY] and s_region = [REGION] group by d_year, p_brand1 order by d_year, p_brand1; There are three queries generated using this template with specific values for CATEGORY and REGION. The estimated number of rows after filtering are done for scale factor 1. Q2.1 uses MFGR#12 as CATEGORY, which accounts for approximately 4% of the orders and sets REGION to AMERICA. That filters out around 48000 rows. Q2.2 switches from a single category to a range of brands and sets them to be between MFGR#2221 and MFGR#2228, while the REGION is changed to ASIA. The added restrictions amount to around 10000 being selected. Q2.3 limits the search to a single brand MFGR#2339 and changes the REGION to EUROPE. That leads to around 1200 being displayed as a result. Each of the selections is disjoint and in addition separate from Q1, meaning no overlap and no effect of caching in these scenarios.

Query 3 In the third query suite, we want to place restrictions on three di- mensions, including the remaining dimension, customer. The query is intended to provide revenue volume for lineorder transactions by customer nation and supplier nation and year within a given region, in a certain time period. select c_nation, s_nation, d_year, sum(lo_revenue) as revenue from customer, lineorder, supplier, date where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey and c_region = [REGION] and s_region = [REGION] and d_year >= [YEAR] and d_year <= [YEAR] group by c_nation, s_nation, d_year order by d_year asc, revenue desc; There are four queries generated using this template with specific values for YEAR and REGION. Q3.1 uses ASIS as REGION, leading to 20% of the rows being selected, and years 1992 and 1997, bringing the total amount of rows selected to around 205000. Q3.2 specifies a single nations instead of a region and sets it to UNITED STATES, selecting 4%, the years are not changed. The

35 number of rows returned is approximately 8000. Q3.3 sets tighter restrictions on the location, setting it to two cities in UK and adds grouping by them to the query. The year filtering is not changed. This nets around 320 rows. Finally, in Q3.4 the timespan is set to a single month and the location is unchanged, returning only around 5 rows.

Query 4 The last query suite represents a ”What-If” sequence, of the OLAP type. It starts with a group by on two dimensions and rather weak constraints on three dimensions, and measure the aggregate profit, measured as (lo revenue - lo supplycost). select d_year, c_nation, sum(lo_revenue - lo_supplycost) as profit from date, customer, supplier, part, lineorder where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_partkey = p_partkey and lo_orderdate = d_datekey and c_region = [REGION] and s_region = [REGION] and p_mfgr = [CATEGORY] group by d_year, c_nation order by d_year, c_nation There are two queries generated using this template with specific values for REGION. Q4.1 specifies AMERICA as the region and two possible categories, leading to around 96000 orders returned. Q4.2 additionally restricts time to be 1997 or 1998, with all other filtering conditions remaining the same. That leads to around 27000 rows selected. Q4.3 adds a restriction on nation and additionally restricts category to a single one, returning only approximately 500 rows. The complete table of filtering factors for each query can be found in [24].

4.2 Experiment Setup

The experiments were run on a cluster of 9 nodes, each having 32 Gb RAM, 12 cores, and 100 Gb SSD disks. The cluster was setup in AWS EMR 5.13.036 that runs Apache Hive 2.3.2, Apache Spark 2.3, and Presto 0.194. The version for ORC file format used 1.4.3. The version for Parquet file format is 1.8.2. The cluster is running Hadoop 2.8.337, with YARN acting as resource manager and HDFS as the distributed file system. As the main focus of the thesis is interactive queries, the containers are assumed to be long-lived. The time needed

36https://aws.amazon.com/about-aws/whats-new/2018/04/support-for-spark-2-3-0-on- amazon-emr-release-5-13-0/ 37https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-5x.html#emr-5130- release

36 to spin up containers is not included in the benchmark results, and only the actual query execution time is tracked. For each query a timestamp is recorded right before it is submitted and right after the results are collected the second timestamp is recorded. For concurrent execution several benchmarks are started simultaneously, and the resource waiting time impacts the runtime of the query suite.

4.3 Performance Tuning

For Hive the following additional settings were altered: hive.llap.execution.mode=all hive.prewarm.enabled=true hive.vectorized.execution.enabled = true hive.vectorized.execution.reduce.enabled=true hive.vectorized.execution.reduce.groupby.enabled=true This ensured that the containers were pre-warmed and vectorization was enabled for LLAP. LLAP mode all was chosen instead of only in order for all queries to return results, even if not enough resources were available for LLAP to execute the query. In addition, force-local-scheduling was set to true, in order to ensure that all of the possible calls to HDFS were attempted locally and take advantage of data locality. ORC file configuration was: "orc.compress"="SNAPPY", "orc.compress.size"="262144", "orc.create.index"="true", "orc.stripe.size"="268435456", "orc.row.index.stride"="3000") Parquet file configuration also used same parameter values for compression and index creation.

37 5 Results

The benchmark results are presented in this section, which is divided into three subsections. In the first subsection the single user execution is considered, in the second subsection file format comparison for each framework is presented, in the third section the results for concurrent executions are presented, grouped by scale of data, framework, and a separate comparison of Spark and Presto.

5.1 Single User Execution

In this section the results for a single user execution grouped by file format and framework are presented. First, the text format results are presented, then ORC, then Parquet. In the final subsection the direct comparison between Spark on Parquet and Presto on ORC is presented.

5.1.1 Text format

Figure 16: Average execution times Figure 17: Average execution times for text format, SF1 for text format, SF10

Figure 18: Average execution times Figure 19: Average execution times for text format, SF20 for text format, SF75

38 Figures 16-19 show the charts for average execution times for text format data, scale factors 1, 10, 20, and 75. Figure 16 shows the results for text for scale factor 1. On all of the queries Presto was the fastest, and Hive LLAP was second fastest. On queries Q1.2, Q1.3, Q2.3, Q3.3, Q3.4 the execution times for them were comparable. Spark was the third fastest, being slower than Hive LLAP on most of the queries. Spark was faster only on Q3.1, and tied with Hive LLAP on Q1.1 and Q4.1. Hive Tez was the slowest on all of the queries, with by far the worst execution time on Q1.1. Figure 17 shows the results for text for scale factor 10. Again, Presto was the fastest, with most of the execution times being twice as fast as the closest competitor. One major difference is that Spark became faster than Hive LLAP on most of the queries, with Q1.2, Q1.3, and Q2.1 being the only exceptions. Hive LLAP is faster than Hive Tez on every query except Q2.2, where the execution times are roughly the same. Tez still is the slowest. There is still a notable spike for Q1.1, where the difference between Hive LLAP and Hive Tez execution time is the largest. Figure 18 shows the results for text for scale factor 20. Presto maintains its status as the fastest, with only Hive LLAP almost matching the execution times for Q1.2 and Q1.3. Spark is the second fastest on some of the queries (Q2.1, Q4.1, Q4.2, Q4.3), but Hive LLAP matches the times for Q1.1, Q2.2 - Q3.4, and is faster for Q1.2 and Q1.3. Another key difference is that the execution times for Hive Tez are much closer to Hive LLAP and Spark, Q1.1 being the only exception, where it is much slower. Figure 19 shows the results for text for scale factor 75. Hive LLAP is the fastest framework for Q1.2, Q1.3, and Q4.3, narrowly beating Presto, which ties on Q3.4 and is faster on all other queries. Hive Tez is slower or ties with Hive LLAP on most of the queries. The most striking difference is that Spark is the slowest framework on all of the queries, which was an unexpected result, taking into account the results of the previous runs.

39 Figure 20: Execution time growth, Figure 21: Execution time growth, text format, Hive Tez text format, Hive LLAP

Figure 22: Execution time growth, Figure 23: Execution time growth, text format, Spark text format, Presto

Figures 20-23 show the execution time growth for each framework on text format data. The data size growth corresponds to scale factors, being 1, 10, 20, and 75. Figure 20 shows the execution time growth for Hive Tez. The execution time growth is the slowest for Q1.2, Q1.3, and Q4.3. It steeply growth on SF1 to SF10, but stays stable for SF10, SF20, and SF75. The growth is the fastest for Q2.1, especially for SF75. The growth rate corresponds to the change of SF for all of the other queries. Figure 21 shows the execution time growth for Hive LLAP. The growth pattern is similar to Hive Tez, but in addition to Q1.2, Q1.3, and Q4.3 the queries Q3.3 and Q3.4 show very slow execution time growth on SF10 - SF75. Again, Q2.1 shows the steepest incline in execution time, while other queries grow in correspondence with scale factor. Figure 22 shows the execution time growth for Spark. In contrast to Hive, Spark shows very stable execution time growth for all of the queries. Only Q2.2 shows a slightly sharper increase in execution time in SF20 to SH75. However, the

40 absolute values for Spark are larger for SF75. Figure 23 shows the execution time growth for Presto. Similar to Spark, Presto shows very stable growth in execution time, closely corresponding to the growth in scale. There is no noticeable overhead and the absolute values of execution times are the lowest among all of the frameworks.

5.1.2 ORC

Figure 24: Average execution times Figure 25: Average execution times for ORC format, SF1 for ORC format, SF10

Figure 26: Average execution times Figure 27: Average execution times for ORC format, SF20 for ORC format, SF75

Figures 24-27 show the charts for average execution times for ORC format data, scale factors 1, 10, 20, and 75. Figure 24 shows the results for ORC for scale factor 1. Similar to text results, Presto is the fastest by far on scale factor 1. Spark is the second fastest on almost all other queries, except for Q1.1 and Q2.1, where Hive LLAP was faster. Hive Tez was the slowest on all of the queries.

41 Figure 25 shows the results for ORC for scale factor 10. The results for SF 10 differ substantially, with Spark and Presto being very fast, Presto being only slightly faster. A striking change from text format is the Hive LLAP execution time being very similar to Tez, being faster only slightly faster on most queries. Hive Tez is still the slowest and having a noticeable slow start on Q1.1. Figure 26 shows the results for ORC for scale factor 20. The execution time growth is similar to the previous chart - Presto being the fastest or having a tie with Spark. The difference is in that Hive LLAP is slightly slower on some of the executions than Hive Tez. However, Hive Tez is twice as slow on Q1.1. Figure 27 shows the results for ORC for scale factor 75. There are some inter- esting differences to note, compared to scale factor 20. Spark is the fastest on Q2.1, Q2.2, Q2.3, Q4.2, and Q4.3. Presto is still faster on Q1.1-Q1.3 and Q3.1- Q4.1. Execution times for Hive Tez and Hive LLAP are similar, with LLAP being slightly faster in all of the queries.

Figure 28: Execution time growth, Figure 29: Execution time growth, ORC format, Hive Tez ORC format, Hive LLAP

Figure 30: Execution time growth, Figure 31: Execution time growth, ORC format, Spark ORC format, Presto

Figures 28-31 show the execution time growth for each framework on ORC

42 format data. Figure 28 shows the execution time growth for Hive Tez on ORC. The growth pattern is very similar to text format, but the rate is more stable. The execution times follow a very similar pattern on SF10, SF20, and SF75, without sudden spikes. Figure 29 shows the execution time growth for Hive LLAP on ORC. Most of the queries have a very stable rate of growth, with Q1.2 and Q1.3 having a small rate of growth, and Q2.1, Q3.1, Q4.1, and Q4.2 taking much longer time, growing at a rate corresponding to data volume increase. Figure 30 shows the execution time growth for Spark on ORC. There are a few interesting features of the chart. The growth is very slow on SF1-SF10-SF20, much slower that the growth rate of data volume. This trend continues for Q1.1-Q2.3 and Q4.2, Q4.3 to SF75, showing a very small increase. In contrast, Q3.1-Q4.1 show a very sharp surge in execution time for SF75. Figure 31 shows the execution time growth for Presto on ORC. Presto shows the most consistent growth pattern with respect to data volume growth. The difference in individual query times amplify, but the pattern is the same for all scales.

43 5.1.3 Parquet

Figure 32: Average execution times Figure 33: Average execution times for Parquet, SF1 for Parquet, SF10

Figure 34: Average execution times Figure 35: Average execution times for Parquet, SF20 for Parquet, SF75

Figures 32-35 show the charts for average execution times for Parquet data, scale factors 1, 10, 20, and 75. Figure 32 shows the results for Parquet for scale factor 1. There are several interesting features on this chart. First, Spark and Presto show a considerably better execution time compared to Hive. On Q1.1 - Q2.3 Presto is faster or tied with Spark, with the most difference being on Q1.1. However, on Q3.1 - Q4.3 the situation is reversed with Spark being the fastest. Figure 33 shows the results for Parquet for scale factor 10. The trend from SF 1 continues, with Spark and Presto being definite leaders, but now Presto starts to get slower. On Q1.1 - Q1.3 it still maintains the lead, but on all other queries Spark is definitely faster. Figure 34 shows the results for Parquet for scale factor 20. The same pattern from the previous chart continues, as Spark execution times grow much slower

44 with the increased data volume. Only Q1.1 shows Presto being faster than Spark. Figure 35 shows the results for Parquet for scale factor 75. Starting from this chart Spark becomes the leader on all queries. In addition, Presto execution times are close to Hive LLAP on Q2.2 and Q2.3, for the first time in Parquet experiments.

Figure 36: Execution time growth, Figure 37: Execution time growth, Parquet, Hive Tez Parquet, Hive LLAP

Figure 38: Execution time growth, Figure 39: Execution time growth, Parquet, Spark Parquet, Presto

Figures 36-39 show the execution time growth for each framework on Parquet data. The data size growth corresponds to scale factors, being 1, 10, 20, and 75. The charts for Hive and Presto show a similar picture of growth that corresponds with the data volume growth. Spark, however, shows a radically different picture - execution time growth much slower for it, showing only a slight increase with the data volume growth.

45 5.1.4 Spark Parquet and Presto ORC

Presto shows the best time on ORC and Spark shows the best time on Parquet, and in this section they are directly compared side by side.

Figure 40: Spark Parquet and Figure 41: Spark Parquet and Presto ORC average execution Presto ORC average execution times for SF1 times for SF10

Figure 42: Spark Parquet and Figure 43: Spark Parquet and Presto ORC average execution Presto ORC average execution times for SF20 times for SF75

Figures 40 - 43 show the average execution times for Spark on Parquet data and Presto on ORC data. The data size growth corresponds to scale factors, being 1, 10, 20, and 75. Figure 40 shows the comparison on SF1 data. Spark is slower on all of the queries, and it is evident that Q1.1 takes a lot more time for it to process. Q2.1 also takes more time than others. Figure 41 shows the comparison on SF10 data. The situation is quite different, as Spark is now faster on Q2.1 - Q4.3, with only the first three queries being faster on Presto. Q1.1 still shows the overhead that the first execution on a container introduces in Spark, while Presto does not display that behaviour.

46 Figure 42 shows the comparison on SF20 data. The situation is not very different abstractly, but the time difference on Q2.1 - Q4.3 grows substantially. Figure 43 shows the comparison on SF75 data. The difference becomes apparent, as Presto becomes much slower than Spark on Q1.2 - Q4.3.

5.2 File Format Comparison

Figure 44: Average execution times Figure 45: Average execution times for Hive Tez, SF1 for Hive Tez, SF10

Figure 46: Average execution times Figure 47: Average execution times for Hive Tez, SF20 for Hive Tez, SF75

Figures 44-47 show the average execution times for Hive Tez on ORC, Par- quet, and text format data, grouped by scale factor. All of the charts show an improvement in execution time. Figure 44 shows the execution times for Hive Tez on ORC, Parquet, and text files for SF 1. For text and ORC the biggest difference in execution time is on Q1.1, Q1.3, and Q4.3, where the time for ORC is roughly two times less than for text. An interesting feature of this chart is that on Q3.2 and Q3.3 ORC execution time is larger than text. On most of other queries the execution times

47 are similar. Most of the queries are faster on Parquet than text and ORC, with the exception of Q1.1 and Q4.1, where ORC is faster. Figure 45 shows the execution times for Hive Tez on ORC, Parquet, and text files for SF 10. For text and ORC in contrast to the previous chart, the ORC execution time is smaller on every query. Most of the queries show an improve- ment of 25-50%. In contrast to the previous chart, Parquet is slower than ORC on Q1.1 - Q1.3, and Q4.3. On the other queries the difference is similar in comparison to ORC, and greater in comparison to text. Figure 46 shows the execution times for Hive Tez on ORC, Parquet, and text files for SF 20. For text and ORC it is similar to SF1, almost all queries are faster for ORC, with the exception of Q3.1. The difference in execution time is smaller than in SF10, but still present, especially on Q1.1 - Q1.3 and Q3.2 - Q4.2. The trend with Parquet execution times becoming slower than ORC continues with Q3.1, Q4.1 and Q4.2 being slower on Parquet than ORC. The difference between Parquet and text closes, with Q3.1 being the fastest on text. Figure 47 shows the execution times for Hive Tez on ORC, Parquet, and text files for SF 75. In contrast to the two previous charts, for text and ORC, ORC is slower than text on Q3.2-Q4.2. However, the trend with Q1.1-Q1.3 being substantially faster continues. On almost every other query the execution times are comparable. In terms of Parquet the situation is somewhat reversed, with only Q1.1 - Q1.3 and Q3,1 being slower on Parquet, other queries show Parquet to be faster than ORC, and in some cases quite substantially. For instance, Q2.1 - Q2.3 and Q3.3 - Q3.4 being 50% faster on Parquet.

48 Figure 48: Average execution times Figure 49: Average execution times for Hive LLAP, SF1 for Hive LLAP, SF10

Figure 50: Average execution times Figure 51: Average execution times for Hive LLAP, SF20 for Hive LLAP, SF75

Figures 48-51 show the average execution times for Hive LLAP on ORC, Par- quet, and text format data, grouped by scale factor. Figure 48 shows the execution times for Hive LLAP on ORC, Parquet, and text files for scale factor 1. There are a few interesting features about it: similar to Tez, LLAP on ORC shows a marked improvement on Q1.1, especially compared to Parquet. However, on queries Q2.1 - Q2.3 and Q3.2 - Q3.4 ORC is slower than text and on Q2.1 - Q2.3 and Q3.2 - Q3.4, and Q4.2 it is slower than Parquet. Figure 49 shows the execution times for Hive LLAP on ORC, Parquet, and text files for scale factor 10. In contrast to the previous chart the execution times for ORC are lower on all of the queries. The difference for many queries is around 50%. The only queries having similar execution times are Q2.1 and Q3.1. In terms of Parquet, there is an improvement, as on queries Q2.1 - Q4.2 it is tied or faster than ORC and text. On Q1.1 - Q1.3 and Q4.3 ORC is still the fastest format. Figure 50 shows the execution times for Hive LLAP on ORC, Parquet, and

49 text files for scale factor 20. The situation is similar to Tez, as the queries are still faster for ORC than for text with the exception of Q3.1. The trend with Parquet is similar, as Q1.1 - Q1.3 ORC is faster, but on all the other queries the difference is smaller, but Parquet is still slightly faster. Figure 51 shows the execution times for Hive LLAP on ORC, Parquet, and text files for scale factor 75. The pattern of execution times closely resembles the same scale factor for Tez, figure 47. Q1.1 - Q1.3 are substantially faster on ORC, while Q3.2 - Q4.3 are faster on Parquet.

Figure 52: Average execution times Figure 53: Average execution times for Spark, SF1 for Spark, SF10

Figure 54: Average execution times Figure 55: Average execution times for Spark, SF20 for Spark, SF75

Figures 52-55 show the average execution times for Spark on ORC, Parquet, and text format data, grouped by scale factor. There is a major difference from the same charts for Hive, as Parquet is clearly faster on every query. Figure 52 shows the execution times for Spark on ORC, Parquet, and text files for scale factor 1. Parquet is faster on all of the queries, however the time difference is smaller on Q1.1 - Q1.3, and Q3.3 - Q3.4, compared to other queries where the times for Parquet were two or more times smaller than for

50 ORC. Figure 53 shows the execution times for Spark on ORC, Parquet, and text files for scale factor 10. Similar to scale factor 1, the execution times for Parquet are smaller than for text and ORC, but the difference is in the scale of it - here most of the times are 70-80% faster than for text for ORC and these times are further improved on Parquet. Figure 54 shows the execution times for Spark on ORC, Parquet, and text files for scale factor 20. The same trend continues with scale factor 20, with most of the ORC times being 80-90% faster on most of the queries for text and Parquet being even faster with the same improvement but from ORC. Figure 55 shows the execution times for Spark on ORC, Parquet, and text files for scale factor 75. This chart has some interesting features. While some of the query times (Q1.1 - Q2.3 and Q4.2, Q4.3) still show a major difference in runtime on the scale of the previous charts, queries Q3.1 - Q4.1 show much less of a difference, where ORC times are around a third of the times for text. But the trend for Parquet show a remarkable growth pattern, as it is substantially smaller compared to ORC and text.

51 Figure 56: Average execution times Figure 57: Average execution times for Presto, SF1 for Presto, SF10

Figure 58: Average execution times Figure 59: Average execution times for Presto, SF20 for Presto, SF75

Figures 56-59 show the average execution times for Presto on ORC, Parquet, and text format data, grouped by scale factor. The pattern of execution times is even more consistent than for Spark in terms of ORC and text. For all of the queries ORC is faster, being more than twice as fast as text and slightly faster than Parquet. This trend does not change for SF1, SF10, SF20, and SF75 showing a remarkably consistent picture.

5.3 Concurrent Execution

In this section the results for concurrent execution are presented. First sub- section groups them by scale factor of data, second subsection groups them by framework, and the final subsection has a direct comparison of Spark on Parquet and Presto on ORC.

52 5.3.1 Grouped by Scale Factor

Figure 60: Average execution times Figure 61: Average execution times for 1 user, SF1 for 2 users, SF1

Figure 62: Average execution times for 5 users, SF1

Figures 60-62 show the average execution times for SF1 data in ORC format grouped by framework for 1, 2, and 5 users. Figure 60 shows the execution time for SF1 ORC data with a single user for all frameworks. Presto is the fastest by far on scale factor 1. Spark is the second fastest on almost all other queries, except for Q1.1 and Q2.1, where Hive LLAP was faster. Hive Tez was the slowest on all of the queries. Figure 61 shows the execution time for SF1 ORC data with two concurrent users for all frameworks. Presto maintains the status of the fastest framework with ease, being two or more times faster than Spark. Spark easily holds the second place, with Hive LLAP and Hive Tez being much slower. This is different from an execution with a single user, where Hive LLAP was closer to Spark on several queries and outperforming it on two. Figure 62 shows the execution time for SF1 ORC data with five concurrent users for all frameworks. Once again, Presto is the indisputable leader with more than twice faster execution times than Spark. The difference in execution times between Spark and Hive LLAP shortened, with some of the queries having

53 almost the same results (Q1.2, Q1.3, Q4.1, Q4.3). Hive Tez is the slowest. A notable feature is the overhead for both Hive Tez and LLAP on Q1.1, which takes more than three times the time that Spark requires for it.

Figure 63: Average execution times Figure 64: Average execution times for 1 user, SF10 for 2 users, SF10

Figure 65: Average execution times for 5 users, SF10

Figures 63-65 show the average execution times for SF10 data in ORC format grouped by framework for 1, 2, and 5 users. Figure 63 shows the execution time for SF10 ORC data with a single user for all frameworks. The results for SF 10 differ substantially from SF 1 (figure 60), with Spark and Presto being very fast, Presto being only slightly faster. Hive Tez is still the slowest and having a noticeable slow start on Q1.1. Figure 64 shows the execution time for SF10 ORC data with two concurrent users for all frameworks. The situation has several changes from the single concurrent user, as the difference Spark and Presto became even smaller, with Spark being slightly faster on several queries. In addition, Hive LLAP and Hive Tez have very similar execution times on most of the queries. Figure 65 shows the execution time for SF10 ORC data with five concurrent users for all frameworks. There are a few key difference compared to the previous chart. First, Spark is two times faster than Presto on most of the queries (Q2.1

54 - Q4.3), with Presto maintaining the lead on Q1.1 - Q1.3. Second, Hive LLAP execution times are definitely smaller than Hive Tez on every query.

Figure 66: Average execution times Figure 67: Average execution times for 1 user, SF20 for 2 users, SF20

Figure 68: Average execution times for 5 users, SF20

Figures 66-68 show the average execution times for SF20 data in ORC format grouped by framework for 1, 2, and 5 users. Figure 66 shows the execution time for SF20 ORC data with a single user for all frameworks. Presto is the fastest or has a tie with Spark. The difference from scale factor 10 is in that Hive LLAP is slightly slower on some of the executions than Hive Tez. However, Hive Tez is two times slower on Q1.1. Figure 67 shows the execution time for SF20 ORC data with two concurrent users for all frameworks. This chart has several key differences. Presto becomes slower than Spark on Q2.1 - Q4.3. Hive Tez execution times grow very fast compared to Hive LLAP, especially on Q3.2 - Q4.3. Figure 68 shows the execution time for SF20 ORC data with five concurrent users for all frameworks. The results for five users reinforce the results for two users. Spark has comparable (Q1.1 - Q1.3) or faster (Q2.1 - Q4.3) execution times than Presto on all of the queries. Hive LLAP outperforms Hive Tez on every query, but the difference is the largest on Q3.1 - Q4.3.

55 5.3.2 Grouped by Framework

Figure 69: Average execution times Figure 70: Average execution times for Hive Tez, SF1 for Hive Tez, SF10

Figure 71: Average execution times for Hive Tez, SF20

Figures 69-71 show the average execution times for Hive Tez on ORC format data grouped by scale factor. Figure 69 shows the execution times for 1, 2, and 5 users for Hive Tez for ORC data on scale factor 1. A notable feature of this chart is the growth in execution times for going from 2 users to 5. Execution time for some queries took more than five times the time than for for two concurrent users. Figure 70 shows the execution times for 1, 2, and 5 users for Hive Tez for ORC data on scale factor 10. There are several interesting feature in this chart, compared to scale factor 1. First, the growth in execution time for going from 1 to 2 users seems to closely resemble the first chart, however, the situation with 5 users is very different. For some of the queries it seems that the growth was much slower, e.g. Q3.1 - Q4.3 show slow growth. For other queries the execution time growth more similar to scale factor 1, e.g. Q1.1, Q2.1 - Q2.3. Q1.2 and Q1.3 seem to grow much slower than others. Figure 71 shows the execution times for 1, 2, and 5 users for Hive Tez for ORC data on scale factor 20. The pattern of growth become more stable,

56 compared to the scale factor 10. The only outlier is Q3.1 that seems to require significantly more time to execute for 5 users, while most other queries have a more predictable execution time growth.

Figure 72: Average execution times Figure 73: Average execution times for Hive LLAP, SF1 for Hive LLAP, SF10

Figure 74: Average execution times for Hive LLAP, SF20

Figures 72-74 show the average execution times for Hive LLAP on ORC format data grouped by scale factor. Figure 72 shows the execution times for 1, 2, and 5 users for Hive LLAP for ORC data on scale factor 1. Most of the query execution times show a much more stable growth time pattern, compared to Hive Tez, with the exception of Q1.1 that seems to have a lot of overhead from concurrent executions. Figure 73 shows the execution times for 1, 2, and 5 users for Hive LLAP for ORC data on scale factor 10. There are differences from the previous chart. The growth in execution time is stable, including the Q1.1. Queries Q1.2 and Q1.3 have a much slower growth compared to others. Other queries show ex- ecution time growth corresponding to the change in the number of concurrent users. Figure 74 shows the execution times for 1, 2, and 5 users for Hive LLAP for ORC data on scale factor 20. The chart for scale factor 20 has the same growth

57 patterns as the one for scale factor 10.

Figure 75: Average execution times Figure 76: Average execution times for Spark, SF1 for Spark, SF10

Figure 77: Average execution times for Spark, SF20

Figures 75-77 show the average execution times for Spark on ORC format data grouped by scale factor. Figure 75 shows the execution times for 1, 2, and 5 users for Spark for ORC data on scale factor 1. Spark shows a very different rate of growth compared to Hive. For transition from a single user to two concurrent users Spark has almost no increase in runtime. For the increase to five concurrent users the rate of growth is smaller than for Hive on most of the queries. Figure 76 shows the execution times for 1, 2, and 5 users for Spark for ORC data on scale factor 10. The trend is even more visible for scale factor 10 for Spark. It shows only slight increase in execution time from a single user to two to five on a scale factor 10 data. Figure 77 shows the execution times for 1, 2, and 5 users for Spark for ORC data on scale factor 20. Execution time for scale factor 20 show an even more impressive results, showing almost no difference for the three executions.

58 Figure 78: Average execution times Figure 79: Average execution times for Presto, SF1 for Presto, SF10

Figure 80: Average execution times for Presto, SF20

Figures 78-80 show the average execution times for Presto on ORC format data grouped by scale factor. Figure 78 shows the execution times for 1, 2, and 5 users for Presto for ORC data on scale factor 1. On scale factor 1 Presto shows a growth pattern similar to Hive LLAP, with most queries having steady growth corresponding with the increase in concurrent users. The exception is Q1.1 that shows an overhead for concurrent execution. Figure 79 shows the execution times for 1, 2, and 5 users for Presto for ORC data on scale factor 10. The execution time growth gets worse with the increase in volume. However, execution time grows slower than the number of concurrent users. Figure 80 shows the execution times for 1, 2, and 5 users for Presto for ORC data on SF 20. Scale factor 20 reinforces the pattern established by scale factor 10, but the execution time growth gets slightly worse for 5 users.

59 5.3.3 Presto ORC and Spark Parquet on concurrent users

Figure 81: Average execution times Figure 82: Average execution times for Spark for concurrent users, for Presto for concurrent users, SF10, Parquet SF10, ORC

Figure 83: Total execution time growth for Spark and Presto for concurrent users, SF10

Figures 81 and 82 show the average execution times for Spark on Parquet and Presto on ORC for concurrent users on scale factor 10 data volume. There is a very different growth pattern for the frameworks. Spark shows a very slow growth, while Presto shows a similar growth for up to 5 users, but has much slower execution for 10 and 25 users. Figure 83 shows the total execution time growth for Spark and Presto. It is evident that Spark shows much slower growth, while Presto has a dramatic increase in execution time.

60 6 Conclusions

In this section the conclusions and future work are presented. First the con- clusions for Single User Execution are presented, in which the performance of each framework with only a single user is examined. Then the Fiel Format Comparison results are outlined, in which for every framework the performance on different file formats is examined. Finally, in Concurrent Execution the per- formance of the frameworks with multiple concurrent users is presented.f

6.1 Single User Execution

Presto is the fastest framework for queries with a single concurrent user on ORC in most conditions, while Spark is the fastest on Parquet for a single user. Spark and Hive LLAP are second fastest on ORC in different situations. On text data Hive LLAP maintained a lead on Spark, especially on scale factor 75 (figure 19). On ORC data the picture is quite different. Presto was definitely fastest on SF 1 and scale 75, but Spark competed with it closely on scales 10 and 20. Hive LLAP performance gain was less than for Spark and Presto, making it the third fastest on ORC. On Parquet the situation is substantially different with Spark being the absolute leader. The difference is especially visible on figures 42 and 43. Presto execution time growth is much faster than Spark, and of SF75 it becomes very apparent. The execution time growth patterns on both text and ORC show a similar picture. For Hive Tez and Hive LLAP the changes in query execution times varies more (figures 20, 21, 28, 29) than for Presto and Spark (figures 22, 23, 30, 31). Both show a stable execution time growth pattern that is consistent with the data volume growth with the increasing scale factor. Some queries circumvent this pattern, having much slower growth in execution time than others, especially on ORC format. On Hive and Presto Parquet (figures 36, 37, 39) shows similar growth pattern to ORC (figures 28, 29, 31), while Spark shows a slow growth, compared to other frameworks (figure 38), leading to a conclusion that Spark will perform better with growing data volume, while the cluster RAM can handle it. Overall, the results show that single user execution was most performant with Spark on Parquet for SF10 - SF75 (figures 41, 42, and 43), and with Presto on ORC for SF1 (figure 40).

6.2 File Format Comparison

In terms of ORC and text execution time comparisons, Hive showed unexpected results. Hive Tez performed faster on ORC for scales 1, 10, and 20, but for scale 75 it showed a better time on some of the queries on text files (figures 44 - 47). Hive LLAP showed better time for ORC on scales 10 and 20 (figures 49 and 50), but had mixed results for SF 1 and 75 (figures 48 and 51). Spark

61 and Presto performed faster on ORC, with a substantial reduction of execution time, especially Spark on SF 10 and 20 (figures 52 - 59). For Parquet the trends are similar for Hive on ORC, where the times for Parquet and ORC (figures 44 - 51) are similar in most cases. For Spark the situation is very different, as Parquet shows virtually no growth compared to text and ORC with the data volume growth (figures 52 - 55). Presto shows similar growth on ORC and Parquet, being very consistent (figures 56 - 59). Overall, the decision on which format to pick is very dependent on the choice of framework. On ORC Presto showed the best execution time, while Spark showed the second best time, but the growth pattern on scale did not look very good. On Parquet Spark easily outperformed other frameworks and the execution time growth was much slower, leading to better performance on large data volumes. In addition, the data volume after compression can be taken into account, as shown in tables 1 - 3. On the scale factor 75 the difference is already more than 40%, with ORC providing smaller data volume. With strict constraints on the data volume storage, it is possible that the compression factor will be a major part of the decision. To conclude, the experiments showed that Spark on Parquet is the most performant choice, but Presto on ORC can be a solid option and under some conditions.

6.3 Concurrent Execution

On concurrent execution the fastest framework for extremely small scale was Presto (scale factor 1), while Spark outperformed it on scales 10, 20, and 75 on ORC and Parquet. In addition to being the fastest, Spark shows a slow growth rate with the increase in concurrent users with slight increase in pro- cessing time and very small overhead. Presto was faster on small scale data, but the overhead from concurrent users made it slower with growth in data volume and concurrency. Spark shows great isolation between users, as is evident on execution growth time charts, with very slight execution time increase for the growing number of concurrent users. The difference is especially evident, when comparing Presto on ORC and Spark on Parquet side by side. Total execution time greatly increased with more than 5 users for Presto, while Spark main- tains a steady execution time even with 25 concurrent users. During the runs it seemed that Presto encountered more resource management problems, with the times for queries varying significantly. Spark, on the other hand seemed to execute queries at roughly the same time, and not using all of the cluster resources for every query.

6.4 Future Work

There are several possible directions for future work. First, it is possible to perform an examination the resource usage on queries in order to determine the

62 possible bottlenecks that different frameworks encounter. Second, file formats can be explored in more detail, with possible inclusion of RCFile38 or Avro39 formats in the comparison. Lastly, another avenue for future work is to assess the additional overhead from resource provisioning by YARN and include it into the results, which could impact the results in concurrency.

38https://cwiki.apache.org/confluence/display/Hive/RCFile 39https://avro.apache.org/

63 References

[1] Big Data Statistics and Facts for 2017 report by Waterford Technolo- gies;. Accessed: 2018-05-12. https://www.waterfordtechnologies.com/ big-data-interesting-facts/. [2] DB-Engines Ranking;. Accessed: 2018-05-12. https://db-engines.com/ en/ranking. [3] The Business Intelligence on Hadoop Benchmark Q4 2016 report by atScale;. Accessed: 2017-12-24. http://info.atscale.com/ atscale-business-intelligence-on-hadoop-benchmark. [4] Xue R. SQL Engines for Big Data Analytics [G2 Pro gradu, diplomity¨o]; 2015. Available from: http://urn.fi/URN:NBN:fi:aalto-201512165719. [5] Li X, Zhou W. Performance Comparison of Hive, Impala and Spark SQL. In: Proceedings of the 2015 7th International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 01. IHMSC ’15. Wash- ington, DC, USA: IEEE Computer Society; 2015. p. 418–423. Available from: http://dx.doi.org/10.1109/IHMSC.2015.95. [6] Li M, Tan J, Wang Y, Zhang L, Salapura V. SparkBench: A Compre- hensive Benchmarking Suite for in Memory Data Analytic Platform Spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers. CF ’15. New York, NY, USA: ACM; 2015. p. 53:1–53:8. Available from: http://doi.acm.org/10.1145/2742854.2747283. [7] Ivanov T, Beer M. Evaluating Hive and Spark SQL with BigBench. CoRR. 2015;abs/1512.08417. Available from: http://arxiv.org/abs/ 1512.08417. [8] Ousterhout K, Rasti R, Ratnasamy S, Shenker S, Chun BG. Mak- ing Sense of Performance in Data Analytics Frameworks. In: 12th USENIX Symposium on Networked Systems Design and Implementa- tion (NSDI 15). Oakland, CA: USENIX Association; 2015. p. 293– 307. Available from: https://www.usenix.org/conference/nsdi15/ technical-sessions/presentation/ousterhout. [9] H˚akansson A. Portal of research methods and methodologies for research projects and degree projects. In: Proceedings of the International Confer- ence on Frontiers in Education: Computer Science and Computer Engineer- ing (FECS). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp); 2013. p. 1. [10] Apache Hadoop project;. Accessed: 2017-11-24. http://hadoop.apache. org/. [11] Dorband J. Commodity Cluster Computing for Remote Sensing Applica- tions using Red Hat LINUX; 2003.

64 [12] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Communications of the ACM. 2008;51(1):107–113. [13] Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop Distributed File System. In: Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. Ieee; 2010. p. 1–10. [14] Ghemawat S, Gobioff H, Leung ST. The Google File System. vol. 37. ACM; 2003. [15] Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, et al. Apache Hadoop YARN: Yet another resource negotiator. In: Pro- ceedings of the 4th annual Symposium on Cloud Computing. ACM; 2013. p. 5. [16] Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment. 2009;2(2):1626–1629. [17] Huai Y, Chauhan A, Gates A, Hagleitner G, Hanson EN, O’Malley O, et al. Major Technical Advancements in Apache Hive. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. SIGMOD ’14. New York, NY, USA: ACM; 2014. p. 1235–1246. Available from: http://doi.acm.org/10.1145/2588555.2595630. [18] Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C. Apache tez: A unifying framework for modeling and building data processing ap- plications. In: Proceedings of the 2015 ACM SIGMOD international con- ference on Management of Data. ACM; 2015. p. 1357–1369. [19] Song H, Dharmapurikar S, Turner J, Lockwood J. Fast hash table lookup using extended bloom filter: an aid to network processing. ACM SIG- COMM Computer Communication Review. 2005;35(4):181–192. [20] Shainman M. The Power of Presto Open Source for the Enterprise. Tera- data; 2016. [21] Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, et al. Apache Spark: A Unified Engine for Big Data Processing. Commun ACM. 2016 Oct;59(11):56–65. Available from: http://doi.acm.org/10.1145/ 2934664. [22] Apache Spark Survey 2016;. Accessed: 2017-12-12. http: //cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/ 2016_Spark_Survey/2016_Spark_Infographic.pdf?t=1520448521980. [23] Kleppmann M. Designing Data-Intensive Applications: The Big Ideas Be- hind Reliable, Scalable, and Maintainable Systems. O’Reilly Media; 2017. Available from: https://books.google.se/books?id=BM7woQEACAAJ. [24] O’Neil PE, O’Neil EJ, Chen X. The Star Schema Benchmark (SSB); 2009.

65