Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data
Grégory M. Essertel1, Ruby Y. Tahboub1, James M. Decker1, Kevin J. Brown2, Kunle Olukotun2, Tiark Rompf1 1Purdue University, 2Stanford University {gesserte,rtahboub,decker31,tiark}@purdue.edu, {kjbrown,kunle}@stanford.edu
Abstract cessing. Systems like Apache Spark [8] have gained enormous traction thanks to their intuitive APIs and abil- In recent years, Apache Spark has become the de facto ity to scale to very large data sizes, thereby commoditiz- standard for big data processing. Spark has enabled a ing petabyte-scale (PB) data processing for large num- wide audience of users to process petabyte-scale work- bers of users. But thanks to its attractive programming loads due to its flexibility and ease of use: users are able interface and tooling, people are also increasingly using to mix SQL-style relational queries with Scala or Python Spark for smaller workloads. Even for companies that code, and have the resultant programs distributed across also have PB-scale data, there is typically a long tail of an entire cluster, all without having to work with low- tasks of much smaller size, which make up a very impor- level parallelization or network primitives. tant class of workloads [17, 44]. In such cases, Spark’s However, many workloads of practical importance are performance is suboptimal. For such medium-size work- not large enough to justify distributed, scale-out execu- loads, performance may still be of critical importance tion, as the data may reside entirely in main memory of if there are many such jobs, individual jobs are compu- a single powerful server. Still, users want to use Spark tationally heavy, or need to be run very frequently on for its familiar interface and tooling. In such scale-up changing data. This is the problem we address in this pa- scenarios, Spark’s performance is suboptimal, as Spark per. We present Flare, an accelerator module for Spark prioritizes handling data size over optimizing the com- that delivers order of magnitude speedups on scale-up ar- putations on that data. For such medium-size workloads, chitectures for a large class of applications. A high-level performance may still be of critical importance if jobs view of Flare’s architecture can be seen in Figure1b. are computationally heavy, need to be run frequently on changing data, or interface with external libraries and Inspiration from In-Memory Databases Flare is systems (e.g., TensorFlow for machine learning). based on native code generation techniques that have We present Flare, an accelerator module for Spark been pioneered by in-memory databases (e.g., Hy- that delivers order of magnitude speedups on scale-up Per [35]). Given the multitude of front-end program- architectures for a large class of applications. Inspired ming paradigms, it is not immediately clear that look- by query compilation techniques from main-memory ing at relational databases is the right idea. However, we database systems, Flare incorporates a code genera- argue that this is indeed the right strategy: Despite the tion strategy designed to match the unique aspects of variety of front-end interfaces, contemporary Spark is, at Spark and the characteristics of scale-up architectures, its core, an SQL engine and query optimizer [8]. Rich in particular processing data directly from optimized file front-end APIs are increasingly based on DataFrames, formats and combining SQL-style relational processing which are internally represented very much like SQL with external frameworks such as TensorFlow. query plans. Data frames provide a deferred API, i.e., 1 Introduction calls only construct a query plan, but do not execute it immediately. Thus, front-end abstractions do not inter- Modern data analytics applications require a combina- fere with query optimization. Previous generations of tion of different programming paradigms, spanning rela- Spark relied critically on arbitrary UDFs, but this is be- tional, procedural, and map-reduce-style functional pro- coming less and less of a concern as more and more func- … Scala Python SQL TensorFlow (somewhat obviously) yields inferior performance to na- tive execution as seen in HyPer. However, generating Spark SQL DataFrame API Export query plan native code within Spark poses a challenge of interfac- Catalyst Optimizer ing with the JVM when dealing with e.g., data load- Code Generation XLA Code Generation ing. Another challenge comes from Spark’s reliance on resilient distributed datasets (RDDs) as its main in- Spark Flare Runtime Resilient Distributed Dataset ternal execution abstraction. Mapping query operators to RDDs imposes boundaries between code generation (a) Spark SQL (b) Flare regions, which incurs nontrivial runtime overhead. Fi- Figure 1: Flare system overview: (a) architecture of nally, having a code generation engine capable of inter- Spark adapted from [8]; (b) Flare generates code for en- facing with external frameworks and libraries, particu- tire queries, eliminating the RDD layer, and orchestrat- larly machine-learning oriented frameworks like Tensor- ing parallel execution optimized for shared memory ar- Flow and PyTorch, is also challenging due to the wide chitectures. Flare also integrates with TensorFlow. variety of data representations which may be used. tionality is implemented on top of DataFrames. End-to-End Datapath Optimization In solving the With main-memory databases in mind, it follows that problem of generating native code and working within one may look to existing databases for answers on im- the Java environment, we focus specifically on the is- proving Spark’s performance. A piece of low-hanging sue of data processing. When working with data di- fruit seems to be simply translating all DataFrame query rectly from memory, it is possible to use the Java Native plans to an existing best-of-breed main-memory database Interface (JNI) and operate on raw pointers. However, (e.g., HyPer [35]). However, such systems are full when processing data directly from files, fine-grained database systems, not just query engines, and would re- interaction between decoding logic in Java and native quire data to be stored in a separate, internal format spe- code would be required, which is both cumbersome and cific to the external system. As data may be changing presents high overhead. To resolve this problem, we elect rapidly, loading this data into an external system is un- to reimplement file processing for common formats in desirable, for reasons of both storage size and due to the native code as well. This provides a fully compiled data inherent overhead associated with data loading. More- path, which in turn provides significant performance ben- over, retaining the ability to interact with other systems efits. While this does present a problem in calling Java (e.g., TensorFlow [3] for machine learning) is unclear. UDFs (user-defined functions) at runtime, we can sim- Another logical alternative would be to build a new ply fall back to Spark’s existing execution in such a case, system which is overall better optimized than Spark for as these instances appear rare in most use cases consid- the particular use case of medium-size workloads and ered. We note in passing that existing work (e.g., Tuple- scale-up architectures. While some effort has been done ware [17], Froid [40]) has presented other solutions for in this vein (e.g., Tupleware [17]), such systems for- this problem which could be adopted within our method, feit the ability to leverage existing libraries and frame- as well. works built on top of Spark, including the associated tooling. Whereas a system that competes with Spark Fault Tolerance on Scale-Up Architectures In addi- must replicate all of this functionality, our goal instead tion, we must overcome the challenge of working with was to build a drop-in module capable of handling work- Spark’s reliance on RDDs. For this, we propose a simple loads for which Spark is not optimized, preferably using solution: when working in a scale-up, shared memory methodologies seen in these best-of-breed, external sys- environment, remove RDDs and bypass all fault toler- tems (e.g., HyPer). ance mechanisms, as they are not needed in such archi- Native Query Compilation Indeed, the need to accel- tectures (seen in Figure1b). The presence of RDDs fun- erate CPU computation prompted the development of a damentally limits the scope of query compilation to in- code generation engine that ships with Spark since ver- dividual query stages, which prevents optimization at the sion 1.4, called Tungsten [8]. However, despite follow- granularity of full queries. Without RDDs, we compile ing some of the methodology set forth by HyPer, there whole queries and eliminate the preexisting boundaries are a number of challenges facing such a system, which across query stages. This also enables the removal of ar- causes Tungsten to yield suboptimal results by compar- tifacts of distributed architectures, such as Spark’s use of ison. First, due to the fact that Spark resides in a Java- HashJoinExchange operators even if the query is run on based ecosystem, Tungsten generates Java code. This a single core. Interfacing with External Code Looking now to the as opposed to individual query stages, which results in issue of having a robust code generation engine capa- an end-to-end optimized data path (Section3). ble of interfacing with external libraries and frameworks • We show how Flare’s compilation model efficiently within Spark, we note that most performance-critical ex- extends to external user-defined functions. Specifi- ternal frameworks are also embracing deferred APIs. cally, we discuss Flare’s ability to integrate with other This is particularly true for machine learning frame- frameworks and domain-specific languages, including works, which are based on a notion of execution graphs. in particular machine learning frameworks like Tensor- This includes popular frameworks like TensorFlow [3], Flow that provide compilation facilities of their own Caffe [24], and ONNX [1], though this list is far from (Section4). exhaustive. As such, we focus on frameworks with APIs • We evaluate Flare in comparison to Spark on TPC-H, that follow this pattern. Importantly, many of these sys- reducing the gap to best-of-breed relational query en- tems already have a native execution backend, which al- gine, and on benchmarks involving external libraries. lows for speedups by generating all required glue code In both settings, Flare exhibits order-of-magnitude and keeping the entire data path within native code. speedups. Our evaluation spans single-core, multi- Contributions The main intellectual contribution of core, and NUMA targets (Section5). this paper is to demonstrate and analyze some of the underlying issues contained in the Spark runtime, and Finally, we survey related work in Section6, and draw to show that the HyPer query compilation model must conclusions in Section7. be adapted in certain ways to achieve good results in 2 Background Spark (and, most likely, systems with a similar archi- tecture like Flink [16]), most importantly to eliminate Apache Spark [55, 56] is today’s most widely-used big codegen boundaries as much as possible. For Spark, this data framework. The core programming abstraction means generating code not at the granularity of operator comes in the form of an immutable, implicitly distributed pipelines but compiling whole Catalyst operator trees at collection called a resilient distributed dataset (RDD). once (which may include multiple SQL-queries and sub- RDDs serve as high-level programming interfaces, while queries), generating specialized code for data structures, also transparently managing fault-tolerance. for file loading, etc. We present a short example using RDDs (from [8]), We present Flare, an accelerator module for Spark which counts the number of errors in a (potentially dis- that solves these (and other) challenges which currently tributed) log file:
prevent Spark from achieving optimal performance on val lines = spark.sparkContext.textFile("...") val errors = lines.filter(s => s.startsWith("ERROR")) scale-up architectures for a large class of applications. println("Total errors:" + errors.count()) Building on query compilation techniques from main- Spark’s RDD abstraction provides a deferred API: in memory database systems, Flare incorporates a code the above example, the calls to textFile and filter generation strategy designed to match the unique aspects merely construct a computation graph. In fact, no actual of Spark and the characteristics of scale-up architectures, computation occurs until errors.count is invoked. in particular processing data directly from optimized file The directed, acyclic computation graph represented formats and combining SQL-style relational processing by an RDD describes the distributed operations in a with external libraries such as TensorFlow. rather coarse-grained fashion: at the granularity of map, This paper makes the following specific contributions: filter, etc. While this level of detail is enough to en- able demand-driven computation, scheduling, and fault- • We identify key impediments to performance for tolerance via selective recomputation along the “lineage” medium-sized workloads running on Spark on a single of a result [55], it does not provide a full view of the com- machine in a shared memory environment and present putation applied to each element of a dataset. For exam- a novel code generation strategy able to overcome ple, in the code snippet shown above, the argument to these impediments, including the overhead inherent in lines.filter is a normal Scala closure. This makes in- boundaries between compilation regions. (Section2). tegration between RDDs and arbitrary external libraries • We present Flare’s architecture and discuss some im- much easier, but it also means that the given closure must plementation choices. We show how Flare is capable be invoked as-is for every element in the dataset. of optimizing data loading, dealing with parallel exe- As such, the performance of RDDs suffers from two cution, as well as efficiently working on NUMA sys- limitations: first, limited visibility for analysis and op- tems. This is a result of Flare compiling whole queries, timization (especially standard optimizations, e.g., join scan select * reordering for relational workloads); and second, sub- from lineitem, orders stantial interpretive overhead, i.e., function calls for each where l_orderkey = o_orderkey scan Stage 0 Spark Stage 1 Filter processed tuple. Both issues have been ameliorated with Sort-merge join 14,937 ms Broadcast-hash join 4,775 ms Filter Project the introduction of the Spark SQL subsystem [8]. of which in exchange 2,232 ms Flare Broadcast BroadcastExchange In-memory hash join 136 ms HashJoin 2.1 The Power of Multi-Stage APIs . Project The chief addition of Spark SQL is an alternative API CollectLimit based on DataFrames. A DataFrame is conceptually (a) (b) equivalent to a table in a relational database; i.e., a collec- Figure 2: (a) The cost of Join lineitem ./ orders with tion of rows with named columns. However, like RDDs, different operators (b) Spark’s hash join plan shows two the DataFrame API records operations, rather than com- separate code generation regions, which communicate puting the result. through Spark’s runtime system. Therefore, we can write the same example as before: val lines = spark.read.textFile("...") 2.2 Catalyst and Tungsten val errors = lines.filter($"value".startsWith("ERROR")) println("Total errors:" + errors.count()) With the addition of Spark SQL, Spark also introduced Indeed, this is quite similar to the RDD API in that a query optimizer known as Catalyst [8]. We elide the only the call to errors.count will trigger actual execu- details of Catalyst’s optimization strategy, as they are tion. Unlike RDDs, however, DataFrames capture the largely irrelevant here. After Catalyst has finished op- full computation/query to be executed. We can obtain the timizing a query plan, Spark’s execution backend known internal representation using errors.explain(), which as Tungsten takes over. Tungsten aims to improve produces the following output: Spark’s performance by reducing the allocation of ob- == Physical Plan == *Filter StartsWith(value#894, ERROR) jects on the Java Virtual Machine (JVM) heap, control- +- *Scan text [value#894] Format: ...TextFileFormat@18edbdbb, ling off-heap memory management, employing cache- InputPaths: ..., ReadSchema: struct
������� � ���� �� �� �� �� �� �� �� �� �� �� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���
SF10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Postgres 72599 5064 20790 6032 17904 9071 16408 20003 31728 16409 4729 15755 16152 9369 9607 5662 66254 30104 12852 22795 184736 6900 Spark SQL 17692 21062 28039 20693 40008 2826 28262 55346 98216 27585 16193 15431 25632 5913 11538 30973 66089 47542 11062 27448 78896 8687 HyPer 498 47 969 725 821 207 804 467 1782 846 111 460 3307 283 227 1112 485 3061 1666 345 1349 177 Flare 550 98 481 544 527 212 596 1016 2688 1429 53 643 3499 256 175 679 1924 1150 2236 836 1533 241
Figure 10: Performance comparison of Postgres, HyPer, Spark SQL, Flare in SF10
ular by reordering joins. While Spark’s Catalyst opti- The single-core performance gap between Spark SQL mizer [8] is also cost-based, the default configurations and Flare is attributed to the bottlenecks identified in do not perform any kind of join re-ordering. Hence, we Sections 2.3 and 2.4. First, overhead associated with match the join ordering of the query plan in Spark SQL low-level data access on the JVM. Second, Spark SQL’s and Flare with HyPer’s, with a small number of excep- distributed-first strategy that employs costly distributed tions: in Spark SQL, the original join ordering given in operators, e.g., sort-merge join and broadcast hash join, the TPC-H reference outperformed the HyPer plans for even when running on a single core. Third, internal bot- Q5, Q9, Q10, and Q11 in Spark SQL, and for Q10 in tlenecks in in-memory processing, the overhead of RDD Flare. For these queries, we kept the original join order- operations, and communication through Spark’s runtime ing as is. For Spark SQL, this difference is mainly due system. By compiling entire queries, instead of isolated to Catalyst picking sort-merge joins over hash joins. It is query stages, Flare effectively avoids these bottlenecks. worth pointing out that HyPer and Postgres plans can use HyPer [35] is a state-of-the-art compiled relational indexes on primary keys, which may give an additional query engine. A precursory look shows that Flare is advantage. faster than HyPer by 10%-60% in Q4-Q5,Q7, and Q14- Figure 10 gives the absolute execution time of Post- Q16. Moreover, Flare is faster by 2× in Q3, Q11, and gres, HyPer, Spark SQL, and Flare for all TPC-H queries. Q18. On the other hand, HyPer is faster than Flare by For all systems, data loading time is excluded, i.e., only 20%-60% in Q9, Q10, Q12, and Q21. Moreover, Hy- execution time is reported. In Spark and Flare, we use Per is faster by 2×-4× in Q2, Q8, Q17, and Q20. This persist to ensure that the data is loaded from mem- performance gap is, in part, attributed to (1) HyPer’s use ory. At first glance, the performance of Flare and Hy- of specialized operators like GroupJoin [32], and (2) em- Per lie within the same range, and notably outperform ploying indexes on primary keys as seen in Q2, Q8, etc., Postgres and Spark in all queries. Similarly, Spark’s whereas Flare (and Spark SQL) currently does not sup- performance is comparable to Postgres’s in most of the port indexes. queries. Unlike the other systems, Postgres does not In summary, while both Flare and HyPer generate na- compile queries at runtime, and relies on the Volcano tive code at runtime, subtle implementation differences model [23] for query evaluation, which incurs significant in query evaluation and code generation can result in overhead. Hence, we can see that Spark’s query com- faster code. For instance, HyPer uses proper decimal pilation does not provide a significant advantage over a precision numbers, whereas Flare follows Spark in using standard interpreted query engines on most queries. double precision floating point values which are native to the architecture. Furthermore, HyPer generates LLVM At a closer look, Flare outperforms Spark SQL in ag- code, whereas Flare generates C code which is compiled gregate queries Q1 and Q6 by 32× and 13× respectively. with GCC. We observe that Spark is 200× slower than Flare in nested queries (e.g., Q2) After examining the execution Compilation Time We compared the compilation time plans of Q2, we found that Catalyst’s plan does not de- for each TPC-H query on Spark and Flare (results omit- tect all patterns that help with avoiding re-computations, ted). For Spark, we measured the time to generate the e.g., a table which has been previously scanned or sorted. physical plan, which includes Java code generation and In join queries, e.g., Q5, Q10, Q14, etc., Flare is faster compilation. We do not quantify JVM-internal JIT com- than Spark SQL by 19×-76×. Likewise, in join variants pilation, as this is hard to measure, and code may be outer join Q13, semi-join Q21, and anti-join Q22, Flare recompiled multiple times. For Flare, we measured C is faster by 7×, 51× and 36× respectively. code generation and compilation with GCC. Both sys- tems spend a similar amount of time on code generation cores, we start accessing non-local memory (NUMA). and compilation. Compilation time depends on the com- The reason Spark scales better is because the internal plexity of the query, but is less than 1s for all queries, overhead, which does not contribute anything to query i.e., well in line with interactive, exploratory, usage. evaluation, is trivially parallelizable and hides the mem- ory bandwidth effects. In summary, Flare scales as ex- ����� ����� ��� ��� ������������ pected for both of memory and CPU-bound workloads, �������������� ��� �������� ��� ������������������ and reflects the hardware characteristics of the workload, ��� ���
��� ��� ������� which means that query execution takes good advantage �� �� of the available resources – with the exception of multi- � � � � �� �� � � � � �� �� ple CPU sockets, a problem we address next. �� ��� ����� ������ ��������� ����� ����� ������ Q1 Q6 ����� ������ 5500 3700 ����� 1 1 one socket ����� ������ 5400 3600 two sockets ����� four sockets ������ 5300 3500 � 600 300 ���� 14 ������� � ���� ������ 12 12 12 ����� 500 400 200 �� �� 22 23 � � � � �� �� � � � � �� �� 300 23 24 29 14000 Q14 16000 Q22 200 100 44 Running Time (ms) Time Running 46 58 12000 14000 100 12000 0 0 10000 1 18 36 72 1 18 36 72 10000 8000 # Cores # Cores 8000 6000 6000 4000 4000
Running Time (ms) Figure 12: Scaling up Flare for SF100 with NUMA op- 2000 2000 0 0 1 2 4 8 16 32 1 2 4 8 16 32 timizations on different configurations: threads pinned # Cores # Cores to one, two, or four sockets. The speedups relative to a Figure 11: Scaling up Flare and Spark SQL in SF20, single thread are shown on top of the bars (Benchmark without NUMA optimizations: Spark has good nominal machine: 72 cores, 1 TB RAM across 4 CPU sockets, speedups (top), but Flare has better absolute running time i.e., 18 cores, 250 GB each). in all configurations (bottom). For both systems, NUMA effects for 32 cores are clearly visible (Benchmark ma- As a next step, we evaluate NUMA optimizations in chine: 96 cores, 3 TB RAM across 4 CPU sockets, i.e., Flare and show that these enable us to scale queries like 24 cores, 750 GB each). Q6 to higher core numbers. In particular, we pin threads to individual cores and lay out memory such that most Parallel Scaling In this experiment, we compare the accesses are to the local memory region attached to each scalability of Spark SQL and Flare. The experiment fo- socket (Figure 12). Q6 performs better when the threads cuses on the absolute performance and the Configuration are dispatched on different sockets. This is due to the that Outperforms a Single Thread (COST) metric pro- computation being bounded by the memory bandwidth. posed by McSherry et al. [30]. We pick four queries that As such, when dividing the threads on multiple sock- represent aggregate and join variants. ets, we multiply the available bandwidth proportionally. Figure 11 presents speedup numbers for Q6, Q13, However, as Q1 is more computation bound, dispatch- Q14, and Q22 when scaled up to 32 cores. At first glance, ing the threads on different sockets has little effect. For Spark appears to have good speedups in Q6 and Q13 both Q1 and Q6, we see scaling up to the capacity of the whereas Flare’s Q6 speedup drops for high core counts. machine (up to 72 cores). This is seen in a maximum However, examining the absolute running times, Flare speedup of 46× and 58× for Q1 and Q6, respectively. is faster than Spark SQL by 14×. Furthermore, it takes Optimized Data Loading An often overlooked part of Spark SQL estimated 16 cores in Q6 to match the perfor- data processing is data loading. Flare contains an opti- mance of Flare’s single core. In scaling up Q13, Flare is mized implementation for both CSV files and the colum- consistently faster by 6×-7× up to 16 cores. Similarly, nar Apache Parquet format.2 We show loading times for Flare continues to outperform Spark by 12×-23× in Q14 each of the TPC-H tables in Table1. and by 13×-22× in Q22. What appears to be good scaling for Spark actually re- Full table read From the data in Table1, we see that veals that the runtime incurs significant overhead. In par- in both Spark and Flare, the Parquet file readers outper- ticular, we would expect Q6 to become memory-bound form the CSV file readers in most scenarios, despite this as we increase the level of parallelism. In Flare we can being a worst-case scenario for Parquet. Spark’s CSV directly observe this effect as a sharp drop from 16 to 32 2All Parquet files tested were uncompressed and encoded using cores. Since our machine has 18 cores per socket, for 32 PLAIN encoding. Table #Tuples Postgres HyPer Spark Spark Flare Flare Flare, however, can optimize the data layout to reduce the CSV CSV CSV Parquet CSV Parquet CUSTOMER 1500000 7067 1102 11664 9730 329 266 amount of data copied to the bare minimum, and elimi- LINEITEM 59986052 377765 49408 471207 257898 11167 10668 nate essentially all of the inefficiencies on the boundary NATION 25 1 8 106 110 < 1 < 1 ORDERS 15000000 60214 33195 85985 54124 2028 1786 between Spark and TensorFlow. PART 2000000 8807 1393 11154 7601 351 340 PARTSUPP 8000000 37408 5265 28748 17731 1164 1010 6 Related Work REGION 5 1 8 102 90 < 1 < 1 SUPPLIER 100000 478 66 616 522 28 16 Cluster Computing Frameworks Such frameworks Table 1: Loading time in ms for TPC-H SF10 in Post- typically implement a combination of parallel, dis- gres, HyPer, Flare, and SparkSQL. tributed, relational, procedural, and MapReduce compu- tations. The MapReduce model [18] realized in Hadoop [6] performs big data analysis on shared-nothing, po- reader was faster in only one case: reading nation, a tentially unreliable, machines. Twister [20] and Haloop table with only 25 rows. In all other cases, Spark’s Par- [14] support iterative MapReduce workloads by avoiding quet reader was 1.33×-1.81× faster. However, Flare’s reading unnecessary data and keeping invariant data be- highly optimized CSV reader operates at a closer level of tween iterations. Likewise, Spark [55, 56] tackles the is- performance to the Parquet reader, with all tables except sue of data reuse among MapReduce jobs or applications supplier having a benefit of less than a 1.25× speedup by explicitly persisting intermediate results in memory. by using Parquet. Along the same lines, the need for an expressive pro- Performing queries Figure 13 shows speedups gained gramming model to perform analytics on structured and from executing queries without preloading data for both semistructured data motivated Hive [52], Dremel [31], systems. Whereas reading an entire table gives Spark and Impala [26], Shark [53] and Spark SQL [8] and many Flare marginal speedups, reading just the required data others. SnappyData [41] integrates Spark with a trans- gives speedups in the range of 2×-144× (Q16 remained actional main-memory database to realize a unified en- the same) for Spark and 60%-14× for Flare. Across gine that supports streaming, analytics and transactions. systems, Flare’s Parquet reader demonstrated between Asterix [10], Stratosphere / Apache Flink [5], and Tuple- a 2.5×-617× speedup over Spark’s, and between 34×- ware [17] are other systems that improve over Spark in 101× over Spark’s CSV reader. While the speedup over various dimensions, including UDFs and performance, Spark lessens slightly in higher scale factors, we found and which inspired the design of Flare. While these sys- that Flare’s Parquet reader consistently performed on av- tems are impressive, Flare sets itself apart by accelerat- erage at least one order of magnitude faster across each ing actual Spark workloads instead of proposing a com- query, regardless of scale factor. peting system, and by demonstrating relational perfor- In nearly every case, reading from a Parquet file in mance on par with HyPer [35] on the full set of TPC-H Flare is approximately 2×-4× slower than in-memory queries. Moreover, in contrast to systems like Tuple- processing. However, reading from a Parquet file in ware that mainly integrate UDFs on the LLVM level, Spark is rarely significantly slower than in-memory pro- Flare uses higher-level knowledge about specific exter- cessing. These results show that while reading from nal systems, such as TensorFlow. Similar to Tupleware, Parquet certainly provides performance gains for Spark Flare’s main target are small clusters of powerful ma- when compared to reading from CSV, the overall per- chines where faults are statistically improbable. formance bottleneck of Spark does not lie in the cost of Query Compilation Recently, code generation for reading from SSD compared to in-memory processing. SQL queries has regained momentum. Historic efforts go back all the way to System R [9]. Query compila- TensorFlow We evaluate the performance of Flare and tion can be realized using code templates e.g., Spade[22] TensorFlow integration with Spark. We run the query or HIQUE [27], general purpose compilers, e.g., HyPer shown in Figure9, which embeds a UDF that performs [35] and Hekaton [19], or DSL compiler frameworks, classification via machine learning over the data (based e.g., Legobase [25], DryadLINQ [54], DBLAB [45], and on a pre-trained model). As shown in Figure 14, using LB2 [49]. Flare in conjunction with TensorFlow provides speedups of over 1,000,000× when compared to PySpark, and Embedded DSL Frameworks and Intermediate Lan- 60× when Spark calls the TensorFlow UDF through JNI. guages These address the compromise between pro- Thus, while we can see that interfacing with an object ductivity and performance in writing programs that can file gives an important speed-up to Spark, the data load- run under diverse programming models. Voodoo [39] ing ultimately becomes the bottleneck for the system. addresses compiling portable query plans that can run 1000 Spark CSV Spark Parquet Flare CSV Flare Parquet
100
10 Speedup 1 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
Figure 13: Speedup for TPC-H SF1 when streaming data from SSD on a single thread.
6 1x10 Spark SQL 7 Conclusions 100000 Spark JNI #Data Spark Spark + Flare Flare 10000 Points JNI 1000 Modern data analytics need to combine multiple pro- 200 11909 990 0.064 100 gramming models and make efficient use of modern 2000 522471 3178 0.503 10 1 hardware with large memory, many cores, and NUMA 0.1 200 2000 # Data Points capabilities. We introduce Flare: a new backend for Figure 14: Running time (ms) of query in Figure9 using Spark that brings relational performance on par with the TensorFlow in Spark and Flare. best SQL engines, and also enables highly optimized het- erogeneous workloads with external ML systems. Most importantly, all of this comes without giving up the ex- pressiveness of Spark’s high-level APIs. We believe that on CPUs and GPUs. Voodoo’s intermediate algebra multi-stage APIs, in the spirit of DataFrames, and com- is expressive and captures hardware optimizations, e.g., piler systems like Flare, will play an increasingly impor- multicores, SIMD, etc. Furthermore, Voodoo is used tant role in the future to satisfy the increasing demand for as an alternative back-end for MonetDB [12]. Delite flexible and unified analytics with high efficiency. [46], a general purpose compiler framework, implements high-performance DSLs (e.g., SQL, Machine Learning, Acknowledgments graphs and matrices), provides parallel patterns and gen- This work was supported in part by NSF awards 1553471 erates code for heterogeneous targets. The Distributed and 1564207, DOE award DE-SC0018050, and a Google Multiloop Language (DMLL) [13] provides rich col- Faculty Research Award. lections and parallel patterns and supports big-memory NUMA machines. Weld [38] is another recent system References that aims to provide a common runtime for diverse li- braries e.g., SQL and machine learning. Steno [33] per- [1] ONNX. https://github.com/onnx/onnx. forms optimizations similar to DMLL to compile LINQ queries. Furthermore, Steno uses DryadLINQ [54] run- [2] OpenMP. http://openmp.org/. time for distributed execution. Nagel et. al. [34] gener- ates efficient code for LINQ queries. Weld is similar to [3] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, DMLL in supporting nested parallel structures. Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, Performance evaluation In data analytics frame- M. Kudlur, J. Levenberg, D. Mané, R. Monga, works, performance evaluation aims to identify bottle- S. Moore, D. Murray, C. Olah, M. Schuster, necks and study the parameters that impact performance J. Shlens, B. Steiner, I. Sutskever, K. Talwar, the most, e.g., workload, scale-up/scale-out resources, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, probability of faults, etc. A recent study [36] on a single O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Spark cluster revealed that CPU, not I/O, is the source Y. Yu, and X. Zheng. TensorFlow: Large-scale of bottlenecks. McSherry et al. [30] proposed the COST machine learning on heterogeneous distributed sys- (Configuration that Outperforms a Single Thread) met- tems, 2015. ric, and showed that in many cases, single-threaded pro- grams can outperform big data processing frameworks [4] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, running on large clusters. TPC-H [51] is a decision sup- J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Is- port benchmark that consists of 22 analytical queries that ard, et al. TensorFlow: A system for large-scale address several “choke points,” e.g., aggregates, large machine learning. In OSDI, volume 16, pages 265– joins, arithmetic computations, etc. [11]. 283, 2016. [5] A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Frey- [17] A. Crotty, A. Galakatos, K. Dursun, T. Kraska, tag, F. Hueske, A. Heise, O. Kao, M. Leich, C. Binnig, U. Çetintemel, and S. Zdonik. An ar- U. Leser, V. Markl, et al. The Stratosphere plat- chitecture for compiling UDF-centric workflows. form for big data analytics. The VLDB Journal, PVLDB, 8(12):1466–1477, 2015. 23(6):939–964, 2014. [18] J. Dean and S. Ghemawat. MapReduce: Simplified [6] Apache. Hadoop. http://hadoop.apache.org/. Data Processing on Large Clusters. In OSDI, pages 137–150, 2004. [7] Apache. Parquet. https://parquet.apache.org/. [19] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, [8] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, P. Mittal, R. Stonecipher, N. Verma, and M. Zwill- J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, ing. Hekaton: SQL server’s memory-optimized A. Ghodsi, and M. Zaharia. Spark SQL: relational oltp engine. In SIGMOD, pages 1243–1254. ACM, data processing in Spark. In SIGMOD, pages 1383– 2013. 1394. ACM, 2015. [9] M. M. Astrahan, M. W. Blasgen, D. D. Chamber- [20] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.- lin, K. P. Eswaran, J. N. Gray, P. P. Griffiths, W. F. H. Bae, J. Qiu, and G. Fox. Twister: a runtime for King, R. A. Lorie, P. R. McJones, J. W. Mehl, et al. iterative MapReduce. In ACM HPCD, pages 810– System R: relational approach to database manage- 818. ACM, 2010. ment. TODS, 1(2):97–137, 1976. [21] Y. Futamura. Partial evaluation of computation pro- [10] A. Behm, V. R. Borkar, M. J. Carey, R. Grover, cess — an approach to a compiler-compiler. Trans- C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Pa- actions of the Institute of Electronics and Com- pakonstantinou, and V. J. Tsotras. Asterix: to- munication Engineers of Japan, 54-C(8):721–728, wards a scalable, semistructured data platform for 1971. evolving-world models. Distributed and Parallel [22] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and Databases, 29(3):185–216, 2011. M. Doo. SPADE: the System S declarative stream [11] P. Boncz, T. Neumann, and O. Erling. TPC-H an- processing engine. In SIGMOD, pages 1123–1134. alyzed: Hidden messages and lessons learned from ACM, 2008. an influential benchmark. In Technology Confer- [23] G. Graefe. Volcano-anextensible and parallel query ence on Performance Evaluation and Benchmark- evaluation system. TKDE, 6(1):120–135, 1994. ing, pages 61–76. Springer, 2013. [24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, [12] P. A. Boncz, M. Zukowski, and N. Nes. Mon- J. Long, R. Girshick, S. Guadarrama, and T. Dar- etdb/x100: Hyper-pipelining query execution. In rell. Caffe: Convolutional architecture for fast fea- CIDR, 2005. ture embedding. arXiv preprint arXiv:1408.5093, [13] K. J. Brown, H. Lee, T. Rompf, A. K. Sujeeth, 2014. C. De Sa, C. Aberger, and K. Olukotun. Have ab- [25] Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. straction and eat performance, too: Optimized het- Building efficient query engines in a high-level lan- erogeneous computing with parallel patterns. CGO guage. PVLDB, 7(10):853–864, 2014. 2016, pages 194–205. ACM, 2016. [26] M. Kornacker, A. Behm, V. Bittorf, T. Bobrovyt- [14] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. sky, C. Ching, A. Choi, J. Erickson, M. Grund, Haloop: efficient iterative data processing on large D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Ku- clusters. PVLDB, 3(1-2):285–296, 2010. mar, A. Leblang, N. Li, I. Pandis, H. Robinson, [15] C. Calcagno, W. Taha, L. Huang, and X. Leroy. D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, Implementing multi-stage languages using ASTs, S. Wanderman-Milne, and M. a. Yoder. Impala: A gensym, and reflection. GPCE, pages 57–76, 2003. modern, open-source SQL engine for Hadoop. In CIDR, 2015. [16] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache Flink: Stream [27] K. Krikellas, S. D. Viglas, and M. Cintra. Gener- and batch processing in a single engine. IEEE Data ating code for holistic query evaluation. In ICDE, Eng. Bull., 38(4):28–38, 2015. pages 613–624. IEEE, 2010. [28] J. Laskowski. Mastering Apache Spark 2. https: [40] K. Ramachandra, K. Park, K. V. Emani, A. Halver- //www.gitbook.com/book/jaceklaskowski/ son, C. Galindo-Legaria, and C. Cunningham. mastering-apache-spark/details, 2016. Froid: Optimization of imperative programs in a relational database. Proceedings of the VLDB En- [29] F. McSherry. Scalability! but at what COST. dowment, 11(4), 2017. http://www.frankmcsherry.org/graph/ scalability/cost/2015/01/15/COST.html, [41] J. Ramnarayan, B. Mozafari, S. Wale, S. Menon, 2015. N. Kumar, H. Bhanawat, S. Chakraborty, Y. Maha- jan, R. Mishra, and K. Bachhav. Snappydata: A [30] F. McSherry, M. Isard, and D. G. Murray. Scalabil- hybrid transactional analytical store built on Spark. ity! but at what COST? In HotOS, 2015. In SIGMOD, pages 2153–2156, 2016. [31] S. Melnik, A. Gubarev, J. J. Long, G. Romer, [42] T. Rompf and N. Amin. Functional Pearl: A SQL S. Shivakumar, M. Tolton, and T. Vassilakis. to C compiler in 500 lines of code. In ICFP, 2015. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330– [43] T. Rompf and M. Odersky. Lightweight Modu- 339, 2010. lar Staging: a pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM, [32] G. Moerkotte and T. Neumann. Accelerating 55(6):121–130, 2012. queries with group-by and join by groupjoin. Pro- ceedings of the VLDB Endowment, 4(11), 2011. [44] A. Rowstron, D. Narayanan, A. Donnelly, G. O’Shea, and A. Douglas. Nobody ever got fired [33] D. G. Murray, M. Isard, and Y. Yu. Steno: au- for using Hadoop on a cluster. In Proceedings tomatic optimization of declarative queries. In of the 1st International Workshop on Hot Topics ACM SIGPLAN Notices, volume 46, pages 121– in Cloud Data Processing, HotCDP ’12, pages 131, 2011. 2:1–2:5, New York, NY, USA, 2012. ACM. [34] F. Nagel, G. Bierman, and S. D. Viglas. Code gen- eration for efficient query processing in managed [45] A. Shaikhha, Y. Klonatos, L. Parreaux, L. Brown, runtimes. VLDB, 7(12):1095–1106, 2014. M. Dashti, and C. Koch. How to architect a query compiler. In SIGMOD, pages 1907–1922. ACM, [35] T. Neumann. Efficiently compiling efficient query 2016. plans for modern hardware. PVLDB, 4(9):539–550, 2011. [46] A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. Delite: [36] K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, A compiler architecture for performance-oriented and B. Chun. Making sense of performance in data embedded domain-specific languages. TECS, analytics frameworks. In NSDI, pages 293–307, 13(4s):134, 2014. 2015. [47] W. Taha. A gentle introduction to multi-stage pro- [37] L. Page, S. Brin, R. Motwani, and T. Winograd. gramming. In Domain-Specific Program Genera- The pagerank citation ranking: Bringing order to tion, pages 30–50. Springer, 2004. the web. Technical Report 1999-66, Stanford Info- Lab, November 1999. [48] W. Taha and T. Sheard. MetaML and multi-stage programming with explicit annotations. Theor. [38] S. Palkar, J. J. Thomas, A. Shanbhag, Comput. Sci., 248(1-2):211–242, 2000. D. Narayanan, H. Pirk, M. Schwarzkopf, S. Ama- rasinghe, M. Zaharia, and S. InfoLab. Weld: [49] R. Y. Tahboub, G. M. Essertel, and T. Rompf. How A common runtime for high performance data to architect a query compiler, revisited. SIGMOD analytics. In CIDR, 2017. ’18, pages 307–322. ACM, 2018.
[39] H. Pirk, O. Moll, M. Zaharia, and S. Mad- [50] T. X. Team. XLA – TensorFlow Com- den. Voodoo - a vector algebra for portable piled. Post on the Google Developers Blog, database performance on modern hardware. VLDB, 2017. http://developers.googleblog.com/ 9(14):1707–1718, 2016. 2017/03/xla-tensorflow-compiled.html. [51] The Transaction Processing Council. TPC-H Ver- computing using a high-level language. OSDI, 8, sion 2.15.0. 2008. [52] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. [55] M. Zaharia, M. Chowdhury, M. J. Franklin, Hive: a warehousing solution over a map-reduce S. Shenker, and I. Stoica. Spark: cluster comput- framework. PVLDB, 2(2):1626–1629, 2009. ing with working sets. In USENIX, HotCloud’10, 2010. [53] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In ACM SIGMOD, pages 13–24, [56] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Arm- 2013. brust, A. Dave, X. Meng, J. Rosen, S. Venkatara- man, M. J. Franklin, A. Ghodsi, J. Gonzalez, [54] Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlings- S. Shenker, and I. Stoica. Apache Spark: a uni- son, P. K. Gunda, and J. Currey. DryadLINQ: A fied engine for big data processing. Commun. ACM, system for general-purpose distributed data-parallel 59(11):56–65, 2016.