Apache Spark: a Unified Engine for Big Data Processing

contributed articles DOI:10.1145/2934664 for interactive SQL queries and Pregel11 This open source computing framework for iterative graph algorithms. In the open source Apache Hadoop stack, unifies streaming, batch, and interactive big systems like Storm1 and Impala9 are data workloads to unlock new applications. also specialized. Even in the relational database world, the trend has been to BY MATEI ZAHARIA, REYNOLD S. XIN, PATRICK WENDELL, move away from “one-size-fits-all” sys- TATHAGATA DAS, MICHAEL ARMBRUST, ANKUR DAVE, tems.18 Unfortunately, most big data XIANGRUI MENG, JOSH ROSEN, SHIVARAM VENKATARAMAN, applications need to combine many MICHAEL J. FRANKLIN, ALI GHODSI, JOSEPH GONZALEZ, different processing types. The very SCOTT SHENKER, AND ION STOICA nature of “big data” is that it is diverse and messy; a typical pipeline will need MapReduce-like code for data load- ing, SQL-like queries, and iterative machine learning. Specialized engines Apache Spark: can thus create both complexity and inefficiency; users must stitch together disparate systems, and some applications simply cannot be expressed effi- ciently in any engine. A Unified In 2009, our group at the Univer- sity of California, Berkeley, started the Apache Spark project to design a unified engine for distributed data Engine for processing. Spark has a programming model similar to MapReduce but ex- tends it with a data-sharing abstraction called “Resilient Distributed Da- Big Data tasets,” or RDDs.25 Using this simple extension, Spark can capture a wide range of processing workloads that previously needed separate engines, Processing including SQL, streaming, machine learning, and graph processing2,26,6 (see Figure 1). These implementations use the same optimizations as specialized engines (such as column-oriented processing and incremental updates) and achieve similar performance but THE GROWTH OF data volumes in industry and research run as libraries over a common en- poses tremendous opportunities, as well as tremendous gine, making them easy and efficient to compose. Rather than being specific computational challenges. As data sizes have outpaced the capabilities of single machines, users have needed key insights new systems to scale out computations to multiple ! A simple programming model can capture streaming, batch, and interactive nodes. As a result, there has been an explosion of workloads and enable new applications new cluster programming models targeting diverse that combine them. ! Apache Spark applications range from 1,4,7,10 computing workloads. At first, these models were finance to scientific data processing and combine libraries for SQL, machine relatively specialized, with new models developed for learning, and graphs. new workloads; for example, MapReduce4 supported ! In six years, Apache Spark has grown to 1,000 contributors and 13 batch processing, but Google also developed Dremel thousands of deployments. 56 COMMUNICATIONS OF THE ACM | NOVEMBER 2016 | VOL. 59 | NO. 11 Analyses performed using Spark of brain activity in a larval zebrafish: (left) matrix factorization to characterize functionally similar regions (as depicted by different colors) and (right) embedding dynamics of whole-brain activity into lower-dimensional trajectories. Source: Jeremy Freeman and Misha Ahrens, Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA. to these workloads, we claim this result gine, Spark can run diverse functions applications that combine their func- is more general; when augmented with over the same data, often in memory. tions (such as video messaging and data sharing, MapReduce can emu- Finally, Spark enables new applica- Waze) that would not have been pos- late any distributed computation, so tions (such as interactive queries on a sible on any one device. it should also be possible to run many graph and streaming machine learn- Since its release in 2010, Spark other types of workloads.24 ing) that were not possible with previ- has grown to be the most active open Spark’s generality has several im- ous systems. One powerful analogy for source project or big data processing, portant benefits. First, applications the value of unification is to compare with more than 1,000 contributors. The are easier to develop because they use a smartphones to the separate portable project is in use in more than 1,000 or- unified API. Second, it is more efficient devices that existed before them (such ganizations, ranging from technology to combine processing tasks; whereas as cameras, cellphones, and GPS gad- companies to banking, retail, biotech- prior systems required writing the gets). In unifying the functions of these nology, and astronomy. The largest data to storage to pass it to another en- devices, smartphones enabled new publicly announced deployment has NOVEMBER 2016 | VOL. 59 | NO. 11 | COMMUNICATIONS OF THE ACM 57 contributed articles Figure 1. Apache Spark software stack, with specialized processing libraries implemented returns a result to the program (here, over the core engine. the number of elements in the RDD) instead of defining a new RDD. Spark evaluates RDDs lazily, al- lowing it to find an efficient plan for Streaming SQL ML Graph the user’s computation. In particular, transformations return a new RDD ob- ject representing the result of a computation but do not immediately compute it. When an action is called, Spark looks at the whole graph of transformations used to create an execution plan. For example, if there were multiple filter or map operations in a row, Spark can fuse them into one pass, or, if it knows that data is partitioned, it can avoid moving it over the network for groupBy.5 Users can thus build up programs modularly without losing performance. Finally, RDDs provide explicit sup- port for data sharing among computations. By default, RDDs are “ephem- eral” in that they get recomputed each time they are used in an action (such as count). However, users can also more than 8,000 nodes.22 As Spark has across a cluster that can be manipu- persist selected RDDs in memory or grown, we have sought to keep building lated in parallel. Users create RDDs by for rapid reuse. (If the data does not on its strength as a unified engine. We applying operations called “transfor- fit in memory, Spark will also spill it (and others) have continued to build an mations” (such as map, filter, and to disk.) For example, a user searching integrated standard library over Spark, groupBy) to their data. through a large set of log files in HDFS with functions from data import to ma- Spark exposes RDDs through a func- to debug a problem might load just the chine learning. Users find this ability tional programming API in Scala, Java, error messages into memory across the powerful; in surveys, we find the major- Python, and R, where users can simply cluster by calling ity of users combine multiple of Spark’s pass local functions to run on the clus- libraries in their applications. ter. For example, the following Scala errors.persist() As parallel data processing becomes code creates an RDD representing the common, the composability of process- error messages in a log file, by search- After this, the user can run a variety of ing functions will be one of the most ing for lines that start with ERROR, and queries on the in-memory data: important concerns for both usability then prints the total number of errors: and performance. Much of data analy- // Count errors mentioning MySQL sis is exploratory, with users wishing to lines = spark.textFile(“hdfs://...”) errors.filter(s => s.contains(“MySQL”)) combine library functions quickly into errors = lines.filter( .count() a working pipeline. However, for “big s => s.startsWith(“ERROR”)) // Fetch back the time fields of errors that data” in particular, copying data be- println(“Total errors: “ + errors.count()) // mention PHP, assuming time is field #3: tween different systems is anathema to errors.filter(s => s.contains(“PHP”)) performance. Users thus need abstrac- The first line defines an RDD backed .map(line => line.split(‘\t’)(3)) tions that are general and composable. by a file in the Hadoop Distributed File . c o l l e c t ( ) In this article, we introduce the Spark System (HDFS) as a collection of lines of programming model and explain why it text. The second line calls the filter This data sharing is the main differ- is highly general. We also discuss how transformation to derive a new RDD ence between Spark and previous com- we leveraged this generality to build from lines. Its argument is a Scala puting models like MapReduce; other- other processing tasks over it. Finally, function literal or closure.a Finally, the wise, the individual operations (such we summarize Spark’s most common last line calls count, another type of as map and groupBy) are similar. Data applications and describe ongoing de- RDD operation called an “action” that sharing provides large speedups, often velopment work in the project. as much as 100×, for interactive queries and iterative algorithms.23 It is also Programming Model a The closures passed to Spark can call into any the key to Spark’s generality, as we dis- existing Scala or Python library or even refer- The key programming abstraction in ence variables in the outer program. Spark cuss later. Spark is RDDs, which are fault-toler- sends read-only copies of these variables to Fault tolerance. Apart from provid- ant collections of objects partitioned worker nodes. ing data sharing and a variety of paral- 58 COMMUNICATIONS OF THE ACM | NOVEMBER 2016 | VOL. 59 | NO. 11 contributed articles lel operations, RDDs also automatical- RDDs usually store only temporary SQL and DataFrames. One of the ly recover from failures. Traditionally, data within an application, though most common data processing para- distributed computing systems have some applications (such as the Spark digms is relational queries.

Apache Spark: a Unified Engine for Big Data Processing

Discretized Streams: Fault-Tolerant Streaming Computation at Scale

The Design and Implementation of Declarative Networks

A Ditya a Kella

Alluxio: a Virtual Distributed File System by Haoyuan Li A

Donut: a Robust Distributed Hash Table Based on Chord

UC Berkeley UC Berkeley Electronic Theses and Dissertations

A Networking Abstraction for Distributed Data-Parallel Applications

Heterogeneity and Load Balance in Distributed Hash Tables

Tachyon-Further-Improve-Sparks

Big Data Analysis with Apache Spark

Table of Contents

Cluster Computing with Working Sets