contributed articles

DOI:10.1145/2934664 for interactive SQL queries and Pregel11 This open source computing framework for iterative graph algorithms. In the open source Apache Hadoop stack, unifies streaming, batch, and interactive big systems like Storm1 and Impala9 are data workloads to unlock new applications. also specialized. Even in the relational database world, the trend has been to BY , REYNOLD S. XIN, PATRICK WENDELL, move away from “one-size-fits-all” sys- TATHAGATA DAS, MICHAEL ARMBRUST, ANKUR DAVE, tems.18 Unfortunately, most XIANGRUI MENG, JOSH ROSEN, SHIVARAM VENKATARAMAN, applications need to combine many MICHAEL J. FRANKLIN, , JOSEPH GONZALEZ, different processing types. The very SCOTT SHENKER, AND ION STOICA nature of “big data” is that it is diverse and messy; a typical pipeline will need MapReduce-like code for data load- ing, SQL-like queries, and iterative machine learning. Specialized engines : can thus create both complexity and inefficiency; users must stitch together disparate systems, and some applica- tions simply cannot be expressed effi- ciently in any engine. A Unified In 2009, our group at the Univer- sity of California, Berkeley, started the Apache Spark project to design a unified engine for distributed data Engine for processing. Spark has a programming model similar to MapReduce but ex- tends it with a data-sharing abstrac- tion called “Resilient Distributed Da- Big Data tasets,” or RDDs.25 Using this simple extension, Spark can capture a wide range of processing workloads that previously needed separate engines, Processing including SQL, streaming, machine learning, and graph processing2,26,6 (see Figure 1). These implementations use the same optimizations as special- ized engines (such as column-oriented processing and incremental updates) and achieve similar performance but THE GROWTH OF data volumes in industry and research run as libraries over a common en- poses tremendous opportunities, as well as tremendous gine, making them easy and efficient to compose. Rather than being specific computational challenges. As data sizes have outpaced the capabilities of single machines, users have needed key insights new systems to scale out computations to multiple ! A simple programming model can capture streaming, batch, and interactive nodes. As a result, there has been an explosion of workloads and enable new applications new cluster programming models targeting diverse that combine them. ! Apache Spark applications range from 1,4,7,10 computing workloads. At first, these models were finance to scientific data processing and combine libraries for SQL, machine relatively specialized, with new models developed for learning, and graphs. new workloads; for example, MapReduce4 supported ! In six years, Apache Spark has grown to 1,000 contributors and 13 batch processing, but Google also developed Dremel thousands of deployments.

56 COMMUNICATIONS OF THE ACM | NOVEMBER 2016 | VOL. 59 | NO. 11 Analyses performed using Spark of brain activity in a larval zebrafish: (left) matrix factorization to characterize functionally similar regions (as depicted by different colors) and (right) embedding dynamics of whole-brain activity into lower-dimensional trajectories. Source: Jeremy Freeman and Misha Ahrens, Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA. to these workloads, we claim this result gine, Spark can run diverse functions applications that combine their func- is more general; when augmented with over the same data, often in memory. tions (such as video messaging and data sharing, MapReduce can emu- Finally, Spark enables new applica- Waze) that would not have been pos- late any distributed computation, so tions (such as interactive queries on a sible on any one device. it should also be possible to run many graph and streaming machine learn- Since its release in 2010, Spark other types of workloads.24 ing) that were not possible with previ- has grown to be the most active open Spark’s generality has several im- ous systems. One powerful analogy for source project or big data processing, portant benefits. First, applications the value of unification is to compare with more than 1,000 contributors. The are easier to develop because they use a smartphones to the separate portable project is in use in more than 1,000 or- unified API. Second, it is more efficient devices that existed before them (such ganizations, ranging from technology to combine processing tasks; whereas as cameras, cellphones, and GPS gad- companies to banking, retail, biotech- prior systems required writing the gets). In unifying the functions of these nology, and astronomy. The largest data to storage to pass it to another en- devices, smartphones enabled new publicly announced deployment has

NOVEMBER 2016 | VOL. 59 | NO. 11 | COMMUNICATIONS OF THE ACM 57 contributed articles

Figure 1. Apache Spark software stack, with specialized processing libraries implemented returns a result to the program (here, over the core engine. the number of elements in the RDD) instead of defining a new RDD. Spark evaluates RDDs lazily, al- lowing it to find an efficient plan for Streaming SQL ML Graph the user’s computation. In particular, transformations return a new RDD ob- ject representing the result of a compu- tation but do not immediately compute it. When an action is called, Spark looks at the whole graph of transformations used to create an execution plan. For ex- ample, if there were multiple filter or map operations in a row, Spark can fuse them into one pass, or, if it knows that data is partitioned, it can avoid moving it over the network for groupBy.5 Users can thus build up programs modularly without losing performance. Finally, RDDs provide explicit sup- port for data sharing among compu- tations. By default, RDDs are “ephem- eral” in that they get recomputed each time they are used in an action (such as count). However, users can also more than 8,000 nodes.22 As Spark has across a cluster that can be manipu- persist selected RDDs in memory or grown, we have sought to keep building lated in parallel. Users create RDDs by for rapid reuse. (If the data does not on its strength as a unified engine. We applying operations called “transfor- fit in memory, Spark will also spill it (and others) have continued to build an mations” (such as map, filter, and to disk.) For example, a user searching integrated standard library over Spark, groupBy) to their data. through a large set of log files in HDFS with functions from data import to ma- Spark exposes RDDs through a func- to debug a problem might load just the chine learning. Users find this ability tional programming API in Scala, Java, error messages into memory across the powerful; in surveys, we find the major- Python, and R, where users can simply cluster by calling ity of users combine multiple of Spark’s pass local functions to run on the clus- libraries in their applications. ter. For example, the following Scala errors.persist() As parallel data processing becomes code creates an RDD representing the common, the composability of process- error messages in a log file, by search- After this, the user can run a variety of ing functions will be one of the most ing for lines that start with ERROR, and queries on the in-memory data: important concerns for both usability then prints the total number of errors: and performance. Much of data analy- // Count errors mentioning MySQL sis is exploratory, with users wishing to lines = spark.textFile(“hdfs://...”) errors.filter(s => s.contains(“MySQL”)) combine library functions quickly into errors = lines.filter( .count() a working pipeline. However, for “big s => s.startsWith(“ERROR”)) // Fetch back the time fields of errors that data” in particular, copying data be- println(“Total errors: “ + errors.count()) // mention PHP, assuming time is field #3: tween different systems is anathema to errors.filter(s => s.contains(“PHP”)) performance. Users thus need abstrac- The first line defines an RDD backed .map(line => line.split(‘\t’)(3)) tions that are general and composable. by a file in the Hadoop Distributed File . c o l l e c t ( ) In this article, we introduce the Spark System (HDFS) as a collection of lines of programming model and explain why it text. The second line calls the filter This data sharing is the main differ- is highly general. We also discuss how transformation to derive a new RDD ence between Spark and previous com- we leveraged this generality to build from lines. Its argument is a Scala puting models like MapReduce; other- other processing tasks over it. Finally, function literal or closure.a Finally, the wise, the individual operations (such we summarize Spark’s most common last line calls count, another type of as map and groupBy) are similar. Data applications and describe ongoing de- RDD operation called an “action” that sharing provides large speedups, often velopment work in the project. as much as 100×, for interactive que- ries and iterative algorithms.23 It is also Programming Model a The closures passed to Spark can call into any the key to Spark’s generality, as we dis- existing Scala or Python library or even refer- The key programming abstraction in ence variables in the outer program. Spark cuss later. Spark is RDDs, which are fault-toler- sends read-only copies of these variables to Fault tolerance. Apart from provid- ant collections of objects partitioned worker nodes. ing data sharing and a variety of paral-

58 COMMUNICATIONS OF THE ACM | NOVEMBER 2016 | VOL. 59 | NO. 11 contributed articles lel operations, RDDs also automatical- RDDs usually store only temporary SQL and DataFrames. One of the ly recover from failures. Traditionally, data within an application, though most common data processing para- systems have some applications (such as the Spark digms is relational queries. Spark SQL2 provided fault tolerance through data SQL JDBC server) also share RDDs and its predecessor, Shark,23 imple- replication or checkpointing. Spark across multiple users.2 Spark’s de- ment such queries on Spark, using uses a different approach called “lin- sign as a storage-system-agnostic techniques similar to analytical da- eage.”25 Each RDD tracks the graph of engine makes it easy for users to run tabases. For example, these systems transformations that was used to build computations against existing data support columnar storage, cost-based it and reruns these operations on base and join diverse data sources. optimization, and code generation for data to reconstruct any lost partitions. query execution. The main idea behind For example, Figure 2 shows the RDDs in Higher-Level Libraries these systems is to use the same data our previous query, where we obtain the The RDD programming model pro- layout as analytical databases—com- time fields of errors mentioning PHP by vides only distributed collections of pressed columnar storage—inside applying two filters and a map. If any objects and functions to run on them. RDDs. In Spark SQL, each record in an partition of an RDD is lost (for example, Using RDDs, however, we have built RDD holds a series of rows stored in bi- if a node holding an in-memory partition a variety of higher-level libraries on nary format, and the system generates of errors fails), Spark will rebuild it by Spark, targeting many of the use cas- applying the filter on the corresponding es of specialized computing engines. Figure 2. Lineage graph for the third query in our example; boxes represent RDDs, and block of the HDFS file. For “shuffle” op- The key idea is that if we control the arrows represent transformations. erations that send data from all nodes to data structures stored inside RDDs, all other nodes (such as reduceByKey), the partitioning of data across nodes, lines senders persist their output data locally and the functions run on them, we can filter(line.startsWith(“ERROR”)) in case a receiver fails. implement many of the execution tech- Lineage-based recovery is signifi- niques in other engines. Indeed, as we errors cantly more efficient than replication show in this section, these libraries filter(line.contains(“PHP”))) in data-intensive workloads. It saves often achieve state-of-the-art perfor- PHP errors both time, because writing data over mance on each task while offering sig- map(line.split(‘\t’)(3)) the network is much slower than writ- nificant benefits when users combine time fields ing it to RAM, and storage space in them. We now discuss the four main memory. Recovery is typically much libraries included with Apache Spark. faster than simply rerunning the pro- gram, because a failed node usually Figure 3. A Scala implementation of logistic regression via batch gradient descent in Spark. contains multiple RDD partitions, and these partitions can be rebuilt in paral- // Load data into an RDD lel on other nodes. val points = sc.textFile(...).map(readPoint).persist() A longer example. As a longer exam- // Start with a random parameter vector ple, Figure 3 shows an implementa- var w = DenseVector.random(D) tion of logistic regression in Spark. It uses batch gradient descent, a // On each iteration, update param vector with a sum simple iterative algorithm that for (i <- 1 to ITERATIONS) { val gradient = points.map { p => computes a gradient function over p.x * (1/(1+exp(-p.y*(w.dot(p.x))))-1) * p.y the data repeatedly as a parallel }.reduce((a, b) => a+b) sum. Spark makes it easy to load the w -= gradient data into RAM once and run multiple } sums. As a result, it runs faster than traditional MapReduce. For example, in a 100GB job (see Figure 4), MapRe- Figure 4. Performance of logistic regression in Hadoop MapReduce vs. Spark for 100GB of data on 50 m2.4xlarge EC2 nodes. duce takes 110 seconds per iteration because each iteration loads the data from disk, while Spark takes only one Hadoop Spark second per iteration after the first load. 2,500 Integration with storage systems. 2,000 Much like Google’s MapReduce, 1,500 Spark is designed to be used with 1,000 multiple external systems for per- 500 sistent storage. Spark is most com- Running Time (s) 0 monly used with cluster file systems 1 5 10 20 like HDFS and key-value stores like Number of Iterations S3 and Cassandra. It can also connect with Apache Hive as a data catalog.

NOVEMBER 2016 | VOL. 59 | NO. 11 | COMMUNICATIONS OF THE ACM 59 contributed articles

code to run directly against this layout. means model) are easily passed to oth- Beyond running SQL queries, er libraries. Apart from compatibility we have used the Spark SQL engine at the API level, composition in Spark to provide a higher-level abstrac- is also efficient at the execution level, tion for basic data transformations because Spark can optimize across pro- called DataFrames,2 which are RDDs Spark has a similar cessing libraries. For example, if one li- of records with a known schema. programming brary runs a map function and the next DataFrames are a common abstraction library runs a map on its result, Spark for tabular data in R and Python, with model to will fuse these operations into a single programmatic methods for filtering, MapReduce but map. Likewise, Spark’s fault recovery computing new columns, and aggrega- works seamlessly across these librar- tion. In Spark, these operations map extends it with ies, recomputing lost data no matter down to the Spark SQL engine and re- which libraries produced it. ceive all its optimizations. We discuss a data-sharing Performance. Given that these librar- DataFrames more later. abstraction ies run over the same engine, do they One technique not yet implemented lose performance? We found that by in Spark SQL is indexing, though other called “resilient implementing the optimizations we libraries over Spark (such as Indexe- distributed just outlined within RDDs, we can often dRDDs3) do use it. match the performance of specialized Spark Streaming. Spark Streaming26 datasets,” or RDDs. engines. For example, Figure 6 com- implements incremental stream pro- pares Spark’s performance on three cessing using a model called “discretized simple tasks—a SQL query, stream- streams.” To implement streaming over ing word count, and Alternating Least Spark, we split the input data into small Squares matrix factorization—versus batches (such as every 200 milliseconds) other engines. While the results vary that we regularly combine with state across workloads, Spark is generally stored inside RDDs to produce new re- comparable with specialized systems sults. Running streaming computations like Storm, GraphLab, and Impala.b For this way has several benefits over tradi- stream processing, although we show tional distributed streaming systems. results from a distributed implementa- For example, fault recovery is less expen- tion on Storm, the per-node through- sive due to using lineage, and it is pos- put is also comparable to commercial sible to combine streaming with batch streaming engines like Oracle CEP.26 and interactive queries. Even in highly competitive bench- GraphX. GraphX6 provides a graph marks, we have achieved state-of-the- computation interface similar to Pregel art performance using Apache Spark. and GraphLab,10,11 implementing the In 2014, we entered the Daytona Gray- same placement optimizations as these Sort benchmark (http://sortbench- systems (such as vertex partitioning mark.org/) involving sorting 100TB of schemes) through its choice of parti- data on disk, and tied for a new record tioning function for the RDDs it builds. with a specialized system built only MLlib. MLlib,14 Spark’s machine for sorting on a similar number of ma- learning library, implements more chines. As in the other examples, this than 50 common algorithms for dis- was possible because we could imple- tributed model training. For example, it ment both the communication and includes the common distributed algo- CPU optimizations necessary for large- rithms of decision trees (PLANET), La- scale sorting inside the RDD model. tent Dirichlet Allocation, and Alternat- ing Least Squares matrix factorization. Applications Combining processing tasks. Spark’s Apache Spark is used in a wide range libraries all operate on RDDs as the of applications. Our surveys of Spark data abstraction, making them easy to combine in applications. For example, b One area in which other designs have outper- Figure 5 shows a program that reads formed Spark is certain graph computations.12,16 some historical Twitter data using However, these results are for algorithms with Spark SQL, trains a K-means clustering low ratios of computation to communication model using MLlib, and then applies (such as PageRank) where the latency from syn- chronized communication in Spark is signifi- the model to a new stream of tweets. cant. In applications with more computation The data tasks returned by each library (such as the ALS algorithm) distributing the ap- (here the historic tweet RDD and the K- plication on Spark still helps.

60 COMMUNICATIONS OF THE ACM | NOVEMBER 2016 | VOL. 59 | NO. 11 contributed articles users have identified more than 1,000 making applications. Published use streaming with batch and interactive companies using Spark, in areas from cases for Spark Streaming include queries. For example, video company Web services to biotechnology to fi- network security monitoring at Cis- Conviva uses Spark to continuously nance. In academia, we have also seen co, prescriptive analytics at Samsung maintain a model of content distribu- applications in several scientific do- SDS, and log mining at Netflix. Many tion server performance, querying it mains. Across these workloads, we find of these applications also combine automatically when it moves clients users take advantage of Spark’s gener- ality and often combine multiple of its Figure 5. Example combining the SQL, machine learning, and streaming libraries in Spark. libraries. Here, we cover a few top use // Load historical data as an RDD using Spark SQL cases. Presentations on many use cases val trainingData = sql( are also available on the Spark Summit “SELECT location, language FROM old_tweets”) conference website (http://www.spark- // Train a K-means model using MLlib summit.org). val model = new KMeans() Batch processing. Spark’s most com- .setFeaturesCol(“location”) mon applications are for batch proc- .setPredictionCol(“language”) essing on large datasets, including .fit(trainingData) // Apply the model to new tweets in a stream Extract-Transform-Load workloads to TwitterUtils.createStream(...) convert data from a raw format (such .map(tweet => model.predict(tweet.location)) as log files) to a more structured for- mat and offline training of machine learning models. Published examples Figure 6. Comparing Spark’s performance with several widely used specialized systems for SQL, streaming, and machine learning. Data is from Zaharia24 (SQL query and stream- of these workloads include page per- ing word count) and Sparks et al.17 (alternating least squares matrix factorization). sonalization and recommendation at Yahoo!; managing a data lake at Gold- man Sachs; graph mining at Alibaba; Response Time Throughput Response Time financial Value at Risk calculation; and (sec) (records/s) (hours) text mining of customer feedback at 20 10 x 106 6 Toyota. The largest published use case

Spark 5 we are aware of is an 8,000-node cluster 8 at Chinese social network Tencent that 15 MATLAB 4 Impala (disk) 22 Spark (disk) ingests 1PB of data per day. 6 While Spark can process data in 10 3 memory, many of the applications in 4 Storm this category run only on disk. In such 2 Impala (mem)

5 Mahout cases, Spark can still improve perfor- 2 mance over MapReduce due to its sup- Redshift 1 Spark (mem) Spark port for more complex operator graphs. GraphLab 0 0 0 Interactive queries. Interactive use of SQL Streaming Machine Learning Spark falls into three main classes. First, organizations use Spark SQL for rela- tional queries, often through business- intelligence tools like Tableau. Examples Figure 7. PanTera, a visualization application built on Spark that can interactively filter data. include eBay and Baidu. Second, devel- opers and data scientists can use Spark’s Scala, Python, and R interfaces interac- tively through shells or visual notebook environments. Such interactive use is crucial for asking more advanced ques- tions and for designing models that eventually lead to production applica- tions and is common in all deployments. Third, several vendors have developed domain-specific interactive applications that run on Spark. Examples include Tresata (anti-money laundering), Tri- facta (data cleaning), and PanTera (large- scale visualization, as in Figure 7). Stream processing. Real-time proc- essing is also a popular use case, both Source: PanTera in analytics and in real-time decision-

NOVEMBER 2016 | VOL. 59 | NO. 11 | COMMUNICATIONS OF THE ACM 61 contributed articles

across servers, in an application that queries during live experiments. Figure Figure 9, most organizations use mul- requires substantial parallel work for 8 shows an example image generated tiple components; 88% use at least two both model maintenance and queries. using Spark. of them, 60% use at least three (such Scientific applications. Spark has also Spark components used. Because as Spark Core and two libraries), and been used in several scientific domains, Spark is a unified data-processing en- 27% use at least four components. including large-scale spam detection,19 gine, the natural question is how many Deployment environments. We also image processing,27 and genomic data of its libraries organizations actually see growing diversity in where Apache processing.15 One example that com- use. Our surveys of Spark users have Spark applications run and what data bines batch, interactive, and stream shown that organizations do, indeed, sources they connect to. While the first processing is the Thunder platform use multiple components, with over Spark deployments were generally in for neuroscience at Howard Hughes 60% of organizations using at least Hadoop environments, only 40% of de- Medical Institute, Janelia Farm.5 It is three of Spark’s APIs. Figure 9 out- ployments in our July 2015 Spark sur- designed to process brain-imaging data lines the usage of each component in vey were on the Hadoop YARN cluster from experiments in real time, scaling a July 2015 Spark survey by Databricks manager. In addition, 52% of respon- up to 1TB/hour of whole-brain imaging that reached 1,400 respondents. We dents ran Spark on a public cloud. data from organisms (such as zebrafish list the Spark Core API (just RDDs) and mice). Using Thunder, researchers as one component and the higher- Why Is the Spark Model General? can apply machine learning algorithms level libraries as others. We see that While Apache Spark demonstrates (such as clustering and Principal Com- many components are widely used, that a unified cluster programming ponent Analysis) to identify neurons in- with Spark Core and SQL as the most model is both feasible and useful, it volved in specific behaviors. The same popular. Streaming is used in 46% of would be helpful to understand what code can be run in batch jobs on data organizations and machine learning makes cluster programming models from previous runs or in interactive in 54%. While not shown directly in general, along with Spark’s limita- tions. Here, we summarize a discus- Figure 8. Visualization of neurons in the zebrafish brain created with Spark, where each sion on the generality of RDDs from neuron is colored based on the direction of movement that correlates with its activity. 24 Source: Jeremy Freeman and Misha Ahrens of Janelia Research Campus. Zaharia. We study RDDs from two perspectives. First, from an expres- siveness point of view, we argue that RDDs can emulate any distributed computation, and will do so efficient- ly in many cases unless the computa- tion is sensitive to network latency. Second, from a systems point of view, we show that RDDs give applications control over the most common bottle- neck resources in clusters—network and storage I/O—and thus make it possible to express the same optimizations for these resources that characterize specialized systems. Expressiveness perspective. To study the expressiveness of RDDs, we start by com- paring RDDs to the MapReduce model, which RDDs build on. The first question is what computations can MapReduce Figure 9. Percent of organizations using each Spark component, from the Databricks 2015 itself express? Although there have been Spark survey; https://databricks.com/blog/2015/09/24/. numerous discussions about the limita- tions of MapReduce, the surprising an- swer here is that MapReduce can emu- Core late any distributed computation. SQL To see this, note that any distributed computation consists of nodes that per- Streaming form local computation and occasionally MLlib exchange messages. MapReduce offers the map operation, which allows local GraphX computation, and reduce, which allows 0% 20% 40% 60% 80% 100% all-to-all communication. Any distrib- uted computation can thus be emulated, Fraction of Users perhaps somewhat inefficiently, by breaking down its work into timesteps,

62 COMMUNICATIONS OF THE ACM | NOVEMBER 2016 | VOL. 59 | NO. 11 contributed articles running maps to perform the local bottleneck resources in cluster com- Links. Each node has a 10Gbps computation in each timestep, and putations? And can RDDs use them ef- (1.3GB/s) link, or approximately 40× batching and exchanging messages at ficiently? Although cluster applications less than its memory bandwidth and the end of each step using a reduce. A are diverse, they are all bound by the 2× less than its aggregate disk band- series of MapReduce steps will capture same properties of the underlying hard- width; and the whole result, as in Figure 10. Re- ware. Current datacenters have a steep Racks. Nodes are organized into racks cent theoretical work has formalized storage hierarchy that limits most ap- of 20 to 40 machines, with 40Gbps– this type of emulation by showing that plications in similar ways. For example, 80Gbps bandwidth out of each rack, MapReduce can simulate many com- a typical Hadoop cluster might have the or 2×–5× lower than the in-rack net- putations in the Parallel Random Ac- following characteristics: work performance. cess Machine model.8 Repeated Map- Local storage. Each node has local Given these properties, the most Reduce is also equivalent to the Bulk memory with approximately 50GB/s important performance concern in Synchronous Parallel model.20 of bandwidth, as well as 10 to 20 lo- many applications is the placement of While this line of work shows that cal disks, for approximately 1GB/s to data and computation in the network. MapReduce can emulate arbitrary 2GB/s of disk bandwidth; Fortunately, RDDs provide the facili- computations, two problems can make the “constant factor” behind Figure 10. Emulating an arbitrary distributed computation with MapReduce. this emulation high. First, MapReduce is inefficient at sharing data across timesteps because it relies on repli- map cated external storage systems for this purpose. Our emulated system may reduce thus become slower due to writing out its state after each step. Second, the latency of the MapReduce steps (a) MapReduce provides primitives determines how well our emulation for local computation and all-to-all will match a real network, and most communication. Map-Reduce implementations were designed for batch environments with minutes to hours of latency. . . . RDDs and Spark address both of (b) By chaining these steps together, these limitations. On the data-sharing we can emulate any distributed computation. The main costs for this front, RDDs make data sharing fast by emulation are the latency of the rounds avoiding replication of intermediate data and the overhead of passing state and can closely emulate the in-memory across steps. “data sharing” across time that would happen in a system composed of long- running processes. On the latency front, Figure 11. Example of Spark’s DataFrame API in Python. Unlike Spark’s core API, DataFrames have a schema with named columns (such as age and city) and take expressions in a limited Spark can run MapReduce-like steps language (such as age > 20) instead of arbitrary Python functions. on large clusters with 100ms latency; nothing intrinsic to the MapReduce model prevents this. While some applications users.where(users[“age”] > 20) .groupBy(“city”) need finer-grain timesteps and commu- .agg(avg(“age”), max(“income”)) nication, this 100ms latency is enough to implement many data-intensive workloads, where the amount of com- putation that can be batched before a Figure 12. Working with DataFrames in Spark’s R API. We load a distributed DataFrame using Spark’s JSON data source, then filter and aggregate using standard R column ex- communication step is high. pressions. In summary, RDDs build on Map-

Reduce’s ability to emulate any dis- people <- read.df(context, “./people.json”, “json”) tributed computation but make this emulation significantly more efficient. # Filter people by age Their main limitation is increased adults = filter(people, people$age > 20) latency due to synchronization in each # Count number of people by country communication step, but this latency summarize(groupBy(adults, adults$city), count=n(adults$id)) is often not a factor. ## city count Systems perspective. Independent ##1 Cambridge 1 ##2 San Francisco 6 of the emulation approach to char- ##3 Berkeley 4 acterizing Spark’s generality, we can take a systems approach. What are the

NOVEMBER 2016 | VOL. 59 | NO. 11 | COMMUNICATIONS OF THE ACM 63 contributed articles

ties to control this placement; the in- ity in new libraries. More than 200 third- relational optimizations under a data terface lets applications place com- party packages are also available.c In the frame API.d putations near input data (through research community, multiple projects While DataFrames are still new, an API for “preferred locations” for at Berkeley, MIT, and Stanford build on they have quickly become a popular input sources25), and RDDs provide Spark, and many new libraries (such API. In our July 2015 survey, 60% of control over data partitioning and co- as GraphX and Spark Streaming) came respondents reported using them. Be- location (such as specifying that data from research groups. Here, we sketch cause of the success of DataFrames, be hashed by a given key). Libraries four of the major efforts. we have also developed a type-safe in- (such as GraphX) can thus implement DataFrames and more declarative terface over them called Datasetse that the same placement strategies used in APIs. The core Spark API was based on lets Java and Scala programmers view specialized systems.6 functional programming over distrib- DataFrames as statically typed col- Beyond network and I/O bandwidth, uted collections that contain arbitrary lections of Java objects, similar to the the most common bottleneck tends to be types of Scala, Java, or Python objects. RDD API, and still receive relational CPU time, especially if data is in memo- While this approach was highly ex- optimizations. We expect these APIs ry. In this case, however, Spark can run pressive, it also made programs more to gradually become the standard ab- the same algorithms and libraries used difficult to automatically analyze and straction for passing data between in specialized systems on each node. For optimize. The Scala/Java/Python ob- Spark libraries. example, it uses columnar storage and jects stored in RDDs could have com- Performance optimizations. Much of processing in Spark SQL, native BLAS plex structure, and the functions run the recent work in Spark has been on per- libraries in MLlib, and so on. As we over them could include arbitrary formance. In 2014, the Databricks team discussed earlier, the only area where code. In many applications, develop- spent considerable effort to optimize RDDs clearly add a cost is network la- ers could get suboptimal performance Spark’s network and I/O primitives, al- tency, due to the synchronization at if they did not use the right operators; lowing Spark to jointly set a new record parallel communication steps. for example, the system on its own for the Daytona GraySort challenge.f One final observation from a systems could not push filter functions Spark sorted 100TB of data 3× faster perspective is that Spark may incur extra ahead of maps. than the previous record holder based costs over some of today’s special- To address this problem, we extend- on Hadoop MapReduce using 10× few- ized systems due to fault tolerance. ed Spark in 2015 to add a more declara- er machines. This benchmark was not For example, in Spark, the map tasks tive API called DataFrames2 based on executed in memory but rather on (solid- in each shuffle operation save their the relational algebra. Data frames are state) disks. In 2015, one major effort was output to local files on the machine a common API for tabular data in Py- Project Tungsten,g which removes Java where they ran, so reduce tasks can re- thon and R. A data frame is a set of re- Virtual Machine overhead from many of fetch it later. In addition, Spark imple- cords with a known schema, essentially Spark’s code paths by using code genera- ments a barrier at shuffle stages, so the equivalent to a database table, that tion and non-garbage-collected memory. reduce tasks do not start until all the supports operations like filtering One benefit of doing these optimizations maps have finished. This avoids some and aggregation using a restricted in a general engine is that they simulta- of the complexity that would be needed “expression” API. Unlike working in neously affect all of Spark’s libraries; for fault recovery if one “pushed” re- the SQL language, however, data frame machine learning, streaming, and SQL cords directly from maps to reduces in operations are invoked as function all became faster from each change. a pipelined fashion. Although removing calls in a more general programming R language support. The SparkR some of these features would speed language (such as Python and R), al- project21 was merged into Spark in up the system, Spark often performs lowing developers to easily structure 2015 to provide a programming inter- competitively despite them. The main their program using abstractions in the face in R. The R interface is based on reason is an argument similar to our host language (such as functions and DataFrames and uses almost identical previous one: many applications are classes). Figure 11 and Figure 12 show syntax to R’s built-in data frames. Oth- bound by an I/O operation (such as examples of the API. er Spark libraries (such as MLlib) are shuffling data across the network or Spark’s DataFrames offer a similar also easy to call from R, because they reading it from disk) and beyond this API to single-node packages but auto- accept DataFrames as input. operation, optimizations (such as matically parallelize and optimize the Research libraries. Apache Spark pipelining) add only a modest benefit. computation using Spark SQL’s query continues to be used to build higher- We have kept fault tolerance “on” by planner. User code thus receives op- default in Spark to make it easy to reason timizations (such as predicate push- about applications. down, operator reordering, and join d One reason optimization is possible is that Spark’s DataFrame API uses lazy evaluation algorithm selection) that were not where the content of a DataFrame is not com- Ongoing Work available under Spark’s functional API. puted until the user asks to write it out. The Apache Spark remains a rapidly evolv- To our knowledge, Spark DataFrames data frame APIs in R and Python are eager, pre- ing project, with contributions from are the first library to perform such venting optimizations like operator reordering. e https://databricks.com/blog/2016/01/04/in- both industry and research. The code- troducing-spark-datasets.html base size has grown by a factor of six c One package index is available at https:// f http://sortbenchmark.org/ApacheSpark2014.pdf since June 2013, with most of the activ- spark-packages.org/ g https://databricks.com/blog/2015/04/28/

64 COMMUNICATIONS OF THE ACM | NOVEMBER 2016 | VOL. 59 | NO. 11 contributed articles

level data processing libraries. Recent amplab/spark-indexedrdd S., and Stoica, I. Shark: SQL and rich analytics at scale. 4. Dean, J. and Ghemawat, S. MapReduce: Simplified In Proceedings of the ACM SIGMOD/PODS Conference projects include Thunder for neurosci- data processing on large clusters. In Proceedings of (New York, June 22–27). ACM Press, New York, 2013. ence,5 ADAM for genomics,15 and Kira the Sixth OSDI Symposium on Operating Systems 24. Zaharia, M. An Architecture for Fast and General Data 27 Design and Implementation (San Francisco, CA, Dec. Processing on Large Clusters. Ph.D. , Electrical for image processing in astronomy. 6–8). USENIX Association, Berkeley, CA, 2004. Engineering and Computer Sciences Department, Other research libraries (such as 5. Freeman, J., Vladimirov, N., Kawashima, T., Mu, Y., University of California, Berkeley, 2014; https://www.eecs. Sofroniew, N.J., Bennett, D.V., Rosen, J., Yang, C.-T., berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf GraphX) have been merged into the Looger, L.L., and Ahrens, M.B. Mapping brain activity 25. Zaharia, M. et al. Resilient distributed datasets: A main codebase. at scale with cluster computing. Nature Methods 11, 9 fault-tolerant abstraction for in-memory cluster (Sept. 2014), 941–950. computing. In Proceedings of the Ninth USENIX 6. Gonzalez, J.E. et al. GraphX: Graph processing in a NSDI Symposium on Networked Systems Design and Conclusion distributed dataflow framework. In Proceedings of the Implementation (San Jose, CA, Apr. 25–27, 2012). 11th OSDI Symposium on Operating Systems Design 26. Zaharia, M. et al. Discretized streams: Fault-tolerant Scalable data processing will be es- and Implementation (Broomfield, CO, Oct. 6–8). streaming computation at scale. In Proceedings of sential for the next generation of USENIX Association, Berkeley, CA, 2014. the 24th ACM SOSP Symposium on Operating Systems 7. Isard, M. et al. Dryad: Distributed data-parallel Principles (Farmington, PA, Nov. 3–6). ACM Press, New computer applications but typically programs from sequential building blocks. In York, 2013. involves a complex sequence of pro- Proceedings of the EuroSys Conference (Lisbon, 27. Zhang, Z., Barbary, K., Nothaft, N.A., Sparks, E., Zahn, Portugal, Mar. 21–23). ACM Press, New York, 2007. O., Franklin, M.J., Patterson, D.A., and Perlmutter, S. cessing steps with different com- 8. Karloff, H., Suri, S., and Vassilvitskii, S. A model Scientific Computing Meets Big Data Technology: of computation for MapReduce. In Proceedings An Astronomy Use Case. In Proceedings of IEEE puting systems. To simplify this of the ACM-SIAM SODA Symposium on Discrete International Conference on Big Data (Santa Clara, task, the Spark project introduced Algorithms (Austin, TX, Jan. 17–19). ACM Press, CA, Oct. 29–Nov. 1). IEEE, 2015. New York, 2010. a unified programming model and 9. Kornacker, M. et al. Impala: A modern, open-source engine for big data applications. Our SQL engine for Hadoop. In Proceedings of the Seventh Matei Zaharia ([email protected]) is an assistant Biennial CIDR Conference on Innovative Data professor of at Stanford University, experience shows such a model can Systems Research (Asilomar, CA, Jan. 4–7, 2015). Stanford, CA, and CTO of Databricks, San Francisco, CA. efficiently support today’s workloads 10. Low, Y. et al. Distributed GraphLab: A framework Reynold S. Xin ([email protected]) is the chief architect for machine learning and data mining in the cloud. on the Spark team at Databricks, San Francisco, CA. and brings substantial benefits to users. In Proceedings of the 38th International VLDB We hope Apache Spark highlights the Conference on Very Large Databases (Istanbul, Patrick Wendell ([email protected]) is the vice president of engineering at Databricks, San Francisco, CA. importance of composability in pro- Turkey, Aug. 27–31, 2012). 11. Malewicz, G. et al. Pregel: A system for large-scale Tathagata Das ([email protected]) is a software gramming libraries for big data and graph processing. In Proceedings of the ACM engineer at Databricks, San Francisco, CA. SIGMOD/PODS Conference (Indianapolis, IN, June encourages development of more eas- 6–11). ACM Press, New York, 2010. Michael Armbrust ([email protected]) is a ily interoperable libraries. 12. McSherry, F., Isard, M., and Murray, D.G. Scalability! software engineer at Databricks, San Francisco, CA. th But at what COST? In Proceedings of the 15 Ankur Dave ([email protected]) is a graduate All Apache Spark libraries described HotOS Workshop on Hot Topics in Operating Systems student in the Real-Time, Intelligent and Secure Systems in this article are open source at http:// (Kartause Ittingen, Switzerland, May 18–20). USENIX Lab at the University of California, Berkeley. Association, Berkeley, CA, 2015. spark.apache.org/. Databricks has 13. Melnik, S. et al. Dremel: Interactive analysis of Web- Xiangrui Meng ([email protected]) is a software also made videos of all Spark Summit scale datasets. Proceedings of the VLDB Endowment 3 engineer at Databricks, San Francisco, CA. (Sept. 2010), 330–339. Josh Rosen ([email protected]) is a software conference talks available for free at 14. Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., engineer at Databricks, San Francisco, CA. https://spark-summit.org/. Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Shivaram Venkataraman ([email protected]) Zadeh, R., Zaharia, M., and Talwalkar, A. MLlib: is a Ph.D. student in the AMPLab at the University of Acknowledgments Machine learning in Apache Spark. Journal of Machine California, Berkeley. Learning Research 17, 34 (2016), 1–7. Michael Franklin ([email protected]) is the Liew Apache Spark is the work of hun- 15. Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Family Chair of Computer Science at the University of dreds of open source contributors Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Chicago and Director of the AMPLab at the University of Hammerbacher, J., Linderman, M., Franklin, M.J., California, Berkeley. who are credited in the release notes Joseph, A.D., and Patterson, D.A. Rethinking data- intensive science using scalable analytics systems. Ali Ghodsi ([email protected]) is the CEO of Databricks at https://spark.apache.org. Berke- In Proceedings of the SIGMOD/PODS Conference and adjunct faculty at the University of California, ley’s research on Spark was sup- (Melbourne, Australia, May 31–June 4). ACM Press, Berkeley. ported in part by National Science New York, 2015. Joseph E. Gonzalez ([email protected]) is an 16. Shun, J. and Blelloch, G.E. Ligra: A lightweight assistant professor in EECS at the University of California, Foundation CISE Expeditions Award graph processing framework for shared memory. Berkeley. In Proceedings of the 18th ACM SIGPLAN PPoPP CCF-1139158, Lawrence Berkeley Symposium on Principles and Practice of Parallel Scott Shenker ([email protected]) is a professor National Laboratory Award 7076018, Programming (Shenzhen, China, Feb. 23–27). ACM in EECS at the University of California, Berkeley. and DARPA XData Award FA8750- Press, New York, 2013. Ion Stoica ([email protected]) is a professor in 17. Sparks, E.R., Talwalkar, A., Smith, V., Kottalam, EECS and co-director of the AMPLab at the University of 12-2-0331, and gifts from Amazon J., Pan, X., Gonzalez, J.E., Franklin, M.J., Jordan, California, Berkeley. M.I., and Kraska, T. MLI: An API for distributed Web Services, Google, SAP, IBM, The machine learning. In Proceedings of the IEEE ICDM Thomas and Stacey Siebel Founda- International Conference on Data Mining (Dallas, TX, Copyright held by the authors. Dec. 7–10). IEEE Press, 2013. Publication rights licensed to ACM. $15.00 tion, Adobe, Apple, Arimo, Blue 18. Stonebraker, M. and Cetintemel, U. ‘One size fits all’: An Goji, Bosch, C3Energy, Cisco, Cray, idea whose time has come and gone. In Proceedings of the 21st International ICDE Conference on Data Cloudera, EMC2, Ericsson, Face- Engineering (Tokyo, Japan, Apr. 5–8). IEEE Computer book, Guavus, Huawei, Informatica, Society, Washington, D.C., 2005, 2–11. 19. Thomas, K., Grier, C., Ma, J., Paxson, V., and Song, Intel, Microsoft, NetApp, Pivotal, D. Design and evaluation of a real-time URL spam Samsung, Schlumberger, Splunk, filtering service. In Proceedings of the IEEE Symposium on Security and Privacy (Oakland, CA, May Virdata, and VMware. 22–25). IEEE Press, 2011. 20. Valiant, L.G. A bridging model for parallel computation. Commun. ACM 33, 8 (Aug. 1990), 103–111. References 21. Venkataraman, S. et al. SparkR; http://dl.acm.org/ 1. Apache Storm project; http://storm.apache.org citation.cfm?id=2903740&CFID=687410325&CFTO 2. Armbrust, M. et al. Spark SQL: Relational data KEN=83630888 processing in Spark. In Proceedings of the ACM 22. Xin, R. and Zaharia, M. Lessons from running large- Watch the authors discuss SIGMOD/PODS Conference (Melbourne, Australia, May scale Spark workloads; http://tinyurl.com/large- their work in this exclusive 31–June 4). ACM Press, New York, 2015. scale-spark Communications video. 3. Dave, A. Indexedrdd project; http://github.com/ 23. Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, http://cacm.acm.org/videos/spark

NOVEMBER 2016 | VOL. 59 | NO. 11 | COMMUNICATIONS OF THE ACM 65