Trash Day: Coordinating Garbage Collection in Distributed Systems

Trash Day: Coordinating Garbage Collection in Distributed Systems Martin Maas? † ∗ Tim Harris† Krste Asanovic´? John Kubiatowicz? ? University of California, Berkeley † Oracle Labs, Cambridge Abstract ten outweigh performance disadvantages, particularly for companies that cannot afford the engineering workforce Cloud systems such as Hadoop, Spark and Zookeeper are to maintain large native code bases. frequently written in Java or other garbage-collected lan- Garbage collection (GC) underpins many of the ad- guages. However, GC-induced pauses can have a signifi- vantages of high-level languages. GC helps productiv- cant impact on these workloads. Specifically, GC pauses ity because it reduces the engineering effort of explicitly can reduce throughput for batch workloads, and cause managing pointer ownership. It also eliminates a large high tail-latencies for interactive applications. class of bugs, increases safety and avoids many sources In this paper, we show that distributed applications of memory leaks (the latter is very important in cloud suffer from each node’s language runtime system mak- settings where applications often run for a long time). ing GC-related decisions independently. We first demon- GC performs well for many single-machine work- strate this problem on two widely-used systems (Apache loads. However, as we show in Section 2, it is a double- Spark and Apache Cassandra). We then propose solv- edged sword in the cloud setting. In latency-critical ap- ing this problem using a Holistic Runtime System, a dis- plications such as web servers or databases, GC pauses tributed language runtime that collectively manages run- can cause requests to take unacceptably long times (this time services across multiple nodes. is even true for minor GC pauses at the order of mil- We present initial results to demonstrate that this liseconds, as sub-millisecond latency requirements are Holistic GC approach is effective both in reducing the increasingly common). This is exacerbated in applica- impact of GC pauses on a batch workload, and in improv- tions that are composed of hundreds of services, where ing GC-related tail-latencies in an interactive setting. the overall latency depends on the slowest component (as common in data center workloads [10]). GC also poses 1 Introduction a problem for applications that distribute live data across nodes, since pauses can make a node’s data unavailable. Garbage-collected languages are the dominant choice for In our work, we investigate the sources of GC-related distributed applications in cloud data centers. Languages problems in current data center applications. We first such as C#, Go, Java, JavaScript/node.js, PHP/Hack, show how to alleviate these problems by coordinating Python, Ruby and Scala account for a large portion of GC pauses between different nodes, such that they occur code in this environment. Popular frameworks such at times that are convenient for the application. We then as Hadoop, Spark and Zookeeper are written in these show how a Holistic Runtime System [15] can be used languages, cloud services such as Google AppEngine to achieve this in a general way using an approach we and Microsoft Azure target these languages directly, and call Holistic Garbage Collection (Section 3). We finally companies such as Twitter [7] or Facebook [4, 5] build a present a work-in-progress Holistic Runtime System cur- significant portion of their software around them. rently under development at UC Berkeley (Section 4). This reflects the general trend towards higher-level languages. Their productivity and safety properties of- 2 GC in Distributed Applications ∗ Work was started while at Oracle Labs, Cambridge. UC Berke- Data center applications written in high-level languages ley research partially funded by the STARnet Center for Future Archi- tecture Research, and ASPIRE Lab industrial sponsors Intel, Google, are typically deployed by running each process within Huawei, LG, NVIDIA, Oracle, and Samsung. its own, independent language runtime system (such as a 1 35 Java Virtual Machine or Common Language Runtime). 30 Frameworks such as Hadoop or Spark hence run over 25 20 multiple runtime systems on different nodes, communi- 15 10 cating through libraries such as Apache Thrift [2]. 5 Execution time (s) 0 A consequence of this approach is that each runtime 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Superstep system makes decisions independently, including over (a) Baseline System (no coordination) when to perform GC. In practice, this means that GC 35 pauses occur on different nodes at different times based 30 25 on when memory fills up and needs to be collected. De- 20 pending on the collector and workload, these pauses can 15 10 range from milliseconds to multiple seconds. 5 Execution time (s) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 In this section, we show how GC pauses cause prob- Superstep lems in two representative real-world systems, Apache (b) Coordinating GC (stop-the-world everywhere) Spark [20] and Apache Cassandra [14]. We next demonstrate how even simple strategies can alleviate these Figure 1: Impact of GC on the superstep durations of problems. In Section 3, we then show how these strate- Spark PageRank (darker = more nodes performing GC gies can be generalized to fit a wider range of systems. during a superstep; white = no GC). This does not count We use the commodity OpenJDK Hotspot JVM (using minor collections, which occur much more frequently but the GC settings provided by each application). There are have negligible impact. specialized systems – such as those used in real-time sce- narios – that limit or even eliminate GC pauses by run- 100 80 ning GC concurrently with the application [18]. How- 60 ever, this usually comes at the cost of reduced overall 40 performance or increased resource utilization (e.g., from 20 Old Generation (%) 0 barrier or trap handling). Furthermore, these special- 0 50 100 150 200 250 300 350 400 Execution Time (s) ized runtime systems still incur pauses if memory fills up (a) Baseline System (no coordination) faster than it can be collected. To our knowledge, none 100 of these systems are widely used in cloud settings. 80 We perform all experiments on a cluster of 2-socket 60 Xeon E5-2660 machines with 256GB RAM, connected 40 through Infiniband. All our workloads run on dedicated 20 Old Generation (%) 0 0 50 100 150 200 250 300 350 400 nodes, but the network is shared with other jobs. Execution Time (s) (b) Coordinating GC (stop-the-world everywhere) 2.1 Case Study I: Apache Spark Figure 2: Memory (old generation) occupancy and GC Apache Spark is a distributed computation framework pauses of Spark PageRank. Each of the colors represents representative for a class of applications often called a node, and vertical lines indicate the start of a GC. batch workloads. Spark jobs perform large-scale, long- running computations on distributed data sets and performance is measured in overall job execution time. on a different node, as nodes are likely to not allocate any memory while stalling idly, and therefore only trig- Problem. When running a job on a cluster, Spark ger GC while actually performing productive work. spawns a worker process on each node. Each worker then To illustrate this problem, Figure 1a shows a PageR- performs a series of tasks on its local data. Occasionally, ank computation using Spark on an 8-node cluster (the workers communicate data between nodes (for example, same workload as in the original Spark paper [20]). We during shuffle operations). In those cases, no node can show the time that each PageRank superstep takes, as continue until data has been exchanged with every other well as how many nodes incur a GC pause during that node (equivalent to a cluster-wide barrier). superstep. We clearly see that while superstep times are This synchronization causes problems in the presence homogeneous in the absence of GC, they increase sig- of GC pauses: if even a single node is stuck in GC dur- nificantly as soon as at least one node is performing a ing such an operation, no other node can make progress, collection. Figure 2a shows the reason for this – mem- and therefore all nodes stall for a significant amount of ory occupancy increases independently on the different time. Worse, once the stalled node finishes its GC, exe- nodes, and once it reaches a threshold, a collection is per- cution will continue and may quickly trigger a GC pause formed, independently from the state of the other nodes. 2 Strategy. Instead of nodes collecting independently 100 once their memory fills up, we want all nodes to perform 80 GC at the same time – this way, none of the nodes waste time waiting for another node to finish collecting. This 60 is reminsiscent of the use of gang-scheduling in multi- threaded synchronization-heady workloads. Figure 1b 40 and Figure 2b show the effect of this strategy: we instru- Query Latency (ms) mented all the JVMs in our Spark cluster to track their 20 occupancy, and as soon as any one node reaches an occu- 0 pancy threshold (80% in this case), we triggered a GC on 60 80 100 120 140 Time (s) all nodes in the Spark cluster. Even on our small cluster, (a) Baseline System (no coordination) the PageRank computation completed 15% faster over- 100 all, without tuning or modifications to the JVM. This effect will become significantly more pronounced on larger 80 cluster sizes, since this increases the likelihood of some node incurring a GC during each superstep.

Trash Day: Coordinating Garbage Collection in Distributed Systems

16 Inspiring Women Engineers to Watch

Magento on HHVM Speeding up Your Webshop with a Drop-In PHP Replacement

Automated Program Transformation for Improving Software Quality

Nástroje Pro Sjednocení Datových Zdrojů Projektu Gloffer Tools for Unification of Data Sources Project Gloffer

Artificial Intelligence for Understanding Large and Complex

Facebook Messenger Engineering

Unicorn: a System for Searching the Social Graph

Ting-Yuan Hsia (408) 707-2897 | [email protected]| HPs:// HPs:// Education Santa Clara Univeristy, Santa Clara, CA, USA Sep

Dmon: Efficient Detection and Correction of Data Locality

PHP Optimization Using Hip Hop Virtual Machine Chaitali Tambe, Pramod Pawar, Dashrath Mane Vivekanand Education Society Institute of Technology, Chembur

HHVM @Richardtape

Codestitcher: Inter-Procedural Basic Block Layout Optimization