Broom: Sweeping out Garbage Collection from Big Data Systems
Total Page:16
File Type:pdf, Size:1020Kb
Broom: sweeping out Garbage Collection from Big Data systems Ionel Gog∗ Jana Giceva†∗ Malte Schwarzkopf Kapil Vaswani‡ Dimitrios Vytiniotis‡ Ganesan Ramalingan‡ Manuel Costa‡ Derek Murray?} Steven Hand?} Michael Isard? University of Cambridge † ETH Zurich ‡ Microsoft Research ? Unaffiliated ∗ parts of the work were done while at Microsoft Research Silicon Valley } now at Google, Inc. Abstract form event-based processing of arriving input data, and therefore behave as independent actors. For example, Many popular systems for processing “big data” are im- at any one time, MapReduce runs a “map” or “reduce” plemented in high-level programming languages with function (i.e., operator) in each task [5], Dryad [11] and automatic memory management via garbage collection Naiad [15] execute a “vertex” per worker thread, and (GC). However, high object churn and large heap sizes Spark runs an “action” per task [22]. Each data-flow op- put severe strain on the garbage collector. As a result, ap- erator’s objects live at most as long as the operator itself. plications underperform significantly: GC increases the Moreover, they are often grouped in logical batches – runtime of typical data processing tasks by up to 40%. e.g., according to keys or timestamps – that can be freed We propose to use region-based memory management atomically. This architecture presents an opportunity to instead of GC in distributed data processing systems. In revisit standard memory-management, because: these systems, many objects have clearly defined life- times. Hence, it is natural to allocate these objects in 1. Actors explicitly share state via message-passing. fate-sharing regions, obviating the need to scan a large 2. The state held by actors consists of many fate- heap. Regions can be memory-safe and could be in- sharing objects with common lifetimes. ferred automatically. Our initial results show that region- 3. End-users only supply code fragments to system- based memory management reduces emulated Naiad ver- defined operators, which makes automatic program tex runtime by 34% for typical data analytics jobs. transformations and region annotations practical. In §3, we illustrate these points with reference to Naiad. 1 Introduction Region-based memory management [19] works well Memory-managed languages dominate the landscape for sets of related objects in the absence of implicit shar- of systems for computing with “big data”: Hadoop, ing. While writing programs using regions is difficult in Spark [22], DryadLINQ [21] and Naiad [15] are only the general case, this old concept is a good fit for the some examples of systems running atop a Java Virtual restricted domain of distributed data processing systems Machine (JVM) or the .NET common language runtime (§4). In addition, region-based allocation can offer mem- (CLR). Managed languages are attractive as they of- ory safety and may be as transparent to the user as GC- fer strong typing, automated memory management and based memory management. We sketch how this can be higher-order functions. These features improve the pro- achieved in a distributed data processing system in §5. ductivity of system developers and end-users. Using Broom, a proof-of-concept implementation of However, these benefits do not come for free. Data region-based memory allocation for Naiad vertices, we processing tasks stress the runtime GC by allocating a show that region-based memory management eliminates large number of objects [4, 17]. This results in long the overheads of GC and improves execution time by up GC pauses that reduce application throughput or cause to 34% in memory-intensive operators (§6). “straggler” tasks [14, 18]. In §2, we show that the impact 2 Motivation of GC on job runtime can range between 20 and 40%. In this paper, we argue that a different approach can be We illustrate the effect of GC on Naiad’s performance used in distributed data processing systems. Their oper- using two simple experiments: ation is highly structured: most such systems are based 1. We measure the fraction of job runtime spent in GC on an implicit or explicit graph of stateful data-flow op- for two data-intensive batch jobs: TPC-H Q17 and erators executed by worker threads. These operators per- a join-heavy business analytics workflow (§2.1). 50 24 TPC-H Q17 22 40 shopper 20 18 30 16 20 14 12 10 10 8 Time spent on GC [%] 0 Naiad process ID 6 4 8 16 32 64 128 256 4 Young generation heap size [MB; log2] 2 0 Figure 1: The Naiad TPC-H Q17 and “shopper” work- 100 200 300 400 500 600 700 800 900 flows spend 20–40% of their total runtime on GC, inde- Experiment time [ms] pendent of the young generation heap size. Figure 3: Trace of incremental strongly connected com- Minor Major ponents in 24 parallel Naiad processes: uncoordinated 20001.0 80 GC pauses (orange and green bars) delay synchroniza- 15000.8 TPC-H Q17 60 tion barriers (gray vertical lines). 0.6 shopper 1000 40 0.4 5000.2 20 Collections (where many objects die young), but the increased young 0.0 0 generation heap size does not help TPC-H Q17 (which 0.04 8 0.2 0.4 04.6 8 0.8 1.0 16 32 64 16 32 64 128 256 128 256 uses stateful JOINs). In fact, the number of minor col- Young generation heap size [MB; log2] lections in TPC-H Q17 increases with young generation heap size as they are traded for major ones (Figure2). Figure 2: Increasing the young generation heap trades This experiment looked at the GC behavior of a sin- more minor collections for fewer major collections in gle data-intensive process that does not communicate. In TPC-H, and reduces total collections for “shopper”. the next experiment, we show that the problem is exacer- bated when dependencies exist between processes. 2. We measure the effects of GC stalls when com- 2.2 Synchronized iterative workflow puting strongly connected components, an iterative We run an incremental strongly connected components workflow with frequent synchronization (§2.2). workflow on a graph of 15M vertices and 80M edges. We run Naiad v0.4 on Linux using Mono v2.10.8.1 with Each incremental step changes one random edge in the the generational GC (sgen) enabled. graph. The computation synchronizes all nodes after ev- 2.1 Batch processing workflows ery graph change. We collect a trace of 24 Naiad pro- cesses on four machines (six per machine) by logging the We run two typical batch processing workflows with high timestamps of the synchronization steps and the times at object churn on a single machine.1 The first is query 17 which the processes run their GC. from the TPC-H benchmark and the second is “shopper”, Figure3 shows a subset of the trace, with each gray a business intelligence workflow that selects users from vertical bar corresponding to a synchronization step at the same country who bought a specific product (a JOIN– the end of an iteration. GC pauses are shown as horizon- SELECT–JOIN workflow). In these experiments, Naiad tal bars for major (orange) and minor (green) collections. uses eleven worker threads on the 12-core machine. It is evident that GC invocations sometimes delay a syn- With the default GC configuration, we found that the chronization by tens of milliseconds. TPC-H workflow spends around 25% of its runtime in It is also worth observing that a GC in one process is GC, while the “shopper” workflow reaches about 37% occasionally immediately followed by another GC in a (Figure1). This makes sense: “shopper” generates many different process (e.g. at times 170–180, 430–460, 530– small objects that are subsequently freed in minor col- 560 and 720–750). This occurs because some changes lections of the 4 MB young generation heap. Increasing to the graph affect state in several processes. Down- the size of the young generation heap reduces the num- stream processes may trigger a GC as soon as they re- ber of objects promoted to the next generation in “shop- ceive messages from upstream ones that have just fin- per”, and thus the overall number of collections (Fig- ished their GC. In other words, the GCs are effectively ure2). This reduces the time spent on GC for “shopper” serialized when a parallel execution would offer better 1AMD Opteron 4243 (12× 3.1 GHz) with 64 GB of DDR3-1600. performance. 3 Case study: Naiad 1 public class AggregateActor { We illustrate the memory management patterns common 2 private Dictionary<Time, Dictionary<K, V>> state; in distributed data processing by example of Naiad. List- 3 4 public void OnReceive(Time time, Message msg) { ing1 shows the code of the Naiad Aggregate vertex 5 if (state[time] == null){ that groups and aggregates input data. The inputs to 6 state[time] = new Dictionary<K,V>(); the vertex are batched in Message objects, each asso- 7 NotifyAt(time); 8 } ciated with a logical timestamp. The results of the ag- 9 foreach (var entry in msg) { gregation are stored in a dictionary keyed by the logical 10 var key = SelectKey(entry); timestamp (line 2). Upon receiving a message, Naiad 11 state[time][key] = calls the user-provided OnReceive method (ln. 4). The 12 Aggregate(state[time][key], entry); 13 } method processes the input and applies the Aggregate 14 } function to each entry (ln. 12), storing the results. Fi- 15 public void OnNotify(Time time) { nally, the OnNotify method is called once the actor is 16 // Remove state for timestamp time guaranteed to receive no more messages with a logical 17 state.remove(time); 18 Send(outgoingMsg); timestamp less than or equal to the one passed (ln.