Caching in the Multiverse

Caching in the Multiverse Mania Abdi?, Amin Mosayyebzadeh, Mohammad Hossein Hajkazemi? Ata Turk‡, Orran Krieger, Peter Desnoyers? ?Northeastern University, Boston University, ‡State Street Abstract execution plan consisting of multiple, for example, MapRe- duce [11], or Tez [24] jobs, and a directed acyclic graph (DAG) To get good performance for data stored in Object storage of dependencies between these jobs. Jobs are then scheduled services like S3, data analysis clusters need to cache data in parallel, within the constraints set by these dependencies. locally. Recently these caches have started taking into ac- Jobs can take minutes to even hours [9], resulting in execu- count higher-level information from analysis framework, al- tion plans which identify data accesses far into the future. lowing prefetching based on predictions of future data ac- Exploiting this knowledge of future access patterns results in cesses. There is, however, a broader opportunity; rather than significant improvements in caching performance vs. LRU using this information to predict one future, we can use it to and other history-based algorithms, as shown by works such select a future that is best for caching. This paper provides as MemTune [33], LRC [36] and [15]. preliminary evidence that we can exploit the directed acyclic These existing efforts use application DAG information graph (DAG) of inter-task dependencies used by data-parallel to predict future data accesses, and then prefetch data into frameworks such as Spark, PIG and Hive to improve applica- the cache and manage the cache contents based on those tion performance, by optimizing caching for the critical path predictions. In doing so, they are not taking advantage of a through the DAG for the application. We present experimental fundamental opportunity. Rather than caching data given a results for PIG running TPC-H queries, showing completion prediction of task execution, can we exploit the information time improvements of up to 23% vs our implementation of provided by the DAG to influence the order of task execution MRD, a state-of-the-art DAG-based prefetching system, and to enable more effective caching? That is, rather than man- improvements of up to 2.5x vs LRU caching. We then discuss aging/prefetching the cache based on one prediction of the the broader opportunity for building a system based on this future universe, can we select a universe for which caching opportunity. will be more effective? This paper provides preliminary evidence that the answer is 1 Introduction yes. In a simple, semi automated experiment, we show that by caching can be used to optimize the critical path through the Modern data analytics platforms (e.g. Spark [37], PIG [20], DAG, and present experimental results showing completion or Hive [27]) are often coupled with external data storage time improvements for TPC-H queries of as much as 2.5x on services such as Amazon S3 [3] and Azure Data Lake over LRU and 23% over MRD [22], the state-of-the-art DAG- Store [17], resulting in storage bottlenecks [4, 15, 18, 23]. based approach (and in all cases no worse than MRD). Multiple caching solutions have been developed to address We next provide more background on the opportunity, this storage bottleneck, e.g. Solutions such as Pacman [5], present our initial evidence, and then discuss the research Tachyon/Alluxio [2, 16] and Apache:Ignite [6] allow datasets challenges and effort to exploit this opportunity in a system- to be cached within the local cluster. However given finite atic way. cache, and often even more limited bandwidth for fetching data into the cache, the performance of this cache depends 2 Background and Motivation on its caching policy, and recent studies show that traditional caching policies (e.g. LRU) for this workload perform poorly To explain DAG-guided caching, we consider the PIG work- relative to task-specific ones [5,9, 15, 21]. flow management framework, which compiles the user’s Higher-level analysis frameworks such as PIG [20], query into a directed acyclic graph (DAG) of MapReduce [11], Hive [27] and SPARK [37] compile user programs into an Spark [37] or Tez [24] jobs. In Figure 1a, we see TPC-H I J1 region = load 1 O1 = filter region I1 I2 I31 I32 nation = load I2 J2 O2 = filter nation J1 J2 J3 lineitem = load I3-1 part = load I3-2 J3 I51 fpart = filter part O O O O3 = join fpart, lineitem 1 2 3 Stage 1 O4 = join O1 and O2 J4 J4 J5 I52 n = load I5-1 J1 s = load I J5 5-2 I61 I62 J2 O5 = join s and n O4 O5 Stage 2 J3 customer = load I6-1 J4 supply = load I6-2 J J Stage 3 6 s1 = join supply, O2 and O3 6 J5 Stage 4 O6 = join customer and s1 Ii Input O6 J6 gr = groupby O and O J 5 6 Ji MRJob 7 J7 J7 store gr Store Time Oi Output t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 (a) PIG Latin script (b) DAG of MapReduce jobs (c) Scheduled jobs Figure 1: Execution of Query #8 from TPC-H benchmark using Pig framework. In (a) we see the query script with jobs and inputs identified in purple circles and grey squares, respectively. (b) shows the directed acyclic graph (DAG) produced by the Pig compiler, with the corresponding jobs and inputs; (c) is the Pig schedule of the DAG to the MapReduce framework, with jobs executed in breadth-first stages. Query 8 in PIG Latin [25], which is compiled by PIG into the and with LRU cache management; completion time is 9 units execution plan DAG in Figure 1b. Tasks are then sorted by as there is no input data re-use, and thus all inputs are read dependency into stages, and submitted for execution resulting from remote storage (purple); at the bottom of the figure we in a timeline such as is seen in Figure 1c. see a timeline of jobs running at each unit of time. In this environment, caching policy can use not only in- MRD uses prior runtime information to predict the order formation from previous requests, but knowledge of future and timing of data requests, prefetching data (green blocks) requests derived from this execution plan1, including i) the and evicting other data (orange) to improve performance. dependency graph, ii) job type, and iii) job input datasets and Prefetching allows job J6 to run faster, resulting in a comple- sizes. In addition, to predict the execution timeline, we need tion time of 8 units rather than 9. to iv) predict individual job execution times, which may be At each point in time, MRD fetches the dataset which will done with data from past executions of the same job type be requested the soonest in the future; ties are broken arbi- (sort/join/etc. in PIG; application executable in some sys- trarily. Stage 1 requires 10 units of input, but we have 6 units tems), and v) network and storage system bandwidth, which of cache; we show inputs for J1 and J2 being prefetched, but may be known or can be measured. only part of the two inputs to J3, I3−1, so stage 1 ends at t4 In Table1 we see the jobs created by the PIG compiler as before. Prefetching of inputs to J5 begins at t3, when the (J1;J2;::;J7) as well as the inputs to each job (I1;I2;::::;I6−2) inputs become more valuable than data already in the cache; with their respective sizes. Runtime for each job (in arbitrary however input I5−2 cannot be completely loaded as inputs to units) is predicted without cached input (“baseline runtime”) J3, which is still running, are occupying 3 units of cache, and as well as for the case where each input dataset is cached prefetching I5−1 gives a speedup of 0 to J5. Finally at t6 we (expressed as runtime improvement over baseline). For this prefetch I6−1 but not I6−2, giving a speedup of 1 to job J6 and example we assume that speedups are additive, and that they completing the entire workflow one unit sooner. are all-or-nothing; i.e. if one block of input for J is fetched 2 By taking speedup information into account, we can do (I is 2 units in size), there is no speedup, while it completes 2 much better as shown in Figure 2c(c). For each stage of job in 1 time unit if both units of I are fetched. (We take this 2 execution, we prefetch the set of inputs (subject to available simplifying assumption from Pacman [5]) cache space) which will result in the largest decrease in overall Figure2(a) shows the execution plan without prefetching execution time, or nothing if no decrease is possible. Thus 1We restrict our discussion to PIG; however the same approach may be inputs to J1 and J2 are ignored, as they cannot cause stage 1 used with Spark and other systems which expose internal dependencies. to complete in less than 2 time units. However the inputs to X Prefetched Block X Eviction candidate block X LRU demand cache Time Time Time MRU 1 1 3_1 3_1 5_1 5_1 6_1 6_1 6_1 1 1 1 5_1 5_1 5_1 5_1 6_1 6_1 3_1 3_1 3_1 6_1 6_1 6_1 6_1 6_1 1 1 3_2 3_2 5_1 5_1 6_1 6_1 6_1 1 1 1 5_2 5_1 5_1 5_1 6_2 6_2 3_1 3_1 3_1 6_1 6_1 6_1 6_1 6_1 2 2 3_1 3_1 5_2 5_2 6_2 6_2 6_2 2 2 2 5_2 5_2 5_2 5_2 5_2 3_1 3_1 3_1 6_2 6_2 6_2 6_2 6_2 2 2 3_1 3_1 5_2 5_2 6_2 6_2 6_2 3_2 3_2 3_2 3_2 3_2 5_2 5_2 5_2 3_2 3_2 3_2 6_2 6_2 6_2 6_2 6_2 3_1 3_1 3_1 3_1 3_1 3_1 5_2 5_2 5_2 3_1 3_1 3_1 3_1 3_1 3_1 6_1 6_1 6_1 3_2 3_2 3_2 3_2 LRU 3_2 3_2 3_2 3_2 3_2 3_2 5_2 5_2 5_2 3_1 3_1 3_1 3_1 3_1 3_1 6_2 6_2 6_2 3_2 3_2 3_2 3_2 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t0 t1 t2 t3 t4 t5 t6 t7 t8 t0 t1 t2 t3 t4 t5 t6 t7 J J J J 3 7 3 7 J3 J4 J7 J J J6 J J J 2 4 2 4 6 J2 J5 J6 J J 1 J5 1 J5 J1 (a) LRU (b) MRD (c) NOC Figure 2: Different cache managements decisions over Query #8 from TPC-H benchmark.

Load more