Nectar: Automatic Management of Data and Computation in Datacenters

Nectar: Automatic Management of Data and Computation in Datacenters Pradeep Kumar Gunda, Lenin Ravindranath,∗ Chandramohan A. Thekkath, Yuan Yu, Li Zhuang Microsoft Research Silicon Valley Abstract simplified the development of large-scale, data-intensive, Managing data and computation is at the heart of data- distributed applications. However, major challenges still center computing. Manual management of data can lead remain in realizing the full potential of data-intensive to data loss, wasteful consumption of storage, and labo- distributed computing within datacenters. In current rious bookkeeping. Lack of proper management of com- practice, a large fraction of the computations in a dat- putation can result in lost opportunities to share common acenter is redundant and many datasets are obsolete or computations across multiple jobs or to compute results seldom used, wasting vast amounts of resources in a dat- incrementally. acenter. Nectar is a system designed to address the aforemen- As one example, we quantified the wasted storage in tioned problems. It automates and unifies the manage- our 240-node experimental Dryad/DryadLINQ cluster. ment of data and computation within a datacenter. In We crawled this cluster and noted the last access time Nectar, data and computation are treated interchange- for each data file. We discovered that around 50% of the ably by associating data with its computation. De- files was not accessed in the last 250 days. rived datasets, which are the results of computations, are As another example, we examined the execution statis- uniquely identified by the programs that produce them, tics of 25 production clusters running data-parallel ap- and together with their programs, are automatically man- plications. We estimated that, on one such cluster, over aged by a datacenter wide caching service. Any derived 7000 hours of redundant computation can be eliminated dataset can be transparently regenerated by re-executing per day by caching intermediate results. (This is approx- its program, and any computation can be transparently imately equivalent to shutting off 300 machines daily.) avoided by using previously cached results. This en- Cumulatively, over all clusters, this figure is over 35,000 ables us to greatly improve datacenter management and hours per day. resource utilization: obsolete or infrequently used de- Many of the resource issues in a datacenter arise due rived datasets are automatically garbage collected, and to lack of efficient management of either data or compu- shared common computations are computed only once tation, or both. This paper describes Nectar: a system and reused by others. that manages the execution environment of a datacenter This paper describes the design and implementation of and is designed to address these problems. Nectar, and reports on our evaluation of the system using A key feature of Nectar is that it treats data and com- analytic studies of logs from several production clusters putation in a datacenter interchangeably in the following and an actual deployment on a 240-node cluster. sense. Data that has not been accessed for a long pe- riod may be removed from the datacenter and substituted by the computation that produced it. Should the data be 1 Introduction needed in the future, the computation is rerun. Similarly, instead of executing a user’s program, Nectar can par- Recent advances in distributed execution engines (Map- tially or fully substitute the results of that computation Reduce [7], Dryad [18], and Hadoop [12]) and high-level with data already present in the datacenter. Nectar relies language support (Sawzall [25], Pig [24], BOOM [3], on certain properties of the programming environment HIVE [17], SCOPE [6], DryadLINQ [29]) have greatly in the datacenter to enable this interchange of data and ∗L. Ravindranath is affiliated with the Massachusetts Institute of computation. Technology and was a summer intern on the Nectar project. Computations running on a Nectar-managed datacenter are specified as programs in LINQ [20]. LINQ com- computed only once and reused by others. This sig- prises a set of operators to manipulate datasets of .NET nificantly reduces redundant computations, result- objects. These operators are integrated into high level ing in better resource utilization. .NET programming languages (e.g., C#), giving programmers direct access to .NET libraries as well tradi- 3. Incremental computations. Many datacenter ap- tional language constructs such as loops, classes, and plications repeat the same computation on a slid- modules. The datasets manipulated by LINQ can contain ing window of an incrementally augmented dataset. objects of an arbitrary .NET type, making it easy to com- Again, caching in Nectar enables us to reuse the re- pute with complex data such as vectors, matrices, and sults of old data and only compute incrementally for images. All of these operators are functional: they trans- the newly arriving data. form input datasets to new output datasets. This property 4. Ease of content management. With derived datasets helps Nectar reason about programs to detect program uniquely named by LINQ expressions, and auto- and data dependencies. LINQ is a very expressive and matically managed by Nectar, there is little need for flexible language, e.g., the MapReduce class of compu- developers to manage their data manually. In par- tations can be trivially expressed in LINQ. ticular, they do not have to be concerned about re- Data stored in a Nectar-managed datacenter are di- membering the location of the data. Executing the vided into one of two classes: primary or derived. Pri- LINQ expression that produced the data is sufficient mary datasets are created once and accessed many times. to access the data, and incurs negligible overhead in Derived datasets are the results produced by computa- almost all cases because of caching. This is a sig- tions running on primary and other derived datasets. Ex- nificant advantage because most datacenter applica- amples of typical primary datasets in our datacenters tions consume a large amount of data from diverse are click and query logs. Examples of typical derived locations and keeping track of the requisite filepath datasets are the results of thousands of computations per- information is often a source of bugs. formed on those click and query logs. In a Nectar-managed datacenter, all access to a derived Our experiments show that Nectar, on average, could dataset is mediated by Nectar. At the lowest level of the improve space utilization by at least 50%. As well, in- system, a derived dataset is referenced by the LINQ pro- cremental and sub-computations managed by Nectar program fragment or expression that produced it. Program- vide an average speed up of 30% for the programs run- mers refer to derived datasets with simple pathnames that ning on our clusters. We provide a detailed quantitative contain a simple indirection (much like a UNIX symbolic evaluation of the first three benefits in Section 4. We link) to the actual LINQ programs that produce them. By have not done a detailed user study to quantify the fourth maintaining this mapping between a derived dataset and benefit, but the experience from our initial deployment the program that produced it, Nectar can reproduce any suggests that there is evidence to support the claim. derived dataset after it is automatically deleted. Primary Some of the techniques we used such as dividing datasets are referenced by conventional pathnames, and datasets into primary and derived and reusing the re- are not automatically deleted. sults of previous computations via caching are reminis- A Nectar-managed datacenter offers the following ad- cent of earlier work in version management systems [15], vantages. incremental database maintenance [5], and functional caching [16, 27]. Section 5 provides a more detailed 1. Efficient space utilization. Nectar implements a analysis of our work in relation to prior research. cache server that manages the storage, retrieval, and This paper makes the following contributions to the eviction of the results of all computations (i.e., de- literature: rived datasets). As well, Nectar retains the de- scription of the computation that produced a de- • We propose a novel and promising approach that rived dataset. Since programmers do not directly automates and unifies the management of data and manage datasets, Nectar has considerable latitude computation in a datacenter, leading to substantial in optimizing space: it can remove unused or in- improvements in datacenter resource utilization. frequently used derived datasets and recreate them on demand by rerunning the computation. This is a • We present the design and implementation of our classic trade-off of storage and computation. system, including a sophisticated program rewriter and static program dependency analyzer. 2. Reuse of shared sub-computations. Many applications running in the same datacenter share com- • We present a systematic analysis of the performance mon sub-computations. Since Nectar automatically of our system from a real deployment on 240-nodes caches the results of sub-computations, they will be as well as analytical measurements. DryadLINQ Program TidyFS. The program store keeps all DryadLINQ pro- P grams that have ever executed successfully. The data Nectar Client-Side Library Nectar Cluster-Wide Services store is used to store all derived streams generated by Lookup Program Rewriter Cache Server DryadLINQ programs. The Nectar cache server pro- Hits vides cache hits to the program rewriter on the client P’ Garbage Collector side. It also implements a replacement policy that deletes cache entries of least value. Any stream in the data DryadLINQ/Dryad store that is not referenced by any cache entry is deemed to be garbage and deleted permanently by the Nectar garbage collector. Programs in the program store are Nectar Program Store Distributed FS Nectar Data Store never deleted and are used to recreate a deleted derived stream if it is needed in the future. A simple example of a program is shown in Ex- Figure 1: Nectar architecture.

Nectar: Automatic Management of Data and Computation in Datacenters

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support