Decoupled Analytics for Shared Storage Systems
Total Page:16
File Type:pdf, Size:1020Kb
MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu†, Gokul Soundararajan†, Cristiana Amza University of Toronto, NetApp, Inc.† Abstract nodes, and the results are written back to the file sys- tem [19, 22]. The file system component of these frame- Distributed file systems built for data analytics and en- works [13, 21] is a standalone storage system; the file terprise storage systems have very different functionality system stores the data on local server drives, with redun- requirements. For this reason, enabling analytics on en- dancy (typically, 3-way replication) to provide fault tol- terprise data commonly introduces a separate analytics erance and data availability. storage silo. This generates additional costs, and ineffi- While the benefits of these frameworks are well un- ciencies in data management, e.g., whenever data needs derstood, it is not clear how to leverage them for per- to be archived, copied, or migrated across silos. forming analysis of enterprise data. Enterprise data, e.g., MixApart uses an integrated data caching and schedul- corporate documents, user home directories, critical log ing solution to allow MapReduce computations to ana- data, e-mails, need to be secured from tampering, pro- lyze data stored on enterprise storage systems. The front- tected from failures, and archived for long-term reten- end caching layer enables the local storage performance tion; high-value data demands the enterprise-level data required by data analytics. The shared storage back-end management features provided by enterprise storage. simplifies data management. Traditional analytics file systems [13], however, have We evaluate MixApart using a 100-core Amazon EC2 been designed for low-value data, i.e., data that can be cluster with micro-benchmarks and production workload regenerated on a daily basis; low-value data do not need traces. Our evaluation shows that MixApart provides (i) enterprise-level data management. Consequently, these up to 28% faster performance than the traditional ingest- file systems, while providing high throughput for paral- then-compute workflows used in enterprise IT analyt- lel computations, lack many of the management features ics, and (ii) comparable performance to an ideal Hadoop essential for the enterprise environment, e.g., support for setup without data ingest, at similar cluster sizes. standard protocols (NFS), storage efficiency mechanisms such as deduplication, strong data consistency, and data 1 Introduction protection mechanisms for active and inactive data [26]. These disparities create an environment where two We design a novel method for enabling distributed data dedicated reliable storage systems (silos) are required for analytics to use data stored on enterprise storage. Dis- running analytics on enterprise data: a feature-rich en- tributed data analytics frameworks, such as MapRe- terprise silo that manages the total enterprise data and a duce [12, 19] and Dryad [22], are utilized by organiza- high-throughput analytics silo storing enterprise data for tions to analyze increasing volumes of information. Ac- analytics. This setup leads to a substantial upfront in- tivities range from analyzing e-mails, log data, or trans- vestment as well as complex workflows to manage data action records, to executing clustering algorithms for across different file systems. Enterprise analytics typi- customer segmentation. Using such data flow frame- cally works as follows: (i) an application records data works, data partitions are loaded from an underlying into the enterprise silo, (ii) an extraction workflow reads commodity distributed file system, passed through a se- the data and ingests the data into the analytics silo, (iii) ries of operators executed in parallel on the compute an analytics program is executed and the results of the USENIX Association 11th USENIX Conference on File and Storage Technologies (FAST ’13) 133 analytics are written into the analytics file system, (iv) 2 MixApart Use Cases the results of the analytics are loaded back into the enter- prise silo. The workflow repeats as new data arrives. MixApart allows computations to analyze data stored on To address the above limitations, we design MixApart enterprise storage systems, while using a caching layer to facilitate scalable distributed processing of datasets for performance. We describe scenarios where we expect stored in existing enterprise storage systems, thereby dis- this decoupled design to provide high functional value. placing the dedicated analytics silo. MixApart replaces Analyzing data on enterprise storage: Companies the distributed file system component in current analytics can leverage their existing investments in enterprise stor- frameworks with an end-to-end storage solution, XDFS, age and enable analytics incrementally. There exist many that consists of: file-based data such as source-code repositories, e-mails, a stateless analytics caching layer, built out of local and log files; these files are generated by traditional ap- • server disks, co-located with the compute nodes for plications but currently require a workflow to ingest the data analytics performance, data into an analytics filesystem. MixApart allows a sin- a data transfer scheduler that schedules transfers of gle storage back-end to manage and service data for both • data from consolidated enterprise storage into the disk enterprise and analytics workloads. Data analytics, using cache, just in time, in a controlled manner, and the same filesystem namespace, can analyze enterprise data with no additional ingest workflows. the enterprise shared storage system for data reliability • and data management. Cross-data center deployments: MixApart’s decou- pled design also allows for independent scaling of com- The design space for MixApart includes two main pute and storage layers. This gives the flexibility of plac- alternatives: (i) unified caching, by having the XDFS ing the analytics compute tier on cloud infrastructures, cache interposed on the access path of both the analytics such as Amazon EC2 [3], while keeping the data on- and the enterprise workloads, (ii) dedicated caching, by premise. In this scenario, upfront hardware purchases are having the XDFS cache serve analytics workloads only. replaced by the pay-as-you-go cloud model. Cross-data Moreover, in terms of implementation, for each design center deployments benefit from efforts such as AWS Di- option, we can either leverage the existing HDFS [13] rect Connect [2] that enable high-bandwidth connections codebase for developing our cache, or leverage existing between private data centers and clouds. file caching solutions e.g., for NFS or AFS [8] file sys- tems. With MixApart we explore the design of a dedi- cated cache with an implementation based on the exist- ing Hadoop/HDFS codebase. Our main reasons are pro- totyping speed and fair benchmarking against Hadoop. 3 MapReduce Workload Analysis We deployed MixApart on a 100-core Amazon EC2 cluster, and evaluated its performance with micro- Data analytics frameworks process data by splitting a benchmarks and production workload traces from Face- user-submitted job into several tasks that run in parallel. book. In general, the results show that MixApart pro- In the input data processing phase, e.g., the Map phase, vides comparable performance to Hadoop, at similar tasks read data partitions from the underlying distributed compute scales. In addition, MixApart improves stor- file system, compute the intermediate results by running age efficiency and simplifies data management. First, the computation specified in the task, and shuffle the out- the stateless cache design removes the redundancy re- put to the next set of operators – e.g., the Reduce phase. quirements in current analytics file systems, thus low- The bulk of the data processed is read in this initial phase. ering storage demands. Moreover, with classic analytics Recent studies [10, 16] of production workloads de- R deployments, ingest is usually run periodically as a way ployed at Facebook, MicrosoftBing, and Yahoo! make to synchronize two storage systems, independent of job three key observations: (i) there is high data reuse across demands. In contrast, our integrated solution provides a jobs, (ii) the input phases are, on average, CPU-intensive, dynamic ingest of only the needed data. Second, MixA- and (iii) the I/O demands of jobs are predictable. Based part eliminates cross-silo data workflows, by relying on on the above workload characteristics, we motivate our enterprise storage for data management. The data in the MixApart design decisions. We then show that, with typ- cache is kept consistent with the associated data in shared ical I/O demands and data reuse rates, MixApart can sus- storage, thus enabling data freshness for analytics, trans- tain large compute cluster sizes, equivalent to those sup- parently, when the underlying enterprise data changes. ported by current deployments using dedicated storage. 2 134 11th USENIX Conference on File and Storage Technologies (FAST ’13) USENIX Association 3.1 High Data Reuse across Jobs 100000 100000 100000 100000 40s 40s Production workloads exhibit high data reuse across jobs 10000 10000 20s 20s 10000 10000 0.5s 0.5s with only 11%, 6%, and 7% of jobs from Facebook, 1000 1000 1000 1000 Bing, and Yahoo!, respectively, reading a file once [10]. Using the job traces collected, Ananthanarayanan et al. 100 100 100 100 # of Tasks # of Tasks # of Tasks # of Tasks estimate that in-memory cache hit rates of 60% are pos- 10 10 10 10 40s 40s sible by allocating 32GB of memory on each machine, 20s 20s with an optimal cache replacement policy [10]. 1 1 1 1 0.5s 0.5s MixApart employs a distributed cache layer built from 0 0.2 0.4 00.6 0.2 0.8 0.40.99 0.6 0.8 0.99 0 0.2 0.4 00.6 0.2 0.8 0.40.99 0.6 0.8 0.99 local server drives; this disk cache is two orders of mag- Data ReuseData Ratio Reuse Ratio Data ReuseData Ratio Reuse Ratio nitude larger than an in-memory cache. For example, (a) 1Gbps (b) 10Gbps Amazon EC2 [3] instances typically provide about 1TB Figure 1: Compute Cluster Size.