Scalable Collaborative Caching and Storage Platform for Data Analytics
Total Page:16
File Type:pdf, Size:1020Kb
Scalable Collaborative Caching and Storage Platform for Data Analytics by Timur Malgazhdarov A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto c Copyright 2018 by Timur Malgazhdarov Abstract Scalable Collaborative Caching and Storage Platform for Data Analytics Timur Malgazhdarov Master of Applied Science Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto 2018 The emerging Big Data ecosystem has brought about dramatic proliferation of paradigms for analytics. In the race for the best performance, each new engine enforces tight cou- pling of analytics execution with caching and storage functionalities. This one-for-all ap- proach has led to either oversimplifications where traditional functionality was dropped or more configuration options that created more confusion about optimal settings. We avoid user confusion by following an integrated multi-service approach where we assign responsibilities to decoupled services. In our solution, called Gluon, we build a collab- orative cache tier that connects state-of-art analytics engines with a variety of storage systems. We use both open-source and proprietary technologies to implement our archi- tecture. We show that Gluon caching can achieve 2.5x-3x speedup when compared to uncustomized Spark caching while displaying higher resource utilization efficiency. Fi- nally, we show how Gluon can integrate traditional storage back-ends without significant performance loss when compared to vanilla analytics setups. ii Acknowledgements I would like to thank my supervisor, Professor Cristiana Amza, for her knowledge, guidance and support. It was my privilege and honor to work under Professor Amza's supervision. I would also like to thank my examination committee members: Professor Eyal de Lara, Professor Ashvin Goel, and Professor Ashish Khisti for their valuable comments and feedback. I am truly grateful to my colleagues and lab mates: Dr. Stelios Sotiriadis, Seyed Ali Jokar, and Arnamoy Bhattacharyya for their knowledge, help, and support. Last but not least, I would like to thank my family, especially my mother Nurgul Yessetova for her understanding, love, and support. iii Contents Acknowledgements iii Contents iv 1 Introduction1 2 Background7 2.1 Analytics Engines...............................8 2.1.1 Hadoop MapReduce (HMR).....................8 2.1.2 Spark................................. 10 2.1.3 Specialized Graph Processing.................... 11 2.2 Resource Managers.............................. 12 2.3 Storage platforms............................... 13 2.3.1 Network-attached storage...................... 13 2.3.2 Storage Area Networks........................ 14 2.3.3 Distributed systems with direct-attached storage.......... 14 2.4 Distributed Cache............................... 15 2.4.1 Alluxio................................. 15 2.5 Common Solutions.............................. 16 2.5.1 Vanilla Hadoop Solution....................... 16 2.5.2 Vanilla Spark Solution........................ 17 2.6 Conclusion................................... 18 iv 3 Thesis Idea and Design 19 3.1 Thesis Idea.................................. 19 3.2 Usability Issues................................ 20 3.2.1 Case Study: Spark.......................... 20 3.2.2 HDFS................................. 22 3.3 Proposed Design............................... 23 3.3.1 Collaborative caching layer...................... 23 3.3.2 Service Decoupling and Modularity................. 27 3.3.3 Consolidated Storage Layer..................... 28 3.4 Summary................................... 30 4 Implementation 32 4.1 Components.................................. 33 4.1.1 Alluxio................................. 33 4.1.2 Server SAN.............................. 34 4.1.3 GFS2................................. 36 4.1.4 YARN................................. 36 4.2 Control and Data Flow............................ 38 4.3 Connecting storage component........................ 41 4.3.1 Server SAN to filesystem connection................ 41 4.4 Connecting GFS2 with Analytics Engines.................. 43 4.5 Cache integration............................... 46 4.5.1 GFS2 to Alluxio connection..................... 46 4.6 Spark integration............................... 49 4.7 Additional optimizations........................... 50 4.7.1 Asynchronous Delete......................... 50 4.7.2 File consistency checker....................... 51 4.8 Summary................................... 52 5 Evaluation 53 5.1 Environment Setup.............................. 53 v 5.1.1 Benchmarks.............................. 55 5.2 Comparative evaluation using Spark..................... 57 5.2.1 Spark count.............................. 57 5.2.2 Logistic Regression.......................... 58 5.2.3 PageRank............................... 59 5.2.4 Gluon job statistics.......................... 60 5.2.5 Discussion............................... 61 5.3 Comparative evaluation using Hadoop MapReduce............ 62 5.3.1 DFSIO................................. 62 5.3.2 Terasort................................ 63 5.3.3 PageRank............................... 64 5.3.4 Discussion............................... 65 5.4 Graph Processing Framework - Hama.................... 65 5.5 Conclusion................................... 66 6 Related Work 68 6.1 Caching in Analytics............................. 68 6.2 HPC and shared storage integrations.................... 70 6.3 Full-stack integrations............................ 72 6.4 Conclusion................................... 74 7 Future Work and Final Remarks 75 Bibliography 77 vi Chapter 1 Introduction Several data analytics paradigms have been recently proposed in order to accommodate the growing needs of Big Data. Each new paradigm brought with it specialization for a particular need of data analytics workloads. At the same time, each such specializa- tion had as side effect a significant departure from existing data processing paradigms. From a usability perspective, this trend makes it increasingly difficult to analyse the trade-offs of existing offerings and determine the appropriate platform support, including interfaces, environments, settings and configurations for both functionality and optimal performance. In other words, as many different paradigms have proliferated to facilitate various data management needs, they have made usability and platform management and integration itself a growing concern. For example, the initial MapReduce offerings, such as Apache Hadoop[8], came with a departure from traditional approaches to data processing. Relational data access typ- ically used SQL-based interfaces to data maintained by consolidated storage back-ends. Newer data analytics systems, such as Hadoop, not only introduced a new Java-based data processing language; they also required that data reside in a distributed fashion, on compute nodes, which formed a separate data silo for data analytics. Spark[61] came later with yet another data processing language, Scala, and also with an even more pro- nounced decoupling from persistent data storage concerns. Both paradigms imply that input data and intermediate data is stored in a distributed fashion on new commodity distributed file systems, such as, HDFS[52]. Moreover, both Apache Hadoop and Spark 1 Chapter 1. Introduction 2 had their own data caching techniques with the only commonality the data locality and distributed file system principles. On the other hand, Apache Hama[50] and Giraph[22] have been recently introduced for better support of graph-based data analytics as compared with Apache Hadoop and Spark. The BSP[56] data processing paradigm, which they proposed, strays from the data locality principle used in all former data analytics paradigms. This makes typi- cal performance enhancements for distributed data analytics, such as, network traffic avoidance and effective caching difficult or impossible. In this work, we propose a scalable, unified, caching and storage platform for data analytics, called Gluon. Our unified platform provides performance, robustness and ease of use for any data analytics paradigm currently in use with little or no modifications. Gluon comes with two essential services for integration of platform support for all types of data analytics. First, our Gluon caching layer supports global collaborative caching across the memories of all participating compute (and storage) nodes. Second, Gluon supports full integration of the collaborative caching service with traditional consolidated storage back-end services. With Gluon we emphasize the principle of data locality for in-memory data on any compute node. At the same time, we take full advantage of fast remote memory access when opportunities for memory availability in collaborating nodes exist. Such opportu- nities may be present due to a variety of reasons. For example, compute nodes may be temporarily idle due to imperfect load balancing, such as created by fault-induced strag- glers, or skewed workloads. Furthermore, unused memory may be available on back-end storage nodes, which can be leveraged by compute nodes. Whenever data would be normally evicted from the local in-memory cache on any compute node, Gluon has the capability to push the data to be evicted to a remote node. Conversely, Gluon fetches remote in-memory data on-demand from collaborative nodes upon subsequent local access. We currently opt for disjoint caching