High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Nusrat Sharmin Islam, M.Sc. Graduate Program in Computer Science and Engineering The Ohio State University 2016 Dissertation Committee: Dr. Dhabaleswar K. (DK) Panda, Advisor Dr. Ponnuswamy Sadayappan Dr. Radu Teodorescu Dr. Xiaoyi Lu c Copyright by Nusrat Sharmin Islam 2016 Abstract Hadoop MapReduce and Spark are the two most popular Big Data processing frame- works of the recent time and Hadoop Distributed File System (HDFS) is the underlying file system for MapReduce, Spark, Hadoop Database, HBase as well as SQL query en- gines like Hive and Spark SQL. HDFS along with these upper-level middleware is now extensively being used on High Performance Computing (HPC) systems. The large-scale HPC systems, by necessity, are equipped with high performance interconnects like Infini- Band, heterogeneous storage devices like RAM Disk, SSD, HDD, and parallel file systems like Lustre. Non-Volatile Memory (NVM) is emerging and making its way into the HPC systems. Hence, the performance of HDFS, and in turn, the upper-level middleware and applications, is heavily dependent upon how it has been designed and optimized to take the system resources and architecture into account. But HDFS was initially designed to run over low-speed interconnects and disks on commodity clusters. As a result, it cannot efficiently utilize the resources available in HPC systems. For example, HDFS uses Java Socket for communication that leads to the overhead of multiple data copies and offers little overlapping among different phases of operations. Besides, due to the tri-replicated data blocks, HDFS suffers from huge I/O bottlenecks and requires a large volume of local storage. The existing data placement and access policies in HDFS are oblivious to the performance and persistence characteristics of the heterogeneous storage media on modern HPC systems. In addition, even though parallel file systems are optimized for large number ii of concurrent accesses, Hadoop jobs running over Lustre suffer from huge contention due to the bandwidth limitation of shared file system. This work addresses several of these critical issues in HDFS while proposing efficient and scalable file system and I/O middleware for Big Data in HPC clusters. It proposes an RDMA-Enhanced design of HDFS to improve the communication performance of write and replication. It also proposes a Staged Event Drive Architecture (SEDA)-based ap- proach to maximize overlapping among different phases of HDFS operations. It proposes a hybrid design (Triple-H) to reduce the I/O bottlenecks and local storage requirements of HDFS through enhanced data placement policies that can efficiently utilize the heterogeneous storage devices available in HPC platforms. This thesis studies the impact of in-memory files systems for Hadoop and Spark and presents acceleration techniques for iterative applications with intelligent use of in-memory and heterogeneous storage. It fur- ther proposes advanced data access strategies that take into account locality, topology, and storage types for Hadoop and Spark on heterogeneous (storage) HPC clusters. This thesis carefully analyzes the challenges for incorporating NVM for Big Data file system and proposes an NVM-based design of HDFS (NVFS) that leverages the byte-addressability of NVM for HDFS I/O and RDMA communication. It also co-designs Spark and HBase to utilize the NVM in NVFS in a cost-effective manner via identifying the performance- critical data for each of these upper-level middleware. It also proposes the design of a burst buffer system using RDMA-based Memcached for integrating Hadoop with Lustre. The designs proposed in this thesis have been evaluated on 64-nodes (1024 cores) testbed on TACC Stampede and 32-nodes (768 cores) testbed on SDSC Comet. These designs increase the HDFS throughput by up to 7x while improving the performance of Big Data applications by up to 79% and reduce the local storage requirements by 66% over HDFS. iii To Ammu and Abbu. iv Acknowledgments This work was made possible through the love and support of several people who stood by me through the years of my doctoral research. I would like to take this opportunity to thank all of them. My advisor, Dr. Dhabaleswar K. Panda for his guidance and support throughout my doctoral studies. The depth of his knowledge and experience has helped me grow as a re- searcher. I have learned a lot from his sincerity, diligence, and devotion towards work. My dissertation committee members, Dr. P. Sadayappan and Dr. R. Teodorescu for agreeing to serve on the committee and their valuable feedback. My collaborators, Dr. X. Lu and other senior members of the lab for taking the time to listen to my ideas and giving me useful suggestions. I would also like to thank all my colleagues who have helped me in one way or another throughout my graduate studies. My friends, here at Columbus, for making this journey easier for me. Special thanks to Sanjida Ahmad, who just like an elder sister has stood beside me through thick and thin. Thanks to my friends at home Nadia, Drabir, Tareq, Aysha, and Pavel for their company that always invigorates me for a new start. My friends, Susmita, Amit, and Rashed for being there for me, no matter what. These people never forget to remind me that I can achieve anything that I want. I am really thankful to them for boosting my confidence each and every day. v My cousins, Mishu, Muna, and others for believing in me and helping me face life challenges. Thanks to my sister, Sadia Islam and brother-in-law, Nafiz Tamim for never letting me miss home even after being far away from it. My brother, Dr. Mohaimenul Islam for always cheering me up and taking pride in what I do. My Husband, Md. Wasi-ur-Rahman for his love, patience, and understanding. His constant encouragement has been a driving factor for me. My daughter, Nayirah for helping me forget all the hurdles and stresses that I went through during my Ph.D. Even in the most difficult times, her smile lights up my entire world. Her presence in my life has taught me how to be more organized and manage time well. Last but not the least, my Parents, Mazeda Islam and Dr. Shahidul Islam. I consider myself privileged to be born to parents who have never asked me to do anything differently just because I am a woman. I have always idolized my dad for his prudence, honesty, and hard work; my mom has supported my work with all her strengths. I am truly grateful to them for all the sacrifices that they have made so that I can be where I am today. The greatest achievement of my life is to make my parents proud through this work. I would like to thank them from the core of my heart for giving me the resilience and inspiration to move forward in all my endeavors. vi Vita 2002-2007 . .B.Sc., Computer Science and Engineer- ing, Bangladesh University of Engineer- ing and Technology (BUET) 2007-2008 . .Lecturer, Dept. of Computer Science and Engineering, BRAC University 2008-2010 . .Lecturer, Dept. of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET) 2010-Present . Ph.D., Computer Science and Engineer- ing, The Ohio State University, USA 2010-2011 . .Graduate Fellow, The Ohio State Univer- sity, USA 2011-Present . Graduate Research Associate, Dept. of Computer Science and Engineering, The Ohio State University, USA Summer 2014 . .Software Engineer Intern, Oracle Amer- ica Inc., USA Publications N. S. Islam, M. W. Rahman, X. Lu, and D. K. Panda, Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage, In 2016 IEEE Interna- tional Conference on Big Data (IEEE BigData ’16), December 2016. N. S. Islam, M. W. Rahman, X. Lu, and D. K. Panda, High Performance Design for HDFS with Byte-Addressability of NVM and RDMA, In International Conference on Supercomputing (ICS ’16), June 2016. N. S. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Character- ization and Acceleration of In-Memory File Systems for Hadoop and Spark Applications on HPC Cluster, In 2015 IEEE International Conference on Big Data (IEEE BigData ’15), October 2015. vii N. S. Islam, D. Shankar, X. Lu, M. W. Rahman, and D. K. Panda, Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-based Key-Value Store, In The 44th International Conference on Parallel Processing (ICPP ’15), September 2015. N. S. Islam, X. Lu, M. W. Rahman, D. Shankar and D. K. Panda, Triple-H: A Hybrid Ap- proach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, In 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’15), May 2015. N. S. Islam, X. Lu, M. W. Rahman and D. K. Panda, SOR-HDFS: A SEDA-based Ap- proach to Maximize Overlapping in RDMA-Enhanced HDFS, In The 23rd International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC ’14), June 2014. N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over Infini- Band, In The Int’l Conference for High Performance Computing, Networking, Storage and Analysis (SC ’12), November 2012. M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters?, In 1st Joint International Workshop on Par- allel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS 16), in conjunction with SC 16, November 2016.

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

Using Alluxio to Optimize and Improve Performance of Kubernetes-Based Deep Learning in the Cloud

Alluxio Overview

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & Tensorflow) with HPC Technologies

Alluxio: a Virtual Distributed File System by Haoyuan Li A

Intel Select Solutions for HPC & AI Converged Clusters

Best Practices

A Solution for Bigdata Integrate with Ceph Inwinstack / Chief Architect Thor Chin Agenda

Scalable Collaborative Caching and Storage Platform for Data Analytics

Spark on Docker Solutions

Kyle Bader - Red Hat Yuan Zhou, Yong Fu, Jian Zhang - Intel May, 2018 Agenda

Pocket: Elastic Ephemeral Storage for Serverless Analytics

A Benchmark for Suitability of Alluxio Over Spark