Designing Fast, Resilient and Heterogeneity-Aware Key-Value Storage on Modern HPC Clusters
Total Page:16
File Type:pdf, Size:1020Kb
Designing Fast, Resilient and Heterogeneity-Aware Key-Value Storage on Modern HPC Clusters Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Dipti Shankar, B.E. Graduate Program in Department of Computer Science and Engineering The Ohio State University 2019 Dissertation Committee: Dhabaleswar K. Panda, Advisor Xiaoyi Lu, Co-Advisor Feng Qin Gagan Agrawal c Copyright by Dipti Shankar 2019 Abstract With the recent emergence of in-memory computing for Big Data analytics, memory- centric and distributed key-value storage has become vital to accelerating data processing workloads, in high-performance computing (HPC) and data center environments. This has led to several research works focusing on advanced key-value store designs with Remote- Direct-Memory-Access (RDMA) and hybrid ‘DRAM+NVM’ storage designs. However, these existing designs are constrained by the blocking store/retrieve semantics; incurring additional complexity with the introduction of high data availability and durability require- ments. To cater to the performance, scalability, durability and resilience needs of the diverse key-value store-based workloads (e.g., online transaction processing, offline data analytics, etc.), it is therefore vital to fully exploit resources on modern HPC systems. Moreover, to maximize server scalability and end-to-end performance, it is necessary to focus on designing an RDMA-aware communication engine that goes beyond optimizing the key-value store middleware for better client-side latencies. Towards addressing this, in this dissertation, we present a ‘holistic approach’ to design- ing high-performance, resilient and heterogeneity-aware key-value storage for HPC clus- ters, that encompasses: (1) RDMA-enabled networking, (2) high-speed NVMs, (3) emerg- ing byte-addressable persistent memory devices, and, (4) SIMD-enabled multi-core CPU compute capabilities. We first introduce non-blocking API extensions to the RDMA- Memcached client, that allows an application to separate the request issue and completion ii phases. This facilitates overlapping opportunities by truly leveraging the one-sided char- acteristics of the underlying RDMA communication engine, while conforming to the basic Set/Get semantics. Secondly, we analyze the overhead of employing memory-efficient resilience via Erasure Coding (EC), in an online fashion. Based on this, we extend our pro- posed RDMA-aware key-value store, that supports non-blocking API semantics, to enable overlapping the EC encoding/decoding compute phases with the scatter/gather communi- cation protocol involved in resiliently storing the distributed key-value data objects. This work also examines durable key-value store designs for emerging persistent memory technologies. While RDMA-based protocols employed in existing volatile DRAM-based key-value stores can be directly leveraged, we find that there is a need for a more inte- grated approach to fully exploit the fine-grained durability of these new byte-addressable storage devices. We propose ‘RDMP-KV’, that employs a hybrid ‘server-reply/server- bypass’ approach to ‘durably’ store individual key-value pair objects on the remote per- sistent memory-equipped servers via RDMA. RDMP-KV’s runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability and remote data consistency. Finally, the thesis explores SIMD-accelerated CPU-centric hash table designs, that can enable higher server throughput. We propose an end-to-end SIMD-aware key-value store design, ‘SCOR- KV’, which introduces optimistic ‘RDMA+SIMD’-aware client-centric request/response offloading protocols. SCOR-KV can minimize the server-side data processing overheads to achieve better scalability, without compromising on the client-side latencies. With this as the basis, we demonstrate the potential performance gains of the proposed designs with online (e.g, YCSB) and offline (e.g, in-memory and distributed burst-buffer over Lustre for Hadoop I/O) workloads on small-scale and production-scale HPC clusters. iii To my Advisors, Family and Friends iv Acknowledgments This work was made possible through the love and support of several people who stood by me, through the many years of my doctoral program and all through my life leading to it. I would like to take this opportunity to thank them all. My advisor, Dr. Dhabaleswar K. Panda for his guidance and support throughout my doc- toral program. I have been able to grow, both personally and professionally, through my association with him. His dedication and hard work is a guide-book to live by. I admire his drive, commitment and the energy he puts into each of his pursuits. My co-advisor, Dr. Xiaoyi Lu for believing in me and helping get this far. This work would not have been possible without his direction and support. He helped me transform my doubts and insecurities into a driving force for bettering myself, towards pursuing a career in research. His optimistic outlook on life and enthusiasm will always be an inspiration. My husband, Manju G. Siddappa for his love, support, and understanding. I admire his determination and courage to keep moving despite facing hard challenges. Thank you for being there for me every day for the past ten years. My family - my parents, Dr. R. Shivashankar and G. S. Usharani, for giving me uncondi- tional love, freedom, and support. They have always believed in me and encouraged me to pursue my goals, no matter how hard. I would like to thank my sister, Divya Shankar, and my brother-in-law, Debmalya Das Sharma, for always standing by me. I would also like v to show my utmost gratitude to my parents-in-law, Siddappa R. and Radha K. R., for their support and acceptance. My friends - I am very happy to have met and become friends with Dr. Hari Subramoni, Sourav Chakraborty, Jahanzeb Hashmi, Mohammadreza Bayatpour, Ammar Ahmed Awan, Shashank Gugnani, Haiyang Shi, Rajarshi Biswas, Haseeb Javed, Ching-Hsiang Chu, Jeff Smith, Jie Zhang, and Mark Arnold. They have given me memories that I will cherish for the rest of my life. I am also thankful for the support and constant encouragement from my friends, Philip Carpenter, Dr. Karthik Rao, Aditi Raveesh, Arshiya Parveez, Kavya Naik, Pavana Basavaraj, Kshama Mahesh, and Ganesh S. Gowda. I would finally like to thank all my colleagues at the Network-Based Computing Lab, who have helped me through my graduate studies. vi Vita 2013-Present . Ph.D., Computer Science and Engineer- ing, The Ohio State University, U.S.A 2013-Present . Graduate Research Associate, The Ohio State University, U.S.A 2017 . Graduate Summer Research Intern, Oak Ridge National Lab, U.S.A. 2013-2019 . .M.S., Computer Science and Engineer- ing, The Ohio State University, U.S.A 2011-2013 . .Software Engineer, Oracle India Pvt. Ltd., Bangalore, India 2007-2011 . .B. E., Computer Science and Engineer- ing, Rashtreeya Vidyalaya College of En- gineering, Bangalore, India Publications D. Shankar, X. Lu and D. K. Panda, SCOR-KV: SIMD-Aware Client-Centric and Opti- mistic RDMA-based Key-Value Store for Emerging CPU Architectures, (Under Review) D. Shankar, X. Lu and D. K. Panda, RDMP-KV: Designing Remote Direct Memory Per- sistence based Key-Value Stores with NVRAM, (Under Review) D. Shankar, X. Lu, and D. K. Panda, Accelerating NVRAM-aware In-Memory Datastore with Remote Direct Memory Persistence, 10th Annual Non-Volatile Memories Workshop 2019 (NVMW 2019) [Poster] D. Shankar, X. Lu, and D. K. Panda, High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads, in Proceedings of the 37th IEEE International Conference on Distributed Computing Systems (ICDCS 2017) vii D. Shankar, X. Lu, and D. K. Panda, Boldio: A Hybrid and Resilient Burst-Buffer Over Lustre for Accelerating Big Data I/O [Short Paper], in Proceedings of the 2016 IEEE In- ternational Conference on Big Data (IEEE BigData 2016) D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits, in Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2016) D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, Benchmarking Key-Value Stores on High-Performance Storage and Interconnects for Web-Scale Workloads [Short Paper], in Proceedings of the 2015 IEEE International Conference on Big Data (IEEE Big- Data 2015) D. Shankar, X. Lu, J. Jose, W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On- Line Data Processing Workloads with Memcached and MySQL?, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015) [Poster] D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, Characterizing and bench- marking stand-alone Hadoop MapReduce on modern HPC clusters, The Journal of Super- computing (SUPE), Springer, 2016 D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks, in Proceedings of the 5th Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware (BPOE 2014) H. Shi, X. Lu, D. Shankar, and D. K. Panda, UMR-EC: A Unified and Multi-Rail Era- sure Coding Library for High-Performance Distributed Storage Systems, in Proceedings of the 28th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2019) X. Lu, D. Shankar, H. Shi, and D. K. Panda, Spark-uDAPL: Cost-Saving Big Data Ana- lytics on Microsoft Azure Cloud with RDMA Networks, in Proceedings of the 2018 IEEE International Conference on Big Data (IEEE BigData 2018) X. Lu, H. Shi, D. Shankar, and D. K. Panda, Performance Characterization and Accel- eration of Big Data Workloads on OpenPOWER System, in Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017) viii X.