High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Nusrat Sharmin Islam, M.Sc.

Graduate Program in Computer Science and Engineering

The Ohio State University

2016

Dissertation Committee:

Dr. Dhabaleswar K. (DK) Panda, Advisor Dr. Ponnuswamy Sadayappan Dr. Radu Teodorescu Dr. Xiaoyi Lu c Copyright by

Nusrat Sharmin Islam

2016 Abstract

Hadoop MapReduce and Spark are the two most popular Big Data processing frame- works of the recent time and Hadoop Distributed File System (HDFS) is the underlying

file system for MapReduce, Spark, Hadoop Database, HBase as well as SQL query en- gines like Hive and Spark SQL. HDFS along with these upper-level middleware is now extensively being used on High Performance Computing (HPC) systems. The large-scale

HPC systems, by necessity, are equipped with high performance interconnects like Infini-

Band, heterogeneous storage devices like RAM Disk, SSD, HDD, and parallel file systems like Lustre. Non-Volatile Memory (NVM) is emerging and making its way into the HPC systems. Hence, the performance of HDFS, and in turn, the upper-level middleware and applications, is heavily dependent upon how it has been designed and optimized to take the system resources and architecture into account. But HDFS was initially designed to run over low-speed interconnects and disks on commodity clusters. As a result, it cannot efficiently utilize the resources available in HPC systems. For example, HDFS uses Java

Socket for communication that leads to the overhead of multiple data copies and offers little overlapping among different phases of operations. Besides, due to the tri-replicated data blocks, HDFS suffers from huge I/O bottlenecks and requires a large volume of lo- cal storage. The existing data placement and access policies in HDFS are oblivious to the performance and persistence characteristics of the heterogeneous storage media on modern

HPC systems. In addition, even though parallel file systems are optimized for large number

ii of concurrent accesses, Hadoop jobs running over Lustre suffer from huge contention due to the bandwidth limitation of shared file system.

This work addresses several of these critical issues in HDFS while proposing efficient and scalable file system and I/O middleware for Big Data in HPC clusters. It proposes an

RDMA-Enhanced design of HDFS to improve the communication performance of write and replication. It also proposes a Staged Event Drive Architecture (SEDA)-based ap- proach to maximize overlapping among different phases of HDFS operations. It proposes a hybrid design (Triple-H) to reduce the I/O bottlenecks and local storage requirements of HDFS through enhanced data placement policies that can efficiently utilize the hetero- geneous storage devices available in HPC platforms. This thesis studies the impact of in-memory files systems for Hadoop and Spark and presents acceleration techniques for iterative applications with intelligent use of in-memory and heterogeneous storage. It fur- ther proposes advanced data access strategies that take into account locality, topology, and storage types for Hadoop and Spark on heterogeneous (storage) HPC clusters. This the- sis carefully analyzes the challenges for incorporating NVM for Big Data file system and proposes an NVM-based design of HDFS (NVFS) that leverages the byte-addressability of NVM for HDFS I/O and RDMA communication. It also co-designs Spark and HBase to utilize the NVM in NVFS in a cost-effective manner via identifying the performance- critical data for each of these upper-level middleware. It also proposes the design of a burst buffer system using RDMA-based Memcached for integrating Hadoop with Lustre.

The designs proposed in this thesis have been evaluated on 64-nodes (1024 cores) testbed on TACC Stampede and 32-nodes (768 cores) testbed on SDSC Comet. These designs increase the HDFS throughput by up to 7x while improving the performance of Big Data applications by up to 79% and reduce the local storage requirements by 66% over HDFS.

iii To Ammu and Abbu.

iv Acknowledgments

This work was made possible through the love and support of several people who stood by me through the years of my doctoral research. I would like to take this opportunity to thank all of them.

My advisor, Dr. Dhabaleswar K. Panda for his guidance and support throughout my doctoral studies. The depth of his knowledge and experience has helped me grow as a re- searcher. I have learned a lot from his sincerity, diligence, and devotion towards work. My dissertation committee members, Dr. P. Sadayappan and Dr. R. Teodorescu for agreeing to serve on the committee and their valuable feedback.

My collaborators, Dr. X. Lu and other senior members of the lab for taking the time to listen to my ideas and giving me useful suggestions. I would also like to thank all my colleagues who have helped me in one way or another throughout my graduate studies.

My friends, here at Columbus, for making this journey easier for me. Special thanks to

Sanjida Ahmad, who just like an elder sister has stood beside me through thick and thin.

Thanks to my friends at home Nadia, Drabir, Tareq, Aysha, and Pavel for their company that always invigorates me for a new start. My friends, Susmita, Amit, and Rashed for being there for me, no matter what. These people never forget to remind me that I can achieve anything that I want. I am really thankful to them for boosting my confidence each and every day.

v My cousins, Mishu, Muna, and others for believing in me and helping me face life challenges. Thanks to my sister, Sadia Islam and brother-in-law, Nafiz Tamim for never letting me miss home even after being far away from it. My brother, Dr. Mohaimenul

Islam for always cheering me up and taking pride in what I do.

My Husband, Md. Wasi-ur-Rahman for his love, patience, and understanding. His constant encouragement has been a driving factor for me. My daughter, Nayirah for helping me forget all the hurdles and stresses that I went through during my Ph.D. Even in the most difficult times, her smile lights up my entire world. Her presence in my life has taught me how to be more organized and manage time well.

Last but not the least, my Parents, Mazeda Islam and Dr. Shahidul Islam. I consider myself privileged to be born to parents who have never asked me to do anything differently just because I am a woman. I have always idolized my dad for his prudence, honesty, and hard work; my mom has supported my work with all her strengths. I am truly grateful to them for all the sacrifices that they have made so that I can be where I am today. The greatest achievement of my life is to make my parents proud through this work. I would like to thank them from the core of my heart for giving me the resilience and inspiration to move forward in all my endeavors.

vi Vita

2002-2007 ...... B.Sc., Computer Science and Engineer- ing, Bangladesh University of Engineer- ing and Technology (BUET) 2007-2008 ...... Lecturer, Dept. of Computer Science and Engineering, BRAC University 2008-2010 ...... Lecturer, Dept. of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET) 2010-Present ...... Ph.D., Computer Science and Engineer- ing, The Ohio State University, USA 2010-2011 ...... Graduate Fellow, The Ohio State Univer- sity, USA 2011-Present ...... Graduate Research Associate, Dept. of Computer Science and Engineering, The Ohio State University, USA Summer 2014 ...... Software Engineer Intern, Oracle Amer- ica Inc., USA

Publications

N. S. Islam, M. W. Rahman, X. Lu, and D. K. Panda, Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage, In 2016 IEEE Interna- tional Conference on Big Data (IEEE BigData ’16), December 2016.

N. S. Islam, M. W. Rahman, X. Lu, and D. K. Panda, High Performance Design for HDFS with Byte-Addressability of NVM and RDMA, In International Conference on Supercomputing (ICS ’16), June 2016.

N. S. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Character- ization and Acceleration of In-Memory File Systems for Hadoop and Spark Applications on HPC Cluster, In 2015 IEEE International Conference on Big Data (IEEE BigData ’15), October 2015.

vii N. S. Islam, D. Shankar, X. Lu, M. W. Rahman, and D. K. Panda, Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-based Key-Value Store, In The 44th International Conference on Parallel Processing (ICPP ’15), September 2015.

N. S. Islam, X. Lu, M. W. Rahman, D. Shankar and D. K. Panda, Triple-H: A Hybrid Ap- proach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, In 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’15), May 2015.

N. S. Islam, X. Lu, M. W. Rahman and D. K. Panda, SOR-HDFS: A SEDA-based Ap- proach to Maximize Overlapping in RDMA-Enhanced HDFS, In The 23rd International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC ’14), June 2014.

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over Infini- Band, In The Int’l Conference for High Performance Computing, Networking, Storage and Analysis (SC ’12), November 2012.

M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters?, In 1st Joint International Workshop on Par- allel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS 16), in conjunction with SC 16, November 2016.

M. W. Rahman, N. S. Islam, X. Lu, D. Shankar, and D. K. Panda, MR-Advisor: A Com- prehensive Tuning Tool for Advising HPC Users to Accelerate MapReduce Applications on Supercomputers, In 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD ’16), October 2016.

M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, A Comprehensive Study of MapRe- duce over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters, In IEEE Transactions on Parallel and Distributed Systems, July 2016.

X. Lu, M. W. Rahman, N. S. Islam, D. Shankar, and D. K. Panda, Accelerating Big Data Processing on Modern HPC Clusters, In Conquering Big Data with High Performance Computing - Springer International Publishing, July 2016.

viii D. Shankar, X. Lu, M. W. Rahman, N. S. Islam, and D. K. Panda, Characterizing and Benchmarking Stand-alone Hadoop MapReduce on Modern HPC Clusters, In The Journal of Supercomputing - Springer, June 2016.

D. Shankar, X. Lu, N. S. Islam, M. W. Rahman, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non- blocking Extensions, Designs, and Benefits, In The 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS ’16), May 2016.

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar and D. K. Panda, High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA, In 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS ’15), May 2015.

N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, In Special Issue of LNCS, WBDB, January 2014.

N. S. Islam, X. Lu, M. W. Rahman, R. Rajachandrasekar and D. K. Panda, In-Memory I/O and Replication for HDFS with Memcached: Early Experiences, In 2014 IEEE Interna- tional Conference on Big Data (IEEE BigData ’14), October 2014.

M. W. Rahman, X. Lu, N. S. Islam and D. K. Panda, Performance Modeling for RDMA- Enhanced Hadoop MapReduce, In 43rd International Conference on Parallel Processing (ICPP ’14), September 2014.

D. Shankar, X. Lu, M. W. Rahman, N. S. Islam and D. K. Panda, A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks, In 5th Workshop on Big data benchmarks, Performance Optimization, and Emerging hardware (BPOE), 40th International Conference on Very Large Data Bases (VLDB ’14), September 2014.

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar and D. K. Panda, MapReduce over Lustre: Can RDMA-based Approach Benefit?, In 20th International European Con- ference on Parallel Processing (Euro-Par ’14), August 2014.

X. Lu, M. W. Rahman, N. S. Islam, D. Shankar and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, In Int’l Symposium on High- Performance Interconnects (HotI ’14) August 2014.

ix M. W. Rahman, X. Lu, N. S. Islam and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, In International Conference on Supercompuing, (ICS ’14), June 2014.

M. W. Rahman, X. Lu, N. S. Islam and D. K. Panda, Does RDMA-based Enhanced Hadoop MapReduce Need a New Performance Model?, In ACM Symposium on Cloud Computing (SoCC ’13), October 2013.

X. Lu, N. S. Islamn, M. W. Rahman, J. Jose, H. Subramoni, H. Wang and D. K. Panda, High-Performance Design of Hadoop RPC with RDMA over InfiniBand, In Int’l Confer- ence on Parallel Processing (ICPP ’13), October 2013.

N. S. Islam, X. Lu and D. K. Panda, Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?, In 21st Annual Symposium on High- Performance Interconnects (HOTI ’13), August 2013.

M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramoni, H. Wang and D. K. Panda, High Performance RDMA-Based Design of Hadoop MapReduce over InfiniBand, In Int’l Workshop on High Performance Data Intensive Computing (HPDIC ’13), in conjuction with IPDPS, May 2013.

N. S. Islam, X. Lu, M. W. Rahman, J. Jose, H. Wang and D. K. Panda, A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Cluster, In Second Workshop on Big Data Benchmarking (WBDB ’12), December 2012.

X. Ouyang, N. S. Islam, R. Rajachandrasekar, J. Jose, Miao Luo, H. Wang and D. K. Panda, SSD-Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks, In Int’l Conference on Parallel Processing (ICPP ’12), September 2012.

J. Vienne, J Chen, M. W. Rahman, N. S. Islam, H. Subramoni and D. K. Panda, Perfor- mance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing System In Int’l Symposium on High-Performance Interconnects(HOTI ’12), August 2012.

M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. S. Islam, H. Subramoni, C. Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, In Int’l Symposium on Performnce Analysis of Systems and Software (ISPASS ’12), April 2012.

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. S. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance

x RDMA Capable Interconnects, In Int’l Conference on Parallel Processing (ICPP ’11), September 2011.

Fields of Study

Major Field: Computer Science and Engineering

xi Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vii

List of Tables ...... xviii

List of Figures ...... xx

1. Introduction ...... 1

1.1 Problem Statement ...... 5 1.2 Research Framework ...... 6 1.3 Organization of the Thesis ...... 12

2. Background ...... 14

2.1 Hadoop ...... 14 2.1.1 Hadoop Distributed File System (HDFS) ...... 14 2.1.1.1 Overview of HDFS Design ...... 15 2.1.2 MapReduce ...... 17 2.1.3 HBase ...... 17 2.1.4 Hive ...... 17 2.2 InfiniBand and UCR ...... 17 2.3 Heterogeneous Storage Devices on Modern HPC Clusters ...... 19 2.3.1 Node-Local Storage Devices ...... 19 2.3.2 Non-Volatile Memory ...... 20 2.3.3 Parallel File System ...... 20

xii 2.4 In-Memory Computing Framework and Storage ...... 21 2.4.1 Spark ...... 21 2.4.1.1 Spark SQL ...... 22 2.4.2 Tachyon ...... 22 2.4.3 Memcached ...... 22 2.5 Benchmarks, Workloads, and Applications ...... 23

3. High Performance RDMA-based Design of HDFS over InfiniBand ...... 28

3.1 Proposed Design ...... 28 3.1.1 Design Overview ...... 29 3.1.2 New Components in the Hybrid Architecture ...... 29 3.1.3 Connection and Buffer Management ...... 32 3.1.4 Communication Flow using RDMA over InfiniBand ...... 33 3.2 Experimental Results ...... 34 3.2.1 Experimental Setup ...... 34 3.2.2 Optimal Packet-size for Different Interconnects/Protocols . . . . 35 3.2.3 Micro-benchmark Level Evaluations on Different Interconnects . 37 3.2.4 HDFS Communication Time ...... 39 3.2.5 Evaluation with TestDFSIO ...... 40 3.3 Benefit of RDMA-based HDFS in HBase ...... 41 3.4 Related Work ...... 43 3.5 Summary ...... 44

4. Maximizing Overlapping in RDMA-Enhanced HDFS ...... 45

4.1 Proposed Design ...... 47 4.1.1 Architectural Overview ...... 47 4.1.2 Design of SOR-HDFS ...... 48 4.1.3 Overlapping among Different Phases ...... 52 4.1.3.1 Data Read, Processing, Replication, and I/O ...... 52 4.1.3.2 Communication and Data Write ...... 52 4.1.4 SOR-HDFS with Parallel Replication ...... 53 4.2 Experimental Results ...... 54 4.2.1 Experimental Setup ...... 54 4.2.2 Parameter Tuning of SOR-HDFS Threads ...... 55 4.2.3 Performance Analysis of SOR-HDFS ...... 57 4.2.4 Evaluation with Parallel Replication ...... 59 4.2.5 Evaluation using HDFS Microbenchmark ...... 60 4.2.6 Evaluation using Enhanced DFSIO of HiBench ...... 61 4.2.7 Evaluation using HBase ...... 62 4.3 Related Work ...... 64

xiii 4.4 Summary ...... 65

5. Hybrid HDFS with Heterogeneous Storage and Advanced Data Placement Poli- cies ...... 66

5.1 Proposed Architecture and Design ...... 67 5.1.1 Design Considerations ...... 67 5.1.2 Architecture ...... 68 5.1.3 Data Placement Policies ...... 69 5.1.3.1 Performance-sensitive ...... 69 5.1.3.2 Storage-sensitive ...... 71 5.1.4 Design Details ...... 72 5.2 Experimental Results ...... 75 5.2.1 Performance Analysis of Triple-H ...... 75 5.2.2 Evaluation with Triple-H Default Mode ...... 79 5.2.2.1 TestDFSIO ...... 79 5.2.2.2 Data Generation Benchmarks ...... 80 5.2.3 Evaluation with Triple-H Lustre-Integrated Mode ...... 80 5.2.3.1 TestDFSIO ...... 80 5.2.3.2 Sort ...... 82 5.2.4 Evaluation with Applications ...... 83 5.3 Related Work ...... 83 5.4 Summary ...... 84

6. Accelerating Iterative Applications with In-Memory and Heterogeneous Storage 86

6.1 System Architectures to Deploy In-Memory File Systems ...... 87 6.2 Adapting Triple-H for Iterative Applications ...... 91 6.2.1 Optimizing Triple-H for Spark ...... 92 6.2.1.1 Enhanced Connection Management for Spark in Triple-H 92 6.2.2 Design Scope for Iterative Applications ...... 93 6.2.3 Proposed Design ...... 95 6.3 Experimental Results ...... 96 6.3.1 Identifying the Impact of Different Parameters ...... 96 6.3.2 Primitive Operations ...... 99 6.3.3 Hadoop Workloads on HDFS, Tachyon and Triple-H ...... 99 6.3.4 Hadoop Workloads on Luster, Tachyon and Triple-H ...... 101 6.3.5 Spark Workloads on HDFS, Tachyon and Triple-H ...... 102 6.3.6 Spark Workloads on Lustre, Tachyon and Triple-H ...... 103 6.3.7 Fault Tolerance of HDFS, Tachyon and Triple-H ...... 103 6.3.8 Evaluation with Iterative workloads ...... 104 6.3.8.1 Synthetic Iterative Benchmark ...... 104

xiv 6.3.8.2 K-Means ...... 105 6.3.9 Summary of Performance Characteristics ...... 105 6.4 Related Work ...... 107 6.5 Summary ...... 107

7. Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage ...... 109

7.1 Proposed Design ...... 110 7.1.1 Data Access Strategies ...... 111 7.1.1.1 Greedy ...... 111 7.1.1.2 Hybrid ...... 112 7.1.2 Data Placement Strategies ...... 114 7.1.3 Design and Implementation ...... 114 7.2 Performance Evaluation ...... 117 7.2.1 Experimental Setup ...... 118 7.2.2 Evaluation with RandomWriter and TeraGen ...... 119 7.2.3 Evaluation with TestDFSIO ...... 120 7.2.4 Evaluation with Hadoop MapReduce Sort and TeraSort . . . . . 124 7.2.5 Evaluation with Spark Sort and TeraSort ...... 126 7.2.6 Evaluation with Hive and Spark SQL ...... 127 7.3 Related Work ...... 129 7.4 Summary ...... 129

8. High Performance Design for HDFS with Byte-Addressability of NVM and RDMA ...... 131

8.1 Design Scope ...... 133 8.1.1 NVM as HDFS Storage ...... 133 8.1.2 NVM for RDMA ...... 134 8.1.3 Performance Gap Analysis ...... 136 8.2 Proposed Design and Implementation ...... 137 8.2.1 Design Overview ...... 137 8.2.2 Design Details ...... 138 8.2.2.1 Block Access (NVFS-BlkIO) ...... 139 8.2.2.2 Memory Access (NVFS-MemIO) ...... 140 8.2.3 Implementation ...... 144 8.2.4 Cost-effective Performance Schemes for using NVM in HDFS . 146 8.2.4.1 Spark ...... 146 8.2.4.2 HBase ...... 147 8.2.5 NVFS-based Burst Buffer for Spark over Lustre ...... 148 8.3 Performance Evaluation ...... 149

xv 8.3.1 Performance Analysis ...... 150 8.3.2 Evaluation with Hadoop MapReduce ...... 153 8.3.2.1 TestDFSIO ...... 153 8.3.2.2 Data Generation Benchmark (TeraGen) ...... 154 8.3.2.3 SWIM ...... 154 8.3.3 Evaluation with Spark ...... 156 8.3.3.1 TeraSort ...... 156 8.3.3.2 PageRank ...... 156 8.3.4 Evaluation with HBase ...... 157 8.4 Related Work ...... 158 8.5 Summary ...... 161

9. Accelerating I/O Performance through RDMA-based Key-Value Store . . . . . 162

9.1 Design ...... 163 9.1.1 Proposed Deployment ...... 163 9.1.2 Internal Architecture ...... 164 9.1.2.1 Key-Value Store-based Burst Buffer System ...... 165 9.1.2.2 Integration of Hadoop with Lustre through Key-Value Store-based Burst Buffer ...... 166 9.1.3 Proposed Schemes to Integrate Hadoop with Lustre ...... 167 9.1.3.1 Asynchronous Lustre Write (ALW) ...... 168 9.1.3.2 Asynchronous Disk Write (ADW) ...... 169 9.1.3.3 Bypass Local Disk (BLD) ...... 171 9.2 Experimental Results ...... 172 9.2.1 Experimental Setup ...... 172 9.2.2 Evaluation with TestDFSIO ...... 172 9.2.3 Evaluation with RandomWriter and Sort ...... 176 9.2.4 Evaluation with PUMA ...... 179 9.3 Related Work ...... 180 9.4 Summary ...... 181

10. Future Research Directions ...... 182

10.1 Designing Kudu over RDMA and NVM ...... 182 10.2 Study the Impact of the Proposed Data Placement and Access Policies on Energy Efficiency ...... 184 10.3 Accelerating Erasure Coding with RDMA and KNL/GPU ...... 185 10.4 Designing High Performance I/O Subsystem for Deep/Machine Learn- ing Applications ...... 186

xvi 11. Conclusion and Contributions ...... 187

11.1 Software Release and its Impact ...... 188

Bibliography ...... 190

xvii List of Tables

Table Page

1.1 Performance vs Capacity comparison for [32] ...... 3

2.1 Benchmark parameter list ...... 24

3.1 Optimal Network Packet-Sizes for Different Interconnects/Protocols . . . . 36

3.2 Split-up Times for a Single Block Transmission in HDFS ...... 37

4.1 Packet rate and communication time ...... 56

5.1 Read and Write performance comparison in different Modes of Triple-H (OSU RI Cluster B) ...... 78

6.1 Fault tolerance of Triple-H and Tachyon ...... 90

6.2 Normalized execution times over Tachyon-HDFS and Triple-H (HHH) . . . 106

6.3 Normalized execution times over Tachyon-Lustre and Triple-H (HHH-L) . 106

8.1 SWL performance over RAMDisk ...... 152

8.2 Comparison of Communication Time ...... 152

8.3 Comparison of I/O Time ...... 153

8.4 Summary of benefits for SWIM ...... 156

9.1 Determining optimal number of Memcached servers ...... 173

xviii 9.2 Performance comparison of RandomWriter and Sort with 4 and 6 maps per host using ALW ...... 177

9.3 Performance comparison of RandomWriter and Sort with 4 and 6 maps per host using BLD ...... 178

xix List of Figures

Figure Page

1.1 HDFS and Storage Hierarchy in HPC System ...... 3

1.2 The Proposed Research Framework ...... 6

2.1 HDFS Architecture ...... 15

3.1 Architecture and Design of RDMA-Enhanced HDFS ...... 29

3.2 HDFS-RDMA design with hybrid communication support ...... 32

3.3 Optimal network packet size evaluation for different file sizes ...... 36

3.4 Micro-benchmark evaluation in different clusters for HDD ...... 38

3.5 Micro-benchmark evaluation in different clusters for SSD ...... 39

3.6 Communication times in HDFS in OSU RI ...... 40

3.7 TestDFSIO benchmark evaluation over different interconnects ...... 41

3.8 YCSB evaluation with a single region server ...... 42

3.9 YCSB evaluation with multiple (32) region servers ...... 43

4.1 Default vs. Proposed HDFS Architecture ...... 46

4.2 Architectural Overview of SOR-HDFS ...... 48

4.3 Read stage ...... 49

xx 4.4 Packet Processing Stage ...... 50

4.5 Different stages of SOR-HDFS ...... 50

4.6 Parameter tuning for SOR-HDFS threads ...... 54

4.7 Disk write throughput in different clusters ...... 55

4.8 Breakdown time for 64MB block write (OSU RI Cluster B) ...... 57

4.9 Performance analysis of SOR-HDFS (OSU RI Cluster B) ...... 58

4.10 Write time evaluation using HDFS microbenchmark (SWL) ...... 59

4.11 Write time evaluation using HDFS microbenchmark (SWL) ...... 60

4.12 Enhanced DFSIO throughput evaluation of Intel HiBench ...... 61

4.13 Evaluation of HBase Put throughput ...... 63

5.1 Architecture and Design of Triple-H ...... 68

5.2 Performance-sensitive Data Placement Policies ...... 70

5.3 Replication during data placement in Triple-H ...... 71

5.4 Eviction-Promotion Manager ...... 72

5.5 Comparison among different Data Placement Policies in OSU RI Cluster B 76

5.6 Comparison between Performance-sensitive Data Placement Policies . . . . 78

5.7 Evaluation of TestDFSIO in TACC Stampede (Default mode) ...... 79

5.8 Performance comparison with data generation benchmarks (Default mode) . 80

5.9 Evaluation of TestDFSIO (Lustre-Integrated mode) ...... 81

5.10 Evaluation of Sort (Lustre-Integrated mode) ...... 82

xxi 6.1 System architectures for deploying Hadoop MapReduce and Spark on top of in-memory file systems (Tachyon & HHH) ...... 88

6.2 Enhanced connection management for supporting Spark over Triple-H . . . 91

6.3 Iterations and profiling of K-Means ...... 94

6.4 Added functional units in Triple-H for iterative applications ...... 95

6.5 Impact of blocksize (OSU RI Cluster B) ...... 97

6.6 Impact of concurrent containers and tasks (OSU RI Cluster B) ...... 98

6.7 Evaluation of RandomWriter and Sort (SDSC Gordon) ...... 98

6.8 Evaluation of Grep and MR-MSPolygraph ...... 100

6.9 Evaluation of Spark Standalone mode (SDSC Gordon) ...... 100

6.10 Evaluation of Spark workloads over YARN ...... 103

6.11 Fault tolerance of different file systems (OSU RI Cluster B) ...... 104

6.12 Evaluation of iterative workloads (OSU RI Cluster B) ...... 104

7.1 Default data access strategy ...... 110

7.2 Proposed data access strategies ...... 111

7.3 Overview of NBD Strategy ...... 113

7.4 Proposed design ...... 115

7.5 Evaluation of RandomWriter and TeraGen on OSU RI2 ...... 119

7.6 Evaluation of the data access strategies ...... 119

7.7 Selecting optimal number of connections in NBD strategy on OSU RI Clus- ter B ...... 120

7.8 Selecting optimal number of readers in SBD strategy on OSU RI Cluster B 120

xxii 7.9 Evaluation of TestDFSIO on OSU RI2 ...... 121

7.10 Evaluation of Hadoop MapReduce workloads on OSU RI2 ...... 123

7.11 Evaluation of Hadoop MapReduce workloads on SDSC Comet ...... 124

7.12 Evaluation of Spark workloads on OSU RI2 ...... 126

7.13 Evaluation of Spark workloads on SDSC Comet ...... 127

7.14 Evaluation of Spark SQL on SDSC Comet ...... 128

8.1 Performance Characteristics of NVM and SSD ...... 132

8.2 NVM for HDFS I/O ...... 133

8.3 NVM for HDFS over RDMA ...... 134

8.4 Performance gaps among different storage devices ...... 137

8.5 Architecture of NVFS ...... 138

8.6 Design of NVFS-BlkIO ...... 139

8.7 Design of NVFS-MemIO ...... 140

8.8 Internal data structures for Block Architecture ...... 144

8.9 Design of NVFS-based Burst Buffer for Spark over Lustre ...... 148

8.10 HDFS Microbenchmark ...... 149

8.11 TestDFSIO ...... 150

8.12 Data generation benchmark, TeraGen ...... 151

8.13 SWIM (SDSC Comet) ...... 155

8.14 Evaluation with Spark workloads ...... 157

xxiii 8.15 YCSB 100% insert (SDSC Comet) ...... 158

8.16 YCSB 50% read, 50% update (Cluster B) ...... 160

9.1 Deployment and Architecture ...... 164

9.2 Persistence Manager ...... 167

9.3 Hadoop Write Flow ...... 168

9.4 Evaluation of TestDFSIO (OSU RI) ...... 175

9.5 Evaluation of TestDFSIO (TACC Stampede) ...... 175

9.6 Evaluation of RandomWriter and Sort (OSU RI) ...... 176

9.7 Evaluation of RandomWriter and Sort (TACC Stampede) ...... 177

9.8 Evaluation with PUMA [66] ...... 179

10.1 Network Architecture of Kudu, Courtesy [11] ...... 183

10.2 Impact of interconnects and storage on Kudu operations ...... 183

xxiv Chapter 1: Introduction

Massive data is being generated every minute through different Internet services, such as Facebook, Twitter, and Google, as well as numerous smart phone apps. In this data deluge, collection, storage, and analysis of these data in time are fundamental for efficient business solutions. Not only in business, the productions of data in many diverse fields including biomedical research, Internet search, finance, scientific computing are expanding at an astonishing rate. According to [4], 2.5 exabytes of data were generated each day throughout the year 2014. The total data production will be 44 times greater by the end of this decade [27] compared to the present growth. A recent IDC report [40] claims that the digital universe is doubling in size every two years and will multiply 10-fold between 2013 and 2020; from 4.4 trillion gigabytes to 44 trillion gigabytes. Data generation and analytics have taken an enormous leap in recent times due to the continuous growth of popularity for social gaming and geospatial applications. According to [19], at any given hour, around ten million people are logged into their favorite social game, like Candy Crush Saga from

King [10] or the Clash of Clans from Supercell [14]. The more recent craze of Pokemon GO has even emphasized the inevitability of data growth and the necessity of faster processing with supreme storage requirements - in other words, technological advancements for Big

Data middleware, in the near future. According to a recent Datanami article [20], people are now spending twice as much time playing Pokemon GO as they are on Facebook. It is

1 also mentioned that the social media traffic from Twitter, Snapchat, and YouTube are going down because of the massive level usage of Pokemon GO. This phenomenon is presenting significant challenges not only to operate on this data to perform large-scale analysis but also to manage, store, and protect the sheer volume and diversity of the data.

During the last decade, the Apache Hadoop [16] platform has become one of the most prominent open-source frameworks to handle Big Data analytics and Hadoop Distributed

File System (HDFS) is the underlying storage engine of Hadoop. The current eco-system of Hadoop contains the legacy batch processing framework like MapReduce [16], as well as the in-memory based DAG execution framework like Spark [111] for iterative and inter- active processing. HDFS, which is well-known for its scalability and reliability, provides the storage capabilities to these processing frameworks. With the advent of Information

Age, Big Data systems like Hadoop and HDFS are being widely deployed on High Perfor- mance Computing (HPC) Clusters. The International Data Corporation (IDC) study [44] on Latest Trends in High Performance Computing (HPC) Usage and Spending indicated that 67% of HPC sites were running High Performance Data Analysis (HPDA) workloads.

IDC forecasts HPDA revenues will experience robust growth from 2012 to 2017, reaching almost $1.4 billion in 2017, versus $743.8 million in 2012. HPC departments everywhere appear to be bearing the brunt of this transition. A recent example is the ‘Pivotal Analyt- ics Workbench’ [65], which is a 1000+ node HPC cluster, for running regular testing on the Apache Hadoop open source code base. Even though such systems are being used for

Hadoop deployments, current Hadoop middleware components do not leverage many of the HPC cluster features. Modern HPC clusters are equipped with high performance interconnects like Infini-

Band [42] that provides low latency and high throughput data transmission. The multi-core

2 Compute Nodes Compute Nodes NameNode DataNode Receive RAM Disk DataNode1 Data Memory SSD Send SSD RAM HDD HDD Memory Interconnect Fabric NVM (Ethernet/InfiniBand) DataNode2 DataNodeN Lustre deployment NameNode SSD RAM Disk RAM Disk DataNode ... Meta Data Server HDD SSD Object Storage Server HDD

(a) HDFS deployment (cluster with ho- (b) HDFS deployment (cluster with (c) NVM in the Storage Hi- mogeneous storage characteristics) heterogeneous storage characteristics) erarchy Figure 1.1: HDFS and Storage Hierarchy in HPC System

compute nodes have large memory along with heterogeneous storage devices like RAM

Disk, SSD, and HDD [32]. Moreover, the traditional Beowulf architecture [87, 88] model has been followed for building these clusters [32, 85], where the compute nodes are provi- sioned with either a disk-less or a limited-capacity local storage [30], while a sub-cluster of dedicated I/O nodes with parallel file systems, such as Lustre, is provided for fast and scal- able data storage and access. Figure 1.1(a) and Figure 1.1(b) show examples of how hetero- geneous storage devices and Lustre are deployed on most modern HPC systems building clusters with homogeneous or heterogeneous storage characteristics, respectively and how

HDFS runs in these setups with the DataNodes being located on the compute nodes.

Table 1.1: Performance vs Capacity comparison for [32] Type Peak Bandwidth Capacity RAM Disk (local) ≈ 6.61 GBps ≈ 32 GB SSD (local) ≈ 2.32 GBps ≈ 300 GB HDD (local) ≈ 267.2 MBps ≈ 80 GB Lustre ≈ 817.1 MBps ≈ 1.6 PB

3 All such heterogeneous storage devices on modern HPC clusters have different perfor- mance and storage characteristics. Table 1.1 shows the different types of storage media available in one of the leadership-class HPC clusters SDSC Gordon [32]. It is evident from this table that the amounts of local storage spaces of RAM Disk, SSD, and HDD are negligible compared to the vast installation of Lustre. But from the angle of data access peak bandwidth, RAM Disk and SSD are much faster than Lustre. Moreover, emerging

Non-Volatile Memory (NVM) system is paving its way into the HPC clusters. NVMs offer byte-addressability with near-DRAM performance and low power consumption for I/O- intensive applications. NVMs can not only augment the overall memory capacity, but also provide persistence while bringing in significant performance improvement. As depicted in Figure 1.1(c), NVMs are, thus, excellent contenders to co-exist with RAM and SSDs in large scale clusters and server deployments. This further implies that it is the software overheads that cause performance bottlenecks when using NVM to replace the disk-based storage systems.

The outstanding performance requirement in HPC environments pertains to unprece- dented demands on the performance of supporting storage systems. The default design of

HDFS cannot efficiently utilize the advanced features of the available resources in HPC platforms. It is, therefore, critical to re-think the architecture of HDFS and design high performance I/O subsystems for Big Data that consider multiple factors including exploita- tion of the underlying system resources, data locality, application characteristics along with identifying the performance-critical data, data access patterns, storage characteristics, per- sistence and fault tolerance, and minimizing the storage space usage. Most of the Big Data applications being I/O intensive in nature, the HPC cluster resources can play a crucial role

4 in re-designing the I/O subsystem for Big Data. This leads to the following broad chal- lenge: Can the resources on HPC platform be leveraged to design high performance file system and I/O middleware that can improve the performance and scalability for a range of data-intensive applications? If such file system and I/O middleware can be designed and developed in an efficient manner, it will not only benefit a wide variety of upper layer frameworks and query engines such as MapReduce, Spark, HBase, Hive, Spark SQL, etc. but also will lead to an acceleration in the effective use of Big Data file systems and I/O middleware on leadership-class HPC clusters.

1.1 Problem Statement

In this thesis, we address the above-mentioned challenges with a focus on the I/O re- quirements of Big Data middleware and applications. To summarize, this thesis addresses the following research questions:

1. Can we re-design HDFS to take advantage of high performance interconnects and

exploit advanced features such as RDMA? What are the challenges here?

2. How can we maximize overlapping among different stages of HDFS Write operation?

Can we adopt the high-throughput SEDA-based approach to achieve this?

3. Is it possible to design HDFS with a hybrid architecture to take advantage of the het-

erogeneous storage devices on HPC clusters? Can we propose effective data place-

ment policies for HDFS to reduce the I/O bottlenecks?

4. Can we also propose advanced acceleration techniques to adapt HDFS with in-

memory and heterogeneous storage for iterative benchmarks and applications?

5. How can we design efficient data access strategies for Hadoop and Spark on HPC

systems with heterogeneous storage characteristics?

5 6. Can we re-design HDFS to leverage the byte-addressability of NVM?

7. How can we take advantage of high-performance burst buffer system to accelerate

I/O performance of Big Data applications on HPC clusters? Can key-value store

such as Memcached be used for this? What are the challenges to integrate Hadoop

with Lustre through this burst buffer?

1.2 Research Framework

Figure 1.2 illustrates the overall scope of this thesis. Through this framework, this thesis aims to address the challenges enumerated in Section 1.1 as follows:

Big Data ApplicaΕons, Workloads, Benchmarks

HBase Hadoop MapReduce, Hive Spark, Spark SQL

High Performance File System and I/O Middleware

Hybrid HDFS with Efficient Data Heterogeneous Storage Leveraging Access for NVM for Big Advanced Selective Heterogeneous KV-Store Data I/O Data Caching for (Storage) Clusters (Memcached) Placement IteraΕve Jobs based Burst Buffer Maximized Stage Overlapping

RDMA-Enhanced HDFS

Storage Technologies Networking Technologies/ Parallel File System (Lustre) (HDD, SSD, RAM Disk, and Protocols (InfiniBand, NVM) 10/40/100 GigE, RDMA)

Figure 1.2: The Proposed Research Framework

1. Can we re-design HDFS to take advantage of high performance interconnects

and exploit advanced features such as RDMA? What are the challenges here?

6 HDFS cannot leverage the high performance communication features of HPC clus-

ters. High performance networks such as InfiniBand [42] provide low latency and

high throughput data transmission. Over the past decade, scientific and parallel com-

puting domains, with the Message Passing Interface (MPI) as the underlying basis for

most applications, have made extensive usage of these advanced networks. Imple-

mentations of MPI, such as MVAPICH2 [60], achieve low one-way latencies in the

range of 1-2µs. On the other hand, even the best implementation of sockets on Infini-

Band achieves 20-25µs one-way latency [39]. Recent researches [39, 48, 90, 102]

throw light to the huge performance improvements possible for different Big Data

middleware using InfiniBand networks and faster disk technologies such as SSDs.

HDFS is communication intensive due to its distributed nature. All existing com-

munication protocols [16] of HDFS are layered on top of TCP/IP. Due to the byte

stream communication nature of TCP/IP, multiple data copies are required, which

results in poor performance in terms of both latency and throughput. Consequently,

even though the underlying system is equipped with high performance interconnects

such as InfiniBand, HDFS cannot fully utilize the hardware capability and obtain

peak performance. As a result, a highly scalable RDMA-based design of HDFS is

necessary to exploit the full potential of the underlying interconnect.

2. How can we maximize overlapping among different stages of HDFS Write op-

eration? Can we adopt the high-throughput SEDA-based approach to achieve

this?

During HDFS Write, a data block is transferred as packets from clients to DataN-

odes. Each packet goes through processing and replication; finally, it is stored inside

7 the DataNode. The vanilla HDFS adopts the One-Block-One-Thread (OBOT) archi-

tecture to process data sequentially. The OBOT design is a good trade-off between

simplicity and performance for default Hadoop running over low-speed interconnect

clusters. The RDMA-Enhanced design of HDFS reduces the communication over-

head to a great extent. Even though this design can support task/block-level paral-

lelism, there is no overlapping among different packets belonging to the same block.

With RDMA-based communication, the data transfer time is significantly reduced

which results in higher message rate in the DataNode side. Due to the sequential

data processing feature in the default OBOT architecture, the incoming data packets

should wait for the I/O stage of the previous packet to be completed before they are

read and processed. This phenomenon is more obvious in RDMA-Enhanced HDFS

than the default one, since the performance bottleneck moves from data transmission

to data persistence. It is, therefore, critical to design RDMA-Enhanced HDFS with a

higher throughput architecture to maximize overlapping among different stages.

3. Is it possible to design HDFS with a hybrid architecture to take advantage of the

heterogeneous storage devices on HPC clusters? Can we propose effective data

placement policies for HDFS to reduce the I/O bottlenecks?

Most modern HPC clusters [32] are equipped with heterogeneous storage devices

such as, RAM, SSD, and HDD. Current large-scale HPC systems also take advan-

tage of enterprise parallel file systems like Lustre. All such heterogeneous storage

devices on modern HPC clusters have different performance and storage character-

istics. Even though the amounts of local storage spaces of RAM Disk, SSD, and

HDD are negligible compared to the vast installation of Lustre, the data access peak

bandwidths of RAM Disk and SSD are much higher than that of Lustre. Due to the

8 fast access speed, in-memory I/O for Hadoop is becoming more and more popular

in the recent times. Many researches [21, 96] have studied the impact of caching

to accelerate HDFS Read performance. But as stated in [37], HDFS I/O is often

dominated by the Write operation. Therefore, optimizing the I/O performance of

HDFS Write through intelligent data placement is equally important as that of HDFS

Read. The default design of HDFS cannot fully leverage the heterogeneous storage

devices on modern HPC clusters for performance-sensitive applications. The lim-

itation comes from the existing data storage policies and ignorance of data usage

patterns. Besides, due to the tri-replicated data blocks HDFS requires a large volume

of local storage space, which makes the deployment of HDFS challenging on HPC

systems. Recent researches also pay attention to incorporate heterogeneous storage

media (e.g. SSD) [68, 69] and parallel file systems [70, 91] into HDFS. In this con-

text, it is imperative that we design efficient data placement policies to minimize the

I/O bottlenecks and local storage requirements of HDFS on large-scale HPC systems

with heterogeneous storage architecture.

4. Can we propose advanced acceleration techniques to adapt HDFS with in-memory

and heterogeneous storage for iterative benchmarks and applications?

HDFS is not optimized for iterative applications due to the huge I/O bottlenecks

incurred by storing (accessing) the output from each iteration to (from) disk. The

hybrid design of HDFS with in-memory and heterogeneous storage devices signif-

icantly reduces the I/O bottlenecks. The efficient data placement policies further

9 promise new design opportunities to accelerate iterative benchmarks and applica-

tions. Thus, if we can identify the hot data for such applications and perform intel-

ligent placement in the underlying storage devices, we will be able to improve the

overall application performance.

5. How can we design efficient data access strategies for Hadoop and Spark on

HPC systems with heterogeneous storage characteristics?

By default, the schedular launches MapReduce and Spark tasks considering data lo-

cality in order to minimize data movements. The launched tasks read the data in

HDFS without considering the storage type or bandwidth of the storage devices on

the the local node. For most of the commodity clusters, this makes perfect sense

as the underlying interconnect is supposed to be slow and hence, data movement is

expensive. Besides, HDFS was initially designed for homogeneous systems and the

assumption was that the nodes will have hard disks as the storage devices [16]. Re-

cently, the concept of heterogeneous storage devices have been introduced in HDFS.

Consequently, a variety of data placement policies have been proposed in the litera-

ture [34] to utilize the heterogeneous storage devices efficiently. Even though, with

heterogeneous storage devices, data is replicated in different types of storage devices

across different nodes, while accessing the data only locality is considered in the

default HDFS distributions. As a result, applications and frameworks running over

HDFS cannot utilize the hardware capabilities and obtain peak performance in the

presence of many concurrent tasks and jobs. It is, therefore, critical to re-visit the

data access strategies in HDFS for HPC systems with heterogeneous storage charac-

teristics and fast interconnects.

6. Can we re-design HDFS to leverage the byte-addressability of NVM?

10 For performance-sensitive applications, in-memory storage is being increasingly used

for HDFS on HPC systems. Even though HPC clusters are equipped with large mem-

ory per compute node, using this memory for storage can lead to degraded computa-

tion performance due to competition for physical memory between computation and

I/O. Recent studies [34, 54] propose in-memory data placement policies to increase

the write throughput. But in-memory data placement makes the task of persistence

challenging. Non-volatile memory (NVM) being non-volatile, is emerging to signifi-

cantly improve the I/O performance while ensuring persistence. NVMs not only aug-

ment the memory capacity on the compute nodes, but also offer byte-addressability

with near-DRAM performance. In the presence of RDMA and Lustre on HPC sys-

tems, NVMs offer additional research challenges. In this context, HDFS can take

advantage of NVM for RDMA-based communication and I/O. Even though super-

computing systems are being equipped with high performance NVMs, the default

design of HDFS is not capable of exploiting its full potential for application per-

formance. Thus, it is important to re-design the storage architecture of HDFS to

leverage the byte-addressability of NVM for improving the overall performance and

scalability of upper-level middleware and applications.

7. How can we take advantage of high-performance burst buffer system to acceler-

ate I/O performance of Big Data applications on HPC clusters? Can key-value

store such as Memcached be used for this? What are the challenges to integrate

HDFS with Lustre through this burst buffer?

Hadoop MapReduce jobs are often run on top of the parallel file systems on HPC

clusters. For write-intensive applications, running the job over Lustre can also avoid

the overhead of tri-replication as is present in HDFS [82]. Although parallel file

11 systems are optimized for concurrent accesses by large scale applications, write

overheads can still dominate the run times of data-intensive applications. In HPC

systems, burst buffer systems are often used to handle the bandwidth limitation of

shared file system access [55]. Burst buffers typically buffer the checkpoint data

from HPC applications. But there have been very few researches to optimize such

burst buffers for Big Data, in particular. For instance, checkpoints are read during

application restarts only and thus checkpointing is a write-intensive operation. On

the other hand, there are a variety of Hadoop applications; some are write-intensive,

some are read-intensive while some have equal amount of reads and writes. Besides,

the output of one job is often used as the input of the next one; in such scenarios data

locality in the cluster impacts the performance of the latter to a great extent. More-

over, considering the fact that real data has to be stored to the file system through the

burst buffer layer, ensuring fault-tolerance of the data is much more important com-

pared to that in the checkpointing domain. We, therefore, believe that designing a

key-value store-based burst buffer and integrating Hadoop with Lustre can guarantee

optimal performance for a variety of Big Data applications.

1.3 Organization of the Thesis

Chapter 2 introduces the necessary backgroud relevant to this thesis. Chapter 3 presents the RDMA-Enhanced design of HDFS, while Chapter 4 describes the design to maximize overlapping among different stages of HDFS operations. Chapter 5 discusses the hybrid de- sign of HDFS with heterogeneous storage and advanced data placement policies. Chapter 6 proposes acceleration techniques for iterative applications with in-memory and heteroge- neous storage. Chapter 7 presents the efficient data access strategies for Hadoop and Spark

12 on clusters with heterogeneopus storage characteristics. Chapter 8 discusses the NVM- based design for HDFS and Chapter 9 describes the burst buffer design for Hadoop over

Lustre. Some of the future research directions are discussed in Chapter 10 and finally, the dissertation concludes in Chapter 11.

13 Chapter 2: Background

2.1 Hadoop

Hadoop is a popular software framework for distributed storage and processing of very large data sets on computer clusters. It is the open-source implementation of the MapRe- duce programming model and Hadoop Distributed File System (HDFS) is the underlying

file system of Hadoop.

2.1.1 Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a distributed file system which is used as the primary storage for Hadoop cluster. Figure 2.1 illustrates the basic architecture of HDFS. An HDFS cluster consists of two types of nodes: NameNode and DataNode.

The NameNode manages the file system namespace. It maintains the file system tree and stores all the meta data. The DataNodes on the other hand, act as the storage sys- tem for the HDFS files. HDFS divides large files into blocks of size 64 MB. Each block is stored as an independent file in the local file system of the DataNodes. HDFS usu- ally replicates each block to three (default replication factor) DataNodes. In this way,

HDFS guarantees data availability and fault-tolerance. The HDFS client contacts the Na- meNode during any kind of file system operations. When the client wants to write a

14 file to HDFS, it gets the block IDs and a list of DataNodes for each block from the Na-

meNode. Each block is split into smaller packets and sent to the first DataNode in the

pipeline. The first DataNode then replicates each of the packets to the subsequent DataN-

odes. Packet transmission in HDFS is pipelined; a DataNode can receive packets from

the previous DataNode while it is replicating data to the next DataNode. If the client

is running on a DataNode, then the block will be first written to the local file system.

When a client reads an HDFS file, it first contacts the NameNode to check its access permission and gets the block IDs and lo- DataNode cations for each of the blocks. For each Client block belonging to the file, the client con- Zookeeper DataNode nects with the nearest DataNode and reads DataNode the block. NameNode 2.1.1.1 Overview of HDFS Design DataNode

The main components of HDFS are DF-

SClient and DataNode. The upper layer Figure 2.1: HDFS Architecture frameworks send data to the file system by creating instances of DFSClient that send data to the DataNodes. The DataNodes store these data by creating block files on the local storage devices. The detailed design of DF-

SClient and DataNode are as follows:

DataStreamer: The DataStreamer thread runs as a daemon in the DFSClient side. Data

packets from the upper layer are put into the DataQueue associated with this thread. DataS-

treamer is responsible for sending the block header and the data packets for each block to

different DataNodes via Java socket. Due to the byte-stream model of Java socket, each

15 packet is converted to a stream of bytes before being written to the socket. After send-

ing each packet, it is inserted into the AckQueue. Application performance depends on

the speed at which the DataStreamer sends the data packets, as new packets can only be

inserted into the DataQueue up to a certain limit.

ResponseProcessor: The ResponseProcessor thread waits for the acknowledgment for each packet. Once the acknowledgment for a packet arrives, the packet is removed from the AckQueue.

DataXceiverServer: In the DataNode, a listener thread is running on DataXceiverServer.

This thread keeps on monitoring the Java socket for incoming connection requests.

DataXceiver: When the listener in the DataXceiverServer accepts a socket connection, it creates a DataXceiver thread. The DataXceiver acts as a daemon to process the incoming data. The DFSClient sends write request for a block to the DataXceiver which then parses the header and takes action accordingly (calls writeBlock()). Inside the writeBlock() func- tion, header processing as well as replication takes place and an acknowledgment is sent back to the client. Then a BlockReceiver is created for the incoming block, which waits for receiving packets belonging to that block. The DFSClient, after getting an acknowl- edgment for the block header, starts sending packets for the corresponding block. The

BlockReceiver, upon receiving the byte stream from the socket, de-serializes and processes the packet. It is then replicated to the next DataNode in the pipeline and flushed into the local file system. Then the sequence number of the packet is enqueued into the acknowl- edgment queue. The PacketResponder sends acknowledgment for each packet to either the client or the previous DataNode in the pipeline. Acknowledgments are also sent using Java socket.

16 2.1.2 MapReduce

MapReduce [36] is the processing engine of Hadoop. The NameNode hosts a Job-

Tracker that is responsible for farming out tasks on different nodes in the cluster. The

DataNodes host the TaskTrackers, one per DataNode. TaskTrackers are responsible for successful completion of a MapReduce job. Each TaskTracker can launch multiple map and reduce tasks that deal with the job input and output, respectively.

2.1.3 HBase

HBase is a Java-based database that runs on top of the Hadoop framework [16]. It is used to host very large tables with many billions of entries and provides capabilities similar to Google’s BigTable [22]. It is developed as part of the Apache Software Foundation’s

Apache Hadoop project [16] and runs on top of HDFS. It flushes its in-memory data to

HDFS whenever the size of its memory store reaches a particular threshold (default is

64 MB).

2.1.4 Hive

Hive [1] provides a SQL interface for querying the data stored in Hadoop. Each Hive table corresponds to a directory in HDFS. The SQL queries are launched as MapReduce,

Spark, or Tez [3] jobs to execute the analyses on the distributed data. Hive acts as an abstraction layer to integrate SQL-like queries (HiveQL) with the underlying Java-based

Hadoop layer. Hive was initially developed in Facebook and is now being used in many organizations including Amazon, Netflix, etc.

2.2 InfiniBand and UCR

InfiniBand [42] is an industry standard switched fabric that is designed for intercon- necting nodes in High End Computing (HEC) clusters. It is a high-speed, general purpose

17 I/O interconnect that is widely used by scientific computing centers world-wide. The re- cently released TOP500 [97] rankings in November 2016 reveal that more than 37.4% of the computing systems use InfiniBand as their primary interconnect. One of the main fea- tures of InfiniBand is Remote Direct Memory Access (RDMA). This feature allows soft- ware to remotely read memory contents of another remote process without any software involvement at the remote side. This feature is very powerful and can be used to imple- ment high performance communication protocols. InfiniBand has started making inroads into the commercial domain with the recent convergence around RDMA over Converged

Enhanced Ethernet (RoCE) [89].

InfiniBand Verbs InfiniBand Host Channel Adapters (HCA) and other network equip- ments can be accessed by the upper layer software using an interface called Verbs. The verbs interface is a low level communication interface that follows the Queue Pair (or com- munication end-points) model. Queue pairs are required to establish a channel between the two communicating entities. Each queue pair has a certain number of work queue elements.

Upper-level software places a work request on the queue pair that is then processed by the

HCA. When a work element is completed, it is placed in the completion queue. Upper level software can detect completion by polling the completion queue. Verbs that are used to transfer data are completely OS-bypassed.

InfiniBand IP Layer InfiniBand also provides a driver for implementing the IP layer, al- lowing socket applications to make use of InfiniBand networks. This exposes the Infini-

Band device as just another network interface available from the system with an IP address.

InfiniBand devices are presented as ib0, ib1 and so on. However, it does not provide OS- bypass. This layer is often called IP-over-IB or IPoIB in short. We will use this terminology

18 in the thesis. There are two modes available for IPoIB. One is the datagram mode, imple- mented over Unreliable Datagram (UD), and the other is connected mode, implemented over RC. The connected mode offers better performance since it leverages reliability from the hardware. In this thesis, we have used connected mode IPoIB.

Unified Communication Runtime (UCR) The Unified Communication Runtime (UCR) [47] is a light-weight, high performance communication runtime, designed and developed at

The Ohio State University. It aims to unify the communication runtime requirements of scientific parallel programming models, such as MPI and Partitioned Global Address

Space (PGAS) along with those of data-center middle-ware, such as Memcached [31],

HBase [95], MapReduce [36], etc.

UCR is designed as a native library to extract high performance from advanced net- work technologies. The UCR project draws from the design of MVAPICH2 software.

MVAPICH2 [60] is a popular MPI library implementation that provides MPI-3 standard compliance. It is being used by more than 2,700 organizations in 83 countries and also distributed by popular InfiniBand software stacks and distributions like RedHat and

SUSE.

2.3 Heterogeneous Storage Devices on Modern HPC Clusters 2.3.1 Node-Local Storage Devices

RAM Disk is a memory-based file system, faster than modern SSDs, with massively increased read/write performance for all workload types. On the other hand, Solid state drives (SSDs) provide a persistent storage with high throughput and low latency and are essentially well-suited for Big Data applications. SATA SSDs are being widely used due to their low cost, and high-performance as compared to disk-based hard drives (HDDs). The

SATA-based interface limits the capacity of the bus that transfers data from the SSDs to the

19 processor, but are still suitable as a low-cost SSD interface where performance is not the major factor. The current generation of PCIe flash SSDs have thus become very popular in modern in HPC clusters due to their high bandwidth and fast random access. While they provide unmatchable performance, the adoption of PCIe SSDs is inhibited by differ- ent implementations and unique drivers. This has led to the definition of the NVM Express standard (NVMe) to enable faster adoption and interoperability of PCIe SSDs. The benefits of NVMe with PCIe over traditional SATA SSD drives are reduced latency, increased IOPS and lower power consumption, in addition to being durable. Hard-disk Drives (HDDs) fur- nish the largest amount of data storage, but are poor in terms of performance as compared to SSDs.

2.3.2 Non-Volatile Memory

Non-volatile Memory (NVM) or NVRAM provide high data reliability and byte ad- dressability. Therefore, they are closing the performance, capacity and cost-per-bit gaps between volatile main memory (DRAM) and persistent storage. The performance of NVM is orders of magnitude faster than SSD and disk.

2.3.3 Parallel File System

Lustre [108] is one of the most commonly deployed global file systems on supercom- puting clusters and data centers. It provides POSIX-compliant, stateful, object-based stor- age to the end applications. Its architecture has two major components - Meta Data Server

(MDS) and Object Storage Server (OSS). MDS is responsible to store the meta data of all

files, whereas OSSs store the actual files. To access a file, a client first obtains its meta data and other file attributes from the primary MDS. Subsequent file I/O operations are performed directly between the client and the OSS. The various components in a Lustre

20 deployment communicate with each other using Lustre Network (LNET), which supports a variety of interconnects like commodity InfiniBand and Ethernet.

2.4 In-Memory Computing Framework and Storage 2.4.1 Spark

While MapReduce [28] has revolutionized Big Data processing for data-intensive batch applications on commodity clusters, it has been demonstrated to be a poor fit for low- latency interactive applications and iterative computations, such as machine learning and graph algorithms. As a result, newer data-processing frameworks such as [111] have emerged. Its architecture is dependent on the concept of a Resilient Distributed

Dataset or RDD [110], which is a fault tolerant collection of objects distributed across a set of nodes that can be operated on in parallel. Spark can be run in standalone mode or over YARN [98].

Spark can be run using one of the two deployment modes:

Standalone: This is the default deployment mode. In this mode, Spark can run on top of

HDFS. It uses a Master daemon which coordinates the workers that host the executors.

Spark over YARN [98]: In this mode, Apache YARN takes responsibility of launching the Spark applications. YARN ResourceManager provides the available resources to the

Spark applications, whereas YARN NodeManager launches the Spark workers that run the executors. This mode supports security and provides better integration with YARN’s resource management policies. Applications can run in either YARN-cluster or YARN- client mode. The difference between these two modes is that the application driver runs locally in the YARN-client mode. On the other hand, for YARN-cluster mode, the driver is co-located with YARN’s ApplicationMaster.

21 2.4.1.1 Spark SQL

Spark SQL [13] provides an interface to query structured data inside Spark programs.

It can also be used to query data in existing Hive tables. It provides an abstraction to access a variety of data sources like, Avro, Parquet, ORC, JSON, and JDBC. It is also possible to join data from multiple sources while running Spark SQL queries.

2.4.2 Tachyon

Tachyon [54] is a memory-centric distributed storage system for Spark and MapReduce applications. Tachyon is usually deployed on top of a distributed file system like HDFS,

Lustre, or GlusterFS. By leveraging lineage for fault tolerance, it provides faster in-memory read and write accesses for primitive file system operations. But for complex workloads like MapReduce and Spark, Tachyon, by default, depends on the fault tolerance mechanism of the underlying file system.

2.4.3 Memcached

Memcached was proposed by Fitzpatrick [31] to cache database request results. It was primarily designed to improve performance of the web site LiveJournal. Due to its generic nature and open-source distribution [12], it was quickly adopted in a wide variety of environments. Using Memcached, spare memory in data-center servers can be aggregated to speedup look-ups of frequently accessed information, like database queries, results of

API calls, or web-page rendering elements. Memcached is usually deployed in Web 2.0 architecture as a caching layer to improve performance for various client operations. It is also being used for Big Data computing on HPC systems.

22 2.5 Benchmarks, Workloads, and Applications

HDFS Micro-benchmarks: The HDFS micro-benchmark suite proposed in [45] has five

benchmarks for testing standalone HDFS. These are:

Sequential Write Latency (HDFS-SWL): This benchmark takes the file name and size as inputs and outputs the total time to write this file to HDFS. HDFS write is performed sequentially by dividing the file into a set of blocks. For this, the benchmark invokes the

HDFS create() API to get an instance of FSDataOutputStream. Data bytes are

then written to HDFS by using the write() method of FSDataOutputStream. The benchmark starts a timer just after creating the file and stops this after the file is closed. The time measured in this way is reported as the latency of sequential write for the file specified by the user.

Sequential or Random Read Latency (HDFS-SRL or HDFS-RRL): This benchmark takes the file name, size, access pattern (random or sequential) and seek interval (for random only) as inputs and outputs the time to read the file from HDFS. For this, the benchmark invokes the HDFS open() API to get an instance of FSDataInputStream. Data bytes are then read from HDFS by using the sequential or random read() method of

FSDataInputStream. The benchmark starts a timer just after opening the file and stops this after read completion. The time measured in this way is reported as the read latency for the data size specified by the user.

Sequential Write Throughput (HDFS-SWT): The user can input the number of concurrent writers and the size of data in (MB) per writer. The benchmark outputs the throughput per writer by dividing the data size with the write-time required by it. The total throughput is calculated by multiplying the average throughput per writer with the number of writers.

23 In order to launch multiple HDFS clients (writers) at the same time, there is a job- launcher in the benchmark suite. The job-launcher is a shell script that starts the writer processes in different nodes. It also aggregates the throughput values from different writers and outputs the total write throughput.

Sequential Read Throughput (HDFS-SRT): This benchmark works in a similar manner as the one for write workload. Here the inputs are number of concurrent readers and read size per reader. It calculates the total throughput for sequential read. In this case also, the

Java-based job-launcher aggregates the read throughput from different readers and outputs the total throughput.

Sequential Read-Write Throughput (HDFS-SRWT): This benchmark calculates the total throughput when HDFS read and write are occurring simultaneously. The benchmark takes the numbers of readers, writers and size per reader and writer as inputs. The read/write ra- tio can be varied by varying the number of concurrent readers and writers and also the data size for each.

Table 2.1: Benchmark parameter list Benchmark File File HDFS Pa- Readers Writers Random/Seq Seek Inter- Name Size rameters Read val HDFS- X X X SWL HDFS- X X X X X(RRL) SRL/RRL HDFS- X X X SWT HDFS-SRT X X X HDFS- X X X X SRWT

24 In all these benchmarks, the users can also provide different HDFS configuration pa- rameters as input. Table 2.1 lists the parameters of our benchmark suite. Each benchmark can report the configuration parameters in use for it as part of the output. The benchmark suite also calculates statistics like Min, Max, and Avg. latency and throughput.

TestDFSIO: DFSIO is an HDFS benchmark and is available as part of the Hadoop distri- bution. It measures the I/O performance of HDFS. It is implemented as a MapReduce job in which each map task opens one file to perform sequential write or read, and measures the data I/O size and execution time of that task. There is a single reduce task that is launched at the end of the map tasks which aggregates the performance results of all the map tasks.

Enhanced DFSIO: Enhanced DFSIO is included in the HiBench benchmark suite from

Intel. This benchmark consists of two consecutive MapReduce jobs. The first job runs the

TestDFSIO test while sampling the number of bytes read/written at fixed time intervals in each map task; during the reduce and post-processing stage, the samples of each map task are linear interpolated and re-sampled at a fixed plot rate, to align the time series between map tasks. The re-sampled points at the same timestamp of all map tasks are then summed up to compute the total number of bytes read/written by all the map tasks at that timestamp

The second job calculates the average of the aggregated throughput value of each time slot during the steady periods and outputs it as the overall aggregated throughput.

Yahoo! Cloud Serving Benchmark (YCSB): The goal of Yahoo! Cloud Serving Bench- mark (YCSB) [26] is to facilitate performance comparisons of different key/value-pair and cloud data serving systems. It defines a core set of benchmarks for four widely used sys- tems: HBase, Cassandra [94], PNUTS [25] and a simple shared MySQL implementation.

The core workload of HBase consists of six different workloads and each represents a dif- ferent application scenario. Zipfian and Uniform distribution modes are used in YCSB for

25 record selection in database. Besides that, customization of workloads is also possible. In

addition to these different workloads, there are a number of runtime parameters that can be

defined while running YCSB.

TeraGen and TeraSort: TeraGen is an I/O-intensive benchmark that generates the input

data for TeraSort and stores in the underlying file system. On the other hand, TeraSort is

shuffle-intensive. It reads data from the file system, sorts them and then writes the out-

put back to the file system. These benchmarks have both Hadoop MapReduce and Spark

implementations.

RandomWriter and Sort: RandomWriter [76] is an I/O-intensive MapReduce benchmark that writes data to the file system. The parallelism in the data write comes from concurrent tasks running per host. The benchmark writes specific amount of data to the file system and finally reports the execution time to generate this data. The Sort [84] benchmark runs on the data generated by RandomWriter. This benchmark is shuffle-intensive but has equal amount of reads and writes from the file system.

SWIM:Statistical Workload Injector for MapReduce, SWIM [86] contains suits of work- loads consisting of many MapReduce jobs, with complex data, arrival and computation patterns. The MapReduce workloads in SWIM are generated by sampling traces of MapRe- duce jobs from Facebook. The suit also contains the generators for data and workloads.

PUMA: PUrdue MApReduce Benchmark Suite (PUMA) [18, 66] represents a range of

MapReduce applications exhibiting different characteristics. The benchmark suite has 13

benchmarks, some are compute-intensive, some are I/O-intensive while some others in-

volve heavy shuffle. PUMA also contains three benchmarks from Hadoop distribution.

MR-MSPolygraph: MR-MSPolygraph [50] is a MapReduce-based implementation for

parallelizing peptide identification from mass spectrometry data. This application searches

26 for the query peptides in the reference file and counts the number of matching peptides.

This application does not have any reduce phase, and is, hence, a map-intensive workload.

CloudBurst: CloudBurst [78] is a highly parallelized read mapping algorithm for next generation DNA sequence mapping. The MapReduce implementation of CloudBurst reads the input DNA sequence data in parallel and maps the data to the human genome or other reference genomes. Finally, the application reports the alignment time along with the align- ments for each read. This application is used for a variety of biological analyses including

SNP discovery, genotyping and personal genomics.

27 Chapter 3: High Performance RDMA-based Design of HDFS over InfiniBand

HDFS is a communication intensive middleware because of its distributed nature. All existing communication protocols [16] of HDFS are layered on top of TCP/IP. Due to the byte stream communication nature of TCP/IP, multiple data copies are required, which re- sults in poor performance in terms of both latency and throughput. Consequently, even though the underlying system is equipped with high performance interconnects such as In-

finiBand, HDFS cannot fully utilize the hardware capability and obtain peak performance.

These issues lead us to investigate the impact of network on overall HDFS performance through detailed profiling and analysis. In this chapter, we also describe the design chal- lenges for re-designing HDFS to take advantage of high performance interconnects such as InfiniBand and exploit advanced features like RDMA and finally, present an RDMA-

Enhanced design of HDFS.

3.1 Proposed Design

In this section, we propose a hybrid design for HDFS that supports both socket and

RDMA-based communication. We concentrate on HDFS write because it is more network intensive. Our design extends the existing HDFS and uses UCR for InfiniBand communi- cation via the JNI interfaces.

28 3.1.1 Design Overview

In this section, we present an overview of the RDMA-Enhanced design of HDFS, which we call HDFSoIB. In HDFSoIB, we re-design the communication layer of HDFS by using

RDMA over InfiniBand instead of the Java-Socket interface. In this hybrid design, we use

RDMA for HDFS write and replication via the JNI interface, whereas all other HDFS oper- ations go over Java-Socket. The JNI layer bridges Java-based HDFS with communication library written in native C. Figure 3.1(a) illustrates the design of HDFSoIB. The objective of this research is to find out how much improvement we can achieve in HDFS performance just by enhancing the communication layer. Therefore, in this design we keep the existing

HDFS architecture intact.

3.1.2 New Components in the Hybrid Architecture

In the hybrid design, we apply our modifications to the DFSClient and the DataNode server. Figure 3.1(b) depicts the architecture of our proposed HDFSoIB design. The major components that we add in the DFSClient side are:

ApplicaΦons RDMA Enabled DFSClient DataNode

HDFS RDMADataXceiverServer Connection Connection (UCR CTX) Others Write (UCR CTX) Java Socket Java NaΦve Interface Interface (JNI) RDMA RDMA RDMADataXceiver Data Response UCR Processor Streamer RDMA Block Verbs RDMA Xceiver Packet Data Ack Responder RDMA Capable Networks Queue Queue RDMAReplicator 1/10 GigE, IPoIB (IB, 10GE/ iWARP, UCR CTX Network RoCE ..) (a) Design Overview (b) Architecture of RDMA-Enhanced HDFS Figure 3.1: Architecture and Design of RDMA-Enhanced HDFS

29 Connection: In the hybrid design, we introduce a new Java object called Connection. We

use UCR as the communication library. As indicated in Chapter 2.2, UCR is an end-point

based library. An end-point is analogous to a socket connection. Before establishing an

end-point, UCR has to initialize the IB device and create an UCR CTX. The Connection

object holds the UCR CTX id returned by the communication library. It is also responsible

for providing the end-points for UCR communication. We maintain a pre-allocated buffer

for each end-point to avoid intermediate data copies between the JNI layer and UCR li-

brary. The size of this buffer is limited to HDFS packet-size in order to keep low memory

footprint. The Connection object also holds the pointer to this buffer.

The JNI Adaptive Interface: The UCR library is implemented in native C code. The JNI

interface enables the Java code in HDFS to make use of the UCR library functions for

communication.

RDMADataStreamer: Data packets that are put into the DataQueue are sent by the RD-

MADataStreamer. The RDMADataStreamer runs as a daemon in the RDMA Enabled

DFSClient (RDMADFSClient); the RDMADFSClient redirects the client to appropriate sender threads (RDMADataStreamer or DataStreamer) depending on whether it wants to send data via RDMA or Java socket. The RDMADataStreamer dynamically creates an end- point to the DataNode where it wants to send the HDFS blocks. It gets the data packets from the DataQueue and puts the corresponding bytebuffer into the buffer registered for this particular end-point. The data is then sent over RDMA by invoking the UCR library functions via the JNI interface.

RDMAResponseProcessor: The RDMAResponseProcessor receives acknowledgments for the packets sent over RDMA.

30 Similar to the client side, Connection object and JNI interface components exist in the

DataNode side as well. The HDFS DataNode server side is extended by adding different components like:

RDMADataXceiverServer: This is a listener thread which waits for connection requests over RDMA. When the listener accepts an incoming request, an end-point is created be- tween the client and the server. This end-point is equivalent to a socket-connection and can be used for data transfer between the respective nodes.

RDMADataXceiver: After an RDMA connection or end-point is established, the RDMA-

DataXceiverServer spawns an RDMADataXceiver thread associated with that connection.

The RDMADFSClient sends write request for a block to the RDMADataXceiver which then parses the header and takes action accordingly (calls writeBlock()). Inside the write-

Block() function, header processing takes place and an acknowledgment is sent back to the client. Then an RDMABlockXceiver is created for the incoming block, which waits for receiving the data packets belonging to that block. The RDMADFSClient, after getting an acknowledgment for the block header, starts sending the data packets for the corresponding block. The RDMABlockXceiver, upon receiving the data buffer, processes the packet. It is then given to the RDMAReplicator.

RDMAReplicator: Each DataNode has a Connection object to be used for replication. The

RDMAReplicator creates an end-point to the next DataNode in the pipeline and replicates the data packets. The packet is then flushed to the local storage.

RDMAPacketResponder: Receives acknowledgments from the downstream DataNode and sends the acknowledgments to the upstream DataNode or the RDMADFSClient.

31 Client DataNode

Application RDMADataXceiverServer

RDMA Enabled DFSClient RDMADataXceiver

(Connection, RDMADataStreamer, (Connection, RDMABlockXceiver, RDMAResponseProcessor) RDMAPacketResponder)

Java Socket JNI Adaptive Java Socket JNI Adaptive Interface Interface Interface Interface

UCR UCR

1GigE/10GigE 1GigE/10GigE IB Verbs IB Verbs Network Network

Figure 3.2: HDFS-RDMA design with hybrid communication support

3.1.3 Connection and Buffer Management

The DFSClient sends data blocks to different DataNodes for replication, as advised by the NameNode. Each RDMADFSClient creates an RDMA Connection object (UCR CTX) and registers JNI buffer for communication. End-points are dynamically created in this

Connection object while data is sent over RDMA. In the DataNode side also, we maintain a Connection object to be used for replication. This connection object is initialized when the DataNode server starts. The end-points are dynamically established during replication.

In the existing HDFS implementation, the DFSClient establishes a new socket connection with the DataNode server each time it wants to send a block to it. The DataNode side creates a new receiver thread for each block. In our design, we eliminate the overhead of connection creation for each block by using pre-established Connection object. On the other hand, the overhead for end-point creation is negligible. Therefore, end-points are

32 established on demand and and cached for subsequent use. Furthermore, a single receiver thread in the DataNode is responsible for handling all the blocks coming from a particu- lar client to that end-point. In HDFS, data packets as well as the block headers are Java objects. However, RDMA uses memory semantics. Therefore, we use Java direct buffers which allows us to take advantage of zero-copy data transfer. Besides, the size of the JNI buffer registered per Connection object is limited to the HDFS packet-size as this is the maximum amount of data to be transferred at a time using this buffer. This also helps in maintaining a low memory-footprint. The Connection objects are also created during system initialization. Therefore, the start-up time should also be negligible for large data- intensive applications.

3.1.4 Communication Flow using RDMA over InfiniBand

The hybrid design supports both RDMA and socket-based communications. Figure 3.2 illustrates the communication flow in our new design. The RDMADFSClient can choose which interconnect/protocol to use for an operation. Two nodes must establish an end- point before they can communicate with each other. When the DFSClient wants to write a file, it first contacts the NameNode over IPC to get a block ID and the list of DataNodes for that block. The block is sent to the first DataNode by the client and then replicated to subsequent DataNodes. The communications between the client and the first DataNode and also among the DataNodes take place over RDMA in the hybrid design. Therefore, the RDMADFSClient first establishes an end-point with the first DataNode server in the pipeline and uses it for data transfer. In the DataNode server side, the RDMADataXceivers create RDMABlockXceivers that monitor their respective end-points for arrivals of data packets. Each block is replicated by the corresponding RDMAReplicator over RDMA.

33 The RDMAResponseProcessors, in the RDMADFSClient, receive the acknowledgments in their corresponding end-points.

3.2 Experimental Results

In this section, we perform detailed profiling and evaluation of the proposed HDFSoIB design.

3.2.1 Experimental Setup

The experiments are performed on four different clusters. They are:

OSU RI Cluster A: This in-house cluster has Xeon Dual quad-core operating at 2.67

GHz. Each node is equipped with 12 GB RAM and one 160 GB HDD. Each node is also equipped with MT26428 QDR ConnectX HCAs (32 Gbps) with PCI-Ex Gen2 interfaces.

The nodes are interconnected with a Mellanox QDR switch. Each node runs Red Hat

Enterprise Linux Server release 6.1 (Santiago). Lustre installation in this cluster has 12 TB capacity and accessible through InfiniBand QDR.

OSU RI Cluster B: This cluster has nine nodes. Each node has Xeon Dual quad- core processor operating at 2.67 GHz. Each node is equipped with 24 GB RAM, two 1TB

HDDs, single 300GB OCZ VeloDrive PCIe SSD, and 12 GB RAM Disk with Red Hat

Enterprise Linux Server release 6.1. Nodes on this cluster have MT26428 QDR ConnectX

HCAs (32 Gbps data rate) and are interconnected with a Mellanox QDR switch.

OSU WCI: This cluster consists of 64 compute nodes with Intel Xeon Dual quad-core processor nodes operating at 2.33 GHz with 6GB RAM and PCIe 1.1 interface. Each node is equipped with a ConnectX DDR IB HCA (16 Gbps data rate). 16 of the compute nodes have Chelsio T320 10GbE Dual Port Adapters with TCP Offload capabilities. Each node runs Red Hat Enterprise Linux Server release 5.5 (Tikanga), with kernel version 2.6.30.10 and OpenFabrics version 1.5.3. The IB cards on the nodes are interconnected using a 144

34 port Silverstorm IB DDR switch, while the 10 GigE cards are connected using a Fulcrum

Focalpoint 10 GigE switch.

SDSC Gordon: The Gordon Compute Cluster at SDSC [32] is composed of 1,024 dual-socket compute nodes connected by a dual rail, QDR InfiniBand, 3D torus network.

Each compute node has two eight-core Intel EM64T Xeon E5 2.6 GHz (Sandy Bridge) processors, 64 GB of DRAM, 32 GB RAM Disk with CentOS 6.4 operating system release.

This cluster has a total of sixteen 300 GB Intel 710 solid state drives distributed among the compute nodes.

3.2.2 Optimal Packet-size for Different Interconnects/Protocols

In this set of experiments, we vary the HDFS packet-size from 64 KB to 1 MB and measure the file write times for different file sizes (from 1 GB to 5 GB) using the HDFS

Write benchmark SWL as described in Chapter 2.5. We perform these experiments for

1 GigE, IPoIB and RDMA on OSU RI Cluster B and 1 GigE, IPoIB, 10 GigE and RDMA

on OSU WCI cluster. On both the clusters, the experiments are run on four DataNodes. For

a particular interconnect/protocol, the packet-size, for which we obtain the smallest write

latency for most of the file sizes is chosen as the optimal packet-size for the corresponding

interconnect/protocol.

Figure 3.3 shows the optimal packet-sizes for different interconnects/protocols on OSU

RI Cluster B. As it can be observed from the figure, 64 KB packet-size provides minimum

latency for most of the file sizes in 1 GigE network. Therefore, 64 KB is chosen as the

optimal packet size for 1 GigE network. Similarly, the optimal packet-size for IPoIB and

RDMA are 128 KB and 512 KB, respectively. The optimal packet-size values determined

for different interconnects/protocols are depicted in Table 3.1. In all these experiments,

HDFS block-size is kept fixed at 64 MB. The experiments are run using both HDD and

35 OSU WCI OSU RI Cluster B File Size (GB) 1GigE 10GigE IPoIB UCR-IB 1GigE IPoIB UCR-IB 1 64KB 64KB 128KB 512KB 64KB 128KB 512KB 2 64KB 64KB 128KB 512KB 64KB 128KB 512KB 3 64KB 64KB 128KB 512KB 64KB 128KB 512KB 4 64KB 128KB 128KB 512KB 64KB 128KB 512KB 5 64KB 64KB 128KB 256KB 64KB 128KB 512KB

Table 3.1: Optimal Network Packet-Sizes for Different Interconnects/Protocols

SSD as the underlying storage medium in the DataNode servers. The optimal packet-sizes

identified for both HDD and SSD are the same. But, for HDD, the optimal packet-sizes

primarily benefit the communication; the bottleneck is still in the I/O. Therefore, even

though 64 KB, 128 KB and 512 KB are the optimal packet-sizes for 1 GigE, IPoIB and

RDMA respectively, we do not observe much variation in the file write times for different

packet-sizes in a particular interconnect/protocol.

100 50 50 1GB 1GB 1GB 2GB 2GB 2GB 80 3GB 40 3GB 40 3GB 4GB 4GB 4GB 60 5GB 30 5GB 30 5GB

40 20 20 Time (s) Time (s) Time (s)

20 10 10

0 0 0 64K 256K 1M 64K 256K 1M 64K 256K 1M Packet Size Packet Size Packet Size (a) 1GigE (b) IPoIB (32Gbps) (c) HDFSoIB (32Gbps)

Figure 3.3: Optimal network packet size evaluation for different file sizes

Similar experiments are performed on OSU WCI cluster and the optimal packet-sizes for different interconnects on this cluster are illustrated in Table 3.1

36 1GigE IPoIB (32Gbps) HDFSoIB (32Gbps) Communication 325 ms 77 ms 47 ms Processing 143 ms 139 ms 112 ms I/O 299 ms 131 ms 110 ms

Table 3.2: Split-up Times for a Single Block Transmission in HDFS

3.2.3 Micro-benchmark Level Evaluations on Different Interconnects

In our study, we perform a comprehensive profiling to understand HDFS behavior and determine that, while writing a 64 MB block to HDFS, time is spent on three main parts:

(1) Communication, (2) Processing, and (3) I/O. We measure the time taken by each of these parts for a 64 MB block write by placing timers inside HDFS code. Table 3.2 shows these three times measured over different interconnects on OSU RI Cluster A using HDD.

These numbers are measured using a single DataNode, with a replication factor of one.

The use of RDMA over InfiniBand improves the communication time. Here the I/O time is the total time required to flush all the data packets for the block to the file system.

Since the packet-sizes used for IPoIB and RDMA are 128 KB and 512 KB, respectively, we are performing aggregated I/O operations in these two cases compared to 1 GigE. As a result, each of these interconnects gain in terms of I/O time to some extent. After receiving each packet, HDFS performs some processing on it. As part of the processing, some Java functions are called per packet basis. Since the number of packets is reduced in our design, we gain in terms of processing time also.

We design a micro-benchmark that measures and reports HDFS file write times. Using this micro-benchmark, we measure the latency of writing files ranging from 1 GB to 5 GB.

37 We run these experiments on both OSU WCI and OSU RI clusters. In each of these exper-

iments, we use the optimal packet-sizes obtained in our previous experiments for different

interconnects.

120 60 1GigE 1GigE 100 IPoIB (16Gbps) 50 IPoIB (32Gbps) 10GigE HDFSoIB (32Gbps) HDFSoIB (16Gbps) 80 40

60 30

40 20 File Write Time (s) File Write Time (s) 20 10

0 0 1 2 3 4 5 1 2 3 4 5 File Size (GB) File Size (GB) (a) Total file write times in HDD (OSU WCI with 4 (b) Total file write times in HDD (OSU RI Cluster A DataNodes) with 32 DataNodes)

Figure 3.4: Micro-benchmark evaluation in different clusters for HDD

Figure 3.4(a) shows the file write times for different file sizes over 1 GigE, IPoIB,

10 GigE and RDMA in OSU WCI. We observe that the RDMA-based design outperforms

10 GigE by 14% and IPoIB by 20% for 5 GB file size. Figure 3.4(b) shows the file write

times for different file sizes over 1 GigE, IPoIB, and RDMA on OSU RI Cluster A. As we

observe from this figure, the RDMA-based design outperforms the socket-based design for

all the file sizes and the overall gain in terms of latency is 15% over IPoIB for 5 GB file

size. In this experiment, HDD is used as the underlying storage device in the DataNodes.

Figure 3.5(a) shows the file write times in OSU WCI using SSD. Due to the decrease

in I/O times with SSD, the total file write times are less compared to those in Figure 3.4(a).

Our design achieves an improvement of 16% over 10 GigE and 25% over IPoIB for 5 GB

file size. Figure 3.5(b) shows the file write times for different file sizes over three different

38 interconnects on OSU RI Cluster B. From this figure, we observe that, our design improves

the write latency by 12% over IPoIB for 5 GB file size.

60 50 45 50 1GigE 1GigE IPoIB (16Gbps) 40 IPoIB (32Gbps) 10GigE 35 HDFSoIB (32Gbps) 40 HDFSoIB (16Gbps) 30 30 25 20 20 15 File Write Time (s) File Write Time (s) 10 10 5 0 0 1 2 3 4 5 1 2 3 4 5 File Size (GB) File Size (GB) (a) Total file write times in SSD (OSU WCI with 4 (b) Total file write times in SSD (OSU RI Cluster B DataNodes) with 4 DataNodes)

Figure 3.5: Micro-benchmark evaluation in different clusters for SSD

3.2.4 HDFS Communication Time

In order to measure the communication time during HDFS Write, we design a bench-

mark that mimics the communication pattern in HDFS. In this benchmark, the client first

transfers the data to the first DataNode in the pipeline. The data is then replicated to the

subsequent DataNodes and finally, the client gets back the acknowledgment. Figure 3.6

shows the times spent in communication during HDFS write of different file sizes, mea-

sured over different interconnects on OSU RI Cluster A with 32 DataNodes with HDD.

The new design gains an improvement of 30% over IPoIB and 56% over 10 GigE in terms

of communication time. We also measure the communication times on OSU RI Cluster B

with four SSD nodes and observe similar gains. This clearly demonstrates the capability of

the native RDMA-based design to provide low-latency data transmission.

39 25 10GigE IPoIB (32Gbps) 20 HDFSoIB (32Gbps)

15

10

5 Communication Time (s)

0 2 4 6 8 10 Data Size (GB)

Figure 3.6: Communication times in HDFS in OSU RI

3.2.5 Evaluation with TestDFSIO

DFSIO is a file system MapReduce benchmark in Hadoop that measures the I/O perfor- mance of HDFS. In this benchmark, each map task opens one file for sequential read/write and the Hadoop job measures the data I/O size and the execution time for writing a partic- ular sized file [90]. There is only a single reduce task which aggregates the results from different map tasks running in this benchmark.

In our experiments, we run DFSIO write with file sizes ranging from 80 GB to 120 GB.

We perform these experiments on 32 nodes on SDSC Gordon. Figure 3.7(a) shows the throughput results of DFSIO with SSD. Here we can see that our design outperforms IPoIB

(32Gbps) by 28% for 120 GB file size. Figure 3.7(b) shows the throughput results of DFSIO with multiple HDDs per node over three different interconnects on OSU RI Cluster B.

These experiments are performed on four DataNodes. From this figure we observe that, as we switch from single to double HDDs per node, the throughput over IPoIB (32Gbps) increases by 66.5%, whereas, for HDFSoIB the increase in throughput is 75.8%. Moreover, our design gains by 24% over IPoIB (32Gbps) for single HDD per node; whereas, the gain is 31% for two HDDs per node. Therefore, the RDMA-based design (HDFSoIB) is able to

40 achieve much better throughput compared to the socket-based design when I/O bottlenecks are reduced.

4,500 3,500 4,000 IPoIB (32Gbps) 10GigE−1HDD HDFSoIB (32Gbps) 3,000 10GigE−2HDD 3,500 IPoIB (32Gbps)−1HDD IPoIB (32Gbps)−2HDD 2,500 HDFSoIB (32Gbps)−1HDD 3,000 HDFSoIB (32Gbps)−2HDD 2,500 2,000 2,000 1,500 1,500 1,000 1,000

Total Throughput (MBps) 500 Total Throughput (MBps) 500 0 0 80 100 120 5 10 15 20 Data Size (GB) Data Size (GB) (a) TestDFSIO with SSD (32 DataNodes on SDSC (b) TestDFSIO with multiple HDDs per node (4 Gordon) DataNodes on OSU RI Cluster B)

Figure 3.7: TestDFSIO benchmark evaluation over different interconnects

3.3 Benefit of RDMA-based HDFS in HBase

In this section, we evaluate HBase Put operation with the underlying file system as

HDFS. The most useful use-case for HBase with HDFS is the bulk writes or updates.

With large amount of data insertion, HBase eventually flushes its in-memory data to

HDFS. HBase has a MemStore which holds the in-memory modifications to any particular region of data. MemStore triggers a flush operation when it reaches its size limit (default is 64 MB). During the flush, HBase writes the MemStore to HDFS as an HFile instance.

These HFiles are written per flush of MemStore and a compaction of all these store files is needed when the number of store files reaches a particular threshold (default is three) for any one HStore. Although MemStore flush to HStore, HStore files’ compaction and HBase update to MemStore all happen concurrently in different threads, high operational latency for HDFS Writes can affect the overall performance of HBase Put operation. HBase also

41 has an HLog instance which keeps flushing its log data to HDFS. These HLogs are the

basis of HBase data replication and failure recovery, and thus must be kept in HDFS.

With the new design of HDFS, we conducted experiments to measure the overall Put

latency in HBase. These experiments are run on OSU RI Cluster A on a QDR platform

(32Gbps) and we use HDD as the DataNode storage for HDFS. A modified version of work-

load A (100% update) of YCSB (mentioned in Chapter 2) has been used as our benchmark.

Figure 3.8 shows these results for HBase Put operation of 1 KB message size with a single region server. For insertion of 360 K records, an average latency of 204 µs is achieved, whereas, in IPoIB, the average latency is 252 µs. The throughput achieved by the RDMA- based design is 4.41 Kops/sec, whereas IPoIB obtains a throughput of 3.63 Kops/sec.

Our HDFSoIB design gets an overall performance improvement of around 20% in both

average latency and throughput.

6 500 1GigE 1GigE IPoIB (32Gbps) 5 IPoIB (32Gbps) HDFSoIB (32Gbps) HDFSoIB (32Gbps) 400 4 300 3

200 2 Average Latency (us) 100 Throughput (Kops/sec) 1

0 0 120K 240K 360K 480K 120K 240K 360K 480K Number of Records Number of Records (a) Put latency (b) Put throughput

Figure 3.8: YCSB evaluation with a single region server

In order to have an increased number of DFSClients in our experiments, we increase the

number of region servers to 32 on OSU RI Cluster A. With 32 region servers, we experience

similar performance improvement for the new HDFS design on QDR platform (32 Gbps).

42 For 480 K records insertion, each of 1 KB message size, an average latency of 201 µs is achieved, which is 26% less than that of IPoIB (272 µs). The throughput for 480 K record insertion is 4.42 Kops/sec, which is also 24% higher than that of IPoIB (3.35 Kops/sec).

Figure 3.9 illustrates these results.

8 500 1GigE 7 IPoIB (32Gbps) 1GigE HDFSoIB (32Gbps) IPoIB (32Gbps) 400 6 HDFSoIB (32Gbps) 5 300 4 200 3 2 Average Latency (us)

100 Throughput (Kops/sec) 1 0 0 120K 240K 360K 480K 120K 240K 360K 480K Number of Records Number of Records (a) Put latency (b) Put throughput

Figure 3.9: YCSB evaluation with multiple (32) region servers

3.4 Related Work

In HPC field, RDMA has been used to speed up file systems’ I/O performance. Wu et

al. designed the PVFS over InfiniBand to improve PVFS I/O performance [105]. Ouyang

et. al designed a RDMA based job migration framework to use RDMA to improve large-

sized jobs’ recovery [62]. Yu et. al proposed a RDMA-capable storage protocol on wide

area network to speedup NFS performance [109]. These studies illustrate that RDMA can

benefit the traditional distributed and parallel file systems. We share similar objectives with

these research directions and investigate the benefit of RDMA in the Hadoop environment.

43 An RDMA-based high-performance design of HBase over InfiniBand was presented

in [39] where the HBase data-query performance was improved by optimizing the commu-

nication cost using RDMA. The memory-block semantics supported by RDMA-capable

networks were mapped to the object transmission primitives used in HBase. The study

by Wang et al. [103] reveals that the Merge operation in MapReduce can be accelerated

by exploiting the benefits of RDMA in such a manner that the data does not have to be

copied to disk. They have also developed a shuffle-merge-reduce pipeline that works in

conjunction with the RDMA-based merge.

In [90], the authors examined the impact of high-speed interconnects such as 1GigE,

10GigE and InfiniBand (IB) using protocols such as TCP, IP-over-IB (IPoIB) and Sock-

ets Direct Protocol (SDP), on HDFS performance. Their findings also revealed that these

faster interconnects make a larger impact on the performance if SSDs are used in DataN-

odes, as this reduces the local I/O costs. The findings in this work has lead us to further

investigate the fundamental communication overheads in native HDFS and redesign it to

leverage advanced features offered by InfiniBand, such as RDMA.

3.5 Summary

In this work, we have proposed a hybrid design for HDFS that incorporates commu-

nication over conventional socket as well as RDMA over InfiniBand. The new design is

able to provide low latency and high throughput for HDFS Write operations as it leverages the RDMA capability of high performance network like InfiniBand. Our design achieves

30% gain in communication time and increases HDFS write throughput by 28% over IPoIB

(32Gbps). For HBase, the Put operation performance is improved by 26% with the pro- posed design.

44 Chapter 4: Maximizing Overlapping in RDMA-Enhanced HDFS

DFSClient sends data to the DataNode in a pipelined manner. As a result, there is a lot of potential to exploit overlapping among different phases of operations in a single HDFS block write. This is particularly true for RDMA-Enhanced design of HDFS, where data packets arrive very fast to the DataNode side. Thus, waiting for the completion of I/O of the previous packet before going to receive the next packet of the same block is extremely inefficient to leverage the full benefits offered by RDMA. Considering all these aspects, we propose SOR-HDFS, a SEDA-based approach to maximize overlapping in HDFS over

RDMA. Figure 4.1 depicts the differences of our proposed design with that of the existing ones. In the default architecture, different stages of HDFS write operation happen sequen- tially for the packets belonging to a block; whereas, in the proposed architecture, these phases can overlap with each other.

This is achieved by assigning the task of each of the stages to a dedicated thread pool as is done in a Staged Event-Driven Architecture (SEDA) [104]. Staged Event-Driven Ar- chitecture (SEDA) [104] is one of the most commonly used design approach for Internet services in distributed computing area. SEDA has been considered as a high throughput software architecture. It decomposes a complex processing logic into a set of stages con- nected by queues. For each stage, a dedicated thread pool is in charge of processing events on the corresponding queue. By performing admission control on each event queue, the

45 whole system with multiple thread pools can maximally overlap various processing stages.

HDFS Write operation also consists of several stages as demonstrated in Figure 4.1(a).

But the sequential execution of these stages limits HDFS from fully utilizing the hardware capabilities (such as RDMA) and obtain peak performance. These issues lead us to the following broad questions and design challenges:

Task1 Read Packet . Processing . Replication I/O Pkt1,1 . Read TaskK Read Packet Packet Processing Processing Replication Replication I/O I/O Pkt1,2 PktK,1 Read Packet Pkt1,N Processing Replication I/O PktK,2 PktK,N (a) Sequential Data Processing in Default HDFS (One-Block- One-Thread Architecture)

Task1, TaskK Pkt1,1 PktK,1 Pkt1,2 PktK,2 ... Pkt1,N PktK,N Read

Packet Pkt1,1 PktK,1 Pkt1,2 PktK,2 ... Pkt1,N PktK,N Processing

Pkt1,1 PktK,1 Pkt1,2 PktK,2 ... Pkt1,N PktK,N Replication

I/O Pkt1,1 PktK,1 Pkt1,2 PktK,2 ... Pkt1,N PktK,N

(b) Overlapped Data Processing in Proposed HDFS (Staged Event-Driven Architecture)

Figure 4.1: Default vs. Proposed HDFS Architecture

1. How can we re-design HDFS to exploit maximum overlapping among different stages

of HDFS Write operation?

46 2. Can we adopt high throughput SEDA-based approach for achieving maximum over-

lapping in HDFS?

3. How much performance improvement can we obtain while maintaining the correct

processing of data with the SEDA-based approach using RDMA for communication?

4. Can we guarantee performance benefits across different HPC clusters with different

configurations using the proposed design?

4.1 Proposed Design

In this section, we will describe the architecture and design-details of SOR-HDFS over

InfiniBand.

4.1.1 Architectural Overview

The primary goal of SOR-HDFS is to maximize overlapping in HDFS write with a

SEDA-based architecture. A stage is a fundamental unit of SEDA-based architecture. In our design, we have divided the operations of HDFS write and replication into four stages:

(1) Read, (2) Packet Processing, (3) Replication, and (4) I/O.

Figure 4.2(a) depicts a high-level overview of SOR-HDFS architecture with pipelined replication support. In this design, after data is received via RDMA from the JNI layer, the data is first read into a Java I/O stream. The received packet is then replicated after some processing operations. The data packet is also written to the disk file. In the default architecture of HDFS, all these stages are handled sequentially by a single thread per block.

Whereas, in the proposed design of SOR-HDFS, each of the stages is handled by different thread pools and thus the operations among different stages can overlap at packet level also.

In this way, SOR-HDFS can achieve task-level parallelism as well as overlapping within blocks. 47 DFSClient

DFSClient Packets DataNode1 Replication DataNode2 DataNode1 Read

I/O Data Read Packet Processing Pointers DataNode2 DataNode3 Data Packet Pointers Processing I/O DataNode3 Data Data Packets Packets

(a) Pipelined Replication (b) Parallel Replication

Figure 4.2: Architectural Overview of SOR-HDFS

4.1.2 Design of SOR-HDFS

In this section, we discuss the design of different stages of SOR-HDFS in detail.

Read Stage: Figure 4.3 shows the architecture of the Read stage. This stage consists of an

RDMA-based receiver that receives data in an RDMA Connection object. The Connection object can support multiple end-points. Each DFSClient connects to one of the end-points.

This stage also has a pool of buffers that can be used for data received at any end-point.

In the HDFSoIB design, we had a single buffer per end-point and this is well-suited with the default architecture. Because, data packets belonging to a block go through different phases sequentially, as in Figure 4.1(a) and by the time the receiver goes to read the next packet the buffer is free and can be reused. But with the SEDA-based architecture, while one packet is in the I/O phase, many other packets may arrive at the receiver from the same

DFSClient that keeps on sending packets back to back. Thus having a common pool of buffers is more reasonable for robust handling of data packets in SOR-HDFS.

The buffer pool is a set of Java Direct Buffers to obtain data directly from the JNI layer.

These buffers are registered when the DataNode starts up and can be used in a Round-Robin manner. Each buffer has an associated flag bit to indicate whether the buffer is free or not.

48 In the JNI layer, when data comes to an end-point, it is assigned a free buffer. The RDMA-

based receiver thread in the Java layer receives data using non-blocking receive() and thus, can acquire data from multiple end-points at the same time. After each invocation of receive(), the receiver thread returns the buffer pointer of each received packet to the

Packet Processing stage.

Connection with multiple ep1 ep2 ep3 epN end-points ...

1 Buf1 1 Buf2 1 Buf3 1 Buf4 ... 0 BufM Buffer pool in the JNI layer

Receiver Thread pairs returned from JNI layer

Figure 4.3: Read stage

Packet Processing Stage: Figure 4.4 depicts the architecture of the Packet Processing

stage. The Read stage returns the tuple for each packet and passes

it to the Packet Processing stage. This stage has a pool of worker threads that wait on a

Process Request Queue (PRQ). DFSClient sends two types of packets to the DataNode for

each block, header and data packet. The Process Request Controller (PRC) is responsible

for choosing free worker threads from the pool and assigning them to the incoming packets.

When a worker thread gets the header for a block, the Process Request Controller (PRC)

assigns it for that particular block and passes all the subsequent data packets belonging to

this block to that particular worker thread. In this way, the sequence of packets within a

block is maintained by the PRC. Packet processing includes reading different fields from

the data packet, verification of the packet sequence number and checksum validation.

49 Each worker thread puts the data pack- pairs returned from JNI layer Packet Processing ets to the queues for Replication and I/O (PRC) Buffers are freed 0Buf1 0Buf2 0Buf3 0 Buf4 Data encapsulated to Java I/O stream Buffers are freed

Queue (PRQ) stages after the processing is complete. Thread Pool IOStream1 IOStream2 IOStream3 IOStream4 (Worker Threads) Replication Stage: Figure 4.5(a) shows the components of the RDMA-based Replica- Figure 4.4: Packet Processing Stage tion stage of our design. This stage consists of a pool of replicator threads that replicate the data packets to the downstream DataN- odes via RDMA Connection object. The RDMA Connection object supports multiple end- points. Thus, data can be replicated to any of the DataNodes using this single Connection object. The pool of replicator threads wait on a Replication Request Queue (RRQ). The

Replication Request Controller (RRC) finds out a free thread from the pool and assigns it to replicate a block. RRC preserves the sequence of packets belonging to a particular block by assigning them to appropriate replicator threads. The replicator sends out the packets over

RDMA in a non-blocking manner. This stage also has a Responder thread that receives acknowledgments coming to different end-points of the RDMA Connection object.

Replication Thread Pool (RRC) (Replicator Threads) Java Connection with multiple end-points I/O (IORC) ep1 ep2 ... epN Block Data Replicated over RDMA Thread Pool via the end-points (I/O Threads) Write to disk file Responder Aggregated flush

Queue (RRQ) Receives acknowledgements from downstream DataNodes Aggregation Queue (IORQ) Cache

(a) RDMA-based Replication stage (b) I/O stage

Figure 4.5: Different stages of SOR-HDFS

50 I/O Stage: Figure 4.5(b) shows the architecture of the I/O stage of SOR-HDFS. This stage consists of a pool of I/O threads. The worker threads in the Packet Processing stage ag- gregate multiple data packets into the aggregation cache and puts an I/O request in the I/O

Request Queue (IRQ). The I/O Request Controller (IRC) assigns the incoming requests to appropriate I/O threads such that data is written sequentially. The I/O threads flush the aggregated data from the cache to the disk file. The I/O requests contain the pointer to the cache location that needs to be flushed as well as the pointer to the disk file where the aggregated data should be flushed to. This type of aggregated write helps reduce I/O bottlenecks.

The number of threads in each stage is tuned to maximize the utilization of the corre- sponding system resource.

Design of DFSClient: In the pipelined replication technique, the DFSClient sends data to the first DataNode and packets are sent back to back over RDMA. This means, the

DFSClient does not wait for acknowledgment of the previous packet before it sends the next packet. To efficiently handle this type of pipelined data transfer, the SOR-HDFS design has a single RDMA Connection with multiple end-point support. DFSClient can dynamically create an end-point with respect to a DataNode. The client side also maintains a pool of buffers (Java Direct Buffer) that are used in a Round-Robin manner to send data in a non-blocking fashion. In our previous design of RDMA-Enhanced HDFS, there was a single buffer associated with an end-point. Thus, even if there are data packets available in the DataQueue, the DFSClient has to wait till this buffer is free before it can send the next packet. But in the SOR-HDFS design, we have minimized this wait time by managing a pool of buffers in the client side. The RDMADataStreamer can send packets back-to-back to keep the pipeline full and thus make efficient utilization of the network bandwidth. In

51 this design, we have also decoupled the buffers from the end-points. Thus any buffer can

be used to send data to any of the end-points created from the DFSClient.

4.1.3 Overlapping among Different Phases 4.1.3.1 Data Read, Processing, Replication, and I/O

In SOR-HDFS, each stage of HDFS write is managed by a separate unit. Therefore,

when the worker threads in the Packet Processing stage are busy processing one packet,

the RDMA-based receiver thread in the read stage is free to receive the subsequent packets

from the same block. The receiver can also receive packets coming from any client in

parallel using the single RDMA Connection object. Again, the worker threads can process

subsequent packets while the replicator and I/O threads are busy replicating or flushing

the packets received earlier. In this way, in addition to providing task-level parallelism,

SOR-HDFS achieves overlapping among various stages of the same block.

4.1.3.2 Communication and Data Write

DFSClient writes data to the DataNodes in a pipelined manner. The client keeps on

sending data packets back to back to the first DataNode in the pipeline. The First DataN-

ode also replicates the data in a pipelined manner to the subsequent DataNodes. Thus,

even in the default architecture of both socket-based and RDMA-based design of HDFS,

the Packet Processing, Replication and I/O phases are overlapped with that of data transfer

from the client of upstream DataNodes. The proposed design also preserves this overlap-

ping among communication and the other phases of HDFS write. Because data is sent in

a pipelined manner from the DFSClient using non-blocking send() and the DataNode side also provides pool of buffers to hold the incoming data packets. Moreover, the use of

52 RDMA-based communication makes the data transfer much faster than that of the default socket-based design.

4.1.4 SOR-HDFS with Parallel Replication

Figure 4.2(b) shows the design of SOR-HDFS with parallel replication technique. The main difference between Figure 4.2(b) and Figure 4.2(a) is in the replication phase. In the parallel replication technique, the DFSClient is responsible for writing all the replicas in parallel. Usually, in MapReduce jobs, a DFSClient writes data to the local DataNode

first and then it is replicated to the subsequent DataNodes. Thus, in the common case, the DFSClient creates a single end-point and sends all the data through it. But in the parallel replication technique, the DFSClient establishes three (default replication factor is three) different end-points to write three replicas. In this replication scheme, overlapping is achieved among Read, Packet Processing and I/O phases in the DataNode side. As illustrated in Figure 4.2(b), after the RDMA-based receiver receives data in the Read Phase, the data is given to the worker threads in the Packet Processing phase. The worker threads process the packets and store them in the aggregation cache. Then the I/O threads are notified by the worker threads. The I/O threads then write the aggregated data to disk.

Though the Replication stage is absent in the DataNode side, packet replication is actually overlapped with other phases. Because, while the DataNode performs processing and I/O of a packet, the packet is being replicated by the DFSClient in parallel. Thus the parallel replication scheme also achieves overlapping among Read, Packet Processing, Replication and I/O phases.

53 4.2 Experimental Results

In this section, we present the detailed performance evaluations of SOR-HDFS design

and compare the performance with that of the default architecture over various intercon-

nects, protocols and storage systems.

4.2.1 Experimental Setup

The experiments are performed on three different HPC clusters. They are:

OSU RI Cluster B: The cluster configurations are described in Chapter 3.2.

TACC Stampede: Each node on TACC Stampede [85] is dual socket containing Intel

Sandy Bridge (E5-2680) dual octa-core processors, running at 2.70GHz. It has 32 GB of memory, a SE10P (B0-KNC) co-processor and a Mellanox IB FDR MT4099 HCA

(56 Gbps data rate). The host processors are running CentOS release 6.3 (Final). Each node is equipped with a single 80 GB HDD and 16 GB RAM Disk.

SDSC Gordon : The cluster configurations are described in Chapter 3.2.

250 OSU RI Cluster B 6000 SDSC Gordon TACC Stampede 200 5000

4000 150

3000 100

Bandwidth (MBps) 2000

Total Execution Time (s) 50 1000 OSU RI Cluster B SDSC Gordon TACC Stampede 0 0 1 2 4 8 16 2 4 8 16 32 Number of Processes Number of Worker Threads (a) Network bandwidth for dif- (b) Tuning number of worker ferent number of threads threads in Packet Processing Stage

Figure 4.6: Parameter tuning for SOR-HDFS threads

54 4.2.2 Parameter Tuning of SOR-HDFS Threads

Selecting the Number of RDMA-based Receiver Threads: In our new design, we have used a single RDMA Connection that supports multiple end-points for different number of concurrent maps/clients. The DataNode can simultaneously receive data from multiple clients using this Connection object. Therefore, we have used a single receiver thread that receives data from different clients in different end-points. Since RDMA connection creation is more expensive than that of Socket, the use of a single connection i.e. a single receiver thread minimizes the connection creation overhead.

95 3000 76 74 90 90 2500 72 85 70 70 2000 68 80 80 66 1500 65 64 75 8192 62 Throughput (MBps) Throughput (MBps) 60 70 70 1000 60 2700 58 1 65 4096 500 1 56 1800 2 2048 2 8192 8192 900 Throughput (MBps) 1024 Record Size (KB) 4096 1 4096 4 2 4 2048 4 2048 Writer Threads 8 512 Writer Threads 1024 16 1024 8 512 8 512 Record Size (KB) Writer Threads Record Size (KB) (a) OSU RI Cluster B (b) SDSC Gordon (c) TACC Stampede

Figure 4.7: Disk write throughput in different clusters

Selecting the Number of RDMA-based Replicator Threads: In the Replication stage of the proposed architecture, we use a pool of replicator threads in each DataNode. These threads replicate data to the downstream DataNodes. Thus, the number of replicator threads is related to the NIC bandwidth of the DataNodes. In order to find out what might be an optimal value of the number of replicator threads, we ran the OSU-Multi-Pair-Bandwidth

(inter-node) test over IB on different clusters. Figure 4.6(a) shows the IB bandwidth be- tween a pair of nodes for 512KB (HDFS packet size in SOR-HDFS) message size on three

55 IPoIB SOR-HDFS Communication Time 77ms 37ms Packet Rate 360 pkt/sec 460 pkt/sec

Table 4.1: Packet rate and communication time

different clusters for different number of concurrent processes per node. From this fig-

ure, we observe that, four concurrent processes can utilize the network bandwidth most

efficiently in OSU RI Cluster B and SDSC Gordon. That is why, we have used four repli-

cator threads in our experiments on these clusters. We performed similar tests on TACC

Stampede and found that the optimal number of replicator threads is eight in this cluster.

Selecting the Number of I/O Threads: The I/O threads in the I/O stage of our design

flushes the cached data into disk. Therefore, the number of I/O threads is determined by

the maximum disk bandwidth as well as the amount of aggregated data. In order to find out

the optimal number of I/O threads on different clusters, we ran write tests with IOzone for

varying number of concurrent writers and record sizes. In these tests, each writer writes a

64MB file to disk. Figure 4.7 shows the results of our IOzone tests on different clusters.

The Z-axis indicates the write throughput for a particular tuple. Based on these results, for OSU RI Cluster B (HDD nodes), we have selected two

I/O threads and an aggregated data size of 2 MB for SOR-HDFS. The optimal number of

I/O threads is TACC Stampede is one, whereas, in SDSC Gordon we can use up to eight

I/O threads, because the storage in SDSC Gordon is SSD.

Selecting the Number of Worker Threads: In our design, the worker threads are respon- sible for processing of the received packets. These threads perform processing operations on the data packets received from the clients running on the local DataNode as well as those from the replicator threads of the other upstream DataNodes. Thus, the number of these

56 threads is closely related to the number of concurrent clients running in the DataNode and also the number of threads replicating to this DataNode. In our experiments, we tuned the number of the worker threads while keeping the other thread-counts fixed (as determined from the above-mentioned experiments) and found that in all the three clusters, 16 is the optimal number. Figure 4.6(b) shows the results of our tuning.

50000 IB-SOR (with IO aggregation) IB-SOR (without IO aggregation) 40000

30000 184ms Read Read: 100.23ms 184.7ms 20000 Packet 0ms Packet Processing 200.9ms Processing: 1.6ms 0.03ms Execution Time (ms) Replication: Replication 207.5ms 10000 127.04ms 0.9ms I/O I/O: 94.3ms 1.5ms 0 323.2ms 207.5ms 2 4 6 8 10 12 14 16 Map ID (a) Sequential operations in de- (b) Overlapped stages in SOR- (c) Impact of I/O aggregation on fault architecture HDFS execution time

Figure 4.8: Breakdown time for 64MB block write (OSU RI Cluster B)

4.2.3 Performance Analysis of SOR-HDFS

SOR-HDFS not only incorporates RDMA-based communication but also maximizes overlapping among different phases of HDFS write operations. In this section, we perform comprehensive profiling in order to analyze the performance of SOR-HDFS.

Overlapping Efficiency: Figure 4.8 depicts the duration of the overlapped stages for a

64MB HDFS block write compared to the sequential execution in the default architecture.

As demonstrated in the figure, the default architecture requires 323.2ms to complete the operations of various stages, whereas the time spent in SOR-HDFS is only 207.5ms. Thus,

SOR-HDFS achieves much higher pipelined efficiency via overlapping.

57 12000 IPoIB (32Gbps) IB-SOR-Pipelined (32Gbps) 35 IB-SOR-Parallel (32Gbps) 10% 10000 IB−SOR−Pipelined (32Gbps) 30 IB−SOR−Parallel (32Gbps) 8000 25

6000 20 15 4000

Execution Time (ms) 10 Execution Time (s) 2000 5

0 0 2 4 6 8 10 12 14 16 5 10 15 20 Map ID File Size (GB) (a) Impact of parallel replica- (b) Impact of parallel replication tion on execution time per map on TestDFSIO execution time

Figure 4.9: Performance analysis of SOR-HDFS (OSU RI Cluster B)

RDMA-based Communication: The overlapping efficiency for SOR-HDFS is increased by the RDMA-based communication. According to the previous chapter for HDFSoIB,

RDMA can reduce HDFS communication time by 30%. In our new design, we have en- hanced the communication layer with non-blocking semantics that increases the utilization of network bandwidth. Thus data comes to the DataNode at a much faster rate to exploit maximum overlapping among various stages. As shown in Table 4.1, the communication time for a 64MB block is 77ms over IPoIB (32Gbps), whereas, for RDMA (32Gbps), it is

37ms. We obtained these numbers by designing two benchmarks that mimic the communi- cation pattern and underlying native library for RDMA communication. We also profiled the rate at which packets arrive at the DataNode in the HDFS layer and found that the packet rate for IPoIB (32Gbps) is 360pkt/sec and for SOR-HDFS (32Gbps), it is 460pkt/sec. Thus data packets arrive at a much faster rate in SOR-HDFS.

Impact of I/O Aggregation: I/O aggregation is a technique for reducing the disk con- tention among concurrent writers by writing large chunks of data instead of many smaller pieces. In [81], the authors discuss that, writing data at the granularity of packets by con- current clients introduces huge contention at disk-level and degrades the I/O bandwidth.

58 Thus aggregating the data packets into a larger chunk can enhance the I/O performance by reducing contention.

Figure 4.8(c) illustrates the benefits of I/O aggregation in SOR-HDFS. This figure shows that I/O aggregation can reduce the average execution time per map (this is a TestDF-

SIO job with a total of 16 maps, 4 map slots per node) by 10%.

60 50 10GigE 45 50 IPoIB (32Gbps) IB (32Gbps) 40 10GigE IB−SOR (32Gbps) IPoIB (32Gbps) 12% 35 IB (32Gbps) 40 IB−SOR (32Gbps) 30 18% 30 25 20 20 Write Time (s) Write Time (s) 15 10 10 5 0 0 2 4 6 8 2 4 6 8 File Size (GB) File Size (GB) (a) OSU RI Cluster B with 4 (b) OSU RI Cluster B with 4 DataNodes (HDD) DataNodes (SSD)

Figure 4.10: Write time evaluation using HDFS microbenchmark (SWL)

4.2.4 Evaluation with Parallel Replication

Figure 4.9(a) shows the evaluation results of parallel replication with SOR-HDFS. From this figure we observe that, parallel replication can reduce the average execution time per map compared to that of pipelined replication. Figure 4.9(b) shows the execution times of

TestDFSIO benchmark on OSU RI Cluster B. This test is performed in four SSD DataN- odes and we varied the file size from 5 GB to 20 GB. This figure shows that parallel repli- cation can enhance the performance of TestDFSIO benchmark by up to 10% in terms of execution time. Though parallel replication can offer additional benefits for numerous

HDFS applications, pipelined replication technique is widely used in data center environ- ments. Therefore, in the next section, we will present results of SOR-HDFS with pipelined

59 replication support and compare them with those of other interconnects and protocols with default architecture.

300 1,600 1,400 33% 250 12% IPoIB (56Gbps) IPoIB (32Gbps) IB (56Gbps) IB (32Gbps) 1,200 IB−SOR (56Gbps) 200 IB−SOR (32Gbps) 1,000 150 800 600 100 Write Time (s) Write Time (s) 400 50 200 0 0 8 16 32 8 16 32 File Size (GB) File Size (GB) (a) SDSC Gordon with 8 (b) TACC Stampede with 16 DataNodes DataNodes

Figure 4.11: Write time evaluation using HDFS microbenchmark (SWL)

4.2.5 Evaluation using HDFS Microbenchmark

In this section, we have evaluated our design using the HDFS microbenchmark of Se- quential Write Latency (SWL) [45]. This benchmark writes a file to HDFS and reports the time taken to complete the file write. Figure 4.10 shows the performance of our design on

OSU RI Cluster B. On HDD DataNodes, SOR-HDFS reduces the latency of SWL bench- mark by 25% over IPoIB (32Gbps) and 12% compared to the previous best RDMA-based design of HDFS for 20 GB file size. On SSD platform, SOR-HDFS achieves further per- formance improvement by reducing the latency by 34% and 18% over IPoIB (32GBps) and

HDFSoIB, respectively.

We have also performed microbenchmark level evaluations on SDSC Gordon and TACC

Stampede. On both of these clusters, we varied the data size from 8GB to 32GB. We used a cluster size of 8 on SDSC Gordon and 16 on TACC Stampede. On SDSC Gor- don, our design improves the performance of SWL by 22% over IPoIB (32Gbps) and 12% over HDFSoIB (32Gbps). On TACC Stampede, our design shows a benefit of up to 57%

60 over IPoIB (56Gbps) and 33% compared to HDFSoIB (56Gbps). The overlapping among different stages of data write as well as RDMA-based communication leads to this large performance benefit for SOR-HDFS.

3,000 2,000 1,200 30% 10GigE IPoIB (32Gbps) IPoIB (56Gbps) IPoIB (32Gbps) 2,500 IB (32Gbps) 20% IB (56Gbps) IB (32Gbps) IB−SOR (32Gbps) 1,000 IB−SOR (56Gbps) IB−SOR (32Gbps) 1,500 17% 2,000 800

1,000 1,500 600

1,000 400 500 500 200

0 0 0

Aggregated Throughput (MegaBytes/s) 8 16 32 Aggregated Throughput (MegaBytes/s) 32 64 128 Aggregated Throughput (MegaBytes/s) 64 128 256 File Size (GB) File Size (GB) File Size (GB) (a) OSU RI Cluster B (b) SDSC Gordon (c) TACC Stampede

Figure 4.12: Enhanced DFSIO throughput evaluation of Intel HiBench

4.2.6 Evaluation using Enhanced DFSIO of Intel HiBench

In this set of experiments, we evaluate our design with the Enhanced DFSIO benchmark of Intel HiBench on three different Clusters. On OSU RI Cluster B, we perform Enhanced

DFSIO test on four SSD DataNodes and vary the file size from 8 to 32 GB. In these ex- periments, each writer (map) writes 1 GB of data. Thus for 32 GB data size, we have used

32 writers. As observed from Figure 4.12(a), SOR-HDFS increases the aggregated write throughput of Enhanced DFSIO by 37% over IPoIB (32Gbps) and 17% over HDFSoIB

(32Gbps) in OSU RI Cluster B.

We have performed Enhanced DFSIO test on SDSC Gordon and TACC Stampede also.

In both of these clusters, we varied the data size with varying cluster sizes. Figure 4.12(b) shows the aggregated write throughput for three different cluster sizes on SDSC Gordon. In this experiment, we vary the cluster size from 8 to 32 DataNodes and the data size is varied from 32 GB (on 8 nodes) to 128 GB on (32 nodes). SOR-HDFS achieves an improvement

61 of up to 47% over IPoIB (32Gbps) and 20% over HDFSoIB (32Gbps) in aggregated write throughput.

Figure 4.12(c) shows the aggregated write throughput for three different cluster sizes on TACC Stampede. In this experiment, we vary the cluster size from 16 to 64 DataNodes and the data size is varied from 64 GB (in 16 nodes) to 256 GB in (64 nodes). Each writer writes 1 GB data. Thus for 256 GB data we ran a total of 256 writers. In this experiment,

SOR-HDFS achieves an improvement of up to 60% over IPoIB (56Gbps) and 30% over

HDFSoIB (56Gbps) in aggregated write throughput. These gains are achieved because of the SEDA-architecture which is particularly designed to maximize throughput.

4.2.7 Evaluation using HBase

Hadoop database, HBase uses HDFS as the underlying file system. Therefore, HDFS performance influences the performance of HBase operations also. In our new design, we have introduced overlapping among different stages of HDFS write operation. Therefore, we evaluate the performance of HBase Put operation by running HBase on top of SOR-

HDFS. For this experiment, we have used the YCSB benchmark. The most appropriate case of evaluating HBase on top of HDFS is the bulk Put operation. With large amount of data insertion, HBase eventually flushes its in-memory data to HDFS. HBase has a MemStore which holds the newly inserted data in-memory. MemStore triggers a flush operation when it reaches its size limit (default is 64 MB). During the flush, HBase writes the MemStore to HDFS as an HFile instance. HBase also stores the HLogs into HDFS. Therefore, we load the data from workload A in YCSB on different clusters. The load operation of YCSB inserts the specified number of records into HBase using the Put operation.

On OSU RI Cluster B, we run the YCSB load experiment on four DataNodes. We vary the number of records from 1M to 4M and the size of each record is 1KB. Each record has

62 25 14 20 10GigE IPoIB (56Gbps) IPoIB (32Gbps) IB (56Gbps) IPoIB (56Gbps) 29% IB (32Gbps) 20 IB−SOR (56Gbps) 14% 12 IB (56Gbps) IB−SOR (32Gbps) IB−SOR (56Gbps) 15 9% 10 15 8 10 10 6 4 5

Throughput (Kops/sec) Throughput (Kops/sec) 5 Throughput (Kops/sec) 2 0 0 0 1 2 4 8 16 32 8 16 32 No. of Records (Million) No. of Records (Million) No. of Records (Million) (a) OSU RI Cluster B (b) SDSC Gordon (c) TACC Stampede

Figure 4.13: Evaluation of HBase Put throughput

10 fields and the size of each field is 100 Bytes. On OSU RI Cluster B, our design achieves an improvement of 23% over IPoIB (32Gbps) and 9% over HDFSoIB (32Gbps) in HBase

Put throughput. Figure 4.12(a) show the results of our evaluations.

We perform tests with the YCSB benchmark on SDSC Gordon and TACC Stampede.

On both these clusters, we vary the number of records from 8M to 32M. On SDSC Gor- don, we use 16 DataNodes and on TACC Stampede, the test is run on 32 DataNodes. As shown in Figure 4.13(b), our design achieves a performance gain of up to 25% over IPoIB

(32Gbps) and 14% over HDFSoIB (32Gbps) on SDSC Gordon. The design can also im- prove the performance of HBase Put operation by up to 29% over HDFSoIB (56Gbps) on

TACC Stampede, as depicted in Figure 4.13(c).

Since both the data and logs are flushed from HBase to HDFS, the overlapping between different stages of SOR-HDFS as well as the RDMA-based data transfer helps faster com- pletion of HBase Put operation. Also, since multiple threads (data and logs) from HBase invoke HDFS write on the same DataNode, the I/O aggregation technique and careful tun- ing of the number of threads in different stages boost up HBase Put performance.

63 4.3 Related Work

Design and optimization on Hadoop distributed filesystem is a hot topic in academic

and industrial areas. Two categories of related work will be summarized in this section.

Optimizations in default HDFS architecture: The default HDFS architecture has al- ready been analyzed for its drawbacks and limitations in many prior works. For instance,

Shvacho [82] has studied correlation between HDFS design goals and the feasibility of achieving them with current system architecture. This work has revealed the factors that limit growth and provided several conclusions that highlights dependencies between the components in HDFS that causes them. By analyzing the performance of HDFS, Shafer et al. [81] identified three critical issues including architectural bottlenecks that cause de- lays in scheduling new tasks, portability limitations posed by the Java implementation and assumptions about storage management. Their work has some suggestions for potential improvements to the HDFS architecture that will help to achieve a better balance between performance and portability. Other research works dealt with improving the performance of HDFS by exploring the interoperability between HDFS and existing POSIX-compliant parallel filesystems such as PVFS [92], Ceph [58], GPFS [33], GFarm [59], etc. These studies either provide interfaces that the Hadoop environment can use to replace HDFS with the corresponding filesystem, or propose HDFS-specific optimizations in their filesys- tem which makes it compatible for use in the Hadoop ecosystem. However, none of these work pay attention to optimize the internal HDFS design to exploit maximum overlap- ping among different phases of write operations. In this work, we propose a SEDA-based approach to redesign HDFS architecture and we have shown promising results.

Optimizing Hadoop components by RDMA: In HPC, RDMA has been widely used to speed up parallel file systems communication performance. Wu et. al [105] designed PVFS

64 over InfiniBand for improving the performance. Ouyang et. al [62] designed a RDMA based job migration framework to improve large-sized jobs recovery. From these studies, we have seen that RDMA can benefit the traditional parallel file systems. One of our earlier work [90] examined the impact of high-speed interconnects such as 10 GigE and InfiniBand

(IB) using protocols such as TCP, IP-over-IB (IPoIB) and Sockets Direct Protocol (SDP), on HDFS performance. Our findings also revealed that these faster interconnects make a larger improvement on the HDFS performance. We further proposed RDMA-Enhanced

HDFS design in Chapter 3 to improve HDFS performance. But this design kept the de- fault architecture intact. Some recent studies [39, 57, 72, 103] have also shown that other

Hadoop components, like MapReduce, HBase, RPC, etc., can be improved significantly by leveraging RDMA.

4.4 Summary

In this work, we proposed SOR-HDFS, a new HDFS design that exploits maximum overlapping among different phases of HDFS write operation by leveraging the event- driven principles in SEDA architecture. In this design, we enhance HDFS write perfor- mance by overlapping different stages of data transfer and I/O. Performance evaluations prove that, the new design improves the write throughput of TestDFSIO benchmark by up to 64% over IPoIB and 30% over the previous best RDMA-enhanced design of HDFS

(HDFSoIB). Our design can also achieve a performance gain of up to 64% over IPoIB and

30% over HDFSoIB for the Enhanced DFSIO benchmark of Intel HiBench. SOR-HDFS also improves HBase Put operation throughput by up to 53% over IPoIB and 29% com- pared to HDFSoIB.

65 Chapter 5: Hybrid HDFS with Heterogeneous Storage and Advanced Data Placement Policies

For data-intensive applications, the biggest bottleneck of HDFS lies in large number of I/O operations. Consequently, in-memory I/O for Hadoop is becoming more and more popular in the recent times [35]. But memory is a scarce resource. As a result, for data- intensive applications, the size of the hot data itself may exceed the total amount of avail- able memory in the cluster [37]. Additionally, to store data in memory, an application must dedicate sufficient memory, which may not always be feasible or desirable. Also, little has been discussed about how to ensure persistence and reliability of the in-memory data in the literature. HPC clusters are usually equipped with node-local heterogeneous storage devices like RAM Disk, SSD, HDD, etc. as well as shared parallel file system like Lustre. But HDFS cannot efficiently utilize the available storage devices. The limita- tion comes from the existing data placement policies and ignorance of data usage patterns.

Typically, heterogeneous storage media can be utilized in one of the two ways: hierarchical or flat. However, most of the researches on HDFS are based on the hierarchical deploy- ment [68, 69]. The performance potential of a hybrid (hierarchical with flat) architecture is yet to be investigated in this literature. Moreover, due to the tri-replicated data blocks,

HDFS requires 3x the amount of local storage space as the data size (x), which makes the deployment of HDFS challenging on HPC platforms.

66 This thesis addresses these problems with a hybrid HDFS design which we call Triple-

H ((A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage

Architecture)). In order to reduce the I/O bottlenecks of HDFS, Triple-H takes advantage of RAM Disk and SSD-based data buffering and caching. Triple-H incorporates advanced data placement policies to efficiently utilize the available storage media in the presence of heterogeneous storage systems on HPC clusters. Triple-H further reduces the local storage requirements through an integrated design with Lustre. The following chapter describes these various solutions that are proposed.

5.1 Proposed Architecture and Design

In this section, we propose the architecture and design-details of Triple-H.

5.1.1 Design Considerations

Buffering/caching in memory is expensive. In-memory caching may not always im- prove performance if the working set of the application is larger than the total available memory on the cluster. Also providing persistence to the memory-resident data is chal- lenging. Moreover, using additional memory for caching may affect the performance if the application itself requires large amount of memory. In our design, although we use

RAM Disk as a primary-level buffer-cache, we enlarge the size of the buffer-cache by using SSD. We also ensure fault tolerance of the data in RAM Disk through SSD-based staging. However, using SSD for storing all the data in a long-running Hadoop cluster is also expensive [37]. Therefore, HDDs are used for storing the data that are less critical to job performance. Furthermore, as discussed in Chapter 1, deploying HDFS on HPC systems is difficult as it requires huge amount of local storage space. Therefore, in this study, we propose an integrated design of HDFS with Lustre that can significantly reduce the amount of local storage requirement on HPC systems.

67 HDFS with Heterogeneous Storage DFSClient DFSClient NameNode Placement Policy Selector HDFS DataNode

BlockReceiver RAM Disk BlockSender Data Placement Policies Performance-sensitive Storage-sensitive DataNodes... Greedy Load-balanced Lustre Client Staging Unit Lustre Client RAM Disk/ RAM Disk/ RAM Disk/ SSD/HDD SSD/HDD SSD/HDD DataNode Data Placement Engine SSD Meta Data Eviction/ Staging On Demand Weight Servers Promotion Meta Data Server Interconnect Fabric Unit Adjustment Manager Object Storage Server (Ethernet/InfiniBand) Lustre Deployment ...

Object Storage Servers RAM Disk SSD HDD Lustre Lustre Deployment

(a) Deployment (b) Architecture of Triple-H (c) Lustre-based Storage Module Figure 5.1: Architecture and Design of Triple-H

5.1.2 Architecture

We assume that each node on the cluster is equipped with different types of storage devices like RAM Disk, SSD, and HDD; the nodes are homogeneous in storage charac- teristics. Figure 5.1(a) depicts the proposed deployment for Triple-H. In our design, each

HDFS DataNode runs a RAM Disk and SSD-based buffer-cache on top of the disk-based storage system. Also, our design uses the parallel file system (i.e. Lustre) installation in

HPC clusters for storing data from Hadoop applications. Triple-H can run in two differ- ent modes: Default (HHH) and Lustre-Integrated (HHH-L). If Lustre is available in the environment, the Default mode uses it as the storage of cold data; otherwise it runs as a traditional HDFS deployment. In both of these modes, data placement follows a hybrid storage usage pattern; whereas data caching goes through the storage hierarchy (Lustre to HDD, HDD to SSD, and SSD to RAM Disk). The cold data is also evicted along the opposite direction of the hierarchy.

Figure 5.1(b) depicts the architecture of our design. The major components of Triple-H are:

68 Placement Policy Selector: The Placement Policy Selector is in the DFSClient side. It

assigns weights to different files according to the policies proposed in Section 5.1.3.

Data Placement Policies: The data placement policies define the rules to efficiently uti-

lize the different types of storage available in HPC systems. We discuss the policies in

Section 5.1.3 in detail.

Data Placement Engine: The Data Placement Engine chooses the appropriate storage

volume based on the weights generated by the DFSClient as well as the availability of

the storage capacity and writes the data in the target volume. It can also detect data that

are not performance-critical (e.g. data coming to the DataNode as a result of a balancing

operation) and stores them to HDD or Lustre. To integrate HDFS with Lustre, there is a

separate module in our design that is shown in Figure 5.1(c).

5.1.3 Data Placement Policies

In this section, we propose the data placement and replication policies for Triple-H. We

design the placement policies such that the latency for replicated data placement can be

hidden from the clients for performance-critical applications without sacrificing fault tol-

erance and reliability. We also consider the limitation of local disk space per node on HPC

clusters while designing the policies. Thus, we propose two kinds of placement policies:

5.1.3.1 Performance-sensitive

Figure 5.2 demonstrates the Performance-sensitive data placement policies.

Greedy Placement: In this policy, the incoming data is placed in the high-performance

storage layer in a greedy manner by the DataNodes. So, as long as there is space available in

the RAM Disk-based primary buffer-cache, the DataNode writes all the data there. Then it

switches to the next level of storage based on SSD and so on. Let us assume that, each file fi consists of two blocks bi1 and bi2 where i is the file Id. With this policy, the Data Placement

69 Engine tries to greedily place the files f1, f2, f3, and so on in the high-performance storage layer. But it is possible that all the replicas of a particular file are placed into the RAM

Disk in this policy. Therefore, we propose to follow a hybrid replication scheme as shown in Figure 5.3(a). In the hybrid replication scheme, the first replica of a file is placed in the RAM Disk-based buffer-cache in a greedy manner while the other two are stored in the

SSD-based secondary buffer-cache. Alternatively, two replicas can be placed to RAM Disk and the other one in SSD. Since Triple-H asynchronously flushes the replica in RAM Disk to SSD, this scheme can ensure similar level of fault tolerance as default HDFS.

Block bij : i = File Id, j = Block Id RAM Disk SSD HDD Lustre

b11 b12 b21 b22 b31 b32 b41 b42 b51 b52 b61 b62 b71 b72 Greedy Placement

b11 b12 b21 b22 b31 b32 b41 b42 b51 b52 b61 b62 b71 b72 Load-balanced Placement Figure 5.2: Performance-sensitive Data Placement Policies

Load-balanced Placement: This policy spreads the amount of I/O across multiple storage

devices. The Placement Policy Selector randomly assigns weights to different files and

the Data Placement Engine stores the files accordingly. Weight assignment is done per file

basis and all blocks belonging to the file gets the same weight. As shown in Figure 5.2,

this policy may store file f1 to RAM Disk, f2 to HDD, f3 to SSD, and so on. In our current implementation, we store all the data from the DFSClient to the high-performance storage layers consisting of RAM Disk and SSD. So the weight assignment is done according to the following formula:

1 place to RAM Disk w = (0 place to SSD 70 This policy also follows the hybrid replication scheme for the files destined to RAM

Disk (w = 1).

In order to implement the placement policies mentioned above, we modify the Java

class VolumeChoosingPolicy in the Apache Hadoop distribution such that it can in- corporate the policies discussed here. If there are multiple devices of the same type, the target volume is selected in a round-robin manner within that storage layer in our design.

DFSClient DFSClient DFSClient

DataNode DataNode DataNode DataNode DataNode b11 b11 b11 RAM DiskSSD RAM Disk b11 B21 Rack 1 Rack 2 RAM Disk SSD DFSClient

DataNode DataNode DataNode Lustre b11 b11 b11 RAM Disk SSD SSD b11 b21 Rack 1 Rack 2 (a) Performance-sensitive Placement (b) Storage-sensitive Placement Figure 5.3: Replication during data placement in Triple-H

5.1.3.2 Storage-sensitive

This placement policy is suitable for HPC systems that usually are equipped with large installations of parallel file systems like Lustre and limited storage space in the compute nodes. With this policy, one replica is stored in local storage (RAM Disk or SSD) and one copy is written to Lustre through the Parallel File system-based Storage Module discussed in Section 5.1.4. Figure 5.3(b) shows the replication scheme for this placement policy.

Lustre provides reliability of the stored files and the overall storage saving is one-third compared to HDFS.

71 5.1.4 Design Details

In this section, we discuss our design features and implementation.

Buffering and Caching through Hybrid Buffer-Cache: In our design, there is a RAM

Disk-based Buffer-Cache in each DataNode. We buffer (cache) the data written to (read

from) HDFS in RAM Disk which is the primary buffer-cache in Triple-H. The size of the

buffer-cache is enlarged by using SSD. Triple-H hides the cost of disk access during HDFS

Write and Read by placing the data in this buffer-cache.

List of blockIds Thread Pool

b22 b21 b12 b11

Block Volume = ? RAM Disk SSD HDD

EvictBlock MoveBlock MoveBlock to SSD to HDD/RAM Disk to Lustre/SSD Figure 5.4: Eviction-Promotion Manager

Storage Volume Selection and Data Placement through On-demand Weight Adjust- ment: The Data Placement Engine calculates the available space in the storage layer as

requested by the Placement Policy Selector. If there is sufficient space available for the

file, it is stored in the requested medium. But if the available space is not enough, then

the On-demand Weight Adjustment unit modifies the weight of that replica and places it in

the next level of the storage hierarchy. However, for each replica, it remembers the weight

assigned by the Placement Policy Selector so that whenever there is space available in the

requested level, the file can be moved there. The On-demand Weight Adjustment unit as-

signs a weight wi to each replica during initial placement. Over time, the priority, hence

72 the weight of the replica, changes based on its access pattern. The weight (wi) is updated

according to the following equation.

wi+1 =wi + f(x, y, z) i = 0, 1, 2, 3 ...

Here, x = access count, y = access time, and z = available storage space. f(x, y, z) represents the weight factor which is estimated by the On-demand Weight Adjustment unit.

Input: Access count x, Access time y, Available space z Output: An integer representing decision currT ime ← System.currentT ime() interval ← currT ime − y if (interval ≥ thresholdtime OR x < thresholdcount) AND z ≤ thresholdspace then return (−1) end else if interval < thresholdtime AND x ≥ thresholdcount then return (+1) end end return (0) Algorithm 1: PSEUDO-CODE OF CALCULATING WEIGHT FACTOR

Whenever a block is accessed by a reader, the accessTime and accessCount of

the corresponding replica is updated. We implement this by extending the ReplicaInfo

class of Apache Hadoop distribution. The weight factor f(x, y, z) of a file is calculated ac-

cording to the Algorithm 1. In Algorithm 1, thresholdtime, thresholdcount, and thresholdspace

are user-defined parameters. Based on f(x, y, z), the weight of each replica is updated pe-

riodically.

For Hadoop clusters with job histories, the On-demand Weight Adjustment unit can

adjust the file weights based on how-performance critical they are.

73 Persistence of RAM Disk data through SSD-based Staging Unit: Data stored in RAM

Disk cannot sustain power/node failures. Therefore, in our design, we asynchronously flush each data placed into the RAM Disk to an SSD-based staging layer. The staging is imple- mented via a pool of background threads that receive the data from the BlockReceiver and flushes to SSD by creating a Java I/O stream. In default HDFS, the writer process re- turns after data is placed to the OS buffer in the DataNode and does not wait till it is synced to the persistent storage. Therefore, we argue that our design can provide similar level of fault tolerance as the default architecture. If SSD is not available, data is flushed to HDD asynchronously.

Data Movement through Eviction/Promotion Manager: This module, as shown in Fig- ure 5.4, is responsible for evicting the cold data and making space for hot data in the buffer- cache. It also promotes the hot data from SSD/HDD/Lustre to RAM Disk/SSD/HDD. In each DataNode, we maintain a list of blockIds. There is a pool of threads running in the

DataNode that periodically scan the list to identify the access patterns of the blocks. The decision of eviction/promotion (D), is made as follows:

promote() wi+1 > wi D = evict() wi+1 < wi keep() otherwise

Based on the outcome of D, the Eviction/Promotion Manager moves the block to the higher level of storage (promote()) or evicts it to the lower level (evict()) or keeps it in the same level (keep()). Since the DFSClient reads each file sequentially, blocks of the same

file are moved during the same window of data movement. After a replica is moved, its access information are reset.

Storage Efficiency via integrated Parallel File System-based Storage: We integrate

HDFS with Lustre in our design to operate in the Lustre-Integrated mode and follow the

74 Storage-sensitive placement policy for this. Figure 5.1(c) depicts the design of our Lustre

integration module. In this design, we use a Lustre client from the BlockReceiver

and place one copy of data in HDFS and the other in Lustre. The Lustre client and HDFS

BlockReceiver work independently in separate threads but the HDFS Write operation returns to the DFSClient only after both the writes have completed. Thus HDFS holds a single replica of the file in the local storage and the integration with Lustre offers reliability and fault tolerance. During HDFS Read, the file can be read from the local storage or

Lustre or both. In this way, the contention on the shared file system can be minimized.

We integrate Triple-H with SOR-HDFS which has been presented in Chapter 4. So

Triple-H supports RDMA-based communication and SEDA [104] for HDFS write and replication.

5.2 Experimental Results

In this section, we evaluate the Triple-H design and compare the performances with those of HDFS and Lustre for various Hadoop workloads and applications. The exper- iments are performed on three different clusters, OSU RI Cluster B, SDSC Gordon and

TACC Stampede. The cluster configurations are described in Chapter 3.2 and Chapter 4.2.

In our evaluations, HDFS block size and Lustre stripe size are set to 128MB. We use thresholdtime = 10 mins, thresholdcount = no. of concurrent maps, thresholdspace = 70%

for RAM Disk, 40% for SSD and 80% for HDD.

5.2.1 Performance Analysis of Triple-H

In this section, we analyze the performance of Triple-H using the TestDFSIO bench-

mark. First, we compare the data placement policies proposed in Section 5.1.3. For this, we

use four nodes on OSU RI Cluster B and run tests with four concurrent maps on top of RAM

Disk, SSD, and HDD. Even though RAM Disk does not guarantee tolerance to node/power

75 failure for default HDFS, we do this to ensure fairness in our comparison among the place- ment policies over the default round-robin one. We evaluate the Performance-sensitive placement policies with HHH; whereas to evaluate the Storage-sensitive policy, we use the

HHH-L. As shown in Figure 5.5(a), our data placement policies perform up to 92% bet- ter than the default round-robin policy in HDFS for all the data sizes and Load-balanced offers higher throughput than that of Greedy. Another observation is that, the through- put of the Performance-sensitive policies are much higher than that of Storage-sensitive policy. This is because, the Performance-sensitive policies store all the replicas in high- performance storage layer. But the Storage-sensitive policy has to access Lustre over IPoIB

(32Gbps); whereas, in HHH, network transmission for data replication goes over RDMA.

Figure 5.5(b) demonstrates the performance of TestDFSIO read over the data generated using different placement policies. As observed from the figure, the read throughput for both the Performance-sensitive placement policies are much higher (up to 5x) than that of default HDFS. The read throughput of HDFS is higher than that of Storage-sensitive data placement for 30 GB file size. This is because HDFS has better data locality compared to HHH-L due to the higher replication factor. As the data size increases to 50 GB, I/O overhead increases and the read throughput for Storage-sensitive data placement becomes higher than HDFS.

200 600 250 Default Default Default−Hot Greedy 500 Greedy Default−All_SSD Load−balanced Load−balanced 200 Default−Lazy_Persist 150 Storage−sensitive Storage−sensitive Greedy 400 Load−balanced 150 100 300 100 200 50 100 50 Average Throughput (MBps) Average Throughput (MBps) Average Throughput (MBps) 0 0 0 30 40 50 30 40 50 40 60 80 Data Size (GB) Data Size (GB) Data Size (GB) (a) Write performance (b) Read performance (c) HDFS vs Triple-H Figure 5.5: Comparison among different Data Placement Policies in OSU RI Cluster B

76 Figure 5.5(c) shows the comparison of our proposed Performance-sensitive policies

with those in default HDFS (Hadoop 2.6.0). These experiments are performed on eight

nodes on OSU RI Cluster B with the TestDFSIO write test. The Hot (default) policy in de-

fault HDFS uses the available storage devices in a round-robin manner. The All SSD pol-

icy places all the replicas on SSD, whereas, the Lazy Persist policy places the blocks with

replication factor of one to RAM Disk and others to HDD. As observed from the figure, our

policies perform better than all the three policies of default HDFS for performance-critical

data. Compared to the All SSD policy (best for default HDFS), Greedy and Load-balanced increase the throughput by up to 1.9x and 2x, respectively.

We also compare the performance of Greedy policy with that of Load-balanced using the TestDFSIO write benchmark for 30 GB data size on four nodes of OSU RI Cluster

B and TACC Stampede for different number of concurrent maps per node. As observed from Figure 5.6(a), for two concurrent maps, Greedy performs better than Load-balanced.

As the concurrency increases, Load-balanced outperforms Greedy. This is because, on

OSU RI Cluster B, the usable size of RAM Disk is small, 8GB. For Greedy policy, the contention in the storage layer increases as the concurrency increases and Load-balanced starts performing better. On the other hand, there is no SSD on the compute nodes on TACC

Stampede. Thus, Greedy policy shows higher throughput than that of Load-balanced as shown in Figure 5.6(b). In our subsequent experiments, we use Load-balanced policy on

OSU RI Cluster B and Greedy on TACC Stampede. For similar reasons, we use Greedy on

SDSC Gordon. In order to demonstrate the benefits of our proposed policies over different storage policies of default HDFS, we use the Hot policy on OSU RI Cluster B, All SSD on

SDSC Gordon, and Lazy Persist on TACC Stampede for default HDFS.

77 250 80 70 200 Greedy 60 Greedy Load−balanced Load−balanced 150 50 40 100 30 20 50 10 Average Throughput (MBps) Average Throughput (MBps) 0 0 2−concurrent 4−concurrent 8−concurrent 2−concurrent 4−concurrent 8−concurrent Data Size (GB) Data Size (GB) (a) TestDFSIO Write (OSU RI (b) TestDFSIO Write (TACC Cluster B) Stampede) Figure 5.6: Comparison between Performance-sensitive Data Placement Policies

Triple-H supports RDMA-based communication. Therefore, we compare the perfor- mance of Triple-H with those of both HDD and SSD-only performance of SOR-HDFS. As observed from Table 5.1, the write throughput of HHH for 30 GB data size is 16x and 1.6x higher than that of SOR-HDFS over HDD and SSD, respectively. HHH also increases the read throughput by 9x compared to that over HDD and 1.6x over SSD.

Table 5.1: Read and Write performance comparison in different Modes of Triple-H (OSU RI Cluster B) Default Lustre-Integrated Operation SOR-HDFS SOR-HDFS (HDD) (SSD) HHH Lustre HHH-L Read (MBps) 53.3 318.08 492.17 152.47 238.3 Write (MBps) 10.3 83.2 129.43 62.29 55.7

The write throughput of HHH-L is slightly worse than that of MapReduce over Lustre.

This is because, while running MapReduce over Lustre, we write only one copy of data to

Lustre. But HHH-L writes one copy to local storage and one to Lustre. On the other hand,

HHH-L improves the read throughput by 1.5x over Lustre. The reason behind this is that

HHH-L has higher data locality than that of MapReduce running over Lustre as HHH-L places one replica to local storage of the DataNodes.

78 Our design has less than 5% overhead for staging. Data caching done by the Evic-

tion/Promotion Manager shows good benefit as is evident from the fact that during the first

run of TestDFSIO read of 10 GB file on OSU RI Cluster B (we place the data to SSD and

HDD following the default round-robin placement policy, thresholdtime = 1 min), we get a throughput of 988.5 MBps. On the second run, the read throughput increases to 1063.4

MBps as some of the blocks move to SSD from HDD.

5.2.2 Evaluation with Triple-H Default Mode

In this section, we evaluate HHH-Default.

5.2.2.1 TestDFSIO

Figure 5.7(a) demonstrates the results of TestDFSIO write evaluations on TACC Stam- pede. Here we vary the data size from 40 GB on 8 DataNodes to 160 GB on 32 DataNodes.

The figure shows that, our design can improve the write throughput by 7x over default

HDFS. The benefit comes from the efficient data placement to the buffer-cache via our proposed policies as well as RDMA-based write and replication. We also evaluate TestDF-

SIO read performance and the results are shown in Figure 5.7(b). Reduction of disk I/O in

Triple-H increases the read throughput by 2x here.

50 60 HDFS (56Gbps) HDFS (56Gbps) HHH (56Gbps) 40 HHH (56Gbps) 50 40 30 30 20 20

10 10 Average Throughput (MBps) Average Throughput (MBps) 0 0 8:40 16:80 32:160 8:40 16:80 32:160 Cluster Size:Data Size (GB) Cluster Size:Data Size (GB) (a) Write Throughput (b) Read Throughput Figure 5.7: Evaluation of TestDFSIO in TACC Stampede (Default mode)

79 180 350 200 HDFS (32Gbps) 160 HDFS (32Gbps) HDFS (56Gbps) HHH (32Gbps) HHH (32Gbps) 300 HHH (56Gbps) 140 250 150 120 100 200 100 80 150 60 100 50 40

Job Executiion Time (sec) Job Executiion Time (sec) 20 Job Executiion Time (sec) 50 0 0 0 20 40 60 140 160 180 8:30 16:60 32:120 Data Size (GB) Data Size (GB) Cluster Size:Data Size (GB) (a) OSU RI Cluster B: Random- (b) SDSC Gordon: TeraGen (c) TACC Stampede: Ran- TextWriter domWriter Figure 5.8: Performance comparison with data generation benchmarks (Default mode)

5.2.2.2 Data Generation Benchmarks

In this section, we discuss the performance of Triple-H using data generation bench- marks like TeraGen, RandomTextWriter, and RandomWriter. For RandomTextWriter, we use eight nodes on OSU RI Cluster B. Triple-H reduces the execution time of this bench- mark by 48% for 60 GB data size over the default architecture (Figure 5.8(a)). The per- formance gain for TeraGen on 32 nodes on SDSC Gordon, as shown in Figure 5.8(b), is up to 42% with Triple-H. We also perform similar experiments on TACC Stampede with

RandomWriter, as shown in Figure 5.8(c). Here, we vary the data size with the cluster size.

As observed from the figure, for 120 GB data generation on a 32-node cluster, Triple-H achieves a benefit of 3x over HDFS. The presence of the buffer-cache along with the effi- cient placement policies during HDFS write leads to significant performance benefits for the I/O-intensive data generation benchmarks.

5.2.3 Evaluation with Triple-H Lustre-Integrated Mode

In this section, we evaluate HHH-L.

5.2.3.1 TestDFSIO

Figure 5.9(a) demonstrates the results of our TestDFSIO write evaluations on eight

DataNodes on SDSC Gordon. The figure shows that, the write throughput of our design

80 400 300 HDFS (32Gbps) HDFS (32Gbps) Lustre (32Gbps) 350 Lustre (32Gbps) HHH−L (32Gbps) HHH−L (32Gbps) 250 300 200 250

150 200 150 100 100 50 50 Average Throughput (MBps) Average Throughput (MBps) 0 0 20 40 60 20 40 60 Data Size (GB) Data Size (GB) (a) TestDFSIO Write (SDSC Gor- (b) TestDFSIO Read (SDSC Gor- don) don) Figure 5.9: Evaluation of TestDFSIO (Lustre-Integrated mode)

is up to 7x better than default HDFS; but MapReduce over Lustre shows slightly better throughput than HHH-L. The benefit of our design over default HDFS comes due to the reduction of I/O from less number of replication. On the other hand, Lustre shows higher write throughput as the amount of I/O is half of that in Triple-H when MapReduce runs over Lustre. We also measure the DFS used in each DataNode and find that the total storage used by HDFS is 118.11GB for writing 40GB (40,000MB) data; whereas in HHH-L, it is

79.2GB. Thus, our design saves 33% storage space over HDFS.

We also evaluate the read performance of HHH-L using the TestDFSIO read bench- mark. As observed from Figure 5.9(b), our design increases the read throughput by 15% over Lustre. However, the read throughput of default HDFS is higher than both Lustre and HHH-L as the default architecture maximizes data locality via replication. However, latency-sensitive workloads with similar number of HDFS Read and Write can benefit from our design over default HDFS. On the other hand, read-intensive workloads can show better performance over Lustre using our design.

81 2,500 700 1,600 HDFS (32Gbps) HDFS (32Gbps) HDFS (56Gbps) Lustre (32Gbps) 600 Lustre (32Gbps) 1,400 Lustre (56Gbps) 2,000 HHH−L (32Gbps) HHH−L (32Gbps) HHH−L (56Gbps) 500 1,200 1,500 1,000 400 800 300 1,000 600 200 400

500 Job Execution Time (s) Job Execution Time (sec) 100 Job Executiion Time (sec) 200 0 0 0 40 60 80 40 60 80 8:40 16:80 32:160 Data Size (GB) Data Size (GB) Cluster Size:Data Size (GB) (a) Sort (OSU RI Cluster B) (b) Sort (SDSC Gordon) (c) Sort (TACC Stampede) Figure 5.10: Evaluation of Sort (Lustre-Integrated mode)

5.2.3.2 Sort

We evaluate our HHH-L design with the Sort benchmark. On OSU RI Cluster B, we run Sort on eight DataNodes. As observed from Figure 5.10(a), HHH-L improves the Sort performance by up to 40% over HDFS and 46% over Lustre.

Figure 5.10(b) shows the performance of our evaluations on SDSC Gordon. These experiments are performed on 16 DataNodes while the data size is varied from 40 GB to

80 GB. HHH-L reduces the execution time by up to 21% over default HDFS and 54% compared to MapReduce over Lustre. Sort benchmark has equal amount of HDFS Read and Write operations. On SDSC Gordon, the compute nodes are equipped with local SSDs and Lustre is accessed via 10 GigE Ethernet. Therefore, our design leads to larger benefit over Lustre compared to that over HDFS.

We also evaluate the Sort benchmark performance on TACC Stampede. Figure 5.10(c) shows the results of our experiments. On this cluster, we vary the data size from 40 GB on eight nodes to 160 GB on 32 nodes. For 160 GB Sort on 32 nodes, 603 (out of 640) maps are launched as data-local; the map phase take 537s in HHH-L but 919s for MapReduce over Lustre. The data locality introduced in our design helps improve the performance of the map phase over Lustre. The write phase of job output is faster in our design compared

82 to HDFS due to less replication and in-memory I/O. In this way, our design reduces the execution time of Sort by up to 23% over default HDFS and up to 24% over Lustre.

5.2.4 Evaluation with Applications

We evaluate our design with the CloudBurst [78] application on TACC Stampede with

16 DataNodes. We use HHH with the default dataset of CloudBurst. We run the application with 160 maps while the other configurations are kept as default. CloudBurst is designed for parallel read-mapping through MapReduce and particularly optimized for mapping se- quence data to a reference genome. The time taken by the alignment phase is 60.24 sec in default HDFS; it is 48.3 sec in our design. So, Triple-H accelerates the alignment phase by

19% over default HDFS. We also evaluate the HHH-L design with SequenceCount from the

PUMA workloads repository [66]. This experiment is performed on 8 DataNodes on OSU

RI Cluster B. Although SequenceCount reads and writes data from/to HDFS/Lustre, it also has computation. The execution time of SequenceCount is 2050s, 2308s, and 1893s over

HDFS, Lustre and HHH-L. HHH-L reduces the execution time of SequenceCount by 17% compared to MapReduce over Lustre. We gain over HDFS because of the reduction of I/O overheads; compared to Lustre, we offer better data locality which leads to improvement in performance.

5.3 Related Work

Extensive research has been carried out in the recent past to improve the performance of HDFS. For HDFS, the read throughput has been improved by caching data in memory or using explicit caching systems [21, 46, 96]. The Apache Hadoop community [96] has also proposed such kind of centralized cache management scheme in HDFS, which is an explicit caching mechanism that allows users to specify paths to be cached by HDFS. But currently, this caching mechanism can only help HDFS read operations. Authors in [46]

83 has done the studies for improving HDFS read and write performance by utilizing Mem- cached. Recently, the new in-memory computing systems, like Spark [54, 111], are emerg- ing. These systems cache the data into memory and support lineage based data recovery (re- computation), which is more expensive than replication [110]. Some recent studies [68, 69] also pay attention to incorporate heterogeneous storage media (e.g. SSD) in HDFS. Au- thors in [68] deal with data distribution in the presence of nodes that do not have uni- form storage characteristics; whereas [69] caches data in SSD. Researchers in [91] present

HDFS-specific optimization for PVFS and [70] propose to store cold data of Hadoop clus- ter to Network-attached file systems. Most of these designs are based on the hierarchical storage deployment. In this work, we propose a hybrid (hierarchical with flat) architecture to take full advantage of the heterogeneous storage resources on modern HPC clusters.

5.4 Summary

In this work, we proposed Triple-H, a novel design for HDFS on HPC Clusters with heterogeneous storage architecture. Triple-H follows a hybrid storage usage pattern that can minimize the I/O bottlenecks and ensure efficient storage utilization with similar level of fault tolerance as HDFS in HPC environments. In our design, we introduced a RAM Disk and SSD-based Buffer-Cache to hide the cost of disk access during Read/Write operations and ensure reliability by utilizing SSD-based staging. We also proposed effective data placement policies for our architecture integrating with parallel file system installations, such as Lustre. From our comprehensive performance evaluations of Triple-H, we observed

7x benefit for TestDFSIO write and 2x improvement for TestDFSIO read, as compared to default HDFS. For data-generation benchmarks, we observed a reduction in job execution time of up to 3x. For Sort benchmark, our design can show up to 24% improvement over default HDFS and 54% over Lustre. The alignment phase of the CloudBurst application is

84 accelerated by 19%. Our design also improves the performance of SequenceCount by 17% with a local storage space reduction of 66%.

85 Chapter 6: Accelerating Iterative Applications with In-Memory and Heterogeneous Storage

The previous chapter describes the design of Triple-H that exploits the in-memory and heterogeneous storage available in HPC systems to minimize the I/O bottlenecks of HDFS.

Another popular in-memory file system in the literature is Alluxio/Tachyon [54]. These advanced file systems have shown significant performance improvement for different sets of workloads by bringing in-memory storage in their designs. They have also efficiently utilized heterogeneous storage such as SSD, HDD, and even parallel file systems like Lus- tre and GlusterFS, that are usually available on large-scale data centers or HPC clusters.

These heterogeneous design choices introduce new possibilities of different deployment strategies for in-memory file systems in these environments. Different system architectures and deployment modes have different performance and fault tolerance characteristics and thus, may fit with different sets of workloads and applications. Therefore, there is a criti- cal need to have a systematic performance characterization study on these in-memory file systems with different modes for a representative set of benchmarks and applications.

Additionally, HDFS which was primarily designed for batch applications cannot pro- vide optimal performance for iterative jobs. Iterative applications suffer from the huge I/O bottlenecks resulting from persisting and replicating the output of each iteration to HDFS.

86 Even though in-memory computing framework like Spark is optimized for iterative applica- tions, it consumes significant memory for caching the data from the intermediate iterations.

Spark also aggressively uses memory to store different types of data. Tachyon along with the in-memory computing framework Spark [111] is known to accelerate iterative appli- cations by eliminating I/O bottlenecks. But upper-level frameworks like MapReduce and

Spark cannot take advantage of lineage without significant modifications in the frameworks themselves. As a result, the performance of Tachyon becomes bounded by the underlying

file system i.e. HDFS.

In this work, we study the impact of in-memory file systems like Triple-H and Tachyon on the performance and fault tolerance of Hadoop and Spark. We optimize the Triple-H design to support different execution modes of Spark and propose advanced accelerating techniques to adapt Triple-H for iterative applications. We present an orthogonal way for

Spark to cache the output from the intermediate iterations of an iterative application to

Triple-H. Through this, Spark can utilize the available memory for other purposes like com- putation, intermediate data, etc. Our experimental results show that Triple-H outperforms

Tachyon for Hadoop MapReduce and Spark workloads on HPC clusters. The proposed design is also able to improve the performance of iterative MapReduce as well as Spark applications.

6.1 System Architectures to Deploy In-Memory File Systems

In this section, we discuss the system architectures to deploy Tachyon [54] and Triple-H on HPC clusters.

Tachyon [54] is a memory-centric distributed storage system for Spark and MapReduce applications. By leveraging the lineage information, it provides faster in-memory read and write accesses. For data-intensive applications like that of Hadoop MapReduce or Spark,

87 Tachyon supports two primary deployment modes. They are:

(1) Tachyon as a Write-Through Cache (Tachyon-WTC)

(2) Tachyon as a Cache (Tachyon-C)

Application Application Application

Hadoop MapReduce Spark Hadoop MapReduce Spark Hadoop MapReduce Spark

Containers Executor Threads Application Containers Spark Executor Threads Application Containers Spark Executor Threads Application Spark Master Map Reduce Master Master Map Reduce Master Master Map Reduce Master

< read > < write > < read > < read > < read > < write > < read > Tachyon Tachyon Triple-H with different modes < write > Tachyon Tachyon < write > Tachyon Master Tachyon Worker Daemons Daemons Master Worker HHH-M HHH HHH-L < read on < read on miss > < synchronous write > miss > < read on miss > RAM Disk RAM Disk, SSDs & HDDs HDFS Lustre HDFS Lustre < async. write > DataNodes MDS OSS DataNodes MDS OSS Name Name Lustre RAM Disk, SSDs & MDS OSS Node Node HDDs SSDs & HDDs (a) Tachyon as a Write-Through (b) Tachyon as a Cache (c) HHH Cache Figure 6.1: System architectures for deploying Hadoop MapReduce and Spark on top of in-memory file systems (Tachyon & HHH)

Figure 6.1(a) depicts the first mode. In this mode, Tachyon uses HDFS or any POSIX

compliant file system as the underlying file system (underFS) and acts as a write-through

cache itself. Applications write data to the in-memory layer of Tachyon which then flushes

it to the underFS in a synchronous manner. This is the default deployment mode of

Tachyon for MapReduce and Spark applications.

The second deployment mode, as shown in Figure 6.1(b), uses Tachyon just as a cache.

In this mode, the data generation phase of MapReduce and Spark applications writes data directly to the underFS of Tachyon. Since lineage is not enabled for such applications, data reliability is guaranteed by the existing fault tolerance mechanism offered by the file system. When a job runs on the generated data, Tachyon reads the data and caches it in the in-memory layer. Thus, Tachyon acts just as a cache in this mode.

88 In both of these modes, write performance is bounded by that of the underlying file system. For example, if HDFS is used as the underFS of Tachyon, the write performance of Hadoop MapReduce and Spark jobs depends on the underlying storage and network speeds for HDFS replication. In the Tachyon-WTC mode, the read operation can proceed in memory speed as long as data can be found in the memory layer of Tachyon. Beyond that, the data is read from the underlying file system. On the other hand, in Tachyon-C mode, data is read from the in-memory layer of Tachyon as it caches it from the underlying

file system with an LRU eviction policy.

Tachyon also supports another mode of data write. In this mode, the entire data is cached in memory as long as it fits in. This data is not persisted to the underlying file sys- tem. This mode can be set using the MUST CACHE parameter for Tachyon write; the data size stored in this mode equals the available memory size. The primitive operations e.g. basic read and write use the native Tachyon APIs and should ensure fault tolerance through lineage and periodic checkpointing. However, Tachyon does not offer fault tolerance for more complex workloads on Hadoop MapReduce and Spark due to lineage not being en- abled. Therefore, the fault tolerance for these workloads entirely depends on the built-in fault tolerance mechanism of the underFS.

As depicted in Figure 6.1(c), Triple-H also has three different modes: Default (HHH),

Lustre-Integrated (HHH-L), and In-Memory (HHH-M). HHH and HHH-L are described in

Chapter 5.1.2. HHH-M is a variation of the Default mode that uses memory in a greedy manner for data placement.

Limitations of Tachyon: The major limitation of Tachyon for running MapReduce and

Spark workloads comes from the lack of any mode to use it as a standalone in-memory

file system with fault tolerance [15]. Primitive operations can use Tachyon APIs to support

89 lineage. But there is no configuration parameter to enable this feature for MapReduce and

Spark; the frameworks should be modified to support this [54], which turns out to be a daunting task considering the large number of frameworks depending on distributed file systems like HDFS or Tachyon. Therefore, the performance of Tachyon is bounded by the underlying file system.

Table 6.1 summarizes the fault tolerance mechanisms offered by Triple-H and Tachyon for different types of workloads. Triple-H provides RDMA-based replication for fault tol- erance of all workloads in HHH-M and HHH modes. On the other hand, Tachyon supports lineage for primitive operations. But for MapReduce and Spark applications, the upper layer frameworks have to be modified to take advantage of lineage.

Table 6.1: Fault tolerance of Triple-H and Tachyon File system Operations Fault Tolerance How provided HHH-M All In-memory repli- RDMA-based cation with lazy replication persistence HHH All Hybrid replica- RDMA-based tion replication HHH-L All Lustre-based Synchronous fault tolerance Lustre write Tachyon Primitive Lineage Tachyon APIs Tachyon MapReduce, Fault tolerance Synchronous Spark of underFS, write to underFS Lineage support needs framework modifications

For both Tachyon and Triple-H, computation is co-located with storage to ensure better data locality for MapReduce and Spark applications. At the same time, in order to minimize contention in memory, the usage is tunable for both the file systems. When RAMDisk is

90 full, both HHH (default) and Tachyon (over HDFS) fall back to local SSDs/HDDs. HHH gains in performance via its enhanced placement policies. Moreover, Tachyon workers have to cross process boundaries to persist or cache the data to/from the underlying file system in both the modes. This overhead is not present in HHH-M and HHH modes. For

HHH-L, the DataNode processes write to Lustre to make the data fault tolerant. But reads can proceed from local storage within the process boundaries. Besides, any application framework compatible with HDFS can run on top of Triple-H without any framework-level modifications.

thr1 thr2 thrM Sender …

ep1 ep2 … epN

RDMA Connec+on

Receiver

, ,…, Hash Map Figure 6.2: Enhanced connection management for supporting Spark over Triple-H

6.2 Adapting Triple-H for Iterative Applications

In this section, we first present our designs for running Spark over Triple-H and then propose accelerating techniques for iterative workloads.

91 6.2.1 Optimizing Triple-H for Spark

Spark is a popular in-memory computing framework that is known to offer better per-

formance for iterative applications over MapReduce. In Chapter 5, we proposed Triple-H

design that is optimized to run MapReduce applications. But Spark can run in different

modes as discussed in Chapter 2. Each of these modes have different types of interac-

tion with the underlying file system and some of these are quite different than that from

MapReduce. For example, the MapReduce framework provides process-level parallelism,

whereas, Spark, in its standalone mode, offers thread-level parallelism while invoking the

file system APIs. On the other hand, Spark when run in YARN-Client mode, multiple

threads are spawned in the same process that is launched by the YARN scheduler during

file system invocation. The YARN-cluster mode has the same characteristics as MapRe-

duce in terms of file system interaction. In order to accommodate these varying types of

interactions, we propose a novel design for connection management in the client side of

Triple-H.

6.2.1.1 Enhanced Connection Management for Spark in Triple-H

In Chapter 4, we described the default design for the Triple-H client side for RDMA

communication. Parallelism from thread-level brings additional challenges for managing

and mapping different end-points with the corresponding threads. For this, we assign an

id to each thread during its creation and store the < ep id, thrd id > tuples in a hash map. The threads can use the buffers allocated per client process in a round-robin manner for sending data. All the end-points are created in a single RDMA connection object.

Figure 6.2 depicts the proposed connection management for supporting Spark over Triple-

H.

92 A single receiver thread is created to receive acknowledgements from the DataNodes

for all the sender threads that are spawned from the Spark framework. The reason behind

this is, all the acknowledgements are received on a single RDMA connection object at

different end-points. After receiving an acknowledgement in an end-point, the receiver

thread looks up the corresponding < ep id, thrd id > tuple in the hash map and notifies the appropriate thread. In this way, with the proposed connection management, Triple-H can support different execution modes of Spark.

6.2.2 Design Scope for Iterative Applications

Iterative applications generate many different types of data (input, output, intermediate, control, etc.) that are stored to the underlying file system. To blindly store all the data to high performance storage is not economical and may not be necessary to extract perfor- mance from the application. It is, therefore, challenging to determine the right set of data to store in the high performance storage layer for iterative applications. As a representative, we first analyze K-Means and propose advanced acceleration techniques to adapt Triple-H for such applications.

k-means is a popular clustering algorithm [56] that divides the input dataset into k clus-

ters. It is iterative in nature and starts the clustering based on some initial cluster centroids.

Each iteration of this algorithm can be run as a MapReduce job where it generates the

input data for the next iteration in the pipeline. Each record in the input dataset is a d-

dimensional tuple. As demonstrated in Figure 6.3(a), the map phase computes the similar-

ity of the input records with those of the starting centroids and emits the

record id> tuples. This is the intermediate data that is then processed in the reduce phase. Reducers calculate new centroids based on the average of the similarities of all the

93 tuples in a cluster and output the records and their current centroids along with the new cen- troids for each cluster. The new centroids are used for the next iteration and the algorithm continues until the change in the centroids is below some threshold.

2000 Intermediate Data Cached Data Iteration i (One MapReduce Job) 1800 1600 Cache on Write map() reduce() 1400 Calculate Calculate new 800 Records R> (i) (i+1), Records Clusters) centroid and centroid in in the cluster> 600 cluster is Intermediate each cluster Data Generated (MB) Data 400 formed by Job Output choosing the 200 closest one 0 0 100 200 300 400 500 Job Execution (s) (a) Data cached in each k-means iteration (b) I/O in different stages of K- Means Figure 6.3: Iterations and profiling of K-Means

The output from each iteration is written to HDFS and read by the next iteration, if there is any. Therefore, in our design, we cache the output from each iteration so that it can be read faster when the next iteration starts. All other data and control files are written to disk.

Figure 6.3(b) shows the profiling results of a single iteration of K-Means with 30GB input and six initial centroids. This graph is generated through analysis of the Hadoop execution logs. The intermediate data, as shown in Figure 6.3(b), is generated in the early stage of job execution by the map processes. As observed from the figure, after the completion of all map tasks, the reduce phase starts generating the output data. The size of the output data being equal to the sum of the input data size and centroid size, is significant and plays an important role in the performance of K-Means. Since, this output data is used as the input for the next iteration, there is a good opportunity of performance gain if this data is stored in the high performance storage devices in HDFS.

94 6.2.3 Proposed Design

In order to accelerate iterative applications over HDFS, we buffer the output data from each iteration in the RAMDisk and SSD-based buffer-cache of Triple-H. To further adapt

Triple-H for iterative applications, we propose the following enhancements:

Selective Caching: Selective Caching refers to the caching mechanism based on the type of data that is being generated during job runtime. In order to perform selective caching on write, we add a Selection Unit in the DataNode side of the Triple-H design. Every data coming to a DataNode goes through this unit. If the data is detected to be the output of an iteration coming from a map or reduce task, it is cached in the high performance storage layer (RAMDisk or SSD). Otherwise, it goes to other storage devices (HDD or Lustre). The

HDFS client side provides hints related to per-task meta information in the block header to the DataNode indicating if a block is part of the job output or not. In this way, only performance-sensitive data go to the hybrid buffer-cache and the non-critical data being stored to slower storage, reduce contention on the buffer-cache. Figure 6.4 presents these added features in Triple-H with different functional units.

Client [ sends a hint of the block type ] DataNode Selection Unit Eviction Unit [ evicts the block from RAMDisk on Read ] Job Output Data? Yes No

Other Storage High-Performance Storage (HDD, Lustre) (RAMDisk, SSD)

Figure 6.4: Added functional units in Triple-H for iterative applications

95 Optimization in the Eviction Algorithm: We also optimize the cache eviction algorithm of Triple-H for iterative applications. Iterative applications require clearing the cache dur- ing each iteration to make space for new data to be read in the next iteration. Therefore, as opposed to Triple-H eviction algorithm discussed in Chapter 5.1.4, instead of using a fixed interval for the eviction daemon to wake up, it evicts the data from the RAMDisk after a block is read in the map phase.

In this way, the output data of an iteration gets written to RAMDisk more frequently over Triple-H. Eviction from SSD still follows similar policies as proposed for Triple-H.

Thus, the cold data gets evicted to HDD or Lustre.

6.3 Experimental Results

In this section, we present a detailed performance evaluation of Hadoop MapReduce and Spark workloads over Tachyon and Triple-H. For our evaluations, we use Hadoop

2.6.0, Spark 1.3.0 and Tachyon 0.6.1. In all the experiments, we mention the number of

DataNodes as the cluster size. Each number reported here is an average of three iterations.

On Cluster OSU RI Cluster B, we use RAMDisk, SSD, and HDD as HDFS data di- rectories with the default RoundRobin placement policy for default HDFS. When Tachyon runs on top of HDFS, RAMDisk is used by Tachyon. For Triple-H, we use RAMDisk, SSD, and HDD as the data directories. On SDSC Gordon, we use the All SSD [34] placement policy for default HDFS.

6.3.1 Identifying the Impact of Different Parameters

First we identify the impact of different file system and framework parameters on the performance of MapReduce and Spark workloads over in-memory file systems Tachyon and Triple-H. The parameters that we consider are the ones that impact I/O performance.

These are:

96 BlockSize: The blocksize is an important file system parameter that determines the unit of data write. By default, HDFS block size is 128MB, so is for Triple-H, and Tachyon blocksize is 1GB. Since Tachyon uses HDFS or Lustre as the underlying file system, in this set of experiments, we evaluate the impact of HDFS blocksize and Lustre stripe size on the write performance of Tachyon. We do these experiments on eight DataNodes on OSU RI

Cluster B using the RandomWriter benchmark. The total number of maps is 64 for 60GB data size.

As observed from Figure 6.5(a), 128MB HDFS blocksize minimizes the execution time of RandomWriter over all the three file systems. HHH reduces the execution time by up to 41% over HDFS, and the performance of HDFS is better than that of Tachyon. This is because, even though Tachyon writes data to the RAMDisk (12GB), it also synchronously persists and replicates the data in HDFS. Since lineage is not enabled here, replication is important for data fault tolerance.

180 180 HDFS Lustre 160 200 Tachyon−WTC 160 Tachyon−WTC HHH HHH−L 140 140 150 120 120 100 100 100 80 80 60 60 50 40 40 Job Executiion Time (sec) Job Executiion Time (sec) 20 Job Executiion Time (sec) 20 0 0 0 64 128 256 64 128 256 512 256 512 1,024 2,048 HDFS BlockSize (MB) Lustre StripeSize (MB) Tachyon BlockSize (MB) (a) HDFS blocksize (b) Lustre stripe size (c) Tachyon blocksize Figure 6.5: Impact of blocksize (OSU RI Cluster B)

Next, we evaluate the impact of Lustre stripe size on Tachyon (underFS is Lustre) write performance. As observed from Figure 6.5(b), a stripe size of 256MB minimizes the execution time of RandomWriter over all the three file systems with the execution times being very similar to one another. This is because, Tachyon write performance is bounded

97 by that of Lustre. HHH-L also writes one copy of data to RAMDisk and the other to Lustre in a synchronous manner. However, this write in HHH-L happens in an overlapped manner.

Therefore, the performance of HHH-L is slightly better than Tachyon.

We also determine the optimal blocksize for Tachyon write. We vary the blocksize from 256MB to 2GB while running a 60GB RandomWriter test on eight nodes of OSU

RI Cluster B. HDFS is used as the underFS of Tachyon. As shown in Figure 6.5(c), the default blocksize of 1GB minimizes the execution time over the others. Therefore, in our subsequent experiments on OSU RI Cluster B, we use 128MB blocksize for HDFS and

Triple-H, 256MB stripe size for Lustre and 1GB blocksize for Tachyon.

Concurrent Containers: In this experiment, we vary the number of concurrent containers in MapReduce and observe the impact on the execution time of 60GB RandomWriter. On

OSU RI Cluster B, each node has eight processor cores, therefore, we vary the number of concurrent containers from four to eight. As observed from Figure 6.6(a), the execution time of RandomWriter is minimized for eight concurrent containers. Increasing the number of containers introduces more parallelism in terms of the number of concurrent maps which leads to reduced execution time for this benchmark. The execution time for HHH is reduced by 41% over HDFS and 60% over Tachyon.

180 HDFS 160 300 700 200 Tachyon−WTC HDFS HHH Tachyon−WTC HDFS 140 HHH HDFS 250 Tachyon−WTC 600 Tachyon−WTC 150 120 Tachyon−C Tachyon−C HHH 500 HHH 100 200 100 80 400 150 60 300 50 40 100 200 Job Executiion Time (sec) Job Executiion Time (sec) 20 50 0 0 Job Executiion Time (sec) Job Executiion Time (sec) 100 4 6 8 4 6 8 0 0 No. of Concurrent Containers per Host No. of Concurrent Tasks per Host 8:50 16:100 32:200 8:50 16:100 32:200 Cluster Size:Data Size (GB) Cluster Size:Data Size (GB) (a) Concurrent containers (b) Concurrent tasks (a) RandomWriter (b) Sort (Hadoop) (Spark) Figure 6.7: Evaluation of RandomWriter Figure 6.6: Impact of concurrent containers and Sort (SDSC Gordon) and tasks (OSU RI Cluster B)

98 Concurrent Tasks: We also evaluate the impact of number of concurrent tasks per host using the TeraGen benchmark of Spark in Standalone mode. These experiments are run on eight nodes on OSU RI Cluster B for 60GB data size. The number of concurrent tasks per host is varied from four to eight. As shown in Figure 6.6(b), increasing number of concurrent tasks per host reduces the execution time of Spark TeraGen and the time is minimized for eight concurrent tasks due to higher concurrency. The performance of HHH is better by 52% over HDFS and 54% over Tachyon-WTC.

6.3.2 Primitive Operations

In this section, we evaluate the performance of primitive level operations over Tachyon and Triple-H. In this experiment, we write 1,000 files (around 12GB) to both Tachyon and

Triple-H. Tachyon runs on top of HDFS and we use the copyFromLocal command to copy

the files to Tachyon. Tachyon takes 16s to copy all the files to its in-memory layer and no

file is written to the underFS as the MUST CACHE feature is on. On the other hand, HHH-

M (HDFS) with replication factor of three takes 45s (80s) to write (put) the same amount

of data. If we use replication factor of one, the write time is reduced to 16s. Compared to

HHH-M with replication factor of three, Tachyon performs 2.8x better, and the performance

is similar to that of HHH-M with replication factor of one. For the primitive operations,

Tachyon uses its native file system APIs with no replication to enable lineage. On the other

hand, HHH-M uses replication to ensure fault tolerance which increases the latency of this

operation. However, while executing Hadoop MapReduce and Spark workloads, Tachyon

does not support lineage and depends on the underlying file system for fault tolerance.

6.3.3 Hadoop Workloads on HDFS, Tachyon and Triple-H

In this section, we evaluate the performance of two MapReduce benchmarks, Ran-

domWriter and Sort over HDFS, Tachyon and Triple-H. For Tachyon, we consider both the

99 modes as discussed in Section 6.1. Tachyon in its MUST CACHE mode sacrifices fault

tolerance for MapReduce and Spark applications, as lineage is not enabled. For fair com-

parison, we evaluate both HHH and Tachyon in their default modes that guarantee fault

tolerance. We perform these experiments on SDSC Gordon and vary the data size from

50GB on 8 nodes to 200GB on 32 nodes.

800 Lustre 1,200 Lustre 700 Tachyon−WTC Tachyon−WTC 250 900 HHH−L 1,000 HHH−L HDFS 800 HDFS 600 Tachyon−WTC Tachyon−WTC 200 Tachyon−C 700 Tachyon−C 500 800 HHH HHH 600 400 150 600 500 300 400 400 100 200 300 XXX Job Executiion Time (sec) 100 Job Executiion Time (sec) 200 50 200

0 0 Job Executiion Time (sec) Job Executiion Time (sec) 8:50 16:100 32:200 4 6 8 100 0 0 Cluster Size:Data Size (GB) Concurrent Maps per Host 8:50 16:100 32:200 8:50 16:100 32:200 Cluster Size:Data Size (GB) Cluster Size:Data Size (GB) (a) Grep (SDSC Gordon) (b) MR-MSPolygraph (a) TeraGen (b) TeraSort (OSU RI Cluster B) Figure 6.9: Evaluation of Spark Standalone Figure 6.8: Evaluation of Grep and MR- mode (SDSC Gordon) MSPolygraph

As observed from Figure 6.7(a), HHH reduces the execution time of RandomWriter by up to 56% over HDFS and 47% over Tachyon. As demonstrated in the figure, there is not much difference in the performance of HDFS and Tachyon-WTC. This is because, even though Tachyon stores the data in memory, it synchronously persists the data to HDFS and replicates it to multiple DataNodes. RandomWriter in Tachyon-C writes all the data to

HDFS, resulting in similar performance as the latter. Figure 6.7(b) shows the performance of the Sort benchmark. As observed from the figure, Tachyon-WTC reduces the execution time of Sort by 15% for 200GB data size on 32 nodes. Improvement in case of Tachyon-C is 5%. The reason that Tachyon-WTC speeds up Sort is that, the map phase of the Sort benchmark, finds the input splits in memory while reading them. Thus the map phase

finishes faster in this case. For Tachyon-C, data has to be loaded to memory at the start of the benchmark and then the job starts. The initial loading phase takes some time here.

100 HHH improves the performance of Sort by 31% over HDFS and 19% over Tachyon-WTC for 200GB data size on a cluster size of 32. The gain for HHH comes due to in-memory data read in the map phase and RDMA-based write with hybrid replication using the enhanced placement policies in the reduce phase. The performance of Tachyon-WTC and Tachyon-C is bounded by that of HDFS for benchmarks like RandomWriter and Sort, whereas, HHH improves the performance over default HDFS.

6.3.4 Hadoop Workloads on Luster, Tachyon and Triple-H

In this section, we evaluate the performance of Grep and MR-MSPolygraph over Lus- tre, Tachyon-WTC (over Lustre) and HHH-L. For the Grep benchmark, we vary the cluster size from 8 to 32 and generate 50GB data on 8 nodes using the RandomTextWriter bench- mark. The data size on 16 and 32 nodes are 100GB and 200GB, respectively. When

Tachyon-WTC and HHH-L generate the data, it is cached in the in-memory layer as well as written to Lustre. Thus, when the Grep benchmark is run, it gets to read the data from memory. The performance of Grep is largely influenced by data locality. Both HHH-L and

Tachyon-WTC improve data locality in the Hadoop Cluster compared to MapReduce run over Lustre. Therefore, both Tachyon and HHH-L show large improvements of 76% and

78%, respectively, over Lustre. HHH-L performs up to 8% better than Tachyon because data write in the reduce phase gets the benefit of overlapping in HHH-L. Such overlapping is not present in the Tachyon architecture.

The MR-MSPolygraph application reads data from the file system while matching the input protein sequences with those of the reference file. This is a read-intensive application with locality playing an important role in performance. We run this experiment on eight nodes on OSU RI Cluster B with a total of 1,000 maps. The number of concurrent con- tainers/maps is varied from four to eight. Both HHH-L and Tachyon-WTC improve the

101 read performance over Lustre and thus achieve similar performance benefits over Lustre, the maximum benefit being 79% over Lustre for HHH-L.

6.3.5 Spark Workloads on HDFS, Tachyon and Triple-H

In this section, we evaluate Spark TeraGen and TeraSort over HDFS, Tachyon and

HHH. We run these experiments on SDSC Gordon. We vary the cluster size from 8 to 32 and the data size is varied from 50GB to 200GB. The number of concurrent tasks is 32 on

8 nodes, and 128 on 32 nodes.

Figure 6.9(a) shows the results of our evaluations with TeraGen using the Standalone mode. As observed from the figure, HHH reduces the execution time by up to 2.3x over

HDFS and 2.4x over Tachyon-WTC. Running TeraGen with Tachyon-WTC and Tachyon-C does not lead to significant improvement over HDFS. This is because, for Spark workloads also, Tachyon does not support lineage and thus, the performance is bounded by that of

HDFS. Figure 6.9(b) shows the results of our evaluations with TeraSort. As observed from the figure, HHH reduces the execution time of TeraSort by up to 17% over HDFS and 25.2% over Tachyon-C. TeraSort performs worse over Tachyon-C, compared to HDFS because couple of task failures are seen during the execution of the job. These failures are due to out of memory and no space left on RAMDisk error coming from Tachyon. TeraSort did not run successfully in Tachyon-WTC mode. All the times, the jobs failed with out of memory errors. TeraSort being shuffle-intensive, the gain of HHH for this benchmark is not as high as that for TeraGen.

Figure 6.10 shows the results of our evaluations using Spark over YARN. HHH im- proves the performance of TeraGen over YARN-cluster by up to 2.1x over HDFS and

2x over Tachyon-WTC. Tachyon-C performs similar to HDFS for TeraGen, as this mode writes data directly to HDFS. Tachyon-WTC also does not lead to significant improvement

102 300 800 3,000 HDFS 700 HDFS Lustre 250 Tachyon−WTC Tachyon−WTC 2,500 Tachyon−WTC Tachyon−C Tachyon−C HHH−L HHH 600 HHH 200 2,000 500 150 400 1,500 300 XXXXXX 100 1,000 200 50 500 Job Executiion Time (sec) Job Executiion Time (sec) 100 Job Executiion Time (sec) 0 0 0 8:50 16:100 32:200 8:50 16:100 32:200 40 50 60 Cluster Size:Data Size (GB) Cluster Size:Data Size (GB) Concurrent Maps per Host (a) TeraGen on SDSC Gordon (b) TeraSort on SDSC Gordon (c) WordCount on OSU RI Clus- ter B Figure 6.10: Evaluation of Spark workloads over YARN

in performance since it is bounded by that of HDFS. Figure 6.10(b) depicts that HHH im- proves the performance of TeraSort in YARN-cluster mode by 16% over HDFS. However,

TeraSort did not run successfully in Tachyon-WTC and Tachyon-C modes. All the times, the experiments failed with out of memory errors.

6.3.6 Spark Workloads on Lustre, Tachyon and Triple-H

In this section, we evaluate the Spark WordCount job using the YARN-client mode over

Lustre, Tachyon-WTC and HHH-L. WordCount is a compute-intensive workload. Before running the WordCount job, we generate 40GB to 60GB data using the RandomTextWriter benchmark on eight nodes on OSU RI Cluster B. WordCount reads data from the under- lying file system and writes the output to it. Figure 6.10(c) shows the results of our eval- uations. As observed from the Figure, there is not much difference in the execution times of WordCount over different file systems. HHH-L improves the performance by up to

4% over Lustre and by up to 2% over Tachyon-WTC. WordCount being CPU-bound, the performance is not much influenced by the in-memory file systems.

6.3.7 Fault Tolerance of HDFS, Tachyon and Triple-H

In this section, we evaluate HHH and Tachyon in terms of fault tolerance for MapRe- duce and Spark workloads and compare with that of HDFS. For this, we use the Sort

103 benchmark with 20GB data size on four nodes on OSU RI Cluster B. For all the three

file systems, the RandomWriter benchmark first generates 20GB data. Then at the very beginning of the Sort experiment, we turn some nodes of the cluster off. As observed from

Figure 6.11, HDFS, Tachyon-WTC, Tachyon-C, and HHH can tolerate the failures of one and two nodes by successfully completing the Sort test. The overhead for two nodes failure is up to 104%, 81.5%, 73.2%, and 55% for HDFS, Tachyon-WTC, Tachyon-C, and HHH, respectively. Here, HHH and HDFS read the data from the third DataNode and Tachyon master instructs the worker in the third node to cache the data from the underFS and con- tinue with execution. For failures of more than two nodes, none of the file systems can complete the job. The reason behind this is, the job loses access to all the replicas of some data blocks for HDFS and HHH (replication factor = 3). Since Tachyon depends on HDFS for fault tolerance, it also cannot finish the job in case of such failures.

1000 2,500 HDFS Job 3 Job 2 1,200 HadoopMR−HDFS 900 HHH Job 1 HadoopMR−Tachyon−WTC Tachyon-WTC 2,000 HadoopMR−HHH−Iterative 800 Tachyon-C 1,000 Spark−HDFS 700 Spark−Tachyon−WTC 1,500 800 Spark−HHH−Iterative 600 500 600 1,000 400 400 XXX 300 500

Job Executiion Time (sec) 200 Job Execution Time (sec) Job Execution Time (sec) 200 100 0 0 HDFS Tachyon−WTCTachyon−CHHH HHH−iterative 20−Million 100−Million 0 0 1 2 3 4 Number of Records Number of Failed DataNodes Figure 6.11: Fault toler- (a) Synthetic Workload (b) K-Means ance of different file sys- Figure 6.12: Evaluation of iterative workloads (OSU RI tems (OSU RI Cluster B) Cluster B)

6.3.8 Evaluation with Iterative workloads 6.3.8.1 Synthetic Iterative Benchmark

First we develop a MapReduce-based iterative benchmark consisting of three jobs. The

first job writes 60GB data to the file system. The second job reads the output from the

first job, sorts it and then generates new output which is the input of the third one. The

104 third job scans its input and writes the data back to the file system. The amounts of input and output data for the last two jobs are equal (60GB). We run these experiments on eight nodes on OSU RI Cluster B. Figure 6.12(a) shows the total time along with the breakdown times for this iterative job over different file systems. As observed from the figure, HHH with enhancements for iterative applications (HHH-iterative) performs 12% better than the basic HHH (Default mode) design. HHH-iterative also outperforms Tachyon by up to 28%.

6.3.8.2 K-Means

We also evaluate our design with MapReduce and Spark-based K-Means from Intel

HiBench [43] with two different data sizes. For both the cases, each of the input records is a 20-dimensional tuple. K-Means starts the clustering with five centroids and runs for

five iterations. These experiments are performed on eight DataNodes on OSU RI Cluster

B. As observed from Figure 6.12(b), for 20Million records, HHH-iterative gains by 8% over HDFS and 6.5% over Tachyon-WTC for MapReduce-based K-Means. For 100Million records, the gain increases to 13% over HDFS and 9% over Tachyon-WTC. As the data size increases, the selective caching and eviction algorithm of HHH-iterative prove to be more effective in minimizing the I/O bottlenecks over both HDFS and Tachyon. For smaller data sizes, Tachyon can cache all the data and HDFS takes advantage of the OS cache.

For Spark-based K-Means, HHH-iterative reduces the execution time by 15% over HDFS and 5.5% over Tachyon for 20Million record. However, for 100Million records, Spark

K-Means failed with out of memory errors.

6.3.9 Summary of Performance Characteristics

To summarize, we present the performances of the above-mentioned workloads in a tabular manner. Table 6.2 shows the performance characteristics of different MapReduce

105 and Spark workloads over Tachyon (over HDFS) and HHH. In this table, the job execu- tion times of different benchmarks are normalized with respect to that of HHH and lower value indicates better performance. As observed from the table, HHH performs better than

Tachyon for all the workloads due to its enhanced designs. For RandomWriter and Sort,

Tachyon-WTC performs better than Tachyon-C. On the other hand, for TeraGen and Tera-

Sort, Tachyon-C outperforms Tachyon-WTC. Because, Tachyon-C does better memory management (caching during read only) for the Spark workloads than Tachyon-WTC.

Table 6.2: Normalized execution times over Tachyon-HDFS and Triple-H (HHH) Hadoop MapReduce Spark RandomWriter Sort TeraGen TeraSort Tachyon-WTC 1.89 1.23 2.36 N/A Tachyon-C 1.95 1.36 2.29 1.34

Table 6.3: Normalized execution times over Tachyon-Lustre and Triple-H (HHH-L) Hadoop MapReduce Spark Grep MR-MSPolygraph WordCount Tachyon-WTC 1.08 1.01 1.02

Table 6.3 shows the normalized execution times (w.r.t. HHH-L) of MapReduce and

Spark workloads over Tachyon (over Lustre) and HHH-L. As observed from the table, although the performance of Tachyon is similar to those of HHH-L, HHH-L still performs better than Tachyon in all the cases.

For iterative applications like K-Means, HHH gains over both HDFS and Tachyon. For smaller data size, Tachyon gets to cache the entire data for subsequent iterations and thus shows higher gains compared to that for large one. However, CPU-bound workloads do

106 not show as much benefit as I/O-intensive ones over in-memory file systems. In terms of

tolerance to faults during application execution, both HHH and Tachyon can recover from

N − 1 (N = replication factor) node failures, similar to HDFS. MapReduce and Spark

frameworks need major revisions to leverage lineage of Tachyon.

6.4 Related Work

With the increasing use of HPC cluster for data-intensive computing, several researches

are being dedicated towards harnessing its power of heterogeneous storage devices to min-

imize the I/O bottlenecks. In the previous chapter, we introduced Triple-H, a novel design

for HDFS on HPC clusters with heterogeneous storage architecture. The use of hybrid stor-

age usage patterns, RAMDisk and SSD-based Buffer-Cache, and effective data placement

policies ensure efficient storage utilization and has proven to enhance the performance of

HDFS write by 7x and HDFS read by 2x. Yet another in-memory file system, Tachyon [54],

has shown promise in achieving high throughput writes and reads, without compromising

fault tolerance. Tachyon enables in-memory storage while providing fault tolerance by

leveraging the well-known technique of lineage in the storage layer, and addresses the timely data recovery from failures using asynchronous checkpointing algorithms in the background. But without framework-level modifications, upper layer middleware cannot take advantage of lineage. Moreover, none of these studies have proposed enhanced de- sign to accelerate HDFS for iterative applications. In this work, we propose accelerating techniques to adapt Triple-H for iterative MapReduce and Spark jobs.

6.5 Summary

In this work, we have characterized two file systems in literature, Tachyon and Triple-H and discussed their impacts on the performance and fault tolerance of Hadoop MapReduce

107 and Spark applications. We also proposed enhancements in Triple-H architecture for itera- tive applications. Our evaluations show that Tachyon is 5x faster than HDFS for primitive operations. On the other hand, complex workloads on Hadoop MapReduce and Spark can- not leverage lineage without major framework-level modifications. Therefore, for such workloads, the performance of Tachyon is bounded by that of the underlying file system.

Triple-H, on the other hand, has improved performance compared to both HDFS and Lus- tre. Our evaluations show that Triple-H outperforms Tachyon by up to 47% and 2.4x for

Hadoop and Spark workloads, respectively. Triple-H also accelerates K-Means by 15% over HDFS and 9% over Tachyon.

108 Chapter 7: Efficient Data Access Strategies for Hadoop and Spark on HPC Cluster with Heterogeneous Storage

Hadoop MapReduce and Spark access data from HDFS considering data locality only.

Figure 7.1 shows the default data access scheme by Hadoop and Spark in HDFS. HPC clus- ters host heterogeneous nodes with different types of storage devices. For such clusters, different replica of a data reside in different types of storage devices. But while accessing the data, HDFS does not take into account the type of storage the data is in. It is only the topological distance of the data that is considered by HDFS. Moreover, the interconnect in

HPC systems also provides low latency and high throughput communication. Recent stud- ies [57, 71, 73, 74] have demonstrated the performance improvement of different Big Data middleware by taking advantage of HPC technologies. As an example, Triple-H proposed in Chapter 5 leverages the benefits from Remote Direct Memory Access (RDMA) over

InfiniBand and heterogeneous storage to deliver better performance for Hadoop and Spark workloads. Recently, Cloudera also stressed the importance of storage types while reading data from HDFS [5]. The community has also been talking about locality- and caching- aware task scheduling for both Hadoop and Spark [6, 8]. But Big Data frameworks like

Hadoop and Spark can use a variety of schedulers like, Mesos [2], YARN [98], etc. to run their applications. In order to add the support of storage types as well as locality, each of these schedulers has to be modified. On the other hand, most of the Big Data middleware

109 use HDFS as the underlying file system. As a result, if the concept of storage types can be introduced (in addition to data locality) while reading data from HDFS, it saves a lot of efforts by the framework programmer. It is, therefore, critical to rethink the data access strategies for Hadoop and Spark from HDFS on HPC clusters.

Application

Hadoop MapReduce Spark Application Master Spark Master

HDFS DataNode1 DataNode2 DataNodeN Map Spark Tasks Map Spark Tasks Map Spark Tasks Reduce Reduce ... Reduce Local Local Local Read Read Read HDD SSD HDD

Figure 7.1: Default data access strategy

In this work, we address these challenges by proposing enhanced data access strategies for Hadoop and Spark on HPC clusters with heterogeneous storage characteristics. We also present efficient data placement policies for such platforms. For this we re-design HDFS to accommodate the proposed access and placement strategies so that Hadoop and Spark frameworks can exploit the benefits in a transparent manner.

7.1 Proposed Design

In this section, we propose different data access and placement strategies for Hadoop and Spark on HPC clusters. We also present the associated designs introduced in HDFS in order to realize the enhanced strategies.

110 Application Application

Hadoop MapReduce Spark Hadoop MapReduce Spark Application Master Spark Master Application Master Spark Master

HDFS HDFS DataNode1 DataNode2 DataNodeN DataNode1 DataNode2 DataNodeN Map Spark Tasks Map Spark Tasks Map Spark Tasks Map Spark Tasks Map Spark Tasks Map Spark Tasks ... Reduce Reduce ... Reduce Reduce Reduce Reduce Local/ Local/ Local/ Remote Local Remote Remote Remote Remote Read Read Read HDD SSD HDD HDD Read SSD Read HDD Read

(a) Greedy (b) Hybrid Figure 7.2: Proposed data access strategies

7.1.1 Data Access Strategies 7.1.1.1 Greedy

In the Greedy strategy, the remote high performance storage devices are prioritized over

local low performance storage. This strategy is particularly suitable for cases, in which the

local storage devices on the Hadoop cluster are mostly hard disks. But some nodes in

the cluster are equipped with large memory or SSD and replicas of the data blocks are

also hosted by these nodes in SSD or in-memory storage. For such cases, we propose to

read data from the remote SSD or in-memory storage in a greedy manner. Figure 7.2(a)

demonstrates the Greedy strategy of data access. As illustrated in the figure, the tasks

launched in DataNode1 and DataNodeN, instead of reading data from the local storage de-

vices (HDDs), access the data in SSD storage in DataNode2 over the interconnect. Tasks

launched in DataNode2 read data locally as they see that the local storage has higher per-

formance than those on most of the other nodes in the cluster. When the performance gaps

among the storage devices are large and the underlying interconnect is fast, this method

performs better for large data sizes.

111 7.1.1.2 Hybrid

In the Hybrid strategy, data is read from both local as well as remote storage devices.

Figure 7.2(b) depicts the Hybrid read strategy. Considering the topology, the bandwidth of the underlying interconnect and available storage devices, we propose different variations of the Hybrid read strategy. These are as follows:

Balanced: The purpose of this data access strategy is to distribute the load among the nodes

in the cluster. In this strategy, some of the tasks read data from the local storage devices

(irrespective of the type of the storage) and the rest read from the remote high performance

storage. The percentage of tasks to read from local vs. remote storage can be indicated by

a configuration parameter.

Network Bandwidth-driven (NBD): In this scheme, tasks send read requests to remote nodes with high performance storage. The number of read requests that can be efficiently served by the remote node is bounded by the NIC bandwidth. Therefore, when the remote node reaches the saturation point of its network bandwidth, it sends a Read Fail message

to the client. The client then reads the data available in its local node (if any), irrespective

of the type of storage. Otherwise (for rack-local tasks), the read request is re-sent to another

node that hosts a replica of the desired data.

As depicted in Figure 7.3, when Client3 sends the Read Req to the DataNode at time

t3, it is already serving the requests from Client1 and Client2. The network bandwidth

is saturated (assumption) at this point and therefore, the DataNode sends a Read Fail

message to Client3.

112 Client1 Client2 Client3 DataNode

Read_Req t1 Read_Req t2 Data Read_Req t3 Data Read_Fail t4 t5

t5 > t4 > t3

Figure 7.3: Overview of NBD Strategy

Storage Bandwidth-driven (SBD): In this scheme, tasks send read requests to remote

nodes with high performance storage. The number of read requests that can be efficiently

served by the remote node is bounded by the storage bandwidth of that node. Therefore,

when the remote node reaches the saturation point of its storage bandwidth, it sends a

Read Fail message to the client. The client then reads the data available in its local node, irrespective of the type of storage or retries to read from another node.

Data Locality-driven (DLD): This strategy is a trade-off between the topological distances among nodes and the high bandwidth offered by the high performance storage devices like

SSD and RAM Disk. In this scheme, the client reads data from the local storage devices as long as it is available. In this scenario, the client does not take into account the types of storage devices on the local node. If the data is not found locally, the client chooses to read the replica from a remote node that has high performance storage. If multiple replicas are hosted by different remote nodes having the same type of storage device, the client chooses the node (with high performance storage) that is topologically the closest.

113 7.1.2 Data Placement Strategies

For HPC clusters with heterogeneous storage characteristics, we propose two types of placement like those in Triple-H as discussed in Chapter 5.1.3. But Triple-H design assumes a homogeneous cluster in terms of the storage devices. Therefore, in this work, before the data placement is done, Hadoop or Spark tasks identify the storage types along with the DataNodes that host them. Our proposed placement strategies are:

Greedy: This strategy places data in DataNodes having high performance storage like

RAM Disk in a greedy manner. In order to guarantee fault tolerance, it stores two replicas in RAM Disk and one in SSD. In this way, this strategy makes sure one of the replicas is synchronously persisted. The copies in RAM Disk are persisted in a lazy manner.

Balanced: This strategy balances the load of data placement between two types of high performance storage devices, namely, RAM Disk and SSD. As a result, some of the tasks store data in RAM Disk and the others in SSD. Depending on how performance-critical a job is, the percentage of tasks to store data in RAM Disk (or SSD) can be selected.

7.1.3 Design and Implementation

When Hadoop, Spark, or similar upper layer middleware store data to HDFS, they send write requests to the HDFS Client side. The client contacts the NameNode to select the desired DataNodes (for three replicas) to place the replicas. On the other hand, when these frameworks access data from HDFS, they send read requests for the required files to the HDFS client side. The client contacts the NameNode to get the list of blocks for the requested file. The client then chooses the best DataNode to read the blocks (each

file is divides into multiple blocks) from. In order to incorporate the proposed enhanced placement/read strategies, we introduce some new components in HDFS. Figure 7.4 shows the architecture of our proposed design.

114 Hadoop/Spark Applications

HDFS Client Strategy DataNode Selector Selector Locality Detector Weight Distributor StorageType Fetcher

DataNode Storage Monitor Connection Tracker SSD HDD

Figure 7.4: Proposed design

The new design components introduced in HDFS client side are:

Placement/Access Strategy Selector: HDFS can run with any of the proposed place- ment/access strategies based on a user-provided configuration parameter. The client re- trieves the selected strategy and passes it to the DataNode Selector.

DataNode Selector: The DataNode selector chooses the best DataNode to store/read a block to/from based on the selected strategy and the available storage types in the cluster.

For placement, the client passes the selected strategy to the NameNode and gets the list of DataNodes that host the required types of storage devices. In order to select the best

DataNode to read a block from based on the selected strategy and the available storage types in the cluster, we incorporate two more components here:

1. StorageType Fetcher: The DataNode selector passes the list of DataNodes that hold

the replicas of a block to the StorageType Fetcher. This module gets the storage types

of the replicas from the NameNode. The NameNode has information of the storage

types of all the replicas. The DataNode, after storing a block, sends the storage type

to the NameNode in the block reports.

115 2. Locality Detector: The Locality Detector gets the list of nodes that host the replicas

of the requested blocks and determines if the data is available locally or not. The

list of DataNodes for a block is returned by the NameNode in a topologically sorted

manner (since most of the blocks are replicated to three DataNodes by default, the

length of this list is three) and, therefore, the first node in the list is the local node for

data-local tasks.

Weight Distributor: This module is activated when the user selects the Balanced strat-

egy. It gets the user-provide percentage and assigns weights (0 or 1) to the launched tasks

accordingly based on their task ids so that the DataNode Selector can select some of the

tasks (having weight 0) to read data locally and the rest (with weight 1) are redirected to

the remote nodes with high performance storage devices.

In order to realize the proposed data access strategies, we modify the existing DataNode

selection algorithm in HDFS by supplying the parameters obtained from Locality Detec-

tor, StorageType Fetcher, and Weight Distributor.

We choose the HDFS client side rather than the NameNode to host these components

because if each client has the knowledge of the access strategy and gets the list of DataN-

odes (and storage types) that store the replicas of the blocks, each of them can compute

the desired DataNode in parallel through the DataNode selection algorithm. Otherwise,

the NameNode has to determine the desired DataNode for each client, which would be a

bottleneck for large number of concurrent clients.

The design components introduced in the DataNode side are:

Placement Engine : Stores the data block to the desired type of storage based on the

selected placement strategy.

116 Connection Tracker: This component is enabled for the NBD scheme. It keeps track of the number of read requests Nc accepted by the DataNode. After Nc reaches a user-supplied threshold Tc (= Number of connections that saturate the network bandwidth), the DataNode does not accept any more read requests. When Nc becomes less than Tc due to some read requests being completed, the DataNode starts accepting read requests again.

Storage Monitor: This component is enabled for the SBD scheme. When the DataNode receives a read request, a reader thread is spawned to perform the I/O. This thread reads the data block from the storage device and sends it to the client. The Storage Monitor keeps track of the numbers of reader threads Nr spawned in the DataNode. After Nr reaches a user-supplied threshold Tr (= Number of concurrent readers that saturate the storage bandwidth), the DataNode starts sending Read Fail message to the clients.

When Greedy or DLD scheme is selected by the user, the StorageType Fetcher and

Locality Detector along with the enhanced DataNode selection algorithm can select the appropriate DataNodes to read the blocks from.

7.2 Performance Evaluation

In this section, we present a detailed performance evaluation of Hadoop MapReduce and Spark workloads over HDFS using our proposed design. We compare the performances of different workloads using our enhanced data access strategies and compare them with those using the default read scheme. For our evaluations, we use Hadoop 2.6.0 and Spark

1.6.0, Hive 1.1.0, BigDataBench V3.2, and Postgres database REL9 1 22. In all the ex- periments, we mention the number of DataNodes as the cluster size and set up half of the

DataNodes with SSD storage while the rest have only HDD. For the Balanced strategy, we use a 50%-50% ratio of local vs. remote (high performance storage) reads. We incorporate our proposed SSD-RAM Disk-based placement policies along with the access strategies

117 in the RDMA-Enhanced designs proposed in Chapter 3 and Chapter 4. In the Graphs, the

bars corresponding to this design are indicated by the RDMA prefix. On the other hand,

for Disk-SSD-based placement we use the existing One SSD policy in default Hadoop over

IP-over-InfiniBand (IPoIB).

In this study, we perform the following experiments:

1. Evaluations with RandomWriter and TeraGen, 2. Evaluations with TestDFSIO, 3. Eval-

uations with Hadoop Sort and TeraSort, 4. Evaluations with Spark Sort and TeraSort, and

5. Evaluations with Hive and Spark SQL.

For Hadoop Sort and TeraSort, we use the benchmark implementations that come with

the Apache Hadoop distribution. For Spark workloads, we use the implementations pro-

vided by Intel HiBench [9].

7.2.1 Experimental Setup

The experiments are performed on four different clusters. They are:

OSU RI Cluster A: The cluster configurations are described in Chapter 3.2.

OSU RI Cluster B: The cluster configurations are described in Chapter 3.2.

OSU RI2: There are 17 storage nodes in this cluster. These nodes are equipped with two fourteen Core Xeon E5-2680v4 2.4GHz. Each node is equipped with 512 GB RAM and one 2 TB HDD. Each node is also equipped with MT4115 EDR ConnectX HCAs (100

Gbps data rate). The nodes are interconnected with two Mellanox EDR switches. Each node runs CentOS 7.

SDSC Comet: Each compute node on SDSC Comet [79] has two twelve-core Intel

Xeon E5-2680 v3 (Haswell) processors, 128GB DDR4 DRAM, and 320GB of local SATA-

SSD with CentOS operating system. The network topology in this cluster is 56Gbps FDR

118 InfiniBand with rack-level full bisection bandwidth and 4:1 over-subscription cross-rack bandwidth.

250 250 Default Default RDMA−Greedy RDMA−Greedy 200 RDMA−Balanced 200 RDMA−Balanced

150 150

100 100

Execution Time (s) 50 Execution Time (s) 50

0 0 4:80 8:160 16:320 4:80 8:160 16:320 Cluster Size:Data Size (GB) Cluster Size:Data Size (GB) (a) RandomWriter (b) TeraGen Figure 7.5: Evaluation of RandomWriter and TeraGen on OSU RI2

7.2.2 Evaluation with RandomWriter and TeraGen

In order to determine the effectiveness of our proposed placement strategies, we evalu- ate with data generation benchmarks, RandomWriter and TeraGen. These experiments are performed on OSU RI2. We vary the data sizes from 80GB on 4 DataNodes to 320GB on

16 DataNodes. As observed from Figure 7.5(a), our proposed Greedy placement strategy reduces the execution time of RandomWriter by 38% over the default ONE SSD scheme.

The improvement for the Balanced strategy is 26%.

600 350 Default 300 Default 500 Greedy NBD Hybrid SBD 250 Balanced 400 DLD 200 Greedy 300 150 200 100 Execution Time (s) Throughput (MBps) 100 50 0 0 40 60 80 40 60 80 Data Size (GB) Data Size (GB) (a) Evaluation of TestDFSIO Read (b) Comparison among the hybrid (OSU RI Cluster A and B) access strategies (OSU RI Cluster B) Figure 7.6: Evaluation of the data access strategies

119 We run similar experiments with TeraGen. As shown in Figure 7.5(b) our proposed

placement strategies gain by up to 46% (Greedy) and 30% (Balanced) over the default

one. Since the proposed policies perform better selection of the DataNodes with high

performance storage devices, they bring in significant speed up for the data generation

benchmarks.

3500

3000

2500

2000

1500

Bandwidth (MBps) 1000

500

0 1 2 4 8 16 Number of Processes Figure 7.7: Selecting optimal number of connections in NBD strategy on OSU RI Cluster B

1400

1200

1000

800

600

400

Aggregated throughput (MBps) 200

0 1 2 4 8 16 Number of Clients Figure 7.8: Selecting optimal number of readers in SBD strategy on OSU RI Cluster B

7.2.3 Evaluation with TestDFSIO

In this section, we evaluate the Hadoop TestDFSIO read workload using our proposed data access strategies and compare the performances with those of the default (locality- aware read) scheme.

120 First we run our experiments on OSU RI on 16 nodes. Eight of these nodes have SSDs

and the rest have only HDDs. The SSD nodes are selected from OSU RI Cluster B and the

rest of the nodes are on OSU RI Cluster A. Data placement is done following the One SSD placement policy [34]. The experiments are run with eight concurrent maps per node. As observed from Figure 7.6(a), the Greedy scheme increases the read throughput by up to

4x over the default read scheme. The Hybrid read performs even better than the Greedy read by increasing read throughput by up to 11% over the Greedy read. The reason behind this is, when all the clients try to read greedily from the SSD nodes, at one point, the performance becomes bounded by the bandwidth of those nodes. On the other hand, the

Hybrid scheme (DLD) distributes the load across all the nodes, making some clients read locally, while others perform remote reads from the SSD nodes.

450 500 400 Default RDMA−Default Balanced 250 Default RDMA−Balanced DLD Balanced RDMA−DLD 350 DLD 400 300 200 Greedy 300 250 150 200 200 150 100 100 50 100

Average Throughput (MBps) 50 Average Throughput (MBps) Average Throughput (MBps) 0 0 0 160 320 480 4 6 160 320 480 Data Size (GB) Number of SSD DataNodes Data Size (GB) (a) TestDFSIO Read on Disk-SSD- (b) TestDFSIO Read on Disk- (c) TestDFSIO Read on SSD- based system (24 concurrent tasks) SSD-based system (48 concurrent RAM Disk-based system (24 con- tasks) current tasks) Figure 7.9: Evaluation of TestDFSIO on OSU RI2

Next, we perform experiments to compare different Hybrid read strategies. These ex-

periments are performed on eight nodes on OSU RI Cluster B. Four of the nodes have SSDs

and the rest have only HDDs. The experiments are run with eight concurrent clients per

node. As observed from Figure 7.6(b), the Balanced scheme performs better than the de-

fault read in terms of TestDFSIO read latency. In the presence of many concurrent clients,

121 distributing the load across multiple nodes in the Balanced scheme helps improve the per- formance. The DLD approach further improves the execution time by reading locally when available. But when remote reads are performed, the tasks select the nodes with SSD only, rather than reading from the remote HDD nodes. The number of rack-local tasks launched in this experiment is 22. NBD and SBD strategies initially work as Greedy but force the clients to read locally when network (NBD) or storage (SBD) bandwidth is saturated.

In order to determine the optimal number of connections supported by a DataNode in the NBD scheme, we run the OSU-MPI-Multi-Pair-Bandwidth (inter-node) [17] test on

OSU RI Cluster B with a message size of 512KB (HDFS packet size) and find that four connections provide optimal performance as depicted in Figure 7.7, as the bandwidth is maximized at this point. Similarly, to find out the optimal number of I/O threads that sat- urate the disk bandwidth, we run IOzone test on the SSD nodes with varying number of reader threads. Each thread reads a file (one file per thread) of size 128MB (HDFS block size) with a record size of 512KB. Figure 7.8 shows the results of our IOzone experiments.

As observed from the figure, four concurrent readers maximize the total read throughput.

Therefore, for the SBD approach, we use four reader threads as the threshold in the DataN- ode side. The reason that NBD and SBD approaches perform worse than Balanced is that, they force local reads under failure (local reads occur only when remote reads fail). But the Balanced approach does not have to go through the overhead of failed reads. For NBD and SBD, the loads on the DataNodes keep changing during the course of the job because when the requested blocks are sent to the client, the connection with that client is closed by the DataNode. Therefore, we send each read request (even after receiving a Read Fail) from the HDFS client to the DataNode with high performance storage so that the client can read data from the remote storage once the DataNode is able to accept further requests.

122 1,200 800 Default 700 1,000 Balanced Default DLD Balanced RDMA−Default 600 DLD 800 RDMA−Balanced RDMA−Default RDMA−DLD 500 RDMA−Balanced RDMA−DLD 600 400 300 400

Execution Time (s) Execution Time (s) 200 200 100 0 0 4:80 8:160 16:320 4:80 8:160 16:320 Cluster Size:Data Size (GB) Cluster Size:Data Size (GB) (a) Sort (b) TeraSort Figure 7.10: Evaluation of Hadoop MapReduce workloads on OSU RI2

We also evaluate the performance of TestDFSIO read on eight nodes on Cluster OSU

RI2. As observed from Figure 7.9(a), the read throughput is maximized by the DLD scheme. Both DLD and Balanced perform better than the default read approach in HDFS with DLD offering a benefit of up to 33%; DLD also outperforms Balanced by almost 20%.

These tests are run with 24 concurrent tasks per node. In these tests, 20 rack-local tasks were launched. The figure clearly shows the benefits of the proposed access schemes. This further proves that in the presence of rack-local tasks (not data-local), it is always more efficient to read from the high performance storage devices through the proposed access schemes, rather than accessing data in the default approach. Even though the DLD strategy is performing better than Balanced for 24 concurrent tasks, Balanced does better than DLD for 48 concurrent tasks, which is evident from Figure 7.9(b). As the number of concurrent tasks increases, balancing the load across the nodes in the cluster starts performing better.

Because, when 48 concurrent tasks read for the local disks, the disk bandwidths are satu- rated easily for the DLD scheme. Increasing the number of SSD nodes from 4 to 6 also shows similar trend with the Balanced strategy performing up to 13.5% better than DLD.

For six SSD DataNodes, even though the in-SSD blocks are spread across more number of nodes, for the same data size, the number of blocks going to SSD are same., as the data was generated by the One SSD scheme. As a result, the absolute values of the throughputs

123 increase as shown in Figure 7.9(c), but Balanced still does better than DLD. We perform similar TestDFSIO experiments with 24 concurrent tasks per node on SSS-RAM Disk- based cluster on 8 DataNodes on RI2. The data was written to HDFS following the Greedy placement policy as proposed in Section 7.1.2. As shown in Figure 7.9(c), Both Balanced and DLD yield higher throughput over Default scheme. But DLD outperforms Balanced by 7% for 480GB data size on 8 nodes. When the number of concurrent tasks is increased to 48, Balanced becomes better over DLD. Another observation from Figure 7.9(c), is the performance gaps between the proposed access schemes are not as high as for the Disk-

SSD-based system. This is because, in all the nodes the tasks get high performance storage like PCIe SSD and RAM Disk locally and hence, the gap reduces here.

700 Default 500 Default Balanced Balanced 600 RDMA−Default RDMA−Default RDMA−Balanced 400 RDMA−Balanced 500 400 300 300 200 200 Execution Time (s) Execution Time (s) 100 100 0 0 4:25 8:50 16:100 4:25 8:50 16:100 Cluster Size:Data Size (GB) Cluster Size:Data Size (GB) (a) Sort (b) TeraSort Figure 7.11: Evaluation of Hadoop MapReduce workloads on SDSC Comet

7.2.4 Evaluation with Hadoop MapReduce Sort and TeraSort

In this section, we evaluate our design with the MapReduce Sort benchmark on OSU

RI2. For these experiments, the data size is varied from 80GB on 4 nodes to 320GB on

16 nodes. The experiments are run with 24 concurrent tasks per node. As observed from

Figure 7.10(a), the Balanced read strategy demonstrates optimal performance for the Sort benchmark and reduces the execution time by up to 32% for 320GB data size on 16 nodes

124 for Disk-SSD-based cluster. For SSD-RAM Disk-based system, with Greedy data place- ment policy, the Balanced strategy outperforms the default scheme by 23%. Even though the DLD approach performs better than the default read strategy, Balanced works better than DLD. The reason behind this is that, the Balanced scheme distributes the I/O loads across multiple nodes while prioritizing the high performance storage nodes. Therefore, in the presence of concurrent tasks, it helps minimize the contention. Since Sort has equal amount of I/O, shuffle, and computation, minimizing the I/O pressure accelerates the other phases of the benchmark. Therefore, the Balanced approach performs better than DLD.

Figure 7.10(b) shows the results of our evaluations with Hadoop TeraSort. For these experiments, the data size is varied from 80GB on 4 nodes to 320GB on 16 nodes. The experiments are run with 24 concurrent tasks per node on Cluster B. The figure clearly shows that the Balanced read strategy reduces the execution time of TeraSort by up to 11% for 320GB data size on 16 nodes for Disk-SSD-based system. For SSD-RAM Disk-based system, the Balanced strategy outperforms the default data access strategy by up to 10%.

We further evaluate our design using Hadoop Sort and TeraSort benchmarks on SDSC

Comet. On this cluster, we vary the data size from 25GB on 4 nodes to 100GB on 16 nodes.

We use the Balanced strategy for data access here. As observed from Figure 7.11(a), our design reduces the execution time of the Sort benchmark by up to 21% for 100GB data size on 16 nodes for Disk-SSD-based system. For SSD-RAM Disk-based system, the gain of the Balanced strategy is 13% over the default one. The benefit comes from the usage of enhanced data access strategy that accelerates the operations by balancing out the load of I/O. Similarly, our experiments with TeraSort on Cluster C yields an improvement of

10% over default HDFS on Disk-SSD-based system. On SSD-RAM Disk-based system, our gain is 12%. Figure 7.11(b) shows the results of our evaluations with TeraSort.

125 400 900 350 Default 800 Default Balanced Balanced DLD 700 DLD 300 RDMA−Default RDMA−Default RDMA−Balanced 600 RDMA−Balanced 250 RDMA−DLD RDMA−DLD 500 200 400 150 300

Execution Time (s) Execution Time (s) 100 200 100 50 0 0 4:80 8:160 16:320 4:80 8:160 16:320 Cluster Size::Data Size (GB) Cluster Size::Data Size (GB) (a) Sort (b) TeraSort Figure 7.12: Evaluation of Spark workloads on OSU RI2

7.2.5 Evaluation with Spark Sort and TeraSort

In this section, we evaluate our design with the Spark Sort benchmark on OSU RI2. For these experiments, the data size is varied from 80GB on 4 nodes to 320GB on 16 nodes. The experiments are run with 24 concurrent tasks per node. As observed from Figure 7.12(a), the Balanced read strategy demonstrates optimal performance for the Sort benchmark and reduces the execution time by up to 17% for 320GB data size on 16 nodes on Disk-SSD- based system. For SSD-RAM Disk-based system, the gain for the Balanced strategy is 7%.

Even though the DLD approach performs better than the default read strategy, Balanced works better than DLD.

Figure 7.12(b) shows the results of our evaluations with Hadoop TeraSort. For these experiments, the data size is varied from 80GB on 4 nodes to 320GB on 16 nodes. The experiments are run with 24 concurrent tasks per node on OSU RI2. The figure clearly shows that the Balanced read strategy reduces the execution time of TeraSort by up to 9% for 320GB data size on 16 nodes for Disk-SSD-based cluster. On the other hand, the gain on SSD-RAM Disk-based system is 7% over the default scheme.

We further evaluate our design using the Spark Sort and TeraSort benchmarks on SDSC

Comet. On this cluster, we vary the data size from 25GB on 4 nodes to 100GB on 16

126 nodes. We use the Balanced strategy for data access here. As observed from Figure 7.13(a), our design reduces the execution time of the Sort benchmark by up to 14% for 100GB data size on 16 nodes for Disk-SSD-based system. For SSD-RAM Disk-based cluster, our gain is 13%. Similarly, our experiments with Spark TeraSort on SDSC Comet yields an improvement of 9% over default HDFS. Figure 7.13(b) depicts the results of our evaluations with Spark TeraSort.

450 Default 900 Balanced 800 Default 400 RDMA−Default Balanced RDMA−Balanced 700 RDMA−Default 350 RDMA−Balanced 300 600 250 500 200 400 150 300 Execution Time (s) 100 Execution Time (s) 200 50 100 0 0 4:25 8:50 16:100 4:25 8:50 16:100 Cluster Size::Data Size (GB) Cluster Size::Data Size (GB) (a) Sort (b) TeraSort Figure 7.13: Evaluation of Spark workloads on SDSC Comet

7.2.6 Evaluation with Hive and Spark SQL

In this section, the proposed design is evaluated in a Disk-SSD-based system with Hive aggregation query on four DataNodes on OSU RI Cluster B. For this, we use the workload and datasets [63] available from Brown University. We perform the experiment with 33GB data that are written to HDFS according to the One SSD policy. After that, the aggregation query is launched as a MapReduce job with 136 map and 142 reduce tasks and we use the

Balanced data access scheme. The query takes a total time of 995.25s to complete in default

HDFS. The proposed data access schemes improve the performance by 8.1% resulting in an execution time of 914.56s. The benefit comes from the efficient access strategy that helps the data access speed of the aggregation workload by spreading the I/O loads among the DataNodes.

127 350 Default 300 RDMA−Balanced Balanced 250 200 150 100 Execution Time (s) 50 0 48 64 80 Data Size (GB)

Figure 7.14: Evaluation of Spark SQL on SDSC Comet

Next, we evaluate Spark SQL with our proposed design. For this, we use the Select query from BigDataBench. We run this experiment on eight DataNodes on SDSC Comet and vary the data size from 48GB to 80GB. We use the proposed Greedy placement pol- icy along with the Balanced data access scheme on SSD-RAM Disk-based setup. Spark runs in the standalone mode with full subscription (24 concurrent tasks per node). We use

96GB memory for each Spark Worker. First, we run the Postgres database then start Hive and Spark. Figure 7.14 shows the results of our evaluations. As observed from the figure, our proposed design improves the performance of the Select workload by up to 30% over

default HDFS. The select query reads data from HDFS and finally writes the selected data

into a table in HDFS. Therefore, this query gets benefited from both our placement and ac-

cess schemes as indicated by the RDMA-Balanced bar in the graph. On the other hand, the

benefit for the same workload when only the Balanced access is enabled (data placement

via Greedy strategy is disabled), is up to 12% over the default data access scheme. Since

the Select query involves both HDFS read and write, it benefits most when the proposed

strategies for both placement and access are enabled.

128 7.3 Related Work

In Chapter 5, we introduced Triple-H, a novel design for HDFS on HPC clusters with heterogeneous storage architecture. The use of hybrid storage usage patterns, RAM Disk and SSD-based Buffer-Cache, and effective data placement policies ensure efficient stor- age utilization and have proven to significantly enhance the performance of Hadoop. In

Chapter 6, we have shown that Triple-H can accelerate the performance of Spark as well as iterative workloads. Yet another in-memory file system, Tachyon [54], has shown promise in achieving high throughput writes and reads, without compromising fault tolerance via lineage. Tachyon also supports the usage of heterogeneous storage devices [15]. All of these above-mentioned file systems assume that each node on the cluster has a homoge- neous storage architecture with RAM Disk, SSD, and HDD. But on many occasions, HPC clusters are equipped with nodes having heterogeneous storage characteristics, i.e. not all the nodes have the same type of storage devices. This paper addresses the challenges as- sociated with obtaining high performance in placement and access of data for Hadoop and

Spark on such clusters.

7.4 Summary

In this work, we proposed efficient data placement and access strategies for Hadoop and Spark considering both data locality and storage types. For this, we re-designed HDFS to accommodate the enhanced placement (access) strategies for writing (reading) data on heterogeneous HPC clusters.Our evaluations show that, the proposed data access strategies can improve the read performance of HDFS by up to 33% compared to the default locality- aware data access. The performances of the data generation benchmarks are improved by up to 46% by the proposed data placement schemes. The execution times of Hadoop and Spark Sort are also reduced by up to 32% and 17%. The performances of Hadoop

129 and Spark TeraSort are also improved by up to 11% by our design. The proposed designs further improve the performances of Hive and Spark SQL.

130 Chapter 8: High Performance Design for HDFS with Byte-Addressability of NVM and RDMA

HDFS stores the data files in local storage on the DataNodes. For performance-sensitive applications, in-memory storage is being increasingly used for HDFS on HPC systems [34].

Even though HPC clusters are equipped with large memory per compute node, using this memory for storage can lead to degraded computation performance due to competition for physical memory between computation and I/O. The huge memory requirements by appli- cations from diverse domains, such as, deep-learning, neuroscience, astrophysics make the memory contention a critical issue to rethink the deployment of in-memory file systems on

HPC platforms. As indicated in a recent trace from Cloudera [23], 34% of jobs have their outputs as large as the inputs. The performances of such write-intensive jobs are bound by the data placement efficiency of the underlying file system. Triple-H proposed in Chapter 5 along with some recent studies [34, 54] have proposed in-memory data placement policies to increase the write throughput. But in-memory data placement makes the task of persis- tence challenging. NVMs, being non-volatile, promise new opportunities for persistence and fault tolerance. NVMs can not only augment the overall memory capacity, but also provide persistence while bringing in significant performance improvement.

Besides, HPC clusters are usually equipped with high performance interconnects and protocols, such as RDMA. Non-volatile memory (NVM) technology is also emerging and

131 making its way into the HPC systems. Since the initial design considerations for HDFS were focusing on the technologies mostly used in the commodity clusters, extracting perfor- mance benefits through the usage of advanced storage technologies, such as NVMs along with RDMA, requires reassessment of those design choices for different file system opera- tions.

160 18,000 3,000 SATA SATA Write PCIe 16,000 PCIe 140 Read NVMe NVMe 2,500 14,000 120 12,000 2,000 100 10,000 80 1,500 8,000 60 1,000 6,000 40 4,000 Execution Time (ms) Total Throughput (MBps) 500 Total Throughput (MBps) 2,000 20 0 0 0 4 6 8 4 6 8 PCIeSSD NVMeSSD NVRAMDisk No. of concurrent maps per node No. of concurrent maps per node Storage Device (a) TestDFSIO Write (b) TestDFSIO Read (c) I/O Time for an HDFS Block Figure 8.1: Performance Characteristics of NVM and SSD

Moreover, the use of NVMs is becoming popular in the NVMe SSDs. Figure 8.1(a) and

Figure 8.1(b) show the performance comparisons of NVMe and traditional PCIe SSD with those of SATA SSD. As observed from the figures, both NVMe and PCIe SSD significantly improve the performances over SATA SSD. On the other hand, even though NVMe stan- dardizes the interface of SSDs, there is little performance increase over traditional PCIe ones. Figure 8.1(c) illustrates the I/O times for a single HDFS block across different stor- age devices. The figure clearly shows that, the I/O time reduces by 12% when the block is stored to RAMDisk backed by NVM (which we call NVRAMDisk). But even this is not the most optimal use of NVM as this cannot fully exploit the byte-addressability and obtain peak performance. All these lead us to the following broad challenges:

1. What are the alternative techniques to use NVM for HDFS?

132 2. Can RDMA-based communication in HDFS leverage NVM on HPC systems?

3. How can we re-design HDFS to fully exploit the byte-addressability of NVM to

improve the I/O performance?

4. Can we propose advanced, cost-effective accelerating techniques for Spark and HBase

to utilize the NVM in HDFS? How can we adapt the NVM-based HDFS design to

be used as a burst buffer layer for running Spark jobs over parallel file systems like

Lustre?

The following chapter describes and evaluates the NVM-based HDFS design, which we call NVFS (NVM- and RDMA-aware HDFS). 8.1 Design Scope

In this section, we discuss our design scope.

HDFS Writer/ HDFS Writer/ Reader Reader File System APIs Memory Semantics

NVMe Interface Direct Memory Interface

NVMe Driver PCIe/DDR/ others PCIe

NVM NVM

(a) Block Access (b) Memory Access Figure 8.2: NVM for HDFS I/O

8.1.1 NVM as HDFS Storage

In this section, we discuss different design alternatives to incorporate NVMs for HDFS

storage. Typically, NVM can be accessed in either of two modes:

133 DRAM NVM NVM DRAM NVM NVM

NIC NIC NIC NIC NIC NIC HDFS-RDMA HDFS-RDMA HDFS-RDMA HDFS-RDMA HDFS-RDMA HDFS-RDMA (RDMADFSClient) (RDMADFSServer) (RDMADFSClient) (RDMADFSServer) (RDMADFSClient) (RDMADFSServer) PCIe PCIe PCIe PCIe PCIe PCIe Client Server Client Server Client Server CPU CPU CPU CPU CPU CPU

(a) D-to-N over RDMA (b) N-to-D over RDMA (c) N-to-N over RDMA Figure 8.3: NVM for HDFS over RDMA

Block Access: Figure 8.2(a) shows how NVMs can be used in block access mode for

HDFS. NVM can be mounted as an HDFS data directory in the DataNode side. After the data is received, it can be written to the corresponding HDFS block file via HDFS I/O APIs by the writer threads. Similarly, the readers can read data blocks using the HDFS file Read

APIs. These I/O operations go through the NVMe interface in fixed blocks, just the way

flash and disk storage are accessed. RAMDisk can be backed by NVM instead of DRAM.

HDFS can access NVM in block mode (via RAMDisk) in such case.

Memory Access (NVRAM): Figure 8.2(b) depicts the way to use NVM in memory access

mode for HDFS. The NVM card usually supports a direct memory interface with memory

mapping and addressing. In this mode, the writer/reader threads on HDFS DataNodes can

use memory semantics like direct load and store to access the data in NVRAM. But this

requires the HDFS storage engine to be re-designed using memory semantics instead of file

system APIs.

8.1.2 NVM for RDMA

In this section, we discuss different design alternatives to incorporate NVM in HDFS

over RDMA. In RDMA-based HDFS designs presented in Chapter 3 and Chapter 4, the

memory for RDMA communication between RDMADFSClient and RDMADFSServer is

134 allocated from DRAM. The communication library provides APIs to register memory from

DRAM in both the client and server side. Communication also occurs among the DataN-

odes for HDFS replication. Since these data transfers happen in a non-blocking fashion, a

large amount of memory (allocated from DRAM) is needed for communication. Moreover,

many concurrent clients and RDMADFSServer can co-exist in a single machine, which can

lead to huge memory-contention. Thus, RDMA communication takes memory from the

DRAM which otherwise could be used by heavy-weight applications.

The presence of NVMs in the HDFS DataNodes (we assume that each HDFS DataN-

ode has an NVM) offers the opportunity to utilize them for RDMA communication. This

requires the NVM to be used in memory access mode (NVRAM). Figure 8.3 depicts the

three possible alternatives to use NVRAM for RDMA.

DRAM to NVRAM (D-to-N) over RDMA: In this approach, the communication buffers

in RDMADFSClient are allocated from DRAM; the server side registers memory from

NVRAM. After the RDMADFSServer receives data, the data is replicated to the next

DataNode in the pipeline. In this case, the first DataNode acts as a client and allocates memory from DRAM for the sender side buffers. The receiver side (RDMADFSServer) receives data in buffers allocated in NVRAM.

NVRAM to DRAM (N-to-D) over RDMA: In this approach, the communication buffers in RDMADFSClient as well as the buffers for replication are allocated in NVRAM; the

server side registers memory for RDMA communication from DRAM.

NVRAM to NVRAM (N-to-N) over RDMA: This approach allocates memory from NVRAM

in both RDMADFSClient and RDMADFSServer. RDMA-based replication also uses buffers allocated in NVRAM.

135 The (N-to-N) over RDMA approach leads to more penalty compared to traditional

RDMA as well as the other two approaches using NVRAM. The performance charac- teristics of D-to-N and N-to-D are similar. Moreover, D-to-N does not need NVMs to be present in the client side. Since we want to use more of NVM space for storage in the

DataNode (server) side and each RDMADFSServer has to serve multiple concurrent RD-

MADFSClients, we follow the D-to-N over RDMA approach in our design.

8.1.3 Performance Gap Analysis

In order to make the opportunities for performance gain with NVM concrete, we evalu- ated HDFS block I/O times over different memory and storage configurations. For this, we designed a benchmark (in Java) that mimics the HDFS file write/read APIs. This bench- mark was run locally on a single node to measure the time required to write/read one HDFS block (128MB) over different storage (or memory) configurations. The block is written in

512KB chunks (HDFS packet size). The NVM is accessed in memory mode (NVRAM).

For performance characterization over DRAM, the chunks are written to (read from) a hash map. Here, we assumed that NVRAM is 10x slower than DRAM in write performance and has similar read performance [61, 64]. Therefore, while calculating the HDFS block

I/O times over NVRAM, we followed the same approach as in DRAM and added extra latency for simulation. On the other hand, for writing (reading) to (from) RAMDisk, PCIe

SSD, or NVMe SSD, we used HDFS file I/O APIs. We also measured the time to write

(read) to (from) NVRAMDisk using the file system APIs, but added extra latency to mimic the NVRAM behavior.

As observed from Figure 8.4, NVRAM is 20.8x faster than RAMDisk, 24.9x faster than

PCIe SSD, and 24.8x faster than NVMe SSD for HDFS write. For HDFS read, NVRAM is 83.4x faster than RAMDisk, 107.67x faster than PCIe SSD, and 101.53x faster than

136 160 140 Write Read 120 100 80 60 40

Execution Time (ms) 20

0 DRAM NVRAM RAMDisk NVRAMDiskPCIeSSD NVMeSSD

Storage Device Figure 8.4: Performance gaps among different storage devices

NVMe SSD. On the other hand, NVRAMDisk is 1.13x faster than both PCIe and NVMe

SSD for HDFS write; for HDFS read, it is 1.29x and 1.21x faster than PCIe and NVMe

SSD, respectively.

We also calculate the delay (extra latency) to be added per block for HDFS I/O for

NVRAM (compared to DRAM). The block write time for DRAM is 0.499ms; for NVRAM,

it is 4.99ms (NVRAM is 10x slower than DRAM). Therefore, the extra latency to be added

for NVRAM during HDFS write is (4.99 - 0.499) = 4.49ms per block. No extra latency

is added for HDFS block read as NVRAM has similar read performance as DRAM. We

followed this approach because, to the best of our knowledge, none of the available NVM

simulators can work with HDFS so far.

8.2 Proposed Design and Implementation 8.2.1 Design Overview

In this section, we present an overview of our proposed design, NVFS. Figure 8.5

presents its architecture.

NVFS is the high performance HDFS design using NVM and RDMA. For RDMA-

based communication between DFSClient and DataNode and replication among the DataN-

odes, we use the D-to-N over RDMA approach as discussed in Section 8.1.2. The RD-

MASender in the DFSClient sends data over RDMA and uses DRAM for buffer allocation.

137 Application

Hadoop MapReduce Spark HBase

Application Containers Spark Executor Threads HBase HBaseClient Master Map Reduce Master Master HRegionServer

NVM- and RDMA-aware HDFS (NVFS) DFSClient DataNode RDMA Writer/Reader Responder DataStreamer Replicator NVFS- NVFS- BlkIO MemIO RDMA RDMA Receiver Sender RDMA Receiver NVM

SSD Figure 8.5: Architecture of NVFS

On the other hand, the RDMAReceiver in the DataNode receives the data in buffers allo- cated in NVRAM. The data is replicated by the RDMAReplicator over RDMA. After the

data is received by the RDMAReceiver, it is given to the HDFS Writer which can access

NVM either in block access mode (NVFS-BlkIO) using the HDFS file system APIs or in

memory access mode (NVFS-MemIO). In this paper, we present designs for HDFS I/O

using both of these modes.

We also present advanced and cost-effective acceleration techniques for Spark and

HBase by using NVM (with SSD through a hybrid design) for performance-critical data

only. For this, we co-design HDFS with HBase and Spark so that these middleware can

utilize the NVM in a cost-effective manner. We further propose enhancements to use the

NVM-based HDFS design as a burst buffer file system for running Spark jobs over parallel

file systems, such as Lustre.

8.2.2 Design Details

In this section, we discuss our proposed designs in detail. We propose designs for utiliz-

ing NVM both in block and memory access modes for HDFS while considering workload

characteristics (read-/write-intensive).

138 8.2.2.1 Block Access (NVFS-BlkIO)

In this section, we propose to use NVM in block access mode for HDFS storage on

the DataNodes. For RDMA communication, the buffers are allocated from NVRAM in the

HDFS DataNode side when the DataNodes start up. These buffers are used in a round-robin

manner for receiving data by the RDMADFSServer. In order to preserve the HDFS block

I/O structure, the data received in the NVRAM buffers via RDMA, are then encapsulated

into JAVA I/O streams and written to the block files created in NVM using the HDFS file

system APIs. Figure 8.6 describes this design.

RDMADFSServer (DataNode) Write/ NVM RDMADFSClient Blk1 Read Communication Buffers (Memory access) ... Write/ RDMADFSClient Blk2 Read Block access

Write/ RDMADFSClient Blk3 Read Blk1 Blk2 Blk3 . . . Write/ Blk4 ... BlkN RDMADFSClient BlkN Read Figure 8.6: Design of NVFS-BlkIO

In order to accelerate the I/O, we propose to mmap the block files during HDFS write over RDMA. In this way, the NVM pages corresponding to the block files are mapped into the address space of RDMADFSServer, rather than paging them in and out of the kernel

buffer-cache. RDMADFSServer can then perform direct memory operations via ByteBuffer

put/get to the mmaped region and syncs the block file when the last packet of the block

arrives.

139 RDMADFSServer (DataNode) RDMADFSServer (DataNode) NVM NVM Write/ Free Buffers Write/ Communication Buffers RDMADFSClient Blk1 RDMADFSClient Blk1 Read ... Read ...

Used Buffers Write/ Write/ Storage Buffers RDMADFSClient Blk2 Read Blk1 Pkt1 Data Next RDMADFSClient Blk2 Read Blk1 Blk2 Pkt1 Data Pkt1 Data Blk2 Pkt1 Data Next Write/ Write/ Pkt2 Data Pkt2 Data RDMADFSClient Blk3 RDMADFSClient Blk3 . . Read Read . . . . . Blk1 Pkt2 Data Next . . . PktM Data. PktM Data . . Blk3 BlkN . Blk3 Pkt1 Data Next . . Pkt1 Data Pkt1 Data Write/ . Write/ RDMADFSClient BlkN RDMADFSClient BlkN Pkt2 Data ... Pkt2 Data Read . Read . . . . BlkN Pkt1 Data Next . . PktM Data PktM Data

(a) Flat Architecture (b) Block Architecture Figure 8.7: Design of NVFS-MemIO

8.2.2.2 Memory Access (NVFS-MemIO)

In this section, we present our design to use NVM in memory access mode (NVRAM) for HDFS. We consider two different architectures for HDFS to incorporate NVRAM.

Flat Architecture: Figure 8.7(a) shows the Flat Architecture. In this design, we allocate a pool of buffers in the NVM on the HDFS DataNode. The buffers are allocated in a single linked list. We maintain two pointers: head and tail for the buffer pool. The head pointer always points to the first used buffer. The first free buffer is pointed to by the tail pointer. Initially, all the buffers in the buffer pool are free for use and head and tail point to the same buffer. As the RDMADFSServer receives data, the tail pointer keeps on moving forward to point to the first free buffer. The major ideas of this design are:

1. Shared buffers for communication and storage: The buffers allocated in NVM are

used for both RDMA-based communication and HDFS I/O. The RDMADFSServer

receives the data packets sent by the RDMADFSClient or upstream DataNodes. Each

packet is identified by the corresponding block id and packet sequence

number in the packet header. These packets, after being received, remain in those

140 buffers and the buffers are marked as used. In this way, whenever a new packet

arrives, the first free buffer pointed to by the tail pointer is returned and this buffer

acts as the RDMA receive buffer as well as the storage buffer for that packet. As

shown in Figure 8.7(a), in this architecture, a packet from Block2 can reside in the

buffer next to a packet from Block1 when multiple clients send data concurrently.

Therefore, this architecture does not preserve HDFS block structures and hence, is

called the Flat Architecture.

2. Overlap between Reader and Writer: Even though the tail pointer keeps on chang-

ing when data comes to the RDMADFSServer, in the presence of concurrent readers

and writers, the Flat Architecture does not need any synchronization to access the

buffer pool. Because, read request for a block comes only when it is entirely writ-

ten to HDFS. Therefore, scanning through the pool of buffers can continue without

any synchronization from head to tail as it was at the time of the arrival of the

read request. Therefore, this architecture can guarantee good overlap between HDFS

write and read.

3. Scan through the buffer pool during read: Due to the Flat Architecture, when a read

request for a block arrives, the reader thread in the DataNode side gets the first used

buffer in the linked list pointed to by the head pointer and has to scan through the

entire pool of buffers (up to the tail pointer in the worst case) to retrieve the packets

belonging to that block in sequence. For large buffer pool, this scan is very inefficient

which results in poor read performance.

This design has fast write performance, as the communication buffer for a piece of data also acts as the storage buffer. During HDFS write, the first free buffer can be retrieved in

141 O(1) time as it is pointed to by the tail pointer. So, the time complexity to write a block with M packets is O(M). But in order to achieve this, all the storage (communication also) buffers have to be registered against the RDMA connection object, and, therefore, depends on the underlying communication library. Moreover, this architecture needs to scan the en- tire pool of buffers to retrieve the required data packets. If the number of blocks received by the RDMADFSServer is N when the read request arrives, and each block contains at most

M packets, then, the time to read an entire block is O(M ∗ N). This makes read perfor-

mance very slow for large data sizes (large N). Considering all these, we propose another

architecture, which we call Block Architecture that offers modularity between storage and

communication library with enhanced read performance.

Block Architecture: Figure 8.7(b) shows the Block Architecture. In this design, we allo-

cate two different buffer pools in the NVM: one for RDMA communication, another for

storage. Each of the buffer pools is pointed to by a head pointer that returns the first free

buffer. The reason to separate out these two buffer pools is to eliminate the need to reg-

ister all the buffers against the RDMA connection object. In this way, storage operation

does not depend on the communication library and offers better modularity than the Flat

Architecture. The main ideas behind this design are:

1. Re-usable communication buffers: The communication buffers are used in a round-

robin manner. The RDMADFSServer receives the data in a free communication

buffer returned by the head pointer. After this, data is written to one of the free

storage buffers. The communication buffer is then freed and can be re-used for sub-

sequent RDMA data transfers.

2. HDFS Block Structure through the storage buffers: Figure 8.8 shows the internal

data structures for this design. The DataNode has a hash table that hashes on

142 the block ids. Each entry in the hash table points to a linked list of buffers.

This linked list contains the data packets belonging to the corresponding block only

and data packets belonging to the same block are inserted sequentially in the linked

list. In this way, this design preserves the HDFS block structures. This is why, we

name this architecture Block Architecture.

3. Overlapping between communication and storage: The encapsulation of data from

the communication to the storage buffers happens in separate threads. The receiver

thread in RDMADFSServer, after receiving a data packet, returns the buffer id to

a worker thread. The worker thread then performs the transfer of data to a storage

buffer; while the receiver thread can continue to receive subsequent data over RDMA.

In this way, the Block Architecture ensures good overlap between communication and

I/O.

4. Constant time retrieval of blocks during read: As the DataNode maintains a hash

table of block ids, the list of data packets can be retrieved in constant time (amor-

tized cost) for each block during read. The subsequent data packets for a block are

also obtained in constant time as they are in adjacent entries of the linked list.

5. Synchronization: Since separate buffer pools are used for communication and stor-

age, no synchronization is needed between the communication and storage phases.

The readers read from the linked lists corresponding to their respective blocks and

the writers write to the free storage buffers. So, no synchronization is needed for

concurrent readers and writers. Concurrent writers need to synchronize the pool of

storage buffers to retrieve the first free buffer.

143 In this architecture, the amortized time complexity to read a block is O(1) + O(M), where M is the number of packets per block. This time is independent of the total number of blocks received by the DataNode and therefore, the read performance is much faster than that of the Flat Architecture. On the other hand, the amortized time complexity to write a block is M ∗ (O(1) + O(W )). The first free buffer for RDMA is retrieved in constant time and W is the number of concurrent writers competing for a free storage buffer pointed to by the head pointer. Moreover, W << N and receiving the data is overlapped with I/O.

Thus, the write overhead is negligible compared to the Flat Architecture. Therefore, the overall performance of Block Architecture is better than Flat.

Hashtable NVM

Blk1 Pkt1 Data Pkt2 Data ... PktM Data

Blk2 Pkt1 Data Pkt2 Data ... PktM Data

Blk3 Pkt1 Data Pkt2 Data ... PktM Data ...... BlkN Pkt1 Data Pkt2 Data ... PktM Data Figure 8.8: Internal data structures for Block Architecture

In our design, we do not make any changes to HDFS client side APIs. Even though data

is stored to NVM using memory semantics by the DataNode, this happens in a transparent

manner and upper layer frameworks (HDFS clients) can still use HDFS file system APIs to

invoke the I/O operations.

8.2.3 Implementation

In this section, we discuss our implementation details.

144 NVFS-BlkIO: We assume that the RAMDisk is backed by NVRAM and simulate HDFS

I/O using RAMDisk as a data directory while adding additional latency for the write op- erations. As obtained in Section 8.1.3, the extra latency added per block is 4.49ms (0ms) during HDFS write (read). Similarly, during RDMA communication for HDFS write and replication, we add extra latency of 4.49ms per block.

NVFS-MemIO: The Block Architecture offers better read performance and modularity

over the Flat Architecture. Therefore, we implement this design in this work. We consider

two types of memory allocation here.

1. Memory Allocation from JVM (NVFS-MemIO-JVM): Allocates memory from JVM.

2. Off-Heap Memory Allocation (NVFS-MemIO-JNI): Allocates off-heap memory through

Java nio direct buffers using JNI.

In both the cases, some buffers are pre-allocated to avoid the overhead of memory alloca-

tion. The rest of the buffers are dynamically allocated on demand. We overlap the dynamic

buffer allocation with communication so that this cost does not come into the critical path

of the application.

For both NVFS-BlkIO and NVFS-MemIO, the NVM behavior is obtained by the fol-

lowing equations for each block:

D−N D−D tcomm = tcomm + delaycomm

D−N D−D Here, tcomm represents communication time for D-to-N over RDMA, tcomm is com- munication time for RDMA over DRAM (traditional RDMA), and delaycomm = 4.49ms

represents the added overhead.

145 NVM DRAM tio = tio + delayio

NVM DRAM Similarly, tio represents I/O time for NVM, tio is I/O time for DRAM, and delayio = 4.49ms represents the added overhead for write. For read, delayio = 0.

8.2.4 Cost-effective Performance Schemes for using NVM in HDFS

NVMs are expensive. Therefore, for data-intensive applications, it is not feasible to store all the data in NVM. We propose to use NVM with SSD as a hybrid storage for

HDFS I/O. In our design, NVM can replace or co-exist with SSD through a configuration parameter. As a result, cost-effective, NVM-aware placement policies are needed to iden- tify the appropriate data to go to NVMs. The idea behind this is to take advantage of the high IOPS of NVMs for performance-critical data; all others can go to SSD. Moreover, the performance-critical data can be different for different applications and frameworks. Con- sequently, it is challenging to determine the performance-critical data for different upper- level middleware over HDFS. In this section, we present optimization techniques to utilize

NVMs in a cost-effective manner for Spark and HBase.

8.2.4.1 Spark

Spark jobs often run on pre-existing data that are generated previously and stored in disk. For such jobs, we propose to write only the job output in NVM. To do this, HDFS client identifies the output directory of a job provided by the application. The client then appends a flag bit to indicate buffering with the header of each block to be written to the output directory. In other words, all the headers of all the blocks belonging to the job output is updated with the buffering info. The DataNode parses the block header and writes the corresponding data packets to NVM. Periodically, these data are moved from NVM to

146 SSD to make free buffers in the NVM. By default, Spark aggressively uses memory to store

different types of data. We present an orthogonal design to write the job outputs to NVM

in HDFS, so that Spark can use the available memory for other purposes like computation,

intermediate data, etc.

8.2.4.2 HBase

With large amount of data insertions/updates, HBase eventually flushes its in-memory

data to HDFS. HBase holds the in-memory modifications in MemStore which triggers a

flush operation when it reaches its threshold. During the flush, HBase writes the MemStore

to HDFS as an HFile instance. High operational latency for HDFS write can affect the

overall performance of HBase Put operations [37]. The evaluations in Chapter 4.2 also demonstrate similar behavior. HBase also maintains HLog instances that are the basis of

HBase data replication and failure recovery, and thus, must be stored in HDFS. HBase periodically flushes its log data to HDFS. Each HBase operation has a corresponding log entry and the logs need to be persistent. Therefore, in this paper, we propose to store only the Write Ahead Logs (WALs) in NVM. All other HBase files along with the data tables are stored in SSD. When a file is created in HDFS from HBase, the file path is distinguished based on its type. For example, a data file is always written under the directory /hbase/data.

On the other hand, WAL files are written in /hbase/WALs. While a block is sent from the

DFSClient to a DataNode, the client appends the file path (the file the block belongs to) to the block header. The DataNode after receiving a block, parses the header and stores it to

NVM accordingly, i.e. if the block belongs to a WAL file, it goes to NVM, otherwise, it is written to SSD.

147 8.2.5 NVFS-based Burst Buffer for Spark over Lustre

In order to handle the bandwidth limitation of shared file system access in HPC clusters, burst buffer systems [55] are often used; such systems buffer the bursty I/O and gradually

flush datasets to the back-end parallel file system. The most commonly used form of a burst buffer in current HPC systems is dedicated burst buffer nodes. These nodes can exist in the I/O forwarding layer or as a distinct subset of the compute nodes or even as entirely different nodes close to the processing resources (i.e., extra resources if available). A burst buffer can also be located inside the main memory of a compute node or as a fast non- volatile storage device placed in the compute node (i.e., NVRAM devices, SSD devices,

PCM devices, etc.). BurstFS [99] is an SSD-based distributed file system to be used as a burst buffer for scientific applications. It buffers the application checkpoints to the local

SSDs placed in the compute nodes, while asynchronously flushing them to the parallel file system.

Compute Node Spark Tasks DataNode

DFSClient NVM Asynchronous flush Lustre

Figure 8.9: Design of NVFS-based Burst Buffer for Spark over Lustre

NVFS can, therefore, be used as the burst buffer for running Spark jobs over parallel file system Lustre. For this, we propose to deploy NVFS cluster with replication factor of one

(or more depending on the fault tolerance requirements of the application). The tasks from

148 Spark jobs are co-located with the NVFS DataNodes in the compute partition of an HPC cluster. As a result, these tasks send their data to the DataNodes. These data are buffered in NVM and the DataNodes asynchronously flush the data to Lustre. To equip the NVFS

DataNodes with the burst buffer functionality, we add a pool of threads in the DataNode side to take care of the asynchronous flush to the parallel file system. This helps improve the I/O performance of Spark jobs by reducing the bottlenecks of shared file system access.

Figure 8.9 shows the proposed design.

8.3 Performance Evaluation

In this section, we present a detailed performance evaluation of NVFS design in com- parison to that of the default HDFS architecture. In this study, we perform the following sets of experiments: (1) Performance Analysis, (2) Evaluation with MapReduce, (3) Eval- uation with Spark, and (4) Evaluation with HBase. In all our experiments, we use RDMA for Apache Hadoop 2.x 0.9.7 [38] released by The Ohio State University and JDK 1.7.0.

We also use OHB Microbenchmark 0.8 [38], YCSB-0.6.0, HBase 1.1.2, and Spark-1.5.1 for our evaluations. First we analyze our designs NVFS-BlkIO and NVFS-MemIO using the HDFS Microbenchmarks from OHB and then demonstrate the effectiveness of both of these designs using different MapReduce, Spark, and HBase workloads.

12 18,000 70 16,000 Write HDFS−IPoIB (56Gbps) 10 60 HDFS−RDMA (56Gbps) HDFS−IPoIB (56Gbps) Read NVFS−BlkIO (56Gbps) NVFS−BlkIO (56Gbps) 14,000 50 NVFS−MemIO−JVM (56Gbps) 8 NVFS−MemIO−JVM (56Gbps) NVFS−MemIO−JNI (56Gbps) NVFS−MemIO−JNI (56Gbps) 12,000 10,000 40 6 8,000 30 4 6,000 20 Execution Time (s) Execution Time (s) 4,000 2 10 Total Throughput (MBps) 2,000 0 0 0 2 4 8 2 4 8 Block Flat Data Size (GB) Data Size (GB) HDFS Designs for Memory−Mapped Access (a) SWL (b) SRL (c) Flat vs Block Architecture Figure 8.10: HDFS Microbenchmark

149 900 350

800 HDFS (32Gbps) 1,200 HDFS−PCIe (56Gbps) NVFS−BlkIO (32Gbps) 300 HDFS−NVMe (56Gbps) 700 NVFS−MemIO−JNI (32Gbps) HDFS (56Gbps) NVFS−BlkIO (56Gbps) NVFS−BlkIO (56Gbps) 1,000 250 NVFS−MemIO−JNI (56Gbps) NVFS−MemIO−JNI (56Gbps) 600 800 500 200 400 150 600 300 100 400 200 Average Throughput (MBps) Average Throughput (MBps) Average Throughput (MBps) 200 100 50 0 0 0 Write Read Write Read Write Read (a) OSU RI Cluster B (b) SDSC Comet (c) OSU Nowlab Figure 8.11: TestDFSIO

The experiments are performed on three different clusters. They are:

OSU RI Cluster B: The cluster configurations are described in Chapter 3.2.

OSU Nowlab : There are four nodes with Intel P3700 NVMe-SSD on this cluster.

Each of them has dual eight-core 2.6 GHz Intel Xeon E5-2670 (Sandy Bridge) processors,

32GB DRAM, 16GB RAMDisk and is equipped with Mellanox 56Gbps FDR InfiniBand

HCAs with PCI Express Gen3 interfaces. The nodes run Red Hat Enterprise Linux Server release 6.5. Each of these nodes also have a 300GB OCZ VeloDrive PCIe SSD and INTEL

SSDSC2CW12 SATA SSD.

SDSC Comet: The cluster configurations are described in Chapter 7.2.

We allocate 1.5GB NVM space for the communication buffers that are used in a round- robin manner for non-blocking RDMA transfers. For storage buffers, we allocate 6GB

NVM space on OSU RI Cluster B and OSU Nowlab; on SDSC Comet, it is 10GB.

8.3.1 Performance Analysis

In this section, we analyze the performances of our proposed designs using HDFS Mi- crobenchmarks from OHB Microbenchmark 0.8. These experiments are performed on three DataNodes on OSU Nowlab. NVMe SSD is used as HDFS data directory on each of the DataNodes.

150 Performance comparison using HDFS Microbenchmarks: We evaluate our designs using SWL and SRL from OHB Microbenchmarks. As observed from Figure 8.10(a),

NVFS-MemIO-JNI reduces the execution time by up to 67% over HDFS-IPoIB (56Gbps) and 53.95% over HDFS-RDMA (56Gbps) for 8GB data size. The performance of NVFS-

MemIO-JNI is better by up to 20% over NVFS-BlkIO. As observed from Figure 8.10(b),

NVFS-MemIO-JNI reduces the execution time of SRL by up to 43% over HDFS-IPoIB

(56Gbps) and 27% over NVFS-BlkIO. NVFS-MemIO-JNI also shows better performance than NVFS-MemIO-JVM. Therefore, unless otherwise stated, we use NVFS-MemIO-JNI for our subsequent experiments.

Figure 8.10(c) shows the performance comparisons between our two proposed archi- tectures for memory access mode of NVM. In this experiment, we use SWT and SRT for

8GB data size with four concurrent clients per node. The figure clearly shows that, the read performance for the Block Architecture is 1.5x of the Flat Architecture. Both the designs offer similar performances for write. This explains the benefit of the Block Architecture over Flat. In the presence of concurrent readers, read in the Block Architecture is faster, as each block and the corresponding data packets can be retrieved in constant time. Due to the overlap between communication and storage, the write performances are similar in both the architectures.

70 60 HDFS (32Gbps) 60 NVFS−MemIO−JNI (32Gbps) HDFS (56Gbps) 60 NVFS−MemIO−JNI (56Gbps) HDFS (56Gbps) 50 50 NVFS−MemIO−JNI (56Gbps) 50 40 40 40 30 30 30 20 20 20 Execution Time (s) Execution Time (s) Execution Time (s) 10 10 10 0 0 0 2 4 8 8:20 16:40 32:80 2 4 8 Data Size (GB) Data Size (GB) Data Size (GB) (a) OSU RI Cluster B (b) SDSC Comet (c) OSU Nowlab Figure 8.12: Data generation benchmark, TeraGen

151 Performance comparison over RAMDisk: We evaluate NVFS-BlkIO and NVFS-MemIO-

JNI and compare the performances with HDFS running over RAMDisk using the SWL

benchmark for 4GB data size. Table 8.1 shows the results for this experiment.

Table 8.1: SWL performance over RAMDisk IPoIB RDMA NVFS-BlkIO NVFS-MemIO-JNI 22.72 sec 18.3 sec 15.21 sec 12.26 sec

The table clearly shows that, NVFS-MemIO-JNI reduces the execution time by 46%,

33%, and 20% over HDFS-IPoIB, HDFS-RDMA, and NVFS-BlkIO, respectively. NVFS-

BlkIO also gains by 33% and 17% over HDFS-IPoIB and HDFS-RDMA, respectively.

Moreover, RAMDisk cannot ensure tolerance against process failures; which our designs

can, due to the non-volatility of NVM. NVFS-BlkIO gains due to memory speed write

using mmap, while NVFS-MemIO gains due to elimination of software overheads of the

file system APIs.

Comparison of communication time: We compare the communication times of HDFS over RDMA using DRAM vs. NVRAM. For this, we design a benchmark that mimics the

HDFS communication pattern. The benchmark transfers data from the client to a server which replicates it in a pipelined manner (no processing or I/O of the received data).

Table 8.2: Comparison of Communication Time IPoIB RDMA RDMA with NVRAM 10.73 sec 5.10 sec 5.68 sec

152 As shown in Table 8.2, the communication time for a 4GB file is 5.10s over RDMA using DRAM. On the other hand, when NVRAM is used for RDMA, this time increases to

5.68s due to the overheads of NVRAM. Compared to the communication time over IPoIB

(56Gbps), the proposed D-to-N over RDMA approach has 47% benefit.

Comparison of I/O time: In this study, we compare the I/O times needed by our designs with that of NVMe SSD using the SWL benchmark for 4GB data size.

Table 8.3: Comparison of I/O Time NVMe SSD NVFS-BlkIO NVFS-MemIO-JNI 4.27 sec 3.14 sec 1.83 sec

The I/O times are profiled on the first DataNode in the replication pipeline. First, we measure the I/O time per block and report the cumulative I/O time for the entire file. As observed from Table 8.3, NVFS-MemIO-JNI is 42% and 57% faster than NVFS-BlkIO and HDFS with NVMe SSD, respectively. NVFS-BlkIO is also 27% faster than HDFS with

NVMe SSD in terms of I/O time. Thus, our designs guarantee much better I/O performance than default HDFS with NVMe SSD as they leverage the byte-addressability of NVM.

8.3.2 Evaluation with Hadoop MapReduce

In this section, we evaluate MapReduce benchmarks.

8.3.2.1 TestDFSIO

Figure 8.11 shows the results of our evaluations with the TestDFSIO benchmark on three different clusters. For these experiments, we use both NVFS-BlkIO and NVFS-

MemIO-JNI. On OSU RI Cluster B, we run this test with 8GB data size on four DataNodes.

As observed from Figure 8.11(a), NVFS-BlkIO increases the write throughput by 3.3x over

153 PCIe SSD. The gain for NVFS-MemIO-JNI is 4x over HDFS using PCIe SSD over IPoIB

(32Gbps) for 8GB data size. For TestDFSIO read, our design NVFS-MemIO-JNI has up to 1.7x benefit over PCIe SSD on OSU RI Cluster B. The benefit for NVFS-BlkIO is up to 1.4x. As shown in Figure 8.11(b), on SDSC Comet with 32 DataNodes for 80GB data, our benefit is 2.5x for NVFS-BlkIO and 4x for NVFS-MemIO-JNI over SATA SSD, for

TestDFSIO write. Our gain for read is 1.2x. Figure 8.11(c) shows the results of TestDFSIO experiment for 8GB data on three DataNodes on OSU Nowlab, where we have 4x benefit over NVMe SSD for TestDFSIO write. Our gain for read is 2x.

8.3.2.2 Data Generation Benchmark (TeraGen)

Figure 8.12 shows our evaluations with the data generation benchmark TeraGen [93].

We use our design NVFS-MemIO-JNI here. Figure 8.12(a) shows the results of experi- ments that are run on four HDFS DataNodes on OSU RI Cluster B. As observed from the

figure, our design has 45% gain over HDFS using PCIe SSD with IPoIB (32Gbps). On

SDSC Comet, we vary the data size from 20GB on eight nodes to 80GB on 32 nodes. As observed from the Figure 8.12(b), our design reduces the execution time of TeraGen by up to 37% over HDFS using SATA SSD with IPoIB (56Gbps). As shown in Figure 8.12(c), on

OSU Nowlab with three DataNodes, our design has a gain of up to 44% over HDFS using

NVMe SDD with IPoIB (56Gbps). TeraGen is an I/O-intensive benchmark. The enhanced

I/O-level designs using NVM and communication with the D-to-N over RDMA approach, results in these performance gains across different clusters with different interconnects and storage configurations.

8.3.2.3 SWIM

In this section, we evaluate our design (NVFS-BlkIO) using the workloads provided in

Statistical Workload Injector for MapReduce (SWIM) [86]. The MapReduce workloads in

154 SWIM are generated by sampling traces of MapReduce jobs from Facebook. In our ex-

periment, we generate one such workload from historic Hadoop job traces from Facebook.

We run this experiment on eight DataNodes on SDSC Comet and generate 20GB data.

The generated workload consists of 50 representative MapReduce jobs. As observed from

Figure 8.13, 42 jobs are short duration with execution times less than 30s. For such jobs,

NVFS-BlkIO has a maximum benefit of 18.5% for job-42. Seven jobs have execution times

greater than (or equal to) 30s and less than 100s. For such jobs, our design has a maximum

gain of 34% for job-39. NVFS-BlkIO gains a maximum benefit of 37% for job-38; for this

job, the execution time of HDFS is 478s, whereas for NVFS-BlkIO, it is 300s. The overall

gain for our design is 18%.

1000 HDFS (56Gbps) NVFS-BlkIO (56Gbps)

100 Log-scale Execution Time (sec)

10 5 10 15 20 25 30 35 40 45 50 Job No. Figure 8.13: SWIM (SDSC Comet)

Table 8.4 summarizes the results for SWIM. The MapReduce jobs in SWIM read

(write) data from (to) HDFS. Therefore our design, with enhanced I/O and communi-

cation performances gain over HDFS with SATA SSD over IPoIB (56Gbps). Job-38,

reads and writes large amount of data from (to) HDFS (hdfs bytes read = 1073750104, hdfs bytes written = 17683184696) compared to the others. Therefore, the gain is also higher for this job.

155 Table 8.4: Summary of benefits for SWIM Job Execution Time Number of Benefit Duration (sec) Jobs Short <30 42 18.5% Medium ≥ 30 and < 100 7 34% Long ≥ 100 1 37%

8.3.3 Evaluation with Spark 8.3.3.1 TeraSort

In this section, we evaluate our design (NVFS-MemIO-JNI) along with the cost-effective scheme using the Spark TeraSort workload. We run these experiments on four DataNodes on OSU RI Cluster B using the Spark over YARN mode. The data size is varied from 10GB to 14GB. The values of Spark executor and worker memory are set to 2GB each.

As observed from Figure 8.14(a), by buffering only the output of TeraSort to NVM, our design can reduce the execution time by up to 11% over HDFS using PCIe SSD with IPoIB

(32Gbps). The input data as well as other control information are stored to PCIe SSD.

The reason behind this gain is the reduction of I/O and communication overheads while writing the TeraSort outputs to HDFS. Besides, TeraSort is shuffle-intensive and involves computation to sort the data. In our design, we use DRAM to mimic NVRAM behavior which could otherwise be used for computation as in HDFS, if real hardware was available.

8.3.3.2 PageRank

In this section, we evaluate our NVFS-based burst buffer design using the Spark PageR- ank workload. We perform these experiments on eight DataNodes on SDSC Comet and compare the I/O times of PageRank when run over NVFS-based burst buffer vs. Lustre.

156 400 70

350 HDFS (32Gbps) 60 NVFS−MemIO−JNI (32Gbps) 300 50 Lustre (56Gbps) 250 NVFS−MemIO−JNI (56Gbps) 40 200 30 150 I/O Time (s) 20

Execution Time (s) 100 50 10 0 0 10 12 14 50 5,000 500,000 File Size (GB) Number of pages (a) Spark TeraSort with cost-effective (b) Spark PageRank with NVFS-based scheme (OSU RI Cluster B) burst buffer (SDSC Comet) Figure 8.14: Evaluation with Spark workloads

For this, we use our design NVFS-MemIO-JNI and PageRank is launched with 64 concur- rent maps and reduces. Figure 8.14(b) illustrates the results of our evaluations.

As observed from the figure, our burst buffer design can reduce the I/O time (time needed to write the output of different iterations to the file system) of Spark PageRank by up to 24% over Lustre. This is because, when Spark tasks write data to NVFS-based burst buffer, most of the writes go to the same node where the tasks are launched. As a result, unlike Spark running over Lustre, the tasks do not have to write data over the network to the shared file system, rather, data is asynchronously flushed to Lustre. Besides, studies have already shown that, in terms of data access peak bandwidth, SSD is much faster than

Lustre [99]. The evaluations done in Chapter 5 and Chapter 6 also establish similar trend.

NVM being faster than SSD, has much higher performance than Lustre. Therefore, our proposed burst buffer design can improve I/O performance of Spark jobs over Lustre.

8.3.4 Evaluation with HBase

In this section, we evaluate our design (NVFS-BlkIO) with HBase using the YCSB benchmark. We perform these experiments on SDSC Comet. We vary the cluster size from

8 to 32 and the number of records are varied from 800K on 8 nodes to 3200K on 32 nodes.

157 Figure 8.15 shows the performance of our design for the 100% insert workload. As observed from the figure, our design reduces the average insert latency by up to 19% over

HDFS using SATA SSD with IPoIB (56Gbps); the increase in operation throughput is 21%.

Figure 8.16 shows the performance of our design for the 50% read, 50% update workload.

As observed from the figure, our design reduces the average read latency by up to 26%

over HDFS using SATA SSD; the average update latency is improved by up to 16%; the

increase in operation throughput is 20%. The proposed design flushes the WAL files to

NVM during the course of these operations. The WALs are also replicated using the D-

to-N over RDMA approach. The reduction of I/O and communication times through the

proposed cost-effective solution helps improve the performance of HBase insert, update,

and read.

2,500 1,000

HDFS (56Gbps) HDFS (56Gbps) 2,000 NVFS−BlkIO (56Gbps) 800 NVFS−BlkIO (56Gbps)

1,500 600

1,000 400 Throughput (ops/s)

Average Latency (us) 500 200

0 0 8:800K 16:1,600K 32:3,200K 8:800K 16:1,600K 32:3,200K Cluster Size:Number of Records Cluster Size:Number of Records (a) Average Latency (b) Throughput Figure 8.15: YCSB 100% insert (SDSC Comet)

8.4 Related Work

There have been many researches to accelerate I/O performance through NVM. The

authors in [24] present a file system BPFS and a hardware architecture that are designed

around the properties of persistent, byte-addressable memory. In order to transmit atomic,

158 fine-grained updates to persistent storage, BPFS proposes a new technique called short- circuit shadow paging that helps BPFS provide strong reliability guarantees and better per- formance than traditional file systems. Pelley et al. [64] propose to leverage NVRAM as main memory to host all of the data sets for in-memory databases. The authors in [29] propose a new, light-weight POSIX-compliant file system, PMFS, that exploits the byte- addressability of NVRAM to avoid overheads of block I/O. PMFS also leverages hardware features like processor paging and memory ordering. Fusion-io’s NVMFS, which was for- merly known as DFS [49] uses the virtualized flash storage layer for simplicity and perfor- mance. DFS ensures a much shorter datapath to the flash memory by leveraging the virtual to physical block allocation in the virtualized flash layer. SCMFS [106] is a file system for storage class memory and uses the virtual address space for implementation. In [67], the authors propose a hybrid file system to resolve the random write issues of SSDs by leveraging NVRAM. In [80], the authors perform systematic performance study of various legacy Linux file systems with real world workloads over NVM. In this work, the authors demonstrate that most of the POSIX-compliant file systems can be tuned to achieve com- parable performance as PMFS. All these NVM-based file system studies concentrate on workloads over unreplicated single machine. In Mojim [113], the authors propose RDMA and NVM-based replication for storage subsystems in data centers. But this work focuses only on the write and replication part, whereas, a lot of Big Data applications involve sig- nificant amount of reads. None of the above-mentioned studies consider the special needs of a Big Data file system like HDFS. Besides, HDFS has to support a variety of upper layer frameworks and middleware. Therefore, it is very important to maintain backward com- patibility for HDFS even when exploiting the byte-addressability of NVMs. Moreover, the

159 high expense of NVMs make cost-effectiveness a big issue to consider for Big Data appli- cations. In this study, we consider all these aspects while designing HDFS over NVM and

RDMA.

1,400 700 HDFS (56Gbps) 2,000 HDFS (56Gbps) NVFS−BlkIO (56Gbps) HDFS (56Gbps) 1,200 NVFS−BlkIO (56Gbps) 600 NVFS−BlkIO (56Gbps) 1,000 500 1,500 800 400 1,000 300 600 200 400 500 Throughput (ops/s) Average Latency (us) Average Latency (us) 100 200 0 0 0 8:800K 16:1,600K 32:3,200K 8:800K 16:1,600K 32:3,200K 8:800K 16:1,600K 32:3,200K Cluster Size:Number of Records Cluster Size:Number of Records Cluster Size:Number of Records (a) Average Read Latency (b) Average Update Latency (c) Throughput Figure 8.16: YCSB 50% read, 50% update (Cluster B)

Both Big Data and HPC community have put significant efforts to improve the perfor- mance of HDFS by leveraging resources from HPC platforms. In Chapter 3, we presented an RDMA-enhanced HDFS design to improve the performance of write and replication.

But this design kept the default HDFS architecture intact. In Chapter 4, a SEDA (Staged

Event-Driven Architecture) [104] based approach has been proposed to re-design HDFS architecture. Authors in [21] have proposed in-memory data caching to maximize HDFS read throughputs. The Apache Hadoop community [96] has also proposed such kind of centralized cache management scheme in HDFS, which is an explicit caching mechanism that allows users to specify paths to be cached by HDFS. Some recent studies [68, 69, 91] also pay attention to incorporate heterogeneous storage media (e.g. SSD and parallel file system) in HDFS. Authors in [68] deal with data distribution in the presence of SSD and

HDD nodes; whereas, [69] caches data in SSD. Researchers in [91] present HDFS-specific optimizations for PVFS and [70] propose to store cold data of Hadoop cluster to network- attached file systems. In Chapter 5, we proposed advanced data placement policies for

160 HDFS to efficiently utilize the heterogeneous storage devices on modern HPC clusters. In

Chapter 8, efficient data access strategies have been presented for Hadoop and Spark on

heterogeneous HPC clusters. But none of these researches have explored NVM technology

for HDFS. In this work, we, therefore, analyze the performance potential for incorporating

NVM in HDFS and finally propose designs to leverage the byte-addressability of NVMs

for HDFS communication and I/O.

8.5 Summary

In this work, we proposed a novel design of HDFS to leverage the byte-addressability of

NVM. Our design minimizes the memory contention of computation and I/O by allocating

memory from NVM for RDMA-based communication. We re-designed the HDFS storage

architecture to exploit the memory semantics of NVM for I/O operations. We also proposed

advanced and cost-effective techniques for accelerating Spark and HBase by intelligently

utilizing NVM in the underlying file system for only job outputs and Write Ahead Logs

(WALs), respectively. We further proposed enhancements to use our proposed NVFS as a

burst buffer layer for running Spark jobs over parallel file systems, such as Lustre. In-depth

performance evaluations show that, our design can increase the read and write throughputs

of HDFS by up to 2x and 4x, respectively over default HDFS. It also achieves up to 45%

benefit in terms of job execution time for data generation benchmarks. Our design reduces

the overall execution time of the SWIM workload by 18% while providing a maximum gain

of 37% for job-38. The proposed design can improve the performance of Spark TeraSort

by 11%. The performances of HBase insert, update, and read operations are improved by 21%, 16%, and 26%, respectively. Our burst buffer design can also improve the I/O performance of Spark PageRank by up to 24% over Lustre.

161 Chapter 9: Accelerating I/O Performance through RDMA-based Key-Value Store

Current large-scale HPC systems take advantage of enterprise parallel file systems like

Lustre that are designed for a range of different applications. Therefore, Hadoop MapRe- duce jobs are often run on top of the existing installations of the parallel file systems on

HPC clusters. For write-intensive applications, running the job over Lustre can also avoid the overhead of tri-replication as is present in HDFS [82]. Although parallel file systems are optimized for concurrent accesses by large scale applications, write overheads can still dominate the run times of data-intensive workloads. Besides, there are applications that require significant amount of reads from the underlying file system. The performances of such applications (e.g Grep) are highly influenced by the data locality in the Hadoop cluster [101]. As a result, applications that have equal amounts of reads and writes suffer from poor write performance when run on top of HDFS; whereas, because of inadequate locality, read performs sub-optimally while these applications run entirely on top of Lustre.

Such applications need a new design that efficiently integrates these two file systems and can offer the combined benefits from both of these architectures.

In HPC clusters, burst buffer systems are often used to handle the bandwidth limitation of shared file system access [55]; such systems temporarily buffer the bursty I/O and grad- ually flush datasets to the back-end parallel file system. In HPC systems, burst buffers are

162 typically used to buffer the checkpoint data from HPC applications. Checkpoints are read during application restarts only and thus checkpointing is a write-intensive operation. On the other hand, there are a variety of Hadoop applications; some are write-intensive, some are read-intensive while some have equal amount of reads and writes. Besides, the output of one job is often used as the input of the next one; in such scenarios data locality in the cluster impacts the performance of the latter to a great extent. Moreover, considering the fact that real data has to be stored to the file system through the burst buffer layer, ensuring fault tolerance of the data is much more important compared to that in the checkpointing domain.

In this work, we address these challenges by designing a key-value (RDMA-based

Memcached) store-based burst buffer system for Big Data Analytics. Considering the as- pects of data locality and fault tolerance, we also propose different schemes to integrate

Hadoop with Lustre through this burst buffer. The following sections in this chapter de- scribe the designs and evaluations.

9.1 Design 9.1.1 Proposed Deployment

Figure 9.1(a) depicts our proposed deployment to use a key-value store-based burst buffer system to integrate Hadoop with parallel file system like Lustre on HPC clusters. In

HPC systems, the Hadoop jobs are launched on the compute nodes. The burst buffer pool is deployed as a separate set of servers. In our design, we use Memcached to develop the burst buffer, so the Memcached server daemon runs in each of these servers. The map and reduce tasks of the Hadoop job write their data in the burst buffer layer. The data is also

flushed to the parallel file system to guarantee persistence and fault tolerance. In order to

163 reduce the probability of data loss due to node failure, we propose to deploy the Hadoop

DataNodes and Memcached servers in two different failure domains.

The reason that we use a key-value store for buffering Hadoop I/O on top of the parallel

file system is that key-value stores provide flexible APIs to store the HDFS data packets

against corresponding block names (each HDFS block consists of a number of fixed sized

packets). The NameNode generates the block names and they are unique per block. Also,

data buffered in this way can be read at memory I/O speed from the key-value store.

Application Hadoop Cluster HDFS map map map map Failure I/O Forwarding Module reduce reduce reduce reduce Domain 1 Memcached Client Local Disk Block Lustre 1 2 3 N JNI 1 2 3 N Sweeper Writer memcached_delete()

RDMA libmemcached Meta Data Servers Hash on key/prefix MDS1 MDS2 Memcached Server Pool memcached_set()/ memcached_get() RDMA-based Memcached Object Storage Servers

OSS1 OSS2 Lustre Lustre Writer 1 2 33 M 1 2 Persistence Manager

Failure Domain 2 Key-Value Store based Burst Buffer Parallel File System, Lustre

(a) Deployment (b) Architecture Figure 9.1: Deployment and Architecture

9.1.2 Internal Architecture

Figure 9.1(b) demonstrates the internal architecture of our proposed design. We first design a burst buffer system using Memcached and then integrate Hadoop with Lustre through this buffer.

164 9.1.2.1 Key-Value Store-based Burst Buffer System

We use the RDMA-based Memcached to design a burst buffer system. For this, each

Memcached server must have a local SSD on it. The burst buffer system has two compo-

nents: 1) RDMA-based libMemcached (client) and 2) RDMA-based Memcached (server).

Hadoop tasks write data to HDFS in order to maintain locality; HDFS then sends data

to the Memcached servers through RDMA-based libMemcached. The key is formed by

appending the packet sequence number with the corresponding block name and the value is

the data packet. Thus, all the packets belonging to the same block have the same key prefix

(block name). We maintain the concept of blocks in the Memcached layer by modifying

the libMemcached hash function. Instead of hashing on the entire key, we hash only based

on the block name prefix of the key. In this way, all the data packets belonging to the same

block go to the same Memcached server.

The Memcached server side stores the data sent from the HDFS layer in the in-memory

slabs. Data can also be read by the HDFS clients (i.e. map or reduce tasks) from the

server. The burst buffer functionalities are introduced in the Memcached server by two

main components: Persistence Manager and Lustre Writer.

Persistence Manager: As demonstrated in Figure 9.2, the Persistence Manager

consists of a pool of threads attached to a queue. After completion of each set operation

memcached set(), data is put to the queue. Whenever the first packet of a block arrives, a new file is created in the local SSD. This file is identified by the block name prefix in the key of the first packet. The writer threads get the data and the associated file pointer from the queue and write the data to the file. The file pointer is cached in a linked list so that subsequent packets from the same block can be written to this open file. The HDFS client sends the data packets belonging to a block sequentially to the Memcached server through

165 RDMA and when the last packet arrives, the Persistence Manager makes sure that

the memcached set() returns to the HDFS Client after all the packets including the last one are persisted. At this point, the block file is closed and the file pointer is removed from the linked list. Thus, in addition to having a persistent file per block in the SSD, the data packets belonging to the blocks are also kept in the in-memory slabs of the Memcached server. The updated data distribution strategy in the libMemcached side helps to preserve the block structure while data is written in the back-end SSDs on the Memcached servers and subsequently, to the parallel file system.

Lustre Writer: The Lustre Writer consists of a pool of threads attached to a

queue. This module is responsible for forwarding the data to Lustre. In our design, we

consider writing to the parallel file system in both asynchronous and synchronous manner.

Depending on which mode we choose, the Lustre Writer can work accordingly.

In this study, we also consider deployment of our proposed design on different clusters.

Therefore, if there is no SSD node on the cluster, the Persistence Manager in the

Memcached server is not enabled. For such scenarios, we propose different modes to

deploy our design. We support these modes via different integration schemes of Hadoop

with Lustre through the buffer layer. We discuss these schemes in detail in Section 9.1.3.

Also, data can be sent to the parallel file system by either the burst buffer or the HDFS layer

directly based on the chosen scheme. So, Lustre Writer can be enabled either in the

Memcached server or in the HDFS layer.

9.1.2.2 Integration of Hadoop with Lustre through Key-Value Store-based Burst Buffer

The interaction of Hadoop with the buffer layer is controlled by the following two major

components as depicted in Figure 9.1(b):

166 FP Cache:

Writer Threads

insert after memcached_set()

Local SSDs

blk_1 blk_2 blk_3 blk_n Files in SSD Figure 9.2: Persistence Manager

I/O Forwarding Module: The I/O forwarding module in HDFS Client is responsible

for interaction with the Memcached layer. It creates a libMemcached client object and

invokes the memcached set() operation via JNI to pass the data to the Memcached

server. The JNI layer bridges the Java-based HDFS with the native libMemcached library.

While reading the buffered data, HDFS Client uses the memcached get() API.

BlockSweeper: Each DataNode has a BlockSweeper that wakes up periodically and removes data from the in-memory slabs of the Memcached servers. Typically, the

BlockSweeper kicks in at the end of a job or when no job is actively running on the cluster. The interval at which the BlockSweeper should wake up is configurable. If the

available memory on a Memcached server runs out while a job is running, data eviction

takes place following the LRU policy of Memcached.

9.1.3 Proposed Schemes to Integrate Hadoop with Lustre

Considering the aspects of data locality and fault tolerance, our design supports three

different modes: (1) Asynchronous Lustre Write (ALW), (2) Asynchronous Disk Write

(ADW), and (3) Bypass Local Disk (BLD). In this section, we propose different schemes

to integrate Hadoop with Lustre through the key-value store-based burst buffer system pro-

posed in Section 9.1.2.1 to provide support for different modes.

167 Application Application Application

HDFS Client DataNode HDFS Client DataNode HDFS Client DataNode DataNode I/O Module Block Receiver DataNode I/O Module Block Receiver Block Receiver DataNode I/O Module I/O Forwarding Module I/O Forwarding Module Asynchronous Write Memcached Client Memcached Client Local Disk RAM Disk Lustre Writer I/O Forwarding Module Lustre Writer Local Disk Memcached Client JNI JNI JNI memcached_set()/ Asynchronous memcached_set()/ memcached_get() Write memcached_set()/ Local memcached_get() Local Local memcached_get() Remote Remote Remote

Persistence Manager Lustre Writer Lustre Lustre Lustre Memcached Server Memcached Server Memcached Server

(a) ALW (Asynchronous Lustre (b) ADW (Asynchronous Disk (c) BLD (Bypass Local Disk) Write) Write) Figure 9.3: Hadoop Write Flow

9.1.3.1 Asynchronous Lustre Write (ALW)

Figure 9.3(a) shows the integration scheme to support ALW mode. In this scheme,

MapReduce jobs are scheduled on the compute nodes. Each map or reduce task essentially

launches an HDFS Client that writes one copy of data to the local storage on the DataNode

and the other copy is transferred to the Memcached server through the I/O Forwarding

Module. The DataNode forwards the data to the parallel file system in an asynchronous

manner through the Lustre Writer. The DataNode, while writing to Lustre keeps track of the pending writes and when a write is complete, it informs the NameNode of it.

The I/O Forwarding Module writes the data packets of HDFS blocks to Mem- cached. Each Memcached server is backed by SSD or NVRAM. Therefore, on completion of the memcached set() operation, the data is given to the Persistence Manager

that ensures data persistence by synchronously writing each data to SSD.

In this way, Memcached servers host an SSD copy along with the in-memory copy of a

data block. In this scheme, the Lustre Writer is activated in the DataNode side. The

168 Lustre Writer can also be enabled on the Memcached servers instead of the DataN- odes. But the advantage of keeping it in the DataNode is that the DataNode can inform the

NameNode on completions of the asynchronous Lustre writes.

The map and reduce tasks of a MapReduce job, read data from the local storage when- ever possible. This ensures the benefit of data locality for Hadoop jobs. If, on the other hand, data is not available locally, they read from the in-memory slabs of the Memcached servers through memcached get() operation. In case of failures (data is not available in local storage or Memcached), data is read from Lustre.

This scheme meets the design challenges as follows:

1. Improves performance by hiding the latency of parallel file system access through

the high performance burst buffer system. The Memcached-based burst buffer layer

uses RDMA-based communication and SSD-based data persistence.

2. Improves data locality compared to MR jobs running over Lustre by placing data

in the local storage of the DataNodes. By reading data from the local storage and

Memcached, this design can achieve much better read performance.

3. Provides fault tolerance by storing one replica in local disk and one in the burst

buffer; these two replicas are in two different failure domains. As presented in [113],

two replicas on two different failure domains are enough for fault tolerance in data

centers. In our design, data is also written to the parallel file system asynchronously.

9.1.3.2 Asynchronous Disk Write (ADW)

Figure 9.3(b) shows our proposed scheme to support the ADW mode.

In this design, the HDFS Client writes one copy of data to the local disks on the DataN- ode asynchronously. The second copy of data is sent to the Memcached server through

169 the I/O Forwarding Module by the HDFS Client. In this scheme also, we use the

updated hash function to hash the key-value pairs based on the block name prefix rather

than the entire key so that all the data packets belonging to the same block are sent to the

same Memcached server.

In this scheme, we assume that, the Memcached servers do not have any local SSD; this

scheme is applicable on clusters that have no SSD. So the Persistence Manager in

the Memcached server is not activated by this scheme. Also, this scheme prioritizes data

fault tolerance over write performance. Therefore, the Lustre Writer on the Mem-

cached server, writes data synchronously to Lustre. After completion of each set operation

using memcached set(), data is put to the queue attached to the Lustre Writer.

Whenever the first packet of a block arrives, a new file is created in Lustre. Just like the

Persistence Manager as described in Section 9.1.2.1, the Lustre Writer also caches the file pointers. The pool of threads in the Lustre Writer increases overlap- ping between Memcached operations and Lustre I/O. At the same time, it guarantees that each data block is persisted synchronously.

The map and reduce tasks of a MapReduce job, read data from the local storage or

Memcached. In case of failures (data is not available in local storage or Memcached), data is read from Lustre.

This scheme meets the design challenges as follows:

1. It can reduce I/O bottlenecks by asynchronously writing data to local disks.

2. Improves data locality compared to MapReduce jobs running over Lustre by placing

data in the local storage of the DataNodes. By reading data from the local storage

and Memcached, this design can achieve better read performance.

170 3. This scheme synchronously writes data to Lustre. Thus, it can offer same level of

fault tolerance as Lustre.

9.1.3.3 Bypass Local Disk (BLD)

Figure 9.3(c) shows the internal design scheme to support the BLD mode. There are

HPC clusters that have large amount of memory on the compute nodes while the local disk space is limited. For this type of cluster, application performance can be largely im- proved by avoiding disk I/O. Therefore, in this scheme, we propose to place one copy of data to RAM Disks on the compute nodes in a greedy manner. Since dedicating enough memory for storage may not always be feasible for the application, we keep a thresh- old of RAM Disk usage. When this usage threshold is reached, data is forwarded to the

Memcached server by the I/O Forwarding Module in the DataNode. We assume that the Memcached servers are not backed by SSD and the key-value pairs are stored on the in-memory slabs only. The other copy of data is synchronously written to Lustre by the Lustre Writer in HDFS Client. Thus, in this scheme, the Lustre Writer and Persistence Manager on the Memcached server are not enabled and it uses the key-value store as a buffer so that the stored data can be read at memory speed. Since persistence of the data in Memcached is not needed, we use the default hash function of libMemcached (hash on the entire key) here to distribute the data across different servers.

For this scheme, we do not send all the data to Memcached; only the data that cannot be fit into the local RAM Disks are sent to the Memcached servers. Therefore, in order to make all the data fault-tolerant we send them to Lustre directly from the HDFS Client side rather than from the Memcached servers.

In this scheme, instead of accessing the parallel file system, MapReduce jobs read data from the local storage or the Memcached servers.

171 This scheme meets the design challenges as follows:

1. It can entirely bypass disk I/O on local storage by adaptively switching to place data

to Memcached rather than slower local disks.

2. By reading data from the local storage and Memcached, this design can achieve much

better data locality and read performance compared to MapReduce run over Lustre.

3. This scheme offers same level of fault tolerance as Lustre by synchronously writing

to Lustre.

9.2 Experimental Results

In this section, we perform evaluations with the proposed burst buffer design while

integrating Hadoop with Lustre through this using our proposed schemes.

9.2.1 Experimental Setup

The experiments are performed on three different clusters. They are:

OSU RI Cluster A: The cluster configurations are described in Chapter 3.2.

OSU RI Cluster B: The cluster configurations are described in Chapter 3.2.

TACC Stampede: The cluster configurations are described in Chapter 4.2.

9.2.2 Evaluation with TestDFSIO

First, we determine the optimal number of Memcached servers required to evaluate our designs on OSU RI. For this, we vary the number of DataNodes from 10 to 14 on OSU RI

Cluster A and number of Memcached servers from 8 to 4, respectively. For Memcached server deployment, we use the large memory nodes on OSU RI Cluster B and we use 20GB memory for Memcached on each server. We run the TestDFSIO test with 100GB data size using 80 maps. For these experiments, we use the ALW mode proposed in Section 9.1.3.1.

172 Table 9.1: Determining optimal number of Memcached servers Memcached Write Read DataNodes Servers Throughput Throughput 10 8 12.72 MBps 237.5 MBps 12 6 13.99 MBps 240.6 MBps 14 4 14.03 MBps 67.8 MBps

As observed from Table 9.1, we can achieve the highest read throughput while using

12 DataNodes with 6 Memcached servers. In the ALW mode, we want to buffer the en- tire dataset in the Memcached servers. Therefore, while using 4 Memcached servers each with 20GB memory the entire dataset does not fit in memory and earlier key-value pairs get dropped to make space for the new ones. The write throughput in this case is the best because it has the largest number of DataNodes and ALW writes one copy of data to local storage and the other to the Memcached servers. But during TestDFSIO read, this experi- ment reads the entire data from the DataNodes (if data not available locally, goes to other

DataNode over IPoIB (32Gbps)) as not all data are available in the Memcached servers. On the other hand, while using 12 DataNodes with 6 Memcached servers, the entire dataset can

fit in memory and with this combination of servers, we achieve the highest read throughput with near-optimal write performance. The write throughput in this case is slightly less than that for 14-4 case. This is because due to fewer DataNodes the I/O overhead during write is higher here. But since data is read from both the local storage and Memcached servers over RDMA, the read throughput is much higher in this case. Therefore, in our subsequent experiments on OSU RI, we maintain the 2:1 ratio of DataNodes and Memcached servers and choose the dataset such that the entire data can fit in memory. On TACC Stampede, we use 4:1 ratio of DataNodes and Memcached servers.

173 Besides, in our design, we follow a hybrid approach of data read. We read data from local storage when available. In case data is not available locally, we read it from the

Memcached servers. The throughput of 40GB data read while using 16 DataNodes and 8

Memcached servers following this hybrid approach is 390.48 MBps. But if data is entirely read from the Memcached servers, this throughput reduces to 304.67 MBps. The reason behind this is, even though in-memory reads reduce the I/O overheads, when data is read entirely from Memcached, the read throughput is bounded by the aggregated bandwidth of the Memcached servers. But reading data locally helps reduce the contention for network bandwidth and also scales with the number of DataNodes.

Figure 9.4 shows the results of our evaluations on OSU RI. While running these experi- ments with our designs, we use 16 nodes on OSU RI Cluster A as Hadoop DataNodes and 8 large memory nodes as Memcached servers from OSU RI Cluster B. In order to have a fair distribution of resources, we use 24 DataNodes while running TestDFSIO on HDFS and

Lustre. In all these cases, we perform these experiments with 64 maps and vary the data sizes from 60GB to 100GB. As observed from Figure 9.4, our ALW mode improves the write throughput by 2.6x over HDFS and 1.5x over Lustre for 100GB data size. Compared to HDFS, our approach can avoid the overhead of replication on local disks. Also, we hide the latency of shared file system access by using the Memcached-based burst buffer layer.

For ADW and BLD, we do not see significant gain in write performance compared to

Lustre. Because, Lustre writes only a single copy of data, whereas, in both ADW and

BLD we write additional data copies on local storage to introduce locality. The write performance of BLD is slightly better than that of ADW. Because, unlike ADW, BLD can avoid the overhead of Memcached write as long as the data fits in the local RAM Disks.

In terms of read performance, all three modes perform better than both HDFS and Lustre

174 with the improvement being up to 8x. The read throughput of ADW is slightly higher than that of BLD on Cluster A. This is because, the RAM Disk size on Cluster A is small and thus locality is reduced during read for BLD mode. However, due to RDMA operations over high performance interconnects, the difference in read performance is not significant between these two modes.

45 700 40 HDFS (32Gbps) 600 HDFS (32Gbps) 35 Lustre (32Gbps) Lustre (32Gbps) ALW (32Gbps) ALW (32Gbps) ADW (32Gbps) 500 ADW (32Gbps) 30 BLD (32Gbps) BLD (32Gbps) 25 400 20 300 15 200 10 100

Average Throughput (MBps) 5 Average Throughput (MBps) 0 0 60 80 100 60 80 100 Data Size (GB) Data Size (GB) (a) TestDFSIO Write (b) TestDFSIO Read Figure 9.4: Evaluation of TestDFSIO (OSU RI)

500 500 HDFS (56Gbps) HDFS (56Gbps) Lustre (56Gbps) 400 Lustre (56Gbps) 400 ADW (56Gbps) ADW (56Gbps) BLD (56Gbps) BLD (56Gbps) 300 300

200 200

100 100 Average Throughput (MBps) Average Throughput (MBps) 0 0 10:50 20:100 40:200 8:50 16:100 32:200 Cluster Size:Data Size (GB) Cluster Size:Data Size (GB) (a) TestDFSIO Write (b) TestDFSIO Read Figure 9.5: Evaluation of TestDFSIO (TACC Stampede)

Figure 9.5 shows the results of our evaluations with TestDFSIO write and read on TACC

Stampede. While running these experiments with our designs, we use 8 DataNodes for

Hadoop and 2 servers for Memcached. For experiments with 16 and 32 DataNodes, the numbers of Memcached servers used are 4 and 8, respectively. Each Memcached server

175 uses 25GB of memory. In order to have a fair distribution of resources, we use 10, 20, and 40 DataNodes while running TestDFSIO on HDFS and Lustre. In these experiments, we vary the data sizes from 50GB on 10 nodes to 200GB on 40 nodes. As observed from

Figure 9.5(b), BLD improves the write throughput by 30x over HDFS. On TACC Stampede, local disk performance is very poor. Thus, bypassing the local disks leads to significant improvement over HDFS. Compared to Lustre, we do not see any obvious improvement in write performance. This is because, in our design (BLD), we write one copy of data to local RAM Disks so that subsequent readers can achieve sufficient data locality. Another copy is written to Lustre for fault tolerance. In terms of read throughput, our design has up to 64% improvement over Lustre and 15x over HDFS.

These experiments prove that ALW performs best on OSU RI and BLD on TACC Stam- pede (write better than ADW and requires less memory for Memcached). Therefore, in our subsequent experiments, we use ALW (BLD) on OSU RI (TACC Stampede).

9.2.3 Evaluation with RandomWriter and Sort

In this section, we evaluate our designs with RandomWriter and Sort benchmarks. Ran- domWriter is write-intensive, while Sort has equal amount of reads and writes.

250 1,400 HDFS (32Gbps) HDFS (32Gbps) 200 Lustre (32Gbps) 1,200 Lustre (32Gbps) ALW (32Gbps) ALW (32Gbps) 1,000 150 800 100 600 400 Execution Time (s) 50 Execution Time (s) 200 0 0 40 60 80 40 60 80 Data Size (GB) Data Size (GB) (a) RandomWriter (b) Sort Figure 9.6: Evaluation of RandomWriter and Sort (OSU RI)

176 500 1,600 HDFS (56Gbps) HDFS (56Gbps) Lustre (56Gbps) 1,400 Lustre (56Gbps) 400 BLD (56Gbps) BLD (56Gbps) 1,200 300 1,000 800 200 600 Execution Time (s) 100 Execution Time (s) 400 200 0 0 10:50 20:100 40:200 10:50 20:100 40:200 Cluster Size:Data Size (GB) Cluster Size:Data Size (GB) (a) RandomWriter (b) Sort Figure 9.7: Evaluation of RandomWriter and Sort (TACC Stampede)

Figure 9.6 shows the performances of RandomWriter and Sort on OSU RI cluster using the ALW mode. We perform these experiments on 24 nodes for HDFS and Lustre. For our design, we use 16 DataNodes and 8 Memcached servers each using 20GB memory and backed by SSD. We run 4 concurrent maps per host and vary the data size from 40GB to 80GB. Therefore, for HDFS and Lustre, a total of 96 maps write the same amount of data as 64 maps in our design. As observed from the figure, even with less concurrency,

ALW reduces the execution time of RandomWriter by up to 36% over Luster and 39% over

HDFS. As observed from Figure 9.6, our design improves Sort performance by 20% over

Lustre and by 18% over HDFS.

Table 9.2: Performance comparison of RandomWriter and Sort with 4 and 6 maps per host using ALW RandomWriter Sort Data Size (GB) ALW-4 ALW-6 ALW-4 ALW-6 40 79 68 301 318 60 107 94 623 638 80 123 112 948 967

177 As observed from Table 9.2, with a total of 96 maps (6 maps per host) like HDFS and

Lustre, our design can further reduce the execution time of RandomWriter. Increasing the concurrency does not help improve the performance of Sort. This is because, the major bottleneck of Sort lies in the shuffle phase and with increasing concurrency, the level of locality reduces. Even then, our design has up to 15% gain over HDFS and up to 19% gain over Lustre.

Figure 9.7 shows the performances of RandomWriter and Sort on TACC Stampede using the BLD mode. We perform these experiments on 10, 20, and 40 nodes for HDFS and Lustre. For our design, we use 8, 16, and 32 DataNodes with 2, 4, and 8 Memcached servers, respectively, each using 28GB memory. We run 4 maps per host and vary the data size from 50GB to 100GB. Therefore, for HDFS and Lustre on 20 nodes, a total of 80 maps write the same amount of data as 64 maps in our design (16 DataNodes). As observed from the figure, BLD reduces the execution time of RandomWriter by up to 5x over HDFS.

But it does not improve the execution time over Lustre. This is because BLD writes one additional copy of data in local RAM Disks compared to Lustre. But BLD improves the performance of Sort by 28% over Lustre and by 19% over HDFS.

Table 9.3: Performance comparison of RandomWriter and Sort with 4 and 6 maps per host using BLD RandomWriter Sort Data Size (GB) BLD-4 BLD-5 BLD-4 BLD-6 50 59 53 677 721 100 61 59 768 801 200 62 59 858 893

178 In these evaluations, we run 4 maps per host. Therefore, for HDFS and Lustre more

concurrent maps run than that in our design. Running BLD with 5 concurrent maps per host

can make the total number of maps equal in all the cases. Table 9.3 shows the performance

of BLD with 4 maps and 5 maps per host.

As observed from the table, with increased number of concurrent maps, the execution

time of RandomWriter reduces. But increasing the concurrency does not help improve the

performance of Sort. This is because, the major bottleneck of Sort lies in the shuffle phase

and with increasing concurrency, the level of locality reduces. Even then, our design has

up to 15% gain over HDFS and up to 25% gain over Lustre.

4,500 1,400 HDFS (56Gbps) 4,000 1,200 Lustre (56Gbps) HDFS (32Gbps) BLD (56Gbps) 3,500 Lustre (32Gbps) ALW (32Gbps) 1,000 3,000 2,500 800 2,000 600 1,500 400 Execution Time (s) 1,000 Execution Time (s) 500 200 0 0 SeqCount RankedInvIndex HistoRating SeqCount RankedInvIndex HistoRating Workloads Workloads (a) Workloads (OSU RI) (b) Workloads (TACC Stampede) Figure 9.8: Evaluation with PUMA [66]

9.2.4 Evaluation with PUMA

In this section, we evaluate the performance of different PUMA workloads like Se- quenceCount, RankedInvertedIndex and Histogram rating. On Cluster OSU RI, we use the

ALW mode and on Cluster TACC Stampede, we use BLD.

On Cluster OSU RI, our design has up to 34.5% improvement over Lustre and 40%

over HDFS for SequenceCount. For RankedInvertedIndex, the gain is 48% over HDFS and

27.3% over Lustre. Our gain for Histogram rating is up to 17% on this cluster.

179 On the other hand, on TACC Stampede, our design has up to 13% improvement over

Lustre and 23% over HDFS for SequenceCount. For RankedInvertedIndex, the gain is 34%

over HDFS and 16% over Lustre. Our gain for Histogram rating is up to 16% over Lustre.

The SequenceCount workload has almost thrice as much output as input. Thus, it is a write-intensive application. RankedInvertedIndex also has equal amount of reads and writes. So our design has significant gains compared to both HDFS and Lustre for these workloads. Because, in addition to reducing the I/O overheads, our design improves lo- cality that enhances the read performance. The amount of write is less than that of read for Histogram rating. But our design gains mainly due to improved read performance in

this case. However, Lustre on TACC Stampede uses InfiniBand (56Gbps) verbs interface,

whereas on OSU RI, the interconnect to access Lustre is IPoIB (32Gbps). Lustre installa-

tion on TACC Stampede is also much larger with higher number of I/O nodes than that on

OSU RI. TACC Stampede also has large memory on the compute nodes. For these reasons,

the performance of the workloads over Lustre are better on TACC Stampede. Therefore,

on OSU RI, our design has higher gain over Lustre compared to that in TACC Stampede.

9.3 Related Work

Singh et al. [83] presented a dynamic caching mechanism for Hadoop by integrating

HDFS with Memcached. They have discussed several practical design choices for the inte-

gration and choose to use Memcached as the meta data store for objects at HDFS DataN-

odes. However, they also focused on HDFS read operation performance only. In [46], we

presented an in-memory design for HDFS using Memcached that improves both the read

and write performances. Researchers in [100] proposed a Memcached-based burst buffer

for HPC applications running over parallel file systems. In [73], the authors proposed an

RDMA-based design of YARN MapReduce over Lustre that stores the intermediate data of

180 MapReduce jobs in Lustre. This work also shows that Lustre read performance degrades with increasing concurrency. Even though HPC systems are running Big Data workloads over parallel file systems, very few researches focus on designing a burst buffer system for

Big Data. In this work, we propose a burst buffer for Big Data over parallel file system.

9.4 Summary

In this work, we proposed to integrate Hadoop with Lustre through a RDMA-based key-value store. We designed a burst buffer system for Big Data analytics applications using RDMA-based Memcached [48] and integrated Hadoop with Lustre through this high- performance buffer layer. We also presented three schemes for integrating Hadoop with

Lustre considering different aspects of I/O, data locality and fault tolerance. Performance evaluations show that our design can improve the performance of TestDFSIO write by up to 2.6x over HDFS and 1.5x over Lustre. The gain in read throughput is up to 8x.

The execution time of Sort is reduced by up to 28% over Lustre and 19% over HDFS.

The performances of PUMA workloads like SequenceCount, RankedInvertedIndex, and

Histogram rating are also significantly improved by our design.

181 Chapter 10: Future Research Directions

Distributed file systems and I/O middleware are widely used in a range of diverse fields including Big Data, Deep and Machine Learning, Bio-informatics, Climate Change Pre- diction, etc. Consequently, the designs proposed in this dissertation are applicable for the

I/O subsystem of a wide variety of domains and applications. In this chapter, we discuss some of the future research directions that can be explored as a follow up of the work done in this thesis.

10.1 Designing Kudu over RDMA and NVM

Typically, data stored in HDFS is static as HDFS does not support random write or in-place update operations. This type of storage is sufficient for most of the Big Data processing frameworks like Hadoop MapReduce and Spark. But for fast analytics, OLAP workloads in particular, storage engines with low-latency random access support are emerg- ing. One such example is Kudu [41] that supports fast sequential as well as random access to the underlying storage devices.

As a future work, Kudu can be enhanced with advanced HPC features similar to the design concepts of SOR-HDFS, Triple-H, and NVFS. The major design opportunities and challenges in Kudu are:

182 Figure 10.1: Network Architecture of Kudu, Courtesy [11]

RDMA-based Communication: Kudu stores data in tables and a table is split into seg- ments called tablets. As shown in Figure 10.1, Kudu replicates its tablets to meet the avail- ability SLAs of different applications. The RDMA-based replication with SEDA-based overlapping approach can be applied here for faster data replication.

350,000 350,000 300,000 300,000 40GigE−Disk 250,000 40GigE 250,000 IPoIB (100Gbps)−Disk IPoIB (100Gbps) 40GigE−SSD IPoIB (100Gbps)−SSD 200,000 200,000 150,000 150,000

Total Time (s) 100,000 Total Time (s) 100,000 50,000 50,000 0 0 100K 200K 400K 100K 200K 400K Number of Records Number of Records (a) Kudu insert over different interconnects (b) Kudu insert with SSD-based log flushing

Figure 10.2: Impact of interconnects and storage on Kudu operations

To perform some initial experiments, we design a micro-benchmark that creates a table in Kudu and inserts records into that table. Each of these records get replicated to three

183 tablet servers. These experiments are performed on four tablet servers (one master) on

OSU RI2. As observed from Figure 10.2(a), as the underlying interconnect is switched from 40GigE Ethernet to IPoIB (100Gbps) the total insert time to Kudu reduces by up to

20%. This proves that an RDMA-based design of Kudu can further accelerate the insert operation performance.

Faster I/O with High Performance Storage Devices: After the data is received by the server, it is written to the underlying storage device. We, therefore, believe that enhancing the storage engine with advanced data placement policies along with high performance storage devices will significantly reduce the cost of data distribution. As observed from

Figure 10.2(b), by storing only the Kudu write-ahead logs to SSD, the insert operation performance can be improved by up to 16% over IPoIB (100Gbps). The performance over

40GigE is also improved by 14%. Moreover, the support of random write and update operations in Kudu promise new challenges to incorporate NVM in the architecture. NVM offers near-DRAM performance. So placing the performance-critical data in NVM with memory-access mode can be critical for real time applications. Researchers can further explore along these directions.

10.2 Study the Impact of the Proposed Data Placement and Access Policies on Energy Efficiency

In Chapter 5 and Chapter 9, we have proposed a variety of data placement and access policies for HDFS on HPC clusters with heterogeneous storage. These policies are targeted to maximize performance and hence, prioritize utilizing high performance storage devices such as, RAM Disk, SSD, NVM, etc. over Disk or parallel file system. But energy effi- ciency also becomes a critical factor for large-scale HDFS deployment. Moreover, storage devices like SSD and NVM have very impressive power consumption characteristics. As

184 a future work, one could evaluate the impact our proposed policies on energy efficiency for different upper-level middleware and applications. For example, evaluating the poli- cies for MapReduce vs. Spark implementation of the same workload can be an interesting direction. There have been quite a few researches for energy efficiency in Hadoop eco- system [51–53]. In most of these work, the authors focus on cluster-level power supply controlling for energy efficiency. But very few studies have considered exploring the data placement and access policies for energy efficiency. One can also characterize and further optimize these policies for different upper-level middleware and application.

10.3 Accelerating Erasure Coding with RDMA and KNL/GPU

Erasure coding is gaining momentum in the community for replicated systems like

HDFS, Memcached, etc. Even though erasure coding can save a lot of storage space, it degrades the read performance by reducing data locality. This is because, a single block is typically striped across multiple nodes in most of the erasure coding techniques [7]. It also involves large CPU overhead (computation) while re-construction of the data blocks during node failures. Read performance can be accelerated by using RDMA to combine the stripes faster while reading the desired blocks. The encoding and decoding performance can be improved by using KNL or GPU that can significantly speed up the complex computations involved in these phases. Although researchers have proposed a range of algorithms [77,

107, 112] for erasure coding, very few of them focus on exploiting HPC technologies and resources to speed up the process. Researcher can explore along these directions in the future.

185 10.4 Designing High Performance I/O Subsystem for Deep/Machine Learning Applications

On HPC clusters, Deep/Machine learning applications operating on terabytes of data suffer from huge I/O bottleneck due to accessing disk-based or shared parallel file sys- tem. Existing solutions use in memory double buffering schemes to reduce the overhead of bringing the data from disk/file system to the memory [75]. Although these techniques can accelerate the computation pipeline of the Deep/Machine Learning applications, this still limits the scale up as it reduces the available memory for the current computation. An

NVRAM/SSD-based burst buffer can help the application by hiding the latency of shared

file system access. The NVRAMs can be located on the compute nodes so that data from the intermediate stages can be buffered locally. On the other hand, if a number of nodes on the cluster host NVRAM, those nodes can act as the burst buffer servers. The initial phases of the application can also be accelerated by pre-fetching and caching the input data to the burst buffer from the back-end file system. This can introduce overlapping among different phases of computation and I/O and thus, keep the execution pipeline full. Researchers can explore these designs and their evaluations in the future.

186 Chapter 11: Conclusion and Contributions

HDFS being the underlying storage engine for Hadoop MapReduce, Spark, HBase, etc. its performance and scalability are of supreme importance for a variety of Big Data applica- tions. Even though HDFS was initially designed for commodity hardware, it is increasingly being used on HPC platforms. The outstanding performance requirements of HPC systems make the I/O and communication bottlenecks of HDFS a critical issue to rethink its stor- age architecture. The byte-stream communication nature of JAVA sockets along with the default OBOT architecture for HDFS block write and replication cannot fully exploit the underlying hardware capabilities and obtain peak performance. Besides, the existing data placement policies of HDFS cannot ensure efficient utilization of the heterogeneous storage devices for performance-sensitive applications. Moreover, due to the tri-replication of the data blocks, HDFS requires an excessive amount of local storage space that makes its de- ployment challenging on HPC systems. Even though most HPC systems are equipped with a vast installation of parallel file systems like Lustre, the default architecture of HDFS does not provision its utilization for data storage in an optimal manner. Besides, HDFS does not consider the storage types and underlying interconnect while accessing data; upper-level middleware read data considering locality only, which results in performance deficiency on clusters with heterogeneous storage characteristics. Furthermore, the current HDFS design is also unable to leverage the byte-addressability of the emerging Non-Volatile Memory

187 (NVM) systems. All these lead to suboptimal performance for HDFS, and hence, the upper layer frameworks and applications on HPC systems.

In this thesis, we addressed several of these critical issues in HDFS. We designed an

RDMA-based HDFS with enhanced overlapping among different stages. We designed ad- vanced data placement policies to efficiently utilize in-memory and heterogeneous storage devices for batch and iterative applications. Our proposed placement schemes can improve data locality over running Hadoop jobs directly with parallel file systems while reducing the local storage requirements. We proposed efficient data access strategies for Hadoop and

Spark on heterogeneous (storage characteristics) HPC clusters. We also proposed an NVM- based design of HDFS and co-designed with Spark and HBase to accelerate these middle- ware in a cost-effective manner. Finally, we designed a key-value store based burst buffer system to take advantage of the existing parallel file system for performance-sensitive Big

Data applications. The designs proposed in this thesis significantly improve the perfor- mances of a range of upper-level middleware including MapReduce, HBase, Hive, Spark, and Spark SQL as well as Big Data applications and workloads like MR-MSPolygraph,

CloudBurst, PUMA, etc.

11.1 Software Release and its Impact

RDMA for Apache Hadoop is an RDMA-based implementation for the Apache Hadoop software and is released as part of the High-Performance Big Data (HiBD) project [38]. It contains high performance designs for Hadoop RPC, MapReduce, and HDFS over native

InfiniBand and RoCE (RDMA over Converged Ethernet). It also supports plugin-based architecture compliant with Apache Hadoop 2.7.3, Hortonworks Data Platform (HDP)

2.5.0.3, and Cloudera Distribution including Hadoop (CDH) 5.8.2 for HDFS, MapReduce, and RPC and includes HDFS Micro-benchmarks as part of the OHB Micro-benchmarks

188 0.9.2 release. It also incorporates RDMA-Memcached based burst buffer design for Hadoop over Lustre and the file system includes support for running Spark (default and RDMA ver- sion), HBase (default and RDMA version), Spark SQL, Hive, etc. The software is publicly available for the community. This package is being used by more than 200 organizations worldwide in 27 countries to accelerate Big Data applications. As of November, 2016, more than 18,750 downloads have taken place from this project’s site [38]. It is also deployed on some leadership-class HPC clusters. For example, the RDMA for Apache

Hadoop package is installed and publicly used in Comet at San Diego Supercomputer Cen- ter [79].

The duration of this work has spanned several release versions of the RDMA for Apache

Hadoop package, from version 0.9.1 to 1.1.0 (current). Many of the designs and their variants described in this thesis have already been released in this software package. All components of this thesis will eventually be released as part of the RDMA for Apache

Hadoop software.

189 Bibliography

[1] . http://hive.apache.org/.

[2] . http://mesos.apache.org.

[3] Apache Tez. https://tez.apache.org/.

[4] Big data needs a new type of non-volatile memory. http://www.electronicsweekly.com/news/big-data-needs-a-new-type-of-non- volatile-memory-2015-10/.

[5] Cloudera: Data access based on Locality and Storage Type. http://blog.cloudera. com/blog/2014/08/new-in-cdh-5-1-hdfs-read-caching/.

[6] Cloudera Spark Roadmap. https://2s7gjr373w3x22jf92z99mgm5w-wpengine. netdna-ssl.com/wp-content/uploads/2015/09/Cloudera Spark Roadmap 1.png.

[7] Erasure Coding Cloudera. http://blog.cloudera.com/blog/2015/09/ introduction-to-hdfs-erasure-coding-in-apache-hadoop/.

[8] Future of Hadoop. https://www.datanami.com/2015/09/09/ spark-is-the-future-of-hadoop-cloudera-says/.

[9] Intel HiBench. https://github.com/intel-hadoop/HiBench.

[10] King. https://king.com/.

[11] Kudu architecture. http://cloudera.github.io/kudu/docs/introduction.html\ architectural\ overview.

[12] Memcached: High-Performance, Distributed Memory Object Caching System. http://memcached.org/.

[13] Spark SQL. http://spark.apache.org/sql/.

[14] SuperCell. http://supercell.com/en/.

[15] Tachyon Project. http://tachyon-project.org.

190 [16] The Apache Hadoop Project. http://hadoop.apache.org/.

[17] The Ohio State Micro Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/.

[18] Faraz Ahmad, Srimat T. Chakradhar, Anand Raghunathan, and T. N. Vijaykumar. Tarazu: Optimizing MapReduce on Heterogeneous Clusters. In 17th Intl. Confer- ence on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.

[19] Alex Woodie. Game-Changer: The Big Data Behind Social Gaming. https://www. datanami.com/2014/06/23/game-changer-big-data-behind-social-gaming/.

[20] Alex Woodie. What Pokemon GO Means for Big Data. https://www.datanami.com/ 2016/08/01/pokemon-go-means-big-data.

[21] Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, and . PACMan: Coordinated Memory Caching for Parallel Jobs. In 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2012.

[22] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chan- dra, A. Fikes, and R. Gruber. Bigtable: A Distributed Storage System for Structured Data. In The Proceedings of the Seventh Symposium on Operating System Desgin and Implementation (OSDI’06), WA, November 2006.

[23] Y. Chen, S. Alspaugh, and R. Katz. Interactive Analytical Processing in Big Data Systems: A Cross-industry Study of MapReduce Workloads. Proc. VLDB Endow., 5(12):1802–1813, August 2012.

[24] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, D. Burger, B. Lee, and D. Coetzee. Better I/O Through Byte-Addressable, Persistent Memory. In Symposium on Oper- ating Systems Principles (SOSP ’09). Association for Computing Machinery, Inc., October 2009.

[25] B. F. Cooper, R. Ramakrishnan, R. Sears, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. PNUTS: Yahoo!s Hosted Data Serving Platform. In 34th Intl. Conference on Very Large Data Bases, 2008.

[26] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking Cloud Serving Systems with YCSB. In The Proceedings of the ACM Symposium on Cloud Computing (SoCC), Indianapolis, Indiana, June 2010.

[27] CSC. Big Data Universe Beginning to Explode. http://www.csc.com/insights/flxwd/ 78931-big data universe beginning to explode.

191 [28] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Operating Systems Design and Implementation (OSDI), 2004.

[29] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, and J. Jackson. System Software for Persistent Memory. In Proceedings of the Ninth European Conference on Computer Systems, EuroSys ’14, pages 15:1–15:15, New York, NY, USA, 2014. ACM.

[30] Christian Engelmann, Hong Ong, and Stephen L. Scott. Middleware in Modern High Performance Computing System Architectures. In ICCS, Beijing, China, 2007.

[31] Brad Fitzpatrick. Distributed Caching with Memcached. Linux Journal, 2004:5–, August 2004.

[32] Gordon at San Diego Supercomputer Center. http://www.sdsc.edu/us/resources/ gordon/.

[33] Karan Gupta, Reshu Jain, Himadbindu Pucha, Prasenjit Sarkar, and Dinesh Subhraveti. Scaling Highly-Parallel Data-Intensive Supercomputing Applications on a Parallel Clustered Filesystem. In The Proceedings of Intl. Conference for High Performance Computing, Networking, Storage and Analysis (SC), New Orleans, LA, November 2010.

[34] Hadoop 2.6 Storage Policies. https://hadoop.apache.org/docs/stable/ hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html.

[35] Hadoop 2.6 Storage Policies. https://hadoop.apache.org/docs/stable/ hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html.

[36] Hadoop Map Reduce. The Apache Hadoop Project. http://hadoop.apache.org/mapreduce/.

[37] T. Harter, D. Borthakur, S. Dong, A. Aiyer, L. Tang, A. Arpaci-Dusseau, and R. Arpaci-Dusseau. Analysis of HDFS Under HBase: A Facebook Messages Case Study. In 12th USENIX Conference on File and Storage Technologies (FAST), 2014.

[38] HiBD. http://hibd.cse.ohio-state.edu/.

[39] Jian Huang, Xiangyong Ouyang, Jithin Jose, Md. Wasi Rahman, Hao Wang, Miao Luo, Hari Subramoni, Chet Murthy, and Dhabaleswar K. Panda. High-Performance Design of HBase with RDMA over InfiniBand. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2012.

[40] IDC. The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. http://www.emc.com/leadership/digital-universe/index.htm.

192 [41] Cloudera Inc. Kudu. https://blog.cloudera.com/blog/2015/09/ kudu-new-apache-hadoop-storage-for-fast-analytics-on-fast-data/.

[42] Infiniband Trade Association. http://www.infinibandta.org.

[43] Intel. HiBench Suite. https://github.com/intel-hadoop/HiBench.

[44] Intl. Data Corporation (IDC). New IDC Worldwide HPC End-User Study Identifies Latest Trends in High Performance Computing Usage and Spending. http://www.idc.com/getdoc.jsp?containerId=prUS24409313.

[45] N. S. Islam, X. Lu, M. W. Rahman, J. Jose, H. Wang, and D. K. Panda. A Micro- benchmark Suite for Evaluating HDFS Operations on Modern Clusters. In 2nd Workshop on Big Data Benchmarking (WBDB), 2012.

[46] N. S. Islam, X. Lu, M. W. Rahman, R. Rajachandrasekar, and D. K. Panda. In- Memory I/O and Replication for HDFS with Memcached: Early Experiences. In 2014 IEEE Intl. Conference on Big Data (IEEE BigData), 2014.

[47] J. Jose, M. Luo, S. Sur, and D. K. Panda. Unifying UPC and MPI Runtimes: Expe- rience with MVAPICH. In Fourth Conference on Partitioned Global Address Space Programming Model (PGAS), Oct 2010.

[48] J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. S. Islam, X. Ouyang, H. Wang, S. Sur, and D. K. Panda. Memcached Design on High Per- formance RDMA Capable Interconnects. In Intl. Conference on Parallel Processing (ICPP), Sept 2011.

[49] W. K. Josephson, L. A. Bongo, K. Li, and D. Flynn. DFS: A File System for Virtu- alized Flash Storage. Trans. Storage, 2010.

[50] A. Kalyanaraman, R. W. Cannon, B. Latt, and J. D. Baxter. MapReduce Implemen- tation of a Hybrid Spectral Library-Database Search Method for Large-Scale Peptide Identification. In Bioinformatics 27, 2011.

[51] R. Kaushik and M. Bhandarkar. GreenHDFS: Towards an Energy-conserving, Storage-efficient, Hybrid Hadoop Compute Cluster. In Proceedings of the 2010 International Conference on Power Aware Computing and Systems, HotPower’10, pages 1–9, Berkeley, CA, USA, 2010. USENIX Association.

[52] Jacob Leverich and Christos Kozyrakis. Performance and Energy Efficiency of Big Data Applications in Cloud Environments: A Hadoop Case Study. Journal of Par- allel and Distributed Computing, pages 80–89.

[53] Jacob Leverich and Christos Kozyrakis. On the energy (in)efficiency of hadoop clusters. SIGOPS Oper. Syst. Rev., 44(1):61–65, March 2010.

193 [54] , Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In ACM Sym- posium on Cloud Computing (SoCC), 2014.

[55] N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn. On the Role of Burst Buffers in Leadership-Class Storage Systems. In MSST/SNAPI, 2012.

[56] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 1982.

[57] Xiaoyi Lu, Nusrat S. Islam, Md. Wasi. Rahman, Jithin Jose, Hari Subramoni, Hao Wang, and Dhabaleswar K. Panda. High-Performance Design of Hadoop RPC with RDMA over InfiniBand. In IEEE 42nd Intl. Conference on Parallel Processing (ICPP), 2013.

[58] Carlos Maltzahn, Esteban Molina-Estolano, Amandeep Khurana, Alex J. Nelson, Scott A. Brandt, and Sage Weil. Ceph as a Scalable Alternative to the Hadoop Distributed File System. August 2010.

[59] Shunsuke Mikami, Kazuki Ohta, and Osamu Tatebe. Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications. In The Proceedings of the 2011 IEEE/ACM 12th Intl. Conference on Grid Computing (GRID), France, September 2011.

[60] MVAPICH2: High Performance MPI over InfiniBand and iWARP. http://mvapich. cse.ohio-state.edu/.

[61] NVRAM. http://www.enterprisetech.com/2014/08/06/flashtec-nvram-15-million- iops-sub-microsecond-latency/.

[62] X. Ouyang, S. Marcarelli, R. Rajachandrasekar, and D. K. Panda. RDMA-based Job Migration Framework for MPI over InfiniBand. In The Proceedings of the Interna- tional Conference on Cluster Computing (Cluster), September 2010.

[63] A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, and M. Stone- braker. A Comparison of Approaches to Large-Scale Data Analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, July 2009.

[64] S. Pelley, T. F. Wenisch, B. T. Gold, and B. Bridge. Storage Management in the NVRAM Era. Proc. VLDB Endow., 7(2):121–132, October 2013.

[65] Pivotal. Pivotal Analytics Workbench. http://www.gopivotal.com/solutions/analytics- workbench.

194 [66] Purdue MapReduce Benchmarks Suite (PUMA). https://sites.google.com/site/ farazahmad/pumabenchmarks.

[67] S. Qiu and A. L. N. Reddy. NVMFS: A Hybrid File System for Improving Random Write in Nand-Flash SSD. In IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), 2013.

[68] K. R, A. Anwar, and A. Butt. hatS: A Heterogeneity-Aware Tiered Storage for Hadoop. In 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2014.

[69] K. R, S. Iqbal, and A. Butt. VENU: Orchestrating SSDs in Hadoop Storage. In 2014 IEEE International Conference on Big Data (IEEE BigData), 2014.

[70] K. R, A. Khasymski, A. Butt, S. Tiwari, and M. Bhandarkar. AptStore: Dynamic Storage Management for Hadoop. In International Conference on Cluster Comput- ing (CLUSTER), 2013.

[71] M. W. Rahman, N. Islam, X. Lu, and D. Panda. A Comprehensive Study of MapRe- duce over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters. IEEE Transactions on Parallel and Distributed Systems, 2016.

[72] M. W. Rahman, Nusrat S. Islam, Xiaoyi Lu, J. Jose, H. Subramoni, H. Wang, and D. K. Panda. High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand. In HPDIC, in conjunction with IPDPS, Boston, MA, 2013.

[73] M. W. Rahman, X. Lu, N. S. Islam, R. Rajachadrasekar, and D. K. Panda. High- Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA. In 29th IEEE Intl. Parallel and Distributed Processing Symposium (IPDPS), 2015.

[74] M. W. Rahman, Xiaoyi Lu, Nusrat S. Islam, and D. K. Panda. HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects. In Intl. Conference on Supercomputing (ICS), 2014.

[75] R. Rajachandrasekar, X. Ouyang, X. Besseron, V. Meshram, and D. Panda. Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging? In Work- shop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids 2011, held in conjunction with EuroPar , August 2011.

[76] RandomWriter. http://wiki.apache.org/hadoop/RandomWriter.

[77] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. Dimakis, R. Vadali, S. Chen, and D. Borthakur. XORing Elephants: Novel Erasure Codes for Big Data. In Proceedings of the 39th international conference on Very Large Data Bases, PVLDB’13, pages 325–336. VLDB Endowment, 2013.

195 [78] Michael C. Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009. [79] SDSC Comet. http://www.sdsc.edu/services/hpc/hpc systems.html. [80] P. Sehgal, S. Basu, K. Srinivasan, and K. Voruganti. An Empirical Study of File Sys- tems on NVM. In IEEE 31st Symposium on Mass Storage Systems and Technologies, (MSST), 2015. [81] J. Shafer, S. Rixner, and A. L. Cox. The Hadoop Distributed Filesystem: Balancing Portability and Performance. In Intl. Symposium on Performance Analysis of Systems and Software (ISPASS’10), White Plains, NY, March 28-30 2010. [82] Konstantin Shvachko. HDFS Scalability: The Limits to Growth. login: The Maga- zine of USENIX, 2010. [83] Gurmeet Singh, Puneet Chandra, and Rashid Tahir. A Dynamic Caching Mecha- nism for Hadoop using Memcached. http://cloudgroup.neu.edu.cn/slides/2014.5.30/ ClouData-3rd.pdf. [84] Sort. http://wiki.apache.org/hadoop/Sort. [85] Stampede at TACC. http://www.tacc.utexas.edu/resources/hpc/stampede. [86] Statistical Workload Injector for MapReduce. https://github.com/ SWIMProjectUCB. [87] Thomas Sterling, Ewing Lusk, and William Gropp. Beowulf Cluster Computing with Linux. In MIT Press, 2003. [88] Thomas L. Sterling, John Salmon, Donald J. Becker, and Daniel F. Savarese. How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. MIT Press, 1999. [89] H. Subramoni, P. Lai, M. Luo, and Dhabaleswar K. Panda. RDMA over Ethernet - A Preliminary Study. In Proceedings of the 2009 Workshop on High Performance Interconnects for Distributed Computing (HPIDC’09), 2009. [90] S. Sur, H. Wang, J. Huang, X. Ouyang, and D. K. Panda. Can High Performance Interconnects Benefit Hadoop Distributed File System? In Workshop on Micro Ar- chitectural Support for Virtualization, Data Center Computing, and Clouds, in Con- junction with MICRO 2010, Atlanta, GA, December 2010. [91] W. Tantisiriroj, S. Patil, G. Gibson, S. Son, S. Lang, and Robert Ross. On the Duality of Data-intensive File System Design:Reconciling HDFS and PVFS. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011.

196 [92] W. Tantisiriroj, S. Patil, G. Gibson, Seung Woo Son, S.J. Lang, and R.B. Ross. On the Duality of Data-Intensive File System Design: Reconciling HDFS and PVFS. In The Proceedings of Intl. Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seattle, WA, November 2011.

[93] TeraGen. http://hadoop.apache.org/docs/r0.20.0/api/org/apache/hadoop/examples/ terasort/TeraGen.html.

[94] The Apache Software Foundation. Apache cassandra. http://cassandra.apache.org/.

[95] The Apache Software Foundation. Apache HBase. http://hbase.apache.org/.

[96] The Apache Software Foundation. Centralized Cache Management in HDFS. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/ CentralizedCacheManagement.html.

[97] Top500 Supercomputing System. http://www.top500.org.

[98] Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Ma- hadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In ACM Symposium on Cloud Computing (SOCC), 2013.

[99] T. Wang, K. Mohror, A. Moody, W. Yu, and K. Sato. BurstFS: A Distributed Burst Buffer File System for Scientific Applications. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015.

[100] T. Wang, S. Oral, Y. Wang, B. Settlemyer, and S. Atchley. BurstMem: A High- Performance Burst Buffer System for Scientific Applications. In In the Proceedings of the 2014 IEEE International Conference on Big Data, (IEEE BigData), 2014.

[101] Y. Wang, R. Goldstone, W. Yu, and T. Wang. Characterization and Optimization of Memory-Resident MapReduce on HPC Systems. In 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2014.

[102] Yandong Wang, Xinyu Que, Weikuan Yu, Dror Goldenberg, and Dhiraj Sehgal. Hadoop Acceleration through Network Levitated Merge. In The Proceedings of Intl. Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seattle, WA, November 2011.

[103] Yandong Wang, Xinyu Que, Weikuan Yu, Dror Goldenberg, and Dhiraj Sehgal. Hadoop Acceleration through Network Levitated Merge. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC, 2011.

197 [104] Matt Welsh, David Culler, and Eric Brewer. SEDA: An Architecture for Well- Conditioned, Scalable Internet Services. In 18th ACM Symposium on Operating Systems Principles (SOSP), 2001.

[105] J. Wu, P. Wyckoff, and D. K. Panda. PVFS over InfiniBand: Design and Performance Evaluation. In The Proceedings of the 32nd International Conference on Parallel Processing (ICPP 2003), October 2003.

[106] X. Wu and A. L. N. Reddy. SCMFS: A File System for Storage Class Memory. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 39:1–39:11, New York, NY, USA, 2011. ACM.

[107] M. Xia, M. Saxena, M. Blaum, and D. Pease. A Tale of Two Erasure Codes in HDFS. In Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST’15, pages 213–226, Berkeley, CA, USA, 2015. USENIX Association.

[108] Xyratex. Lustre. http://wiki.lustre.org/index.php/Main Page.

[109] W. Yu, N. S. V. Rao, and J. S. Vetter. Experimental Analysis of InfiniBand Transport Services on WAN. In The Proceedings of the IEEE International Conference on Networking, Architecture, and Storage (NAS), June 2008.

[110] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Mur- phy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient Dis- tributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Comput- ing. In 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2012.

[111] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Computing with Working Sets. In 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud), 2010.

[112] H. Zhang, M. Dong, and H. Chen. Efficient and available in-memory kv-store with hybrid erasure coding and replication. In 14th USENIX Conference on File and Stor- age Technologies (FAST), pages 167–180, Santa Clara, CA, February 2016. USENIX Association.

[113] Y. Zhang, J. Yang, A. Memaripour, and S. Swanson. Mojim: A Reliable and Highly- Available Non-Volatile Memory System. In 20th International Conference on Ar- chitectural Support for Programming Languages and Operating Systems (ASPLOS), 2015.

198