Designing High-Performance and Scalable Clustered Network Attached Storage with Infiniband

DESIGNING HIGH-PERFORMANCE AND SCALABLE CLUSTERED NETWORK ATTACHED STORAGE WITH INFINIBAND DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Ranjit Noronha, MS * * * * * The Ohio State University 2008 Dissertation Committee: Approved by Dhabaleswar K. Panda, Adviser Ponnuswammy Sadayappan Adviser Feng Qin Graduate Program in Computer Science and Engineering c Copyright by Ranjit Noronha 2008 ABSTRACT The Internet age has exponentially increased the volume of digital media that is being shared and distributed. Broadband Internet has made technologies such as high quality streaming video on demand possible. Large scale supercomputers also consume and create huge quantities of data. This media and data must be stored, cataloged and retrieved with high-performance. Researching high-performance storage subsystems to meet the I/O demands of applications in modern scenarios is crucial. Advances in microprocessor technology have given rise to relatively cheap off-the-shelf hardware that may be put together as personal computers as well as servers. The servers may be connected together by networking technology to create farms or clusters of work- stations (COW). The evolution of COWs has significantly reduced the cost of ownership of high-performance clusters and has allowed users to build fairly large scale machines based on commodity server hardware. As COWs have evolved, networking technologies like InfiniBand and 10 Gigabit Eth- ernet have also evolved. These networking technologies not only give lower end-to-end latencies, but also allow for better messaging throughput between the nodes. This allows us to connect the clusters with high-performance interconnects at a relatively lower cost. ii With the deployment of low-cost, high-performance hardware and networking technology, it is increasingly becoming important to design a storage system that can be shared across all the nodes in the cluster. Traditionally, the different components of the file system have been stringed together using the network to connect them. The protocol generally used over the network is TCP/IP. The TCP/IP protocol stack in general has been shown to have poor performance especially for high-performance networks like 10 Gigabit Ethernet or InfiniBand. This is largely due to the fragmentation and reassembly cost of TCP/IP. The cost of multiple copies also serves to severely degrade the performance of the stack. Also, TCP/IP has been been shown to reduce the capacity of network attached storage systems because of problems like incast. In this dissertation, we research the problem of designing high-performance communication subsystems for network attached storage (NAS) systems. Specifically, we delve into the issues and potential solutions with designing communication protocols for high-end single-server and clustered server NAS systems. Orthogonally, we also investigate how a caching architecture may potentially enhance the performance of a NAS system. Finally, we look at the potential performance implications of using some of these designs in two scenarios; over a long haul network and when used as a basis for checkpointing parallel applications. iii VITA January 7, 1976 . Born - Mumbai, India 1998 . BE, Computer Engineering, Mumbai University, India 1998-2000 . .Associate Software Engineer, Tata In- fotech, Mumbai, India 2000 . Intern, Compaq Computer Corporation, Cuppertino, CA, USA 2002 . MS, Computer Science, Binghamton University, Binghamton, NY, USA 2002-2008 . .Graduate Research Associate, The Ohio State University, Columbus, OH, USA 2007 . Intern, NFS Group, Sun MicroSystems, Austin, TX, USA. PUBLICATIONS Dissertation Related Research Publications R. Noronha, X. Ouyang and D.K. Panda “Designing a High-Performance Clustered NAS: A Case Study with pNFS over RDMA on InfiniBand”. International Conference on High Performance Computing (HiPC), December 2008. R. Noronha and D.K. Panda “IMCa: A High-Performance Caching Front-End for Infini- Band”. International Conference on Parallel Processing (ICPP), Sept 2008. iv S. Narravula, H. Subramoni, P. Lai, B. Rajaraman, R. Noronha and D.K. Panda “Perfor- mance of HPC Middleware over InfiniBand WAN”. International Conference on Parallel Processing (ICPP), Sept 2008. R. Noronha, Lei Chai, Spencer Shepler and D.K. Panda “Enhancing the Performance of NFSv4 with RDMA”. The 4th International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI'07), held with 24th IEEE Conference on Mass Storage Systems and Technologies (MSST), Sept 2007. R. Noronha, Lei Chai, Thomas Talpey and D.K. Panda “Designing NFS With RDMA for Security, Performance and Scalability”. International Conference on Parallel Processing (ICPP), Sept 2007. Lei Chai, X. Ouyang, R. Noronha and D.K. Panda “pNFS/PVFS2 over InfiniBand: Early Experiences”. PetaScale Data Storage Workshop help with SuperComputing 2008, Nov 2007. W. Yu, R. Noronha, S. Liang and D.K. Panda “Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre”. Communication Architecture for Clusters (CAC) Workshop, to be held in conjunction with Int'l Parallel and Distributed Processing Symposium (IPDPS '06), April 2006. Other Related Research Publications R. Noronha and D.K. Panda “Improving Scalability of OpenMP Applications on Multi- Core Systems Using Large Page Support”. International Workshop on Multithreaded Ar- chitectures and Applications (MTAAP) to be held in conjunction with International Parallel and Distributed Processing Symposium (IPDPS), March 2007. L. Chai and R. Noronha D. K. Panda “Can Performance and Portability Exist Across Architectures?” The 6th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), May 2006. S. Liang and R. Noronha and D. K. Panda “Swapping to Remote Memory over Infini- Band: An Approach using a High Performance Network Block Device”, IEEE Cluster Computing, Sept 2005. v P. Balaji and W. Feng and Q. Gao and R. Noronha and W. Yu and D. K. Panda “Head- to-TOE Evaluation of High-Performance Sockets over Protocol Offload Engines”, IEEE Cluster Computing, Sept 2005. L. Chai and R. Noronha and P. Gupta and R. Brown and D. K. Panda “Designing a Portable MPI-2 over Modern Interconnects Using uDAPL Interface”, EuroPVM/MPI, Sept 2005. R. Noronha and D. K. Panda. “Performance Evaluation of MM5 on Clusters With Modern Interconnects: Scalability and Impact”, Euro-Par, Aug 2005. R. Noronha and D. K. Panda “Can High Performance DSM Systems Designed With Infini- Band Features Benefit from PCI-Express?”, Fifth International Workshop on Distributed Shared Memory (conjunction with CCGrid 2005), May 2005. R. Noronha and D. K. Panda “Reducing Diff Overhead in Software DSM Systems using RDMA Operations in InfiniBand”, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies (RAIT 2004), San Diego (CA, USA), in conjunction with the IEEE CLUSTER 2004 conference, Sept 2004. R. Noronha and D. K. Panda “Designing High Performance DSM Systems using Infini- Band Features.”, Fourth International Workshop on Distributed Shared Memory (held in conjunction with CCGrid 2004), April 2004. R. Noronha and D. K. Panda “Implementing Treadmarks over GM on Myrinet: Chal- lenges, Design Experiences and Performance Evaluation”, Workshop on Communication Architecture for Clusters (CAC 03), to be held in conjunction with IPDPS 03. M. Otey and R. Noronha and G. Li and S. Parthasarathy and D.K. Panda “NIC-based Intrusion Detection: A feasibilty study”, Workshop on Data Mining for Cyber Threat Analyses (DMCTA' 02) held in conjunction with IEEE International Conference on Data Mining Prior Unrelated Research Publications R. Noronha and N. Abu-Ghazeleh “Using Programmable NIC's for TimeWarp Optimiza- tion”, International Parallel and Distributed Processing Symposium (IPDPS '02), April 2002. vi R. Noronha and N. Abu-Ghazaleh “Early Cancellation: An Active-NIC Optimization for TimeWarp”, 16th Workshop on Parallel and Distributed Simulation (PADS'02), May 2002. Masters Thesis R. Noronha “Intelligent NIC's-A Feasibilty Study of Improving Performance of Distributed Applications By Programming Some of their components on the Network Interface Card”, Masters Thesis, State University of New York at Binghamton, Binghamton, New York, June 2001. FIELDS OF STUDY Major Field: Computer Science and Engineering Studies in: Computer Architecture Prof. D.K. Panda Software Systems Prof. S. Parthasarthy Computer Networks Prof. Dong Xuan vii TABLE OF CONTENTS Page Abstract . ii Vita . iv List of Tables . xiii List of Figures . xiv Chapters: 1. Introduction . 1 1.1 Overview of Storage Terminology . 4 1.1.1 File System Concepts . 4 1.1.2 UNIX Notion of Files . 5 1.1.3 Mapping files to blocks on storage devices . 6 1.1.4 Techniques to improve file system performance . 8 1.2 Overview of Storage Architectures . 9 1.2.1 Direct Attached Storage (DAS) . 9 1.2.2 Network Storage Architectures: In-Band and Out-Of-Band Access 11 1.2.3 Network Attached Storage (NAS) . 12 1.2.4 System Area Networks (SAN) . 13 1.2.5 Clustered Network Attached Storage (CNAS) . 14 1.2.6 Object Based Storage Systems (OBSS) . 16 1.3 Representative File Systems . 17 1.3.1 Network File Systems (NFS) . 18 1.3.2 Lustre File System . 19 viii 1.4 Overview of Networking Technologies . 20 1.4.1 Fibre Channel (FC) . 21 1.4.2 10 Gigabit Ethernet . 22 1.4.3 InfiniBand . 23 2. Problem Statement . 26 2.1 Research Approach . 29 2.2 Dissertation Overview . 31 3. Single Server NAS: NFS on InfiniBand . 34 3.1 Why is NFS over RDMA important? . 34 3.2 Overview of the InfiniBand Communication Model . 39 3.2.1 Communication Primitives . 40 3.3 Overview of NFS/RDMA Architecture . 41 3.3.1 Inline Protocol for RPC Call and RPC Reply . 43 3.3.2 RDMA Protocol for bulk data transfer . 44 3.4 Proposed Read-Write Design and Comparison to the Read-Read Design . 46 3.4.1 Limitations in the Read-Read Design . 49 3.4.2 Potential Advantages of the Read-Write Design . 50 3.4.3 Proposed Registration Strategies For the Read-Write Protocol . 51 3.5 Experimental Evaluation of NFSv3 over InfiniBand .

Designing High-Performance and Scalable Clustered Network Attached Storage with Infiniband

End-To-End Performance of 10-Gigabit Ethernet on Commodity Systems

Early Experiences with Storage Area Networks and CXFS John Lynch

CXFSTM Client-Only Guide for SGI® Infinitestorage

Parallel Computing at DESY Peter Wegner Outline •Types of Parallel

HP Storageworks Clustered File System Command Line Reference

Silicon Graphics, Inc. Scalable Filesystems XFS & CXFS

W4118: File Systems

Lecture 17: Files and Directories

CXFSTM Administration Guide for SGI® Infinitestorage

Data Center Architecture and Topology

Shared File Systems: Determining the Best Choice for Your Distributed SAS® Foundation Applications Margaret Crevar, SAS Institute Inc., Cary, NC

Operating Systems Lecture #5: File Management