Scalable and Manageable Storage Systems

Scalable and manageable storage systems Khalil S. Amiri December, 2000 CMU-CS-00-178 Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Thesis Committee: Garth A. Gibson, Chair Gregory R. Ganger M. Satyanarayanan Daniel P. Siewiorek John Wilkes, HP Laboratories This research is sponsored by DARPA/ITO, through DARPA Order D306, and issued by Indian Head Division under contract N00174-96-0002. Additional support was provided by generous contributions of the member companies of the Parallel Data Consortium including 3COM Corporation, Compaq, Data General, EMC, Hewlett-Packard, IBM, Intel, LSI Logic, Novell, Inc., Seagate Technology, StorageTek, Quantum Corporation and Wind River Systems. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies or endorsements, either expressed or implied, of any supporting organization or the US Government. Keywords: storage, network-attached storage, storage management, distributed disk arrays, concurrency control, recovery, function partitioning. Abstract Emerging applications such as data warehousing, multimedia content distribution, electronic com- merce and medical and satellite databases have substantial storage requirements that are growing at 3X to 5X per year. Such applications require scalable, highly-available and cost-effective storage systems. Traditional storage systems rely on a central controller (file server, disk array controller) to access storage and copy data between storage devices and clients which limits their scalability. This dissertation describes an architecture, network-attached secure disks (NASD), that elimi- nates the single controller bottleneck allowing throughput and bandwidth of an array to scale with increasing capacity up to the largest sizes desired in practice. NASD enables direct access from client to shared storage devices, allowing aggregate bandwidth to scale with the number of nodes. In a shared storage system, each client acts as its own storage (RAID) controller, performing all the functions required to manage redundancy and access its data. As a result, multiple controllers can be accessing and managing shared storage devices concurrently. Without proper provisions, this concurrency can corrupt redundancy codes and cause hosts to read incorrect data. This dissertation proposes a transactional approach to ensure correctness in highly concurrent storage device arrays. It proposes distributed device-based protocols that exploit trends towards increased device intelligence to ensure correctness while scaling well with system size. Emerging network-attached storage arrays consist of storage devices with excess cycles in their on-disk controllers, which can be used to execute filesystem function traditionally executed on the host. Programmable storage devices increase the flexibility in partitioning filesystem function between clients and storage devices. However, the heterogeneity in resource availability among servers, clients and network links causes optimal function partitioning to change across sites and with time. This dissertation proposes an automatic approach which allows function partitioning to be changed and optimized at run-time by relying only on the black-box monitoring of functional components and of resource availability in the storage system. i ii This dissertation is dedicated to my mother, Aicha, and to my late father, Sadok. iii iv Acknowledgements The writing of a dissertation can be a daunting and tasking experience. Fortunately, I was lucky enough to have enjoyed the support of a large number of colleagues, family members, faculty and friends that have made this process almost enjoyable. First and foremost, however, I am grateful to God for blessing me with a wonderful family and for giving me the opportunity and ability to complete my doctorate. I would like to thank my advisor, Professor Garth Gibson, for his advice and encouragement during my time at Carnegie Mellon. I am also grateful to the members of my thesis committee, Gregory Ganger, Dan Siewiorek, Satya and John Wilkes for agreeing to serve on the committee and for their valuable guidance and feedback. I am also indebted to John Wilkes for his mentoring and encouragement during my internships at HP labs. My officemates, neighbors, and collaborators on the NASD project, Fay Chang, Howard Go- bioff, Erik Riedel and David Rochberg were great colleagues and wonderful friends. I am also grateful to the members of the RAIDFrame and NASD research groups and to the members of the entire Parallel Data Laboratory, who always proved ready to engage in challenging discussions. In particular, I would like to thank Jeff Butler, Bill Courtright, Mark Holland, Eugene Feinberg, Paul Mazaitis, David Nagle, Berend Ozceri, Hugo Patterson, Ted Wong, and Jim Zelenka. In addition, I would like to thank LeAnn, Andy, Charles, Cheryl, Chris, Danner, John, Garth, Marc, Sean, and Steve. I am grateful to Joan Digney for her careful proofreading of this document, for her valuable comments, and for her words of encouragement. I am also grateful to Patty for her friendship, support, assistance, and for running such a large laboratory so well. I am equally grateful to Karen for helping me with so many tasks and simply for being such a wonderful person. During my time at Carnegie Mellon, I had several fruitful collaborations with colleagues within and outside the University. In particular, I would like to thank David Petrou, who worked with me on ABACUS, for always proving ready to engage in challenging discussions or to start any design or v vi implementation task we thought was useful. I would also like to thank Richard Golding who worked with me on the challenging scalability and fault-tolerance problems in storage clusters. Richard was always a source of good ideas, advice and encouragement, in addition to being great company. I am deeply grateful to the members of the Parallel Data Lab’s Industrial Consortium for helping me understand the practical issues surrounding our research proposals. In particular, I am grateful to Dave Anderson and Chris Mallakapali of Seagate, to Elizabeth Borowsky then of HP labs, to Don Cameron of Intel, and to Satish Rege of Quantum. I can not imagine my six years in graduate school without the support of many friends in the United States and overseas. I would like to thank in particular Adel, Ferdaws, Gaddour, Karim, Najed, Rita, Sandra, Sean, and Sofiane. Last but not least, I am infinitely grateful to my mother, Aicha, and my late father, Sadok. They provided me with unconditional love, advice, and support. My late father and his father before him presented me with admirable role models of wisdom and perseverance. Despite our long- distance relationship, my brothers and sisters, Charfeddine, Amel, Ghazi, Hajer, Nabeel, and Ines overwhelmed me with more love and support throughout my years in college and graduate school than what I could have ever deserved. Many thanks go to my uncles and aunts and their respective families, for their unwavering support. Khalil Amiri Pittsburgh, Pennsylvania December, 2000 Contents 1 Introduction 1 1.1 The storage management problem . 2 1.1.1 The ideal storage system . 2 1.1.2 The shortcomings of current systems . 3 1.2 Dissertation research . 4 1.3 Dissertation road map . 6 2 Background 9 2.1 Trends in technology . 9 2.1.1 Magnetic disks . 10 2.1.2 Memory . 12 2.1.3 Processors . 13 2.1.4 Interconnects . 14 2.2 Application demands on storage systems . 19 2.2.1 Flexible capacity . 19 2.2.2 High availability . 20 2.2.3 High bandwidth . 21 2.3 Redundant disk arrays . 23 2.3.1 RAID level 0 . 24 2.3.2 RAID level 1 . 24 2.3.3 RAID level 5 . 24 2.4 Transactions . 26 2.4.1 Transactions . 27 2.4.2 Serializability . 28 2.4.3 Serializability protocols . 30 2.4.4 Recovery protocols . 37 vii viii CONTENTS 2.5 Summary . 38 3 Network-attached storage devices 41 3.1 Trends enabling network-attached storage . 42 3.1.1 Cost-ineffective storage systems . 42 3.1.2 I/O-bound large-object applications . 44 3.1.3 New drive attachment technology . 44 3.1.4 Convergence of peripheral and interprocessor networks . 44 3.1.5 Excess of on-drive transistors . 45 3.2 Two network-attached storage architectures . 45 3.2.1 Network SCSI . 45 3.2.2 Network-Attached Secure Disks (NASD) . 47 3.3 The NASD architecture . 48 3.3.1 Direct transfer . 49 3.3.2 Asynchronous oversight . 50 3.3.3 Object-based interface . 51 3.4 The Cheops storage service . 54 3.4.1 Cheops design overview . 54 3.4.2 Layout protocols . 57 3.4.3 Storage access protocols . 58 3.4.4 Implementation . 58 3.5 Scalable bandwidth on Cheops-NASD . 59 3.5.1 Evaluation environment . 60 3.5.2 Raw bandwidth scaling . 61 3.6 Other scalable storage architectures . 63 3.6.1 Decoupling control from data transfer . 63 3.6.2 Network-striping . 64 3.6.3 Parallel and clustered storage systems . 66 3.7 Summary . 68 4 Shared storage arrays 71 4.1 Ensuring correctness in shared arrays . 72 4.1.1 Concurrency anomalies in shared arrays . 73 4.1.2 Failure anomalies in shared arrays . 74 CONTENTS ix 4.2 System description . 75 4.2.1 Storage controllers . 76 4.2.2 Storage managers . 77 4.3 Storage access and management using BSTs . 78 4.3.1 Fault-free mode . 79 4.3.2 Degraded mode . 80 4.3.3 Migrating mode . 80 4.3.4 Reconstructing mode . 83 4.3.5 The structure of BSTs . 83 4.4 BST properties . 85 4.4.1 BST Consistency . 85 4.4.2 BST Durability . 85 4.4.3 No atomicity . 86 4.4.4 BST Isolation . 86 4.5 Serializability protocols for BSTs . 87 4.5.1 Evaluation environment .

Load more