Scalability of RAID Systems
Total Page:16
File Type:pdf, Size:1020Kb
Scalability of RAID Systems Yan Li I V N E R U S E I T H Y T O H F G E R D I N B U Doctor of Philosophy Institute of Computing Systems Architecture School of Informatics University of Edinburgh 2010 Abstract RAID systems (Redundant Arrays of Inexpensive Disks) have dominated back- end storage systems for more than two decades and have grown continuously in size and complexity. Currently they face unprecedented challenges from data intensive applications such as image processing, transaction processing and data warehousing. As the size of RAID systems increases, designers are faced with both performance and reliability challenges. These challenges include limited back-end network bandwidth, physical interconnect failures, correlated disk failures and long disk reconstruction time. This thesis studies the scalability of RAID systems in terms of both performance and reliability through simulation, using a discrete event driven simulator for RAID systems (SIMRAID) developed as part of this project. SIMRAID incorporates two benchmark workload generators, based on the SPC-1 and Iometer benchmark specifi- cations. Each component of SIMRAID is highly parameterised, enabling it to explore a large design space. To improve the simulation speed, SIMRAID develops a set of abstraction techniques to extract the behaviour of the interconnection protocol without losing accuracy. Finally, to meet the technology trend toward heterogeneous storage architectures, SIMRAID develops a framework that allows easy modelling of different types of device and interconnection technique. Simulation experiments were first carried out on performance aspects of scalabil- ity. They were designed to answer two questions: (1) given a number of disks, which factors affect back-end network bandwidth requirements; (2) given an interconnec- tion network, how many disks can be connected to the system. The results show that the bandwidth requirement per disk is primarily determined by workload features and stripe unit size (a smaller stripe unit size has better scalability than a larger one), with cache size and RAID algorithm having very little effect on this value. The maximum number of disks is limited, as would be expected, by the back-end network bandwidth. Studies of reliability have led to three proposals to improve the reliability and scal- ability of RAID systems. Firstly, a novel data layout called PCDSDF is proposed. PCDSDF combines the advantages of orthogonal data layouts and parity declustering data layouts, so that it can not only survive multiple disk failures caused by physical in- terconnect failures or correlated disk failures, but also has a good degraded and rebuild performance. The generating process of PCDSDF is deterministic and time-efficient. The number of stripes per rotation (namely the number of stripes to achieve rebuild i workload balance) is small. Analysis shows that the PCDSDF data layout can signif- icantly improve the system reliability. Simulations performed on SIMRAID confirm the good performance of PCDSDF, which is comparable to other parity declustering data layouts, such as RELPR. Secondly, a system architecture and rebuilding mechanism have been designed, aimed at fast disk reconstruction. This architecture is based on parity declustering data layouts and a disk-oriented reconstruction algorithm. It uses stripe groups instead of stripes as the basic distribution unit so that it can make use of the sequential nature of the rebuilding workload. The design space of system factors such as parity declustering ratio, chunk size, private buffer size of surviving disks and free buffer size are explored to provide guidelines for storage system design. Thirdly, an efficient distributed hot spare allocation and assignment algorithm for general parity declustering data layouts has been developed. This algorithm avoids conflict problems in the process of assigning distributed spare space for the units on the failed disk. Simulation results show that it effectively solves the write bottleneck problem and, at the same time, there is only a small increase in the average response time to user requests. Publication List: Y. Li, T. Courtney, R. N. Ibbett and N.Topham, “On the Scalability of the Back- • end Network of Storage Sub-Systems”, in SPECTS 2008, pp 464-471, Edinburgh UK. Y.Li and R.N. Ibbett, “SimRAID-An Efficent Performance Evaluation Tool for • Modern RAID Systems”, in Poster Competition of University of Edinbrugh In- formatics Jamboree, Edinburgh, 2007. the First prize Y. Li, T. Courtney, R. N. Ibbett and N.Topham, “Work in Progress: On The • Scalability of Storage Sub-System Back-end Network”, in WiP of FAST, San Jose, USA, 2007. Y. Li, T. Courtney, R. N. Ibbett and N. Topham, “Work in Progress: Performance • Evaluation of RAID6 Systems”, in WiP of FAST, San Jose, USA, 2007. Y. Li, T. Courtney, F. Chevalier and R. N. Ibbett,“SimRAID: An Efficient Per- • formance Evaluation Tool for RAID Systems”, in Proc. SCSC, pp 431-438, Calgary, Canada, 2006. ii T. Courtney, F.Chevalier and Y. Li, “Novel technique for accelerated simulation • of storage systems”, in IASTED PDCN, pp 266 - 272, Innsbruck, Austria, 2006. Y. Li and A. Goel. ”An Efficient Distributed Hot Sparing Scheme In A Par- • ity Declustered RAID Organization”, under US patent application. (application number 12247877). iii Acknowledgements Many thanks to my supervisor, Prof. Roland Ibbett, for his invaluable supervision, warm-hearted encouragement, patience to correct my non-native English and under- standing throughout my PhD. Without him, this Ph.D. and thesis would never have happened. Thanks must also go to Prof. Nigel Topham, my current supervisor, for his supportive supervision and discussion and for his trust and support. Without his support and trust, I wouldn’t have been able to finish this thesis. In addition, many thanks to my colleagues Tim Courtney and Franck Chevalier for their help and discussions. Their help has been invaluable for me, especially at the beginning of this PhD. My sincere thanks must go to David Dolman for solving HASE problems and maintaining HASE. Without his help and quick response, I wouldn’t be able to finish my experiments. Also thank to my other colleagues in the HASE group, Juan Carballo and Worawan Marurngsith for the useful discussions and help over the years. Many thanks to Atul Goel and Tom Theaker from NetApp for their help and sup- port during my internship in NetApp. I learned a lot from Atul on designing useful and practical systems. Chapter 6 of this thesis is based on my intern project. Tom Theaker has been a very supportive boss and I really appreciate the internship and job opportunities he gave me. Many thanks to my friends Lu, Yuanyuan, Lin and Guo for their friendship. Talking with them has always been my pleasure. Finally, my sincere thanks to my beloved parents and my dear husband Zhiheng for their unwavering support through all these years. iv Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Yan Li) v To Zhiheng and Grace. vi Table of Contents 1 Introduction 1 1.1 RAIDSystems ............................. 2 1.1.1 DataStriping .......................... 3 1.1.2 Redundancy........................... 4 1.1.3 OrthogonalRAID........................ 5 1.1.4 Degraded/Rebuilding Performance and Parity Declustering Data Layouts............................. 6 1.2 TechnologyTrendsinRAIDSystemArchitecture . ... 7 1.3 RAIDSystemsChallenges . .. .. .. .. .. .. 9 1.3.1 Performance Challenge - Limited Back-end Network Bandwidth 9 1.3.2 ReliabilityChallenges . 10 1.4 ResearchObjectives. .. .. .. .. .. .. .. 13 1.5 ThesisContributions . .. .. .. .. .. .. .. 13 1.6 ThesisOverview ............................ 15 2 RAID Systems 17 2.1 RAIDSystemImplementationIssues. 17 2.1.1 BasicRAIDLevels. .. .. .. .. .. .. 17 2.1.2 ReadingandWritingOperations . 21 2.1.3 DataLayout........................... 25 2.1.4 ReconstructionAlgorithms. 27 2.2 PhysicalComponentsofRAIDSystems . 29 2.2.1 HardDiskDrives........................ 29 2.2.2 RAIDController ........................ 33 2.2.3 Back-endNetworks. 38 2.2.4 Shelf-enclosures . .. .. .. .. .. .. 40 2.3 Summary ................................ 41 vii 3 Related Work 43 3.1 PerformanceModellingandEvaluation . 43 3.1.1 RAIDSystemPerformance. 44 3.1.2 FCNetworkPerformance . 45 3.2 RAIDSystemReliabilityStudy. 45 3.2.1 RAIDSystemReliabilityAnalysis. 46 3.2.2 ImprovingSystemReliability . 47 3.3 ParityDeclusteringDataLayoutDesign . .. 48 3.3.1 BIBDandBCBD........................ 49 3.3.2 PRIME and RELPR ....................... 51 3.3.3 NearlyRandomPermutationBasedMapping . 52 3.3.4 PermutationDevelopmentDataLayout . 53 3.3.5 StringParityDeclusteringDataLayout . 55 3.4 DistributedHotSparing. 55 3.5 SimulationofStorageSystems . 57 3.5.1 SimulationModelsofRAIDSystem . 57 3.5.2 SimulationModelsofFCnetwork . 59 3.6 Summary ................................ 59 4 The SIMRAID Simulation Model 62 4.1 HASE.................................. 62 4.1.1 OverviewofHASESystem . 63 4.1.2 HASEEntityandHierarchicalDesign . 65 4.1.3 TheHASE++DESEngine . 66 4.1.4 EventSchedulingandTimeAdvancing . 67 4.1.5 ExtensibleClockMechanism . 68 4.2 SIMRAIDModelDesign ....................... 70 4.2.1 OVERVIEW OF SIMRAID .................. 70 4.2.2 TransportLevelAbstractionTechniques . 72 4.2.3 Framework for modellingheterogeneous systems