Highly-Available, Scalable Network Storage

Highly-Available, Scalable Network Storage Edward K. Lee Digital Equipment Corporation Systems Research Center [email protected] Abstract opportunity costs. Products such as redundant disk arrays and logical vol- The ideal storage system is always available, is incre- ume managers often simplify the management of central- mentally expandable, scales in performance as new com- ized storage systems by automatically balancing capacity ponents are added, and requires no management. Existing and performance across disks, and by tolerating and au- storage systems are far from this ideal. The recent intro- tomatically recovering from some component failures [4]. duction of low-cost, scalable, high-performance networks Most such products, however, do not support large-scale allows us to re-examine the way we build storage systems distributed environments. They cannot balance capacity or and to investigate storage architectures that bring us closer performance across multiple server nodes and cannot toler- to the ideal storage system. This document examines some ate server, network, or site failures. They are effective in of the issues and ideas in building such storage systems and managing storage local to a given server but once your sys- describes our ®rst scalable storage prototype. tem grows beyond the limits of a single server, you must face all the old management problems anew at a higher level. Moreover, because distributed systems, unlike cen- 1 Motivation tralized systems, can suffer from communication failures, the dif®culty of the problems is signi®cantly increased. Today, managing large-scale storage systems is an ex- Of course, products to manage storage in distributed en- pensive, complicated process. Various estimates suggest vironments are also available. They collect information that for every $1 of storage, $5-$10 is spent to manage it. from throughout the system, summarize the information in Adding a new storage device frequently requires a dozen or an easy to understand format, and provide standard inter- more distinct steps, many of which require reasoning about faces for con®guring storage components. We are unaware the system as a whole and are, therefore, dif®cult to auto- of any, however, that provides the level of automation and mate. Moreover, the capacity and performance of each de- integration offered by the best centralized storage manage- vice in the system must be periodically monitored and bal- ment products. In this paper, we propose properties that anced to reduce fragmentation and eliminate hot spots. This are desirable of block-level distributed storage systems and usually requires manually moving, partitioning, or replicat- ideas for implementing such storage systems. We describe ing ®les and directories. For example, directories contain- the implementation of our ®rst distributedstorage prototype ing commonly used commands are often replicated to dis- based on these ideas and conclude with a summary and di- tribute load across several disks and ®le servers. Also, user rections for future work. directories are frequently partitioned across several ®le systems, resulting in naming artifacts such as /user1, /user2, etc. 2 An Architecture for Scalable Storage Another importantcontributorto the high cost of managing storage systems is component failures. In most storage The desirable properties of a distributed storage archi- systems today, the failure of even a single component can tecture are straight-forward. First, the system should be al- make data inaccessible. Moreover, such failures can often ways available. Users should never be denied authorized bring an entire system to halt until the failed component is access to data. Second, the system should be incremen- repaired or replaced and the lost data restored. As comput- tally expandable, and both the capacity and throughput of ing environments become more distributed, the adverse ef- the system should scale linearly as additional components fects of component failures become more widespread and are added. When components fail, the performance of the frequent. Aside from the cost of maintaining the necessary system should degrade only by the fraction of the failed staff and equipment, such failures could incur signi®cant components. Finally, even as the storage system's size in- 1 Client Client Client Client Logical Disk Server 0 Server 1 Server 2 Server 3 Scalable Network D0 D1 D2 D3 D4 D5 D6 D7 /dev/netdisk D8 D9 D10 D11 Storage Server Storage Server Storage Server Storage Server D12 D13 D14 D15 Figure 1: Architecture for Scalable Storage. Figure 2: Round-Robin Data Striping. creases, the overhead in managing the system should re- tolerate and recover from component failures. All the intel- main ®xed. ligence needed for implementing the storage system is con- Such a scalable storage architecture would allow us to tained within each individual scalable storage server; the buildstorage systems at any desired level of capacity or per- storage servers do not require help from the clients or a cen- formance by simply putting together enough scalable stor- tralized command center. To the clients, the storage servers age components. Moreover, a large storage system would collectively appear as a single large highly-available high- be no more dif®cult to manage than a small storage system. performance disk with multiple network interfaces. When We could start with a small system and incrementally in- an additional storage server is added, the storage systems crease its capacity and performance as the needs of our sys- simply looks like a larger, higher-performance disk with tem grew. more network connections. But how do we design such scalable storage systems? To conclude, all systems, whether they are biological, The key technological innovation that makes scalable stor- social, or computational, are limited in size and effective- age systems feasible is commodity scalable (switch-based) ness by the capabilities of their basic control and transporta- networks and interconnects. Without scalable intercon- tion infrastructure. The advent of commodity scalable net- nects, the size of a computer system is limited to a single works and interconnects radically alter the basic assump- box or at most a few boxes each containing a few process- tions we use for building computing systems. In particular, ing, storage and communication elements. Only so many we believe that they willdramatically alter the way we build CPU's, disks and network interfaces could be aggregated and use distributed storage systems. before the system interconnect, usually a bus, saturates. In contrast, scalable interconnects can support almost arbitrary levels of performance, allowing us to build systems 3 Availability and Scalable Performance with an arbitrary number of processing, storage and communication elements. Furthermore, we could group many The main technical challenges in designing scalable elements together into specialized subsystems that provide storage systems are availability and scalable performance. superior performance but can be accessed and managed as We have found that in large distributed systems, scalable if they were a single element. performance is closely tied with good load-balancing. Fur- Figure 1 illustrates these concepts when applied to the thermore, because almost all schemes for providing high- storage subsystem. Each scalable storage server consists availability impose a performance penalty during the nor- of a processing element, some disks, and a network inter- mal operation of the system and also because failures face. In the ideal case, we plug the components into the should not cause disproportionate degradations in perfor- scalable interconnect, turn them on, and the components mance, the availability and performance issues are closely auto-con®gure, auto-manage and communicate with each related. This section discusses the basic mechanisms for other to implement a single large, highly-available, high- meeting these challenges: data striping and redundancy. performance storage system. In particular, the servers auto- Figure 2 illustrates round-robin data striping. The solid matically balance capacity and performance across the en- rectangles represent blocks of storage on the speci®ed stor- tire storage system and uses redundancy to automatically age server, the dotted rectangle emphasizes that the storage 2 Logical Disk Logical Disk Server 0 Server 1 Server 2 Server 3 Server 0 Server 1 Server 2 Server 3 D0 D0 D1 D1 D0 D1 D2 D3 D2 D2 D3 D3 D3 D0 D1 D2 D4 D4 D5 D5 D4 D5 D6 D7 D6 D6 D7 D7 D7 D4 D5 D6 Figure 3: Mirrored-Striping. Figure 4: Chained-Declustering. servers appear as a single logical disk to its clients. Each we can do much better. For example, since Server 3 has letter represents a block of data stored in the storage system. copies of some data from servers 0 and 2, servers 0 and 2 The ®gure shows that the ®rst block of data, D0, is stored can of¯oad some of their normal read load on Server 3 and on storage server 0, the second block of data, D1, is stored achieve uniform load balancing. on storage server 1, and so on. This effectively implements Although we have presented this example with four a RAID 0 [4], using storage servers instead of disks. The servers, the same type of of¯oading can be done with many main problem with this arrangement is the availability of more servers. Chaining the data placement allows each the system. Any single disk failure will result in data loss server to of¯oad some of its read load to either the server and any single server failure will result in data unavailabil- immediately following or preceding the given server. By ity. cascading the of¯oading across multiple servers, a uni- In theory, you can implement a distributed RAID 5 us- form load can be maintained across all surviving servers. ing storage servers [2]. However, in practice, the complex- A disadvantage of this scheme is that it is less reliable ity of the additional synchronization required for RAID 5 than mirrored-striping.

Load more