Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems
Total Page:16
File Type:pdf, Size:1020Kb
Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems Andy D. Hospodor Ethan L. Miller Senior Member, IEEE Storage Systems Research Center [email protected] University of California, Santa Cruz [email protected] Abstract The first part of this shift is already occurring at the storage device level. Today, many storage subsystems uti- As demand for storage bandwidth and capacity grows, lize low-cost commodity storage in the form of 3.5” hard designers have proposed the construction of petabyte- disk drives. The heads, media and electronics of these de- scale storage systems. Rather than relying upon a few vices are often identical to storage used on desktop comput- very large storage arrays, these petabyte-scale systems ers. The only remaining differentiator between desktop and have thousands of individual disks working together to server storage is the interface. At present, Fibre-Channel provide aggregate storage system bandwidth exceeding and SCSI remain the choice of large, high-end storage sys- 100 GB/s. However, providing this bandwidth to storage tems while the AT attachment (ATA) remains the choice of system clients becomes difficult due to limits in network desktop storage. However, the introduction of the Serial technology. This paper discusses different interconnection ATA interface provides nearly equivalent performance and topologies for large disk-based systems, drawing on previ- a greatly reduced cost to attach storage. ous experience from the parallel computing community. By Most current designs for such petabyte-scale systems choosing the right network, storage system designers can rely upon relatively large individual storage systems that eliminate the need for expensive high-bandwidth commu- must be connected by very high-speed networks in order to nication links and provide a highly-redundant network re- provide the required transfer bandwidths to each storage el- silient against single node failures. We analyze several dif- ement. We have developed alternatives to this design tech- ferent topology choices and explore the tradeoffs between nique using 1Gb/s network speeds and small (4–12 port) cost and performance. Using simulations, we uncover po- switching elements to connect individual object-based stor- tential pitfalls, such as the placement of connections be- age devices, usually single disks. By including a small- tween the storage system network and its clients, that may scale switch on each drive, we developa design that is more arise when designing such a large system. scalable and less expensive than using larger storage ele- ments because we can use cheaper networks and switches. Moreover, our design is more resistant to failures—if a sin- 1. Introduction gle switch or node fails, data can simply flow around it. Since failure of a single switch typically makes data from at least one storage element unavailable, maintaining less Modern high-performance computing systems require 1 data per storage element makes the overall storage system storage systems capable of storing petabytes of data, more resilient. and delivering that data to thousands of computing el- ements at aggregate speeds exceeding 100GB/s. Just In this paper, we present alternative interconnection net- as high-performance computers have shifted from a few work designs for a petabyte-scale storage system built from very powerful computation elements to networks of thou- individual nodes consisting of a disk drive and 4–12 port sands of commodity-typecomputing elements, storage sys- gigabit network switch. We explore alternative network tems must make the transition from relatively few high- topologies, focusing on overall performance and resistance performance storage engines to thousands of networked to individual switch failures. We are less concerned with commodity-type storage devices. disk failures—disks will fail regardless of interconnection network topology, and there is other research on redun- 1A petabyte (PB) is 250 bytes. dancy schemes for massive-scale storage systems [16]. This paper appeared in the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST2004), College Park, MD, April 2004. 2. Background can provide the same high bandwidth with much less ex- pensive, lower-speed networks and switches. There are many existing techniques that provide high- bandwidth file service, including RAID, storage area net- 2.2. Parallel Processing works, and network-attached storage. However, these tech- niques cannot provide 100GB/s on their own, and each has Interconnection networks for computing elements have limitations that manifest in a petabyte-scale storage sys- long been the subject of parallel computing research. No tem. Rather, network topologies originally developed for clear winner has emerged; rather, there are many different massively parallel computers are better suited to construct interconnection topologies, each with its own advantages massive storage systems. and disadvantages, as discussed in Section 3. Traditional multiprocessors such as the Intel Touchstone Delta [14] 2.1. Existing Storage Architectures and the CM-5 [10] segregated I/O nodes from computing nodes, typically placing I/O nodes at the edge of the par- RAID (Redundant Array of Independent Disks) [2] pro- allel computer. File systems developed for these configu- tects a disk array against failure of an individual drive. rations were capable of high performance [3, 12] using a However, the RAID system is limited to the aggregate per- relatively small number of RAID-based arrays, eliminating formance of the underlying array. RAID arrays are typi- the need for more complex interconnection networks in the cally limited to about 16 disks; larger arrays begin to suffer storage system. from reliability problems and issues of internal bandwidth. There have been a few systems that suggested embed- Systems such as Swift [9] have proposed the use of RAID ding storage in a multiprocessor network. In RAMA [11], on very large disk arrays by computing parity across sub- every multiprocessor compute node had its own disk and sections of disks, thus allowing the construction of larger switch. RAMA, however, did not consider storage systems arrays. However, such systems still suffer from a basic on the scale that are necessary for today’s systems, and did problem: connections between the disks in the array and not consider the wide range of topologies discussed in this to the outside world are limited by the speed of the inter- paper. connection network. Fortunately, storage systems place different, and some- Storage area networks (SANs) aggregate many de- what less stringent, demands on the interconnection net- vices together at the block level. Storage systems are work than parallel processors. Computing nodes typically connected together via network, typically FibreChannel, communicate using small, low latency messages, but stor- through high-performance switches. This arrangement al- age access involves large transfers and relatively high la- lows servers to share a pool of storage, and can enable the tency. As a result, parallel computers require custom net- use of hundreds of disks in a single system. However, the work hardware, while storage interconnection networks, primary use of SANs is to decouple servers from storage because of their tolerance for higher latency, can exploit devices, not to provide high bandwidth. While SANs are commodity technologies such as gigabit Ethernet. appropriate for high I/O rate systems, they cannot provide high bandwidth without appropriate interconnection net- 2.3. Issues with Large Storage Scaling work topology. Network-attached storage [6] (NAS) is similar to SAN- A petabyte-scale storage system must meet many de- based storage in that both designs have pools of storage mands: it must provide high bandwidth at reasonable la- connected to servers via networks. In NAS, however, in- tency, it must be both continuously available and reliable, it dividual devices present storage at the file level rather than must not lose data, and its performancemust scale as its ca- the block level. This means that individual devices are re- pacity increases. Existing large-scale storage systems have sponsible for managing their own data layout; in SANs, some of these characteristics, but not all of them. For exam- data layout is managed by the servers. While most exist- ple, most existing storage systems are scaled by replacing ing network-attached storage is implemented in the form of the entire storage system in a “forklift upgrade.” This ap- CIFS- or NFS-style file systems, object-based storage [15] proach is unacceptable in a system containing petabytes of is fast becoming a good choice for large-scale storage sys- data because the system is simply too large. While there tems [16]. Current object-based file systems such as Lus- are techniques for dynamically adding storage capacity to tre [13] use relatively large storage nodes, each imple- a existing system [7], the inability of such systems to scale mented as a standard network file server with dozens of in performance remains an issue. disks. As a result, they must use relatively high-speed inter- One petabyte of storage capacity requires about 4096 connections to provide the necessary aggregate bandwidth. (212) storage devices of 250GB each; disks with this ca- In contrast, tightly coupling switches and individual disks pacity are appearing in the consumer desktop market in 2 45 10000 RPM FibreChannel 40 7200 RPM ATA 35 30 ¡ ¢ £ ¤ ¥ ¦ ¡ § ¨ © § ¨ 25 20 15 10 Throughput (MB/s) 5 0 4 10 100 1000 10000 100000 Request size (KB) ¡ ¢ £ ¤ ¥ ¦ ¡ § ¨ © § ¨ Figure 1. Disk throughput as a function of transfer size. early 2004. A typical high-performance disk, the Seagate Cheetah 10K, has a 200MB/s FibreChannel interface and spins at 10000 RPM with a 5.3ms average seek time and (a) Disks connected to a single server. This configuration is sus- sustained transfer rate of 44MB/s. In a typical transac- ceptible to loss of availability if a single switch or server fails. tion processing environment, the Cheetah would service a 4KB request in about 8.3ms for a maximum of 120 I/Os per second, or 0.5MB/s from each drive.