CHAPTER 24 Storage Subsystems

Up to this point, the discussions in Part III of this with how multiple drives within a subsystem can be book have been on the disk drive as an individual organized together, cooperatively, for better reliabil- storage device and how it is directly connected to a ity and performance. This is discussed in Sections host system. This direct attach storage (DAS) para- 24.1–24.3. A second aspect deals with how a storage digm dates back to the early days of mainframe subsystem is connected to its clients and accessed. computing, when disk drives were located close to Some form of networking is usually involved. This is the CPU and cabled directly to the computer system discussed in Sections 24.4–24.6. A storage subsystem via some control circuits. This simple model of disk can be designed to have any organization and use any drive usage and confi guration remained unchanged of the connection methods discussed in this chapter. through the introduction of, fi rst, the mini computers Organization details are usually made transparent to and then the personal computers. Indeed, even today user applications by the storage subsystem presenting the majority of disk drives shipped in the industry are one or more virtual disk images, which logically look targeted for systems having such a confi guration. like disk drives to the users. This is easy to do because However, this simplistic view of the relationship logically a disk is no more than a drive ID and a logical between the disk drive and the host system does not address space associated with it. The storage subsys- tell the whole story for today’s higher end computing tem software understands what organization is being environment. Sometime around the 1990s, computing used and knows how to map the logical addresses evolved from being computation-centric to storage- of the virtual disk to the addresses of the underlying centric. The motto is: “He who holds the data holds the physical devices. This concept is described as virtual- answer.” Global commerce, fueled by explosive growth ization in the storage community. in the use of the Internet, demands 24/7 access to an ever-increasing amount of data. The decentralization of departmental computing into networked individual 24.1 workstations requires effi cient sharing of data. Storage When a set of disk drives are co-located, such as all subsystems, basically a collection of disk drives, and being mounted in the same rack, solely for the purpose perhaps some other storage devices such as of sharing physical resources such as power and cool- tape and optical disks, that can be managed together, ing, there is no logical relationship between the drives. evolved out of necessity. The management software can Each drive retains its own identity, and to the user it either be run in the host computer itself or may reside exhibits the same behavioral characteristics as those in a dedicated processing unit serving as the storage discussed in previous chapters. A term has been coined controller. In this chapter, welcome to the alphabet to describe this type of storage subsystems—JBOD, for soup world of storage subsystems, with acronyms like just-a-bunch of disks. There is nothing more to be said DAS, NAS, SAN, iSCSI, JBOD, RAID, MAID, etc. about JBOD in the remainder of this chapter. There are two orthogonal aspects of storage sub- The simplest organizational relationship that systems to be discussed here. One aspect has to do can be established for a set of drives is that of data 763 764 Memory Systems: Cache, DRAM, Disk

striping. With data striping, a set of K drives are ganged Disk 1 Disk 2 Disk 3 together to form a data striping group or data striping array, and K is referred to as the stripe width. The data e1 e2 e3 striping group is a logical entity whose logical address e4 f1 f2 space is sectioned into fi xed-sized blocks called stripe units. The size of a stripe unit is called the stripe size f3 f4 f5 and is usually specifi able in most storage subsystems by the administrator setting up the stripe array. These f6 f7 g1 stripe units are assigned to the drives in the striping array in a round-robin fashion. This way, if a user’s fi le h1 h2 is larger than the stripe size, it will be broken up and stored in multiple drives. Figure 24.1 illustrates how four user fi les of different sizes are stored in a striping array of width 3. File e takes up four stripe units and FIGURE 24.1: An example of a data striping array with a stripe spans all three disks of the array, with Disk 1 holding width = 3. Four user files e, f, g, and h of different sizes are two units, while Disks 2 and 3 hold one unit each. File shown. f continues with seven stripe units and also spans all three disks with multiple stripe units in each disk. File where SPT is the number of sectors per track. When g is a small fi le requiring only one stripe unit and is all data striping is used, the I/O time is contained in one disk. File h is a medium size fi le and spans two of the three disks in the array. seek time 1 R K/(K 1 1) 1 FS R/K SPT The purpose of data striping is to improve perfor- (EQ 24.2) mance. The initial intent for introducing striping was so that data could be transferred in parallel to/from Assuming the seek times are the same in both multiple drives, thus cutting down data transfer time. cases, data striping is faster than non-striping when Clearly, this makes sense only if the amount of data transfer is large. To access the drives in parallel, all R/2 1 f R/SPT . R K/(K 1 1) 1 f R/K SPT drives involved must perform a seek operation and (EQ 24.3) take a rotational latency overhead. Assuming the cyl- inder positions of all the arms are roughly in sync, the which happens when seek times for all drives are about the same. However, since most arrays do not synchronize the rotation f . K SPT/2(K 1 1) (EQ 24.4) of their disks,1 the average rotational latency for the last drive out of K drives to be ready is R K/(K 1), Thus, the stripe size should be chosen to be at least where R is the time for one disk revolution. This is SPT/2(K 1 1) sectors in size3 so that smaller fi les do higher than the latency of R/2 for a single drive. Let not end up being striped. FS be the number of sectors of a fi le. The I/O comple- That, however, is only part of the story. With striping, tion time without data striping is2 all the drives are tied up servicing a single command. Furthermore, the seek and rotational latency overhead seek time 1 R/2 1 FS R/SPT (EQ 24.1) of one command must be paid by every one of the

1Actually, synchronizing the rotation of the disks would not necessarily help either, since the fi rst stripe unit of a fi le in each of the drives may not be the same stripe unit in all the drives. As File f in Figure 24.1 illustrates, f1 and f2 are the sec- ond stripe unit in Disks 2 and 3, but f3 is the third stripe unit in Disk 1. 2Track skew is ignored in this simple analysis. 3Later, in Section 24.3.2, there is an opposing argument for using a smaller stripe size. Chapter 24 STORAGE SUBSYSTEMS 765

drives involved. In other words, the overhead is paid slow response times due to long queueing delays,4 for K times. On the other hand, if striping is not used, even though many other drives are sitting idle. With then each drive can be servicing a different command. data striping, due to the way logical address space is The seek and latency overhead of each command are spread among the drives, each user’s fi les are likely paid for by only one disk, i.e., one time only. Thus, no to be more or less evenly distributed to all the drives matter what the stripe size and the request size are, instead of all concentrated in one drive. As a result, data striping will never have a better total throughput the workload to a subsystem at any moment in time for the storage subsystem as a whole when compared is likely to be roughly divided evenly among all the to non-striping. In the simplest case where all com- drives. It is this elimination of hot drives in a storage mands are of the same size f, the throughput for data subsystem that makes data striping a useful strat- striping is inversely proportional to Equation 24.1, egy for performance improvement when supporting while that for non-striping is inversely proportional to multiple users. Equation 24.2 times K. Only when both seek time and rotational latency are zero, as in sequential access, can the two be equal. Thus, as long as there are multiple I/Os that can keep individual disks busy, parallel trans- 24.2 Data Mirroring fer offers no throughput advantage. If the host system The technique of using data striping as an organiza- has only a single stream of long sequential accesses as tion is purely for improving performance. It does not do its workload, then data striping would be a good solu- anything to help a storage subsystem’s reliability. Yet, tion for providing a faster response time. in many applications and computing environments, This brings up another point. As just discussed, high data availability and integrity are very important. parallel data transfer, which was the original intent The oldest method for providing data reliability is by of data striping, is not as effective as it sounds for means of . Bell Laboratories was one of the improving performance in the general case. However, fi rst, if not the fi rst, to use this technique in the elec- data striping is effective in improving performance in tronic switching systems (ESS) that they developed general because of a completely different dynamic: its for telephony. Data about customers’ equipment and tendency to evenly distribute I/O requests to all the routing information for phone lines must be available drives in a subsystem. Without data striping, a logical 24/7 for the telephone network to operate. “Dual copy” volume will be mapped to a physical drive. The fi les was the initial terminology used, but later the term for each user of a storage subsystem would then likely “data mirroring” became popularized in the open lit- end up being all placed in one disk drive. Since not erature. This is unfortunate because mirroring implies all users are active all the time, this results in what is left-to-right reversal, but there defi nitely is no reversal known as the 80/20 access rule of storage, where at of bit ordering when a second copy of data is made. any moment in time 80% of all I/Os are for 20% of Nevertheless, the term is now universally adopted. the disk drives, which means the remaining 80% of Data mirroring provides reliability by maintain- the disk drives receive only 20% of the I/Os. In other ing two copies5 of data [Ng 1986, 1987]. To protect words, a small fraction of the drives in the subsystem against disk drive failures, the two copies are kept on is heavily utilized, while most other drives are lightly different disk drives.6 The two drives appear as one used. Users of the heavily utilized drives, who are the logical drive with one logical address space to the majority of active users at the time, would experience user. When one drive in a mirrored subsystem fails,

4The queueing delay is inversely proportional to (1 – utilization). Thus, a drive with 90% utilization will have 5 the queueing delay of a drive with 50% utilization. 5For very critical data, more than two copies may be used. 6A duplicate copy of certain critical data, such as boot record and fi le system tables, may also be maintained in the same drive to guard against non-recoverable errors. This is independent of mirroring. 766 Memory Systems: Cache, DRAM, Disk

its data is available from another drive. At this point drive that becomes available fetches another the subsystem is vulnerable to data loss should the command from the shared queue. In this way, second drive fail. To restore the subsystem back to a both drives get fully utilized. If RPO could be fault-tolerant state, the remaining copy of data needs applied in selecting the next command to to be copied to a replacement disk. fetch, that would be the best strategy. In addition to providing much improved data reli- ability, data mirroring can potentially also improve Mirrored disks have a write penalty in the sense performance. When a read command requests a piece that both drives containing the mirrored data need of data that is mirrored, there is a choice of two places to be updated. One might argue that this is not really from which this data can be retrieved. This presents a penalty since there is no increase in workload on a an opportunity for some performance improvement. per physical drive basis. If write caching is not on, a Several different strategies can be applied: write is not complete until both drives have done the write. If both drives start the write command at the • Send the command to the drive that has the same time, the mechanical delay time is the larger of shorter seek distance. For random access, the sum of seek and latency of the two drives. If the the average seek distance drops from 1/3 of drives do not start at the same time, then the delay for full stroke to 5/24 (for no zone bit record- the write completion is even longer. ing). This assumes that both drives are There are different ways that data can be mirrored available for handling a command. Thus, its in a storage subsystem [Thomasian & Blaum 2006]. disadvantage is that it ties up both drives in Three of them are described in the following, followed servicing a single command, which is not by discussions on their performance and reliability. good for throughput. • Send the read command to both drives and take the data from the fi rst one to complete. 24.2.1 Basic Mirroring This is more effective than the previous The simplest and most common form of mirror- approach in cutting the response time for ing is that of pairing off two disks so that they both the command, as it minimizes the sum of contain exactly the same data image. If M is the num- seek time and rotational latency. It also has ber of disk drives in a storage subsystem, then there the same disadvantage of tying up both would be M/2 pairs of mirrored disks; M must be an drives in servicing a single command. even number. Figure 24.2 illustrates a storage subsys- • Assign read commands to the two drives in tem with six drives organized into three sets of such a ping-pong fashion so that each drive ser- mirrored disks. In this and in Figures 24.3 and 24.4 vices half of the commands. Load sharing by each letter represents logically one-half the content, two drives improves throughput. or logical address space, of a physical drive. • Let one drive handle commands for the fi rst half of the address space, and the other drive handle commands for the second half of the 24.2.2 Chained Decluster Mirroring address space. This permits load sharing, while at the same time reduces the average In this organization, for any one of the M drives in seek distance to one-sixth of full stroke. How- the subsystem, half of its content is replicated on a sec- ever, if the workload is such that address dis- ond disk, while the other half is replicated on a third tribution is not even, one drive may become disk [Hsiao & DeWitt 1993]. This is illustrated in Figure more heavily utilized than the other. Also, a 24.3 for M 5 6. In this illustration, half of the replication drive may become idle while there is still a resides in a drive’s immediate neighbors on both sides. queue of commands for the other drive. Note that for this confi guration, M does not have to • A good strategy is to maintain a single queue be an even number, which makes this approach more of commands for both mirrored disks. Each fl exible than the basic mirroring method. Chapter 24 STORAGE SUBSYSTEMS 767

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 A A C C E E

B B D D F F

FIGURE 24.2: Basic mirroring with M = 6 drives.

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 A B C D E F

F A B C D E

FIGURE 24.3: Chained decluster mirroring with N = 6 drives.

24.2.3 Interleaved Decluster Mirroring striping example of Figure 24.1 to basic mirroring. As In this third organization, for any one of the M discussed earlier, one benefi t of data striping is that drives in the subsystem, half of its content is divided it tends to distribute the workload for a subsystem evenly into (M 2 1) partitions. Each partition is rep- evenly among the drives. So, if data striping is used, licated on a different drive. The other half of the con- then during normal mode all mirroring organizations tent of a drive consists of one partition from each of have similar performance, as there is no difference the other (M 2 1) drives. Figure 24.4 illustrates such a in the I/O load among the drives. Any write request subsystem with M 5 6 drives. M also does not have to must go to two drives, and any read request can be be an even number for this mirroring organization. handled by one of two drives. When data striping is not used, hot data can make a small percentage of mirrored pairs very busy in the 24.2.4 Mirroring Performance Comparison basic mirroring organization. For the example illus- When looking into the performance of a fault-tol- trated in Figure 24.2, high I/O activities for data in erant storage subsystem, there are three operation A and B will result in more requests going to Disks 1 and modes to be considered. When all the drives are work- 2. Users of this pair of mirrored drives will experience ing, it is called the normal mode. is a longer delay. With chained decluster mirroring, it can when a drive has failed and the subsystem has to be seen in Figure 24.3 that activities for A and B can be make do with the remaining drives to continue ser- handled by three drives instead of two, namely Disks 1, vicing user requests. During the time when a replace- 2, and 3. This is an improvement over basic mirroring. ment disk is being repopulated with data, it is called Finally, with interleaved decluster mirroring, things are the rebuild mode. even better still as all drives may be involved for han- dling read requests for A and B, even though Disks 1 and 2 must handle more writes than the other drives. Normal Mode One factor that affects the performance of a sub- system is whether data striping is being used. Data Degraded Mode striping can be applied on top of mirroring in the During normal mode, a read request for any piece obvious way. Figure 24.5 illustrates applying the of data can be serviced by one of two possible disks, 768 Memory Systems: Cache, DRAM, Disk

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6

A1 B1 C1 D1 E1 F1

A2 B2 C2 D2 E2 F2

A3 B3 C3 D3 E3 F3

A4 B4 C4 D4 E4 F4

A5 B5 C5 D5 E5 F5

F1 A1 B1 C1 D1 E1

E2 F2 A2 B2 C2 D2

D3 E3 F3 A3 B3 C3

C4 D4 E4 F4 A4 B4

B5 C5 D5 E5 F5 A5

FIGURE 24.4: Interleaved decluster mirroring with M = 6 drives.

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6

e1 e1 e2 e2 e3 e3

e4 e4 f1 f1 f2 f2

f3 f3 f4 f4 f5 f5

f6 f6 f7 f7 g1 g1

h1 h1 h2 h2

FIGURE 24.5: An example of applying data striping to basic mirroring. Using the example of Figure 24.1, four user files e, f, g, and h are shown. regardless of which mirroring organization is used. if Disk 3 fails, then read requests for data B and C will When one drive fails, in basic mirroring all read have to be handled by Disks 2 and 4, respectively. On requests to the failed drive must now be handled by its the surface, the workload for these two drives would mate, doubling its load for read. For chained declus- seem to be increased by 50%. However, the subsys- ter mirroring, the read workload of the failed drive is tem controller can divert much of the read work- borne by two drives. For the example of Figure 24.3, load for D to Disk 5 and much of the workload for Chapter 24 STORAGE SUBSYSTEMS 769

A to Disk 1. This action can ripple out to all the Basic Mirroring remaining good drives in the chained decluster array With this organization, the data of a drive is repli- until every drive receives an equal workload. This can cated all on a second drive. Thus, only if this second happen only if the array is subject to a balanced work- drive fails will the subsystem suffer a data loss. In load distribution to begin with, as in data striping. other words, there is only one failure out of (M 1) Finally, with interleaved decluster mirroring, reads possible second failures that will cause data loss. The for the failed drive are naturally shared by all remain- mean time to data loss (MTTDL) for an M disk basic ing (M – 1) good drives, giving it the best degraded mirroring subsystem is approximately mode performance. MTTF 2 MTTDL ______(EQ 24.5) M MTTR Rebuild Mode A similar situation exists in degraded mode. In where MTTF is the mean time to failure for a disk, and addition to handling normal user commands, the MTTR is the mean time to repair (replace failed drive mirrored data must be read in order for it to be rewrit- and copy data). This is assuming that both failure and ten onto the replacement drive. In basic mirroring, repair are exponentially distributed. all data must come from the failed drive’s mate, add- ing even more workload to its already doubled read Chained Decluster Mirroring workload. With chained decluster mirroring, half of the mirrored data comes from one drive, while the With this organization, the data of a drive is rep- other half comes from another drive. For interleaved licated onto two other drives. Thus, if either of these decluster mirroring, each of the (M – 1) remaining two drives also fails, the subsystem will suffer data good drives is responsible for doing the rebuild read loss, even though half and not all of the data of a drive of 1/(M – 1)th of the mirrored data. is going to be lost. In other words, there are two possi- In summary, of the three different organizations, ble second failures out of (M – 1) possible failures that basic mirroring provides the worst performance, will cause data loss. The MTTDL for an M disk chained especially during degraded and rebuild modes. Inter- decluster mirroring subsystem is approximately leaved decluster mirroring has the most balanced MTTF 2 MTTDL ______(EQ 24.6) and graceful degradation in performance when a 2M MTTR drive has failed, with chained decluster mirroring performing between the other two organizations. which is one-half shorter than that for basic mirroring. 24.2.5 Mirroring Reliability Comparison For a user, a storage subsystem has failed if it Interleaved Decluster Mirroring loses any of his data. Any mirroring organization can With this organization, the data of a drive is rep- tolerate a single drive failure, since another copy of licated onto all the other drives in the subsystem. the failed drive’s data is available elsewhere. Repair Thus, if any one of these (M – 1) drives fails, the sub- entails replacing the failed drive with a good drive and system will suffer data loss, even though 1/(M – 1)th copying the mirrored data from the other good drive, and not all of the data of a drive is going to be lost. or drives, to the replacement drive. When a subsys- The MTTDL for an M disk interleaved decluster mir- tem has M drives, there are M ways to have the fi rst roring subsystem is approximately disk failure. This is where the similarity ends. Whether MTTF 2 a second drive failure before the repair is completed MTTDL ______(EQ 24.7) M (M 1) MTTR will cause data loss depends on the organization and which drive is the second one to fail out of the remain- which is (M – 1) times worse than that for basic ing (M – 1) drives [Thomasian & Blaum 2006]. mirroring. 770 Memory Systems: Cache, DRAM, Disk

In summary, interleave decluster mirroring has (RAID) [Patterson et al. 1988] and created an organized the lowest reliability of the three organizations, while taxonomy to different schemes for providing fault basic mirroring has the highest; chained decluster in tolerance in a collection of disk drives. This was between these two. This ordering is exactly the reverse quickly adopted universally as standard, much to the of that for performance. Thus, which mirroring orga- benefi t of the storage industry. In fact, a RAID Advi- nization to use in a subsystem is a performance ver- sory Board was created within the storage industry to sus reliability trade-off. promote and standardize the concept.

24.3.1 RAID Levels 24.3 RAID The original Berkeley paper, now a classic, enumer- While data replication is an effective and simple ated fi ve classes of RAID, named Levels8 1 to 5. Others means for providing high reliability, it is also an expen- have since tacked on additional levels. These differ- sive solution because the number of disks required is ent levels of RAID are briefl y reviewed here [Chen doubled. A different approach that applies the tech- et al. 1988] . nique of error correcting coding (ECC), where a small addition in redundancy provides fault protection for a larger amount of information, would be less costly. RAID-0 When one drive in a set of drives fails, which drive This is simply data striping, as discussed in Section has failed is readily known. In ECC theory parlance, 4.1. Since data striping by itself involves no redun- an error whose location is known is called an erasure. dancy, calling this RAID-0 is actually a misnomer. Erasure codes are simpler than ECCs for errors whose However, marketing hype by storage subsystem locations are not known. The simplest erasure code vendors, eager to use the “RAID” label in their is that of adding a single . The parity P for product brochures, has succeeded in making this information bits or data bits, say, A, B, C, D, and E, is terminology generally accepted, even by the RAID simply the binary exclusive-OR of those bits: Advisory Board.

P 5 A ᮍ B ᮍ C ᮍ D ᮍ E (EQ 24.8) RAID-1 A missing information bit, say, B, can be recovered by This is the same as basic mirroring. While this XORing the parity bit with the remaining good infor- is the most costly solution to achieve higher reli- mation bits, since Equation 24.8 can be rearranged to ability in terms of percentage of redundancy drives required, it is also the simplest to implement. In the B 5 P ᮍ A ᮍ C ᮍ D ᮍ E (EQ 24.9) early 1990s, EMC was able to get to the nascent mar- ket for high-reliability storage subsystem fi rst with The application of this simple approach to a set of products based on this simple approach and had disk drives was fi rst introduced by Ken Ouchi of IBM much success with it. with a U.S. patent issued in 1978 [Ouchi 1978]. When data striping, or RAID-0, is applied on top A group of UC Berkeley researchers in 1988 coined of RAID-1, as illustrated in Figure 24.5, the resulting the term redundant array of inexpensive drives7 array is oftentimes referred to as RAID-10.

7The word “inexpensive” was later replaced by “independent” by most practitioners and the industry. 8The choice of using the word “level” is somewhat unfortunate, as it seems to imply some sort of ranking, when there actu- ally is none. Chapter 24 STORAGE SUBSYSTEMS 771

RAID-2 RAID-4 In RAID-2, is achieved by applying This architecture recognizes the shortcoming of the ECC across a set of drives. As an example, if a (7,4)9 lack of individual disk access in RAID-3. While retaining Hamming code is used, three redundant drives the use of single parity to provide for fault tolerance, it are added to every four data drives. A user’s data is disassociates the concept of data striping from parity striped across the four data drives at the bit level, striping. While each parity byte is still the parity of all the i.e., striping size is one bit. Corresponding bits from corresponding bytes of the data drives, the user’s data is the data drives are used to calculate three ECC bits, free to be striped with any striping size. By choosing a with each bit going into one of the redundant drives. striping width that is some multiple of sectors, acces- Reads and writes, even for a single sector’s worth of sibility of user’s data to individual disks is now possible. data, must be done in parallel to all the drives due In fact, a user’s data does not even have to be striped at to the bit-level striping. The RAID-2 concept was all, i.e., striping size can be equal to one disk drive. fi rst disclosed by Michelle Kim in a 1986 paper. She RAID-4 is a more fl exible architecture than RAID-3. also suggested synchronizing the rotation of all the Because of its freedom to choose its data striping size, striped disks. data for any workload environment can be organized RAID-2 is also a rather costly solution for achieving in the best possible optimization allowable under higher reliability. It is more complex than mirroring and data striping, and yet it enjoys the same reliability as yet does not have the fl exibility of mirroring. It never RAID-3. However, this fl exibility does come at a price. was adopted by the storage industry since RAID-3 is a In RAID-3, writes are always done in full stripes so similar but simpler and less costly solution. that the parity can always be calculated from the new data. With RAID-4, as drives are individually accessible, a write request may only be for one of the drives. Take RAID-3 Equation 24.8, and generalize it so that each letter rep- This organization concept is similar to RAID-2 resents a block in one drive. Suppose a write command in that bit-level striping10 is used. However, instead wants to change block C to C’. Because blocks A, B, D, of using Hamming ECC code to provide error cor- and E are not changed and therefore not sent by the rection capability, it uses the simple parity scheme host as part of the write command, the new parity P’, discussed previously to provide single drive failure (erasure) fault tolerance. Thus, only one redundant P’ 5 A ᮍ B ᮍ C’ ᮍ D ᮍ E (EQ 24.10) drive, called the , needs to be added, regardless of the number of data drives. Its low over- head and simplicity makes RAID-3 an attractive cannot be computed from the command itself. One solution for designing a high-reliability storage sub- way to generate this new parity is to read the data system. Because its data access is inherently parallel blocks A, B, D, and E from their respective drives and due to byte-level striping, this architecture is suitable then XOR them with the new data C’. This means for applications that mostly transfer a large volume that the write command in this example triggers of data. Hence, it is quite popular with supercomput- four read commands before P’ can be calculated and ers. Since the drives cannot be individually accessed C’and P’ can be fi nally written. In general, if the total to provide high IOPS for small block data accesses, it number of data and parity drives is N, then a single is not a good storage subsystem solution for on-line small write command to a RAID-4 array becomes transaction processing (OLTP) and other - (N – 2) read commands and two write commands for type applications. the underlying drives.

9A (n,k) code means the code word is n bits wide and contains k information bits. 10In practice, byte-level striping is more likely to be used as it is more convenient to deal with, but the concept and the net effect are still the same. 772 Memory Systems: Cache, DRAM, Disk

It can be observed that when Equations 24.8 and blocks from each disk together form a parity group. 24.10 are combined, the following equation results: The number of disks in a parity group is referred to as the parity group width. Each parity group includes N 11 P ᮍ P’ 5 C ᮍ C’ (EQ 24.11) data blocks and one parity block. Finally, the par- ity blocks from different parity groups are distributed evenly among all the drives in the array. Figure 24.6 or illustrates an example of a 6-disk RAID-5 with the placement of the parity blocks Pi rotated among all ᮍ ᮍ P’ 5 P C C’ (EQ 24.12) the drives. Note that Figure 24.6 is only a template for data placement, and the pattern can be repeated Therefore, an alternative method is to read the as many times as necessary depending of the parity original data block C and the original parity block P, stripe block size. For example, if drives with 60 GB are compute the new parity P’ in accordance with Equa- used here, and the block size is chosen to be 1 MB, tion 24.12, and then write the new C’ and P’ back to then the pattern will be repeated 10,000 times. those two drives. Thus, a single, small write command As it is possible to access individual disk drives, becomes two read-modify-write (RMW) commands, RAID-5 has the same small block write penalty of one to the target data drive and one to the parity drive. RAID-4. However, since parity blocks are evenly dis- The I/O completion time for a RMW command is the tributed among all the drives in an array, the bottle- I/O time for a read plus one disk revolution time. neck problem associated with the dedicated parity Either method adds a lot of extra disk activities for drive of RAID-4 is eliminated. Because of this advan- a single, small write command. This is often referred tage of RAID-5, and everything else being equal, to as the small write penalty. For N > 3, which is usu- RAID-4 is not used in any storage subsystems. ally the case, the second method involving RMW is The parity stripe block size does not have to be more preferable since it affects two drives instead of the same as the data stripe size. In fact, there is no all N drives, even though RMW commands take lon- requirement that data striping needs to be used in ger than regular commands. RAID-5 at all. Figure 24.7 illustrates how the example of data striping of Figure 24.1 can be organized in a RAID-5 4-disk RAID-5 with the parity stripe block size 2x the data stripe size. While there is no requirement that With RAID-4, parity information is stored in one parity stripe size and data stripe size must be equal, dedicated parity drive. Because of the small write doing so will certainly make things a little easier to penalty, the parity drive will become a bottleneck for manage for the array control software, which needs any workload that has some amount of small writes, to map user’s address space to drives’ address spaces as it is involved in such writes for any of the data while keeping track of where the parity blocks are. drives. Also, for a heavily read workload, the parity drive is underutilized. The Berkeley team recognized this unbalanced distribution of workload and devised RAID-6 the RAID-5 scheme as a solution. RAID-5 offers a single drive failure protection RAID-5 uses an organization which is very similar for the array. If even higher reliability of protection to that of data striping. The address space of each disk against double drive failures is desired, one more drive in the group is partitioned into fi xed-size blocks redundancy drive will be needed. The simple bitwise referred to as parity stripe blocks, and the size is parity across all data blocks schemes is retained for called the parity stripe block size. The corresponding the fi rst redundancy block, but cannot be used for the

11RAID-5 is oftentimes described as an N P , with N being the equivalent number of data disks and P being the equivalent of one parity disk. Thus, a 6-disk RAID-5 is a 5 P array. Chapter 24 STORAGE SUBSYSTEMS 773

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6

D11 D12 D13 D14 D15 P1

D21 D22 D23 D24 P2 D26

D31 D32 D33 P3 D35 D36

D41 D42 P4 D44 D45 D46

D51 P5 D53 D54 D55 D56

P6 D62 D63 D64 D65 D66

FIGURE 24.6: An example of RAID-5 with 6 disks (5 P). Parity block Pi is the parity of the corresponding Di data blocks from the other disks.

Disk 1 Disk 2 Disk 3 Disk 4 parity block e1 e2 e3 P121

e4 f1 f2 P122 parity f3 f4 P221 f5 group

f6 f7 P222 g1

h1 P321 h2

P322

FIGURE 24.7: An example of RAID-5 where parity stripe size and data stripe size are different. P2–2 is the parity for f6, f 7, and g1.

second redundancy. A more complex coding scheme is When one drive in a RAID-6 fails, requests for required. Furthermore, because updating this second its content can be reconstructed from the remain- redundancy also requires RMW for small writes, the ing good drives using the simple parity P in exactly new redundancy blocks are also distributed among the same way as RAID-5. Repair consists of replac- all the drives in the array à la RAID-5. The resulting ing the failed drive and rebuilding its content from organization is RAID-6, as illustrated in Figure 24.8 for the other good drives. If another drive fails before 12 a 6-disk array. Pi is parity of the corresponding Di the fi rst failure can be repaired, then both P and the th data blocks in the i parity group, and Qi is the sec- more complicated redundancy Q will be needed for ond redundancy block computed from Dis and Pi. data reconstruction. Some form of Reed-Solomon

12RAID-6 is oftentimes described as an N P Q array, with N being the equivalent number of data disks, P being the equiv- alent of one parity disk, and Q being the equivalent of a second redundant disk. Thus, a 6-disk RAID-6 is a 4 P Q array. 774 Memory Systems: Cache, DRAM, Disk

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6

D11 D12 D13 D14 P1 Q1

D21 D22 D23 P2 Q2 D26

D31 D32 P3 Q3 D35 D36

D41 P4 Q4 D44 D45 D46

P5 Q5 D53 D54 D55 D56

Q6 D62 D63 D64 D65 P6

FIGURE 24.8: An example of RAID-6 with 6 disks (4 P Q). code was suggested for the second redundancy in the but the component drives are underutilized when original RAID-6 description. However, Reed-Solomon servicing small block requests. decoding logic is somewhat complex, and so later on For write requests, RAID-3 performs the same as others came up with alternative types of code, such reads. Both mirroring and RAID-4/5/6 have write as Even-Odd [Blaum et al. 1995] that uses simple penalties. In the case of mirroring, the penalty is XOR logic for the decoder. Regardless of what code that the write must be done to both disks. For RAID- is used, any array with two redundant drives to 4/5, a small block write request becomes two RMW provide double failure protection can be referred to requests. In the case of RAID-6, a small block write as RAID-6. request becomes three RMW requests. For RAID- 4/5/6, if a full data stripe is written, there is no pen- alty since it is not necessary to read the old data. This 24.3.2 RAID Performance argues for using a smaller data stripe size to increase Mirroring and RAID-4/5/6 have some perfor- the likelihood of writes being full stripe, which is con- mance similarity and dissimilarity [Ng 1989, Schwarz trary to the discussion on data striping in general in & Burkhard 1995]. RAID-3 has different performance Section 24.1. characteristics because its component drives cannot This is summarized in Table 24.1 for different be accessed individually. In the following, for ease of arrays with N data disks, or N equivalent data disks. discussion, it will be assumed that the workload to Each array is given a total workload of R reads and an array is normally evenly distributed among all the W small writes. The table shows the workload as disks, as in the case with data striping. seen by each drive in the array for various organi- zations. The total number of disks in an array is an indication of its relative cost in providing the equiv- Normal Mode Performance alent of N disks of storage space. Mirroring has the Mirroring has the best normal mode read perfor- lightest workload per drive and, therefore, should mance because read requests can be serviced by one have the best performance, but it is also the most of two different drives. RAID-5/6 have the same read expensive solution. Not surprisingly, the double- performance as that of JBOD, with RAID-4 slightly fault-tolerant RAID-6 has a high workload per drive worse since it has one fewer data drive. RAID-3 has and thus low performance. RAID-3 also has low good read performance with large block requests, small block performance.