Reliable, Parallel Storage Architecture: RAID & Beyond
Garth Gibson
School of Computer Science & Dept of Electrical and Computer Engineering Carnegie Mellon University
[email protected] URL: http://www.cs.cmu.edu/Web/Groups/PDL/ Anonymous FTP on ftp.cs.cmu.edu in project/pdl
Patterson, Gibson, Katz, Sigmod, 88. Gibson, Hellerstein, Asplos III, 89. Gibson, MIT Press, 92. Gibson, Patterson, J. Parallel and Distributed Computing, 93. Drapeau, et al., ISCA, 94. Chen, Lee, Gibson, Katz, Patterson, ACM Computing Surveys, 94. Stodolsky, Courtright, Holland, Gibson, ACM Trans on Computer Systems, 94. Holland, Gibson, Siewiorek, J. Distributed and Parallel Databases, 94. Patterson, Gibson, Parallel and Distributed Information Systems, 94. Patterson, et al, CMU-CS-95-134, 95. Carnegie Gibson, et al, Compcon, 95. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 1/74 G. Gibson, CMU
Tutorial Outline
¥ RAID basics: striping, RAID levels, controllers ¥ recent advances in disk technology ¥ expanding RAID markets ¥ RAID reliability: high and higher ¥ RAID performance: fast recovery, small writes ¥ embedding storage management ¥ exploiting disk parallelism: deep prefetching ¥ RAID over the network
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 2/74 G. Gibson, CMU
Review of RAID Basics (RAID in the 80s)
Motivating need for more storage parallelism
Striping for parallel transfer, load balancing
Rapid review of original RAID levels
Simple performance model of RAID levels 1, 3, 5
Basic controller design
RAID = Redundant Array of Inexpensive (Independent) Disks
Patterson, Gibson, Katz, Sigmod, 88. Carnegie P. Massiglia, DEC, RAIDBook, 93. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 3/74 G. Gibson, CMU
What I/O Performance Crisis?
Existing huge gap in performance ¥ access time ratio, disk:dram, is 1000 X sram:dram, onchip:sram
Cache-ineffective applications stress hierarchy ¥ video, data mining, scientific visualization, digital libraries
Increasing gap in performance ¥ 40-100+%/year VLSI versus 20-40%/year magnetic disk
Amdahl’s law implies diminishing decreases in application response time
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 4/74 G. Gibson, CMU
Disk Arrays: Parallelism in Storage
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 5/74 G. Gibson, CMU
More Disk Array Motivations
Volumetric and floorspace density
Exploit best technology trend to smaller disks
Increasing requirement for high reliability
Increasing requirement for high availability
Enabling technology: SCSI storage abstraction
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 6/74 G. Gibson, CMU
Data Striping for Read Throughput
Parallelism will only be effective if ¥ load balance high concurrency, small accesses ¥ parallel transfer low concurrency, large accesses
Striping data provides both
¥ uniform load for small independent accesses 0 1 2 3 stripe unit large enough to contain sin- gle accesses 4 56 7 ¥ parallel transfer for large accesses stripe unit small enough to spread access widely
Carnegie Livny, Khoshafian, Boral, Sigmetrics, 87. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 7/74 G. Gibson, CMU
Sensitivity to Stripe Unit Size
Simple model of 10 balanced, sync’d disks
Response Time Throughput at low loads at 50% utilization 1000 800
400 250 non-striped
non-striped 160 80 ms IO/s 1K 32K 65 25 32K 1K byte interleaved
25 8 byte interleaved
10 2.5 1 10 100 1000 1 10 100 1000 Request Size KB Request Size KB
Carnegie Gibson, MIT Press, 92 (derived from J. Gray, VLDB, 90). Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 8/74 G. Gibson, CMU
Selecting a Good Striping Unit (S.U.)
Balance benefit to penalty: ¥ benefit: parallel transfer shortens transfer time ¥ penalty: parallel positioning consumes more disk time
Stochastic simulation study of maximum throughput yields simple rules of thumb ¥ 16 disks, sync’d; find max min of normalized throughput
Given I/O concurrency (conc) ¥ S.U. = 1/4 * (positioning time) * (transfer rate) * (conc - 1) + 1
Given zero workload knowledge ¥ S.U. = 2/3 * (positioning time) * (transfer rate)
Carnegie Chen, Patterson, ISCA, 90. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 9/74 G. Gibson, CMU
Adding Redundancy to Striped Arrays
Meet performance needs with more disks in array ¥ implies more disks vulnerable to failure Striping all data implies failure effects most files ¥ failure recovery involves processing full dump and all increments Provide failure protection with redundant code == RAID
Single failure protecting codes ¥ general single-error-correcting code too powerful ¥ disk failures are self-identifying - called erasures ¥ fact: T-error-detecting code is also a T-erasure-correcting code
Parity is single-disk-failure-correcting Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 10/74 G. Gibson, CMU
Synopsis of RAID Levels
RAID Level 0: Non-redundant
RAID Level 1: Mirroring
RAID Level 2: Byte-Interleaved, ECC
RAID Level 3: Byte-Interleaved, Parity
RAID Level 4: Block-Interleaved, Parity
RAID Level 5: Block-Interleaved, Distributed Parity
Carnegie Chen, Lee, Gibson, Katz, Patterson, ACM Computing Surveys, 94. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 11/74 G. Gibson, CMU
Basic Tradeoffs in RAID Levels
Relative to non-redundant, 16 disk (sync’d) array 100%
C R r W w C R r W w C R r W w RAID Level 1 RAID Level 3 RAID Level 5 Mirroring Byte-Interleaved Block-Interleaved Parity Distributed Parity
Patterson, Gibson, Katz, Sigmod, 88. Carnegie Bitton, Gray, VLDB, 88. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 12/74 G. Gibson, CMU
Parity Placement
Throughput generally insensitive ¥ stochastic workload on 2 groups of 4+1, 32 KB stripe units
Left Symmetric One Read At All Times Left Symmetric 12
0123P0 10 0 123P0 RAID0, 567P14 8 XLS,FLS 567P1 4 MB/sec 10 11 P2 8 9 6 LS 10 11 P2 89 RAID4,RA,LA
15 P3 12 13 14 4 RS 15 P3 12 13 14
P4 16 17 18 19 2 P4 16 17 18 19 0 0 1 2 3 4 5 6 7 8 9 10 Request Size (100KB) G Gib Carnegie Lee, Katz, Asplos IV, 91. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 13/74 G. Gibson, CMU
Stripe Unit (S.U.) Size in RAID
Writes in RAID must update redundancy code ¥ full stripe overwrites can compute new code in memory ¥ other accesses must pre-read disk blocks to determine new code ¥ write work limits concurrency gains with large S.U.
Rerun Chen’s ISCA 90 simulations in RAID 5
Given I/O concurrency (conc) ¥ read S.U. = 1/4 * (positioning time) * (transfer rate) * (conc - 1) + 1 ¥ write S.U. = 1/20 * (positioning time) * (transfer rate) * (conc - 1) + 1
Given zero workload knowledge ¥ S.U. = 1/2 * (positioning time) * (transfer rate)
Carnegie Chen, Lee, U. Michigan CSE-TR-181-93. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 14/74 G. Gibson, CMU
Modeling RAID 5 Performance
Queueing model for RAID 5 response time ¥ controller directly manages disk queues ¥ separate channels for each disk ¥ first-come-first-serve scheduling policies ¥ accesses are 1 to N stripe units, N = number of data disks ¥ detailed model of disk seek and rotate used
Model succeeds for case of special parity handling ¥ separate, high-priority queue for parity R-M-W operations ¥ parity R-M-W not generated until (last) data operation starts ¥ parity R-M-W does not preempt current data operation ¥ implies parity update nearly always one rotation, finishing last
Carnegie S. Chen, D. Towsley, J. Parallel and Distributed Computing, 93. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 15/74 G. Gibson, CMU
Example Controller: NCR (Symbios) 6299
Minimal buffering for RAID 5 parity calc. only ¥ low cost and complexity controller - quick to market ¥ problem: concurrent disk accesses return out-of-order (TTD/CIOP) ¥ problem: max RAID 5 bandwidth is single drive bandwidth
Fast SCSI-2 NCR 53C916 Fast/Wide NCR SCSI-2 Host Interconnect 53C916 NCR NCR 53C916 53C916
NCR 53C916
NCR 53C916
MUX XOR
NCR 53C920 Read/Modify/Write 64KB SRAM Buffer
Carnegie Chen, Lee, Gibson, Katz, Patterson, ACM Computing Surveys, 94. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 16/74 G. Gibson, CMU
Adding Buffering to RAID Controller
Performance advantages; added complexity ¥ parallel independent transfers to all disks ¥ XOR placement: stream through or separate transfer ¥ buffer management - extra buffers give higher concurrency
Large buffer XOR
XOR High bandwidth controller data bus
Host
Carnegie Menon, Cortney, ISCA, 93. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 17/74 G. Gibson, CMU
Recap RAID Basics
Disk arrays respond to increasing requirements for I/O throughput, reliability, availability
Data striping for parallel transfer, load balancing Stripe unit size important ¥ 1/2 to 2/3 of disk positioning time, transfer rate product
Simple code, parity, protects against disk failure Berkeley RAID levels ¥ Mirroring, RAID 1, is costly, faster small writes ¥ RAID 5, is cheaper, faster large writes; response time model exists
Controller design hinges on XOR, buffer mgmt
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 18/74 G. Gibson, CMU
Current Trends in Magnetic Disk Technology
Rapid density increase is major cause of change
Access time and disk diameter lagging
Embedded intelligence for value-added
Interface technologies up in the air
Competing technologies have niche roles
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 19/74 G. Gibson, CMU
Magnetic Disk Capacity
10000
1000 85%/yr 8-14” 100 3.5” 2.5”
1.8” 10 5.25” MegaBytes
1 80 84 88 92 96 Date of First Ship
Carnegie G. Milligan, Seagate, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 20/74 G. Gibson, CMU
Areal Density Driving Magnetic Disk Trends
1989 sea change in trends ¥ prior to 1989, density grew at 27% per year ¥ since 1989, density is growing at 60% per year linear bit density growing at 20+% per year tracks per inch growing at 30+% per year ¥ 2.5” and 3.5” products with 600+ Mbits/sq.inch in 1995 ¥ spurred by wide acceptance of SCSI open standard ¥ magnetic disk densities are greater than optical disk densities
Much higher densities demonstrated in lab ¥ IBM: 1 Gbit/sq.inch in 1989, 3 Gbit/sq.inch in 1995 ¥ CMU’s DSSC target is 10 Gbits/sq.inch in 1998
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 21/74 G. Gibson, CMU Magnetic Disk Price Trends
Revenue per unit shipped constant over time ¥ density growing 60% per year -> price falling 40% per year ¥ some claim price has been falling by 50% per year since 92
Current price leader 3.5”: .23-.40 $/MB (.31wtd) ¥ 2.5”: .5-1.1 $/MB; 1.8”: 1.1-2.9 $/MB
Total revenue stream growing slowly ¥ 1994 brought in $23 billion on 60 million units ¥ projected yearly revenue increase in billion dollars: 2.7 in 1995, 5.1 in 1996, 2.3 in 1997, 1.9 in 1998, 1.4 in 1999 ¥ competition for market share may reduce vendors
Carnegie B. Frank, Augur Visions Inc, J. Molina, Tech Forums, and G. Garrettson, Censtor, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 22/74 G. Gibson, CMU Key Magnetic Disk Technologies
Magneto-resistive heads ¥ replace single inductive (thin film) head with two heads: one inductive write head, one magneto-resistive head ¥ write-wide, read-narrow; higher signal-noise on read
Ever lower flying heights (< 2 microinches) ¥ reduced mass sliders, thinner/smoother surfaces ¥ tolerating frequent “skiing”
Decoding interfering signals (PRML) ¥ bits too close to ignore interference ¥ decode based on sequence of samples, neighbor bit values
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 23/74 G. Gibson, CMU Magnetic Disk Performance Trends
Access time decreases near 1/year curve ¥ median access times: 25 ms in 1990, 15 ms in 1995, 10 ms in 1999 ¥ minimum access times: 15 ms in 1990, 12 ms in 1995, 10 ms in 1996
Data rate growing with bits/inch and rpm ¥ long term growth at 10% year (fixed rpm) ¥ in last couple of years increased near 75% per year ¥ limited by decode circuit (analog/digital VLSI)
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 24/74 G. Gibson, CMU Disk Diameter Trends
Decreasing diameter dominated 1980s ¥ 5.25” created desktop market (16+ GB soon) ¥ 3.5” created laptop market (4+ GB 1/2 high; 500+ MB 19mm) ¥ 2.5” dominating laptop market (200+ MB; IBM 720 MB) ¥ 1.8” creating PCMCIA disk market (80+ MB)
Decreasing diameter trend slowed to a stop ¥ 1.8” market not 10X 2.5” market ¥ 1.3” (HP Kittyhawk) discontinued ¥ vendors continue work on smaller disks to lower access time
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 25/74 G. Gibson, CMU Storage Interfaces
Small Computer System Interface (SCSI) ¥ dominates market everywhere but cheapest systems
SCSI provides level of indirection ¥ linear block address space rather than track/head/sector allows rapid introduction of non-standard geometries ¥ internal buffer separates bus from media allows rapid introduction of non-standard media data rates ¥ internal controller command queueing allows geometry-specific, real-time disk scheduling ¥ internal controller and buffer for caching, read-ahead, write-behind
But SCSI does too much handshaking ¥ short, slow cables, limited bus ports Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 26/74 G. Gibson, CMU Trends in Storage Interfaces: Serial Buses
Faster, smaller, longer, more ports
Multiple contenders: ¥ Serial Bus (P1394): isochronous desktop peripheral bus (10MB/s) ¥ IBM SSA (X3T10.1): dual ring packetized disk interface (20MB/s) ¥ FibreChannel (X3T9.3): merge peripheral and network interconnect lengths to 10 km, speeds to 100 MB/s, multiple classes of service
But, parallel SCSI is not dead yet ¥ SCSI-2 is 8 or 16 bits parallel, 5 or 10 MHz (5, 10, 20 MB/s) ¥ Fast-20, Fast-40 are 20, 40 MHz (20, 40, 80 MB/s)
SCSI-3 isolates SCSI from physical layer
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 27/74 G. Gibson, CMU What about other storage technologies?
Optical disk? ¥ limited market; disk density surpassed optical last year ¥ portable use possible, but contact and flying disks are ahead ¥ long term role in publishing
Magnetic tape? ¥ very low cost media in robots ¥ but low data rates (1-10 MB/s) and slow robot switch (4 min) ¥ long term role in archival storage
Holographic and other advanced media? ¥ Tamarack, Austin TX, may deliver 20 GB WORM jukebox high data rates possible (100 MB/s?)
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 28/74 G. Gibson, CMU Recap Magnetic Disk Technology
Rapid density increase is major cause of change
Access time and disk diameter lagging
Embedded intelligence for value-added
Interface technologies up in the air
Competing technologies have niche roles
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 29/74 G. Gibson, CMU Current Trends in RAID Marketplace
Rapid market development ¥ largest growth in storage subsystems, for LAN environment ¥ broad range of competition available
RAID standards organization ¥ RAID Advisory Board focuses on defining/qualifying RAID levels ¥ RAID support may appear in SCSI-3 specification
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 30/74 G. Gibson, CMU RAID Market Trends
Begun 6 years ago, $5.7 billion in 1994 ¥ 23% IBM, 22% EMC, 12% DEC, 10% Compaq, 7% HP, 4% Hitachi, 4% STK, 3% Sun, 3% DG, 2% Tandem, 1% Symbios, 1% Mylex, 1% Auspex, 1% FWB, 1% NEC
Continued growth predicted ¥ in billion $: 7.8 in 95, 9.7 in 96, 11.6 in 97, 13.2 in 98 ¥ 66% -> 75% into network/minicomputer/multi-user systems ¥ units shipped 94 thru 97: 400,000 becomes 1,200,000 subsystems: 213,000 x 3.4 -> 730,000 boards: 120,000 x 2.5 -> 293,000 software: 72,000 x 2 -> 140,000
Carnegie J. Porter, DISK/TREND, 95. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 31/74 G. Gibson, CMU RAID Product History Sample
$ 20 $ 10 $ 1 90 91 92 93 94 95 Compaq IDA EMC Symetrix DG Clariion NCR ADP-92 IBM 9337 MYLEX DAC960 SUN FC Array STK ICEBERG IBM RAMAC HP AutoRAID Sun $.66/MB
¥ Rule of thumb: RAID MB cost 2x raw disk
Carnegie Symbios, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 32/74 G. Gibson, CMU RAID Advisory Board
To promote use and understanding of RAID ¥ 55 member organization since formation in 92
Education ¥ RAIDBook - technology and RAID level definitions ¥ publishes in Computer Technology Review ¥ hosts Comdex RAID Technology Center
Standardization ¥ functional test suite: IO, buffering, queueing, error injection ¥ performance tests: synthetic: TP, FS, DB, video, backup, scientific ¥ host interface spec: joint development of SCSI-3 RAID support
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 33/74 G. Gibson, CMU SCSI-3 Support for Storage Arrays
Storage Array Conversion Layer (SACL) ¥ defines, manages, accesses, reconstructs array components ¥ pushes mapping of data and parity into SCSI software
SACL objects ¥ p_extents - contiguous range of blocks on one device ¥ ps_extents - portion of p_extent excluding redundancy ¥ redundancy groups - group of p_extents sharing protection ¥ volume sets - group of ps_extents contiguous in user data space ¥ spare - p_extent, device or component avail to replace failure
Carnegie G. Penokie, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 34/74 G. Gibson, CMU Recap RAID Market Trends
Rapid market development ¥ largest growth in storage subsystems, for LAN environment ¥ broad range of competition available
RAID standards organization ¥ RAID Advisory Board focuses on defining/qualifying RAID levels ¥ RAID support may appear in SCSI-3 specification
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 35/74 G. Gibson, CMU RAID Reliability
Single correcting codes (parity) ¥ coping with dependent failure modes (controllers, cables, power): ¥ contrasting mirroring (RAID 1) with parity (RAID 5) ¥ limitations in ever larger disk arrays
Double correcting redundancy codes (RAID 6) ¥ basics: intersecting codewords allowing recovery choices ¥ binary, one bit per disk (2D parity) ¥ non-binary, one symbol per disk (Reed-Solomon) ¥ binary, multiple bits per disk (IBM EvenOdd)
Increasing focus on electronics, software impact
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 36/74 G. Gibson, CMU Basic Disk Data Reliability Model
Exponential disk lifetime and repair time ¥ recovery must not fail; repair with on-line hot spare
123 G P = XOR(A,B,C) ABC P C = XOR(A,B,P)
disk failure rate: λ = 1/MTTF-disk disk repair rate: µ = 1/MTTR-disk
G λ (G −1)λ
0 1 2 µ
MTTF-disk >> MTTR-disk MTTF-disk 2 MTTDL-RAID = N G (G-1) MTTR-disk Carnegie Patterson, Gibson, Katz, Sigmod, 88. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 37/74 G. Gibson, CMU Arrays contain Support Hardware
More hardware than just disks in an array ¥ must defend against external power quality first ¥ combined effects of non-disk components rival disk failures
Controller
AC power SCSI Host 4.3 Khours Bus Adapter 120 Khours
power supply SCSI cable 300W supply 21,000 Khours 123 Khours
Power cable 10,000 Khours Fan 195 Khours
Carnegie Schulze, Compcon, 89. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 38/74 G. Gibson, CMU Orthogonal Parity Groups
Limit exposure of each parity group to one disk ¥ organize support hardware into independent columns
string group parity group Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 39/74 G. Gibson, CMU Mirror versus Parity Data Reliability
Simple mirrors less reliable and more costly 1 hour mean recovery to spare; 150 Khour mean disk, string lifetime
Mean Data Lifetime 10yr Reliability (1000 hours) 10,000 70(1+1), D=18 7(10+1)+14s, D=72 1.0 7(10+1)+14s, D=72 0.9 70(1+1), D=72 8,000 7(10+1)+7s, D=72 0.8 70(1+1), D=4 0.7 6,000 0.6 7(10+1)+7s, D=72
0.5 4,000 70(1+1), D=4 0.4
0.3 70(1+1), D=18 2,000 0.2 70(1+1), D=72 0.1 7 11 72 0 0.0 1 4 16 64 256 1024 1 4 16 64 256 1024 Mean String Repair Time (hours) Mean String Repair Time (hours)
Carnegie Gibson, Patterson, J. Parallel and Distributed Computing, 93. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 40/74 G. Gibson, CMU Multiple Failure Tolerance?
I/O parallelism grows with processing speed ¥ larger arrays have more disks vulnerable to failure
But disk reliability has been growing at 50%/year ¥ allows 125%/year increase in processing speed at fixed reliability
No need for more powerful failure tolerance ¥ will disk reliability continue to rise at this rate?
Increasing needs for highly reliable storage ¥ are customers prepared to pay performance cost?
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 41/74 G. Gibson, CMU Simple, Fast Double Correcting Arrays
Borrow from earlier approaches (1 bit at a time) ¥ orthogonal parity groups ¥ double-error detecting codes from memory systems
Extended Hamming 2d- Parity 0123 A
4567 B 0123456 information8 9 10 11 C check information 14 ABCD E 7891011 12 13 check 12 13 14 15 D
check E F G H
Overheads: check space versus check update time ¥ 2d-parity has minimal time overhead (3), but space grows as root ¥ Hamming has lower space overhead, but higher avg time cost
Carnegie Gibson, Hellerstein, Asplos III, 89. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 42/74 G. Gibson, CMU Non-binary Codes (P+Q in STK Iceberg)
Exploit existing hardware (Reed-Solomon)
Binary (b=1) + C1 = C2 = (A+B) mod 2 *1 *1 C1 modulo 2 A B A = f(C1,C2) ? *1 *1 + C2 B = g(C1,C2) ?
Unique check disk pair for each data disk
Nonbinary (b=2) + C1 = (A+B) mod 4 *1 *1 C1 modulo 4 C2 = (A+2B) mod 4 A B A = (2C1-C2) mod 4 *1 *2 + C2 B = (C2-C1) mod 4
Multiple data disks can share same check disk pair
Carnegie Gibson, MIT Press, 92. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 43/74 G. Gibson, CMU IBM EvenOdd
disk 0 disk 1 disk 2 disk 3 disk 4 horiz parity diag parity 10110p0d0 01100p1d1 11000p2d2 01011p3d3
Careful selection of intersecting codewords ¥ horizontal codewords: a bit from each disk at same offset (p0,p1,p2, p3) = (1, 0, 0, 1) ¥ diagonal codewords: a bit from each disk from different offsets main diagonal, {disk, offset}=(4,0),(3,1),(2,2),(1,3), has parity 1 add main diagonal parity to parity of each other diagonal (d0, d1, d2, d3, d4) = (1, 1, 1, 1)+(1, 1, 0, 1) = (0, 0, 1, 0)
All computations are XOR (no RS hardware) Small random write updates 3 disks (minimum)
Carnegie Blaum, Brady, Bruck, Menon, ISCA, 94. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 44/74 G. Gibson, CMU Trends to Higher Reliability: TQM
Drive reliabilities rising quickly ¥ 30,000 hr MTTF in 1987 -> 800,000 hr MTTF in 1995 ¥ 50% of failures in electronics
Subsystem reliability rising more slowly ¥ much larger parts count in array controllers ¥ much large software component in array controller
Subsystem optimizations introduce new risk ¥ write-back cache failures significant
Increased emphasis on super-disk failure modes
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 45/74 G. Gibson, CMU Recap RAID Reliability
Basic reliability decreases linearly in group size Dependent failure in support hw: orthogonal arrays Parity + spares reliability, cost best simple mirrors Need for multiple failure tolerance unclear Multiple failure tolerance possible at low cost But small write penalty substantially increased Real push is tolerance for electronics/software faults
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 46/74 G. Gibson, CMU RAID Performance
On-line failure recovery: high availability
Exploiting spares: faster recovery
Writeback caching: optimize write-induced work
Parity logging: defer parity update until efficient
Floating data and parity: remap for fast R-M-W
Log-structured: convert small to large writes
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 47/74 G. Gibson, CMU On-Line Failure Recovery Performance
Fault-Free Read Fault-Free Write
2 1 3 4 D D D P D D D P
Degraded Read Degraded Write
D D D P D D D P
Per-disk load increase in degraded mode ≈ 1 + r + 0.25w ¥ 50% throughput wall; long response times; long recovery time
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 48/74 G. Gibson, CMU Reducing Load Increase: Parity Declustering
Logical Array Physical Array 0123 0123456
S S S S S S S S
Parity or TT T T Data Unit TT TT UUU U C Parity UUUU Stripe Declustering Ratio = (G-1)/(C-1)
G
¥ Per-disk failure-induced workload increase reduced ¥ Entire array bandwidth available for reconstruction ¥ Allows fault-free utilization > ~50% ¥ Map parity groups using Balanced Incomplete Block Designs or Random Selection of Permutations
Muntz, Lui, VLDB 1990. Holland, Gibson, ASPLOS V, 92. Carnegie Merchant, Yu, FTCS 92. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 49/74 G. Gibson, CMU Comparing to Multiple RAID Level 5 Groups
RAID5: 4 groups of 9+1 ⇒ 40 disks, 10% ovhd Declustered: 1 group of C=40/G=10 ⇒ 40 disks, 10% ovhd
Performance during reconstruction
4x 9+1 RAID5 40/10 Declustered 3000 160 140 2500 120 2000 100 1500 80 60 1000 40 500 20 User response time (ms) Reconstruction Time (sec) Reconstruction Time 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 User I/Os per Second User I/Os per Second
Carnegie Holland, Gibson, Siewiorek, J. Distributed and Parallel Databases, 94. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 50/74 G. Gibson, CMU Exploiting Hot Spares: Distributed Sparing
Distribute spare disk space in same way as parity ¥ Use spare actuator(s) to improve fault-free performance ¥ No need to reconstruct spare units ¥ Alleviate spare-disk bottleneck at low declustering ratio
RAID Level 5 with RAID Level 5 with Dedicated Sparing Distributed Sparing
Disk Number Disk Number Offset 01234 Offset 01234 0DDDPS 0DDDPS 1DDPDS 1DPSDD 2DPDDS 2SDDDP 3PDDDS 3DDPSD 4 DDDPS 4 P S DDD
Carnegie Menon, Mattson, ISCA, 92. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 51/74 G. Gibson, CMU The Small-Write Throughput Problem
RAID Level 5 random small write cost 2X mirrors ¥ mirroring writes two copies: two disk accesses ¥ RAID 5 must toggle parity bits when data bits toggle costs read then write of data, read then write of parity: four disk accesses!
Small writes non-negligible: OLTP, Network FS
Major approaches ¥ Caching: delay writes in cache, schedule update later ¥ Logging: delay parity update in log, schedule update later ¥ Dynamic mapping: rearrange parity mapping as needed
¥ most require fault-tolerance of in-memory cache or mapping tables
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 52/74 G. Gibson, CMU Zero Latency Writes: Writeback Caching
Delay in NVRAM or in redundant controllers Prioritize reads over delayed writes Schedule writeback efficiently ¥ aggregate small writes into larger writes ¥ exploit idle time on actuator ¥ greedy Shortest Access Time First scheduling
Achieve deep queue, write costs 6-8 ms Fast writes -> shorter read behind write queueing
Widely used in most RAID products
Menon, Cortney, ISCA, 93. Carnegie Solworth, Orji, Sigmod, 90. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 53/74 G. Gibson, CMU Read Performance given Zero Latency Writes
¥ While idle available, read response independent of writeback work ¥ Idle exhausts according writeback work 200 Simulated performance 1:1 Read/Write 4K Random I/Os cached RAID 5 4MB Cache (0.3%) 150 4 Drive Raid Level 0 5 Drive Raid Level 5 uncached RAID 5 100 uncached RAID 0 cached RAID 0
50 Read Response Time (ms) Time Read Response
0 0 10 20 30 40 50 60 70 80 Throughput (IO/sec/disk)
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 54/74 G. Gibson, CMU Parity Logging Extension to RAID Level 5
Read/Write data at block rates ¥ collect XOR of old and new data into “parity update log” Read/Write log and parity at cylinder rates ¥ delay reintegration of parity updates into parity until efficient Adds 2 More I/Os but 4 I/Os are > 8 times faster ¥ disk seconds spent for small writes comparable to mirroring
1467
720
Data Rate (KB/S) 97 Block Track Cylinder
Stodolsky, Courtright, Holland, Gibson, ACM Trans on Computer Systems, 94. Carnegie Bhide, Dias, IBM TR 17879, 92. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 55/74 G. Gibson, CMU Logging Small Random Write Performance
Simulated results comparable to mirroring 300 Raid level 5 250 Parity Logging Mirroring 200 Nonredundant Cache hits on write 150
100
50 Average User Response Time User Response Average
IOs/second per disk 0 0 10 20 30 40
Carnegie Stodolsky, Courtright, Holland, Gibson, ACM Trans on Computer Systems, 94. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 56/74 G. Gibson, CMU Dynamic Mapping: Floating Data and Parity
Reduce cost of one or both read-modify-write in small random overwrite
Dynamically remap “overwrite” to closest free space after read-modify ¥ 1.6 blocks to next free block if 7% space hidden from user ¥ read-modify-write response time is 10-20% longer than read-only ¥ disconnects contiguous data written at different times ¥ fault-tolerant mapping tables can be too large if data floats
Floating parity only with high cache hit ratio ¥ small random write complete in little more than 2 access times
Carnegie Menon, Roche, Kasson, J. Parallel and Distributed Computing, 93. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 57/74 G. Gibson, CMU Dynamic Mapping: Log-Structured
New write data is accumulated and written into an empty stripe ¥ large write optimization computes parity in memory
Background garbage collection of old data
Log-structured file system ¥ focus on physical contiguity for bandwidth ¥ garbage collection involves copying
Virtual parity striping ¥ focus on minimal parity update cost ¥ garbage collection involves parity recomputation
Rosenblum, Ousterhout, Symp of Operating Systems Principles, 91. Carnegie Mogi, Kitsuregawa, Parallel and Distributed Info Systems, 94. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 58/74 G. Gibson, CMU StorageTek Iceberg
user compressed data cache
accumulate stripe, compute P + Q
find contiguous free space for stripe
parallel write, update mapping tables
8x(13+2)
Double-correct, log-structured, compressed data
Carnegie Saunders, Storage Tech, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 59/74 G. Gibson, CMU Recap RAID Performance
On-line recovery drastic reduction in performance ¥ read workload doubles on surviving disks in RAID 5 ¥ declustering parity allows tunable degradation, cost ¥ declustering avoids bottlenecks in multi-group RAID 5 ¥ on-line spares effective for very fast recovery
Reducing cost of small writes in parity-based arrays ¥ write caching for immediate ack, write scheduling ¥ logging parity changes for later efficient processing ¥ floating allocation for fast R-M-W ¥ dynamic remapping (log-structured) for large write optimization
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 60/74 G. Gibson, CMU Trends in Transparency
Intelligent storage management
Rapid development of architectures
Aggressive prefetching to increase I/O parallelism
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 61/74 G. Gibson, CMU Trends in Storage Organization Transparency
Increasing microprocessor power in controller ¥ controllers in 2000 expected to have 200 MHz processors
Cost of managing storage per year 7X storage cost
Strong push to embed management in storage ¥ dynamic adaptation to workload ¥ selection and migration through redundancy schemes ¥ support for backup
STK Iceberg, HP AutoRAID leading the push ¥ dynamic migration of storage representation
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 62/74 G. Gibson, CMU Rapid Prototyping and Evaluation for RAID
RAIDFrame: separate policy from mechanism ¥ Express RAID functions as Directed Acyclic Graph ¥ Execute DAGs on engine unaware of RAID architecture ¥ Distributable, portable “RAID N reference model”
RAID Level 5 Parity Logging Small Write DAG Small Write DAG ¥ Roll-back and roll-forward error handling ¥ Automatic construction of H H uncommon DAGs Rod Rop Rod Xor ¥ Optimization of DAGs
Wnd Xor Wnd Log
Wnp T T
Carnegie Gibson, et al, Compcon, 95. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 63/74 G. Gibson, CMU Overcoming Disclosure Bottleneck: Prefetching
Application F1, F2, F3, F4, F5, ... ¥ Expose concurrency S1, S2, S3, S4, ... ¥ overlap I/O and computation ¥ overlap I/O and think time
B1, B2, B3, B4, ... ¥ overlap I/O and I/O !!!! ¥ I/O optimization - seek scheduling D , D , D , D , ... 1 2 3 4 - batch processing
¥ Cache management Devices D1 ¥ balance buffers between prefetch and demand
Carnegie Patterson, Gibson, Parallel and Distributed Information Systems, 94. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 64/74 G. Gibson, CMU Examples: Informed Prefetching Filesystem
Annotated applications: text search, 3D visualiza- tion, database join, speech recognition Digital UNIX (OSF/1 v2), 5 Fast-SCSI, 15 HP2247
Change in Elapsed Time Change in I/O Stall Time
1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2
0 4 8 12 16 0 4 8 12 16 Disks Disks Agrep XDS Postgres Sphinx
Carnegie Patterson, et al, CMU-CS-95-134, 95. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 65/74 G. Gibson, CMU Recap Storage Transparency
Embedding intelligent storage management ¥ exploit embedded processor power to simplify configuration ¥ dynamic migration or representations within storage system
Tools for rapid RAID architecture development ¥ definition, evaluation, optimization tools
Beyond embarrassingly parallel I/O applications ¥ aggressive prefetching needed to reduce latency for many tasks
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 66/74 G. Gibson, CMU Trends in Network RAID
Avoid file server workstation’s memory
¥ Fastpath data from disk to network
Handle controller failure as another disk failure
Large arrays needing more than 1 controller
Support client file access faster than 1 controller
¥ Stripe data on controllers & switched networks
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 67/74 G. Gibson, CMU Fastpath Disk and Network
File server processor does not look at most bytes ¥ attach network and disks to RAID controller with fast bus ¥ scale file network bandwidth and file server CPU separately ¥ separate high bandwidth and low latency traffic on own net Berkeley RAID-II prototype: 15-21 MB/s thru LFS
Ethernet (Control and Low Latency Transfers) LINK High Bandwidth VME LINK Transfers XBUS 8 Port Interleaved LINKLINK Card Memory4 Port (128 Interleaved MByte) LINK TMC Memory (32 MB) File HIPPIS Server HIPPI TMC XOR HIPPID HIPPIS 8 x 8 x 32-bit XOR 4-by-8Crossbar by 32-bit HIPPD Crossbar VME LINK VME VME VME VME VME Control HIPPIS Bus VME VME VME VME Bus HIPPID Bus LINK VME Disk VME Ribbon Controller Cable Segments Control Paths Four VME Disk Controllers
Carnegie Drapeau, et al., ISCA, 94. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 68/74 G. Gibson, CMU Embedded Controller Parallelism
disks
hosts controllers
HP TickerTaip - private controller network ¥ scale controller compute power; tolerate controller failures Borg Technologies promises similar product
Cao, Lim, Venkataraman, Wilkes, ISCA, 93. Carnegie D. Stallmo, Borg Tech, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 69/74 G. Gibson, CMU Exposed Controller Parallelism
disks
hosts servers
Use host network for controller/server traffic ¥ Zebra - log-structured file system over network ¥ Swift - distributed server RAID ¥ Network-attached disk - Conner (ATM), Seagate (FibreChannel)
Hartman, Ousterhout, Symposium on Operating Systems Principles, 93. Carnegie Cabrera, Long, Computing Systems, 91. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 70/74 G. Gibson, CMU Opportunity for Parallelism Increasing
Application File Bandwidth CM5-RAID 1991 MVS- RAID 1989 Parallel Applications attached I/O Scotch PFS MVS G-Nectar 1985 1994 Intel Paragon PFS 1994 MVS parallel block VAX-FFS 1980 I/O fabric 1982 server (Zebra) 1993 Application Performance switched LAN SUN-NFS Auspex NFS 1985 1989
AFS 1989
Coda 1993
shared media Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 71/74 G. Gibson, CMU Parallel File Systems for Parallel Programs
¥ Concurrent write sharing ¥ Globally shared file pointers
¥ Performance hungry applications ¥ High bandwidth to large files ¥ Application-specific access methods ¥ Application control over basic PFS parameters
• Limited instances of any specific environment ¥ Emphasis on scalability and portability ¥ How much integration with network?
Corbett, Feitelson, Proc Scalable High-Performance Computing Conf, 94. Carnegie Gibson, et.al., Compcon, 95. Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 72/74 G. Gibson, CMU Recap RAID over the Network
Attach storage “closer” to network ¥ avoid workstation memory system
Stripe data over multiple controllers/servers ¥ private controller network in the box ¥ use host LAN for redundancy and controller communication
Support explicit concurrent write sharing ¥ parallel file systems for parallel programs ¥ application assistance in data layout, prefetching, checkpointing
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 73/74 G. Gibson, CMU Tutorial Summary
¥ Basic RAID: Levels 1, 3, 5 useful
¥ Magnetic disk technology stronger than ever
¥ RAID market well beyond basic RAID
¥ Increasingly sophisticated function in subsystem
¥ How much transparency is too much?
¥ Striping/RAID over network emerging
Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center
ISCA’95 Reliable, Parallel Storage Tutorial Slide 74/74 G. Gibson, CMU