Reliable, Parallel Storage Architecture: RAID & Beyond

Garth Gibson

School of Computer Science & Dept of Electrical and Computer Engineering Carnegie Mellon University

[email protected] URL: http://www.cs.cmu.edu/Web/Groups/PDL/ Anonymous FTP on ftp.cs.cmu.edu in project/pdl

Patterson, Gibson, Katz, Sigmod, 88. Gibson, Hellerstein, Asplos III, 89. Gibson, MIT Press, 92. Gibson, Patterson, J. Parallel and Distributed Computing, 93. Drapeau, et al., ISCA, 94. Chen, Lee, Gibson, Katz, Patterson, ACM Computing Surveys, 94. Stodolsky, Courtright, Holland, Gibson, ACM Trans on Computer Systems, 94. Holland, Gibson, Siewiorek, J. Distributed and Parallel , 94. Patterson, Gibson, Parallel and Distributed Information Systems, 94. Patterson, et al, CMU-CS-95-134, 95. Carnegie Gibson, et al, Compcon, 95. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 1/74 G. Gibson, CMU

Tutorial Outline

¥ RAID basics: striping, RAID levels, controllers ¥ recent advances in disk technology ¥ expanding RAID markets ¥ RAID reliability: high and higher ¥ RAID performance: fast recovery, small writes ¥ embedding storage management ¥ exploiting disk parallelism: deep prefetching ¥ RAID over the network

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 2/74 G. Gibson, CMU

Review of RAID Basics (RAID in the 80s)

Motivating need for more storage parallelism

Striping for parallel transfer, load balancing

Rapid review of original RAID levels

Simple performance model of RAID levels 1, 3, 5

Basic controller design

RAID = Redundant Array of Inexpensive (Independent) Disks

Patterson, Gibson, Katz, Sigmod, 88. Carnegie P. Massiglia, DEC, RAIDBook, 93. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 3/74 G. Gibson, CMU

What I/O Performance Crisis?

Existing huge gap in performance ¥ access time ratio, disk:dram, is 1000 X sram:dram, onchip:sram

Cache-ineffective applications stress hierarchy ¥ video, data mining, scientific visualization, digital libraries

Increasing gap in performance ¥ 40-100+%/year VLSI versus 20-40%/year magnetic disk

Amdahl’s law implies diminishing decreases in application response time

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 4/74 G. Gibson, CMU

Disk Arrays: Parallelism in Storage

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 5/74 G. Gibson, CMU

More Motivations

Volumetric and floorspace density

Exploit best technology trend to smaller disks

Increasing requirement for high reliability

Increasing requirement for high availability

Enabling technology: SCSI storage abstraction

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 6/74 G. Gibson, CMU

Data Striping for Read Throughput

Parallelism will only be effective if ¥ load balance high concurrency, small accesses ¥ parallel transfer low concurrency, large accesses

Striping data provides both

¥ uniform load for small independent accesses 0 1 2 3 stripe unit large enough to contain sin- gle accesses 4 56 7 ¥ parallel transfer for large accesses stripe unit small enough to spread access widely

Carnegie Livny, Khoshafian, Boral, Sigmetrics, 87. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 7/74 G. Gibson, CMU

Sensitivity to Stripe Unit Size

Simple model of 10 balanced, sync’d disks

Response Time Throughput at low loads at 50% utilization 1000 800

400 250 non-striped

non-striped 160 80 ms IO/s 1K 32K 65 25 32K 1K byte interleaved

25 8 byte interleaved

10 2.5 1 10 100 1000 1 10 100 1000 Request Size KB Request Size KB

Carnegie Gibson, MIT Press, 92 (derived from J. Gray, VLDB, 90). Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 8/74 G. Gibson, CMU

Selecting a Good Striping Unit (S.U.)

Balance benefit to penalty: ¥ benefit: parallel transfer shortens transfer time ¥ penalty: parallel positioning consumes more disk time

Stochastic simulation study of maximum throughput yields simple rules of thumb ¥ 16 disks, sync’d; find max min of normalized throughput

Given I/O concurrency (conc) ¥ S.U. = 1/4 * (positioning time) * (transfer rate) * (conc - 1) + 1

Given zero workload knowledge ¥ S.U. = 2/3 * (positioning time) * (transfer rate)

Carnegie Chen, Patterson, ISCA, 90. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 9/74 G. Gibson, CMU

Adding Redundancy to Striped Arrays

Meet performance needs with more disks in array ¥ implies more disks vulnerable to failure Striping all data implies failure effects most files ¥ failure recovery involves processing full dump and all increments Provide failure protection with redundant code == RAID

Single failure protecting codes ¥ general single-error-correcting code too powerful ¥ disk failures are self-identifying - called erasures ¥ fact: T-error-detecting code is also a T-erasure-correcting code

Parity is single-disk-failure-correcting Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 10/74 G. Gibson, CMU

Synopsis of RAID Levels

RAID Level 0: Non-redundant

RAID Level 1: Mirroring

RAID Level 2: Byte-Interleaved, ECC

RAID Level 3: Byte-Interleaved, Parity

RAID Level 4: Block-Interleaved, Parity

RAID Level 5: Block-Interleaved, Distributed Parity

Carnegie Chen, Lee, Gibson, Katz, Patterson, ACM Computing Surveys, 94. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 11/74 G. Gibson, CMU

Basic Tradeoffs in RAID Levels

Relative to non-redundant, 16 disk (sync’d) array 100%

C R r W w C R r W w C R r W w RAID Level 1 RAID Level 3 RAID Level 5 Mirroring Byte-Interleaved Block-Interleaved Parity Distributed Parity

Patterson, Gibson, Katz, Sigmod, 88. Carnegie Bitton, Gray, VLDB, 88. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 12/74 G. Gibson, CMU

Parity Placement

Throughput generally insensitive ¥ stochastic workload on 2 groups of 4+1, 32 KB stripe units

Left Symmetric One Read At All Times Left Symmetric 12

0123P0 10 0 123P0 RAID0, 567P14 8 XLS,FLS 567P1 4 MB/sec 10 11 P2 8 9 6 LS 10 11 P2 89 RAID4,RA,LA

15 P3 12 13 14 4 RS 15 P3 12 13 14

P4 16 17 18 19 2 P4 16 17 18 19 0 0 1 2 3 4 5 6 7 8 9 10 Request Size (100KB) G Gib Carnegie Lee, Katz, Asplos IV, 91. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 13/74 G. Gibson, CMU

Stripe Unit (S.U.) Size in RAID

Writes in RAID must update redundancy code ¥ full stripe overwrites can compute new code in memory ¥ other accesses must pre-read disk blocks to determine new code ¥ write work limits concurrency gains with large S.U.

Rerun Chen’s ISCA 90 simulations in RAID 5

Given I/O concurrency (conc) ¥ read S.U. = 1/4 * (positioning time) * (transfer rate) * (conc - 1) + 1 ¥ write S.U. = 1/20 * (positioning time) * (transfer rate) * (conc - 1) + 1

Given zero workload knowledge ¥ S.U. = 1/2 * (positioning time) * (transfer rate)

Carnegie Chen, Lee, U. Michigan CSE-TR-181-93. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 14/74 G. Gibson, CMU

Modeling RAID 5 Performance

Queueing model for RAID 5 response time ¥ controller directly manages disk queues ¥ separate channels for each disk ¥ first-come-first-serve scheduling policies ¥ accesses are 1 to N stripe units, N = number of data disks ¥ detailed model of disk seek and rotate used

Model succeeds for case of special parity handling ¥ separate, high-priority queue for parity R-M-W operations ¥ parity R-M-W not generated until (last) data operation starts ¥ parity R-M-W does not preempt current data operation ¥ implies parity update nearly always one rotation, finishing last

Carnegie S. Chen, D. Towsley, J. Parallel and Distributed Computing, 93. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 15/74 G. Gibson, CMU

Example Controller: NCR (Symbios) 6299

Minimal buffering for RAID 5 parity calc. only ¥ low cost and complexity controller - quick to market ¥ problem: concurrent disk accesses return out-of-order (TTD/CIOP) ¥ problem: max RAID 5 bandwidth is single drive bandwidth

Fast SCSI-2 NCR 53C916 Fast/Wide NCR SCSI-2 Host Interconnect 53C916 NCR NCR 53C916 53C916

NCR 53C916

NCR 53C916

MUX XOR

NCR 53C920 Read/Modify/Write 64KB SRAM Buffer

Carnegie Chen, Lee, Gibson, Katz, Patterson, ACM Computing Surveys, 94. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 16/74 G. Gibson, CMU

Adding Buffering to RAID Controller

Performance advantages; added complexity ¥ parallel independent transfers to all disks ¥ XOR placement: stream through or separate transfer ¥ buffer management - extra buffers give higher concurrency

Large buffer XOR

XOR High bandwidth controller data bus

Host

Carnegie Menon, Cortney, ISCA, 93. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 17/74 G. Gibson, CMU

Recap RAID Basics

Disk arrays respond to increasing requirements for I/O throughput, reliability, availability

Data striping for parallel transfer, load balancing Stripe unit size important ¥ 1/2 to 2/3 of disk positioning time, transfer rate product

Simple code, parity, protects against disk failure Berkeley RAID levels ¥ Mirroring, RAID 1, is costly, faster small writes ¥ RAID 5, is cheaper, faster large writes; response time model exists

Controller design hinges on XOR, buffer mgmt

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 18/74 G. Gibson, CMU

Current Trends in Magnetic Disk Technology

Rapid density increase is major cause of change

Access time and disk diameter lagging

Embedded intelligence for value-added

Interface technologies up in the air

Competing technologies have niche roles

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 19/74 G. Gibson, CMU

Magnetic Disk Capacity

10000

1000 85%/yr 8-14” 100 3.5” 2.5”

1.8” 10 5.25” MegaBytes

1 80 84 88 92 96 Date of First Ship

Carnegie G. Milligan, Seagate, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 20/74 G. Gibson, CMU

Areal Density Driving Magnetic Disk Trends

1989 sea change in trends ¥ prior to 1989, density grew at 27% per year ¥ since 1989, density is growing at 60% per year linear bit density growing at 20+% per year tracks per inch growing at 30+% per year ¥ 2.5” and 3.5” products with 600+ Mbits/sq.inch in 1995 ¥ spurred by wide acceptance of SCSI open standard ¥ magnetic disk densities are greater than optical disk densities

Much higher densities demonstrated in lab ¥ IBM: 1 Gbit/sq.inch in 1989, 3 Gbit/sq.inch in 1995 ¥ CMU’s DSSC target is 10 Gbits/sq.inch in 1998

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 21/74 G. Gibson, CMU Magnetic Disk Price Trends

Revenue per unit shipped constant over time ¥ density growing 60% per year -> price falling 40% per year ¥ some claim price has been falling by 50% per year since 92

Current price leader 3.5”: .23-.40 $/MB (.31wtd) ¥ 2.5”: .5-1.1 $/MB; 1.8”: 1.1-2.9 $/MB

Total revenue stream growing slowly ¥ 1994 brought in $23 billion on 60 million units ¥ projected yearly revenue increase in billion dollars: 2.7 in 1995, 5.1 in 1996, 2.3 in 1997, 1.9 in 1998, 1.4 in 1999 ¥ competition for market share may reduce vendors

Carnegie B. Frank, Augur Visions Inc, J. Molina, Tech Forums, and G. Garrettson, Censtor, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 22/74 G. Gibson, CMU Key Magnetic Disk Technologies

Magneto-resistive heads ¥ replace single inductive (thin film) head with two heads: one inductive write head, one magneto-resistive head ¥ write-wide, read-narrow; higher signal-noise on read

Ever lower flying heights (< 2 microinches) ¥ reduced mass sliders, thinner/smoother surfaces ¥ tolerating frequent “skiing”

Decoding interfering signals (PRML) ¥ bits too close to ignore interference ¥ decode based on sequence of samples, neighbor bit values

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 23/74 G. Gibson, CMU Magnetic Disk Performance Trends

Access time decreases near 1/year curve ¥ median access times: 25 ms in 1990, 15 ms in 1995, 10 ms in 1999 ¥ minimum access times: 15 ms in 1990, 12 ms in 1995, 10 ms in 1996

Data rate growing with bits/inch and rpm ¥ long term growth at 10% year (fixed rpm) ¥ in last couple of years increased near 75% per year ¥ limited by decode circuit (analog/digital VLSI)

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 24/74 G. Gibson, CMU Disk Diameter Trends

Decreasing diameter dominated 1980s ¥ 5.25” created desktop market (16+ GB soon) ¥ 3.5” created laptop market (4+ GB 1/2 high; 500+ MB 19mm) ¥ 2.5” dominating laptop market (200+ MB; IBM 720 MB) ¥ 1.8” creating PCMCIA disk market (80+ MB)

Decreasing diameter trend slowed to a stop ¥ 1.8” market not 10X 2.5” market ¥ 1.3” (HP Kittyhawk) discontinued ¥ vendors continue work on smaller disks to lower access time

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 25/74 G. Gibson, CMU Storage Interfaces

Small Computer System Interface (SCSI) ¥ dominates market everywhere but cheapest systems

SCSI provides level of indirection ¥ linear block address space rather than track/head/sector allows rapid introduction of non-standard geometries ¥ internal buffer separates bus from media allows rapid introduction of non-standard media data rates ¥ internal controller command queueing allows geometry-specific, real-time disk scheduling ¥ internal controller and buffer for caching, read-ahead, write-behind

But SCSI does too much handshaking ¥ short, slow cables, limited bus ports Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 26/74 G. Gibson, CMU Trends in Storage Interfaces: Serial Buses

Faster, smaller, longer, more ports

Multiple contenders: ¥ Serial Bus (P1394): isochronous desktop peripheral bus (10MB/s) ¥ IBM SSA (X3T10.1): dual ring packetized disk interface (20MB/s) ¥ FibreChannel (X3T9.3): merge peripheral and network interconnect lengths to 10 km, speeds to 100 MB/s, multiple classes of service

But, parallel SCSI is not dead yet ¥ SCSI-2 is 8 or 16 bits parallel, 5 or 10 MHz (5, 10, 20 MB/s) ¥ Fast-20, Fast-40 are 20, 40 MHz (20, 40, 80 MB/s)

SCSI-3 isolates SCSI from physical layer

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 27/74 G. Gibson, CMU What about other storage technologies?

Optical disk? ¥ limited market; disk density surpassed optical last year ¥ portable use possible, but contact and flying disks are ahead ¥ long term role in publishing

Magnetic tape? ¥ very low cost media in robots ¥ but low data rates (1-10 MB/s) and slow robot switch (4 min) ¥ long term role in archival storage

Holographic and other advanced media? ¥ Tamarack, Austin TX, may deliver 20 GB WORM jukebox high data rates possible (100 MB/s?)

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 28/74 G. Gibson, CMU Recap Magnetic Disk Technology

Rapid density increase is major cause of change

Access time and disk diameter lagging

Embedded intelligence for value-added

Interface technologies up in the air

Competing technologies have niche roles

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 29/74 G. Gibson, CMU Current Trends in RAID Marketplace

Rapid market development ¥ largest growth in storage subsystems, for LAN environment ¥ broad range of competition available

RAID standards organization ¥ RAID Advisory Board focuses on defining/qualifying RAID levels ¥ RAID support may appear in SCSI-3 specification

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 30/74 G. Gibson, CMU RAID Market Trends

Begun 6 years ago, $5.7 billion in 1994 ¥ 23% IBM, 22% EMC, 12% DEC, 10% Compaq, 7% HP, 4% Hitachi, 4% STK, 3% Sun, 3% DG, 2% Tandem, 1% Symbios, 1% Mylex, 1% Auspex, 1% FWB, 1% NEC

Continued growth predicted ¥ in billion $: 7.8 in 95, 9.7 in 96, 11.6 in 97, 13.2 in 98 ¥ 66% -> 75% into network/minicomputer/multi-user systems ¥ units shipped 94 thru 97: 400,000 becomes 1,200,000 subsystems: 213,000 x 3.4 -> 730,000 boards: 120,000 x 2.5 -> 293,000 software: 72,000 x 2 -> 140,000

Carnegie J. Porter, DISK/TREND, 95. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 31/74 G. Gibson, CMU RAID Product History Sample

$ 20 $ 10 $ 1 90 91 92 93 94 95 Compaq IDA EMC Symetrix DG Clariion NCR ADP-92 IBM 9337 MYLEX DAC960 SUN FC Array STK ICEBERG IBM RAMAC HP AutoRAID Sun $.66/MB

¥ Rule of thumb: RAID MB cost 2x raw disk

Carnegie Symbios, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 32/74 G. Gibson, CMU RAID Advisory Board

To promote use and understanding of RAID ¥ 55 member organization since formation in 92

Education ¥ RAIDBook - technology and RAID level definitions ¥ publishes in Computer Technology Review ¥ hosts Comdex RAID Technology Center

Standardization ¥ functional test suite: IO, buffering, queueing, error injection ¥ performance tests: synthetic: TP, FS, DB, video, backup, scientific ¥ host interface spec: joint development of SCSI-3 RAID support

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 33/74 G. Gibson, CMU SCSI-3 Support for Storage Arrays

Storage Array Conversion Layer (SACL) ¥ defines, manages, accesses, reconstructs array components ¥ pushes mapping of data and parity into SCSI software

SACL objects ¥ p_extents - contiguous range of blocks on one device ¥ ps_extents - portion of p_extent excluding redundancy ¥ redundancy groups - group of p_extents sharing protection ¥ volume sets - group of ps_extents contiguous in user data space ¥ spare - p_extent, device or component avail to replace failure

Carnegie G. Penokie, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 34/74 G. Gibson, CMU Recap RAID Market Trends

Rapid market development ¥ largest growth in storage subsystems, for LAN environment ¥ broad range of competition available

RAID standards organization ¥ RAID Advisory Board focuses on defining/qualifying RAID levels ¥ RAID support may appear in SCSI-3 specification

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 35/74 G. Gibson, CMU RAID Reliability

Single correcting codes (parity) ¥ coping with dependent failure modes (controllers, cables, power): ¥ contrasting mirroring (RAID 1) with parity (RAID 5) ¥ limitations in ever larger disk arrays

Double correcting redundancy codes (RAID 6) ¥ basics: intersecting codewords allowing recovery choices ¥ binary, one bit per disk (2D parity) ¥ non-binary, one symbol per disk (Reed-Solomon) ¥ binary, multiple bits per disk (IBM EvenOdd)

Increasing focus on electronics, software impact

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 36/74 G. Gibson, CMU Basic Disk Data Reliability Model

Exponential disk lifetime and repair time ¥ recovery must not fail; repair with on-line hot spare

123 G P = XOR(A,B,C) ABC P C = XOR(A,B,P)

disk failure rate: λ = 1/MTTF-disk disk repair rate: µ = 1/MTTR-disk

G λ (G −1)λ

0 1 2 µ

MTTF-disk >> MTTR-disk MTTF-disk 2 MTTDL-RAID = N G (G-1) MTTR-disk Carnegie Patterson, Gibson, Katz, Sigmod, 88. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 37/74 G. Gibson, CMU Arrays contain Support Hardware

More hardware than just disks in an array ¥ must defend against external power quality first ¥ combined effects of non-disk components rival disk failures

Controller

AC power SCSI Host 4.3 Khours Bus Adapter 120 Khours

power supply SCSI cable 300W supply 21,000 Khours 123 Khours

Power cable 10,000 Khours Fan 195 Khours

Carnegie Schulze, Compcon, 89. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 38/74 G. Gibson, CMU Orthogonal Parity Groups

Limit exposure of each parity group to one disk ¥ organize support hardware into independent columns

string group parity group Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 39/74 G. Gibson, CMU Mirror versus Parity Data Reliability

Simple mirrors less reliable and more costly 1 hour mean recovery to spare; 150 Khour mean disk, string lifetime

Mean Data Lifetime 10yr Reliability (1000 hours) 10,000 70(1+1), D=18 7(10+1)+14s, D=72 1.0 7(10+1)+14s, D=72 0.9 70(1+1), D=72 8,000 7(10+1)+7s, D=72 0.8 70(1+1), D=4 0.7 6,000 0.6 7(10+1)+7s, D=72

0.5 4,000 70(1+1), D=4 0.4

0.3 70(1+1), D=18 2,000 0.2 70(1+1), D=72 0.1 7 11 72 0 0.0 1 4 16 64 256 1024 1 4 16 64 256 1024 Mean String Repair Time (hours) Mean String Repair Time (hours)

Carnegie Gibson, Patterson, J. Parallel and Distributed Computing, 93. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 40/74 G. Gibson, CMU Multiple Failure Tolerance?

I/O parallelism grows with processing speed ¥ larger arrays have more disks vulnerable to failure

But disk reliability has been growing at 50%/year ¥ allows 125%/year increase in processing speed at fixed reliability

No need for more powerful failure tolerance ¥ will disk reliability continue to rise at this rate?

Increasing needs for highly reliable storage ¥ are customers prepared to pay performance cost?

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 41/74 G. Gibson, CMU Simple, Fast Double Correcting Arrays

Borrow from earlier approaches (1 bit at a time) ¥ orthogonal parity groups ¥ double-error detecting codes from memory systems

Extended Hamming 2d- Parity 0123 A

4567 B 0123456 information8 9 10 11 C check information 14 ABCD E 7891011 12 13 check 12 13 14 15 D

check E F G H

Overheads: check space versus check update time ¥ 2d-parity has minimal time overhead (3), but space grows as root ¥ Hamming has lower space overhead, but higher avg time cost

Carnegie Gibson, Hellerstein, Asplos III, 89. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 42/74 G. Gibson, CMU Non-binary Codes (P+Q in STK Iceberg)

Exploit existing hardware (Reed-Solomon)

Binary (b=1) + C1 = C2 = (A+B) mod 2 *1 *1 C1 modulo 2 A B A = f(C1,C2) ? *1 *1 + C2 B = g(C1,C2) ?

Unique check disk pair for each data disk

Nonbinary (b=2) + C1 = (A+B) mod 4 *1 *1 C1 modulo 4 C2 = (A+2B) mod 4 A B A = (2C1-C2) mod 4 *1 *2 + C2 B = (C2-C1) mod 4

Multiple data disks can share same check disk pair

Carnegie Gibson, MIT Press, 92. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 43/74 G. Gibson, CMU IBM EvenOdd

disk 0 disk 1 disk 2 disk 3 disk 4 horiz parity diag parity 10110p0d0 01100p1d1 11000p2d2 01011p3d3

Careful selection of intersecting codewords ¥ horizontal codewords: a bit from each disk at same offset (p0,p1,p2, p3) = (1, 0, 0, 1) ¥ diagonal codewords: a bit from each disk from different offsets main diagonal, {disk, offset}=(4,0),(3,1),(2,2),(1,3), has parity 1 add main diagonal parity to parity of each other diagonal (d0, d1, d2, d3, d4) = (1, 1, 1, 1)+(1, 1, 0, 1) = (0, 0, 1, 0)

All computations are XOR (no RS hardware) Small random write updates 3 disks (minimum)

Carnegie Blaum, Brady, Bruck, Menon, ISCA, 94. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 44/74 G. Gibson, CMU Trends to Higher Reliability: TQM

Drive reliabilities rising quickly ¥ 30,000 hr MTTF in 1987 -> 800,000 hr MTTF in 1995 ¥ 50% of failures in electronics

Subsystem reliability rising more slowly ¥ much larger parts count in array controllers ¥ much large software component in array controller

Subsystem optimizations introduce new risk ¥ write-back cache failures significant

Increased emphasis on super-disk failure modes

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 45/74 G. Gibson, CMU Recap RAID Reliability

Basic reliability decreases linearly in group size Dependent failure in support hw: orthogonal arrays Parity + spares reliability, cost best simple mirrors Need for multiple failure tolerance unclear Multiple failure tolerance possible at low cost But small write penalty substantially increased Real push is tolerance for electronics/software faults

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 46/74 G. Gibson, CMU RAID Performance

On-line failure recovery: high availability

Exploiting spares: faster recovery

Writeback caching: optimize write-induced work

Parity logging: defer parity update until efficient

Floating data and parity: remap for fast R-M-W

Log-structured: convert small to large writes

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 47/74 G. Gibson, CMU On-Line Failure Recovery Performance

Fault-Free Read Fault-Free Write

2 1 3 4 D D D P D D D P

Degraded Read Degraded Write

D D D P D D D P

Per-disk load increase in ≈ 1 + r + 0.25w ¥ 50% throughput wall; long response times; long recovery time

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 48/74 G. Gibson, CMU Reducing Load Increase: Parity Declustering

Logical Array Physical Array 0123 0123456

S S S S S S S S

Parity or TT T T Data Unit TT TT UUU U C Parity UUUU Stripe Declustering Ratio = (G-1)/(C-1)

G

¥ Per-disk failure-induced workload increase reduced ¥ Entire array bandwidth available for reconstruction ¥ Allows fault-free utilization > ~50% ¥ Map parity groups using Balanced Incomplete Block Designs or Random Selection of Permutations

Muntz, Lui, VLDB 1990. Holland, Gibson, ASPLOS V, 92. Carnegie Merchant, Yu, FTCS 92. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 49/74 G. Gibson, CMU Comparing to Multiple RAID Level 5 Groups

RAID5: 4 groups of 9+1 ⇒ 40 disks, 10% ovhd Declustered: 1 group of C=40/G=10 ⇒ 40 disks, 10% ovhd

Performance during reconstruction

4x 9+1 RAID5 40/10 Declustered 3000 160 140 2500 120 2000 100 1500 80 60 1000 40 500 20 User response time (ms) Reconstruction Time (sec) Reconstruction Time 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 User I/Os per Second User I/Os per Second

Carnegie Holland, Gibson, Siewiorek, J. Distributed and Parallel Databases, 94. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 50/74 G. Gibson, CMU Exploiting Hot Spares: Distributed Sparing

Distribute spare disk space in same way as parity ¥ Use spare actuator(s) to improve fault-free performance ¥ No need to reconstruct spare units ¥ Alleviate spare-disk bottleneck at low declustering ratio

RAID Level 5 with RAID Level 5 with Dedicated Sparing Distributed Sparing

Disk Number Disk Number Offset 01234 Offset 01234 0DDDPS 0DDDPS 1DDPDS 1DPSDD 2DPDDS 2SDDDP 3PDDDS 3DDPSD 4 DDDPS 4 P S DDD

Carnegie Menon, Mattson, ISCA, 92. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 51/74 G. Gibson, CMU The Small-Write Throughput Problem

RAID Level 5 random small write cost 2X mirrors ¥ mirroring writes two copies: two disk accesses ¥ RAID 5 must toggle parity bits when data bits toggle costs read then write of data, read then write of parity: four disk accesses!

Small writes non-negligible: OLTP, Network FS

Major approaches ¥ Caching: delay writes in cache, schedule update later ¥ Logging: delay parity update in log, schedule update later ¥ Dynamic mapping: rearrange parity mapping as needed

¥ most require fault-tolerance of in-memory cache or mapping tables

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 52/74 G. Gibson, CMU Zero Latency Writes: Writeback Caching

Delay in NVRAM or in redundant controllers Prioritize reads over delayed writes Schedule writeback efficiently ¥ aggregate small writes into larger writes ¥ exploit idle time on actuator ¥ greedy Shortest Access Time First scheduling

Achieve deep queue, write costs 6-8 ms Fast writes -> shorter read behind write queueing

Widely used in most RAID products

Menon, Cortney, ISCA, 93. Carnegie Solworth, Orji, Sigmod, 90. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 53/74 G. Gibson, CMU Read Performance given Zero Latency Writes

¥ While idle available, read response independent of writeback work ¥ Idle exhausts according writeback work 200 Simulated performance 1:1 Read/Write 4K Random I/Os cached RAID 5 4MB Cache (0.3%) 150 4 Drive Raid Level 0 5 Drive Raid Level 5 uncached RAID 5 100 uncached RAID 0 cached RAID 0

50 Read Response Time (ms) Time Read Response

0 0 10 20 30 40 50 60 70 80 Throughput (IO/sec/disk)

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 54/74 G. Gibson, CMU Parity Logging Extension to RAID Level 5

Read/Write data at block rates ¥ collect XOR of old and new data into “parity update log” Read/Write log and parity at cylinder rates ¥ delay reintegration of parity updates into parity until efficient Adds 2 More I/Os but 4 I/Os are > 8 times faster ¥ disk seconds spent for small writes comparable to mirroring

1467

720

Data Rate (KB/S) 97 Block Track Cylinder

Stodolsky, Courtright, Holland, Gibson, ACM Trans on Computer Systems, 94. Carnegie Bhide, Dias, IBM TR 17879, 92. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 55/74 G. Gibson, CMU Logging Small Random Write Performance

Simulated results comparable to mirroring 300 Raid level 5 250 Parity Logging Mirroring 200 Nonredundant Cache hits on write 150

100

50 Average User Response Time User Response Average

IOs/second per disk 0 0 10 20 30 40

Carnegie Stodolsky, Courtright, Holland, Gibson, ACM Trans on Computer Systems, 94. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 56/74 G. Gibson, CMU Dynamic Mapping: Floating Data and Parity

Reduce cost of one or both read-modify-write in small random overwrite

Dynamically remap “overwrite” to closest free space after read-modify ¥ 1.6 blocks to next free block if 7% space hidden from user ¥ read-modify-write response time is 10-20% longer than read-only ¥ disconnects contiguous data written at different times ¥ fault-tolerant mapping tables can be too large if data floats

Floating parity only with high cache hit ratio ¥ small random write complete in little more than 2 access times

Carnegie Menon, Roche, Kasson, J. Parallel and Distributed Computing, 93. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 57/74 G. Gibson, CMU Dynamic Mapping: Log-Structured

New write data is accumulated and written into an empty stripe ¥ large write optimization computes parity in memory

Background garbage collection of old data

Log-structured file system ¥ focus on physical contiguity for bandwidth ¥ garbage collection involves copying

Virtual parity striping ¥ focus on minimal parity update cost ¥ garbage collection involves parity recomputation

Rosenblum, Ousterhout, Symp of Operating Systems Principles, 91. Carnegie Mogi, Kitsuregawa, Parallel and Distributed Info Systems, 94. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 58/74 G. Gibson, CMU StorageTek Iceberg

user compressed data cache

accumulate stripe, compute P + Q

find contiguous free space for stripe

parallel write, update mapping tables

8x(13+2)

Double-correct, log-structured, compressed data

Carnegie Saunders, Storage Tech, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 59/74 G. Gibson, CMU Recap RAID Performance

On-line recovery drastic reduction in performance ¥ read workload doubles on surviving disks in RAID 5 ¥ declustering parity allows tunable degradation, cost ¥ declustering avoids bottlenecks in multi-group RAID 5 ¥ on-line spares effective for very fast recovery

Reducing cost of small writes in parity-based arrays ¥ write caching for immediate ack, write scheduling ¥ logging parity changes for later efficient processing ¥ floating allocation for fast R-M-W ¥ dynamic remapping (log-structured) for large write optimization

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 60/74 G. Gibson, CMU Trends in Transparency

Intelligent storage management

Rapid development of architectures

Aggressive prefetching to increase I/O parallelism

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 61/74 G. Gibson, CMU Trends in Storage Organization Transparency

Increasing microprocessor power in controller ¥ controllers in 2000 expected to have 200 MHz processors

Cost of managing storage per year 7X storage cost

Strong push to embed management in storage ¥ dynamic adaptation to workload ¥ selection and migration through redundancy schemes ¥ support for backup

STK Iceberg, HP AutoRAID leading the push ¥ dynamic migration of storage representation

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 62/74 G. Gibson, CMU Rapid Prototyping and Evaluation for RAID

RAIDFrame: separate policy from mechanism ¥ Express RAID functions as Directed Acyclic Graph ¥ Execute DAGs on engine unaware of RAID architecture ¥ Distributable, portable “RAID N reference model”

RAID Level 5 Parity Logging Small Write DAG Small Write DAG ¥ Roll-back and roll-forward error handling ¥ Automatic construction of H H uncommon DAGs Rod Rop Rod Xor ¥ Optimization of DAGs

Wnd Xor Wnd Log

Wnp T T

Carnegie Gibson, et al, Compcon, 95. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 63/74 G. Gibson, CMU Overcoming Disclosure Bottleneck: Prefetching

Application F1, F2, F3, F4, F5, ... ¥ Expose concurrency S1, S2, S3, S4, ... ¥ overlap I/O and computation ¥ overlap I/O and think time

B1, B2, B3, B4, ... ¥ overlap I/O and I/O !!!! ¥ I/O optimization - seek scheduling D , D , D , D , ... 1 2 3 4 - batch processing

¥ Cache management Devices D1 ¥ balance buffers between prefetch and demand

Carnegie Patterson, Gibson, Parallel and Distributed Information Systems, 94. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 64/74 G. Gibson, CMU Examples: Informed Prefetching Filesystem

Annotated applications: text search, 3D visualiza- tion, join, speech recognition Digital UNIX (OSF/1 v2), 5 Fast-SCSI, 15 HP2247

Change in Elapsed Time Change in I/O Stall Time

1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2

0 4 8 12 16 0 4 8 12 16 Disks Disks Agrep XDS Postgres Sphinx

Carnegie Patterson, et al, CMU-CS-95-134, 95. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 65/74 G. Gibson, CMU Recap Storage Transparency

Embedding intelligent storage management ¥ exploit embedded processor power to simplify configuration ¥ dynamic migration or representations within storage system

Tools for rapid RAID architecture development ¥ definition, evaluation, optimization tools

Beyond embarrassingly parallel I/O applications ¥ aggressive prefetching needed to reduce latency for many tasks

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 66/74 G. Gibson, CMU Trends in Network RAID

Avoid file server workstation’s memory

¥ Fastpath data from disk to network

Handle controller failure as another disk failure

Large arrays needing more than 1 controller

Support client file access faster than 1 controller

¥ Stripe data on controllers & switched networks

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 67/74 G. Gibson, CMU Fastpath Disk and Network

File server processor does not look at most bytes ¥ attach network and disks to RAID controller with fast bus ¥ scale file network bandwidth and file server CPU separately ¥ separate high bandwidth and low latency traffic on own net Berkeley RAID-II prototype: 15-21 MB/s thru LFS

Ethernet (Control and Low Latency Transfers) LINK High Bandwidth VME LINK Transfers XBUS 8 Port Interleaved LINKLINK Card Memory4 Port (128 Interleaved MByte) LINK TMC Memory (32 MB) File HIPPIS Server HIPPI TMC XOR HIPPID HIPPIS 8 x 8 x 32-bit XOR 4-by-8Crossbar by 32-bit HIPPD Crossbar VME LINK VME VME VME VME VME Control HIPPIS Bus VME VME VME VME Bus HIPPID Bus LINK VME Disk VME Ribbon Controller Cable Segments Control Paths Four VME Disk Controllers

Carnegie Drapeau, et al., ISCA, 94. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 68/74 G. Gibson, CMU Embedded Controller Parallelism

disks

hosts controllers

HP TickerTaip - private controller network ¥ scale controller compute power; tolerate controller failures Borg Technologies promises similar product

Cao, Lim, Venkataraman, Wilkes, ISCA, 93. Carnegie D. Stallmo, Borg Tech, RAID’95 Forum, April 95. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 69/74 G. Gibson, CMU Exposed Controller Parallelism

disks

hosts servers

Use host network for controller/server traffic ¥ Zebra - log-structured file system over network ¥ Swift - distributed server RAID ¥ Network-attached disk - Conner (ATM), Seagate (FibreChannel)

Hartman, Ousterhout, Symposium on Operating Systems Principles, 93. Carnegie Cabrera, Long, Computing Systems, 91. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 70/74 G. Gibson, CMU Opportunity for Parallelism Increasing

Application File Bandwidth CM5-RAID 1991 MVS- RAID 1989 Parallel Applications attached I/O Scotch PFS MVS G-Nectar 1985 1994 Intel Paragon PFS 1994 MVS parallel block VAX-FFS 1980 I/O fabric 1982 server (Zebra) 1993 Application Performance switched LAN SUN-NFS Auspex NFS 1985 1989

AFS 1989

Coda 1993

shared media Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 71/74 G. Gibson, CMU Parallel File Systems for Parallel Programs

¥ Concurrent write sharing ¥ Globally shared file pointers

¥ Performance hungry applications ¥ High bandwidth to large files ¥ Application-specific access methods ¥ Application control over basic PFS parameters

• Limited instances of any specific environment ¥ Emphasis on and portability ¥ How much integration with network?

Corbett, Feitelson, Proc Scalable High-Performance Computing Conf, 94. Carnegie Gibson, et.al., Compcon, 95. Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 72/74 G. Gibson, CMU Recap RAID over the Network

Attach storage “closer” to network ¥ avoid workstation memory system

Stripe data over multiple controllers/servers ¥ private controller network in the box ¥ use host LAN for redundancy and controller communication

Support explicit concurrent write sharing ¥ parallel file systems for parallel programs ¥ application assistance in data layout, prefetching, checkpointing

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 73/74 G. Gibson, CMU Tutorial Summary

¥ Basic RAID: Levels 1, 3, 5 useful

¥ Magnetic disk technology stronger than ever

¥ RAID market well beyond basic RAID

¥ Increasingly sophisticated function in subsystem

¥ How much transparency is too much?

¥ Striping/RAID over network emerging

Carnegie Mellon Parallel Data Laboratory Data Storage Systems Center

ISCA’95 Reliable, Parallel Storage Tutorial Slide 74/74 G. Gibson, CMU