Relational Database Systems 2 2

Relational Database Systems 2 2. Physical Data Storage

Wolf-Tilo Balke Jan-Christoph Kalo Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 2 Physical Data Storage

2.1 Introduction 2.2 Hard Disks 2.3 RAIDs 2.4 SANs and NAS 2.5 Case Study

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 2 2.1 Physical Storage Introduction

• DBMS needs to retrieve, update and process persistently stored data – Storage consideration is an important factor in planning a database system (physical layer) – Remember: The data has to be securely stored, but access to the data should be declarative!

Headquarters in Redwood City, CA

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 3 2.1 Physical Storage Introduction

• Data is stored on a storage media. Media highly differ in terms of – Random Access Speed – Random/Sequential Read/Write speed – Capacity – Cost per Capacity

EN 13.1 Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 4 2.1 Relevant Media Characteristics • Capacity: Quantifies the amount of data which can be stored – Base Units: 1 Bit, 1 Byte = 23 Bit = 8 Bit – Capacity units according to IEC, IEEE, NIST, etc: • Usually used for file sizes and primary storage (for higher degree of confusion, sometimes used with SI abbreviations…) • 1 KiB = 10241 Byte; 1 MiB = 10242 Byte ; 1 GiB = 10243 Byte; … – Capacity units according to SI: • Usually used for advertising secondary/tertiary storage • 1 KB = 10001 Byte ≈ 0.976 KiB; 1 MB = 10002 Byte ≈ 0.954 MiB; 1 GB = 10003 Byte ≈ 0.931 GiB; … – Especially used by the networking community: • 1 Kb = 10001 Bit = 0.125 KB ≈ 0.122 KiB ; 1 Mb = 10002 Bit = 0.125 MB ≈ 0.119 MiB

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 5 2.1 A Kilo-Joke

http://xkcd.com/ Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 6 2.1 Characteristic Parameters

• Random Access Time: Average time to access a random piece of data at a known media position – Usually measured in ms or ns – Within some media, access time can vary depending on position (e.g. hard disks) • Transfer Rate: Average amount consecutive of data which can be transferred per time unit – Usually measured in KB/sec, MB/sec, GB/sec,… – Sometimes also in Kb/sec, Mb/sec, Gb/sec

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 7 2.1 Other characteristics

• Volatile: Memory needs constant power to keep data – Dynamic: Dynamic volatile memory needs to be “refreshed” regularly to keep data – Static: No refresh necessary • Access Modes – Random Access: Any piece of data can be accessed in approximately the same time – Sequential Access: Data can only be accessed in sequential order • Write Mode – Mutable Storage: Can be read and written arbitrarily – Write Once Read Many (WORM) • Interesting for legal issues  Sarbanes-Oxley Act (2002)

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 8 2.1 Online, Nearline, Offline

• Online media – „always on“ – Each single piece of data can be accessed fast – e.g. hard drives, main memory • Nearline media – Compromise between online and offline – Offline media can automatically put “on line” – e.g. juke boxes, robot libraries • Offline media (disconnected media) – Not under direct control of processing unit – Have to be connected manually – e.g. box of backup tapes in basement

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 9 2.1 The Storage Hierarchy

• Media characteristics result in a storage hierarchy • DBMS optimize data distribution among the storage levels – Primary Storage: Fast, limited capacity, high price, usually volatile electronic storage • Frequently used data / current work data – Secondary Storage: Slower, large capacity, lower price • Main stored data – Tertiary Storage: Even slower, huge capacity, even lower price, usually offline • Backup and long term storage of not frequently used data

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 10 2.1 The Storage Hierarchy

Primary

Cache, RAM

~100 ns Speed Secondary Cost Flash, Magnetic Disks ~10 ms

Optical Disks, Tape Tertiary > 1 s

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 11 2.1 Storage Media – Examples

Type Media Size Random Transfer Characteristics Price Price/GB Acc. Speed Speed Pri L1-Processor Cache 32 KB 5 x 10-10 s 15.4 GB/sec Vol, Stat, RA,OL Pri DDR4-Ram 16 GB 3.6 x 10-8 s 66.8 GB/sec Vol, Dyn, Ra, € 90 6€ (RipJaws V OL ) Sec Harddrive SSD 512GB 6 x 10-6 s 550MB/sec Stat, RA, OL € 230 € 0.45 (Samsung 850 PRO) Sec Harddrive Magnetic 4000 GB 1.6 x 10-2 s 171MB/sec Stat, RA, OL € 240 € 0.06 (WD 4TB Red Pro)

Ter Blank recordable 4.7 GB 9.8 x 10-2 s 11 MB/sec Stat, RA, OF, € 0.15/Disk € 0.03 DVD-R disk WORM

Ter LTO-5 tape 1500 GB 58 s 280 MB/sec Stat, SA, OF € 15/Tape € 0.01 (TDK - LTO Ultrium 5 Data Cartridge) Pri= Primary, Sec=Secondary, Ter=Tertiary Vol=Volatile, Stat=Static, Dyn=Dynamic, RA=Random Access, SA=Sequential Access OL=Online, OF=Offline, WORM=Write Once Read Many

Last updated April 2016

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 12 2.2 Magnetic Disk Storage – HDs

• Hard drives are currently the standard for large, cheap and persistent storage – Usually used as the main storage media for most data in a DB • DBMS need to be optimized for efficient disk storage and access – Data access needs to be as fast as possible – Often used data should be accessible with highest speed, rarely needed data may take longer – Different data items needed for certain reoccurring tasks should also be stored/accessed together

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 13 2.2 HD – How does it work?

• Directionally magnetization of a ferromagnetic material • Realized on hard disk platters – Base platter made of non-magnetic aluminum or glass substrate – Magnetic grains worked into base platter to form magnetic regions • Each region represents 1 Bit – Read head can detect magnetization direction of each region – Write head may change direction

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 14 2.2 HD – Notable Technology Advances

• Giant Magnetoresistance Effect (GMR) – Discovered 1988 simultaneously by Peter Grünberg and Albert Fert • Both honored with the 2007 Nobel Prize in Physics – Allows the construction of efficient read heads: • The electric resistance of an alternating ferro- and non-magnetic material giantally changes within changing directed magnetic fields – http://www.research.ibm.com/research/demos/gmr/cyberdemo1.htm

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 15 2.2 HD – Notable Technology Advances

• Perpendicular Recording (used since 2005) – Longitudal Recording limited to ~200 Gb/inch2 due to superparamagnetic effect • Thermal energy may spontaneously change magnetic direction – Perpendicular recording allows for up to 1000 Gb/inch2 – Very simplified: Align magnetic field orthogonal to surface instead of parallel • Magnetic regions can be smaller

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 16 2.2 HD – Notable Technology Advances

• Usage of magnetic grains instead of continuous magnetic material – Between magnetic direction transitions, Neel Spikes are formed • Areas of unsure magnetic direction – Neel Spikes are larger for continuous materials – Magnetic regions can be smaller as the transition width can be reduced

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 17 2.2 HD – Basic Architecture

• A hard disk is made up of multiple double-sided platters – Platter sides are called surfaces – Platters are fixed on main spindle and rotate at equal and constant speed (common: 5400 rpm / 7200 rpm) – Each surface has it’s own read and write head – Heads are attached to arms • Arms can position heads along the surface • Heads cannot move independently – Heads have no contact to surface and hover on top of an air bearing

EN 13.2 Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 18 2.2 HD – Basic Architecture

• Each surface is divided into circular tracks – Some disks may use spirals • All tracks of all surfaces with the same diameter are called cylinder – Data within the same cylinder can be accessed very efficiently

EN 13.2 Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 19 2.2 HD – Basic Architecture

• Each track is subdivided into sectors of equal capacity a) Fixed angle sector subdivision • Same number of sectors per track, changing density, constant speed b) Fixed data density • Outer tracks have more sectors than inner tracks • Transfer speed higher on outer tracks • Adjacent sectors can be grouped into clusters

EN 13.2 Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 20 2.2 HD – Reliability

• Hard drives are not completely reliable! – Drives do fail – Means for physical failure recovery are necessary • Backups • Redundancy • Hard drives age and wear down. Wear significantly increases by: – Contact cycles (head parking) – Spindle start-stop – Power-on hours – Operation outside ideal environment • Temperature too low/high • Unstable voltage

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 21 2.2 HD – Reliability

• Reliability measures are statistical values assuming certain usage patterns • Desktop usage (all per year): 2,400 hours, 10,000 motor start/stops, 25°C temperature • Server usage (all per year): 8,760 hours, 250 motor start/stops, 40°C temperature – Non-Recoverable read errors: A sector on the surface cannot be read anymore – the data is lost • Desktop disk: 1 per 1014 read bits, Server: 1 per 1015 read bits • Disk can detect this! – Maximum contact cycles: Maximum number of allowed head contacts (parking) • Usually around 50,000 cycles

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 22 2.2 HD – Reliability

– Mean Time Between Failure (MTBF): Statistically anticipated time for a large disk population failing to 50% • Drive manufactures usually use optimistic simulations to guess the MTBF • Desktop: 0.7 million hours (80 years), Server: 1.2 million hours (137 years) – Manufacturers values – Annualized Failure Rate (AFR): Probability of a failure per year based on MTBF

• AFR = OperatingHoursPerYear / MTBFhours • Desktop: 0.34%, Server: 0.73%

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 23 2.2 HD – Reliability

• Failure rate during a hard disks lifespan is not constant • Can be better modeled by the “bathtub curve” having 3 components – Infant Mortality Rate – Wear Out Failures – Random Failures

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 24 2.2 Real World Failure Rates

Careful: 2+ year results are biased. See reference. • Report by Google – 100,000 consumer grade disks (80-400GB, ATA Interface, 5,400- 7,200 RPM) • Results (among others) – Drives fail often! – There is infant mortality – High usage increases infant mortality, but not later failure rates – Observed AFR is around 7% and MTBF 16.6 years!

Failure trends in a large disk drive population E. Pinheiro, W.-D. Weber, L. A. Barroso 5th USENIX conference on File and Storage Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Technologies, 2007 Informationssysteme 25 2.2 Reliability – Considerations

• Assume a storage need of 100 TB. Only following HDs are available – Capacity: 1 TB capacity each – MTBF: 100,000 hours each (ca. 11 years) • Consider using 100 of these disks independently (w/o RAID). – Total Storage: 100,000 GB = 100 TB – MTBF: 1,000 hours (ca. 42 days) – THIS IS BAD! • More sophisticated ways of using multiple disks are needed

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 26 2.2 Solid State Disk (SSD)

• Alternative to hard-drives: SSD – Use microchips which retain data in non-volatile memory chips and contain no moving parts • Use the same interface as hard disk drives – Easy replacement in most applications possible • Key components – Memory – Controller

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 27 2.2 Solid State Disk (SSD)

• Advantages – Low access time and latency – No moving parts  shock resistant • MTBF about 1.5 million hours – Lighter and more energy-efficient than HDDs • Disadvantages – Divided into blocks/pages • If one byte changes the whole page has to be written • The old page will be marked as stale • Only whole blocks can be deleted – Limited ability of being rewritten (between 3,000 and 100,000 cycles per page) • Wear leveling algorithms assure that write operations are equally distributed to the pages • 150 TB can be written during its lifetime

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 28 2.2 Memory

• Flash-Memory – Most SSDs use NAND-based flash memory – Retains memory even without power – Slower than DRAM solutions – Single-level cell versus multi-level cell – Wears down! • DRAM – Use volatile Random Access Memory – Ultrafast data access (< 10 microseconds) – Sometimes use internal battery or external power device to ensure data persistence – Only for Applications that require even faster access, but do not need data persistence after power loss

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 29 2.2 Non-volatile Memory

• Most memory today is volatile – e.g. DRAM – Looses information on power off – Problems with data persistence • Non-volatile memory keeps information even after shutdown – e.g. Flash-Memory, F-RAM, M-RAM, ReRam, CB-RAM

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 30 2.2 Magnetoresistive RAM

• Has been the most promising technology for years – In 2013 a 64MB MRAM main memory module was presented – Planned to replace existing technologies • Advantages to current DRAM solutions – MRAM is as fast as SRAM – Density is comparable to DRAM • Problems – After several years of research the storage density is far from SRAM and DRAM

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 31 2.2 Resistive RAM

• State-of-the-art research favors ReRAM – Has much smaller cell structure than MRAM – Lower energy consumption than flash memory – DRAM is still faster than ReRAM – 1TB of data fits on a chip about the size of a stamp – Compare different storage technologies

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 32 2.2 3D Xpoint

• In July 2015 Intel announced 3D Xpoint – Storage density is comparable to Flash storage – About 1000 times faster than Flash memory – Latency is 10 times smaller than SSD, but still 10 times higher than DRAM • Several other technologies are under development – CBRAM, FeRAM, SCRAM

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 33 2.2 HD – Controller

• The disk controller organizes low level Inner System / Mainboard access to the disk – e.g. head positioning, error checking, signal Internal Bus processing – Usually integrated into the disk Host Bus – Provides unified and abstracted interface Adapter to access the disks (e.g. LBA) – Connects disk to a peripheral bus (e.g. IDE, SCSI, FiberChannel, SAS) Peripheral Bus • The host bus adapter (HBA) bridges between the peripheral bus and systems internal bus (like PCIe, PCI) Disk – Internal Bus usually integrated into systems Controller main board – Often confused as being the disk controller • DAS (Directly Attached Storage)

Mechanics

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 34 2.2 HD – Controller

• Sectors can be logically grouped to blocks by the operating system – Sectors in a block do not necessarily need to be adjacent – e.g. NTFS defaults to 4 KiB per block • 8 sectors on a modern disk • Hardware address of a block is combination of – Cylinder number, surface number, block number within track – Controller maps hardware address to logical block address (LBA)

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 35 2.2 HD – Controller

• Disk controller transfers content of whole blocks to buffer – Buffer resides in primary storage and can be accessed efficiently – Time needed to transfer a random block (4KiB/Block on ST3100034AS): (<10 msec) • Seek Time: Time needed to position head to correct cylinder (<8 msec) • Latency (Rotational Delay): Time until the correct block arrives below the head (<0.14 msec) • Block Transfer Time: Time to read all sectors of block (<0.01 msec) – Bulk Transfer Rate for n adjacent blocks (<20msec for n=10) • Seek time + Rotational Delay + n * Block Transfer Rate

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 36 2.2 HD – Controller

• Locating data on a disk is a major bottleneck – Try operating on data already in buffer – Aim for bulk transfer, avoid random block transfer

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 37 2.3 RAID

• A single HD is often not sufficient – Limited capacity – Limited speed – Limited reliability • Idea: Combine multiple HD into a RAID Array (Redundant Array of Independent Disks) – RAID Array treats multiple hardware disks as a single logical disk • More HDs for increased capacity • Parallel access for increased speed • Controlled redundancy for increased reliability

Silber 11.3 Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 38 2.3 RAID Controller

• The RAID controller connects to multiple hard disks – Disks are virtualized and appear to be just one single Internal Bus logical disk RAID Controller – The RAID controller acts as an extended specialized HBA

(Host Bus Adapter) Represented – as single Still DAS (Directly Attached logical Disk Storage) Peripheral Bus

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 39 2.3 RAID Principles – Mirroring

• Mirroring (or shadowing): Increases reliability by complete redundancy • Idea: Mirror Disks are exact copies of original disk – Not space efficient • Read speed can be n times as fast, write speed does not increase • Increases reliability. Assume – Two disks with a MTBF 11 years each • One original disk, one mirror disk • Assume disk failures are independent of each other (unrealistic) – Disk replacement time of 10 hours – ► MTBF of mirror system is >57,000 years!

Silber 11.3 Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 40 2.3 RAID Principles – Striping

• Striping: Improve performance by parallelism • Idea: Distribute data among all disks for increased performance • BitLevel Striping: Split all bits of a byte to the disks – e.g. for 8 disks, write i-th bit to disk i – Number of disk needs to be a power of 2 – Each disk is involved in each access • Access rate does not increase • Read and write transfer speed linearly increases • Simultaneous accesses not possible – Good for speeding up few, sequential and large accesses

Silber 11.3 Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 41 2.3 RAID Principles – Striping

• Block Level Striping: Distribute blocks among the disks – Only one disk is involved reading a specific block • Read and write speed of a single block not increased • Other disks still free to read/write other blocks • Read and write speed of multiple accesses increase – Good for large number of parallel accesses

Silber 11.3 Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 42 2.3 RAID Principles – Error Correction Codes

• Error Correction Codes: Increase reliability with computed redundancy • Hamming Codes (~1940) – Can detect and repair 1 bit errors within a set of n data bits by computing k parity bits • n = 2k - k – 1 • n=1, k=2; n=4, k=3; n = 11, k=4; n = 26, k=5; … – Especially used for in-memory and tape error correction • Media cannot detect errors autonomously • Not really used for hard drives anymore

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 43 2.3 RAID Principles – Error Correction Codes

• Interleaved Parity (Reed-Solomon Algorithm on GD(2) Galois Field) – Can repair 1-bit errors (when the error is known) – Hard Disks can detect read errors themselves, no need for complete Hamming codes – Basic Idea:

• From n data pieces D1,…,Dn compute a parity data Dp by combining data using logical XOR (eXclusive OR) – XOR is associative and commutative – Important: A XOR B XOR B = A

• i.e. Dp= D1 XOR D2 XOR … XOR Dn • Assume D2 was lost. It can be reconstructed by D2= Dp XOR D1 XOR D3 XOR … XOR Dn

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 44 2.3 RAID Principles – Interleaved Parity

0101 • Interleaved Parity. Example: XOR 1100 • A = 0101, B = 1100, C = 1011 XOR 1011 • P = 0010 = A XOR B XOR C P 0010

• C is lost. 0010 – P = A XOR B XOR C XOR 0101 – C = P XOR A XOR B XOR 1100 C 1011 – C = A XOR B XOR C XOR A XOR B – C = A XOR A XOR B XOR B XOR C – C = 0 XOR C – C = 1011

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 45 2.3 RAID in practical applications

• The 3 RAID principles can be combined in multiple ways – Not every combination is useful • This led to the definition of 7 core RAID levels – RAID 0 – RAID 6 – The most dominant levels are RAID 0, RAID 1, RAID 1+0, RAID 5 • In following examples, assume – A MTBF of 100,000 hours (11.42 years) per disk – A Mean Time to Repair (MTTR) of 6 hours – Failure rate is constant and failures between disks are independent

– MTBFraid is the mean time to data loss within the raid if each failing disk is replaced within the MTTR – D is the number of drives in the RAID set

– C=200 GB is capacity of one disk, Craid capacity of whole raid

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 46 2.3 RAID in practical applications

• Mean Time to Repair (MTTR) – MTTR = TimeToNotice + RebuildTime – Assume time to notice of 0.5 hours – Rebuild time is the time for completely writing back lost data • Assume disk capacity of 200GB • Write back speed of 10 MB/sec – Consisting of reading remaining disks – Computing parity / Reconstructing data • Rebuild time around 5.5 hours – During rebuild, a RAID is especially vulnerable – MTTR = 6 hours

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 47 2.3 RAID Levels

• File A (A1-Ax), File B (B1-Bx), File C (C1-Cx) • Raid 0 – Block-Level-Striping only – Increased parallel access and transfer speeds, reduced reliability – All disks contain data (0% overhead) – Works with any number of disks

– MTBFraid = MTBFdisk / D – 4 disks:

• MTBFraid = 2.86 years

• Craid = 800 GB (0 GB wasted (0%)) – Common size: 2 disks

• MTBFraid = 5.72 years

• Craid = 400 GB (0 GB wasted (0%))

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 48 2.3 RAID Levels

• Raid 1 – Mirroring only – Increased reliability, increased read transfer speed, low space efficiency D D-1 – MTBFraid = MTBFdisk / (D! * MTTR ) – 4 disks:

• MTBFraid = 2.2 trillion years

• Craid = 200 GB (600 GB wasted (75%)) • Age of universe may be around 15 billion years… – Common size: 2 disks

• MTBFraid = 95,130 years

• Craid = 200 GB (200 GB wasted (50%))

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 49 2.3 RAID Levels

• RAID 2 – Not used anymore in practice • was used in old mainframes – Bit-Level-Striping – Use Hamming Codes • Usually Hamming Code(7,4) – 4 data bits, 3 parity bits • Reliable 1-Bit error recovery (i.e. one disk may fail) – 3 redundant disks per 4 data disks (75% overhead) • Ratio better for larger number of disks 2 – MTBFraid = MTBFdisk / (D*(D-1) * MTTR) – 7 disks (does not really make sense for 4 – not comparable to other values)

• MTBFraid = 4,530 years • Craid = 800 GB (600 GB wasted (43%))

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 50 2.3 RAID Levels

• RAID 3 – Interleaved-Parity – Byte-Level-Striping – Dedicated Parity Disk • Bottleneck! Every write operation needs to update the parity disk. • No parallel writes – 1 redundant disk per n data disks • Overhead decreases with number of disks while reliability decreases • 25% overhead for 4 data disks 2 – MTBFraid = MTBFdisk / (D*(D-1) * MTTR) – 4 disks

• MTBFraid = 15,854 years • Craid = 600 GB (200 GB wasted (25%))

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 51 2.3 RAID Levels

• RAID 4 – Block-Level Striping – As RAID 3 otherwise – 4 disks (common size)

• MTBFraid = 15,854 years

• Craid = 600 GB (200 GB wasted (25%)) – 5 disks (also common size)

• MTBFraid = 9,513 years

• Craid = 800 GB (200 GB wasted (20%))

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 52 2.3 RAID Levels

• RAID 5 – Parity is distributed among the hard disks • May allow for parallel block writes – As RAID 4 otherwise – Bottleneck when writing many files smaller than a block • Whole parity block has to be read and re-written for each minor write – Can recover from a single disk failure

– MTBFraid and Craid as for RAID 3 & 4

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 53 2.3 RAID Levels

• RAID 6 – Two independent parity blocks distributed among the disks • May be implemented by parity on orthogonal data or by using Reed-Solomon on GF(28) – As RAID 5 otherwise – 2 redundant disks per n data disks • Can recover from a double disk failure • No vulnerability during single failure rebuild • Very suitable for larger arrays • Writer overhead due to more complicated parity computation 3 2 – MTBFraid = MTBFdisk / (D*(D-2)*(D-1) * MTTR ) – 4 disks

• MTBFraid = 132 million years • Craid = 400 GB (400 GB wasted (50%)) – 8 disks (common)

• MTBFraid = 9,437 years (~RAID 5 w. D=5) • Craid = 1,200 GB (400 GB wasted (25%))

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 54 2.3 Practical use of RAIDS

• Additionally, there are hybrid levels combing the core levels – RAID 0+1, RAID 1+0, RAID 5+0, RAID 5+1, RAID 6+6, … • Raid 1+0 – Mirrored sets nested in a striped set • RAID 0 on sets of RAID 1 sets – Very high read and write transfer speeds, increased reliability, low space efficiency, limited maximum size – Most performant RAID combination

– D1= Drives per RAID 1, D0=Number of RAID1 sets D1 D1-1 – MTBFraid = MTBFdisk / (D1! * MTTR ) / D0

– 4 disks: D1 = 2, D0= 2

• MTBFraid = 47,565 years

• Craid = 400 GB (400 GB wasted (50%))

– 6 disks: D1 = 2, D0= 3

• MTBFraid = 31,706 years

• Craid = 600 GB (600 GB wasted (50%))

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 55 2.4 Beyond RAID

• RAIDs controllers directly connect storage to the system bus – Storage available to only one system/ server/ application • Number of disks is limited – Consumer grade RAID: 2-4 disks – Enterprise grade RAID: 8-24+ disks • Solutions – NAS (Network Attached Storage): Provide abstracted file systems via network (software solution) – SAN (Storage Area Network): Virtualized logical storage within a specialized network on block level (hardware solution)

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 56 2.4 File Systems vs. Raw Devices

Application • Before discussing NAS, we need file Software systems • A file systems is a software for abstracting file operations on a logical storage device – Files are a collection of binary data File System • Creating, reading, writing, deleting, finding, organizing – How does a file access translate into top-level operations on a logical storage device? • e.g. which blocks have to be read/written? Logical • Bridge between application software Storage and (abstracted) hardware

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 57 2.4 File Systems vs. Raw Devices

Application • Raw Devices access allows Software applications to bypass the OS and the file system • Application may directly tune aspects of physical storage – There is still the hard drive controller…so, its not really direct • May lead to very efficient

implementations Logical – Used for e.g. high performance Storage database, system virtualization, etc

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 58 2.4 NAS – Network Attached Storage

• Idea: Provide a remote file system using already Application available network infrastructure Software – NAS: Network Attached Storage – Use specialized network protocols (e.g. CIFS, NFS, FTP, etc) – Easiest case: File Server (e.g. Linux+Samba) Network • Advantages: – Easy to setup, easy to use, cheap infrastructure – Allows sharing of storage among several systems – Abstracts on file system level (easy for most applications) File System • Disadvantages – Inefficient and slow

• large protocol and processing overhead NAS Server NAS – Abstracts on file system level (not suitable for special Logical purposes like raw devices or storage virtualization) Storage

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 59 2.4 SAN – Storage Area Network

• SANs offer specialized high-speed networks for storage devices Application – Usually uses local FibreChannel networks Software – Remote location may be connected via Ethernet or IP-WAN (Internet) – Network uses specialized storage protocols File System • iFCP (SCSI on FiberChannel) • iSCSI (SCSI on TCP/IP) • HyperSCSI (SCSI on raw ethernet) • SANs provide raw block level access to SAN logical storage devices – Logical disks of any size can be offered by the SAN – For a client system using a logical disk, it might appears like a local disk or RAID – Client system has full control over file systems on Logical logical disks Storage

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 60 2.4 SAN – Storage Area Network

SAN Switch

SAN SAN Ethernet HBA HBA Network

WAN-SAN Bus (HyperSCSI) NAS Protocol (CIFS) NAS SAN Switch SAN Switch Head

SAN Bus (iFCP)

SAN/RAID HBA

Peripheral Bus (SCSI, SAS, etc.)

SAN HBA SAN HBA

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 61 2.4 SAN – Storage Area Network

• Advantages: – Very efficient • Highly optimized local network infrastructure • Optimized protocols with low overhead – Very flexible (any number of systems may use any number of disks at any location) – Helps for disaster protection • SAN can transparently span to even remote locations – May also employ NAS heads for NAS-like behavior • Disadvantages – Expensive

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 62 2 Physical Storage

• There are different types of storage – Usually, there is a storage hierarchy • Faster, smaller, more expensive storage • Slower, bigger, less expensive storage • Hard drives are currently the most popular media – Mechanical device • High sequential transfer rates, • Bad random access times, low random transfer rates • Prone to failure – DBMS must be optimized for the used storage devices!

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 63 2 Next Lecture

• Access Pathes – Physical Data Access – Index Structures – Physical Tuning

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 64