Data Management Systems

• Storage Management • The • Memory hierarchy • Capacity and latencies • Segments and file storage • Locality and replacement policies • buffer • Hardware evolution • Storage techniques in context

Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich

Storage - Memory Hierarchy 1 In an ideal world …

The database should have an unlimited amount of memory with plenty of bandwidth for sequential and concurrent access, very low latencies for random accesses, persistent over time, and at a low cost

instead

Databases provide the illusion of large memory capacity and try to hide the performance problems created by implementing all those desirable properties through complex architectures and optimizations

Storage - Memory Hierarchy 2 The memory wall

• Main memory suffers from several issues: • There is never enough of it (application growth) • Memory outside the CPU chip (DRAM) is much slower than memory located in the CPU => memory wall • Processor-memory gap: processor speeds increased much faster than memory speeds • Price becomes a problem in the context of management (DRAM is expensive) • Main memory is not persistent • Over time, a complex hierarchy evolved trying to address all these issues

Storage - Memory Hierarchy 3 CPU Registers

Caches

Main memory (DRAM)

External storage (local persistent storage)

External storage (remote persistent storage)

Archive storage

Storage - Memory Hierarchy 4 Looking at the memory hierarchy

• The memory hierarchy is a rather complex construct affected by many parameters • Capacity • Cost • Latency • Bandwidth • It keeps evolving as the parameters of each component change over time • It keeps evolving as new technology becomes available • Disclaimer: numbers provided as a reference (they vary a lot)

Storage - Memory Hierarchy 5 64-bit architecture 16x64b general purpose Capacity CPU Registers 32x512b AVX

L1i 32K, L1d 32K, L2 256K - 1MB, L3 8MB - 45MB Caches

1 to 1000 GB Main memory (DRAM)

Few Terabytes External storage (local persistent storage)

Many Terabytes External storage (remote persistent storage)

Petabytes Archive storage

Storage - Memory Hierarchy 6 Latency CPU Registers Sub-nanosecond (1 cycle)

L1 0.5-1 ns, L2 4-8 ns, L3 15-30 ns Caches

100 ns Main memory (DRAM)

Microseconds (SSD) Milliseconds (HDD) External storage (local persistent storage)

Milliseconds External storage (remote persistent storage)

Seconds, minutes Archive storage

Storage - Memory Hierarchy 7 Access CPU Registers Sub-nanosecond (1 cycle)

Caches Byte addressable Random access Main memory (DRAM)

External storage (local persistent storage) addressable Sequential access External storage (remote persistent storage)

Archive storage

Storage - Memory Hierarchy 8 What does this all mean?

• The performance gaps between layers is huge (difficult to imagine at human scales) • We process an increasing amount of data, resulting in even more pressure on the memory system • Data movement is one of the major sources of energy consumption and inefficiencies in modern computers (and data centers) • Performance and efficiency largely determined by how well the database manages the movement of data across the hierarchy

Storage - Memory Hierarchy 9 Locality (spatial and temporal) SELECT * FROM T WHERE X > 10 SELECT * FROM T • The unit of transfer between layers SELECT * FROM T in the memory hierarchy is typically WHERE Y = 20 fixed

A B C • To improve performance, it is important to exploit D E • Spatial locality (put together what belongs together) • Temporal locality (do at the same time things that require the same data) Transfer A B C • Managing the hierarchy amounts to unit D E improving spatial and temporal locality

Storage - Memory Hierarchy 10 What needs to be done?

• Enhance temporal and spatial locality (data organization, query scheduling) • Make sure the data is available a the layer where it is needed to hide the latency caused by getting data from lower layer (pre-fetching) • Be clever about what to keep at each layer (caching strategies, replacement strategies) • Keep track of modifications and write back to the lower layers (all the way to persistent storage) when needed

Storage - Memory Hierarchy 11 Reality is complex and getting even more so

• Managing the memory hierarchy was never easy • No perfect solution • Workload dependent • Many compromises needed • Problem is becoming far more involved due to architectural developments • Multicore and NUMA • Non-Volatile Memory • and economies of scale • Network attached storage • Hardware Acceleration

Storage - Memory Hierarchy 12 Multicore and NUMA AMD Bulldozer

Storage - Memory Hierarchy 13 Non-Volatile Memory (NVM) CPU Registers Sub-nanosecond (1 cycle) Non-Volatile memory is a new form of memory combining Caches characteristics of DRAM and persistent storage: • Cheaper than DRAM Main memory (DRAM) • Byte addressable • Random access NVM External storage (local persistent storage) • Persistent • Faster than disks • Can be used as External storage (remote persistent storage) • Memory • Local disk • Network attached Archive storage Storage - Memory Hierarchy 14 Cloud computing

• The ephemeral nature of the computing infrastructure forces a Compute layer separation of compute and storage. • Gives more flexibility to the cloud Network provider • Has changed the nature of “disk” and Storage layer “storage” in fundamental ways • Crucial for cloud native

Storage - Memory Hierarchy 15 Network attached storage

• The bandwidth and latencies of storage devices are not very high • Motivated by cloud designs, networks are becoming faster and have more bandwidth • Round trip time in a data center is less than a seek operation on a HDD • RDMA (Remote Direct Memory Access) reduces latencies by removing OS related inefficiencies • Eventually it might be faster to get data from the memory of a remote machine or remote storage device than from a local disk.

Storage - Memory Hierarchy 16 Hardware Acceleration

Oracle M7 SPARC processor Storage - Memory Hierarchy 17 Summary

• Dealing with the memory hierarchy is a key aspect of the architecture of data management systems • Very old problem, still relevant • Many fundamental concepts still applicable today due to the way systems are evolving

Storage - Memory Hierarchy 18