<<

Progress & Challenges for Management

Kathryn S. McKinley Microsoft Research

1!

The Storage Gap

Petabytes 50,000,000! Digital Universe 40,000,000!

30,000,000! Gap 20,000,000!

10,000,000! Installed Capacity

0! Capacity Shipped

Source: IDC Memory capacity for $10,000

64! 248 Addressable 10 10,000 10 TB

1,000 1

100 100 servers GB 10 10

1 1 31 Memory size 2 Addressable 100 0

M B 10 0

1 0 1980 1985 1990 1995 2000 2005 2010 2015 Year *Inflation-adjusted 2011 USD, from: jcmit.com! Memory technologies

FLASH

Electrode DRAM

Chalcogenide PCM Resistor (heat) Electrode DNA Heterogeneous Memory Systems

Mix of 0! 0! 0! 0! DRAM, Flash, NVM 0! 0! 0! 0! 0! 0! 0! 0! Memory coupled with FPGA & special purpose ASIC Server software

Mobile + Web + Cloud

traditional batch throughput + interactive latency Scaling up & out

Basic Server Architecture

aggregator

Server workloads scaled to memory sizes & CPUs

8! Applications must fit in memory

1000" Conviva customer search logs

100" 60 GB Amazon EC2 server, one core

10" """"Queries"per"second" Corollary

Queries per seconds 1" 1" 2" 4" 8" 16" 32" 64" 128" no swapping Raw uncompressedRaw"input"size" data size Elas7csearch" MongoDB" Cassandra"

Source: R. Agarwal [NSDI’15, NSDI’16] Road map

• Fitting in memory

• Better big memory hardware

• Implications for memory management Data in memory is compressible Physical Memory swapping It’s now cheaper to compress a page and write it out than write compress copy an uncompressed page

Apple & VMware do it

✗! Can we compute on compressed data? limit study ∑ψ⊕ξ!

…! t ξ–∑ψ! savings run program Heap dumps Models limit Objects Arrays 60! 50! 40! 30! 20! 10! 0! Value set Value set Deep Zeros Hybrid indirection caching equal Compression rate sharing Sartor et al. [ISMM’08, PLDI’10] Reality of compressed arrays up to 49% compression, but 6% on average Fast indirect slow access Array Header Inlined Arraylet Remaining Spine First N Pointers Elements

Arraylet Space Arraylet Arraylet Arraylet ... Zero Arraylet Are databases compressible?

Data#Scans# Indexes#

0,#10,#14,#16,#19,#26,#29# 1,#4,#5,#8,#20,#22,#24# 2,#15,#17,#27# Search( ) 3,#6,#7,#9,#12,#13,#18,#23#..# 11,#21#

Low$storage$ High$storage$ Low$Throughput$ High$Throughput$ Source: R. Agarwal [NSDI’15, NSDI’16] Databases are compressible!

Succinct insight Queries'execute'directly'on'compressed'data!'compression index serves as search index • Burrows-Wheeler Transform (bzip) suffix arrays • New data structures & query algorithms

Source: R. Agarwal [NSDI’15, NSDI’16] Advantages Limitations

• Compression index is the • More CPU operations for data index! data access

• No scans • No scans

• No decompression • No in-place updates

• Queries search, range, • Preprocessing time random access, regular expressions

Source: R. Agarwal [NSDI’15, NSDI’16] Order of magnitude more data 300 RAM

225 Cassandra

150 MongoDB

75 Elascsearch

Total footprint (GB) Succinct 0 1 2 4 8 16 32 64 128 Raw uncompressed data size Source: R. Agarwal [NSDI’15, NSDI’16] Throughput 10000 DB-X (Scans) Elascsearch Succinct 1000

100

10

1 Queries per seconds 1 2 4 8 16 32 64 128 Raw uncompressed data size Fitting in memory is necessary, but not sufficient Road map

• Fitting in memory

• Better big memory hardware

• Implications for operating system memory management Scaling up & out

Basic Server Architecture

aggregator

Server workloads scaled to memory sizes & CPUs

22! Hardware organization Virtual Physical Memory Memory

Page Table Process Hardware organization Virtual Page Table Physical Memory Memory TLB miss fill Process

hits

TLB Translation Look aside Buffer

Process on CPU critical path Virtualization exacerbates TLB limitations Virtual Physical Host Memory Memory Page Table Guest Page Table Process

TLB misses incur

Process 4 + 4*5 memory accesses!

Shadow page tables [TACO’13] Memory capacity for $10,000

64! 248 Addressable 10 10,000 10 TB

1,000 1

100 100 servers GB 10 10

1 1 31 Memory size 2 Addressable 100 0

M B 0 10 72 96 100 TLB L1 Entries 1 0 1980 1985 1990 1995 2000 2005 2010 2015 Year *Inflation-adjusted 2011 USD, from: jcmit.com! Bigger pages extend TLB reach Physical Memory 1 MB 1 GB aligned • Explicit Programmer burden • Transparent Internal fragmentation up to .5 page size depending on policy

One size does not fit all! Paged memory

Virtual Memory

Physical Memory

28! Key observation: ranges of contiguity

Code Heap Stack Shared Lib. Virtual Memory

How much range contiguity is there? |Ranges| vs |Pages|

100 |ranges|! |pages|! 80 astar mcf 60 cactusADM canneal graph500 40 memcached tigr

20 Total Application Address Space (%) 0 1 10 100 1000 10000 30! #number of entries Idea: Redundant memory mappings

Code Heap Stack Shared Lib. Virtual Memory

Physical Memory

31! Range translations: Base, Limit, Offset BASE LIMIT Virtual Memory

BASE OFFSET Range Translation

Physical Memory Range Translation: maps contiguous virtual pages to contiguous physical pages with uniform protection OS memory management support

Demand Paging Virtual Memory

Physical Memory

Does not facilitate physical range creation Eager paging

Virtual Memory

Physical Memory Increases range sizes, but can waste memory It works:eliminatesVMoverheads 15%! 25%! 35%! 45%! -5%! 5%!

cactusADM! 4KB! CTLB! THP! DS! 0.00%! RMM! 0.25%!

canneal! 4KB! CTLB! THP! DS! RMM! 0.40%!

graph500! 4KB! CTLB! THP! DS! 0.00%! RMM! 0.14%! ISCA’15, IEEE MICRO Top Picks 4KB!

mcf! CTLB! THP! DS! 0.06%! RMM! 0.26%! 4KB!

tigr! CTLB! THP! DS! RMM! 1.06%! Road map

• Fitting in memory

• Better big memory hardware

• Implications for operating system memory management Variable page sizes and ranges require contiguity base pages do not OS Buddy allocation

Design goals Power of 2 page sizes Fast allocation Fast coalescing

Source: codeproject.com Memory management is a space tradeoff

Lessons from garbage collection GC Fundamentals Allocation Identification Reclamation Sweep-to-Free Tracing Free List (implicit) Compact ` Bump Allocation (explicit) Evacuate 3 1

40 Canonical Garbage Collectors

Mark-Sweep [McCarthy 1960]! Free-list + trace + sweep-to-free! ! Sweep-to-Free ! !

Mark-Compact [Styger 1967]! Compact Bump allocation + trace + compact! ! ` ! ! Evacuate Semi-Space [Cheney 1970]! Bump allocation + trace + evacuate! ! ! ! Free ListAllocation+TraceSweep-to-Free Mark-Sweep

Time! Time! Garbage Collection Simple fastcollection Poor locality Mutator Space! Space! ✓

Time! Space! Space efficient ✓ Total Performance Minimum Heap

Space! MarkSweep SemiSpace MarkCompact Bump Allocation Mark-Compact

Time! Time! Expensive multi-pass ✓ Garbage Collection Actual data,taken from collection + Mutator Good locality

Trace Space! Space! + geomean Compact

of DaCapo

, jvm98, andjbb2000 on2.4GHz Core2Duo !

Time! Space! Space efficient Total Performance Minimum Heap ✓ Space! MarkSweep SemiSpace MarkCompact

Bump Allocation Semi-Space

Time! Time! ✓ Garbage Collection + Mutator Space inefficient

Trace Good locality Space! Space!

+ Evacuate

Space inefficient Time! Space! Total Performance Minimum Heap Space! MarkSweep SemiSpace MarkCompact Mark-Sweep, Performance Pathologies

Time! Time! expensive multi-pass Garbage Collection Actual data,taken from Mark-Compact Mark-Sweep Poor locality Mark-Compact Mutator Space! Space! geomean , Semi-Space ofDaCapo, jvm98, andjbb2000 on2.4GHz Core2Duo ! Space inefficient

Time! Space! Semi-Space Total Performance Minimum Heap

Space! MarkSweep SemiSpace MarkCompact Mark-Region with Sweep-To-Region Reclamation Sweep-to-Free Mark-Sweep Compact Mark-Compact Evacuate Semi-Space `

Mark-Region Bump + trace + sweep-to-region Sweep-to-Region

Naïve Mark-Region

0!

• Contiguous allocation into regions

✓ Excellent locality Objects cannot span regions

• Simply mark objects and their region

• Free unmarked regions Region Size? Large Regions

✓ More contiguous allocation! ✗ Fragmentation (false marking)! Small Regions

✗ Increased metadata ! ✗ Fragmentation (can’t fill blocks)! ✗ Constrained object sizes! Hierarchy: lines and blocks

approx 1 cache line N pages Free Recyclable lines Free Recyclable lines 0

• TLB locality, cache locality • Block > 4 X max object size! • Objects span lines • Lines marked with objects ✓ Less fragmentation ✓ Fast common case Allocation Policy (Recycling)

• Recycle partially marked blocks first ü Minimizes fragmentation ü Maximizes contiguity ü Shares free blocks among parallel allocators Opportunistic Defragmentation

• Opportunistically evacuate fragmented blocks – Lightweight, uses same allocation mechanism – No cost in common case (specialized GC) 0!

• Identify source and target blocks • Evacuate objects in source blocks Allocate into target blocks • Opportunistic: Leave in place if no space, or object pinned Immix Mark-Region

128 byte lines, 32 KB blocks

Opportunistic defragmentation triggered if there exist partially full blocks at memory exhaustion

Overflow allocation of medium objects (> 128 bytes)

Implicit marking Bump Allocation Immix

Time! Time! ✓ ✓ Garbage Collection Actual data,taken from

Mutator

+ Locality

Space! Space! Simple fast collection Trace geomean + Sweep-to-Region of DaCapo , jvm98, andjbb2000 on2.4GHz Core2Duo ! Total Performance Time! Space! Excellent Minimum Heap

✓ Space efficient Space!

Immix SemiSpace MarkCompact MarkSweep ✓

Lessons for OS Metadata Immix is more space & time efficient with 1.5% of metadata than all the zero metadata algorithms.

Contiguity Applications allocate & access sequential virtual addresses. Immix exploits this property. Why not the OS?

Looking forward Opportunities for OS memory management Mark Region (MR) for Multiple Page Sizes page sizes = region sizes huge page big page page! page allocation big page big page huge page allocation promotion promotion page = MR line big page = multiple pages = MR block (aligned) huge page = multiple big pages = multiple MR blocks, etc.

Fragmentation management - better, block statistics always correct prevention allocate into occupied blocks w/ free pages defrag target & source blocks Incremental adoption in buddy buddy sizes = page sizes huge page big page page! page allocation big page big page huge page allocation promotion promotion

• Eliminate all “buddy” sizes ≠ potential page sizes • Add metadata for blocks and pages • Add allocation and defragmentation policies What about range allocation?

Virtual Memory

Physical Memory Mark Region for Ranges max range size = biggest block size max range size big page

base range allocation bound partially filled block of pages Range allocation – overflow allocation, harder due to variable sizes Fragmentation management prevention same, allocate into occupied blocks w/ free pages defrag same mechanisms with new incremental policies that consider multiple page sizes and maximum range size Lots of challenges

Heterogeneous memories 0! 0! 0! 0! memory management 0! 0! 0! 0! 0! 0! 0! 0!

CPU constraints Queries'execute'directly'on'compressed'data!' DNA Collaborators maybe YOU!

Steve Blackburn Karin Strauss Vasileios Karakostas Jayneel Gandhi Mark D. Hill Michael M. Swift Osman S. Ünsal Mario Nemirovsky Furkan Ayar Adrián Cristal Jennifer Sartor Martin Hirzel Tiejun Gao Lots of challenges

Heterogeneous memories 0! 0! 0! 0! memory management 0! 0! 0! 0! 0! 0! 0! 0!

CPU constraints Queries'execute'directly'on'compressed'data!' DNA