Progress & Challenges for Virtual Memory Management
Kathryn S. McKinley Microsoft Research
1!
The Storage Gap
Petabytes 50,000,000! Digital Universe 40,000,000!
30,000,000! Gap 20,000,000!
10,000,000! Installed Capacity
0! Capacity Shipped
Source: IDC Memory capacity for $10,000
64! 248 Addressable 10 10,000 10 TB
1,000 1
100 100 servers GB 10 10
1 1 31 Memory size 2 Addressable 100 0
M B 10 0
1 0 1980 1985 1990 1995 2000 2005 2010 2015 Year *Inflation-adjusted 2011 USD, from: jcmit.com! Memory technologies
FLASH
Electrode DRAM
Chalcogenide PCM Resistor (heat) Electrode DNA Heterogeneous Memory Systems
Mix of 0! 0! 0! 0! DRAM, Flash, NVM 0! 0! 0! 0! 0! 0! 0! 0! Memory coupled with FPGA & special purpose ASIC Server software
Mobile + Web + Cloud
traditional batch throughput + interactive latency Scaling up & out
Basic Server Architecture
aggregator
Server workloads scaled to memory sizes & CPUs
8! Applications must fit in memory
1000" Conviva customer search logs
100" 60 GB Amazon EC2 server, one core
10" """"Queries"per"second" Corollary
Queries per seconds 1" 1" 2" 4" 8" 16" 32" 64" 128" no swapping Raw uncompressedRaw"input"size" data size Elas7csearch" MongoDB" Cassandra"
Source: R. Agarwal [NSDI’15, NSDI’16] Road map
• Fitting in memory
• Better big memory hardware
• Implications for operating system memory management Data in memory is compressible Physical Memory swapping It’s now cheaper to compress a page and write it out than write compress copy an uncompressed page
Apple & VMware do it
✗! Can we compute on compressed data? Java limit study ∑ψ⊕ξ!
…! t ξ–∑ψ! savings run program Heap dumps Models limit Objects Arrays 60! 50! 40! 30! 20! 10! 0! Value set Value set Deep Zeros Hybrid indirection caching equal Compression rate sharing Sartor et al. [ISMM’08, PLDI’10] Reality of compressed arrays up to 49% compression, but 6% on average Fast indirect slow access Array Header Inlined Arraylet Remaining Spine First N Pointers Elements
Arraylet Space Arraylet Arraylet Arraylet ... Zero Arraylet Are databases compressible?
Data#Scans# Indexes#
0,#10,#14,#16,#19,#26,#29# 1,#4,#5,#8,#20,#22,#24# 2,#15,#17,#27# Search( ) 3,#6,#7,#9,#12,#13,#18,#23#..# 11,#21#
Low$storage$ High$storage$ Low$Throughput$ High$Throughput$ Source: R. Agarwal [NSDI’15, NSDI’16] Databases are compressible!
Succinct insight Queries'execute'directly'on'compressed'data!'compression index serves as search index • Burrows-Wheeler Transform (bzip) suffix arrays • New data structures & query algorithms
Source: R. Agarwal [NSDI’15, NSDI’16] Advantages Limitations
• Compression index is the • More CPU operations for data index! data access
• No scans • No scans
• No decompression • No in-place updates
• Queries search, range, • Preprocessing time random access, regular expressions
Source: R. Agarwal [NSDI’15, NSDI’16] Order of magnitude more data 300 RAM
225 Cassandra
150 MongoDB
75 Elas csearch
Total footprint (GB) Succinct 0 1 2 4 8 16 32 64 128 Raw uncompressed data size Source: R. Agarwal [NSDI’15, NSDI’16] Throughput 10000 DB-X (Scans) Elas csearch Succinct 1000
100
10
1 Queries per seconds 1 2 4 8 16 32 64 128 Raw uncompressed data size Fitting in memory is necessary, but not sufficient Road map
• Fitting in memory
• Better big memory hardware
• Implications for operating system memory management Scaling up & out
Basic Server Architecture
aggregator
Server workloads scaled to memory sizes & CPUs
22! Hardware organization Virtual Page Table Physical Memory Memory Process
Page Table Process Hardware organization Virtual Page Table Physical Memory Memory TLB miss fill Process
hits
TLB Translation Look aside Buffer
Process on CPU critical path Virtualization exacerbates TLB limitations Virtual Physical Host Memory Memory Page Table Guest Page Table Process
TLB misses incur
Process 4 + 4*5 memory accesses!
Shadow page tables [TACO’13] Memory capacity for $10,000
64! 248 Addressable 10 10,000 10 TB
1,000 1
100 100 servers GB 10 10
1 1 31 Memory size 2 Addressable 100 0
M B 0 10 72 96 100 TLB L1 Entries 1 0 1980 1985 1990 1995 2000 2005 2010 2015 Year *Inflation-adjusted 2011 USD, from: jcmit.com! Bigger pages extend TLB reach Physical Memory 1 MB 1 GB aligned • Explicit Programmer burden • Transparent Internal fragmentation up to .5 page size depending on policy
One size does not fit all! Paged memory
Virtual Memory
Physical Memory
28! Key observation: ranges of contiguity
Code Heap Stack Shared Lib. Virtual Memory
How much range contiguity is there? |Ranges| vs |Pages|
100 |ranges|! |pages|! 80 astar mcf 60 cactusADM canneal graph500 40 memcached tigr
20 Total Application Address Space (%) 0 1 10 100 1000 10000 30! #number of entries Idea: Redundant memory mappings
Code Heap Stack Shared Lib. Virtual Memory
Physical Memory
31! Range translations: Base, Limit, Offset BASE LIMIT Virtual Memory
BASE OFFSET Range Translation
Physical Memory Range Translation: maps contiguous virtual pages to contiguous physical pages with uniform protection OS memory management support
Demand Paging Virtual Memory
Physical Memory
Does not facilitate physical range creation Eager paging
Virtual Memory
Physical Memory Increases range sizes, but can waste memory It works:eliminatesVMoverheads 15%! 25%! 35%! 45%! -5%! 5%!
cactusADM! 4KB! CTLB! THP! DS! 0.00%! RMM! 0.25%!
canneal! 4KB! CTLB! THP! DS! RMM! 0.40%!
graph500! 4KB! CTLB! THP! DS! 0.00%! RMM! 0.14%! ISCA’15, IEEE MICRO Top Picks 4KB!
mcf! CTLB! THP! DS! 0.06%! RMM! 0.26%! 4KB!
tigr! CTLB! THP! DS! RMM! 1.06%! Road map
• Fitting in memory
• Better big memory hardware
• Implications for operating system memory management Variable page sizes and ranges require contiguity base pages do not OS Buddy allocation
Design goals Power of 2 page sizes Fast allocation Fast coalescing
Source: codeproject.com Memory management is a space tradeoff
Lessons from garbage collection GC Fundamentals Allocation Identification Reclamation Sweep-to-Free Tracing Free List (implicit) Compact ` Reference Counting Bump Allocation (explicit) Evacuate 3 1
40 Canonical Garbage Collectors
Mark-Sweep [McCarthy 1960]! Free-list + trace + sweep-to-free! ! Sweep-to-Free ! !
Mark-Compact [Styger 1967]! Compact Bump allocation + trace + compact! ! ` ! ! Evacuate Semi-Space [Cheney 1970]! Bump allocation + trace + evacuate! ! ! ! Free ListAllocation+TraceSweep-to-Free Mark-Sweep
Time! Time! Garbage Collection Simple fastcollection Poor locality Mutator Space! Space! ✓
Time! Space! Space efficient ✓ Total Performance Minimum Heap
Space! MarkSweep SemiSpace MarkCompact Bump Allocation Mark-Compact
Time! Time! Expensive multi-pass ✓ Garbage Collection Actual data,taken from collection + Mutator Good locality
Trace Space! Space! + geomean Compact
of DaCapo
, jvm98, andjbb2000 on2.4GHz Core2Duo !
Time! Space! Space efficient Total Performance Minimum Heap ✓ Space! MarkSweep SemiSpace MarkCompact
Bump Allocation Semi-Space
Time! Time! ✓ Garbage Collection + Mutator Space inefficient
Trace Good locality Space! Space!
+ Evacuate
Space inefficient Time! Space! Total Performance Minimum Heap Space! MarkSweep SemiSpace MarkCompact Mark-Sweep, Performance Pathologies
Time! Time! expensive multi-pass Garbage Collection Actual data,taken from Mark-Compact Mark-Sweep Poor locality Mark-Compact Mutator Space! Space! geomean , Semi-Space ofDaCapo, jvm98, andjbb2000 on2.4GHz Core2Duo ! Space inefficient
Time! Space! Semi-Space Total Performance Minimum Heap
Space! MarkSweep SemiSpace MarkCompact Mark-Region with Sweep-To-Region Reclamation Sweep-to-Free Mark-Sweep Compact Mark-Compact Evacuate Semi-Space `
Mark-Region Bump + trace + sweep-to-region Sweep-to-Region
Naïve Mark-Region
0!
• Contiguous allocation into regions
✓ Excellent locality Objects cannot span regions
• Simply mark objects and their region
• Free unmarked regions Region Size? Large Regions
✓ More contiguous allocation! ✗ Fragmentation (false marking)! Small Regions
✗ Increased metadata ! ✗ Fragmentation (can’t fill blocks)! ✗ Constrained object sizes! Hierarchy: lines and blocks
approx 1 cache line N pages Free Recyclable lines Free Recyclable lines 0
• TLB locality, cache locality • Block > 4 X max object size! • Objects span lines • Lines marked with objects ✓ Less fragmentation ✓ Fast common case Allocation Policy (Recycling)
• Recycle partially marked blocks first ü Minimizes fragmentation ü Maximizes contiguity ü Shares free blocks among parallel thread allocators Opportunistic Defragmentation
• Opportunistically evacuate fragmented blocks – Lightweight, uses same allocation mechanism – No cost in common case (specialized GC) 0!
• Identify source and target blocks • Evacuate objects in source blocks Allocate into target blocks • Opportunistic: Leave in place if no space, or object pinned Immix Mark-Region
128 byte lines, 32 KB blocks
Opportunistic defragmentation triggered if there exist partially full blocks at memory exhaustion
Overflow allocation of medium objects (> 128 bytes)
Implicit marking Bump Allocation Immix
Time! Time! ✓ ✓ Garbage Collection Actual data,taken from
Mutator
+ Locality
Space! Space! Simple fast collection Trace geomean + Sweep-to-Region of DaCapo , jvm98, andjbb2000 on2.4GHz Core2Duo ! Total Performance Time! Space! Excellent Minimum Heap
✓ Space efficient Space!
Immix SemiSpace MarkCompact MarkSweep ✓
Lessons for OS Metadata Immix is more space & time efficient with 1.5% of metadata than all the zero metadata algorithms.
Contiguity Applications allocate & access sequential virtual addresses. Immix exploits this property. Why not the OS?
Looking forward Opportunities for OS memory management Mark Region (MR) for Multiple Page Sizes page sizes = region sizes huge page big page page! page allocation big page big page huge page allocation promotion promotion page = MR line big page = multiple pages = MR block (aligned) huge page = multiple big pages = multiple MR blocks, etc.
Fragmentation management - better, block statistics always correct prevention allocate into occupied blocks w/ free pages defrag target & source blocks Incremental adoption in buddy allocator buddy sizes = page sizes huge page big page page! page allocation big page big page huge page allocation promotion promotion
• Eliminate all “buddy” sizes ≠ potential page sizes • Add metadata for blocks and pages • Add allocation and defragmentation policies What about range allocation?
Virtual Memory
Physical Memory Mark Region for Ranges max range size = biggest block size max range size big page
base range allocation bound partially filled block of pages Range allocation – overflow allocation, harder due to variable sizes Fragmentation management prevention same, allocate into occupied blocks w/ free pages defrag same mechanisms with new incremental policies that consider multiple page sizes and maximum range size Lots of challenges
Heterogeneous memories 0! 0! 0! 0! memory management 0! 0! 0! 0! 0! 0! 0! 0!
CPU constraints Queries'execute'directly'on'compressed'data!' DNA Collaborators maybe YOU!
Steve Blackburn Karin Strauss Vasileios Karakostas Jayneel Gandhi Mark D. Hill Michael M. Swift Osman S. Ünsal Mario Nemirovsky Furkan Ayar Adrián Cristal Jennifer Sartor Martin Hirzel Tiejun Gao Lots of challenges
Heterogeneous memories 0! 0! 0! 0! memory management 0! 0! 0! 0! 0! 0! 0! 0!
CPU constraints Queries'execute'directly'on'compressed'data!' DNA