Progress & Challenges for Virtual Memory Management

Progress & Challenges for Virtual Memory Management Kathryn S. McKinley Microsoft Research 1! The Storage Gap Petabytes 50,000,000! Digital Universe 40,000,000! 30,000,000! Gap 20,000,000! 10,000,000! Installed Capacity 0! Capacity Shipped Source: IDC Memory capacity for $10,000 64! 248 Addressable 10 10,000 10 TB 1,000 1 100 100 servers GB 10 10 1 1 31 Memory size 2 Addressable 100 0 M B 10 0 1 0 1980 1985 1990 1995 2000 2005 2010 2015 Year *Inflation-adjusted 2011 USD, from: jcmit.com! Memory technologies FLASH Electrode DRAM Chalcogenide PCM Resistor (heat) Electrode DNA Heterogeneous Memory Systems Mix of 0! 0! 0! 0! DRAM, Flash, NVM 0! 0! 0! 0! 0! 0! 0! 0! Memory coupled with FPGA & special purpose ASIC Server software Mobile + Web + Cloud traditional batch throughput + interactive latency Scaling up & out Basic Server Architecture aggregator Server workloads scaled to memory sizes & CPUs 8! Applications must fit in memory 1000" Conviva customer search logs 100" 60 GB Amazon EC2 server, one core 10" """"Queries"per"second" Corollary Queries per seconds 1" 1" 2" 4" 8" 16" 32" 64" 128" no swapping Raw uncompressedRaw"input"size" data size Elas7csearch" MongoDB" Cassandra" Source: R. Agarwal [NSDI’15, NSDI’16] Road map • Fitting in memory • Better big memory hardware • Implications for operating system memory management Data in memory is compressible Physical Memory swapping It’s now cheaper to compress a page and write it out than write compress copy an uncompressed page Apple & VMware do it ✗! Can we compute on compressed data? Java limit study ∑ψ⊕ξ! …! t ξ–∑ψ! savings run program Heap dumps Models limit Objects Arrays 60! 50! 40! 30! 20! 10! 0! Value set Value set Deep Zeros Hybrid indirection caching equal Compression rate sharing Sartor et al. [ISMM’08, PLDI’10] Reality of compressed arrays up to 49% compression, but 6% on average Fast indirect slow access Array Header Inlined Arraylet Remaining Spine First N Pointers Elements Arraylet Space Arraylet Arraylet Arraylet ... Zero Arraylet Are databases compressible? Data#Scans# Indexes# 0,#10,#14,#16,#19,#26,#29# 1,#4,#5,#8,#20,#22,#24# 2,#15,#17,#27# Search( ) 3,#6,#7,#9,#12,#13,#18,#23#..# 11,#21# Low$storage$ High$storage$ Low$Throughput$ High$Throughput$ Source: R. Agarwal [NSDI’15, NSDI’16] Databases are compressible! Succinct insight Queries'execute'directly'on'compressed'data!'compression index serves as search index • Burrows-Wheeler Transform (bzip) suffix arrays • New data structures & query algorithms Source: R. Agarwal [NSDI’15, NSDI’16] Advantages Limitations • Compression index is the • More CPU operations for data index! data access • No scans • No scans • No decompression • No in-place updates • Queries search, range, • Preprocessing time random access, regular expressions Source: R. Agarwal [NSDI’15, NSDI’16] Order of magnitude more data 300 RAM 225 Cassandra 150 MongoDB 75 ElasCcsEarch Total footprint (GB) Succinct 0 1 2 4 8 16 32 64 128 Raw uncompressed data size Source: R. Agarwal [NSDI’15, NSDI’16] Throughput 10000 DB-X (Scans) ElasCcsEarch Succinct 1000 100 10 1 Queries per seconds 1 2 4 8 16 32 64 128 Raw uncompressed data size Fitting in memory is necessary, but not sufficient Road map • Fitting in memory • Better big memory hardware • Implications for operating system memory management Scaling up & out Basic Server Architecture aggregator Server workloads scaled to memory sizes & CPUs 22! Hardware organization Virtual Page Table Physical Memory Memory Process Page Table Process Hardware organization Virtual Page Table Physical Memory Memory TLB miss fill Process hits TLB Translation Look aside Buffer Process on CPU critical path Virtualization exacerbates TLB limitations Virtual Physical Host Memory Memory Page Table Guest Page Table Process TLB misses incur Process 4 + 4*5 memory accesses! Shadow page tables [TACO’13] Memory capacity for $10,000 64! 248 Addressable 10 10,000 10 TB 1,000 1 100 100 servers GB 10 10 1 1 31 Memory size 2 Addressable 100 0 M B 0 10 72 96 100 TLB L1 Entries 1 0 1980 1985 1990 1995 2000 2005 2010 2015 Year *Inflation-adjusted 2011 USD, from: jcmit.com! Bigger pages extend TLB reach Physical Memory 1 MB 1 GB aligned • Explicit Programmer burden • Transparent Internal fragmentation up to .5 page size depending on policy One size does not fit all! Paged memory Virtual Memory Physical Memory 28! Key observation: ranges of contiguity Code Heap Stack Shared Lib. Virtual Memory How much range contiguity is there? |Ranges| vs |Pages| 100 |ranges|! |pages|! 80 astar mcf 60 cactusADM canneal graph500 40 memcached tigr 20 Total Application Address Space (%) 0 1 10 100 1000 10000 30! #number of entries Idea: Redundant memory mappings Code Heap Stack Shared Lib. Virtual Memory Physical Memory 31! Range translations: Base, Limit, Offset BASE LIMIT Virtual Memory BASE OFFSET Range Translation Physical Memory Range Translation: maps contiguous virtual pages to contiguous physical pages with uniform protection OS memory management support Demand Paging Virtual Memory Physical Memory Does not facilitate physical range creation Eager paging Virtual Memory Physical Memory Increases range sizes, but can waste memory It works:eliminatesVMoverheads 15% 25% 35% 45% -5% 5% ! ! ! ! ! ! cactusADM 4KB! CTLB! THP! DS! 0.00%! ! RMM! 0.25%! canneal 4KB! CTLB! THP! ! DS! RMM! 0.40%! graph500 4KB! CTLB! THP! DS! 0.00%! ! RMM! 0.14%! ISCA’15, IEEE MICRO Top Picks 4KB! mcf CTLB! THP! ! DS! 0.06%! RMM! 0.26%! 4KB! tigr CTLB! THP! ! DS! RMM! 1.06%! Road map • Fitting in memory • Better big memory hardware • Implications for operating system memory management Variable page sizes and ranges require contiguity base pages do not OS Buddy allocation Design goals Power of 2 page sizes Fast allocation Fast coalescing Source: codeproject.com Memory management is a space tradeoff Lessons from garbage collection GC Fundamentals Allocation Identification Reclamation Sweep-to-Free Tracing Free List (implicit) Compact ` Reference Counting Bump Allocation (explicit) Evacuate 3 1 40 Canonical Garbage Collectors Mark-Sweep [McCarthy 1960]! Free-list + trace + sweep-to-free! ! Sweep-to-Free ! ! Mark-Compact [Styger 1967]! Compact Bump allocation + trace + compact! ! ` ! ! Evacuate Semi-Space [Cheney 1970]! Bump allocation + trace + evacuate! ! ! ! Mark-Sweep Free List Allocation + Trace + Sweep-to-Free Mutator Minimum Heap Poor locality Space efficient ! ! Time Space! Space ✓ Garbage Collection Total Performance SemiSpace Simple fast collection ! ! MarkCompact MarkSweep Time Time Space! ✓ Space! Mark-Compact Bump Allocation + Trace + Compact Mutator Minimum Heap Space efficient ! Good locality ! Time ✓ Space Space! ✓ Garbage Collection Total Performance Expensive multi-pass SemiSpace ! ! collection MarkCompact MarkSweep Time Time Space! Space! Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo! Semi-Space Bump Allocation + Trace + Evacuate Mutator Minimum Heap Space inefficient ! Good locality ! Time ✓ Space! Space Garbage Collection Total Performance Space inefficient SemiSpace ! ! MarkCompact MarkSweep Time Time Space! Space! Performance Pathologies Mark-Sweep, Mark-Compact, Semi-Space Mutator Minimum Heap Semi-Space ! Mark-Sweep Space! inefficient Poor locality Time Space! Space Garbage Collection Total Performance SemiSpace ! ! Mark-Compact expensive multi-pass MarkCompact MarkSweep Time Time Space! Space! Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo! Mark-Region with Sweep-To-Region Reclamation Sweep-to-Free Mark-Sweep Compact Mark-Compact Evacuate Semi-Space ` Mark-Region Bump + trace + sweep-to-region Sweep-to-Region Naïve Mark-Region 0! • Contiguous allocation into regions ✓ Excellent locality Objects cannot span regions • Simply mark objects and their region • Free unmarked regions Region Size? Large Regions ✓ More contiguous allocation! ✗ Fragmentation (false marking)! Small Regions ✗ Increased metadata ! ✗ Fragmentation (can’t fill blocks)! ✗ Constrained object sizes! Hierarchy: lines and blocks approx 1 cache line N pages Free Recyclable lines Free Recyclable lines 0 • TLB locality, cache locality • Block > 4 X max object size! • Objects span lines • Lines marked with objects ✓ Less fragmentation ✓ Fast common case Allocation Policy (Recycling) • Recycle partially marked blocks first ü Minimizes fragmentation ü Maximizes contiguity ü Shares free blocks among parallel thread allocators Opportunistic Defragmentation • Opportunistically evacuate fragmented blocks – Lightweight, uses same allocation mechanism – No cost in common case (specialized GC) 0! • Identify source and target blocks • Evacuate objects in source blocks Allocate into target blocks • Opportunistic: Leave in place if no space, or object pinned Immix Mark-Region 128 byte lines, 32 KB blocks Opportunistic defragmentation triggered if there exist partially full blocks at memory exhaustion Overflow allocation of medium objects (> 128 bytes) Implicit marking Immix Bump Allocation + Trace + Sweep-to-Region Mutator Minimum Heap Space efficient ! Locality ! Time ✓ Space! Space ✓ Garbage Collection Total Performance MarkSweep ! MarkCompact ! Simple fast Excellent SemiSpace Time Immix Time collection Space! ✓ ✓Space! Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo! Lessons for OS Metadata Immix is more space & time efficient with 1.5% of metadata than all the zero metadata algorithms. Contiguity Applications allocate & access sequential virtual addresses. Immix exploits this property. Why not the OS? Looking forward

Load more