Indian Institute of Science (IISc), Bangalore, India

Memory Management Arkaprava Basu

Dept. of Computer Science and Automation Indian Institute of Science www.csa.iisc.ac.in/~arkapravab/

www.csa.iisc.ac.in Indian Institute of Science (IISc), Bangalore, India Memory is virtual !

▪ Application software sees virtual address ‣ int * p = malloc(64); Virtual address

‣ Load, store instructions carries virtual addresses

2/19/2019 2 Indian Institute of Science (IISc), Bangalore, India Memory is virtual !

▪ Application software sees virtual address ‣ int * p = malloc(64); Virtual address

‣ Load, store instructions carries virtual addresses ▪ Hardware uses (real) physical address ‣ e.g., to find data, lookup caches etc.

▪ When an application executes (a.k.a process) virtual addresses are translated to physical addresses, at runtime

2/19/2019 2 Indian Institute of Science (IISc), Bangalore, India Bird’s view of virtual memory

Physical Memory (e.g., DRAM) 2/19/2019 Picture credit: Nima Honarmand, Stony Brook university 3 Indian Institute of Science (IISc), Bangalore, India Bird’s view of virtual memory Process 1

Physical Memory (e.g., DRAM) 2/19/2019 Picture credit: Nima Honarmand, Stony Brook university 3 Indian Institute of Science (IISc), Bangalore, India Bird’s view of virtual memory Process 1

// Program expects (*x) // to always be at // address 0x1000 int *x = 0x1000;

Physical Memory (e.g., DRAM) 2/19/2019 Picture credit: Nima Honarmand, Stony Brook university 3 Indian Institute of Science (IISc), Bangalore, India Bird’s view of virtual memory Process 1 Process 2

// Program expects (*x) // to always be at // address 0x1000 int *x = 0x1000;

Physical Memory (e.g., DRAM) 2/19/2019 Picture credit: Nima Honarmand, Stony Brook university 3 Indian Institute of Science (IISc), Bangalore, India Bird’s view of virtual memory Process 1 Process 2

// Program expects (*x) // to always be at Only one physical // address 0x1000 address 0x1000!! int *x = 0x1000;

0x1000 Physical Memory (e.g., DRAM) 2/19/2019 Picture credit: Nima Honarmand, Stony Brook university 3 Indian Institute of Science (IISc), Bangalore, India Bird’s view of virtual memory Process 1 Process 2

App’s view of memory App’s view of memory

0x1000 0x1000

0x1000 Physical Memory (e.g., DRAM) 2/19/2019 4 Indian Institute of Science (IISc), Bangalore, India Bird’s view of virtual memory Process 1 Process 2

App’s view of memory App’s view of memory

0x1000 0x1000 Virtual address Virtual

0x1000 Physical Memory (e.g., DRAM) 2/19/2019 4 Indian Institute of Science (IISc), Bangalore, India Bird’s view of virtual memory Process 1 Process 2

App’s view of memory App’s view of memory

0x1000 0x1000 Virtual address Virtual

0x1000 Physical Memory (e.g., DRAM) 2/19/2019 4 Indian Institute of Science (IISc), Bangalore, India Bird’s view of virtual memory Process 1 Process 2

App’s view of memory App’s view of memory

0x1000 0x1000 Virtual address Virtual Virtual memory is a layer of indirection between s/w’s view of memory and h/w’s view

0x1000 Physical Memory (e.g., DRAM) 2/19/2019 4 Indian Institute of Science (IISc), Bangalore, India Roles of software and hardware Process 1 1. Operating systemProcess creates 2 and manages Virtual to physical address mappings. App’s view of memory App’s view of memory 2. (Typically) Hardware performs the VA to

0x1000 PA translation at runtime.0x1000 Virtual address Virtual Virtual memory is a layer of indirection between s/w’s view of memory and h/w’s view

0x1000 Physical Memory (e.g., DRAM) 2/19/2019 5 Indian Institute of Science (IISc), Bangalore, India Virtual memory terminology Process 1 Process 2 Virtual address space Virtual address space

0x1000 0x1000 Virtual address Virtual

2/19/2019 6 Indian Institute of Science (IISc), Bangalore, India Virtual memory terminology Process 1 Process 2 Virtual address space Virtual address space

0x1000 0x1000 Virtual address Virtual

2/19/2019 Physical address space 6 Indian Institute of Science (IISc), Bangalore, India Virtual memory terminology Process 1 Process 2 Virtual address space Virtual address space

0x1000 0x1000 Virtual address Virtual Unit (MMU)

2/19/2019 Physical address space 6 Indian Institute of Science (IISc), Bangalore, India Why virtual memory?

2/19/2019 7 Indian Institute of Science (IISc), Bangalore, India Why virtual memory?

▪ Relocation: Ease of programming by providing application contiguous view of memory ‣ But actual physical memory can be scattered

2/19/2019 7 Indian Institute of Science (IISc), Bangalore, India Why virtual memory?

▪ Relocation: Ease of programming by providing application contiguous view of memory ‣ But actual physical memory can be scattered ▪ Resource management: Application writer can assume there is enough memory ‣ But, system can manage physical memory across concurrent processes ▪ Isolation: Bug in one process should not corrupt memory of another under multi-programming ‣ Provide each process with separate virtual address space ▪ Protection: Enforce rules on what memory a process can or cannot access

2/19/2019 7 Indian Institute of Science (IISc), Bangalore, India Why virtual memory?

▪ Sharing/communication: Inter-process communication and sharing

2/19/2019 8 Indian Institute of Science (IISc), Bangalore, India Why virtual memory?

▪ Sharing/communication: Inter-process communication and sharing ‣ Same physical address can be mapped onto two process’s virtual address space with different virtual address • Could be used for inter-process-communication

2/19/2019 8 Indian Institute of Science (IISc), Bangalore, India Why virtual memory?

▪ Sharing/communication: Inter-process communication and sharing ‣ Same physical address can be mapped onto two process’s virtual address space with different virtual address • Could be used for inter-process-communication • Very useful for code/library sharing

2/19/2019 8 Indian Institute of Science (IISc), Bangalore, India Aside: How hardware knows which software is OS?

▪ x86 (Intel/AMD) protection rings: ‣ ring 0 to ring 3 ▪ User application runs in ring 3 (least privileged/user mode) ▪ OS runs in ring 0 (supervisor/privileged)

2/19/2019 9 Indian Institute of Science (IISc), Bangalore, India Aside: How hardware knows which software is OS?

▪ x86 (Intel/AMD) protection rings: ‣ ring 0 to ring 3 ▪ User application runs in ring 3 (least privileged/user mode) ▪ OS runs in ring 0 (supervisor/privileged) ▪ What about other rings? ‣ Typically not used ‣ SMM uses ring 2 for running firmware

2/19/2019 9 Indian Institute of Science (IISc), Bangalore, India Aside: What privileges the OS gets by running in ring 0?

▪ A subset of instructions only executable in ring 0 ‣ Typically ones that changes system state, e.g., invlpg ▪ System state accessible/modifiable only in ring 0 ‣ e.g., interrupt flag IF

2/19/2019 10 Indian Institute of Science (IISc), Bangalore, India Aside: What privileges the OS gets by running in ring 0?

▪ A subset of instructions only executable in ring 0 ‣ Typically ones that changes system state, e.g., invlpg ▪ System state accessible/modifiable only in ring 0 ‣ e.g., interrupt flag IF ▪ Parts of physical memory that is accessible only in ring0 (virtual memory help enforce this) ▪ Physical addresses are visible to software only in ring 0(ring 3 can never see physical address)

2/19/2019 10 Indian Institute of Science (IISc), Bangalore, India Plan for the lectures

What is virtual memory?

Hardware implementations of virtual memory

Software management of memory ()

Advanced topics in memory management

2/19/2019 11 Indian Institute of Science (IISc), Bangalore, India Plan for the lectures

What is virtual memory?

Hardware implementations of virtual memory

Software management of memory (Linux)

Advanced topics in memory management

2/19/2019 11 Indian Institute of Science (IISc), Bangalore, India Implementing virtual memory

▪ Many ways to implement virtual memory

‣ Base and bound register Useful/versatile ‣ Segmentation

‣ Paging Complexity Complexity

2/19/2019 12 Indian Institute of Science (IISc), Bangalore, India Base and bound registers Process 1 Process 2

Virtual address space Virtual address space

2/19/2019 13 Indian Institute of Science (IISc), Bangalore, India Base and bound registers Process 1 Process 2

Virtual address space Virtual address space

2/19/2019 13 Indian Institute of Science (IISc), Bangalore, India Base and bound registers Process 1 Process 2

Virtual address space Virtual address space

Base Bound

2/19/2019 13 Indian Institute of Science (IISc), Bangalore, India Base and bound registers Process 1 Process 2

Virtual address space Virtual address space

Base Bound Base Bound

2/19/2019 13 Indian Institute of Science (IISc), Bangalore, India Base and bound registers

▪ How it works? ‣ Each CPU core has base and bound registers ‣ Each process has own values for base and bound ‣ Base and bound values are assigned by OS ‣ On a context switch, a process’s base and bound registers are saved/restored by the OS

2/19/2019 14 Indian Institute of Science (IISc), Bangalore, India Base and bound registers

▪ How it works? ‣ Each CPU core has base and bound registers ‣ Each process has own values for base and bound ‣ Base and bound values are assigned by OS ‣ On a context switch, a process’s base and bound registers are saved/restored by the OS ▪ Example: Cray-1 (very old system) ▪ Advantage ‣ Simple

2/19/2019 14 Indian Institute of Science (IISc), Bangalore, India Base and bound registers

▪ How it works? ‣ Each CPU core has base and bound registers ‣ Each process has own values for base and bound ‣ Base and bound values are assigned by OS ‣ On a context switch, a process’s base and bound registers are saved/restored by the OS ▪ Example: Cray-1 (very old system) ▪ Advantage ‣ Simple ▪ Disadvantage ‣ Needs contiguous physical memory ‣ Deallocation of memory is possible only on the edge ‣ Cannot allocate more memory is no free memory at the edge

2/19/2019 14 Indian Institute of Science (IISc), Bangalore, India Segmentation Process 1 Process 2

Virtual address space Virtual address space

2/19/2019 15 Indian Institute of Science (IISc), Bangalore, India Segmentation Process 1 Process 2

Virtual address space Virtual address space

2/19/2019 15 Indian Institute of Science (IISc), Bangalore, India Segmentation Process 1 Process 2

Virtual address space Virtual address space

2/19/2019 15 Indian Institute of Science (IISc), Bangalore, India Segmentation SemanticsProcess 1 of typical Process 2 segments Code Data Stack Virtual address space

2/19/2019 15 Indian Institute of Science (IISc), Bangalore, India Segmentation: How it works? Virtual address Prot Base Length CS 128 SS R/W 0x0100 1024 Seg Id Offset CS R 0x30000 1024 DS R/W 0x40100 16384 Segment table

2/19/2019 16 Indian Institute of Science (IISc), Bangalore, India Segmentation: How it works? Virtual address Prot Base Length CS 128 SS R/W 0x0100 1024 Seg Id Offset CS R 0x30000 1024 DS R/W 0x40100 16384 Segment table

2/19/2019 16 Indian Institute of Science (IISc), Bangalore, India Segmentation: How it works? Virtual address Prot Base Length CS 128 SS R/W 0x0100 1024 Seg Id Offset CS R 0x30000 1024 DS R/W 0x40100 16384 Segment table Protection and bound check

2/19/2019 16 Indian Institute of Science (IISc), Bangalore, India Segmentation: How it works? Virtual address Prot Base Length CS 128 SS R/W 0x0100 1024 Seg Id Offset CS R 0x30000 1024 DS R/W 0x40100 16384 Segment table Protection and bound check

2/19/2019 16 Indian Institute of Science (IISc), Bangalore, India Segmentation: How it works?

▪ How to select segmentation ID? ‣ In most cases, the compiler can infer • For example, x86-32 bit has CS, SS, DS, ES, FS, GS

Compiler Programmer selected selected

▪ Advantages segmentation: ‣ Multiple memory regions with different protections ‣ No need to have all segments in memory all the time ‣ Ability to share segments across processes

2/19/2019 17 Indian Institute of Science (IISc), Bangalore, India Disadvantages of Segmentation Process 1 Process 2

Virtual address space Virtual address space

2/19/2019 (Partial) Picture credit: Nima Honarmand, Stony Brook university 18 Indian Institute of Science (IISc), Bangalore, India Disadvantages of Segmentation Process 1 Process 2 Process 3 Virtual address space Virtual address space Virtual address space

2/19/2019 (Partial) Picture credit: Nima Honarmand, Stony Brook university 18 Indian Institute of Science (IISc), Bangalore, India Disadvantages of Segmentation Process 1 Process 2 Process 3 Virtual address space Virtual address space Virtual address space

Cannot be used although total free size is more than requested

2/19/2019 (Partial) Picture credit: Nima Honarmand, Stony Brook university 18 Indian Institute of Science (IISc), Bangalore, India Disadvantages of Segmentation

▪ External fragmentation: Inability to use free memory as it is non-contiguous (fragmentation) ‣ Allocating different sized segments leaves free memory fragmented

▪ A segment of size n bytes requires n bytes long contiguous physical memory

▪ Not completely transparent to applications

2/19/2019 19 Indian Institute of Science (IISc), Bangalore, India Paging

▪ Idea: Allocate/de-allocate memory in same size chunks (pages) ‣ No external fragmentation (Why?) ‣ Page sizes are typically small, e.g., 4KB

2/19/2019 20 Indian Institute of Science (IISc), Bangalore, India Paging Process 1 Process 2

Virtual address space Virtual address space

2/19/2019 21 Indian Institute of Science (IISc), Bangalore, India Paging Process 1 Process 2

Virtual address space Virtual address space

2/19/2019 21 Indian Institute of Science (IISc), Bangalore, India Paging Process 1 Process 2

Virtual address space Virtual address space Pages

Physical page frames 2/19/2019 21 Indian Institute of Science (IISc), Bangalore, India Advantages of paging

▪ No external fragmentation !

▪ Simplifies allocation, deallocation

▪ Provides application a contiguous view of memory ‣ But, physical memory backing it could very scattered

2/19/2019 22 Indian Institute of Science (IISc), Bangalore, India Advantages of paging

▪ No external fragmentation !

▪ Simplifies allocation, deallocation

▪ Provides application a contiguous view of memory ‣ But, physical memory backing it could very scattered

Almost all processors today employs paging

2/19/2019 22 Indian Institute of Science (IISc), Bangalore, India Disadvantages of paging

▪ Internal fragmentation: Memory wastage due to allocation only in page sizes (e.g., 4KB) ‣ e.g., for an allocation 128 bytes a 4KB page will be allocated ‣ Potentially waste (4KB – 128) bytes.

2/19/2019 23 Indian Institute of Science (IISc), Bangalore, India Paging: How it works (simplistic)? Virtual address 0x0001bbf 21f Virtual Page Page Offset Number (VPN)

2/19/2019 24 Indian Institute of Science (IISc), Bangalore, India Paging: How it works (simplistic)?

Virtual address Prot PFN

maintained by theOS In memory Inmemory datastructure 0x0001bbf 21f ------Virtual Page R 0x30000 Page Offset R/W 0x40100 Number (VPN) ------R/W 0x60000 Page table

2/19/2019 24 Indian Institute of Science (IISc), Bangalore, India Paging: How it works (simplistic)?

Virtual address Prot PFN

maintained by theOS In memory Inmemory datastructure 0x0001bbf 21f ------Virtual Page R 0x30000 Page Offset R/W 0x40100 Number (VPN) ------R/W 0x60000 Page table

2/19/2019 24 Indian Institute of Science (IISc), Bangalore, India Paging: How it works (simplistic)?

Virtual address Prot PFN

maintained by theOS In memory Inmemory datastructure 0x0001bbf 21f ------Virtual Page R 0x30000 Page Offset R/W 0x40100 Number (VPN) ------R/W 0x60000 Page table

2/19/2019 24 Indian Institute of Science (IISc), Bangalore, India Paging: How it works (simplistic)?

Virtual address Prot PFN

maintained by theOS In memory Inmemory datastructure 0x0001bbf 21f ------Virtual Page R 0x30000 Page Offset R/W 0x40100 Number (VPN) ------R/W 0x60000 Page table

Physical page frames 2/19/2019 24 Indian Institute of Science (IISc), Bangalore, India Paging: How it really works?

▪ Single level page table adds too much overhead ‣ Consider a typical 48-bit virtual address space, 4KB page size and 8 byte long page table entry (PTE) ‣ Each page table will be 512GB !

2/19/2019 25 Indian Institute of Science (IISc), Bangalore, India Paging: How it really works?

▪ Single level page table adds too much overhead ‣ Consider a typical 48-bit virtual address space, 4KB page size and 8 byte long page table entry (PTE) ‣ Each page table will be 512GB ! ‣ Each process has its own page table (How many page table for multi-threaded applications/process?) ‣ 100s of processes → 100s of page tables

2/19/2019 25 Indian Institute of Science (IISc), Bangalore, India Paging: How it really works?

▪ Single level page table adds too much overhead ‣ Consider a typical 48-bit virtual address space, 4KB page size and 8 byte long page table entry (PTE) ‣ Each page table will be 512GB ! ‣ Each process has its own page table (How many page table for multi-threaded applications/process?) ‣ 100s of processes → 100s of page tables ▪ Observation: Often virtual address space is sparsely allocated ‣ All possible addresses are not allocated ‣ How many applications use 48bits virtual address space?

2/19/2019 25 Indian Institute of Science (IISc), Bangalore, India Paging: How it really works?

▪ Single level page table adds too much overhead ‣ Consider a typical 48-bit virtual address space, 4KB page size and 8 byte long page table entry (PTE) ‣ Each page table will be 512GB ! ‣ Each process has its own page table (How many page table for multi-threaded applications/process?) ‣ 100s of processes → 100s of page tables ▪ Observation: Often virtual address space is sparsely allocated ‣ All possible addresses are not allocated ‣ How many applications use 48bits virtual address space?

▪ Solution: Multilevel radix tree for page table

2/19/2019 25 Indian Institute of Science (IISc), Bangalore, India Paging: 64bit x86* page table 4 level 512-ary radix tree indexed by virtual page number

Up to 512 children Unallocated VA regions

Contains PFN (4KB) * i.e., Intel or AMD processors 2/19/2019 26 Indian Institute of Science (IISc), Bangalore, India Paging: 64bit x86* page table 4 level 512-ary radix tree indexed by virtual page number

Up to 512 children Unallocated VA regions

How much of VA region this represents?

Contains PFN (4KB) * i.e., Intel or AMD processors 2/19/2019 26 Indian Institute of Science (IISc), Bangalore, India Paging: 64bit x86 page table

▪ Operating system maintains one page table per process (a.k.a., per virtual address space)

▪ Operating system creates, updates, deletes page table entries (PTEs)

2/19/2019 27 Indian Institute of Science (IISc), Bangalore, India Paging: 64bit x86 page table

▪ Operating system maintains one page table per process (a.k.a., per virtual address space)

▪ Operating system creates, updates, deletes page table entries (PTEs)

▪ Page table structure is part of agreement between the OS and the hardware → page table structure ISA-specific

2/19/2019 27 Indian Institute of Science (IISc), Bangalore, India Paging: 64bit x86 page table

▪ Operating system maintains one page table per process (a.k.a., per virtual address space)

▪ Operating system creates, updates, deletes page table entries (PTEs)

▪ Page table structure is part of agreement between the OS and the hardware → page table structure ISA-specific ▪ IBM’s processor’s page table structure is different than x86 processors (Intel and AMD) 2/19/2019 27 Indian Institute of Science (IISc), Bangalore, India Paging: 64bit x86 page walk

▪ When applications reads/writes memory location then it issues load or store instructions with virtual addresses

2/19/2019 28 Indian Institute of Science (IISc), Bangalore, India Paging: 64bit x86 page walk

▪ When applications reads/writes memory location then it issues load or store instructions with virtual addresses ▪ Hardware needs to translate virtual addresses to physical address for each load/store ‣ Hardware needs to lookup address translation in the OS created page table • where does the page table lives? ‣ The process of looking up the page table is called a page table walk

2/19/2019 28 Indian Institute of Science (IISc), Bangalore, India

Paging: 64bit x86 page walk 47 39 38 30 29 21 20 12 11 0 11 12 20 21 29 30 38 39 47

cr3

Virtual pagenumberVirtual Pageoffset

2/19/2019 29 Indian Institute of Science (IISc), Bangalore, India

Paging: 64bit x86 page walk 47 39 38 30 29 21 20 12 11 0 11 12 20 21 29 30 38 39 47

cr3 Pageoffset

2/19/2019 29 Indian Institute of Science (IISc), Bangalore, India

Paging: 64bit x86 page walk 47 39 38 30 29 21 20 12 11 0 11 12 20 21 29 30 38 39 47

cr3 Pageoffset

2/19/2019 29 Indian Institute of Science (IISc), Bangalore, India

Paging: 64bit x86 page walk 47 39 38 30 29 21 20 12 11 0 11 12 20 21 29 30 38 39 47

cr3 Pageoffset

2/19/2019 29 Indian Institute of Science (IISc), Bangalore, India

Paging: 64bit x86 page walk 47 39 38 30 29 21 20 12 11 0 11 12 20 21 29 30 38 39 47

cr3 Pageoffset

2/19/2019 29 Indian Institute of Science (IISc), Bangalore, India

Paging: 64bit x86 page walk 47 39 38 30 29 21 20 12 11 0 11 12 20 21 29 30 38 39 47

cr3 Pageoffset

Physical page frame number 2/19/2019 29 Indian Institute of Science (IISc), Bangalore, India Who walks the page table?

▪ A hardware page table walker (PTW) walks the page table ‣ Input: Root of page table (cr3) and VPN ‣ Output: Physical page frame number or fault ‣ A hardware fine-state-automata in each CPU core

2/19/2019 30 Indian Institute of Science (IISc), Bangalore, India Who walks the page table?

▪ A hardware page table walker (PTW) walks the page table ‣ Input: Root of page table (cr3) and VPN ‣ Output: Physical page frame number or fault ‣ A hardware fine-state-automata in each CPU core • Generates load-like “instructions” to access page table

2/19/2019 30 Indian Institute of Science (IISc), Bangalore, India Who walks the page table?

▪ A hardware page table walker (PTW) walks the page table ‣ Input: Root of page table (cr3) and VPN ‣ Output: Physical page frame number or fault ‣ A hardware fine-state-automata in each CPU core • Generates load-like “instructions” to access page table • Called “page table walker” • Typical in x86 and ARM processors

2/19/2019 30 Indian Institute of Science (IISc), Bangalore, India Who walks the page table?

▪ A hardware page table walker (PTW) walks the page table ‣ Input: Root of page table (cr3) and VPN ‣ Output: Physical page frame number or fault ‣ A hardware fine-state-automata in each CPU core • Generates load-like “instructions” to access page table • Called “page table walker” • Typical in x86 and ARM processors ▪ Alternatives: Software page table walker ‣ A OS handler walks the page table

2/19/2019 30 Indian Institute of Science (IISc), Bangalore, India Who walks the page table?

▪ A hardware page table walker (PTW) walks the page table ‣ Input: Root of page table (cr3) and VPN ‣ Output: Physical page frame number or fault ‣ A hardware fine-state-automata in each CPU core • Generates load-like “instructions” to access page table • Called “page table walker” • Typical in x86 and ARM processors ▪ Alternatives: Software page table walker ‣ A OS handler walks the page table ‣ Advantage: Free to choose page table format ‣ Disadvantage: Slow → Large address translation overhead ‣ Example: SPARC (Sun/Oracle) machines

2/19/2019 30 Indian Institute of Science (IISc), Bangalore, India Contents of a page table entry 8 bytes

P

Typical x86-64 PTE ▪ ‘P’ (Present bit): If the address present in memory (DRAM) ▪ ‘U/S’ (User/Supervisor bit): Is the page accessible in supervisor mode (e.g., by OS) only? ▪ ‘R/W’ (Read/Write): Is the page read-only? ▪ A (Access bit): Is this page has ever been accessed (load/stored to)? ▪ D (Dirty bit): Is the page has been written to? ▪ X/D (Executable bit): Does the page contains executable?

2/19/2019 31 Indian Institute of Science (IISc), Bangalore, India TLB: Making page walks faster

▪ Disadvantage: A single address translation can take up to 4 memory accesses!

2/19/2019 32 Indian Institute of Science (IISc), Bangalore, India TLB: Making page walks faster

▪ Disadvantage: A single address translation can take up to 4 memory accesses!

▪ How to make address translation fast?

2/19/2019 32 Indian Institute of Science (IISc), Bangalore, India TLB: Making page walks faster

▪ Disadvantage: A single address translation can take up to 4 memory accesses!

▪ How to make address translation fast? ‣ Observation: locality of access is common in programs ‣ Translation Lookaside Buffer or TLB to cache recently used virtual to physical address mappings

2/19/2019 32 Indian Institute of Science (IISc), Bangalore, India TLB: Making page walks faster

▪ Disadvantage: A single address translation can take up to 4 memory accesses!

▪ How to make address translation fast? ‣ Observation: locality of access is common in programs ‣ Translation Lookaside Buffer or TLB to cache recently used virtual to physical address mappings ‣ A read-only cache of page table entries ‣ For address translation, TLB is first looked up ‣ On a TLB miss page walk is performed ‣ A TLB hit is fast but page walk is slow

2/19/2019 32 Indian Institute of Science (IISc), Bangalore, India Where are TLBs & Page table walkers?

Core 0 Core 1 Core 2 Core 3 L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Interconnect

Page table walker L3 (PTW)

2/19/2019 33 Indian Institute of Science (IISc), Bangalore, India Where are TLBs & Page table walkers? L1 TLB

Core 0 Core 1 Core 2 Core 3 L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Interconnect

Page table walker L3 (PTW)

2/19/2019 33 Indian Institute of Science (IISc), Bangalore, India Where are TLBs & Page table walkers? L1 TLB

Core 0 Core 1 Core 2 Core 3 L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Interconnect L2 TLB Page table walker L3 (PTW)

2/19/2019 33 Indian Institute of Science (IISc), Bangalore, India Where are TLBs & Page table walkers? L1 TLB

Core 0 Core 1 Core 2 Core 3 L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Interconnect L2 TLB Page table walker L3 (PTW)

2/19/2019 33 Indian Institute of Science (IISc), Bangalore, India Typical TLB hierarchy of modern processors

▪ Each core typically has: ‣ 32-64-entry L1 TLB (fully associative/4-8 way set- associative) ‣ 1500-2500 entry L2 TLB (8-16 way set-associative) ‣ One to two page table walker

2/19/2019 34 Indian Institute of Science (IISc), Bangalore, India Putting it all together

▪ Steps of address translation ‣ Load/store instructions carries virtual address ‣ Virtual address divided into virtual page number (VPN) and page offset

2/19/2019 35 Indian Institute of Science (IISc), Bangalore, India Putting it all together

▪ Steps of address translation ‣ Load/store instructions carries virtual address ‣ Virtual address divided into virtual page number (VPN) and page offset ‣ TLBs are looked for the desired VPN to page frame number mapping (PFN) ‣ On a hit, VPN is concatenated with offset and cache is looked up using the physical address

2/19/2019 35 Indian Institute of Science (IISc), Bangalore, India Putting it all together

▪ Steps of address translation ‣ Load/store instructions carries virtual address ‣ Virtual address divided into virtual page number (VPN) and page offset ‣ TLBs are looked for the desired VPN to page frame number mapping (PFN) ‣ On a hit, VPN is concatenated with offset and cache is looked up using the physical address ‣ On a miss, PTW starts walking the page table to get desired PFN ‣ If found, the address mapping is returned to the TLB

2/19/2019 35 Indian Institute of Science (IISc), Bangalore, India Putting it all together

▪ Steps of address translation ‣ Load/store instructions carries virtual address ‣ Virtual address divided into virtual page number (VPN) and page offset ‣ TLBs are looked for the desired VPN to page frame number mapping (PFN) ‣ On a hit, VPN is concatenated with offset and cache is looked up using the physical address ‣ On a miss, PTW starts walking the page table to get desired PFN ‣ If found, the address mapping is returned to the TLB ‣ If there exits no entry for the VPN in the page table then page fault is raised to the OS

2/19/2019 35 Indian Institute of Science (IISc), Bangalore, India Plan for the lectures

What is virtual memory?

Hardware implementations of virtual memory

Software management of memory (Linux)

Advanced topics in memory management

2/19/2019 36 Indian Institute of Science (IISc), Bangalore, India Operations in virtual memory management ▪ Virtual address allocation ‣ Responsibility of OS and libraries (e.g., malloc in glibc)

2/19/2019 37 Indian Institute of Science (IISc), Bangalore, India Operations in virtual memory management ▪ Virtual address allocation ‣ Responsibility of OS and libraries (e.g., malloc in glibc) ▪ Physical memory allocation ‣ Responsibility of OS ‣ Note: Applications never see physical memory

2/19/2019 37 Indian Institute of Science (IISc), Bangalore, India Operations in virtual memory management ▪ Virtual address allocation ‣ Responsibility of OS and libraries (e.g., malloc in glibc) ▪ Physical memory allocation ‣ Responsibility of OS ‣ Note: Applications never see physical memory ▪ Creating virtual to physical address mapping ‣ i.e., the page table entries (PTEs) ‣ Responsibility of the OS

2/19/2019 37 Indian Institute of Science (IISc), Bangalore, India Operations in virtual memory management ▪ Virtual address allocation ‣ Responsibility of OS and libraries (e.g., malloc in glibc) ▪ Physical memory allocation ‣ Responsibility of OS ‣ Note: Applications never see physical memory ▪ Creating virtual to physical address mapping ‣ i.e., the page table entries (PTEs) ‣ Responsibility of the OS ▪ Performing address translation on every load/store ‣ Hardware PTWs walk page tables

2/19/2019 37 Indian Institute of Science (IISc), Bangalore, India Operations in virtual memory management ▪ Virtual address allocation ‣ Responsibility of OS and libraries (e.g., malloc in glibc) ▪ Physical memory allocation ‣ Responsibility of OS ‣ Note: Applications never see physical memory ▪ Creating virtual to physical address mapping ‣ i.e., the page table entries (PTEs) ‣ Responsibility of the OS ▪ Performing address translation on every load/store ‣ Hardware PTWs walk page tables ▪ Updating/removing address mappings ‣ Also called TLB shootdown ‣ Responsibility of the OS

2/19/2019 37 Indian Institute of Science (IISc), Bangalore, India Operations in virtual memory management ▪ Virtual address allocation ‣ Responsibility of OS and libraries (e.g., malloc in glibc) ▪ Physical memory allocation ‣ Responsibility of OS ‣ Note: Applications never see physical memory ▪ Creating virtual to physical address mapping ‣ i.e., the page table entries (PTEs) ‣ Responsibility of the OS ▪ Performing address translation on every load/store ‣ Hardware PTWs walk page tables ▪ Updating/removing address mappings ‣ Also called TLB shootdown ‣ Responsibility of the OS

2/19/2019 37 Indian Institute of Science (IISc), Bangalore, India Roles of HW and SW Sees only virtual addresses Application/ libraray

DRAM Physical addresses/ real location of data/code

2/19/2019 38 Indian Institute of Science (IISc), Bangalore, India Roles of HW and SW Sees only virtual addresses malloc() Application/ malloc Virtual address allocation request libraray libraries/runtimes (a.k.a,“memory allocation”) (), sbrk()

OS

DRAM Physical addresses/ real location of data/code

2/19/2019 38 Indian Institute of Science (IISc), Bangalore, India Roles of HW and SW Sees only virtual addresses malloc() Application/ malloc Virtual address allocation request libraray libraries/runtimes (a.k.a,“memory allocation”) mmap(), sbrk() Virtual address allocation at page granularity OS

DRAM Physical addresses/ real location of data/code

2/19/2019 38 Indian Institute of Science (IISc), Bangalore, India Roles of HW and SW Sees only virtual addresses malloc() Application/ malloc Virtual address allocation request libraray libraries/runtimes (a.k.a,“memory allocation”) mmap(), sbrk() Virtual address allocation at page granularity OS Physical memory allocation, page table allocation and update

DRAM Physical addresses/ real location of data/code

2/19/2019 38 Indian Institute of Science (IISc), Bangalore, India Roles of HW and SW Sees only virtual addresses malloc() Application/ malloc Virtual address allocation request libraray libraries/runtimes (a.k.a,“memory allocation”) mmap(), sbrk() Virtual address allocation at page granularity OS Physical memory allocation, page table allocation and update Processor Address translation DRAM Physical addresses/ real location of data/code

2/19/2019 38 Indian Institute of Science (IISc), Bangalore, India Roles of HW and SW Sees only virtual addresses malloc() Application/ malloc Virtual address allocation request libraray libraries/runtimes (a.k.a,“memory allocation”) mmap(), sbrk() Virtual address allocation at page granularity OS Physical memory allocation, page table allocation and update Processor Address translation DRAM Physical addresses/ real location of data/code

2/19/2019 39 Indian Institute of Science (IISc), Bangalore, India Virtual address allocation ▪ “Memory allocation” is actually virtual address allocation ▪ Operating system is responsible for allocating virtual address for an application

2/19/2019 40 Indian Institute of Science (IISc), Bangalore, India Virtual address allocation ▪ “Memory allocation” is actually virtual address allocation ▪ Operating system is responsible for allocating virtual address for an application

0x7ffffffff Stack Typical virtual address layout Dynamically allocated of a process in memory Linux

Heap

Static data Virtual address range address Virtual 0x00000 Code 2/19/2019 40 Indian Institute of Science (IISc), Bangalore, India Creation of a virtual address space ▪ OS (“loader”) is responsible for creation of the VA space on process creation

2/19/2019 41 Indian Institute of Science (IISc), Bangalore, India Creation of a virtual address space ▪ OS (“loader”) is responsible for creation of the VA space on process creation ▪ ELF (binary) of the application contains the VA layout ‣ A format/convention for how application’s code and data to be laid out

2/19/2019 41 Indian Institute of Science (IISc), Bangalore, India Creation of a virtual address space ▪ OS (“loader”) is responsible for creation of the VA space on process creation ▪ ELF (binary) of the application contains the VA layout ‣ A format/convention for how application’s code and data to be laid out ‣ has header and sections like .text (code), .data (initialized data), .bss ( uninitialized data) ‣ Some other interesting sections • .got (data/code addresses for libraries), .plt (trampoline for libraries)

2/19/2019 41 Indian Institute of Science (IISc), Bangalore, India Creation of a virtual address space ▪ OS (“loader”) is responsible for creation of the VA space on process creation ▪ ELF (binary) of the application contains the VA layout ‣ A format/convention for how application’s code and data to be laid out ‣ has header and sections like .text (code), .data (initialized data), .bss ( uninitialized data) ‣ Some other interesting sections • .got (data/code addresses for libraries), .plt (trampoline for libraries) ‣ Also contains the first instruction address to execute ‣ OS’s loader knows how to interpret ELF 2/19/2019 41 Indian Institute of Science (IISc), Bangalore, India Creation of a virtual address space

▪ Steps of process creation ‣ fork() creates a VA with copy of parents VA space

2/19/2019 42 Indian Institute of Science (IISc), Bangalore, India Creation of a virtual address space

▪ Steps of process creation ‣ fork() system call creates a VA with copy of parents VA space ‣ execve(elf_name/binary) system call reads ELF and mmap() sections into process’s VA space • Transfer control to the first user instruction to execute for the application

2/19/2019 42 Indian Institute of Science (IISc), Bangalore, India Representation of VA (in Linux)

0x7ffffffff Stack Typical virtual address layout Dynamically allocated of a process in memory Linux

Heap

Static data Virtual address range address Virtual 0x00000 Code

2/19/2019 43 Indian Institute of Science (IISc), Bangalore, India Representation of VA (in Linux) Represents a process ptr to PT root In Linux VMAs or VM area: Represent a contiguous start/end stack chunk of allocated virtual address range. pid start/end code status start/end ptr to VA mmap space

Ending VA vma_area ptr Starting list of VA Flags/Prot open files list of struct mm_struct VM_READ VM_WRITE signals Represents a virtual VM_SHARED address space struct task_struct ……………. (TCB) 2/19/2019 44 Indian Institute of Science (IISc), Bangalore, India Aside: What is the difference between a process and a thread?

2/19/2019 45 Indian Institute of Science (IISc), Bangalore, India Aside: What is the difference between a process and a thread?

▪ Primary difference: ‣ Each process has separate virtual address space ‣ All threads in a process share the same virtual address space

2/19/2019 45 Indian Institute of Science (IISc), Bangalore, India Aside: What is the difference between a process and a thread? ptr to PT root VMAs or VM area: Represent a contiguous pid start/end stack chunk of allocated virtual address range. pid status start/end code status ptr to VA start/end ptr to VA space mmap space tgid Ending VA list of vma_area ptr Starting list of open files VA Flags/Prot open files list of list of struct mm_struct VM_READ signals VM_WRITE signals Represents a virtual VM_SHARED address space ……………. struct task_struct(s) task_struct → One per each thread 2/19/2019 mm_struct → One per each process 46 Indian Institute of Science (IISc), Bangalore, India Aside: What about the stacks of each thread?

▪ Should each thread have same stack or different? ‣ Of course, otherwise how different threads will call different functions ▪ But if threads share entire virtual address space, don’t they share stack? ‣ NO. ▪ There is one main process stack (shown in the previous picture) and one separate stack for each thread that is created dynamically (e.g., using pthread_create)

2/19/2019 47 Indian Institute of Science (IISc), Bangalore, India Stacks in multi-threaded application

0x7ffffffff Stack Typical virtual address layout Dynamically allocated of a process in memory Linux

Heap

Static data Virtual address range address Virtual 0x00000 Code

2/19/2019 48 Indian Institute of Science (IISc), Bangalore, India Aside: Stacks in multi-threaded application

0x7ffffffff Stack

Stack of thread 1 Stack of thread 2 Stack of thread n

Heap

Static data Virtual address range address Virtual 0x00000 Code

2/19/2019 49 Indian Institute of Science (IISc), Bangalore, India Aside: Stacks in multi-threaded application

0x7ffffffff Stack How each thread Stack of thread 1 knows which is Stack of thread 2 its stack? Stack of thread n

Heap

Static data Virtual address range address Virtual 0x00000 Code

2/19/2019 49 Indian Institute of Science (IISc), Bangalore, India Aside: Stacks in multi-threaded application

0x7ffffffff Stack How each thread Stack of thread 1 knows which is Stack of thread 2 its stack? Stack of thread n

Heap

Static data Virtual address range address Virtual 0x00000 Code

Is it okay to share the heap across multiple threads?

2/19/2019 49 Indian Institute of Science (IISc), Bangalore, India Dynamically allocating virtual address range (a.k.a ,“allocating memory”) ▪ User application or library requests VA allocation via system calls

2/19/2019 50 Indian Institute of Science (IISc), Bangalore, India Aside: What is a system call?

2/19/2019 51 Indian Institute of Science (IISc), Bangalore, India Aside: What is a system call?

▪ A function call but from user land to OS ▪ A interface to expose system functionalities to user applications/libraries ‣ e.g., open, read/write files ‣ e.g., virtual address allocation

2/19/2019 51 Indian Institute of Science (IISc), Bangalore, India Aside: What is a system call?

▪ A function call but from user land to OS ▪ A interface to expose system functionalities to user applications/libraries ‣ e.g., open, read/write files ‣ e.g., virtual address allocation ▪ What happens on a system call? ‣ Special instructions like sysenter is executed ‣ This switches to ring 0 from ring 3 and jumps to a specific system call handler in the OS ‣ Does the work of the invoked system call ‣ Returns to user application by executing sysexit ▪ Linux has more than 300 system calls ‣ Each system call is identified by an unique integer number

2/19/2019 51 Indian Institute of Science (IISc), Bangalore, India Allocating virtual address range

▪ User application or library requests VA allocation via system calls

2/19/2019 52 Indian Institute of Science (IISc), Bangalore, India Allocating virtual address range

▪ User application or library requests VA allocation via system calls

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

Length has to be multiple of 4KB Prot ➔ PROT_NONE, PROT_READ, PROT_WRITE… Flags➔ MAP_ANONYMOUS, MAP_SHARED, MAP_PRIVATE, MAP_SHARED

2/19/2019 52 Indian Institute of Science (IISc), Bangalore, India What happens on dynamic memory allocation (e.g., mmap())? Represents a process ptr to PT root In Linux VMAs or VM area: Represent a contiguous start/end stack chunk of allocated virtual address range. pid start/end code status start/end ptr to VA mmap space

Ending VA vma_area ptr Starting list of VA Flags/Prot open files vma_cache list of struct mm_struct VM_READ VM_WRITE signals Represents a virtual VM_SHARED address space struct task_struct …………….

2/19/2019 53 Indian Institute of Science (IISc), Bangalore, India Extending heap memory

▪ Heap: Special type of dynamically allocated contiguous memory region (data segment) that grows in upwards

▪ System calls in Linux to extend heap ‣ int sbrk (increment _bytes)

2/19/2019 54 Indian Institute of Science (IISc), Bangalore, India Extending heap via sbrk Represents a process ptr to PT root In Linux VMAs or VM area: Represent a contiguous start/end stack chunk of allocated virtual address range. pid start/end code status start/end ptr to VA mmap space

Ending VA vma_area ptr Starting list of VA Flags/Prot open files vma_cache list of struct mm_struct VM_READ VM_WRITE signals Represents a virtual VM_SHARED address space struct task_struct …………….

2/19/2019 13755 Indian Institute of Science (IISc), Bangalore, India Extending heap via sbrk Represents a process ptr to PT root In Linux VMAs or VM area: Represent a contiguous start/end stack chunk of allocated virtual address range. pid start/end code status start/end ptr to VA mmap space

Ending VA vma_area ptr Starting list of VA Flags/Prot open files vma_cache list of struct mm_struct VM_READ VM_WRITE signals Represents a virtual VM_SHARED address space struct task_struct …………….

2/19/2019 13855 Indian Institute of Science (IISc), Bangalore, India Demand paging

▪ Note when “memory” (a.k.a, VA) is allocated no physical memory is allocated ▪ Why?

2/19/2019 56 Indian Institute of Science (IISc), Bangalore, India Demand paging

▪ Note when “memory” (a.k.a, VA) is allocated no physical memory is allocated ▪ Why? ‣ Virtual address is in abundance; but physical memory is scarce resource. ▪ Allocate physical memory for when the virtual address is accessed first time

2/19/2019 56 Indian Institute of Science (IISc), Bangalore, India Demand paging

▪ Note when “memory” (a.k.a, VA) is allocated no physical memory is allocated ▪ Why? ‣ Virtual address is in abundance; but physical memory is scarce resource. ▪ Allocate physical memory for when the virtual address is accessed first time ▪ Lazily allocating physical memory is called demand paging ▪ Advantage: Commits physical memory only if used

2/19/2019 56 Indian Institute of Science (IISc), Bangalore, India Role of page fault in demand paging ▪ Page fault: When h/w page walker fails to find desired PTE the processor generates a general protection fault (gpf) to the OS ‣ Faulting address is stored in a specific register (e.g., cr2 in x86) ‣ Transfer control to a specific page fault handling routine in the OS

2/19/2019 57 Indian Institute of Science (IISc), Bangalore, India Role of page fault in demand paging ▪ Page fault: When h/w page walker fails to find desired PTE the processor generates a general protection fault (gpf) to the OS ‣ Faulting address is stored in a specific register (e.g., cr2 in x86) ‣ Transfer control to a specific page fault handling routine in the OS ▪ First access to an allocated memory generates a page fault

2/19/2019 57 Indian Institute of Science (IISc), Bangalore, India Role of page fault in demand paging ▪ Page fault: When h/w page walker fails to find desired PTE the processor generates a general protection fault (gpf) to the OS ‣ Faulting address is stored in a specific register (e.g., cr2 in x86) ‣ Transfer control to a specific page fault handling routine in the OS ▪ First access to an allocated memory generates a page fault ▪ Page fault can also happen due to insufficient permission (e.g., write access to read-only page) → protection fault

2/19/2019 57 Indian Institute of Science (IISc), Bangalore, India Servicing a page fault

▪ Page fault routine is implemented inside the OS ▪ Argument routine are faulting VA and type of access (e.g., read or write)

2/19/2019 58 Indian Institute of Science (IISc), Bangalore, India Servicing a page fault

▪ Page fault routine is implemented inside the OS ▪ Argument routine are faulting VA and type of access (e.g., read or write) ▪ Steps of handling a page fault

2/19/2019 58 Indian Institute of Science (IISc), Bangalore, India Servicing a page fault

▪ Page fault routine is implemented inside the OS ▪ Argument routine are faulting VA and type of access (e.g., read or write) ▪ Steps of handling a page fault ‣ Check VMA structures if the VA is allocated ‣ If not allocated or insufficient permission, raise segmentation fault to application

2/19/2019 58 Indian Institute of Science (IISc), Bangalore, India Servicing a page fault

▪ Page fault routine is implemented inside the OS ▪ Argument routine are faulting VA and type of access (e.g., read or write) ▪ Steps of handling a page fault ‣ Check VMA structures if the VA is allocated ‣ If not allocated or insufficient permission, raise segmentation fault to application ‣ If allocated with correct permission find a physical page frame to map the page containing the faulting VA

2/19/2019 58 Indian Institute of Science (IISc), Bangalore, India Servicing a page fault

▪ Page fault routine is implemented inside the OS ▪ Argument routine are faulting VA and type of access (e.g., read or write) ▪ Steps of handling a page fault ‣ Check VMA structures if the VA is allocated ‣ If not allocated or insufficient permission, raise segmentation fault to application ‣ If allocated with correct permission find a physical page frame to map the page containing the faulting VA ‣ Update page table to note new VA->PA mapping ‣ Return

2/19/2019 58 Indian Institute of Science (IISc), Bangalore, India Servicing a page fault

▪ Page fault routine is implemented inside the OS ▪ Argument routine are faulting VA and type of access (e.g., read or write) ▪ Steps of handling a page fault ‣ Check VMA structures if the VA is allocated ‣ If not allocated or insufficient permission, raise segmentation fault to application ‣ If allocated with correct permission find a physical page frame to map the page containing the faulting VA ‣ Update page table to note new VA->PA mapping ‣ Return ▪ The faulting instruction retries after page fault returns 2/19/2019 58 Indian Institute of Science (IISc), Bangalore, India Servicing a page fault

▪ Page fault routine is implemented inside the OS ▪ Argument routine are faulting VA and type of access (e.g., read or write) ▪ Steps of handling a page fault ‣ Check VMA structures if the VA is allocated ‣ If not allocated or insufficient permission, raise segmentation fault to application Next ‣ If allocated with correct permission find a physical page frame to map the page containing the faulting VA ‣ Update page table to note new VA->PA mapping ‣ Return ▪ The faulting instruction retries after page fault returns 2/19/2019 58 Indian Institute of Science (IISc), Bangalore, India Roles of HW and SW Sees only virtual addresses malloc() Application/ malloc Virtual address allocation request libraray libraries/runtimes (a.k.a,“memory allocation”) mmap(), sbrk() Virtual address allocation at page granularity OS Physical memory allocation, page table allocation and update Processor Address translation DRAM Physical addresses/ real location of data/code

2/19/2019 59 Indian Institute of Science (IISc), Bangalore, India Physical memory users

Physical Memory Pages

Picture credit: Nima Honarmand, Stony Brook university 2/19/2019 60 Indian Institute of Science (IISc), Bangalore, India Physical memory users

Application allocation (Anonymous Memory)

Physical Memory Pages

Picture credit: Nima Honarmand, Stony Brook university 2/19/2019 60 Indian Institute of Science (IISc), Bangalore, India Physical memory users

Application allocation (Anonymous Memory)

Physical Memory Pages

Files (Page Cache)

Picture credit: Nima Honarmand, Stony Brook university 2/19/2019 60 Indian Institute of Science (IISc), Bangalore, India Physical memory users

Application allocation Device DMA (Anonymous Buffers Memory)

Physical Memory Pages

Files (Page Cache)

Picture credit: Nima Honarmand, Stony Brook university 2/19/2019 60 Indian Institute of Science (IISc), Bangalore, India Physical memory users

Application allocation Device DMA (Anonymous Buffers Memory)

Physical Memory Pages

Kernel’s own Files memory (Page Cache) usage(e.g., kmalloc)

Picture credit: Nima Honarmand, Stony Brook university 2/19/2019 60 Indian Institute of Science (IISc), Bangalore, India Memory zones and page blocks ▪ Physical memory is divided into multiple zones ‣ Example: ZONE_NORMAL (Most commonly used) • ZONE_DMA (Typically used for DMA requests) • ZONE_HIGHMEM (Empty in x86-64; used in 32 bit systems) • Each zone has a descriptor that contains information (metadata) about that zone like number of free pages in the zone, blocks of free pages ‣ Depending upon the purpose of physical memory allocation memory from different zones can be used

2/19/2019 61 Indian Institute of Science (IISc), Bangalore, India Memory zones and page blocks ▪ Physical memory is divided into multiple zones ‣ Example: ZONE_NORMAL (Most commonly used) • ZONE_DMA (Typically used for DMA requests) • ZONE_HIGHMEM (Empty in x86-64; used in 32 bit systems) • Each zone has a descriptor that contains information (metadata) about that zone like number of free pages in the zone, blocks of free pages ‣ Depending upon the purpose of physical memory allocation memory from different zones can be used ▪ All these different zones of memory forms a “node” of memory ‣ There could be multiple such nodes in servers with NUMA (Coming later under advanced topics)

2/19/2019 61 Indian Institute of Science (IISc), Bangalore, India Allocating physical memory

struct page* alloc_pages(gfp_mask, order)

Physical memory allocator (per node)

Per-CPU Per-CPU Per-CPU page cache page cache page cache

Buddy allocator Buddy allocator Buddy allocator

Zone DMA Zone NORMAL Zone Highmem

2/19/2019 62 Indian Institute of Science (IISc), Bangalore, India Allocating physical memory

▪ alloc_pages() is the Linux’s primary internal function to get hold a free physical page frame (not a system call) ‣ Internal → Only another kernel function can invoke it; not exposed to the user (i.e., not a system call) ‣ Many gfp_mask: • Most common is GFP_KERNEL o Most allocation uses this flag (e.g., allocation request on a system call). If not enough free physical memory is available then allocation with this can block. • GFP_ATOMIC: Sometimes physical memory allocation cannot block (wait). For example, within critical section of an interrupt handler.

2/19/2019 63 Indian Institute of Science (IISc), Bangalore, India Allocating physical memory

▪ alloc_pages() is the Linux’s primary internal function to get hold a free physical page frame (not a system call) ‣ Internal → Only another kernel function can invoke it; not exposed to the user (i.e., not a system call) ‣ Many gfp_mask: • Most common is GFP_KERNEL o Most allocation uses this flag (e.g., allocation request on a system call). If not enough free physical memory is available then allocation with this can block. • GFP_ATOMIC: Sometimes physical memory allocation cannot block (wait). For example, within critical section of an interrupt handler. ‣ Zone allocator typically maintains minimum number of free pages particularly to satisfy requests with GFP_ATOMIC • FUN: See how many in /proc/sys/vm/min_free_kbytes in your Linux system

2/19/2019 63 Indian Institute of Science (IISc), Bangalore, India Allocating physical memory

▪ alloc_pages(gfp_mask, order) ‣ Requests 2order number of contiguous physical page frames

2/19/2019 64 Indian Institute of Science (IISc), Bangalore, India Allocating physical memory

struct page* alloc_pages(gfp_mask, order)

Physical memory allocator (per node)

Per-CPU Per-CPU Per-CPU page cache page cache page cache

Buddy allocator Buddy allocator Buddy allocator

Zone DMA Zone NORMAL Zone Highmem

2/19/2019 65 Indian Institute of Science (IISc), Bangalore, India Allocating physical memory

struct page* alloc_pages(gfp_mask, order) For quick allocation Physical memory allocator (per node) requests of single page frames

Per-CPU Per-CPU Per-CPU page cache page cache page cache

Buddy allocator Buddy allocator Buddy allocator

Zone DMA Zone NORMAL Zone Highmem

2/19/2019 65 Indian Institute of Science (IISc), Bangalore, India Allocating physical memory

struct page* alloc_pages(gfp_mask, order) For quick allocation Physical memory allocator (per node) requests of single page frames

Per-CPU Per-CPU Per-CPU page cache page cache page cache

Buddy allocator Buddy allocator Buddy allocator

Zone DMA Zone NORMAL Zone Highmem For everything else or if page cache is empty 2/19/2019 65 Indian Institute of Science (IISc), Bangalore, India Allocating physical memory

struct page* alloc_pages(gfp_mask, order) For quick allocation Physical memory allocator (per node) requests of single page frames

Per-CPU Per-CPU Per-CPU page cache page cache page cache

Buddy allocator Buddy allocator Buddy allocator

Zone DMA Zone NORMAL Zone Highmem For everything else or if page cache is empty 2/19/2019 65 Indian Institute of Science (IISc), Bangalore, India Representing physical memory in Linux

▪ Ultimately, Linux manages physical memory at the granularity of 4KB pages (a.k.a, page frames)

2/19/2019 66 Indian Institute of Science (IISc), Bangalore, India Representing physical memory in Linux

▪ Ultimately, Linux manages physical memory at the granularity of 4KB pages (a.k.a, page frames) ▪ Linux represents physical memory with an array of struct page objects for each zone ‣ Also called page descriptors ‣ Think of it as metadata for each physical page ‣ One struct page object a 4KB physical page frames

2/19/2019 66 Indian Institute of Science (IISc), Bangalore, India Representing physical memory in Linux

▪ Ultimately, Linux manages physical memory at the granularity of 4KB pages (a.k.a, page frames) ▪ Linux represents physical memory with an array of struct page objects for each zone ‣ Also called page descriptors ‣ Think of it as metadata for each physical page ‣ One struct page object a 4KB physical page frames ‣ Can easily find the descriptor given the physical address ‣ struct page has a reverse pointer to VA that maps to the corresponding physical page frame (a.k.a, reverse mapping)

2/19/2019 66 Indian Institute of Science (IISc), Bangalore, India The buddy allocator

▪ Buddy allocator in Linux: Goal is to keep free physical memory as contiguous as possible (why?) ▪ It is a list of free list of page descriptors representing contiguous physical pages of different sizes (2order x 4KB)

2/19/2019 67 Indian Institute of Science (IISc), Bangalore, India The buddy allocator

▪ Buddy allocator in Linux: Goal is to keep free physical memory as contiguous as possible (why?) ▪ It is a list of free list of page descriptors representing contiguous physical pages of different sizes (2order x 4KB) 4KB Order=0 8KB Order=1 16KB Order=2 Order=3 Order=4 64KB

Order=10 2/19/2019 67 Indian Institute of Science (IISc), Bangalore, India The buddy allocator

▪ Allocate from a free list which is smallest that fits the requested allocation size ‣ If no entry in the smallest list, go to next larger list and so on… ‣ Put the leftover blocks in the lower order list

2/19/2019 68 Indian Institute of Science (IISc), Bangalore, India The buddy allocator

▪ Allocate from a free list which is smallest that fits the requested allocation size ‣ If no entry in the smallest list, go to next larger list and so on… ‣ Put the leftover blocks in the lower order list ▪ Merge two contiguous blocks of physical memory in a free list and add the merged block in next higher order free list

2/19/2019 68 Indian Institute of Science (IISc), Bangalore, India Allocating physical memory on page fault

▪ On a page fault, the fault handler will ask zone allocator to provide a free page (or contiguous free pages) ‣ This can potentially go to the buddy allocator. • Either because per CPU cache is empty or request is for more than one page frame ▪ The buddy returns pointer to a struct page corresponding to the chosen physical page frames

2/19/2019 69 Indian Institute of Science (IISc), Bangalore, India Roles of HW and SW Sees only virtual addresses malloc() Application/ malloc Virtual address allocation request libraray libraries/runtimes (a.k.a,“memory allocation”) mmap(), sbrk() Virtual address allocation at page granularity OS Physical memory allocation, page table allocation and update Processor Address translation DRAM Physical addresses/ real location of data/code

2/19/2019 70 Indian Institute of Science (IISc), Bangalore, India The problem with mmap, sbrk

▪ The minimum “memory” (virtual address) allocation size in mmap, sbrk is 4KB ‣ But most user application allocates smaller granularities

2/19/2019 71 Indian Institute of Science (IISc), Bangalore, India The problem with mmap, sbrk

▪ The minimum “memory” (virtual address) allocation size in mmap, sbrk is 4KB ‣ But most user application allocates smaller granularities

▪ A lot of memory wastages (“internal fragmentation”) if memory is allocated only via mmap, sbrk

▪ Other problem: system calls are slow

2/19/2019 71 Indian Institute of Science (IISc), Bangalore, India Where does malloc() fits in?

▪ malloc() is function call implemented in a library (ring 3; user land). It is not part of OS. ▪ malloc allocates virtual address range (like mmap()) ‣ But unlike mmap, sbrk malloc() is optimized for small allocation/de-allocation requests ‣ malloc() with large size (e.g., >32KB) converts to mmap

2/19/2019 72 Indian Institute of Science (IISc), Bangalore, India Where does malloc() fits in?

▪ malloc() is function call implemented in a library (ring 3; user land). It is not part of OS. ▪ malloc allocates virtual address range (like mmap()) ‣ But unlike mmap, sbrk malloc() is optimized for small allocation/de-allocation requests ‣ malloc() with large size (e.g., >32KB) converts to mmap ▪ malloc() maintains free list of small allocations ▪ If free memory is available in malloc()’s free list, no need to got to the OS (no system call → faster) ▪ If it accumulates “enough” free memory it returns memory to OS via munmap

2/19/2019 72 Indian Institute of Science (IISc), Bangalore, India Where does malloc() fits in?

▪ Key goals of a malloc library: ‣ Reduce memory bloat = additional allocated memory than what application asked ‣ Reduce number of system calls to OS (e.g., mmap()) • System calls are slow ‣ Fast allocation/de-allocations

2/19/2019 73 Indian Institute of Science (IISc), Bangalore, India Where does malloc() fits in?

▪ Key goals of a malloc library: ‣ Reduce memory bloat = additional allocated memory than what application asked ‣ Reduce number of system calls to OS (e.g., mmap()) • System calls are slow ‣ Fast allocation/de-allocations ▪ Many approaches to create malloc library ‣ Best fit, first fit, worst fit ▪ Many malloc() libraries ‣ glibc-malloc, tc-malloc, dl-malloc ▪ You can write your own!

2/19/2019 73 Indian Institute of Science (IISc), Bangalore, India Simple Algorithm: Bump Allocator

Virtual address space of a given process Indian Institute of Science (IISc), Bangalore, India Simple Algorithm: Bump Allocator

▪ malloc (6)

Virtual address space of a given process Indian Institute of Science (IISc), Bangalore, India Simple Algorithm: Bump Allocator

▪ malloc (6) ▪ malloc (12)

Virtual address space of a given process Indian Institute of Science (IISc), Bangalore, India Simple Algorithm: Bump Allocator

▪ malloc (6) ▪ malloc (12) ▪ malloc(20)

Virtual address space of a given process Indian Institute of Science (IISc), Bangalore, India Simple Algorithm: Bump Allocator

▪ malloc (6) ▪ malloc (12) ▪ malloc(20) ▪ malloc (5)

Virtual address space of a given process Indian Institute of Science (IISc), Bangalore, India Example: Bump Allocator ▪ Simply “bumps” up the free pointer Indian Institute of Science (IISc), Bangalore, India Example: Bump Allocator ▪ Simply “bumps” up the free pointer ▪ How does free() work? ‣ It doesn’t; it’s a no-op Indian Institute of Science (IISc), Bangalore, India Example: Bump Allocator ▪ Simply “bumps” up the free pointer ▪ How does free() work? ‣ It doesn’t; it’s a no-op ▪ This is ideal for simple programs ‣ You only care about free() if you need the memory for something else ▪ What if memory is limited? → Need more complex allocators Indian Institute of Science (IISc), Bangalore, India Implementation issues

▪ Control fragmentation ▪ Splitting and coalescing ▪ Free space tracking ▪ Allocation strategy ▪ Allocation and free latency ▪ Implementation complexity Indian Institute of Science (IISc), Bangalore, India Fragmentation

▪ Undergrad review: What is it? Why does it happen? Indian Institute of Science (IISc), Bangalore, India Fragmentation

▪ Undergrad review: What is it? Why does it happen? ‣ Happens due to variable-sized allocations ‣ Why we care about fragmentation in VA? • VA could have been already mapped to PA, more system call overheads, more page fault overheads Indian Institute of Science (IISc), Bangalore, India Fragmentation

▪ Undergrad review: What is it? Why does it happen? ‣ Happens due to variable-sized allocations ‣ Why we care about fragmentation in VA? • VA could have been already mapped to PA, more system call overheads, more page fault overheads ▪ What is ‣ Internal fragmentation?

‣ External fragmentation? Indian Institute of Science (IISc), Bangalore, India Fragmentation

▪ Undergrad review: What is it? Why does it happen? ‣ Happens due to variable-sized allocations ‣ Why we care about fragmentation in VA? • VA could have been already mapped to PA, more system call overheads, more page fault overheads ▪ What is ‣ Internal fragmentation? • Wasted space when you round an allocation up ‣ External fragmentation? Indian Institute of Science (IISc), Bangalore, India Fragmentation

▪ Undergrad review: What is it? Why does it happen? ‣ Happens due to variable-sized allocations ‣ Why we care about fragmentation in VA? • VA could have been already mapped to PA, more system call overheads, more page fault overheads ▪ What is ‣ Internal fragmentation? • Wasted space when you round an allocation up ‣ External fragmentation? • When you end up with small chunks of free memory that are too small to be useful Indian Institute of Science (IISc), Bangalore, India Fragmentation

▪ Undergrad review: What is it? Why does it happen? ‣ Happens due to variable-sized allocations ‣ Why we care about fragmentation in VA? • VA could have been already mapped to PA, more system call overheads, more page fault overheads ▪ What is ‣ Internal fragmentation? • Wasted space when you round an allocation up ‣ External fragmentation? • When you end up with small chunks of free memory that are too small to be useful ▪ Which kind does our bump allocator have? Indian Institute of Science (IISc), Bangalore, India Splitting and Coalescing

▪ Split a free VA region into smaller ones upon allocation ‣ Why? Indian Institute of Science (IISc), Bangalore, India Splitting and Coalescing

▪ Split a free VA region into smaller ones upon allocation ‣ Why? ‣ To reduce/avoid internal fragmentation Indian Institute of Science (IISc), Bangalore, India Splitting and Coalescing

▪ Split a free VA region into smaller ones upon allocation ‣ Why? ‣ To reduce/avoid internal fragmentation ▪ Coalesce a just freed VA region with neighboring free objects upon deallocation ‣ Why? Indian Institute of Science (IISc), Bangalore, India Splitting and Coalescing

▪ Split a free VA region into smaller ones upon allocation ‣ Why? ‣ To reduce/avoid internal fragmentation ▪ Coalesce a just freed VA region with neighboring free objects upon deallocation ‣ Why? ‣ To reduce/avoid external fragmentation Indian Institute of Science (IISc), Bangalore, India Splitting and Coalescing

▪ Split a free VA region into smaller ones upon allocation ‣ Why? ‣ To reduce/avoid internal fragmentation ▪ Coalesce a just freed VA region with neighboring free objects upon deallocation ‣ Why? ‣ To reduce/avoid external fragmentation ▪ We need extra meta-data for these Indian Institute of Science (IISc), Bangalore, India Splitting and Coalescing

▪ Split a free VA region into smaller ones upon allocation ‣ Why? ‣ To reduce/avoid internal fragmentation ▪ Coalesce a just freed VA region with neighboring free objects upon deallocation ‣ Why? ‣ To reduce/avoid external fragmentation ▪ We need extra meta-data for these ‣ We need the free VA region’s size (at least) ‣ Data/mechanisms to find the neighboring regions for coalescing Indian Institute of Science (IISc), Bangalore, India Keeping Per-region Meta-data ▪ Prepend the meta-data to the allocated object (as a header) ‣ On malloc(sz), look for a free object of size at least sz + sizeof(header)

Allocated object (VA range)

int size; // other data Returned pointer: int status; Return value of malloc()

Slide credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India Keeping Per-region Meta-data ▪ Prepend the meta-data to the allocated object (as a header) ‣ On malloc(sz), look for a free object of size at least sz + sizeof(header)

Allocated object (VA range) Free object

int size; int size; // other data void *next; Returned pointer: int status; Return value of malloc()

• For free objects, can keep the meta-data in the object itself

Slide credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India Tracking Free Regions ▪ Link the free objects in a doubly-linked list ‣ Using the next/prev field in the free object header ‣ Keep in the list head in a global variable

Slide credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India Tracking Free Regions ▪ Link the free objects in a doubly-linked list ‣ Using the next/prev field in the free object header ‣ Keep in the list head in a global variable ▪ malloc() is simple using this representation ‣ Traverse the free list ‣ Find a big-enough object • Split if necessary ‣ Return the pointer

Slide credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India Tracking Free Regions ▪ Link the free objects in a doubly-linked list ‣ Using the next/prev field in the free object header ‣ Keep in the list head in a global variable ▪ malloc() is simple using this representation ‣ Traverse the free list ‣ Find a big-enough object • Split if necessary ‣ Return the pointer ▪ What about free()?

Slide credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India Tracking Free Regions ▪ Link the free objects in a doubly-linked list ‣ Using the next/prev field in the free object header ‣ Keep in the list head in a global variable ▪ malloc() is simple using this representation ‣ Traverse the free list ‣ Find a big-enough object • Split if necessary ‣ Return the pointer ▪ What about free()? ‣ Easy to add the object to the free list ‣ What about coalescing?

Slide credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India Tracking Free Regions ▪ Link the free objects in a doubly-linked list ‣ Using the next/prev field in the free object header ‣ Keep in the list head in a global variable ▪ malloc() is simple using this representation ‣ Traverse the free list ‣ Find a big-enough object • Split if necessary ‣ Return the pointer ▪ What about free()? ‣ Easy to add the object to the free list ‣ What about coalescing? • Not easy to do dynamically on every free() ― Why? • Can periodically traverse the free list and merge neighboring free objects

Slide credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India Performance Issues (1)

▪ Allocation ‣ Need to quickly find a big-enough object ‣ Searching a free list can take long ‣ Can use other data structures • All sorts of trees have been proposed ‣ Or, can avoid searching altogether by having pools of same-size objects

Slide credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India Performance Issues (1)

▪ Allocation ‣ Need to quickly find a big-enough object ‣ Searching a free list can take long ‣ Can use other data structures • All sorts of trees have been proposed ‣ Or, can avoid searching altogether by having pools of same-size objects ▪ Segregated pools: on malloc(sz), round up sz to the next available object size, and allocate from the corresponding pool (Any issue here?)

Slide credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India Performance Issues (2)

▪ Deallocation ‣ Returning free object to free list is easy and fast ‣ Bit more overhead if using other data structures Indian Institute of Science (IISc), Bangalore, India Performance Issues (2)

▪ Deallocation ‣ Returning free object to free list is easy and fast ‣ Bit more overhead if using other data structures ▪ Coalescing ‣ Not easy in any case • Have to find neighboring free objects • Book-keeping can be complex ‣ Alternative: avoid coalescing by using segregated pools • All objects of the same size, no need to coalesce at all Indian Institute of Science (IISc), Bangalore, India Performance Issues (3) ▪ Allocation/de-allocation from multi-threaded applications ‣ Need locking for concurrent malloc()s and free()s • Why? lots of shared data-structures Indian Institute of Science (IISc), Bangalore, India Performance Issues (3) ▪ Allocation/de-allocation from multi-threaded applications ‣ Need locking for concurrent malloc()s and free()s • Why? lots of shared data-structures ▪ Types of concurrency-related overheads 1. Waiting for locks: contended locks cause serialized execution • If locks are used, only one thread can allocate/deallocate at any point of time 2. lock/unlock is pure overhead, even when uncontended • Often use atomic instructions • Can take tens of cycles Indian Institute of Science (IISc), Bangalore, India Performance Issues (3) ▪ Allocation/de-allocation from multi-threaded applications ‣ Need locking for concurrent malloc()s and free()s • Why? lots of shared data-structures ▪ Types of concurrency-related overheads 1. Waiting for locks: contended locks cause serialized execution • If locks are used, only one thread can allocate/deallocate at any point of time 2. lock/unlock is pure overhead, even when uncontended • Often use atomic instructions • Can take tens of cycles ▪ Alternative: avoid concurrency issues by having per-thread heaps ‣ Or, at least, reduce contention by having multiple heaps and distributing the threads across them Indian Institute of Science (IISc), Bangalore, India Example high performance malloc() library: tcmalloc()

Thread-local Thread-local Thread-local Needs no cache of free cache of free cache of free locking objects objects objects

Requires Free list Free list Free list Central Cache locking of size 1 of size 2 of size n

Requires Page heap locking

mmap() ▪ For large allocation page heap is used directly

2/19/2019 84 Indian Institute of Science (IISc), Bangalore, India Putting everything together

1. Application requests memory allocation via malloc() or mmap(), sbrk() etc. Library may requests VA allocation via mmap()/sbrk() 2. OS creates/extends VMA regions to allocate VA range 3. OS returns the starting VA of just-allocated VA range

2/19/2019 87 Indian Institute of Science (IISc), Bangalore, India Putting everything together

1. Application requests memory allocation via malloc() or mmap(), sbrk() etc. Library may requests VA allocation via mmap()/sbrk() 2. OS creates/extends VMA regions to allocate VA range 3. OS returns the starting VA of just-allocated VA range 4. Application performs load/store on the address in the VA range 5. H/W looks up TLB; misses 6. H/W page table walker walks the page table

2/19/2019 87 Indian Institute of Science (IISc), Bangalore, India Putting everything together

1. Application requests memory allocation via malloc() or mmap(), sbrk() etc. Library may requests VA allocation via mmap()/sbrk() 2. OS creates/extends VMA regions to allocate VA range 3. OS returns the starting VA of just-allocated VA range 4. Application performs load/store on the address in the VA range 5. H/W looks up TLB; misses 6. H/W page table walker walks the page table 7. Desired entry not found on PTE 8. H/W generates a page fault to OS 2/19/2019 87 Indian Institute of Science (IISc), Bangalore, India Putting everything together

9. OS’a page fault handler checks VMA regions to ensure it’s a legal access 10. Page fault handler finds a free physical page frame to map the VA page from the buddy allocator 11. Updates page table to note new VA to PA mapping and return

2/19/2019 88 Indian Institute of Science (IISc), Bangalore, India Putting everything together

9. OS’a page fault handler checks VMA regions to ensure it’s a legal access 10. Page fault handler finds a free physical page frame to map the VA page from the buddy allocator 11. Updates page table to note new VA to PA mapping and return 12. Application retries the same instruction 13. This time page table walker finds it in page table and loads it into TLB 14. Next time if same page is accessed it may hit in TLB (→ no page walk)

2/19/2019 88 Indian Institute of Science (IISc), Bangalore, India High level view of memory allocation(so far)

Process 1 Process 2

Dynamic Rest of the malloc() Memory Application free() Process n Allocator User (ring 3)

Kernel (ring 0)

Physical Virtual page frame address allocator space (Zone manager allocator)

Partial picture credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India High level view of memory allocation(so far)

Process 1 Process 2

Dynamic Rest of the malloc() Memory Application free() Process n Allocator User (ring 3)

Kernel (ring 0)

Physical Virtual page frame address allocator space (Zone manager allocator)

Partial picture credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India High level view of memory allocation(so far)

Process 1 Process 2

Dynamic Rest of the malloc() Memory Application free() Process n Allocator User (ring 3) brk(), mmap() Kernel (ring 0)

Physical Virtual page frame address allocator space (Zone manager allocator)

Partial picture credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India High level view of memory allocation(so far)

Process 1 Process 2

Dynamic Rest of the malloc() Memory Application free() Process n Allocator ld/st User (ring 3) brk(), mmap() page faults Kernel (ring 0) (via h/w)

Physical Virtual page frame Page fault address allocator handler space (Zone manager allocator)

Partial picture credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India High level view of memory allocation(so far)

Process 1 Process 2

Dynamic Rest of the malloc() Memory Application free() Process n Allocator ld/st User (ring 3) brk(), mmap() page faults Kernel (ring 0) (via h/w)

Physical Virtual page frame Page fault address allocator handler space (Zone manager allocator) page_alloc() page_free() Partial picture credit: Nima Honarmand, Stony Brook university Indian Institute of Science (IISc), Bangalore, India

Managing kernel’s own memory

2/19/2019 90 Indian Institute of Science (IISc), Bangalore, India Kernels needs memory too

▪ For what? ‣ Allocating TCB (task_struct), address space (mm_struct), inodes….. ‣ Basically for all sorts of meta data

2/19/2019 91 Indian Institute of Science (IISc), Bangalore, India Kernel virtual address space

▪ Kernel has access to physical addresses ▪ But, processor does not allow any load/store (even from kernel) to use physical address ‣ Except during initial part of the boot process (called real mode)

2/19/2019 92 Indian Institute of Science (IISc), Bangalore, India Kernel virtual address space

▪ Kernel has access to physical addresses ▪ But, processor does not allow any load/store (even from kernel) to use physical address ‣ Except during initial part of the boot process (called real mode) ▪ Thus, kernel also has to use virtual addresses to access memory

▪ How many virtual address spaces for kernel?

2/19/2019 92 Indian Institute of Science (IISc), Bangalore, India Kernel virtual address space

▪ Kernel has access to physical addresses ▪ But, processor does not allow any load/store (even from kernel) to use physical address ‣ Except during initial part of the boot process (called real mode) ▪ Thus, kernel also has to use virtual addresses to access memory

▪ How many virtual address spaces for kernel? ‣ Kernel is a single (privileged) process ‣ So it has one virtual address space

2/19/2019 92 Indian Institute of Science (IISc), Bangalore, India

Kernel virtual address space

VA bit is set to 1 to bit set is VA

Kernel VA Kernel

th 47

Guard band

Stack 0x7ffffffff

Dynamically allocated

memory UserVA

VA bit is set to 0 to set bit is VA Heap

th Static data 47 Code 0x00000 2/19/2019 93 Indian Institute of Science (IISc), Bangalore, India Kernel virtual address space

Copy of Kernel VA Copy of Kernel VA Copy of Kernel VA

Stack Stack Stack User process 2 User process 1 Dynamically allocated User process 0 Dynamically allocated memory Dynamically allocated memory memory Heap Heap Heap Static data Static data Code Static data Code 2/19/2019 Code 94 Indian Institute of Science (IISc), Bangalore, India Why copy Kernel VA?

2/19/2019 95 Indian Institute of Science (IISc), Bangalore, India Why copy Kernel VA?

▪ Avoids address space switch (overwrite cr3; TLB flush) when switching from user space (ring 3) to kernel space (ring 0) and vice-versa ‣ Such switching happen on every system call and return from system call ‣ Whenever an interrupt arrives while in user space need to switch to kernel space ‣ Needs to make switching between ring 0 and ring 3 fast

2/19/2019 95 Indian Institute of Science (IISc), Bangalore, India Why copy Kernel VA?

▪ Avoids address space switch (overwrite cr3; TLB flush) when switching from user space (ring 3) to kernel space (ring 0) and vice-versa ‣ Such switching happen on every system call and return from system call ‣ Whenever an interrupt arrives while in user space need to switch to kernel space ‣ Needs to make switching between ring 0 and ring 3 fast ▪ Having copy of kernel address space attached to user process’s VA space avoids page table switch (cr3 write; TLB flush)

2/19/2019 95 Indian Institute of Science (IISc), Bangalore, India How the kernel VA copied to every process’s VA?

▪ Every process has a parent process ▪ The first process is “init” in Linux ‣ All other process are fork()-ed from either init or one of its children ‣ Try pstree command on Linux ‣ Linux attaches kernel page table to init process’s page table

2/19/2019 96 Indian Institute of Science (IISc), Bangalore, India How the kernel VA copied to every process’s VA?

▪ Every process has a parent process ▪ The first process is “init” in Linux ‣ All other process are fork()-ed from either init or one of its children ‣ Try pstree command on Linux ‣ Linux attaches kernel page table to init process’s page table ▪ Every process starts by copying (inheriting) parent’s page table ‣ Parts of the page table belonging to the kernel is not modified. ‣ Thus through inheritance through the process tree every user process’s page table contains kernel page table too 2/19/2019 96 Indian Institute of Science (IISc), Bangalore, India Key parts of kernel VA in Linux (x86-64)

64 TB of reserved Direct mapped memory VA space

2/19/2019 97 Indian Institute of Science (IISc), Bangalore, India Linux direct mapped memory

▪ One-to-one mapping to entire physical memory of the system ▪ PA = D_VA + Offset (constant; set at boot time) ▪ Page tables for direct mapped memory is setup during kernel initialization.

2/19/2019 98 Indian Institute of Science (IISc), Bangalore, India Linux direct mapped memory

▪ One-to-one mapping to entire physical memory of the system ▪ PA = D_VA + Offset (constant; set at boot time) ▪ Page tables for direct mapped memory is setup during kernel initialization.

▪ Why direct mapping could be helpful?

2/19/2019 98 Indian Institute of Science (IISc), Bangalore, India Linux direct mapped memory

▪ One-to-one mapping to entire physical memory of the system ▪ PA = D_VA + Offset (constant; set at boot time) ▪ Page tables for direct mapped memory is setup during kernel initialization.

▪ Why direct mapping could be helpful? ‣ Kernel often needs to find reverse mapping: Given PA; need to access VA • e.g., kernel plans to copy contents of a physical page to disk. It knows the PA of that page but needs a VA to access the contents of the page (remember Kernel also needs to acess memory using VA).

2/19/2019 98 Indian Institute of Science (IISc), Bangalore, India Linux direct mapped memory

▪ One-to-one mapping to entire physical memory of the system ▪ PA = D_VA + Offset (constant; set at boot time) ▪ Page tables for direct mapped memory is setup during kernel initialization.

▪ Why direct mapping could be helpful? ‣ Kernel often needs to find reverse mapping: Given PA; need to access VA • e.g., kernel plans to copy contents of a physical page to disk. It knows the PA of that page but needs a VA to access the contents of the page (remember Kernel also needs to acess memory using VA). ‣ With direct mapping the VA to access the PA can be calculated (no need for any lookup)

2/19/2019 98 Indian Institute of Science (IISc), Bangalore, India Key parts of kernel VA in Linux (x86-64)

512MB of Kernel code (text). reserved VA space Typically mapped towards start of PA space 32 TB of reserved vmalloc/ioremap VA space

64 TB of reserved Direct mapped memory VA space

2/19/2019 99 Indian Institute of Science (IISc), Bangalore, India vmalloc/ioremap

▪ Unlike direct mapping these kernel mappings are created at dynamically as per application request ‣ Infrequency used ‣ Typically needed for mapping large buffer (e.g., in device driver)

▪ ioremap is used for mapping memory mapped IO where the physical address region is not covered by direct mapped memory

2/19/2019 100 Indian Institute of Science (IISc), Bangalore, India Key parts of kernel VA in Linux (x86-64) 1.5 TB Driver/module mapping 512MB of Kernel code (text). reserved VA space Typically mapped towards start of PA space 32 TB of reserved vmalloc/ioremap VA space

64 TB of reserved Direct mapped memory VA space

2/19/2019 101 Indian Institute of Science (IISc), Bangalore, India Need for dynamic memory allocator for Linux’s own use

Process 1 Process 2

Dynamic Rest of the malloc() Memory Application free() Process n Allocator ld/st User (ring 3) brk(), mmap() page faults Kernel (ring 0) (via h/w)

Physical Virtual Rest of page frame Page fault address allocator the Physical addresses handler space (Zone kernel Page granularity manager allocator) page_alloc() page_free() 2/19/2019 102 Indian Institute of Science (IISc), Bangalore, India Need for dynamic memory allocator for Linux’s own use

Process 1 Process 2

Dynamic Rest of the malloc() Memory Application free() Process n Allocator ld/st User (ring 3) brk(), mmap() page faults Kernel (ring 0) (via h/w) kmalloc() Physical Dynamic Virtual page frame Rest of memory Page fault address allocator the allocator handler space kernel vmalloc() (Zone manager allocator) page_alloc() page_free() 2/19/2019 103 Indian Institute of Science (IISc), Bangalore, India Dynamic memory allocation in Kernel ▪ kmalloc() is the primary dynamic allocator in kernel ‣ Like malloc() but for kernel’s own use (part of kernel) • Designed primarily for sub-page size allocations • Does not handle large (e.g., >4MB) allocation requests

2/19/2019 104 Indian Institute of Science (IISc), Bangalore, India Dynamic memory allocation in Kernel ▪ kmalloc() is the primary dynamic allocator in kernel ‣ Like malloc() but for kernel’s own use (part of kernel) • Designed primarily for sub-page size allocations • Does not handle large (e.g., >4MB) allocation requests ‣ void * kmalloc(GFP_MASK flags, int size) ‣ Returns kernel virtual address (KVA) (direct mapped) ‣ Flags are mostly same as accepted by page_alloc() function • E.g., GFP_KERNEL, GFP_ATOMIC ‣ Returns contiguous KVA that is mapped to contiguous PA

2/19/2019 104 Indian Institute of Science (IISc), Bangalore, India Dynamic memory allocation in Kernel ▪ kmalloc() is the primary dynamic allocator in kernel ‣ Like malloc() but for kernel’s own use (part of kernel) • Designed primarily for sub-page size allocations • Does not handle large (e.g., >4MB) allocation requests ‣ void * kmalloc(GFP_MASK flags, int size) ‣ Returns kernel virtual address (KVA) (direct mapped) ‣ Flags are mostly same as accepted by page_alloc() function • E.g., GFP_KERNEL, GFP_ATOMIC ‣ Returns contiguous KVA that is mapped to contiguous PA ‣ Would there be a page fault on first access to a address allocated via kmalloc()?

2/19/2019 104 Indian Institute of Science (IISc), Bangalore, India Dynamic memory allocation in Kernel ▪ kmalloc() is the primary dynamic allocator in kernel ‣ Like malloc() but for kernel’s own use (part of kernel) • Designed primarily for sub-page size allocations • Does not handle large (e.g., >4MB) allocation requests ‣ void * kmalloc(GFP_MASK flags, int size) ‣ Returns kernel virtual address (KVA) (direct mapped) ‣ Flags are mostly same as accepted by page_alloc() function • E.g., GFP_KERNEL, GFP_ATOMIC ‣ Returns contiguous KVA that is mapped to contiguous PA ‣ Would there be a page fault on first access to a address allocated via kmalloc()? • NO; Page tables for direct mapped KVA is setup during boot

2/19/2019 104 Indian Institute of Science (IISc), Bangalore, India kmalloc()’s allocation policy: SLAB allocator

▪ Keeps slabs (contiguous VA/PA) of memory ▪ Each slab contains objects of same size ▪ Keeps multiple slabs of various sizes

2/19/2019 105 Indian Institute of Science (IISc), Bangalore, India kmalloc()’s allocation policy: SLAB allocator

▪ Keeps slabs (contiguous VA/PA) of memory ▪ Each slab contains objects of same size ▪ Keeps multiple slabs of various sizes ▪ On a kmalloc() call; rounds up the nearest size supported and returns an object from the corresponding slab

2/19/2019 105 Indian Institute of Science (IISc), Bangalore, India vmalloc()

▪ Typically sed for allocating really large buffers ▪ Returns virtually contiguous KVA range but may not be physically contiguous

2/19/2019 106 Indian Institute of Science (IISc), Bangalore, India vmalloc()

▪ Typically sed for allocating really large buffers ▪ Returns virtually contiguous KVA range but may not be physically contiguous ▪ Not part of direct mapped kernel VA region ▪ Page table entries are dynamically created like for user memory ▪ Rarely used inside kernel

2/19/2019 106 Indian Institute of Science (IISc), Bangalore, India

Miscellaneous related topics

2/19/2019 107 Indian Institute of Science (IISc), Bangalore, India Swapping (a.k.a., memory overcommit) ▪ Goal: Provide an illusion of larger memory than actually available ▪ Observations: Not every address space (process or file) uses all the memory it requests ‣ Allow memory overcommit ‣ Allocate more virtual memory than physical memory

2/19/2019 108 Indian Institute of Science (IISc), Bangalore, India Swapping (a.k.a., memory overcommit) ▪ Goal: Provide an illusion of larger memory than actually available ▪ Observations: Not every address space (process or file) uses all the memory it requests ‣ Allow memory overcommit ‣ Allocate more virtual memory than physical memory

Memory (DRAM)

Storage

2/19/2019 108 Indian Institute of Science (IISc), Bangalore, India Swapping (a.k.a., memory overcommit) ▪ Goal: Provide an illusion of larger memory than actually available ▪ Observations: Not every address space (process or file) uses all the memory it requests ‣ Allow memory overcommit ‣ Allocate more virtual memory than physical memory

Memory (DRAM) Swap in Swap out Storage

Swap file 2/19/2019 108 Indian Institute of Science (IISc), Bangalore, India How swapping work?

▪ OS attempts to figure out which physical page frame in memory are not actively used ‣ Use access bit (“A” bit) in the PTE ‣ OS periodically unsets “A” bits of pages in memory ‣ After a little while, OS checks H/W has set “A” bit of those pages

2/19/2019 109 Indian Institute of Science (IISc), Bangalore, India How swapping work?

▪ OS attempts to figure out which physical page frame in memory are not actively used ‣ Use access bit (“A” bit) in the PTE ‣ OS periodically unsets “A” bits of pages in memory ‣ After a little while, OS checks H/W has set “A” bit of those pages • If “A” bit set for a page → actively used page • If “A” bit is unset for a page → not actively used → candidate for swapped out to storage • The algorithm called clock algorithm • Linux uses enhanced version of this basic algorithm

2/19/2019 109 Indian Institute of Science (IISc), Bangalore, India How swapping work?

▪ OS attempts to figure out which physical page frame in memory are not actively used ‣ Use access bit (“A” bit) in the PTE ‣ OS periodically unsets “A” bits of pages in memory ‣ After a little while, OS checks H/W has set “A” bit of those pages • If “A” bit set for a page → actively used page • If “A” bit is unset for a page → not actively used → candidate for swapped out to storage • The algorithm called clock algorithm • Linux uses enhanced version of this basic algorithm ▪ OS writes back contents of the page to swap file (swap out) ▪ Update PTE to unset present (“p”) bit

2/19/2019 109 Indian Institute of Science (IISc), Bangalore, India How swapping work?

▪ There will be page fault if an application accesses a page that is swapped out ‣ Because “p” bit is unset

2/19/2019 110 Indian Institute of Science (IISc), Bangalore, India How swapping work?

▪ There will be page fault if an application accesses a page that is swapped out ‣ Because “p” bit is unset ▪ OS page fault handler figures out the page had been swapped out ▪ Page fault handler brings the page into memory (swap in) ▪ Application retries the faulting instruction

2/19/2019 110 Indian Institute of Science (IISc), Bangalore, India Need for reverse mapping

▪ Kernel scans physical page frames to select a victim for swapping out to disk ▪ Needs to invalidate VA to PA mapping of the page frame ▪ Problem: Given a PA how to find VA that is mapping to that PA? (a.k.a., reverse mapping)

2/19/2019 111 Indian Institute of Science (IISc), Bangalore, India Reverse mapping page descriptor

Process A vma

Virtual memory

Physical memory

2/19/2019 112 Indian Institute of Science (IISc), Bangalore, India Reverse mapping page descriptor

Process A vma

Virtual memory

Physical memory

2/19/2019 112 Indian Institute of Science (IISc), Bangalore, India Reverse mapping page descriptor

Process A anon_vma vma

Virtual memory

Physical memory

2/19/2019 112 Indian Institute of Science (IISc), Bangalore, India Reverse mapping page descriptor

Process A anon_vma vma

Virtual memory

Physical memory

2/19/2019 112 Indian Institute of Science (IISc), Bangalore, India Reverse mapping page descriptor

Process A anon_vma vma

Virtual memory

Physical memory

2/19/2019 112 Indian Institute of Science (IISc), Bangalore, India Reverse mapping page descriptor

Process A Process B (forked) anon_vma vma vma

Virtual memory

Physical memory

2/19/2019 112 Indian Institute of Science (IISc), Bangalore, India Reverse mapping page descriptor

Process A Process B (forked) anon_vma vma vma

Virtual memory

Physical memory

2/19/2019 112 Indian Institute of Science (IISc), Bangalore, India Reverse mapping page descriptor

Process A Process B (forked) anon_vma vma vma

Virtual memory

Physical memory

2/19/2019 112 Indian Institute of Science (IISc), Bangalore, India Reverse mapping page descriptor

Process A Process B (forked) anon_vma vma vma

Virtual memory

Physical memory

2/19/2019 112 Indian Institute of Science (IISc), Bangalore, India Copy-on-Write (CoW) optimization

▪ Remember fork()? ‣ Creates and starts a new process (child) which is exact copy of the patent process except for the return value

2/19/2019 113 Indian Institute of Science (IISc), Bangalore, India Copy-on-Write (CoW) optimization

▪ Remember fork()? ‣ Creates and starts a new process (child) which is exact copy of the patent process except for the return value ▪ Naïve approach would march through address space and copy each page ‣ Most processes immediately () a new binary without using any of these pages ‣ Again, lazy is better!

2/19/2019 113 Indian Institute of Science (IISc), Bangalore, India Copy-on-Write (CoW) optimization

▪ Memory regions: ‣ New copies of each vma are allocated for child during fork ‣ As are page tables ▪ Pages in memory: ‣ In page table (and in-memory representation), clear write bit, set “COW” bit • Is the COW bit hardware specified?

2/19/2019 114 Indian Institute of Science (IISc), Bangalore, India Copy-on-Write (CoW) optimization

▪ Memory regions: ‣ New copies of each vma are allocated for child during fork ‣ As are page tables ▪ Pages in memory: ‣ In page table (and in-memory representation), clear write bit, set “COW” bit • Is the COW bit hardware specified? • No, OS uses one of the available bits in the PTE o But it does not have to; can just keep the info in the VMA like other meta data

2/19/2019 114 Indian Institute of Science (IISc), Bangalore, India Copy-on-Write (CoW) optimization

▪ Memory regions: ‣ New copies of each vma are allocated for child during fork ‣ As are page tables ▪ Pages in memory: ‣ In page table (and in-memory representation), clear write bit, set “COW” bit • Is the COW bit hardware specified? • No, OS uses one of the available bits in the PTE o But it does not have to; can just keep the info in the VMA like other meta data ‣ Make a new, writeable copy on a write fault

2/19/2019 114 Indian Institute of Science (IISc), Bangalore, India Operations in virtual memory management ▪ Virtual address allocation ‣ Responsibility of OS and libraries (e.g., malloc in glibc) ▪ Physical memory allocation ‣ Responsibility of OS ‣ Note: Applications never see physical memory ▪ Creating virtual to physical address mapping ‣ i.e., the page table entries (PTEs) ‣ Responsibility of the OS ▪ Performing address translation on every load/store ‣ Hardware PTWs walk page tables ▪ Updating/removing address mappings ‣ Also called TLB shootdown ‣ Responsibility of the OS

2/19/2019 115 Indian Institute of Science (IISc), Bangalore, India Updating address mapping/permission ▪ Why update PTEs? ‣ OS intiaited • Swapping, Copy-on-Write, Page migration ‣ User initiated • Change page permissions, unmap

▪ System call to change page permission ‣ mprotect(void *addr, size_t len, int new_prot) ▪ System call free/unmap memory ‣ int munmap(void *addr, size_t len); Indian Institute of Science (IISc), Bangalore, India Steps of updating/deleting address mapping

▪ Update VMA regions/flags

2/19/2019 117 Indian Institute of Science (IISc), Bangalore, India Steps of updating/deleting address mapping

▪ Update VMA regions/flags ▪ Update page table entries containing the updated virtual address range (via load/stores issued by kernel) ‣ This does not happen in TLB ‣ TLB is read-only cache of PTEs

2/19/2019 117 Indian Institute of Science (IISc), Bangalore, India Steps of updating/deleting address mapping

▪ Update VMA regions/flags ▪ Update page table entries containing the updated virtual address range (via load/stores issued by kernel) ‣ This does not happen in TLB ‣ TLB is read-only cache of PTEs ▪ Invalidate stale PTEs in TLBs of each CPU core that could be caching the stale page mapping ‣ This process is called TLB shootdown

2/19/2019 117 Indian Institute of Science (IISc), Bangalore, India Update VMA flags, delete/split VMAs ptr to PT root start/end stack pid start/end code status start/end ptr to VA mmap space

Ending VA vma_area ptr Starting list of VA Flags/Prot open files Vm_cache list of struct mm_struct VM_READ signals VM_WRITE VM_SHARED struct task_struct …………….

Update page table entry, issue TLB shootdown Indian Institute of Science (IISc), Bangalore, India Steps of TLB shootdown

▪ OS on the initiator CPU core updates the page table entry ▪ OS on the initiator core finds a set of other cores that may have stale entry in the TLB Indian Institute of Science (IISc), Bangalore, India Steps of TLB shootdown

▪ OS on the initiator CPU core updates the page table entry ▪ OS on the initiator core finds a set of other cores that may have stale entry in the TLB ▪ OS on the initiator core sends inter-process-interrupt (IPI) to other cores in the list and waits for ack ▪ OS on the initiator core uses invlpg instruction or writes to cr3 to invalidate local TLB entries, while waiting for ack ▪ Other cores context switch to OS thread and invalidate entries in their local TLBs via invlpg or write to cr3 Indian Institute of Science (IISc), Bangalore, India Steps of TLB shootdown

▪ OS on the initiator CPU core updates the page table entry ▪ OS on the initiator core finds a set of other cores that may have stale entry in the TLB ▪ OS on the initiator core sends inter-process-interrupt (IPI) to other cores in the list and waits for ack ▪ OS on the initiator core uses invlpg instruction or writes to cr3 to invalidate local TLB entries, while waiting for ack ▪ Other cores context switch to OS thread and invalidate entries in their local TLBs via invlpg or write to cr3 ▪ Other cores sends ack to the initiator core ▪ TLB shootdown completes after initiator receives all ack Indian Institute of Science (IISc), Bangalore, India Problems of TLB shootdown

▪ TLB shootdowns are slow ‣ Can take up to several micro-seconds ▪ TLB shootdown latency typically grows linearly with number of cores ‣ Why?

▪ A good paper to understand intricacies of shootdown: ‣ DiDi: Mitigating The Performance Impact of TLB ‣ Shootdowns Using A Shared TLB Directory

2/19/2019 120 Indian Institute of Science (IISc), Bangalore, India Alternative way for shootdown

▪ ARM-based processors have “TLBi” instruction ‣ Like invlpg but invalidates a page mapping from all TLBs in the system ‣ It is like broadcast invlpg

2/19/2019 121 Indian Institute of Science (IISc), Bangalore, India

Non-uniform memory architecture (NUMA)

2/19/2019 122 Indian Institute of Science (IISc), Bangalore, India Building symmetric multiprocessors

2/19/2019 123 Indian Institute of Science (IISc), Bangalore, India Building symmetric multiprocessors

2/19/2019 123 Indian Institute of Science (IISc), Bangalore, India Cache coherent NUMA systems

2/19/2019 124 Indian Institute of Science (IISc), Bangalore, India Cache coherent NUMA systems

One physical address space 2/19/2019 (Single OS view) 124 Indian Institute of Science (IISc), Bangalore, India Cache coherent NUMA systems

Local memory access

One physical address space 2/19/2019 (Single OS view) 124 Indian Institute of Science (IISc), Bangalore, India Cache coherent NUMA systems

▪ Non-Uniform Memory Access (NUMA)

Remote memory access

Local memory access

One physical address space 2/19/2019 (Single OS view) 124 Indian Institute of Science (IISc), Bangalore, India Cache coherent NUMA systems

▪ Non-Uniform Memory Access (NUMA) ▪ Cache coherent

Remote memory access

Cache coherent link (e.g., Intel QPI)

Local memory access

One physical address space 2/19/2019 (Single OS view) 124 Indian Institute of Science (IISc), Bangalore, India Goal of a NUMA systems

▪ Satisfy most of the memory accesses from the local memory node ‣ NUMA is a pure performance place; Not for correctness

▪ On Linux servers, try numactl –H ‣ Will give you NUMA topology of the system

2/19/2019 125 Indian Institute of Science (IISc), Bangalore, India Dividing physical memory in multiple memory nodes struct page* alloc_pages(gfp_mask, order)

Physical memory allocator (per node)

Per-CPU Per-CPU Per-CPU page cache page cache page cache

Buddy allocator Buddy allocator Buddy allocator

Zone DMA Zone NORMAL Zone Highmem

2/19/2019 126 Indian Institute of Science (IISc), Bangalore, India Dividing physical memory in multiple memory nodes struct page* alloc_pages(gfp_mask, order)

PhysicalPhysical memory memory allocator allocator (per (per node) node)

PerPer-CPU-CPU PerPer-CPU-CPU PerPer-CPU-CPU pagepage cache cache pagepage cache cache pagepage cache cache

BuddyBuddy allocator allocator BuddyBuddy allocator allocator BuddyBuddy allocator allocator

Zone DMA Zone NORMAL Zone Highmem

2/19/2019 126 Indian Institute of Science (IISc), Bangalore, India Two ways to reduce NUMA

2/19/2019 127 Indian Institute of Science (IISc), Bangalore, India Page placement

▪ User-directed ▪ Example: numactl -cpubind=0 –membind=0 ‣ Will ensure that your application runs on the Node 0 (socket) and its memory is allocated from memory node 0 ▪ Another example: numactl –localalloc ‣ Allocate memory from local memory node

2/19/2019 128 Indian Institute of Science (IISc), Bangalore, India Page placement

▪ User-directed ▪ Example: numactl -cpubind=0 –membind=0 ‣ Will ensure that your application runs on the Node 0 (socket) and its memory is allocated from memory node 0 ▪ Another example: numactl –localalloc ‣ Allocate memory from local memory node ▪ Another example: numactl –interleave=all ‣ Will have memory interleaved across all memory node ‣ When it may be useful?

2/19/2019 128 Indian Institute of Science (IISc), Bangalore, India Page migration

▪ Migrate contents of a page from one node to another ‣ Why?

2/19/2019 129 Indian Institute of Science (IISc), Bangalore, India Page migration

▪ Migrate contents of a page from one node to another ‣ Why? • Consider multi-threaded app. The first thread allocates all memory; may get memory allocated locally but later thread access the same memory from another node

2/19/2019 129 Indian Institute of Science (IISc), Bangalore, India Page migration

▪ Migrate contents of a page from one node to another ‣ Why? • Consider multi-threaded app. The first thread allocates all memory; may get memory allocated locally but later thread access the same memory from another node • Phase change in multi-threaded app: Initially a memory region was accessed a given thread(s) but in a later phase that same memory region accessed by different set of threads • The thread which allocated may have moved ▪ User requested page migration ‣ migratepages • Migrates all pages of the process ‣ move_pages () • Can move individual pages

2/19/2019 129 Indian Institute of Science (IISc), Bangalore, India AutoNuma: Automatic NUMA balancing

▪ Objective: Increase local memory access without user/programmer involvement ▪ Basic idea:

2/19/2019 130 Indian Institute of Science (IISc), Bangalore, India AutoNuma: Automatic NUMA balancing

▪ Objective: Increase local memory access without user/programmer involvement ▪ Basic idea: ‣ Memory follows compute • Migrates pages same node as the task/threads access the memory ‣ Compute follows memory: • Schedule task/thread on nodes which will have its memory local • AutoNUMA is one possible implementation of both of these ideas

2/19/2019 130 Indian Institute of Science (IISc), Bangalore, India Implementing AutoNUMA

▪ Implementing “Memory follows compute”: ‣ Periodically induce NUMA (false) page faults to infer which processes are accessing which memory

2/19/2019 131 Indian Institute of Science (IISc), Bangalore, India Implementing AutoNUMA

▪ Implementing “Memory follows compute”: ‣ Periodically induce NUMA (false) page faults to infer which processes are accessing which memory • Unset “p” bit in PTE for ranges of memory and set an unused “special” bit in PTE • Next access to the page will fault – OS can find where the fault is coming from • Migrate the page if the accesses are coming from remote node o Typically the implementation waits for remote access from same node twice before migrating – migrations are costly

2/19/2019 131 Indian Institute of Science (IISc), Bangalore, India Implementing AutoNUMA

▪ Implementing “Compute follows memory”: ‣ Keeps stats of local/remote page fault in the task_struct ‣ At the time of (re-)scheduling, schedule where the task will have most of its memory access local

2/19/2019 132 Indian Institute of Science (IISc), Bangalore, India Implementing AutoNUMA

▪ Implementing “Compute follows memory”: ‣ Keeps stats of local/remote page fault in the task_struct ‣ At the time of (re-)scheduling, schedule where the task will have most of its memory access local

▪ Other improvisations: ‣ Define task groups: Set of threads in a multi-threaded application that often access same memory regions ‣ Try to schedule threads of a task group on the node where the memory is

2/19/2019 132 Indian Institute of Science (IISc), Bangalore, India

Miscellaneous paging topic

2/19/2019 133 Indian Institute of Science (IISc), Bangalore, India Super pages or Huge pages: Reducing TLB misses

▪ TLB miss are long latency operation ‣ Address translation overhead == ~ TLB miss overhead ▪ TLB reach = # of TLB entries page size ‣ Higher TLB reach → (Typically) Lower TLB miss rate

2/19/2019 134 Indian Institute of Science (IISc), Bangalore, India Super pages or Huge pages: Reducing TLB misses

▪ TLB miss are long latency operation ‣ Address translation overhead == ~ TLB miss overhead ▪ TLB reach = # of TLB entries page size ‣ Higher TLB reach → (Typically) Lower TLB miss rate ▪ How to increase TLB reach? ‣ Increase TLB size. • Not always possible due to h/w overhead ‣ Increase page size • Increases internal fragmentation

2/19/2019 134 Indian Institute of Science (IISc), Bangalore, India Super pages or Huge pages: Reducing TLB misses

▪ TLB miss are long latency operation ‣ Address translation overhead == ~ TLB miss overhead ▪ TLB reach = # of TLB entries page size ‣ Higher TLB reach → (Typically) Lower TLB miss rate ▪ How to increase TLB reach? ‣ Increase TLB size. • Not always possible due to h/w overhead ‣ Increase page size • Increases internal fragmentation ‣ Have multiple page sizes • Larger page sizes beyond base page size (e.g., 4KB) 2/19/2019 134 Indian Institute of Science (IISc), Bangalore, India Super pages: Reducing number of TLB misses ▪ Larger page sizes called superpages (or huge pages)

▪ Example page sizes in x86-64 (Intel, AMD) machines ‣ 4KB (base page size), 2MB, 1GB (super page sizes)

2/19/2019 135 Indian Institute of Science (IISc), Bangalore, India Super pages: Reducing number of TLB misses ▪ Larger page sizes called superpages (or huge pages)

▪ Example page sizes in x86-64 (Intel, AMD) machines ‣ 4KB (base page size), 2MB, 1GB (super page sizes)

▪ Challenge: How to decide which page size to use for mapping a given virtual address? ‣ Use of larger pages increase internal fragmentation → memory bloat ‣ Using smaller page size can increase address translation overhead

2/19/2019 135 Indian Institute of Science (IISc), Bangalore, India Allocating super pages

▪ Three broad approaches: ‣ User directed: mmap(,…,…) with MAP_HUGE flag. • Gets VA ranged mapped with a super page

2/19/2019 136 Indian Institute of Science (IISc), Bangalore, India Allocating super pages

▪ Three broad approaches: ‣ User directed: mmap(,…,…) with MAP_HUGE flag. • Gets VA ranged mapped with a super page ‣ Helper library: libhugetlbfs in Linux (discussed next) • Need user intervention but makes life for programmer easier

2/19/2019 136 Indian Institute of Science (IISc), Bangalore, India Allocating super pages

▪ Three broad approaches: ‣ User directed: mmap(,…,…) with MAP_HUGE flag. • Gets VA ranged mapped with a super page ‣ Helper library: libhugetlbfs in Linux (discussed next) • Need user intervention but makes life for programmer easier ‣ Automatic OS guided allocation of large pages • No user/programmer intervention needed • Transparent huge pages (THP) in Linux • Ashish will talk about it

2/19/2019 136 Indian Institute of Science (IISc), Bangalore, India Linux’s libhugeTLBFS

▪ A helper library that can map memory segments (e.g., heap, code) of application with huge pages ‣ No need modify the program ‣ But still need to change how programs are launched and/or recompile/re-link the binary

2/19/2019 137 Indian Institute of Science (IISc), Bangalore, India Linux’s libhugeTLBFS

▪ A helper library that can map memory segments (e.g., heap, code) of application with huge pages ‣ No need modify the program ‣ But still need to change how programs are launched and/or recompile/re-link the binary • Need access to application source code (for most part) • Cannot map parts of memory segments (e.g., parts of heap) with large pages

2/19/2019 137 Indian Institute of Science (IISc), Bangalore, India Linux’s libhugeTLBFS

▪ A helper library that can map memory segments (e.g., heap, code) of application with huge pages ‣ No need modify the program ‣ But still need to change how programs are launched and/or recompile/re-link the binary • Need access to application source code (for most part) • Cannot map parts of memory segments (e.g., parts of heap) with large pages ▪ Example libHugeTLBFS usage: ‣ To map entire heap region with large pages • LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes

2/19/2019 137 Indian Institute of Science (IISc), Bangalore, India Linux’s libhugeTLBFS

▪ To make use of large pages for Code, Data and BSS: ‣ While compiling the application, add this linker flag: ‣ -Wl,--hugetlbfs-align ‣ Then when executing the application: • To map code pages with large pages HUGETLB_ELFMAP=R • To map BSS and data segments HUGETLB_ELFMAP=R

2/19/2019 138