2/5/20

COMP 790: OS Implementation COMP 790: OS Implementation Logical Diagram Binary Memory Process Address Spaces Threads Formats AllocatorsToday’s and Binary Formats Lecture User System Calls Kernel

Don Porter RCU Networking Sync

Memory Device CPU Management Drivers Scheduler

Hardware Interrupts Disk Net Consistency

1 2 1 2

COMP 790: OS Implementation COMP 790: OS Implementation Review Definitions (can vary) • We’ve seen how paging and segmentation work on • Process is a virtual address space x86 – 1+ threads of execution work within this address space – Maps logical addresses to physical pages • A process is composed of: – These are the low-level hardware tools – Memory-mapped files • This lecture: build up to higher-level abstractions • Includes program binary • Namely, the process address space – Anonymous pages: no file backing • When the process exits, their contents go away

3 4 3 4

COMP 790: OS Implementation COMP 790: OS Implementation Address Space Layout Simple Example • Determined (mostly) by the application • Determined at compile time Virtual Address Space – Link directives can influence this hello heap stk libc.so • See kern/kernel.ld in JOS; specifies kernel starting address • OS usually reserves part of the address space to map 0 0xffffffff itself • “Hello world” binary specified load address – Upper GB on x86 • Also specifies where it wants libc • Application can dynamically request new mappings from the OS, or delete mappings • Dynamically asks kernel for “anonymous” pages for its heap and stack

5 6 5 6

1 2/5/20

COMP 790: OS Implementation COMP 790: OS Implementation In practice Problem 1: How to represent in the kernel? • You can see (part of) the requested memory layout • What is the best way to represent the components of of a program using ldd: a process? $ ldd /usr/bin/git – Common question: is mapped at address x? linux-vdso.so.1 => (0x00007fff197be000) • Page faults, new memory mappings, etc. libz.so.1 => /lib/libz.so.1 (0x00007f31b9d4e000) • Hint: a 64-bit address space is seriously huge libpthread.so.0 => /lib/libpthread.so.0 (0x00007f31b9b31000) • Hint: some programs (like databases) map tons of libc.so.6 => /lib/libc.so.6 (0x00007f31b97ac000) data /lib64/ld-linux-x86-64.so.2 (0x00007f31b9f86000) – Others map very little • No one size fits all

7 8 7 8

COMP 790: OS Implementation COMP 790: OS Implementation Sparse representation Linux: vm_area_struct • Naïve approach might make a big array of pages • Linux represents portions of a process with a – Mark empty space as unused vm_area_struct, or vma – But this wastes OS memory • Includes: • Better idea: only allocate nodes in a data structure – Start address (virtual) for memory that is mapped to something – End address (first address after vma) – why? – Kernel data structure memory use proportional to • Memory regions are page aligned complexity of address space! – Protection (read, write, execute, etc) – implication? • Different page protections means new vma – Pointer to file (if one) – Other bookkeeping

9 10 9 10

COMP 790: OS Implementation COMP 790: OS Implementation Simple list representation Simple list • Linear traversal – O(n) 0 Process Address Space 0xffffffff – Shouldn’t we use a data structure with the smallest O? • Practical system building question: start end – What is the common case? next – Is it past the asymptotic crossover point? vma vma vma • If tree traversal is O(log n), but adds bookkeeping /bin/ls anon libc.so (data) overhead, which makes sense for: – 10 vmas: log 10 =~ 3; 10/2 = 5; Comparable either way – 100 vmas: log 100 starts making sense mm_struct (process)

11 12 11 12

2 2/5/20

COMP 790: OS Implementation COMP 790: OS Implementation Common cases Red-black trees • Many programs are simple • (Roughly) balanced tree – Only load a few libraries • Read the wikipedia article if you aren’t familiar with – Small amount of data them • Some programs are large and complicated • Popular in real systems – Databases – Asymptotic average == worst case behavior • Linux splits the difference and uses both a list and a • Insertion, deletion, search: log n red-black tree • Traversal: n

13 14 13 14

COMP 790: OS Implementation COMP 790: OS Implementation Optimizations Memory mapping recap • Using an RB-tree gets us logarithmic search time • VM Area structure tracks regions that are mapped • Other suggestions? – Efficiently represent a sparse address space • Locality: If I just accessed region x, there is a – On both a list and an RB-tree • reasonably good chance I’ll access it again Fast linear traversal • Efficient lookup in a large address space – Linux caches a pointer in each process to the last vma – Cache last lookup to exploit temporal locality looked up – Source code (mm/mmap.c) claims 35% hit rate

15 16 15 16

COMP 790: OS Implementation COMP 790: OS Implementation Linux APIs Example 1: • mmap(void *addr, size_t length, int prot, int flags, • Let’s map a 1 page (4k) anonymous region for data, int fd, off_t offset); read-write at address 0x40000 • munmap(void *addr, size_t length); • mmap(0x40000, 4096, PROT_READ|PROT_WRITE, MAP_ANONYMOUS, -1, 0); • How to create an anonymous mapping? – Why wouldn’t we want exec permission? • What if you don’t care where a memory region goes (as long as it doesn’t clobber something else)?

17 18 17 18

3 2/5/20

COMP 790: OS Implementation COMP 790: OS Implementation Insert at 0x40000 Scenario 2 • What if there is something already mapped there 0x1000-0x4000 0x20000-0x21000 0x100000-0x10f000 with read-only permission? – Case 1: Last page overlaps – Case 2: First page overlaps – Case 3: Our target is in the middle

1) Is anything already mapped at 0x40000-0x41000? 2) If not, create a new vma and insert it 3) Recall: pages will be allocated on demand mm_struct (process)

19 20 19 20

COMP 790: OS Implementation COMP 790: OS Implementation Case 1: Insert at 0x40000 Case 3: Insert at 0x40000

0x1000-0x4000 0x20000-0x41000 0x100000-0x10f000 0x1000-0x4000 0x20000-0x50000 0x100000-0x10f000

1) Is anything already mapped at 0x40000-0x41000? 1) Is anything already mapped at 0x40000-0x41000? 2) If at the end and different permissions: 2) If in the middle and different permissions: 1) Truncate previous vma 1) Split previous vma mm_struct mm_struct 2) Insert new vma 2) Insert new vma (process) (process) 3) If permissions are the same, one can replace pages and/or extend previous vma

21 22 21 22

COMP 790: OS Implementation COMP 790: OS Implementation () • Creating a memory mapping (vma) doesn’t • Recall: this function creates and starts a copy of the necessarily allocate physical memory or setup page process; identical except for the return value table entries • Example: – What mechanism do you use to tell when a page is needed? int pid = fork(); • It pays to be lazy! if (pid == 0) { – A program may never touch the memory it maps. // child code • Examples? – Program may not use all code in a } else if (pid > 0) { – Save work compared to traversing up front // parent code – Hidden costs? Optimizations? • Page faults are expensive; heuristics could help performance } else // error

23 24 23 24

4 2/5/20

COMP 790: OS Implementation COMP 790: OS Implementation Copy-On-Write (COW) How does COW work? • Naïve approach would march through address space • Memory regions: and copy each page – New copies of each vma are allocated for child during fork – Most processes immediately exec() a new binary – As are page tables without using any of these pages • Pages in memory: – Again, lazy is better! – In page table (and in-memory representation), clear write bit, set COW bit • Is the COW bit hardware specified? • No, OS uses one of the available bits in the PTE – Make a new, writeable copy on a write fault

25 26 25 26

COMP 790: OS Implementation COMP 790: OS Implementation New Topic: Stacks Idiosyncrasy 1: Stacks Grow Down • In Linux/Unix, as you add frames to a stack, they actually decrease in virtual address order • Example: Stack “bottom” – 0x13000 main() 0x12600 foo() 0x12300 bar() 0x11900 Exceeds stack OS allocates a page new page

27 28 27 28

COMP 790: OS Implementation COMP 790: OS Implementation Problem 1: Expansion Feed 2 Birds with 1 Scone • Recall: OS is free to allocate any free page in the • Unix has been around longer than paging virtual address space if user doesn’t specify an – Remember data segment abstraction? address – Unix solution: • What if the OS allocates the page below the “top” of Grows Grows the stack? Heap Stack – You can’t grow the stack any further Data Segment – Out of memory fault with plenty of memory spare • OS must reserve stack portion of address space • Stack and heap meet in the middle – Fortunate that memory areas are demand paged – Out of memory when they meet

29 30 29 30

5 2/5/20

COMP 790: OS Implementation COMP 790: OS Implementation But now we have paging Windows Comparison • Unix and Linux still have a data segment abstraction • LPVOID VirtualAllocEx(__in HANDLE hProcess, – Even though they use flat data segmentation! __in_opt LPVOID lpAddress, • sys_brk() adjusts the endpoint of the heap __in SIZE_T dwSize, – Still used by many memory allocators today __in DWORD flAllocationType, __in DWORD flProtect); • Library function applications program to – Provided by ntdll.dll – the rough equivalent of Unix libc – Implemented with an undocumented

31 32 31 32

COMP 790: OS Implementation COMP 790: OS Implementation Windows Comparison Windows Comparison • LPVOID VirtualAllocEx(__in HANDLE hProcess, • LPVOID VirtualAllocEx(__in HANDLE hProcess, __in_opt LPVOID lpAddress, __in_opt LPVOID lpAddress, __in SIZE_T dwSize, __in SIZE_T dwSize, __in DWORD flAllocationType, __in DWORD flAllocationType, __in DWORD flProtect); __in DWORD flProtect); • Programming environment differences: • Different capabilities – Parameters annotated (__out, __in_opt, etc), compiler – hProcess doesn’t have to be you! Pros/Cons? checks – flAllocationType – can be reserved or committed – Name encodes type, by convention • And other flags – dwSize must be page-aligned (just like mmap)

33 34 33 34

COMP 790: OS Implementation COMP 790: OS Implementation Reserved memory Part 1 Summary • An explicit abstraction for cases where you want to • Understand what a vma is, how it is manipulated in prevent the OS from mapping anything to an address kernel for calls like mmap region • Demand paging, COW, and other optimizations • To use the region, it must be remapped in the • brk and the data segment committed state • Windows VirtualAllocEx() vs. Unix mmap() • Why? – My speculation: Gives the OS more information for advanced heuristics than demand paging

35 36 35 36

6 2/5/20

COMP 790: OS Implementation COMP 790: OS Implementation Part 2: Program Binaries Linux: ELF • How are address spaces represented in a binary file? • Executable and Linkable Format • How are processes loaded? • Standard on most Unix systems – And used in JOS – You will implement part of the loader in lab 3 • 2 headers: – Program header: 0+ segments (memory layout) – Section header: 0+ sections (linking information)

37 38 37 38

COMP 790: OS Implementation COMP 790: OS Implementation Helpful tools Key ELF Sections • readelf - Linux tool that prints part of the elf headers • .text – Where read/execute code goes • objdump – Linux tool that dumps portions of a – Can be mapped without write permission binary • .data – Programmer initialized read/write data – Includes a disassembler; reads debugging symbols if – Ex: a global int that starts at 3 goes here present • .bss – Uninitialized data (initially zero by convention) • Many other sections

39 40 39 40

COMP 790: OS Implementation COMP 790: OS Implementation How ELF Loading Works Static vs. Dynamic Linking • execve(“foo”, …) • Static Linking: • Kernel parses the file enough to identify whether it is – Application binary is self-contained a supported format • Dynamic Linking: – Kernel loads the text, data, and bss sections – Application needs code and/or variables from an external • ELF header also gives first instruction to execute library – Kernel transfers control to this application instruction • How does dynamic linking work? – Each binary includes a “jump table” for external references – Jump table is filled in at run time by the loader

41 42 41 42

7 2/5/20

COMP 790: OS Implementation COMP 790: OS Implementation Jump table example Dynamic Linking (Overview) • Suppose I want to call foo() in another library • Rather than loading the application, load the loader • Compiler allocates an entry in the jump table for foo (ld.so), give the loader the actual program as an – Say it is index 3, and an entry is 8 bytes argument • Compiler generates local code like this: • Kernel transfers control to loader (in ) – mov rax, 24(rbx) // rbx points to the • Loader: // jump table – 1) Walks the program’s ELF headers to identify needed – call *rax libraries • Loader initializes the jump tables at runtime – 2) Issue mmap() calls to map in said libraries – 3) Fix the jump tables in each binary – 4) Call main()

43 44 43 44

COMP 790: OS Implementation COMP 790: OS Implementation Recap Summary • Understand basics of program loading • We’ve seen a lot of details on how programs are • OS does preliminary executable parsing, maps in represented: program and maybe dynamic linker – In the kernel when running • Linker does needed fixup for the program to work – On disk in an executable file – And how they are bootstrapped in practice • Will help with lab 3

45 46 45 46

8