Lecture 8: Memory Management: the Kernel
Total Page:16
File Type:pdf, Size:1020Kb
Operating Systems
Lecture 8: Memory Management
Physical memory Abstraction: virtual memory No protection Each program isolated from all others and from the OS
Limited size Illusion of infinite memory
Sharing visible to programs Transparent -- can't tell if memory is shared
Easy to share data between programs Ability to share code, data
Hardware support for protection How is protection implemented? Hardware support: address translation , dual mode operation
Address translation
Address space: literally, all the addresses a program can touch. All the state that a program can affect or be affected by.
Restrict what a program can do by restricting what it can touch! Hardware translates every memory reference from virtual addresses to physical addresses; software sets up and manages the mapping in the translation box.
Virtual Physical Address Address Address Physical Memory User mode Translation Box CP U Kernel mode
Virtual Address (untranslated)
Address Translation in Modern Architectures Two views of memory:
view from the CPU -- what program sees, virtual memory view from memory -- physical memory
Translation box converts between the two views.
Translation helps implement protection because no way for program to even talk about other program's addresses; no way for them to touch operating system code or data.
Translation can be implemented in any number of ways -- typically, by some form of table lookup (we'll discuss various options for implementing the translation box later). Separate table for each user address space.
Application can not modify its own translation table. If it could, could get access to all of physical memory. Has to be restricted somehow. Dual-mode operation enables control.
when in the OS, can do anything (kernel-mode) when in a user program, restricted to only touching that program's memory (user-mode)
Hardware requires CPU to be in kernel-mode to modify address translation tables.
OS runs in kernel mode (untranslated) User programs run in user mode (translated)
Want to isolate each address space so its behavior can't do any harm, except to itself.
How does kernel and user interact?
Kernel -> user: To run a user program, create a process/thread to: allocate and initialize address space control block read program off disk and store in memory allocate and initialize translation table (point to program memory) run program (or to return to user level after calling the kernel): set machine registers set hardware pointer to translation table set processor status word (user vs. kernel) jump to start of program User-> Kernel: How does the user program get back into the kernel?
Voluntarily user->kernel: System call -- special instruction to jump to a specific operating system handler. Just like doing a procedure call into the operating system kernel -- program asks OS kernel, please do something on procedure's behalf.
Can the user program call any routine in the OS? No. Just specific ones the OS says is ok. Always start running handler at same place, otherwise, problems!
Involuntarily user->kernel: Hardware interrupt , also program exception such as bus error, segmentation fault, page fault
On system call, interrupt, or exception: hardware atomically sets processor status to kernel changes execution stack to kernel saves current program counter jumps to handler in kernel handler saves previous state of any registers it uses
How does the system call pass arguments? a. Use registers b. Write into user memory, kernel copies into its memory
Except: user addresses -- translated kernel addresses -- untranslated
PROTECTION
Base and Bounds Base and bounds: Each program loaded into contiguous regions of physical memory, but with protection between programs. First built in the Cray-1.
BOUND BASE
Logical physical MEMORY address yes address CP U < + no
trap; addressing error
Hardware Implementation of Base and Bounds Translation Program has illusion it is running on its own dedicated machine, with memory starting at 0 and going up to size = bounds. Like linker-loader, program gets contiguous region of memory. But unlike linker-loader, protection: program can only touch locations in physical memory between base and base + bounds.
Logical address space Physical address space 0
Code 4000 bound Data
Stack 4000+bound
Base=4000
Virtual and Physical Memory Views in Base and Bounds System
Provides level of indirection: OS can move bits around behind the program's back, for instance, if program needs to grow beyond its bounds, or if need to coalesce fragments of memory. Stop program, copy bits, change base and bounds registers, restart.
Only the OS gets to change the base and bounds! Clearly, user program can't, or else lose protection.
Hardware cost: 2 registers adder, comparator Plus, slows down hardware because need to take time to do add/compare on every memory reference.
Base and bounds is simple and fast but it has the following disadvantages: 1. hard to share between programs for example, suppose two copies of "vi" :we want to share code , only data and stack need to be different . We can't do this with base and bounds!
2. hard to grow address space. We want stack and heap to grow into each other (have to allocate maximum future needs.
3. needs complex memory allocation such as first fit, best fit, buddy system . In worst case, it is needed to shuffle large chunks of memory to fit new program. Solution to 1 & 2 : (segmentation), Solution to 1 & 3 : (paging), Solution to 1 & 2 & 3 : (segmentation plus paging)!
8.2.3.2 Segmentation A segment is a region of logically contiguous memory. Idea is to generalize base and bounds, by allowing a table of base & bound pairs.
Virtual address error
Virtual Offset > Segment #
+
Physical addres
For example, what does it look like with this segment table, in virtual memory and physical memory? Assume 2 bit segment ID, and 12 bit segment offset.
Virtual Segment # Physical Segment Start at Segment size Code 0x4000 0x700 Data 0 0x500 - 0 0 Stack 0x2000 0x1000
Although it seems that the virtual address space has gaps in it, each segment gets mapped to contiguous locations in physical memory, but may be gaps between segments. But a correct program will never address gaps; if it does, trap to kernel. Minor exception: stack, heap can grow. Segmentation is efficient for sparse address spaces. It is easy to share whole segments (for example, code segment) Only a protection mode can be added in segmentation table. For example, code segment would be read-only (only execution and loads are allowed). Data and stack segment would be read-write (stores allowed). But, segmentation still needs complex memory allocation such as first fit, best fit, etc., and re-shuffling to coalesce free fragments, if no single free space is big enough for a new segment.
How do we make memory allocation simple and easy?
8.2.3.3 Paging Allocate physical memory in terms of fixed size chunks of memory, or pages.
Simpler, because allows use of a bitmap. What's a bitmap? 001111100000001100
Each bit represents one page of physical memory -- 1 means allocated, 0 means unallocated. Lots simpler than base&bounds or segmentation
Operating system controls mapping: any page of virtual memory can go anywhere in physical memory.
Each address space has its own page table, in physical memory. Hardware needs two special registers -- pointer to physical location of page table, and page table size. Example: suppose page size is 4 bytes.
error Virtual address > Virtual offset Page table size Page #
Page table pointer
Physical offset Page # Physical address
Page table translation Questions: 1. What if page size is very small? For example if page size is 512 bytes, means lots of space taken up with page table entries. 2. What if page size is really big? Why not have an infinite page size? Would waste unused space inside of page. Example of internal fragmentation. With segmentation need to re-shuffle segments to avoid external fragmentation. Paging suffers from internal fragmentation. 3. What if address space is sparse? For example: on UNIX, code starts at 0, stack starts at 2^31 - 1. With 1KB pages, 2 million page table entries -- because have to have table that maps entire virtual address space.
Paging is a simple memory allocation. It is easy to share but needs big page tables if the address space is sparse.
Is there a solution that allows simple memory allocation, easy to share memory, and is efficient for sparse address spaces?
Combining paging and segmentation?
8.2.3.4 Paged Segmentation (Multi-level translation) Multi-level translation. Use tree of tables. Lowest level is page table, so that physical memory can be allocated using a bitmap. Higher levels are typically segmented. For example, 2-level translation:
Virtual address Virtual Virtual offset Segment # Page #
Page table Ptr table size page table
Physical page # offset
Segment table Physical address
> Just like recursion -- could have any number of levels. Most architectures today do some flavor of this.
Questions: 1. Where are segment table/page tables stored? Segment tables are usually in special CPU registers, because they are small. Page tables, usually in main memory 2. How do we share memory? Can share entire segment, or a single page. Multilevel translation only needs to allocate as many page table entries as we need. In other words, sparse address spaces are easy. Memory allocation is easy. Sharing can be done at segment or page level. But it has some disadvantages as well. A pointer is needed per page (typically 4KB - 16KB pages today). Page tables need to be contiguous . Two lookups per memory reference needed.
TLB: Translation Lookaside Buffer (a kind of page table cache )
Generic Issues in Caching Cache hit : item is in the cache Cache miss : item is not in the cache, have to do full operation
Effective access time = P(hit) * cost of hit + P(miss) * cost of miss 1. How do you find whether item is in the cache (whether there is a cache hit)? 2. If it is not in cache (cache miss), how do you choose what to replace from cache to make room? 3. Consistency -- how do you keep cache copy consistent with real version?
Use caching at each level, to provide illusion of a terabyte, with register access times. Works because programs aren't random.
Exploit locality : that computers behave in future like they have in the past. Temporal locality : will reference same locations as accessed in the recent past Spatial locality : will reference locations near those accessed in the recent past
Caching applied to address translation
Often reference same page repeatedly, why go through entire translation each time?
Translation Buffer, Translation Lookaside Buffer : hardware table of frequently used translations, to avoid having to go through page table lookup in common case. Typically, on chip, so access time of 5-10ns, instead of several hundred for main memory.
How do we tell if needed translation is in TLB?
1. Search table in sequential order 2. Direct mapped: restrict each virtual page to use specific slot in TLB
Consistency between TLB and page tables
What happens on context switch? Have to invalidate entire TLB contents. When new program starts running, will bring in new translations. Alternatively, include process id tag in TLB comparator. Have to keep TLB consistent with whatever the full translation would give.
What if translation tables change? For example, to move page from memory to disk, or vice versa. Have to invalidate TLB entry.
Relationship between TLB and hardware memory caches
Can put a cache of memory values anywhere in this process. If between translation box and memory, called a "physically addressed cache". Could also put a cache between CPU and translation box: "virtually addressed cache".
Virtual memory is a kind of caching: we're going to talk about using the contents of main memory as a cache for disk.
Page Replacement Algorithms:
FIFO: First -in -First -Out (Belady`s Anomaly) LRU: Least Recently Used (Implementing with counters, stacks etc) NRU: Not Recently used Optimal: Clock algorithm : arrange physical pages in a circle, with a clock hand.
1. Hardware keeps use bit per physical page frame 2. Hardware sets use bit on each reference ,If use bit isn't set, means not referenced in a long time 3. On page fault: Advance clock hand (not real time) check use bit 1 -> clear, go on 0 -> replace page Will it always find a page or loop infinitely? Even if all use bits are set, it will eventually loop around, clearing all use bits -> FIFO
What if hand is moving slowly? Not many page faults and/or find page quickly What if hand is moving quickly? Lots of page faults and/or lots of reference bits set.
Nth chance algorithm : don't throw page out until hand has swept by n times OS keeps counter per page -- # of sweeps On page fault, OS checks use bit: 1 => clear use and also clear counter, go on 0 => increment counter, if < N, go on else replace page How do we pick N? Why pick large N? Better approx to LRU. Why pick small N? More efficient; otherwise might have to look a long way to find free page. Dirty pages have to be written back to disk when replaced. Takes extra overhead to replace a dirty page, so give dirty pages an extra chance before replacing?
Common approach: clean pages -- use N = 1 dirty pages -- use N = 2 (and write-back to disk when N=1)
To summarize, many machines maintain four bits per page table entry: use : set when page is referenced, cleared by clock algorithm modified : set when page is modified, cleared when page is written to disk valid : ok for program to reference this page read-only : ok for program to read page, but not to modify it (for example, for catching modifications to code pages)
Inverted Page Table Page tables map virtual page # -> physical page # Do we need the reverse? physical page # -> virtual page #? Yes. Clock algorithm runs through page frames. What if it ran through page tables? (i) many more entries (ii) what if there is sharing?
Thrashing Thrashing: memory overcommitted, pages tossed out while still needed. Example: One program, touches 50 pages (each equally likely). Have only 40 physical page frames If have enough pages, 200 ns/ref If have too few pages, assume every 5th page reference, page fault 4 refs x 200 ns 1 page fault x 10 ms for disk I/O
Dennings Working Set
Informally, collection of pages process is using right now Formally, set of pages job has referenced in last T seconds. How do we pick T? 1 page fault = 10 msec 10 msec = 2 million instructions So T needs to be a lot bigger than 1 million instructions. How do you figure out what working set is? (a) Modify clock algorithm, so that it sweeps at fixed intervals. Keep idle time/page -- how many sec since last reference (b) With second chance list -- how many seconds since got put on 2nd chance list Now that you know how many pages each program needs, what to do?
Global replacement (UNIX) -- all pages in one pool. More flexible -- if my process needs a lot, and you need a little, I can grab pages from you.