<<

Last time… • Prefetching – Speculate future I & d accesses and fetch them into caches CSE 490/590 • Hardware techniques – buffer – Prefetch-on-miss – One Block Lookahead Address Translation and Protection – Strided • Software techniques Steve Ko – Prefetch instruction Computer Sciences and Engineering – Loop interchange University at Buffalo – Loop fusion – tiling

CSE 490/590, Spring 2011 CSE 490/590, Spring 2011 2

Memory Management Absolute Addresses

• From early absolute addressing schemes, to modern systems with support for virtual machine monitors EDSAC, early 50’s

• Only one program ran at a time, with unrestricted • Can separate into orthogonal functions: access to entire machine (RAM + I/O devices) – Translation (mapping of virtual address to physical address) • Addresses in a program depended upon where the – Protection (permission to access word in memory) program was to be loaded in memory – Virtual memory (transparent extension of memory space using slower disk storage) • Problems? • But most modern systems provide support for all the above functions with a single page-based system

CSE 490/590, Spring 2011 3 CSE 490/590, Spring 2011 4

Dynamic Address Translation Simple Base and Bound Translation

Motivation Segment Length Bound Bounds In the early machines, I/O operations were slow and each word transferred involved the CPU Register ≤ Violation? Higher throughput if CPU and I/O of 2 or more Physical programs were overlapped. prog1 current Address How?⇒ multiprogramming Load X Effective segment Address + Location-independent programs Programming and storage management ease Base Memory Physical ⇒ need for a base register Register Base Physical Address

Physical Memory Memory Physical Program Protection prog2 Address Independent programs should not affect Space each other inadvertently ⇒ need for a bound register registers are visible/accessible only when is running in the supervisor mode

CSE 490/590, Spring 2011 5 CSE 490/590, Spring 2011 6

C 1 Memory Fragmentation Paged Memory Systems • Processor-generated address can be split into: Users 4 & 5 Users 2 & 5 free arrive leave page number offset OS OS OS Space Space Space • A contains the physical address of the base of 16K user 1 16K user 1 user 1 16K each page: user 2 24K user 2 24K 24K 1 user 4 16K 24K user 4 16K 0 0 0 8K 8K 1 1 Physical user 3 32K user 3 32K user 3 32K 2 2 Memory 3 3 3 24K user 5 24K 24K Page Table 2 of User-1 of User-1 As users come and go, the storage is “fragmented”. Therefore, at some stage programs have to be moved Page tables make it possible to store the around to compact the storage. pages of a program non-contiguously.

CSE 490/590, Spring 2011 7 CSE 490/590, Spring 2011 8

Private Address Space per User Where Should Page Tables Reside?

OS • Space required by the page tables (PT) is User 1 VA1 pages proportional to the address space, number of Page Table users, ... ⇒ Space requirement is large User 2 VA1 ⇒ Too expensive to keep in registers Page Table • Idea: Keep PTs in the main memory User 3 VA1 – needs one reference to retrieve the page base address Physical Memory Memory Physical and another to access the data word Page Table free ⇒ doubles the number of memory references! • Each user has a page table • Page table contains an entry for each user page

CSE 490/590, Spring 2011 9 CSE 490/590, Spring 2011 10

Page Tables in Physical Memory CSE 490/590 Administrivia

PT • Midterm on Friday, 3/4 User • Project 1 deadline: Friday, 3/11 1 • Project 2 list will be up soon VA1 PT • Guest lectures possibly this month User User 1 Virtual 2 • Quiz will be distributed Monday Address Space Physical Memory Memory Physical VA1

User 2

CSE 490/590, Spring 2011 11 CSE 490/590, Spring 2011 12

C 2 A Problem in the Early Sixties Manual Overlays • Assume an instruction can address all • There were many applications whose data the storage on the drum could not fit in the main memory, e.g., payroll 40k – Paged memory system reduced fragmentation but still required the whole program to be resident in • Method 1: programmer keeps track of main the main memory addresses in the main memory and initiates an I/O transfer when required 640k bits – Difficult, error-prone! • Programmers moved the data back and forth drum from the secondary store by overlaying it • Method 2: automatic initiation of I/O Central Store transfers by software address translation repeatedly on the primary store Ferranti Mercury – Brooker’s interpretive coding, 1960 1956 tricky programming! – Inefficient! Not just an ancient black art, e.g., IBM Cell microprocessor used in Playstation-3 has explicitly managed local store!

CSE 490/590, Spring 2011 13 CSE 490/590, Spring 2011 14

Demand Paging in Atlas (1962) Hardware Organization of Atlas

Effective 16 ROM pages system code “A page from secondary 0.4 ~1 sec Address Initial µ (not swapped) storage is brought into the Address 2 subsidiary pages primary storage whenever Decode system data PARs 1.4 µsec (not swapped) it is (implicitly) demanded 48- words 0 by the processor.” Main Drum (4) 512-word pages 8 Tape decks Tom Kilburn Primary 32 pages 192 pages 88 sec/word 32 Pages 1.4 µsec 1 Page Address 512 words/page 31 Register (PAR) Primary memory as a cache per page frame for secondary memory Secondary Central (Drum) Compare the effective page address against all 32 PARs User sees 32 x 6 x 512 words Memory 32x6 pages match ⇒ normal access of storage no match ⇒ page fault save the state of the partially executed instruction

CSE 490/590, Spring 2011 15 CSE 490/590, Spring 2011 16

Atlas Demand Paging Scheme Caching vs. Demand Paging

secondary • On a page fault: memory

– Input transfer into a free page is initiated primary primary CPU cache CPU – The Page Address Register (PAR) is updated memory memory – If no free page is left, a page is selected to be replaced (based on usage) – The replaced page is written on the drum Caching Demand paging cache entry page frame » to minimize drum latency effect, the first empty cache block (~32 ) page (~4K bytes) page on the drum was selected cache miss rate (1% to 20%) page miss rate (<0.001%) cache hit (~1 cycle) page hit (~100 cycles) – The page table is updated to point to the new cache miss (~100 cycles) page miss (~5M cycles) location of the page on the drum a miss is handled a miss is handled in hardware mostly in software

CSE 490/590, Spring 2011 17 CSE 490/590, Spring 2011 18

C 3 Acknowledgements

• These slides heavily contain material developed and copyright by – Krste Asanovic (MIT/UCB) – David Patterson (UCB) • And also by: – Arvind (MIT) – Joel Emer (Intel/MIT) – James Hoe (CMU) – John Kubiatowicz (UCB)

• MIT material derived from course 6.823 • UCB material derived from course CS252

CSE 490/590, Spring 2011 19

C 4