CS152 Computer Architecture and Engineering Lecture 21 Buses and I
Total Page:16
File Type:pdf, Size:1020Kb
Recap: Levels of the Memory Hierarchy Capacity Upper Level Access Time Staging CS152 Cost Xfer Unit faster Computer Architecture and Engineering CPU Registers 100s Bytes Registers Lecture 21 <10s ns Instr. Operands prog./compiler 1-8 bytes Buses and I/O Cache K Bytes Cache #1 10-100 ns $.01-.001/bit cache cntl Blocks 8-128 bytes Main Memory M Bytes Memory November 10, 1999 100ns-1us $.01-.001 OS John Kubiatowicz (http.cs.berkeley.edu/~kubitron) Pages 512-4K bytes Disk G Bytes ms Disk -3 -4 lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/ 10 - 10 cents user/operator Files Mbytes Tape Larger infinite sec-min Tape Lower Level 10-6 CS152 / Kubiatowicz CS152 / Kubiatowicz 11/10/99 ©UCB Fall 1999 11/10/99 ©UCB Fall 1999 Lec21.1 Lec21.2 Recap: What is virtual memory? Recap: Three Advantages of Virtual Memory ° Virtual memory => treat memory as a cache for the disk ° Translation: • Program can be given consistent view of memory, even ° Terminology: blocks in this cache are called “Pages” though physical memory is scrambled ° Typical size of a page: 1K — 8K • Makes multithreading reasonable (now used a lot!) ° Page table maps virtual page numbers to physical frames • Only the most important part of program (“Working Set”) Virtual Physical Virtual Address must be in physical memory. Address Space Address Space 10 • Contiguous structures (like stacks) use only as much V page no. offset physical memory as necessary yet still grow later. ° Protection: Page Table Page Table • Different threads (or processes) protected from each other. Base Reg Access • Different pages can be given special behavior index V Rights PA - (Read Only, Invisible to user programs, etc). into page • Kernel data protected from User programs table table located • Very important for protection from malicious programs in physical P page no. offset => Far more “viruses” under Microsoft Windows memory 10 ° Sharing: Physical Address • Can map same physical page to multiple users CS152 / Kubiatowicz CS152 / Kubiatowicz 11/10/99 ©UCB Fall 1999 11/10/99(“Shared memory”) ©UCB Fall 1999 Lec21.3 Lec21.4 Recap: Making address translation practical: TLB Recap: TLB organization: include protection ° Translation Look-aside Buffer (TLB) is a cache of recent translations Virtual Address Physical Address Dirty Ref Valid Access ASID ° Speeds up translation process “most of the time” 0xFA00 0x0003 Y N Y R/W 34 ° TLB is typically a fully-associative lookup-table 0x0040 0x0010 N Y Y R 0 virtual address 0x0041 0x0011 N Y Y R 0 page off Virtual Physical Address Space Memory Space Page Table ° TLB usually organized as fully-associative cache 2 • Lookup is by Virtual Address 0 • Returns Physical Address + other info 1 ° Dirty => Page modified (Y/N)? 3 physical address Ref => Page touched (Y/N)? page off Valid => TLB entry valid (Y/N)? TLB Access => Read? Write? frame page 2 2 ASID => Which User? 0 5 CS152 / Kubiatowicz CS152 / Kubiatowicz 11/10/99 ©UCB Fall 1999 11/10/99 ©UCB Fall 1999 Lec21.5 Lec21.6 Recap: MIPS R3000 pipelining of TLB Reducing Translation Time I: Overlapped Access MIPS R3000 Pipeline Virtual Address (For 4K pages) Inst Fetch Dcd/ Reg ALU / E.A Memory Write Reg 12 TLB I-Cache RF Operation WB V page no. offset E.A. TLB D-Cache TLB Lookup TLB 64 entry, on-chip, fully associative, software TLB fault handler Access V Rights PA Virtual Address Space ASID V. Page Number Offset P page no. offset 6 20 12 12 Physical Address 0xx User segment (caching based on PT/TLB entry) 100 Kernel physical space, cached 101 Kernel physical space, uncached ° Machines with TLBs overlap TLB lookup with cache 11x Kernel virtual space access. Allows context switching among • Works because lower bits of result (offset) available early 64 user processes without TLB flush CS152 / Kubiatowicz CS152 / Kubiatowicz 11/10/99 ©UCB Fall 1999 11/10/99 ©UCB Fall 1999 Lec21.7 Lec21.8 Overlapped TLB & Cache Access Problems With Overlapped TLB Access ° Overlapped access only works as long as the address bits used to ° If we do this in parallel, we have to be careful, index into the cache do not change as the result of VA translation however: Example: suppose everything the same except that the cache is assoc increased to 8 K bytes instead of 4 K: lookup index 11 2 32 TLB 4K Cache 1 K cache index 00 This bit is changed 20 10 2 4 bytes by VA translation, but 20 12 page # disp is needed for cache 00 virt page # disp lookup Hit/ Miss ° Solutions: ⇒ FN = FN Data Hit/ Go to 8K byte page sizes; Miss ⇒ Go to 2 way set associative cache; or ⇒ SW guarantee VA[13]=PA[13] ° With this technique, size of cache can be up to same size as pages. 1K 2 way set assoc cache 10 ⇒ What if we want a larger cache??? 44 CS152 / Kubiatowicz CS152 / Kubiatowicz 11/10/99 ©UCB Fall 1999 11/10/99 ©UCB Fall 1999 Lec21.9 Lec21.10 Reduced Translation Time II: Virtually Addressed Cache Survey VA PA ° R4000 Trans- Main CPU lation Memory • 32 bit virtual, 36 bit physical • variable page size (4KB to 16 MB) Cache • 48 entries mapping page pairs (128 bit) hit ° MPC601 (32 bit implementation of 64 bit PowerPC data arch) ° Only require address translation on cache miss! • 52 bit virtual, 32 bit physical, 16 segment registers 428 • Very fast as result (as fast as cache lookup) • 4KB page, 256MB segment • No restrictions on cache organization 24 • 4 entry instruction TLB ° Synonym problem: two different virtual addresses map to same physical address ⇒ two cache entries holding data for the same physical address! • 256 entry, 2-way TLB (and variable sized block xlate) ° Solutions: • overlapped lookup into 8-way 32KB L1 cache • Provide associative lookup on physical tags during cache miss to enforce • hardware table search through hashed page tables a single copy in the cache (potentially expensive) • Make operating system enforce one copy per cache set by selecting ° Alpha 21064 virtual⇒physical mappings carefully. This only works for direct mapped • arch is 64 bit virtual, implementation subset: 43, 47,51,55 bit caches. • 8,16,32, or 64KB pages (3 level page table) ° Virtually Addressed caches currently out of favor because of synonym • 12 entry ITLB, 32 entry DTLB complexities CS152 / Kubiatowicz • 43 bit virtual, 28 bit physical octword address CS152 / Kubiatowicz 11/10/99 ©UCB Fall 1999 11/10/99 ©UCB Fall 1999 Lec21.11 Lec21.12 Alpha VM Mapping Administrivia ° “64-bit” address divided ° Important: Lab 7. Design for Test into 3 segments • You should be testing from the very start of your design • seg0 (bit 63=0) user • Consider adding special monitor modules at various points code/heap in design => I have asked you to label trace output from • seg1 (bit 63 = 1, 62 = 1) these modules with the current clock cycle # user stack • The time to understand how components of your design • kseg (bit 63 = 1, 62 = 0) should work is while you are designing! kernel segment for OS ° Question: Oral reports on 12/6? ° 3 level page table, each • Proposal: 10 — 12 am and 2 — 4 pm one page ° Pending schedule: • Alpha only 43 unique bits of VA • Sunday 11/14: Review session 7:00 in 306 Soda • (future min page size up to • Monday 11/15: Guest lecture by Bob Broderson 64KB => 55 bits of VA) • Tuesday 11/16: Lab 7 breakdowns and Web description ° PTE bits; valid, kernel & • Wednesday 11/17: Midterm I user read & write enable • Monday 11/29: no class? Possibly (No reference, use, or • Monday 12/1 Last class (wrap up, evaluations, etc) dirty bit) • Monday 12/6: final project reports due after oral report CS152 / Kubiatowicz • Friday 12/10 grades should be posted. CS152 / Kubiatowicz 11/10/99 ©UCB Fall 1999 11/10/99 ©UCB Fall 1999 Lec21.13 Lec21.14 Administrivia II Computers in the News: Sony Playstation 2000 ° Major organizational options: • 2-way superscalar (18 points) • 2-way multithreading (20 points) • 2-way multiprocessor (18 points) • out-of-order execution (22 points) • Deep Pipelined (12 points) ° Test programs will include multiprocessor versions ° Both multiprocessor and multithreaded must implement synchronizing “Test and Set” instruction: • Normal load instruction, with special address range: - Addresses from 0xFFFFFFF0 to 0xFFFFFFFF - Only need to implement 16 synchronizing locations • Reads and returns old value of memory location at specified address, while setting the value to one (stall memory stage for one extra cycle). ° (as reported in Microprocessor Report, Vol 13, No. 5) • For multiprocessor, this instruction must make sure that all • Emotion Engine: 6.2 GFLOPS, 75 million polygons per second updates to this address are suspended during operation. • Graphics Synthesizer: 2.4 Billion pixels per second • For multithreaded, switch to other processor if value is already • Claim: Toy Story realism brought to games! CS152 / Kubiatowicz CS152 / Kubiatowicz 11/10/99non-zero (like a cache miss). ©UCB Fall 1999 11/10/99 ©UCB Fall 1999 Lec21.15 Lec21.16 Playstation 2000 Continued What is a bus? A Bus Is: ° shared communication link ° single set of wires used to connect multiple subsystems Processor Input Control Memory Datapath Output ° Emotion Engine: ° Sample Vector Unit • Superscalar MIPS core • 2-wide VLIW ° A Bus is also a fundamental tool for composing • Vector Coprocessor • Includes Microcode Memory large, complex systems Pipelines • High-level instructions like • systematic means of abstraction • RAMBUS DRAM interface matrix-multiply CS152 / Kubiatowicz CS152 / Kubiatowicz 11/10/99 ©UCB Fall 1999 11/10/99 ©UCB Fall 1999 Lec21.17 Lec21.18 Buses Advantages of Buses I/O I/O I/O Processer Device Device Device Memory ° Versatility: • New devices can