CS 61C: Great Ideas in Computer Architecture Multiple Instruction

7/31/2014 CS 61C: Great Ideas in Great Idea #4: Parallelism Computer Architecture Software Hardware • Parallel Requests Warehouse Smart Assigned to computer Scale Phone Computer e.g. search “Garcia” Leverage Multiple Instruction • Parallel Threads Parallelism & Assigned to core Achieve High Issue/Virtual Memory e.g. lookup, ads Performance Computer Introduction • Parallel Instructions Core … Core > 1 instruction @ one time Memory e.g. 5 pipelined instructions Input/Output Guest Lecturer/TA: Andrew Luo • Parallel Data Core > 1 data item @ one time Instruction Unit(s) Functional e.g. add of 4 pairs of words Unit(s) • Hardware descriptions A0+B0 A1+B1 A2+B2 A3+B3 Logic Gates All gates functioning in Cache Memory parallel at same time 8/31/2014 Summer 2014 - Lecture 23 1 8/31/2014 Summer 2014 - Lecture 23 2 Review of Last Lecture (1/2) Review of Last Lecture (2/2) •A hazard is a situation that prevents the next instruction from executing in the next clock •Hazards hurt the performance of pipelined cycle processors • Structural hazard •Stalling can be applied to any hazard, but hurt • A required resource is needed by multiple instructions performance… better solutions? in different stages • Structural hazards • Data hazard • Have separate structures for each stage • Data dependency between instructions • A later instruction requires the result of an earlier • Data hazards instruction • Forwarding • Control hazard • Control hazards • The flow of execution depends on the previous • Branch prediction (mitigates), branch delay slot instruction (ex. jumps, branches) 8/31/2014 Summer 2014 - Lecture 23 3 8/31/2014 Summer 2014 - Lecture 23 4 Agenda Multiple Issue •Modern processors can issue and execute •Multiple Issue multiple instructions per clock cycle •Administrivia •CPI < 1 (superscalar), so can use Instructions •Virtual Memory Introduction Per Cycle (IPC) instead •e.g. 4 GHz 4-way multiple-issue can execute 16 billion IPS with peak CPI = 0.25 and peak IPC = 4 • But dependencies and structural hazards reduce this in practice 8/31/2014 Summer 2014 - Lecture 23 5 8/31/2014 Summer 2014 - Lecture 23 6 1 7/31/2014 Superscalar Laundry: Parallel per Multiple Issue stage6 PM 7 8 9 10 11 12 1 2 AM •Static multiple issue 3030 30 30 30 Time • Compiler reorders independent/commutative T instructions to be issued together a A (light clothing) • Compiler detects and avoids hazards s B (dark clothing) • k Dynamic multiple issue C (very dirty clothing) • CPU examines pipeline and chooses instructions to O D reorder/issue r (light clothing) • CPU can resolve hazards at runtime d E (dark clothing) e (very dirty clothing) r F • More resources, HW to match mix of parallel tasks? 8/31/2014 Summer 2014 - Lecture 23 7 8/31/2014 Summer 2014 - Lecture 23 8 Pipeline Depth and Issue Width Pipeline Depth and Issue Width • Intel Processors over Time 10000 Microprocessor Year Clock Rate Pipeline Issue Cores Power Clock Stages width i486 1989 25 MHz 5 1 1 5W 1000 Power Pentium 1993 66 MHz 5 2 1 10W Pentium Pro 1997 200 MHz 10 3 1 29W Pipeline P4 Willamette 2001 2000 MHz 22 3 1 75W 100 Stages P4 Prescott 2004 3600 MHz 31 3 1 103W Issue width Core 2 Conroe 2006 2930 MHz 12-14 4* 2 75W 10 Core 2 Penryn 2008 2930 MHz 12-14 4* 4 95W Cores Core i7 Westmere 2010 3460 MHz 14 4* 6 130W Xeon Sandy Bridge 2012 3100 MHz 14-19 4* 8 150W 1 1989 1992 1995 1998 2001 2004 2007 2010 Xeon Ivy Bridge 2014 2800 MHz 14-19 4* 15 155W 8/31/2014 Summer 2014 - Lecture 23 9 8/31/2014 Summer 2014 - Lecture 23 10 Static Multiple Issue Scheduling Static Multiple Issue • Compiler reorders independent/commutative instructions to be issued together (an “issue packet”) •Compiler must remove some/all hazards • Group of instructions that can be issued on a single cycle • Reorder instructions into issue packets • Determined by structural resources required • No dependencies within a packet • Specifies multiple concurrent operations • Possibly some dependencies between packets • Varies between ISAs; compiler must know! • Pad with nops if necessary 8/31/2014 Summer 2014 - Lecture 23 11 8/31/2014 Summer 2014 - Lecture 23 12 2 7/31/2014 Dynamic Multiple Issue Dynamic Pipeline Scheduling • Allow the CPU to execute instructions out of order to •Used in “superscalar” processors avoid stalls •CPU decides whether to issue 0, 1, 2, … • But commit result to registers in order instructions each cycle • Example: lw $t0, 20($s2) • Goal is to avoid structural and data hazards addu $t1, $t0, $t2 subu $s4, $s4, $t3 •Avoids need for compiler scheduling slti $t5, $s4, 20 • Though it may still help • Can start subu while addu is waiting for lw • Code semantics ensured by the CPU • Especially useful on cache misses; can execute many instructions while waiting! 8/31/2014 Summer 2014 - Lecture 23 13 8/31/2014 Summer 2014 - Lecture 23 14 Why Do Dynamic Scheduling? Speculation • “Guess” what to do with an instruction • Why not just let the compiler schedule code? • Start operation as soon as possible • Not all stalls are predicable • Check whether guess was right and roll back if necessary • e.g. cache misses • Examples: • Can’t always schedule around branches • Speculate on branch outcome (Branch Prediction) • Branch outcome is dynamically determined by I/O • Roll back if path taken is different • Different implementations of an ISA have different • Speculate on load latencies and hazards • Load into an internal register before instruction to minimize time waiting for memory • Forward compatibility and optimizations • Can be done in hardware or by compiler • Common to static and dynamic multiple issue 8/31/2014 Summer 2014 - Lecture 23 15 8/31/2014 Summer 2014 - Lecture 23 16 Not a Simple Linear Pipeline Out-of-Order Execution (1/2) 3 major units operating in parallel: Can also unroll loops in hardware • Instruction fetch and issue unit 1) Fetch instructions in program order (≤ 4/clock) • Issues instructions in program order • Many parallel functional (execution) units 2) Predict branches as taken/untaken • Each unit has an input buffer called a Reservation Station • Holds operands and records the operation • Can execute instructions out-of-order (OOO) 3) To avoid hazards on registers, rename registers • Commit unit using a set of internal registers (≈ 80 registers) • Saves results from functional units in Reorder Buffers • Stores results once branch resolved so OK to execute 4) Collection of renamed instructions might execute in • Commits results in program order a window (≈ 60 instructions) 8/31/2014 Summer 2014 - Lecture 23 17 8/31/2014 Summer 2014 - Lecture 23 18 3 7/31/2014 Out-of-Order Execution (2/2) Dynamically Scheduled CPU Preserves Branch prediction, dependencies 5) Execute instructions with ready operands in 1 of Register renaming multiple functional units (ALUs, FPUs, Ld/St) Wait here 6) Buffer results of executed instructions until until all predicted branches are resolved in reorder buffer operands available 7) If predicted branch correctly, commit results in program order Execute… Results also sent to any waiting 8) If predicted branch incorrectly, discard all reservation … and Hold dependent results and start with correct PC stations (think: Reorder buffer forwarding!) for register and Can supply memory writes operands for issued instructions 8/31/2014 Summer 2014 - Lecture 23 19 8/31/2014 Summer 2014 - Lecture 23 20 Out-Of-Order Intel Intel Nehalem Microarchitecture • All use O-O-O since 2001 Microprocessor Year Clock Rate Pipeline Issue Out-of-order/ Cores Power Stages width Speculation i486 1989 25 MHz 5 1 No 1 5W Pentium 1993 66 MHz 5 2 No 1 10W Pentium Pro 1997 200 MHz 10 3 Yes 1 29W P4 Willamette 2001 2000 MHz 22 3 Yes 1 75W P4 Prescott 2004 3600 MHz 31 3 Yes 1 103W Core 2 Conroe 2006 2930 MHz 12-14 4* Yes 2 75W Core 2 Penryn 2008 2930 MHz 12-14 4* Yes 4 95W Core i7 2010 3460 MHz 14 4* Yes 6 130W Westmere Xeon Sandy 2012 3100 MHz 14-19 4* Yes 8 150W Bridge Xeon Ivy Bridge 2014 2800 MHz 14-19 4* Yes 15 155W 8/31/2014 Summer 2014 - Lecture 23 21 8/31/2014 Summer 2014 - Lecture 23 22 Intel Nehalem Pipeline Flow Does Multiple Issue Work? • Yes, but not as much as we’d like • Programs have real dependencies that limit ILP • Some dependencies are hard to eliminate • e.g. pointer aliasing (restrict keyword helps) • Some parallelism is hard to expose • Limited window size during instruction issue • Memory delays and limited bandwidth • Hard to keep pipelines full • Speculation can help if done well 8/31/2014 Summer 2014 - Lecture 23 23 8/31/2014 Summer 2014 - Lecture 23 24 4 7/31/2014 Agenda Administrivia •Multiple Issue •HW5 due tonight •Administrivia •Project 2 (Performance Optimization) due •Virtual Memory Introduction Sunday •No lab today •TAs will be in lab to check-off make up labs • Highly encouraged to make up labs today if you’re behind – treated as Tuesday checkoff for lateness •Project 3 (Pipelined Processor in Logisim) released Friday/Saturday 8/31/2014 Summer 2014 - Lecture 23 25 8/31/2014 Summer 2014 - Lecture 23 26 Agenda Memory Hierarchy •Multiple Issue Upper Level Regs Faster •Administrivia Instr Operands •Virtual Memory Introduction Earlier: L1 Cache Caches Blocks L2 Cache Blocks Next Up: Memory Virtual Pages Memory Disk Files Larger Tape Lower Level 8/31/2014 Summer 2014 - Lecture 23 27 8/31/2014 Summer 2014 - Lecture 23 28 Memory Hierarchy Requirements Memory Hierarchy Requirements • Principle of Locality • Allow multiple processes to simultaneously occupy memory and provide • Allows caches to offer (close to) speed of cache memory with size of DRAM protection memory • Don’t let

Load more