18-447 Lecture 21: Parallelism – ILP to Multicores Parallel

CMU 18‐447 S’10 L21‐1 © 2010 J. C. Hoe J. F. Martínez 18‐447 Lecture 21: Parallelism –ILP to Multicores James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Lab 4 due this week Optional reading assignments below. Handouts: The Microarchitecture of Superscalar Processors, Smith and Sohi, Proceedings of IEEE, 12/1995. (on Blackboard) The MIPS R10000 Superscalar Microprocessor, Yeager, IEEE Micro, 4/1996. (on Blackboard) Design Challenges of Technology Scaling, Shekhar Borkar, IEEE Micro, 1999. (on Blackboard) CMU 18‐447 S’10 L21‐2 © 2010 J. C. Hoe Parallel Processing 101 J. F. Martínez Assume you have N units of work and each unit of work takes 1 unit‐time on 1 processing element (PE) ‐ with 1 PE, it will take N unit‐time to complete the N units of work ‐ with p PEs, how long does it take to complete the same N units of work? “Ideally”, speedup is “linear” with p runtime speedup runtime speedup= sequential runtimeparalle l N/p 1 p= 1 2 3 4 5 . p= 1 2 3 4 5 . CMU 18‐447 S’10 L21‐3 © 2010 J. C. Hoe It may be linear, but . J. F. Martínez S 4 3 2 1 32 64 p How could this be? CMU 18‐447 S’10 L21‐4 © 2010 J. C. Hoe Parallelization Overhead J. F. Martínez The cheapest algorithm may not be the most scalable, s.t., runtimeparalle l@p=1 = Kruntimesequential and K>1 and speedup = p/K K is known facetiously as the “parallel slowdown” Communications between PEs are not free ‐ a PE may spend extra instructions/time in the act of sending or receiving data ‐ a PE may spend extra time waiting for data to arrive from another PEa function of latency and bandwidth ‐ a PE may spend extra time waiting for another PE to get to a particular point of the computation (a.k.a. synchronization) CMU 18‐447 S’10 L21‐5 © 2010 J. C. Hoe It could be worse . J. F. Martínez S limited scalability 4 3 2 1 1 24 8 p May never get high speedup regardless of the number of PEs CMU 18‐447 S’10 L21‐6 © 2010 J. C. Hoe Parallelism Defined J. F. Martínez T1 (call it “Work”): ‐ time to complete work with 1 PE x = a + b; y = b * 2 T (ll(call it “Cr itica l Pth”)Path”): z =(x‐y) * (x+y) ‐ time to complete work given infinite PEs ‐ T lower‐bounded by dataflow a b dependencies + *2 Average Parallelism: x y Pavg = T1 / T For a system wihith p PEs - + Tp max{ T1/p, T } Sp min{ p, T1/T } * When Pavg>>p Tp T1/p and Sp p An app has to have parallelism to get speedup from parallel PEs!! CMU 18‐447 S’10 L21‐7 © 2010 J. C. Hoe Amdahl’s Law on Speedup J. F. Martínez A program is rarely completely parallelizable. Let’s say a fraction f is parallelizable by a factor of p and the rest are not timesequential (1 ‐ f) f timeparallel (1 ‐ f) f/p timeparallel = timesequential∙( (1‐f) + f/p ) Sparallel = 1 / ( (1‐f) + f/p ) If f is small p doesn’t matter. An architect also cannot ignore the sequential performance of the (1‐f) portion CMU 18‐447 S’10 L21‐8 © 2010 J. C. Hoe J. F. Martínez ILP: the parallelism you already know CMU 18‐447 S’10 L21‐9 © 2010 J. C. Hoe Going after IPC J. F. Martínez Scalar Pipeline with Forwarding Operation Latency = 1 Peak IPC = 1 Instruction‐Level Parallelism = 1 IF ID EX MEM WB ream IF ID EX MEM WB t s IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB instruction cyc 0 1 2 3 4 5 6 7 8 9 10 CMU 18‐447 S’10 L21‐10 © 2010 J. C. Hoe Superscalar Machines J. F. Martínez Superscalar (Pipelined) Execution OL = 1 baseline cycles Peak IPC = N per baseline cycle ILP = N IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB stream IF ID EX MEM WB IF ID EX MEM WB ction IF ID EX MEM WB u IF ID EX MEM WB IF ID EX MEM WB instr IF ID EX MEM WB cyc 0 1 2 3 4 5 6 7 8 9 10 Achieving full performance requires finding N “independent” instructions per cycle CMU 18‐447 S’10 L21‐11 © 2010 J. C. Hoe ILP: Instruction‐Level Parallelism J. F. Martínez ILP is a measure of the amount of inter‐dependencies between instructions Average ILP =T1 / T = no. instruction / no. cyc required code1: ILP = 1 i.e. must execute serially code2: ILP = 3 iei.e. can execute at the same time code1: r1 r2 + 1 code2: r1 r2 + 1 r3 r1 / 17 r3 r9 / 17 r4 r0 - r3 r4 r0 - r10 CMU 18‐447 S’10 L21‐12 © 2010 J. C. Hoe Removing False Dependencies J. F. Martínez Anti and output dependencies are false dependencies r3 r1 op r2 r5 r3 op r4 r3 r6 op r7 The dependence is on the register name rather than data Given infinite number of registers, anti and output dependencies can always be eliminated CMU 18‐447 S’10 L21‐13 © 2010 J. C. Hoe Register Renaming: Example J. F. Martínez Original r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 r3 r1 ‐ r5 Renamed r1 r2 / r3 ILP=1 r4 r1 * r5 r8 r3 + r6 r9 r8 ‐ r5 ILP=2 CMU 18‐447 S’10 L21‐14 © 2010 J. C. Hoe Hardware Register Renaming J. F. Martínez Rename rename Register ISA name Rename t56 e.g. r12 Table File (t0 ... t63) maintain bindings from ISA reg. names to rename registers When issuing an instruction that updates ‘rd’: ‐ allocate an unused rename register tx ‐ recording binding from ‘rd’ to tx When to remove a binding? When to de‐allocate a rename register? r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 CMU 18‐447 S’10 L21‐15 © 2010 J. C. Hoe Out‐of‐Order Execution J. F. Martínez Renaming eliminates WAW and WAR In a RAW dependent instruction pair, the reader must wait for the result from the writer How to get more ILP? r1 r2 + 1 r3 r1 / 17 ILP=1 r4 r0 ‐ r3 r11 r12 + 1 ILP=2 r13 r19 / 17 r14 r0 ‐ r20 CMU 18‐447 S’10 L21‐16 © 2010 J. C. Hoe Dataflow Execution Ordering J. F. Martínez Maintain a window of many pending instructions (a.k.a. Issue Buffer) Dispatch instructions out‐of‐order ‐ find instructions whose operands are available ‐ give preference to older instructions ‐ A completing instruction may enable other pending instructions (RAW) Need to remember how to put things back in order (Reor der Bff)Buffer) CMU 18‐447 S’10 L21‐17 © 2010 J. C. Hoe Instruction Reorder Buffer J. F. Martínez At today’s clock frequency, on a memory load ‐ a cache hit (best case) takes 4~7 cyc ‐ a L1 cache miss takes a few 10s of cycles ‐ an off‐chip cache miss takes a few 100s of cycles ROB is a program‐order instruction bookkeeping structure ‐ instructions must enter and leave in program order ‐ holds 10s to 100s of “in‐flight” instructions in various stages of execution ‐ re‐sorts all instructions on exit to appear to complete in program order ‐ supports precise exception for any in‐flight instruction CMU 18‐447 S’10 L21‐18 © 2010 J. C. Hoe Speculative Execution J. F. Martínez Multiple levels of branch predictions are needed to fetch 100’s instructions beyond the commit point Instructions after a predicted branch are speculative must undo their effects in case of misprediction Maintain separate copies of ‐ In‐order State: a check‐point state up to just before the first speculated instruction ‐ Speculative State: include all state changes after check‐ pp,oint, ppyossibly multiple predicted branches Commit ‐ admit known‐to‐be good speculative state changes into the in‐order state Rewind ‐ discard all, or part of, the speculative state CMU 18‐447 S’10 L21‐19 © 2010 J. C. Hoe MIPS R10000 J. F. Martínez pre‐decoded I‐cache 4xinst decode map table (16R4W) map table 16‐entry 16‐entry 8x4 entries int. Q FP. Q Active List (R.S.) (R.S.) (ROB) 64‐entry 64‐entry Int GPR FPR 7R3W 5R3W ALU1 ALU2 LD/ST ALU1 ALU2 Read [Yeager 1996, IEEE Micro] if you are really interested CMU 18‐447 S’10 L21‐20 © 2010 J. C. Hoe State of the Art J. F. Martínez AMD Intel Intel IBM IBM Fijitsu SUN Opteron Xeon Itanium P5 P6 SPARC 7 T2 8360SE X7460 9050 cores/threads 4x1 6x1 2x2 2x2 2x2 4x2 8x8 Clock (GHz) 2.5 2.67 1.60 2.2 5 2.52 1.8 Issue Rate 3 (x86) 4 (rop) 6 5 7 4 2 Pipeline depth 12/17 14 8 15 13 15 8/12 Out‐of‐order 72(rop) 96(rop) inorder 200 limited 64 inorder on‐chip$ (MB) 2+2 9+16 1+12 1.92 8 6 4 Trans (106) 463 1900 1720 276 790 600 503 Power (W) 105 130 104 100 >100 135 95 SPECint 2006 14.4/170 22/274 14.5/1534 10.5/197 15.8/1837 10.5/2088 ‐‐/142 per‐core/total SPECfp 2006 18.5/156 22/142 17.3/1671 12.9/229 20.1/1822 25.0/1861 ‐‐/111 per‐core/total Microprocessor Report, Oct 2008 CMU 18‐447 S’10 L21‐21 © 2010 J.

18-447 Lecture 21: Parallelism – ILP to Multicores Parallel

Slicing (Draft)

Assessing Gains from Parallel Computation on a Supercomputer

Instruction Level Parallelism Example

Minimizing Startup Costs for Performance-Critical Threading

Performance of Physics-Driven Procedural Animation of Character Locomotion for Bipedal and Quadrupedal Gait

Easy Dataflow Programming in Clusters with UPC++ Depspawn

Econstor Wirtschaft Leibniz Information Centre Make Your Publications Visible

Extracting Parallelism from Legacy Sequential Code Using Transactional Memory

Optimization Techniques for Efficient HTA Programs

Introduction to MPI

Performance Loss Between Concept and Keyboard

FDTD) Algorithms on a Selection of High Performance Multiprocessor Computing Systems