CMU 18‐447 S’10 L21‐1 © 2010 J. C. Hoe J. F. Martínez 18‐447 Lecture 21: Parallelism –ILP to Multicores

James C. Hoe Dept of ECE, CMU April 7, 2010

Announcements: Lab 4 due this week Optional reading assignments below.

Handouts: The Microarchitecture of Superscalar Processors, Smith and Sohi, Proceedings of IEEE, 12/1995. (on Blackboard) The MIPS R10000 Superscalar Microprocessor, Yeager, IEEE Micro, 4/1996. (on Blackboard) Design Challenges of Technology Scaling, Shekhar Borkar, IEEE Micro, 1999. (on Blackboard)

CMU 18‐447 S’10 L21‐2 © 2010 J. C. Hoe Parallel Processing 101 J. F. Martínez

 Assume you have N units of work and each unit of work takes 1 unit‐time on 1 processing element (PE) ‐ with 1 PE, it will take N unit‐time to complete the N units of work ‐ with p PEs, how long does it take to complete the same N units of work?  “Ideally”, is “linear” with p runtime speedup runtime speedup= sequential runtimeparalle l

N/p 1

p= 1 2 3 4 5 ...... p= 1 2 3 4 5 ...... CMU 18‐447 S’10 L21‐3 © 2010 J. C. Hoe It may be linear, but . . . . . J. F. Martínez

S 4

3

2

1

32 64 p How could this be?

CMU 18‐447 S’10 L21‐4 © 2010 J. C. Hoe Parallelization Overhead J. F. Martínez

 The cheapest algorithm may not be the most scalable, s.t.,

runtimeparalle l@p=1 = Kruntimesequential and K>1 and speedup = p/K K is known facetiously as the “parallel slowdown”  Communications between PEs are not free ‐ a PE may spend extra instructions/time in the act of sending or receiving data ‐ a PE may spend extra time waiting for data to arrive from another PEa function of latency and bandwidth ‐ a PE may spend extra time waiting for another PE to get to a particular point of the computation (a.k.a. synchronization) CMU 18‐447 S’10 L21‐5 © 2010 J. C. Hoe It could be worse ...... J. F. Martínez

S limited 4

3

2

1

1 24 8 p May never get high speedup regardless of the number of PEs

CMU 18‐447 S’10 L21‐6 © 2010 J. C. Hoe Parallelism Defined J. F. Martínez

 T1 (call it “Work”): ‐ time to complete work with 1 PE x = a + b; y = b * 2  T (ll(call it “Cr itica l Pth”)Path”): z =(x‐y) * (x+y) ‐ time to complete work given infinite PEs

‐ T lower‐bounded by dataflow a b dependencies + *2  Average Parallelism: x y Pavg = T1 / T  For a system wihith p PEs - +

Tp  max{ T1/p, T }

Sp  min{ p, T1/T } *  When Pavg>>p Tp  T1/p and Sp  p An app has to have parallelism to get speedup from parallel PEs!! CMU 18‐447 S’10 L21‐7 © 2010 J. C. Hoe Amdahl’s Law on Speedup J. F. Martínez  A program is rarely completely parallelizable. Let’s say a fraction f is parallelizable by a factor of p and the rest are not timesequential (1 ‐ f) f

timeparallel (1 ‐ f) f/p

timeparallel = timesequential∙( (1‐f) + f/p ) Sparallel = 1 / ( (1‐f) + f/p ) If f is small p doesn’t matter. An architect also cannot ignore the sequential performance of the (1‐f) portion

CMU 18‐447 S’10 L21‐8 © 2010 J. C. Hoe J. F. Martínez

ILP: the parallelism you already know CMU 18‐447 S’10 L21‐9 © 2010 J. C. Hoe Going after IPC J. F. Martínez

Scalar Pipeline with Forwarding Operation Latency = 1 Peak IPC = 1 Instruction‐Level Parallelism = 1

IF ID EX MEM WB

ream IF ID EX MEM WB t s

IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB instruction

cyc 0 1 2 3 4 5 6 7 8 9 10

CMU 18‐447 S’10 L21‐10 © 2010 J. C. Hoe Superscalar Machines J. F. Martínez

Superscalar (Pipelined) Execution OL = 1 baseline cycles Peak IPC = N per baseline cycle ILP = N

IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB stream IF ID EX MEM WB IF ID EX MEM WB

ction IF ID EX MEM WB u IF ID EX MEM WB IF ID EX MEM WB instr IF ID EX MEM WB

cyc 0 1 2 3 4 5 6 7 8 9 10 Achieving full performance requires finding N “independent” instructions per cycle CMU 18‐447 S’10 L21‐11 © 2010 J. C. Hoe ILP: Instruction‐Level Parallelism J. F. Martínez

 ILP is a measure of the amount of inter‐dependencies between instructions

 Average ILP =T1 / T = no. instruction / no. cyc required code1: ILP = 1 i.e. must execute serially code2: ILP = 3 iei.e. can execute at the same time

code1: r1  r2 + 1 code2: r1  r2 + 1 r3  r1 / 17 r3  r9 / 17 r4  r0 - r3 r4  r0 - r10

CMU 18‐447 S’10 L21‐12 © 2010 J. C. Hoe Removing False Dependencies J. F. Martínez

 Anti and output dependencies are false dependencies

r3  r1 op r2 r5  r3 op r4 r3  r6 op r7

 The dependence is on the register name rather than data

 Given infinite number of registers, anti and output dependencies can always be eliminated CMU 18‐447 S’10 L21‐13 © 2010 J. C. Hoe Register Renaming: Example J. F. Martínez

Original r1  r2 / r3 r4  r1 * r5 r1  r3 + r6 r3  r1 ‐ r5 Renamed r1  r2 / r3 ILP=1 r4  r1 * r5 r8  r3 + r6 r9  r8 ‐ r5

ILP=2

CMU 18‐447 S’10 L21‐14 © 2010 J. C. Hoe Hardware Register Renaming J. F. Martínez

Rename rename Register ISA name Rename t56 e.g. r12 Table File (t0 ... t63)

 maintain bindings from ISA reg. names to rename registers  When issuing an instruction that updates ‘rd’: ‐ allocate an unused rename register tx ‐ recording binding from ‘rd’ to tx  When to remove a binding? When to de‐allocate a rename register? r1  r2 / r3 r4  r1 * r5 r1  r3 + r6 CMU 18‐447 S’10 L21‐15 © 2010 J. C. Hoe Out‐of‐Order Execution J. F. Martínez

 Renaming eliminates WAW and WAR  In a RAW dependent instruction pair, the reader must wait for the result from the writer  How to get more ILP?

r1  r2 + 1 r3  r1 / 17 ILP=1 r4  r0 ‐ r3 r11  r12 + 1 ILP=2 r13 r19 / 17 r14 r0 ‐ r20

CMU 18‐447 S’10 L21‐16 © 2010 J. C. Hoe Dataflow Execution Ordering J. F. Martínez

 Maintain a window of many pending instructions (a.k.a. Issue Buffer)  Dispatch instructions out‐of‐order ‐ find instructions whose operands are available ‐ give preference to older instructions ‐ A completing instruction may enable other pending instructions (RAW)  Need to remember how to put things back in order (Reor der BBff)uffer) CMU 18‐447 S’10 L21‐17 © 2010 J. C. Hoe Instruction Reorder Buffer J. F. Martínez

 At today’s clock frequency, on a memory load ‐ a cache hit (best case) takes 4~7 cyc ‐ a L1 cache miss takes a few 10s of cycles ‐ an off‐chip cache miss takes a few 100s of cycles

 ROB is a program‐order instruction bookkeeping structure ‐ instructions must enter and leave in program order ‐ holds 10s to 100s of “in‐flight” instructions in various stages of execution ‐ re‐sorts all instructions on exit to appear to complete in program order ‐ supports precise exception for any in‐flight instruction

CMU 18‐447 S’10 L21‐18 © 2010 J. C. Hoe Speculative Execution J. F. Martínez  Multiple levels of branch predictions are needed to fetch 100’s instructions beyond the commit point  Instructions after a predicted branch are speculative  must undo their effects in case of misprediction  Maintain separate copies of ‐ In‐order State: a check‐point state up to just before the first speculated instruction ‐ Speculative State: include all state changes after check‐ pp,oint, ppyossibly multiple predicted branches  Commit ‐ admit known‐to‐be good speculative state changes into the in‐order state  Rewind ‐ discard all, or part of, the speculative state CMU 18‐447 S’10 L21‐19 © 2010 J. C. Hoe MIPS R10000 J. F. Martínez

pre‐decoded I‐cache 4xinst decode

map table (16R4W) map table

16‐entry 16‐entry 8x4 entries int. Q FP. Q Active List (R.S.) (R.S.)

(ROB) 64‐entry 64‐entry Int GPR FPR 7R3W 5R3W

ALU1 ALU2 LD/ST ALU1 ALU2

Read [Yeager 1996, IEEE Micro] if you are really interested

CMU 18‐447 S’10 L21‐20 © 2010 J. C. Hoe State of the Art J. F. Martínez

AMD Intel Intel IBM IBM Fijitsu SUN Opteron Xeon Itanium P5 P6 SPARC 7 T2 8360SE X7460 9050 cores/threads 4x1 6x1 2x2 2x2 2x2 4x2 8x8 Clock (GHz) 2.5 2.67 1.60 2.2 5 2.52 1.8 Issue Rate 3 (x86) 4 (rop) 6 5 7 4 2 Pipeline depth 12/17 14 8 15 13 15 8/12 Out‐of‐order 72(rop) 96(rop) inorder 200 limited 64 inorder on‐chip$ (MB) 2+2 9+16 1+12 1.92 8 6 4 Trans (106) 463 1900 1720 276 790 600 503 Power (W) 105 130 104 100 >100 135 95

SPECint 2006 14.4/170 22/274 14.5/1534 10.5/197 15.8/1837 10.5/2088 ‐‐/142 per‐core/total SPECfp 2006 18.5/156 22/142 17.3/1671 12.9/229 20.1/1822 25.0/1861 ‐‐/111 per‐core/total Microprocessor Report, Oct 2008 CMU 18‐447 S’10 L21‐21 © 2010 J. C. Hoe J. F. Martínez

Moving beyond ILP

CMU 18‐447 S’10 L21‐22 © 2010 J. C. Hoe Moore’s Law J. F. Martínez

 The number of transistors that can be economically integrated shall double every 24 months

[http://www.intel.com/research/silicon/mooreslaw.htm] CMU 18‐447 S’10 L21‐23 © 2010 J. C. Hoe Transistor Scaling J. F. Martínez

gate length

[http://www.intel.com/museum/online/circuits.htm] ‐ commercial products are at 40~45nm today ‐ distance between silicon atoms ~ 500 pm

CMU 18‐447 S’10 L21‐24 © 2010 J. C. Hoe Basic Scaling Theory J. F. Martínez  Planned scaling occurs in discrete “nodes” where each is ~0.7x of the previous in linear dimension e.g., 90nm, 65nm, 45nm, 32nm, 22nm, 15nm, “The End”  Take the same design, reducing the linear dimensions by 0.7x (aka “gate shrink”) leads to (ideally) ‐ die area = 0.5x ‐ delay = 0.7x, frequency=1.43x ‐ capacitance = 0.7x ‐ Vdd=0.7x (if constant field) or Vdd=1x (if constant voltage) ‐ power = CV2f = 0.5x (if constant field) ‐ BUT power = 1x (if constant voltage)  Take the same area, then ‐ transistor count = 2x ‐ power = 1x (constant field), power = 2x (constant voltage) [refer to the Shekhar article for the more complete story] CMU 18‐447 S’10 L21‐25 © 2010 J. C. Hoe The Other Moore’s Law J. F. Martínez

This part is for later

This is not all scaling driven. Without the extensive instruction reordering and speculation in modern pipelines, we would not be able to run the CPU at multi‐GHz against 50ns DRAMs

CMU 18‐447 S’10 L21‐26 © 2010 J. C. Hoe Moore’s Law  Performance J. F. Martínez

 According to scaling theory, we should get @constant complexity: 1x transistors at 1.43x frequency  1431.43x performance at 050.5x power

@max complexity: 2x transistors at 1.43x frequency  2.8x performance at constant power

 Instead, we have been getting (for high‐perf CPUs) ‐ ~2x tittransistors ‐ ~2x frequency (note: faster than scaling would give us) ‐ all together we get about ~2x performance at ~2x power CMU 18‐447 S’10 L21‐27 © 2010 J. C. Hoe Performance (In)efficiency J. F. Martínez

 To hit the “expected” performance target on single‐ microprocessors ‐ we had been pushing frequency harder by deepening pipelines ‐ we used the 2x transistors to build more complicated microarchitectures so the fast/deep pipelines don’t stall (i.e., caches, BP, superscalar, out‐of‐order)  The consequence of performance inefficiency is

Intel P4 Tehas 150W (Guess what limit of economical happened next) PC cooling [ITRS]

[from Shekhar Borkar, IEEE Micro, July 1999]

CMU 18‐447 S’10 L21‐28 © 2010 J. C. Hoe The Other Moore’s Law J. F. Martínez CMU 18‐447 S’10 L21‐29 © 2010 J. C. Hoe Turning the Corner J. F. Martínez  Much less energy and power to do the same work slowly  Suppose at full frequency

‐ Runtime = Work /cperf ‐ Energy = (cswitch + cleakage / cperf )∙Work ‐ Power = cswitchcperf + cleakage •cswitch is a constant for energy per unit work done •cperf is a constant for work per unit time •cleakage is the leakage power  When frequency is reduced by a fraction sffqreq, the supply voltage can also be reduced by a fraction svoltage ‐ Runtime’ = Work / (cperf sfreq) 2 ~2.5 ‐ Energy’ = (cswitchsvoltage + cleakagesvoltage / cperfsfreq )∙Work 2 ~2.5 ‐ Power’ = cswitch svoltage cperf sfreq + cleakagesvoltage Disclaimer: this is good enough for 447. Don’t try to pass this off in any of the circuits courses

CMU 18‐447 S’10 L21‐30 © 2010 Voltage‐Frequency Scaling J. C. Hoe J. F. Martínez Intel P4 660 with “Enhanced SpeedStep”

What if you did the same work on 2 cores at half speed? CMU 18‐447 S’10 L21‐31 © 2010 J. C. Hoe Parallelization vs. Energy/Power J. F. Martínez

 Ideal parallelization by N cores

‐ Runtimespeedup‐by‐N = W / (cperf N) ‐ Energyspeedup‐by‐N = (cswitch + cleakage / cperf )∙W ‐ Powerspeedup‐by‐N = N (cswitchcperf + cleakage)

 Alternatively, we can trade N speedup for power and

energy reduction by sfreq=1/N, if svoltage  sfreq ‐ Runtimeiso‐perf = W / cperf 2 ~1.5 ‐ Energyiso‐perf = (cswitch / N + cleakage / (cperf N ))∙W 2 ~1.5 ‐ Poweriso‐perf = cswitchcperf / N + cleakage / N

Disclaimer: this is good enough for 447. Don’t try to pass this off in any of the circuits courses

CMU 18‐447 S’10 L21‐32 © 2010 J. C. Hoe On to Multicores and Manycores J. F. Martínez

Core Core Core $ $ $

Fat Interconnect

Big L2

Bigger L3

We are here because we need to compute faster while using less energy per operation ...... and the boss says to stay on the Moore’s Law CMU 18‐447 S’10 L21‐33 © 2010 J. C. Hoe What is to come? J. F. Martínez

 We (actually Intel, et al.) know how to pack more cores on a die to stay on Moore’s law in terms of “aggregate” or “throughput” performance ‐ life is good if your N units of work (on slide 2) is N independent programs  just run them ‐ what if your N units of work is N operations of the same program?  rewrite a parallel program ‐ what if your N units of work is N sequentially dependent operations of the same program?  ??

 Will we (the users) run out of parallelism in our workloads to make use of the “many‐many”cores?