<<

Revisiting the Future ☺

Erik Hagersten Uppsala University Sweden DARK2 in a nutshell

1. Memory Systems (caches, VM, DRAM, microbenchmarks, …) 2. Multiprocessors (TLP, coherence, interconnects, scalability, clusters, …) 3. CPUs (pipelines, ILP, scheduling, Superscalars, VLIWs, embedded, …) 4. Future: (physical limitations, TLP+ILP

DARK in the CPU,…) 2009

Dept of Information Technology| www.it.uu.se 2 ©Erik Hagersten| user.it.uu.se/~eh How do we get good performance?

Creating and exploring: 1) Locality a) Spatial locality b) Temporal locality c) Geographical locality 2) Parallelism a) Instruction level (ILP) b) level (TLP)

DARK c) Memory level (MLP) 2009

Dept of Information Technology| www.it.uu.se 3 ©Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/

DARK 2009

Dept of Information Technology| www.it.uu.se 4 ©Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/

DARK 2009

Dept of Information Technology| www.it.uu.se 5 ©Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/

DARK 2009

Dept of Information Technology| www.it.uu.se 6 ©Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/

DARK 2009

Dept of Information Technology| www.it.uu.se 7 ©Erik Hagersten| user.it.uu.se/~eh Doubling (or Halving) times

Dynamic RAM Memory (bits per dollar) 1.5 years Average Price 1.6 years Cost per Transistor Cycle 1.1 years Total Bits Shipped 1.1 years Performance in MIPS 1.8 years in 2.0 years Microprocessor Clock Speed 2.7 years

DARK 2009

Dept of Information Technology| www.it.uu.se 8 ©Erik Hagersten| user.it.uu.se/~eh Old Trend1: Deeper pipelines Exploring ILP (instruction-level parallelism)

I R Regs PC … Data Dependence

DARK 1GB 2009 150cycles Mem

Dept of Information Technology| www.it.uu.se 9 ©Erik Hagersten| user.it.uu.se/~eh Old Trend2: Wider pipelines Exploring more ILP

I More pipelines + Deeper pipelines I Thread 1 Issue I Need more logic independent instructions PC … I Regs

DARK 1GB 2009 150cycles Mem

Dept of Information Technology| www.it.uu.se 10 ©Erik Hagersten| user.it.uu.se/~eh Old Trend3: Deeper Exploring access locality

I R B M M W I R B M M W Thread 1 Issue logic I R B M M W PC … I R B M M W Regs

2 cycles kr 2kB 10 cycles € 64kB 30 cycles £ 2MB

DARK 1GB 2009 150cycles Mem

Dept of Information Technology| www.it.uu.se 11 ©Erik Hagersten| user.it.uu.se/~eh Are we hitting the wall now? Pop: Can the transistors be made even smaller and faster?

Performance Possible path, [log] but requires a paradigm shift 1000 ?

Business as 100 ” w usual… La s re 10 o Well uhm … o M The transistors can “ p: be made smaller and DARK 1 Po 2009 faster, but there are other problems Dept of Information Technology| www.it.uu.se Now 12 ©ErikYear Hagersten| user.it.uu.se/~eh Microprocessors today: Whatever it takes to run one program fast.

Exploring ILP (instruction-level parallelism): P IL Faster clocks Deep pipelines t Superscalar Pipelines os Branch Prediction m d Out-of-Order Execution e Trace lor : p Speculation x 1 e Predicate Execution # ) y sm d li Advanced Load Address Table e a ll ws e ra Return Address Stack r a e l p …. N a l ve e e d v -l a n a o h ti DARK B e ruc 2009 W st n (i

Dept of Information Technology| www.it.uu.se 13 ©Erik Hagersten| user.it.uu.se/~eh Bad News #2 Looong#2: Future wire delay Technology slow CPUs

830 ps

cycle time (1.2GHz) el C mod ace (R mm tr for a 5 atency L 390ps

170ps

70 ps (14 GHz)

2000 2005 2010 2015 År DARK 2009 Quantitative data and trends according to V. Agarwal et al., ISCA 2000 Based on SIA (Semiconductor Industry Association) prediction, 1999

Dept of Information Technology| www.it.uu.se 14 ©Erik Hagersten| user.it.uu.se/~eh #2:Bad How News much of#2 the chip area can you reachLooong in one wire cycle? delay slow CPUs Span -- Fraction of chip reachable in one cycle 1.0 100%

0.75

0.50

SIA 0.25

0.4 - 1.4 %

2000 2005 2010 2015 Year DARK .25 .18 .13 .10 .070 .050 .0035 Feature size(micron) 2009

Dept of Information Technology| www.it.uu.se 15 ©Erik Hagersten| user.it.uu.se/~eh Bad News #3: Memory latency/bandwidth is the bottleneck… Slooow Memory

C C CPU A B + B L1 cache L3 Ctrl L2 Cache

A = B + C: Latency Read B 0.3 - 100 ns

DARK Read C 0.3 - 100 ns 2009 Add B & C 0.3 ns WriteA 0.3 - 100 ns Dept of Information Technology| www.it.uu.se 16 ©Erik Hagersten| user.it.uu.se/~eh Bad News #4: Power is the limit

Power consumption is the bottleneck Cooling servers is hard Battery lifetime for mobile computers Energy is money

Dynamic effect is proportional to:

P_dyn ~ Capacitance * Frequency * Voltage2

DARK 2009

Dept of Information Technology| www.it.uu.se 17 ©Erik Hagersten| user.it.uu.se/~eh Now What?

#1: Running out of ILP

#2: Wire delay is starting to hurt

#3: Memory is the bottleneck

DARK #4: Power is the limit 2009

Dept of Information Technology| www.it.uu.se 18 ©Erik Hagersten| user.it.uu.se/~eh Solving all the problems: exploring threads parallelism

#1: Running out of ILP feed one CPU with instr. from many threads

#2: Wire delay is starting to hurt Multiple small CPUs with private L1$

#3: Memory is the bottleneck memory accesses from many threads (MLP)

#4: Power is the limit DARK 2009 Lower the frequency lower voltage

Dept of Information Technology| www.it.uu.se 19 ©Erik Hagersten| user.it.uu.se/~eh Bad News #1: Not enough ILP 1(2) feed one CPU with instr. from many threads Slooow Memory

CPU L1 Issue Logic

LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 DARK LD X, R2 2009 ST R1, R2, R3

Dept of Information Technology| www.it.uu.se 20 ©Erik Hagersten| user.it.uu.se/~eh Bad News #1: Not enough ILP 2(2) feed one CPU with instr. from many threads

CPU L1 Issue Logic

LD X, R2 LD X, R2 ST R1, R2, R3 ST R1, R2, R3 LD X, R2 LD X, R2 ST R1, R2, R3 ST R1, R2, R3 LD X, R2 LD X, R2 ST R1, R2, R3 ST R1, R2, R3 LD X, R2 LD X, R2 ST R1, R2, R3 ST R1, R2, R3 DARK LD X, R2 LD X, R2 2009 ST R1, R2, R3 ST R1, R2, R3

Dept of Information Technology| www.it.uu.se 21 ©Erik Hagersten| user.it.uu.se/~eh SMT: Simultaneous Multithreading ”Combine TLP&ILP to find independent instr.” Thread 1 I R B M M W I R B M M W PC … Issue logic I R B M M W Thread N I R B M M W Regs 1 … PC Regs N

SEK € $

DARK 2009 Mem

Dept of Information Technology| www.it.uu.se 22 ©Erik Hagersten| user.it.uu.se/~eh Thread-interleaved

Historical Examples: Denelcor, HEP, Tera Computers [B. Smith] 1984 Each thread executes every n:th cycle in a round-robin fashion -- Poor single-thread performance -- Expensive (due to early adoption)

Intel’s Hyperthreading (2002??) -- Poor implementation

DARK 2009

Dept of Information Technology| www.it.uu.se 23 ©Erik Hagersten| user.it.uu.se/~eh Bad News #2:wire delay Multiple small CPUs with private L1$

CPUCPUCPUCPU L1L1L1 L1L1 cache cachecache Externa L1 cache L3 L3 L2 Cache Externa Ctrl L3

DARK 2009

Dept of Information Technology| www.it.uu.se 24 ©Erik Hagersten| user.it.uu.se/~eh CMP: Chip Multiprocessor more TLP & geographical locality

Thread 1 I R B M M W Issue logic I R B M M W

PC Regs

… SEK

Thread N I R B M M W Issue logic I R B M M W PC Regs

SEK € $ DARK 2009 Mem

Dept of Information Technology| www.it.uu.se 25 ©Erik Hagersten| user.it.uu.se/~eh Bad News #3: memory latency/bandwidth memory accesses from many threads (MLP) Sloooow memory

C C Thread+ 2 A B B C Thread 1 C + A B B

DARK TLP MLP 2009 Thread-Level Parallellism Memory-Level Parallelism (MLP)

Dept of Information Technology| www.it.uu.se 26 ©Erik Hagersten| user.it.uu.se/~eh Bad News #4: Power consumption Lower the frequency lower voltage

2 2 Pdyn = C * f * V ≈ area * freq * voltage

CPU CPU CPU freq=f vs. freq=f/2 freq=f/2

2 2 Pdyn(C, f, V) = CfV Pdyn(2C, f/2,

CPU CPU CPU vs. freq=f CPU CPU freq = f/2

2 2 Pdyn(C, f, V) = CfV Pdyn(C, f/2,

DARK 2009 Throughput < 2x

Dept of Information Technology| www.it.uu.se 27 ©Erik Hagersten| user.it.uu.se/~eh Example: Freq. Scaling

speed 0.87

pow 0.51 freq speed pow freq speed freq speed 1.20 1.13 1.51 0.80 0.87 pow 0.80 0.87 pow 0.51 0.51

20% higher freq. 20% lower freq. 20% lower freq. Two cores

DARK 2009

Dept of Information Technology| www.it.uu.se 28 ©Erik Hagersten| user.it.uu.se/~eh Multicore CPUs:

PhysX, a multi-core physics processing unit. Ambric Am2045, a 336-core Massively Parallel Processor Array (MPPA) AMD Athlon 64, Athlon 64 FX and Athlon 64 X2 family, dual-core desktop processors. Opteron, dual- and quad-core server/workstation processors. Phenom, triple- and quad-core desktop processors. Sempron X2, dual-core entry level processors. Turion 64 X2, dual-core laptop processors. and FireStream multi-core GPU/GPGPU (10 cores, 16 5-issue wide superscalar stream processors per core) ARM MPCore is a fully synthesizable multicore container for ARM9 and ARM11 processor cores, intended for high-performance embedded and entertainment applications. Azul Systems Vega 2, a 48-core processor. Broadcom SiByte SB1250, SB1255 and SB1455. Cradle Technologies CT3400 and CT3600, both multi-core DSPs. Cavium Networks Octeon, a 16-core MIPS MPU. HP PA-8800 and PA-8900, dual core PA-RISC processors. IBM POWER4, the world's first dual-core processor, released in 2001. POWER5, a dual-core processor, released in 2004. POWER6, a dual-core processor, released in 2007. PowerPC 970MP, a dual-core processor, used in the Apple Power Mac G5. Xenon, a triple-core, SMT-capable, PowerPC microprocessor used in the Microsoft Xbox 360 game console. IBM, Sony, and Toshiba processor, a nine-core processor with one general purpose PowerPC core and eight specialized SPUs (Synergystic Processing Unit) optimized for vector operations used in the Sony PlayStation 3. Infineon Danube, a dual-core, MIPS-based, home gateway processor. Intel Celeron Dual Core, the first dual-core processor for the budget/entry-level market. Core Duo, a dual-core processor. Core 2 Duo, a dual-core processor. Core 2 Quad, a quad-core processor. Core i7, a quad-core processor, the successor of the Core 2 Duo and the Core 2 Quad. 2, a dual-core processor. Pentium D, a dual-core processor. Teraflops Research Chip (Polaris), an 3.16 GHz, 80-core processor prototype, which the company says will be released within the next five years[6]. Xeon dual-, quad- and hexa-core processors. IntellaSys seaForth24, a 24-core processor. GeForce 9 multi-core GPU (8 cores, 16 scalar stream processors per core) GeForce 200 multi-core GPU (10 cores, 24 scalar stream processors per core) Tesla multi-core GPGPU (8 cores, 16 scalar stream processors per core) DARK Parallax Propeller P8X32, an eight-core . picoChip PC200 series 200-300 cores per device for DSP & wireless 2009 Rapport Kilocore KC256, a 257-core microcontroller with a PowerPC core and 256 8-bit "processing elements". [source: Wikipedia] Raza Microelectronics XLR, an eight-core MIPS MPU Sun Microsystems UltraSPARC IV and UltraSPARC IV+, dual-core processors. 29 Dept ofUltraSPARC Information T1, anTechnology eight-core,| 32-threadwww.it.uu.se processor. ©Erik Hagersten| user.it.uu.se/~eh High-performance single-core CPUs:

DARK 2009

Dept of Information Technology| www.it.uu.se 30 ©Erik Hagersten| user.it.uu.se/~eh Many shapes and forms... Mem

I/F Mem FSB Mem

m m m m £ I/F I/F £ £ C C C C

$ $ $ $ C C C C $

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ m m m m ¢

C C C C C C C C C

AMD Intel x86 IBM Cell DARK 2009 May run more than one thread...

Dept of Information Technology| www.it.uu.se 31 ©Erik Hagersten| user.it.uu.se/~eh Darling, I shrunk the computer eunileeuin( execution Sequential

Mainframes ≈

Super Minis: program) one

Microprocessor: Mem Paradigm Shift Parallel DARK Mem 2009 Chip Multiprocessor (CMP): A execution multiprocessor on a chip! (TLP) Dept of Information Technology| www.it.uu.se 32 ©Erik Hagersten| user.it.uu.se/~eh How to create threads?

e e c c n n ie ie c c Seq. S S t t Prog. e e Not ideal for k k HARD!! c c o o multi cores R R

Parallel Parallelizing Modeling Prog1 Prog2 OS Prog compiler

Thread1 Thread2 Thread3 Thread1 Thread2 Thread3

Throughput Computing Capability Computing •“Multitasking” •Weather simulation DARK •Virtualization •Computer games 2009 •Concurrency •… •… Dept of Information Technology| www.it.uu.se 33 ©Erik Hagersten| user.it.uu.se/~eh Design Examples CMPs

Erik Hagersten Uppsala University Sweden Intel Core2 Quad, 45 nm

South Bridge I/O

North Bridge DRAM

Front-side (FSB)

Module Chip 1 Chip 2 L2$ L2$ 6MB 6MB

D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB

DARK CPU CPU CPU CPU 2009

Dept of Information Technology| www.it.uu.se 35 ©Erik Hagersten| user.it.uu.se/~eh Intel: Dunnington 45nm

South Bridge I/O

DDR2 DRAM North Bridge DDR2 DRAM Front-side Bus (FSB)

L3 8+ MB

X-bar

L2$ L2$ L2$ 2MB 2MB 2MB

D$ I$ D$ I$ D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB DARK 2009 CPU CPU CPU CPU CPU CPU

Dept of Information Technology| www.it.uu.se 36 ©Erik Hagersten| user.it.uu.se/~eh Intel Dunnington, 45 nm

DARK 2009

Dept of Information Technology| www.it.uu.se 37 ©Erik Hagersten| user.it.uu.se/~eh AMD Barcelona, 65 nm

Hyper Transport DDR-2

L3 2MB

X-bar

L2$ L2$ L2$ L2$ 512kB 512kB 512kB 512kB

D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB CPU CPU CPU CPU

DARK 2009

Dept of Information Technology| www.it.uu.se 38 ©Erik Hagersten| user.it.uu.se/~eh AMD MC System Architecture

Coherent ”multi-socket” i.e, NUMA

I/O I/O

DARK 2009

Dept of Information Technology| www.it.uu.se 39 ©Erik Hagersten| user.it.uu.se/~eh AMD Shanghai, 45 nm

Hyper Transport DDR-2, DRAM

L3 8MB

X-bar

L2$ L2$ L2$ L2$ 512kB 512kB 512kB 512kB

D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB CPU CPU CPU CPU

DARK 2009

Dept of Information Technology| www.it.uu.se 40 ©Erik Hagersten| user.it.uu.se/~eh AMD Istanbul, 6 cores

DARK 2009

Dept of Information Technology| www.it.uu.se 41 ©Erik Hagersten| user.it.uu.se/~eh Intel: Nehalem, Core i7 45 nm Q1 2009 (4 cores) QuickPath Interconnect 3x DDR-3 DRAM

L3 8MB

X-bar

L2$ L2$ L2$ L2$ 256kB 256kB 256kB 256B

D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB CPU, 2 thr CPU, 2 thr. CPU, 2 thr. CPU, 2 thr.

DARK 2009 Up to 4 cores x 2 threads

Dept of Information Technology| www.it.uu.se 42 ©Erik Hagersten| user.it.uu.se/~eh Nehalem ”Core i7”, 45nm

DARK 2009

Dept of Information Technology| www.it.uu.se 43 ©Erik Hagersten| user.it.uu.se/~eh AMD Magny-Cours, 2010 (??)

MCM

C C C C C C C C C C C C ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ $ $ $ $ $ $ $ $ $ $ $ $

€ €

~Istanbul DARK 2009 DDR3 DDR3 HT HT HT HT DDR3 DDR3

Dept of Information Technology| www.it.uu.se 44 ©Erik Hagersten| user.it.uu.se/~eh Intel: Core i7 2010 (??)

QuickPath Interconnect 4 x DDR-3

L3 24MB

X-bar

L2$ L2$ L2$ L2$ L2$ 256kB 256kB 256kB 256B 256kB

D$ I$ D$ I$ D$ I$ D$ I$ ... D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB CPU, 2 thr CPU, 2 thr. CPU, 2 thr. CPU, 2 thr. CPU

DARK 2009 8 cores x 2 threads

Dept of Information Technology| www.it.uu.se 45 ©Erik Hagersten| user.it.uu.se/~eh Sun Niagara, 2005!!

4 x DDR-2 = 25GB/s (!)

Memory Memory Memory Memory ctrl ctrl ctrl ctrl

Shared L2 L2 L2 L2 L2

Xbar =…8 134 ... GB/s L1I L1D L1I L1D (wt) (wt)

CPU …8 … CPU

Four interleaved threads DARK 2009 Now: Victoria’s falls: 16 core with 16 threads each

Dept of Information Technology| www.it.uu.se 46 ©Erik Hagersten| user.it.uu.se/~eh Niagara 2

4 x DDR-2 = 25GB/s (!)

Memory Memory Memory Memory ctrl ctrl ctrl ctrl

L2 L2 L2 L2

Xbar =…8 134 ... GB/s L1I L1D L1I L1D (wt) (wt)

CPU …8 … CPU (8 threads) (8 threads) FP FP DARK 2009 Now: Niagra 3, 16 cores,

Dept of Information Technology| www.it.uu.se 47 ©Erik Hagersten| user.it.uu.se/~eh Niagara Chip

DARK 2009 Sun Microsystems

Dept of Information Technology| www.it.uu.se 48 ©Erik Hagersten| user.it.uu.se/~eh TILERA Architecture

64 cores connected in a mesh Local L1 + L2 caches Shared distributed L3 cache Linux + ANSI C New Libraries New IDE Stream computing DARK ... 2009

Dept of Information Technology| www.it.uu.se 49 ©Erik Hagersten| user.it.uu.se/~eh IBM Cell Processor

Mem

m m m m I/F C C C C

C C C C $ m m m m ¢

C

IBM Cell DARK 2009

Dept of Information Technology| www.it.uu.se 50 ©Erik Hagersten| user.it.uu.se/~eh So-called accelerators

Sits on the IO bus (!!)

GP Graphics processors, aka GPGPU? [e.g. NVIDIA, AMD/ATI]

Specialized accelerators? [e.g., FPGA/Mitrionics, ASIC/ClearSpeed]

Specialized languages for the above?

DARK [CUDA, Ct, Rapid Mind, Open-CL, …] 2009

Dept of Information Technology| www.it.uu.se 51 ©Erik Hagersten| user.it.uu.se/~eh So-called accelerators

Sits on the IO bus (!!) My view: Not very general purpose yet

GP Graphics•Fits well processors,for a few VERY IMPORTANT aka GPGPU? app domains!

[e.g. NVIDIA,•Limited applicability? AMD/ATI]

•Programmer productivity? Specialized accelerators? •Application life time? [e.g., FPGA/Mitrionics, ASIC/ClearSpeed] •Better architectures around the corner • Fermi by NVIDIA • Larrabee by Intel Specialized languages for the above? •SW standards emerging, e.g. OpenCL DARK [CUDA, Ct, Rapid Mind, Open-CL, …] 2009

Dept of Information Technology| www.it.uu.se 52 ©Erik Hagersten| user.it.uu.se/~eh Intel Larrabee 2010 (??) 32 single in-order cores 4 threads /core 16 wide SIMD (single precission)/8 wide (double)

DARK 2009 Rumers on the net. Not based of actual facts.

Dept of Information Technology| www.it.uu.se 53 ©Erik Hagersten| user.it.uu.se/~eh Multicore CPUs:

Ageia PhysX, a multi-core physics processing unit. Ambric Am2045, a 336-core Massively Parallel Processor Array (MPPA) AMD Athlon 64, Athlon 64 FX and Athlon 64 X2 family, dual-core desktop processors. Opteron, dual- and quad-core server/workstation processors. Phenom, triple- and quad-core desktop processors. Sempron X2, dual-core entry level processors. Turion 64 X2, dual-core laptop processors. Radeon and FireStream multi-core GPU/GPGPU (10 cores, 16 5-issue wide superscalar stream processors per core) ARM MPCore is a fully synthesizable multicore container for ARM9 and ARM11 processor cores, intended for high-performance embedded and entertainment applications. Azul Systems Vega 2, a 48-core processor. Broadcom SiByte SB1250, SB1255 and SB1455. Cradle Technologies CT3400 and CT3600, both multi-core DSPs. Cavium Networks Octeon, a 16-core MIPS MPU. HP PA-8800 and PA-8900, dual core PA-RISC processors. IBM POWER4, the world's first dual-core processor, released in 2001. POWER5, a dual-core processor, released in 2004. POWER6, a dual-core processor, released in 2007. PowerPC 970MP, a dual-core processor, used in the Apple Power Mac G5. Xenon, a triple-core, SMT-capable, PowerPC microprocessor used in the Microsoft Xbox 360 game console. IBM, Sony, and Toshiba Cell processor, a nine-core processor with one general purpose PowerPC core and eight specialized SPUs (Synergystic Processing Unit) optimized for vector operations used in the Sony PlayStation 3. Infineon Danube, a dual-core, MIPS-based, home gateway processor. Intel Celeron Dual Core, the first dual-core processor for the budget/entry-level market. Core Duo, a dual-core processor. Core 2 Duo, a dual-core processor. Core 2 Quad, a quad-core processor. Core i7, a quad-core processor, the successor of the Core 2 Duo and the Core 2 Quad. Itanium 2, a dual-core processor. Pentium D, a dual-core processor. Teraflops Research Chip (Polaris), an 3.16 GHz, 80-core processor prototype, which the company says will be released within the next five years[6]. Xeon dual-, quad- and hexa-core processors. IntellaSys seaForth24, a 24-core processor. Nvidia GeForce 9 multi-core GPU (8 cores, 16 scalar stream processors per core) GeForce 200 multi-core GPU (10 cores, 24 scalar stream processors per core) Tesla multi-core GPGPU (8 cores, 16 scalar stream processors per core) DARK Parallax Propeller P8X32, an eight-core microcontroller. picoChip PC200 series 200-300 cores per device for DSP & wireless 2009 Rapport Kilocore KC256, a 257-core microcontroller with a PowerPC core and 256 8-bit "processing elements". [source: Wikipedia] Raza Microelectronics XLR, an eight-core MIPS MPU Sun Microsystems UltraSPARC IV and UltraSPARC IV+, dual-core processors. 54 Dept ofUltraSPARC Information T1, anTechnology eight-core,| 32-threadwww.it.uu.se processor. ©Erik Hagersten| user.it.uu.se/~eh Design Issues for Multicores

Erik Hagersten Uppsala University Sweden CMP bottlenecks/points of optimization

? Performance per memory byte? Performance per bandwidth? Performance per $? …

How large fraction of a CMP system cost is the CPU chip? Should the execution (MIPS/FLOPS) be viewed as a

DARK scarce resource? 2009

Dept of Information Technology| www.it.uu.se 56 ©Erik Hagersten| user.it.uu.se/~eh DRAM issues

”Rock will have more than 1000 memory chips per Rock chip” [M. Trembley, Sun Fellow at ICS 2006] Memory will dominate cost?

Fewer “open pages” accessed due to interleaving of several threads

Far memory long latency & low per-pin BW

Pushing dense memory technology/packaging?

DARK 2009

Dept of Information Technology| www.it.uu.se 57 ©Erik Hagersten| user.it.uu.se/~eh Bandwidth issues

#pins is a scarce resource! every pin should run at maximum speed

external memory controllers? off-chip cache? is there room for multi-CMP?

DARK is this maybe a case for multi-CMP? 2009

Dept of Information Technology| www.it.uu.se 58 ©Erik Hagersten| user.it.uu.se/~eh Shared Bottlenecks

CPU CPU CPU CPU

L1 L1 L1 L1 L2 Cache

L1 L1 L1 L1 Bandwidth CPU CPU CPU CPU

DARK Shared 2009 Resources

Dept of Information Technology| www.it.uu.se 59 ©Erik Hagersten| user.it.uu.se/~eh Thread Interaction

CPU CPU CPU CPU

L1 L1 L1 L1 L2 Cache

L1 L1 L1 L1

CPU CPU CPU CPU

•Coherence traffic •Communication utilization •Load imbalance DARK •Synchronization 2009 • •... Dept of Information Technology| www.it.uu.se 60 ©Erik Hagersten| user.it.uu.se/~eh Example: Thread Interaction (True Communication)

More cache sharing more misses? CPU CPU CPU CPU

L1 L1 L1 L1 L2 L2 Communication? A L2 A L2 A L1 L1 L1 L1

CPU CPU CPU CPU

A:= 1 A := 2 A:= 3 DARK 2009

61 Dept of Information Technology| www.it.uu.se 61 ©Erik Hagersten| user.it.uu.se/~eh Multicore Challenges: Non-uniformity Communication

CPU CPU CPU CPU

L1 L1 L1 L1 L2 L2 L3 L2 L2 L1 L1 L1 L1

CPU CPU CPU CPU

L2 sharing (here: pair-wise) DARK 2009

62 Dept of Information Technology| www.it.uu.se 62 ©Erik Hagersten| user.it.uu.se/~eh Multicore Challenges(Multisocket) Non-uniformity Communication

CPU CPU CPU CPU CPU CPU CPU CPU

L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1

CPU CPU CPU CPU CPU CPU CPU CPU

L2 sharing (here: pair-wise) DARK 2009

63 Dept of Information Technology| www.it.uu.se 63 ©Erik Hagersten| user.it.uu.se/~eh Multicore Challenges (Multisocket) Non-uniformity Memory

CPU CPU CPU CPU CPU CPU CPU CPU

L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1

CPU CPU CPU CPU CPU CPU CPU CPU

DARK 2009

64 Dept of Information Technology| www.it.uu.se 64 ©Erik Hagersten| user.it.uu.se/~eh Example: Poor Throughput Scaling! Example: 470.lbm

3,5 3 2,5 2 1,5 1.0 Throughput 1 0,5 0 1234 Number of Cores Used Throughput (as defined by SPEC): Amount of work performed per time unit when several instances of the application is executed simultaneously. Our TP study: compare TP improvement when you go from 1 core to 4 cores DARK 2009

Dept of Information Technology| www.it.uu.se 65 ©Erik Hagersten| user.it.uu.se/~eh Throughput Scaling, More Apps

SPEC CPU 20006 FP Throughput improvements on 4 cores

4,50

4,00

3,50

3,00

2,50 2.0 2,00

1,50

1,00

0,50

0,00

s c e ss d II x e mp DM e y ix s A ie3d m eal l ra l m 3,mil s l d p v lb ,wrf 3 eu u , o o FDTD , 1 inx3 4 ,z t ,les 7 ,s p s 0 h 10,bwav 4 c 0 , 48 p 4 16,gam 3 a 444,na 44 3 47 s 4 4 ,c 437 45 465,tonto , 435,gromacs6 45 454,calcu 9,Gem 482 43 5 4

DARK Intel X5365 3GHz, 1333 MHz FSB, 8MB L2. (Based on data from the SPEC web) 2009

Dept of Information Technology| www.it.uu.se 66 ©Erik Hagersten| user.it.uu.se/~eh Nerd Curve: 470.lbm

Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate

5,0% 3,5%

cache size DARK Running Running Less amount of work 2009 four threads one thread per memory byte moved @ four threads Dept of Information Technology| www.it.uu.se 67 ©Erik Hagersten| user.it.uu.se/~eh Better Memory Usage! Example: 470.lbm Modified to promote better cache utilization 3,5

3

2,5

2

1,5

Througput 1

0,5

0 1234 # Cores Used

Original code

DARK 2009

68 Dept of Information Technology| www.it.uu.se 68 ©Erik Hagersten| user.it.uu.se/~eh Nerd Curve (again)

Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate

orig application 5,0%

2,5% optimized application

cache size DARK Running Twice the amount of work 2009 four threads per memory byte moved

Dept of Information Technology| www.it.uu.se 69 ©Erik Hagersten| user.it.uu.se/~eh More transistors More Threads

Warning: # transistors grows exponentially # threads can grow exponentially Can memory BW keep up?

DARK 2009

Dept of Information Technology| www.it.uu.se 70 ©Erik Hagersten| user.it.uu.se/~eh BW in the Future?

#Cores ~ #Transistors Computation vs Bandwidth

6 CPU CPU 5 #T * T _f r eq / #P * P_f r eq CPU CPU 4

... 3

2

Mem 1 0 2007 2008 2009 2010 2011 2012 2013 2014 2015

Year

Source: Internatronal Technology Roadmap for Semiconductors (ITRS)

See also: Karlsson et al. Conserving Memory Bandwidth in Chip Multiprocessors DARK with Runahead Execution. IPDPS March 2007. 2009 See also IDC presentation from ISC June 2007

Dept of Information Technology| www.it.uu.se 71 ©Erik Hagersten| user.it.uu.se/~eh Capacity or Capability Computing?

Capacity? (≈several sequential jobs) or Capability? (≈one parallel job)

Issues: Memory requirement? Sharing in cache? Memory bandwidth requirement?

Memory: the major cost of a CMP system! How do we utilize it the best? Once the workingset is in memory, work like crazy!

Capability computing suits CMPs the best

DARK (in general) 2009

Dept of Information Technology| www.it.uu.se 72 ©Erik Hagersten| user.it.uu.se/~eh Fat or narrow cores?

Fat: Fewer cores but… wide issue? O-O-O?

Narrow: More cores but… narrow issue? in-order? have you ever heard of Amdahl? SMT, run-ahead, execute-ahead … to cure shortcomings?

Read: Maximizing CMP Throughput with Mediocre Cores Davis, Laudon and Olukotun, PACT 2006

DARK 2009

Dept of Information Technology| www.it.uu.se 73 ©Erik Hagersten| user.it.uu.se/~eh Cores vs. caches

Depends on your target applications…

Niagara’s answer: go for cores In-order 5-stage pipeline 8 cores a’ 4 SMT threads each 32 threads, 3MB shared L2 cache (96 kB/thread) SMT to hide memory latency Memory bandwidth: 25 GB/s Will this approach scale with technology?

Others: go for cache 2-4 cores for now

DARK 2009

Dept of Information Technology| www.it.uu.se 74 ©Erik Hagersten| user.it.uu.se/~eh Cache Interference in Shared Cache Cache sharing strategies: 1. Fight it out! 2. Fair share: 50% of the cache each 3. Maximize throughput: who will benefit the most?

miss 3 2 1 rate single-threaded cache profiling A

1 A examples: ammp, art, … B B examples: vpr_place, vortex, … 2 3

tot $ size Read: STATSHARE: A Statistical Model for Managing Cache Share via Decay Pavlos Petoumenos et al in MOBS workshop ISCA 2006 DARK 2009 Predicting the inter-thread cache contention on a CMP Chandra et al in HPCA 2005 Dept of Information Technology| www.it.uu.se 75 ©Erik Hagersten| user.it.uu.se/~eh Hiding Memory Latency

O-O-O HW prefetching SMT Run-ahead/Execute-ahead

DARK 2009

Dept of Information Technology| www.it.uu.se 76 ©Erik Hagersten| user.it.uu.se/~eh Trends (my guess!)

Threads/Chip Transistors/Thread

t t now now

Memory/Chip Cache/Thread

t t now now Bandwidth/Thread Thr. Comm. Cost (temporal)

DARK 2009 t t now now Dept of Information Technology| www.it.uu.se 77 ©Erik Hagersten| user.it.uu.se/~eh Questions for the Future

What applications? How to get parallelism and data locality? Will funky languages see a renascence? Will automatic parallelizing return? Are we buying: compute power, memory capacity, or memory bandwidth? Will the CPU market diverge into desktop/capacity CPUs again? DARK How to debug? … 2009 A non-question: will it happen? Dept of Information Technology| www.it.uu.se 78 ©Erik Hagersten| user.it.uu.se/~eh