<<

Multicore: Why is it happening now? eller Hur Mår Moore’s Lag?

Erik Hagersten Uppsala Universitet Are we hitting the wall now? Pop: Can the be made even smaller and faster?

Performance Possible path, [log] but requires a paradigm shift 1000 ?

Business as 100 ” w usual… La s re o Well uhm … 10 o M “ The transistors can p: Po be made smaller and PDC 1 Summer faster, but there are School other problems / 2010 Institutionen för informationsteknologi | www.it.uu.seNowMC 2 ©Erik Hagersten| user.it.uu.se/~ehYear Darling, I shrunk the computer

Mainframes

Super Minis:

Microprocessor: Mem

Multicore: Many CPUs on a Mem PDC Summer chip! School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 3 ©Erik Hagersten| user.it.uu.se/~eh Multi-core CPUs: „ PhysX, a multi-core physics processing unit. „ Ambric Am2045, a 336-core Massively Parallel Array (MPPA) „ AMD “ Athlon 64, Athlon 64 FX and Athlon 64 X2 family, dual-core desktop processors. “ Opteron, dual- and quad-core server/workstation processors. “ Phenom, triple- and quad-core desktop processors. “ Sempron X2, dual-core entry level processors. “ Turion 64 X2, dual-core laptop processors. “ and FireStream multi-core GPU/GPGPU (10 cores, 16 5-issue wide superscalar stream processors per core) „ ARM MPCore is a fully synthesizable multicore container for ARM9 and ARM11 processor cores, intended for high-performance embedded and entertainment applications. „ Azul Systems Vega 2, a 48-core processor. „ Broadcom SiByte SB1250, SB1255 and SB1455. „ Cradle Technologies CT3400 and CT3600, both multi-core DSPs. „ Cavium Networks Octeon, a 16-core MIPS MPU. „ HP PA-8800 and PA-8900, dual core PA-RISC processors. „ IBM “ POWER4, the world's first dual-core processor, released in 2001. “ POWER5, a dual-core processor, released in 2004. “ POWER6, a dual-core processor, released in 2007. “ PowerPC 970MP, a dual-core processor, used in the Apple Power Mac G5. “ Xenon, a triple-core, SMT-capable, PowerPC used in the Microsoft Xbox 360 game console. „ IBM, Sony, and Toshiba processor, a nine-core processor with one general purpose PowerPC core and eight specialized SPUs (Synergystic Processing Unit) optimized for vector operations used in the Sony PlayStation 3. „ Infineon Danube, a dual-core, MIPS-based, home gateway processor. „ ƒ Celeron Dual Core, the first dual-core processor for the budget/entry-level market. “ Core Duo, a dual-core processor. “ Core 2 Duo, a dual-core processor. “ Core 2 Quad, a quad-core processor. “ Core i7, a quad-core processor, the successor of the Core 2 Duo and the Core 2 Quad. “ 2, a dual-core processor. “ Pentium D, a dual-core processor. “ Teraflops Research Chip (Polaris), an 3.16 GHz, 80-core processor prototype, which the company says will be released within the next five years[6]. “ Xeon dual-, quad- and hexa-core processors. „ IntellaSys seaForth24, a 24-core processor. „ “ GeForce 9 multi-core GPU (8 cores, 16 scalar stream processors per core) “ GeForce 200 multi-core GPU (10 cores, 24 scalar stream processors per core) “ Tesla multi-core GPGPU (8 cores, 16 scalar stream processors per core) „ Parallax Propeller P8X32, an eight-core . „ picoChip PC200 series 200-300 cores per device for DSP & wireless „ Rapport Kilocore KC256, a 257-core microcontroller with a PowerPC core and 256 8-bit "processing elements". „ Raza Microelectronics XLR, an eight-core MIPS MPU „ Sun Microsystems PDC “ UltraSPARC IV and UltraSPARC IV+, dual-core processors. “ UltraSPARC T1, an eight-core, 32- processor. Summer “ UltraSPARC T2, an eight-core, 64-concurrent-thread processor. „ Texas Instruments TMS320C80 MVP, a five-core multimedia[source: video processor. Wikipedia] School „ Tilera TILE64, a 64-core processor „ XMOS Software Defined Silicon quad-core XS1-G4 2010 Institutionen för informationsteknologi | www.it.uu.se MC 4 ©Erik Hagersten| user.it.uu.se/~eh High-performance single-core CPUs:

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 5 ©Erik Hagersten| user.it.uu.se/~eh Many shapes and forms... Mem

I/F Mem FSB Mem

m m m m € I/F I/F € € C C C C

$ $ $ $ C C C C $

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ m m m m ¢

C C C C C C C C C

AMD Intel Core2 IBM Cell PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 6 ©Erik Hagersten| user.it.uu.se/~eh Outline

„ Why multicore now? „ Technology reasons „ Performance bottlenecks in MCs „ Commercial offerings „ Reflection for the future

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 7 ©Erik Hagersten| user.it.uu.se/~eh Everybody is doing is! But, why now?

Perf Multi core [log] Single core

time 2007

1. Not enough ILP to get payoff from using more transistors 2. Signal propagation delay » delay 3. 2 PDC Power consumption Pdyn ~ C • f • V Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 8 ©Erik Hagersten| user.it.uu.se/~eh today: Whatever it takes to run one program fast. P IL t Exploring ILP (instruction-level parallelism):os „ Faster clocks Î Deep pipelines m „ d Superscalar Pipelines e „ Branch Prediction lor „ Prefetching : p 1 x „ Out-of-Order Execution# e ) y sm „ Trace d li a le „ Speculation l ws e ra r a „ Predicate Executione l p N a l ve „ Advanced Loade eAddress Table d v -l a n „ Returna Addresso Stack h ti „ ….B e ruc W st in PDC ( Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 9 ©Erik Hagersten| user.it.uu.se/~eh Bad News #2 Looong#2: Future wire delay Technology Îslow CPUs

830 ps

cycle time (1.2GHz) el) C mod ace (R mm tr or a 5 ency f Lat 390ps

170ps

70 ps (14 GHz)

2000 2005 2010 2015 År PDC Summer Quantitative data and trends according to V. Agarwal et al., ISCA 2000 School Based on SIA (Semiconductor Industry Association) prediction, 1999

2010 Institutionen för informationsteknologi | www.it.uu.se MC 10 ©Erik Hagersten| user.it.uu.se/~eh #2:Bad How News much of#2 the chip area can youLooong reach wire in one delay cycle? Îslow CPUs Span -- Fraction of chip reachable in one cycle 1.0 100%

0.75

0.50

SIA 0.25

0.4 - 1.4 %

2000 2005 2010 2015 Year PDC .25 .18 .13 .10 .070 .050 .0035 Feature size(micron) Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 11 ©Erik Hagersten| user.it.uu.se/~eh Bad News #2:wire delay Î Multiple small CPUs with private L1$

CPUCPUCPUCPU L1L1L1 L1L1 cache cachecache Externa L1 cache L3 L3 L2 Cache Externa Ctrl L3

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 12 ©Erik Hagersten| user.it.uu.se/~eh Bad News #3: Power consumption Î Lower frequency Î lower voltage

2 2 Pdyn = C * f * V ≈ area * freq * voltage

CPU CPU CPU freq=f vs. freq=f/2 freq=f/2

2 2 Pdyn(C, f, V) = CfV Pdyn(2C, f/2,

CPU CPU CPU vs. freq=f CPU CPU freq = f/2

2 2 Pdyn(C, f, V) = CfV Pdyn(C, f/2,

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 13 ©Erik Hagersten| user.it.uu.se/~eh Example: Freq. Scaling

speed 0.87 1.0 pow 0.51 freq speed pow freq speed freq speed 1.20 1.13 1.51 0.80 0.87 pow 0.80 0.87 pow 0.51 0.51

20% higher freq. 20% lower freq. 20% lower freq. Two cores

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 14 ©Erik Hagersten| user.it.uu.se/~eh Darling, I shrunk the computer eunileeuin( execution Sequential

Mainframes ≈

Super Minis: program) one

Microprocessor: Mem Paradigm Shift Parallel Chip Multiprocessor (CMP): Mem Apps (TLP) PDC Summer A multiprocessor on a chip! School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 15 ©Erik Hagersten| user.it.uu.se/~eh How to create threads?

e e c c n n ie ie c c Seq. Seq. S S t t Prog. eProg. e Not ideal for k k HARD!! c c o o multi cores R R

Parallel Parallelizing Modeling Prog1 Prog2 OS Prog compiler

Thread1 Thread2 Thread3 Thread1 Thread2 Thread3

Throughput Computing Capability Computing •“Multitasking” •Weather simulation PDC •Virtualization •Computer games Summer •Concurrency •… School •… 2010 Institutionen för informationsteknologi | www.it.uu.se MC 16 ©Erik Hagersten| user.it.uu.se/~eh Shared Bottlenecks

CPU CPU CPU CPU

L1 L1 L1 L1 L2 Cache

L1 L1 L1 L1 Bandwidth CPU CPU CPU CPU

• Cache sharing effects • Cache utilization • Loop order • Data allocation • Data usage… PDC • Data reuse Shared Summer • Tiling (aka Blocking) Resources School • Fusion…

2010 Institutionen för informationsteknologi | www.it.uu.se MC 17 ©Erik Hagersten| user.it.uu.se/~eh Example: Poor Throughput Scaling!

Example: 470.LBM "Lattice Boltzmann Method" to simulate incompressible fluids in 3D

3,5 3 2,5 2 1,5 1.0 Throughput 1 0,5 0 1234

Number of Cores Used Throughput (as defined by SPEC): Amount of work performed per time unit when several instances of the application is executed simultaneously. Our TP study: compare TP improvement when you go from 1 core to 4 cores PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 18 ©Erik Hagersten| user.it.uu.se/~eh Throughput Scaling, More Apps

SPEC CPU 20006 FP Throughput improvements on 4 cores

4,50

4,00

3,50

3,00

2.02,50 2,00

1,50

1,00

0,50 f o m t b ,l 0,00 d ton 481,wr phinx3 5, s p s 470 ie3 amd oplex 82, m l n s 46 4 ,milc sADM es 4, 3 tu ,l 447,dealII 50, 453,povray454,calculix,GemsFDTD amess 44 4 g 43 ac ,c 437 459 16, 434,zeus 435,gromac6 4 410,bwaves 43

PDC Intel X5365 3GHz “Core2”, 1333 MHz FSB, 8MB L2. (Based on data from the SPEC web) Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 19 ©Erik Hagersten| user.it.uu.se/~eh Nerd Curve: 470.LBM

Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate

5,0% 3,5%

cache size PDC Running Running Î Less amount of work Summer four threads one thread per memory byte moved School @ four threads 2010 Institutionen för informationsteknologi | www.it.uu.se MC 20 ©Erik Hagersten| user.it.uu.se/~eh Nerd Curve (again)

Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate

orig application 5,0%

2,5% optimized application

cache size PDC Running Î Twice the amount of work Summer four threads School per memory byte moved

2010 Institutionen för informationsteknologi | www.it.uu.se MC 21 ©Erik Hagersten| user.it.uu.se/~eh Î Better Memory Usage! Example: 470.LBM Modified to promote better cache utilization

3,5

3

2,5

2

1,5

Througput 1

0,5

0 1234 # Cores Used

Original code

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 22 ©Erik Hagersten| user.it.uu.se/~eh 22 BW in the Future?

#Cores ~ #Transistors Computation vs Bandwidth

CPU CPU 6 5 #T * T _f r eq / #P * P_f r eq CPU CPU ... 4 3

2 Mem 1 0 2007 2008 2009 2010 2011 2012 2013 2014 2015

Year HPCWire.com thisSource: morning: Internatronal Technology Roadmap for Semiconductors (ITRS) Up Against the Memory Wall See also: Karlsson et al. Conserving Memory Bandwidth in Chip Multiprocessors ”Nevermindwith Runahead the cores. Execution Just. IPDPS hand March 2007.over the cache”

HPCWire December 07: PDC More Than 16 Cores May Well Be Pointless Summer School [by Sandia Labs]

2010 Institutionen för informationsteknologi | www.it.uu.se MC 23 ©Erik Hagersten| user.it.uu.se/~eh Throughput Computing:

„ Does not explore the shared caches well (cache capacity/thread: ~100-500 kB)

„ Requires N times more memory

„ If limited by memory capacity or memory bandwidth Î go parallel

Prog1 Prog2 OS

PDC Summer Thread1 Thread2 Thread3 School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 24 ©Erik Hagersten| user.it.uu.se/~eh Thread Interaction

CPU CPU CPU CPU

L1 L1 L1 L1 L2 Cache

L1 L1 L1 L1

CPU CPU CPU CPU

•Coherence traffic •Communication utilization •Load imbalance PDC •Synchronization Summer • School •... 2010 Institutionen för informationsteknologi | www.it.uu.se MC 25 ©Erik Hagersten| user.it.uu.se/~eh Multicore Challenges: Non-uniformity Communication

CPU CPU CPU CPU

L1 L1 L1 L1 L2 L2 L3 L2 L2 L1 L1 L1 L1

CPU CPU CPU CPU

L2 sharing (here: pair-wise) PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 26 ©Erik Hagersten| user.it.uu.se/~eh Multicore Challenges (Multisocket) Non-uniformity Communication

CPU CPU CPU CPU CPU CPU CPU CPU

L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1

CPU CPU CPU CPU CPU CPU CPU CPU

L2 sharing (here: pair-wise) PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 27 ©Erik Hagersten| user.it.uu.se/~eh Multicore Challenges (Multisocket) Non-uniformity Memory

CPU CPU CPU CPU CPU CPU CPU CPU

L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1

CPU CPU CPU CPU CPU CPU CPU CPU

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 28 ©Erik Hagersten| user.it.uu.se/~eh 28 Example: Thread Interaction (True Communication)

Cache sharing

Æmore misses?CPU CPU CPU CPU

L1 L1 L1 L1 L2 L2 Communication? A L2 A L2 A L1 L1 L1 L1

CPU CPU CPU CPU

A:= 1 A := 2 A:= 3 PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 29 ©Erik Hagersten| user.it.uu.se/~eh 29 Example: Thread Interaction (False sharing)

CPU CPU CPU CPU

L1 L1 L1 L1 L2 L2 Communication? AB L2AB L2 AB L1 L1 L1 L1

CPU CPU CPU CPU

A := 1 … … B := 1 … … A := 2 PDC Ex: Code change ÎPerformance Summer School difference on a Quad: 5x

2010 Institutionen för informationsteknologi | www.it.uu.se MC 30 ©Erik Hagersten| user.it.uu.se/~eh 30 Commercial snapshot

(I may have miss-quoted some details, get architecture details from vendors)

Erik Hagersten Uppsala Universitet Intel Core2 Quad

South Bridge I/O

North Bridge DRAM

Front-side (FSB)

Module 1 Die 2 L2$ L2$ 6MB 6MB

D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB

PDC CPU CPU CPU CPU Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 32 ©Erik Hagersten| user.it.uu.se/~eh AMD Shanghai

Hyper Transport DDR-2, DRAM

L3 8MB

X-bar

L2$ L2$ L2$ L2$ 512kB 512kB 512kB 512kB

D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB CPU CPU CPU CPU

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 33 ©Erik Hagersten| user.it.uu.se/~eh AMD MC System Architecture

I/O I/O

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 34 ©Erik Hagersten| user.it.uu.se/~eh Intel: Dunnington

South Bridge I/O

DDR2 DRAM North Bridge DDR2 DRAM Front-side Bus (FSB)

L3 8+ MB

X-bar

L2$ L2$ L2$ 2MB 2MB 2MB

D$ I$ D$ I$ D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB PDC Summer CPU CPU CPU CPU CPU CPU School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 35 ©Erik Hagersten| user.it.uu.se/~eh Intel Dunnington

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 36 ©Erik Hagersten| user.it.uu.se/~eh Intel: Nehalem, Core i7 Q1 2009 (4 cores) QuickPath Interconnect 3x DDR-3 DRAM

L3 8MB

X-bar

L2$ L2$ L2$ L2$ 256kB 256kB 256kB 256B

D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB CPU, 2 thr CPU, 2 thr. CPU, 2 thr. CPU, 2 thr.

PDC Summer Up to 4 cores x 2 threads School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 37 ©Erik Hagersten| user.it.uu.se/~eh Nehalem ”Core i7”

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 38 ©Erik Hagersten| user.it.uu.se/~eh AMD Istanbul, 6 cores

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 39 ©Erik Hagersten| user.it.uu.se/~eh AMD Magny-Cours

MCM

C C C C C C C C C C C C ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ $ $ $ $ $ $ $ $ $ $ $ $

€ €

~Istanbul PDC Summer School DDR3 DDR3 HT HT HT HT DDR3 DDR3

2010 Institutionen för informationsteknologi | www.it.uu.se MC 40 ©Erik Hagersten| user.it.uu.se/~eh Intel: ”Nehalem-Ex” (i7)

QuickPath Interconnect 4 x DDR-3

L3 24MB

X-bar

L2$ L2$ L2$ L2$ L2$ 256kB 256kB 256kB 256B 256kB

D$ I$ D$ I$ D$ I$ D$ I$ ... D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB CPU, 2 thr CPU, 2 thr. CPU, 2 thr. CPU, 2 thr. CPU

PDC Summer 8 cores x 2 threads School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 41 ©Erik Hagersten| user.it.uu.se/~eh How is the silicon used (i7-Ex)?

3 Mbyte

Tag

PDC Data array Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 42 Source: JSSC©Erik Hagersten Jan 2010,| user.it.uu.se/~eh Rusu et. al How is the silicon used?

Cache L1 + L2 Ex Transl. $ L3 Branch $ Ord

Cache L1 + L2 Ex Transl. $ Re-order Branch $ L3 Cache Tag

PDC Re-order I-decode Data array Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 43 Source: JSSC©Erik Hagersten Jan 2010,| user.it.uu.se/~eh Rusu et. al Handling silicon defects?

PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 44 Source: JSSC©Erik Hagersten Jan 2010,| user.it.uu.se/~eh Rusu et. al Some other multicores

(I may have miss-quoted some details, get architecture details from vendors)

Erik Hagersten Uppsala Universitet Sun Niagara, 2005!!

4 x DDR-2 = 25GB/s (!)

Memory Memory Memory Memory ctrl ctrl ctrl ctrl

Shared L2 L2 L2 L2 L2

Xbar =…8 134 ... GB/s L1I L1D L1I L1D (wt) (wt)

CPU …8 … CPU

Four interleaved threads PDC Summer Now: Victoria’s falls: 16 core with 16 threads each School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 46 ©Erik Hagersten| user.it.uu.se/~eh Niagara Chip

PDC Summer Sun Microsystems School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 47 ©Erik Hagersten| user.it.uu.se/~eh TILERA Architecture

64 cores connected in a mesh Local L1 + L2 caches Shared distributed L3 cache Linux + ANSI C New Libraries New IDE Stream computing PDC ... Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 48 ©Erik Hagersten| user.it.uu.se/~eh IBM Cell Processor

Mem

m m m m I/F C C C C

C C C C $ m m m m ¢

C

IBM Cell PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 49 ©Erik Hagersten| user.it.uu.se/~eh So-called accelerators

„ Sits on the IO bus (!!)

„ GP Graphics processors, aka GPGPU [e.g. NVIDIA, AMD/ATI]

„ Specialized accelerators [e.g., FPGA/Mitrionics, ASIC/ClearSpeed]

„ Specialized languages for the above

PDC [CUDA, Ct, Rapid Mind, Open-CL, …] Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 50 ©Erik Hagersten| user.it.uu.se/~eh So-called accelerators

„ Sits on the IO bus (!!)

„ GP Graphics processors, aka GPGPU? My view: Not very general purpose yet! [e.g. NVIDIA, AMD/ATI] •Fits well for a few VERY IMPORTANT app domains!

„ Specialized•Limited applicability? accelerators? [e.g.,•Programmer FPGA/Mitrionics, productivity? ASIC/ClearSpeed]

•Application life time?

„ Specialized•New generation languages devices willfor be the more above? useful…

PDC [CUDA, Ct, Rapid Mind, Open-CL, …] Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 51 ©Erik Hagersten| user.it.uu.se/~eh Fermi from nVIDIA a huge step in the right direction

„ 512 ”processors” (P) „ 16 P/StreamProcessor (SP) „ Full DP-FP IEEE support „ 64kB L1 cache /SP „ 768kB global shared cache „ Atomic instructions „ ECC correction „ Debugging support „ ...

David Black-Schaffer to give you the full story PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 52 ©Erik Hagersten| user.it.uu.se/~eh Scaling the x86 Manycore GP computing Larrabee (now called MIC) from Intel 2011??

”more than 50 cores”

PDC Summer School SIMD instructions (16-way??)

2010 Institutionen för informationsteknologi | www.it.uu.se MC 53 ©Erik Hagersten| user.it.uu.se/~eh Wrapping up about multicores

Erik Hagersten Uppsala Universitet Looks and Smells Like an SMP? SMP system Multicore system

Memory Memory

Interconnect

…32 … L2 L2 L2 L2

1 1 L1 L1 L1 …8… TT TT CPU CPU CPU TT TT Well, how about: „ Cost of parallelism? „ Cache capacity per thread?

PDC „ Memory bandwidth per thread? Summer „ Cost of thread communication? … School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 55 ©Erik Hagersten| user.it.uu.se/~eh What matters for multicore performance?

„ Are we buying… “ CPU frequency? “ Number of cores? “ MIPS and FLOPS? “ Memory bandwidth? “ Cache capacity? “ Memory capacity? “ Performance/Watt? PDC Summer School

2010 Institutionen för informationsteknologi | www.it.uu.se MC 56 ©Erik Hagersten| user.it.uu.se/~eh MC Questions for the Future

„ How to get parallelism? „ How to get performance/data locality? „ How to debug? „ A case for new funky languages? „ A case for automatic parallelization? „ Are we buying: “ compute power, “ memory capacity, or “ memory bandwidth? „ Will 128 cores be mainstream in 5 years? „ Will the CPU market diverge into desktop/capacity/capability/special-purpose PDC CPUs again? Summer „ School A non-question: will it happen?

2010 Institutionen för informationsteknologi | www.it.uu.se MC 57 ©Erik Hagersten| user.it.uu.se/~eh