<<

Multicore: Why is it happening now? eller Hur Mår Moore’s Lag?

Erik Hagersten Uppsala Universitet Darling, I shrunk the computer

Mainframes

Super Minis:

Microprocessor: Mem

Multicore: Many CPUs on a Mem chip! AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 2 © Erik Hagersten| user.it.uu.se/~eh Multi-core CPUs:  PhysX, a multi-core physics processing unit.  Ambric Am2045, a 336-core Massively Parallel Array (MPPA)  AMD  Athlon 64, Athlon 64 FX and Athlon 64 X2 family, dual-core desktop processors.  Opteron, dual- and quad-core server/workstation processors.  Phenom, triple- and quad-core desktop processors.  Sempron X2, dual-core entry level processors.  Turion 64 X2, dual-core laptop processors.  and FireStream multi-core GPU/GPGPU (10 cores, 16 5-issue wide superscalar stream processors per core)  ARM MPCore is a fully synthesizable multicore container for ARM9 and ARM11 processor cores, intended for high-performance embedded and entertainment applications.  Azul Systems Vega 2, a 48-core processor.  Broadcom SiByte SB1250, SB1255 and SB1455.  Cradle Technologies CT3400 and CT3600, both multi-core DSPs.  Cavium Networks Octeon, a 16-core MIPS MPU.  HP PA-8800 and PA-8900, dual core PA-RISC processors.  IBM  POWER4, the world's first dual-core processor, released in 2001.  POWER5, a dual-core processor, released in 2004.  POWER6, a dual-core processor, released in 2007.  PowerPC 970MP, a dual-core processor, used in the Apple Power Mac G5.  Xenon, a triple-core, SMT-capable, PowerPC used in the Microsoft Xbox 360 game console.  IBM, Sony, and Toshiba processor, a nine-core processor with one general purpose PowerPC core and eight specialized SPUs (Synergystic Processing Unit) optimized for vector operations used in the Sony PlayStation 3.  Infineon Danube, a dual-core, MIPS-based, home gateway processor.  . Celeron Dual Core, the first dual-core processor for the budget/entry-level market.  Core Duo, a dual-core processor.  Core 2 Duo, a dual-core processor.  Core 2 Quad, a quad-core processor.  Core i7, a quad-core processor, the successor of the Core 2 Duo and the Core 2 Quad.  2, a dual-core processor.  Pentium D, a dual-core processor.  Teraflops Research Chip (Polaris), an 3.16 GHz, 80-core processor prototype, which the company says will be released within the next five years[6].  Xeon dual-, quad- and hexa-core processors.  IntellaSys seaForth24, a 24-core processor.   GeForce 9 multi-core GPU (8 cores, 16 scalar stream processors per core)  GeForce 200 multi-core GPU (10 cores, 24 scalar stream processors per core)  Tesla multi-core GPGPU (8 cores, 16 scalar stream processors per core)  Parallax Propeller P8X32, an eight-core .  picoChip PC200 series 200-300 cores per device for DSP & wireless  Rapport Kilocore KC256, a 257-core microcontroller with a PowerPC core and 256 8-bit "processing elements".  Raza Microelectronics XLR, an eight-core MIPS MPU  Sun Microsystems  UltraSPARC IV and UltraSPARC IV+, dual-core processors.  UltraSPARC T1, an eight-core, 32- processor.  UltraSPARC T2, an eight-core, 64-concurrent-thread processor. AVDARK  Texas Instruments TMS320C80 MVP, a five-core multimedia video processor. [source: Wikipedia]  Tilera TILE64, a 64-core processor 2012  XMOS Software Defined Silicon quad-core XS1-G4 Institutionen för informationsteknologi | www.it.uu.se Multicores 3 © Erik Hagersten| user.it.uu.se/~eh Outline

 Why multicore now?  Performance bottlenecks in MCs  Commercial offerings  Reflection for the future

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 4 © Erik Hagersten| user.it.uu.se/~eh Everybody is doing it! But, why now?

Perf Multi core [log] Single core

time 2007

1. Not enough ILP to get payoff from using more 2. Signal propagation delay » delay 2 3. Power consumption Pdyn ~ • f • V

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 5 © Erik Hagersten| user.it.uu.se/~eh Example: Freq. Scaling

2 2 Pdyn = C * f * V ≈ area * freq * voltage

speed 0.87 1.0 pow 0.51 freq speed pow freq speed freq speed 1.20 1.13 1.51 0.80 0.87 pow 0.80 0.87 pow 0.51 0.51

20% higher freq. 20% lower freq. 20% lower freq. Two cores

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 6 © Erik Hagersten| user.it.uu.se/~eh Darling, I shrunk the computer Sequential execution (

Mainframes ≈

Super Minis: one program)

Microprocessor: Mem Paradigm Shift Parallel Chip Multiprocessor (CMP): Mem Apps (TLP) A multiprocessor on a chip! AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 7 © Erik Hagersten| user.it.uu.se/~eh Shared Bottlenecks (the MCs on these slides are generalized)

CPU CPU CPU CPU

L1 L1 L1 L1 L2

L1 L1 L1 L1 Bandwidth CPU CPU CPU CPU

• Cache sharing effects • Cache utilization • Loop order • Data allocation • Data usage… • Data reuse Shared AVDARK • Tiling (aka Blocking) Resources • Fusion… 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 8 © Erik Hagersten| user.it.uu.se/~eh Example: Poor Throughput Scaling!

Example: 470.LBM "Lattice Boltzmann Method" to simulate incompressible fluids in 3D

3,5 3 2,5 2 1,5 1.0 Throughput 1 0,5 0 1234

Number of Cores Used Throughput (as defined by SPEC): Amount of work performed per time unit when several instances of the application is executed simultaneously. Our TP study: compare TP improvement when you go from 1 core to 4 cores

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 9 © Erik Hagersten| user.it.uu.se/~eh Throughput Scaling, More Apps

SPEC CPU 20006 FP Throughput improvements on 4 cores

4,50

4,00

3,50

3,00

2.02,50 2,00

1,50

1,00

0,50 f o x3 D t wr in T lbm 0,00 d 0, 81, ph d ,ton 4 s s 3 m plex FD 47 lc p o 65 m ADM ,na 4 82, es mi 0,s ems 4 us 447,dealII 5 av romac ,leslie 444 4 453,povray454,calculix amess 33, ze 7 w 4 g actus b g 43 459,G 6, 34, 35, c 0, 4 4 6, 41 41 43

Intel X5365 3GHz “Core2”, 1333 MHz FSB, 8MB L2. (Based on data from the SPEC web) AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 10 © Erik Hagersten| user.it.uu.se/~eh Nerd Curve: 470.LBM

Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate

5,0% 3,5%

cache size Running Running  Less amount of work four threads one thread per memory moved AVDARK 2012 @ four threads Institutionen för informationsteknologi | www.it.uu.se Multicores 11 © Erik Hagersten| user.it.uu.se/~eh Nerd Curve (again)

Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate

orig application 5,0%

2,5% optimized application

cache size Running  Twice the amount of work four threads AVDARK per memory byte moved 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 12 © Erik Hagersten| user.it.uu.se/~eh  Better Memory Usage! Example: 470.LBM Modified to promote better cache utilization

3,5

3

2,5

2

1,5

Througput 1

0,5

0 1234 # Cores Used

Original code

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 13 © Erik Hagersten| user.it.uu.se/~eh 13 BW in the Future?

#Cores ~ #Transistors Computation vs Bandwidth

CPU CPU 6 5 #T * T _f r eq / #P * P_f r eq CPU CPU ... 4 3

2 Mem 1 0 2007 2008 2009 2010 2011 2012 2013 2014 2015

Year HPCWire.com thisSource: morning: Internatronal Technology Roadmap for Semiconductors (ITRS) Up Against the Memory Wall See also: Karlsson et al. Conserving Memory Bandwidth in Chip Multiprocessors ”Nevermindwith Runahead the cores.Execution Just. IPDPS hand March 2007.over the cache”

HPCWire December 07: More Than 16 Cores May Well Be Pointless AVDARK [by Sandia Labs] 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 14 © Erik Hagersten| user.it.uu.se/~eh Commercial snapshot eller “Kärnornas krig”

(I may have miss-quoted some details, get architecture details from vendors) Erik Hagersten Uppsala Universitet Intel Core2 Quad, 2006

South Bridge I/O

North Bridge DRAM

Front-side (FSB) Module Another 1 Die 2 Module... L2$ L2$ 6MB 6MB

D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB CPU CPU CPU CPU

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 16 © Erik Hagersten| user.it.uu.se/~eh AMD Shanghai, 2007

Hyper Transport DDR-2, DRAM

L3 8MB

X-bar

L2$ L2$ L2$ L2$ 512kB 512kB 512kB 512kB

D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB CPU CPU CPU CPU

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 17 © Erik Hagersten| user.it.uu.se/~eh AMD MC System Architecture

I/O I/O

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 18 © Erik Hagersten| user.it.uu.se/~eh Intel: Nehalem, Core i7 Q1 2009 (4 cores) QuickPath Interconnect 3x DDR-3 DRAM

L3 8MB

X-bar

L2$ L2$ L2$ L2$ 256kB 256kB 256kB 256B

D$ I$ D$ I$ D$ I$ D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB CPU, 2 thr CPU, 2 thr. CPU, 2 thr. CPU, 2 thr.

Up to 4 cores x 2 threads AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 19 © Erik Hagersten| user.it.uu.se/~eh Nehalem ”Core i7”

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 20 © Erik Hagersten| user.it.uu.se/~eh Intel: ”Nehalem-Ex” (i7)

QuickPath Interconnect 4 x DDR-3

L3 24MB

X-bar

L2$ L2$ L2$ L2$ L2$ 256kB 256kB 256kB 256B 256kB

D$ I$ D$ I$ D$ I$ D$ I$ ... D$ I$ 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB 64kB CPU, 2 thr CPU, 2 thr. CPU, 2 thr. CPU, 2 thr. CPU

8 cores x 2 threads AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 21 © Erik Hagersten| user.it.uu.se/~eh How is the silicon used (i7-Ex)?

3 Mbyte

Tag

Data array AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores Source:22 JSSC© Erik Hagersten Jan 2010,| user.it.uu.se/~eh Rusu et. al How is the silicon used?

Cache L1 + L2 Ex Transl. $ L3 Branch $ Ord

Cache L1 + L2 Ex TLB $ Re-order Branch $ L3 Cache Tag

Re-order I-decode Data array AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores Source:23 JSSC© Erik Hagersten Jan 2010,| user.it.uu.se/~eh Rusu et. al More recent: Sandy Bridge, 32 nm & Ivy Bridge, 22nm

 Integrated graphics!  Ring bus for coherent LLC  Up to 12 cores (24 threads)  ~3GHz-ish

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 24 © Erik Hagersten| user.it.uu.se/~eh General Purpose: Is there a demand for more cores?

 Intel released their quad core at Supercomputing 2006.  Six years later we are only(!!) at 12 cores and 24 threads.  Core complexity and caches are favoured over number of cores

Yes! In certain market segments and special applications need cores...

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 25 © Erik Hagersten| user.it.uu.se/~eh Hot Chips Conference 2012

” 16-core SPARC T5 CMT Processor with glueless 1-hop scaling to 8-sockets”

” Knights Corner, Intel’s first Many Integrated Core (MIC) Architecture Product”

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 26 © Erik Hagersten| user.it.uu.se/~eh Sun Niagara T1, 2005 Targeting transactions and DB

4 x DDR-2 = 25GB/s

Memory Memory Memory Memory ctrl ctrl ctrl ctrl

Shared L2 L2 L2 L2 L2

Xbar =…8 134 ... GB/s L1I L1D L1I L1D (wt) (wt)

Core … 8 … Core

In-order 5 stage pipeline 1.2 GHz AVDARK Four interleaved threads 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 27 © Erik Hagersten| user.it.uu.se/~eh 2012: SPARC Niagra T5

 16 cores  8 threads per core (128 threads/chip)  2x superscalar out-of-order @3.6GHz  8MB L3 cache (64kB/thread)  Glueless 8 multisockets (1-hop)  Coherent global NUMA memory  Lots of HW accelerations

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 28 © Erik Hagersten| user.it.uu.se/~eh Oracle marketing slide: Speeds and feeds, 8 sockets

DDR3-1066 PCIe G3 PCIe G3 1+TB/sec DIMMs DIMMs T5 T5 PCIe G3 PCIe G3 Coherency Bisection

Bandwidth 840 DIMMs T5 T5 DIMMs GB/sec

DIMMs DIMMs T5 T5 PCI Gen3 - Bandwidth PCIe G3 T PCIe G3 T5 T5 256 GB/sec DIMMs DIMMs

AVDARK PCIe G3 PCIe G3 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 29 © Erik Hagersten| user.it.uu.se/~eh Oracle marketing slide: Special purpose HW

IBM Power7, Intel Westmere/ On-Chip Accelerators SPARC T5 HP Sandybridge Asymmetric /Public Key RSA, DH, DSA, none RSA, ECC Encryption ECC

Symmetric Key / AES, DES, 3DES, none AES Bulk Encryption Camellia, Kasumi CRC32c, MD5, Message Digest / Sha-1, SHA-224, none none Hash Functions SHA-256, SHA-384, SHA-512 Random Number Supported none Supported Generation

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 30 © Erik Hagersten| user.it.uu.se/~eh T5 Processor Overview

SerDes MI/O SerDes • 16 S3 cores @ 3.6GHz SPARC SPARC Pwr SPARC SPARC Core Core PCIe Gen3 Core Core • 8MB shared L3 Cache

SPARC SPARC SPARC SPARC • 8 DDR3 BL8 Schedulers Core Core L3 L3 L3 L3 Core Core providing 80 GB/s BW SerDes SerDes MCU MCU Cross Bar MCU MCU • 8-way 1-hop glue-less SPARC SPARC L3 L3 L3 L3 SPARC SPARC Core Core Core Core scalability • Integrated 2x8 PCIe Gen 3 SPARC SPARC SPARC SPARC Core Core Core Core Coherence • Advanced with DVFS SerDes

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 31 © Erik Hagersten| user.it.uu.se/~eh TILERA Architecture Targeting Embedded

64 cores connected in a mesh Local L1 + L2 caches Shared distributed L3 cache Linux + ANSI C New Libraries New IDE Stream computing ...

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 32 © Erik Hagersten| user.it.uu.se/~eh GPUs and Accelerators: Lots of hype in HPC! Fermi from nVIDIA -- a huge step in the right direction

 512 ”cores” (P)  16 P/StreamProcessor (SP)  Full DP-FP IEEE support  64kB L1 cache /SP  768kB global shared cache  Atomic instructions  ECC correction  Debugging support  ... • CUDA language. Requires ”heroic programming” [M.Bull, EPCC] • How much is an nVIDIA core worth vs. an core? • nVIDIA used to say 100x performance [over some old x86] • At SC2011, nVIDIA’s banner said ”2x peformance guaranteed” • With what productivity?

++ New interestring programming approach announced: OpenACC AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 33 © Erik Hagersten| user.it.uu.se/~eh [Thanks. David Black-Schaffer] Processors in HPC

Where are the GPUs?

64bit x86 (AMD Opteron) Double- precision GPUs [Thanks. David Black-Schaffer] GPUs in the Top500

Number of Cores Number of Chips GPU GPU Cell Other Cell

Other 5% of “cores” 3% of Chips Power

Power Intel x86

Intel x86 AMD AMD x86 x86

Notes: Top500 June 2012. Number of Chips is based on an approximation from dividing the number of cores by the number of cores per chip from various reasonably reliable sources. MIC = Multiple Integrated Cores Knights Corner , first commercial MIC implementation

KNC Card KNC Card GDDR5 GDDR5 Channel … Channel Intel® Xeon® PC e x16 Channel Processor PCIe x16 GDDR5 … KN> 50 Cores KN Channel Linux OS GDDR5 System Memory

GDDR5 GDDR5 Channel … Channel

>= 8GB GDDR5 memory

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 36 © Erik Hagersten| user.it.uu.se/~eh Intel’s MIC HPC architecture

4 threads 4 threads

≥8 GB >25MB L2 cache GDDR

4 threads 4 threads ??= Unknown. Check the Web  Knights Corner (Intel ), 22nm for updates  ”>50 cores (??)” X86 instructions (”enhanced”), 4 Threads  Coherent caches. Runs Linux. ”Just recompile and run”  > 1TFLOPS (DP). Available late 2012 (??)

AVDARK  2% of the transistors devoted to x86 instructions 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 37 © Erik Hagersten| user.it.uu.se/~eh The MIC core ”tile”

 2x superscalar in-order  4 Threads SMT/core  512b SIMD instr. (16x SP)  Scatter/gather operations  32k L1 I and D cache  512kB coherent L2  16 HW prefetch streams

512KB  2x512bit ringbus  ... 2x512b 2x512b

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 38 © Erik Hagersten| user.it.uu.se/~eh MIC Tool Stack

 It is ”just an x86 Linux box”  Use your favorit language  Intel compiler, tuning tool etc.  3rd party tools (for example mine:-)  A huge alpha program since a year back  Already on the Top500 list (#150)  Low power/FLOP  Many strategic sales already

AVDARK  Question: can we afford 200+ threads/chip 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 39 © Erik Hagersten| user.it.uu.se/~eh Multicores: A different beast

Dr Dobb’s 2005: The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software By Herb Sutter

“The biggest change in software development since the object-oriented revolution is knocking at the door, and its name is Concurrency…

Dr Dobb’s 2012: Welcome to the Jungle By Herb Sutter

“Mainstream computers […] are being permanently transformed into heterogeneous supercomputer clusters. Henceforth, a single compute-intensive application will need to harness different kinds of cores, in immense numbers, to get its job done.”. AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 40 © Erik Hagersten| user.it.uu.se/~eh Trends 1 (2) Varies with application areas:  Number of cores,  Cache capacity,  Bandwidth,  Threads/core,  Cache/thread varies.  Important SW functions moved to HW

 HW diversity is larger than ever  Intel has never released so many different chips per year

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 41 © Erik Hagersten| user.it.uu.se/~eh Trends 2

 Active power management (DVFS)  Non-uniformity  Heterogeneity  tiny/fat cores in the same system  different ISA  specialized HW  Dark silicon is around the corner (!!)  Cannot have all transistors turned on  Reconfigurable architectures

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 42 © Erik Hagersten| user.it.uu.se/~eh Wrapping up about multicores

Erik Hagersten Uppsala Universitet What matters for multicore performance?

 Are we buying…  CPU frequency?  Number of cores?  MIPS and FLOPS?  Memory bandwidth?  Cache capacity?  Memory capacity?  Performance/Watt?

AVDARK 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 44 © Erik Hagersten| user.it.uu.se/~eh MC Questions for the Future

 How to get parallelism?  How to get performance/data locality?  How to debug?  A case for new funky languages?  A case for automatic parallelization?  Are we buying:  compute power,  memory capacity, or  memory bandwidth?  Will 1000 cores be mainstream in 5 years?  Will the CPU market diverge into desktop/capacity/capability/special-purpose CPUs again?

AVDARK  A non-question: will it happen? 2012 Institutionen för informationsteknologi | www.it.uu.se Multicores 45 © Erik Hagersten| user.it.uu.se/~eh AVDARK mid-term eval

1. My talking speed (1=too fast ..3=good...5=too slow) 2. The teaching speed (1.fast ... 5.slow) 3. Starting with a repeat? (1.good ... 5.waste of time) 4. Content rel. to your backgroud (1.hard .... 5.easy) 5. Labs rel. to your backgroud (1.hard .... 5.easy) 6. Amount of work per Lab (1.too much ... 5.too little) 7. Amount of work per Handin (1.too much..5.too little) 8. Give me more feedback in writing!

AVDARK 2012

Dept of Information Technology| www.it.uu.se 46 © Erik Hagersten| user.it.uu.se/~eh