The Future is not what it used to be...

Erik Hagersten Then... ENIAC 1946 (”5kHz”) 18 000 radiorör sladdprogrammerad ”5 KHz”

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik HagerstenENIAC| user.it.uu.se/~eh Then (in )

 BARK (~1950)  8 000 relays,  80 km cables

 BESK (~1953)  2 400 vac. tubes  ”20 kHz” (world record) AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh “Recently” APZ 212, 1983

Ericsson’s Supercomputer (“5 MHz”)

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh APZ 212 marketing brochure quotes:

 ”Very compact”  6 times the performance  1/6:th the size  1/5 the power consumption  ”A breakthrough in science”  ”Why more CPU power?”  ”All the power needed for future development”  ”…800,000 BHCA, should that ever be needed”  ”SPC computer science at its most elegance”  ”Using 64 kbit memory chips”  ”1500W power consumption

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh 65 years of “improvements”

 Speed  Size  Price  Price/performance  Reliability  Predictability  Energy  Safety  Usability….

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh ”Moore’s Law” Pop: Double performance every 18-24th month

Performance [log] Multicore

1000

Single-core 100

10

1

AVDARK Year

2012 Dept of Information Technology| www.it.uu.se2006 Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Exponentiell utveckling: Doublerings/halverings-tider (according to Kurzweil)

 Dynamic RAM Memory (bits per dollar) 1.5 years  Average Price 1.6 years  Microprocessor Cost per Transistor Cycle 1.1 years  Total Bits Shipped 1.1 years  Processor Performance in MIPS 1.8 years  in Intel Microprocessors 2.0 years

Log scale 1000 100

10

1 AVDARK time

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Linear scale 1940  2017 (2x performance every 18th month)

Doubling every 18th month since 1940

4,E+15

3,E+15

3,E+15

2,E+15 Performance 2,E+15

1,E+15

5,E+14

0,E+00 40 50 60 70 80 90 0 10

Year

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Exponentiell utveckling

Example: Doubling every 2nd year How long does it it take for 1000x improvement?

Example: Doubling every 18th month How long does it it take for 1000x improvement?

Log scale 1000

100 10

1 time ? Linear scale AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Looking Forward

Three rules of common wisdom:  Do not bet against exponential trends  Do not bet against exponential trends  Do not bet against exponential trends

But, is it possible to continue ”Moore’s Law”? - Are there show-stoppers? - Can we utilize an exponential growth of #cores? AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Not everything scales as fast!

Example: 470.LBM "Lattice Boltzmann Method" to simulate incompressible fluids in 3D

3,5 3 2,5 2 1,5 1.0 Throughput 1 0,5 0 1234

Number of Cores Used Throughput (as defined by SPEC): Amount of work performed per time unit when several instances of the application is executed simultaneously. Our TP study: compare TP improvement when you go from 1 core to 4 cores

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Nerd Curve: 470.LBM

Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate

5,0% 3,5%

cache size Running Running  Less amount of work four threads one thread per memory byte moved AVDARK @ four threads 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Remember: It is getting worse! Computation vs Bandwidth #Cores ~ #Transistors

CPU CPU 6 5 #T * T _f r eq / #P * P_f r eq

CPU CPU 4 #Pins 3 2 DRAM 1 0 2007 2008 2009 2010 2011 2012 2013 2014 2015

Year

Source: Internatronal Technology Roadmap for Semiconductors (ITRS) From Karlsson and Hagersten. Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution. IPDPS March 2007. [graph updated with more recent data] HPCwire Feb 2011 [cites Linley Gwennap and Justin Rattner] Without Silicon Photonics, Moore's Law Won't Matter

HPCwire Feb 2011 AVDARK Growing Data Deluge Prompts Processor Redesign

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Case study: Limited by bandwidth

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Nerd Curve (again)

Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate

orig application 5,0%

2,5% optimized application

cache size Running  Twice the amount of work four threads per memory byte moved AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh  Better Memory Usage! Example: 470.LBM Modified to promote better cache utilization

3,5

3

2,5

2

1,5

Througput 1

0,5

0 1234 # Cores Used

Original code

AVDARK 21 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Example 2: A Scalable Parallel Application

4 Performance

3

2

1

0 1234 # Cores App: Cigar

Looks like a perfect scalable application! Are we done?

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Example 2: The Same Application Optimized

Performance 30 7.3x 25 Original Optimized 20

15

10

5

0 1234 App: Cigar #Cores

Looks like a perfect scalable application! Are we done?  Duplicate one data structure

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Implementation Trends Predicting the future is hard Predicting: “Chip Multiprocessor” aka Multicores [from PARA Bergen 2000] Chip Multiprocessor (CMP): Mem

External Mem Simple fast CPU I/F I/F -- many open questions L2$

$1 $1 $1 $1

CPU CPU CPU CPU

AVDARK t treads

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Multi-CMPs [from PARA Bergen 2000]

Mem Explicit parallelism: #chips x #threads/chip Mem • Global shared memory • Global/local comm cost >10 Mem • Gotta’ explore small caches

Mem • Gotta’ explore locality! chips • OS scalability ? c Mem • Application scalability ? Interconnect

Mem

Mem

Mem AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Why Multicores Now? -- Hur Mår ”Moore’s Lag”? --

Perf Multi core [log] Single core

time ~2007

1. Not enough ILP/MLP to get payoff from using more transistors 2. Signal propagation delay » transistor delay 2 3. Power consumption Pdyn ~ C • f • V

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Darling, I shrunk the computer Sequential execution (

Mainframes ≈

Super Minis: one program)

Microprocessor: Mem Paradigm Shift Need TLP to Chip Multiprocessor (CMP): Mem make one A multiprocessor on a chip! chip run AVDARK fast 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh HPC in the Rear Mirror...

* Promise of performance MC + Accelerators * Forced by technology † ???? MC Clusters * COTS cost convergence † ???? Beowulf x86 Linux Clusters * UNIX †COTS perf Commercial management computing Killer Micro SMPs †High cost, * Scalability Bad scaling Naive view Nifty Parallel †Hard to use Vector No standards †Not general Expensive ???? 1980 1990 2000 2010 AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Parallelism can be used to hide memory latency

 Intel ”Hyper Threading”  T1 Niagara, MIC, … (4 threads per core)  GPUs  ...

 Is this a good idea?  It cannot hide the need for bandwidth!

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Parallelism is a Hard Currency

Speedup

Parallelism

Remember Amdahl’s Law?

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Do you have 1000 threads to spare? SIMD rears its ugly head again

 512 ”cores” (C)  16 C/StreamProcessor (SP)  SP is SIMD-ish (sort off)  Full DP-FP IEEE support  64kB L1 cache /SP

L2  768kB global shared cache (less than the sum of L1:s) L1  Atomic instructions  ECC correction for DRAM This is SIMD-ish (aka Vector)  Debugging support  Synch within SP efficient  Giant chip/high power  ...

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh SIMD and CPUs?

NVIDIA Fermi: •Special language... Coh. •Topology matter... Mem. •User-managed memory

Reminds me of:

* Scalability

I/O bus

†Hard to use No standards Common research papers: ”How to get 100X speedup”. Starting to get debunking of those results [ISCA 2010, IBM Journal 201

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh SIMD and CPUs?

Vector instruction

Coh. Intel’s Knights Ferry [MIC] Mem. (topology like Sandy Bridge)

[Pic from Michael Wulf, PGI]

Other efforts:

• AMD Fusion (x86 + GPU) • ARM + NVIDIA collaboration (project Denver)

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Trends for 2016

 No major revolution of the Multicore magnitude  Challenge: Will the number of cores double every 2 years?  Moving towards MIMD+SIMD “fusion”  Architecture complexity grows  Bumpy memory/communication costs  Heterogenious architectures (e.g., ARM: big.LITTLE)  Memory bandwidth the bottleneck  Energy is a first-class citizen  Users are getting less computer-savvy (and ideally should not have to be)

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Implications

 One size will not “fit all”  SIMD parallelism will be more prominent, but the jury is still out about how this will be done  More heterogeneous arch (size, mem, isa)  More parallelism needed, but ... memory/power to become the bottleneck anyhow  Diversity: Different applications will need different “heterogeneous configurations”  Even harder to use resources efficiently

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh HiPEAC Roadmap -- High Performance Embedded Architecture and Compilers

AVDARK

2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh