www.bsc.es

CPU alternatives for future high-performance systems

Nikola Puzović Barcelona Supercomputing Center Motivation

Custom Commodity ???

Pure Vector CPUs Server CPUs HPC Accelerators

Nikola Puzovic - CPU Alternatives for Future HPC Systems 2 Outline A little bit of history – From vector CPUs to commodity components

Killer mobile processors – Overview of current trends for mobile CPUs

Our experiences – Low-power prototypes in BSC

Disclaimer: All references to unavailable products are speculative, taken from web sources. There is no commitment from ARM, Samsung, Intel, or others implied.

Nikola Puzovic - CPU Alternatives for Future HPC Systems 3 In the beginning ... there were only

Built to order – Very few of them Special purpose hardware – Very expensive

Control Data -1 – 1975, 160 MFLOPS • 80 units, 5-8 M$ Cray X-MP – 1982, 800 MFLOPS Cray-2 – 1985, 1.9 GFLOPS Cray Y-MP – 1988, 2.6 GFLOPS

...Fortran+ Vectorizing Compilers

Nikola Puzovic - CPU Alternatives for Future HPC Systems 4 Then, commodity took over special purpose

ASCI Red, Sandia ASCI White, Lawrence Livermore Lab. – 1997, 1 Tflops (Linpack), 9298 – 2001, 7.3 TFLOPS, 8192 proc. processors at 200 Mhz, 1.2 RS6000 at 375 Mhz, 6 Terabytes, Tbytes, 850 kWatts (3+3) MWatts – Intel Pentium Pro – IBM Power 3 • Upgraded to Pentium II Xeon, 1999, 3.1 Tflops

Message-Passing Programming Models

Nikola Puzovic - CPU Alternatives for Future HPC Systems 5 “Killer

10.000

Cray-1, Cray-C90

1000 NEC SX4, SX5 Alpha AV4, EV5

MFLOPS Intel Pentium 100 IBM P2SC HP PA8200

10 1974 1979 1984 1989 1994 1999

Microprocessors killed the Vector supercomputers – They were not faster ... – ... but they were significantly cheaper and greener 10 microprocessors approx. 1 Vector CPU – SIMD vs. MIMD programming paradigms

Nikola Puzovic - CPU Alternatives for Future HPC Systems 6 Finally, commodity hardware + commodity software

MareNostrum – Nov 2004, #4 Top500 • 20 Tflops, Linpack – IBM PowerPC 970 FX • Blade enclosure – Myrinet + 1 GbE network – SuSe Linux

Nikola Puzovic - CPU Alternatives for Future HPC Systems 7 2008 – 1 PFLOPS – IBM RoadRunner

Los Alamos National Laboratory (USA)

Hybrid architecture – 1 x AMD dual-core Master blade – 2 x PowerXCell 8i Worker blade

Hybrid MPI + Task off-load model

296 racks – 6.480 Opteron processors – 12.960 Cell processors • 128-bit SIMD Infiniband interconnect – 288-port switches

2.35 MWatt (425 MFLOPS / W)

Nikola Puzovic - CPU Alternatives for Future HPC Systems 8 2009 - Cray Jaguar (1.8 PFLOPS)

Oak Ridge National Laboratory (USA)

Multi-core architecture – Hybrid MPI + OpenMP programming

230 racks 224.256 AMD Opteron processors – 6 cores / chip Cray Seastar2+ interconnect – 3D-mesh using AMD Hypertransport

7 MWatt (257 MFLOPS / W)

Nikola Puzovic - CPU Alternatives for Future HPC Systems 9 2012 – Cray (17.6 PFLOPS)

DOE/SC/Oak Ridge National Laboratory – Jaguar GPU upgrade

200 racks 224.256 Cray XK7 nodes – 16-core AMD Opteron – Nvidia Testa K20X GPU

8.2 Mwatts (2.142 MFLOPS/W)

Nikola Puzovic - CPU Alternatives for Future HPC Systems 10 Outline A little bit of history – From vector CPUs to commodity components

Killer-mobile processors – Overview of current trends for mobile CPUs

Our experiences – Low-power prototypes in BSC

Nikola Puzovic - CPU Alternatives for Future HPC Systems 11 The next step in the commodity chain

HPC

Servers

Desktop Total cores in Nov„12 Top500 – 14.9M Cores Tablets sold 2012 Mobile – > 100M Tablets Smartphones sold 2012 – > 712M Phones

Nikola Puzovic - CPU Alternatives for Future HPC Systems 12 Current trends in mobile CPUs We want to see how mobile SoCs behave with HPC apps – Current test systems have limited memory – Use a set of HPC-specific micro-kernels

Micro-kernels – Stress different architectural features – Cover a wide range of HPC application domains – Reduce porting effort to new architectures

Single core and multi-core evaluations – Goal is to see how mobile CPUs change through generations… – …and to compare them to modern HPC cores

Nikola Puzovic - CPU Alternatives for Future HPC Systems 13 ARM Cortex-A9 Smartphone CPU

OoO superscalar – Issue width of 4

VFP for 64-bit Floating Point – DP: 1 FMA each 2 cycles

The first ARM CPU truly usable for testing HPC workloads

Nikola Puzovic - CPU Alternatives for Future HPC Systems 14 NVIDIA Tegra2

Dual-core Cortex-A9 @ 1GHz – VFP for 64-bit Floating Point • 2 GFLOPS (1 FMA / 2 cycles)

Low-power Nvidia GPU – OpenGL only, CUDA not supported

Several (not useful for HPC) accelerators – Video encoder-decoder – Audio processor SECO Q7 board – Image processor

2 GFLOPS ~ 0.5 Watt

Nikola Puzovic - CPU Alternatives for Future HPC Systems 15 NVIDIA Tegra3

Quad-core Cortex-A9 @ 1.3GHz – VFP for 64-bit Floating Point • 5.2 GFLOPS (1 FMA / 2 cycles) – NEON for 32-bit floating Point SIMD

Low-power Nvidia GPU – 3x faster than Tegra2 • For graphics only SECO Q7 board – CUDA not supported

Nikola Puzovic - CPU Alternatives for Future HPC Systems 16 ARM Cortex-A15 Next generation of Cortex – Improved uArch – Improved performance – Virtualization support – Improved multiprocessing capabilities

Floating point performance – DP: 1 FMA per cycle

Nikola Puzovic - CPU Alternatives for Future HPC Systems 17 Samsung 5 Dual

Dual-core ARM Cortex-A15 @ (up to 1.7 GHz) – VFP for 64-bit Floating Point • 6.8 GFLOPS (1 FMA / cycle) – NEON for 32-bit floating Point SIMD Quad-core ARM Mali T604 – Compute capable • OpenCL 1.1 • 68 GFLOPS (SP) Shared memory between CPU and GPU

Nikola Puzovic - CPU Alternatives for Future HPC Systems 18 MicroKernels

Benchmark Properties Vector Operation (vecop) Common operation in regular codes Dense Matrix-Matrix Multiplication (dmmm) Common operation: measures data reuse and compute performance 3D stencil (3dstc) Strided memory accesses (7-point 3D stencil) 2D Convolution (2dcon) Spatial locality Fast Fourier Transform (fft) Peak floating-point, variable-stride accesses Reduction (red) Varying levels of parallelism (Scalar sum) Histogram (hist) Histogram with local privatisation, requires reduction stage Merge Sort (msort) Barrier operations N-Body (nbody) Irregular memory accesses Atomic Monte-Carlo Dynamics (amcd) Embarrassingly parallel: peak compute performance Sparse Vector-Matrix Multiplication (spwm) Load imbalance

Nikola Puzovic - CPU Alternatives for Future HPC Systems 19 Performance – Double precision FP (single core)

Performance normalized to Tegra2 board (single core) – Benefits due to increased frequency and improved micro-architecture – Better memory technology gives additional performance benefits

Nikola Puzovic - CPU Alternatives for Future HPC Systems 20 Performance – Double precision FP (multi-core)

Performance normalized to Tegra2 board (OpenMP) – Threads used: 2 in Tegra2, 4 in Tegra3, 2 in Exynos – New generation of ARM cores shows benefits despite smaller number of cores being used

Nikola Puzovic - CPU Alternatives for Future HPC Systems 21 Energy Efficiency – Double precision FP (multi-core)

Energy efficiency in GOps/Watt – Gains proportional to improvements in execution time – Increased frequency and complex architecture do add power… – …but “overhead” is too large for this to have an effect

Nikola Puzovic - CPU Alternatives for Future HPC Systems 22 Results – single core summary Double Precision Single Precision 5 5

4 4

3 3

2 2

Speedup Speedup 1 1

0 0 Tegra 2 Tegra 3 Exynos 5 Dual Tegra 2 Tegra 3 Exynos 5 Dual

Results as expected – T3 faster than T2 – frequency increase – Exynos faster than T3 – uArch improvements

Nikola Puzovic - CPU Alternatives for Future HPC Systems 23 ARMv8 architecture Cortex-A57 64-bit processor – Improved performance in all workloads • Up to 3x announced at same power budget – Interoperability with ARM Mali GPUs

Double Floating Point performance – DP support in the Neon instruction set • 128-bit words – Should double performance wrt Cortex-A15

big.LITTLE processing – Cortex-A53 – Low-performance, low-power companion on A57 – Can pair A57s with A53s

Nikola Puzovic - CPU Alternatives for Future HPC Systems 24 What if…

Double Precision (single core) 6

5

4

3

Speedup 2

1

0 Tegra 2 Tegra 3 Exynos 5 Dual ARMv8 ARMv8 could improve DFP performance – We keep the same frequency for our projection (pessimistic) – Increase the DP capability 2x (optimistic?) – Just a speculation, but could happen…

Nikola Puzovic - CPU Alternatives for Future HPC Systems 25 What about current HPC CPUs?

Double Precision (single core) 16 14 12

10 4.7 x 2.3 x 8

Speedup 6 4 2 0 Tegra 2 Tegra 3 Exynos 5 ARM v8 Sandy Dual Bridge Comparison with single core of Intel SandyBridge-EP E5-2670 @ 2.6 GHz

Current “mobile champion” still far away – Next generation mobile CPUs could bring significant improvements – Also, back in time, SX5 was significantly faster than Pentium II…

Nikola Puzovic - CPU Alternatives for Future HPC Systems 26 The Killer Mobile processorsTM

1.000.000

Alpha 100.000

Intel AMD 10.000

Nvidia Tegra MFLOPS Samsung Exynos 1.000 4-core ARMv8 1.5 GHz

100 1990 1995 2000 2005 2010 2015

History may be about to repeat itself … – are not faster … – … but they are significantly cheaper and greener

Nikola Puzovic - CPU Alternatives for Future HPC Systems 27 Then and now

Then: Vector vs Commodity Now: Commodity vs Mobile

Today‟s situation looks very familiar – “Mobile vs. Server” similar to “Server vs. Vector” – Significantly lower cost of mobile CPUs (thousands vs hundreds of $) – Same programming model, larger scale • Will need more parallelism (probably less than one order of magnitude) Off course, this does not prove anything – Mobile CPUs will become a viable alternative, but there‟s no guarantee that they will make it to mainstream HPC systems

Nikola Puzovic - CPU Alternatives for Future HPC Systems 28 Outline A little bit of history – From vector CPUs to commodity components

Killer-mobile processors – Overview of current trends for mobile CPUs

Our experiences – Low-power prototypes in BSC

Nikola Puzovic - CPU Alternatives for Future HPC Systems 29 BSC ARM-based prototype roadmap

Pedraforca:

ARM + GPU

Tibidabo: Integrated

ARM multicore ARM + GPU GFLOPS / W /GFLOPS

2011 2012 2013 2014 Prototypes are critical to accelerate software development – System software stack + applications

Nikola Puzovic - CPU Alternatives for Future HPC Systems 30 Tibidabo: The first ARM multicore cluster

Q7 Tegra 2 2 Racks 2 x Cortex-A9 @ 1GHz 32 blade containers 2 GFLOPS 256 nodes 5 Watts (?) 512 cores 0.4 GFLOPS / W 9x 48-port 1GbE switch

512 GFLOPS Q7 carrier board 2 x Cortex-A9 3.4 Kwatt 2 GFLOPS 0.15 GFLOPS / W 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W

1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W

Proof of concept – It is possible to deploy a cluster of smartphone processors Enable software stack development

Nikola Puzovic - CPU Alternatives for Future HPC Systems 31 Tibidabo: scalability and energy efficiency HPC applications scale out of the box on tibidabo – Strong scaling depends on the size of input set HPL – good weak scaling – 120 MFLOPS/Watt

Specfem3D – Improvements over cluster in energy efficiency (up to 3x)

D. Goddeke et. al. “Energy-efficiency vs. performance of the numerical solution of PDEs: an application study on a low-power ARM-based cluster”, Journal of Computational Physics

Nikola Puzovic - CPU Alternatives for Future HPC Systems 32 Pedraforca: ARM+GPU cluster Stage One – Test cluster of CARMA kits • Tegra3 SoC • Quadro 1000M – 1 GbE interconnect

Stage Two – ARM multicore SoC (NVIDIA) – NVIDIA GPU

In progress…

Nikola Puzovic - CPU Alternatives for Future HPC Systems 33 Mont-Blanc project goals To develop an European Exascale approach Based on embedded power-efficient technology

Objetives – Develop a first prototype system, limited by available technology – Design a Next Generation system, to overcome the limitations – Develop a set of Exascale applications targeting the new system

Nikola Puzovic - CPU Alternatives for Future HPC Systems 34 Mont-Blanc prototype Exynos 5 Dual – Integrated CPU + GPU – Dual core Cortex-A15 + ARM Mali T604 GPU

Integrated GPU has many advantages – Shared memory with CPU • Even cache coherent! – No power wasted on PCIe bus – No power wasted on GDDR5 memory – Higher energy efficiency + lower cost

Nikola Puzovic - CPU Alternatives for Future HPC Systems 35 High density packaging architecture

Standard BullX blade enclosure Multiple compute nodes per blade – Additional level of interconnect, on-blade network

Deployment expected later this year

Nikola Puzovic - CPU Alternatives for Future HPC Systems 36

Are we building BlueGene again?

Yes ... – Exploit Pollack's Rule in presence of abundant parallelism • Many small cores vs. Single fast core

... and No – Heterogeneous computing • On-chip GPU – Commodity vs. Special purpose • Higher volume • Many vendors • Lower cost – Lots of room for improvement • No SIMD / vectors yet ... – Build on Europe's embedded strengths

Nikola Puzovic - CPU Alternatives for Future HPC Systems 37 Conclusions Vector vs Commodity Commodity vs Mobile

Killer mobile processors – Not yet there, but getting very close We will see a with mobile SoCs soon – Mont-Blanc prototype @ BSC – Question is if it will become mainstream

www.montblanc-project.eu MontBlancEU @MontBlanc_EU

Nikola Puzovic - CPU Alternatives for Future HPC Systems 38