www.bsc.es
CPU alternatives for future high-performance systems
Nikola Puzović Barcelona Supercomputing Center Motivation
Custom Commodity ???
Pure Vector CPUs Server CPUs HPC Accelerators
Nikola Puzovic - CPU Alternatives for Future HPC Systems 2 Outline A little bit of history – From vector CPUs to commodity components
Killer mobile processors – Overview of current trends for mobile CPUs
Our experiences – Low-power prototypes in BSC
Disclaimer: All references to unavailable products are speculative, taken from web sources. There is no commitment from ARM, Samsung, Intel, or others implied.
Nikola Puzovic - CPU Alternatives for Future HPC Systems 3 In the beginning ... there were only supercomputers
Built to order – Very few of them Special purpose hardware – Very expensive
Control Data Cray-1 – 1975, 160 MFLOPS • 80 units, 5-8 M$ Cray X-MP – 1982, 800 MFLOPS Cray-2 – 1985, 1.9 GFLOPS Cray Y-MP – 1988, 2.6 GFLOPS
...Fortran+ Vectorizing Compilers
Nikola Puzovic - CPU Alternatives for Future HPC Systems 4 Then, commodity took over special purpose
ASCI Red, Sandia ASCI White, Lawrence Livermore Lab. – 1997, 1 Tflops (Linpack), 9298 – 2001, 7.3 TFLOPS, 8192 proc. processors at 200 Mhz, 1.2 RS6000 at 375 Mhz, 6 Terabytes, Tbytes, 850 kWatts (3+3) MWatts – Intel Pentium Pro – IBM Power 3 • Upgraded to Pentium II Xeon, 1999, 3.1 Tflops
Message-Passing Programming Models
Nikola Puzovic - CPU Alternatives for Future HPC Systems 5 “Killer microprocessors”
10.000
Cray-1, Cray-C90
1000 NEC SX4, SX5 Alpha AV4, EV5
MFLOPS Intel Pentium 100 IBM P2SC HP PA8200
10 1974 1979 1984 1989 1994 1999
Microprocessors killed the Vector supercomputers – They were not faster ... – ... but they were significantly cheaper and greener 10 microprocessors approx. 1 Vector CPU – SIMD vs. MIMD programming paradigms
Nikola Puzovic - CPU Alternatives for Future HPC Systems 6 Finally, commodity hardware + commodity software
MareNostrum – Nov 2004, #4 Top500 • 20 Tflops, Linpack – IBM PowerPC 970 FX • Blade enclosure – Myrinet + 1 GbE network – SuSe Linux
Nikola Puzovic - CPU Alternatives for Future HPC Systems 7 2008 – 1 PFLOPS – IBM RoadRunner
Los Alamos National Laboratory (USA)
Hybrid architecture – 1 x AMD dual-core Master blade – 2 x PowerXCell 8i Worker blade
Hybrid MPI + Task off-load model
296 racks – 6.480 Opteron processors – 12.960 Cell processors • 128-bit SIMD Infiniband interconnect – 288-port switches
2.35 MWatt (425 MFLOPS / W)
Nikola Puzovic - CPU Alternatives for Future HPC Systems 8 2009 - Cray Jaguar (1.8 PFLOPS)
Oak Ridge National Laboratory (USA)
Multi-core architecture – Hybrid MPI + OpenMP programming
230 racks 224.256 AMD Opteron processors – 6 cores / chip Cray Seastar2+ interconnect – 3D-mesh using AMD Hypertransport
7 MWatt (257 MFLOPS / W)
Nikola Puzovic - CPU Alternatives for Future HPC Systems 9 2012 – Cray Titan (17.6 PFLOPS)
DOE/SC/Oak Ridge National Laboratory – Jaguar GPU upgrade
200 racks 224.256 Cray XK7 nodes – 16-core AMD Opteron – Nvidia Testa K20X GPU
8.2 Mwatts (2.142 MFLOPS/W)
Nikola Puzovic - CPU Alternatives for Future HPC Systems 10 Outline A little bit of history – From vector CPUs to commodity components
Killer-mobile processors – Overview of current trends for mobile CPUs
Our experiences – Low-power prototypes in BSC
Nikola Puzovic - CPU Alternatives for Future HPC Systems 11 The next step in the commodity chain
HPC
Servers
Desktop Total cores in Nov„12 Top500 – 14.9M Cores Tablets sold 2012 Mobile – > 100M Tablets Smartphones sold 2012 – > 712M Phones
Nikola Puzovic - CPU Alternatives for Future HPC Systems 12 Current trends in mobile CPUs We want to see how mobile SoCs behave with HPC apps – Current test systems have limited memory – Use a set of HPC-specific micro-kernels
Micro-kernels – Stress different architectural features – Cover a wide range of HPC application domains – Reduce porting effort to new architectures
Single core and multi-core evaluations – Goal is to see how mobile CPUs change through generations… – …and to compare them to modern HPC cores
Nikola Puzovic - CPU Alternatives for Future HPC Systems 13 ARM Cortex-A9 Smartphone CPU
OoO superscalar processor – Issue width of 4
VFP for 64-bit Floating Point – DP: 1 FMA each 2 cycles
The first ARM CPU truly usable for testing HPC workloads
Nikola Puzovic - CPU Alternatives for Future HPC Systems 14 NVIDIA Tegra2
Dual-core Cortex-A9 @ 1GHz – VFP for 64-bit Floating Point • 2 GFLOPS (1 FMA / 2 cycles)
Low-power Nvidia GPU – OpenGL only, CUDA not supported
Several (not useful for HPC) accelerators – Video encoder-decoder – Audio processor SECO Q7 board – Image processor
2 GFLOPS ~ 0.5 Watt
Nikola Puzovic - CPU Alternatives for Future HPC Systems 15 NVIDIA Tegra3
Quad-core Cortex-A9 @ 1.3GHz – VFP for 64-bit Floating Point • 5.2 GFLOPS (1 FMA / 2 cycles) – NEON for 32-bit floating Point SIMD
Low-power Nvidia GPU – 3x faster than Tegra2 • For graphics only SECO Q7 board – CUDA not supported
Nikola Puzovic - CPU Alternatives for Future HPC Systems 16 ARM Cortex-A15 Next generation of Cortex – Improved uArch – Improved performance – Virtualization support – Improved multiprocessing capabilities
Floating point performance – DP: 1 FMA per cycle
Nikola Puzovic - CPU Alternatives for Future HPC Systems 17 Samsung Exynos 5 Dual
Dual-core ARM Cortex-A15 @ (up to 1.7 GHz) – VFP for 64-bit Floating Point • 6.8 GFLOPS (1 FMA / cycle) – NEON for 32-bit floating Point SIMD Quad-core ARM Mali T604 – Compute capable • OpenCL 1.1 • 68 GFLOPS (SP) Shared memory between CPU and GPU
Nikola Puzovic - CPU Alternatives for Future HPC Systems 18 MicroKernels
Benchmark Properties Vector Operation (vecop) Common operation in regular codes Dense Matrix-Matrix Multiplication (dmmm) Common operation: measures data reuse and compute performance 3D stencil (3dstc) Strided memory accesses (7-point 3D stencil) 2D Convolution (2dcon) Spatial locality Fast Fourier Transform (fft) Peak floating-point, variable-stride accesses Reduction (red) Varying levels of parallelism (Scalar sum) Histogram (hist) Histogram with local privatisation, requires reduction stage Merge Sort (msort) Barrier operations N-Body (nbody) Irregular memory accesses Atomic Monte-Carlo Dynamics (amcd) Embarrassingly parallel: peak compute performance Sparse Vector-Matrix Multiplication (spwm) Load imbalance
Nikola Puzovic - CPU Alternatives for Future HPC Systems 19 Performance – Double precision FP (single core)
Performance normalized to Tegra2 board (single core) – Benefits due to increased frequency and improved micro-architecture – Better memory technology gives additional performance benefits
Nikola Puzovic - CPU Alternatives for Future HPC Systems 20 Performance – Double precision FP (multi-core)
Performance normalized to Tegra2 board (OpenMP) – Threads used: 2 in Tegra2, 4 in Tegra3, 2 in Exynos – New generation of ARM cores shows benefits despite smaller number of cores being used
Nikola Puzovic - CPU Alternatives for Future HPC Systems 21 Energy Efficiency – Double precision FP (multi-core)
Energy efficiency in GOps/Watt – Gains proportional to improvements in execution time – Increased frequency and complex architecture do add power… – …but “overhead” is too large for this to have an effect
Nikola Puzovic - CPU Alternatives for Future HPC Systems 22 Results – single core summary Double Precision Single Precision 5 5
4 4
3 3
2 2
Speedup Speedup 1 1
0 0 Tegra 2 Tegra 3 Exynos 5 Dual Tegra 2 Tegra 3 Exynos 5 Dual
Results as expected – T3 faster than T2 – frequency increase – Exynos faster than T3 – uArch improvements
Nikola Puzovic - CPU Alternatives for Future HPC Systems 23 ARMv8 architecture Cortex-A57 64-bit processor – Improved performance in all workloads • Up to 3x announced at same power budget – Interoperability with ARM Mali GPUs
Double Floating Point performance – DP support in the Neon instruction set • 128-bit words – Should double performance wrt Cortex-A15
big.LITTLE processing – Cortex-A53 – Low-performance, low-power companion on A57 – Can pair A57s with A53s
Nikola Puzovic - CPU Alternatives for Future HPC Systems 24 What if…
Double Precision (single core) 6
5
4
3
Speedup 2
1
0 Tegra 2 Tegra 3 Exynos 5 Dual ARMv8 ARMv8 could improve DFP performance – We keep the same frequency for our projection (pessimistic) – Increase the DP capability 2x (optimistic?) – Just a speculation, but could happen…
Nikola Puzovic - CPU Alternatives for Future HPC Systems 25 What about current HPC CPUs?
Double Precision (single core) 16 14 12
10 4.7 x 2.3 x 8
Speedup 6 4 2 0 Tegra 2 Tegra 3 Exynos 5 ARM v8 Sandy Dual Bridge Comparison with single core of Intel SandyBridge-EP E5-2670 @ 2.6 GHz
Current “mobile champion” still far away – Next generation mobile CPUs could bring significant improvements – Also, back in time, SX5 was significantly faster than Pentium II…
Nikola Puzovic - CPU Alternatives for Future HPC Systems 26 The Killer Mobile processorsTM
1.000.000
Alpha 100.000
Intel AMD 10.000
Nvidia Tegra MFLOPS Samsung Exynos 1.000 4-core ARMv8 1.5 GHz
100 1990 1995 2000 2005 2010 2015
History may be about to repeat itself … – Mobile processor are not faster … – … but they are significantly cheaper and greener
Nikola Puzovic - CPU Alternatives for Future HPC Systems 27 Then and now
Then: Vector vs Commodity Now: Commodity vs Mobile
Today‟s situation looks very familiar – “Mobile vs. Server” similar to “Server vs. Vector” – Significantly lower cost of mobile CPUs (thousands vs hundreds of $) – Same programming model, larger scale • Will need more parallelism (probably less than one order of magnitude) Off course, this does not prove anything – Mobile CPUs will become a viable alternative, but there‟s no guarantee that they will make it to mainstream HPC systems
Nikola Puzovic - CPU Alternatives for Future HPC Systems 28 Outline A little bit of history – From vector CPUs to commodity components
Killer-mobile processors – Overview of current trends for mobile CPUs
Our experiences – Low-power prototypes in BSC
Nikola Puzovic - CPU Alternatives for Future HPC Systems 29 BSC ARM-based prototype roadmap
Pedraforca:
ARM + GPU
Tibidabo: Integrated
ARM multicore ARM + GPU GFLOPS / W /GFLOPS
2011 2012 2013 2014 Prototypes are critical to accelerate software development – System software stack + applications
Nikola Puzovic - CPU Alternatives for Future HPC Systems 30 Tibidabo: The first ARM multicore cluster
Q7 Tegra 2 2 Racks 2 x Cortex-A9 @ 1GHz 32 blade containers 2 GFLOPS 256 nodes 5 Watts (?) 512 cores 0.4 GFLOPS / W 9x 48-port 1GbE switch
512 GFLOPS Q7 carrier board 2 x Cortex-A9 3.4 Kwatt 2 GFLOPS 0.15 GFLOPS / W 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W
1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W
Proof of concept – It is possible to deploy a cluster of smartphone processors Enable software stack development
Nikola Puzovic - CPU Alternatives for Future HPC Systems 31 Tibidabo: scalability and energy efficiency HPC applications scale out of the box on tibidabo – Strong scaling depends on the size of input set HPL – good weak scaling – 120 MFLOPS/Watt
Specfem3D – Improvements over x86 cluster in energy efficiency (up to 3x)
D. Goddeke et. al. “Energy-efficiency vs. performance of the numerical solution of PDEs: an application study on a low-power ARM-based cluster”, Journal of Computational Physics
Nikola Puzovic - CPU Alternatives for Future HPC Systems 32 Pedraforca: ARM+GPU cluster Stage One – Test cluster of CARMA kits • Tegra3 SoC • Quadro 1000M – 1 GbE interconnect
Stage Two – ARM multicore SoC (NVIDIA) – NVIDIA GPU
In progress…
Nikola Puzovic - CPU Alternatives for Future HPC Systems 33 Mont-Blanc project goals To develop an European Exascale approach Based on embedded power-efficient technology
Objetives – Develop a first prototype system, limited by available technology – Design a Next Generation system, to overcome the limitations – Develop a set of Exascale applications targeting the new system
Nikola Puzovic - CPU Alternatives for Future HPC Systems 34 Mont-Blanc prototype Exynos 5 Dual – Integrated CPU + GPU – Dual core Cortex-A15 + ARM Mali T604 GPU
Integrated GPU has many advantages – Shared memory with CPU • Even cache coherent! – No power wasted on PCIe bus – No power wasted on GDDR5 memory – Higher energy efficiency + lower cost
Nikola Puzovic - CPU Alternatives for Future HPC Systems 35 High density packaging architecture
Standard BullX blade enclosure Multiple compute nodes per blade – Additional level of interconnect, on-blade network
Deployment expected later this year
Nikola Puzovic - CPU Alternatives for Future HPC Systems 36
Are we building BlueGene again?
Yes ... – Exploit Pollack's Rule in presence of abundant parallelism • Many small cores vs. Single fast core
... and No – Heterogeneous computing • On-chip GPU – Commodity vs. Special purpose • Higher volume • Many vendors • Lower cost – Lots of room for improvement • No SIMD / vectors yet ... – Build on Europe's embedded strengths
Nikola Puzovic - CPU Alternatives for Future HPC Systems 37 Conclusions Vector vs Commodity Commodity vs Mobile
Killer mobile processors – Not yet there, but getting very close We will see a supercomputer with mobile SoCs soon – Mont-Blanc prototype @ BSC – Question is if it will become mainstream
www.montblanc-project.eu MontBlancEU @MontBlanc_EU
Nikola Puzovic - CPU Alternatives for Future HPC Systems 38