ARM: Power-efficient Compute for HPC

CERN OCP HPC Conference Darren Cepulis [email protected]

1 CONFIDENTIAL

Over 12B ARM Based Chips Shipped in 2014

2013 2014 Growth

5.1bn 5.4bn 300m Mobile

2.9bn 4.1bn 1.2b Embedded

1.8bn 1.9bn 100m Enterprise

Home 0.6bn 0.6bn -

ARM CPU Core Unit Shipments

2 CONFIDENTIAL

ARM Architecture: Licensing Overview

1000+ 1B overall smartphones 350+ sold in 2013 customers licenses

50B 121 new 200+ chips licenses sold Cortex-M licenses, shipped in 2013 over 160 companies

10B 620 licenses ARM based sold in past 50 ARM chips shipped in five years v8-A licenses 2013

Hundreds of optimized system-on-chip solutions

3 CONFIDENTIAL 3 Extensible Architecture for Heterogeneous Multi-core Solutions Up to 24 I/O Up to 4 Virtualized Interrupts Heterogeneous processors – CPU, GPU, DSP and coherent cores per accelerators interfaces for cluster GIC-500 accelerators and I/O Cortex CPU Cortex CPU 10-40 or CHI or CHI DSPDSP GbE PCIe PCIe DSP SATA master master Cortex®-A72 Cortex-A72 Cortex-A72 Cortex-A72 DPI DPI Crypto USB ACE Up to 12 Cortex CPU Cortex CPU AHB NIC-400 coherent or CHI or CHI master master clusters Cortex-A53 Cortex-A53 Cortex-A53 Cortex-A53 I/O Virtualisation CoreLink MMU-500

CoreLink™ CCN-512 Cache Coherent Network

1-32MB L3 cache Snoop Filter

Memory Memory Memory Memory Network Interconnect Network Interconnect Controller Controller Controller Controller Integrated NIC-400 NIC-400 L3 cache DMC-520 DMC-520 DMC-520 DMC-520 x72 x72 x72 x72 Flash SRAM GPIO PCIe DDR4-3200 DDR4-3200 DDR4-3200 DDR4-3200 Up to Quad channel DDR3/4 x72 Peripheral address space

4 CONFIDENTIAL

ARM in Datacenter/HPC Compute

. ARM-based SOC’s enable compelling solutions: Compelling Performance/Watt/$/RU . Optimized performance/watt, TCO performance/RU . Workload optimized balance of CPU, memory, cache, and IO . Application specific HW accelerators . Heterogeneous compute (DPS’s, FPGA’s, Choice Flexibility and Workload GPU’s, etc). Optimization and Innovation . Comprehensive SW Ecosystem . Built to existing datacenter standards . ARM platforms use standard motherboard form-factors or chassis/rack form-factors Standards . Standard interconnects (PCIe, RapidIO, Platform Management based and Deployment etc.) Ecosystem . Standard platform firmware abstractions for deployment and management

5 CONFIDENTIAL

ARM: All the Pieces for HPC

Engagements Partner 64-bit SOC’s • Global: NA, EU, Asia • Applied Micro X-Gene 1 and 2 • Exascale Projects, Nat’l Labs, and • AMD Seattle Universities • Cavium ThunderX • Broadcom Vulcan Ecosystem • HiSilicon • HPC Workloads, Math Libraries, • Several Other Confidentials… • OpenCL, OpenMP, OpenMPI, etc.

8 – 48 Cores + Accelerators Technologies • Performance Cores, Interconnects, GPU, DSP, Vector/Floating Point : Workloads • Partners and partner Accelerators Scale-out, HPC, Networking, Cloud, • Performance/Watt and Storage

6 CONFIDENTIAL

ARM HPC Software Ecosystem Overview

Compilers Analysis Tools Math Libraries Parallelism Parallel File Cluster Management and / Libraries Systems Test SW

X86 Ecosystem ICC, Pathscale, PGI, Parallel Studio MKL (BLAS), TBB, OpenMP, Lustre, Panasas HP CMU, Compare => NAG, GCC, Allinea, Rogue Wave BLIS, libATLAS, SequenceL, OpenSFS, IBM Platform LSF, Proprietary Cluster Studio XE, Eigen BLAS, DistrParallelism: HDFS, Ceph, Adaptive Computing (Moab) IBM Toolkit, Vtune ACML, FFTW OpenMPI/PGAS IBM GPFS Altair PBS Works Supercomputer Internal or proprietary Vendor or Customer Internal or Vendor or Customer Vendor SW Stack Vendor or Vendors & Labs Compilers SW Stack proprietary SW Stack Customer SW Stack

Enterprise GCC, LLVM Allinea DDT Pathscale BLAS ARM OpenMP Lustre Client/Server IBM LSF supported, Volume HPC Pathscale EKOPath, Rogue Wave NAG Numerical GCC Libgomp, IBM GPFS HP CMU supported, (Commercial SW) NAG ARM DS-5 + PTP Libtbb Panasas Altair PBS and Moab OpenMPI HPC Labs, Universities GCC & Clang, Gfortran TAU, BLIS, FFTW GCC, OpenMP OpenSFS (Lustre) SLURM (Open Source SW) LLVM Perf, libATLAS, LLVM run-time Lustre Client/Server OpenLava (LSF fork) HPCToolkit EigenBLAS PGAS HDFS, CEPH, Gluster PTP OpenBLAS

7 CONFIDENTIAL

Maximizing Throughput Density: per mm2, per Watt

1.2 20 Workload ARM Solution Benefits: 105W rd 1 . Less than 1/3 the power for equivalent <30W performance 105W <30W . Allows more specialized computing or 0.8 significantly greater thread density in the same power budget 0.6 Comparison for equivalent number of threads

. Platforms used: . Xeon-E5 2660 v3 10C20T platform (measured) . Xeon-E5 2650 v3 10C20T platform (measured) 0.4 . Gcc v4.9 with –o3 flag 2.3 GHz 2.7 GHz 2.6 GHz 2.5 GHz . Estimated result on example 20C ARM Cortex platforms Relative performance (Spec2K6 rate) (Spec2K6 performance Relative with CCN-508, 28MB total L2+L3 cache 0.2 . per-core measurements on RTL with relevant memory system . Gcc compiler v4.9 with –o3 flag . Scaled to 20T based on modelled and empirical results 0 . Power estimated in 16nm based on ARM internal Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 implementations for entire CPU+ interconnect

(10 cores 20 threads) (20 cores 20 threads) (20 cores 20 threads) (10 cores 20 threads)

8 CONFIDENTIAL

Cortex-A72: Ideal for dense compute environments

Single Cortex-A72 core 2 Cortex-A72 MP4 + 2MB L23 ~1.15mm2 ~8mm2

Core

A quad core Cortex-A72 Cortex-A72 is <20 % size Single Broadwell CPU + 256K1 L2 with 8x L2 cache RAM is ~8mm2 the same size

1Source: Estimated from die-shot image provided by Intel at IDF 2014. 2/3Source: ARM trial implementations on TSMC 16FF+, using ARM Artisan libraries 9 CONFIDENTIAL

Cortex-A72: More performance within constrained envelopes

Single-thread Multi-thread Memory 1.8

1.6 Core-M 2 GHz (14FF)

1.4 Cortex-A72 2.5 GHz est (16nm)

1.2 4W <1W 1

0.8

0.6

0.4

0.2

0 Geekbench ST SPECint SPECfp Geekbench MT SPECintRate (4T) STREAM Add STREAM Copy STREAM Scale STREAM Triad (4T)

. Intel workloads measured on Dell Venue Pro II. SPEC benchmarks measured using gcc compiler v4.9 with –o3 flag. Upside (scaled based on . Cortex-A72 measured on RTL with realistic memory system with the same compiler settings Xeon scaling) if core isn’t thermally throttled . Multi-threaded workloads use 2C4T Core-M CPU and estimated on 4C Cortex-A72 configuration w/2MB L2 cache.

10 CONFIDENTIAL

ARM Partner SoC solutions c.2015 More ARM 64-bit solutions on the horizon…

Source: Broadcom Presentation at IDC HPC USER FORUM APRIL 7, 2014 ARMv8-A Infrastructure Ecosystem Building Momentum

Example End Users

Key Applications Middleware

Operating System, Virtualization & Firmware ACPI

OEMs and ODMs

ARM SoC

13 CONFIDENTIAL

Empowering Enterprise Software Developers

Developer Cloud Platforms Resources

. Multiple options for software developers on ARM.

14 CONFIDENTIAL 14 PayPal – Real-time Data Analysis

Application logs . Leverages HPC and Networking/Telco data- Data center metrics ARM plane Technology Server machine data General Purpose Java, Python, . 55W per cartridge (measured) => Metadata OpenCL, Open 11.2GFlops/watt Social media data. MPI . Top of Green500.ORG SuperComputer list is 4.4GFlops/W . SOC Specific Advantages 3M events/sec, 25Tb/hour . Right Sized General Purpose Compute TI DSP . TI DSP cores for high performance signal Signal processing processing performance performance . Low latency response times . An integrated, high-performance I/O fabric

HP ProLiant m800 ARM Server 15 CONFIDENTIAL

Questions

16 CONFIDENTIAL