Mali-T880 Launch Slides

ARM: Power-efficient Compute for HPC CERN OCP HPC Conference Darren Cepulis [email protected] 1 CONFIDENTIAL Over 12B ARM Based Chips Shipped in 2014 2013 2014 Growth 5.1bn 5.4bn 300m Mobile 2.9bn 4.1bn 1.2b Embedded 1.8bn 1.9bn 100m Enterprise Home 0.6bn 0.6bn - ARM CPU Core Unit Shipments 2 CONFIDENTIAL ARM Architecture: Licensing Overview 1000+ 1B overall smartphones 350+ sold in 2013 customers licenses 50B 121 new 200+ chips licenses sold Cortex-M licenses, shipped in 2013 over 160 companies 10B 620 licenses ARM based sold in past 50 ARM chips shipped in five years v8-A licenses 2013 Hundreds of optimized system-on-chip solutions 3 CONFIDENTIAL 3 Extensible Architecture for Heterogeneous Multi-core Solutions Up to 24 I/O Up to 4 Virtualized Interrupts Heterogeneous processors – CPU, GPU, DSP and coherent cores per accelerators interfaces for cluster GIC-500 accelerators and I/O Cortex CPU Cortex CPU 10-40 or CHI or CHI DSPDSP GbE PCIe PCIe DSP SATA master master Cortex®-A72 Cortex-A72 Cortex-A72 Cortex-A72 DPI DPI Crypto USB ACE Up to 12 Cortex CPU Cortex CPU AHB NIC-400 coherent or CHI or CHI master master clusters Cortex-A53 Cortex-A53 Cortex-A53 Cortex-A53 I/O Virtualisation CoreLink MMU-500 CoreLink™ CCN-512 Cache Coherent Network 1-32MB L3 cache Snoop Filter Memory Memory Memory Memory Network Interconnect Network Interconnect Controller Controller Controller Controller Integrated NIC-400 NIC-400 L3 cache DMC-520 DMC-520 DMC-520 DMC-520 x72 x72 x72 x72 Flash SRAM GPIO PCIe DDR4-3200 DDR4-3200 DDR4-3200 DDR4-3200 Up to Quad channel DDR3/4 x72 Peripheral address space 4 CONFIDENTIAL ARM in Datacenter/HPC Compute . ARM-based SOC’s enable compelling solutions: Compelling Performance/Watt/$/RU . Optimized performance/watt, TCO performance/RU . Workload optimized balance of CPU, memory, cache, and IO . Application specific HW accelerators . Heterogeneous compute (DPS’s, FPGA’s, Choice Flexibility and Workload GPU’s, etc). Optimization and Innovation . Comprehensive SW Ecosystem . Built to existing datacenter standards . ARM platforms use standard motherboard form-factors or chassis/rack form-factors Standards . Standard interconnects (PCIe, RapidIO, Platform Management based and Deployment etc.) Ecosystem . Standard platform firmware abstractions for deployment and management 5 CONFIDENTIAL ARM: All the Pieces for HPC Engagements Partner 64-bit SOC’s • Global: NA, EU, Asia • Applied Micro X-Gene 1 and 2 • Exascale Projects, Nat’l Labs, and • AMD Seattle Universities • Cavium ThunderX • Broadcom Vulcan Ecosystem • HiSilicon • HPC Workloads, Math Libraries, • Several Other Confidentials… Compilers • OpenCL, OpenMP, OpenMPI, etc. 8 – 48 Cores + Accelerators Technologies • Performance Cores, Interconnects, GPU, DSP, Vector/Floating Point : Workloads • Partners and partner Accelerators Scale-out, HPC, Networking, Cloud, • Performance/Watt and Storage 6 CONFIDENTIAL ARM HPC Software Ecosystem Overview Compilers Analysis Tools Math Libraries Parallelism Parallel File Cluster Management and C/Fortran Libraries Systems Test SW X86 Ecosystem ICC, Pathscale, PGI, Parallel Studio MKL (BLAS), TBB, OpenMP, Lustre, Panasas HP CMU, Compare => NAG, GCC, Allinea, Rogue Wave BLIS, libATLAS, SequenceL, OpenSFS, IBM Platform LSF, Proprietary Cluster Studio XE, Eigen BLAS, DistrParallelism: HDFS, Ceph, Adaptive Computing (Moab) IBM Toolkit, Vtune ACML, FFTW OpenMPI/PGAS IBM GPFS Altair PBS Works Supercomputer Internal or proprietary Vendor or Customer Internal or Vendor or Customer Vendor SW Stack Vendor or Vendors & Labs Compilers SW Stack proprietary SW Stack Customer SW Stack Enterprise GCC, LLVM Allinea DDT Pathscale BLAS ARM OpenMP Lustre Client/Server IBM LSF supported, Volume HPC Pathscale EKOPath, Rogue Wave NAG Numerical GCC Libgomp, IBM GPFS HP CMU supported, (Commercial SW) NAG ARM DS-5 + PTP Libtbb Panasas Altair PBS and Moab OpenMPI HPC Labs, Universities GCC & Clang, Gfortran TAU, BLIS, FFTW GCC, OpenMP OpenSFS (Lustre) SLURM (Open Source SW) LLVM Perf, libATLAS, LLVM run-time Lustre Client/Server OpenLava (LSF fork) HPCToolkit EigenBLAS PGAS HDFS, CEPH, Gluster PTP OpenBLAS 7 CONFIDENTIAL Maximizing Throughput Density: per mm2, per Watt 1.2 20 Thread Workload ARM Solution Benefits: 105W rd 1 . Less than 1/3 the power for equivalent <30W performance 105W <30W . Allows more specialized computing or 0.8 significantly greater thread density in the same power budget 0.6 Comparison for equivalent number of threads . Platforms used: . Xeon-E5 2660 v3 10C20T platform (measured) . Xeon-E5 2650 v3 10C20T platform (measured) 0.4 . Gcc compiler v4.9 with –o3 flag 2.3 GHz 2.7 GHz 2.6 GHz 2.5 GHz . Estimated result on example 20C ARM Cortex platforms Relative performance (Spec2K6 rate) (Spec2K6 performance Relative with CCN-508, 28MB total L2+L3 cache 0.2 . per-core measurements on RTL with relevant memory system . Gcc compiler v4.9 with –o3 flag . Scaled to 20T based on modelled and empirical results 0 . Power estimated in 16nm based on ARM internal Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 implementations for entire CPU+ interconnect (10 cores 20 threads) (20 cores 20 threads) (20 cores 20 threads) (10 cores 20 threads) 8 CONFIDENTIAL Cortex-A72: Ideal for dense compute environments Single Cortex-A72 core 2 Cortex-A72 MP4 + 2MB L23 ~1.15mm2 ~8mm2 Core A quad core Cortex-A72 Cortex-A72 is <20 % size Single Broadwell CPU + 256K1 L2 with 8x L2 cache RAM is 2 the same size ~8mm 1Source: Estimated from die-shot image provided by Intel at IDF 2014. 2/3Source: ARM trial implementations on TSMC 16FF+, using ARM Artisan libraries 9 CONFIDENTIAL Cortex-A72: More performance within constrained envelopes Single-thread Multi-thread Memory 1.8 1.6 Core-M 2 GHz (14FF) 1.4 Cortex-A72 2.5 GHz est (16nm) 1.2 4W <1W 1 0.8 0.6 0.4 0.2 0 Geekbench ST SPECint SPECfp Geekbench MT SPECintRate (4T) STREAM Add STREAM Copy STREAM Scale STREAM Triad (4T) . Intel workloads measured on Dell Venue Pro II. SPEC benchmarks measured using gcc compiler v4.9 with –o3 flag. Upside (scaled based on . Cortex-A72 measured on RTL with realistic memory system with the same compiler settings Xeon scaling) if core isn’t thermally throttled . Multi-threaded workloads use 2C4T Core-M CPU and estimated on 4C Cortex-A72 configuration w/2MB L2 cache. 10 CONFIDENTIAL ARM Partner SoC solutions c.2015 More ARM 64-bit solutions on the horizon… Source: Broadcom Presentation at IDC HPC USER FORUM APRIL 7, 2014 ARMv8-A Infrastructure Ecosystem Building Momentum Example End Users Key Applications Middleware Operating System, Virtualization & Firmware ACPI OEMs and ODMs ARM SoC 13 CONFIDENTIAL Empowering Enterprise Software Developers Developer Cloud Platforms Resources . Multiple options for software developers on ARM. 14 CONFIDENTIAL 14 PayPal – Real-time Data Analysis Application logs . Leverages HPC and Networking/Telco data- Data center metrics ARM plane Technology Server machine data General Purpose Java, Python, . 55W per cartridge (measured) => Metadata OpenCL, Open 11.2GFlops/watt Social media data. MPI . Top of Green500.ORG SuperComputer list is 4.4GFlops/W . SOC Specific Advantages 3M events/sec, 25Tb/hour . Right Sized General Purpose Compute TI DSP . TI DSP cores for high performance signal Signal processing processing performance performance . Low latency response times . An integrated, high-performance I/O fabric HP ProLiant m800 ARM Server 15 CONFIDENTIAL Questions 16 CONFIDENTIAL .

Mali-T880 Launch Slides

Swri IR&D Program 2016

A Functional Approach to Finding Answer Sets

Bryant Nelson

Designing Interdisciplinary Approaches to Problem Solving Into Computer Languages

Comparative Programming Languages CM20253

Your Software Faster, Better, Sooner

Exploring the Current Landscape of Programming Paradigm and Models

Automatic Parallelization Tools: a Review Varsha

Automatic Concurrency in Sequencel

AUTOMATIC DISTRIBUTED PROGRAMMING USING SEQUENCEL by Bryant K

A Parallel Communication Architecture for The

Sequencel Optimization Results on IBM POWER8 with Linux