Introduchon to Arm for Network Stack Developers

Introducon to Arm for network stack developers Pavel Shamis/Pasha Principal Research Engineer Mvapich User Group 2017 © 2017 Arm Limited Columbus, OH Outline • Arm Overview • HPC SoLware Stack • Porng on Arm • Evaluaon 2 © 2017 Arm Limited Arm Overview © 2017 Arm Limited An introduc1on to Arm Arm is the world's leading semiconductor intellectual property supplier. We license to over 350 partners, are present in 95% of smart phones, 80% of digital cameras, 35% of all electronic devices, and a total of 60 billion Arm cores have been shipped since 1990. Our CPU business model: License technology to partners, who use it to create their own system-on-chip (SoC) products. We may license an instrucBon set architecture (ISA) such as “ARMv8-A”) or a specific implementaon, such as “Cortex-A72”. …and our IP extends beyond the CPU Partners who license an ISA can create their own implementaon, as long as it passes the compliance tests. 4 © 2017 Arm Limited A partnership business model A business model that shares success Business Development • Everyone in the value chain benefits Arm Licenses technology to Partner • Long term sustainability SemiCo Design once and reuse is fundamental IP Partner Licence fee • Spread the cost amongst many partners Provider • Technology reused across mulBple applicaons Partners develop • Creates market for ecosystem to target chips – Re-use is also fundamental to the ecosystem Royalty Upfront license fee OEM • Covers the development cost Customer Ongoing royalBes OEM sells • Typically based on a percentage of chip price consumer products • Vested interest in success of customers Approximately 1350 licenses More than 440 potenBal 14.8bn+ Arm-powered chips in Grows by ~120 every year royalty payers 2015 5 © 2017 Arm Limited Partnership 6 © 2017 Arm Limited Range of SoCs addressing infrastructure Highly Accelerated Massively MulBcore QorIQ® Layerscape 2080A One size does not fit all 7 © 2017 Arm Limited Serious Arm HPC deployments star1ng in 2017 Two big announcements about Arm in HPC in Europe: 8 © 2017 Arm Limited Japan slides from Fujitsu at ISC’16 9 © 2017 Arm Limited SoGware Stack © 2017 Arm Limited Parallelism to enable opmal HPC performance § OpenMP § We are adding enhancements to the LLVM OpenMP implementaon to get beler AArch64 performance § Arm is acBve member of the OpenMP Standards Commilee § Auto-vectorizaon § Arm acBvely works on vectorizaon in GCC and LLVM, and encourages work with vectorizaon support in the compiler community. § PathScale’s compiler has vectorizaon support built in 11 © 2017 Arm Limited Open source in the Arm HPC ecosystem Over the past 12 months many more packages and applicaons have been successfully ported to Arm HPC 12 © 2017 Arm Limited Linux / FreeBSD w/ AARCH64 support 12.04LTS & 14.04LTS ß Also 14.10 & 15.04 releases Debian 8 adds AARCH64 – April 2015 released Fedora 22 released – May 2015 Red Hat Enterprise Linux Server for Arm CentOS Linux 7 for AArch64 Fedora 23 released – nov 2015 7.2 BETA – Sept, 2015 GA – August 2015 OpenSUSE 13.2 – nov 2014 SUSE Launches Partner Program to Bring SUSE Linux Enterprise 12 to 64-bit Arm July 2015 @ ISC ß Engaged with FreeBSD foundaon / Semi-half & Cavium to get FreeBSD on ARMv8 FreeBSD Beta version demo’d by Semihalf – nov. 2015 13 © 2017 Arm Limited HPC filesystems Soware Status LUSTRE Ported HDFS Ported CEPH Ported BeeGFS Ported 14 © 2017 Arm Limited Workload and cluster managers Soware Status IBM LSF Ported HP CMU Ported SLURM Ported AdapBve CompuBng (Moab) Ported Altair PBS Works Ported OpenLava (LSF port) Ported 15 © 2017 Arm Limited Compilers © 2017 Arm Limited Open source and commercial compilers § GCC § NAG § C, C++, Fortran § Fortran § OpenMP 4.0 § OpenMP 3.1 § LLVM § C, C++, Fortran § Arm C/C++/Fortran § OpenMP 3.1, (4.0 coming soon) Compiler § Fortran coming Q1 2017 § LLVM based § Includes SVE 17 © 2017 Arm Limited Arm C/C++ Compiler Commercially supported C/C++/Fortran compiler for Linux user-space HPC applicaons LLVM-based § Arm-on-Arm compiler § For applicaon development (not bare-metal/embedded) Regularly pulls from upstream LLVM, adding: § SVE support in the assembler, disassembler, intrinsics and autovectorizer § Compiler Insights to support Arm Code Advisor OpenMP § Uses latest open source (now Arm-opBmized) LLVM OpenMP runBme § Changes pushed back to the community 18 © 2017 Arm Limited Arm C/C++/Fortran Compiler Linux user-space compiler tailored for HPC on Arm • Maintained and supported by Arm for a wide range of Arm-based SoCs running leading Linux distribuBons Commercially supported by Arm • Based on LLVM, the leading compiler framework Latest features go into the commercial releases first • Ahead of upstream LLVM by up to an year with latest performance improvement Latest features and patches performance opBmizaons • SVE support in the assembler, disassembler, intrinsics and autovectorizer OpenMP • Uses latest open source (now Arm-opBmized) LLVM OpenMP runme OpBmized OpenMP • Changes pushed back to the community 19 © 2017 Arm Limited Arm C/C++ Compiler – usage To compile C code: % armclang -O3 file.c –o file To compile C++ code: % armclang++ -O3 file.cpp –o file 20 © 2017 Arm Limited Common armclang opons Flag Descripon --help Describes the most common opBons supported by Arm C/C++ Compiler --vsn Displays version informaon and license details --version -O<level> Specifies the level of opBmizaon to use when compiling source files. The default is -O0 -c Performs the compilaon step, but does not perform the link step. Produces an ELF object .o file. Run armclang again, passing in the object files to link -o <file> Specifies the name of the output file -fopenmp Use OpenMP -S Outputs assembly code, rather than object code. Produces a text .s file containing annotated assembly code 21 © 2017 Arm Limited Experimental tools to support SVE With Arm HPC Compiler, InstrucBon Emulator and Code Advisor Compile Emulate Analyse Arm HPC Compiler InstrucBon Emulator Code Advisor C/C++/Fortran Runs userspace binaries Console or web-based output shows prioriBzed for future Arm advice in-line with original source code. SVE via auto-vectorizaon, architectures on today’s intrinsics and assembly. systems. Compiler Insight: Compiler Supported instrucBons places results of compile- run unmodified. Bme decisions and analysis in the resulBng binary. Unsupported instrucBons are trapped and emulated. 22 © 2017 Arm Limited Arm Instruc1on Emulator Run SVE binaries at near nave speed on exisBng Armv8-A hardware Trap-and-emulate of illegal userspace instrucBons Armv8-A Binary Armv8-A SVE Binary § For applicaon development (not bare-metal/embedded like Arm Fast Models) § navely supported instrucBons run at full speed § Unsupported instrucBons are faithfully emulated in soLware Arm InstrucBon Emulator Converts SVE Full integraon with Arm Code Advisor instrucBons to § nave Armv8-A Plugin allows Arm InstrucBon Emulator to provide hotspot instrucBons informaon and other metrics § Command-line integraon allows Arm Code Advisor Linux workflows to seamlessly integrate with Arm InstrucBon Emulator Armv8-A 23 © 2017 Arm Limited Arm Code Advisor Combines stac and dynamic informaon to produce acBonable insights Performance Advice § Compiler vectorizaon hints § Compilaon flags advice § Fortran subarray warnings OpenMP instrumentaon § Insights from compilaon and runBme § Compiler Insights are embedded into the applicaon binary by the Arm HPC Compilers § OMPT interface used to instrument OpenMP runme Extensible Architecture § Users can write plugins to add their own analysis informaon § Data accessible via command-line, web browser and REST API to support new user interfaces 24 © 2017 Arm Limited Libraries © 2017 Arm Limited – Now on Arm OpenHPC is a community effort to provide a common, Func1onal Areas Components include verified set of open source packages for HPC Base OS RHEL/CentOS 7.1, SLES 12 deployments Administrave Conman, Ganglia, Lmod, LosF, ORCM, nagios, pdsh, Tools prun Arm’s parBcipaon: Provisioning Warewulf Resource Mgmt. SLURM, Munge. Altair PBS Pro* • Silver member of OpenHPC I/O Services Lustre client (community version) • Arm is on the OpenHPC Technical Steering Commilee numerical/ScienBfic Boost, GSL, FFTW, MeBs, PETSc, Trilinos, Hypre, Libraries SuperLU, Mumps in order to drive Arm build support I/O Libraries HDF5 (pHDF5), netCDF (including C++ and Fortran interfaces), Adios Status: 1.3.1 release out now Compiler Families GnU (gcc, g++, gfortran) MPI Families OpenMPI, MVAPICH2 • All packages built on Armv8-A for CentOS and SUSE Development Tools Autotools (autoconf, automake, libtool), Valgrind,R, SciPy/NumPy • Arm-based machines are being used for building and Performance Tools PAPI, Intel IMB, mpiP, pdtoolkit TAU also in the OpenHPC build infrastructure 26 © 2017 Arm Limited Open source library AArch64 inbuilt tuning work OpenBLAS • ARMv8 kernels included BLIS • BLIS developers have close relaonship with Arm • BLIS supports various Arm processors by default (e.g. Arm Cortex-A53, Cortex-A57 CPUs) • Also currently conducBng Arm big.LITTLE development ATLAS • Work ongoing with Arm Research team • Cortex-A57/A53 patches went into ATLAS FFTW • Just works. nEOn opBons built into v3.3.5 27 © 2017 Arm Limited Arm Performance Libraries OpBmized BLAS, LAPACK and FFT Commercial 64-bit Armv8-A math libraries • Commonly used low-level math rouBnes - BLAS, LAPACK and FFT Performance on par with best-in-class math libraries • Validated with nAG’s test suite, a de-facto standard Best-in-class performance with commercial support • Tuned by Arm for Cortex-A72, Cortex-A57

Introduchon to Arm for Network Stack Developers

ARM HPC Ecosystem

0 BLIS: a Modern Alternative to the BLAS

18 Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

0 BLIS: a Framework for Rapidly Instantiating BLAS Functionality

DD2358 – Introduction to HPC Linear Algebra Libraries & BLAS

Sun Microsystems: Sun Java Workstation W1100z

Cray Scientific Libraries Overview

The BLAS API of BLASFEO: Optimizing Performance for Small Matrices

Accelerating the “Motifs” in Machine Learning on Modern Processors

Sun Microsystems

Best Practice Guide - Generic X86 Vegard Eide, NTNU Nikos Anastopoulos, GRNET Henrik Nagel, NTNU 02-05-2013

Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices