Intel(R) Advanced Vector Extensions Programming Reference

Total Page:16

File Type:pdf, Size:1020Kb

Intel(R) Advanced Vector Extensions Programming Reference Intel® Advanced Vector Extensions Programming Reference 319433-011 JUNE 2011 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANT- ED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to specifications and product descriptions at any time, without notice. Developers must not rely on the absence or characteristics of any features or instructions marked “re- served” or “undefined.” Improper use of reserved or undefined features or instructions may cause unpre- dictable behavior or failure in developer's software code when running on an Intel processor. Intel reserves these features or instructions for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from their unauthorized use. The Intel® 64 architecture processors may contain design defects or errors known as errata. Current char- acterized errata are available on request. Hyper-Threading Technology requires a computer system with an Intel® processor supporting Hyper- Threading Technology and an HT Technology enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. For more information, see http://www.in- tel.com/technology/hyperthread/index.htm; including details on which processors support HT Technology. Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and for some uses, certain platform software enabled for it. Functionality, perfor- mance or other benefits will vary depending on hardware and software configurations. Intel® Virtualization Technology-enabled BIOS and VMM applications are currently in development. 64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, oper- ating system, device drivers and applications enabled for Intel® 64 architecture. Processors will not operate (including 32-bit operation) without an Intel® 64 architecture-enabled BIOS. Performance will vary depend- ing on your hardware and software configurations. Consult with your system vendor for more information. Intel, Pentium, Intel Atom, Intel Xeon, Intel NetBurst, Intel Core, Intel Core Solo, Intel Core Duo, Intel Core 2 Duo, Intel Core 2 Extreme, Intel Pentium D, Itanium, Intel SpeedStep, MMX, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an ordering number and are referenced in this document, or other Intel literature, may be obtained from: Intel Corporation P.O. Box 5937 Denver, CO 80217-9808 or call 1-800-548-4725 or visit Intel’s website at http://www.intel.com Copyright © 1997-2011 Intel Corporation ii Ref. # 319433-011 CONTENTS PAGE CHAPTER 1 INTEL® ADVANCED VECTOR EXTENSIONS 1.1 About This Document . 1-1 1.2 Overview . 1-1 1.3 Intel® Advanced Vector Extensions Architecture Overview . 1-2 1.3.1 256-Bit Wide SIMD Register Support . 1-2 1.3.2 Instruction Syntax Enhancements . 1-3 1.3.3 VEX Prefix Instruction Encoding Support . 1-4 1.4 Overview AVX2. 1-4 1.5 Functional Overview . 1-5 1.5.1 256-bit Floating-Point Arithmetic Processing Enhancements . 1-5 1.5.2 256-bit Non-Arithmetic Instruction Enhancements. 1-5 1.5.3 Arithmetic Primitives for 128-bit Vector and Scalar processing . 1-6 1.5.4 Non-Arithmetic Primitives for 128-bit Vector and Scalar Processing . 1-6 1.5.5 AVX2 and 256-bit Vector Integer Processing. 1-7 1.6 General Purpose Instruction Set Enhancements. 1-8 CHAPTER 2 APPLICATION PROGRAMMING MODEL 2.1 Detection of PCLMULQDQ and AES Instructions. 2-1 2.2 Detection of AVX and FMA Instructions . 2-1 2.2.1 Detection of FMA . 2-3 2.2.2 Detection of VEX-Encoded AES and VPCLMULQDQ. 2-4 2.2.3 Detection of AVX2 . 2-6 2.2.4 Detection VEX-encoded GPR Instructions . 2-7 2.3 Fused-Multiply-ADD (FMA) Numeric Behavior . 2-7 2.3.1 FMA Instruction Operand Order and Arithmetic Behavior . 2-11 2.4 Accessing YMM Registers. 2-12 2.5 Memory alignment . 2-13 2.6 SIMD floating-point ExCeptions . 2-15 2.7 Instruction Exception Specification. 2-15 2.7.1 Exceptions Type 1 (Aligned memory reference) . 2-21 2.7.2 Exceptions Type 2 (>=16 Byte Memory Reference, Unaligned) . 2-22 2.7.3 Exceptions Type 3 (<16 Byte memory argument). 2-23 2.7.4 Exceptions Type 4 (>=16 Byte mem arg no alignment, no floating-point exceptions) . 2-24 2.7.5 Exceptions Type 5 (<16 Byte mem arg and no FP exceptions). 2-25 2.7.6 Exceptions Type 6 (VEX-Encoded Instructions Without Legacy SSE Analogues). 2-26 2.7.7 Exceptions Type 7 (No FP exceptions, no memory arg). 2-27 2.7.8 Exceptions Type 8 (AVX and no memory argument) . 2-27 2.7.9 Exception Type 11 (VEX-only, mem arg no AC, floating-point exceptions) . 2-28 2.7.10 Exception Type 12 (VEX-only, VSIB mem arg, no AC, no floating-point exceptions). 2-29 2.7.11 Exception Conditions for VEX-Encoded GPR Instructions . 2-30 2.8 Programming Considerations with 128-bit SIMD Instructions . 2-31 2.8.1 Clearing Upper YMM State Between AVX and Legacy SSE Instructions . 2-32 i Ref. # 319433-011 2.8.2 Using AVX 128-bit Instructions Instead of Legacy SSE instructions . 2-33 2.8.3 Unaligned Memory Access and Buffer Size Management . 2-33 2.9 CPUID Instruction . 2-34 CPUID—CPU Identification . 2-34 CHAPTER 3 SYSTEM PROGRAMMING MODEL 3.1 YMM State, VEX Prefix and Supported Operating Modes . 3-1 3.2 YMM State Management . 3-2 3.2.1 Detection of YMM State Support. 3-2 3.2.2 Enabling of YMM State . 3-2 3.2.3 Enabling of SIMD Floating-Exception Support . 3-3 3.2.4 The Layout of XSAVE Area . 3-4 3.2.5 XSAVE/XRSTOR Interaction with YMM State and MXCSR. 3-5 3.2.6 Processor Extended State Save Optimization and XSAVEOPT . 3-7 3.2.6.1 XSAVEOPT Usage Guidelines . 3-8 3.3 Reset Behavior . 3-9 3.4 Emulation. 3-9 3.5 Writing AVX floating-point exception handlers. 3-9 CHAPTER 4 INSTRUCTION FORMAT 4.1 Instruction Formats . 4-1 4.1.1 VEX and the LOCK prefix . 4-2 4.1.2 VEX and the 66H, F2H, and F3H prefixes. 4-2 4.1.3 VEX and the REX prefix . 4-2 4.1.4 The VEX Prefix . 4-2 4.1.4.1 VEX Byte 0, bits[7:0]. 4-4 4.1.4.2 VEX Byte 1, bit [7] - ‘R’ . 4-4 4.1.4.3 3-byte VEX byte 1, bit[6] - ‘X’. ..
Recommended publications
  • 07 Vectorization for Intel C++ & Fortran Compiler .Pdf
    Vectorization for Intel® C++ & Fortran Compiler Presenter: Georg Zitzlsberger Date: 10-07-2015 1 Agenda • Introduction to SIMD for Intel® Architecture • Compiler & Vectorization • Validating Vectorization Success • Reasons for Vectorization Fails • Intel® Cilk™ Plus • Summary 2 Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Vectorization • Single Instruction Multiple Data (SIMD): . Processing vector with a single operation . Provides data level parallelism (DLP) . Because of DLP more efficient than scalar processing • Vector: . Consists of more than one element . Elements are of same scalar data types (e.g. floats, integers, …) • Vector length (VL): Elements of the vector A B AAi i BBi i A B Ai i Bi i Scalar Vector Processing + Processing + C CCi i C Ci i VL 3 Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Evolution of SIMD for Intel Processors Present & Future: Goal: Intel® MIC Architecture, 8x peak FLOPs (FMA) over 4 generations! Intel® AVX-512: • 512 bit Vectors • 2x FP/Load/FMA 4th Generation Intel® Core™ Processors Intel® AVX2 (256 bit): • 2x FMA peak Performance/Core • Gather Instructions 2nd Generation 3rd Generation Intel® Core™ Processors Intel® Core™ Processors Intel® AVX (256 bit): • Half-float support • 2x FP Throughput • Random Numbers • 2x Load Throughput Since 1999: Now & 2010 2012 2013 128 bit Vectors Future Time 4 Optimization Notice
    [Show full text]
  • Effective Virtual CPU Configuration with QEMU and Libvirt
    Effective Virtual CPU Configuration with QEMU and libvirt Kashyap Chamarthy <[email protected]> Open Source Summit Edinburgh, 2018 1 / 38 Timeline of recent CPU flaws, 2018 (a) Jan 03 • Spectre v1: Bounds Check Bypass Jan 03 • Spectre v2: Branch Target Injection Jan 03 • Meltdown: Rogue Data Cache Load May 21 • Spectre-NG: Speculative Store Bypass Jun 21 • TLBleed: Side-channel attack over shared TLBs 2 / 38 Timeline of recent CPU flaws, 2018 (b) Jun 29 • NetSpectre: Side-channel attack over local network Jul 10 • Spectre-NG: Bounds Check Bypass Store Aug 14 • L1TF: "L1 Terminal Fault" ... • ? 3 / 38 Related talks in the ‘References’ section Out of scope: Internals of various side-channel attacks How to exploit Meltdown & Spectre variants Details of performance implications What this talk is not about 4 / 38 Related talks in the ‘References’ section What this talk is not about Out of scope: Internals of various side-channel attacks How to exploit Meltdown & Spectre variants Details of performance implications 4 / 38 What this talk is not about Out of scope: Internals of various side-channel attacks How to exploit Meltdown & Spectre variants Details of performance implications Related talks in the ‘References’ section 4 / 38 OpenStack, et al. libguestfs Virt Driver (guestfish) libvirtd QMP QMP QEMU QEMU VM1 VM2 Custom Disk1 Disk2 Appliance ioctl() KVM-based virtualization components Linux with KVM 5 / 38 OpenStack, et al. libguestfs Virt Driver (guestfish) libvirtd QMP QMP Custom Appliance KVM-based virtualization components QEMU QEMU VM1 VM2 Disk1 Disk2 ioctl() Linux with KVM 5 / 38 OpenStack, et al. libguestfs Virt Driver (guestfish) Custom Appliance KVM-based virtualization components libvirtd QMP QMP QEMU QEMU VM1 VM2 Disk1 Disk2 ioctl() Linux with KVM 5 / 38 libguestfs (guestfish) Custom Appliance KVM-based virtualization components OpenStack, et al.
    [Show full text]
  • Analysis of SIMD Applicability to SHA Algorithms O
    1 Analysis of SIMD Applicability to SHA Algorithms O. Aciicmez Abstract— It is possible to increase the speed and throughput of The remainder of the paper is organized as follows: In an algorithm using parallelization techniques. Single-Instruction section 2 and 3, we introduce SIMD concept and the SIMD Multiple-Data (SIMD) is a parallel computation model, which has architecture of Intel including MMX technology and SSE already employed by most of the current processor families. In this paper we will analyze four SHA algorithms and determine extensions. Section 4 describes SHA algorithm and Section possible performance gains that can be achieved using SIMD 5 discusses the possible improvements on SHA performance parallelism. We will point out the appropriate parts of each that can be achieved by using SIMD instructions. algorithm, where SIMD instructions can be used. II. SIMD PARALLEL PROCESSING I. INTRODUCTION Single-instruction multiple-data execution model allows Today the security of a cryptographic mechanism is not the several data elements to be processed at the same time. The only concern of cryptographers. The heavy communication conventional scalar execution model, which is called single- traffic on contemporary very large network of interconnected instruction single-data (SISD) deals only with one pair of data devices demands a great bandwidth for security protocols, and at a time. The programs using SIMD instructions can run much hence increasing the importance of speed and throughput of a faster than their scalar counterparts. However SIMD enabled cryptographic mechanism. programs are harder to design and implement. A straightforward approach to improve cryptographic per- The most common use of SIMD instructions is to perform formance is to implement cryptographic algorithms in hard- parallel arithmetic or logical operations on multiple data ware.
    [Show full text]
  • New Instruction Set Extensions
    New Instruction Set Extensions Instruction Set Innovation in Intels Processor Code Named Haswell [email protected] Agenda • Introduction - Overview of ISA Extensions • Haswell New Instructions • New Instructions Overview • Intel® AVX2 (256-bit Integer Vectors) • Gather • FMA: Fused Multiply-Add • Bit Manipulation Instructions • TSX/HLE/RTM • Tools Support for New Instruction Set Extensions • Summary/References Copyright© 2012, Intel Corporation. All rights reserved. Partially Intel Confidential Information. 2 *Other brands and names are the property of their respective owners. Instruction Set Architecture (ISA) Extensions 199x MMX, CMOV, Multiple new instruction sets added to the initial 32bit instruction PAUSE, set of the Intel® 386 processor XCHG, … 1999 Intel® SSE 70 new instructions for 128-bit single-precision FP support 2001 Intel® SSE2 144 new instructions adding 128-bit integer and double-precision FP support 2004 Intel® SSE3 13 new 128-bit DSP-oriented math instructions and thread synchronization instructions 2006 Intel SSSE3 16 new 128-bit instructions including fixed-point multiply and horizontal instructions 2007 Intel® SSE4.1 47 new instructions improving media, imaging and 3D workloads 2008 Intel® SSE4.2 7 new instructions improving text processing and CRC 2010 Intel® AES-NI 7 new instructions to speedup AES 2011 Intel® AVX 256-bit FP support, non-destructive (3-operand) 2012 Ivy Bridge NI RNG, 16 Bit FP 2013 Haswell NI AVX2, TSX, FMA, Gather, Bit NI A long history of ISA Extensions ! Copyright© 2012, Intel Corporation. All rights reserved. Partially Intel Confidential Information. 3 *Other brands and names are the property of their respective owners. Instruction Set Architecture (ISA) Extensions • Why new instructions? • Higher absolute performance • More energy efficient performance • New application domains • Customer requests • Fill gaps left from earlier extensions • For a historical overview see http://en.wikipedia.org/wiki/X86_instruction_listings Copyright© 2012, Intel Corporation.
    [Show full text]
  • Elementary Functions: Towards Automatically Generated, Efficient
    Elementary functions : towards automatically generated, efficient, and vectorizable implementations Hugues De Lassus Saint-Genies To cite this version: Hugues De Lassus Saint-Genies. Elementary functions : towards automatically generated, efficient, and vectorizable implementations. Other [cs.OH]. Université de Perpignan, 2018. English. NNT : 2018PERP0010. tel-01841424 HAL Id: tel-01841424 https://tel.archives-ouvertes.fr/tel-01841424 Submitted on 17 Jul 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Délivré par l’Université de Perpignan Via Domitia Préparée au sein de l’école doctorale 305 – Énergie et Environnement Et de l’unité de recherche DALI – LIRMM – CNRS UMR 5506 Spécialité: Informatique Présentée par Hugues de Lassus Saint-Geniès [email protected] Elementary functions: towards automatically generated, efficient, and vectorizable implementations Version soumise aux rapporteurs. Jury composé de : M. Florent de Dinechin Pr. INSA Lyon Rapporteur Mme Fabienne Jézéquel MC, HDR UParis 2 Rapporteur M. Marc Daumas Pr. UPVD Examinateur M. Lionel Lacassagne Pr. UParis 6 Examinateur M. Daniel Menard Pr. INSA Rennes Examinateur M. Éric Petit Ph.D. Intel Examinateur M. David Defour MC, HDR UPVD Directeur M. Guillaume Revy MC UPVD Codirecteur À la mémoire de ma grand-mère Françoise Lapergue et de Jos Perrot, marin-pêcheur bigouden.
    [Show full text]
  • Operating Guide
    Operating Guide EPIA EN-Series Mini-ITX Mainboard January 18, 2012 Version 1.21 EPIA EN-Series Operating Guide Table of Contents Table of Contents ...................................................................................................................................................................................... i VIA EPIA EN-Series Overview.............................................................................................................................................................. 1 VIA EPIA EN-Series Layout .................................................................................................................................................................. 2 VIA EPIA EN-Series Specifications ...................................................................................................................................................... 3 VIA EPIA EN Processor SKUs .............................................................................................................................................................. 4 VIA CN700 Chipset Overview ............................................................................................................................................................... 5 VIA EPIA EN-Series I/O Back Panel Layout ...................................................................................................................................... 6 VIA EPIA EN-Series Layout Diagram & Mounting Holes ..............................................................................................................
    [Show full text]
  • Undocumented CPU Behavior: Analyzing Undocumented Opcodes on Intel X86-64 Catherine Easdon Why Investigate Undocumented Behavior? the “Golden Screwdriver” Approach
    Undocumented CPU Behavior: Analyzing Undocumented Opcodes on Intel x86-64 Catherine Easdon Why investigate undocumented behavior? The “golden screwdriver” approach ● Intel have confirmed they add undocumented features to general-release chips for key customers "As far as the etching goes, we have done different things for different customers, and we have put different things into the silicon, such as adding instructions or pins or signals for logic for them. The difference is that it goes into all of the silicon for that product. And so the way that you do it is somebody gives you a feature, and they say, 'Hey, can you get this into the product?' You can't do something that takes up a huge amount of die, but you can do an instruction, you can do a signal, you can do a number of things that are logic-related." ~ Jason Waxman, Intel Cloud Infrastructure Group (Source: http://www.theregister.co.uk/2013/05/20/intel_chip_customization/ ) Poor documentation ● Intel has a long history of withholding information from their manuals (Source: http://datasheets.chipdb.org/Intel/x86/Pentium/24143004.PDF) Poor documentation ● Intel has a long history of withholding information from their manuals (Source: https://stackoverflow.com/questions/14413839/what-are-the-exhaustion-characteristics-of-rdrand-on-ivy-bridge) Poor documentation ● Even when the manuals don’t withhold information, they are often misleading or inconsistent Section 22.15, Intel Developer Manual Vol. 3: Section 6.15 (#UD exception): Poor documentation leads to vulnerabilities ● In operating systems ○ POP SS/MOV SS (May 2018) ■ Developer confusion over #DB handling ■ Load + execute unsigned kernel code on Windows ■ Also affected: Linux, MacOS, FreeBSD..
    [Show full text]
  • Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3C: System Programming Guide, Part 3
    Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3C: System Programming Guide, Part 3 NOTE: The Intel® 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture, Order Number 253665; Instruction Set Reference A-M, Order Number 253666; Instruction Set Reference N-Z, Order Number 253667; Instruction Set Reference, Order Number 326018; System Programming Guide, Part 1, Order Number 253668; System Programming Guide, Part 2, Order Number 253669; System Programming Guide, Part 3, Order Number 326019. Refer to all seven volumes when evaluating your design needs. Order Number: 326019-049US February 2014 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDI- TIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRAN- TY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
    [Show full text]
  • Hyper-Threading Performance with Intel Cpus for Linux SAP Deployment on Proliant Servers
    Hyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers Session #3798 Hein van den Heuvel Performance Engineer Hewlett-Packard © 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Topics • Hyper-Threading Intro • Implementation details Intel, IBM, Sun • Linux implementation • My own tests • SAP (SD) benchmark • Benchmark Results • Conclusions: (18% improvement for SAP 2-tier) Intel Hyper-Threading Overview “Hyper-Threading Technology is a form of simultaneous multithreading technology (SMT), where multiple threads of software applications can be run simultaneously on one processor. This is achieved by duplicating the architectural state on each processor, while sharing one set of processor execution resources. The architectural state tracks the flow of a program or thread, and the execution resources are the units on the processor that do the work: add, multiply, load, etc. “ http://www.intel.com/business/bss/products/hyperthreading/server/ht_server.pdf http://www.intel.com/technology/hyperthread/ Intel HT in a picture To-be-updated Hyper-Threading Versus Dual Core • HP (PA + ipf) opted for ‘dual core’ technology. − Each processor has full set of resources − Only limitation is shared ‘system’ connection. − Allows for dense (8p – 4u – 4640) − minimally constrained systems • Software licensing impact (Oracle!) • Hyper-Threading technology effectiveness will depend on application IBM P5 SMT Summary Enhanced Simultaneous Multi-Threading features To improve SMT performance for various workload mixes and provide robust quality of service, POWER5 provides two features: • Dynamic resource balancing – The objective of dynamic resource balancing is to ensure that the two threads executing on the same processor flow smoothly through the system.
    [Show full text]
  • Andre Heidekrueger
    AMD Heterogenous Computing X86 in development AMD new CPU and Accelerator Designs Building blocks for Heterogenous computing with the GPU Accelerators and the Latest x86 Platform Innovations 1 | Hot Chips | August, 2010 Server Industry Trends China has Seismic The performance of the 265m datasets online gamers fastest supercomputer typically exceed a terabyte grew 500x in the last decade The top 8 systems Accelerator-based on the Green 500 list use accelerators 800 servers on the Green 500 images list are 3x as energy are uploaded to Facebook efficient as those without every second 2 | Hot Chips | August,accelerators 2010 2 Top500.org Performance Projections: Can the Current Trajectory Achieve Exascale? 1 EFlops Might get there on current trajectory, but… • Too late for major government programs leading to 2018 • System power in traditional x86 architecture would be unmanageable Source for chart: Top500.org; annotations by AMD 3 | Hot Chips | August, 2010 Three Eras of Processor Performance Multi-Core Heterogeneous Single-Core Systems Era Era Era Enabled by: Enabled by: Enabled by: Moore’s Law Moore’s Law Moore’s Law Voltage Scaling Desire for Throughput Abundant data parallelism Micro-Architecture 20 years of SMP arch Power efficient GPUs Constrained by: Constrained by: Temporarily constrained by: Power Power Programming models Complexity Parallel SW availability Communication overheads Scalability o o ? we are we are here o here we are Performance here thread Performance - Time Time Application Targeted Time Throughput Throughput Performance (# of Processors) (Data-parallel exploitation) Single 4 | Driving HPC Performance Efficiency Fusion Architecture The Benefits of Heterogeneous Computing x86 CPU owns GPU Optimized for the Software World Modern Workloads .
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • Hierarchical Roofline Analysis for Gpus: Accelerating Performance
    Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System Charlene Yang, Thorsten Kurth Samuel Williams National Energy Research Scientific Computing Center Computational Research Division Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA Berkeley, CA 94720, USA fcjyang, [email protected] [email protected] Abstract—The Roofline performance model provides an Performance (GFLOP/s) is bound by: intuitive and insightful approach to identifying performance bottlenecks and guiding performance optimization. In prepa- Peak GFLOP/s GFLOP/s ≤ min (1) ration for the next-generation supercomputer Perlmutter at Peak GB/s × Arithmetic Intensity NERSC, this paper presents a methodology to construct a hi- erarchical Roofline on NVIDIA GPUs and extend it to support which produces the traditional Roofline formulation when reduced precision and Tensor Cores. The hierarchical Roofline incorporates L1, L2, device memory and system memory plotted on a log-log plot. bandwidths into one single figure, and it offers more profound Previously, the Roofline model was expanded to support insights into performance analysis than the traditional DRAM- the full memory hierarchy [2], [3] by adding additional band- only Roofline. We use our Roofline methodology to analyze width “ceilings”. Similarly, additional ceilings beneath the three proxy applications: GPP from BerkeleyGW, HPGMG Roofline can be added to represent performance bottlenecks from AMReX, and conv2d from TensorFlow. In so doing, we demonstrate the ability of our methodology to readily arising from lack of vectorization or the failure to exploit understand various aspects of performance and performance fused multiply-add (FMA) instructions. bottlenecks on NVIDIA GPUs and motivate code optimizations.
    [Show full text]