Evaluating MMX Technology Using DSP and Multimedia Applications

Total Page:16

File Type:pdf, Size:1020Kb

Evaluating MMX Technology Using DSP and Multimedia Applications Evaluating MMX Technology Using DSP and Multimedia Applications Ravi Bhargava, Lizy K. John, Brian L. Evans and Ramesh Radhakrishnan Electrical and Computer Engineering Department The University of Texas at Austin fravib,ljohn,bevans,radhakri g@ece .ute xas. edu Keywords: Digital Signal Pro cessing, Machine Mea- 1 Intro duction surement, MMX, Performance Monitoring, Workload Characterization. Demand for digital signal pro cessing DSP and mul- timedia capabilities on a p ersonal computer has b een increasing to accommo date 3-D graphics, video con- Abstract ferencing, and other applications. The PC industry's attempt to satisfy these demands resulted in the rst Many current general purpose processors are using addition to the X86 instruction set architecture ISA extensions to the instruction set architecture to enhance in almost a decade. This extension, intro duced in 1996, the performance of digital signal processing DSP and has b een dubb ed MMX for MultiMedia eXtension multimedia applications. In this paper, we evaluate [1,2] and can outp erform lower-end DSP pro cessors [3]. the X86 architecture's multimedia extension MMX This technology adds new assembly instructions and instruction set on a set of benchmarks. Our benchmark data typ es to the existing ISA to exploit the data par- suite includes kernels ltering, fast Fourier transforms, allelism that is often available in DSP and multimedia and vector arithmetic and applications JPEG com- applications. In this study, weinvestigate the p erfor- pression, Doppler radar processing, imaging, and G.722 mance of a suite of programs on Intel Pentium pro ces- speech encoding. Each benchmark has at least one sors with MMX technology. non-MMX version in C and an MMX version that MMX and native signal pro cessing NSP exten- makes cal ls to an MMX assembly library. The ver- sions to general purp ose pro cessors [4, 5] are single- sions di er in the implementation of ltering, vector instruction multiple-data SIMD architectures. One arithmetic, and other relevant kernels. The observed SIMD data typ e which contains several pieces of data speedup for the MMX versions of the suite ranges from is sent to a pro cessing unit. By packing many pieces of less than 1.0 to 6.1. In addition to quantifying the data into one 64-bit MMX register, several calculations speedup, we perform detailed instruction level pro l- can take place simultaneously [6]. For example, image ing using Intel's VTune pro ling tool. Using VTune, pro cessing applications typically manipulate matrices we pro le static and dynamic instructions, microarchi- of 8-bit data. Eight pieces of this data could b e packed tecture operations, and data references to isolate the into an MMX register, arithmetic or logical op erations speci c reasons for speedup or lack thereof. This anal- could be p erformed on the pixels in parallel, and the ysis al lows one to understand which aspects of native results could b e written to a register. signal processing instruction sets are most useful, the Toachieve this functionality, MMX technology adds current limitations, and how they can be utilized most 57 new assembly instructions to the X86 instruction eciently. set. These instructions can op erate on any of the L. John is supported by the National Science Founda- packed data typ es and on unsigned or signed data. tion under Grants CCR-9796098 CAREER Award, and EIA- Saturation and wrap-around arithmetic are also sup- 9807112, and a grant from the Texas AdvancedTechnology Pro- p orted. Multiply-accumulate MAC, a frequent op- gram. B. Evans was supported by the US National Science Foundation CAREER Award under grant MIP-9702707, the eration in DSP applications, is also part of the ISA. US Defense AdvancedResearch Projects Agency under contract MMX can multiply 8-bit and 16-bit xed-p oint data, DAAB07-97-C-J007, Accelerix, and Rockwel l. but not 32-bit data. Data widths of 8 and 16 bits are sucient for sp eech, image, audio, and video pro cess- ing applications as well as 3-D graphics. could interface it with the Intel libraries. MMX maintains full compatibility with existing X86 Although MMX presents opp ortunity for op erating systems and applications. MMX registers p erformance increase, to our knowledge no indep en- and state are aliased onto the oating-p oint registers dent evaluations of applications on an X86 pro cessor and state, so no new registers or states are intro duced with MMX corrob orate and explain the p erformance by MMX. Maintaining compatibility places limitations increase. We quantify sp eedup for our b enchmark suite on MMX. The MMX registers are limited to the width of kernels and applications and o er insight into de- of the oating-p oint registers MMX uses 64 of the 80 velopment of DSP and multimedia applications on the available bits and mixing of oating p oint and MMX Pentium family of pro cessors. We analyze variations in co de b ecomes costly. the execution time, dynamic co de size, static co de size, In DSP applications, several algorithms surface more number of memory references, function calls, number frequently than others [3, 4, 7, 8, 9, 10, 11, 12]. The of MMX instructions, and mix of MMX instructions. most common DSP kernels are the nite impulse re- Sp eedup is achieved on some, but not all, DSP and sp onse FIR lter, in nite impulse resp onse I IR l- multimedia applications. The e ects of packing and ter, fast Fourier transform FFT, least mean square unpacking do not undermine the advantages of MMX, LMS adaptive lters, and matrix-vector arithmetic. and some applications do not require packing and un- Applications commonly b enchmarked are sp eech, au- packing of data b ecause of prop erly aligned data. On dio, image and video compression systems. applications such as JPEG image compression that ran Previous e orts have analyzed NSP on general pur- slower with MMX, we found that the core kernels of p ose pro cessors [5,13]. However, e orts to compare ap- the applications show sp eedup, but numerous calls to plications with MMX instructions versus applications NSP assembly libraries as well as the formatting of without MMX on the same X86 pro cessor have b een data for these libraries prove to be signi cant over- incomplete [8]. The results found in [8] are only antici- head. The most e ective MMX applications require pated results based on simulation. A b enchmarking of bu ered data and high data parallelism. The instruc- several applications on the UltraSPARC pro cessor [4] tion mix of the applications allows us to provide more using the Visual Instruction Set VIS showed a p er- insight into these sp eci c observations and elucidate formance sp eedup for some DSP applications over non- other concerns. VIS versions. Applications with FIR lters showed the In this pap er, Section 2 describ es the b enchmark most improvement while I IR lters and FFTs exhibited programs. Section 3 describ es the exp eriment metho d- little or no p erformance increase [4]. ology, and Section 4 analyzes the results. Section 5 To study the e ects of DSP and multimedia pro- concludes the pap er. grams, we need co de with and without MMX for these applications. It is imp ortant to note that there are no publicly available compilers that supp ort MMX in- 2 Benchmarks structions. This means that the burden of incorp orat- ing MMX is placed solely on the develop ers of the ap- For our study,we pro le four DSP and multimedia plication. Achieving the largest p erformance increase kernels and four applications. Table 1 summarizes the would involve tailoring MMX assembly co de for each implementations and general characteristics of the four sp eci c application or kernel and then in-lining this kernels and four applications that comprise our MMX assembly co de into the application source co de. A b enchmark suite. We provide more information ab out less time-consuming metho d would b e to write generic our b enchmark source co de at the following Web site: MMX libraries for common algorithms and kernels http://www.ece.utexas.edu/~ljohn/mmxdsp/. which can b e accessed via function calls. The rest of this section provides some details on the Intel provides a suite of optimized assembly libraries b enchmarks. on their Web site [14] and with VTune [15]. Recentver- sions, which include the Signal Pro cessing Library 4.0, 2.1 Kernels Recognition Primitives Library 3.1, and Image Pro cess- ing Library 2.0, supp ort xed-p oint functions that uti- Finite Impulse Resp onse FIR Filters allow lize MMX and oating-p oint functions. Not all DSP al- certain frequency comp onents of the input to pass un- gorithms have corresp onding MMX functions e.g. the changed to the output while blo cking other comp o- LMS algorithm. We develop ed some C co de and ac- nents. FIR lters are moving average lters. Their quired the remainder from several sources [10, 16, 17, resp onse to an impulse dies away in a nite number of 18]. We lo oked for co de that is fast and ecient, yet samples. The output y nisaweighted average of the somewhat mo dular and easy to interpret so that we Table 1: Summary of Benchmark Kernels and Applications Kernels Fast Fourier Transform fft 4096 p oint, in-place FFT Finite Impulse Resp onse Filter fir Low-pass lter of length 35 i.e. 35 co ecients and 35 entry history.
Recommended publications
  • New Instruction Set Extensions
    New Instruction Set Extensions Instruction Set Innovation in Intels Processor Code Named Haswell [email protected] Agenda • Introduction - Overview of ISA Extensions • Haswell New Instructions • New Instructions Overview • Intel® AVX2 (256-bit Integer Vectors) • Gather • FMA: Fused Multiply-Add • Bit Manipulation Instructions • TSX/HLE/RTM • Tools Support for New Instruction Set Extensions • Summary/References Copyright© 2012, Intel Corporation. All rights reserved. Partially Intel Confidential Information. 2 *Other brands and names are the property of their respective owners. Instruction Set Architecture (ISA) Extensions 199x MMX, CMOV, Multiple new instruction sets added to the initial 32bit instruction PAUSE, set of the Intel® 386 processor XCHG, … 1999 Intel® SSE 70 new instructions for 128-bit single-precision FP support 2001 Intel® SSE2 144 new instructions adding 128-bit integer and double-precision FP support 2004 Intel® SSE3 13 new 128-bit DSP-oriented math instructions and thread synchronization instructions 2006 Intel SSSE3 16 new 128-bit instructions including fixed-point multiply and horizontal instructions 2007 Intel® SSE4.1 47 new instructions improving media, imaging and 3D workloads 2008 Intel® SSE4.2 7 new instructions improving text processing and CRC 2010 Intel® AES-NI 7 new instructions to speedup AES 2011 Intel® AVX 256-bit FP support, non-destructive (3-operand) 2012 Ivy Bridge NI RNG, 16 Bit FP 2013 Haswell NI AVX2, TSX, FMA, Gather, Bit NI A long history of ISA Extensions ! Copyright© 2012, Intel Corporation. All rights reserved. Partially Intel Confidential Information. 3 *Other brands and names are the property of their respective owners. Instruction Set Architecture (ISA) Extensions • Why new instructions? • Higher absolute performance • More energy efficient performance • New application domains • Customer requests • Fill gaps left from earlier extensions • For a historical overview see http://en.wikipedia.org/wiki/X86_instruction_listings Copyright© 2012, Intel Corporation.
    [Show full text]
  • Hyper-Threading Performance with Intel Cpus for Linux SAP Deployment on Proliant Servers
    Hyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers Session #3798 Hein van den Heuvel Performance Engineer Hewlett-Packard © 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Topics • Hyper-Threading Intro • Implementation details Intel, IBM, Sun • Linux implementation • My own tests • SAP (SD) benchmark • Benchmark Results • Conclusions: (18% improvement for SAP 2-tier) Intel Hyper-Threading Overview “Hyper-Threading Technology is a form of simultaneous multithreading technology (SMT), where multiple threads of software applications can be run simultaneously on one processor. This is achieved by duplicating the architectural state on each processor, while sharing one set of processor execution resources. The architectural state tracks the flow of a program or thread, and the execution resources are the units on the processor that do the work: add, multiply, load, etc. “ http://www.intel.com/business/bss/products/hyperthreading/server/ht_server.pdf http://www.intel.com/technology/hyperthread/ Intel HT in a picture To-be-updated Hyper-Threading Versus Dual Core • HP (PA + ipf) opted for ‘dual core’ technology. − Each processor has full set of resources − Only limitation is shared ‘system’ connection. − Allows for dense (8p – 4u – 4640) − minimally constrained systems • Software licensing impact (Oracle!) • Hyper-Threading technology effectiveness will depend on application IBM P5 SMT Summary Enhanced Simultaneous Multi-Threading features To improve SMT performance for various workload mixes and provide robust quality of service, POWER5 provides two features: • Dynamic resource balancing – The objective of dynamic resource balancing is to ensure that the two threads executing on the same processor flow smoothly through the system.
    [Show full text]
  • Implementing Elliptic Curve Cryptography (A Narrow Survey)
    Implementing Elliptic Curve Cryptography (a narrow survey) Institute of Computing – UNICAMP Campinas, Brazil April 2005 Darrel Hankerson Auburn University Implementing ECC – 1/110 Overview Objective: sample selected topics of practical interest. Talk will favor: I Software solutions on general-purpose processors rather than dedicated hardware. I Techniques with broad applicability. I Methods targeted to standardized curves. Goals: I Present proposed methods in context. I Limit coverage of technical details (but “implementing” necessarily involves platform considerations). Implementing ECC – 2/110 Focus: higher-performance processors “Higher-performance” includes processors commonly associated with workstations, but also found in surprisingly small portable devices. Sun and IBM workstations RIM pager circa 1999 SPARC or Intel x86 (Pentium) Intel x86 (custom 386) 256 MB – 8 GB 2 MB “disk”, 304 KB RAM 0.5 GHz – 3 GHz 10 MHz, single AA battery heats entire building fits in shirt pocket Implementing ECC – 3/110 Optimizing ECC Elliptic Curve Digital Signature Algorithm (ECDSA) Random number Big number and Curve generation modular arithmetic arithmetic Fq field arithmetic General categories of optimization efforts: 1. Field-level optimizations. 2. Curve-level optimizations. 3. Protocol-level optimizations. Constraints: security requirements, hardware limitations, bandwidth, interoperability, and patents. Implementing ECC – 4/110 Optimizing ECC... 1. Field-level optimizations. I Choose fields with fast multiplication and inversion. I Use special-purpose hardware (cryptographic coprocessors, DSP, floating-point, SIMD). 2. Curve-level optimizations. I Reduce the number of point additions (windowing). I Reduce the number of field inversions (projective coords). I Replace point doubles (endomorphism methods). 3. Protocol-level optimizations. I Develop efficient protocols. I Choose methods and protocols that favor your computations or hardware.
    [Show full text]
  • Instruction-Level Parallelism in AES Candidates Craig S
    Instruction-level Parallelism in AES Candidates Craig S. K. Clapp Instruction-level Parallelism in AES Candidates Craig S.K. Clapp PictureTel Corporation, 100 Minuteman Rd., Andover, MA01810, USA email: [email protected] Abstract. We explore the instruction-level parallelism present in a number of candidates for the Advanced Encryption Standard (AES) and demonstrate how their speed in software varies as a function of the execution resources available in the target CPU. An analysis of the critical paths through the algorithms is used to establish theoretical upper limits on their performance, while performance on finite machines is characterized on a family of hypothetical RISC/VLIW CPUs having from one through eight concurrent instruction-issue slots. The algorithms studied are Crypton, E2, Mars, RC6, Rijndael, Serpent, and Twofish. 1 Introduction Several performance comparisons among AES candidate algorithms have already been published, e.g.[6,7,8,10,16]. However, while such studies do indeed render the valuable service of providing a quantitative rather than qualitative comparison between candidates, and in some cases do so for a number of currently popular processors, they are not necessarily very insightful as to how to expect performance of the algorithms to compare on processors other than those listed, nor in particular on future processors that are likely to replace those in common use today. One of the lessons of DES[11] is that, apart from having too short a key, the original dominant focus on hardware implementation led to a design which over the years has become progressively less able to take full advantage of each new processor generation.
    [Show full text]
  • X86 Assembly Language Reference Manual
    x86 Assembly Language Reference Manual Part No: 817–5477–11 March 2010 Copyright ©2010 Oracle and/or its affiliates. All rights reserved. This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related software documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are “commercial computer software” or “commercial technical data” pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms setforth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle USA, Inc., 500 Oracle Parkway, Redwood City, CA 94065.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • 2 the VIS Instruction Set Pdist Instruction
    The VISTM Instruction Set Version 1.0 June 2002 A White Paper This document provides an overview of the Visual Instruction Set. ® 4150 Network Circle Santa Clara, CA 95054 USA www.sun.com Copyright © 2002 Sun Microsystems, Inc. All Rights reserved. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED"AS IS" WITHOUT ANY EXPRESS REPRESENTATIONS OR WARRANTIES. IN ADDITION, SUN MICROSYSTEMS, INC. DISCLAIMS ALL IMPLIED REPRESENTATIONS AND WARRANTIES, INCLUDING ANY WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT OF THIRD PARTY INTELLECTUAL PROPERTY RIGHTS. This document contains proprietary information of Sun Microsystems, Inc. or under license from third parties. No part of this document may be reproduced in any form or by any means or transferred to any third party without the prior written consent of Sun Microsystems, Inc. Sun, Sun Microsystems, the Sun Logo, VIS, Java, and mediaLib are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. UNIX is a registered trademark in the United States and other countries, exclusively licensed through x/Open Company Ltd. The information contained in this document is not designed or intended for use in on-line control of aircraft, air traffic, aircraft navigation or aircraft communications; or in the design, construction, operation or maintenance of any nuclear facility. Sun disclaims any express or implied warranty of fitness for such uses.
    [Show full text]
  • SIMD-Swift: Improving Performance of Swift Fault Detection
    Master thesis SIMD-Swift: Improving Performance of Swift Fault Detection Oleksii Oleksenko 15. November 2015 Technische Universität Dresden Department of Computer Science Systems Engineering Group Supervisor: Prof. Christof Fetzer Adviser: MSc. Dmitrii Kuvaiskii Declaration Herewith I declare that this submission is my own work and that, to the best of my knowledge, it contains no material previously published or written by another person nor material which to a substantial extent has been accepted for the award of any other degree or diploma of the university or other institute of higher education, except where due acknowledgment has been made in the text. Dresden, 9. November 2015 Oleksii Oleksenko Abstract The general tendency in modern hardware is an increase in fault rates, which is caused by the decreased operation voltages and feature sizes. Previously, the issue of hardware faults was mainly approached only in high-availability enterprise servers and in safety-critical applications, such as transport or aerospace domains. These fields generally have very tight requirements, but also higher budgets. However, as fault rates are increasing, fault tolerance solutions are starting to be also required in applications that have much smaller profit margins. This brings to the front the idea of software-implemented hardware fault tolerance, that is, the ability to detect and tolerate hardware faults using software-based techniques in commodity CPUs, which allows to get resilience almost for free. Current solutions, however, are lacking in performance, even though they show quite good fault tolerance results. This thesis explores the idea of using the Single Instruction Multiple Data (SIMD) technology for executing all program’s operations on two copies of the same data.
    [Show full text]
  • Ultrasparc T1™ Supplement to the Ultrasparc Architecture 2005
    UltraSPARC T1™ Supplement to the UltraSPARC Architecture 2005 Draft D2.1, 14 May 2007 Privilege Levels: Hyperprivileged, Privileged, and Nonprivileged Distribution: Public Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95054 U.S.A. 650-960-1300 Part No.No: 8xx-xxxx-xx819-3404-05 ReleaseRevision: 1.0, Draft 2002 D2.1, 14 May 2007 ii UltraSPARC T1 Supplement • Draft D2.1, 14 May 2007 Copyright 2002-2006 Sun Microsystems, Inc., 4150 Network Circle • Santa Clara, CA 950540 USA. All rights reserved. This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd. For Netscape Communicator™, the following notice applies: Copyright 1995 Netscape Communications Corporation. All rights reserved. Sun, Sun Microsystems, the Sun logo, Solaris, and VIS are trademarks, registered trademarks, or service marks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc The OPEN LOOK and Sun™ Graphical User Interface was developed by Sun Microsystems, Inc.
    [Show full text]
  • Idisa+: a Portable Model for High Performance Simd Programming
    IDISA+: A PORTABLE MODEL FOR HIGH PERFORMANCE SIMD PROGRAMMING by Hua Huang B.Eng., Beijing University of Posts and Telecommunications, 2009 a Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Computing Science Faculty of Applied Science c Hua Huang 2011 SIMON FRASER UNIVERSITY Fall 2011 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. APPROVAL Name: Hua Huang Degree: Master of Science Title of Thesis: IDISA+: A Portable Model for High Performance SIMD Pro- gramming Examining Committee: Dr. Kay C. Wiese Associate Professor, Computing Science Simon Fraser University Chair Dr. Robert D. Cameron Professor, Computing Science Simon Fraser University Senior Supervisor Dr. Thomas C. Shermer Professor, Computing Science Simon Fraser University Supervisor Dr. Arrvindh Shriraman Assistant Professor, Computing Science Simon Fraser University SFU Examiner Date Approved: ii Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.
    [Show full text]
  • HPC-Event-Return-Of-Vector-20160801
    The OrionX Constellation: Data Center High Performance Computing The Return of Vector Processing Shahin Khan Digital transformation, cloud computing and application elasticity, open source software, and new app areas like AI and IoT are driving a renaissance in system architecture. This is an area of interest and research for us at OrionX. Vector processing was an interesting topic to re-emerge recently. First during the International Supercomputing Conference (ISC), and again in various announcements in implicit and explicit ways. On the Monday of the ISC conference, a new leader on the TOP500 list was announced. The Sunway TaihuLight system uses a new processor architecture that is Single-Instruction-Multiple-Data (SIMD) with a pipeline that can do eight 64-bit floating-point calculations per cycle. 10,649,600 computing cores comprising 40,960 nodes produce 93 petaflop/s running the LINPACK benchmark. Later that day, at the “ISC 2016 Vendor Showdown”, NEC had a presentation about its project “Aurora”. This project aims to combine x86 clusters and NEC’s vector processors in the same high bandwidth system. NEC has a long history of advanced vector processors with its SX architecture. Among many achievements, it built the Earth Simulator, a vector-parallel system that was #1 on the TOP500 list from 2002 to 2004. At its debut, it had a substantial (nearly 5x) lead over the previous #1. Vector Processing and Parallelism Vector processing is a time-honored system architecture that started the supercomputing market, starting with the CDC STAR-100 system (STrings and ARrays, performing at 100 million floating point operations per second), and then led by the legendary Seymour Cray and his superb team.
    [Show full text]
  • Intel® Architecture Instruction Set Extensions and Future Features
    Intel® Architecture Instruction Set Extensions and Future Features Programming Reference May 2021 319433-044 Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure. Your costs and results may vary. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. All product plans and roadmaps are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. Code names are used by Intel to identify products, technologies, or services that are in development and not publicly available. These are not “commercial” names and not intended to function as trademarks. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be ob- tained by calling 1-800-548-4725, or by visiting http://www.intel.com/design/literature.htm. Copyright © 2021, Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
    [Show full text]