G4 Is First Powerpc with Altivec: 11/16/98
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
RISC-V Vector Extension Webinar II
RISC-V Vector Extension Webinar II August 3th, 2021 Thang Tran, Ph.D. Principal Engineer Webinar II - Agenda • Andes overview • Vector technology background – SIMD/vector concept – Vector processor basic • RISC-V V extension ISA – Basic – CSR • RISC-V V extension ISA – Memory operations – Compute instructions • Sample codes – Matrix multiplication – Loads with RVV versions 0.8 and 1.0 • AndesCore™ NX27V • Summary Copyright© 2020 Andes Technology Corp. 2 Terminology • ACE: Andes Custom Extension • ISA: Instruction Set Architecture • CSR: Control and Status Register • GOPS: Giga Operations Per Second • SEW: Element Width (8-64) • GFLOPS: Giga Floating-Point OPS • ELEN: Largest Element Width (32 or 64) • XRF: Integer register file • XLEN: Scalar register length in bits (64) • FRF: Floating-point register file • FLEN: FP register length in bits (16-64) • VRF: Vector register file • VLEN: Vector register length in bits (128-512) • SIMD: Single Instruction Multiple Data • LMUL: Register grouping multiple (1/8-8) • MMX: Multi Media Extension • EMUL: Effective LMUL • SSE: Streaming SIMD Extension • VLMAX/MVL: Vector Length Max • AVX: Advanced Vector Extension • AVL/VL: Application Vector Length • Configurable: parameters are fixed at built time, i.e. cache size • Extensible: added instructions to ISA includes custom instructions to be added by customer • Standard extension: the reserved codes in the ISA for special purposes, i.e. FP, DSP, … • Programmable: parameters can be dynamically changed in the program Copyright© 2020 Andes Technology Corp. 3 RISC-V V Extension ISA Basic Copyright© 2020 Andes Technology Corp. 4 Vector Register ISA • Vector-Register ISA Definition: − All vector operations are between vector registers (except for load and store). -
Vxworks Architecture Supplement, 6.2
VxWorks Architecture Supplement VxWorks® ARCHITECTURE SUPPLEMENT 6.2 Copyright © 2005 Wind River Systems, Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means without the prior written permission of Wind River Systems, Inc. Wind River, the Wind River logo, Tornado, and VxWorks are registered trademarks of Wind River Systems, Inc. Any third-party trademarks referenced are the property of their respective owners. For further information regarding Wind River trademarks, please see: http://www.windriver.com/company/terms/trademark.html This product may include software licensed to Wind River by third parties. Relevant notices (if any) are provided in your product installation at the following location: installDir/product_name/3rd_party_licensor_notice.pdf. Wind River may refer to third-party documentation by listing publications or providing links to third-party Web sites for informational purposes. Wind River accepts no responsibility for the information provided in such third-party documentation. Corporate Headquarters Wind River Systems, Inc. 500 Wind River Way Alameda, CA 94501-1153 U.S.A. toll free (U.S.): (800) 545-WIND telephone: (510) 748-4100 facsimile: (510) 749-2010 For additional contact information, please visit the Wind River URL: http://www.windriver.com For information on how to contact Customer Support, please visit the following URL: http://www.windriver.com/support VxWorks Architecture Supplement, 6.2 11 Oct 05 Part #: DOC-15660-ND-00 Contents 1 Introduction -
SIMD Extensions
SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel. -
Optimizing Packed String Matching on AVX2 Platform
Optimizing Packed String Matching on AVX2 Platform M. Akif Aydo˘gmu¸s1,2 and M.O˘guzhan Külekci1 1 Informatics Institute, Istanbul Technical University, Istanbul, Turkey [email protected], [email protected] 2 TUBITAK UME, Gebze-Kocaeli, Turkey Abstract. Exact string matching, searching for all occurrences of given pattern P on a text T , is a fundamental issue in computer science with many applica- tions in natural language processing, speech processing, computational biology, information retrieval, intrusion detection systems, data compression, and etc. Speeding up the pattern matching operations benefiting from the SIMD par- allelism has received attention in the recent literature, where the empirical results on previous studies revealed that SIMD parallelism significantly helps, while the performance may even be expected to get automatically enhanced with the ever increasing size of the SIMD registers. In this paper, we provide variants of the previously proposed EPSM and SSEF algorithms, which are orig- inally implemented on Intel SSE4.2 (Streaming SIMD Extensions 4.2 version with 128-bit registers). We tune the new algorithms according to Intel AVX2 platform (Advanced Vector Extensions 2 with 256-bit registers) and analyze the gain in performance with respect to the increasing length of the SIMD registers. Profiling the new algorithms by using the Intel Vtune Amplifier for detecting performance bottlenecks led us to consider the cache friendliness and shared-memory access issues in the AVX2 platform. We applied cache op- timization techniques to overcome the problems particularly addressing the search algorithms based on filtering. Experimental comparison of the new solutions with the previously known-to- be-fast algorithms on small, medium, and large alphabet text files with diverse pattern lengths showed that the algorithms on AVX2 platform optimized cache obliviously outperforms the previous solutions. -
Optimizing Software Performance Using Vector Instructions Invited Talk at Speed-B Conference, October 19–21, 2016, Utrecht, the Netherlands
Agner Fog, Technical University of Denmark Optimizing software performance using vector instructions Invited talk at Speed-B conference, October 19–21, 2016, Utrecht, The Netherlands. Abstract Microprocessor factories have a problem obeying Moore's law because of physical limitations. The answer is increasing parallelism in the form of multiple CPU cores and vector instructions (Single Instruction Multiple Data - SIMD). This is a challenge to software developers who have to adapt to a moving target of new instruction set additions and increasing vector sizes. Most of the software industry is lagging several years behind the available hardware because of these problems. Other challenges are tasks that cannot easily be executed with vector instructions, such as sequential algorithms and lookup tables. The talk will discuss methods for overcoming these problems and utilize the continuously growing power of microprocessors on the market. A few problems relevant to cryptographic software will be covered, and the outlook for the future will be discussed. Find more on these topics at author website: www.agner.org/optimize Moore's law The clock frequency has stopped growing due to physical limitations. Instead, the number of CPU cores and the size of vector registers is growing. Hierarchy of bottlenecks Program installation Program load, JIT compile, DLL's System database Network access Speed → File input/output Graphical user interface RAM access, cache utilization Algorithm Dependency chains CPU pipeline and execution units Remove -
Multi-Platform Auto-Vectorization
H-0236 (H0512-002) November 30, 2005 Computer Science IBM Research Report Multi-Platform Auto-Vectorization Dorit Naishlos, Richard Henderson* IBM Research Division Haifa Research Laboratory Mt. Carmel 31905 Haifa, Israel *Red Hat Research Division Almaden - Austin - Beijing - Haifa - India - T. J. Watson - Tokyo - Zurich LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. I thas been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g ,. payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home . Multi-Platform Auto-Vectorization Dorit Naishlos Richard Henderson IBM Haifa Labs Red Hat [email protected] [email protected] Abstract. The recent proliferation of the Single Instruction Multiple Data (SIMD) model has lead to a wide variety of implementations. These have been incorporated into many platforms, from gaming machines and em- bedded DSPs to general purpose architectures. In this paper we present an automatic vectorizer as implemented in GCC - the most multi-targetable compiler available today. We discuss the considerations that are involved in developing a multi-platform vectorization technology, and demonstrate how our vectorization scheme is suited to a variety of SIMD architectures. -
A Bibliography of Publications in IEEE Micro
A Bibliography of Publications in IEEE Micro Nelson H. F. Beebe University of Utah Department of Mathematics, 110 LCB 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 USA Tel: +1 801 581 5254 FAX: +1 801 581 4148 E-mail: [email protected], [email protected], [email protected] (Internet) WWW URL: http://www.math.utah.edu/~beebe/ 16 September 2021 Version 2.108 Title word cross-reference -Core [MAT+18]. -Cubes [YW94]. -D [ASX19, BWMS19, DDG+19, Joh19c, PZB+19, ZSS+19]. -nm [ABG+16, KBN16, TKI+14]. #1 [Kah93i]. 0.18-Micron [HBd+99]. 0.9-micron + [Ano02d]. 000-fps [KII09]. 000-Processor $1 [Ano17-58, Ano17-59]. 12 [MAT 18]. 16 + + [ABG+16]. 2 [DTH+95]. 21=2 [Ste00a]. 28 [BSP 17]. 024-Core [JJK 11]. [KBN16]. 3 [ASX19, Alt14e, Ano96o, + AOYS95, BWMS19, CMAS11, DDG+19, 1 [Ano98s, BH15, Bre10, PFC 02a, Ste02a, + + Ste14a]. 1-GHz [Ano98s]. 1-terabits DFG 13, Joh19c, LXB07, LX10, MKT 13, + MAS+07, PMM15, PZB+19, SYW+14, [MIM 97]. 10 [Loc03]. 10-Gigabit SCSR93, VPV12, WLF+08, ZSS+19]. 60 [Gad07, HcF04]. 100 [TKI+14]. < [BMM15]. > [BMM15]. 2 [Kir84a, Pat84, PSW91, YSMH91, ZACM14]. [WHCK18]. 3 [KBW95]. II [BAH+05]. ∆ 100-Mops [PSW91]. 1000 [ES84]. 11- + [Lyl04]. 11/780 [Abr83]. 115 [JBF94]. [MKG 20]. k [Eng00j]. µ + [AT93, Dia95c, TS95]. N [YW94]. x 11FO4 [ASD 05]. 12 [And82a]. [DTB01, Dur96, SS05]. 12-DSP [Dur96]. 1284 [Dia94b]. 1284-1994 [Dia94b]. 13 * [CCD+82]. [KW02]. 1394 [SB00]. 1394-1955 [Dia96d]. 1 2 14 [WD03]. 15 [FD04]. 15-Billion-Dollar [KR19a]. -
Compiler-Based Data Prefetching and Streaming Non-Temporal Store Generation for the Intel R Xeon Phitm Coprocessor
Compiler-based Data Prefetching and Streaming Non-temporal Store Generation for the Intel R Xeon PhiTM Coprocessor Rakesh Krishnaiyer†, Emre K¨ult¨ursay†‡, Pankaj Chawla†, Serguei Preis†, Anatoly Zvezdin†, and Hideki Saito† † Intel Corporation ‡ The Pennsylvania State University TM Abstract—The Intel R Xeon Phi coprocessor has software used, and generating them. In this paper, we: TM prefetching instructions to hide memory latencies and spe- • Present how the Intel R Xeon Phi coprocessor soft- cial store instructions to save bandwidth on streaming non- ware prefetch and non-temporal streaming store instructions temporal store operations. In this work, we provide details on R compiler-based generation of these instructions and evaluate are generated by the Intel Composer XE 2013, TM their impact on the performance of the Intel R Xeon Phi • Evaluate the impact of these mechanisms on the overall coprocessor using a wide range of parallel applications with performance of the coprocessor using a variety of parallel different characteristics. Our results show that the Intel R applications with different characteristics. Composer XE 2013 compiler can make effective use of these Our experimental results demonstrate that (i) a large mechanisms to achieve significant performance improvements. number of applications benefit significantly from software prefetching instructions (on top of hardware prefetching) that I. INTRODUCTION are generated automatically by the compiler for the Intel R TM TM The Intel R Xeon Phi coprocessor based on Intel R Xeon Phi coprocessor, (ii) some benchmarks can further Many Integrated Core Architecture (Intel R MIC Architec- improve when compiler options that control prefetching ture) is a many-core processor with long vector (SIMD) units behavior are used (e.g., to enable indirect prefetching), targeted for highly parallel workloads in the High Perfor- and (iii) many applications benefit from compiler generated mance Computing (HPC) segment. -
Vybrid Controllers Technical Overview
TM June 2013 Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, ColdFire+, C- Ware, the Energy Efficient Solutions logo, Kinetis, mobileGT, PEG, PowerQUICC, Processor Expert, QorIQ, Qorivva, SafeAssure, the SafeAssure logo, StarCore, Symphony and VortiQa are trademarks of Freescale Semiconductor, Inc., Reg. U.S. Pat. & Tm. Off. Airfast, BeeKit, BeeStack, CoreNet, Flexis, Layerscape, MagniV, MXC, Platform in a Package, QorIQ Qonverge, QUICC Engine, Ready Play, SMARTMOS, Tower, TurboLink, Vybrid and Xtrinsic are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2013 Freescale Semiconductor, Inc. • Overview of Vybrid Family • Vybrid Tower Board • Vybrid System Modules • QuadSPI Flash • Vybrid Clock System • Vybrid Power System • Vybrid Boot Operation • High Assurance Boot • Vybrid Trusted Execution • LinuxLink and MQX Embedded Software • DS-5 compiler TM Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, ColdFire+, C-Ware, the Energy Efficient Solutions logo, Kinetis, mobileGT, PEG, PowerQUICC, Processor Expert, QorIQ, Qorivva, SafeAssure, the SafeAssure logo, StarCore, Symphony and VortiQa are trademarks of Freescale Semiconductor, Inc., Reg. U.S. Pat. & Tm. Off. 2 Airfast, BeeKit, BeeStack, CoreNet, Flexis, Layerscape, MagniV, MXC, Platform in a Package, QorIQ Qonverge, QUICC Engine, Ready Play, SMARTMOS, Tower, TurboLink, Vybrid and Xtrinsic are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2013 Freescale Semiconductor, Inc. TM Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, ColdFire+, C- Ware, the Energy Efficient Solutions logo, Kinetis, mobileGT, PEG, PowerQUICC, Processor Expert, QorIQ, Qorivva, SafeAssure, the SafeAssure logo, StarCore, Symphony and VortiQa are trademarks of Freescale Semiconductor, Inc., Reg. -
POWER Block Course Assignment 3: SIMD with Altivec
POWER Block Course Assignment 3: SIMD with AltiVec Hasso Plattner Institute Agenda 1. SIMD & AltiVec 2. Programming AltiVec in C 3. Example programs 4. Assignment 3 5. Performance measurement 6. Useful resources Assignment 3 What is this assignment about? In this assignment you will: □ Get to know AltiVec □ Use AltiVec instructions to speed up computation □ Do performance comparison on Linux using perf Submission Deadline: 04.11.2016 by any issues: [email protected] SIMD & AltiVec Sven Köhler, 29.09.2016 Chart 3 SIMD & AltiVec Sven Köhler, SIMD & AltiVec 29.09.2016 1 Chart 4 1. SIMD & AltiVec AltiVec is SIMD on POWER SIMD ::= Single Instruction Multiple Data The same instruction is performed simultaneously on multiple data points (data-level parallelism). Many architectures provide SIMD instruction set extensions. Intel: MMX, SSE, AVX ARM: NEON SIMD & AltiVec Sven Köhler, POWER: AltiVec (VMX), VSX 29.09.2016 Chart 5 Multiprocessor: Flynn‘s Taxonomy 1. (1966)SIMD & AltiVec Flynn’s Taxonomy on Multiprocessors (1966) 20 ■ Classify multiprocessor architectures among instruction and data processing dimension Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD) (C) Blaise Barney Barney (C) Blaise SIMD & AltiVec Sven Köhler, 29.09.2016 Single Data (MISD) Multiple Instruction, Multiple Instruction, Multiple Data (MIMD) ) Chart 6 ( Single Instruction, Single Data (SISD) 1. SIMD & AltiVec Scalar vs. SIMD How many instructions are needed to add four numbers from memory? scalar 4 element SIMD + A0 B0 = C0 A0 B0 C0 A + B = C 1 1 1 A1 B1 C1 + = A B C A2 + B2 = C2 2 2 2 A B C + 3 3 3 A3 B3 = C3 SIMD & AltiVec 4 additions 1 addition Sven Köhler, 29.09.2016 8 loads 2 loads Chart 7 4 stores 1 store 1. -
Cell Processor and Playstation 3
Cell Processor and Playstation 3 Guillem Borrell i Nogueras February 24, 2009 • Cell systems • Bad news • More bad news • Good news • Q&A • No SPU Double precision improvements expected from IBM IBM Blades • QS21 • Cell BE based. • 8 SPE • 460 Gflops Float • 20 GFLops Double • QS22 • PowerXCell 8i based • 8 SPE* • 460 GFlops Float • 200 GFlops Double • Some of them already installed in BSC IBM Blades • QS21 • Cell BE based. • 8 SPE • 460 Gflops Float • 20 GFLops Double • QS22 • PowerXCell 8i based • 8 SPE* • 460 GFlops Float • 200 GFlops Double • Some of them already installed in BSC • No SPU Double precision improvements expected from IBM • 256 MB RAM Playstation 3 • Cell BE based. • 6 SPE • 460 Gflops Float • 20 GFLops Double Playstation 3 • Cell BE based. • 6 SPE • 460 Gflops Float • 20 GFLops Double • 256 MB RAM • 1 TFlop on a chip IBM Power 7 • 8 PowerPC cores • 4 threads per core (32 Threads!) • ? SPE IBM Power 7 • 8 PowerPC cores • 4 threads per core (32 Threads!) • ? SPE • 1 TFlop on a chip Cell Broadband Engine PPE PowerPC Processor Element PPU PowerPC Processor Unit EIB Element Interconnect Bus SPE Synergistic Processor Element SPU Synergistic Processor Unit MFC Memory Flow Controller DMA Direct Memory Access SIMD Single Instruction Multiple Data • I have said nothing about the data ¿How does it compute? • PPU starts a program • PPU loads an SPU context on a thread • SPU acquires the thread • SPU executes context • SPU ends the task and returns control to PPU ¿How does it compute? • PPU starts a program • PPU loads an SPU context on a thread • SPU acquires the thread • SPU executes context • SPU ends the task and returns control to PPU • I have said nothing about the data Overview • LS is regiser based • No type distinction • Data should be aligned by hand • MFC is a DMA controller • Data moved with DMA primitives. -
Idisa+: a Portable Model for High Performance Simd Programming
IDISA+: A PORTABLE MODEL FOR HIGH PERFORMANCE SIMD PROGRAMMING by Hua Huang B.Eng., Beijing University of Posts and Telecommunications, 2009 a Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Computing Science Faculty of Applied Science c Hua Huang 2011 SIMON FRASER UNIVERSITY Fall 2011 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. APPROVAL Name: Hua Huang Degree: Master of Science Title of Thesis: IDISA+: A Portable Model for High Performance SIMD Pro- gramming Examining Committee: Dr. Kay C. Wiese Associate Professor, Computing Science Simon Fraser University Chair Dr. Robert D. Cameron Professor, Computing Science Simon Fraser University Senior Supervisor Dr. Thomas C. Shermer Professor, Computing Science Simon Fraser University Supervisor Dr. Arrvindh Shriraman Assistant Professor, Computing Science Simon Fraser University SFU Examiner Date Approved: ii Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.