Altivec™ Technology Programming Environments Manual

Total Page:16

File Type:pdf, Size:1020Kb

Altivec™ Technology Programming Environments Manual AltiVec™ Technology Programming Environments Manual ALTIVECPEM Rev. 3, 04/2006 How to Reach Us: Home Page: www.freescale.com email: [email protected] USA/Europe or Locations Not Listed: Freescale Semiconductor Technical Information Center, CH370 1300 N. Alma School Road Chandler, Arizona 85224 (800) 521-6274 480-768-2130 [email protected] Europe, Middle East, and Africa: Freescale Halbleiter Deutschland GmbH Technical Information Center Schatzbogen 7 81829 Muenchen, Germany +44 1296 380 456 (English) +46 8 52200080 (English) Information in this document is provided solely to enable system and software +49 89 92103 559 (German) implementers to use Freescale Semiconductor products. There are no express or +33 1 69 35 48 48 (French) implied copyright licenses granted hereunder to design or fabricate any integrated [email protected] circuits or integrated circuits based on the information in this document. Japan: Freescale Semiconductor reserves the right to make changes without further notice to Freescale Semiconductor Japan Ltd. Headquarters any products herein. Freescale Semiconductor makes no warranty, representation or ARCO Tower 15F guarantee regarding the suitability of its products for any particular purpose, nor does 1-8-1, Shimo-Meguro, Meguro-ku Freescale Semiconductor assume any liability arising out of the application or use of Tokyo 153-0064, Japan 0120 191014 any product or circuit, and specifically disclaims any and all liability, including without +81 3 5437 9125 limitation consequential or incidental damages. “Typical” parameters which may be [email protected] provided in Freescale Semiconductor data sheets and/or specifications can and do Asia/Pacific: vary in different applications and actual performance may vary over time. All operating Freescale Semiconductor Hong Kong Ltd. parameters, including “Typicals” must be validated for each customer application by Technical Information Center customer’s technical experts. Freescale Semiconductor does not convey any license 2 Dai King Street Tai Po Industrial Estate, under its patent rights nor the rights of others. Freescale Semiconductor products are Tai Po, N.T., Hong Kong not designed, intended, or authorized for use as components in systems intended for +800 2666 8080 surgical implant into the body, or other applications intended to support or sustain life, [email protected] or for any other application in which the failure of the Freescale Semiconductor product For Literature Requests Only: could create a situation where personal injury or death may occur. Should Buyer Freescale Semiconductor purchase or use Freescale Semiconductor products for any such unintended or Literature Distribution Center P.O. Box 5405 unauthorized application, Buyer shall indemnify and hold Freescale Semiconductor Denver, Colorado 80217 and its officers, employees, subsidiaries, affiliates, and distributors harmless against all (800) 441-2447 claims, costs, damages, and expenses, and reasonable attorney fees arising out of, 303-675-2140 directly or indirectly, any claim of personal injury or death associated with such Fax: 303-675-2150 LDCForFreescaleSemiconductor unintended or unauthorized use, even if such claim alleges that Freescale @hibbertgroup.com Semiconductor was negligent regarding the design or manufacture of the part. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. The PowerPC name is a trademark of IBM Corp. and is used under license. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2001, 2003, 2006. All rights reserved. Document Number: ALTIVECPEM Rev. 3, 04/2006 Overview 1 AltiVec Register Set 2 Operand Conventions 3 Addressing Modes and Instruction Set Summary 4 Cache, Interrupts, and Memory Management 5 AltiVec Instructions 6 AltiVec Instruction Set Listings A Revision History B Glossary GLO Index IND 1 Overview 2 AltiVec Register Set 3 Operand Conventions 4 Addressing Modes and Instruction Set Summary 5 Cache, Interrupts, and Memory Management 6 AltiVec Instructions A AltiVec Instruction Set Listings B Revision History GLO Glossary IND Index Contents Paragraph Page Number Title Number About ThisContents Book Audience ............................................................................................................................xx Organization.......................................................................................................................xx Suggested Reading........................................................................................................... xxi General Information..................................................................................................... xxi Related Documentation................................................................................................ xxi Conventions .................................................................................................................... xxii Acronyms and Abbreviations ........................................................................................ xxiii Terminology Conventions................................................................................................xxv Chapter 1 Overview 1.1 Overview.......................................................................................................................... 1-1 1.2 AltiVec Technology Overview......................................................................................... 1-3 1.2.1 Levels of AltiVec ISA.................................................................................................. 1-4 1.2.2 Features Not Defined by AltiVec ISA ......................................................................... 1-5 1.3 AltiVec Architectural Model............................................................................................ 1-5 1.3.1 AltiVec Registers and Programming Model ................................................................ 1-5 1.3.2 Operand Conventions .................................................................................................. 1-6 1.3.2.1 Byte Ordering .......................................................................................................... 1-6 1.3.2.2 Floating-Point Conventions..................................................................................... 1-7 1.3.3 AltiVec Addressing Modes .......................................................................................... 1-7 1.3.4 AltiVec Instruction Set................................................................................................. 1-9 1.3.5 AltiVec Cache Model................................................................................................. 1-10 1.3.6 AltiVec Interrupt Model............................................................................................. 1-10 1.3.7 Memory Management Model .................................................................................... 1-10 Chapter 2 AltiVec Register Set 2.1 Overview of AltiVec and PowerPC Registers ................................................................. 2-1 2.2 AltiVec Register Set Overview ........................................................................................ 2-3 2.3 Registers Defined by AltiVec ISA................................................................................... 2-3 2.3.1 AltiVec Vector Register File (VRF)............................................................................. 2-4 2.3.2 Vector Status and Control Register (VSCR)................................................................ 2-4 2.3.3 Vector Save/Restore Register (VRSAVE) ................................................................... 2-6 AltiVec™ Technology Programming Enviroments Manual, Rev. 3 Freescale Semiconductor v Contents Paragraph Page Number Title Number 2.4 Additions to PowerPC UISA Registers ........................................................................... 2-7 2.4.1 PowerPC Condition Register....................................................................................... 2-7 2.5 Additions to PowerPC OEA Registers ............................................................................ 2-8 2.5.1 AltiVec Field Added in the PowerPC Machine State Register (MSR)........................ 2-8 2.5.2 Machine Status Save/Restore Registers (SRRs).......................................................... 2-9 2.5.2.1 Machine Status Save/Restore Register 0 (SRR0).................................................... 2-9 2.5.2.2 Machine Status Save/Restore Register 1 (SRR1).................................................. 2-10 Chapter 3 Operand Conventions 3.1 Data Organization in Memory ......................................................................................... 3-1 3.1.1 Aligned and Misaligned Accesses ............................................................................... 3-1 3.1.2 AltiVec Byte Ordering................................................................................................. 3-2 3.1.2.1 Big-Endian Byte Ordering....................................................................................... 3-2 3.1.2.2 Little-Endian Byte Ordering.................................................................................... 3-2 3.1.3 Quad Word Byte Ordering Example............................................................................ 3-3 3.1.4 Aligned Scalars in Little-Endian Mode ......................................................................
Recommended publications
  • Vxworks Architecture Supplement, 6.2
    VxWorks Architecture Supplement VxWorks® ARCHITECTURE SUPPLEMENT 6.2 Copyright © 2005 Wind River Systems, Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means without the prior written permission of Wind River Systems, Inc. Wind River, the Wind River logo, Tornado, and VxWorks are registered trademarks of Wind River Systems, Inc. Any third-party trademarks referenced are the property of their respective owners. For further information regarding Wind River trademarks, please see: http://www.windriver.com/company/terms/trademark.html This product may include software licensed to Wind River by third parties. Relevant notices (if any) are provided in your product installation at the following location: installDir/product_name/3rd_party_licensor_notice.pdf. Wind River may refer to third-party documentation by listing publications or providing links to third-party Web sites for informational purposes. Wind River accepts no responsibility for the information provided in such third-party documentation. Corporate Headquarters Wind River Systems, Inc. 500 Wind River Way Alameda, CA 94501-1153 U.S.A. toll free (U.S.): (800) 545-WIND telephone: (510) 748-4100 facsimile: (510) 749-2010 For additional contact information, please visit the Wind River URL: http://www.windriver.com For information on how to contact Customer Support, please visit the following URL: http://www.windriver.com/support VxWorks Architecture Supplement, 6.2 11 Oct 05 Part #: DOC-15660-ND-00 Contents 1 Introduction
    [Show full text]
  • Data Representation
    Chapter 4 – Data Representation The focus of this chapter is the representation of data in a digital computer. We begin with a review of several number systems (decimal, binary, octal, and hexadecimal) and a discussion of methods for conversion between the systems. The two most important methods are conversion from decimal to binary and binary to decimal. The conversions between binary and each of octal and hexadecimal are quite simple. Other conversions, such as hexadecimal to decimal, are often best done via binary. After discussion of conversion between bases, we discuss the methods used to store integers in a digital computer: one’s complement and two’s complement arithmetic. This includes a characterization of the range of integers that can be stored given the number of bits allocated to store an integer. The most common integer storage formats are 16, 32, and 64 bits. The next topic for this chapter is the storage of real (floating point) numbers. This discussion will mention the standard put forward by the Institute of Electrical and Electronic Engineers, the IEEE Standard 754 for floating point numbers, but will focus on the base–16 format used by IBM Mainframes. The chapter closes with a discussion of codes for storing characters, focusing on the EBCDIC system used on IBM mainframes. Number Systems There are four number systems of possible interest to the computer programmer: decimal, binary, octal, and hexadecimal. Each system is characterized by its base or radix, always given in decimal, and the set of permissible digits. Note that the hexadecimal numbering system calls for more than ten digits, so we use the first six letters of the alphabet.
    [Show full text]
  • Bits and Bytes and Words, Oh My!
    Bits and Bytes and Words, Oh My! Let’s get down to the real ni8y gri8y • CIS 211 In principle you already know ... Computer memory is binary (base 2) Everything: instrucEons, numbers, strings, ... memory is just one big array of binary numbers If I ask you what 01101011101010 represents, the only correct answer is “it depends” 0 = False = 0 volts; 1 = True = 5 volts (maybe) • CIS 211 CPU and Memory Main Memory CPU Address ALU Reg Reg Reg Values Reg Reg Reg Reg Reg Reg Other buses • CIS 211 CPU and Memory (simplified*) Main Memory CPU Address ALU Reg Reg Reg Values Reg Reg Reg Reg Reg Reg Other buses • CIS 211 (*with several useful lies) CPU and Memory CPU places a memory address on the address bus CPU may place a value on the data base and assert a “write” line (wire) or assert a “read” line and read a value from the data bus Main Memory CPU Address ALU Reg Reg Reg Values Reg Reg Reg Reg Reg Reg Other buses • CIS 211 A few terms: • A bit is a single binary digit • A byte is 8 binary digits • Most computer memory is “byte addressed”; a byte is the smallest addressable unit • What’s half a byte? (4 bits)? • A “word” is a sequence of bytes • Usually 4 bytes (32 bits) or 8 bytes (64 bits) depending on the computer (see next slide) 1 = True = 5 volts ; 0 = False = 0 volts (or 3.3 volts) • CIS 211 Typical byte-addressed memory 15 0 1 0 1 0 0 0 1 with 32-bit words 14 0 0 1 0 0 1 0 0 13 1 1 0 0 1 1 1 0 12 0 0 0 1 1 0 0 0 11 0 1 0 0 1 1 0 0 10 0 0 0 0 0 0 0 0 9 0 0 0 1 1 0 0 0 8 0 0 0 0 0 1 1 0 7 0 0 0 0 0 0 0 0 6 0 0 1 1 1 0 0 1 5 0 1 1 0 0 1 1
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • Endian: from the Ground up a Coordinated Approach
    WHITEPAPER Endian: From the Ground Up A Coordinated Approach By Kevin Johnston Senior Staff Engineer, Verilab July 2008 © 2008 Verilab, Inc. 7320 N MOPAC Expressway | Suite 203 | Austin, TX 78731-2309 | 512.372.8367 | www.verilab.com WHITEPAPER INTRODUCTION WHat DOES ENDIAN MEAN? Data in Imagine XYZ Corp finally receives first silicon for the main Endian relates the significance order of symbols to the computers chip for its new camera phone. All initial testing proceeds position order of symbols in any representation of any flawlessly until they try an image capture. The display is kind of data, if significance is position-dependent in that regularly completely garbled. representation. undergoes Of course there are many possible causes, and the debug Let’s take a specific type of data, and a specific form of dozens if not team analyzes code traces, packet traces, memory dumps. representation that possesses position-dependent signifi- There is no problem with the code. There is no problem cance: A digit sequence representing a numeric value, like hundreds of with data transport. The problem is eventually tracked “5896”. Each digit position has significance relative to all down to the data format. other digit positions. transformations The development team ran many, many pre-silicon simula- I’m using the word “digit” in the generalized sense of an between tions of the system to check datapath integrity, bandwidth, arbitrary radix, not necessarily decimal. Decimal and a few producer and error correction. The verification effort checked that all other specific radixes happen to be particularly useful for the data submitted at the camera port eventually emerged illustration simply due to their familiarity, but all of the consumer.
    [Show full text]
  • Binary Numbers 8'S Colum 8'S Colum 2'S Colum 1'S 4'S Colum 4'S N N N N
    Codes and number systems Introduction to Computer Yung-Yu Chuang with slides by Nisan & Schocken (www.nand2tetris.org) and Harris & Harris (DDCA) Coding • Assume that you want to communicate with your friend with a flashlight in a night, what will you do? light painting? What’s the problem? Solution #1 • A: 1 blink • B: 2 blibliknks • C: 3 blinks : • Z: 26 blinks Wha t’s the problem ? • How are you? = 131 blinks Solution #2: Morse code Hello Lookup • It is easy to translate into Morse code than reverse. Why? Lookup Lookup Useful for checking the correctness/ redddundency Braille Braille What’s common in these codes? • They are both binary codes. Binary representations • Electronic Implementation – Easy to store with bitblbistable elemen ts – Reliably transmitted on noisy and inaccurate wires 0 1 0 3.3V 2.8V 0.5V 0.0V Number systems <13> 1 Chapter = = Systems 2 1's column 10 1's column 10's column 2's column 100's column 4's column 1000's column 8's column 5374 1101 Number • numbers Decimal • Binary numbers Number Systems • Decimal numbers 1000's col 10's colum 1's column 100's colu u m n mn n 3 2 1 0 537410 = 5 ? 10 + 3 ? 10 + 7 ? 10 + 4 ? 10 five three seven four thousands hundreds tens ones • Binary numbers 8's colum 2's colum 1's colum 4's colum n n n n 3 2 1 0 11012 = 1 ? 2 + 1 ? 2 + 0 ? 2 + 1 ? 2 = 1310 one one no one eight four two one Chapter 1 <14> Binary numbers • Digits are 1 and 0 (a binary dig it is calle d a bit) 1 = true 0 = false • MSB –most significant bit • LSB –least significant bit MSB LSB • Bit numbering: 1 0 1 1 0
    [Show full text]
  • TCG ACPI Specification
    TCG ACPI Specification Family “1.2” and “2.0” Version 1.3 Revision 8 August 5, 2021 PUBLISHED Contact: [email protected] TCG PUBLISHED Copyright © TCG 2021 TCG TCG ACPI Specification Disclaimers, Notices, and License Terms THIS SPECIFICATION IS PROVIDED "AS IS" WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Without limitation, TCG disclaims all liability, including liability for infringement of any proprietary rights, relating to use of information in this specification and to the implementation of this specification, and TCG disclaims all liability for cost of procurement of substitute goods or services, lost profits, loss of use, loss of data or any incidental, consequential, direct, indirect, or special damages, whether under contract, tort, warranty or otherwise, arising in any way out of use or reliance upon this specification or any information herein. This document is copyrighted by Trusted Computing Group (TCG), and no license, express or implied, is granted herein other than as follows: You may not copy or reproduce the document or distribute it to others without written permission from TCG, except that you may freely do so for the purposes of (a) examining or implementing TCG specifications or (b) developing, testing, or promoting information technology standards and best practices, so long as you distribute the document with these disclaimers, notices, and license terms. Contact the Trusted Computing Group at www.trustedcomputinggroup.org for information on specification licensing through membership agreements. Any marks and brands contained herein are the property of their respective owners.
    [Show full text]
  • Numbering Systems
    12 Digital Principles Switching Theory R HAPTE 1 C NUMBERING SYSTEMS 1.0 INTRODUCTION Inside today’s computers, data is represented as 1’s and 0’s. These 1’s and 0’s might be stored magnetically on a disk, or as a state in a transistor, core, or vacuum tube. To perform useful operations on these 1’s and 0’s one have to organize them together into patterns that make up codes. Modern digital systems do not represent numeric values using the decimal system. Instead, they typically use a binary or two’s complement numbering system. To understand the digital system arithmetic, one must understand how digital systems represent numbers. This chapter discusses several important concepts including the binary, octal and hexadeci- mal numbering systems, binary data organization (bits, nibbles, bytes, words, and double words), signed and unsigned numbering systems. If one is already familiar with these terms he should at least skim this material. 1.1 A REVIEW OF THE DECIMAL SYSTEM People have been using the decimal (base 10) numbering system for so long that they probably take it for granted. When one see a number like “123”, he don’t think about the value 123; rather, he generate a mental image of how many items this value represents. In reality, however, the number 123 represents: 1*102 + 2*101 + 3*100 OR 100 + 20 + 3 Each digit appearing to the left of the decimal point represents a value between zero and nine times an increasing power of ten. Digits appearing to the right of the decimal point represent a value between zero and nine times an increasing negative power of ten.
    [Show full text]
  • An Empirical Analysis of the Ipv4 Number Market
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Illinois Digital Environment for Access to Learning and Scholarship Repository Buying numbers: An Empirical Analysis of the IPv4 Number Market Milton L. Mueller Brenden Kuerbis Syracuse University Syracuse University [email protected] and University of Toronto [email protected] Abstract The emergence of a trading market for previously allocated Internet number blocks is an important change in Internet governance. Almost all of the Internet’s 32-bit address space has been given out, and we have not migrated to a new internet protocol, IPv6, with a larger address space. IP addresses are therefore being commoditized, as organizations with surplus numbers sell address blocks to organizations that want more. Though controversial, we know little about this phenomenon. This paper quantifies the number of address blocks that have been traded as of July 2012 and analyzes the scant information that exists about the pricing of these resources, discovering the emergence of a billion dollar market. The paper then shows how this factual information relates to key policy debates, in particularly the role of needs assessment and property rights in IPv4 number blocks. Keywords: Internet governance, IP address markets, IPv4 scarcity, IPv6 migration, property rights Introduction One of the most important but least-studied aspects of Internet policy is the emergence of a trading market for previously allocated Internet number blocks. Without unique Internet protocol numbers for the networks and devices attached, the Internet simply doesn’t work. The original Internet Protocol standard, known as IPv4, specified a 32-bit numbering space, which provided slightly less than 4 billion unique numbers that could be used as addresses (Postel, 1981).
    [Show full text]
  • Multi-Platform Auto-Vectorization
    H-0236 (H0512-002) November 30, 2005 Computer Science IBM Research Report Multi-Platform Auto-Vectorization Dorit Naishlos, Richard Henderson* IBM Research Division Haifa Research Laboratory Mt. Carmel 31905 Haifa, Israel *Red Hat Research Division Almaden - Austin - Beijing - Haifa - India - T. J. Watson - Tokyo - Zurich LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. I thas been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g ,. payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home . Multi-Platform Auto-Vectorization Dorit Naishlos Richard Henderson IBM Haifa Labs Red Hat [email protected] [email protected] Abstract. The recent proliferation of the Single Instruction Multiple Data (SIMD) model has lead to a wide variety of implementations. These have been incorporated into many platforms, from gaming machines and em- bedded DSPs to general purpose architectures. In this paper we present an automatic vectorizer as implemented in GCC - the most multi-targetable compiler available today. We discuss the considerations that are involved in developing a multi-platform vectorization technology, and demonstrate how our vectorization scheme is suited to a variety of SIMD architectures.
    [Show full text]
  • A Bibliography of Publications in IEEE Micro
    A Bibliography of Publications in IEEE Micro Nelson H. F. Beebe University of Utah Department of Mathematics, 110 LCB 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 USA Tel: +1 801 581 5254 FAX: +1 801 581 4148 E-mail: [email protected], [email protected], [email protected] (Internet) WWW URL: http://www.math.utah.edu/~beebe/ 16 September 2021 Version 2.108 Title word cross-reference -Core [MAT+18]. -Cubes [YW94]. -D [ASX19, BWMS19, DDG+19, Joh19c, PZB+19, ZSS+19]. -nm [ABG+16, KBN16, TKI+14]. #1 [Kah93i]. 0.18-Micron [HBd+99]. 0.9-micron + [Ano02d]. 000-fps [KII09]. 000-Processor $1 [Ano17-58, Ano17-59]. 12 [MAT 18]. 16 + + [ABG+16]. 2 [DTH+95]. 21=2 [Ste00a]. 28 [BSP 17]. 024-Core [JJK 11]. [KBN16]. 3 [ASX19, Alt14e, Ano96o, + AOYS95, BWMS19, CMAS11, DDG+19, 1 [Ano98s, BH15, Bre10, PFC 02a, Ste02a, + + Ste14a]. 1-GHz [Ano98s]. 1-terabits DFG 13, Joh19c, LXB07, LX10, MKT 13, + MAS+07, PMM15, PZB+19, SYW+14, [MIM 97]. 10 [Loc03]. 10-Gigabit SCSR93, VPV12, WLF+08, ZSS+19]. 60 [Gad07, HcF04]. 100 [TKI+14]. < [BMM15]. > [BMM15]. 2 [Kir84a, Pat84, PSW91, YSMH91, ZACM14]. [WHCK18]. 3 [KBW95]. II [BAH+05]. ∆ 100-Mops [PSW91]. 1000 [ES84]. 11- + [Lyl04]. 11/780 [Abr83]. 115 [JBF94]. [MKG 20]. k [Eng00j]. µ + [AT93, Dia95c, TS95]. N [YW94]. x 11FO4 [ASD 05]. 12 [And82a]. [DTB01, Dur96, SS05]. 12-DSP [Dur96]. 1284 [Dia94b]. 1284-1994 [Dia94b]. 13 * [CCD+82]. [KW02]. 1394 [SB00]. 1394-1955 [Dia96d]. 1 2 14 [WD03]. 15 [FD04]. 15-Billion-Dollar [KR19a].
    [Show full text]
  • Compiler-Based Data Prefetching and Streaming Non-Temporal Store Generation for the Intel R Xeon Phitm Coprocessor
    Compiler-based Data Prefetching and Streaming Non-temporal Store Generation for the Intel R Xeon PhiTM Coprocessor Rakesh Krishnaiyer†, Emre K¨ult¨ursay†‡, Pankaj Chawla†, Serguei Preis†, Anatoly Zvezdin†, and Hideki Saito† † Intel Corporation ‡ The Pennsylvania State University TM Abstract—The Intel R Xeon Phi coprocessor has software used, and generating them. In this paper, we: TM prefetching instructions to hide memory latencies and spe- • Present how the Intel R Xeon Phi coprocessor soft- cial store instructions to save bandwidth on streaming non- ware prefetch and non-temporal streaming store instructions temporal store operations. In this work, we provide details on R compiler-based generation of these instructions and evaluate are generated by the Intel Composer XE 2013, TM their impact on the performance of the Intel R Xeon Phi • Evaluate the impact of these mechanisms on the overall coprocessor using a wide range of parallel applications with performance of the coprocessor using a variety of parallel different characteristics. Our results show that the Intel R applications with different characteristics. Composer XE 2013 compiler can make effective use of these Our experimental results demonstrate that (i) a large mechanisms to achieve significant performance improvements. number of applications benefit significantly from software prefetching instructions (on top of hardware prefetching) that I. INTRODUCTION are generated automatically by the compiler for the Intel R TM TM The Intel R Xeon Phi coprocessor based on Intel R Xeon Phi coprocessor, (ii) some benchmarks can further Many Integrated Core Architecture (Intel R MIC Architec- improve when compiler options that control prefetching ture) is a many-core processor with long vector (SIMD) units behavior are used (e.g., to enable indirect prefetching), targeted for highly parallel workloads in the High Perfor- and (iii) many applications benefit from compiler generated mance Computing (HPC) segment.
    [Show full text]