X86 Intrinsics Cheat Sheet Jan Finis [email protected]

Total Page:16

File Type:pdf, Size:1020Kb

X86 Intrinsics Cheat Sheet Jan Finis Finis@In.Tum.De x86 Intrinsics Cheat Sheet Jan Finis [email protected] Bit Operations Conversions Boolean Logic Bit Shifting & Rotation Packed Conversions Convert all elements in a packed SSE register Reinterpet Casts Rounding Arithmetic Logic Shift Convert Float See also: Conversion to int Rotate Left/ Pack With S/D/I32 performs rounding implicitly Bool XOR Bool AND Bool NOT AND Bool OR Right Sign Extend Zero Extend 128bit Cast Shift Right Left/Right ≤64 16bit ↔ 32bit Saturation Conversion 128 SSE SSE SSE SSE Round up SSE2 xor SSE2 and SSE2 andnot SSE2 or SSE2 sra[i] SSE2 sl/rl[i] x86 _[l]rot[w]l/r CVT16 cvtX_Y SSE4.1 cvtX_Y SSE4.1 cvtX_Y SSE2 castX_Y si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd epi16-64 epi16-64 (u16-64) ph ↔ ps SSE2 pack[u]s epi8-32 epu8-32 → epi8-32 SSE2 cvt[t]X_Y si128,ps/d (ceiling) mi xor_si128(mi a,mi b) mi and_si128(mi a,mi b) mi andnot_si128(mi a,mi b) mi or_si128(mi a,mi b) NOTE: Shifts elements right NOTE: Shifts elements left/ NOTE: Rotates bits in a left/ NOTE: Converts between 4x epi16,epi32 NOTE: Sign extends each NOTE: Zero extends each epi32,ps/d NOTE: Reinterpret casts !a & b while shifting in sign bits. right while shifting in zeros. right by a number of bits 16 bit floats and 4x 32 bit element from X to Y. Y must element from X to Y. Y must from X to Y. No operation is SSE4.1 ceil NOTE: Packs ints from two NOTE: Converts packed generated. The i version takes an The i version takes an specified in i. The l version floats (or 8 for 256 bit input registers into one be longer than X. be longer than X. elements from X to Y. If pd is ps/d,ss/d immediate as count. The immediate as count. The is for 64-bit ints and the w mode). For the 32 to 16-bit register halving the bitwidth. mi cvtepi8_epi32(mi a) mi cvtepu8_epi32(mi a) used as source or target md castsi128_pd(mi a) md ceil_pd(md a) version without i takes the version without i takes the version for 16-bit ints. conversion, a rounding mode Overflows are handled using type, then 2 elements are Change the position of selected bits, zeroing out lower 64bit of an SSE lower 64bit of an SSE r must be specified. u64 _lrotl(u64 a) saturation. The u version converted, otherwise 4. Selective Bit Moving remaining ones register. register. m cvtph_ps(mi a) saturates to the unsigned The t version is only 128/256bit Round down mi srai_epi32(mi a,ii32 i) mi slli_epi32(mi a,ii i) mi cvtps_ph(m a, i r) integer type. available when casting to int and performs a truncation Bit Scatter Bit Gather mi sra_epi32(mi a,mi b) mi sll_epi32(mi a,mi b) mi pack_epi32(mi a,mi b) Cast 256 Movemask Extract Bits instead of rounding. (floor) (Deposit) (Extract) md cvtepi32_pd(mi a) AVX castX_Y SSE4.1 ≤64 ≤64 ≤64 floor Variable Variable Logic pd128↔pd256, ps/d,ss/d BMI2 _pdep BMI2 _pext SSE2 movemask BMI1 _bextr ps128↔ps256, u32-64 u32-64 epi8,pd,ps[SSE] u32-64 si128↔si256 md floor_pd(md a) Arithmetic Shift Shift Convert a single element in the lower bytes of an SSE register NOTE: Deposit contiguous NOTE: Extract bits from a at NOTE: Creates a bitmask NOTE: Extracts l bits from a Single Element Conversion NOTE: Reinterpret casts low bits from a to dst at the corresponding bit from the most significant bit starting at bit s. AVX2 sl/rav AVX2 sl/rlv from X to Y. If cast is from 128 to 256 bit, then the the corresponding bit locations specified by mask of each element. u64 _bextr_u64(u64 a,u32 s,u32 l) epi32-64 epi32-64 Single SSE Float Old Float/Int upper bits are undefined! No Round locations specified by mask m to contiguous low bits in i movemask_epi8(mi a) (a >> s) & ((1 << l)-1) Bit Scan m; all other bits in dst are dst; the remaining upper NOTE: Shifts elements left/ NOTE: Shifts elements left/ Single Conversion Single Float to Single 128-bit operation is generated. right while shifting in sign right while shifting in zeros. Forward/Reverse to Normal Float set to zero. bits in dst are set to zero. to Float with Fill Conversion 128 m128d castpd256_pd128(m256d a) SSE4.1 round bits. Each element in a is Each element in a is shifted ≤64 Int Conversion Int Conversion Conversion u64 _pdep_u64(u64 a,u64 m) u64 _pext_u64(u64 a, u64 m) shifted by an amount by an amount specified by _bit_scan_forward 128 128 128 128 SSE cvt[t]_X2Y ps/d,ss/d x86 /reverse specified by the the corresponding element SSE2 SSE SSE2 SSE2 corresponding element in b. in b. (i32) cvtX_Y cvt[t]X_Y cvtX_Y cvtX_Y pi↔ps,si↔ss NOTE: Rounds according to a si32-64,ss/d → ss/d ss/d → si32-64 si32-128 ss→f32,sd→f64 NOTE: Converts X to Y. 256bit Cast rounding paramater r. mi slav_epi32(mi a,mi b) mi sllv_epi32(mi a,mi b) NOTE: Returns the index of Set or reset a range of bits Converts between int and 256 md round_pd(md a,i r) the lowest/highest bit set in NOTE: Converts a single NOTE: Converts a single NOTE: Converts a single NOTE: Converts a single SSE p_ Bit Masking b float. versions convert 2- the 32bit int a. Undefined if element in from X to Y. element from X (int) to Y integer from X to Y. Either of float from X (SSE Float type) packs from b and fill result AVX castX_Y a is 0. The remaining bits of the (float). Result is normal int, the types must be si128. If to Y (normal float type). with upper bits of operand ps,pd,si256 Find Lowest result are copied from a. not an SSE register! the new type is longer, the a s_ Reset Lowest Mask Up To Count specific ranges of 0 or 1 bits i32 _bit_scan_forward(i32 a) t f cvtss_f32(m a) . versions do the same Zero High Bits Bit Counting md cvtsi64_sd(md a,i64 b) The version performs a integer is zero extended. for one element in b. NOTE: Reinterpret casts 1-Bit Lowest 1-Bit double(b[63:0]) | truncation instead of t from X to Y. No operation is ≤64 ≤64 ≤64 1-Bit ≤64 rounding. mi cvtsi32_si128(i a) The version truncates (a[127:64] << 64) instead of rounding and is generated. BMI2 _bzhi BMI1 _blsr BMI1 _blsmsk BMI1 _blsi Count 1-Bits Count Leading Count Trailing i cvtss_si32(m a) available only when m256 castpd_ps(m256d a) u32-64 u32-64 u32-64 u32-64 (Popcount) converting from float to int. ≤64 Zeros ≤64 Zeros ≤64 m cvt_si2ss(m a,i b) NOTE: Zeros all bits in a NOTE: Returns a but with NOTE: Returns an int that NOTE: Returns an int that higher than and including the lowest 1-bit set to 0. has all bits set up to and has only the lowest 1-bit in a POPCNT popcnt LZCNT _lzcnt BMI1 _tzcnt the bit at index i. u64 _blsr_u64(u64 a) including the lowest 1-bit in or no bit set if a is 0. u32-64 u32-64 u32-64 u64 _bzhi_u64(u64 a, u32 i) a or no bit set if a is 0. u64 _blsi_u64(u64 a) (a-1) & a NOTE: Counts the number of NOTE: Counts the number of NOTE: Counts the number of dst := a u64 _blsmsk_u64(u64 a) (-a) & a a a a IF (i[7:0] < 64) 1-bits in . leading zeros in . trailing zeros in . dst[63:n] := 0 (a-1) ^ a u32 popcnt_u64(u64 a) u64 _lzcnt_u64(u64 a) u64 _tzcnt_u64(u64 a) Arithmetics Basic Arithmetics Loading data into an SSE register or storing data from an SSE register Register I/O Overview Version 2.1f Multiplication Load Loading data into an SSE register Fences Misc I/O Introduction Mul High with This cheat sheet displays most x86 intrinsics supported by Intel processors. The following intrinsics were omitted: Carryless Mul Mul Mul Low Mul High · obsolete or discontinued instruction sets like MMX and 3DNow! 128 Round & Scale Inserting data into an register without loading from memory · AVX-512, as it will not be available for some time and would blow up the space needed due to the vast amount of new instructions SSE SSE2 Set Register Store Fence Prefetch · Intrinsics not supported by Intel but only AMD like parts of the XOP instruction set (maybe they will be added in the future). CLMUL clmulepi64 SSE2 mul SSE4.1 mullo SSE2 mulhi SSSE3 mulhrs · Intrinsics that are only useful for operating systems like the _xsave intrinsic to save the CPU state si128 epi32[SSE4.1],epu32, epi16,epi32[SSE4.1] epi16,epu16 epi16 SSE sfence SSE prefetch · The RdRand intrinsics, as it is unclear whether they provide real random numbers without enabling kleptographic loopholes NOTE: Perform a carry-less ps/d,ss/d NOTE: Multiplies vertically NOTE: Multiplies vertically NOTE: Treat the 16-bit Set Reversed Set Insert Replicate Each family of intrinsics is depicted by a box as described below. It was tried to group the intrinsics meaningfully. Most information is taken from the Intel Intrinsics Guide (http://software.intel.com/en-us/articles/intel-intrinsics-guide).
Recommended publications
  • 07 Vectorization for Intel C++ & Fortran Compiler .Pdf
    Vectorization for Intel® C++ & Fortran Compiler Presenter: Georg Zitzlsberger Date: 10-07-2015 1 Agenda • Introduction to SIMD for Intel® Architecture • Compiler & Vectorization • Validating Vectorization Success • Reasons for Vectorization Fails • Intel® Cilk™ Plus • Summary 2 Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Vectorization • Single Instruction Multiple Data (SIMD): . Processing vector with a single operation . Provides data level parallelism (DLP) . Because of DLP more efficient than scalar processing • Vector: . Consists of more than one element . Elements are of same scalar data types (e.g. floats, integers, …) • Vector length (VL): Elements of the vector A B AAi i BBi i A B Ai i Bi i Scalar Vector Processing + Processing + C CCi i C Ci i VL 3 Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Evolution of SIMD for Intel Processors Present & Future: Goal: Intel® MIC Architecture, 8x peak FLOPs (FMA) over 4 generations! Intel® AVX-512: • 512 bit Vectors • 2x FP/Load/FMA 4th Generation Intel® Core™ Processors Intel® AVX2 (256 bit): • 2x FMA peak Performance/Core • Gather Instructions 2nd Generation 3rd Generation Intel® Core™ Processors Intel® Core™ Processors Intel® AVX (256 bit): • Half-float support • 2x FP Throughput • Random Numbers • 2x Load Throughput Since 1999: Now & 2010 2012 2013 128 bit Vectors Future Time 4 Optimization Notice
    [Show full text]
  • Effective Virtual CPU Configuration with QEMU and Libvirt
    Effective Virtual CPU Configuration with QEMU and libvirt Kashyap Chamarthy <[email protected]> Open Source Summit Edinburgh, 2018 1 / 38 Timeline of recent CPU flaws, 2018 (a) Jan 03 • Spectre v1: Bounds Check Bypass Jan 03 • Spectre v2: Branch Target Injection Jan 03 • Meltdown: Rogue Data Cache Load May 21 • Spectre-NG: Speculative Store Bypass Jun 21 • TLBleed: Side-channel attack over shared TLBs 2 / 38 Timeline of recent CPU flaws, 2018 (b) Jun 29 • NetSpectre: Side-channel attack over local network Jul 10 • Spectre-NG: Bounds Check Bypass Store Aug 14 • L1TF: "L1 Terminal Fault" ... • ? 3 / 38 Related talks in the ‘References’ section Out of scope: Internals of various side-channel attacks How to exploit Meltdown & Spectre variants Details of performance implications What this talk is not about 4 / 38 Related talks in the ‘References’ section What this talk is not about Out of scope: Internals of various side-channel attacks How to exploit Meltdown & Spectre variants Details of performance implications 4 / 38 What this talk is not about Out of scope: Internals of various side-channel attacks How to exploit Meltdown & Spectre variants Details of performance implications Related talks in the ‘References’ section 4 / 38 OpenStack, et al. libguestfs Virt Driver (guestfish) libvirtd QMP QMP QEMU QEMU VM1 VM2 Custom Disk1 Disk2 Appliance ioctl() KVM-based virtualization components Linux with KVM 5 / 38 OpenStack, et al. libguestfs Virt Driver (guestfish) libvirtd QMP QMP Custom Appliance KVM-based virtualization components QEMU QEMU VM1 VM2 Disk1 Disk2 ioctl() Linux with KVM 5 / 38 OpenStack, et al. libguestfs Virt Driver (guestfish) Custom Appliance KVM-based virtualization components libvirtd QMP QMP QEMU QEMU VM1 VM2 Disk1 Disk2 ioctl() Linux with KVM 5 / 38 libguestfs (guestfish) Custom Appliance KVM-based virtualization components OpenStack, et al.
    [Show full text]
  • Technical Report TR-2020-10
    Simulation-Based Engineering Lab University of Wisconsin-Madison Technical Report TR-2020-10 CryptoEmu: An Instruction Set Emulator for Computation Over Ciphers Xiaoyang Gong and Dan Negrut December 28, 2020 Abstract Fully homomorphic encryption (FHE) allows computations over encrypted data. This technique makes privacy-preserving cloud computing a reality. Users can send their encrypted sensitive data to a cloud server, get encrypted results returned and decrypt them, without worrying about data breaches. This project report presents a homomorphic instruction set emulator, CryptoEmu, that enables fully homomorphic computation over encrypted data. The software-based instruction set emulator is built upon an open-source, state-of-the-art homomorphic encryption library that supports gate-level homomorphic evaluation. The instruction set architecture supports multiple instructions that belong to the subset of ARMv8 instruction set architecture. The instruction set emulator utilizes parallel computing techniques to emulate every functional unit for minimum latency. This project re- port includes details on design considerations, instruction set emulator architecture, and datapath and control unit implementation. We evaluated and demonstrated the instruction set emulator's performance and scalability on a 48-core workstation. Cryp- toEmu shown a significant speed up in homomorphic computation performance when compared with HELib, a state-of-the-art homomorphic encryption library. Keywords: Fully Homomorphic Encryption, Parallel Computing, Homomorphic Instruction Set, Homomorphic Processor, Computer Architecture 1 Contents 1 Introduction 3 2 Background 4 3 TFHE Library 5 4 CryptoEmu Architecture Overview 7 4.1 Data Processing . .8 4.2 Branch and Control Flow . .9 5 Data Processing Units 9 5.1 Load/Store Unit . .9 5.2 Adder .
    [Show full text]
  • ADSP-21065L SHARC User's Manual; Chapter 2, Computation
    &20387$7,2181,76 Figure 2-0. Table 2-0. Listing 2-0. The processor’s computation units provide the numeric processing power for performing DSP algorithms, performing operations on both fixed-point and floating-point numbers. Each computation unit executes instructions in a single cycle. The processor contains three computation units: • An arithmetic/logic unit (ALU) Performs a standard set of arithmetic and logic operations in both fixed-point and floating-point formats. • A multiplier Performs floating-point and fixed-point multiplication as well as fixed-point dual multiply/add or multiply/subtract operations. •A shifter Performs logical and arithmetic shifts, bit manipulation, field deposit and extraction operations on 32-bit operands and can derive exponents as well. ADSP-21065L SHARC User’s Manual 2-1 PM Data Bus DM Data Bus Register File Multiplier Shifter ALU 16 × 40-bit MR2 MR1 MR0 Figure 2-1. Computation units block diagram The computation units are architecturally arranged in parallel, as shown in Figure 2-1. The output from any computation unit can be input to any computation unit on the next cycle. The computation units store input operands and results locally in a ten-port register file. The Register File is accessible to the processor’s pro- gram memory data (PMD) bus and its data memory data (DMD) bus. Both of these buses transfer data between the computation units and internal memory, external memory, or other parts of the processor. This chapter covers these topics: • Data formats • Register File data storage and transfers
    [Show full text]
  • Analysis of SIMD Applicability to SHA Algorithms O
    1 Analysis of SIMD Applicability to SHA Algorithms O. Aciicmez Abstract— It is possible to increase the speed and throughput of The remainder of the paper is organized as follows: In an algorithm using parallelization techniques. Single-Instruction section 2 and 3, we introduce SIMD concept and the SIMD Multiple-Data (SIMD) is a parallel computation model, which has architecture of Intel including MMX technology and SSE already employed by most of the current processor families. In this paper we will analyze four SHA algorithms and determine extensions. Section 4 describes SHA algorithm and Section possible performance gains that can be achieved using SIMD 5 discusses the possible improvements on SHA performance parallelism. We will point out the appropriate parts of each that can be achieved by using SIMD instructions. algorithm, where SIMD instructions can be used. II. SIMD PARALLEL PROCESSING I. INTRODUCTION Single-instruction multiple-data execution model allows Today the security of a cryptographic mechanism is not the several data elements to be processed at the same time. The only concern of cryptographers. The heavy communication conventional scalar execution model, which is called single- traffic on contemporary very large network of interconnected instruction single-data (SISD) deals only with one pair of data devices demands a great bandwidth for security protocols, and at a time. The programs using SIMD instructions can run much hence increasing the importance of speed and throughput of a faster than their scalar counterparts. However SIMD enabled cryptographic mechanism. programs are harder to design and implement. A straightforward approach to improve cryptographic per- The most common use of SIMD instructions is to perform formance is to implement cryptographic algorithms in hard- parallel arithmetic or logical operations on multiple data ware.
    [Show full text]
  • Operating Guide
    Operating Guide EPIA EN-Series Mini-ITX Mainboard January 18, 2012 Version 1.21 EPIA EN-Series Operating Guide Table of Contents Table of Contents ...................................................................................................................................................................................... i VIA EPIA EN-Series Overview.............................................................................................................................................................. 1 VIA EPIA EN-Series Layout .................................................................................................................................................................. 2 VIA EPIA EN-Series Specifications ...................................................................................................................................................... 3 VIA EPIA EN Processor SKUs .............................................................................................................................................................. 4 VIA CN700 Chipset Overview ............................................................................................................................................................... 5 VIA EPIA EN-Series I/O Back Panel Layout ...................................................................................................................................... 6 VIA EPIA EN-Series Layout Diagram & Mounting Holes ..............................................................................................................
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Most Computer Instructions Can Be Classified Into Three Categories
    Introduction: Data Transfer and Manipulation Most computer instructions can be classified into three categories: 1) Data transfer, 2) Data manipulation, 3) Program control instructions » Data transfer instruction cause transfer of data from one location to another » Data manipulation performs arithmatic, logic and shift operations. » Program control instructions provide decision making capabilities and change the path taken by the program when executed in computer. Data Transfer Instruction Typical Data Transfer Instruction : LD » Load : transfer from memory to a processor register, usually an AC (memory read) ST » Store : transfer from a processor register into memory (memory write) MOV » Move : transfer from one register to another register XCH » Exchange : swap information between two registers or a register and a memory word IN/OUT » Input/Output : transfer data among processor registers and input/output device PUSH/POP » Push/Pop : transfer data between processor registers and a memory stack MODE ASSEMBLY REGISTER TRANSFER CONVENTION Direct Address LD ADR ACM[ADR] Indirect Address LD @ADR ACM[M[ADR]] Relative Address LD $ADR ACM[PC+ADR] Immediate Address LD #NBR ACNBR Index Address LD ADR(X) ACM[ADR+XR] Register LD R1 ACR1 Register Indirect LD (R1) ACM[R1] Autoincrement LD (R1)+ ACM[R1], R1R1+1 8 Addressing Mode for the LOAD Instruction Data Manipulation Instruction 1) Arithmetic, 2) Logical and bit manipulation, 3) Shift Instruction Arithmetic Instructions : NAME MNEMONIC Increment INC Decrement DEC Add ADD Subtract SUB Multiply
    [Show full text]
  • Microcode Revision Guidance August 31, 2019 MCU Recommendations
    microcode revision guidance August 31, 2019 MCU Recommendations Section 1 – Planned microcode updates • Provides details on Intel microcode updates currently planned or available and corresponding to Intel-SA-00233 published June 18, 2019. • Changes from prior revision(s) will be highlighted in yellow. Section 2 – No planned microcode updates • Products for which Intel does not plan to release microcode updates. This includes products previously identified as such. LEGEND: Production Status: • Planned – Intel is planning on releasing a MCU at a future date. • Beta – Intel has released this production signed MCU under NDA for all customers to validate. • Production – Intel has completed all validation and is authorizing customers to use this MCU in a production environment.
    [Show full text]
  • Undocumented CPU Behavior: Analyzing Undocumented Opcodes on Intel X86-64 Catherine Easdon Why Investigate Undocumented Behavior? the “Golden Screwdriver” Approach
    Undocumented CPU Behavior: Analyzing Undocumented Opcodes on Intel x86-64 Catherine Easdon Why investigate undocumented behavior? The “golden screwdriver” approach ● Intel have confirmed they add undocumented features to general-release chips for key customers "As far as the etching goes, we have done different things for different customers, and we have put different things into the silicon, such as adding instructions or pins or signals for logic for them. The difference is that it goes into all of the silicon for that product. And so the way that you do it is somebody gives you a feature, and they say, 'Hey, can you get this into the product?' You can't do something that takes up a huge amount of die, but you can do an instruction, you can do a signal, you can do a number of things that are logic-related." ~ Jason Waxman, Intel Cloud Infrastructure Group (Source: http://www.theregister.co.uk/2013/05/20/intel_chip_customization/ ) Poor documentation ● Intel has a long history of withholding information from their manuals (Source: http://datasheets.chipdb.org/Intel/x86/Pentium/24143004.PDF) Poor documentation ● Intel has a long history of withholding information from their manuals (Source: https://stackoverflow.com/questions/14413839/what-are-the-exhaustion-characteristics-of-rdrand-on-ivy-bridge) Poor documentation ● Even when the manuals don’t withhold information, they are often misleading or inconsistent Section 22.15, Intel Developer Manual Vol. 3: Section 6.15 (#UD exception): Poor documentation leads to vulnerabilities ● In operating systems ○ POP SS/MOV SS (May 2018) ■ Developer confusion over #DB handling ■ Load + execute unsigned kernel code on Windows ■ Also affected: Linux, MacOS, FreeBSD..
    [Show full text]
  • Asrock G41C-VS Motherboard, a Reliable Motherboard Produced Under Asrock’S Consistently Stringent Quality Control
    G41C-VS User Manual Version 1.0 Published October 2009 Copyright©2009 ASRock INC. All rights reserved. 1 Copyright Notice: No part of this manual may be reproduced, transcribed, transmitted, or translated in any language, in any form or by any means, except duplication of documentation by the purchaser for backup purpose, without written consent of ASRock Inc. Products and corporate names appearing in this manual may or may not be regis- tered trademarks or copyrights of their respective companies, and are used only for identification or explanation and to the owners’ benefit, without intent to infringe. Disclaimer: Specifications and information contained in this manual are furnished for informa- tional use only and subject to change without notice, and should not be constructed as a commitment by ASRock. ASRock assumes no responsibility for any errors or omissions that may appear in this manual. With respect to the contents of this manual, ASRock does not provide warranty of any kind, either expressed or implied, including but not limited to the implied warran- ties or conditions of merchantability or fitness for a particular purpose. In no event shall ASRock, its directors, officers, employees, or agents be liable for any indirect, special, incidental, or consequential damages (including damages for loss of profits, loss of business, loss of data, interruption of business and the like), even if ASRock has been advised of the possibility of such damages arising from any defect or error in the manual or product. This device complies with Part 15 of the FCC Rules. Operation is subject to the following two conditions: (1) this device may not cause harmful interference, and (2) this device must accept any interference received, including interference that may cause undesired operation.
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]