Yet Another Survey on SIMD Instructions

Total Page:16

File Type:pdf, Size:1020Kb

Yet Another Survey on SIMD Instructions Yet Another Survey on SIMD Instructions Armando Faz Hern´andez [email protected] Computer Science Department, UNICAMP Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Outline Introduction SIMD for multimedia ARM architecture Implementation aspects Auto vectorization Concluding remarks Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Introduction Outline Introduction SIMD for multimedia ARM architecture Implementation aspects Auto vectorization Concluding remarks Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Introduction RISC Computers Around 1980 RISC computers establish a milestone in the computer's architecture design. • Provide a pipeline execution. • Extract parallelism among instructions. • Introduced the Instruction Level Parallelism ILP. The performance gained by RISC processors was limited by applications. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Introduction Flynn Taxonomy Flynn categorized computers according to how data and/or instructions was processed. Programs can be seen as a stream of instructions that are applied over a stream of data. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Single Instruction Multiple Data Multiple Instruction Single Data Multiple Instruction Multiple Data Introduction Flynn Taxonomy Single Instruction Single Data Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Single Instruction Single Data Multiple Instruction Single Data Multiple Instruction Multiple Data Introduction Flynn Taxonomy Single Instruction Multiple Data Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Single Instruction Single Data Single Instruction Multiple Data Multiple Instruction Multiple Data Introduction Flynn Taxonomy Multiple Instruction Single Data Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Single Instruction Single Data Single Instruction Multiple Data Multiple Instruction Single Data Introduction Flynn Taxonomy Multiple Instruction Multiple Data Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Introduction First SIMD approach Since 1970 vector architectures appeared as the first implementation of SIMD processing. Illiac IV from Illinois University, CDC Star and ASC from Texas Instruments. • Process more than 64 elements per operation. • Many replicated functional units. • Cray series of computers was a successful implementation. Illiac IV was able to compute 128 32-bit multiplications in 625 ns. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Introduction Survey content 1 Description of SIMD instructions sets for multimedia applications for desktop processors (Intel and AMD) and for low power consumption architectures such as ARM. 2 We show how to exploit parallelism in a SIMD fashion. 3 Finally, we go further exploring some tools that enables auto vectorization in code. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Outline Introduction SIMD for multimedia ARM architecture Implementation aspects Auto vectorization Concluding remarks Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia What is multimedia? Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia What is multimedia? Armando Faz Hern´andez Yet Another Survey on SIMD Instructions A cheaper solution is partitioning carry chains of an 64-bit ALU. SIMD for multimedia Multimedia Image and sound processing frequently involves to perform same instruction to a set of short size data types, in particular 8-bit for pixels and 16-bits for audio samples. Adding a dedicated unit results expensive. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Multimedia Image and sound processing frequently involves to perform same instruction to a set of short size data types, in particular 8-bit for pixels and 16-bits for audio samples. Adding a dedicated unit results expensive. A cheaper solution is partitioning carry chains of an 64-bit ALU. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions Current computer 2010 SIMD for multimedia Multimedia in 1995 IBM computer 1995 Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Multimedia in 1995 IBM computer Current computer 1995 2010 Armando Faz Hern´andez Yet Another Survey on SIMD Instructions MMX can process operations over 8-bit, 16-bit and 32-bit vectors. SIMD for multimedia MultiMedia eXtensions MMX was released in 1997 and introduces 57 integer instructions and a register set MMX0-MMX7. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia MultiMedia eXtensions MMX was released in 1997 and introduces 57 integer instructions and a register set MMX0-MMX7. MMX can process operations over 8-bit, 16-bit and 32-bit vectors. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia 3DNow! MMX can only work in integer mode xor in FPU mode. Switching between modes incurs in performance loses. AMD released in 1998 the 3DNow! technology using the same register set (MMX), but computing floating point operations. 3DNow! was not enough popular for future developments and in 2010 AMD decide to deprecate the project. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Streaming SIMD Extensions Intel identified problems of MMX and decided to provide a new set of registers, a.k.a XMM, and a new instruction set called Streaming SIMD Extensions (SSE). • XMM registers are 128-bit length. • SSE contains 70 new instructions for floating point arithmetic. • SIMD instructions are able to compute up to four 32-bit floating point operations. In 2000, AMD launched the x64 extension, which doubles the number of XMM registers (XMM0-XMM15). Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia SSE 2 The second iteration of SSE, called SSE2, adds 140 new instructions for double precision floating point processing. This release was focused on 3D games and Computer-Aided Design applications. Although SSE2 operate over four elements, the performance was roughly the same as MMX, which operates in just two elements. This loss of performance is due to accessing to misaligned memory addresses. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia SSE3 In order to solve performance issues due to misaligned data, SSE3 incorporates new instructions to load from unaligned memory addresses minimizing timing penalties. Supplemental Streaming SIMD Extensions 3 (SSSE3) was released in 2006, adding new instructions such as multiply-and-add and vector alignment. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia SSE4 SSE4 contemplates 54 new instructions (47 in SSE4.1 and 7 in SSE4.2) dedicated to string processing. Also includes elaborated instructions to perform population count and computation of CRC-32 error detecting code. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Advanced Vector Extensions Intel decided to move computations to wider registers and introduces the Advanced Vector Extensions (AVX). This technology involves: • 16 256-bit registers, called YMM0-YMM15. • The ability to write three operand code in assembler listings. • Proposes the VEX encoding scheme that increases the space of operation codes. • It also support the legacy SSEx instructions by adding the VEX prefix. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia Advanced Vector Extensions 2 The second version of AVX, released in 2012, includes the expansion of many integer operations to 256-bit registers. AVX2 support gather/scatter operations to load/store registers from/to non-contiguous memory locations. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions SIMD for multimedia SSE5, XOP, FMA4 A big confusion was caused about future directions of new instruction sets, both Intel and AMD has been changed their proposes about SSE5. Bulldozer, the AMD's latest micro-architecture, implements XOP and FMA4 instruction set, and also has compatibility with AVX. Piledriver and Haswell are the next micro-architectures from AMD and Intel, respectively. They will provide more multiply-and-add instructions for both floating point and integer operations. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions ARM architecture Outline Introduction SIMD for multimedia ARM architecture Implementation aspects Auto vectorization Concluding remarks Armando Faz Hern´andez Yet Another Survey on SIMD Instructions ARM architecture ARM ARM is a low power consumption processor widely distributed in several devices such as routers, tablets, cell phones and recently integrated into GPU. • ARM is a 32-bit architecture with a pipelined processor. • Most of the instructions are executed in just one clock cycle. • All instructions are conditionally executed. • Native ARM instructions have fixed size encodings. Armando Faz Hern´andez Yet Another Survey on SIMD Instructions ARM architecture Thumb and Thumb-2 Fixed size encodings results in fast instruction decoding but large binary programs. Thumb is a variable encoding scheme proposed in 1994. Thumb is able to encode a subset of ARM instructions in 16-bit operand codes. Shorter encodings left out conditional instruction execution. Thumb-2 emerged and solved this issue allowing to have variable size encodings and supporting conditional execution using the IT instruction. Armando Faz Hern´andez
Recommended publications
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • Parallel Programming
    Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.
    [Show full text]
  • RISC-V Vector Extension Webinar I
    RISC-V Vector Extension Webinar I July 13th, 2021 Thang Tran, Ph.D. Principal Engineer Who WeAndes Are Technology Corporation CPU Pure-play RISC-V Founding Major Open-Source CPU IP Vendor Premier Member Contributor/Maintainer RISC-V Ambassador 16-year-old Running Task Groups Public Company TSC Vice Chair Director of the Board Quick Facts + NL 100 years 80% FR BJ KR USA JP IL SH CPU Experience in Silicon Valley R&D SZ TW (HQ) 200+ 20K+ 7B+ Licensees AndeSight IDE Total shipment of Andes- installations Embedded™ SoC Confidential Taking RISC-V® Mainstream 2 Andes Technology Corporation Overview Andes Highlights •Founded in March 2005 in Hsinchu Science Park, Taiwan, ROC. •World class 32/64-bit RISC-V CPU IP public company •Over 200 people; 80% are engineers; R&D team consisting of Silicon Valley veterans •TSMC OIP Award “Partner of the Year” for New IP (2015) •A Premier founding member of RISC-V Foundation •2018 MCU Innovation Award by China Electronic News: AndesCore™ N25F/NX25F •ASPENCORE WEAA 2020 Outstanding Product Performance of the Year: AndesCore™ NX27V •2020 HsinChu Science Park Innovation Award: AndesCore™ NX27V Andes Mission • Trusted Computing Expert and World No.1 RISC-V IP Provider Emerging Opportunities • AIoT, 5G/Networking, Storage and Cloud computing 3 V5 Adoptions: From MCU to Datacenters • Edge to Cloud: − ADAS − Datacenter AI accelerators − AIoT − SSD: enterprise (& consumer) − Blockchain − 5G macro/small cells − FPGA − MCU − Multimedia − Security − Wireless (BT/WiFi) 5G Macro • 1 to 1000+ core • 40nm to 5nm • Many in AI Copyright© 2020 Andes Technology Corp. 4 Webinar I - Agenda • Andes overview • Vector technology background – SIMD/vector concept – Vector processor basic • RISC-V V extension ISA – Basic – CSR – Memory operations – Compute instructions • Sample codes – Matrix multiplication – Loads with RVV versions 0.8 and 1.0 • AndesCore™ NX27V introduction • Summary Copyright© 2020 Andes Technology Corp.
    [Show full text]
  • Real-Time Performance During CUDA™ a Demonstration and Analysis of Redhawk™ CUDA RT Optimizations
    A Concurrent Real-Time White Paper 2881 Gateway Drive Pompano Beach, FL 33069 (954) 974-1700 www.concurrent-rt.com Real-Time Performance During CUDA™ A Demonstration and Analysis of RedHawk™ CUDA RT Optimizations By: Concurrent Real-Time Linux® Development Team November 2010 Overview There are many challenges to creating a real-time Linux distribution that provides guaranteed low process-dispatch latencies and minimal process run-time jitter. Concurrent Real Time’s RedHawk Linux distribution meets and exceeds these challenges, providing a hard real-time environment on many qualified hardware configurations, even in the presence of a heavy system load. However, there are additional challenges faced when guaranteeing real-time performance of processes while CUDA applications are simultaneously running on the system. The proprietary CUDA driver supplied by NVIDIA® frequently makes demands upon kernel resources that can dramatically impact real-time performance. This paper discusses a demonstration application developed by Concurrent to illustrate that RedHawk Linux kernel optimizations allow hard real-time performance guarantees to be preserved even while demanding CUDA applications are running. The test results will show how RedHawk performance compares to CentOS performance running the same application. The design and implementation details of the demonstration application are also discussed in this paper. Demonstration This demonstration features two selectable real-time test modes: 1. Jitter Mode: measure and graph the run-time jitter of a real-time process 2. PDL Mode: measure and graph the process-dispatch latency of a real-time process While the demonstration is running, it is possible to switch between these different modes at any time.
    [Show full text]
  • NASM – the Netwide Assembler
    NASM – The Netwide Assembler version 2.14rc7 © 1996−2017 The NASM Development Team — All Rights Reserved This document is redistributable under the license given in the file "LICENSE" distributed in the NASM archive. Contents Chapter 1: Introduction . 17 1.1 What Is NASM?. 17 1.1.1 License Conditions . 17 Chapter 2: Running NASM . 19 2.1 NASM Command−Line Syntax . 19 2.1.1 The −o Option: Specifying the Output File Name . 19 2.1.2 The −f Option: Specifying the Output File Format . 20 2.1.3 The −l Option: Generating a Listing File . 20 2.1.4 The −M Option: Generate Makefile Dependencies. 20 2.1.5 The −MG Option: Generate Makefile Dependencies . 20 2.1.6 The −MF Option: Set Makefile Dependency File. 20 2.1.7 The −MD Option: Assemble and Generate Dependencies . 20 2.1.8 The −MT Option: Dependency Target Name . 21 2.1.9 The −MQ Option: Dependency Target Name (Quoted) . 21 2.1.10 The −MP Option: Emit phony targets . 21 2.1.11 The −MW Option: Watcom Make quoting style . 21 2.1.12 The −F Option: Selecting a Debug Information Format . 21 2.1.13 The −g Option: Enabling Debug Information. 21 2.1.14 The −X Option: Selecting an Error Reporting Format . 21 2.1.15 The −Z Option: Send Errors to a File. 22 2.1.16 The −s Option: Send Errors to stdout ..........................22 2.1.17 The −i Option: Include File Search Directories . 22 2.1.18 The −p Option: Pre−Include a File . 22 2.1.19 The −d Option: Pre−Define a Macro .
    [Show full text]
  • Unit: 4 Processes and Threads in Distributed Systems
    Unit: 4 Processes and Threads in Distributed Systems Thread A program has one or more locus of execution. Each execution is called a thread of execution. In traditional operating systems, each process has an address space and a single thread of execution. It is the smallest unit of processing that can be scheduled by an operating system. A thread is a single sequence stream within in a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. In a process, threads allow multiple executions of streams. Thread Structure Process is used to group resources together and threads are the entities scheduled for execution on the CPU. The thread has a program counter that keeps track of which instruction to execute next. It has registers, which holds its current working variables. It has a stack, which contains the execution history, with one frame for each procedure called but not yet returned from. Although a thread must execute in some process, the thread and its process are different concepts and can be treated separately. What threads add to the process model is to allow multiple executions to take place in the same process environment, to a large degree independent of one another. Having multiple threads running in parallel in one process is similar to having multiple processes running in parallel in one computer. Figure: (a) Three processes each with one thread. (b) One process with three threads. In former case, the threads share an address space, open files, and other resources. In the latter case, process share physical memory, disks, printers and other resources.
    [Show full text]
  • Gpu Concurrency
    GPU CONCURRENCY ROBERT SEARLES 5/26/2021 EXECUTION SCHEDULING & MANAGEMENT Pre-emptive scheduling Concurrent scheduling Processes share GPU through time-slicing Processes run on GPU simultaneously Scheduling managed by system User creates & manages scheduling streams C B A B C A B A time time time- slice 2 CUDA CONCURRENCY MECHANISMS Streams MPS MIG Partition Type Single process Logical Physical Max Partitions Unlimited 48 7 Performance Isolation No By percentage Yes Memory Protection No Yes Yes Memory Bandwidth QoS No No Yes Error Isolation No No Yes Cross-Partition Interop Always IPC Limited IPC Reconfigure Dynamic Process launch When idle MPS: Multi-Process Service MIG: Multi-Instance GPU 3 CUDA STREAMS 4 STREAM SEMANTICS 1. Two operations issued into the same stream will execute in issue- order. Operation B issued after Operation A will not begin to execute until Operation A has completed. 2. Two operations issued into separate streams have no ordering prescribed by CUDA. Operation A issued into stream 1 may execute before, during, or after Operation B issued into stream 2. Operation: Usually, cudaMemcpyAsync or a kernel call. More generally, most CUDA API calls that take a stream parameter, as well as stream callbacks. 5 STREAM EXAMPLES Host/Device execution concurrency: Kernel<<<b, t>>>(…); // this kernel execution can overlap with cpuFunction(…); // this host code Concurrent kernels: Kernel<<<b, t, 0, streamA>>>(…); // these kernels have the possibility Kernel<<<b, t, 0, streamB>>>(…); // to execute concurrently In practice, concurrent
    [Show full text]
  • Hyper-Threading Technology Architecture and Microarchitecture
    Hyper-Threading Technology Architecture and Microarchitecture Deborah T. Marr, Desktop Products Group, Intel Corp. Frank Binns, Desktop ProductsGroup, Intel Corp. David L. Hill, Desktop Products Group, Intel Corp. Glenn Hinton, Desktop Products Group, Intel Corp. David A. Koufaty, Desktop Products Group, Intel Corp. J. Alan Miller, Desktop Products Group, Intel Corp. Michael Upton, CPU Architecture, Desktop Products Group, Intel Corp. Index words: architecture, microarchitecture, Hyper-Threading Technology, simultaneous multi- threading, multiprocessor INTRODUCTION ABSTRACT The amazing growth of the Internet and telecommunications is powered by ever-faster systems Intel’s Hyper-Threading Technology brings the concept demanding increasingly higher levels of processor of simultaneous multi-threading to the Intel performance. To keep up with this demand we cannot Architecture. Hyper-Threading Technology makes a rely entirely on traditional approaches to processor single physical processor appear as two logical design. Microarchitecture techniques used to achieve processors; the physical execution resources are shared past processor performance improvement–super- and the architecture state is duplicated for the two pipelining, branch prediction, super-scalar execution, logical processors. From a software or architecture out-of-order execution, caches–have made perspective, this means operating systems and user microprocessors increasingly more complex, have more programs can schedule processes or threads to logical transistors, and consume more power. In fact, transistor processors as they would on multiple physical counts and power are increasing at rates greater than processors. From a microarchitecture perspective, this processor performance. Processor architects are means that instructions from both logical processors therefore looking for ways to improve performance at a will persist and execute simultaneously on shared greater rate than transistor counts and power execution resources.
    [Show full text]
  • Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks
    Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks Timothy G. Armstrong, Zhao Zhang Daniel S. Katz, Michael Wilde, Ian T. Foster Department of Computer Science Computation Institute University of Chicago University of Chicago & Argonne National Laboratory [email protected], [email protected] [email protected], [email protected], [email protected] Abstract—In order for many-task applications to be attrac- as a worker and allocate one task per node. If tasks are tive candidates for running on high-end supercomputers, they single-threaded, each core or virtual thread can be treated must be able to benefit from the additional compute, I/O, as a worker. and communication performance provided by high-end HPC hardware relative to clusters, grids, or clouds. Typically this The second feature of many-task applications is an empha- means that the application should use the HPC resource in sis on high performance. The many tasks that make up the such a way that it can reduce time to solution beyond what application effectively collaborate to produce some result, is possible otherwise. Furthermore, it is necessary to make and in many cases it is important to get the results quickly. efficient use of the computational resources, achieving high This feature motivates the development of techniques to levels of utilization. Satisfying these twin goals is not trivial, because while the efficiently run many-task applications on HPC hardware. It parallelism in many task computations can vary over time, allows people to design and develop performance-critical on many large machines the allocation policy requires that applications in a many-task style and enables the scaling worker CPUs be provisioned and also relinquished in large up of existing many-task applications to run on much larger blocks rather than individually.
    [Show full text]
  • Intel's Hyper-Threading
    Intel’s Hyper-Threading Cody Tinker & Christopher Valerino Introduction ● How to handle a thread, or the smallest portion of code that can be run independently, plays a core component in enhancing a program's parallelism. ● Due to the fact that modern computers process many tasks and programs simultaneously, techniques that allow for threads to be handled more efficiently to lower processor downtime are valuable. ● Motive is to reduce the number of idle resources of a processor. ● Examples of processors that use hyperthreading ○ Intel Xeon D-1529 ○ Intel i7-6785R ○ Intel Pentium D1517 Introduction ● Two main trends have been followed to increase parallelism: ○ Increase number of cores on a chip ○ Increase core throughput What is Multithreading? ● Executing more than one thread at a time ● Normally, an operating system handles a program by scheduling individual threads and then passing them to the processor. ● Two different types of hardware multithreading ○ Temporal ■ One thread per pipeline stage ○ Simultaneous (SMT) ■ Multiple threads per pipeline stage ● Intel’s hyper-threading is a SMT design Hyper-threading ● Hyper-threading is the hardware solution to increasing processor throughput by decreasing resource idle time. ● Allows multiple concurrent threads to be executed ○ Threads are interleaved so that resources not being used by one thread are used by others ○ Most processors are limited to 2 concurrent threads per physical core ○ Some do support 8 concurrent threads per physical core ● Needs the ability to fetch instructions from
    [Show full text]
  • AMD's Bulldozer Architecture
    AMD's Bulldozer Architecture Chris Ziemba Jonathan Lunt Overview • AMD's Roadmap • Instruction Set • Architecture • Performance • Later Iterations o Piledriver o Steamroller o Excavator Slide 2 1 Changed this section, bulldozer is covered in architecture so it makes sense to not reiterate with later slides Chris Ziemba, 鳬o AMD's Roadmap • October 2011 o First iteration, Bulldozer released • June 2013 o Piledriver, implemented in 2nd gen FX-CPUs • 2013 o Steamroller, implemented in 3rd gen FX-CPUs • 2014 o Excavator, implemented in 4th gen Fusion APUs • 2015 o Revised Excavator adopted in 2015 for FX-CPUs and beyond Instruction Set: Overview • Type: CISC • Instruction Set: x86-64 (AMD64) o Includes Old x86 Registers o Extends Registers and adds new ones o Two Operating Modes: Long Mode & Legacy Mode • Integer Size: 64 bits • Virtual Address Space: 64 bits o 16 EB of Address Space (17,179,869,184 GB) • Physical Address Space: 48 bits (Current Versions) o Saves space/transistors/etc o 256TB of Address Space Instruction Set: ISA Registers Instruction Set: Operating Modes Instruction Set: Extensions • Intel x86 Extensions o SSE4 : Streaming SIMD (Single Instruction, Multiple Data) Extension 4. Mainly for DSP and Graphics Processing. o AES-NI: Advanced Encryption Standard (AES) Instructions o AVX: Advanced Vector Extensions. 256 bit registers for computationally complex floating point operations such as image/video processing, simulation, etc. • AMD x86 Extensions o XOP: AMD specified SSE5 Revision o FMA4: Fused multiply-add (MAC) instructions
    [Show full text]
  • COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011
    COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Edgar Gabriel Spring 2011 Edgar Gabriel Moore’s Law • Long-term trend on the number of transistor per integrated circuit • Number of transistors double every ~18 month Source: http://en.wikipedia.org/wki/Images:Moores_law.svg COSC 6385 – Computer Architecture Edgar Gabriel 1 What do we do with that many transistors? • Optimizing the execution of a single instruction stream through – Pipelining • Overlap the execution of multiple instructions • Example: all RISC architectures; Intel x86 underneath the hood – Out-of-order execution: • Allow instructions to overtake each other in accordance with code dependencies (RAW, WAW, WAR) • Example: all commercial processors (Intel, AMD, IBM, SUN) – Branch prediction and speculative execution: • Reduce the number of stall cycles due to unresolved branches • Example: (nearly) all commercial processors COSC 6385 – Computer Architecture Edgar Gabriel What do we do with that many transistors? (II) – Multi-issue processors: • Allow multiple instructions to start execution per clock cycle • Superscalar (Intel x86, AMD, …) vs. VLIW architectures – VLIW/EPIC architectures: • Allow compilers to indicate independent instructions per issue packet • Example: Intel Itanium series – Vector units: • Allow for the efficient expression and execution of vector operations • Example: SSE, SSE2, SSE3, SSE4 instructions COSC 6385 – Computer Architecture Edgar Gabriel 2 Limitations of optimizing a single instruction
    [Show full text]