Lec20-Vector.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

Lec20-Vector.Pdf Vector Processors CS252 • Initially developed for super-computing applications, Graduate Computer Architecture today important for multimedia. Lecture 20 • Vector processors have high-level operations that Vector Processing => Multimedia work on linear arrays of numbers: "vectors" SCALAR VECTOR David E. Culler (1 operation) (N operations) r1 r2 v1 v2 + + Many slides due to Christoforos E. Kozyrakis r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 CS252/Culler Lec 20. 2 4/9/02 Properties of Vector Processors Styles of Vector Architectures • Single vector instruction implies lots of work (loop) • Memory-memory vector processors – Fewer instruction fetches – All vector operations are memory to memory • Each result independent of previous result • Vector-register processors – Multiple operations can be executed in parallel – All vector operations between vector registers (except – Simpler design, high clock rate vector load and store) – Compiler (programmer) ensures no dependencies – Vector equivalent of load-store architectures • Reduces branches and branch problems in pipelines – Includes all vector machines since late 1980s – We assume vector-register for rest of the lecture • Vector instructions access memory with known pattern – Effective prefetching – Amortize memory latency of over large number of elements – Can exploit a high bandwidth memory system – No (data) caches required! CS252/Culler CS252/Culler Lec 20. 3 Lec 20. 4 4/9/02 4/9/02 Historical Perspective Cray-1 Breakthrough • Mid-60s fear perf. stagnates • Fast, simple scalar processor • SIMD processor arrays – 80 MHz! actively developed during late – single-phase, latches 60’s – mid 70’s • Exquisite electrical and mechanical design – bit-parallel machines for image • Semiconductor memory processing • Vector register concept • pepe, staran, mpp – vast simplification of instruction set – word-parallel for scientific – reduced necc . memory bandwidth • Illiac IV • Tight integration of vector and scalar • Cray develops fast scalar • Piggy-back off 7600 stacklib – CDC 6600, 7600 • Later vectorizing compilers developed • CDC bets of vectors with • Owned high-performance computing for a decade Star-100 – what happened then? • Amdahl argues against vector – VLIW competition CS252/Culler CS252/Culler Lec 20. 5 Lec 20. 6 4/9/02 4/9/02 1 Components of a Vector Processor Cray-1 • Scalar CPU: registers, datapaths, instruction fetch logic Block • Vector register – Fixed length memory bank holding a single vector Diagram – Typically 8-32 vector registers, each holding 1 to 8 Kbits – Has at least 2 read and 1 write ports • Simple 16-bit RR instr – MM: Can be viewed as array of 64b, 32b, 16b, or 8b elements • Vector functional units (FUs) • 32-bit with immed – Fully pipelined, start new operation every clock • Natural combinations of – Typically 2 to 8 FUs: integer and FP scalar and vector – Multiple datapaths (pipelines) used for each unit to process multiple elements per cycle • Scalar bit-vectors • Vector load-store units (LSUs) match vector length – Fully pipelined unit to load or store a vector – Multiple elements fetched/stored per cycle • Gather/scatter M-R – May have multiple LSUs • Cond. merge • Cross-bar to connect FUs , LSUs, registers CS252/Culler CS252/Culler Lec 20. 7 Lec 20. 8 4/9/02 4/9/02 Basic Vector Instructions Vector Memory Operations Instr. Operands Operation Comment • Load/store operations move groups of data VADD.VV V1,V2,V3 V1=V2+V3 vector + vector between registers and memory VADD.SV V1,R0,V2 V1=R0+V2 scalar + vector • Three types of addressing VMUL.VV V1,V2,V3 V1=V2xV3 vector x vector – Unit stride VMUL.SV V1,R0,V2 V1=R0xV2 scalar x vector • Fastest VLD V1,R1 V1=M[R1..R1+63] load, stride=1 – Non-unit (constant) stride VLDS V1,R1,R2 V1=M[R1..R1+63*R2] load, stride=R2 – Indexed (gather-scatter) • Vector equivalent of register indirect VLDX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed("gather") • Good for sparse arrays of data VST V1,R1 M[R1..R1+63]=V1 store, stride=1 • Increases number of programs that vectorize VSTS V1,R1,R2 V1=M[R1..R1+63*R2] store, stride=R2 • compress/expand variant also VSTX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed(“scatter") • Support for various combinations of data widths in memory + all the regular scalar instructions (RISC style)… – {.L,.W,.H.,.B} x {64b, 32b, 16b, 8b} CS252/Culler CS252/Culler Lec 20. 9 Lec 20. 10 4/9/02 4/9/02 Vector Code Example Vector Length Y[0:63] = Y[0:653] + a*X[0:63] • A vector register can hold some maximum number of elements for each data width (maximum vector length 64 element SAXPY: scalar 64 element SAXPY: vector or MVL) LD R0,a LD R0,a #load scalar a • What to do when the application vector length is not ADDI R4,Rx,#512 VLD V1,Rx #load vector X exactly MVL? loop: LD R2, 0(Rx) VMUL.SV V2,R0,V1 #vector mult MULTD R2,R0,R2 VLD V3,Ry #load vector Y • Vector-length (VL) register controls the length of any LD R4, 0(Ry) VADD.VV V4,V2,V3 #vector add vector operation, including a vector load or store ADDD R4,R2,R4 – E.g. vadd.vv with VL=10 is SD R4, 0(Ry) VST Ry,V4 #store vector Y ADDI Rx,Rx,#8 for (I=0; I<10; I++) V1[I]=V2[I]+V3[I] ADDI Ry,Ry,#8 • VL can be anything from 0 to MVL SUB R20,R4,Rx BNZ R20,loop • How do you code an application where the vector length is not known until run-time? CS252/Culler CS252/Culler Lec 20. 11 Lec 20. 12 4/9/02 4/9/02 2 Strip Mining Optimization 1: Chaining • Suppose application vector length > MVL • Suppose: • Strip mining vmul.vv V1,V2,V3 – Generation of a loop that handles MVL elements per iteration vadd.vv V4,V1,V5 # RAW hazard – A set operations on MVL elements is translated to a single vecto r • Chaining instruction – Vector register (V1) is not as a single entity but as a • Example: vector saxpy of N elements group of individual registers – First loop handles (N mod MVL) elements, the rest handle MVL – Pipeline forwarding can work on individual vector elements VL = (N mod MVL); // set VL = N mod MVL • Flexible chaining: allow vector to chain to any other for (I=0; I<VL; I++) // 1st loop is a single set of active vector operation => more read/write ports Y[I]=A*X[I]+Y[I]; // vector instructions low = (N mod MVL); Unchained vmul vadd Cray X-mp VL = MVL; // set VL to MVL introduces for (I=low; I<N; I++) // 2nd loop requires N/MVL memory chaining vmul Y[I]=A*X[I]+Y[I]; // sets of vector instructions Chained CS252/Culler CS252/Culler Lec 20. 13 vadd Lec 20. 14 4/9/02 4/9/02 Optimization 2: Multi-lane Implementation Chaining & Multi-lane Example Pipelined Scalar LSU FU0 FU1 Lane Datapath vld Vector Reg. vmul.vv Partition vadd.vv Functional addu Unit Time vld To/From Memory System vmul.vv vadd.vv • Elements for vector registers interleaved across the lanes addu • Each lane receives identical control • Multiple element operations executed per cycle • Modular, scalable design Element Operations: Instr. Issue: • No need for inter-lane communication for most vector instructions • VL=16, 4 lanes, 2 FUs, 1 LSU, chaining -> 12 ops/cycle CS252/Culler CS252/Culler Lec 20. 15 • Just one new instruction issued per cycle !!!! Lec 20. 16 4/9/02 4/9/02 Optimization 3: Conditional Execution Two Ways to View Vectorization • Suppose you want to vectorize this: • Inner loop vectorization (Classic approach) for (I=0; I<N; I++) – Think of machine as, say, 32 vector registers each with 16 if (A[I]!= B[I]) A[I] -= B[I]; elements • Solution: vector conditional execution – 1 instruction updates 32 elements of 1 vector register – Add vector flag registers with single-bit elements – Good for vectorizing single-dimension arrays or regular – Use a vector compare to set the a flag register kernels (e.g. saxpy) – Use flag register as mask control for the vector sub • Outer loop vectorization (post-CM2) • Addition executed only for vector elements with – Think of machine as 16 “virtual processors” (VPs) corresponding flag element set each with 32 scalar registers! (• multithreaded processor) • Vector code – 1 instruction updates 1 scalar register in 16 VPs vld V1, Ra – Good for irregular kernels or kernels with loop-carried vld V2, Rb dependences in the inner loop vcmp.neq.vv F0, V1, V2 # vector compare • These are just two compiler perspectives vsub.vv V3, V2, V1, F0 # conditional vadd – The hardware is the same for both vst V3, Ra –Cray uses vector mask & merge CS252/Culler CS252/Culler Lec 20. 17 Lec 20. 18 4/9/02 4/9/02 3 Vectorizing Matrix Mult Parallelize Inner Product // Matrix-matrix multiply: // sum a[i][t] * b[t][j] to get c[i][j] Sum of Partial Products for (i=1; i<n; i++) { for (j=1; j<n; j++) { sum = 0; for (t=1; t<n; t++) { * * * * sum += a[i][t] * b[t][j]; // loop-carried } // dependence c[i][j] = sum; } } + + CS252/Culler CS252/Culler Lec 20. 19 Lec 20. 20 4/9/02 4/9/02 Outer-loop Approach Approaches to Mediaprocessing // Outer -loop Matrix-matrix multiply: // sum a[i][t] * b[t][j] to get c[i][j] // 32 elements of the result calculated in parallel // with each iteration of the j- loop (c[i][j:j+31]) General-purpose for (i=1; i<n; i++) { Vector Processors processors with for (j=1; j<n; j+= 32) { // loop being vectorized SIMD extensions sum[0:31] = 0; VLIW with SIMD extensions for (t=1; t<n; t++) { (aka mediaprocessors) ascalar = a[i][t]; // scalar load bvector[0:31] = b[t][j:j+31]; // vector load prod[0:31] = b_vector[0:31]*ascalar ; // vector mul Multimedia sum[0:31] += prod[0:31]; // vector add } Processing c[i][j:j+31] = sum[0:31]; // vector store } } DSPs ASICs/FPGAs CS252/Culler CS252/Culler Lec 20.
Recommended publications
  • Data-Level Parallelism
    Fall 2015 :: CSE 610 – Parallel Computer Architectures Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Data Parallelism vs. Control Parallelism – Data Parallelism: parallelism arises from executing essentially the same code on a large number of objects – Control Parallelism: parallelism arises from executing different threads of control concurrently • Hypothesis: applications that use massively parallel machines will mostly exploit data parallelism – Common in the Scientific Computing domain • DLP originally linked with SIMD machines; now SIMT is more common – SIMD: Single Instruction Multiple Data – SIMT: Single Instruction Multiple Threads Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Many incarnations of DLP architectures over decades – Old vector processors • Cray processors: Cray-1, Cray-2, …, Cray X1 – SIMD extensions • Intel SSE and AVX units • Alpha Tarantula (didn’t see light of day ) – Old massively parallel computers • Connection Machines • MasPar machines – Modern GPUs • NVIDIA, AMD, Qualcomm, … • Focus of throughput rather than latency Vector Processors 4 SCALAR VECTOR (1 operation) (N operations) r1 r2 v1 v2 + + r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 Scalar processors operate on single numbers (scalars) Vector processors operate on linear sequences of numbers (vectors) 6.888 Spring 2013 - Sanchez and Emer - L14 What’s in a Vector Processor? 5 A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs) 6.888 Spring 2013 - Sanchez and Emer - L14 Example of Simple Vector Processor 6 6.888 Spring 2013 - Sanchez and Emer - L14 Basic Vector ISA 7 Instr.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Vector-Thread Architecture and Implementation by Ronny Meir Krashinsky B.S
    Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky B.S. Electrical Engineering and Computer Science University of California at Berkeley, 1999 S.M. Electrical Engineering and Computer Science Massachusetts Institute of Technology, 2001 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2007 c Massachusetts Institute of Technology 2007. All rights reserved. Author........................................................................... Department of Electrical Engineering and Computer Science May 25, 2007 Certified by . Krste Asanovic´ Associate Professor Thesis Supervisor Accepted by . Arthur C. Smith Chairman, Department Committee on Graduate Students 2 Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky Submitted to the Department of Electrical Engineering and Computer Science on May 25, 2007, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract This thesis proposes vector-thread architectures as a performance-efficient solution for all-purpose computing. The VT architectural paradigm unifies the vector and multithreaded compute models. VT provides the programmer with a control processor and a vector of virtual processors. The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application paral- lelism and locality. VT architectures can efficiently exploit a wide variety of loop-level parallelism, including non-vectorizable loops with cross-iteration dependencies or internal control flow.
    [Show full text]
  • Lecture 14: Gpus
    LECTURE 14 GPUS DANIEL SANCHEZ AND JOEL EMER [INCORPORATES MATERIAL FROM KOZYRAKIS (EE382A), NVIDIA KEPLER WHITEPAPER, HENNESY&PATTERSON] 6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 Today’s Menu 2 Review of vector processors Basic GPU architecture Paper discussions 6.888 Spring 2013 - Sanchez and Emer - L14 Vector Processors 3 SCALAR VECTOR (1 operation) (N operations) r1 r2 v1 v2 + + r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 Scalar processors operate on single numbers (scalars) Vector processors operate on linear sequences of numbers (vectors) 6.888 Spring 2013 - Sanchez and Emer - L14 What’s in a Vector Processor? 4 A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs) 6.888 Spring 2013 - Sanchez and Emer - L14 Example of Simple Vector Processor 5 6.888 Spring 2013 - Sanchez and Emer - L14 Basic Vector ISA 6 Instr. Operands Operation Comment VADD.VV V1,V2,V3 V1=V2+V3 vector + vector VADD.SV V1,R0,V2 V1=R0+V2 scalar + vector VMUL.VV V1,V2,V3 V1=V2*V3 vector x vector VMUL.SV V1,R0,V2 V1=R0*V2 scalar x vector VLD V1,R1 V1=M[R1...R1+63] load, stride=1 VLDS V1,R1,R2 V1=M[R1…R1+63*R2] load, stride=R2
    [Show full text]
  • Lecture 14: Vector Processors
    Lecture 14: Vector Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Autumn 2009 Lecture 14- 1 Christos Kozyrakis Announcements • Readings for this lecture – H&P 4th edition, Appendix F – Required paper • HW3 available on online – Due on Wed 11/11th • Exam on Fri 11/13, 9am - noon, room 200-305 – All lectures + required papers – Closed books, 1 page of notes, calculator – Review session on Friday 11/6, 2-3pm, Gates Hall Room 498 EE382A – Autumn 2009 Lecture 14 - 2 Christos Kozyrakis Review: Multi-core Processors • Use Moore’s law to place more cores per chip – 2x cores/chip with each CMOS generation – Roughly same clock frequency – Known as multi-core chips or chip-multiprocessors (CMP) • Shared-memory multi-core – All cores access a unified physical address space – Implicit communication through loads and stores – Caches and OOO cores lead to coherence and consistency issues EE382A – Autumn 2009 Lecture 14 - 3 Christos Kozyrakis Review: Memory Consistency Problem P1 P2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; • Intuitively, you expect to print A=1 – But can you think of a case where you will print A=0? – Even if cache coherence is available • Coherence talks about accesses to a single location • Consistency is about ordering for accesses to difference locations • Alternatively – Coherence determines what value is returned by a read – Consistency determines when a write value becomes visible EE382A – Autumn 2009 Lecture 14 - 4 Christos Kozyrakis Sequential Consistency (What the Programmers Often Assume) • Definition by L.
    [Show full text]
  • Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks
    Vector vs. Scalar Processors: A Performance Comparison Using a Set of Computational Science Benchmarks Mike Ashworth, Ian J. Bush and Martyn F. Guest, Computational Science & Engineering Department, CCLRC Daresbury Laboratory ABSTRACT: Despite a significant decline in their popularity in the last decade vector processors are still with us, and manufacturers such as Cray and NEC are bringing new products to market. We have carried out a performance comparison of three full-scale applications, the first, SBLI, a Direct Numerical Simulation code from Computational Fluid Dynamics, the second, DL_POLY, a molecular dynamics code and the third, POLCOMS, a coastal-ocean model. Comparing the performance of the Cray X1 vector system with two massively parallel (MPP) micro-processor-based systems we find three rather different results. The SBLI PCHAN benchmark performs excellently on the Cray X1 with no code modification, showing 100% vectorisation and significantly outperforming the MPP systems. The performance of DL_POLY was initially poor, but we were able to make significant improvements through a few simple optimisations. The POLCOMS code has been substantially restructured for cache-based MPP systems and now does not vectorise at all well on the Cray X1 leading to poor performance. We conclude that both vector and MPP systems can deliver high performance levels but that, depending on the algorithm, careful software design may be necessary if the same code is to achieve high performance on different architectures. KEYWORDS: vector processor, scalar processor, benchmarking, parallel computing, CFD, molecular dynamics, coastal ocean modelling All of the key computational science groups in the 1. Introduction UK made use of vector supercomputers during their halcyon days of the 1970s, 1980s and into the early 1990s Vector computers entered the scene at a very early [1]-[3].
    [Show full text]
  • Computer Architecture: Parallel Processing Basics
    Computer Architecture: Parallel Processing Basics Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/9/13 Today What is Parallel Processing? Why? Kinds of Parallel Processing Multiprocessing and Multithreading Measuring success Speedup Amdhal’s Law Bottlenecks to parallelism 2 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi PDL'09 © 2007-9 Goldstein5 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi Parallel PDL'09 © 2007-9 Goldstein6 Concurrent Systems Physical Geographical Cloud Parallel Geophysical +++ ++ --- --- location Relative +++ +++ + - location Faults ++++ +++ ++++ -- Number of +++ +++ + - Processors + Network varies varies fixed fixed structure Network --- --- + + connectivity 7 Concurrent System Challenge: Programming The old joke: How long does it take to write a parallel program? One Graduate Student Year 8 Parallel Programming Again?? Increased demand (multicore) Increased scale (cloud) Improved compute/communicate Change in Application focus Irregular Recursive data structures PDL'09 © 2007-9 Goldstein9 Why Parallel Computers? Parallelism: Doing multiple things at a time Things: instructions,
    [Show full text]
  • COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011
    COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Edgar Gabriel Spring 2011 Edgar Gabriel Moore’s Law • Long-term trend on the number of transistor per integrated circuit • Number of transistors double every ~18 month Source: http://en.wikipedia.org/wki/Images:Moores_law.svg COSC 6385 – Computer Architecture Edgar Gabriel 1 What do we do with that many transistors? • Optimizing the execution of a single instruction stream through – Pipelining • Overlap the execution of multiple instructions • Example: all RISC architectures; Intel x86 underneath the hood – Out-of-order execution: • Allow instructions to overtake each other in accordance with code dependencies (RAW, WAW, WAR) • Example: all commercial processors (Intel, AMD, IBM, SUN) – Branch prediction and speculative execution: • Reduce the number of stall cycles due to unresolved branches • Example: (nearly) all commercial processors COSC 6385 – Computer Architecture Edgar Gabriel What do we do with that many transistors? (II) – Multi-issue processors: • Allow multiple instructions to start execution per clock cycle • Superscalar (Intel x86, AMD, …) vs. VLIW architectures – VLIW/EPIC architectures: • Allow compilers to indicate independent instructions per issue packet • Example: Intel Itanium series – Vector units: • Allow for the efficient expression and execution of vector operations • Example: SSE, SSE2, SSE3, SSE4 instructions COSC 6385 – Computer Architecture Edgar Gabriel 2 Limitations of optimizing a single instruction
    [Show full text]
  • Design and Implementation of a Multithreaded Associative Simd Processor
    DESIGN AND IMPLEMENTATION OF A MULTITHREADED ASSOCIATIVE SIMD PROCESSOR A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Kevin Schaffer December, 2011 Dissertation written by Kevin Schaffer B.S., Kent State University, 2001 M.S., Kent State University, 2003 Ph.D., Kent State University, 2011 Approved by Robert A. Walker, Chair, Doctoral Dissertation Committee Johnnie W. Baker, Members, Doctoral Dissertation Committee Kenneth E. Batcher, Eugene C. Gartland, Accepted by John R. D. Stalvey, Administrator, Department of Computer Science Timothy Moerland, Dean, College of Arts and Sciences ii TABLE OF CONTENTS LIST OF FIGURES ......................................................................................................... viii LIST OF TABLES ............................................................................................................. xi CHAPTER 1 INTRODUCTION ........................................................................................ 1 1.1. Architectural Trends .............................................................................................. 1 1.1.1. Wide-Issue Superscalar Processors............................................................... 2 1.1.2. Chip Multiprocessors (CMPs) ...................................................................... 2 1.2. An Alternative Approach: SIMD ........................................................................... 3 1.3. MTASC Processor ................................................................................................
    [Show full text]
  • Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. SIMD architecture 2. Vector architectures optimizations: Multiple Lanes, Vector Length Registers, Vector Mask Registers, Memory Banks, Stride, Scatter-Gather, 3. Programming Vector Architectures 4. SIMD extensions for media apps 5. GPUs – Graphical Processing Units 6. Fermi architecture innovations 7. Examples of loop-level parallelism 8. Fallacies Copyright © 2012, Elsevier Inc. All rights reserved. 2 Classes of Computers Classes Flynn’s Taxonomy SISD - Single instruction stream, single data stream SIMD - Single instruction stream, multiple data streams New: SIMT – Single Instruction Multiple Threads (for GPUs) MISD - Multiple instruction streams, single data stream No commercial implementation MIMD - Multiple instruction streams, multiple data streams Tightly-coupled MIMD Loosely-coupled MIMD Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Advantages of SIMD architectures 1. Can exploit significant data-level parallelism for: 1. matrix-oriented scientific computing 2. media-oriented image and sound processors 2. More energy efficient than MIMD 1. Only needs to fetch one instruction per multiple data operations, rather than one instr. per data op. 2. Makes SIMD attractive for personal mobile devices 3. Allows programmers to continue thinking sequentially SIMD/MIMD comparison. Potential speedup for SIMD twice that from MIMID! x86 processors expect two additional cores per chip per year SIMD width to double every four years Copyright © 2012, Elsevier Inc. All rights reserved. 4 Introduction SIMD parallelism SIMD architectures A. Vector architectures B. SIMD extensions for mobile systems and multimedia applications C.
    [Show full text]
  • 7Th Gen Intel® Core™ Processor U/Y-Platforms
    7th Generation Intel® Processor Families for U/Y Platforms Datasheet, Volume 1 of 2 Supporting 7th Generation Intel® Core™ Processor Families, Intel® Pentium® Processors, Intel® Celeron® Processors for U/Y Platforms January 2017 Document Number:334661-002 Legal Lines and Disclaimers You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548- 4725 or visit www.intel.com/design/literature.htm.
    [Show full text]
  • Lect. 11: Vector and SIMD Processors
    Lect. 11: Vector and SIMD Processors . Many real-world problems, especially in science and engineering, map well to computation on arrays . RISC approach is inefficient: – Based on loops → require dynamic or static unrolling to overlap computations – Indexing arrays based on arithmetic updates of induction variables – Fetching of array elements from memory based on individual, and unrelated, loads and stores – Instruction dependences must be identified for each individual instruction . Idea: – Treat operands as whole vectors, not as individual integer of float-point numbers – Single machine instruction now operates on whole vectors (e.g., a vector add) – Loads and stores to memory also operate on whole vectors – Individual operations on vector elements are independent and only dependences between whole vector operations must be tracked CS4/MSc Parallel Architectures - 2012-2013 1 Execution Model for (i=0; i<64; i++) a[i] = b[i] + s; . Straightforward RISC code: – F2 contains the value of s – R1 contains the address of the first element of a – R2 contains the address of the first element of b – R3 contains the address of the last element of a + 8 loop: L.D F0,0(R2) ;F0=array element of b ADD.D F4,F0,F2 ;main computation S.D F4,0(R1) ;store result DADDUI R1,R1,8 ;increment index DADDUI R2,R2,8 ;increment index BNE R1,R3,loop ;next iteration CS4/MSc Parallel Architectures - 2012-2013 2 Execution Model for (i=0; i<64; i++) a[i] = b[i] + s; . Straightforward vector code: – F2 contains the value of s – R1 contains the address of the first element of a – R2 contains the address of the first element of b – Assume vector registers have 64 double precision elements LV V1,R2 ;V1=array b ADDVS.D V2,V1,F2 ;main computation SV V2,R1 ;store result – Notes: .
    [Show full text]