Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Total Page:16

File Type:pdf, Size:1020Kb

Data-Level Parallelism in Vector, SIMD, and GPU Architectures Chapter 4 SIMD or MIMD? Data-Level Parallelism ● Based on past history, the book predicts that for x86 in – We will get two additional cores per chip per year ● Linear growth Vector, SIMD, and GPU – We will get twice as wide SIMD instructions every 4 years ● Exponential growth – Potential speedup from SIMD to be twice that from MIMD! Architectures 1 4 Introduction: Focusing on SIMD Vector Architecture Basics ● SIMD architectures can exploit significant data- level parallelism for: ● The underlying idea: – matrix-oriented scientific computing – The processor does everything on a set of numbers at a time – media-oriented image and sound processors ● Read sets of data elements into “vector registers” ● SIMD is more energy efficient than MIMD ● Operate on those vector registers – Only needs to fetch one instruction per data operation ● Disperse the results back into memory in sets – Makes SIMD attractive for personal mobile devices ● Registers are controlled by compiler to... ● SIMD allows programmer to continue to think sequentially – ...hide memory latency – “Backwards compatibility” ● The price of latency paid only once for the first element of the vector ● Three major implementations ● No need for sophisticated prefetcher – Vector architectures – Always load consecutive elements from memory (unless gather-scatter) – SIMD extensions – ...leverage memory bandwidth – Graphics Processor Units (GPUs) 2 5 ILP vs. SIMD vs. MIMD Example Architecture: VMIPS ... ● Loosely based on Cray-1 ... ● Vector registers – Each register holds a 64-element, 64 bits/element vector – Register file has 16 read ports and 8 write ports ILP ... ● Vector functional units – Fully pipelined: new instruction every cycle ... – Data and control hazards are detected ... ● Vector load-store unit MIMD – Fully pipelined – One word per clock cycle after initial latency ● Scalar registers SIMD – 32 general-purpose registers of MIPS – 32 floating-point registers of MIPS 3 6 How Vector Processors Work: Y=a*X+Y Example Calculation of Execution Time LV V1,Rx ;load vector X L.D F0, a L.D F0, a MULVS.D V2,V1,F0 ;vector-scalar multiply DADDIU R4, Rx, #512 LV V1, Rx LV V3,Ry ;load vector Y Loop: MULVS.D V2, V1, F0 ADDVV.D V4,V2,V3 ;add two vectors L.D F2, 0(Rx) LV V3, Ry SV Ry,V4 ;store the sum MUL.D F2, F2, F0 ADDVV.DV4, V2, V3 L.D F4, 0(Ry) SV V4, Ry Convoys: ADD.D F4, F4, F2 1 LV MULVS.D S.D F4, 9(Ry) DADDIU Rx, Rx, #8 2 LV ADDVV.D DADDIU Ry, Ry, #8 3 SV DSUB R20, R4, Rx BNEZ Loop 3 chimes, 2 FP ops per result, cycles per FLOP = 1.5 600 instructions 6 instructions, no loop For 64 element vectors, requires 64 x 3 = 192 clock cycles 7 10 Vector Execution Time Vector Execution Challenges ● Execution time depends on three factors: ● Needed improvements – Length of operand vectors ● Start up time – > 1 element per clock cycle – Structural hazards ● Compare with CPI < 1 – Data dependencies – Latency of vector functional unit – Non-64 wide vectors ● In VMIPS: – Assume the same as Cray-1 – IF statements in vector code – Functional units consume one element per clock cycle ● Floating-point add → 6 – Memory system – Execution time is approximately the vector length cycles optimizations to support ● Floating-point multiply → 7 vector processors ● Convoy cycles – Multiple dimensional – Set of vector instructions that could potentially execute together ● Floating-point divide → 20 matrices ● cycles No structural hazards – Sparse matrices ● ● RAW hazards are avoided through chaining Vector load → 12 cycles – Programming a vector – In superscalar RISC it is called forwarding computer – Number of convoys gives an estimate of execution time 8 11 Chaining and Chimes Vector Optimizations: Multiple Lanes pipe 0 pipe 1 pipe 2 pipe 3 + + + A[9] B[9] + ● Chaining A[8] B[8] – Like forwarding in superscalar RISC Lane 0 Lane 1 Lane 2 Lane 3 A[7] B[7] – Allows a vector operation to start as soon as the individual V[1], V[5], V[2], V[6], V[3], V[7], V[4], V[8], elements of its vector source operand become available V[9], … V[10], … V[11], … V[12], … A[6] B[6] – Flexible chaining chaining with any other instruction in execution, not just the once sharing a register A[5] B[5] ● Chime A[4] B[4] x x x x – Unit of time to execute one convoy A[3] B[3] A[9] B[9] – m convoys executes in m chimes A[2] B[2] A[5] B[5] A[6] B[6] A[7] B[7] A[8] B[8] – For vector length of n, it is required m*n clock cycles A[1] B[1] A[1] B[1] A[2] B[2] A[3] B[3] A[4] B[4] – Ignores some processor specific overheads ● Better at estimating time for longer vector processors + + + + + – Chime is an underestimate Element C[1] C[1] C[2] group C[3] C[4] 9 12 Arbitrary Length Loops with Vector-Length Register Example: Vector Mask Register ● for (i=0; i < N; ++i) Y[i] = a * X[i] + Y[i] ● What if N < 64 (=Maximum Vector Length)? ● for (i=0 ; i<64; ++i) ● Use a Vector-Length Registers if (X[i] != 0) X[i] = X[i] – Y[i] ● LV V1, Rx – Load N into VLR LV V2, Ry – Vector instructions (loads, stores, arithmetic, …) use VLR for the L.D F0, #0 maximum number of iterations SNEVS.DV1, F0 – By default, VLR=64 ● Vector register utilization SUBVV.DV1, V1, V2 SV V1, Rx – If VLR<64 then the registers are not fully utilized drops with the dropping count of 1's in mask register ● Vector startup time is a greater portion of the instruction ● Use Amdahl's law to calculate the effect of diminishing vector length 13 16 Arbitrary Length Loops with Strip Mining Strided Access and Multidimensional Arrays ● ● for (i=0; i < N; ++i) Y[i] = a * X[i] + Y[i] Matrix-matrix multiply C= alpha * C + beta * A * B: for (i = 0; i < 64; ++i) ● What if N > 64? for (j = 0; j < 64; ++j) { C[i, j] = alpha * C[i, j] ● Use strip mining to split the loop into smaller chunks for (k = 0; k < 64; ++k) C[i,j] = C[i,j] + beta * A[i, k] * B[k, j] for (j = 0; j < (N/64)*64; j += 64) {//N - N%64 } for (i=0 ; i < 64; ++i) ● C[0,0] = A[0,0]*B[0,0]+A[0,1]*B[1,0]+A[0,2]*B[2,0]+... Y[j+i] = a * X[j+i] + Y[i] ● A is accessed by rows and B is accessed by columns } ● Two most common storage formats: ● Clean-up code for N not divisible by 64 1. C: row-major (store matrix by rows) if (N % 64) 2. F: column-major (store matrix by columns) ● There is a need for non-unit stride for accessing memory for (i = 0; i < N%64; ++i) // use VLR here – Best handled by memories with multiple banks Y[j+i] = a * X[j+i] + Y[i] – Stride: 64 elements * 8 bytes = 512 bytes 14 17 Handling Conditional Statements with Mask Regs. Strided Access and Memory Banks ● for (i=0 ; i < 64; ++i) if (X[i] != 0) X[i] = X[i] – Y[i] ● Special load instructions just for accessing strided data ● Values of X are not known until runtime – LVWS=load vector with stride and SVWS ● The loop cannot be analyzed at compile time – The stride (distance between elements) is loaded in GPR (general purpose register) just like vector address ● Conditional is the inherent part of the program logic ● Multiple banks help with fast data access ● Scientific codes are full of conditionals and hardware support is a must – Multiple loads and/or store can be active at the same time – Critical metric: bank busy time (minimum period of time between ● Vector mask register is used as to prohibit execution of vector successive accesses to the same bank) instructions on some elements of the vector for (i = 0; i < 64; ++i) ● For strided accesses, compute LCM (least common multiple) of VectorMask[i] = (X[i] == 0) stride and number of banks to see if bank conflict occurs for (i = 0; i < 64; ++i) X[i] = X[i] – Y[i] only if VectorMask[i] != 0 #banks < bank busy time cycles LCM(stride ,#banks) 15 18 Strided Access Patterns Vector Processors and Compilers 0 1 2 3 4 5 6 7 Stride = 1 1 2 3 4 1 CPU cycle ● Vector machine vendors supplied good compilers to take advantage of the hardware 0 1 2 3 4 5 6 7 Stride = 9 – High profit margins allowed investment in compilers 1 – Without well vectorized code, hardware cannot deliver high 2 Gflop/s rates 3 ● Vector compilers were... 1 CPU cycle 4 – Good at recognizing vectorization opportunities 0 1 2 3 4 5 6 7 Stride = 8 – Good at reporting vectorization process failures and successes 1 – Good at taking hints from the programmer in form of vendor- specific language extensions such as pragmas 2 3 1 bank cycle) max(1 CPU cycle, 4 19 22 Example: Memory banks for Cray T90 SIMD Extensions for Multimedia Processing ● Design parameters ● Modern SIMD extensions in x86 date back to early desktop – 32 processors processing of multimedia – Each processor: 4 loads and 2 stores per cycle ● Typical case of Amdahl's law: – Processor cycle time: 2.167 ns – Speedup the common case or the slowest component – Cycle time for SRAMs: 15 ns ● Image processing: operates on 8 bit colors, 3 components, 1 alpha ● How many memory banks required to keep up with the transparency processors? – Typical pixel length: 32 bits – Typical operations: additions – 32 processors * 6 references = 192 references per cycle ● Audio processing: 8 bit or 16 bit samples (Compact Disc) – SRAM busy time: 15/2.167 = 6.92 ~ 7 CPU cycles ● SIMD extensions versus vector processing – Required cycles: 192 * 7 = 1344 memory banks – Fixed vector length encoded in instruction opcode for SIMD – Original design had only 1024 memory banks.
Recommended publications
  • The Instruction Set Architecture
    Quiz 0 Lecture 2: The Instruction Set Architecture COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 2 Quiz 0 CD 3 Miles of Music 3 4 Pits and Lands Interpretation 0 1 1 1 0 1 0 1 As Music: 011101012 = 117/256 position of speaker As Number: Transition represents a bit state (1/on/red/female/heads) 01110101 = 1 + 4 + 16 + 32 + 64 = 117 = 75 No change represents other state (0/off/white/male/tails) 2 10 16 (Get comfortable with base 2, 8, 10, and 16.) As Text: th 011101012 = 117 character in the ASCII codes = “u” 5 6 Interpretation – ASCII Princeton Computer Science Building West Wall 7 8 Interpretation Binary Code and Data (Hello World!) • Programs consist of Code and Data • Code and Data are Encoded in Bits IA-64 Binary (objdump) As Music: 011101012 = 117/256 position of speaker As Number: 011101012 = 1 + 4 + 16 + 32 + 64 = 11710 = 7516 As Text: th 011101012 = 117 character in the ASCII codes = “u” CAN ALSO BE INTERPRETED AS MACHINE INSTRUCTION! 9 Interfaces in Computer Systems Instructions Sequential Circuit!! Software: Produce Bits Instructing Machine to Manipulate State or Produce I/O Computers process information State Applications • Input/Output (I/O) Operating System • State (memory) • Computation (processor) Compiler Firmware Instruction Set Architecture Input Output Instruction Set Processor I/O System Datapath & Control Computation Digital Design Circuit Design • Instructions instruct processor to manipulate state Layout • Instructions instruct processor to produce I/O in the same way Hardware: Read and Obey Instruction Bits 12 State State – Main Memory Typical modern machine has this architectural state: Main Memory (AKA: RAM – Random Access Memory) 1.
    [Show full text]
  • CUDA by Example
    CUDA by Example AN INTRODUCTION TO GENERAL-PURPOSE GPU PROGRAMMING JASON SaNDERS EDWARD KANDROT Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid Capetown • Sydney • Tokyo • Singapore • Mexico City Sanders_book.indb 3 6/12/10 3:15:14 PM Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals. The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. NVIDIA makes no warranty or representation that the techniques described herein are free from any Intellectual Property claims. The reader assumes all risk of any such claims based on his or her use of these techniques. The publisher offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests. For more information, please contact: U.S. Corporate and Government Sales (800) 382-3419 [email protected] For sales outside the United States, please contact: International Sales [email protected] Visit us on the Web: informit.com/aw Library of Congress Cataloging-in-Publication Data Sanders, Jason.
    [Show full text]
  • Bitfusion Guide to CUDA Installation Bitfusion Guides Bitfusion: Bitfusion Guide to CUDA Installation
    WHITE PAPER–OCTOBER 2019 Bitfusion Guide to CUDA Installation Bitfusion Guides Bitfusion: Bitfusion Guide to CUDA Installation Table of Contents Purpose 3 Introduction 3 Some Sources of Confusion 4 So Many Prerequisites 4 Installing the NVIDIA Repository 4 Installing the NVIDIA Driver 6 Installing NVIDIA CUDA 6 Installing cuDNN 7 Speed It Up! 8 Upgrading CUDA 9 WHITE PAPER | 2 Bitfusion: Bitfusion Guide to CUDA Installation Purpose Bitfusion FlexDirect provides a GPU virtualization solution. It allows you to use the GPUs, or even partial GPUs (e.g., half of a GPU, or a quarter, or one-third of a GPU), on different servers as if they were attached to your local machine, a machine on which you can now run a CUDA application even though it has no GPUs of its own. Since Bitfusion FlexDirect software and ML/AI (machine learning/artificial intelligence) applications require other CUDA libraries and drivers, this document describes how to install the prerequisite software for both Bitfusion FlexDirect and for various ML/AI applications. If you already have your CUDA drivers and libraries installed, you can skip ahead to the chapter on installing Bitfusion FlexDirect. This document gives instruction and examples for Ubuntu, Red Hat, and CentOS systems. Introduction Bitfusion’s software tool is called FlexDirect. FlexDirect easily installs with a single command, but installing the third-party prerequisite drivers and libraries, plus any application and its prerequisite software, can seem a Gordian Knot to users new to CUDA code and ML/AI applications. We will begin untangling the knot with a table of the components involved.
    [Show full text]
  • How Data Hazards Can Be Removed Effectively
    International Journal of Scientific & Engineering Research, Volume 7, Issue 9, September-2016 116 ISSN 2229-5518 How Data Hazards can be removed effectively Muhammad Zeeshan, Saadia Anayat, Rabia and Nabila Rehman Abstract—For fast Processing of instructions in computer architecture, the most frequently used technique is Pipelining technique, the Pipelining is consider an important implementation technique used in computer hardware for multi-processing of instructions. Although multiple instructions can be executed at the same time with the help of pipelining, but sometimes multi-processing create a critical situation that altered the normal CPU executions in expected way, sometime it may cause processing delay and produce incorrect computational results than expected. This situation is known as hazard. Pipelining processing increase the processing speed of the CPU but these Hazards that accrue due to multi-processing may sometime decrease the CPU processing. Hazards can be needed to handle properly at the beginning otherwise it causes serious damage to pipelining processing or overall performance of computation can be effected. Data hazard is one from three types of pipeline hazards. It may result in Race condition if we ignore a data hazard, so it is essential to resolve data hazards properly. In this paper, we tries to present some ideas to deal with data hazards are presented i.e. introduce idea how data hazards are harmful for processing and what is the cause of data hazards, why data hazard accord, how we remove data hazards effectively. While pipelining is very useful but there are several complications and serious issue that may occurred related to pipelining i.e.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy
    Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy Eric Tune Rakesh Kumar Dean M. Tullsen Brad Calder Computer Science and Engineering Department University of California at San Diego {etune,rakumar,tullsen,calder}@cs.ucsd.edu Abstract ibility of SMT comes at a cost. The register file and rename tables must be enlarged to accommodate the architectural reg- A simultaneous multithreading (SMT) processor can issue isters of the additional threads. This in turn can increase the instructions from several threads every cycle, allowing it to clock cycle time and/or the depth of the pipeline. effectively hide various instruction latencies; this effect in- Coarse-grained multithreading (CGMT) [1, 21, 26] is a creases with the number of simultaneous contexts supported. more restrictive model where the processor can only execute However, each added context on an SMT processor incurs a instructions from one thread at a time, but where it can switch cost in complexity, which may lead to an increase in pipeline to a new thread after a short delay. This makes CGMT suited length or a decrease in the maximum clock rate. This pa- for hiding longer delays. Soon, general-purpose micropro- per presents new designs for multithreaded processors which cessors will be experiencing delays to main memory of 500 combine a conservative SMT implementation with a coarse- or more cycles. This means that a context switch in response grained multithreading capability. By presenting more virtual to a memory access can take tens of cycles and still provide contexts to the operating system and user than are supported a considerable performance benefit.
    [Show full text]
  • Simple Computer Example Register Structure
    Simple Computer Example Register Structure Read pp. 27-85 Simple Computer • To illustrate how a computer operates, let us look at the design of a very simple computer • Specifications 1. Memory words are 16 bits in length 2. 2 12 = 4 K words of memory 3. Memory can be accessed in one clock cycle 4. Single Accumulator for ALU (AC) 5. Registers are fully connected Simple Computer Continued 4K x 16 Memory MAR 12 MDR 16 X PC 12 ALU IR 16 AC Simple Computer Specifications (continued) 6. Control signals • INCPC – causes PC to increment on clock edge - [PC] +1 PC •ACin - causes output of ALU to be stored in AC • GMDR2X – get memory data register to X - [MDR] X • Read (Write) – Read (Write) contents of memory location whose address is in MAR To implement instructions, control unit must break down the instruction into a series of register transfers (just like a complier must break down C program into a series of machine level instructions) Simple Computer (continued) • Typical microinstruction for reading memory State Register Transfer Control Line(s) Next State 1 [[MAR]] MDR Read 2 • Timing State 1 State 2 During State 1, Read set by control unit CLK - Data is read from memory - MDR changes at the Read beginning of State 2 - Read is completed in one clock cycle MDR Simple Computer (continued) • Study: how to write the microinstructions to implement 3 instructions • ADD address • ADD (address) • JMP address ADD address: add using direct addressing 0000 address [AC] + [address] AC ADD (address): add using indirect addressing 0001 address [AC] + [[address]] AC JMP address 0010 address address PC Instruction Format for Simple Computer IR OP 4 AD 12 AD = address - Two phases to implement instructions: 1.
    [Show full text]
  • Parallel Architectures and Algorithms for Large-Scale Nonlinear Programming
    Parallel Architectures and Algorithms for Large-Scale Nonlinear Programming Carl D. Laird Associate Professor, School of Chemical Engineering, Purdue University Faculty Fellow, Mary Kay O’Connor Process Safety Center Laird Research Group: http://allthingsoptimal.com Landscape of Scientific Computing 10$ 1$ 50%/year 20%/year Clock&Rate&(GHz)& 0.1$ 0.01$ 1980$ 1985$ 1990$ 1995$ 2000$ 2005$ 2010$ Year& [Steven Edwards, Columbia University] 2 Landscape of Scientific Computing Clock-rate, the source of past speed improvements have stagnated. Hardware manufacturers are shifting their focus to energy efficiency (mobile, large10" data centers) and parallel architectures. 12" 10" 1" 8" 6" #"of"Cores" Clock"Rate"(GHz)" 0.1" 4" 2" 0.01" 0" 1980" 1985" 1990" 1995" 2000" 2005" 2010" Year" 3 “… over a 15-year span… [problem] calculations improved by a factor of 43 million. [A] factor of roughly 1,000 was attributable to faster processor speeds, … [yet] a factor of 43,000 was due to improvements in the efficiency of software algorithms.” - attributed to Martin Grotschel [Steve Lohr, “Software Progress Beats Moore’s Law”, New York Times, March 7, 2011] Landscape of Computing 5 Landscape of Computing High-Performance Parallel Architectures Multi-core, Grid, HPC clusters, Specialized Architectures (GPU) 6 Landscape of Computing High-Performance Parallel Architectures Multi-core, Grid, HPC clusters, Specialized Architectures (GPU) 7 Landscape of Computing data VM iOS and Android device sales High-Performance have surpassed PC Parallel Architectures iPhone performance
    [Show full text]
  • Amdahl's Law Threading, Openmp
    Computer Science 61C Spring 2018 Wawrzynek and Weaver Amdahl's Law Threading, OpenMP 1 Big Idea: Amdahl’s (Heartbreaking) Law Computer Science 61C Spring 2018 Wawrzynek and Weaver • Speedup due to enhancement E is Exec time w/o E Speedup w/ E = ---------------------- Exec time w/ E • Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Execution Time w/ E = Execution Time w/o E × [ (1-F) + F/S] Speedup w/ E = 1 / [ (1-F) + F/S ] 2 Big Idea: Amdahl’s Law Computer Science 61C Spring 2018 Wawrzynek and Weaver Speedup = 1 (1 - F) + F Non-speed-up part S Speed-up part Example: the execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? 1 1 = = 1.33 0.5 + 0.5 0.5 + 0.25 2 3 Example #1: Amdahl’s Law Speedup w/ E = 1 / [ (1-F) + F/S ] Computer Science 61C Spring 2018 Wawrzynek and Weaver • Consider an enhancement which runs 20 times faster but which is only usable 25% of the time Speedup w/ E = 1/(.75 + .25/20) = 1.31 • What if its usable only 15% of the time? Speedup w/ E = 1/(.85 + .15/20) = 1.17 • Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! • To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99 4 Amdahl’s Law If the portion of Computer Science 61C Spring 2018 the program that Wawrzynek and Weaver can be parallelized is small, then the speedup is limited The non-parallel portion limits the performance 5 Strong and Weak Scaling Computer Science 61C Spring 2018 Wawrzynek and Weaver • To get good speedup on a parallel processor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem.
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • Parallel Programming
    Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.
    [Show full text]