Vector Lane Threading
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Data-Level Parallelism
Fall 2015 :: CSE 610 – Parallel Computer Architectures Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Data Parallelism vs. Control Parallelism – Data Parallelism: parallelism arises from executing essentially the same code on a large number of objects – Control Parallelism: parallelism arises from executing different threads of control concurrently • Hypothesis: applications that use massively parallel machines will mostly exploit data parallelism – Common in the Scientific Computing domain • DLP originally linked with SIMD machines; now SIMT is more common – SIMD: Single Instruction Multiple Data – SIMT: Single Instruction Multiple Threads Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Many incarnations of DLP architectures over decades – Old vector processors • Cray processors: Cray-1, Cray-2, …, Cray X1 – SIMD extensions • Intel SSE and AVX units • Alpha Tarantula (didn’t see light of day ) – Old massively parallel computers • Connection Machines • MasPar machines – Modern GPUs • NVIDIA, AMD, Qualcomm, … • Focus of throughput rather than latency Vector Processors 4 SCALAR VECTOR (1 operation) (N operations) r1 r2 v1 v2 + + r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 Scalar processors operate on single numbers (scalars) Vector processors operate on linear sequences of numbers (vectors) 6.888 Spring 2013 - Sanchez and Emer - L14 What’s in a Vector Processor? 5 A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs) 6.888 Spring 2013 - Sanchez and Emer - L14 Example of Simple Vector Processor 6 6.888 Spring 2013 - Sanchez and Emer - L14 Basic Vector ISA 7 Instr. -
2.5 Classification of Parallel Computers
52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor. -
Data-Flow Prescheduling for Large Instruction Windows in Out-Of-Order Processors
Data-Flow Prescheduling for Large Instruction Windows in Out-of-Order Processors Pierre Michaud, Andr´e Seznec IRISA/INRIA Campus de Beaulieu, 35042 Rennes Cedex, France {pmichaud, seznec}@irisa.fr Abstract We introduce data-flow prescheduling. Instructions are sent to the issue buffer in a predicted data-flow order instead The performance of out-of-order processors increases of the sequential order, allowing a smaller issue buffer. The with the instruction window size. In conventional proces- rationale of this proposal is to avoid using entries in the is- sors, the effective instruction window cannot be larger than sue buffer for instructions which operands are known to be the issue buffer. Determining which instructions from the yet unavailable. issue buffer can be launched to the execution units is a time- In our proposal, this reordering of instructions is accom- critical operation which complexity increases with the issue plished through an array of schedule lines. Each schedule buffer size. We propose to relieve the issue stage by reorder- line corresponds to a different depth in the data-flow graph. ing instructions before they enter the issue buffer. This study The depth of each instruction in the data-flow graph is de- introduces the general principle of data-flow prescheduling. termined, and the instruction is inserted in the correspond- Then we describe a possible implementation. Our prelim- ing schedule line. Lines are consumed by the issue buffer inary results show that data-flow prescheduling makes it sequentially. possible to enlarge the effective instruction window while Section 2 briefly describes issue buffers and discusses re- keeping the issue buffer small. -
Vector-Thread Architecture and Implementation by Ronny Meir Krashinsky B.S
Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky B.S. Electrical Engineering and Computer Science University of California at Berkeley, 1999 S.M. Electrical Engineering and Computer Science Massachusetts Institute of Technology, 2001 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2007 c Massachusetts Institute of Technology 2007. All rights reserved. Author........................................................................... Department of Electrical Engineering and Computer Science May 25, 2007 Certified by . Krste Asanovic´ Associate Professor Thesis Supervisor Accepted by . Arthur C. Smith Chairman, Department Committee on Graduate Students 2 Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky Submitted to the Department of Electrical Engineering and Computer Science on May 25, 2007, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract This thesis proposes vector-thread architectures as a performance-efficient solution for all-purpose computing. The VT architectural paradigm unifies the vector and multithreaded compute models. VT provides the programmer with a control processor and a vector of virtual processors. The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application paral- lelism and locality. VT architectures can efficiently exploit a wide variety of loop-level parallelism, including non-vectorizable loops with cross-iteration dependencies or internal control flow. -
Lecture 14: Vector Processors
Lecture 14: Vector Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Autumn 2009 Lecture 14- 1 Christos Kozyrakis Announcements • Readings for this lecture – H&P 4th edition, Appendix F – Required paper • HW3 available on online – Due on Wed 11/11th • Exam on Fri 11/13, 9am - noon, room 200-305 – All lectures + required papers – Closed books, 1 page of notes, calculator – Review session on Friday 11/6, 2-3pm, Gates Hall Room 498 EE382A – Autumn 2009 Lecture 14 - 2 Christos Kozyrakis Review: Multi-core Processors • Use Moore’s law to place more cores per chip – 2x cores/chip with each CMOS generation – Roughly same clock frequency – Known as multi-core chips or chip-multiprocessors (CMP) • Shared-memory multi-core – All cores access a unified physical address space – Implicit communication through loads and stores – Caches and OOO cores lead to coherence and consistency issues EE382A – Autumn 2009 Lecture 14 - 3 Christos Kozyrakis Review: Memory Consistency Problem P1 P2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; • Intuitively, you expect to print A=1 – But can you think of a case where you will print A=0? – Even if cache coherence is available • Coherence talks about accesses to a single location • Consistency is about ordering for accesses to difference locations • Alternatively – Coherence determines what value is returned by a read – Consistency determines when a write value becomes visible EE382A – Autumn 2009 Lecture 14 - 4 Christos Kozyrakis Sequential Consistency (What the Programmers Often Assume) • Definition by L. -
Jetson TX2 • NVIDIA Jetson Xavier • GPU Programming • Algorithm Mapping: • Convolutions Parallel Algorithm Execution
GPU and multicore CPU architectures. Algorithm mapping Contributors: N. Tsapanos, I. Karakostas, I. Pitas Aristotle University of Thessaloniki, Greece Presenter: Prof. Ioannis Pitas Aristotle University of Thessaloniki [email protected] www.multidrone.eu Presentation version 1.3 GPU and multicore CPU architectures. Algorithm mapping • GPU and multicore CPU processing boards • Graphics cards • NVIDIA Jetson TX2 • NVIDIA Jetson Xavier • GPU programming • Algorithm mapping: • Convolutions Parallel algorithm execution • Graphics computing: • Highly parallelizable • Linear algebra parallelization: • Vector inner products: 푐 = 풙푇풚. • Matrix-vector multiplications 풚 = 푨풙. • Matrix multiplications: 푪 = 푨푩. Parallel algorithm execution • Convolution: 풚 = 푨풙 • CNN architectures, linear systems, signal filtering. • Correlation: 풚 = 푨풙 • template matching, tracking. • Signal transforms (DFT, DCT, Haar, etc): • Matrix vector product form: 푿 = 푾풙 • 2D transforms (matrix product form): 푿’ = 푾푿. Processing Units • Multicore (CPU): • MIMD. • Focused on latency. • Best single thread performance. • Manycore (GPU): • SIMD. • Focused on throughput. • Best for embarrassingly parallel tasks. Pascal microarchitecture https://devblogs.nvidia.com/inside-pascal/gp100_block_diagram-2/ Pascal microarchitecture https://devblogs.nvidia.com/inside-pascal/gp100_sm_diagram/ GeForce GTX 1080 • Microarchitecture: Pascal. • DRAM: 8 GB GDDR5X at 10000 MHz. • SMs: 20. • Memory bandwidth: 320 GB/s. • CUDA cores: 2560. • L2 Cache: 2048 KB. • Clock (base/boost): 1607/1733 MHz. • L1 Cache: 48 KB per SM. • GFLOPs: 8873. • Shared memory: 96 KB per SM. GPU and multicore CPU architectures. Algorithm mapping • GPU and multicore CPU processing boards • Graphics cards • NVIDIA Jetson TX2 • NVIDIA Jetson Xavier • GPU programming • Algorithm mapping: • Convolutions ARM Cortex-A57: High-End ARMv8 CPU • ARMv8 architecture • Architecture evolution that extends ARM’s applicability to all markets. • Full ARM 32-bit compatibility, streamlined 64-bit capability. -
Computer Architecture Out-Of-Order Execution
Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data -
CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 1 Datapath of Ooo Execution Processor
Fiscal Year 2020 Ver. 2021-01-25a Course number: CSC.T433 School of Computing, Graduate major in Computer Science Advanced Computer Architecture 10. Multi-Processor: Distributed Memory and Shared Memory Architecture www.arch.cs.titech.ac.jp/lecture/ACA/ Room No.W936 Kenji Kise, Department of Computer Science Mon 14:20-16:00, Thr 14:20-16:00 kise _at_ c.titech.ac.jp CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 1 Datapath of OoO execution processor Instruction flow Instruction cache Branch handler Instruction fetch Instruction decode Renaming Register file Dispatch Integer Floating-point Memory Memory dataflow RS Instruction window ALU ALU Branch FP ALU Adr gen. Adr gen. Store Reorder buffer (ROB) queue Data cache Register dataflow CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH Reservation station (RS) 2 Growth in clock rate of microprocessors From CAQA 5th edition CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 3 From multi-core era to many-core era Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction, MICRO-36 CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 4 Aside: What is a window? • A window is a space in the wall of a building or in the side of a vehicle, which has glass in it so that light can come in and you can see out. (Collins) Instruction window 8 6 5 4 7 (a) Instruction window Instructions to be executed for an application Large instruction -
Computer Architecture: Parallel Processing Basics
Computer Architecture: Parallel Processing Basics Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/9/13 Today What is Parallel Processing? Why? Kinds of Parallel Processing Multiprocessing and Multithreading Measuring success Speedup Amdhal’s Law Bottlenecks to parallelism 2 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi PDL'09 © 2007-9 Goldstein5 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi Parallel PDL'09 © 2007-9 Goldstein6 Concurrent Systems Physical Geographical Cloud Parallel Geophysical +++ ++ --- --- location Relative +++ +++ + - location Faults ++++ +++ ++++ -- Number of +++ +++ + - Processors + Network varies varies fixed fixed structure Network --- --- + + connectivity 7 Concurrent System Challenge: Programming The old joke: How long does it take to write a parallel program? One Graduate Student Year 8 Parallel Programming Again?? Increased demand (multicore) Increased scale (cloud) Improved compute/communicate Change in Application focus Irregular Recursive data structures PDL'09 © 2007-9 Goldstein9 Why Parallel Computers? Parallelism: Doing multiple things at a time Things: instructions, -
COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011
COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Edgar Gabriel Spring 2011 Edgar Gabriel Moore’s Law • Long-term trend on the number of transistor per integrated circuit • Number of transistors double every ~18 month Source: http://en.wikipedia.org/wki/Images:Moores_law.svg COSC 6385 – Computer Architecture Edgar Gabriel 1 What do we do with that many transistors? • Optimizing the execution of a single instruction stream through – Pipelining • Overlap the execution of multiple instructions • Example: all RISC architectures; Intel x86 underneath the hood – Out-of-order execution: • Allow instructions to overtake each other in accordance with code dependencies (RAW, WAW, WAR) • Example: all commercial processors (Intel, AMD, IBM, SUN) – Branch prediction and speculative execution: • Reduce the number of stall cycles due to unresolved branches • Example: (nearly) all commercial processors COSC 6385 – Computer Architecture Edgar Gabriel What do we do with that many transistors? (II) – Multi-issue processors: • Allow multiple instructions to start execution per clock cycle • Superscalar (Intel x86, AMD, …) vs. VLIW architectures – VLIW/EPIC architectures: • Allow compilers to indicate independent instructions per issue packet • Example: Intel Itanium series – Vector units: • Allow for the efficient expression and execution of vector operations • Example: SSE, SSE2, SSE3, SSE4 instructions COSC 6385 – Computer Architecture Edgar Gabriel 2 Limitations of optimizing a single instruction -
STRAIGHT: Realizing a Lightweight Large Instruction Window by Using Eventually Consistent Distributed Registers
2012 Third International Conference on Networking and Computing STRAIGHT: Realizing a Lightweight Large Instruction Window by using Eventually Consistent Distributed Registers Hidetsugu IRIE∗, Daisuke FUJIWARA∗, Kazuki MAJIMA∗, Tsutomu YOSHINAGA∗ ∗The University of Electro-Communications 1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, Japan E-mail: [email protected], [email protected], [email protected], [email protected] Abstract—As the number of cores as well as the network size programs. For scale-out applications, we assume the manycore in a processor chip increases, the performance of each core is processor structure, which consists of a number of STRAIGHT more critical for the improvement of the total chip performance. architecture cores (SAC) that are loosely connected each other. However, to improve the total chip performance, the performance per power or per unit area must be improved, making it difficult Being the first report on this novel processor architecture, in to adopt a conventional approach of superscalar extension. In this paper, we discuss the concept behind STRAIGHT, propose this paper, we explore a new core structure that is suitable for basic principles, and estimate the performance and budget manycore processors. We revisit prior studies of new instruction- expectation. The rest of the paper consists of following sec- level (ILP) and thread-level parallelism (TLP) architectures tions. Section II revisits studies of new architectures that were and propose our novel STRAIGHT processor architecture. By introducing the scheme of distributed key-value-store to the designed to improve the ILP/TLP performance of superscalar register file of clustered microarchitectures, STRAIGHT directly processors, and discusses the dilemma of both scalability executes the operation with large logical registers, which are approach and quick worker approach. -
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. SIMD architecture 2. Vector architectures optimizations: Multiple Lanes, Vector Length Registers, Vector Mask Registers, Memory Banks, Stride, Scatter-Gather, 3. Programming Vector Architectures 4. SIMD extensions for media apps 5. GPUs – Graphical Processing Units 6. Fermi architecture innovations 7. Examples of loop-level parallelism 8. Fallacies Copyright © 2012, Elsevier Inc. All rights reserved. 2 Classes of Computers Classes Flynn’s Taxonomy SISD - Single instruction stream, single data stream SIMD - Single instruction stream, multiple data streams New: SIMT – Single Instruction Multiple Threads (for GPUs) MISD - Multiple instruction streams, single data stream No commercial implementation MIMD - Multiple instruction streams, multiple data streams Tightly-coupled MIMD Loosely-coupled MIMD Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Advantages of SIMD architectures 1. Can exploit significant data-level parallelism for: 1. matrix-oriented scientific computing 2. media-oriented image and sound processors 2. More energy efficient than MIMD 1. Only needs to fetch one instruction per multiple data operations, rather than one instr. per data op. 2. Makes SIMD attractive for personal mobile devices 3. Allows programmers to continue thinking sequentially SIMD/MIMD comparison. Potential speedup for SIMD twice that from MIMID! x86 processors expect two additional cores per chip per year SIMD width to double every four years Copyright © 2012, Elsevier Inc. All rights reserved. 4 Introduction SIMD parallelism SIMD architectures A. Vector architectures B. SIMD extensions for mobile systems and multimedia applications C.