Computer Archicture

Total Page:16

File Type:pdf, Size:1020Kb

Computer Archicture COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation • Programming models such as MapReduce and Spark provide a high-level view of parallelism – not easy for all problems, e.g. recursive algorithms, many graph problems, etc. • How to handle problems that do not have inherent high-level parallelism? – sequential processing: time to solution takes too long for large problems – exploit low-level parallelism: often groups of very few instructions • Problem with instruction level parallelism: costs for exploiting parallelism exceed benefits if using regular threads/processes/tasks 1 Historic context: SIMD Instructions • Same operation executed for multiple data items • Uses a fixed length register and partitions the carry chain to allow utilizing the same functional unit for multiple operations – E.g. a 256 bit adder can be utilized for eight 32-bit add operations simultaneously • All elements in a register have to be on the same memory page to avoid page faults within the instruction Comparison of instructions • Example add operation of eight 32-bit integers with and and without SIMD instruction 3 instructions required for managing the loop (i.e. not contributing to the LOOP: LOAD R2, 0(R4) /* load x(i) */ actual solution of the LOAD R0, 0(R6) /* load y(i) */ problem) ADD R2, R2, R0 /* x(i)+y(i)*/ Branch instructions STORE R2, 0(R4) /* store x(i) */ typically lead to ADD R4, R4, #4 /* increment x */ processor stalls since ADD R6, R6, #4 /* increment y */ you have to wait for the outcome BNEQ R4, R20, LOOP of the comparison before you can decide --------------------- what is the next LOAD256 YMM1, 0(R4) /* loads 256 bits of data*/ instruction to execute LOAD256 YMM2, 0(R6) /* ditto */ VADDSP YMM1, YMM1, YMM2 /* AVX ADD operation */ STORE256 YMM1, 0(R4) Note: not actual Intel assembly instructions and registers used 2 SIMD Instructions • MMX (Mult-Media Extension) - 1996 – Existing 64 bit floating point register could be used for eight 8-bit operations or four 16-bit operations • SSE (Streaming SIMD Extension) – 1999 – Successor to MMX instructions – Separate 128-bit registers added for sixteen 8-bit, eight 16-bit, or four 32-bit operations • SSE2 – 2001, SSE3 – 2004, SSE4 - 2007 – Added support for double precision operations • AVX (Advanced Vector Extensions) - 2010 – 256-bit registers added • AVX2 – 2013 – 512 –bit registers added Graphics Processing Units (GPU) • Hardware in Graphics Units similar to SIMD units – Works well with data-level parallel problems – Scatter-gather transfers – Mask registers – Large register files • Using NVIDIA GPUs as an example 3 Graphics Processing Units (II) • Basic idea: – Heterogeneous execution model • CPU is the host, GPU is the device – Develop a C-like programming language for GPU – Unify all forms of GPU parallelism as CUDA thread – Programming model is “Single Instruction Multiple Threads” • GPU hardware handles thread management, not applications or OS Example: Vector Addition • Sequential code: int main ( int argc, char **argv ) { int A[N], B[N], C[N]; for ( i=0; i<N; i++) { C[i] = A[i] + B[i]; } return (0); } CUDA: replace the loop by N threads each executing on element of the vector add operation 4 Example: Vector Addition (II) • CUDA: replace the loop by N threads each executing one element of the vector add operation • Question: How does each thread know which elements to execute? – threadIdx : each thread has an id which is unique in the thread block • of type dim3, which is a struct { int x,y,z; } dim3; – blockDim: Total number of threads in the thread block • a thread block can be 1D, 2D or 3D Example: Vector Addition (III) • Initial CUDA kernel: void vecadd ( int *d_A, int *d_B, int* d_C) { int i = threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; } Assuming a 1-D thread block -> only x-dimension used • This code is limited by the maximum number of threads in a thread block – Upper limit on max. number of threads in one block – if vector is longer, we have to create multiple thread blocks 5 How does the compiler now which code to compile for CPU and which one for GPU? • Specifier tells compiler where function will be executed -> compiler can generate code for corresponding processor • Executed on CPU, called form CPU (default if not specified) __host__ void func(…) • CUDA kernel to be executed on GPU, called from CPU __global__ void func(...); • CUDA kernel to be executed on GPU, called from GPU __device__ void func(...); Example: Vector Addition (IV) • so the CUDA kernel is in reality: __global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; } • Note: – d_A, d_B, and d_C are in global memory – int i is in local memory of the thread 6 If you have multiple thread blocks __global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = blockIdx.x * blockDim.x + threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; } ID of the thread block that Number of threads in a this thread is part of thread block Using more than one element per thread __global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j; for ( j=i*NUMELEMENTS; j<(i+1)*NUMELEMENTS; j++) d_C[j] = d_A[j] + d_B[j]; return; } 7 Nvidia GT200 • A GT200 is multi-core chip with two level hierarchy – focuses on high throughput on data parallel workloads • 1st level of hierarchy: 10 Thread Processing Clusters (TPC) • 2nd level of hierarchy: each TPC has – 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1 core in a conventional processor) – a texture pipeline (used for memory access) • Global Block Scheduler: – issues thread blocks to SMs with available capacity – simple round-robin algorithm but taking resource availability (e.g. of shared memory) into account Nvidia GT200 Image Source: David Kanter, “Nvidia GT200: Inside a Parallel Processor”, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=1 8 Streaming multi-processor (I) • Instruction fetch, decode and issue logic • 8 32bit ALU units (that are often referred to as Streaming processor (SP) or confusingly called a ‘core’ by Nvidia) • 8 branch units: no branch prediction or speculation, branch delay: 4 cycles • Can execute up to 8 thread blocks/1024 threads concurrently • Each SP has access to 2048 register file entries each with 32 bits – a double precision number has to utilize two adjacent registers – register file can be used by up to 128 threads concurrently CUDA Memory Model 9 CUDA Memory Model (II) • cudaError_t cudaMalloc(void** devPtr, size_t size) – Allocates size bytes of device(global) memory pointed to by *devPtr – Returns cudaSuccess for no error • cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind) – Dst = destination memory address – Src = source memory address – Count = bytes to copy – Kind = type of transfer (“HostToDevice”, “DeviceToHost”, “DeviceToDevice”) • cudaError_t cudaFree(void* devPtr) – Frees memory allocated with cudaMalloc Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo http://www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf Example: Vector Addition (V) int main ( int argc, char ** argv) { float a[N], b[N], c[N]; float *d_a, *d_b, *d_c; cudaMalloc( &d_a, N*sizeof(float)); cudaMalloc( &d_b, N*sizeof(float)); cudaMalloc( &d_c, N*sizeof(float)); cudaMemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice); dim3 threadsPerBlock(256); // 1-D array of threads dim3 blocksPerGrid(N/256); // 1-D grid vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c); cudaMemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return-0; } 10 Nvidia Tesla V100 GPU • Most recent Nvidia GPU architecture • Architecture: each V100 contains – 6 GPU Processing Clusters (GPCs) – Each GPC has • 7 Texture Processing Clusters (TPCs) • 14 Streaming Multiprocessors (SMs) – Each SM has • 64 32-bit Floating Point cores • 64 32-bit Integer cores • 32 64-bit Floating Point cores • 8 Tensor Cores • 4 Texture Units Nvidia V100 Tensor Cores • Specifically designed to support neural networks • Designed to execute – D = A×B + C for 4x4 matrices – Operate on FP16 input data with FP32 accumulation Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf 11 Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf NVLink • GPUs traditionally utilize a PCIe slot for moving data and instructions from CPU to GPU and between GPUs – PCIe 3.0 x8: 8 GB/s bandwidth – PCIe 3.0 x16: 16 GB/s bandwidth – motherboards often restricted in the number of PCIe lanes managed: i.e. using multiple PCIe cards will reduce the bandwidth available for each card • NVLink : high speed connection between multiple GPUs – higher bandwidth per link (25 GB/sec) than PCIe – V100 supports up to 6 NVLINKs per GPU 12 NVLink2 multi-GPU no CPU support Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf NVLink2 with CPU support • only supports IBM Power 9 CPUs at the moment Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf 13 Other V100 enhancements • In earlier GPUs – a group of threads (warps) executed a single instruction. – A single program counter was used in combination with a active mask that specified which threads of the warp are active at any given point in time – Divergent paths (e.g. if-then-else statements)
Recommended publications
  • ECE 590: Digital Systems Design Using Hardware Description Language (VHDL) Systolic Implementation of Faddeev's Algorithm in V
    Project Report ECE 590: Digital Systems Design using Hardware Description Language (VHDL) Systolic Implementation of Faddeev’s Algorithm in VHDL. Final Project Tejas Tapsale. PSU ID: 973524088. Project Report Introduction: = In this project we are implementing Nash’s systolic implementation and Chuang an He,s systolic implementation for Faddeev’s algorithm. These two implementations have their own advantages and drawbacks. Here in this project report we first see detail of Nash implementation and then we will go for Chaung and He’s implementation. The organization of this report is like this:- 1. First we take detail idea about what is systolic architecture and how it can be used for matrix multiplication and its advantages and disadvantages. 2. Then we discuss about Gaussian Elimination for matrix computation and its properties. 3. Then we will see Faddeev’s algorithm and how it is used. 4. Systolic arrays for MATRIX TRIANGULARIZATION 5. We will discuss Nash implementation in detail and its VHDL coding. 6. Advantages and disadvantage of Nash systolic implementation. 7. Chaung and He’s implementation in detail and its VHDL coding. 8. Difficulties chased in this project. 9. Conclusion. 10. VHDL code for Nash Implementation. 11. VHDL code for Chaung and He’s Implementation. 12. Simulation Results. 13. References. 14. PowerPoint Presentation Project Report 1: Systolic Architecture: = A systolic array is composed of matrix-like rows of data processing units called cells. Data processing units (DPU) are similar to central processing units (CPU)s, (except for the usual lack of a program counter, since operation is transport-triggered, i.e., by the arrival of a data object).
    [Show full text]
  • On the Efficiency of Register File Versus Broadcast Interconnect For
    On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators Ardavan Pedram, Andreas Gerstlauer Robert A. van de Geijn Department of Electrical and Computer Engineering Department of Computer Science The University of Texas at Austin The University of Texas at Austin fardavan,[email protected] [email protected] Abstract—Reducing power consumption and increasing effi- on broadcast communication among a 2D array of PEs. In ciency is a key concern for many applications. How to design this paper, we focus on the LAC’s data-parallel broadcast highly efficient computing elements while maintaining enough interconnect and on showing how representative collective flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can communication operations can be efficiently mapped onto eliminate the use of power hungry multi-ported register files this architecture. Such collective communications are a core in the context of data-parallel hardware accelerators for linear component of many matrix or other data-intensive operations algebra operations. We demonstrate an algorithm/architecture that often demand matrix manipulations. co-design for the mapping of different collective communication We compare our design with typical SIMD cores with operations, which are crucial for achieving performance and efficiency in most linear algebra routines, such as GEMM, equivalent data parallelism and with L1 and L2 caches that SYRK and matrix transposition. We compare a broadcast bus amount to an equivalent aggregate storage space. To do so, based architecture with conventional SIMD, 2D-SIMD and flat we examine efficiency and performance of the cores for data register file for these operations in terms of area and energy movement and data manipulation in both GEneral matrix- efficiency.
    [Show full text]
  • Computer Architectures
    Parallel (High-Performance) Computer Architectures Tarek El-Ghazawi Department of Electrical and Computer Engineering The George Washington University Tarek El-Ghazawi, Introduction to High-Performance Computing slide 1 Introduction to Parallel Computing Systems Outline Definitions and Conceptual Classifications » Parallel Processing, MPP’s, and Related Terms » Flynn’s Classification of Computer Architectures Operational Models for Parallel Computers Interconnection Networks MPP’s Performance Tarek El-Ghazawi, Introduction to High-Performance Computing slide 2 Definitions and Conceptual Classification What is Parallel Processing? - A form of data processing which emphasizes the exploration and exploitation of inherent parallelism in the underlying problem. Other related terms » Massively Parallel Processors » Heterogeneous Processing – In the1990s, heterogeneous workstations from different processor vendors – Now, accelerators such as GPUs, FPGAs, Intel’s Xeon Phi, … » Grid computing » Cloud Computing Tarek El-Ghazawi, Introduction to High-Performance Computing slide 3 Definitions and Conceptual Classification Why Massively Parallel Processors (MPPs)? » Increase processing speed and memory allowing studies of problems with higher resolutions or bigger sizes » Provide a low cost alternative to using expensive processor and memory technologies (as in traditional vector machines) Tarek El-Ghazawi, Introduction to High-Performance Computing slide 4 Stored Program Computer The IAS machine was the first electronic computer developed, under
    [Show full text]
  • Systolic Computing on Gpus for Productive Performance
    Systolic Computing on GPUs for Productive Performance Hongbo Rong Xiaochen Hao Yun Liang Intel Intel, Peking University Peking University [email protected] [email protected] [email protected] Lidong Xu Hong H Jiang Pradeep Dubey Intel Intel Intel [email protected] [email protected] [email protected] Abstract an SIMT (single-instruction multiple-threads) programming We propose a language and compiler to productively build interface, and rely on an underlying compiler to transparently high-performance software systolic arrays that run on GPUs. map a wrap of threads to SIMD execution units. If data need to Based on a rigorous mathematical foundation (uniform re- be exchanged among threads in the same wrap, programmers currence equations and space-time transform), our language have to write explicit shuffle instructions [19]. has a high abstraction level and covers a wide range of ap- This paper proposes a new programming style that pro- plications. A programmer specifies a projection of a dataflow grams GPUs as building software systolic arrays. Systolic ar- compute onto a linear systolic array, while leaving the detailed rays have been extensively studied since 1978 [15], and shown implementation of the projection to a compiler; the compiler an abundance of practice-oriented applications, mainly in implements the specified projection and maps the linear sys- fields dominated by iterative procedures [32], e.g. image and tolic array to the SIMD execution units and vector registers signal processing, matrix arithmetic, non-numeric applica- of GPUs. In this way, both productivity and performance are tions, relational database [7, 8, 10, 11, 16, 17, 30], and so on.
    [Show full text]
  • Research Article a Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm
    International Scholarly Research Network ISRN Bioinformatics Volume 2012, Article ID 195658, 11 pages doi:10.5402/2012/195658 Research Article A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm Xinyu Guo,1 Hong Wang,2 and Vijay Devabhaktuni1 1 Electrical Engineering and Computer Science Department, The University of Toledo, MS.308, 2801 W. Bancroft Street, Toledo, OH 43607, USA 2 Department of Engineering Technology, The University of Toledo, MS.402, 2801 W. Bancroft Street, Toledo, OH 43606, USA Correspondence should be addressed to Hong Wang, [email protected] Received 23 May 2012; Accepted 25 July 2012 Academic Editors: F. Couto, B. Haubold, and J. T. L. Wang Copyright © 2012 Xinyu Guo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. A design of systolic array-based Field Programmable Gate Array (FPGA) parallel architecture for Basic Local Alignment Search Tool (BLAST) Algorithm is proposed. BLAST is a heuristic biological sequence alignment algorithm which has been used by bio- informatics experts. In contrast to other designs that detect at most one hit in one-clock-cycle, our design applies a Multiple Hits Detection Module which is a pipelining systolic array to search multiple hits in a single-clock-cycle. Further, we designed a Hits Combination Block which combines overlapping hits from systolic array into one hit. These implementations completed the first and second step of BLAST architecture and achieved significant speedup comparing with previously published architectures.
    [Show full text]
  • EE 7722—GPU Microarchitecture
    Course Organization Text EE 7722|GPU Microarchitecture EE 7722|GPU Microarchitecture URL: https://www.ece.lsu.edu/gp/. Offered by: David M. Koppelman 3316R P.F. Taylor Hall, 578-5482, [email protected], https://www.ece.lsu.edu/koppel Office Hours: Monday - Friday, 15:00-16:00. Prerequisites By Topic: • Computer architecture. • C++ and machine language programming. Text Papers, technical reports, etc. (Available online.) -1 EE 7722 Lecture Transparency. Formatted 11:25, 13 January 2021 from lsli00-TeXize. -1 Course Organization Course Objectives Course Objectives • Understand low-level accelerator (for scientific and ML workloads) organizations. • Be able to fine-tune accelerator codes based on this low-level knowledge. • Understand issues and ideas being discussed for the next generation of accelerators ::: ::: including tensor processing (machine-learning, inference, training) accelerators. -2 EE 7722 Lecture Transparency. Formatted 11:25, 13 January 2021 from lsli00-TeXize. -2 Course Organization Course Topics Course Topics • Performance Limiters (Floating Point, Bandwidth, etc.) • Parallelism Fundamentals • Distinction between processor types: CPU, many-core CPU, (many-thread) GPU, FPGA, large spatial processors, large systolic array. • GPU Architecture and CUDA Programming • GPU Microarchitecture (Low-Level Organization) • Tuning based on machine instruction analysis. • Tensor Processing Units, machine learning accelerators. -3 EE 7722 Lecture Transparency. Formatted 11:25, 13 January 2021 from lsli00-TeXize. -3 Course Organization Graded Material Graded Material Midterm Exam, 35% Fifty minutes in-class or take-home. Final Exam, 35% Two hours in-class or take-home. Yes, it's cumulative. Homework, 30% Written and computer assignments. Lowest grade or unsubmitted assignment dropped. -4 EE 7722 Lecture Transparency.
    [Show full text]
  • Delft University of Technology an Efficient GPU-Accelerated Implementation of Genomic Short Read Mapping with BWA-MEM
    Delft University of Technology An Efficient GPU-Accelerated Implementation of Genomic Short Read Mapping with BWA-MEM Houtgast, Ernst; Sima, Vlad; Bertels, Koen; Al-Ars, Zaid DOI 10.1145/3039902.3039910 Publication date 2016 Document Version Final published version Published in SIGARCH Computer Architecture News Citation (APA) Houtgast, E., Sima, V., Bertels, K., & Al-Ars, Z. (2016). An Efficient GPU-Accelerated Implementation of Genomic Short Read Mapping with BWA-MEM. SIGARCH Computer Architecture News, 44(4), 38-43. https://doi.org/10.1145/3039902.3039910 Important note To cite this publication, please use the final published version (if applicable). Please check the document version above. Copyright Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim. This work is downloaded from Delft University of Technology. For technical reasons the number of authors shown on this cover page is limited to a maximum of 10. An Efficient GPU-Accelerated Implementation of Genomic Short Read Mapping with BWA-MEM Ernst Joachim Houtgast1;2, Vlad-Mihai Sima2, Koen Bertels1 and Zaid Al-Ars1 1 Computer Engineering Lab, TU Delft, Mekelweg 4, 2628 CD Delft, The Netherlands 2 Bluebee, Laan van Zuid Hoorn 57, 2289 DC Rijswijk, The Netherlands Corresponding author: [email protected] ABSTRACT These variants, or mutations, are generally what is of inte- Next Generation Sequencing techniques have resulted in an rest, as such a mutation could give insight on which is the exponential growth in the generation of genetics data, the most effective treatment to follow for the particular illness a amount of which will soon rival, if not overtake, other Big patient has.
    [Show full text]
  • An FPGA-Based Systolic Array to Accelerate the BWA-MEM Genomic Mapping Algorithm
    An FPGA-Based Systolic Array to Accelerate the BWA-MEM Genomic Mapping Algorithm Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels and Zaid Al-Ars Faculty of EEMCS, Delft University of Technology, Delft, The Netherlands E-mail: fe.j.houtgast, v.m.sima, k.l.m.bertels, [email protected] Abstract—We present the first accelerated implementation of thus (4) create the first accelerated version of the BWA-MEM BWA-MEM, a popular genome sequence alignment algorithm algorithm, using FPGAs to offload execution of one kernel. widely used in next generation sequencing genomics pipelines. The rest of this paper is organized as follows. Section II The Smith-Waterman-like sequence alignment kernel requires a significant portion of overall execution time. We propose and provides a brief background on the BWA-MEM algorithm. evaluate a number of FPGA-based systolic array architectures, Section III describes the details of our acceleration approach. presenting optimizations generally applicable to variable length Section IV discusses design alternatives for the systolic array Smith-Waterman execution. Our kernel implementation is up to implementation. Section V provides the details on the configu- 3x faster, compared to software-only execution. This translates ration used to obtain results, which are presented in Section VI into an overall application speedup of up to 45%, which is 96% of the theoretically maximum achievable speedup when accelerating and then discussed in Section VII. Section VIII concludes the only this kernel. paper and indicates directions for future work. II. BWA-MEM ALGORITHM I. INTRODUCTION The BWA program is “a software package for mapping low- As next generation sequencing techniques improve, the divergent sequences against a large reference genome, such resulting genomic data, which can be in the order of tens as the human genome.
    [Show full text]
  • 11 Systolic Arrays Types of Special-Purpose Computers: 1
    System-on-Chip Architectures 11 Systolic arrays Types of special-purpose computers: 1. Inflexible and highly dedicated structures 2. Structures, enabling some programmability and reconfiguration. Are there applications for array of simple processors on chip? Systolic and wavefront arrays are determined by pipelining data concurrently with the (multi)processing - data and computational pipelining. ✤ Wavefront arrays use data-driven processing capability. ✤ Systolic arrays use local instruction codes synchronized globally. Data pipelined through boundary Data preloaded processors from data bus Data I/O Globally Systolic SIMD synchronous (prestored local control) (broadcasted control) Data Wavefront MIMD driven (prestored local control) (prestored local control) Timing scheme Definition: A systolic array is a network of processors that rhythmically compute and pass data through the system. Kalle Tammemäe, Dept. of CE, Tallinn Technical University 2000/02 1(18) System-on-Chip Architectures 11.1 Applications Basic matrix algorithms: ✧ matrix-vector multiplication ✧ matrix-matrix multiplication ✧ solution of triangular linear systems ✧ LU and QR decomposition (resulting triangular matrices) ✧ preconditioned conjugate gradient (solving a set of positive definite symmetric linear equations), etc. In general, various DSP algorithms can be mapped to systolic arrays: ✧ FIR, IIR, and ID convolution ✧ Interpolation ✧ Discrete Fourier Transform ✧ Template matching ✧ Dynamic scene analysis ✧ Image resampling, etc. Non-numeric applications: ✧ Data structures - stacks and queues, sorting ✧ Graph algorithms - transitive closure, minimum spanning trees ✧ Language recognition ✧ Dynamic programming ✧ Relational database operations, etc. Kalle Tammemäe, Dept. of CE, Tallinn Technical University 2000/02 2(18) System-on-Chip Architectures 11.2 Basic configuration Memory PE PE PE ... PE PE Data item is not only used when it is input but also reused as it moves through the pipelines in the array.
    [Show full text]
  • Computer Architecture: SIMD and Gpus (Part III) (And Briefly VLIW, DAE, Systolic Arrays)
    Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture n These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 20: GPUs, VLIW, DAE, Systolic Arrays n Video of the part related to only SIMD and GPUs: q http://www.youtube.com/watch? v=vr5hbSkb1Eg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ &index=20 2 Last Lecture n SIMD Processing n GPU Fundamentals 3 Today n Wrap up GPUs n VLIW n If time permits q Decoupled Access Execute q Systolic Arrays q Static Scheduling 4 Approaches to (Instruction-Level) Concurrency n Pipelined execution n Out-of-order execution n Dataflow (at the ISA level) n SIMD Processing n VLIW n Systolic Arrays n Decoupled Access Execute 5 Graphics Processing Units SIMD not Exposed to Programmer (SIMT) Review: High-Level View of a GPU 7 Review: Concept of “Thread Warps” and SIMT n Warp: A set of threads that execute the same instruction (on different data elements) à SIMT (Nvidia-speak) n All threads run the same kernel n Warp: The threads that run lengthwise in a woven fabric … Thread Warp 3 Thread Warp 8 Thread Warp Common PC Scalar Scalar Scalar Scalar Thread Warp 7 ThreadThread Thread Thread W X Y Z SIMD Pipeline 8 Review: Loop Iterations as Threads for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Sequential Code Vectorized Code load load load Iter. 1 load load load add Time add add store store store load Iter. Iter. Iter. 2 load 1 2 Vector Instruction add store Slide credit: Krste Asanovic 9 Review:
    [Show full text]
  • Generating Systolic Array Accelerators with Reusable Blocks
    Theme Article: Agile and Open-Source Hardware Generating Systolic Array Accelerators With Reusable Blocks Liancheng Jia and Liqiang Lu Yun Liang Peking University Peking University Xuechao Wei Alibaba, China Abstract—Systolic array architecture is widely used in spatial hardware and well-suited for many tensor processing algorithms. Many systolic array architectures are implemented with high-level synthesis (HLS) design flow. However, existing HLS tools do not favor of modular and reusable design, which brings inefficiency for design iteration. In this article, we analyze the systolic array design space, and identify the common structures of different systolic dataflows. We build hardware module templates using Chisel infrastructure, which can be reused for different dataflows and computation algorithms. This remarkably improves the productivity for the development and optimization of systolic architecture. We further build a systolic array generator that transforms the tensor algorithm definition to a complete systolic hardware architecture. Experiments show that we can implement systolic array designs for different applications and dataflows with little engineering effort, and the performance throughput outperforms HLS designs. & TENSOR ALGEBRA IS a prevalent tool of mod- accelerators. Systolic array architecture that ern computer applications and is increasingly features with high computation parallelism deployed onto various embedded devices. and data reusability using an array of process- Such a trend demands specialized hardware ing elements (PEs) are widely adopted in accel- erator designs. Google’s TPU uses a systolic array for the matrix multiply unit.1 Systolic Digital Object Identifier 10.1109/MM.2020.2997611 architectures are also used in many other Date of publication 26 May 2020; date of current version applications like convolution, FFT, and matrix 30 June 2020.
    [Show full text]
  • A Comprehensive Survey of Various Processor Types & Latest
    International Journal of Research and Scientific Innovation (IJRSI) | Volume V, Issue IV, April 2018 | ISSN 2321–2705 A Comprehensive Survey of Various Processor types & Latest Architectures Harisha M.S1, Dr. D. Jayadevappa2 1Research Scholar, Jain University, Bangalore, Karnataka, India 2Professor, Electronics Instrumentation, Dept., JSSATE, Bangalore, Karnataka, India Abstract- This technology survey paper covers application based Processors- ASIP, Processor based on Flynn’s classification which includes SISD, SIMD, MISD, MIMD, Special processors like Graphics processing unit (GPU), Physics processing unit (PPU), Digital signal processor (DSP), network processor, front end processor, co-processor and processor based on number of cores which includes- single core, multi core, multi- processor, hyper threading and multi core with shared cache processors. Keywords- ASIP, SISD, SIMD, MISD, MIMD, GPU, PPU, DSP, Network processor, coprocessor. I. INTRODUCTION A. Types of processors- Processor based on application Application-specific instruction-set processor (ASIP) Application-specific instruction set processor is a component Fig.2 Flynn’s classification used in system-on-a-chip design. The instruction set of an SISD & SIMD ASIP is designed to benefit a specific application [1]. This specialization of the core provides a tradeoff between the flexibility of a general purpose CPU and the performance of an ASIC. Fig.3 SISD and SIMD SISD SISD is a computer architecture in which a single uni-core processor executes a single instruction . Fig.1 ASIP stream, to operate on data stored in a single memory. A sequential computer which exploits no parallelism B. Processor types based on Flynn’s classification in either the instruction or data streams. In Flynn's taxonomy, the processors are classified based on Examples of SISD architecture are the traditional the number of concurrent instruction and data streams uniprocessor machines like PC or old mainframes available in the architecture: Single Instruction Single Data [3].
    [Show full text]