Msc Informatics Eng

Computing Systems & Performance Beyond Vector/SIMD architectures • Vector/SIMD-extended architectures are hybrid approaches MSc Informatics Eng. – mix scalar + vector operation capabilities on a single device – highly pipelined approach to reduce memory access penalty – tightly-closed access to shared memory: lower latency 2011/12 • Evolution of Vector/SIMD-extended architectures – CPU cores with wider vectors and/or SIMD cores: A.J.Proença • DSP VLIW cores with vector capabilities: Texas Instrument • PPC cores coupled with SIMD cores: Cell Broadband Engine • ARM64 cores coupled with SIMD cores: project Denver (NVidia) • future x86 hybrid cores: Intel, AMD ... Data Parallelism 2 (Cell BE, GPU, ...) – devices with no scalar processor: accelerator devices • penalty on disjoint physical memories (most slides are borrowed) • based on the GPU architecture: SIMD—>SIMT to hide memory latency • ISA-free architecture, code compiled to silica: FPGA AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 2 Texas Instruments: Keystone DSP architecture Cell Broadband Engine (1) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 3 AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 4 Cell Broadband Engine (2) Cell Broadband Engine (3) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 5 AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 6 Cell Broadband Engine (4) NVidia: Project Denver • Pick a successful SoC: Tegra 3 • Replace the 32-bit ARM Cortex 9 cores by 64-bit ARM cores • Add some Fermi SIMT cores into the same chip AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 7 AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 8 What is an FPGA FPGA as a multiple configurable ISA Field-Programmable Gate Arrays (FPGA) A fabric with 1000s of simple configurable logic cells with LUTs, on-chip SRAM, configurable routing and I/O cells AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 9 AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 10 The GPU as a compute device: the G80 The CUDA programming model • Compute Unified Device Architecture • CUDA is a recent programming model, designed for – Manycore architectures – Wide SIMD parallelism – Scalability • CUDA provides: – A thread abstraction to deal with SIMD – Synchr. & data sharing between small groups of threads • CUDA programs are written in C with extensions • OpenCL inspired by CUDA, but hw & sw vendor neutral – Programming model essentially identical AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 11 AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 12 Graphical Processing Units CUDA Devices and Threads Terminology (and in NVidia) ! Threads of SIMD instructions (warps) • A compute device ! Each has its own PC (up to 48 per SIMD processor) – Is a coprocessor to the CPU or host ! Thread scheduler uses scoreboard to dispatch – Has its own DRAM (device memory)! ! No data dependencies between threads! – Runs many threads in parallel ! Threads are organized into blocks & executed in groups – Is typically a GPU but can also be another type of parallel ! processing device ! of 32 threads (thread block) • Data-parallel portions of an application are expressed as ! Blocks are organized into a grid device kernels which run on many threads - SIMT ! The thread block scheduler schedules blocks to • Differences between GPU and CPU threads SIMD processors (Streaming Multiprocessors) – GPU threads are extremely lightweight ! Within each SIMD processor: • Very little creation overhead, requires LARGE register bank ! 32 SIMD lanes (thread processors) – GPU needs 1000s of threads for full efficiency ! Wide and shallow compared to vector processors • Multi-core CPU needs only a few AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 13 Copyright © 2012, Elsevier Inc. All rights reserved. 14 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 W. and Wen-mei © David Kirk/NVIDIA ECE 498AL, University of Illinois, Urbana-Champaign CUDA basic model: Single-Program Multiple-Data (SPMD) Programming Model: SPMD + SIMT/SIMD • CUDA integrated CPU + GPU application C program • Hierarchy – Device => Grids – Serial C code executes on CPU – Grid => Blocks – Block => Warps – Parallel Kernel C code executes on GPU thread blocks – Warp => Threads • Single kernel runs on multiple blocks ! CPU Code ! (SPMD) Grid 0 • Threads within a warp are executed in a lock-step way called single- GPU Parallel Kernel instruction multiple-thread (SIMT) KernelA<<< nBlk, nTid >>>(args); . • Single instruction are executed on multiple threads (SIMD) CPU Code – Warp size defines SIMD granularity Grid 1 (32 threads) • Synchronization within a block using GPU Parallel Kernel shared memory KernelB<<< nBlk, nTid >>>(args); . Courtesy NVIDIA AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 15 AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 16 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 W. and Wen-mei © David Kirk/NVIDIA ECE 498AL, University of Illinois, Urbana-Champaign The Computational Grid: CUDA Thread Block Block IDs and Thread IDs • A kernel runs on a computational • Programmer declares (Thread) Block: grid of thread blocks – Block size 1 to 512 concurrent – Threads share global memory threads CUDA Thread Block • Each thread uses IDs to decide – Block shape 1D, 2D, or 3D what data to work on – Block dimensions in threads ! threadID 0 1 2 3 4 5 6 7 ! – Block ID: 1D or 2D ! • All threads in a Block execute the ! – Thread ID: 1D, 2D, or 3D same thread program • A thread block is a batch of … • Threads share data and synchronize float x = input[threadID]; threads that can cooperate by: while doing their share of the work float y = func(x); output[threadID] = y; – Sync their execution w/ barrier • Threads have thread id numbers … – Efficiently sharing data through a within Block low latency shared memory • Thread program uses thread id to – Two threads from two different select work and address shared data blocks cannot cooperate AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 17 AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 18 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 W. and Wen-mei © David Kirk/NVIDIA ECE 498AL, University of Illinois, Urbana-Champaign Hwu, 2007-2009 W. and Wen-mei © David Kirk/NVIDIA ECE 498AL, University of Illinois, Urbana-Champaign Parallel Memory Sharing CUDA Memory Model Overview Thread • Local Memory: per-thread (Device) Grid – Private per thread • Each thread can: Local Memory – Auto variables, register spill – R/W per-thread registers Block (0, 0) Block (1, 0) • Shared Memory: per-block – R/W per-thread local memory Block Shared Memory Shared Memory – Shared by threads of the same – R/W per-block shared memory block Registers Registers Registers Registers ! – R/W per-grid global memory ! Shared – Inter-thread communication ! ! Memory – Read only per-grid constant • Global Memory: per-application Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) – Shared by all threads memory Grid 0 – Inter-Grid communication – Read only per-grid texture Local Local Local Local memory Memory Memory Memory Memory . Host Global • The host can R/W Memory Global Sequential Grid 1 Memory global, constant, and Constant Grids Memory in Time texture memories . Texture Memory AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 19 AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 20 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 W. and Wen-mei © David Kirk/NVIDIA ECE 498AL, University of Illinois, Urbana-Champaign Hwu, 2007-2009 W. and Wen-mei © David Kirk/NVIDIA ECE 498AL, University of Illinois, Urbana-Champaign Graphical Processing Units Hardware Implementation: NVIDIA GPU Memory Structures Memory Architecture ! Each SIMD Lane has private section of off-chip Device DRAM • Device memory (DRAM) Multiprocessor N – Slow (2~300 cycles) ! “Private memory” (Local Memory) Multiprocessor 2 ! – Local, global, constant, Multiprocessor 1 Contains stack frame, spilling registers, and private and texture memory variables Shared Memory ! Each multithreaded SIMD processor also has Registers Registers Registers • On-chip memory Instruction … Unit local memory (Shared Memory) – Fast (1 cycle) Processor 1 Processor 2 Processor M ! Shared by SIMD lanes / threads within a block – Registers, Constant Cache shared memory, ! Memory shared by SIMD processors is GPU Texture constant/texture cache Cache Memory (Global Memory) ! Device memory Host can read and write GPU memory Courtesy NVIDIA AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 21 Copyright © 2012, Elsevier Inc. All rights reserved. 22 Families in NVidia GPU NVidia GPU structure & scalability AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 23 AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 24 The NVidia Fermi architecture GT200 and Fermi SIMD processor Fermi Multithreaded SIMD Processor (Streaming Fermi Multiprocessor) Architecture AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 25 AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 26 Graphical Processing Units Fermi Architecture Innovations Fermi: Multithreading and Memory Hierarchy ! Each SIMD processor has ! Two SIMD thread schedulers, two instruction dispatch units ! 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store units, 4 special function

Msc Informatics Eng

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support