Data-Parallel Programming on Manycore Graphics Processors

!"#"$%"&"''(')%&*+&",,-.+)*.) /".01*&()2&"34-15)%&*1(55*&5) 6&0".)7"#".8"&*) 9.-:(&5"')%"&"''(')7*,3;#-.+)<(5("&14)7(.#(&) 9.-:(&5-#0)*=)7"'-=*&.-">)6(&?('(0) Overview !! Terminology: Multicore, Manycore, SIMD !! The CUDA Programming model !! Mapping CUDA to Nvidia GPUs !! Experiences with CUDA Multicore and Manycore /;'#-1*&() /".01*&() !! Multicore: yoke of oxen "! Each core optimized for executing a single thread !! Manycore: flock of chickens "! Cores optimized for aggregate throughput, deemphasizing individual performance Multicore & Manycore, cont. !"#$%&$'()*+, -).#,%/,012, 345678, /*$#"%&0*/*123*4567* 9?*$#"%&0*@*123*4567* !"#$%&&'()*+,%-%(.&* 89:;*<=>* 8A:B*<=>* /*$#"%&0*;*.G"%2D&0*/* 9?*$#"%&0*9;*4567* C%&'D%(.*4."2(D&E 123*4567K* L%$.#"&0*9;*123* FG"%2D&*H-2IJ* 4567K* 7*&()-@)ABC.,D) 9;*&."2(D&* 9?M;?*.G"%2D&* 4!*<NOP!E&* A?;* A?@?* 6%-#"3*Q2(D1'D.G* ;B:R*<QE&* ABS*<QE&* C%)'&.%"*N',%* T* A:@MB*6Q* O#$2,*4.#"%* T* /@?*UQ* 2EFGHC)ACC.,D) What is a core? !! Is a core an ALU? "! ATI: We have 800 streaming processors!! !! Actually, we have 5 way VLIW * 16 way SIMD * 10 “SIMD cores” !! Is a core a SIMD vector unit? "! Nvidia: We have 240 streaming processors!! !! Actually, we have 8 way SIMD * 30 “multiprocessors” "! To match ATI, they could count another factor of 2 for dual issue !! In this lecture, we’re using core consistent with the CPU world "! Superscalar, VLIW, SIMD, SMT, etc. are part of a core’s architecture, not the number of cores SIMD a b a1 a2 b1 b2 IJ/!) IJI!) + + K-L#4MG) c c1 c2 !! Single Instruction Multiple Data architectures make use of data parallelism !! SIMD can be area and power efficient "! Amortize control overhead over SIMD width !! Parallelism exposed to programmer & compiler SIMD: Neglected Parallelism !! It is difficult for a compiler to exploit SIMD !! How do you deal with sparse data & branches? "! Many languages (like C) are difficult to vectorize "! Fortran is somewhat better !! Most common solution: "! Either forget about SIMD !! Pray the autovectorizer likes you "! Or instantiate intrinsics (assembly language) "! Requires a new code version for every SIMD extension A Brief History of x86 SIMD 8 x 8 bit Integer MMX 4 x 32 bit 4 x 32 bit 3dNow! SP Float Subset SSE SP Float 2 x 64 bit SSE2 DP Float Future Subset SSE3 ?? SSSE3 SSE4.1 SSE4.A SSE4.2 Larrabee 8 x 32 bit SSE5 AVX SP Float 16 x 32 bit SP Float AVX+FMA 3 operands What to do with SIMD? B)K"0)IJ/!)AIIND) OP)K"0)IJ/!)AQ<6D) !! Neglecting SIMD in the future will be more expensive "! AVX: 8 way SIMD, Larrabee: 16 way SIMD, Nvidia: 32 way SIMD, ATI: 64 way SIMD !! This problem composes with thread level parallelism !! We need a programming model which addresses both problems The CUDA Programming Model !! CUDA is a recent programming model, designed for "! Manycore architectures "! Wide SIMD parallelism "! Scalability !! CUDA provides: "! A thread abstraction to deal with SIMD "! Synchronization & data sharing between small groups of threads !! CUDA programs are written in C + extensions !! OpenCL uses very similar programming model, but is HW & SW vendor neutral Hierarchy of Concurrent Threads !! Parallel kernels composed of many threads "! all threads execute the same sequential program E4&("L)#) !! Threads are grouped into thread blocks "! threads in the same block can cooperate 6'*1?)R) !"#!$#%#!&# !! Threads/blocks have unique IDs What is a CUDA Thread? What is a CUDA Thread Block? Synchronization !! Threads within a block may synchronize with barriers #%#'!()#$#%# **+,-.!/0(12+345# %#'!()#6#%# !! Blocks coordinate via atomic memory operations "! e.g., increment shared queue pointer with 1!789.:-.34# !! Implicit barrier between dependent kernels ;(.*89-<+===->?7.@+A#>?@+9B(CCC31A#>A#.45# ;(.*27!===->?7.@+A#>?@+9B(CCC3.A#.45# Blocks must be independent Scalability !! Manycore chips exist in a diverse set of configurations S;,R(&)*=)1*&(5) 35 30 25 20 15 10 5 0 8300GS 9400M 8800GTX GTX285 !! 79!T)"''*K5)*.()R-."&0)#*)#"&+(#)"'')#4(5()14-35) !! E4&("L)R'*1?5)R&-.+)51"'"R-'-#0U) Hello World: Vector Addition DDE78)<!(#;(.!70#+<8#EFGHI# DDJ1./#!/0(12#)(0K708+#7-(#)190L9+(#1229!97-# **M?7>1?**#;792#;(.G223K?71!N#1A#K?71!N#>A#K?71!N#.4#O# #9-!#9#F#>?7.@:2PQP#N#>?7.@R98QP#H#!/0(12:2PQP5# #.S9T#F#1S9T#H#>S9T5# U# 9-!#819-34#O# #DDV<-#&D6WX#>?7.@+#7K#6WX#!/0(12+#(1./# #;(.G22===&D6WXA#6WXCCC32*1A#2*>A#2*.45# U# Flavors of parallelism !! Thread parallelism "! each thread is an independent thread of execution !! Data parallelism "! across threads in a block "! across blocks in a kernel !! Task parallelism "! different blocks are independent "! independent kernels Memory model Block E4&("L) %(&$#4&("L) %(&$R'*1?) Q*1"')/(,*&0) I4"&(L)/(,*&0) Memory model V) Per Device Global V) Memory Memory model Device ? Memory Host !"#$%&'!()*+, Memory Device A Memory Using per-block shared memory 6'*1?) Shared CUDA: Minimal extensions to C/C++ !! Declaration specifiers to indicate where things live #**M?7>1?**#;792#Y(0-(?Z<-.3QQQ45##DD#kernel callable from host #**2(;9.(**#;792#R(;9.(Z<-.3QQQ45##DD#function callable on device #**2(;9.(**#9-!##[?7>1?\105 #####DD#variable in device memory #**+/10(2**#9-!##'/10(2\105 #####DD#in per-block shared memory !! Extend function invocation syntax for parallel kernel launch Y(0-(?Z<-.===W""A#$6]CCC3QQQ45####DD#500 blocks, 128 threads each !! Special variables for thread identification in kernels 298^#!/0(12:2P5##298^#>?7.@:2P5##298^#>?7.@R985# !! Intrinsics that expose specific operations in kernel code **+,-.!/0(12+345#################DD#barrier synchronization# CUDA: Features available on GPU CUDA: Runtime support Mapping CUDA to Nvidia GPUs !! CUDA is designed to be functionally forgiving "! First priority: make things work. Second: get performance. !! However, to get good performance, one must understand how CUDA is mapped to Nvidia GPUs !! Threads: "! each thread is a SIMD vector lane !! Warps: "! A SIMD instruction acts on a “warp” "! Warp width is 32 elements: LOGICAL SIMD width !! Thread blocks: "! Each thread block is scheduled onto a processor "! Peak efficiency requires multiple thread blocks per processor Mapping CUDA to a GPU, continued !! The GPU is very deeply pipelined "! Throughput machine, trying to hide memory latency !! This means that performance depends on the number of thread blocks which can be allocated on a processor !! Therefore, resource usage costs performance: "! More registers => Fewer thread blocks "! More shared memory usage => Fewer thread blocks "! In general: More resources => less effective parallelism !! It is often worth trying to reduce register count in order to get more thread blocks to fit on the chip SIMD & Control Flow Memory, Memory, Memory ! Chapter 1. Introduction to CUDA "#$%&'(')#*!+!,-("'./!01('!23(%1)"4!3,*5,3)*2!)4!(6#&'!+!(*5!'1,3,7#3,!5,4)2*,5! !! A many4&"1!'1('!$#3,!'3(*4)4'#34!(3,!5,8#',5!'#!5('(!%3#",44)*2!3('1,3!'1(*!5('(!"("1)*2! core processor ! A device for turning a compute bound(*5!7.#0!"#*'3#.9!(4!4"1,$(')"(../!)..&4'3(',5!6/! problem into a memory bound:)2&3,!;<=>! problem ! Control ALU ALU ALU ALU Cache DRAM DRAM CPU GPU ! !! Lots of processors, only one socket Figure 1-2. The GPU Devotes More Transistors to Data !! Memory concernsProcessing dominate performance tuning ! ?#3,!4%,")7)"(../9!'1,!@AB!)4!,4%,")(../!0,..<4&)',5!'#!(553,44!%3#6.,$4!'1('!"(*!6,! ,-%3,44,5!(4!5('(<%(3(..,.!"#$%&'(')#*4!+!'1,!4($,!%3#23($!)4!,-,"&',5!#*!$(*/! 5('(!,.,$,*'4!)*!%(3(..,.!+!0)'1!1)21!(3)'1$,')"!)*',*4)'/!+!'1,!3(')#!#7!(3)'1$,')"! #%,3(')#*4!'#!$,$#3/!#%,3(')#*4>!C,"(&4,!'1,!4($,!%3#23($!)4!,-,"&',5!7#3!,("1! 5('(!,.,$,*'9!'1,3,!)4!(!.#0,3!3,D&)3,$,*'!7#3!4#%1)4')"(',5!7.#0!"#*'3#.E!(*5! 6,"(&4,!)'!)4!,-,"&',5!#*!$(*/!5('(!,.,$,*'4!(*5!1(4!1)21!(3)'1$,')"!)*',*4)'/9!'1,! $,$#3/!("",44!.(',*"/!"(*!6,!1)55,*!0)'1!"(."&.(')#*4!)*4',(5!#7!6)2!5('(!"("1,4>! F('(<%(3(..,.!%3#",44)*2!$(%4!5('(!,.,$,*'4!'#!%(3(..,.!%3#",44)*2!'13,(54>!?(*/! (%%.)"(')#*4!'1('!%3#",44!.(32,!5('(!4,'4!"(*!&4,!(!5('(<%(3(..,.!%3#23($$)*2!$#5,.! '#!4%,,5!&%!'1,!"#$%&'(')#*4>!G*!HF!3,*5,3)*2!.(32,!4,'4!#7!%)-,.4!(*5!8,3')",4!(3,! $(%%,5!'#!%(3(..,.!'13,(54>!I)$).(3./9!)$(2,!(*5!$,5)(!%3#",44)*2!(%%.)"(')#*4!4&"1! (4!%#4'<%3#",44)*2!#7!3,*5,3,5!)$(2,49!8)5,#!,*"#5)*2!(*5!5,"#5)*29!)$(2,!4"(.)*29! 4',3,#!8)4)#*9!(*5!%('',3*!3,"#2*)')#*!"(*!$(%!)$(2,!6.#"J4!(*5!%)-,.4!'#!%(3(..,.! %3#",44)*2!'13,(54>!G*!7("'9!$(*/!(.2#3)'1$4!#&'4)5,!'1,!7),.5!#7!)$(2,!3,*5,3)*2! (*5!%3#",44)*2!(3,!("",.,3(',5!6/!5('(<%(3(..,.!%3#",44)*29!73#$!2,*,3(.!4)2*(.! %3#",44)*2!#3!%1/4)"4!4)$&.(')#*!'#!"#$%&'(')#*(.!7)*(*",!#3!"#$%&'(')#*(.!6)#.#2/>! K1,!LBFM!%3#23($$)*2!$#5,.!)4!8,3/!0,..!4&)',5!'#!,-%#4,!'1,!%(3(..,.! "(%(6).)'),4!#7!@AB4>!K1,!.(',4'!2,*,3(')#*!#7!NOGFGM!@AB49!6(4,5!#*!'1,!K,4.(! (3"1)',"'&3,!P4,,!M%%,*5)-!M!7#3!(!.)4'!#7!(..!LBFM<"(%(6.,!@AB4Q9!4&%%#3'4!'1,! LBFM!%3#23($$)*2!$#5,.!(*5!'3,$,*5#&4./!("",.,3(',4!LBFM!(%%.)"(')#*4>! ! CUDA Programming Guide Version 2.0 3! Memory is SIMD too !! Virtually all processors have SIMD memory subsystems ?* A* ;* 9* /* B* R* M* 1"14()'-.()K-L#4) !! E4-5)4"5)#K*)(W(1#5X) !! I3"&5()"11(55)K"5#(5)R".LK-L#4) G)K*&L5);5(L>)H)K*&L5)'*"L(LX) ? A ; 9 / B R M * * * * * * * * Y))(W(1#-:()R".LK-L#4) !! 9."'-+.(L)"11(55)K"5#(5)R".LK-L#4) B)K*&L5);5(L>)H)K*&L5)'*"L(LX) ?* A* ;* 9* /* B* R* M* Z)(W(1#-:()R".LK-L#4) Coalescing !! Current GPUs don’t have cache lines as such, but they do have similar issues with alignment and sparsity !! Nvidia GPUs have a “coalescer”, which examines memory requests dynamically and coalesces them into vectors !! To use bandwidth effectively,

Data-Parallel Programming on Manycore Graphics Processors

SIMD Extensions

RISC-V Vector Extension Webinar I

NASM – the Netwide Assembler

AMD's Bulldozer Architecture

C++ Code M128 Add (Const M128 &X, Const __M128 &Y){ X X3 X2 X1 X0 Return Mm Add Ps(X, Y); } + + + + +

Computer Architectures an Overview

An Introduction to CUDA/Opencl and Graphics Processors

(GAMI) API Specification Designed and Implemented for Intel® Rack Scale Design Software V2.3.2 Release

128-Bit SSE5 Instruction Set and Supplemental 64-Bit Media

Zynq-7000 All Programmable Soc Architecture Porting Quick Start Guide

SHA-3 Conference, March 2012, BLAKE and 256-Bit Advanced

Six-Core AMD Opteron Processor Istanbul Paul G. Howard, Ph.D