!"#"$%"&"''(')%&*+&",,-.+)*.) /".01*&()2&"34-15)%&*1(55*&5) 6&0".)7"#".8"&*)

9.-:(&5"')%"&"''(')7*,3;#-.+)<(5("&14)7(.#(&) 9.-:(&5-#0)*=)7"'-=*&.-">)6(&?('(0) Overview

!! Terminology: Multicore, Manycore, SIMD !! The CUDA Programming model !! Mapping CUDA to Nvidia GPUs !! Experiences with CUDA Multicore and Manycore

/;'#-1*&() /".01*&() !! Multicore: yoke of oxen "! Each core optimized for executing a single thread !! Manycore: flock of chickens "! Cores optimized for aggregate throughput, deemphasizing individual performance Multicore & Manycore, cont.

!"#$%&$'()*+, -).#,%/,012, 345678,

/*$#"%&0*/*123*4567* 9?*$#"%&0*@*123*4567* !"#$%&&'()*+,%-%(.&* 89:;*<=>* 8A:B*<=>* /*$#"%&0*;*.G"%2D&0*/* 9?*$#"%&0*9;*4567* C%&'D%(.*4."2(D&E 123*4567K* L%$.#"&0*9;*123* FG"%2D&*H-2IJ* 4567K* 7*&()-@)ABC.,D) 9;*&."2(D&* 9?M;?*.G"%2D&*

4!*

6%-#"3*Q2(D1'D.G* ;B:R*

C%)'&.%"*N',%* T* A:@MB*6Q*

O#$2,*4.#"%* T* /@?*UQ* 2EFGHC)ACC.,D) What is a core?

!! Is a core an ALU? "! ATI: We have 800 streaming processors!!

!! Actually, we have 5 way VLIW * 16 way SIMD * 10 “SIMD cores” !! Is a core a SIMD vector unit? "! Nvidia: We have 240 streaming processors!!

!! Actually, we have 8 way SIMD * 30 “multiprocessors” "! To match ATI, they could count another factor of 2 for dual issue !! In this lecture, we’re using core consistent with the CPU world "! Superscalar, VLIW, SIMD, SMT, etc. are part of a core’s architecture, not the number of cores SIMD

a b a1 a2 b1 b2 IJ/!) IJI!) + + K-L#4MG)

c c1 c2

!! Single Instruction Multiple Data architectures make use of data parallelism !! SIMD can be area and power efficient "! Amortize control overhead over SIMD width !! Parallelism exposed to programmer & compiler SIMD: Neglected Parallelism

!! It is difficult for a compiler to exploit SIMD !! How do you deal with sparse data & branches? "! Many languages (like C) are difficult to vectorize "! Fortran is somewhat better !! Most common solution: "! Either forget about SIMD

!! Pray the autovectorizer likes you "! Or instantiate intrinsics (assembly language) "! Requires a new code version for every SIMD extension A Brief History of SIMD

8 x 8 bit Integer MMX 4 x 32 bit 4 x 32 bit 3dNow! SP Float Subset SSE SP Float 2 x 64 bit SSE2 DP Float Future Subset SSE3 ?? SSSE3

SSE4.1 SSE4.A SSE4.2 Larrabee 8 x 32 bit SSE5 AVX SP Float 16 x 32 bit SP Float AVX+FMA 3 operands What to do with SIMD?

B)K"0)IJ/!)AIIND) OP)K"0)IJ/!)AQ<6D) !! Neglecting SIMD in the future will be more expensive "! AVX: 8 way SIMD, Larrabee: 16 way SIMD, Nvidia: 32 way SIMD, ATI: 64 way SIMD !! This problem composes with thread level parallelism !! We need a programming model which addresses both problems The CUDA Programming Model

!! CUDA is a recent programming model, designed for "! Manycore architectures "! Wide SIMD parallelism "! Scalability !! CUDA provides: "! A thread abstraction to deal with SIMD "! Synchronization & data sharing between small groups of threads !! CUDA programs are written in C + extensions !! OpenCL uses very similar programming model, but is HW & SW vendor neutral Hierarchy of Concurrent Threads

!! Parallel kernels composed of many threads "! all threads execute the same sequential program E4&("L)#)

!! Threads are grouped into thread blocks "! threads in the same block can cooperate 6'*1?)R) !"#!$#%#!&#

!! Threads/blocks have unique IDs What is a CUDA Thread? What is a CUDA Thread Block? Synchronization

!! Threads within a block may synchronize with barriers #%#'!()#$#%# **+,-.!/0(12+345# %#'!()#6#%#

!! Blocks coordinate via atomic memory operations "! e.g., increment shared queue pointer with 1!789.:-.34#

!! Implicit barrier between dependent kernels ;(.*89-<+===->?7.@+A#>?@+9B(CCC31A#>A#.45#

;(.*27!===->?7.@+A#>?@+9B(CCC3.A#.45# Blocks must be independent Scalability

!! Manycore chips exist in a diverse set of configurations

S;,R(&)*=)1*&(5) 35 30 25 20 15 10 5 0 8300GS 9400M 8800GTX GTX285 !! 79!T)"''*K5)*.()R-."&0)#*)#"&+(#)"'')#4(5()14-35) !! E4&("L)R'*1?5)R&-.+)51"'"R-'-#0U) Hello World: Vector Addition

DDE78)1?**#;792#;(.G223K?71!N#1A#K?71!N#>A#K?71!N#.4#O# #9-!#9#F#>?7.@:2PQP#N#>?7.@R98QP#H#!/0(12:2PQP5# #.S9T#F#1S9T#H#>S9T5# U#

9-!#819-34#O# #DDV<-#&D6WX#>?7.@+#7K#6WX#!/0(12+#(1./# #;(.G22===&D6WXA#6WXCCC32*1A#2*>A#2*.45# U# Flavors of parallelism

!! Thread parallelism "! each thread is an independent thread of execution

!! Data parallelism "! across threads in a block "! across blocks in a kernel

!! Task parallelism "! different blocks are independent "! independent kernels Memory model

Block E4&("L) %(&$#4&("L) %(&$R'*1?) Q*1"')/(,*&0) I4"&(L)/(,*&0) Memory model

V) Per Device Global V) Memory Memory model

Device ? Memory Host !"#$%&'!()*+, Memory Device A Memory Using per-block shared memory

6'*1?) Shared CUDA: Minimal extensions to C/C++

!! Declaration specifiers to indicate where things live #**M?7>1?**#;792#Y(0-(?Z<-.3QQQ45##DD#kernel callable from host #**2(;9.(**#;792#R(;9.(Z<-.3QQQ45##DD#function callable on device #**2(;9.(**#9-!##[?7>1?\105 #####DD#variable in device memory #**+/10(2**#9-!##'/10(2\105 #####DD#in per-block shared memory

!! Extend function invocation syntax for parallel kernel launch Y(0-(?Z<-.===W""A#$6]CCC3QQQ45####DD#500 blocks, 128 threads each

!! Special variables for thread identification in kernels 298^#!/0(12:2P5##298^#>?7.@:2P5##298^#>?7.@R985#

!! Intrinsics that expose specific operations in kernel code **+,-.!/0(12+345#################DD#barrier synchronization# CUDA: Features available on GPU CUDA: Runtime support Mapping CUDA to Nvidia GPUs

!! CUDA is designed to be functionally forgiving "! First priority: make things work. Second: get performance. !! However, to get good performance, one must understand how CUDA is mapped to Nvidia GPUs !! Threads: "! each thread is a SIMD vector lane !! Warps: "! A SIMD instruction acts on a “warp” "! Warp width is 32 elements: LOGICAL SIMD width !! Thread blocks: "! Each thread block is scheduled onto a processor "! Peak efficiency requires multiple thread blocks per processor Mapping CUDA to a GPU, continued

!! The GPU is very deeply pipelined "! Throughput machine, trying to hide memory latency !! This means that performance depends on the number of thread blocks which can be allocated on a processor !! Therefore, resource usage costs performance: "! More registers => Fewer thread blocks "! More shared memory usage => Fewer thread blocks "! In general: More resources => less effective parallelism !! It is often worth trying to reduce register count in order to get more thread blocks to fit on the chip SIMD & Control Flow Memory, Memory, Memory ! Chapter 1. Introduction to CUDA

"#$%&'(')#*!+!,-("'./!01('!23(%1)"4!3,*5,3)*2!)4!(6#&'!+!(*5!'1,3,7#3,!5,4)2*,5! !! A many4&"1!'1('!$#3,!'3(*4)4'#34!(3,!5,8#',5!'#!5('(!%3#",44)*2!3('1,3!'1(*!5('(!"("1)*2! core processor ! A device for turning a compute bound(*5!7.#0!"#*'3#.9!(4!4"1,$(')"(../!)..&4'3(',5!6/! problem into a memory bound:)2&3,!;<=>! problem !

Control ALU ALU

ALU ALU

Cache

DRAM DRAM

CPU GPU ! !! Lots of processors, only one socket Figure 1-2. The GPU Devotes More Transistors to Data !! Memory concernsProcessing dominate performance tuning ! ?#3,!4%,")7)"(../9!'1,!@AB!)4!,4%,")(../!0,..<4&)',5!'#!(553,44!%3#6.,$4!'1('!"(*!6,! ,-%3,44,5!(4!5('(<%(3(..,.!"#$%&'(')#*4!+!'1,!4($,!%3#23($!)4!,-,"&',5!#*!$(*/! 5('(!,.,$,*'4!)*!%(3(..,.!+!0)'1!1)21!(3)'1$,')"!)*',*4)'/!+!'1,!3(')#!#7!(3)'1$,')"! #%,3(')#*4!'#!$,$#3/!#%,3(')#*4>!C,"(&4,!'1,!4($,!%3#23($!)4!,-,"&',5!7#3!,("1! 5('(!,.,$,*'9!'1,3,!)4!(!.#0,3!3,D&)3,$,*'!7#3!4#%1)4')"(',5!7.#0!"#*'3#.E!(*5! 6,"(&4,!)'!)4!,-,"&',5!#*!$(*/!5('(!,.,$,*'4!(*5!1(4!1)21!(3)'1$,')"!)*',*4)'/9!'1,! $,$#3/!("",44!.(',*"/!"(*!6,!1)55,*!0)'1!"(."&.(')#*4!)*4',(5!#7!6)2!5('(!"("1,4>! F('(<%(3(..,.!%3#",44)*2!$(%4!5('(!,.,$,*'4!'#!%(3(..,.!%3#",44)*2!'13,(54>!?(*/! (%%.)"(')#*4!'1('!%3#",44!.(32,!5('(!4,'4!"(*!&4,!(!5('(<%(3(..,.!%3#23($$)*2!$#5,.! '#!4%,,5!&%!'1,!"#$%&'(')#*4>!G*!HF!3,*5,3)*2!.(32,!4,'4!#7!%)-,.4!(*5!8,3')",4!(3,! $(%%,5!'#!%(3(..,.!'13,(54>!I)$).(3./9!)$(2,!(*5!$,5)(!%3#",44)*2!(%%.)"(')#*4!4&"1! (4!%#4'<%3#",44)*2!#7!3,*5,3,5!)$(2,49!8)5,#!,*"#5)*2!(*5!5,"#5)*29!)$(2,!4"(.)*29! 4',3,#!8)4)#*9!(*5!%('',3*!3,"#2*)')#*!"(*!$(%!)$(2,!6.#"J4!(*5!%)-,.4!'#!%(3(..,.! %3#",44)*2!'13,(54>!G*!7("'9!$(*/!(.2#3)'1$4!#&'4)5,!'1,!7),.5!#7!)$(2,!3,*5,3)*2! (*5!%3#",44)*2!(3,!("",.,3(',5!6/!5('(<%(3(..,.!%3#",44)*29!73#$!2,*,3(.!4)2*(.! %3#",44)*2!#3!%1/4)"4!4)$&.(')#*!'#!"#$%&'(')#*(.!7)*(*",!#3!"#$%&'(')#*(.!6)#.#2/>! K1,!LBFM!%3#23($$)*2!$#5,.!)4!8,3/!0,..!4&)',5!'#!,-%#4,!'1,!%(3(..,.! "(%(6).)'),4!#7!@AB4>!K1,!.(',4'!2,*,3(')#*!#7!NOGFGM!@AB49!6(4,5!#*!'1,!K,4.(! (3"1)',"'&3,!P4,,!M%%,*5)-!M!7#3!(!.)4'!#7!(..!LBFM<"(%(6.,!@AB4Q9!4&%%#3'4!'1,! LBFM!%3#23($$)*2!$#5,.!(*5!'3,$,*5#&4./!("",.,3(',4!LBFM!(%%.)"(')#*4>!

! CUDA Programming Guide Version 2.0 3!

Memory is SIMD too

!! Virtually all processors have SIMD memory subsystems

?* A* ;* 9* /* B* R* M*

1"14()'-.()K-L#4) !! E4-5)4"5)#K*)(W(1#5X) !! I3"&5()"11(55)K"5#(5)R".LK-L#4) G)K*&L5);5(L>)H)K*&L5)'*"L(LX) ? A ; 9 / B R M * * * * * * * * Y))(W(1#-:()R".LK-L#4) !! 9."'-+.(L)"11(55)K"5#(5)R".LK-L#4) B)K*&L5);5(L>)H)K*&L5)'*"L(LX) ?* A* ;* 9* /* B* R* M* Z)(W(1#-:()R".LK-L#4) Coalescing

!! Current GPUs don’t have cache lines as such, but they do have similar issues with alignment and sparsity !! Nvidia GPUs have a “coalescer”, which examines memory requests dynamically and coalesces them into vectors !! To use bandwidth effectively, when threads load, they should: "! Present a set of unit strided loads (dense accesses) "! Keep sets of loads aligned to vector boundaries Data Structure Padding

#$ A&*K),"[*&D) !! Multidimensional arrays are usually stored as monolithic vectors in memory !! Care should be taken to assure aligned memory accesses for the necessary access pattern

%$ Sparse Matrix Vector Multiply

\) M)

!! Problem: Sparse Matrix Vector Multiplication !! How should we parallelize the computation? !! How should we represent the matrix? "! Can we take advantage of any structure in this matrix? Diagonal representation

!! Since this matrix has nonzeros only on diagonals, let’s project the diagonals into vectors !! Sparse representation becomes dense !! Launch a thread per row !! Are we done?

!! The straightforward diagonal projection is not aligned Optimized Diagonal Representation

3"LL-.+)

%$

!! Skew the diagonals again !! This ensures that all memory #$ loads from matrix are coalesced !! Don’t forget padding! SoA, AoS

!! Different data access patterns may also require transposing data structures

E)

T&&"0)*=)I#&;1#5) I#&;1#;&()*=)T&&"05) !! E4()1*5#)*=)")#&".53*5()*.)#4()L"#")5#&;1#;&()-5)*=#(.) ,;14)'(55)#4".)#4()1*5#)*=);.1*"'(51(L),(,*&0) "11(55(5) Experiences with CUDA

!! Image Contour Detection !! Support Vector Machines !"#$%&'()*(+,-&

!! 7*.#*;&5)"&()5;R[(1#-:()_)#4(0)L(3(.L)*.)3(&5*."')3(&53(1#-:() !! I;&3&-5(X)`;,".5)"+&(()A,*&()*&)'(55D) !! ab)/"'-?c5)+&*;3)4"5)L(:('*3(L)")d+&*;.L)#&;#4e)R(.14,"&?)

J,"+() `;,".)7*.#*;&5) /"14-.()7*.#*;&5) ]H^CB) $./&01$(,2*3"4&'+,,%)*&5%#6%,&

!! $'*R"').&*R"R-'-#0)*=)/*;.L"&0)

1 !! 7;&&(.#'0>)#4(),*5#)"11;&"#() 1 1 -,"+()1*.#*;&)L(#(1#*&) !! 9:S),-.5)3(&)5,"'')-,"+()) 0.75 0.75 0.75 A?:AB)/%D)'-,-#5)-#5)"33'-1"R-'-#0) &! g])R-''-*.)-,"+(5)*.)K(R) 0.5 0.5 0.5

Precision &! Precision Ohhhh)1*,3;#(&)1';5#(&) Precision

K*;'L)#"?()G)0("&5)#*)i.L) gPb F=0.70 0.25 0.25 0.25 UCM [1] F=0.67 #4(-&)1*.#*;&5)gPb F=0.68 [6] F=0.66 [6] F=0.64 MCV [7] F=0.65 gPb F=0.70 Pb [17] F=0.63 Pb [17] F=0.65 sPb F=0.68 &! `*K),".0).(K)-,"+(5)Canny [3] F=0.58 [30] F=0.64 mPb F=0.67 [29] F=0.48 CRF [22] F=0.64 0 0 0 0 0.25 0.5 0.75 1 K*;'L)#4(&()R()R0)#4(.j)0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 Recall Recall Recall /"-&(>)T&R('"(8>)k*K'?(5>)/"'-?>) 7l%<)GhhH) Figure 2. Evaluation of boundary detectors. Left: Comparison of the detectors in this paper. The spectral detector sP b improves]f^CB) the precision of the local signal mP b, while their combination gP b provides uniformly better performance. Middle: Comparison of gP b with leading boundary detectors on grayscale images. Right: Comparison of gP b with leading boundary detectors on color images. gP b obtains the highest F-measure (2 · P recision · Recall/(P recision + Recall)) to date on the BSDS benchmark.

The signals mP b and sP b convey different information, 3. Junction Detection as the former fires at all the edges while the latter extracts only the most salient curves in the image. We found that Evidence from psychophysics [18] suggests that junc- tions are difficult to detect locally. However, much previ- a simple linear combination is enough to benefit from both behaviors. Our final globalized probability of boundary ous work in the computer vision community has focused on the development of operators based on local image patches is then written as: [5, 8, 11, 14, 23]. While our approach still analyzes an im- 9 age neighborhood, we do not necessarily rely on image in- gP b(x, y, θ)=! βi Gi(x, y, θ)+γ sP b(x, y, θ) (4) · · formation in the immediate vicinity of a junction. Rather, i=1 we choose the support of the neighborhood large enough where the weights are learned by gradient ascent on the F- with respect to the support of the boundary operator so that measure. our algorithm may recover from errors in contour detection A thinned, real-valued contour image can be obtained near a junction. from gP b by considering the maximal response over ori- entations at each location and applying non-maximum sup- 3.1. Local Operators pression [3, 12]. We use a version of the Harris operator [11] as a base- 2.2. Evaluation line with which to compare our algorithm. Given image I let G be a two-dimensional Gaussian smoothing kernel and Figure 2 presents the evaluation of our boundary detec- define: tion results using the BSDS benchmark [10]. The precision- T (5) A(x, y)=G [ I I ]"(x,y) recall curves show that the reduction of false positives due ∗ ∇ ∇ " where denotes convolution [14]. to the use of global information in sP b is concentrated in the Let ∗λ and λ be the eigenvalues of A(x, y). The Harris high thresholds, while gP b takes the best of both worlds, re- 1 2 corner operator is based on the observation that near a cor- lying on sP b in the high precision regime and on mP b in ner, both and are large and positive, whereas near an the high recall regime. The mean improvement in precision λ1 λ2 edge or featureless region, at least one of λ , λ 0. We of gP b with respect to the single scale Pbis 10% in the re- 1 2 define our Harris operator to be: ≈ call range [0.1, 0.9]. The gain in precision of the grayscale version of gP b with respect to the Canny detector is 17% H(x, y)=λ1λ2/(λ1 + λ2) (6) in the same recall range. Qualitatively, this improvement translates as reduction of clutter edges and completion of Applying non-maximum suppression to H(x, y) yields can- contours in the output, as shown in Figure 3. The central didate junction locations. and right panels of Figure 2 show that gP b compares fa- Reliance on the grayscale image derivative I leaves the vorably with the leading contour detection techniques eval- Harris operator vulnerable to erroneous responses∇ in tex- uated on the BSDS [1, 3, 6, 7, 17, 22, 29, 30]. tured regions. The next section describes a novel junction $./&'("7+*#*2()&8+*12)%&

Bh^CB) 92"%&/,%#:6(;)&

'("7+*#*2()& 8,2$2)#1& '&?&.*3,%#6-& @#"#-A%)%& <0950=>'??& 97,:;.#'<+=,6,+)$>#:+?, BC9DEFGH& E(m#*.5) @:R* A:9B* ?:AB;* 2&"L-(.#5) B9:@* A;:S;* ?:@/* J.#(&:(.-.+)7*.#*;&) R:9* A:;A* ?:?9* N-+(.5*':(&) ABA:?* A/:;S* ?:@A* n:(&"'') ;;;*&%$#(D&* ;S:MS*&%$#(D&* A:@*&%$#(D&*

.*3,%#6-& C9DEFG&

E(m#*.5) 2&"L-(.#5) J.#(&:(.-.+) N-+(.5*':(&) n#4(&)

BO^CB) IA#1#/212*J&K%-+1*-&

2#)#33%3'4*#3#053567' !! I1"'"R-'-#0)(m",-.(L)*.) !#(" 89:;<=' S:-L-")2%95)=&*,)G)#*)]h) !#'" 9%&3#'?C=B=' 1*&(5) !#&" !! T'+*&-#4,)51"'(5)K('') !#%" >?2@A' A<==89:' !#$" !! J5),(,*&0)R".LK-L#4)o) 89' !" "&14-#(1#;&()L(3(.L(.#)

!"#$%&'(%)'&%*+,-' !" $!" %!" &!" '!" ./"0%)'+1'*+)%&' J,"+()I-8()I1"'"R-'-#0) ;?* !! I1"'-.+)R(4":-*&)K-#4)&(53(1#) A@* #*)-,"+()5-8()-5)+**L) AR* A/* !! 6-,*L"')L-5#&-R;#-*.)L;()#*) A;* A?* (-+(.5*':(&)&;.#-,() @* R* E-,()A5(1*.L5D) !! Q-,-#(L)R0),(,*&0)5-8(X) /* !! ;* ObH)/%)-,"+(X)B)26)*=) ?* ,(,*&0)&(p;-&(L) ?:+V??* B:+V?B* A:+V?R* ;:+V?R* %-m('5) BG^CB) 0AA+,#AJ&L&I+""#,J&

A* !! q()"14-(:()(p;-:"'(.#) "11;&"10)*.)#4()6I!I)1*.#*;&) ?:@* L(#(1#-*.)R(.14,"&?)

?:R*

%&(1-5-*.) ?:/*

?:;*

?* ?* ?:;* ?:/* ?:R* ?:@* A* WX!C*;??@* 72-2&$%(%* 7)r)%#4&("L5)3*&#)L*.()R0)s;.5;3)Q(() <(1"'') ".L)T.L&(K)q"#(&,".) B]^CB) IM<&9,#2)2)$4&N+#6,#*2A& .,($,#""2)$&

l 1 F (α) = max α αT Qα i − 2 i=1 ! s.t. 0 α C, i [1,l] ≤ i ≤ ∀ ∈ yT α =0

Qij = yiyjΦ(xi,xj)

d Φ(xi,xj)=xi xj Φ(x ,x ; a, r, d)=(ax x + r) · i j j · j 2 Φ(xi,xj; a, r) = tanh(axi xj + r) Φ(x ,x ; γ) = exp γ x x · i j {− || i − j|| } BB^GP) I<8&01$(,2*3"&

!! E4()I(p;(.#-"')/-.-,"')n3#-,-8"#-*.)"'+*&-#4,)A%'"##>)OfffD)-5)".) -#(&"#-:()5*';#-*.),(#4*L)=*&)#4()Il/)#&"-.-.+)3&*R'(,) !! T#)("14)-#(&"#-*.>)-#)"L[;5#5)*.'0)G)*=)#4():"&-"R'(5)A14*5(.)R0) 4(;&-5#-1D) &! E4()*3#-,-8"#-*.)5#(3)-5)#4(.)")#&-:-"')*.()L-,(.5-*."')3&*R'(,X) C

α2 y1α1 + y2α2 = k

0 C α1 !! 7*,3;#-.+)=;'')?(&.('),"#&-m)Q).*#)&(p;-&(L) !! !(53-#().",(>)"'+*&-#4,)1".)R()p;-#()3"&"''(') !! 7*,3;#"#-*.)-5)L*,-."#(L)R0)ttE)*3#-,"'-#0)1*.L-#-*.);3L"#(5)

BC^GP) 4.'%*%*@,A#+BC:+,

E&"-.-.+)E-,()A5(1*.L5D) O#"%& P7(2)*-& P62"&

9I%I) M;SA* ;BR* k"1() RSMM* 9@A*

TL;'#) 9;BRA* A;9* QJ6Il/) 2%9) q(R) /SM/S* 9??*

/SJIE) R????* M@/* k*&(5#) BRA?A;* B/* 9I%I) k"1() TL;'#) q(R) /SJIE) k*&(5#)

!! O'Y4X6*"Z(('()*#(*5(.%,*W#"%*;*7Z#*;:RR*<=>* !! PZ"*&#,L%"*"Z(('()*#(*[L'D'2*<%N#"$%*@@??

!! E*)1'"55-=0)")3*-.#)z>)(:"';"#()X) l zˆ = b + yiαiΦ(xi,z) ! i=1 # " !! k*&)5#".L"&L)?(&.('5>)Il/)7'"55-i1"#-*.)-.:*':(5)1*,3"&-.+)"'') 5;33*&#):(1#*&5)".L)"'')#(5#):(1#*&5)K-#4)")L*#)3&*L;1#) !! q()#"?()"L:".#"+()*=)#4()1*,,*.)5-#;"#-*.)K4(.)*.()4"5) ,;'#-3'()L"#")3*-.#5)#*)1'"55-=0)5-,;'#".(*;5'0) !! q()1"5#)#4()L*#)3&*L;1#5)"5)")/"#&-m$/"#&-m),;'#-3'-1"#-*.>)".L) #4(.);5()/"3)<(L;1()#*)i.-54)#4()1'"55-i1"#-*.)

B@^GP) '1#--2QA#*2()&K%-+1*-&

7'"55-i1"#-*.)E-,()A5(1*.L5D)

Q-RIl/) 7%9)n3#-,-8(L) 2%9)n3#-,-8(L)

9I%I) TL;'#) k"1(5) q(R) /SJIE) !! W!_*#^`-'>%D*L%"&'#(*2$G'%L%&*9T9?I*&^%%DZ^* !!

!! C%&Z,.&*'D%(`$2,*.#*&%"'2,*L%"&'#(* BH^GP) CUDA Summary

!! CUDA is a programming model for manycore processors !! It abstracts SIMD, making it easy to use wide SIMD vectors !! It provides good performance on today’s GPUs !! In the near future, CUDA-like approaches will map well to many processors & GPUs !! CUDA encourages SIMD friendly, highly scalable algorithm design and implementation