Highly Efficient and GPU-Friendly Implementation of BFS on Single
Total Page:16
File Type:pdf, Size:1020Kb
2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) #=*=..B*+8;*=8;B8/869>=.;;,12=.,=>;.7<=2=>=.8/869>=270$.,178580B127.<.,*-.6B8/#,2.7,.< .23270 #,18858/869>=.;*7-87=;857027..;270%72?.;<2=B8/127.<.,*-.6B8/#,2.7,.<.23270 6*25,*81>*@.22,=*,,7 9(2' 2'$ #$4$*-.+$,2 -% "-+.32$0 2$"',-*-&7 2'$ ;*91 52<=@*<27=;8->,.-=8;*74,869>=.;9.;/8;6*7,. # 2 &$,$0 2$# (, -30 # (*7 *(%$ (1 &0-5(,& 0 .(#*7 2 =8@*;-< -*=*27=.7<2?. *9952,*=287< *7- # 2< 87. 8/ =1. (,2$,1(4$ *&-0(2'+1 2 )$ +-0$ ,# +-0$ (+.-02 ,2 . 02 (, 4.B4.;7.5<8/;*91 +.7,16*;4( ) '(&'.$0%-0+ ,"$"-+.32(,&'$0$ #2'(012$ 0"' ;2027*55B !% @.;. ,;.*=.- =8 <85?. =1. 9;8+5.6 8/ (1 27.(" * # 2 (,2$,1(4$ *&-0(2'+ ,# "' 0 "2$0(8$# 5(2' 0;*912,< ;.7-.;270 >. =8 2=< 6*<<2?. 9*;*55.52<6 1201 (,2$,1(4$ (00$&3* 0 +$+-07 ""$11 *-5 "-+.32 2(-, (,2$,1(27 +*7-@2-=1 *7- 58@ .7.;0B ,87<>69=287 !% 1*< +.,86. ,# 120-,& # 2 #$.$,#$,"7 *2'-3&' 0 .'("1 0-"$11(,& *7 *==;*,=2?. 95*=/8;6 /8; 1201 9.;/8;6*7,. ,869>=270 ,(2-%%$01+ 11(4$. 0 **$*(1+(1,-2 869*;.- =8 !% !% 1*< -2//.;.7= 6.68;B 12.;*;,1B %0($,#*7 #3$ 2- !-4$ "' 0 "2$0(12("1 - 32(*(8$ 2'$ .-5$0 -% .A.,>=28768-.*7-68;.,869>=*=287>72=< %-0 $%%("($,2 1"'$#3*(,& -% + 11(4$ 2'0$ #1 ,# 8@.?.; ->. =8 =1. 2;;.0>5*; ,1*;*,=.;2<=2,< 8/ 0;*91 !$22$0 32(*(8 2(-, -% +$+-07 '($0 0"'7 0$ 0$/3(0$# , 2'(1. .$05$%-"31-,'(&'*7$%%("($,2(+.*$+$,2 2(-,-% =;*?.;<. 2=D< -2//2,>5= =8 *,12.?. 1201 9.;/8;6*7,. 27 -, .* 2%-0+ $ .0-.-1$ 2'0$$ -.2(+(8 2(-, =;*-2=287*5 6>5=2,8;. 95*=/8;6< .<9.,2*55B 27 !% $1. 2$"',(/3$1(,"*3#(,&%(,$&0 (,$#. 0 **$*(1+-0($,2$# 9;8+5.6 2< />;=1.; *00;*?*=.- /8; <,*5./;.. 0;*91< *7 1203"230$ ,# 4$02$6 /3(")1$ 0"' 2- -4$0"-+$ .<<.7=2*5,5*<<8/;.*5@8;5-0;*91<@12,1/8558@<98@.;5*@ .$0%-0+ ,"$!-22*$,$")1(,(,$&0 (,$#. 0 **$*(1+" , -2<=;2+>=287 () $1. =898580B 8/ <,*5./;.. 0;*91< ;.<=;2,=< (+.0-4$ 2'$ 5-0)*- # ! * ,"$ %-0 . 0 **$* (+.*$+$,2 2(-, (, .//2,2.7=#2695.6.7=*=28727!%*7-,*7,*><.<.?.;. 2-.#-5,12 &$-0($,2$#1203"230$$+.*-71 @8;458*-26+*5*7,..68;B-2?.;0.7,.,;.*=.<*--2=287*5 %0($,#*7 # 2 * 7-32 5'("' " , (+.0-4$ 2'$ $%%("($,"7 -% ,1*55.70.< 27 # 9;8,.<<270 >. =8 =1. 27,87<.,>=2?. +$+-07 ""$11 $02$6 /3(")1$ 0"' (1 .0-.-1$# 2- 0$#3"$ 6.68;B*,,.<<@2=127*@*;92=,*7,*><.*5*;0.*68>7=8/ 0$#3,# ,2 &0 .' "-+.32 2(-,1 -, (, **7 5$ "-,#3"2 58*-*7-<=8;.=;*7<*,=287<->;270=1.=;*?.;<. $62$,1(4$ $6.$0(+$,21 -, ! 1$# .* 2%-0+ 2- 4$0(%7 2'$ $8 =*,45. <>,1 ,1*55.70.< *7- .//2,2.7=5B >=252C. =1. $%%$"2(4$,$11 -% 2'$1$ 2$"',(/3$1 $ "'($4$ 6*<<2?. 9*;*55.52<6 27 !% 6*7B 89=262C*=287 6.=18-< %-0 2'$ 0-,$")$0 &0 .' 5(2' 4$02("$1 ,# $#&$1, 1*?. +..7 9>= /8;@*;- 27 ;.,.7= B.*;< () *;2<1 *7- 2$0+1-%$,$0&7$%%("($,"7-30(+.*$+$,2 2(-,0 ,)112.* "$ *;*B*7*79;898<.-#2695.6.7=*=28727!%+*<.-87 -,2'$-4$+!$0 0$$,0 .' *(12 ?.;=.A,.7=;2, 9;8,.<<270 =1*= 2-.7=2/2.< *,=2?. ?.;=2,.< +B <,*77270?.;=.A<=*=><()870 9>=/8;@*;-?2;=>*5 @*;9=8269;8?.@8;458*-+*5*7,.()$1.7.201+8;52<=8/ .*,1*,=2?.?.;=.A@8>5-+.9;8,.<<.-+B*0;8>98/=1;.*-< 27<=.*- 8/ 87. =1;.*- .;;255 9;898<.- * 527.*; %+)&,+ &% 9*;*55.52C*=287 8/ # *508;2=16 =1*= 6*99.- =1. @8;458*- '2=1=1.-.?.5896.7=8/27/8;6*=287<8,2.=B=1.-*=*2< 8/*<2705.?.;=.A=8*<2705.=1;.*-@*;98;+58,4-.9.7-270 ,87=27>8><5B 0.7.;*=.- 27 8>; -*25B 52/. ;*91 *7*5B=2,< * 87 2=< 8>=-.0;.. *7- *,12.?.- 1201 9.;/8;6*7,. 27 !% @*?.8/+20-*=**7*5B<2<1*<.6.;0.-*<*7.@6.=18-=8 ( ).*6.; 9;898<.-*-2;.,=28789=262C270<,1.6. .A958;. *7- ><. =1.<. -*=* =8 /*,252=*=. 9.895.D< 52?.< () =1*= ,86+27.- =;*-2=287*5 =89-8@7 *99;8*,1 @2=1 * 78?.5 *7B 9;8+5.6< 27 ;.*52=B ,*7 +. *+<=;*,=.- *7- -.<,;2+.- +8==86>9 *99;8*,1 @12,1 ,*7 -;*6*=2,*55B ;.->,. =1. @2=1 0;*91 < 87. 8/ =1. 68<= 2698;=*7= -*=* <=;>,=>;.< 7>6+.; 8/ ;.->7-*7= .-0.< =;*?.;<. () 2> 0;*91 2< @2-.5B ><.- 27 ?*;28>< /2.5-< 27,5>-270 9;8=.27 2695.6.7=.-*7.//2,2.7=1B+;2-#*508;2=16@2=1-.0;.. 27=.;*,=287*7*5B<2<0;8>7-=;*7<98;=*=287<8,2*5<,2.7,.*7- +*<.- ,5*<<2/2,*=287 /8; ?.;=2,.< =8 -.*5 @2=1 @8;458*- 6*,127.5.*;7270( ) 26+*5*7,.9;8+5.6( )#*+.= ,87<=;>,=.-*?2;=>*55B $1. ;.*-=12;<= #.*;,1 # 2< * =B92,*5 0;*91 =;*7</8;6.-0;*91+*<.-87#"<=;>,=>;.@12,15262=<=1. *508;2=16*7-=1.,8;.,86987.7=8/6*7B12015.?.50;*91 @8;458*- 8/ .*,1 ?.;=.A ( ) 5=18>01 *+8?. =.,172:>.< *7*5B<2< <>,1 *< ,877.,=.- ,86987.7=< ,.7=;*52=B *7- 1*?. 269;8?.- =1. .//2,2.7,B 8/ # 87 !% 1201 <2705.<8>;,. <18;=.<= 9*=1< ( ) # 2< ,1*;*,=.;2C.- @2=1 9.;/8;6*7,.0*27,8>5-+./>;=1.;*,12.?.-+*<.-87!% 27=.7<2?. 2;;.0>5*; 6.68;B *,,.<< 58@ ,869>=*=287 <9.,2/2,89=262C*=287< 27=.7<2=B *7- <=;870 -*=* -.9.7-.7,B @12,1 *;. :>2=. 7 =12< 9*9.; @./8,>< 87 269;8?270 # 9.;/8;6*7,. -2//.;.7= /;86 ,869>=.27=.7<2?. @8;458*- 7 87 ?2-2* !% 95*=/8;6 *7- ><. $.<5* ! 27 8>; 978-0-7381-3199-3/20/$31.00 ©2020 IEEE 544 DOI 10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00094 .A9.;26.7=< $1. -.=*25.- 89=262C*=287< *;. 9;.<.7=.- 27 first neighbor in adjacency list for each vertex. The 8;-.; =8 -.*5 @2=1 @8;458*- 26+*5*7,. 6.68;B *,,.<< difference between adjacent value in row list is the degree of -2?.;0.7,.*7-;.->7-*7=,*5,>5*=287<87!% each vertex. #9.,2/2,*[email protected]*4.=1./8558@270,87=;2+>=287< 1. Due to graph topology and SIMT execution, there exists severe workload imbalance on scale-free graphs. We develop a fine-grained parallelism method to improve the workload balance. 2. Original CSR data structure is not GPU-friendly, which causes memory divergence problem. We develop a GPU-oriented CSR layout to improve the efficiency of memory access. 3. By leveraging bitmap structure, we further propose a vertex quick-search method to find all unvisited vertices. It can highly reduce the amount of redundant computations in status check procedure. 4. We conduct extensive experiments on P100 Figure 1: Illustration of CSR format platform to verify the effectiveness of the proposed techniques. Our implementation achieves 237.94 GTEPS for Top-down BFS the Kronecker graph with 226 vertices and 230 edges. It ranks 1st on November 2019 Green Graph500 list. Algorithm 1: Top-down BFS Input: undirected graph G=(V,E), level array LA, current frontier BACKGROUND CF, next frontier NF, adjacency list A, source vertex s. Output: level array LA, parent map PM. BFS is a widely used graph algorithm and important building block of many graph analysis algorithms. To 1: LA[v] ← inf, for facilitate BFS performance, there has been a lot of work on 2: lvl ← 0 parallel implementations of BFS algorithm. In this section, 3: LA[s] ← level we will present some preliminary concepts concerning GPU 4: PM[s] ← s and some state-of-art optimizations for BFS. 5: CF ← {s} GPU Concepts 6: NF ←∅ Normally, one GPU contains dozens of Streaming 7: while CF is not empty do Multiprocessors (SMs). For example, P100 consists of 56 8: lvl++ SMs. Each SM contains 64 single-precision CUDA cores 9: ∈ and 32 double-precision cores. With numerous processing for u CF in parallel do units, GPU can offer outstanding parallel computing power. 10: for w do The execution model of GPU is quite different from CPU. 11: if LA[w]==infthen GPU schedules threads in the form of warp (32 adjacent 12: PM[w] ← u threads) and executes in Single-Instruction Multiple-Threads 13: LA[w] ← lvl (SIMT) fashion. The SIMT execution model is very efficient 14: ← for regular computations [20]. NF NF {w} The memory hierarchy of GPU is also different from 15: swap CF with NF CPU. P100 offers 16 GB global memory and 4096 KB L2 16: NF ←∅ cache. Each SM contains 256 KB register file and 64 KB dedicated shared memory. The shared memory is a software Algorithm 1: Top-down BFS algorithm configurable cache in SM. All the threads in the same Cooperative Thread Array (CTA) can communicate through Traditional BFS is presented in top-down manner. Given shared memory and execute in the same SM. a graph G = (V, E) with vertex set V and edge set E, BFS is going to traverse all reachable vertices starting at a source CSR Format vertex. The result of the algorithm is the BFS searching tree In order to reduce the memory footprint of graph data, according to the source vertex.