Highly Efficient and GPU-Friendly Implementation of BFS on Single

Highly Efficient and GPU-Friendly Implementation of BFS on Single

2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) #=*=..B*+8;*=8;B8/869>=.;;,12=.,=>;.7<=2=>=.8/869>=270$.,178580B127.<.,*-.6B8/#,2.7,.< .23270 #,18858/869>=.;*7-87=;857027..;270%72?.;<2=B8/127.<.,*-.6B8/#,2.7,.<.23270 6*25,*81>*@.22,=*,,7 9(2' 2'$ #$4$*-.+$,2 -% "-+.32$0 2$"',-*-&7 2'$ ;*91 52<=@*<27=;8->,.-=8;*74,869>=.;9.;/8;6*7,. # 2 &$,$0 2$# (, -30 # (*7 *(%$ (1 &0-5(,& 0 .(#*7 2 =8@*;-< -*=*27=.7<2?. *9952,*=287< *7- # 2< 87. 8/ =1. (,2$,1(4$ *&-0(2'+1 2 )$ +-0$ ,# +-0$ (+.-02 ,2 . 02 (, 4.B4.;7.5<8/;*91 +.7,16*;4( ) '(&'.$0%-0+ ,"$"-+.32(,&'$0$ #2'(012$ 0"' ;2027*55B !% @.;. ,;.*=.- =8 <85?. =1. 9;8+5.6 8/ (1 27.(" * # 2 (,2$,1(4$ *&-0(2'+ ,# "' 0 "2$0(8$# 5(2' 0;*912,< ;.7-.;270 >. =8 2=< 6*<<2?. 9*;*55.52<6 1201 (,2$,1(4$ (00$&3* 0 +$+-07 ""$11 *-5 "-+.32 2(-, (,2$,1(27 +*7-@2-=1 *7- 58@ .7.;0B ,87<>69=287 !% 1*< +.,86. ,# 120-,& # 2 #$.$,#$,"7 *2'-3&' 0 .'("1 0-"$11(,& *7 *==;*,=2?. 95*=/8;6 /8; 1201 9.;/8;6*7,. ,869>=270 ,(2-%%$01+ 11(4$. 0 **$*(1+(1,-2 869*;.- =8 !% !% 1*< -2//.;.7= 6.68;B 12.;*;,1B %0($,#*7 #3$ 2- !-4$ "' 0 "2$0(12("1 - 32(*(8$ 2'$ .-5$0 -% .A.,>=28768-.*7-68;.,869>=*=287>72=< %-0 $%%("($,2 1"'$#3*(,& -% + 11(4$ 2'0$ #1 ,# 8@.?.; ->. =8 =1. 2;;.0>5*; ,1*;*,=.;2<=2,< 8/ 0;*91 !$22$0 32(*(8 2(-, -% +$+-07 '($0 0"'7 0$ 0$/3(0$# , 2'(1. .$05$%-"31-,'(&'*7$%%("($,2(+.*$+$,2 2(-,-% =;*?.;<. 2=D< -2//2,>5= =8 *,12.?. 1201 9.;/8;6*7,. 27 -, .* 2%-0+ $ .0-.-1$ 2'0$$ -.2(+(8 2(-, =;*-2=287*5 6>5=2,8;. 95*=/8;6< .<9.,2*55B 27 !% $1. 2$"',(/3$1(,"*3#(,&%(,$&0 (,$#. 0 **$*(1+-0($,2$# 9;8+5.6 2< />;=1.; *00;*?*=.- /8; <,*5./;.. 0;*91< *7 1203"230$ ,# 4$02$6 /3(")1$ 0"' 2- -4$0"-+$ .<<.7=2*5,5*<<8/;.*5@8;5-0;*91<@12,1/8558@<98@.;5*@ .$0%-0+ ,"$!-22*$,$")1(,(,$&0 (,$#. 0 **$*(1+" , -2<=;2+>=287 () $1. =898580B 8/ <,*5./;.. 0;*91< ;.<=;2,=< (+.0-4$ 2'$ 5-0)*- # ! * ,"$ %-0 . 0 **$* (+.*$+$,2 2(-, (, .//2,2.7=#2695.6.7=*=28727!%*7-,*7,*><.<.?.;. 2-.#-5,12 &$-0($,2$#1203"230$$+.*-71 @8;458*-26+*5*7,..68;B-2?.;0.7,.,;.*=.<*--2=287*5 %0($,#*7 # 2 * 7-32 5'("' " , (+.0-4$ 2'$ $%%("($,"7 -% ,1*55.70.< 27 # 9;8,.<<270 >. =8 =1. 27,87<.,>=2?. +$+-07 ""$11 $02$6 /3(")1$ 0"' (1 .0-.-1$# 2- 0$#3"$ 6.68;B*,,.<<@2=127*@*;92=,*7,*><.*5*;0.*68>7=8/ 0$#3,# ,2 &0 .' "-+.32 2(-,1 -, (, **7 5$ "-,#3"2 58*-*7-<=8;.=;*7<*,=287<->;270=1.=;*?.;<. $62$,1(4$ $6.$0(+$,21 -, ! 1$# .* 2%-0+ 2- 4$0(%7 2'$ $8 =*,45. <>,1 ,1*55.70.< *7- .//2,2.7=5B >=252C. =1. $%%$"2(4$,$11 -% 2'$1$ 2$"',(/3$1 $ "'($4$ 6*<<2?. 9*;*55.52<6 27 !% 6*7B 89=262C*=287 6.=18-< %-0 2'$ 0-,$")$0 &0 .' 5(2' 4$02("$1 ,# $#&$1, 1*?. +..7 9>= /8;@*;- 27 ;.,.7= B.*;< () *;2<1 *7- 2$0+1-%$,$0&7$%%("($,"7-30(+.*$+$,2 2(-,0 ,)112.* "$ *;*B*7*79;898<.-#2695.6.7=*=28727!%+*<.-87 -,2'$-4$+!$0 0$$,0 .' *(12 ?.;=.A,.7=;2, 9;8,.<<270 =1*= 2-.7=2/2.< *,=2?. ?.;=2,.< +B <,*77270?.;=.A<=*=><()870 9>=/8;@*;-?2;=>*5 @*;9=8269;8?.@8;458*-+*5*7,.()$1.7.201+8;52<=8/ .*,1*,=2?.?.;=.A@8>5-+.9;8,.<<.-+B*0;8>98/=1;.*-< 27<=.*- 8/ 87. =1;.*- .;;255 9;898<.- * 527.*; %+)&,+ &% 9*;*55.52C*=287 8/ # *508;2=16 =1*= 6*99.- =1. @8;458*- '2=1=1.-.?.5896.7=8/27/8;6*=287<8,2.=B=1.-*=*2< 8/*<2705.?.;=.A=8*<2705.=1;.*-@*;98;+58,4-.9.7-270 ,87=27>8><5B 0.7.;*=.- 27 8>; -*25B 52/. ;*91 *7*5B=2,< * 87 2=< 8>=-.0;.. *7- *,12.?.- 1201 9.;/8;6*7,. 27 !% @*?.8/+20-*=**7*5B<2<1*<.6.;0.-*<*7.@6.=18-=8 ( ).*6.; 9;898<.-*-2;.,=28789=262C270<,1.6. .A958;. *7- ><. =1.<. -*=* =8 /*,252=*=. 9.895.D< 52?.< () =1*= ,86+27.- =;*-2=287*5 =89-8@7 *99;8*,1 @2=1 * 78?.5 *7B 9;8+5.6< 27 ;.*52=B ,*7 +. *+<=;*,=.- *7- -.<,;2+.- +8==86>9 *99;8*,1 @12,1 ,*7 -;*6*=2,*55B ;.->,. =1. @2=1 0;*91 < 87. 8/ =1. 68<= 2698;=*7= -*=* <=;>,=>;.< 7>6+.; 8/ ;.->7-*7= .-0.< =;*?.;<. () 2> 0;*91 2< @2-.5B ><.- 27 ?*;28>< /2.5-< 27,5>-270 9;8=.27 2695.6.7=.-*7.//2,2.7=1B+;2-#*508;2=16@2=1-.0;.. 27=.;*,=287*7*5B<2<0;8>7-=;*7<98;=*=287<8,2*5<,2.7,.*7- +*<.- ,5*<<2/2,*=287 /8; ?.;=2,.< =8 -.*5 @2=1 @8;458*- 6*,127.5.*;7270( ) 26+*5*7,.9;8+5.6( )#*+.= ,87<=;>,=.-*?2;=>*55B $1. ;.*-=12;<= #.*;,1 # 2< * =B92,*5 0;*91 =;*7</8;6.-0;*91+*<.-87#"<=;>,=>;.@12,15262=<=1. *508;2=16*7-=1.,8;.,86987.7=8/6*7B12015.?.50;*91 @8;458*- 8/ .*,1 ?.;=.A ( ) 5=18>01 *+8?. =.,172:>.< *7*5B<2< <>,1 *< ,877.,=.- ,86987.7=< ,.7=;*52=B *7- 1*?. 269;8?.- =1. .//2,2.7,B 8/ # 87 !% 1201 <2705.<8>;,. <18;=.<= 9*=1< ( ) # 2< ,1*;*,=.;2C.- @2=1 9.;/8;6*7,.0*27,8>5-+./>;=1.;*,12.?.-+*<.-87!% 27=.7<2?. 2;;.0>5*; 6.68;B *,,.<< 58@ ,869>=*=287 <9.,2/2,89=262C*=287< 27=.7<2=B *7- <=;870 -*=* -.9.7-.7,B @12,1 *;. :>2=. 7 =12< 9*9.; @./8,>< 87 269;8?270 # 9.;/8;6*7,. -2//.;.7= /;86 ,869>=.27=.7<2?. @8;458*- 7 87 ?2-2* !% 95*=/8;6 *7- ><. $.<5* ! 27 8>; 978-0-7381-3199-3/20/$31.00 ©2020 IEEE 544 DOI 10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00094 .A9.;26.7=< $1. -.=*25.- 89=262C*=287< *;. 9;.<.7=.- 27 first neighbor in adjacency list for each vertex. The 8;-.; =8 -.*5 @2=1 @8;458*- 26+*5*7,. 6.68;B *,,.<< difference between adjacent value in row list is the degree of -2?.;0.7,.*7-;.->7-*7=,*5,>5*=287<87!% each vertex. #9.,2/2,*[email protected]*4.=1./8558@270,87=;2+>=287< 1. Due to graph topology and SIMT execution, there exists severe workload imbalance on scale-free graphs. We develop a fine-grained parallelism method to improve the workload balance. 2. Original CSR data structure is not GPU-friendly, which causes memory divergence problem. We develop a GPU-oriented CSR layout to improve the efficiency of memory access. 3. By leveraging bitmap structure, we further propose a vertex quick-search method to find all unvisited vertices. It can highly reduce the amount of redundant computations in status check procedure. 4. We conduct extensive experiments on P100 Figure 1: Illustration of CSR format platform to verify the effectiveness of the proposed techniques. Our implementation achieves 237.94 GTEPS for Top-down BFS the Kronecker graph with 226 vertices and 230 edges. It ranks 1st on November 2019 Green Graph500 list. Algorithm 1: Top-down BFS Input: undirected graph G=(V,E), level array LA, current frontier BACKGROUND CF, next frontier NF, adjacency list A, source vertex s. Output: level array LA, parent map PM. BFS is a widely used graph algorithm and important building block of many graph analysis algorithms. To 1: LA[v] ← inf, for facilitate BFS performance, there has been a lot of work on 2: lvl ← 0 parallel implementations of BFS algorithm. In this section, 3: LA[s] ← level we will present some preliminary concepts concerning GPU 4: PM[s] ← s and some state-of-art optimizations for BFS. 5: CF ← {s} GPU Concepts 6: NF ←∅ Normally, one GPU contains dozens of Streaming 7: while CF is not empty do Multiprocessors (SMs). For example, P100 consists of 56 8: lvl++ SMs. Each SM contains 64 single-precision CUDA cores 9: ∈ and 32 double-precision cores. With numerous processing for u CF in parallel do units, GPU can offer outstanding parallel computing power. 10: for w do The execution model of GPU is quite different from CPU. 11: if LA[w]==infthen GPU schedules threads in the form of warp (32 adjacent 12: PM[w] ← u threads) and executes in Single-Instruction Multiple-Threads 13: LA[w] ← lvl (SIMT) fashion. The SIMT execution model is very efficient 14: ← for regular computations [20]. NF NF {w} The memory hierarchy of GPU is also different from 15: swap CF with NF CPU. P100 offers 16 GB global memory and 4096 KB L2 16: NF ←∅ cache. Each SM contains 256 KB register file and 64 KB dedicated shared memory. The shared memory is a software Algorithm 1: Top-down BFS algorithm configurable cache in SM. All the threads in the same Cooperative Thread Array (CTA) can communicate through Traditional BFS is presented in top-down manner. Given shared memory and execute in the same SM. a graph G = (V, E) with vertex set V and edge set E, BFS is going to traverse all reachable vertices starting at a source CSR Format vertex. The result of the algorithm is the BFS searching tree In order to reduce the memory footprint of graph data, according to the source vertex.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us