Highly Efficient and GPU-Friendly Implementation of BFS on Single

2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) #=*=..B*+8;*=8;B8/869>=.;;,12=.,=>;.7<=2=>=.8/869>=270$.,178580B127.<.,*-.6B8/#,2.7,.< .23270 #,18858/869>=.;*7-87=;857027..;270%72?.;<2=B8/127.<.,*-.6B8/#,2.7,.<.23270 6*25,*81>*@.22,=*,,7 9(2' 2'$ #$4$*-.+$,2 -% "-+.32$0 2$"',-*-&7 2'$ ;*91 52<=@*<27=;8->,.-=8;*74,869>=.;9.;/8;6*7,. # 2 &$,$0 2$# (, -30 # (*7 *(%$ (1 &0-5(,& 0 .(#*7 2 =8@*;-< -*=*27=.7<2?. *9952,*=287< *7- # 2< 87. 8/ =1. (,2$,1(4$ *&-0(2'+1 2 )$ +-0$ ,# +-0$ (+.-02 ,2 . 02 (, 4.B4.;7.5<8/;*91 +.7,16*;4( ) '(&'.$0%-0+ ,"$"-+.32(,&'$0$ #2'(012$ 0"' ;2027*55B !% @.;. ,;.*=.- =8 <85?. =1. 9;8+5.6 8/ (1 27.(" * # 2 (,2$,1(4$ *&-0(2'+ ,# "' 0 "2$0(8$# 5(2' 0;*912,< ;.7-.;270 >. =8 2=< 6*<<2?. 9*;*55.52<6 1201 (,2$,1(4$ (00$&3* 0 +$+-07 ""$11 *-5 "-+.32 2(-, (,2$,1(27 +*7-@2-=1 *7- 58@ .7.;0B ,87<>69=287 !% 1*< +.,86. ,# 120-,& # 2 #$.$,#$,"7 *2'-3&' 0 .'("1 0-"$11(,& *7 *==;*,=2?. 95*=/8;6 /8; 1201 9.;/8;6*7,. ,869>=270 ,(2-%%$01+ 11(4$. 0 **$*(1+(1,-2 869*;.- =8 !% !% 1*< -2//.;.7= 6.68;B 12.;*;,1B %0($,#*7 #3$ 2- !-4$ "' 0 "2$0(12("1 - 32(*(8$ 2'$ .-5$0 -% .A.,>=28768-.*7-68;.,869>=*=287>72=< %-0 $%%("($,2 1"'$#3*(,& -% + 11(4$ 2'0$ #1 ,# 8@.?.; ->. =8 =1. 2;;.0>5*; ,1*;*,=.;2<=2,< 8/ 0;*91 !$22$0 32(*(8 2(-, -% +$+-07 '($0 0"'7 0$ 0$/3(0$# , 2'(1. .$05$%-"31-,'(&'*7$%%("($,2(+.*$+$,2 2(-,-% =;*?.;<. 2=D< -2//2,>5= =8 *,12.?. 1201 9.;/8;6*7,. 27 -, .* 2%-0+ $ .0-.-1$ 2'0$$ -.2(+(8 2(-, =;*-2=287*5 6>5=2,8;. 95*=/8;6< .<9.,2*55B 27 !% $1. 2$"',(/3$1(,"*3#(,&%(,$&0 (,$#. 0 **$*(1+-0($,2$# 9;8+5.6 2< />;=1.; *00;*?*=.- /8; <,*5./;.. 0;*91< *7 1203"230$ ,# 4$02$6 /3(")1$ 0"' 2- -4$0"-+$ .<<.7=2*5,5*<<8/;.*5@8;5-0;*91<@12,1/8558@<98@.;5*@ .$0%-0+ ,"$!-22*$,$")1(,(,$&0 (,$#. 0 **$*(1+" , -2<=;2+>=287 () $1. =898580B 8/ <,*5./;.. 0;*91< ;.<=;2,=< (+.0-4$ 2'$ 5-0)*- # ! * ,"$ %-0 . 0 **$* (+.*$+$,2 2(-, (, .//2,2.7=#2695.6.7=*=28727!%*7-,*7,*><.<.?.;. 2-.#-5,12 &$-0($,2$#1203"230$$+.*-71 @8;458*-26+*5*7,..68;B-2?.;0.7,.,;.*=.<*--2=287*5 %0($,#*7 # 2 * 7-32 5'("' " , (+.0-4$ 2'$ $%%("($,"7 -% ,1*55.70.< 27 # 9;8,.<<270 >. =8 =1. 27,87<.,>=2?. +$+-07 ""$11 $02$6 /3(")1$ 0"' (1 .0-.-1$# 2- 0$#3"$ 6.68;B*,,.<<@2=127*@*;92=,*7,*><.*5*;0.*68>7=8/ 0$#3,# ,2 &0 .' "-+.32 2(-,1 -, (, **7 5$ "-,#3"2 58*-*7-<=8;.=;*7<*,=287<->;270=1.=;*?.;<. $62$,1(4$ $6.$0(+$,21 -, ! 1$# .* 2%-0+ 2- 4$0(%7 2'$ $8 =*,45. <>,1 ,1*55.70.< *7- .//2,2.7=5B >=252C. =1. $%%$"2(4$,$11 -% 2'$1$ 2$"',(/3$1 $ "'($4$ 6*<<2?. 9*;*55.52<6 27 !% 6*7B 89=262C*=287 6.=18-< %-0 2'$ 0-,$")$0 &0 .' 5(2' 4$02("$1 ,# $#&$1, 1*?. +..7 9>= /8;@*;- 27 ;.,.7= B.*;< () *;2<1 *7- 2$0+1-%$,$0&7$%%("($,"7-30(+.*$+$,2 2(-,0 ,)112.* "$ *;*B*7*79;898<.-#2695.6.7=*=28727!%+*<.-87 -,2'$-4$+!$0 0$$,0 .' *(12 ?.;=.A,.7=;2, 9;8,.<<270 =1*= 2-.7=2/2.< *,=2?. ?.;=2,.< +B <,*77270?.;=.A<=*=><()870 9>=/8;@*;-?2;=>*5 @*;9=8269;8?.@8;458*-+*5*7,.()$1.7.201+8;52<=8/ .*,1*,=2?.?.;=.A@8>5-+.9;8,.<<.-+B*0;8>98/=1;.*-< 27<=.*- 8/ 87. =1;.*- .;;255 9;898<.- * 527.*; %+)&,+ &% 9*;*55.52C*=287 8/ # *508;2=16 =1*= 6*99.- =1. @8;458*- '2=1=1.-.?.5896.7=8/27/8;6*=287<8,2.=B=1.-*=*2< 8/*<2705.?.;=.A=8*<2705.=1;.*-@*;98;+58,4-.9.7-270 ,87=27>8><5B 0.7.;*=.- 27 8>; -*25B 52/. ;*91 *7*5B=2,< * 87 2=< 8>=-.0;.. *7- *,12.?.- 1201 9.;/8;6*7,. 27 !% @*?.8/+20-*=**7*5B<2<1*<.6.;0.-*<*7.@6.=18-=8 ( ).*6.; 9;898<.-*-2;.,=28789=262C270<,1.6. .A958;. *7- ><. =1.<. -*=* =8 /*,252=*=. 9.895.D< 52?.< () =1*= ,86+27.- =;*-2=287*5 =89-8@7 *99;8*,1 @2=1 * 78?.5 *7B 9;8+5.6< 27 ;.*52=B ,*7 +. *+<=;*,=.- *7- -.<,;2+.- +8==86>9 *99;8*,1 @12,1 ,*7 -;*6*=2,*55B ;.->,. =1. @2=1 0;*91 < 87. 8/ =1. 68<= 2698;=*7= -*=* <=;>,=>;.< 7>6+.; 8/ ;.->7-*7= .-0.< =;*?.;<. () 2> 0;*91 2< @2-.5B ><.- 27 ?*;28>< /2.5-< 27,5>-270 9;8=.27 2695.6.7=.-*7.//2,2.7=1B+;2-#*508;2=16@2=1-.0;.. 27=.;*,=287*7*5B<2<0;8>7-=;*7<98;=*=287<8,2*5<,2.7,.*7- +*<.- ,5*<<2/2,*=287 /8; ?.;=2,.< =8 -.*5 @2=1 @8;458*- 6*,127.5.*;7270( ) 26+*5*7,.9;8+5.6( )#*+.= ,87<=;>,=.-*?2;=>*55B $1. ;.*-=12;<= #.*;,1 # 2< * =B92,*5 0;*91 =;*7</8;6.-0;*91+*<.-87#"<=;>,=>;.@12,15262=<=1. *508;2=16*7-=1.,8;.,86987.7=8/6*7B12015.?.50;*91 @8;458*- 8/ .*,1 ?.;=.A ( ) 5=18>01 *+8?. =.,172:>.< *7*5B<2< <>,1 *< ,877.,=.- ,86987.7=< ,.7=;*52=B *7- 1*?. 269;8?.- =1. .//2,2.7,B 8/ # 87 !% 1201 <2705.<8>;,. <18;=.<= 9*=1< ( ) # 2< ,1*;*,=.;2C.- @2=1 9.;/8;6*7,.0*27,8>5-+./>;=1.;*,12.?.-+*<.-87!% 27=.7<2?. 2;;.0>5*; 6.68;B *,,.<< 58@ ,869>=*=287 <9.,2/2,89=262C*=287< 27=.7<2=B *7- <=;870 -*=* -.9.7-.7,B @12,1 *;. :>2=. 7 =12< 9*9.; @./8,>< 87 269;8?270 # 9.;/8;6*7,. -2//.;.7= /;86 ,869>=.27=.7<2?. @8;458*- 7 87 ?2-2* !% 95*=/8;6 *7- ><. $.<5* ! 27 8>; 978-0-7381-3199-3/20/$31.00 ©2020 IEEE 544 DOI 10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00094 .A9.;26.7=< $1. -.=*25.- 89=262C*=287< *;. 9;.<.7=.- 27 first neighbor in adjacency list for each vertex. The 8;-.; =8 -.*5 @2=1 @8;458*- 26+*5*7,. 6.68;B *,,.<< difference between adjacent value in row list is the degree of -2?.;0.7,.*7-;.->7-*7=,*5,>5*=287<87!% each vertex. #9.,2/2,*[email protected]*4.=1./8558@270,87=;2+>=287< 1. Due to graph topology and SIMT execution, there exists severe workload imbalance on scale-free graphs. We develop a fine-grained parallelism method to improve the workload balance. 2. Original CSR data structure is not GPU-friendly, which causes memory divergence problem. We develop a GPU-oriented CSR layout to improve the efficiency of memory access. 3. By leveraging bitmap structure, we further propose a vertex quick-search method to find all unvisited vertices. It can highly reduce the amount of redundant computations in status check procedure. 4. We conduct extensive experiments on P100 Figure 1: Illustration of CSR format platform to verify the effectiveness of the proposed techniques. Our implementation achieves 237.94 GTEPS for Top-down BFS the Kronecker graph with 226 vertices and 230 edges. It ranks 1st on November 2019 Green Graph500 list. Algorithm 1: Top-down BFS Input: undirected graph G=(V,E), level array LA, current frontier BACKGROUND CF, next frontier NF, adjacency list A, source vertex s. Output: level array LA, parent map PM. BFS is a widely used graph algorithm and important building block of many graph analysis algorithms. To 1: LA[v] ← inf, for facilitate BFS performance, there has been a lot of work on 2: lvl ← 0 parallel implementations of BFS algorithm. In this section, 3: LA[s] ← level we will present some preliminary concepts concerning GPU 4: PM[s] ← s and some state-of-art optimizations for BFS. 5: CF ← {s} GPU Concepts 6: NF ←∅ Normally, one GPU contains dozens of Streaming 7: while CF is not empty do Multiprocessors (SMs). For example, P100 consists of 56 8: lvl++ SMs. Each SM contains 64 single-precision CUDA cores 9: ∈ and 32 double-precision cores. With numerous processing for u CF in parallel do units, GPU can offer outstanding parallel computing power. 10: for w do The execution model of GPU is quite different from CPU. 11: if LA[w]==infthen GPU schedules threads in the form of warp (32 adjacent 12: PM[w] ← u threads) and executes in Single-Instruction Multiple-Threads 13: LA[w] ← lvl (SIMT) fashion. The SIMT execution model is very efficient 14: ← for regular computations [20]. NF NF {w} The memory hierarchy of GPU is also different from 15: swap CF with NF CPU. P100 offers 16 GB global memory and 4096 KB L2 16: NF ←∅ cache. Each SM contains 256 KB register file and 64 KB dedicated shared memory. The shared memory is a software Algorithm 1: Top-down BFS algorithm configurable cache in SM. All the threads in the same Cooperative Thread Array (CTA) can communicate through Traditional BFS is presented in top-down manner. Given shared memory and execute in the same SM. a graph G = (V, E) with vertex set V and edge set E, BFS is going to traverse all reachable vertices starting at a source CSR Format vertex. The result of the algorithm is the BFS searching tree In order to reduce the memory footprint of graph data, according to the source vertex.

Highly Efficient and GPU-Friendly Implementation of BFS on Single

Computer Organization and Architecture Designing for Performance Ninth Edition

Trends in Electrical Efficiency in Computer Performance

Computer Performance Evaluation and Benchmarking

Performance Scalability of N-Tier Application in Virtualized Cloud Environments: Two Case Studies in Vertical and Horizontal Scaling

An Algorithmic Theory of Caches by Sridhar Ramachandran

02 Computer Evolution and Performance

Measurement and Rating of Computer Systems Performance

Computer Performance Factors

A Review of High Performance Computing

Performance of a Computer (Chapter 4) Vishwani D

Central Processing Unit (CPU) the CPU Or Processor Is the "Brain" of the Computer

CSE 141L: Design Your Own Processor What You'll