2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom©2020 IEEE 544 DOI 10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00094 .A9.;26.7=< $1. -.=*25.- 89=262C*=287< *;. 9;.<.7=.- 27 first neighbor in adjacency list for each vertex. The 8;-.; =8 -.*5 @2=1 @8;458*- 26+*5*7,. 6.68;B *,,.<< difference between adjacent value in row list is the degree of -2?.;0.7,.*7-;.->7-*7=,*5,>5*=287<87!% each vertex. #9.,2/2,*[email protected]*4.=1./8558@270,87=;2+>=287< 1. Due to graph topology and SIMT execution, there exists severe workload imbalance on scale-free graphs. We develop a fine-grained parallelism method to improve the workload balance. 2. Original CSR data structure is not GPU-friendly, which causes memory divergence problem. We develop a GPU-oriented CSR layout to improve the efficiency of memory access. 3. By leveraging bitmap structure, we further propose a vertex quick-search method to find all unvisited vertices. It can highly reduce the amount of redundant computations in status check procedure. 4. We conduct extensive experiments on P100 Figure 1: Illustration of CSR format platform to verify the effectiveness of the proposed techniques. Our implementation achieves 237.94 GTEPS for Top-down BFS the Kronecker graph with 226 vertices and 230 edges. It ranks 1st on November 2019 Green Graph500 list. Algorithm 1: Top-down BFS Input: undirected graph G=(V,E), level array LA, current frontier BACKGROUND CF, next frontier NF, adjacency list A, source vertex s. Output: level array LA, parent map PM. BFS is a widely used graph algorithm and important building block of many graph analysis algorithms. To 1: LA[v] ← inf, for facilitate BFS performance, there has been a lot of work on 2: lvl ← 0 parallel implementations of BFS algorithm. In this section, 3: LA[s] ← level we will present some preliminary concepts concerning GPU 4: PM[s] ← s and some state-of-art optimizations for BFS. 5: CF ← {s} GPU Concepts 6: NF ←∅ Normally, one GPU contains dozens of Streaming 7: while CF is not empty do Multiprocessors (SMs). For example, P100 consists of 56 8: lvl++ SMs. Each SM contains 64 single-precision CUDA cores 9: ∈ and 32 double-precision cores. With numerous processing for u CF in parallel do units, GPU can offer outstanding parallel computing power. 10: for w do The execution model of GPU is quite different from CPU. 11: if LA[w]==infthen GPU schedules threads in the form of warp (32 adjacent 12: PM[w] ← u threads) and executes in Single-Instruction Multiple-Threads 13: LA[w] ← lvl (SIMT) fashion. The SIMT execution model is very efficient 14: ← for regular computations [20]. NF NF {w} The memory hierarchy of GPU is also different from 15: swap CF with NF CPU. P100 offers 16 GB global memory and 4096 KB L2 16: NF ←∅ cache. Each SM contains 256 KB register file and 64 KB dedicated shared memory. The shared memory is a software Algorithm 1: Top-down BFS algorithm configurable cache in SM. All the threads in the same Cooperative Thread Array (CTA) can communicate through Traditional BFS is presented in top-down manner. Given shared memory and execute in the same SM. a graph G = (V, E) with vertex set V and edge set E, BFS is going to traverse all reachable vertices starting at a source CSR Format vertex. The result of the algorithm is the BFS searching tree In order to reduce the memory footprint of graph data, according to the source vertex.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-