Large Graph on Multi-Gpus
Total Page:16
File Type:pdf, Size:1020Kb
Large Graph on multi-GPUs Enrico Mastrostefano1 Massimo Bernaschi2 Massimiliano Fatica3 1Sapienza Universita` di Roma 2Istituto per le Applicazioni del Calcolo, IAC-CNR, Rome, Italy 3NVIDIA Corporation May 11, 2012 Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 1 / 32 Outline 1 Large Graphs and Supercomputing 2 Graph500 on multi-GPUs 3 Breadth First Search on multi-GPUs 4 Sort-Unique Breadth First Search 5 Results Large Graphs Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms Large Graphs Most of graph algorithms have low arithmetic intensity and irregular memory access patterns Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms Large Graphs Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms? Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms Large Graphs Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 The core of the benchmark is a set of graph algorithms Large Graphs Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 Large Graphs Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 Kernel 1 Build the data structure that represent the graph: timed! Data structure cannot be modified by subsequent kernels Kernel 2 Perform a Breadth First Search (BFS) visit starting from a random vertex: timed! Output the parent array. graph500: specifications Generator Generate the edge list with real-world properties (RMAT generator). minimum size: 228 vertices and 16 ∗ 228 edges Generator (268; 435; 456 vertices and 8; 589; 934; 592 edges) Kernel 1 Kernel 2 Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 4 / 32 Kernel 2 Perform a Breadth First Search (BFS) visit starting from a random vertex: timed! Output the parent array. graph500: specifications Generator Generate the edge list with real-world properties (RMAT generator). minimum size: 228 vertices and 16 ∗ 228 edges Generator (268; 435; 456 vertices and 8; 589; 934; 592 edges) Kernel 1 Build the data structure that represent the graph: timed! Kernel 1 Data structure cannot be modified by subsequent kernels Kernel 2 Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 4 / 32 graph500: specifications Generator Generate the edge list with real-world properties (RMAT generator). minimum size: 228 vertices and 16 ∗ 228 edges Generator (268; 435; 456 vertices and 8; 589; 934; 592 edges) Kernel 1 Build the data structure that represent the graph: timed! Kernel 1 Data structure cannot be modified by subsequent kernels Kernel 2 Perform a Breadth First Search (BFS) visit starting from a Kernel 2 random vertex: timed! Output the parent array. Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 4 / 32 graph500: specifications Outputs Execution time of K1 Execution time of K2, TEPS: Traversed Edges Per Second in the BFS (actually the ranking is based only on TEPS) Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 5 / 32 Outline 1 Large Graphs and Supercomputing 2 Graph500 on multi-GPUs 3 Breadth First Search on multi-GPUs 4 Sort-Unique Breadth First Search 5 Results graph500 on multi-GPUs CUDA + MPI Generator GPU1 GPU2 GPUn Kernel 1 MPI Kernel 2 Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 7 / 32 Generate the edge list Edge list We have to generate: jV j = 2SCALE ; jMj = 16 ∗ 2SCALE Each task generates a subset of the edge list in the form: (U0; V0); (U1; V1); ::: Edges are assigned to processes via a simple rule: edge (Ui ; Vj ) 2 Pk if Ui mod #P == k Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 8 / 32 The core of the algorithm is a sort Build the data structure Data structure We transform the edge list in a Compressed Sparse Row (CSR) data structure CSR is simple and has minimal memory requirements Compressed Sparse Row (CSR) local u1 u2 u3 u4 un vertices offset 0 3 3 7 7 7 7 18 l M array adj. list Vj Vk Vl Vt 0 . 3 . 7 . l . M Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 9 / 32 Build the data structure Data structure We transform the edge list in a Compressed Sparse Row (CSR) data structure CSR is simple and has minimal memory requirements Compressed Sparse Row (CSR) local u1 u2 u3 u4 un vertices offset 0 3 3 7 7 7 7 18 l M array adj. list Vj Vk Vl Vt 0 . 3 . 7 . l . M The core of the algorithm is a sort Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 9 / 32 K1 results Weak scaling plot, Kernel 1 time (sec) Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 10 / 32 Outline 1 Large Graphs and Supercomputing 2 Graph500 on multi-GPUs 3 Breadth First Search on multi-GPUs 4 Sort-Unique Breadth First Search 5 Results 1 2 3 4 k-1 k QBFS U3 U7 U21 U16 Ui Uj 2 Adj V V list 7 3 I update the predecessor array local non-local I enqueue Vj V7 V3 V2 V6 MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue Output parent array and TEPS Straightforward BFS 1 2 3 4 k-1 k Queue based BFS QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread ti Each thread ti visits all the neighboring Vj of its vertex If Vj is local: visit it if Vj is not local: send to its owner receive vertices Vk from other processes Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k QBFS U3 U7 U21 U16 Ui Uj 2 Adj V V list 7 3 I update the predecessor array local non-local I enqueue Vj V7 V3 V2 V6 MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue Output parent array and TEPS Straightforward BFS 1 2 3 4 k-1 k Queue based BFS QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread ti Each thread ti visits all the neighboring Vj of its vertex If Vj is local: visit it if Vj is not local: send to its owner receive vertices Vk from other processes Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k QBFS U3 U7 U21 U16 Ui Uj 2 Adj V V list 7 3 I update the predecessor array local non-local I enqueue Vj V7 V3 V2 V6 MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue Output parent array and TEPS Straightforward BFS 1 2 3 4 k-1 k Queue based BFS QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread ti Each thread ti visits all the neighboring Vj of its vertex If Vj is local: visit it if Vj is not local: send to its owner receive vertices Vk from other processes Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k QBFS U3 U7 U21 U16 Ui Uj 2 Adj V V list 7 3 local non-local V7 V3 V2 V6 MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue Output parent array and TEPS Straightforward BFS 1 2 3 4 k-1 k Queue based BFS QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread ti Each thread ti visits all the neighboring Vj of its vertex If Vj is local: visit it I update the predecessor array I enqueue Vj if Vj is not local: send to its owner receive vertices Vk from other processes Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k QBFS U3 U7 U21 U16 Ui Uj 2 Adj V V list 7 3 local non-local V7 V3 V2 V6 MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue Output parent array and TEPS Straightforward BFS 1 2 3 4 k-1 k Queue based BFS QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread ti Each thread ti visits all the neighboring Vj of its vertex If Vj is local: visit it I update the predecessor array I enqueue Vj if Vj is not local: send to its owner receive vertices Vk from other processes Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k QBFS U3 U7 U21 U16 Ui Uj 2 Adj V V list 7 3 local non-local V7 V3 V2 V6 MPI Send/Recv Recv Enqueue Output parent array and TEPS Straightforward BFS 1 2 3 4 k-1 k Queue based BFS QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread ti Each thread ti visits all the neighboring Vj