Large Graph on multi-GPUs
Enrico Mastrostefano1 Massimo Bernaschi2 Massimiliano Fatica3
1Sapienza Universita` di Roma 2Istituto per le Applicazioni del Calcolo, IAC-CNR, Rome, Italy 3NVIDIA Corporation
May 11, 2012
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 1 / 32 Outline
1 Large Graphs and Supercomputing
2 Graph500 on multi-GPUs
3 Breadth First Search on multi-GPUs
4 Sort-Unique Breadth First Search
5 Results Large Graphs
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms
Large Graphs
Most of graph algorithms have low arithmetic intensity and irregular memory access patterns
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms
Large Graphs
Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms?
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms
Large Graphs
Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 The core of the benchmark is a set of graph algorithms
Large Graphs
Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org)
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 Large Graphs
Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 Kernel 1 Build the data structure that represent the graph: timed! Data structure cannot be modified by subsequent kernels
Kernel 2 Perform a Breadth First Search (BFS) visit starting from a random vertex: timed! Output the parent array.
graph500: specifications
Generator Generate the edge list with real-world properties (RMAT generator). minimum size: 228 vertices and 16 ∗ 228 edges Generator (268, 435, 456 vertices and 8, 589, 934, 592 edges)
Kernel 1
Kernel 2
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 4 / 32 Kernel 2 Perform a Breadth First Search (BFS) visit starting from a random vertex: timed! Output the parent array.
graph500: specifications
Generator Generate the edge list with real-world properties (RMAT generator). minimum size: 228 vertices and 16 ∗ 228 edges Generator (268, 435, 456 vertices and 8, 589, 934, 592 edges)
Kernel 1 Build the data structure that represent the graph: timed! Kernel 1 Data structure cannot be modified by subsequent kernels
Kernel 2
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 4 / 32 graph500: specifications
Generator Generate the edge list with real-world properties (RMAT generator). minimum size: 228 vertices and 16 ∗ 228 edges Generator (268, 435, 456 vertices and 8, 589, 934, 592 edges)
Kernel 1 Build the data structure that represent the graph: timed! Kernel 1 Data structure cannot be modified by subsequent kernels
Kernel 2 Perform a Breadth First Search (BFS) visit starting from a Kernel 2 random vertex: timed! Output the parent array.
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 4 / 32 graph500: specifications
Outputs Execution time of K1 Execution time of K2, TEPS: Traversed Edges Per Second in the BFS (actually the ranking is based only on TEPS)
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 5 / 32 Outline
1 Large Graphs and Supercomputing
2 Graph500 on multi-GPUs
3 Breadth First Search on multi-GPUs
4 Sort-Unique Breadth First Search
5 Results graph500 on multi-GPUs
CUDA + MPI
Generator GPU1 GPU2 GPUn
Kernel 1
MPI Kernel 2
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 7 / 32 Generate the edge list
Edge list We have to generate: |V | = 2SCALE ; |M| = 16 ∗ 2SCALE Each task generates a subset of the edge list in the form: (U0, V0), (U1, V1), ... Edges are assigned to processes via a simple rule: edge (Ui , Vj ) ∈ Pk if Ui mod #P == k
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 8 / 32 The core of the algorithm is a sort
Build the data structure
Data structure We transform the edge list in a Compressed Sparse Row (CSR) data structure CSR is simple and has minimal memory requirements
Compressed Sparse Row (CSR)
local u1 u2 u3 u4 un vertices
offset 0 3 3 7 7 7 7 18 l M array
adj. list Vj Vk Vl Vt
0 . . . . . 3 ...... 7 ...... l . . . M
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 9 / 32 Build the data structure
Data structure We transform the edge list in a Compressed Sparse Row (CSR) data structure CSR is simple and has minimal memory requirements
Compressed Sparse Row (CSR)
local u1 u2 u3 u4 un vertices
offset 0 3 3 7 7 7 7 18 l M array
adj. list Vj Vk Vl Vt
0 . . . . . 3 ...... 7 ...... l . . . M
The core of the algorithm is a sort
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 9 / 32 K1 results Weak scaling plot, Kernel 1 time (sec)
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 10 / 32 Outline
1 Large Graphs and Supercomputing
2 Graph500 on multi-GPUs
3 Breadth First Search on multi-GPUs
4 Sort-Unique Breadth First Search
5 Results 1 2 3 4 k-1 k
QBFS U3 U7 U21 U16 Ui Uj
2
Adj V V list 7 3
I update the predecessor array local non-local I enqueue Vj
V7 V3 V2 V6
MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue
Output parent array and TEPS
Straightforward BFS
1 2 3 4 k-1 k Queue based BFS
QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread ti
Each thread ti visits all the neighboring Vj of its vertex
If Vj is local: visit it
if Vj is not local: send to its owner
receive vertices Vk from other processes
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k
QBFS U3 U7 U21 U16 Ui Uj
2
Adj V V list 7 3
I update the predecessor array local non-local I enqueue Vj
V7 V3 V2 V6
MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue
Output parent array and TEPS
Straightforward BFS
1 2 3 4 k-1 k Queue based BFS
QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2
Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it
if Vj is not local: send to its owner
receive vertices Vk from other processes
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k
QBFS U3 U7 U21 U16 Ui Uj
2
Adj V V list 7 3
I update the predecessor array local non-local I enqueue Vj
V7 V3 V2 V6
MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue
Output parent array and TEPS
Straightforward BFS
1 2 3 4 k-1 k Queue based BFS
QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2
Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it
local
V7 V3 if Vj is not local: send to its owner
receive vertices Vk from other processes
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k
QBFS U3 U7 U21 U16 Ui Uj
2
Adj V V list 7 3
local non-local
V7 V3 V2 V6
MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue
Output parent array and TEPS
Straightforward BFS
1 2 3 4 k-1 k Queue based BFS
QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2
Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it
I update the predecessor array local I enqueue Vj
V7 V3 if Vj is not local: send to its owner
receive vertices Vk from other processes
Enqueue
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k
QBFS U3 U7 U21 U16 Ui Uj
2
Adj V V list 7 3
local non-local
V7 V3 V2 V6
MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue
Output parent array and TEPS
Straightforward BFS
1 2 3 4 k-1 k Queue based BFS
QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2
Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it
I update the predecessor array local non-local I enqueue Vj
V7 V3 V2 V6 if Vj is not local: send to its owner
receive vertices Vk from other processes
Enqueue
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 Output parent array and TEPS
Straightforward BFS
1 2 3 4 k-1 k Queue based BFS
QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2
Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it
I update the predecessor array local non-local I enqueue Vj
V7 V3 V2 V6 if Vj is not local: send to its owner receive vertices V from other processes k MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 Straightforward BFS
1 2 3 4 k-1 k Queue based BFS
QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2
Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it
I update the predecessor array local non-local I enqueue Vj
V7 V3 V2 V6 if Vj is not local: send to its owner receive vertices V from other processes k MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue
Output parent array and TEPS
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 Straightforward BFS: Results
0.14 0.7- Straight BFS 0.12 0.6 Straight BFS 0.10 0.5
0.08 0.4 (Billion) (Billion)
0.06 0.3
0.2 0.04
0.1 0.02
0.0
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 13 / 32 Algorithms rely on the use of Atomic Add
1 2 3 4 k-1 k
QBFS U3 U7 U21 U16 Ui Uj
Memory access patterns can 2 1 be irregular and threads that belong to the same warp may access non-contiguos regions
of memory 0 . . . .L2 0 ...... L1
Straightforward BFS: Issues
Gpu-related issues 1 2 3 4 k-1 k Threads workloads are unbalanced when threads visit QBFS U3 U7 U21 U16 Ui Uj different adjacency lists
2 1
0 . . . .L2 0 ...... L1
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 14 / 32 Algorithms rely on the use of Atomic Add
Straightforward BFS: Issues
Gpu-related issues 1 2 3 4 k-1 k Threads workloads are unbalanced when threads visit QBFS U3 U7 U21 U16 Ui Uj different adjacency lists Memory access patterns can 2 1 be irregular and threads that belong to the same warp may access non-contiguos regions
of memory 0 . . . .L2 0 ...... L1
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 14 / 32 Straightforward BFS: Issues
Gpu-related issues 1 2 3 4 k-1 k Threads workloads are unbalanced when threads visit QBFS U3 U7 U21 U16 Ui Uj different adjacency lists Memory access patterns can 2 1 be irregular and threads that belong to the same warp may access non-contiguos regions
of memory 0 . . . .L2 0 ...... L1
Algorithms rely on the use of Atomic Add
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 14 / 32 Processor 1 Next Level V1 V0 V7 V0 V3 V2 V9 V7 V1 V0 Processor 2 Frontier Processor k
neighbors of U3 neighbors of U7
Straightforward BFS: Issues
MPI-related issues We don’t process non-local vertices so the array to send contains multiple copies of the same vertices. Multiple copies of the same vertex are sent to the owner.
Next Level V1 V0 V7 V0 V3 V2 V9 V7 V1 V0 Frontier
neighbors of U3 neighbors of U7
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 15 / 32 Straightforward BFS: Issues
MPI-related issues We don’t process non-local vertices so the array to send contains multiple copies of the same vertices. Multiple copies of the same vertex are sent to the owner.
Processor 1 Next Level V1 V0 V7 V0 V3 V2 V9 V7 V1 V0 Processor 2 Frontier Processor k
neighbors of U3 neighbors of U7
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 15 / 32 Outline
1 Large Graphs and Supercomputing
2 Graph500 on multi-GPUs
3 Breadth First Search on multi-GPUs
4 Sort-Unique Breadth First Search
5 Results We use k threads to visit neighbors We want to use as many threads as the number of neighbors
Neighbors of U are not-contiguous in the Adjacency list array We want a contiguous array of neighbors.
We send/recv multiple copies We want to prune the array that we send
Beyond the straightforward BFS
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 17 / 32 Neighbors of U are not-contiguous in the Adjacency list array We want a contiguous array of neighbors.
We send/recv multiple copies We want to prune the array that we send
Beyond the straightforward BFS
We use k threads to visit neighbors We want to use as many threads as the number of neighbors
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 17 / 32 We send/recv multiple copies We want to prune the array that we send
Beyond the straightforward BFS
We use k threads to visit neighbors We want to use as many threads as the number of neighbors
Neighbors of U are not-contiguous in the Adjacency list array We want a contiguous array of neighbors.
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 17 / 32 Beyond the straightforward BFS
We use k threads to visit neighbors We want to use as many threads as the number of neighbors
Neighbors of U are not-contiguous in the Adjacency list array We want a contiguous array of neighbors.
We send/recv multiple copies We want to prune the array that we send
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 17 / 32 Beyond the straightforward BFS: Sort-Unique BFS overview
What we will do is: 1 Compute the total number of neighbors, say m 2 Start m threads, read the Adjacency list and build a contiguous array of neighbors (We call this array Next Level Frontier) 3 With m threads prune the Next Level Frontier 4 Exchange vertices with other processes and update the parent array
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 18 / 32 The last element of New Offset is: m = P d i∈QBFS i
1 2 3 4 k-1 k
QBFS U3 U7 U21 U16 Ui Uj
Qdeg d3 d7 d21 d16 di dj
prefix-sum
New off3 off7 off21 off16 offi offmj Off
Sort-Uniq BFS Recipe #1: compute the total number of neighbors
1 2 3 4 k-1 k Start k threads, each element of QBFS is assigned to one thread
QBFS U3 U7 U21 U16 Ui Uj
Build Qdeg, substituting each vertex with its degree
Perform a prefix-sum operation on Qdeg to build the New Offset array
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 19 / 32 The last element of New Offset is: m = P d i∈QBFS i
1 2 3 4 k-1 k
QBFS U3 U7 U21 U16 Ui Uj
Qdeg d3 d7 d21 d16 di dj
prefix-sum
New off3 off7 off21 off16 offi offmj Off
Sort-Uniq BFS Recipe #1: compute the total number of neighbors
1 2 3 4 k-1 k Start k threads, each element of QBFS is assigned to one thread
QBFS U3 U7 U21 U16 Ui Uj
Build Qdeg, substituting each Qdeg d3 d7 d21 d16 di dj vertex with its degree
Perform a prefix-sum operation on Qdeg to build the New Offset array
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 19 / 32 The last element of New Offset is: m = P d i∈QBFS i
1 2 3 4 k-1 k
QBFS U3 U7 U21 U16 Ui Uj
Qdeg d3 d7 d21 d16 di dj
prefix-sum
New off3 off7 off21 off16 offi offmj Off
Sort-Uniq BFS Recipe #1: compute the total number of neighbors
1 2 3 4 k-1 k Start k threads, each element of QBFS is assigned to one thread
QBFS U3 U7 U21 U16 Ui Uj
Build Qdeg, substituting each Qdeg d3 d7 d21 d16 di dj vertex with its degree
prefix-sum
Perform a prefix-sum operation New on Qdeg to build the New Offset off3 off7 off21 off16 offi offj array Off
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 19 / 32 Sort-Uniq BFS Recipe #1: compute the total number of neighbors
1 2 3 4 k-1 k Start k threads, each element of QBFS is assigned to one thread
QBFS U3 U7 U21 U16 Ui Uj
Build Qdeg, substituting each Qdeg d3 d7 d21 d16 di dj vertex with its degree
prefix-sum
Perform a prefix-sum operation New on Qdeg to build the New Offset off3 off7 off21 off16 offi offmj array Off
The last element of New Offset is: m = P d i∈QBFS i
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 19 / 32 1 m
Binary Search New off3 off7 off21 off16 offi offmj Off
{Id1, Id2, ..., Idm}
Id1 Idk Idm Adj list Vi Vk Vs 0...... M
Vi Vk
Next V V level i k frontier 0 ...... m
Sort-Uniq BFS
Recipe #2: build a contiguos array of neighbors
1 m Start m threads
Each thread performs a binary search on New Offset and finds its index
Each thread reads from the Adj list the element corresponding to the index
and write it in the Next Level Frontier.
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 20 / 32 1 m
Binary Search New off3 off7 off21 off16 offi offmj Off
{Id1, Id2, ..., Idm}
Id1 Idk Idm Adj list Vi Vk Vs 0...... M
Vi Vk
Next V V level i k frontier 0 ...... m
Sort-Uniq BFS
Recipe #2: build a contiguos array of neighbors
1 m Start m threads
Binary Search Each thread performs a binary New off3 off7 off21 off16 offi offmj search on New Offset and finds Off
its index {Id1, Id2, ..., Idm}
Each thread reads from the Adj list the element corresponding to the index
and write it in the Next Level Frontier.
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 20 / 32 1 m
Binary Search New off3 off7 off21 off16 offi offmj Off
{Id1, Id2, ..., Idm}
Id1 Idk Idm Adj list Vi Vk Vs 0...... M
Vi Vk
Next V V level i k frontier 0 ...... m
Sort-Uniq BFS
Recipe #2: build a contiguos array of neighbors
1 m Start m threads
Binary Search Each thread performs a binary New off3 off7 off21 off16 offi offmj search on New Offset and finds Off
its index {Id1, Id2, ..., Idm}
Id1 Idk Idm Each thread reads from the Adj Adj list list the element corresponding to Vi Vk Vs the index 0...... M
and write it in the Next Level Frontier.
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 20 / 32 Sort-Uniq BFS
Recipe #2: build a contiguos array of neighbors
1 m Start m threads
Binary Search Each thread performs a binary New off3 off7 off21 off16 offi offmj search on New Offset and finds Off
its index {Id1, Id2, ..., Idm}
Id1 Idk Idm Each thread reads from the Adj Adj list list the element corresponding to Vi Vk Vs the index 0...... M
Vi Vk
Next and write it in the Next Level V V level i k Frontier. frontier 0 ...... m
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 20 / 32 m Unique ratio n ∼ 20
1 m
Sort - Unique Compact
Vi Vp Vp Vk Vp Vp 0 ...... m
Compact V V array i k 0 ...... n
Sort-Uniq BFS Recipe #3: prune the Next Level Frontier
1 m Start m threads
Perform a sort-uniq operation on the Next Level Frontier (by using thrust library)
and compact it to n unique elements
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 21 / 32 m Unique ratio n ∼ 20
1 m
Sort - Unique Compact
Vi Vp Vp Vk Vp Vp 0 ...... m
Compact V V array i k 0 ...... n
Sort-Uniq BFS Recipe #3: prune the Next Level Frontier
1 m Start m threads
Sort - Unique Perform a sort-uniq operation on Compact
the Next Level Frontier (by Vi Vp Vp Vk Vp Vp using thrust library) 0 ...... m
and compact it to n unique elements
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 21 / 32 m Unique ratio n ∼ 20
Sort-Uniq BFS Recipe #3: prune the Next Level Frontier
1 m Start m threads
Sort - Unique Perform a sort-uniq operation on Compact
the Next Level Frontier (by Vi Vp Vp Vk Vp Vp using thrust library) 0 ...... m
Compact V V array i k and compact it to n unique 0 ...... n elements
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 21 / 32 Sort-Uniq BFS Recipe #3: prune the Next Level Frontier
1 m Start m threads
Sort - Unique Perform a sort-uniq operation on Compact
the Next Level Frontier (by Vi Vp Vp Vk Vp Vp using thrust library) 0 ...... m
Compact V V array i k and compact it to n unique 0 ...... n elements
m Unique ratio n ∼ 20
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 21 / 32 If QBFS == 0 quit.
1 n
Compact V V array i k 0 ...... n
Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n
Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s 0 ...... n
MPI - Exchange
Enqueue
Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array
Start n threads 1 n
Compact V V array i k Substitute vertices with tasks 0 ...... n
Sort by tasks (by using thrust library)
Exchange non-local edges
Update the array of predecessors and Enqueue
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 If QBFS == 0 quit.
1 n
Compact V V array i k 0 ...... n
Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n
Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s 0 ...... n
MPI - Exchange
Enqueue
Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array
Start n threads 1 n
Compact V V array i k Substitute vertices with tasks 0 ...... n Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n Sort by tasks (by using thrust library)
Exchange non-local edges
Update the array of predecessors and Enqueue
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 If QBFS == 0 quit.
1 n
Compact V V array i k 0 ...... n
Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n
Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s 0 ...... n
MPI - Exchange
Enqueue
Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array
Start n threads 1 n
Compact V V array i k Substitute vertices with tasks 0 ...... n Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n Sort by tasks (by using thrust Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s library) 0 ...... n
Exchange non-local edges
Update the array of predecessors and Enqueue
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 If QBFS == 0 quit.
1 n
Compact V V array i k 0 ...... n
Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n
Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s 0 ...... n
MPI - Exchange
Enqueue
Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array
Start n threads 1 n
Compact V V array i k Substitute vertices with tasks 0 ...... n Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n Sort by tasks (by using thrust Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s library) 0 ...... n
Exchange non-local edges MPI - Exchange
Update the array of predecessors and Enqueue
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 If QBFS == 0 quit.
Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array
Start n threads 1 n
Compact V V array i k Substitute vertices with tasks 0 ...... n Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n Sort by tasks (by using thrust Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s library) 0 ...... n
Exchange non-local edges MPI - Exchange
Update the array of Enqueue predecessors and Enqueue
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array
Start n threads 1 n
Compact V V array i k Substitute vertices with tasks 0 ...... n Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n Sort by tasks (by using thrust Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s library) 0 ...... n
Exchange non-local edges MPI - Exchange
Update the array of Enqueue predecessors and Enqueue
If QBFS == 0 quit.
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 Outline
1 Large Graphs and Supercomputing
2 Graph500 on multi-GPUs
3 Breadth First Search on multi-GPUs
4 Sort-Unique Breadth First Search
5 Results Sort-Uniq BFS: Results
Straight BFS (Billion)
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 24 / 32 K2: balancing Cumulative running time, 16 processors
Computations and communications among processes are well balanced
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 25 / 32 Sum of running time over bfs levels, proc 0 of 64 Cuda Kernels binary search sort-unique enqueue-local enqueue-received sort owners
Communications mpi allgather mpi send/recv mpi allreduce time (sec)
CudaCpy cuda-cpy sendbuff cuda-cpy recvbuff
K2: cuda kernels times
Sum of running time over bfs levels, proc 0 of 64 Cuda Kernels binary search sort-unique enqueue-local enqueue-received sort owners
Communications mpi allgather mpi send/recv mpi allreduce time (sec)
CudaCpy cuda-cpy sendbuff cuda-cpy recvbuff
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 26 / 32 Sum of running time over bfs levels, proc 0 of 64 Cuda Kernels binary search sort-unique enqueue-local enqueue-received sort owners
Communications mpi allgather mpi send/recv mpi allreduce time (sec)
CudaCpy cuda-cpy sendbuff cuda-cpy recvbuff
K2: cuda kernels times
Sum of running time over bfs levels, proc 0 of 64 Cuda Kernels binary search sort-unique enqueue-local enqueue-received sort owners
Communications mpi allgather mpi send/recv mpi allreduce time (sec) time (sec)
CudaCpy cuda-cpy sendbuff cuda-cpy recvbuff
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 26 / 32 K2: cuda kernels times
Sum of running time over bfs levels, proc 0 of 64 Cuda Kernels binary search sort-unique enqueue-local enqueue-received sort owners
Communications mpi allgather mpi send/recv mpi allreduce time (sec) time (sec)
CudaCpy cuda-cpy sendbuff cuda-cpy recvbuff
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 26 / 32 Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 27 / 32 Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 27 / 32 Conclusions and Outlook
To visit large graphs we need a distributed algorithm We are slower on a single GPU We relay on sorting to achieve better scaling If we can speed-up the sorting then we will speed up the BFS
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 28 / 32 D. Chakrabarti, D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. IN SDM, 2004. A. G. Duane Merrill, Michael Garland. High performance and scalable gpu graph traversal. Technical report, Nvidia, 2011. P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda, 2007. J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. J. Mach. Learn. Res., 11:985–1042, March 2010. R. Murphy and G. Chukkapalli. Graph 500 and data-intensive hpc. Technical report, Oracle.
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 28 / 32 P. of the 16th ACM symposium on Principles and practice of parallel programming, editors. Accelerating CUDA Graph Algorithms at Maximum Warp, 2011.
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 29 / 32 [5] [1, 4] [3, 6, 2]
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 29 / 32 Single GPU with the multi-GPUs code
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 30 / 32 Random Graph
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 31 / 32 Unique ratio example
2 0.28 2 0.47 2 0.18 3 0.16 3 0.15 3 0.21 4 0.77 4 0.50 4 0.97 5 1.00 5 1.00 5 1.00 6 1.00 6 1.00 6 1.00 7 1.00 Table: Unique ratio, proc 0 of 64, 3 run of BFS
Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 32 / 32