Large Graph on multi-GPUs

Enrico Mastrostefano1 Massimo Bernaschi2 Massimiliano Fatica3

1Sapienza Universita` di Roma 2Istituto per le Applicazioni del Calcolo, IAC-CNR, Rome, Italy 3NVIDIA Corporation

May 11, 2012

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 1 / 32 Outline

1 Large Graphs and Supercomputing

2 Graph500 on multi-GPUs

3 Breadth First Search on multi-GPUs

4 -Unique Breadth First Search

5 Results Large Graphs

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms

Large Graphs

Most of graph algorithms have low arithmetic intensity and irregular memory access patterns

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms

Large Graphs

Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms?

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms

Large Graphs

Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 The core of the benchmark is a set of graph algorithms

Large Graphs

Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org)

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 Large Graphs

Most of graph algorithms have low arithmetic intensity and irregular memory access patterns How do modern architectures perform running such algorithms? Large-scale benchmark for data-intensive application “This is the first serious approach to complement the Top 500 with data-intensive applications ...” (from www.graph500.org) The core of the benchmark is a set of graph algorithms

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 3 / 32 Kernel 1 Build the data structure that represent the graph: timed! Data structure cannot be modified by subsequent kernels

Kernel 2 Perform a Breadth First Search (BFS) visit starting from a random vertex: timed! Output the parent array.

graph500: specifications

Generator Generate the edge list with real-world properties (RMAT generator). minimum size: 228 vertices and 16 ∗ 228 edges Generator (268, 435, 456 vertices and 8, 589, 934, 592 edges)

Kernel 1

Kernel 2

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 4 / 32 Kernel 2 Perform a Breadth First Search (BFS) visit starting from a random vertex: timed! Output the parent array.

graph500: specifications

Generator Generate the edge list with real-world properties (RMAT generator). minimum size: 228 vertices and 16 ∗ 228 edges Generator (268, 435, 456 vertices and 8, 589, 934, 592 edges)

Kernel 1 Build the data structure that represent the graph: timed! Kernel 1 Data structure cannot be modified by subsequent kernels

Kernel 2

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 4 / 32 graph500: specifications

Generator Generate the edge list with real-world properties (RMAT generator). minimum size: 228 vertices and 16 ∗ 228 edges Generator (268, 435, 456 vertices and 8, 589, 934, 592 edges)

Kernel 1 Build the data structure that represent the graph: timed! Kernel 1 Data structure cannot be modified by subsequent kernels

Kernel 2 Perform a Breadth First Search (BFS) visit starting from a Kernel 2 random vertex: timed! Output the parent array.

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 4 / 32 graph500: specifications

Outputs Execution of K1 Execution time of K2, TEPS: Traversed Edges Per Second in the BFS (actually the ranking is based only on TEPS)

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 5 / 32 Outline

1 Large Graphs and Supercomputing

2 Graph500 on multi-GPUs

3 Breadth First Search on multi-GPUs

4 Sort-Unique Breadth First Search

5 Results graph500 on multi-GPUs

CUDA + MPI

Generator GPU1 GPU2 GPUn

Kernel 1

MPI Kernel 2

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 7 / 32 Generate the edge list

Edge list We have to generate: |V | = 2SCALE ; |M| = 16 ∗ 2SCALE Each task generates a subset of the edge list in the form: (U0, V0), (U1, V1), ... Edges are assigned to processes via a simple rule: edge (Ui , Vj ) ∈ Pk if Ui mod #P == k

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 8 / 32 The core of the algorithm is a sort

Build the data structure

Data structure We transform the edge list in a Compressed Sparse Row (CSR) data structure CSR is simple and has minimal memory requirements

Compressed Sparse Row (CSR)

local u1 u2 u3 u4 un vertices

offset 0 3 3 7 7 7 7 18 l M array

adj. list Vj Vk Vl Vt

0 . . . . . 3 ...... 7 ...... l . . . M

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 9 / 32 Build the data structure

Data structure We transform the edge list in a Compressed Sparse Row (CSR) data structure CSR is simple and has minimal memory requirements

Compressed Sparse Row (CSR)

local u1 u2 u3 u4 un vertices

offset 0 3 3 7 7 7 7 18 l M array

adj. list Vj Vk Vl Vt

0 . . . . . 3 ...... 7 ...... l . . . M

The core of the algorithm is a sort

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 9 / 32 K1 results Weak scaling plot, Kernel 1 time (sec)

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 10 / 32 Outline

1 Large Graphs and Supercomputing

2 Graph500 on multi-GPUs

3 Breadth First Search on multi-GPUs

4 Sort-Unique Breadth First Search

5 Results 1 2 3 4 k-1 k

QBFS U3 U7 U21 U16 Ui Uj

2

Adj V V list 7 3

I update the predecessor array local non-local I enqueue Vj

V7 V3 V2 V6

MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue

Output parent array and TEPS

Straightforward BFS

1 2 3 4 k-1 k Queue based BFS

QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread ti

Each thread ti visits all the neighboring Vj of its vertex

If Vj is local: visit it

if Vj is not local: send to its owner

receive vertices Vk from other processes

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k

QBFS U3 U7 U21 U16 Ui Uj

2

Adj V V list 7 3

I update the predecessor array local non-local I enqueue Vj

V7 V3 V2 V6

MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue

Output parent array and TEPS

Straightforward BFS

1 2 3 4 k-1 k Queue based BFS

QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2

Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it

if Vj is not local: send to its owner

receive vertices Vk from other processes

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k

QBFS U3 U7 U21 U16 Ui Uj

2

Adj V V list 7 3

I update the predecessor array local non-local I enqueue Vj

V7 V3 V2 V6

MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue

Output parent array and TEPS

Straightforward BFS

1 2 3 4 k-1 k Queue based BFS

QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2

Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it

local

V7 V3 if Vj is not local: send to its owner

receive vertices Vk from other processes

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k

QBFS U3 U7 U21 U16 Ui Uj

2

Adj V V list 7 3

local non-local

V7 V3 V2 V6

MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue

Output parent array and TEPS

Straightforward BFS

1 2 3 4 k-1 k Queue based BFS

QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2

Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it

I update the predecessor array local I enqueue Vj

V7 V3 if Vj is not local: send to its owner

receive vertices Vk from other processes

Enqueue

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 1 2 3 4 k-1 k

QBFS U3 U7 U21 U16 Ui Uj

2

Adj V V list 7 3

local non-local

V7 V3 V2 V6

MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue

Output parent array and TEPS

Straightforward BFS

1 2 3 4 k-1 k Queue based BFS

QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2

Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it

I update the predecessor array local non-local I enqueue Vj

V7 V3 V2 V6 if Vj is not local: send to its owner

receive vertices Vk from other processes

Enqueue

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 Output parent array and TEPS

Straightforward BFS

1 2 3 4 k-1 k Queue based BFS

QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2

Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it

I update the predecessor array local non-local I enqueue Vj

V7 V3 V2 V6 if Vj is not local: send to its owner receive vertices V from other processes k MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 Straightforward BFS

1 2 3 4 k-1 k Queue based BFS

QBFS U3 U7 U21 U16 Ui Uj Each vertex Ui of QBFS is assigned to one thread t i 2

Each thread ti visits all the neighboring Vj of its vertex Adj V V list 7 3 If Vj is local: visit it

I update the predecessor array local non-local I enqueue Vj

V7 V3 V2 V6 if Vj is not local: send to its owner receive vertices V from other processes k MPI Send/Recv I update the predecessor array I enqueue Vk Recv Enqueue

Output parent array and TEPS

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 12 / 32 Straightforward BFS: Results

0.14 0.7- Straight BFS 0.12 0.6 Straight BFS 0.10 0.5

0.08 0.4 (Billion) (Billion)

0.06 0.3

0.2 0.04

0.1 0.02

0.0

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 13 / 32 Algorithms rely on the use of Atomic Add

1 2 3 4 k-1 k

QBFS U3 U7 U21 U16 Ui Uj

Memory access patterns can 2 1 be irregular and threads that belong to the same warp may access non-contiguos regions

of memory 0 . . . .L2 0 ...... L1

Straightforward BFS: Issues

Gpu-related issues 1 2 3 4 k-1 k Threads workloads are unbalanced when threads visit QBFS U3 U7 U21 U16 Ui Uj different adjacency lists

2 1

0 . . . .L2 0 ...... L1

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 14 / 32 Algorithms rely on the use of Atomic Add

Straightforward BFS: Issues

Gpu-related issues 1 2 3 4 k-1 k Threads workloads are unbalanced when threads visit QBFS U3 U7 U21 U16 Ui Uj different adjacency lists Memory access patterns can 2 1 be irregular and threads that belong to the same warp may access non-contiguos regions

of memory 0 . . . .L2 0 ...... L1

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 14 / 32 Straightforward BFS: Issues

Gpu-related issues 1 2 3 4 k-1 k Threads workloads are unbalanced when threads visit QBFS U3 U7 U21 U16 Ui Uj different adjacency lists Memory access patterns can 2 1 be irregular and threads that belong to the same warp may access non-contiguos regions

of memory 0 . . . .L2 0 ...... L1

Algorithms rely on the use of Atomic Add

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 14 / 32 Processor 1 Next Level V1 V0 V7 V0 V3 V2 V9 V7 V1 V0 Processor 2 Frontier Processor k

neighbors of U3 neighbors of U7

Straightforward BFS: Issues

MPI-related issues We don’t process non-local vertices so the array to send contains multiple copies of the same vertices. Multiple copies of the same vertex are sent to the owner.

Next Level V1 V0 V7 V0 V3 V2 V9 V7 V1 V0 Frontier

neighbors of U3 neighbors of U7

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 15 / 32 Straightforward BFS: Issues

MPI-related issues We don’t process non-local vertices so the array to send contains multiple copies of the same vertices. Multiple copies of the same vertex are sent to the owner.

Processor 1 Next Level V1 V0 V7 V0 V3 V2 V9 V7 V1 V0 Processor 2 Frontier Processor k

neighbors of U3 neighbors of U7

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 15 / 32 Outline

1 Large Graphs and Supercomputing

2 Graph500 on multi-GPUs

3 Breadth First Search on multi-GPUs

4 Sort-Unique Breadth First Search

5 Results We use k threads to visit neighbors We want to use as many threads as the number of neighbors

Neighbors of U are not-contiguous in the Adjacency list array We want a contiguous array of neighbors.

We send/recv multiple copies We want to prune the array that we send

Beyond the straightforward BFS

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 17 / 32 Neighbors of U are not-contiguous in the Adjacency list array We want a contiguous array of neighbors.

We send/recv multiple copies We want to prune the array that we send

Beyond the straightforward BFS

We use k threads to visit neighbors We want to use as many threads as the number of neighbors

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 17 / 32 We send/recv multiple copies We want to prune the array that we send

Beyond the straightforward BFS

We use k threads to visit neighbors We want to use as many threads as the number of neighbors

Neighbors of U are not-contiguous in the Adjacency list array We want a contiguous array of neighbors.

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 17 / 32 Beyond the straightforward BFS

We use k threads to visit neighbors We want to use as many threads as the number of neighbors

Neighbors of U are not-contiguous in the Adjacency list array We want a contiguous array of neighbors.

We send/recv multiple copies We want to prune the array that we send

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 17 / 32 Beyond the straightforward BFS: Sort-Unique BFS overview

What we will do is: 1 Compute the total number of neighbors, say m 2 Start m threads, read the Adjacency list and build a contiguous array of neighbors (We call this array Next Level Frontier) 3 With m threads prune the Next Level Frontier 4 Exchange vertices with other processes and update the parent array

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 18 / 32 The last element of New Offset is: m = P d i∈QBFS i

1 2 3 4 k-1 k

QBFS U3 U7 U21 U16 Ui Uj

Qdeg d3 d7 d21 d16 di dj

prefix-sum

New off3 off7 off21 off16 offi offmj Off

Sort- BFS Recipe #1: compute the total number of neighbors

1 2 3 4 k-1 k Start k threads, each element of QBFS is assigned to one thread

QBFS U3 U7 U21 U16 Ui Uj

Build Qdeg, substituting each vertex with its degree

Perform a prefix-sum operation on Qdeg to build the New Offset array

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 19 / 32 The last element of New Offset is: m = P d i∈QBFS i

1 2 3 4 k-1 k

QBFS U3 U7 U21 U16 Ui Uj

Qdeg d3 d7 d21 d16 di dj

prefix-sum

New off3 off7 off21 off16 offi offmj Off

Sort-Uniq BFS Recipe #1: compute the total number of neighbors

1 2 3 4 k-1 k Start k threads, each element of QBFS is assigned to one thread

QBFS U3 U7 U21 U16 Ui Uj

Build Qdeg, substituting each Qdeg d3 d7 d21 d16 di dj vertex with its degree

Perform a prefix-sum operation on Qdeg to build the New Offset array

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 19 / 32 The last element of New Offset is: m = P d i∈QBFS i

1 2 3 4 k-1 k

QBFS U3 U7 U21 U16 Ui Uj

Qdeg d3 d7 d21 d16 di dj

prefix-sum

New off3 off7 off21 off16 offi offmj Off

Sort-Uniq BFS Recipe #1: compute the total number of neighbors

1 2 3 4 k-1 k Start k threads, each element of QBFS is assigned to one thread

QBFS U3 U7 U21 U16 Ui Uj

Build Qdeg, substituting each Qdeg d3 d7 d21 d16 di dj vertex with its degree

prefix-sum

Perform a prefix-sum operation New on Qdeg to build the New Offset off3 off7 off21 off16 offi offj array Off

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 19 / 32 Sort-Uniq BFS Recipe #1: compute the total number of neighbors

1 2 3 4 k-1 k Start k threads, each element of QBFS is assigned to one thread

QBFS U3 U7 U21 U16 Ui Uj

Build Qdeg, substituting each Qdeg d3 d7 d21 d16 di dj vertex with its degree

prefix-sum

Perform a prefix-sum operation New on Qdeg to build the New Offset off3 off7 off21 off16 offi offmj array Off

The last element of New Offset is: m = P d i∈QBFS i

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 19 / 32 1 m

Binary Search New off3 off7 off21 off16 offi offmj Off

{Id1, Id2, ..., Idm}

Id1 Idk Idm Adj list Vk Vs 0...... M

Vi Vk

Next V V level i k frontier 0 ...... m

Sort-Uniq BFS

Recipe #2: build a contiguos array of neighbors

1 m Start m threads

Each thread performs a binary search on New Offset and finds its index

Each thread reads from the Adj list the element corresponding to the index

and it in the Next Level Frontier.

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 20 / 32 1 m

Binary Search New off3 off7 off21 off16 offi offmj Off

{Id1, Id2, ..., Idm}

Id1 Idk Idm Adj list Vi Vk Vs 0...... M

Vi Vk

Next V V level i k frontier 0 ...... m

Sort-Uniq BFS

Recipe #2: build a contiguos array of neighbors

1 m Start m threads

Binary Search Each thread performs a binary New off3 off7 off21 off16 offi offmj search on New Offset and finds Off

its index {Id1, Id2, ..., Idm}

Each thread reads from the Adj list the element corresponding to the index

and write it in the Next Level Frontier.

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 20 / 32 1 m

Binary Search New off3 off7 off21 off16 offi offmj Off

{Id1, Id2, ..., Idm}

Id1 Idk Idm Adj list Vi Vk Vs 0...... M

Vi Vk

Next V V level i k frontier 0 ...... m

Sort-Uniq BFS

Recipe #2: build a contiguos array of neighbors

1 m Start m threads

Binary Search Each thread performs a binary New off3 off7 off21 off16 offi offmj search on New Offset and finds Off

its index {Id1, Id2, ..., Idm}

Id1 Idk Idm Each thread reads from the Adj Adj list list the element corresponding to Vi Vk Vs the index 0...... M

and write it in the Next Level Frontier.

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 20 / 32 Sort-Uniq BFS

Recipe #2: build a contiguos array of neighbors

1 m Start m threads

Binary Search Each thread performs a binary New off3 off7 off21 off16 offi offmj search on New Offset and finds Off

its index {Id1, Id2, ..., Idm}

Id1 Idk Idm Each thread reads from the Adj Adj list list the element corresponding to Vi Vk Vs the index 0...... M

Vi Vk

Next and write it in the Next Level V V level i k Frontier. frontier 0 ...... m

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 20 / 32 m Unique ratio n ∼ 20

1 m

Sort - Unique Compact

Vi Vp Vp Vk Vp Vp 0 ...... m

Compact V V array i k 0 ...... n

Sort-Uniq BFS Recipe #3: prune the Next Level Frontier

1 m Start m threads

Perform a sort-uniq operation on the Next Level Frontier (by using thrust library)

and compact it to n unique elements

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 21 / 32 m Unique ratio n ∼ 20

1 m

Sort - Unique Compact

Vi Vp Vp Vk Vp Vp 0 ...... m

Compact V V array i k 0 ...... n

Sort-Uniq BFS Recipe #3: prune the Next Level Frontier

1 m Start m threads

Sort - Unique Perform a sort-uniq operation on Compact

the Next Level Frontier (by Vi Vp Vp Vk Vp Vp using thrust library) 0 ...... m

and compact it to n unique elements

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 21 / 32 m Unique ratio n ∼ 20

Sort-Uniq BFS Recipe #3: prune the Next Level Frontier

1 m Start m threads

Sort - Unique Perform a sort-uniq operation on Compact

the Next Level Frontier (by Vi Vp Vp Vk Vp Vp using thrust library) 0 ...... m

Compact V V array i k and compact it to n unique 0 ...... n elements

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 21 / 32 Sort-Uniq BFS Recipe #3: prune the Next Level Frontier

1 m Start m threads

Sort - Unique Perform a sort-uniq operation on Compact

the Next Level Frontier (by Vi Vp Vp Vk Vp Vp using thrust library) 0 ...... m

Compact V V array i k and compact it to n unique 0 ...... n elements

m Unique ratio n ∼ 20

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 21 / 32 If QBFS == 0 quit.

1 n

Compact V V array i k 0 ...... n

Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n

Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s 0 ...... n

MPI - Exchange

Enqueue

Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array

Start n threads 1 n

Compact V V array i k Substitute vertices with tasks 0 ...... n

Sort by tasks (by using thrust library)

Exchange non-local edges

Update the array of predecessors and Enqueue

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 If QBFS == 0 quit.

1 n

Compact V V array i k 0 ...... n

Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n

Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s 0 ...... n

MPI - Exchange

Enqueue

Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array

Start n threads 1 n

Compact V V array i k Substitute vertices with tasks 0 ...... n Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n Sort by tasks (by using thrust library)

Exchange non-local edges

Update the array of predecessors and Enqueue

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 If QBFS == 0 quit.

1 n

Compact V V array i k 0 ...... n

Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n

Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s 0 ...... n

MPI - Exchange

Enqueue

Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array

Start n threads 1 n

Compact V V array i k Substitute vertices with tasks 0 ...... n Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n Sort by tasks (by using thrust Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s library) 0 ...... n

Exchange non-local edges

Update the array of predecessors and Enqueue

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 If QBFS == 0 quit.

1 n

Compact V V array i k 0 ...... n

Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n

Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s 0 ...... n

MPI - Exchange

Enqueue

Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array

Start n threads 1 n

Compact V V array i k Substitute vertices with tasks 0 ...... n Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n Sort by tasks (by using thrust Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s library) 0 ...... n

Exchange non-local edges MPI - Exchange

Update the array of predecessors and Enqueue

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 If QBFS == 0 quit.

Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array

Start n threads 1 n

Compact V V array i k Substitute vertices with tasks 0 ...... n Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n Sort by tasks (by using thrust Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s library) 0 ...... n

Exchange non-local edges MPI - Exchange

Update the array of Enqueue predecessors and Enqueue

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 Sort-Uniq BFS: communication and enqueue Recipe #4: Exchange vertices and update the parent array

Start n threads 1 n

Compact V V array i k Substitute vertices with tasks 0 ...... n Procs P P P P P P P P array 0 1 0 2 1 2 1 s 0 ...... n Sort by tasks (by using thrust Procs P P P P P P P P sorted 0 0 1 1 1 2 2 s library) 0 ...... n

Exchange non-local edges MPI - Exchange

Update the array of Enqueue predecessors and Enqueue

If QBFS == 0 quit.

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 22 / 32 Outline

1 Large Graphs and Supercomputing

2 Graph500 on multi-GPUs

3 Breadth First Search on multi-GPUs

4 Sort-Unique Breadth First Search

5 Results Sort-Uniq BFS: Results

Straight BFS (Billion)

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 24 / 32 K2: balancing Cumulative running time, 16 processors

Computations and communications among processes are well balanced

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 25 / 32 Sum of running time over bfs levels, proc 0 of 64 Cuda Kernels binary search sort-unique enqueue-local enqueue-received sort owners

Communications mpi allgather mpi send/recv mpi allreduce time (sec)

CudaCpy cuda-cpy sendbuff cuda-cpy recvbuff

K2: cuda kernels times

Sum of running time over bfs levels, proc 0 of 64 Cuda Kernels binary search sort-unique enqueue-local enqueue-received sort owners

Communications mpi allgather mpi send/recv mpi allreduce time (sec)

CudaCpy cuda-cpy sendbuff cuda-cpy recvbuff

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 26 / 32 Sum of running time over bfs levels, proc 0 of 64 Cuda Kernels binary search sort-unique enqueue-local enqueue-received sort owners

Communications mpi allgather mpi send/recv mpi allreduce time (sec)

CudaCpy cuda-cpy sendbuff cuda-cpy recvbuff

K2: cuda kernels times

Sum of running time over bfs levels, proc 0 of 64 Cuda Kernels binary search sort-unique enqueue-local enqueue-received sort owners

Communications mpi allgather mpi send/recv mpi allreduce time (sec) time (sec)

CudaCpy cuda-cpy sendbuff cuda-cpy recvbuff

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 26 / 32 K2: cuda kernels times

Sum of running time over bfs levels, proc 0 of 64 Cuda Kernels binary search sort-unique enqueue-local enqueue-received sort owners

Communications mpi allgather mpi send/recv mpi allreduce time (sec) time (sec)

CudaCpy cuda-cpy sendbuff cuda-cpy recvbuff

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 26 / 32 Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 27 / 32 Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 27 / 32 Conclusions and Outlook

To visit large graphs we need a distributed algorithm We are slower on a single GPU We relay on sorting to achieve better scaling If we can speed-up the sorting then we will speed up the BFS

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 28 / 32 D. Chakrabarti, D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. IN SDM, 2004. A. G. Duane Merrill, Michael Garland. High performance and scalable gpu graph traversal. Technical report, Nvidia, 2011. P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda, 2007. J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. J. Mach. Learn. Res., 11:985–1042, March 2010. R. Murphy and G. Chukkapalli. Graph 500 and data-intensive hpc. Technical report, Oracle.

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 28 / 32 P. of the 16th ACM symposium on Principles and practice of parallel programming, editors. Accelerating CUDA Graph Algorithms Maximum Warp, 2011.

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 29 / 32 [5] [1, 4] [3, 6, 2]

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 29 / 32 Single GPU with the multi-GPUs code

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 30 / 32 Random Graph

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 31 / 32 Unique ratio example

2 0.28 2 0.47 2 0.18 3 0.16 3 0.15 3 0.21 4 0.77 4 0.50 4 0.97 5 1.00 5 1.00 5 1.00 6 1.00 6 1.00 6 1.00 7 1.00 Table: Unique ratio, proc 0 of 64, 3 run of BFS

Enrico Mastrostefano Large Graph on multi-GPUs May 11, 2012 32 / 32