Graph500: From Kepler to Pascal

Julien Loiseau, Michaël Krajecki, François Alin and Christophe Jaillet GTC 2017 University of Reims Champagne Ardenne (URCA)

Multidisciplinary University

I about 27 000

I 5 campus: Reims, Troyes, Charleville-Mézières, Chaumont et Châlons-en-Champagne

I A wide initial undergaduate studies program

I Graduate studies and PhD program linked with research lab

Graph500: From Kepler to Pascal J. Loiseau et al. 1 / 19 HPC issues Power eciency

Exascale architecture

I Computational power: Petaop → ×1000 → Exaop

I Moore's law is over

I Energy eciency: 8MW → ×1000 → 8GW ??

Graph500: From Kepler to Pascal J. Loiseau et al. 2 / 19 HPC issues HPC Architectures

Xeon Phi GPU FPGA

CPU(s) + Accelerator(s) ASIC MPI

Memory CPU/GPU CPU(s) + Accelerator(s) - In-core SSD - Out-of-core HDD

Graph500: From Kepler to Pascal J. Loiseau et al. 3 / 19 HPC issues ROMEO, Reims, France

ROMEO

I Reims, Champagne-Ardenne, France I 130 nodes

I 2 × CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 × GPU NVIDIA K20Xm (6GB RAM)

I FatTree with InniBand

I 1 × DGX-1 node

Graph500: From Kepler to Pascal J. Loiseau et al. 4 / 19 HPC issues ROMEO, Regional HPC Center

Its mission is to deliver, for both industrial and academic researchers:

I high performance computing resources

I secured storage spaces

I specic & scientic software

I advanced user support in exploiting these resources

I in-depth expertise in dierent engineering elds: HPC, applied mathematics, physics, biophysics and chemistry, ...

I Promote and diuse HPC and simulation to companies / SMB I identify, experiment and master breakthrough technologies

I which give new opportunities for our users I from technology-watching to production I for all research domains

Graph500: From Kepler to Pascal J. Loiseau et al. 5 / 19 HPC issues GPU Integration

2016 DGX1 Server First server 2013 dedicated to 100% of Deep Learning 2010 cluster 8 GPU P100 10% of nodes with GPU 2016 2008 cluster 260 K20x 1 server nodes with TOP 500 & GREEN 500 Tesla S1070 GPU 960 cores/U Fermi - M2090 2015

2012

Graph500: From Kepler to Pascal J. Loiseau et al. 6 / 19 HPC issues Benchmarking HPC Architectures

How to compare the computing power of parallel architectures?

TOP500

I LINPACK

I Solving n equations with n unknowns

I "Regular"

FLoating-point Operation Per Second, FLOPS

GRAPH500

I BFS

I Large randomly generated graphs

I "Irregular"

Traversed Edges Per Second, TEPS

Graph500: From Kepler to Pascal J. Loiseau et al. 7 / 19 GRAPH500 Protocol and ranking

Graphs algorithms:

I Irregular memory access

I Irregular communications

I No heavy computation step Data dense application Steps

I Graph generation (SKG, RMAT)

I Randomly sample 64 unique root vertices

I Structure generation I For each root vertex:

I BFS I Validate BFS tree

Graph500: From Kepler to Pascal J. Loiseau et al. 8 / 19 GRAPH500 Protocol and ranking

Goal: Breadth First Search on random graph:

Level 3

BFS

Level 2 Level 1 Source

Graph500: From Kepler to Pascal J. Loiseau et al. 9 / 19 GRAPH500 Problem Scale

Graph500 current list (Nov. 2016) -2 SCALE ⇒ vertices I 4 Best CPU& GPU machines: -2 SCALE+ ⇒ edges Name Scale GTEPS Problem size Scale Memory (TB) (1) 40 38621,4 Toy 26 0,0172 (2)Sunway 40 23755.7 Mini 29 0,1374 (3) 41 23751 Small 32 1,0995 ...... Medium 36 17,5922 (31) TSUBAME 2.0 35 462.25 Large 39 140,7375 (39) GSIC Center 35 317,09 Huge 42 1125,8999 (43) HA-PACS 32 223.634

I For graph generation I BlueGene ⇒ 19 in top 30 I Converted before use I GPU ⇒ NVIDIA

Graph500: From Kepler to Pascal J. Loiseau et al. 10 / 19 GRAPH500 Graph generation

To Nodes From a b b c d a c d

Nodes dc

Sparse Graph

I Kronecker

I Generation: a = 0.57 b = 0.19 c = 0.19 d = 0.05

I Edge: 16× more than vertices

Graph500: From Kepler to Pascal J. Loiseau et al. 11 / 19 GRAPH500 Data structure format Structure format → Bitmap

I Natural representation 20 I 2 vertices = 128GB I BG/Q version → CSR/CSC (Compressed Sparse Rows/Columns)

I Compressed format 20 I 2 vertices < 1GB I BG/P and GPU version

Compressed Sparse Row Sparse Matrix 0 1 2 3 4 0 0 1 0 1 1 Row pointers 1 1 0 1 0 1 0 3 6 8 9 12 2 0 1 0 0 1 Column indice 3 1 0 0 0 0 1 3 4 0 2 4 1 4 0 0 1 2 4 1 1 1 0 0

Graph500: From Kepler to Pascal J. Loiseau et al. 12 / 19 GRAPH500 Parallel algorithm

I Split the adjacency matrix into blocks

Generate output queue Share input queue

A0,0 A0,1 A0,2 A0,3 A0,0 A1,0 A2,0 A3,0

A1,0 A1,1 A1,2 A1,3 A0,1 A1,1 A2,1 A3,1

A2,0 A2,1 A2,2 A2,3 A0,2 A1,2 A2,2 A3,2

A3,0 A3,1 A3,2 A3,3 A0,3 A1,3 A2,3 A3,3

Repeat until end

k - l × l machines with l = 2 (k ∈ N) - Machine Mi,j ⇔ block Ai,j → Vertices "In" Ri → Vertices "Out" Rj

- Predecessors, 1D distribution: Mi,j get 1/4 vertices in Ri

Graph500: From Kepler to Pascal J. Loiseau et al. 13 / 19 GRAPH500 Exploration

Search for children: CSR Not yet visited

I Top-down Current frontier I in_queue → out_queue

CSC Search for parents: Not yet visited I Bottom-up

I out_queue & visited ← in_queue Current frontier

Iteration Top-down Bottom-up Hybrid version 0 27 22 090 111 27 1 8 156 1 568 798 8 156 2 3 695 684 587 893 587 893 3 19 565 465 12 586 12 586 4 214 578 8 256 8 256 5 5 865 1 201 1 201 6 12 156 12

Graph500: From Kepler to Pascal J. Loiseau et al. 14 / 19 Results and prospects Performance analysis

CPU/GPU Comparison

I one CPU or GPU

I Dierent graph scales

CPU/GPU Comparision 3 GTX 970 GTX 780 Ti 2,5 K20X CPU E5-2650v2 TX1 2 CPU GRAPH500 S

P 1,5 E T G 1

0,5

0 16 17 18 19 20 21 SCALE

Graph500: From Kepler to Pascal J. Loiseau et al. 15 / 19 Results and prospects Scalability

weak scaling strong scaling (SCALE=21) 14 6 CPU CPU 27 GPU GPU 12 5

10 4 8 25 S S P P 3 E E T T G G 6 2 23 4 21 1 2

0 0 1 4 16 64 1 4 16 64 #GPU #GPU

I ROMEO → 105th (Nov. 2016)

Graph500: From Kepler to Pascal J. Loiseau et al. 16 / 19 Results and prospects P100 GPU

P100 GPU

I Pascal Architecture

I Several improvements

I Base component of DGX-1

Communication, NVLink

I DGX-1, Power 8

I 40GBs bidirectional ⇒ Advantage for graph

Graph500: From Kepler to Pascal J. Loiseau et al. 17 / 19 Kepler vs Pascal

Product K20X Tesla P100 Arch Kepler Pascal GPU GK100 GP100 SMs 14 56 | More concurrent blocks FP32/SM 192 64 FP32/GPU 2688 3584 | More concurrent threads FP64/SM 64 32 FP64/GPU 896 1792 Base Clock 732 MHz 1328 MHz FP32 GFLOPs 3950 10600 FP64 GFLOPs 1310 5300 Memory Interface 384b GDDR5 4096b HBM2 Memory Size 6GB 16 GB | 2.6× more mem. L2 Cache Size 1536 KB 4096 KB Register/SM 256 KB 256 KB | same register size Register/GPU 3584 KB 14336 KB

Graph500: From Kepler to Pascal J. Loiseau et al. 18 / 19 Kepler vs Pascal

ROMEO Supercomputer

I 105th of GRAPH500 (November 2016 list)

Benchmark for architectures and accelerators

I Machine MESCA: 12TB of RAM + 8 sockets (256 threads)

I FPGA: Intel partnership

I Xeon Phi: Knights Landing ?

I IBM OpenPower: new communications device (NVLINK)

Applications

I Social networks

I Management of electric network

I Big data and deep learning

Graph500: From Kepler to Pascal J. Loiseau et al. 19 / 19