Graph500: from Kepler to Pascal

Graph500: From Kepler to Pascal Julien Loiseau, Michaël Krajecki, François Alin and Christophe Jaillet GTC 2017 University of Reims Champagne Ardenne (URCA) Multidisciplinary University I about 27 000 I 5 campus: Reims, Troyes, Charleville-Mézières, Chaumont et Châlons-en-Champagne I A wide initial undergaduate studies program I Graduate studies and PhD program linked with research lab Graph500: From Kepler to Pascal J. Loiseau et al. 1 / 19 HPC issues Power eciency Exascale architecture I Computational power: Petaop ! ×1000 ! Exaop I Moore's law is over I Energy eciency: 8MW ! ×1000 ! 8GW ?? Graph500: From Kepler to Pascal J. Loiseau et al. 2 / 19 HPC issues HPC Architectures Xeon Phi GPU FPGA CPU(s) + Accelerator(s) ASIC MPI Memory CPU/GPU CPU(s) + Accelerator(s) - In-core SSD - Out-of-core HDD Graph500: From Kepler to Pascal J. Loiseau et al. 3 / 19 HPC issues ROMEO, Reims, France ROMEO supercomputer I Reims, Champagne-Ardenne, France I 130 nodes I 2 × CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 × GPU NVIDIA K20Xm (6GB RAM) I FatTree with InniBand I 1 × DGX-1 node Graph500: From Kepler to Pascal J. Loiseau et al. 4 / 19 HPC issues ROMEO, Regional HPC Center Its mission is to deliver, for both industrial and academic researchers: I high performance computing resources I secured storage spaces I specic & scientic software I advanced user support in exploiting these resources I in-depth expertise in dierent engineering elds: HPC, applied mathematics, physics, biophysics and chemistry, ... I Promote and diuse HPC and simulation to companies / SMB I identify, experiment and master breakthrough technologies I which give new opportunities for our users I from technology-watching to production I for all research domains Graph500: From Kepler to Pascal J. Loiseau et al. 5 / 19 HPC issues GPU Integration 2016 DGX1 Server First server 2013 dedicated to 100% of Deep Learning 2010 cluster 8 GPU P100 10% of nodes with GPU 2016 2008 cluster 260 K20x 1 server nodes with TOP 500 & GREEN 500 Tesla S1070 GPU 960 cores/U Fermi - M2090 2015 2012 Graph500: From Kepler to Pascal J. Loiseau et al. 6 / 19 HPC issues Benchmarking HPC Architectures How to compare the computing power of parallel architectures? TOP500 I LINPACK I Solving n equations with n unknowns I "Regular" FLoating-point Operation Per Second, FLOPS GRAPH500 I BFS I Large randomly generated graphs I "Irregular" Traversed Edges Per Second, TEPS Graph500: From Kepler to Pascal J. Loiseau et al. 7 / 19 GRAPH500 Protocol and ranking Graphs algorithms: I Irregular memory access I Irregular communications I No heavy computation step Data dense application Steps I Graph generation (SKG, RMAT) I Randomly sample 64 unique root vertices I Structure generation I For each root vertex: I BFS I Validate BFS tree Graph500: From Kepler to Pascal J. Loiseau et al. 8 / 19 GRAPH500 Protocol and ranking Goal: Breadth First Search on random graph: Level 3 BFS Level 2 Level 1 Source Graph500: From Kepler to Pascal J. Loiseau et al. 9 / 19 GRAPH500 Problem Scale Graph500 current list (Nov. 2016) -2 SCALE ) vertices I 4 Best CPU& GPU machines: -2 SCALE+ ) edges Name Scale GTEPS Problem size Scale Memory (TB) (1)K computer 40 38621,4 Toy 26 0,0172 (2)Sunway 40 23755.7 Mini 29 0,1374 (3)Sequoia 41 23751 Small 32 1,0995 ... ... ... Medium 36 17,5922 (31) TSUBAME 2.0 35 462.25 Large 39 140,7375 (39) GSIC Center 35 317,09 Huge 42 1125,8999 (43) HA-PACS 32 223.634 I For graph generation I BlueGene ) 19 in top 30 I Converted before use I GPU ) NVIDIA Graph500: From Kepler to Pascal J. Loiseau et al. 10 / 19 GRAPH500 Graph generation To Nodes From a b b c d a c d Nodes dc Sparse Graph I Kronecker I Generation: a = 0:57 b = 0:19 c = 0:19 d = 0:05 I Edge: 16× more than vertices Graph500: From Kepler to Pascal J. Loiseau et al. 11 / 19 GRAPH500 Data structure format Structure format ! Bitmap I Natural representation 20 I 2 vertices = 128GB I BG/Q version ! CSR/CSC (Compressed Sparse Rows/Columns) I Compressed format 20 I 2 vertices < 1GB I BG/P and GPU version Compressed Sparse Row Sparse Matrix 0 1 2 3 4 0 0 1 0 1 1 Row pointers 1 1 0 1 0 1 0 3 6 8 9 12 2 0 1 0 0 1 Column indice 3 1 0 0 0 0 1 3 4 0 2 4 1 4 0 0 1 2 4 1 1 1 0 0 Graph500: From Kepler to Pascal J. Loiseau et al. 12 / 19 GRAPH500 Parallel algorithm I Split the adjacency matrix into blocks Generate output queue Share input queue A0,0 A0,1 A0,2 A0,3 A0,0 A1,0 A2,0 A3,0 A1,0 A1,1 A1,2 A1,3 A0,1 A1,1 A2,1 A3,1 A2,0 A2,1 A2,2 A2,3 A0,2 A1,2 A2,2 A3,2 A3,0 A3,1 A3,2 A3,3 A0,3 A1,3 A2,3 A3,3 Repeat until end k - l × l machines with l = 2 (k 2 N) - Machine Mi;j , block Ai;j ! Vertices "In" Ri ! Vertices "Out" Rj - Predecessors, 1D distribution: Mi;j get 1=4 vertices in Ri Graph500: From Kepler to Pascal J. Loiseau et al. 13 / 19 GRAPH500 Exploration Search for children: CSR Not yet visited I Top-down Current frontier I in_queue ! out_queue CSC Search for parents: Not yet visited I Bottom-up I out_queue & visited in_queue Current frontier Iteration Top-down Bottom-up Hybrid version 0 27 22 090 111 27 1 8 156 1 568 798 8 156 2 3 695 684 587 893 587 893 3 19 565 465 12 586 12 586 4 214 578 8 256 8 256 5 5 865 1 201 1 201 6 12 156 12 Graph500: From Kepler to Pascal J. Loiseau et al. 14 / 19 Results and prospects Performance analysis CPU/GPU Comparison I one CPU or GPU I Dierent graph scales CPU/GPU Comparision 3 GTX 970 GTX 780 Ti 2,5 K20X CPU E5-2650v2 TX1 2 CPU GRAPH500 S P 1,5 E T G 1 0,5 0 16 17 18 19 20 21 SCALE Graph500: From Kepler to Pascal J. Loiseau et al. 15 / 19 Results and prospects Scalability weak scaling strong scaling (SCALE=21) 14 6 CPU CPU 27 GPU GPU 12 5 10 4 8 25 S S P P 3 E E T T G G 6 2 23 4 21 1 2 0 0 1 4 16 64 1 4 16 64 #GPU #GPU I ROMEO ! 105th (Nov. 2016) Graph500: From Kepler to Pascal J. Loiseau et al. 16 / 19 Results and prospects P100 GPU P100 GPU I Pascal Architecture I Several improvements I Base component of DGX-1 Communication, NVLink I DGX-1, Power 8 I 40GBs bidirectional ) Advantage for graph Graph500: From Kepler to Pascal J. Loiseau et al. 17 / 19 Kepler vs Pascal Product K20X Tesla P100 Arch Kepler Pascal GPU GK100 GP100 SMs 14 56 | More concurrent blocks FP32/SM 192 64 FP32/GPU 2688 3584 | More concurrent threads FP64/SM 64 32 FP64/GPU 896 1792 Base Clock 732 MHz 1328 MHz FP32 GFLOPs 3950 10600 FP64 GFLOPs 1310 5300 Memory Interface 384b GDDR5 4096b HBM2 Memory Size 6GB 16 GB | 2.6× more mem. L2 Cache Size 1536 KB 4096 KB Register/SM 256 KB 256 KB | same register size Register/GPU 3584 KB 14336 KB Graph500: From Kepler to Pascal J. Loiseau et al. 18 / 19 Kepler vs Pascal ROMEO Supercomputer I 105th of GRAPH500 (November 2016 list) Benchmark for architectures and accelerators I Machine MESCA: 12TB of RAM + 8 sockets (256 threads) I FPGA: Intel partnership I Xeon Phi: Knights Landing ? I IBM OpenPower: new communications device (NVLINK) Applications I Social networks I Management of electric network I Big data and deep learning Graph500: From Kepler to Pascal J. Loiseau et al. 19 / 19.

Graph500: from Kepler to Pascal

Performance Tuning of Graph500 Benchmark on Supercomputer Fugaku

The Blue Gene/Q Compute Chip

MT-Lib: a Topology-Aware Message Transfer Library for Graph500 on Supercomputers Xinbiao Gan

Supercomputers – Prestige Objects Or Crucial Tools for Science and Industry?

An Analysis of System Balance and Architectural Trends Based on Top500 Supercomputers

Looking Back on Supercomputer Fugaku Development Project

Scalable Graph Traversal on Sunway Taihulight with Ten Million Cores

Comparative HPC Performance Powerpoint

Supercomputer Fugaku Takes First Place in the Graph 500 for the Three

D4.1 Reviewed Co-Design Methodology, and Detailed List of Actions for the Co-Design Cycle

Scalable Graph Traversal on Sunway Taihulight with Ten Million Cores

A64FX Skylake