Assessing Performance of High- Performance Computing Systems

Tarek El-Ghazawi

Department of Electrical and Computer Engineering The George Washington University

Tarek El-Ghazawi, Intro to HPC slide 1 Performance of MPP’s

Theoretical Peak Performance vs. Actual Example: » AMD Opteron 6100 – 12 cores – clock speed 1.9 GHz, 4 floating operations per cycle – 12 * 4 * 1.9 G = 91.2 GFLOPS » XE6m from GWU HPC Lab – AMD Opteron 6100 – (1.9 GHz, 12 cores per chip) – 2 CPUs per node – 56 nodes – 91.2 GFLOPS * 2 * 56 = 10214 GFLOPS = 10.2 TFLOPS (measured Rmax (Linpack ) = 7.9 TFLOPS )

Tarek El-Ghazawi, Intro to HPC slide 2 Performance of MPP’s- Metrics

Execution Time- Wall clock time

Throughput – amount of work per unit time MIPS and problems MFLOPS and problems Applications related measures – e.g. # of particle interactions/second; pixels/sec…. Speedup – how many times the parallel execution is faster than the sequential execution Efficiency – percentage of utilized computing resources during an execution All the above are typically considered in the light of a given program Tarek El-Ghazawi, Intro to HPC slide 3 Performance of MPP’s

Scalability - the ability to maintain performance gains when system and/or problem size increase » strong scaling - how the processing time varies with the number of processors for a fixed problem size » weak scaling - how the processing time varies with the number of processors for a fixed problem size per processor

Tarek El-Ghazawi, Intro to HPC slide 4 Performance Metrics

Speedup (strong scaling) ratio of completion time on one processor to completion time on the n-processor system Uniprocessor Execution Time S  p ParallelExecution Time

time for the best serial algorithm Ts Sp  Tp time for parallel algorithm using p processors Scaled Speedup (weak scaling) » Parallel processing gain over sequential processing, where problem size scales up with computing power

Tarek El-Ghazawi, Intro to HPC slide 5 Performance Metrics

Speedup [-] Linear speedup: Sp  n

Super-linear speedup: Sp  n HW (cache) or algorithm (search)

Sub-linear speedup: Sp  n Causes of Superlinear Speedup - Improved utilization of memory - as more processors are used, Cache is used instead of main n = # of processors [-] Memory and disk - Less work in the parallel program Sources of Parallel Overheads -Interprocessor communication - constant or function of n • X.H. Sun, and J. Zhu, "Performance Considerations of Shared Virtual -Load imbalance Memory Machines," IEEE Trans. on Parallel and Distributed Systems, Nov. 1995 -Synchronization -Processing redundancy Tarek El-Ghazawi, Intro to HPC slide 6 Performance Metrics

100 Performance Issues » overheads in making the sequential application parallel – redundancy – extra computation time – inter-processor communication – synchronization Processor 1 – load imbalance

50 50 50 50 35 35 35 35 30 20 40 10 time 25 25 25 25 time time time

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 = 4.0 = 2.85 = 2.5 = 2.0 - Load imbalance - Perfect parallelization - Perfect load balancing - Load imbalance + - But Parallel Overhead Parallel Overhead Tarek El-Ghazawi, Intro to HPCis 10 slide 7 Performance Metrics

Efficiency - achieved fraction of total potential parallel processing gain

Useful Parallel Time S E   Overall Parallel Time n

Throughput - amount of work done per unit of time For an efficiency of 0.80 - on 4 procs, m=64 - on 8 procs, m=192 - on 16 procs, m=512

Tarek El-Ghazawi, Intro to HPC slide 8 Performance Metrics Amdahl’s Law

Amdahl’s Law » The performance improvement that can be gained by a parallel implementation is limited by the fraction of time parallelism can actually be used in an application

» Let  = fraction of program (algorithm) that is serial and cannot be parallelized.

» For instance: – Loop initialization – Reading/writing to a single disk – Procedure call overhead T T 1 S s s  T (1 )T 1   p T  s   X% (parallel) Y% (seq) s n n 1 1 1 n - number of lim S  S  for n  p p X processors  Y Y Tarek El-Ghazawi, Intro to HPCn slide 9 Performance Metrics Example Example: Consider Y = (a*b) + (c/d) + e, which is represented by the following dependence graph/schedule assuming a 2- processor system: Tsequnetial = 4 a b c d Tparallel = 3 * / cycle 1 Speedup = 4/3 + cycle 2 e cycle 3 Efficiency = (4/3)/2 = 0.66 + or 66%

Y Throughput = 4/3 operations per cycle

Tarek El-Ghazawi, Intro to HPC slide 10 Performance of MPP’s

Parallel Benchmarks

» Peak Performance reported by vendors! (TOP500 = Rpeak) » Buyers need application oriented assessment - benchmarks

– Numerical Aerodynamic Simulations (NAS) Parallel Benchmark (NPB) at NASA Ames – Earth-and-Space-Sciences Parallel Benchmark (EPB) at NASA GSFC

– Linpack – TOP500 (Rmax)  software package that solves a (random) dense linear system in double precision (64 bits) arithmetic – HPCC – Others

Tarek El-Ghazawi, Intro to HPC slide 11 Performance of MPP’s

The NAS Parallel Benchmark (Numerical Aerodynamic Simulations) Different versions/classes (Sample, A, B, and C, D) 8 Scientific/Engineering problems specified as computational task, implementation is up to the vendor » 5 kernels and 3 computational fluid dynamics (CFD) applications that do not favor any specific architecture » Kernels (general numerical tasks) are – EP, an Embarrassingly Parallel Monte-Carlo calculation  runs on any number of processors, minimal communication  estimates the upper achievable limits for floating point operations of parallel computers – MG, a multi-grid calculation  solves 3D scalar Poisson equation  perform short and long-range communication – CG, conjugate gradient – calculate smallest eigenvalue of a matrix  irregular long range communication – FT, a 3-D FFT – solves 3D partial differential equation  long range communication – IS, parallel integer sort ( based on bucket sort mechanism)  requires lots small exchange communication Tarek El-Ghazawi, Intro to HPC slide 12 Performance of MPP’s

» computational fluid dynamics (CFD) simulations applications – LU, solution of a block lower and upper triangular system  large number of small communications – SP, solution of multiple systems of scalar pentadiagonal – BT, solution of multiple systems of block tridiagonal equations  SP and BT - coarse grain communication

Tarek El-Ghazawi, Intro to HPC slide 13 Performance of MPP’s

Class A workloads - Smaller Version

Benchmark Size Operations (x109) MFLOPS (YMP/1) EP 228 26.68 211 MG 2563 3.905 176 CG 14,000 1.508 127 FT 2562x128 5.631 196 IS 223x219 0.7812 68 LU 643 64.57 194 SP 643 102.0 216 BT 643 181.3 229

Cray Y-MP – Vector Machine

Tarek El-Ghazawi, Intro to HPC slide 14 Performance of MPP’s

Class B Workloads - A bigger version

Benchmark Size Operations (x109) MFLOPS (YMP/1) EP 230 100.9 MG 2563 18.81 498 CG 75,000 54.89 447 FT 512x2562 71.37 560 IS 225x221 3.150 244 LU 1023 319.6 493 SP 1023 447.1 627 BT 1023 721.5 572

Cray Y-MP – Vector Machine

Tarek El-Ghazawi, Intro to HPC slide 15 Performance of MPP’s

» Sample Sustained Performance for Class B LU CFD Application Computer System Processors Time Ratio to C-90/1 Cray C-90 1 684.5 1.00 (vector) 16 51.6 12.6 Cray T3D 32 517.9 1.25 (MPP) 512 38.69 16.8* IBM SP-2 8 434.6 1.49 (cluster) 64 79.64 8.14 Intel Paragon 64 675.0 .96 (MPP) 256 254.0 2.55 SGI PCXL 1 5699. .11 (cluster) 16 426.0 1.52 TMC CM-5E 32 595.0 1.09 (MIMD) 128 318.0 2.04

Tarek El-Ghazawi, Intro to HPC slide 16 Performance of MPP’s

» Sample Sustained Performance Per Dollar for Class B BT Application

Computer Processors Ratio to Nominal Performance System C-90/1 Cost ($M) per ($M)

Cray C-90 16 13.1 30.9 0.42 Cray T3D 256 17.7* 9.25 1.91 IBM SP-2 64 9.28 5.94 1.56 SGI PCXL 16 2.96 1.02 2.90* TMC CM-5E 128 4.99 4.00 1.25

Note: Performance to cost, all non-vector parallel systems outperform the C-90. HPCC works?

Tarek El-Ghazawi, Intro to HPC slide 17 Performance of MPP’s

LINPACK BENCHMARK

Linpack Benchmark report: “Performance of Various Computers Using Standard Linear Equations Software”. Implemented on top of another package called BLAS(Basic Linear Algebra Subprograms) and superseded by a package called LAPACK. 3 Benchmarks in Linpack Report: 1. Linpack Fortran n =100 benchmark – Matrix of Order 100 – Ground Rule:  Only compiler optimizations and no change in Fortran Code  Solution must adhere to prescribed accuracy » Linpack n = 1000 benchmark – Matrix of Order 1000 – Performance on two routines: DGEFA and DGESL – Ground Rule:  Use any language to solve any linear equation  Solution must adhere to prescribed accuracy

Tarek El-Ghazawi, Intro to HPC slide 18 Performance of MPP’s

LINPACK BENCHMARK

» Linpack’s Highly Parallel Computing Benchmark (HPL) – Measure best performance in solving (random) dense linear system in double precision arithmetic – Ground Rule:  LU factorization and solver step can be replaced with custom implementation  Solution must adhere to prescribed accuracy

Tarek El-Ghazawi, Intro to HPC slide 19 Performance of MPP’s

HPL (High-Performance Computing Linpack Benchmark)

A library for solving linear algebra problem Is used to measure the performance of TOP500 list Requirement: MPI and either BLAS or VSIPL HPL (High Performance LINPACK) benchmark » A special version of LINPACK benchmark » Solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed- memory computers » Able to scale the size of the problem and to optimize the software for better performance Algorithm: » recursive panel factorization, » multiple lookahead depths, » bandwidth reducing swapping Output:

» Rmax: the performance in GFLOPS for the largest problem run on a machine.

» Nmax: the size of the largest problem run on a machine.

» N1/2: the size where half the Rmax execution rate is achieved.

» Rpeak: the theoretical peak performance GFLOPS for the machine.

Tarek El-Ghazawi, Intro to HPC slide 20 Performance of MPP’s TOP 500 (Nov 2016) † † https://www.top500.org/list/2016/11/

Rmax Rpeak Rank Site System Cores Power (kW) (TFlop/s) (TFlop/s) National Sunway TaihuLight - Sunway Supercomputing Center MPP, Sunway SW26010 260C 1 10,649,600 93,014.6 125,435.9 15,371 in Wuxi 1.45GHz, Sunway China NRCPC Tianhe-2 (MilkyWay-2) - TH-IVB- National Super Computer FEP Cluster, Intel Xeon E5-2692 2 Center in Guangzhou 12C 2.200GHz, TH Express-2, 3,120,000 33,862.7 54,902.4 17,808 China Intel Xeon Phi 31S1P NUDT Titan - Cray XK7 , Opteron 6274 DOE/SC/Oak Ridge 16C 2.200GHz, Cray Gemini 3 National Laboratory 560,640 17,590.0 27,112.5 8,209 interconnect, NVIDIA K20x United States Cray Inc. Sequoia - BlueGene/Q, Power DOE/NNSA/LLNL 4 BQC 16C 1.60 GHz, Custom 1,572,864 17,173.2 20,132.7 7,890 United States IBM Cori - Cray XC40, Intel Xeon Phi DOE/SC/LBNL/NERSC 7250 68C 1.4GHz, Aries 5 622,336 14,014.7 27,880.7 3,939 United States interconnect Cray Inc.

Tarek El-Ghazawi, Intro to HPC slide 21 Processing vs. Memory Access RandomAccess(RA)‡ ‡ http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/

Random memory access results into significant decrease in overall performance of an application RandomAccess (RA) is a part of HPC challenge benchmark developed for HPCS program It outputs GUPS (Giga Updates per Second) and expects to achieve close to the “peak” capability of the memory system GUPS is calculated as total number of randomly updated memory locations divided by 1 billion. » A random address is generated and update operation (read-modify-write) is performed on a table of 64-bit words. – Modify operation can be (add, and, or, xor) with a literal value RandomAccess is run in three variants for HPC challenge test: » Single Process » Embarrassingly Parallel (All processes run identical processes with no interactions) » All processes collaborate and cooperate towards solving a larger problem Tarek El-Ghazawi, Intro to HPC slide 22 Processing vs. Memory Access RandomAccess(RA)‡ ‡ http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/

Basic Definitions of RA benchmark: » T[ ] = table of size 2 » { = stream of 64-bit integer of length 2 generated by the primitive polynomial over GF(2) [Galois Field of order 2]

For each » Set T[ < 63, 64 - n> ] + where, + : denotes addition in GF(2) i.e exclusive “or”

, : denotes the sequence of bits within n : defined as the largest power of 2 that is less than or equal to half of main memory The basic definition of RA is described graphically in figure on next slide

Tarek El-Ghazawi, Intro to HPC slide 23 Processing vs. Memory Access RandomAccess(RA)‡ ‡ http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/ The expected value of the number Tables 64 bits of accesses per memory location T[ k ] E[ T[ k ] ] = (2n+2 / 2n) = 4 T 2n k 1/2 Memory

T[ k ]  ai

Define k = [a <63, 64-n>] Addresses i Sequences of Highest n bits Bit-Level bits within ai Exclusive Or 

Data Stream The Commutative and Associative nature of  Length allows processing in any order {Ai} 2n+2 Acceptable Error — 1% Data-Driven ai Look ahead and Storage — 1024 per “node” Memory Access 64 bits Tarek El-Ghazawi, Intro to HPC slide 24 “Twelve Ways to Fool the Masses†”

† Bailey, D. H. (1991). 12 WAYS TO FOOL THE MASSES WHEN GIVING PERFORMANCE RESULTS ON PARALLEL COMPUTERS. Compare 32-bit results to others’ 64-bit results Present inner kernel results instead of whole application Use assembly coding and compare with others’ Fortran or C codes Scale problem size with number of processors, but don’t tell Quote performance results linearly projected to a full system Compare with scalar, un-optimized, uniprocessor results on CRAYs Compare with old code on obsolete system Use parallel code as base instead of best sequential one Quote performance in terms of utilization, speedups/peak MFLOPS/$ Use inefficient algorithms to get high MFLOPS rates Measure parallel times on dedicated sys, but uniprocessor on busy sys Show pretty pictures and videos, but don’t talk about performance Tarek El-Ghazawi, Intro to HPC slide 25