Performance Computing Systems

Assessing Performance of High- Performance Computing Systems Tarek El-Ghazawi Department of Electrical and Computer Engineering The George Washington University Tarek El-Ghazawi, Intro to HPC slide 1 Performance of MPP’s Theoretical Peak Performance vs. Actual Example: » AMD Opteron 6100 – 12 cores – clock speed 1.9 GHz, 4 floating operations per cycle – 12 * 4 * 1.9 G = 91.2 GFLOPS » Cray XE6m from GWU HPC Lab – AMD Opteron 6100 – (1.9 GHz, 12 cores per chip) – 2 CPUs per node – 56 nodes – 91.2 GFLOPS * 2 * 56 = 10214 GFLOPS = 10.2 TFLOPS (measured Rmax (Linpack ) = 7.9 TFLOPS ) Tarek El-Ghazawi, Intro to HPC slide 2 Performance of MPP’s- Metrics Execution Time- Wall clock time Throughput – amount of work per unit time MIPS and problems MFLOPS and problems Applications related measures – e.g. # of particle interactions/second; pixels/sec…. Speedup – how many times the parallel execution is faster than the sequential execution Efficiency – percentage of utilized computing resources during an execution All the above are typically considered in the light of a given program Tarek El-Ghazawi, Intro to HPC slide 3 Performance of MPP’s Scalability - the ability to maintain performance gains when system and/or problem size increase » strong scaling - how the processing time varies with the number of processors for a fixed problem size » weak scaling - how the processing time varies with the number of processors for a fixed problem size per processor Tarek El-Ghazawi, Intro to HPC slide 4 Performance Metrics Speedup (strong scaling) ratio of completion time on one processor to completion time on the n-processor system Uniprocessor Execution Time S p ParallelExecution Time time for the best serial algorithm Ts Sp Tp time for parallel algorithm using p processors Scaled Speedup (weak scaling) » Parallel processing gain over sequential processing, where problem size scales up with computing power Tarek El-Ghazawi, Intro to HPC slide 5 Performance Metrics Speedup [-] Linear speedup: Sp n Super-linear speedup: Sp n HW (cache) or algorithm (search) Sub-linear speedup: Sp n Causes of Superlinear Speedup - Improved utilization of memory - as more processors are used, Cache is used instead of main n = # of processors [-] Memory and disk - Less work in the parallel program Sources of Parallel Overheads -Interprocessor communication - constant or function of n • X.H. Sun, and J. Zhu, "Performance Considerations of Shared Virtual -Load imbalance Memory Machines," IEEE Trans. on Parallel and Distributed Systems, Nov. 1995 -Synchronization -Processing redundancy Tarek El-Ghazawi, Intro to HPC slide 6 Performance Metrics 100 Performance Issues » overheads in making the sequential application parallel – redundancy – extra computation time – inter-processor communication – synchronization Processor 1 – load imbalance 50 50 50 50 35 35 35 35 30 20 40 10 time 25 25 25 25 time time time 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 = 4.0 = 2.85 = 2.5 = 2.0 - Load imbalance - Perfect parallelization - Perfect load balancing - Load imbalance + - But Parallel Overhead Parallel Overhead Tarek El-Ghazawi, Intro to HPCis 10 slide 7 Performance Metrics Efficiency - achieved fraction of total potential parallel processing gain Useful Parallel Time S E Overall Parallel Time n Throughput - amount of work done per unit of time For an efficiency of 0.80 - on 4 procs, m=64 - on 8 procs, m=192 - on 16 procs, m=512 Tarek El-Ghazawi, Intro to HPC slide 8 Performance Metrics Amdahl’s Law Amdahl’s Law » The performance improvement that can be gained by a parallel implementation is limited by the fraction of time parallelism can actually be used in an application » Let = fraction of program (algorithm) that is serial and cannot be parallelized. » For instance: – Loop initialization – Reading/writing to a single disk – Procedure call overhead T T 1 S s s T (1 )T 1 p T s X% (parallel) Y% (seq) s n n 1 1 1 n - number of lim S S for n p p X processors Y Y Tarek El-Ghazawi, Intro to HPCn slide 9 Performance Metrics Example Example: Consider Y = (a*b) + (c/d) + e, which is represented by the following dependence graph/schedule assuming a 2- processor system: Tsequnetial = 4 a b c d Tparallel = 3 * / cycle 1 Speedup = 4/3 + cycle 2 e cycle 3 Efficiency = (4/3)/2 = 0.66 + or 66% Y Throughput = 4/3 operations per cycle Tarek El-Ghazawi, Intro to HPC slide 10 Performance of MPP’s Parallel Benchmarks » Peak Performance reported by vendors! (TOP500 = Rpeak) » Buyers need application oriented assessment - benchmarks – Numerical Aerodynamic Simulations (NAS) Parallel Benchmark (NPB) at NASA Ames – Earth-and-Space-Sciences Parallel Benchmark (EPB) at NASA GSFC – Linpack – TOP500 (Rmax) software package that solves a (random) dense linear system in double precision (64 bits) arithmetic – HPCC – Others Tarek El-Ghazawi, Intro to HPC slide 11 Performance of MPP’s The NAS Parallel Benchmark (Numerical Aerodynamic Simulations) Different versions/classes (Sample, A, B, and C, D) 8 Scientific/Engineering problems specified as computational task, implementation is up to the vendor » 5 kernels and 3 computational fluid dynamics (CFD) applications that do not favor any specific architecture » Kernels (general numerical tasks) are – EP, an Embarrassingly Parallel Monte-Carlo calculation runs on any number of processors, minimal communication estimates the upper achievable limits for floating point operations of parallel computers – MG, a multi-grid calculation solves 3D scalar Poisson equation perform short and long-range communication – CG, conjugate gradient – calculate smallest eigenvalue of a matrix irregular long range communication – FT, a 3-D FFT – solves 3D partial differential equation long range communication – IS, parallel integer sort ( based on bucket sort mechanism) requires lots small exchange communication Tarek El-Ghazawi, Intro to HPC slide 12 Performance of MPP’s » computational fluid dynamics (CFD) simulations applications – LU, solution of a block lower and upper triangular system large number of small communications – SP, solution of multiple systems of scalar pentadiagonal – BT, solution of multiple systems of block tridiagonal equations SP and BT - coarse grain communication Tarek El-Ghazawi, Intro to HPC slide 13 Performance of MPP’s Class A workloads - Smaller Version Benchmark Size Operations (x109) MFLOPS (YMP/1) EP 228 26.68 211 MG 2563 3.905 176 CG 14,000 1.508 127 FT 2562x128 5.631 196 IS 223x219 0.7812 68 LU 643 64.57 194 SP 643 102.0 216 BT 643 181.3 229 Cray Y-MP – Vector Machine Tarek El-Ghazawi, Intro to HPC slide 14 Performance of MPP’s Class B Workloads - A bigger version Benchmark Size Operations (x109) MFLOPS (YMP/1) EP 230 100.9 MG 2563 18.81 498 CG 75,000 54.89 447 FT 512x2562 71.37 560 IS 225x221 3.150 244 LU 1023 319.6 493 SP 1023 447.1 627 BT 1023 721.5 572 Cray Y-MP – Vector Machine Tarek El-Ghazawi, Intro to HPC slide 15 Performance of MPP’s » Sample Sustained Performance for Class B LU CFD Application Computer System Processors Time Ratio to C-90/1 Cray C-90 1 684.5 1.00 (vector) 16 51.6 12.6 Cray T3D 32 517.9 1.25 (MPP) 512 38.69 16.8* IBM SP-2 8 434.6 1.49 (cluster) 64 79.64 8.14 Intel Paragon 64 675.0 .96 (MPP) 256 254.0 2.55 SGI PCXL 1 5699. .11 (cluster) 16 426.0 1.52 TMC CM-5E 32 595.0 1.09 (MIMD) 128 318.0 2.04 Tarek El-Ghazawi, Intro to HPC slide 16 Performance of MPP’s » Sample Sustained Performance Per Dollar for Class B BT Application Computer Processors Ratio to Nominal Performance System C-90/1 Cost ($M) per ($M) Cray C-90 16 13.1 30.9 0.42 Cray T3D 256 17.7* 9.25 1.91 IBM SP-2 64 9.28 5.94 1.56 SGI PCXL 16 2.96 1.02 2.90* TMC CM-5E 128 4.99 4.00 1.25 Note: Performance to cost, all non-vector parallel systems outperform the C-90. HPCC works? Tarek El-Ghazawi, Intro to HPC slide 17 Performance of MPP’s LINPACK BENCHMARK Linpack Benchmark report: “Performance of Various Computers Using Standard Linear Equations Software”. Implemented on top of another package called BLAS(Basic Linear Algebra Subprograms) and superseded by a package called LAPACK. 3 Benchmarks in Linpack Report: 1. Linpack Fortran n =100 benchmark – Matrix of Order 100 – Ground Rule: Only compiler optimizations and no change in Fortran Code Solution must adhere to prescribed accuracy » Linpack n = 1000 benchmark – Matrix of Order 1000 – Performance on two routines: DGEFA and DGESL – Ground Rule: Use any language to solve any linear equation Solution must adhere to prescribed accuracy Tarek El-Ghazawi, Intro to HPC slide 18 Performance of MPP’s LINPACK BENCHMARK » Linpack’s Highly Parallel Computing Benchmark (HPL) – Measure best performance in solving (random) dense linear system in double precision arithmetic – Ground Rule: LU factorization and solver step can be replaced with custom implementation Solution must adhere to prescribed accuracy Tarek El-Ghazawi, Intro to HPC slide 19 Performance of MPP’s HPL (High-Performance Computing Linpack Benchmark) A library for solving linear algebra problem Is used to measure the performance of TOP500 list Requirement: MPI and either BLAS or VSIPL HPL (High Performance LINPACK) benchmark » A special version of LINPACK benchmark » Solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed- memory computers » Able to scale the size of the problem and to optimize the software for better performance Algorithm: » recursive panel factorization, » multiple lookahead depths, » bandwidth reducing swapping Output: » Rmax: the performance in GFLOPS for the largest problem run on a machine.

Performance Computing Systems

UNICOS/Mk Status)

UNICOS® Installation Guide for CRAY J90lm Series SG-5271 9.0.2

(PDF) Kostenlos

The Gemini Network

Trends in HPC Architectures and Parallel Programmming

System Programmer Reference (Cray SV1™ Series)

Cray Supercomputers Past, Present, and Future

Implementation of IEEE Floating-Point Arithmetic on the Cray T90 System

EECS 594 Spring 2009 Lecture 6: Overview of High-Performance Computing

Message Passing Dataflow Shared Memory

An Introduction

Performance Engineering on Cray XC40 with Xeon Phi