Getting the Performance Out Of High Performance Computing

Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/ 1

Getting the Performance Out Of High Performance Computing

Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/ 2

1 3

Getting the Performance into High Performance Computing

Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/ 4

2 Technology Trends: Microprocessor Capacity

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor Microprocessors have chips would double roughly every become smaller, denser, and 18 months. more powerful. 2X transistors/Chip Every 1.5 years Not just processors, storage, Called “Moore’s Law” internet bandwidth, etc 5

Moore’s Law Moore’s Law Super Scalar/Vector/Parallel 1 PFlop/s Earth Parallel Simulator

ASCI White ASCI Red 1 TFlop/s Pacific TMC CM-5 Cray T3D

Vector TMC CM-2 Cray 2 1 GFlop/s Cray X-MP Super Scalar Cray 1

CDC 7600 IBM 360/195 1 MFlop/s Scalar CDC 6600 IBM 7090

1 KFlop/s UNIVAC 1 EDSAC 1 6 1950 1960 1970 1980 1990 2000 2010

3 Next Generation: IBM Blue Gene/L and ASCI Purple ¨ Announced 11/19/02 ØOne of 2 machines for LLNL Ø360 TFlop/s Ø130,000 proc ØLinux ØFY 2005

Plus ASCI Purple IBM Power 5 based 7 12K proc, 100 TFlop/s

To Be Provocative… Citation in the Press, March 10th, 2008 ¨ How could this happen? Ø Complexity of programming these machines were DOE Sit Idle underestimated WASHINGTON, Mar. 10, 2008 Ø Users were unprepared for GAO reports that after almost the lack of reliability of 5 years of effort and several the hardware and software hundreds of M$’s spent at Ø Little effort was spent to the DOE labs, the carry out medium and long high performance term research activities to solve problems that were computers recently foreseen 5 years ago in purchased did not the areas of applications, meet users’ expectation algorithm, middleware, and are sitting idle…Alan programming models, and 8 Laub head of the DOE efforts computer architectures, … reports that the computer equ

4 Software Technology & Performance

¨ Tendency to focus on the hardware ¨ Software required to bridge an ever widening gap ¨ Gaps between potential and delivered performance is very steep Ø Performance only if the data and controls are setup just right Ø Otherwise, dramatic performance degradations, very unstable situation Ø Will become more unstable as systems change and become more complex ¨ Challenge for applications, libraries, and tools is formidable with Tflop/s level, even greater with Pflops, some might say insurmountable. 9

Linpack (100x100) Analysis, The Machine on My Desk 12 Years Ago and Today ¨ Compaq 386/SX20 SX with FPA - .16 Mflop/s ¨ Pentium IV – 2.8 GHz – 1317 Mflop/s ¨ 12 years è we see a factor of ~ 8231 Ø Doubling in less than 12 months, for 12 years ¨ Moore’s Law gives us a factor of 256. ¨ How do we get a factor > 8000? vComplex set of interaction between ØApplication Ø Clock speed increase = 128x ØAlgorithms Ø External Bus Width & Caching – ØProgramming language ØCompiler Ø 16 vs. 64 bits = 4x ØMachine instructions Ø Floating Point - ØHardware vMany layers of translation from Ø 4/8 bits multi vs. 64 bits (1 clock) = 8x the application to the hardware Ø Compiler Technology = 2x vChanging with each generation ¨ However the potential for that Pentium 4 is 5.6 Gflop/s and here we are getting 1.32 Gflop/s Ø Still a factor of 4.25 off of peak 10

5 Where Does Much of Our Lost Performance Go? or Why Should I Care About the Memory Hierarchy?

Processor-DRAM Memory Gap (latency) µProc 60%/yr. 1000000 (2X/1.5yr) “Moore’s Law” CPU

10000 Processor-Memory Performance Gap: 100 Performance (grows 50% / year) DRAM DRAM 9%/yr. 1 (2X/10 yrs)

1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 Year 11

Optimizing Computation and Memory Use ¨ Computational optimizations Ø Theoretical peak:(# fpus)*(/cycle)*cycle time Ø Pentium 4: (1 fpu)*(2 flops/cycle)*(2.8 Ghz) = 5600 MFLOP/s ¨ Operations like: Ø y = a x + y : 3 operands (24 ) needed for 2 flops; 5600 Mflop/s requires 8400 MWord/s bandwidth from memory

¨ Memory optimization Ø Theoretical peak: (bus width) * (bus speed) Ø Pentium 4: (32 bits)*(533 Mhz) = 2132 MB/s = 266 MWord/s

Off by a factor of 30 from what’s required to drive 12 the processor from memory to peak performance

6 Memory Hierarchy ¨ By taking advantage of the principle of locality: Ø Present the user with as much memory as is available in the cheapest technology. Ø Provide access at the speed offered by the fastest technology. Processor Tertiary Storage Secondary (Disk/Tape) Control Storage (Disk) Level Main Registers On

Cache 2 and 3 Memory Distributed Remote

Datapath - Chip Cache (DRAM) Memory Cluster (SRAM) Memory

Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s (10s ms) Size (bytes): 100s (10s sec) Ks Ms 100,000 s 10,000,000 s (.1s ms) (10s ms) 13 Gs Ts

Tool To Help Understand What’s Going On In the Processor

¨ Complex system with many filters ¨ Need to identify bottlenecks ¨ Prioritize optimization ¨ Focus on important aspects

14

7 Tools for Performance Analysis, Modeling and Optimization

¨ PAPI ¨ Dynist ¨ SVPablo

ROSE: Compiler Framework Recognition of high-level 15 abstractions Specification of Transformations

Example PERC on Climate Model

¨ Interaction with the SciDAC Climate development effort ¨ Profiling Ø Identifying performance bottleneck and prioritizing enhancements ¨ Evaluation of code over time 9 month period ¨ Produced 400% improvement via decreased overhead and increased scalability 16

8 Signatures: Key Factors in Applications and System that Affect Performance ¨ Application Signatures ¨ Hardware Signatures Ø Performance capabilities of Machine Ø Latencies and bandwidth of memory Ø Characterization of operations hierarchy needed to be performed by Ø Local to node & to remote node application Ø Instruction issue rates Ø Description of application demands Ø Cache size on resources Ø TLB size Ø Algorithm Signatures Ø Opts counts Parallel or Distributed Application Ø Memory ref patterns Ø Data dependencies Ø I/O characteristics Performance Monitor Ø Software Signatures Ø Sync points Feedback Live Performance Data Ø Thread level parallelism (Degree of Similarity) Ø Inst level parallelism Observation and Generation of Ø Ratio of mem ref to flpt ops Model Signature Observation Ø Predict application behavior and Comparison Signature performance Application Signature §Execution signature 17 §combine application and machine signatures to provide accurate performance models Model Signature .

Algorithms vs Applications

Lattice Quantum Weather Comp Fluid Adjustment Inverse Structural Electronic Circuit Gauge Chemistry Simulation Dynamics of Geodetic Problems Mechanics Device Simulation (QCD) Networks Simulation

Sparse Linear System Solvers X X X X X X

Linear Least Squares X X

Nonlinear Algebraic System X X X X Solvers

Sparse Eigenvalue Problems X X X FFT X X X Rapid Elliptic Problem Solvers X X X

Multigrid Schemes X X Stiff ODE Solvers X X Monte Carlo Schemes X X

Integral Transformations X 18 From: Supercomputing Tradeoffs and the Cedar System, E. Davidson, D. Kuck, D. Lawrie, and A. Sameh, in High - Speed Computing, Scientific Applications and Algorithm Design, Ed R. Wilhelmson, U of I Press, 1986.

9 Update to Sameh’s Table?

Application Performance Matrix ¨ Next step by looking http://www.krellinst.org/matrix/ at: Ø Application Signatures Ø Algorithms choices Ø Software profile Ø Architecture (Machine)

¨ Data mine to extract information ¨ Need signatures for A3S

19

Performance Tuning

¨ Motivation: performance of many applications dominated by a few kernels ¨ Conventional approach: handtuning by user or vendor ØVery time consuming and tedious work ØEven with intimate knowledge of architecture and compiler, performance hard to predict ØGrowing list of kernels to tune ØMust be redone for every architecture, compiler ØCompiler technology often lags architecture ØNot just a compiler problem: ØBest algorithm may depend on input, so some tuning at run - time. Ø Not all algorithms semantically or mathematically equivalent 20

10 Automatic Performance Tuning to Hide Complexity

¨ Approach: for each kernel 1. Identify and generate a space of algorithms 2. Search for the fastest one, by running them ¨ What is a space of algorithms? Ø Depending on kernel and input, may vary Ø instruction mix and order Ø memory access patterns Ø data structures Ø mathematical formulation ¨ When do we search? Ø Once per kernel and architecture Ø At compile time Ø At run time Ø All of the above

21

Some Automatic Tuning Projects ¨ ATLAS (www.netlib.org/atlas) (Dongarra, Whaley) used in Matlab and many SciDAC and ASCI projects ¨ PHIPAC (www.icsi.berkeley.edu/~bilmes/phipac) (Bilmes,Asanovic,Vuduc,Demmel) ¨ Sparsity (www.cs.berkeley.edu/~yelick/sparsity) (Yelick, Im) ¨ Self Adapting Linear Algebra Software (SALAS) (Dongarra, Eijkhout, Gropp, Keyes) ¨ FFTs and Signal Processing Ø FFTW (www.fftw.org) Ø Won 1999 Wilkinson Prize for Numerical Software Ø SPIRAL (www.ece.cmu.edu/~spiral)

Ø Extensions to other 3500.0 Vendor BLAS Ø transforms, DSPs ATLAS BLAS 3000.0 F77 BLAS Ø UHFFT Ø Extensions to higher 2500.0 Ø dimension, parallelism 2000.0

MFLOP/S 1500.0

1000.0

500.0

0.0

DEC ev56-533 DEC ev6-500 AMD Athlon-600 HP9000/735/135 IBM PPC604-112 IBM Power2-160 IBM Power3-200 22 Intel P-III 933 MHz SGI R10000ip28-200SGI R12000ip30-270 Sun UltraSparc2-200

Architectures Intel P-4 2.53 GHz w/SSE2

11 Futures for High Performance Scientific Computing ¨ Numerical software will be adaptive, exploratory, and intelligent ¨ Determinism in numerical computing will be gone. Ø After all, its not reasonable to ask for exactness in numerical computations. Ø Reproducibility at a cost ¨ Importance of floating point arithmetic will be undiminished. Ø 16, 32, 64, 128 bits and beyond. ¨ Reproducibility, fault tolerance, and auditability ¨ Adaptivity is a key so applications can effectively use the resources. 23

Citation in the Press, March 10th, 2008 How can this happen? ¨ Close interactions of with the DOE Supercomputers applications and the CS and Math ISIC groups Live up to Expectation ¨ Dramatic improvements in WASHINGTON, Mar. 10, 2008 adaptability of software to the GAO reported today that after execution environment almost 5 years of effort and ¨ Improved processor-memory several hundreds of M$’s spent at bandwidth DOE labs, the high performance ¨ New large-scale system computers recently purchased architectures and software have exceeded users’ expectation Ø Aggressive fault management and are helping to solve some of and reliability our most challenging problems. ¨ Exploration of some alternative Alan Laub head of architectures and languages DOE’s HPC efforts Ø Application teams to help drive reported today at the the design of new 24 annual meeting of the SciDAC PI architectures

12 With Apologies to Gary Larson…

¨ SciDAC is helping ¨ Teams are developing the scientific computing software and hardware infrastructure needed to use terascale computers and beyond. 25

13