Seminario EEUU/Venezuela de Computación de Alto Rendimiento 2000 US/Venezuela Workshop on High Performance Computing 2000 - WoHPC 2000 4 al 6 de Abril del 2000 Puerto La Cruz Venezuela

Jack Dongarra and Oak Ridge National Laboratory http://www.cs.utk.edu/~dongarra/

1 ♦ Look at trends in HPC Ø Top500 statistics ♦ NetSolve Ø Example of grid middleware ♦ Performance on today’s architecture Ø ATLAS effort ♦ Tools for performance evaluation Ø Performance API (PAPI)

2 - Started in 6/93 by JD,Hans W. Meuer and Erich Strohmaier

- Basis for analyzing the HCP market - Quantify observations - Detection of trends (market, architecture, technology)

3 - Listing of the 500 most powerful in the World - Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem TPP performance

- Updated twice a year Rate Size SC‘xy in November Meeting in Mannheim in June

- All data available from www..org 4 ♦ Original classes of ♦ A way for machines tracking trends Ø Sequential Ø SMPs Øin performance Ø MPPs Ø in market Ø SIMDs Øin classes of ♦ Two new classes HPC systems Ø Beowulf-class ØArchitecture systems Ø ØTechnology Clustering of SMPs and DSMs ØRequires additional terminology 5 Ø “Constellation” ClusterCluster ofof ClustersClusters

♦ An ensemble of N nodes each comprising p computing elements ♦ The p elements are tightly bound shared memory (e.g. smp, dsm) ♦ The N nodes are loosely coupled, i.e.: distributed memory 4TF Blue Pacific SST 3 x 480 4-way SMP nodes ♦ p is greater than N 3.9 TF peak performance 2.6 TB memory ♦ Distinction is which 2.5 Tb/s bisectional bandwidth 62 TB disk layer gives us the most 6.4 GB/s delivered I/O bandwidth power through parallelism

6 Gflop/s Intel ASCI Red Xeon (9632)

2000 ASCI Blue Pacific SST(5808)

1500 SGI ASCI Blue Mountain (5040)

1000 Intel ASCI Red (9152)

750

Hitachi CP-PACS (2040) 500 Intel Paragon (6788) Fujitsu VPP-500 (140) TMC CM-5 (1024) 250 TMC CM-2 NEC SX-3 (4) Fujitsu VP-2600 (2048) 100 Cray Y-MP (8) 0 91 93 Year 95 97 99 MANU- RMAX AREA OF RANK INSTALLATION SITE COUNTRY YEAR # PROC FACTURER [GF/S] INSTALLATION 1 Intel ASCI Red 2379.6 Sandia National Labs USA 1999 Research 9632 Albuquerque ASCI Blue- Lawrence Livermore 2 IBM Pacific SST, 2144 USA 1999 Research 5808 National Laboratory IBM SP 604E 3 SGI ASCI Blue 1608 Los Alamos National Lab USA 1998 Research 6144 Mountain 4 SGI T3E 1200 891.5 Government USA 1998 Classified 1084 5 Hitachi SR8000 873.6 University of Tokyo Japan 1999 Academic 128 6 SGI T3E 900 815.1 Government USA 1997 Classified 1324 7 SGI Orgin 2000 690.9 Los Alamos National Lab USA 1999 Research 2048 /ACL Naval Oceanographic Research 8 Cray/SGI T3E 900 675.7 USA 1999 1084 Office, Bay Saint Louis Weather 9 SGI T3E 1200 671.2 Deutscher Wetterdienst Germany 1999 Research 812 Weather 10 IBM SP Power3 558.13 UCSD/San Diego USA 1999 Research 1024 Supercomputer Center, IBM/Poughkeepsie 100 Tflop/s 50.97 TF/s

SUM 10 Tflop/s 2.38 TF/s

1 Tflop/s Intel Intel Intel ASCI Red N=1 ASCI Red ASCI Red Sandia Sandia Sandia 100 Gflop/s 33.1 GF/s Fujitsu Fujitsu Hitachi/Tsukuba 'NWT' NAL 'NWT' NAL CP-PACS/2048 Intel XP/S140 10 Gflop/s Sandia N=500 IBM 604e Sun 69 proc Sun Ultra HPC 10000 Nabisco HPC 1000 1 Gflop/s SGI Merril Lynch News POWER Cray International Cray Y-MP Y-MP C94/364 CHALLANGE SNI VP200EX M94/4 'EPA' USA GOODYEAR 100 Mflop/s Uni Dresden KFA Jülich

3 3 4 4 5 5 6 6 7 7 8 8 9 9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 n v n v n v n v n v n v n v u o u o u o u o u o u o u o J N J N J N J N J N J N J N 9 [60G - 400 M][2.4 Tflop/s 33 Gflop/s],[1 40 Tf] Schwab #12, 1/2 each year, 100 > 100 Gf, faster than Moore’s law 1 PFlop/s 1000000 ASCI 100000 Earth Simulator 10000 Sum 1 TFlop/s 1000 N=1 100 N=10 10

Performance [GFlop/s] Performance 1 N=500 0.1

3 4 6 7 9 0 2 3 5 6 8 9 -9 -9 -9 -9 -9 -0 -0 -0 -0 -0 -0 -0 n v v v n v v v u o un o un o u o un o un o J N J N J N J N J N J N 10 Entry 1 T 2005 and 1 P 2010 Cluster 500 SIMD Constellation

400 MPP 300

200

100 SMP Single Processor 0

3 3 4 4 5 5 6 6 7 7 8 8 9 9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 n v n v n v n v n v n v n v u o u o u o u o u o u o u o J N J N J N J N J N J N J N 11 others 500 "Japan Inc"

Convex/HP 400 Intel Sun TMC 300 IBM

200

100 Cray SGI

0

3 4 4 5 6 6 7 8 9 9 -93 9 -9 -9 -95 9 -9 -9 -97 9 -98 -9 -9 9 n v- n v n v- n v n v- n v 12n v- u o o u o o u o u o u o J N Ju N J N Ju N J N J N J N Tara -> Cray Inc $15M 500 Other 400 Inmos Transputer SUN 300 MIPS 200 HP IBM 100 intel Alpha 0

3 3 4 4 5 5 6 6 7 7 8 8 9 9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 n v n v n v n v n v n v n v u o u o u o u o u o u o u o J N J N J N J N J N J N J N 500 Vendor Classified 400 Academic

300 Industry 200

100 Research

0

3 4 7 93 -95 96 -96 -98 99 -99 - - v v un un-94 un o un-97 o J Nov-9 J Nov-9 Jun-95 Nov J N J Nov-9 Jun-98 Nov Jun- N 14 ♦ Clustering of shared memory machines for scalability Ø Emergence of PC commodity systems Ø Pentium/Alpha based, Linux or NT driven Ø “Supercomputer performance at mail-order prices” Ø Beowulf-Class Systems (Linux+PC) Ø Distributed Shared Memory (clusters of processors connected) Ø Shared address space w/deep memory hierarchy ♦ Efficiency of message passing and data parallel programming Ø Helped by standards efforts such as PVM, MPI, Open-MP and HPF ♦ Many of the machines as a single user environments ♦ Pure COTS

15 MANU- AREA OF # RANK COMPUTER RMAX INSTALLATION SITE COUNTRY YEAR FACTURER INSTALLATION PROC 33 Sun HPC 450 272.1 Sun, Burlington USA 1999 Vendor 720 Cluster Compaq Computer Corp. 34 Compaq Alpha Server SC 271.4 USA 1999 Vendor 512 Littleton … … … … … … … … … Sandia National 44 Self-made Cplant Cluster 232.6 USA 1999 Research 580 Laboratories … … … … … … … … … Institute of Physical and 169 Self-made Alphleet Cluster 61.3 Japan 1999 Research 140 Chemical Res. (RIKEN) … … … … … … … … … Los Alamos National Lab/ 265 Self-made Avalon Cluster 48.6 USA 1998 Research 140 CNLS … … … … … … … … … 351 Siemens hpcLine Cluster 41.45 Universitaet Paderborn/PC2 Germany 1999 Academic 192 … … … … … … … … … 454 Self-made Parnass2 Cluster 34.23 University of Bonn/ Germany 1999 Academic 128 Applied Mathematic 16 ♦ Allow networked resources to be integrated into the desktop. ♦ Not just hardware, but also make available software resources. ♦ Locate and “deliver” software or solutions to the user in a directly usable and “conventional” form. ♦ Part of the motivation - software maintenance

17 18 ♦ Three basic scenarios: Ø Client, servers and agents anywhere on Internet (3(10)-150(80-ws/mpp)-Mcell) Ø Client, servers and agents on an Intranet Ø Client, server and agent on the same machine ♦ “Blue Collar” Grid Based Computing Ø User can set things up, no “su” required Ø Doesn’t require deep knowledge of network programming ♦ Focus on Matlab users

Ø OO language, objects are matrices (pse, eg os) Ø One of the most popular desktop systems for 19 numerical computing, 400K Users • No knowledge of networking involved • Hide complexity of numerical software • Computation location transparency • Provides access to Virtual Libraries :

• Component grid-based framework • Central management of library resources • User not concerned with most up-to-date version • Automatic tie to repository in project • Provides synchronous or asynchronous calls (User level parallelism) 20 >> define sparse matrix A >> define rhs …>> [x, its] = netsolve('itmeth',’petsc’, A, rhs, 1.e-6 call NETSL(‘DGESV()’,NSINFO, N,1,A,MAX,IPIV,B,MAX,INFO) Asynchronous Calls also available Computational Server : • Various Software resources installed on various Hardware Resources • Configurable and Extensible : • Framework to easily add arbitrary software ... • Many numerical libraries being integrated by the NetSolve team • Many software being integrated by users Agent : • Gateway to the computational servers • Performs Load Balancing among the resources

http://www.cs.utk.edu/netsolve NetSolve agent : predicts the execution times and sorts the servers Prediction for a server based on : • Its distance over the network - Latency and Bandwidth - Statistical Averaging • Its performance (LINPACK benchmark) • Its workload • The problem size and the algorithm complexity Quick estimate Cached data workload out of date ? http://www.cs.utk.edu/netsolve UniversityUniversity ofof Tennessee’sTennessee’s GridGrid Prototype:Prototype: ScalableScalable IntracampusIntracampus ResearchResearch Grid:Grid: SInRGSInRG

24 command1(A,sequence(A, B, B) E) Client Server result C netsl_begin_sequence( ); input A, netsl(“command1”, A, B, C); command2(A,intermediate output C) C netsl(“command1”, A, B, C); netsl(“command2”, A, C, D); Client Server netsl(“command3”, D, E, F); result D intermediate output D, netsl_end_sequence(C, D); input E command3(D, E) Client Server result F

25 NetSolve Client

Client-Proxy Client-Proxy Client-Proxy Client-Proxy Interface Interface Interface Interface NetSolve Globus Proxy Ninf Legion Proxy Proxy Proxy

GRAM NetSolve Services Ninf GASS Services Legion NetSolve Agent MDS Services NetSolve Servers 26 ♦ Kerberos used to maintain Access Control Lists and manage access to computational resources. ♦ NetSolve properly handles authorized and non-authorized components together in the same system. ♦ In use by DOD Modernization program at Army Research Lab

27 ♦ Multiple requests to single problem.

♦ Previous Solution: – Many calls to netslnb( ); /* non-blocking */ ♦ New Solution: –– SingleSingle callcall toto netsl_farm(netsl_farm( ););

28 SCIRun torso defibrillator application (A(A bitbit inin thethe future)future)

Matrix ♦xx == netsolvenetsolve(‘linear(‘linear system’,A,b)system’,A,b) type

Ø NetSolve examines the Sparse user’s data together Dense with the resources available (in the grid sense) and makes Direct Iterative decisions on best time to solution dynamically. Symmetric General Ø Decision based on size of problem, hardware/software, Rectangular General Symmetric network connection etc. Symmetric General Ø Method and 30 Placement. WhereWhere DoesDoes thethe PerformancePerformance Go?Go? oror WhyWhy ShouldShould II CaresCares AboutAbout thethe MemoryMemory Hierarchy?Hierarchy?

Processor-DRAM Memory Gap (latency) µProc 1000 CPU 60%/yr. “Moore’s Law” (2X/1.5yr) 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM

Performance DRAM 9%/yr. 1 (2X/10 yrs) 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Time 31 ♦ By taking advantage of the principle of locality: Ø Present the user with as much memory as is available in the cheapest technology. Ø Provide access at the speed offered by the fastest technology.

Processor Tertiary Storage Secondary (Disk/Tape) Control Storage (Disk)

On-Chip Level Main Registers Cache 2 and 3 Memory Distributed Remote Datapath Cache (DRAM) Memory Cluster (SRAM) Memory

Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s (10s ms) Size (bytes): 100s (10s sec) Ks Ms 100,000 s 10,000,000 s (.1s ms) (10s ms) Gs Ts ♦ Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning. ♦ Hardware and software have a large design space w/many parameters Ø Blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules. Ø Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors. ♦ About a year ago no tuned BLAS for Pentium for Linux. ♦ Need for quick/dynamic deployment of optimized routines. ♦ ATLAS - Automatic Tuned Software Ø PhiPac from Berkeley Ø FFTW from MIT (http://www.fftw.org)

33 ♦ An adaptive software architecture ØHigh-performance ØPortability ØElegance

♦ ATLAS is faster than all other portable BLAS implementations and it is comparable with

machine-specific libraries provided by the vendor. 34 (DGEMM(DGEMM nn == 500)500) 900.0 Vendor BLAS 800.0 ATLAS BLAS F77 BLAS 700.0

600.0 500.0

400.0 MFLOPS 300.0 200.0

100.0

0.0

0 2 0 0 6 0 0 0 00 0 1 6 6 5 70 0 6 -5 /135 1 1 20 2 2 5 4- 2- 8-20 - 0 ro- c2 on- 6 er p2 hl ev56-533 ev6 P i 0/73 C 0 par EC 0 D PP Pow ium 00 DEC 0 raS MD At P90 entium III-51 t H BM BM ent Pentium II-2P R l A I I IBM Power3-200P I U G S SGI R12000ip30- Architectures Sun ♦ ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor. 35 ♦ Two phases: Ø Probes the systems for system features Ø Does a parameter study ♦ On-chip multiply optimizes ♦ Code is iteratively for: generated & timed until Ø TLB access optimal case is found. Ø L1 cache reuse We try: Ø FP unit usage Ø Differing NBs Ø Memory fetch Ø Breaking false Ø Register reuse dependencies Ø Ø Loop overhead M, N and K loop unrolling minimization ♦ Designed for RISC arch ♦ New model of HP Ø Super Scalar programming where Ø Need reasonable C critical code is machine compiler generated using ♦ Takes ~20 minutes to run 36 parameter optimization. 350

300

Vendor BLAS 250 AT LAS B LAS Reference BLAS

200

150 MFLOPS

100

50

0 DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM 37 BLAS MultiMulti--ThreadedThreaded DGEMMDGEMM IntelIntel PIIIPIII 550550 MHzMHz

800 700 600 500

Mflop/s 400 300 200 100 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 0 1 Intel BLAS 1 proc ATLAS 1procSize Intel BLAS 2 proc ATLAS 2 proc38 ♦ Software Release, available today: Ø Level 1, 2, and 3 BLAS implementations Ø See: www.netlib.org/atlas/ ♦ Near Future: Ø Multi-treading Ø Optimize message passing system Ø Extend these ideas to Java directly Ø Sparse Matrix-Vector ops ♦ Futures: Ø Runtime adaptation Ø Sparsity analysis Ø Iterative code improvement Ø Specialization for user applications 39 Ø Adaptive libraries ♦ Timing and performance evaluation has been an art ØResolution of the clock ØIssues about cache effects ØDifferent systems ♦ Situation about to change ØToday’s processors have internal counters

40 ♦ Hidden from users. ♦ On most platforms the APIs, if they exist, are not appropriate for a common user, functional or well documented. ♦ Existing performance counter APIs Ø Cray T3E Ø SGI MIPS R10000 Ø IBM Power series Ø DEC Alpha pfm pseudo-device interface Ø Windows 95, NT and Linux 41 Ø Ø Cycle count Pipeline stalls due to Ø Floating point memory subsystem instruction count Ø Pipeline stalls due to Ø Integer instruction resource conflicts count Ø I/D cache misses for Ø Instruction count different levels Ø Load/store count Ø Cache invalidations Ø Branch taken / not Ø TLB misses taken count Ø TLB invalidations Ø Branch mispredictions

42 ♦ Performance Application Programming Interface ♦ The purpose of PAPI is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors ♦ Used by Tau (A. Malony) and SvPablo (D. Reed) 43 ♦ Application is instrumented with PAPI Ø One simple “call” ♦ Will be layered over the best existing vendor-specific APIs for these platforms ♦ Sections of code that are of interest are designated with specific colors Ø Using a call to mark_perfometer(‘color’) ♦ Application is started and a Java window containing the Perfometer application is also started 44 PerfometerPerfometer

Call Perfometer() ♦ Top500 Ø Hans W. Meuer, Mannheim U Ø Erich Strohmaier, UTK ♦ NetSolve Ø Dorian Arnold, UTK Ø Susan Blackford, UTK Ø Henri Casanova, UCSD Ø Michelle Miller, UTK Ø Ganapathy Raman, UTK Ø Sathish Vadhiyar, UTK ♦ ATLAS For additional Ø Clint Whaley, UTK Ø Antoine Petitet, UTK information see… ♦ PAPI www.top500.org Ø Shirley Browne, UTK www.netlib.org/atlas/ Ø Nathan Garner, UTK Ø Kevin London, UTK www.netlib.org/netsolve/ Ø Phil Mucci, UTK www.cs.utk.edu/~dongarra/ 46 Computational Resources Software Repository

NetSolve Agent

NetSolve Client 47 http://www.cs.utk.edu/netsolve/ ♦ List of Top 100 Clusters Ø IEEE Task Force on Cluster Computing Ø Interested in assembling a list of the Top n Clusters Ø Based on current metric Ø Starting to put together software to facilitate running and collection of data. ♦ Sparse Benchmark Ø Look at the performance in terms of sparse matrix operations Ø Iterative solvers Ø Beginning to collect data 48 ♦ Tool integration Ø Globus - Middleware infrastructure (ANL/SSI) Ø Condor - Workstation farm (U Wisconsin) Ø NWS - Network Weather Service (U Tennessee) Ø SCIRun - Computational steering (U Utah) Ø Ninf - NetSolve-like system, (Tsukuba U) ♦ Library usage Ø LAPACK/ScaLAPACK - Parallel dense linear solvers Ø SuperLU/MA28 - Parallel sparse direct linear solvers(UCB/RAL) Ø PETSc/Aztec - Parallel iterative solvers (ANL/SNL) Ø Other areas as well (not just linear algebra) ♦ Applications Ø MCell - Microcellular physiology (UCSD/Salk) Ø IPARS - Reservoir Simulator (UTexas, Austin) Ø Virtual Human - Pulmonary System Model (ORNL) Ø RSICC - Radiation Safety sw/simulation (ORNL) Ø LUCAS - Land usage modeling (U Tennessee) 49 Ø ImageVision - Computer Graphics and Vision (Graz U) ♦ Iterative and direct solvers: PETSc, Aztec, SuperLU, Ma28, … ♦ Support for compressed row/column sparse matrix storage -- significantly reduces network data transmission ♦ Sequential and parallel SPOOLESimplementationsPETSc available SuperLSuperLUU MA2MA288 50