Performance Modeling the Earth Simulator and ASCI Q

PAL CCS-3 Performance Modeling the Earth Simulator and ASCI Q Darren J. Kerbyson Adolfy Hoisie, Harvey J. Wasserman Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory April 2003 Los Alamos PAL CCS-3 “26.58Tflops on AFES … 64.9% of peak (640nodes)” “14.9Tflops on Impact-3D …. 45% of peak (512nodes)” “10.5Tflops on PFES … 44% of peak (376nodes)” 40Tflops 20Tflops Los Alamos PAL Talk Overview CCS-3 G Overview of the Earth Simulator – A (quick) view of the architecture of the Earth Simulator (and Q) – A look at its performance characteristics G Application Centric Performance Models – Method of comparing performance is to use trusted models of applications that we are interested in, e.g. SAGE and Sweep3D. – Analytical / Parameterized in system & application characteristics G Models can be used to provide: – Predicted performance prior to availability (hardware or software) – Insight into performance – Performance Comparison (which is better?) G System Performance Comparison (Earth Simulator vs ASCI Q) Los Alamos PAL Earth Simulator: Overview CCS-3 ... ... RCU RCU . RCU AP AP AP Memory Memory Memory 0-7 0-7 0-7 ... Node0 Node1 Node639 640x640 ... crossbar G 640 Nodes (Vector Processors) G interconnected by a single stage cross-bar – Copper interconnect (~3,000Km wire) G NEC markets a product – SX-6 – Not the same as an Earth Simulator Node - similar but different memory sub-system Los Alamos PAL Earth Simulator Node CCS-3 Crossbar LAN network Disks AP AP AP AP AP AP AP AP 0 1 2 3 4 5 6 7 RCU IOP Node contains: 8 vector processors (AP) 16GByte memory Remote Control Unit (RCU) I/O Processor (IOP) 2 3 4 5 6 0 1 31 M M M M M M M . M G Each AP has: – A vector unit (8 sets of 6 different vector pipelines), and a super-scalar unit – 5,000 pins – AP operates mostly at 500MHz: – Processor Peak Performance = 500 x 8 (pipes) x 2 (float-point) = 8,000 Mflop/s – Memory bandwidth = 32 GB/s per Processor (256GB/s per node) Los Alamos PAL ASCI Q CCS-3 G 2048 Nodes Node G Fat-tree network (Quadrics) CPU3 CPU PCI PCI G Each node is an HP ES45 2 x2 x2 CPU – 4 x Alpha EV68 micro-processors 1 – 1.25GHz CPU 0 M0 Switch – Memory Hierarchy: L1 64KB I+D Based » 64KB I+D L1 Inter-connect L2 16MB M1 » 16MB L2 cache » 16 GB memory per node (typical) Los Alamos PAL CCS-3 Earth Simulator ASCI Q (similar to NEC SX-6) (HP ES45) Node Architecture Vector SMP Microprocessor SMP System Topology Crossbar (single- Fat-tree stage) Number of nodes 640 2048 Processors - per node 8 4 - system total 5120 8192 Processor Speed 500 MHz 1.25 GHz Peak speed - per processor 8 Gflops 2.5 Gflops - per node 64 Gflops 10 Gflops Memory - per node 16 GB 16 GB (max 32 GB) - per processor 2 GB 4 GB (max 8 GB) Memory Bandwidth (peak) - L1 Cache N/A 20 GB/s - L2 Cache N/A 13 GB/s - Main memory (per PE) 32 GB/s 2 GB/s Inter-node MPI communication - Latency 8.6 sec 5 sec - Bandwidth 11.8 GB/s 300 MB/s Los Alamos PAL A quotation from a CCS-3 scientific explorer? “Its life Jim, but not as we know it” “Its performance Jim, but not as we expected it” Los Alamos PAL Need to have an expectation CCS-3 G Complex machines and software – single processors, interactions within nodes, interaction between nodes (communication networks), I/O G Large cost for development, deployment and maintenance G Need to know in advance what performance will be. Update of SW and/or HW Maintenance Verification of ASCI Q Performance Installation A Model Procurement What should we buy? is the Key (ASCI Purple) Implementation Some measurement possible (small scale) Design Lots of system choices Los Alamos PAL CCS-3 How many _ ? ( How many Alphas are equivalent to the Earth Simulator ? ) Answer: Earth Simulator = 640 x 8 x 8 = 40-Tflops Nodes PEs/node Gflops/PE Alpha ES45 System = ? x 4 x 2.5 = 40-Tflops -> ? = 4,096 nodes Another Answer: It depends on what is going to run Los Alamos PAL Performance Model for SAGE CCS-3 G Identify and understand the key characteristics of code G Main Data structure decomposition: Slab PE – communications scale as number of PEs (^2/3) 1 2 – communication distance (PEs) increases 3 4 – Effect of network topology G Processing Stages – Gather data, Computation, Scatter Data n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Gather Gather/Scatter Comms Direction Size (cells) X, HI 4 Computen-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 X, LO 4 Y, LO 152 Y, HI 152 Z, LO 6860 Scatter n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Z, HI 6860 Los Alamos PAL Scaling Characteristics CCS-3 256PEs 7 64PEs 1 4 8PEs 3 8 2 5 2PEs 1 5 2 ... ... 9 12345678 4 6 12 3 3 2 4 7 Comparison of Boundary Surfaces Comparison of inter-PE communication distances 1.E+06 100 Neighbor Distance Grid Surface PE Surface 1.E+05 10 1.E+04 PE Distance Total Surface (cells) i) Surface size ii) PE Distance 1.E+03 1 110100 1000 10000 100000 110100 1000 10000 100000 # PEs # PEs Surface split across PEs: P >÷(E/8) PE distance = ÈÈ (8P2/E)1/3 ˘˘ (for cells/PE = 13,500 P > 41) Los Alamos PAL Validation CCS-3 G Validated on large-scale platforms: SAGE Performance (ASCI White) 2 – ASCI Blue Mountain (SGI Origin 2000) 1.8 1.6 – CRAY T3E 1.4 – ASCI Red (intel) 1.2 1 – ASCI White (IBM SP3) 0.8 Cycle Time (s) 0.6 – Compaq Alphaserver SMP clusters 0.4 Prediction 0.2 Measurement G 0 Model is highly accurate (< 10% error) 110100 1000 10000 # PEs SAGE Performance (ASCI Blue Mountain) SAGE Performance (Compaq ES45 AlphaServer) 6 0.6 5 0.5 4 0.4 3 0.3 Cycle Time (s) Cycle Time (s) 2 0.2 1 Prediction 0.1 Prediction Measurement Measurement 0 0 100 1000 10000 110100 # PEs # PEs Los Alamos PAL Experience: Compaq Installation CCS-3 G Model provided expected performance G Installation performed in stages Late 2001: Early 2002: upgraded PCI SAGE - Performance Data (Compaq ES45) SAGE - Performance Data (Compaq ES45) 1.6 ModelModel 1.6 Model ModelModel MeasuredMeasured (Sept(Sept 9th9th 01)01) 4-Jan4-Jan Measured (Oct 2nd 01) 2-Feb2-Feb 1.2 Measured (Oct 24th 01) 1.2 20-Apr 0.8 0.8 Cycle Time (s) Cycle Time (s) Cycle Time (s) Cycle Time (s) Cycle Time (s) Cycle Time (s)Cycle Time (s)Cycle Time (s) 0.4 0.4 0 0 110100 1000 110100 1000 10000 # PEs # PEs -> Model used to validate measurements! Other factors: accurately predicted when 2-rails improves the performance (P>41) Los Alamos PAL More recently: (Jan 2003) CCS-3 SAGE Performance (QA & QB) 1.2 1 0.8 0.6 Cycle-time (s)Cycle-time (s)Cycle-time (s)Cycle-time (s)Cycle-time (s) 0.4 ModelModel Sep-21-02 0.2 Nov-25-02 Jan-25-03 Jan-27-03 0 0 512 1024 1536 2048 2560 3072 3584 4096 # PEs G Performance of ASCI Q is now with ~10% of our expectation G Without a model we would not have identified (and solved) the poor performance! Los Alamos PAL SWEEP3D Particle Transport: 2-D CCS-3 Pipeline Parallelism n+3 n+2 n+1 n WW n-1 PE n-2 n-3 G SN transport using discrete ordinates method G wavefronts of computation propagate across grid G Model for Sweep3D accurately captures the performance Los Alamos PAL CCS-3 How many _ ? G Basis for Performance Comparison – Use validated performance models (trusted) to predict performance – Large problem size (Weak scaling) to fill available memory -> Problem size on Q is double that on the Earth Simulator G Compare processing rate: – Equal processor count basis (1 to 5120) – % of system used (10 to 100%) G But have an unknown: – performance of codes on single Earth Simulator processor -> Consider range of values Los Alamos PAL Model Parameters: SAGE CCS-3 Parameter Alpha ES45 Earth Simulator PSMP Processors per node 4 8 CL Communication Links per Node 1 1 E Number of level 0 cells 35,000 17,500 fGS_r Frequency of real and integer 377 377 fGS_I gather-scatters per cycle 22 22 T (E) Sequential Computation time per 68.6 s Ï 42.9m s (%5 of peak) comp Ô cell Ô 21.4m s (%10 of peak) Ô 14.3m s (%15 of peak) Ì Ô 10.7m s (%20 of peak) Ô 8.6m s (%25 of peak) Ô ÓÔ 7.1m s (%30 of peak) L (S,P) Bi-directional MPI communication Ï 6.10m s S < 64 8 s c Ô Latency Ì 6.44m s 64 £ S £ 512 Ô Ó 13.8m s S > 512 B (S,P) Bi-directional MPI communication Ï 0.0 S < 64 10GB/s c Ô Bandwidth (per direction) Ì 78MB/ s 64 £ S £ 512 Ô Ó 120MB/ s S > 512 Los Alamos PAL SAGE: Processing Rate CCS-3 16000 ASCI Q 14000 ASCI White 12000 10000 8000 Higher is better 6000 4000 Processing Rate (Cell-Cycles/s/PE) 2000 0 110100 1000 10000 #Processors G Weak-scaling G ASCI Q between 2.2 and 2.8 times faster than ASCI White G Data from Model (validated to 4096 PEs on Q, 8192 PEs on White) Los Alamos PAL SAGE: Processing Rate CCS-3 160000 30% 25% 140000 20% 15% Earth Simulator 120000 10% 5% 100000 80000 Higher is better 60000 40000 Processing Rate (Cell-Cycles/s/PE) 20000 0 110100 1000 10000 #Processors G Single Processor time is unknown – modeled using range of values Los Alamos PAL SAGE: Earth Simulator vs Q CCS-3 (Equal processor count) 12 11 10 9 8 7 Higher = Earth Simulator 6 Faster ASCI Q) 5 4 3 2 30% 25% 20% Relative Performance (Earth Simulator / 1 15% 10% 5% 0 110100 1000 10000 #Processors G Ratio of processor peak-performance is 3.2 (red line) G In nearly all cases: Earth Simulator performance is better than ratio of processor peak-performance Los Alamos PAL SAGE: Earth Simulator vs Q CCS-3 (% of system utilized) 8 7 6 5 Higher = Earth Simulator 4 Faster ASCI Q) 3 2 1 30% 25% 20% Relative Performance (Earth Simulator / 15% 10% 5% 0 0102030405060708090100 System Size (%) G Ratio of system peak-performance is 2 (red line) G In nearly all cases: Earth Simulator better than ratio of system peak-performance Los Alamos CCS-3 PAL SAGE: Component Time Predictions 100% 100% Compute/memory Compute/memory 90% 90% Bandwidth Bandwidth 80% Latency

Performance Modeling the Earth Simulator and ASCI Q

Project Aurora 2017 Vector Inheritance

2020 Global High-Performance Computing Product Leadership Award

Hardware Technology of the Earth Simulator 1

Supercomputers – Prestige Objects Or Crucial Tools for Science and Industry?

Beyond Earth Simulator

Evaluation of Cache-Based Superscalar and Cacheless Vector Architectures for Scientific Computations

Japan's 10 Peta FLOPS Supercomputer Development

1. Introduction

Notes on the Earth Simulator

The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment

Porting the 3D Gyrokinetic Particle-In-Cell Code GTC to the NEC SX-6 Vector Architecture: Perspectives and Challenges

SX-Aurora TSUBASA Introduction Vector Supercomputer Technology on a Pcie Card What Is Vector Processor? (1/2)