Ratio of Processor Peak-Performance Is

PAL CCS-3 The Earth Simulator vs. the rest of the Supercomputing World Ten rounds or a Knock-Out? Darren J. Kerbyson Adolfy Hoisie, Harvey J. Wasserman Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory April 2003 PAL CCS-3 “26.58Tflops on AFES … 64.9% of peak (640nodes)” “14.9Tflops on Impact-3D …. 45% of peak (512nodes)” “10.5Tflops on PFES … 44% of peak (376nodes)” 40Tflops 20Tflops PAL Talk Overview CCS-3 z Overview of the Earth Simulator – A (quick) view of the architecture of the Earth Simulator (and Q) – A look at its performance characteristics z Application Centric Performance Models – Method of comparing performance is to use trusted models of applications that we are interested in, e.g. SAGE and Sweep3D. – Analytical / Parameterized in system & application characteristics z Models can be used to provide: – Predicted performance prior to availability (hardware or software) – Insight into performance – Performance Comparison (which is better?) z System Performance Comparison (Earth Simulator vs ASCI Q) PAL Earth Simulator: Overview CCS-3 ... .. RCU RCU . RCU AP AP AP Memory Memory Memory 0-7 0-7 0-7 .. Node0 Node1 Node639 640x640 ... crossbar z 640 Nodes (Vector Processors) z interconnected by a single stage cross-bar – Copper interconnect (~3,000Km wire) z NEC markets a product – SX-6 – Not the same as an Earth Simulator Node - similar but different memory sub-system PAL Earth Simulator Node Crossbar LAN network CCS-3Disks AP AP AP AP AP AP AP AP 0 1 2 3 4 5 6 7 RCU IOP Node contains: 8 vector processors (AP) 16GByte memory Remote Control Unit (RCU) I/O Processor (IOP) 6 0 1 2 3 4 5 31 M M M M M M M . M z Each AP has: – A vector unit (8 sets of 6 different vector pipelines), and a super-scalar unit – 5,000 pins – AP operates mostly at 500MHz: – Processor Peak Performance = 500 x 8 (pipes) x 2 (float-point) = 8,000 Mflop/s – Memory bandwidth = 32 GB/s per Processor (256GB/s per node) PAL ASCI Q CCS-3 z 2048 Nodes Node z Fat-tree network (Quadrics) CPU3 CPU PCI PCI z Each node is an HP ES45 2 x2 x2 CPU – 4 x Alpha EV68 micro-processors 1 – 1.25GHz CPU0 M0 Switch – Memory Hierarchy: L1 64KB I+D Based » 64KB I+D L1 Inter-connect L2 16MB M1 » 16MB L2 cache » 16 GB memory per node (typical) PAL Earth Simulator ASCI Q (based on NEC SX-6) (HP ES45) CCS-3 Node Architecture Vector SMP Microprocessor SMP System Topology Crossbar (single- Fat-tree stage) Number of nodes 640 2048 Processors - per node 8 4 - system total 5120 8192 Processor Speed 500 MHz 1.25 GHz Peak speed - per processor 8 Gflops 2.5 Gflops - per node 64 Gflops 10 Gflops Memory - per node 16 GB 16 GB (max 32 GB) - per processor 2 GB 4 GB (max 8 GB) Memory Bandwidth (peak) - L1 Cache N/A 20 GB/s - L2 Cache N/A 13 GB/s - Main memory (per PE) 32 GB/s 2 GB/s Inter-node MPI communication - Latency 8.6 µsec 5 µsec -Bandwidth 11.8 GB/s 300 MB/s PAL A quotation from a scientific explorer? CCS-3 “Its life Jim, but not as we know it” “Its performance Jim, but not as we expected it” PAL Need to have an expectation z Complex machines and software CCS-3 – single processors, interactions within nodes, interaction between nodes (communication networks), I/O z Large cost for development, deployment and maintenance z Need to know in advance what performance will be. Update of SW and/or HW Maintenance Verification of ASCI Q Performance Installation A Model Procurement What should we buy? is the Key (ASCI Purple) Implementation Some measurement possible (small scale) Design Lots of system choices PAL CCS-3 How many ≡ ? ( How many Alphas are equivalent to the Earth Simulator ? ) Answer: Earth Simulator = 640 x 8 x 8 = 40-Tflops Nodes PEs/node Gflops/PE Alpha ES45 System = ? x 4 x 2.5 = 40-Tflops -> ? = 4,096 nodes Another Answer: It depends on what are going to run PAL Performance Model for SAGE CCS-3 z Identify and understand the key characteristics of code z Main Data structure decomposition: Slab PE – communications scale as number of PEs (^2/3) 1 2 – communication distance (PEs) increases 3 4 – Effect of network topology z Processing Stages – Gather data, Computation, Scatter Data Gather n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Gather/Scatter Comms Direction Size (cells) X, HI 4 Computen-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 X, LO 4 Y, LO 152 Y, HI 152 Z, LO 6860 Scatter n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Z, HI 6860 PAL Scaling Characteristics 256PEs 7 64PEs 1 4 CCS-3 8PEs 3 8 2 5 2PEs 1 5 2 ... ... 9 12345678 4 6 12 3 3 2 4 7 Comparison of Boundary Surfaces Comparison of inter-PE communication distances 1.E+06 100 Neighbor Distance Grid Surface PE Surface 1.E+05 10 1.E+04 PE Distance Total Surface (cells) Surface Total i) Surface size ii) PE Distance 1.E+03 1 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 # PEs # PEs Surface split across PEs: P >√(E/8) PE distance = ⎡ (8P2/E)1/3 ⎤ (for cells/PE = 13,500 P > 41) PAL Validation z Validated on large-scale platforms: SAGE Performance (ASCI White) CCS-3 2 – ASCI Blue Mountain (SGI Origin 2000) 1.8 1.6 – CRAY T3E 1.4 – ASCI Red (intel) 1.2 1 – ASCI White (IBM SP3) 0.8 Cycle Time(s) 0.6 – Compaq Alphaserver SMP clusters 0.4 Prediction 0.2 Measurement 0 z Model is highly accurate (< 10% error) 1 10 100 1000 10000 # PEs SAGE Performance (ASCI Blue Mountain) SAGE Performance (Compaq ES45 AlphaServer) 6 0.6 5 0.5 4 0.4 3 0.3 Cycle Time (s) 2 Cycle Time (s) 0.2 1 Prediction 0.1 Prediction Measurement Measurement 0 0 100 1000 10000 1 10 100 # PEs # PEs PAL Experience: Compaq Installation z Model provided expected performance CCS-3 z Installation performed in stages Late 2001: Early 2002: upgraded PCI SAGE - Performance Data (Compaq ES45) SAGE - Performance Data (Compaq ES45) 1.6 ModelModelModel 1.6 Model ModelModel MeasuredMeasured (Sept(Sept 9th9th 01)01) 4-Jan4-Jan Measured (Oct 2nd 01) 2-Feb2-Feb 1.2 Measured (Oct 24th 01) 1.2 20-Apr 0.8 0.8 CycleCycleCycleCycle Time Time Time Time (s)(s)(s)(s) Cycle TimeCycle (s) CycleCycleCycle Time Time Time (s) (s) (s) 0.4 0.4 0 0 1 10 100 1000 1 10 100 1000 10000 # PEs # PEs -> Model used to validate measurements! Other factors: accurately predicted when 2-rails improves the performance (P>41) PAL More recently: (Jan 2003) CCS-3 SAGE Performance (QA & QB) 1.2 1 0.8 0.6 Cycle-timeCycle-timeCycle-timeCycle-timeCycle-time (s) (s) (s) (s) (s) 0.4 ModelModel Sep-21-02 0.2 Nov-25-02 Jan-25-03 Jan-27-03 0 0 512 1024 1536 2048 2560 3072 3584 4096 # PEs z Performance of QA and QB is now with ~10% of our expectation z Without a model we would not have identified (and solved) the poor performance! PAL SWEEP3D Particle Transport: 2-D Pipeline Parallelism CCS-3 n+3 n+2 n+1 n Ω n-1 PE n-2 n-3 z Sn transport using discrete ordinates method z wavefronts of computation propagate across grid z Model for Sweep3D accurately captures the performance PAL CCS-3 How many ≡ ? z Basis for Performance Comparison – Use validated performance models (trusted) to predict performance – Large problem size (Weak scaling) to fill available memory -> Problem size on Q is double that on the Earth Simulator z Compare processing rate: – Equal processor count basis (1 to 5120) – % of system used (10 to 100%) z But have an unknown: – performance of codes on single Earth Simulator processor -> Consider range of values PAL Model Parameters: SAGE CCS-3 Parameter lpha ES45A arth SimulatorE PSMP Processors per node 4 8 CL Communication Links per Node 1 1 E Number of level 0 cells 35,000 17,500 fGS_r of real and integer Frequency 377 377 fGS_I clegather-scatters per cy 22 22 T (E) Sequential Computation time per 68.6µs 42⎧ .s 9μ of (% 5 peak ) comp ⎪ cell 21 .⎪s 4μ (% of 10 peak) 14 .⎪s 3μ (% of 15 peak ) ⎨ 10 .⎪s 7μ (% of 20 peak) 8 . 6s⎪ μ (% of 25 peak ) ⎪ 7 .s⎩⎪ 1μ (% of 30 peak) Lc(S,P) Bi-directional MPI communication 6⎧ . 10μs S< 64 8µs ⎪ Latency 6 .⎨ 44μs≤ 64 S ≤512 ⎪ 13⎩ .μs 8 S> 512 Bc(S,P) Bi-directional MPI communication ⎧ 0 . 0 S < 64 10GB/s ⎪ )er direction (pthBandwid 78⎨ MB / s ≤ 64S ≤512 ⎪ ⎩120MB s / > 512 S PAL SAGE: Processing Rate CCS-3 16000 ASCI Q 14000 ASCI White 12000 10000 8000 Higher is better 6000 4000 Processing(Cell-Cycles/s/PE) Rate 2000 0 1 10 100 1000 10000 #Processors z Weak-scaling z ASCI Q between 2.2 and 2.8 times faster than White z Data from Model (validated to 4096 PEs on Q, 8192 PEs on White) PAL SAGE: Processing Rate 160000 CCS-3 30% 25% 140000 20% 15% Earth Simulator 120000 10% 5% 100000 80000 Higher is better 60000 40000 Processing Rate (Cell-Cycles/s/PE) 20000 0 1 10 100 1000 10000 #Processors z Single Processor time is unknown – modeled using range of values PAL SAGE: Earth Simulator vs Q (Equal processor count) CCS-3 12 11 10 9 8 7 Higher = Earth Simulator 6 Faster ASCI Q) ASCI 5 4 3 2 30% 25% 20% Relative Performance (Earth Simulator / / Simulator (Earth Performance Relative 1 15% 10% 5% 0 1 10 100 1000 10000 #Processors z Ratio of processor peak-performance is 3.2 (red line) z In nearly all cases: Earth Simulator performance is better than ratio of processor peak-performance PAL SAGE: Earth Simulator vs Q (% of system utilized) CCS-3 8 7 6 5 Higher = Earth Simulator 4 Faster ASCI Q) ASCI 3 2 1 30% 25% 20% Relative Performance (EarthRelative Simulator / 15% 10% 5% 0 0 102030405060708090100 System Size (%) z Ratio of system peak-performance is 2 (red line) z In nearly all cases: Earth Simulator better than ratio of system peak-performance PAL SAGE: Component Time Predictions CCS-3 100% 100% Compute/memory Compute/memory 90% 90% Bandwidth Bandwidth 80% Latency 80% Collectives Collectives Latency 70% 70% 60% 60% 50% 50% Time (%) Time Time (%) Time 40% 40% 30% 30% 20% 20% 10%

Ratio of Processor Peak-Performance Is

Performance Modeling the Earth Simulator and ASCI Q

Project Aurora 2017 Vector Inheritance

2020 Global High-Performance Computing Product Leadership Award

Hardware Technology of the Earth Simulator 1

Supercomputers – Prestige Objects Or Crucial Tools for Science and Industry?

Beyond Earth Simulator

Evaluation of Cache-Based Superscalar and Cacheless Vector Architectures for Scientific Computations

Japan's 10 Peta FLOPS Supercomputer Development

1. Introduction

Notes on the Earth Simulator

The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment

Porting the 3D Gyrokinetic Particle-In-Cell Code GTC to the NEC SX-6 Vector Architecture: Perspectives and Challenges