PAL

CCS-3 The Earth Simulator vs. the rest of the Supercomputing World

Ten rounds or a Knock-Out?

Darren J. Kerbyson

Adolfy Hoisie, Harvey J. Wasserman Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory April 2003 PAL

CCS-3 “26.58Tflops on AFES … 64.9% of peak (640nodes)”

“14.9Tflops on Impact-3D …. 45% of peak (512nodes)”

“10.5Tflops on PFES … 44% of peak (376nodes)”

40Tflops

20Tflops PAL Talk Overview

CCS-3 z Overview of the Earth Simulator – A (quick) view of the architecture of the Earth Simulator (and Q) – A look at its performance characteristics

z Application Centric Performance Models – Method of comparing performance is to use trusted models of applications that we are interested in, e.g. SAGE and Sweep3D. – Analytical / Parameterized in system & application characteristics

z Models can be used to provide: – Predicted performance prior to availability (hardware or software) – Insight into performance – Performance Comparison (which is better?)

z System Performance Comparison (Earth Simulator vs ASCI Q) PAL Earth Simulator: Overview

CCS-3 ......

RCU RCU . . . RCU AP AP AP Memory Memory Memory 0-7 0-7 0-7 .. . Node0 Node1 Node639

640x640 ... crossbar

z 640 Nodes (Vector Processors) z interconnected by a single stage cross-bar – Copper interconnect (~3,000Km wire) z NEC markets a product – SX-6 – Not the same as an Earth Simulator Node - similar but different memory sub-system PAL Earth Simulator Node Crossbar LAN network CCS-3Disks

AP AP AP AP AP AP AP AP 0 1 2 3 4 5 6 7 RCU IOP Node contains: 8 vector processors (AP) 16GByte memory Remote Control Unit (RCU) I/O Processor (IOP) 6 0 1 2 3 4 5 31 M M M M M M M . . . M z Each AP has: – A vector unit (8 sets of 6 different vector pipelines), and a super-scalar unit – 5,000 pins – AP operates mostly at 500MHz: – Processor Peak Performance = 500 x 8 (pipes) x 2 (float-point) = 8,000 Mflop/s – Memory bandwidth = 32 GB/s per Processor (256GB/s per node) PAL ASCI Q

CCS-3

z 2048 Nodes Node

z Fat-tree network (Quadrics) CPU3 CPU PCI PCI z Each node is an HP ES45 2 x2 x2 CPU – 4 x Alpha EV68 micro-processors 1

– 1.25GHz CPU0 M0 Switch – Memory Hierarchy: L1 64KB I+D Based » 64KB I+D L1 Inter-connect L2 16MB M1 » 16MB L2 cache » 16 GB memory per node (typical) PAL Earth Simulator ASCI Q (based on NEC SX-6) (HP ES45) CCS-3 Node Architecture Vector SMP Microprocessor SMP System Topology Crossbar (single- Fat-tree stage) Number of nodes 640 2048

Processors - per node 8 4 - system total 5120 8192 Processor Speed 500 MHz 1.25 GHz Peak speed - per processor 8 Gflops 2.5 Gflops - per node 64 Gflops 10 Gflops Memory - per node 16 GB 16 GB (max 32 GB) - per processor 2 GB 4 GB (max 8 GB) Memory Bandwidth (peak) - L1 Cache N/A 20 GB/s - L2 Cache N/A 13 GB/s - Main memory (per PE) 32 GB/s 2 GB/s Inter-node MPI communication - Latency 8.6 µsec 5 µsec -Bandwidth 11.8 GB/s 300 MB/s PAL A quotation from a

scientific explorer? CCS-3

“Its life Jim, but not as we know it”

“Its performance Jim, but not as we expected it” PAL Need to have an expectation

z Complex machines and software CCS-3 – single processors, interactions within nodes, interaction between nodes (communication networks), I/O

z Large cost for development, deployment and maintenance

z Need to know in advance what performance will be.

Update of SW and/or HW Maintenance

Verification of ASCI Q Performance Installation

A Model Procurement What should we buy? is the Key (ASCI Purple) Implementation Some measurement possible (small scale)

Design Lots of system choices PAL

CCS-3 How many ≡ ?

( How many Alphas are equivalent to the Earth Simulator ? )

Answer: Earth Simulator = 640 x 8 x 8 = 40-Tflops Nodes PEs/node Gflops/PE Alpha ES45 System = ? x 4 x 2.5 = 40-Tflops -> ? = 4,096 nodes

Another Answer: It depends on what are going to run PAL Performance Model for SAGE

CCS-3 z Identify and understand the key characteristics of code

z Main Data structure decomposition: Slab PE – communications scale as number of PEs (^2/3) 1 2 – communication distance (PEs) increases 3 4 – Effect of network topology

z Processing Stages – Gather data, Computation, Scatter Data

Gather n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Gather/Scatter Comms Direction Size (cells) X, HI 4 Computen-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 X, LO 4 Y, LO 152 Y, HI 152 Z, LO 6860 Scatter n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Z, HI 6860 PAL Scaling Characteristics 256PEs 7 64PEs 1 4 CCS-3 8PEs 3 8 2 5 2PEs 1 5 2 ...... 9 12345678 4 6 12 3 3 2 4 7

Comparison of Boundary Surfaces Comparison of inter-PE communication distances 1.E+06 100 Neighbor Distance Grid Surface PE Surface

1.E+05

10

1.E+04 PE Distance Total Surface (cells) Surface Total i) Surface size ii) PE Distance 1.E+03 1 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 # PEs # PEs Surface split across PEs: P >√(E/8) PE distance = ⎡ (8P2/E)1/3 ⎤ (for cells/PE = 13,500 P > 41) PAL Validation

z Validated on large-scale platforms: SAGE Performance (ASCI White) CCS-3 2 – ASCI Blue Mountain (SGI Origin 2000) 1.8 1.6 – CRAY T3E 1.4 – ASCI Red (intel) 1.2 1 – ASCI White (IBM SP3) 0.8 Cycle Time(s) 0.6 – Compaq Alphaserver SMP clusters 0.4 Prediction 0.2 Measurement 0 z Model is highly accurate (< 10% error) 1 10 100 1000 10000 # PEs

SAGE Performance (ASCI Blue Mountain) SAGE Performance (Compaq ES45 AlphaServer) 6 0.6

5 0.5

4 0.4

3 0.3

Cycle Time (s) 2 Cycle Time (s) 0.2

1 Prediction 0.1 Prediction Measurement Measurement

0 0 100 1000 10000 1 10 100 # PEs # PEs PAL Experience: Compaq Installation

z Model provided expected performance CCS-3 z Installation performed in stages

Late 2001: Early 2002: upgraded PCI

SAGE - Performance Data (Compaq ES45) SAGE - Performance Data (Compaq ES45)

1.6 ModelModelModel 1.6 Model ModelModel MeasuredMeasured (Sept(Sept 9th9th 01)01) 4-Jan4-Jan Measured (Oct 2nd 01) 2-Feb2-Feb 1.2 Measured (Oct 24th 01) 1.2 20-Apr

0.8 0.8 CycleCycleCycleCycle Time Time Time Time (s)(s)(s)(s) Cycle TimeCycle (s) CycleCycleCycle Time Time Time (s) (s) (s)

0.4 0.4

0 0 1 10 100 1000 1 10 100 1000 10000 # PEs # PEs -> Model used to validate measurements! Other factors: accurately predicted when 2-rails improves the performance (P>41) PAL More recently: (Jan 2003)

CCS-3

SAGE Performance (QA & QB) 1.2

1

0.8

0.6

Cycle-timeCycle-timeCycle-timeCycle-timeCycle-time (s) (s) (s) (s) (s) 0.4 ModelModel Sep-21-02 0.2 Nov-25-02 Jan-25-03 Jan-27-03 0 0 512 1024 1536 2048 2560 3072 3584 4096 # PEs

z Performance of QA and QB is now with ~10% of our expectation z Without a model we would not have identified (and solved) the poor performance! PAL SWEEP3D Particle Transport: 2-D Pipeline Parallelism CCS-3 n+3 n+2 n+1 n Ω

n-1 PE

n-2

n-3

z Sn transport using discrete ordinates method z wavefronts of computation propagate across grid z Model for Sweep3D accurately captures the performance PAL

CCS-3 How many ≡ ?

z Basis for Performance Comparison – Use validated performance models (trusted) to predict performance – Large problem size (Weak scaling) to fill available memory -> Problem size on Q is double that on the Earth Simulator

z Compare processing rate: – Equal processor count basis (1 to 5120) – % of system used (10 to 100%)

z But have an unknown: – performance of codes on single Earth Simulator processor -> Consider range of values PAL Model Parameters: SAGE

CCS-3 Parameter Alpha ES45 Earth Simulator

PSMP Processors per node 4 8 CL Communication Links per Node 1 1 E Number of level 0 cells 35,000 17,500

fGS_r Frequency of real and integer 377 377 fGS_I gather-scatters per cycle 22 22 T (E) Sequential Computation time per 68.6µs ⎧ μ peakofs )5(%9.42 comp ⎪ cell ⎪ μ 10(%4.21 peakofs ) ⎪ μ peakofs )15(%3.14 ⎨ ⎪ μ 20(%7.10 peakofs ) ⎪ μ peakofs )25(%6.8 ⎪ ⎩⎪ μ 30(%1.7 peakofs )

Lc(S,P) Bi-directional MPI communication ⎧ 10.6 μ Ss < 64 8µs ⎪ Latency ⎨ μ 6444.6 Ss ≤≤ 512 ⎪ ⎩ 8.13 μ Ss > 512

Bc(S,P) Bi-directional MPI communication ⎧ 0.0 S < 64 10GB/s ⎪ Bandwidth (per direction) ⎨ sMB 64/78 S ≤≤ 512 ⎪ ⎩ /120 SsMB > 512 PAL SAGE: Processing Rate

CCS-3 16000 ASCI Q 14000 ASCI White

12000

10000

8000 Higher is better

6000

4000 Processing(Cell-Cycles/s/PE) Rate 2000

0 1 10 100 1000 10000 #Processors z Weak-scaling z ASCI Q between 2.2 and 2.8 times faster than White z Data from Model (validated to 4096 PEs on Q, 8192 PEs on White) PAL SAGE: Processing Rate

160000 CCS-3 30% 25% 140000 20% 15% Earth Simulator 120000 10% 5% 100000

80000 Higher is better

60000

40000

Processing Rate (Cell-Cycles/s/PE) 20000

0 1 10 100 1000 10000 #Processors

z Single Processor time is unknown – modeled using range of values PAL SAGE: Earth Simulator vs Q

(Equal processor count) CCS-3

12 11 10 9 8 7 Higher = Earth Simulator 6 Faster ASCI Q) ASCI 5 4 3 2 30% 25% 20% Relative Performance (Earth Simulator / / Simulator (Earth Performance Relative 1 15% 10% 5% 0 1 10 100 1000 10000 #Processors z Ratio of processor peak-performance is 3.2 (red line) z In nearly all cases: Earth Simulator performance is better than ratio of processor peak-performance PAL SAGE: Earth Simulator vs Q (% of system utilized) CCS-3 8

7

6

5 Higher = Earth Simulator 4 Faster ASCI Q) ASCI 3

2

1 30% 25% 20% Relative Performance (EarthRelative Simulator / 15% 10% 5% 0 0 102030405060708090100 System Size (%) z Ratio of system peak-performance is 2 (red line) z In nearly all cases: Earth Simulator better than ratio of system peak-performance PAL SAGE: Component Time Predictions

CCS-3

100% 100% Compute/memory Compute/memory 90% 90% Bandwidth Bandwidth 80% Latency 80% Collectives Collectives Latency 70% 70%

60% 60%

50% 50% Time (%) Time Time (%) Time 40% 40%

30% 30%

20% 20%

10% 10%

0% 0% 1 2 4 8 1 2 4 8 16 32 64 16 32 64 128 256 512 128 256 512 1024 2048 4096 5120 8192 1024 2048 4096 5120 # Processors # Processors

z 10% case considered on Earth Simulator z Higher bandwidth component on Alpha z Latency cost on Earth Simulator is more visible – reduced computation (x3.2), increased bandwidth (x40), latency ~same PAL How many Alpha nodes ≡ Earth Simulator for SAGE? CCS-3

14000 140 125 12000 115 120 103 10000 100 88

8000 80 67

6000 60 40 4000 40 Equivalent AlphaEquivalent Tflops

# Alpha Nodes (ES451.25GHz) 2000 20

0 0 5% 10% 15% 20% 25% 30% % of single-processor peak achieved on Earth Simulator

z 1 ES45 1.25GHz = 10GFlops, or 1,000 nodes = 10Tflops z If assume a 10% of peak on ES -> need 67Tflops of current Alphas PAL Sweep3D: Processing Rate

CCS-3

200000 1400000 ASCI Q 30% 180000 25% ASCI White 1200000 160000 20% MCR 15% 1000000 140000 10% 5% 120000 800000

100000 600000 80000

400000 60000 (cc/s/pe) Rate Processing Processing Rate (cc/s/pe) Rate Processing 40000 200000 20000 0 0 1 10 100 1000 10000 1 10 100 1000 10000 # Processors # Processors Higher is better

z Again, range of values for Earth Simulator unknown single processor time PAL Sweep3D: Earth Simulator vs Q

(Equal processor count) CCS-3 8 30% 7 25% 20% 6 15% 10% 5% 5 Ratio

4 Higher = Earth Simulator Faster 3

2

1 Relative Performance (Earth Simulator / ASCI Q) 0 1 10 100 1000 10000 # Processors

z Ratio of processor peak-performance is 3.2 (red line) z At large PE counts: Earth Simulator performance is worse than ratio of processor peak-performance PAL Sweep3D: Earth Simulator vs Q

(% of system utilized) CCS-3 5 30% 4.5 25% 20% 4 15% 10% 3.5 5% 3 Ratio

2.5 Higher = Earth Simulator

2 Faster

1.5

1

0.5 Relative Performance (EarthSimulator / ASCIQ) 0 020406080100 System Size (%)

z Ratio of system peak-performance is 2 (red line) z Using more than 15% of system: Earth Simulator performance is worse than ratio of system peak-performance PAL Sweep3D: Component Time Predictions

CCS-3

100% 100% Compute/memory (Pipeline) 90% 90% Compute/memory (Pipeline) Compute/memory (Block) Compute/memory (Block) 80% Bandwidth (Pipeline) 80% Bandwidth (Pipeline) Bandwidth (Block) 70% 70% Bandwidth (Block) Latency (Pipeline) Latency (Pipeline) 60% Latency (block) 60% Latency (block) 50% 50% Time (%) Time Time (%) Time 40% 40%

30% 30%

20% 20%

10% 10%

0% 0% 1 2 4 8 1 2 4 8 16 32 64 16 32 64 128 256 512 128 256 512 1024 2048 4096 5120 1024 2048 4096 8192 # Processors # Processors

z 5% case considered for the Earth Simulator z Higher Latency component than SAGE on both systems – Significant z Pipeline effect can be seen on higher PE counts PAL How many Alpha nodes ≡ Earth Simulator for Sweep3D? CCS-3

5000 42 45

37 40

4000 35 30 30 3000 24 25 19 20 2000 12 15 Equivalent AlphaEquivalent Tflops 1000 10 # Alpha Nodes(ES45 1.25GHz) 5

0 0 5% 10% 15% 20% 25% 30% % of single-processor peak achieved on Earth Simulator

z 1 ES45 1.25GHz = 10GFlops, or 1,000 nodes = 10Tflops z If assume a 5% of peak on ES -> need 12Tflops of current Alphas PAL Hypothetical Workload:

Assume 40% SAGE and 60% Sweep CCS-3

SAGE % of single-processor peak

5% 10% 15% 20% 25% 30% S 5% 23 34 42 48 53 57 w 10% 27 38 46 52 57 61 e e 15% 30 41 49 55 60 64 p 20% 33 44 52 58 62 67 3 25% 35 46 54 60 65 69 D 30% 38 48 57 63 68 72

z Numbers in table indicate a peak Tflop rated Alpha ES46 system that would achieve the same performance as the Earth Simulator

z Currently: SAGE on NEC SX-6 achieved 5% on first run (Sweep3D expected to be less). This may improve over time. PAL Summary

z Models have provided quantitative information on the long- CCS-3 debated efficiency of large, microprocessor-based systems for HPC (instead of smaller, more-powerful but special-purpose vector systems)

z Comparison of the Earth Simulator and ASCI Q performance is heavily dependent on the workload – At present, an Alpha system of approx. 23Tflops peak would achieve the same level of performance as the Earth Simulator on the workload considered here (60% Sweep3D, and 40% SAGE). z Results gives a ‘reference’ comparison – The achievable performance on the NEC SX-6 (or Earth Simulator node) may change over time. This analysis will stay valid for the systems as they presently are. z Models have also been used for: – Design studies (architecture and software, e.g IBM PERCS HPCS ) – During ASCI Q installation (to validate observed performance) – In the procurement of ASCI Purple – Performance comparison of systems www.c3.lanl.gov/par_arch