PAL CCS-3

Performance Modeling the Earth Simulator and ASCI Q

Darren J. Kerbyson

Adolfy Hoisie, Harvey J. Wasserman Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory April 2003

Los Alamos PAL CCS-3

“26.58Tflops on AFES … 64.9% of peak (640nodes)”

“14.9Tflops on Impact-3D …. 45% of peak (512nodes)”

“10.5Tflops on PFES … 44% of peak (376nodes)”

40Tflops

20Tflops

Los Alamos PAL Talk Overview CCS-3

G Overview of the Earth Simulator – A (quick) view of the architecture of the Earth Simulator (and Q) – A look at its performance characteristics G Application Centric Performance Models – Method of comparing performance is to use trusted models of applications that we are interested in, e.g. SAGE and Sweep3D. – Analytical / Parameterized in system & application characteristics G Models can be used to provide: – Predicted performance prior to availability (hardware or software) – Insight into performance – Performance Comparison (which is better?) G System Performance Comparison (Earth Simulator vs ASCI Q)

Los Alamos PAL Earth Simulator: Overview CCS-3 ...... RCU RCU . . . RCU AP AP AP Memory Memory Memory 0-7 0-7 0-7

... Node0 Node1 Node639

640x640 ... crossbar

G 640 Nodes (Vector Processors) G interconnected by a single stage cross-bar – Copper interconnect (~3,000Km wire) G NEC markets a product – SX-6 – Not the same as an Earth Simulator Node - similar but different memory sub-system

Los Alamos PAL Earth Simulator Node CCS-3 Crossbar LAN network Disks

AP AP AP AP AP AP AP AP 0 1 2 3 4 5 6 7 RCU IOP Node contains: 8 vector processors (AP) 16GByte memory Remote Control Unit (RCU) I/O Processor (IOP) 2 3 4 5 6 0 1 31 M M M M M M M . . . M

G Each AP has: – A vector unit (8 sets of 6 different vector pipelines), and a super-scalar unit – 5,000 pins – AP operates mostly at 500MHz: – Processor Peak Performance = 500 x 8 (pipes) x 2 (float-point) = 8,000 Mflop/s – Memory bandwidth = 32 GB/s per Processor (256GB/s per node) Los Alamos PAL ASCI Q CCS-3

G 2048 Nodes Node

G Fat-tree network (Quadrics) CPU3 CPU PCI PCI G Each node is an HP ES45 2 x2 x2 CPU – 4 x Alpha EV68 micro-processors 1 – 1.25GHz CPU 0 M0 Switch – Memory Hierarchy: L1 64KB I+D Based » 64KB I+D L1 Inter-connect L2 16MB M1 » 16MB L2 cache » 16 GB memory per node (typical)

Los Alamos PAL CCS-3 Earth Simulator ASCI Q (similar to NEC SX-6) (HP ES45) Node Architecture Vector SMP Microprocessor SMP System Topology Crossbar (single- Fat-tree stage) Number of nodes 640 2048

Processors - per node 8 4 - system total 5120 8192 Processor Speed 500 MHz 1.25 GHz Peak speed - per processor 8 Gflops 2.5 Gflops - per node 64 Gflops 10 Gflops Memory - per node 16 GB 16 GB (max 32 GB) - per processor 2 GB 4 GB (max 8 GB) Memory Bandwidth (peak) - L1 Cache N/A 20 GB/s - L2 Cache N/A 13 GB/s - Main memory (per PE) 32 GB/s 2 GB/s Inter-node MPI communication - Latency 8.6 sec 5 sec - Bandwidth 11.8 GB/s 300 MB/s Los Alamos PAL A quotation from a CCS-3 scientific explorer?

“Its life Jim, but not as we know it”

“Its performance Jim, but not as we expected it”

Los Alamos PAL Need to have an expectation CCS-3

G Complex machines and software – single processors, interactions within nodes, interaction between nodes (communication networks), I/O G Large cost for development, deployment and maintenance G Need to know in advance what performance will be.

Update of SW and/or HW Maintenance

Verification of ASCI Q Performance Installation

A Model Procurement What should we buy? is the Key (ASCI Purple) Implementation Some measurement possible (small scale)

Design Lots of system choices Los Alamos PAL CCS-3

How many _ ?

( How many Alphas are equivalent to the Earth Simulator ? )

Answer: Earth Simulator = 640 x 8 x 8 = 40-Tflops Nodes PEs/node Gflops/PE Alpha ES45 System = ? x 4 x 2.5 = 40-Tflops -> ? = 4,096 nodes

Another Answer: It depends on what is going to run Los Alamos PAL Performance Model for SAGE CCS-3

G Identify and understand the key characteristics of code

G Main Data structure decomposition: Slab PE – communications scale as number of PEs (^2/3) 1 2 – communication distance (PEs) increases 3 4 – Effect of network topology

G Processing Stages – Gather data, Computation, Scatter Data

n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Gather Gather/Scatter Comms Direction Size (cells) X, HI 4 Computen-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 X, LO 4 Y, LO 152 Y, HI 152 Z, LO 6860 Scatter n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Z, HI 6860

Los Alamos PAL Scaling Characteristics CCS-3 256PEs 7 64PEs 1 4 8PEs 3 8 2 5 2PEs 1 5 2 ...... 9 12345678 4 6 12 3 3 2 4 7

Comparison of Boundary Surfaces Comparison of inter-PE communication distances 1.E+06 100

Neighbor Distance Grid Surface PE Surface

1.E+05

10

1.E+04 PE Distance Total Surface (cells) Total Surface i) Surface size ii) PE Distance 1.E+03 1 110100 1000 10000 100000 110100 1000 10000 100000 # PEs # PEs Surface split across PEs: P >÷(E/8) PE distance = ÈÈ (8P2/E)1/3 ˘˘ (for cells/PE = 13,500 P > 41) Los Alamos PAL Validation CCS-3

G Validated on large-scale platforms: SAGE Performance (ASCI White) 2 – ASCI Blue Mountain (SGI Origin 2000) 1.8 – CRAY T3E 1.6 1.4 – ASCI Red (intel) 1.2 – ASCI White (IBM SP3) 1 Cycle Time (s) 0.8 – Compaq Alphaserver SMP clusters 0.6 Prediction 0.4 Measurement G Model is highly accurate (< 10% error) 0.2 0 110100 1000 10000 # PEs SAGE Performance (ASCI Blue Mountain) SAGE Performance (Compaq ES45 AlphaServer) 6 0.6

5 0.5

4 0.4

3 0.3 Cycle Time (s) Cycle Time (s) Cycle Time

2 0.2 Prediction Prediction Measurement Measurement 1 0.1

0 0 100 1000 10000 110100 # PEs # PEs Los Alamos PAL Experience: Compaq Installation CCS-3

G Model provided expected performance G Installation performed in stages

Late 2001: Early 2002: upgraded PCI

SAGE - Performance Data (Compaq ES45) SAGE - Performance Data (Compaq ES45)

1.6 ModelModel 1.6 Model ModelModel MeasuredMeasured (Sept(Sept 9th9th 01)01) 4-Jan4-Jan Measured (Oct 2nd 01) 2-Feb2-Feb 1.2 Measured (Oct 24th 01) 1.2 20-Apr

0.8 0.8 Cycle Time (s) Time (s) Time (s) Time (s) Cycle Cycle Cycle Cycle Cycle Time (s) Cycle Time (s)Cycle Time (s)Cycle Time (s)

0.4 0.4

0 0 110100 1000 110100 1000 10000 # PEs # PEs

-> Model used to validate measurements! Other factors: accurately predicted when 2-rails improves the performance (P>41) Los Alamos PAL More recently: (Jan 2003) CCS-3

SAGE Performance (QA & QB) 1.2

1

0.8

0.6

Cycle-time (s)Cycle-time (s)Cycle-time (s)Cycle-time (s)Cycle-time (s) ModelModel 0.4 Sep-21-02 Nov-25-02 0.2 Jan-25-03 Jan-27-03 0 0 512 1024 1536 2048 2560 3072 3584 4096 # PEs G Performance of ASCI Q is now with ~10% of our expectation G Without a model we would not have identified (and solved) the poor performance!

Los Alamos PAL SWEEP3D Particle Transport: 2-D CCS-3 Pipeline Parallelism n+3 n+2 n+1 n WW

n-1 PE

n-2

n-3

G SN transport using discrete ordinates method G wavefronts of computation propagate across grid G Model for Sweep3D accurately captures the performance Los Alamos PAL CCS-3

How many _ ?

G Basis for Performance Comparison – Use validated performance models (trusted) to predict performance – Large problem size (Weak scaling) to fill available memory -> Problem size on Q is double that on the Earth Simulator

G Compare processing rate: – Equal processor count basis (1 to 5120) – % of system used (10 to 100%)

G But have an unknown: – performance of codes on single Earth Simulator processor -> Consider range of values Los Alamos PAL Model Parameters: SAGE CCS-3

Parameter Alpha ES45 Earth Simulator

PSMP Processors per node 4 8 CL Communication Links per Node 1 1 E Number of level 0 cells 35,000 17,500

fGS_r Frequency of real and integer 377 377 fGS_I gather-scatters per cycle 22 22 m T (E) Sequential Computation time per 68.6 s Ï 42.9 s (%5 of peak) comp Ô m cell Ô 21.4 s (%10 of peak) Ô 14m .3 s (%15 of peak) Ì Ô 10.m7 s (%20 of peak) Ô 8.6m s (%25 of peak) Ô ÓÔ 7.1m s (%30 of peak)

L (S,P) Bi-directional MPI communication Ï 6.10m s S < 64 8 s c Ô Latency Ì 6.44m s 64 £ S £ 512 Ô Ó 13.8m s S > 512 B (S,P) Bi-directional MPI communication Ï 0.0 S < 64 10GB/s c Ô Bandwidth (per direction) Ì 78MB/ s 64 £ S £ 512 Ô Ó 120MB/ s S > 512

Los Alamos PAL SAGE: Processing Rate CCS-3

16000 ASCI Q 14000 ASCI White

12000

10000

8000 Higher is better 6000

4000 Processing Rate (Cell-Cycles/s/PE) Processing 2000

0 110100 1000 10000 #Processors G Weak-scaling G ASCI Q between 2.2 and 2.8 times faster than ASCI White G Data from Model (validated to 4096 PEs on Q, 8192 PEs on White) Los Alamos PAL SAGE: Processing Rate CCS-3

160000 30% 25% 140000 20% 15% Earth Simulator 120000 10% 5% 100000

80000 Higher is better

60000

40000

Processing Rate (Cell-Cycles/s/PE) Processing 20000

0 110100 1000 10000 #Processors

G Single Processor time is unknown – modeled using range of values Los Alamos PAL SAGE: Earth Simulator vs Q CCS-3 (Equal processor count)

12 11 10

9 8 7 Higher = Earth Simulator 6 Faster ASCI Q) 5 4 3

2 30% 25% 20% Relative Performance (Earth Simulator / (Earth Simulator Performance Relative 1 15% 10% 5% 0 110100 1000 10000 #Processors G Ratio of processor peak-performance is 3.2 (red line) G In nearly all cases: Earth Simulator performance is better than ratio of processor peak-performance Los Alamos PAL SAGE: Earth Simulator vs Q CCS-3 (% of system utilized) 8

7

6

5 Higher = Earth Simulator 4 Faster ASCI Q) 3

2

1 30% 25% 20% Relative Performance (Earth Simulator / (Earth Simulator Performance Relative 15% 10% 5% 0 0102030405060708090100 System Size (%) G Ratio of system peak-performance is 2 (red line) G In nearly all cases: Earth Simulator better than ratio of system peak-performance Los Alamos CCS-3 PAL SAGE: Component Time Predictions

100% 100% Compute/memory Compute/memory 90% 90% Bandwidth Bandwidth 80% Latency 80% Collectives Collectives Latency 70% 70%

60% 60%

50% 50% Time (%) Time (%) Time 40% 40%

30% 30%

20% 20%

10% 10%

0% 0% 1 2 4 8 1 2 4 8 16 32 64 16 32 64 128 256 512 128 256 512 1024 2048 4096 5120 8192 1024 2048 4096 5120 # Processors # Processors

G 10% case considered on Earth Simulator G Higher bandwidth component on Alpha G Latency cost on Earth Simulator is more visible – reduced computation (x3.2), increased bandwidth (x40), latency ~same Los Alamos PAL How many Alpha nodes CCS-3 _ Earth Simulator for SAGE?

14000 140 125 12000 115 120 103 10000 100 88

8000 80 67

6000 60 40 4000 40 Equivalent Alpha Tflops

# Alpha Nodes (ES45 1.25GHz) # Alpha Nodes 2000 20

0 0 5% 10% 15% 20% 25% 30% % of single-processor peak achieved on Earth Simulator

G 1 ES45 1.25GHz = 10GFlops, or 1,000 nodes = 10Tflops G If assume a 10% of peak on ES -> need 67Tflops of EV68 Alphas

Los Alamos PAL Sweep3D: Processing Rate CCS-3

200000 1400000 ASCI Q 30% 180000 25% ASCI White 1200000 160000 20% MCR 15% 1000000 140000 10% 5% 120000 800000

100000 600000 80000

400000 60000 Processing Rate (cc/s/pe) Processing Rate (cc/s/pe) Processing 40000 200000 20000 0 0 110100 1000 10000 110100 1000 10000 # Processors # Processors Higher is better

G Again, range of values for Earth Simulator unknown single processor time Los Alamos PAL Sweep3D: Earth Simulator vs QCCS-3 (Equal processor count) 8 30% 7 25% 20% 6 15% 10% 5% 5 Ratio

4 Higher = Earth Simulator Faster 3

2

1 Relative Performance (Earth Simulator / ASCI Q) Simulator / ASCI Performance (Earth Relative 0 110100 1000 10000 # Processors

G Ratio of processor peak-performance is 3.2 (red line) G At large PE counts: Earth Simulator performance is worse than ratio of processor peak-performance Los Alamos PAL Sweep3D: Earth Simulator vs QCCS-3 (% of system utilized) 5 30% 4.5 25% 20% 4 15% 10% 3.5 5% 3 Ratio

2.5 Higher = Earth Simulator

2 Faster

1.5

1

0.5 Relative Performance (Earth Simulator / ASCI Q) Simulator / ASCI Performance (Earth Relative 0 020406080100 System Size (%)

G Ratio of system peak-performance is 2 (red line) G Using more than 15% of system: Earth Simulator performance is worse than ratio of system peak-performance Los Alamos PAL Sweep3D: Component Time PredictionsCCS-3

100% 100% Compute/memory (Pipeline) 90% 90% Compute/memory (Pipeline) Compute/memory (Block) Compute/memory (Block) 80% Bandwidth (Pipeline) 80% Bandwidth (Pipeline) Bandwidth (Block) 70% 70% Bandwidth (Block) Latency (Pipeline) Latency (Pipeline) 60% Latency (block) 60% Latency (block) 50% 50% Time (%) Time (%) 40% 40%

30% 30%

20% 20%

10% 10%

0% 0% 1 2 4 8 1 2 4 8 16 32 64 16 32 64 128 256 512 128 256 512 1024 2048 4096 5120 1024 2048 4096 8192 # Processors # Processors

G 5% case considered for the Earth Simulator G Higher Latency component than SAGE on both systems – Significant G Pipeline effect can be seen on higher PE counts

Los Alamos PAL How many Alpha nodes CCS-3 _ Earth Simulator for Sweep3D?

5000 42 45

37 40

4000 35 30 30 3000 24 25 19 20 2000 12 15 Equivalent Alpha Tflops 1000 10 # Alpha Nodes (ES45 1.25GHz) # Alpha Nodes 5

0 0 5% 10% 15% 20% 25% 30% % of single-processor peak achieved on Earth Simulator

G 1 ES45 1.25GHz = 10GFlops, or 1,000 nodes = 10Tflops G If assume a 5% of peak on ES -> need 12Tflops of EV68 Alphas

Los Alamos PAL Hypothetical Workload: CCS-3 Assume 40% SAGE and 60% Sweep

SAGE % of single-processor peak

5% 10% 15% 20% 25% 30% S 5% 23 34 42 48 53 57 w 10% 27 38 47 53 57 61 e e 15% 30 41 50 56 60 64 p 20% 34 45 53 59 64 68 3 25% 38 49 57 63 68 72 D 30% 41 52 60 66 71 75

G Numbers in table indicate a peak Tflop rated Alpha ES45 system that would achieve the same performance as the Earth Simulator

G Currently: SAGE on NEC SX-6 achieved 5% on first run (Sweep3D expected to be less). This may improve over time. Los Alamos PAL Summary CCS-3

G Models have provided quantitative information on the long- debated efficiency of large, microprocessor-based systems for HPC (instead of smaller, more-powerful but special-purpose vector systems)

G Comparison of the Earth Simulator and ASCI Q performance is heavily dependent on the workload – At present, an Alpha system of approx. 23Tflops peak would achieve the same level of performance as the Earth Simulator on the workload considered here (60% Sweep3D, and 40% SAGE). G Results gives a ‘reference’ comparison – The achievable performance on the NEC SX-6 (or Earth Simulator node) may change over time. This analysis will stay valid for the systems as they presently are. G Models have also been used for: – Design studies (architecture and software, e.g IBM PERCS HPCS ) – During ASCI Q installation (to validate observed performance) – In the procurement of ASCI Purple – Performance comparison of systems www.c3.lanl.gov/par_arch Los Alamos