PAL CCS-3
Performance Modeling the Earth Simulator and ASCI Q
Darren J. Kerbyson
Adolfy Hoisie, Harvey J. Wasserman Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory April 2003
Los Alamos PAL CCS-3
“26.58Tflops on AFES … 64.9% of peak (640nodes)”
“14.9Tflops on Impact-3D …. 45% of peak (512nodes)”
“10.5Tflops on PFES … 44% of peak (376nodes)”
40Tflops
20Tflops
Los Alamos PAL Talk Overview CCS-3
G Overview of the Earth Simulator – A (quick) view of the architecture of the Earth Simulator (and Q) – A look at its performance characteristics G Application Centric Performance Models – Method of comparing performance is to use trusted models of applications that we are interested in, e.g. SAGE and Sweep3D. – Analytical / Parameterized in system & application characteristics G Models can be used to provide: – Predicted performance prior to availability (hardware or software) – Insight into performance – Performance Comparison (which is better?) G System Performance Comparison (Earth Simulator vs ASCI Q)
Los Alamos PAL Earth Simulator: Overview CCS-3 ...... RCU RCU . . . RCU AP AP AP Memory Memory Memory 0-7 0-7 0-7
... Node0 Node1 Node639
640x640 ... crossbar
G 640 Nodes (Vector Processors) G interconnected by a single stage cross-bar – Copper interconnect (~3,000Km wire) G NEC markets a product – SX-6 – Not the same as an Earth Simulator Node - similar but different memory sub-system
Los Alamos PAL Earth Simulator Node CCS-3 Crossbar LAN network Disks
AP AP AP AP AP AP AP AP 0 1 2 3 4 5 6 7 RCU IOP Node contains: 8 vector processors (AP) 16GByte memory Remote Control Unit (RCU) I/O Processor (IOP) 2 3 4 5 6 0 1 31 M M M M M M M . . . M
G Each AP has: – A vector unit (8 sets of 6 different vector pipelines), and a super-scalar unit – 5,000 pins – AP operates mostly at 500MHz: – Processor Peak Performance = 500 x 8 (pipes) x 2 (float-point) = 8,000 Mflop/s – Memory bandwidth = 32 GB/s per Processor (256GB/s per node) Los Alamos PAL ASCI Q CCS-3
G 2048 Nodes Node
G Fat-tree network (Quadrics) CPU3 CPU PCI PCI G Each node is an HP ES45 2 x2 x2 CPU – 4 x Alpha EV68 micro-processors 1 – 1.25GHz CPU 0 M0 Switch – Memory Hierarchy: L1 64KB I+D Based » 64KB I+D L1 Inter-connect L2 16MB M1 » 16MB L2 cache » 16 GB memory per node (typical)
Los Alamos PAL CCS-3 Earth Simulator ASCI Q (similar to NEC SX-6) (HP ES45) Node Architecture Vector SMP Microprocessor SMP System Topology Crossbar (single- Fat-tree stage) Number of nodes 640 2048
Processors - per node 8 4 - system total 5120 8192 Processor Speed 500 MHz 1.25 GHz Peak speed - per processor 8 Gflops 2.5 Gflops - per node 64 Gflops 10 Gflops Memory - per node 16 GB 16 GB (max 32 GB) - per processor 2 GB 4 GB (max 8 GB) Memory Bandwidth (peak) - L1 Cache N/A 20 GB/s - L2 Cache N/A 13 GB/s - Main memory (per PE) 32 GB/s 2 GB/s Inter-node MPI communication - Latency 8.6 sec 5 sec - Bandwidth 11.8 GB/s 300 MB/s Los Alamos PAL A quotation from a CCS-3 scientific explorer?
“Its life Jim, but not as we know it”
“Its performance Jim, but not as we expected it”
Los Alamos PAL Need to have an expectation CCS-3
G Complex machines and software – single processors, interactions within nodes, interaction between nodes (communication networks), I/O G Large cost for development, deployment and maintenance G Need to know in advance what performance will be.
Update of SW and/or HW Maintenance
Verification of ASCI Q Performance Installation
A Model Procurement What should we buy? is the Key (ASCI Purple) Implementation Some measurement possible (small scale)
Design Lots of system choices Los Alamos PAL CCS-3
How many _ ?
( How many Alphas are equivalent to the Earth Simulator ? )
Answer: Earth Simulator = 640 x 8 x 8 = 40-Tflops Nodes PEs/node Gflops/PE Alpha ES45 System = ? x 4 x 2.5 = 40-Tflops -> ? = 4,096 nodes
Another Answer: It depends on what is going to run Los Alamos PAL Performance Model for SAGE CCS-3
G Identify and understand the key characteristics of code
G Main Data structure decomposition: Slab PE – communications scale as number of PEs (^2/3) 1 2 – communication distance (PEs) increases 3 4 – Effect of network topology
G Processing Stages – Gather data, Computation, Scatter Data
n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Gather Gather/Scatter Comms Direction Size (cells) X, HI 4 Computen-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 X, LO 4 Y, LO 152 Y, HI 152 Z, LO 6860 Scatter n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Z, HI 6860
Los Alamos PAL Scaling Characteristics CCS-3 256PEs 7 64PEs 1 4 8PEs 3 8 2 5 2PEs 1 5 2 ...... 9 12345678 4 6 12 3 3 2 4 7
Comparison of Boundary Surfaces Comparison of inter-PE communication distances 1.E+06 100
Neighbor Distance Grid Surface PE Surface
1.E+05
10
1.E+04 PE Distance Total Surface (cells) Total Surface i) Surface size ii) PE Distance 1.E+03 1 110100 1000 10000 100000 110100 1000 10000 100000 # PEs # PEs Surface split across PEs: P >÷(E/8) PE distance = ÈÈ (8P2/E)1/3 ˘˘ (for cells/PE = 13,500 P > 41) Los Alamos PAL Validation CCS-3
G Validated on large-scale platforms: SAGE Performance (ASCI White) 2 – ASCI Blue Mountain (SGI Origin 2000) 1.8 – CRAY T3E 1.6 1.4 – ASCI Red (intel) 1.2 – ASCI White (IBM SP3) 1 Cycle Time (s) 0.8 – Compaq Alphaserver SMP clusters 0.6 Prediction 0.4 Measurement G Model is highly accurate (< 10% error) 0.2 0 110100 1000 10000 # PEs SAGE Performance (ASCI Blue Mountain) SAGE Performance (Compaq ES45 AlphaServer) 6 0.6
5 0.5
4 0.4
3 0.3 Cycle Time (s) Cycle Time (s) Cycle Time
2 0.2 Prediction Prediction Measurement Measurement 1 0.1
0 0 100 1000 10000 110100 # PEs # PEs Los Alamos PAL Experience: Compaq Installation CCS-3
G Model provided expected performance G Installation performed in stages
Late 2001: Early 2002: upgraded PCI
SAGE - Performance Data (Compaq ES45) SAGE - Performance Data (Compaq ES45)
1.6 ModelModel 1.6 Model ModelModel MeasuredMeasured (Sept(Sept 9th9th 01)01) 4-Jan4-Jan Measured (Oct 2nd 01) 2-Feb2-Feb 1.2 Measured (Oct 24th 01) 1.2 20-Apr
0.8 0.8 Cycle Time (s) Time (s) Time (s) Time (s) Cycle Cycle Cycle Cycle Cycle Time (s) Cycle Time (s)Cycle Time (s)Cycle Time (s)
0.4 0.4
0 0 110100 1000 110100 1000 10000 # PEs # PEs
-> Model used to validate measurements! Other factors: accurately predicted when 2-rails improves the performance (P>41) Los Alamos PAL More recently: (Jan 2003) CCS-3
SAGE Performance (QA & QB) 1.2
1
0.8
0.6
Cycle-time (s)Cycle-time (s)Cycle-time (s)Cycle-time (s)Cycle-time (s) ModelModel 0.4 Sep-21-02 Nov-25-02 0.2 Jan-25-03 Jan-27-03 0 0 512 1024 1536 2048 2560 3072 3584 4096 # PEs G Performance of ASCI Q is now with ~10% of our expectation G Without a model we would not have identified (and solved) the poor performance!
Los Alamos PAL SWEEP3D Particle Transport: 2-D CCS-3 Pipeline Parallelism n+3 n+2 n+1 n WW
n-1 PE
n-2
n-3
G SN transport using discrete ordinates method G wavefronts of computation propagate across grid G Model for Sweep3D accurately captures the performance Los Alamos PAL CCS-3
How many _ ?
G Basis for Performance Comparison – Use validated performance models (trusted) to predict performance – Large problem size (Weak scaling) to fill available memory -> Problem size on Q is double that on the Earth Simulator
G Compare processing rate: – Equal processor count basis (1 to 5120) – % of system used (10 to 100%)
G But have an unknown: – performance of codes on single Earth Simulator processor -> Consider range of values Los Alamos PAL Model Parameters: SAGE CCS-3
Parameter Alpha ES45 Earth Simulator
PSMP Processors per node 4 8 CL Communication Links per Node 1 1 E Number of level 0 cells 35,000 17,500
fGS_r Frequency of real and integer 377 377 fGS_I gather-scatters per cycle 22 22 m T (E) Sequential Computation time per 68.6 s Ï 42.9 s (%5 of peak) comp Ô m cell Ô 21.4 s (%10 of peak) Ô 14m .3 s (%15 of peak) Ì Ô 10.m7 s (%20 of peak) Ô 8.6m s (%25 of peak) Ô ÓÔ 7.1m s (%30 of peak)
L (S,P) Bi-directional MPI communication Ï 6.10m s S < 64 8 s c Ô Latency Ì 6.44m s 64 £ S £ 512 Ô Ó 13.8m s S > 512 B (S,P) Bi-directional MPI communication Ï 0.0 S < 64 10GB/s c Ô Bandwidth (per direction) Ì 78MB/ s 64 £ S £ 512 Ô Ó 120MB/ s S > 512
Los Alamos PAL SAGE: Processing Rate CCS-3
16000 ASCI Q 14000 ASCI White
12000
10000
8000 Higher is better 6000
4000 Processing Rate (Cell-Cycles/s/PE) Processing 2000
0 110100 1000 10000 #Processors G Weak-scaling G ASCI Q between 2.2 and 2.8 times faster than ASCI White G Data from Model (validated to 4096 PEs on Q, 8192 PEs on White) Los Alamos PAL SAGE: Processing Rate CCS-3
160000 30% 25% 140000 20% 15% Earth Simulator 120000 10% 5% 100000
80000 Higher is better
60000
40000
Processing Rate (Cell-Cycles/s/PE) Processing 20000
0 110100 1000 10000 #Processors
G Single Processor time is unknown – modeled using range of values Los Alamos PAL SAGE: Earth Simulator vs Q CCS-3 (Equal processor count)
12 11 10
9 8 7 Higher = Earth Simulator 6 Faster ASCI Q) 5 4 3
2 30% 25% 20% Relative Performance (Earth Simulator / (Earth Simulator Performance Relative 1 15% 10% 5% 0 110100 1000 10000 #Processors G Ratio of processor peak-performance is 3.2 (red line) G In nearly all cases: Earth Simulator performance is better than ratio of processor peak-performance Los Alamos PAL SAGE: Earth Simulator vs Q CCS-3 (% of system utilized) 8
7
6
5 Higher = Earth Simulator 4 Faster ASCI Q) 3
2
1 30% 25% 20% Relative Performance (Earth Simulator / (Earth Simulator Performance Relative 15% 10% 5% 0 0102030405060708090100 System Size (%) G Ratio of system peak-performance is 2 (red line) G In nearly all cases: Earth Simulator better than ratio of system peak-performance Los Alamos CCS-3 PAL SAGE: Component Time Predictions
100% 100% Compute/memory Compute/memory 90% 90% Bandwidth Bandwidth 80% Latency 80% Collectives Collectives Latency 70% 70%
60% 60%
50% 50% Time (%) Time (%) Time 40% 40%
30% 30%
20% 20%
10% 10%
0% 0% 1 2 4 8 1 2 4 8 16 32 64 16 32 64 128 256 512 128 256 512 1024 2048 4096 5120 8192 1024 2048 4096 5120 # Processors # Processors
G 10% case considered on Earth Simulator G Higher bandwidth component on Alpha G Latency cost on Earth Simulator is more visible – reduced computation (x3.2), increased bandwidth (x40), latency ~same Los Alamos PAL How many Alpha nodes CCS-3 _ Earth Simulator for SAGE?
14000 140 125 12000 115 120 103 10000 100 88
8000 80 67
6000 60 40 4000 40 Equivalent Alpha Tflops
# Alpha Nodes (ES45 1.25GHz) # Alpha Nodes 2000 20
0 0 5% 10% 15% 20% 25% 30% % of single-processor peak achieved on Earth Simulator
G 1 ES45 1.25GHz = 10GFlops, or 1,000 nodes = 10Tflops G If assume a 10% of peak on ES -> need 67Tflops of EV68 Alphas
Los Alamos PAL Sweep3D: Processing Rate CCS-3
200000 1400000 ASCI Q 30% 180000 25% ASCI White 1200000 160000 20% MCR 15% 1000000 140000 10% 5% 120000 800000
100000 600000 80000
400000 60000 Processing Rate (cc/s/pe) Processing Rate (cc/s/pe) Processing 40000 200000 20000 0 0 110100 1000 10000 110100 1000 10000 # Processors # Processors Higher is better
G Again, range of values for Earth Simulator unknown single processor time Los Alamos PAL Sweep3D: Earth Simulator vs QCCS-3 (Equal processor count) 8 30% 7 25% 20% 6 15% 10% 5% 5 Ratio
4 Higher = Earth Simulator Faster 3
2
1 Relative Performance (Earth Simulator / ASCI Q) Simulator / ASCI Performance (Earth Relative 0 110100 1000 10000 # Processors
G Ratio of processor peak-performance is 3.2 (red line) G At large PE counts: Earth Simulator performance is worse than ratio of processor peak-performance Los Alamos PAL Sweep3D: Earth Simulator vs QCCS-3 (% of system utilized) 5 30% 4.5 25% 20% 4 15% 10% 3.5 5% 3 Ratio
2.5 Higher = Earth Simulator
2 Faster
1.5
1
0.5 Relative Performance (Earth Simulator / ASCI Q) Simulator / ASCI Performance (Earth Relative 0 020406080100 System Size (%)
G Ratio of system peak-performance is 2 (red line) G Using more than 15% of system: Earth Simulator performance is worse than ratio of system peak-performance Los Alamos PAL Sweep3D: Component Time PredictionsCCS-3
100% 100% Compute/memory (Pipeline) 90% 90% Compute/memory (Pipeline) Compute/memory (Block) Compute/memory (Block) 80% Bandwidth (Pipeline) 80% Bandwidth (Pipeline) Bandwidth (Block) 70% 70% Bandwidth (Block) Latency (Pipeline) Latency (Pipeline) 60% Latency (block) 60% Latency (block) 50% 50% Time (%) Time (%) 40% 40%
30% 30%
20% 20%
10% 10%
0% 0% 1 2 4 8 1 2 4 8 16 32 64 16 32 64 128 256 512 128 256 512 1024 2048 4096 5120 1024 2048 4096 8192 # Processors # Processors
G 5% case considered for the Earth Simulator G Higher Latency component than SAGE on both systems – Significant G Pipeline effect can be seen on higher PE counts
Los Alamos PAL How many Alpha nodes CCS-3 _ Earth Simulator for Sweep3D?
5000 42 45
37 40
4000 35 30 30 3000 24 25 19 20 2000 12 15 Equivalent Alpha Tflops 1000 10 # Alpha Nodes (ES45 1.25GHz) # Alpha Nodes 5
0 0 5% 10% 15% 20% 25% 30% % of single-processor peak achieved on Earth Simulator
G 1 ES45 1.25GHz = 10GFlops, or 1,000 nodes = 10Tflops G If assume a 5% of peak on ES -> need 12Tflops of EV68 Alphas
Los Alamos PAL Hypothetical Workload: CCS-3 Assume 40% SAGE and 60% Sweep
SAGE % of single-processor peak
5% 10% 15% 20% 25% 30% S 5% 23 34 42 48 53 57 w 10% 27 38 47 53 57 61 e e 15% 30 41 50 56 60 64 p 20% 34 45 53 59 64 68 3 25% 38 49 57 63 68 72 D 30% 41 52 60 66 71 75
G Numbers in table indicate a peak Tflop rated Alpha ES45 system that would achieve the same performance as the Earth Simulator
G Currently: SAGE on NEC SX-6 achieved 5% on first run (Sweep3D expected to be less). This may improve over time. Los Alamos PAL Summary CCS-3
G Models have provided quantitative information on the long- debated efficiency of large, microprocessor-based systems for HPC (instead of smaller, more-powerful but special-purpose vector systems)
G Comparison of the Earth Simulator and ASCI Q performance is heavily dependent on the workload – At present, an Alpha system of approx. 23Tflops peak would achieve the same level of performance as the Earth Simulator on the workload considered here (60% Sweep3D, and 40% SAGE). G Results gives a ‘reference’ comparison – The achievable performance on the NEC SX-6 (or Earth Simulator node) may change over time. This analysis will stay valid for the systems as they presently are. G Models have also been used for: – Design studies (architecture and software, e.g IBM PERCS HPCS ) – During ASCI Q installation (to validate observed performance) – In the procurement of ASCI Purple – Performance comparison of systems www.c3.lanl.gov/par_arch Los Alamos