Performance Modeling the Earth Simulator and ASCI Q
Total Page:16
File Type:pdf, Size:1020Kb
PAL CCS-3 Performance Modeling the Earth Simulator and ASCI Q Darren J. Kerbyson Adolfy Hoisie, Harvey J. Wasserman Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory April 2003 Los Alamos PAL CCS-3 “26.58Tflops on AFES … 64.9% of peak (640nodes)” “14.9Tflops on Impact-3D …. 45% of peak (512nodes)” “10.5Tflops on PFES … 44% of peak (376nodes)” 40Tflops 20Tflops Los Alamos PAL Talk Overview CCS-3 G Overview of the Earth Simulator – A (quick) view of the architecture of the Earth Simulator (and Q) – A look at its performance characteristics G Application Centric Performance Models – Method of comparing performance is to use trusted models of applications that we are interested in, e.g. SAGE and Sweep3D. – Analytical / Parameterized in system & application characteristics G Models can be used to provide: – Predicted performance prior to availability (hardware or software) – Insight into performance – Performance Comparison (which is better?) G System Performance Comparison (Earth Simulator vs ASCI Q) Los Alamos PAL Earth Simulator: Overview CCS-3 ... ... RCU RCU . RCU AP AP AP Memory Memory Memory 0-7 0-7 0-7 ... Node0 Node1 Node639 640x640 ... crossbar G 640 Nodes (Vector Processors) G interconnected by a single stage cross-bar – Copper interconnect (~3,000Km wire) G NEC markets a product – SX-6 – Not the same as an Earth Simulator Node - similar but different memory sub-system Los Alamos PAL Earth Simulator Node CCS-3 Crossbar LAN network Disks AP AP AP AP AP AP AP AP 0 1 2 3 4 5 6 7 RCU IOP Node contains: 8 vector processors (AP) 16GByte memory Remote Control Unit (RCU) I/O Processor (IOP) 2 3 4 5 6 0 1 31 M M M M M M M . M G Each AP has: – A vector unit (8 sets of 6 different vector pipelines), and a super-scalar unit – 5,000 pins – AP operates mostly at 500MHz: – Processor Peak Performance = 500 x 8 (pipes) x 2 (float-point) = 8,000 Mflop/s – Memory bandwidth = 32 GB/s per Processor (256GB/s per node) Los Alamos PAL ASCI Q CCS-3 G 2048 Nodes Node G Fat-tree network (Quadrics) CPU3 CPU PCI PCI G Each node is an HP ES45 2 x2 x2 CPU – 4 x Alpha EV68 micro-processors 1 – 1.25GHz CPU 0 M0 Switch – Memory Hierarchy: L1 64KB I+D Based » 64KB I+D L1 Inter-connect L2 16MB M1 » 16MB L2 cache » 16 GB memory per node (typical) Los Alamos PAL CCS-3 Earth Simulator ASCI Q (similar to NEC SX-6) (HP ES45) Node Architecture Vector SMP Microprocessor SMP System Topology Crossbar (single- Fat-tree stage) Number of nodes 640 2048 Processors - per node 8 4 - system total 5120 8192 Processor Speed 500 MHz 1.25 GHz Peak speed - per processor 8 Gflops 2.5 Gflops - per node 64 Gflops 10 Gflops Memory - per node 16 GB 16 GB (max 32 GB) - per processor 2 GB 4 GB (max 8 GB) Memory Bandwidth (peak) - L1 Cache N/A 20 GB/s - L2 Cache N/A 13 GB/s - Main memory (per PE) 32 GB/s 2 GB/s Inter-node MPI communication - Latency 8.6 sec 5 sec - Bandwidth 11.8 GB/s 300 MB/s Los Alamos PAL A quotation from a CCS-3 scientific explorer? “Its life Jim, but not as we know it” “Its performance Jim, but not as we expected it” Los Alamos PAL Need to have an expectation CCS-3 G Complex machines and software – single processors, interactions within nodes, interaction between nodes (communication networks), I/O G Large cost for development, deployment and maintenance G Need to know in advance what performance will be. Update of SW and/or HW Maintenance Verification of ASCI Q Performance Installation A Model Procurement What should we buy? is the Key (ASCI Purple) Implementation Some measurement possible (small scale) Design Lots of system choices Los Alamos PAL CCS-3 How many _ ? ( How many Alphas are equivalent to the Earth Simulator ? ) Answer: Earth Simulator = 640 x 8 x 8 = 40-Tflops Nodes PEs/node Gflops/PE Alpha ES45 System = ? x 4 x 2.5 = 40-Tflops -> ? = 4,096 nodes Another Answer: It depends on what is going to run Los Alamos PAL Performance Model for SAGE CCS-3 G Identify and understand the key characteristics of code G Main Data structure decomposition: Slab PE – communications scale as number of PEs (^2/3) 1 2 – communication distance (PEs) increases 3 4 – Effect of network topology G Processing Stages – Gather data, Computation, Scatter Data n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Gather Gather/Scatter Comms Direction Size (cells) X, HI 4 Computen-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 X, LO 4 Y, LO 152 Y, HI 152 Z, LO 6860 Scatter n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4 Z, HI 6860 Los Alamos PAL Scaling Characteristics CCS-3 256PEs 7 64PEs 1 4 8PEs 3 8 2 5 2PEs 1 5 2 ... ... 9 12345678 4 6 12 3 3 2 4 7 Comparison of Boundary Surfaces Comparison of inter-PE communication distances 1.E+06 100 Neighbor Distance Grid Surface PE Surface 1.E+05 10 1.E+04 PE Distance Total Surface (cells) i) Surface size ii) PE Distance 1.E+03 1 110100 1000 10000 100000 110100 1000 10000 100000 # PEs # PEs Surface split across PEs: P >÷(E/8) PE distance = ÈÈ (8P2/E)1/3 ˘˘ (for cells/PE = 13,500 P > 41) Los Alamos PAL Validation CCS-3 G Validated on large-scale platforms: SAGE Performance (ASCI White) 2 – ASCI Blue Mountain (SGI Origin 2000) 1.8 1.6 – CRAY T3E 1.4 – ASCI Red (intel) 1.2 1 – ASCI White (IBM SP3) 0.8 Cycle Time (s) 0.6 – Compaq Alphaserver SMP clusters 0.4 Prediction 0.2 Measurement G 0 Model is highly accurate (< 10% error) 110100 1000 10000 # PEs SAGE Performance (ASCI Blue Mountain) SAGE Performance (Compaq ES45 AlphaServer) 6 0.6 5 0.5 4 0.4 3 0.3 Cycle Time (s) Cycle Time (s) 2 0.2 1 Prediction 0.1 Prediction Measurement Measurement 0 0 100 1000 10000 110100 # PEs # PEs Los Alamos PAL Experience: Compaq Installation CCS-3 G Model provided expected performance G Installation performed in stages Late 2001: Early 2002: upgraded PCI SAGE - Performance Data (Compaq ES45) SAGE - Performance Data (Compaq ES45) 1.6 ModelModel 1.6 Model ModelModel MeasuredMeasured (Sept(Sept 9th9th 01)01) 4-Jan4-Jan Measured (Oct 2nd 01) 2-Feb2-Feb 1.2 Measured (Oct 24th 01) 1.2 20-Apr 0.8 0.8 Cycle Time (s) Cycle Time (s) Cycle Time (s) Cycle Time (s) Cycle Time (s) Cycle Time (s)Cycle Time (s)Cycle Time (s) 0.4 0.4 0 0 110100 1000 110100 1000 10000 # PEs # PEs -> Model used to validate measurements! Other factors: accurately predicted when 2-rails improves the performance (P>41) Los Alamos PAL More recently: (Jan 2003) CCS-3 SAGE Performance (QA & QB) 1.2 1 0.8 0.6 Cycle-time (s)Cycle-time (s)Cycle-time (s)Cycle-time (s)Cycle-time (s) 0.4 ModelModel Sep-21-02 0.2 Nov-25-02 Jan-25-03 Jan-27-03 0 0 512 1024 1536 2048 2560 3072 3584 4096 # PEs G Performance of ASCI Q is now with ~10% of our expectation G Without a model we would not have identified (and solved) the poor performance! Los Alamos PAL SWEEP3D Particle Transport: 2-D CCS-3 Pipeline Parallelism n+3 n+2 n+1 n WW n-1 PE n-2 n-3 G SN transport using discrete ordinates method G wavefronts of computation propagate across grid G Model for Sweep3D accurately captures the performance Los Alamos PAL CCS-3 How many _ ? G Basis for Performance Comparison – Use validated performance models (trusted) to predict performance – Large problem size (Weak scaling) to fill available memory -> Problem size on Q is double that on the Earth Simulator G Compare processing rate: – Equal processor count basis (1 to 5120) – % of system used (10 to 100%) G But have an unknown: – performance of codes on single Earth Simulator processor -> Consider range of values Los Alamos PAL Model Parameters: SAGE CCS-3 Parameter Alpha ES45 Earth Simulator PSMP Processors per node 4 8 CL Communication Links per Node 1 1 E Number of level 0 cells 35,000 17,500 fGS_r Frequency of real and integer 377 377 fGS_I gather-scatters per cycle 22 22 T (E) Sequential Computation time per 68.6 s Ï 42.9m s (%5 of peak) comp Ô cell Ô 21.4m s (%10 of peak) Ô 14.3m s (%15 of peak) Ì Ô 10.7m s (%20 of peak) Ô 8.6m s (%25 of peak) Ô ÓÔ 7.1m s (%30 of peak) L (S,P) Bi-directional MPI communication Ï 6.10m s S < 64 8 s c Ô Latency Ì 6.44m s 64 £ S £ 512 Ô Ó 13.8m s S > 512 B (S,P) Bi-directional MPI communication Ï 0.0 S < 64 10GB/s c Ô Bandwidth (per direction) Ì 78MB/ s 64 £ S £ 512 Ô Ó 120MB/ s S > 512 Los Alamos PAL SAGE: Processing Rate CCS-3 16000 ASCI Q 14000 ASCI White 12000 10000 8000 Higher is better 6000 4000 Processing Rate (Cell-Cycles/s/PE) 2000 0 110100 1000 10000 #Processors G Weak-scaling G ASCI Q between 2.2 and 2.8 times faster than ASCI White G Data from Model (validated to 4096 PEs on Q, 8192 PEs on White) Los Alamos PAL SAGE: Processing Rate CCS-3 160000 30% 25% 140000 20% 15% Earth Simulator 120000 10% 5% 100000 80000 Higher is better 60000 40000 Processing Rate (Cell-Cycles/s/PE) 20000 0 110100 1000 10000 #Processors G Single Processor time is unknown – modeled using range of values Los Alamos PAL SAGE: Earth Simulator vs Q CCS-3 (Equal processor count) 12 11 10 9 8 7 Higher = Earth Simulator 6 Faster ASCI Q) 5 4 3 2 30% 25% 20% Relative Performance (Earth Simulator / 1 15% 10% 5% 0 110100 1000 10000 #Processors G Ratio of processor peak-performance is 3.2 (red line) G In nearly all cases: Earth Simulator performance is better than ratio of processor peak-performance Los Alamos PAL SAGE: Earth Simulator vs Q CCS-3 (% of system utilized) 8 7 6 5 Higher = Earth Simulator 4 Faster ASCI Q) 3 2 1 30% 25% 20% Relative Performance (Earth Simulator / 15% 10% 5% 0 0102030405060708090100 System Size (%) G Ratio of system peak-performance is 2 (red line) G In nearly all cases: Earth Simulator better than ratio of system peak-performance Los Alamos CCS-3 PAL SAGE: Component Time Predictions 100% 100% Compute/memory Compute/memory 90% 90% Bandwidth Bandwidth 80% Latency