Computational Science and Engineering Department Daresbury Laboratory

ApplicationApplication PerformancePerformance onon High-EndHigh-End andand Commodity-classCommodity-class ComputersComputers

Martyn F. Guest CCLRC Daresbury Laboratory

[email protected]

Acknowledgements:

P. Sherwood, H.J. van Dam, W. Smith, I. Bush, A. Sunderland, DL

J. Nieplocha, E. Apra, PNNL http://www.cse.clrc.ac.uk/disco/hw-perf.shtml

Daresbury Laboratory1 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory OutlineOutline

! Computational Chemistry - Background ! Commodity-based and High-end Systems " Single-node performance & the Interconnect bottleneck " Prototype Commodity Systems; CS1 - CS13 " High-end Systems: IBM SP/p690, SGI Origin 3800, and Compaq Alpha Server SC " Performance Metrics ! Application performance " Molecular Simulation (DLPOLY, DLMULTI and CHARMM) " Electronic Structure (Global Arrays (GAs) ; Linear Algebra (PeIGS) ! NWChem and GAMESS-UK " Materials Simulation (CPMD) and Engineering (ANGUS, SBLI) ! Application performance analysis " VAMPIR and instrumenting the GA Tools ! Summary

Daresbury Laboratory2 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory CapabilityCapability ComputingComputing -- CostCost EffectivenessEffectiveness && DependenciesDependencies Specification Usage Cost CPU Memory I/O Units MPP/ASCI HPC 1280 CPUs, IBM SP community 10,000 2,000 2,000 200-300 (1 TB) Capability Commodity Systems (1.5 x N ??) SMP 16-processor SGI Department 100 15 20 20-30 Origin 3800, R14k (8GB RAM ) Capacity Commodity Systems (1.5 x N) PC Pentium-4 / 3GHz Desktop 11 1 1 (512 MByte, 30 GB) 3 year continual access to 32 CPUs: High-end (£0.5 / CPU hour) : £419,358 in-house Beowulf : £50,000

Daresbury Laboratory3 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

High-endHigh-end Commodity-basedCommodity-based SystemsSystems

Cluster Site Nodes/ Peak CPU Interconnect Processors (Gflop) 1. MCR LLNL, US 1152/2304 11174 P4/2400 QsNet 2. Corporate GX Tech. US 1792/3264 5091 Ultra2/450+ GigabitE. 3. Supermike Louisana US 512/1024 3686 P4/1800 Myrinet 2k 4. RHIC Brookhaven 1097/2194 3115 PIII/PII. FastEth. 5. Joplin Buffalo, US 300/600 2880 P4/2400 Myrinet 2k 6. hpc S. Calif. US 579/1164 2677 P4/1400 Myrinet 2k 7. Locus Locus, USA 960/1920 1920 PIII/1000 FastEth. 8. HELICS Heidelberg 256/512 1434 AMD MP/1400 Myrinet 9. Gideon Honk Kong 300/300 1200 P4/2000 FastEth. 10. Empire Mississippi 519/1038 1157 PIII/1000+ FastEth.

Daresbury Laboratory4 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory PNNL’sPNNL’s HPCS2HPCS2 -- 14001400 ProcessorProcessor HPHP SuperclusterSupercluster

! 8.3TF -based HP supercomputer to be installed in the MSCF (2003). " 1,400 Intel® McKinley and Madison CPUs " QSNet2/Elan4 interconnect from Quadrics ! Timetable: " Prototype system (2002): 32 dual-CPU nodes with prototype McKinley processors ! 256 peak GF , 256GB of total memory. ! HP zx1 chipset (four independent PCI-X busses, one at 1GB/sec) - 12.8GBs memory bandwidth: Quadrics QSNet1/Elan3 interconnect " Phase 1 System (2002) ! 128 Longs Peak nodes with 256 McKinley CPUs (1 TF, 572GB memory. ! 1.0-GHz McKinley processors (3MB of on-chip L2 cache (86 % peak DGEMM). " Final Phase 2 System (2003). ! Additional 566 nodes with Madison CPUs. ! 1400 IA-64 CPUs (1.8TB of memory), 53TB of global + 117 TB node storage). ! QSNet2/Elan4 interconnect from Quadrics (1GB/sec PCIX2 bus). Daresbury Laboratory5 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory High-EndHigh-End SystemsSystems EvaluatedEvaluated

! Cray T3E/1200E " 816 processor system at Manchester (CSAR service) " 600 Mz EV56 Alpha processor with 256 MB memory ! IBM SP/WH2-375 and SP/p690 " 32 CPU system at DL, 4-way Winterhawk2 SMP “thin nodes” with 2 GB memory, 375 MHz Power3-II processors with 8 MB L2 cache " IBM Regatta-H (32-way node, 1.3 GHZ power4 CPUs) at Montepelier " IBM SP/p690 (8-way LPAR’d nodes, 1.3 GHZ) at Daresbury (1280 CPUs, HPCx) ! Compaq AlphaServer SC " 4-way ES40/667 A21264A (APAC) and 833 MHz SMP nodes (2 GB RAM); " TCS1 system at PSC (comprising 750 4-way ES45 nodes - 3,000 EV68 CPUs - with 4 GB memory per node, 8MB L2 cache " Quadrics “fat tree” interconnect (5 usec latency, 250 MB/sec B/W) ! SGI Origin 3800 " SARA (1000 CPUs) - Numalink with R14k/500 & R12k/400 CPUs ! Cray Supercluster at Eagen " Linux Alpha Cluster (96 X API CS20s - dual 833 MHz EV67 CPUs, Myrinet)

Daresbury Laboratory6 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory TechnicalTechnical ProgressProgress inin 20032003 Hardware and Software Evaluation: " CPU ! PC systems - Intel Pentium 4 and Xeon systems (2.8 GHz), AMD Opteron 244 (1.8 GHz) ! Itanium2 (Intel Tiger 1, 1.2 and 1.5 GHz; HP systems, 900 MHz, 1 and 1.5 GHz) ! Alpha systems - ES45/1250 & EV7 1 GHz " Networks www.cse.clrc.ac.uk/Activity/DisCo ! Gigabit options, cards, switches, channel-bonding, …. ! SCI and Myrinet (P4/2400 and P4/2666 Clusters - Streamline and UK) " System Software ! message passing S/W (LAM MPI, LAM MPI-VIA (100 us to 60 us), MPICH, VMI, SCAMPI),libraries (ATLAS, NASA, MKL), compilers (Absoft, PGI, Intel’s ifc and efc, GNU/g77), GA tools (PNNL) ! resource management software (PBS, LSF etc.)

Daresbury Laboratory7 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory CommodityCommodity SystemsSystems ((CSxCSx)) PrototypePrototype // EvaluationEvaluation HardwareHardware Systems Location CPUs Configuration CS1 Daresbury 32 PentiumIII / 450 MHz + FE (EPSRC) CS2 Daresbury 64 24 X dual UP2000/EV67-667, QSNet Alpha/LINUX cluster, 8 X dual CS20/EV67-833 (“loki”) CS3 RAL 16 Athlon K7 850MHz + myrinet CS4 Sara 32 Athlon K7 1.2 GHz + FE CS6 CLiC 528 PentiumIII / 800 MHz; fast ethernet (Chemnitzer Cluster) CS7 Daresbury 64 AMD K7/1000 MP + SCALI/SCI (“ukcp”) CS8 NCSA 320 160 dual IBM /800 + Myrinet 2k (“titan”) CS9 Bristol 96 Pentium4 Xeon/2000 + Myrinet 2k (“dirac”)

Protoype Systems www.cse.clrc.ac.uk/Activity/DisCo CS0 Daresbury 10 10 CPUS, Pentium II/266 CS5 Daresbury 16 8 X dual Pentium III/933, SCALI

Daresbury Laboratory8 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

CommodityCommodity SystemsSystems ((CSxCSx)) II.II. EvaluationEvaluation HardwareHardware

Systems Location CPUs Configuration

CS10 Hull 64 Pentium4 Xeon/2666 + Myrinet 2k (“eagle”), Streamline/SCORE CS11 Workstations 32 Pentium4 Xeon/2666 + GbitEther, UK ScaMPI CS12 Essex 48 Pentium4 Xeon/2000 + GbitEther (“sstream1”), Streamline/SCORE CS13 White Rose, 256 Pentium4 Xeon/2200-2400 + Leeds Myrinet 2k, (“snowdon”), Streamline/SCORE

www.cse.clrc.ac.uk/Activity/DisCo

Daresbury Laboratory9 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

ApplicationsApplications PerformancePerformance OverviewOverview

•• Serial (SPEC, DL) and Communication Benchmarks

•• Applications Performance 1. Computational Chemistry: Molecular Simulation & Electronic Structure 2. Computational Materials Science 3. Atomic and Molecular Physics 4. Computational Engineering 5. Environmental Modelling

Capability and Capacity Computing Commodity vs. Proprietary Solutions

Daresbury Laboratory10 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory SPEC CPU 2000 - SPECfp2000 Values relative to IBM p-series 690/pwr4 1.3 GHz

AMD A7N8X XP3200+/2.2 GHz 78 AMD A4800 Opteron144/1.8GHz 96 Intel D875PBZ 3.2 GHz P4 100 HP RX2600 Itanium2/1500 167

HP RX2600 Itanium2/1000 113 HP AlphaServer ES45 68/1250 108

HP AlphaServer GS1280 7/1150 117 SUN FireV880 / 900 Cu 55 2119 : HP RX2600 Madison/1500 2100 : SGI Madison/1500 Sun Blade 2050 / 1050 Cu 65 2100 : SGI Altix Madison/1500 1699 : IBM power4 / 1.7GHz Sun Blade 2000 / 1200 MHz 88 Fujitsu PP650 GP64/1350 106

SGI Origin 3200/R14k-600 42 SGI Altix 3000 Itanium 2/1000 111

SGI Altix 3000 Madison/1500 166 HP 9000/C3750-875 53

IBM 690Turbo/pwr4 1.7 GHz 134 IBM 690Turbo/pwr4 1.3 GHz 100 1266 0 40 80 120 160

Daresbury Laboratory11 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory TheThe GAMESS-UKGAMESS-UK SerialSerial BenchmarkBenchmark Performance relative to the IBM p-series 690/pwr4 1.3 GHz

AMD MP2400+ / 2000 72 Pentium 4 Xeon / 2666 96 HP RX2600 Itanium2/1000 84 HP RX5670 Itanium2/1000 83 Intel Tiger Itanium2/1000 85 Intel Tiger2 Madison/1500 127 AMD Opteron 244/ 1800 98

Compaq Alpha ES45/1250 79

Compaq Marvel EV7 /1000 73 SUN Blade 2000 / 1056 Cu 54 SUN FireV880 / 900 Cu 45 SGI O2 R12k/270 11 SGI Origin3800/R14k-600 40 HP PA-9000/RP7410-875 77 IBM p-series 690/Pwr4 1.7 GHz 137 IBM p-series 690/Pwr4 1.3 GHz 100 3.2 minutes IBM p-series 630/Pwr4 1.0 GHz 78

0 20 40 60 80 100 120 140

Daresbury Laboratory12 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory STREAM:STREAM: MeasuredMeasured SustainableSustainable MemoryMemory BandwidthBandwidth inin HPCHPC (TRIAD)(TRIAD) AMD MP2400+ / 2000 21 Pentium 4 Xeon / 2666 73 http://www.cs.virginia.edu/stream HP RX2600 Itanium2/1000 93 SGI O2/R12k-270 : 2% HP RX5670 Itanium2/1000 89 SGI O2/R12k-270 : 2% HP RX5670 Madison/1500 (+) 100 NECNEC SX-5: SX-5: 1222% 1222% Intel Tiger Itanium2/1000 88 Intel Tiger2 Madison/1500 98 AMD Opteron 244/ 1800 52 Compaq Alpha ES45/1250 53 Compaq Marvel EV7 /1000 140 SUN Blade 2000 / 1056 Cu 25 SUN FireV880 / 900 Cu 26 SGI Origin3800/R14k-600 17 HP PA-9000/RP7410-875 20 IBM p-series 690/Pwr4 1.7 GHz 152 IBM p-series 690/Pwr4 1.3 GHz 100

IBM p-series 630/Pwr4 1.0 GHz 44

0 50 100 150

Daresbury Laboratory13 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory CommunicationsCommunications BenchmarkBenchmark PMBPMB andand EFF_BWEFF_BW

! PMB - Pallas MPI Benchmark ! EFF_BW Suite (V2.2) ! Spin off from PMB, the Point-to-Point (Mbytes/sec) EFF_BW benchmark (Pallas) calculates an "effective PingPong; PingPing; Sendrecv; bandwidth". Exchange ! A single integral number is calculated which includes the Collective Operations (Time - usec) - performance for small and as function of no.of CPUs large messages under participation of all available Allreduce; Reduce; Reduce_scatter; processors, Allgather; Allgatherv; Alltoall; Bcast, ! EFF_BW uses only the PMB Barrier Barrier PingPong Benchmark for measuring startup and Message Lengths: throughput. 0 to 4194304 Bytes

Daresbury Laboratory14 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory CollectiveCollective OperationsOperations (Time - usec) - as function of no.of CPUs Communications among groups of processors in the cluster e.g. • Distribute data to all nodes in the cluster (scatter, all-to-all) • Collect data from all nodes (gather) • Collect information from all nodes to determine an overall condition e.g. minima or maxima (reduce).

Alltoall

Allgather

Daresbury Laboratory15 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory CommunicationsCommunications BenchmarkBenchmark PMB: Pallas MPI Benchmark Suite (V2.2) Commodity-based Systems High-end Systems ! CS2 Alpha Linux Cluster dual UP2000/667 ! Cray T3E/1200E ! CS3 AMD Athlon/850 + Myrinet, MPICH ! SGI Origin3800/R14k-500 (Teras) ! CS6 PIII/800 + Fast ethernet, MPICH ! IBM SP/WH2-375 (“trailblazer” ! CS7 AMD K7/1000 MP + SCALI switch ! CS8 Itanium/800 + Myrinet 2k (NCSA) ! IBM SP/NH2-375 (single-plane colony) ! CS9 dual P4/2000 Xeon + Myrinet 2k (Compusys, Bristol U.) ! IBM Regatta H, HPC (intra node) ! CS10 Pentium 4/2666 + Myrinet 2k ! IBM SP/SP p690 (8-way LPAR, (SCORE, Streamline, Hull U.) HPCx) ! CS11 Pentium 4/2666 + GbitEther ! Cray SuperCluster (833MHz EV68 (Workstations UK, ScaMPI) dual CS20 nodes with myrinet) ! CS12 Pentium 4/2400 + GbitEther ! Compaq AlphaServer SC (Score, Streamline, Essex U.) ES45/1GHz -TCS1 ! CS13 Pentium 4/2200-2400 + Myrinet 2k (SCORE, Streamline, Leeds U.)

Daresbury Laboratory16 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory PingPongPingPong PerformancePerformance Bandwidth (Mbytes/s) PMB Benchmark (Pallas) 1000

100

Compaq AlphaServer SC ES45/1000 10 Cray T3E/1200E IBM SP/WH2 375-quad SGI Origin3800/R14k-500 IBM SP/NH2 375 16-way 1 IBM/SP p690 CS8 Titan Itanium/800 + Myrinet 2k CS9 Pentium 4/2000 + Myrinet 2k

Bandwidth (MByte/sec) Bandwidth CS10 Pentium 4/2666 + Myrinet 2k 0.1 CS11 Pentium 4/2666 + GbitEther CS12 Pentium 4/2400 + GbitEther

Message Length (Bytes) 0.01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Daresbury Laboratory17 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory MPI_MPI_allreduceallreduce PerformancePerformance PMB Benchmark (Pallas) 1.E+06 CS2 QSNet Alpha Cluster/667 Compaq AlphaServer SC ES45/1000 Cray T3E/1200E IBM SP/WH2 375-quad 1616 CPUsCPUs 1.E+05 IBM SP/NH2 375 16-way IBM/SP p690 SGI Origin3800/R14k-500 CS8 Titan Itanium/800 + Myrinet 2k CS9 Pentium 4/2000 + Myrinet 2k 1.E+04 CS10 Pentium 4/2666 + Myrinet 2k CS11 Pentium 4/2666 + GbitEther CS12 Pentium 4/2400 + GbitEther

1.E+03 Measured Time (us) Time Measured

1.E+02

Message Length (Bytes) 1.E+01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Daresbury Laboratory18 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory InterconnectInterconnect BenchmarkBenchmark -- EFF_BWEFF_BW SGI Origin 3800/R14k-500 530 AlphaServer SC ES45/1000 (1CPU) 964 AlphaServer SC ES45/1000 (4CPU) 507 IBM SP/NH2-375 (16 - 8+8) 362 QSNet IBM SP/p690 (16 - 16X1) 1057 IBM SP/p690 (16 - 8 + 8) 825 Cray T3E/1200E CS2 QSNet AlphaEV67 (1 CPU) 698 1217 CS2 QSNet AlphaEV67 (2 CPU) 456 QSNet CS10 P4/2666 Xeon - Myrinet 2k (1 CPU) 735 CS10 P4/2666 Xeon - Myrinet 2k (2 CPU) 600 Myrinet 2k CS8 Itanium/800 - Myrinet (1 CPU) 667 CS7 AMD K7/1000 MP - SCALI (1 CPU) 481 SCALI 16 CPUs CS12 P4/2400 Xeon - GbitEther (1 CPU) 557 CS12 P4/2400 Xeon - GbitEther (2 CPU) 413 CS11 P4/2666 Xeon - GbitEther (1 CPU) 249 Gbit Ethernet CS11 P4/2666 Xeon - GbitEther (2 CPU) 224 CS6 PIII/800 - MPICH 44.2 MBytes/sec CS6 PIII/800 - LAM 77.9 Fast Ethernet

0 500 1000

Daresbury Laboratory19 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory InterconnectInterconnect BenchmarkBenchmark -- LatencyLatency SGI Origin 3800/R14k-500 4.7 AlphaServer SC ES45/1000 (1 CPU) 4.8 AlphaServer SC ES45/1000 (4 CPU) 5.6 QSNet CS2 QSNet AlphaEV67 (1 CPU) 5.4 IBM SP/WH2-375 13.9 IBM SP/NH2-375 (16 - 8+8) 16.5 IBM SP/p690 (16 - 16X1) 21.7 usecusec Cray T3E/1200E 4.1 CS9 P4/2000 Xeon - Myrinet (1 CPU) 10.9 CS10 P4/2666 Xeon - Myrinet 2k (1 CPU) 8.2 Myrinet 2k CS8 Itanium/800 - Myrinet (1 CPU) 15.7 CS7 AMD K7/1000 MP - SCALI (1 CPU) 4.7 SCI / SCALI CS12 P4/2400 Xeon - GbitEther (1 CPU) 16.4 Gbit Ethernet CS11 P4/2666 Xeon - GbitEther (1 CPU) 91 CS4 AMD K7/1200 - LAM 67

CS6 PIII/800 - MPICH 97 CS6 PIII/800 - LAM 63 Fast Ethernet

0 20406080100120 Daresbury Laboratory20 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory ApplicationApplication CodesCodes

Application-driven performance comparisons between Commodity-based systems and both current MPP and ASCI-style SMP-node platforms : ! Computational Chemistry " Molecular Simulation ! DL_POLY, DLMULTI - parallel MD codes with many applications ! CHARMM - macromolecular MD and energy minimisation " Molecular Electronic Structure ! GAMESS-UK and NWChem - Ab initio Electronic structure codes ! Materials ! CPMD - Car-Parrinello plane wave code ! Engineering " ANGUS, SBLI - regular-grid domain decomposition engineering codes

Performance Metric (% 32-node high-end system)

Daresbury Laboratory21 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory PerformancePerformance Metrics:Metrics: 1999-20011999-2001 Attempted to quantify delivered performance from the Commodity-based systems against current MPP (CSAR Cray T3E/1200E)and ASCI-style SMP- node platforms (e.g. SGI Origin 3800) i.e. Performance Metric (% 32-node Cray T3E) T (32-nodes Cray T3E/1200E) / T (32 CPUs ) CSx

[ T 32-node T3E / T 32-node CS1 Pentium III/450 + FE ]

T 32-node T3E / T 32-node CS6 Pentium III/800 + FE

T 32-node T3E / T 32-CPU CS2 Alpha Linux Cluster + Quadrix Performance Metrics: 2002 Performance Metric (% 32-node AlphaServer SC [PSC])

T (32-CPUs AlphaServer SC ES45/1000) / T (32 CPUs ) CSx

T 32-CPU AlphaServer ES45 / T 32-CPU CS9 Pentium 4 Xeon / 2000 + Myrinet 2k

Daresbury Laboratory22 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory BeowulfBeowulf ComparisonsComparisons withwith thethe T3ET3E && O3800/R14k-500O3800/R14k-500 CSx - Pentium III + FE CS2 - QSNet Alpha Linux Cluster % of 32-node Cray T3E/1200E % of 32-node Cray T3E and O3800/R14k-500 GAMESS-UK CS1 CS6 GAMESS-UK SCF 53-69% 96% SCF 256% 99% DFT 65-85% 130-178% DFT † 301-361% 99% DFT (Jfit) 44-77% 65-131% DFT (Jfit) 219-379% 89-100% DFT Gradient 90% 130% DFT Gradient † 289% 89% MP2 Gradient 44% 73% MP2 Gradient 228% 87% SCF Forces 80% 127% SCF Forces 154% 86% NWChem (DFT Jfit) 50-60% NWChem (DFT Jfit) † 150-288% 74-135% REALC 67% CRYSTAL 79-145% CRYSTAL † 349% DL_POLY DL_POLY Ewald-based 95-107% 151-184% Ewald-based † 363-470% 95% bond constraints 34-56% 69% bond constraints 143-260% 82% CHARMM 96% 172% CHARMM † 404% 78% CASTEP 33% 42% CASTEP 166% 78% CPMD 62%

ANGUS 60% 68% ANGUS 145% FLITE3D 104% FLITE3D † 480%

Daresbury Laboratory23 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

Commodity Comparisons with High-end Systems: Compaq AlphaServer SC ES45/1000 and the SGI O3800/R14k-500

CS9 - Pentium4/2000 Xeon + Myrinet 2k % of 32-CPU AlphaServer SC Origin 3800 GAMESS-UK SCF 95% 135% DFT 70-73% 117-161% DFT (Jfit) 59-70% 92-111% DFT Gradient 69% 114% MP2 Gradient 59% 78% SCF Forces 92% 148%

NWChem (DFT Jfit) 65-88% 94-227%

GAMESS / CHARMM 89% 139% DL_POLY Ewald-based 44-53% 71-84% bond constraints 41% 64% CHARMM 95% 103% CASTEP 47-67% 75-81% CPMD 30% 53%

ANGUS 35-69% 51-147%

Daresbury Laboratory24 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory PerformancePerformance Metrics:Metrics: 20032003

Attempt to quantify delivered performance from current Commodity- based systems against high-end SMP-node platforms: Compaq AlphaServer SC ES45/1000 (PSC) and IBM SP/p690 (HPCx) i.e. Performance Metric (% 32-node AlphaServer SC)

1. T (32-CPUs AlphaServer SC ES45/1000) / T (32 CPUs ) CSx

† T 32-CPU AlphaServer ES45 / T 32-CPU CS10 Pentium 4 Xeon / 2666 + Myrinet 2k

T 32-CPU AlphaServer / T 32-CPU CS12 Pentium 4 Xeon / 2400 + Gbit Ethernet

2. T (32-CPUs IBM SP/p690) / T (32 CPUs) CSx

† T 32-CPU IBM SP/p690 / T 32-CPU CS10 Pentium 4 Xeon / 2666 + Myrinet 2k

Daresbury Laboratory25 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

MolecularMolecular SimulationSimulation

MolecularMolecular DynamicsDynamics Codes:Codes: DL_POLY,DL_POLY, DLMULTIDLMULTI andand CHARMMCHARMM

Daresbury Laboratory26 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DL_POLY:DL_POLY: AA ParallelParallel MDMD PackagePackage

! Parallel molecular dynamics code by W. Smith and T.R. Forester ! Wide selection of Periodic Boundaries and Integration Algorithms ! Wide Range of Target systems " Atomic systems & mixtures (Ne, Ar, etc.), ionic melts & crystals (NaCl, KCl etc.) " Polarisable ionics (ZSM-5, MgO etc.)

" Molecular liquids, solids (CCl4, Bz etc.) and ionics (KNO3, NH4Cl, H2O etc.) " Synthetic polymers ([PhCHCH2]n etc.), Bio-polymers and macromolecules " Polymer electrolytes, Membranes, Aqueous solutions, Metals ! Migration from Replicated to Distributed data " Distribute atoms, forces across the nodes ! More memory efficient to address larger cases (105-107) A B " Shake and short-ranges forces require only neighbour communications ! Communications scale linearly with number of nodes C D " Coulombic energy remains global ! Strategy depends on problem & machine characteristics ! Adopt particle Mesh Ewald scheme Daresbury Laboratory27 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DL_POLYDL_POLY ParallelParallel BenchmarksBenchmarks (Cray(Cray T3E/1200)T3E/1200)

V2: Replicated Data 9. Model membrane/Valinomycin (MTS, 18,886) 7. Gramicidin in water (SHAKE, 12,390) 4. NaCl; Ewald, 27,000 ions 6. K/valinomycin in water (SHAKE, AMBER, 3,838) 5. NaK-disilicate glass; 8,640 atoms,MTS+ Ewald 1. Metallic Al (19,652 atoms, Sutton Chen) 8. MgO microcrystal: 5,416 atoms 3. Transferrin in Water (neutral groups + SHAKE, 27,593) 256 2. Peptide in water (neutral groups + SHAKE, 3993). 64 Speed-up Linear Speed-up Linear 192 48

128 32

64 16

0 0 0 64 128 192 256 0 16324864 Number of Nodes Number of Nodes Daresbury Laboratory28 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DL_POLYDL_POLY V2:V2: High-endHigh-end andand Commodity-Commodity- basedbased SystemsSystems

Bench 4. NaCl; 27,000 ions, Bench 5: NaK-disilicate glass; 8,640 Ewald , 75 time steps, Cutoff=24Å atoms, MTS + Ewald: 270 time steps

Measured Time (seconds) Measured Time (seconds) CS9 P4/2000 + Myrinet CS9 P4/2000 + Myrinet CS11 P4/2666 + Gbit Ether CS11 P4/2666 + Gbit Ether 191 200 CS12 P4/2400 + Gbit Ether 121 CS13 P4/2200 + Myrinet CS13 P4/2200 + Myrinet CS12 P4/2400 + Gbit Ether 177 114 CS10 P4/2666 + Myrinet 107 CS10 P4/2666 + Myrinet SGI Origin3800/R14k-500 SGI Origin3800/R14k-500 160 154 AlphaServer SC ES45/1000 100 AlphaServer SC ES45/1000 150 IBM SP/p690 87 IBM SP/p690

66%,77% 74 82%,95% 106 68 68 98 62 100 89 56 57 115 52 110 79 68 70 50 67 44 36 34 50 65 43 37 56 38

0 0 16 32 16 32 Number of CPUs Number of CPUs T3E128 =94 T3E128 = 75 Daresbury Laboratory29 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DL_POLYDL_POLY V2:V2: High-endHigh-end andand Commodity-Commodity- basedbased SystemsSystems I.I. Measured Time (seconds)

CS9 P4/2000 + Myrinet 191 CS2 QSNet Alpha Cluster/667 200 CS13 P4/2200 + Myrinet Bench 4. 172 CS10 P4/2666 + Myrinet SGI Origin3800/R14k-500 NaCl; 27,000 ions, 160 154 AlphaServer SC ES45/1000 Ewald , 75 time 150 IBM SP/p690 steps, Cutoff=24Å 66%,77% 98 100 89 110 79 68 70 T3E128 =94 61 80 52 43 50 37 39 56 25 23 34

0 16 32 64 Number of CPUs

Daresbury Laboratory30 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DL_POLYDL_POLY V2:V2: High-endHigh-end andand Commodity-basedCommodity-based SystemsSystems II.II. Measured Time (seconds) CS9 P4/2000 + Myrinet Bench 5: NaK- 121 CS2 QSNet Alpha Cluster/667 disilicate glass; 109 CS13 P4/2200 + Myrinet 8,640 atoms, MTS + 107 CS10 P4/2666 + Myrinet Ewald: 270 time 100 SGI Origin3800/R14k-500 87 AlphaServer SC ES45/1000 steps IBM SP/p690

68 82%,95% 62 56 60 57 T3E = 75 52 128 50 67 44 40 38 36 34 33

38 23 23 25

0 16 32 64 Number of CPUs

Daresbury Laboratory31 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

DL_POLYDL_POLY V2:V2: High-endHigh-end andand CommodityCommodity-- basedbased SystemsSystems III.III. Measured Time (seconds) Bench 7. 400 CS9 P4/2000 + Myrinet Gramicidin in CS11 P4/2666 + Gbit Ether CS13 P4/2200 + Myrinet water; CS12 P4/2400 + Gbit Ether rigid bonds and 306 CS10 P4/2666 + Myrinet SHAKE, 300 SGI Origin 3800/R14k-500 12,390 atoms, IBM SP/p690 260 500 time steps AlphaServer SC ES45/1000

217 245 190 61%,75% 200 182 170 140 128 128 121 116 T3E128 =166 139 89 100 78 104

0 16Number of CPUs 32

Daresbury Laboratory32 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DL_POLYDL_POLY V2:V2: ReplicatedReplicated DataData Performance Relative to the Cray Macromolecular Simulations T3E/1200E 14 SGI Origin 3800/R14k-500 Bench 7: Gramicidin in water; Compaq AlphaServer SC ES45/1000 rigid bonds and SHAKE, 12 IBM SP/p690 12,390 atoms, 500 time steps 10 Performance Relative to the Cray T3E/1200E 8 7 SGI Origin 3800/R14k-500 Compaq AlphaServer SC ES45/1000 6 6 IBM SP/p690 4 5

2 4 3 0 16 32 64 128 2 Number of CPUs 1

Bench 4. NaCl; 27,000 ions, 0 Ewald, 75 time steps, Cutoff=24Å 16 32 64 Ionic Simulations .. CHARMM, AMBER Number of CPUs

Daresbury Laboratory33 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

DLMULTIDLMULTI

! DL_MULTI - A DL_POLY package to simulate rigid molecules with multipoles

! Allows molecular dynamics simulation of rigid molecules with the electrostatics described by a distributed multipole.

! Work has been carried out on implementing constant pressure simulations, involving extensive modifications of the existing DL_POLY code as well as addition of new code.

! The program has been tested on a variety of parallel architectures. Future work will initially focus on using the package to investigate the thermodynamic stability of polymorphs of organic crystals e.g. 5-and 6-azauracil

Daresbury Laboratory34 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

DLMULTI:DLMULTI: High-endHigh-end andand Commodity-basedCommodity-based SystemsSystems Measured Time 500 CS12 P4/2400 + Gbit Ether (2CPUs/Node) (seconds) CS12 P4/2400 + Gbit Ether (1CPU/Node) CS10 P4/2666 + Myrinet (2CPUs/Node) CS10 P4/2666 + Myrinet (1CPU/Node) 400 368 SGI Origin 3800/R14k-500 IBM SP/p690 339 AlphaServer SC ES45/1000

300 Bench 2: Relaxed Supercell of 57,024 207 48%,59% 201 200 181 atoms (5-azauracil), 2 time steps. 120 106 99 100 58 62 38

0 8 163264 Number of CPUs Daresbury Laboratory35 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

CHARMMCHARMM

! CHARMM (Chemistry at HARvard Macromolecular Mechanics) is a general purpose molecular mechanics, molecular dynamics and vibrational analysis package for modelling and simulation of the structure and behaviour of macromolecular systems (proteins, nucleic acids, lipids etc.) ! Supports energy minimisation and MD approaches using a classical parameterised force field. ! J. Comp. Chem. 4 (1983) 187-217 ! Parallel Benchmark - MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules. ! QM/MM model for study of reacting species " incorporate the QM energy as part of the system into the force field " coupling between GAMESS-UK (QM) and CHARMM.

Daresbury Laboratory36 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory ParallelParallel CHARMMCHARMM BenchmarkBenchmark Benchmark MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules: 14026 atoms, 1000 steps (1 ps), 12-14 A shift.

T3E = 665, T3E = 106 Measured Time (seconds) 16 128 150 CS2 QSNet Alpha Cluster/667 CS7 AMD K7/1000 MP + SCI CS9 P4/2000 + Myrinet 2k CS12 P4/2400 + Gbit Ether Linear 114 CS10 P4/2666 + Myrinet CS1 PIII/450 + FE/LAM 104 SGI Origin 3800/R14k-500 52 Cray T3E/1200E CS2 QSNet Alpha Cluster/667 100 AlphaServer SC ES45/1000 89 SGI Origin 3800/R14k-500 83 104%,139% 73 36 66 64 64 59 61 62 69 51 50 20

44 37 4 0 4203652 16 32 64 Number of CPUs Daresbury Laboratory37 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory ParallelParallel CHARMMCHARMM Benchmark:Benchmark: LAMLAM MPIMPI vsvs.. MPICHMPICH Measured Time (seconds) Benchmark MD Calculation: 750 Carboxy Myoglobin CS6 PIII/800 + FE (MbCO) with 3830 Water Molecules: 600 LAM 6.3-b2 MPICH-1.2.1 440.1 450

300 198.9

150

0 481632 Number of CPUs Daresbury Laboratory38 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory MigrationMigration fromfrom ReplicatedReplicated toto DistributedDistributed datadata DL_POLY-3:DL_POLY-3: CoulombCoulomb EnergyEnergy EvaluationEvaluation

! Conventional routines (e.g. fftw) assume plane or column distributions

! A global transpose of the data is required to complete the 3D FFT and additional costs are Plane Block incurred re-organising the data from the natural block domain decomposition.

! An alternative FFT algorithm has been designed to reduce communication costs. " the 3D FFT are performed as a series of 1D FFTs, each involving communications only between blocks in a given column " More data is transferred, but in far fewer messages " Rather than all-to-all, the communications are column-wise only

Daresbury Laboratory39 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

MigrationMigration fromfrom ReplicatedReplicated toto DistributedDistributed datadata DL_POLY-3:DL_POLY-3: CoulombCoulomb EnergyEnergy PerformancePerformance

7 IBM SP/p690 Speed-up 6 AlphaServer SC ES45/1000 SGI Origin 3800/R14k-500 5 NaCl Simulation; 216,000 ions, 200 time steps, 4 Cutoff=12Å

3

2

1

0 32 64 128 256 Number of CPUs Daresbury Laboratory40 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DL_POLY3DL_POLY3 MacromolecularMacromolecular SimulationsSimulations

Gramicidin in water; rigid bonds + SHAKE: Speedup 792,960 ions, 50 time steps Measured Time (seconds)

Linear 654 256 SGI Origin 3800/R14k-500 600 SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 215.8 AlphaServer SC ES45/1000 224 IBM SP/p690 IBM SP/p690 192 174.1

400 370 149.5 349 160 273 128 145.0 186 189 200 96 76.7 140 98.0 114 109 97 68 77 64 59.1

0 32 32 64 128 256 32 64 96 128 160 192 224 256 Number of CPUs Number of CPUs

Daresbury Laboratory41 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DL_POLY3DL_POLY3 CommodityCommodity PerformancePerformance

NaCl Simulation; 216,000 ions Gramicidin in water; 792,960 ions

Measured Time (seconds) Measured Time (seconds)

1000 CS9 P4/2000 + Myrinet 2k 800 CS9 P4/2000 + Myrinet 2k 739 CS12 Pentium 4/2400 + Gbit Ethernet CS12 Pentium 4/2400 + Gbit Ethernet 675 CS10 Pentium 4/2666 + Myrinet 2k CS10 Pentium 4/2666 + Myrinet 2k IBM SP/p690 IBM SP/p690 750 600 649 521 603 84%,91% 86%,92% 408 500 381 455 400 349

324 301 234 197 189 250 200 145 128

0 0 16 32 64 16 32 64 Number of CPUs Number of CPUs

Daresbury Laboratory42 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

MolecularMolecular ElectronicElectronic StructureStructure

AbAb initioinitio ElectronicElectronic structurestructure Codes:Codes: NWChemNWChem andand GAMESS-UKGAMESS-UK

Daresbury Laboratory43 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DistributedDistributed DataData SCFSCF

MOAO Pµν Pictorial representation of the iterative SCF process in (i) a sequential process, guess dge mm and (ii) a distributed data parallel or bitals process: MOAO represents the Sequential molecular orbitals, P the density matrix Eigensol ver Fρρσσ and F the Fock or Hamiltonian matrix Integrals

V If Converged XC

V Coul V 1e MOAO Pµν

guess Sequential or bitals ga_dgemm

PeIGS Fρσ

If Converged Integrals VXC VCoul Distributed Data V1e

Daresbury Laboratory44 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory High-EndHigh-End ComputationalComputational Chemistry:Chemistry: TheThe NWChemNWChem SoftwareSoftware

! Capabilities (Direct, Semi-direct and conventional): " RHF, UHF, ROHF using up to 10,000 basis functions; analytic 1st and 2nd derivatives. " DFT with a wide variety of local and non-local XC potentials, using up to 10,000 basis functions; analytic 1st and 2nd derivatives. " CASSCF; analytic 1st and numerical 2nd derivatives. " Semi-direct and RI-based MP2 calculations for RHF and UHF wave functions using up to 3,000 basis functions; analytic 1st derivatives and numerical 2nd derivatives. " Coupled cluster, CCSD and CCSD(T) using up to 3,000 basis functions; numerical 1st and 2nd derivatives of the CC energy. " Classical molecular dynamics and free energy simulations with the forces obtainable from a variety of sources

Daresbury Laboratory45 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory GlobalGlobal ArraysArrays

Physically distributed data • Shared-memory-like model – Fast local access – NUMA aware and easy to use – MIMD and data-parallel modes – Inter-operates with MPI, … • BLAS and linear al gebra interface • Ported to major parallel machines – IBM, Cray, SGI, clusters,... • Origi nated in an HPCC project • Used by 5 major chemistry codes, financial futures forecasting, astrophysics, computer graphics

Single, shared data structure Tools developed as part of the NWChem project at PNNL; R.J. Harrison, J. Nieplocha et al.

Daresbury Laboratory46 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory PeIGS 3.0 Parallel Performance

(Solution of real symmetric generalized Developed as part of the and standard eigensystem problems) NWChem project at PNNL; R.J. Harrison, J. Nieplocha et al.

Features (not always available elsewhere): • Inverse iteration using Dhillon-Fann-Parlett’s parallel algorithm (fastest uniprocessor performance and good parallel scaling) • Guaranteed orthonormal eigenvectors in the presence of large clusters of degenerate eigenvalues • Packed Storage • Smaller scratch space requirements • PDSYEV (Scalapack 1.5) and PDSYEVD (Scalapack 1.7)

Consider performance on Fock matrices generated in typical Hartree Fock SCF calculations (1152 and 3888 basis functions).

Daresbury Laboratory47 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory ScalabilityScalability ofof NumericalNumerical AlgorithmsAlgorithms I.I.

SGISGI Origin Origin 3800/R12k-400 3800/R12k-400 (“green”) (“green”) Real symmetric eigenvalue problems 50 PeIGS 2.1 120 PeIGS 2.1 PeIGS 2.1 - Cray T3E/1200 PeIGS 3.0 PeIGS 3.0 PDSYEV (Scpk 1.5) 40 PDSYEV (Scpk 1.5) 100 PDSYEVD (Scpk 1.7) PDSYEVD (Scpk 1.7) BFG-Jacobi (DL)

80 30 Fock matrix Fock matrix (N = 3888)

Time (sec) Time (N = 1152) 60 20 40

10 20

0 0 2481632 16 32 64 128 256 512 Number of processors Number of processors

Daresbury Laboratory48 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory ScalabilityScalability ofof NumericalNumerical AlgorithmsAlgorithms II.

IBMIBM SP/p690 SP/p690 and and SGI SGI Origin Origin O3800/R12k O3800/R12k

Time (secs.) N = 3,888 Real symmetric 100 PeIGS 3.0 O3K eigenvalue problems PeIGS 3.0 IBM SP/p690 PDSYEV O3K PDSYEV IBM SP/p690 80 PDSYEVD O3K Time (secs.) N = 9,000 PDSYEVD IBM SP/p690 800 739 60 IBM SP/p690 600 PDSYEV PDSYEVD 40 407 400

251 20 200 147 99 99 75 57 0 0 32 64 128 256 512 16 32 64 128 256 Number of processors Number of processors

Daresbury Laboratory49 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

ParallelParallel EigensolversEigensolvers Real symmetric eigenvalue problems

CS7 AMD K7/1000 MP + SCALI CS9 P4/2000 Xeon + Myrinet

Measured Time (seconds) Measured Time (seconds) 600 400 PDSYEV: 2 CPUs / node Scalapack (PDSYEV) 350 PDSYEV: 1 CPU / node 500 PDSYEVD: 2 CPUs / node PDSYEVD: 1 CPU / node 300 PDSYEV: 2 CPUs per Node PeIGS 2.1: 2 CPUs / node 400 PeIGS 2.1: 1 CPU / node PDSYEV: 1 CPU Per Node 250

300 200 N = 3888 N = 3888 150 200 100 100 50

0 0 4 8 16 32 64 8 163264 Number of processors Number of processors Daresbury Laboratory50 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory CaseCase StudiesStudies -- ZeoliteZeolite FragmentsFragments

• DFT Calculations with

Si8O7H18 347/832 Coulomb Fitting

Basis (Godbout et al.) DZVP - O, Si

Si8O25H18 617/1444 DZVP2 - H Fitting Basis: DGAUSS-A1 - O, Si DGAUSS-A2 - H

Si26O37H36 1199/2818 • NWChem & GAMESS-UK

Both codes use auxiliary fitting basis for coulomb energy, with 3 centre 2 electron integrals Si28O67H30 1687/3928 held in core.

Daresbury Laboratory51 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

DFTDFT CoulombCoulomb FitFit -- NWChemNWChem

Si O H 617/1444 Si8O7H18 347/832 8 25 18 Measured Time (seconds) Measured Time (seconds) CS6 PIII/800 + FE CS6 PIII/800 + FE 400 800 CS7 AMD K7/1000 + SCI CS7 AMD K7/1000 + SCI CS9 P4/2000 + Myrinet CS9 P4/2000 + Myrinet CS2 QSNet Alpha/667 350 CS2 QSNet Alpha/667 596 Cray T3E/1200E Cray T3E/1200E SGI O3800/R14k-500 300 SGI O3800/R14k-500 600 511 AlphaServer SC ES45/1000 AlphaServer SC ES45/1000 227 514 223 411 250 249 404 177 89% 200 163 76% 400 257 150 107

182 159 100 68 200 135 119 55 55 42 177 50 61 124 43 0 0 16 32 16 32 Number of CPUs Number of CPUs Daresbury Laboratory52 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

DFTDFT CoulombCoulomb FitFit -- NWChemNWChem

Si26O37H36 1199/2818 Si28O67H30 1687/3928

Measured Time (seconds) Measured Time (seconds) 2500 2388 CS7 AMD K7/1000 + SCI CS9 P4/2000 + Myrinet 2k 6000 CS7 AMD K7/1000 + SCI 2414 5507 CS2 QSNet Alpha Cluster/667 CS9 P4/2000 + Myrinet 2k SGI Origin 3800/R14k-500 CS2 QSNet Alpha Cluster 4682 SGI Origin 3800/R14k-500 2000 IBM SP/p690 IBM SP/p690 AlphaServer SC ES45/1000 AlphaServer SC ES45/1000 79% 4000 1500 1271 65% 1147 3008 3050 951 2424 1000 907 2053 2000 1617 1580 1418 517 502 490 500 404 390 2351 1182 880 303 1504 834 611

0 0 32 64 128 32 64 128 Number of CPUs Number of CPUs

Daresbury Laboratory53 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory Memory-drivenMemory-driven Approaches:Approaches: NWChemNWChem -- DFTDFT (LDA):(LDA): PerformancePerformance onon thethe IBMIBM SP/p690SP/p690

ZeoliteZeolite ZSM-5ZSM-5

• DZVP Basis (DZV_A2) and Dgauss A1_DFT Fitting basis:

AO basis: 3554 CD basis: 12713 • IBM SP/p690)

Wall time (13 SCF iterations): 64 CPUs = 9,184 seconds 128 CPUs= 3,966 seconds

MIPS R14k-500 CPUs (Teras) Wall time (13 SCF iterations): • 3-centre 2e-integrals = 1.00 X 10 12 64 CPUs = 5,242 seconds • Schwarz screening = 6.54 X 10 9 128 CPUs= 3,451 seconds • % 3c 2e-ints. In core = 100%

Daresbury Laboratory54 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory 2.22.2 GAMESS-UKGAMESS-UK

GAMESS-UK is the general purpose ab initio molecular electronic structure program for performing SCF-, MCSCF- and DFT-gradient calculations, together with a variety of techniques for post Hartree Fock calculations.

" The program is derived from the original GAMESS code, obtained from Michel Dupuis in 1981 (then at the NRCC), and has been extensively modified and enhanced over the past two decades. " This work has included contributions from numerous authors†, and has been conducted largely at the CCLRC Daresbury Laboratory, under the auspices of the UK's Collaborative Computational Project No. 1 (CCP1). Other major sources that have assisted in the on-going development and support of the program include various academic funding agencies in the Netherlands, and ICI plc. Additional information on the code may be found from links at: http://www.dl.ac.uk/CFS † M.F. Guest, J.H. van Lenthe, J. Kendrick, K. Schoffel & P. Sherwood, with contributions from R.D. Amos, R.J. Buenker, H.J.J. van Dam, M. Dupuis, N.C. Handy, I.H. Hillier, P.J. Knowles, V. Bonacic-Koutecky, W. von Niessen, R.J. Harrison, A.P. Rendell, V.R. Saunders, A.J. Stone and D. Tozer.

Daresbury Laboratory55 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UKGAMESS-UK featuresfeatures 1.1. " Hartree Fock: ! Segmented/ GC + spherical harmonic basis sets ! SCF-Energies and Gradients: conventional, in-core, direct ! SCF-Frequencies: numerical and analytic 2nd derivatives ! Restricted, unrestricted open shell SCF and GVB. " Density Functional Theory ! Energies + gradients, conventional and direct including Dunlap fit ! B3LYP, BLYP, BP86, B97, HCTH, B97-1, FT97 & LDA functionals ! DFT-Frequencies: numerical and analytic 2nd derivatives " Electron Correlation: ! MP2 energies, gradients and frequencies, Multi-reference MP2, MP3 Energies ! MCSCF and CASSCF Energies, gradients and numerical 2nd derivatives ! MR-DCI Energies, properties and transition moments (semi-direct module) ! CCSD and CCSD(T) Energies ! RPA (direct) and MCLR excitation energies / oscillator strengths, RPA gradients ! Full-CI Energies ! Green's functions calculations of IPs. ! Valence bond (Turtle)

Daresbury Laboratory56 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UKGAMESS-UK featuresfeatures 2.2.

" Molecular Properties: ! Mulliken and Lowdin population analysis, Electrostatic Potential-Derived Charges ! Distributed Multipole Analysis, Morokuma Analysis, Multipole Moments ! Natural Bond Orbital (NBO) + Bader Analysis ! IR and Raman Intensities, Polarizabilities & Hyperpolarizabilities ! Solvation and Embedding Effects (DRF) ! Relativistic Effects (ZORA) " Pseudopotentials: ! Local and non-local ECPs. " Visualisation: tools include CCP1 GUI " Hybrid QM/MM (ChemShell + CHARMM QM/MM) " Semi-empirical : MNDO, AM1, and PM3 hamiltonians " Parallel Capabilities: ! MPP and SMP implementations (GA tools) ! SCF/DFT energies, gradients, frequencies ! MP2 energies and gradients ! Direct RPA

Daresbury Laboratory57 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

ParallelParallel ImplementationsImplementations ofof GAMESS-UKGAMESS-UK

! Extensive use of Global Array (GA) Tools andand ParallelParallel Linear Algebra from NWChem Project (EMSL) ! SCF and DFT energies and gradients " Replicated data, but … " GA Tools for caching of I/O for restart and checkpoint files " Storage of 2-centre 2-e integrals in DFT Jfit " Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix) ! SCF and DFT second derivatives " Distribution of and integrals via GAs ! MP2 gradients " Distribution of and integrals via GAs

Daresbury Laboratory58 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UKGAMESS-UK ✁✁SCFSCF PerformancePerformance IBM SP/p690, High-end and Commodity-based Systems

Elapsed Time (seconds) T3E128 = 436 CS6 PIII/800 + FE CS7 AMD K7/1000 MP + SCI CS2 QSNet Alpha Cluster/667 2040 CS9 P4 Xeon/2000 + Myrinet 2000 CS12 P4/2400 + Gbit Ether CS10 P4/2666 + Myrinet SGI Origin3800/R14k-500 Compaq AlphaServer SC ES45/1000 1500 IBM SP/p690 1276 1418 85%,90%

1000 896 925 878

656 597 618 581 502 529 522 387 436 407 500 369 337

0 16Number of CPUs 32 Impact of Serial Linear Algebra: TIBM-SP2(16) = 2656 [1289] Cyclosporin:(3-21GCyclosporin:(3-21G Basis, Basis, 1000 1000 GTOS) GTOS) TIBM-SP2(32) = 2184 [ 821]

Daresbury Laboratory59 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK.GAMESS-UK. DFTDFT B3LYPB3LYP PerformancePerformance IBM SP/p690, High-end and Commodity-based Systems

Elapsed Time (seconds) CS4 AMD K7/1200 + FE CS7 AMD K7/1000 MP + SCI CS2 QSNet Alpha Cluster/667 CS9 P4 Xeon/2000 + Myrinet CS12 P4/2400 + Gbit Ether CS10 P4/2666 + Myrinet SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 2208 IBM SP/p690

1859 2000 79%,85%

1170 1191 1018 Basis: 6-31G

675

0 16 32 Number of CPUs Cyclosporin, 1000 GTOs Daresbury Laboratory60 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK.GAMESS-UK. DFTDFT B3LYPB3LYP PerformancePerformance The IBM SP/p690 and High-endHigh-end Systems

Cyclosporin,Cyclosporin, 1855 1855 GTOs GTOs

Elapsed Time (seconds) Basis: 6-31G** Speed-up 4731 5000 SGI Origin 3800/R14k-500 128 Linear IBM SP/p690 SGI Origin 3800/R14k-500 AlphaServer ES45/1000 IBM SP/p690 3750 AlphaServer ES45/1000 2838 96 2614 2504 S = 8181 2500 1681 1867 1584 1281 1100 64 1250

0 32 32 64 128 32 64 96 128 Number of CPUs Number of CPUs Daresbury Laboratory61 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK.GAMESS-UK. DFTDFT HCTHHCTH PerformancePerformance The IBM SP/p690 and High-end Systems

Basis: DZVP2_A2 (Dgauss) Valinomycin,Valinomycin, 1620 1620 GTOs GTOs Elapsed Time (seconds) Speed-up 128 12000 11053 SGI Origin 3800/R14k-500 Linear IBM SP/p690 SGI Origin 3800/R14k-500 AlphaServer ES45/1000 IBM SP/p690 104S = 104 9000 AlphaServer ES45/1000 96 100 5557 5711 5846 92 6000

3109 3081 3388 64 3000 1940 1825

0 32 32 64 128 32 64 96 128 Number of CPUs Number of CPUs Daresbury Laboratory62 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory AuxilliaryAuxilliary BasisBasis CoulombCoulomb FitFit (I)(I)

The approach is based on the expansion of the charge density in an auxiliary basis of Gaussian functions

  ρ(r) = D pq ≈  D C pq  u = d u ∑ pq ∑∑ pq u  ∑ u pq upq  u

As suggested by Dunlap, a variational choice of the fitting coefficients C can be obtained as follows:

C pq = V−1b pq

Where V is the matrix of 2-centre 2-electron repulsion integrals in the charge density basis and b are the three centre electron repulsion integrals between the wavefunction basis set and the charge density basis.

Daresbury Laboratory63 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

AuxilliaryAuxilliary BasisBasis CoulombCoulomb FitFit (ii)(ii)

! The number of 3-centre integrals is significantly smaller than the 4-centre integrals used in the conventional coulomb evaluation, but for large molecules additional screening is required. ! We make use of the Schwarz inequality

()pq u ≤ ()pq pq ()u u

! Where p and q are AO basis functions and u are the fitting functions. Since screening is applied on a shell basis, the maximal integrals for each shell quartet are stored. ! Using this screening, and exploiting the aggregate memory of a parallel machine, it is possible to hold a significant fraction of the 3-centre integrals in core.

Daresbury Laboratory64 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

GAMESS-UK: DFT HCTH on Valinomycin. Impact of Coulomb Fitting: IBM SP/p690, Compaq AlphaServer SC/1000 and SGI Origin R14k/500

Basis:Basis: DZV_A2 DZV_A2 ( (Dgauss)Dgauss) A1_DFTA1_DFT Fit: Fit: 882/3012 882/3012

Measured Time (seconds) Measured Time (seconds)

2306 SGI Origin 3800/R14k-500 750 SGI Origin 3800/R14k-500 IBM SP/p690 IBM SP/p690 724 Compaq AlphaServer SC ES45/1000 2000 Compaq AlphaServer SC ES45/1000 516 477 1361 500 443 386 JJFIT 1301 1228 JJEXPLICIT 371 304 310 754 734 235 1000 705 250 489 415

0 0 32 64 128 32 64 128 Number of CPUs Number of CPUs Daresbury Laboratory65 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK:GAMESS-UK: DFTDFT HCTHHCTH onon ValinomycinValinomycin.. ImpactImpact ofof CoulombCoulomb FittingFitting J Basis: DZV_A2 (Dgauss) Measured Time (seconds) JEXPLICIT A1_DFT Fit: 882/3012

CS7 AMD K7/1000 MP + SCALI CS4 AMD K7/1200 + FE Measured Time (seconds) JFIT CS2 QSNet Alpha Cluster/667 CS9 P4 Xeon/2000 + Myrinet 4500 CS4 AMD K7/1200 + FE CS12 P4/2400 + Gbit Ether CS7 AMD K7/1000 MP + SCALI CS10 P4/2666 + Myrinet CS2 QSNet Alpha Cluster/667 IBM SP/p690 3000 2753 CS9 P4 Xeon/2000 + Myrinet AlphaServer SC ES45/1000 CS12 P4/2400 + Gbit Ether CS10 P4/2666 + Myrinet 3000 IBM SP/p690 2418 95%,103% AlphaServer SC ES45/1000

1542 2278 1301 1500 1500 1061 70%,75% 1259 933 634 477 0 16 32 0 Number of CPUs 16 32 Number of CPUs Daresbury Laboratory66 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK: DFT HCTH on Valinomycin. Impact of Coulomb Fitting J JEXPLICIT Basis: DZV_A2 (Dgauss) Basis:Basis: DZV_A2 DZV_A2 ( (Dgauss)Dgauss) J A1_DFTA1_DFT Fit: Fit: 882/3012 882/3012 FIT

TT3E/1200E (128) = 2139 TT3E/1200E (128) = 2139 Performance Ratio vs. Cray T3E/1200E Performance Ratio vs. Cray T3E/1200E AlphaServer SC ES45/1000 7 IBM SP/p690 AlphaServer SC ES45/1000 SGI Origin 3800/R14k-500 7 IBM SP/p690 SGI Origin 3800/R14k-500 6 6 5 5 4 4 3 3 2 2 1 1 0 0 16 32 64 128 16 32 64 128 TT3E/1200E (128) = 995 Number of CPUs Number of CPUs TT3E/1200E (128) = 995 Daresbury Laboratory67 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK.GAMESS-UK. DFTDFT HCTHHCTH Performance:Performance: Impact of Coulomb FittingFitting Valinomycin,Valinomycin, 1620 1620 GTOs GTOs Basis:Basis: DZVP2_A2 DZVP2_A2 ( (Dgauss)Dgauss)

12000 11053 SGI Origin 3800/R14k-500 IBM SP/p690 IBM SP/p690 (Jfit) AlphaServer ES45/1000 9000 Elapsed Time 5557 (seconds) 5711 5846 6000

3109 3388 3081 3000 1940 1825

1678 1176 994 0 32 64 128 Number of CPUs Daresbury Laboratory68 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory Memory-drivenMemory-driven ApproachesApproaches -- SCFSCF andand DFTDFT

HF and DFT Energy and Gradient calculation for NiSiMe2CH2CH2C6F13: Basis Ahlrichs DZ (1485 GTOs) Elapsed Time (seconds) 5000 HF - Direct SCF 4206 4182 HF - In-core 4000 DFT - Direct-SCF DFT - In-core Integrals written directly 3000 to memory, rather than to 2476 2405 disk, and are not re- calculated 2000 1723 1692

1000 830 799 SGI Origin 3800/R14k-500

0 Number of CPUs 28 60 124 Daresbury Laboratory69 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DFTDFT BLYPBLYP Gradient:Gradient: High-end and Commodity-based Systems Geometry optimisation of polymerisation catalyst Cl(C3H5O).Pd[(P(CMe3)2)2.C6H4]

Basis 3-21G* (446 GTOs): 10 energy + gradient evaluations

Elapsed Time (seconds) 4423 CS7 AMD K7/1000 MP + SCI Speed-up CS2 QSNet Alpha Cluster/667 85%,85% 64 CS9 P4 Xeon/2000 + Myrinet Linear 4000 Cray T3E/1200E linear 3596 CS12 P4/2400 + Gbit. Ethernet IBM SP/p690 3190 CS10 P4/2666 + Myrinet 2k CS10 P4/2666 + Myrinet 2k 2810 SGI Origin 3800 R14k/500 SGI Origin 3800/R14k-500 3000 IBM SP/p690 48 CS2 QSNet Alpha Cluster/667 2278 AlphaServer SC ES45/1000 1934 2035 1825 1935 2000 1869 1575 1249 32 1167 1000

16 0 32 64 16 32 48 64 Number of CPUs Number of CPUs Daresbury Laboratory70 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory MP2MP2 GradientGradient AlgorithmsAlgorithms SerialSerial ParallelParallel !! PoorPoor I/O-to-compute I/O-to-compute ! Conventional ! Conventional performanceperformance of of MPPs MPPs " integrals written to disk " integrals written to disk "" directdirect approach approach "" readread back, back, transformed, transformed, !! CurrentCurrent MPPs MPPs have have large large writtenwritten out, out, resorted resorted etc. etc. globalglobal memories memories "" heavyheavy I/O I/O demands demands !! StoreStore subset subset of of MO MO integrals integrals "" reducereduce number number of of integral integral !! Direct/Semi-directDirect/Semi-direct (Frisch, (Frisch, recomputationsrecomputations Head-GordonHead-Gordon & & Pople, Pople, Hasse Hasse "" increaseincrease communication communication andand Ahlrichs) Ahlrichs) overheadoverhead "" replacereplace all/some all/some I/O I/O with with !! SubsetSubset includes includes VOVO, VOVO, VVOO, VVOO, batchedbatched integral integral VOOO,VOOO, recomputation recomputation "" VVVO-classVVVO-class too too large large to to store store "" computecompute VVVO-terms VVVO-terms in in separateseparate step step

Daresbury Laboratory71 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

128 Linear PerformancePerformance ofof MP2MP2 GradientGradient ModuleModule Cray T3E/1200E 112 AlphaServer SC ES45/1000 Cray T3E/1200, High-end and Commodity- SGI Origin3800/R14k-500 based Systems 96 80 Speed-up ! Mn(CO)5H - MP2 geometry optimisation 64

! BASIS: TZVP + f (217 GTOs 48

Elapsed Time (seconds) 32

CS6 PIII/800 + FE 16 7487 CS7 AMD K7/1000 MP + SCI CS2 QSNet Alpha Cluster/667 16 32 48 64 80 96 112 128 6713 CS9 P4 Xeon/2000 + Myrinet Cray T3E/1200E SGI Origin3800/R14k-500 6713 Cray T3E/1200E 6000 IBM SP/p690 SGI Origin3800/R14k-500 5233 AlphaServer SC ES45/1000 6000 IBM SP/p690 4847 AlphaServer SC ES45/1000 59%,78% 3790 3530 4500 3530 2722 3000 2496 79% 2496 1820 3000 1550 1346 1539 1923 1301 2521 1539 1012 1346 1158 1725 1500 1012 836 634 529 0 16 32 0 Number of CPUs 16 32 64 128 Daresbury Laboratory72 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

SCFSCF andand DFTDFT AnalyticAnalytic 2nd2nd DerivativesDerivatives

SCF " Distribution of and integrals via GAs Solve DFT equations " All DFT contributions Derivative Fock calculated on the fly Compute 1st derivatives of DFT matrices " Extensive use of screening " Evaluation in AO or MO basis CPHF " Evaluation in AO or MO basis CPHF available Solve systems of linear equations for 1st " CPHF arrays replicated but all derivatives of DFT wavefunction systems of equations can be 2nd Derivative solved separately to save Compute 2nd derivatives of DFT functionals memory " DFT terms in CPHF equations Hessian can be calculated at lower Sum partial derivatives accuracy

Daresbury Laboratory73 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory SCFSCF AnalyticAnalytic 2nd2nd DerivativesDerivatives PerformancePerformance IBM SP/p690, High-end and Commodity-based Systems

(C6H4(CF3))2: Basis 6-31G (196 GTO)

• Terms from MO 2e-integrals in GA storage (CPHF & pert. Fock matrices); Calculation dominated by CPHF: Gaussian98 - L1002 (CPU) - 32 nodes: 1181 secs; 64 nodes: 1058 secs. GAMESS-UK (total job time); 128 nodes: 499 secs. Elapsed Time (seconds) 3000 CS2 QSNet Alpha Cluster/667 3000 Cray T3E/1200E 2687 CS9 P4 Xeon/2000 + Myrinet 2687 Cray T3E/1200E SGI Origin 3800/R12k-400 IBM SP/WH2-375 SGI Origin 3800/R14k-500 SGI Origin 3800/R14k-500 IBM SP/p690 G98: 2271 IBM SP/p690 AlphaServer SC ES45/1000 2000 1842 2000 G98: 1706 92%,93%148% 1450 1490 1439 1439 G98: 1490 1073 933 860 803 1000 860 803 1000 621 949 501 499 371 543 213

0 0 16 32 16 32 64 128 CPUs CPUs Daresbury Laboratory74 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DFTDFT AnalyticAnalytic 2nd2nd DerivativesDerivatives PerformancePerformance IBM SP/p690, HP/Compaq SC ES45/1000 and SGI Origin 3800

(C6H4(CF3))2: Basis 6-31G (196 GTO)

CPUs SGI Origin3800/R14k-500 - B3LYP 1614 IBM SP/p690 - B3LYP IBM SP/p690 - HCTH 1500 AlphaServer ES45/1000 - B3LYP AlphaServer ES45/1000 - HCTH

989 1073 1000 743 569 Terms from MO 2e- integrals in GA storage 470 354 (CPHF & pert. Fock 500 307 matrices); Calculation dominated by CPHF: 0 32 64 128 Elapsed Time (seconds) Daresbury Laboratory75 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory ModellingModelling ComplexComplex Systems:Systems: TheThe QM/MMQM/MM ApproachApproach

! Couple quantum mechanics and molecular mechanics approaches ! QM treatment of the active site " reacting centre " problem structures (e.g. complex transition metal centre) " excited state processes (e.g. spectroscopy) ! Classical MM treatment of environment " enzyme structure " zeolite framework " explicit and/or dielectric solvent models

Daresbury Laboratory76 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

QM/MMQM/MM ModellingModelling -- ChallengesChallenges

! Methodological validation " establish reliability of both QM and MM schemes " QM/MM coupling schemes introduce additional artefacts " consistency of QM and MM energy expressions ! Computational demands " macromolecular systems, with extended conformational space ! conformational search problems ! entropic contributions " QM component means an expensive energy and gradient evaluation ! Software Complexity " range of forcefield types " wide variation in QM and MM program design " close integration needed for performance (e.g. HPC), but weak coupling simplifies maintenance (e.g. incorporating new versions of QM and MM packages)

Daresbury Laboratory77 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UKGAMESS-UK VersionVersion 6.36.3 QM/MMQM/MM InterfaceInterface withwith CHARMMCHARMM

! Implemented in collaboration with Bernie Brooks, Eric Billings, (NIH, Bethesda Maryland), ! Functionality: " Similar to existing ab-initio interfaces; CHARMM side follows coupling to GAMESS(US) (Milan Hodoscek) " Support for Gaussian delocalised point charges implemented in GAMESS-UK, based on 2- and 3- centre integral and derivative integral drivers from the DFT module. ! Availability: " CHARMM-capable code incorporated into GAMESS-UK Version 6.2. " CHARMM (implemented in c28 onwards) requires independent licencing from Martin Karplus. ! Ported to a wide variety of systems including MPPs " Origin (Green), Alphaserver SC (PSC), Beowulfs …..

Daresbury Laboratory78 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory QuantumQuantum SimulationSimulation inin IndustryIndustry (QUASI)(QUASI)

! Software Development " Address barriers to uptake of existing QM/MM methodology ! explore range of QM/MM coupling schemes ! dynamics, geometry optimisation for large systems ! maintain flexible approach, address enzymes, zeolites and metal oxide surfaces ! adopt modular scheme with interfaces to industry standard codes ! High Performance Computing " Scalable MPP implementation " QM/MM MD simulation based on semi-empirical ab-initio and DFT methods ! Demonstration Applications " Value of modelling technology and HPC to industrial problems " Beowulf EV6-based solution ! Exploitation " Disseminate results through workshop, newsletters etc.

Daresbury Laboratory79 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

QUASIQUASI PartnersPartners andand QM/MMQM/MM DevelopmentsDevelopments

! CLRC Daresbury Laboratory ! Conformational Complexity " P. Sherwood, M.F. Guest, A.H. de " Hybrid Delocalised Internal Vries, G. Schreckenbach Coordinates (Thiel/A. Turner/S. ! Royal Institution of Great Britain Billeter, Zürich, MPI Mülheim) " C.R.A Catlow, A. Sokol, S. French, " QM/MM dynamics for molecular S Bromley and extended systems ! University of Zurich / MPI Mulheim ! Developments to QM/MM Coupling " W. Thiel, A Turner, S. Billeter, F. Schemes Terstegen. " Gaussian Blur (Brooks, NIH) ! ICI Wilton (UK) " Solid State embedding using shell " J. Kendrick (CAPS), S. Rogers model and pseudopotentials (Synetix), J. Casci (Catalco) (Catlow/A. Sokol, Royal Institution) ! Norsk Hydro (Porsgrunn, Norway) ! New Interfaces " K. Schoeffel, O. Swang (SINTEF) " Gaussian, TURBOMOLE etc ! BASF (Ludwigshafen, Germany) ! Graphical Interface for industrial " A. Schaefer, C. Lennartz applications (Cerius2 SDK)

Daresbury Laboratory80 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

QM/MM Applications TriosephosphateTriosephosphate isomeraseisomerase (TIM) (TIM) C. Lennartz, A. Schäfer, F. Terstegen, W. Thiel, J. Phys. Chem. B, 2002, 106, 1758-1767. •• Central Central reaction reaction in in glycolysis,glycolysis, catalytic catalytic interconversioninterconversion of of DHAP to GAP T 128 (IBM SP/p690) = 134 secs DHAP to GAP Measured Time (seconds) •• Demonstration Demonstration case case CS9 P4/2000 + Myrinet 2k within QUASI 1600 1487 within QUASI SGI Origin3800/R14k-500 (Partners(Partners UZH, UZH, and and AlphaServer SC ES45/1000 BASF)BASF) 1200 IBM SP/p690

714 778 89% 800 687 • QM region 35 atoms (DFT BLYP) 1030 – include residues with possible 419 428 376 proton donor/acceptor roles 400 540 274 257 – GAMESS-UK, MNDO, TURBOMOLE 222 213 308 152 • MM region (4,180 atoms + 2 link) 196 – CHARMM force-field, implemented 0 in CHARMM, DL_POLY 8163264

Daresbury Laboratory81 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory Vampir 2.5

Visualization and Analysis of MPI Programs

Performance Analysis of GA- based Applications using Vampir

GAMESS-UK on High-end and Commodity class machines • extensions to handle GA applications

Daresbury Laboratory82 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

GAMESS-UKGAMESS-UK // SiSi8OO25HH18 :: 88 CPUs:CPUs: OneOne DFTDFT CycleCycle

Daresbury Laboratory83 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

GAMESS-UKGAMESS-UK // SiSi8OO25HH18 :: 88 CPUsCPUs Q†HQ (GAMULT2) and PEIGS

Daresbury Laboratory84 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

MaterialsMaterials SimulationSimulation CodesCodes Plane Wave DFT Codes: Local Gaussian Basis Set Codes: • CASTEP • CRYSTAL • VASP • CPMD

These codes have similar This code presents a different set of functionality, power and problems. problems when considering CASTEP is the flagship code of performance on HPCx (matrix UKCP and hence subsequent diagonalisation) discussions will focus on this.

SIESTA and CONQUEST: • O(n) scaling codes which will be extremely attractive to users. • Both are currently development rather than production codes.

Daresbury Laboratory85 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory MaterialsMaterials Simulation.Simulation. PlanePlane WaveWave MethodsMethods Direct minimisation of the total energy (avoiding diagonalisation) r+ r 2 < r (k G) Ecut r r r ψ k r = k −i(k +G).rr j (r ) C r e ∑r j,G G • Pseudopotentials must be used to keep the number of plane waves manageable • Large number of basis functions N~106 (especially for heavy atoms). The plane wave expansion means that the bulk of the computation comprises large 3D Fast Fourier Transforms (FFTs) between real and momentum space. • These are distributed across the processors in various ways. • The actual FFT routines are optimized for the cache size of the processor. Daresbury Laboratory86 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory CPMDCPMD 3.53.5 -- Car-ParrinelloCar-Parrinello MolecularMolecular DynamicsDynamics

CPMD Elapsed Time (secs.) ! Version 3.5.1: Hutter, Alavi, Deutsh, 400 CS9 P4/2000 Xeon + Myrinet (2CPU) Bernasconi, St. Goedecker, Marx, CS9 P4/2000 Xeon + Myrinet (1CPU) Tuckerman and Parrinello (1995-2001) SGI Origin 3800 / R14k-500 IBM Regatta-HPC ! DFT, plane-waves, pseudo-potentials IBM SP/Regatta-H (8 tasks) and FFT's 259 IBM SP/Regatta-H (4 tasks) AlphaServer SC ES45 / 1000 221 208 Benchmark Example: Liquid Water ! Physical Specifications: 200 30% 32 molecules, Simple cubic periodic box of 204 113 124 length 9.86 A, Temperature 300K 111 92 99 ! Electronic Structure; 86 63 BLYP functional, Trouillier Martins pseudopotential, Recriprocal space cutoff 70 75 Ry = 952 eV 0 16 32 ! CPMD is the base code for the current CCP1 flagship project CPUs Sprik and Vuilleumier (Cambridge)

Daresbury Laboratory87 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

CPMDCPMD 3.73.7 -- CC120 BenchmarkBenchmark

Benchmark Example: C120 Electronic Structure; BLYP functional, Kleinman pseudopotential, Wavefunction cutoff 35 Ry Elapsed no. of plane waves ~360K, Real space mesh 154X96X96 Time (secs.) Singlet Triplet 1600 3000 CS10 P4/2666 + Myrinet (2CPUs/node) CS10 P4/2666 + Myrinet (2 CPUs/node) CS10 P4/2666 + Myrinet (1CPU/node) 2654 2599 CS10 P4/2666 + Myrinet (1 CPU/node) CS12 P4/2400 + GbitE. (2 CPUs/node) CS12 P4/2400 + GbitE. (2CPUs/node) 1262 CS12 P4/2400 + GbitE. (1CPU/node) CS12 P4/2400 + GbitEther (1CPU/node) 1176 2400 1200 SGI Origin 3800 / R14k-500 SGI Origin 3800 / R14k-500 IBM SP/p690 IBM SP/p690 AlphaServer SC ES45 / 1000 AlphaServer SC ES45 / 1000 1800 54%,65% 1424 55%,61% 800 721 684 1267 1147 571 1200

373 697 400 600

0 0 16 32 16 32 CPUs CPUs

Daresbury Laboratory88 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

CPMDCPMD 3.73.7 -- CC120 BenchmarkBenchmark

Benchmark Example: C120 Electronic Structure; BLYP functional, Kleinman pseudopotential, Wavefunction cutoff 35 Ry Elapsed no. of plane waves ~360K, Real space mesh 154X96X96 Time (secs.) Singlet Triplet 1600 CS12 P4/2400 + GbitE. (2 CPUs/node) 3000 CS12 P4/2400 + GbitE. (2CPUs/node) 2654 CS10 P4/2666 + Myrinet (2CPUs/node) CS10 P4/2666 + Myrinet (2 CPUs/node) 1262 SGI Origin 3800 / R14k-500 SGI Origin 3800 / R14k-500 2400 1200 IBM SP/p690 IBM SP/p690 AlphaServer SC ES45 / 1000 AlphaServer SC ES45 / 1000 1800 54%,65% 800 721 1424 55%,61% 684 1147

500 1200 1022 373 400 697 235 600 447

0 0 16 32 64 16 32 64 CPUs CPUs

Daresbury Laboratory89 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

ComputationalComputational EngineeringEngineering

DirectDirect numericalnumerical simulationsimulation (DNS)(DNS) Codes:Codes: ANGUSANGUS andand SBLISBLI

Daresbury Laboratory90 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

ANGUS: Combustion modelling (regular grid) Cray T3E/1200, High-end and Commodity-based Systems

Conjugate Gradient + ILU Direct numerical simulations Measured Time (seconds) Grid Size - 1443 (DNS) of turbulent pre-mixed 100 Iterations combustion solving the CS11 P4/2666 + GbitEther augmented Navier-Stokes CS12 P4/2400 + GbitEther equations for fluid flow. CS9 P4/2000 Xeon+ Myrinet 2k 1449 1500 CS10 P4/2666 Xeon+ Myrinet 2k 1285 CS2 QSNet Alpha Cluster/667 Discretisation of equations is SGI Origin 3800/R14k-500 performed using standard 2nd IBM Regatta-HPC performed using standard 2nd IBM SP/p690 order central differences on a AlphaServer SC ES45/1000 879 3D-grid. 1000 827 1218 34%,36% 665 664 Pressure solver utilises either a 1187 531 conjugate gradient method with 647 modified incomplete LU pre- 500 330 357 conditioner or a multi-grid solver 454 628 228 (both make extensive use of Level 1 BLAS) or fast Fourier 0 transform. 16 32 CPUs Daresbury Laboratory91 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

ANGUS: Combustion modelling (regular grid) Memory Bandwidth Effects: Pentium Xeon and Alpha Linux Systems

Conjugate Gradient + ILU Grid Size - 1443

CS10 P4 Xeon/2666 + Myrinet 2k CS2 QSNet Alpha Cluster Number of CPUs Number of CPUs 321 628 751 32 32 Nodes 16 Nodes 16 Nodes 32 8 Nodes 4 Nodes 543 8 Nodes 16 1187 2 Nodes 4 Nodes 16 2 Nodes 8 1005 1982 8

4 4 1859 3858

0 1000 2000 3000 4000 0 2000 4000 6000 Measured Time (seconds) Daresbury Laboratory92 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory ANGUS: Combustion modelling (regular grid) High-end and Commodity-based Systems

Measured Time (seconds) 10 Iterations 3 Grid Size - 2883 Conjugate Gradient + ILU

72%,73% 1500 1331 CS8 dual-Itanium/800 + Myrinet 1206 CS9 P4/2000 Xeon + Myrinet 2k CS10 P4/2666 Xeon + Myrinet 2k SGI Origin 3800/R14k-500 Direct numerical simulations AlphaServer SC ES45/1000 Direct numerical simulations (DNS) of turbulent pre-mixed 819 IBM SP/p690 1.3GHz combustion solving the 563 augmented Navier-Stokes equations for fluid flow. 544 548 equations for fluid flow. 772 399 360

204 224 129 81 333 0 CPUs 32 64 128 Daresbury Laboratory93 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory ComputationComputation Engineering:Engineering: UKUK TurbulenceTurbulence ConsortiumConsortium

! Focus on compute-intensive methods (Direct Numerical Simulation, Large Eddy Simulation, etc) for the simulation of turbulent flows. ! Shock boundary layer interaction modelling is critical for accurate aerodynamic design but is still poorly understood. ! The following results are from a compressible turbulent channel flow benchmark using Southampton’s DNS code. " 3D compressible Navier-Stokes equations. " General grid transformation for complex geometries. " Multi-step Runge-Kutta time advancement (RK3/4). " High-order (4th or 6th) central difference scheme. " Reduced-order and stable boundary treatment. " Entropy splitting for Euler terms. " Shock capturing TVD scheme with artificial compression method. http://www.afm.ses.soton.ac.uk/

Daresbury Laboratory94 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory ParallelParallel ImplementationImplementation andand DNSDNS BenchmarkBenchmark ! 3D load-balanced grid partitioning. ! Standard Fortran 90 + MPI library. ! Variable width halos to account for higher-order differencing schemes. ! Compile-time selection of static or dynamic memory allocation. ! Loop optimisation for single node performance. ! Run-time choice of standard, asynchronous or send-receive MPI message passing modes: Benchmark

Cube size Number of Reynolds Minimum gridpoints number memory size

120 x 120 x 120 1.7 M 180 2 GB

240 x 240 x 240 13.8 M 360 16 GB

480 x 480 x 480 110 M 720 128 GB

Daresbury Laboratory95 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory

DNS:DNS: 3603603 benchmark.benchmark. HighHigh EndEnd SystemsSystems 40.0

30.0

20.0 IBM Regatta (ORNL) Performance Cray T3E/1200E IBM Regatta (HPCx) 10.0 (million iteration points/sec) iteration (million

Scaled from 128 CPUs

0.0 0 128 256 384 512 640 768 896 1024 Number of processors Daresbury Laboratory96 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DNS:DNS: 1201203 benchmark.benchmark. High-endHigh-end andand Commodity-basedCommodity-based SystemsSystems Measured Time (seconds)

855 CS12 P4/2400 Xeon + Gbit E.(2CPUs/node) CS12 P4/2400 Xeon + Gbit E. (1CPU/node) CS10 P4/2666 Xeon + Myrinet (2CPUs/node) 760 800 736 CS10 P4/2666 Xeon + Myrinet (1CPU/node) SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 (4CPUs/node) AlphaServer SC ES45/1000 (2CPUs/node) IBM SP/p690 1.3GHz (8 CPUs/node) Dramatic 600 IBM SP/p690 1.3GHz (4 CPUs/node) Dramatic 456 ImpactImpact of of 407 399 387 51%,53% MemoryMemory 400 342 BandwidthBandwidth

204 182 200

0 16 32 Number of processors

Daresbury Laboratory97 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory DNS:DNS: 1201203 benchmark.benchmark. High-endHigh-end andand Commodity-basedCommodity-based SystemsSystems Measured Time (seconds)

1600 1465 CS12 P4/2400 Xeon + Gbit E.(2CPUs/node) CS10 P4/2666 Xeon + Myrinet (2CPUs/node) SGI Origin 3800/R14k-500 1200 AlphaServer SC ES45/1000 (4CPUs/node) IBM SP/p690 1.3GHz (8 CPUs/node) Dramatic 892 Dramatic 847 855 Impact of 736 ImpactImpact ofof 800 MemoryMemory 51%,53% 456 BandwidthBandwidth 407 387 342 400 204 195 182 94 77

0 8163264 Number of processors Daresbury Laboratory98 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory Commodity Comparisons with High-end Systems: Compaq AlphaServer SC ES45/1000 and the IBM SP/p690

CS10 - Pentium4/266 Xeon + Myrinet 2k % of 32-CPU AlphaServer SC IBM SP/p690 GAMESS-UK SCF 91% 83% DFT 85-103% 85-108% DFT (Jfit) 75% 81% DFT Gradient 85% 82% † CS9 Pentium 4/2000 MP2 Gradient 59% † 75% † + Myrinet 2k SCF Forces 92% † 114%

GAMESS / CHARMM 89% † 139% DL_POLY- 2 (and - 3) Ionic Systems 77- 95% ( 91%) 66-89% (78%) Macromolecular 75% ( 92%) 86% (97% CHARMM 139% CASTEP 60-83% 44-54% CPMD 60-65% 75-79%

ANGUS 36-73% 56-70% SBLI 53% 47%

Daresbury Laboratory99 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory Commodity Comparisons with High-end Systems: The Compaq AlphaServer SC ES45/1000

% of 32 CPUs of Compaq AlphaServer SC ES45/1000 Application Cluster CS10 CS12 Pentium4/2666 Xeon Pentium4/2400 Xeon + Myrinet 2k + Gbit Ethernet GAMESS-UK SCF 91% 85% DFT 85-103% 78 - 95% DFT (Jfit) 75% 70% DFT Gradient 85% 85% MP2 Gradient 59% † SCF Forces 92% † DL_POLY- 2 (and - 3) Ionic Systems 77- 95% ( 91%) 66-82% ( 84%) Macromolecular 75% ( 92%) 61% ( 86%) CHARMM 139% 104% CASTEP 60-83% 53- 66% CPMD 60-65% 55%

ANGUS 36-73% 34-72% SBLI 53% 51%

Daresbury Laboratory100 14 August 2003 Computational Science and Engineering Department Daresbury Laboratory SummarySummary

! Commodity-based and High-end Systems " Commodity Systems under evaluation ; CS1 - CS13 " High-end systems from Cray, SGI, IBM and HP/Compaq " Serial CPU and Communications Performance; Evaluation Metrics ! Application performance " Molecular Simulation (replicated & distributed data) ! DL_POLY, DLMULTI and CHARMM " Electronic Structure and Materials ! NWChem, GAMESS-UK and CPMD ! VAMPIR and instrumenting the GA Tools " Computational Engineering (ANGUS and SBLI) ! Notable Performance of Intel Xeon clusters with Myrinet: " CS10 Xeon / 2666 cluster with Myrinet-2k delivers on average 82% of AlphaServer ES45/1000. " Extreme Cost effectiveness of the CS12 Xeon / 2400 Cluster with Gbit Ethernet (Streamline) - delivers 75% of the AlphaServer ES45/1000.

Daresbury Laboratory101 14 August 2003