<<

OpenMM library “One ring to rule them all” https://simtk.org/home/openmm CHARMM * Abalone Presto * ACEMD NAMD * ADUN * Ascalaph Gromacs AMBER * COSMOS ??? HOOMD * Desmond * Culgi * ESPResSo * GROMOS * GULP * Hippo LAMMPS * Kalypso MD * LPMD * MacroModel * MDynaMix * MOLDY * Materials Studio * MOSCITO * ProtoMol The OpenM(olecular)M(echanics) API * RedMD * YASARA * ORAC https://simtk.org/home/openmm * XMD Goals of OpenMM l Complete library

l provide what is most needed l easy to learn and use l Fast, general, extensible

l optimize for speed l hide hardware specifics l support new hardware l add new force fields, integration methods etc. l Available APIs in ++, Fortran95, Python l Advantages

l Object oriented l Modular l Extensible l Disadvantages

l Programs written in other languages need to invoke it through a layer of C++ code Features

• Electrostatics: cut-off, Ewald, PME • Implicit solvent • Integrators: Verlet, Langevin, Brownian, custom • Thermostat: Andersen • Barostat: MonteCarlo • Others: energy minimization, virtual sites

5 The OpenMM Architecture

Public Interface OpenMM Public API

Platform Independent Code OpenMM Implementation Layer

Platform Abstraction Layer OpenMM Low Level API

Computational Kernels CUDA/OpenCL/Brook/MPI etc. Public API Classes (1) l System

l A collection of interaction particles l Defines the mass of each particle (needed for integration) l Specifies distance restraints l Contains a list of Force objects that define the interactions l OpenMMContext

l Contains all state information for a particular simulation − Positions, velocities, other parmeters Public API Classes (2) l Force

l Anything that affects the system's behavior: forces, barostats, thermostats, etc. l A Force may: − Apply forces to particles − Contribute to the potential energy − Define adjustable parameters − Modify positions, velocities, and parameters at the start of each time step l Existing Force subclasses − HarmonicBond, HarmonicAngle, PeriodicTorsion, RBTorsion, Nonbonded, GBSAOBC, AndersenThermostat, CMMotionRemover... Public API Classes (3) l Integrator

l Advances the system through time l Only fixed step size integrators are supported l LeapFrog, Langevin, Brownian Integrators l State

l A snapshot of the state of the system l Immutable (used only for reporting) l Creating a state is the only way to access positions, velocities, energies or forces. Example

• Create a System System system ; for (int i = 0; i < numParticles; ++i) system.addParticle(mass[i]); for (int i = 0; i < numConstraints; ++i) system.addConstraint(constraint[i].atom1, constraint[i].atom2, constraint[i].distance);

• Add Forces to it HarmonicBondForce* bonds = new HarmonicBondForce; for (int i = 0; i < numBonds; ++i) bonds->addBond(bond[i].atom1, bond[i].atom2, bond[i].length, bond[i].k); system.addForce(bonds); // ... add Forces for other force !eld terms. Example (continued)

• Simulate it

LangevinIntegrator integrator(297.0, 1.0, 0.002); // Temperature, friction, // step size OpenMMContext context(system, integrator); context.setPositions(positions); context.setVelocities(velocities); integrator.step(500); // Take 500 steps

• Retrieve state information

State state = context.getState(State::Positions | State::Velocities); for (int i = 0; i < numParticles ; ++i) cout << state.getPositions()[i] << endl; Some algorithms for stream computing in OpenMM Algorithms for calculation of long-range interactions in MD

l O(N2) (all-vs-all) algorithm l O(N) algorithm l Ewald summation (O(N3/2)) l Particle-Mesh Ewald (PME) (O(N*logN) MD simulations on GPUs

l Algorithms need to be reconsidered from the ground up l Fast on CPU != Fast on GPU

Slow on CPU != Slow on GPU nonbonded interactions in molecular systems

l Divide into short-range and long-range interactions l Calculate short-range analytically l Use approximations for the long-range O(N2) (all-vs-all) algorithm on GPUs efficient implementation of O(N2) algorithm would...

l minimize global memory access l avoid thread synchronization l take advantage of symmetry algorithm

• Group atoms into blocks of 32 Atoms • Interactions divide into 32x32 0 32 64 96 ... Atoms 0 tiles

32 • Each tile is processed by a group of 32 threads 64 – Load atom coordinates and parameters into shared memory 96 – Each thread computes ... interactions of one atom with 32 atoms – Use symmetry to skip half the tiles algorithm (cont.)

• Each thread loops over atoms in a different order – Avoids con#icts between threads • No explicit synchronization needed – Threads in a warp are always synchronized

Atoms Threads

O(N) algorithm on GPUs why this is hard

• Traditional O(N) methods (e.g. neighbor lists) are slow on GPUs – Out of order memory access

for i = 1 to numNeighbors load coordinates and parameters for neighbor[i] slow! compute force store force for neighbor[i] slow!

• The inner loop contains non-coalesced memory access! general approach

2 • Start with the O(N ) algorithm • Exclude tiles with no interactions – Like a neighbor list between blocks of 32 atoms

Atoms 0 32 64 96 128 160 192 224 ... Atoms Atoms 0

32

64

96

128

160

192

224

... identifying tiles with interactions

• Compute an axis aligned bounding box for the 32 atoms in each block • Calculate the distance between bounding boxes solvent

• Solvent molecules must be ordered to be spatially coherent – Otherwise bounding boxes will be very large • Arrange along a space !lling curve • Reorder every ~100 time steps performance of this algorithm

2 • Much faster than O(N ) for large systems • Performance scales linearly But • Computes many more interactions than really required Computes– all 1024 interactions in a tile, even if few (or none!) are within the cutoff Finer Grained Neighbor List

• For each tile with interactions: – Compute distance of each atom in one block from the bounding box of the other block – Set a #ag for each atom Force Computation for Tile

for i = 1 to 32 if (hasInteractions[i]) } on each thread compute interaction with atom i • All 32 threads must loop over atoms in the same order Atoms

– Requires a reduction to sum Threads

the forces – For a few atoms, this is still much faster – For many atoms, better to just compute all interactions benchmarks Tiling circlesTiling is difficult

serial computing stream computing

Two issues: • You need a lot of cubes to cover a sphere • All interactions beyond the cut-off need to be zero

Para2010 – p. 20/22 TheCalculating art of calculating zeros zeros!

GPU 3k, rc=1.5, buffered

GPU 48k, rc=1.5, buffered zeros

CPU, rc=1.5, no water force loops

CPU, rc=1.5

CPU, rc=0.9

0 2000 4000 6000 8000 10000 pair interactions per µs

Para2010 – p. 21/22 Ewald summation method Ewald summation method O(N2) algorithm Ewald summation method O(N3 / 2) algorithm

for each wave vector K

CosSum = 0 SinSum = 0

for each atom i StructureFactor_Real(i) = charge(i) * cos(K.coord(i)) StructureFactor_Imag(i) = charge(i) * sin(K.coord(i))

CosSum += StructureFactor_Real(i) SinSum += StructureFactor_Imag(i)

for each atom i force(i) = CosSum * StructureFactor_Imag(i) – SinSum * StructureFactor_Real(i) Ewald O(N3 / 2) algorithm on GPUs

CalculateEwaldForces_kernel()

// Each thread is processing // one of the atoms CalculateCosSinSums_kernel() atom = threadIdx.x + blockIdx.x * blockDim.x; // Each thread is processing // one of the wave vectors K // Load atom data force = Force(atom) K = threadIdx.x + blockIdx.x * blockDim.x; coord = Coord(atom)

// Compute the sum for that K // Loop over all wave vectors. // Compute the force contribution of this wave vector. #oat2 sum = make_#oat2(0.0f, 0.0f); for each wave vector K // Read atom data for each atom i // calculate again (!) the structure sum.x += charge(i) * cos(K.coord(i)) // factor for this wave vector sum.y += charge(i) * sin(K.coord(i)) structureFactor.x = cos(K.coord(i)) structureFactor.y = sin(K.coord(i)) // Store the sum in global memory Sum(K) = sum; // Load the Cosine/Sine Sum from memory cosSinSum = cSim.pEwaldCosSinSum[K]; // Compute the contribution to the energy. force += charge(i) * energy += sum * sum (cosSinSum.x*structureFactor.y- cosSinSum.y*StructureFactor.x) // Store energy in global memory Energy(K) = energy // Record the force on the atom. Force(atom) = force Particle-Mesh Ewald (PME) method Particle-Mesh Ewald method

1) calculate scaled fractionals

2) calculate B-spline coefficient

3) “smear” the charges over the grid points from spline-coefficient

4) execute forward FFT

5) calculate the reciprocal energy

6) execute backward FFT

7) calculate force gradient Particle-Mesh Ewald method

CPU GPU spreading of charges gathering of charges

l sort the atoms before gather ! OpenMM GPU weak scaling

Water box on a Nvidia Telsa2050

0.6

real space 0.4 F reduction q spread/gather 3D FFT s/atom/step 0.2 µ

0 3k 6k 12k 24k 48k 96k # atoms

Note: real space cut-off 1.5 nm, on a CPU around 0.9 nm

Para2010 – p. 18/22 GPU - CPU comparison

Nvidia Tesla 2050 vs Intel Core i7 920 (2.66 GHZ, 4 threads)

real space 1 F reduction q spread/gather 3D FFT

0.5 s/atom/step µ

0 GPU 3k GPU 48k CPU rc=1.5 CPU rc=0.9

3 real space cost ∼ rc − FFT cost ∼ spacing 3 for constant accuracy: spacing = rc/a 3 3 −3 total cost: Cpp rc + CFFT a rc

Para2010 – p. 19/22 Integrating Gromacs and OpenMM Gromacs & OpenMM in practice Gromacs & OpenMM in practice

• GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” Gromacs & OpenMM in practice

• GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” • Same input files, same output files Gromacs & OpenMM in practice

• GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” • Same input files, same output files • Only subset of features supported Gromacs & OpenMM in practice

• GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” • Same input files, same output files • Only subset of features supported • No shortcuts taken on the GPU: • At least same accuracy as on the CPU (<1e-6) Gromacs-OpenMM

• run as $ mdrun -device “OpenMM:platform=Cuda,memtest=off,deviceid=0,force- device=yes" • options • platform: only CUDA, potentially in future OpenCL too • memtest: check the GPU memory for error, default 15s • deviceid: select the GPU card for simulation (for multi-device systems) • force-device: run on unsupported but CUDA capable low-end cards GPU performance over x86 CPU PME Reaction-field Implicit All-vs-all GPU performance over x86 CPU PME Reaction-field Implicit All-vs-all ns/day 1500

1000 32 46 500 230 Quad x86 470 0 3.2 GHz GPU performance over x86 CPU PME Reaction-field Implicit All-vs-all ns/day 1500

1000 32 46 500 230 Quad x86 470 0 3.2 GHz

1500 1200 29 900 66 600 810 300 1450 0 C2050 Fermi (C20) performance over C10 PME Reaction-field BPTI (~21k atoms) Implicit All-vs-all Villin (600 atoms, implicit) Fermi (C20) performance over C10 PME Reaction-field BPTI (~21k atoms) Implicit All-vs-all Villin (600 atoms, implicit) ns/day 1500

1000 13 35 500 460 C1060 790 0 Fermi (C20) performance over C10 PME Reaction-field BPTI (~21k atoms) Implicit All-vs-all Villin (600 atoms, implicit) ns/day 1500

1000 13 35 500 460 C1060 790 0

1500 1200 29 900 66 600 810 300 1450 0 C2050 Conclusions

• Only a subset of the features are supported in OpenMM but they would cover most users needs • Better integration expected to come • Useful for implicit solvent simulations