OpenMM library “One ring to rule them all” https://simtk.org/home/openmm CHARMM * Abalone Presto * ACEMD NAMD * ADUN * Ascalaph Gromacs AMBER * COSMOS ??? HOOMD * Desmond * Culgi * ESPResSo Tinker * GROMOS * GULP * Hippo LAMMPS * Kalypso MD * LPMD * MacroModel * MDynaMix * MOLDY * Materials Studio * MOSCITO * ProtoMol The OpenM(olecular)M(echanics) API * RedMD * YASARA * ORAC https://simtk.org/home/openmm * XMD Goals of OpenMM l Complete library
l provide what is most needed l easy to learn and use l Fast, general, extensible
l optimize for speed l hide hardware specifics l support new hardware l add new force fields, integration methods etc. l Available APIs in C++, Fortran95, Python l Advantages
l Object oriented l Modular l Extensible l Disadvantages
l Programs written in other languages need to invoke it through a layer of C++ code Features
• Electrostatics: cut-off, Ewald, PME • Implicit solvent • Integrators: Verlet, Langevin, Brownian, custom • Thermostat: Andersen • Barostat: MonteCarlo • Others: energy minimization, virtual sites
5 The OpenMM Architecture
Public Interface OpenMM Public API
Platform Independent Code OpenMM Implementation Layer
Platform Abstraction Layer OpenMM Low Level API
Computational Kernels CUDA/OpenCL/Brook/MPI etc. Public API Classes (1) l System
l A collection of interaction particles l Defines the mass of each particle (needed for integration) l Specifies distance restraints l Contains a list of Force objects that define the interactions l OpenMMContext
l Contains all state information for a particular simulation − Positions, velocities, other parmeters Public API Classes (2) l Force
l Anything that affects the system's behavior: forces, barostats, thermostats, etc. l A Force may: − Apply forces to particles − Contribute to the potential energy − Define adjustable parameters − Modify positions, velocities, and parameters at the start of each time step l Existing Force subclasses − HarmonicBond, HarmonicAngle, PeriodicTorsion, RBTorsion, Nonbonded, GBSAOBC, AndersenThermostat, CMMotionRemover... Public API Classes (3) l Integrator
l Advances the system through time l Only fixed step size integrators are supported l LeapFrog, Langevin, Brownian Integrators l State
l A snapshot of the state of the system l Immutable (used only for reporting) l Creating a state is the only way to access positions, velocities, energies or forces. Example
• Create a System System system ; for (int i = 0; i < numParticles; ++i) system.addParticle(mass[i]); for (int i = 0; i < numConstraints; ++i) system.addConstraint(constraint[i].atom1, constraint[i].atom2, constraint[i].distance);
• Add Forces to it HarmonicBondForce* bonds = new HarmonicBondForce; for (int i = 0; i < numBonds; ++i) bonds->addBond(bond[i].atom1, bond[i].atom2, bond[i].length, bond[i].k); system.addForce(bonds); // ... add Forces for other force !eld terms. Example (continued)
• Simulate it
LangevinIntegrator integrator(297.0, 1.0, 0.002); // Temperature, friction, // step size OpenMMContext context(system, integrator); context.setPositions(positions); context.setVelocities(velocities); integrator.step(500); // Take 500 steps
• Retrieve state information
State state = context.getState(State::Positions | State::Velocities); for (int i = 0; i < numParticles ; ++i) cout << state.getPositions()[i] << endl; Some algorithms for stream computing in OpenMM Algorithms for calculation of long-range interactions in MD
l O(N2) (all-vs-all) algorithm l O(N) algorithm l Ewald summation (O(N3/2)) l Particle-Mesh Ewald (PME) (O(N*logN) MD simulations on GPUs
l Algorithms need to be reconsidered from the ground up l Fast on CPU != Fast on GPU
Slow on CPU != Slow on GPU nonbonded interactions in molecular systems
l Divide into short-range and long-range interactions l Calculate short-range analytically l Use approximations for the long-range O(N2) (all-vs-all) algorithm on GPUs efficient implementation of O(N2) algorithm would...
l minimize global memory access l avoid thread synchronization l take advantage of symmetry algorithm
• Group atoms into blocks of 32 Atoms • Interactions divide into 32x32 0 32 64 96 ... Atoms 0 tiles
32 • Each tile is processed by a group of 32 threads 64 – Load atom coordinates and parameters into shared memory 96 – Each thread computes ... interactions of one atom with 32 atoms – Use symmetry to skip half the tiles algorithm (cont.)
• Each thread loops over atoms in a different order – Avoids con#icts between threads • No explicit synchronization needed – Threads in a warp are always synchronized
Atoms Threads
O(N) algorithm on GPUs why this is hard
• Traditional O(N) methods (e.g. neighbor lists) are slow on GPUs – Out of order memory access
for i = 1 to numNeighbors load coordinates and parameters for neighbor[i] slow! compute force store force for neighbor[i] slow!
• The inner loop contains non-coalesced memory access! general approach
2 • Start with the O(N ) algorithm • Exclude tiles with no interactions – Like a neighbor list between blocks of 32 atoms
Atoms 0 32 64 96 128 160 192 224 ... Atoms Atoms 0
32
64
96
128
160
192
224
... identifying tiles with interactions
• Compute an axis aligned bounding box for the 32 atoms in each block • Calculate the distance between bounding boxes solvent molecules
• Solvent molecules must be ordered to be spatially coherent – Otherwise bounding boxes will be very large • Arrange along a space !lling curve • Reorder every ~100 time steps performance of this algorithm
2 • Much faster than O(N ) for large systems • Performance scales linearly But • Computes many more interactions than really required Computes– all 1024 interactions in a tile, even if few (or none!) are within the cutoff Finer Grained Neighbor List
• For each tile with interactions: – Compute distance of each atom in one block from the bounding box of the other block – Set a #ag for each atom Force Computation for Tile
for i = 1 to 32 if (hasInteractions[i]) } on each thread compute interaction with atom i • All 32 threads must loop over atoms in the same order Atoms
– Requires a reduction to sum Threads
the forces – For a few atoms, this is still much faster – For many atoms, better to just compute all interactions benchmarks Tiling circlesTiling is difficult
serial computing stream computing
Two issues: • You need a lot of cubes to cover a sphere • All interactions beyond the cut-off need to be zero
Para2010 – p. 20/22 TheCalculating art of calculating zeros zeros!
GPU 3k, rc=1.5, buffered
GPU 48k, rc=1.5, buffered zeros
CPU, rc=1.5, no water force loops
CPU, rc=1.5
CPU, rc=0.9
0 2000 4000 6000 8000 10000 pair interactions per µs
Para2010 – p. 21/22 Ewald summation method Ewald summation method O(N2) algorithm Ewald summation method O(N3 / 2) algorithm
for each wave vector K
CosSum = 0 SinSum = 0
for each atom i StructureFactor_Real(i) = charge(i) * cos(K.coord(i)) StructureFactor_Imag(i) = charge(i) * sin(K.coord(i))
CosSum += StructureFactor_Real(i) SinSum += StructureFactor_Imag(i)
for each atom i force(i) = CosSum * StructureFactor_Imag(i) – SinSum * StructureFactor_Real(i) Ewald O(N3 / 2) algorithm on GPUs
CalculateEwaldForces_kernel()
// Each thread is processing // one of the atoms CalculateCosSinSums_kernel() atom = threadIdx.x + blockIdx.x * blockDim.x; // Each thread is processing // one of the wave vectors K // Load atom data force = Force(atom) K = threadIdx.x + blockIdx.x * blockDim.x; coord = Coord(atom)
// Compute the sum for that K // Loop over all wave vectors. // Compute the force contribution of this wave vector. #oat2 sum = make_#oat2(0.0f, 0.0f); for each wave vector K // Read atom data for each atom i // calculate again (!) the structure sum.x += charge(i) * cos(K.coord(i)) // factor for this wave vector sum.y += charge(i) * sin(K.coord(i)) structureFactor.x = cos(K.coord(i)) structureFactor.y = sin(K.coord(i)) // Store the sum in global memory Sum(K) = sum; // Load the Cosine/Sine Sum from memory cosSinSum = cSim.pEwaldCosSinSum[K]; // Compute the contribution to the energy. force += charge(i) * energy += sum * sum (cosSinSum.x*structureFactor.y- cosSinSum.y*StructureFactor.x) // Store energy in global memory Energy(K) = energy // Record the force on the atom. Force(atom) = force Particle-Mesh Ewald (PME) method Particle-Mesh Ewald method
1) calculate scaled fractionals
2) calculate B-spline coefficient
3) “smear” the charges over the grid points from spline-coefficient
4) execute forward FFT
5) calculate the reciprocal energy
6) execute backward FFT
7) calculate force gradient Particle-Mesh Ewald method
CPU GPU spreading of charges gathering of charges
l sort the atoms before gather ! OpenMM GPU weak scaling
Water box on a Nvidia Telsa2050
0.6
real space 0.4 F reduction q spread/gather 3D FFT s/atom/step 0.2 µ
0 3k 6k 12k 24k 48k 96k # atoms
Note: real space cut-off 1.5 nm, on a CPU around 0.9 nm
Para2010 – p. 18/22 GPU - CPU comparison
Nvidia Tesla 2050 vs Intel Core i7 920 (2.66 GHZ, 4 threads)
real space 1 F reduction q spread/gather 3D FFT
0.5 s/atom/step µ
0 GPU 3k GPU 48k CPU rc=1.5 CPU rc=0.9
3 real space cost ∼ rc − FFT cost ∼ spacing 3 for constant accuracy: spacing = rc/a 3 3 −3 total cost: Cpp rc + CFFT a rc
Para2010 – p. 19/22 Integrating Gromacs and OpenMM Gromacs & OpenMM in practice Gromacs & OpenMM in practice
• GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” Gromacs & OpenMM in practice
• GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” • Same input files, same output files Gromacs & OpenMM in practice
• GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” • Same input files, same output files • Only subset of features supported Gromacs & OpenMM in practice
• GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” • Same input files, same output files • Only subset of features supported • No shortcuts taken on the GPU: • At least same accuracy as on the CPU (<1e-6) Gromacs-OpenMM
• run as $ mdrun -device “OpenMM:platform=Cuda,memtest=off,deviceid=0,force- device=yes" • options • platform: only CUDA, potentially in future OpenCL too • memtest: check the GPU memory for error, default 15s • deviceid: select the GPU card for simulation (for multi-device systems) • force-device: run on unsupported but CUDA capable low-end cards GPU performance over x86 CPU PME Reaction-field Implicit All-vs-all GPU performance over x86 CPU PME Reaction-field Implicit All-vs-all ns/day 1500
1000 32 46 500 230 Quad x86 470 0 3.2 GHz GPU performance over x86 CPU PME Reaction-field Implicit All-vs-all ns/day 1500
1000 32 46 500 230 Quad x86 470 0 3.2 GHz
1500 1200 29 900 66 600 810 300 1450 0 C2050 Fermi (C20) performance over C10 PME Reaction-field BPTI (~21k atoms) Implicit All-vs-all Villin (600 atoms, implicit) Fermi (C20) performance over C10 PME Reaction-field BPTI (~21k atoms) Implicit All-vs-all Villin (600 atoms, implicit) ns/day 1500
1000 13 35 500 460 C1060 790 0 Fermi (C20) performance over C10 PME Reaction-field BPTI (~21k atoms) Implicit All-vs-all Villin (600 atoms, implicit) ns/day 1500
1000 13 35 500 460 C1060 790 0
1500 1200 29 900 66 600 810 300 1450 0 C2050 Conclusions
• Only a subset of the features are supported in OpenMM but they would cover most users needs • Better integration expected to come • Useful for implicit solvent simulations