“One Ring to Rule Them All”

OpenMM library “One ring to rule them all” https://simtk.org/home/openmm CHARMM * Abalone Presto * ACEMD NAMD * ADUN * Ascalaph Gromacs AMBER * COSMOS ??? HOOMD * Desmond * Culgi * ESPResSo Tinker * GROMOS * GULP * Hippo LAMMPS * Kalypso MD * LPMD * MacroModel * MDynaMix * MOLDY * Materials Studio * MOSCITO * ProtoMol The OpenM(olecular)M(echanics) API * RedMD * YASARA * ORAC https://simtk.org/home/openmm * XMD Goals of OpenMM l Complete library l provide what is most needed l easy to learn and use l Fast, general, extensible l optimize for speed l hide hardware specifics l support new hardware l add new force fields, integration methods etc. l Available APIs in C++, Fortran95, Python l Advantages l Object oriented l Modular l Extensible l Disadvantages l Programs written in other languages need to invoke it through a layer of C++ code Features • Electrostatics: cut-off, Ewald, PME • Implicit solvent • Integrators: Verlet, Langevin, Brownian, custom • Thermostat: Andersen • Barostat: MonteCarlo • Others: energy minimization, virtual sites 5 The OpenMM Architecture Public Interface OpenMM Public API Platform Independent Code OpenMM Implementation Layer Platform Abstraction Layer OpenMM Low Level API Computational Kernels CUDA/OpenCL/Brook/MPI etc. Public API Classes (1) l System l A collection of interaction particles l Defines the mass of each particle (needed for integration) l Specifies distance restraints l Contains a list of Force objects that define the interactions l OpenMMContext l Contains all state information for a particular simulation − Positions, velocities, other parmeters Public API Classes (2) l Force l Anything that affects the system's behavior: forces, barostats, thermostats, etc. l A Force may: − Apply forces to particles − Contribute to the potential energy − Define adjustable parameters − Modify positions, velocities, and parameters at the start of each time step l Existing Force subclasses − HarmonicBond, HarmonicAngle, PeriodicTorsion, RBTorsion, Nonbonded, GBSAOBC, AndersenThermostat, CMMotionRemover... Public API Classes (3) l Integrator l Advances the system through time l Only fixed step size integrators are supported l LeapFrog, Langevin, Brownian Integrators l State l A snapshot of the state of the system l Immutable (used only for reporting) l Creating a state is the only way to access positions, velocities, energies or forces. Example • Create a System System system ; for (int i = 0; i < numParticles; ++i) system.addParticle(mass[i]); for (int i = 0; i < numConstraints; ++i) system.addConstraint(constraint[i].atom1, constraint[i].atom2, constraint[i].distance); • Add Forces to it HarmonicBondForce* bonds = new HarmonicBondForce; for (int i = 0; i < numBonds; ++i) bonds->addBond(bond[i].atom1, bond[i].atom2, bond[i].length, bond[i].k); system.addForce(bonds); // ... add Forces for other force !eld terms. Example (continued) • Simulate it LangevinIntegrator integrator(297.0, 1.0, 0.002); // Temperature, friction, // step size OpenMMContext context(system, integrator); context.setPositions(positions); context.setVelocities(velocities); integrator.step(500); // Take 500 steps • Retrieve state information State state = context.getState(State::Positions | State::Velocities); for (int i = 0; i < numParticles ; ++i) cout << state.getPositions()[i] << endl; Some algorithms for stream computing in OpenMM Algorithms for calculation of long-range interactions in MD l O(N2) (all-vs-all) algorithm l O(N) algorithm l Ewald summation (O(N3/2)) l Particle-Mesh Ewald (PME) (O(N*logN) MD simulations on GPUs l Algorithms need to be reconsidered from the ground up l Fast on CPU != Fast on GPU Slow on CPU != Slow on GPU nonbonded interactions in molecular systems l Divide into short-range and long-range interactions l Calculate short-range analytically l Use approximations for the long-range O(N2) (all-vs-all) algorithm on GPUs efficient implementation of O(N2) algorithm would... l minimize global memory access l avoid thread synchronization l take advantage of symmetry algorithm • Group atoms into blocks of 32 • Interactions divide into 32x32 Atoms 0 32 64 96 ... Atoms 0 tiles 32 • Each tile is processed by a group of 32 threads 64 – Load atom coordinates and parameters into shared memory 96 – Each thread computes ... interactions of one atom with 32 atoms – Use symmetry to skip half the tiles algorithm (cont.) • Each thread loops over atoms in a different order – Avoids con#icts between threads • No explicit synchronization needed – Threads in a warp are always synchronized Atoms Threads O(N) algorithm on GPUs why this is hard • Traditional O(N) methods (e.g. neighbor lists) are slow on GPUs – Out of order memory access for i = 1 to numNeighbors load coordinates and parameters for neighbor[i] slow! compute force store force for neighbor[i] slow! • The inner loop contains non-coalesced memory access! general approach 2 • Start with the O(N ) algorithm • Exclude tiles with no interactions – Like a neighbor list between blocks of 32 atoms Atoms 0 32 64 96 128 160 192 224 ... Atoms Atoms 0 32 64 96 128 160 192 224 ... identifying tiles with interactions • Compute an axis aligned bounding box for the 32 atoms in each block • Calculate the distance between bounding boxes solvent molecules • Solvent molecules must be ordered to be spatially coherent – Otherwise bounding boxes will be very large • Arrange along a space !lling curve • Reorder every ~100 time steps performance of this algorithm 2 • Much faster than O(N ) for large systems • Performance scales linearly But • Computes many more interactions than really required Computes– all 1024 interactions in a tile, even if few (or none!) are within the cutoff Finer Grained Neighbor List • For each tile with interactions: – Compute distance of each atom in one block from the bounding box of the other block – Set a #ag for each atom Force Computation for Tile for i = 1 to 32 if (hasInteractions[i]) } on each thread compute interaction with atom i • All 32 threads must loop over atoms in the same order Atoms – Requires a reduction to sum Threads the forces – For a few atoms, this is still much faster – For many atoms, better to just compute all interactions benchmarks Tiling circlesTiling is difficult serial computing stream computing Two issues: • You need a lot of cubes to cover a sphere • All interactions beyond the cut-off need to be zero Para2010 – p. 20/22 TheCalculating art of calculating zeros zeros! GPU 3k, rc=1.5, buffered GPU 48k, rc=1.5, buffered zeros CPU, rc=1.5, no water force loops CPU, rc=1.5 CPU, rc=0.9 0 2000 4000 6000 8000 10000 pair interactions per µs Para2010 – p. 21/22 Ewald summation method Ewald summation method O(N2) algorithm Ewald summation method O(N3 / 2) algorithm for each wave vector K CosSum = 0 SinSum = 0 for each atom i StructureFactor_Real(i) = charge(i) * cos(K.coord(i)) StructureFactor_Imag(i) = charge(i) * sin(K.coord(i)) CosSum += StructureFactor_Real(i) SinSum += StructureFactor_Imag(i) for each atom i force(i) = CosSum * StructureFactor_Imag(i) – SinSum * StructureFactor_Real(i) Ewald O(N3 / 2) algorithm on GPUs CalculateEwaldForces_kernel() // Each thread is processing // one of the atoms CalculateCosSinSums_kernel() atom = threadIdx.x + blockIdx.x * blockDim.x; // Each thread is processing // one of the wave vectors K // Load atom data force = Force(atom) K = threadIdx.x + blockIdx.x * blockDim.x; coord = Coord(atom) // Compute the sum for that K // Loop over all wave vectors. // Compute the force contribution of this wave vector. #oat2 sum = make_#oat2(0.0f, 0.0f); for each wave vector K // Read atom data for each atom i // calculate again (!) the structure sum.x += charge(i) * cos(K.coord(i)) // factor for this wave vector sum.y += charge(i) * sin(K.coord(i)) structureFactor.x = cos(K.coord(i)) structureFactor.y = sin(K.coord(i)) // Store the sum in global memory Sum(K) = sum; // Load the Cosine/Sine Sum from memory cosSinSum = cSim.pEwaldCosSinSum[K]; // Compute the contribution to the energy. force += charge(i) * energy += sum * sum (cosSinSum.x*structureFactor.y- cosSinSum.y*StructureFactor.x) // Store energy in global memory Energy(K) = energy // Record the force on the atom. Force(atom) = force Particle-Mesh Ewald (PME) method Particle-Mesh Ewald method 1) calculate scaled fractionals 2) calculate B-spline coefficient 3) “smear” the charges over the grid points from spline-coefficient 4) execute forward FFT 5) calculate the reciprocal energy 6) execute backward FFT 7) calculate force gradient Particle-Mesh Ewald method CPU GPU spreading of charges gathering of charges l sort the atoms before gather ! OpenMM GPU weak scaling Water box on a Nvidia Telsa2050 0.6 real space 0.4 F reduction q spread/gather 3D FFT s/atom/step 0.2 µ 0 3k 6k 12k 24k 48k 96k # atoms Note: real space cut-off 1.5 nm, on a CPU around 0.9 nm Para2010 – p. 18/22 GPU - CPU comparison Nvidia Tesla 2050 vs Intel Core i7 920 (2.66 GHZ, 4 threads) real space 1 F reduction q spread/gather 3D FFT 0.5 s/atom/step µ 0 GPU 3k GPU 48k CPU rc=1.5 CPU rc=0.9 3 real space cost ∼ rc − FFT cost ∼ spacing 3 for constant accuracy: spacing = rc/a 3 3 −3 total cost: Cpp rc + CFFT a rc Para2010 – p. 19/22 Integrating Gromacs and OpenMM Gromacs & OpenMM in practice Gromacs & OpenMM in practice • GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” Gromacs & OpenMM in practice • GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” • Same input files, same output files Gromacs & OpenMM in practice • GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda”

“One Ring to Rule Them All”

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support