“One Ring to Rule Them All”
Total Page:16
File Type:pdf, Size:1020Kb
OpenMM library “One ring to rule them all” https://simtk.org/home/openmm CHARMM * Abalone Presto * ACEMD NAMD * ADUN * Ascalaph Gromacs AMBER * COSMOS ??? HOOMD * Desmond * Culgi * ESPResSo Tinker * GROMOS * GULP * Hippo LAMMPS * Kalypso MD * LPMD * MacroModel * MDynaMix * MOLDY * Materials Studio * MOSCITO * ProtoMol The OpenM(olecular)M(echanics) API * RedMD * YASARA * ORAC https://simtk.org/home/openmm * XMD Goals of OpenMM l Complete library l provide what is most needed l easy to learn and use l Fast, general, extensible l optimize for speed l hide hardware specifics l support new hardware l add new force fields, integration methods etc. l Available APIs in C++, Fortran95, Python l Advantages l Object oriented l Modular l Extensible l Disadvantages l Programs written in other languages need to invoke it through a layer of C++ code Features • Electrostatics: cut-off, Ewald, PME • Implicit solvent • Integrators: Verlet, Langevin, Brownian, custom • Thermostat: Andersen • Barostat: MonteCarlo • Others: energy minimization, virtual sites 5 The OpenMM Architecture Public Interface OpenMM Public API Platform Independent Code OpenMM Implementation Layer Platform Abstraction Layer OpenMM Low Level API Computational Kernels CUDA/OpenCL/Brook/MPI etc. Public API Classes (1) l System l A collection of interaction particles l Defines the mass of each particle (needed for integration) l Specifies distance restraints l Contains a list of Force objects that define the interactions l OpenMMContext l Contains all state information for a particular simulation − Positions, velocities, other parmeters Public API Classes (2) l Force l Anything that affects the system's behavior: forces, barostats, thermostats, etc. l A Force may: − Apply forces to particles − Contribute to the potential energy − Define adjustable parameters − Modify positions, velocities, and parameters at the start of each time step l Existing Force subclasses − HarmonicBond, HarmonicAngle, PeriodicTorsion, RBTorsion, Nonbonded, GBSAOBC, AndersenThermostat, CMMotionRemover... Public API Classes (3) l Integrator l Advances the system through time l Only fixed step size integrators are supported l LeapFrog, Langevin, Brownian Integrators l State l A snapshot of the state of the system l Immutable (used only for reporting) l Creating a state is the only way to access positions, velocities, energies or forces. Example • Create a System System system ; for (int i = 0; i < numParticles; ++i) system.addParticle(mass[i]); for (int i = 0; i < numConstraints; ++i) system.addConstraint(constraint[i].atom1, constraint[i].atom2, constraint[i].distance); • Add Forces to it HarmonicBondForce* bonds = new HarmonicBondForce; for (int i = 0; i < numBonds; ++i) bonds->addBond(bond[i].atom1, bond[i].atom2, bond[i].length, bond[i].k); system.addForce(bonds); // ... add Forces for other force !eld terms. Example (continued) • Simulate it LangevinIntegrator integrator(297.0, 1.0, 0.002); // Temperature, friction, // step size OpenMMContext context(system, integrator); context.setPositions(positions); context.setVelocities(velocities); integrator.step(500); // Take 500 steps • Retrieve state information State state = context.getState(State::Positions | State::Velocities); for (int i = 0; i < numParticles ; ++i) cout << state.getPositions()[i] << endl; Some algorithms for stream computing in OpenMM Algorithms for calculation of long-range interactions in MD l O(N2) (all-vs-all) algorithm l O(N) algorithm l Ewald summation (O(N3/2)) l Particle-Mesh Ewald (PME) (O(N*logN) MD simulations on GPUs l Algorithms need to be reconsidered from the ground up l Fast on CPU != Fast on GPU Slow on CPU != Slow on GPU nonbonded interactions in molecular systems l Divide into short-range and long-range interactions l Calculate short-range analytically l Use approximations for the long-range O(N2) (all-vs-all) algorithm on GPUs efficient implementation of O(N2) algorithm would... l minimize global memory access l avoid thread synchronization l take advantage of symmetry algorithm • Group atoms into blocks of 32 • Interactions divide into 32x32 Atoms 0 32 64 96 ... Atoms 0 tiles 32 • Each tile is processed by a group of 32 threads 64 – Load atom coordinates and parameters into shared memory 96 – Each thread computes ... interactions of one atom with 32 atoms – Use symmetry to skip half the tiles algorithm (cont.) • Each thread loops over atoms in a different order – Avoids con#icts between threads • No explicit synchronization needed – Threads in a warp are always synchronized Atoms Threads O(N) algorithm on GPUs why this is hard • Traditional O(N) methods (e.g. neighbor lists) are slow on GPUs – Out of order memory access for i = 1 to numNeighbors load coordinates and parameters for neighbor[i] slow! compute force store force for neighbor[i] slow! • The inner loop contains non-coalesced memory access! general approach 2 • Start with the O(N ) algorithm • Exclude tiles with no interactions – Like a neighbor list between blocks of 32 atoms Atoms 0 32 64 96 128 160 192 224 ... Atoms Atoms 0 32 64 96 128 160 192 224 ... identifying tiles with interactions • Compute an axis aligned bounding box for the 32 atoms in each block • Calculate the distance between bounding boxes solvent molecules • Solvent molecules must be ordered to be spatially coherent – Otherwise bounding boxes will be very large • Arrange along a space !lling curve • Reorder every ~100 time steps performance of this algorithm 2 • Much faster than O(N ) for large systems • Performance scales linearly But • Computes many more interactions than really required Computes– all 1024 interactions in a tile, even if few (or none!) are within the cutoff Finer Grained Neighbor List • For each tile with interactions: – Compute distance of each atom in one block from the bounding box of the other block – Set a #ag for each atom Force Computation for Tile for i = 1 to 32 if (hasInteractions[i]) } on each thread compute interaction with atom i • All 32 threads must loop over atoms in the same order Atoms – Requires a reduction to sum Threads the forces – For a few atoms, this is still much faster – For many atoms, better to just compute all interactions benchmarks Tiling circlesTiling is difficult serial computing stream computing Two issues: • You need a lot of cubes to cover a sphere • All interactions beyond the cut-off need to be zero Para2010 – p. 20/22 TheCalculating art of calculating zeros zeros! GPU 3k, rc=1.5, buffered GPU 48k, rc=1.5, buffered zeros CPU, rc=1.5, no water force loops CPU, rc=1.5 CPU, rc=0.9 0 2000 4000 6000 8000 10000 pair interactions per µs Para2010 – p. 21/22 Ewald summation method Ewald summation method O(N2) algorithm Ewald summation method O(N3 / 2) algorithm for each wave vector K CosSum = 0 SinSum = 0 for each atom i StructureFactor_Real(i) = charge(i) * cos(K.coord(i)) StructureFactor_Imag(i) = charge(i) * sin(K.coord(i)) CosSum += StructureFactor_Real(i) SinSum += StructureFactor_Imag(i) for each atom i force(i) = CosSum * StructureFactor_Imag(i) – SinSum * StructureFactor_Real(i) Ewald O(N3 / 2) algorithm on GPUs CalculateEwaldForces_kernel() // Each thread is processing // one of the atoms CalculateCosSinSums_kernel() atom = threadIdx.x + blockIdx.x * blockDim.x; // Each thread is processing // one of the wave vectors K // Load atom data force = Force(atom) K = threadIdx.x + blockIdx.x * blockDim.x; coord = Coord(atom) // Compute the sum for that K // Loop over all wave vectors. // Compute the force contribution of this wave vector. #oat2 sum = make_#oat2(0.0f, 0.0f); for each wave vector K // Read atom data for each atom i // calculate again (!) the structure sum.x += charge(i) * cos(K.coord(i)) // factor for this wave vector sum.y += charge(i) * sin(K.coord(i)) structureFactor.x = cos(K.coord(i)) structureFactor.y = sin(K.coord(i)) // Store the sum in global memory Sum(K) = sum; // Load the Cosine/Sine Sum from memory cosSinSum = cSim.pEwaldCosSinSum[K]; // Compute the contribution to the energy. force += charge(i) * energy += sum * sum (cosSinSum.x*structureFactor.y- cosSinSum.y*StructureFactor.x) // Store energy in global memory Energy(K) = energy // Record the force on the atom. Force(atom) = force Particle-Mesh Ewald (PME) method Particle-Mesh Ewald method 1) calculate scaled fractionals 2) calculate B-spline coefficient 3) “smear” the charges over the grid points from spline-coefficient 4) execute forward FFT 5) calculate the reciprocal energy 6) execute backward FFT 7) calculate force gradient Particle-Mesh Ewald method CPU GPU spreading of charges gathering of charges l sort the atoms before gather ! OpenMM GPU weak scaling Water box on a Nvidia Telsa2050 0.6 real space 0.4 F reduction q spread/gather 3D FFT s/atom/step 0.2 µ 0 3k 6k 12k 24k 48k 96k # atoms Note: real space cut-off 1.5 nm, on a CPU around 0.9 nm Para2010 – p. 18/22 GPU - CPU comparison Nvidia Tesla 2050 vs Intel Core i7 920 (2.66 GHZ, 4 threads) real space 1 F reduction q spread/gather 3D FFT 0.5 s/atom/step µ 0 GPU 3k GPU 48k CPU rc=1.5 CPU rc=0.9 3 real space cost ∼ rc − FFT cost ∼ spacing 3 for constant accuracy: spacing = rc/a 3 3 −3 total cost: Cpp rc + CFFT a rc Para2010 – p. 19/22 Integrating Gromacs and OpenMM Gromacs & OpenMM in practice Gromacs & OpenMM in practice • GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” Gromacs & OpenMM in practice • GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda” • Same input files, same output files Gromacs & OpenMM in practice • GPUs supported in Gromacs 4.5 mdrun ... -device “OpenMM:Cuda”