Weronika Filinger (W.Filinger@Epcc.Ed.Ac.Uk) EPCC, the University of Edinburgh

Home , Tinker (software)

AMOEBA and HPC Parallelising Polarisable Force Fields Weronika Filinger (w.ﬁ[email protected]) EPCC, The University of Edinburgh

Why we want to use polarisable force fields OpenMP Performance The accuracy of bio-molecular simulaons can be seriously limited by the use of The performance of dynamic on the Joint AMBER-CHARMM (JAC) benchmark [5] run tradional treatment of electrostacs from empirically-derived force fields. Non- on the UK ARCHER supercompung service is shown in Figure 1. Each ARCHER node polarisable models typically approximate electrostac interacons by only using fixed, has 24 cores so the program can be run on up to 24 OpenMP threads. The JAC system usually atom-centred, charges, so that the charges in atoms are limited in their ability consists of 2489 atoms of dihydrofolate reductase (dhfr) in a bath of 7023 water to adapt to changes in their local environment. Real physical systems, however, do molecules, giving a total of 23,558 atoms. The simulaon run 100 MD steps in a adapt and undergo polarisaon, e.g. the orientaon of atom bonds will distort when canonical ensemble (NVT - constant number of atoms, volume and temperature) at a placed in poorly conducng mediums such as water, changing the geometry, energy temperature of 298 Kelvins. The length of a me step was set to 1.0 femtoseconds. and the way the system interacts. Polarisaon is thus a necessary and important From the performance and scalability intermolecular interacon required to accurately model chemical reacons in analysis we can note that: biological molecules and water-based systems, which are of great interest in fields such • The applicaon scales up for to 20 as chemistry, biochemistry and material science. OpenMP threads but a parallel efficiency above 50% is only observed for up to 12 Much effort has been put into developing mutually polarisable models, which enhance threads; fixed charge models by incorporang the polarisaon of atoms and molecules in the • Slightly beer performance and scalability dynamics of such systems. These polarisable models are more accurate and more are observed for larger systems, but complex hence they tend to be much more computaonally demanding. Therefore memory usage restricts the maximum size there is a strong requirement for the soware packages that implement polarisable Figure 1. Speedups for different number of of the system that can be run; OpenMP threads for JAC benchmark on ARCHER. force fields to be capable of exploing current and emerging HPC architectures to • An aempt to improve the performance make modelling ever more complex systems viable. through OpenMP opmisaons were not The ideal speedup is the speedup that very successful; small performance would be obtained if the whole code improvements were compiler and was parallelised with maximal AMOEBA architecture dependant; efficiency, and the theorecal speedup • The difference between the observed and is the speedup calculated using AMOEBA [1] ( Atomic Mulpole Opmised Energies theorecal speedups suggests that Amdahl’s law [6] with the serial for Biomolecular Applicaons) is a polarisable force without increased parallel coverage the component (i.e. the fracon of the field that has been implemented and is used in OpenMP opmisaon will not improve code that is executed serially) of 5%. codes such as TINKER [2] and AMBER [3]. AMOEBA the performance significantly; replaces fixed paral charges with atomic mulpoles, including an explicit dipole polarisaon The next step is to create a hybrid MPI and OpenMP implementaon to make use of the shared memory within a node via OpenMP and to use MPI for the internode that allows atoms to respond to the polarisaon Amoeba proteus communicaons. changes in their molecular environment. Image taken from hp://www.microscopy-uk.org.uk/mag/ imgsep01/amoebaproteus450.jpg AMOEBA has been shown to give more accurate structural and thermodynamic MPI Parallelisaon properes of various simulated systems [4]. However, the me required to calculate the effect of the induced dipoles alone is 15-20 mes larger than the cost of one me An MPI parallelisaon effort is under way. Due MPI 1 2 4 8 16 step for a fixed charge model. If this computaonal me could be reduced then the to the me constraints of the APES project processes Speedup 1 1.48 1.85 1.87 1.36 AMOEBA force field could, for example, model a much broader range of protein-ligand and the complexity of the Tinker code EPCC’s efforts have been focused on a replicated data Parallel 100% 74% 46% 23% 9% systems than is currently possible or provide a means of refining the analysis of X-ray Efficiency crystallography data to cope with large and challenging data sets, such as for ribosome strategy. This should enable performance improvements within the me frame of the Table 1. Speedup and parallel efficiency for crystals. The larger computaonal costs associated with the polarisaon calculaons different number of MPI processes for JAC needs to be migated by improving the parallelisaon of codes such as TINKER. project. benchmark. Micro-benchmarks of the most me consuming subrounes parallelised with MPI together with scaling profiles indicate that using both MPI and OpenMP over the TINKER same regions will not give sasfactory scaling (table 1). There is not enough computaon to compensate for the MPI communicaon costs in most subrounes.

TINKER is a soware package capable of Currently, work is being done to increase parallel coverage as much as possible while molecular mechanics and dynamics, with keeping communicaon and synchronisaon between MPI processes to a minimum. some special features for biopolymers, wrien by Jay Ponder (Washington University in St. Louis). TINKER consists of a collecon of Difficules programs offering different funconalies and During both stages of the parallelisaon of TINKER, a number of issues that affect the supports customised force fields parameter performance and the effecveness of the parallelisaon have been observed: sets for proteins, organic molecules, inorganic • The fine-grained structure of the code, i.e. many small subrounes that are called systems, etc., including the AMOEBA force many mes, make it hard to parallelise; field. TINKER’s logo taken from TINKER’s website[2] • Memory issues: large memory usage and indirect memory addressing, produce bad memory locality and lead to memory trashing with a degradaon in the overall performance; TINKER is wrien in Fortran and has been • Large numbers of global variables, makes it hard to idenfy dependencies between parallelised using OpenMP direcves. Nearly different variables and subrounes. 95% of the dynamic program included in the Tinker distribuon, which can be used to generate a stochasc dynamic trajectory or References compute a molecular dynamics trajectory in a chosen ensemble, can be executed in parallel. [1] J. W. Ponder et al. , J Phys .Chem. B, 2010, 114(8), 2549-2564. However, the scalability of dynamic has been [2] J. W. Ponder and F M Richards J. Comput. Chem. 1987, 8, 1016-1024. website : hp://dasher.wustl.edu/nker/ found to be limited. Hence further [3] D.A. Case, et al. (2014), AMBER 14, University of California, San Francisco. parallelisaon efforts have been undertaken website : hp://ambermd.org by EPCC as part of the Advanced Potenal DHFR in water used in JAC benchmark [4] O. Demerdash, E-H Yap and T. Head-Gordon, Annu.Rev. Phys. Chem. 2014(65): 149-74. Image taken from hp://www.gromacs.org/@api/deki/files/ Energy Surfaces (APES) project, a NSF-EPSRC th 130/=dhfr_water.png [5] hp://ambermd.org/amber10.bench1.html#jac, last visited on 9 April 2015. funded US-UK collaboraon. [6] G.M. Amdahl, AFIPS Conf. Proc., 1967 (30): 483-485.