Identifying Conformational Isomers of Organic Molecules in Solution Via Unsupervised Clustering
Total Page:16
File Type:pdf, Size:1020Kb
Identifying conformational isomers of organic molecules in solution via unsupervised clustering Veselina Marinova,y,z Laurence Dodd,y Song-Jun Lee,y Geoffrey P. F. Wood,{ Ivan Marziano,x and Matteo Salvalaglio∗,y yThomas Young Centre and Department of Chemical Engineering, University College London, London WC1E 7JE, UK. zDepartment of Materials Science and Engineering, The University of Sheffield, Sheffield S1 3JD, UK {Pfizer Worldwide Research and Development, Groton Laboratories, Groton, Connecticut 06340, USA xPfizer Worldwide Research and Development, Sandwich, Kent CT13 9NJ, UK E-mail: [email protected] Abstract We present a systematic approach for the identification of statistically relevant conforma- tional macrostates of organic molecules from molecular dynamics trajectories. The approach applies to molecules characterised by an arbitrary number of torsional degrees of freedom and enables the transferability of the macrostates definition across different environments. We formulate a dissimilarity measure between molecular configurations that incorporates informa- tion on the characteristic energetic cost associated with transitions along all relevant torsional degrees of freedom. Such metric is employed to perform unsupervised clustering of molecu- lar configurations based on the fast search and find of density peaks algorithm. We apply this method to investigate the equilibrium conformational ensemble of Sildenafil, a conformationally complex pharmaceutical compound, in different environments including the crystal bulk, the gas phase and three different solvents (acetonitrile, 1-butanol, and toluene). We demonstrate that, while Sildenafil can adopt more than one hundred metastable conformational configura- tions, only 12 are significantly populated across all the environments investigated. Despite the complexity of the conformational space, we find that the most abundant conformers in solution are the closest to the conformers found in the most common Sildenafil crystal phase. 1 Introduction high-dimensional and impractical to read and interpret. A critical drawback of this approach Conformational isomerism in organic molecules is that, by reducing the dimensionality of the is an important characteristic which bears sig- descriptors’ space used to represent configu- nificance in a variety of problems. For exam- rations, degeneracy is introduced, and conse- ple, binding properties of proteins in protein- quently information lost. More generally, re- ligand complexes are controlled by their con- liable conformational descriptors are particu- formational configuration by affecting associa- larly important when implementing enhanced tion/dissociation rates, and by entropic contri- sampling techniques. Enhanced sampling tech- butions to the process.1,2 Understanding the de- niques are heavily reliant on the use of ap- tails and mechanisms of conformational changes propriate system descriptors. Particularly for that proteins undergo is an important part studying self-assembly processes, conformation- of modern drug discovery methodologies.3 For ally flexible systems currently present a major small organic molecules the ability to adopt dif- challenge,11 thereby driving the search for a sys- ferent conformational configurations can open tematic approach to their classification. the possibility for the formation of multiple Dividing the conformational space of large or- crystal forms known as conformational poly- ganic molecules is often done via partitional morphs4,5 - crystal structures of components clustering methods, meaning that the data is with the same chemical formula but different assigned into groups without any hierarchical molecular shape. This phenomenon is particu- structure, based on a chosen criterion.12 One larly important in the pharmaceutical industry of the best known partitioning algorithms is k- where the uncontrolled occurrence of an unde- means.13 The idea behind this algorithm is to sired polymorphic form can affect the stability, define a k-centroid for each cluster and mea- shelf-life or efficacy of the drug. In the field of sure the distance between a data point and each crystallisation, conformational rearrangements of the cluster centres. Many computational are not only relevant to polymorphism. In our works have achieved partitioning of molecular previous work6 on the study of ibuprofen con- configurational space through k-means cluster- formational isomerism at the crystal/solution ing based methodologies.14–18 Despite its ease of interface, we demonstrate how, even for rela- use, it has a few important limitations. Cluster tively small systems, conformational rearrange- centres can be difficult to define arpiori, as well ments, crystal growth and dissolution are inher- as, k-means can be very sensitive to outliers and ently coupled. Additionally state-to-state tran- noise.19,20 A partition-based algorithm which sitions of a molecule along its path of incor- has tackled the drawbacks of k-means clustering poration into the crystal from solution may be is Affinity Propagation (AP),21 where all data limited by conformational rearrangements. points are regarded as potential cluster centres. Computational studies of conformational re- The negative distance between data points is arrangement in small organic molecules often their affinity, so the bigger the sum of the affin- use internal torsional angles to describe the ity of one data point to other data points, the adopted molecular configuration.6–9 Torsional higher the probability of this data point being angles are a convenient way of describing re- a cluster centre is. AP has been implemented arrangements as they provide a fine-grained in the study of protein conformations,22 how- comprehensive picture of the internal molecu- ever it has also been regarded and complex and lar configuration space. To describe the con- costly approach.19,23 formation of larger molecules such as peptides As an alternative to distance-based algo- or aliphatic chains, however, resorting to de- rithms, density-based algorithms have been de- scriptors such as end-to-end distance or Root veloped.20,24–28 They work on the principle of Mean Square Deviation (RMSD)1,10 is a com- assigning densities to local points and are able mon choice, made necessary by the fact that to separate clusters based on high- and low- the torsional angle space for these systems is density regions. Such algorithms not only do 2 not require defining the number of cluster cen- extract quantitative information on conforma- tres apiori, but also allow the identification of tional states from enhanced sampling molecu- non-spherical clusters. Particularly suited to lar dynamics simulations performed in solution the analysis of molecular dynamics (MD) sim- under experimentally relevant conditions. Our ulations is the Fast Search and Find of Den- methodology also enables the breakdown of the sity Peaks (FSFDP) clustering algorithm, de- free energy of a conformational state into en- veloped by Rodriguez and Laio.26 By comput- thalpic and entropic contributions, providing a ing the distance between all pairs of data points, valuable insight into the effect of the solvent the algorithm identifies the points with highest on conformational isomerism. With this work density in their neighbourhood as the cluster we aim to propose a method for conformational centroids. analysis which provides a route towards achiev- Such tools, which allow the systematic clas- ing rapid and automated conformational clas- sification of molecular configurations regardless sification, enabling the comprehensive study of of the dimensionality of the space of descriptors conformational isomerism in solution for sys- necessary to completely capture every confor- tems for which it is currently impractical. mational change, have the potential to improve existing methodologies for studying the effects of conformational rearrangements during crys- Methods tal nucleation and growth.6 In this work, molecular dynamics (MD) sim- Here, we propose a methodology which en- ulations are used to study the conformational ables the study of conformational isomerism in isomerism of sildenafil in the crystal bulk, in a general way for systems with a large number the gas phase and in three solvents - acetoni- of torsional degrees of freedom. Our approach, trile, 1-butanol and toluene. MD is combined based on the application of the Fast Search and with well-tempered metadynamics (WTmetaD) Find of Density Peaks clustering algorithm,26 to enable the study of the conformational rear- allows to define a set of conformational states rangements of sildenafil in solution. The Fast that is common to multiple environments (i.e. Search and Find of Density Peaks (FSFDP) solvents), and enables a systematic assessment clustering method, developed by Rodriguez and of their impact on the conformational land- Laio,26 is used to identify sildenafil conformers scape. We demonstrate this approach by study- in the gas phase and generate a characteristic ing the conformational rearrangement of silde- fingerprint in torsional angle space for each of nafil, a commercially available active pharma- them. The metric used to define the similarity ceutical ingredient (API).7 between configurations includes information on Sildenafil is the main component of Via- the free energy cost associated to transitions in gra,7 which is known to have two polymor- every degree of freedom explicitly considered. phic forms.29,30 Sildenafil is a relatively large The fingerprints are used to post-process the molecule consisting of 63 atoms and