<<

Identifying conformational of organic in solution via unsupervised clustering

Veselina Marinova,†,‡ Laurence Dodd,† Song-Jun Lee,† Geoffrey P. F. Wood,¶ Ivan Marziano,§ and Matteo Salvalaglio∗,†

†Thomas Young Centre and Department of Chemical Engineering, University College London, London WC1E 7JE, UK. ‡Department of Materials Science and Engineering, The University of Sheffield, Sheffield S1 3JD, UK ¶Pfizer Worldwide Research and Development, Groton Laboratories, Groton, Connecticut 06340, USA §Pfizer Worldwide Research and Development, Sandwich, Kent CT13 9NJ, UK

E-mail: [email protected]

Abstract We present a systematic approach for the identification of statistically relevant conforma- tional macrostates of organic molecules from molecular dynamics trajectories. The approach applies to molecules characterised by an arbitrary number of torsional degrees of freedom and enables the transferability of the macrostates definition across different environments. We formulate a dissimilarity measure between molecular configurations that incorporates informa- tion on the characteristic energetic cost associated with transitions along all relevant torsional degrees of freedom. Such metric is employed to perform unsupervised clustering of molecu- lar configurations based on the fast search and find of density peaks algorithm. We apply this method to investigate the equilibrium conformational ensemble of Sildenafil, a conformationally complex pharmaceutical compound, in different environments including the crystal bulk, the gas phase and three different solvents (acetonitrile, 1-butanol, and toluene). We demonstrate that, while Sildenafil can adopt more than one hundred metastable conformational configura- tions, only 12 are significantly populated across all the environments investigated. Despite the complexity of the conformational space, we find that the most abundant conformers in solution are the closest to the conformers found in the most common Sildenafil crystal phase.

1 Introduction high-dimensional and impractical to read and interpret. A critical drawback of this approach Conformational isomerism in organic molecules is that, by reducing the dimensionality of the is an important characteristic which bears sig- descriptors’ space used to represent configu- nificance in a variety of problems. For exam- rations, degeneracy is introduced, and conse- ple, binding properties of in - quently information lost. More generally, re- ligand complexes are controlled by their con- liable conformational descriptors are particu- formational configuration by affecting associa- larly important when implementing enhanced tion/dissociation rates, and by entropic contri- sampling techniques. Enhanced sampling tech- butions to the process.1,2 Understanding the de- niques are heavily reliant on the use of ap- tails and mechanisms of conformational changes propriate system descriptors. Particularly for that proteins undergo is an important part studying self-assembly processes, conformation- of modern drug discovery methodologies.3 For ally flexible systems currently present a major small organic molecules the ability to adopt dif- challenge,11 thereby driving the search for a sys- ferent conformational configurations can open tematic approach to their classification. the possibility for the formation of multiple Dividing the conformational space of large or- crystal forms known as conformational poly- ganic molecules is often done via partitional morphs4,5 - crystal structures of components clustering methods, meaning that the data is with the same chemical formula but different assigned into groups without any hierarchical molecular shape. This phenomenon is particu- structure, based on a chosen criterion.12 One larly important in the pharmaceutical industry of the best known partitioning algorithms is k- where the uncontrolled occurrence of an unde- means.13 The idea behind this algorithm is to sired polymorphic form can affect the stability, define a k-centroid for each cluster and mea- shelf-life or efficacy of the drug. In the field of sure the distance between a data point and each crystallisation, conformational rearrangements of the cluster centres. Many computational are not only relevant to polymorphism. In our works have achieved partitioning of molecular previous work6 on the study of ibuprofen con- configurational space through k-means cluster- formational isomerism at the crystal/solution ing based methodologies.14–18 Despite its ease of interface, we demonstrate how, even for rela- use, it has a few important limitations. Cluster tively small systems, conformational rearrange- centres can be difficult to define arpiori, as well ments, crystal growth and dissolution are inher- as, k-means can be very sensitive to outliers and ently coupled. Additionally state-to-state tran- noise.19,20 A partition-based algorithm which sitions of a along its path of incor- has tackled the drawbacks of k-means clustering poration into the crystal from solution may be is Affinity Propagation (AP),21 where all data limited by conformational rearrangements. points are regarded as potential cluster centres. Computational studies of conformational re- The negative distance between data points is arrangement in small organic molecules often their affinity, so the bigger the sum of the affin- use internal torsional angles to describe the ity of one data point to other data points, the adopted molecular configuration.6–9 Torsional higher the probability of this data point being angles are a convenient way of describing re- a cluster centre is. AP has been implemented arrangements as they provide a fine-grained in the study of protein conformations,22 how- comprehensive picture of the internal molecu- ever it has also been regarded and complex and lar configuration space. To describe the con- costly approach.19,23 formation of larger molecules such as peptides As an alternative to distance-based algo- or aliphatic chains, however, resorting to de- rithms, density-based algorithms have been de- scriptors such as end-to-end distance or Root veloped.20,24–28 They work on the principle of Mean Square Deviation (RMSD)1,10 is a com- assigning densities to local points and are able mon choice, made necessary by the fact that to separate clusters based on high- and low- the torsional angle space for these systems is density regions. Such algorithms not only do

2 not require defining the number of cluster cen- extract quantitative information on conforma- tres apiori, but also allow the identification of tional states from enhanced sampling molecu- non-spherical clusters. Particularly suited to lar dynamics simulations performed in solution the analysis of molecular dynamics (MD) sim- under experimentally relevant conditions. Our ulations is the Fast Search and Find of Den- methodology also enables the breakdown of the sity Peaks (FSFDP) clustering algorithm, de- free energy of a conformational state into en- veloped by Rodriguez and Laio.26 By comput- thalpic and entropic contributions, providing a ing the distance between all pairs of data points, valuable insight into the effect of the solvent the algorithm identifies the points with highest on conformational isomerism. With this work density in their neighbourhood as the cluster we aim to propose a method for conformational centroids. analysis which provides a route towards achiev- Such tools, which allow the systematic clas- ing rapid and automated conformational clas- sification of molecular configurations regardless sification, enabling the comprehensive study of of the dimensionality of the space of descriptors conformational isomerism in solution for sys- necessary to completely capture every confor- tems for which it is currently impractical. mational change, have the potential to improve existing methodologies for studying the effects of conformational rearrangements during crys- Methods tal nucleation and growth.6 In this work, molecular dynamics (MD) sim- Here, we propose a methodology which en- ulations are used to study the conformational ables the study of conformational isomerism in isomerism of sildenafil in the crystal bulk, in a general way for systems with a large number the gas phase and in three solvents - acetoni- of torsional degrees of freedom. Our approach, trile, 1-butanol and toluene. MD is combined based on the application of the Fast Search and with well-tempered metadynamics (WTmetaD) Find of Density Peaks clustering algorithm,26 to enable the study of the conformational rear- allows to define a set of conformational states rangements of sildenafil in solution. The Fast that is common to multiple environments (i.e. Search and Find of Density Peaks (FSFDP) solvents), and enables a systematic assessment clustering method, developed by Rodriguez and of their impact on the conformational land- Laio,26 is used to identify sildenafil conformers scape. We demonstrate this approach by study- in the gas phase and generate a characteristic ing the conformational rearrangement of silde- fingerprint in torsional angle space for each of nafil, a commercially available active pharma- them. The metric used to define the similarity ceutical ingredient (API).7 between configurations includes information on Sildenafil is the main component of Via- the free energy cost associated to transitions in gra,7 which is known to have two polymor- every degree of freedom explicitly considered. phic forms.29,30 Sildenafil is a relatively large The fingerprints are used to post-process the molecule consisting of 63 and a number biased trajectories in solution and assign a con- of ring structures. The two forms of sildenafil formational macrostate for each trajectory step. (denoted form 1 and form 2) are morphotrop- Through the implementation of a reweighting ically related to one another as a noncrystal- procedure, the equilibrium probability of con- lographic rearrangement can transform one to formers, as well as enthalpic and entropic con- the other,30 with form I being the thermody- tributions to the free energy of each conformer namically stable form. Both forms have two for each of the solvents considered is obtained. molecules in the asymmetric unit adopting dif- In this study, the molecular rearrangement of ferent conformations.29,30 the drug is described in torsional angle space With the use of a data clustering approach by considering all internal dihedral angles as we demonstrate how characteristic conforma- shown in Figure 1. By including all torsional tional configurations can be identified a priori degrees of freedom of the molecule, this study for a molecule in the gas phase in order to then

3 adopts a systematic and transferable strategy barostat.39 More detail on the applied pres- for tackling conformational rearrangement. sure control and the recovered system density in each case can be found in Section A of the Supplementary Material.

Simulations of the Crystal Bulk Super- cells of size 3.5 × 3.5 × 5.0 nm and 7.0 × 3.5 × 2.3 nm, representing crystal forms I and II re- spectively, were set up containing 96 molecules each. Crystal structure .cif files29 at ambi- ent temperature and pressure were obtained from the CSD under deposition codes QEG- TUT and QEGTUT02. In both cases unbiased MD simulations were performed for 10 ns in the isothermal-isobaric ensemble, implemented by applying an anisotropic pressure control.

Figure 1: Sildenafil structure, where hydrogen Simulation of Sildenafil in the Gas Phase atoms have been excluded for simplicity. The A 450 ns unbiased simulation of a sildenafil six internal torsional angles, labelled τA, τB, τC , molecule in a box of 2.0 × 2.0 × 2.0 nm in τD, τE and τF , are marked on the structure. All vacuum was carried out in the canonical ensem- images of molecular structures shown have been 31 ble. A free energy profile of each torsional angle generated with VMD. as denoted in Figure 1 was obtained. All one- dimensional free energy profiles are reported in Section B of the Supporting Information. Molecular Dynamics Setup Simulations of Sildenafil in Solution Sim- Molecular dynamics simulations of form I and ulations in solution were set up by solvating form II crystal polymorphs of sildenafil, a silde- a single sildenafil molecule with each of the nafil molecule in the gas phase and a silde- three solvents used in this study - acetoni- nafil molecule in three different solvents were trile, 1-butanol and toluene - by using the performed using the Generalised Amber Force insert-molecules utility in Gromacs in a box Field (GAFF).32 For all systems considered in of approximate size of 4 × 4 × 4 nm. MD sim- this work GAFF is able to reproduce proper- ulations were performed in combination with ties consistent with experimental data (Section well-tempered metadynamics. All simulations A of the Supplementary Material). MD sim- were performed in the isothermal-isobaric en- ulations were performed with Gromacs 5.1.433 semble (NPT) at pressure of 1 bar and temper- with an explicit representation of the solvent. ature of 300 K. Force field parameters for solvent molecules were obtained from the Virtual sol- vent database.34,35 A standard cut-off distance Well-Tempered Metadynamics of 1.0 nm for the non-bonded interactions was Setup chosen, along with including long-range inter- molecular interactions using the particle-mesh Metadynamics was implemented in order to en- Ewald (PME) approach.36 A time step of 2 hance fluctuations in the internal rearrange- fs was used. Temperature and pressure con- ment of sildenafil in solution for computational trol have been implemented through the use efficiency. The bias was applied as a function of of the Bussi-Donadio-Parrinello thermostat,37 the τA torsional angle as shown in Figure 1 in Berendsen barostat38 and Parrinello-Rahman blue. The choice of the CV was made based on

4 10 ns exploratory MD simulations, which re- sity of each point i, evaluated by considering the vealed that overcoming the barrier associated number of neighbours within a distance cut-off with the rotation of τA in an efficient way re- dc according to: quires enhanced sampling in all three solvents. X The biasing protocol was applied in the form of ρi = χ(dij − dc) (2) Gaussian functions with a width of 0.3 rad and j height of 2.5 kBT at a rate of every 500 simula- where χ(x) = 1 if x < 0 and χ(x) = 0 other- tion steps with a bias factor of 15 K. The use wise. of WTmetaD was implemented through plumed 2.4.40 All the data and input files, required δi = min dij (3) to reproduce the results reported in this pa- j:pj >pi per, are available on plumed-NEST,41 the pub- Cluster centres are characterised by having lic repository of the PLUMED consortium as the highest density within a cluster of points plumID:20.032. and a large distance from points with a higher density as shown in Eq. 3. Each cluster has an Clustering Procedure associated core set of data points and a halo, evaluated based on the given cut-off. The core Studying the conformational isomerism of sys- set are the points which belong exclusively to tems with several internal degrees of freedom a cluster and can be found within the selected in a systematic and transferable way requires distance cut-off from the cluster centre, while the need of grouping molecular configurations the halo is considered as the noise around the and identifying relevant conformational states. cluster core, while still affiliated with the given Achieving this provides the opportunity of re- cluster. This grouping algorithm is particularly ducing the dimensionality of the problem and powerful as the clustering procedure is such enables the analysis of biased molecular dynam- that the number of clusters arises intuitively, ics trajectories in order to obtain useful kinetic which makes it particularly suited to identi- and thermodynamic information. fying characteristic conformational configura- To achieve this partitioning of configurational tions, described by a highly multidimensional space in an unsupervised data-driven manner, set of CVs. a recently developed data clustering method is Here, we consider each frame of an MD tra- applied. In this work, the Fast Search and jectory of a sildenafil molecule in the gas phase Find of Density Peaks (FSFDP) algorithm, as a single point in the 6-dimensional torsional 26 developed by Rodriguez and Laio, is imple- angle space and so the distance dij between ev- mented. The algorithm can be used to group ery two frames i and j, is calculated according molecular configurations into clusters of struc- to Eq.4: tures based on their similarity by calculating v a distance matrix between configurations. The uNCV uX 2 distance matrix refers to a matrix containing dij = t (dnwn) (4) the distance between any two molecular config- n=1 urations i and j as a function of the chosen sys- where NCV refers to the total number of sys- tem descriptors e.g. internal torsional angles, tem descriptors, which in this case is 6 torsional with no limit on the dimensionality, as shown angles, while w stands for a weight applied to in Eq. 1. each dimension as described below. v In the above equation dn is refers to the abso- uNCV uX lute difference in the values of any of the given d = t d2 (1) ij n torsional angles for points i and j. For example, n=1 d1 is calculated as: Once the distance matrix has been calculated, the algorithm operates by calculating the den- d1 = |τAi − τAj| (5)

5 Should the value of dn be greater than π, peri- ing the lowest rotational barrier for each tor- odicity is accounted for by subtracting the value sional angle according to the calculated one- from 2pi as follows: dimensional free energy profiles (see Section B of the SI) along each dimension of the CV space. dn > π : dn = 2π − dn (6) The barriers are then normalised with respect to the lowest one as shown in Table 1. An ex- Clustering structures in a meaningful way ample for the case of τ is shown in Figure 2. requires a distance matrix calculation which A is able to distinguish between conformational transitions and conformational adjustments. These terms refer to the nature of the confor- mational rearrangement within the molecule. A conformational transition describes the conver- sion of one stable conformational state into an- other, usually associated with overcoming a free energy barrier higher than kBT. An adjustment is, on the other hand, a term used to describe a minor rearrangement which is not associated with a new conformer, but rather a relaxation of the structure from one configuration to an- other, both of which occupy the same free en- ergy minimum in collective variable space.5 In Figure 2: Free energy profile of torsional angle order to resolve these two cases when calcu- τA obtained from a vacuum simulation, along lating the distance matrix, a slight modifica- with Newman projections of the angle in the tion to the algorithm is applied by including a free energy minima. The lowest free energy bar- weight associated with the free energy barrier rier of rotation was extracted for the clustering of rotation, w, of each torsional angle in order weights. to scale the distance d. In such a way, rota- tions which lead to new conformational config- We note that in the case of larger, more com- urations through rare, activated transition, will plex solutes with a higher number of confor- have a higher impact on the distance compared mational degrees of freedom one may need to to those associated with an adjustment or a fast analyse large datasets, and in such cases the conversion. calculation of the distance matrix may become an efficiency bottleneck for the process. Re- Table 1: Free energy barrier of rotation to each cent work27 addresses this problem pushing the torsional angle obtained from simulations in limit of applicability of FSFDP to the million- vacuum, along with the corresponding rescal- configurations range. ing used as a weight in calculating the distance matrix. Conformational Classification Angle Barrier Height Weight Clustering of the molecular configurations was [kBT ] performed using the FSFDP algorithm, with a τA 6 5 distance cut-off dc tuned to obtain an average τB 1.5 1.3 number of neighbours for every datapoint equal τC 2 1.7 to 2% of the entire dataset. Based on the dis- τD 1.2 1 tance matrix calculation and the chosen cut-off, τE 2.5 2.1 the sildenafil configurations sampled in the gas τF 4 3.3 phase are grouped into clusters, each consisting of a cluster centre structure, a core set of config- The weight factor, w, is obtained by record- urations and a halo. All configurations assigned

6 to a cluster were used to generate a structural to be accounted for in a procedure referred to fingerprint for the identified conformer. The as reweighting. term fingerprint refers to the probability distri- In this work, the total metadynamics bias po- bution of each torsional angle as illustrated in tential applied as a function of τA and recov- Figure 1. All fingerprints are provided in Sec- ered at the end of the simulation of a silde- total tion D of the Supporting Information. nafil molecule in solution, V (τA), is used Conformational classification of each frame of in the reweighing scheme.42 The trajectory is the trajectories of sildenafil in solution was car- post-processed so that each time frame, with a ried out in order to represent conformational corresponding value for each torsional angle, as change of the molecule in a one-dimensional well a conformational cluster number, will also space and hence enable further analysis of the have associated with the total bias de- enthalpic and entropic contributions to confor- posited in that particular point in the CV space mational isomerism in solution. To achieve of τA. For simplicity, let us refer to this value total that, an algorithm which compares the instan- as Vi (τA) where i stands for the trajectory taneous value of the torsional angles, defined to frame number. The weight Wi applied to each describe the molecular configuration, to each of frame when reconstructing the unbiased proba- the given fingerprints for every trajectory frame bility distribution of conformational isomers in was set up. For every instantaneous torsional solution is a Boltzmann weight associated with total angle value where the corresponding probabil- a rescaled value of Vi (τA) according to: ity density in the fingerprint is nonzero a value β(V total−max(V total)) of 1 is assigned. The total number of variables Wi = e i i (7) used for the classification is 6 and therefore a In such a way, the weight associated with score of 6 means that the molecular configu- points in CV space where the maximum total ration in the given frame matches a fingerprint external bias was deposited will have a value of and therefore is assigned the corresponding con- 1, as it corresponds to the lowest point in the former number. A score lower than 6 indicates free energy profile in τ and all other frames at least one mismatch between the given con- A will have a correspondingly lower weight. This figuration and the fingerprint and is therefore reweighting scheme was implemented to recon- assigned a value of 0 signifying that it remains struct the population of sildenafil conformers unclassified. in different solvents, as well as obtaining a two- dimensional probability density function of the Conformational Equilibrium Prob- conformer number and its associated potential ability Distribution energy, used in the free energy decomposition discussed in the next chapter. A characteristic fingerprint in torsional angle space for each dominating sildenafil conformer in the gas phase was generated. Choosing the Enthalpy and Entropy Contribu- gas phase as a reference is inspired by the work tions to Free Energy Differences of Cruz-Cabeza and Bernstein,5 who use the between clusters same conditions to define reference conforma- tional states. To calculate the conformational Free energy differences between conformational population of sildenafil in solution, each frame macrostates i and j of sildenafil in solution of the biased trajectory was assigned a charac- were computed from their equilibrium proba- Pj bility Pi and Pj as ∆Gi,j = −kBT ln and teristic conformer following the procedure dis- Pi cussed in the previous section. A discrete prob- decomposed into their enthalpic and entropic ability distribution in one-dimensional space contributions following the procedure outlined 43,44 can then be straightforwardly calculated, with by Gimondi et. al. For instance, the dif- the caveat that the bias potential deposited ference in free energy between clusters i and j, throughout the duration of the simulation needs ∆Gi,j can be expressed as ∆Hi,j−T ∆Si,j, where

7 ∆Hi,j and ∆Si,j are the enthalpy and entropy solvent contribution to the potential energy the solvent solvent differences between states i and j. Hence the term hEP ii = hEP ij = const, and entropic contribution can obtained by differ- thus ∆Ui,j reduces to: ence as T ∆Si,j = ∆Gi,j − ∆Hi,j, once the term solute solute  ∆Hi,j is known. Since conformational transi- ∆Ui,j = hEP ij − hEP ii + solute−solvent solute−solvent  tions of sildenafil are not associated to a change + hEP ij − hEP ii (10) in the ensemble average of the system’s volume, ∆Hi,j = ∆Ui,j +P ∆Vi,j ' ∆Ui,j, where ∆Ui,j is which is not affected by the large fluctuations solvent the difference in internal energy between confor- of hEP ii that would mask the contribu- mational macrostates i and j. Moreover, since tion of conformational transitions to ∆Ui,j, and states i and j are conformational isomers sam- hamper the convergence of the enthalpy and pled at the same temperature, the internal en- entropy contribution to free energy differences. ergy difference reduces to the difference in the Despite implementing this strategy the conver- ensemble average of the potential energy be- gence of the enthalpic and entropic contribu- tween states i and j as: tions require a substantial sampling of the con- figuration space of the explicitly solvated sys- ∆Ui,j = hEP ij − hEP ii (8) tem. Here we achieve sufficient sampling by running WTmetaD simulations for 0.5 to 0.6 where hEP ii, the ensemble average of the poten- µs. In the case of larger, more complex solutes tial energy of configurations classified in cluster with a higher number of conformational degrees i, is computed as: of freedom we anticipate the need for replica Z exchange methods to exhaustively sample the 0 hEP ii = EP (r)p(EP (r)|r ∈ i)dEP (9) configuration space and successfully apply this decomposition approach. where p(EP (r)|r ∈ i) is the potential energy probability density, conditional to the system being classified in cluster i. The probability Results density p(E (r)|r ∈ i) is computed as discussed P The following sections summarise the analysis in Gimondi et al.,43 using the reweighting strat- carried out on sildenafil conformers in the gas egy described in the previous paragraph. In phase, as well as on the conformational rear- models in which explicit solvents are employed, rangements of sildenafil occurring in the crystal the hE i is typically dominated by the con- P i bulk and in solution. The results are organised tribution of the solvent molecules, and it is as- as follows. First, the conformational freedom sociated with large fluctuations that affect the of sildenafil in the crystal bulk is reviewed by convergence and accuracy of the ∆U estimate. i,j considering the torsional angle distribution ob- In order to improve the statistical accuracy in tained from MD simulations of forms I and II. the ∆U , we follow the procedure outlined i,j Next, the results obtained from the clustering by Kollias et al.44 and decompose the ensem- algorithm are reported, along with drawing a ble average of the potential energy in three comparison between structures in the gas phase components, namely hE isolute, hE isolvent P i P i and those in the solid. The last section reports and hE isolute−solvent. The hE isolute, and P i P i on the equilibrium distribution of sildenafil con- hE isolvent contributions account respectively P i formers in different solvents, obtained with the for potential energy terms associated with inter- aid of WTmetaD, along with a breakdown of actions between atoms that belong exclusively their corresponding free energy into enthalpy to the solute and to the solvent species. The solute−solvent and entropy contributions. solute-solvent term hEP ii accounts instead for non-bonded interactions between solute and solvent atoms. Since conformational changes in the solute do not affect the solvent-

8 Figure 3: a) Probability density plots of each torsional angle for crystal conformers 1 and 2 obtained from a 10 ns MD simulation of form I sildenafil. b) Image of crystal form I conformers 1 and 2, obtained from the .cif, generated with VMD, where hydrogen atoms are removed for simplicity.

Conformational Rearrangements amount of flexibility is accessible to the rest. in the Crystal Bulk A probability distribution of each torsional an- gle for crystal conformers 1 and 2 in form I is The conformational rearrangement of sildenafil shown in Figure 3a. In the figure, a narrow and was first investigated for the case of a molecule mono-modal distribution corresponds to each of in the crystal bulk in each of the two poly- the torsional angles τA, τC and τF , shown re- morphic forms. As mentioned, each poly- spectively in blue, green and brown. A mild morph contains two conformational isomers of degree of conformational adjustment is associ- the molecule. A probability distribution of each ated with torsional angles τB (in orange) and τD torsional angle representative of the conforma- (in red) which are both distributed around ± π. tional state adopted by sildenafil in the solid These torsional angles are associated with adja- was obtained, which enables to gain insight into cent on the phenyl ring, suggesting the degree of conformational freedom available that the rearrangement is possibly related to re- in each crystal form. leaving steric hindrance. The highest degree of The results reveal that while several of the rotational freedom is observed in the case of dihedral angles of sildenafil are completely re- torsional angle τE, shown in purple, represent- strained by the crystal packing, a surprising

9 ing the rotation of the propanyl of rangement is counterintuitively lower in form the pyrazole ring (see Figure 1), displaying a II compared to form I, despite the presence multi-modal distribution. The flexibility of τE of larger structural voids in the former. The reveals a moderate degree of conformational re- experimental publication does not report mea- arrangement available to the conformer, despite surements of the degree of disorder in form II, the restrictive environment traditionally associ- however the authors speculate that the confor- ated with a crystal. This observation has been mational rearrangement of the propanyl groups addressed by Barbas et al.29 who have recorded will be dominated by a drive to keep the cavi- the presence of a dynamic disorder in the propyl ties in the structure empty, which could explain groups at room temperature in form I, which the conformational behaviour observed in the disappears at temperatures lower than 100K. MD simulation of form II. The authors justify this observation with the fact that these functional groups do not estab- Structure Clustering in the Gas lish strong intermolecular interactions with the surrounding atoms within the crystal, resulting Phase in a moderate degree of flexibility in the chain. This section reports on the results obtained Analysing the torsional angle distributions of from implementing a data clustering algorithm each of the two conformers found within the in order to group structures from an MD trajec- crystal structure of form I reveals that they are tory of sildenafil in the gas phase and identify conformational isomers, where a rotation along characteristic conformational configurations. A τA and τE can convert conformer 1 into con- trajectory of a molecule in vacuum was chosen former 2, as shown in Figure 3b. According to for this purpose as in the absence of solvent ef- the results obtained from the MD simulation fects the internal rearrangement of the molecule of form I, the internal rearrangement of crystal is unhindered and thorough sampling of all pos- conformer 2 displays an identical behaviour as sible molecular configurations can be achieved to that of conformer 1. efficiently. This allows to consider the full con- Similarly, the conformational freedom of formational space of sildenafil in the clustering sildenafil in form II was analysed through an procedure. Such approach has the potential to unbiased MD simulation. A comparison of the provide a more meaningful and robust method probability distribution of each torsional an- of identifying representative conformational iso- gle of sildenafil between form I and form II mers over methods which rely on generating reveals a similar conformational configuration conformers thorough random search and local in the two structures, with the only difference minima strategies. By considering the free en- being a marginally lower flexibility of τE in ergy profile of each torsional angle in the gas form II compared to that observed in form I. phase, as discussed in Section B of the SI, all A detailed comparison between the conform- possible combinations of structural local min- ers in each form is provided in Section C of ima of the molecule in torsional angle space is the Supporting Information. An experimental estimated to be 144. However, in reality, each comparison between the two crystal forms is local minimum comprises of an ensemble of con- provided by Barbas et al.30 who make a sim- figurations, meaning that 144 structures is a ilar observation to the one reported here, and rather conservative estimate, and in practice, stress that any differences between conformers there is a swarm of possible stable molecular in the two crystals can be classified as confor- configurations. For that reason, failing to ex- mational readjustments of the same gas phase plore the collective variable space thoroughly conformer, validating the conclusions made on encounters the risk of missing out important the basis of our MD simulations. As mentioned, structures due to the sheer number of avail- the probability distribution of torsional angle able configurations, even for systems of mod- τE, associated with the rearrangement of the erate flexibility. pronanyl group, shows that the degree of rear- A typical output of the clustering algorithm

10 Figure 4: Decision graph of the structure clustering performed on the trajectory in the gas phase. The circled structures represent the cluster centres. The figure below represents the distance matrix between distributions of conformers 1 and 2 in the crystal and the identified from the algorithm cluster centres. is a decision graph, displaying all data points as bours) and distance from the nearest point of a function of their density ρ (number of neigh- higher density δ (See Eq. 2 and 3). When ap-

11 Figure 5: a) Probability density of each torsional angle for crystal conformer 1 (dashed lines) and cluster centre 8 (thick line). The numerical distance between distribution peaks is provided in the table below. b) Comparison of crystal conformer 1 structure and the representative structure for C8 cluster group. plied, the clustering algorithm determined 12 tained by taking the most probable value for cluster centres as shown in the decision graph each angle and calculating the distance between in Figure 4. Each cluster centre corresponds points in the 6D space of all dihedrals. to the most representative conformer structure The distance matrix reveals that conformer for those belonging to a cluster (core and halo). C8 is the most similar structure to crystal con- A visual representation of the molecular shape former 1, while C2 is the most similar to con- corresponding to each cluster centre is displayed former 2. The fingerprints corresponding to C8 around the decision graph. The conformers and conformer 1 structures in torsional angle identified at this stage are labelled C1 to C12, space are visually compared in Figure 5a. The each of which has an associated characteristic two plots show that the conformational rear- fingerprint in torsional angle space as discussed rangement of C8 into crystal conformer 1 would in Section D of the Methods. All fingerprints involve readjustment in torsional angles τA, τB can be found in Section D of the SI. and τD, which display a comparatively broader Before proceeding further into the analysis of distribution in vacuum. The overlap in the dis- the conformational population of sildenafil in tributions demonstrates that the exact crystal solution, it is useful to compare the structures conformer is found in the gas phase, and it ac- of the cluster centres identified in vacuum and counts for only 3% of all configurations. Fig- those of the crystal conformers. To this aim, a ure 5b shows a comparison between the cluster distance matrix comparing crystal conformer 1 centre structure of C8 and crystal conformer and 2 found in form I and each of the 12 cluster 1 as taken from the CCDC database, prior to centre structures is generated as shown in Fig- performing MD. The overlap demonstrates the ure 4. In the plot, the colour scale corresponds readjustment in τA and τD necessary to convert to the distance in torsional angle space, where C8 into conformer 1. Torsional angle τE differs white signifies the lowest distance, i.e the most significantly according to the figure, however, similar structures. The distance matrix was ob- as discussed, it has a moderate flexibility in the

12 solid and it relaxes to a configuration closer to to 70% of the structures found in solution re- that of C8 during MD as shown in the proba- semble configurations close to those observed bility distribution in Figure 5a in purple. in the crystal, inferring that at the crystal- The second most similar to the crystal group solution interface 30% of structures will have of structures are conformers C10 and C12 which to encounter a more significant conformational relate to respectively conformers C2 and C8 via rearrangement to adopt a crystal-like configura- a rotation of torsional angle τE. tion and promote crystal nucleation or growth.

Conformational Isomers of Silde- Enthalpy and Entropy Contributions to nafil in Solution Conformational Stability In order to gain further insight into the con- Equilibrium Probability formational isomerism of sildenafil in solution, MD simulations in combination with WT- the free energy ∆G of each state is calculated, metaD were used to investigate the confor- along with a breakdown into potential energy mational rearrangement of sildenafil in three and entropy. In Figure 7 the relative free en- different solvents - acetonitrile, 1-butanol and ergy of each state with respect to structure C2 toluene. The biased MD trajectory of each sol- for each solvent can be found in blue. As ex- vent case was analysed using the characteris- pected, the free energy difference between the tic fingerprints associated with structure clus- four structures dominating the probability den- ters C1 to C12, following the procedure out- sity plot - C2, C8, C10 and C12 is less than lined in the Methods. The probability of each 1 kJ/mol for all solvents. The difference in conformer cluster was generated by account- free energy between the latter group and the ing for the deposited bias potential through the rest of the structures varies between 2.5 and reweighting procedure discussed earlier. 5.5 kJ/mol, with the exception of C6 for the The obtained probability for all solvent cases case of acetonitrile. Despite the minor vari- is shown in Figure 6. The results show that, ations in ∆G between different solvent cases 95% of configurations in solution are accounted however, the potential energy and entropy re- for, indicating that the proposed procedure of veal more significant differences induced by the identifying conformational structures via unsu- solvent. The relative potential energy difference pervised clustering is a fast and reliable way (with respect to C2) rarely exceeds 5 kJ/mol of determining conformational configurations in in the case of acetonitrile, which also trans- solution for systems of moderate to high de- lates into minor entropic contributions in most gree of conformational complexity. Examining states. The exception is once again state C6 for the distribution, small but significant variations which the free energy is dominated by configu- in the probability in different solvents are ob- rational entropy. A significantly different obser- served. These findings correlate with what was vations can be made for the cases of 1-butanol observed in in our previous work for the case and toluene, where the relative potential energy of ibuprofen.6 Structure types C2 and C8 are of states is much larger, varying between 5 and found to account for 15 to 20% of the struc- 13 kJ/mol. Equally, entropy contributions in tures in solution each. As discussed, these two these two solvents are much more substantial groups represent the conformers with a struc- than for the case of acetonitrile, indicating that ture closest to the crystal-like configuration of solute - solvent interactions are much more dy- sildenafil. Furthermore, C10 and C12, which namical. This is particularly prominent for the relate to C2 and C8 via a rotation along the case of 1-butanol and it is possibly related to fast-converting torsional angle τE, each account the inherent flexibility of its aliphatic chain. for further a 15% of conformational isomers in solution. Therefore, given the high flexibility of τE even within the crystal structures, overall 60

13 Figure 6: Probability of each conformer configuration C1-C12 along with crystal conformer 1 and crystal conformer 2 for all three solvents.

Conclusions tropy contributions. This analysis leads to new insights into the role of solvent in the definition In this paper we develop a systematic approach of the conformational landscape of an organic to partition the configuration space of flexi- molecule. It is found that, while the relative ble molecules with an arbitrary number of ro- free energy variation between states in differ- tatable bonds into conformational macrostates. ent solvents is limited, solvents cases 1-butanol The approach is based on the development of and toluene cause an increase in the entropic a distance metric between configurations that contribution to the conformational free energy. incorporates qualitative information on the en- Combining this approach with existing strate- ergetic cost associated to transitions along each gies for studying effects of conformational rear- degree of freedom, and the subsequent appli- rangements on processes can prove invaluable cation of unsupervised clustering. We apply in understanding the effect of conformational this approach to investigate the conformational isomerism in the process of crystal nucleation, landscape of sildenafil in the crystal bulk, in growth and dissolution for systems of any size the gas phase and in solution. A key aspect of and level of conformational complexity. the methodology introduced in this work is that the cluster centres are identified only once, for a reference state in the gas phase. These cluster Conflict of interest centres configurations then used to classify con- There are no conflicts to declare. figurations sampled in solution. This approach provides a self-consistent identification scheme Acknowledgements for clusters in the condensed phase. Using this methodology, 95% of structures in three differ- The authors acknowledge financial support by ent solvents are unambiguously assigned to a Pfizer. We are grateful to the UK Materials cluster, demonstrating the effectiveness of the and Hub which is par- proposed classification procedure. We demon- tially funded by EPSRC (EP/P020194/1) and strate that this classification strategy can be the UCL High Performance Computing Facili- coupled with reweighting strategies to compute ties and associated support services for compu- the free energy of conformational states and to tational resources. further decompose it into its enthalpy and en-

14 Figure 7: Relative free energy, potential energy and entropy of each cluster configuration with respect to C2 for the case of a) acetonitrile, b) 1-butanol and c) toluene.

References Wade, R. C.; Frech, M. Protein conforma- tional flexibility modulates kinetics and ther- (1) Amaral, M.; Kokh, D. B.; Bomke, J.; modynamics of drug binding. Nat. Comm. Wegener, A.; Buchstaller, H. P.; Eggen- 2017, 8, 1–14. weiler, H. M.; Matias, P.; Sirrenberg, C.;

15 (2) Ertekin, A.; Massi, F. eMagRes; John Wiley (13) Hartigan, J. A.; Wong, M. A. Algorithm AS Sons, Ltd: Chichester, UK, 2014; Vol. 3; pp 136: A K-Means Clustering Algorithm. Appl. 255–266. Stat. 1979, 28, 100.

(3) Saladino, G.; Gervasio, F. L. New Insights (14) Malmstrom, R. D.; Lee, C. T.; Van in Protein Kinase Conformational Dynamics. Wart, A. T.; Amaro, R. E. Application of Curr. Top. Med. Chem. 2012, 12, 1889–1895. molecular-dynamics based markov state mod- els to functional proteins. J. Chem. Theory (4) Bernstein, J.; Hagler, A. T. Conforma- Comput 2014, 10, 2648–2657. tional Polymorphism. The Influence of Crys- tal Structure on Molecular Conformation. J (15) Thayer, K. M.; Lakhani, B.; Beveridge, D. L. Am Chem Soc 1978, 100, 673–681. Molecular Dynamics-Markov State Model of Protein Ligand Binding and Allostery in (5) Cruz-Cabeza, A. J.; Bernstein, J. Conforma- CRIB-PDZ: Conformational Selection and In- tional polymorphism. 2014. duced Fit. J. Phys. Chem. B 2017, 121, 5509– (6) Marinova, V.; Wood, G. P.; Marziano, I.; Sal- 5514. valaglio, M. Dynamics and Thermodynamics of Ibuprofen Conformational Isomerism at the (16) Voelz, V. A.; Bowman, G. R.; Beauchamp, K.; Crystal/Solution Interface. J. Chem. Theory Pande, V. S. Molecular simulation of ab ini- Comput 2018, 14, 6484–6494. tio protein folding for a millisecond folder NTL9(1-39). J. Am. Chem. Soc. 2010, 132, (7) Goldstein, I.; Burnett, A. L.; Rosen, R. C.; 1526–1528. Park, P. W.; Stecher, V. J. The Serendipi- tous Story of Sildenafil: An Unexpected Oral (17) Wang, Y.; Makowski, L. Fine structure of Therapy for Erectile Dysfunction. 2019. conformational ensembles in adenylate kinase. Proteins 2018, 86, 332–343. (8) Weise, C. F.; Weisshaar, J. C. Conformational analysis of alanine dipeptide from dipolar cou- (18) Sun, L. W.; Li, H.; Zhang, X. Q.; Gao, H. B.; plings in a water-based liquid crystal. J. Phys. Luo, M. B. Identifying Conformation States Chem. B 2003, 107, 3265–3277. of Polymer through Unsupervised Machine Learning. Chinese J Polym Sci 2020, 38, (9) Lucaioli, P.; Nauha, E.; Gimondi, I.; 1403–1408. Price, L. S.; Guo, R.; Iuzzolino, L.; Singh, I.; Salvalaglio, M.; Price, S. L.; Blagden, N. (19) Xu, D.; Tian, Y. A Comprehensive Survey of Serendipitous isolation of a disappearing con- Clustering Algorithms. Ann. Data. Sci 2015, formational polymorph of succinic acid chal- 2, 165–193. lenges computational polymorph prediction. CrystEngComm 2018, 20, 3971–3977. (20) Sittel, F.; Stock, G. Robust Density-Based Clustering To Identify Metastable Conforma- (10) Granata, D.; Camilloni, C.; Vendruscolo, M.; tional States of Proteins. J Chem Theory Laio, A. Characterization of the free-energy Comput 2016, 12, 2426–2435. landscapes of proteins by NMR-guided meta- dynamics. Proc. Nat. Acad. Sci. 2013, 110, (21) Frey, B. J.; Dueck, D. Clustering by Passing 6817–6822. Messages Between Data Points.

(11) Elts, E.; Greiner, M.; Briesen, H. In Silico (22) North, B.; Lehmann, A.; Dunbrack, R. L. A Prediction of Growth and Dissolution Rates new clustering of antibody CDR loop confor- for Organic Molecular Crystals: A Multiscale mations. J. Mol. Biol. 2011, 406, 228–256. Approach. Crystals 2017, 7, 288. (23) Vlasblom, J.; Wodak, S. J. Markov clustering (12) Saxena, A.; Prasad, M.; Gupta, A.; Bhar- versus affinity propagation for the partition- ill, N.; Patel, O. P.; Tiwari, A.; Er, M. J.; ing of protein interaction graphs. BMC Bioin- Ding, W.; Lin, C. T. A review of clustering formatics 2009, 10 . techniques and developments. Neurocomput- ing 2017, 267, 664–681.

16 (24) Cheng, Y. Mean Shift, Mode Seeking, and Density, Enthalpy of Vaporization, Heat Ca- Clustering. IEEE Trans. Pattern Anal. Mach. pacities, Surface Tension, Isothermal Com- Intell. 1995, 17, 790–799. pressibility, Volumetric Expansion Coeffi- cient, and Dielectric Constant. J. Chem. The- (25) Fukunaga, K.; Hostetler, L. D. The Esti- ory Comput 2012, 8, 61–74. mation of the Gradient of a Density Func- tion, with Applications in Pattern Recogni- (35) van der Spoel, D.; van Maaren, P. J.; Cale- tion. IEEE Trans Inf Theory 1975, 21, 32–40. man, C. GROMACS molecule and liquid database. Bioinformatics 2012, 28, 752–753. (26) Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science (36) Darden, T.; York, D.; Pedersen, L. Particle 2014, 344, 1492–1496. mesh Ewald: An N log( N) method for Ewald sums in large systems. J. Chem. Phys. 1993, (27) Liu, S.; Zhu, L.; Sheong, F. K.; Wang, W.; 98, 10089–10092. Huang, X. Adaptive partitioning by local density-peaks: An efficient density-based (37) Bussi, G.; Donadio, D.; Parrinello, M. Canon- clustering algorithm for analyzing molecu- ical sampling through velocity rescaling. J. lar dynamics trajectories. J. Comput. Chem. Chem. Phys. 2007, 126, 014101. 2017, 38, 152–160. (38) Berendsen, H. J. C.; Postma, J. P. M.; van (28) Wang, G.; Bu, C.; Luo, Y. Modified FDP clus- Gunsteren, W. F.; DiNola, A.; Haak, J. R. ter algorithm and its application in protein Molecular dynamics with coupling to an ex- conformation clustering analysis. Digit Signal ternal bath. J. Chem. Phys. 1984, 81, 3684– Process 2019, 92, 97–108. 3690.

(29) Barbas, R.; Font-Bardia, M.; Prohens, R. (39) Parrinello, M.; Rahman, A. Polymorphic Polymorphism of Sildenafil: A New transitions in single crystals: A new molec- Metastable Desolvate. Cryst. Growth Des. ular dynamics method. J. App. Phys. 1981, 2018, 18, 3740–3746. 52, 7182–7190.

(30) Barbas, R.; Font-Bardia, M.; Paradkar, A.; (40) Tribello, G. A.; Bonomi, M.; Branduardi, D.; Hunter, C. A.; Prohens, R. Combined Vir- Camilloni, C.; Bussi, G. PLUMED 2: New tual/Experimental Multicomponent Solid feathers for an old bird. Comput. Phys. Com- Forms Screening of Sildenafil: New Salts, mun. 2014, 185, 604–613. Cocrystals, and Hybrid Salt-Cocrystals. Cryst. Growth Des. 2018, 18, 7618–7627. (41) Bonomi, M. et al. Promoting transparency and reproducibility in enhanced molecular (31) Humphrey, W.; Dalke, A.; Schulten, K. VMD: simulations. 2019; https://nomad-coe.eu. Visual molecular dynamics. J. Mol. Graph. 1996, 14, 33–38. (42) Branduardi, D.; Bussi, G.; Parrinello, M. Metadynamics with adaptive gaussians. J. (32) Wang, J.; Wolf, R. M.; Caldwell, J. W.; Koll- Chem. Theory Comput 2012, 8, 2247–2254. man, P. A.; Case, D. A. Development and testing of a general amber force field. J. Com- (43) Gimondi, I.; Tribello, G. A.; Salvalaglio, M. put. Chem. 2004, 25, 1157–1174. Building maps in collective variable space. J. Chem. Phys. 2018, 149 . (33) Van Der Spoel, D.; Lindahl, E.; Hess, B.; Groenhof, G.; Mark, A. E.; Berendsen, H. (44) Kollias, L.; Cantu, D. C.; Glezakou, V.-A.; J. C. GROMACS: Fast, flexible, and free. J. Rousseau, R.; Salvalaglio, M. On the Role Comput. Chem. 2005, 26, 1701–1718. of Enthalpic and Entropic Contributions to the Conformational Free Energy Landscape of (34) Caleman, C.; van Maaren, P. J.; Hong, M.; MIL-101 (Cr) Secondary Building Units. Adv. Hub, J. S.; Costa, L. T.; van der Spoel, D. Theory Simul. 2020, 3, 2000092. Force Field Benchmark of Organic Liquids:

17