Tutorial IV: Sampling Conformational Spaces Manuscript for Exercise Problems

Hands-on workshop and Humboldt-Kolleg: Density-Functional Theory and Beyond - Basic Principles and Modern Insights Isfahan, Iran, May 2 – May 13, 2016 Energy coordinates Tutorial IV: Sampling conformational spaces Manuscript for Exercise Problems Prepared by Adriana Supady and Carsten Baldauf Isfahan University of Technology Isfahan, May 6, 2016 Introduction This tutorial aims to familiarize you with the basic concepts of searches in molecular structure space. The practice session consists of three parts: Part A: Genetic algorithm search Problem I: Starting the first run Problem II: Analysis of the results Part B: Pool of structures Problem III: Looking for duplicates Problem IV: Descriptive coordinates Problem V: Looking for duplicates - internal coordinates Part C: And beyond Problem VI: Parameters of the GA search Problem VII: Alternative search techniques Please start with Problem I and launch your genetic algorithm (GA) based search for alanine dipeptide. While waiting for the results, please proceed with Part B. Once your GA run is completed, you can work on Problem II. Part C is optional. In the directory $HandsOn/tutorial_4/, you can find all the files necessary for this tutorial. Please copy the contents of the skel/ folder into your own working directory. Dedicated folders have been prepared in the skel/ directory for each problem. Practical notes • There are more than 100 chemical file formats and some examples are presented in the Appendix in Figure9. Openbabel 1 is a program that can read and convert files of different formats. E.g. if you want to convert a geometry.in file to a XYZ-file, use: obabel -i fhiaims geometry.in -o xyz -O geometry.xyz where -i <format> specifies the format of the input -o <format> specifies the format of the output -O <filename> specifies the name of the output file • You can use Jmol for visualizing 3D structures. Jmol recognizes a number of different chemical formats. You can measure bond lengths, bond angles and dihedral angles with Jmol: just double- click on the first atom and click the remaining ones. • The atom ordering in a file matters. Most of the scripts/programs rely on the fact that the atoms are ordered in a consistent matter, especially when comparing structures! • The database Berlin ab-initio amino acid DB 2 provides structural data of 20 amino acids, bare and in cation-complexes [1]. You can use the data stored there for benchmarking! 1 https://github.com/openbabel/openbabel 2 http://aminoaciddb.rz-berlin.mpg.de/ 2 Investigating molecular structures Representing a chemical compound Small organic molecules are often highly flexible and may adapt different 3D structures that differ in properties and energy. For methods like protein-ligand docking or catalyst design it is important to know the low-energy conformers of the molecule that build its molecular ensemble. To this end, the molecular conformational space needs to be sampled efficiently so that all relevant low-energy conformers are found. Figure1 depicts popular chemical representations of alanine dipeptide. The chemical formula stores only the composition of the compound. The simplified molecular-input line-entry system (SMILES) [2] string is a convenient representation as it allows for encoding the connectivity, the bond order and the stereochemical information in a one-line notation. It should be noted, that a number of valid SMILES codes can be constructed for the same compound. The great advantage of the SMILES strings is the fact that they are intuitive and can be easily read and written. 1D Chemical formula C6H12N2O2 SMILES CC(=O)N[C@H](C(=O)NC)C InChI 1S/C6H12N2O2/c1-4(6(10)7-3)8-5(2)9/h4H,1-3H3,(H,7,10)(H,8,9)/t4-/m0/s1 version, chemical atom connections hydrogen atoms stereochemical standard formula layer 2D 3D H CH3 O bond bond angle N Cα H3C N CH3 dihedral O H angle Figure 1: Alternative chemical representations on the example of alanine dipeptide. SMILES codes can be used to generate a schematic, 2D representation of a molecule. Finally, the last missing piece of information, namely the spatial arrangement of atoms, is revealed in a 3D representation of a molecule. Two types of coordinates are commonly employed to represent a molecular 3D structure: Cartesian and internal coordinates. In Cartesian coordinates, each atom is represented as a point in 3D space. Cartesian coordinates are universal, intuitive and always relative to the origin of the coordinate system. The simplest internal coordinates are based on the ’Z-matrix coordinates’ i.e. include bond lengths, bond angles as well as dihedral angles (torsions) (Figure1). The main advantage of the internal coordinates is that they are orientation- and location-invariant, i.e. they remain unchanged upon translation and rotation in 3D space. In contrast, the dihedral angles are in most cases the only relevant degrees of freedom (DOF). Bond lengths and bond angles have usually only one minimum, i.e. the energy will increase rapidly if these parameters adopt non-optimal values. On the contrary, there is no single minimum for the value of a dihedral angle and in most cases, diverse values can be adopted. The adopted values depend on the neighboring atoms/functional groups and on the steric interactions within the conformation. For the purpose of global structure search, only single, non-ring bonds between non-terminal atoms are considered as fully rotatable bonds after excluding bonds that are attached to methyl groups that carry three identical substituents. Further, the cis/trans nomenclature can be utilized to describe the relative orientation of functional groups within a molecule. In cases in which the functional groups are oriented in the same direction we refer to it as cis, whereas, when the groups are oriented in opposite directions, we refer to it as trans. The full representation of a 3D structure is the list of its Cartesian coordinates. An alternative way to store 3D structures is to use a reduced representation that contains the SMILES and DOFs with the corresponding values. The difference between these two alternative representations is illustrated in Figure2. The substantial advantage of the reduced representation is the fact that for a specified chemical compound, the only stored data are simultaneously the DOFs for the optimization. This is 3 C 7.77629 -0.99492 -2.45999 C 6.70435 -0.55761 -1.49837 full O 6.43080 0.61938 -1.29985 reduced N 6.05315 -1.59489 -0.87545 C 4.97379 -1.36011 0.08672 H 4.45919 -0.42867 -0.17831 C 5.54586 -1.18095 1.51158 SMILES: CC(=O)N[C@H](C(=O)NC)C 1 O 5.22975 -1.89502 2.46012 N 6.40900 -0.11162 1.63863 and C 7.00728 0.22717 2.90258 C 3.97100 -2.50525 0.02836 Significant degrees 2 H 8.45209 -0.15731 -2.65424 H 7.31692 -1.31259 -3.39960 of freedom: rotable bond 1 (C-N-C-C): 178 º H 8.36070 -1.81736 -2.03868 4 H 6.49171 -2.50679 -0.87046 rotable bond 2 (N-C-C-N): 61 º 3 H 6.52118 0.51324 0.84106 H 6.22683 0.34653 3.65919 rotable bond 3 (C-C-N-C): -88 º H 7.56699 1.15828 2.79096 H 7.68144 -0.57908 3.20522 rotable bond 4 (C-N-C-C): -180 º H 3.10880 -2.29845 0.67116 H 4.41840 -3.44833 0.36172 H 3.60690 -2.65261 -0.99416 Figure 2: The comparison of a full and reduced representation of 3D structure of the alanine dipeptide. The full representation contains all atomic Cartesian coordinates. The reduced representation consists of the SMILES string and dictionary of the rotatable bonds together with the corresponding values. extremely convenient, especially for larger systems. Nevertheless, one should keep in mind that the reduced representation stores no information about the bond lengths and bond angles assuming no substantial changes of these coordinates. Geometrical similarity of structures The quantification of the molecular similarity is a common problem that needs to be solved, e.g. in order to remove duplicates from a pool of 3D structures. The most popular approach to quantify the similarity is the root-mean-square deviation RMSD, calculated for two sets of Cartesian coordinates. Root-mean-square deviation (RMSD) Given two 3D geometries of a compound with N atoms, the formula for the RMSD is defined as follows: v u N u 1 X RMSD = t d2 (1) N i i=1 where di is the distance between the corresponding atoms. Although fast to calculate, the RMSD value describes the similarity of two molecular conformations only after the best superposition of the geometries is identified. The most popular algorithm for finding the best alignment of two sets of coordinates is the Kabsch algorithm [3]. After translating the centroids of the two sets of coordinates to the center of the coordinate system, the Kabsch algorithm computes the optimal rotation matrix that minimizes the RMSD. Often only heavy atoms are considered in RMSD calculations. There are multiple advantages of using the Cartesian RMSD, e.g. it is a well-recognized metric, it is easy to calculate and reproduce and is available as a basic functionality in most of the modeling packages. Torsional RMSD (tRMSD) Instead of using Cartesian coordinates, the values of the significant torsional degrees of freedom, i.e. rotatable bonds, can be used. Analogically to the Cartesian RMSD, given two 3D geometries with m rotatable bonds, the formula for tRMSD reads: v u m u 1 X tRMSD = t θ2 (2) m i i=1 where θi is the angular difference between values of the corresponding dihedral angles.

Load more