Evolutionary Models of Amino Acid Substitutions Based on The

bioRxiv preprint doi: https://doi.org/10.1101/468173; this version posted July 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 Evolutionary Models of Amino Acid Substitutions Based on the 2 Tertiary Structure of their Neighborhoods. 1,2 2 3 3 Elias Primetis , Spyridon Chavlis , and Pavlos Pavlidis 1 4 University of Crete, Department of Biology 2 5 Foundation for Research and Technology, Hellas, Institute of Molecular Biology and 6 Biotechnology 3 7 Foundation for Research and Technology, Hellas, Institute of Computer Science 8 Abstract 9 Intra-protein residual vicinities depend on the involved amino acids. Energetically favorable vicini- 10 ties (or interactions) have been preserved during evolution, while unfavorable vicinities have been 11 eliminated. We describe, statistically, the interactions between amino acids using resolved protein 12 structures. Based on the frequency of amino acid interactions, we have devised an amino acid sub- 13 stitution model that implements the following idea: amino acids that have similar neighbors in the 14 protein tertiary structure can replace each other, while substitution is more difficult between amino 15 acids that prefer different spatial neighbors. Using known tertiary structures for α-helical membrane 16 (HM) proteins, we build evolutionary substitution matrices. We constructed maximum likelihood 17 phylogenies using our amino acid substitution matrices and compared them to widely-used methods. 18 Our results suggest that amino acid substitutions are associated with the spatial neighborhoods of 19 amino acid residuals, providing, therefore, insights into the amino acid substitution process. 20 1 Introduction 21 Structure and functionality of a protein are closely related (Worth et al., 2009). Its amino acid sequence 22 as well as cofactors, ligands and other parts of the same or other proteins form a complex network 23 of interactions that is the basis of the unique physicochemical properties of each protein related to its 24 function (Worth et al., 2009). During the last decade, a multitude of tertiary protein structures have 25 been determined by using techniques such as crystallography, NMR, electron microscopy and hybrid 26 methods (Egli, 2010). Consequently, the number of the available tertiary structures in the RCSB-PDB 27 database (Bernstein et al., 1977) amino acid is rapidly increasing. 28 Computational methods facilitate structural, functional and evolutionary characterization of the pro- 29 teins Nath Jha et al. (2011). Protein function is tightly linked to its tertiary structure, and consequently 30 to the vicinities, or interactions, amino acid residues have formed. For example, globular and membrane 31 proteins are characterized by totally different physicochemical environments. Membrane proteins show 1 bioRxiv preprint doi: https://doi.org/10.1101/468173; this version posted July 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 32 a limited interaction with water molecules, and they are able to interact with the lipid bilayer. Thus, 33 their trans-membrane region adopts a single type of secondary structure { either a helix or a beta-sheet. 34 Largely, the secondary structure is defined by non-neighboring (on the sequence level) amino acid residue 35 interactions (Nath Jha et al., 2011). 36 Recently, we developed PrInS (freely available from http://pop-gen.eu/wordpress/software/prins- 37 protein-residues-interaction-statistics (Protein Interaction Statistics; Pavlidis et al. unpub- 38 lised) an open-source software to score proteins based on the frequency of their residue interactions. 39 PrInS uses protein structures stored in Protein Data Bank (PDB) to construct a statistical model of 40 intra-protein amino acid residue interactions for a certain class of proteins (e.g. membrane proteins). 41 PrInS scores every amino acid a proportionally to the number of ùnexpected residues' that interact with 42 it. The term ùnexpected residues' means residues that they are rarely found to be in the vicinity of a if 43 we consider all protein structures of a given dataset. Therefore, PrInS is able to pinpoint residues charac- 44 terized by a large number of `less frequent' interactions, and therefore they may represent functional areas 45 of the proteins or targets of natural selection based on the following assumption: even though a large 46 number of unlikely interactions characterizes these amino acids, nature has preserved them. Here, we use 47 PrInS to describe statistically the intra-protein amino acid interactions and consequently to construct 48 an amino acid substitution matrix endowed with the principle that amino acids with similar (residual) 49 neighborhoods can substitute each other during evolution. 50 Protein evolution comprises two major principles. The first principle suggests that protein structure 51 is more conserved than the sequence (Siltberg-Liberles et al., 2011). The second principle suggests that 52 the physicochemical properties of amino acids constrain the structure, the function and the evolution of 53 proteins (Hatton and Warr, 2015). Protein evolutionary rate is strongly correlated with fractional residue 54 burial (Siltberg-Liberles et al., 2011). This is due to the fact that the core of a protein is mostly formed 55 by buried residues, which often play a crucial role in the stability of the folded structure (Franzosa and 56 Xia, 2008). The three-dimensional structure of the protein determines its evolutionary rate, since most 57 mutations in the core of a protein tend to destabilize the protein. 58 Within protein families the backbone changes are infrequent, thus, preserving the folding properties 59 over relatively long evolutionary distances, while substitutions are found often at the side chains. For 60 example, for proteins with binding function, the binding interface is under functional constraint and 61 may evolve the slowest, with differences in rate between affinity-determining and specificity-determining 62 residues (Siltberg-Liberles et al., 2011). 63 In addition, the secondary structural elements of a protein evolve at different rates. Beta sheets 64 evolve more slowly than helical regions and random coils evolve the fastest (Siltberg-Liberles et al., 65 2011). Secondary structure changes may eventually occur due to varying helix/sheet propensity. Some of 66 these changes in secondary structural composition may be evolutionary neutral, whereas some structural 2 bioRxiv preprint doi: https://doi.org/10.1101/468173; this version posted July 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 67 transition may involve negative or positive selection. In the latter case, a new mutationally accessible 68 fold may enable the development of a new favorable function that was not possible within the previous 69 fold (Siltberg-Liberles et al., 2011). 70 In the context of folding, the thermodynamic stability of the proteins with a stable unique tertiary 71 structure is important. Thermodynamic stability is maintained throughout evolution despite the desta- 72 bilizing effect of the non-synonymous mutations, which are often removed from the populations as a 73 result of negative selection. The protein structure is important because it acts as a scaffold for properly 74 orientating functional residues, such as a binding interface and a catalytic residue. As a result, the selec- 75 tive pressure for particular sequences (and not structures) over longer evolutionary periods is decreased, 76 generating a neutral network of sequences interconnected via mutational changes (Siltberg-Liberles et al., 77 2011). 78 Choi and Kim (2006), based on the most common structural ancestor (CSA), showed that not all 79 present-day proteins evolved from one single set of proteins in the last common ancestral organism, 80 but new common ancestral proteins were born at different evolutionary times. These proteins are not 81 traceable to one or two ancestral proteins, but they follow the rules of the "multiple birth model" for the 82 evolution of protein sequence families (Choi and Kim, 2006). 83 In this study we used the scoring matrices that are obtained from the PrInS algorithm to examine 84 the evolution of proteins and we focused on the evolution of α-helical membrane proteins. The main 85 hypothesis is that protein evolution is related to the three dimensional neighborhoods of amino acids. 86 Specifically, amino acids that have similar amino acid neighbourhoods can substitute each other during 87 evolution. 88 2 Methods 89 2.1 Dataset retrieval and name conversion 90 In eukaryotes, α-helical proteins exist mostly in the plasma membrane or sometimes in the outer cell 91 membrane. In prokaryotes, they are present in their inner membranes. We used 82 α-helical membrane 92 proteins, which were also scrutinized previously by Nath Jha et al. (2011) to describe the statistical 93 properties of amino acid interactions within α-helical membrane proteins. First, for each of the protein 94 in the dataset, we downloaded multiple sequence alignments of homologous protein sequences from the 95 UCSC Genome Browser (Kent et al., 2002) for 16 primate and 3 non-primate mammalian species. Second, 96 three-dimensional structures were downloaded from the Protein Data Bank (PDB) database. 3 bioRxiv preprint doi: https://doi.org/10.1101/468173; this version posted July 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Evolutionary Models of Amino Acid Substitutions Based on The

Optimal Matching Distances Between Categorical Sequences: Distortion and Inferences by Permutation Juan P

Sequence Motifs, Correlations and Structural Mapping of Evolutionary

Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh

Development of Novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in Msas

3D Representations of Amino Acids—Applications to Protein Sequence Comparison and Classiﬁcation

Assume an F84 Substitution Model with Nucleotide Frequ

Amino Acid Substitution Matrices from an Information Theoretic Perspective

Comparison Theorems for Splittings of M-Matrices in (Block) Hessenberg

Pathological Rate Matrices: from Primates to Pathogens

3D Deep Convolutional Neural Networks for Amino Acid Environment Similarity Analysis Wen Torng1 and Russ B

Multiplicatively Closed Markov Models Must Form Lie Algebras

Metzlerian and Generalized Metzlerian Matrices: Some Properties and Economic Applications