Reconstruction of Ancestral Enzymes
Total Page:16
File Type:pdf, Size:1020Kb
Perspectives in Science (2016) 9, 17—23 View metadata, citation and similar papers at core.ac.uk brought to you by CORE Available online at www.sciencedirect.com provided by Elsevier - Publisher Connector ScienceDirect jo urnal homepage: www.elsevier.com/pisc ଝ Reconstruction of ancestral enzymes ∗ Rainer Merkl, Reinhard Sterner University of Regensburg, Institute of Biophysics and Physical Biochemistry, 93040 Regensburg, Germany Received 1 December 2015; accepted 2 February 2016 Available online 26 August 2016 KEYWORDS Summary The amino acid sequences of primordial enzymes from extinct organisms can be determined by an in silico approach termed ancestral sequence reconstruction (ASR). In the first Ancestral sequence reconstruction; step of an ASR, a multiple sequence alignment (MSA) comprising extant homologous enzymes is being composed. On the basis of this MSA and a stochastic model of sequence evolution, a phy- Enzyme evolution; logenetic tree is calculated by means of a maximum likelihood approach. Finally, the sequences Phylogenetic tree; Promiscuous of the ancestral proteins at all internal nodes including the root of the tree are deduced. We enzymes; present several examples of ASR and the subsequent experimental characterization of enzymes Thermostability as old as four billion years. The results show that most ancestral enzymes were highly ther- mostable and catalytically active. Moreover, they adopted three-dimensional structures similar to those of extant enzymes. These findings suggest that sophisticated enzymes were invented at a very early stage of biological evolution. © 2016 Beilstein-lnstitut. Published by Elsevier GmbH. This is an open access article under the CC BY license. (http://creativecommons.org/licenses/by/4.0/). Introduction is only marginal (Jaenicke, 1991). We are interested in how enzyme activity and stability developed and changed in the course of evolution. For this purpose, it would be desirable Modern enzymes from contemporary organisms are sophis- to characterize ancient enzymes from primordial organisms. ticated biocatalysts transforming their substrates into The lack of macromolecular fossils, however, seems to block products with high efficiency and specificity. However, as the access to this interesting information. Luckily, there is a catalytic activity usually requires a certain degree of con- circumstantial way out of this dilemma, which is the charac- formational flexibility, the stability of most modern enzymes terization of ‘‘extinct’’ proteins after their ‘‘resurrection’’ via ancestral sequence reconstruction (ASR). ASR is an in silico approach allowing to deduce the ଝ This is an open-access article distributed under the terms of the sequences of ancient proteins from the sequences of homol- Creative Commons Attribution License, which permits unrestricted ogous extant proteins (Liberles, 2007). The idea of ASR is use, distribution, and reproduction in any medium, provided the more than 50 years old when Pauling and Zuckerkandl pos- original author and source are credited. This article is part of a tulated that modern proteins contain enough information special issue entitled Proceedings of the Beilstein ESCEC Symposium to derive the sequences of common ancestors (Pauling and 2015 with copyright © 2016 Beilstein-lnstitut. Published by Elsevier Zuckerkandl, 1963). However, the concept of ASR could only GmbH. All rights reserved. ∗ Corresponding author. be realized after Fitch had implemented the first phyloge- E-mail address: [email protected] (R. Sterner). netic algorithm termed PAUP (phylogenetic analysis using http://dx.doi.org/10.1016/j.pisc.2016.08.002 2213-0209/© 2016 Beilstein-lnstitut. Published by Elsevier GmbH. This is an open access article under the CC BY license. (http://creativecommons.org/licenses/by/4.0/). 18 R. Merkl, R. Sterner Figure 1 Protocol for ASR. Each ASR requires four steps to deduce ancestral sequences from a set of extant homologs. (A) A set of sequences is retrieved from a database. (B) These sequences are compiled to a multiple sequence alignment, which allows for the identification of mutations observed in the sequences. (C) A phylogeny is determined; the extant sequences constitute the leaves. (D) Based on this phylogenetic tree and the MSA, ancestral sequences are deduced for all internal nodes of the tree. parsimony) (Fitch, 1971). Although phylogenetic models Computing a phylogenetic tree by means of maximum and algorithms have considerably improved since then, the likelihood workflow of ASR has remained essentially the same (Fig. 1). The prerequisite for the computation of a phylogenetic tree A set of homologous sequences is put together in a multiple is a stochastic model that describes the probability for DNA sequence alignment (MSA), which is the basis for the subse- or protein sequences to acquire mutations within a certain quent calculations: first, a phylogenetic tree is constructed time interval ti. For this purpose, a probability to acquire whose outermost nodes (the leaves) are represented by the any mutation within ti is combined with a substitution extant sequences. After the calculation of this tree, the pre- model. The latter explains in detail with which probability a cursor sequences corresponding to the internal nodes are nucleotide or amino acid residue is replaced by another one. being calculated. These calculations are commonly based on Instead of using fixed mutation rates, it is meanwhile state a maximum likelihood approach and a phylogenetic model of the art to sample these probabilities from a continuous which allows for the sampling of mutational frequencies distribution, which provides every site with a specific rate in a position-specific residue manner (Merkl and Sterner, (Susko et al., 2003). 2016). Based on such a model, the likelihood of a tree can be In this short review, we will first describe state-of- computed. Likelihood is the probability for observing the the art in silico methods that have been developed for data (i.e., sequences) given (i) the parameters of the chosen ASR. We will then provide examples on how ASR has been evolutionary model and (ii) the topology of the tree under used to resurrect ancient enzymes from the Precambrian study. Commonly, mutations at different sites are considered era, among them translation elongation factors (Gaucher as independent events. Thus, this likelihood of a complete et al., 2008), thioredoxins (Ingles-Prieto et al., 2013; Perez- sequence is the product of all site-specific values. To explain Jimenez et al., 2011), 3-ispropylmalate dehydrogenases the principle, it is therefore sufficient to consider one site (Hobbs et al., 2015; Hobbs et al., 2012), nucleotide kinases S(j) of a sequence S and to compute the likelihood for the (Akanuma et al., 2013), -lactamases (Risso et al., 2013; nucleotides at S(j) at each node of the tree. If all time Zou et al., 2015), imidazole glycerol phosphate synthase intervals ti and all nucleotides ei are known for all nodes . (Reisinger et al., 2014), and ribonuclease H1 (Hart et al., i = 1, , 8, the likelihood of the tree shown in Fig. 2 is: 2014). We will conclude by summarizing the most impor- L tree = p t p t p t p t p t tant insights that have been gained from these studies with ( ) e1e8 ( 1) e2e8 ( 2) e3e7 ( 3) e4e6 ( 4) e5e6 ( 5) respect to our ability to ‘‘replay the molecular tape of life’’ p t p t . e6e7 ( 6) e7e8 ( 7) (1) (Gaucher, 2007). In silico methods for ASR However, the states (nucleotides) of the internal nodes are not known and therefore it is necessary to sum over all A detailed introduction into stochastic concepts and phylo- possible states (nucleotides at internal nodes) which results genetic models that is needed to understand modern ASR in methods is beyond the scope of this review and can be found elsewhere; see (Merkl and Sterner, 2016) and refer- L tree = p t p t p t p t ( ) e1e8 ( 1) e2e8 ( 2) e3e7 ( 3) e4e6 ( 4) ences therein. Here, we present a short summary of the e8 e7 e6 algorithms required to deduce phylogenetic trees and ances- p t p t p t . tral sequences. e5e6 ( 5) e6e7 ( 6) e7e8 ( 7)) (2) Reconstruction of ancestral enzymes 19 states (the content of the internal nodes labeled 6, 7, and 8). 3 For each site of these three internal nodes there are 4 com- 3 binations of nucleotides or 20 combinations of amino acids ei. It is the aim of an ML approach to identify for these three nodes a triplet with the largest value p(|data), which is in a Bayesian approach the triplet that maximizes p(data|) · p() . (3) p(data) Since p(data) is identical for all candidate triplets, it is sufficient to maximize p(data|) · p(). More specifically, for this tree and the given three nodes the triplet is found by solving Figure 2 An example of a phylogeny. Leaves representing extant sequences are labeled 1—5; internal nodes representing max [pe e (t1)pe e (t2) e ,e ,e 1 8 2 8 reconstructed ancestral sequences are labeled 6—8. The values 8 7 6 t1 to t7 represent the length of the vertices; i.e., time intervals; p t p t p t p t p t . e3e7 ( 3) e4e6 ( 4) e5e6 ( 5) e6e7 ( 6) e7e8 ( 7)] (4) example according to (Pupko et al., 2000). If we know all transition probabilities pij, the likelihood The solution computed for Eq. (4) is the maximum over all of a tree can be computed quite efficiently by means of possible triplets. For larger trees, it is necessary to maximize an iterative approach. The missing length of the edges can over all internal nodes; see (Pupko et al., 2000) for details. be determined by means of expectation maximization; for details see (Felsenstein, 1981). However, to find the tree with the maximal likelihood, the topology — which was taken as given so far — has to be optimized as well, which requires Typical software protocols for ASR the creation and the assessment of alternative topologies in The basic principles introduced in the previous paragraph order to find the optimal one.