Ancestral Protein Reconstruction: Techniques and Applications
Total Page:16
File Type:pdf, Size:1020Kb
Biol. Chem. 2016; 397(1): 1–21 Review Rainer Merkl* and Reinhard Sterner* Ancestral protein reconstruction: techniques and applications DOI 10.1515/hsz-2015-0158 proteins with highly variable amino acid sequences. Received April 10, 2015; accepted July 30, 2015; previously published Multiple sequence alignments of such proteins allow online August 7, 2015 for the identification of conserved key amino acids, for example active site residues, that are characteristic of the Abstract: Ancestral sequence reconstruction (ASR) is the entire protein family (Brown and Babbitt, 2014). However, calculation of ancient protein sequences on the basis of such an analysis will rarely uncover the set of residues extant ones. It is most powerful in combination with the that are responsible for the functional diversity observed experimental characterization of the corresponding pro- in large protein families (Gerlt and Babbitt, 2009). The teins. Such analyses allow for the study of problems that reason for this is that many neutral as well as epistatic are otherwise intractable. For example, ASR has been mutations may have accumulated during the evolution used to characterize ancestral enzymes dating back to of the proteins under comparison. Neutral mutations the Paleoarchean era and to deduce properties of the cor- produce sequence noise that impedes the identification responding habitats. In addition, the historical approach of the crucial mutations, leading to altered functions. underlying ASR enables the identification of amino acid Epistatic mutations, i.e. mutations that have different residues key to protein function, which is often not pos- consequences depending on the genetic background, can sible by only comparing extant proteins. Along these lines, be divided into permissive and restrictive mutations. Per- residues responsible for the spectroscopic properties of missive mutations, for example stabilizing ones, are often protein pigments were identified as well as residues deter- the prerequisite for a change in function, when the causa- mining the binding specificity of steroid receptors. Further tive key mutation is destabilizing. In contrast, restrictive applications are studies related to the longevity of muta- mutations will prevent a key mutation from becoming tions, the contribution of gene duplications to enzyme effective, for example by introducing steric clashes. Thus, functionalization, and the evolution of protein complexes. the exchange of putative key residues by site-directed For these applications of ASR, we discuss recent examples; mutagenesis in the framework of a functional analysis can moreover, we introduce the basic principles of the underly- lead to non-functional proteins, when permissive muta- ing algorithms and present state-of-the-art protocols. tions are missing or restrictive mutations are present in the alternative background. As a consequence, the resi- Keywords: ancestral sequence reconstruction; phylo- dues that historically led to a new function can often not genetic analysis; protein evolution; vertical analysis of be identified by comparing extant sequences (Harms and protein function. Thornton, 2010). From an evolutionary point of view, extant homologs are the leaves of a phylogenetic tree and represent varia- Introduction tions observed for one specific point in time. Therefore, a comparison of extant sequences was termed “horizon- Starting from ancestral precursors, gene duplication and tal approach” (Figure 1). It is easy to accept that a “ver- diversification events have yielded families of homologous tical approach”, which additionally takes into account the evolutionary history of the proteins under study, is *Corresponding authors: Rainer Merkl and Reinhard Sterner, a more straightforward strategy to identify crucial but Institute of Biophysics and Physical Biochemistry, University of subtle amino acid differences (Harms and Thornton, Regensburg, Universitätsstrasse 31, D-93053 Regensburg, Germany, e-mail: [email protected] (Rainer Merkl), 2010). Instead of exclusively comparing the leaves, such [email protected] (Reinhard Sterner). an approach includes the internal nodes of the tree and http://orcid.org/0000-0002-3521-2957 (Rainer Merkl) thus considers the chronology of mutations (Figure 1). 2 R. Merkl and R. Sterner: Ancestral sequence reconstruction extinct life to climatic, ecological and physiological alter- ations (see, for example, Boussau et al., 2008). ASR was also utilized to characterize the promiscu- ity of ancestral enzymes (Perez-Jimenez et al., 2011) and to determine the longevity of mutations (Risso et al., 2015). Moreover, the contribution of gene duplications to the evolution of modern enzymes (Voordeckers et al., 2012) and the sophistication of enzyme complexes were studied by means of ASR (Bridgham et al., 2006; Perica et al., 2014). In the following sections, we will first review in silico techniques that have been developed for ASR. Then we Figure 1: An example of a phylogeny. will discuss how ASR was used to address the applications Leaves representing extant sequences are labeled 1–5, internal nodes representing reconstructed ancestral sequences are labeled mentioned above. 6–8; 0 represents the root. The exclusive comparison of the leaf sequences is termed a “horizontal” approach; a “vertical” approach additionally takes into account the sequences of the internal nodes. The values v1–v8 represent the length of the vertices; example Ancestral sequence reconstruction: according to Felsenstein (1981). history, theoretical background, current protocols Moreover, the comparison of the more similar sequences In the following paragraphs, we will survey pioneering that are related to adjacent nodes reduces the number experiments, give an introduction to the theory of phylo- of neutral mutations and could help to identify epistatic genetic algorithms, and present state-of-the-art protocols mutations. However, internal nodes represent extinct that have been used for ASR. proteins, whose properties cannot easily be determined, due to the lack of macromolecular fossils. Fortunately, novel computational techniques allow us to reconstruct Pioneering methods of ASR the sequences of such proteins and to travel back in time (Thornton, 2004; Hanson-Smith et al., 2010). The outcome The idea of reconstructing ancestral amino acid sequences of these in silico approaches, termed ancestral sequence based on a comparison of extant sequences was put reconstruction (ASR), is a value in itself. Furthermore, forward by E. Zuckerkandl and L. Pauling in 1963 (Pauling in combination with modern gene synthesis technology, and Zuckerkandl, 1963). However, the technology needed these proteins can be produced in recombinant form and for ASR is borrowed from phylogenetic analyses and the characterized by means of all biochemical and biophysi- first algorithm was not developed until 1971 (Fitch, 1971). cal methods at hand. Fitch used the principles of maximum parsimony (MP), Sorting out neutral and epistatic mutations is an which is a non-parametric statistical method, to deduce important but by no means the only application of ASR. a phylogenetic tree for a given set of extant sequences. Driven to extremes, the most ancient sequences that can Parsimony aims at constructing a tree that minimizes be reconstructed are related to the era of the last universal the number of mutations needed to explain the observed common ancestor (LUCA), which preceded the diversifica- data. Thus, the optimality criterion (Swofford et al., 1996) tion of life and existed in the Paleoarchean era, i.e. at least is total tree length len(tree), which is given by 3.8 billion years (Gyr) and presumably 4.5 Gyr ago (Nisbet and Sleep, 2001). Thus, due to the enormous number of BN lent()reew= diff (,xx ) (1) known sequences, ASR makes it possible to follow changes ∑∑ jk′′jk′j kj==11 in properties like substrate specificity and to reproduce the advent of novel or more specialized functions over this B is the number of branches, N the number of sites long evolutionary time span. Moreover, certain features of (nucleotide or amino acid positions), k′ and k″ are the reconstructed macromolecules like stability are correlated two nodes connected by branch k, and xk′j, xk″j are the cor- with important characteristics of the corresponding habi- responding nucleotides or amino acids observed in the tats. Thus, ASR can implicitly reproduce the adaptation of leaves or inferred for internal nodes. The function diff(y, z) R. Merkl and R. Sterner: Ancestral sequence reconstruction 3 specifies the cost of a mutation from y to z and wj assigns short description of evolutionary models and trees, which a weight to each site. This concept was attractive, because summarizes two recent publications (Liò and Bishop, the tree provided the minimal number of mutations 2008; Whelan, 2008); for a more detailed presentation, required to explain the variations observed in the given see Liberles (2007). The reader who is familiar with or is extant sequences. The corresponding implementation, not interested in the theoretical background can skip the named PAUP (Swofford, 1984), has proved popular and next six paragraphs. the algorithm proposed by Sankoff (Sankoff, 1975) further improved this principle by adding costs to mutations. This parsimonious principle has been the basis for Assessing mutations by means of Markov pioneering work on ribonucleases (RNases) that hydro- lyze single- and double-stranded RNA (Stackhouse et al., models 1990). The reconstruction