"Introduction to Inferring Evolutionary Relationships". In: Current
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Inferring Evolutionary UNIT 6.1 Relationships Much of bioinformatics is essentially comparative biology. Inferences based on compari- sons between entities such as motifs, sequences, genomes, and molecular structures are made. Given that living organisms and their constituent components have an evolutionary history, phylogeny properly lies at the heart of comparative biology (Harvey and Pagel, 1991). Increasingly, researchers in bioinformatics are realizing that phylogeny-based comparisons can yield important insights that can be missed using other techniques (Eisen, 1998; Rehmsmeier and Vingron, 2001). For example, a knowledge of phylogeny is vital in determining whether a set of sequences are orthologous or paralogous (Page and Charleston, 1997; Yuan et al., 1998; Storm and Sonnhammer, 2001; Zmasek and Eddy, 2001), which in turn has implications for predicting the function of a novel sequence (Eisen, 1998). Indeed, it is increasingly common for protein family databases to include gene phylogenies in addition to alignments. Examples include HOVERGEN (http://pbil.univ-lyon1.fr/databases/hovergen.html; Duret et al., 1994), SYSTERS (http://systers.molgen.mpg.de; Krause et al., 2000), and COPSE (http://copse.molgen. mpg.de). Phylogenetic analysis at the level of whole genomes poses new analytical challenges (Kim and Salisbury, 2001; Korbel et al., 2002; Wang et al., 2002), but promises new insights into genomic evolution. In some cases, the role of phylogeny might not be either obvious or explicit, so that many people may well have built phylogenetic trees without necessarily realizing it. The popular multiple sequence alignment program Clustal (UNIT 2.3) builds a phylogeny (the “guide tree”) every time it aligns sequences. Whereas Clustal constructs an initial tree to prioritize the order in which sequences are aligned, other methods infer the alignment and phylo- geny simultaneously (Hein, 1990; Phillips et al., 2000). Phylogenies also have a role to play in structural biology, especially in methods for predicting the secondary structure of RNA (Gulko and Haussler, 1996; Knudsen and Hein, 1999; Akmaev et al., 2000) and protein sequences (Li et al., 1998). Tree comparison techniques similar to those used to study host-parasite associations (Page and Charleston, 1998) have been used to tackle the problem of mapping cell-bound receptors onto the ligands to which they bind (Bafna et al., 2000). INTRODUCTION TO TREES This section introduces some basic phylogenetic terms and concepts. For a fuller discus- sion see recent textbooks such as Page and Holmes (1998) and Hall (2001). Tree Terminology A tree is a mathematical structure which is used to represent the evolutionary history of a group of sequences or organisms. This actual pattern of historical relationships is the phylogeny, or evolutionary tree, which we try to estimate. A tree consists of nodes connected by branches (also called edges). Terminal nodes (also called leaves, or terminal taxa) represent sequences or organisms for which we have data. Internal nodes represent hypothetical ancestors; the ancestor of all the sequences that comprise the tree is the root of the tree (Fig. 6.1.1). The nodes and branches of a tree may have various kinds of information associated with them. Some methods of phylogeny reconstruction (e.g., parsimony) endeavor to reconstruct the characters of each hypothetical ancestor; most Inferring methods also estimate the amount of evolution that takes place between each node on the Evolutionary Relationships Contributed by Roderic D.M. Page 6.1.1 Current Protocols in Bioinformatics (2003) 6.1.1-6.1.13 Copyright © 2003 by John Wiley & Sons, Inc. human mouse fugu Drosophila leaf edge internal node root Figure 6.1.1 A phylogeny showing some basic tree terms. mouse fugu root human Drosophila human mouse fugu Drosophila time Figure 6.1.2 Unrooted and rooted trees for human, mouse, fugu and Drosophila. The rooted tree (bottom) is obtained by rooting the unrooted tree along the edge leading to Drosophila. tree, which can be represented as edge lengths (also called branch lengths). Nodes may also be labeled with various measures of support or confidence, such as bootstrap values or posterior probabilities (see below). Trees can be depicted as a diagram such as Figure 6.1.1 or in a shorthand form consisting of nested parentheses, such as (((human, mouse), fugu), Drosophila). UNIT 6.2 provides examples of different styles of tree drawing, and a detailed description of the nested parentheses notation. Rooted and Unrooted Trees Trees can either be rooted or unrooted. A rooted tree has a node identified as the root from which all other nodes ultimately descend; hence a rooted tree has direction. This direction corresponds to evolutionary time; the closer a node is to the root of the tree the older it is in time. Rooted trees allow us to define ancestor-descendant relationships between nodes—given a pair of nodes connected by a branch, the node closest to the root is the ancestor of the node further away from the root (the descendant). Unrooted trees lack a root, and hence do not specify evolutionary relationships in quite the same way, and they do not allow us to talk of ancestors and descendants. Furthermore, sequences that may be Introduction to adjacent on an unrooted tree need not be evolutionarily closely related. For example, given Inferring the unrooted tree in Figure 6.1.2, the fugu and Drosophila sequences are neighbors on the Evolutionary Relationships tree, yet the fugu is a fish and is is more closely related to humans and mice than to the 6.1.2 Current Protocols in Bioinformatics mouse Drosophila human 2 fugu 1 3 mouse fugu 4 human human 5 12Drosophila mouse fugu Drosophila Drosophila 34fugu Figure 6.1.3 Each of the rooted trees 1 to 5 corresponds to placing the root on the corresponding numbered human edge of the unrooted tree. human mouse fruit fly. This is because theThe root five of rooted the tree trees lies that on canthe branchbe derived leading fromhum antoa unrootedn tree for four sequences. we placed the root elsewhere, say on the branch leading to the mouse, then the fugu and mouse Drosophila mouse on any one of the five edges in the unrooted tree; hence, this unrooted tree corresponds Drosophila to a set of five rooted trees (Fig. 6.1.3). fugu fugu The distinction between sequences rooted and would unrooted indeed trees be is closely important related. because In fact, most we methods could place for the root Drosophila reconstructing phylogenies reconstruct unrooted trees, and hence cannot distinguish5 among the five trees shown in Figure 6.1.3 on the basis of the data alone. In order to root an unrooted tree (i.e., decide which of the five trees is the actual evolutionary tree), we need some other source of information. Rooting Trees Unrooted trees can be converted into rooted trees using a number of methods (Fig. 6.1.4). The most common is the use of an “outgroup.” An outgroup is one or more sequences thought to lie outside the sequences of interest (the “ingroup”). For example, given the three vertebrate sequences from human, mouse, and fugu, we could use the invertebrate Drosophila then other methods must be used. Midpoint rooting (Fig. 6.1.4B) places the root halfway along the longest path in the tree. This makes an implicit assumption that the rate of to root the tree (Fig. 6.1.4A). If we are unsure of the identity of the outgroup, Current Protocols in Bioinformatics Drosophila . Had Inferring Evolutionary Relationships 6.1.3 A outgroup fugu mouse Drosophilafugu mouse human Drosophila human B midpoint fugu 3 mouse Drosophilafugu mouse human 1 11 1 2 3 4 2 5 human 1 Drosophila C gene duplication α β α β human-α human-β fugu- human-fugu- human- fuga-α fuga-β Figure 6.1.4 Three different methods of rooting an unrooted tree. Outgroup rooting places the root between the outgroup (in this example Drosophila) and the ingroup (fugu, mouse, human). Midpoint rooting places the root at the midpoint of the longest path in the tree. Gene duplication rooting places the root between paralogous gene copies. evolution among the sequences has been broadly constant over time (a “molecular clock”). If the sequences come from a gene family that has undergone a gene duplication resulting in paralogous copies (see below), then the root can be placed between those copies. In Figure 6.1.4C, the root is placed between the α and β copies in humans and fugu. In large gene families where multiple gene duplications have taken place, the gene tree can be rooted at the position requiring the least number of gene duplications (Iwabe et al., 1989). Gene Trees and Species Trees There are a number of processes by which gene phylogeny can depart from species phylogeny. One of the most important is gene duplication, which can result in a species containing a number of distinct but related sequences. In the example shown in Figure 6.1.5, fugu, mice, and humans each have two copies of the same gene, α and β. A phylogeny for all six genes allows us to correctly recover the organismal phylogeny ((human, mouse), fugu) from either the α or β genes. However, if we are unfortunate enough to sequence only genes 1, 3, and 5, and are unaware that they are part of a larger gene tree, we would infer that the organismal tree was ((human, fugu), mouse). This is because gene 3 from humans is more closely related to gene 1 from fugu than is gene 5 from mouse, even though humans and mice are more closely related to each other than either is to the fugu. Introduction to Inferring Two homologous genes are orthologous if their most recent common ancestor did not Evolutionary Relationships undergo a gene duplication; otherwise they are termed paralogous (Fitch, 1970).