Phylogenetics Topic 1: an Overview
Total Page:16
File Type:pdf, Size:1020Kb
Phylogenetics Topic 1: An overview Introduction “The affinities of all beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth. The green budding twigs may represent existing species; and those produced during former years may represent the long succession of extinct species...and this connection of the former and present buds by ramifying branches may well represent the classification of all extinct and living species in groups subordinate to groups.” Charles Darwin, in Chapter IV of On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. A fundamental concept of the theory of evolution, independently developed by Charles Robert Darwin and Alfred Russell Wallace and published jointly in a letter of 1858, is that species share a common origin and have subsequently diverged through time. Interestingly, both men came to use the simile of a great tree to illustrate this notion of descent with modification, and ever since biologists have been using tree-like diagrams to describe the pattern and timing of events that gave rise to the earth’s biodiversity. The branching pattern of the tree represents the splitting of biological lineages, and the lengths of the branches can be used to signify the age of those events. Today, biologists call these tree-like diagrams phylogenies. Unrooted tree diagram drawn in the margin of one of Charles Darwin’s notebooks Phylogenetic tree used in The Origin of Species. Darwin wasn’t just thinking about classification based on phylogenies. He used them to visualize the process of divergence within species and the splitting of populations into separate species. Darwin used this figure to illustrate divergence of variants within species; over time successively more variation accumulates. Eventually some of this variation forms the basis for new species. The biological discipline dedicated to reconstructing organismal phylogenies is called phylogenetics. Parallel advances in a number of fields led to a tremendous growth in phylogenetics over the last 40 years. First, beginning in the 1960’s, sophisticated techniques were developed and refined for the purpose of reconstructing phylogenies from the actual features, or characters, of organisms. Second, phylogenetics grew beyond its traditional application to classification of living organisms. Recognition that phylogenies can provide an evolutionary framework for studying a wide variety of problems led to their application in almost every other sub discipline of biology. Third, rapid increases in the computational power of computers meant that programs implementing phylogeny reconstruction algorithms could accommodate very large amounts of data. Lastly, the revolution in molecular biotechnology opened up a vast new source of characters to phylogenetic analysis. Before discussing the wide-ranging applications of phylogenies, it is necessary to define some essential terminology. An imaginary species phylogeny is presented in figure 1a as a guide. The lines of the phylogeny, called branches, represent species, and the bifurcation points, called nodes, represent speciation events. The tips of the terminal branches are present-day species, and each node represents a species that is the common ancestor of all its descendants, or daughter species. For example, in figure 1a the species at node B is the most recent common ancestor of present-day species 1, 2, and 3, and is not an ancestor of species 4 or 5. Furthermore, the group composed of ancestor B and all its descendants (species 1, 2, 3, and A) is called a clade, or a monophyletic group. Smaller clades are comprised of A and all its descendants, and D and all its descendants. It must be noted that phylogenetics is not restricted to just species. Phylogenetic methods can be used to depict kinship of individuals within a local group or population, relationships among populations or subspecies, relationships among taxonomic lineages above species (e.g., supraspecific categories such as genera, families, etc.), relationships among genes within populations, or relationships among different genes within a gene family. Figure 1 The phylogeny in figure 1a (above) is rooted at node C, allowing us to infer which ancestral species gave rise to which present-day species. Without a root, a phylogeny looks very different; compare figure 1a with 1b, they differ only by the placement of a root. The importance of placing a root on a phylogeny should now be clear; without a root biologists cannot distinguish between what is ANCESTRAL and what is DERIVED (descendant). We will return to the concept of a root in topic 3 [methods]. Rooted phylogenies allow biologists to distinguish similar characteristics due to common decent (HOMOLOGY) from similar characteristics due to convergence from different ancestors (ANALOGY) (see figure 2 to right). However, most methods of phylogenetic inference produce unrooted trees, and the location of the root also must be inferred. Rooted phylogenies allow biologists to infer CHARACTER POLARITY; the evolutionary relationship between two or more states for a given character. Say we have a character with two states, “a” and “b”. By mapping them on a phylogeny we can determine that “b” preceded “a” in evolutionary history; hence “a” is the derived state and “b” is the primitive state. Figure 2 In the former examples, branch lengths were not intended to convey any information (figures 1a and 1b). The phylogeny in figure 1c illustrates how branch lengths can show how much change has occurred along a branch. In the case of molecular characters, if the rate of evolution is constant over time (the so-called molecular clock), the branches will show the relative divergence times of the lineages. For example, figure 1c indicates that the divergence of species 1 and 2 was much more recent than divergence of species 4 and 5. Moreover, if the divergence dates of some points in the phylogeny are known from the fossil record (calibration points), and the characters are evolving in a clock-like fashion, the phylogeny can be used to predict divergences absent from the fossil record. Below is an example of a real dataset (COII and cyt b gene sequences of selected mammals) where the branch lengths have been estimated once by assuming clock-like molecular evolution and again without such an assumption. Branch lengths estimated under the assumption of the Branch lengths estimated without assumption of the molecular molecular clock clock Felis Felis Canis Canis Ursus Ursus oot Bos Bos R Root Root Hippopotamus Hippopotamus Physeter Physeter Balaenoptera Balaenoptera Rhinoceros Rhinocero s Equus 0.1 Equus 0.1 Tips are contemporary; the distance Tips are NOT contemporary; the distance from root to each tip is the same from root to each tip is NOT the same The phylogenetic comparative method Evolutionary biologists use the comparative method to discover common evolutionary patterns, and to understand the causes of those patterns. The key to this approach is discovering correlated patterns of evolution between different characters of organisms, or between characters of organisms and aspects of the environment that they inhabit. Most comparative studies attempt to address the adaptive significance of biological variation, although many patterns ultimately require non-adaptive explanations. Since Darwin’s time, the comparative method has remained one of the most important analytical tools of evolutionary biologists. However, comparative biology has recently undergone a major transformation; the realization that the characteristics of species could be correlated due to shared ancestry, taken alongside the major developments in the field of phylogenetics, meant that evolutionary biologists had to examine comparative trends together with phylogenetic relatedness. What is the problem? Standard statistical methods for assessing the correlation treat the data drawn from different species as independent. Because species are hierarchically related by the phylogeny they cannot be treated as if drawn independently from the same distribution. Let’s consider a hypothetical example. Consider a phenotype (say, the size of a primate’s big toe; Y) and an ecological variable (say, the frequency of things that a big toe can be stubbed into; X). Suppose you have gone to great trouble to collect measurements for size of big toe and the “stubbiness” of the habitat, and you are interested in the significance of any relationship of Y on X. So, you plot you data and you find what appears to be a significant correlation. Hypothetical dataset for phenotype (Y) and ecological variable (X) Y X Now consider at some point in early history that two species diverged for toe-size and colonized two different habitats. At that point in time there are only two points that lie on a straight line, but the correlation cannot be significant; there are, after all, only two points and the regression has zero degrees of freedom. Two point dataset from early in evolutionary history Y X Now consider some evolutionary time has passed and each of these two species gives rise to 100 descendent species. By this accident of history, all the descendants in one clade will have a larger toe and tend to be in one habitat type, and the descendents of the other species will have a smaller toe and tend to be in the other habitat type. If our sample of data came from these two clades, we would have effectively sampled only two species. Phylogeny of two groups of close relatives “Big-toe clade” “Little-toe clade” Recent diversifications Old divergence of “big-toed” and “little-toed” primates If we code our data to indicate the clade of origin (below) we see that the correlation is an illusion generated by two clusters with different mean values. Hypothetical dataset with points coloured according to clade of origin Y X “Little-toed” clade “Big-toed” clade One way to analyze these data is to use a method called FELSENSTEIN’S INDEPENDENT CONTRASTS.