Math/C SC 5610 Computational Biology Lecture 10 and 11: Phylogenetics
Total Page:16
File Type:pdf, Size:1020Kb
Math/C SC 5610 Computational Biology Lecture 10 and 11: Phylogenetics Stephen Billups University of Colorado at Denver Math/C SC 5610Computational Biology – p.1/29 Announcements Project Guidelines and Ideas are posted. (proposal due March 8) CCB Seminar, this Friday (Feb. 18) Speaker: Kevin Cohen Title: Two and a half approaches to natural language processing in Computational Biology Time: 11-12 (Followed by lunch) Place: Media Center, AU008 Math/C SC 5610Computational Biology – p.2/29 Outline Finish Intro to Optimization Baldi-Chauvin Algorithm Phylogenetics Math/C SC 5610Computational Biology – p.3/29 Equality Constrained Optimization minx2X f(x) subject to h(x) = 0 Define the Lagrangian: L(x; ¸) = f(x) ¡ ¸h(x): where ¸ 2 IRm. Optimality Conditions: If x¤ is a solution, then there exists ¸¹ 2 IRm such that ¤ ¤ ¤ rxL(x ; ¸¹) = rf(x ) ¡ ¸¹rh(x ) = 0: Math/C SC 5610Computational Biology – p.4/29 Geometric Intuition The equation ¤ ¤ rf(x ) ¡ ¸¹rh(x ) = 0: says that rf(x¤) is a linear combination of ¤ ¤ ¤ ¤ frh1(x ); rh2(x ); : : : ; rh3(x )g, which says that rf(x ) is orthogonal to tangent plane of the constraints. g(x)=0 grad g(x) grad f(x) Math/C SC 5610Computational Biology – p.5/29 Back to Training HMMs Now that we understand a little about optimization, we can now look at the Baldi-Chauvin Algorithm for training HMMs. Math/C SC 5610Computational Biology – p.6/29 Baldi-Chauvin Algorithm Main Ideas: Applies gradient descent to minimize the negative log-likelihood E = ¡ log L(M) as a function of the model parameters. Requires constraints on the probabilities: n n m X ¼i = 1; X ai;j = 1; X bi;k = 1: i=1 j=1 k=1 This is accomplished essentially by variable elimination. Does not use a linesearch. Instead, the approach is to update as follows: xk+1 = xk ¡ Crf(xk); where C is a constant. (Not guaranteed to converge to anything!!). Math/C SC 5610Computational Biology – p.7/29 Baldi-Chauvin (cont). Employs a change of variables that ensures that transition and emission probabilities never go to zero. e¸!i;j ai;j = ¸!i;k k e ¸ºi;c Pe bi;c = : ¸ºi;k k e Unlike the Baum-Welch method,P this method can be run on-line. Math/C SC 5610Computational Biology – p.8/29 Phylogeny Phylogenetic tree–a graphical representation of the evolutionary history of related objects, called taxa. (e.g. genes, organisms, languages). Leaves are the current species. Internal nodes are inferred ancestors. Usually a binary tree. Math/C SC 5610Computational Biology – p.9/29 Example: The Tree of Life Math/C SC 5610Computational Biology – p.10/29 Caveats Trees can only approximate evolutionary history Lateral gene transfer Hybridization Phylogenetic trees of a single gene or protein taken from a group of species often differ from the phylogentic trees of the species. Care is needed in inferring phylogenetic relationships between species. Math/C SC 5610Computational Biology – p.11/29 Phylogenetic Inference Problem Given A set of species (genes or organisms) with a common ancestor. Inheritable characteristics of the species. Determine a phylogenetic tree that best fits the data. Math/C SC 5610Computational Biology – p.12/29 Why do it? Resolve evolutionary history Helpful in constructing vaccines Ensure that vaccines address diverse strains of the disease. (e.g. influenza) Epidemiology Reconstruct paths of infection. (e.g. HIV) Math/C SC 5610Computational Biology – p.13/29 Molecular Phylogenetics Before sequence data was available, taxonomists relied on phenotype to compare organisms. Now, by comparing sequences, phylogenies can be reconstructed based on genotype. Advantages of genotypic comparisons. Phenotypic similarities do not always reflect evolution (convergent evolution). In contrast, the corresponding genotypes will be very different unless there is homology. Phenotypic characteristics can be difficult to measure. Not so with genotype, which is clearly defined by sequence. For very distant organisms, it is difficult to determine meaningful phenotypic characteristics for comparison. (how do you compare bacteria, jelly fish, and humans?) In contrast, there are many homologous molecules essential to all living things–so genotypic comparisons are sensible even for very distant species. Math/C SC 5610Computational Biology – p.14/29 Gene vs. Species Trees Gene Tree: A phylogenetic tree representing evolutionary history of a single gene. Species Tree: A phylogenetic tree representing evolutionary history of species. A gene tree (constructed from a set of species) can be different than the species tree. Species 1 Species 2 Species trees can be constructed by analyzing multiple genes. Math/C SC 5610Computational Biology – p.15/29 A Little Graph Theory A directed graph G = (V; E) consists of a set V of nodes (or vertices), and a set E ½ V £ V of directed edges. (i; j) 2 E means that there is a directed edge from node i to node j. A graph is undirected if (i; j) 2 E () (j; i) 2 E. A graph is connected if any two distinct nodes i; j 2 V , are connected by a directed path (v0; v1; : : : ; vm), where v0 = i, vm = j, (vk; vk+1) 2 E for k = 0; : : : ; m ¡ 1. A directed graph is acyclic if it does not contain a cycle (v0; v1; : : : ; vm), where v0 = vm, and (vk; vk+1) 2 E for k < m. A tree is an undirected, connected, acyclic graph. Math/C SC 5610Computational Biology – p.16/29 Answer: n ¡ 1 Trees Question: If a tree has n nodes, how many edges does it have? A Rooted tree has a distinguished node r, called the root. The parent of node y in a rooted tree is the node x which lies immediately before y on the path from the root r to y. Node y is the child of x. A leaf node of a rooted tree is a node with no children. The depth of a tree is one less than the maximal number of nodes on a path from a root to a leaf. A rooted tree is binary if every node has at most two children. Phylogenetic tree: A phylogenetic tree on n taxa is a tree whose leaves are the n taxa. Math/C SC 5610Computational Biology – p.17/29 Trees Question: If a tree has n nodes, how many edges does it have? Answer: n ¡ 1 A Rooted tree has a distinguished node r, called the root. The parent of node y in a rooted tree is the node x which lies immediately before y on the path from the root r to y. Node y is the child of x. A leaf node of a rooted tree is a node with no children. The depth of a tree is one less than the maximal number of nodes on a path from a root to a leaf. A rooted tree is binary if every node has at most two children. Phylogenetic tree: A phylogenetic tree on n taxa is a tree whose leaves are the n taxa. Math/C SC 5610Computational Biology – p.17/29 Rooted vs. Unrooted Trees Rooted trees indicate direction of evolution. Unrooted trees say nothing about the direction of evolution. Many algorithms find unrooted trees, because its easier. Rooted trees can be created from unrooted trees using an outgroup. Math/C SC 5610Computational Biology – p.18/29 More complexity: and £ Notation + + Given functions f : ZZ ! IR and K : ZZ ! IR, (i.e., f and K map non-negative integers to real values), f(n) = (K(n)) if there exists a constant c and an integer N such that f(n)¸cK(n) for all n ¸ N: Compare this to big-O notation: the definition for big-O had f(n)·cK(n). Big-O gives an upper bound on the growth of f, gives a lower bound. We say that f(n) = £(K(n)) if f is both O(K(n)) and (K(n)). Math/C SC 5610Computational Biology – p.19/29 Counting the Number of Trees Given n taxa, there are: ¡1 (2n ¡ 3)! 2n n = rooted, binary 2n¡2(n ¡ 2)! 3 õ ¶ ! phylogenetic trees (up to isomorphism). ¡2 (2n ¡ 5)! 2n n = un rooted, binary 2n¡3(n ¡ 3)! 3 õ ¶ ! phylogenetic trees (up to isomorphism). Look at Table 4.1 to see how fast these numbers grow! So finding the best fitting tree is NP-hard! Math/C SC 5610Computational Biology – p.20/29 Sketch of Proof By induction... Let t(n) = the number of rooted trees with n leaves. For n = 2, t(n) = 1. For each tree with n leaves, a tree with n + 1 leaves can be constructed by attaching a new leaf node either to 1. a new internal node, created in the middle of an edge of the tree. 2. or a new root node, created above the original root node. There are thus (2n ¡ 1) places to add the new leaf node, so t(n + 1) = t(n)(2n ¡ 1) for n > 1. So, t(n) = 1 ¢ 3 ¢ 5 ¢ ¢ ¢ (2n ¡ 3). Some clever manipulations yields the desired formula. Math/C SC 5610Computational Biology – p.21/29 Tree Inference: Another Optimization Problem Define a scoring mechanism to evaluate how well a tree matches the data. Choose the tree with the best score. NP-hard. Math/C SC 5610Computational Biology – p.22/29 Scoring Methods Distance-based methods: Based on a measure of overall, pairwise differences between two sequences. Clustering Methods (e.g. UPGMA) Neighbor Joining Character-based methods: Based on a well-defined feature that can exist in a limited number of different states. Maximum Parsimony Compatability Maximum likelihood Math/C SC 5610Computational Biology – p.23/29 Distance Based Clustering Methods Overview Requires a distance matrix D, (defining distances between each pair of elements).