<<

Math/C SC 5610 Computational Biology Lecture 10 and 11:

Stephen Billups

University of Colorado at Denver

Math/C SC 5610Computational Biology – p.1/29 Announcements

Project Guidelines and Ideas are posted. (proposal due March 8) CCB Seminar, this Friday (Feb. 18) Speaker: Kevin Cohen Title: Two and a half approaches to natural language processing in Computational Biology Time: 11-12 (Followed by lunch) Place: Media Center, AU008

Math/C SC 5610Computational Biology – p.2/29 Outline

Finish Intro to Optimization Baldi-Chauvin Algorithm Phylogenetics

Math/C SC 5610Computational Biology – p.3/29 Equality Constrained Optimization

minx∈X f(x) subject to h(x) = 0

Define the Lagrangian:

L(x, ) = f(x) h(x). where ∈ IRm. Optimality Conditions: If x is a solution, then there exists ∈ IRm such that ∇xL(x , ) = ∇f(x ) ∇h(x ) = 0.

Math/C SC 5610Computational Biology – p.4/29 Geometric Intuition

The equation ∇f(x ) ∇h(x ) = 0. says that ∇f(x) is a linear combination of {∇h1(x ), ∇h2(x ), . . . , ∇h3(x )}, which says that ∇f(x ) is orthogonal to tangent plane of the constraints.

g(x)=0

grad g(x)

grad f(x)

Math/C SC 5610Computational Biology – p.5/29 Back to Training HMMs

Now that we understand a little about optimization, we can now look at the Baldi-Chauvin Algorithm for training HMMs.

Math/C SC 5610Computational Biology – p.6/29 Baldi-Chauvin Algorithm

Main Ideas: Applies gradient descent to minimize the negative log-likelihood E = log L(M) as a function of the model parameters. Requires constraints on the probabilities:

n n m X i = 1, X ai,j = 1, X bi,k = 1. i=1 j=1 k=1

This is accomplished essentially by variable elimination. Does not use a linesearch. Instead, the approach is to update as follows:

xk+1 = xk C∇f(xk),

where C is a constant. (Not guaranteed to converge to anything!!).

Math/C SC 5610Computational Biology – p.7/29 Baldi-Chauvin (cont).

Employs a change of variables that ensures that transition and emission probabilities never go to zero.

eωi,j ai,j = ωi,k k e

i,c Pe bi,c = . i,k k e Unlike the Baum-Welch method,P this method can be run on-line.

Math/C SC 5610Computational Biology – p.8/29 Phylogeny

Phylogenetic tree–a graphical representation of the evolutionary history of related objects, called taxa. (e.g. genes, organisms, languages). Leaves are the current species. Internal nodes are inferred ancestors. Usually a binary tree.

Math/C SC 5610Computational Biology – p.9/29 Example: The Tree of Life

Math/C SC 5610Computational Biology – p.10/29 Caveats

Trees can only approximate evolutionary history Lateral gene transfer Hybridization Phylogenetic trees of a single gene or taken from a group of species often differ from the phylogentic trees of the species. Care is needed in inferring phylogenetic relationships between species.

Math/C SC 5610Computational Biology – p.11/29 Phylogenetic Inference Problem

Given A set of species (genes or organisms) with a common ancestor. Inheritable characteristics of the species.

Determine a that best fits the data.

Math/C SC 5610Computational Biology – p.12/29 Why do it?

Resolve evolutionary history Helpful in constructing vaccines Ensure that vaccines address diverse strains of the disease. (e.g. influenza) Epidemiology Reconstruct paths of infection. (e.g. HIV)

Math/C SC 5610Computational Biology – p.13/29

Before sequence data was available, taxonomists relied on phenotype to compare organisms. Now, by comparing sequences, phylogenies can be reconstructed based on genotype.

Advantages of genotypic comparisons. Phenotypic similarities do not always reflect evolution (convergent evolution). In contrast, the corresponding genotypes will be very different unless there is homology. Phenotypic characteristics can be difficult to measure. Not so with genotype, which is clearly defined by sequence. For very distant organisms, it is difficult to determine meaningful phenotypic characteristics for comparison. (how do you compare bacteria, jelly fish, and humans?) In contrast, there are many homologous molecules essential to all living things–so genotypic comparisons are sensible even for very distant species.

Math/C SC 5610Computational Biology – p.14/29 Gene vs. Species Trees

Gene Tree: A phylogenetic tree representing evolutionary history of a single gene. Species Tree: A phylogenetic tree representing evolutionary history of species. A gene tree (constructed from a set of species) can be different than the species tree.

Species 1 Species 2

Species trees can be constructed by analyzing multiple genes.

Math/C SC 5610Computational Biology – p.15/29 A Little Graph Theory

A directed graph G = (V, E) consists of a set V of nodes (or vertices), and a set E V V of directed edges. (i, j) ∈ E means that there is a directed edge from node i to node j. A graph is undirected if (i, j) ∈ E ⇐⇒ (j, i) ∈ E. A graph is connected if any two distinct nodes i, j ∈ V , are connected by a directed path (v0, v1, . . . , vm), where v0 = i, vm = j, (vk, vk+1) ∈ E for k = 0, . . . , m 1. A directed graph is acyclic if it does not contain a cycle (v0, v1, . . . , vm), where v0 = vm, and (vk, vk+1) ∈ E for k < m. A tree is an undirected, connected, acyclic graph.

Math/C SC 5610Computational Biology – p.16/29 Answer: n 1

Trees

Question: If a tree has n nodes, how many edges does it have?

A Rooted tree has a distinguished node r, called the root. The parent of node y in a rooted tree is the node x which lies immediately before y on the path from the root r to y. Node y is the child of x. A leaf node of a rooted tree is a node with no children. The depth of a tree is one less than the maximal number of nodes on a path from a root to a leaf. A rooted tree is binary if every node has at most two children. Phylogenetic tree: A phylogenetic tree on n taxa is a tree whose leaves are the n taxa.

Math/C SC 5610Computational Biology – p.17/29 Trees

Question: If a tree has n nodes, how many edges does it have? Answer: n 1

A Rooted tree has a distinguished node r, called the root. The parent of node y in a rooted tree is the node x which lies immediately before y on the path from the root r to y. Node y is the child of x. A leaf node of a rooted tree is a node with no children. The depth of a tree is one less than the maximal number of nodes on a path from a root to a leaf. A rooted tree is binary if every node has at most two children. Phylogenetic tree: A phylogenetic tree on n taxa is a tree whose leaves are the n taxa.

Math/C SC 5610Computational Biology – p.17/29 Rooted vs. Unrooted Trees

Rooted trees indicate direction of evolution. Unrooted trees say nothing about the direction of evolution. Many algorithms find unrooted trees, because its easier. Rooted trees can be created from unrooted trees using an .

Math/C SC 5610Computational Biology – p.18/29 More complexity: and Notation

+ + Given functions f : ZZ → IR and K : ZZ → IR, (i.e., f and K map non-negative integers to real values), f(n) = (K(n)) if there exists a constant c and an integer N such that

f(n)cK(n) for all n N.

Compare this to big-O notation: the definition for big-O had f(n)cK(n). Big-O gives an upper bound on the growth of f, gives a lower bound.

We say that f(n) = (K(n)) if f is both O(K(n)) and (K(n)).

Math/C SC 5610Computational Biology – p.19/29 Counting the Number of Trees

Given n taxa, there are:

1 (2n 3)! 2n n = rooted, binary 2n2(n 2)! 3 õ ¶ ! phylogenetic trees (up to isomorphism). 2 (2n 5)! 2n n = un rooted, binary 2n3(n 3)! 3 õ ¶ ! phylogenetic trees (up to isomorphism). Look at Table 4.1 to see how fast these numbers grow!

So finding the best fitting tree is NP-hard!

Math/C SC 5610Computational Biology – p.20/29 Sketch of Proof

By induction... Let t(n) = the number of rooted trees with n leaves. For n = 2, t(n) = 1. For each tree with n leaves, a tree with n + 1 leaves can be constructed by attaching a new leaf node either to 1. a new internal node, created in the middle of an edge of the tree. 2. or a new root node, created above the original root node. There are thus (2n 1) places to add the new leaf node, so t(n + 1) = t(n)(2n 1) for n > 1. So, t(n) = 1 3 5 (2n 3). Some clever manipulations yields the desired formula.

Math/C SC 5610Computational Biology – p.21/29 Tree Inference: Another Optimization Problem

Define a scoring mechanism to evaluate how well a tree matches the data. Choose the tree with the best score. NP-hard.

Math/C SC 5610Computational Biology – p.22/29 Scoring Methods

Distance-based methods: Based on a measure of overall, pairwise differences between two sequences. Clustering Methods (e.g. UPGMA) Character-based methods: Based on a well-defined feature that can exist in a limited number of different states. Maximum Parsimony Compatability Maximum likelihood

Math/C SC 5610Computational Biology – p.23/29 Distance Based Clustering Methods

Overview Requires a D, (defining distances between each pair of elements). Repeatedly group together closest elements. Different algorithms differ by how they treat distances between groups. UPGMA (unweighted pair group method with arithmetic mean). WPGMA (weighted pair group method with arithmetic mean).

Math/C SC 5610Computational Biology – p.24/29 UPGMA

1. Initialize C to the n singleton clusters {1}, . . . , {n}. 2. Initialize dist(c, d) on C by defining

dist({i}, {j}) = D(i, j).

3. Repeat n 1 times: (a) determine pair c, d of clusters in C such that dist(c, d) is minimal; define dmin = dist(c, d). (b) define new cluster e = cS d; update C = C {c, d} S{e}. (c) define a node with label e and daughters c, d, where e has distance dmin/2 to its leaves. (d) define for all f ∈ C with f 6= e,

dist(c, f) + dist(d, f) dist(e, f) = dist(f, e) = . (avg. of previous distances) 2

Math/C SC 5610Computational Biology – p.25/29 Does UPGMA Make Sense

Need to show that when we join two nodes, then the height of the resulting node is greater than the heights of either of the original nodes. Proposition: Let e be a node generated by joining clusters c and d in step 3 above. Let C be the set of all clusters immediately before joing c and d to e. Then

∀t ∈ C : height(e) height(t).

Math/C SC 5610Computational Biology – p.26/29 WPGMA

1. Initialize C to the n singleton clusters {1}, . . . , {n}. 2. Initialize dist(c, d) on C be defining

dist({i}, {j}) = D(i, j).

3. Repeat n 1 times: (a) determine pair c, d of clusters in C such that dist(c, d) is minimal; define dmin = dist(c, d). (b) define new cluster e = cS d; update C = C {c, d} S{e}. (c) define a node with label e and daughters c, d, where e has distance dmin/2 to its leaves. (d) define for all f ∈ C with f 6= e,

|c| dist(c, f) + |d| dist(d, f) dist(e, f) = dist(f, e) = . (weighted avg.) |c| + |d|

Math/C SC 5610Computational Biology – p.27/29 Ultrametric Trees

Key feature: distance to root is the same for every leaf node. (distance time since divergence).

Given a tree di,j with positive edge weights (i, j), if the value di,j of the distance function between leaves i and j is the sum of the edge weights along the path connecting i and j, then d is an additive metric If the path length from the root r to every leaf is identical, then d is called an ultrametric

Math/C SC 5610Computational Biology – p.28/29 Ultrametric Trees and UPGMA

If the distances between taxa are and ultrametric (for sum tree), then UPGMA always correctly constructs the original topology. If the distances are not an ultrametric (but are still additive), then UPGMA yields a tree with incorrect topology and incorrect branch lengths. One method of fixing this problem is the Farris Transformed Distance Method.

Math/C SC 5610Computational Biology – p.29/29