Phylogenetics and Bioinformatics for Evolution [30Pt] Maximum Likelihood
Total Page:16
File Type:pdf, Size:1020Kb
c N. Salamin Sept 2007 Lecture outline Maximum likelihood in phylogenetics Definition Maximum likelihood Phylogenetics and bioinformatics for evolution and models Likelihood of a tree Computational complexity Statistical properties Maximum parsimony Maximum Likelihood Maximum likelihood Experimental design Hypothesis testing Tree support Tests of topology Tests of models September, 2007 c N. Salamin Sept 2007 Lecture outline Lecture outline Maximum likelihood in phylogenetics 1 Definition Maximum likelihood in phylogenetics Maximum likelihood and models Definition Likelihood of a tree Maximum likelihood and models Computational complexity Likelihood of a tree Statistical properties Computational complexity Maximum parsimony Maximum likelihood Experimental design 2 Statistical properties Hypothesis testing Maximum parsimony Tree support Maximum likelihood Tests of topology Tests of models Experimental design 3 Hypothesis testing Tree support Tests of topology Tests of models c N. Salamin Sept 2007 Lecture outline Lecture outline Maximum likelihood in phylogenetics 1 Definition Maximum likelihood in phylogenetics Maximum likelihood and models Definition Likelihood of a tree Maximum likelihood and models Computational complexity Likelihood of a tree Statistical properties Computational complexity Maximum parsimony Maximum likelihood Experimental design 2 Statistical properties Hypothesis testing Maximum parsimony Tree support Maximum likelihood Tests of topology Tests of models Experimental design 3 Hypothesis testing Tree support Tests of topology Tests of models c N. Salamin Sept 2007 Description Lecture outline Maximum likelihood in phylogenetics Definition Maximum likelihood Given an hypothesis H and some data D, the likelihood of H is and models Likelihood of a tree Computational L(H) = Prob(D H) = Prob(D1 H)Prob(D2 H) Prob(Dn H) complexity | | | · · · | Statistical properties if the D can be split in n independent parts. Maximum parsimony Maximum likelihood Experimental design Note that L(H) is not the probability of the hypothesis, but the Hypothesis testing probability of the data, given the hypothesis. Tree support Tests of topology Tests of models Maximum likelihood properties (Fisher, 1922) • consistency – converge to correct value of the parameter • efficiency – has the smallest possible variance around true parameter value c N. Salamin Sept 2007 Toss a coin Lecture outline Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational complexity Statistical properties Let say we toss a coin 11 times and obtain 5 heads and 6 tails. Maximum parsimony All tosses are independent and all have the same unknown Maximum likelihood Experimental design head probability p. What is the probability of this data? Hypothesis testing 5 6 Tree support L(p) = Prob(D p) = p (1 p) Tests of topology | − Tests of models The maximum likelihood is p = 0.454545, which can be found by equating the derivative of L(p) with respect to p to zero and solving: dL(p) = 5p4(1 p)6 6p5(1 p)5 = 0 dp − − − which yields pˆ = 5/11 = 0.454545. c N. Salamin Sept 2007 Likelihood and models Lecture outline Maximum likelihood in phylogenetics Maximum likelihood rely on explicit probabilistic models of Definition Maximum likelihood evolution. and models Likelihood of a tree Computational But, the process of evolution is so complex and multifaceted that complexity Statistical basic models involve assumptions built upon assumptions. properties Maximum parsimony Maximum likelihood This reliance is often seen as a weakness of the likelihood Experimental design framework, but Hypothesis testing • the need to make explicit assumptions is a strength Tree support Tests of topology • Tests of models enable both inferences about evolutionary history and assessments of the accuracy of the assumptions made • this led to a better understanding of evolution “The purpose of models is not to fit the data, but to sharpen the questions” (S. Karlin) c N. Salamin Sept 2007 Basic settings for models of evolution Lecture outline Maximum likelihood in phylogenetics Definition Maximum likelihood In order to discuss models of DNA evolution, we need to make and models Likelihood of a tree some basic assumptions. Some can be relaxed in more complex Computational complexity models. Statistical properties Maximum parsimony Maximum likelihood • Experimental design the DNA sequences are alignable Hypothesis • testing substitutions through time follow a Poison distribution Tree support • Tests of topology the sites of DNA sequence evolved independently Tests of models • all the sites have the same rate of substitutions With these assumptions, we can easily model substitutions with a Markov chain. c N. Salamin Sept 2007 Markov chain Lecture outline Maximum likelihood in phylogenetics Let the state (one of A, C, G or T) of the chain be X(t) at time t. Definition Maximum likelihood and models The Markov chain is characterized by its generator matrix Likelihood of a tree Computational Q = qij , where qij is the instantaneous rate of change from i to j complexity when{∆t} 0, that is Statistical → properties Maximum parsimony Pr X t t j X t i q t Maximum likelihood ( + ∆ ) = ( ) = = ij ∆ Experimental design { | } Hypothesis testing Tree support The diagonal elements qii are specified by the requirement that Tests of topology each row of Q sums to zero, that is Tests of models qii = qij − Xi6=j Thus qii is the substitution rate of state i, i.e. the rate at which the Markov− chain leaves i. c N. Salamin Sept 2007 Transition-probability matrix Lecture outline Maximum likelihood in phylogenetics Definition Maximum likelihood and models The Q matrix fully determines the dynamics of the Markov chain. Likelihood of a tree Computational complexity It specifies, in particular, the transition-probability matrix over any Statistical properties time t > 0, P(t) = pij (t) where Maximum parsimony { } Maximum likelihood Experimental design pij (t) = Pr X(t) = j X(0) = i { | } Hypothesis testing Tree support Tests of topology Tests of models We can further show that P(h) I lim + − = Q h→0 h c N. Salamin Sept 2007 Transition-probability matrix Lecture outline Maximum likelihood in phylogenetics Definition Maximum likelihood and models Thus (by the Chapman-Kolmogorov relation) Likelihood of a tree Computational complexity P(t + ∆t) = P(t)P(∆t) Statistical properties P(t + ∆t) P(t) = P(t)(P(∆t) I) Maximum parsimony − − Maximum likelihood P(t + ∆t) P(t) P(∆t) I Experimental design − = P(t) − ∆t ∆t Hypothesis testing P′(t) = P(t)Q Tree support Tests of topology Tests of models as P(0) = I, we finally get P(t) = eQt c N. Salamin Sept 2007 Rate of mutation Lecture outline Maximum likelihood in phylogenetics Definition Maximum likelihood and models As Q and t occur only in the form of a product, it is conventional Likelihood of a tree Computational to scale Q so that the average rate is 1. complexity Statistical properties Maximum parsimony In phylogenetics, branch length is therefore measured in expected Maximum likelihood Experimental design substitutions per site. Hypothesis testing Tree support A long branch can therefore either by due to Tests of topology Tests of models • long evolutionary time • a rapid rate of substitution • a combination of both c N. Salamin Sept 2007 Kimura 2-parameters, 1981 Lecture outline Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational β/4 α/4 β/4 complexity − β/4 β/4 α/4 Statistical Q = − properties α/4 β/4 β/4 Maximum parsimony − Maximum likelihood β/4 α/4 β/4 Experimental design − Hypothesis where testing Tree support • α is the transition rate Tests of topology Tests of models • β is the transversion rate It simplifies to Jukes-Cantor, 1969 if α = β. c N. Salamin Sept 2007 Tamura-Nei, 1993 Lecture outline Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree − βπC αR πG/πR + βπG βπT Computational βπA − βπG αY πT /πY + βπT complexity Q = αR πA/πR + βπA βπC − βπT βπ α π /π + βπ βπ − Statistical A Y C Y C G properties Maximum parsimony Maximum likelihood where Experimental design • αR is purine transition rate, πR is frequency of purine (A+G) Hypothesis testing • is pyrimidine transition rate, is frequency of pyrimidine Tree support αY πY Tests of topology (C+T) Tests of models It simplifies to • Hasegawa, Kishino, Yano, 1985 if αR/αY = πR/πY • Felsenstein, 1984 if αR = αY c N. Salamin Sept 2007 General-time reversible Lecture outline Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational complexity απC βπG γπT Statistical − properties απA δπG ǫπT Maximum parsimony Q = − Maximum likelihood βπA δπC ηπT Experimental design − γπA ǫπC ηπG Hypothesis − testing Tree support where Tests of topology • Tests of models α η are rates of changes from one nucleotide to another · · · • πi are frequencies of nucleotides c N. Salamin Sept 2007 Variation of substitution rates Lecture outline Up to now, we assumed that every sites had the same rate of Maximum likelihood in substitution. How can we change that? phylogenetics Definition Maximum likelihood and models The idea is to use a probability distribution to model changes in Likelihood of a tree Computational rates of substitution among sites, e.g. Gamma distribution complexity • 2 Statistical mean of distribution is αβ, and variance αβ properties • Maximum parsimony set the mean rate of substitution to 1, so assume β = 1/α Maximum likelihood • Experimental