<<

c N. Salamin Sept 2007

Lecture outline

Maximum likelihood in Definition Maximum likelihood Phylogenetics and bioinformatics for and models Likelihood of a tree Computational complexity

Statistical properties Maximum parsimony Maximum Likelihood Maximum likelihood Experimental design

Hypothesis testing Tree support Tests of topology Tests of models September, 2007

c N. Salamin Sept 2007 Lecture outline Lecture outline

Maximum likelihood in phylogenetics 1 Definition Maximum likelihood in phylogenetics Maximum likelihood and models Definition Likelihood of a tree Maximum likelihood and models Computational complexity Likelihood of a tree Statistical properties Computational complexity Maximum parsimony Maximum likelihood Experimental design 2 Statistical properties Hypothesis testing Maximum parsimony Tree support Maximum likelihood Tests of topology Tests of models Experimental design

3 Hypothesis testing Tree support Tests of topology Tests of models c N. Salamin Sept 2007 Lecture outline Lecture outline

Maximum likelihood in phylogenetics 1 Definition Maximum likelihood in phylogenetics Maximum likelihood and models Definition Likelihood of a tree Maximum likelihood and models Computational complexity Likelihood of a tree Statistical properties Computational complexity Maximum parsimony Maximum likelihood Experimental design 2 Statistical properties Hypothesis testing Maximum parsimony Tree support Maximum likelihood Tests of topology Tests of models Experimental design

3 Hypothesis testing Tree support Tests of topology Tests of models

c N. Salamin Sept 2007 Description Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood Given an hypothesis H and some data D, the likelihood of H is and models Likelihood of a tree Computational L(H) = Prob(D H) = Prob(D1 H)Prob(D2 H) Prob(Dn H) complexity | | | · · · | Statistical properties if the D can be split in n independent parts. Maximum parsimony Maximum likelihood Experimental design Note that L(H) is not the probability of the hypothesis, but the Hypothesis testing probability of the data, given the hypothesis. Tree support Tests of topology Tests of models Maximum likelihood properties (Fisher, 1922) • consistency – converge to correct value of the parameter • efficiency – has the smallest possible variance around true parameter value c N. Salamin Sept 2007 Toss a coin Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational complexity

Statistical properties Let say we toss a coin 11 times and obtain 5 heads and 6 tails. Maximum parsimony All tosses are independent and all have the same unknown Maximum likelihood Experimental design head probability p. What is the probability of this data? Hypothesis testing 5 6 Tree support L(p) = Prob(D p) = p (1 p) Tests of topology | − Tests of models The maximum likelihood is p = 0.454545, which can be found by equating the derivative of L(p) with respect to p to zero and solving: dL(p) = 5p4(1 p)6 6p5(1 p)5 = 0 dp − − − which yields pˆ = 5/11 = 0.454545.

c N. Salamin Sept 2007 Likelihood and models Lecture outline

Maximum likelihood in phylogenetics Maximum likelihood rely on explicit probabilistic models of Definition Maximum likelihood evolution. and models Likelihood of a tree Computational But, the process of evolution is so complex and multifaceted that complexity

Statistical basic models involve assumptions built upon assumptions. properties Maximum parsimony Maximum likelihood This reliance is often seen as a weakness of the likelihood Experimental design framework, but Hypothesis testing • the need to make explicit assumptions is a strength Tree support Tests of topology • Tests of models enable both inferences about evolutionary history and assessments of the accuracy of the assumptions made • this led to a better understanding of evolution

“The purpose of models is not to fit the data, but to sharpen the questions” (S. Karlin) c N. Salamin Sept 2007 Basic settings for models of evolution Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood In order to discuss models of DNA evolution, we need to make and models Likelihood of a tree some basic assumptions. Some can be relaxed in more complex Computational complexity models. Statistical properties Maximum parsimony Maximum likelihood • Experimental design the DNA sequences are alignable Hypothesis • testing substitutions through time follow a Poison distribution Tree support • Tests of topology the sites of DNA sequence evolved independently Tests of models • all the sites have the same rate of substitutions

With these assumptions, we can easily model substitutions with a .

c N. Salamin Sept 2007 Markov chain Lecture outline

Maximum likelihood in phylogenetics Let the state (one of A, C, G or T) of the chain be X(t) at time t. Definition Maximum likelihood and models The Markov chain is characterized by its generator matrix Likelihood of a tree Computational Q = qij , where qij is the instantaneous rate of change from i to j complexity when{∆t} 0, that is Statistical → properties Maximum parsimony Pr X t t j X t i q t Maximum likelihood ( + ∆ ) = ( ) = = ij ∆ Experimental design { | }

Hypothesis testing Tree support The diagonal elements qii are specified by the requirement that Tests of topology each row of Q sums to zero, that is Tests of models

qii = qij − Xi6=j

Thus qii is the substitution rate of state i, i.e. the rate at which the Markov− chain leaves i. c N. Salamin Sept 2007 -probability matrix Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models The Q matrix fully determines the dynamics of the Markov chain. Likelihood of a tree Computational complexity It specifies, in particular, the transition-probability matrix over any Statistical properties time t > 0, P(t) = pij (t) where Maximum parsimony { } Maximum likelihood Experimental design pij (t) = Pr X(t) = j X(0) = i { | } Hypothesis testing Tree support Tests of topology Tests of models We can further show that P(h) I lim + − = Q h→0 h

c N. Salamin Sept 2007 Transition-probability matrix Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Thus (by the Chapman-Kolmogorov relation) Likelihood of a tree Computational complexity P(t + ∆t) = P(t)P(∆t) Statistical properties P(t + ∆t) P(t) = P(t)(P(∆t) I) Maximum parsimony − − Maximum likelihood P(t + ∆t) P(t) P(∆t) I Experimental design − = P(t) − ∆t ∆t Hypothesis testing P′(t) = P(t)Q Tree support Tests of topology Tests of models as P(0) = I, we finally get

P(t) = eQt c N. Salamin Sept 2007 Rate of Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models As Q and t occur only in the form of a product, it is conventional Likelihood of a tree Computational to scale Q so that the average rate is 1. complexity

Statistical properties Maximum parsimony In phylogenetics, branch length is therefore measured in expected Maximum likelihood Experimental design substitutions per site.

Hypothesis testing Tree support A long branch can therefore either by due to Tests of topology Tests of models • long evolutionary time • a rapid rate of substitution • a combination of both

c N. Salamin Sept 2007 Kimura 2-parameters, 1981 Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational β/4 α/4 β/4 complexity − β/4 β/4 α/4 Statistical Q = − properties α/4 β/4 β/4 Maximum parsimony  −  Maximum likelihood β/4 α/4 β/4  Experimental design  −  Hypothesis where testing Tree support • α is the transition rate Tests of topology Tests of models • β is the rate

It simplifies to Jukes-Cantor, 1969 if α = β. c N. Salamin Sept 2007 Tamura-Nei, 1993 Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree − βπC αR πG/πR + βπG βπT Computational βπA − βπG αY πT /πY + βπT complexity Q = αR πA/πR + βπA βπC − βπT  βπ α π /π + βπ βπ −  Statistical A Y C Y C G properties Maximum parsimony Maximum likelihood where Experimental design • αR is purine transition rate, πR is frequency of purine (A+G) Hypothesis testing • is pyrimidine transition rate, is frequency of pyrimidine Tree support αY πY Tests of topology (C+T) Tests of models

It simplifies to

• Hasegawa, Kishino, Yano, 1985 if αR/αY = πR/πY

• Felsenstein, 1984 if αR = αY

c N. Salamin Sept 2007 General-time reversible Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational complexity απC βπG γπT Statistical − properties απA δπG ǫπT  Maximum parsimony Q = − Maximum likelihood βπA δπC ηπT Experimental design  −  γπA ǫπC ηπG  Hypothesis  −  testing Tree support where Tests of topology • Tests of models α η are rates of changes from one to another · · · • πi are frequencies of c N. Salamin Sept 2007 Variation of substitution rates Lecture outline Up to now, we assumed that every sites had the same rate of Maximum likelihood in substitution. How can we change that? phylogenetics Definition Maximum likelihood and models The idea is to use a probability distribution to model changes in Likelihood of a tree Computational rates of substitution among sites, e.g. Gamma distribution complexity • 2 Statistical mean of distribution is αβ, and variance αβ properties • Maximum parsimony set the mean rate of substitution to 1, so assume β = 1/α Maximum likelihood • Experimental design α parameter allows to change characteristics of distribution Hypothesis testing Tree support Tests of topology Tests of models

Felsenstein, 2004

c N. Salamin Sept 2007 Discrete characters Lecture outline

Maximum likelihood in phylogenetics Note that it is easy to extend the same approach to other discrete Definition Maximum likelihood character dataset simply by modifying the size of the Q matrix, and models Likelihood of a tree and thus the number of parameters. Computational complexity

Statistical properties Maximum parsimony codon or aa morphology Maximum likelihood Experimental design Hypothesis • 61 or 21 states • usually 2 states testing Tree support • computational problems Tests of topology Tests of models due to the number of parameters involved • for codons, solutions have been found to reduce size based on HKY85 model. c N. Salamin Sept 2007 Continuous characters Lecture outline

Maximum likelihood in Continuous characters evolve according to a Brownian motion phylogenetics model. Definition Maximum likelihood and models • consider a particle moving along an x-axis in small steps that Likelihood of a tree Computational are independent from each other complexity • 2 Statistical with mean displacement of zero and constant variance s properties Maximum parsimony • after n steps, the net displacement is the sum of individual Maximum likelihood 2 Experimental design steps and its variance is the variance of the sum ns

Hypothesis testing • given a divergence time t between two species Tree support Tests of topology • let σ2 be the variance expected to accumulate per unit of time Tests of models in a continuous character • the variance after an interval of time t is σ2t • the mean is assumed to stay the same over time t

The Ornstein-Uhlenbeck model generalize the Brownian motion model by allowing to model traits subject to multiple types of selection processes.

c N. Salamin Sept 2007 Likelihood of a tree Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Suppose we have a data set D of DNA sequences with m sites. Computational complexity Statistical We are given a topology T with branch lengths and a model of properties Maximum parsimony evolution, Q that allow us to compute Pij (t). Maximum likelihood Experimental design

Hypothesis testing Assumptions made to compute likelihood L(T , Q) = Prob(D T , Q) Tree support | Tests of topology Tests of models • evolution in different sites is independent • evolution in different lineages is independent c N. Salamin Sept 2007 Independence of sites Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Felsenstein, 2004 Likelihood of a tree Computational complexity

Statistical The assumption of independence of sites along a sequence properties allow us to decompose the likelihood in a product of likelihood Maximum parsimony Maximum likelihood for each site Experimental design

Hypothesis m testing Tree support L(T , Q) = Prob(D T , Q) = Prob(Di T , Q) Tests of topology | = | Tests of models Yi 1 The likelihood for this tree for site i is

ACGT ACGT ACGT ACGT

Prob(Di T , Q) = | Xx Xy Xz Xw Prob(A, C, C, C, G, x, y, z, w T , Q) |

c N. Salamin Sept 2007 Independence of lineages Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Felsenstein, 2004 Likelihood of a tree Computational complexity With the independence of lineages assumptions, we can Statistical decompose the right hand side of the equation a bit further. properties Maximum parsimony Maximum likelihood ACGT ACGT ACGT ACGT Experimental design Prob(A, C, C, C, G, x, y, z, w T , Q) = Hypothesis | testing Xx Xy Xz Xw Tree support Tests of topology Prob(x)P (t )P (t )P (t ) Tests of models xy 6 yA 1 yC 2 Pxz (t8)PzC (t3)

Pzw (t7)PwC (t4)PwG(t3) c N. Salamin Sept 2007 Number of terms Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational complexity It is reasonable to take Prob(x) to be the equilibrium probability of Statistical base x under the model selected. properties Maximum parsimony Maximum likelihood Experimental design On a tree with n species, there are n 1 interior nodes, and each Hypothesis − testing can have one of 4 states; we have therefore too many terms: Tree support 9 Tests of topology • n = 10, we have 4 terms: 262,144 Tests of models • n = 20, we have 419 terms: 274,877,906,944

c N. Salamin Sept 2007 Pruning algorithm Lecture outline

Maximum likelihood in • Goal: render the likelihood computation practicable using phylogenetics Definition “dynamic programming” Maximum likelihood and models • Likelihood of a tree Idea: move summation signs as far right as possible and Computational complexity enclose them in parentheses where possible

Statistical properties Prob(A, C, C, C, G, x, y, z, w T ) = Maximum parsimony | Maximum likelihood ACGT ACGT Experimental design Prob(x) Pxy (t6)PyA(t1)PyC (t2) Hypothesis testing Xx Xy  Tree support Tests of topology ACGT Tests of models Pxz (t8)P (t3) × zC  Xz ACGT

Pzw (t7)P (t4)P (t5) × wC wG Xw  The parentheses pattern and terms for tips has an exact correspondence to the structure of the tree. c N. Salamin Sept 2007 Conditional likelihoods Lecture outline Maximum Flow of information goes down the tree, which make use of likelihood in phylogenetics conditional likelihood of a subtree, e.g. Definition Maximum likelihood and models Likelihood of a tree PyA(t1)PyC (t2) Computational complexity

Statistical Probability of everything seen at or above that node, given that properties the node has base y. Maximum parsimony Maximum likelihood So at each node, we can compute Experimental design

Hypothesis testing ACGT ACGT i i i Tree support L (s) = P (t )L (x) P (t )L (y) Tests of topology k sx l l sy m m Tests of models Xx  Xy  The likelihood at the root node 0 is then

ACGT i i L = πx L0(x) Xx Similar logic as in the Sankoff algorithm for parsimony.

c N. Salamin Sept 2007 Unrootedness Lecture outline

Maximum The trees inferred so far appeared to be rooted trees, but if model likelihood in phylogenetics of evolution is reversible, the tree are unrooted. Definition Maximum likelihood and models Looking at the region near the root Likelihood of a tree Computational complexity ACGT ACGT ACGT Statistical properties L(T ) = Prob(x)Pxy (t6)Pxz (t8) Maximum parsimony Xx Xy Xz Maximum likelihood Experimental design

Hypothesis testing But reversibility of the subsitution process guarantees that Tree support Tests of topology Tests of models Prob(x)Pxy (t6) = Prob(y)Pyx (t6)

Substituting that, we get

ACGT ACGT ACGT

L(T ) = Prob(y)Pyx (t6)Pxz (t8) Xx Xy Xz c N. Salamin Sept 2007 Estimating parameters Lecture outline

Maximum likelihood in phylogenetics Up to now we assumed that both branch lengths and model Definition Maximum likelihood parameters were fixed at a given value. and models Likelihood of a tree Computational Of course, we don’t know these values so we use numerical complexity

Statistical integration: properties • Maximum parsimony to get these values, start from a random guess Maximum likelihood Experimental design • make small changes and compare the improvement in Hypothesis testing likelihood Tree support • Tests of topology do so until the likelihood do not change any more Tests of models • every time the topology is changed, reestimate these parameters

Drawback: very very very computationally intensive, but there are solutions to that.

c N. Salamin Sept 2007 End of part 1 Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational complexity

Statistical properties Maximum parsimony Maximum likelihood Experimental design

Hypothesis testing Tree support Tests of topology Tests of models c N. Salamin Sept 2007 Lecture outline Lecture outline

Maximum likelihood in phylogenetics 1 Definition Maximum likelihood in phylogenetics Maximum likelihood and models Definition Likelihood of a tree Maximum likelihood and models Computational complexity Likelihood of a tree Statistical properties Computational complexity Maximum parsimony Maximum likelihood Experimental design 2 Statistical properties Hypothesis testing Maximum parsimony Tree support Maximum likelihood Tests of topology Tests of models Experimental design

3 Hypothesis testing Tree support Tests of topology Tests of models

c N. Salamin Sept 2007 Statistical properties of parsimony Lecture outline

Maximum likelihood in phylogenetics As the amount of data approach infinity, an estimator is Definition Maximum likelihood consistent if it convergences to the true value of the parameter and models Likelihood of a tree with probability 1. Computational complexity inconsistent if it converges to something else. Statistical properties Maximum parsimony One way to test consistency of parsimony is to check data Maximum likelihood Experimental design patterns. For four taxa: Hypothesis • 44 256 possible site patterns, from AAAA to TTTT testing = Tree support • 3 unrooted tree topologies work out for each how many Tests of topology ⇒ Tests of models changes are necessary for patterns to evolve • patterns xxyy, xyxy, or xyyx are the only ones to affect length of tree in parsimony • commonly called phylogenetically informative characters • patterns such as xxyz do not affect a parsimony method, as they require 2 changes on any tree c N. Salamin Sept 2007 Data patterns Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational complexity

Statistical properties Maximum parsimony Maximum likelihood Experimental design

Hypothesis testing Tree support Tests of topology Tests of models

Felsenstein, 2004

c N. Salamin Sept 2007 Observed numbers of patterns Lecture outline Maximum The length of a tree is obtained by counting how many times we likelihood in phylogenetics have xxyy, xyxy, and xyyx, whatever nucleotides are represented Definition Maximum likelihood in the x and y. and models Likelihood of a tree Computational complexity On the first tree, the number of changes are

Statistical properties Maximum parsimony nxxyy + 2nxyxy + 2nxyyx = 2(nxxyy + nxyxy + nxyyx ) nxxyy Maximum likelihood − Experimental design

Hypothesis testing Similarly Tree support Tests of topology Tests of models 2nxxyy + nxyxy + 2nxyyx = 2(nxxyy + nxyxy + nxyyx ) nxyxy −

2nxxyy + 2nxyxy + nxyyx = 2(nxxyy + nxyxy + nxyyx ) nxyyx −

Since the first term is the same, we can remove it and the tree chosen by parsimony is the one minimizing the right most part of the equations. c N. Salamin Sept 2007 Pattern probabilities Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational Felsenstein, 2004 complexity

Statistical properties p and q represent the net probability of change along the branch, Maximum parsimony Maximum likelihood and assume independent evolutionary process in different Experimental design lineages. Hypothesis testing If we assign the character state 0 to both internal nodes: Tree support Tests of topology Tests of models 1 P1100 = (1 p)(1 q)(1 q)pq 2 − − − For all other possibilities at both internal nodes:

1 2 2 2 2 3 P1100 = [(1 p)(1 q) pq + (1 p) (1 q) q + p q 2 − − − − +pq(1 p)(1 q)2] − −

c N. Salamin Sept 2007 Long branch attraction Lecture outline

Maximum likelihood in Long branch attraction: phylogenetics Definition • 2 Maximum likelihood probability of parallel changes along both long branches is p and models Likelihood of a tree • probability of single change in interior branch is q Computational complexity • if p2 becomes greater than q long branch attraction Statistical ⇒ properties Maximum parsimony Maximum likelihood Experimental design

Hypothesis testing Tree support Tests of topology Tests of models

Felsenstein, 2004

But if the tree is short enough, even large ratios of the length of the long to the short branches do not cause inconsistency. c N. Salamin Sept 2007 Generalisation Lecture outline

Maximum likelihood in The proof of inconsistency of parsimony has been generalized to phylogenetics Definition DNA data, and larger trees. Maximum likelihood and models Likelihood of a tree Computational complexity However, parsimony can be “rescued” if long branches in the tree Statistical are broken by adding more taxa: properties Maximum parsimony Maximum likelihood Experimental design

Hypothesis testing Tree support Tests of topology Tests of models

Graybeal, 1995

Problem: we don’t know in advance which branch to break, but a good sampling should take care of that

c N. Salamin Sept 2007 Consistency and overparameterisation Lecture outline

Maximum likelihood in phylogenetics Maximum likelihood can be proved to be consistent (see Definition Maximum likelihood Felsenstein, 2004 pp. 271) and models Likelihood of a tree • true if we use the correct model of evolution Computational complexity • but what happen if model in not correct? No guarantee can Statistical properties be given Maximum parsimony Maximum likelihood • less problematic if what we want to infer is just the topology Experimental design

Hypothesis testing Tree support Tests of topology Beware of overparameterisation problems Tests of models • adding more and more parameters to the model will result in a better fit to the data • but this will lead to inconsistency • a few parameters are more important than others, in particular the gamma distribution c N. Salamin Sept 2007 Phylogenetic information Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood “Given the recognition of phylogenetic inference as being and models Likelihood of a tree inherently statistical in nature, it is surprising that so little attention Computational complexity has been paid to experimental design” Statistical properties Goldman, 1998 Maximum parsimony Maximum likelihood Experimental design 2 Hypothesis ∂ ln(L) testing Iθ = Tree support − ∂θi ∂θj Tests of topology Tests of models where Iθ is proportional to the precision of an estimator of θ.

For experimental design, Fisher or expected information is useful as its inverse gives asymptotic lower bounds for the variances of estimators of the θi .

c N. Salamin Sept 2007 What can we do with that? Lecture outline

Maximum likelihood in phylogenetics best rate of evolution where to add a new sequence Definition Maximum likelihood and models Likelihood of a tree Computational complexity

Statistical properties Maximum parsimony Maximum likelihood Experimental design

Hypothesis testing Tree support Tests of topology Tests of models

Townsend, 2007

Goldman, 1998 c N. Salamin Sept 2007 End of part 2 Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational complexity

Statistical properties Maximum parsimony Maximum likelihood Experimental design

Hypothesis testing Tree support Tests of topology Tests of models

c N. Salamin Sept 2007 Lecture outline Lecture outline

Maximum likelihood in phylogenetics 1 Definition Maximum likelihood in phylogenetics Maximum likelihood and models Definition Likelihood of a tree Maximum likelihood and models Computational complexity Likelihood of a tree Statistical properties Computational complexity Maximum parsimony Maximum likelihood Experimental design 2 Statistical properties Hypothesis testing Maximum parsimony Tree support Maximum likelihood Tests of topology Tests of models Experimental design

3 Hypothesis testing Tree support Tests of topology Tests of models c N. Salamin Sept 2007 Bootstrap on Lecture outline

Maximum likelihood in phylogenetics Allow us to infer the variability of parameters in models that are too Definition Maximum likelihood complex for easy calculation of their variance. This is the case of and models Likelihood of a tree topologies! Computational complexity

Statistical properties Maximum parsimony Procedure Maximum likelihood • Experimental design sample whole columns of data Hypothesis with replacement testing Tree support • recreate n pseudomatrices with Tests of topology Tests of models the same number of species and sites than original one • build n phylogenetic trees from these n pseudoreplicates • weigh each tree in replicate i by Felsenstein, 2004 the number of trees obtained

c N. Salamin Sept 2007 Summarizing results Lecture outline

Maximum likelihood in phylogenetics Definition The end result of a bootstrap is a cloud of trees. Maximum likelihood and models Likelihood of a tree What is the best way to summarize this given that trees have Computational complexity discrete topologies and continuous branch lengths? Statistical properties Maximum parsimony We could make a histogram of the length of a particular branches Maximum likelihood Experimental design • it will give a lower limit on the branch length Hypothesis testing • then check if 0 is in the 95% interval, we would assert the Tree support Tests of topology existence of the branche Tests of models A simpler solution is to count how many times a particular branch appears in the list of trees estimated by bootstrap.

A majority-rule consensus tree containing appearing in more than 50% of them can then be built c N. Salamin Sept 2007 Simple example Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational complexity

Statistical properties Maximum parsimony Maximum likelihood Experimental design

Hypothesis testing Tree support Tests of topology Tests of models

Felsenstein, 2004

c N. Salamin Sept 2007 Jackknife on phylogenetic tree Lecture outline

Maximum likelihood in phylogenetics Resample a portion of characters without replacement. Definition Maximum likelihood and models Bootstrap: Likelihood of a tree Computational • complexity weights mj on data have a multinomial distribution, with n Statistical trials and equal probability for all j characters. properties Maximum parsimony • mean weight per character is 1, with variance 1 1/n Maximum likelihood − Experimental design

Hypothesis testing Jackknife Tree support • Tests of topology delete fraction fj of the characters (weights 0 or 1) Tests of models • mean weight per character is 1 f , with variance f (1 f ) − − • when f = 1/2, same coefficient of variation as bootstrap when n →∞ Some authors argue that jackknife with f = 1/e is better, but it has a smaller variance than bootstrap c N. Salamin Sept 2007 Multiple tests Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood We don’t (or rarely) know in advance which group interest us. and models Likelihood of a tree Computational complexity

Statistical properties If we look for the most supported group on the tree and report its Maximum parsimony p-value, we have a “multiple-tests” problem Maximum likelihood Experimental design

Hypothesis testing • if no significant evidence for existence of any groups on a tree Tree support Tests of topology • 5% of branches are expected to be above 0.95 Tests of models • so one out of every 20 branches of a tree would be significant

The p-value cannot be interpreted as statistical test.

c N. Salamin Sept 2007 Independence of characters Lecture outline

Maximum likelihood in phylogenetics Independence assumption of resampling technics may not be Definition Maximum likelihood met. and models Likelihood of a tree Computational complexity Imagine that pairs of characters are identical Statistical • properties we should draw once for each identical pair, because only Maximum parsimony n/2 independent characters Maximum likelihood Experimental design • if we draw n times, we will be sampling too often Hypothesis testing • the variation between bootstrap samples will be too small Tree support Tests of topology • the trees generated will be too similar Tests of models

Results: corroborating evidence for groups on the tree will be higher than there really is.

Problem: no easy way to know how much correlation there is between characters. c N. Salamin Sept 2007 Invariant characters Lecture outline

Maximum likelihood in phylogenetics Definition What is the impact of invariant characters on the bootstrap Maximum likelihood and models percentages? Likelihood of a tree Computational complexity How often will a single varying character appear in a bootstrap Statistical properties replicate Maximum parsimony • Maximum likelihood if N characters in total, will be chosen with probability 1/N Experimental design • it will be omitted 1 1/N of the times Hypothesis − testing • N Tree support probability that it will be omitted entirely is (1 1/N) Tests of topology − Tests of models • increase the number of invariant characters, therefore N • probability of being omitted tend towards e−1 = 0.36788 when N →∞ • 90% of this value is reached with N = 6 • 99% with N = 50

c N. Salamin Sept 2007 Biases in bootstrap Lecture outline

Maximum likelihood in phylogenetics Estimated p-value is conservative: Definition Maximum likelihood and models Likelihood of a tree Computational complexity

Statistical properties Maximum parsimony Maximum likelihood Experimental design

Hypothesis testing Tree support Hillis and Bull (1993) Tests of topology Tests of models One source of conservatism • with bootstrap, we make statement about branch lengths µ • then we reduce that to statements about tree topology, i.e. µ > 0 or µ < 0

• generalisation of Hillis and Bull (1993) results is not clear c N. Salamin Sept 2007 Parametric bootstrap Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Use computer simulations to create Likelihood of a tree pseudorandom data sets Computational complexity

Statistical properties Advantages and disadvantages Maximum parsimony • Maximum likelihood can sample from the desired Experimental design distribution, even with small data Hypothesis testing sets Tree support • Tests of topology flexible hypothesis testing Tests of models Felsenstein, 2004 framework • close reliance on the correctness of the model of evolution

c N. Salamin Sept 2007 Are two topologies different? Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models LRT or AIC not possible on topologies because they are discrete Likelihood of a tree Computational entities, and not parameters. complexity

Statistical properties We therefore have to rely on paired-sites tests Maximum parsimony Maximum likelihood • expected lnL is average lnL per site as number of sites grows Experimental design without limit Hypothesis testing • Tree support if sites are independent and two trees have equal expected Tests of topology lnL, then differences in lnL at each site is drawn Tests of models independently with expectation zero • statistical test of mean of these differences is zero • valid for likelihood and parsimony c N. Salamin Sept 2007 Possible tests Lecture outline

Maximum likelihood in phylogenetics Definition Different form of tests possible Maximum likelihood and models Likelihood of a tree sites for each sites, score which tree is better and use a Computational complexity binomial distribution to test if scores are

Statistical significantly different from 0.5 properties Maximum parsimony z assumes the differences at each site are normally Maximum likelihood Experimental design distributed and estimate the variance of differences Hypothesis of the scores testing Tree support Wilcoxon replace absolute values of differences by their Tests of topology Tests of models ranks, then re-applies signs; sum of values for one tree is used as statistic KH use bootstrap sampling to infer distribution of sum of differences of scores and see whether 0 lay in the tails of distribution

c N. Salamin Sept 2007 Example Lecture outline Maximum 232-sites mtDNA for 7 mammals likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational complexity

Statistical properties Maximum parsimony Maximum likelihood Experimental design

Hypothesis testing Tree support Tests of topology Tests of models

Felsenstein, 2004 c N. Salamin Sept 2007 Example Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational Results complexity sites 160 sites for tree I, 72 for tree II p-value = Statistical −9 ⇒ properties 3.279 10 against 116:116 H0 Maximum parsimony × Maximum likelihood z sum of lnL 3 18, 2 0 04, so variance of sum of Experimental design = . σ = .

Hypothesis differences is 11.31; z statistic is 3.18/3.36 = 0.94 testing Tree support with p-value of 0.34 Tests of topology Tests of models KH 10,000 bootstrap samples of sites; 8,326 favored tree II, which yields a one-tailed probability of 0.16, and a 0.33 two-tailed probability

c N. Salamin Sept 2007 Multiple tests, again Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models If we want to test more than two trees Likelihood of a tree Computational • compare each tree to best tree complexity Statistical • accept all trees that cannot be rejected by KH test properties Maximum parsimony Maximum likelihood • multiple tests setting, but no reduction of nominal rejection Experimental design

Hypothesis level possible testing • Tree support need to correct for all different ways the data can vary, ways Tests of topology Tests of models that support different trees

When two trees are compared, but one of them is the actual best tree, we should do a one-tailed test. c N. Salamin Sept 2007 SH test Lecture outline

Maximum likelihood in Resampling technique that approximately corrects for testing phylogenetics multiple trees Definition Maximum likelihood and models 1 make R bootstrap of the N sites Likelihood of a tree Computational 2 for each tree, normalize resampled lnL so they have same complexity

Statistical expectation properties ˜ Maximum parsimony 3 for jth bootstrap, calculate Sij for ith tree how far normalized Maximum likelihood Experimental design value is below maximum across all trees for that replicate Hypothesis 4 for each tree i, the tail probability is proportion of bootstrap testing ˜ Tree support replicates in which Sij is less than the actual difference Tests of topology Tests of models between ML and lnL of that tree

Resampling build a “least-favorable” case in which the trees show some patterns of covariation of site as in actual data but do not differ in overall lnL.

One limitation: assume that all proposed trees are possibly equal in likelihood

c N. Salamin Sept 2007 Uncertainty assessment Lecture outline

Maximum likelihood in Likelihood does not only allow to make point estimate of the phylogenetics Definition topology and branch length, it also gives information about the Maximum likelihood and models uncertainty of our estimate. Likelihood of a tree Computational complexity It is possible to use the likelihood curve to test hypothesis and to Statistical properties make interval estimates. Maximum parsimony Maximum likelihood Experimental design

Hypothesis testing Tree support Tests of topology Tests of models

Asymptotically (i.e. when the number of data point tend towards ), the ML estimate θˆ is normally distributed around its true value ∞ θ0. c N. Salamin Sept 2007 Likelihood ratio test Lecture outline

Maximum likelihood in The variance σ around θˆ can also be found using the curvature of phylogenetics Definition the likelihood surface, then Maximum likelihood and models Likelihood of a tree ˆ Computational θ θ0 complexity − (0, 1) √σ ∼ N Statistical properties Maximum parsimony Maximum likelihood Experimental design Considering the lnL curve as locally quadratic

Hypothesis testing 1 ˆ 2 Tree support ˆ (θ0 θ) Tests of topology lnL(θ0) = lnL(θ) − Tests of models − 2 σ

Subtracting lnL(θ0) from both sides, and rearranging leads to

2 2 lnL(θˆ) lnL(θ0) χ − ∼ 1   2 For p parameters, twice the difference in likelihood is χp distributed.

c N. Salamin Sept 2007 Nested models Lecture outline Maximum Assumptions of the likelihood ratio test likelihood in phylogenetics • Definition null hypothesis should be in the interior space that contains Maximum likelihood and models the alternative hypotheses Likelihood of a tree • Computational if q parameters have been constrained, they must be able to complexity vary in both sense Statistical properties • Maximum parsimony if L0 restricts one parameter to the end of its range, Maximum likelihood distribution of twice log likelihood ratio has half its mass at 0 Experimental design 2 Hypothesis and the other half in the usual χ distribution testing • 2 Tree support halve the tail probability obtained wiht usual χ Tests of topology Tests of models

Valid only asymptotically • should be close enough to the true values • can therefore approximate the likelihood curve as being shaped like a normal distribution • true with very large amounts of data c N. Salamin Sept 2007 Akaike Information Criterion Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood More general model will always have higher likelihood than and models Likelihood of a tree restricted models. Computational complexity So, choosing model with highest likelihood will lead to one that is Statistical unnecessarily complex. properties Maximum parsimony Maximum likelihood We should therefore compromise goodness of fit with complexity Experimental design

Hypothesis of model. testing Tree support Tests of topology Tests of models AIC for hypothesis i with pi parameters:

AICi = 2lnLi + 2pi − Hypothesis with the lowest AIC is preferred.

c N. Salamin Sept 2007 Models hierarchy Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models Likelihood of a tree Computational complexity All the different models seen so far are special cases of the Statistical GTR+Γ+I model properties Maximum parsimony • setting and I to 0 leads to GTR Maximum likelihood Γ Experimental design • setting transversion rates to β and transition rates to α leads Hypothesis testing to F84 Tree support Tests of topology • setting β = α leads to K2P Tests of models • setting all nucleotide frequencies to 1/4 lead to JC69 c N. Salamin Sept 2007 Examples Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood Possible comparisons and models Likelihood of a tree Computational complexity • GTR+Γ vs GTR Statistical • properties 2 × [lnLGTR+Γ − lnLGTR] Maximum parsimony • Maximum likelihood difference in df = 1 Experimental design Hypothesis • GTR+Γ vs F84+Γ testing • Tree support 2 × [lnLGTR+Γ − lnLF84+Γ] Tests of topology • Tests of models difference in df = 4

• F84+Γ vs JC69

• 2 × [lnLF84+Γ − lnLF84] • difference in df = 6

c N. Salamin Sept 2007 Testing the Lecture outline

Maximum likelihood in phylogenetics Definition Maximum likelihood and models For the same topology, if the tree likelihood is estimated by Likelihood of a tree Computational enforcing a molecular clock, the number of branch lengths to complexity estimate is reduced. Statistical properties Maximum parsimony Maximum likelihood The distance between an ancestor and its descendants is the Experimental design same under a molecular clock, so we only have to estimate the Hypothesis testing s 1 node ages instead of 2s 3 branch lengths. Tree support − − Tests of topology Tests of models The likelihood ratio test is thus:

• 2 [lnLno clock lnLclock] × − • difference in df = s 2 −