<<

Phylogenetic Inference: Building Trees

Philippe Lemey and Marc A. Suchard

Rega Institute Department of and Immunology K.U. Leuven, Belgium, and Departments of Biomathematics and Human Genetics David Geffen School of Medicine at UCLA Department of Biostatistics UCLA School of Public Health

SISMID – p.1

Intra- Viral

Retroviruses (and HBV) exist as a quasi-species within infected patients: • Shared substitutions may be insufficient to resolve intra-host phylogenies Improve resolution using joint model: 1195 env sequences from 9 HIV+ • Indel rates ≥ substitution rates patients [taken from Rambaut et al. • Opportunity to detect (2004)] intra-host recombination

SISMID – p.2 Reconstructing the Tree of : Are Humans Just Big Slime Molds?

A Thermotoga Eubacteria Uncertain Certain Chloroflexus Thermotoga ILVVAA−−−−−−−−−−TD−−−GPMPQTREHVLLARQVEVPYMIVFINKTDM−−VD−DPELIDL Thermus Chl orofl ex ILVVSA−−−−−−−−−−PD−−−GPMPQTREHILLARQVQVPAIVVFLNKVDM−−MD−DPELLEL Escheria Escheria ILVVAA−−−−−−−−−−TD−−−GPMPQTREHILLGRQVGVPYIIVFLNKCDM−−VD−DEELLEL Streptomyces Pyrococcus VLVVAA−−−−−−−−−−TD−−−GVMPQTKEHAFLARTLGIKHIIVAINKMDM−−VNYNQKRFEE Halobacterium VLVVAA−−−−−−−−−−DD−−−GVAPQTREHVFLSRTLGIDELIVAVNKMDV−−VDYDESKYNE Pyrococcus Euryarchaeota Methanococus VLVVNV−−−−−−−−−−DDAKSGIQPQTREHVFLIRTLGVRQLAVAVNKMDT−−VNFSEADYNE Nanoarchaeum Aeropyrum ILVVSARKGEFEAGMSTE−−−G−−−QTREHLLLARTMGIEQIIVAVNKMDAPDVNYDQKRYEF Archaeoglobus Sul f ol obus ILVVSAKKGEYEAGMSAE−−−G−−−QTREHIILSKTMGINQVIVAINKMDLADTPYDEKRFKE Ferroplasma Gi ardia ILVVAAGQGEFEAGISKD−−−G−−−QTREHATLANTLGIKTMIICVNKMDDGQVKYSKERYDE Methanosarcina Euglena VLVIDSTTGGFEAGISKD−−−G−−−QTREHALLAYTLGVKQMIVATNKFDDKTVKYSQARYEE Halobacterium Dictyostelium VLVIASPTGEFEAGIAKN−−−G−−−QTREHALLAYTLGVKQMIVAINKMDEKSTNYSQARYDE Methanothermobacter Homo VLIVAAGVGEFEAGISKN−−−G−−−QTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEE Methanococcus Thermotoga GIDKDEVERGQVLA−−−−−APGSIKPHKRFKAQIYVLKKEEGGRHTPFTKGYKPQFYIRTADV Aeropyrum Eocytes Chl orofl ex GIERTDVERGQVLC−−−−−APGSIKPHKKFEAQVYVLKKEEGGRHTPFFSGYRPQFYIRTTDV Sulfolobus Escheria GIKREEIERGQVLA−−−−−KPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDV Branches Pyrobaculum Pyrococcus GVSKNDIKRGDVAGHTTN−PPTVVRTKDTFKAQIIVLNH−−−−−PTAITVGYSPVLHAHTAQV Halobacterium GIGKDDIRRGDVCGPADD−PPSVA−−−DTFQAQVVVMQH−−−−−PSVITAGYTPVFHAHTAQV Support Giardia Methanococus GVGKKDIKRGDVLGHTTN−PPTVA−−−TDFTAQIVVLQH−−−−−PSVLTDGYTPVFHTHTAQI .50 − .74 Entamoeba Aeropyrum GVSKSDIKRGDVAGHLDK−PPTVA−−−EEFEARIFVIWH−−−−−PSAITVGYTPVIHVHTASV .75 − .99 Saccharum Sul f ol obus GVEKKDVKRGDVAGSVQN−PPTVA−−−DEFTAQVIVIWH−−−−−PTAVGVGYTPVLHVHTASI GLAVKDLKKGYVVGDVTNDPPVGC−−−KSFTAQVIVMNH−−−−−PKKIQPGYTPVIDCHTAHI > .99 Euglena Gi ardia Trypanasoma Euglena NVSVKDIRRGYVASNAKNDPAKEA−−−ADFTAQVIILNH−−−−−PGQIGNGYAPVLDCHTCHI Scale Dictyostelium Dictyostelium NVSVKEIKRGMVAGDSKNDPPQET−−−EKFVAQVIVLNH−−−−−PGQIHAGYSPVLDCHTAHI Saccharomyces Homo NVSVKDVRRGNVAGDSKNDPPMEA−−−AGFTAQVIILNH−−−−−PGQISAGYAPVLDCHTAHI .10 Homo BDC

MAP Tree Partial Au Plot

• Contentious issue among paleobiologists: Do Archaea (Euryarchaeota/Eocytes) form one or two domains? Weekly World News calls humans slime molds. SISMID – p.3

The Chicken or the (Small) : Which Came First?

Issue:Bird Reptiles are markedly log Genome Size smaller than those from other

Birds . Question: Mammals Did small genomes 0.2 0.4 0.6 0.8 1.0 1.2 1.4 precede Evolutionary History and Genome Sizes of flight or Reptiles, , Birds and Mammals co-evolve? SISMID – p.4 Maximum Parsimony (MP)

Most often used =" “best”, not even statistically consistent, but fast, fast, fast ...ifyouknowthetree

Key:Findtreewithminimal#of“suspected”substitutions (internal states are not observed, 0/1 model process) • Counting minimum # of substitutions is easy • Enumerating (searching through) all possible trees is hard

Human TTCCTGGAAT 7 {A,T} Chimp TACCTGGAAT Mouse A ACCT−−T AT 6 A Fly AGATGATA CT 5 {A,G} X Order Traversal

Site:1 2 3 4 5 6 7 8 9 10 Order Traversal X − Along Molecular Sequence − Pre Post Sites are independent 1234 AGAT SISMID – p.5

Maximum Parsimony (MP)

Alittlehistory: • Anthony Edwards/Luca Cavalli-Sforza (1963,1964) – Both students of R.A Fisher – Introduced both parsimony and likelihood methods (for continuous quantities, e.g. frequencies) in one paper • Camin and Sokal (1965) provide first program for molecular sequences • Fitch and Margoliash (1967) provide efficient algorithm

7 {A,T}

6 A

5 {A,G} X Order Traversal Order Traversal X − − Pre Post 1234 AGAT

SISMID – p.6 Maximum Parsimony Algorithm

procedure Fitch and Margoliash (1967) Algorithm cost C ← 0 {Initialization} pointer k ← 2N − 1 {at the root node} To obtain the set Rk of possible states at node k {Recursion} if k is leaf then Rk ← observed character for k else Compute Ri,Rj for daughters i, j of k if Ri ∩ Rj $=0then Rk ← Ri ∩ Rj 7 {A,T} else 6 A Rk ← Ri ∪ Rj C ← C +1 5 {A,G} X

end if Order Traversal Order Traversal X − end if − Pre Post minimum cost is C {Termination} 1234 AGAT

SISMID – p.7

Searching for the MP Tree 7 Branches

5 Branches

Complexity: ABCD 5 Branches • Find MP score is ACB ABC 3 Branches NP-complete A B ABDC

• Find MP tree is NP-hard BCA ABCD

5 Branches Recall that # of N-taxon rooted trees is 3 × 5 × ···× 2N − 3

Attack exponential-order space Branch-and-Bound:

• Monotonic order: min PS2 ≤ min PS3 ≤ ...

• Bound if min PSk > best n-taxon PS found so far.

SISMID – p.8 Neighbor-Joining (Saitou and Nei, 1987)

Computational algorithm: alignment → single tree 2 3 4 Join (closest) Pairwise distances neighbors A 1 1012 15 AA = 0 1 A 2 30 20 12* AC = 1 C 3 25 2 T AT = 1 Site (data)

Recompute distances (using 12*,3,4)

• Advantages:veryfast,greatfor1000sofsequences • Disadvantages:nosite-to-siteratevariation,nonatural ways to compare trees/measure data support

SISMID – p.9

Neighbor-Joining

Caveat:Pairs with min are i, j dij A 1 B not necessarily nearest neighbors. 11 4 4 E.g., dAB =3

Computational: O(N 3)

SISMID – p.10 Likelihood-based Methods (Felsenstein, 1973)

Statistical technique: assumes an unknown tree and a stochastic model for character change along the tree

Continuous−Time Markov Chain C A A A A A G

state A C T C A T T Site (data) time T Unknown tree/internal node states

• Advantages:site-to-siterate/treevariationiseasy,can formulate probability statements • Disadvantages:must“search”tree-space→ slow

Foundation of Bayesian SISMID – p.11

Traditional Phylogenetic Reconstruction

Reconstruction Example Statistical Model

Human TTCCTGGAAT Assume:Homologoussitesareiid Chimp TACCTGGAAT and site patterns (e.g. dotted box) Mouse A ACCT−−T AT Fly AGATGATA CT XY ...Z ∼ Multinomial(pXY ...Z) Site: 1 2 3 4 5 6 7 8 9 10 Along Molecular Sequence where pXY ...Z is determined by an Human Mouse unknown tree τ,branchlengths t ∈ T and continuous-time Chimp Fly Markov chain model (for residue • Substitution:singleresidue substitution) given by infinitesimal replaces another rate matrix Q t • Insertion/deletion:residues P(X → Y in time t)=e Q are inserted or deleted SISMID – p.12 CTMC(Q)=! ∼ Normal(µ, σ2) of Phylogenetics

Continuous in elapsed time t,discreteinstarting/ending A state! G

Memory-less process in which state C the probability that state b T replaces state a during (t, t + time s)=sqab + o(s) • Infinitesimal generator matrix Q has off-diagonal entries qab and row sums =0

Think:ExponentialwaitingtimewithrateRa = b qab until chain leaves a.Thenthenewstateb is independently " chosen with probabilities qab/Ra

SISMID – p.13

From Infinitesimal to Finite Time

Let pab(t)=the finite-time probability of the chain moving from state a at time 0 to state b at time t,thenmatrix P (t)={pab(t)} satisfies d P (t)=P (t)Q where P (0) = I dt with solution 1 ∞ 1 P (t)=etQ = I + tQ + (tQ)2 + ···= (tQ)k 2 k! !k=0 as d etQ = QetQ = etQQ for t real dt

SISMID – p.14 Example: Two-State Model

Consider purines (R) ↔ pyrimidines α R Y (Y). Kolmogorov forward equation: β

pRY(t + s)=pRR(t)αs + pRY(t)(1 − βs)+o(s) yielding d p (t)=αp (t) − βp (t) dt RY RR RY

−αα tQ Q = Solutions of P (t)=e # β −β $ have the form with eigenvalues 0 and c + de−(α+β)t −(α + β)

SISMID – p.15

Standard CTMCs for Phylogenetics

• Jukes and Cantor (JC69), 1 ππ πa = , κ1 = κ2 =1 A G 4 κβ1 1 A G • Kimura (K80), πa = 4 , κ1 = κ2 • Hasegawa, Kishino and Yano β ββ β (HKY85), κ1 = κ2 (most common) κβ2 • Tamura and Nei (TN93), right CT ππC T • General Time Reversible (GTR) Note identifiability concern in etQ.Commonsolutionistofix 1d.f.suchthat

qaaπa = −1 a ! Scaling: t =1⇒ 1 expected substitution per site SISMID – p.16 Explicit Parameterization of TN93

Nucleotides mutate according to a Markovian process

Pr(X → Y in time t)=etQNuc where QNuc is a 4x4 infinitesimal rate matrix and t is a branch length.

ππA G κβ1 A G 1 0 − κ πG πC πT 1 κ1π − π π B A C T C β ββ β QNuc = β×B C , where B π π − κ2π C B A G T C κβ B C CT2 @ πA πG κ2πC − A ππC T κ1, κ2 are transition:transversion rate ratios and π is the stationary distribution of {A,G,C,T}. β controls the overall rate and can vary from site-to-site. SISMID – p.17

Site-to-Site Rate Variation

Variation occurs quite naturally and is also an important inference • short range: codon phase (slow-slow-fast) • long range: enzymatic active sites, folding, immunological pressures/selection

Assume:infinitesimalratesforsitek are rk × t × qab.Variouspriorson rk with E(rk)=1.ImplicitlyBayesian • Yang (1994) – discretized Gamma distribution SISMID – p.18 General Time Reversible CTMC

Let

Q = RDπ where R is symmetric and Dπ is a diagonal matrix composed of the stationary distribution π.

• Detailed balance ⇔ πaqab = πbqba.Balance+ irreducibility ⇔ reversible • Note Q is similar to R,asD1/2QD−1/2 = R • Hence, Q must have real eigenvalues and real eigenvectors The properties speed up computation of the finite-time transition matrix P (t)=etQ

SISMID – p.19

Calculating the Probability of a Single Site Pattern Yi

Given the tree and unobserved internal node states, the probability is the product of the finite time probabilities over allbranches:

A G tt1 2 X Y t 5 t 4 t 3 A T

L(Yi) ∝ pAAGT = Pr(Y → A,t1)Pr(X → G,t2) ∗ X Y ! ! Pr(X → T,t3)Pr(Y → A,t4)Pr(X → Y,t5)πX (1)

• Number of sumants grow rapidly in N → sum-product/peeling algorithm to distribute sums across the product

SISMID – p.20 Pruning Algorithm Felsenstein (1981)

Let P (Lk|a) =likelihoodofleavesbelownodek given k is in state a.Then,recursively compute P (Lk|a) given P (Li|b) and P (Lj |c) for daughters i, j of k:

Set pointer k ← 2N − 1 {the root, initialization} Node k Compute P (Lk|a) ∀ a as follows: {recursion}

if k is a leaf node then a a − b

if a is observed then −

P(Li|b) c P (Lk|a)=1 else P (L |a)=0 k P(L |c) end if j else Compute P (Li|a) and P (Lj |a) ∀a for daughters i, j of k {post-order traversal} P (Lk|a)= Pr(a → b, ti)P (Li|b) × Pr(a → c, tj )P (Lj |c) Pb Pc end if L(Yi) ← P (L2n−1|a)πa {termination} Pa

SISMID – p.21

ML Tree or MAP Tree?

Reporting uncertainty on tree estimates:

• The Bootstrap – Most common

– Assumes evolutionary events are reproducible. “If I went back out to the field and recollected exchangeable data . . . ”

• Bayesian inference – Returns the probability of a tree given the observed data and model

– Requires MCMC (e.g., MrBayes or BEAST)

– Advantages ∗ Does not rely on asymptotics (hypothesis testing) ∗ Naturally incorporates uncertainty in all parameters (including discrete quantities: trees, site-classifications, etc.) ∗ Arguably faster algorithms

– Disadvantages ∗ Must specify (justifiable) prior distributions SISMID – p.22