Taming the Beast

Molecular Evolution Models Taming the Beast Workshop Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as Molecular Evolution Models Markov chains Variable substitution rates across sites Codons and data partitions The universal Let BEAST2 choose the David Rasmussen & Carsten Magnus right model References

June 27, 2016

1 / 31 Outline Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as I Models of sequence evolution: Markov chains Variable substitution rates I rate matrices across sites Codons and data partitions I Markov chain model The universal genetic code Let BEAST2 choose the right model

I Variable rates amongst different sites: ”+Γ” References

I Codons and data partitions

I Implementation in BEAST2

2 / 31 When comparing two sequences we have to keep in mind that they are the result of mutation during replication (genotypic level) and selection (phenotypic level).

Levels of evolution Taming the Beast

genotype phenotype Molecular Evolution Models sequence level e.g. antigenic level: Antibody binding to HIV Levels of evolution ACUGAACGUGACUACUG Sequence alignment ACUGAACGUAACUACUG Substitution models Substitution rate matrices Substitutions modelled as Markov chains Variable substitution rates across sites Codons and data partitions The universal genetic code Let BEAST2 choose the right model codon: three encode References for one amino acid

one nucleotide change can already change the phenotype

alphabet: 4 nucleotides: DNA: TCAG RNA: UCAG 20 amino acids

3 / 31 Levels of evolution Taming the Beast

genotype phenotype Molecular Evolution Models sequence level e.g. antigenic level: Antibody binding to HIV Levels of evolution ACUGAACGUGACUACUG Sequence alignment ACUGAACGUAACUACUG Substitution models Substitution rate matrices Substitutions modelled as Markov chains Variable substitution rates across sites Codons and data partitions The universal genetic code Let BEAST2 choose the right model codon: three nucleotides encode References for one amino acid

one nucleotide change can already change the phenotype

alphabet: 4 nucleotides: DNA: TCAG RNA: UCAG 20 amino acids

When comparing two nucleotide sequences we have to keep in mind that they are the result of mutation during replication (genotypic level) and selection (phenotypic level).

3 / 31 Sequence alignment Taming the Beast

Molecular Evolution way of arranging sequences to identify regions Models Levels of evolution ATTACGAC of similarity that may be a consequence of Sequence alignment Substitution models TCTACGAC functional, structural, or evolutionary Substitution rate matrices Substitutions modelled as relationships between the sequences Markov chains Variable substitution rates across sites Codons and data partitions To find an alignment: concept of positional homology: The universal genetic code I Let BEAST2 choose the nucleotides (or amino acids) show positional homology if right model they exist at equivalent positions in the respective sequence. References

I Programs for alignment MUSCLE, CLUSTAL which can be called from e.g. AliView, MegAlign,. . .

BEAST analysis starts with aligned sequences!!! → file format .fas, .fasta, .nexus

4 / 31 Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as Markov chains Variable substitution rates across sites Codons and data partitions The universal genetic code Let BEAST2 choose the right model

Models for nucleotide substitions References

5 / 31 Problem of phylogenetics: We observe sequences but not their evolutionary history. Thus we have to take all possible evolutionary trajectories into account.

The sequence evolution model appears in the posterior:

The fundamental problem Taming the Beast

Molecular Evolution Models Levels of evolution A C T T G A T G Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as Markov chains A C T A G C T G taxon 1 Variable substitution rates across sites Codons and data partitions The universal genetic code A G T T G C T G taxon 2 Let BEAST2 choose the right model

References A C T T G A T G taxon 3

6 / 31 Problem of phylogenetics: We observe sequences but not their evolutionary history. Thus we have to take all possible evolutionary trajectories into account.

The sequence evolution model appears in the posterior:

The fundamental problem Taming the Beast

Molecular Evolution Models Levels of evolution A C T T G A T G Sequence alignment Substitution models single substitution Substitution rate matrices Substitutions modelled as A C T A G C T G taxon 1 Markov chains Variable substitution rates across sites C > G Codons and data partitions The universal genetic code A G T T G C T G taxon 2 Let BEAST2 choose the right model

References A C T T G A T G taxon 3

6 / 31 Problem of phylogenetics: We observe sequences but not their evolutionary history. Thus we have to take all possible evolutionary trajectories into account.

The sequence evolution model appears in the posterior:

The fundamental problem Taming the Beast

Molecular Evolution Models Levels of evolution A C T T G A T G Sequence alignment multiple substitutions Substitution models Substitution rate matrices T > C C > A Substitutions modelled as A C T A G C T G taxon 1 Markov chains Variable substitution rates across sites Codons and data partitions The universal genetic code A G T T G C T G taxon 2 Let BEAST2 choose the right model

References A C T T G A T G taxon 3

6 / 31 Problem of phylogenetics: We observe sequences but not their evolutionary history. Thus we have to take all possible evolutionary trajectories into account.

The sequence evolution model appears in the posterior:

The fundamental problem Taming the Beast

Molecular Evolution Models Levels of evolution A C T T G A T G Sequence alignment Substitution models convergent substitution Substitution rate matrices A > C Substitutions modelled as Markov chains A C T A G C T G taxon 1 Variable substitution rates across sites Codons and data partitions A > C The universal genetic code A G T T G C T G taxon 2 Let BEAST2 choose the right model

References A C T T G A T G taxon 3

6 / 31 The sequence evolution model appears in the posterior:

The fundamental problem Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices

A C T T G A T G A C T T G A T G A C T T G A T G Substitutions modelled as Markov chains T > C C > A A > C A C T A G C T G taxon 1 A C T A G C T G taxon 1 A C T A G C T G Variable substitution rates

C > G A > C across sites A G T T G C T G taxon 2 A G T T G C T G taxon 2 A G T T G C T G Codons and data partitions

A C T T G A T G taxon 3 A C T T G A T G taxon 3 A C T T G A T G The universal genetic code Let BEAST2 choose the right model

References

Problem of phylogenetics: We observe sequences but not their evolutionary history. Thus we have to take all possible evolutionary trajectories into account.

6 / 31 The fundamental problem Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment A C T T G A T G A C T T G A T G A C T T G A T G Substitution models T > C C > A A > C A C T A G C T G taxon 1 A C T A G C T G taxon 1 A C T A G C T G Substitution rate matrices C > G A > C Substitutions modelled as A G T T G C T G taxon 2 A G T T G C T G taxon 2 A G T T G C T G Markov chains

A C T T G A T G taxon 3 A C T T G A T G taxon 3 A C T T G A T G Variable substitution rates across sites Codons and data partitions The universal genetic code Let BEAST2 choose the right model

References

Problem of phylogenetics: We observe sequences but not their evolutionary history. Thus we have to take all possible evolutionary trajectories into account.

The sequence evolution model appears in the posterior:

ACAC... ACAC... TCAC... TCAC... P( |A C AG ... )=P(A C AG ... | )P( | )P( )P( )P( ) ACAC... TCAC... P(A CAG ... ) 6 / 31 Substitution rate matrix: TCAG T  -(a+b+c) a b c  C d -(d+e+f) e f    A g h -(g+h+i) i  G j k l -(j+k+l)

A model for nucleotide substitutions Taming the Beast

Molecular Evolution State space of each nucleotide position: S = {T, C, A, G} Models Levels of evolution Sequence alignment Substitution models C Substitution rate matrices -(a+b+c) A Substitutions modelled as Markov chains G Variable substitution rates a across sites Example: Assume Codons and data partitions T C The universal genetic code the process is at A Let BEAST2 choose the c right model state T b C References T A A G

7 / 31 A model for nucleotide substitutions Taming the Beast

Molecular Evolution State space of each nucleotide position: S = {T, C, A, G} Models Levels of evolution Sequence alignment Substitution models C Substitution rate matrices -(a+b+c) A Substitutions modelled as Markov chains G Variable substitution rates a across sites Example: Assume Codons and data partitions T C The universal genetic code the process is at A Let BEAST2 choose the c right model state T b C References T A A G

Substitution rate matrix: TCAG T  -(a+b+c) a b c  C d -(d+e+f) e f    A g h -(g+h+i) i  G j k l -(j+k+l)

7 / 31 Site models in BEAST2 Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as Markov chains Variable substitution rates across sites Codons and data partitions The universal genetic code Let BEAST2 choose the right model

References

8 / 31 The easiest substitution model: JC69 Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment JC69: Substitution models Substitution rate matrices Substitutions modelled as I named after TH Jukes, CR Cantor: Evolution of Markov chains Variable substitution rates molecules. 1969 [Jukes and Cantor, 1969]. across sites Codons and data partitions I all substitution have the same rate, λ The universal genetic code Let BEAST2 choose the right model Substitution rates: References T C TCAG T  · λ λ λ  C λ · λ λ    A λ λ · λ  A G G λ λ λ ·

9 / 31 Accounting for /: K80 Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment K80: Substitution models Substitution rate matrices named after M Kimura: A simple method for estimating Substitutions modelled as I Markov chains evolutionary rates of base substitutions through comparative Variable substitution rates across sites studies of nucleotide sequences. 1980. [Kimura, 1980] Codons and data partitions The universal genetic code Let BEAST2 choose the I transitions happen at rate α, at rate β right model

References Substitution rates: T C (one ring) TCAG   transversion T · α β β C α · β β    A β β · α  A G (two rings) transition G β β α ·

10 / 31 Accounting for transition/transversion: HKY Taming the Beast

Molecular Evolution Models HKY: Levels of evolution Sequence alignment named after [Hasegawa et al., 1984, Hasegawa et al., 1985] Substitution models I Substitution rate matrices Substitutions modelled as I accounting for transitions (rate α), transversions (rate β) Markov chains Variable substitution rates I after a long period of evolution, equilibrium frequencies are across sites Codons and data partitions reached The universal genetic code Let BEAST2 choose the right model Substitution rates: References TCAG pyrimidines T C   (one ring) T · απC βπA βπG C απT · βπA βπG  transversion   A βπT βπC · απG  G βπT βπC απA · purines A G     (two rings) transition · α β β πT 0 0 0 α · β β  0 π 0 0  =   ·  C  β β · α  0 0 πA 0  β β α · 0 0 0 πG

11 / 31 Accounting for transition/transversion: TN93 Taming the Beast

Molecular Evolution Models Levels of evolution TN93: Sequence alignment Substitution models named after [Tamura and Nei, 1993] Substitution rate matrices I Substitutions modelled as Markov chains I accounting for different transition rates between T and C as Variable substitution rates across sites well as A and G Codons and data partitions The universal genetic code after a long period of evolution, equilibrium frequencies are Let BEAST2 choose the I right model

reached References

α 1 Substitution rates: pyrimidines T C (one ring) TCAG   transversion T · α1πC βπA βπG C α1πT · βπA βπG  α   2 A βπT βπC · α2πG  purines A G G βπT βπC α2πA · (two rings) transition

12 / 31 A more general substitution model: GTR Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment GTR (REV): Substitution models Substitution rate matrices Substitutions modelled as I generalised time-reversible model Markov chains Variable substitution rates I based on three papers: across sites Codons and data partitions [Tavar´e,1986, Yang, 1994, Zharkikh, 1994] The universal genetic code Let BEAST2 choose the right model

References Substitution rates: TCAG + quite flexible   T · aπC bπA cπG + time-reversible C aπT · dπA eπG    - not completely A bπT dπC · fπG  general G cπT eπC fπA ·

13 / 31 The most general substitution model – Taming the Beast implemented in BEAST2 but not in BEAUti Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices UNREST: Substitutions modelled as Markov chains Variable substitution rates I unrestricted model first described in [Yang, 1994] across sites Codons and data partitions I each substitution has a (different) rate The universal genetic code Let BEAST2 choose the right model

References Substitution rates: + most general case TCAG + all other models are special cases T  · a b c  of UNREST C d · e f    - mathematical very complicated and A g h · i  not handy to use G j k l · - not time-reversible

14 / 31 Substitution models in BEAUti Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as Markov chains Variable substitution rates across sites Codons and data partitions The universal genetic code Let BEAST2 choose the model parameters description right model JC69 1 all substitutions have the same rate References K80 2+3∗ accounts for transition and transversions, not in BEAUti HKY 2+3∗ distinction between transition and transversions, including equilibrium frequencies TN93 3+3∗ different rates for transitions GTR 6+3∗ general, but still time-reversible UNREST 12 most general, not time-reversible, not in BEAUti ∗ Can be empirically estimated from the alignment or inferred alongside the substitution rates. 15 / 31 So far we determined rates of nucleotide substitutions. But we need probabilities.

The fundamental problem - again Taming the Beast

Molecular Evolution Models Levels of evolution A C T T G A T G Sequence alignment A C T A G C T G taxon 1 Substitution models Substitution rate matrices Substitutions modelled as Markov chains A G T T G C T G taxon 2 Variable substitution rates across sites Codons and data partitions The universal genetic code Let BEAST2 choose the A C T T G A T G taxon 3 right model Problem of phylogenetics: References We observe sequences but not their evolutionary history. Thus we have to take all possible evolutionary trajectories into account.

16 / 31 The fundamental problem - again Taming the Beast

Molecular Evolution Models Levels of evolution A C T T G A T G Sequence alignment A C T A G C T G taxon 1 Substitution models Substitution rate matrices Substitutions modelled as Markov chains A G T T G C T G taxon 2 Variable substitution rates across sites Codons and data partitions The universal genetic code Let BEAST2 choose the A C T T G A T G taxon 3 right model Problem of phylogenetics: References We observe sequences but not their evolutionary history. Thus we have to take all possible evolutionary trajectories into account.

So far we determined rates of nucleotide substitutions. But we need probabilities.

16 / 31 pTC T p C lives on a state space and jumps to the CT different states A G

A A A A p p p memorylessness: the probability of G TA G TC G CC G jumping to a state only depends on the T A C C A A A A actual state C C C C time T T T T

Nucleotide substitutions as Markov chains Taming the Beast

(MC) Molecular Evolution Models Definition of a Markov chain (see also Nucleotide Levels of evolution Sequence alignment [Ross, 1996]) substitutions as MC Substitution models Substitution rate matrices A A A A Substitutions modelled as Markov chains G pTA G pTC G pCC G Variable substitution rates across sites stochastic process, i.e. a series of T A C C Codons and data partitions random experiments through time A A A A The universal genetic code Let BEAST2 choose the C C C C right model time T T T T References

17 / 31 A A A A p p p memorylessness: the probability of G TA G TC G CC G jumping to a state only depends on the T A C C A A A A actual state C C C C time T T T T

Nucleotide substitutions as Markov chains Taming the Beast

(MC) Molecular Evolution Models Definition of a Markov chain (see also Nucleotide Levels of evolution Sequence alignment [Ross, 1996]) substitutions as MC Substitution models Substitution rate matrices A A A A Substitutions modelled as Markov chains G pTA G pTC G pCC G Variable substitution rates across sites stochastic process, i.e. a series of T A C C Codons and data partitions random experiments through time A A A A The universal genetic code Let BEAST2 choose the C C C C right model time T T T T References

pTC T p C lives on a state space and jumps to the CT different states A G

17 / 31 Nucleotide substitutions as Markov chains Taming the Beast

(MC) Molecular Evolution Models Definition of a Markov chain (see also Nucleotide Levels of evolution Sequence alignment [Ross, 1996]) substitutions as MC Substitution models Substitution rate matrices A A A A Substitutions modelled as Markov chains G pTA G pTC G pCC G Variable substitution rates across sites stochastic process, i.e. a series of T A C C Codons and data partitions random experiments through time A A A A The universal genetic code Let BEAST2 choose the C C C C right model time T T T T References

pTC T p C lives on a state space and jumps to the CT different states A G

A A A A p p p memorylessness: the probability of G TA G TC G CC G jumping to a state only depends on the T A C C A A A A actual state C C C C time T T T T 17 / 31 Why Markov chains are a great model for Taming the Beast nucleotide substitutions Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices memorylessness: a nucleotides substitution happens Substitutions modelled as I Markov chains independently from the substitution history at this site Variable substitution rates across sites Codons and data partitions The universal genetic code I substitution rate matrix defines the transition probabilities Let BEAST2 choose the right model

I applying theories of linear algebra we can calculate the References transition probability matrix according to:

P(t) = eQt = U diag(e1t, e2t, e3t, e4t)U−1

I the transition probabilities take into account every possible substitution path (Chapman-Kolmogorov theorem)

18 / 31 substitutions per site λ = 0.015 day

Example of transition probabilities: JC69 Taming the Beast

Molecular Evolution Models Substitution rates: Levels of evolution T C Sequence alignment −3λ λ λ λ  Substitution models Substitution rate matrices  λ −3λ λ λ  Q =   Substitutions modelled as  λ λ −3λ λ  Markov chains Variable substitution rates λ λ λ −3λ across sites Codons and data partitions A G The universal genetic code Qt P(t) = e Let BEAST2 choose the right model

References

transition probability matrix:   p0(t) p1(t) p1(t) p1(t) p1(t) p0(t) p1(t) p1(t) P(t) =   p1(t) p1(t) p0(t) p1(t) p1(t) p1(t) p1(t) p0(t)

1 3 −4λt with p0(t) = 4 + 4 e 1 1 −4λt and p1(t) = 4 − 4 e

19 / 31 Example of transition probabilities: JC69 Taming the Beast

Substitution rates: Molecular Evolution Models T C −3λ λ λ λ  Levels of evolution Sequence alignment  λ −3λ λ λ  Substitution models Q =    λ λ −3λ λ  Substitution rate matrices Substitutions modelled as λ λ λ −3λ Markov chains Variable substitution rates A G across sites Qt P(t) = e Codons and data partitions The universal genetic code Let BEAST2 choose the right model

References substitutions per site λ = 0.015 day transition probability matrix: 1.0 p (t) p (t) p (t) p (t)

0 1 1 1 0.8

p1(t) p0(t) p1(t) p1(t) p0(t) P(t) =   0.6 p1(t) p1(t) p0(t) p1(t) p1(t) p1(t) p1(t) p0(t) 0.4 0.2

1 3 −4λt p1(t) with p0(t) = + e ansistion probabilities 4 4 0.0 1 1 −4λt t r and p1(t) = − e 4 4 0 20 40 60 80 100 time in days

19 / 31 JC69: Stationary distribution Taming the Beast

Molecular Evolution Suppose we have a sequence that evolves with rate Models −9 substitutions per site λ = 2.2/3 × 10 . We follow the evolution of 4 Levels of evolution year Sequence alignment different sites with T at site 1, C at site 2, A at site 3 and G at Substitution models Substitution rate matrices site 4 at time point 0. How likely is it, that after time t has Substitutions modelled as Markov chains passed, there is a T,C,A or G at the four different positions? To Variable substitution rates across sites answer this question, we follow the time evolution of the Codons and data partitions The universal genetic code transition probability matrix P(t): Let BEAST2 choose the right model

0.46 0.18 0.18 0.18 References 0.18 0.46 0.18 0.18 0.18 0.18 0.46 0.18 0.18 0.18 0.18 0.46 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 I when t → 0.25 0.25 0.25 0.25 stationary distribution 0.25 0.25 0.25 0.25 1 000 is reached ∞ 0 100 0.31 0.23 0.23 0.23 0 010 0.23 0.31 0.23 0.23 I Any long sequence 0 001 0.23 0.23 0.31 0.23 0.23 0.23 0.23 0.31 (e.g. TTTTTT...) at time 0, will be time/years composed of equal 0 4.5x108 9x108 1.8x109 amounts of T,C,A,G after time t →

∞ 20 / 31 JC69: Time transformation Taming the Beast

Molecular Evolution Models The times we look at, e.g. in species evolution, are very often very large. Levels of evolution Thus, instead of real time, we display an evolutionary time scale in terms of Sequence alignment sequence distances. As one substitution happens at rate 3λ in JC69 (keep in Substitution models Substitution rate matrices mind that in other models the expected time to substitution is different!), we Substitutions modelled as expect one substitution to happen after time 1 3 . This is due to Markov chains /( λ) Variable substitution rates exponentially distributed waiting times for an event happening at a certain across sites Codons and data partitions rate. This means, that we expect one substitution after The universal genetic code 1 −8 −9 ≈ 4.5 × 10 years in our example. Let BEAST2 choose the 2.2×10 right model

0.46 0.18 0.18 0.18 References 0.18 0.46 0.18 0.18 0.18 0.18 0.46 0.18 0.18 0.18 0.18 0.46 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 d 0.25 0.25 0.25 0.25 t = 3λ in JC69 0.25 0.25 0.25 0.25 1 000 0 100 0.31 0.23 0.23 0.23 0 010 0.23 0.31 0.23 0.23 Trick from physics: 0 001 0.23 0.23 0.31 0.23 0.23 0.23 0.23 0.31 compare units: time/years [t] =years d # substitutions 8 8 9 [ ] = 0 4.5x10 9x10 1.8x10 time in years 3λ # substitutions/year expected time to 1 substitution

d=timex(3 λ ) 0 1 2 4

21 / 31 Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as Markov chains Variable substitution rates across sites Codons and data partitions The universal genetic code Let BEAST2 choose the right model

Variable substitution rates across sites References

22 / 31 We extend the existing models, by replacing the constant rates by Γ-distributed random variables (notation: JC69+Γ, HKY+Γ,...)

Variable rates Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as Markov chains I so far: all sites in the sequence evolve at the same rate Variable substitution rates across sites I but: substitution rates might differ over the Codons and data partitions The universal genetic code I mutation rates might differ over sites Let BEAST2 choose the right model I selective pressure might be different on the phenotypic level References

23 / 31 Variable rates Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as Markov chains I so far: all sites in the sequence evolve at the same rate Variable substitution rates across sites I but: substitution rates might differ over the genome Codons and data partitions The universal genetic code I mutation rates might differ over sites Let BEAST2 choose the right model I selective pressure might be different on the phenotypic level References

We extend the existing models, by replacing the constant rates by Γ-distributed random variables (notation: JC69+Γ, HKY+Γ,...)

23 / 31 In BEAUti:

Change number of Gamma Category Count to allow for rate variation. 4 to 6 categories work normally well.

Example: JC69+Γ Taming the Beast

Molecular Evolution 2.0 α=0.2 Models λ 7→ λR α=1 Levels of evolution

1.5 α =2 Sequence alignment α=20 Substitution models 1.0 we replace the substitution rate λ by g(r) Substitution rate matrices Substitutions modelled as Markov chains λR, where R is a Γ-distributed 0.5 Variable substitution rates across sites

random variable with shape 0.0 Codons and data partitions parameter α and mean 1. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 The universal genetic code Let BEAST2 choose the r right model References

24 / 31 Example: JC69+Γ Taming the Beast

Molecular Evolution

1.0 Models λ 7→ λR Levels of evolution

0.8 α =2 Sequence alignment Substitution models 0.6

we replace the substitution rate λ by g(r) Substitution rate matrices

0.4 Substitutions modelled as λR, where R is a Γ-distributed Markov chains

0.2 Variable substitution rates random variable with shape across sites Codons and data partitions 0.0 parameter α and mean 1. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 The universal genetic code Let BEAST2 choose the r right model In BEAUti: References

Change number of Gamma Category Count to allow for rate variation. 4 to 6 categories work normally well.

24 / 31 Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as Markov chains Variable substitution rates across sites Codons and data partitions The universal genetic code Let BEAST2 choose the right model

Codons and data partitions References

25 / 31 The codon sun Taming the Beast

Molecular Evolution Models Levels of evolution A codon consists of three nucleotides, translating to one of the Sequence alignment Substitution models 20 amino acids: Substitution rate matrices Three-Letter One-Letter Molecular Substitutions modelled as Amino Acid Abbreviation Symbol Weight Markov chains Alanine Ala A 89Da Arginine Arg R 174Da Variable substitution rates across sites Asparagine Asn N 132Da Aspartic acid Asp D 133Da Codons and data partitions Asparagine or The universal genetic code aspartic acid Asx B 133Da Let BEAST2 choose the Cysteine Cys C 121Da right model Glutamine Gln Q 146Da Glutamic acid Glu E 147Da References Glutamine or glutamic acid Glx Z 147Da Glycine Gly G 75Da Histidine His H 155Da Isoleucine Ile I 131Da Leucine Leu L 131Da Lysine Lys K 146Da Methionine Met M 149Da Phenylalanine Phe F 165Da Proline Pro P 115Da Serine Ser S 105Da Threonine Thr T 119Da Tryptophan Trp W 204Da Tyrosine Tyr Y 181Da Valine Val V 117Da [Sanger, 2015] [Promega, 2015]

26 / 31 Example: Codon CTA Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Overview over substitution rates to the same codon CTA, the Substitution models thickness of arrows represent different rates: Substitution rate matrices Substitutions modelled as Markov chains Variable substitution rates CTT across sites Codons and data partitions (Leu) GTA CTC I synonymous substitutions: The universal genetic code Let BEAST2 choose the (Val) (Leu) AA does not change right model ATA References CTG I nonsynonymous (Ile) CTA (Leu) substitutions: (Leu) TTA CCA AA does change (Leu) (Pro) I bigger arrows: transition CGA CAA (Arg) (Gln) I smaller arrows: transversion

adapted from [Yang, 2014]

27 / 31 ⇒ Different codon positions can have different evolutionary rates. BEAST2 allows for estimating these rates separately.

file BEAST2.4.x/examples/nexus/-mtDNA.nex

Varying substitution rates amongst the codon Taming the Beast positions Molecular Evolution Models Levels of evolution Sequence alignment [Bofkin and Goldman, 2007] have shown that in protein encoding Substitution models regions Substitution rate matrices Substitutions modelled as Markov chains I second codon positions evolve more slowly than first codon Variable substitution rates across sites positions Codons and data partitions The universal genetic code third codon positions evolve faster than first codon positions Let BEAST2 choose the I right model

References

28 / 31 Varying substitution rates amongst the codon Taming the Beast positions Molecular Evolution Models Levels of evolution Sequence alignment [Bofkin and Goldman, 2007] have shown that in protein encoding Substitution models regions Substitution rate matrices Substitutions modelled as Markov chains I second codon positions evolve more slowly than first codon Variable substitution rates across sites positions Codons and data partitions The universal genetic code third codon positions evolve faster than first codon positions Let BEAST2 choose the I right model

References ⇒ Different codon positions can have different evolutionary rates. BEAST2 allows for estimating these rates separately.

file BEAST2.4.x/examples/nexus/primate-mtDNA.nex

28 / 31 Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as Markov chains Variable substitution rates across sites Codons and data partitions The universal genetic code Let BEAST2 choose the Including the choice of substitution rate model into your BEAST right model analysis References

29 / 31 T: package bModelTest: Bayesian site model selection for nucleotide data T: package SubstBMA: modelling across-site variation in the nucleotide

Rate models in BEAST2 Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as BEAST2 allows for including different site models into your Markov chains I Variable substitution rates analysis ( tab in BEAUti) across sites Site Model Codons and data partitions The universal genetic code I Which site model is the best for your data? Let BEAST2 choose the right model

References

30 / 31 T: package SubstBMA: modelling across-site variation in the nucleotide

Rate models in BEAST2 Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as BEAST2 allows for including different site models into your Markov chains I Variable substitution rates analysis ( tab in BEAUti) across sites Site Model Codons and data partitions The universal genetic code I Which site model is the best for your data? Let BEAST2 choose the right model

References T: package bModelTest: Bayesian site model selection for nucleotide data

30 / 31 Rate models in BEAST2 Taming the Beast

Molecular Evolution Models Levels of evolution Sequence alignment Substitution models Substitution rate matrices Substitutions modelled as BEAST2 allows for including different site models into your Markov chains I Variable substitution rates analysis ( tab in BEAUti) across sites Site Model Codons and data partitions The universal genetic code I Which site model is the best for your data? Let BEAST2 choose the right model

References T: package bModelTest: Bayesian site model selection for nucleotide data T: package SubstBMA: modelling across-site variation in the nucleotide

30 / 31 ReferencesI Taming the Beast

Molecular Evolution - Bofkin, L. and Goldman, N. (2007). Variation in Evolutionary Processes at Different Codon Positions. Molecular Biology Models and Evolution, 24(2):513–521. Levels of evolution - Hasegawa, M., Kishino, H., and Yano, T. (1985). Dating of the Human Ape Splitting by a of Mitochondrial-Dna. Journal of Molecular Evolution, 22(2):160–174. Sequence alignment Substitution models - Hasegawa, M., Yano, T., and Kishino, H. (1984). A New Molecular Clock of Mitochondrial-Dna and the Evolution of Hominoids. Proceedings of the Japan Academy Series B-Physical and Biological Sciences, 60(4):95–98. Substitution rate matrices Substitutions modelled as - Jukes, T. and Cantor, C. (1969). Evolution of protein molecules. Mammalian Protein Metabolism., pages 21–123. Markov chains - Kimura, M. (1980). A simple method for estimating evolutionary rates of base substitutions through comparative studies Variable substitution rates of nucleotide sequences. Journal of molecular evolution, 16(2):111–120. across sites - Promega (2015). The amino acids: https://www.promega.com/ /media/files/resources/technical references/amino Codons and data partitions acid abbreviations and molecular weights.pdf. The universal genetic code - Ross, S. M. (1996). Stochastic Processes. Second edition. Wiley. Let BEAST2 choose the right model - Sanger (2015). The codon sun: ftp://ftp.sanger.ac.uk/pub/yourgenome/downloads/activities/kras-cancer-mutation/krascodonwheel.pdf. References - Tamura, K. and Nei, M. (1993). Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution, 10(3):512–526. - Tavar´e,S. (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. In Some mathematical questions in biology—DNA sequence analysis (New York, 1984), pages 57–86. Amer. Math. Soc., Providence, RI. - Yang, Z. (1994). Estimating the pattern of nucleotide substitution. Journal of molecular evolution, 39(1):105–111. - Yang, Z. (2014). Molecular Evolution – A Statistical Approach. Oxford University Press. - Zharkikh, A. (1994). Estimation of evolutionary distances between nucleotide sequences. Journal of molecular evolution, 39(3):315–329.

31 / 31