Assume an F84 Substitution Model with Nucleotide Frequ

Exercise Sheet 5 Computational Phylogenetics Prof. D. Metzler Exercise 1: Assume an F84 substitution model with nucleotide frequencies (πA; πC ; πG; πT ) = (0:2; 0:3; 0:3; 0:2), a rate λ = 0:1 of “crosses” and a rate µ = 0:2 of “bullets” (see lecture). (a) Assume that the nucleotide at some site is A at time t. Calculate the probabilities that the nucleotide is A, C or G at time t + 0:2. (b) Assume the nucleotide distribution in a genomic region is (0:1; 0:2; 0:3; 0:4) at time t, but from this time on the genomic region evolves according to the model above. Calculate the expectation values for the nucleotide distributions at time points t + 0:2 and t + 2. Exercise 2: Calculate rate matrix for the nucleotide substution process for which the substitution matrix for time t is 0 1−e−t=10 21+9e−t=10−30e−t=5 1−e−t=10 1 PA!A(t) 10 70 5 B 2−2e−t=10 3−3et=10 3+7e−t=10−10e−t=5 C B PC!C (t) C S(t) = B 5 10 15 C ; B 14+6e−t=10−20e−t=5 1−e−t=10 1−e−t=10 C B PG!G(t) C @ 35 10 5 A 2−2e−t=10 3+7e−t=10−10e−t=5 3−3e−t=10 5 30 10 PT !T (t) where the diagonal entries PA!A(t), PC!C (t), PG!G(t) and PT !T (t) are the values that fulfill that each row sum is 1 in the matrix S(t). Exercise 3: 0:5 0:5 M = 0:2 0:8 (a) Calculate with paper and pencil the eigenvalues and the left and the right eigenvectors of the matrix M. (b) To what distribution will a Markov process converge if M is its transition matrix? (c) Calculate M 100 with paper, pencil and pocket calculator (or R, but only for calculations that are also possible with a cheap pocket calculator). Exercise 4: Calculate the transition matrix for t = 0:4 for the nucleotide-substitution model rate-matrix 0 −0:10 0:03 0:040 0:030 1 B 0:02 −0:13 0:020 0:090 C Q = B C : @ 0:04 0:03 −0:085 0:015 A 0:02 0:09 0:010 −0:120 (a) Calculate this in R using the command expm of the Matrix package. (b) Compare this to solution that you get when using the approximation (1 + (t=n) · Q)n of the matrix exponential when the matrix power is calculated in R by iterating the usual matrix multiplication (R operation: %*%). That is, avoid calculating eigenvectors and eigenvalues. (c) Compare both solutions to what you get when you calculate the matrix exponential with the eigenvectors calculated by the R command eigen. You can also use R matrix operations and funtions like t, diag and solve, but no functions from special packages for computing matrix exponentials. Exercise 5: Let nucleotide frequencies (πA; πC ; πG; πT ) and mutation rates α and β of the HKY model be given. Under which conditions can you find rates λ and µ of the Felsenstein 84 model, such that the transition matrices of the two models are the same, and which are the appropriate values for λ and µ? Discuss also the opposite direction. The F84 rate matrix is: µπG µπG 0 −λ(1 − πA) − λπC λπG + λπT 1 πA+πG πA+πG µπT µπT λπA −λ(1 − πC ) − λπG λπT + B πC +πT πC +πT C B µπA µπA C λπA + λπC −λ(1 − πG) − λπT @ πA+πG πA+πG A µπC µπC λπA λπC + λπG −λ(1 − πT ) − πC +πT πC +πT You can insert positive values for the F84 parameters. Exercise 6: Based on the fossil record for lizards it is claimed that the most recent common ancestor (MRCA) of Sphenodon punctatus and Cnemidophorus tigris is at least 228.0 Mya old, the MRCA of Cnemidophorus tigris and Rhineura floridana is at least 113 Mya old, the MRCA of Rhineura floridana and Gallotia galloti at least 64.2 Mya, and the MRCA of Timon lepidus and Dalmatolacerta oxycephala at least 5.3 Mya. (a) Use this information to estimate the phylogeny of lizards with time calibration. (b) How does this result depend on the choice of model for the relaxed clock? (c) How does the result change if you use only two or three of the given calibration points? (d) How well are the given calibration points supported by fossils documented in the following references? Science (1990) 249:1020-1023, 2 J Vertebr Paleontol (2006) 26:795-800, J Vertebr Paleontol (2002) 22:286-298, J Paleontol (1985) 59:1481-1485 Exercise 7: Find publications about at least two different genera, where fossils have been used to time-calibrate phylogenetic trees. Find out in detail which traits of the fossils were used for their taxonomic classification—you may have to trace this back in other publications—and how this was used in the phylogeny analysis. Exercise 8: (a) Implement Felsenstein’s pruning algorithm in R as a function that reads a tree in newick format, a sequence alignment and parameters of the F84 model and returns the log-likelihood of the tree. You can use the R script pruning.R as a starting point. (b) Discuss how your implementation can be made more efficient. Exercise 9: Compute the stationary distributions of the pfold rate matrix and the PAM substitution rate matrix. Are these evolutionary dynamics (almost) reversible? (You can use R or a similar program to solve this exercise. In R the function eigen may be helpful.) Exercise 10: In the lecture we discussed four different methods to compute a substitution matrix S(t) from a rate matrix R. Investigate for the PAM rate matrix for amino acids, for the pfold rate matrix for RNA stem basepairs and for the following F84 rate matrix how efficient and how accu- rate and numerically stable these methods are. The PAM rate matrix and the pfold rate matrix will be available from the website of the lecture. (Note that the conventions about matrix notation are not always strictly followed in Bioinformatics. Sometimes transposed matrices are given, i.e. the roles of rows and columns are interchanged. You should always check for this with any matrix that is given to you!) 3.

Assume an F84 Substitution Model with Nucleotide Frequ

Optimal Matching Distances Between Categorical Sequences: Distortion and Inferences by Permutation Juan P

Sequence Motifs, Correlations and Structural Mapping of Evolutionary

Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh

Development of Novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in Msas

3D Representations of Amino Acids—Applications to Protein Sequence Comparison and Classiﬁcation

Amino Acid Substitution Matrices from an Information Theoretic Perspective

Comparison Theorems for Splittings of M-Matrices in (Block) Hessenberg

Pathological Rate Matrices: from Primates to Pathogens

3D Deep Convolutional Neural Networks for Amino Acid Environment Similarity Analysis Wen Torng1 and Russ B

Multiplicatively Closed Markov Models Must Form Lie Algebras

Metzlerian and Generalized Metzlerian Matrices: Some Properties and Economic Applications

Amino Acid Substitution Matrices from Protein Blocks (Amino Add Sequence/Alignment Algorithms/Data Base Srching) STEVEN HENIKOFF* and JORJA G