Sequence Distances: Seq1 Seq2 Seq3 Seq4 Seq1 - Seq2 5 - Seq3 1 4 - Seq4 21 22 20
Total Page:16
File Type:pdf, Size:1020Kb
Zoogeografia Dr. Valerio Ketmaier [email protected] Gene trees DNA sequences Can be used to infer phylogenies DNA sequence alignment Human GCGCTCGGGTCTAGCCTCT Chimp GCGATTCGGGTTAGCCTCT Gorilla GCGTCGCGTCTAGTCTCT Orangutan GGCTTTGGTCCAGCGCT DNA alignment Human GCGCT-CGGGTCTAGCCTCT Chimp GCGATTCGGGT-TAGCCTCT Gorilla GCG--TCGCGTCTAGTCTCT Orangutan -GGCTTTGG-TCCAG-CGCT Only aligned sequences can be analysed The sites at which the mutant allele is present at Informative sites least twice in the dataset 1 …GCGCTTCGGCTCCTGCGTGCTTAG… 2 …GCGCTTCGGCTCCTGCGTGCTTAG… 3 …GCGCTTCGGCTCCTGCGTGCTTAG… Unresolved tree 4 …GCGCTTCGGCTCCTGCGTGCTTAG… 1 …GCGCTTTGGCTCCTGCGTGCTTAG… 2 …GCGCTTCGGCTCTTGCGTGCTTAG… 3 …GCGCTTCGGCTCCTGCGTACTTAG… Unresolved tree 4 …GCGCTTCCGCTCCTGCGTGCTTAG… 1 …GCGCTTTGGCTCCTGAGTGCTTAG… 2 …GCGCTTCGGCTCTTGAGTGCTTAG… 3 …GCGCGTCGGCTCCTGCGTACTTAG… Resolved tree 4 …GCGCGTCCGCTCCTGCGTGCTTAG… Informative site Phylogeny reconstruction methods GCGCTTCGGCTCCT GCGATTCGGCTTCT GCGCTTCGCCTTCT ! Neighbor-joining (NJ) GGGATTTGGCCCCG ! Maximum parsimony (MP) ! Maximum likelihood (ML) ! Bayesian Distance-based phylogeny reconstruction (UPGMA, Least squares, Neighbour-joining) Seq1 Seq4 Seq3 Seq2 Sequence distances: Seq1 Seq2 Seq3 Seq4 Seq1 - Seq2 5 - Seq3 1 4 - Seq4 21 22 20 - Outgroup “Brackets notation”: (((Seq1,Seq3)Seq2)Seq4); Human How to calculate distances? (saturation problem) Saturation: The diff. is due to …GCGCTTCGGC… Expected saturation …GCGTTTCCGC… 2 Chimp …GCGTATTCGC… 4 …GCGCATTCGC… 5 Red Observed …ACCCATACGC… 8 Time Time colobus monkey …TCCCATACTC… 10 The number of differences number of differences The Time …TCCCACACTC… 11 …TTGCACACTC… 13 Multiple mutations at the …TTGCGCACTT… 15 same site result in Observed: 8 underestimation of Actual: 15 Mouse T-changed once evolutionary distances T-changed >once SubstitutionJukes &probability Cantor 1969 matrix: From\to AKimura C 1980 G T FromFrom\to\to A A C C G G T T How to calculate A Paa Pac Pag Pat AA - - α α α C Pca Pccβ Pcgα Pctβ CC α - - α α G Pgaβ Pgc Pggβ Pgtα distances? GG α α - - α T Ptaα Ptcβ Ptg Pttβ TT α β α α α β - - Model-based corrections of observed distances: • Jukes & Cantor 1969 (JC): d = -3/4 ln(1 - 4p/3) (p-proportion of nucleotides different between two sequences) • Kimura 1980 (K2P): d = ln[1/(1- 2s - v)]/2 - ln[1/(1- 2v)]/4 (s and v - proportion of transitions and transversions) • Felsenstein 1981 (F81) • Hasegawa, Kishino & Yano 1985 (HKY85) • å& many others……. Maximum Parsimony Method Method predicts the evolutionary tree that minimizes the number of steps required to generate the observed variation in the sequences. Step 0 Input: multiple sequence alignment Step 1 For each aligned position, identify phylogenetic trees that require the smallest number of evolutionary changes to produce the observed sequence changes. Step 1.5 Continue analysis for every position in the sequence alignment. Step 2 Sequence variations at each site in the alignment are placed at the tips of the trees. Identify the tree (trees) that produce the smallest number of changes overall for all sequence positions. Because all possible trees are examined, method is best suited for sequences that are quite similar + for small number of sequences. It is guaranteed to find the best tree. 8. Lecture WS 2003/04 Bioinformatics III 6 Maximum likelihood approach Method uses probability calculations to find a tree that best accounts for the variation in a set of sequences. Similar to maximum parsimony method in that analysis is performed on each column of a multiple sequence alignment. All trees are considered. Because the rate of appearance of new mutations is very small, the more mutations are needed to fit a tree to the data, the less likely that tree. Start with an evolutionary model of sequence change that provides estimates of rates of substitution of one base for another (transitions and transversions). Base A C G T A -u(aπC+bπG+cπT) uaπC ubπG ucπT C ugπA -u(gπA+dπG+eπT) udπG ueπT G uhπA ujπG -u(hπA+jπG+fπT) ufπT T uiπA ukπG ulπT -u(iπA+kπG+lπT) 8. Lecture WS 2003/04 Bioinformatics III 11 Maximum likelihood approach Step1 Align set of sequences Step2 Examine substitutions in each column for their fit to a set of trees that describe possible phylogenetic relationships among the sequences. Each tree has a certain likelihood based on the series of mutations that are required to give the sequence data. The probability of each tree is the product of the mutation rates in each branch of the tree, which itself is the product of the rate of substitution in each branch times the branch length. branchn (i ) P = mutation rate treei ∏ branch1 (i) branchn (i ) = ∏ rate of substitution in branch(i)×length of branch(i) branch1 (i) Advantage of maximum likelihood approach: allows to evaluate trees with variations in mutation rates in different lineages. Can be used for more diverse sequences. Disadvantage: computationally intense. 8. Lecture WS 2003/04 Bioinformatics III 12 Infer relationships among three species: Three possible trees (topologies): Outgroup: A C Model B A B C 1.0 Prior distribution probability Data (observations) 1.0 Posterior distribution probability Molecular divergence is clock-like DNA divergence between the species Human The rate of evolution of haemoglobin Orangutan (from Kimura 1983) 1 D ~ 3% 0.9 DNA T ~ 10MY 0.8 Time of 0.7 divergence divergence 0.6 0.5 m = D / 2T 0.4 (Shows pairwise divergence 0.3 Amino acid divergence Amino divergence acid of the protein sequence of 0.2 hemoglobin for 13 pairs of The rate of molecular 0.1 species) evolution at the protein 0 100 200 300 400 500 level seems to be too Time (Myr) constant to be explained by natural selection Living fossils (100s of MY old): molecular evolution continues Haemoglobin α Haemoglobin β Amino acid divergence of two haemoglobin genes (α & β) within the same species Platypus Human 147 Shark 150 Rates of protein evolution in human and shark lineages are Ginkgo biloba approx. equal. Echidna Latimeria Port Jackson shark Cycas circinalis 12 9 3 slava 6 Molecular clock GCGCATCGTGCCTGGCTTGT D D, divergence seq3 seq2 seq1 GCGGTTCGGGTCTAGCCTCT T T 2 1 D = rT1 + rT2 = 2rT Thus, Outgroup T = D / 2r r – rate of divergence (not always known) Assuming molecular clock (r is constant over time), it is possible to estimate the time of divergence, T (e.g. Human / chimp divergence). Molecular clock in Hawaiian honeycreeper (Bromham and Penny 2003 Nat Rev Genet) 12 9 3 slava Molecular clock calibration 6 (estimation of divergence rate, r) D D seq3 1 seq2 2 seq1 If the time of divergence (T1) is known, it is possible ?? to estimate the rate T2 of divergence (r) r = D / 2T T1 1 1 Assuming r is the same throughout the tree, we can estimate T2 T1 years ago T2 = D2 / 2r Dating events with the molecular clock combined sequence of hemoglobins alpha and beta, cytochrome c, and fibrinopeptide A Given the number of differences… The molecular clock estimates the divergence. Graur and Li (2000) Find a slow-down in apes and monkeys and speed up in horse-monkey Molecular clock rates Slow Human/horse ! Plant mitochondrial DNA (~5x10-10) ! Chloroplast DNA (~10-9) ! Amino acid substitutions in proteins (varies) ! Nuclear DNA (~10-8) ! Animal mitochondrial DNA (~5x10-8) Fast Variation in molecular clock rate 1. Generation time: Shorter generation time will accelerate the clock because it shortens the time to fix new mutations. 2. Mutation rate: Species-characteristic differences in polymerases or other biological properties that affect the fidelity of DNA replication, and hence the incidence of mutations. Pereira and Backer 2006 MBE 3. Gene function: Changes in the function of a protein as evolutionary time proceeds. This might particularly be expected in the case of gene duplication. 4. Natural selection: Organisms are continually adapting to the physical and biotic environments, which change endlessly in patterns that are unpredictable and differently significant to different species. 12 9 3 slava Violation of molecular clock 6 Mouse Opossum The clock is not always the same in Human r different species. 2 r1 E.g. species with shorter generation times have faster molecular clock (“generation time effect”) Cytochrom B sequences of tube-nosed Molecular clock rate: seabirds body mass effect in animals The larger the body, the longer generation time, the slower molecular clock The authors claim that the taxa with larger body mass have slower molecular clock Nunn and Stanley 1998 MBE How reliable is the tree? Bootstrap allows to test robustness of Seq1 Seq2 Seq3 Seq4 the topology of the tree A B 1 …GCTTTGGCCTGAGTGCAG… How reliable are nodes A & B? 2 …GCTTCGGCTTGAGTGCAG… 3 …GCGTCGGCCTGCGTACAG… 4 …GCGTCCGCCTGCGTGCAG… • Choose random position • Take column of nucleotides Repeat 1000 times • Add to new dataset • Repeat until length on new dataset = length 1 …G G A GCTTGGTCGTTGCGA… of old dataset • Make a tree from the new dataset 2 …G G A GTTCGGTTGTTGCGA… Seq2 Seq3 3 …G A C GCTCGGTCGGTACGA… Seq1 Seq4 4 …G G C GCTCGCTCGGTGCGA… See how often the nodes A & B are present in the bootstrap replicates Levantina • Materiale museale • Due geni mitocondriali (COI; 16S) Table 1. Taxa included in the study and their geographic origin. For each individual we detail the presence/absence of the umbilicus in the shell, the voucher number in the collections of the Zoologisches Museum Hamburg (ZMH) and the Museum für Naturkunde Berlin (ZMB), the composite COI/16S haplotype identifier number and the corresponding GenBank Accession numbers (COI and 16S separately). Taxon Umbilicus