Phylogenetic Inference

Phylogenetic Inference Christian M. Zmasek, PhD [email protected] https://sites.google.com/site/cmzmasek/home GABRIEL Network/J. Craig Venter Institute APPLICATIONS OF GENOMICS & BIOINFORMATICS TO INFECTIOUS DISEASES 2017-12-07 Overview • General concepts, common misconceptions • Tree of Life (Eukaryotic, Bacterial, Viral) • Homologs, gene duplications, orthologs, … • Methods • Unix command line refresher • Multiple Sequence Alignment (MAFFT) • Pairwise distance calculation (Phylip) • Distance based methods (Neighbor Joining, FastME) • Maximum Likelihood methods (RAxML, PhyML) • Bayesian methods (MrBayes, BEAST) • Visualization: Archaeopteryx • Selection Analysis ("dN/dS", Datamonkey ~ HyPhy) • Gene Duplication Inference (GSDI algorithm) • Select papers are available here for download (in zipped archive): https://goo.gl/o2NPDj Why perform phylogenetic inference? • To infer the evolutionary relationships amongst different species/classes/sub-classes/strains/… of organisms • To infer the evolutionary relationships amongst molecular sequences (genes, proteins) • To infer the functions of genes/proteins • Paper: "Eisen_1998_Phylogenomics" • To use resulting tree as basis for additional analyses Theoretical Background • A phylogeny the evolutionary history of a species or a group of species • "Lately", the term is also being applied to the evolutionary history of individual DNA or protein sequences • The evolutionary history of organisms or sequences can be illustrated using a tree-like diagram – a phylogenetic tree A phylogenetic tree proposed in 1866 by Häckel Many Misconceptions about Phylogenetic Trees! • Example of misconception: "order of the external nodes provides information about their relatedness" • The order of external nodes is meaningless! • Paper: "Ryan_2008_Understanding_ evolutionary_trees" Types of Trees (Displays) • Rooted vs. unrooted • Cladogram vs. Phylogram Eukaryotic Tree of Life • Still not a resolved • Two major groups (probably): • Unikonta (single, or no, flagellum) • Bikonta (two flagella) • No monophyletic group of "protists" • Papers: • "Cavalier-Smith_2015_Multiple-origins" • "Zmasek_20111_Strong_functional_patterns" • "Roger_2009_Revisiting_the_root_of_the_eukaryote_tree" • "Baldauf_2003_The_Deep_Roots_of_Eukaryotes" Bacterial Tree of Life Based on concatenated a set of 16 ribosomal protein sequences Paper: "Hug_2016_A_new_view_of_the_tree_of_life" Viruses • No universal "tree of life" for viruses • Instead "superfamilies" of (probably unrelated viruses): • Double-stranded RNA Viruses (monophyly uncertain) • Single-stranded Negative Sense RNA Viruses (monophyly uncertain) • Single-stranded Positive Sense RNA Viruses (monophyly uncertain) • Single-stranded DNA Viruses (non-monophyletic) • Double-stranded DNA Viruses (non-monophyletic) • DNA-RNA Reverse Transcribing Viruses (monophyly uncertain) • Papers: • "Castro-Nallar_2012_The_evolution_of_HIV" • "Forterre_2013_The_major_role_of_viruses_in_cellular_evolution" • "Koonin_2013_A_virocentric_perspective_on_the_evolution_of_life" • "Krupovic_2013_Networks_of_evolutionary_interactions" A special case? Nucleo cytoplasmic large DNA viruses • Nucleo cytoplasmic large DNA virus (NCLDV) superfamily • Diverse group of viruses that infects a wide range of eukaryotic hosts (e.g. vertebrates, insects, single celled organisms) • Huge range in genome size (between 100 kb and 1.2 Mb) • Examples: • Mimiviridae • Marseilleviridae • Phycodnaviridae • Poxviridae • Papers: • "Krupovic_2013_Networks_of_evolutionary_interactions" • "Nasir_2012_Giant_viruses_coexisted" Nucleo cytoplasmic large DNA viruses (NCLDV) Bayesian Inference (BI) tree based on conserved regions of DNA polymerase B Paper: "Fischer_2010_Giant_virus_with_a_remarkable_complement" Gene Trees/Species Trees • Initially, phylogenetic trees were built based on the morphology of organisms. • Around 1960 molecular sequences were recognized as containing phylogenetic information and hence as valuable for tree building • A tree built based on sequence data is called a gene tree since it is a representation of the evolutionary history of genes • A tree illustrating the evolutionary history of organisms is called a species tree A gene tree which is also a species tree A gene tree of orthologs and paralogs based on Bcl-2 family protein sequences The Number of all possible trees topologies… … gets quickly larger than the number of all H-Atoms in the Universe • The number of different tree topologies increases rapidly with an increase in number of external nodes. The number of topologies for unrooted completely binary trees (T) with N external nodes is: 2N 5! Tp Tp(N=5)=15 2N 3 N 3! Tp(N=10)=2x106 Tp(N=20)=2x1020 Tp(N=100)=1x10182 Homologs • Homologs are defined as sequences which share a common ancestor (Fitch, 1966) • This definition becomes unclear if mosaic proteins, which are composed of structural units originating from different genes are considered • Phylogenetic trees make sense only if constructed based on homologous sequences (whole genes/proteins, or domains) Globin Family: An example of a homologous proteins Orthologs, Paralogs, Xenologs • Homologous sequences can be divided into orthologs, paralogs and xenologs: • Orthologs: diverged by a speciation event (their last common ancestor on a phylogenetic tree corresponds to a speciation event) • Paralogs: diverged by a duplication event (their last common ancestor corresponds to a duplication) • Xenologs: are related to each other by horizontal gene transfer (via retroviruses, for example) Orthologs, Paralogs example Caveat emptor: Orthology vs. Function • Orthologous sequences tend to have more similar “functions” than paralogs • Yet: Orthologs are mathematically defined, whereas there is no definition of sequence “function” (i.e. it is a subjective term) Gene Duplication – Significance • New genes evolve if mutations accumulate while selective constraints are relaxed by gene duplication • First recognized by Haldane (“… it [mutation pressure] will favour polyploids, and particularly allopolyploids, which possess several pairs of sets of genes, so that one gene may be altered without disadvantage…” Wheat S Rat Human How How – Rat 2 G Wheat Human Rat Human Wheat 1 G Wheat Rat Human Gene Duplications Can Be Detected Be Can Duplications Gene Gene Trees Vs. Species Trees Trees Species Vs. Trees Gene Rooting • Almost all methods and algorithms produce unrooted or randomly rooted trees!! • Rooting by: • Midpoint-rooting (minimizing overall tree height) • Known "outgroup" • Minimizing gene duplications • … Methods Multiple sequence alignment of homologous sequences Pairwise distance calculation Optimality Criteria Based on Character Data: •Maximum Parsimony •Maximum Likelihood Algorithmic Methods Optimality Criteria Based Based on Pairwise on Pairwise Distances: Bayesian Methods Distances: •Fitch-Margoliash (MCMC) •Neighbor Joining •Minimal Evolution “More accurate” Fast (in general) Pairwise Distance Calculation The simplest method to measure the distance between two amino acid sequences is by their fractional dissimilarity p (nd is the number of aligned sequence positions containing non- identical amino acids and ns is the number of aligned sequence positions containing identical amino acids): n p d nd ns Pairwise Distance Calculation • Unfortunately, this is unrealistic -- does not take into account: • superimposed changes: multiple mutations at the same sequence location • different chemical properties of amino acids: for example, changing leucine into isoleucine is more likely and should be weighted less than changing leucine into proline Pairwise Distance Calculation • A more realistic approach for estimating evolutionary distances is to apply maximum likelihood to empirical amino acid replacement models, such as PAM transition probability matrices. • The likelihood LH of a hypothesis H (an evolutionary distance, for example) given some data D (an alignment, for example) is the probability of D given H: LH=P(D|H) Algorithmic Methods Based on Pairwise Distances • UPGMA • Neighbor Joining UPGMA vs … • UPGMA stands for unweighted pair group method using arithmetic averages • This is clustering • This algorithm produces rooted trees based under the assumption of a molecular clock. • Do not use!! … Neighbor Joining • As opposed to UPGMA, neighbor joining (NJ) is not misled by the absence of a molecular clock • NJ produces phylogenetic trees (not cluster diagrams) Optimality Criteria Based on Pairwise Distances • Fitch-Margoliash • Minimal evolution (ME) Fitch-Margoliash An optimal tree is selected by minimizing the disagreement E between the tree and the estimated pairwise distances (estimated from a multiple alignment): Minimal Evolution Branch lengths are fitted to a tree according to a unweighted least squares criterion, but the optimality criterion to evaluate and compare trees is to minimize the sum of all branch lengths. Optimality Criteria Based on Character Data • Maximum Parsimony (MP) • Maximum Likelihood (ML) Maximum Parsimony • Evaluate a given topology • Example: • Sequence1: TGC • Sequence2: TAC • Sequence3: AGG • Sequence4: AAG Maximum Likelihood • Probabilistic methods can be used to assign a likelihood to a given tree and therefore allow the selection of the tree which is most likely given the observed sequences. • Probability for one residue a to change to b in time t along a branch of a tree: P(b|a,t) • Its actual calculation is dependent on what model for sequence evolution is used. • Poisson process: • P(b|a,t)=1/20 + 19/20e-ut for

Load more