Introduction to Phylogenetic/ Phylogenomic Concepts and Methods

Introduction to phylogenetic/ phylogenomic concepts and methods Bioe 190 Fall 2016 Where we are in the syllabus Challenge: not enough time to cover everything (except at a superficial level) 2 lectures on phylogenetic trees & orthology (11/3 and 11/8) 2 lectures on PPI and functional associations (11/10 and 11/17) In preparing for the next quiz: you can focus on the material covered in the slides and use the readings for background Recommended reading • Felsenstein, J. 2003. “Inferring Phylogenies”. Sinauer Associates, Sunderland, Massachusetts. (The classic reference in the field of phylogenetics.) • Sjölander,“Phylogenomic inference of protein molecular function: advances and challenges” Bioinformatics 2004 (describes sources of annotation error and phylogenomic inference of function) • Online sources: – http://www.evolution-textbook.org/content/free/contents/ch27.html – http://evolution.berkeley.edu/evolibrary/home.php – https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf (Mona Singh) 4 Papers on ortholog identification and protein-protein interaction 5 A brief review of aspects of HW 5 (phylogenomic analyses of K+ channels) Monophyly, monophyletic and other useful terms Taken from https://www.mun.ca/biology/scarr/Taxon_types.htm p7 Taxonomic relationships can be represented by a tree Monophyletic clades: Katelyn Greene solution • Rodents (mouse and rat) • Mammals 8 Functional classification using phylogenetic trees Katelyn Greene solution How would you classify KCNA1_ONCMY based exclusively on its phylogenetic placement (assuming this tree is accurate)? Is one of the subfamilies “closest”? Clues: • Locate the MRCA (most recent common ancestor) of KCNA1_ONCMY and the KCNA3 clade. Repeat for the KCNA1 and KCNA2 clades. Is one MRCA closer? • Evaluate the tree distance (sum over the branch lengths) between KCNA1_ONCMY and other sequences in the tree (or between KCNA1_ONCMY and the subtree root node for each functional subfamily). Is one tree distance smaller? Monophyly and congruence KCNA1 subfamily is polyphyletic Katelyn Greene solution 10 "Nothing in Biology Makes Sense Except in the Light of Evolution" Theodosius Dobzhansky (1973) Species phylogeny (Tree of Life) Protein superfamily phylogeny Is Phylogenetic Analysis “Truth”? • No – it’s just a prediction, and like any other prediction method, it is prone to errors of different types. • It’s important that you be aware of sources of possible errors in a phylogenetic tree reconstruction • However, phylogenetic reconstruction is one of the best tools available for understanding both how species evolve and how gene families evolve novel functions and structures. • As Winston Churchill said… 12 Winston Churchill http://www.winstonchurchill.org/ 13 Uses of phylogenetic trees • Traditional (virtually synonymous) uses: – Reconstructing species phylogenies – Input: MSA of a single gene family (e.g., ribosomal RNA, Cytochrome C) • Data Science – ‘omics era – Ortholog identification – Ancestral sequence reconstruction • Synthesized for experimental investigation (e.g., to explore the ancestral function, folding pathways/stability) – Phylogenomic function prediction – Exploring the evolution of a functional and structural domain – Functional site prediction – Cellular localization and membrane topology prediction – Modelling the evolution of metabolic and signalling networks and pathways – A million other uses (see other lectures) 14 Evaluating tree reconstruction methods http://www.evolution-textbook.org/content/free/contents/ch27.html#ch27-5-1 15 Assumptions of phylogenetic reconstruction • The input MSA has no errors • Taxon sampling is sufficient – Density of sampling, not simply the number of sequences – Problematic when sequences are deliberately culled (e.g., by restricting data to specific genomes, or to those that are in the manually curated SwissProt database, or to make an alignment non-redundant at some level of sequence identity) – Affected by biases and limitations of genome sequencing projects • Sufficient site data – Proofs of statistical consistency (that the predicted tree will converge on the true tree) depend on the number of sites approaching infinity – This is always a problem with real molecular data for a single protein family (although less of a problem for phylogenomic methods of species phylogeny inference) • Key assumption: the sequences in your dataset evolve under the model you plan to use to reconstruct their evolutionary tree 16 Interpreting tree topologies • Many phylogenetic trees are not meant to be interpreted as rooted • Terminal nodes (leaves) represent contemporary taxa (organisms, genes, proteins, etc.) • Internal nodes represent inferred ancestors (or distributions over possible ancestors) – not of species/sequences in sequence databases – In multi-gene families, these internal nodes may also represent duplication events and domain architecture changes • Edge lengths give a measure of the evolutionary distance 17 Estimating a phylogenetic tree Each step requires many (potentially difficult) decisions 18 Sjölander,“Phylogenomic inference of protein molecular function: advances and challenges” Bioinformatics 2004 What data to use? http://www.evolution-textbook.org/content/free/contents/ch27.html#ch27-5-1 19 Alignment masking Standard protocol: Delete “noisy” columns, e.g., § Remove columns with many gaps § Remove columns with low BLOSUM62 scores § Aim is to remove ambiguous columns (perhaps with errors in the MSA) and positions that do not represent clearly homologous positions defining the family as a whole Alternative: Consensus approach (Ward Wheeler) § Construct many MSAs for same set of sequences (varying alignment parameters), concatenate alignments, and estimate trees § Expectation is that uncertain regions of alignment will cancel each other out, whereas signal from regions with consensus support will be magnified Questions to test comprehension: § What is the impact of masking when estimating a multi-gene family tree? § Do variable positions contain information you’d need for the tree topology? § If masking deletes positions that are important for functional subgroups but are not conserved across the family § Given the impact of the number of sites included in an analysis on phylogenetic tree construction accuracy (shown by simulation studies), what fraction of sites can be masked without problem? p20 Example MSA masking http://www.evolution-textbook.org/content/free/contents/ch27/ch27-f25.html p21 Revised tree of life including horizontal transfer Horizontal transfer endosymbiosis 22 Basic terms node human A clade mouse Fruit fly Root? Taxa (singular taxon) Terminal nodes (leaves) 23 From Bioinformatics, A practical guide to the analysis of genes and proteins Edited by Baxevanis & Ouellette Trees are a special type of graph • Graphs have nodes (vertices) and edges (branches) • Edges can be directed or undirected • Nodes can be internal or terminal – Terminal nodes in a phylogenetic tree are called “leaves” (or “taxa”) – The term “taxon” refers to (groups of) species, but is commonly used to describe genes in multi-gene families, even when the same species may be found in multiple copies in the tree • Trees are a special subtype of graph (acyclic connected graphs) • The valency (or degree) of a node equals the number of edges • A tree for which every internal node (except for the root) has degree 3 (one ancestor and two children) is called a “bifurcating” or “binary” tree. • Trees for which internal nodes can have >2 children are called “multifurcating trees” • The diameter of a tree is equal to the longest path between two leaves (including edge lengths, not simply number of edges) • Most phylogenetic trees are unrooted, and special methods must be used to infer the root. 24 Understanding Evolution p25 http://evolution.berkeley.edu/evolibrary/article/phylogenetics_02 p26 http://evolution.berkeley.edu/evolibrary/article/phylogenetics_02 Know the lingo p27 http://evolution.berkeley.edu/evolibrary/article/phylogenetics_02 Mapping phylogenetic terminology from the standard (taxa=species, clades are taxonomic) to protein superfamilies In a protein superfamily, a taxon is a gene (or protein) from an organism; both the organism and the function may be unknown, and the tree may include many paralogous groups with diverse functions. p28 http://evolution.berkeley.edu/evolibrary/article/phylogenetics_02 Monophyly, monophyletic and other useful terms Taken from https://www.mun.ca/biology/scarr/Taxon_types.htm p29 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 30 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 31 Rooting methods • Two most common approaches: • Outgroup rooting (most common): Use an outgroup sequence (or set of sequences for the outgroup clade) (Standard in species tree estimation where the selection of an outgroup is easy) • Midpoint rooting: Root the tree by finding the midpoint on the longest span between the most distant pair of leaves (assumes molecular clock) • Other methods (much less common) • Minimize mutations at sites: Root the tree by locating the root that requires the minimal number of mutations from the ancestor to current sequences • Minimize duplication/loss events (for multi-gene families) p32 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 33 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 34 Finding an “optimal” tree topology is … unlikely!

Introduction to Phylogenetic/ Phylogenomic Concepts and Methods

Lecture Notes: the Mathematics of Phylogenetics

In Defence of the Three-Domains of Life Paradigm P.T.S

Phylogenetics Topic 2: Phylogenetic and Genealogical Homology

Phylogenetics of Buchnera Aphidicola Munson Et Al., 1991

Introductory Activities

1 "Principles of Phylogenetics: Ecology

The Caper Package: Comparative Analysis of Phylogenetics and Evolution in R

Working at the Interface of Phylogenetics and Population

Phylogenetics: Recovering Evolutionary History COMP 571 Luay Nakhleh, Rice University

High-Throughput Genomic Data in Systematics and Phylogenetics

Phylogenetics

Handout for the Phylogenetics Lectures