<<

Introduction to phylogenetic/ phylogenomic concepts and methods

Bioe 190 Fall 2016 Where we are in the syllabus

Challenge: not enough time to cover everything (except at a superficial level)

2 lectures on phylogenetic trees & orthology (11/3 and 11/8) 2 lectures on PPI and functional associations (11/10 and 11/17)

In preparing for the next quiz: you can focus on the material covered in the slides and use the readings for background Recommended reading

• Felsenstein, J. 2003. “Inferring Phylogenies”. Sinauer Associates, Sunderland, Massachusetts. (The classic reference in the field of phylogenetics.) • Sjölander,“Phylogenomic inference of protein molecular function: advances and challenges” 2004 (describes sources of annotation error and phylogenomic inference of function) • Online sources: – http://www.evolution-textbook.org/content/free/contents/ch27.html – http://evolution.berkeley.edu/evolibrary/home.php – https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf (Mona Singh) 4 Papers on ortholog identification and protein-protein interaction

5 A brief review of aspects of HW 5 (phylogenomic analyses of K+ channels) , monophyletic and other useful terms

Taken from https://www.mun.ca/biology/scarr/Taxon_types.htm p7 Taxonomic relationships can be represented by a tree

Monophyletic : Katelyn Greene solution • Rodents (mouse and rat) • 8 Functional classification using phylogenetic trees

Katelyn Greene solution

How would you classify KCNA1_ONCMY based exclusively on its phylogenetic placement (assuming this tree is accurate)?

Is one of the subfamilies “closest”? Clues: • Locate the MRCA (most recent common ancestor) of KCNA1_ONCMY and the KCNA3 . Repeat for the KCNA1 and KCNA2 clades. Is one MRCA closer? • Evaluate the tree distance (sum over the branch lengths) between KCNA1_ONCMY and other sequences in the tree (or between KCNA1_ONCMY and the subtree root node for each functional subfamily). Is one tree distance smaller? Monophyly and congruence

KCNA1 subfamily is polyphyletic

Katelyn Greene solution 10 "Nothing in Makes Sense Except in the Light of Evolution"

Theodosius Dobzhansky (1973)

Species phylogeny (Tree of )

Protein superfamily phylogeny Is Phylogenetic Analysis “Truth”?

• No – it’s just a prediction, and like any other prediction method, it is prone to errors of different types. • It’s important that you be aware of sources of possible errors in a reconstruction • However, phylogenetic reconstruction is one of the best tools available for understanding both how evolve and how families evolve novel functions and structures. • As Winston Churchill said…

12 Winston Churchill

http://www.winstonchurchill.org/ 13 Uses of phylogenetic trees

• Traditional (virtually synonymous) uses: – Reconstructing species phylogenies – Input: MSA of a single gene (e.g., ribosomal RNA, Cytochrome C) • Data Science – ‘omics era – Ortholog identification – Ancestral sequence reconstruction • Synthesized for experimental investigation (e.g., to explore the ancestral function, folding pathways/stability) – Phylogenomic function prediction – Exploring the evolution of a functional and structural – Functional site prediction – Cellular localization and membrane topology prediction – Modelling the evolution of metabolic and signalling networks and pathways – A million other uses (see other lectures) 14 Evaluating tree reconstruction methods

http://www.evolution-textbook.org/content/free/contents/ch27.html#ch27-5-1 15 Assumptions of phylogenetic reconstruction

• The input MSA has no errors • sampling is sufficient – Density of sampling, not simply the number of sequences – Problematic when sequences are deliberately culled (e.g., by restricting data to specific genomes, or to those that are in the manually curated SwissProt database, or to make an alignment non-redundant at some level of sequence identity) – Affected by biases and limitations of genome sequencing projects • Sufficient site data – Proofs of statistical consistency (that the predicted tree will converge on the true tree) depend on the number of sites approaching infinity – This is always a problem with real molecular data for a single (although less of a problem for phylogenomic methods of species phylogeny inference) • Key assumption: the sequences in your dataset evolve under the model you plan to use to reconstruct their evolutionary tree 16 Interpreting tree topologies • Many phylogenetic trees are not meant to be interpreted as rooted • Terminal nodes (leaves) represent contemporary taxa (, , proteins, etc.) • Internal nodes represent inferred ancestors (or distributions over possible ancestors) – not of species/sequences in sequence databases – In multi-gene families, these internal nodes may also represent duplication events and domain architecture changes • Edge lengths give a measure of the evolutionary distance

17 Estimating a phylogenetic tree

Each step requires many (potentially difficult) decisions

18 Sjölander,“Phylogenomic inference of protein molecular function: advances and challenges” Bioinformatics 2004 What data to use?

http://www.evolution-textbook.org/content/free/contents/ch27.html#ch27-5-1 19 Alignment masking

Standard protocol: Delete “noisy” columns, e.g., § Remove columns with many gaps § Remove columns with low BLOSUM62 scores § Aim is to remove ambiguous columns (perhaps with errors in the MSA) and positions that do not represent clearly homologous positions defining the family as a whole Alternative: Consensus approach (Ward Wheeler) § Construct many MSAs for same set of sequences (varying alignment parameters), concatenate alignments, and estimate trees § Expectation is that uncertain regions of alignment will cancel each other out, whereas signal from regions with consensus support will be magnified Questions to test comprehension: § What is the impact of masking when estimating a multi-gene family tree? § Do variable positions contain information you’d need for the tree topology? § If masking deletes positions that are important for functional subgroups but are not conserved across the family § Given the impact of the number of sites included in an analysis on phylogenetic tree construction accuracy (shown by simulation studies), what fraction of sites can be masked without problem?

p20 Example MSA masking

http://www.evolution-textbook.org/content/free/contents/ch27/ch27-f25.html p21 Revised including horizontal transfer

Horizontal transfer endosymbiosis

22 Basic terms

node

human A clade mouse

Fruit fly

Root? Taxa (singular taxon)

Terminal nodes (leaves) 23 From Bioinformatics, A practical guide to the analysis of genes and proteins Edited by Baxevanis & Ouellette Trees are a special type of graph

• Graphs have nodes (vertices) and edges (branches) • Edges can be directed or undirected • Nodes can be internal or terminal – Terminal nodes in a phylogenetic tree are called “leaves” (or “taxa”) – The term “taxon” refers to (groups of) species, but is commonly used to describe genes in multi-gene families, even when the same species may be found in multiple copies in the tree • Trees are a special subtype of graph (acyclic connected graphs) • The valency (or degree) of a node equals the number of edges • A tree for which every internal node (except for the root) has degree 3 (one ancestor and two children) is called a “bifurcating” or “binary” tree. • Trees for which internal nodes can have >2 children are called “multifurcating trees” • The diameter of a tree is equal to the longest path between two leaves (including edge lengths, not simply number of edges) • Most phylogenetic trees are unrooted, and special methods must be used to infer the root. 24 Understanding Evolution

p25 http://evolution.berkeley.edu/evolibrary/article/phylogenetics_02 p26 http://evolution.berkeley.edu/evolibrary/article/phylogenetics_02 Know the lingo

p27 http://evolution.berkeley.edu/evolibrary/article/phylogenetics_02 Mapping phylogenetic terminology from the standard (taxa=species, clades are taxonomic) to protein superfamilies

In a protein superfamily, a taxon is a gene (or protein) from an ; both the organism and the function may be unknown, and the tree may include many paralogous groups with diverse functions.

p28 http://evolution.berkeley.edu/evolibrary/article/phylogenetics_02 Monophyly, monophyletic and other useful terms

Taken from https://www.mun.ca/biology/scarr/Taxon_types.htm p29 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 30 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 31 Rooting methods

• Two most common approaches: • rooting (most common): Use an outgroup sequence (or set of sequences for the outgroup clade) (Standard in species tree estimation where the selection of an outgroup is easy) • Midpoint rooting: Root the tree by finding the midpoint on the longest span between the most distant pair of leaves (assumes ) • Other methods (much less common) • Minimize at sites: Root the tree by locating the root that requires the minimal number of mutations from the ancestor to current sequences • Minimize duplication/loss events (for multi-gene families)

p32 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 33 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 34 Finding an “optimal” tree topology is … unlikely! #leaves #trees • Number of (unrooted) 4 3 binary trees on n leaves 5 15 is (2n-5)!! 6 105 • If each tree on 1000 taxa could be analyzed 7 945 in 0.001 seconds, we 8 10395 would find the best tree in 9 135135 2890 millennia 10 2027025 20 2.2 x 1020 100 4.5 x 10190 1000 2.7 x 102900

35 From T. Warnow Phylogenetic reconstruction methods can be divided into two main classes

• Distance-based – Input is a matrix of distances between species – Distances can be computed in many ways (typically based on MSA) – Fundamental characteristic: once the distance matrix is derived, the MSA data is set aside • E.g., Neighbor-Joining, UPGMA, etc.

• Character-based – Starting point is always an MSA – Examine each character (e.g., residue) separately – Retain character information at every stage in the tree estimation – Note that a “distance” between clades can be computed at each stage in the tree construction while still being a character-based method • E.g., Maximum Likelihood, Maximum Parsimony, Mr. Bayes 36 KS: recall that branch lengths are proportional to the distance (#mutations) between the “sequecnes” represented at the vertices at either endpoint, so MP methods will seek to produce the tree whose sum of branch lengths is smallest.

Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 37 Molecular Clock

• UPGMA implicitly assumes that all distances are clock-like

“True” tree Inferred tree

2 3 2 3 4 1 1 4

Question to test understanding: Can you assume a molecular clock if data include paralogous groups? What if they’re restricted to orthologs? 38 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 39 Maximum Parsimony

• Input: Set S of n aligned sequences of length k • Output: – A phylogenetic tree T leaf-labeled by sequences in S – additional sequences of length k labeling the internal nodes of T

such that ∑ H ( i , j ) is minimized. (i, j)∈E(T )

E(T) is the set of edges in the tree T H is the Hamming distance (minimum #substitutions) between edges (i,j). 40 T. Warnow Hamming Distance

https://en.wikipedia.org/wiki/Hamming_distance 41 The fundamental principle behind maximum parsimony

Occam’s Razor Entia non sunt multiplicanda praeter necessitatem.

William of Occam (1300-1349)

The best tree is the one which requires the least number of substitutions 42 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 43 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 44 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 45 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 46 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 47 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 48 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 49 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 50 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 51 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 52 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 53 Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 54 KS: Sites are assumed to be i.i.d. This is not biologically realistic

Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 55 Maximum Likelihood

• Input: Set S of n aligned sequences of length k, and a specified parametric model • Output: – A phylogenetic tree T leaf-labeled by sequences in S – With additional model parameters (e.g. edge “lengths”) such that Pr[S|(T, parameters)] is maximized.

(Recall that ML methods seek to identify a model that maximizes the probability of the data.)

56 From T. Warnow Maximum Likelihood (in English) • Require a model of evolution • Each substitution has an associated likelihood given a branch of a certain length • A function is derived to represent the likelihood of the data given the tree, branch- lengths and additional parameters • Find the tree that maximizes the likelihood

57 From T. Warnow Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 58 KS Note: sampling with replacement!

Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 59 KS Note: sampling with replacement means that some columns may show up multiple times, and others not at all Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 60 KS Notes: • Tree produced is a consensus (over all the replicates); • in a rooted tree, the bootstrap support for an edge is identical to the bootstrap support for a “clade” • In an unrooted tree, bootstrap support is displayed on the edges

Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 61 Bootstrap Analysis

• Columns from the input alignment are resampled with replacement to create many bootstrap replicate data sets • each resampled MSA will have the same number of columns -- some columns may be sampled multiple times, and others not at all • A phylogenetic tree is constructed for each bootstrap replicate data set (e.g. with parsimony, distance, ML etc.) • A majority-rule consensus tree is estimated (e.g., using the PHYLIP consense program) and interior nodes (in the case of a rooted tree) or branches (in the case of an unrooted tree) are labelled with the fraction of trees containing that edge/subtree • Ignoring the topology/branching within that subtree • Bootstrap analysis does not actually tell you the expected accuracy of that subtree -- just the support for that subtree in the alignment p62 Bootstrap vs Jacknife http://www.evolution-textbook.org/content/free/contents/ch27/ch27-f21.html

FIGURE 27.21. Bootstrapping. (A) Bootstrapped alignment #1. The columns in green and light blue are not part of the new alignment and the columns in yellow and red are present twice. Bootstrapped alignment #2. The columns in green and light blue p63 are present but the two columns in dark blue are not and the column in pink is represented three times. (B) Jackknifing. Alignments are resampled without replacement such that the new alignments have fewer columns than the original. KS note: if the similarity is “not recognizable”, how were the sequences selected? If the dataset is highly diverse, it’s better to use several different phylogenetic methods and derive a majority rule consensus – subtrees with low support should be treated with skepticism.

Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 64 KS: best to use multiple methods and derive a consensus tree

Mona Singh, Princeton, https://www.cs.princeton.edu/~mona/Lecture/phylogeny-slides.pdf 65 Evaluating the accuracy of phylogenetic tree reconstruction algorithms

• Tree methods are typically validated using simulation studies – Given a “True” tree, generate data (ungapped multiple sequence alignments) – Attempt to reconstruct the tree from the MSA – Compare the reconstructed tree to the true tree; compute error

66 Quantifying Error

FN

FN: false negative (missing edge) FP: false positive (incorrect edge) FP

Courtesy of T. Warnow 67 Benchmarking phylogenetic tree algorithms: Analogy to active site prediction

“True tree” = TT Inferred tree = IT

Edge in both TT and IT: TP. Edge in IT but not TT: FP Edge in TT but not IT: FN

Every edge defines a bipartition on the leaves in the tree. For example:

The highlighted edge in the True Tree (labelled FN) defines two non-overlapping subsets: {S1, S2, S3}, {S4, S5}.

There is no edge in the Inferred tree that produces the same division (which is why it’s labelled as a False Negative).

The highlighted edge in the Inferred Tree (labelled FP) defines two non-overlapping subsets: {S1, S2, S4}, {S3, S5}.

There is no edge in the True Tree that produces the same partition (which is why it’s labelled as 68 a False Positive).

Benchmarking phylogenetic tree algorithms: Analogy to active site prediction

“True tree” = TT, Inferred tree = IT Edge in both TT and IT: TP Edge in IT but not TT: FP Edge in TT but not IT: FN

Catalytic Site prediction Phylogenetic tree estimation Gold standard : CSA (Catalytic Site Atlas) Gold standard : The “True tree” Prediction methods: Evolutionary Trace, INTREPID, Prediction methods: NJ, MP, ML, UPGMA (etc.); these Discern, etc.; these produce sets of predicted produce inferred trees (ITs) catalytic residues TP: edges in both the True Tree and Inferred Trees TP: residues in both the CSA and prediction FP: edge in the Inferred Tree that is not in the True Tree FP: residues predicted as catalytic that are not in the FN: edge in the True Tree that is not in the Inferred Tree CSA FN: residues in the CSA that are not predicted by the method 69 Errors in trees

• May be in the branching order (topology) – Example coarse branching order: • Relative branching order between taxonomic groups (primates, rodents, ) • Relative branching order between clades representing different genes (in a multi-gene tree including duplication events/paralogous groups) – Example fine branching order: • Relative branching order within hominidae (human, chimps, bonobos, gorilla, orangutan) • May be in branch lengths – In general, less of a problem for functional inferences in protein superfamily reconstruction

70 Major sources of error • Sparse “taxon sampling” – Historically refers to reconstructing phylogenies for single genes (restricted to orthologs in different species) – In protein superfamily reconstruction, including paralogous groups, simply refers to the selection of proteins (multiple genes and multiple species) • -specific rate variation – Historically refers to species that are evolving more rapidly than others – In protein superfamily reconstruction, refers to genes (a group of orthologs) that are evolving rapidly (perhaps due to neo-functionalization) • Site-specific rate variation – Less common in single gene trees (orthologs in different species) – Very common in protein superfamilies due to diversification of function following gene duplication • Sequence fragments (or gene model errors) – Very common in protein sequence databases • Insufficient site data (e.g., short MSA) – Very common for trees based on single domains (esp. if <100aa) – Gene matrix approaches (see Delsuc et al) address this problem 71 • Few informative sites (e.g., using DNA for closely related taxa) Why would a gene tree not agree with a reference species tree?

§ Insufficient site data (#informative sites << #aa) § Sequence fragments (how gaps are handled) § Use of amino acid instead of nucleic acid sequences (or vice versa, depending on the taxonomic distance) § Slow-evolving genes informative for distantly related species, may not be informative for closely related species § Errors in the alignment (mostly between distantly related groups) § Sparse taxon sampling § consider the related task of estimating amino acid distributions for profiles or HMMs given limited sampling from the family § Functional pressures § Incomplete lineage sorting

72 Brief Bioinform (2011) 12 (5): 413-422. doi: 10.1093/bib/bbr036

Example 1: Gene tree and species tree agree (MSA/tree based on global MDA clustering)

See figure caption on final slide for this example 73 Example conflicts between gene tree and assumed species tree

Example 2: Gene tree and species tree disagree (MSA/tree based on ERG4 domain only)

Note that sequences cluster correctly into subfamilies whose MDAs agree (even though the amino acids outside the ERG4 domain were not included)

See figure caption on final slide for this example 74 75