<<

Creating Supertrees from Phylogeny

Christiaan van Woudenberg

What is ? • The study of the relationships of things that evolved from a common ancestor. • First studies in 1866 by Ernst Heckel. • Goal: describe the relationship between the species we see (and have seen) over 3 billion years of .

1 Before it’s too late • Estimates are between 7 and 80 million extant species (13.8 million is accepted answer). ƒ 1/4 may be beetles in the Amazon ƒ 1/4 more may be nemotodes at the bottom of the ocean • To date, scientists have described 1.75 million species. • 1/2 of species alive today may be extinct in 100 years. • A job for you database folks: no standards for how species/gene data are stored, indexed across databases.

Source: American Association for the Advancement of Science

Phylogenetic trees

• Describe evolutionary relationships. • May be rooted or unrooted. • Branch lengths (if they exist) correspond to evolutionary distance, denotes an additive

7 tree. 2 3 4 • If all path lengths to root are 3 1 equal, the tree is ultrametric 5 3 1 (molecular clock).

1 1 3 1 1 3 2 1 1 1 1 , additive tree, ultrametric tree from Evolutionary Genomics Course, Itai Yanai

2 Some definitions • Homolog - genes are descendant from a common ancestral DNA sequence (speciation). • Ortholog - homologous genes in different species, evolved from a common ancestral gene, usually retains function. • Paralog - homologous genes related by gene duplication within a genome. • Homoplasy - convergent evolution, where similar characteristics are not homologous. • Polytomy - a node with more than 2 branches (degree > 3). • - a taxon in which all descendants from a common ancestor are identified.

• Visit http://evowiki.org/ for an intro to Evolutionary Biology

Question • When analyzing genes, is it best to use: ƒ Homologs ƒ Orthologs ƒ Paralogs

3 Constructing phylogenies

Species versus genes • When considering data on the species level, much more complex than DNA,RNA or protein sequence data. ƒ Large data sets are difficult to compute. ƒ Horizontal gene transfer is an issue, especially in bacteria. ƒ Genes do not map 1:1 to species. • Solution:

ƒ Create tree of genes. Species Label the species along the way. ƒ Gene ƒ Species become node segments, not leaf nodes.

ab c d

4 DNA/RNA sequence issues • Best results obtained when: ƒ Gene(s) are resistant to horizontal gene transfer. ƒ Genes appear in all species of interest. ƒ Sequences are large and stable enough to be considered non-random. • RNA and protein sequences are more stable: ƒ Mutation from TCT to TCC still codes for serine; such mutations in DNA disappear from protein sequence.

Approaches to creating phylogenies General Methods: Algorithms: • Exploit common • Distance analysis groupings among the ƒ Minimize branch lengths set of input trees. ƒ UPGMA, Neighbor • Optimize fit of a set of Joining candidate trees • Maximum likelihood according to some ƒ Maximize P(model|data) objective function. • Maximum parsimony ƒ Minimize evolutionary changes. • Quartet methods • Bayesian Monte Carlo Markov Chains

5 Supertrees • Supertree methods generally take a collection of trees as input and return a supertree, if one exists. • A collection of phylogenetic trees is compatible if there exists a supertree that contains them all. • The problem of determining whether a compatible supertree exists is NP-hard.

Why create supertrees? • Allows comparative analysis of disparate sets of data, experiments: ƒ Nuclear, mitochondrial DNA/RNA sequences ƒ Metabolic pathway characterization ƒ Gene microarray expression profiles ƒ Behavior patterns ƒ And more ... • We don’t have the computational methods or power to analyze large data sets with existing phylogeny methods.

6 Creating supertrees • Gather source trees, data: ƒ Literature searches ƒ DNA/RNA/protein sequence comparison ƒ Gene microarray experiments • Construct input trees with any of the construction methods. • Create consensus tree using a supertree algorithm. • Validate using bootstrapping, etc. • Augment with fossil record, other evidence.

Creating Supertrees

7 Example: Vascular Plants • Wojciechowski (1993) analyzed nuclear ribosomal DNA internal transcribed spacer sequences from various species of Astragalus (vascular plants). • Used a software package called PAUP* to create supertree with a maximum parsimony method. • Results stored in TreeBASE. • Trees visualized using a Java applet called ATV (A Tree Viewer).

Discussion • What problems do you foresee with creating supertrees? • How do you resolve conflicts in source trees? • How do you deal with missing data? • How do you deal with data reliability? • Does this make sense biologically?

8 Desirable Properties 1. The method should be invariant to the order of the input trees. 2. Relabelling of leaf nodes on input tree(s) results in corresponding relabelling of supertree. 3. Common structures shared by all input trees should appear in the supertree. 4. Any leaf that occurs in at least one input tree occurs in the supertree. 5. A polynomial time algorithm exists to compute the supertree.

Measuring Goodness • What constitutes a “good” tree? ƒ Minimum number of substitutions? ƒ Maximum likelihood? ƒ Answer: depends on the method! • Use bootstrapping to evaluate branching points: ƒ Sample a subset of the columns, re-perform analysis. ƒ For the same structure appearing n times out of N simulations, .9 ≤ n/N ≤ 1.0 are deemed statistically significant. • Heuristic evaluation of monophyletic groups (a group of organisms all descended from a common ancestor).

9 The problem of poor source data • What to do with conflicting input data? • What to do with under-represented data? ƒ i.e. taxa that appear in only one tree. • If a source tree is constructed from “poor data”, do we exclude it or downweight it? • No, because: ƒ We expect “good data” to dominate, i.e. the proper relationship is reported more often. ƒ Inferences from “poor data” are not necessarily incorrect. ƒ Downweighting “poor data” trees has little effect on the consensus supertree.

The problem of duplicated data • The same experimental data may be used as an input to a number of input trees. ƒ Experiments on the same gene. ƒ Input trees do not reveal sources of duplication. • Results in: ƒ An over-representation of some data. ƒ Violates the assumption of independence. • The problem is generally over-reported, effects minimized by proper weighting (often by bootstrap measures).

10 Some algorithms • Supermatrix • Strict Consensus Method (Gordon 1986) • Matrix Representation using Parsimony (Baum 1992; Ragan 1992) • Quartet methods • Bayesian MCMC

Supermatrix methods • An example of total evidence analysis –using all available data as input. • Matrix representation of data used to construct each input tree is amalgamated into a single input matrix. ƒ A subjective method of merging sequence / expression data. ƒ Often a case of sequence concatenation ƒ Assumes constant evolution rate across genes. • Normal tree construction methods such as maximum parsimony are used. ƒ Sensitive to large input sets, which may dominate results. ƒ Less useful for larger data sets due to large solution space, missing data. ƒ Explore with sampling/replicates.

11 Strict consensus method • Requires compatible trees. • Finds a consensus supertree (very fast). • If such a tree exists, it is returned, otherwise no tree is returned. • Less likely to work for input trees with many overlapping taxa. • Basic method depends on order of input trees. • Problem of finding a supertree is P if all input trees rooted, NP otherwise.

Strict consensus method

ABCDEFGACHDIGJ

ACDBJE FI G

• IdentifyConsiderConstructExamine common eachtwo backbone input branch, leaf trees tree nodes add unique taxa

12 Consensus efficiency • Given a set S of source trees, a set C of possible consensus trees, and a set T of possible binary trees for L(S), we calculate the consensus efficiency:

CE = (log |T| - log |C|)/(log |T| - log |S|)

• Maximize CE by minimizing C. • Dependant on quality of source trees S.

Matrix representation w/parsimony • MRP is the most popular method for constructing supertrees. • Generate an input matrix based on input tree topology, and analyze with maximum parsimony to determine consensus tree. • Bias towards: ƒ Input trees with large number of leaf nodes; swamping occurs to small input trees. ƒ “Popular opinion” - consistent results across input trees are favored over novel, more correct results in one or few trees.

13 Generate Input Matrix • Evaluate all edges ej for all input trees against all leaf nodes in L(S). • Make entries as follows: ƒ Mij = 0 if leaf node i does not participate in edge j. ƒ Mij = 1 if leaf node i participates in edge j. ƒ Mij = ? if unsufficient evidence.

MRP Example

14 Quartet methods • Works well on input trees with significant overlap (50% or more). • Consider the n-choose-4 possible quartets for a leaf set |L(T)| = n • A quartet (ab|cd) agrees with T if {a,b,c,d} are all leaf nodes of T and the path ab shares no vertices with the path cd. • We may reconstruct T using the set q(T) of agreeing quartets with T. • Returns a single unrooted tree, but not deterministically. ƒ Helpful to do multiple runs!

Quartet algorithm strategies • Minimize w(T) = Σq ∈ q(T) w(q) • Problem is NP-complete as an instance of the Maximum Quartet Consistency problem. • When constrained by the degree d of the supertree, can be solved in polynomial time, with O(n4k+n2dkd-1), down to O(n5) for d=3 or d=4 (Bryant and Steel 2001). • Shows promise, but slower than MRP, results are no better.

15 Quartet puzzling

• Find maximum likelihood for each of the three possible arrangements for 4 leaves. • Randomize order of the leaf nodes. • Using the first quartet as a seed: ƒ Use a scoring function to determine to which branch the next node should be added. ƒ Repeat until the full tree of n leaves is constructed. • Repeat many times to explore landscape of optimal trees. • Construct a majority rule consensus tree from the intermediate trees.

Quartet Branch Insertion

• Source quartets are AB|CD, AE|BC, AE|BD, AC|DE, and BD|CE. • Penalty neighbors are in bold.

16 Quartet and tree weights

• Consider the quartet sets Q = ∪ q(Ti) for the set of input trees S = ∪ Ti. • Let f(q) be the frequency of the quartet q ∈ Q. • Then the weight w(q) of q is:

-log(f(q)) if f(q) > 0 w(q) = { r∞ if f(q) = 0

where r ∞ > max{ -log(f(q)) : f(q) > 0 } • This is one example, many other weighting schemes exist.

Maximum parsimony • Occam’s Razor: The phylogenetic tree with the fewest changes is the most likely. • Returns only branch structure, no lengths. • Number of possible trees: (n -3) ƒ Nunrooted = (2n -5)!/(2 ) (n -3)! (n -2) ƒ Nrooted = (2n -3)!/(2 ) (n -2)!

ƒ For n=10, Nu = 3.4E8 , Nr = 2.13E15 • Decrease run time by: ƒ Only using informative characters – where ≥ 2 characters appear in ≥ 2 leaves at same position. ƒ Combining with branch-and-bound strategies.

17 Maximum parsimony • Given Hamming distance: H(x,y) = |{ i : xi ≠ yi }| • Minimize sum (parsimony length):

Σ(a,b)∈EH(a,b) • Good candidates are minimum spanning trees – shown to be within twice the length of the most parsimonious tree. ƒ Kruskal’s algorithm: • Find and mark the lowest weight edge in the graph. • Find and mark the lowest weight edge that is unmarked, does not form a cycle. • Repeat until every vertex is reached. • Marked edges form minimum spanning tree.

Minimum spanning tree • Kruskal’s algorithm applied to graph of distances between 128 US cities.

http://www.cs.sunysb.edu/~skiena/combinatorica/animations/mst.html

18 Edge contraction techniques

121 2

2 5 3 5 5 3 1 46 4 6 6 12 12 3

7 4

3 3 7 4 7 4

Sub problem Approach • Problem space grows superexponentially with number of species - the “divide and conquer” approach is more sensible (Felsenstein 1978)

19 Parallel Bayesian MC3 • Metropolis-coupled Monte Carlo Markov Chain method (Feng et al. 2003). • Related to maximum likelihood methods, better for analysis of long sequences. • For n=100, 1.7E182 possible unrooted, 3.4E184 possible rooted trees! • Even in parallel, takes days or weeks to compute a solution.

Parallel Bayesian MC3 • Uses an equal likelihood of each tree as an assumed uninformative prior. • Three main tasks: ƒ Calculate posterior probability of a specific model. ƒ Posterior probability distribution of the models. ƒ Find a maximum posterior probability tree. • Exploits structures in the simulation to save computation at the expense of memory usage. • Strategy is to split problem across grid by subsequence and chain-runs.

20 Subproblem grid assignment

Sequence 1..2000 1..500 501..1000 1001..1500 1501..2000

chain {1,2} P11 P12 P13 P14

chain {3,4} P21 P22 P23 P24

chain {5,6} P31 P32 P33 P34

chain {7,8} P41 P42 P43 P44

21