Chapter 13: Building phylogenetic Outline trees • Phylogenetic trees • Matching multiple sequence alignment • ClustalW/JalView and Neighbour Joining Nothing in biology makes sense except in the light of Tree evolution. • Phylip and Bootstrapping __Theodosius Dobzhansky (1900-1973).

Outline Background knowledge

• Phylogenetic trees • Taxonomy classifies life into groups • Matching multiple sequence alignment • Example: taxonomic classification of man

• ClustalW/JalView and Neighbour Joining Superkingdom: Eukaryota Tree Kingdom: Metazoa Phylum: Chordata • Phylip and Bootstrapping Class: Mammalia Order: Primata Family: Hominidae Genus: Homo Species: sapiens

Are there correct trees?

•Conditions –dataare clean – outgroups are correctly specified – appropriate algorithms are chosen – no assumptions are violated, etc. •Question – Can the true, correct tree be found and proven to be scientifically valid? • Unfortunately, it is impossible to ever conclusively state what is the "true" tree for a group of sequences (or a group of organisms); taxonomy is constantly under revision as new data is gathered.

1 Ale pozor

•Rozcestník: http://evolution.genetics.washington.edu /phylip/software.html

Species Fossils

• Any group of closely related organisms • Fossils are relics or impressions of that can produce fertile offspring. organisms from the past, mineralized in • Two organisms are more closely rock. "related" as they approach the level of • Fossils provide a documented history of species, that is, they have more genes how the diversity of life has changed, in common. and how organisms are related.

Fossils reliability

• Fossils are reliable only when properly dated. • Geological dating. • Radiometric dating – a method used to determine the age of rocks and fossils by studying the rate of decay of radioactive isotopes.

2 Time unit Principle

• Half-life – the number of years it takes for • While alive: assimilates both 12C and 14C. 50% of the original sample to decay (14C has • After it dies: stops assimilation and the amount of 14C 12 a half-life of 5600 years) declines relative to . • Once found: the ratio of 14C to 12C can be measured • Half-life is NOT affected by temperature, for the number of half-life reductions. pressure, or other environmental variables.

All species are related by descent

Phylogeny Phylogenetic analysis

• Phylogeny is the evolutionary history of a species or • Phylogeny applied to gene data. group of related species. • Means of classification of organisms. • Gene data are more objective than morphology • Explains the present diversity of living creatures. data. • Represented as genealogic tree – pedigree – tree of • Reconstruction of evolutionary history of a group life. Usually binary tree with contemporary organisms of organisms or sequences. at the leaves. • Goal of phylogenetic inference is to reconstruct the • Phylogenetic tree: presenting evolutionary order of splitting events (and perhaps the distances relationship. between them).

3 Phylogenetic problem Why phylogenetic analysis? • Input: – A set of contemporary species (S) whose • Constructing vaccines evolutionary relationship is to be reconstructed. – Want to assure that vaccine is constructed – A set of inheritable characteristics (C) that describe to address diverse strains of the disease each species. Characteristics can be ((bird) influenza). • quantitative – continuous (size) • qualitative – discrete (gene sequence). •Epidemiology •Output: – Reconstruct paths of infection, either for – Tree (branch lengths) which best fits the data. an individual or for population (HIV). • Assumptions: – Common ancestor, homologous characteristics.

Systematics Phylogenetic systematics

• Study of biological diversity in an • Deals with identifying and understanding evolutionary context. the relationships among the many different • Includes taxonomy - the science of kinds of life on Earth, both living (extant) naming and classifying the diversity of and dead (extinct). organisms. • Phenetics and cladistics.

Phylogenetic terms Phenetics

• Monophyletic • Numerical taxonomy, involves the use – group of DNA sequences from a single common ancestral sequence. of various measures of overall similarity •Clade for the ranking of species. – a group of all monophyletic DNA sequences and their ancestor included in the analysis. • Data converted to numerical values • Parsimony without any character "weighting" and – choosing tree by the shortest evolutionary pathway, compared to produce phenograms pathway with the smallest number of nucleotide changes to go from the ancestral sequence at the (taxonomic clusters). root of the tree to all of the present-day sequences that have been compared.

4 Cladistic groups Cladistics • Monophyletic • Members of a group share a common – all species share a common ancestor, and all species evolutionary history, emphasis on recognizing derived from that common ancestor are included. This is the only form of grouping accepted as valid by cladists. only monophyletic groups, a group plus all of its descendents, or clades. • Paraphyletic – all species share an immediate common ancestor, but NOT • Sharing the set of unique features all species derived from that common ancestor are (apomorphies) within a related group. included. • Species arise from bifurcation. •Polyphyletic – species that do NOT share an immediate common ancestor are lumped together, while excluding other members that would link them.

Monophyletic Polyphyletic Question

D E GH J K D E GH J K Paraphyletic

C F I C F I L

B B

A A

Question Homologs D E GH J K • Orthologs - produced by speciation; similar function, genes derived from a common ancestor that diverged because of divergence of the C F I organism. • Paralogs - produced by gene duplication; B different function, genes derived from a common ancestral gene that duplicated within an organism and then diverged. A • Xenologs – produced by the horizontal transfer of a gene between two organisms; function tends to be similar.

5 Orthology and paralogy Evolution

• Drivers of evolution: mutation, selection, drift (indiscriminate parents selection, founder effect, and bottleneck effect).

"Natural Selection" is the principle by which each slight variation, if useful, is preserved. __Charles Darwin

Evolution theories Phylogenetic tree

• Neodarwinism (selectionism) – survival of the fittest. • Neutralism – (Kimura), most mutations are lethal or neutral, neutral spread in population by chance.

• We use mostly neutral mutations for phylogenetic tree computations because they accumulate smoothly and follow molecular clock (ticking at a different pace for every gene).

Phylogenetic tree parts Branch

• Node - taxonomic unit. Either an existing species or • Relationship between the taxa in terms an ancestor. of descent and ancestry. • Root - the common ancestor of all taxa. •Branch • Clade - a group of two or more taxa or DNA sequences that includes both their common ancestor – scaled: length represents the number of and all of their descendents. changes (in terms of passage of time) • Topology - the branching patterns of the tree. – unscaled: branch length is NOT proportional to the number of changes that has • OTU – Operational Taxonomic Unit. occurred.

6 Tree Unrooted tree

• Rooted: with a node (root) representing a common ancestor, from which a unique path leads to any other node • Unrooted: without identifying a common ancestor, or evolutionary path. • The oldest split (the tree root) is the hardest to reconstruct. – ignore the problem, and produce unrooted trees – use an “outgroup” - more distant relation than most distant pair in the phylogeny • Outgroups are problematic – When close, too hard to be sure an outgroup really is one – When far away, too hard to identify homologous characteristics.

Rooted tree Unrooted to rooted

• Include outgroup to make a correct root.

Newick tree format ...Newick tree format

(B,(A,C,E),D); • ((Human,Gorilla),(Mouse,Rat)); • ((Human:0.1,Gorilla:0.1):0.4,(Mouse:0.2,Rat: 0.2):0.3);

7 ...Newick tree format Gene tree vs. species tree

• NOT identical because internal nodes are NOT always equivalent. • Gene tree internal node - the divergence of an ancestral gene into two genes with different DNA sequences, usually resulting from a mutation. • Species tree internal node - speciation event, whereby the population of the ancestral species splits into two groups that are no longer able to interbreed. •GeneTree http://taxonomy.zoology.gla.ac.uk/rod/genetree/genetr ee.html

Biological questions with Outline phylogenetic trees • I sequenced rRNA of an unknown • Phylogenetic trees bacterium. What is the closest relative • Matching multiple sequence alignment of my organism? • I found a gene. Is it orthologous to • ClustalW/JalView and Neighbour Joining another well-characterized gene? Tree • My gene seems strange within my • Phylip and Bootstrapping organism. Does it descend from a horizontal transfer?

Multiple sequence alignment (MSA) Preparation of data

• A multiple sequence alignment is good for • GIGO: garbage in – garbage out specifying a phylogeny. • Each position (column in an MSA) is a • Highly accurate multiple sequence “character”. alignment with properly chosen • Can calculate “distances” or “set of changes” sequences. required, based on differences in aligned • Assumption of divergent evolution is columns and indels. right – divergent evolution on the • These distances or change sets can be used to score alternative trees. molecular level is exception.

8 …preparation of data …preparation of data

• Use DNA multiple sequence alignment only if >70% • If you select paralogs for your multiple identical. • If your DNAs <70% identical then translate them to sequence alignment then your gene proteins. tree concerns gene family only. • If your DNAs <70% identical but very similar on protein level then • If you select orthologs then your gene – thread your DNA back to your protein alignment using tree resembles species tree. Transalign http://life.anu.edu.au/molecular/software/transalign/ • If you know the species tree then you – after threading, if most synonymous mutation sites are different - several neutral mutations on one position (sites can ask if your gene is orthologous. saturated) then use protein level, – if synonymous mutation sites are NOT saturated then use DNA level.

Distinguishing ortho- and Question paralogs • Can two homologous genes from two • Reciprocal BLAST best match: different species be paralogous? – choose sequence A from genome A, – run BLAST against complete genome B, – sequence B is your top BLAST hit, – run BLAST of sequence B against complete genome A, – if sequence A is the top hit then A and B are orthologous.

Collections of orthologous genes …preparation of data

•COG- Cluster of Orthologous Groups by • NO fragments – they produce bad multiple alignment Eugene Koonin • NO xenologs (unless you study them) • NO recombinant sequences – protein from proteins, have • Hovergen - Homologous Vertebrate two ancestors, common in viruses Genes Database • NO large complex families with many domains and repeats • Hobacgen - Homologous Bacterial Genes (i.e. ABC transporter) Database •NO large datasets • Add an alignable outgroup for rooting the tree. •Systers- large-scale protein clustering • All sequences you use are of the same age – the same based on sequence similarity number of neutral mutations (unless you use fossil DNA).

9 Editing multiple sequence Multiple sequence alignment alignment • Distance matrix – table with the distances separating each pair of sequences in your dataset, – made by running multiple sequence alignment and extracting info for every pair of sequences separately. • ClustalW http://www.ebi.ac.uk/clustalw/, • Dialign http://bibiserv.techfak.uni- bielefeld.de/dialign/, • Tcoffee http://igs-server.cnrs- mrs.fr/Tcoffee/tcoffee_cgi/index.cgi.

Editing multiple sequence Outline alignment •Remove gaps • Phylogenetic trees –use complete-deletion parameter in ClustalW – remove the whole gap-rich region (loop) • Matching multiple sequence alignment • Remove extremities: N- and C- termini are • ClustalW/JalView and Neighbour Joining poorly conserved Tree • Keep informative blocks (25 aa with a few conserved positions) • Phylip and Bootstrapping • Use JalView (or Alt-mouse in MS Word to remove columns).

Molecular clock Molecular clock ticking

• If mutation rates are constant then sequence distance = evolutionary time distance. • In reality, it fits only over relatively short intervals and in certain regions. • Why not generally? – Selective pressure (mutations that disrupt function in coding regions will occur less often than neutral mutations) – Punctuated equilibrium (causative factors in mutation rates like UV light exposure or earthquakes vary over evolutionary time) If the same phylogeny is obtained with the molecular clock assumption – Random changes in mutation rate. as without we can indirectly assume that assumption has been satisfied.

10 Tree methods Sequence distances

• Parsimony (MP) a = a1 a2 a3 ...a100 b = b b b ...b • Distance (UPGMA, NJ, …) 1 2 3 100 • Likelihood (ML) • Hamming distance - no. of different positions • Artificial intelligence 100 ⎡ 2 ⎤ • Eucleidian distance ⎢∑()ai − bi ⎥ ⎣ i=1 ⎦

100 a − b • City block distance ∑ i i i=1

Sequence distances Decision a= 101010101 b= 001110100

• Hamming distance = 3

• Eucleidian distance = 3

• City block distance = 1+0+0+1+0+0+0+0+1=3

Maximum parsimony MSA column

• Maximum parsimony minimizes the sum of the Hamming distances among connected nodes, the fewest mutations (and shallowest tree). • Evaluates all possible trees based on the number of evolutionary changes that are needed to explain the observed data. • Parsimony is the most popular method for reconstructing ancestral relationships because it allows the use of all known evolutionary information in building a tree (in contrast, distance methods compress all of the differences between pairs of sequences into a single number). • Assumption of molecular clock, unrooted tree •Practical,if – less than 20 species (exponential growth of number of trees examined) – well conserved family.

11 Parsimony example Compare two trees • Tree 2 first divides ATCC on its own branch, then • Consider four sequences: ATCG, TTCG, ATCC, and splits off ATCG, and finally divides TTCG from TCCG. TCCG • Trees 1 and 2 both have three nodes, but when all of • Imagine a tree that branches at the first position, the distances back to the root (no. of nodes crossed) dividing ATCx from TyCG are summed, the total is equal to 8 for Tree 1 and 9 • Then each branch splits, grouping ATCG and ATCC on for Tree 2. one branch, TTCG and TCCG on the other branch. • Total of 3 nodes on the tree (Tree 1). Tree 1 Tree 2

Decision Distance

• Basic idea: –Given an n x n distance matrix D between n contemporary organisms, find a tree with positive branch weights such that the sum of weights along the path between two organisms is equal to the distance between them. • Some distance matrices (D) have such a tree, some don't. yes – Ones that have a tree are called „additive“.

Four parameters for distance methods • Multiple alignment – choose method and sequences. • Distance measure – choose substitution matrix. • Tree reconstruction algorithm – UPGMA (average), NJ (minimum branch length), Fitch (minimum square), Kitsch (Fitch+ ). • Type of tree – cladogram without distances or phenogram with distances, (un)rooted.

12 Distance matrix UPGMA

• Unweighted pair-group method using arithmetic averages clustering • Assumption of evolutionary clock. •Method – Let initial organisms be leaves of the tree. – Iteratively add a parent to the closest pair of leaves – Let the distance from an internal node to initial leaves is the mean distance to its leaves. • Produces rooted tree.

UPGMA clustering of entry data First cycle • Table cells correspond to number of different • At each cycle, the smallest entry (corresponding to the closest relatives, aminoacids in molecule of cytochrome c: shown in bold) is located, and the entries intersecting at that cell are – 1 difference between man and monkey joined (we make a branch from two leaves). – 19 differences between man and turtle. • Cells of one colour on the left table are combined (averaged) to form the new entries on the right table. • Matrix will be reduced step by step. • The lenght of the branch to this junction is 1/2 the value of the original entry (the smallest entry).

• Example: In this table, smallest entry at the beginning is 1 (between B = man and F = monkey); B and F are joined with branch lengths of ½*1=0.5. The new table value for intersection of C and joined BF is average of value for BC and FC ((31+32)/2=31.5).

Next 2 cycles Final cycles

13 Neighbour Joining (NJ) Fitch

• Agglomerative clustering method. • Fitch-Margoliash minimizes the sum of • Neighbour Joining minimizes the sum of branch lengths for the resulting tree. squares between distances on the tree and • Method does NOT assume a molecular clock – it is in the matrix (the mean least square good when some sequences evolve faster than others. method). • Provides unrooted tree (can be rooted later). • The method is the most popular way • Unrooted. to build trees from distance measurements. • Neighbor Joining methods generally produce just one tree, which can help to validate a tree built with the parsimony or maximum likelihood method.

Kitsch

• Fitch with a molecular clock • Rooted

yes

From prior belief to posterior Maximum likelihood belief P()H * P()data H P()H data = • Consider Bayes theorem: P()data

• P(data) is constant for any model. If priors over model are uniform, the model that gives the highest likelihood for the data is the best. • We have a parameterized statistical model of evolution and a set of observations. • We could find most likely set of parameters. • Even with simple models of evolutionary change, the computational task is enormous, making this the slowest of all phylogenetic methods. • MrBayes http://morphbank.ebc.uu.se/mrbayes/, • TreePuzzle http://www.tree-puzzle.de/

14 Priors Prior probability • Branch length parameters • Relative substitution rate parameters Likelihood – transition/transversion ratio • Rate heterogeneity parameters – proportion of invariable sites • Nucleotide frequency parameters Posterior probability • Tree topologies

Artificial intelligence Decision simplified

• SOTA Self Organizing Tree Algorithm • Divergent sequences: NJ and ML • ftp://ftp.cnb.uam.es/cnb/sota/ • Technically feasible: NJ and MP

Famous trees iTOL

• 16S ribosomal trees • http://itol.embl.de/index.shtml – Generated Archaeal kingdom hypothesis – Exist in all organisms, highly conserved – Suitable for very broad phylogenies – Sequenced in tens of thousands of organisms – Not appropriate for close relatives. •Whole genome tree – Characteristics over entire genomes (presence/absence of homologous genes, chromosomal rearrangements) – Good for distant taxa – Mostly practical for microbes, since relatively few completely sequenced genomes of metazoa yet.

15 ClustalW ClustalW output

• Any tree with „.dnd“ extension is NOT a • Program log phylogenetic tree (it is a guide tree for multiple sequence alignment). •Steps – retrieve a list of entries with your accession numbers – ClustalW http://www.ebi.ac.uk/clustalw/index.html

…ClustalW output …ClustalW output

•Scores table • Multiple alignment

…ClustalW output Computing phylogenetic tree

•Guide tree

16 …computing phylogenetic tree Your phylogenetic tree

Outline Phylip

• Phylogenetic trees • Phylip: PHYLogeny Inference Package • Matching multiple sequence alignment • http://evolution.genetics.washington.edu/p • ClustalW/JalView and Neighbour Joining hylip/phylip.html Tree • Allows bootstrapping • Phylip and Bootstrapping

Bootstrapping Initial alignment

• Assessing the quality of your tree Column 1 2 3 4 5 6 7 8 9 • Checking of robustness - whether every branch of the alignment equally supports your tree Seq_1 A B C D E F G H I • Steps of bootstrapping: Seq_2 A A B B C B A C A – randomly generate bootstrap alignments with missing and duplicated columns (100 times) Seq_3 C C A C B A C A B – calculate distance matrix for each – build a tree for each – compare trees, make a consensus tree with robust branches.

17 Bootstrap alignment 1 Bootstrap alignment 2

Column 1 1 8 1 2 5 1 8 2 Column 1 4 5 6 6 3 4 1 7

Seq_1 A A H A B E A H B Seq_1 A D E F F C D A G

Seq_2 A A C A A C A C A Seq_2 A B C B B B B A A Seq_3 C C A C C B C A C Seq_3 C C B A A A C C C

Displaying phylogenetic tree • Phylodendron Making phylogenetic tree online http://www.es.embnet.org/Doc/phylodendron/ treeprint-form.html • ClustalW http://www.ebi.ac.uk/clustalw/ • Newick format to graphical output • GeneBee http://www.genebee.msu.ru/clustal/basic.html • JalView http://www.es.embnet.org/cgi- bin/clustalw.cgi • Phylip http://www.sacs.ucsf.edu/Resources/phylip/ • iTOL http://itol.embl.de/index.shtml • PAUP http://paup.csit.fsu.edu/about.html - Phylogenetic Analysis Using Parsimony (100$)

When a tree is not the best Alternatives of phylogenetic trees model?

• For individuals within a species – The genetic material of an individual doesn't derive from a single earlier existing •Networks individual. – Animals and plants that multiply by sexual reproduction receive half their genetic material from each of two parents, so a tree like this is inappropriate. • For closely related species – Individuals do occasionaly mate between closely related species, and their progeny survive to contribute to the gene pool of one or both of the parent species. – Solution: either treat closely related species as one larger variable species or do not consider closely related species. • Hybrid species – new tetraploid plant species can arise from two related diploid species. • Distant interaction (horizontal transfer) – Bacterium can injest the genetic material of a bacterium of another species and incorporate part of it into its own genetic material. – Viruses can inadvertantly transport genetic material from one species to another.

18 mtDNA network Thank you for your attention!

http://www.fluxus-technology.com/sharenet.htm

19