Phylogenetic Inference

Christian M. Zmasek, PhD [email protected] https://sites.google.com/site/cmzmasek/home

GABRIEL Network/J. Craig Venter Institute APPLICATIONS OF GENOMICS & TO INFECTIOUS DISEASES

2017-12-07 Overview

• General concepts, common misconceptions • Tree of Life (Eukaryotic, Bacterial, Viral) • Homologs, gene duplications, orthologs, … • Methods • Unix command line refresher • Multiple Sequence Alignment (MAFFT) • Pairwise distance calculation (Phylip) • Distance based methods (, FastME) • Maximum Likelihood methods (RAxML, PhyML) • Bayesian methods (MrBayes, BEAST) • Visualization: Archaeopteryx • Selection Analysis ("dN/dS", Datamonkey ~ HyPhy) • Gene Duplication Inference (GSDI algorithm)

• Select papers are available here for download (in zipped archive): https://goo.gl/o2NPDj Why perform phylogenetic inference? • To infer the evolutionary relationships amongst different species/classes/sub-classes/strains/… of organisms • To infer the evolutionary relationships amongst molecular sequences (genes, proteins) • To infer the functions of genes/proteins • Paper: "Eisen_1998_Phylogenomics" • To use resulting tree as basis for additional analyses Theoretical Background

• A phylogeny the evolutionary history of a species or a group of species • "Lately", the term is also being applied to the evolutionary history of individual DNA or protein sequences • The evolutionary history of organisms or sequences can be illustrated using a tree-like diagram – a phylogenetic tree A phylogenetic tree proposed in 1866 by Häckel Many Misconceptions about Phylogenetic Trees! • Example of misconception: "order of the external nodes provides information about their relatedness" • The order of external nodes is meaningless! • Paper: "Ryan_2008_Understanding_ evolutionary_trees" Types of Trees (Displays)

• Rooted vs. unrooted • Cladogram vs. Phylogram Eukaryotic Tree of Life

• Still not a resolved • Two major groups (probably): • Unikonta (single, or no, flagellum) • Bikonta (two flagella) • No monophyletic group of "protists" • Papers: • "Cavalier-Smith_2015_Multiple-origins" • "Zmasek_20111_Strong_functional_patterns" • "Roger_2009_Revisiting_the_root_of_the_eukaryote_tree" • "Baldauf_2003_The_Deep_Roots_of_Eukaryotes" Bacterial Tree of Life

Based on concatenated a set of 16 ribosomal protein sequences

Paper: "Hug_2016_A_new_view_of_the_tree_of_life" Viruses

• No universal "tree of life" for viruses • Instead "superfamilies" of (probably unrelated viruses): • Double-stranded RNA Viruses (monophyly uncertain) • Single-stranded Negative Sense RNA Viruses (monophyly uncertain) • Single-stranded Positive Sense RNA Viruses (monophyly uncertain) • Single-stranded DNA Viruses (non-monophyletic) • Double-stranded DNA Viruses (non-monophyletic) • DNA-RNA Reverse Transcribing Viruses (monophyly uncertain) • Papers: • "Castro-Nallar_2012_The_evolution_of_HIV" • "Forterre_2013_The_major_role_of_viruses_in_cellular_evolution" • "Koonin_2013_A_virocentric_perspective_on_the_evolution_of_life" • "Krupovic_2013_Networks_of_evolutionary_interactions" A special case? Nucleo cytoplasmic large DNA viruses • Nucleo cytoplasmic large DNA virus (NCLDV) superfamily • Diverse group of viruses that infects a wide range of eukaryotic hosts (e.g. vertebrates, insects, single celled organisms) • Huge range in genome size (between 100 kb and 1.2 Mb) • Examples: • Mimiviridae • Marseilleviridae • Phycodnaviridae • Poxviridae • Papers: • "Krupovic_2013_Networks_of_evolutionary_interactions" • "Nasir_2012_Giant_viruses_coexisted" Nucleo cytoplasmic large DNA viruses (NCLDV)

Bayesian Inference (BI) tree based on conserved regions of DNA polymerase B

Paper: "Fischer_2010_Giant_virus_with_a_remarkable_complement" Gene Trees/Species Trees

• Initially, phylogenetic trees were built based on the morphology of organisms. • Around 1960 molecular sequences were recognized as containing phylogenetic information and hence as valuable for tree building • A tree built based on sequence data is called a gene tree since it is a representation of the evolutionary history of genes • A tree illustrating the evolutionary history of organisms is called a species tree A gene tree which is also a species tree A gene tree of orthologs and paralogs based on Bcl-2 family protein sequences The Number of all possible trees topologies… … gets quickly larger than the number of all H-Atoms in the Universe • The number of different tree topologies increases rapidly with an increase in number of external nodes. The number of topologies for unrooted completely binary trees (T) with N external nodes is:

2N 5! Tp  Tp(N=5)=15 2N 3 N 3! Tp(N=10)=2x106

Tp(N=20)=2x1020

Tp(N=100)=1x10182 Homologs

• Homologs are defined as sequences which share a common ancestor (Fitch, 1966)

• This definition becomes unclear if mosaic proteins, which are composed of structural units originating from different genes are considered

• Phylogenetic trees make sense only if constructed based on homologous sequences (whole genes/proteins, or domains) Globin Family: An example of a homologous proteins Orthologs, Paralogs, Xenologs

• Homologous sequences can be divided into orthologs, paralogs and xenologs:

• Orthologs: diverged by a speciation event (their last common ancestor on a phylogenetic tree corresponds to a speciation event)

• Paralogs: diverged by a duplication event (their last common ancestor corresponds to a duplication)

• Xenologs: are related to each other by horizontal gene transfer (via retroviruses, for example) Orthologs, Paralogs example Caveat emptor: Orthology vs. Function • Orthologous sequences tend to have more similar “functions” than paralogs

• Yet: Orthologs are mathematically defined, whereas there is no definition of sequence “function” (i.e. it is a subjective term) Gene Duplication – Significance

• New genes evolve if mutations accumulate while selective constraints are relaxed by gene duplication

• First recognized by Haldane (“… it [mutation pressure] will favour polyploids, and particularly allopolyploids, which possess several pairs of sets of genes, so that one gene may be altered without disadvantage…” Gene Trees Vs. Species Trees – How Gene Duplications Can Be Detected

G1 G2 S

Human

Rat

Wheat

Human Human Rat Rat

Rat

Human

Wheat

Wheat Wheat Rooting

• Almost all methods and algorithms produce unrooted or randomly rooted trees!! • Rooting by: • Midpoint-rooting (minimizing overall tree height) • Known "outgroup" • Minimizing gene duplications • … Methods

Multiple sequence alignment of homologous sequences

Pairwise distance calculation Optimality Criteria Based on Character Data: •Maximum Parsimony •Maximum Likelihood Algorithmic Methods Optimality Criteria Based Based on Pairwise on Pairwise Distances: Bayesian Methods Distances: •Fitch-Margoliash (MCMC) •Neighbor Joining •Minimal Evolution

“More accurate” Fast (in general) Pairwise Distance Calculation

The simplest method to measure the distance between two

amino acid sequences is by their fractional dissimilarity p (nd is the number of aligned sequence positions containing non-

identical amino acids and ns is the number of aligned sequence positions containing identical amino acids): n p  d nd  ns

 Pairwise Distance Calculation

• Unfortunately, this is unrealistic -- does not take into account: • superimposed changes: multiple mutations at the same sequence location • different chemical properties of amino acids: for example, changing leucine into isoleucine is more likely and should be weighted less than changing leucine into proline Pairwise Distance Calculation

• A more realistic approach for estimating evolutionary distances is to apply maximum likelihood to empirical amino acid replacement models, such as PAM transition probability matrices.

• The likelihood LH of a hypothesis H (an evolutionary distance, for example) given some data D (an alignment, for example) is

the probability of D given H: LH=P(D|H) Algorithmic Methods Based on Pairwise Distances

• UPGMA • Neighbor Joining UPGMA vs …

• UPGMA stands for unweighted pair group method using arithmetic averages

• This is clustering

• This algorithm produces rooted trees based under the assumption of a molecular clock.

• Do not use!! … Neighbor Joining

• As opposed to UPGMA, neighbor joining (NJ) is not misled by the absence of a molecular clock

• NJ produces phylogenetic trees (not cluster diagrams) Optimality Criteria Based on Pairwise Distances • Fitch-Margoliash • Minimal evolution (ME) Fitch-Margoliash

An optimal tree is selected by minimizing the disagreement E between the tree and the estimated pairwise distances (estimated from a multiple alignment): Minimal Evolution

Branch lengths are fitted to a tree according to a unweighted least squares criterion, but the optimality criterion to evaluate and compare trees is to minimize the sum of all branch lengths. Optimality Criteria Based on Character Data • Maximum Parsimony (MP) • Maximum Likelihood (ML) Maximum Parsimony

• Evaluate a given topology

• Example: • Sequence1: TGC • Sequence2: TAC • Sequence3: AGG • Sequence4: AAG Maximum Likelihood

• Probabilistic methods can be used to assign a likelihood to a given tree and therefore allow the selection of the tree which is most likely given the observed sequences. • Probability for one residue a to change to b in time t along a branch of a tree: P(b|a,t) • Its actual calculation is dependent on what model for sequence evolution is used. • Poisson process: • P(b|a,t)=1/20 + 19/20e-ut for a=b • P(b|a,t)=1/20 + 1/20e-ut for a≠b Bayesian Methods

• Example: MrBayes • Use Markov Chain Monte Carlo (MCMC) approach to sample over tree space Bootstrap Resampling

• To asses the reliability of trees

• Resampling with replacement (see example on next slide)

• What is “good enough”? >60%?, >90%?

• A tree without support values is of little value -> always must calculate support values!! Bootstrap resampling: example

Original sequence alignment: Sequence 1: ARNDCQ Sequence 2: VRNDCQ 123456

Bootstrap resample 1: Sequence 1: RRQCCA Sequence 2: RRQCCV 226551

Bootstrap resample 2: Sequence 1: AQCDCQ Sequence 2: VQCDCQ 165456 Methods

Multiple sequence alignment of homologous sequences

Pairwise distance calculation Optimality Criteria Based on Character Data: •Maximum Parsimony •Maximum Likelihood Algorithmic Methods Optimality Criteria Based Based on Pairwise on Pairwise Distances: Bayesian Methods Distances: •Fitch-Margoliash (MCMC) •Neighbor Joining •Minimal Evolution

“More accurate” Fast (in general) Obtain Example Sequences

• DNA Polymerases (B, delta) from: • Nucleo Cytoplasmic Large DNA Virus (NCLDV) superfamily • Ascoviridae (3) ("AV_") • Iridoviridae (6) ("IV_") • Marseilleviridae (3) ("MaV_") • Mimiviridae (3) ("MiV_") • Phycodnaviridae (4) ("PhV_") • Poxviridae (4) ("PoV_") • Archaea (2) ("A_") • Eukaryotes (5) ( "E_") • Unikonta (3) • Bikonta (2)

• Download file "dna_pol.fasta" from https://goo.gl/NhRzHG • Download file "map.txt" from https://goo.gl/1Q7QS8 • Open Terminal (console) • In console, type grep – '>' to count sequences in "dna_pol.fasta" Unix/ Refresher

• To show files in current directory: ls • To show files in current dir with more information: ls –la • To show current path: pwd • To return to home directory: cd • To change current directory to "x": cd x • To change current directory to one level up: cd .. • To rename/move a file (e.g. "x" to "y"): mv x y • To copy a file (e.g. "x" to "y"): cp x y • To delete a file (e.g. "x"): rm x • To create a directory (e.g. "q"): mkdir q • To delete a empty directory (e.g. "q"): rmdir q • To monitor running programs: top • To end a running program: ctrl+c • To look at a file (eg. "a.fasta"): more a.fasta • To edit a file (eg. "a.fasta"): gedit a.fasta Multiple Sequence Alignment: MAFFT • MAFFT website: https://mafft.cbrc.jp/alignment/software/ • Type mafft –h to get help • Infer alignment with mafft --maxiterate 1000 -- localpair dna_pol.fasta > dna_pol_mafft.fasta Inspect MSA with Jalview

• Version on Biolinux is old, and seems to have some issues, thus… • Download "Linux" version (without from Java VM) from http://www.jalview.org/Web_Installers/install.htm • After downloading open console and cd to the directory where you downloaded the installer • Type: sh ./install-jalview.bin • Launch Jalview by typing ./runJalview in your home directory • Open file with File|Input Alignment | From File Multiple Sequence Alignment: ClustalOmega • ClustalOmega website: http://www.clustal.org/omega/ • Type clustalo --help to get help • Infer MSA with clustalo -i dna_pol.fasta > dna_pol_clustalo.fasta • Inspect MSA with Jalview • Compare with MAFFT MSA MSA Format Conversion

• Use Jalview (newest version) to convert MSA format from Fasta to PHYLIP ("save as" in "dna_pol_mafft" window) • "dna_pol_mafft.fasta" - > "dna_pol_mafft.phy" • WARNING: If labels are 10 characters in length, Jalview 2.10.3 produces PHYLIP formatted files which no space between label and sequence, these files cannot be read by FastME, PhyML and RAxML • Online Format Conversion: • http://sing.ei.uvigo.es/ALTER/ • http://phylogeny.lirmm.fr/phylo_cgi/data_converter.cgi Multiple Sequence Alignment: Challenge 1: More Programs • Learn about, and play around, with other MSA inference programs: • clustalw (the "grandfather") • clustalx • probcons (generally accurate, but slow) • muscle (fast) • kalign • hmmalign from HMMER package (http://hmmer.org/) • t-coffee (t_coffee) Multiple Sequence Alignment: Challenge 2: MSA Formats • Learn about different formats for MSAs • Fasta • Nexus • see: http://wiki.christophchamp.com/index.php?title=NEXUS_file_format • http://mrbayes.sourceforge.net/Help/format.html • Phylip • interleaved vs. sequential • see: http://evolution.genetics.washington.edu/phylip/doc/sequence.html • Others? Tree Inference: Phylip

• Website: http://evolution.genetics.washington.edu/phylip.html

• Download: 1. Go to "Get me PHYLIP" 2. Download file "gzip'ed tar archive of C sources and documentation" • Extract files: 1. Open console 2. Go ("cd") to directory where "phylip-3.696.tar.gz" is 3. Type tar -zxvf phylip-3.696.tar.gz • Compilation: 1. Go to "src" directory ("cd src") 2. Type make -f Makefile.unx install • For convenience: 1. In file .zshrc (in your home dir) add (by way of example) using gedit: • "alias seqboot='/home/path/to/phylip/exe/seqboot'" • "alias neighbor='/home/path/to/phylip/exe/neighbor'" • Do this for seqboot, protdist, neighbor, fitch, consense 2. Type source .zshrc Goal: Calculate a Neighbor Joining (NJ) tree using Phylip • Input: Homologous sequences • Output: Tree topology with bootstrap support values • Steps: 1. MSA calculation (done, with MAFFT) 2. Bootstrap resampling of the MSA: seqboot 1 MSA -> 100 MSAs 3. Pairwise distance calculation for each MSA: protdist 100 MSAs -> 100 pairwise distance matrices 4. Tree calculation: neighbor 100 pairwise distance matrices -> 100 phylogenetic trees with branch lengths 5. Consensus tree calculation: consense 100 phylogenetic trees with branch lengths - > 1 tree with support values Phylip: Bootstrap Resampling: seqboot 1. Type seqboot 2. Give the input MSA name (in Phylip format), e.g. "dna_pol_mafft.phy" 3. Nothing needs to be changed, so type "Y" 4. Give random number seed, e.g. "21" • Output will be written to "outfile" • Rename (move) "outfile" to "msa_bootstrapped.phy": mv outfile msa_bootstrapped.phy Phylip: Pairwise Distance Calculation: protdist 1. Type protdist 2. Give the input MSA name (in Phylip format), e.g. "msa_bootstrapped.phy" 3. Type "M" to change to 100 datasets, followed by "D", followed by "100" 4. Type "Y" to start pairwise distance calculation (should take about 10min ~ could do "Multiple Sequence Alignment: Challenges" 1 & 2 now ~ ) • Output will be written to "outfile" • Re-name (move) "outfile" to "dna_pol.pwd": mv outfile dna_pol.pwd Phylip: Neighbor Joining Tree Calculation: neighbor • Here, we use a algorithmic method based on pairwise distances: the neighbor joining algorithm (NJ) 1. Type neighbor 2. Give the input pairwise distances file name, e.g. "dna_pol.pwd" 3. Type "M" to change to 100 datasets, followed by "100", followed by "21" 4. Type "Y" to start neighbor joining algorithm • Trees will be written to "outtree" • Change "outtree" to "dna_pol_NJ_trees.nh": mv outtree dna_pol_NJ_trees.nh • Delete outfile: rm outfile Phylip: Consensus Tree Calculation: consense 1. Type consense 2. Give input trees file name, e.g. "dna_pol_NJ_trees.nh " 3. Type "Y" to start • Tree will be written to "outtree", rename this into "dna_pol_NH_tree.nh" • Delete outfile Phylogenetic Tree Visualization and Analysis: Archaeopteryx • Website: https://sites.google.com/site/cmzmasek/home/software/arch aeopteryx • Download "forester.jar" (under "Download" –> "Most current version" –> "Download") • In file .zshrc (in your home dir) add (by way of example) using gedit: alias aptx='java -cp /path/to/forester.jar org.forester.archaeopteryx.Archaeopteryx' • Type source .zshrc • Now you can launch Archaeopteryx by typing "aptx" in the console • Type aptx dna_pol_NH_tree.nh • What are the drawbacks of this tree? Phylip: Challenge: Fitch-Margoliash

• Instead of using an algorithmic method based on pairwise distances, use a optimality criteria method based on pairwise distances: the Fitch-Margoliash method: • An optimal tree is selected by minimizing the disagreement E between the tree and the estimated pairwise distances (estimated from a multiple alignment): Phylip: Challenge: Fitch-Margoliash

1. Type fitch 2. Give input pairwise distances file name, e.g. "dna_pol.pwd" 3. Type "G" to turn on Global rearrangements 4. Type "J" to turn on Randomize input order, followed by "21", followed by 2 (for real analysis, use 10 or more…) 5. Type "M" to change to 100 datasets, followed by "100", followed by "21" 6. Type "Y" to start (this should take about 15min) • Trees will be written to "outtree" • Change "outtree" to "dna_pol_FM_trees.nh • Use consense to calculate a consensus tree for "dna_pol_FM_trees.nh", rename the result into "dna_pol_FM_tree.nh" • Visualize the result using Archaeopteryx Phylip: 2nd Challenge: Minimal Evolution

• Learn about the difference between the Fitch-Margoliash method and the Minimal Evolution (ME) method • Use Phylip to analyze "dna_pol.pwd" with the ME method, name the consensus tree "dna_pol_ME_tree.nh" • Visualize the result using Archaeopteryx Minimal Evolution: FastME

• Website: http://www.atgc-montpellier.fr/fastme/ • Command line: fastme –i dna_pol_mafft.phy –m I –p W –n B –b 100 –z 9 –o dna_pol_mafft_FastME.nh • "dna_pol_mafft.phy" is the input MSA in Phylip format, download here: https://goo.gl/Nu2oM5

• Options: • -i input data file: MSA or a distance matrix(ces) • -m method: Algorithm used to compute a tree from a distance matrix. It can be iterative taxon addition (optimizing the BalME criterion for 'TaxAdd_BalME' or the OLSME criterion for 'TaxAdd_OLSME'), Neighbor Joining (NJ), the unweighted version of NJ (UNJ) or an improved version of NJ based on a simple model of sequence data (BioNJ): I is for BioNJ • -p model: p-distance, F81-like, LG, WAG, JTT, Dayhoff, DCMut, CpREV, MtREV, RtREV, HIVb, HIVw and FLU: W is for WAG • -n use nearest-neighbor interchange (NNI) to explore the topologies space: B is for NNI_BalME • -b: bootstrap replicates • -z: random seed • -o: output name • Visualize the result (e.g. "dna_pol_mafft_FastME.nh") with Archaeopteryx – are there any problems?? If so, how can you fix it? Phylip: 3rd "Challenge"

• Redo all the Phylip analyses done so far, but do not bootstrap re-sample: • Instead of "seqboot-protdist-(neighbor|FM|ME)-consense", • do "protdist-(neighbor|FM|ME)" Maximum Likelihood: RAxML

Example of command line for doing rapid bootstrapping with RAxML (version 8.2.9):

raxml -f a -m PROTGAMMAWAGX -N 100 -p 27 -x 26 -s dna_pol_mafft.phy -n dna_pol_mafft_WAG

"dna_pol_mafft.phy" is the input MSA in Phylip format, download here: https://goo.gl/Nu2oM5

Options used: • -f a: -f option selects the algorithm, here 'a' is used which is for "rapid bootstrap analysis and search for best-scoring ML tree in one program run" • -m PROTGAMMAWAGX: -m option selects the substitution model: • PROT: amino acids • GAMMA: gamma model of rate heterogeneity (alpha parameter will be estimated) • WAG: WAG model (with optimization of substitution rate) • X: ML estimate of base frequencies • -N 100: number of bootstraps • -p 27: random number seed for the parsimony inferences • -x 26: random number seed for rapid bootstrapping • -s: input alignment in phylip format • -n : base name for output files Maximum Likelihood: RAxML: Automatic model selection Automatic protein model selection by ML (version 8.2.9):

raxml -f a --auto-prot=ml -m PROTGAMMAAUTO -N 100 -p 27 -x 26 -s dna_pol_mafft.phy -n dna_pol_mafft_AUTO

Options used: -f a: -f option selects the algorithm, here 'a' is used which is for "rapid bootstrap analysis and search for best-scoring ML tree in one program run" --auto-prot=ml: automatic protein model selection by ML -m PROTGAMMAAUTO: -m option selects the substitution model: PROT: amino acids GAMMA: gamma model of rate heterogeneity (alpha parameter will be estimated) AUTO: automatic protein model selection -N 100: number of bootstraps -p 27: random number seed for the parsimony inferences -x 26: random number seed for rapid bootstrapping -s: input alignment in phylip format -n : base name for output files Maximum Likelihood: RAxML: Nucleotides For nucleotide sequences (version 8.2.9):

raxml -f a -m GTRGAMMAIX -N 100 -p 27 -x 26 -s input_aln.phy -n output_GTRGAMMAIX

"input_aln" is the input MSA in Phylip format

Options used: • -f a: -f option selects the algorithm, here 'a' is used which is for "rapid Bootstrap analysis and search for best-scoring ML tree in one program run" • -m GTRGAMMAIX: -m option selects the substitution model: • GTRGAMMA: GTR + Optimization of substitution rates + GAMMA model of rate heterogeneity (alpha will be estimated) • I: estimate of proportion of invariable sites • X: ML estimate of base frequencies • -N 100: number of bootstraps • -p 27: random number seed for the parsimony inferences • -x 26: random number seed for rapid bootstrapping • -s: input alignment in phylip format • -n : base name for output files Maximum Likelihood: PhyML

• Website: http://www.atgc-montpellier.fr/phyml/ • Play around with interactive menu (similar to Phylip) by launching the PhyML program with phyml • Download input file here: https://goo.gl/Nu2oM5 • Challenge: Study and use the command line options for PhyML • phyml -i dna_pol_mafft.phy -d aa -b 100 -m WAG -f e -s BEST • This will take a long time to finish… • Without bootstrap resampling -> much faster, but … : phyml -i dna_pol_mafft.phy -d aa -m WAG -f e -s BEST • Visualize the resulting "_phyml_tree.txt" file using Archaeopteryx. What is the problem? How can you fix it? Bayesian Methods: MrBayes

• Website: http://mrbayes.sourceforge.net/ • MrBayes requires input alignment in Nexus format: • convert "dna_pol_mafft.fasta" to Nexus format using online tools, OR download it from here: https://goo.gl/wkyhov

1. Launch MrBayes by typing mrbayes 2. Load "dna_pol_mafft.nex" into MrBayes: > execute dna_pol_mafft.nex 3. Use mixed amino acid model: > prset aamodelpr=mixed (example for nucleotides: lset nst=6 rates=invgamma) 4. Start Markov Chain Monte Carlo run with 10,000 steps: > mcmc ngen=10000 samplefreq=100 printfreq=100 diagnfreq=1000 For real studies, should use more steps Should get "Average standard deviation of split frequencies" to 0.01 or less. End run by typing "no" when promted 5. Summarize results and calculate consensus tree: > sump > sumt > quit

• Use Archaeopteryx to visualize the tree in the ".con.tre" file. Bayesian Methods: BEAST

• Bayesian Evolutionary Analysis Sampling Trees • Orientated towards: • rooted • time-measured phylogenies • inferred using strict or relaxed molecular clock models • Two independent projects: • BEAST v1.X : http://beast.community/ • BEAST2: http://www.beast2.org/ • Tutorials: https://www.beast2.org/tutorials/ • Basic workflow for tree inference: 1. Prepare MSA and add instructions: BEAUTi 2. Run Monte Carlo Markov Chain: BEAST 3. Summarize output and generate consensus tree: TreeAnnotator Bayesian Methods: BEAST - BEAUTi

• BEAST takes a XML file containing the MSA, other data, and instructions about the analysis as input. This file is generated by another program called BEAUTi. • Launch BEAUTi by typing beauti • Import the "dna_pol_mafft.nex" MSA into BEAUTi by selecting "File" | "Import Alignment" from the menu • Double click "dna_pol_mafft" in the "File" column to see the alignment • In "Tip Dates" tab, keep "Use tip dates" unchecked • In "Site Model" tab, check "Proportion Invariant" estimate box and change "Subst Model" to "WAG" • In "MCMC" tab, change "Chain Length" to 50,000 (for real studies, this is much too low!!) • Under "File" | "Save As", save XML as "dna_pol_mafft_beauti.xml" • Exit Bayesian Methods: BEAST – BEAST MCMC

• Launch BEAST by typing beast • Select "dna_pol_mafft_beauti.xml " • BEAST will start the MCMC run • Trees will be written to "dna_pol_mafft.trees" Bayesian Methods: BEAST – TreeAnnotator

• Summarize tree with TreeAnnotator • Launch TreeAnnotator by typing treeannotator • Set "Burnin percentage" to 25 • Set "Input Tree File" to "dna_pol_mafft.trees" • Output tree will be in Nexus format, use Archaeopteryx to visualize it • Furthermore, use densitree to visualize all/most trees in "dna_pol_mafft.trees" • Much, much, more! Start here https://www.beast2.org/tutorials/ SELECTION ANALYSIS dN/dS

• dN/dS ratio is calculated as the ratio of the number of nonsynonymous substitutions per non-synonymous site, in a given period of time, to the number of synonymous substitutions per synonymous site, in the same period • If dN/dS • =1: no selection occurred, evolves neutrally • <<1: negative selection, i.e. changes in this genetic sequence are actively selected against. • >>1: we have positive selection Selection Analysis: Datamonkey Server (HyPhy) • Website: http://www.datamonkey.org/ • A simple example for positive/diversifying selection and negative/purifying selection on Influenza hemagglutinin: 1. Go to http://www.datamonkey.org/help 2. Download "Influenza A H5N1 hemagglutinin" (saved as "Flu.fasta") 3. Go to http://www.datamonkey.org/analysistree 4. Select "Selection" 5. Select "Sites" 6. Select "Pervarsive" 7. Select "Large" 8. Select "Counting" (for real studies the Bayesian approach is likely more useful, but takes more time) 9. You should see: Datamonkey recommends that you use... SLAC SLAC (Single-Likelihood Ancestor Counting) uses a combination of maximum-likelihood (ML) and counting approaches to infer nonsynonymous (dN) and synonymous (dS) substitution rates on a per-site basis for a given coding alignment and corresponding phylogeny. 10. Click on "SLAC" 11. Choose "Flu.fasta" file 12. Leave Genetic code at "Universal" 13. Click on arrow, analysis starts (should take about 5min) GENE DUPLICATION INFERENCE Gene Duplication Inference

• GSDI: Generalized Speciation Duplication Inference (command line) website: https://sites.google.com/site/cmzmasek/home/software/forester/gsdi • Using GSDI from within Archaeopteryx: 1. Obtain example gene and species trees: • Gene tree: Dihydroxyacetone kinase: https://goo.gl/qzR3bq • Species tree ("tree of life"): https://goo.gl/nWXsSu 2. Load gene tree into Archaeopteryx... Unfortunately, support values are interpreted as internal node names, and no taxonomic information is present… 3. Therefore, go to "Options" and check (under "Newick/NHX/Nexus Read"): • "Internal Node Names are Confidence Values" and • "Extract Taxonomy Codes/Ids from Pfam-style Node Names" 4. Load gene tree again 5. Go to "Analysis" and "Load Species Tree…" 6. Finally, execute GSDI algorithm (with re-rooting), under "Analysis" Summary

• General concepts, common misconceptions • Tree of Life (Eukaryotic, Bacterial, Viral) • Homologs, gene duplications, orthologs, … • Methods • Unix command line refresher • Multiple Sequence Alignment (MAFFT) • Pairwise distance calculation (Phylip) • Distance based methods (Neighbor Joining, FastME) • Maximum Likelihood methods (RAxML, PhyML) • Bayesian methods (MrBayes, BEAST) • Visualization: Archaeopteryx • Selection Analysis ("dN/dS", Datamonkey ~ HyPhy) • Gene Duplication Inference (GSDI algorithm)

• Select papers are available here for download (in zipped archive): https://goo.gl/o2NPDj Grad student finally gets phylogenetic tree he agrees with

Moscow, Idaho – Grad student, Herb Bumblebee finally got the phylogenetic tree he likes and is all ready now to write the paper, he confirmed for The Allium on Friday.

“It took me seventeen different computer programs, 84 substitution models and 4,000 CPU years of computing, but finally I got the tree I hoped I would get when I started the study”.

“My thesis advisor is so happy because it confirms his long-held belief in how the phylogenetic tree would look”.

“For weeks I was getting this other tree, using parsimony, likelihood, bayesian analysis, neighbor-joining, everything. However, I finally got to correct the alignment, which was clearly being constructed incorrectly by , MAFFT, Proalign and MUSCLE. In the end, if you want it done right, you gotta do it yourself”.