TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees Uyen Mai University of California San Diego Error-prone gene trees
• Sequence data may include various sources of error • Erroneous sequences often appear as long branches in the inferred phylogenies From Gatesy et. al. (2014) Long branches are suspicious Hedgehog Shrew Microbat Megabat Alpaca Cow Dolphin Pig Cat Dog Horse Tree_Shrew Mouse Rat Kangaroo_Rat Squirrel Guinea_Pig Platypus Pika Rabbit Tarsier Wallaby Marmoset Orangutan Human Chimpanzee Gorilla Mouse_Lemur Galagos Hyrax Elephant Lesser_Hedgehog_Tenrec Sloth Armadillos Macaque Opossum Chicken A Gene tree from Mammalian dataset Song et al, PNAS, 2012 0.2 Long branches are suspicious
Sloth Armadillos Cat Dog Megabat Horse Microbat Cow Dolphin Alpaca Pig Galagos Mouse_Lemur Tarsier Marmoset Wallaby Orangutan Gorilla Chimpanzee Human Pika Rabbit Squirrel Rat Mouse Kangaroo_Rat Platypus Guinea_Pig Shrew Tree_Shrew Hedgehog Lesser_Hedgehog_Tenrec Elephant Hyrax Macaque A Gene tree from Mammalian dataset Opossum Song et al, PNAS, 2012 Chicken
0.2 Pyramimonas_parkeae Roya_obtusa Penium_margaritaceum Cosmarium_ochthodes Nephroselmis_pyriformis Monomastix_opisthostigma Chlorokybus_atmophyticus Netrium_digitus Mougeotia_sp Cylindrocystis_cushleckae Cylindrocystis_brebissonii Mesotaenium_endlicherianum Spirogyra_sp Chara_vulgaris Nothoceros_aenigmaticus Nothoceros_vincentianus Huperzia_squarrosa Dendrolycopodium_obscurum Pseudolycopodiella_caroliniana Polytrichum_commune Ceratodon_purpureus Physcomitrella_patens Hedwigia_ciliata Rhynchostegium_serrulatum Leucodon_brachypus Thuidium_delicatulum Anomodon_attenuatus Bryum_argenteum Sphagnum_lescurii Ephedra_sinica Gnetum_montanum Nuphar_advena Amborella_trichopoda Arabidopsis_thaliana Carica_papaya Hibiscus_cannabinus Medicago_truncatula Kochia_scoparia Rosmarinus_officinalis Allamanda_cathartica Catharanthus_roseus Ipomoea_purpurea Diospyros_malabarica Vitis_vinifera Boehmeria_nivea Inula_helenium Tanacetum_parthenium Podophyllum_peltatum Aquilegia_formosa Eschscholzia_californica Liriodendron_tulipifera Houttuynia_cordata Saruma_henryi Sarcandra_glabra Kadsura_heteroclita Persea_americana Acorus_americanus Dioscorea_villosa Yucca_filamentosa Sabal_bermudana Oryza_sativa Brachypodium_distachyon Zea_mays Smilax bona-nox Cycas_micholitzii Cycas_rumphii Zamia_vazquezii Ginkgo_biloba Pinus_taeda Prumnopitys_andina Juniperus_scopulorum Cunninghamia_lanceolata Taxus_baccata Sciadopitys_verticillata Alsophila_spinulosa Pteridium_aquilinum Psilotum_nudum Ophioglossum_petiolatum Equisetum_diffusum Bazzania_trilobata Sphaerocarpos_texanus Marchantia_polymorpha Marchantia_emarginata Ricciocarpos_natans Metzgeria_crassipilis Selaginella_moellendorffii_genome Selaginella_moellendorffii_1kp Coleochaete_irregularis Coleochaete_scutata Entransia_fimbriata Klebsormidium_subtile Chaetosphaeridium_globosum A Gene tree from 1kp Plants dataset Wicket et al, PNAS, 2014 0.6 Long branches are suspicious
• What can lead to long branches? • contamination • mistaken orthology • misalignment Semi-simulated data: randomly select 10 sequences and mutate them
Mutate 10% Mutate 5% of the DNA of the DNA bases bases Rooted filtering
• Filter out taxa that are exceptionally distant from the root • There are studies reported that such a method improved the phylogeny
• Rely on the root position Diameter-based filtering ForForFor unrooted unrootedunrooted trees?trees? trees?
Diameter:Diameter: the the longest longest path path between between any any two two species species Diameter: the longest path between any two species
A gene tree 0.2from the1KP plant dataset 0.2 0.2 0.2 A geneA gene tree 0.2 fromtree fromthe1KP the1KP plant plant dataset dataset 0.2 (Wicket et al, PNAS, 2014) 3 (Wicket(Wicket et al, et PNAS, al, PNAS, 2014) 2014) 3 3 Diameter-based filtering
Tree diameter is reduced by half! Diameter-basedWhatWhat toto remove?remove? filtering
What to remove? ● 5 ● 5 the diameter after i-1 removals Let νi = —————————————— the diameter after i removals
4 4 i i i i 3 ratio
ν 3 ν ratio ν ν
0.2
7
2 2
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 5 removal10 15 20 removal
88 WhatWhat toto remove?remove?
Diameter-based thethe diameterdiameter afterafter i-i- 11filtering removalsremovals LetLet ννii == ———————————————————————————— thethe diameterdiameter afterafter ii removalsremovals
What to remove? 5 the diameter after i-1 removals Let νi = —————————————— the diameter after i removals
4 i 3 ratio ν
0.2
7
2
● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.20.2 5 10 15 removal
77 Mai and Mirarab Page 8 of 31
Diameter-based filtering
(a) 0.4 0.3 0.2
2.0
5 5 5 ● 5
●
4 4 4 4
3 3 3 3 ratio ratio ratio ratio
● 2 2 2 2
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15 5 10 15 5 10 15 5 10 15 20 removal removal removal removal
(b)
Figure 3 (a) Patterns of ⌫i as a function of i. Four unfiltered gene trees from a Plant dataset [23] are shown (top). For each tree, we also show ⌫i for 1 i k = min(n/4, 5pn) (bottom). (b) An example tree from the Plant dataset with the removing sets and species signatures.Theremovingsetsareshownwiththecorresponding⌫ values. The max ⌫ values associated with the species signatures are marked in red. TreeShrink
• Automatically compute the “diameter-shrinking” plot and identify outliers
• Combine computational algorithm and statistics
• Processed a tree of 200,000 leaves in 30 minutes
• Software tool: https://github.com/uym2/TreeShrink/
Mai, Uyen, and Siavash Mirarab. “TreeShrink: Fast and Accurate Detection of Outlier Long Branches in Collections of Phylogenetic Trees.” BMC Genomics 19, no. S5 (2018): 272. doi:10.1186/s12864-018-4620-2. TreeShrink
Hedgehog Shrew Microbat Megabat Alpaca Cow Dolphin Pig Cat Dog Horse Tree_Shrew Mouse Rat Kangaroo_Rat Squirrel Guinea_Pig Platypus Pika Rabbit Tarsier Wallaby Marmoset Orangutan Human Chimpanzee Gorilla Mouse_Lemur Galagos Hyrax Elephant Lesser_Hedgehog_Tenrec Sloth Armadillos Macaque Opossum Chicken
0.2 TreeShrink Sloth Armadillos Cat Dog Megabat Horse Microbat Cow Dolphin Alpaca Pig Galagos Mouse_Lemur Tarsier Marmoset Wallaby Orangutan Gorilla Chimpanzee Human Pika Rabbit Squirrel Rat Mouse Kangaroo_Rat Platypus Guinea_Pig Shrew Tree_Shrew Hedgehog Lesser_Hedgehog_Tenrec Elephant Hyrax Macaque Opossum Chicken
0.2 TreeShrink
Armadillos Sloth Marmoset Wallaby Gorilla Human Chimpanzee Orangutan Rabbit Pika Galagos Mouse_Lemur Tarsier Kangaroo_Rat Squirrel Rat Mouse Guinea_Pig Cat Dog Shrew Hedgehog Microbat Chicken Tree_Shrew Dolphin Alpaca Cow Pig Megabat Horse Lesser_Hedgehog_Tenrec Elephant Hyrax Platypus Macaque Opossum
0.2 TreeShrink
True positive False negative False positive
10% flipped 5% flipped 100% precision 92% precision 99% recall 53% recall
!19 TreeShrink
• Complication: outgroups look like outliers! • Simply discard outgroups? • Outgroups sometimes are erroneous —> sometimes they need to be removed, sometimes not TreeShrink
Shrew
Megabat Lesser_Hedgehog_Tenrec Hedgehog Microbat Alpaca Horse Dog Hyrax
Cat
Cow Elephant Dolphin Pig Sloth Armadillos Tree_Shrew Galagos Opossum Mouse_Lemur Macaque Marmoset Wallaby Squirrel
HumanGorilla
OrangutanChimpanzee Tarsier
Rabbit Pika Guinea_Pig Mouse
Rat Kangaroo_Rat Platypus
Chicken
A Tree with correct outgroup placement on long branch
0.09 TreeShrink
A Tree with incorrect outgroup placement on long branch TreeShrink
• If given a collection of gene trees, TreeShrink can learn the impact of each species on the diameter
• per-gene test: applied to each gene tree independently
• per-species test: applied a each species in a collection of gene trees independently TreeShrink
Outgroup Removed by TreeShrink
Tree_Shrew
Marmoset Orangutan Rat Pika Mouse Rabbit Tarsier Galagos Macaque Gorilla Opossum
HumanChimpanzee Wallaby Chicken
Mouse_Lemur
Squirrel
Kangaroo_Rat Platypus Guinea_Pig
Elephant
Horse Armadillos Hyrax
Cat Megabat Sloth Lesser_Hedgehog_Tenrec Alpaca Pig Dog Hedgehog Dolphin Cow
Microbat
Shrew
0.05 TreeShrink Chicken
Outgroup Removed by TreeShrink
Hedgehog Platypus
Microbat
Cow
Dolphin Galagos Alpaca Macaque Pig Armadillos Opossum Mouse_Lemur Sloth Cat Dog Shrew
Megabat Rat Horse Mouse
Elephant
Guinea_Pig Hyrax Tarsier
Squirrel Wallaby Marmoset Rabbit
Kangaroo_Rat HumanGorilla
Orangutan Chimpanzee
Lesser_Hedgehog_Tenrec
Tree_Shrew Pika
0.08 TreeShrink
Cannon Frogs Insects 30
20
10
0
Mammals Plants Rouse 30 Percent removed Percent 20
10
0 per−gene per−species per−gene per−species per−gene per−species
All_Taxa Outgroups TreeShrink
• A software tool to automatically filter outlier long branches in phylogenomics data
• Incorporate computational algorithm and statistics • Handle outgroups and fast-evolved species
• Freely available: https://github.com/uym2/TreeShrink/ Thank you!
!28