<<

TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees Uyen Mai University of San Diego Error-prone gene trees

• Sequence data may include various sources of error • Erroneous sequences often appear as long branches in the inferred phylogenies From Gatesy et. al. (2014) Long branches are suspicious Hedgehog Shrew Microbat Megabat Alpaca Cow Dolphin Pig Cat Dog Horse Tree_Shrew Mouse Rat Kangaroo_Rat Squirrel Guinea_Pig Platypus Pika Tarsier Wallaby Marmoset Orangutan Human Chimpanzee Gorilla Mouse_Lemur Galagos Elephant Lesser_Hedgehog_Tenrec Sloth Armadillos Macaque Opossum Chicken A Gene tree from Mammalian dataset Song et al, PNAS, 2012 0.2 Long branches are suspicious

Sloth Armadillos Cat Dog Megabat Horse Microbat Cow Dolphin Alpaca Pig Galagos Mouse_Lemur Tarsier Marmoset Wallaby Orangutan Gorilla Chimpanzee Human Pika Rabbit Squirrel Rat Mouse Kangaroo_Rat Platypus Guinea_Pig Shrew Tree_Shrew Hedgehog Lesser_Hedgehog_Tenrec Elephant Hyrax Macaque A Gene tree from Mammalian dataset Opossum Song et al, PNAS, 2012 Chicken

0.2 Pyramimonas_parkeae Roya_obtusa Penium_margaritaceum Cosmarium_ochthodes Nephroselmis_pyriformis Monomastix_opisthostigma Chlorokybus_atmophyticus Netrium_digitus Mougeotia_sp Cylindrocystis_cushleckae Cylindrocystis_brebissonii Mesotaenium_endlicherianum Spirogyra_sp Chara_vulgaris Nothoceros_aenigmaticus Nothoceros_vincentianus Huperzia_squarrosa Dendrolycopodium_obscurum Pseudolycopodiella_caroliniana Polytrichum_commune Ceratodon_purpureus Physcomitrella_patens Hedwigia_ciliata Rhynchostegium_serrulatum Leucodon_brachypus Thuidium_delicatulum Anomodon_attenuatus Bryum_argenteum Sphagnum_lescurii Ephedra_sinica Gnetum_montanum Nuphar_advena Amborella_trichopoda Arabidopsis_thaliana Carica_papaya Hibiscus_cannabinus Medicago_truncatula Kochia_scoparia Rosmarinus_officinalis Allamanda_cathartica Catharanthus_roseus Ipomoea_purpurea Diospyros_malabarica Vitis_vinifera Boehmeria_nivea Inula_helenium Tanacetum_parthenium Podophyllum_peltatum Aquilegia_formosa Eschscholzia_californica Liriodendron_tulipifera Houttuynia_cordata Saruma_henryi Sarcandra_glabra Kadsura_heteroclita Persea_americana Acorus_americanus Dioscorea_villosa Yucca_filamentosa Sabal_bermudana Oryza_sativa Brachypodium_distachyon Zea_mays Smilax bona-nox Cycas_micholitzii Cycas_rumphii Zamia_vazquezii Ginkgo_biloba Pinus_taeda Prumnopitys_andina Juniperus_scopulorum Cunninghamia_lanceolata Taxus_baccata Sciadopitys_verticillata Alsophila_spinulosa Pteridium_aquilinum Psilotum_nudum Ophioglossum_petiolatum Equisetum_diffusum Bazzania_trilobata Sphaerocarpos_texanus Marchantia_polymorpha Marchantia_emarginata Ricciocarpos_natans Metzgeria_crassipilis Selaginella_moellendorffii_genome Selaginella_moellendorffii_1kp Coleochaete_irregularis Coleochaete_scutata Entransia_fimbriata Klebsormidium_subtile Chaetosphaeridium_globosum A Gene tree from 1kp Plants dataset Wicket et al, PNAS, 2014 0.6 Long branches are suspicious

• What can lead to long branches? • contamination • mistaken orthology • misalignment Semi-simulated data: randomly select 10 sequences and mutate them

Mutate 10% Mutate 5% of the DNA of the DNA bases bases Rooted filtering

• Filter out taxa that are exceptionally distant from the root • There are studies reported that such a method improved the phylogeny

• Rely on the root position Diameter-based filtering ForForFor unrooted unrootedunrooted trees?trees? trees?

Diameter:Diameter: the the longest longest path path between between any any two two species Diameter: the longest path between any two species

A gene tree 0.2from the1KP plant dataset 0.2 0.2 0.2 A geneA gene tree 0.2 fromtree fromthe1KP the1KP plant plant dataset dataset 0.2 (Wicket et al, PNAS, 2014) 3 (Wicket(Wicket et al, et PNAS, al, PNAS, 2014) 2014) 3 3 Diameter-based filtering

Tree diameter is reduced by half! Diameter-basedWhatWhat toto remove?remove? filtering

What to remove? ● 5 ● 5 the diameter after i-1 removals Let νi = —————————————— the diameter after i removals

4 4 i i i i 3 ratio

ν 3 ν ratio ν ν

0.2

7

2 2

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 5 removal10 15 20 removal

88 WhatWhat toto remove?remove?

Diameter-based thethe diameterdiameter afterafter i-i- 11filtering removalsremovals LetLet ννii == ———————————————————————————— thethe diameterdiameter afterafter ii removalsremovals

What to remove? 5 the diameter after i-1 removals Let νi = —————————————— the diameter after i removals

4 i 3 ratio ν

0.2

7

2

● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.20.2 5 10 15 removal

77 Mai and Mirarab Page 8 of 31

Diameter-based filtering

(a) 0.4 0.3 0.2

2.0

5 5 5 ● 5

4 4 4 4

3 3 3 3 ratio ratio ratio ratio

● 2 2 2 2

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 5 10 15 5 10 15 5 10 15 20 removal removal removal removal

(b)

Figure 3 (a) Patterns of ⌫i as a function of i. Four unfiltered gene trees from a Plant dataset [23] are shown (top). For each tree, we also show ⌫i for 1 i k = min(n/4, 5pn) (bottom). (b) An example tree from the Plant dataset with the removing sets and species signatures.Theremovingsetsareshownwiththecorresponding⌫ values. The max ⌫ values associated with the species signatures are marked in red. TreeShrink

• Automatically compute the “diameter-shrinking” plot and identify outliers

• Combine computational algorithm and statistics

• Processed a tree of 200,000 leaves in 30 minutes

• Software tool: https://github.com/uym2/TreeShrink/

Mai, Uyen, and Siavash Mirarab. “TreeShrink: Fast and Accurate Detection of Outlier Long Branches in Collections of Phylogenetic Trees.” BMC Genomics 19, no. S5 (2018): 272. doi:10.1186/s12864-018-4620-2. TreeShrink

Hedgehog Shrew Microbat Megabat Alpaca Cow Dolphin Pig Cat Dog Horse Tree_Shrew Mouse Rat Kangaroo_Rat Squirrel Guinea_Pig Platypus Pika Rabbit Tarsier Wallaby Marmoset Orangutan Human Chimpanzee Gorilla Mouse_Lemur Galagos Hyrax Elephant Lesser_Hedgehog_Tenrec Sloth Armadillos Macaque Opossum Chicken

0.2 TreeShrink Sloth Armadillos Cat Dog Megabat Horse Microbat Cow Dolphin Alpaca Pig Galagos Mouse_Lemur Tarsier Marmoset Wallaby Orangutan Gorilla Chimpanzee Human Pika Rabbit Squirrel Rat Mouse Kangaroo_Rat Platypus Guinea_Pig Shrew Tree_Shrew Hedgehog Lesser_Hedgehog_Tenrec Elephant Hyrax Macaque Opossum Chicken

0.2 TreeShrink

Armadillos Sloth Marmoset Wallaby Gorilla Human Chimpanzee Orangutan Rabbit Pika Galagos Mouse_Lemur Tarsier Kangaroo_Rat Squirrel Rat Mouse Guinea_Pig Cat Dog Shrew Hedgehog Microbat Chicken Tree_Shrew Dolphin Alpaca Cow Pig Megabat Horse Lesser_Hedgehog_Tenrec Elephant Hyrax Platypus Macaque Opossum

0.2 TreeShrink

True positive False negative False positive

10% flipped 5% flipped 100% precision 92% precision 99% recall 53% recall

!19 TreeShrink

• Complication: outgroups look like outliers! • Simply discard outgroups? • Outgroups sometimes are erroneous —> sometimes they need to be removed, sometimes not TreeShrink

Shrew

Megabat Lesser_Hedgehog_Tenrec Hedgehog Microbat Alpaca Horse Dog Hyrax

Cat

Cow Elephant Dolphin Pig Sloth Armadillos Tree_Shrew Galagos Opossum Mouse_Lemur Macaque Marmoset Wallaby Squirrel

HumanGorilla

OrangutanChimpanzee Tarsier

Rabbit Pika Guinea_Pig Mouse

Rat Kangaroo_Rat Platypus

Chicken

A Tree with correct outgroup placement on long branch

0.09 TreeShrink

A Tree with incorrect outgroup placement on long branch TreeShrink

• If given a collection of gene trees, TreeShrink can learn the impact of each species on the diameter

• per-gene test: applied to each gene tree independently

• per-species test: applied a each species in a collection of gene trees independently TreeShrink

Outgroup Removed by TreeShrink

Tree_Shrew

Marmoset Orangutan Rat Pika Mouse Rabbit Tarsier Galagos Macaque Gorilla Opossum

HumanChimpanzee Wallaby Chicken

Mouse_Lemur

Squirrel

Kangaroo_Rat Platypus Guinea_Pig

Elephant

Horse Armadillos Hyrax

Cat Megabat Sloth Lesser_Hedgehog_Tenrec Alpaca Pig Dog Hedgehog Dolphin Cow

Microbat

Shrew

0.05 TreeShrink Chicken

Outgroup Removed by TreeShrink

Hedgehog Platypus

Microbat

Cow

Dolphin Galagos Alpaca Macaque Pig Armadillos Opossum Mouse_Lemur Sloth Cat Dog Shrew

Megabat Rat Horse Mouse

Elephant

Guinea_Pig Hyrax Tarsier

Squirrel Wallaby Marmoset Rabbit

Kangaroo_Rat HumanGorilla

Orangutan Chimpanzee

Lesser_Hedgehog_Tenrec

Tree_Shrew Pika

0.08 TreeShrink

Cannon Frogs Insects 30

20

10

0

Mammals Plants Rouse 30 Percent removed Percent 20

10

0 per−gene per−species per−gene per−species per−gene per−species

All_Taxa Outgroups TreeShrink

• A software tool to automatically filter outlier long branches in phylogenomics data

• Incorporate computational algorithm and statistics • Handle outgroups and fast-evolved species

• Freely available: https://github.com/uym2/TreeShrink/ Thank you!

!28