1/22/2018

Dissecting evolution and disease using comparative vertebrate Knowledge from three major genome projects genomicsDissecting – power from evolution 200 and disease using comparative vertebrate genomics – power ~5% of mammalian genomes are functional from 200 mammals ~1.5% protein coding >3.5% non-coding conserved

≥ 20,000 mammalian genes some lineage specific expansion

The most highly conserved non-coding elements sit around developmental genes

~20% of non-coding elements are lineage specific innovations in placental mammals

Kerstin Lindblad-Toh Transposable elements give rise to novel Broad Institute of MIT and Harvard Uppsala University functional elements

Large-scale evolution across vertebrates Large-scale evolution across vertebrates

SOX2 Lowe et al Science, 2011

Label conserved non-coding elements as: Eutherian Therian Amniote

Mikkelsen et al Nature, 2007

Conservation predicts function 29 mammals project At least 6-10 % of the genome is conserved If something has been conserved for 100 million years, it is probably doing Sanger 2x (upgraded to 7x) something 90% of GWAS peaks are outside coding regions 3.6 million constraint elements coding (1.5%) conserved non-coding (~6.5%) encompasses 4.2% of the human genome

mystery

Lindblad-Toh et al Nature 2011

numbers = substitutions per 100 base pairs Broad, WashU, Baylor

1 1/22/2018

New genes and exons Not just genes –

~3,900 new exons : 4.2% of human genome detected under purifying selection: ~1,400 alternative exon in 850 genes ~1,000 new candidate new genes with multiple exons antisense genes • 20% exonic • 30% intronic • 40% intergenic

Varför inte hittade förut? Half novel elements För korta Bara i vissa organismer

Lin

3 categories of promoter constraint Novel constraint elements & limitations

12bp resolution 10% False Discovery Rate

Synonumous constraint elements The 200(+) mammals project

• New genome assemblies for 137 species • Conservation across 200+ mammals (~75 extant + 137 new) • 1 bp resolution • False Positive Rate = 6x10-7 (~1800 in genome)

Mandrill Dugong

Both CSE regions contain enhancers driving the expression of HOXA2 in hindbrain Arctic Fox Screaming Hairy Armadillo

2 1/22/2018

The placental mammalian tree “today” (52) The 200 mammals tree

Glires Glires

Primates Primates

Afrotheria Xenarthra

Afrotheria Laurasiatheria Laurasiatheria Xenarthra

Source: UCSC

Species selection Current status

1.Branch length (aiming for ~1 per family) Goal: NCBI submission of all genomes by early 2018 2.Expert consultations (Ollie Ryder, Bill Murphy, Emma Teeling, Jim Patton, and many others) Sample LC and Assembly 3.Interesting research models, specific traits collection sequencing 4.Sample availability >150 collected 138 Sequenced 132 assembled

+ ~75 extant genomes

Analysis Annotations Alignment

• Laurasiatheria pilot alignment (26 species)

• “Backbone” alignment in progress using extant genomes

DISCOVAR de novo genomes Assembly stats summary (N50s)

• 1 ug of standard quality DNA Technical or biological? • Single Illumina library (~450bp insert) • 1 lane HiSeq 2500 with 2x250 bp reads

~$5000 LC + sequencing per genome assembly

Scaffold – Arachne/ALLPATHS

Scaffold - DISCOVAR

Southern Three-Banded Armadillo (13kb) Red-shanked Douc Linnaeus's Two Toed Sloth (10kb) (5kb) Screaming Hairy Armadillo (5kb)

Hoary Rat (4kb) Uppsala and Broad

3 1/22/2018

BUSCO DISCOVAR captures gene content assesses genome assembly with single-copy orthologs. • Lifted over dog annotation to both rhino assemblies • Almost identical total gene sequence (38.9Mb vs 38.8Mb) Primates Laurasiatheria Dog (Discovar & canFam3.1) • Very similar coverage of the 14K 1:1 genes (94.8% vs. 93.8%)

But – chromosome evolution not possible

Simão Bioinformatics 2015

Upgrading DISCOVAR assemblies - Dovetail Chicago/HiC Dovetail HiRise2 (first 20 genomes) • 1 species per order (Harris Lewin @ UC Davis) Clade Common Name Status Dovetail N50 (Mb) Rock hyrax Done 9 Lesser Hedgehog tenrec Done 60 Afrotheria Cape elephant shrew No sample Aardvark No sample Colugo Done 10 Euarchonta Large tree shrew In progress Siberian musk deer Done 33 Pronghorn Done 24 Chacoan Peccary Done 37 Narwhal Done 28 Hippopotamus Done 5 Eastern black rhinocerus Done 18 Laurasiatheria North Indian muntjac Done 32 Gemsbok Done 47 Masai giraffe Done 57 Solenodon Done 43 Greater Mouse Deer Done 19 Tree pangolin Done 10 Southern three-banded armadillo Libraries made; Failed QC Xenarthra Giant anteater Libraries made; Failed QC

Reference free alignment with Cactus Flexibility in alignments

• whole genome multiple alignment • What other genomes are available? • end result: a reconstruction of the evolution of the genome • Potential to replace with newer assemblies in alignment along the input species tree – reconstructed ancestors “genomes” will be available • annotate genes with Comparative Annotation Toolkit (CAT)

Joel Armstrong Benedict Paten

4 1/22/2018

Pilot alignment - 26 Laurasatheria Good coverage of reindeer by cow, hippo and meerkat

Contig N50 Scaffold N50 Name Species Clade (kb) (kb*) Part of the ITIH2 gene (exon 7-15) Cow Bos taurus ARTIODACTYLA 97 6.4 Mb Pig Sus scrofa ARTIODACTYLA 69 576 Siberian Reindeer Rangifer tarandus ARTIODACTYLA 92 106 Hippopotamus Hippopotamus amphibius ARTIODACTYLA 85 99 Pronghorn Antilocapra americana ARTIODACTYLA 75 91 Nilgiri Tahr Hemitragus hylocrius ARTIODACTYLA 70 90 Penninsular Bighorn Sheep Ovis canadensis ARTIODACTYLA 63 79 Hunter's Hartebeest Beatragus hunteri ARTIODACTYLA 60 73 Ferret Mustela putorius CARNIVORA 45 9.3 Mb Dog Canis lupus CARNIVORA 267 45.8 Mb Cat (Domestic) Felis catus CARNIVORA 45 18.1 Mb South African Banded Mongoose Mungos mungo CARNIVORA 189 247 Meerkat Suricata suricatta CARNIVORA 157 197 Fossa Cryptoprocta ferox CARNIVORA 138 187 Dwarf Mongoose Helogale parvula CARNIVORA 117 185 California Sea Lion Zalophus californianus CARNIVORA 98 143 Giant Otter Pteronura brasiliensis CARNIVORA 100 131 Arctic Fox Vulpes lagopus CARNIVORA 89 124 Asian Palm Civet CARNIVORA Paradoxurus hermaphroditus 68 77 Striped Hyena Hyaena hyaena CARNIVORA 54 69 Northern Elephant Seal Mirounga angustirostris CARNIVORA 55 67 Narwhal Monodon monoceros CETACEA 79 99 Horse PERISSODACTYLA Equus callabus 112 46.7 Mb Malayan Tapir PERISSODACTYLA Tapirus indicus 234 320 South American Tapir PERISSODACTYLA Tapirus terrestris 184 207 Black Rhinocerous PERISSODACTYLA Diceros bicornis 114 152

Cactus with “backbone” alignment Analysis plan

Conservation: Both SiPhy + PhastCons 33 high-quality genomes “backbone” – mostly extant • Placental mammalian conservation genomes • Major clades: primate, , carnivore Attach smaller clade alignments to backbone Conserved regulatory motifs, non-coding RNAs CTCF sites – genome compartmentalization Synonymous Constraint Elements (codons)

Regions under positive selection (HARs, PARs, CARs)

Convergent evolution

Couple to phenotypes and disease

Helps finding candidate disease causal variants Connecting genotype to phenotype

5Mb targeted sequencing of 4 doberman OCD cases • Overlap with known mutations / functional annotations and 4 controls • Non-coding element turnover No coding changes • Convergent evolution All 4 cases had mutations in single regulatory element • venom 3 DP cases + 1 DP case • hibernation • loss of vitamin C metabolism • marine • ecolocation

Mole vole Hispaniolan solenodon Narwhal The dentist’s friend Tang et al, Genome Biol 2014 Lost its Y chromosome Venomous

5 1/22/2018

Convergent evolution EPAS1 - High altitude adaptation

Sheep: 200 genes enriched for functions related to angiogenesis, energy production and erythropoiesis. Wei, C. et al Sci. Rep. 2016

Hu Y et al PNAS,2017 Wang, G.-D. et al. Genome Biol. Evol. 2014.

Tool for conservation genetics How to get 150 mammals ... THANK YOU! • Eric Baitchman • Steve Goodman • Bret Pasch • Genomic resources crucial for conservation genetics • Robert Baker • Kris Helgen • Klaus Peter-Koepfli – 3 Northern white rhinos left on earth • Erika Barthlemess • Allyson Hindle • Sébastien Puechmaille • Matthew Breen • Hopi Hoekstra • David Ray – Southern white rhinos as surrogates • Kevin Campbell • Pavel Hulva • Kelly Robertson – White rhino + black rhino genome assemblies • Nicholas Casewell • William Israelsen • Stephen Rossiter • Leona Chemnick • Danielle Lee • Manuel Ruedi • Kimberly Cooper • Harris Lewin • Karen Sears Critical point: variation found among stored samples that can • Liliana Davalos • Matt MacManes • Ashley Seifert be used for reproduction • Frederic Delsuc • Phil Morin • Mark Springer • Dan Distel • Bill Murphy • Emma Teeling • Christopher Emerling • Alice Mouton • Anne Yoder • Vadim Gladyshev • Michael Nachman • Jeffrey Good • Rob Ogden

Oliver Ryder, San Diego Zoo

200 Mammals Collaboration

• Vertebrate Genomics @ Broad -Elinor Karlsson & Bruce Birren, Kerstin Lindblad-Toh • Uppsala University – Kerstin Lindblad-Toh • San Diego Zoo – Oliver Ryder • University of California, Santa Cruz – Benedict Paten & Joel Armstrong • Stanford – Gill Bejerano • University of California, Davis – Harris Lewin • Earlham Institute (UK) - Wilfried Haerty (formerly TGAC) • Institut de Biologica Evolutiva (Spain) – Tomas Marques-Bonet • Karolinska Institutet (Sweden) – Jussi Taipale • UMass Medical School – Manuel Garber

6 1/22/2018

Connecting to phenotype Backbone improves ancestral assemblies

• Activity pattern (nocturnal, etc) Contiguity Size • Brain size • Chemosensing • Demography (longevity, littersize, …) • Diet • Habitat (aquatic, altitude, heat, cold …) • Immunity • Reproduction • Skeletal (size, digits, vertebrae, …) • Skin (hair, spines, sweat, claws, …) • Vision • what else?

New program: Zoonomics

How do you map (Mendelian) diseases in species with few genomic resources and limited funds? Step 1. low cost de novo assembly Step 2. RNA-seq (gene map) Step 3. GWAS with genotyping by sequencing

Eric Baitchman

Pilot Zoonomics projects 200 Mammals Collaboration

• UMMS - Elinor Karlsson & Manuel Garber • Broad Institute - Elinor Karlsson, Bruce Birren, Kerstin Lindblad-Toh • Uppsala University – Kerstin Lindblad-Toh • University of California, Santa Cruz – Benedict Paten and Joel Armstrong • San Diego Zoo – Oliver Ryder • Stanford – Gill Bejerano • University of California, Davis – Harris Lewin Dilated Fibrosing Fungal disease • Texas A & M – Bill Murphy cardiomyopathy cardiomyopathy Amyloidosis susceptibility • University of California, Riverside – Mark Springer captive captive black-footed timber • Earlham Institute (UK) - Federica DiPalma and Wilfried Haerty meerkats gorillas ferrets rattlesnakes • Institut de Biologica Evolutiva (Spain) – Tomas Marques-Bonet • Karolinska Institutet (Sweden) – Jussi Taipale

7 1/22/2018

Upgrading DISCOVAR assemblies Assembly stats summary (N50s)

Have data from Dovetail for two assemblies Contig N50 Technical or biological? Scaffold N50 +$10-15K/assembly to upgrade 500,000

Caveat: need high molecular weight DNA 450,000

400,000

350,000

300,000

250,000

200,000

150,000

100,000

50,000

0 Primates Rodents Laurasiatheria

Important regions are similar across species Understanding the human/Eutherian genome Coding regions have reasonable annotation

human 80% of GWAS peaks are outside coding regions armadillo elephant rabbit coding (1.5%) tenrec cat shrew

Conserved region functional? mystery

90% GWAS signals fall outside coding regions Assembly statistics for first 104 assemblies completed

Finucane et al. (2015)

Mammalian conservation

Coding Prim Glir Laur Afr Xen

8 1/22/2018

Assembly stats summary (N50s) Aye-aye Technical or biological? Long-tongued fruit bat (312kb) Common gundi (351kb) (247kb)

Southern Three-Banded Armadillo (13kb) Red-shanked Douc Linnaeus's Two Toed Sloth (10kb) (5kb) Screaming Hairy Armadillo (5kb)

Hoary (4kb)

9