Global Genomic Resources for Biodiversity Research

Jonathan Coddington Global Genome Initiative Smithsonian Institution

1 Global Genome Initiative Before After Hard-to-find, ambiguous Discoverable, genomic quality tissues ambiguously samples in institutional owned by individual PI’s biorepositories (best practices & int. treaties) “Boutique” sequencing of a Affordable, coordinated, few genomes sequencing of a thoughtful synopsis of all of Life Phenotypic, expert-limited Approximate “mesoscale” IDs of most anywhere

Dispersed environmental Precise, scalable, cheap tools biology, evolution, conservation, ecology, biotech 2 Four Ways to Be a Genomic Sample

Warm Cold

Warm Living Cold Living (best) (good) Living Wildlands (!!), parks (!), zoos, cultures, seed banks, botanical gardens, aquaria biobanks, biorepositories

Fo Warm Dead Cold Dead Dead (ok…) (less good) Museums, herbaria, other biobanks, biorepositories, collections research freezers, etc.

33 LIFE ON ICE! Just one Genome!

Eukaryotes

Global Genome diversity = Σ (branch lengths)? Discovery of “Families”

Discovery of Major Lineages (Families) 1.00 0.90 0.80 0.70 Angiosperms 0.60 0.50 Chordata 0.40 Animalia

Percent 2020 Total Total Percent2020 0.30 0.20 Fungi 0.10 - 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 2025 Earliest Description

5 Big Data for 9,911 families

Source Total Records % Families Missing Max GBIF 955,459,561 0.93 710 51,162,821 BHL 3,603,706 0.79 2,064 31,017 NCBI 219,978,814 0.77 2,288 17,150,286 OTOL* 1,903,704 0.76 2,407 39,969 BOLD 6,373,906 0.57 4,282 424,231 EOL 99,886 0.38 6,132 2,707 GGBN 1,632,440 0.36 6,334 165,650 Total 1,187,419,577 196

Only 196 families absent from all 7 DBs

High Priority for IBOL2: 4,282 families?

*OTOL = Open Tree of Life 6 All Catalogue 16 of Life Phyla (101) 5 20 Global Genome Biodiversity 3 NCBI Network GenBank Tissues/DNA 10 DNA (68) Barcodes 4 (37) Global Catalogue of 33 Microorganisms Cultures Results exclude names (57) from GCM that did not 10 match to CoL. Mismatch Last updated 03 October 2018 rate for GCM was 2%.

Animalia: Cycliophora, Dicyemida, Entoprocta, Gastrotricha, Gnathostomulida, Micrognathozoa, Myxozoa, Nematomorpha, Onychophora, Orthonectida, , : Acavomonidia, Picozoa, Radiozoa, Plantae: Anthocerotophyta, : Calcitarcha, Choanozoa, Metamonada, Microsporidia All Families Catalogue 4,942 of Life (9,858)

Global Genome Biodiversity NCBI Network 790 2,118 GenBank Tissues/DNA DNA (3,457) 787 Barcodes 186 (3,248) Global Catalogue of 363 Microorganisms 157 Cultures 515 Results exclude names (1,221) from that did not match to CoL. Mismatch rates: Last updated 03 October 2018 GGBN 7%, GenBank 8%, GCM 7%. All Catalogue Genera of Life 126,141 (160,938) Global Genome Biodiversity NCBI Network 4,943 GenBank Tissues/DNA 11,730 DNA (17,783) 11,722 Barcodes 766 344 (24,375) 579 Global Catalogue of 4,713 Microorganisms Cultures Results exclude names (6,402) from that did not match to CoL. Mismatch rates: Last updated 03 October 2018 GGBN 6%, GenBank 10%, GCM 7%. GGBN Solves a Problem data model for tissues, DNAs, etc. GBIF NCBI, BOLD specimens & sequences vouchers

GGBN The Tissues, DNAs, Missing RNAs, physical Link genetic resources 83 members, 30 countries, 2M samples, 3,899 families, 20K genera, 45K species GGBN DarwinCore Extension (156 fields, 9 mandatory)

http://terms.tdwg.org/wiki/ GGBN_Data_Standard GGBN Progress as of May 2018

10,000,000 7 1,885,450 In GGBN Not in GGBN 1,000,000 6 156,056 100,000 5 9,446 10,000 4 1,365 1,000 313 3 96 [CELLRANGE] 100 [CELLRANGE] [CELLRANGE] 2 8 10 [CELLRANGE] [CELLRANGE] 1 Names in the Catalogue of Life Life of NamesCatalogue the in [CELLRANGE] [CELLRANGE] 1 0 Kingdoms Phyla Classes Orders Families Genera Species Taxonomic Rank Eukaryoc Genome Quality (n=3311)

500 450 400 X Axis 350 0.x Genome Level 1=cong 300 2=scaffold Fungi 3=chromosome Prosts 250 4=complete 0.x.x log (cong n50) zOther 200 0.x.x.x log (scaffold n50

# Genomes 150 100 50 - 1.2.2 1.4.4 1.6.6 2.2.2 2.3.3 2.3.5 2.3.7 2.4.5 2.4.7 2.5.5 2.5.7 2.6.6 2.6.8 2.7.8 3.4.5 3.4.7 3.5.5 3.5.7 3.5.9 3.6.7 3.7.7 3.7.9 4.6.6 Genome Quality

Aer Lewin&Al_2018_Earth BioGenome Project: Sequencing life for the future of life. www.pnas.org/cgi/doi/10.1073/pnas.1720115115 Solemya velum genomic DNA 0.7X Agarose Gel FEMTO Pulse (min 10fg) * *

Pipee shearing?

PFGE, 1X Agarose Gel (200ng) Femto Pulse 165 kb ladder * Conclusions and Thanks!

• Big Data: OK! • Biodiversity Discovery: OK! • Preserving the genome = Σ branch lengths Gap Analysis (known unknowns): – Especially useful to set priorities – Quantitative metrics possible – Enviro management, conservation – Build community, bridges to biodiversity genomics

16