Global Genomic Resources for Biodiversity Research
Jonathan Coddington Global Genome Initiative Smithsonian Institution
1 Global Genome Initiative Before After Hard-to-find, ambiguous Discoverable, genomic quality tissues ambiguously samples in institutional owned by individual PI’s biorepositories (best practices & int. treaties) “Boutique” sequencing of a Affordable, coordinated, few genomes sequencing of a thoughtful synopsis of all of Life Phenotypic, expert-limited Approximate “mesoscale” IDs taxonomy of most organisms anywhere
Dispersed environmental Precise, scalable, cheap tools biology, evolution, conservation, ecology, biotech 2 Four Ways to Be a Genomic Sample
Warm Cold
Warm Living Cold Living (best) (good) Living Wildlands (!!), parks (!), zoos, Cell cultures, seed banks, botanical gardens, aquaria biobanks, biorepositories
Fo Warm Dead Cold Dead Dead (ok…) (less good) Museums, herbaria, other biobanks, biorepositories, collections research freezers, etc.
33 LIFE ON ICE! Just one Genome!
Eukaryotes
Global Genome diversity = Σ (branch lengths)? Discovery of “Families”
Discovery of Major Lineages (Families) 1.00 0.90 0.80 0.70 Angiosperms 0.60 0.50 Chordata 0.40 Animalia
Percent 2020 Total Total Percent2020 0.30 Foraminifera 0.20 Fungi 0.10 - 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 2025 Earliest Description
5 Big Data for 9,911 families
Source Total Records % Families Missing Max GBIF 955,459,561 0.93 710 51,162,821 BHL 3,603,706 0.79 2,064 31,017 NCBI 219,978,814 0.77 2,288 17,150,286 OTOL* 1,903,704 0.76 2,407 39,969 BOLD 6,373,906 0.57 4,282 424,231 EOL 99,886 0.38 6,132 2,707 GGBN 1,632,440 0.36 6,334 165,650 Total 1,187,419,577 196
Only 196 families absent from all 7 DBs
High Priority for IBOL2: 4,282 families?
*OTOL = Open Tree of Life 6 All Catalogue 16 of Life Phyla (101) 5 20 Global Genome Biodiversity 3 NCBI Network GenBank Tissues/DNA 10 DNA (68) Barcodes 4 (37) Global Catalogue of 33 Microorganisms Cultures Results exclude names (57) from GCM that did not 10 match to CoL. Mismatch Last updated 03 October 2018 rate for GCM was 2%.
Animalia: Cycliophora, Dicyemida, Entoprocta, Gastrotricha, Gnathostomulida, Micrognathozoa, Myxozoa, Nematomorpha, Onychophora, Orthonectida, Placozoa, Chromista: Acavomonidia, Picozoa, Radiozoa, Plantae: Anthocerotophyta, Protozoa: Calcitarcha, Choanozoa, Metamonada, Microsporidia All Families Catalogue 4,942 of Life (9,858)
Global Genome Biodiversity NCBI Network 790 2,118 GenBank Tissues/DNA DNA (3,457) 787 Barcodes 186 (3,248) Global Catalogue of 363 Microorganisms 157 Cultures 515 Results exclude names (1,221) from that did not match to CoL. Mismatch rates: Last updated 03 October 2018 GGBN 7%, GenBank 8%, GCM 7%. All Catalogue Genera of Life 126,141 (160,938) Global Genome Biodiversity NCBI Network 4,943 GenBank Tissues/DNA 11,730 DNA (17,783) 11,722 Barcodes 766 344 (24,375) 579 Global Catalogue of 4,713 Microorganisms Cultures Results exclude names (6,402) from that did not match to CoL. Mismatch rates: Last updated 03 October 2018 GGBN 6%, GenBank 10%, GCM 7%. GGBN Solves a Problem data model for tissues, DNAs, etc. GBIF NCBI, BOLD specimens & sequences vouchers
GGBN The Tissues, DNAs, Missing RNAs, physical Link genetic resources 83 members, 30 countries, 2M samples, 3,899 families, 20K genera, 45K species GGBN DarwinCore Extension (156 fields, 9 mandatory)
http://terms.tdwg.org/wiki/ GGBN_Data_Standard GGBN Progress as of May 2018
10,000,000 7 1,885,450 In GGBN Not in GGBN 1,000,000 6 156,056 100,000 5 9,446 10,000 4 1,365 1,000 313 3 96 [CELLRANGE] 100 [CELLRANGE] [CELLRANGE] 2 8 10 [CELLRANGE] [CELLRANGE] 1 Names in the Catalogue of Life Life of NamesCatalogue the in [CELLRANGE] [CELLRANGE] 1 0 Kingdoms Phyla Classes Orders Families Genera Species Taxonomic Rank Eukaryo c Genome Quality (n=3311)
500 450 400 X Axis Animals 350 0.x Genome Level Plants 1=con g 300 2=scaffold Fungi 3=chromosome Pro sts 250 4=complete 0.x.x log (con g n50) zOther 200 0.x.x.x log (scaffold n50
# Genomes 150 100 50 - 1.2.2 1.4.4 1.6.6 2.2.2 2.3.3 2.3.5 2.3.7 2.4.5 2.4.7 2.5.5 2.5.7 2.6.6 2.6.8 2.7.8 3.4.5 3.4.7 3.5.5 3.5.7 3.5.9 3.6.7 3.7.7 3.7.9 4.6.6 Genome Quality
A er Lewin&Al_2018_Earth BioGenome Project: Sequencing life for the future of life. www.pnas.org/cgi/doi/10.1073/pnas.1720115115 Solemya velum genomic DNA 0.7X Agarose Gel FEMTO Pulse (min 10fg) * *
Pipe e shearing?
PFGE, 1X Agarose Gel (200ng) Femto Pulse 165 kb ladder * Conclusions and Thanks!
• Big Data: OK! • Biodiversity Discovery: OK! • Preserving the genome = Σ branch lengths Gap Analysis (known unknowns): – Especially useful to set priorities – Quantitative metrics possible – Enviro management, conservation – Build community, bridges to biodiversity genomics
16