Introduction to DNA metabarcoding
Pierre Taberlet
Laboratoire d'Ecologie Alpine, CNRS UMR 5553 Université Grenoble Alpes, Grenoble, France
Porto, 1-5 May 2017 Introduction to DNA metabarcoding
• Definitions • Technical context • Which marker for DNA metabarcoding? • The importance of bioinformatics • Key studies – For diet analysis – For current biodiversity surveys – For reconstructing past ecosystems • The future Introduction to DNA metabarcoding
• Definitions • Technical context • Which marker for DNA metabarcoding? • The importance of bioinformatics • Key studies – For diet analysis – For current biodiversity surveys – For reconstructing past ecosystems • The future Environmental DNA
• First reference in 1987 • Microbiology: from 2000 • Plants and animals: from 2003 • Environmental DNA: DNA that can be extracted from environmental samples (such as soil, water, or air), without first isolating any target organisms • Complex mixture of genomic DNA from many different organisms, possibly degraded • Contains intracellular and extracellular DNA Overview of the emergence of eDNA studies Taxonomic identification from environmental DNA: terminology
metabarcoding Suggested terminology
DNA DNA barcoding metabarcoding identification level DNA DNA eDNA barcoding metabarcoding metabarcoding (sensu lato) (sensu lato) species level
type of markers standardized barcodes DNA DNA eDNA barcoding metabarcoding metabarcoding (sensu lato) (sensu lato) genus, family, or order level
other markers
complexity of DNA extract single multiple environmental specimen specimens sample (bulk sample) (air, water, soil, feces)
Taberlet et al. (2012) Molecular Ecology, 21, 1789-1793. metabarcoding
Web of Science, 28 April 2017 DNA metabarcoding The metabarcoding approach: bioinformatics, field, bench, bioinformatics
• In silico analysis: design and test the most efficient metabarcodes for the target group • Sampling in the field to obtain a DNA extract representative of the local biodiversity • DNA amplification and sequencing • Sequence analysis and taxa identification – OBITools (metabarcoding.org/obitools) – Problem of amplification/sequencing errors DNA metabarcoding
• Sampling in the field (soil, water, feces, etc.) • DNA extraction • DNA amplification with barcode primers • Sequencing of the PCR products on next generation sequencers • Identification of taxa using a reference database (or identification of MOTUs) DNA metabarcoding is not DNA barcoding
• Same objective to identify taxa, but ... • Different methodology: metabarcoding relies on high throughput systems for high throughput taxon identification • Not the same constraints when working with environmental DNA -> different markers might be used The main steps of an eDNA study, showing the three possible approaches: single- species identification, metabarcoding, and metagenomics Introduction to DNA metabarcoding
• Definitions • Technical context • Which marker for DNA metabarcoding? • The importance of bioinformatics • Key studies – For diet analysis – For current biodiversity surveys – For reconstructing past ecosystems • The future DNA sequencing
• 2005: Capillary electrophoresis – 500-1000 bp per sequencing reaction – 12 x 96 reactions per day (≈ 1 Mb per day) • 2016: Next generation sequencers – Roche 454: ≈ 0.8 Gb per day – HiSeq 4000: ≈ 400 Gb per day
= 400'000 times increase of sequencing capacity in 10 years Traditional versus next generation sequencing tradi onal sequencing next genera on sequencing
sampling and DNA extrac on
DNA amplifica on
sequencing bioinforma cs
results ACGCTA ACGTTA ACGTTA ACGTTG ACATTA ACGCTA ACGTTA ACGTTA ACGTTG ACATTA 454 GS FLXTM
• Company: Roche Diagnostic® • Website: www. 454.com • Fragment length: 700-800 bases • Number of reads per run: 1 106 • Total output per run: 0.7-0.8 Gb per run • Time per run: 23 hours Ion Torrent
• Company: Life Technologies • Website: www.iontorrent.com • Fragment length: 100, 200, 400 bases • Number of reads per run: 0.1, 1, 8 106 • Time per run: 2 hours HiSeq 4000
• Company: Illumina® • Website: www.illumina.com • Fragment length: 150 bases (2x150 paired-ends) • Number of reads per run: 8.6-10 billions • Total output per run: 1.3-1.5 Tb • Time per run: 3.5 days An idea of the HiSeq 4000 production per run
• 10 billions of reads of 150 bp • 6 lines per read • 55 lines per page (font 11) • 1 090 909 091 pages • 324 000 km long • 122.4 km high • more than 5,000 tons of paper MiSeq
• Company: Illumina® • Website: www.illumina.com • Fragment length: 300 bases (2x300 paired-ends) • Number of reads per run: 2x25 106 • Total output per run: 14 Gb • Time per run: 27 hours MiniSeq
• Company: Illumina® • Website: www.illumina.com • Fragment length: 150 bases (2x150 paired-ends) • Number of reads per run: 2x25 106 • Total output per run: 7.5 Gb • Time per run: 24 hours MinION
• Company: Oxford Nanopore Technologies Ltd • Website: www.nanoporetech.com • Fragment length: ultra long reads (up to 300 kb) • Total output: 6 Gb per day Introduction to DNA metabarcoding
• Definitions • Technical context • Which marker for DNA metabarcoding? • The importance of bioinformatics • Key studies – For diet analysis – For current biodiversity surveys – For reconstructing past ecosystems • The future
Standard barcodes: COI, rbcL, matK • Advantages – Standard reference libraries can be used – High taxonomic resolution • Drawbacks – Primers from standard barcodes are designed on protein-coding genes, and cannot be highly conserved (the third nucleotide of each codon is variable) – Too long for using with degraded environmental DNA New barcodes for analyzing environmental DNA
• Very short marker (usually less than 100 bp) • Highly conserved primers to equally amplify the different target sequences • Problem of the taxonomic resolution when using very short barcodes The mirage of standard "minibarcodes"
• Hajibabaei M, Smith MA, Janzen DH, Rodriguez JJ, Whitfield JB, Hebert PDN (2006) A minimalist barcode can identify a specimen whose DNA is degraded. Molecular Ecology Notes, 6, 959-964. • Meusnier I, Singer GAC, Landry JF, Hickey DA, Hebert PDN, Hajibabaei M (2008) A universal DNA mini-barcode for biodiversity analysis. BMC Genomics, 9, 214. • Hajibabaei M, Shokralla S, Zhou X, Singer GAC, Baird DJ (2011) Environmental barcoding: a next-generation sequencing approach for biomonitoring applications using river benthos. PLoS ONE, 6, e17497. • Hajibabaei M, Spall JL, Shokralla S, van Konynenburg S (2012) Assessing biodiversity of a freshwater benthic macroinvertebrate community through non-destructive environmental barcoding of DNA from preservative ethanol. BMC Ecology, 12, 28. The mirage of standard "minibarcodes" COI Metazoameusnier 7 errors 8 6 4 Reverse errors Reverse 2 Reverseerrors 0
0 2 4 6 8 ForwardForward errors errors Meusnier et al. (2008) BMC Genomics, 9, 214.
The mirage of standard "minibarcodes" COI Metazoameusnier 7 errors 18S Eukaryota18s_hardy 5 errors 8 5 6 4 3 4 2 Reverse errors Reverse Reverse errors Reverse 2 1 Reverseerrors Reverseerrors 0 0
0 2 4 6 8 0 1 2 3 4 5 ForwardForward errors errors ForwardForward errors errors Meusnier et al. (2008) Hardy et al. (2010) Molecular BMC Genomics, 9, 214. Ecology, 19, 197-212.
The mirage of standard "minibarcodes" COI Arthropodazbj 7 errors 16S Insectains 5 errors 8 5 6 4 3 4 2 Reverse errors Reverse Reverse errors Reverse 2 1 Reverseerrors Reverseerrors 0 0
0 2 4 6 8 0 1 2 3 4 5 ForwardForward errors errors ForwardForward errors errors Zeale et al. (2011) Molecular Ecology Resources, 11, 236-244. Unpublished
The chloroplast trnL(UAA) intron
The g/h primers target highly conserved regions
Spermatophytagh 3 errors 3.0 2.5 cpDNA trnL (UAA) intron 2.0 1.5 Reverse errors Reverse 1.0 Reverseerrors 0.5 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
ForwardForward errors The ideal metabarcode
• The primer targets must Target taxonomicgh 3 errors group be perfectly conserved
(no mismatch at all) 3.0 • Must be short, but highly 2.5 informative • Must amplify all the 2.0
target taxonomic group, 1.5
but not the other groups errors Reverse 1.0
• The reference database Reverseerrors
must be comprehensive 0.5
• Unfortunately, such ideal 0.0 marker does not exist
0.0 0.5 1.0 1.5 2.0 2.5 3.0
ForwardForward errors Introduction to DNA metabarcoding
• Definitions • Technical context • Which marker for DNA metabarcoding? • The importance of bioinformatics • Key studies – For diet analysis – For current biodiversity surveys – For reconstructing past ecosystems • The future Bioinformatic tools for designing new markers for DNA metabarcoding
http://metabarcoding.org/obitools (NAR : RIAZ et al 2011)
http://metabarcoding.org/obitools
Look for conserved regions that flank variable regions many whole genome sequences [email protected]
ecoPCR: fully integrated taxonomy http://metabarcoding.org/obitools
ecoPCR -d ebpln96 -l50 -L150 -k -e3 GGGCAATCCTGAGCCAA CCATTGAGTCTCTGCACCTATC
# # ecoPCR version 0.1 # direct strand oligo1 : GGGCAATCCTGAGCCAA ; oligo2c : GATAGGTGCAGAGACTCAATGG # reverse strand oligo2 : CCATTGAGTCTCTGCACCTATC ; oligo1c : TTGGCTCAGGATTGCCC # max error count by oligonucleotide : 3 # database : arctic_01_02_2008 # amplifiat length between [5,300] bp # output in kingdom mode # 0240g | 495 | 10000726 | subspecies | 282718 | Achillea alpina | 13328 | Achillea | 4210 | Asteraceae | 33090 | Viridiplantae | D | GGGCAATCCTGAGCCAA | 0 | CCATCGAGTCTCTGCACCTATC | 1 | 90 | ATCACGTTTTCCGAAAACAAACAAAGGTTCAGAAAGCGAAAAGAAAAAAAA | 1043o | 496 | 10000724 | subspecies | 282718 | Achillea alpina | 13328 | Achillea | 4210 | Asteraceae | 33090 | Viridiplantae | D | GGGCAATCCTGAGCCAA | 0 | CCATCGAGTCTCTGCACCTATC | 1 | 90 | ATCACGTTTTCCGAAAACAAACAAAGGTTCAGAAAGCGAAAAGAAAAAAAA | 0239g | 495 | 10000001 | subspecies | 13329 | Achillea millefolium | 13328 | Achillea | 4210 | Asteraceae | 33090 | Viridiplantae | D | GGGCAATCCTGAGCCAA | 0 | CCATCGAGTCTCTGCACCTATC | 1 | 90 | ATCACGTTTTCCGAAAACAAACAAAGGTTCAGAAAGCGAAAAGAAAAAAAA | 1042o | 496 | 10000725 | subspecies | 13329 | Achillea millefolium | 13328 | Achillea | 4210 | Asteraceae | 33090 | Viridiplantae | D | GGGCAATCCTGAGCCAA | 0 | CCATCGAGTCTCTGCACCTATC | 1 | 90 | ATCACGTTTTCCGAAAACAAACAAAGGTTAAGAAAGCGAAAAGAAAAAAAA | 0722g | 567 | 10000639 | subspecies | 10000110 | Aconitum delphinifolium | 49188 | Aconitum | 3440 | Ranunculaceae | 33090 | Viridiplantae | D | GGGCAATCCTGAGCCAA | 0 | CCATTGAGTCTCTGCACCTATC | 0 | 101 | ATCCTGTTTTTATAAAACAAATCAAAATCAAATAAAGGGTTCAGAAAGCAAGAATAAAAAAG | 1474o | 567 | 10000639 | subspecies | 10000110 | Aconitum delphinifolium | 49188 | Aconitum | 3440 | Ranunculaceae | 33090 | Viridiplantae | D | GGGCAATCCTGAGCCAA | 0 | CCATTGAGTCTCTGCACCTATC | 0 | 101 | ATCCTGTTTTTATAAAACAAATCAAAATCAAATAAAGGGTTCAGAAAGCAAGAATAAAAAAG | 0818g | 567 | 10000002 | subspecies | 112589 | Aconitum lycoctonum | 49188 | Aconitum | 3440 | Ranunculaceae | 33090 | Viridiplantae | D | GGGCAATCCTGAGCCAA | 0 | CCATCGAGTCTCTGCACCTATC | 1 | 101 | ATCCTGTTTTTAGAAAACAAATCAAAATCAAATAAAGGGTTCAGAAAGCAAGAATAAAAAAG | 1649o | 567 | 10000002 | subspecies | 112589 | Aconitum lycoctonum | 49188 | Aconitum | 3440 | Ranunculaceae | 33090 | Viridiplantae | D | GGGCAATCCTGAGCCAA | 0 | CCATCGAGTCTCTGCACCTATC | 1 | 101 | ATCCTGTTTTTAGAAAACAAATCAAAATCAAATAAAGGGTTCAGAAAGCAAGAATAAAAAAG | 0282g | 464 | 10000111 | species | 10000111 | Aconogonon alaskanum | 106214 | Aconogonon | 3615 | Polygonaceae | 33090 | Viridiplantae | D | GGGCAATCCTGAGCCAA | 0 | CCATTGAGTCTCTGCACCTATC | 0 | 71 | CTCCTGCTTTCCAAAAATAAGCATAAAAAAGG | Bioinformatic tools for analyzing the output of next generation sequencers
http://metabarcoding.org/obitools Example of raw output (HiSeq 2000) fastq format Main steps of a standard analysis • illuminapairedend: assembling forward and reverse strands • ngsfilter: identifying tags and primers • obiuniq: dereplicating of all barcode sequences, keeping the information on the samples • obigrep: different filtering steps • ecotag: taxonomic assignation • obitab: production of a text file than can be open in excel or R
http://metabarcoding.org/obitools Introduction to DNA metabarcoding
• Definitions • Technical context • Which marker for DNA metabarcoding? • The importance of bioinformatics • Key studies – For diet analysis – For current biodiversity surveys – For reconstructing past ecosystems • The future
Valen ni et al. (2009) Molecular Ecology Resources, 9, 51-60.
Introduction to DNA metabarcoding
• Definitions • Technical context • Which marker for DNA metabarcoding? • The importance of bioinformatics • Key studies – For diet analysis – For current biodiversity surveys – For reconstructing past ecosystems • The future
Above ground botanical surveys versus DNA metabarcoding
© Kari Anne Bråthen Above ground analysis DNA-based soil analysis Avenella flexuosa Bistorta vivipara Poa sp.
Salix sp.
Taraxacum sp.
Anthoxanthum nipponicum Carex sp. Alchemilla sp.
Viola Festuca sp. biflora
Equisetum sp.
Deschampsia sp. Rumex sp. Calamagrostris sp. How long does a DNA molecule persist in soil?
© Serge Aubert/SAJF Results
rank scientific_name count sequence
family Apiaceae 210517 atcctattttccaaaaacaaacaaaggcccagaaggtgaaaaaag subfamily Pooideae 151799 atccgtgttttgagaaaacaagggggttctcgaactagaatacaaaggaaaag
order Asterales 137806 atcacgttttccgaaaacaaacaaaggttcagaaagcgaaaatcaaaaag
tribe Poeae 132486 atccgtgttttgagaaaacaaggaggttctcgaactagaatacaaaggaaaag
family Asteraceae 122047 atcacgttttccgaaaacaaacaaaggttcagaaagcgaaaataaaaaag
genus Leontodon 109979 atcacgttttccgaaaacaaacgaaggttcagaaagcgaaaataaaaaag
family Apiaceae 83485 atcctattttccaaaaacaaacaaaggcctagaaggtgaaaaaag
species Lathyrus pratensis 70936 atccttctttccgaaaacaaacaaataaaagttcagaaactgaaaatcaaaaaag
tribe Hedysareae 69369 atcctgaaacaaataaaagttcagaaagtgaaaataaaaaaag subfamily Asteroideae 62708 atcacgttttccgaaaacaaacaaaggttcagaaagcgaaaagaaaaaaaa
species Vicia cracca 42598 atccttaagttaaaatcaaaaaag
… … … …
species Secale cereale 7476 (0.3%) atccgtgttttgagaaaacaaggggttctcgaactagaatacaaaggaaaag
Solanum 2185 species atcctgttttctgaaaacaaacaaaggttcagaaaaaaag tuberosum (0.09%)
species Hordeum vulgare 605 (0.02%) atccgtgttttgagaagggattctcgaactagaatacaaaggaaaag How long does a DNA molecule persist in soil?
© Serge Aubert/SAJF © Sébastien De Danieli Earthworms from soil DNA: results
Chartreuse Grenoble Species Barcode Plot 1 Plot 2 Plot 1 Plot 2 Aporrectodea icterica catcttaatgaagactaaaacttcactaaa 836954 649677 834031 1359355 Aporrectodea longa tattttaacaaaaacccaaaaattttcaataaa 2 6 244463 271829 Aporrectodea sp cattttaataaaaattataaattttactaaa 0 0 236024 236678 Octolasion cyaneum cattttaatagaagcttactattctaataaa 468462 3823 0 2 Lumbricus terrestris aatttaaataaatataaaaaatttactaaa 0 0 174286 143682 Octolasion tyrtaeum cattttaatagaaaaataatatcctaataaa 306476 0 0 2 Lumbricus castaneus aatttaaataaatataaaaaaatttactaaa 0 0 56 131001 Aporrectodea longa tattttaacaaaacccaaaaattttcaataaa 2469 105312 159 145 Allobophora chlorotica cattttaataaagatataaactttactaaa 0 0 51953 43196 Aporrectodea caliginosa tattttaataaaaaaatataaatttttaataa 0 23005 0 0 number of sequence reads
Bienert R, de Danieli S, Miquel C, Coissac E, Poillot C, Brun JJ, Taberlet P (2012) Tracking earthworm communities from soil DNA. Molecular Ecology, 21, 2017-2030.
Experiment in tropical environment: H20 plot of the Nouragues Field Station (French Guiana) Sampling strategy (sampling on a grid)
100 m
Plot H20 of the Nouragues Field Station (CNRS, French Guiana)
Metabarcoding markers
• Eukaryotes: 18S (short fragment: ~110 bp) • Archaea: 16S (~100 bp) • Bacteria: 16S (~250 bp) • Plants – P6 loop of the chloroplast trnL intron (10-100 bp) – ITS1 (nuclear ribosomal DNA: ~250 bp) • Fungi: ITS1 (100-200 bp) • Termites: mtDNA 12S (~70 bp) H20 plot: plant results (trnL intron)
Taxonomic identification: Viridiplantae, Moraceae, Bagassa guianensis Sequence: atccggttttctgaaaacaaaacaaacaagggttcagaaagcgataataaaaaag Best homology with the reference database: 1.00 Ilex sp. (Viridiplantae, Aquifoliaceae) (3 markers)
log10 % reads trnL P6 loop ITS 1 18S rRNA gene Cyrilliotermes angulariceps (Metazoa, Termitidae) (mitochondrial 12S rRNA gene)
log10 % reads 18S (eukaryotes): results
nb of nb of Kingdom % reads MOTUs reads Metazoa 622 1265596 46.88 Fungi 338 1104933 40.93 Viridiplantae 99 169019 6.26 None 279 160336 5.94 18S (eukaryotes): results
Phylum nb of MOTUs nb of reads % reads Annelida 315 743877 27.56 None 622 546992 20.27 Basidiomycota 338 493122 18.27 Arthropoda 279 356162 13.20 Ascomycota 197 228651 8.47 Streptophyta 98 168843 6.26 Platyhelminthes 45 52102 1.93 Nematoda 42 29312 1.09 Chordata 7 26614 0.99 Blastocladiomycota 5 25159 0.93 Glomeromycota 49 18650 0.69 Rotifera 14 4971 0.18 Gastrotricha 6 3228 0.12 Chytridiomycota 6 1207 0.04 18S (eukaryotes): results
Arthropoda Biodiversity assessment of Batrachia and Teleostei using water samples
• Tested on 39 water bodies for Batrachia • Tested on 22 very diverse water bodies for Teleostei (including freshwater and marine environments)
eDNA Batrachia traditionnal approach
1 1 Alytes obstetricans 10 Bufo bufo 17 1 5 Bufo calamita 1 Discoglossus pictus 1 A N 11 Hyla meridionalis O 24 U 4 R 7 Pelobates cultripes A 6 Pelodytes punctatus 10 18 Pelophylax sp. 26 5 7 Rana dalmatina 0 1 Rana temporaria C 21 Lissotriton helveticus A 32 U 1 D Salamandra salamandra 4 A T 11 Triturus marmoratus A 16
90 ALL SPECIES 151
n 0.6 0.4 0.8 0.0 0.2 1.0 Detectability eDNA total Teleostei traditionnal approach
C O N T R O
L 4321
P O N D 8765
D I T C H 12 11 10 9 L A
K 13 E
S T R E A M 17 16 15 14
R I V E R 21 20 19 18
0510152025
Introduction to DNA metabarcoding
• Definitions • Technical context • Which marker for DNA metabarcoding? • The importance of bioinformatics • Key studies – For diet analysis – For current biodiversity surveys – For reconstructing past ecosystems • The future
Sediments from Anterne Lake Sediments from Anterne Lake
• Analysis of 47 slices of a 20 m lake core representing the past 10000 years • DNA markers – P6 loop of the trnL intron (chloroplast DNA) for plants – short 16S fragment (mitochondrial DNA) for mammals • 2 DNA extractions per slice • 4 PCRs per extraction
Sediments from Anterne Lake Sediments from Anterne Lake
Reconstructing past plant communities from permafrost samples (European project "EcoChange") Permafrost sampling (1) Experimental protocol
• Sampling of 242 permafrost samples from 21 localities (+ 8 megafauna coprolites or gut content) • DNA extractions • DNA amplifications (trnL P6 loop [5 replicates], + short ITS1 fragment for Poaceae, Asteraceae, and Cyperaceae) • DNA sequencing on Illumina GA IIx sequencing platform • Sequence analysis using the OBITools (http:// metabarcoding.org/obitools) Reconstructing past plant communities from permafrost samples Reconstructing past plant communities from permafrost samples
Forbs Graminoids Dwarf shrubs Trees and shrubs a 0.8
0.4 Proportion
0.0 7 26 11 62 31 28 27 6 6 17 12