The “MetaCopepod” project: Designing an integrated DNA metabarcoding and image analysis approach to study and monitor biodiversity of zooplanktonic .

PanagiotisPanagiotis KasapidisKasapidis

HellenicHellenic CentreCentre forfor MarineMarine ResearchResearch (HCMR),(HCMR), InstituteInstitute ofof MarineMarine Biology,Biology, BiotechnologyBiotechnology andand AquacultureAquaculture P.O.Box 2214, 71003 Heraklion Crete, Greece

1 The “MetaCopepod” project

Aim: to develop a novel methodology, based on the combination of DNA metabarcoding and image analysis, to assess and monitor the diversity of marine zooplanktonic copepods (and cladocera), in the Mediterranean and the Black Sea, in a high-throughput, cost- effective, accurate and quantitative way.

Coordinator: Dr. Panagiotis Kasapidis, Hellenic Centre for Marine Research (HCMR), GREECE

Study area: Mediterranean and the Black Sea

Duration: Feb. 2014 – Oct. 2015

Budget: 180,000 euros

Funding: European Social Fund (ESF) and National Funds through the National Strategic Reference Framework (NSRF) 2007-2013, Operational Programme "Education and Life-Long Learning", Action "ARISTEIA II", Greek Ministry of Education and Religious Affairs, General Secretary of Research and Technology.

2 Studying diversity: limitations of traditional approaches

● Quite laborious (sorting, identification under stereoscope) → bottleneck in sample processing.

● Requires local taxonomic expertise

● Difficult to identify immature stages

● Misidentifications

● Cryptic

3 Image analysis

+ Pros

 High throughput

 Quantitative results (abundance, size spectra, biomass)

- Cons

 Low resolution → can recognize taxa similar to the ability of a trained taxonomists to identify a taxon under stereoscope at a glance

 “train” image analysis software separately for different types of zooplankton communities (not once for all).

4 DNA metabarcoding (NGS analysis) for studying marine biodiversity

+ Pros:

 Faster processing of the samples

 Potentially higher accuracy in species' identification (even for difficult taxa, immature stages, cryptic species) bulk sample  No need of expert but requires a well curated and complete reference genetic database

- Cons:

 Results are semiquantitative due to the PCR NGS  Biases in species' relative abundance mostly due to PCR amplification bias A A A A T G G G G G T T T A T

The “MetaCopepod” project aims to combine the Pros of T T T T T C C C C C T both methods in order to increase accuracy in assessing A A T A zooplankton diversity. 5 Workflow of the “MetaCopepod” project

WP7: Dissemination

WP1 sampling Sampling and imaging

morphological identification of pseudo-samples bulk samples image acquisition of bulk species samples

image acquisition of training of image WP2 individuals image analysis analysis software Image analysis

DNA extraction WP3 WP4 DNA extraction

in silico primer design DNA amplification with PCR amplification for standard COI barcode and testing DNA barcoding­ and selected barcode selected barcode primers DNA Metabarcoding genetic reference DB

Sanger sequencing High throughput next generation DNA sequencing

online management system phylogenetic analysis local reference database raw sequencing data

WP5 Bioinformatic Construction of Bioinformatic analysis DNA based species comparison with image pipeline bioinformatic pipeline identification analysis data

WP6: Demonstration of methodology (Copepod biomonitoring) 6 Sampling

Napoli Saronikos

Annaba Cretan Sea

Partner Network and sampling stations. Monthly sampling ( ), non-regular sampling ( )

7 Samples

 108 zooplankton samples collected and preserved in 95% EtOH

 80 zooplankton samples were scanned for image analysis and then used for DNA metabarcoding

 97 copepod species were morphologically identified ( >350 specimen collections from different zooplankton samples)

For standardizing the methodology:

 Five of the zooplankton samples were taxonomically identified by taxonomist

 Six pseudosamples (mock samples) created by mixing taxonomically identified specimens at known numbers (6-16 species, 1-3 individuals per species)

8 Image analysis (Epson scanner similar to Zooscan and ZooImage software) For “training” the image analysis software we used:

 scanned images of taxonomically identified taxa (“gold” standards)

 taxonomically identified taxa from scanned zooplankton images (“silver” standards)

The image analysis software was evaluated by:

 self-evaluation

 using pseudosamples and taxonomically identified samples

9 Self evaluation of the Image Analysis software

Copepods

Cladocera

Bad focus or multiple

Other zooplankton

Other (artifacts)

Mostly correct identification, but confusion in certain taxa (e.g. Clausocalanus and Ctenocalanus)

10 Evaluation of the Image Analysis software with taxonomically identified samples

SG1-0214_1 SG1-0214_2 TAX IA TAX IA s l a u d i v i d n i

f SG1-0214 o

r e b

m TAX IA u N

● Three samples from Saronikos Gulf analyzed both by a taxonomist and by IA

● IA software performs well both for taxa recognition and abundance estimation 11 Larger Categories for Image categories Analysis Taxa in each category Calanus helgolandicus Calanus_helgolandicus Candacia Candacia­Paracandacia Paracandacia simplex Big Euchirella Euchirella Nannocalanus minor Nannocalanus minor Neocalanus Neocalanus Pleuromamma Pleuromamma Acartia Paracartia grani Centropagidae Centropages Medium Calanoida Categories identified with accuracy by Image analysis Temora Aetidus Lucicutia (standardized for Saronikos Gulf) Clausocalanus Paracalanus Ctenocalanus vanus Small Calanoida Small Calanoida Phaenna spinifera Scolecithricella & Scolecithrix Isias clavipes Mecynocera clausi Oithona Oithona Euterpina acutifrons Euterpina acutifrons Harpacticoida Clytemnestra Clytemnestra other Harpacticoida n.d. Micro & Macrosettela Micro & Macrosettela Coryceaus Coryceaidae Farranula rostrata Triconia Ctenopoda Penilia_avirostris Penilia avirostris Evadne Pseudevadne Onychopoda Onychopoda Podon Pleopis polyphemoides 12 Critical factors for DNA metabarcoding

 Primers

 designed on conserved regions across target taxa→ reduce amplification bias  amplify a variable region → high resolution, ideally at species' level  amplify a short region→ easier PCR amplification even for degraded samples

 Genetic reference database – well curated (no errors or misidentifications) – as complete as possible

13 Primer design and reference database

Primers were designed (ecoPrimer software), which amplify a region of ~150 bp of the mitochondrial 16S rRNA gene.

Reference database (GenBank largely incomplete for 16S rRNA) Out of the 97 taxa taxonomically identified

 93 lacked 16S sequences and 36 lacked COI sequences in GenBank

We performed DNA barcoding for both COI and 16S genes  50 new additions of 16S barcode (55 species sequenced in total)  8 new additions for standard COI barcode (48 species sequenced in total)

Sapphirina opalina 14 Evaluation of the 16S barcode to discriminate species

CopF barcode (~120 bp) CopR-Un

Alignment of 16S copepod sequences from NCBI and from MetaCopepod (produced with universal primers)

15 Median-Joining network for the 16S rRNA barcode for copepod species 16 DNA metabarcoding of zooplankton samples

 Unsorted zooplankton samples, taxonomically identified samples and pseudosamples → DNA extraction, PCR and sequencing on a MiSeq Illumina platform.

 Raw sequencing data analyzed using a bioinformatic pipeline constructed for the project.

 Sequences assigned to taxa using the reference database

17 Pseudosamples: comparison between genetic data, % abundance (%N) and % biomass

0.9

e l e 0.8 p c %N

n 0.7 m a a 0.6 %Biomass d s

n 0.5 o %Genetic u d 0.4 b u

a 0.3 e

s

e 0.2 p v

i 0.1 t e a h

l 0 t 1 5 2 2 3 2 4 1 3 1 2 4 6 1 2 3 6 1 2 3 5 2 6 2 3 2 3 1 2 1 2 5 6 2 1 4 2 2 5 4 2 4 6 1 3 4 6 2 5 6 4 5

e r S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S n i P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P ------i i i i i i i i i r r s s s s s s s a a s a s s a a s s s s a x s s s s s a a s s s a s s s s s s s a a s i e n i i i i i o o o i i i s s s i t t r r t t o o r l l l l l l l r u u u e u u u u u u u u u n u u u n s u u u u u i i i n d d i u u u u l n b l n r r r i e t t t a a a a n n t t e e a c c c n c m o c c c r d d o r i i n v v v t r r r r f f f c c c a a i i i p i i i i o i i u e e a a a d a a a a a a a s i i r r r r i t t t t j l l l l l n t o a o a s l l l e n f a a i a v p d d i p p p v c c c e c c c i m d a o r m a a s s s s c i i u r r i i i r c c c n v m m y y i m c c m n e y r r r r t l l y n n i y i s i l p y y u i t t p p p t o o o o a g g d d d m g t o t d m g t t u u u u s s r r r r a a a s u a a a y s u g s r a a n s v r r s s s

f f f i l l s s p o o o o i i i i p t c u u s s s

t c n n s s t t t a t t t u o e e a e s s s s e u e u a a o o b u d r a r t t t r r r s a a a n n u u u t t a s s s l l a s i a a o u u u s e e n n a a d n e r r r g g l n n l l l b o a a a a a o o m a t a a n n n n l l u u u u l l u c u n n a e a c c i i s g g g l l a a e o o o a i a a a c l c c c l l c c c e s e a a a e e n n n l l e e n n a i n s a a g a a s a a n n l m u a l l n n c r r l l c a s s s a g n u n i a a a a a a m m A A A A h h a i i m d a u n a a n p p p u c c a a a a l l l d o l r r a l a a c O O c d n o p e c c h h i e e r r e A n p o n o o a a a a o o c c c a a a r a s s s a t c c d o l a m a a m d T T T r r r o o c c a a a o l a n n e e e r a a a t t t c c c e P v u u u n c p o o l n a r v o r r r o s s u u t F F n n a a c a a m t o o o C n n n r e n n a o o e e a e c a a a a l u u u l E P t n c E E a a c c a n e s s s u d e e a a s P c o C N N l l a a a l P P P C E n n a e u u u l l e u u C N N C C C o O r m l a a a H a a a e a C s C C O O l l l a o l P C C C r s u C C C P C u a P l e l C P

 Pseudosamples contained 6-16 species of 1-3 individuals each.

 All species detected by genetic analysis (even as low as 0.12% of the total sample biomass), except for the Corycaeidae [Farranula rostrata (max 6.5% biomass) and typicus (2.5% biomass)].

 For few species (e.g. Calanus helgolandicus and Euterpina acutifrons), large discrepancies between genetic and actual data. Systematic? Due to technical handling?→ more checks necessary.

18 Taxonomically identified samples: Comparison between actual counts (blue line) and genetic data (yellow line) 0.35

0.3 sample SG1-0414 Tax-SG1-0414-C 0.25 Tax-SG1-0414-B 0.2 SG1-0414-G

0.15

0.1

0.05

0

0.6

0.5 sample SG1-0214 Tax-SG1-0214-C 0.4 Tax-SG1-0214-B 0.3 Tax-SG1-0214-G 0.2

0.1

0 i i i . . . r . . . . ) . . .

i s s a x s e v i s s v 0 3 s a v s p v 1 3 1 a a t 1 a s 2 4 8 s s s

r i i v v v v v v n i e s s s s o s s p s M s s s M p r r s s s t M a a M p p . l l s u s a e u u p n n h i i p i e u u i u s u 1 p p p p n e p p p r u a M a o r i u u u l u j j u j j u a i i u p g u u u u v u a o n u u u t p e p e t p t U a j U u j j j U j U j e n n s o c n s u p s i i p s s s s t c s e i s s s o n t n t e f f c l p i U i d r c r n a m c c r n c s i i n i n r s s s a p f n l a o n i s g e i t c s a a a i s d s a m i n S S a S c m i S f i i t l a i s l i a a i l a v l a s e a s e n r o a o u p e o m l a a a a i n e a o s s - - - S r - r l i l d m n p t u d o i u r i e n a t d a m a t a r s d t a a r p - n c u p d u u i l c a i c c i c m n a d u a a d t b v u e y d d d i r i r a g c m g i l e

i l y a A E i g i n o o S e i u n i i i i p A p d p m p s o u u r t u n a e t s n n s s i o I d r s n t g d A i k n d s o v r I s i o o a e a i c c g a u m y p a v g c a / i o e s b i i o s v o o o s o s e u n m n s c r - u U A p d i i n a N a a l l a i l e t r a e n e a i c a e T t u i

n l b d d n E r l l t l v u l i n n a a h a o o n n n n p a p l n n s a a e u t u a a t t o 1 n o l i l s r c a D s s N l i e h a o n e

n c a O b a s a h e a a . n e o e c f i U m d n a o h a o g p a a a a e I n a t d n A e a n e a l l c r p a t t l a a l g c s u l l l l a t t l e a t c n r b n p i i l e a s l a m a i l c p i a A a u i i

c u o a c H c l C s g a m n a l u C a o t a L l a E C V n n c c e a a a i u m l d s l a g c a o p n o b s u a o s e n d n c I d e s o s L s c r s s i o o u a s a C c T i r e d n E r l a o d g a I o y n l c a u c n a a m C e a u a u t i a u A c d A n c u C C C C + I N t n b h a e u m u i h a u i u n p C e i c A p a o l a u l s n a a e e c o n p u c y c a L v a o n C t r P P C y a a r d a i n E e S C c l n c m s p n a o m o l l O O l v d e v o U t s u a p C d C l o c c a h o a u h v C r l a T e e P n h E a a u i a C i a a l a n a a r o l r o t Y t A l u t c p u l n L r C E o a m t n l L e C t r j i e a O P L P u H c a r c a n c a C n a u a R o o n a C P a c a S e e t u c o n l l r c u q o C S M c t a e c o S a c P c o l n ( E e e e d O A a l a l L e o m l l a n o o C N U l o a C t e u l r a l c a C H o s C o e P a e s r a e a A a t C o C c C u l C e u s C L H C P C v C a S e r - C P l M C a a r P p a s P - u o n s a l u a a l c a C r

a 19 P MDS for taxonomically identified samples: Comparison between genetic data (-G) and actual counts (-C)

replicates genetic data

actual data (by taxonomist)

The observed shift between the two data sets is mainly due to the inability to match exactly some taxa categories (e.g. Calocalanus juv. and Calocalanus sp. identified by the taxonomist to which of the 7 Calocalanus species identified by DNA metabarcoding correspond to ?) 20 Conclusions from standardization of DNA metabarcoding

High repeatability (sub-samples of the same sample give very similar results)

Generally good concordance between morphological identification and NGS analysis but, Corycaeidae almost absent in DNA metabarcoding analysis (both in pseudo and taxonomically identified samples), although some species present in the reference DB and are individually amplified in PCR with the 16S primers. redesign primers? problems in DNA extraction?

Some taxa are systematically over or under-represented in DNA metabarcoding analysis → image analysis can greatly assist reducing the bias

21 Relative abundance (species level) of Sapphirina sp1 Sapphirina opalina Sapphirina metallina Oncaea sp9 copepoda & cladocera in monitoring stations Oncaea sp8 Oncaea sp7 Oncaea sp6 Oncaea sp5 Oncaea sp4 of the Mediterranean (genetic data Oncaea sp3 ) Oncaea sp2 Oncaea sp1 Oncaea scottodicarloi Oncaea mediterranea Oncaea media Agetus limbatus Euterpina acutifrons Oithona tenuis Oithona sp8 Oithona sp7 Oithona sp6 Oithona sp5 Oithona sp4 Oithona sp3 Oithona sp2 Oithona sp1 Oithona similis sp3 Oithona similis sp2 Oithona similis sp1 Oithona similis Oithona plumifera Temora stylifera Paracalanus sp4 Paracalanus sp3 Paracalanus sp2 Paracalanus sp1 Paracalanus quasimodo Paracalanus parvus Paracalanus nanus Paracalanus indicus Paracalanus denudatus Other Pleuromamma sp1 Pleuromamma gracilis Pleuromamma borealis Pleuromamma abdominalis Mecynocera sp2 Mecynocera clausi Lucicutia flavicornis Euchaeta acuta Other Pseudocalanus sp2 Pseudocalanus sp1 Clausocalanus sp3 Clausocalanus sp2 Clausocalanus sp1 Clausocalanus pergens Clausocalanus paululus Clausocalanus mastigophorus Clausocalanus lividus Clausocalanus jobei Clausocalanus furcatus Clausocalanus arcuicornis Centropages violaceus sp1 Centropages violaceus Centropages sp1 Centropages ponticus Centropages kroyeri Centropages chierchiae Candacia sp1 Candacia simplex Candacia bispinosa Candacia armata Calocalanus styliremis Calocalanus sp3 Calocalanus sp2 Calocalanus sp1 Calocalanus pavoninus Calocalanus pavo sp2 Calocalanus pavo Calocalanus neptunus Neocalanus gracilis Nannocalanus sp2 Nannocalanus sp1 Nannocalanus minor Type II Nannocalanus minor Type I Mesocalanus tenuicornis Mesocalanus sp1 Ctenocalanus vanus 4 4 4 4 4 4 4 5 4 4 4 4 4 5 5 4 4 4 4 5 5 4 4 4 4 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 4 4

2 3 4 4 Calanus helgolandicus 4 x 4 4 4 4 4 4 5 4 4 4 4 4 4 4 4 4 4 x 2 2 2 2 2 2 p F F F 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 _ _ 1 _ 1 1 a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 _ _ 1 1 1 1 1 1 1 a x Haloptilus longicornis e p p t p t 2 3 7 8 9 0 1 2 1 6 0 1 1 3 7 8 0 1 3 4 8 0 7 8 1 2 5 6 4 5 4 7 8 4 6 4 1 6 4 7 1 1 3 2 9 0 5 1 2 1 5 6 7 9 0 1 2 r 3 5 5 4 4 5 5 4 4 5 5 5 5 a - - -

o Euchirella sp1 o o t 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 4 4 ------4 ------C

C Euchirella rostrata C 6 6 6 3 4 c - c c 1 4 1 - 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 1 1 1 1 1 1 1 - 3 3 a a a a a a a a a a 1 a a b a b a b a b 0 0 0 0 0 Acartia sp2 x 1 2 2 1 x 1 5 B - - - 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 - - 1 1 2 G 2 G G G G G G G G G G P P P P P P P N N N N N N N N N N N N N N N N N N N N N N N N N - a 0 a 4 0

P R R Acartia sp1 1 2 3 S S S S S S S S S S 1 S a a S S S S S S S S S S S - P P P P R R R R - 4 A A A A A A A A A A A A A A A A A A A A A A A A A N N N N N N N T - T 0 - - 1 1 Acartia negligens N C C C C C C C C C C C H H 1 N N N - a 1 a N N N N H H H H 4 4 S S 1 1 G 4 Acartia discaudata A A A 1 1 1 C C S 0 S G S Acartia clausii 4 - 5 S C C 0 1

0 Penilia avirostris - -

G Pseudevadne tergestina 1 a S 1

G Podon intermedius

S Pleopis polyphemoides D

R Other Evadne spinifera Evadne nordmanni Other Annaba-Algeria (1,2,3) Napoli Saronikos Cretan Sea Daphnia magna 22 MDS plot based on genetic data for all samples Replicate tows from two neighboring station in the Gulf of Heraklion

Cretan Sea

23 Similarity dendrogram (Bray Curtis) of zooplankton samples (genetic data)

Cretan Sea & Rhodes Autumn-winter from Spring-summer from Spring-summer Napoli, Saronikos, Annaba Napoli, Saronikos from Annaba 24 Integration between image analysis and DNA metabarcoding analysis (under way)

A script is developed to combine automatically the output of image analysis and genetic analysis.

DNA metabarcoding Genetic Final combined IA categories IA % categories data % (%) Calanus helgolandicus 20 Calanus helgolandicus 40 20 Subtotal 40 20 Acartia clausii 4 14 Acartiidae 35 Acartia negligens 6 21 Subtotal 10 35 Transformed to Clausocalanus arcuicornis 10 15 Clausocalanus furcatus 2 3 abundances and Small Calanoida 30 Clausocalanus jobei 5 7.5 biomass. Ctenocalanus vanus 2 3 Mecynocera clausi 1 1.5 Subtotal 20 30 Oithona similis 8 4 Oithona 15 Oithona nana 12 6 Oithona sp. 10 5 Subtotal 30 15 Total 100 100 100

25 What next

 Results are currently reanalyzed and the methodology is optimized for higher integration between image analysis and NGS analysis

 The 16S primers designed perform quite well and have high resolution (also amplify quite well the other components of zooplankton but not evaluated yet for taxa other than copepods and cladocera). Redisign to amplify Corycaeidae.

Genetic reference DB needs further improvement (few important species still missing, taxonomic issues for certain taxa)

Image analysis software should be “trained” for the oligotrophic regions of Eastern Mediterranean and other regions (e.g Algeria)

26 Partners of the "MetaCopepod" project

Genetics Image analysis

Panagiotis Kasapidis Costas Frangoulis

Jon Kristoffersen Stratos Batziakas

Gianpaolo Zampicinini Taxonomy Valentina Tsartsianidou Ioanna Siokou Vasso Terzoglou Nontas Christou

Biodiversity informatics external collaborators

Sarah Faulwetter Maria Grazia Mazzocchi (SZN, Naples) Maria-Luz Fernandez de Puelles (IEO, Spain)

Bioinformatics Meriem Khelifi Touhami (Annaba Univ., Algeria)

Jacques Lagnel Yesim Ak Örek (METU, Turkey)

Tereza Manousaki Rana Abu Alhaija (The Cyprus Institute, Cyprus)

Alexandra Gubanova (IBSS, Ukraine)

27 Acknowledgments

Many thanks to people who contributed with samples:

● Prof. Alenka Malej, National Institute of Biology, Ljubljana, Slovenia

● Dr. Soultana Zervoudaki, HCMR, Greece

● Dr. Maria Corsini-Foka, HCMR, Rhodes, Greece

28 The project's webpage: http://metacopepod.hcmr.gr/

For more info you may contact Dr. P. Kasapidis

Thank you for your attention!

29