FROM SPECIMENS TO THE TREE-OF-LIFE: TACKLING TROPICAL DIVERSITY

DARREN YEO B.Sc. (Hons), NUS

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

DEPARTMENT OF BIOLOGICAL SCIENCES NATIONAL UNIVERSITY OF SINGAPORE AND DEPARTMENT OF LIFE SCIENCES IMPERIAL COLLEGE LONDON

2018

Supervisors: Professor Rudolf Meier, Main Supervisor Professor Alfried P. Vogler, Co-Supervisor

Examiners: Assistant Professor Huang Danwei Dr. Thomas Bell Professor Dalton De Souza Amorim

i Declaration

I hereby declare that this thesis is my original work and it has been written by

me in its entirety. I have duly acknowledged all the sources of information

which have been used in the thesis.

This thesis has also not been submitted for any degree in any university

previously.

______

Darren Yeo

03 August 2018

The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work

ii Acknowledgements

I am deeply grateful towards the following people, without whom this thesis would not have been possible:

Prof. Rudolf Meier, who has had the central role in shaping my growth as a researcher, student and teacher. Thank you for always being supportive, conscientious and patient with me throughout my PhD studies. I am truly thankful to have a supervisor both passionate and well-versed in this field, who is able to spark and nurture my interest for entomology and molecular biology. I have come a long way since first joining the lab as a fresh-faced undergraduate and I am grateful for this journey spent under your supervision.

Prof. Alfried Vogler, for providing me with the opportunity to work with you and your team at the London Natural History Museum. When I first entered the main hall of the NHM and looked upon the incredible skeleton of Dippy the Diplodocus, I never imagined that I would have had the chance to study and exchange ideas with the brilliant researchers within. Thank you for always being so approachable and full of great ideas, as well as your time and support even I was abroad. I have learnt so much from you and I will always look back at my time at the NHM with great fondness.

Dr. Amrita Srivathsan, for being an amazing mentor and friend. Thank you for being patient with my silly bioinformatics questions and for helping me find my interest in coding. Your sage advice and calm presence have helped me through many difficult times in my PhD and I am incredibly thankful.

Jayanthi Puniamoorthy, for being my mentor and very first interaction in the lab. Your tutelage and careful instruction have stuck with me throughout my experiences with molecular lab work, and the fact that my results are not filled with contaminants is testament to your effective mentorship.

iii

Robin Ngiam from the National Parks Board for first getting me interested in odonates and being always willing to share his expertise. The Ministry of Education and National Parks Board for funding the various projects here.

The members and alumni of the NUS Evolution lab: Dr. Sujatha Kutty and Dr.

Kathy Su, for taking the time off their busy schedules to listen to my troubles and give amazing advice. Dr. Ang Yuchen, Foo Maosheng and Siti Maimon, for entertaining my constant pestering regarding specimens, imaging and sampling information. Dr.

Wendy Wang and Theodore Lee for helping me manage students, interns and the general molecular lab insanity. Gowri Rajaratnam, Rebecca Loh, Mindy Tuan and

Bilgenur Baloğlu for the great company and much-needed games/durian nights. Arina

Adom and Phua Junwei for your invaluable help with data management, sequencing organization and molecular lab work. Lee Wan Ting, Chen Shihui and Yuen Huei Khee for all the help with molecular lab work and your patience and diligence even when we had to stay late to finish up. Lab alumni Shiyang, Denise, Andie, Gerald, Ivy, Shu Min, and Youguang for your friendship and guidance. The hardworking students and interns who have contributed to the various NGS barcoding projects: Jake, Terence, Keneth,

Kaiqing, Jonathan, Quý, Sandra and many others whom I am not able to fully list here.

My fellow PhD students and dear friends, Gowri, Jerome, Ian and Bee Yan for lending a listening ear and a helping hand whenever I need them.

The members and alumni of the Vogler lab: Dr. Alex Crampton Platt for patiently mentoring me when I was a clueless undergraduate and new to and molecular work. Dr. Peter Foster, for his limitless patience with my requests for programme installations and server space. Dr. Thomas Creedy for being such an approachable and brilliant scientist and bioinformatician. Dr. Belen Arias for all the much-needed coffee breaks and conversations, as well as Ginez for having us over.

Angelina Ceballos and Borja for your warm companionship and flying halfway across the world to attend my wedding. Hannah Norman, Alejandro Lopez, Mizanur Rahman,

iv

Beulah Garner, Tang Pu, Nie Rui E and Ge Deyan whom I have learnt so much from and made me feel welcome. Dr. Benjamin Linard, Dr. Carmelo Andújar and Dr. Paula

Arribas for graciously taking the time to guide me and answer my queries despite your hectic schedules.

My dear friends at the London NHM: Vassia, Nathan, Susy, Katia, Marco,

Sandy, Wui Shen and Carlos, thank you for making me feel at home away from home.

I am truly glad and honoured to have met you. Sophie, Arseni and Xuewei, who have enriched my life and kept me sane while I was studying abroad.

NUS, for funding my studies through the President’s Graduate Fellowship and providing other financial and administrative support.

Mom, Dad and Debbie, for being my pillars of support, for always encouraging me to pursue my dreams and passions and for making me who I am today.

My wife, Sabrina, for your love and support, for constantly motivating me to strive for self-improvement and to be a better person. I can only hope I have done the same for you.

v

Table of Contents

Summary ...... xii

List of Tables ...... xiv

List of Figures ...... xvi

CHAPTER 1 General Introduction ...... 1

1.1. Species discovery with NGS barcodes ...... 5

1.2. Life-history stage association with NGS barcodes ...... 7

1.3. How are these species related to each other: Tree of Life ...... 8

1.4. Phylogenetics via genome skimming and multiplexed tagged amplicon

sequencing ...... 9

1.5. Exploring hybrid enrichment for mitochondrial genome skimming ...... 11

CHAPTER 2 It doesn’t have to be full-length: some mini-barcodes perform as well as full-length cox1 barcodes ...... 12

2.1. Abstract ...... 12

2.2. Introduction ...... 13

2.3. Materials & Methods...... 17

2.3.1. Survey of mini-barcodes ...... 17

2.3.2. Species delimitation and testing ...... 20

2.3.3. Performance assessment ...... 20

2.4. Results ...... 22

vi

2.4.1. Congruence with morphology ...... 22

2.4.2. Decisiveness ...... 30

2.4.3. Variable sites ...... 31

2.4.4. Pairwise distances ...... 32

2.5. Discussion ...... 33

2.5.1. Should full-length barcodes be preferred? ...... 33

2.5.2. Are longer barcodes more decisive? ...... 35

2.5.3. Which end of the barcode performs better? ...... 35

2.5.4. Conclusion ...... 36

2.6. Supplementary Information ...... 38

CHAPTER 3 Scalable species delimitation with molecular markers: comparing the performance of distance- and tree-based methods ...... 39

3.1. Abstract ...... 39

3.2. Introduction ...... 40

3.3. Materials & Methods...... 44

3.3.1. Specimen sampling, barcoding and OTU representative

selection ...... 44

3.3.2. Data for phylogeny reconstruction

I. Mitogenomes via MMG ...... 45

3.3.3. Data for phylogeny reconstruction: II. Multiplexed tagged

amplicon sequencing ...... 46

3.3.4. Tree reconstruction ...... 50

vii

3.3.5. Species delimitation and comparison ...... 51

3.3.6. Comparison of species delimitation results ...... 52

3.3.7. Congruence with morphology ...... 53

3.4. Results ...... 53

3.4.1. Barcoding and character sampling ...... 53

3.4.2. Morphological verification ...... 53

3.4.3. Multiplexed tagged amplicon sequencing ...... 54

3.4.4. Mitogenome skimming with single and multiple baits ...... 55

3.4.5. Phylogenetics ...... 56

3.4.6. Comparison between Tree-based and Distance-based species

delimitation ...... 59

3.4.7. Species delimitation: congruence with morphology ...... 63

3.5. Discussion ...... 65

3.5.1. Do tree-based species delimitation methods yield better

results? ...... 65

3.5.2. Improving MMG with multiple baits ...... 67

3.5.3. Toward an efficient and scaling species-discovery pipeline ... 69

3.5.4. Concluding statement ...... 72

CHAPTER 4 Should hybrid enrichment be used for mitochondrial metagenomics? ...... 74

4.1. Abstract ...... 74

4.2. Introduction ...... 75

viii

4.3. Materials & Methods ...... 77

4.3.1. Sampling, extraction and barcoding...... 77

4.3.2. MMG and hybrid enrichment library preparation and

sequencing...... 79

4.3.3. Post-sequencing processing and bioinformatics ...... 80

4.4. Results ...... 81

4.4.1. Illumina barcoding and species delimitation...... 81

4.4.2. Mitochondrial read counts ...... 82

4.4.3. Mitochondrial genome assembly ...... 84

4.4.4. Mitochondrial read coverage ...... 86

4.5. Discussion...... 90

4.5.1. Should hybrid enrichment be used for mitochondrial genome

skimming? ...... 90

4.5.2. What is causing the uneven enrichment? ...... 94

CHAPTER 5 Towards holomorphology in entomology: rapid and cost- effective larval-adult matching using NGS barcodes ...... 98

5.1. Abstract ...... 98

5.2. Introduction ...... 99

5.3. Materials & Methods...... 104

5.3.1. Sampling and identification ...... 104

5.3.2. DNA extraction, amplification and sequencing ...... 105

5.3.3. Read processing and barcode determination ...... 106

ix 5.3.4. mOTU estimation for adult-larva association ...... 107

5.3.5. Imaging and databasing ...... 107

5.3.6. Review of adult-larval association literature ...... 108

5.4. Results ...... 108

5.4.1. Sequencing and initial processing ...... 108

5.4.2. Clustering and life history stage matching ...... 110

5.4.3. mOTU stability ...... 113

5.4.4. Literature survey ...... 114

5.5. Discussion ...... 115

5.5.1. Expediting adult-larval associations with NGS barcoding ... 115

5.5.2. Tackling the problem of rarity and elusive larvae ...... 117

5.5.3. Barcode reliability and potential sources of error ...... 119

5.5.4. The value of both morphology and molecules ...... 120

5.5.5. Description, illustration, or both? ...... 121

5.5.6. Barcode databases ...... 123

5.5.7. Concluding remarks ...... 123

CHAPTER 6 Discovery of a rich, distinct, and imperilled marine fauna in tropical mangrove forests ...... 125

6.1. Abstract ...... 125

6.2. Introduction ...... 125

6.3. Materials & Methods...... 129

x

6.3.1. Sample collection and processing ...... 129

6.3.2. NGS barcoding and putative species sorting ...... 131

6.3.3. Diversity analysis ...... 132

6.4. Results ...... 133

6.4.1. Species delimitation based on barcodes ...... 133

6.4.2. Alpha-diversity across ...... 134

6.4.3. Beta-diversity across habitats...... 136

6.4.4. Beta-diversity across mangrove sites ...... 140

6.4.5. Dolichopodidae beta-diversity across Southeast Asia ...... 143

6.5. Discussion ...... 143

6.5.1. Discovery of a largely overlooked insect community in

mangroves ...... 143

6.5.2. NGS barcodes as a new technique for large-scale study of

insect communities ...... 147

6.5.3. Concluding remarks ...... 148

6.6. Supplementary Information ...... 150

CHAPTER 7 Conclusion ...... 156

CHAPTER 8 References ...... 162

xi

Summary

The question of how many species there are on earth is one not easily solved.

This is due in part to how much is still unknown and undiscovered with regard to certain hyper-diverse groups such as insects (estimated 5 – 10 million species), which hampers efforts to derive reasonable global species estimates. Indeed, most of the insect fauna

(mostly localized in the tropics) remain undescribed and hence unidentifiable with any currently available tools. With so much still undescribed, experts estimate that it would take taxonomists another 200 years to complete the description process. This is time we do not have, given the current catastrophic extinction rates.

The main bottleneck in the species discovery process is the species pre-sorting and identification stage. This is due to the massive number of specimens and species in tropical insect samples, as well as the insufficient number of taxonomists required to tackle such hyper-diversity. Fortunately, recent advances in next-generation sequencing (NGS) technologies can help develop affordable and scalable tools required to address this problem. This thesis hence aims to develop accessible workflows that enable the transition from specimen to Tree-of-Life.

In this thesis, I use NGS barcodes for a novel “reverse workflow” approach to taxonomy, where barcodes for all specimens are used for species pre-sorting, followed by validation by expert taxonomists, instead of barcodes being used as validation of morphospecies. This is only possible because each NGS barcode costs less than <$0.40

USD. However, in order for the barcode to be sequenced on a high-throughput Illumina platform, the barcode has to be shorter than 500-bp. Hence in Chapter 2, I test the efficacy of such mini-barcodes for species delimitation and find that mini-barcodes often perform as well as the full-length 657-bp barcode. This justifies the use of the

313-bp mini-barcode employed in the subsequent chapters. I then use NGS barcoding techniques to associate insect larval and adult life history stages for Singapore’s

xii

Odonata (Chapter 5) and discover a largely overlooked insect fauna in the mangroves that is both species-rich and highly distinct from communities in other habitats

(Chapter 6).

NGS also affords us the opportunity to place these newly-discovered species on the tree of life. Given the large number of species, this process has to be scalable as well. Here, I apply and test three different techniques for obtaining mitochondrial genomic data for constructing species-rich trees: (1) mitochondrial metagenomics

(MMG), (2) multiplexed tagged amplicon sequencing and (3) hybrid enrichment. In

Chapter 3, I use MMG and multiplexed tagged amplicon sequencing to obtain mitochondrial genome and 28S rDNA markers for constructing a tree with 459 terminals. This tree was used to test if tree-based species delimitation was more effective than the more scalable distance-based methods. I find that tree-based species delimitation with data-rich trees only offer marginal improvements to distance-based methods on mini-barcodes, and hence the optimal approach might be to perform initial species delimitation with the latter technique and reserve the former for regions of conflict. I also evaluate hybrid enrichment for tree-building in Chapter 4 and find that

it does not perform as well as MMG for broad and diverse taxon sets.

Overall, this thesis explores and develops the means for expediting the

processes of species discovery, identification and placement on the Tree-of-Life

through cost-effective NGS-based techniques. NGS barcoding helps to pre-sort

specimens into putative species, which facilitates downstream natural history and

ecological research. These species can also be affordably placed on the Tree-of-life

with MMG and multiplexed tagged amplicon sequencing. It is indeed an exciting time

for taxonomy and systematics, where the goal of a global species inventory might be

within our reach.

xiii

List of Tables

CHAPTER 2: Table 1: Barcoding datasets used in this study Table S1: Primer sequences of the mini-barcodes assessed in this study

CHAPTER 3: Table 1: Mitochondrial gene and 28S rDNA primer sequences used in this study Table 2: Number of 0.6% mOTUs and specimens of the various taxa with sufficient data for phylogenetics Table 3: Pairwise match ratios of the mOTU comparisons between species delimitation techniques and parameters are shown on the bottom-left diagonal, while the top-right diagonal indicates the percentage of specimens that need to be removed in order to achieve perfect congruence. A heatmap effect was applied to both metrics, with the highest values in green and lowest in white for the match ratios, while the highest values are in red and lowest in white for the percentages of specimens Table 4: Pairwise Robinson-Foulds distances (bottom-left) and branch-length scores (top-right) between the trees used in this study Table 5: Match ratios and number of specimens to remove in order to achieve perfect congruence for the various molecular species delimitation methods against the mycetophilid morphospecies clusters

CHAPTER 4: Table 1: Read counts of both pure genome skimming and hybrid enriched libraries before and after filtering of mitochondrial-like reads Table 2: Counts and proportions of reads mapped too all three references for both libraries 3 and 5 Table 3: Cost analysis of both hybrid-enrichment (HE) and pure genome skimming (GS) approaches across different sequencing depth requirements and sequencing platforms with lower and higher throughputs, as well as with the number of libraries required in this study

CHAPTER 5: Table 1: Sequencing success rates and number of reads obtained Table 2: mOTU delimitation via objective clustering and ABGD approaches Table 3: Pairwise match ratios of clusters delimited by objective clustering and ABGD

xiv

CHAPTER 6: Table 1: Total number of barcodes and mOTUs in this study as delimited by objective clustering and ABGD, with different p-distances and equivalent -id parameters for the different types: mangroves (M), tropical forest (TF), freshwater swamp forest (FS) and disturbed secondary forest (DSF) Table 2: Number of species found exclusively in the various habitat types and shared across multiple habitats with varying levels of rarity Table 3: Global and pairwise p-value and R-statistic outputs from ANOSIM analyses of the full dataset split by habitat type. The p-values are displayed in the bottom-left of the pairwise matrix while the R-statistics are displayed at the top-right Table 4: Number of species found exclusively in the various mangrove sites and shared across multiple sites Table 5: Global and pairwise p-value and R-statistic outputs from ANOSIM analyses of the mangrove sites (PU: Pulau Ubin, SB: Sungei Buloh, SMN: Pulau Semakau new grove, SMO: Pulau Semakau old grove). The p-values are displayed in the bottom-left of the pairwise matrix while the R-statistics are displayed at the top-right Table S1: Collection periods, locations and number of trapping sites in this study. The habitat types included are as follows: M = mangroves, FS = freshwater swamp, DSF = disturbed secondary forest, TF = tropical rainforest Table S3: Global and pairwise p-value and R-statistic outputs from ANOSIM analyses of modified datasets split by habitat type (TF: tropical forest, DSF: disturbed secondary forest, FS: freshwater swamp, M: mangroves) with singletons and doubletons removed, as well as species with less than 5 and 10 specimens. The p-values are displayed in the bottom-left of the pairwise matrices while the R-statistics are displayed at the top-right Table S5: Global and pairwise p-value and R-statistic outputs from ANOSIM analyses of the mangrove sites (PU: Pulau Ubin, SB: Sungei Buloh, SMN: Pulau Semakau new grove, SMO: Pulau Semakau old grove) with singletons and doubletons removed, as well as species with less than 5 and 10 specimens. The p-values are displayed in the bottom-left of the pairwise matrices while the R-statistics are displayed at the top-right

xv

List of Figures

CHAPTER 2: Figure 1: Map of the mini-barcode positions and lengths within the 657-bp Folmer barcode region, with the target taxa they were originally designed for Figure 2: Match ratios for 20 datasets when clustered with objective clustering at 2 – 4% p-distance thresholds Figure 3: Heatmap of match ratios and number of specimens that require removal for perfect congruence for objective clustering, with red representing low and green high congruence values. The highest values for each dataset and p-distance threshold is indicated below. The total number of datasets with highest match ratios along with the range of match ratios and specimen numbers for different barcodes are also indicated at the bottom. Figure 4: Pairwise post-hoc Tukey tests of objective clustering mOTU (2, 3 & 4% p- distance thresholds) congruence with morphology in terms of match ratios for both species (left) and specimen-level assessments (right) for all 10 barcodes tested. P- values are indicated at the bottom left of each matrix and an asterisk (*) is used to indicate a significant difference (P < 0.05). Where differences are significant, the side at which the asterisk is placed indicates the higher match ratio for that mini-barcode length, as well as fewer specimens that need to be removed for perfect congruence Figure 5: Match ratios for 20 datasets when clustered with ABGD for priors P=0.001, 0.01 & 0.04 Figure 6: Heatmap of match ratios and number of specimens that require removal for perfect congruence for ABGD, with red representing low and green high congruence values. The highest values for each dataset and p-distance threshold is indicated below. The total number of datasets with highest match ratios along with the range of match ratios and specimen numbers for different barcodes are also indicated at the bottom. Figure 7: Pairwise post-hoc Tukey tests of ABGD mOTU (P = 0.001, 0.01 & 0.04) congruence with morphology in terms of match ratios for both species (left) and specimen-level assessments (right) for all 10 barcodes tested. P-values are indicated at the bottom left of each matrix and an asterisk (*) is used to indicate a significant difference (P < 0.05). Where differences are significant, the side at which the asterisk is placed indicates the higher match ratio for that mini-barcode length, as well as fewer specimens that need to be removed for perfect congruence Figure 8: Match ratio comparisons between ABGD and objective clustering for each mini-barcode set. The barcode lengths and Mann-Whitney U-test p-values are indicated above each boxplot Figure 9: Heatmap of dispersion values for objective clustering (left) and ABGD (right), with red representing higher values and green lower values. The lowest values per dataset are indicated with an asterix (*). The total number of datasets with the lowest dispersion values are indicated at the bottom Figure 10: Pairwise post-hoc Tukey tests of dispersion (variance/mean) for both objective clustering (left) and ABGD (right) for all 10 barcodes tested. P-values are indicated at the bottom left of each matrix and an asterisk (*) is used to indicate a significant difference (P < 0.05). Where differences are significant, the side at which

xvi

the asterisk is placed indicates the lower dispersion (ie. greater decisiveness) for that mini-barcode length Figure 11: Proportions of variable sites (%) for each barcode for both nucleotides and amino acids Figure 12: The difference in p-distances between the 657-bp barcode and the various mini-barcodes

CHAPTER 3: Figure 1: Depictions of the aligned character matrices of the four datasets/treatments used in this study. The longer bars represent the full character set while the shorter bars represent just the cox1 region. Figure 2: Distribution of successfully sequenced mitochondrial bait (A) and 28S fragment (B) amplicons Figure 3: The number of characters (bp) in each contig sorted from highest to lowest from the actual multiple-bait dataset and the longest and shortest possible single-bait dataset. The area in blue indicates the difference in the number of characters, which would have been lost had only a single bait been used Figure 4: Family-level maximum likelihood tree of mOTUs in this study built with the full mitochondrial genome and 28S rDNA dataset, with bootstrap support values indicated at the basal nodes. The number of 0.6% mOTUs for the target taxa is displayed below the family name. Paraphyletic clades are indicated with red text, while monophyletic ones are in black. The green branches indicate congruence with the Wiegmann et al. (2011) Diptera phylogeny, while the red branches indicate conflict and the blue branches indicate a lack of information. The asterisks (*) beside the family names indicate taxa that were missorted but still retained in this study Figure 5: Bootstrap support values sorted from highest to lowest for the trees constructed from the full dataset, the barcodes-only dataset, as well as the mixed mitochondrial genome, 28S rDNA and barcode dataset without a constraint tree Figure 6: Maximum likelihood phylogeny of the Mycetophilidae extracted from the tree constructed with the full character set, with the 0.6% mOTUs as terminals. Bootstrap supports are indicated at each node and 3% p-distance cox1 clusters with more than one terminal are highlighted in yellow. Within these clusters, branches congruent with morphospecies identification are coloured green, while lumping caused by the clustering are indicated by the red branches. The blue branches indicate where the morphospecies were split by the clustering process Figure 7: Proposed pipeline for the species delimitation of large highly diverse samples using both distance-based and tree-based species delimitation techniques

CHAPTER 4: Figure 1: Ranked supercontig length distribution of mitochondrial supercontigs, sorted from longest to shortest, from hybrid enriched (HE) and genome skimming (GS) datasets, as well a combined supercontig assembly of contigs from both datasets (HE + GS) Figure 2: Ranked baited supercontig length distribution of mitochondrial supercontigs, sorted from longest to shortest, from hybrid enriched (HE) and genome skimming (GS)

xvii

datasets, as well a combined supercontig assembly of contigs from both datasets (HE + GS) Figure 3a: Mapping tracks of library 3 mitochondrial reads from both genomes skimming and hybrid enrichment datasets mapped to 3 references (top to bottom): 1) the longest scaffold in the library, 2) consensus sequence of mitochondrial genomes used for probe design and 3) consensus sequence of RefSeq Coleoptera mitochondrial genomes Figure 3b: Mapping tracks of library 5 mitochondrial reads from both genomes skimming and hybrid enrichment datasets mapped to 3 references (top to bottom): 1) the longest scaffold in the library, 2) consensus sequence of mitochondrial genomes used for probe design and 3) consensus sequence of RefSeq Coleoptera mitochondrial genomes Figure 4: Mapping track generated when the probe sequences were mapped to the RefSeq Coleoptera consensus sequence at different levels of match similarity from 70 – 95%. Minimum and maximum read counts are indicated on the y-axis

CHAPTER 5: Figure 1: Rarity and probability of adult-larval matching of a species. (a) Current study; (b) Literature (numbers=number of species); (c) Larval abundance and across rarity classes in study Figure 2: Adult-larva association methods in descriptive odonate literature (2000– 2016) Figure 3: Species entry in digital reference collection for larva of Mortonagrion arthuri: dorsal (A), lateral (B) and ventral (C) views; enlarged images of dorsal (D) and ventral (E) views of the labium and caudal lamellae (F). Scale = 1mm. All images can be enlarged in browser

CHAPTER 6: Figure 1: Map of Singapore’s mangroves (Yang et al., 2013) showing the mangrove sites sampled in this study: Pulau Ubin (P. Ubin), Sungei Buloh Wetland Reserve (SBWR) and Pulau Semakau (P. Semakau) Figure 2: Map of trapping sites in this study, with the various habitat types in different colours (green: mangroves, blue: tropical forest, red: freshwater swamp forest, purple: disturbed secondary forest) Figure 3: Number of objective clustering mOTUs across p-distances 0 – 5%, with the 2 – 4% thresholds commonly used for species-level delimitation shown with a solid line Figure 4: Comparison of species diversity across habitats (3% p-distance mOTUs). Mangrove (M) sites are represented by Pulau Ubin (PU), Sungei Buloh (SB), Pulau Semakau old grove (SMO), Pulau Semakau new grove (SMN). The tropical forest site is represented by maturing secondary forest (TF-MS), old secondary forest (TF-OS) and primary forest (TF-P), while the other habitat types are freshwater swamp (FS) and disturbed secondary forest (DSF). In plot (A), curves were plotted for the mangrove sites as a single habitat type, but plotted as separate sites in (B). Plot (C) also plots the forest types in the tropical forest site separately. The full lines represent rarefactions, while the dotted lines extrapolations and the point between the lines as actual observed

xviii

values. The black vertical dotted lines indicate the point of rarefaction at which species richness comparisons were made Figure 5: NMDS plots of Bray-Curtis (top) and Chao (bottom) distances of the trapping sites in this study, indicating the distinctness of the four habitat types (mangroves: green, tropical forest: blue, freshwater swamp: red, disturbed secondary forest: purple) Figure 6: NMDS plots of Bray-Curtis (top) and Chao (bottom) distances of the mangrove traps sampled across 2 years in this study, grouped by site (Pulau Ubin [PU], Sungei Buloh [SB], Pulau Semakau old fragment [SMO] and new fragment [SMN]) Figure S2: Rarefaction curves of all species (2 & 4% p-distance) in the study split by habitat and site for the mangroves and forest type for the tropical forest. Mangrove (M) sites are represented by Pulau Ubin (PU), Sungei Buloh (SB), Pulau Semakau old grove (SMO), Pulau Semakau new grove (SMN). The tropical forest site is represented by maturing secondary forest (TF-MS), old secondary forest (TF-OS) and primary forest (TF-P), while the other habitat types are freshwater swamp (FS) and disturbed secondary forest (DSF). In plots (A), curves were plotted for the mangrove sites as a single habitat type, but plotted as separate sites in (B). Plots (C) also plots the forest types in the tropical forest site separately. The full lines represent rarefactions, while the dotted lines extrapolations and the point between the lines as actual observed values. The black vertical dotted lines indicate the point of rarefaction at which species richness comparisons were made Figure S4: NMDS plots of Bray-Curtis and Chao distances of the trapping sites in this study, grouped by habitat (mangroves [M]: green, tropical forest [TF]: blue, freshwater swamp [FS]: red, disturbed secondary forest [DSF]: purple) with various degrees of rare species removal

xix

Chapter 1

General Introduction

“No one knows the diversity in the world, not even to the nearest order of magnitude. … We don’t know for sure how many species there are, where they can be found or how fast they’re disappearing. It’s like having astronomy without knowing where the stars are.” – E. O. Wilson

One of the principal goals of taxonomy and systematics is to catalogue life and uncover how many species there are on our planet. This has been a continuous pursuit in biology for more than 250 years. However, as Robert May discussed in 2010, if visiting aliens were to inquire how many species exist on earth, we would embarrassingly have no certain answer. Our best estimates range from 5 – 10 million eukaryotes, but could go as low as 3 million or as high as 100 million (May, 2010).

This is troubling as the species units in question are fundamental to our understanding of biodiversity and evolution, and the pertinency of our quest for answers is exacerbated by the growing rate of extinction through anthropogenic means (Thomas et al., 2004).

Several attempts have been made to estimate global metazoan species diversity, although none have been particularly conclusive or persuasive. This includes the much- cited estimates that were made by Erwin (1982) who proposed that there are up to 100 million species of insects. This is now often considered an overestimate with more recent global estimates ranging from 2 million (Costello et al., 2011) to 8.7 million

(Mora et al., 2011). However, biology is nowhere near to reaching a consensus (Caley et al., 2014). The only agreement is on the fact that tropical dominate the global total for multicellular (May, 1990; Hamilton et al., 2010), although similarly poorly studied and highly diverse clades such as nematodes are expected to

1

also contribute substantially to the total. However, it is mainly tropical arthropods that have been used as the starting point for estimating global species richness (Erwin, 1982;

Stork, 1993); i.e., it is the great uncertainty regarding the number of tropical arthropod species (Scheffers et al., 2012) that is the reason for the lack of a definitive answer.

Of course, there are other factors that contribute to the uncertainty. One being disagreements over species concepts (Wheeler & Meier, 2000; De Queiroz, 2007).

Species are mutable and hypothetical entities and hence a global estimate of species numbers will naturally fluctuate based on the species definition applied. However, a discussion on the nature of species necessitates a philosophical discussion beyond the scope of this thesis. Rather, I focus primarily on generating the much-needed evidence required to support species hypotheses.

In this thesis I will focus on improving our ability to discover species in the tropics, with particular attention to insects, which present a major hurdle for obtaining a reliable estimate. The problem can be dissected into several key challenges: 1) hyper- diversity, 2) small body size of particularly diverse clades, 3) high species richness in poorly sampled tropical habitats, 4) comparative lack of support for non-vector arthropod research and 5) the slow pace with which traditional taxonomy can deal with high diversity. In this thesis, I attempt to develop techniques that address some of these obstacles. I outline how specimen-based species discovery can be accelerated at sufficiently low cost such that the techniques are scalable. I also document how the same techniques can be used for large-scale life-history matching in arthropods. Lastly,

I outline how the phylogenetic relationships of the newly discovered species can be estimated using scalable techniques.

Tropical arthropods are impressively species-rich, but the greatest contributors to this species richness are insects (Basset et al., 2012), most of which belong to a few large orders: Coleoptera, Hymenoptera, Diptera, Hemiptera and . The large number of species impedes comprehensive study because a substantial number of

2

experts are required to fully understand any particular clade. Ebach et al., (2011) claim that a single taxonomist can be an expert on at most 5000 – 10,000 species in their lifetime. Hence, if there are 5 million species, tackling this diversity would require the lifetimes of 500 – 1000 taxonomists. This is, however, an optimistic estimate, as taxonomic expertise varies widely between individuals and taxa in reality. Of course, it does not help that the vast majority of species are not even described and hence all existing identification tools are likely to fail when attempting to identify these species.

Currently, Costello et al. (2013) estimate that the average rate of species description ranges from 18,000 to 20,000 a year. With a global estimate of 5 million arthropods

(Ødegaard, 2000) and around 4 million still undescribed, it would take another 200 years to complete the description process. While this thesis utilizes primarily insect specimens for its experiments, the methods developed and concepts examined are applicable to other arthropod taxa that share the same properties of being hyper-diverse and poorly studied.

Taxa that mostly comprise of large numbers of species and small individuals are a particularly serious challenge. For one, body size is generally correlated with range size (Gaston, 1994). Small species have small range sizes and thereby require meticulous and fine-scale sampling effort since they are often overlooked and under- sampled. The relationship between body size and abundance on the other hand, is more complex and multi-factorial. Reliable information on the effects of body size on the probability of discovery is unfortunately lacking but there is no doubt that small body size interferes with taxonomic study. This is because the relatively small body sizes of most arthropods make it hard to use morphological characters, especially in cases which require time-consuming preparation and examination techniques such as dissection, slide mounting, clearing of soft tissue and microscopy (eg. Beckett & Lewis,

1982). However, just as advances in sequencing and genotyping methods led to more effective and prolific bacterial species characterization (Achtman & Wagner, 2008),

3

there are likely many more arthropod species to be discovered once the appropriate tools are applied.

Another reason why global species diversity is so poorly known is that much of the unknown diversity is in the tropics. This is partially due to the well-documented latitudinal gradient in species richness, which exists even though the mechanisms behind it might not be fully understood (Stevens, 1989; Lambers et al., 2002). Another reason is that most taxonomists work in temperate regions and temperate fauna has consequently received more attention. Several species-rich tropical habitats, such as rainforest canopies and small islands, also tend to be particularly inaccessible.

Combined with high beta diversity, this leads to significant under-sampling.

Unfortunately, arthropods are also less charismatic than larger, more conspicuous vertebrates. Aside from a few exceptions (eg. butterflies, ) as well as those taxa with economic importance, arthropods have received less attention from both the scientific community and the general public (Troudet et al., 2017). This has negatively impacted the amount of resources that are devoted to the study and conservation of these groups (Cardoso et al., 2011).

Of course, our poor knowledge of arthropod diversity is particularly tragic given that the current rate of species discovery and description is unable to keep pace with the rate of extinction. While there have been multiple large-scale collecting expeditions in the past few decades (Siemen et al., 1996; Basset et al., 2012), with ample material preserved in museums and universities, the main bottleneck lies at the species sorting, identification and description process. E. O. Wilson estimates that it would take 25,000 taxonomist lifetimes to describe all of Earth’s species (Wilson &

Peter, 1988). Most experts agree that there are currently still not enough taxonomists and the field still needs to describe species at a faster rate than it currently does (Bacher,

2012; Wheeler, 2014), although some argue that the number of taxonomists is not dwindling (Joppa et al., 2011). Nevertheless, the process of species description is time-

4

consuming, but fortunately much conservation and management work can be accomplished with species units obtained through species delimitation without description, which is comparatively more straight-forward. By focusing on improving the latter process, one can accelerate species discovery and promote the use of arthropod data in conservation biology. At the same time, specimen-based species discovery approaches allow for the formal description of species at a later date.

This thesis therefore focuses on developing methods that ameliorate much of the difficulty in dealing with high arthropod diversity in the tropics. Here, I develop techniques that were designed to be as affordable and accessible as possible to allow more researchers to start species discovery across habitats. With that in mind, I develop a workflow that goes from specimen to the Tree of Life. I start by testing whether scalable short barcodes are useful for species discovery and evaluating whether tree- based species delimitation techniques outperform distance-based methods based on short barcodes. I then use this barcoding technique to associate insect adult and larval life history stages, as well as discover a largely overlooked insect fauna in the mangroves. Finally, I test three different techniques for obtaining mitochondrial genomic data for constructing species-rich trees.

1.1. Species discovery with NGS barcodes

The advent of next-generation sequencing (NGS) technologies has greatly reduced the cost per sequence. This creates enormous potential for DNA-based taxonomy (Vogler & Monaghan, 2007), which have traditionally relied on more time- consuming and expensive techniques such as Sanger sequencing. As such, molecular information was mostly employed as a complementary tool to more conventional morphological techniques. Sorting of specimen samples would first be done via morphological characters and subsequently verified with DNA due to the high cost of

DNA barcodes. Through NGS-based pipelines, I employ a “reverse workflow” in this thesis (Wang et al., 2018), where DNA barcodes (short standardized genes/gene

5

fragments with sufficient variability to distinguish most species [Hebert et al., 2003]) are obtained for each specimen via NGS barcoding (Meier et al., 2016), pre-sorted into putative species via these barcodes and then validated by experienced taxonomists.

This relegates the tedious pre-sorting process to a mechanical exercise, thereby allowing taxonomists to focus on more pertinent and biologically interesting tasks like species description and the testing of species boundaries.

NGS barcoding (Meier et al., 2016) greatly reduces the cost of DNA barcoding per specimen compared to a Sanger barcode (can cost anywhere from $5 – 20 USD

[Cameron et al., 2006; CCDB: http://ccdb.ca/pricing]). Combined with other cost- saving techniques such as Direct-PCR (Wong et al., 2014) and tagged primers, each

NGS barcode can be obtained for less than $0.40 USD. A large number of specimens

(>15,000) can thus be pooled in a single Illumina Miseq run as the oligonucleotide tag combinations on the primers allow the reads to be associated with their specimen origin during read processing. The price falls further with the use of higher-throughput platforms like the Hiseq 2500 or 4000 and is likely to continue dropping as more powerful sequencing platforms emerge. Instead of using DNA barcodes as a complementary verification tool for morphological sorting and identification, every single specimen can thus be barcoded non-destructively at low cost. NGS barcodes can therefore function as an affordable, fast and effective pre-sorting tool for species discovery. While a single Illumina NGS run might be too expensive for most research groups in the most biodiverse regions, the capacity of these runs is large enough such that resources between groups can be pooled in order to attain the per specimen sequencing costs stated earlier. Additionally, companies like NovogeneAIT and BGI offer sequencing services where the customer pays by the gigabase rather than per run.

In order for barcodes to be sequenced on an Illumina platform (300-bp paired- end), the amplicon has to be shorter than 500-bp. Hence, the full-length 658-bp barcode will not easily fit in a single run. While this might be possible due to future advances,

6

such mini-barcodes are often easier to amplify and have been designed for degraded

samples like gut/faecal or environmental DNA. There are however, concerns that a

shorter barcode would not contain sufficient information for properly delimiting

species (Sultana et al., 2018). Chapter 2 of this thesis demonstrates the efficacy of mini-

barcodes and explores how short mini-barcodes can be in order to function as effective

species discovery tools. I further document that the position of the mini-barcode greatly

affects the ability to use cox1 for species delimitation. After ensuring that mini-

barcodes perform as well as a full-length 658-bp barcode, I scaled up the NGS

barcoding process in Chapter 6 to sequence 46,000 insect specimens in a study of

mangroves and other tropical habitats in Singapore.

1.2. Life-history stage association with NGS barcodes

In Chapter 5, I address another challenge to species discovery for most arthropods: the morphological disparity of different life history stages and sexes.

Through the use of NGS barcodes, I associate the larvae and adults of more than a thousand specimens with a pipeline that requires little time (ca. two weeks of molecular work). Such pipelines are needed because the juvenile or larval forms of

(holometabolous) insects are often very distinct from the adult. The larvae might also inhabit a different environment and have a different diet. However, in many cases, only the adult is well-studied and characterized, while the morphology and habitat of larvae are unknown. Neglecting larvae and only focusing on adults is problematic as many insects spend most of their life as larvae, during which they accrue the most biomass.

The most ecologically relevant life history stage thereby becomes the most poorly understood. Also, larvae tend to be less vagile than adults and are thus more sensitive to environmental disturbances; i.e. they require more attention by conservation biologists. Lastly, ignoring larval specimens results in an incomplete understanding of the environment that is being studied. The main reason why larvae tend to be ignored is the lack of obvious ways to associate them with adults. For one, they often have a

7 different set of morphological characters as the adult. They generally lack (fully developed) genitalia which are important for species identification and description.

Larvae also tend to have multiple morphs or instars, which complicates description and identification.

Traditionally, larvae are associated to adults via rearing (van Gossum et al.,

2003). However, this is time-consuming and sometimes impossible due to the lack of environmental triggers for metamorphosis present in the laboratory set up. As such, this is sometimes complemented with molecular methods to associate larvae to adults with DNA when rearing fails. However, my chapter documents that with the use of cost-effective NGS barcodes, the adult-larval association process can be performed entirely via DNA barcoding, bypassing the time-consuming and unreliable process of rearing. This is ideal for habitats where larval diversity is poorly sampled and characterized. Here, broad sampling of adults and larvae followed by mass NGS barcoding can quickly yield a large number of associations. Additionally, having a barcode for each specimen allows the use of global barcode databases like NCBI

Genbank and BOLDSystems to find potential matches.

1.3. How are these species related to each other: Tree of Life

One of the major challenges of modern biology is reconstructing the tree of life. Currently, much attention is paid to higher-level relationships, but eventually all species should be placed given the importance of phylogenetic relationships for understanding evolutionary processes, species distributions, and species sensitivies to extinction. New techniques are thus needed for reconstructing species-rich and well- supported trees. However, obtaining sufficient character information for such trees is not easy and often prohibitively expensive.

In Chapters 3 and 4, I explore NGS-based techniques for accurately placing many species on the tree of life. This aids in building species-rich trees for tropical

8

arthropods. These techniques have to again be scalable, given the large number of species that need to be added to the tree. Increasing taxon sampling also has the added effect of improving overall phylogenetic accuracy (Zwickl & Hillis, 2002) and breaking up spurious long branches (Hendy & Penny, 1989). Phylogenetic accuracy can also be improved through multi-locus character sampling, which has been shown to help reconstruct more robust evolutionary relationships by reducing stochastic error or character sampling bias (Delsuc et al., 2002; Phillips et al., 2004). Studies on the effects of character and taxon sampling therefore suggest that one should sample multiple genes and many species in order to obtain reliable phylogenetic trees.

1.4. Phylogenetics via genome skimming and multiplexed tagged amplicon sequencing

Mitochondrial markers generally evolve very fast and tend to provide many variable sites. However, they do not perform well for deeper nodes because they have a higher likelihood of reaching substitution saturation due to homoplasy (Zardoya &

Meyer, 1996). In addition, translocated mitochondrial genes in the nuclear genomes

(NUMTs) can yield spurious signals (Zhang & Hewitt, 1996; Bensasson et al., 2001).

Nuclear markers, while better for resolving relationships at the basal nodes (Townsend et al., 2008), are potentially difficult to work with due to heterozygosity, low substitution rate and copy number, as well as potentially high frequency of paralogs.

The integration of both mitochondrial and nuclear markers would hence complement each other’s strengths and weaknesses (Rubinoff & Holland, 2005).

Mitochondrial genomes are a potential source of useful phylogenetic characters (Simon et al., 1994). In this thesis I test three approaches for obtaining mitochondrial genomes for many species: (1) multiplexed tagged amplicons, (2) genome skimming, (3) genome skimming after hybrid enrichment. In Chapter 3, I show that genome skimming, with multiple baits derived from multiplexed tagged

9

amplicon sequencing, is an effective approach. I evaluate genome skimming with

hybrid enrichment in Chapter 4 but find that it produces highly fragmented contigs,

and that even genome skimming with a single bait is preferable.

Mitochondrial metagenomics (MMG) is a pipeline for cost-effective recovery

of mitochondrial genomes for many species (Crampton-Platt et al., 2016) via genome

skimming. This entails the recovery of naturally enriched markers (eg. mitochondrial,

plastid) through shotgun sequencing (Straub et al., 2012) of pooled genomic DNA.

Mitochondrial reads are identified, assembled into contigs and assigned to species

through amplicons of known species identity (“baits”). This process is easily scalable

as no enrichment step is necessary and genomic DNA multiple species can be pooled

and sequenced in a single Illumina lane, as long as closely related species are assigned

to different libraries. Although the cox1 NGS barcode can function as a single bait, I

show in Chapter 3 that multiple baits help with increasing data recovery as well as

detecting chimeric assemblies. I also demonstrate that multiple baits can be sequenced

efficiently using multiplexed tagged amplicon sequencing.

Nuclear markers however, are not as easily sampled through genome

skimming. They are generally more conserved, thereby requiring more libraries to

ensure that the markers in the same pool are sufficiently distinct to be assigned to the

correct species using bioinformatics tools. This is why I again use multiplexed tagged

amplicon sequencing in Chapter 3 to obtain full 28S rDNA data. The amplicons are designed to tile the full region with overlapping ends, which help with downstream assembly and lower the risk of chimeric assembly. Both MMG and multiplexed tagged amplicon sequencing are used in Chapter 3 to sequence both mitochondrial genome and 28S rDNA markers (~20,000-bp) for reconstructing a tree with 459 terminals. The pipeline is designed to be as scalable.

10

1.5. Exploring hybrid enrichment for mitochondrial genome skimming

Chapter 4 evaluates another technique for obtaining mitochondrial genomes at a lower cost through the use of hybrid enrichment. This technique involves the use of probes that target desired regions of the genome and increase their representation in the sequencing pool. In Chapter 4, it is applied to a pool of 3683 species of tropical

Coleoptera to increase the proportion of mitochondrial reads. The results are compared to conventional MMG. Hybrid enrichment has been used throughout molecular biology for various purposes such as exon capture (Hodges et al., 2007) and enrichment of degraded DNA from museum specimens (Bi et al., 2013). In phylogenomics, the probes are typically designed to target ultra-conserved elements in the nuclear genome, with most of the informative characters located in the sequences flanking these regions

(Lemmon et al., 2012, Brandley et al., 2015), with the exception of exon-capture, which utilizes homologs (Bragg et al., 2016). This technique also has the potential to improve the MMG pipeline through the design of probes that target and enrich mitochondrial

DNA. This is worth investigating as a genome skimming pool usually contains only 1

– 5% of mitochondrial reads, while the remaining >95% of the data remains unused.

Hybrid capture for organelle DNA enrichment have been mainly used for human mitochondrial genomes (Maricic et al., 2010; Templeton et al., 2013), angiosperm plastid genomes (Stull et al., 2013) and degraded material from museum specimens (Mason et al., 2011). Attempts to apply hybrid enrichment for mitochondrial

DNA on a broad range of taxa are currently rare and exist as proofs of concept (Liu et al., 2016). Hence, in Chapter 4, we apply this technique at a large scale on a hyper- diverse clade of tropical to examine whether either hybrid enrichment or increasing sequencing depth is a more scalable approach to improving the MMG pipeline.

11

Chapter 2

It doesn’t have to be full-length: some mini-barcodes perform as well as full-length cox1 barcodes

2.1. Abstract

The DNA barcode standards for metazoan animals were defined >10 years ago based mostly on practical considerations (e.g., availability of universal primer, sequence length matching ABI sequencing). However, these standards are now interfering with the use of cost-effective short-read sequencing, which is extensively used for metabarcoding of environmental samples and particularly suitable for obtaining barcodes of specimens with low DNA template quality (e.g., museum specimens). We here test the performance of mini-barcodes, which have the advantage of amplifying more readily, working better for samples with degraded template DNA

(ancient DNA, gut/faecal content, eDNA), and are sufficiently short for cost-effective multiplexing on Illumina platforms. But do they provide as much species-level information as full-length barcodes? This is assessed using two criteria, viz. (1) congruence with morphology and (2) mOTU stability across different clustering methods and thresholds. The test is based on 20 published datasets covering 29,288 specimens for 5,500 species, for which all specimens were first sorted/identified based on morphology before being barcoded. Such datasets provide independent data for testing whether mOTUs based on barcodes of different length (94 – 657bp) and position within the Folmer region of cox1 show significantly different levels of congruence with morphology. We find that short mini-barcodes (<150-bp) in the 5’ end of the Folmer region perform significantly worse than the full-length barcode. In contrast, mini-barcodes >150-bp length that are situated at the 3’ end of the Folmer region show no significant performance differences to full-length barcodes. We also

12

test whether mOTUs based on mini-barcodes are less stable when using different clustering methods (ABGD and objective clustering) and thresholds (p-distances and priors). We again find no significant differences as long as the mini-barcodes exceed

150-bp in length and are predominantly located in the second half of the barcode region.

Overall, the results are good news for projects that use metabarcoding or require barcodes for specimens with low DNA quality.

2.2. Introduction

DNA barcoding and the standardization of barcoding markers (Hebert et al.,

2003) have led to the development of large barcode libraries which have proven invaluable for species recognition, species identification and the study of species interaction. The standardized barcode for animals is a 657-bp region of cox1 that has been called the Folmer region after the team that designed universal metazoan primers

(Folmer et al., 1994). This barcoding region is currently predominantly used for barcoding, especially those that are intended to be submitted to the Barcode of Life

Data systems (BOLD) which requires a minimum length of 500-bp with less than 1%

Ns for formal barcode status (Ratnasingham & Hebert, 2007).

However, there are some notable drawbacks to obtaining full-length barcodes.

Firstly, for some taxa, the target region is difficult to amplify with universal primers

(Hoareau & Boissin, 2010; Geller et al., 2013). Secondly, the full-length barcode is difficult to amplify when the DNA of the starting material is degraded. This is the rule rather than the exception for all scientists working with environmental samples or old museum specimens (Hajibabaei & McKenna, 2012). Lastly, the 657-bp barcode is too long for it to be sequenced as a single amplicon on short-read next-generation- sequencing platforms (e.g. Illumina). Relief is in sight in the form of Oxford Nanopore

MinION (Srivathsan et al., 2018) and PacBio Sequel (Hebert et al., 2018) sequencing,

13

but these methods still require the amplification of long amplicons and are less cost- effective than Illumina sequencing (Wang et al., 2018).

As a response to these shortcomings, mini-barcodes have been developed, which utilize primers that amplify shorter subsets of the original barcode region. They are easier to amplify, more suitable for work on degraded starting material, and can be sequenced with Illumina sequencers. Primers for mini-barcodes have been designed for large clades in the Metazoa (Hajibabaei et al., 2006; Meusnier et al., 2008; Hebert et al., 2013; Little, 2014) or for specific taxa such as fruit flies, catfish and sharks (Fan et al., 2009; Bhattacharjee & Ghosh, 2014; Fields et al., 2015). Mini-barcodes are also widely used whenever template DNA is degraded, a large number of specimens have to be analysed, or a sample contains DNA from multiple taxa (“metabarcoding”).

Common applications of mini-barcodes are the sequencing of museum specimens

(Zuccon et al., 2012; Hebert et al., 2013), processed food products (Armani et al., 2015;

Shokralla et al., 2015b), water/soil/fecal eDNA (eDNA) (Epp et al., 2012; Lim et al.,

2016; Srivathsan et al., 2015) and biodiversity surveys including thousands of specimens (Meier et al., 2016; Wang et al., 2018; Yeo et al., 2018).

Given their ubiquitous use, it is surprising that the efficacy of mini-barcodes for species identification and delimitation has not been sufficiently tested. For example, the largest published study includes only 6695 barcodes obtained from GenBank

(Meusnier et al., 2008) while most other studies focus on one or two taxa and include considerably fewer specimens (<5000: e.g. Hajibabaei et al., 2006; Yu & You, 2010).

These studies yielded conflicting results with regard to the suitability of mini-barcodes for species identification and/or delimitation. Hajibabaei et al. (2006) find high congruence with the full-length barcode when species are delimited based on mini- barcodes and Meusnier et al. (2008) find similar BLAST identification rates for mini- barcodes and full-length barcodes in their in silico tests. However, Yu & You (2010) concede that mini-barcodes may have poorer accuracy despite having close structural

14

concordance with the full-length barcode and Sultana et al. (2018) state that the ability to identify species is compromised when the barcodes are too short (<150-bp). Even fewer studies test the amplification success rates for different mini-barcodes, but Arif et al. (2011) find that the universal mini-barcode primers designed by Meusnier et al.

(2008) had poor amplification rates for their taxa (desert vertebrates). In summary, the existing tests of mini-barcodes are fairly small-scale and pertain to mini-barcodes of different length and position. Furthermore, the performance of mini-barcodes is usually assessed relative to results obtained with full-length barcodes. It is thus often implicitly assumed that molecular operational taxonomic units (mOTUs) derived from full-length barcodes are correct and all conflict with mOTUs based on mini-barcodes is due to the inferior performance of the latter.

In this study, we focus on testing the performance of mini-barcodes for species delimitation. We compare mOTUs obtained for different-length barcodes and clustering methods. However, instead of assuming that the results based on full-length barcodes are correct, we use morphology as an external source of data that allows for testing whether mOTUs obtained based on full-length barcodes are indeed more likely to be congruent with morphology. We evaluate congruence at both the species and specimen level. For the latter, we determine the number of specimens that have to be removed from a dataset in order to obtain perfect congruence between the two sources of data. We consider those barcode lengths as performing particularly well that yield high congruence and/or require the removal of few specimens. In addition, we also evaluate barcodes of different lengths with regard to decisiveness; i.e., mOTU stability when different clustering thresholds and algorithms are used. Stability is desirable because finding appropriate clustering parameters is difficult; i.e., it is desirable to use barcodes that are largely insensitive across a wide set of parameters. However, decisiveness is here treated as a secondary criterion because it can be misleading when

15

mini-barcodes lack sufficient variability to distinguish closely related species (see

Discussion).

In order to test barcodes of different lengths using congruence and decisiveness, one needs datasets for which all specimens have full-length barcodes and morphological grouping information. We here use 20 such datasets, which comprise almost 30,000 specimens belonging to 5500 morphological species. We only retained those specimens for which the full-length DNA barcode was available. We then assessed the performance of nine mini-barcodes (length: 94 – 407bp) which have been used repeatedly in the literature because these are tested primers. We use both objective clustering and Automatic Barcode Gap Discovery (ABGD; Puillandre et al., 2012) algorithms for species delimitation. The former utilizes an a priori distance threshold to group sequences into clusters, while the latter groups sequences into clusters based on an initial prior and recursively uses incremental priors to find stable partitions. We here avoid tree-based species delimitation methods as these methods are heavily dependent on the accuracy of the input tree and it is generally difficult to infer a well- resolved tree from very short single-locus markers with few variable sites (Tang et al.,

2014; Luo et al., 2018). Note that our analysis does not imply that morphology is more suitable for species delimitation than barcodes. Instead, we test whether shortening barcodes influences congruence with morphology; i.e. morphology is treated as a constant while testing whether barcode length and/or position influences the number of morpho-species that are recovered. Given that morphology is generally accepted criterion for species delimitation, significantly poorer congruence between barcode and morphology-derived species can be interpreted as evidence that a mini-barcode performs more poorly than the full-length barcode.

16

2.3. Materials & Methods

2.3.1. Survey of mini-barcodes

We identified nine mini-barcodes within the cox1 Folmer region (Fig. 1 &

Table S1) that have been repeatedly used in the literature published after 2003 and are used for a broad range of taxa.

Figure 1: Map of the mini-barcode positions and lengths within the 657-bp Folmer barcode region, with the target taxa they were originally designed for

We then surveyed the barcoding literature in order to identify publications that cited the original Hebert et al. (2003) barcoding paper and met the following criteria:

1) have pre-identified specimens where the barcoded specimens were pre- sorted/identified based on morphology and 2) the dataset had at least 500 specimens with cox1 barcodes >656-bp. This 500 specimen cut-off is useful for identifying studies with a sufficiently large sample size such that 1) the effects of occasional identification mistakes is minimized and meaningful comparisons can be obtained, 2) we would be able to aggregate a dataset much larger than previous attempts at mini-barcode assessment and 3) because these studies would more closely resemble datasets

17

encountered by the target audience of the thesis: researchers working with highly- abundant, species-rich taxa. We identified 20 such datasets (Table 1) with >500 barcodes after removal of imprecisely sorted material (eg. only identified to or higher) or short sequences <657-bp (while the full-length barcode is technically 658- bp long, a 1-bp concession was made to prevent discarding too man sequences). The data were downloaded from BOLDSystems or NCBI GenBank.

18

Table 1. Barcoding datasets used in this study Original Barcodes Without Barcodes Filtered No. of Dataset Reference Barcode Count Species ID <657-bp Barcode Count Morphospecies Canadian echinoderms 999 122 244 633 104 Layton et al., 2016 Ecuador Geometridae 3998 3296 76 626 239 Brehm et al., 2016 Tanytarsus Chironomidae 2796 464 1793 539 101 Lin et al., 2015 European marine fish 3970 0 574 3396 267 Oliveira et al., 2016 German Ephemeroptera, 2613 68 466 2079 340 Morinière et al., 2017 Plecoptera & Trichoptera Pakistani Lepidoptera 4503 1973 324 2206 479 Ashfaq et al., 2017 North American birds 2116 0 814 1302 472 Kerr et al., 2007 Northwest Pacific molluscs 2757 8 1322 1427 388 Li et al., 2016

19 French Guianan earthworms 651 43 96 512 41 Decaëns et al., 2016 German Aranea & Opiliones 3538 1 206 3331 584 Astrin et al., 2016 Great Barrier Reef fish 983 17 283 683 258 Steinke et al., 2017 North American Pyraustinae 1589 49 554 986 96 Yang et al., 2016 North Sea molluscs 579 1 49 529 108 Barco et al., 2016 South America butterflies 2027 15 96 1916 406 Lavinia et al., 2017 South China Sea fish 1353 14 0 1339 264 Hou et al., 2018 Congo fish 821 69 101 651 163 Decru et al., 2016 Ecuador Chrysomelidae 674 0 0 674 252 Thormann et al., 2016 North European Tachinidae 884 34 124 726 327 Pohjoismäki et al., 2016 Iberia butterflies 5278 16 123 5139 299 Dincă et al., 2015 Amazonian 1128 135 399 594 312 Lamarre et al., 2016 TOTAL 43257 6325 7644 29288 5500

19

The primers for the various mini-barcodes were aligned to the homologous regions of each dataset with MAFFT v7 --addfragments (Katoh & Standley, 2013) in order to identify the precise position of the mini-barcodes within the full-length barcode.

The mini-barcode subsets from each barcode were then identified after alignment to full-length barcodes

2.3.2. Species delimitation and testing

The mini-barcode datasets, as well as the full-length barcode datasets, were clustered using objective clustering (Meier et al., 2006; script unpublished), but only mOTUs clustered at 2 – 4% uncorrected p-distance thresholds were used given that these are the typical distance thresholds used for species delimitation in the literature

(Ratnasingham & Hebert, 2013). The same datasets were also clustered with ABGD

(Puillandre et al., 2012) using the default range of priors and with uncorrected p- distances, but the minimum slope parameter (-X) was reduced in a stepwise manner

(1.5, 1.0, 0.5, 0.1) if the algorithm could not find a partition. We then considered the

ABGD clusters at priors P=0.001, P=0.01 and P=0.04 in this study. The priors (P) refer to the maximum intraspecific divergence and functions similarly to p-distance thresholds at the first iteration, before being recursively refined by recursive application of the ABGD algorithm.

2.3.3. Performance assessment

We assess the performance of the mini-barcodes by using morphospecies units as an external arbiter that allows us to test whether units delimited based on full-length or mini-barcodes are more likely to be congruent with morphology. Congruence was quantified at the species-level using match ratios (Ahrens et al., 2016) between molecular and morphological groups. This is defined as 2 * Nmatch / (N1 + N2), where

Nmatch is the number of clusters identical across both mOTU delimitation methods/thresholds (N1 & N2). Incongruence between morphology and mOTUs are

20

caused by specimens that are assigned to the “incorrect” mOTUs as judged by

morphological information. If all mOTUs are represented by the same number of

species, assessing conflict between sequences and morphology at the species- or

specimen-level yield identical results. However, specimen abundances are rarely equal

across species and hence conflict/congruence should also be quantified at the

specimen-level. This is because ultimately, biologists aim to accurately classify or

identify specimens. If congruence is assessed at the species level, both common and

rare species are treated as equals, even though human error in sorting common species

results in a larger number of specimens being deemed as misidentified. One way to

assess congruence between morphology and barcodes is to count the number of

specimens that have to be removed in order to generate perfect congruence between

morphology and barcodes.

In order to test whether barcode length is a significant predictor of congruence,

MANOVA tests were carried out in R (R Core Team, 2017) with dataset and mini-

barcode as explanatory variables. Since most of the variance in this study was generated

by the dataset (P < 0.05 in MANOVA tests), linear mixed effects models were used (R

package lme4: Bates, 2010) with match ratio (species-level congruence) or number of specimens to be removed for obtaining perfect congruence (specimen-level congruence) as the response variables. The type of mini-barcode was the explanatory variable

(categorical) and the variable “dataset” was treated as a random effect. The emmeans

R package (Lenth, 2018) was then used to perform pairwise post-hoc Tukey tests

between mini- and full-length barcodes. Given that interpreting significance results of

the mixed effect model is complicated, with approximations being necessary in order

to obtain these values [see Luke, S. G. (2017)], and significance of the factors

themselves were of less import than the pairwise Tukey tests, only the latter is reported

here. To compare the differences in performance between objective clustering and

ABGD for the various mini-barcodes, pairwise Mann-Whitney U-tests were carried out

21

in R. The non-parametric Mann-Whitney U-test was used as the distributions were non- normal, as evidenced by significant Shapiro-Wilk test results.

An arguably weaker but nevertheless important criterion for assessing the quality of mini-barcodes is decisiveness. We consider congruence with morphology to be a more important criterion than decisiveness because it involves an external source of evidence that allows for the assessment of barcodes of very different lengths.

Decisiveness is nevertheless a useful additional criterion for selecting barcodes of an appropriate length because mOTU stability is a desirable property as long as it does not interfere with congruence. We therefore tested whether mOTU numbers were stable across a range of p-distance thresholds (2-4%) and ABGD priors (P=0.001 ̶ 0.1, unless all sequences were lumped into a single cluster). In order to quantify decisiveness, we determined the dispersion (variance/mean) for each dataset, before using the same statistical tests as for congruence to test for significant differences between barcodes of different lengths. The only difference was that dispersion was used as the response variable.

Nucleotide diversity was also considered as a possible explanation for mini- barcode performance, especially when fixed distance thresholds are used. To explore levels of nucleotide diversity across the barcode, the proportion of variable nucleotide and amino acid sites were also obtained via MEGA6 (Tamura et al., 2013).

2.4. Results

2.4.1. Congruence with morphology

Regardless of which clustering algorithm is used, the 657-bp barcodes performs significantly better than mini-barcodes <150bp. However, there are no significant differences once the barcode length exceeds this length threshold. In addition, both clustering algorithms yield mOTUs with very similar congruence levels to morphology.

22

Further evidence for the poor congruence of very short and good congruence of longer barcodes comes from heatmaps of match ratios (with the gradient applied within the dataset and threshold) with the best match ratio marked with an asterisk. For objective clustering, we find no significant differences between the full-length and mini-barcodes for all mini-barcodes >150-bp (Fig. 2 & 4). Contrary to expectation, congruence at the species level is maximized for a range of different length barcodes instead of the full-length barcode (407-bp and 657-bp for 2%, 407-bp for 3%, 313-bp for 4%) (Fig. 3). Only mini-barcodes <150-bp and one mini-barcode of 307-bp length perform noticeably poorer.

For mOTUs obtained with ABGD, the results are again similar in that mini- barcodes >150-bp have no significant differences in match ratios compared to the full- length barcode (Fig. 5 & 7). The exception is the analysis performed at P=0.04 where the full-length barcode maximizes congruence, but does not significantly out-perform the 295, 313, or 407-bp mini-barcodes. Additionally, with such a high prior, many of the shorter mini-barcodes collapse all barcodes into a single mOTU (Fig. 5). On average, objective clustering yields higher match ratios (2%: 0.79, 3%: 0.80, 4%: 0.76) than ABGD (P=0.001: 0.75, P=0.01: 0.76, P=0.04: 0.48), which indicates an overall better performance of objective clustering. Note that the full-length barcode only outperforms the mini-barcodes for the prior (P=0.04) that yields the overall lowest match ratios (Fig. 6).

Congruence can also be determined at the specimen level by determining the number of specimens that have to be removed in order to obtain a perfect match ratio of 1. Based on this criterion, mini-barcodes >150-bp perform significantly better, although the 94-bp barcode performs better than the 130 and 145-bp mini-barcodes

(Fig. 3 & 4). For ABGD, two priors (P=0.001 and P=0.01) are again out-performing the third prior in that they require the removal of the smallest number of specimens in order to generate perfect congruence (Fig. 6 & 7). Note that on average, objective

23

clustering required the removal of fewer specimens (2%: 2322, 3%: 3611, 4%: 5108) than ABGD (P=0.001: 4377, P=0.01: 5491, P=0.04: 17894).

Figure 2: Match ratios for 20 datasets when clustered with objective clustering at 2 – 4% p-distance thresholds

24

Match ratio 2% p-distance threshold 3% p-distance threshold 4% p-distance threshold Dataset 94 130 145 164 189 295 307 313 407 657 94 130 145 164 189 295 307 313 407 657 94 130 145 164 189 295 307 313 407 657 Congruence - Species 0.94 0.99 0.99 0.99 0.99 0.99 0.89 0.99 0.99 0.74 0.97 0.97 Ecuador Geometridae Congruence - Specimens 13 13 3 3 3 3 3 43 3 3 206 6 6 Congruence - Species 0.87 0.97 0.75 0.94 0.54 0.90 Pakistan Lepidoptera Congruence - Specimens 175 17 723 42 1294 119 Congruence - Species 0.91 0.97 0.97 0.85 0.96 0.96 0.80 0.95 Great Barrier Reef Fish Congruence - Specimens 30 6 6 51 9 9 103 14 Congruence - Species 0.81 0.95 0.90 0.94 0.94 0.94 0.78 0.94 Canada Echinoderms Congruence - Specimens 40 15 15 12 36 67 24 Congruence - Species 0.86 0.95 0.94 0.82 0.94 0.79 0.93 South China Sea Fish Congruence - Specimens 69 22 26 132 26 165 35 35 Congruence - Species 0.83 0.93 0.93 0.77 0.94 0.60 0.91 South America Butterflies Congruence - Specimens 116 31 31 326 31 782 66 Congruence - Species 0.87 0.91 0.91 0.81 0.90 0.90 0.65 0.90 Amazon Moths Congruence - Specimens 33 19 19 19 68 21 21 64 13 Congruence - Species 0.77 0.92 0.96 0.96 0.96 0.89 0.90 0.90 0.96 0.96 North Sea Molluscs Congruence - Specimens 29 6 3 3 3 8 9 9 3 3 Congruence - Species 0.70 0.85 0.72 0.86 0.70 0.86 European Marine Fish Congruence - Specimens 396 147 471 155 628 158 Congruence - Species 0.77 0.85 0.72 0.83 0.65 0.79 North America Birds Congruence - Specimens 173 107 253 150 405 179 Congruence - Species 0.78 0.85 0.89 0.84 0.83 0.88 Germany EPT Congruence - Specimens 151 106 136 95 163 98 Congruence - Species 0.81 0.84 0.71 0.81 0.60 0.78 North Europe Tachinidae Congruence - Specimens 83 67 67 149 80 120 222 25 Congruence - Species 0.56 0.84 0.41 0.75 0.19 0.68 North America Pyraustinae Congruence - Specimens 249 26 599 69 781 92 Congruence - Species 0.51 0.65 0.42 0.58 0.26 0.52 Iberia Butterflies Congruence - Specimens 681 1118 2620 1226 3302 1301 Congruence - Species 0.69 0.82 0.79 0.87 0.78 0.87 Germany Aranea & Opiliones Congruence - Specimens 407 177 379 158 458 155 Congruence - Species 0.72 0.78 0.76 0.79 0.79 0.74 0.78 Northwest Pacific Molluscs Congruence - Specimens 109 76 97 84 84 133 85 85 Congruence - Species 0.60 0.72 0.55 0.72 0.44 0.70 Congo Fish Congruence - Specimens 119 76 153 91 91 109 217 109 Congruence - Species 0.52 0.74 0.74 0.62 0.85 0.74 0.89 French Guiana Earthworms Congruence - Specimens 64 32 32 49 12 32 12 12 Congruence - Species 0.47 0.59 0.69 0.61 0.64 0.72 Tanytarsus Congruence - Specimens 108 60 80 35 31 60 Congruence - Species 0.54 0.56 0.55 0.57 0.54 0.58 Ecuador Chrysomelidae Congruence - Specimens 83 76 77 73 77 69 69 Total no. of highest Congruence - Species 0110135368102214348400011431030 values Congruence - Specimens 121123249610222336632323141452 0.47 0.51 0.49 0.54 0.56 0.55 0.55 0.55 0.55 0.55 0.49 0.45 0.41 0.46 0.51 0.54 0.54 0.55 0.56 0.56 0.35 0.19 0.23 0.40 0.41 0.49 0.42 0.52 0.51 0.50 Congruence - Species Range of values 0.94 0.95 0.96 0.98 0.98 0.99 0.99 0.99 0.99 0.99 0.94 0.93 0.93 0.96 0.97 0.97 0.99 0.98 0.98 0.99 0.95 0.90 0.90 0.96 0.96 0.97 0.94 0.94 0.97 0.97 11139443333365633535534993344444 Congruence - Specimens 681 1118 741 902 691 795 834 845 806 719 1571 1775 2620 2036 1818 1543 1641 1334 1283 1226 2432 3063 3302 1301 2774 2011 2557 1935 2003 1805 Figure 3: Heatmap of match ratios and number of specimens that require removal for perfect congruence for objective clustering, with red representing low and green high congruence values. The highest values for each dataset and p-distance threshold is indicated below. The total number of datasets with highest match ratios along with the range of match ratios and specimen numbers for different barcodes are also indicated at the bottom.

25

Figure 4: Pairwise post-hoc Tukey tests of objective clustering mOTU (2, 3 & 4% p- distance thresholds) congruence with morphology in terms of match ratios for both species (left) and specimen-level assessments (right) for all 10 barcodes tested. P- values are indicated at the bottom left of each matrix and an asterisk (*) is used to indicate a significant difference (P < 0.05). Where differences are significant, the side at which the asterisk is placed indicates the higher match ratio for that mini-barcode length, as well as fewer specimens that need to be removed for perfect congruence

26

Figure 5: Match ratios for 20 datasets when clustered with ABGD for priors P=0.001, 0.01 & 0.04

27

Match ratio P = 0.001 P = 0.01 P = 0.04 Dataset 94 130 145 164 189 295 307 313 407 657 94 130 145 164 189 295 307 313 407 657 94 130 145 164 189 295 307 313 407 657 Congruence - Species 0.48 0.99 0.99 0.99 0.99 0.48 0.99 0.99 0.99 0.99 0.37 0.99 0.99 0.99 Ecuador Geometridae Congruence - Specimens 406 3 3 3 3 406 3 3 3 3 443 3 3 3 Congruence - Species 0.61 0.96 0.61 0.960 0 0 0 0 0 0 0.96 Pakistan Lepidoptera Congruence - Specimens 1014 17 2152 19 2152 2152 2152 2152 2152 2152 2152 19 Congruence - Species 0.75 0.97 0.75 0.97 0.7 0.94 0.94 0.94 Great Barrier Reef Fish Congruence - Specimens 119 7 119 7 166 17 17 17 Congruence - Species 0.76 0.95 0.76 0.97 0 0 0.91 Canada Echinoderms Congruence - Specimens 109 15 109 8 606 606 32 Congruence - Species 0.78 0.94 0.78 0.94 0 0 0.94 South China Sea Fish Congruence - Specimens 208 24 208 23 1312 1312 23 Congruence - Species 0.6 0.91 0 0.94 0 0 0 0 0 0 0 0 00.93 South America Butterflies Congruence - Specimens 851 41 1888 35 1888 1888 1888 1888 1888 1888 1888 1888 1888 48 Congruence - Species 0.42 0.91 0.42 0.91 0.2 0.2 0.91 Amazon Moths Congruence - Specimens 9 154 9 154 1 1 119 Congruence - Species 0.94 0.94 0.94 0.8 0.94 0.9 0.94 0.94 0.94 0 0.94 0.94 0.94 0.94 0.94 0.94 North Sea Molluscs Congruence - Specimens 4 4 4 26 4 9 4 4 4 492 4 4 4 4 4 4 Congruence - Species 0.86 0.69 0.72 0.86 000000 00 0.86 European Marine Fish Congruence - Specimens 521 128 3342 140 3342 3342 3342 3342 3342 3342 3342 3342 159 Congruence - Species 0.61 0.87 0.61 0.83 0 0 0 0 0 0 0.83 North America Birds Congruence - Specimens 434 80 1202 130 1202 1202 1202 1202 1202 1202 129 Congruence - Species 0.87 0.67 0.87 0.83 0 0 0 0.89 Germany EPT Congruence - Specimens 103 209 161 103 2019 2019 2019 103 Congruence - Species 0.6 0.84 0.6 0.84 0 0 0 0.74 0 0 North Europe Tachinidae Congruence - Specimens 222 59 222 68 715 715 715 132 715 715 28 Congruence - Species 0.25 0.76 0.25 0.8 0 0 0 0 0 0.77 North America Pyraustinae Congruence - Specimens 746 30 932 30 932 932 932 932 932 69 Congruence - Species 0.26 0.65 0.26 0.650 0 0 0 00.360 0 0 0 Iberia Butterflies Congruence - Specimens 3541 521 5046 561 5046 5046 5046 5046 5046 3089 5046 5046 5046 5046 Congruence - Species 0.18 0.85 0.18 0.85 0.5 00000000 Germany Aranea & Opiliones Congruence - Specimens 2999 252 2999 176 138 3285 3285 3285 3285 3285 3285 3285 3285 Congruence - Species 0.77 0.63 0.75 0.79 0 0.76 Northwest Pacific Molluscs Congruence - Specimens 91 157 139 80 1366 89 Congruence - Species 0.44 0.71 0.44 0.72 0 0 0 0 0 0 0 0.7 Congo Fish Congruence - Specimens 207 75 207 80 626 626 626 626 626 626 626 101 Congruence - Species 0.93 0.55 0.93 0.59 0.96 0.88 0.88 0.88 0.88 French Guiana Earthworms Congruence - Specimens 7 78 7 53 5 5 15 15 15 15 Congruence - Species 0.71 0.71 0.45 0.71 0.71 0.53 0 0 0.73 Tanytarsus Congruence - Specimens 46 118 46 91 488 488 48 Congruence - Species 0.54 0.58 0.54 0.58 0.54 0.57 Ecuador Chrysomelidae Congruence - Specimens 79 63 79 63 76 59 76 Total no. of highest Congruence - Species 1122532343112224224610111431611 values Congruence - Specimens 112245224211215423343022142368 0.25 0.18 0.41 0.53 0.56 0.5 0.56 0.48 0.47 0.45 0.25 0.18 0 0.46 0.56 0.49 0.56 0.55 0.57 0.53 0 0 0 0 0 0 0 0 0 0 Congruence - Species 0.94 0.9 0.93 0.96 0.94 0.99 0.99 0.99 0.99 0.97 0.94 0.9 0.93 0.96 0.94 0.99 0.99 0.99 0.99 0.98 0.92 0.93 0.96 0.94 0.94 0.94 0.99 0.95 0.99 0.99 Range of values 487443333848744333341214443433 Congruence - Specimens 3541 2999 2656 1621 1134 775 521 715 610 583 3541 2999 5046 2033 1202 2388 841 842 777 561 5046 5046 5046 5046 5046 3342 5046 5046 5046 5046 Figure 6: Heatmap of match ratios and number of specimens that require removal for perfect congruence for ABGD, with red representing low and green high congruence values. The highest values for each dataset and p-distance threshold is indicated below. The total number of datasets with highest match ratios along with the range of match ratios and specimen numbers for different barcodes are also indicated at the bottom.

28

Figure 7: Pairwise post-hoc Tukey tests of ABGD mOTU (P = 0.001, 0.01 & 0.04) congruence with morphology in terms of match ratios for both species (left) and specimen-level assessments (right) for all 10 barcodes tested. P-values are indicated at the bottom left of each matrix and an asterisk (*) is used to indicate a significant difference (P < 0.05). Where differences are significant, the side at which the asterisk is placed indicates the higher match ratio for that mini-barcode length, as well as fewer specimens that need to be removed for perfect congruence

29

Figure 8: Match ratio comparisons between ABGD and objective clustering for each mini-barcode set. The barcode lengths and Mann-Whitney U-test p-values are indicated above each boxplot

2.4.2. Decisiveness

We find there that longer mini-barcodes perform better in terms of decisiveness

for objective clustering, but short mini-barcodes are more decisive when ABGD is used.

Closer inspection of the ABGD clusters however, reveals that this stability is due to

consistent lumping; i.e., the stable mOTUs are in conflict with morphology.

When objective clustering is used, mini-barcodes >150bp have generally low

dispersion values and hence high decisiveness, with the exception of 130 and 145-bp

mini-barcodes (Fig. 10, left). Based on the heatmap (Fig. 9, left); the 313-bp mini- barcode out-performs all other barcodes for objective clustering, but the performance of the 295, 407 and 657-bp barcodes is overall very similar. For ABGD, the post-hoc analysis finds no significant differences in dispersion values (Fig. 10, right), but on the heatmap (Fig. 9, right), the shorter mini-barcodes seem to be performing much better.

30

However, this result is deceptive because shorter barcodes have lower congruence with

morphology and hence it is likely that incorrect mOTUs are delimited.

Figure 9: Heatmap of dispersion values for objective clustering (left) and ABGD (right), with red representing higher values and green lower values. The lowest values per dataset are indicated with an asterix (*). The total number of datasets with the lowest dispersion values are indicated at the bottom

Figure 10: Pairwise post-hoc Tukey tests of dispersion (variance/mean) for both objective clustering (left) and ABGD (right) for all 10 barcodes tested. P-values are indicated at the bottom left of each matrix and an asterisk (*) is used to indicate a significant difference (P < 0.05). Where differences are significant, the side at which the asterisk is placed indicates the lower dispersion (ie. greater decisiveness) for that mini-barcode length

2.4.3. Variable sites

We find that the 130 and 145-bp mini-barcodes have significantly higher

proportions of variable sites for both nucleotide and amino-acid sequences (ANOVA

test P = 0.005 for nucleotides, P = 0.003 for amino acids, with significant Tukey test

results for the 130 and 145-bp mini-barcodes) (Fig. 11). The proportion of variable sites

before and after translation is similar, although the variance for the amino acids is larger

31

which is presumably due to the smaller number of characters (codons vs single nucleotides).

Figure 11: Proportions of variable sites (%) for each barcode for both nucleotides and amino acids

2.4.4. Pairwise distances

Mini-barcodes can be thought of as a means to estimate the pairwise distance values that would be obtained with the full-length barcode. If the estimator was to be unbiased, the mean p-distance between specimen pairs based on full-length and mini- barcode should be zero. We find that longer mini-barcodes (295, 313 & 407-bp) do indeed have a mean that is close to zero, which indicates that they are good estimators of distances obtained from full-length barcodes. Not surprisingly, the differences between full-length and mini-barcode distances for the same barcodes decrease as the barcode lengths increase (Fig. 12). However, the poorly performing 307-bp mini- barcode is a noticeable exception. While the difference is not significant, the larger variance is a notable departure from the trend.

32

Figure 12: The difference in p-distances between the 657-bp barcode and the various mini-barcodes

2.5. Discussion

2.5.1. Should full-length barcodes be preferred?

This study tests the widespread assumption that full-length barcodes should automatically be preferred over mini-barcodes when it comes to species delimitation and species identification (Burns et al., 2007; Min & Hickey, 2007). This would be unfortunate because mini-barcodes have higher amplification success rates, are more suitable for degraded starting material, and can be conveniently sequenced on Illumina high-throughput sequencing platforms. Fortunately, our findings indicate that cox1 mini-barcodes with a length >150-bp that are located in the 3’ end of the Folmer region do not have significant performance differences when compared to the full-length 657- bp fragment. This conclusion is supported by two criteria, viz. congruence with morphology and decisiveness (unless ABGD is applied to very short barcodes: Fig. 5

– 7).

33

The main factor affecting congruence with morphology in this study is the individual datasets that yield best match ratios, which range from 0.65 to 0.99 for objective clustering and 0.58 to 1.00 for ABGD. Indeed, compared to the effect of dataset, the choice to use either full-length or mini-barcode is largely secondary. This dataset-related variance is presumably caused by the taxon under study, the quality of the morphological work, or a combination of both. Given that this taxon/investigator effect is the most important reason for the greatly varying degree of congruence between morphology and DNA barcodes, it certainly deserves more study. In some cases, it is likely that there is a strong taxon-effect. For example, it is known that cox1 barcodes do not perform well for anthozoans (Huang et al., 2008), for which taxon would be a substantial factor. Similarly, taxa with large numbers of cryptic species would be expected to yield low match ratios. However, more controlled studies would be desirable. For example, it would help if one could have multiple experts sort the same sample and then determine the variance caused by sorting (see Krell, 2004).

Alternatively, one could use appropriate nuclear markers to determine whether there is a stronger taxon- or sorting effect.

Given that we were aware of the strong dataset-specific effect, we treated it as a confounding factor in our analyses and instead focused on the additional effect caused by barcode length. We find that barcode length is not an important factor once the mini- barcode is >150-bp long. This conclusion is robust in that we analysed 20 different datasets and used two different clustering algorithms utilizing pairwise distances

(objective clustering, ABGD). Overall, the species delimitation algorithm matters little unless ABGD is used to analyse extremely short mini-barcodes (<150-bp) or very high priors are used (eg. P = 0.04). Under these circumstances, ABGD often collapses the barcodes for many species into one cluster (Fig. 6). Overall, we obtain slightly poorer match ratios on average for ABGD compared to objective clustering when priors 0.001 and 0.01 are used, and much poorer results when a prior of 0.04 is selected. For these

34

reasons, we recommend the use of objective clustering and lower priors (P ≤ 0.01; Fig.

5 – 7) because ABGD tends to over-lump species at higher priors. Overall, the selection

of the best priors and/or clustering thresholds remains very difficult for studies of

largely unknown faunas that lack morphological identifications and we recommend the

use of multiple methods and thresholds in order to distinguish robust from labile

mOTUs that are heavily dependent on threshold- or prior-choice.

2.5.2. Are longer barcodes more decisive?

Barcode decisiveness involves the likelihood of converging on a set of results

robust to different algorithms and parameters (ie. distance thresholds or ABGD priors).

Our main finding is that mini-barcodes longer than 150-bp again perform similarly well

in terms of decisiveness. This is particularly apparent for objective clustering where

mini-barcodes >150-bp have no significant performance differences to full-length

barcodes and are significantly more decisive (Fig. 9 & 10, left) than even shorter mini- barcodes. The results are somewhat more confusing for ABGD because very short mini-barcodes (<150 bp) can be more decisive, but they yield mOTUs that are very likely to be incorrect because of drastic over-lumping. ABGD is essentially decisive at the expense of congruence. The largely invariant number of clusters under ABGD for very short barcodes is likely caused by the difficulty of identifying further partitions in p-distances based on very short sequences. We suspect that there is an initial underestimate of the number of clusters based on the chosen prior which is exacerbated by problems with finding further partitions in the greater spread of p-distances (Fig.

12). We would thus discourage the use of ABGD for very short mini-barcodes (94 –

145bp). Objective clustering is more robust in this respect.

2.5.3. Which end of the barcode performs better?

We find that in general, mini-barcodes at the 3’ end of the Folmer region seem to out-perform others with regard to congruence with morphology and decisiveness.

35

This is suggested by the performance of two mini-barcodes that perform better or worse than expected based on their lengths. The shortest mini-barcode (94-bp) occasionally out-performs the 130 and 145-bp mini-barcodes while the 307-bp mini-barcode which is in the “wrong” part of the barcode region tends to under-perform compared to the shorter 295-bp fragment. Overall, the Folmer region of cox1 shows signs of unequal genetic variability which could generate a positional effect. The 130 bp, 145 bp, and

307-bp mini-barcodes are closer to the 5’ end of the Folmer region (Fig. 1) which has higher variability than the other mini-barcodes for most datasets tested (Fig. 11). This variability does matter as a higher variability reduces the ability to predict successful delimitation. This factor however, deserves more scrutiny as it is dependent on the taxonomic breadth of the dataset and its evolutionary history. This positional effect is consistent with findings by Shokralla et al. (2015b), where the mini-barcodes at the 5’ end have poorer species resolution for fish species. However, this requires more rigorous testing by situating shorter mini-barcodes at the 3’ end of the Folmer region and running similar analyses.

2.5.4. Conclusion Mini-barcodes have tremendous utility for species delimitation and facilitate work on degraded starting material and cost-saving NGS barcoding. It is thus encouraging that such mini-barcodes perform well across a large range of metazoan taxa as long as a few guidelines are followed. We recommend the use of mini- barcodes >150-bp at the 3’ end because they perform as well or even better than the full-length barcode in terms of both congruence with morphology and decisiveness for both objective clustering and ABGD algorithms. This also justifies the use of the 313- bp mini-barcode used in the subsequent chapters. In terms of species delimitation algorithm, there is no appreciable difference in performance for objective clustering and ABGD as long as the mini-barcodes >150-bp. However, for mini-barcodes <150- bp, users should avoid using ABGD as it increases the chances of obtaining mOTUs

36 that lump species. Our study also reveals that more research has to be carried on taxon- or sample-effects because they contribute most of the overall variance.

37

2.6. Supplementary Information

Table S1: Primer sequences of the mini-barcodes assessed in this study Barcode Length (bp) Barcode Primer Sequence (5’ → 3’) Target Taxon Reference F: ATTAGGAGCWCCWGATATRGC Hebert et al., 2013 94 Lepidoptera R: GGAGGRTAAACWGTTCAWCC F: TCCACTAATCACAARGATATTGGTAC 130 Eukaryotes Meusnier et al., 2008 R: GAAAATCATAATGAAGGCATGAGC F: ATTCAACCAATCATAAAGATATTGG 145 Braconidae Hajibabaei et al., 2006 R: ATAATTTTTTTTATAGTTATACC F: CATGCWTTTGTAATAATTTTYTTTATAG 164 Lepidoptera Hebert et al., 2013 R: GGAGGRTAAACWGTTCAWCC

38 F: ATTRRWRATGATCAARTWTATAAT 189 Lepidoptera Hebert et al., 2013 R: GTTCAWCCWGTWCCWGCYCCATTTTC F: TGTAAAACGACGGCCAGTGCWTTCCCMCGWATAAATAATATAAG 295 Lepidoptera Hebert et al., 2013 R: CAGGAAACAGCTATGACGTAATWGCWCCWGCTARWACWGG F: ATTCAACCAATCATAAAGATATTGG 307 Lepidoptera Hebert et al., 2013 R: GTTCAWCCWGTWCCWGCYCCATTTTC F: GGWACWGGWTGAACWGTWTAYCCYCC 313 Metazoa Leray et al., 2013 R: TANACYTCNGGRTGNCCRAARAAYCA F: GCTTTCCCACGAATAAATAATA 407 Braconidae Hajibabaei et al., 2006 R: TAAACTTCTGGATGTCCAAAAAATCA

38

Chapter 3

Scalable species delimitation with molecular markers: comparing the performance of distance- and tree-based methods

3.1. Abstract

Species discovery in hyper-diverse taxa such as tropical arthropods requires scalable molecular methods that are suitable for the discovery of millions of species.

Distance-based methods that utilize barcodes scale easily while tree-based species delimitation techniques (GMYC, mPTP, bPTP) require well-supported trees that have to be reconstructed based on multiple markers; i.e., tree-based species delimitation may be less suitable for high-throughput species discovery in hyper-diverse invertebrate clades. Here, we first test to what extent different species-delimitation methods yield different results. Afterwards, we test whether tree- or distance-based methods yield higher congruence with morphology. Our dataset comprises 8081 specimens belonging to five Diptera clades of different ages (Culicidae, Dolichopodidae, Mycetophiliformia,

Stratiomyidae, Syrphidae, Tabanidae). All specimens were barcoded with a mini- barcode (“NGS barcode”) in order to isolate all haplotypes that differ by >0.6% (721 haplotypes differing by >2bp) for use in this study. The relationships between these haplotypes were then reconstructed based on mitogenomes obtained via genome skimming as well as complete 28S rDNA sequences obtained via tagged amplicon sequencing. We then estimate mOTUs using distance-based (objective clustering at 2-

4% p-distance; ABGD: P=0.001 & 0.01) and tree-based methods (mPTP, bPTP). We place 304 – 406 putative species on well-supported trees and find that the majority of these species are stable across all species delimitation techniques (median: 0.86, range:

0.73 – 0.97 match ratios). We then test which molecular species delimitation technique maximizes congruence with morphology for 817 specimens belonging to 106

39

morphospecies of fungus gnats (Mycetophilidae). We again find that most mOTUs are congruent with the morphospecies clusters (median: 0.87, range: 0.77 – 0.94 match ratios), with the best tree-based method having a match ratio of 0.94 vs. 0.89 for distance-based methods (=3 additional congruent species). Based on our results, we propose that one should first use mini-barcodes for identifying highly stable mOTUs.

Subsequently, haplotypes belonging to unstable mOTUs can be targeted for tree-based species delimitation.

3.2. Introduction

Species delimitation with molecular markers is now widely used due to the ease with which DNA sequences can be obtained via new cost-effective sequencing technologies (Blaxter, 2004; Vogler & Monaghan, 2007; Fontaneto et al., 2015; Meier et al., 2016; Wang et al., 2018). The popularity of DNA-based taxonomy follows the rise of DNA barcoding (Hebert et al., 2003) which led to the accumulation of standardized sequence data in globally accessible databases like NCBI GenBank

(Benson et al., 2017) and BOLD (Ratnasingham & Hebert, 2007). Overall, there is no doubt that molecular data are of increasing importance for taxonomy, but systematists continue to argue over its relative importance. Some consider them to be sufficient for species delimitations (Tautz et al., 2003; Blaxter, 2004) while others argue that they should only be used for pre-sorting specimens into putative species (Wang et al., 2018); i.e., the sequence data should complement more traditional morphological work (Tan et al., 2010). Such an integrative approach is attractive because it is common to find species that are “cryptic” if only one type of data is evaluated (Bickford et al., 2007).

However, overall molecular data is very likely to increase in relative importance for most hyper-diverse taxa given that many lack readily accessible morphological characters that are reliable for species delimitation while high-quality morphological

40

data (e.g., genitalia) are time-consuming to obtain for thousands of specimens (Seifert,

2009; Le Gall & Saunders, 2010).

It is thus not surprising that many species delimitation methods have been proposed for clustering specimens into putative species based on DNA sequences.

These methods can be broadly grouped into two classes: distance-based and tree-based.

The former is based on pairwise distances and requires the use of thresholds that separate intra- from interspecific divergence. Such thresholds can be obtained via algorithms (e.g., BIN: Ratnasingham & Hebert, 2013; ABGD: Puillandre et al., 2012;

UCLUST: Edgar, 2010; jMOTU: Jones et al., 2011) or specified by the user based on empirical data (e.g. objective clustering: Meier et al., 2006). For distance-based species delimitation, estimates of robustness are then usually obtained by varying the thresholds/priors. In contrast, tree-based species delimitation techniques group specimens into species based on relationship. These methods can be coalescence-based like the general mixed Yule coalescent (GMYC) model (Pons et al., 2006) and

Bayesian phylogenetics and phylogeography (BPP; Yang, 2015) or based on the number of substitutions on a branch, like Poisson tree process (PTP; Zhang et al.,

2013a).

Tree-based species delimitation is often assumed to be more reliable compared to distance-based methods, with the molecular operational taxonomic units (mOTUs) derived being perceived as highly sensitive to the a priori distance thresholds selected by the user (Rubinoff et al., 2006). It is thus not surprising that tree-based methods are becoming more popular which is also partially due to the fact that new sequencing technologies have improved access to the kind of multi-locus data that are needed for obtaining reliable trees (Hudson & Coyne, 2002). Many of the new tools require multi- locus data (Yang & Rannala, 2010; Fujisawa et al., 2016) since they assume disparate genealogical histories for the different loci. Although multi-locus species delimitation is often seen as superior to single-locus delimitation (Dellicour & Flot, 2018), its data

41

requirement makes such methods less scalable, which is a major concern given that millions of species remain undiscovered. In terms of scalability, single-locus, distance- based methods would still be the most suitable for large-scale species discovery followed by single-locus and multi-locus tree-based methods. Given that millions of species have yet to be discovered, it is timely to determine to what extent the different techniques yield different results and to use an external criterion for testing whether the less scalable methods need to be used for hyper-diverse taxa.

Single-locus tree-based species delimitation is mostly accomplished using

GMYC and PTP. The former requires ultrametric trees and assumes that the distinction between speciation and coalescence processes will be reflected in the branch lengths

(Pons et al., 2006). GMYC can be implemented with single or multiple threshold options. In contrast, PTP is able to process non-ultrametric trees and groups specimens with a similar criterion, but instead utilizes number of substitutions as a function of branch length (Zhang et al., 2013a). PTP was originally implemented with a maximum likelihood search algorithm (mPTP) but recent algorithms apply Bayesian statistics and yield Bayesian posterior probabilities (bPTP).

Tree-based species delimitation requires accurate and well-resolved trees and it is generally accepted that multi-locus data is better able to differentiate between speciation and coalescent processes (Dupuis et al., 2012). However, generating well- resolved trees requires both good character and taxon sampling (Mayden et al., 2008), which is costly when the number of haplotypes is in the thousands or even millions.

Although sequencing cost is declining, obtaining multi-locus data for all haplotypes of species-rich clades is a non-trivial challenge. Therefore, the aim of this chapter is to examine to what extent cost-effective and scalable distance-based techniques yield similar results to tree-based species delimitation based on well-resolved trees. In order to conduct this assessment, a sizeable dataset is required where both distance- and tree- based methods can be compared for the same specimens. In addition, the availability

42

of morphological data is useful because it can be used to evaluate whether distance- or

tree-based species maximize congruence with external data in the case of conflict.

In this study, we use NGS barcoding (Meier et al., 2016) to sequence a short

313-bp cox1 barcode for >8000 Diptera specimens from a Malaise trap survey of different tropical habitats in Singapore. The specimens belonged to six Diptera clades

(Culicidae, Dolichopodidae, Mycetophiliformia, Stratiomyidae, Syrphidae and

Tabanidae). We initially identify cox1 “haploypes” that differ by 0.6% uncorrected p-

distance (≥2-bp divergent). Mitochondrial genomes were sequenced for each of these

haplotypes through a modified pipeline of the one developed by Crampton-Platt et al.

(2016) known as mitochondrial metagenomics (MMG), which utilizes principles of

genome skimming (Straub et al., 2012) to obtain mitochondrial contigs from Illumina

sequencing pools and assigns them to species with short amplicon “baits” sequenced

from pre-identified specimens. Mitochondrial contigs obtained via genome skimming

are initially anonymous until they are baited with an amplicon of known species

identity. If the assembled genome is complete or the contig overlaps with the bait region,

a single bait can be used to identify a contig. However, in the case of split contigs,

which are common, multiple baits are needed to identify all fragments. Long contigs

can also benefit from multiple baits as it allows for the identification of chimeric

assemblies. As such, we here use multiplexed, tagged amplicon sequencing to obtain a

large number of mitochondrial baits and full 28S rDNA sequence.

Based on these data, we compare the mOTUs obtained with distance-

(objective clustering and ABGD) and tree-based species delimitation methods (bPTP and mPTP). For a subset of 817 Mycetophilidae specimens that were sorted into 106 morphospecies, we establish which species delimitation technique maximizes congruence. In this study, we also test if using multiple baits for the MMG process significantly increases the recovery of mitochondrial contigs. Lastly, we propose an

43

efficient species discovery pipeline that combines distance- and tree-based species delimitation.

3.3. Materials & Methods

3.3.1. Specimen sampling, barcoding and OTU representative selection

Specimens were sampled during a 2-year long Malaise-trap sampling campaign in Singapore’s mangroves and freshwater swamp forest, as well as a similar

6-month long sampling in disturbed secondary forest habitat near the National

University of Singapore (more details in Chapter 6). The trap samples were preserved in 70% ethanol and pre-sorted to various clades. A total of 8081 specimens from six

Diptera clades (Culicidae, Dolichopodidae, Mycetophiliformia, Stratiomyidae,

Syrphidae and Tabanidae) were selected for this study.

The specimens were barcoded with the NGS barcoding pipeline (Meier et al.,

2016) and the 313-bp barcodes were aligned in MAFFT v7 (Katoh & Standley, 2013) using default parameters but with the “--adjustdirection" function. A total of 1617 cox1 haplotypes were obtained from the 8081 specimens. For reasons of scale and cost, we then estimated trees based on haplotypes that differ by at least 2-bp (0.6% uncorrected p-distance threshold). Given that this is still well below the species threshold (2-4%), this approach still represents the intraspecific variability while greatly reducing the need for sequence data (721 instead of 1617 haplotypes).

Genomic DNA (gDNA) was extracted from each specimen representing every

0.6% haplotype using QIAGEN DNeasy Blood & Tissue Kit according to the manufacturer’s instructions, with the exception of Dolichopodidae which were extracted using Lucigen QuickExtract. The extractions were performed on whole specimens, but a small hole was pierced in the thorax for digestion such that the specimen was not destructively sampled and remained available for subsequent morphological examination. The vouchers were stored in 70% ethanol and the

44

mycetophilid specimens were pre-sorted and subsequently checked for morphospecies congruence by taxonomist Professor Dalton Amorim according to Wang et al. (2018).

3.3.2. Data for phylogeny reconstruction: I. Mitogenomes via MMG

A total of 721 gDNA extracts from the 0.6% mOTU representatives were pooled in equimolar fashion. However, each library only contained DNA for mOTUs that differed by >5% for cox1 based on the 313-bp mini-barcode. This was done by clustering the cox1 barcodes at 5% p-distance and if the mOTUs were found within the same cluster, they were spread out across different libraries. This was to ensure that the pooled mOTUs were sufficiently distantly related such that chimeric assemblies are unlikely to generate spurious data. Seven libraries were required, and we targeted a total of 360 million reads for the 721 mOTUs (500,000 reads per mOTU). The pools were mixed with samples from other projects and sequenced across several Hiseq 2500

250-bp paired-end lanes, but collectively would have required at least three lanes of

Hiseq 2500 sequencing (~150 million reads per lane). Library preparation used a mixture of Truseq DNA and NEBNext Ultra II library preparation kits.

The data were processed following a pipeline described by Crampton-Platt et al. (2015). The raw reads were first processed via adapter trimming with Trimmomatic v0.36 (Bolger et al., 2014) and a BLAST search was conducted against a database consisting of all Diptera full mitochondrial genomes present in NCBI GenBank as of

20th March 2018. Using the BLAST hits, the matching reads were extracted such that this dataset would only consist of ‘mitochondrial gene-like’ reads. These reads were then used in three different assemblers in order to obtain mitochondrial genome contigs:

1) SPAdes [--meta, -k 21,33,55,77,99,127] (Bankevich et al., 2012), 2) Ray [k 61, - minimumseedlength 100, -minimumcontiglength 1000] (Boisvert et al., 2010) and 3)

IDBA-UD [--mink 60, --maxk 250] (Peng et al., 2012). For IDBA-UD, quality control was first performed with PRINSEQ [-min_len 150 -min_qual_mean 25 -ns_max_n 0 - trim_qual_right 20] (Schmieder & Edwards, 2011) before the reads were used for

45

assembly since the fasta input IDBA-UD required has no phred scores included. The

contigs then underwent an additional BLAST check against the same reference

database and were filtered by length with contigs <1000-bp being discarded. The

filtered contigs were then assembled into supercontigs in Geneious R7 [1000-bp

minimum overlap, 1% maximum mismatches per read] (Kearse et al., 2012).

Mitochondrial gene baits were obtained via multiplexed tagged amplicon

sequencing (next section) and then mapped to the supercontigs in Geneious [100-bp

minimum overlap, 0% maximum mismatches per read] (Kearse et al., 2012) in order

to assign the contigs to the original mOTUs. The baited supercontigs were also

screened for inconsistencies like chimeric assemblies (identified when baits from

different mOTUs mapped onto the same supercontig); such contigs were fixed or

discarded. The supercontigs were then annotated in Geneious using annotated reference

mitochondrial genomes from NCBI GenBank, and the coding and rDNA regions (cox1,

cox2, cox3, cytb, nd1, nd2, nd3, nd4, nd4l, nd5, nd6, atp6, atp6, 16S & 12S) were

extracted and aligned in MAFFT v7 (Katoh & Standley, 2013).

3.3.3. Data for phylogeny reconstruction: II. Multiplexed tagged amplicon sequencing

Fourteen primer pairs were designed to target short 200 – 400-bp fragments of

all mt coding genes except ATP8 so as to obtain baits for the MMG pipeline, while

another fifteen primer pairs were designed spanning the full 28S rDNA gene (Table 1),

such that the amplicon tiled the full segment with short overlapping ends to facilitate

easier downstream assembly. The primers were tagged with 12 different 5’-end 6-bp

oligonucleotides to generate 144 possible forward and reverse tag combinations, so as

to allow demultiplexing and assignment of sequence to sample of origin

bioinformatically. The 28S rDNA primers were split into two sets, where every alternate adjacent amplicon along the sequence-of-interest was placed in a different set, thereby reducing the likelihood of long-range PCR products.

46

Table 1: Mitochondrial gene and 28S rDNA primer sequences used in this study Locus Gene Direction Sequence (5’ – 3’) Fragment Mito- nad2 F AAGCTAHTRGGYTCATACCCYA genome R GGRAAYCAAAARTGRAANGGDGC nad2 F TAACHTGACAAAAADTWGCHCC R ATTTDGGWAWAAATCCWARRAATGG cox1 F GGWACWGGWTGAACWGTWTAYCCY CC R TAWACYTCWGGRTGNCCRAARAAYC A cox2 F CAATGATAYTGAAGYTAYGAATA R TACTTGCTTTCAGTCATCTAATG atp6 F TGAAAATGATAAYWAAYYTATT R CCTAAWARWKTTTTAAATTCTTT cox3 F CAATGATGACGDGATRTHTCWCG R GCTTCDRTRTATTCATADSCTTG nad3 F TDGCHACWGGHTTYCAYGG R GGRTCAAAHCCRCATTCAAAHG nad5 F ACTARDGTWGAAGAATGNACHAAHG C R ATTWTTAGHCCWAAYTTARTBAG nad4 F GWDGGVGGDGCWGCYATATT R GTWGARGCHCCDGTWTCDGGDTC nad4l/nad6 F AACAKTGGTCTTGTAAAYCAA R AATWTTTCATTWGANGCHADDG cytb F TGATGAAAYTTYGGNTCWYTA R CTCARAADGATATTTGWCCTCA nad1 F GGVCCYTTHCGAATYTGRAYATA R CTTACATGATYTGAGTTCARACC LrRNA F ACGCTGTTATCCCTAARGTAACTT R AGCWWAATMATTWGTYTTTTAATT SrRNA F TACTWTGTTACGACTTATCT R TTADTCTDTTYAGAGGAAYYTG 28S P01 F AGTAGYKGCGARCGAAA rDNA R CTTTTCAACTTTCCCTCACGGTACTTG TT P02 F CTAAATATNACCAYGAGWCCGATAG R CCTTGGTCCGTGTTTCAAGACGG P03 F TTYRGGAYACCTTYDGGAC R GGTTTCCCCTGACTTCDACCTGATCA P04 F GACCCGAAAGATGGTGAACTAT R CCCATTTAAAGTTTGAGAATAGG P05 F CTGGTAAAGCGAATGATTAGAGGC R ACCCTTCTCCACTTCAG P06 F ATCCGCTAAGGAGTGTGTAACAACTC ACC R ACCCTTCTCCACTTCAG P07 F GGTGGTAGTAGCAAATAATCG R TACTYAAYAGGYTMCGGAAT P08 F CGAAAGGGAATACGGTTCCA R CCTTATCCCGAAGTTACG

47

P09 F ATGTAGGTAAGGGAAGTC R CAGAGCACTGGGCAGAAATCAC P10 F TGACTGTCTAATTAAAACAAAGCAT R CAGATTAGAGTCAAGCTCAACAGG P11 F AAGGTAGCCAAATGCCTCATC R TAAGTRAGGMARCRRTAAGAGT P12 F GAGGTGYAGHATARGTGGGAG R GATGTACCGCCCCAGTCAAACTCC P13 F CAMDTCAWGGACANTGYCAGGT R ACGCTTGGCBGCCACAAGCCAGTTA P14 F CAAGAGGTGTCAGAAAAGTTACCAC R CGGATACGACCTTAGAGGCG P15 F GTGGTATGAHGCTACGTCCGTTGGAT R CAGCTGCTCTACCACTHACRRCAC

The primers in each set were mixed in equal proportions and PCR was carried out using the afore-mentioned mOTU-specific gDNA extracts and the QIAGEN

Multiplex PCR Kit. 10 µL reaction mixes were prepared, using 5 µL of QIAGEN

Multiplex master mix, 1 µL of forward and reverse primers each, 1 µL of bovine serum albumin (1 mg/mL) and 2 µL of gDNA extract. The PCRs were performed with an initial denaturation at 95 ⁰C for 15 min, followed by an initial cycle of denaturation at

94 ⁰C for 1 min, annealing at 57 ⁰C for 90 s, and extension at 72 ⁰C for 45 s. Following that, a step-up protocol was employed, with another 5 cycles of annealing at 42 ⁰C and

35 cycles of annealing at 45 ⁰C, and a final extension step at 68 ⁰C for 15 min. The

PCRs were performed in 96-well plates with a negative control for each plate.

The PCR products were purified with Bioline SureClean and pooled into six equimolar libraries given that 721 mOTU had to be sequenced and only 144 possible primer tag combinations were available. The purified products were then sequenced paired-end on one lane of the Illumina Hiseq 2500 2x250-bp platform after library preparation with the TruSeq DNA Nano kit. Using the reagent costs presented in Meier et al. (2016), each amplicon costs ca. 0.50 USD given that we only used a quarter of the recommended volume of QIAGEN Multiplex PCR Kit master mix (10µl instead of

40µl). The total cost per amplicon was thus 0.96 USD after sequencing on Illumina

Miseq platform or 0.78 USD on a Hiseq2500.

48

Post-sequencing data processing for the amplicons was performed as described

in Meier et al., 2016. It involved paired-end assembly, demultiplexing and quality

control. A Basic Local Alignment Search Tool (BLAST) to NCBI GenBank’s

nucleotide database was used to remove contaminant sequences, which were defined

as BLAST hits to non-target taxa at >95% identity percentage. A genome-mapping

based approach was used for further verification, where unique amplicons were

mapped onto published mitochondrial genomes and 28S rDNA reference sequences for

species in the target Diptera clades (GI: 309056, 120944038, 120944077,

NC_008756.1, 347394, 425869744, 425869783, 425869795, 425869828, DQ496190)

with CLC Genomics Workbench v.8.0.3 (mismatch, insertion and deletion costs of 2,

length fraction of 0.7 and similarity fraction of 0.8). When good reference sequences

were unavailable, sequences from the next closest relative were used instead. The

amplicons that mapped successfully were then iteratively used as reference sequences

for the demultiplexed reads using the same parameters. The consensus sequences were

extracted, and the mapped amplicon was discarded if more than two ambiguities were

present in the consensus. Amplicons with coverage < 1% compared to the highest

coverage for this gene fragment were excluded from the dataset. These processes were

performed with python scripts written by Dr. Amrita Srivathsan.

The 28S rDNA amplicons that passed the filters were aligned to a reference published sequence with the --addfragments function in MAFFT v.7 (Katoh & Standley,

2013). Using another in-house python script, sequences with overlapping regions were merged if the overlapping regions were identical. The sequences were then examined for conflicts in the overlapping region in AliView (Larsson, 2014). If there was such conflict, both sequences were discarded. Note that the filtered mitochondrial amplicons did not require assembly because they were used as species baits.

49

3.3.4. Tree reconstruction

Based on Wiegmann et al. (2011), Trichocera bimacula (GI: 356488987,

202070904) was selected as outgroup for the tree reconstruction for the several hundred

MOTUs belonging to the six dipteran clades in this study. The Genbank data were

aligned to the genes for the mOTUs before concatenating the 15 aligned mitochondrial

protein-encoding genes and 28S rDNA in SequenceMatrix v1.8 (Vaidya et al., 2011).

mOTUs with <3 of the 16 character sets were discarded and data for 459 0.6% mOTUs

were retained. Maximum likelihood trees were then obtained in RAxML v8

(Stamatakis, 2014) on CIPRES using the GTRCAT model and rapid bootstrapping (-f

a). The best trees were retained for species delimitation. The same tree-reconstruction

techniques were also used to obtain a tree for a reduced dataset consisting of

representatives of 3% cox1 p-distance mOTUs with all available mitochondrial genome

and 28S characters. Overall, four different trees were obtained that differed with regard

to dataset and/or tree reconstruction technique (Fig. 1):

Tree 1 – Haplotype-tree: This tree is based on all available mitochondrial genome and

28S rDNA data for the haplotypes that differ by >2 bp for the 313-bp cox1 mini-barcode.

Tree 2 – Mini-barcode tree: This tree is based on 313-bp of cox1 for the haplotypes mentioned above.

Tree 3 - Unconstrained mixed tree: We reconstructed this tree based on a supermatrix consisting of all available mitochondrial genome and 28S rDNA data for species-level mOTUs, while having only 313-bp mini-barcodes for the remaining 0.6% mOTU haplotypes. This matrix had 43% missing data.

Tree 4 - Constrained mixed tree: This tree is based on the same supermatrix as Tree

3 but it constrained the haplotype tree topology during the tree search.

50

Figure 1: Depictions of the aligned character matrices of the four datasets/treatments used in this study. The longer bars represent the full character set while the shorter bars represent just the cox1 region.

Trees 2 and 3 were built using the >2bp difference mOTUs. These form the terminals for all trees 1-4. The 3% cox1 p-distance representatives are used to create a guide tree for the placement of the shorter barcode fragments.

All trees were visualized in FigTree v1.4.2 (Rambaut, 2007) and several specimens were identified that initially seemed misplaced. However, after checking the vouchers, they were identified as not belonging to the target clades (i.e., they were mis- sorted by the para-taxonomists). These species were identified to species level and retained because they were overall correctly placed according to the tree in Wiegmann et al. (2011).

3.3.5. Species delimitation and comparison

Distance-based species delimitation was performed on the aligned 313-bp barcode dataset with objective clustering (Meier et al., 2006) using a python script

(unpublished provided by Dr. Amrita Srivathsan), at 2, 3, and 4% uncorrected p-

51

distance thresholds, as well as with ABGD (Puillandre et al., 2012) at P=0.01 and

P=0.001 priors. The distance thresholds that were selected for objective clustering were based on what is commonly used in the literature (Hebert et al., 2003), while the two reported ABGD priors represent the results from a range of priors that were tested.

Tree-based species delimitation was performed with bPTP and mPTP (Zhang et al., 2013a) on the Exelis Lab web server (URL: http://species.h-its.org/ptp), with a rooted tree and outgroup taxon specified. This was done using trees from the four different datasets mentioned above. While GMYC has been shown to be highly effective in many cases (Esselstyn et al., 2012; Talavera et al., 2013), our data were not suitable for this coalescent-based species delimitation technique due to the lack of comprehensive population-level sampling of full character sets across the entire dataset.

In addition, we did not sample all haplotypes for the full mitochondrial genomes and

28S rDNA characters. Applying GMYC to these data with artificially skewed effective population sizes would overestimate the number of singletons for which GMYC has been shown to under-perform (Esselstyn et al., 2012; Fujisawa & Barraclough, 2013;

Dellicour & Flot, 2015; Ahrens et al., 2016). Our study thus used bPTP and mPTP for tree-based species delimitation which has been shown to be less sensitive to effective population sizes (Luo et al., 2018).

3.3.6. Comparison of species delimitation results

The results obtained with each species delimitation methods were compared using match ratios for pairwise comparison of mOTUs. Match ratio (Ahrens et al., 2016) is the metric for congruence used and is defined as 2 * Nmatch / (N1 + N2), where

Nmatch is the number of clusters identical across both mOTU delimitation methods/thresholds (N1 & N2). In addition, the tree topologies were compared using pairwise Robinson-Foulds distances and branch lengths scores obtained with the ape package (Paradis et al., 2004) in R (R Development Core Team).

52

3.3.7. Congruence with morphology

All mycetophilid specimens were pre-sorted via NGS barcoding and the accuracy of the mOTUs was assessed based on morphology by a taxonomic specialist in Mycetophilidae (Dr. Dalton Amorim). The morphospecies memberships were then compared to the mOTU memberships based on different species delimitation methods.

3.4. Results

3.4.1. Barcoding and character sampling

A total of 8081 Diptera specimens were barcoded successfully and yielded

1617 haplotypes, of which 721 differ by at least 2-bp (0.6% uncorrected p-distance).

These represent 538 – 579 mOTUs (2-4% p-distance) based on objective clustering. Of these, 414 of 501 (82.1%) of the 0.6% mOTUs extracted with DNeasy passed the filtering criterion for a minimum of three character sets while only 45 of 204 (22.1%) of those extracted with QuickExtract passed the thresholds. Overall this allowed us to place 459 of the 0.6% haplotypes on the haplotype tree. Note that the data is strong for most taxa and only weak for Dolichopodidae due to the comparatively poorer quality of the DNA obtained with QuickExtract. This highlights the importance of obtaining high-quality DNA for the effectiveness of the pipeline.

3.4.2. Morphological verification

The target taxa, belonged to Culicidae, Dolichopodidae, Stratiomyidae,

Syrphidae, Tabanidae and Mycetophiliformia, yielded 430 mOTUs (0.6%) representing 4815 specimens. Of these, 817 specimens belonged to 106 morphospecies of Mycetophilidae and were subsequently used for assessing congruence between morphology and DNA sequences in this study. In addition, 29 mOTUs (represented by

51 specimens) belonged to non-target taxa based on morphological re-examination

(Table 2).

53

Table 2: Number of 0.6% mOTUs and specimens of the various taxa with sufficient data for phylogenetics Taxon No. of 0.6% mOTUs No. of specimens Target taxa Culicidae 96 1719 Dolichopodidae 45 670 Stratiomyidae 38 522 Syrphidae 25 258 Tabanidae 28 541 Mycetophiliformia 198 1105 Non-target taxa Ptychopteridae 1 1 Limoniidae 9 25 Psychodidae 4 6 Chironomidae 4 4 Ceratopogonidae 6 9 Xylomidae 2 3 Therevidae 1 1 Phoridae 1 1 Pipunculidae 1 1

3.4.3. Multiplexed tagged amplicon sequencing

Overall, 5677 of the 10,094 mitochondrial bait amplicons yielded bait

sequencing after quality filtering for 689 of the 721 mOTUs (0.6%) in the study. The

amplification success varies across genes, with the cox1 and cox2 regions being the most successfully while success was lowest for nd4 and the nd4L/nd6 regions (Fig. 2A).

For the 28S rDNA fragments, 7986 of the 10,815 amplicons were successfully sequenced, yielding assemblies for 672 of the 721 0.6% mOTUs in the study. The amplification rates varied across the 28S sequence and are noticeably poorer at the 5’ end (Fig. 2B).

54

A) 700

600

500

400

300

No. of amplicons 200

100

0

B) 800

700

600

500

400

300 No. of amplicons

200

100

0 P01 P02 P03 P04 P05 P06 P07 P08 P09 P10 P11 P12 P13 P14 P15

Fig. 2: Distribution of successfully sequenced mitochondrial bait (A) and 28S fragment (B) amplicons

3.4.4. Mitogenome skimming with single and multiple baits

The multiple bait dataset yielded the largest amount of data (Fig. 3), but we also examined the benefits of having multiple baits by tallying discarded contigs that would not have been identified if a single bait had been used. The size of single-bait datasets differed depending on which gene was used as a bait and how the contigs are

55

split. However, by using the longest and shortest possible contig, we can provide lower- and upper-bound estimates for what might be expected if single baits had been used.

The multiple bait dataset contains 6,288,176-bp of data and the use of a single bait would have reduced this by 632,768 – 1,185,231-bp (longest possible: 5,655,408- bp, shortest possible: 5,102,945-bp). In effect, 10 – 19% of usable data would have been lost had only a single bait been utilized (Fig. 3). Additionally, the multiple baits helped to identify 31 supercontigs that were chimeric (3.87%) which could either be resolved by splitting or discarding.

25000

20000

15000

10000 No. of characters (bp) 5000

0

Multiple Baits Single Bait - Longest Single Bait - Shortest

Figure 3: The number of characters (bp) in each contig sorted from highest to lowest from the actual multiple-bait dataset and the longest and shortest possible single-bait dataset. The area in blue indicates the difference in the number of characters, which would have been lost had only a single bait been used

3.4.5. Phylogenetics

On the tree built from the full set of mitochondrial genomes and 28S rDNA, 61 of the 67 species-level (3% mOTUs) nodes are well-supported (>80 bootstrap support).

Only one of the 3% mOTUs is not monophyletic. The family-level relationships (Fig.

4) are largely congruent as long as only the well-supported relationships in Wiegmann et al. (2011) are considered. The main area of conflict is paraphyly of Ceratopogonidae

56

which is likely due to poor sampling. Another area of conflict is the placement of

Bibionomorpha as sister-group to Culicomorpha and Psychodomorpha instead of the

Brachycera, which is, however, poorly supported in Wiegmann et al. (2011). Similarly, on our tree the Ceratopogonidae are sister-group to Culicidae instead of Chironomidae.

However, the position of the Ceratopogonidae is difficult to resolve and is also only weakly supported based on a recent phylogenomic study using transcriptome data

(Kutty et al., 2018). Lastly, within the Mycetophiliformia, Lygistorrhinidae is nested within the Keroplatidae instead of being sister to the Cecidomyiidae, while the

Keroplatidae is sister-group to Mycetophilidae in Wiegmann et al. (2011). However, with regard to the placement of the Lygistorrhinidae, we recover the same relationship as Ševčík et al. (2016).

In contrast to the well-supported haplotype tree, the tree constructed based on only 313-bp for 459 mOTUs is poorly supported. When these haplotypes that only have

NGS barcode data are added to the multi-gene dataset, we again obtain a well- supported tree with support values that are similar to the ones on the haplotype tree.

This result is obtained despite the large amount of missing data (43% of the character matrix). The Robinson-Foulds distances and branch-length scores between the trees used in this study (Table 4) are a measure of dissimilarity in topology and overall branch lengths, respectively. The results indicate that the tree obtained from NGS barcodes alone is highly divergent in both topology (608 – 614 RF distances) and branch lengths (1.26 – 1.29) from the trees built with much better character sampling.

However, the unconstrained and constrained mixed trees are fairly similar to the tree constructed from the full dataset.

57

Figure 4: Family-level maximum likelihood tree of mOTUs in this study built with the full mitochondrial genome and 28S rDNA dataset, with bootstrap support values indicated at the basal nodes. The number of 0.6% mOTUs for the target taxa is displayed below the family name. Paraphyletic clades are indicated with red text, while monophyletic ones are in black. The green branches indicate congruence with the Wiegmann et al. (2011) Diptera phylogeny, while the red branches indicate conflict and the blue branches indicate a lack of information. The asterisks (*) beside the family names indicate taxa that were missorted but still retained in this study

58

100 90 80 70 60 50 40

Bootstrap values 30 20 10 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Full mitochondrial genome + 28S rDNA Barcodes only Mixed dataset without constraint

Figure 5: Bootstrap support values sorted from highest to lowest for the trees constructed from the full dataset, the barcodes-only dataset, as well as the mixed mitochondrial genome, 28S rDNA and barcode dataset without a constraint tree

3.4.6. Comparison between Tree-based and Distance-based species delimitation

Pairwise match ratios were calculated between each species delimitation

technique and parameter for the various datasets with all taxa included (Table 3,

bottom-left). Overall, most species units are congruent across all delimitation methods

based on pairwise congruence comparisons (0.73 – 0.97 match ratios). The lower

bound estimate of congruence is skewed by the poor performance of ABGD under

some priors. Otherwise, all match ratios would have been >0.78. Note that low priors

were used in this study because high priors tended to lump all barcodes into a single

cluster (see also Chapter 2 for poor ABGD performance on mini-barcodes with high

priors).

Surprisingly, the species units obtained from bPTP and mPTP applied on the

barcode-only tree do not differ greatly from those derived from the full dataset (0.90 –

0.95 match ratios). In general, the species units are largely consistent between the two

iterations of PTP for each dataset (>0.90 match ratios), with the notable exception of

the unconstrained mixed tree (0.86 match ratio). The specimen-level analysis (Table 3,

59 top-right) confirms high levels of congruence between methods (2.0 – 24.6% removal required) with again a higher lower-bound (20.9%) if ABGD was disregarded. In several cases, species- and specimen-based congruence yielded fairly different results with the latter requiring a smaller number of specimen removal for obtaining perfect congruence. This implies that rare haplotypes were causing the conflict between methods.

60

Table 3: Pairwise match ratios of the mOTU comparisons between species delimitation techniques and parameters are shown on the bottom-left diagonal, while the top-right diagonal indicates the percentage of specimens that need to be removed in order to achieve perfect congruence. A heatmap effect was applied to both metrics, with the highest values in green and lowest in white for the match ratios, while the highest values are in red and lowest in white for the percentages of specimens

Unconstrained Constrained mixed NGS barcodes (Dataset 2) All data (Dataset 1) mixed tree tree (Dataset 4) (Dataset 3) Objective clustering ABGD bPTP mPTP bPTP mPTP bPTP mPTP bPTP mPTP 2% 3% 4% 0.001 0.01 2% 3.3% 5.9% 24.6% 19.8% 12.8% 10.2% 9.8% 8.3% 19.6% 19.3% 20.9% 12.4% Objective 3% 0.95 2.6% 21.5% 16.5% 10.0% 7.4% 7.0% 5.4% 16.7% 16.1% 17.8% 9.1% clustering 4% 0.91 0.96 19.6% 14.6% 8.7% 5.7% 5.2% 4.1% 14.8% 14.1% 16.3% 7.4% NGS barcodes 0.001 0.73 0.78 0.80 5.0% 16.5% 17.2% 17.6% 18.7% 6.5% 6.7% 6.5% 15.0% (Dataset 2) ABGD 0.01 0.76 0.81 0.84 0.97 17.2% 15.7% 14.8% 15.0% 8.5% 8.7% 8.7% 12.8% bPTP 0.85 0.89 0.91 0.81 0.80 3.5% 6.3% 7.8% 10.9% 10.2% 10.2% 7.4%

61 PTP mPTP 0.88 0.92 0.95 0.81 0.82 0.95 3.7% 4.8% 10.2% 9.6% 10.0% 6.3% bPTP 0.88 0.92 0.95 0.80 0.81 0.92 0.95 2.0% 12.2% 11.5% 13.7% 3.9% All data (Dataset 1) PTP mPTP 0.89 0.93 0.95 0.79 0.81 0.90 0.94 0.97 13.3% 12.6% 14.8% 5.4% Unconstrained mixed bPTP 0.78 0.82 0.85 0.86 0.85 0.86 0.86 0.86 0.86 4.1% 5.7% 10.0% PTP tree (Dataset 3) mPTP 0.78 0.83 0.86 0.87 0.86 0.86 0.86 0.87 0.86 0.92 5.4% 9.8% Constrained mixed bPTP 0.83 0.82 0.84 0.87 0.86 0.87 0.86 0.85 0.84 0.91 0.91 11.7% PTP tree (Dataset 4) mPTP 0.84 0.89 0.91 0.85 0.85 0.90 0.91 0.93 0.92 0.90 0.90 0.89

61

Table 4: Pairwise Robinson-Foulds distances (bottom-left) and branch-length scores (top-right) between the trees used in this study NGS barcodes All data Unconstrained mixed tree Constrained mixed tree

(Dataset 2) (Dataset 1) (Dataset 3) (Dataset 4) Barcodes only (Dataset 2) 1.29 1.26 1.26 All data (Dataset 1) 614 0.24 0.23 Unconstrained mixed tree 608 48 0.06 (Dataset 3) Constrained mixed tree 608 40 10 (Dataset 4) 62

62

3.4.7. Species delimitation: congruence with morphology

When comparing the mOTUs to the morphospecies, we find generally high congruence between molecular and morphological sorting (0.77 – 0.94 match ratios), which indicates that most units are again stable across delimitation techniques (Table

5). The highest congruence is achieved when using mPTP on the constrained mixed tree (0.94 match ratio), although the full dataset produces the highest congruence across both bPTP and mPTP algorithms (0.92 match ratios). This is also apparent when inspecting the mycetophilid tree constructed with the full character set (Fig. 6): most of the nodes are in agreement between morphospecies and mOTU sorting. The ABGD algorithm with the barcodes only dataset results in the lowest congruence (0.77 match ratios).

Using the less scalable tree- instead of distance-based species delimitation techniques results in small gains of congruence when compared to objective clustering

(2 – 6 additional morphospecies are congruent with mOTUs). The gains are more substantial if ABGD is included (2 – 24 additional morphospecies are congruent). The constrained mixed tree has the highest congruence when mPTP (0.94) is used but has a poorer match ratio (0.84) when bPTP is applied, which makes it somewhat inconsistent. Without the constraint tree however, the congruence is poorer (0.84 &

0.87 match ratios), which highlights the need for the constraint tree to stabilize the backbone topology. The number of specimens that require removal in order to achieve perfect congruence is broadly consistent with the match ratios, and only 2 – 47 more specimens out of 817 require removal when tree-based species delimitation methods are used instead of distance-based ones.

63

Table 5: Match ratios and number of specimens to remove in order to achieve perfect congruence for the various molecular species delimitation methods against the mycetophilid morphospecies clusters Unconstrained Constrained All data Dataset NGS barcodes (Tree 2) mixed tree mixed tree (Tree 1) (Tree 3) (Tree 4) Objective ABGD Species Delimitation Clustering bPTP mPTP bPTP mPTP bPTP mPTP bPTP mPTP Method P=0.00 P=0.0 2% 3% 4% 1 1 Total no. of clusters 119 116 114 91 91 115 115 112 112 104 106 100 108 No. of clusters 0 0 0 9 9 0 0 0 0 5 5 5 1 lumped No. of clusters split 25 19 16 6 6 17 17 12 12 8 12 8 6 64 No. of correct clusters 94 97 98 76 76 98 98 100 100 91 89 87 101 Match ratio 0.84 0.87 0.89 0.77 0.77 0.89 0.89 0.92 0.92 0.87 0.84 0.84 0.94 Specimens to remove 51 41 37 79 79 34 37 32 32 38 66 46 22

64

Figure 6: Maximum likelihood phylogeny of the Mycetophilidae extracted from the tree constructed with the full character set, with the 0.6% mOTUs as terminals. Bootstrap supports are indicated at each node and 3% p-distance cox1 clusters with more than one terminal are highlighted in yellow. Within these clusters, branches congruent with morphospecies identification are coloured green, while lumping caused by the clustering are indicated by the red branches. The blue branches indicate where the morphospecies were split by the clustering process

3.5. Discussion

3.5.1. Do tree-based species delimitation methods yield better results?

Tackling the tropical arthropod fauna eventually requires delimiting millions of species by processing hundreds of millions of specimens. This means that there is a premium on finding scalable methods. This is why we here evaluate a range of species delimitation methods that require very different amounts of data and computation time.

65

Techniques that require a lot of data would be justified if they were to dramatically

increase in accuracy. However, we find that the overall placement of >73% of the

mOTUs and specimens in our dataset is consistent (>78% if ABGD is not included).

This means that the additional sequencing and species delimitation efforts that are

required by the use of tree-based methods on well-resolved trees can only improve

species delimitation for ca. 20% of the species (Table 3). This is surprising given that

the tree constructed from 313-bp per haplotype was compared with a tree with a mean

of 14,179-bp per haplotype.

Additional insights can be gained by comparing the mOTUs obtained with

molecular data with the morphospecies for 817 specimens of Mycetophilidae. We find

that tree-based methods based on complete mitochondrial genome and 28S rDNA do indeed increase congruence with morphology (Table 5) for up to 23% of the species, although this is greatly reduced to 6% if ABGD is not included. Overall, species delimitation with barcodes, however, does yield more inconsistency across delimitation methods and parameters, especially when ABGD is included. We here confirm Luo et al.’s (2018) result that it tends to under-split species. When the results obtained with

ABGD are ignored, the mOTUs become far more consistent (Table 3; 0.78 – 0.97 match ratios), which reduces the number of species that require deeper sequencing.

While these observations might change if more intraspecific sampling is done across larger geographical scales, it seems that most species can be effectively pre- sorted into stable units using mini-barcodes. It appears to matter little whether the latter are used to cluster specimens via pairwise distances or poorly-supported trees. As such, the most cost-effective approach to large-scale species delimitation would start with obtaining short mini-barcodes. They can be used to identify stable mOTUs. Afterwards, the unstable mOTUs can be subjected to deeper sequencing before applying tree-based species delimitation methods. The second step may be optional and could only be applied for those taxa where the additional level of accuracy is required in order to

66

address a particular biological question. Based on our data set, the number of specimens that would require such an in-depth treatment may be fairly small in our trial. Excluding

ABGD, it is ≤13% of the haplotypes in our study that have inconsistent placements in different mOTUs (Table 3).

3.5.2. Improving MMG with multiple baits

The MMG pipeline (Crampton-Platt et al., 2016) is suitable for retrieving naturally enriched DNA sequences like mitochondrial, chloroplast and ribosomal DNA via genome skimming, with minimal pre-sequencing processing that might bias read or taxon representation. Given ample sequencing depth and even sampling of initial gDNA amounts, a sufficient number of the reads can be retrieved to allow for the reconstruction of mitochondrial genomes. However, the downstream bioinformatics present two challenges: 1) avoiding chimeric assemblies and 2) assigning contigs to species.

Problems with the former arise when reads with conserved or overlapping regions are incorporated into the same contig during the assembly process although the reads belong to different species. This problem can be mitigated by ensuring that closely related species are not pooled in the same library, which highlights the important role the NGS barcodes play for pre-sorting. They yield the genetic distances that can be used to assign closely related mOTUs to different libraries. Additionally, a shorter k-mer length can be used for the assembly which reduces the likelihood that the k-mer contains an error (Chikhi & Medvedev, 2013). However, this comes at the cost of more fragmented scaffolds. The optimal k-mer length is hence difficult to determine.

Given these problems, chimeric contigs may still be assembled despite all precaution.

Detection is straightforward if enough baits are available for the same species. Multiple baits are also very useful for identifying mitochondrial genomes that are fragmented into multiple contigs. We here introduce a multiplexed tagged amplicon sequencing pipeline as a cost-effective way to generate multiple baits in a single PCR. By including

67 multiple primers in a single multiplexed reaction, with the primers uniquely tagged such that they can be pooled in a single Illumina run, few PCR reactions are required, thereby reducing manpower and reagent cost, and lowering the risk of human error.

Note that the consumable and sequencing costs are largely negligible (0.78 USD per amplicon on a Hiseq 2500). In this study, we designed 14 tagged primers targeting mitochondrial coding regions that were spaced across the mitochondrial genome and distant enough from each other to reduce the likelihood of sequencing long-range products. The distribution of contig lengths is likely dependent on various factors such as quality of starting material and sequencing depth, which makes it difficult to recommend a fixed number of amplicons to safely bait most contigs. However, it is more effective to design baits that span the target region at even intervals in order to maximize the ability to identify contigs.

With multiple baits, chimeric assemblies can easily be detected because baits from different species align to the same contig (3.87% of all supercontigs in this study).

Such problematic contigs can then be resolved. Similarly, with more baits spread across the mitochondrial genome, it is more likely that short contigs may still be assigned species identities and used for downstream analyses. In our study, using multiple baits allowed us to recover 10 – 18% additional sequence data compared to when a single bait was used (Fig. 3). In our study, the coverage assigned to each mOTU (~500,000 reads) was generous enough for recovering a large proportion of the baited contigs

(~70%). If the assemblies were less complete due to poorer coverage, we would likely recover less data with just one bait. As such, incorporating multiplexed tagged amplicon sequencing into the MMG process makes it more robust to low coverage and data retrieval.

Finally, multiplexed tagged amplicon sequencing is ideal for recovering nuclear or ribosomal markers that are otherwise too conserved to obtain via MMG. A good example is the 28S rDNA marker. Most nuclear markers, being more conserved,

68

would require sequencing pools to be split across a large number of libraries in order to avoid chimeric assemblies of closely related species. Multiplexed tagged amplicon sequencing however, overcomes both these limitations by enriching the proportion of these markers via PCR amplification, while allowing the amplified reads to be easily demultiplexed and assigned to the correct species. As such, only a single library is required so long as there are sufficient primer tag combinations available to differentiate species based on unique primer combinations. As a precaution however, we recommend that adjacent regions be amplified in separate multiplexed reactions in order to reduce the likelihood of long-range amplification.

3.5.3. Toward an efficient and scaling species-discovery pipeline

Combining multiplexed tagged amplicon sequencing with MMG allows for affordable sampling of many characters across a large number of mOTUs. However, for the purpose of more accurate species delimitation, it might be possible to further reduce cost by selecting species-representative haplotypes for deeper sequencing while relying on existing barcodes for delimiting the remaining mOTUs. This results in a supermatrix comprising a mixture of taxa with fully sampled characters and taxa that only have short barcodes. The best-sampled mOTUs would then be used to structure the backbone topology of the tree, while the remaining mOTUs would affix to the terminals. This greatly reduces the sequencing load, as long as the barcoded mOTUs correctly place with closely related mOTUs.

The concept is similar to phylogenetic placement methods, such as EPA

(Berger & Stamatakis, 2011) and pplacer (Matsen et al., 2010), which queries each placement sequence against a global alignment of references accompanied with a reference tree. Various alignment scoring algorithms are then used to determine the best node on the tree that the query sequence can be inserted into. These methods however, are more akin to closed-reference OTU-picking (Bik et al., 2012) where the accuracy of taxonomic assignment is dependent on the taxonomic completeness of the

69

reference database or tree (Meyer & Paulay, 2005). Given how poorly sampled tropical arthropod diversity is currently, it is unlikely that an accurate and comprehensive reference tree is available for such phylogenetic placement algorithms. We believe that a de novo OTU-picking methods (ie. objective clustering, ABGD, PTP, GMYC) that focuses on species delimitation into mOTUs is more likely to yield accurate putative species units. These units can then be formally described later.

With this proposed method of mixed character sampling, trees for tree-based species delimitation can be constructed. We find that although the underlying supermatrices have a large amount of missing data, the tree is remarkably well- supported, with bootstrap values very similar to that of the tree constructed with the full dataset (Fig. 5). However, missing data could cause incorrect estimation of branch length and topology and consequently affect species delimitation. In our study, we find that constructing a tree using such supermatrices indeed leads to a small decline of congruence with morphospecies; even the tree based on only barcodes performs better

(Table 5). While this is likely due to the large amount of missing data in the character matrix (43% missing data), most of the delimited mOTUs remain unchanged (Table 3), which is in agreement with the literature about how taxa with extensive missing data can often be placed accurately (Wiens & Morrill, 2011).

When the backbone topology of the tree is constrained, the mixed dataset performs better and we observe a marked increase in congruence with morphology

(Table 5). This tree also has similar branch lengths to the tree constructed with the full dataset (Table 3). However, the high congruence only applies to the application of mPTP to this tree, while bPTP produces much poorer congruence. As such, while promising, the results are not consistent and it would be difficult to select a preferred method in the absence of morphological data. While our results tend to favour mPTP, additional testing is required to assess the performance of bPTP for more datasets.

70

After taking our findings into consideration, we suggest the following final pipeline which should be subjected to further testing with additional datasets (Fig. 7):

Firstly, barcode each individual using NGS barcoding. Secondly, use a variety of methods (objective clustering, ABGD, mPTP and bPTP used here) and parameters to identify those mOTUs that are consistent across all species delimitation techniques.

These are very likely to be accurate species units. Most mOTUs are likely to fall into this category (~80%). Thirdly, one haplotype for each consistent mOTU and all haplotypes for the inconsistent mOTUs should be subjected to multiplexed tagged amplicon sequencing and MMG in order to sample sufficient characters for reconstructing a reliable tree for species delimitation.

71

Figure 7: Proposed pipeline for the species delimitation of large highly diverse samples using both distance-based and tree-based species delimitation techniques

3.5.4. Concluding statement

In this chapter, we have shown that while barcoding provides sufficient

information for reliable species delimitation for most mOTUs, the remaining

inconsistent 10 – 20% can be resolved with further character sampling and tree-based

species delimitation. This can be obtained affordably through a combination of MMG

72

and multiplexed tagged amplicon sequencing techniques. Lastly, a mixed dataset with greater character sampling for interspecific representatives and barcodes for intraspecific representatives can generate accurate mOTUs without conducting deeper sequencing for every haplotype. With this pipeline, systematists are better able to tackle the enormous species diversity inherent in tropical arthropods at a manageable cost.

73

Chapter 4

Should hybrid enrichment be used for mitochondrial metagenomics?

4.1. Abstract

Mitochondrial metagenomics is a fairly straightforward and elegant method for sequencing large numbers of mitochondrial genomes across a broad range of species.

The basic pipeline is predicated on the genome skimming of naturally enriched mitochondrial DNA in a given genomic DNA extract. However, one drawback of the technique is that a large proportion of reads are not mitochondrial and thus typically unused. Hybrid enrichment, a technique commonly used in exon-capture, might provide a solution to this problem by greatly increasing the representation of mitochondrial DNA in the sequencing pool. Here, we test the efficacy of this approach by sequencing the mitochondrial genomes of 3683 species of Coleoptera from Borneo and the Neotropics with both hybrid enrichment and genome skimming pipelines.

These species were initially barcoded (420-bp cox1 mini-barcode) via a nested

PCR amplification approach on an Illumina Hiseq 2500 platform, in order to obtain

“baits” to assign the mitochondrial contigs to species. We found the enrichment process to be highly efficient, with an average of 76% of the reads per library being mitochondrial, versus an average of 6% in the genome-skimming pools. However, the coverage across the mitochondrial genome for the hybrid-enriched pools were highly uneven, resulting in mostly short fragmented contigs (<5000-bp) post-assembly. This poses a problem for both super-contig assembly as well as for species assignment based on a single bait. For these reasons, the original mitogenomic pipeline based on unenriched DNA yielded significantly more informative data than the hybrid-enriched pools. This suggests that it is more advantageous to devote resources to increased

74

sequencing depth instead of hybrid-enrichment when obtaining data for a broad range of largely unknown species.

4.2. Introduction

In Chapter 3, we utilized the mitochondrial metagenomics (MMG; Crampton-

Platt et al., 2016) pipeline to successfully sequence the mitochondrial genomes of approximately 460 species of Diptera. Based on the seven libraries used in this run, an average of 8.36% reads were of mitochondrial origin. As such, despite mitochondrial

DNA being naturally enriched in the genomic DNA (gDNA) sample, roughly 90% of the data remained unused. In order to improve the recovery of mitochondrial genomes, one could either increase the overall sequencing depth while retaining the natural level of mitochondrial DNA representation in the pool or improve the efficiency of mitochondrial retrieval per sequencing run via hybrid enrichment.

The first option is increasingly viable because of the decreasing cost of sequencing, especially with the development of newer platforms with higher throughput. The Illumina HiSeq 4000 can generate up to 5 billion reads per run, while the latest NovaSeq 6000 platform produces up to four times that amount. This means that the sequencing cost per mitochondrial genome is likely to continue decreasing as long as higher throughput platforms are being used and sufficient species are being pooled into a single run. While retaining the ~8-10% mitochondrial representation in each pool, the sequencing cost would scale linearly with any improvement of sequencing throughput.

The second option involves improving the representation of mitochondrial reads in the sequencing pool. This can be achieved through either organelle or nucleotide enrichment. Organelle enrichment involves the isolation of mitochondria through various methods like cell fractionation and differential centrifugation (Sims &

75

Anderson, 2008), with enrichment levels as high as 20% reported. This method is generally cheap and cost-effective, but it is time-consuming and potentially inefficient, with the enrichment procedure in some cases yielding only 0.53% mitochondrial DNA of the entire read pool (Zhou et al., 2013). Nucleotide enrichment however, directly increases the proportion of sequenced reads in the pool, either through amplification or hybrid capture (Mamanova et al., 2010). PCR amplification has been frequently used for select regions in the genome; indeed, most of my thesis uses this technique in the form of NGS barcoding and multiplexed tagged amplicon sequencing. Anchored hybrid enrichment however, utilizes a set of probes which hybridize to the regions of interest (or those flanking it), thereby selectively enriching the sample when non- hybridized DNA fragments are removed from the pool. This can be achieved either through probes that are immobilized on an array or in-solution probes. Overall, there seems to be no appreciable difference in performance, although the latter is more easily scalable (Mamanova et al., 2010). This technique has been mainly used for exon capture (Hodges et al., 2007), targeting ultra-conserved elements (UCEs) for phylogenomics (Faircloth et al., 2012; Lemmon et al., 2012), and enriching degraded

DNA from museum specimens (Bi et al., 2013), but rarely for enriching mitochondrial reads (Liu et al., 2016). In Liu et al.’s study, this technique was validated using 49 known mitochondrial genomes, but has not yet been used for large-scale phylogenomics (examined further in the Discussion).

In this chapter, we test the potential of hybrid enrichment for mitochondrial capture across a large number and broad range of tropical Coleoptera species (3683 species). The same species were also sequenced via genome skimming in order to obtain comparison material, although much greater sequencing depth was required.

This study design allows us to compare the sequencing and post-sequencing processing outputs of the MMG pipeline with and without hybrid capture. Note that the species in the sample were barcoded using a nested PCR amplification method where primers

76

include an overhang adapter sequence to facilitate high-throughput sequencing on an

Illumina platform (Arribas et al., 2016). These barcodes were vital in both ensuring

closely related species were not pooled in the same library, as well as assigning

downstream assembled contigs to species. We find that while hybrid enrichment for

mitochondrial DNA greatly increased the amount of mitochondrial reads, targeting too

broad a taxon set can result in patchy and uneven coverage across the mitochondrial

genome, thereby resulting in short, fragmented contigs.

4.3. Materials & Methods

4.3.1. Sampling, extraction and barcoding

Coleoptera specimens were collected via Malaise traps, flight intercept traps,

Winkler traps, pitfall traps and beating sheets in four sites located in French Guiana,

Panama (Santa Fe and Cerro Hoya) and Borneo (Poring) as part of the Biodiversity

Initiative of the London Natural History Museum. The specimens were sorted into

morphospecies by para-taxonomists and one specimen for each morphospecies was

selected for gDNA extraction. These representatives were also imaged and identified

to the lowest possible taxonomic level. gDNA was extracted non-destructively from

the specimens using the QIAGEN Biosprint 96 following the manufacturer’s

instructions, after which the specimens were deposited in the London Natural History

Museum.

A total of 7346 morphospecies were barcoded through the amplification of a

420-bp cox1 fragment via a dual-indexing system, in which the primer (Forward: 5’-

CCNGAYATRGCNTTYCCNCG-3’ [modified from Shokralla et al., 2015a], Reverse:

5’-TANACYTCNGGRTGNCCRAARAAYCA-3’ [Yu et al., 2012]) was designed to have a unique 6-bp 5’ tag and an overhang (Forward: 5’-

TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3’, Reverse: 5’-

GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-3’) complementary to the

77

Illumina adapter. This facilitates a nested PCR approach to Illumina library preparation and utilizes the combination of both primer tags and Illumina indexes to generate unique identifiers for downstream demultiplexing. In this manner, 49 tagged primer pairs and 96 Illumina indexes can generate 4704 unique identifiers without any risk of tag mismatch due to chimera formation (Schnell et al., 2015). The amplification was performed with the following reagents and protocol: 20 µl reaction of 13.4 µl molecular grade water, 2.0 µl BIOTAQ reaction buffer, 1.4 µl MgCl2 (50 mM), 0.4 dNTPs (10 mM), 0.3 µl forward and reverse primers (10 µM), 0.2 µl BIOTAQ Taq polymerase and 2.0 µl of DNA extract, with an initial denaturation at 94°C for 4min, 39 cycles of denaturation at 94°C for 30s, annealing at 48°C for 30s, extension at 72°C for 1.5min and a final extension at 72°C for 10min.

The amplicons were then pooled in an equimolar fashion and sent to Earlham

Institute for sequencing across two Illumina Hiseq 2500 lanes (300-bp paired end), along with samples from other projects. PCR clean-up and library preparation with

Truseq DNA kits were also performed at the institute’s sequencing facility. The sequencing output was processed via a suite of scripts collectively termed NAP (NGS

Amplicon Pipeline) that performs demultiplexing, paired-end read merging, quality filtering and barcode selection based on the dominant read in each demultiplexed pool

(Creedy, unpublished [URL: https://github.com/tjcreedy/NAPtime/wiki]). A BLAST search against the NCBI GenBank nucleotide database was then conducted on the filtered barcodes to identify and discard contaminants. After the BLAST filter, a total of 6717 barcodes remained, which were then aligned with MAFFT v7 (Katoh &

Standley, 2013) with the “--adjustdirection” function, a gap opening penalty of 5.0, and other default parameters. The aligned barcodes were then grouped into putative species via objective clustering of uncorrected p-distances at a threshold of 3% using a custom clustering script (Srivathsan, unpublished). This resulted in 3683 molecular operational taxonomic units (mOTUs) that were then used in downstream phylogenetic analyses.

78

4.3.2. MMG and hybrid enrichment library preparation and sequencing

The DNA for the 3683 mOTUs were pooled into 10 different libraries, such that each mOTU had at least a 5% p-distance divergence from all other mOTUs in the same pool. This reduces the likelihood of chimeric assemblies that can arise from closely related species in the same sequencing pool. Note that the gDNA extracts of the 3683 mOTUs were quantified on a Nanodrop and the concentrations were used to determine the volumes to be pooled for each sample. The pooling process was automated through the Perkin Elmer JANUS liquid handling workstation in order to reduce the likelihood of human error. The gDNA from these 10 pools were then used for both MMG and hybrid enrichment pipelines.

For the MMG pipeline, the pools were directly submitted to the Beijing

Genomics Institute (BGI) for library preparation and sequenced across 10 lanes on multiple BGISEQ platforms for a total of ~4 million reads (150-bp paired end). In the hybrid-enrichment pipeline, a set of 21,691 120-bp probes evenly distributed across the entire mitochondrial genome for 164 Coleoptera species across the tree-of-life

(obtained from NCBI GenBank as well as in-house sequencing) were designed by Dr.

Benjamin Linard and Dr. Carmelo Andújar (unpublished). This probe set was ordered as part of the Agilent SureSelect XT Target Enrichment System (Tier 2: probe region size = 0.5 - 2.999 Mbp; up to 57.5K probes), which also includes a library preparation kit for Illumina platforms. The hybrid capture was performed on the 10 pools concurrently with the library preparation according to the manufacturer’s instructions with the following modifications of the protocol: 1) shearing performed with the

Covaris M220 optimized for an 800-bp target peak, 2) dual-SPRI size selection performed at 0.5X – 0.7X beads to product ratio to obtain a 680-bp peak, 3) all other

SPRI clean-up steps performed at 0.9X ratios and 4) in the final amplification step, the annealing time was increased to 1 min, extension time increased to 1.5 min and the number of cycles was increased to 15. The long insert size was selected to potentially

79

increase the efficiency of mitochondrial recovery (Crampton-Platt et al., 2016 The libraries were then submitted to the Earlham Institute sequencing facility and sequenced on a single Hiseq 2500 Rapid lane, 250-bp paired end, shared with samples from other projects.

4.3.3. Post-sequencing processing and bioinformatics

The processing of the reads from both MMG and hybrid-enrichment pipelines followed the same basic pipeline, unless the MMG datasets were too large and alternative strategies had to be pursued. Adapter trimming was first performed using

Trimmomatic v0.36 (Bolger et al., 2014), with the following parameters: seed mismatches = 2, palindrome clip threshold = 30, simple clip threshold = 10. A BLAST search was then performed on the trimmed reads against a local database of full-length

Coleoptera mitochondrial genomes obtained from NCBI GenBank for retaining

“mitochondrial-like” DNA reads, with task=dc-megablast instead of blastn for the

MMG dataset due to the large file size.

The filtered reads were then assembled into contigs using three different assemblers, namely IDBA-UD (Peng et al., 2012), Ray (Boisvert et al., 2012) and

SPAdes (Bankevich et al., 2012). Before IDBA-UD assembly, PRINSEQ (Schmieder

& Edwards, 2011) was used for quality filtering as IDBA-UD does not have this built- in function. The IDBA-UD assembly parameters used were --mink 60 and --maxk 250 for the hybrid enrichment dataset and --maxk 150 for the MMG dataset, with other parameters as default. The Ray assembly parameters were -k 61, -minimumseedlength

100, and -minimumcontiglength 1000, with other parameters as default. The SPAdes assembly parameters were –meta and -k 21,33,55,77,99,127 for the hybrid enrichment dataset and -k 21,33,55,77 for the MMG dataset. For certain libraries in the MMG dataset where SPAdes ran out of memory, the CLC Genomics Workbench v8.0.3

(https://www.qiagenbioinformatics.com) de novo assembler was used instead.

80

The contigs obtained from all three assemblers were subject to another BLAST search against the same Coleoptera mitochondrial genome reference set in order to screen for and discard misassembled contigs and pseudogenes. In addition, a length filter was used to discard contigs <1000-bp long. The resultant contigs were assembled into supercontigs in Geneious R7 (Kearse et al., 2012) with 1000-bp minimum overlap,

1% maximum gaps, and 1% maximum mismatches per read; all other parameters were kept as default. A combined dataset of supercontigs was also generated by combining contigs from both hybrid enrichment and genome skimming runs in the same supercontig assembly. The barcodes obtained earlier were mapped to the contigs from their corresponding libraries in Geneious at 0% maximum mismatches per read and

100% minimum overlap identity to assign the contigs to species.

To generate coverage profiles of the mitochondrial reads, the sequences obtained after the first mitochondrial BLAST filter were mapped to three different references: 1) the longest scaffold obtained from the library the mapping reads were from, 2) a 50% majority rule threshold consensus sequence generated from the 164 mitochondrial genomes used for the probe design, and 3) a 50% majority rule threshold consensus sequence generated from 154 full Coleoptera mitochondrial genomes downloaded from RefSeq (Pruitt et al., 2006). As the file sizes were too large to map the combined reads from all libraries, the reads from each library were mapped separately. The probes used were also mapped to the RefSeq reference at different similarity fractions. The mapping was performed in the CLC Genomics Workbench with 0.7 length and 0.8 similarity fractions.

4.4. Results

4.4.1. Illumina barcoding and species delimitation

The barcode amplicons were pooled across two Hiseq 2500 lanes, yielding

102,362,556 reads from one and 57,762,446 from the other, thereby amounting to a

81

total of 160,125,002 reads. Of the 7346 morphospecies processed and included in the run, a total of 6717 barcodes passed all filters, which translates to a relatively high success rate of 91.4%. After objective clustering at a 3% p-distance threshold, 3683 mOTUs were obtained, which suggest a substantial degree of error in the morphospecies sorting process. This was, however, expected, given the large sample size, involvement of multiple para-taxonomists, and the high species richness belonging to poorly understood Coleoptera communities. Representatives from these

3683 mOTUs were then used for deeper sequencing via the MMG pipeline with and without hybrid enrichment.

4.4.2. Mitochondrial read counts

Both pure genome skimming and hybrid enriched genome pools contained the

DNA of 3683 mOTUs across 11 libraries. The former was sequenced across multiple

BGISEQ lanes and generated a total of 4,107,938,311 150-bp reads, while the latter was sequenced on a fraction of a single Illumina Hiseq 2500 lane and generated a total of 118,366,631 300-bpreads (Table 1). After adapter-trimming and BLAST filtering for mitochondrial-like reads, a total of 234,464,908 reads were obtained from the genome skimming pool, yielding a 5.7% proportion of mitochondrial-like reads. The hybrid enrichment pool however, gave a total of 86,270,331 reads after filtering, which resulted in a 72.9% proportion of mitochondrial reads; the hybrid enrichment was very successful with an almost 13X increase in mitochondrial reads. In other terms, despite having approximately 35X the number of initial reads, after enrichment, the genome skimming pool had only around 3X the number of mitochondrial reads as the hybrid enriched pool. Note, however, that one of the genome skimming libraries (Library 1) largely failed due to uneven library pooling.

82

Table 1: Read counts of both pure genome skimming and hybrid enriched libraries before and after filtering of mitochondrial-like reads

Hybrid Enrichment Genome Skimming Library Raw reads Mito-like reads % Mito Library Raw reads Mito-like reads % Mito 1 9,102,083 6,342,969 69.69% 1 22,611,541 5,018,932 22.20% 2 10,559,922 7,441,773 70.47% 2 589,923,427 46,252,265 7.84% 3 11,694,629 8,331,814 71.24% 3 613,824,564 16,801,433 2.74% 4 9,525,103 7,135,632 74.91% 4 538,701,874 12,467,779 2.31% 5 11,107,063 8,326,921 74.97% 5 378,370,780 16,119,883 4.26% 6 11,302,088 8,513,003 75.32% 6 350,341,528 27,502,898 7.85% 7 15,117,836 11,953,783 79.07% 7 534,780,790 19,349,146 3.62% 8 13,975,349 9,913,565 70.94% 8 426,510,268 57,495,875 13.48% 9 11,566,625 8,408,580 72.70% 9 299,944,322 16,735,546 5.58% 83 10 14,415,933 9,902,291 68.69% 10 352,929,217 16,721,151 4.74%

Total 118,366,631 86,270,331 72.88% Total 4,107,938,311 234,464,908 5.71%

83

4.4.3. Mitochondrial genome assembly

The filtered reads were then assembled into contigs via three different assemblers and re-assembled again into supercontigs. The number of these supercontigs and their lengths were tallied. A total of 9455 supercontigs were obtained from the hybrid-enrichment libraries, while only 3637 supercontigs were derived from the genome-skimming libraries (Fig. 1). However, most of the hybrid enriched supercontigs were short and fragmented, with only 29 supercontigs >15,000-bp long

(0.3%). In comparison, 1462 of the 3637 (40.2%) genome skimming supercontigs were >15,000-bp – a much higher proportion of complete assemblies.

18000 16000 14000 12000 10000 8000 6000 Supercontig length 4000 2000 0 1 348 695 1042 1389 1736 2083 2430 2777 3124 3471 3818 4165 4512 4859 5206 5553 5900 6247 6594 6941 7288 7635 7982 8329 8676 9023 9370 9717 Ranked supercontigs HE GS HE + GS

Figure 1: Ranked supercontig length distribution of mitochondrial supercontigs, sorted from longest to shortest, from hybrid enriched (HE) and genome skimming (GS) datasets, as well a combined supercontig assembly of contigs from both datasets (HE + GS)

Only 2932 of the 9455 hybrid enrichment supercontigs could be baited by the cox1 barcode; i.e., it is likely that most of the unbaited supercontigs belong to the other segments of the mitochondrial genome scaffold. The large number of fragmented contigs rendered it difficult to use the single “bait” strategy for assigning supercontigs to species. The hybrid enrichment dataset only yielded 18/2932 (0.6%) baited

84

supercontigs (supercontigs that could be assigned to species) that were >15,000-bp long, as opposed to 1463/2638 (55.5%) for the baited supercontigs obtained via unenriched genome skimming (Fig. 2).

18000 16000 14000 12000 10000 8000 6000

Supercontig length 4000 2000 0 1 149 297 445 593 741 889 1037 1185 1333 1481 1629 1777 1925 2073 2221 2369 2517 2665 2813 2961 3109 3257 3405 3553 3701 3849 3997 Ranked supercontigs

HE BGI HE + BGI

Figure 2: Ranked baited supercontig length distribution of mitochondrial supercontigs, sorted from longest to shortest, from hybrid enriched (HE) and genome skimming (GS) datasets, as well a combined supercontig assembly of contigs from both datasets (HE + GS)

When the contigs from both hybrid enrichment and genome skimming datasets were combined and used in the supercontig assembly, the number of complete supercontigs increased, but not substantially compared to what had already been obtained via unenriched genome skimming. A total of 1609 supercontigs with lengths >15,000-bp were obtained from the combined dataset, just 142 more than the genome-skimming dataset (Fig. 1); i.e., the hybrid enrichment supercontigs only improved the genome-skimming dataset by 9.7%. This is also the case when considering only baited supercontigs, with 1500 supercontigs with >15,000-bp from the combined dataset representing only 37 additional supercontigs (2.5%) compared to the genome-skimming dataset (Fig. 2).

85

4.4.4. Mitochondrial read coverage

In order to explore why so many short, fragmented contigs were obtained based

on the hybrid enrichment dataset, the mitochondrial reads were mapped against three

different reference genomes. The reads from each library were mapped separately and

yielded similar profiles so that only the profiles from two randomly chosen libraries (3

& 5) are depicted here (Fig. 3a & b). It is immediately apparent that the hybrid

enrichment reads generate much high coverage in certain mt genome regions, namely

those including the cox1 and 16S lrRNA regions. The mapping profiles are much less even compared to the data obtained by conventional genome skimming where small sections of the genomes have low coverage, but no particular gene region is dominating other sections. The read mapping profiles are consistent across all libraries so that they are unlikely to be the results of library construction. The proportions of mitochondrial- like reads mapped (Table 2) are consistent across all libraries but are much lower for the genome skimming dataset compared to the hybrid enrichment dataset, despite the rather lenient mapping criteria.

86

Table 2: Counts and proportions of reads mapped too all three references for both libraries 3 and 5 Genome Skimming Hybrid Enrichment Library 3 Reads Mapped Mito-like Total Reads % Reads Mapped Reads Mapped Mito-like Total Reads % Reads Mapped Contig 1,313,011 33,602,866 3.91% 3,262,773 16,663,628 19.58% Reference Probe 1,842,513 33,602,866 5.48% 4,525,179 16,663,628 27.16% Reference RefSeq 1,635,259 33,602,866 4.87% 4,085,525 16,663,628 24.52% Reference

Genome Skimming Hybrid Enrichment Library 5 Reads Mapped Mito-like Total Reads % Reads Mapped Reads Mapped Mito-like Total Reads % Reads Mapped 87 Contig 548,141 32,239,766 1.70% 3,316,775 16,653,842 19.92% Reference Probe 654,783 32,239,766 2.03% 4,694,305 16,653,842 28.19% Reference RefSeq 590,969 32,239,766 1.83% 4,295,486 16,653,842 25.79% Reference

87

Figure 3a: Mapping tracks of library 3 mitochondrial reads from both genomes skimming and hybrid enrichment datasets mapped to 3 references (top to bottom): 1) the longest scaffold in the library, 2) consensus sequence of mitochondrial genomes used for probe design and 3) consensus sequence of RefSeq Coleoptera mitochondrial genomes

88

Figure 3b: Mapping tracks of library 5 mitochondrial reads from both genomes skimming and hybrid enrichment datasets mapped to 3 references (top to bottom): 1) the longest scaffold in the library, 2) consensus sequence of mitochondrial genomes used for probe design and 3) consensus sequence of RefSeq Coleoptera mitochondrial genomes

We then mapped the probe sequences to the same RefSeq Coleoptera consensus mitochondrial genomes at different match similarity thresholds in order to examine if the read distributions are driven by probe conservation. This would affect the binding efficiency and could explain the sequencing results (Fig. 4). At the highest threshold, only probes at the 16S region were mapped, indicating a high degree of similarity to the reference. As the similarity threshold decreases, the next highest peak appears at the cox1 region, followed by the nd5 region and subsequently the other gene regions. These regions similarly have high coverage in the hybrid enrichment read

89

mapping as well. This suggests that probe design was an important factor in generating the uneven read profile in the hybrid enrichment sequencing pools.

Fig. 4: Mapping track generated when the probe sequences were mapped to the RefSeq Coleoptera consensus sequence at different levels of match similarity from 70 – 95%. Minimum and maximum read counts are indicated on the y-axis

4.5. Discussion

4.5.1. Should hybrid enrichment be used for mitochondrial genome skimming?

This chapter explores if hybrid capture techniques have the potential to improve the existing MMG pipeline for large scale mitochondrial genome sequencing by increasing the proportion of mitochondrial reads in the sequencing pool. We find that although hybrid capture is highly effective in increasing the mitochondrial representation in the sample, the probe set used generates large biases in coverage across the mitochondrial genome which leads to uneven and patchy coverage, and poorly assembled genomes. If hybrid capture for mitochondrial enrichment were to be pursued in the future, the probes would have to be redesigned to have an approximately equal chance of binding to all regions of the mitochondrial genome. However,

90

designing such probes will be difficult because of a dearth of reference sequences for some genes and taxa, as well as considerable variation of genetic variability across the mitochondrial genome.

Hybrid capture might be worth the cost and effort if it were to provide substantial benefits to the existing pipeline in terms of cost reduction. This means that the cost and efficiency of hybrid enrichment and conventional genome skimming should be compared. We therefore conducted a cost analysis (Table 3). The cost estimates for the BGISEQ-500 are based on estimates provided by BGI and the cost for SureSelect XT probes (Tier 2) were obtained from Kumar et al. (2016), while the cost for a HiSeq 4000 lane was obtained from the Stanford Medicine Genome

Sequencing Service Centre (http://med.stanford.edu/gssc/rates.html). The cost analysis reveals that when using a low-throughput NGS platform, it would be advantageous to invest in hybrid enrichment, especially if greater sequencing depth is required.

However, since the investment in probe design and purchase is a flat cost, hybrid enrichment becomes less appealing when high-throughput platforms such as Hiseq

4000 are used. Note that this comparison, however, is unrealistic because typically genome skimming requires multiple libraries in order to prevent the pooling of closely related species. As each library requires its own hybrid capture reaction, this increases the cost of hybrid enrichment at a faster rate than for MMG alone, which only requires unenriched libraries.

Note that this cost comparison is skewed in favour towards hybrid enrichment as it does not account for the highly uneven coverage across the mitochondrial genome and subsequently poorer assemblies and data loss. As a means to salvage more data, multiple baits could be designed and generated, but this would require additional time and expenses for additional baits. Moreover, the hybrid enrichment approach requires more manpower given its more complex lab procedures. Given that sequencing costs

91

are likely to continue falling, it seems likely that MMG is the better approach. However, this is under the assumption that a diverse array of taxa is analysed.

92

Table 3: Cost analysis of both hybrid-enrichment (HE) and pure genome skimming (GS) approaches across different sequencing depth requirements and sequencing platforms with lower and higher throughputs, as well as with the number of libraries required in this study

10Gb mito required 50Gb mito required 100Gb mito required HE GS HE GS HE GS Probe efficiency 72.9 5.7 72.9 5.7 72.9 5.7

Gb Cost Gb Cost Gb Cost Gb Cost Gb Cost Gb Cost (USD) (USD) (USD) (USD) (USD) (USD) Sequencing platform: 120 1000 120 1000 120 1000 120 1000 120 1000 120 1000 BGISEQ-500 Sequencing 13.72 114.31 175.44 1461.99 68.59 571.56 877.19 7309.94 137.17 1143.12 1754.39 14619.88 Probes 562.5 562.5 562.5 Total: 676.81 1461.99 1134.06 7309.94 1705.62 14619.88 93 Sequencing platform: 1300 2900 1300 2900 1300 2900 1300 2900 1300 2900 1300 2900 HiSeq-4000 Sequencing 13.72 30.60 175.44 391.36 68.59 153.00 877.19 1956.82 137.17 306.00 1754.39 3913.63 Probes 562.5 562.5 562.5 Total: 593.10 391.36 715.50 1956.82 868.50 3913.63

Sequencing platform: 1300 2900 1300 2900 1300 2900 1300 2900 1300 2900 1300 2900 HiSeq-4000 Sequencing 13.72 30.60 175.44 391.36 68.59 153.00 877.19 1956.82 137.17 306.00 1754.39 3913.63 Probes (10 libraries) 5625 5625 5625 Total: 5655.60 391.36 5778.00 1956.82 5931.00 3913.63

93

4.5.2. What is causing the uneven enrichment?

By mapping the raw reads onto a consensus sequence of Coleoptera

mitochondrial genomes, we find that a large proportion of the reads in the hybrid

enrichment set were over-represented in the cox1 region, and, to a lesser degree, the 5’ end of the 16S region. A third minor peak occurs in the vicinity of nd5 (Fig. 3a & b).

A mapping of the probe sequences at different similarity thresholds suggest that the afore-mentioned regions are more conserved and so that the designed probes are more likely to pull DNA from those regions (Fig. 4). Note that at the time of probe design, the probe density was held equal across the complete mitochondrial genome. Overall, there is also a good correlation between the known level of conservation of the genes and probe success rates; cox1 is one of the most conserved coding genes and the ribosomal RNA regions tend to have highly conserved sections (Simon et al., 1994).

However, what is surprising is the poor representation of the 12S srRNA region, which contains sections that are as conserved as 16S. This suggests that there are other factors that affect the evenness of enrichment. There is a possibility that varying GC-content across the mitochondrial genome could have contributed to the enrichment unevenness

(Benjamini & Speed, 2012), but that would require further investigation.

While the probe mapping strongly suggested that different evolutionary rates of the targeted regions were one of the main causes of the uneven read mapping profile, the read mapping distribution may have been inflated by inherent biases in the mapping process. The read mapping profiles of the conventional genome skimming data (Fig.

3a & b) also have the highest peaks in the same three regions (cox1, nd5, 16S), despite the overall more even profile. This was likely due to the same issue of these regions being more highly conserved and hence a more favourable target for the read mapping.

As such, while the enrichment profile was obviously skewed towards these conserved regions, some of the underperformance of the hybrid enrichment reads may be due to bioinformatics challenges.

94

It is likely that the probe set was ill-equipped to tackle such a large and diverse taxon set as the tropical Coleoptera specimens involved in this study. The probes were designed to target Coleoptera species across the entire clade, based mostly on mitochondrial genome references from NCBI GenBank. However, tropical Coleoptera are poorly represented in global databases. It is therefore likely that the more variable regions in the mitochondrial genome of these reference species are too distant from the targets for the design of efficient probes. Additionally, given how only 164 species could be used for designing the probes, it is likely that a single set of around 21,000 probes is insufficient to cover such a large and broad taxonomic group such as the

Coleoptera, with an estimated 1.5 million species (Stork et al., 2015). However, while taxon bias is likely an important factor for probe design and assessment, its precise impact on the results obtained in my study cannot be evaluated due to the large number of species in the sequencing pool. A follow-up study with individual libraries for each species may be informative. However, through the comparisons performed here, we believe that gene bias is an important factor that needs attention.

Liu et al. (2016) recently published a proof-of-concept hybrid capture study using 49 species (mostly insects but including a broad range of other metazoan taxa) with known mitochondrial genome sequences enriched with a probe set designed from insect mitochondrial genomes obtained via the 1KITE project. The study reports a high success rate of mitochondrial genome contig retrieval, as well as no significant correlation between similarity with the probe set and enrichment factor. However, it is likely that the small number of taxa translated into higher sequencing depth per species.

This may compensate for variable sequencing depth. In addition, the mitochondrial genomes in Liu et al.’s study (2016) were known so that all contigs could be baited. It is also interesting to note that the taxa with fragmented contigs in their study were generally taxa that were distant from those used in the probe design (eg. Arachnida,

Chordata, Echinodermata), with notable exceptions for the Hymenoptera and

95

Embioptera, which are known to have novel gene rearrangements in their mitochondrial genomes (Dowton & Austin, 1999; Kômoto et al., 2012). Finally, the gene regions in their read mapping profiles with higher p-distances tended to have lower coverage although the correlation between coverage and p-distances was not significant and the read mapping profiles were more even than those obtained in our study.

Based on the case studies, one can devise several ways to improve the hybrid enrichment process to target large numbers of highly diverse species. The most straightforward approach might be to increase sequencing depth, such that lower coverage mitochondrial regions are still retrieved. However, as shown earlier, it might be more cost effective to increase sequencing depth without involving hybrid capture by using higher-throughput platforms. Another way to salvage a pool of fragmented contigs would be to sequence more baits that span the target region via the multiplexed tagged amplicon sequencing technique highlighted in Chapter 3, which would help assign the fragmented contigs to species. Probe design could be narrowed to a smaller taxon range but would remain problematic if most taxa have not been sampled. In addition, this strategy would increase cost if multiple sets have to be designed. Finally, one could increase probe density at variable regions and reduce the probe density at more conserved regions, but this requires a good understanding of mitochondrial gene variability, which can vary across taxa (Simon et al., 1994).

In conclusion, hybrid capture has tremendous potential for plastid and mitochondrial regions, but its efficacy relies heavily on the suitability of the probe design for the target taxon set. Additionally, the technique requires non-trivial amounts of additional manpower and investment in probes compared to the more straightforward and more cost-effective, conventional genome skimming. Hybrid enrichment may therefore be more suitable for tackling narrower taxon groups with known target regions for probe design, although this would be likely to increase the

96

need for additional libraries. Investing in higher throughput systems for genome skimming may hence be a more effective and less tedious way of obtaining complete mitochondrial genome sequences for hyper-diverse and largely unknown taxa like tropical Coleoptera.

97

Chapter 5

Towards holomorphology in entomology: rapid and cost- effective larval-adult matching using NGS barcodes

5.1. Abstract

In many taxa the morphology of females and immatures is comparatively poorly known because species descriptions and identification tools have a male bias.

The root causes are problems that are associated with matching life history stages and genders belonging to the same species. Such matching is time-consuming when conventional methods are used (e.g., rearing, associations) and expensive when the stages are matched with DNA barcodes. Unfortunately, the lack of associations is not a trivial problem because it renders a large part of the phenome of insects unexplored although larvae and females are useful sources of characters for descriptive and phylogenetic purposes. In addition, many collectors intentionally avoid females and immature stages, which skews survey results, interferes with collecting life history information, and makes it less likely that rare species are discovered. These problems even exist for well-studied taxa like Odonata where obtaining adult-larval matches relies largely on rearing. Here we demonstrate how the matching problem can be addressed with cost-effective tagged amplicon sequencing of a 313 bp segment of cox1 with NGS (“NGS barcoding”). We illustrate the value of the approach based on

Singapore’s odonate fauna which is of a similar size as the European fauna (Singapore:

122 extant species; Europe: 138 recorded species). We match the larvae and adults of

59 species by first creating a barcode database for 338 identified adult specimens representing 83 species. We then sequence 1178 larvae from a wide range of sources.

We successfully barcode 1123 specimens, which leads to adult-larvae matches for 59 species based on our own barcodes (55) and online barcode databases (4). With these

98 additions, 84 of the 131 species recorded in Singapore now have been associated to a species name. Most common species are now matched (83%), and good progress has been made for vulnerable/near threatened (55%), endangered (53%), and critically endangered species (38%). We used non-destructive DNA extraction methods in order to be able to use high-resolution imaging of matched larvae for establishing a publicly available digital reference collection for odonates within “Biodiversity of Singapore”.

We suggest that the methods described here are suitable for many additional insect taxa because NGS barcoding allows for fast and low-cost matching of well-studied life history stages with neglected semaphoronts (eggs, larvae, females). We estimate that the specimen-specific amplicons in this study (ca. 1500 specimens) can now be obtained within 8 working days and that the laboratory and sequencing cost are ca.

USD 600 (<0.40 USD/specimen).

5.2. Introduction

One of the major challenges in entomology is obtaining a complete and comprehensive inventory of all insect species and their life history stages. This task is particularly overwhelming in the tropics (Basset et al., 2012) with its high diversity, shorter history of study, and insufficient resources for taxonomic research. While the species diversity challenge is frequently discussed in the literature, another challenge receives much less attention; i.e., the complex phenomes of most insect species. For the vast majority of species, the immatures are not miniature adults. Instead they are morphologically so disparate that the morphological distinction between life history stages of the same species is larger than the morphological disparity between the same stages of closely related species. Yet, many species descriptions focus on adult males because they tend to have species-specific differences in their sclerotized genitalia

(Eberhard, 1985). This the morphological diversity of females and immatures underexplored although they can be morphologically very diverse (Meier, 1995;

99

Puniamoorthy et al., 2010) and can be a rich source of characters for phylogenetic analyses (Beutel, 1993; Meier & Lim, 2009; Beutel et al., 2010; Aspöck et al., 2012).

The male-bias in entomology has led to a lack of morphological identification tools for immatures and females, which in turn has affected the efficacy of arthropod surveys.

In such surveys, female and immature specimens are often not even collected and/or they are ignored/discarded during sorting. This is unfortunate because life history theory predicts that the abundance of immatures will be higher than the abundance of males and females combined (Pianka, 1970); i.e., only identifying male specimens will significantly weaken the conclusions that can be drawn from survey data. It will also exacerbate the rarity problem in arthropod biodiversity surveys. A large number of species are so rare that they are only known or described based on a single specimen or collecting event (Novotný & Basset, 2000; Lim et al., 2011). Yet, a substantial proportion of specimens obtained in surveys is not processed; i.e., the commonness-of- rarity problem is partially caused by not assessing diversity based on all life history stages.

Including females and immatures in surveys is also important for our knowledge of insect natural history and for conserving species. In many species, females and larvae play a larger role in biomass turnover than males. For example, females tend to have higher caloric requirements since they have to build larger gametes. Similarly, immatures have to build the biomass of the species and are therefore often ecologically more important than adults. At the same time, they tend to be more sedentary and long-lived than adult stages. This often means that larvae are more likely to be affected by anthropogenic threats such as habitat-loss and pollution

(Kraus & Secor, 2005). Yet, entomologists cannot identify the immatures of many species, which becomes an impediment to conservation because the habitat preferences of the immature stages remain unknown. The lack of identification tools is a particularly significant problem in freshwater biomonitoring, which is heavily

100

dependent on information gathered from immature insects. The immatures are frequently only identified to taxa above the species rank which often leads to coarse habitat assessments (Gaston, 2000).

The lack of adult-larva associations in many insect groups is partially due to the fact that matching life history stages is time-consuming. Traditionally, associations have been obtained by rearing larvae to adults and then identifying the adults to species

(van Gossum et al., 2003). However, this approach has drawbacks. Firstly, not all species are amenable to rearing in captivity because they have narrow and unknown dietary and environmental requirements. Secondly, even when rearing is technically feasible, the life-cycles of some species are so long that rearing is time-consuming.

Lastly, the process of rearing partially destroys the larval morphology unless living larvae can be imaged in sufficient detail for morphological descriptions and/or the exuviae preserve sufficient information. These problems can sometimes be overcome by rearing only some individuals belonging to the same clutch to adulthood. However, association via rearing rarely starts with eggs and is more likely to involve multiple larvae collected at the same time and site.

More recently, DNA barcoding (Hebert et al., 2003) has been used to obtain species-level adult-larva associations. This technique has been widely applied across a variety of insect groups, such as Coleoptera (Miller et al., 2005; Ahrens et al., 2007;

Curiel & Morrone, 2012), Diptera (Trivinho-Strixino et al., 2012; Pramual &

Wongpakam, 2014) and EPT taxa (Zhou et al., 2007; Gattolliat & Monaghan 2010;

Ruiter et al., 2013; Avelino-Capistrano et al., 2014). Barcodes have also been used for associating anisomorphic developmental stages of other invertebrates (e.g., trematodes;

Jousson et al., 1999; Blasco-Costa et al., 2016) and vertebrates (Thomas et al., 2005;

Victor 2007). Indeed, DNA barcoding overcomes many of the problems encountered during rearing. It is faster and does not require prior knowledge of the species’ life history requirements (e.g., feeding preferences). Furthermore, non-destructive DNA

101

extraction allows for morphological study of the specimens that have been barcoded.

Lastly, while specialist expertise is useful for the sorting or identification of larval or adult forms, it is not strictly required as long as all specimens are barcoded (Wang et al., 2018).

However, DNA barcodes also have drawbacks. They can be misleading when closely related species have intraspecific distances that are lower than expected (0-2%) or when species have deep intraspecific splits (Meier et al., 2006). In addition, obtaining barcodes with Sanger sequencing is so expensive and time-consuming that it is often not feasible for all available specimens in specimen- and species-rich taxa. This means that the existing material is pre-sorted into putative species based on morphological characters. This is time consuming and can be very difficult if different instars are in the sample. We believe that these problems can be overcome through the use of “NGS barcodes”. Here, next-generation sequencing (NGS) of tagged amplicons is used for obtaining a short barcode (313 bp) at low cost (USD 0.40 USD per specimen:

Meier et al., 2016). This low cost means that large numbers of immatures and adults can be sequenced without any morphological pre-sorting (Wang et al., 2018). Indeed, the molecular pre-sorting becomes a technical exercise, which means that entomologists can focus on working through sequence-based specimen clusters that in most cases represent species. Lastly, the comparatively short NGS barcode fragment

(313bp) amplifies well even for fairly degraded material that has been collected, for example, for the purpose of biomonitoring and preserved in suboptimal conditions. The low cost of NGS barcoding also means that it becomes feasible to carry out large scale biodiversity surveys that deliberately target and/or include larvae and females. Such targeting would be desirable because it has the potential to yield much natural history information.

In this study, we document the potential of NGS barcoding for life-history- stage matching by sequencing >1100 specimens of Singaporean odonates. Singapore’s

102

odonates exhibit some of the typical problems of studying insect phenomes in the tropics. The adults are better known than the larvae and covered by species-level identification guides (Orr, 2005; Tang et al., 2010) and species inventories (Ngiam &

Cheong, 2016). This work has revealed that the fauna remains surprisingly rich although Singapore has lost most of its original forest cover. As of 2016, 131 species of odonates have been recorded in Singapore, with 122 species being considered extant.

This means that the fauna is of similar size as the entire European fauna (138 species:

Kalkman et al., 2010), albeit being compressed onto an area of 700 km2. Note that

Singapore’s fauna is also very respectable when compared to the fauna of other tropical islands; e.g., Borneo is 1000 times larger but only has twice the number of species (275 spp.; Orr & Hämäläinen, 2003), and Madagascar has approximately 175 species while being around 800 times larger (Dijkstra & Clausnitzer, 2004). Unfortunately, the

Singaporean fauna is also typical in that much less is known about the larval morphology, ecology, or distribution than about the adult stages. This is a common problem that is well exemplified by the 500 odonate species in India for which fewer than 100 have known larval forms (Prasad & Varshney, 1995; Andrew et al., 2008).

New larval forms for odonate species living in Singapore are regularly described but prior to this study the material was mostly obtained via rearing of targeted species or descriptions based on exuviae (Lok & Orr, 2009; Ngiam, 2011; Ngiam et al.,

2011; Orr & Ngiam, 2011; Ngiam & Leong, 2012; Ngiam & Dow, 2013). This approach has led to the larval identification and description of 71 of the 131 species of odonates in Singapore. However, it took 70 years to reach this stage (first description was in Lieftinck, 1932) and only the larvae of 6 species were described based on material from Singapore. All the remaining descriptions are based on larvae obtained from surrounding countries (Table S2 in Yeo et al., 2018).

Here, we document how NGS barcoding can not only accelerate the life history stage matching process, but also radically alter the approach to larval identification.

103

Instead of employing coarse parataxonomist sorting with subsequent DNA barcoding of few specimens, our approach uses NGS barcoding to directly associate all available specimens for a species. Larval specimens can thereby be barcoded and eventually sorted and associated en masse without any morphospecies pre-sorting. In this study, most specimens were obtained serendipitously over many years via miscellaneous surveys of freshwater bodies across Singapore. The larvae from Singapore were stored in different laboratories and collections so that they were of widely varying preservation quality. Note that such haphazardly collected larval material is available for many taxa across the natural history museums of the world. We here match the

Singapore specimens to an adult sample that was obtained for the purpose of this study.

After matching, the larval stages were imaged and made available online in the form of a digital reference collection (Ang et al., 2013a), while the specimens are preserved in the Lee Kong Chian Natural History Museum A useful by-product of the matching campaign was an odonate DNA barcode database for Singapore that has already been used for matching eDNA signatures obtained from water (Lim et al., 2016). Finally, while both male and female odonates are well characterized in Singapore despite exhibiting sexual dimorphism in many species, the methods used here are also applicable for male-female association for taxa where only males have been formally described.

5.3. Materials & Methods

5.3.1. Sampling and identification

Most specimens were collected between 2011 and 2014. Some specimens were directly killed and preserved in 100% ethanol, while other specimens had unfortunately been first anaesthetized with carbonated water (the acidity likely affected DNA quality) before they were preserved in 70% methylated ethanol. Adults were obtained via targeted sweep-netting with the aim of maximizing species coverage. Adult specimens for some species were also serendipitously obtained from Malaise traps (Singapore).

104

Some larvae were collected with kick-nets, sieves and coconut brush samplers (Loke et al., 2010). Many of the samples were collected for biomonitoring purposes by the

Tropical Marine Science Institute (TMSI). The freshwater bodies sampled include streams (natural and canalized), reservoirs, ponds and swamps across the country.

Specimens were individualized and preserved in 70–100% ethanol. Only adult specimens were identified (Meier, 2017) using the following identification guides that include detailed drawings and photographs (Orr, 2005; Tang et al., 2010). Most adults were identified by the first authors, but identifications for some specimens were verified by Ngiam, R. W. J.. Additional confirmation came from BLAST and Barcode of Life Data Systems (BOLD) hits. The specimens are deposited in the Lee Kong Chian

Natural History Museum (National University of Singapore).

5.3.2. DNA extraction, amplification and sequencing

Regardless of extraction method, a leg or piece of tissue (0.5–1cm) was removed from the specimen. Early on in the project, DNA was extracted using either a phenol chloroform extraction protocol (Kutty et al., 2007) or QIAGEN’s DNeasy

Blood & Tissue kit following the manufacturer’s instructions. Later in the study, we used directPCR for larval specimens (Wong et al., 2014). In directPCR, a small amount of tissue is used directly as DNA template in the PCR reaction.

We used the following the cox1 primers from Leray et al. (2013) for obtaining amplicons (mlCO1intF: 50-GGWACWGGWTGAACWGTWTAYCCYCC-30 and jgHCO2198: 50-TAIACYTCIGGRTGICCRAARAAYCA-30), which amplify a 313- bp region of the barcoding portion of the gene. The primers were labelled with a 9-bp tag at the 5’ ends. The tags were generated with a barcode generator script (Henry et al., 2014) to differ by at least 3-bp. The tagged primers were also screened for secondary structure formation. Each specimen was assigned a unique combination of primer tags during amplification to allow recovered sequences to be assigned to specific specimen after sequencing. PCR amplification of both gDNA and dissected

105

tissue were performed in a 25 µl reaction with 2.5 µl of 10X BioReady rTaq buffer

(with 15mM MgCl2), 2.0 µl of 2.5 mM dNTP mixture, 1.0 µl of 1.0 mM BSA solution,

0.2 µl of BioReady rTaq and 2.0 µl of 10 µM forward and reverse primer, under the following conditions: initial denaturation at 95 ⁰C for 3 min, 40 cycles of denaturation at 94 ⁰C for 30 s, annealing at 50 ⁰C for 1 min and extension at 72 ⁰C for 1 min, with a final extension at 72 ⁰C for 5 min.

Successful amplification was verified for a few products for each 96-well PCR plate via gel electrophoresis in order to ensure there was no plate-wide amplification failure. Afterwards, 2.0 µl of PCR product was obtained from each sample and pooled.

The pooled sample was purified using Bioline SureClean Plus, following the manufacturer’s protocol. The final pool was quantified via a Qubit 2.0 Fluorometer and submitted to AITbiotech (Singapore) along with accompanying samples for library preparation and sequencing on an Illumina Miseq platform with a 2 x 300-bp paired end kit. The specimens were sequenced in this manner across four different Miseq runs alongside other NGS-based projects.

5.3.3. Read processing and barcode determination

Raw paired-end reads were first merged with PEAR (Zhang et al., 2013b) using the default parameters. A custom Python script was then used to demultiplex the data, count the number of reads per sample, identify identical reads, merge and count them, as well as compare the number of reads in the largest identity set and with the second largest identity set (Meier et al., 2016). The barcoding of a specimen was only considered successful if the total amplicon coverage for the sample was > 50 times, the number of reads in the largest set of identical reads was > 10 times, and the read coverage in the largest identical set was >5 times the coverage of the second largest set

(Meier et al., 2016). If these criteria were met, the sequence of the largest read set was considered the barcode for the specimen. Lastly, we used BLAST against NCBI’s nucleotide database for removing any non-odonate sequences.

106

5.3.4. mOTU estimation for adult-larva association

The cox1 barcodes were aligned using MAFFT v.7

(http://mafft.cbrc.jp/alignment/software/) under default parameters to confirm open reading frames. Larval matching was achieved by estimating molecular Operational

Taxonomic Units (mOTUs) via objective clustering of uncorrected p-distances using

SpeciesIdentifier (TaxonDNA 1.6.2; Meier et al., 2006). We also used the Automatic

Barcoding Gap Discovery (ABGD) algorithm (Puillandre et al., 2012) to test how robust the mOTU assignments were when a different species-delimitation technique was applied. For objective clustering, a range of distance thresholds (2–4%) were applied that are commonly used in the barcoding literature (Ratnasingham & Hebert,

2013). For ABGD, we used a range of priors (P=0.002783–P=0.007743). We then used the threshold/prior that maximized congruence with the adult identification given that species limits in odonates are overall well understood based on morphology. We considered those mOTU clusters that contained sequences from both adults and larvae as having yielded a life history stage association. Congruence between the different delimitation methods was assessed by pairwise match ratios (Ahrens et al., 2016), which are defined as: 2 * Nmatch / (N1 + N2), where Nmatch is the number of clusters identical across both mOTU delimitation methods/thresholds (N1 & N2). The rarities of Singapore’s odonate species are derived from a study by Ngiam and Cheong (2016).

5.3.5. Imaging and databasing

In order to facilitate the widespread use of the newly obtained associations, we imaged the larvae with distinct morphological features and created a digital reference collection with high-quality images (Ang et al., 2013a). For all matched clusters, a representative specimen was selected for imaging based on morphological integrity.

The selected specimens were then imaged with a Visionary Digital BK Lab System to obtain habitus photographs for the dorsal, ventral, and lateral side in addition to detailed images for labium and caudal lamellae (Zygoptera only). The images were stacked with

107

Helicon Focus, edited in Adobe Photoshop CS6 and uploaded online to the Biodiversity of Singapore digital reference collection (https://singapore.biodiversity.online), with each species being represented by high-resolution specimen images and links to relevant publications. The DNA barcodes have been submitted to NCBI GenBank.

5.3.6. Review of adult-larval association literature

In order to review which techniques have been used for life history stage matching in the odonate literature, we conducted a literature search for odonate larval descriptions between 2000 and 2016, using the following search terms on ISI Web of

Science: (odonat$ OR zygopter$ OR anisopter$ OR anisozygopt$) AND (larv$ OR nymph& OR immature&). We determined the number of larval species described, country of larval origin, adult-larva association method, and sex/life history stage of specimens described. We included publications describing larvae of species that were already described based on adult specimens as well as larval descriptions that were part of new species descriptions.

5.4. Results

5.4.1. Sequencing and initial processing

A total of 1516 specimens were processed (338 adults and 1178 larvae). In the initial stage, we used formal DNA extraction via Qiagen DNeasy blood and tissue kits and phenol chloroform protocols to obtain DNA for 826 specimens. After developing the more time- and cost effective direct PCR methods, we used direct PCR for the remaining 690 specimens. We initially generated DNA barcodes with Sanger sequencing, but the amplification success rates for cox1 were fairly low and we thus switched to NGS barcoding via tagged amplicon sequencing. All specimens were sequenced using 14,079,372 paired reads (Table 1). This is equivalent to around 1.5

Miseq runs (300 bp paired-end) or roughly 10% of a HiSeq 2500 (250 bp paired-end) lane.

108

Table 1. Sequencing success rates and number of reads obtained

Run No. of samples Total read No. of samples No. of samples No. of samples No. of samples No. of samples Sequencing number included counts failed 1st filter* failed 2nd filter† failed 3rd filter‡ failed BLAST filter passed all filters success rate 1 826 13742608 12 5 32 4 773 93.58% 2 558 150344 238 2 12 35 271 48.57% 3 88 132724 17 14 5 0 52 59.09% 4 44 53696 10 0 7 0 27 61.36% Total 1516 14079372 277 21 56 39 1123 74.08% *: amplicons with read counts <50 are discarded. †: amplicons with a dominant, unique read count < 10 counts are discarded. ‡: amplicons where a read count ratio of the second-most dominant read to the most dominant read exceeds 0.2 are discarded.

109

109

After applying quality control filters, sequences for 392 specimens were excluded, leaving 1123 barcodes for subsequent analyses (sequences are available on

NCBI GenBank: MG884602 – MG885724). Of the successfully barcoded specimens,

319 were adults and 804 were larvae. The overall sequencing recovery rate was unusually low (74.08%) which was due to specimens that were often anesthetized with acidic, carbonated water before being stored in 70% methylated ethanol at room temperature for variable amounts of time. The success rates for those samples varied from 48.6% to 61.4% (Table 1), while the success rates for amplicons obtained from formally extracted gDNA was 93.6%.

5.4.2. Clustering and life history stage matching

The number of mOTUs is overall stable across 2-4% p-distance thresholds and yields 95–99 mOTUs (Table 2). Of these mOTUs, 83 match known species based on the presence of pre-identified adult specimens. Note that specimens for one nominal species (Neurothemis fluctuans) are consistently found in two mOTUs, which diverge by a 6% p-distance thereby suggesting the presence of a cryptic species (Bickford et al., 2007). Our study covers 68% of Singapore’s extant species and we obtained adult- larva matches for 46% (55 species). Twenty-seven of the remaining mOTUs contained adults only, while another 12–14 contained only larvae. BLAST searches of the larval cox1 barcodes against NCBI GenBank and BOLD were used in an attempt to identify these latter larvae to species. This was successful for four species (Tetracanthagyna plagiata, Diplacodes trivialis, , Ceriagrion chaoi), which brings the total number of matched species in this study to 59. Overall, the study adds life- history-stage matches for 13 species to the existing list of described larvae of species occurring in Singapore (Table S2 in Yeo et al., 2018).

110

Table 2. mOTU delimitation via objective clustering and ABGD approaches

Objective clustering ABGD 2% p- 3% p- 4% p- P=0.0027 P=0.0046 P=0.0077

distance distance distance 83 42 43 Total no. of 99 96 95 102 97 95 clusters No. of adult 3 1 1 6 1 1 clusters split No. of adult 0 0 0 0 0 0 clusters lumped No. of clusters with 55 56 56 55 55 56 adult- larva matches No. of adult- 30 27 27 33 28 27 only clusters No. of larva- 14 13 12 14 14 12 only clusters

Overall, BLAST searches were able to provide species matches with reasonable certainty (>97% match similarity) for 50 mOTUs; i.e., the publicly available barcode databases remain fairly incomplete even for such charismatic and well-studied taxa as odonates. However, GenBank and BOLD yield identification for a large proportion of all specimens because common species tend to be overrepresented in the databases (1123 successfully identified specimens; Genbank: 827; BOLD: 848 at 97.5–

100.0% similarity). The remaining 46 mOTUs without Genbank/BOLD matches contribute only 296 specimens in our study.

The checklist of Singapore’s odonates by Ngiam and Cheong (2016) includes an assessment of rarity for each species in Singapore. Most of the species frequently

111

encountered have been matched (Common: 74.5%; Uncommon: 45.2%) and barcoded

(Common: 94.5%; Uncommon: 87.1%). “Rare” and “Very Rare” species are under- represented in this study (Fig. 1a), but we nevertheless matched 7 of the 40 (17.5%).

The likelihood of the species being matched or barcoded correlates well with the abundance in Singapore, but number of larvae collected across rarity classes and species is not significantly different (Kruskal-Wallis: P=0.62) (Fig. 1a). In contrast, the proportions of species with larvae described prior to this study (mostly based on non- molecular methods) do not match the abundance of the species in Singapore which is presumably due to the fact that most species matches were obtained elsewhere (Fig.

1b).

112

Figure 1. Rarity and probability of adult-larval matching of a species. (a) Current study; (b) Literature (numbers=number of species); (c) Larval abundance and across rarity classes in study

5.4.3. mOTU stability

In order to determine which distance thresholds maximizes congruence with adult morphology, we tested different threshold and prior parameters (objective clustering of uncorrected p-distances and ABGD). The clustering thresholds and prior values used were 2–4% and P = 0.002783–0.007743 respectively, which correspond to what is typically employed in the literature (Ratnasingham & Hebert, 2013). This resulted in around 95–102 mOTUs (Table 2), depending on the choice of clustering method and parameter. Since the adults can be reliably identified to species, we here

113

used the clustering method and parameters that maximized congruence with adult morphology (3% and 4% for objective clustering; P=0.007743 for ABGD).

The pairwise match ratios (>0.92) across the different clustering algorithms and parameters (Table 3) indicate high stability; i.e., most species are genetically distant enough to be identifiable across the delimitation methods and thresholds. Only one species represented by adults is consistently split into two clusters.

Table 3. Pairwise match ratios of clusters delimited by objective clustering and ABGD

Objective ABGD Clustering 2% 3% 4% P=0.003 P=0.005 P=0.008 2% 1.00 Objective 3% 0.96 1.00 Clustering 4% 0.95 0.98 1.00 P=0.002783 0.97 0.95 0.92 1.00 ABGD P=0.004642 0.98 0.98 0.97 0.95 1.00 P=0.007743 0.95 0.99 1.00 0.92 0.97 1.00

5.4.4. Literature survey

Our literature survey revealed larval descriptions for 229 species that were published between 2000 and 2016 (Table S3 in Yeo et al., 2018). Of these, only 11 were based on DNA barcodes (Fig. 2) while the vast majority were obtained via rearing

(164 species) or identifying tenerals next to their exuviae. Overall, there seems to be a slight increase of adult-larva associations via DNA barcodes over time. Note that some adult-larva associations are still made via supposition (i.e., identifying larvae based on the putative presence of adults which is sometimes determined based on checklists).

114

Figure 2. Adult-larva association methods in descriptive odonate literature (2000– 2016)

5.5. Discussion

5.5.1. Expediting adult-larval associations with NGS barcoding

Singapore is a small island with a surface area of 719 km2. The country nevertheless used to be the home of 131 species of Odonata (Ngiam & Cheong, 2016) of which 122 are still considered extant. This is a very large number of species given that all of Europe only has 138 species (Kalkman et al., 2010); i.e., the odonate fauna of Singapore poses a significant enough challenge to be typical for other life history matching studies. Here, we barcoded 1123 adult and larval specimens that belonged to

95-96 mOTUs, of which 83 were identifiable to species due to the presence of a pre- identified adults in the cluster. To our knowledge, this is currently the largest adult- larva association study in the literature. A very large proportion of the specimens in our study can be associated (1012 of 1123: 90.1%) and yield morphological and habitat information for 59 species. Most of these associations were obtained during our study

(55) and only 4 were established via BLAST to NCBI GenBank. Our study adds a total of 13 species to the list of species with associated larvae. This was accomplished via

115

occasional collecting spread across a two-year period. In particular, we targeted habitats that were suspected to have high diversity. In comparison to collecting, the molecular work was not time-consuming. Based on the techniques applied to our latest batch of specimens, we could now obtain amplicons for all 1123 specimens in ca. 8 days of molecular work (Wang et al., 2018). In contrast, matching via rearing have only resulted in 6 new larval descriptions between 2009–2013; i.e., NGS barcoding has the potential to quickly yield life history associations for a large number of species especially if a large amount of material is already in collections and an adult barcode database can be obtained quickly. These associations can easily lead to new descriptions and identification via morphology in the future.

Immature associations via DNA sequences have been available since the late

1990s (DeSalle & Birstein, 1996; Miller et al., 1997), but most studies are small-scale and only associate the life history stages for a few species at time. Such a small-scale application of DNA barcoding is particularly common in odonates where the first such study was only published in 2006 (Etscher et al. and Fleck et al.). Indeed, even recent publications involving odonate larval matching via DNA barcoding only involved 1-2 species and are based on <50 specimens. Other insect orders have seen larger-scale studies, but none have been at the scale reported here. The largest study that we found in the literature included 250 larvae and adult beetles (Ahrens et al., 2007). We believe that it is now time to become more ambitious. Large-scale immature sampling in arthropod surveys are now an option because NGS barcodes can be used to match life history stages.

The small number of DNA association studies and small number of specimens is most likely due to the substantial per-specimen cost of Sanger sequencing. Sanger sequencing is expensive (anywhere from 7–34 USD, from http://ccdb.ca/pricing), but the cost of sequencing can be dramatically cut by using NGS approaches (Meier et al.,

2016; Hebert et al., 2017). Here, we sequenced 1123 specimens using 14,079,372 reads

116

of 300bp PE Illumina Miseq sequencing (Table 1). The average coverage of each specimen was 12,537x; this is very excessive. It would have been safe to reduce the average number of reads per specimen to 1/10th (1254 reads per specimen). This implies that >11,000 specimens can be barcoded in a single Illumina MiSeq lane. The cost of the lane would be ca. USD 2,000 (see: http://medicine.yale.edu/keck/ycga/services/illuminaprices.aspx), thereby reducing the sequencing cost of each specimen to USD <0.20. Indeed, the main cost and challenge is not sequencing, but obtaining specimens and generating a sufficiently large number of amplicons with PCR to fill a lane or flowcell. However, the latter problem can be overcome because the same sequencing run is shared by multiple projects and obtaining amplicons with direct PCR is cost-effective (USD 0.16). The process is also sufficiently simple that even inexperienced personnel can process 200 specimens per day (Wang et al., 2018). This means that wet-lab cost per specimen is ca. USD 0.40 and that a study like the one whose results we are here reporting cost only ca. USD 600 (1516 specimens x 0.40). We estimate that all specimens in our study could be processed within 8 working days.

5.5.2. Tackling the problem of rarity and elusive larvae

As discussed earlier, insect larvae are absent or underrepresented in many species descriptions. The most likely reasons are the difficulty of finding larvae and associating them with adults. As documented here, the latter problem can be addressed but the larvae of some species will remain elusive because they have very specific habitat requirements (e.g., phytotelma: Ngiam & Leong, 2012). As one would expect, the likelihood of being matched and barcoded in this study is overall positively correlated with the species’ abundance/rarity (Fig. 1a) although it appears from our study that “rare” species are more likely to have been associated than “uncommon” ones (Fig. 1b). This anomaly is likely due to the fact that most of the published larval descriptions were prepared based on material collected outside of Singapore, while the

117

ranking into “rare” or “uncommon” is based on the abundance of the species in the country. Note that the larval abundances in this study also do not correspond closely to the rarity classes. This is likely due to non-standardized method of larva collection.

Entomologists tend to oversample habitats that are likely to yield larvae for rare or unsampled species.

We suggest that future insect surveys may want to deliberately include techniques that target larvae. Such larval screens should be a priority if a habitat has a large number of species with unknown larval stages. Many odonate larval association studies use targeted sampling of larvae in promising habitats in order to find the larval stages for rare species. However, the high mobility of the adults, often related to territorial behaviour that may force individuals to forage away from their optimal breeding zones (Wolf & Waltz, 1984), can separate larval and adult habitats. This is why we believe that such targeted collecting should be complemented with more large- scale sampling of potential larval habitats that are otherwise neglected. This is fast and affordable once NGS barcoding is used for barcoding the specimens. Currently, most adult-larval associations for odonates are still obtained via rearing (Fig. 2). This is complemented by studying the morphology of pharates (Carvalho, 2000; Pérez-

Gutiérrez & Montes-Fontalvo, 2011) and obtaining exuviae collected after the teneral adult has emerged (Xu, 2012; Del Palacio & Muzon, 2014; Domenico et al., 2016). But these are chance events that should, of course, be utilized whenever available, but they are probably too infrequent to be a significant method for obtaining adult-immature associations. Some authors have also identified larvae via supposition; i.e., based on the presence of adults in a particular habitat (Tennessen, 2010; Müller et al., 2012;

Novelo-Gutierrez et al., 2014). However, this practice can lead to incorrect results.

Larval matching via NGS barcodes is particularly promising for habitats with a large number of species without larval associations, but it should also be considered for well-studied habitats; especially if there are existing underutilized collections of

118

larvae and the specimens are relevant for freshwater biomonitoring. Identifying larval specimens to species can yield significant benefits when using the taxa as bioindicator

(Lenat & Resh, 2001). Fortunately, NGS barcoding is also suitable for many of the specimens in museum collections given that it only involves a 313 bp piece of cox1 that is more likely to amplify for specimens that were not preserved for molecular purposes and may have been stored for a long time under problematic conditions.

Lastly, the standardization of the barcoding gene for animals allows the creation of large databases and promotes information sharing. Locally rare species might be common abroad, and comparisons of local barcodes against global databases can yield new matches, as was the case for four species in our study.

5.5.3. Barcode reliability and potential sources of error

In our study, we find that standard clustering thresholds applied to barcodes yield mOTUs that are overall consistent with species identified based on morphology

(Table 2). In addition, the mOTUs are overall stable across delimitation algorithms and thresholds (Table 3). This goes quite a long way to alleviating the theoretically justified concerns that barcodes can contain misleading signal. After all, genetic distances between barcodes do not always correspond to species boundaries (Meier et al., 2006) because cox1 is not a speciation gene (Kwong et al., 2012a). Based on theoretical consideration, one should therefore always scrutinize results obtained with barcodes for two potential sources of error: low interspecific variability leading to lumping and high intraspecific variability leading to splitting. The former is not observed here, but such cases have been reported in the literature. For example, in some odonate species

(e.g., Orchithemis pulcherrima) different morphotypes have been revealed to be cases of polymorphism or sexual dimorphism based on molecular data. In other cases, high genetic variability within morphological clusters can be indicative of the presence of cryptic species (Damm et al., 2010). In our study, there may be two cases:

Amphicnemis gracilis had unusually high cox1 pairwise distances of >4% between

119

individuals collected at the same site. However, the largest genetic distances were

observed within Neurothemis fluctuans (>6%). Unfortunately, our current sample size is not large enough for both cases to clarify the species boundaries.

Despite these cases, there is overall good congruence between NGS barcodes and morphotypes. Congruence between morphology and DNA sequences is often discussed at the species level. However, it is arguably equally important to consider how many specimens can be confidently placed into species-level units that are stable across clustering methods. In our study, we find that almost all specimens (1099 of

1123; 97.9%) can be placed into such stable mOTUs (mOTUs that are invariant to algorithms and thresholds). This means that only the assignment of a very small number of specimens is uncertain because they pertain to species where the barcodes are not decisive.

5.5.4. The value of both morphology and molecules

Odonate larvae are typically described only based on the final instar; presumably this is partly due to the fact that morphological characters are most clearly visible at this stage and exuviae of final instars are often used for morphological descriptions. Since DNA barcoding can be applied to all instars, it also allows for the association of early instar larvae. This might pose problems when using the morphological literature that focuses on the morphology of late-instar larva, but it is arguably time to also study the morphology of early instars. Because NGS barcodes can identify all instars, barcodes are also more useful for identifying larval habitats of species because early instars are more common than late instars. If a late instar is needed for descriptive purposes, follow-up work can concentrate on finding a late- instar larva in a habitat where an early instar had been identified based on barcoding.

The pipeline described here also paves the way for a more extensive use of larvae in phylogenetics. Larval morphology is a potentially rich yet frequently

120

underutilized source of characters (Aspöck, 2012; Beutel, 1993; Beutel et al., 2010;

Meier & Lim, 2009). Phylogenomics is a powerful tool, but it is not the “magic-bullet” that can resolve all problems with confidence (Bordenstein et al., 2008; Kutty et al.,

2018). In the era of genomics, morphology still has an important role to play in phylogenetics (Giribet, 2015). Congruence between morphology and genes increase the confidence in phylogenetic results, while incongruence can lead to novel discoveries of body plan evolution. There are several cases in the recent literature where larval morphology provided important sets of characters that were used to resolve problematic relationships (Heikkilä et al., 2014; Badano et al., 2017; Michat et al., 2017). However, more larvae that can be identified to species are needed in order to incorporate larval characters in phylogenetic analyses. A systematic effort to associate larval and adult forms via NGS barcoding can overcome this bottleneck.

5.5.5. Description, illustration, or both?

One of the aims of the project was to create a lasting and publicly available resource for researchers or naturalists interested in the odonates of Singapore. This should cover the adult and larval stages. For this reason, we prepared a digital reference collection (Ang et al., 2013a) by documenting larvae with high-resolution images and making the images available online as part of the “The Biodiversity of Singapore” initiative (https://singapore.biodiversity.online). For diagnostic purposes not only the habitus, but also images of the labium and caudal lamellae are included (Fig. 3). All images can be enlarged to reveal small details such as the number of setae on the specimens. This resource is for odonate specialists, but also very useful for biologists involved in freshwater assessment via macroinvertebrates. Lastly, it communicates to taxonomic experts that material is available for the formal description of new larvae or larval instars (specimens are deposited in the Lee Kong Chian Natural History

Museum). The digital reference collection now has entries for 110 species of Odonata, with 59 having associated larvae. It is expected to grow as new specimens are collected,

121

sequenced and imaged. Due the low technical complexity of NGS barcoding, regular updates can be obtained via undergraduate student projects every few years.

Figure 3. Species entry in digital reference collection for larva of Mortonagrion arthuri: dorsal (A), lateral (B) and ventral (C) views; enlarged images of dorsal (D) and ventral (E) views of the labium and caudal lamellae (F). Scale = 1mm. All images can be enlarged in browser

The gold standard in taxonomy is providing high quality descriptions that are supported by high-quality images that illustrate all characters that are known to be important for species identification (Ang et al., 2013b). At the other end of the scale is the kind of short descriptions that are common in the old literature (see Lamb, 1924 for an odonate example) and that are not supported by illustrations and/or supported by poorly executed or even incorrect drawings. We here have opted for the middle path; i.e., high resolution images of the larvae (Fig. 3) which illustrate the features that are currently used for odonate larvae descriptions and identification keys. This includes photographs of the habitus, but also of the modified labium, anal pyramid, lateral and dorsal abdominal spines (Anisoptera), and caudal lamellae (Zygoptera). We considered providing more comprehensive descriptions, but this would bepremature given that much fewer than half of all known odonate species in Southeast Asia have known

122

larvae; i.e., this would make it very difficult to identify those characters that will eventually be revealed to be species-specific. With the current level knowledge, we would thus argue that a digital reference collection consisting of high quality images of potentially diagnostic features is preferable.

5.5.6. Barcode databases

In this study, four larval clusters could initially not be identified to species because they lacked an adult match. This illustrates why it is important to have comprehensive and publicly available barcode databases. This study generated a large number of such (mini) barcodes. Approximately half of the 95 species barcoded here

(46 spp) lacked barcodes in NCBI GenBank or BOLD. Barcode databases are extremely valuable once they are sufficiently comprehensive, but they tend to be very incomplete for most invertebrate groups (Kwong et al., 2012b) so that continued adult and immature collecting and barcoding remains necessary. Comprehensive and well- curated barcode databases are not only needed for specimen-based barcoding, but they are also important for metabarcoding studies that yield sequence information from gut content, faecal matter, soil, or water (Taberlet et al., 2012; Srivathsan et al., 2015; Lim et al., 2016). For instance, biomonitoring of freshwater macroinvertebrates occasionally utilizes a “soup” approach to species identification. Homogenised samples are studied with metabarcoding (Yu et al., 2012), but the metabarcoding data is much more meaningful when many sequence signatures can be matched to species.

Routine or exploratory biomonitoring via metabarcoding of eDNA can also yield new larval matches if the metabarcodes are identified as odonates not present in the current database, which indicates a potential new larval habitat for targeted sampling.

5.5.7. Concluding remarks

We hope that our matching study will inspire more work on larval-adult matching in entomology. Once we have the means to identify larvae to species, find

123

breeding sites, and reconstruct habitat preferences, species-specific conservation projects can be initiated. In addition, NGS barcoding can significantly aid in the study of insect phenomes. Indeed, we know next to nothing about most of the species described (Greene, 2005). This is problematic because observing and detailing natural processes can help to generate novel and idiographic hypotheses, as well as generate interest and support in biodiversity research (Willson & Armesto, 2006; Cotterill &

Foissner, 2010). Odonates in Singapore have fortunately received more attention than other invertebrates, with updated species checklists (Ngiam & Cheong, 2016), updates of new records and existing populations (Ngoi & Ngiam, 2011; Ngiam et al., 2011), larval descriptions (Ngiam & Leong, 2012; Ngiam & Dow, 2013), as well as excellent notes on species life histories (Ngiam, 2009; Orr et al., 2010). However, only very few insect taxa have received so much attention.

We here describe a pipeline for the large-scale identification of larvae of an ecologically important and charismatic insect taxon via association by NGS barcodes.

While DNA sequences have been used for associating different life history stages before, it is only through the use of NGS technologies that the cost can be reduced sufficiently to justify sequencing all specimens. This approach (“reverse workflow” in

Wang et al., 2018) renders morphological pre-sorting of specimen samples unnecessary.

Instead, biologists can focus on confirming specimen clusters that were formed with

NGS barcodes and then studying the morphology of the associated life history stages.

The method described here is particularly relevant for the study of insects with morphologically very different life stages from poorly explored environments and habitats. It is also suitable for taxa with high specimen abundance and species diversity.

Finally, we hope that that the method will eventually allow for the inclusion of more immature data in phylogenetics, habitat surveys, and assessments.

124

Chapter 6

Discovery of a rich, distinct, and imperilled marine insect fauna in tropical mangrove forests

6.1. Abstract

Marine habitats such as mangroves are often thought to have poor insect

species richness and diversity. We here use a newly developed low-cost and fast NGS

barcoding pipeline for testing this conjecture. We compare species diversity, ranges,

and abundances for >3200 Diptera and Hymenoptera species belonging to clades

representing different ecological guilds, based on specimens collected over >2 years

across several distinct habitats: mangroves, rainforest, swamp forests and disturbed

tropical forests in Singapore (46,196 specimens, 2597 Malaise traps representing 485

trapping months). The data reveal a largely overlooked mangrove insect fauna that is

unexpectedly rich (2/3rd of the diversity of a neighbouring rainforest) and distinct

(ANOSIM P = 0.001). A beta-diversity comparison of mangrove faunas over small geographical scales further reveals high species turnover. This implies that this globally imperilled ecosystem may harbour a very significant and largely overlooked proportion of the global insect species diversity that unexpectedly thrives in mangroves despite high-salinity and comparatively low -diversity. The study exemplifies how new techniques can be used to greatly accelerate the pace at which insect species are discovered and whole communities characterized at a time of worldwide insect declines.

6.2. Introduction

Mangrove forests are important tropical and subtropical habitats that provide a wide variety of valuable ecosystem services (Alongi, 2008; Spalding et al., 2014;

125

Zavalloni et al., 2014), while supporting and preserving unique communities of and animals (Nagelkerken et al., 2008; Yates et al., 2014). Unfortunately, these habitats are severely threatened by anthropogenic development (Duke et al. 2007), with an estimated global area loss of 1-2% annually (Valiela et al., 2001); i.e., levels that exceed those of other threatened tropical habitats such as rainforests and coral reefs.

Additionally, mangroves are expected to be severely affected by climate change

(Gilman et al., 2008) with some simulations predicting a loss of 46–59% of all global coastal wetlands due to rising sea-levels (Spencer et al., 2016). These alarming prospects render it particularly important to improve our understanding of mangroves

(Friess & Webb, 2014) since further wetland loss can be mitigated but only through careful management (Schuerch et al., 2018). This includes comprehensive surveying of the invertebrate biodiversity which contributes many essential ecosystem services.

The past decade has seen much interest in the conservation and restoration of mangroves (Balke & Friess, 2016). This research has focused on commercially or ecologically important taxa that are readily characterized viz. vascular plants and vertebrates (Fisher et al., 2011). As is often the case, our knowledge of the invertebrate fauna lags far behind and their species diversity and abundances remain poorly understood (Nagelkerken et al., 2008), despite the fact that invertebrates are vital to mangrove health and survival (Weisser & Siemann, 2004) by providing ecosystem services such as pollination, decomposition, and predation. Indeed, the insect fauna of mangroves has received even less attention than that of rainforests for which several comprehensive site-specific surveys exist (e.g., Basset et al., 2012). Existing studies of mangrove insects tend to focus on specific taxa (Batista-da-Silva, 2014; Chowdhury,

2014; Rohde et al., 2014), only identify specimens to higher taxonomic levels (Hazra et al., 2005; Adeniyi & Adeyinka, 2013; García-Gómez et al., 2014) or lack a comparison of the fauna with adjacent habitats; i.e., the species richness and distinctness of the fauna is unknown.

126

Due to these shortcomings, the few published studies yielded conflicting conclusions with regard to the importance of mangroves for insect diversity

(Balakrishnan et al., 2011; Adeniyi & Adeyinka, 2013; D’Cunha & Nair, 2013), which are elaborated on later in the Discussion. These shortcomings are here addressed by a species-level study based on 46,196 specimens collected over >2 years (485 trapping months; 2597 Malaise samples) obtained from a range of tropical habitats: (1) mangroves, (2) rainforests, (3) swamp forests, (4) disturbed tropical forests. All are in close proximity of each other (<40km), with physical barriers not substantially impeding movement, such that community structure is determined by habitat type rather than the lack of migration and the distinctness of the faunas can be assessed. All samples were collected in Singapore (Fig. 1) which retains ca. 10% (6.59 km2; ca. 1.66 km2 are new/regenerated mangroves) of the original mangrove coverage (ca. 75 km2 in

1819: Yee et al., 2010). The close proximity of all sampled sites and similar deforestation levels across different habitat types (>90%; Brook et al., 2003) allow us to compare the species richness and beta-diversity of different habitats across a small geographical scale. In addition, we here also test in one of the mangrove sites whether destroyed mangroves can be rehabilitated (Lai et al., 2015) by comparing adjacent old- growth and regenerated mangroves.

127

Figure 1: Map of Singapore’s mangroves (Yang et al., 2013) showing the mangrove sites sampled in this study: Pulau Ubin (P. Ubin), Sungei Buloh Wetland Reserve (SBWR) and Pulau Semakau (P. Semakau)

We estimate that over the course of the 2-year study, >2 million specimens were collected and sorted to order/family level by para-taxonomists. From these, we selected thirteen broad taxonomic groups of insects representing an array of trophic guilds for species-level research. Species-level research on such taxa is rarely undertaken (Scherber et al., 2014) because of the logistic and financial challenges that are associated with large scale comparative work impeded by high species diversity and specimen abundance (Novotny & Miller, 2014). Morphological sorting of specimens to species is slow and difficult because an estimated 75–90% of insect species remain undescribed (Stork et al., 2008; Hamilton et al., 2010) and even more lack suitable identification tools. Furthermore, morphospecies sorting by para- taxonomists is fraught with unpredictable error (Krell, 2004), specialists are usually unavailable, and sorting with DNA barcodes is too expensive when they are obtained with Sanger sequencing (>$10/specimen). We here use a recently developed solution:

128 species-level sorting with NGS barcodes, with subsequent morphological validation for some key taxa (“reverse workflow”: Wang et al., 2018).

Our results show that mangrove forests are much more species-rich than expected, being only slightly less diverse than adjacent tropical primary/secondary forest and freshwater swamp forest sites, while being more species-rich than disturbed secondary forests. More significantly, we find high beta-diversity between habitats and between mangrove fragments. For a subset of the taxa (Dolichopodidae), we also document high beta-diversity across Southeast Asian mangroves (Brunei, Thailand).

6.3. Materials & Methods

6.3.1. Sample collection and processing

All specimens were collected with Malaise traps (Fig. 2) and the traps within each site were no more than 200m apart from each other. The mangrove sites were located at Pulau Ubin (PU), Sungei Buloh (SB) and Pulau Semakau (SM), for both old

(SMO) and newly regenerated fragments (SMN). PU and SM are offshore sites whereas SB is on Singapore’s mainland. The freshwater swamp forest site was Nee

Soon (NS), which is Singapore’s largest remaining freshwater swamp remnant and known for high species richness and endemism (Ng & Lim, 1992; Turner et al., 1996).

Bukit Timah Nature Reserve (BT) was selected as the representative tropical rainforest site and the traps there sampled different forest types: primary forest, old secondary forest, and maturing secondary forest. The last set of Malaise traps sampled “disturbed secondary forest” along a disturbance gradient on the campus of the National

University of Singapore (NUS). The geographically closest sites were BT and NS (<5 km), while the largest distance between sites was ca. 35km between SMN and PU.

129

Figure 2: Map of trapping sites in this study, with the various habitat types in different colours (green: mangroves, blue: tropical forest, red: freshwater swamp forest, purple: disturbed secondary forest)

All Malaise trap samples were collected between 2012 and 2017. Further information can be found in Table S1. The samples were preserved in molecular grade ethanol and collected weekly. Subsequently, the specimens were sorted to order/family level by para-taxonomists, and specimens from thirteen insect taxa from different ecological guilds were extracted for barcoding. They comprise of phytophages/pollinators such as bees, non-parasitic wasps (Apoidea, Scolioidea,

Vespoidea), hoverflies (Syrphidae) and fruitflies (Tephritidae), predatory robberflies

(Asilidae) and long-legged flies (Dolichopodidae), fungivores like fungus gnats

(Mycetophilidae), dedritivores like soldierflies (Stratiomyidae), haematophages such as mosquitoes (Culicidae) and horseflies (Tabanidae), as well as parasitoids such as ichneumonid wasps (Ichneumonidae). In addition, we included high-diversity taxa that feed on a variety of resources (scuttleflies: Phoridae, ants: Formicidae; brachyceran flies). The main Diptera families highlighted here have been selected for a number of reasons: 1) abundance in the sample, 2) ease of recognition in the initial parataxonomist

130

pre-sorting and 3) relevance to downstream studies. Comparison material for

Dolichopodidae were also collected from mangrove forests in Thailand and Brunei via

Malaise trapping from 2012 – 2017.

6.3.2. NGS barcoding and putative species sorting

A 313-bp fragment of the cytochrome oxidase I gene (cox1) was sequenced for each specimen via the protocol described in Meier et al., 2016. Direct-PCR (Wong et al., 2014) was conducted for specimens collected earlier, using the primer pair mlCO1intF: 5’-GGWACWGGWTGAACWGTWTAYCCYCC-3’ (Leray et al., 2013) and jgHCO2198: 5’-TAIACYTCIGGRTGICCRAARAAYCA-3’ (Geller et al., 2013), with 1-2 legs from the specimen as template. For specimens collected later, the whole specimen was immersed in Lucigen QuickExtract solution and gDNA extraction was conducted non-destructively. The QuickExtract solution was then used as a PCR template with the afore-mentioned reagents and protocol. The primers used were labelled with 9-bp long barcodes, which were generated using a Barcode Generator script (http://comailab.genomecenter.ucdavis.edu/index.php/Barcode_generator) to have a difference of at least three base pairs between them. Every specimen in each sequencing library was assigned a unique combination of forward and reverse primer labels, which allowed sequences to be sorted to their specimen of origin post- sequencing. A negative control was prepared and sequenced for each 96-well PCR plate to account for external contaminants. Amplification success rates for each plate were approximately determined via gel electrophoresis for eight random wells.

The amplicons were pooled at equal volumes within each plate and later pooled across plates through an approximation of equimolarity as estimated by the presence and intensity of bands on the gels. The pooled samples were cleaned with Bioline

SureClean Plus and/or via gel cuts before outsourcing library preparation to AITbiotech using TruSeq Nano DNA Library Preparation Kits (Illumina). Paired-end sequencing was performed on Illumina Miseq or Hiseq 2500 2x300-bp platform over multiple runs,

131

thereby allowing troubleshooting for specimens which failed to sequence. The raw

reads were processed with the bioinformatics pipeline and quality-control filters

described in Meier et al., 2016. A BLAST search to GenBank’s nucleotide database

was also conducted to identify and discard contaminants and other spurious sequences.

To obtain putative species units, the cox1 barcodes were clustered over a range

of uncorrected p-distance thresholds (2–4%) typically used for species delimitation in

the literature (Ratnasingham & Hebert, 2013). The clustering was performed using a

python script (Srivathsan, unpublished). USEARCH (Edgar, 2010) was also applied to

these at -id 0.96, 0.97 and 0.98, as an alternative species delimitation algorithm to

objective clustering. The use of Automatic Barcoding Gap Discovery (ABGD;

Puillandre et al., 2012) was attempted but failed to complete due the large size of the

dataset.

6.3.3. Diversity analysis

To assess the species alpha-diversity of the mangrove, tropical forest,

freshwater swamp and disturbed forest habitats, the communities from each mangrove

site were compared to that of the other habitat types. To account for unequal sampling

completeness, the samples from each site were rarefied with the iNEXT (Chao et al.,

2014) R package (R Development Core Team), with 1000 bootstrap replicates. Site

comparisons were carried out by comparing curves of species diversity against number

of individuals post-rarefaction. In order to determine the distinctness of the

communities across habitats, non-metric multidimensional scaling (NMDS) plots were

generated for each habitat type and mangrove site comparison with the vegan (Oksanen et al., 2017) R package using both Bray-Curtis and Chao dissimilarity indices, with the former being the classical index for abundance-type data, while the latter suitable for samples likely to have rare species (Chao et al., 2005). Additional analysis of similarities (ANOSIM) tests were performed in PRIMER (Clarke & Gorley, 2006) with the default 999 permutations to obtain p-values and R-statistics of both the global

132

dataset and pairwise comparisons between habitat types. The p-values would indicate the presence of significant differences while the R-statistic allows assessment of the degree of similarity, with values closer to 1 indicating greater distinctness. Robustness of the results was tested by creating additional datasets from which singleton, doubleton and rare species (<5 and <10 individuals) were removed. The pruned datasets were subjected to the same analyses as the full dataset. A subset of the mangrove data that constitutes a consistent 2-year long sampling regime was also used to determine if there were significant differences between the mangrove sites in

Singapore.

6.4. Results

6.4.1. Species delimitation based on barcodes

Objective clustering (Meier et al., 2006) was used to group the 313-bp cox1 barcodes representing 46,196 specimens at uncorrected p-distance thresholds from 2–

4% (Table 1). This yielded 3041–3237 molecular operationally taxonomic units

(mOTUs) which have a high probability of corresponding to species, especially given that most of the species numbers (94%) are stable across thresholds, while the ~260 mOTUs of Dolichopodidae were checked for congruence (Fig. 3; Wang et al. 2018).

An alternative species delimitation algorithm, USEARCH, yielded highly congruent species richness estimates (3093–3330 mOTUs with similar --id parameters 0.96–0.98).

Given that the number of mOTUs derived from both delimitation methods and ranges of distance thresholds is largely invariant, we used the mOTUs generated via objective clustering at the 3% p-distance threshold for the main analysis, while also reporting the results for 2% and 4% clusters for some analyses.

133

Table 1: Total number of barcodes and mOTUs in this study as delimited by objective clustering and ABGD, with different p-distances and equivalent -id parameters for the different habitat types: mangroves (M), tropical forest (TF), freshwater swamp forest (FS) and disturbed secondary forest (DSF) No. of mOTUs from No of mOTUs from

Objective Clustering USEARCH No. of Habitat 2% 3% 4% id=0.98 id=0.97 id=0.96 Barcodes M 30824 1578 1533 1494 1620 1553 1519 TF 4990 1007 978 969 1026 986 972 FS 4521 786 765 744 796 775 757 DSF 5866 577 564 559 589 570 562 Total 46196 3237 3129 3041 3330 3175 3093

8000

7000

6000

5000

4000

3000 No. of MOTUs 2000

1000

0 0 1 2 3 4 5 p-distance threshold

Figure 3: Number of objective clustering mOTUs across p-distances 0 – 5%, with the 2 – 4% thresholds commonly used for species-level delimitation shown with a solid line

6.4.2. Alpha-diversity across habitats

Based on observed species numbers (Table 1), mangroves appear to have the

highest species richness, but this is before correcting for larger sampling size. We thus

used the iNEXT (Hsieh et al., 2016) R package (R Development Core Team) to plot the rarefied species richness of the mangrove sites (PU, SB, SMN, SMO), tropical forest, freshwater swamp and disturbed secondary forest habitats. The comparisons were performed both with all three mangrove sites grouped as a single habitat type, as well

134

as with the mangroves as separate sites and the tropical forest habitat as separate forest types. Our results show that mangrove forests are unusually species-rich, with much more species-rich than expected, being only slightly less diverse than adjacent tropical primary/secondary forest and freshwater swamp forest sites, while being more species- rich than disturbed secondary forests. The rarefied combined species richness of all mangrove sites is approximately two thirds of the tropical forest site and roughly five sixths that of the freshwater swamp site (Fig. 4A). When the mangrove sites were treated as independent samples (Fig. 4B), PU and SB exhibit similar species richness compared to the tropical rainforest (TF) and freshwater swamp (FS) sites and have a higher species richness than the urban/disturbed secondary forest site (DSF). Pulau

Semakau’s new grove (SMN) has the poorest observed species richness, but the diversity of the old grove (SMO) and the disturbed secondary forest is not significantly different given that the confidence intervals overlap. (see Fig. S2 for similar results obtained with 2 and 4% p-distance clusters).

135

Figure 4: Comparison of species diversity across habitats (3% p-distance mOTUs). Mangrove (M) sites are represented by Pulau Ubin (PU), Sungei Buloh (SB), Pulau Semakau old grove (SMO), Pulau Semakau new grove (SMN). The tropical forest site is represented by maturing secondary forest (TF-MS), old secondary forest (TF-OS) and primary forest (TF-P), while the other habitat types are freshwater swamp (FS) and disturbed secondary forest (DSF). In plot (A), curves were plotted for the mangrove sites as a single habitat type, but plotted as separate sites in (B). Plot (C) also plots the forest types in the tropical forest site separately. The full lines represent rarefactions, while the dotted lines extrapolations and the point between the lines as actual observed values. The black vertical dotted lines indicate the point of rarefaction at which species richness comparisons were made

136

6.4.3. Beta-diversity across habitats

Beta-diversity was quantified using Chao (Chao et al., 2005) and Bray-Curtis index distances between the insect communities of different habitats. The results were visualized with NMDS ordination (Fig. 5 & S4) before testing for statistical significance using pairwise ANOSIM analyses based on Bray-Curtis distances (Table

3 & S3). Based on NMDS plots and ANOSIM tests, we find that the mangrove (M) communities are compositionally distinct from those of the other habitats. When all species were included, and the trapping sites used as data points, the communities from different habitats are clearly separated on the NMDS plots, with the mangrove, tropical forest, freshwater swamp and disturbed secondary forest traps clustering tightly within habitat type and apart from each other. The plots are robust to the removal of rare species (Fig. S4), consistent across both distance metrics and the NMDS stress values are all within the acceptable range (<0.2).

This compositional dissimilarity between habitats was confirmed by ANOSIM tests (Table 3 & S3) which finds significant differences between communities in both global (P < 0.05, R = 0.862 – 0.882) and pairwise habitat comparisons (P < 0.05, R =

0.717 – 1.000). We also find that the tropical rainforest (TF) and freshwater swamp

(FS) communities are more similar than the rest, as seen from their clustering distance in the NMDS plots and their lower R-statistic (0.717 – 0.726) in the pairwise ANOSIM tests. Note that these habitats also have the closest proximity out of the other habitat types. The ANOSIM outputs are again robust to the removal of rare species and that the p-values are significant even according to re-defined statistical criteria for unexpected or new results (Benjamin et al., 2018).

Of the approximate 3200 species, >2500 are found only in a single habitat type

(Table 2), while only 19 are found in all habitat types. 485 species are found in two habitat types and 46 are found in three. This is indicative of communities of species

137

largely specialized to their various habitats and insensitive to the removal of rare species (Table S3 & Fig. S4).

Table 2: Number of species found exclusively in the various habitat types and shared across multiple habitats with varying levels of rarity

No species No species Full No No <5 <10 dataset singletons doubletons specimens specimens Species in mangroves 1200 662 488 337 211 only Species in tropical forest 591 234 153 90 32 only Species in freshwater 406 171 111 58 27 swamp only Species in disturbed 342 102 61 37 19 secondary forest only Species in two 485 485 417 300 190 habitats Species in 46 46 46 41 32 three habitats Species in all 19 19 19 19 15 habitats

Table 3: Global and pairwise p-value and R-statistic outputs from ANOSIM analyses of the full dataset split by habitat type. The p-values are displayed in the bottom-left of the pairwise matrix while the R-statistics are displayed at the top-right Global P: 0.001 Global R: 0.882 TF DSF FS M TF 1.000 0.726 0.954 DSF 0.001 1.000 0.868 FS 0.001 0.029 0.950 M 0.001 0.001 0.001

138

Figure 5: NMDS plots of Bray-Curtis (top) and Chao (bottom) distances of the trapping sites in this study, indicating the distinctness of the four habitat types (mangroves: green, tropical forest: blue, freshwater swamp: red, disturbed secondary forest: purple)

139

6.4.4. Beta-diversity across mangrove sites

Beta-diversity index distances for the mangrove sites Pulau Ubin (PU), Sungei

Buloh (SB), and Pulau Semakau (old [SMO] and new groves [SMN]) were used for the specimens obtained during the 2-year sampling period only (Fig. 6 & S6). There is clear separation with regard to mangrove sites which indicate spatial clustering. The adjacent old-growth and regenerated mangroves of Pulau Semakau also house largely distinct communities, with both sites only sharing 244 out of the 483 species in the new grove and 571 species in the old grove. Overall, these results are robust even when the rare species are removed. The only exception is the Bray-Curtis distance plot where

PU and SB overlap when all species with <5 specimens are removed (Fig. S6). Note that the stress values for all NMDS plots were within acceptable levels (<0.2).

The overall global ANOSIM analyses also indicate more significant differences between communities than within them (Table 5 & S5; p < 0.05, R = 0.605

– 0.626). The pairwise comparisons however, show mainly SMO and SMN being highly distinct from PU (P < 0.05). The other comparisons lack significant p-values, although the R-statistic values are largely positive (R = 0.214 – 1.000). PU and SB are the sites that show the most compositional similarity (R = 0.214 – 0.250), which is also observed in the NMDS plots.

The insect communities of the different mangrove sites are fairly distinct, with

925 of the 1389 mangrove species from the 2-year sampling period only found in one of the sampled sites (Table 4). Only 68 species are found in all sites, while 90 species are found in three and 306 are found in two (Table 4).

140

Table 4: Number of species found exclusively in the various mangrove sites and shared across multiple sites No species No species Full No No <5 <10 dataset singletons doubletons specimens specimens Species in 276 105 59 33 10 PU only Species in 228 78 36 12 4 SB only Species in 191 57 36 22 10 SMN only Species in 230 80 41 16 4 SMO only Species in 306 306 233 143 78 two sites Species in 90 90 90 76 51 three sites Species in all 68 68 68 67 65 sites

Table 5: Global and pairwise p-value and R-statistic outputs from ANOSIM analyses of the mangrove sites (PU: Pulau Ubin, SB: Sungei Buloh, SMN: Pulau Semakau new grove, SMO: Pulau Semakau old grove). The p-values are displayed in the bottom-left of the pairwise matrix while the R-statistics are displayed at the top-right Global P: 0.002 Global R: 0.614 PU SB SMN SMO PU 0.250 0.852 0.574 SB 0.200 1.000 1.000 SMN 0.029 0.100 0.556 SMO 0.057 0.100 0.100

141

Figure 6: NMDS plots of Bray-Curtis (top) and Chao (bottom) distances of the mangrove traps sampled across 2 years in this study, grouped by site (Pulau Ubin [PU], Sungei Buloh [SB], Pulau Semakau old fragment [SMO] and new fragment [SMN])

142

6.4.5. Dolichopodidae beta-diversity across Southeast Asia

In order to test for high beta-diversity across SE Asia, we obtained data for a key taxon in our analysis, Dolichopodidae. Malaise-trap sampling of mangroves in

Thailand and Brunei reveals that of the 278 dolichopodid species (3486 specimens) in the combined sample, 227 species are unique to each country while only 9 species are shared across all countries and 42 shared between two countries. However, the sites in

Brunei (9 sites, 2711 specimens) and Singapore (33 sites, 13307 specimens) had larger sample sizes while those in Thailand (13 sites, 775 specimens) had fewer specimens per site. Further sampling is hence required to make more accurate assessments.

6.5. Discussion

6.5.1. Discovery of a largely overlooked insect community in mangroves

Mangroves are typically assumed to have poor insect diversity due to high salinity and comparatively low plant diversity (Nagelkerken et al., 2008; Adeniyi &

Adeyinka, 2013). However, in our study, we find that the mangroves harbour a large proportion of the insect diversity. Indeed, after adjusting for sampling, Singapore’s premier rainforest reserve and largest swamp forest remnant have only ca. 30% more species of insects each than its mangroves. Given that tropical rainforests are often thought of as the most species-rich habitats (Erwin, 1982; Novotny et al., 2006), it is surprising that mangrove forests, often thought to be species-poor for insects, have only around a third less of the diversity.

Finding such high alpha-diversity is noteworthy, but it is even more remarkable given that these species-rich mangrove insect communities are very distinct from those found in a wide range of habitats. This means that mangrove forests host largely overlooked species-rich assemblages unique and distinct from that of the tropical rainforest, swamp forest and disturbed secondary forest sites, as seen from the low species overlap and high community differentiation. These results are robust even when

143

rare species are removed, different distance metrics are used, and the patterns are tested for significance with ANOSIM (see also NMDS plots). Note also that this compositional distinctness is not driven by geographical distance, given that the mangrove sites are in close proximity to the other habitats (Fig. 2: <35km).

There are few cross-habitat studies involving mangrove insect communities in the literature. Adeniyi & Adeyinka (2013) similarly find mangroves to be only slightly less diverse than rainforest habitats, while D’Cunha & Nair’s (2013) study on ants find their mangrove site to be extremely species-poor. The former, while much smaller in scale (<200 specimens) than this study includes a wide range of taxa, including

Lepidoptera, Orthoptera, Dictyoptera and Coleoptera while the latter study has a tight taxon focus and confirmed that the few species that were found were also unique to the habitat. While our study has a much larger sample size and hence statistical power, it might be worth subsequently examining the individual taxa for varying species richness across habitats.

The high beta-diversity is likely driven by the distinct environment of mangrove forests, which are characterized by high salinity, large temperature ranges, regular tidal inundation, etc.. These environmental conditions are likely to require specialized physiological and behavioural adaptations that should encourage the emergence of evolutionarily distinct fauna. The unexpected finding here is that this specialized fauna is also species-rich, given that species richness in insects is often linked to high plant diversity (Novotny et al., 2006) and high salinity is not considered to be conducive to insect diversity. Finding such high insect diversity in the mangroves suggests that it will be important to study phylogenetic diversity (Faith, 1992) and community structure (Webb et al., 2002). At this point, it is unclear whether the high species diversity is due to radiation within mangroves or repeated invasions from other tropical habitats.

144

A strength and potential weakness of our study is that all habitats were sampled in Singapore. This tightly controls for geographic effects, but one could surmise that the patterns of diversity have been skewed through urbanization. This would be a major concern if the different habitats had experienced different levels of deforestation and the sizes of the remaining patches were very dissimilar. However, this is not the case.

It is estimated that Singapore had 72,000 ha of rainforest in 1819 (Corlett, 1992; Brook et al., 2003). Today, only 100 ha of primary forest, 1600 ha of secondary forest, 600 ha of mangrove forest and 87 ha of freshwater swamp forest survive; i.e., 90% of rainforest

(Corlett, 1991), 99.5% of mangroves (Ng & Low, 1994) and 93% of freshwater swamp forest (Davison et al., 2018) habitat was lost. Note also that the patches of these habitats that were sampled here were larger for rainforest and swamp forest than for individual mangroves (swamp forest: 5 km2, tropical rainforest: 1.64 km2, mangrove forest fragments: 0.904 km2 [PU], 1.168 km2 [SB], 0.174 km2 [SM]; Yee et al., 2010). Yet, two of these mangrove fragments (PU & SB; Fig. 4B) have similar alpha-diversity to what is found in the larger rainforest and swamp forest patches. The comparatively high species diversity of mangroves compared to Bukit Timah Nature Reserve and Nee

Soon swamp forest is particularly striking given that these are particularly well- protected areas with known high diversity. Bukit Timah Nature Reserve is considered of high conservation value and has been protected for at least 53 years (Corlett, 1988).

The reserve contains a very small Centre for Tropical Forest Science (CTFS) plot (2 ha) that nevertheless supports 312 species of trees. Similarly, Nee Soon freshwater swamp forest is known for its high species richness (e.g., home to 1150 species of vascular plants: Wong et al., 2013). It is also a habitat that retains the last Singaporean populations for many species (Ng & Lim, 1992). Furthermore, we sampled different forest types (primary, old secondary and maturing secondary forests) in Bukit Timah, which is likely to have increased the species richness through greater habitat heterogeneity. Conversely, mangrove forests are arguably less heterogenous. It is hence

145

remarkable that the mangrove insect richness is nevertheless both distinct and not substantially poorer than what is observed for any one of these habitats.

Malaise trap sampling, which should be of similar efficiency in low canopy forests, is one way to conduct standardized sampling in tropical habitats. It is however, important to follow this up in the future with other trapping techniques that sample a wider range of arthropod fauna, such as flight intercept traps, pitfall traps, Winkler sifting, light trapping, etc. There is a possibility that the lack of direct canopy sampling might be under-sampling the tropical forest sites. However, the number of canopy specialists is often not as high as one might expect (Gaston, 1991) and the mature mangroves sampled (PU, SB and SMO) have similar canopy heights as some of the secondary forest sites in this study (pers. obs.). Despite these potential concerns however, our results suggest that the assumption that mangroves cannot support substantial insect diversity (Macnae, 1968) might be false and deserves further scrutiny.

In order to fully appreciate the insect diversity of mangroves, it is important to also study beta-diversity at different scales. We find that the three mangrove sites surveyed in Singapore have fairly distinct communities (Tables 4 & 5, Fig. 6). This suggests that there is high beta-diversity between the mangrove sites and hence the mangrove insect diversity is potentially very large. This also implies that these mangrove fragments host sufficiently distinct communities to warrant protection. The data for Singapore’s sites are suggestive but are important to understand whether beta diversity is similarly high across Southeast Asia. We were here able to test this for 3486 specimens of Dolichopoidae from Thailand and Borneo. Note that due to extensive sampling, Singapore’s mangrove Dolichopodidae are so well studied that the vast majority of species are already known (178 spp.). We nevertheless find that just by travelling 1250 km to Brunei, we find an additional 63 species. Similarly, we obtain another 37 species just 980 km away in Thailand. However, more comprehensive

146

sampling of these areas and other regional forests is necessary to properly quantify and characterize these communities.

Fortunately, mangrove forests can be rehabilitated through replanting, but we find that such a rehabilitated mangrove in Pulau Semakau (SMN) has slightly poorer species richness than the old-growth grove (SMO) (483 vs 571 spp). Additionally, the species composition of these two sites are quite distinct (Table 5 & Fig. 6), with only

244 of 810 species shared between the two sites. Hence despite reforestation, the adjacent sites retain largely differentiated communities. This may be due to the fact that the reforested mangrove was initially a monoculture of Rhizophora and it is known that plant diversity during the restoration process affects the subsequent faunal community (Miller & Hobbs, 2007).

6.5.2. NGS barcodes as a new technique for large-scale study of insect communities

Global insect declines have been an ongoing and troubling trend that has recently sparked great concern in the scientific community (Hallmann et al., 2017) and the public (Mikanowski, 2017; Stager, 2018). Obtaining relevant data has been very difficult because quantifying insect diversity is slow and expensive. The study of insect diversity is notoriously challenging due to the need to study a large number of species based on an even greater number of specimens (Novotny & Miller, 2014). This has created a “taxonomic impediment” (Samways, 1993) which is here overcome via NGS barcoding. It provides the means for conducting large-scale species discovery and habitat assessment studies for invertebrate taxa (Meier et al., 2016; Wang et al., 2018) by combining the cost advantages of NGS with the species-level sorting prowess of

DNA barcodes. In addition to the low consumable cost, the manpower cost can also be kept low because the molecular work can be learned within a few hours. Using the estimates in Wang et al. (2018), the data used in this study can now be obtained within

3 months by four full-time interns.

147

We here used NGS barcoding instead of metabarcoding of bulk DNA obtained from extracting entire homogenized samples (Yu et al., 2012) because the number of samples that have to be covered for sufficient spatial and temporal sampling is very large (2597 in this study). Including such a large number of samples via the metabarcoding of bulk DNA, along with a sufficient number of PCR replicates

(Ficetola et al., 2015; Pawluczyk et al., 2015), would have been prohibitively expensive.

In addition, we would have lost the specimen-barcode association that allows for morphological verification, identification, and description of putative species.

However, the two approaches are complementary in that NGS barcoding can be used to first characterize a community. Re-surveys and routine biomonitoring can then be carried out using metabarcoding of homogenized material or storage ethanol (Shokralla et al., 2010; Hajibabaei et al., 2012; Ji et al., 2013; Kocher et al., 2017).

6.5.3. Concluding remarks

We here document that mangrove insect communities are not only rich, but also discrete when compared to insect communities found in other tropical habitats.

Note that we are not claiming mangroves will challenge rainforests for species richness, given that the overall size of the rainforest insect fauna is likely to be much larger since rainforests occupy a greater area and mangrove forests lack elevational gradients.

Nevertheless, the discovery of such an unexpectedly rich community once again documents how little we know about insects in the tropics. Fortunately, advances in sequencing technology now provide the techniques for large-scale detection and quantification of tropical and temperate insect diversity at the species-level across many habitats. As sequencing costs continue to decline, we believe that many additional and rather distinct insect communities will be discovered in hitherto poorly sampled environments. With regard to mangroves, it will be important to study beta- diversity at the regional and global scale. It will be similarly important to understand

148

the phylogenetic composition of this fauna as it is conceivable that many distinct lineages of insects are only found in mangroves.

149

6.6. Supplementary Information

Table S1: Collection periods, locations and number of trapping sites in this study. The habitat types included are as follows: M = mangroves, FS = freshwater swamp, DSF = disturbed secondary forest, TF = tropical rainforest Sampling period Location No. of traps M: Pulau Ubin 3 M: Pulau Semakau old grove 3 April 2012 – March 2014 M: Pulau Semakau new grove 3 M: Sungei Buloh 2 FS: Nee Soon 2 November 2014 – May 2015 FS: Nee Soon 2 April 2015 – September DSF: NUS 4 2015 M: Pulau Ubin 10 March 2016 – August 2016 M: Sungei Buloh 10 TF: Primary forest 3 TF: Old secondary forest 3 August 2016 – October 2017 TF: Maturing secondary 3 forest

150

Figure S2: Rarefaction curves of all species (2 & 4% p-distance) in the study split by habitat and site for the mangroves and forest type for the tropical forest. Mangrove (M) sites are represented by Pulau Ubin (PU), Sungei Buloh (SB), Pulau Semakau old grove (SMO), Pulau Semakau new grove (SMN). The tropical forest site is represented by maturing secondary forest (TF-MS), old secondary forest (TF-OS) and primary forest (TF-P), while the other habitat types are freshwater swamp (FS) and disturbed secondary forest (DSF). In plots (A), curves were plotted for the mangrove sites as a single habitat type, but plotted as separate sites in (B). Plots (C) also plots the forest types in the tropical forest site separately. The full lines represent rarefactions, while the dotted lines extrapolations and the point between the lines as actual observed values. The black vertical dotted lines indicate the point of rarefaction at which species richness comparisons were made

151

Table S3: Global and pairwise p-value and R-statistic outputs from ANOSIM analyses of modified datasets split by habitat type (TF: tropical forest, DSF: disturbed secondary forest, FS: freshwater swamp, M: mangroves) with singletons and doubletons removed, as well as species with less than 5 and 10 specimens. The p-values are displayed in the bottom-left of the pairwise matrices while the R-statistics are displayed at the top-right

152

Figure S4: NMDS plots of Bray-Curtis and Chao distances of the trapping sites in this study, grouped by habitat (mangroves [M]: green, tropical forest [TF]: blue, freshwater swamp [FS]: red, disturbed secondary forest [DSF]: purple) with various degrees of rare species removal

153

Table S5: Global and pairwise p-value and R-statistic outputs from ANOSIM analyses of the mangrove sites (PU: Pulau Ubin, SB: Sungei Buloh, SMN: Pulau Semakau new grove, SMO: Pulau Semakau old grove) with singletons and doubletons removed, as well as species with less than 5 and 10 specimens. The p-values are displayed in the bottom-left of the pairwise matrices while the R-statistics are displayed at the top-right

154

Figure S6: NMDS plots of Bray-Curtis and Chao distances of the mangrove traps sampled across 2 years in this study, grouped by site (Pulau Ubin [PU], Sungei Buloh [SB], Pulau Semakau old fragment [SMO] and new fragment [SMN]) with various degrees of rare species removal.

155

Chapter 7

Conclusion

The main aim of this thesis is to explore and ultimately derive the means for realistically approaching the age-old question of how many species there are on

Earth. Many of the fundamental problems interfering with progress towards obtaining a reasonable estimate are not rooted in philosophical underpinnings of what constitutes a species or disagreements on species concepts. Instead, it is the logistic challenge of obtaining and sorting hundreds of millions of specimens into millions of species. This is one of the challenges that was subsumed under “taxonomic impediment”, defined as “the gaps of knowledge in our taxonomic system (including knowledge gaps associated with genetic systems), the shortage of trained taxonomists and curators, and the impact these deficiencies have on our ability to manage and use our biological diversity” in the Darwin Declaration at the 1998 Convention on

Biological Diversity. These knowledge gaps are unevenly distributed. Most are found for hyper-diverse uncharismatic taxa that have very large numbers of species in the tropics. Given that most species are arthropods and that tropical diversity is particularly poorly known, tackling tropical arthropod diversity will be key to understanding global species diversity.

These issues are not new and great strides have been made towards addressing the taxonomic impediment. Strategies range from increasing financial and manpower support for taxonomic research (de Carvalho et al., 2005; Joppa et al.,

2011) to the development of molecular tools (Hebert et al., 2003; Blaxter, 2004) and informatics platforms (Godfray, 2007). However, time is running out due to the current unprecedented rate of extinction resulting from anthropogenic activity.

Species are lost at a faster rate than our ability to discover and catalogue them,

156

especially because much of the habitat loss is in areas with the highest species

diversity (Brooks et al., 2002). Taxonomy and systematics have to develop faster

methods even with compromises to accuracy.

With greater accessibility to Next Generation Sequencing (NGS)

technologies, taxonomists now have a powerful tool in their arsenal for species

discovery and identification. NGS-based methods and techniques are developed,

tested, and discussed in this thesis. We are attempting to develop fast species

discovery tools that are cost-effective and thus accessible to as many scientists as

possible to democratize access to these techniques. Additionally, the aim is not to

relegate the role of the taxonomist to a laboratory technician, but to make the most

tedious aspects of species discovery, pre-sorting, a technical exercise. This frees up

the taxonomists’ time for species descriptions, evaluating species boundaries, and

testing evolutionary hypotheses. Aside from species discovery, placing these species

on the tree-of-life is another task made easier by NGS. In this thesis, I explore the use

of genome skimming and multiplexed tagged amplicon sequencing. Lastly, the

molecular tools provide a complementary line of evidence to morphology and

generates the data to better understand life history traits such as niche differentiation

between larvae and adults of the same species.

Chapter 2 of this thesis first examines the utility of mini-barcodes. It is

compared to the performance of full-length cox1 barcode for metazoan species delimitation. Mini-barcodes can be sequenced on Illumina, while full-length barcodes require Sanger, MinION, or PacBio sequencing. Additionally, mini-barcodes are more likely to amplify if template DNA is degraded, thereby facilitating molecular work on museum specimens, gut/faecal content and environmental DNA. I used data for 30,000 barcoded specimens with morphospecies designations to test nine commonly used mini-barcodes. I found that there were no significant differences in congruence to morphology and decisiveness as long as the mini-barcodes are >150-bp

157

and positioned in the 3’ end of the Folmer region. This generates the guidelines for

mini-barcode primer design that may interest researchers developing mini-barcodes,

as well as justify the use of the 313-bp and 420-bp mini-barcodes used in the

remaining chapters of the thesis. The positional effect however, remains something

worth further investigation and should give potential reseachers indications on what

regions to select or avoid.

After evaluating the use of mini-barcodes, I next evaluate different species

delimitation methods in Chapter 3 (namely tree-based species delimitation). I also propose a pipeline for cost-effectively generating large numbers of mitochondrial and nuclear characters for placing species on the tree-of-life. This pipeline involves mitochondrial metagenomics (MMG; Crampton-Platt et al., 2016) and multiplexed tagged amplicon sequencing. I find that most species units (70 – 80%) are stable even when only mini-barcodes are used for delimitation. The use of tree-based species delimitation methods applied to accurate, well-resolved trees can increase congruence with morphospecies for an additional 10 – 15% of the species. I furthermore obtain full mitochondrial genomes and 28S rDNA characters (~21,000-bp) for around 460

MOTUs at low cost and obtain a phylogenetic tree that is consistent with the trees that have been published for Diptera. Most of these species are likely new to science and would go a long way towards a comprehensive dipteran tree-of-life. Furthermore, with this pipeline in place, as well as new innovations in sequencing technology, the pace at which species can be added to the tree can be easily scaled up. These trees are not likely to be well-supported at every node but would highlight particularly problemtic relationships that require further character sampling through either transcriptomics or UCE hybrid capture. In the meantime, these species-rich trees are likely sufficient for conducting phylogenetic diversity or community-level analyses.

Chapter 4 builds on the MMG pipeline used in the previous chapter for mitochondrial genome skimming and explores whether it can be improved via hybrid

158

capture techniques that can greatly enrich the sequencing pool with desired

mitochondrial DNA. Although I find the hybrid enrichment process to be very

efficient in enriching the mitochondrial proportion of the sequencing pool, I also find

that certain regions on the mitochondrial genome are preferentially enriched at the

expense of other regions. This results in patchy coverage across the mitochondrial

genome and consequently short and fragmented contigs after assembly. After

considering the additional cost of hybrid enrichment (manpower and consumables)

and the difficulties related to designing probes for a broad, diverse and largely

unknown fauna (eg. tropical Coleoptera), I conclude that it may be more cost-

effective to use genome skimming at a greater sequencing depth on high-throughput

NGS platforms. As the per sequence cost continues to steadily decline and the

accessibility to high throughput platforms becomes more available (eg. allowing users

to pay by gigabase rather than for the whole run), this option seems like the clear

alternative.

In Chapter 5, I use NGS barcoding to barcode around 1200 adult and larval odonate specimens. The barcodes were then used to associate the adults and larvae of

59 species. Efficient species discovery requires the ability to work with all natural history specimens. A large majority of specimens collected are eggs and larvae, while half are females. Relying only on adult males leads to under-sampling and an incomplete understanding of the system studied. With each barcode costing <$0.40

USD, NGS barcoding can be used to quickly perform adult-larva associations for a large number of specimens, thereby providing a faster and more reliable alternative to rearing. The matched larvae were then photographed with a high-resolution imaging system with emphases on taxonomically relevant characters and made available online for public access. With this endeavour, Singapore’s odonate larvae can be more easily identified through the digital reference collection, and once more species have been matched, a detailed identification key for the country may be developed.

159

Furthermore, we now have better knowledge of Singapore’s larval habitat, especially

for certain rarer species that might have conservation value. The NGS barcoding

approach to life history stage association is particularly effective for novel habitats in

order to maximise the number of matches, while subsequent efforts to fill in the gaps

can be more targeted. This technique is also particularly useful for matching females

and polymorphic forms.

Lastly, NGS barcoding was used again in Chapter 6 to explore and discover a largely neglected and unknown fauna: mangrove insects. Due to assumptions that environments with low plant diversity and high salinity are not conducive to insects, mangroves are speculated to have poor insect diversity and are rarely sampled. By barcoding more than 46,000 specimens from Singapore’s mangroves, freshwater swamp forest, tropical secondary forests and disturbed forests, I obtain distributional and abundance data for ca. 3200 species. I demonstrate that mangroves are species- rich. Additionally, I find that mangrove insect communities are highly distinct and unique, which highlights the need for conserving and protecting this fast-disappearing tropical habitat (Valiela et al., 2001). While the limited regional data processed is promising, further exploration of regional material will allow us to understand how well this trend holds at different spatial scales. Finally, from a broader perspective,

NGS barcoding can help facilitate the study of similarly poorly studied and highly diverse taxa like mites and nematodes.

All in all, the techniques used and evaluated in these chapters demonstrate the incredible potential for addressing insect diversity problems in the 21st century.

Species discovery can be fast and cost-effective, with the data collected also useful for answering ecological questions and resolving the natural history of species. The barcode databases generated here are already facilitating future work such as metabarcoding studies on diet and species detection based on eDNA (eg. Lim et al.,

2016). The trees that we reconstruct here for species-delimitation purposes contribute

160

to the goal of having a Tree-of-Life for all species. With the advent of NGS technologies providing the means of greatly expediting species discovery and identification processes, it is indeed an exciting time for taxonomy and systematics.

Hopefully, this translates to greater public and scientific interest in these fields which have been considered a “marginal science”. Although we have no definitive answer as to how many species there are on earth, the new techniques can narrow the margin by obtaining much more data (Stork et al., 2015) in a shorter period of time, and I belief that the answer is within our grasp.

161

Chapter 8

References

Achtman, M., & Wagner, M. (2008). Microbial diversity and the genetic nature of microbial species. Nature Reviews Microbiology, 6(6), 431.

Adeniyi, S., & Adeyinka, O. J. (2013). Diversity and abundance of arthropods and tree species as influenced by different forest vegetation types in Ondo state,

Nigeria. International Journal of Ecosystem, 3(3), 19-23.

Ahrens, D., Monaghan, M. T. & Vogler, A. P. (2007) DNA-based taxonomy for associating adults and larvae in multi-species assemblages of chafers (Coleoptera:

Scarabaeidae). Molecular Phylogenetics and Evolution, 44, 436-449.

Ahrens, D., Fujisawa, T., Krammer, H. J., Eberle, J., Fabrizi, S., & Vogler, A. P. (2016).

Rarity and incomplete sampling in DNA-based species delimitation. Systematic

Biology, 65(3), 478-494.

Alongi, D. M. (2008). Mangrove forests: resilience, protection from tsunamis, and responses to global climate change. Estuarine, Coastal and Shelf Science, 76(1), 1-13.

Andrew, R. J., Subramanian, K. A. & Tiple, A. D. (2008) A handbook on common odonates of Central India. Hislop College.

Ang, Y., Puniamoorthy, J., Pont, A. C., Bartak, M., Blanckenhorn, W. U., Eberhard,

W. G., ... & Meier, R. (2013a) A plea for digital reference collections and other science‐ based digitization initiatives in taxonomy: Sepsidnet as exemplar. Systematic

Entomology, 38, 637-644.

162

Ang, Y., Wong, L. J., & Meier, R. (2013b). Using seemingly unnecessary illustrations to improve the diagnostic usefulness of descriptions in taxonomy–a case study on

Perochaeta orientalis (Diptera, Sepsidae). ZooKeys, 355, 9-27.

Arif, I. A., Khan, H. A., Al Sadoon, M., & Shobrak, M. (2011). Limited efficiency of universal mini-barcode primers for DNA amplification from desert reptiles, birds and mammals. Genetics and Molecular Research, 10(4), 3559-64.

Armani, A., Guardone, L., Castigliego, L., D'Amico, P., Messina, A., Malandra, R., ...

& Guidi, A. (2015). DNA and Mini-DNA barcoding for the identification of Porgies species (family Sparidae) of commercial interest on the international market. Food

Control, 50, 589-596.

Arribas, P., Andujar, C., Hopkins, K., Shepherd, M., & Vogler, A. P. (2016).

Metabarcoding and mitochondrial metagenomics of endogean arthropods to unveil the mesofauna of the soil. Methods in Ecology and Evolution, 7(9), 1071-1081.

Ashfaq, M., Akhtar, S., Rafi, M. A., Mansoor, S., & Hebert, P. D. (2017). Mapping global biodiversity connections with DNA barcodes: Lepidoptera of Pakistan. PloS

One, 12(3), e0174749.

Aspöck, U., Haring, E. & Aspöck, H. (2012) The phylogeny of the Neuropterida: long lasting and current controversies and challenges (Insecta: Endopterygota). Arthropod

Systematics & Phylogeny, 70, 119-129.

Astrin, J. J., Höfer, H., Spelda, J., Holstein, J., Bayer, S., Hendrich, L., ... & Monje, J.

C. (2016). Towards a DNA barcode reference database for spiders and harvestmen of

Germany. PloS One, 11(9), e0162624.

Avelino-Capistrano, F., Nessimian, J. L., Santos-Mallet, J. R. & Takiya, D. M. (2014)

DNA-based Identification and Descriptions of Immatures of Kempnyia Klapálek

163

(Insecta: Plecoptera) from Macaé River Basin, Rio de Janeiro State, Brazil. Freshwater

Science, 33, 325-337.

Bacher, S. (2012). Still not enough taxonomists: reply to Joppa et al. Trends in Ecology

& Evolution, 27(2), 65-66.

Badano, D., Aspöck, U., Aspöck, H. & Cerretti, P. (2017) Phylogeny of

Myrmeleontiformia based on larval morphology (Neuropterida: Neuroptera).

Systematic Entomology, 42, 94-117.

Balakrishnan, S., Srinivasan, M., & Elumalai, K. (2011). A survey on diversity in Parangipettai coast, Southeast coast of Tamilnadu, India. Journal of

Entomology, 8(3), 259-266.

Balke, T., & Friess, D. A. (2016). Geomorphic knowledge for mangrove restoration: a pan‐tropical categorization. Earth Surface Processes and Landforms, 41(2), 231-239.

Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., ...

& Pyshkin, A. V. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5), 455-

477.

Barco, A., Raupach, M. J., Laakmann, S., Neumann, H., & Knebelsberger, T. (2016).

Identification of North Sea molluscs with DNA barcoding. Molecular Ecology

Resources, 16(1), 288-297.

Basset, Y., Cizek, L., Cuénoud, P., Didham, R. K., Guilhaumon, F., Missa, O., ... &

Tishechkin, A. K. (2012). Arthropod diversity in a tropical forest. Science, 338(6113),

1481-1484.

Bates, D. M. (2010). lme4: Mixed-effects modeling with R.

164

Batista-da-Silva, J. A. (2014). Effect of lunar phases, tides, and wind speed on the abundance of Diptera Calliphoridae in a mangrove swamp. Neotropical

Entomology, 43(1), 48-52.

Beckett, D. C., & Lewis, P. A. (1982). An efficient procedure for slide mounting of larval chironomids. Transactions of the American Microscopical Society, 96-99.

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J.,

Berk, R., ... & Cesarini, D. (2018). Redefine statistical significance. Nature Human

Behaviour, 2(1), 6.

Benjamini, Y., & Speed, T. P. (2012). Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic acids research, 40(10), e72-e72.

Bensasson, D., Zhang, D. X., Hartl, D. L., & Hewitt, G. M. (2001). Mitochondrial pseudogenes: evolution's misplaced witnesses. Trends in Ecology & Evolution, 16(6),

314-321.

Benson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Ostell, J., Pruitt, K. D.,

& Sayers, E. W. (2017). GenBank. Nucleic Acids Research.

Berger, S. A., & Stamatakis, A. (2011). Aligning short reads to reference alignments and trees. Bioinformatics, 27(15), 2068-2075.

Beutel, R.G., (1993). Phylogenetic analysis of (Coleoptera) based on characters of the larval head. Systematic Entomology, 18, 127-147.

Beutel, R.G., Friedrich, F. and Aspöck, U., (2010). The larval head of Nevrorthidae and the phylogeny of Neuroptera (Insecta). Zoological Journal of the Linnean Society,

158, 533-562.

Bhattacharjee, M. J., & Ghosh, S. K. (2014). Design of Mini‐barcode for Catfishes for assessment of archival biodiversity. Molecular Ecology Resources, 14(3), 469-477.

165

Bi, K., Linderoth, T., Vanderpool, D., Good, J. M., Nielsen, R., & Moritz, C. (2013).

Unlocking the vault: next‐generation museum population genomics. Molecular

Ecology, 22(24), 6018-6032.

Bickford, D., Lohman, D. J., Sodhi, N. S., Ng, P. K., Meier, R., Winker, K., ... & Das,

I. (2007). Cryptic species as a window on diversity and conservation. Trends in

Ecology & Evolution, 22(3), 148-155.

Bik, H. M., Porazinska, D. L., Creer, S., Caporaso, J. G., Knight, R., & Thomas, W. K.

(2012). Sequencing our way towards understanding global eukaryotic

biodiversity. Trends in Ecology & Evolution, 27(4), 233-243.

Blasco-Costa, I., Poulin, R. & Presswell, B. (2016) Species of Apatemon Szidat, 1928

and Australapatemon Sudarikov, 1959 (Trematoda: Strigeidae) from New Zealand:

linking and characterising life cycle stages with morphology and molecules.

Parasitology Research, 115, 271-289.

Blaxter, M. L. (2004). The promise of a DNA taxonomy. Philosophical Transactions

of the Royal Society B: Biological Sciences, 359(1444), 669-679.

Boisvert, S., Laviolette, F., & Corbeil, J. (2010). Ray: simultaneous assembly of reads

from a mix of high-throughput sequencing technologies. Journal of Computational

Biology, 17(11), 1519-1533.

Boisvert, S., Raymond, F., Godzaridis, É., Laviolette, F., & Corbeil, J. (2012). Ray:

scalable de novo metagenome assembly and profiling. Genome Biology, 13(12), R122.

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for

Illumina sequence data. Bioinformatics, 30(15), 2114-2120.

Bordenstein, S. R., Paraskevopoulos, C., Dunning Hotopp, J. C., Sapountzis, P., Lo, N.,

Bandi, C., ... & Bourtzis, K. (2008) Parasitism and mutualism in Wolbachia: what the phylogenomic trees can and cannot say. Molecular Biology and Evolution, 26, 231-241.

166

Bragg, J. G., Potter, S., Bi, K., & Moritz, C. (2016). Exon capture phylogenomics: efficacy across scales of divergence. Molecular Ecology Resources, 16(5), 1059-1068.

Brandley, M. C., Bragg, J. G., Singhal, S., Chapple, D. G., Jennings, C. K., Lemmon,

A. R., ... & Moritz, C. (2015). Evaluating the performance of anchored hybrid enrichment at the tips of the tree of life: a phylogenetic analysis of Australian

Eugongylus group scincid lizards. BMC Evolutionary Biology, 15(1), 62.

Brehm, G., Hebert, P. D., Colwell, R. K., Adams, M. O., Bodner, F., Friedemann, K., ...

& Fiedler, K. (2016). Turning up the heat on a hotspot: DNA barcodes reveal 80% more species of geometrid moths along an Andean elevational gradient. PLoS

One, 11(3), e0150327.

Brook, B. W., Sodhi, N. S., & Ng, P. K. (2003). Catastrophic extinctions follow deforestation in Singapore. Nature, 424(6947), 420.

Brooks, T. M., Mittermeier, R. A., Mittermeier, C. G., Da Fonseca, G. A., Rylands,

A. B., Konstant, W. R., ... & Hilton‐Taylor, C. (2002). Habitat loss and extinction in the hotspots of biodiversity. Conservation biology, 16(4), 909-923.

Burns, J. M., Janzen, D. H., Hajibabaei, M., Hallwachs, W., & Hebert, P. D. (2007).

DNA barcodes of closely related (but morphologically and ecologically distinct) species of butterflies (Hesperiidae) can differ by only one to three nucleotides. Journal of the Lepidopterists Society, 61(3), 138-153.

Caley, M. J., Fisher, R., & Mengersen, K. (2014). Global species richness estimates have not converged. Trends in Ecology & Evolution, 29(4), 187-188.

Cameron, S., Rubinoff, D., & Will, K. (2006). Who will actually use DNA barcoding and what will it cost? Systematic Biology, 55(5), 844-847.

167

Cardoso, P., Erwin, T. L., Borges, P. A., & New, T. R. (2011). The seven impediments in invertebrate conservation and how to overcome them. Biological

Conservation, 144(11), 2647-2655.

Carvalho, A. L. (2000) Descriptions of the last instar larva and some structures in the pharate male adult of Praeviogomphus proprius Belle, 1995, with notes on the occurrence and taxonomic status of the species (Anisoptera: ,

Octogomphinae). Odonatologica, 29, 239-246. de Carvalho, M. R., Bockmann, F. A., Amorim, D. S., de Vivo, M., de Toledo-Piza,

M., Menezes, N. A., ... & McEachran, J. D. (2005). Revisiting the taxonomic impediment. Science, 307(5708), 353-353.

Chao, A., Chazdon, R. L., Colwell, R. K., & Shen, T. J. (2005). A new statistical approach for assessing similarity of species composition with incidence and abundance data. Ecology letters, 8(2), 148-159.

Chao, A., Gotelli, N. J., Hsieh, T. C., Sander, E. L., Ma, K. H., Colwell, R. K., & Ellison,

A. M. (2014). Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies. Ecological Monographs, 84(1),

45-67.

Chikhi, R., & Medvedev, P. (2013). Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30(1), 31-37.

Chowdhury, S. (2014). Butterflies of Sundarban Biosphere Reserve, West Bengal, eastern India: a preliminary survey of their taxonomic diversity, ecology and their conservation. Journal of Threatened Taxa, 6(8), 6082-6092.

Clarke, K. R., & Gorley, R. N. (2006). Primer. Primer-E, Plymouth.

Corlett, R. T. (1988). Bukit Timah: the history and significance of a small rain-forest reserve. Environmental Conservation, 15(1), 37-44.

168

Corlett, R. T. (1991). Plant succession on degraded land in Singapore. Journal of

Tropical Forest Science, 151-161.

Corlett, R. T. (1992). The ecological transformation of Singapore, 1819-1990. Journal

of Biogeography, 411-420.

Costello, M. J., Wilson, S., & Houlding, B. (2011). Predicting total global species

richness using rates of species description and estimates of taxonomic

effort. Systematic Biology, 61(5), 871-883.

Costello, M. J., May, R. M., & Stork, N. E. (2013). Can we name Earth's species before

they go extinct? Science, 339(6118), 413-416.

Cotterill, F. P. & Foissner, W. (2010) A pervasive denigration of natural history misconstrues how biodiversity inventories and taxonomy underpin scientific knowledge. Biodiversity and Conservation, 19, 291.

Crampton-Platt, A., Timmermans, M. J., Gimmel, M. L., Kutty, S. N., Cockerill, T. D.,

Vun Khen, C., & Vogler, A. P. (2015). Soup to tree: the phylogeny of beetles inferred

by mitochondrial metagenomics of a Bornean rainforest sample. Molecular Biology

and Evolution, 32(9), 2302-2316.

Crampton-Platt, A., Douglas, W. Y., Zhou, X., & Vogler, A. P. (2016). Mitochondrial

metagenomics: letting the genes out of the bottle. GigaScience, 5(1), 15.

Curiel, J. & Morrone, J. J. (2012) Association of larvae and adults of Mexican species

of Macrelmis (Coleoptera: ): a preliminary analysis using DNA sequences.

Zootaxa, 3361, 56-62.

D’Cunha, P., & Nair, V. M. G. (2013). Diversity and distribution of ant fauna in

Hejamadi Kodi Sandspit, Udupi District, Karnataka, India. Halteres, 4, 33-47.

169

Damm, S., Schierwater, B. & Hadrys, H. (2010) An integrative approach to species discovery in odonates: from character‐based DNA barcoding to ecology. Molecular

Ecology, 19, 3881-3893.

Davison, G. W. H., Cai, Y., Li, T. J., & Lim, W. H. (2018). Integrated research, conservation and management of Nee Soon freshwater swamp forest, Singapore: hydrology and biodiversity. Gardens’ Bulletin Singapore, 70(Suppl 1), 1-7.

De Queiroz, K. (2007). Species concepts and species delimitation. Systematic

Biology, 56(6), 879-886.

Decaëns, T., Porco, D., James, S. W., Brown, G. G., Chassany, V., Dubs, F., ... & Roy,

V. (2016). DNA barcoding reveals diversity patterns of earthworm communities in remote tropical forests of French Guiana. Soil Biology and Biochemistry, 92, 171-183.

Decru, E., Moelants, T., De Gelas, K., Vreven, E., Verheyen, E., & Snoeks, J. (2016).

Taxonomic challenges in freshwater fishes: a mismatch between morphology and DNA barcoding in fish of the north‐eastern part of the Congo basin. Molecular Ecology

Resources, 16(1), 342-352.

Del Palacio, A. & Muzon, J. (2014) Description of the final instar larva of Limnetron antarcticum Förster and notes on its female (Anisoptera: ). Zootaxa, 3884,

89-94.

Dellicour, S., & Flot, J. F. (2015). Delimiting species-poor data sets using single molecular markers: a study of barcode gaps, haplowebs and GMYC. Systematic

Biology, 64(6), 900-908.

Dellicour, S., & Flot, J. F. (2018). The hitchhiker's guide to single‐locus species delimitation. Molecular Ecology Resources, 0, 1-13.

Delsuc, F., Scally, M., Madsen, O., Stanhope, M. J., De Jong, W. W., Catzeflis, F. M., ...

& Douzery, E. J. (2002). Molecular phylogeny of living xenarthrans and the impact of

170

character and taxon sampling on the placental tree rooting. Molecular Biology and

Evolution, 19(10), 1656-1671.

DeSalle, R. & Birstein, V. J. (1996) PCR identification of black caviar. Nature, 381,

197.

Dijkstra, K. D. B. & Clausnitzer, V. (2004) Critical species of Odonata in Madagascar.

International Journal of Odonatology, 7, 219-228.

Dincă, V., Montagud, S., Talavera, G., Hernández-Roldán, J., Munguira, M. L., García-

Barros, E., ... & Vila, R. (2015). DNA barcode reference library for Iberian butterflies

enables a continental-scale preview of potential cryptic diversity. Scientific Reports, 5,

12395.

Domenico, M. D., Dijkstra, K. D. & Carchini, G. (2016) Redescription of the larva of

Gynacantha cylindrata Karsch (Insecta: Odonata: Aeshnidae). Zootaxa, 4078, 78-83.

Dowton, M., & Austin, A. D. (1999). Evolutionary dynamics of a mitochondrial

rearrangement" hot spot" in the Hymenoptera. Molecular Biology and Evolution, 16(2),

298-309.

Duke, N. C., Meynecke, J. O., Dittmann, S., Ellison, A. M., Anger, K., Berger, U., ...

& Koedam, N. (2007). A world without mangroves? Science, 317(5834), 41-42.

Dupuis, J. R., Roe, A. D., & Sperling, F. A. (2012). Multi‐locus species delimitation in

closely related animals and fungi: one marker is not enough. Molecular

Ecology, 21(18), 4422-4436.

Ebach, M. C., Valdecasas, A. G., & Wheeler, Q. D. (2011). Impediments to taxonomy

and users of taxonomy: accessibility and impact evaluation. Cladistics, 27(5), 550-557.

Eberhard, W. G. (1985) Sexual Selection and Animal Genitalia (Vol. 244). Cambridge,

MA: Harvard University Press.

171

Edgar, R. C. (2010). Search and clustering orders of magnitude faster than

BLAST. Bioinformatics, 26(19), 2460-2461.

Epp, L. S., Boessenkool, S., Bellemain, E. P., Haile, J., Esposito, A., Riaz, T., ... &

Stenøien, H. K. (2012). New environmental metabarcodes for analysing soil DNA: potential for studying past and present ecosystems. Molecular Ecology, 21(8), 1821-

1833.

Erwin, T. L. (1982). Tropical forests: their richness in Coleoptera and other arthropod species. The Coleopterists Bulletin, 36(1), 74-75.

Esselstyn, J. A., Evans, B. J., Sedlock, J. L., Khan, F. A. A., & Heaney, L. R. (2012).

Single-locus species delimitation: a test of the mixed Yule–coalescent model, with an empirical application to Philippine round- bats. Proceedings of the Royal Society of London B: Biological Sciences, rspb20120705.

Etscher, V., Miller, M. A. & Burmeister, E. G. (2006) The larva of Polythore spaeteri

Burmeister & Börzsöny, with comparison to other polythorid larvae and molecular species assignment (Zygoptera: Polythoridae). Odonatologica, 35, 127-142.

Faircloth, B. C., McCormack, J. E., Crawford, N. G., Harvey, M. G., Brumfield, R. T.,

& Glenn, T. C. (2012). Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Systematic Biology, 61(5), 717-726.

Faith, D. P. (1992). Conservation evaluation and phylogenetic diversity. Biological

Conservation, 61(1), 1-10.

Fan, J. A., Gu, H., Chen, S., Mo, B., Wen, Y., He, W., ... & Zeng, X. (2009). Species identification of 36 kinds of fruit flies based on minimalist-barcode. Chinese Journal of Applied & Environmental Biology, 2, 215-219.

Ficetola, G. F., Pansu, J., Bonin, A., Coissac, E., Giguet‐Covex, C., De Barba, M., ...

& Rayé, G. (2015). Replication levels, false presences and the estimation of the

172

presence/absence from eDNA metabarcoding data. Molecular Ecology

Resources, 15(3), 543-556.

Fields, A. T., Abercrombie, D. L., Eng, R., Feldheim, K., & Chapman, D. D. (2015).

A novel mini-DNA barcoding assay to identify processed fins from internationally protected shark species. PloS One, 10(2), e0114844.

Fisher, R., Knowlton, N., Brainard, R. E., & Caley, M. J. (2011). Differences among major taxa in the extent of ecological knowledge across four major ecosystems. PLoS

One, 6(11), e26556.

Fleck, G., Brenk, M. & Misof, B. (2006) DNA Taxonomy and the identification of immature insect stages: the true larva of argo (Hagen 1869) (Odonata:

Anisoptera: ). Annales de la Société Entomologique de France, 42, 91-98.

Folmer, O., Black, M., Hoeh, W., Lutz, R. & Vrijenhoek, R. (1994). DNA primers for amplification of mitochondrial cytochrome c oxidase subunit I from diverse metazoan invertebrates. Molecular Marine Biology and Biotechnology, 3(5), 294-299.

Fontaneto, D., Flot, J. F., & Tang, C. Q. (2015). Guidelines for DNA taxonomy, with a focus on the meiofauna. Marine Biodiversity, 45(3), 433-451.

Friess, D. A., & Webb, E. L. (2014). Variability in mangrove change estimates and implications for the assessment of ecosystem service provision. Global Ecology and

Biogeography, 23(7), 715-725.

Fujisawa, T., & Barraclough, T. G. (2013). Delimiting species using single-locus data and the Generalized Mixed Yule Coalescent approach: a revised method and evaluation on simulated data sets. Systematic Biology, 62(5), 707-724.

Fujisawa, T., Aswad, A., & Barraclough, T. G. (2016). A rapid and scalable method for multilocus species delimitation using Bayesian model comparison and rooted triplets. Systematic Biology, 65(5), 759-771.

173

García-Gómez, A., Castano-Meneses, G., Vázquez-González, M. M., & Palacios-

Vargas, J. G. (2014). Mesofaunal arthropod diversity in shrub mangrove litter of

Cozumel Island, Quintana Roo, México. Applied Soil Ecology, 83, 44-50.

Gaston, K. J. (1991). The magnitude of global insect species richness. Conservation

Biology, 5(3), 283-296.

Gaston, K. J. (1994). Causes of rarity. In Rarity (pp. 114-134). Springer, Dordrecht.

Gaston, K. J. (2000) Biodiversity: higher taxon richness. Progress in Physical

Geography, 24, 117-127.

Gattolliat, J. L. & Monaghan, M. T. (2010) DNA-based association of adults and larvae

in Baetidae (Ephemeroptera) with the description of a new genus Adnoptilum in

Madagascar. Journal of the North American Benthological Society, 29, 1042-1057.

Geller, J., Meyer, C., Parker, M., & Hawk, H. (2013). Redesign of PCR primers for

mitochondrial cytochrome c oxidase subunit I for marine invertebrates and application

in all‐taxa biotic surveys. Molecular Ecology Resources, 13(5), 851-861.

Gilman, E. L., Ellison, J., Duke, N. C., & Field, C. (2008). Threats to mangroves from

climate change and adaptation options: a review. Aquatic Botany, 89(2), 237-250.

Giribet, G. (2015) Morphology should not be forgotten in the era of genomics–a

phylogenetic perspective. Zoologischer Anzeiger-A Journal of Comparative Zoology,

256, 96-103.

Godfray Jr, H. C. J. (2007). Linnaeus in the information age. Nature, 446(7133), 259.

van Gossum, H., Sánchez, R. & Rivera, A. C. (2003) Observations on rearing

under laboratory conditions. Animal Biology, 53, 37-45.

Greene, H. W. (2005) Organisms in nature as a central focus for biology. Trends in

Ecology & Evolution, 20, 23-27.

174

Hajibabaei, M., Smith, M. A., Janzen, D. H., Rodriguez, J. J., Whitfield, J. B., & Hebert,

P. D. (2006). A minimalist barcode can identify a specimen whose DNA is degraded. Molecular Ecology Notes, 6(4), 959-964.

Hajibabaei, M., & McKenna, C. (2012). DNA mini-barcodes. In DNA barcodes (pp.

339-353). Humana Press, Totowa, NJ.

Hajibabaei, M., Spall, J. L., Shokralla, S., & van Konynenburg, S. (2012). Assessing biodiversity of a freshwater benthic macroinvertebrate community through non- destructive environmental barcoding of DNA from preservative ethanol. BMC

Ecology, 12(1), 28.

Hallmann, C. A., Sorg, M., Jongejans, E., Siepel, H., Hofland, N., Schwan, H., ... &

Goulson, D. (2017). More than 75 percent decline over 27 years in total flying insect biomass in protected areas. PLoS One, 12(10), e0185809.

Hamilton, A. J., Basset, Y., Benke, K. K., Grimbacher, P. S., Miller, S. E., Novotný,

V., ... & Yen, J. D. (2010). Quantifying uncertainty in estimation of tropical arthropod species richness. The American Naturalist, 176(1), 90-95.

Hazra, A. K., Dey, M. K., & Mandal, G. P. (2005). Diversity and distribution of arthropod fauna in relation to mangrove vegetation on a newly emerged island on the river Hooghly, West Bengal. Records of the Zoological Survey of India, 104, 99-102.

Hebert, P. D., Cywinska, A., & Ball, S. L. (2003). Biological identifications through

DNA barcodes. Proceedings of the Royal Society of London B: Biological

Sciences, 270(1512), 313-321.

Hebert, P. D., & Gregory, T. R. (2005). The promise of DNA barcoding for taxonomy. Systematic Biology, 54(5), 852-859.

175

Hebert, P. D., Zakharov, E. V., Prosser, S. W., Sones, J. E., McKeown, J. T., Mantle,

B., & La Salle, J. (2013). A DNA ‘Barcode Blitz’: Rapid digitization and sequencing

of a natural history collection. PLoS One, 8(7), e68535.

Hebert, P. D., Braukmann, T. W., Prosser, S. W., Ratnasingham, S., Ivanova, N. V.,

Janzen, D. H., ... & Zakharov, E. V. (2017) A Sequel to Sanger: Amplicon Sequencing

That Scales. bioRxiv, 191619.

Hebert, P. D., Braukmann, T. W., Prosser, S. W., Ratnasingham, S., Ivanova, N. V.,

Janzen, D. H., ... & Zakharov, E. V. (2018). A Sequel to Sanger: amplicon sequencing

that scales. BMC genomics, 19(1), 219.

Heikkilä, M., Mutanen, M., Kekkonen, M. & Kaila, L. (2014) Morphology reinforces

proposed molecular phylogenetic affinities: a revised classification for Gelechioidea

(Lepidoptera). Cladistics, 30, 563-589.

Hendy, M. D., & Penny, D. (1989). A framework for the quantitative study of

evolutionary trees. Systematic Zoology, 38(4), 297-309.

Henry, I. M., Nagalakshmi, U., Lieberman, M. C., Ngo, K. J., Krasileva, K. V.,

Vasquez-Gross, H., ... & Comai, L. (2014) Efficient genome-wide detection and

cataloging of EMS-induced mutations using exome capture and next-generation

sequencing. The Plant Cell, 26, 1382-1397.

Hoareau, T. B., & Boissin, E. (2010). Design of phylum‐specific hybrid primers for

DNA barcoding: addressing the need for efficient COI amplification in the

Echinodermata. Molecular Ecology Resources, 10(6), 960-967.

Hodges, E., Xuan, Z., Balija, V., Kramer, M., Molla, M. N., Smith, S. W., ... &

McCombie, W. R. (2007). Genome-wide in situ exon capture for selective

resequencing. Nature Genetics, 39(12), 1522.

176

Hou, G., Chen, W. T., Lu, H. S., Cheng, F., & Xie, S. G. (2018). Developing a DNA barcode library for perciform fishes in the South China Sea: Species identification, accuracy and cryptic diversity. Molecular Ecology Resources, 18(1), 137-146.

Hsieh, T. C., Ma, K. H., & Chao, A. (2016). iNEXT: an R package for rarefaction and extrapolation of species diversity (Hill numbers). Methods in Ecology and

Evolution, 7(12), 1451-1456.

Huang, D., Meier, R., Todd, P. A., & Chou, L. M. (2008). Slow mitochondrial COI sequence evolution at the base of the metazoan tree and its implications for DNA barcoding. Journal of Molecular Evolution, 66(2), 167-174.

Hudson, R. R., & Coyne, J. A. (2002). Mathematical consequences of the genealogical species concept. Evolution, 56(8), 1557-1565.

Ji, Y., Ashton, L., Pedley, S. M., Edwards, D. P., Tang, Y., Nakamura, A., ... & Larsen,

T. H. (2013). Reliable, verifiable and efficient monitoring of biodiversity via metabarcoding. Ecology Letters, 16(10), 1245-1257.

Jones, M., Ghoorah, A., & Blaxter, M. (2011). jMOTU and taxonerator: turning DNA barcode sequences into annotated operational taxonomic units. PLoS One, 6(4), e19259.

Joppa, L. N., Roberts, D. L., & Pimm, S. L. (2011). The population ecology and social behaviour of taxonomists. Trends in Ecology & Evolution, 26(11), 551-553.

Jousson, O., Bartoli, P. & Pawlowski, J. (1999) Molecular identification of developmental stages in Opecoelidae (Digenea). International Journal for

Parasitology, 29, 1853-1858.

Kalkman, V. J., Boudot, J., Bernard, R., Conze, K., De Knijf, G., Dyatlova, E., Ferreira,

S., Jovi, M., Ott, J., Riservato, E. & Sahln, G. (2010) European red list of dragonflies.

Luxembourg: Publications Office of the European Union.

177

Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and

Evolution, 30(4), 772-780.

Kearse, M., Moir, R., Wilson, A., Stones-Havas, S., Cheung, M., Sturrock, S., ... &

Thierer, T. (2012). Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics, 28(12),

1647-1649.

Kerr, K. C., Stoeckle, M. Y., Dove, C. J., Weigt, L. A., Francis, C. M., & Hebert, P. D.

(2007). Comprehensive DNA barcode coverage of North American birds. Molecular

Ecology Resources, 7(4), 535-543.

Kocher, A., Gantier, J. C., Gaborit, P., Zinger, L., Holota, H., Valiere, S., ... &

Murienne, J. (2017). Vector soup: high‐throughput identification of Neotropical phlebotomine sand flies using metabarcoding. Molecular Ecology Resources, 17(2),

172-182.

Kômoto, N., Yukuhiro, K., & Tomita, S. (2012). Novel gene rearrangements in the mitochondrial genome of a webspinner, Aposthonia japonica (Insecta:

Embioptera). Genome, 55(3), 222-233.

Kraus, R. T. & Secor, D. H. (2005) Application of the nursery-role hypothesis to an estuarine fish. Marine Ecology Progress Series, 291, 301-305.

Krell, F. T. (2004). Parataxonomy vs. taxonomy in biodiversity studies–pitfalls and applicability of ‘morphospecies’ sorting. Biodiversity & Conservation, 13(4), 795-812.

Kumar, N., Lin, M., Zhao, X., Ott, S., Santana-Cruz, I., Daugherty, S., ... & Hotopp, J.

C. D. (2016). Efficient enrichment of bacterial mRNA from host-bacteria total RNA samples. Scientific Reports, 6, 34850.

178

Kutty, S. N., Bernasconi, M. V., Šifner, F. & Meier, R. (2007) Sensitivity analysis,

molecular systematics and natural history evolution of Scathophagidae (Diptera:

Cyclorrhapha: Calyptratae). Cladistics, 23, 64-83.

Kutty, S. N., Wong, W. H., Meusemann, K., Meier, R., & Cranston, P. S. (2018). A

phylogenomic analysis of Culicomorpha (Diptera) resolves the relationships among the

eight constituent families. Systematic Entomology, 43, 434-446.

Kwong, S., Srivathsan, A., Vaidya, G. & Meier, R. (2012a) Is the COI barcoding gene involved in speciation through intergenomic conflict? Molecular Phylogenetics and

Evolution, 62, 1009-1012.

Kwong, S., Srivathsan, A. & Meier, R. (2012b) An update on DNA barcoding: low species coverage and numerous unidentified sequences. Cladistics, 28, 639-644.

Lai, S., Loke, L. H., Hilton, M. J., Bouma, T. J., & Todd, P. A. (2015). The effects of

urbanisation on coastal habitats and the potential for ecological engineering: A

Singapore case study. Ocean & Coastal Management, 103, 78-85.

Lamarre, G., Decaëns, T., Rougerie, R., Barbut, J., Dewaard, J. R., Hebert, P. D., ... &

Bonifacio Martins, M. (2016). An integrative taxonomy approach unveils unknown and

threatened species in Amazonian rainforest fragments. Insect Conservation and

Diversity, 9(5), 475-479.

Lamb, L. (1924). A tabular account of the differences between the earlier instars of

Pantala flavescens (Odonata: Libellulidae). Transactions of the American

Entomological Society, 50, 289-312.

Lambers, J. H. R., Clark, J. S., & Beckage, B. (2002). Density-dependent mortality and

the latitudinal gradient in species diversity. Nature, 417(6890), 732.

Larsson, A. (2014). AliView: a fast and lightweight alignment viewer and editor for

large datasets. Bioinformatics, 30(22), 3276-3278.

179

Lavinia, P. D., Bustos, E. O. N., Kopuchian, C., Lijtmaer, D. A., García, N. C., Hebert,

P. D., & Tubaro, P. L. (2017). Barcoding the butterflies of southern South America:

Species delimitation efficacy, cryptic diversity and geographic patterns of

divergence. PloS One, 12(10), e0186845.

Layton, K. K., Corstorphine, E. A., & Hebert, P. D. (2016). Exploring Canadian

echinoderm diversity through DNA Barcodes. PloS One, 11(11), e0166118.

Le Gall, L., & Saunders, G. W. (2010). DNA barcoding is a powerful tool to uncover

algal diversity: a case study of the Phyllophoraceae (Gigartinales, Rhodophyta) in the

Canadian flora. Journal of Phycology, 46(2), 374-389.

Lemmon, A. R., Emme, S. A., & Lemmon, E. M. (2012). Anchored hybrid enrichment

for massively high-throughput phylogenomics. Systematic Biology, 61(5), 727-744.

Lenat, D. R. & Resh, V. H. (2001) Taxonomy and stream ecology—the benefits of

genus-and species-level identifications. Journal of the North American Benthological

Society, 20, 287-298.

Lenth, R. (2018). Emmeans: Estimated marginal means, aka least-squares means. R

Package Version, 1(2).

Leray, M., Yang, J. Y., Meyer, C. P., Mills, S. C., Agudelo, N., Ranwez, V., ... &

Machida, R. J. (2013). A new versatile primer set targeting a short fragment of the

mitochondrial COI region for metabarcoding metazoan diversity: application for

characterizing coral reef fish gut contents. Frontiers in Zoology, 10(1), 34.

Li, Q., Kong, L., Yu, H., Zheng, X., Yu, R., Dai, L., ... & Feng, Y. (2016). DNA

barcoding reveal patterns of species diversity among northwestern Pacific

molluscs. Scientific Reports, 6, 33367.

Lieftinck, M. A. (1932) Notes on the larvae of two interesting Gomphidae (Odon.) from the Malay Peninsula. The Bulletin of the Raffles Museum, 7, 102-115.

180

Lim, G. S., Balke, M. & Meier, R. (2011) Determining species boundaries in a world full of rarity: singletons, species delimitation methods. Systematic Biology, 61, 165-

169.

Lim, N. K., Tay, Y. C., Srivathsan, A., Tan, J. W., Kwik, J. T., Baloğlu, B., ... & Yeo,

D. C. (2016). Next-generation freshwater bioassessment: eDNA metabarcoding with a conserved metazoan primer reveals species-rich and reservoir-specific communities. Royal Society Open Science, 3(11), 160635.

Lin, X., Stur, E., & Ekrem, T. (2015). Exploring genetic divergence in a species-rich insect genus using 2790 DNA barcodes. PloS One, 10(9), e0138993.

Little, D. P. (2014). A DNA mini‐barcode for land plants. Molecular Ecology

Resources, 14(3), 437-446.

Liu, S., Wang, X., Xie, L., Tan, M., Li, Z., Su, X., ... & Niehuis, O. (2016).

Mitochondrial capture enriches mito‐DNA 100 fold, enabling PCR‐free mitogenomics biodiversity analysis. Molecular Ecology Resources, 16(2), 470-479.

Lok, A. F. S. L. & Orr, A. G. (2009) The biology of Euphaea impar Selys (Odonata:

Euphaeidae) in Singapore. Nature in Singapore, 2, 135-140.

Loke, L. H., Clews, E., Low, E. W., Belle, C. C., Todd, P. A., Eikaas, H. S. & Ng, P.

K. (2010) Methods for sampling benthic macroinvertebrates in tropical lentic systems.

Aquatic Biology, 10, 119-130.

Luke, S. G. (2017). Evaluating significance in linear mixed-effects models in R.

Behavior Research Methods, 49(4), 1494-1502.

Luo, A., Ling, C., Ho, S. Y., & Zhu, C. (2018). Comparison of Methods for Molecular

Species Delimitation across a Range of Speciation Scenarios. Systematic Biology, syy011.

181

Mamanova, L., Coffey, A. J., Scott, C. E., Kozarewa, I., Turner, E. H., Kumar, A., ...

& Turner, D. J. (2010). Target-enrichment strategies for next-generation

sequencing. Nature Methods, 7(2), 111.

Matsen, F. A., Kodner, R. B., & Armbrust, E. V. (2010). pplacer: linear time maximum-

likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference

tree. BMC Bioinformatics, 11(1), 538.

May, R. M. (1990). How many species? Phil. Trans. R. Soc. Lond. B, 330(1257), 293-

304.

May, R. M. (2010). Tropical arthropod species, more or less? Science, 329(5987), 41-

42.

Mayden, R. L., Tang, K. L., Wood, R. M., Chen, W. J., Agnew, M. K., Conway, K.

W., ... & Li, J. (2008). Inferring the Tree of Life of the order Cypriniformes, the earth's

most diverse clade of freshwater fishes: Implications of varied taxon and character

sampling. Journal of Systematics and Evolution, 46(3), 424-438.

Macnae, W. (1968). Fauna and flora of mangrove swamps. Adv. Mar. Biol. 6, 73–270.

Meier, R. (1995) Cladistic analysis of the Sepsidae (Cyclorrhapha: Diptera) based on a

comparative scanning electron microscopic study of larvae. Systematic Entomology,

20, 99-128.

Meier, R., Shiyang, K., Vaidya, G., & Ng, P. K. (2006). DNA barcoding and taxonomy

in Diptera: a tale of high intraspecific variability and low identification

success. Systematic Biology, 55(5), 715-728.

Meier, R. & Lim, G. S. (2009) Conflict, convergent evolution, and the relative importance of immature and adult characters in endopterygote phylogenetics. Annual

Review of Entomology, 54, 85-104.

182

Meier, R., Wong, W., Srivathsan, A., & Foo, M. (2016). $1 DNA barcodes for

reconstructing complex phenomes and finding rare species in specimen‐rich

samples. Cladistics, 32(1), 100-110.

Meier, R. (2017) Citation of taxonomic publications: the why, when, what and what not. Systematic Entomology, 42, 301-304.

Meusnier, I., Singer, G. A., Landry, J. F., Hickey, D. A., Hebert, P. D., & Hajibabaei,

M. (2008). A universal DNA mini-barcode for biodiversity analysis. BMC

Genomics, 9(1), 214.

Michat, M. C., Alarie, Y. & Miller, K. B. (2017) Higher‐level phylogeny of diving

beetles (Coleoptera: ) based on larval characters. Systematic Entomology, 42,

734-767.

Mikanowski, J. (2017). ‘A different dimension of loss’: inside the great insect die-off.

The Guardian. Available from:

https://www.theguardian.com/environment/2017/dec/14/a-different-dimension-of-

loss-great-insect-die-off-sixth-extinction [Accessed 27 July 2018].

Miller, L. J., Graham, G. C., Allsopp, P. G. & Yeates, D. K. (1997) Use of DNA

sequencing for species identification and phylogeography: canegrub examples

(Scarabaeidae: Melolonthini). Soil Invertebrates, 31-34.

Miller, K. B., Alarie, Y., Wolfe, G. W. & Whiting, M. F. (2005) Association of insect

life stages using DNA sequences: the larvae of Philodytes umbrinus Motschulsky

(Coleoptera: Dytiscidae). Systematic Entomology, 30, 499-509.

Müller, O., Taron, U., Jansen, A. & Schneider, T. (2012) Description of the larva of

Boyeria cretensis Peters and comparison with B. irene (Fonscolombe)(Anisoptera:

Aeshnidae). Odonatologica, 41, 47.

183

Min, X. J., & Hickey, D. A. (2007). Assessing the effect of varying sequence length on

DNA barcoding of fungi. Molecular Ecology Resources, 7(3), 365-373.

Mora, C., Tittensor, D. P., Adl, S., Simpson, A. G., & Worm, B. (2011). How many species are there on Earth and in the ocean? PLoS Biology, 9(8), e1001127.

Morinière, J., Hendrich, L., Balke, M., Beermann, A. J., König, T., Hess, M., ... &

Hausmann, A. (2017). A DNA barcode library for Germany's mayflies, stoneflies and caddisflies (Ephemeroptera, Plecoptera and Trichoptera). Molecular Ecology

Resources, 27, 1755-0998.

Nagelkerken, I., Blaber, S. J. M., Bouillon, S., Green, P., Haywood, M., Kirton, L. G., ...

& Somerfield, P. J. (2008). The habitat function of mangroves for terrestrial and marine fauna: a review. Aquatic Botany, 89(2), 155-185.

Ng, P. K., & Lim, K. K. (1992). The conservation status of the Nee Soon freshwater swamp forest of Singapore. Aquatic Conservation: Marine and Freshwater

Ecosystems, 2(3), 255-266.

Ng, P. K. L., & Low, J. K. Y. (1994). Status of mangroves in Singapore: Conservation beyond the year 2000. In Proceedings of the Third ASEAN–Australia Symposium on

Living Coastal Resources. Chulalongkorn University, Bangkok, Thailand (pp. 229-

232).

Ngiam, R. W. J. (2009) The biology and distribution of Pseudagrion rubriceps rubriceps Selys, 1876 (Odonata: Zygoptera: ) in Singapore. Nature in

Singapore, 2, 209-214.

Ngiam, R. W. J. (2011) Heliogomphus cf. retroflexus Ris, 1912, (Odonata: Anisoptera:

Gomphidae), a possible new record for Singapore. Nature in Singapore, 3, 221-225.

184

Ngiam, R. W. J., Sun, S. W. & Sek, J. Y. (2011) An update on Heliogomphus cf.

retroflexus Ris, 1912, with notes on Microgomphus chelifer Selys, 1858 in Singapore

(Odonata: Anisoptera: Gomphidae). Nature in Singapore, 4, 95-99.

Ngiam, R. W. J. & Leong, T. M. (2012) Larva of the phytotelm-breeding ,

Pericnemis stictica Selys from forests in Singapore (Odonata: Zygoptera:

Coenagrionidae). Nature in Singapore, 5, 103-115.

Ngiam, R. W. J. & Dow, R. A. (2013) The larva of Leptogomphus risi Laidlaw from

Singapore with a comparison to Leptogomphus williamsoni Laidlaw from Sarawak and

congeners (Odonata: Anisoptera: Gomphidae). Nature in Singapore, 6, 307-312.

Ngiam, R. W. J. & Cheong, L. F. (2016) The dragonflies of Singapore: An updated checklist and revision of the national conservation statuses. Nature in Singapore, 9,

149-163.

Ngoi, P. S., Tan, J. & Ngiam, R. W. J. (2011) New record of a , Zyxomma obtusum Albarda, 1881 in Singapore (Odonata: Anisoptera: Libellulidae). Nature in

Singapore, 4, 241-244.

Novelo-Gutierrez, R., Sites, R. W. & Vitheepradit, A. (2014) New province record of

Rhinagrion for Thailand and description of the larva of R. mima (Odonata: Zygoptera:

Philosinidae). Zootaxa, 3852, 562-568.

Novotný, V. & Basset, Y. (2000) Rare species in communities of tropical insect herbivores: pondering the mystery of singletons. Oikos, 89, 564-572.

Novotny, V., Drozd, P., Miller, S. E., Kulfan, M., Janda, M., Basset, Y., & Weiblen,

G. D. (2006). Why are there so many species of herbivorous insects in tropical

rainforests? Science, 313(5790), 1115-1118.

185

Novotny, V., & Miller, S. E. (2014). Mapping and understanding the diversity of

insects in the tropics: past achievements and future directions. Austral

Entomology, 53(3), 259-267.

Ødegaard, F. (2000). How many species of arthropods? Erwin's estimate

revised. Biological Journal of the Linnean Society, 71(4), 583-597.

Oksanen, J., Blanchet, F. G., Kindt, R., Legendre, P., O’hara, R. B., Simpson, G. L., ...

& Wagner, H. (2017). vegan: Community Ecology Package. R package version 2.4-3.

URL: https://CRAN.R-project.org/package=vegan

Oliveira, L. M., Knebelsberger, T., Landi, M., Soares, P., Raupach, M. J., & Costa, F.

O. (2016). Assembling and auditing a comprehensive DNA barcode reference library

for European marine fishes. Journal of Fish Biology, 89(6), 2741-2754.

Orr, A. G. (2005). Dragonflies of Peninsular Malaysia and Singapore. Natural History

Publications (Borneo), Sabah.

Orr, A. G. & Hämäläinen, M. (2003) Guide to the dragonflies of Borneo. Natural

History Publications (Borneo), Sabah.

Orr, A. G., Ngiam, R. W. & Leong, T. M. (2010) The larva of Tetracanthagyna plagiata,

with notes on its biology and comparisons with congeneric species (Odonata:

Aeshnidae). International Journal of Odonatology, 13, 153-166.

Orr, A. G. & Ngiam, R. W. (2011) A description of the larva of Heliaeschna uninervulata Martin (Odonata: Aeshnidae) from Singapore, with notes on its relationships. International Journal of Odonatology, 14, 163-169.

Paradis, E., Claude, J., & Strimmer, K. (2004). APE: analyses of phylogenetics and

evolution in R language. Bioinformatics, 20(2), 289-290.

Pawluczyk, M., Weiss, J., Links, M. G., Aranguren, M. E., Wilkinson, M. D., & Egea-

Cortines, M. (2015). Quantitative evaluation of bias in PCR amplification and next-

186

generation sequencing derived from metabarcoding samples. Analytical and

Bioanalytical Chemistry, 407(7), 1841-1848.

Peng, Y., Leung, H. C., Yiu, S. M., & Chin, F. Y. (2012). IDBA-UD: a de novo

assembler for single-cell and metagenomic sequencing data with highly uneven

depth. Bioinformatics, 28(11), 1420-1428.

Pérez-Gutiérrez, L. A. & Montes-Fontalvo, J. M. (2011) Description of the last stadium

larvae of Argia medullaris Hagen in Selys and A. variegata Förster (Odonata:

Coenagrionidae). International Journal of Odonatology, 14, 217-222.

Phillips, M. J., Delsuc, F., & Penny, D. (2004). Genome-scale phylogeny and the

detection of systematic biases. Molecular Biology and Evolution, 21(7), 1455-1458.

Pianka, E. R. (1970). On r-and K-selection. The American Naturalist, 104, 592-597.

Pohjoismäki, J. L., Kahanpää, J., & Mutanen, M. (2016). DNA barcodes for the

northern European tachinid flies (Diptera: Tachinidae). PloS One, 11(11), e0164933.

Pons, J., Barraclough, T. G., Gomez-Zurita, J., Cardoso, A., Duran, D. P., Hazell, S., ...

& Vogler, A. P. (2006). Sequence-based species delimitation for the DNA taxonomy

of undescribed insects. Systematic Biology, 55(4), 595-609.

Pramual, P. & Wongpakam, K. (2014) Association of black fly (Diptera: Simuliidae)

life stages using DNA barcode. Journal of Asia-Pacific Entomology, 17, 549-554.

Prasad, M. & Varshney, R. K. (1995) A check-list of the Odonata of India including

data on larval studies. Oriental Insects, 29, 385-428.

Pruitt, K. D., Tatusova, T., & Maglott, D. R. (2006). NCBI reference sequences

(RefSeq): a curated non-redundant sequence database of genomes, transcripts and

proteins. Nucleic Acids Research, 35(suppl_1), D61-D65.

187

Puillandre, N., Lambert, A., Brouillet, S., & Achaz, G. (2012). ABGD, Automatic

Barcode Gap Discovery for primary species delimitation. Molecular Ecology, 21(8),

1864-1877.

Puniamoorthy, N. Kotrba, M., & Meier, R. (2010) Unlocking the" Black box": internal female genitalia in Sepsidae (Diptera) evolve fast and are species-specific. BMC

Evolutionary Biology, 10, 275.

R Core Team (2017). R: A language and environment for statistical computing. R

Foundation for Statistical Computing, Vienna, Austria. URL http://www.R- project.org/.

Rambaut, A. (2007). FigTree, a graphical viewer of phylogenetic trees. See http://tree. bio. ed. ac. uk/software/figtree.

Ratnasingham, S., & Hebert, P. D. (2007). BOLD: The Barcode of Life Data System

(http://www. barcodinglife.org). Molecular Ecology Resources, 7(3), 355-364.

Ratnasingham, S., & Hebert, P. D. (2013). A DNA-based registry for all animal species: the Barcode Index Number (BIN) system. PloS One, 8(7), e66213.

Rohde, C., Silva, D., Oliveira, G. F., Monteiro, L. S., Montes, M. A., & Garcia, A. C.

L. (2014). Richness and abundance of the Cardini group of Drosophila (Diptera,

Drosophilidae) in the Caatinga and Atlantic Forest biomes in northeastern Brazil. Anais da Academia Brasileira de Ciências,86(4), 1711-1718.

Rubinoff, D., & Holland, B. S. (2005). Between two extremes: mitochondrial DNA is neither the panacea nor the nemesis of phylogenetic and taxonomic inference. Systematic Biology, 54(6), 952-961.

Rubinoff, D., Cameron, S., & Will, K. (2006). A genomic perspective on the shortcomings of mitochondrial DNA for “barcoding” identification. Journal of

Heredity, 97(6), 581-594.

188

Ruiter, D. E., Boyle, E. E. & Zhou, X. (2013) DNA barcoding facilitates associations and diagnoses for Trichoptera larvae of the Churchill (Manitoba, Canada) area. BMC ecology, 13, 5.

Samways, M. J. (1993). Insects in biodiversity conservation: some perspectives and directives. Biodiversity & Conservation, 2(3), 258-282.

Scheffers, B. R., Joppa, L. N., Pimm, S. L., & Laurance, W. F. (2012). What we know and don’t know about Earth's missing biodiversity. Trends in Ecology &

Evolution, 27(9), 501-510.

Scherber, C., Vockenhuber, E. A., Stark, A., Meyer, H., & Tscharntke, T. (2014).

Effects of tree and herb biodiversity on Diptera, a hyperdiverse insect order. Oecologia, 174(4), 1387-1400.

Schmieder, R., & Edwards, R. (2011). Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27(6), 863-864.

Schnell, I. B., Bohmann, K., & Gilbert, M. T. P. (2015). Tag jumps illuminated– reducing sequence‐to‐sample misidentifications in metabarcoding studies. Molecular

Ecology Resources, 15(6), 1289-1303.

Schuerch, M., Spencer, T., Temmerman, S., Kirwan, M. L., Wolff, C., Lincke, D., ...

& Hinkel, J. (2018). Future response of global coastal wetlands to sea-level rise. Nature, 561(7722), 231.

Seifert, K. A. (2009). Progress towards DNA barcoding of fungi. Molecular Ecology

Resources, 9(1), 83-89.

Ševčík, J., Kaspřák, D., Mantič, M., Fitzgerald, S., Ševčíková, T., Tóthová, A., &

Jaschhof, M. (2016). Molecular phylogeny of the megadiverse insect infraorder

Bibionomorpha sensu lato (Diptera). PeerJ, 4, e2563.

189

Shokralla, S., Porter, T. M., Gibson, J. F., Dobosz, R., Janzen, D. H., Hallwachs, W., ...

& Hajibabaei, M. (2015a). Massively parallel multiplex DNA sequencing for specimen identification using an Illumina MiSeq platform. Scientific Reports, 5, 9687.

Shokralla, S., Hellberg, R. S., Handy, S. M., King, I., & Hajibabaei, M. (2015b). A

DNA mini-barcoding system for authentication of processed fish products. Scientific

Reports, 5, 15894.

Simon, C., Frati, F., Beckenbach, A., Crespi, B., Liu, H., & Flook, P. (1994). Evolution, weighting, and phylogenetic utility of mitochondrial gene sequences and a compilation of conserved polymerase chain reaction primers. Annals of the Entomological Society of America, 87(6), 651-701.

Sims, N. R., & Anderson, M. F. (2008). Isolation of mitochondria from rat brain using

Percoll density gradient centrifugation. Nature Protocols, 3(7), 1228.

Spalding, M. D., Ruffo, S., Lacambra, C., Meliane, I., Hale, L. Z., Shepard, C. C., &

Beck, M. W. (2014). The role of ecosystems in coastal protection: Adapting to climate change and coastal hazards. Ocean & Coastal Management, 90, 50-57.

Spencer, T., Schuerch, M., Nicholls, R. J., Hinkel, J., Lincke, D., Vafeidis, A. T., ... &

Brown, S. (2016). Global coastal wetland change under sea-level rise and related stresses: The DIVA Wetland Change Model. Global and Planetary Change, 139, 15-

30.

Srivathsan, A., Sha, J., Vogler, A. P., & Meier, R. (2015). Comparing the effectiveness of metagenomics and metabarcoding for diet analysis of a leaf‐feeding monkey

(Pygathrix nemaeus). Molecular Ecology Resources, 15(2), 250-261.

Srivathsan, A., Sha, J., Vogler, A. P. & Meier, R. (2015) Comparing the effectiveness of metagenomics and metabarcoding for diet analysis of a leaf‐feeding monkey

(Pygathrix nemaeus). Molecular Ecology Resources, 15, 250-261.

190

Srivathsan, A., Baloğlu, B., Wang, W., Tan, W. X., Bertrand, D., Ng, A. H., ... & Meier,

R. (2018). A Min ION™‐based pipeline for fast and cost‐effective DNA barcoding. Molecular Ecology Resources, 0, 1-15.

Stager, C. (2018). The Silence of the Bugs. The New York Times. Available from: https://www.nytimes.com/2018/05/26/opinion/sunday/insects-bugs-naturalists- scientists.html [Accessed 27 July 2018].

Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics, 30(9), 1312-1313.

Steinke, D., R deWaard, J., Gomon, M. F., Johnson, J. W., Larson, H. K., Lucanus,

O., ... & Ward, R. D. (2017). DNA barcoding the fishes of Lizard Island (Great Barrier

Reef). Biodiversity Data Journal, (5).

Stevens, G. C. (1989). The latitudinal gradient in geographical range: how so many species coexist in the tropics. The American Naturalist, 133(2), 240-256.

Stork, N. E. (1993). How many species are there? Biodiversity & Conservation, 2(3),

215-232.

Stork, N. E., Grimbacher, P. S., Storey, R., Oberprieler, R. G., Reid, C., & Slipinski, S.

(2008). What determines whether a species of insect is described? Evidence from a study of tropical forest beetles. Insect Conservation and Diversity, 1(2), 114-119.

Stork, N. E., McBroom, J., Gely, C., & Hamilton, A. J. (2015). New approaches narrow global species estimates for beetles, insects, and terrestrial arthropods. Proceedings of the National Academy of Sciences, 112(24), 7519-7523.

Straub, S. C., Parks, M., Weitemier, K., Fishbein, M., Cronn, R. C., & Liston, A. (2012).

Navigating the tip of the genomic iceberg: Next‐generation sequencing for plant systematics. American Journal of Botany, 99(2), 349-364.

191

Sultana, S., Ali, M. E., Hossain, M. M., Naquiah, N., & Zaidul, I. S. M. (2018).

Universal mini COI barcode for the identification of fish species in processed products. Food Research International, 105, 19-28.

Taberlet, P., Coissac, E., Hajibabaei, M. & Rieseberg, L. H. (2012) Environmental

DNA. Molecular Ecology, 21, 1789-1793.

Talavera, G., Dincă, V., & Vila, R. (2013). Factors affecting species delimitations with the GMYC model: insights from a butterfly survey. Methods in Ecology and

Evolution, 4(12), 1101-1110.

Tamura, K., Stecher, G., Peterson, D., Filipski, A., & Kumar, S. (2013). MEGA6: molecular evolutionary genetics analysis version 6.0. Molecular Biology and

Evolution, 30(12), 2725-2729.

Tan, D. S., Ang, Y., Lim, G. S., Ismail, M. R. B., & Meier, R. (2010). From ‘cryptic species’ to integrative taxonomy: an iterative process involving DNA sequences, morphology, and behaviour leads to the resurrection of Sepsis pyrrhosoma (Sepsidae:

Diptera). Zoologica Scripta, 39(1), 51-61.

Tang, H. B., Wang, L. K. & Hämäläinen, M. (2010) A Photographic Guide to the

Dragonflies of Singapore. Published and distributed by Raffles Museum of

Biodiversity Research, Department of Biological Sciences, National University of

Singapore, Singapore.

Tang, C. Q., Humphreys, A. M., Fontaneto, D., & Barraclough, T. G. (2014). Effects of phylogenetic reconstruction method on the robustness of species delimitation using single‐locus data. Methods in Ecology and Evolution, 5(10), 1086-1094.

Tautz, D., Arctander, P., Minelli, A., Thomas, R. H., & Vogler, A. P. (2003). A plea for DNA taxonomy. Trends in Ecology & Evolution, 18(2), 70-74.

192

Templeton, J. E., Brotherton, P. M., Llamas, B., Soubrier, J., Haak, W., Cooper, A., &

Austin, J. J. (2013). DNA capture and next-generation sequencing can recover whole

mitochondrial genomes from highly degraded samples for human

identification. Investigative Genetics, 4(1), 26.

Tennessen, K. J. (2010). The madicolous larva of Heteropodagrion sanguinipes Selys

(Odonata: Megapodagrionidae). Zootaxa, 2531, 29-38.

Thomas, J. A., Telfer, M. G., Roy, D. B., Preston, C. D., Greenwood, J. J. D., Asher,

J., ... & Lawton, J. H. (2004). Comparative losses of British butterflies, birds, and plants

and the global extinction crisis. Science, 303(5665), 1879-1881.

Thomas, M., Raharivololoniaina, L., Glaw, F., Vences, M. & Vieites, D. R. (2005)

Montane tadpoles in Madagascar: molecular identification and description of the larval stages of elegans, Mantidactylus madecassus, and laurenti from the Andringitra Massif. Copeia, 2005, 174-183.

Thormann, B., Ahrens, D., Armijos, D. M., Peters, M. K., Wagner, T., & Wägele, J.

W. (2016). Exploring the leaf beetle fauna (Coleoptera: Chrysomelidae) of an

Ecuadorian mountain forest using DNA barcoding. PloS One, 11(2), e0148268.

Townsend, T. M., Alegre, R. E., Kelley, S. T., Wiens, J. J., & Reeder, T. W. (2008).

Rapid development of multiple nuclear loci for phylogenetic analysis using genomic

resources: an example from squamate reptiles. Molecular Phylogenetics and

Evolution, 47(1), 129-142.

Trivinho-Strixino, S., Pepinelli, M., Siqueira, T. & de Oliveira Roque, F. (2012) DNA

barcoding of Podonomus (Chironomidae, Podonominae) enables stage association of a

named species and reveals hidden diversity in Brazilian inselbergs. Annales de

Limnologie-International Journal of Limnology, 48, 411-423.

193

Troudet, J., Grandcolas, P., Blin, A., Vignes-Lebbe, R., & Legendre, F. (2017).

Taxonomic bias in biodiversity data and societal preferences. Scientific Reports, 7(1),

9132.

Turner, I. M., Boo, C. M., Wong, Y. K., & Chew, P. T. (1996). Freshwater swamp

forest in Singapore, with particular reference to that found around the Nee Soon firing

ranges. Gardens’ Bulletin Singapore, 48(1-2), 129-157.

Vaidya, G., Lohman, D. J., & Meier, R. (2011). SequenceMatrix: concatenation

software for the fast assembly of multi‐gene datasets with character set and codon

information. Cladistics, 27(2), 171-180.

Valiela, I., Bowen, J. L., & York, J. K. (2001). Mangrove Forests: One of the World's

Threatened Major Tropical Environments. Bioscience, 51(10), 807-815.

Victor, B. C. (2007) Coryphopterus kuna, a new goby (Perciformes: Gobiidae:

Gobiinae) from the western Caribbean, with the identification of the late larval stage

and an estimate of the pelagic larval duration. Zootaxa, 1526, 51-61.

Vogler, A. P., & Monaghan, M. T. (2007). Recent advances in DNA

taxonomy. Journal of Zoological Systematics and Evolutionary Research, 45(1), 1-10.

Wang, W. Y., Srivathsan, A., Foo, M., Yamane, S. K., & Meier, R. (2018). Sorting

specimen‐rich invertebrate samples with cost‐effective NGS barcodes: Validating a

reverse workflow for specimen processing. Molecular Ecology Resources, 18(3), 490-

501.

Webb, C. O., Ackerly, D. D., McPeek, M. A., & Donoghue, M. J. (2002). Phylogenies

and community ecology. Annual Review of Ecology and Systematics, 33(1), 475-505.

Weisser, W. W., & Siemann, E. (2004). The various effects of insects on ecosystem

functioning. In Insects and ecosystem function (pp. 3-24). Springer, Berlin, Heidelberg.

194

Wheeler, Q., & Meier, R. (Eds.). (2000). Species concepts and phylogenetic theory: a debate. Columbia University Press.

Wheeler, Q. (2014). Are reports of the death of taxonomy an exaggeration? New

Phytologist, 201(2), 370-371.

Wiegmann, B. M., Trautwein, M. D., Winkler, I. S., Barr, N. B., Kim, J. W., Lambkin,

C., ... & Wheeler, B. M. (2011). Episodic radiations in the fly tree of life. Proceedings of the National Academy of Sciences, 108(14), 5690-5695.

Wiens, J. J., & Morrill, M. C. (2011). Missing data in phylogenetic analysis: reconciling results from simulations and empirical data. Systematic Biology, 60(5),

719-731.

Wilson, E. O. & Peter, R. M. eds. (1988). Biodiversity. Washington, D.C., USA:

National Academy.

Willson, M. F. & Armesto, J. J. (2006) Is natural history really dead? Toward the rebirth of natural history. Revista Chilena de Historia Natural, 79, 279-283.

Wolf, L. L. & Waltz, E. C. (1984) Dominions and site-fixed aggressive behavior in breeding male Leucorrhinia intacta (Odonata: Libellulidae). Behavioral Ecology and

Sociobiology, 14, 107-115.

Wong, H. F., Tan, S. Y., Koh, C. Y., Siow, H. J. M., Li, T., Heyzer, A., ... & Tan, H.

T. W. (2013). Checklist of the plant species of Nee Soon swamp forest, Singapore:

Bryophytes to Angiosperms. National Parks Board and Raffles Museum of

Biodiversity Research, National University of Singapore, Singapore.

Wong, W. H., Tay, Y. C., Puniamoorthy, J., Balke, M., Cranston, P. S., & Meier, R.

(2014). ‘Direct PCR’ optimization yields a rapid, cost‐effective, nondestructive and efficient method for obtaining DNA barcodes without DNA extraction. Molecular

Ecology Resources, 14(6), 1271-1280.

195

Xu, Q. H. (2012) A description of the final stadium larva of Leptogomphus elegans

Lieftinck, with a discussion of taxonomic characters of the larvae of the genus

Leptogomphus Selys (Odonata: Gomphidae). International Journal of Odonatology,

15, 25-29.

Yang, Z., & Rannala, B. (2010). Bayesian species delimitation using multilocus sequence data. Proceedings of the National Academy of Sciences, 107(20), 9264-9269.

Yang, S. F., Lim, R. L. F., Sheue, C. R. & Yong, J. W. H. (2013). The current botanical status of mangrove forests in Singapore. Proceedings of Nature Society, Singapore’s

Conference on ‘Nature Conservation for a Sustainable Singapore’, 16th October 2011.

99–120.

Yang, Z. (2015). The BPP program for species tree estimation and species delimitation. Current Zoology, 61(5), 854-865.

Yang, Z., Landry, J. F., & Hebert, P. D. (2016). A DNA Barcode Library for North

American Pyraustinae (Lepidoptera: Pyraloidea: Crambidae). PloS One, 11(10), e0161449.

Yates, K. K., Rogers, C. S., Herlan, J. J., Brooks, G. R., Smiley, N. A., & Larson, R.

A. (2014). Diverse coral communities in mangrove habitats suggest a novel refuge from climate change. Biogeosciences, 11(16), 4321-4337.

Yee, A. T. K., Ang, W. F., Teo, S., Liew, S. C., & Tan, H. T. W. (2010). The present extent of mangrove forests in Singapore. Nature in Singapore, 3, 139-145.

Yeo, D., Puniamoorthy, J., Ngiam, R. W. J., & Meier, R. (2018). Towards holomorphology in entomology: rapid and cost‐effective adult–larva matching using

NGS barcodes. Systematic Entomology. doi:10.1111/syen.12296.

Yu, H. J., & You, Z. H. (2010). Comparison of DNA truncated barcodes and full- barcodes for species identification. In Advanced Intelligent Computing Theories and

196

Applications. With Aspects of Artificial Intelligence (pp. 108-114). Springer, Berlin,

Heidelberg.

Yu, D. W., Ji, Y., Emerson, B. C., Wang, X., Ye, C., Yang, C. & Ding, Z. (2012)

Biodiversity soup: metabarcoding of arthropods for rapid biodiversity assessment and

biomonitoring. Methods in Ecology and Evolution, 3, 613-623.

Zardoya, R., & Meyer, A. (1996). Phylogenetic performance of mitochondrial protein-

coding genes in resolving relationships among vertebrates. Molecular Biology and

Evolution, 13(7), 933-942.

Zavalloni, M., Groeneveld, R. A., & van Zwieten, P. A. (2014). The role of spatial

information in the preservation of the shrimp nursery function of mangroves: A

spatially explicit bio-economic model for the assessment of land use trade-offs. Journal

of environmental management, 143, 17-25.

Zhang, D. X., & Hewitt, G. M. (1996). Nuclear integrations: challenges for

mitochondrial DNA markers. Trends in Ecology & Evolution, 11(6), 247-251.

Zhang, J., Kapli, P., Pavlidis, P., & Stamatakis, A. (2013a). A general species

delimitation method with applications to phylogenetic

placements. Bioinformatics, 29(22), 2869-2876.

Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. (2013b) PEAR: a fast and accurate

Illumina Paired-End reAd mergeR. Bioinformatics, 30, 614-620.

Zhou, X., Kjer, K. M. & Morse, J. C. (2007) Associating larvae and adults of Chinese

Hydropsychidae caddisflies (Insecta: Trichoptera) using DNA sequences. Journal of

the North American Benthological Society, 26, 719-742.

Zhou, X., Li, Y., Liu, S., Yang, Q., Su, X., Zhou, L., ... & Huang, Q. (2013). Ultra-

deep sequencing enables high-fidelity recovery of biodiversity for bulk arthropod

samples without PCR amplification. Gigascience, 2(1), 4.

197

Zuccon, D., Brisset, J., Corbari, L., Puillandre, N., Utge, J., & Samadi, S. (2012). An optimised protocol for barcoding museum collections of decapod crustaceans: a case- study for a 10–40-years-old collection. Invertebrate Systematics, 26(6), 592-600.

Zwickl, D. J., & Hillis, D. M. (2002). Increased taxon sampling greatly reduces phylogenetic error. Systematic Biology, 51(4), 588-598.

198