An Exploration of Sufficient Sampling Effort to Describe Intraspecific DNA Barcode Haplotype Diversity: Examples from the Ray-Finned Fishes (Chordata: Actinopterygii)
Total Page:16
File Type:pdf, Size:1020Kb
DNA Barcodes 2015; Volume 3: 66–73 Research Article Open Access Jarrett D. Phillips, Rodger A. Gwiazdowski, Daniel Ashlock, Robert Hanner* An exploration of sufficient sampling effort to describe intraspecific DNA barcode haplotype diversity: examples from the ray-finned fishes (Chordata: Actinopterygii) DOI 10.1515/dna-2015-0008 Received February 26, 2015; accepted June 9, 2015 1 Introduction Abstract: Estimating appropriate sample sizes to measure Most biodiversity research requires an estimate of species abundance and richness is a fundamental adequate sample sizes to achieve a study’s objective problem for most biodiversity research. In this study, we [1]. Sample sizes that are sufficient to address research explore a method to measure sampling sufficiency based questions often depend on sampling methodologies and on haplotype diversity in the ray-finned fishes (Animalia: the organism being considered [2]. Adequate sample Chordata: Actinopterygii). To do this, we use linear sizes involving molecular genetic measurements are regression and hypothesis testing methods on haplotype directly related to a species’ genetic variation. A common accumulation curves from DNA barcodes for 18 species measurement of intraspecific variation is mitochondrial of fishes, in the statistics platform R. We use a simple DNA (mtDNA) haplotype diversity, which is largely mathematical model to estimate sampling sufficiency affected by underlying evolutionary biological processes from a sample-number based prediction of intraspecific such as gene flow and random genetic drift. As such, haplotype diversity, given an assumption of equal sample sizes sufficient to observe within-species mtDNA haplotype frequencies. Our model finds that haplotype variation will vary widely across taxa. Haplotype diversity diversity for most of the 18 fish species remains largely represents the prevalence of haplotypes at the population unsampled, and this appears to be a result of small sample level and is analogous to the concept of heterozygosity sizes. Lastly, we discuss how our overly simple model may at the locus level, except that the former pertains only be a useful starting point to develop future estimators for to haploid data. A simple measure of haplotype diversity intraspecific sampling sufficiency in studies using DNA was first provided by [3] and is calculated as barcodes. N 2 h = ∑(1− pi ) N −1 i Keywords: Chao1 abundance estimator; DNA barcoding; where N is the sample size and p represents the frequency haplotype accumulation curve; method of moments i of each haplotype in a given sample. Estimates of h (which range from 0-1) are greatly affected by sampling intensity, particularly undersampling, which has been observed especially for mtDNA markers [4]. Another widely used metric of haplotype variation is the absolute number of (unique) species haplotypes (used here throughout) *Corresponding author: Robert Hanner, Centre for Biodiversity Genomics, Department of Integrative Biology, University of Guelph, which is comparable in magnitude to actual specimen Ontario, N1G 2W1 Canada, Email: [email protected] sample sizes. Jarrett D. Phillips, Centre for Biodiversity Genomics, Department of A standardized tool for genetic biodiversity assessment Integrative Biology, University of Guelph, Ontario, N1G 2W1 Canada is DNA barcoding [5], because this method uses easily Rodger A. Gwiazdowski, Biodiversity Institute of Ontario, University obtainable mtDNA diversity from the 5’ cytochrome c of Guelph, Guelph, Ontario, N1G 2W1 Canada oxidase subunit I (COI) gene to identify species. However, Daniel Ashlock, Department of Mathematics and Statistics, Univer- sity of Guelph, Guelph, Ontario, N1G 2W1 Canada methods to describe a sample set required to observe a © 2015 Jarrett D. Phillips et al. licensee De Gruyter Open. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License. Unauthenticated Download Date | 10/14/16 4:33 PM Sampling effort for intraspecific DNA barcode haplotype diversity in ray-finned fishes 67 full range of DNA barcode haplotypes within a species maximizing the number of species sampled rather than have not been well developed. A general consensus for exhaustively sampling any one species [6,11]. Thus, there adequate sample sizes for DNA barcode studies appears are few prior studies exploring haplotype accumulation to be ~ 5-10 specimens per species [6]; however, this range curves in relation to sample size estimation using DNA is highly variable within the Barcode of Life Data Systems barcode data (e.g., fungi: [2]; butterflies: [6]; aphids: (BOLD) [7], owing to both the relative difficulty and cost [12]). Of potential relevance to estimating sampling of sample collection and mtDNA sequencing [4]. As such, sufficiency for fishes is an analysis of mtDNA haplotype previous studies incorporating DNA barcodes across variation in Lake trout (Salvelinus namaycush) stocked various taxonomic groups have resulted in a wide range into Lake Ontario [13]. Here, [13] found that a minimum of intraspecific sampling effort: very few specimens in of n ≈ 60 individuals needed to be randomly sampled the case of rare species, to upwards of 500 individuals for in order to observe with β = 95% confidence any one some species of insects within BOLD. individual having a particular haplotype present at a Here, we share a brief exploration in estimating frequency of at least P = 5% according to the equation sampling sufficiency by observing intraspecific haplotype ln(1− β) diversity in the ray-finned fishes (Animalia: Chordata: n = Actinopterygii), a group that is among the largest of ln(1− P). all vertebrates, and also has a large number of DNA Estimating sample sizes necessary for describing barcodes available within BOLD. In the present study, the genetic diversity of a species is also dependent on we define sampling sufficiency to be the sample size at underlying biological processes, population structure which sampling accuracy is maximized and above which as well as lineage history. Therefore estimates based no new sampling information is likely to be gained. We on rigorous statistical considerations alone may not be recognize that estimating a sample size necessary to adequate. observe the range of mtDNA haplotype diversity within a In this paper, we develop our ideas as an R-based species involves at least three measures: sample number, workflow that uses DNA barcodes of actinopterygians, genetic dispersion and geographic dispersion [8]. Because identified to species and retrieved from BOLD, to estimate geographic dispersion is multidimensional and because intraspecific sample sizes that should adequately spatiotemporal metadata (e.g. GPS coordinates) are represent haplotype variation within a species. lacking for many fish species within BOLD, we focus only on exploring the dynamics of estimating intraspecific sample sufficiency based on sample number and genetic 2 Methods dispersion (as predicted haplotype diversity). To do this, we use haplotype accumulation curves calibrated by 2.1 Species retrieval from BOLD a simple variant of the statistical method of moments, which is a method of parameter estimation based on All publicly accessible sequences from Actinopterygii the law of large numbers [9]. Such a method provides a were first retrieved from BOLD on May 30, 2014 using the useful stopping criterion for specimen sampling above keyword ‘Actinopterygii’. Records were then searched which no new haplotypes are likely to be observed. manually for all species represented by at least 60 Haplotype accumulation curves provide a graphical way specimens, chosen as an a priori minimum inspired by to assess the extent of haplotype sampling similar to [13]. This minimum sample size criterion was used in the use of rarefaction curves to assess species richness all subsequent steps of our pipeline to ensure quality [10]. Such curves depict the extent of saturation as a control and integrity of selected species. A total of 12,210 function of the number of specimens sampled and the specimens covering 107 species (mean: 115 specimens/ number of haplotypes accumulated. Those species species) from 16 orders, 46 families and 75 genera were whose curves show rapid saturation indicate that much found. All but three species had formal taxonomic names; of the intraspecific haplotype diversity may have been the remaining were interim. sampled. Species curves showing little to no indication of asymptotic behavior suggest further sampling is necessary 2.2 Sequence cleaning and processing to document the extent of standing genetic variation present. DNA barcode sequences were directly read from BOLD The issue of sampling intensity is rarely raised into R using the package ‘SPIDER’ (SPecies IDentity and in relation to barcode studies, which often focus on Unauthenticated Download Date | 10/14/16 4:33 PM 68 J.D. Phillips, et al. Evolution in R; [14]) using the functions search.BOLD(), and GBGC. Next, sequences were aligned using MUSCLE to find specimens, and read.BOLD() which downloads (MUltiple Sequence Comparison by Log Expectation; [19]) sequences found by search.BOLD(). Sequences were with default parameters for all species and then trimmed written to FASTA files using the function write.dna() to 652 bp. The presence of ambiguous bases was handled from the R package ‘APE’ (Analysis of Phylogenetics and using the functions checkDNA() in SPIDER and base. Evolution; [15]). FASTA files were then read into MEGA6 freq() in APE. The function checkDNA()