Heredity (2012) 109, 50–56 & 2012 Macmillan Publishers Limited All rights reserved 0018-067X/12 www.nature.com/hdy

ORIGINAL ARTICLE On the comparison of population-level estimates of haplotype and diversity: a case study using the gene cox1 in animals

WP Goodall-Copestake, GA Tarling and EJ Murphy

Estimates of represent a valuable resource for biodiversity assessments and are increasingly used to guide conservation and management programs. The most commonly reported estimates of DNA sequence diversity in animal populations are haplotype diversity (h) and nucleotide diversity (p) for the mitochondrial gene cytochrome c oxidase subunit I (cox1). However, several issues relevant to the comparison of h and p within and between studies remain to be assessed. We used population-level cox1 data from peer-reviewed publications to quantify the extent to which data sets can be re-assembled, to provide a standardized summary of h and p estimates, to explore the relationship between these metrics and to assess their sensitivity to under-sampling. Only 19 out of 42 selected publications had archived data that could be unambiguously re-assembled; this comprised 127 population-level data sets (nX15) from 23 animal species. Estimates of h and p were calculated using a 456-base region of cox1 that was common to all the data sets (median h¼0.70130, median p¼0.00356). Non-linear regression methods and Bayesian information criterion analysis revealed that the most parsimonious model describing the relationship between the estimates of h and p was p¼0.0081h2. Deviations from this model can be used to detect outliers due to biological processes or methodological issues. Subsampling analyses indicated that samples of n45 were sufficient to discriminate extremes of high from low population-level cox1 diversity, but samples of nX25 are recommended for greater accuracy. Heredity (2012) 109, 50–56; doi:10.1038/hdy.2012.12; published online 21 March 2012

Keywords: COI; conservation ; data archiving; mitochondrial DNA; ; sample size

INTRODUCTION By far, the most commonly reported diversity metrics in popula- Intra-specific surveys of genetic diversity are increasingly being used to tion-level cox1-sequencing studies are h and p. These metrics are assist in the conservation and management of biodiversity. They useful for biodiversity assessment because they can be influenced by a involve the comparison of samples obtained from different locations variety of factors such as the proportion of asexual to sexual or from a time-series at the same location, generally referred to as reproduction, the size and age of populations, the degree of con- ‘population-level’ samples in literature. DNA sequence variation in nectivity between populations, the extent of introgression from related these population-level surveys is often quantified using data from species, the underlying mutation rate and the impact of selection (for haploid loci, with the mitochondrial gene cytochrome c oxidase example, Boyer et al., 2007; Ca´rdenas et al., 2009; Haig et al., 2010; see subunit I (cox1, COI) being the most frequently sequenced in animals. also Bazin et al., 2006; Wares, 2010). In addition to these biological The popularity of cox1 stems from the availability of several sets of factors, population-level estimates of cox1 h and p are also affected by conserved PCR primers (for example, Folmer et al., 1994) and the dual methodological issues. The best known of these is under-sampling. A purpose of cox1 as a molecular marker for both intra-specific variation low number of sampled individuals can artificially inflate or deflate h and species identification (Bucklin et al., 2011). and p estimates (Nei and Li, 1979; Nei, 1987). Other powerful sources Several metrics are used for assessing diversity in population-level of bias come from the inadvertent inclusion of cryptic species (for cox1 sequencing surveys. These include the absolute number of example, Knowlton, 1986) in ‘single species’ samples and through the haplotypes, the number of unique or ‘private’ haplotypes, the prob- PCR co-amplification of nuclear-mitochondrial sequences (numts; ability that two randomly chosen haplotypes are different (haplotype Lopez et al., 1994) together with the target mitochondrial locus diversity, h; Nei, 1987) and the average number of nucleotide cox1. These unwanted ‘DNA contaminants’ have the potential to differences per site between two randomly chosen DNA sequences increase h and substantially increase p estimates. It is also possible (nucleotide diversity, p; Nei and Li, 1979). Other reported metrics, that DNA sequencing errors may have an impact on the estimation of such as Tajima’s D (Tajima, 1989) and Fu’s Fs (Fu, 1997), are population-level h and p. This may be negligible in many instances but typically used for assessing demography and/or selection rather than could be a problem for samples with low molecular diversities owing diversity. to a lower signal-to-noise ratio (Clark and Whittam, 1992).

Biological Sciences Division, British Antarctic Survey, Natural Environment Research Council, Cambridge, UK Correspondence: Dr WP Goodall-Copestake, Biological Sciences Division, British Antarctic Survey, Natural Environment Research Council, High Cross, Madingley Road, Cambridge CB3 0ET, UK. E-mail: [email protected] Received 21 October 2010; revised 20 December 2011; accepted 22 December 2011; published online 21 March 2012 Population-level cox1 diversity WP Goodall-Copestake et al 51

Even though population-level estimates of cox1 h and p are widely from the 5¢ region of the cox1 gene. The sample size of nX15 individuals was used, there are a number of fundamental issues associated with the chosen as a compromise between the benefit of having a large comparative data comparison of these metrics that have yet to be addressed. One of set and the cost of bias introduced from small sample sizes. Criteria for species these is the data set re-assembly. DNA sequence data set re-assembly is and sampling-site selection could not be refined because species and sampling- required in order to make accurate comparisons against published site boundaries differed between studies depending on the opinions of the corresponding authors. Only studies using overlapping fragments of cox1 DNA results and to reduce bias in meta-analyses. Re-assembled DNA sequence were considered to ensure that subsequent comparative analysis could sequence data sets provide these benefits because they can be edited be performed using exactly the same (homologous) nucleotide sites within the to include only homologous positions and analyzed using exactly gene. To maximize species coverage and fragment overlap, we focused on the 5¢ same methods. Data set re-assembly depends on the use of archived end of cox1 as this is most commonly sequenced region of the gene and defined sequence data; however, not all of the sequence data used in published a minimum sequence length cutoff of 450 bases. studies is archived. Some authors lodge complete population-level Population-level cox1 sequence data sets from studies that matched our samples of DNA sequences in archives (for example, GenBank; selective criteria were manually re-assembled using GenBank accessions and www.ncbi.nlm.nih.gov/genbank/), others submit only those sequences associated haplotype frequency/population identifier data within the corre- that differ (unique haplotypes) to archives and provide frequency/ sponding publications. Estimates of h and p were calculated from each data set population identifier tables in the corresponding publications, using the software DNASP v.5.10.01 (Librado and Rozas, 2009) and the results compared against those reported in the publications. Data sets that generated whereas other authors do not archive any of the sequences generated values of h and p which differed from the published results were excluded from for their publications at all. There is currently no information available subsequent analysis, if these discrepancies could not be explained by the on how many of the cox1 population-level data sets used in publica- rounding of numbers or obvious (correctable) formatting errors. New diversity tions can be re-assembled from archived sequence accessions. estimates for each of the remaining data sets were then calculated using DNASP It is difficult for researchers to readily assess the wider significance from a homologous 456-base region of cox1 that was common to all of of population-level cox1 h and p estimates because there is no the sequences. This DNA region corresponds to positions 136–591 within published summary of this data. Specific comparisons between h or the full-length sequence of many invertebrate cox1 genes, for example the p from different studies are rare and context-dependent, involving bumblebee (Bombus ignitus) cox1 gene in the mitochondrial genome accession species sharing taxonomic/ecological characteristics (for example, Kim NC_010967. et al., 2009) or population-level samples with similar levels of diversity The relationship between h and p was investigated using the newly derived estimates obtained from the re-assembled 456-base data sets. Different poten- (for example, Goodall-Copestake et al., 2010). Furthermore, similar tial models evaluating the relationship between the two metrics were compared p values of h or may be referred to as low in one publication but high based on the functional forms of h and p using least-squared non-linear in another. Quantitative and qualitative comparisons of population- regression methods. Akaike Information Criterion (AIC) and Bayesian Infor- level cox1 h and p would benefit from a summary of published mation Criterion (BIC) were used to examine relative model fit (Burnham and diversity estimates that is taxonomically broad and standardized to a Anderson, 2002) using the software Mathematica v.8 (Wolfram Research, Long homologous region of the cox1 gene. Hanborough, UK). Deviations from the fitted model were transformed as the Two other issues that warrant attention are the haplotype–nucleo- square root of the absolute residuals. 75th and 95th deviation percentiles were tide diversity relationship and the impact of under-sampling on these then calculated from the transformed values. metrics. Bird et al. (2007) described a positive relationship between To assess the effect of sample size on population-level estimates of cox1 h and cox1-based population-level estimates of h and p in a study on three p we used re-assembled data sets for the bumblebee Bombus ardens (Kim et al., 2009) and Antarctic krill Euphausia superba (Goodall-Copestake et al., 2010). species of Hawaiian limpet (Cellana spp.). A greater understanding of These species were selected to represent low- and high-diversity scenarios, the nature of this relationship may provide a useful aid to the respectively. Data sets of n¼50 individuals for each species were assembled by interpretation of pairs of h and p estimates. However, the haplo- randomly selecting cox1 sequences from three location-specific samples of type–nucleotide diversity relationship has not been assessed quantita- B. ardens (Hadandong, Jeongseon, Ulleungdo) among which there was no tively or over a taxonomically broader range of samples. Likewise, little evidence for genetic structure (Kim et al., 2009) and from a single swarm- attention has been paid to the impact of sample size on the estimation specific net sample (number 2) of E. superba (Goodall-Copestake et al., 2010). of population-level h and p. Low sample numbers (for example, np5) Sample order within both of the n¼50 data sets was randomized 100 times and are common in intra-specific sequencing studies due to difficulties subsamples of 2–49 individuals were selected from each randomization. associated with sample collection and the high cost of first generation DNASP was used to calculate h and p from the 456-base region of cox1 (Sanger) DNA sequencing. Although it is generally appreciated that (defined above) for the starting sample of n¼50 and for each of the randomized subsamples. low sample numbers can lead to bias, it is not clear how strong this bias may be for population-level estimates of cox1 h and p. In this study, we begin to address these fundamental issues RESULTS associated with the comparative analysis of population-level cox1 An extensive search of over 500 publications identified 42 studies that diversity data. Specifically, we aim to quantify how many data sets matched our selection criteria (estimates of both h and p, samples of from cox1 population-level studies can be re-assembled and to use re- nX15 individuals per location, X450 bases sequenced from the 5¢ end assembled data sets to provide a working series of standardized h and of cox1). Only 19 publications provided sufficient information to p estimates. We also aim to model the relationship between popula- allow full re-assembly of population-level data sets (Figure 1). The tion-level estimates of cox1 h and p and to assess the sensitivity of main impediments to data set re-assembly were a lack of either these metrics to low sample sizes. archived DNA accessions (11 publications) or a lack of frequency data for archived unique haplotype lists (14 publications). Another MATERIALS AND METHODS factor was that some of the re-assembled data sets produced results Science literature databases were queried using the terms cytochrome c oxidase that differed from those in the corresponding publications, even after subunit I, cox1 and COI. The results of this search were manually examined for taking into account discrepancies due to the rounding of numbers. publications reporting estimates of h and p for samples of nX15 individuals Differences in six data sets could be corrected for, because they from a single species per locality that were derived by analyzing X450 bases concerned obvious formatting errors within the publications (for

Heredity Population-level cox1 diversity WP Goodall-Copestake et al 52

41% Candidate data sets for re-assembly (361, 42)

Data set re-assembly not possible due to:

Absent or incomplete haplotype frequency data (148, 14)

19% Absent or incomplete DNA sequence archives (70, 11)

Data set re-assembled from:

18% DNA archives proportional to haplotype frequency (65, 7)

DNA haplotype archives and published frequency data (78, 13) 22%

Re-assembled data sets generate results that:

Differ from published results due to ambiguous reasons (16, 6) 11% 4% Differ from published results due to formatting errors (6, 3) 85% Agree with published results (121, 19)

Unambiguously re-assembled data sets (127, 19)

Figure 1 Functional breakdown of the data acquisition procedure. The upper pie chart represents the number of candidate population-level data sets found in publications that matched our search criteria and the lower pie chart the number of data sets re-assembled. Numbers in brackets after the data set descriptions follow: number of population-level data sets, number of corresponding publications.

Table 1 Summary of the population-level samples analysed and corresponding diversities calculated from a homologous 456-base region of cox1

Group Species n1 n2 Region h Range Mean h Median h p Range Mean p Median p Publication

Arachnida Rakaia denticulata 1 17 72–856 0.82353 NA NA 0.01993 NA NA Boyer et al.(2007) Chordata Bradypus torquatus 2 33 43–673 0.00000–0.00000 0.00000 0.00000 0.00000–0.00000 0.00000 0.00000 Lara-Ruiz et al.(2008) Chordata Microcosmus squamiger 11 237 28–601 0.51383–0.87446 0.65726 0.64069 0.00243–0.00825 0.00474 0.00432 Rius et al.(2008) Crustacea Chorismus antarcticus 5 166 43–699 0.55749–0.63636 0.58419 0.57500 0.00145–0.00210 0.00175 0.00169 Raupach et al.(2010) Crustacea Euphausia superba 9 504 103–1275 0.94026–0.98506 0.96205 0.96039 0.00971–0.01151 0.01053 0.01044 Goodall-Copestake et al. (2010) Crustacea Halocaridina rubra 3 70 52–681 0.88333–0.92641 0.90768 0.91331 0.00395–0.00476 0.00447 0.00470 Santos (2006) Crustacea Hemimysis margalefi 2 36 42–699 0.31373–0.74510 0.52942 0.52942 0.00073–0.00292 0.00183 0.00183 Lejeusne and Chevaldonne´ (2006) Crustacea Melicertus kerathurus 5 93 100–593 0.42500–0.85000 0.65304 0.71429 0.00099–0.00395 0.00266 0.00272 Pellerito et al.(2009) Crustacea Nematocarcinus lanceopes 3 127 43–699 0.87326–0.89524 0.88791 0.89524 0.00619–0.00752 0.00708 0.00752 Raupach et al.(2010) Crustacea Palinurus elephas 11 227 134–632 0.36957–0.84211 0.57950 0.56209 0.00087–0.00329 0.00173 0.00159 Palero et al.(2008) Crustacea Phycomenes zostericola 10 205 123–712 0.00000–0.96078 0.77998 0.86640 0.00000–0.00981 0.00665 0.00672 Haig et al.(2010) Hexapoda Bombus ardens 3 54 42–699 0.00000–0.20468 0.10156 0.10000 0.00000–0.00046 0.00023 0.00022 Kim et al.(2009) Hexapoda Papilio xuthus 1 18 42–699 0.66013 NA NA 0.00175 NA NA Jeong et al.(2009) Hexapoda Pieris rapae 2 42 42–699 0.12500–0.73846 0.43173 0.43173 0.00055–0.00366 0.00211 0.00211 Jeong et al.(2009) Mollusca Anomalocardia brasiliana 4 123 82–661 0.44598–0.83678 0.65866 0.67595 0.00191–0.00744 0.00462 0.00457 Arruda et al. (2009) Mollusca Cellana exarata 6 139 88–699 0.00000–0.95789 0.66356 0.86410 0.00000–0.00445 0.00299 0.00398 Bird et al.(2007) Mollusca Cellana sandwicensis 5 101 88–699 0.89610–0.95789 0.92705 0.92105 0.00382–0.00831 0.00541 0.00489 Bird et al.(2007) Mollusca Cellana talcosa 4 87 88–699 0.44211–0.96842 0.79796 0.89065 0.00107–0.00647 0.00426 0.00476 Bird et al.(2007) Mollusca Cerithidea djadjariensis 5 100 61–1077 0.00000–0.52105 0.26631 0.35263 0.00000–0.00114 0.00064 0.00083 Kojima et al.(2008) Mollusca Concholepas concholepas 14 337 42–699 0.68571–0.89667 0.81890 0.82676 0.00276–0.00532 0.00397 0.00399 Ca´rdenas et al.(2009) Mollusca Mya arenaria 10 192 42–699 0.00000–0.47619 0.25587 0.26805 0.00000–0.00117 0.00062 0.00063 Strasser and Barber (2009) Mollusca Patella rustica 5 75 58–707 0.56190–0.72381 0.65143 0.63810 0.00163–0.00217 0.00182 0.00167 Ribeiro et al.(2010) Porifera Chondrilla nucula 6 93 76–659 0.00000–0.63810 0.35034 0.38852 0.00000–0.00727 0.00313 0.00221 Duran and Ru¨tzler (2006) NA All samples 127 3076 NA 0.00000–0.98506 0.63388 0.70130 0.00000–0.01993 0.00388 0.00356 NA

Abbreviations: h, haplotype diversity; n1, number of unambiguously reconstructed data sets; n2, total number of individuals; NA, not applicable; p, nucleotide diversity; Region, original DNA region sequenced relative to cox1 in accession NC_010967. Note: Mya arenaria contained a three base insertion not found in the other species.

example, the misplacement of a decimal place within a results table); Data from a total of 127 population-level samples (locations) and however, 16 of the differences between re-assembled data sets and 23 species were unambiguously re-assembled, with representatives of published results could not be resolved (Figure 1). Crustacea and Mollusca dominating (Table 1). The number of

Heredity Population-level cox1 diversity WP Goodall-Copestake et al 53

Rakaia denticulata Bradypus torquatus Microcosmus squamiger Chorismus antarcticus Euphausia superba Halocaridina rubra Hemimysis margalefi Melicertus kerathurus Nematocarcinus lanceopes Palinurus elephas Phycomenes zostericola Bombus ardens Papilio xuthus Pieris rapae Anomalocardia brasiliana Cellana exarata Cellana sandwicensis Cellana talcosa Cerithidea djadjariensis Concholepas concholepas Mya arenaria Patella rustica Chondrilla nucula

0 0.25 0.50 0.75 1.00 0 0.005 0.010 0.015 0.020 Haplotype Diversity Species Nucleotide Diversity

Figure 2 Variance in estimates of h and p among population-level samples within species (n¼23) calculated from a homologous 456-base region of the cox1 gene. Median values for the total sample set (n¼127) are shown as dashed vertical lines. Species order follows the taxonomic grouping in Table 1. population-level samples per species ranged from 1 to 14 (mean¼6, 0.020 median¼5) and the number of individuals per sample ranged from 5 the minimum cutoff of 15 to 91 (mean¼24, median¼20). DNA sequence lengths ranged from 494 to 1173 bases (mean¼666, med- ian¼650). No stop codons were inferred in amino acid translations. 0.015 Reducing the original variable-length cox1 sequence data sets to a homologous 456-base region did not cause any of the population-level 2 samples with h40.00000, p40.00000 to become h¼0.00000, p¼0.00000. Estimates of h and p derived from the 456-base sequences 4 0.010 differed from estimates from the variable-length sequences on average 3 by a factor of 0.08808 and a factor of 0.16373, respectively. 1 3 1 Haplotype diversity for the 456-base data sets ranged from 0.00000 to 0.98506 with a mean h¼0.63388 that was slightly smaller than the Nucleotide Diversity median (h¼0.70130). Nucleotide diversity ranged from 0.00000 to 0.005 0.01993 with a mean p¼0.00388 that was slightly larger than the median (p¼0.00356). Values for both of these cox1 diversity metrics encompassed a near-continuous range with the exception of one extreme p outlier corresponding to a single sample of the mite 0 harvestman Rakaia denticulata (Table 1, Figures 2 and 3). There was 0 0.25 0.50 0.75 1.00 a considerable amount of variation in both h and p estimates among Haplotype Diversity population-level samples within species relative to the variance of h Figure 3 Nucleotide–haplotype diversity relationship based on population- and p within the total data set (Table 1, Figure 2). Population-level level estimates (n¼127) derived from a homologous 456-base region of samples of the shrimp Phycomenes zostericola (n¼10) exhibited the cox1. Fitted model p¼0.0081h2 shown as a thick dark line within two thin greatest within species variation, with estimates of h covering 98% of lines that represent 95% confidence intervals. The 75th and 95th the total data set range and p covering 49% of the total range. Only percentiles of the square root transformed residual data are shown as dashed and dotted lines, respectively. Population-level estimates with the a few species had population-level samples with similar levels of largest deviations are numbered to identify the species: 1, C. nucula;2, diversity, for example E. superba for which the variance in both h E. superba;3,M. squamiger;4,P. zostericola;5,R. denticulata. and p of population-level samples (n¼9) was o10% of the total data set range. The relationship between the population-level estimates of h and p p¼a+gh+bh2). BIC criteria identified p¼h2 as the most parsimonious was positive and non-linear (Figure 3). Model selection using AIC model. The fit of this was p¼0.0081h2, adjusted R2¼0.823, P{0.0001. criteria did not reveal strong statistical differences between a simple Deviation boundaries from the fitted model, calculated using the p¼bh2 model, an exponential model p¼aebh and more complex transformed residuals, were ±0.0021+0.0081h2 for the 75th percentile square forms involving more parameters and terms (p¼a+bh2; boundaries and ±0.0036 +0.0081h2 for the 95th percentile bound-

Heredity Population-level cox1 diversity WP Goodall-Copestake et al 54

aries. The largest deviations from the fitted model were due to samples B. ardens than among subsamples of E. superba. In contrast, for p with a higher than expected p for a given h (Figure 3; see also Table 1, the variance among subsamples was larger for E. superba than for Figure 2). The outlier sample of R. denticulata occurred well beyond B. ardens. the upper 95th percentile, whereas some samples of the chicken-liver sponge Chondrilla nucula, solitary ascidian Microcosmus squamiger DISCUSSION and E. superba exhibited less extreme deviations just beyond the 95th Data archiving and data set re-assembly boundary. One sample from the species P. zostericola occurred on the More than half of the population-level cox1 sequence data sets from 95th percentile boundary. peer-reviewed publications matching our selection criteria could not Sample sizes of np5 individuals produced overlapping be re-assembled (Figure 1). Such a high proportion of unavailable raw estimates of h and p between the ‘low-diversity’ B. ardens and data impedes what can be achieved through re-analysis, which is the ‘high-diversity’ E. superba population-level data sets (Figure 4). particularly important for commonly sequenced genes like cox1,and Samples of n¼23 individuals from B. ardens and n¼15 individuals highlights the importance of current efforts by the research commu- from E. superba were required before the variance between subsamples nity (data generators, data managers and publishers) to promote data was reduced to o0.25 of the total subsampling variation found in archiving (Whitlock et al., 2010). Of those data sets that were these species. For h, variance was larger among subsamples of re-assembled, 15% generated results that differed from results pre- sented in the corresponding publications (Figure 1). This level of 1.000 incompatibility is troubling. It implies that the research community also needs to promote measures to verify the credibility of published diversity estimates, archived data and references to this data. The reduction in total sample size from 361 to 127 during data acquisition (Figure 1) begs the question of whether it is worthwhile 0.750 re-assembling DNA sequence data sets for comparative analysis. To investigate whether this procedure improved our analysis, we re-modeled the relationship between h and p using our initial estimates of these metrics that were derived from the original 0.500 variable-length cox1 sequences (n¼127). Although the model and fit was similar (p¼0.0074h2, adjusted R2¼0.818, P{0.0001), the position of notable samples from the species C. nucula and M. squamiger Haplotype diversity relative to the deviation percentile boundaries changed. These outliers 0.250 occurred above the upper 95th percentile in analysis using the homologous 456-base sequences (Figure 3), but they occurred below the re-derived upper 95th percentile when using the original variable- length cox1 sequences (data not shown). This suggests that data set re- 0 assembly and the derivation of new diversity estimates using a 0 10 20 30 40 50 homologous region of DNA had increased our ability to detect subtle Sample number outliers. We recommend this approach for the comparative analysis of 0.020 population-level cox1 h and p to remove the impact of variable sequence length-based error. This method also has the benefit of reducing error that may arise through the direct analysis of inaccu- rately published diversity estimates. 0.015 A pitfall of the manual data set re-assembly adopted herein is that it is laborious. This process could potentially be automated by adapting the methods used in meta-analyses by Bazin et al. (2006) or Wares (2010) in which population-level cox1 sequences were collated into 0.010 species-level data sets. Algorithms would need to be developed to account for archived lists of unique haplotypes and to group DNA sequences according to population identifiers. It is crucial to deal with

Nucleotide diversity unique haplotype lists because a failure to do so would generate flawed 0.005 data sets that yield biased estimates of h and p. For this type of automated approach to work, it may also be necessary to apply complimentary changes to data archives and future archiving procedures. 0 0 1020304050Haplotype and nucleotide diversity variation Sample number Intra-specific variation for estimates of h and p was generally large compared with the variance of the total data set (Table 1, Figure 2). Figure 4 Sample size–cox1 diversity estimate relationships for h and p This finding reflects the dominant role of ecological and population- under low-diversity (Bombus ardens; black cross series) and high-diversity level processes in determining levels of (for example, (Euphausia superba; gray plus series) scenarios. Each series of estimates from n¼2ton¼49 comprises 100 random subsamples taken from the P. zostericola; Haig et al., 2010) in addition to underlying factors such total sample size of 50. High–low bars indicate s.d. output from DNASP as mutation rate (Boyer et al., 2007). However, a substantial compo- for n¼50. nent of the variance may also be due to error from under-sampling

Heredity Population-level cox1 diversity WP Goodall-Copestake et al 55

(see below), the inadvertent inclusion of cryptic species or possibly variable estimates that hinder accurate h and p comparisons (Figure 4). contaminating numt DNA/sequencing errors. The large variance Higher sample sizes (n45) were shown to be sufficient to discrimi- among population-level samples described here implies that, even nate low-diversity populations such as in B. ardens (Kim et al., 2009) when intra-specific variance is low (for example, E. superba;Figure2), from high-diversity populations like those in E. superba (Goodall- it is important to sample species distributions thoroughly and Copestake et al., 2010). The impact of under-sampling on estimates of carefully before drawing conclusions about the level of h and p for h and p was context-dependent. Estimates of h were more sensitive to species as a whole. under-sampling bias if the focal population harbored limited cox1 In order to improve the clarity and consistency with which variation (B. ardens), because a large number of individuals must be population-level estimates of h and p are reported in the future, we sampled from the population in order to register the presence of low- advocate comparing them against the values in Table 1. In addition to frequency alleles. By contrast, p was more biased by low sample quantitative comparisons against specific values, median values numbers if the focal population was highly variable at cox1 derived from the complete data set can be used as a cutoff for (E. superba) as divergent alleles in such populations have a large qualitative descriptions of low and high diversity. For example, in impact on estimates derived from small sample sizes. the study by Raupach et al. (2010) on benthic shrimp species, Results from the B. ardens and E. superba analyses showed that, with estimates of h and p for samples of Chorismus antarcticus fall below a sample size of nX23 individuals, the total subsampling variance was the median value and thus could be categorized as low diversity, reduced by at least a factor of four. At this arbitrary level, multiple whereas samples of Nematocarcinus lanceopes fall above the median ‘broad’ categories of h and p can be distinguished; although clearly and thus could be termed high diversity (Table 1, Figure 2). much larger sample sizes are required to achieve a greater resolution (for example, to recover non-zero values of h in B. ardens). As a rule- The relationship between h and p of-thumb we suggest using sample sizes of nX25 individuals in future The positive relationship we described between population-level comparisons of population-level cox1 diversity. This number is achiev- estimates of cox1 h and p (Figure 3) has been noted previously by able for many species and represents a practical compromise between Bird et al. (2007). However, to our knowledge, this is the first time that the current expense of data generation and the obvious benefit of high a model describing the relationship between the two metrics has been n numbers. reported. The BIC criteria identified p¼h2 as the most parsimonious model. This relationship emerges because p is based on the pairwise Significance beyond cox1 comparison of nucleotide differences within distinct haplotypes and is The present study highlights a number of issues concerning the dominated by the square term (Nei and Li, 1979; Nei, 1987). We archiving and comparative analysis of population-level cox1 sequence propose that the deviation of individual samples from this relationship data, on the basis of which we make a number of recommendations. can be used to identify outliers that might not be evident through the In addition, we propose a general method to assess paired haplotype– assessment of h or p alone. The extent of deviation from the model nucleotide diversity estimates. This is not only relevant to cox1 data can be assessed by comparison with the 75th and 95th percentile but also applicable to other sources of haploid (chloroplast, mito- deviation boundaries (Figure 3). To assess the deviation of new chondrial, sex-chromosome) DNA sequence data used to investigate diversity estimates, h and p should be derived from the 456-base intra-specific diversity in both animals and plants. In particular, we region of cox1 defined in the Materials and methods. A predicted recommend that a similar approach is taken for analyses of mitochon- nucleotide diversity (pp) can be calculated from the observed haplo- drial control region (D-loop) data, which is widely used for popula- 2 type diversity (ho) using the relationship pp¼0.0081ho . A value for tion-level studies on mammals. comparison against the percentile curves (75 and 95%) for the transformed residuals can then be calculated using D¼O|(poÀpp)|. DATA ARCHIVING The greater the deviation of a sample from the model’s expectations, There were no data to deposit. the greater the need for additional investigation, as this may reveal notable biological processes or the impact of methodological error. CONFLICT OF INTEREST The extreme deviation of the R. denticulata sample well beyond the The authors declare no conflict of interest. upper 95th percentile in Figure 3 could reflect a range of processes according to Boyer et al. (2007), including secondary contact between divergent haplotypes, high mutation rates and the inadvertent sam- ACKNOWLEDGEMENTS pling of cryptic species. The more subtle deviations found in samples We thank the scientists responsible for generating the DNA sequence accessions ´ of C. nucula and M. squamiger may represent secondary contact used in this study, authors responding to queries about their data, Sılvia Pe´rez-Espona for fruitful discussions and three anonymous referees for their between divergent lineages (Duran and Ru¨tzler, 2006; Rius et al., comments on the manuscript. We carried out this work as part of the 2008), whereas that shown by E. superba probably reflects the excep- Ecosystems group within British Antarctic Survey Polar Science for Planet tional abundance and mixing of swarms of this species (Goodall- Earth Programme funded by the Natural Environment Research Council. Copestake et al., 2010). An explanation for the slight deviation of the P. zostericola sample is not obvious on the basis of the corresponding publication (Haig et al., 2010). It may represent a biological process or possibly sampling error given the number of individuals involved Arruda CCB, Beasley CR, Vallinoto M, Marques-Silva NS, Tagliaro CH (2009). Significant (n¼15) and the potential for error associated with a sample of this size genetic differentiation among populations of Anomalocardia brasiliana (Gmelin, 1791): A bivalve with planktonic larval dispersion. Genet Molec Biol 32:423–430. (see Figure 4). Bazin E, Gle´min S, Galtier N (2006). Population size does not influence mitochondrial genetic diversity in animals. Science 312:570–572. The effect of sample size on haplotype and nucleotide diversities Bucklin A, Steinke D, Blanco-Bercial L (2011). DNA Barcoding of Marine Metazoa. Annu Rev Mar Sci 3: 471–508. Subsampling analysis indicated that samples of np5 individuals, Burnham KP, Anderson DR (2002). Model Selection and Multimodel Inference: A Practical which are not uncommon in published studies, could generate highly Information-Theoretic Approach. Springer: New York.

Heredity Population-level cox1 diversity WP Goodall-Copestake et al 56

Bird CE, Holland BS, Bowen BW, Toonen RJ (2007). Contrasting phylogeography in three Lejeusne C, Chevaldonne´ P (2006). Brooding crustaceans in a highly fragmented habitat: endemic Hawaiian limpets (Cellana spp.) with similar life histories. Mol Ecol 16: the genetic structure of Mediterranean marine cave-dwelling mysid populations. Mol 3173–3186. Ecol 15: 4123–4140. Boyer SL, Baker JM, Giribet G (2007). Deep genetic divergences in Aoraki denticulata Librado P, Rozas J (2009). DnaSP v5: a software for comprehensive analysis of DNA (arachnida, opiliones, cyphophthalmi): a widespread ‘mite harvestman’ defies DNA polymorphism data. Bioinformatics 25: 1451–1452. taxonomy. Mol Ecol 16: 4999–5016. Lopez JV, Yuhki N, Masuda R, Modi W, O’Brien SJ (1994). Numt, a recent transfer and Ca´rdenas L, Castilla JC, Viard F (2009). A phylogeographical analysis across three tandem amplification of mitochondrial DNA to the nuclear genome of the domestic cat. biogeographical provinces of the south-eastern Pacific: the case of the marine gastropod J Mol Evol 39: 174–190. Concholepas concholepas. J Biogeogr 36: 969–981. Nei M (1987). Molecular Evolutionary Genetics. Columbia University Press: New York. Clark AG, Whittam TS (1992). Sequencing errors and molecular evolutionary analysis. Mol Nei M, Li WH (1979). Mathematical model for studying genetic variation in terms of Biol Evol 9: 744–752. restriction endonucleases. Proc Natl Acad Sct USA 76: 5269–5273. Duran S, Ru¨tzler K (2006). Ecological speciation in a Caribbean marine sponge. Mol Palero F, Abello´ P, Macpherson E, Gristina M, Pascual M (2008). Phylogeography of the Phylogent Evol 40: 292–297. European spiny lobster (Palinurus elephas): influence of current oceanographical Folmer O, Black M, Hoeh W, Lutz R, Vrijenhoek R (1994). DNA primers for amplification of features and historical processes. Mol Phylogenet Evol 48: 708–717. mitochondrial cytochrome c oxidase subunit I from diverse metazoan invertebrates. Mol Pellerito R, Arculeo M, Bonhomme F (2009). Recent expansion of Northeast Atlantic and Mar Biol Biotechnol 3: 294–297. Mediterranean populations of Melicertus (Penaeus) kerathurus (Crustacea: Decapoda). Fu YX (1997). Statistical tests of neutrality of mutations against population growth, Fish Sci 75: 1089–1095. hitchhiking and background selection. Genetics 147:915–925. Raupach MJ, Thatje S, Dambach J, Rehm P, Misof B, Leese F (2010). Genetic Goodall-Copestake WP, Pe´rez-Espona S, Clark MS, Murphy EJ, Seear PJ, Tarling homogeneity and circum-Antarctic distribution of two benthic shrimp species of the GA (2010). Swarms of diversity at the gene cox1 in Antarctic krill. Heredity 104: Southern Ocean, Chorismus antarcticus and Nematocarcinus lanceopes. Mar Biol 157: 513–518. 1783–1797. Haig JA, Connolly RM, Hughes JM (2010). Little shrimp left on the shelf: the roles that sea- Ribeiro PA, Branco M, Hawkins SJ, Santos AM (2010). Recent changes in the distribution level change, ocean currents and continental shelf width play in the genetic connectivity of a marine gastropod, Patella rustica, across the Iberian Atlantic coast did not result in of a seagrass-associated species. J Biogeogr 37: 1570–1583. diminished genetic diversity or increased connectivity. JBiogeogr37: 1782–1796. Jeong HC, Kim JA, Im HH, Jeong HU, Hong MY, Lee JE et al. (2009). Mitochondrial DNA Rius M, Pascual M, Turon X (2008). Phylogeography of the widespread marine Sequence Variation of the Swallowtail Butterfly, Papilio xuthus, and the Cabbage invader Microcosmus squamiger (Ascidiacea) reveals high genetic diversity of intro- Butterfly, Pieris rapae. Biochem Genet 47: 165–178. duced populations and non-independent colonizations. Diversity Distrib 14:818–828. Kim MJ, Yoon HJ, Im HH, Jeong HU, Kim MI, Kim SR et al. (2009). Mitochondrial DNA Santos SR (2006). Patterns of genetic connectivity among anchialine habitats: a case sequence variation of the bumblebee, Bombus ardens (Hymenoptera: Apidae). JAsia study of the endemic Hawaiian shrimp Halocaridina rubra on the island of Hawaii. Mol Pacific Entomol 12: 133–139. Ecol 15: 2699–2718. Knowlton N (1986). Cryptic and sibling species among the decapod Crustacea. JCrustac Strasser CA, Barber PH (2009). Limited genetic variation and structure in softshell Biol 6: 356–363. clams (Mya arenaria) across their native and introduced range. Conserv Genet 10: Kojima S, Ozeki S, Iijima A, Okoshi K, Suzuki T, Hayashi I et al. (2008). Genetic 803–814. characteristics of three recently discovered populations of the tideland snail Cerithidea Tajima F (1989). Statistical method for testing the neutral mutation hypothesis by DNA djadjariensis (Martin) (Mollusca, Gastropoda) from the Pacific coast of the eastern polymorphism. Genetics 123: 585–595. Japan. Plankton Benthos Res 3: 96–100. Wares JP (2010). Natural distributions of mitochondrial sequence diversity support new Lara-Ruiz P, Chiarello AG, Santos FR (2008). Extreme population divergence and con- null hypotheses. Evolution 64: 1136–1142. servation implications for the rare endangered Atlantic Forest sloth, Bradypus torquatus Whitlock MC, McPeek MA, Rausher MD, Rieseberg L, Moore AJ (2010). Data archiving. Am (Pilosa: Bradypodidae). Biol Cons 141: 1332–1342. Nat 175: 145–146.

Heredity