Molecular Ecology Resources (2012) doi: 10.1111/j.1755-0998.2012.03121.x

De novo characterization of the cristinae transcriptome facilitates marker discovery and inference of genetic divergence

AARON A. COMEAULT,* MATHEW SOMMERS,* TANJA SCHWANDER,† C. ALEX BUERKLE,‡ TIMOTHY E. FARKAS,* PATRIK NOSIL* and THOMAS L. PARCHMAN‡ *Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, CO 80303, USA, †Center for Ecology and Evolutionary Studies, University of Groningen, 9700CC Groningen, The Netherlands, ‡Department of Botany, University of Wyoming, Laramie, WY 82071, USA

Abstract Adaptation to different ecological environments can promote . Although numerous examples of such ‘’ now exist, the genomic basis of the process, and the role of flow in it, remains less understood. This is, at least in part, because systems that are well characterized in terms of their ecology often lack genomic resources. In this study, we characterize the transcriptome of stick , a system that has been researched intensively in terms of ecological speciation, but for which genomic resources have not been previously developed. Specifically, we obtained >1 million 454 sequencing readsthatassembledinto84937contigsrepresenting approximately 18 282 unique and tens of thousands of potential molecular markers. Second, as an illustration of their utility, we used these geno- mic resources to assess multilocus genetic divergence within both an pair and a pair of Timema stick insects. The results suggest variable levels of genetic divergence and gene flow among taxon pairs and genes and illustrate afirststeptowardsfuturegenomicworkinTimema.

Keywords: gene flow, isolation with migration, next-generation sequencing, speciation, transcriptome Received 3 November 2011; revision received 6 January 2012; accepted 13 January 2012

Introduction resources (some notable exceptions aside, such as three- spine stickleback; Peichel et al. 2001; Colosimo et al. The causes of speciation are a central and long-standing 2005; Chan et al. 2010). Thus, although we now have topic in evolutionary biology (Darwin 1859; Dobzhansky convincing evidence for ecological speciation in nature, 1937, 1940; Mayr 1947, 1963) and have received renewed and the number of systems with genomic resources is interest over the last two decades (Coyne & Orr 2004). growing, the genetic basis of ecological speciation One hypothesis that has received particular attention is remains relatively poorly understood (reviewed in Sch- that adaptation to different ecological environments, via luter & Conte 2009). Here, we take steps towards filling divergent natural selection, can drive the of this gap by developing genomic resources for Timema reproductive isolation. This hypothesis of ‘ecological cristinae walking stick insects. The Timema is com- speciation’ (for reviews see Schluter 2001; Rundle & Nosil prised of approximately 20 species of herbivourous 2005; Schluter 2009; Nosil 2012) has now seen widespread walking stick insects found throughout California (Vic- support from theoretical work (Kirkpatrick 2001; Kirk- kery 1993; Sandoval et al. 1998; Crespi & Sandoval 2000). patrick & Ravigne´ 2002; Gavrilets 2004), laboratory Numerous studies of Timema have focused on the role of evolution experiments (reviewed in Rice & Hostert 1993), ecology, such as adaptation to different host plant spe- comparative studies (Funk et al. 2006) and detailed case cies, in processes such as adaptive radiation and specia- studies in nature (Funk 1998; Rundle et al. 2000; Jiggins tion (Crespi & Sandoval 2000; Nosil 2007; Supporting et al. 2001; Rundle & Nosil 2005 for review; Langerhans information for more detail on species included in this et al. 2007; Nosil 2007). study). Timema cristinae, in particular, has been studied The systems used to study ecological speciation tend intensively in terms of adaptive divergence and ecologi- to be ecologically well characterized, but lack genomic cal speciation (see Nosil 2007 for review), but to date has Correspondence: Aaron A. Comeault, Fax: (303) 492 8699; largely lacked genomic resources (160 sequences E-mail: [email protected] currently at NCBI).

Ó 2012 Blackwell Publishing Ltd 2 A. A. COMEAULT ET AL.

The study presented here has two main objectives. specimens of Timema cristinae. The specimens were col- First, we characterize the transcriptome of T. cristinae lected from ten different localities spanning the species’ using sequences obtained from 454 pyrosequencing of a range in the vicinity of Santa Barbara, California. Five normalized cDNA library (Margulies et al. 2005; Ellegren specimens were from the host (population 2008; Holt & Jones 2008; Hudson 2008; Vera et al. 2008; abbreviations following past work, HVA, OUTA, OGA, Wheat 2010). The de novo assembly of transcriptome BTA, OCA) and five specimens from the host sequences is facilitated by greater coverage depth for the (population abbreviations, PC, R12C, PRC, MC, PEC) much smaller number of nucleotides in the transcriptome (Nosil 2007). Procedures were as follows. To minimize than in the whole genome (Bouck & Vision 2007; Emrich sequencing of gut microbes, live insects had their gut et al. 2007). Indeed, recent studies have demonstrated intestinal tracts removed. Specimens were then immedi- highly successful de novo assemblies of 454 EST data for ately placed in a tared 2-mL tube containing 300 lL RLT organisms with few or no prior genomic resources (e.g. Buffer (Qiagen RNeasy Mini Kit), 1% 2BME and a 5-mm Novaes et al. 2008; Vera et al. 2008; Hahn et al. 2009; Kris- steel bead. Upon placing each tissue sample in a tube, it tiansson et al. 2009; Meyer et al. 2009; Schwarz et al. 2009; was subject to disruption and homogenization using the Parchman et al. 2010; Fraser et al. 2011; Garg et al. 2011; TissueLyserII at 20 Hz for 6 min. Total RNA was isolated Martin & Wang 2011). Transcriptome sequencing projects using the Qiagen RNeasy protocol. RNA was quantified also have the advantage of focusing on the protein-cod- using NanoDrop spectrophotometry, and RNA integrity ing fraction of the genome where much of the functional was verified through analysis using BioRad Experion variation might be expected to reside (Bouck & Vision RNA Standard Sensitivity Chip (BioRad, Inc.). For cDNA 2007). The generation of such large-scale sequence data synthesis, 1 lg of total RNA was used as per Evrogen’s will enable the application of genome-wide analyses to Mint Universal cDNA Synthesis Kit Protocol II. taxa that are of ecological interest, but that previously Approximately 750 ng of double-stranded cDNA was have lacked genomic resources (Wheat 2010). used for normalization as per the specifications of the Second, we utilize the T. cristinae transcriptome Evrogen Trimmer Direct cDNA Normalization Kit. Fol- assembly to develop primers for nuclear loci that can be lowing normalization, gene-specific primers were amplified, by PCR, in different species of Timema. We designed for amplification in real-time PCR assays of then gather Sanger sequence data from these loci to char- Actin (a highly expressed gene) and HSP70 (a gene with acterize the nature of genetic divergence and gene flow low expression). These assays were used as an indicator within an ecotype pair and a species pair of Timema stick of normalization. Initially, PCR assays were optimized insects. Such multilocus data sets can provide powerful using RT reagents, conditions and 5 ng of normalized insight into speciation, for example, by providing infor- cDNA with reaction products run on a 2% agarose gel. mation on the extent to which genetic divergence Real-time PCR assays were carried out on Roche 480 occurred in the presence of substantial gene flow, as well Light cycler using Roche LightCycler 480 Sybr Green I as information on the effective population size and the Master Mix. Differences in the Ct values of HSP70 rela- time since population divergence was initiated (Wakeley tive to Actin for both control cDNA (hybridized but not & Hey 1997; Hey 2006; Wakeley 2008). normalized) and normalized cDNA were calculated. The analyses presented here add to a growing body of Results indicate a change as a result of normalization of literature that takes advantage of closely related taxon 5.2 cycles between Actin and HSP70. The normalized pairs, which differ in levels of reproductive isolation, to cDNA collection was then sequenced on a preliminary make inferences on how the process of speciation unfolds quarter-plate run and a final full-plate run on a 454 GS (Coyne & Orr 1989; Funk et al. 2006; Berner et al. 2009; XLR Titanium platform at the Roy Carver Center for Nosil et al. 2009; Stelkens et al. 2010; Merrill et al. 2011; Genomics, University of Iowa. Rosenblum & Harmon 2011). The overall results demon- 454 primer sequences and all sequences resulting from strate the feasibility of rapidly and inexpensively develop- cDNA synthesis reagents were trimmed from reads prior ing novel genomic resources and show how such to assembly. In addition, long poly(A) regions were resources can readily be utilized for population-level screened from reads prior to assembly. As for past stud- insights in evolution. ies involving de novo transcriptome assembly (Weber et al. 2007; Vera et al. 2008; Parchman et al. 2010; Wheat 2010), we used Seqman Ngen (DNAstar, Inc.) to assemble Methods reads into contigs. The assembly was run with a mini- mum match size of 19 nucleotides, match percentage of Transcriptome characterization 90%, mismatch penalty of 18 and gap penalty of 30. Fur- cDNA library construction. A normalized cDNA ther information on this assembly is available upon library was prepared from RNA extracted from ten adult request. The resulting contigs and remaining singletons

Ó 2012 Blackwell Publishing Ltd TIMEMA TRANSCRIPTOME 3 were then combined into a single set for the following tion of the Ceanothus ecotype of T. cristinae was compared analyses, except where noted otherwise. to one adjacent (i.e. parapatric) and one distant (i.e. allo- We determined annotations for the 454 EST sequence patric) population of the Adenostoma ecotype. Likewise, collection using local BLASTx (Altschul et al. 1997) to align the species T. californicum was compared to one parapatric the consensus sequences from the assembled contigs and and one allopatric population of T. poppensis. the singleton sequences to the UNIREF50 15.4 (Suzek et al. We collected 23 to 26 specimens from each of the six 2007) annotated protein database using an E value populations included in this study between March and )10 threshold of 10 .UNIREF50 is a large protein database June 2010. We sampled individuals from parapatric eco- representing nonredundant protein clusters from the types of T. cristinae from one Ceanothus and one adjacent UniProt and UniParc exhaustive annotated protein data- Adenostoma population (N34 29.309 W119 47.180 and N34 bases. BLAST results were parsed and passed through a 29.305 W119 47.191, respectively) along with an allopat- custom Perl pipeline that summarized information and ric Adenostoma population collected from N34 30.464, produced tab-delimited tables containing accession num- W119 47.694. We sampled individuals from parapatric bers, gene name, taxonomic ID, query length, ortholog populations of T. poppensis and T. californicum along sequence length, sequence alignment, E value and bit Summit Road, California (N37 133.43 W122 05.271) and score for each protein matching to the 454 EST sequences. individuals from an allopatric population of T. poppensis To determine the number of unique genes represented in along Tin Barn Road, California (N38 37.100 W123 the Timema 454 EST collection, these files were then fil- 17.322). Once collected, all samples were stored in 100% tered for redundancy in protein accessions. The taxo- ethanol and stored at )20 °C. nomic identities of BLAST hit accessions were hierarchically organized utilizing custom Perl scripts to Molecular methods. DNA was extracted from three to examine the taxonomic distribution of EST annotations. six legs of individual Timema using Qiagen DNeasy Blood and Tissue Kits (Qiagen, Valencia, CA, USA). PCR Molecular marker detection and characterization. We primers for nuclear loci were developed based on the used custom Perl scripts to locate di-, tri- and tetranucle- transcriptome sequences described above. Briefly, we otide SSRs (simple sequence repeats) in the 454 EST developed approximately 14 000 primer pairs along con- sequences with a minimum of four contiguous repeating tigs contained in the characterized transcriptome units. SSRs that represent good candidates for PCR (Appendix S2). We then haphazardly chose and screened amplification were identified and primers for those a subset of 48 of these primer pairs across all three spe- regions were constructed using the program BatchPri- cies included in this study. Upon initial screening, over mer3 (see Supporting information) (You et al. 2008). We 50% of these primer pairs resulted in consistent amplifi- determined which SSRs occurred in coding sequences of cation in T. cristinae and a subset of four primer pairs genes by extracting the aligned portions of sequences produced amplified product in all three species. We having BLAST matches to annotated protein-coding ortho- selected these four primer pairs for use in this study. logs from UNIREF50 and then using the same algorithm as Finally, we used UNIREF50,SWISSPROTBLAST and BLASTN above to detect SSRs in either the aligned (i.e. protein databases and query engines to determine annotations of coding) and remaining portions of these contigs. We also the sequenced amplicons of these four nuclear loci. identified SNPs in contigs with high coverage depths The four nuclear loci we developed from the assem- using the SNP reporter feature in Seqman Pro (DNA- bled transcriptome (Tc_nuc235, Tc_nuc9172, Tc_nuc10854 STAR, Inc., see Supporting information for details). We and Tc_nuc11144) and one mitochondrial locus (CO1) validated SNPs in a subset of annotated contigs using were amplified for 23–26 individuals from our six study Sanger sequencing (see Supporting information and populations using primers and reaction conditions population genetic study below). described in the Supporting information. We were unable to obtain genetic sequence data for CO1 in the allopatric population of T. poppensis. Chromatograms Population genetic study obtained from these individuals had multiple overlap- Experimental design and sampling. To explore patterns ping peaks, potentially due to mutations that influence of genetic divergence between populations at two the specificity of CO1 primers or copies of this locus differing stages of speciation (i.e. host vs. present in the nuclear genome. This problem was species pair), we conducted a population genetic study of also observed for 13 individuals from the parapatric divergence between two ecotypes of T. cristinae as well as T. poppensis population. None of the CO1 sequences col- between the species Timema californicum and Timema lected from individuals of T. cristinae and T. californicum poppensis. To examine the effects of geographical arrange- indicated the presence of nuclear copies. For example, ment on patterns of genetic divergence, a focal popula- PCR products produced a single band in check gels,

Ó 2012 Blackwell Publishing Ltd 4 A. A. COMEAULT ET AL. multiple fluorescence peaks were not observed in any of ancestor lacking gene flow from other sources. Two of the sequences, and edited sequences lacked indels and the loci had values of Tajima’s D that are consistent with stop codons and aligned with high identity to selection, but only in a single population for each locus CO1 sequences from GenBank. All ambiguous CO1 (see Table 3). Thus, consistent evidence for departures sequences from T. poppensis were excluded from further from neutrality was not detected (see Discussion for fur- analyses. The remaining high-quality sequences were ther consideration of this issue). Intralocus recombination aligned and edited using GENEIOUS 5.3.4 (Drummond was tested for using the program IMgc (Woerner et al. et al. 2011). For nuclear loci, positions were scored as 2007). One locus (Tc_nuc235) showed evidence for heterozygous when double fluorescence peaks were recombination, and the longest nonrecombining region unambiguously observed at a given nucleotide site (i.e. a of this locus (343bp) was extracted for all IMa analyses. 40% ⁄ 60% to 50% ⁄ 50% overlap in two bases). The gametic While divergence from a single common ancestor is phase of individuals at nuclear loci was resolved using highly likely between both the ecotype and species pair, PHASE 2.1.1, which uses Bayesian methods for haplotype included in this study it is difficult to rule out gene flow reconstruction from genetic sequence data (Stephens from other sources. Unsampled populations exchanging et al. 2001; Stephens & Donnelly 2003). migrants with the populations included in this study may therefore influence our estimates of gene flow (Stras- Population structure. We used ARLEQUIN 3.5.1.2 (Excof- burg & Rieseberg 2010), and interpretation of gene flow fier & Lischer 2010) to calculate the following population estimates should be made with caution. genetic summary statistics: numbers of haplotypes (h), Simulations in IMa were run for a minimum of 20 mil- genetic diversity (expected heterozygosity for a diploid lion generations following an initial burn-in period of individual, Nei 1987), per cent of polymorphic sites (S) one million generations. Metropolis coupling was carried and Tajima’s D (Tajima 1983). We also used Arlequin to out using five Markov chains, a two-step heating scheme generate estimates of FST between each population for with heating parameters of 0.05 and 2.0, and 10 swap each individual locus. Arlequin uses an AMOVA frame- attempts between chains per generation. A minimum of work and produces estimates of FST based on a distance three independent IMa runs were carried out for each matrix computed on the number of mutations between locus to ensure convergence of parameter estimates. A different haplotypes for each locus (Weir & Cockerham total of 100 000 trees were saved from each run. To statis- 1984; Excoffier et al. 1992; Weir 1996). tically test the fit of models with different migration his- tories between T. californicum and T. poppensis to our Estimating gene flow. We used the isolation with data, we used 30 000 trees saved from IMa runs carried migration model as implemented in the program IMa out on the complete nuclear data set to conduct likeli- (Hey & Nielsen 2004; Hey & Nielsen 2007) to estimate hood ratio tests (LRTs) as implemented in the program levels of gene flow at individual loci that showed evi- IMa (Hey & Nielsen 2007). dence for divergence between populations. We inferred genetic divergence from F estimates that differed sig- ST Results nificantly from zero (Table 4). For those loci that lacked pairwise F estimates significantly greater than zero, ST Transcriptome characterization gene flow estimation in IMa was not possible (i.e. led to a lack of convergence of Markov chains and flat posterior Assembly and annotation. Sequencing of the normalized probability distributions). For such undifferentiated loci, cDNA template on the 454 GS XLR Titanium platform we thus infer gene flow as likely being high, thereby con- produced 1 344 463 sequences averaging 388 bases in straining differentiation at these loci. length (Fig. S1, Supporting information). After removing IMa uses coalescent-based Bayesian Markov chain long poly(A) regions and primer sequences, 1 309 855 Monte Carlo (MCMC) simulations under a model of iso- sequences averaging 369 bp in length (median: 423 bp) lation with migration to estimate six demographic remained for assembly; 923 423 of these reads assembled parameters: migration rate between the two populations into 84 937 contigs, with 386 432 unassembled reads in question (m1, m2), population size of the two extant remaining as singletons. The average read length of the populations and their most recent common ancestor (h1, singleton sequences (292 bp) was substantially shorter h2, hA), and the time the two extant populations diverged than sequences that assembled into contigs (381 bp), from their most recent common ancestor (t), all scaled by indicating read length influenced assembly. The average the mutation rate per gene (u). Assumptions of the IMa contig length was 607 bases (min = 19, max = 4428), with model include selective neutrality of loci, free recombina- an average of 11 reads (min = 2, max = 3675) assembled tion among loci, no recombination within loci and a sim- per contig (Fig. 1a). The mean coverage depth per ple population history of divergence from a common nucleotide position in the assembled contigs was 5.7

Ó 2012 Blackwell Publishing Ltd TIMEMA TRANSCRIPTOME 5

(a) (b) 020 00040 000 00040 020 4 10 3 10 2 10 Contig length (bp) 454 contig length/Uniref ortholog length 0.01 0.1 1 10

101 102 103 020 00040 000 1101001000 Number of reads/contig Average contig coverage depth

Fig. 1 Characteristics of the transcriptomic data and analyses. (A) Contig length as a function of the number of sequences assembled into each contig. The marginal histograms depict the frequency distributions of the number of sequences assembled into contigs and the frequency distribution of contig length. (B) Comparison of Timema cristinae contigs to orthologous UNIREF50-coding sequences. Shown is the ratio of T. cristinae contig length to UniRef ortholog length as a function of contig coverage depth. The dotted line corresponds to a ratio of one, above which 454 contigs are as long or longer than the BLAST-matched UNIREF50 orthologs. The contour lines correspond to the density of points in the plot.

(min = 1, max = 574), indicating substantial coverage Table 1 Numbers and percentages of 454 ESTs in the assembled depth. As would be expected, contig length increased as contigs, singletons and the combined sequence set with matches a function of coverage depth and the number of reads to known proteins in BLASTx searches of the UNIREF50 annotated protein database assembled into each contig (Fig. 1a), and the assembly produced a substantial number of long and deeply cov- Contigs Singletons Combined ered contigs. For example, 7335 contigs had greater than (84 937) (386 432) set (471 369) 10· average coverage depth per nucleotide position, and 8136 contigs were >1000 bp in length. Matches to 37 952 (45%) 150 501 (39%) 188 452 (40%) database BLAST annotation of contig consensus sequences and Matches to 12 808 12 762 18 282 singleton sequences to a large number of unique genes unique indicates that this 454 EST collection probably represents proteins a substantial portion of the genes in T. cristinae; 37 952 (45%) of the contig consensus sequences had BLAST hits to Table 2 Number and percentages of unique best BLASTx matches annotated proteins in UNIREF50 (Table 1). A substantial of 454 EST contigs, singletons and the combined sequence set to portion of the singleton sequences also had BLAST matches UNIREF50 grouped by taxonomic category to the annotated protein databases, indicating that these unassembled sequences still provide valuable sequence Taxonomic Contigs Singletons Combined information and improve overall transcriptome coverage category (12 808) (12 762) set (18 282) breadth. In many cases, multiple sequences (both contigs and singletons) had BLAST matches to the same protein. Arthropod 7940 (62.0%) 7401 (57.9%) 10 704 (58.5%) After correcting for redundancy, the combined set of con- Other 3410 (26.6%) 3508 (27.5%) 5301 (28.9%) Fungi 203 (1.6%) 342 (2.7%) 387 (2.1%) tig consensus sequences and singleton sequences had Plant 122 (1.0%) 100 (0.7%) 248 (1.4%) matches to 18 282 unique proteins in UNIREF50 (Table 1). Protozoa 181 (1.4%) 208 (1.7%) 192 (1.1%) As expected, the vast majority of BLAST hits were to Bacteria 69 (0.5%) 117 (1.0%) 175 (1.0%) known arthropod proteins (Table 2). A relatively large Virus 13 (0.1%) 26 (0.2%) 28 (0.1%) percentage of the significant BLAST hits were also to other Other 869 (6.7%) 1060 (8.3%) 1247 (6.8%) animal proteins, which is likely due to high representa- tion of these taxa in UNIREF50 rather than contamination of but very little, presence of xenobiotic RNA in our sam- our RNA with other animal RNA. BLAST matches to fungal, ples. Many of the 454 contig sequences were sufficiently viral or bacterial accessions were rare, indicating some, long to cover full or nearly full gene transcripts (Fig. 1b).

Ó 2012 Blackwell Publishing Ltd 6 A. A. COMEAULT ET AL.

Molecular marker characterization. Both SSRs and Within-population genetic diversity. Within-population SNPs were abundant in the 454 EST sequences, repre- summary statistics reveal moderate genetic variation senting a valuable source of molecular markers. More within each population and also some differences in the than 50 000 di-, tri- and tetranucleotide repeats were dis- level of variation among loci (Table 3). For example, the covered in contigs and unassembled singletons (for percentage of polymorphic sites was highest in mito- details see Table S1, Supporting information); 4492 SSRs chondrial locus CO1 and lowest in nuclear loci occurred in contigs with BLAST matches to UNIREF50. Of Tc_nuc10854 and Tc_nuc11144, while the number of these, 4067 occurred in coding regions as judged by the haplotypes for each locus and population was higher in part of contigs aligning to protein-coding-region nuclear loci (e.g. Tc_nuc235 and Tc_nuc9172) than the sequences, with the remainder occurring in UTRs. mitochondrial locus. Despite the large number of SSRs observed in coding regions, coding regions covered the majority of these Population structure. Levels of population differentia- contigs, and the density of SSRs per sequence length was tion (i.e. FST estimates) were highly variable among popu- lower in coding (0.0012 SSRs per bp) than in noncoding lation pairs and loci (Table 4). A lack of differentiation regions (0.0032 SSRs per bp). We designed high-quality was observed at all four nuclear loci for both parapatric PCR primers for 15 278 of the SSRs detected in the and allopatric populations of T. cristinae, while mito- T. cristinae ESTs representing a substantial resource for chondrial differentiation was strong and statistically molecular and ecological studies in T. cristinae and clo- significant, but only between allopatric populations sely related species. Detailed information on SSR loci (FST = 0.34). Divergence between Timema californicum including primer sequences, repeat motif and repeat and Timema poppensis was significant and large at all four length are given in Appendix S1. nuclear loci and at both geographical scales (FST SNPs were highly abundant in the 454 EST contigs, estimates ranging from 0.72 to 0.93; Table 4). Mitochon- presenting another valuable resource for future genetic drial differentiation between parapatric populations of work in Timema. Across all contigs with >8· coverage, T. californicum and T. poppensis was less than nuclear there were 47 694 high-quality SNPs that were present at differentiation (FST = 0.331, Table 4). a minimum frequency of 25% and a maximum frequency of 75%. SNPs were also abundant in contigs with deeper Estimating gene flow. The only locus showing evidence coverage. For example, the rate of SNP occurrence was for differentiation at the ecotype level of comparison was 0.41 SNPs per 100 bases in contigs with >50· coverage. mitochondrial CO1, and this only occurred between the Using the same stringent criteria as for SSR primer allopatric populations (Table 4). Results from IMa analy- development, we were able to develop primers that ses of CO1 are consistent with a low level of gene flow, targeted 14 002 regions approximately 500 bp in size despite moderate levels of divergence at this locus, across the assembled transcriptome to be amplified by between allopatric populations of T. cristinae (Table 5; PCR (Appendix S2). Fig. 2). Estimates of migration rates between T. californicum and T. poppensis obtained from IMa analyses varied Population genetic study across loci (Table 5; Fig. 3). Highest posterior probability Locus characterization. Results of BLAST queries showed (HPP) estimates of migration rates at the two nuclear loci variable identity of the four nuclear loci developed Tc_nuc235 and Tc_nuc10854 are consistent with little or here to known genes or proteins. Locus Tc_nuc235 (Gen- no migration in either direction regardless of geographi- Bank: JQ337965–JQ338264) has BLAST query matches to a cal arrangement (Table 5). In contrast, there was evi- probable muscarinic acetylcholine receptor gar-1 protein at dence for nonzero migration from T. poppensis to positions 172–432, however did not have matches to T. californicum, albeit at a low rate, at the two other known protein-coding regions. Approximately the first nuclear loci (Table 5; Fig. 3). Results from the mitochon- half of locus Tc_nuc9172 (GenBank: JQ338265–JQ338550) drial data set were consistent with migration between represents a coding region with similarity to a glycolipid T. poppensis and T. californicum in both directions transfer protein domain-containing protein. Locus between parapatric populations (Table 5; as noted above, Tc_nuc10854 (GenBank: JQ338551–JQ338844) has high sequences that would allow an allopatric comparison for identity to a putative EH-domain-containing protein this locus could not be obtained). Likelihood ratio tests of identified from mRNA isolated from Pediculus humanus nested models using the complete nuclear data set corporis. Finally, locus Tc_nuc11144 (GenBank: JQ338845– rejected models in which migration rates between JQ339126) has variable identity to a number of uncharacter- T. poppensis to T. californicum were equal to zero under ized gene products identified from Drosophila melanogaster both geographical scenarios (LRT, all P <0.05;Table6). transcripts. Moreover, a model where rates of migration were equal

Ó 2012 Blackwell Publishing Ltd TIMEMA TRANSCRIPTOME 7

Table 3 Within-population descriptive Locus Population nhGenetic diversity %S Tajima’s D statistics including number of alleles sampled (n), number of haplotypes Nuclear observed (h), genetic diversity (expected Tc_nuc235 Ceanothus 50 26 0.93 ± 0.02 4.71 )1.24 heterozygosity in a diploid individual), Aden. (para.) 52 24 0.94 ± 0.02 4.24 )0.94 per cent polymorphic sites (%S) and Aden. (allo.) 46 17 0.93 ± 0.02 5.41 )1.28 Tajima’s D Timema californicum 50 5 0.65 ± 0.04 0.94 1.79 T. popp (para.) 54 2 0.11 ± 0.06 0.47 )0.9 T. popp (allo.) 48 1 0.00 ± 0.00 0.00 0.00 Tc_nuc9172 Ceanothus 44 8 0.63 ± 0.08 2.61 )1.48 Aden. (para.) 50 9 0.67 ± 0.07 2.85 )1.14 Aden. (allo.) 44 8 0.76 ± 0.05 2.38 )1.1 T. californicum 46 4 0.24 ± 0.08 0.95 )1.45 T. popp (para.) 54 2 0.07 ± 0.05 0.24 )0.88 T. popp (allo.) 48 1 0.00 ± 0.00 0.00 0.00 Tc_nuc10854 Ceanothus 48 3 0.20 ± 0.07 0.53 )1.00 Aden. (para.) 52 3 0.21 ± 0.07 0.53 )0.89 Aden. (allo.) 44 3 0.28 ± 0.08 0.53 )0.66 T. californicum 50 2 0.18 ± 0.07 0.26 )0.24 T. popp (para.) 52 2 0.27 ± 0.07 0.26 )0.27 T. popp (allo.) 48 2 0.51 ± 0.02 0.53 2.27 Tc_nuc11144 Ceanothus 48 1 0.00 ± 0.00 0.00 0.00 Aden. (para.) 52 2 0.04 ± 0.04 0.22 )1.1 Aden. (allo.) 42 2 0.09 ± 0.06 0.22 )0.85 T. californicum 48 2 0.04 ± 0.04 0.88 )1.87 T. popp (para.) 48 3 0.45 ± 0.06 0.44 1.64 T. popp (allo.) 44 1 0.00 ± 0.00 0.00 0.00 Mitochondrial CO1 Ceanothus 25 7 0.81 ± 0.05 8.006536 1.44 Aden. (para.) 26 7 0.85 ± 0.03 8.006536 2.10 Aden. (allo.) 21 5 0.64 ± 0.10 7.51634 0.51 T. californicum 25 8 0.71 ± 0.09 1.960784 )0.06 T. popp (para.) 15 4 0.72 ± 0.08 1.470588 1.54

Ceanothus = Timema cristinae Ceanothus population, Aden.=T. cristinae Adenostoma popu- lation, T. popp = Timema poppensis. Geographical location relative to focal population is given in parentheses (para. = parapatric and allo. = allopatric).

Table 4 Pairwise FST between Timema Locus cristinae ecotypes and between the species pair Timema poppensis and Timema Level Geography Tc_nuc235 Tc_nuc9072 Tc_nuc10854 Tc_nuc11144 CO1 californicum for four nuclear loci and one mitochondrial locus Ecotype Parapatric 0.016 )0.008 )0.004 )0.002 0.03 Ecotype Allopatric )0.013 0.026 0.004 0.029 0.343 Species Parapatric 0.924 0.904 0.884 0.896 0.331 Species Allopatric 0.929 0.917 0.717 0.982 N ⁄ A

Significant FST values (P < 0.05) are shown in bold. The level of comparison (ecotype or species) is indicated along with the geographical arrangement of the populations being compared. Ecotype comparisons are between populations of T. cristinae, and species-level comparisons are between the species T. californicum and T. poppensis. in both directions was not rejected, suggestive of bidirec- genomic resources for nonmodel organisms. Here, we tional nuclear gene flow. characterize the transcriptome of Timema cristinae, a spe- cies that has been the focus of extensive study but has lacked genomic resources. We then use this resource to Discussion develop markers for assessing patterns of multilocus Recent advances in both sequencing technologies and genetic divergence within and between species of Tim- data processing now facilitate the rapid development of ema. Our results yield several empirical insights into

Ó 2012 Blackwell Publishing Ltd 8 A. A. COMEAULT ET AL.

Table 5 Summary of estimates of locus- m1 m2 specific migration rates. Results are Level of Geographical averages of three runs using the program comparison arrangement Locus HPP 90Low 90High HPP 90Low 90High IMa

Ecotype Allopatric CO1 0.890 0.030 47.25 0.070 0.030 14.590 Species Parapatric 235 0.005 0.005 0.602 0.005 0.005 0.795 Species Parapatric 9172 0.505 0.018 2.965 0.005 0.005 2.008 Species Parapatric 10854 0.005 0.005 4.712 0.005 0.005 4.855 Species Parapatric 11144 0.578 0.018 4.115 0.005 0.005 1.448 Species Parapatric CO1 0.875 0.005 7.582 0.335 0.005 6.742 Species Allopatric 235 0.005 0.005 0.568 0.045 0.005 1.348 Species Allopatric 9172 0.578 0.005 6.052 0.005 0.005 7.728 Species Allopatric 10854 0.005 0.005 4.208 0.005 0.005 3.298 Species Allopatric 11144 0.338 0.005 2.478 0.005 0.005 6.078 m1 = migration rate into population 1; m2 = migration rate into population 2; HPP = highest posterior probability; 90Low = lower bound of 90% of the posterior proba- bility distribution of the parameter estimate; 90High = upper bound of 90% of the poster- ior probability distribution of the parameter estimate.

prevent inaccurate estimates of expression rates; how- Adenostoma to Ceanothus CO1 Ceanothus to Adenostoma ever, these ‘contaminants’ can also provide important insight into the biology of the organism being studied (e.g. Hale et al. 2009, 2010). The small percentage of BLAST annotations to nonmetazoan taxa we observed indicates a lack of major contamination in our assembly. Large numbers of contigs and singleton sequences had BLAST annotations to unique genes, and the deep coverage of many contigs allowed the detection and characterization Posterior probability Posterior of large numbers of SSRs and SNPs. This resource should facilitate and support future work on genomic diver-

0.00 0.05 0.10 0.15 gence in a species that has facilitated insight into the pro- 0246810cess of ecological speciation. Migration rate Similar to previous 454 transcriptome sequencing studies, a large percentage of the 454 reads (70%) were Fig. 2 Posterior probability distribution of migration rate assembled into contigs (Novaes et al. 2008; Meyer et al. estimates between allopatric Timema cristinae ecotypes for the 2009; Parchman et al. 2010). As would be expected from mitochondrial locus CO1. Posterior probability distributions are the means of three independent IMa analyses. this number of sequences, mean coverage depth was reasonably high (5.7·). BLAST analyses indicated the pres- ence of more than 18 000 unique transcripts. Owing to genetic divergence in Timema, such as evidence sugges- the presence of alternatively spliced variants (Modrek & tive of low-level gene flow between allopatric popula- Lee 2002), this is probably an overestimate of the under- tions of T. cristinae and the first estimates of rates of gene lying number of genes. Nonetheless, given Drosophila flow between species of Timema. We discuss below first melanogaster is reported to have approximately 13 000 aspects of the transcriptome sequencing project and then genes (Adams et al. 2000) and Bombyx mori to have of the population genetic study. 18 000 genes (Xia et al. 2004), the transcriptome sequence collection produced here probably represents a significant portion of the genes in T. cristinae. Large Transcriptomal variation and a genomic resource for numbers of both the contig consensus sequences and Timema the unassembled singleton sequences were annotated to The more than 1.3 million 454 sequencing reads we known proteins in BLAST analyses, providing a wealth of obtained provided the basis for a thorough assembly and information on the identity and reading frame of the characterization of the T. cristinae transcriptome. The annotated 454 EST sequences. Moreover, a substantial presence of xenobiotics in transcriptome sequencing pro- number of transcripts were covered in full length jects needs to be carefully considered, for example, to (Fig. 1).

Ó 2012 Blackwell Publishing Ltd TIMEMA TRANSCRIPTOME 9

T. californicum to T. poppensis Parapatric T. californicum and T. poppensis T. poppensis to T. californicum

nuc_complete Tc_nuc235 Tc_nuc9172 0.6 0.8 1.0 2.0 2.5 3.0 3.5 0.4 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1.0 0.6 Tc_nuc10854 Tc_nuc11144 0.30 CO1 Posterior probability Posterior 0.3 0.4 0.5 0.15 0.20 0.25 0.2 0.10 0.2 0.4 0.6 0.8 0.0 0.1 0.0 0.00 0.05 01234 01234 012345

Allopatric T. californicum and T. poppensis

nuc_complete Tc_nuc235 Tc_nuc9172 0.5 1.0 1.5 2.0 0.0 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.1 0.2 0.3 0.4 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 01234

Tc_nuc10854 Tc_nuc11144 0.8 Posterior probability 0.6

0.4 0.6 0.8 NA 0.0 0.2 0.0 0.2 0.4

01234 01234 Migration rate

Fig. 3 Posterior probability distribution of migration rate estimates between the species pair Timema poppensis and Timema californicum. Posterior probability distributions are the means of three independent IMa analyses.

The large 454 EST collection presented here thus pro- ability of transcriptome sequences makes it possible to vides considerable information on the protein-coding begin targeted enrichment and resequencing of thou- fraction of the genome of a taxon where speciation has sands of coding regions across many individuals, as been studied extensively and for which no previous gen- exemplified in Heliconius butterflies (Nadeau et al. 2011). ome-level resources existed. This lack of genomic These and numerous other applications will be facilitated resources has hindered insight into the genomic architec- by the development of the genomic resource reported ture of adaptation and speciation. For example, the avail- here.

Ó 2012 Blackwell Publishing Ltd 10 A. A. COMEAULT ET AL.

Table 6 Results of likelihood ratio tests for gene flow carried type and number of loci employed in a given study (e.g. out on the complete nuclear data set between the species Timema Barker et al. 1997; Bachtrog et al. 2006). The nuclear loci californicum and Timema poppensis. Likelihood ratio tests were we developed here were not randomly sampled from the carried out on nested models with differing migration histories genome of T. cristinae, but instead where most likely to using the software IMa (Hey & Nielsen 2007). Likelihood ratio test statistics ()2LLR) were calculated as )2(ln(probability of amplify in distantly related species (i.e. T. californicum model of interest) ⁄ (probability of full model)). The full model and T. poppensis). It is therefore quite possible that these contains five parameters (h1, h2, ha, m1, m2). Nested models that loci are located in highly conserved genes or genomic test the likelihood of different migration histories are given regions, possibly subject to stabilizing selection, and (note: all models still include the three ‘h’ parameters) could therefore represent an underestimates of neutral genetic divergence between T. cristinae populations. Model Log likelihood 2LLR d.f. P Interestingly, at the species level of comparison, these Parapatric T. californicum vs. T. poppensis nuclear loci indicate greater divergence, or at least more Full model 1.7779 – – compete sorting of molecular variation, than loci used in

m1 =m2 0.7112 2.1334 1 0.144 previous studies (see discussion of T. poppensis and m1 =0 )2.3617 8.2793 1 0.004 T. californicum below). Future studies of genetic diver- m2 =0 1.7764 0.003 1 0.956 gence using sequence-based scans of genome-wide varia- m1 =m2 =0 )2.3603 8.2764 2 0.016 tion will greatly increase our understanding of how Allopatric T. californicum vs. T. poppensis genomic divergence occurs between ecotypes of T. cristi- Full model 2.5796 – – nae and how levels of divergence may vary across the m1 =m2 1.6897 1.7798 1 0.182 m1 =0 )1.0287 7.2165 1 0.007 genome. m2 =0 2.5263 0.1065 1 0.744 The mitochondrial locus (CO1) was the only locus that m1 =m2 =0 )1.0270 7.2132 2 0.0271 showed significant divergence between populations of T. cristinae, and this was only between allopatric popula- m1 = migration from T. poppensis to T. californium,m2 = migra- tions. This finding is consistent with a process whereby tion from T. californicum to T. poppensis. CO1 differentiates between allopatric populations Models represented in bold are those not rejected following likelihood ratio tests. because of little or no gene flow between them, as sup- ported by our IMa analysis (Fig. 2), but remains undiffer- entiated in parapatric populations because of high gene flow or recent common ancestry. Although we examined Gene flow and genetic divergence in Timema speciation only two population pairs here, a larger CO1 data set We used the transcriptome assembly for T. cristinae to examining numerous population pairs has reported con- rapidly develop PCR primers for four nuclear loci that sistently weaker differentiation between parapatric vs. amplified both in T. cristinae and in the distantly related allopatric populations (Nosil et al. 2003). species pair Timema poppensis and Timema californicum. As expected, patterns of genetic divergence between Using these markers and a single mitochondrial locus, the species pair T. poppensis and T. californicum differed we found pronounced differences in genetic divergence dramatically from those observed between ecotypes of according to the point in the speciation process observed T. cristinae. Between T. poppensis and T. californicum, all

(ecotype vs. species), the locus considered and the geo- nuclear loci showed significant and large estimates of FST graphical arrangement of populations. regardless of geographical context (Table 4). In parapat- Between ecotypes of T. cristinae, nuclear differentia- ric populations, the mitochondrial locus showed weaker tion was completely lacking. All comparisons of nuclear divergence than all four of the nuclear loci. This finding differentiation between ecotypes of T. cristinae, at both is in line with previous multilocus studies showing a lack geographical scales, resulted in FST estimates near zero of monophyly at CO1 between numerous species pairs of (Table 4). These findings are in line with previous studies other (Funk & Omland 2003 for review) and using hundreds of AFLP markers, where only a small highlights the importance of using multilocus data sets in proportion of markers showed elevated genetic differen- systematic and evolutionary studies. Mitochondrial tiation between ecotype pairs (Nosil et al. 2008). Collec- genomes may generally be more likely to cross species tively, these results suggest that most of the genome boundaries than regions of the nuclear genome, as sup- remains relatively undifferentiated between ecotypes of ported by our IMa analyses (Fig. 3 and Tables 4 and 5) T. cristinae, with only the few regions affected by and as reported in past studies (e.g. Bachtrog et al. 2006; divergent selection becoming differentiated between Gompert et al. 2006). This trend deserves future study populations (Via 2001; Wu 2001; Nosil et al. 2009). and may be caused by mechanisms such as barriers to Inference of the evolutionary or demographic history gene flow contained in the nuclear genome that are not within or between populations can be influenced by the manifest in the mitochondrial genome (Wu 2001).

Ó 2012 Blackwell Publishing Ltd TIMEMA TRANSCRIPTOME 11

Despite consistently high levels of divergence, evi- Natural Sciences and Engineering Research Council of Canada. dence suggestive of a low level of gene flow at nuclear We thank the handling editor Andrew DeWoody and two anony- loci was observed between T. poppensis and T. californi- mous reviewers for helpful comments on this manuscript. cum (LRT, Table 6). Evidence suggestive of a low level of migration was observed in two of the four nuclear loci in our data set (Tc_nuc9172 and Tc_nuc11144, Fig. 3). It is References possible that the two nuclear loci lacking a signature of Adams MD, Celniker SE, Holt RA et al. (2000) The genome sequence of migration reside in regions of the genome that were Drosophila melanogaster. Science, 287, 2185–2195. unable to introgress between species during speciation; Altschul SF, Madden TL, Schaffer AA et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. however, tests of this hypothesis require a more ‘geno- Nucleic Acids Research, 25, 3389–3402. mic’ view and will benefit from data sets that explore pat- Bachtrog D, Kevin T, Clark A, Andolfatto P (2006) Extensive introgression terns of divergence at tens of thousands of loci (or more) of mitochondrial DNA relative to nuclear genes in the Drosophila yakuba collected from across the genome. With the recent species group. Evolution, 60, 292–302. Barker JSF, Moore SS, Hetzel DJS, Evans D, Tan SG, Byrne K (1997) advances that have been made in sequencing techno- Genetic diversity of Asian water buffalo (Bubalus bubalis): microsatellite logies and bioinformatics tools, these types of genomic variation and a comparison with protein-coding loci. Animal Genetics, studies are becoming feasible (e.g. see Gompert et al. 28, 103–115. 2010; Hohenlohe et al. 2010; Nadeau et al. 2011) and are Berner D, Grandchamp A-C, Hendry AP (2009) Variable progress toward ecological speciation in parapatry: stickleback across eight lake-stream currently underway in Timema. transitions. Evolution, 63, 1740–1753. A major question awaiting further work is the evolu- Bouck A, Vision T (2007) The molecular ecologist’s guide to expressed tionary timing of gene flow between populations or spe- sequence tags. Molecular Ecology, 16, 907–924. Chan YF, Marks ME, Jones FC et al. (2010) Adaptive evolution of pelvic cies. It is possible that gene flow between T. californicum reduction in sticklebacks by recurrent deletion of a Pitx1 enhancer. Sci- and T. poppensis was the result of ancient hybridization ence, 327, 302–305. that occurred during the early stages of speciation, a very Colosimo PF, Hosemann KE, Balabhadra S et al. (2005) Widespread paral- low level of recent hybridization (e.g. as a result of sec- lel evolution in sticklebacks by repeated fixation of ectodysplasin alleles. Science, 307, 1928–1933. ondary contact) or a combination of these processes; Coyne JA, Orr HA (1989) Patterns of speciation in Drosophila. Evolution, however, current methods do not allow for estimates of 43, 362–381. timing of gene flow (Sousa et al. 2011; Strasburg & Riese- Coyne JA, Orr HA (2004) Speciation. Sinauer Associates, Inc., Sunderland, berg 2011). MA. Crespi BJ, Sandoval CP (2000) Phylogenetic evidence for the evolution of ecological specialization in Timema walking-sticks. Journal of Evolution- Conclusions ary Biology, 13, 249–262. Darwin C (1859) On the Origin of Species by Means of Natural Selection, or Population genetic analyses are valuable for quantifying the Preservation of Favoured Races in the Struggle for Life. John Murray, London, UK. genetic divergence occurring between populations at dif- Dobzhansky T (1937) Genetics and the Origin of Species, 1st edn. Columbia ferent stages of speciation (e.g. Mallet et al. 2007; Berner University Press, New York, NY. et al. 2009; Peccoud et al. 2009; Gompert et al. 2010; Dobzhansky T (1940) Speciation as a stage in evolutionary divergence. 74 Hohenlohe et al. 2010; Rosenblum & Harmon 2011). Here, American Naturalist, , 312–321. Drummond A, Ashton B, Buxton S et al. (2011) Geneious v5.4, Available we used transcriptome sequencing to develop genomic from http://www.geneious.com. resources for Timema stick insects that allowed us to Ellegren H (2008) Sequencing goes 454 and takes large-scale genomics explore multilocus genetic divergence between popula- into the wild. Molecular Ecology, 17, 1629–1631. Emrich SJ, Barbazuk WB, Li L, Schnable PS (2007) Gene discovery and tions at different stages of speciation. Analyses of these annotation using LCM-454 transcriptome sequencing. Genome Research, data revealed variable genetic divergence according to 17, 69–73. the point in the speciation process examined, as well as Excoffier L, Lischer HEL (2010) Arlequin suite version 3.5: a new series of variation in divergence history among gene regions. programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources, 10, 564–567. Future studies taking advantage of the assembled tran- Excoffier L, Smouse P, Quattro J (1992) Analysis of molecular variance scriptome reported here and a genome-wide view of inferred from metric distances among DNA haplotypes: application to divergence will greatly increase our understanding of the human mitochondrial DNA restriction data. Genetics, 131, 479–491. processes of local adaptation and speciation. Fraser BA, Weadick CJ, Janowitz I, Rodd H, Hughes KA (2011) Sequenc- ing and characterization of the guppy (Poecilia reticulata) transcriptome. BMC Genomics, 12, 202. Acknowledgements Funk DJ (1998) Isolating a role for natural selection in speciation: host adaptation and sexual isolation in Neochlamisus bebbianae leaf beetles. We thank DNAstar for technical guidance and expertise in use Evolution, 52, 1744–1759. of their Seqman NGen and Lasergene software packages. This Funk DJ, Omland KE (2003) Species-level paraphyly and polyphyly: fre- quency, causes, and consequences, with insights from animal mito- work was funded by an Innovation Seed Grant from CU Boulder chondrial DNA. Annual Review of Ecology, Evolution, and Systematics, 34, to PN. AAC was supported by a graduate scholarship from the 397–423.

Ó 2012 Blackwell Publishing Ltd 12 A. A. COMEAULT ET AL.

Funk DJ, Nosil P, Etges WJ (2006) Ecological divergence exhibits consis- Merrill RM, Gompert Z, Dembeck LM, Kronforst MR, McMillan WO, Jig- tently positive associations with reproductive isolation across disparate gins CD (2011) Mate preference across the speciation continuum in a taxa. Proceedings of the National Academy of Sciences of the United States of of mimetic butterflies. Evolution, 65, 1489–1500. America, 103, 3209–3213. Meyer E, Aglyamova GV, Wang S et al. (2009) Sequencing and de novo Garg R, Patel RK, Tyagi AK, Jain M (2011) De novo assembly of chickpea analysis of a coral larval transcriptome using 454 GSFlx. BMC Genom- transcriptome using short reads for gene discovery and marker identi- ics, 10, 219. fication. DNA Research, 18, 53–63. Modrek B, Lee C (2002) A genomic view of alternative splicing. Nature Gavrilets S (2004) Fitness Landscapes and the Origin of Species. Princeton Genetics, 30, 13–19. University Press, Princeton, NJ. Nadeau JN, Whibley A, Jones RT et al. (2012) Genomic islands of diver- Gompert Z, Nice CC, Fordyce JA, Forister ML, Shapiro AM (2006) Identi- gence in hybridizing Heliconius butterflies identified by large-scale fying units for conservation using molecular systematics: the caution- targeted sequencing. Philosophical Transactions of the Royal Society, B, ary tale of the Karner blue butterfly. Molecular Ecology, 15, 1759–1768. 367, 343–353. Gompert Z, Forister ML, Fordyce JA, Nice CC, Williamson RJ, Buerkle Nei M (1987) Molecular Evolutionary Genetics. Columbia University Press, CA (2010) Bayesian analysis of molecular variance in pyrosequences New York, NY, USA. quantifies population genetic structure across the genome of Lycaeides Nosil P (2007) Divergent host plant adaptation and reproductive isolation butterflies. Molecular Ecology, 19, 2455–2473. between ecotypes of Timema cristinae walking sticks. American Natural- Hahn DA, Ragland GJ, Shoemaker DD, Denlinger DL (2009) Gene discov- ist, 169, 151–162. ery using massively parallel pyrosequencing to develop ESTs for the Nosil P (2012) Ecological Speciation. Oxford University Press, Oxford. flesh fly Sarcophaga crassipalpis. BMC Genomics, 10, 234. Nosil P, Crespi BJ, Sandoval CP (2003) Reproductive isolation driven by Hale MC, McCormick CR, Jackson JR, DeWoody JA (2009) Next-genera- the combined effects of ecological adaptation and reinforcement. Pro- tion pyrosequencing of gonad transcriptomes in the polyploid lake ceedings of the Royal Society of London Series B-Biological Sciences, 270, sturgeon (Acipenser fulvescens): the relative merits of normalization and 1911–1918. rarefaction in gene discovery. BMC Genomics, 10, 203. Nosil P, Egan SP, Funk DJ (2008) Heterogeneous genomic differentiation Hale MC, Jackson JR, DeWoody JA (2010) Discovery and evaluation of between walking-stick ecotypes: ‘‘Isolation by adaptation’’ and multi- candidate sex-determining genes and xenobiotics in the gonads of lake ple roles for divergent selection. Evolution, 62, 316–336. sturgeon (Acipenser fulvescens). Genetica, 138, 745–756. Nosil P, Funk DJ, Ortiz-Barrientos D (2009) Divergent selection and Hey J (2006) Recent advances in assessing gene flow between diverging heterogeneous genomic divergence. Molecular Ecology, 18, 375–402. populations and species. Current Opinion in Genetics & Development, 16, Novaes E, Drost DR, Farmerie WG et al. (2008) High-throughput gene 592–596. and SNP discovery in Eucalyptus grandis, an uncharacterized genome. Hey J, Nielsen R (2004) Multilocus methods for estimating population BMC Genomics, 9, 312. sizes, migration rates and divergence time, with applications to the Parchman TL, Geist KS, Grahnen JA, Benkman CW, Buerkle CA (2010) divergence of Drosophila pseudoobscura and D. persimilis. Genetics, 167, Transcriptome sequencing in an ecologically important tree species: 747–760. assembly, annotation, and marker discovery. BMC Genomics, 11, 180. Hey J, Nielsen R (2007) Integration within the Felsenstein equation for Peccoud J, Ollivier A, Plantegenest M, Simon J-C (2009) A continuum of improved Markov chain Monte Carlo methods in population genetics. genetic divergence from sympatric host races to species in the pea Proceeding of the National Academy of Sciences of the United States of Amer- aphid complex. Proceedings of the National Academy of Sciences of the Uni- ica, 104, 2785–2790. ted States of America, 106, 7495–7500. Hohenlohe PA, Bassham S, Etter PD, Stiffler N, Johnson EA, Cresko WA Peichel CL, Nereng KS, Ohgi KA et al. (2001) The genetic architecture of (2010) Population genomics of parallel adaptation in threespine divergence between threespine stickleback species. Nature, 414, 901– stickleback using sequenced RAD Tags. PLoS Genetics, 6, e1000862, 905. doi: 10.1371/journal.pgen.1000862. Rice WR, Hostert EE (1993) Laboratory experiments on speciation – what Holt RA, Jones SJM (2008) The new paradigm of flow cell sequencing. have we learned in 40 years? Evolution, 47, 1637–1653. Genome Research, 18, 839–846. Rosenblum EB, Harmon LJ (2011) ‘‘Same same but different’’: replicated Hudson ME (2008) Sequencing breakthroughs for genomic ecology and ecological speciation at White Sands. Evolution, 65, 946–960. evolutionary biology. Molecular Ecology Resources, 8, 3–17. Rundle HD, Nosil P (2005) Ecological speciation. Ecology Letters, 8, 336– Jiggins CD, Naisbit RE, Coe RL, Mallet J (2001) Reproductive isolation 352. caused by colour pattern mimicry. Nature, 411, 302–305. Rundle HD, Nagel L, Boughman JW, Schluter D (2000) Natural selection Kirkpatrick M (2001) Reinforcement during ecological speciation. Proceed- and parallel speciation in sympatric sticklebacks. Science, 287, 306– ings of the Royal Society of London Series B-Biological Sciences, 268, 1259– 308. 1263. Sandoval C, Carmean DA, Crespi BJ (1998) Molecular phylogenetics of Kirkpatrick M, Ravigne´ V (2002) Speciation by natural and sexual selec- sexual and parthenogenetic Timema walking-sticks. Proceedings of the tion: models and experiments. American Naturalist, 159, S22–S35. Royal Society, B Biological Sciences, 265, 589–595. Kristiansson E, Asker N, Forlin L, Larsson DGJ (2009) Characterization of Schluter D (2001) Ecology and the origin of species. Trends in Ecology & the Zoarces viviparus liver transcriptome using massively parallel Evolution, 16, 372–380. pyrosequencing. BMC Genomics, 10, 345. Schluter D (2009) Evidence for ecological speciation and its alternative. Langerhans RB, Gifford ME, Joseph EO (2007) Ecological speciation in Science, 323, 737–741. Gambusia fishes. Evolution, 61, 2056–2074. Schluter D, Conte GL (2009) Genetics and ecological speciation. Proceed- Mallet J, Beltran M, Neukirchen W, Linares M (2007) Natural hybridiza- ings of the National Academy of Sciences of the United States of America, tion in heliconiine butterflies: the species boundary as a continuum. 106, 9955–9962. BMC Evolutionary Biology, 7, 28, doi: 10.1186/1471-2148-7-28. Schwarz D, Robertson HM, Feder JL et al. (2009) Sympatric ecological spe- Margulies M, Egholm M, Altman WE et al. (2005) Genome sequencing in ciation meets pyrosequencing: sampling the transcriptome of the apple microfabricated high-density picolitre reactors. Nature, 437, 376–380. maggot Rhagoletis pomonella. BMC Genomics, 10, 633. Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat- Sousa VC, Grelaud A, Hey J (2011) On the nonidentifiability of migration ure Reviews Genetics, 12, 671–682. time estimates in isolation with migration models. Molecular Ecology, Mayr E (1947) Ecological factors in speciation. Evolution, 1, 263–288. 20, 3956–3962. Mayr E (1963) Animal Species and Evolution. Harvard University Press, Stelkens RB, Young KA, Seehausen O (2010) The accumulation of repro- Harvard, MA. ductive incompatibilities in African cichlid fish. Evolution, 64, 617–633.

Ó 2012 Blackwell Publishing Ltd TIMEMA TRANSCRIPTOME 13

Stephens M, Donnelly P (2003) A comparison of bayesian methods for Data Accessibility haplotype reconstruction. American Journal of Human Genetics, 73, 1162– 1169. Assembled EST contigs and singletons: DRYAD reposi- Stephens M, Smith NJ, Donnelly P (2001) A new statistical method for tory: doi:10.5061/dryad.70mv450r haplotype reconstruction from population data. American Journal of Human Genetics, 68, 978–989. Sanger-sequenced nuclear DNA and mtDNA Strasburg JL, Rieseberg LH (2010) How robust are ‘‘isolation with migra- sequences: GenBank accessions: JQ337965–JQ339237. See tion’’ analyses to violations of the IM model? A simulation study. Appendix S3 for metadata associated with Sanger- Molecular Biology and Evolution, 27, 297–310. sequenced nuclear DNA and mtDNA sequences. Strasburg JL, Rieseberg LH (2011) Interpreting the estimated timing of migration events between hybridizing species. Molecular Ecology, 20, 2353–2366. Supporting information Suzek BE, Huang HZ, McGarvey P, Mazumder R, Wu CH (2007) UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinfor- Additional supporting information may be found in the online matics, 23, 1282–1288. version of this article. Tajima F (1983) Evolutionary relationship of DNA sequences in finite populations. Genetics, 105, 437–460. Table S1 Number and percentages of unique best BLASTx Vera JC, Wheat CW, Fescemyer HW et al. (2008) Rapid transcriptome matches of 454 EST contigs, singletons, and the combined characterization for a nonmodel organism using 454 pyrosequencing. sequence set to UniRef50 grouped by taxonomic category. Molecular Ecology, 17, 1636–1647. Via S (2001) Sympatric speciation in animals: the ugly duckling grows up. Table S2 Numbers of di-, tri-, and tetranucleotide simple Trends in Ecology & Evolution, 16, 381–390. sequence repeats (SSRs) occurring in contigs and singletons. Vickery VR (1993) Revision of Timema Scudder (Phasmatoptera, Timemat- odea) including 3 new species. Canadian Entomologist, 125, 657–692. Table S3 Locus length and primers sequences. Wakeley J (2008) Coalescent Theory: An Introduction. Roberts and Company Publishers, Greenwood Village, CO. Fig. S1 The distribution of read lengths resulting from pyrose- Wakeley J, Hey J (1997) Estimating ancestral population parameters. quencing of a normalized cDNA template on 1.25 plate runs on Genetics, 145, 847–855. the 454 GS XLR Titanium platform. Weber APM, Weber KL, Carr K, Wilkerson C, Ohlrogge JB (2007) Sam- pling the arabidopsis transcriptome with massively parallel pyrose- Appendix S1 Primer sequences, annealing temperatures, prod- quencing. Plant Physiology, 144, 32–42. uct size, repeat motif, number of repeats and expected product Weir BS (1996) Genetic Data Analysis II: Methods for Discrete Population lengths for 15 278 SSRs detected in T. cristinae pyrosequenced Genetic Data. Sinauer Assoc., Inc., Sunderland, MA, USA. ESTs. Weir BS, Cockerham CC (1984) Estimating F-statistics for the analysis of population structure. Evolution, 38, 1358–1370. Appendix S2 Primer sequences, annealing temperatures, prod- Wheat CW (2010) Rapidly developing functional genomics in ecological uct size and expected product lengths for 14 002 regions of model systems via 454 transcriptome sequencing. Genetica, 138, 433– 500 bp with suitable priming sites detected in T. cristinae  451. pyrosequenced EST contigs. Woerner AE, Cox MP, Hammer MF (2007) Recombination-Filtered geno- mic datasets by information maximization. Bioinformatics, 23, 1851– Appendix S3 Metadata associated with Sanger-sequenced 1853. nuclear DNA and mtDNA. Wu C-I (2001) The genic view of the process of speciation. Journal of Evolu- Please note: Wiley-Blackwell are not responsible for the content tionary Biology, 14, 851–865. Xia Q, Zhou Z, Lu C et al. (2004) A draft sequence for the genome of the or functionality of any supporting information supplied by the domesticated silkworm (Bombyx mori). Science, 306, 1937–1940. authors. Any queries (other than missing material) should be You FM, Huo NX, Gu YQ et al. (2008) BatchPrimer3: a high throughput directed to the corresponding author for the article. web application for PCR and sequencing primer design. BMC Bioinfor- matics, 9, 253.

CAB, PN, and TLP designed research; AAC, MS, TS, CAB, TEF, PN, and TLP performed research; AAC, MS, and TLP analyzed data; and AAC, MS, TS, CAB, TEF, PN, and TLP wrote the manuscript.

Ó 2012 Blackwell Publishing Ltd