The Pennsylvania State University

Eberly College of Science

Biology Department

FUNCTIONAL TRANSCRIPTOMICS:

ECOLOGICALLY IMPORTANT GENETIC VARIATION IN NON-MODEL ORGANISMS

A Dissertation in

Biology

by

Juan Cristobal Vera

 2012 Juan Cristobal Vera

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

May 2012

ii The thesis of Juan Cristobal Vera was reviewed and approved* by the following:

James Marden Professor of Biology Dissertation Advisor

Stephen Wade Schaeffer Associate Professor of Biology Chair of Committee

Webb Miller Professor of Biology and Computer Science and Engineering

Iliana B. Baums Assistant Professor of Biology

Christina Grozinger Associate Professor of Entomology

Charles R. Fisher Professor of Biology Assistant Department Head for Graduate Education

*Signatures are on file in the Graduate School

iii ABSTRACT With the advent of second generation, high-throughput sequencing, genomic molecular resources once limited to the study of molecular model organisms have become widely available. The ability to sequence a transcriptome (i.e. transcribed portion of the ) without reference to a sequenced genome is a particularly useful, efficient, and cost-saving method for applying genome-wide analyses to ecologically interesting organisms with little or no existing genetic data. Here I present an introduction to the tools and methods I developed specifically for de novo transcriptomics, as well as the results of an analysis of the ecologically well-studied Glanville fritillary butterfly (Melitaea cinxia) and the two coral species P. damicornis and A. millepora. Results of theanalyses include sample preparation, assembly of reads into contiguous consensus sequences, annotation, and verification of the resulting de novo transcriptome quality and completeness. Additional tools and methods were developed for running simulations and other comparative quality analyses, as well as for SNP discovery in a de novo transcriptome. The de novo M. cinxia transcriptome was scanned for SNPs that were used in a polymorphism survey to detect candidate genes with the potential to be undergoing long-term balancing selection. One of the top candidates was juvenile hormone acid methyltransferase gene (jhamt), the gene for the final metabolic enzyme in the Juvenile Hormone synthesis pathway. After partial molecular characterization, jhamt was found to consist of at least two paralogs in M. cinxia, both of which contain extensive allelic polymorphism. In addition, these methods were verified in a transcriptome-wide environmental association study of adaptive genetic variation related to environmental variation in natural populations of the two coral species, where several candidate genes potentially involved in environmental stress response were identified. In conclusion, this work developes a set of novel tools and methods useful for initiating genome-wide genetic and evolutionary analysis in ecologically interesting organisms lacking in previous molecular model organsism resources and presents the results of two such studies as examples for future researchers.

iv

TABLE OF CONTENTS LIST OF FIGURES……………………………………………………………………….. vi LIST OF TABLES………………………………………………………………………... xiv PREFACE………………………………………………………………………………… xx ACKNOWLEDGEMENTS………………………………………………………………. Xxi Chapter One………… A brief history of transcriptomics………………………………… 1 References………………………………………………………………………………… 5 Chapter Two………... Rapid transcriptome characterization for a non-model organism using 454 pyrosequencing……………………………………………… 7 Introduction……………………………………………………………………………….. 7 Materials and Methods……………………………………………………………………. 8 Results…………………………………………………………………………………….. 10 Sequencing and assembly…………………………………………………………. 10 Quality and performance of the 454 assembly……………………………………. 11 Transcriptome coverage breadth…………………………………………………... 12 Functional annotation……………………………………………………………… 13 SNP discovery…………………………………………………………………….. 14 Alternative splicing effects on assembly…………………………………………. 14 Performance of microarray probes……………………………………………….. 15 Metatrascriptomics and detection of an intracellular parasite……………………. 16 Discussion………………………………………………………………………….. 16 References………………………………………………………………………….. 19 Chapter Three………. PipeMeta: a Toolset for de novo Transcriptome Assembly with an Additional New Method for Transcriptome Quality Analysis………… 42 Introduction…………………………………………………………………………. 42 Materials and Methods……………………………………………………………… 46 M. cinxia 454 Reads……………………………………………………………… 46 Additional Assembly and Annotation…………………………………………….. 46 PipeMeta Pipeline………………………………………………………………… 47 Modified in silico Predicted Gene Set…………………………………………… 47 Simulated Transcriptome Reads…………………………………………………. 48 Additional Simulated Reads……………………………………………………… 50 TranscriptSimulator………………………………………………………………. 50 Simulated Assembly……………………………………………………………… 51 Comparison of Assemblies: Metrics……………………………………………… 51 Comparison of Assemblies: by Reference………………………………………… 54 Results………………………………………………………………………………. 55 Assembly of M. cinxia Reads…………………………………………………….. 55 Assembly of Simulated Lepidoptera Reads………………………………………. 55 Transcript Abundance for M. cinxia and Simulated Reads………………………. 56 Comparison of Assemblies: Using Metrics to Assess Assembly Quality………… 58 Evaluation of a Real assembly by using a Reference Transcriptome and Simulation.. 62 Conclusions…………………………………………………………………………. 64 References…………………………………………………………………………… 67 Chapter Four……….. Scanning the Glanville frittilary Transcriptome for SNPs and Partial Characterization of a Polymorphic Life History Gene… 93 Introduction…………………………………………………………………………. 93 Methods……………………………………………………………………………... 97 SNPHunter………………………………………………………………………….. 97 M. cinxia 454 reads…………………………………………………………………. 99 Genome-wide scan of polymorphism in a population of M. cinxia………………… 100 Partial Characterization of JHaMT………………………………………………….. 102 Results………………………………………………………………………………… 102

v

M. cinxia 454 read sequencing, assembly, annotation……………………………… 102 SNPs in M. cinxia………………………………………………………………….... 102 Genome-wide scan of polymorphism in a population of M. cinxia………………… 103 Partial Characterization of JHaMT in M. cinxia……………………………………. 104 Conclusions…………………………………………………………………………… 105 References…………………………………………………………………………….. 107 Chapter Five………... Environmentally Associated Genetic Variation in Two Coral Species through de novo Transcriptomics………………………………… 124 Introduction………………………………………………………………………… 124 Methods……………………………………………………………………………. 125 P. damicornis and A. millepora polymorphism in 454 reads and association with environmental variables…………………………………………………………. 125 Targeted SNP detection in P. damicornis and A. millepora……………………… 126 Validation of environmental SNP associations in independent samples………… 127 Statistical Analysis of P. damicornis and A. millepora………………………….. 128 Results……………………………………………………………………………… 129 P. damicornis 454 read sequencing, assembly, and annotation………………….. 129 Allele frequencies and environmental correlations in P. damicornis and A. millepora.. 130 Conclusions…………………………………………………………………………… 131 References……………………………………………………………………………. 133

vi LIST OF FIGURES

Figure 2-1. Gel image of two cDNA collections (A is larval/pupal material; B is adult) made from poly(A+) mRNA, before and after cDNA normalization. Size markers are shown in the lane at the left; normalized cDNA lanes are labeled N-cDNA. Note the high relative abundance of about 10 eretranscripts (distinct bands) in the cDNA lanes and complete absence thereof after normalization...... 23

Figure 2-2. Characteristics of assembled M. cinxia 454 contigs, and blastx alignments against Bombyx mori. A,B; Length and coverage of contigs. Variable bar width in plot A is an artifact of the logarithmic horizontal axis; there is no information content in the area of these bars. Note the logarithmic vertical axes; the majority of contigs were small and had low coverage depth. C,D; Frequency distributions of percent identity and deduced amino acid alignment length for all blast hits (bitscore > 45) of M. cinxia contigs and singletons (N= 13,653) to 6,289 Bombyx mori predicted proteins...... 24

Figure 2-3. Examples of blast alignments of M. cinxia assembled Sanger ESTs against M. cinxia 454 contigs. A: An alignment that is representative of the average alignment length and percent identity (see Table 4). Note the variable presence of what appears to be an alternatively spliced microexon (single codon). B: The longest Sanger vs. 454 alignment...... 25

Figure 2-4. Upper; Schematic showing how local clusters were often not joined together due primarily to lack of critical sequence reads necessary for joining clusters into one full length, or nearly full length contig. Naturally-occurring variation in the form of polymorphism and alternative splicing, and bias toward increased coverage of transcript end regions (Bainbridge et al 2006; see Figure 4a) can hamper assembly, even where there is high coverage and low rate of sequencing errors.Lower; Within clusters of multiple reads, polymorphic sites, or SNPs, are readily identified. The right side of this alignment shows an example of how sequencing error is ignored in the consensus sequence (underlined sequence above the alignment)...... 26

Figure 2-5. A: Example of sampling biases (Bainbridge et al 2006) present in 454 sequencing. Sequence ends, particularly the 5’ end, tend to be over-represented in 454 ESTs, with under-representation of the immediate flanking region. This pattern is discernable in the most heavily covered contigs in which there was enough sampling of the end-flanking region to join the end to the remainder of the contig. B: Relatively even coverage of the remainder of the same contig shown in A. C. Bimodal distribution of coverage depth as a function of contig length. The upward sloping pattern is expected (i.e. contigs become longer when there is more sequence coverage, up to the point of full length assembly). The circled portion of the plot contains contigs in which there was high coverage but little or no length extension. Few of these generated blast hits and thus appear to be over-sampled end regions (presumably primarily UTR) that lacked sufficient flanking coverage to allow them to assemble with the remainder of ESTs for that gene...... 27

Figure 2-6. Comparison of M. cinxia 454 contig length to the coding region length of B. mori orthologous genes. Orthologs were identified by blastx (bitscore > 45). A:

vii Ratio of 454 contig to B. mori ortholog length in relation to coverage depth (average number of ESTs per nucleotide site within a contig). Ratios greater than unity are caused primarily by the presence in 454 contigs of untranslated regions that extend beyond the coding region. Solid curve shows sigmoid fit for all B. mori genes; dashed curve shows fit for B. mori genes 800 – 1800 bp in length; line shows fit for those greater than 1800 bp length. Where average coverage depth was ≥10, nearly all 454 contigs were as long or longer than the coding region of their putative B. mori ortholog, suggesting full length assembly. B: Ratio of B. mori ortholog length to the summed non-overlapping alignment length of M. cinxia sequences that had a best blast hit (bitscore > 45) to that B. mori gene (i.e. all putative fragments of individual genes)...... 28

Figure 2-7. Venn diagram showing distribution of similarity search results. Numbers are the sum of unique Bombyx mori proteins, unique Drosophila melanogaster or Uniprot proteins, and unique Butterflybase (lepidoptera species excluding B. mori) or Heliconius erato EST clusters that had best blast hits (> 45 bitscore) to a 454 unigene. Priority hierarchy for determining uniqueness of proteins or EST clusters in overlap regions (i.e. cases where multiple 454 unigenes aligned to a unigene shared between the respective databases) was B. mori protein > Uniprot protein or D. melanogaster protein > EST cluster name. Sequences with best blast hits to non- metazoan proteins are excluded from these counts...... 29

Figure 2-8. A-C; Frequency distributions of log10-transformed median spot intensity (corrected for local background and with a constant added to make all values positive for log transformation) of microarray probes. N=602 negative control probes; N=125,655 probes for contigs and singletons that had significant blast hits; N = 113,862 unannotated contigs. This initial test array had low signal intensity overall (note lack of saturated spots [intensity values of 4.8]) and thus yields conservative estimates of the number of quality probes. The greater number of very low signals in B, C compared to A is due to the extreme difference in sample size. D. Distribution of log10-transformed median spot intensity for the lesser (upper plot) and greater (lower plot) performing probe orientation for all probes of unannotated genes for which one probe had an intensity of at least 2.5 (i.e. a strong signal in at least one probe orientation). Signals of the lesser performing probes cluster at just above the intensities obtained from negative control probes, with an average 10-fold difference among these differently oriented probe pairs...... 30

Figure 2-9. Alternative splicing near the 3’ end of the previously characterized M. cinxia muscle gene troponin-i. A) Conjoining of ESTs (red, green, and yellow boxes) representing partial nucleotide sequences of three alternatively spliced mutually exclusive versions of exon six in 454 assembly alignment. Note the degenerate nucleotide string in the consensus sequence (blue box). B) Amino acid alignment of the same three alternatively spliced exons (red, green, and yellow exons) with Bombyx mori sequence for comparison (bottom yellow). The top alternative exon, A1a (purple), was recovered full length in a separate 454 contig. There was one additional, much shorter 454 contig partially spanning the alternatively spliced region representing most of TnI_A3a (green). Reassembly of all of the troponin-i 454 ESTs using more stringent assembly parameters produced four contigs spanning this region with full (A1a and A4a) and partial (A2a and A3a) coverage of the four

viii versions of exon six. Full characterizations of three of the alternative exons (A2a, A3A, and A4a) had been achieved in previous experiments (J. Vera; unpublished) using standard PCR, RACE, and cloning methods and were needed to unravel the complexity of the effect of alternative splicing on these 454 contigs...... 31

Figure 2-10. Probes from 454 sequence generally produce comparable micorarray spot intensity as probes designed from Sanger sequence of the same gene (r2 = 0.57; N = 892). Diagonal is the line of identity. There was no systematic tendency for either type of probe to yield better performance ...... 32

Figure 2-11. Examples of 44K-feature Agilent arrays with probes made from 454 sequence data. A,B; Comparison of raw hybridization intensity from the two channels of microarrays hybridized with indirectly labeled aRNA from whole abdomens of 2-day old female M. cinxia butterflies. Note the substantial increase in transcript level of microsporidia genes (larger open circles) in butterfly 104 compared to butterfly 168 (B), and the apparent absence of microsporidia in the two other individuals (A). C: Repeatability of signal intensity (mean of three replicates per feature) across two microarrays hybridized with labeled abdominal aRNA from a single individual...... 33

Figure 3-1. Comparison of contig and reference clustered or actual transcript copy number distributions for real and simulated M. cinxia reads. Panels display number of occurrences of copy number by contig and B. mori reference clustered read copy number or actual read copy number (for panel d), weighted by summed underlying read nucleotide counts. All data points were log transformed to show linear relationship common to a power law distribution (y = axα + ε). Colors represent underlying (i.e. assembled) read nucleotide counts. a. Seqman pro assembled, B. mori reference clustered real M. cinxia reads binned by copy number (α = 1.97, xmin = 1; p-value = 0.11, Kolmogorov-Smirnov [KS] test for power law goodness of fit [(Clauset, Shalizi et al. 2009)]). b. Mira assembled, B. mori reference clustered real M. cinxia reads binned by copy number (α = 1.95, xmin = 2; p-value = 0.10, KS test). c. Optimally assembled simulated M. cinxia reads clustered by contig and binned by copy number (α = 2.74, xmin = 82; p-value = 0.92, KS test). d. Simulated M. cinxia reads clustered by transcript (i.e. actual transcript copy numbers) and binned by copy number (α = 2.75, xmin = 93; p-value = 0.88, KS test)...... 70

Figure 3-2. Comparison of the transcript copy number distributions of nine sets of simulated reads with varying reference transcript pool sizes and sequencing efforts. Each panel displays the number of occurrences of a transcript copy number by the actual transcript copy number. Simulated reads were generated from (from top to bottom [a.x.-c.x.]: ~14.1K, ~28.3K, and ~56.6K sequences) reference transcript sequences (i.e. modified B. mori predicted reference transcript pool), for varying amounts of sequencing effort (from right to left [x.a.-x.c.]: ~50K, ~100K, and ~1M simulated reads). Transcript copy numbers were derived from a power law distribution during simulated sequencing of the reference transcripts (α = ~2.7). This comparison illustrates how sampling (i.e. sequencing) from a finite set of reference transcripts causes oversampling of the reference sequences at increasing sequencing efforts, shifting the resulting read distribution shape towards log normal as lower copy numbers become less common. As the reference transcript pool used

ix for sequencing gets larger, more sequencing effort is required to shift the distribution towards a log normal shape. This suggests a quantifiable relationship between reference transcript pool abundance (i.e. complexity), assembly quality, and sequencing effort. a.a. Simulated read set3A. a.b. Simulated read set3B. a.c. Simulated read set3C. b.a. Simulated read set3D. b.b. Simulated read set3E. b.c. Simulated read set3F. c.a. Simulated read set3G. c.b. Simulated read set3H. c.c. Simulated read set3I...... 71

Figure 3-3. Comparison of read copy number distributions for three of the simulated read sets from figure 2 (Set3C, Set3F, and Set3I) after optimal assembly. Panels display the number of occurrences of a copy number by the contig clustered read copy number (i.e. similar to what is observable in real data), weighted by assembled read nucleotide counts. This Comparison illustrates the progression towards the more typical power law distribution shape (i.e. lower xmin value and less like a log normal distribution) observed in the real M. cinxia read copy numbers (see table 1) as the reference transcript pool being sampled increases in size. This is due in this case mostly to increased fragmentation of transcript sequences at lower coverages (i.e. lower values on the x-axis). All data points were log transformed to show linear relationship common to a power law distribution (y = axα + ε). Colors represent underlying (i.e. assembled) read nucleotide counts. a. Set3C consists of ~1M simulated reads (avg. read length ± SD = 250±75 bp) derived from ~14.1K reference transcripts. b. Set3F consists of ~1M simulated reads derived from ~28.3K reference transcripts. c. Set3I consists of ~1M simulated reads derived from ~56.6K reference transcripts...... 72

Figure 3-4. Comparison of three basic assembly quality metrics (N contigs, average contig length, and average contig coverage depth) for five simulated data sets and the M. cinxia reads, using up to four assembly methods (Optimal [simulated reads only], Seqman pro, Mira, and CLC Cell). a. Number of contigs assembled increases with complexity of the simulated data set and begins to resemble the M. cinxia assembly (tReal) in terms of both absolute size and relative proportions between assembly methods with the addition of simulated polymorphism in SimSet2D. b. Average contig length of assembled reads decreases with complexity of the simulated data set and begins to resemble the M. cinxia assembly with the addition of simulated polymorphism in SimSet2D. c. Average contig length of top 10% of the assembly displays almost identical behavior to average contig length, with the exception of slightly increased relative performance by CLC in the simulated data sets...... 73

Figure 3-5. Comparison of two basic quality metrics (percentage of read nucleotides assembled and top 10% average contig length) for five simulated data sets and M. cinxia reads, using up to four assembly methods (Optimal [simulated reads only], Seqman pro, Mira, and CLC Cell). a. Percent of nucleotides assembled does not respond to increasing simulated sequence complexity, except for a slight drop in nucleotides assembled by CLC. The metric does display a small relative response to transcript abundance distribution differences (i.e. changes between SimSet1A and SimSet2A), which perhaps explains the proportional differences observed in the real data. Large scale polymorphisms (lacking in the simulated data) are likely responsible for the over-all drop in the real data. b. Average contig coverage depth

x shows little relationship to increasing complexity of the simulated data sets or to assembly method in general. However, it does drop abruptly relative to the M. cinxia assemblies, perhaps due to additional biological complexity not captured by the simulated data (e.g. original size of transcript pool and large scale polymorphism)...... 74

Figure 3-6. Comparison of two assembly quality metrics (N50 length & Nc50 length, see methods for definitions) for five simulated data sets and the real M. cinxia reads, using up to four assembly methods (Optimal [simulated reads only], Seqman pro, Mira, and CLC Cell). a. N50 length decreases in general with increasing simulated data complexity and drops abruptly for the real data set, but only the Mira assemblies show a relative trend to increasing complexity. b. The proposed new metric, Nc50, varies consistently between assemblies until SimSet2D (simulated data set with random polymorphism added), at which point it begins to proportionally resemble more closely the real data set. Seqman pro real data set assembly has an unusually large Nc50 length relative to Mira and CLC assemblies, indicating a larger concentration of the sequencing effort (i.e. read nucleotides) was assembled into longer contigs...... 75

Figure 3-7. Comparison of rate of assembly error in the five simulated read sets for two types of error, Contig assignment error rate and read assembly failure rate. a. Mira had the lowest Contig assignment error rate (i.e. fraction of simulated reads that were assigned to the wrong contig) and CLC had the highest for all complexity levels. There was a relatively large increase in contig assignment error for all sets starting with SimSet2C, the set derived from unmodified B. mori reference transcripts (i.e. containing an extra ~400 transcripts consisting of gene duplications, alternative and splice forms, as well as pseudo-genes and other gene prediction errors). b. Mira also generally had the lowest read assembly failure rate at higher levels of simulated read complexity, with the notable exception of SimSet1A (SimSets 1A and 2A are lowest complexity sets [see methods], and are replicates of each other except for their distribution of read copy numbers). Seqman pro and Mira (relatively) high error rates for SimSet1A are most likely characterized by its specific read copy number distribution to specific DNA sequences...... 76

Figure 3-8. Comparison of conservative estimates of percent B. mori amino acid coverage by three sets of M. cinxia unigene sequences at four alignment identity stringencies (i.e. Blast alignment minimum percent identities). Amino acid coverage was calculated from summed top hit Blast alignments (including multiple non- redundant alignments for each top hit) of M. cinxia unigenes to B. mori modified predicted proteins (see methods). a. Percent of B. mori amino acid coverage by M. cinxia Seqman pro, Mira, and CLC Cell assembled unigenes at minimum percent identities of 90%, 70%, 50% and 30%. Mira unigenes consistently have better alignment to the B. mori reference at all alignment stringencies. b. Percent of redundant B. mori amino acid coverage by M. cinxia Seqman pro, Mira, and CLC Cell assembled unigenes at minimum percent identities of 90%, 70%, 50% and 30% (i.e. percent of B. mori amino acids with two or more M. cinxia unigenes with top alignments to that position). Mira unigenes have a consistently higher rate of reference coverage redundancy, indicating either a higher fractionation of reads

xi during assembly or a higher level of stringency for incorporation of reads into assembly (i.e. excludes more reads as singetons)...... 77

Figure 3-9. Comparison of conservative estimates of percent B. mori amino acid coverage by twenty sets of simulated unigene sequences (i.e. five simulated read sets assembled by four methods). Amino acid coverage was calculated from summed top hit Blast alignments (including multiple non-redundant alignments for each top hit) of simulated unigenes to B. mori modified predicted proteins (see methods). a. Percent of B. mori amino acid coverage by simulated Optimum, Seqman pro, Mira, and CLC Cell assembled unigenes (minimum percent identity of 90%) for five sets of simulated reads with increasing complexity. Mira unigenes consistently approaches closer to optimum assembly coverage of the reference. Unlike for real M. cinxia reads, with these simulated sets CLC assembled unigenes slightly outperform Seqman pro unigenes for this metric. b. Percent of redundant B. mori amino acid coverage by simulated Optimum, Seqman pro, Mira, and CLC Cell assembled unigenes (minimum percent identity of 90%) (i.e. percent of B. mori amino acids with two or more simulated unigenes with top alignments to that position). As with real M. cinxia reads, Mira assembled reads have a consistently higher rate of reference coverage redundancy, most notably with SimSet2d (i.e. simulated reads from unmodified B. mori transcript with both error and polymorphism). c. Percent of B. mori amino acid coverage by simulated Optimum, Seqman pro, Mira, and CLC Cell assembled unigenes (minimum percent identity of 90%) for five sets of simulated reads with increasing complexity, normalized versus percent coverage of optimally assembled unigenes...... 78

Figure 3-10. Number of occurrences of summed polymorphism cluster lengths in M. cinxia contigs versus the binned polymorphism cluster lengths. All data points were log transformed to display the linear (if slightly curved) relationship necessary for power law distributed variables. Contig sequences from both a) Seqman Pro and b) Mira were modified to display all underlying polymorphism (using SNPhunter, part of PipMeta package) in the form of degenerate nucleotides (DNs) and then scanned for clusters with two or less internal nucleotide spacing. DN cluster lengths were grouped by Contig and then binned by cluster length before analysis. Parameters alpha and xmin were calculated through greatest-likelihood method as described in Clauset, 2007 (α = 3.5, Seqman xmin = 9, mira xmin = 27)...... 79

Figure 4-1. Alignment of four partial M. cinxia putative JHaMT paralogous amino acid sequences at the N terminal region, shown over B. mori JHaMT protein, B. mori JHaMT predicted protein, Helicoverpa armigera JHaMT protein, and Spodoptera literu JHaMT protein. Groups were created through realignment and clustering of M. cinxia putative JHaMT assembly unigenes and re-sequencing of individual butterflies (see methods). M. cinxia JHaMT Group1a represents an allele containing a 25 amino acid deletion...... 116

Figure 4-2. Average contig coverage depths versus contig lengths in the M. cinxia assembly v2.0. Colors represent relative SNP abundance, which ranges from zero to 89 in the whole assembly a. Average contig coverage depths adjusted to exclude filtered assembly read nucleotides versus Contig lengths after filtering of suspect assembly regions (i.e. excluded due to lack of annotation, homopolymer run, or very

xii high density polymorphism) for 11,535 annotated contigs. Distribution is volcano shaped, with a peak at roughly 300 bp. b. Average contig coverage depths for v1.0 Aland read-containing contig sections versus contig lengths for v1.0 Aland read- containing contig sections for 1,203 annotated, metazoan contigs from a portion of the assembly with coverage depth greater than 5X and contig length greater than 75 bp. Color distribution is based on SNP distributions from panel ‘a’. Spatial distribution is volcano shaped, with a peak at roughly 250 bp. c. Average raw contig coverage depths versus raw contig lengths for all 58,124 M. cinxia v2.0 contigs. Black colored data points represent contigs with zero SNPs. Distribution is volcano shaped, with a peak at roughly 1,000 bp...... 117

Figure 4-3. Comparison of SNP distribution in raw and filtered, partitioned M. cinxia assembly v2.0. a. Occurrences of number of SNPs per contig versus number of SNPs per contig in the raw (i.e. no filtering or partitioning) assembly. b. Occurrences of number of SNPs per v1.0 Aland contig sections versus number of SNPs per v1.0 Aland contig sections in the filtered and partitioned assembly (see methods) ...... 118

Figure 4-4. Leverage plots of residual effects of the adjusted theta model (i.e. log (theta+1/length) of polymorphism versus two pervasive biases in transcriptome assembly data (variable length and depth of coverage). a. Model residuals by the log of average coverage depths of Aland v1.0 read-containing contig sections in the v2.0 M. cinxia partitioned assembly. This plot shows that the model successfully accounts for variable coverage depth effects in the portions of the assembly at contig average coverage depths greater than 5x and contig lengths greater than 75 bp (n = 1203, R2 = 0.046, P-value = 0.43). b. Model residuals by the log of contig lengths of Aland v1.0 read-containing contig sections in the v2.0 M. cinxia partitioned assembly. In this plot variable contig section length effects in the partitioned assembly remain significant, but account for less than 5% of the variation in the model values (n = 1203, R2 = 0.046, P-value < 0.0001)...... 119

Figure 4-5. Leverage plots of residual effects of the adjusted theta model (i.e. log (theta+1/length)_NS) of non-synonymous polymorphism versus two pervasive biases in transcriptome assembly data (variable length and depth of coverage). a. Model residuals by the log of average coverage depths of Aland v1.0 read- containing contig sections in the v2.0 M. cinxia partitioned assembly. This plot shows that the model successfully accounts for variable coverage depth effects in the portions of the assembly at contig average coverage depths greater than 5x and contig lengths greater than 100 bp (n = 1004, R2 = 0.15, P-value = 0.44). b. Model residuals by the log of contig lengths of Aland v1.0 read-containing contig sections in the v2.0 M. cinxia partitioned assembly. In this plot variable contig section length effects in the partitioned assembly remain significant, but account for only 14% of the variation in the model values (n = 1004, R2 = 0.14, P-value < 0.0001)...... 120

Figure 4-6. Two candidate models for unbiased estimates for Theta (Ɵ = 4nu; Long, 2006) in pooled sample, multi-individual de novo transcriptome assemblies. The correction is made by dividing by the harmonic value (i.e. sum of the reciprocals) of the average sequence coverage depth in order to devalue polymorphism contributions from extremely deep regions. At depths below twice the number of

xiii individuals in the pool (>160x for M. cinxia), this corrects for the possibility of sampling a specific allele from a single individual more than once in order to remove bias. a. Basic unbiased theta estimator. b. Log transformed theta estimator divided by the sequence length. Adding one to the number of segregating sites (i.e. SNPs) allows contigs without SNPs to count towards the estimate. Dividing by the length of each sequence section attempts to account for contig length variability in the transcriptome...... 121

Figure 4-7. Juvenile Hormone (JH) metabolic pathway in B. mori, as shown in Zhan, 2011. Juvenile hormone acid methyltransferase is the final step in the conversion of Acetyl-CoA into JH in Lepidopterans...... 122

Figure 4-8. Polymorphic region of jhamt cluster 4 alignment. Alignment is near the 5’ start site of the transcript. Several SNPs are outlined by mismatch colored boxes starting at position 81...... 123

Figure 5-1. Map of sampling locations for P. damicornis (type A) (blue), P. damicornis (type B) (yellow) and A. millepora (red)...... 143

Figure 5-2. Map showing allele frequency patterns of three SNP loci (β-gamma crystallin right, Galaxin middle and Ubiquitin left) in A. millepora with a significant correlation to temperature categories, plotted on spatial data for average sea surface temperatures between the years 2000 – 2006. Map courtesy of the e-Atlas. (http://e- atlas.org.au)...... 144

Figure 5-3. Map showing allele frequency patterns of two SNP loci (Thioredoxin right, Ligand of numb X2 left) in A. millepora with a significant correlation to turbidity, plotted on spatial data for secchi depths collected between 1992 and 2006. Map courtesy of the e-Atlas (http://e-atlas.org.au)...... 145

xiv LIST OF TABLES

Table 2-1 Settings used in Seqman Pro 7.1 to assemble 454 ESTs...... 34

Table 2-2. Number of reads and nucleotides produced by two 454 sequencing runs...... 35

Table 2-3. Sanger EST statistics...... 36

Table 2-4. Summary statistics for blast of assembled M. cinxia Sanger sequences against M. cinxia 454 contigs and singletons. All blast results refer to hits with bitscores greater than or equal to 45. Percentage length coverage refers to the total extent of Sanger sequences covered by aligned 454 sequences. Percent identity is reduced by polymorphism and reduced accuracy of Sanger sequencing at end regions; it is not a valid measure of the 454 error rate. Alignment lengths refer to nucleotides (not deduced amino acids as in Fig 3 and Table 3)...... 37

Table 2-5. Summary statistics for blast analyses of M. cinxia 454 contigs and singletons against predicted B. mori proteins, Uniprot proteins, and translated clustered EST’s in Butterflybase. A minimum bitscore of 45 was used as the cutoff threshold for all of these analyses. Sequences that had best hits in Uniprot that were non-metazoans (N=716) are excluded. Values in parentheses are medians...... 38

Table 2-6. GO assignments using only Arthropod blastx alignments...... 39

Table 2-7. Taxonomic group of the blast subject species for 454 contigs and singletons (N = 699) that had significant (bitscore >45) best blast alignments to apparent xenobiotic genes...... 40

Table 2-8 Transcriptome space and coverage comparison among relevant taxa. Expected coverage refers to the average coverage depth predicted for that size genome from our size 454 sample, or in the case of Medicago truncatula, to the average coverage depth obtained in that study (Cheung et al 2006)...... 41

Table 3-1. M. cinxia data set statistics. All sets (real and simulated) consist of two populations of reads combined for assembly. Version one consists of the original 454 GS20 reads (reported in Vera, et al., 2008) or the simulated equivalents. Version two consists of the additional unpublished 454 FLX reads or simulated equivalents. Modified sets were created from B. mori in silico predicted reference genes modified by removal of similar (i.e. >90% identity) but shorter (or equal length) sequences, while unmodified sets retain closely related gene family members and possibly some other complicating factors (e.g. pseudo-genes and alternative splice forms)...... 80

Table 3-2. Assembly statistics for real M. cinxia data set. Standard assembly descriptive statistics and quality metrics of assembly results for three assemblers using recommended parameters (with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]). Nc50 is the read coverage depth weighted contig length (or number) where 50% of the underlying sequencing effort occurs in a list of contigs sorted by length. This metric attempts to account for

xv the variable nature of transcript expression levels in transciptome sequencing (unlike the N50 metric) as opposed to genome sequencing...... 81

Table 3-3. Comparison of three sets of real M. cinxia assembly Unigenes to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins by M. cinxia at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self-alignment of M. cinxia unigenes. Raw nucleotide polymorphism rates were estimated using SNPHunter script (see methods). Coverages, overlaps, and fragmentation estimates were calculated for multiple minimum percent identities (for alignments > 90%, 70%, 50%, and 30%) at a minimum bitscore threshold of 40...... 82

Table 3-4. Assembly statistics for simulated M. cinxia data set 1A derived from modified B. mori reference set without errors or polymorphism. Standard assembly descriptive statistics for assembly results from perfect alignment, optimal assembly, and three assemblers using recommended parameters (with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]). Nc50 is the read coverage depth weighted contig length (or number) where 50% of the sequencing effort occurs in a list of contigs sorted by length. This metric attempts to account for the variable nature of transcript expression levels in transciptome sequencing as opposed to genome sequencing. Metrics in italics are unique to simulated reads, which retain true transcript origin information in the titles that allow them to be tracked through the assembly process...... 83

Table 3-5. Comparison of perfect alignment, optimal assembly, and three sets of simulated M. cinxia set 1A assembled Unigenes (see Table 4) to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins (i.e. the perfect alignment) by M. cinxia reads at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self-alignment of M. cinxia unigenes. Coverages, overlaps, and fragmentation estimates were calculated with a minimum percent identity of 90 and minimum bitscore threshold of 40. Additional metrics unique to simulated reads (in italics) include exact calculation of B. mori amino acids covered (useful metric for optimal assembly only) and Blast annotation assignment error rate, which is the fraction of unigenes with an incorrect top Blast alignment to a B. mori reference gene...... 84

Table 3-6. Assembly statistics for simulated M. cinxia data set 2A, derived from modified B. mori reference set without errors or polymorphism. Standard assembly descriptive statistics for assembly results from perfect alignment, optimal assembly, and three assemblers using recommended parameters (with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]). Nc50 is the read coverage depth weighted contig length (or number) where 50% of the sequencing effort occurs in a list of Contigs sorted by length. This metric

xvi attempts to account for the variable nature of transcript expression levels in transciptome sequencing as opposed to genome sequencing. Metrics in italics are unique to simulated reads, which retain true transcript origin information in the titles that allow them to be tracked through the assembly process...... 85

Table 3-7. Comparison of perfect alignment, optimal assembly, and three sets of simulated M. cinxia set 2A assembled Unigenes (see Table 6) to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins (i.e. the perfect alignment) by M. cinxia reads at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self-alignment of M. cinxia unigenes. Coverages, overlaps, and fragmentation estimates were calculated with a minimum percent identity of 90 and minimum bitscore threshold of 40. Additional metrics unique to simulated reads (in italics) include exact calculation of B. mori amino acids covered (useful for optimal assembly only) and Blast annotation assignment error rate, which is the fraction of unigenes with an incorrect top Blast alignment to a B. mori reference gene...... 86

Table 3-8. Assembly statistics for simulated M. cinxia data set 2B, derived from modified B. mori reference set without errors or polymorphism. Standard assembly descriptive statistics for assembly results from perfect alignment, optimal assembly, and three assemblers using recommended parameters with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]. Nc50 is the read coverage depth weighted contig length (or number) where 50% of the sequencing effort occurs in a list of contigs sorted by length. This metric attempts to account for the variable nature of transcript expression levels in transciptome sequencing as opposed to genome sequencing. Metrics in italics are unique to simulated reads, which retain true transcript origin information in the titles that allow them to be tracked through the assembly process...... 87

Table 3-9. Comparison of perfect alignment, optimal assembly, and three sets of simulated M. cinxia set 2B assembled Unigenes (see Table 8) to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins (i.e. the perfect alignment) by M. cinxia reads at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self-alignment of M. cinxia unigenes. Coverages, overlaps, and fragmentation estimates were calculated with a minimum percent identity of 90 and minimum bitscore threshold of 40. Additional metrics unique to simulated reads (in italics) include exact calculation of B. mori amino acids covered (useful for optimal assembly only) and Blast annotation assignment error rate, which is the fraction of unigenes with an incorrect top Blast alignment to a B. mori reference gene...... 88

Table 3-10. Assembly statistics for simulated M. cinxia data set 2C, derived from modified B. mori reference set without errors or polymorphism. Standard assembly

xvii descriptive statistics for assembly results from perfect alignment, optimal assembly, and three assemblers using recommended parameters (with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]). Nc50 is the read coverage depth weighted contig length (or number) where 50% of the sequencing effort occurs in a list of contigs sorted by length. This metric attempts to account for the variable nature of transcript expression levels in transciptome sequencing as opposed to genome sequencing. Metrics in italics are unique to simulated reads, which retain true transcript origin information in the titles that allow them to be tracked through the assembly process...... 89

Table 3-11. Comparison of perfect alignment, optimal assembly, and three sets of simulated M. cinxia set 2C assembled Unigenes (see Table 10) to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins (i.e. the perfect alignment) by M. cinxia reads at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self-alignment of M. cinxia unigenes. Coverages, overlaps, and fragmentation estimates were calculated with a minimum percent identity of 90 and minimum bitscore threshold of 40. Additional metrics unique to simulated reads (in italics) include exact calculation of B. mori amino acids covered (useful for optimal assembly only) and Blast annotation assignment error rate, which is the fraction of unigenes with an incorrect top Blast alignment to a B. mori reference gene...... 90

Table 3-12. Assembly statistics for simulated M. cinxia data set 2D, derived from modified B. mori reference set without errors or polymorphism. Standard assembly descriptive statistics for assembly results from perfect alignment, optimal assembly, and three assemblers using recommended parameters (with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]). Nc50 is the read coverage depth weighted contig length (or number) where 50% of the sequencing effort occurs in a list of contigs sorted by length. This metric attempts to account for the variable nature of transcript expression levels in transciptome sequencing as opposed to genome sequencing. Metrics in italics are unique to simulated reads, which retain true transcript origin information in the titles that allow them to be tracked through the assembly process...... 91

Table 3-13. Comparison of perfect alignment, optimal assembly, and three sets of simulated M. cinxia set 2D assembled Unigenes (see Table 12) to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins (i.e. the perfect alignment) by M. cinxia reads at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self-alignment of M. cinxia unigenes. Coverages, overlaps, and fragmentation estimates were calculated with a minimum percent identity of 90 and minimum bitscore threshold of 40. Additional metrics unique to simulated reads (in italics) include exact calculation of B. mori amino acids covered (useful for optimal

xviii assembly only) and Blast annotation assignment error rate, which is the fraction of unigenes with an incorrect top Blast alignment to a B. mori reference gene...... 92

Table 4-1. Putative juvenile hormone acid methyltransferase M. cinxia unigenes in assembly v.2.0 (* alignment to unigenes was weak, i.e. < 45 bitscore). Evalues and bitscores refer to alignments between M. cinxia and Uniprot database. Unigene versus B. mori and B. mori versus Uniprot Blast alignments are all in agreement...... 110

Table 4-2. Top 25 M. cinxia assembly version 2.0 contigs sorted by polymorphism rate. Basic metric values used in the sorting method (see methods) are shown, including the values for the sorting model (i.e. Log(Theta+1/bp)...... 111

Table 4-3. Top 25 M. cinxia assembly version 2.0 contigs sorted by polymorphism rate. Juvenile hormome methyltransferase is sixth in over-all polymorphism level once length, depth, and population sample biases inherent to a transcriptome have been minimized. All annotations were derived from Blast alignment of B. mori predicted proteins vs. Uniprot DB except for entries marked with **, which are M. cinxia contigs vs. Uniprot...... 112

Table 4-4. Top 25 M. cinxia assembly version 2.0 contigs sorted by non-synonymous (NS) polymorphism rate. Basic metric values used in the sorting method (see methods) are shown, including the values for the sorting model (i.e. Log(Theta+1/bp)...... 113

Table 4-5. Top 25 M. cinxia assembly version 2.0 contigs sorted by non-synonymous (NS) polymorphism rate. Juvenile hormome methyltransferase is ninth in NS polymorphism level once length, depth, and population sample biases inherent to a transcriptome have been minimized. All annotations were derived from Blast alignment of B. mori predicted proteins vs. Uniprot DB except for entries marked with **, which are M. cinxia contigs vs. Uniprot...... 114

Table 4-6. SNPHunter assembly and filtering metrics for M. cinxia v2.0 assembly...... 115

Table 5-1. Summary of 454 sequence data for P. damicornis (type B). Contig = consensus sequence consisting of combined original sequences that showed significant sequence overlap...... 137

Table 5-2. List of SNPs and contig information for the eight SNP loci that were included in the final analysis in P. damicornis (type B). SNP name constructed in the following manner: (Contig number_SNP nucleotides_position of SNP in trimmed contig). SNP nucleotides are ordered with the more common nucleotide first, followed by the rare allele. Genes marked with * were also amplified for type A...... 138

Table 5-3. Selected genes of interest in Acropora millepora, SNP codon position (all targeted SNPs were non-synonymous mutations), and their putative role in the stress response. SNP name constructed in the following manner: (Contig number_SNP nucleotides_position of SNP in trimmed contig)...... 139

Table 5-4. List of sampling locations, listed from north to south, reef types, secchi and thermal categories (each divided into 9 and 7 categories respectively, thermal 1-7

xix coldest to warmest, secchi 1-9 lowest to highest secchi depth) and number of samples of each species...... 140

Table 5-5. Summary of statistics for the SNP loci that show a significant correlation between allele frequency distribution and environmental category in P. damicornis (type A or B). Coeff LR = regression coefficients of the logistic regressions, p- values derived from the Wald tests and considered significant when < 0.05. * Indicates p-values that are no longer significant after corrections are made for multiple tests (0.05/3 = 0.016)...... 141

Table 5-6. Summary of statistics for the SNP loci that show a significant correlation between allele frequency distribution and environmental category in A. millepora. Coeff LR = regression coefficients of the logistic regression, p-values derived from the Wald tests and considered significant when < 0.05...... 142

xx PREFACE

Chapter One of this dissertation consists entirely of a multi-author published work, of which I was a primary co-author. My contributions to the study consisted of writing the manuscript, performing most of the bioinformatics and computational work, and contributing to some of the molecular work. Chapter Four also contains work that is part of a multi-author study (in review), of which I am a co-author. My contributions to this study consisted of performing all of the bioinformatic and computational analysis of the coral transcriptome, the discovery and preliminary analysis of the resulting SNP data, and secondary contributions to the writing of the manuscript. Chapters Two and Three consist primarily of my own original, unpublished work.

xxi ACKNOWLEDGEMENTS

All Chapter One work was supported by NSF grants IBN-0412651 and IOS-0950416, as well as Center of Excellence support from the Academy of Finland, and is part of a multi-author published work. Chapter Four coral work was supported by the Reef and Rainforest Research Center, the Great Barrier Reef Foundation, and the Australian Institute of Marine Science, and is also part of a multi-author work currently in review to be published. I would like to thank my advisor for excellent advice and patient support, and my committee members for their critical contributions to this dissertation. I would also like to thank my fellow authors and collaborators, without whom most of this work would not have been possible.

1

Chapter One A brief history of transcriptomics

Transcriptomics, the study of the set of all transcribed genes (i.e. mRNA) in the cell, has recently begun to be recognized as an important field in the study of a diverse set of biological systems. Sometimes considered a subset of or a compliment to the field of genomics, transcriptomics has never the less proven to have applications separate from or outside of the typical genomic study, especially when considering the large number of questions being investigated outside the purview of the relatively few model molecular organism. As will be further explored in this dissertation, this is in large part due to rapidly evolving advances in the ability to sequence the DNA molecule over the past few years. However, transcriptomics actually began emerging as a recognizable field in science with the advent of sequencing technologies that originated in the late 70’s and early 80’s and matured in the early to late 90’s, although increases in computational power and rising sophistication of software played a large part as well. Indeed, it was changes in several technologies that precipitated the rise of the genomic era. The structure of DNA was historically discovered by Watson and Crick in 1953 (Watson and Crick 1953). Methods for determining DNA sequences began to emerge a few years later, however, they were extremely labor intensive, time consuming, and expensive (e.g. sequencing of the bacteriophage MS2 coat protein gene; Min Jou, Haegeman et al. 1972) and generally limited to a few dozen nucleotides. It wasn’t until 1977 with the invention of the chain termination sequencing method (Sanger, Nicklen et al. 1977) along with the sequencing of the first genome (bacteriophage phi X174; Sanger, Air et al. 1977) that sequencing more than a few nucleotides at a time became feasible. The method (often referred to as the Sanger method) involved extension of a DNA molecule from a single-stranded DNA template using modified and labeled nucleotides (i.e. adenine, guanine, thymine, and cytosine). The template was added to four separate reactions along with a DNA primer, DNA polymerase, all four normal deoxynucleotidetriphosphates (dNTPs), and a single modified and labeled nucleotide (dideoxyNTPs or ddNTPs) that served to halt the extension of the growing DNA molecule chain when incorporated by the polymerase. This resulted in DNA fragments of varying lengths, which were then able to be separated using gel electrophoresis (a denaturing polyacrylamide- urea gel) and then visualized by autoradiography or ultra-violet light depending on the nature of

2 the ddNTP label used. This method slowly improved over the next several years, with particularly noteworthy examples including the addition of fluorescently labeled primers (Smith, Sanders et al. 1986) and fluorescently labeled ddNTPs (Starke, Yan et al. 1994), which allowed the DNA extension to occur in a single reaction and paved the way for automated sequencing. Of equal importance to advances in sequencing technology in the evolution of transcriptomics were advances in computational power and, perhaps more importantly, the growing sophistication of related software. When the first genome (bacteriophage phi X174) was sequenced in 1972, it was only 5,268 nucleotides long and consisted of only 11 genes. Because the Sanger sequencing method produced fragment sizes between 200-1000 nucleotides long, the genome was small enough to reconstruct easily from relatively small number of fragments or reads. Basically, random pieces of the genome were isolated through restriction enzymes and used as PCR primers in a round of sequencing. These reads were then used to reconstruct (i.e. assemble) large portions of the genome through their overlapping regions, followed by custom primer design and additional sequencing to fill in the remaining gaps. This method of reconstructing a larger sequence from several smaller reads, first used by Sanger, was coined “shotgun” sequencing, and at this stage there was little difference between what we now think of as genomics and transcriptomics. As larger and more complex transcriptomes were attempted, however, it quickly became apparent that the assembly process would fast become untenable without new methods and computational assistance. Some of the first such programs began to emerge in parallel with the growing automation capability of the Sanger method (e.g. Staden 1979; Kelly 1989), and the logistics involved behind the two types of sequencing began to diverge. When the project (http://www.ornl.gov/sci/techresources/Human_Genome/project/about.shtml) was proposed jointly in 1990 by the U.S. Department of Energy and the National institutes of Health, it was assumed that it would take approximately 15 years and roughly three billion U.S. dollars, but thanks in large part to advances in computational power, shotgun assembly methods, and software, the draft of the genome was completed in 2000 (Lander, Linton et al. 2001) and a complete version was announced in 2003 (IHGSC 2004), two years ahead of schedule. This unexpected advancement was mostly due to competition from Celera Corporation, headed by , who began working on the human genome in parallel to the government. Although at the time Venter’s genome proposal was controversial (it was a commercial, rather

3 than public project), Celera developed and used an improved whole genome shotgun method (called double-barreled ) with pairwise end sequencing (Roach, Boysen et al. 1995) to skip initial mapping phases and avoid extensive BAC (bacterial artificial chromosomes) cloning that allowed them to finish the project at a fraction of the time and cost (Venter, Adams et al. 2001). Transcriptomics development was also advancing during this period. In fact, prior to the human genome sequencing proposal, a proposal was put forward by the government to initiate the first human transcriptome sequencing project, since it was believed at the time that it would be impossible to sequence and assemble such a large genome (roughly 3 billion nucleotides). Although the went on to be completed successfully along with several other highly studied molecular “model” organisms (e.g. Drosophila melanogaster: Myers, Sutton et al. 2000, many research labs decided that limiting their efforts to the set of genes was sufficient for their needs or even preferable given the cost and labor requirements. While having a sequenced genome provided an enormously powerful tool for a variety of research goals in , medicine, conservation, and evolutionary studies, many of these benefits were available through sequencing of just the genes, and a few (e.g. expression analysis) were only readily available with a sequenced transcriptome. Over the next few years, the number of transcriptome projects grew rapidly. Software for transcriptomics sequencing, assembly, and annotation advanced steadily along with automation in sequencing and computational power, including online databases and resources (e.g. Genbank, Uniprot, and Ensembl). Genome sequence assemblers begin appearing in the 1980’s, developed first by Celera (Myers, Sutton et al. 2000) and followed later by several others (e.g. SEQAID: Peltola, Soderlund et al. 1984; ARACHNE: Batzoglou, Jaffe et al. 2002; PHRAP: de la Bastide and McCombie 2007; CAP3: Huang and Madan 1999), evolving from the simpler alignment algorithms used with viral and bacterial genomes. The first dedicated transcriptome assemblers also emerged during this time (e.g. miraEST: Chevreux, Pfisterer et al. 2004). Starting in the early 2000’s, sequencing automation advanced to the point where entire pipelines for assembling, annotating, and storing transcriptome projects began to appear (e.g. FunnyBase: Paschall, Oleksiak et al. 2004). However, although transcriptome sequencing costs had come down substantially thanks to growing automation, the entire process was still out of reach for the majority of smaller research labs due to the still relatively large costs and labor and time requirements. Between the years of 2000 and 2006, the Sanger method reached its pinnacle as a fully developed sequencing technology (reviewed in

4 Nagaraj, Gasser et al. 2007). Although advancements continued to trickle into the literature, especially in software development (e.g. Paschall, Oleksiak et al. 2004), automated sequencing levels had basically plateaued. It was also within this time frame that key advances were beginning to pave the way for the next generation of sequencing technology to emerge. Pyrosequencing technology first made an appearance as early as 1996 (Ronaghi, Karamohamed et al. 1996), and used detection of DNA polymerase activity by an enzymatic luminometric inorganic pyrophosphate detection assay. However, it wasn’t until 2005 (Margulies, Egholm et al. 2005) that the process was successfully parallelized and refined sufficiently to become a viable commercial sequencing technology and the first successful applications began to be published. It was in this climate, at the cusp of a dawning revolution in sequencing capability, that the study (Vera, Wheat et al. 2008) in chapter one of this dissertation was initiated.

5 References

Batzoglou, S., D. B. Jaffe, et al. (2002). "ARACHNE: a whole-genome shotgun assembler." Genome Res 12(1): 177-89. Chevreux, B., T. Pfisterer, et al. (2004). "Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs." Genome Research 14(6): 1147-1159. de la Bastide, M. and W. R. McCombie (2007). "Assembling genomic DNA sequences with PHRAP." Curr Protoc Bioinformatics Chapter 11: Unit11 4. Huang, X. and A. Madan (1999). "CAP3: A DNA sequence assembly program." Genome Res 9(9): 868-77. IHGSC (2004). "Finishing the euchromatic sequence of the human genome." Nature 431(7011): 931-45. Kelly, M. J. (1989). "Computers: the best friends a human genome ever had." Genome 31(2): 1027-33. Lander, E. S., L. M. Linton, et al. (2001). "Initial sequencing and analysis of the human genome." Nature 409(6822): 860-921. Margulies, M., M. Egholm, et al. (2005). "Genome sequencing in microfabricated high-density picolitre reactors." Nature 437(7057): 376 - 380. Min Jou, W., G. Haegeman, et al. (1972). "Nucleotide sequence of the gene coding for the bacteriophage MS2 coat protein." Nature 237(5350): 82-8. Myers, E. W., G. G. Sutton, et al. (2000). "A whole-genome assembly of Drosophila." Science 287(5461): 2196-204. Nagaraj, S. H., R. B. Gasser, et al. (2007). "A hitchhiker's guide to expressed sequence tag (EST) analysis." Briefings in Bioinformatics 8(1): 6-21. Paschall, J., M. Oleksiak, et al. (2004). "FunnyBase: a systems level functional annotation of Fundulus ESTs for the analysis of gene expression." BMC Genomics 5(1): 96. Peltola, H., H. Soderlund, et al. (1984). "SEQAID: a DNA sequence assembling program based on a mathematical model." Nucleic Acids Res 12(1 Pt 1): 307-21. Roach, J. C., C. Boysen, et al. (1995). "Pairwise end sequencing: a unified approach to genomic mapping and sequencing." Genomics 26(2): 345-53. Ronaghi, M., S. Karamohamed, et al. (1996). "Real-time DNA sequencing using detection of pyrophosphate release." Anal Biochem 242(1): 84-9.

6 Sanger, F., G. M. Air, et al. (1977). "Nucleotide sequence of bacteriophage phi X174 DNA." Nature 265(5596): 687-95. Sanger, F., S. Nicklen, et al. (1977). "DNA sequencing with chain-terminating inhibitors." Proc Natl Acad Sci U S A 74(12): 5463-7. Smith, L. M., J. Z. Sanders, et al. (1986). "Fluorescence detection in automated DNA sequence analysis." Nature 321(6071): 674-9. Staden, R. (1979). "A strategy of DNA sequencing employing computer programs." Nucleic Acids Res 6(7): 2601-10. Starke, H. R., J. Y. Yan, et al. (1994). "Internal fluorescence labeling with fluorescent deoxynucleotides in two-label peak-height encoded DNA sequencing by capillary electrophoresis." Nucleic Acids Research 22(19): 3997-4001. Venter, J. C., M. D. Adams, et al. (2001). "The sequence of the human genome." Science 291(5507): 1304-51. Vera, J. C., C. W. Wheat, et al. (2008). "Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing." Molecular Ecology 17(7): 1636. Watson, J. D. and F. H. C. Crick (1953). "THE STRUCTURE OF DNA." Cold Spring Harbor Symposia on Quantitative Biology 18: 123-131.

7

Chapter Two

Rapid transcriptome characterization for a non-model organism using 454 pyrosequencing

Chapter 2 is a published work: Vera, J. C., C. W. Wheat, et al. (2008). "Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing." Molecular Ecology 17(7): 1636.

Introduction

Whole genome or transcriptome sequencing enables functional genomic studies based on global gene expression patterns, SNP (single nucleotide polymorphism) surveys, QTL studies, genomic scans of diversity, and so forth (Rudd, 2003; Bouck and Vinson, 2007; Nagaraj, Gasser and Ranganathan, 2007). However, most species studied for ecological or physiological traits do not have genomic or Expressed Sequence Tag (EST) data available. For non-model species, the method of obtaining transcriptome data has been cDNA library construction, repeated rounds of normalization/subtraction and Sanger EST sequencing, and often cDNA microarray construction (Whitfield et al., 2002; Mita et al., 2003; Rudd, 2003; Paschall et al., 2004; Papanicolaou et al., 2005; Beldade et al., 2006). Those proven methods can potentially be improved upon by next-generation approaches (Hudson 2007) that reduce costs, labor, errors associated with clone mishandling, and that recover missing or rare transcripts or those that are unstable when cloned into (Weber et al., 2007). Here we demonstrate the utility of recently developed sequencing technology for rapid and accurate transcriptome characterization. Our subject species, the Glanville fritillary butterfly (Melitaea cinxia), has been examined extensively from an ecological and population biology perspective (Ehrlich and Hanski, 2004) and is polymorphic for key traits that affect fitness in ways that are dependent on habitat patch size, connectivity, and population history (Hanski et al., 2004; Haag et al., 2005; Hanski et al., 2006; Hanski and Saccheri, 2006; Saastamoinen, 2007a, b). As is typical for species studied primarily in their free-living state, there are no genomic resources available for mechanistic dissection of these ecologically interesting traits. We report here how 454 pyrosequencing and assembly of short sequence reads can be used to generate functional genomic tools for this and potentially any species.

8 Massively parallel 454 pyrosequencing has become a feasible method for increasing sequencing depth and coverage while reducing time, labor, and cost (Margulies et al., 2005; Moore et al., 2006; Wicker et al., 2006; Huse et al., 2007; Weber et al., 2007). Typical output from a 4.5 hr run of the original GS20 sequencer is 20-30 megabases, comprising roughly 300,000 sequence reads averaging ca. 100 base pairs (bps). Sequencing error levels are low (< 1%), arising primarily due to homopolymer runs (Margulies et al., 2005; Moore et al., 2006; Huse et al., 2007), but these errors tend to be resolved in cases where there is sufficient coverage depth to allow assembly of overlapping reads. For this reason, 454 sequences can be more accurate than traditional Sanger sequences (Moore et al., 2006; Wicker et al., 2006; Goldberg et al., 2006). However, the short sequence reads make assembly of overlapping sequences problematic (Trombetti et al., 2007). The challenge is greatest when genomic data are not available to aid assembly, when there is polymorphism, and when the data are cDNA sequences that contain variation created by alternative splicing and vary widely in transcript abundance, making coverage uneven. Previous studies have used 454 pyrosequencing and assembly with PCR amplicons, BACs, genomic, mitochondrial, and plastid DNA (e.g. Bainbridge et al., 2006; Moore et al., 2006; Goldberg et al., 2006; Poinar et al., 2006; Wicker et al., 2006). At the time of this study (2007), published reports of 454 pyrosequencing of transcriptomes were restricted to model species where genomic or extensive Sanger EST data provided an assembly reference (Cheung et al., 2006; Bainbridge et al., 2006; Weber et al., 2007; Emrich et al., 2007). Two studies (Cheung et al., 2006; Weber et al., 2007) that used genome or Sanger EST sequences for mapping and annotation of 454 ESTs (Cheung et al., 2006; Weber et al., 2007) were not able to also accomplish de novo assembly of their 454 ESTs (e.g. only two contigs out of 184,599 reached 500 bps; Cheung et al., 2006). A transcriptome study using 454 ESTs from a wasp accomplished annotation by alignment with the honeybee genome, and used sequence data to design primers for qRT-PCR, but did not report an assembly (Toth et al., 2007). Here, we test the suitability of 454 data for de novo assembly and annotation of expressed genes of a eukaryote, and show the utility of these data for designing oligonucleotide microarrays. The result establishes a workflow from RNA to gene assembly and large-scale functional genomics tools.

Materials and Methods

RNA was isolated from larvae, pupae, and adults of M. cinxia (ca. 80 individuals from 8 families) collected from old and new populations across the main Åland island located in the Baltic between

9 Finland and Sweden. As appropriate for SNP discovery, this sample is likely to be representative of the genetic diversity of the overall Åland metapopulation. RNA was extracted from whole bodies or body segments. We combined equivalent amounts of RNA into two pools; A) larvae plus pupae, and B) adult, which together contained 16.1 mg of total RNA. cDNA was synthesized from the two RNA pools and normalized (Zhu et al., 2001; Shagin et al., 2002; Zhulidov et al., 2005; Evrogen; http://www.evrogen.com, Moscow, Russia) in order to prevent over-representation of the most common transcripts (Figure 1). Approximately 1 µg of double-stranded cDNA from each of two normalized cDNA pools was sequenced on a Roche GS20 sequencer using methods previously described (Margulies et al., 2005; Poinar et al., 2006). Initial quality filtering of the 454 ESTs was performed at the machine level before base calling. The 454 EST sequences were screened for primer sequence using a custom perl script (SmartScreener). Screened ESTs and their quality scores were then assembled using commercially available software (Seqman Pro 7.1). Prior to assembly, quality filtering was performed using default quality parameters for non-trace sequences. Additional assembly parameters were those identified as optimal by the manufacturer for large 454 data sets (Table 1). The assembler utilized a combined quality weighting (shallow coverage) and simple majority (deep coverage) for consensus base calling after assembly alignment. Files containing our sequences and their quality scores are available from the NCBI Short Read Archive, accession SRA000207. In order to compare 454 sequences against Sanger sequences, we generated plasmid cDNA libraries. We used 5 ug of larval mRNA for cDNA synthesis, essentially following a previously described protocol (Paschall et al., 2004). The cDNAs were unidirectionally cloned into a vector and sequenced from the 5’-end. Sanger ESTs were assembled using Seqman Pro 7.1 with the default parameters for Sanger data and quality trimming set to high stringency. All Sanger ESTs were screened for vector sequence. All 454 and Sanger sequences (contigs and singletons) were aligned (NCBI blastx) vs. three expanded sub-volumes of the Uniprot 9.2 annotated protein database (http://www.expasy.uniprot.org/database/download.shtml). Uniprot blast results were passed through a custom pipeline to create annotated, tab-delimited tables, which included available information on taxonomy, key words, gene function, tissue specificity, and GO terms Additional alignments were performed as follows. The in silico predicted B. mori protein set (Wang et al., 2005; http://silkworm.genomics.org.cn/), D. melanogaster unigenes (Flybase: http://flybase.bio.indiana.edu/), and H. erato clustered ESTs (Papanicolaou et al., 2005) were used to estimate coverage of the M. cinxia transcriptome (NCBI blastx), whereas the Butterflybase v2.91

10 (Papanicolaou et al., 2005) combined EST set was used primarily as a catch-all for otherwise uncharacterized sequences (NCBI tblastx). Finally, both the Sanger contigs and the 454 contigs were aligned (NCBI blastn, no filters) against the combined 454 contig/singleton dataset in order to analyze the quality of the assembly. All sequences, blast results, and annotation tables were stored in a normalized MySQL database. The sense strand of contigs and singletons (both 454 and Sanger) that generated acceptable blast hits (bitscore > 45) to either B. mori predicted proteins or to protein databases, along with both strands for those that hit Butterflybase EST’s (for which strand orientation is uncertain), were submitted to Agilent’s eArray web tool (http://earray.chem.agilent.com/earray/) to generate oligonucleotide (60mer) microarray probe sequences. Six different probes were generated for each submitted sequence. In addition, one probe was generated for the two complementary strand sequences of each contig that was not identified in blast searches; this allowed strand orientation of the unannotated contigs to be inferred from the difference in relative hybridization intensity of the two complimentary probes. All probes (N=207,149) were printed along with Agilent positive and negative controls on a 244K-feature Agilent microarray. Fluorescent-labeled cRNA (Cy3) was synthesized from an even mixture of the RNA pools used originally for the normalization and 454 sequencing; these were hybridized to the array using standard Agilent reagent kits and protocols. Three replicates of the best- performing probe from 13,780 annotated contigs and singletons were printed on 44K-feature arrays, along with a small number of probes for Sanger EST’s and sense-strand inferred probes for otherwise unannotated contigs. Two arrays were hybridized with fluorescent-labeled aRNA from two individual butterfly abdomens. Repeatability was assessed by comparing mean intensity of the replicated probes for one labeled aRNA sample across two arrays.

Results

Sequencing and assembly

A total of 608,053 ESTs averaging 110 bps (see Table 2) were obtained from two 454 sequencing runs. Of these, 518,079 exceeded our minimal quality standards (SMART primer filtering; threshold of 50 bp windows with mean quality score >12) and entered the assembly. Eighty-eight percent (458,136) of the high quality ESTs formed 48,354 contigs of average length 197 bps (median = 134; s.d. = 207), with an average depth of coverage of 2.97 sequences per nucleotide position (Figure 2). The longest

11 10% (n = 4,835; 348 to 2,849 bps in length) had an average depth of coverage of 6.5. Thus, even though the average contig was short, we obtained thousands of long and deeply covered contigs. The remaining 59,943 high quality ESTs were retained as singletons (coverage depth =1), for a total of 108,297 unigenes. An additional 3,888 sequence reads were obtained from M. cinxia cDNA libraries using traditional Sanger sequencing techniques. Forty-four percent (1,711) of the vector-screened and quality trimmed Sanger ESTs assembled into 364 contigs with average length of 574 bps and average coverage depth of 3.0 (see Table 3). There were 1,297 Sanger ESTs that remained singletons. These data were used to test the quality of the 454 sequencing and assembly.

Quality and performance of the 454 assembly

Ideally, accurate and effective assembly of the 454 sequence reads would create assembled sequences with at most a few short alignments to other 454 sequences (e.g. conserved motifs). We tested this with a blastn alignment of all 48K contigs against themselves and the singletons. Eight percent of the contigs (3897) had blast hits (bitscore > 45) with 100% identity with other contigs (N= 1749) and singletons (N=2660), but in no case did these alignments extend over the entire length of either the blast subject or query. These perfect match alignments averaged 35 and 36 nucleotides in length for contig and singleton hits, respectively, averaging 18% (median = 16%; maximum = 82%) of the length of the query contig. Nearly all of these contigs (3,892) had blastx hits (bitscore > 45) against B. mori proteins (Wang et al., 2005), yet only 220 (6%) of those blast-paired contigs had best blastx hits to the same B. mori protein, and only 69 (2%) of those pairs also had best blastx hits to the same protein in Uniprot. Thus, these gene regions that produced 100% identity alignments were primarily conserved motifs in different genes, or in some cases variants of single genes created by alternative splicing; our assembly appropriately partitioned these into different contigs. The 454 assembly was further assessed by comparison to Sanger EST’s derived from an M. cinxia cDNA library. Of 813 quality-trimmed Sanger contigs and singletons filtered for length and annotation (>200nt and with blast hits to metazoan genes), 749 (92%) had strong blast hits to 454 sequences (Table 4). On average, 74% of the length of these Sanger sequences was aligned to at least 1x depth by 454 sequence (median = 90%). Nucleotide alignments of Sanger vs. 454 sequences were 97% identical for all alignments and 99% for each Sanger sequence’s best blast alignment (by bitscore; Figure 3). The average number of gaps for alignments involving 454 contigs was 0.63 (median = 0;

12 Table 4) over an average length of 200 nucleotides. This gap rate of about three per thousand aligned bases was less than the 7 gaps per thousand aligned bases for 454 singletons, which because of 1x coverage contain a higher frequency of homopolymer run errors. These are overestimates of the true 454 base and gap error rates, as they include base mismatches caused by polymorphism, gaps created by alternative splicing, and alignment with less accurate end regions of Sanger sequences. A number of factors affect de novo assembly of genes from 454 sequence data (Figures 4 and 5), especially coverage depth. The effect of coverage depth is apparent in a blastx analysis of 454 contigs to Bombyx mori predicted proteins (Wang et al., 2005; the only fully sequenced lepidopteran genome). The ratio of the length of individual contigs to the length of their B. mori ortholog coding region increased with the depth of coverage of the contig and reached an asymptote near unity (i.e. likely full length gene assemblies) at average coverage > 10, for all but the longest genes (Figure 6a). From the distribution of average coverage depth for these contigs (median = 2.78), we infer that a four-fold increase in sequencing effort of our cDNA pool would be required to achieve an average coverage depth of 11 for about half of the contigs, at which point ca. 50% of the genes would likely assemble to full length. Greater coverage depth in the 454 assembly increased the total percentage length alignment by all contigs across their respective B. mori ortholog, in a manner that depended also on B. mori gene length (Figure 6b). Total coverage was best in 573 cases where B. mori proteins had at least 90% of their coding region aligned to one or more M. cinxia contig.

Transcriptome coverage breadth

Among the 108,297 contigs and singletons, 19,383 (18%) had hits exceeding our threshold (bitscore > 45, e-value generally <1-5) in at least one of three blast searches: blastx against predicted B. mori proteins (Wang et al., 2005); tblastx against clustered lepidopteran EST’s (Butterflybase; Papanicolaou et al., 2005), or blastx against Uniprot proteins (Boeckman et al., 2003; Figure 7a). Excluding sequences with best blast hits to non-metazoa, 71% (13,653) were similar to 6,289 unique B. mori proteins, with an average best-hit amino acid identity of 73% (s.d. = 16, Figure 2c,d; Table 5). In addition, sequences with no B. mori match had best blast hits (bitscore > 45) to 1748 metazoan Uniprot proteins with unique gene descriptions, and an additional 1274 unique EST clusters in Butterflybase, indicating the presence of about 9.3k (6289 + 1748 + 1274) M. cinxia unigenes. Comparisons against Drosophila melanogaster (Diptera) proteins and Heliconius erato (Lepidoptera) clustered ESTs

13 produced similar results (Figure 7b). This minimal estimate of 9.3K annotated genes is nearly two thirds of the number of genes in D. melanogaster (13,379; Adams et al., 2000), about half of the number of genes in B. mori (predicted genes = 18,510; Xia et al., 2004), and exceeds what has been found by all of the Sanger sequencing of Lepidoptera ESTs other than B. mori. The two lepidoptera (H. erato, Spodoptera frugiperda; Butterflybase) with the most EST samples contain approximately 6-7K putative protein-coding genes. A blastx search using butterfly ESTs publicly available (Butterflybase; N = 28,664; Bicyclus anynana, H. erato, H. melpomene, Papilio dardanus, P. xuthus) against the B. mori predicted proteins produced 15,920 hits (bitscore > 45) to 5,085 B. mori proteins (24% less than our M. cinxia 454 assembly). The percentage of M. cinxia 454 unigenes that were annotated (18%) is similar to that observed in other large assembled EST datasets of non-mammalian species (Paschall et al., 2004; Mita et al., 2003), suggesting that the unannotated contigs and singletons could represent a substantial fraction of the M. cinxia-specific transcriptome (as previously observed in butterflies; Beldade et al., 2006). Using an Agilent microarry (60mer probes) to determine if these unannotated sequences are expressed mRNAs, we found that the distribution of signal intensities was quite similar among probes for annotated and unannotated genes (Figure 8). Probe signal intensity for 6,156 unannotated contigs exceeded the 99.5th percentile of negative control probes, and this number is likely to have been much higher on a more effectively labeled array, as this one had no saturated probes and therefore was not effective for detecting rarer transcripts. There was a clear effect of probe orientation (average 10-fold difference) when one probe orientation produced a strong signal (Figure 8d). This further validates transcript presence of non-annotated contigs because only probes complementary to valid mRNA sequences are likely to produce detectable signals on an Agilent oligonucleotide (60-mer) microarray (Hughes et al., 2001).

Functional annotation

Further characterization of 454 ESTs was done with respect to the curated protein database of Uniprot/Swiss Prot (Boeckman et al., 2003), and functionally annotated genes in D. melanogaster (Flybase: http://flybase.bio.indiana.edu/). Using these databases, each 454 contig or singleton was assigned a gene description and a Gene Ontology (GO) classification based on the “best hit” blastx search (bitscore > 45, e-value generally <1-5), using the “inferred from sequence similarity” (ISS) level of evidence (Ashburner et al., 2000). Biological processes make up the GO designation of the majority

14 of Uniprot proteins (3860 counts, 64%), followed by molecular function (1944, 32%) and cellular components (243, 4%). Distribution of gene annotations was similar for D. melanogaster (biological processes = 2560 counts, 54%; molecular function = 2086, 44%; cellular components = 127, 3%; Table 6).

SNP discovery

The assembled 454 contigs provide a potentially rich data source for discovery of common SNPs. However, this potential was predicted to involve some challenges from a technical point of view due to the complexity of assembling a transcriptome without a genome reference. In particular, alternative splicing and excessive polymorphism were thought to have the potential to complicate assembly of deeply covered contigs and discovery of true allelic variation (as opposed to non-allelic variation due to missassembly). Because of these concerns, initial SNP searches were performed in a limited set of specially selected and filtered contig regions. Visual inspection of our ten longest Uniprot-annotated contigs, using SNP identification criteria of at least 2x occurrence of the minority allele, at sites covered by at least 5 ESTs, (mean coverage depth = 10.3) and away from homopolymer runs revealed 6.7 SNPs per thousand bp (N = 232), which matches closely the SNP density within gene coding regions in mosquitoes (Wondji et al., 2007) but is less than that found in M. cinxia genomic DNA (Orsini et al., 2007). To detect SNPs of known codon position over a larger scale, we used custom scripts and database searching to find occurrences of single polymorphic sites within blast annotated regions of contigs where there were no gaps, and where the minority allele was present in at least 25% of ESTs at sites with at least 6x coverage. There were 751 of these SNP’s in 355 contigs (59,747 total sites within aligned regions; 12.6 SNPs per thousand bp), comprising 602 third position sites (likely to be synonymous), 79 first, and 70 second position sites. Thus, we identified 149 sites that are good candidates for amino acid polymorphisms within conserved regions of annotated genes.

Alternative splicing effects on assembly

Alternative splicing (AS) increases protein variation and complexity in eukaryotes (Marden, 2006; Yasukochi et al., 2006; Auboeuf et al., 2007; Kim et al., 2007; Pajares et al., 2007) and poses

15 difficulties for de novo assembly of short gene sequences. The only published study of AS in 454 sequences relied on an extensive Sanger EST database (human) to assess known and novel splicing (Bainbridge et al., 2006). We assessed the effect of AS on our assembly by characterizing two AS genes (troponin-t and troponin-i) in M. cinxia using standard PCR, RACE, and cloning methods (J. Vera & M. Frilander; unpublished). Both genes had contigs in the 454 assembly with deep coverage that incorporated ESTs with small alternative or missing exons (or pieces thereof; Figure 9), resulting in a short string of degenerate nucleotides in the consensus sequence. When ESTs contained longer alternatively spliced regions, a separate, shorter contig was formed encompassing primarily the alternatively spliced exon(s). In other cases, the contig was split at the splice site, creating two or more non-overlapping sequences. Thus, although AS did not render assembly impossible or unusable, these characterized AS regions assembled in a complex way that required clone sequences, genomic data from a related model species, and human processing to decipher.

Performance of microarray probes

To a 244K-feature microarray, we hybridized a 50:50 mix of labeled RNA from the same pools that were normalized and used for 454 sequencing. On this array were 6 different probes per blast- annotated contig and singleton, along with two complementary probes (i.e. from the obtained sequence and its complement) for non-annotated contigs. Many thousands of these probes produced hybridization signal intensities above the range of negative control probes (Figure 8). Probes derived from 454 sequences performed as well as those from Sanger sequences (Figure 10). We printed three replicates of the best-performing 454 probe, assayed above, for each of 13,780 annotated contigs and singletons (~9K putative unique loci; Figure 7) on Agilent 4 x 44K format slides, and hybridized these with amplified mRNA (aRNA) from the abdomen of individual butterflies. Mean spot intensity for a single butterfly across replicate arrays (Figure 11a inset) revealed excellent repeatability (r2 = 0.98). These results further validate sequence quality and assembly, as well as the probe design method.

16

Metatrascriptomics and detection of an intracellular parasite

The cDNA for the 454 sequences came from whole insects whose surface and gut contained symbionts, parasites, and pathogens, some of which appeared in our annotated sequences. Blastx analysis revealed 618 sequences with best hits (bitscore > 45) to non-metazoa (Table 7). Individual non-metazoan taxa had relatively few hits, except for microsporidia (N = 452 hits). Selection of polyadenylated mRNAs during the formation of the cDNA pool for sequencing probably explains the relatively small number of bacterial sequences. These microbial sequences provide valuable data regarding xenobiotics of the host species. For example, this is the first indication that M. cinxia butterflies are infected with microsporidia, an intracellular parasite that affects insect population dynamics (Kohler and Wiley 1992; Kohler and Hoiland, 2001). Microarray analysis revealed marked differences among individuals in the transcript levels of microsporidia genes (Figure 11), and we can now explore unanticipated issues such as the relationship between expression of specific host and parasite genes. This highlights the discovery and hypothesis-expanding aspects of a broad and deep sequencing approach, which in this case has the potential to provide insight regarding physiological mechanisms that may have important ecological consequences.

Discussion

We used samples from whole bodies to characterize the transcriptome of a non-model species using 454 pyrosequencing and de novo assembly, and used those data to discover large numbers of SNPs and to construct a richly featured microarray. Although a precise estimate of transcriptome coverage is unattainable without full genomic sequence, we appear to have recovered over half, and potentially significantly more, of the M. cinxia transcriptome. Xenobiotic sequences were also detected and those data will be useful for developing studies on the interaction between butterflies and their microsporida parasites in natural populations. The assembled sequence data and singletons provided a rich source of material for construction of an oligonucleotide microarray that showed both high repeatability and ability to detect biological differences among individuals.

17 These successes need to be considered alongside limitations of the approach in comparison to established Sanger EST library sequencing methods. Had we not generated libraries from which we collected Sanger sequences for comparison purposes, we would not have clones. Without clones, one cannot readily obtain full length sequence data for a given gene of interest, but must instead use the fragment data as the basis for a RACE protocol. Second, 454 sequences are located fairly evenly across the cDNA of a given gene (Weber et al., 2007), rather than being from one terminal end. Sequences that are distributed away from untranslated end regions are beneficial for obtaining blast alignments, but this scattered coverage also results in multiple fragments per gene, requiring further assessment in downstream analyses to assess their relationships. Finally, because 454 sequences are derived from both strands, the resulting sequence data have an unknown directional orientation. Directionality can be inferred from blast annotation, or as we did for unannotated genes, from the relative signal intensity of complimentary microarray probes. Investigators choosing between approaches based solely on unoriented cDNA fragments (Hudson, 2007) versus clones should carefully consider the potential limitations imposed by these factors. A challenge for any EST project is obtaining sufficient coverage of less abundant transcripts. We used normalized cDNA to reduce over-sampling of abundant transcripts, but there remained substantial variation in average coverage depth. The average contig was fairly short (~200 bps), and even though 88% of the sequence reads assembled into contigs, there remained ca. 60,000 singletons. Thus, even from a normalized cDNA pool, coverage of rarer transcripts was often thin. Singletons provided the only sequence that aligned to 1527 of the 6289 unique B. mori orthologs identified with blast analysis, and because they had no redundant coverage, they contained a higher error and gap rate than the contigs. A recent upgrade of the 454 sequencer now provides reads averaging ca. 230bp and more reads per run; this suggests that coverage and assembly performance will increase, perhaps dramatically, above what we have described here. Lack of annotation for many of the genes discovered will remain a problem when working with non-model species, regardless of methodological approach or assembly quality. Previous attempts at de novo transcriptome assembly from 454 sequences used different methods and achieved significantly lower average coverage (0.23 X coverage vs. our estimated 2.3 X, Table 8; Cheung et al., 2006). High coverage is critical for joining of overlapping ESTs into larger contigs, but even low covered genes can provide valuable data if full-length gene characterization is not the primary goal. Our annotated singletons (~80 to 100bp after quality trimming) provided sufficient sequence length to design microarray probes, most of which produced detectable hybridization signals.

18 A key feature of 454 sequencing is that, compared to a similar dollar expenditure on Sanger sequencing, it yields redundant coverage for more genes, thereby allowing much more extensive SNP discovery. Until now, 454 transcriptome-wide SNP detection (as opposed to ultradeep coverage of single genes or amplicons) has been accomplished only by comparison with genome data (Bainbridge et al., 2006; Barbazuk et al., 2007). Our initial scan of gap-free blast-aligned regions in long and deeply covered contigs revealed a rich collection of SNPs, including 149 first and second codon position polymorphisms that are likely to change the amino acid sequence. We are presently constructing an algorithm that accomplishes assembly-wide discovery of high quality SNPs using specified criteria (depth, quality score, avoidance of homopolymer runs, frequency, inferred codon position within annotated contigs) in de novo assemblies of 454 ESTs; we will report on this in a subsequent publication. The ability to accomplish transcriptome-wide SNP discovery using 454 sequence data is an important development because gene-associated SNPs are difficult and expensive to obtain by other methods but provide valuable data and markers for population biology, functional genomics, and conservation (Rosenberg and Nordborg, 2002; Balding, 2006; Kohn et al., 2006), including non-model species in which SNP proximity and chromosomal location may be inferred by assuming synteny with known genomic sequences of related model taxa (Yasukochi et al., 2006; Joron et al., 2006). The Glanville fritillary butterfly has been studied extensively from an ecology and population biology perspective (Hanski, 1999; Ehrlich and Hanski, 2004). Like most free-living species, it has lacked genetic and genomic resources necessary for mechanistic study. Here we have demonstrated that it is possible to use 454 pyrosequencing to rapidly characterize a draft transcriptome and use it to create a microarray for large-scale functional genomics. These tools will allow us to determine how variation in gene sequences and gene expression are associated with key ecological features of this species, including flight ability, dispersal, fecundity, and the impact of metapopulation parameters on these traits. The methods described here can be used for potentially any species, thereby narrowing the gap between approaches based on model organisms with rich genetic resources versus species that are most tractable for ecological and evolutionary studies (i.e. the Krogh priciple; Krebs, 1975; Wayne and Staves, 1996).

19 References

Adams, M.D. et al. (2000) The genome sequence of Drosophila melanogaster. Science 287, 2185-2195. Ashburner, M. et al. (2000) Gene ontology: Tool for the unification of biology. Nature Genetics 25, 25- 29. Auboeuf, D., Batsche, E., Dutertre, M., Muchardt, C. & O'Malley, B.W. (2007) Coregulators: transducing signal from transcription to alternative splicing. Trends in Endocrinology and Metabolism 18, 122-129. Bainbridge, M.N. et al. (2006) Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach. BMC Genomics 7, 246. Balding, D.J. (2006) A tutorial on statistical methods for population association studies. Nature Reviews Genetics 7, 781-791. Barbazuk, W.B., Emrich, S.J., Chen, H.D., Li, L. & Schnable, P.S. (2007) SNP discovery via 454 transcriptome sequencing. The Plant Journal. Beldade, P., Rudd, S., Gruber, J.D. & Long, A.D. (2006) A wing expressed sequence tag resource for Bicyclus anynana butterflies, an evo-devo model. BMC Genomics 7, 130. Boeckmann, B. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31, 365-370. Bouck, A. & Vision, T. (2007) The molecular ecologist's guide to expressed sequence tags. Molecular Ecology 16, 907-924. Casillas, S., Petit, N., & Barbadilla, A. (2005). DPDB: a database for the storage, representation and analysis of polymorphism in the Drosophila genus. Bioinformatics 21 Suppl 2, ii26-30. Cheung, F. et al. (2006) Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology. BMC Genomics 7, 272. Ehrlich, P.R. & Hanski, I. (2004) On the Wings of Checkerspots: A Model System For Population Biology. (Oxford University Press, Oxford). Emrich, S.J., Barbazuk, W.B., Li, L. & Schnable, P.S. (2007) Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Research 17, 69-73. Goldberg, S.M.D. et al. (2006) A Sanger/pyrosequencing hybrid approach tor the generation of high- quality draft assemblies of marine microbial genomes. Proceedings of the National Academy of Sciences of the United States of America 103, 11240-11245.

20 Haag, C.R., Saastamoinen, M., Marden, J.H. & Hanski, I. (2005) A candidate locus for variation in dispersal rate in a butterfly metapopulation. Proceedings of the Royal Society Biological Sciences Series B 272, 2449-2456. Hanski, I. (1999) Habitat connectivity, habitat continuity, and metapopulations in dynamic landscapes. Oikos 87, 209-219. Hanski, I., Eralahti, C., Kankare, M., Ovaskainen, O. & Siren, H. (2004) Variation in migration propensity among individuals maintained by landscape structure. Ecology Letters 7, 958-966. Hanski, I., & Saccheri, I. (2006) Molecular-level variation affects population growth in a butterfly metapopulation. PLoS Biol. 4:e129 Hanski, I., Saastamoinen, M. & Ovaskainen, O. (2006) Dispersal-related life-history trade-offs in a butterfly metapopulation. Journal of Animal Ecology 75, 91-100. Hudson, M.E. (2007) Sequencing breakthroughs for genomic ecology and evolutionary biology. Molecular Ecology Notes, in press. Hughes, T.R. et al. (2001) Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nature Biotechnology 19, 342-347. Huse, S.M., Huber, J.A., Morrison, H.G., Sogin, M.L. & Welch, D.M. (2007) Accuracy and quality of massively-parallel DNA pyrosequencing. Genome Biology 8: R143. Joron, M. et al. (2006). A conserved supergene locus controls colour pattern diversity in Heliconius butterflies. PLoS Biology 4, 1831-1840. Kersey P, Hermjakob H, Apweiler R. (20000) VARSPLIC: alternatively-spliced protein sequences derived from SWISS-PROT and TrEMBL. Bioinformatics 16:1048-9. Kim, E., Magen, A. & Ast, G. (2007) Different levels of alternative splicing among eukaryotes. Nucleic Acids Research 35, 125-131. Kohler, S.L. & Hoiland, W.K. (2001) Population regulation in an aquatic insect: The role of disease. Ecology 82, 2294-2305. Kohler, S.L. & Wiley, M.J. (1992) Parasite-induced collapse of populations of a dominant grazer in Michigan streams. Oikos 65, 443-449. Kohn, M.H., Murphy, W.J., Ostrander, E.A., Wayne, R.K. (2006) Genomics and conservation genetics Trends in Ecology and Evolution 21, 629-637 . Krebs, H.A. (1975) The August Krogh principle for many problems there is an animal on which it can be most conveniently studied. Journal of Experimental Zoology 194, 221-226.

21 Marden, J.H. (2006) Quantitative and evolutionary biology of alternative splicing: how changing the mix of alternative transcripts affects phenotypic plasticity and reaction norms. Heredity 2006 Sep 27; [Epub ahead of print] Margulies, M. et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-380. Mita, K. et al. (2003) The construction of an EST database for Bombyx mori and its application. Proceedings of the National Academy of Sciences of the United States of America 100, 14121- 14126. Moore, M.J. et al. (2006) Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant Biology 6, 17. Nagaraj, S.H., Gasser, R.B. & Ranganathan, S. (2007) A hitchhiker's guide to expressed sequence tag (EST) analysis. Briefings in Bioinformatics 8, 6-21. Orsini, L., Pajunen, M., Hanski, I., & Savilahti, H. (2007) SNP discovery by mismatch- targeting of Mu transposition. Nucleic Acids Research. 35(6): e44. Pajares, M.J. et al. (2007) Alternative splicing: an emerging topic in molecular and clinical oncology. Lancet Oncology 8, 349-357. Papanicolaou, A., Joron, M., McMillan, W.O., Blaxter, M.L. & Jiggins, C.D. (2005) Genomic tools and cDNA derived markers for butterflies. Molecular Ecology 14, 2883-2897. Paschall, J.E. et al. (2004) FunnyBase: a systems level functional annotation of Fundulus ESTs for the analysis of gene expression. BMC Genomics 5, 96. Poinar HN, Schwarz C, Qi J, Shapiro B, Macphee RD, Buigues B, Tikhonov A, Huson DH, Tomsho LP, Auch A, Rampp M, Miller W, Schuster SC. (2006) to paleogenomics: large-scale sequencing of mammoth DNA. Science 311:392-4. Rosenberg, N.A. & Nordborg, M. (2002) Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nature Reviews Genetics 3, 380-390. Rudd, S. (2003) Expressed sequence tags: Alternative or complement to whole genome sequences? Trends in Plant Science 8, 321-329. Saastamoinen, M. (2007a) Life-history, genotypic, and environmental correlates of clutch size in the Glanville fritillary butterfly. Ecological Entomology 32, 235-242. Saastamoinen M. (2007b) Heritability of dispersal rate and other life history traits in the Glanville fritillary butterfly. Heredity. Sep 5; [Epub ahead of print] Shagin, D.A. et al. (2002)A novel method for SNP detection using a new duplex-specific nuclease from crab hepatopancreas. Genome Research 12, 1935-1942.

22 Toth, A.L., Varala, K., Newman, T.C., Miguez, F.E., Hutchison, S.K., Willoughby, D.A., Simons, J.F., Egholm, M., Hunt, J.H., Hudson, M.E., Robinson, G.E. (2007) Wasp Gene Expression Supports an Evolutionary Link Between Maternal Behavior and Eusociality. Science, Sep 27; [Epub ahead of print] Trombetti, G.A., Bonnal, R.J.P., Rizzi, E., De Bellis, G. & Milanesi, L. (2007) Data handling strategies for high throughput pyrosequencers. BMC Bioinformatics 8, S22. Wang J., Xia Q., He X., Dai M., et al. 2005. (2005) SilkDB: a knowledgebase for silkworm biology and genomics. Nucleic Acids Research 33, D399-D402. Wayne, R. & Staves, M.P. (1996) The August Krogh principle applies to plants. Bioscience 46, 365-369. Weber, A.P.M., Weber, K.L., Carr, K., Wilkerson, C. & Ohlrogge, J.B. (2007) Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiology 144, 32-42. Whitfield, C.W. et al. (2002) Annotated expressed sequence tags and cDNA microarrays for studies of brain and behavior in the honey bee. Genome Research 12, 555-566. Wicker, T. et al. (2006) 454 sequencing put to the test using the complex genome of barley. BMC Genomics 7, 275. Wondji, C.S., Hemingway, J., Ranson, H. (2007) Identification and analysis of single nucleotide polymorphisms (SNPs) in the mosquito Anopheles funestus, malaria vector. BMC Genomics 8, 5. Xia, Q.Y. et al. (2004) A draft sequence for the genome of the domesticated silkworm (Bombyx mori). Science 306, 1937-1940. Yasukochi, Y., Ashakumary, L.A., Baba, K., Yoshido, A. & Sahara, K. (2006) A second-generation integrated map of the silkworm reveals synteny and conserved gene order between lepidopteran insects. Genetics 173, 1319-1328. Zhu, Y.Y., Machleder, E.M., Chenchik, A., Li, R. & Siebert, P.D. (2001) Reverse transcriptase template switching: A SMARTTM approach for full-length cDNA library construction. Biotechniques 30, 892-897. Zhulidov, P.A. et al. (2005) A method for the preparation of normalized cDNA libraries enriched with full-length sequences. Russian Journal of Bioorganic Chemistry 31, 170-177. Zhulidov, P.A. et al. (2004) Simple cDNA normalization using kamchatka crab duplex-specific nuclease. Nucleic Acids Research 32, e37.

23

Figure 2-1. Gel image of two cDNA collections (A is larval/pupal material; B is adult) made from poly(A+) mRNA, before and after cDNA normalization. Size markers are shown in the lane at the left; normalized cDNA lanes are labeled N-cDNA. Note the high relative abundance of about 10 eretranscripts (distinct bands) in the cDNA lanes and complete absence thereof after normalization.

24

Figure 2-2. Characteristics of assembled M. cinxia 454 contigs, and blastx alignments against Bombyx mori. A,B; Length and coverage of contigs. Variable bar width in plot A is an artifact of the logarithmic horizontal axis; there is no information content in the area of these bars. Note the logarithmic vertical axes; the majority of contigs were small and had low coverage depth. C,D; Frequency distributions of percent identity and deduced amino acid alignment length for all blast hits (bitscore > 45) of M. cinxia contigs and singletons (N= 13,653) to 6,289 Bombyx mori predicted proteins.

25

Figure 2-3. Examples of blast alignments of M. cinxia assembled Sanger ESTs against M. cinxia 454 contigs. A: An alignment that is representative of the average alignment length and percent identity (see Table 4). Note the variable presence of what appears to be an alternatively spliced microexon (single codon). B: The longest Sanger vs. 454 alignment.

26

Figure 2-4. Upper; Schematic showing how local clusters were often not joined together due primarily to lack of critical sequence reads necessary for joining clusters into one full length, or nearly full length contig. Naturally-occurring variation in the form of polymorphism and alternative splicing, and bias toward increased coverage of transcript end regions (Bainbridge et al 2006; see Figure 4a) can hamper assembly, even where there is high coverage and low rate of sequencing errors.Lower; Within clusters of multiple reads, polymorphic sites, or SNPs, are readily identified. The right side of this alignment shows an example of how sequencing error is ignored in the consensus sequence (underlined sequence above the alignment).

27

Figure 2-5. A: Example of sampling biases (Bainbridge et al 2006) present in 454 sequencing. Sequence ends, particularly the 5’ end, tend to be over-represented in 454 ESTs, with under-representation of the immediate flanking region. This pattern is discernable in the most heavily covered contigs in which there was enough sampling of the end- flanking region to join the end to the remainder of the contig. B: Relatively even coverage of the remainder of the same contig shown in A. C. Bimodal distribution of coverage depth as a function of contig length. The upward sloping pattern is expected (i.e. contigs become longer when there is more sequence coverage, up to the point of full length assembly). The circled portion of the plot contains contigs in which there was high coverage but little or no length extension. Few of these generated blast hits and thus appear to be over-sampled end regions (presumably primarily UTR) that lacked sufficient flanking coverage to allow them to assemble with the remainder of ESTs for that gene.

28

Figure 2-6. Comparison of M. cinxia 454 contig length to the coding region length of B. mori orthologous genes. Orthologs were identified by blastx (bitscore > 45). A: Ratio of 454 contig to B. mori ortholog length in relation to coverage depth (average number of ESTs per nucleotide site within a contig). Ratios greater than unity are caused primarily by the presence in 454 contigs of untranslated regions that extend beyond the coding region. Solid curve shows sigmoid fit for all B. mori genes; dashed curve shows fit for B. mori genes 800 – 1800 bp in length; line shows fit for those greater than 1800 bp length. Where average coverage depth was ≥10, nearly all 454 contigs were as long or longer than the coding region of their putative B. mori ortholog, suggesting full length assembly. B: Ratio of B. mori ortholog length to the summed non-overlapping alignment length of M. cinxia sequences that had a best blast hit (bitscore > 45) to that B. mori gene (i.e. all putative fragments of individual genes).

29

Figure 2-7. Venn diagram showing distribution of similarity search results. Numbers are the sum of unique Bombyx mori proteins, unique Drosophila melanogaster or Uniprot proteins, and unique Butterflybase (lepidoptera species excluding B. mori) or Heliconius erato EST clusters that had best blast hits (> 45 bitscore) to a 454 unigene. Priority hierarchy for determining uniqueness of proteins or EST clusters in overlap regions (i.e. cases where multiple 454 unigenes aligned to a unigene shared between the respective databases) was B. mori protein > Uniprot protein or D. melanogaster protein > EST cluster name. Sequences with best blast hits to non-metazoan proteins are excluded from these counts.

30

Figure 2-8. A-C; Frequency distributions of log10-transformed median spot intensity (corrected for local background and with a constant added to make all values positive for log transformation) of microarray probes. N=602 negative control probes; N=125,655 probes for contigs and singletons that had significant blast hits; N = 113,862 unannotated contigs. This initial test array had low signal intensity overall (note lack of saturated spots [intensity values of 4.8]) and thus yields conservative estimates of the number of quality probes. The greater number of very low signals in B, C compared to A is due to the extreme difference in sample size. D. Distribution of log10-transformed median spot intensity for the lesser (upper plot) and greater (lower plot) performing probe orientation for all probes of unannotated genes for which one probe had an intensity of at least 2.5 (i.e. a strong signal in at least one probe orientation). Signals of the lesser performing probes cluster at just above the intensities obtained from negative control probes, with an average 10-fold difference among these differently oriented probe pairs.

31

Figure 2-9. Alternative splicing near the 3’ end of the previously characterized M. cinxia muscle gene troponin-i. A) Conjoining of ESTs (red, green, and yellow boxes) representing partial nucleotide sequences of three alternatively spliced mutually exclusive versions of exon six in 454 assembly alignment. Note the degenerate nucleotide string in the consensus sequence (blue box). B) Amino acid alignment of the same three alternatively spliced exons (red, green, and yellow exons) with Bombyx mori sequence for comparison (bottom yellow). The top alternative exon, A1a (purple), was recovered full length in a separate 454 contig. There was one additional, much shorter 454 contig partially spanning the alternatively spliced region representing most of TnI_A3a (green). Reassembly of all of the troponin-i 454 ESTs using more stringent assembly parameters produced four contigs spanning this region with full (A1a and A4a) and partial (A2a and A3a) coverage of the four versions of exon six. Full characterizations of three of the alternative exons (A2a, A3A, and A4a) had been achieved in previous experiments (J. Vera; unpublished) using standard PCR, RACE, and cloning methods and were needed to unravel the complexity of the effect of alternative splicing on these 454 contigs.

32

Figure 2-10. Probes from 454 sequence generally produce comparable micorarray spot intensity as probes designed from Sanger sequence of the same gene (r2 = 0.57; N = 892). Diagonal is the line of identity. There was no systematic tendency for either type of probe to yield better performance

33

Figure 2-11. Examples of 44K-feature Agilent arrays with probes made from 454 sequence data. A,B; Comparison of raw hybridization intensity from the two channels of microarrays hybridized with indirectly labeled aRNA from whole abdomens of 2-day old female M. cinxia butterflies. Note the substantial increase in transcript level of microsporidia genes (larger open circles) in butterfly 104 compared to butterfly 168 (B), and the apparent absence of microsporidia in the two other individuals (A). C: Repeatability of signal intensity (mean of three replicates per feature) across two microarrays hybridized with labeled abdominal aRNA from a single individual.

34 Table 2-1 Settings used in Seqman Pro 7.1 to assemble 454 ESTs. quality screening and trimming = moderate match size = 21 minimum match % = 80 match spacing = 5 min seq length = 40 gap penalty = 0 gap length penalty = 0.7

35

Table 2-2. Number of reads and nucleotides produced by two 454 sequencing runs.

Titration Run A Run B Run

N ESTs 373,752 269,944 Not recorded N ESTs after initial quality filtering 350,329 245,612 12,112 Mean EST length 110 111 108 Total nucleotide Count N nucleotides after initial quality 67,191,253 filtering N nucleotides after SMART primer 53,702,862 (79.9%) filtering and preassembly trimming

N nucleotides assembled as contigs 40,586,542 (60.4%) N nucleotides remaining as singletons 6,176,027 (9.2%) N nucleotides discarded by assembler 6,940,293 (10.4%)

36 Table 2-3. Sanger EST statistics.

Avg N Avg Avg Contig N Avg EST EST Total N Contig Coverage Singleton Singleton s Length BPs Contigs Length Depth s Length Initial 3,888 628 2,432,154 n/a n/a n/a n/a n/a After Pre- Assembly Screening and Trimming 3,741 342 1,280,035 n/a n/a n/a n/a n/a After Assembly 3,008 387 1,164,906 364 574 3.0 1297 386 After Post- Assembly Screening 3,008 387 1,164,906 364 569 3.0 1297 383

37

Table 2-4. Summary statistics for blast of assembled M. cinxia Sanger sequences against M. cinxia 454 contigs and singletons. All blast results refer to hits with bitscores greater than or equal to 45. Percentage length coverage refers to the total extent of Sanger sequences covered by aligned 454 sequences. Percent identity is reduced by polymorphism and reduced accuracy of Sanger sequencing at end regions; it is not a valid measure of the 454 error rate. Alignment lengths refer to nucleotides (not deduced amino acids as in Fig 3 and Table 3).

Number of assembled and blast filtered Sanger sequences 813 Number of Sanger sequences with at least one blast hit against 454 749 sequences % Sanger sequences with a blast hit against 454 sequence 92% Number of 454 contigs with blast hits against 813 Sanger sequences 1296 Number of 454 singletons with blast hits against 813 Sanger sequences 785

Average percent length coverage of Sanger sequences by aligned 454 74% sequences Median percent length coverage of Sanger sequences by aligned 454 90% sequences Median number of 454 contigs aligned per Sanger unigene 2 Median number of 454 singletons aligned per Sanger putative unigene 1

Mean percent identity of Sanger vs. all 454 contig blast hit alignments 97% Mean percent identity of Sanger vs. best 454 blast hit alignment 99% Mean percent identity of Sanger vs. 454 singleton blast hit alignments 97% Mean alignment length (nt) for Sanger vs. 454 contigs 200 Mean alignment length (nt) for Sanger vs 454 singletons 49

Mean number of gaps within Sanger vs. all 454 contig blast hit alignments 0.63 Median number of gaps within Sanger vs. all 454 contig blast hit 0 alignments Mean number of gaps within Sanger vs. all 454 singleton blast hit 0.35 alignments Median number of gaps within Sanger vs. 454 singleton blast hit 0 alignments

38

Table 2-5. Summary statistics for blast analyses of M. cinxia 454 contigs and singletons against predicted B. mori proteins, Uniprot proteins, and translated clustered EST’s in Butterflybase. A minimum bitscore of 45 was used as the cutoff threshold for all of these analyses. Sequences that had best hits in Uniprot that were non-metazoans (N=716) are excluded. Values in parentheses are medians.

B. mori Proteins Butterflybase Contigs

N significant hits 9237 8253 7741 Mean best hit amino 73 (51) 84 (58) 63 (46) acid alignment length Mean % identity 74 (76) 66 (65) 75 (78) Mean N gaps 0.44 (0) 0.62 (0) 0 (0) Mean bitscore 107 (76) 109 (75) 119 (82)

Singletons

N significant hits 4416 2787 2920 Mean best hit amino 32 (32) 32 (32) 30 (31) acid alignment length Mean % identity 81 (82) 77 (76) 80 (82) Mean N gaps 0.08 (0) 0.07 (0) 0 (0) Mean bitscore 57 (56) 56 (54) 61 (58)

Contigs and Singletons combined N unique gene descriptions in 6289 6451 6324 blast hit subjects

39

Table 2-6. GO assignments using only Arthropod blastx alignments. Table. Gene ontology assignments for M. Cinxia 454 ESTs using only Arthropod hits

Definitions Counts Fractions Definitions Counts Fractions biological_process 3860 20.50% response to endogenous stimulus 35 0.19% metabolism 2744 14.57% cytoplasm 34 0.18% response to biotic stimulus 34 0.18% molecular_function 1944 10.32% RNA binding 30 0.16% binding 936 4.97% mitochondrion 28 0.15% protein metabolism 931 4.94% cytoskeleton organization and biogenesis 27 0.14% catalytic activity 747 3.97% morphogenesis 26 0.14% nucleobase,transport nucleoside, nucleotide 659 3.50% cytoskeletal protein binding 23 0.12% and nucleic acid metabolism 632 3.36% reproduction 22 0.12% generationbiosynthesis of precursor metabolites 430 2.28% carbohydrate binding 21 0.11% and energy 423 2.25% translation regulator activity 19 0.10% electron transport 371 1.97% translationDNA binding factor activity, nucleic acid 19 0.10% transferase activity 346 1.84% binding 19 0.10% protein binding 302 1.60% actin binding 18 0.10% DNA metabolism 273 1.45% death 17 0.09% protein biosynthesis 247 1.31% antioxidant activity 16 0.08% nuclease activity 15 0.08% cellular_component 243 1.29% cell death 15 0.08% hydrolase activity 213 1.13% extracellular region 13 0.07% cell 212 1.13% nucleus 13 0.07% carbohydrate metabolism 192 1.02% regulation of gene expression, epigenetic 12 0.06% protein modification 191 1.01% protein kinase activity 11 0.06% cell organization and biogenesis 181 0.96% plasma membrane 11 0.06% cell communication 170 0.90% secondary metabolism 10 0.05% transcription 164 0.87% cell-cell signaling 10 0.05% signal transduction 160 0.85% lipid binding 9 0.05% ion transport 149 0.79% ion channel activity 6 0.03% lipid metabolism 145 0.77% embryonic development 6 0.03% catabolism 120 0.64% motor activity 4 0.02% nucleic acid binding 116 0.62% behavior 4 0.02% nucleotide binding 114 0.61% mitochondrion organization and biogenesis 4 0.02% signal transducer activity 99 0.53% response to abiotic stimulus 4 0.02% receptor activity 97 0.52% peroxisome 4 0.02% aminoresponse acid to andstress derivative 88 0.47% cell homeostasis 3 0.02% metabolism 83 0.44% receptor binding 3 0.02% development 81 0.43% cell proliferation 3 0.02% peptidase activity 77 0.41% response to external stimulus 2 0.01% organellekinase activity organization and 74 0.39% cytoplasm organization and biogenesis 1 0.01% biogenesis 68 0.36% nucleoplasm 1 0.01% intracellular 60 0.32% transcription regulator activity 1 0.01% enzyme regulator activity 56 0.30% cell recognition 1 0.01% structural molecule activity 54 0.29% chromosome 1 0.01% transporter activity 52 0.28% viral life cycle 1 0.01% protein transport 50 0.27% endosome 1 0.01% cell differentiation 44 0.23% growth 1 0.01% cell cycle 40 0.21% calcium ion binding 37 0.20%

40 Table 2-7. Taxonomic group of the blast subject species for 454 contigs and singletons (N = 699) that had significant (bitscore >45) best blast alignments to apparent xenobiotic genes. Taxonomic group Number of 454 Mean % Mean amino contigs and deduced acid singletons with amino acid alignment best blast hit identity length Ds DNA viruses 37 63 49 Ss RNA positive-strand 8 44 90 viruses Archaea 10 68 32 Bacteria Firmicutes 26 68 38 Proteobacteria 38 68 40 other bacteria 16 63 39 Protozoa 20 44 89 Fungi Ascomycota 5 65 40 Basidiomycota 6 63 66 Microsporidia 452 81 35 Plants 81 69 54

41

Table 2-8 Transcriptome space and coverage comparison among relevant taxa. Expected coverage refers to the average coverage depth predicted for that size genome from our size 454 sample, or in the case of Medicago truncatula, to the average coverage depth obtained in that study (Cheung et al 2006).

Average Expected protein-coding N unigenes coverage Species Total bases gene length (nt) depth

Anopheles gambiae 20,934,908 1,537 13,621 2.99 Drosophila melanogaster 44,876,964 2,193 20,468 1.40 Apis mellifera 18,224,922 1,648 11,062 3.44

Medicago truncatula 80,000,000 2,000 40,000 0.29 M. truncatula 454 assembly 23,000,000 0.23

Melitaea cinxia theoretical 27,000,000 1,500 18,000 2.22 M. cinxia 454 assembly 62,637,010 2.97

Chapter Three

PipeMeta: a Toolset for de novo Transcriptome Assembly with an Additional New Method for Transcriptome Quality Analysis

Introduction

Next-generation sequencing is transforming the way we do biology, allowing powerful functional genomic techniques and tools to be applied to new and interesting questions. Of particular value for smaller labs (particularly research involving non-model species) are de novo transcriptome studies (e.g. Vera, Wheat et al. 2008), which allow broad molecular insights without the greater expense and time required for and assembly. However, with sequencing technology currently in a state of rapid flux, investing precious experimental resources still remains a daunting prospect, due to the steep bioinformatic learning curve combined with the risk of rapid technology turnover. Here I present a custom toolset (PipeMeta) for use in a de novo transcriptome assembly pipeline and analysis of assembly quality using a novel method of realistic transcriptome simulation that captures important aspects of transcriptome complexity. There are several potential paths for transcriptome analysis. Methods diverge with the two currently dominant sequencing technologies, 454 pyrosequencing and Illumina sequencing (Margulies, Egholm et al. 2005; reviewed in Li, Chen et al. 2011, Metzker 2010, Harismendy, Ng et al. 2009), and the two most common assembly algorithm classes, standard alignment- based (i.e. overlap/layout/consensus [OLC]; review in Miller, Koren et al. 2010; e.g. Gregory, 2011) and de bruijn graph-based ([DBG]; reviewed in Miller, Koren et al. 2010; Sutton, 2007; Pevzner, Tang et al. 2001; Paszkiewicz and Studholme 2010; Zerbino and Birney 2008). The highest quality de novo assemblies are known to combine both 454 and Illumina sequence, where many of the well-documented systematic error profiles of the individual technologies are canceled out during hybrid assembly (reviewed in Harismendy, Ng et al. 2009; genome example in Rodrigue, Malmstrom et al. 2009; transcriptome example in Garg, Patel et al. 2011). However, most labs are still utilizing single technologies due to financial, logistical, or computational constraints (e.g. Hou, Bao et al. 2011; Gregory, Darby et al. 2011; Chen, Yang et

43 al. 2010). There is not yet consensus on the feasibility of very short reads for de novo transcriptome assembly, as there are contradictory results on quality in the literature, although the tide seems to be drifting in favor of Illumina style sequencing technology (Gibbons, Janson et al. 2009, Surget-Groba and Montoya-Burgos 2010), partly due to recent advances in assembly of short reads (e.g. Trinity; Grabherr, Haas et al. 2011), as well as improvements in read length (from 35bp to between 70-100bp) and the inclusion of paired end information used in Illumina sequencing . Recently, DBG algorithms, always popular for genome assembly and extremely large data sets (e.g. Illumina short reads) due to their speed and less demanding memory needs, have started to be used for de novo transcriptome assembly of next gen reads (e.g. CLC, Velvet Oasis, and Trinity; reviewed in Lin, Li et al. 2011; e.g. Garg, Patel et al. 2011; Grabherr, Haas et al. 2011). These assemblers have advantages and disadvantages in comparison to more traditional OLC assemblers (e.g. Seqman Ngen and Mira). During de novo transcriptome assembly, DBG algorithms have the potential to reconstruct whole alternative transcripts (i.e. multiple correct, full length gene models per gene). However, this potential comes at the cost of increased complexity, an increased chance of certain error types (Miller, Koren et al. 2010), and loss of haplotype expression specificity in certain cases (i.e. mapping reads back onto the Contigs for expression profiling and polymorphism analysis becomes problematic; e.g. Meyer, Aglyamova et al. 2009), although some progress has been made on mitigating these problems (Surget-Groba and Montoya-Burgos 2010; reviewed in Martin and Wang 2011). By necessity, DBG analysis pipelines include at least two stages, de novo assembly followed by read mapping back onto the resulting contigs as references in order to perform expression analysis and for detecting polymorphism. While this method holds some advantages (particularly for extremely large datasets and genomes), such as the potential to map reads onto fully reconstructed haplotypes, it also increases the complexity of downstream assembly quality analysis (e.g. reads do not necessarily map back to the reference contigs they helped assemble in the first place). In part, some of these difficulties can be offset by using paired-end data (Vezzi, Narzisi et al. 2012), although some issues remain and paired-end reads can themselves contribute additional complexities to analysis of the results (for all assembly method types), not least of which is the lack of a simple method for visualizing the results, and in any case are rarely used for de novo transcriptomics using certain technologies such as 454 FLX. On the other hand, the traditional OLC assembly method also comes with both strengths and weaknesses. For instance, OLC assembly algorithms have the tendency of producing a single large, full-length contig,

44 incorporating the most common alternative transcript varieties along with reads from constitutive mRNA regions, sometimes accompanied by the formation of several shorter contigs containing the divergent portions of rarer haplotypes or splice variants (Vera et al., 2008). However, recent hybrid or advanced OLC algorithms have emerged that also claim the DBG assembler potential to reconstruct whole transcripts (e.g. the most recent versions of Newbler and Mira). Thus, the primary benefits of the OLC assembler method are that they come in many forms (they've been around longer than DBG assemblers), utilizing a variety of strategies for overcoming common issues within the de novo transcriptome assembly process (e.g. Newbler; reviewed in Miller, Koren et al. 2010), while still retaining most or all of the strengths typical of OLC assemblers, such as intuitive, simple output and downstream analysis (including annotation, expression, and diversity analysis). The use of simulated data is potentially a powerful tool for analysis of de novo transcriptome assembly quality. Although presumably extensively used by assembly algorithm developers and in various types of genomics and evolutionary studies (Zhang, Chen et al. 2011); reviewed in Carvajal-Rodriguez 2008), beyond a tool for prediction of gene coverage per sequencing effort, simulation is fairly under-utilized in transcriptomics. By generating artificial reads in a controlled and repeatable fashion, it becomes feasible to compare assembly and annotation quality under multiple conditions and methods, such as hybrid sequencing technologies and multiple-individual pooled samples. Such assemblies can then generate precise error and quality metrics due to the ability to track the origin of the simulated reads. Algorithms for generating and analyzing basic simulated read assemblies for genomes are extensive, as are more complex simulators intended for evolutionary genetics studies (reviewed in Carvajal-Rodriguez 2010), ranging from basic in-house scripts to more advanced metagenomic tools such as MetaSim (Richter, Ott et al. 2008). Transcriptome simulation is somewhat less straightforward than genome simulation due to heterogeneous gene expression and sometimes extreme variability between tissues, treatments, and different species. However, some methods have been developed based on sample data from Sanger sequencing (Susko and Roger 2004; Wang, Lindsay et al. 2005) and some of those have been adapted recently for next- generation technology (Wall, Leebens-Mack et al. 2009). Another approach to transcriptome simulation involves modeling the process of transcription itself, and several general models for distribution of cellular transcripts have been proposed in the literature (e.g. Subkhankulova, Gilchrist et al. 2008). These models typically use some form of power-law model related to the Pareto distribution, to create a small number

45 of very highly expressed genes while the majority have low expression (Kuznetsov, Knott et al. 2002, Ueda, Hayashi et al. 2004; J. C. Nacher 2006), a pattern thought to arise primarily from stochasticity in the decay rate of RNA (Paulsson 2004). Indeed, several other biological processes are suspected to have Pareto-like distributions (e.g. allometry), and there is evidence that this includes rates of polymorphism distributed throughout transcriptomes (Castle 2011; see below). Thus, for the development of potential new quality analysis methods and to ease the use and effectiveness of established methods (e.g. Vera et al., 2008), a simple transcription simulating tool was developed with the capability of also simulating, to a limited extent, sequencing error for 454 and Illumina sequencing technologies and simple sequence polymorphism (i.e. SNPs and Indels) using various basic models. It remains to be seen whether or not a single technology will again become the gold standard in sequencing (like Sanger sequencing in the past), but in the meantime methods continue to evolve independently along diverging lines for specific projects, with little consensus so far achieved. While some progress has been made on development of methods for determining assembly quality (e.g. Meyer, 2009; Meyer, Aglyamova et al. 2009; Surget-Groba and Montoya-Burgos 2010; reviewed in Martin and Wang 2011), there remains room for improvement in general and in particular for de novo transcriptomes. Here I present a simple yet effective method for evaluation and simulation of OLC and DBG assembled de novo trancriptomes using 454 and limited amounts of Illumina data, along with an expanded and streamlined set of de novo transcriptome analysis tools developed specifically for this purpose. The study presents multiple real transcriptome analyses generated with the expanded toolset and compares them to idealized results from several simulated transcriptome data sets of escalating complexity (also generated using the new toolset), testing the prediction that OLG based methods provide higher quality de novo transcriptome assemblies of long read high-throughput sequencing technologies such as 454, and providing recommendations based on the results of the comparisons.

46 Materials and Methods

M. cinxia 454 Reads

Sampling of butterflies from the Aland islands, cDNA preparation, sequencing, quality filtering, assembly, and annotation of original M. cinxia 454 reads were previously reported in detail ( Vera et al., 2008) Briefly, cDNA was synthesized from normalized pools of RNA from multiple butterfly individuals (>80) and then sequenced using GS20 454 technology. After sequencing, the reads were passed through the Pipemeta annotation pipeline to screen for primer and adapter sequence and for quality trimming, then assembled using the commercial Lasergene Seqman pro assembler from DNA star (http://www.dnastar.com/) and annotated by Blast alignment to Uniprot (http://www.uniprot.org/), GO (http://www.geneontology.org/), and KEGG (http://www.genome.jp/kegg/ ) databases, as well as by Blast alignment to predicted proteins from a closely related reference species, the silkworm moth Bombyx mori (all accomplished through PipeMeta). Additional M. cinxia transcriptome were subsequently acquired via a new run using the 454 FLX platform. These new reads were combined with the original GS20 reads in a new assembly (assembly version 2.0) after quality trimming steps using Seqman Pro, and annotated using updated annotation databases and methods (Wheat, Fescemyer et al. 2011). The trimmed reads, assembled unigenes, and all annotations for both reported Seqman Pro assemblies are available on the cinxiabase website (http://cinxiabase.vmhost.psu.edu/).

Additional Assembly and Annotation

All reads were originally assembled with the commercial software Seqman Pro of the Lasergene package from DNAStar. For this study, additional assemblies were run using Mira, an open-source OLC assembler that uses an iterative assembly process with a learning algorithm to correct for common assembly errors (http://chevreux.org/projects_mira.html); and the CLC assembly cell package (http://www.clcbio.com/index.php?id=1331), a commercial DBG assembly tool set that combines a de Bruijn graph de novo assembly program with a

47 reference mapping program to generate full assembly alignments. For all transcriptome assemblies, an effort was made to set equivalent assembly parameters where possible, for example the minimum number of ESTs per contig (2x), minimum contig length (50bp), and minimum percent identity to the contig consensus (85%). Recommended default settings were used for all other parameters.

PipeMeta Pipeline

PipeMeta consists of a series of perl scripts to perform primer filtering, quality trimming, sequence editing, annotation, storage, and SNP analysis of next-generation transcriptome sequences, all of which can be optionally run from a desktop through a simple menu-based perl script (hence, PipeMeta). I developed PipeMeta and it’s related tools to streamline the transcriptome pipeline process from a series of independent command-line perl scripts to a more user-friendly, organized, and time-saving format. The Pipemeta package, which includes all pipeline scripts, documentation, a tutorial, and other related pipeline files, can be found at http://www.personal.psu.edu/jcv128/software.html. All other required third party tools, resources, and software, including MySQL, Bioedit, ActivePerl and related modules (optional), and NCBI Blast are open-source or freeware and are available for download online (see PipeMeta documentation for links and additional information about PipeMeta).

Modified in silico Predicted Gene Set

The silk moth Bombyx mori diverged from M. cinxia approximately 100 million years ago (Pringle, Baxter et al. 2007; Grabherr, Haas et al. 2011), making it the closest related insect species with a (mostly) complete sequenced genome (Duan, Li et al. 2009; Wang, Xia et al. 2005; Xia, 2004), and thus the best choice for use as a limited reference for the M. cinxia transcriptome (Vera, Wheat et al. 2008). In silico prediction algorithms are not perfect, and gene sets often contain numerous incorrectly predicted sequences. In addition, the B. mori predicted gene set lacks UTR sequence, making them shorter than the actual transcribed mRNA

48 they are meant to predict. However, the annotation success rates for predicted sets are often relatively high (e.g. ~35% B. mori unique annotations), likely due mostly to their longer lengths (as opposed to assembled contigs). In order to more accurately assess the number of unique transcripts and amino acids covered by sequenced reads, as well as for the purpose of improved power in the annotation process (Wheat, Fescemyer et al. 2011), the B. mori predicted genes were modified by removing all shorter sequences that aligned to larger sequences with more than 90% identity across at least 80% of the alignment. This process also provided a basic, low- complexity reference gene template for creation of simulated reads (i.e. no recent duplications, polymorphism, alternative splicing, or UTR). Official predicted gene sequences for B. mori were downloaded from the genome project webpage (Silkworm Genome Database at http://silkworm.genomics.org.cn/).

Simulated Transcriptome Reads

Five sets of simulated transcriptome reads were generated from B. mori predicted transcripts by using real M. cinxia read metrics (see Table 1) as input parameters in TranscriptSimulator. The purpose of these simulated read sets was to provide replicable simulated assemblies of varying complexity for comparison to the real M. cinxia transcriptome assemblies. From the simplest set to the set with the most complexity, the simulated sets were: Set1A and Set2A (generated from modified B. mori reference without error or polymorphism), Set2B (generated from modified B. mori reference with error but no polymorphism), Set2C (generated from unmodified B. mori reference with error but no polymorphism), and Set2D (generated from unmodified B. mori reference with both error and polymorphism). Sets 1A and 2A can be considered replicates, as they have all the same parameters with the exception that each reference transcript was assigned copy numbers from a power-law distribution independently (i.e. the reference transcripts for 1A and 2A have different expression patterns). All set 2 simulated reads (SimSet2A-D) have very similar although not identical read copy number distributions, due to sharing the same random seed for the power-law model algorithm. The simulated reads were based on either the modified (SimSet 1A, 2A, and 2B) or unmodified (SimSet 2C and 2D) Bombyx mori in silico predicted official gene set derived by algorithm from the B. mori sequenced genome, with unmodified genes containing more complexity due to

49 redundant transcript sequence from gene duplication, alternative splicing, and gene prediction error. Although from one point of view algorithmic gene prediction error could be taken as a serious complication to the use of predicted genes as simulation templates (and as references for that matter), from another point of view such error only serves to add an additional layer of complexity to a simulation experiment by mimicking large scale biological polymorphism (e.g. large indels or intron retention) and sequencing error (e.g. junk reads). For the first step in generating the simulated reads, quality scores were simulated for the B. mori predicted gene sequences from training sets of taxonomically related sequencing data (i.e. all publicly available GS20 or FLX 454 reads from butterfly/moth from NCBI SRA at http://www.ncbi.nlm.nih.gov/sra) using the custom perl script QualSim developed for this study in order to preserve any strong or systematic quality score patterns relative to DNA sequence. Sections of B. mori transcripts with highly variable quality scores or no quality scores were modeled by sampling from a normal distribution based on other, nearby transcript regions. The B. mori transcript set and the resulting simulated quality scores were then used to simulate the transcription process followed by simulated 454 sequencing, with parameters mimicking the basic characteristics (i.e. read length, coverage depth, average quality, expression patterns) of the M. cinxia reads using the perl script TranscriptSimulator (see below, also table 1). The pooled transcript population characteristics of the M. cinxia read sets were included in each simulated read set through consecutive runs of TranscriptSimulator using population specific parameters for each run (see table 1), which is allowed for by TranscriptSimulator through preservation of gene expression and polymorphism patterns between runs. For read sets with simulated sequencing error, commonly reported average error rates for FLX 454 sequencing were used (insertion 0.0025: , deletion: 0.001, substitution: 0.0013; Niu, Fu et al. 2010) within a simple model for 454 sequencing error (i.e. increasingly higher rates of insertion error at homopolymer runs). For read sets with simulated polymorphism, a power-law model was used with the exponent parameter chosen according to real M. cinxia data distributions (α = 3.5, xmin = 9; see figure 8) in order to randomly assign SNPs and indels to predicted genes in the B. mori reference set. Transcript-wide SNP rates were also estimated from the data, and indel rates were derived from those according to commonly reported ratios of SNPs to indels (SNPs: 0.004, Indels: 0.0005; (Castle 2011); (Lunter 2007), see table 2 and figure 10).

50 Additional Simulated Reads

An additional nine sets (designated Set3A-I) of simulated transcriptome reads were generated using TranscriptSimulator from modified B. mori predicted transcripts for the purpose of illustrating the relationship between the reference transcript pool size (i.e. complexity), assembly quality, and sequencing effort (figures 1-3). All nine sets were created without error or polymorphism using the same basic runtime parameters (α = 1.7, avg. length±SD = 250±75 bp). Sets 3A-3C were generated using the original modified B. mori predicted references (14,154 transcripts), with increasing minimum simulated sequencing efforts of ~50K, ~100K, and ~1M reads, respectively. Sets 3D-3F maintained the same levels of increasing sequencing effort, but were generated from a doubled B. mori reference source. Each original predicted B. mori transcript was copied and randomly altered using a custom perl script for a total of 28,308 reference transcripts. The last three simulated sets (3G-3I) were generated from a quadrupled B. mori source (56,616 transcripts).

TranscriptSimulator

The perl script TranscriptSimulator uses a power-law model (i.e. y = axα + ε) with an adjustable exponent scaling parameter (α) to simulate gene expression of a transcriptome (J. C. Nacher 2006). Although other models have been suggested as more representative of smaller samples (e.g. Zhu, He et al. 2008), the power-law model does a more than adequate job of simulating cellular transcript distribution at all scales (see Figures 1-2). TranscriptSimulator also uses simple models to simulate the sequencing process and sequencing error for 454 reads (i.e. increased insertion error at homopolymer runs), although other platform error models can easily be incorporated with the relevant error profiles (e.g. for Illumina, see Bravo and Irizarry 2010). In addition, TranscriptSimulator is capable of simulating simple polymorphism in the form of randomly distributed SNPs and Indels using an additional power-law (or optionally a uniform distribution) model, configurable at runtime. TranscriptSimulator and QualSim, along with documentation, are available online at http://www.personal.psu.edu/jcv128/software.html.

51 Simulated Assembly

Simulated assembly involves creation of contigs under idealized conditions for comparison to a real assembly using actual de novo read assembly algorithms (e.g. Seqman Pro), and is possible due to origin information preserved in the simulated read title lines. Simulated assembly is a two stage process of full or ‘perfect’ alignment of reads under the original reference genes (e.g. B. mori predicted transcripts), followed by fragmentation into simulated consensus contigs. Perfect preliminary alignments and optimal assemblies of simulated reads were created using custom perl scripts (SimAssemblyStage1 and SimAssemblyStage2) developed for this study, and are available for download at http://www.personal.psu.edu/jcv128/SoftwareC.html. The first script utilizes the reference and error/polymorphism information in the simulated read titles to sort and recreate a perfect alignment/mapping of reads under their corresponding references. The second script breaks the supercontig alignments down into an optimal assembly of contigs, with a user defined minimum read overlap required to constitute a contiguous sequence. In all cases for this study the minimum overlap parameter was one nucleotide. Additional simple perl scripts are available for extracting sequences and modifying the assembly.

Comparison of Assemblies: Metrics

Assemblies can be tested for quality and completeness using primarily two general approaches, comparative and annotative. The annotative approach generally attempts to judge completeness of the transcriptome by alignment of the unigenes (generally by Blast similarity search) to various databases containing well annotated orthologous DNA or protein sequences. Often as well, a set of highly conserved, single copy genes are searched against the transcriptome in question as a representative sample in order to judge completeness, the rationale being that the percentage of these genes covered reflects the over-all gene coverage attained by the sequencing effort (Waterhouse, Zdobnov et al. 2011). A comparative approach to quality anlyisis, on the other hand, consists at its most basic level of a series of quality metrics (e.g. average contig length) that can be compared between assemblies to judge how well

52 various aspects of the assembly process performed. At less basic levels, such as when using a set of reference transcripts (see below), comparative approaches begin to blend with annotative methods. Since no single metric can capture the multiple levels of error involved in the complex assembly process of a transcriptome, it’s generally a good idea to compare as many metrics as feasible. In genome sequencing, contig length and the N50 length are usually considered the top quality metrics (see Vezzi, Narzisi et al. 2012 for a dissenting view). The reasoning behind this is that the general goal of any assembly is to have fewer and longer contigs (up to full chromosome length, ideally). The N50 length is simply the contig length at which, while summing nucleotides in a sorted list of all the contig lengths, 50% of all the contig nucleotides have been summed. For genome assemblies this metric makes some intuitive sense-- the longer the N50 length, the more the reads have come together to be assembled. One potential problem with this metric is that it doesn’t account at all for incorrectly assembled reads. In a transcriptome assembly, the N50 metric makes far less sense as a quality metric (although it is often still reported), due primarily to variable transcript abundance and coverage depth, but also due to the fact that transcript lengths are often only a few thousand nucleotides long, even at full length. The Nc50 length metric proposed here is a modified version of the N50 length metric that takes into account the underlying assembled reads in a transcriptome assembly. Basically, both contig lengths and summed read nucleotide lengths are sorted, and the contig length where 50% of the underlying read nucleotides are accounted for is reported. Thus, the new Nc50 metric attempts to account for the variability of transcript abundance, although of course it cannot increase continuously due to the transcript length limit. When combined with a comparison to simulated data, this metric gains greater context through comparison to the optimal assembly, which is by definition the best possible assembly and thus serves as a landmark when attempting to gauge the quality implications of a given Nc50 length. In transcriptome assembly, some additional basic quality metrics not commonly used in genome assemblies become valuable. The longest contig length, for instance, is a simple but useful signal of potential issues when attempting to assemble full transcript lengths. Others include average coverage depth per nucleotide, percent of read nucleotides assembled, average number of reads per contig, and top 10% mean contig length and coverage depth. All of these metrics allow a glimpse into the underlying quality of a given assembly, especially when used in comparison with other assemblies or simulations.

53 When comparing real assemblies with simulated assemblies, it becomes possible gain a better understanding of what these quality metrics mean by placing the real data in context, as well as gaining an over-all feel for the shape of the assembly, thanks to the known characteristics inherent to the simulated reads (see figure 1 for an example). Some quality metrics inherent to simulated assemblies include error rates for reads being misassembled, reads not being assembled when they should, and reads being assembled when they shouldn’t. These metrics are calculated from true origin information embedded in the simulated read titles. Of course, simulations are by no means perfect, and getting a simulation to behave in ways that are comparable to the real data can be challenging at best (especially for a complex set of reads such as when sequencing from a multi-individual transcript pool). However, the ability to generate multiple sets of simulated data with similar parameters and characteristics to a real data set can still help to illuminate otherwise cryptic trends in assembly behavior, even when not perfect. All quality metrics were calculated using a custom perl script (AssemblyAnalyzer) developed for this study to parse the corresponding read set’s annotated assembly file (created through PipeMeta). Mean and actual average unigene lengths, depths, and other metrics differ in their method of calculation. For example, actual average coverage depths were calculated using the AssemblyAnalyzer directly on assembly files, summing coverage depths for each contig nucleotide and dividing by the summed contig length, while mean coverage depths were calculated by hand from a table of individual contig average coverage depths. While analyzing a large number of quality metrics is a necessary component of estimating the quality of any given assembly, the amount of data involved can quickly become somewhat overwhelming due to the fact that even simple metrics can be calculated in multiple fashions. Luckily, one of the benefits of comparisons between multiple simulated and real assembly metrics is that common trends and patterns in the data are often redundant between similar metrics, and some metrics don’t contribute meaningful information at all for a given assembly. This allows us to calculate a large number of quality metrics through automated means while quickly determining the relevant trends in the data (see results).

54 Comparison of Assemblies: by Reference

Another set of valuable quality metrics involve the use of a reference set for comparison to the transcriptome unigenes. This allows for generation of metrics on transcriptome completeness and levels of fragmentation and redundancy, among others. Of course, in de novo assembly, there is no sequenced genome to use as a reference, by definition. However, often there is a relatively closely related species with a sequenced genome for which a set of predicted transcripts or proteins are available. For M. cinxia, this is the silkmoth B. mori. For the comparative analysis of quality reference metrics, all M. cinxia assembly unigenes (both real and simulated) were Blast aligned to a modified set of translated predicted nucleotides, the B. mori official gene set. The alignments were parsed using the custom perl script (BlastStats) developed for this study to derive a conservative, minimum estimate of transcriptome amino acid length coverage. All unique (i.e. non-overlapping) alignments for the top hit to B. mori (cumulative bitscore > 45, including lesser or gapped alignments by the top hit) were combined in order to estimate percentage of B. mori amino acid coverage by M. cinxia or simulated sequence. This is an improvement of the similar method developed in Vera, 2008 (and used in many additional transcriptome studies), where simply the occurrence of a significant M. cinxia Blast hit to B. mori protein was deemed sufficient evidence for full amino acid coverage or alignment of that protein by the M. cinxia unigene. While in many cases the original method is adequate, a systematic estimate of total amino acid coverage is more conservative and likely more accurate if only significant ranges of the Blast alignments are used in the calculation. Percentage of redundant B. mori amino acid coverage by unique M. cinxia unigene lesser hits, average number of unique Blast hits of M. cinxia or simulated unigenes to B. mori sequences, and percentage of redundant M. cinxia or simulated unigene self-alignment were also calculated by the custom script. Percentage of redundant B. mori amino acid coverage is the percent of B. mori amino acid positions aligned with more than one M. cinxia or simulation top Blast hit unigene (tables 3, 5, 7, 9, 11, and 13).

55 Results

Assembly of M. cinxia Reads

The results of sequencing, quality filtering, assembly (i.e. assembly version 1.0), and annotation of the original GS20 454 M. cinxia reads were fully reported in a previous paper (Vera, Wheat et al. 2008). Additional M. cinxia reads were later sequenced using the FLX 454 platform from three separate populations of butterfly (from Aland island, France, and China) and combined with the original GS20 reads to create a new and improved assembly (i.e. assembly version 2.0), the results of which were used in a microarray study (Wheat, Fescemyer et al. 2011), but never reported in detail. After quality filtering, there were a total of 946,285 combined reads, with a mean length of 149±71 bp (see table 1). The combined reads were originally assembled with Lasergene Seqman pro software and then extensively annotated. In order to compare assembly performance and quality for this study, additional assemblies were run using Mira and CLC Cell assembler; the first is an open source OLC assembler with an iterative learning algorithm and the second a commercial DBG assembler. Seqman pro, Mira, and CLC Cell assemblers produced 58,124, 72,302, and 83,400 contigs with average lengths of 396bp, 355bp, and 282bp and average coverage depths of 4.9x, 4.5x, and 4.1x, respectively. Additional basic quality metrics calculated include percentage of nucleotides assembled, average number of reads per contig, top 10% mean contig length and coverage depth, longest contig length, N50 length, and Nc50 length (see table 2).

Assembly of Simulated Lepidoptera Reads

Five sets of simulated Lepidoptera reads of increasing complexity (i.e. sequencing error, gene duplication and alternative splicing, and polymophism), generated from predicted B. mori transcripts using M. cinxia read metric parameters (see methods), were assembled for comparison to the real M. cinxia assemblies. Parameters for total number of reads, distribution of reads, and average read length were chosen to closely approximate the real M. cinxia read set, taking into account that they originated from two separate 454 platforms (see table 1).

56 Assembly of the simulated reads took place by optimal assembly as well as Seqman pro, Mira, and CLC Cell assemblers. In addition to the aforementioned basic quality metrics generated for the real M. cinxia assemblies, simulated assemblies allow precise assembly error rates to be calculated, including the exact read to contig assignment error rate, read failure to assemble rate, and the read assembled in error rate. Thus, for each simulated read set, a large number of basic and precise assembly quality metrics were generated. As described in detail in methods, both Set1A and Set2A reads were created using the same parameters (except transcript copy number distributions), and as such were predicted to have mostly identical metrics (tables 4 & 6; figures 4-6). This prediction mostly held, with the exception of read assembly failure rate (figure 6b) and redundant B. mori amino acid coverage (figure 8b). Simulated set 2B through 2D read assembly quality metrics (tables 8, 9, & 10) for the most part followed expected patterns, implying lower quality assembly with increasing complexity, with generally few exceptions (see comparison of assembly metrics section below).

Transcript Abundance for M. cinxia and Simulated Reads

One of the potential problems in using simulated sequences to evaluate transcriptome assembly quality arises from the unrealistic nature of the predicted transcripts used to generate them. In a typical sequencing run, the cDNA being sequenced would be expected to contain a much higher number of transcripts than the base number of gene sequences for that organism (due to alternative splicing, polymorphism, etc), especially if multiple individuas were used to produce the cDNA sample. To evaluate the effects of this discrepancy, several additional simulated read sets were created (see methods). Cellular transcript abundance is thought to follow a power law distribution, with insect transcription having a scaling parameter (k = α - 1) of roughly 1.7 (J. C. Nacher 2006), and so all simulated read sets created by TranscriptSimulator used those parameters (SimSets1A, SimSets2A-D, and SimSets3A-I; see methods). Variables from a power-law distribution are scale invariant and thus follow a straight line (allowing for some curve representing stochastic effects) when log transformed and plotted against the number of variable occurrences. To illustrate this relationship in real M. cinxia transcripts, the reads were first assembled by Seqman Pro and Mira. The M. cinxia reads were then clustered by their assembled contigs (for

57 both assemblies) and then, in turn, by their corresponding B. mori orthologs (determined through Blast alignment), and then finally plotted as mentioned above in order to determine the distribution of reads in the assembly (figure 1a & 1b). The distribution was reasonably straight and highly characteristic of a power law distribution for both assemblies, even though the variables used did not represent the true M. cinxia pooled transcript abundance due to incomplete coverage of all transcripts, non-optimal assembly, and other sequencing and assembly errors. In fact, a slight curve as seen in the figures was likely typical of sampling from a finite transcript pool (further explained below). The transcript abundances for SimSet2D were also plotted after log transformation (figure 1d) and show the expected power law distribution, with log normal components found in oversampling of a relatively small reference transcript pool (see figure 2). Figure 1c shows the same simulated reads clustered by contig after optimal assembly, with typical characteristics of a power law distribution restored (α = 2.74, xmin = 82, p-value = 0.92; see figure 3 for further explanation). Alpha values for real and simulated copy number distributions were determined through a maximum-likelihood method described in Clauset, 2009 rather than the from the regression equation, which can be inaccurate due to high rates of variation at higher values of the x-axis. A statistical test was performed (Kolmogorov-Smirnov goodness-of-fit test; Clauset, Shalizi et al. 2009) to confirm real data power law distributions, as well as the transcript distribution of simulated data set SimSet2D. Additional simulated read sets (SimSets 3A-I) with simple characteristics (see methods) were created to examine the effects of sampling (i.e. sequencing) from a limited cellular transcript pool (figures 2 & 3). Three cellular reference transcript pools were created from B. mori predicted transcripts (see methods) and each used to generate three sets of simulated reads with increasing sequencing effort (50K, 100K, 1M reads). The transcript abundances for all nine resulting simulated sets were then log transformed, binned by copy number, and plotted to display their distributions (figure 2aa – cc). It quickly becomes apparent that the 14K set of B. mori predicted transcripts was of inadequate size to represent the true M. cinxia transcript pool, as oversampling of the smallest simulated pool begins to occur at a sequencing effort of only 50K reads (figure 2aa) and increases in severity with increasing sequencing effort to the point that the distribution begins to resemble a log normal shape (figure 2ac). At a 56.6K sized transcript pool, the normal curve shape was gone from the 50K sequencing effort read set (figure 2ca) and much more tenuous in the 1M sequencing effort read set (figure 2 cc).

58 To explore the effects of assembly on these simulated data, the three largest read sets (SimSets 3C, 3F, and 3I) from figure 2 were optimally assembled into contigs and the underlying read numbers were plotted (i.e. log transformed occurrences vs number of reads per contig bins) (figure 3). SimSet3C, derived from 14K transcript pool, shows a pronounced dip towards log normal shape before recovering to log linear (figure 3a). This is likely due to fragmentation of low copy number transcripts (with perhaps some moderate copy number, but very long length transcripts) into multiple contigs during assembly. This dip in the distribution is markedly less pronounced in figure3c (SimSet3I, derived from the 56.6K transcript pool), and in fact begins to more closely resemble the real M. cinxia read copy number distribution from figures 1a-1b. These simulations show the behavior of the simulation process, and point out one of the biggest gaps in the continuous range between simulated complexity and real complexity likely lies in greater levels of transcript diversity. Alternative splice forms, additional gene duplications, and additional large scale polymorphisms seem the most likely candidates for this missing complexity (i.e. not simple SNPs, indels, and errors, which are approximated at sufficient rates in the simulated reads); particularly for this multi-individual, pooled-sample M. cinxia read set. However, comparison of the simulations with real data will still enable many aspects of the assembly quality to be revealed.

Comparison of Assemblies: Using Metrics to Assess Assembly Quality

A comparison of the basic metrics between the three real M. cinxia assemblies indicates that although all three assemblers had somewhat similar performance over-all, consistent trends were apparent (see figures 4 -8). One of the most basic assembly metrics, number of contigs generated, serves as an interesting example of how simulated data can shed light on assembly dynamics and features of real data. In general, number of contigs increased as expected along with the complexity of the simulated data set (figure 4a). The two replicated sets of reads, SimSet1A and SimSet2A, had virtually identical metrics for all assemblies, indicating limited influence on number of contigs by transcript copy number distributions (in this case, which admittedly is only a sample size of two). Addition of normal rates of sequencing error to the simulated reads had little effect on the metric (SimSet2A to SimSet2B), as did a limited amount of additional large-scale complexity (SimSet2B to SimSet2C; both sets had simulated

59 sequencing error but were generated from modified and unmodified reference transcripts, respectively). In fact, Seqman Pro, Mira, and CLC assemblers performed fairly close to the optimal assembly until the introduction of simple polymorphism with SimSet2D, at which point the number of contigs increased for all three assemblers and began to take on similar relative proportions to the real assemblies. It seems obvious that fewer contigs is a generally desirable outcome when assembling a set of reads, implying that transcript coverage was sufficient and the assembly algorithm was successful in pulling together many reads that belonged together into full length transcript sequences. This tendency is balanced against the benefits of avoiding over-assembly of reads from separate transcripts (e.g. alternatively spliced transcripts) into chimeric contigs. To some degree then, the acceptable number of contigs generated can range either slightly above or slightly below the actual, optimal number of contigs for a given read set depending on the goals of the investigators. All things being equal, however, substantially higher numbers of contigs relative to other assemblies is a negative for assembly quality, implying increased and unnecessary fractionation of the transcripts in most cases. CLC, the DBG assembler in the group, consistently generated more contigs for all datasets. Although perhaps a consequence of its intended design to separate out alternative transcripts, this behavior can also be due to simple failure to successfully fully assemble the read set. Such behavior could perhaps be expected with an approach (i.e. de bruijn graphs) intended for (and an algorithm optimized for) dealing with smaller reads (<100 bp). In any case, such a result raises cautionary flags and requires further observation. Average contig length of assembled reads, another basic assembly quality metric, tended to decreases with complexity of the simulated data set (figure 4b), as did average contig length of the top 10% of the assembly (figure 4c). However, both metrics followed the same basic patterns as the number of contigs metric, only in reverse, and both also began to resemble the real data set with the addition of simulated polymorphism. These comparisons show that Seqman pro had the edge in terms of the ability to join reads into long contigs, although we have yet to consider other critical quality metrics. Two other commonly used basic quality metrics are percent of read nucleotides assembled and average depth of contig coverage by read nucleotides (see methods). Unlike number of contigs and average contig length, nucleotides assembled and average coverage depth did not respond strongly to increasing simulated sequence complexity, except for a small gradual drop in nucleotides assembled by CLC and a moderate drop in coverage depth by Mira

60 and CLC in the SimSet2D simulated reads (figure 5). The nucleotides assembled metric (figure 5a) did display a small relative response to transcript abundance distribution differences (i.e. changes between SimSet1A and SimSet2A), which perhaps has some bearing on the proportional differences observed in the real data to simulated sets 2A-D, and is thus likely to be an important influence on this metric. Large-scale polymorphisms lacking in the simulated data are likely responsible for the over-all drop in nucleotides assembled seen in the real data. Average contig coverage depth also showed little relationship to increasing complexity of the simulated data sets or even to assembly method in general. However, it did drop substantially between the simulated and M. cinxia assemblies, most likely also due to large-scale polymorphism not found in the simulated reads. Although CLC clearly incorporated the largest percentage of reads in the real data, it had a lower relative average coverage depth and average contig length and a higher number of contigs, indicating a higher level of fractionation of the reads during assembly. Thus, judging by these metrics so far, Seqman pro appears to retain the lead over Mira and CLC in assembly quality. The N50 metric is considered a poor quality metric for transcriptome assemblies, and is even disputed as a useful metric in genome assemblies (see Vezzi, Narzisi et al. 2012). However, here it was considered for the sake of completeness and as a comparison to the proposed replacement transcriptome quality metric, Nc50. Overall, N50 length decreased only slightly with increasing simulated data complexity and droped abruptly for the real data set, but only the Mira assemblies showed a change in metric proportions with other assemblers in response to the increasing complexity (figure 6a). The largest drop in the Mira N50 metric occured with the addition of simple polymorphism (Simset2c to SimSet2D), signaling a relative decrease in the lengths of contigs at the higher quality end of the simulated assembly. The proposed new metric, Nc50, showed no progressive trends with complexity and varied consistently between assemblies until SimSet2D, at which point it began to proportionally resemble more closely the real data set (figure 6b). The Seqman pro real data set assembly had an unusually large Nc50 length relative to Mira and CLC assemblies, indicating a larger concentration of the sequencing effort (i.e. read nucleotides) was assembled into longer contigs towards the high quality end of the assembly (i.e. the contigs in a sorted list of contig lengths with longer lengths and deeper read coverage). In most cases a higher Nc50 is a positive sign of quality, with the condition that it not be higher than the optimal Nc50 value (which would be a strong indicator for excessive over-assembly of reads into chimeric contigs). Although, of course, the real data set optimal assembly metrics cannot be known, the simulated assemblies

61 provide a rough background by which to judge the significance of the quality metrics we do know. Given that all the simulated optimal assemblies had Nc50 values between 700 bp and 800 bp, and that all other assembler Nc50 values for the simulated reads tended to stay within that range (even with the addition of simple polymorphism, which has been shown to have a strong effects on several quality metrics, see above), the real read Seqman pro Nc50 length of close to 1000 bp was somewhat alarming. In comparison, the real read Mira Nc50 length was just below 800 bp and the CLC Nc50 length was actually lower than the equivalent simulated values at ~650 bp. One important caveat to any conclusions here is that, although the real read set optimal assembly metrics are unknown, the M. cinxia optimal Nc50 length could very well be somewhat higher than the simulated Nc50 lengths. This is due to the fact that the M. cinxia set originates from a substantially more complex pool of transcripts than the simulated reads (see section Transcript Abundance for M. cinxia and Simulated Reads) in the form of large scale polymorphism. That being said, the optimal metric did not get larger in the transition between SimSet2B and SimSet2C, where a limited addition of large scale polymorphism and gene duplication occured, and it was encouraging that Simset2D was responsive to the Nc50 length quality metric (among others) in that it suggested the optimal real M. cinxia Nc50 length was not likely to be far removed from the simulated values. Assuming that the M. cinxia optimal Nc50 length value is similar to, or just slightly higher than the simulated reads, the predictions reached are that the real read Seqman pro assembly contained a higher level of misassembled or chimeric contigs than either Mira or CLC, and that the Mira assembly Nc50 value was closest to optimal. These conclusions will be reinforced in the following analysis of assembly error types. The final set of basic quality metrics considered in depth here were the exact assembly error rates, by their nature only calculable for the simulated read assemblies. For this study, the two critical rates were assembly contig assignment error and read assembly failure (see methods; see figure 7). Mira had the lowest Contig assignment error rate (i.e. fraction of simulated reads that were assigned to the wrong contig) and CLC had the highest rate for all simulated assemblies (figure 7a). There was a relatively large increase in contig assignment error (Seqman Pro: ~3x; Mira: ~2x; CLC: ~7x) for all assemblies starting with SimSet2C, the set derived from unmodified B. mori reference transcripts with a limited amount of additional gene duplications, alternative splicing, and other large scale polymorphism, suggesting the primary effect of transcript pool complexity is best observed through this metric. As for the failure-to-assemble metric, both Seqman pro and Mira had relatively high error rates for SimSet1A compared to SimSet2A, primarily replicate sets except for independent transcript

62 copy number distributions (figure 7b). Although somewhat puzzling, this is likely the result of different copy numbers of problematic, low-complexity DNA sequence with low read coverage (i.e. stochastic differences in the replicate simulations). When other quality metrics (primarily assembly assignment error, but also contig number, average contig length, and Nc50) are taken into account, this variation in assembly quality is low (i.e. these other metrics are unresponsive to this transcript copy number variable. In fact, here it is more of a concern that CLC fails to exclude a proportionate number of reads, given its performance in other quality metrics. Mira also generally had the lowest read assembly failure rate at higher levels of simulated read complexity, implying that the assembly algorithm is very capable of including a large amount of the available reads when possible. Although these error metrics are possible only for simulated data, they show that for read sets with typical characteristics for GS20 and FLX 454 sequencing and several increasing levels of realistic complexity (within limits discussed above), Mira assembler appears to strike the best balance between incorporating reads into contigs while avoiding over-assembly and errors. That being said, it is useful to tie these observations together with a comparison through a common reference.

Evaluation of a Real assembly by using a Reference Transcriptome and Simulation

To gain a better understanding of de novo transcriptome assembly completeness and quality, it is typical to compare the assembled unigenes to a set of reference transcripts from a closely related species that has a sequenced genome (see methods). For the Glanville fritillary butterfly M. cinxia, we used the the genome-predicted transcripts from the silk moth B. mori for this purpose. We have presented such a comparison in previous studies (Vera et al., 2008; Wheat et al., 2011), but here incorporate additional improvements to the method, while also using metrics from simulated assemblies.

After Blast comparison of the assembled M. cinxia unigenes to the modified B. mori predicted transcripts, a perl script was run at four separate stringencies (percent alignment identities of 90%, 70%, 50%, and 30% and minimum cumulative bitscore of 45) to combine coverages of B. mori amino acids by top alignment M. cinxia unigenes (see methods for details) in order to calculate a conservative estimate of the percent of all B. mori transcripts covered by

63 M. cinxia unigenes. Also calculated (by similar methods) were the percent of B. mori amino acids with redundant coverage (i.e. more than one unique top hit M. cinxia unigene per amino acid). The results show that real-read Mira unigenes consistently had better alignment to the B. mori reference transcripts at all alignment stringencies, whereas CLC had the lowest coverage for all but the highest stringency (figure 8a). The relative proportion of amino acid coverage did not vary noticeably between stringencies, implying that genetic divergence had negligible influence on this metric, which suggests that it may be generally useful for judging assembly quality. These results reinforce what was observed during the comparison of basic quality metrics, as Mira assembled contigs appearing to be of higher quality overall, albeit with a higher level of fractionation for realistically complex data. The simulated data sets were also compared to B. mori reference transcripts (at 90% identity stringency) and generally displayed the same pattern of coverage as the real M. cinxia set (figure 9a). Mira assembled simulated unigenes consistently approached closer to optimum assembly coverage of the reference. However, unlike with real M. cinxia unigenes, here CLC unigenes covered the reference transcripts better than Seqman pro (also see figure 9c), most likely due to fewer misassembled contigs. Some of the success of the Mira assembler becomes clearer when percentage of redundant B. mori amino acid coverage is taken into account. The results of the comparison for the M. cinxia assemblies show that Mira unigenes had roughly double the redundant coverage of the other assemblers at all stringencies (figure 8b), implying a higher rate of contig fractionation. On the other hand, the CLC assembly contained more contigs than Mira and Seqman pro, and yet performed worse in most objective quality metrics. In the simulated assembly comparison (figure 9b), Mira had less redundant coverage at lower levels of simulated complexity than Seqman pro or CLC, approaching closely to the optimal assembly level, and then gained in redundant coverage as complexity rose in two stages (first with the addition of sequencing error and then a much larger jump with the addition of simple polymorphism). The other assemblers did not show much of a response for this metric to increasing complexity in the simulated reads.

64 Conclusions

De novo transcriptomics presents a unique challenge, both to assembly algorithms and to the researcher attempting to assess the results of the project. Without a sequenced genome to provide a reference for next-gen transcriptome reads, which already face complexity issues due to short read lengths, the many confounding effects of transcript copy number variation, alternative splicing, gene paralogs, and sequence polymorphism present challenges and reduce the quality of de novo transcriptomes. Despite these complexities, many researchers have successfully gleaned useful and interesting results using de novo transcriptome techniques, and indeed some progress has already been made in overcoming what issues remain (e.g. Vera, Wheat et al. 2008; Meyer, Aglyamova et al. 2009; Polato, Vera et al. 2011), at least on a project-by-project basis and with considerable manual effort. On the other hand, it has also been shown that assembly performance and quality between individual transcriptomes vary considerably depending on the nature and quality of the data (reviewed in Garg, Patel et al. 2011), and the results are not yet easily comparable between projects. Therefore, there remains a need for a straightforward yet thorough method of analyzing and also comparing assembly quality and completeness. Here I presented a set of tools and methods for preparing, annotating, evaluating and comparing de novo transcriptome assemblies for their relative quality and completeness, along with an analysis of remaining limitations in these methods. An evaluation of a real de novo transcriptome assembly (the Glanville fritillary butterfly, M. cinxia, assembly version 2.0) was made, along with analyses of an increasingly more realistic series of simulated read. In summary, PipeMeta was shown to be a useful toolset for quickly and easily generating annotated de novo transcriptome assemblies, while TranscriptSimulator proved to be invaluable in quickly generating reasonably realistic sets of simulated reads to help analyze real data. In addition, it was found that although the Seqman pro M. cinxia assembly had fewer and longer contigs, deeper average coverage, and a longer Nc50 length than the other assemblies (figures 4-7), it included a lower percentage of the total read nucleotides and more importantly, when compared with results from the simulated assembly quality metrics (in particular a larger than expected Nc50 length and decreased performance in terms of greater levels of miss-assembly and read assembly failure), it became apparent that the Mira assembly had the highest quality contigs. On the other hand, CLC Cell assembler resulted in the lowest over-all quality contigs,

65 with the highest error rates in the simulated assemblies and the lowest percentage of coverage of the B. mori reference transcripts by the CLC M. cinxia unigenes. The downside to the increased quality of the Mira assembly was a higher rate of fractionation of the reads into separate contigs, which was displayed as a larger number of contigs and higher levels of redundant B. mori amino acid coverage for both real and simulated sets (figures 8 & 9). On the other hand, Mira was designed to purposefully sift apart similar but independent transcripts in an attempt to reproduce the transcripts in their original states (http://chevreux.org/projects_mira.html), and Mira had the lowest failure to assemble error rates in the more complex simulated read sets. Thus, much of the fractionation observed in the real Mira M. cinxia assembly are likely intentional and correctly represent the underlying complexity of the original transcript pool, a position supported by its higher coverage of B. mori amino acids and lower levels of read assignment error in the simulated assemblies. Indeed, newer versions of Mira assembler now offer some interesting, advanced options and parameters for dealing with pooled sample transcriptome reads in order to reassemble contigs fractionated due to their high rates of simple polymorphism (as opposed to large scale polymorphisms such as large indels and alternative splice forms, which should be kept in separate contigs) in a three stage process. Other new methods for improving de novo assembly quality have been proposed in the literature as well, such as including the predicted reference transcripts in the assembly process (Surget-Groba, 2010) or in an additional pipeline stage after the initial assembly (Meyer, 2009). In genome assembly, several recent studies have investigated the effectiveness of various metrics to assess quality of an assembly. In the the most recent of these (Vezzi, 2012), the authors propose novel statistical methods based on multivariate techniques such as principal and independent component analysis to assess the feature space of an assembly in relation to its quality metrics to determine their usefulness. Although not directly applicable to transcriptome quality assessment due to very different assumptions and underlying assembly characteristics, this may change in the near future as statistical methods are adapted. In addition to basic quality metric assessment, further work remains to be done in improving the realism of simulated transcriptome reads at higher read complexity levels. Much of this work lies in improving the reference source used by the simulator in order to increase levels of transcript pool complexity in the end result and avoid the issues with over-sampling at higher sequencing efforts (see above). One potential solution is too simulate this transcript complexity directly by adding randomly spliced alternative splice forms, recently duplicated

66 copies with simulated alterations, and randomly large indels into the transcript pool. Another avenue involves using assembled publically available ESTs from the reference species to supplement or even modify the predicted sequences. In a future study, these statistical quality metric assessment methods, improved simulation capabilities, and new assembly methods could be examined in a new quality analysis, perhaps with the inclusion of additional real read sets from different species and two or more simulated read set replicates to explore the effects of transcript copy number distribution on assembly quality and annotation in a more thorough manner. For now, the availability of these new tools and methods for de novo transcriptome quality assessment will allow researchers considering a transcriptome sequencing project the means to both estimate the levels of sequencing effort required for a given project and to further investigate and understand the results once the assembly is complete. This is particularly true for complex or pooled sample studies, where the ability of TranscriptSimulator to reproduce different strains of reads with variable characteristics, while maintaining a similar or identical distribution of reads and simple polymorphism, can make a large difference in the assessment of quality.

67 References

Bravo, H. C. and R. A. Irizarry (2010). "Model-Based Quality Assessment and Base-Calling for Second- Generation Sequencing Data." Biometrics 66(3): 665. Carvajal-Rodriguez, A. (2008). "Simulation of genomes: A review." Current Genomics 9(3): 155-159. Carvajal-Rodriguez, A. (2010). "Simulation of Genes and Genomes Forward in Time." Current Genomics 11(1): 58-61. Castle, J. C. (2011). "SNPs Occur in Regions with Less Genomic Sequence Conservation." PLoS ONE 6(6): e20660. Chen, S. A., P. C. Yang, et al. (2010). "De Novo Analysis of Transcriptome Dynamics in the Migratory Locust during the Development of Phase Traits." Plos One 5(12). Clauset, A., C. R. Shalizi, et al. (2009). "Power-law distributions in empirical data." SIAM Review 51: 661-703. Duan, J., R. Li, et al. (2009). "SilkDB v2.0: a platform for silkworm (Bombyx mori) genome biology." Nucleic Acids Research 38(suppl 1): D453-D456. Garg, R., R. K. Patel, et al. (2011). "Gene Discovery and Tissue-Specific Transcriptome Analysis in Chickpea with Massively Parallel Pyrosequencing and Web Resource Development." Plant Physiology 156(4): 1661-1678. Gibbons, J. G., E. M. Janson, et al. (2009). "Benchmarking Next-Generation Transcriptome Sequencing for Functional and Evolutionary Genomics." Molecular Biology And Evolution 26(12): 2731- 2744. Grabherr, M. G., B. J. Haas, et al. (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome." Nat Biotech 29(7): 644. Gregory, R., A. C. Darby, et al. (2011). "A De Novo Expression Profiling of Anopheles funestus, Malaria Vector in Africa, Using 454 Pyrosequencing." PLoS ONE 6(2): e17418. Hanski I (1999) Metapopulation Ecology. Oxford University Press, New York. Harismendy, O., P. Ng, et al. (2009). "Evaluation of next generation sequencing platforms for population targeted sequencing studies." Genome Biology 10(3): R32. Hou, R., Z. M. Bao, et al. (2011). "Transcriptome Sequencing and De Novo Analysis for Yesso Scallop (Patinopecten yessoensis) Using 454 GS FLX." Plos One 6(6). J. C. Nacher, T. A. (2006). "Sensitivity of the Power-Law Exponent in Gene Expression Distribution to mRNA Decay Rate." Physics Letters A 360: 174-178.

68 Kuznetsov, V. A., G. D. Knott, et al. (2002). "General statistics of stochastic process of gene expression in eukaryotic cells." Genetics 161(3): 1321-1332. Li, Z., Y. Chen, et al. (2011). "Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph." Briefings in Functional Genomics. Lin, Y., J. Li, et al. (2011). "Comparative studies of de novo assembly tools for next-generation sequencing technologies." Bioinformatics 27(15): 2031-2037. Lunter, G. (2007). "Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes." Bioinformatics 23(13): i289-i296. Margulies, M., M. Egholm, et al. (2005). "Genome sequencing in microfabricated high-density picolitre reactors." Nature 437(7057): 376 - 380. Martin, J. A. and Z. Wang (2011). "Next-generation transcriptome assembly." Nat Rev Genet 12(10): 671. Metzker, M. L. (2010). "Sequencing technologies [mdash] the next generation." Nat Rev Genet 11(1): 31. Meyer, E., G. V. Aglyamova, et al. (2009). "Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx." Bmc Genomics 10. Miller, J. R., S. Koren, et al. (2010). "Assembly algorithms for next-generation sequencing data." Genomics 95(6): 315. Niu, B., L. Fu, et al. (2010). "Artificial and natural duplicates in pyrosequencing reads of metagenomic data." BMC Bioinformatics 11(1): 187. Paszkiewicz, K. and D. J. Studholme (2010). "De novo assembly of short sequence reads." Briefings in Bioinformatics 11(5): 457-472. Paulsson, J. (2004). "Summing up the noise in gene networks." Nature 427(6973): 415. Pevzner, P. A., H. Tang, et al. (2001). "An Eulerian path approach to DNA fragment assembly." Proceedings of the National Academy of Sciences 98(17): 9748-9753. Polato, N. R., J. C. Vera, et al. (2011). "Gene Discovery in the Threatened Elkhorn Coral: 454 Sequencing of the Acropora palmate Transcriptome." PLoS ONE 6(12): e28634. Pringle, E. G., S. W. Baxter, et al. (2007). "Synteny and chromosome evolution in the lepidoptera: Evidence from mapping in Heliconius melpomene." Genetics 177(1): 417-426. Richter, D. C., F. Ott, et al. (2008). "MetaSim-A Sequencing Simulator for Genomics and Metagenomics." Plos One 3(10). Rodrigue, S., R. R. Malmstrom, et al. (2009). "Whole Genome Amplification and De novo Assembly of Single Bacterial Cells." Plos One 4(9).

69 Subkhankulova, T., M. Gilchrist, et al. (2008). "Modelling and measuring single cell RNA expression levels find considerable transcriptional differences among phenotypically identical cells." BMC Genomics 9(1): 268. Surget-Groba, Y. and J. I. Montoya-Burgos (2010). "Optimization of de novo transcriptome assembly from next-generation sequencing data." Genome Research 20(10): 1432-1440. Susko, E. and A. J. Roger (2004). "Estimating and comparing the rates of gene discovery and expressed sequence tag (EST) frequencies in EST surveys." Bioinformatics 20(14): 2279-2287. Sutton G, Dew I (2007). Shotgun Fragment Assembly. In: Rigoutsos I, Stephanopoulos G, editors. Systems Biology: Genomics. Oxford University Press, New York, pp. 79–117. Ueda, H. R., S. Hayashi, et al. (2004). "Universality and flexibility in gene expression from bacteria to human." Proceedings of the National Academy of Sciences of the United States of America 101(11): 3765-3769. Vera, J. C., C. W. Wheat, et al. (2008). "Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing." Molecular Ecology 17(7): 1636. Vezzi, F., G. Narzisi, et al. (2012). "Feature-by-Feature “ Evaluating De Novo Sequence Assembly." PLoS ONE 7(2): e31002. Wall, P. K., J. Leebens-Mack, et al. (2009). "Comparison of next generation sequencing technologies for transcriptome characterization." BMC Genomics 10(1): 347. Wang, J.-P., B. Lindsay, et al. (2005). "Gene capture prediction and overlap estimation in EST sequencing from one or multiple libraries." BMC Bioinformatics 6(1): 300. Wang, J., Q. Xia, et al. (2005). "SilkDB: a knowledgebase for silkworm biology and genomics." Nucleic Acids Research 33(suppl 1): D399-D402. Waterhouse, R. M., E. M. Zdobnov, et al. (2011). "OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011." Nucleic Acids Research 39: D283-D288. Wheat, C. W., H. W. Fescemyer, et al. (2011). "Functional genomics of life history variation in a butterfly metapopulation." Molecular Ecology 20(9): 1813. Zerbino, D. R. and E. Birney (2008). "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs." Genome Research 18(5): 821-829. Zhang, W., J. Chen, et al. (2011). "A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies." PLoS ONE 6(3): e17915. Zhu, J., F. He, et al. (2008). "Modeling Transcriptome Based on Transcript-Sampling Data." PLoS ONE 3(2): e1659.

70

a) b)

c) d)

Figure 3-1. Comparison of contig and reference clustered or actual transcript copy number distributions for real and simulated M. cinxia reads. Panels display number of occurrences of copy number by contig and B. mori reference clustered read copy number or actual read copy number (for panel d), weighted by summed underlying read nucleotide counts. All data points were log transformed to show linear relationship common to a power law distribution (y = axα + ε). Colors represent underlying (i.e. assembled) read nucleotide counts. a. Seqman pro assembled, B. mori reference clustered real M. cinxia reads binned by copy number (α = 1.97, xmin = 1; p-value = 0.11, Kolmogorov-Smirnov [KS] test for power law goodness of fit [(Clauset, Shalizi et al. 2009)]). b. Mira assembled, B. mori reference clustered real M. cinxia reads binned by copy number (α = 1.95, xmin = 2; p-value = 0.10, KS test). c. Optimally assembled simulated M. cinxia reads clustered by contig and binned by copy number (α = 2.74, xmin = 82; p-value = 0.92, KS test). d. Simulated M. cinxia reads clustered by transcript (i.e. actual transcript copy numbers) and binned by copy number (α = 2.75, xmin = 93; p-value = 0.88, KS test).

71

a.a) a.b) a.c)

b.a) b.b) b.c)

c.a) c.b) c.c)

Figure 3-2. Comparison of the transcript copy number distributions of nine sets of simulated reads with varying reference transcript pool sizes and sequencing efforts. Each panel displays the number of occurrences of a transcript copy number by the actual transcript copy number. Simulated reads were generated from (from top to bottom [a.x.-c.x.]: ~14.1K, ~28.3K, and ~56.6K sequences) reference transcript sequences (i.e. modified B. mori predicted reference transcript pool), for varying amounts of sequencing effort (from right to left [x.a.-x.c.]: ~50K, ~100K, and ~1M simulated reads). Transcript copy numbers were derived from a power law distribution during simulated sequencing of the reference transcripts (α = ~2.7). This comparison illustrates how sampling (i.e. sequencing) from a finite set of reference transcripts causes oversampling of the reference sequences at increasing sequencing efforts, shifting the resulting read distribution shape towards log normal as lower copy numbers become less common. As the reference transcript pool used for sequencing gets larger, more sequencing effort is required to shift the distribution towards a log normal shape. This suggests a quantifiable relationship between reference transcript pool abundance (i.e. complexity), assembly quality, and sequencing effort. a.a. Simulated read set3A. a.b. Simulated read set3B. a.c. Simulated read set3C. b.a. Simulated read set3D. b.b. Simulated read set3E. b.c. Simulated read set3F. c.a. Simulated read set3G. c.b. Simulated read set3H. c.c. Simulated read set3I.

72

a) b) c)

Figure 3-3. Comparison of read copy number distributions for three of the simulated read sets from figure 2 (Set3C, Set3F, and Set3I) after optimal assembly. Panels display the number of occurrences of a copy number by the contig clustered read copy number (i.e. similar to what is observable in real data), weighted by assembled read nucleotide counts. This Comparison illustrates the progression towards the more typical power law distribution shape (i.e. lower xmin value and less like a log normal distribution) observed in the real M. cinxia read copy numbers (see table 1) as the reference transcript pool being sampled increases in size. This is due in this case mostly to increased fragmentation of transcript sequences at lower coverages (i.e. lower values on the x-axis). All data points were log transformed to show linear relationship common to a power law distribution (y = axα + ε). Colors represent underlying (i.e. assembled) read nucleotide counts. a. Set3C consists of ~1M simulated reads (avg. read length ± SD = 250±75 bp) derived from ~14.1K reference transcripts. b. Set3F consists of ~1M simulated reads derived from ~28.3K reference transcripts. c. Set3I consists of ~1M simulated reads derived from ~56.6K reference transcripts.

73

a) b) c)

Figure 3-4. Comparison of three basic assembly quality metrics (N contigs, average contig length, and average contig coverage depth) for five simulated data sets and the M. cinxia reads, using up to four assembly methods (Optimal [simulated reads only], Seqman pro, Mira, and CLC Cell). a. Number of contigs assembled increases with complexity of the simulated data set and begins to resemble the M. cinxia assembly (tReal) in terms of both absolute size and relative proportions between assembly methods with the addition of simulated polymorphism in SimSet2D. b. Average contig length of assembled reads decreases with complexity of the simulated data set and begins to resemble the M. cinxia assembly with the addition of simulated polymorphism in SimSet2D. c. Average contig length of top 10% of the assembly displays almost identical behavior to average contig length, with the exception of slightly increased relative performance by CLC in the simulated data sets.

74

a) b)

Figure 3-5. Comparison of two basic quality metrics (percentage of read nucleotides assembled and top 10% average contig length) for five simulated data sets and M. cinxia reads, using up to four assembly methods (Optimal [simulated reads only], Seqman pro, Mira, and CLC Cell). a. Percent of nucleotides assembled does not respond to increasing simulated sequence complexity, except for a slight drop in nucleotides assembled by CLC. The metric does display a small relative response to transcript abundance distribution differences (i.e. changes between SimSet1A and SimSet2A), which perhaps explains the proportional differences observed in the real data. Large scale polymorphisms (lacking in the simulated data) are likely responsible for the over-all drop in the real data. b. Average contig coverage depth shows little relationship to increasing complexity of the simulated data sets or to assembly method in general. However, it does drop abruptly relative to the M. cinxia assemblies, perhaps due to additional biological complexity not captured by the simulated data (e.g. original size of transcript pool and large scale polymorphism).

75

a) b)

Figure 3-6. Comparison of two assembly quality metrics (N50 length & Nc50 length, see methods for definitions) for five simulated data sets and the real M. cinxia reads, using up to four assembly methods (Optimal [simulated reads only], Seqman pro, Mira, and CLC Cell). a. N50 length decreases in general with increasing simulated data complexity and drops abruptly for the real data set, but only the Mira assemblies show a relative trend to increasing complexity. b. The proposed new metric, Nc50, varies consistently between assemblies until SimSet2D (simulated data set with random polymorphism added), at which point it begins to proportionally resemble more closely the real data set. Seqman pro real data set assembly has an unusually large Nc50 length relative to Mira and CLC assemblies, indicating a larger concentration of the sequencing effort (i.e. read nucleotides) was assembled into longer contigs.

76

a) b)

Figure 3-7. Comparison of rate of assembly error in the five simulated read sets for two types of error, Contig assignment error rate and read assembly failure rate. a. Mira had the lowest Contig assignment error rate (i.e. fraction of simulated reads that were assigned to the wrong contig) and CLC had the highest for all complexity levels. There was a relatively large increase in contig assignment error for all sets starting with SimSet2C, the set derived from unmodified B. mori reference transcripts (i.e. containing an extra ~400 transcripts consisting of gene duplications, alternative and splice forms, as well as pseudo-genes and other gene prediction errors). b. Mira also generally had the lowest read assembly failure rate at higher levels of simulated read complexity, with the notable exception of SimSet1A (SimSets 1A and 2A are lowest complexity sets [see methods], and are replicates of each other except for their distribution of read copy numbers). Seqman pro and Mira (relatively) high error rates for SimSet1A are most likely characterized by its specific read copy number distribution to specific DNA sequences.

77

a) b)

Figure 3-8. Comparison of conservative estimates of percent B. mori amino acid coverage by three sets of M. cinxia unigene sequences at four alignment identity stringencies (i.e. Blast alignment minimum percent identities). Amino acid coverage was calculated from summed top hit Blast alignments (including multiple non-redundant alignments for each top hit) of M. cinxia unigenes to B. mori modified predicted proteins (see methods). a. Percent of B. mori amino acid coverage by M. cinxia Seqman pro, Mira, and CLC Cell assembled unigenes at minimum percent identities of 90%, 70%, 50% and 30%. Mira unigenes consistently have better alignment to the B. mori reference at all alignment stringencies. b. Percent of redundant B. mori amino acid coverage by M. cinxia Seqman pro, Mira, and CLC Cell assembled unigenes at minimum percent identities of 90%, 70%, 50% and 30% (i.e. percent of B. mori amino acids with two or more M. cinxia unigenes with top alignments to that position). Mira unigenes have a consistently higher rate of reference coverage redundancy, indicating either a higher fractionation of reads during assembly or a higher level of stringency for incorporation of reads into assembly (i.e. excludes more reads as singetons).

78

a) b) c)

Figure 3-9. Comparison of conservative estimates of percent B. mori amino acid coverage by twenty sets of simulated unigene sequences (i.e. five simulated read sets assembled by four methods). Amino acid coverage was calculated from summed top hit Blast alignments (including multiple non-redundant alignments for each top hit) of simulated unigenes to B. mori modified predicted proteins (see methods). a. Percent of B. mori amino acid coverage by simulated Optimum, Seqman pro, Mira, and CLC Cell assembled unigenes (minimum percent identity of 90%) for five sets of simulated reads with increasing complexity. Mira unigenes consistently approaches closer to optimum assembly coverage of the reference. Unlike for real M. cinxia reads, with these simulated sets CLC assembled unigenes slightly outperform Seqman pro unigenes for this metric. b. Percent of redundant B. mori amino acid coverage by simulated Optimum, Seqman pro, Mira, and CLC Cell assembled unigenes (minimum percent identity of 90%) (i.e. percent of B. mori amino acids with two or more simulated unigenes with top alignments to that position). As with real M. cinxia reads, Mira assembled reads have a consistently higher rate of reference coverage redundancy, most notably with SimSet2d (i.e. simulated reads from unmodified B. mori transcript with both error and polymorphism). c. Percent of B. mori amino acid coverage by simulated Optimum, Seqman pro, Mira, and CLC Cell assembled unigenes (minimum percent identity of 90%) for five sets of simulated reads with increasing complexity, normalized versus percent coverage of optimally assembled unigenes.

79

a) b)

Figure 3-10. Number of occurrences of summed polymorphism cluster lengths in M. cinxia contigs versus the binned polymorphism cluster lengths. All data points were log transformed to display the linear (if slightly curved) relationship necessary for power law distributed variables. Contig sequences from both a) Seqman Pro and b) Mira were modified to display all underlying polymorphism (using SNPhunter, part of PipMeta package) in the form of degenerate nucleotides (DNs) and then scanned for clusters with two or less internal nucleotide spacing. DN cluster lengths were grouped by Contig and then binned by cluster length before analysis. Parameters alpha and xmin were calculated through greatest-likelihood method as described in Clauset, 2007 (α = 3.5, Seqman xmin = 9, mira xmin = 27).

80

Table 3-1. M. cinxia data set statistics. All sets (real and simulated) consist of two populations of reads combined for assembly. Version one consists of the original 454 GS20 reads (reported in Vera, et al., 2008) or the simulated equivalents. Version two consists of the additional unpublished 454 FLX reads or simulated equivalents. Modified sets were created from B. mori in silico predicted reference genes modified by removal of similar (i.e. >90% identity) but shorter (or equal length) sequences, while unmodified sets retain closely related gene family members and possibly some other complicating factors (e.g. pseudo-genes and alternative splice forms).

Simulated M. Simulated M. Simulated M. Simulated M. Simulated M. cinxia Set 2D cinxia Set 1A cinxia Set 2A cinxia Set 2B cinxia Set 2C Reads: Real M. Statistic Reads: Reads: Reads: Reads: Unmodified, cinxia Read Modified, No Modified, No Modified, Unmodified, With Errors Set Errors Errors With Errors With Errors & Simple Polymorphism N reads 955163 937151 933051 960340 1010959 946285 N version one 580015 565880 564233 580607 583911 575380 reads N version two 375148 371271 368818 379733 427048 370905 reads Mean length 147.7 147 147.1 147 150.3 148.6 version one 101 101 101.2 101.2 101.1 100.7 mean length version two 220 217 217.4 217 217.7 222.9 mean length length SD 69.5 68.1 68.2 68 69.2 71.3 version one 19 19 19 19 18.9 19.5 length SD version two 56 55 55.1 55 55.3 57.5 length SD N read 1.41E+08 1.38E+08 1.37E+08 1.41E+08 1.52E+008 1.41E+08 nucleotides version one N 58606677 57168523 57082821 58741251 59055465 57952498 nucleotides version two N 82484049 80581799 80171347 82393526 92841471 82658427 nucleotides

81

Table 3-2. Assembly statistics for real M. cinxia data set. Standard assembly descriptive statistics and quality metrics of assembly results for three assemblers using recommended parameters (with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]). Nc50 is the read coverage depth weighted contig length (or number) where 50% of the underlying sequencing effort occurs in a list of contigs sorted by length. This metric attempts to account for the variable nature of transcript expression levels in transciptome sequencing (unlike the N50 metric) as opposed to genome sequencing.

Metric Seqman Pro Mira CLC N Contigs 58124 72302 83400

Mean Contig Depth 3.2 3.2 3.2

Median Contig Depth 2 2 2

Actual Average Contig Depth 4.9 4.5 4.1

Mean Contig Length 396 355 282

Mean Contig Length SD 362 305 270

Mean Number of ESTs per Contig 14 10.8 10

Median Number of ESTs per Contig 3 3 3

N Assembled Nucleotides (Contigs) 1.12E+08 1.15E+08 1.22E+08

Percent of Nucleotides Assembled 79.8 82.1 86.8

Maximum length Contig 6260 6901 3984

N50 Contig Nucleotide Weighted Length 481 421 387

N50 Contig Nucleotide Weighted N 12563 16876 15892

Nc50 Mean EST Coverage Weighted Contig Length 983 772 658

Nc50 Mean EST Coverage Weighted Contig N 1166 1267 1250 top10% Mean Contig Length 1288 1094 927 top10% Mean Contig Depth 12 10 13.5

N Singletons 91411 158055 113340

Mean Singleton Length 145 146 163

Mean Singleton Length SD 83 71 77

N Unassembled Nucleotides (Singletons) 13264259 23031438 18441268

Percent of Nucleotides Un-assembled 9.4 16.4 13.1

82

Table 3-3. Comparison of three sets of real M. cinxia assembly Unigenes to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins by M. cinxia at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self-alignment of M. cinxia unigenes. Raw nucleotide polymorphism rates were estimated using SNPHunter script (see methods). Coverages, overlaps, and fragmentation estimates were calculated for multiple minimum percent identities (for alignments > 90%, 70%, 50%, and 30%) at a minimum bitscore threshold of 40.

Seqman Metric Mira CLC Pro Percent Length Coverage of modified B. mori amino acids by Blast (>90%) 4.5 5.1 4.6 Percent Length Coverage >= 2 of B. mori amino acids (>90%) 0.27 1.1 0.36 Percent Unigene Nucleotide Overlap Estimate (>90%) 12.4 41.8 18.6 Percent Contig Nucleotide Overlap Estimate (>90%) 9.2 41.4 14.9 Average Fragmentation Estimate (unique query hits per subject, >90%) 1.6 2.1 1.8 Average Fragmentation Estimate SD (>90%) 1.1 1.8 1.6 Percent Length Coverage of modified B. mori amino acids by Blast (>70%) 16.8 17.8 14.2 Percent Length Coverage >= 2 of B. mori amino acids by Blast (>70%) 1.3 4.1 1 Percent Unigene Nucleotide Overlap Estimate (>70%) 15.4 44.9 14.8 Average Fragmentation Estimate (unique query hits per subject, >70%) 2.1 2.9 1.8 Average Fragmentation Estimate SD (>70%) 1.6 2.5 1.3 Percent Length Coverage of modified B. mori amino acids by Blast (>50%) 26.5 27.5 22.9 Percent Length Coverage >= 2 of B. mori amino acids by Blast (>50%) 3.2 7.5 2.5 Percent Unigene Nucleotide Overlap Estimate (>50%) 26.2 51.9 23.4 Average Fragmentation Estimate (unique query hits per subject, >50%) 2.4 3.4 2.1 Average Fragmentation Estimate SD (>50%) 2 3.2 1.7 Percent Length Coverage of modified B. mori amino acids by Blast (>30%) 31 31.7 27.2 Percent Length Coverage >= 2 of B. mori amino acids by Blast (>30%) 5.2 9.8 4.3 Percent Unigene Nucleotide Overlap Estimate (>30%) 39.9 59.7 35.7 Average Fragmentation Estimate (unique query hits per subject, >30%) 2.6 3.7 2.3 Average Fragmentation Estimate SD (>30%) 2.6 3.7 2.2 Percent Raw Contig Nucleotide Polymorphism 1.50 0.67 n/a Percent Raw Contig Nucleotide Complex Polymorphism (DN>2) 0.91 0.31 n/a Percent Raw Contig Nucleotide High Density Complex Polymorphism 0.29 0.056 n/a (DN>10)

83

Table 3-4. Assembly statistics for simulated M. cinxia data set 1A derived from modified B. mori reference set without errors or polymorphism. Standard assembly descriptive statistics for assembly results from perfect alignment, optimal assembly, and three assemblers using recommended parameters (with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]). Nc50 is the read coverage depth weighted contig length (or number) where 50% of the sequencing effort occurs in a list of contigs sorted by length. This metric attempts to account for the variable nature of transcript expression levels in transciptome sequencing as opposed to genome sequencing. Metrics in italics are unique to simulated reads, which retain true transcript origin information in the titles that allow them to be tracked through the assembly process.

Perfect Optimal Seqman Metric Mira CLC Alignment Assembly Pro N Contigs 14127 20785 22783 22268 23831 Mean Contig Depth 16.1 11.7 10.3 10.6 10.6 Median Contig Depth 7.9 3.9 3.5 3.7 3.6 Actual Contig Depth 8 10.2 10 9.6 10.3 Mean Contig Length 1244 665 578 611 563 Mean Contig Length SD 1510 540 467 492 483 Mean Number of ESTs per Contig 67.6 45.7 40.4 41 40 Median Number of ESTs per Contig 61 17 13 14 12 N Assembled Nucleotides (Contigs) 1.41E+08 1.4E+08 1.32E+08 1.35E+08 1.38E+08 Percent of Nucleotides Assembled ~100 99.5 93.5 95.9 97.8 Maximum length Contig 56289 7525 7524 7525 7217 top10% Mean Contig Length 4181 1902 1650.6 1739 1679 top10% Mean Contig Depth 75.9 61.5 55.6 54.4 56.6 N50 Contig Nucleotide Weighted Length 2839 939 809 857 824 N50 Contig Nucleotide Weighted N 1716 4634 5114 4995 5067 Nc50 Mean EST Coverage Weighted 837 742 714 748 714 Contig Length Nc50 EST Coverage Weighted Contig N 1344 1636 1453 1490 1502 EST to Contig Assignmnet Error Rate 0 0 0.009 0.0048 0.012

N Singletons 25 5565 14157 41642 13179 Mean Singleton Length 1737 125 114 138 143 Mean Singleton Length SD 2852 59 55 64 68 N Unassembled Nucleotides (Singletons) 2630 695933 1610735 5743386 1884821 Percent of Nucleotides Unassembled ~0 0.49 1.1 4.1 1.3 EST Assembly Failure Rate 0 0 0.032 0.038 0.0081 ESTs Assembled in Error Rate 0 0 0.00011 0.000081 0.00014

84

Table 3-5. Comparison of perfect alignment, optimal assembly, and three sets of simulated M. cinxia set 1A assembled Unigenes (see Table 4) to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins (i.e. the perfect alignment) by M. cinxia reads at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self- alignment of M. cinxia unigenes. Coverages, overlaps, and fragmentation estimates were calculated with a minimum percent identity of 90 and minimum bitscore threshold of 40. Additional metrics unique to simulated reads (in italics) include exact calculation of B. mori amino acids covered (useful metric for optimal assembly only) and Blast annotation assignment error rate, which is the fraction of unigenes with an incorrect top Blast alignment to a B. mori reference gene.

Perfect Optimal Metric Seqman Pro Mira CLC Alignment Assembly Percent Length Coverage of Perfect 97.1 80 77.4 79.5 78 Alignment (>90%) Percent Length Coverage >= 2 of 0.07 3.7 1.3 0.39 2.2 Perfect Alignment (>90%) Actual Percent Length Coverage of 100 82.4 n/a n/a n/a Perfect Alignment Actual Percent Length Coverage >= 0 0 n/a n/a n/a 2 of Perfect Alignment Reference Annotation 0 0 0.015 0.01 0.022 Assignment Error Rate

85

Table 3-6. Assembly statistics for simulated M. cinxia data set 2A, derived from modified B. mori reference set without errors or polymorphism. Standard assembly descriptive statistics for assembly results from perfect alignment, optimal assembly, and three assemblers using recommended parameters (with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]). Nc50 is the read coverage depth weighted contig length (or number) where 50% of the sequencing effort occurs in a list of Contigs sorted by length. This metric attempts to account for the variable nature of transcript expression levels in transciptome sequencing as opposed to genome sequencing. Metrics in italics are unique to simulated reads, which retain true transcript origin information in the titles that allow them to be tracked through the assembly process.

Perfect Optimal Seqman Statistic Mira CLC Alignment Assembly Pro N Contigs 14106 20713 22967 22301 23517 Mean Contig Depth 15.4 11.2 10 10.5 10.4 Median Contig Depth 7.8 3.9 3.4 3.7 3.8 Actual Contig Depth 7.8 9.5 9.8 10 n/a Mean Contig Length 1244 666 575 610 562 Mean Contig Length SD 1511 542 467 494 485 N50 Contig Nucleotide Weighted Length 1719 895 810 861 824 N50 Contig Nucleotide Weighted N 2840 4997 5132 4985 5057 Nc50 Mean EST Coverage Weighted 873 780 735 768 739 Contig Length Nc50 Mean EST Coverage Weighted 1335 1617 1425 1475 1476 Contig N top10% Contig Length 4183 1909 1642 1744 1688 top10% Contig Depth 69 57 52 54 53 Mean Number of ESTs per Contig 66.2 45 40 41.4 40 Median Number of ESTs per Contig 61 16 13 14 13 N Assembled Nucleotides (Contigs) 1.38E+08 1.37E+08 1.29E+08 1.36E+08 1.35E+08 Percent of Nucleotides Assembled 99.9 99.5 93.8 98.7 97.9 Maximum length Contig 56289 6690 6689 6690 6690 EST to Contig Assignmnet Error Rate 0 0 0.0092 0.0049 0.014

N Singletons 47 5520 14132 14364 12435 Mean Singleton Length 102 124 114 129 139 Mean Singleton Length SD 18 56 54 59 66 N Unassembled Nucleotides (Singletons) 4792 684928 1606158 1856380 1734427 Percent of Nucleotides Unassembled 0.1 0.5 1.2 1.3 1.3 EST Assembly Failure Rate 0 0 0.022 0.0095 0.0075 ESTs Assembled in Error Rate 0 0 0.000093 0.000063 0.00015

86

Table 3-7. Comparison of perfect alignment, optimal assembly, and three sets of simulated M. cinxia set 2A assembled Unigenes (see Table 6) to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins (i.e. the perfect alignment) by M. cinxia reads at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self- alignment of M. cinxia unigenes. Coverages, overlaps, and fragmentation estimates were calculated with a minimum percent identity of 90 and minimum bitscore threshold of 40. Additional metrics unique to simulated reads (in italics) include exact calculation of B. mori amino acids covered (useful for optimal assembly only) and Blast annotation assignment error rate, which is the fraction of unigenes with an incorrect top Blast alignment to a B. mori reference gene.

Perfect Optimal Statistic Seqman Pro Mira CLC Alignment Assembly Percent Length Coverage of 96.9 79.9 77.4 79.4 77.7 Perfect Alignment Percent Length Coverage >= 0.07 0.016 1.4 0.36 1.4 2 of Perfect Alignment Average Fragmentation Estimate (unique query hits 0 1.8 2.5 2.5 2.4 per subject) Average Fragmentation 0 2.3 6.8 52 3.7 Estimate SD Percent Unigene Nucleotide 0 0.43 5.6 6.6 6.3 Overlap Estimate Actual Percent Length Coverage of Perfect 100 82.2 n/a n/a n/a Alignment Actual Percent Length Coverage >= 2 of Perfect 0 0 n/a n/a n/a Alignment Reference Annotation 0 0 0.017 0.013 0.05 Assignment Error Rate

87

Table 3-8. Assembly statistics for simulated M. cinxia data set 2B, derived from modified B. mori reference set without errors or polymorphism. Standard assembly descriptive statistics for assembly results from perfect alignment, optimal assembly, and three assemblers using recommended parameters with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]. Nc50 is the read coverage depth weighted contig length (or number) where 50% of the sequencing effort occurs in a list of contigs sorted by length. This metric attempts to account for the variable nature of transcript expression levels in transciptome sequencing as opposed to genome sequencing. Metrics in italics are unique to simulated reads, which retain true transcript origin information in the titles that allow them to be tracked through the assembly process.

Perfect Optimal Seqman Statistic Mira CLC Alignment Assembly Pro N Contigs 14106 20757 23439 24132 24231 Mean Contig Depth 15.4 11.1 9.8 9.8 9.8 Median Contig Depth 7.8 3.9 3.3 3.4 3.6 Actual Contig Depth 7.8 9.9 9.8 9.9 9.8 Mean Contig Length 1267 678 573 572 553 Mean Contig Length SD 1512 552 466 461 467 N50 Contig Nucleotide Weighted Length 2835 938 785 783 792 N50 Contig Nucleotide Weighted N 1716 4623 5251 5450 5259 Nc50 Mean EST Coverage Weighted 873 779 728 746 733 Contig Length Nc50 Mean EST Coverage Weighted 1332 1611 1400 1396 1447 Contig N top10% Mean Contig Length 4198 1941 1636 1633 1625 top10% Mean Contig Depth 69 57 51 51 50 Mean Number of ESTs per Contig 66 45 39 38 38 Median Number of ESTs per Contig 61 16 12 12 12 N Assembled Nucleotides (Contigs) 1.37E+08 1.37E+08 1.29E+08 1.35E+08 1.32E+08 Percent of Nucleotides Assembled ~100 99.5 93.7 98.2 96.6 Maximum length Contig 56289 6710 6386 6389 6390 EST to Contig Assignmnet Error Rate 0 0 0.0086 0.0049 0.011

N Singletons 47 5671 14647 17260 22766 Mean Singleton Length 1550 124 114 132 140 Mean Singleton Length SD 2268 54 54 64 62 N Unassembled Nucleotides (Singletons) 4742 701673 1668577 2270541 3180117 Percent of Nucleotides Unassembled ~0 0.51 1.2 1.7 2.3 EST Assembly Failure Rate 0 0 0.022 0.012 0.019 ESTs Assembled in Error Rate 0 0 0.000098 0.00006 0.00015

88

Table 3-9. Comparison of perfect alignment, optimal assembly, and three sets of simulated M. cinxia set 2B assembled Unigenes (see Table 8) to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins (i.e. the perfect alignment) by M. cinxia reads at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self- alignment of M. cinxia unigenes. Coverages, overlaps, and fragmentation estimates were calculated with a minimum percent identity of 90 and minimum bitscore threshold of 40. Additional metrics unique to simulated reads (in italics) include exact calculation of B. mori amino acids covered (useful for optimal assembly only) and Blast annotation assignment error rate, which is the fraction of unigenes with an incorrect top Blast alignment to a B. mori reference gene.

Perfect Optimal Statistic Seqman Pro Mira CLC Alignment Assembly Percent Length Coverage of 96.9 79.5 74.1 75.7 74.3 Perfect Alignment by Blast Percent Length Coverage >= 2 of Perfect Alignment by 0.071 0.14 1.3 1.8 1.5 Blast Actual Percent Length Coverage of Perfect 100 82.4 n/a n/a n/a Alignment Actual Percent Length Coverage >= 2 of Perfect 0 0 n/a n/a n/a Alignment Blast Reference Assignment 0 0 0.02 0.015 0.055 Error Rate

89

Table 3-10. Assembly statistics for simulated M. cinxia data set 2C, derived from modified B. mori reference set without errors or polymorphism. Standard assembly descriptive statistics for assembly results from perfect alignment, optimal assembly, and three assemblers using recommended parameters (with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]). Nc50 is the read coverage depth weighted contig length (or number) where 50% of the sequencing effort occurs in a list of contigs sorted by length. This metric attempts to account for the variable nature of transcript expression levels in transciptome sequencing as opposed to genome sequencing. Metrics in italics are unique to simulated reads, which retain true transcript origin information in the titles that allow them to be tracked through the assembly process.

Perfect Optimal Seqman Metric Mira CLC Alignment Assembly Pro N Contigs 14589 21057 23481 24578 25432 Mean Contig Depth 15.7 11.6 10 10 10 Median Contig Depth 7.8 4.2 3.4 3.5 3.5 Actual Contig Depth 7.9 10 9.9 9.9 9.9 Mean Contig Length 1246 682 575 574 563 Mean Contig Length SD 1498 553 467 466 457 Mean Number of ESTs per Contig 66 45 40 38 37 Median Number of ESTs per Contig 61 18 12 12 12 top10% Mean Contig Length 4155 1966 1647 1651 1651 top10% Mean Contig Depth 40 59 53 52 51 N Assembled Nucleotides (Contigs) 1.41E+08 1.4E+08 1.31E+08 1.38E+08 1.35E+08 Percent of Nucleotides Assembled ~100 99.5 92.8 97.9 95.5 Maximum length Contig 56289 7341 5253 5803 5733 N50 Contig Nucleotide Weighted Length 1696 944 792 791 801 N50 Contig Nucleotide Weighted N 2909 4696 5233 5493 5521 Nc50 Mean EST Coverage Weighted Contig Length 870 766 735 744 735 Nc50 Mean EST Coverage Weighted Contig N 1336 1642 1394 1416 1424 EST to Contig Assignmnet Error Rate 0 0 0.03 0.011 0.074

N Singletons 32 5586 15417 20914 22159 Mean Singleton Length 102 124 112 130 143 Mean Singleton Length SD 15 55 54 62 63 N Unassembled Nucleotides (Singletons) 3249 690541 1733394 2714488 3104964 Percent of Nucleotides Unassembled ~0 0.49 1.2 1.9 2.2 EST Assembly Failure Rate 0 0 0.025 0.012 0.025 ESTs Assembled in Error Rate 0 0 0.00014 0.000058 0.00017

90

Table 3-11. Comparison of perfect alignment, optimal assembly, and three sets of simulated M. cinxia set 2C assembled Unigenes (see Table 10) to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins (i.e. the perfect alignment) by M. cinxia reads at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self- alignment of M. cinxia unigenes. Coverages, overlaps, and fragmentation estimates were calculated with a minimum percent identity of 90 and minimum bitscore threshold of 40. Additional metrics unique to simulated reads (in italics) include exact calculation of B. mori amino acids covered (useful for optimal assembly only) and Blast annotation assignment error rate, which is the fraction of unigenes with an incorrect top Blast alignment to a B. mori reference gene.

Perfect Optimal Seqman Metric Mira CLC Alignment Assembly Pro Percent Length Coverage of Perfect Alignment by 97.7 80.3 73.8 76.3 74.8 Blast Percent Length Coverage >= 2 of Perfect Alignment 0.072 0.15 1.5 2.1 1.7 by Blast Average Fragmentation Estimate (unique query hits 0 1.8 2.5 3.1 2.8 per subject) Average Fragmentation Estimate SD 0 2.2 5.1 47.7 7.1 Percent Unigene Nucleotide Overlap Estimate 0 3.7 5.8 29 7.4 Actual Percent Length Coverage of Perfect Alignment 100 84.2 n/a n/a n/a Actual Percent Length Coverage >= 2 of Perfect 0 0 n/a n/a n/a Alignment Blast Reference Assignment Error Rate 0 0 0.044 0.041 0.21

91

Table 3-12. Assembly statistics for simulated M. cinxia data set 2D, derived from modified B. mori reference set without errors or polymorphism. Standard assembly descriptive statistics for assembly results from perfect alignment, optimal assembly, and three assemblers using recommended parameters (with the exception of parameters for minimum contig length [50 bp] and minimum consensus identity [85%]). Nc50 is the read coverage depth weighted contig length (or number) where 50% of the sequencing effort occurs in a list of contigs sorted by length. This metric attempts to account for the variable nature of transcript expression levels in transciptome sequencing as opposed to genome sequencing. Metrics in italics are unique to simulated reads, which retain true transcript origin information in the titles that allow them to be tracked through the assembly process.

Perfect Optimal Seqman Metric Mira CLC Alignment Assembly Pro N Contigs 14623 23041 26374 36776 38743 Mean Contig Depth 16.7 11.3 9.7 7.5 7.6 Median Contig Depth 8.6 4.2 3.4 3.5 3.6 Actual Contig Depth 8.5 9.9 9.9 8.4 8.5 Mean Contig Length 1247 670.2 550 482.2 356.1 Mean Contig Length SD 1497 541.4 455.2 404.3 302 Mean Number of ESTs per Contig 69 43.4 37.2 26.8 32.8 Median Number of ESTs per Contig 45 26 14 9 10 top10% Mean Contig Length 4153 1881.8 1584.7 1447.1 1399 top10% Mean Contig Depth 77.5 58.9 52.9 35.4 43.3 N Assembled Nucleotides (Contigs) 1.52E+08 1.5E+08 1.41E+08 1.48E+08 143542603 Percent of Nucleotides Assembled 100 99 92.9 97.8 94.5 Maximum length Contig 56289 7916 5305 5796 5113 N50 Contig Nucleotide Weighted Length 1698 909 752 629 755 N50 Contig Nucleotide Weighted N 2917 5328 5951 8276 5667 Nc50 Mean EST Coverage Weighted Contig 876 764 752 700 686 Length Nc50 Mean EST Coverage Weighted Contig N 2014 1617 1419 1484 1506 EST to Contig Assignmnet Error Rate 0 0 0.028 0.011 0.075

N Singletons 0 4521 30272 23532 22141 Mean Singleton Length n/a 113.7 70 136 142 Mean Singleton Length SD n/a 47.1 55.6 68 70 N Unassembled Nucleotides (Singletons) 0 514189 2117944 3199873 303790 Percent of Nucleotides Unassembled 0 1 1.4 2.1 2.0 EST Assembly Failure Rate 0 0 0.026 0.019 0.025 ESTs Assembled in Error Rate 0 0 0.00096 0.000064 0.00021

92

Table 3-13. Comparison of perfect alignment, optimal assembly, and three sets of simulated M. cinxia set 2D assembled Unigenes (see Table 12) to B. mori reference genes. Unigenes were first Blastx aligned to B. mori predicted proteins. Top alignment positions (including multiple gapped positions) were then combined to calculate percentage of amino acids covered on the reference proteins (i.e. the perfect alignment) by M. cinxia reads at both one and greater than one coverage depth by unique unigenes. Fragmentation estimates refer to average number of unique unigenes per reference protein. Nucleotide overlap refers to percentage of self-alignment of M. cinxia unigenes. Coverages, overlaps, and fragmentation estimates were calculated with a minimum percent identity of 90 and minimum bitscore threshold of 40. Additional metrics unique to simulated reads (in italics) include exact calculation of B. mori amino acids covered (useful for optimal assembly only) and Blast annotation assignment error rate, which is the fraction of unigenes with an incorrect top Blast alignment to a B. mori reference gene.

Perfect Optimal Metric Seqman Pro Mira CLC Alignment Assembly Percent Length Coverage of Perfect 96.9 84.2 77.5 80.3 78.1 Alignment by Blast Percent Length Coverage >= 2 of Perfect 0.024 0.19 1.5 10.1 1.8 Alignment by Blast Average Fragmentation Estimate (unique query 0 1.8 2.7 4 2.9 hits per subject) Average Fragmentation 0 2.5 5.9 27 6.3 Estimate SD Percent Unigene Nucleotide Overlap 0 0.49 5.9 33.2 7.9 Estimate Actual Percent Length Coverage of Perfect 100 87.6 n/a n/a n/a Alignment Actual Percent Length Coverage >= 2 of Perfect 0 0 n/a n/a n/a Alignment Blast Reference 0 0 0.028 0.027 0.11 Assignment Error Rate

Chapter Four Scanning the Glanville frittilary transcriptome for SNPs and partial characterization of a polymorphic life history gene

Introduction

One of the primary benefits of next-generation, high-throughput sequencing is the ability to detect genetic variation between individuals on a global or genomic scale. In particular, detection and analysis of single nucleotide polymorphisms (SNPs) is a powerful and efficient means of performing many varieties of genetic and evolutionary studies of populations and closely related species (reviews Agarwal, Shrivastava et al. 2008; Holderegger, Herrmann et al. 2008; Oleksyk, Smith et al. 2010; Seeb, Carvalho et al. 2011). Specific applications include examining the genetic basis of disease (e.g. Edwards, Scott et al. 2010) and environmentally adaptive genetic variation in genome-wide association studies (Hedrick, Ginevan et al. 1976; Manel, Joost et al. 2010), speciation and divergence in closely related species or populations (Storz 2005; Beaumont and Balding 2004; Prunier, Laroche et al. 2011; Renaut, Nolte et al. 2010; Morin, Breuil et al. 2004), and nucleotide diversity studies (e.g. Hohenlohe, Bassham et al. 2010; Yu, Jensen-Seaman et al. 2003; Sloan, Keller et al. 2011) which allow for the characterization of population genetic structure and the detection of genes under selection. In the past, genome-wide SNP studies have been limited to model species with plentiful genomic resources, usually including at least a partially sequenced reference genome to use as a mapping reference for re- sequencing projects (e.g. Blavet, Charif et al. 2011; Parchman, Geist et al. 2010; Guo, Zheng et al. 2010). With the advent of high-throughput sequencing technology and new, improved assembly methods, however, it has become feasible and in some cases preferable to perform extensive SNP detection at the transcriptome-wide level even without the benefits of a reference genome (e.g. reviewed in Seeb, Carvalho et al. 2011; e.g. Vera, Wheat et al. 2008), thus opening up the study of genetic variation for a large host of interesting non-model organisms. Typical transcript polymorphism comes in several forms, including large-scale rearrangements, large and small insertions or deletions (indels), simple sequence repeats (SSR),

94 and SNPs. In general, SNPs are easier to work with than other forms of variation due to slower rates of mutation (particularly within transcript open reading frames) and high levels of abundance throughout the genome (Varela and Amos 2010), and are fast becoming the preferred medium for several types of evolutionary genetic studies. Although SNP detection and analysis using transcriptome sequencing omits polymorphism found in intergenic and intronic regions of the genome, much variation typically exists at the transcript level (Castle, 2011), and is often more interesting or useful (depending on the goals of the study, of course) from a practical or even biological perspective to the non-transcript variety. Generally, this is because transcript polymorphism is easier to characterize in terms of molecular effects (if any) on the transcript or protein and can often be directly associated to the relevant environmental, physiological, or morphological traits under consideration (reviewed in Storz and Wheat 2010), particularly when the gene(s) in question are well annotated and characterized. Another advantage of searching for SNPs using transcriptome sequencing is the ability to pool samples from multiple individuals for efficient and relatively low cost population studies (Futschik and Schloetterer 2010). There is ample evidence that pooled samples are a powerful and effective method for marker detection and genetic analysis in genome re-sequencing projects (reviewed in Kim, Li et al. 2010; e.g. Blavet, Charif et al. 2011; Guo, Zheng et al. 2010), and non-model organism labs have already begun adapting these methods to de novo transcriptomes (reviewed in Seeb, Carvalho et al. 2011; e.g. Meyer, Aglyamova et al. 2009). Another advantage of using a pooled de novo transcriptome approach is the opportunity to investigate balancing selection in ecologically interesting species. The Glanville fritillary, Melitaea cinxia, is an ecologically well-studied butterfly that exists in a classic metapopulation on the Aland Islands off the coast of Finland, where dispersal ability differences between butterflies play an important role in the ongoing colonization-extinction dynamics and the maintenance of genetic diversity in the metapopulation as a whole (Hanski 1999; Hanski and Ovaskainen 2000; Nieminen, Siljander et al. 2004, Ovaskainen and Hanski 2004). In previous studies of the alfalfa butterfly (Colias eurytheme), the metabolic gene phosphoglucose isomerase (Pgi) was shown to have adaptive genetic polymorphism maintained by diversifying selection on flight ability in variable weather conditions (reviewed in Watt 2003). This variation in enzymatic activity was found to be caused by non-synonymous changes to the PGI enzyme amino acids that caused temperature dependent modulation of ATP supply to the flight muscles (Wheat, Watt et al. 2006). Variation in PGI alleles was also found in M. cinxia, and associated with flight capacity, female fecundity, and survivorship (Haag, Saastamoinen et al. 2005; Saastamoinen 2007; Klemme and

95 Hanski 2009; Niitepõld, Smith et al. 2009; Orsini, Wheat et al. 2009; Saastamoinen, Ikonen et al. 2009), and the PGI alleles associated with increased fight ability and fecundity were also associated with newly colonized demes (new populations) at higher frequencies than in older, established demes (old populations). Additionally, genetic analysis of DNA sequence polymorphism at the PGI gene showed strong evidence of long term balancing selection, with PGI alleles that date from before divergence of a clade of five other Melitaea species (Wheat, Haag et al. 2010). Although a bottom-up approach (i.e. gene by gene) has long been the hallmark of population genetics and has the advantage of more in-depth experimental work and concentration of resources, a genome-wide approach has the benefits of quickly yielding interesting and sometimes surprising results. A recent investigation of transcriptome-wide expression patterns in the M. cinxia assembly v2.0 (Vera, Wheat et al. 2008; Wheat, Fescemyer et al. 2011) examined the effects of repeated local extinction and re-establishment of demes (i.e. old and new populations), across a heterogeneous landscape, on the maintenance and patterning of genetic variation with large fitness effects. One piece of evidence confirming the hypothesis that this variation is associated with metapopulation dynamics came in the form of allelic variation in Succinate dehydrogenase d (a subunit of SDH enzyme, a TCA cycle and electron transport enzyme, containing three alleles corresponding to the presence or absence of two indels in the 3’ UTR) that associated significantly with population age. Another result was the finding that angiotensin converting enzyme (ACE) expression level varied with population age (higher in new populations) in female butterfly abdomens. ACE is known to regulate oviposition in Lepidoptera (Vercruysse, Gelman et al. 2004), and variation in Ace expression was mirrored by differences in egg production and expression of the egg protein gene vitellogenin (Vg). The allelic variation controlling Ace expression was not revealed in that study, but one candidate is the juvenile hormone (JH) pathway, as JH regulates Ace and other aspects of oogenesis in nymphalid butterflies (Ramaswamy, Shu et al. 1997), and JH titers in 0-2 day old females were significantly higher in new population butterflies (Wheat, Fescemyer et al. 2011). The result of finding allelic variation (i.e. Pgi, Sdhd) assorting with metapopulation dynamics in M. cinxia provided increased incentive to further examine variation within the assembly. The effects of overdominance selection (e.g. PGI enzymatic performance and flight ability; Wheat, Haag et al. 2010) and balancing selection in general caused by metapopulation dynamics are complimentary forces in the maintenance of ancestral genetic diversity in populations (Bell 2009), potentially resulting in increased levels (i.e. significantly above neutral

96 expectations) of polymorphism for selected genes. Theta (i.e. herozygosity; see figure 13; Watterson 1975), is an estimate of polymorphism in a population when underlying assumptions (discussed below) for the population(s) being considered are met. Most theta estimators (tajima’s D, D*, etc.; Tajima 1989; Fu and Li 1993) are based on coalescent theory and can be used along with other metrics (e.g. π; Nei, 1979) to describe polymorphism within and between populations and detect the traces of directional and balancing selection when combined with a proper test for significance (see Oleksyk, Smith et al. 2010 for a recent review). A basic theta estimate calculation is thus ideal for scanning sequence (i.e. contigs) assembled from multiple individuals for regions with unusually large amounts of underlying polymorphism. However, although de novo high throughput transcriptome sequencing of pooled samples holds great promise as an efficient means of detecting polymorphic outliers, in practice this method is complicated by several biases inherent to both pooled samples (Futschik and Schloetterer 2010) and transcriptome data (e.g. Sloan, Keller et al. 2011). In the case of our M. cinxia data, equimolar amounts of mRNA from several age classes, each representing material from many metapopulation demes, were pooled from more than eighty individuals and then normalized before sequencing in the original assembly (i.e. version 1.0; Vera, Wheat et al. 2008), thus limiting the degree and impact of pooled sample bias (Kim, Li et al. 2010; Futschik and Schloetterer 2010). Transcript copy number bias, on the other hand, was a serious concern, along with the effects of scanning for polymorphism density within variable length assembly alignments. To account for this bias, several modified versions of theta estimator with a correction for sampling bias (Long, 2006) were modeled against the M. cinxia assembly for varying values of population-specific (i.e. Aland v1.0 reads) contig coverage length, contig coverage depth, and number of segregating sites (i.e. SNPs). To accomplish this, a SNP detection tool capable of providing population or strain specific metrics for pooled sample de novo transcriptome assemblies was needed. SNP detection tools for use with sequenced genomes have advanced greatly in recent years and are well represented in the literature (reviewed in Duran, Edwards et al. 2009; Zhang, Chen et al. 2011). However, SNP detection within a de novo transcriptome presents unique problems, such as potential false-positive polymorphisms arising from misassembly due to similar alternative splice forms or closely related paralogous genes. Avoiding excessive false positive SNP calls requires the ability to recognize and avoid suspect regions of the assembly. This goal is complicated even more in population studies sequencing from multi-individual pooled samples. Although SNP detection tools exist for statistically differentiating allelic

97 polymorphism from paralogous variation within assembly alignments, they are designed to be used with either un-pooled samples (QualitySNP: Tang, Vosman et al. 2006) or with a sequenced genome available to use as a reference (PoPoolation: Kofler, Orozco-terWengel et al. 2011; FreeBayes: http://bioinformatics.bc.edu/marthlab/FreeBayes). Therefore, I developed the SNP detection tool SNPHunter, designed specifically to find high quality SNPs within annotated de novo transcriptome assemblies of large pooled population samples. It uses a series of filters to screen for and avoid regions of intense polymorphism likely caused by misassembly (due to alternative splicing, paralogs, or assembly error), while providing detailed population-specific information useful for association or nucleotide diversity studies. Here I introduce SNPHunter and show its utility for ecological studies of non-model organisms. Next, I present the results of a survey of transcriptome polymorphism in Glanville fritillary butterflies from the Aland Islands (M. cinxia assembly v1.0). Using the results of the survey, I also identify a promising candidate gene, juvenile hormone acid methyltransferase (jhamt), for the investigation of the effects of balancing selection due to metapopulation dynamics in M. cinxia. Finally, I present a preliminary characterization of the polymorphism present in M. cinxia jhamt and make suggestions for future studies.

Methods

SNPHunter

SNPhunter is a tool written in perl designed to parse annotated de novo transcriptome assemblies of 454 reads from multi-individual pooled samples for high quality SNPs. Although several high quality SNP-detecting programs have been published with various useful features (e.g. QualitySNP; Freebayes), none of them currently combine the ability to screen for error and non-SNP polymorphism specific to de novo 454 (and other high-throughput platforms) assemblies with the ability to report population-specific SNP information. SNPHunter uses a series of customizable filters to screen out assembly alignment regions suspected of high levels of sequencing error or non-SNP polymorphism such as alternative splicing or gene duplication. Base call error is screened through a combination of setting a

98 minimum quality score threshold and reporting the combined error probability and individual allele error probabilities at all candidate SNP sites. The combined error probability is the probability of at least one base call at the SNP site being an error modified by the probability of a proportionally sized region of the local assembly alignment being miss-assembled, while an individual allele error probability is simply the probability of all the nucleotides for an allele being miscalls. The local alignment error probability is the average alignment probability of a consensus region of a length determined by the depth, up to a maximum depth of 10x (e.g. the typical minimum SNP site depth of 4x would result in averaging a length of thirteen alignment quality scores surrounding the SNP site, while a depth of 10x would result in one alignment quality score being used and a depth of 11x would result in none being used). This scaling for the alignment quality score sample size reflects the fact that de novo assembler alignment quality scores typically only reflect the presence or absence of likely sequencing errors (e.g. hompolymer runs in 454 reads) or low complexity or repeat regions in the transcript sequence, although this varies between assemblers (e.g. Newbler 2.5 versus Mira). At deeper coverages (generally > 10x) these types of alignment difficulties (usually far less common, lengthy, or problematic in transcriptomes than in genomes in the first place) are often resolved in de novo transcriptome assemblies using the longer reads of the 454 FLX platform. Once calculated, the average alignment error quality score (if any) is compared to the probability of at least one miscall at the SNP site based on the read nucleotide quality scores alone and, if greater, is reported for that value. This combined error probability thus represents the chance of at least one error at the SNP site using two distinct methods at relatively low coverages, reporting the method denoting the highest probability of an error. Chimeric contigs, usually caused by miss-assembly of alternative splice forms or gene paralogs, typically have less of an effect on alignment quality scores than base call errors and low complexity regions and thus require additional filtering steps when present in the alignment. SNPHunter accomplishes this through two additional filters, pattern recognition of areas of very dense clusters of underlying polymorphism and trimming of polymorphic nucleotides from the ends of assembled reads. These filters are adjustable at runtime to account for various levels of expected polymorphism, such as when scanning pooled versus un-pooled transcriptome assemblies (see SNPHunter documentation for details). In reality, of course, regions of combined alternative splicing or closely related paralogs are sometimes indistinguishable from dense regions of true allelic variation. Although methods have been developed for other SNP finding tools (Tang, Vosman et al. 2006) purporting to statistically distinguish between true allelic

99 variation and paralogous miss-assembly, these methods are not infallible (e.g. amounts of polymorphism vary substantially between species and are found in a power-law-like distribution in the genome [and possibly the transcriptome, see figure 6]) and are designed with un-pooled sequencing projects in mind, where the number of possible alleles per gene are known and of limited number. This method also does nothing to distinguish between high rates of gene duplication and alternative splicing. Other methods (e.g. Freebayes; PoPoolation; Kofler, Orozco-terWengel et al. 2011) will handle pooled transcript assemblies but are either designed for genome scans or require mapping to a reference genome to determine the expected number of paralogs. In any case, regions of likely miss-assembly are commonly associated with very dense regions of polymorphism in de novo assemblies, and even in cases where such regions represent real SNPs, they do not provide particularly useful genetic markers for most ecological association studies, as these require more isolated SNPs capable of being amplified for genotyping (difficult at best in highly polymorphic gene regions). For either case, filtering out the suspect regions entirely or reporting the SNPs with warning labels (in which case the relevant contigs can be further examined or re-assembled as needed) is often the best option. Nucleotide diversity studies also require differentiation between types of polymorphism in order to accurately estimate levels of genetic diversity (i.e. rates could be inflated if the estimate includes large amounts of non-SNP polymorphism). For such studies, the ability to differentiate between allelic and paralogous polymorphism would be more useful. However, diversity studies require sampling from a large number of individuals to get an accurate measure of diversity within the given population (Futschik and Schloetterer 2010, e.g. Hohenlohe, Bassham et al. 2010), and re-sequencing a large number of individuals independently is costly and still out of reach of most ecological studies, even those few with genomic resources. In either case (i.e. re-sequencing or de novo transcriptome), pooled sampling still offers the best design for most labs, with the condition that its limitations are observed and the polymorphism filtering stringency is set carefully to avoid the most likely (and by extension detectable) regions of non- SNP polymorphism. A final note is that filtering assembly regions of unusually high, and thus suspect, SNP density has the added benefit of allowing for a more conservative estimate of nucleotide diversity.

100 M. cinxia 454 reads

Sampling, cDNA preparation and normalization, sequencing, assembly, and annotation of all M. cinxia 454 GS20 and FLX reads were originally reported in previous studies (Vera, Wheat et al. 2008; Wheat, Fescemyer et al. 2011) and further described in detail in chapter two of this dissertation. The cDNA sequenced using the 454 FLX instrument came from only a few individuals from Aland, France, and China. Those reads were assembled together with the original Aland 1.0 reads to obtain the M. cinxia 2.0 assembly. Here I use the 2.0 assembly contigs to provide the best available gene models, but do not use those 454 FLX reads to analyze polymorphisms, as the goal here is examine SNPs specific to Aland in a sample representative of allele frequencies in the metapopulation.

Genome-wide scan of polymorphism in a population of M. cinxia

In order to determine rates of relative polymorphism in the original v1.0 GS20 Aland island reads (within the improved framework of the v2.0 assembly), while limiting the chances for false SNP calls due to sequencing errors (e.g. homopolymer runs), alternative splicing, paralogs, and sequencing errors, the M. cinxia v2.0 was scanned by SNPHunter using all standard filters (see SNPHunter section), plus the annotation filter to limit results to annotated regions of the assembly (i.e. regions with strong Blast alignment to a reference ortholog, either B. mori or Uniprot protein). After the initial scan, the resulting SNPs were additionally filtered to remove all unannotated, non-metazoan sequences, as well as sequences without underlying coverage by v1.0 Aland island GS20 reads. Although a minimum of two minority alleles was required for a SNP call to take place, at very high coverage depths the base miscall error rate can become high enough to cause false positive SNP calls to become a problem. In order to compensate for increasing probability of miscalls at higher SNP site coverage depths (> ~20x) and to attain more conservative SNP density estimates in general, the number of SNPs per contig was adjusted as follows. The expected number of miscall errors for each SNP site was calculated based on the underlying assembled nucleotide quality scores (maxExpE = 10^[-(avgSNPQ – 2 * SD)] * SNPdepth, where avgSNPQ = the average SNP site quality score, SD = the standard deviation of

101 avgSNPQ, and SNPdepth = the coverage depth at the SNP site), and this number was subtracted from the number of minority allele nucleotides constituting the SNP. If this subtraction reduced the number of minority alleles to zero, then the SNP was eliminated from the total number of SNPs for that contig for the purpose of estimating relative SNP density (Tables 8 and 10). After filtering, the data set was partitioned by removal of all contigs with lengths less than 75 bp and average coverage depths less than 5x. This particular partitioning of the data was guided partly by the peaks observed in the distribution of contig coverage depths by contig lengths (figure 5). The central tendency of contig length is 417 bp in the raw assembly and 251 bp in the filtered assembly, while for the reference B. mori predicted genes the tendency is 871 bp (i.e. an estimate of the expected, fully assembled average length in M. cinxia). However, the regression of contig average depth with contig length has a peak near the center of the distribution that suggests the average behavior during assembly is to increase in length proportionally to depth until full transcript lengths are reached, after which only additional depth is added. This peak is at roughly 1,000 bp in the raw, unfiltered assembly (figure 5c), similar to the expected central tendency of B. mori (871 bp with only ORF regions), and occurs at roughly 300 bp in the filtered assembly set (figure 5a). The peak occurs at roughly 250 bp for the filtered set of Aland v.10 population specific contig depths and lengths (figure 5b). Based on the distribution in figure 5a, the best length cut off would likely be between 150-250 bp in the filtered assembly. However, it was deemed such a cutoff would exclude too many sequences, and corrected theta was modeled on lower cutoff values first. Another purpose of the length partition was to exclude most of the large number of short but deeply covered sequences found in M. cinxia v2.0 assembly (figure 5c). This “pile-up” of very short, but deeply covered sequences was mostly under 75-100 bp long, suggesting the minimum contig cut-off length of 75 bp. The minimum average coverage depth of 5x was chosen as the balance between surveying as many assembly sequences as possible with maintaining quality of alignment and minimizing assembly error. Initially, several models for estimating sequence polymorphism based on theta with correction for bias were applied to the assembly data. I found that the Log (theta+1\length) model corrected the depth bias and minimized the length bias when surveying all SNPs and only non-synonymous SNPs better than other models within the filtered and partitioned assembly, while allowing the most Contig sequence to be scanned. The model values were calculated based on population-specific output for v1.0 Aland island GS20 read-containing portions of the v2.0 assembly generated by SNPHunter (i.e. number of Aland v1.0 specific SNPs and NS SNPs, sequence lengths, and sequence coverage depths ) (Tables 8 and 10). After calculation of the

102 modified theta values, the list of sequences was sorted to find contigs with the highest polymorphism rates adjusted for bias.

Partial Characterization of JHaMT

Jhamt contig polymorphism was partly characterized by computational methods and partly by molecular methods. To assist in characterizing the structure, reads were sequenced and the resulting contigs were clustered into similar groups from which degenerate primers were designed by hand. PCR was performed on two individual butterflies using both thorax and abdomen tissue. Preparation of the tissues was as reported previously (Vera et al., 2008). Both genomic and purified cDNA was amplified from both individuals using GOtaq hotstart reagents. After amplification, samples were cloned using the TA cloning pgm-t easyvector kit from Promega (http://www.promega.com/), with standard method for colony screening. After colony selection, vectors were amplified using standard M13 vector primers and then Sanger sequenced at the PSU Genomics Core Facility.

Results

M. cinxia 454 read sequencing, assembly, annotation

Results of sequencing, assembly, and annotation of all M. cinxia 454 reads are reported in previous papers (Vera, Wheat et al. 2008; Wheat, Fescemyer et al. 2011) and further detailed in chapter two of this dissertation.

SNPs in M. cinxia

After the initial scan of the M. cinxia v2.0 assembly using standard SNPHunter filters (Table 12), the initial number of SNPs found was 99,434, for a SNP rate of 4.99 SNPs/Kbp

103 (Table 13). The number of SNPs found in the assembly dropped to 75,864 in the portion of the assembly with at least 5x average coverage depth and 41,471 SNPs at 10x average coverage, but the SNP rates increased from 13.24 SNPs/Kbp to 17.7 SNPs/Kbp, respectively. After the second set of filters were applied for the SNP density survey, 35,001 putative SNPs were found distributed among the 58,124 contigs (6.93 SNPs/Kbp). For the final, partitioned assembly, 12,370 putative SNPs remained in 1,203 contigs (18.16 SNPs/Kbp), demonstrating a substantial increase of the SNP rate in the portion of the assembly used for SNP density analysis. This is to be expected given the exclusion of low coverage, short length Contigs with not enough sample coverage to display polymorphism. However, the overall distribution of the SNPs among the contigs (figure 6a) remains similar after the partitioning, although somewhat more biased towards higher SNP densities (figure 6b).

Genome-wide scan of polymorphism in a population of M. cinxia

After theta estimates (figure 9b) were calculated for contig sections containing Aland population (assembly v1.0) reads in the filtered and partitioned M. cinxia v2.0 assembly (see methods) and ranked by highest theta estimate value, several annotated contigs were noted as having unusually high levels of polymorphism (Table 8 and 9). Of these, juvenile hormone acid methyltransferase (jhamt) (Contig_4275; see table 7) stood out as particularly interesting. Previous work (Wheat, Fescemyer et al. 2011) showed an elevated titer of juvenile hormone in 0- 3 day old new population female butterflies that was associated with increased egg production. Because JHAMT is the last enzymatic step in the metabolic pathway for production of JH (figure 10), it stands out as a good candidate for a gene undergoing balancing selection on life history phenotypes due to metapopulation dynamics. The unbiased theta estimations were also calculated using only non-synonymous (NS) SNPs with the same methods, which resulted in a similar table of M. cinxia genes, with the exception that jhamt dropped slightly in rank from 6th to 10th in terms of relative levels of polymorphism (Tables 10 and 11). It was also interesting to note that juvenile hormone binding protein gene (jhbp) was ranked 22nd in over-all polymorphism and 9th for non-synonymous SNPs (slightly better than jhamt). JHBP plays a key role in transporting JH in the hemolymph and protecting it from random enzymatic degradation, and is in turn also regulated by JHIII (Niewiadomska-Cimicka, 2011).

104 Partial Characterization of Juvenile Hormone acid Methyl Transferase in M. cinxia

To further characterize the jhamt polymorphism, the M. cinxia assembly version 2.0 was searched for additional putative jhamt sequences. Nine contigs and four singletons were identified as having strong top Blast alignments (NCBI Blastx; bitscore > 45) to JHaMT orthologs in the Uniprot protein database (Table 7). Six predicted B. mori reference proteins (see chapter one) were also found to have strong top Blast alignments to putative JHaMT orthologs in Uniprot. Out of the M. cinxia unigenes, all but two had strong reciprocal top Blast alignment to two of the B. mori predicted proteins (i.e. BGIBMGA010391 and BGIBMGA010563), which suggests the presence of at least two M. cinxia jhamt paralogs. All associated M. cinxia jhamt reads were then re-assembled at higher stringency (minimum identity = 90%), resulting in 18 new contigs and nine new singletons, of which eight spanned a portion of the jhamt open reading frame. In an attempt to confirm the number of M. cinxia jhamt paralogs (as opposed to alternative splice-forms or allelic variation), the unigenes were first clustered and aligned into four major groups based on alignment to available orthologous transcripts (B. mori mRNA and predicted protein, Helicoverpa armigera mRNA, and Spodoptera litura mRNA sequences). Next, a region of open reading frame near the five prime end of the jhamt molecule was chosen for additional sub-cloning and DNA amplification from two single individual butterflies. Separate samples were taken from both thorax and abdominal tissues from both individuals, and both purified cDNA and genomic DNA was prepared for use as template for all samples (see methods). The region of interest was chosen because it spanned a large indel (in group cluster one) and had small regions of relatively low polymorphism (see figure 1) suitable for primer design. After alignment and comparison of the resulting 37 clones, the genomic and transcript sequences corresponded exactly in size, suggesting no splicing boundaries were present in the region (i.e. the primers fell within a single exon). For one of the butterflies (sample A), strong evidence for at least four haplotypes of jhamt was observed, with two or more clones aligning within jhamt group clusters two, three, and four, plus a single internal polymorphism (A/G) evenly divided between the four clones representing group three. In addition, there was a single sample ‘A’ clone that aligned to group one. Because these results show more than two group haplotypes within a single individual, jhamt groups are unlikely to represent alternative alleles from a single locus. Rather, this strongly suggests the presence of at least two jhamt paralogs in M. cinxia, with moderately good evidence for there being at least three. However, because the M. cinxia assembly unigenes naturally group into four broad categories of sequences that have good alignment to jhamt, it seems probable that there are

105 four jhamt paralogs. On the other hand, two of the jhamt group clusters contained the great majority of the 454 and cloned, Sanger sequenced reads, and within these two clusters there was a large amount of remaining polymorphism (see figure 11).

Conclusions

In this chapter, I have taken the next step forward in in the use of functional de novo transcriptomics to identify polymorphism in gene coding regions of ecologically interesting organisms lacking costly genomic resources. This process first involved developing a set of tools (e.g. SNPHunter, PipeMeta) for preparing and scanning an entire de novo transcriptome assembly (i.e. set of all expressed transcripts) for a common type of genetic polymorphism (i.e. SNPs). This is in stark contrast to the recent past, when discoveries of a genetic and evolutionary nature were limited to a gene-by-gene approach. The second objective of this work was to demonstrate the effectiveness of the tool set by applying it to an important ecological system, the maintenance of adaptive variation by balancing selection driven by metapopulation dynamics. In a SNP-density study, the M. cinxia transcriptome was scanned to find genes of unusually high levels of polymorphism that may underlie phenotypes that are targets of balancing selection. After eliminating or minimizing various forms of bias common to measurements of heterozygosity and nucleotide diversity in a pooled sample de novo transcriptome assembly, a modified theta estimation was calculated for a filtered and partitioned set of contigs and used to rank the data by levels of relative polymorphism. Interestingly, the metabolic enzyme gene juvenile hormone acid methyltransferase, the final enzymatic step in the JH synthesis pathway, was found near the top of the lists for both overall polymorphism and for non-synonymous SNPs. This gene was examined further because it is a candidate for allelic variation controlling polymorphism in physiological and life history traits in M. cinxia (Wheat et al., 2011), including the higher JH (a regulator of egg production) titers in female butterflies from newly established populations. To determine the nature of the jhamt polymorphism (i.e. whether it be due to allelic or non-allelic variation such as paralogous sequence or alternative splicing being incorrectly assembled), several computational and molecular methods were used, including amplification and cloning of cDNA and genomic DNA from multiple individuals. Unfortunately, the high levels of

106 polymorphism made it a challenge to disentangle the polymorphism entirely (e.g. primer design was difficult, and a palindromic region exposed by an indel was resistant to sequencing), although there was strong evidence for a minimum of two and as many as four jhamt paralogs in Aland Island M. cinxia. On the other hand, two of the four jhamt sequence group clusters contained a great deal of potential allelic variation (the other groups had far fewer sequences). In a follow-up study, this polymorphism could be further untangled using additional molecular methods, including RACE and additional cloning to gain larger tracts of monomorphic sequence from which to base further genotyping or PCR amplification. In conclusion, the day may soon come when automated genome sequencing will be a common component of everyday life at the majority of molecular biology labs, much like the ubiquitous PCR machine today. Sophisticated and widely adopted commercial software will undoubtedly accompany such advances in instrument availability. Until then, investigators requiring a relatively quick and affordable method for gaining genomic insights into ecological and evolutionary questions can use the de novo transcriptome sequencing, assembly, quality analysis, and SNP scanning tools and methods presented here to begin the exciting process of finding answers.

107 References

Agarwal, M., N. Shrivastava, et al. (2008). "Advances in molecular marker techniques and their applications in plant sciences." Plant Cell Reports 27(4): 617. Beaumont, M. A. and D. J. Balding (2004). "Identifying adaptive genetic divergence among populations from genome scans." Molecular Ecology 13(4): 969-980. Bell, G. (2009). "Fluctuating selection: the perpetual renewal of adaptation in variable environments." Philosophical Transactions of the Royal Society B: Biological Sciences 365(1537): 87-97. Blavet, N., D. Charif, et al. (2011). "Comparative high-throughput transcriptome sequencing and development of SiESTa, the Silene EST annotation database." BMC Genomics 12(1): 376. Castle, J. C. (2011). "SNPs Occur in Regions with Less Genomic Sequence Conservation." PLoS ONE 6(6): e20660. Duran, C., D. Edwards, et al. (2009). Molecular Marker Discovery and Genetic Map Visualisation. Bioinformatics: Tools And Applications. New York, Springer: 165-189. Edwards, T. L., W. K. Scott, et al. (2010). "Genome-Wide Association Study Confirms SNPs in SNCA and the MAPT Region as Common Risk Factors for Parkinson Disease." Annals Of Human Genetics 74: 97-109. Fu, Y. X. and W. H. Li (1993). "Statistical Tests of Neutrality of Mutations." Genetics 133(3): 693-709. Futschik, A. and C. Schloetterer (2010). "The Next Generation of Molecular Markers From Massively Parallel Sequencing of Pooled DNA Samples." Genetics 186(1): 207-218. Guo, S., Y. Zheng, et al. (2010). "Transcriptome sequencing and comparative analysis of cucumber flowers with different sex types." BMC Genomics 11(1): 384. Haag, C. R., M. Saastamoinen, et al. (2005). "A candidate locus for variation in dispersal rate in a butterfly metapopulation." Proceedings of the Royal Society B: Biological Sciences 272(1580): 2449-2456. Hanski, I. (1999). "Habitab connectivity, habitat continuity, and metapopulations in dynamic landscapes." OIKOS 87: 209-219. Hanski, I. and O. Ovaskainen (2000). "The metapopulation capacity of a fragmented landscape." Nature 404(6779): 755. Hedrick, P. W., M. E. Ginevan, et al. (1976). "Genetic-Polymorphism In Heterogeneous Environments." Annual Review Of Ecology And Systematics 7: 1-32. Hohenlohe, P. A., S. Bassham, et al. (2010). "Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags." PLoS Genet 6(2): e1000862. Holderegger, R., D. Herrmann, et al. (2008). "Land ahead: using genome scans to identify molecular markers of adaptive relevance." Plant Ecology & Diversity 1(2): 273-283. Kim, S. Y., Y. Li, et al. (2010). "Design of association studies with pooled or un-pooled next- generation sequencing data." Genetic Epidemiology 34(5): 479. Klemme, I. and I. Hanski (2009). "Heritability of and strong single gene (Pgi) effects on life- history traits in the Glanville fritillary butterfly." Journal of Evolutionary Biology 22(9): 1944. Kofler, R., P. Orozco-terWengel, et al. (2011). "PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals." Plos One 6(1). Manel, S., S. Joost, et al. (2010). "Perspectives on the use of landscape genetics to detect genetic adaptive variation in the field." Molecular Ecology 19(17): 3760-3772.

108 Meyer, E., G. V. Aglyamova, et al. (2009). "Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx." Bmc Genomics 10. Morin, C., C. Breuil, et al. (2004). "Genetic variability and structure of Canadian populations of the sapstain fungus Ceratocystis resinifera." Phytopathology 94(12): 1323-1330. Nieminen, M., M. Siljander, et al. (2004). On the wings of checkerspots: a model system for population biology. Oxford, Oxford University Press. Niitepõld, K., A. D. Smith, et al. (2009). "Flight metabolic rate and Pgi genotype influence butterfly dispersal rate in the field." Ecology 90(8): 2223. Oleksyk, T. K., M. W. Smith, et al. (2010). "Genome-wide scans for footprints of natural selection." Philosophical Transactions of the Royal Society B: Biological Sciences 365(1537): 185-205. Orsini, L., C. W. Wheat, et al. (2009). "Fitness differences associated with Pgi SNP genotypes in the Glanville fritillary butterfly (Melitaea cinxia)." Journal of Evolutionary Biology 22(2): 367. Ovaskainen, O. and I. Hanski (2004). "From individual behavior to metapopulation dynamics: unifying the patchy population and classic metapopulation models." Am. Nat. 164: 364- 377. Parchman, T., K. Geist, et al. (2010). "Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery." BMC Genomics 11(1): 180. Prunier, J., J. Laroche, et al. (2011). "Scanning the genome for gene SNPs related to climate adaptation and estimating selection at the molecular level in boreal black spruce." Molecular Ecology 20(8): 1702-1716. Ramaswamy, S. B., S. Q. Shu, et al. (1997). "Dynamics of juvenile hormone-mediated gonadotropism in the lepidoptera." Archives Of Insect Biochemistry And Physiology 35(4): 539-558. Renaut, S., A. W. Nolte, et al. (2010). "Mining transcriptome sequences towards identifying adaptive single nucleotide polymorphisms in lake whitefish species pairs (Coregonus spp. Salmonidae)." Molecular Ecology 19: 115-131. Saastamoinen, M. (2007). "Life-history, genotypic, and environmental correlates of clutch size in the Glanville fritillary butterfly." Ecological Entomology 32(2): 235. Saastamoinen, M., S. Ikonen, et al. (2009). "Significant effects of Pgi genotype and body reserves on lifespan in the Glanville fritillary butterfly." Proceedings of the Royal Society B: Biological Sciences 276(1660): 1313-1322. Seeb, J. E., G. Carvalho, et al. (2011). "Single-nucleotide polymorphism (SNP) discovery and applications of SNP genotyping in nonmodel organisms." Molecular Ecology Resources 11: 1. Sloan, D. B., S. R. Keller, et al. (2011). "De novo transcriptome assembly and polymorphism detection in the flowering plant Silene vulgaris (Caryophyllaceae)." Molecular Ecology Resources: no. Storz, J. F. (2005). "Using genome scans of DNA polymorphism to infer adaptive population divergence." Molecular Ecology 14(3): 671-688. Storz, J. F. and C. W. Wheat (2010). "INTEGRATING EVOLUTIONARY AND FUNCTIONAL APPROACHES TO INFER ADAPTATION AT SPECIFIC LOCI." Evolution 64(9): 2489. Tajima, F. (1989). "Statistical Method for Testing the Neutral Mutation Hypothesis by DNA Polymorphism." Genetics 123(3): 585-595. Tang, J. F., B. Vosman, et al. (2006). "QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species." Bmc Bioinformatics 7.

109 Varela, M. A. and W. Amos (2010). "Heterogeneous distribution of SNPs in the human genome: Microsatellites as predictors of nucleotide diversity and divergence." Genomics 95(3): 151. Vera, J. C., C. W. Wheat, et al. (2008). "Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing." Molecular Ecology 17(7): 1636. Vercruysse, L., D. Gelman, et al. (2004). "The angiotensin converting enzyme inhibitor captopril reduces oviposition and ecdysteroid levels in Lepidoptera." Archives Of Insect Biochemistry And Physiology 57(3): 123-132. Watt, W. B. (2003). Ecology and evolution taking flight: butterflies as model systems. Chicogo, Univ. Chicogo Press. Watterson, G. A. (1975). "Number Of Segregating Sites In Genetic Models Without Recombination." Theoretical Population Biology 7(2): 256-276. Wheat, C. W., H. W. Fescemyer, et al. (2011). "Functional genomics of life history variation in a butterfly metapopulation." Molecular Ecology 20(9): 1813. Wheat, C. W., C. R. Haag, et al. (2010). "Nucleotide Polymorphism at a Gene (Pgi) under Balancing Selection in a Butterfly Metapopulation." Molecular Biology and Evolution 27(2): 267-281. Wheat, C. W., W. B. Watt, et al. (2006). "From DNA to Fitness Differences: Sequences and Structures of Adaptive Variants of Colias Phosphoglucose Isomerase (PGI)." Molecular Biology and Evolution 23(3): 499-512. Yu, N., M. I. Jensen-Seaman, et al. (2003). "Low nucleotide diversity in chimpanzees and bonobos." Genetics 164(4): 1511-1518. Zhang, W., J. Chen, et al. (2011). "A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies." PLoS ONE 6(3): e17915.

110

Table 4-1. Putative juvenile hormone acid methyltransferase M. cinxia unigenes in assembly v.2.0 (* alignment to unigenes was weak, i.e. < 45 bitscore). Evalues and bitscores refer to alignments between M. cinxia and Uniprot database. Unigene versus B. mori and B. mori versus Uniprot Blast alignments are all in agreement.

M. cinxia Modified B. mori Bit assembly v2.0 predicted Eva sco Uniprot FlyBase unigenes proteins lue re Description Species Gene ID 2.0 Juvenile hormone Helicoverpa armigera BGIBMGA01039 0E- acid [Cotton bollworm) Contig_17361 1 35 151 methyltransferase (Heliothis armigera] CG17330 5.0 Juvenile hormone BGIBMGA01039 0E- 69. acid Contig_20616 1 10 3 methyltransferase Bombyx mori [Silk moth] CG17330 6.0 Juvenile hormone Helicoverpa armigera BGIBMGA01039 0E- 58. acid [Cotton bollworm) Contig_3784 1 09 9 methyltransferase (Heliothis armigera] CG17330 1.0 Juvenile hormone Helicoverpa armigera BGIBMGA01056 0E- 92. acid [Cotton bollworm) Contig_41234 3 17 4 methyltransferase (Heliothis armigera] CG17330 6.0 Juvenile hormone Helicoverpa armigera BGIBMGA01039 0E- acid [Cotton bollworm) Contig_4275 1 80 302 methyltransferase (Heliothis armigera] CG17330 3.0 Juvenile hormone BGIBMGA01039 0E- acid Contig_52365 1 40 169 methyltransferase Bombyx mori [Silk moth] CG17330 1.0 Juvenile hormone BGIBMGA01039 0E- 68. acid Contig_52822 1* 10 9 methyltransferase Bombyx mori [Silk moth] n/a 8.0 Juvenile hormone Helicoverpa armigera BGIBMGA01039 0E- acid [Cotton bollworm) Contig_53103 1 37 158 methyltransferase (Heliothis armigera] CG17330 8.0 Juvenile hormone Helicoverpa armigera BGIBMGA01039 0E- 66. acid [Cotton bollworm) Contig_5858 1 10 6 methyltransferase (Heliothis armigera] CG17330 5.0 Juvenile hormone Helicoverpa armigera FCPIGOB01A7 BGIBMGA01039 0E- 70. acid [Cotton bollworm) UWM_A 1 11 5 methyltransferase (Heliothis armigera] CG17330 Juvenile hormone FCPIGOB01E3 BGIBMGA01039 5E- 46. acid Samia cynthia ricini (Indian 6KO_A 1* 07 2 methyltransferase eri silkmoth) n/a 4.0 Juvenile hormone Helicoverpa armigera FCPIGOB02G8 BGIBMGA01039 0E- 80. acid [Cotton bollworm) 8M8_F 1 14 9 methyltransferase (Heliothis armigera] CG17330 Juvenile hormone FCPIGOB02IFT BGIBMGA01039 0.0 acid W2_F 1 006 47 methyltransferase Bombyx mori [Silk moth] CG17330

111

Table 4-2. Top 25 M. cinxia assembly version 2.0 contigs sorted by polymorphism rate. Basic metric values used in the sorting method (see methods) are shown, including the values for the sorting model (i.e. Log(Theta+1/bp).

N Adjusted Adjusted Adjuste v1.0 v1.0 Aland Blast Longest N d v1.0 Unfiltere Aland Contig Log M. cinxia Alignmen ORF SNP Aland d Contig Contig Average (Theta+1/bp Contig t Length Length s SNPs Length Length Depth ) Contig_983 111 198 6 6 87 83 5.72 -1.3927388 Contig_1519 4 138 153 7 7 123 94 6 -1.4286072 Contig_4614 9 147 216 13 11 135 135 8.53 -1.4649311 Contig_5788 8 399 504 42 35 355 355 13.88 -1.4857371 Contig_1281 3 357 405 29 29 325 321 14.46 -1.5318292 Contig_4275 819 822 89 63 726 722 12.67 -1.5323465 Contig_5563 8 381 627 29 23 379 343 7.18 -1.544249 Contig_1265 3 918 978 72 61 860 859 9.91 -1.5758281 Contig_2983 240 126 11 5 225 100 6.3 -1.5804181 Contig_5601 8 663 726 57 44 635 623 10.93 -1.5929036 Contig_7728 237 552 19 14 219 218 11.11 -1.6290799 Contig_5429 3 132 69 7 7 132 132 8.31 -1.6312625 Contig_1583 1 102 144 4 4 88 88 7.98 -1.6346788 Contig_5426 189 243 12 11 181 181 12.64 -1.6584866 Contig_6788 1023 1047 80 72 950 950 21.53 -1.6704305 Contig_5527 9 126 294 11 5 104 104 9.6 -1.6731087 Contig_3587 3 327 183 14 14 287 279 8.75 -1.6832915 Contig_1505 6 330 336 17 16 292 291 10.47 -1.6850721 Contig_5732 8 798 813 40 33 742 736 6.52 -1.6939682 Contig_5283 282 366 14 8 262 185 7.68 -1.7020953 Contig_2335 9 192 195 9 5 192 124 7.87 -1.7044365 Contig_1046 540 642 26 22 435 434 9.53 -1.7099885 Contig_3125 1 942 1017 58 55 836 835 18 -1.7100004 Contig_1722 2 579 702 30 26 567 566 7.08 -1.7106188 Contig_4039 288 327 15 14 224 224 18.53 -1.7106587

112

Table 4-3. Top 25 M. cinxia assembly version 2.0 contigs sorted by polymorphism rate. Juvenile hormome methyltransferase is sixth in over-all polymorphism level once length, depth, and population sample biases inherent to a transcriptome have been minimized. All annotations were derived from Blast alignment of B. mori predicted proteins vs. Uniprot DB except for entries marked with **, which are M. cinxia contigs vs. Uniprot.

Percent M. cinxia Flybase Bit Identit Contig ID B. mori Ortholog Best Description Species Score y CG3279 AGAP008655-PA; Flags: Anopheles Contig_983 1 BGIBMGA006527 Fragment gambiae 154 69.52 CG3279 AGAP008655-PA; Flags: Anopheles Contig_15194 1 BGIBMGA006527 Fragment gambiae 154 69.52 CG3279 AGAP008655-PA; Flags: Anopheles Contig_46149 1 BGIBMGA006527 Fragment gambiae 154 69.52 Bombyx Contig_57888 CG8938 BGIBMGA009106 Glutathione S-transferase 2 mori 436 99.51 Putative uncharacterized Picea Contig_12813 n/a BGIBMGA014203 protein sitchensis 165 34.91 CG1733 Juvenile hormone acid Bombyx Contig_4275 0 BGIBMGA010391 methyltransferase mori 514 100 Chorion class CB protein Antheraea Contig_55638 CG6519 BGIBMGA009715 PC404 polyphemus 100 70.27 Spodoptera Contig_12653 CG5887 BGIBMGA009556 Delta-11 desaturase littoralis 503 69.38 CG5946, isoform D; Drosophila Contig_2983 CG5946 n/a EC=1.6.2.2 ** simulans 79.3 n/a CG3049 20-hydroxysteroid Ixodes Contig_56018 1 BGIBMGA013364 dehydrogenase, putative scapularis 277 54.44 BCP inhibitor; Flags: Bombyx Contig_7728 n/a BGIBMGA007039 Precursor mori 191 100 CG3279 AGAP008655-PA; Flags: Anopheles Contig_54293 1 BGIBMGA006527 Fragment gambiae 154 69.52 CG3408 NADH-ubiquinone Chaetosoma Contig_15831 5 n/a oxidoreductase chain 4 ** scaritides 45.4 n/a Spodoptera Contig_5426 n/a n/a Diapausin ** litura 103 n/a Putative uncharacterized Picea Contig_6788 n/a BGIBMGA014203 protein sitchensis 165 34.91 Putative cuticle protein; Bombyx Contig_55279 CG2555 BGIBMGA000324 Cuticle protein mori 199 100 Bombyx Contig_35873 CG8938 BGIBMGA009106 Glutathione S-transferase 2 mori 436 99.51 Anopheles Contig_15056 CG5056 BGIBMGA002356 AGAP006255-PA gambiae 68.9 90.63 CG1737 Bombyx Contig_57328 4 BGIBMGA013047 P270 mori 2,982 99.6 BCP inhibitor; Flags: Bombyx Contig_5283 n/a BGIBMGA007039 Precursor mori 191 100 Contig_23359 n/a BGIBMGA014501 n/a n/a n/a n/a CG1026 Juvenile hormone binding Manduca Contig_1046 4 BGIBMGA011553 protein-like protein sexta 246 61.03 Drosophila Contig_31251 CG9747 BGIBMGA010614 GK13143 willistoni 376 56.74 CG3103 Bombyx Contig_17222 9 BGIBMGA012788 30kP protease A (43k peptide) mori 568 100 Bombyx Contig_4039 CG9762 BGIBMGA008959 Lethal(3)neo18 mori 401 99.47

113

Table 4-4. Top 25 M. cinxia assembly version 2.0 contigs sorted by non-synonymous (NS) polymorphism rate. Basic metric values used in the sorting method (see methods) are shown, including the values for the sorting model (i.e. Log(Theta+1/bp).

N Adjusted Blast Adjusted Adjusted v1.0 Aland Alignme Longest N v1.0 Unfiltere v1.0 Aland Contig Log M. cinxia nt ORF SNP Aland NS d Contig Contig Average (Theta+1/b Contig Length Length s SNPs Length Length Depth p) Contig_15194 138 153 7 5 123 94 6 -1.5535459 Contig_983 111 198 6 4 87 83 5.72 -1.5786774 Contig_46149 147 216 13 7 135 135 8.53 -1.6614704 Contig_54293 132 69 7 6 132 132 8.31 -1.6892545 Contig_12813 357 405 29 17 325 321 14.46 -1.7536779 Contig_57888 399 504 42 17 355 355 13.88 -1.7974012 Contig_12653 918 978 72 37 860 859 9.91 -1.8058376 Contig_23359 192 195 9 4 192 124 7.87 -1.8082303 Contig_15056 330 336 17 11 292 291 10.47 -1.8363398 Contig_17222 579 702 30 18 567 566 7.08 -1.8632289 Contig_35873 327 183 14 9 287 279 8.75 -1.8798308 Contig_2983 240 126 11 2 225 100 6.3 -1.8814481 Contig_1046 540 642 26 15 435 434 9.53 -1.8849978 Contig_4275 819 822 89 28 726 722 12.67 -1.8879505 Contig_33118 264 321 14 6 247 230 6.9 -1.9057959 Contig_11525 747 762 32 18 710 647 7.36 -1.9213168 Contig_22954 255 231 6 5 224 223 5.6 -1.9287229 Contig_56018 663 726 57 20 635 623 10.93 -1.9389834 Contig_57328 798 813 40 19 742 736 6.52 -1.9550139 Contig_48753 873 1539 49 20 861 840 6.01 -1.9606293 Contig_56862 129 162 3 3 124 123 13.41 -1.9796564 Contig_5283 282 366 14 4 262 185 7.68 -1.9819803 Contig_6788 1023 1047 80 35 950 950 21.53 -1.9831614 Contig_5558 504 627 21 13 443 443 14.52 -2.0123678 Contig_56973 459 528 25 12 401 399 17.16 -2.0160399

114

Table 4-5. Top 25 M. cinxia assembly version 2.0 contigs sorted by non-synonymous (NS) polymorphism rate. Juvenile hormome methyltransferase is ninth in NS polymorphism level once length, depth, and population sample biases inherent to a transcriptome have been minimized. All annotations were derived from Blast alignment of B. mori predicted proteins vs. Uniprot DB except for entries marked with **, which are M. cinxia contigs vs. Uniprot.

Bit Percent M. cinxia Flybase Scor Identit Contig ID B. mori Ortholog Best Description Species* e y Anopheles Contig_15194 CG32791 BGIBMGA006527 AGAP008655-PA gambiae 154 69.52 Anopheles Contig_983 CG32791 BGIBMGA006527 AGAP008655-PA gambiae 154 69.52 Anopheles Contig_46149 CG32791 BGIBMGA006527 AGAP008655-PA gambiae 154 69.52 Anopheles Contig_54293 CG32791 BGIBMGA006527 AGAP008655-PA gambiae 154 69.52 Picea Contig_12813 n/a BGIBMGA014203 AGAP008655-PA sitchensis 165 34.91 Contig_57888 CG8938 BGIBMGA009106 Glutathione S-transferase 2 Bombyx mori 436 99.51 Spodoptera Contig_12653 CG5887 BGIBMGA009556 Delta-11 desaturase littoralis 503 69.38 Contig_23359 n/a BGIBMGA014501 n/a n/a n/a n/a Juvenile hormone binding Manduca Contig_1046 CG10264 BGIBMGA011553 protein-like protein sexta 246 61.03 Juvenile hormone acid Contig_4275 CG17330 BGIBMGA010391 methyltransferase Bombyx mori 514 100 Anopheles Contig_15056 CG5056 BGIBMGA002356 AGAP006255-PA gambiae 68.9 90.63 30kP protease A (43k Contig_17222 CG31039 BGIBMGA012788 peptide) Bombyx mori 568 100 Contig_35873 CG8938 BGIBMGA009106 Glutathione S-transferase 2 Bombyx mori 436 99.51 CG5946, isoform D; Drosophila Contig_2983 CG5946 n/a EC=1.6.2.2 ** melanogaster 79.3 n/a Drosophila Contig_33118 n/a n/a ORF ** melanogaster 76.6 n/a ALKALIPHILIC serine protease P-IIC=TRYPSIN- Contig_11525 CG3355 BGIBMGA003566 like protease Bombyx mori 448 96.12 Drosophila Contig_22954 CG3227 BGIBMGA008968 GK24494 willistoni 97.4 48.21 20-hydroxysteroid Ixodes Contig_56018 CG30491 BGIBMGA013364 dehydrogenase, putative scapularis 277 54.44 Contig_57328 CG17374 BGIBMGA013047 P270 Bombyx mori 2982 99.6 Rhodobactera Aldehyde dehydrogenase les bacterium Contig_48753 BGIBMGA004155 family protein HTCC2654 53.5 n/a AGAP008655-PA; Flags: Anopheles Contig_56862 CG32791 BGIBMGA006527 Fragment ** gambiae 154 69.52 BCP inhibitor; Flags: Contig_5283 BGIBMGA007039 Precursor Bombyx mori 191 100 Putative uncharacterized Picea Contig_6788 BGIBMGA014203 protein sitchensis 165 34.91 Drosophila Contig_5558 BGIBMGA012804 Gag protein virilis 138 33.93 Drosophila Contig_56973 CG5946 BGIBMGA001152 GF25295 ananassae 346 68.4

115

Table 4-6. SNPHunter assembly and filtering metrics for M. cinxia v2.0 assembly.

 Assembly EST count: 798,702  Contig count: 58,124  Total Read/EST nucleotides: 111,767,279 o Percentage A: 34 o Percentage T: 31 o Percentage G: 18.2 o Percentage C: 16.8  Total Contig nucleotides: 22,643,173  Estimated raw polymorphic nucleotides: 136,504  Number of Contigs with annotation: 16,952 Assembly Filtering Results  Homopolymer Run Nucleotides detected: 2,465,017

 High Density Polymorphic Nucleotides detected: 25,188

 Nucleotides outside Blast range detected: 17,246,639

 Total Contig nucleotides removed by filtering: 17,584,347

 Total Contig Nucleotides After Filtering: 5,042,091

116

Figure 4-1. Alignment of four partial M. cinxia putative JHaMT paralogous amino acid sequences at the N terminal region, shown over B. mori JHaMT protein, B. mori JHaMT predicted protein, Helicoverpa armigera JHaMT protein, and Spodoptera literu JHaMT protein. Groups were created through realignment and clustering of M. cinxia putative JHaMT assembly unigenes and re-sequencing of individual butterflies (see methods). M. cinxia JHaMT Group1a represents an allele containing a 25 amino acid deletion.

117

a) b)

c)

Figure 4-2. Average contig coverage depths versus contig lengths in the M. cinxia assembly v2.0. Colors represent relative SNP abundance, which ranges from zero to 89 in the whole assembly a. Average contig coverage depths adjusted to exclude filtered assembly read nucleotides versus Contig lengths after filtering of suspect assembly regions (i.e. excluded due to lack of annotation, homopolymer run, or very high density polymorphism) for 11,535 annotated contigs. Distribution is volcano shaped, with a peak at roughly 300 bp. b. Average contig coverage depths for v1.0 Aland read-containing contig sections versus contig lengths for v1.0 Aland read-containing contig sections for 1,203 annotated, metazoan contigs from a portion of the assembly with coverage depth greater than 5X and contig length greater than 75 bp. Color distribution is based on SNP distributions from panel ‘a’. Spatial distribution is volcano shaped, with a peak at roughly 250 bp. c. Average raw contig coverage depths versus raw contig lengths for all 58,124 M. cinxia v2.0 contigs. Black colored data points represent contigs with zero SNPs. Distribution is volcano shaped, with a peak at roughly 1,000 bp.

118

a) b)

Figure 4-3. Comparison of SNP distribution in raw and filtered, partitioned M. cinxia assembly v2.0. a. Occurrences of number of SNPs per contig versus number of SNPs per contig in the raw (i.e. no filtering or partitioning) assembly. b. Occurrences of number of SNPs per v1.0 Aland contig sections versus number of SNPs per v1.0 Aland contig sections in the filtered and partitioned assembly (see methods)

.

119

a) b)

Figure 4-4. Leverage plots of residual effects of the adjusted theta model (i.e. log (theta+1/length) of polymorphism versus two pervasive biases in transcriptome assembly data (variable length and depth of coverage). a. Model residuals by the log of average coverage depths of Aland v1.0 read-containing contig sections in the v2.0 M. cinxia partitioned assembly. This plot shows that the model successfully accounts for variable coverage depth effects in the portions of the assembly at contig average coverage depths greater than 5x and contig lengths greater than 75 bp (n = 1203, R2 = 0.046, P-value = 0.43). b. Model residuals by the log of contig lengths of Aland v1.0 read-containing contig sections in the v2.0 M. cinxia partitioned assembly. In this plot variable contig section length effects in the partitioned assembly remain significant, but account for less than 5% of the variation in the model values (n = 1203, R2 = 0.046, P-value < 0.0001).

120

a) b)

Figure 4-5. Leverage plots of residual effects of the adjusted theta model (i.e. log (theta+1/length)_NS) of non- synonymous polymorphism versus two pervasive biases in transcriptome assembly data (variable length and depth of coverage). a. Model residuals by the log of average coverage depths of Aland v1.0 read-containing contig sections in the v2.0 M. cinxia partitioned assembly. This plot shows that the model successfully accounts for variable coverage depth effects in the portions of the assembly at contig average coverage depths greater than 5x and contig lengths greater than 100 bp (n = 1004, R2 = 0.15, P-value = 0.44). b. Model residuals by the log of contig lengths of Aland v1.0 read-containing contig sections in the v2.0 M. cinxia partitioned assembly. In this plot variable contig section length effects in the partitioned assembly remain significant, but account for only 14% of the variation in the model values (n = 1004, R2 = 0.14, P-value < 0.0001).

121

a)

b)

Figure 4-6. Two candidate models for unbiased estimates for Theta (Ɵ = 4nu; Long, 2006) in pooled sample, multi-individual de novo transcriptome assemblies. The correction is made by dividing by the harmonic value (i.e. sum of the reciprocals) of the average sequence coverage depth in order to devalue polymorphism contributions from extremely deep regions. At depths below twice the number of individuals in the pool (>160x for M. cinxia), this corrects for the possibility of sampling a specific allele from a single individual more than once in order to remove bias. a. Basic unbiased theta estimator. b. Log transformed theta estimator divided by the sequence length. Adding one to the number of segregating sites (i.e. SNPs) allows contigs without SNPs to count towards the estimate. Dividing by the length of each sequence section attempts to account for contig length variability in the transcriptome.

122

Figure 4-7. Juvenile Hormone (JH) metabolic pathway in B. mori, as shown in Zhan, 2011. Juvenile hormone acid methyltransferase is the final step in the conversion of Acetyl-CoA into JH in Lepidopterans.

123

Figure 4-8. Polymorphic region of jhamt cluster 4 alignment. Alignment is near the 5’ start site of the transcript. Several SNPs are outlined by mismatch colored boxes starting at position 81.

Chapter Five Environmentally Associated Genetic Variation in Two Coral Species through de novo Transcriptomics

Introduction

With the advent of high-throughput sequencing, the ability to discover SNPs in interesting non-model organisms has become a powerful tool for novel genetic and evolutionary studies once limited to systems with sequenced genomes. In chapters one through three of this dissertation, tools and methods were developed demonstrating the advantages of using pooled- transcript de novo transcriptomics for SNP discovery, culminating in a transcriptome-wide polymorphism survey for candidate genes potentially undergoing balancing selection. Another application of SNP discovery using pooled transcript samples is the ability to associate population level genetic variation to environmental variables with ecologically significant effects on fitness (Hedrick, Ginevan et al. 1976; Manel, Joost et al. 2010). One particularly relevant application to conservation biology is the study of the effects of predicted climate change on reef ecosystems around the world due to changing temperatures, rainfall, and ocean pH (IPCC, 2007; Baker, Glynn et al. 2008; Diaz-Pulido, McCook et al. 2009; Hoegh-Guldberg, Mumby et al. 2007; Hughes, Baird et al. 2003; Maynard, Anthony et al. 2008). A major concern is the amount of adaptive genetic variation present in affected corals related to fitness (Holderegger, Kamm et al. 2006), particularly at loci known to be correlated to environmental stress tolerance. A survey of such variation would allow a better understanding of regions of reef vulnerability and guide conservation and restoration efforts, including “assisted dispersal” wherein alleles associated with particular environmental conditions are transported and purposely established in distant areas where climate change is predicted to favor those alleles. Much work has been done in coral conservation; especially on the well-studied stress response of coral bleaching, which is the disassociation of the endosymbiotic dinoflagellate algae (genus Symbiodinium) from its host coral tissue (Baker, Glynn et al. 2008; Berkelmans, De'ath et al. 2004; Oliver, Berkelmans et al. 2009). Variation in bleaching resistance is known to exist in nature and occurs due to many factors, including factors associated with the Symbiodinium-host interaction. However, the coral Acropora millepora stress response tolerance has been shown to

125 be associated at least in part to ambient water temperature in three thermally distinct populations (Berkelmans and van Oppen 2006), most likely through genetic differentiation at allozyme loci that have been correlated with latitudinal sea water temperatures (Smith-Keune and van Oppen 2006). This suggests that past adaptation to varying local water temperatures occurred. In the coral Pocillopora damicornis, on the other hand, laboratory studies demonstrated a latitudinal bleaching sensitivity variation that was not correlated to the type of Symbiodinium harbored (Ulstrup, Berkelmans et al. 2006). Over the past several years, an increasing number of studies have investigated stress response genes in coral through transcriptome sequencing (Sanger sequencing EST projects, e.g. Csaszar, Seneca et al. 2009; Seneca, Foret et al. 2010), allowing the identification of important transcripts in the coral stress response for targeted searches of adaptive genetic variation (Van Oppen and Gates 2006). Here I present the results of a transcriptome-wide scan of polymorphism associated with environmental variables in two species of coral, P. damicornis and A. millepora. Both are ecologically important species in the GBR that display variable stress responses potentially associated with local environmental variables and had limited or non-existant genomic resources at the time of the study (2011). Most of my contribution as a secondary author to this study was in the bioinformatic analysis of the SNP data using my newly developed tools and techniques. Thus, although original and valuable biological results are reported here, this chapter serves also to demonstrate and validate the feasibility of these new methods for environmental association studies in ecologically important non-model organsisms.

Methods

P. damicornis and A. millepora polymorphism in 454 reads and association with environmental variables

Sampling, cDNA preparation and normalization, and sequencing of P. damicornis and A. millepora 454 FLX Titanium reads were reported in previous studies (Lundgren, Vera et al. In Review; Meyer, Aglyamova et al. 2009, see Table 1) and were carried out primarily by the lead authors of those studies. Read preparation, assembly, and annotation of P. damicornis was performed by Vera, J. C, as well as additional SNP discovery and preliminary analysis for both

126 coral. Two strategies were used to search for SNPs associated with environmental gradients in these corals. Using the A. millepora sequences and assembly (Meyer, Aglyamova et al. 2009), these publically available transcriptome data were scanned for SNPs using SNPHunter (Chapter 3; http://www.personal.psu.edu/jcv128/SoftwareC.html). The P. damicornis assembly was created using methods from this dissertation (i.e. through PipeMeta; Chapter 2) and are thus described briefly here. Replicate P. damicornis samples from each of 20 colonies were collected from Miall Island in the southern Great Barrier Reef (GBR; Keppel Islands), Pelorus Island in the central GBR (Palm Islands), and two sites in the far northern GBR, Wallace Islet and Wilke Reef (Figure 1), spanning a large gradient of temperature and turbidity (Figures 2 and 3). These populations presumably contain functionally important polymorphism adapted to variation in the two main stressors on the Great Barrier Reef (GBR), namely elevated sea surface temperatures and declining water quality (Anthony, Connolly et al. 2007; Hoegh-Guldberg, Mumby et al. 2007; Wooldridge, 2009). Recent studies have suggested that P. damicornis consists of several mtDNA lineages (putative species) (Concepcion, Crepeau et al. 2008; Souter 2010, Pinzon, LaJeunesse, 2011), and on the GBR two distinct lineages were tentatively identified (Lundgren, Vera et al. In Review). The colonies that were collected from Wallace Islet and Miall Island reefs were predominately P. damicornis (type A) while the Wilke and Pelorus Island reef samples contained more colonies of P. damicornis (type B). As a result, the collections were divided into two populations for each of the two putative species. The two pooled population samples (Wilke and Pelorus island reef) from P. damicornis (type B) contained the majority of successful sequencing effort and so were used for all further SNP discovery after cleaning and assembly using the PipeMeta pipeline and SeqMan NGen 2.1.0 de novo assembler (DNAStar http://www.dnastar.com/).

Targeted SNP detection in P. damicornis and A. millepora

The P. damicornis (type B) de novo assembly was searched for SNPs using SNPHunter with standard filter parameters. Only SNPs with strong Blast alignment to Uniprot database or a N. vectensis predicted proteins were considered. About 24,000 initial candidate SNP locations were found. SNPs were further narrowed down with the following filters. All non-metazoan or functionally undefined contigs were discarded to avoid Symbiodinium sequences. All SNPs had

127 to be present at a frequency of at least 10% in one of the two populations to further improve stringency (i.e. further avoid false positives due to sequencing error and avoid rare alleles unlikely related to ecological variation). After these filters, SNPs were sorted for allele frequencies that differed by at least 20% between populations (e.g. if an allele occurred at 40% in one population, in the other population it had to occur at less than 20% or greater than 60%). Based on these criteria, 80 final SNP candidates distributed among 67 contigs were found. A custom perl script (part of PipeMeta) was used for finding the SNP sequence ranges and the primers were designed using Primer3 (http://frodo.wi.mit.edu/primer3/). Primers for all 80 candidate SNPs were created from the 75-150 bp sequence ranges surrounding the SNPs. PCR amplification was performed using DNA extracts from three colony samples of P. damicornis (type B) and the product was visualized on a 1% TAE-agarose gel. Any primer pair that produced products longer than predicted and hence possibly including intron sequences, yielded inconsistent amplification success or displayed any other signs of poor quality, were discarded. A final selection of eight SNPs was made to use on the larger, GBR-wide sample set of P. damicornis (type B) (Table 2). Five of these were also successfully amplified in P. damicornis (type A). Due to prior knowledge and sequence data for a range of stress response genes in A. millepora (Meyer, Aglyamova et al. 2009; Meyer, Wang et al. 2010; Souter, 2010), the already published transcriptome data was downloaded and mined for SNPs using PipeMeta as above, and nine genes were selected (Table 3). One non-synonymous SNP was chosen to represent each gene in the GBR-wide sample set. Bioinformatic analysis of SNPs was performed by Vera, J. C. and molecular SNP work was performed by Lundgren, P.

Validation of environmental SNP associations in independent samples

To validate the presence of the SNPs in their populations, the primers designed above were used on independent samples of coral. To test the hypothesis that these candidate SNPs are truly associated with environmental variables rather than chance results from the very large number of potential associations in the transcriptome, nine populations of P. damicornis (type B) with sample sizes greater than 15 colonies and ten populations of P. damicornis (type A) with sample sizes greater than 20 colonies were examined (Figure 1, Table 4). Five P. damicornis

128 (type A) and eight P. damicornis (type B) SNP loci were tested against two environmental variables (turbidity and temperature) across these 17 populations using qPCR High Resolution Melt curve analysis (QIAcube, QIAGEN). The PCR step was set up in 10 µL reactions incorporating Accumelt HRM supermix (Quanta Biosciences©). Each sample was set up in triplicate with a primer concentration of between 200 nM and 500 nM, and 1 µl of 1/50 diluted gDNA. No template (negative) and known genotype (positive) controls were included in all runs. PCR and HRM were carried out under the conditions recommended by the instrument manufacturer. PCR commenced with an initial 5 minute hold at 95°C followed by 40 cycles of a 10 second, 95°C denaturation and 30 second, 55°C annealing and extension. A. millepora genotying was conducted using Sequenom MassArray on an Autoflex Spectrometer using iPLEX GOLD chemistry. Molecular SNP work was performed by Lundgren, P.

Statistical Analysis of P. damicornis and A. millepora

Collection of environmental data and statistical analyses were performed by Manel, S. Environmental variables were extrapolated for temperature and secchi depth using the e-Atlas tool. Temperature data (Rayner, Parker et al. 2003) can be found at http://e atlas.org.au/content/sea-surface-temperature-hadisst-11 (Figure 2). Secchi depth data were provided by scientists at the Australian Institute of Marine Science and Queensland Department of Primary Industries and Fisheries (http://e-atlas.org.au/content/secchi-disk-depth-measure- water-clarity; Figure 3). Seven temperature and nine depth categories were used, with seven as the highest temperature and nine as the deepest depth. All temperature zones across the range of the GBR were represented by the sample collections. Due to the photosynthetic properties of the coral-Symbiodinium symbiosis, corals are rare in highly turbid water, so coral collections were limited to secchi categories four to nine (i.e. 6-7.8 m to 21.7-28 m). Correlations between environmental variables and SNP allele frequencies were tested using a logistic regression with a logit link and a binomial error distribution to relate the probability of allele one in each locus to the two environmental variables (McCullagh, 1998). For each SNP locus, the frequency number the first allele follows a binomial distribution described with two classical parameters: the total number of successes (i.e. N = 2 x n) and the probability of success of the first allele, p = (n x na)/n, where n is the number of individuals, and

129 na the number of the first allele. The probability of the second allele is deduced from the relation 1-p. The likelihood ratio and Wald tests were used to determine the significance of the models (Joost, Bonin et al. 2007). To test for the proportion of random correlations, 159 alleles from 11 putatively neutral microsatellite loci were tested using the same statistics. This test was conducted on the already published A. millepora microsatellite data set on the same coral samples (van Oppen et al., 2011).

Results

P. damicornis 454 read sequencing, assembly, and annotation

After preliminary machine quality filtering, 947,623 sequences were obtained. Additional primer screening and quality filtering through PipeMeta removed 27.4% of all sequenced nucleotides. The remaining 653,814 reads were used in the assembly. Roughly 67% of the trimmed, filtered reads (64% of the sequenced nucleotides) assembled into 87,520 contigs with an average length of 584 (±296) bp, with the 244,541 remaining singletons averaging 304 (±85) bp in length (Table 1). To further assess the quality and coverage achieved by the assembly, the unigenes were Blast aligned to two reference gene sets from related species with sequenced genomes (i.e. Nematostella vectensis and Acropora digitifera modified in silico predicted proteins). Out of these, 10,021 (23.4%) and 11,148 (52.4%) were unique hits to N. vectensis and A. digitifera, respectively. Using a conservative approach to estimate transcriptome coverage length (see chapter two of this dissertation), approximately 16.0% of N. vectensis total amino acids and 25.3% of the A. digitifera total amino acids were covered (i.e. had strong alignment) by P. damicornis (type B) unigene sequence, with median coverage rates of 35.6% and 38.7% per predicted gene with a strong alignment for N. vectensis and A. digitifera proteins. Less than 6.8% (N. vectensis) or 10.2% (A. digitifera) of the above reference gene amino acid coverage was from more than one P. damicornis (type B) unigene (i.e. duplicate or overlapping amino acid coverage by multiple unigenes). A BLAST alignment versus the Uniprot protein database produced significant hits (bitscore > 45) for 54,094 or 61.8% of the Contigs, out of which 44.5% had top

130 alignments to proteins expressed in metazoans , 15.2% aligned to plant, 8.4% aligned to bacteria, and 0.3% aligned to virus (Table 1). All comparisons and annotations were run using PipeMeta tools.

SNPs in P. damicornis and A. millepora Of the eight SNPs that were selected and amplified from the transcriptome of P. damicornis (type B), only one was a transversion (24_CA_2390) (Table 1). This SNP occurred at Night and Tydeman reefs with the rare allele (A) occurring at a frequency of 18.1% and 7.6 % respectively. In the original transcriptome data, this allele occurred at Wilke Islend reef, which is located in the same geographic region as Night and Tydeman (Figure 1). The remaining seven SNPs were all transitions (C/T or G/A). The resolution of the single step PCR to HRM method used in this study did not allow discrimination of A to T or C to G mutations. Hence, all the SNPs analyzed by this method caused a change in the number of hydrogen bonds in the DNA chain, which inferred a large enough change in the melting temperature to allow accurate scoring. The final data set for P. damicornis (type B) is based on the allele frequencies at eight SNP loci. In P. damicornis (type A), five loci were amplified using the primers developed for P. damicornis (type B) (Table 2). Three loci were excluded from trials in P. damicornis (type A) due to ambiguous scoring patterns of the HRM curves in this species. In P. damicornis, the SNPs were found in genes of known and un-known function and included both synonymous and non- synonymous mutations. The genes of known function included Elongation factor 1-α, β- hexosamidase, Carbonic anhydrase, and the 40S ribosomal protein S3 (Table 2). For A. millepora, the Sequinom R MassArray produced sequence data, hence allowing the inclusion of the C to G mutation that was evident in the Ligand of Numb X2 locus. One of the nine chosen loci (Coatomer) was found to be monomorphic and excluded from further analyses. The remaining eight SNPs were successfully genotyped in the majority of the colony samples (Table 3).

Allele frequencies and environmental correlations in P. damicornis and A. millepora

Six of the 26 correlation tests (see methods) showed a significant correlation to environmental variables in P. damicornis (Table 5). These correlations were found between temperature and SNP allele frequency in β-hexosamidase in P. damicornis (type B) and for secchi

131 depth in P. damicornis (type A). Temperature and SNP allele frequency was also significantly correlated in Elongation factor 1-α for both type A and B as well as a gene of unknown function (Contig_24(847)) in P. damicornis (type B) (Table 5). In A. millepora, allele frequencies in five of these loci showed a significant correlation with environmental gradients. β- gamma crystallin, Galaxin, and Ubiquitin were correlated to temperature and Thioredoxin and Ligand of Numb X2 were found to correlate to secchi depth (Table 6).

Conclusions

Adaptive molecular variation relating to coral stress response is of great interest to ocean scientists and conservationists due to the effects of global climate change on reef biology. In this environmental association study (Lundgren et al., in review), two coral transcriptomes (P. damicornis and A. millepora) were assembled and scanned for high quality SNPs using methods developed here. The resulting pool of SNPs were then searched for genes known to be involved in the coral stress response, and the best candidates for each gene of interest (after verification by amplification, see methods) were then used to genotype independent samples of coral in an environmental association of the SNPs to two important stress variables, water temperature and turbidity. Several SNPs within the stress response genes were found to be significantly correlated to environmental gradient (four, two, and three SNPs for A. millepora, P. damicornis (type A) and P. damicornis (type B), respectively), including several genes thought to be involved in calcification (B-gamma cyrstalin and Galaxin), cellular stress response to damage (Ubiquitin), the oxidative stress response (Thioredoxin), cell apoptosis (Elongation factor-α), and the immune response to stress (β-hexosaminidase). B-gamma cyrstalin is thought to bind calcium in bacteria and cephalochordates (Kappe, Purkiss et al. 2010) and has been correlated to variable larval response to water temperature (Meyer, Aglyamova et al. 2009). Another environmentally associated gene perhaps involved in calcification is galaxin, which forms a soluble protein within the coral exoskeleton (Fukuda, Ooki et al. 2003). The calcification rate in corals has been shown to be affected by water temperature, reinforcing the link between water temperature and genetic variation in these genes. Ubiquitin, a component of the ubiquitin cell damage pathway, targets proteins for degradation (Vucic, Dixit et

132 al. 2011), and is a measure of cellular stress in corals (Downs, Fauth et al. 2005). Ligand of Numb X2 is a close homolog of Ligand of Numb X, an important ligase for Ubiquitin (Mirza, Raschperger et al. 2005), while Thioredoxin is involved in the sequestration of reactive oxygen species produced during oxidative stress (Weis, 2008) and is involved with differential stress response in the presence of pollution (Morgan, Edge et al. 2005). Elongation factor-α has been shown to be associated with cell apoptosis initiated by chromosomal damage (Furukawa, Jinks et al. 2001). β-hexosaminidase, on the other hand, is involved in several cellular processes, but one of its stress related roles may be in initiating an immune response to mycobacterium infection (Koo, Ohol et al. 2008). Although environmental associations with neutral genetic variation have been shown through simulations (Bierne, Welch et al. 2011) to occur through non-adaptive means such as endogenous genetic barriers along environmental gradients (i.e. coupling effect), the SNP associations for this study were targeted specifically by a priori knowledge on the possibility of their having an effect on the stress response. Using putatively neutral microsatellite data from a previously published study (van Oppen et al., 2011); the false positive rate due to chance alone was estimated at ~10%. Most of the SNPs were also non-synonymous, and thus potentially have protein-modifying effects at the physiological level. In addition, laboratory aquarium experiments are currently being performed to further verify the links between the SNP loci and coral phenotypes, as well as a simulated SNP study to estimate the rates of false positives at neutral sites. Never the less, it is acknowledged that the inability to fully differentiate environmental variables from historical or demographic effects currently remains a potential short-coming for this study. In conclusion, here I have reported the successful application of new tools and methods in an important environmental association study in two coral species without the benefit of extensive genomic resources, with the result that several candidate genes potentially involved in stress responses to changing environmental variables were identified for future conservation efforts.

133

References

IPCC (2007). Climate change 2007: Impacts, adaptation and vulnerability. Cambridge, Cambridge University Press. Anthony, K. R. N., S. R. Connolly, et al. (2007). "Bleaching, energetics, and coral mortality risk: Effects of temperature, light, and sediment regime." Limnology And Oceanography 52(2): 716-726. Baker, A. C., P. W. Glynn, et al. (2008). "Climate change and coral reef bleaching: An ecological assessment of long-term impacts, recovery trends and future outlook." Estuarine Coastal And Shelf Science 80(4): 435-471. Berkelmans, R., G. De'ath, et al. (2004). "A comparison of the 1998 and 2002 coral bleaching events on the Great Barrier Reef: spatial correlation, patterns, and predictions." Coral Reefs 23(1): 74-83. Berkelmans, R. and M. J. H. van Oppen (2006). "The role of zooxanthellae in the thermal tolerance of corals: a 'nugget of hope' for coral reefs in an era of climate change." Proceedings Of The Royal Society B-Biological Sciences 273(1599): 2305-2312. Bierne, N., J. Welch, et al. (2011). "The coupling hypothesis: why genome scans may fail to map local adaptation genes." Molecular Ecology 20(10): 2044. Concepcion, G. T., M. W. Crepeau, et al. (2008). "An alternative to ITS, a hypervariable, single- copy nuclear intron in corals, and its use in detecting cryptic species within the octocoral genus Carijoa." Coral Reefs 27(2): 323-336. Csaszar, N. B. M., F. O. Seneca, et al. (2009). "Variation in antioxidant gene expression in the scleractinian coral Acropora millepora under laboratory thermal stress." Marine Ecology- Progress Series 392: 93-102. Diaz-Pulido, G., L. J. McCook, et al. (2009). "Doom and Boom on a Resilient Reef: Climate Change, Algal Overgrowth and Coral Recovery." Plos One 4(4). Downs, C. A., J. E. Fauth, et al. (2005). "Cellular diagnostics and coral health: Declining coral health in the Florida Keys." Marine Pollution Bulletin 51(5-7): 558-569.

134 Fukuda, I., S. Ooki, et al. (2003). "Molecular cloning of a cDNA encoding a soluble protein in the coral exoskeleton." Biochemical And Biophysical Research Communications 304(1): 11- 17. Furukawa, R., T. M. Jinks, et al. (2001). "Elongation factor 1 beta is an actin-binding protein." Biochimica Et Biophysica Acta-General Subjects 1527(3): 130-140. Hedrick, P. W., M. E. Ginevan, et al. (1976). "Genetic-Polymorphism In Heterogeneous Environments." Annual Review Of Ecology And Systematics 7: 1-32. Hoegh-Guldberg, O., P. J. Mumby, et al. (2007). "Coral reefs under rapid climate change and ocean acidification." Science 318(5857): 1737-1742. Holderegger, R., U. Kamm, et al. (2006). "Adaptive vs. neutral genetic diversity: implications for landscape genetics." Landscape Ecology 21(6): 797-807. Hughes, T. P., A. H. Baird, et al. (2003). "Climate change, human impacts, and the resilience of coral reefs." Science 301(5635): 929-933. Joost, S., A. Bonin, et al. (2007). "A spatial analysis method (SAM) to detect candidate loci for selection: towards a landscape genomics approach to adaptation." Molecular Ecology 16(18): 3955. Kappe, G., A. G. Purkiss, et al. (2010). "Explosive Expansion of beta gamma-Crystallin Genes in the Ancestral Vertebrate (vol 71, 219, 2010)." Journal Of Molecular Evolution 71(4): 317-318. Koo, I. C., Y. M. Ohol, et al. (2008). "Role for lysosomal enzyme beta-hexosaminidase in the control of mycobacteria infection." Proceedings Of The National Academy Of Sciences Of The United States Of America 105(2): 710-715. Lundgren, P., J. C. Vera, et al. (In Review). "Genotype – environment correlations in corals from the Great Barrier Reef." Manel, S., S. Joost, et al. (2010). "Perspectives on the use of landscape genetics to detect genetic adaptive variation in the field." Molecular Ecology 19(17): 3760-3772. Maynard, J. A., K. R. N. Anthony, et al. (2008). "Major bleaching events can lead to increased thermal tolerance in corals." Marine Biology 155(2): 173-182. Meyer, E., G. V. Aglyamova, et al. (2009). "Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx." Bmc Genomics 10. Meyer, E., S. Wang, et al. (2010). "Quantitative genetics and genomics of reef-building coral Acropora millepora." Integrative And Comparative Biology 50: E118-E118.

135 Mirza, M., E. Raschperger, et al. (2005). "The cell surface protein coxsackie- and adenovirus receptor (CAR) directly associates with the ligand-of-numb protein-X2 (LNX2)." Experimental Cell Research 309(1): 110-120. Morgan, M. B., S. E. Edge, et al. (2005). "Profiling differential gene expression of corals along a transect of waters adjacent to the Bermuda municipal dump." Marine Pollution Bulletin 51(5-7): 524-533. Oliver, J., R. Berkelmans, et al. (2009). Coral Bleaching: Patterns, processes, causes and consequences. Berlin, Springer-Varlag. Pinzon JH, LaJeunesse TC (2011) Species delimitation of common reef corals in the genus Pocillopora using nucleotide sequence phylogenies, population genetics and symbiosis ecology. Mol Ecol 20, 311-325. Rayner, N. A., D. E. Parker, et al. (2003). "Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century." Journal Of Geophysical Research-Atmospheres 108(D14). Seneca, F. O., S. Foret, et al. (2010). "Patterns of Gene Expression in a Scleractinian Coral Undergoing Natural Bleaching." Marine Biotechnology 12(5): 594-604. Smith-Keune, C. and M. van Oppen (2006). "Genetic structure of a reef-building coral from thermally distinct environments on the Great Barrier Reef." Coral Reefs 25(3): 493-502. Souter, P. (2010). "Hidden genetic diversity in a key model species of coral." Marine Biology 157(4): 875-885. Ulstrup, K. E., R. Berkelmans, et al. (2006). "Variation in bleaching sensitivity of two coral species across a latitudinal gradient on the Great Barrier Reef: the role of zooxanthellae." Marine Ecology-Progress Series 314: 135-148. Van Oppen, M. J. H. and R. D. Gates (2006). "Conservation genetics and the resilience of reef- building corals." Molecular Ecology 15(13): 3863-3883. Van Oppen MJH, Peplow L, Kinninmonth S, Berkelmans R (2011). “Historical and contemporary factors shape the population genetic structure of the broadcast spawning coral, Acropora millepora, on the Great Barrier Reef.” Molecular Ecology doi: 10.1111/j.1365-294X.2011.05328.x. Vucic, D., V. M. Dixit, et al. (2011). "Ubiquitylation in apoptosis: a post-translational modification at the edge of life and death." Nature Reviews Molecular Cell Biology 12(7): 439-452.

136 Weis, V. M. (2008). "Cellular mechanisms of Cnidarian bleaching: stress causes the collapse of symbiosis." Journal Of Experimental Biology 211(19): 3059-3066. Wooldridge, S. A. (2009). "Water quality and coral bleaching thresholds: Formalising the linkage for the inshore reefs of the Great Barrier Reef, Australia." Marine Pollution Bulletin 58(5): 745-751.

137

Table 5-1. Summary of 454 sequence data for P. damicornis (type B). Contig = consensus sequence consisting of combined original sequences that showed significant sequence overlap.

Number of nucleotides 277 428 160

Number of sequences 947 623

Average (trimmed) sequence length 312.5 (±85 sd)

Number of assembled contigs 87 520

Average length of contigs 584 (±296)

Proportion significant BLASTs 20% – percent of annotated contigs

% host gene (of total/of annotated data) 13.6% / 68%

138

Table 5-2. List of SNPs and contig information for the eight SNP loci that were included in the final analysis in P. damicornis (type B). SNP name constructed in the following manner: (Contig number_SNP nucleotides_position of SNP in trimmed contig). SNP nucleotides are ordered with the more common nucleotide first, followed by the rare allele. Genes marked with * were also amplified for type A.

Codon SNP name Gene pos Primer seqeunce 5’: GAAATTTCAAACTGACCAACTT GT 171_CT_243 Beta-hexosaminidase * 3 3’: TCGACGTCTTCCATTGTCAT 5’: AATGGCCAGACCCGTGAG 3’: 1841_GA_723 Elongation factor 1-alpha (1841) * 3 CAGCAACAATAAGCTGCTTCAC 5’: ACCGCTGTCAAGGATTCAA 2631_GA_1228 Elongation factor 1-alpha (2631) * 1 3’: AATCGACTCTCGGTGTCA 24_CT_847 5’: ACAAGATGTGTGTAGATGAATTTGC Putative un-characterised protein 1.1* Unknown 3’: ACAACCAGCATATTGGGTGCAT 5’: CTGAGGAAGTTCCACCAACAG 24_CA_2390 Putative un-characterised protein1.2 Unknown 3’: AACTGAGATAATCCTgcagaaaa 5’: TCAAGATTTACTGGTAATTGG 269_GA_818 Putative un-characterised protein2 Unknown 3’: GCATTTTGGACAAAGCCAGA 5’: gggcgctAAACTCAGTGG 1361-TC- 3’: 428 Carbonic anhydrase 3 TCCAAAATGAAAATGGAATTGTT 5’: AAAAGGTGGCCACAAGAGG 5662_AG_426 Predicted protein * 3 3’: CCAGTCCACCAATCAGTTT

139

Table 5-3. Selected genes of interest in Acropora millepora, SNP codon position (all targeted SNPs were non- synonymous mutations), and their putative role in the stress response. SNP name constructed in the following manner: (Contig number_SNP nucleotides_position of SNP in trimmed contig).

Gene SNP name SNP codon pos Putative role in stress response

Argenine kinase 50281_AG_478 1 Detoxification of hyper-oxide Β-gamma crystallin 35180_TG_523 2 Immunity response Coatomer 60613_TG_230 2 Intra cellular protein transport Complete component C3 26792_AG_293 1 Immunity response Galaxin 20421_TC_239 1 Calcification and growh Hsp60 52394_AG_280 2 Generic heat shock protein Mn Superoxide dismutase 16774_AC_791 1 Defence agains ROS Thioredoxin 16728_TC_211 1 Defence agains ROS Ubiquitin like protein 14848_TC_418 2 Defence agains ROS Ligand of numb X2 53464_CG_2463 1 Possible ligase to bind ubiquitin

140

Table 5-4. List of sampling locations, listed from north to south, reef types, secchi and thermal categories (each divided into 9 and 7 categories respectively, thermal 1-7 coldest to warmest, secchi 1-9 lowest to highest secchi depth) and number of samples of each species.

No samples Secchi P. P. Reef / Island depth Thermal damicorni damicornis A. name Location Type category category s (type B) (type A) millepora Wallace Isl FN Inshore 6 7 15 20 Five Rf FN Outer 8 6 15 20 Night Isl FN Inshore 5 6 15 20 20 Wilkie Rf FN Inshore 4 6 15 20 Tydeman Rf FN Outer 6 6/5 15 Lizard Isl North Mid shelf 6 5 20 Emily Rf North Mid shelf 7 5 18 Sudbury 1 North Mid shelf 8 5/4 20 Sudbury 2 North Mid shelf 8 5/4 28 Trunk Rf Central Mid shelf 8 4 20 Rib Rf Central Mid shelf 8 4 15 Dip Rf Central Outer 9 4 20 Pelorus Isl Central Inshore 6 4 15 20 20 Wheeler Rf Central Mid shelf 9 4/3 15 20 Darley Rf Central Mid shelf 6 3 20 Holbourne Isl Central Inshore 4 3 20 Ross Rf Central Mid shelf 8 3 20 Boulton Rf Central Mid shelf 8 2 37 Goble Rf South Mid shelf 7 2 20 Calder Isl South Mid shelf 5 2 20 20_344 South Mid shelf 7 2 18 21_121 South Mid shelf 8 2 19 High peak Isl South Inshore 6 1 20 North Keppel Isl South Inshore 5 1 28 Miall Isl South Inshore 5 1 20 Outer Rf South Inshore 6 1 20 Child Rf South Inshore 6 1 20

141

Table 5-5. Summary of statistics for the SNP loci that show a significant correlation between allele frequency distribution and environmental category in P. damicornis (type A or B). Coeff LR = regression coefficients of the logistic regressions, p-values derived from the Wald tests and considered significant when < 0.05. * Indicates p-values that are no longer significant after corrections are made for multiple tests (0.05/3 = 0.016).

P. damicornis (type B) P. damicornis (type A) Gene Environmental variable Coeff LR p-value Coeff LR p-value

Temp -0.250 0.00025 -0.121 0.3958 β-hexosaminidase Secchi 0.114 0.213 -0.190 0.0374*

Elongation factor 1- Temp 0.232 0.0025 -0.376 0.0347* α 2631 Secchi 0.126 0.24569 -0.472 0.00033

Temp 0.057 0.399 0.085 0.538 Contig_24 (847) Secchi 0.594 5.60e-10 0.539 0.0513

142

Table 5-6. Summary of statistics for the SNP loci that show a significant correlation between allele frequency distribution and environmental category in A. millepora. Coeff LR = regression coefficients of the logistic regression, p-values derived from the Wald tests and considered significant when < 0.05.

Environmental Gene variable Coeff RL p-value Ligand of Numb X2 Secchi 0.150 0.009 Thioredoxin Secchi 0.301 0.003 β-gamma crystallin Temp -0.322 0.001 Galaxin Temp -0.412 0.000 Ubiquitin Temp 0.126 0.015

143

Figure 5-1. Map of sampling locations for P. damicornis (type A) (blue), P. damicornis (type B) (yellow) and A. millepora (red).

144

Figure 5-2. Map showing allele frequency patterns of three SNP loci (β-gamma crystallin right, Galaxin middle and Ubiquitin left) in A. millepora with a significant correlation to temperature categories, plotted on spatial data for average sea surface temperatures between the years 2000 – 2006. Map courtesy of the e-Atlas. (http://e-atlas.org.au).

145

Figure 5-3. Map showing allele frequency patterns of two SNP loci (Thioredoxin right, Ligand of numb X2 left) in A. millepora with a significant correlation to turbidity, plotted on spatial data for secchi depths collected between 1992 and 2006. Map courtesy of the e-Atlas (http://e-atlas.org.au).

VITA

Name: Juan Cristobal Vera

Eduacation: B.S. Biology, Philosophy Minor, 2002. George Mason University, Fairfax, VA Ph.D in Biology, 2004-2012. Penn State University, State College, PA

Awards: Alphred P. Sloan Scholarship, 2007-2010 Homer F. Braddock Fellowship, 2004-2007 Bunton-Waller Graduate Award, 2004-2007

Publications: Lundgren, P., J. C. Vera, et al. (In Review). "Genotype – environment correlations in corals from the Great Barrier Reef." Wheat, C. W., H. W. Fescemyer, et al. (2011). "Functional genomics of life history variation in a butterfly metapopulation." Molecular Ecology 20(9): 1813. Polato, N. R., J. C. Vera, et al. (2011). "Gene Discovery in the Threatened Elkhorn Coral: 454 Sequencing of the Acropora palmata Transcriptome." PLoS ONE 6(12): e28634. Marden, J. H., H. W. Fescemyer, et al. (2008). "Weight and nutrition affect pre-mRNA splicing of a muscle gene associated with performance, energetics and life history." J Exp Biol 211(Pt 23): 3653-60. Vera, J. C., C. W. Wheat, et al. (2008). "Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing." Molecular Ecology 17(7): 1636.

Dissertation:

Vera, J. C. (2012). “Functional Transcriptomics: Ecologically Important Genetic Variation in Non-Model Organisms”. Ph.D Dissertation, Pennsylvania State University, USA. Supervisor: Dr. James H. Marden.

Brief Bio: Juan Cristobal Vera was born in Santiago, Chile on March 23rd, 1972, the son of Luis Vera and Cecelia Domeyko. After completing a BS in Biology with honors and a minor in Philosophy at Gerorge Mason University in 2002, he entered the biology graduate program at Penn State University in the Fall of 2004. He did rotations until the Spring of 2005, after which he joined the lab of Professor James Marden to study organismal genetic variation using various molecular and computational techniques.