Before speciation: adaptive evolution of diapause regulation in common wood-white butterflies ( Leptidea sinapis )

Luis Leal

Degree project in bioinformatics, 2017 Examensarbete i bioinformatik 45 hp till masterexamen, 2017 Education Centre and Department of Ecology and Genetics, Uppsala University Supervisor: Niclas Backström

Abstract | In this project, we have developed and implemented a bioinformatics pipeline for the assembly, annotation and analysis of the transcriptome of Leptidea sinapis (common wood-white butterfly). The main aim was to identify and pathways associated with diapause, a key life-history trait whose precise regulation varies along the species' range as a result of clinal adaptation, as well as to evaluate whether genomic regions associated to diapause regulation are among L. sinapis' genomic areas of differentiation.

De novo transcriptome assembly was performed using Trinity. We used DESeq2 to carry out differential expression (DGE) analysis and draw lists of genes whose expression varies significantly between individuals subject to light treatments that induce either diapause or direct development. Differentially expressed genes were subsequently enriched for (GO) terms using topGO in order to assess their distribution among functional categories. Our results reveal that regulation of monoamine neurotransmitters is markedly different across light treatments and suggest sustained high dopamine levels both during and after the photosensitive period could have triggered metabolic cascades leading to diapause induction in wood whites reared under short-day light conditions. Two genes, Ddc (aromatic-L-amino-acid decarboxylase) and st (scarlet) are identified as the key monoamine regulators during larval stage instar-III. In later larval stages and during pupation, once developmental pathways have been set irreversibly either towards arrested or direct development, asymmetric dopamine levels across treatments are maintained by a set of five genes ‒ PPO2 (prophenoloxidase 2), Manf (mesencephalic astrocyte- derived neurotrophic factor), Vmat (vesicular monoamine transporter), Hn (henna), and CG32447.

Finally, we use these results as a basis for the re-evaluation of genome scans between three L. sinapis ecotypes (Swedish, Spanish, and Kazakh) with diverging diapausal behavior. Two out of the nine genes showing a statistically significant number of fixed differences between populations, ST1C4 (sulfotransferase 1C4) and spri (sprint), are monoamine regulators. These two genes could be behind differences in diapause regulation observed between the three ecotypes and constitute the core of coalescing genomic islands of differentiation.

i ii Before speciation: adaptive evolution of diapause regulation in common wood-white butterflies

Popular Science Summary

Luis Leal

Insects respond to local environmental conditions by adopting specific life- history strategies. In temperate latitudes, many go into diapause (hibernation) during the cold season, a period during which developmental processes come to a standstill. For the wood white, a butterfly present across a large part of western Eurasia, breeding and diapausal rhythms vary with latitude as a result of local adaptation. In this project, we aim at understanding the regulatory mechanisms underpinning diapause in wood white butterflies, and evaluate whether genes involved in diapause regulation are located in genomic areas currently under divergence among different wood white populations.

Using gene expression data, we present evidence that diapause is likely to be induced by dopamine and other neurotransmitters, chemical molecules abundant in larvae reared under low light conditions, and whose presence leads to a progressive decrease in metabolic activities and an increase in fat storage. We identified several genes that act as neurotransmitter modulators and whose expression profile varies with light treatment. We identified two further genes, also neuromodulators, whose nucleotide content has diverged between different wood white populations, and that could therefore be responsible for the differences observed in diapause regulation along the species's range.

Degree project in bioinformatics, 2017 Examensarbete i bioinformatik 45 hp till masterexamen, 2017 Biology Education Centre and Department of Ecology and Genetics Supervisor: Niclas Backström

iii iv Table of Contents ​ ​ ​ ​

Abbreviations 1

1. Background 3 ​ ​ 1.1 Models of speciation 3 ​ ​ ​ ​ ​ ​ 1.2 Leptidea sinapis as a model organism 5 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ The key traits: diapause and direct development 6 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Experimental design 8 ​ ​ 1.3 Current project 11 ​ ​ ​ ​ 2. Project goals 11 ​ ​ ​ ​ 3. Methods 12 ​ ​ 3.1 Quality control 12 ​ ​ ​ ​ Initial quality assessment 12 ​ ​ ​ ​ Filter libraries 13 ​ ​ 3.2 From transcriptome assembly to functional annotation 15 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Transcriptome assembly 15 ​ ​ Mapping and transcript quantification 16 ​ ​ ​ ​ ​ ​ Functional annotation 16 ​ ​ Gene count consolidation 17 ​ ​ ​ ​ 3.3 Expression profiling and gene enrichment analysis 18 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Differential gene expression analysis 18 ​ ​ ​ ​ ​ ​ Enrichment analysis of gene ontology terms 19 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 4. Results 20 ​ ​ 4.1 Transcriptome assembly and transcript quantification 20 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 4.2 Functional annotation 27 ​ ​ ​ ​ 4.3 Sample correlation and sample clustering 28 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Correlation analysis 28 ​ ​ Clustering analysis 30 ​ ​ 4.4 Differential gene expression analysis 36 ​ ​ ​ ​ ​ ​ ​ ​ 4.5 Gene ontology term enrichment 41 ​ ​ ​ ​ ​ ​ ​ ​ 4.6 Circadian clock genes 49 ​ ​ ​ ​ ​ ​ 5. Discussion & Conclusions 51 ​ ​ ​ ​ ​ ​ Diapause regulation 51 ​ ​ Islands of speciation 55 ​ ​ ​ ​ Afterword 57

Acknowledgments 58

References 59

Supplementary information 71 ​ ​ 0

Abbreviations baseMean average of expression values across all samples ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ bp ​ ​ DGE differential gene expression ​ ​ ​ ​ DOWN down-regulated (gene) ​ ​ FDR false discovery rate ​ ​ ​ ​ GO gene ontology ​ ​ log2FoldChange log2 of ratio of gene expression values across treatments ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ PCA principal component analysis ​ ​ ​ ​ RNA-seq RNA sequencing ​ ​ p-adj FDR-adjusted p-value ​ ​ UP up-regulated (gene) ​ ​

1

2 1. Background ​ ​ 1.1 Models of speciation ​ ​ ​ ​ ​ ​ A core question directly or indirectly underpinning current research in the field ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of evolutionary biology concerns the forces and mechanisms shaping the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ emergence of new species. Speciation has been a problematic concept since its ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ groundbreaking introduction 150 years ago and Darwin himself struggle to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ articulate how observed patterns of biodiversity could be explained solely on the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ basis of natural selection and local adaptation (Darwin, 1859, Ch. 6). A major step ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ forward was taken during the first half of the twentieth century when population ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genetics helped to reconcile Darwinian principles with Mendelian genetics ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Fisher, 1930; Haldane, 1932; Wright, 1931; Dobzhansky, 1937). The modern ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ synthesis established the theoretical groundwork upon which the dynamics of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ speciation was to be explored during the following decades by showing how ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ changes in allele composition in natural populations were dependent on factors ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ such as mutations, selection, migration, genetic drift, and effective population ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ size.

Using the modern synthesis as a stepping stone, several models have been put ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ together in an attempt to capture the evolutionary forces behind population ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differentiation and the emergence of new species. In sexually reproducing ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ species, the most straightforward model of speciation involves strict ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reproductive isolation as a consequence of effective geographic separation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (allopatric speciation) (Mayr, 1963). A complementary model has explored ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ speciation in settings where the two diverging populations share the same ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ geographic area (sympatric speciation) (Maynard Smith, 1966; White, 1978). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ During allopatric speciation, gene flow ceases to take place between the two ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ populations, which then diverge due to the presence of novel mutations, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ exposure to different selective forces, and stochastic fixation of ancestral ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ polymorphisms through genetic drift. When the two radiating populations live in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ sympatry, gene flow remains possible and speciation requires the introduction of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ extra mechanisms – such as a niche-inducing heterogeneous environment and/or ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ assortative mating – responsible for abating recombination as the population ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ segregates (Maynard Smith, 1966; White, 1978; Coyne and Orr, 2004; Shapiro et ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ al., 2016). ​ ​ ​

3 The models briefly discussed above provide the conceptual framework under ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ which much of the research and discussions in the field have proceeded up to our ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ days. Yet, in spite of all their sophistication, prediction capabilities, and sound ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ mathematical formulation, until recently proper assessment of the different ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ speciation models was hampered by the lack of data that could somehow ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ elucidate how population differentiation actually takes place at the molecular ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ level. With the advent of high-throughput sequencing technologies, this situation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ has now changed and a major research effort is currently being made to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ investigate the genetic basis of speciation by addressing issues such as: the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ extent and frequency of speciation events involving gene flow, the role played by ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genome architecture (e.g., chromosomal recombination sites) in generating ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reproductive isolation, the role played by sex , the importance of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ background selection and selective sweeps, how phenotypic plasticity may ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ contribute to adaptive radiation, the impact of horizontal gene transfer, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ hybridization and introgression, how to identify telltale signs of speciation as it ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ takes place, and also how speciation may be nudged along as a consequence of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ local adaptation and life history traits (for detailed discussion, see (Wolf and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Ellegren, 2017; Schneider and Meyer, 2017; Shapiro et al., 2016)). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​

An idea that has gained some traction in recent years suggests that different ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ parts of the genome may diverge at different rates (often called 'islands of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differentiation' or 'islands of speciation'), and that speciation may sometimes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (often?) precede complete genomic impermeability (Wu and Ting, 2004). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ However, and in spite of all the wealth of data and insights accumulated to date, a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ debate has flared up on the meaning and significance of many of the results ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ obtained so far. As it turns out, interpretation of observed genomic landscapes is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ far from a simple art, mostly because similar genomic signatures may be ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ produced by highly contrasting scenarios, some of which having little to do with ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ population divergence (Wolf and Ellegren, 2017; for extended discussion, see ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ recent theme issue in Journal of Evolutionary Biology1). An outcome of this ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reevaluation is the realization that true headway in the nascent field of speciation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genomics will be attained only by employing more comprehensive strategies ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ aimed at disentangling the multiple factors at play (Ravinet et al., 2017). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​

1 Journal of Evolutionary Biology, Volume 30, Issue 8, August 2017. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 4 1.2 Leptidea sinapis as a model organism ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ In order to further our understanding of the key forces underlying population ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differentiation and speciation, the Backström lab is currently carrying out a wide ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ range of experiments in common wood-white butterflies (Leptidea sinapis; ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Lepidoptera: Pieridae). Butterflies have become reference model organisms in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ speciation genomics, namely through landmark studies based on the monarch ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Danaus plexippus) and the common postman butterfly (Heliconius melpomene) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Zhan et al., 2011; Zhan and Reppert, 2013; Jiggins et al., 2001). ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​

Leptidea butterflies have in recent years become a popular go-to system for the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ study of local adaptation. They are distributed over a large geographic range, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ from Scandinavia to Southern Europe and all the way to Central Asia, covering a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ wide variety of habitats, leading members of this species to exhibit differences in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ life history characteristics, pupae morphology, and host plant preference (Freese ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and Fiedler, 2002; Friberg et al., 2008a; Friberg and Wiklund, 2007; Lorkovic, ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1993). Recently, Leptidea butterflies have also become a model organism for the ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ study of cryptic speciation, as it was revealed that this system includes three ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ species (L. sinapis, L. juvernica, and L. reali) that are virtually identical in external ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ appearance and are distinguishable mostly through their sexual morphology and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ pheromone signaling (Dinca et al., 2011). Moreover, each one of these three ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ species, which can share ranges and thus live in partial sympatry, is further ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ subdivided into multiple ecotypes, each covering a specific geographic area. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

The ecotypes associated to the same Leptidea species distinguish themselves not ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ only through differences in habitat choice but also through steep karyotype ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ variation across their latitudinal cline. While L. sinapis populations in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ south-western part of the distribution range (e.g. Spain) have 106 chromosomes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (2n), number decreases in a clinal fashion across the range and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ eastern and northern populations have 56-58, an intra-species karyotype range ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ seldom seen in other extant diploid or plant species (Lukhtanov et al., ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ 2011; Dinca et al., unpublished data). In spite of these genomic and behavioral ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differences, individuals belonging to different ecotypes will reproduce when ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ brought together, although offspring fitness tends to be reduced for F2 and later ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ generations (Wiklund et al., unpublished). This has raised the prospect that, ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ although still part of the same species, the different ecotypes may be evolving ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ towards speciation as a result of local adaptation, and that signs of this process ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ may be already visible at the genomic level. Genes underpinning traits diverging ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

5 among the different ecotypes would be prime candidates in the search for ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ incipient islands of differentiation spearheading eventual speciation events. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

The key traits: diapause and direct development ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Insects respond to local environmental conditions by adopting specific ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ life-history strategies. Season length is one such key factor and many insects in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ temperate regions go into diapause, or hibernation, over the cold season, a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ period during which metabolic activities are kept to a minimum and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ developmental processes come to a standstill (Ko tál and Shimada, 2001). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ š ​ ​ ​ ​ ​ ​ Diapause initiation and termination depends on photoperiod ‒ the number of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ daylight hours ‒ and to a certain extent also on temperature and diet (Sauders, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1976; Beck, 1980). The onset of diapause can only take place during a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ species-specific stage in the life of an (photosensitive period), a period ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ during which it must be exposed for several days to diapause-inducing light and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ temperature conditions before arrested development is irreversibly initiated. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

For species covering a wide latitudinal cline, diapause inception and duration can ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ vary widely along the species's range. In the northern tip of its range, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ overwintering wood-whites remain in diapause until the late spring, have one ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ single brood and the offspring go straight into diapause while in their pupal stage ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ still during the summer (they are thus univoltine)(Friberg et al., 2008b). In ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ Southern Europe, wood-whites are multivoltine (have several generations per ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ season) (Tolman, 1997) and stay off diapause for a much longer period (early ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ spring to late fall). Southern individuals emerge as imagos (adults) during the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ early spring and give rise to offspring that, unlike their northern counterparts, do ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ not enter diapause during their pupal stage. Instead, they go into direct ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ development, their pupal stage lasting only long enough for metamorphosis to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ take place. They emerge as imagos about 7-10 days later, mate and give rise to a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ new brood. This life-cycle is repeated up to four or five times before diapause is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ finally triggered sometime during the fall. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Northern and southern L. sinapis populations have a different critical day length, ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ with northern specimens requiring a longer photoperiod in order to come out of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diapause (Wiklund, private communication). Lab experiments have shown that, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ in order for northern wood whites to go into direct development, they need to be ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ exposed to long photoperiods (>18 hours) for an extended period of time ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Friberg et al., 2008b). Daylight hours are longer in the south than in Northern ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Europe during the late part of the summer and early fall. Yet, although day length ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

6 quickly dwindles in the north as the summer progresses, southern individuals ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reared in the north do not adjust their life-cycle accordingly, instead continuing ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to produce several broods before the onset of diapause is finally triggered very ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ late in the fall (Friberg, personal communication). These results raise the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ hypothesis that, as a consequence of local adaptation, regulatory mechanisms ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ underpinning diapause and direct development in L. sinapis may have started to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diverge among the different ecotypes. ​ ​ ​ ​ ​ ​ ​ ​

Insect diapause has long been the subject of scientific research (Sauders, 1976; ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Beck, 1980). That the trait is under genetic control has long been confirmed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Ko tál and Shimada, 2001) but the exact mechanisms and pathways involved š ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ remain unknown. Several studies carried out on insect model organisms suggest ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ that circadian clock genes play an important role in diapause regulation (Goto ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and Denlinger, 2002; Tauber et al., 2007; Yamada and Yamamoto, 2011; Bajgar et ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ al., 2013; Meuti and Denlinger, 2013; Meuti et al., 2015; Pegoraro et al., 2016; ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ Pruisscher, 2016). The circadian clock is a biochemical oscillator that regulates ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ daily metabolic processes in response to external cues (e.g. light), and it has long ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ been suggested that the photoperiodic switch mechanism is at its core based on ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ circadian functions (Bünning, 1936; Saunders, 2012). It is also possible that ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ circadian clock genes play dual roles, regulating daily rhythms and diapause ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ using distinct pathways, or that their role is dependent on where in the body they ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ are expressed (Bajgar et al., 2013). Be as it may, no conclusive evidence has been ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ gathered so far on how this happens, or even that circadian clock genes are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ indeed the key masters behind diapause regulation (Bradshaw and Holzapfel, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2010; Ko tál, 2011). ​ ​ š ​ ​

Further compounding this problem, different insect species experience diapause ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ at different developmental stages, leading us to believe diapause has evolved ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ independently multiple times. For instance, in the wasp Nasonia vitripennis, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ diapause occurs during the larval stage but the actual decision on developmental ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ trajectories is made by adult females prior to egg deposition (Pegoraro et al., ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ 2016). Additionally, there is no single universal clock mechanism. While the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ circadian clock is usually underpinned by four core genes ‒ clk (clock), cyc ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ (cycle), per (period), and tim (timeless) ‒ the mechanisms used to modulate their ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expression can vary. Among insects, Drosophilids, Lepidoptera, and Nasonia ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ modulate expression of core circadian genes using, respectively, the cry1, both ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ cry1 and cry2, and the cry2 cryptochrome genes (Saunders, 2012; Meuti and ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Denlinger, 2013). ​ ​

7

It has been noted that circadian rhythms influence myriads of cellular ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ biochemical processes and thus it is not altogether surprising to find that ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diapause too falls under their sway. As such, it would seem more likely that ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ circadian genes act as indirect modulators and not as the main component of the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ photoperiodic timer (Bradshaw and Holzapfel, 2010). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

An alternative to the circadian model of diapause regulation, the hourglass timer ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ hypothesis, asserts that photoperiodism is underpinned by a set of biochemical ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reactions taking place during nighttime, with diapause inducing conditions being ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ triggered if the whole cascade is terminated before dawn (Lees, 1953; Lees, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1973; Veerman, 2001; Ko tál, 2011). Highlighting the independence between ​ ​ ​ ​ ​ ​ š ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ photoperiodic and circadian clocks, the photoreceptor associated with the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ former is hypothesized to be an opsin, and not the cryptochrome used in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ circadian system (Veerman, 2001). The validity of this model rests on the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ identification of the biomolecules involved in the nocturnal cascade, as well as in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ showing how concentration levels of some of the end products could lead to the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ onset of diapause. Interestingly, studies carried out on the large-white butterfly ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Pieris brassicae), the common silkworm (Bombyx mori), and the Chinese oak ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ silkmoth (Antheraea pernyi) suggest that dopamine ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (3,4-dihydroxyphenethylamine), an amine neurotransmitter whose ​ ​ ​ ​ ​ ​ ​ ​ concentration varies according to light conditions, is highly implicated in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ processes leading to, maintaining, and terminating diapause (Isabel et al., 2001; ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ Noguchi and Hayakawa, 2001; Wang et al., 2015a). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​

As concerning the wood-white butterfly, there is currently no information ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ available on the genetic pathways lying behind the decision on whether a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ specimen is to go into diapause or opt instead for direct development. As ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diapause regulation varies along the species' range, identification of the genes at ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ its core may help us understand how local adaptation evolves and how that in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ turn might lead to lineage differentiation. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Experimental design ​ ​ To assess which genetic pathways are involved in the switch between direct ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ development and diapause, we raised L. sinapis (swedish ecotype) under ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ different light treatments and collected samples at each developmental stage in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ order to investigate differences in gene expression levels across treatments. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Mated L. sinapis females were collected in central Sweden (Uppsala region) in the ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

8 spring 2016 and kept in captivity in a common garden. This population is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ generally univoltine but a smaller second generation can sometimes be observed. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ In the experimental setup implemented, eggs collected from three unrelated ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ females were divided into two groups and reared under either short-day (12 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ hours daylight + 12 hours dark) or long-day (22 hours daylight + 2 hours dark) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ light conditions (Fig. 1). Lights-off switch time was synchronized between ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatments. Temperature was kept constant at 25 ℃ in both light treatments. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Larvae were fed ad libitum on Lotus corniculatus, their natural host plant. ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Developing individuals were harvested during larval stages instar-III and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instar-V, and during pupation. Adult samples were also collected for the long-day ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ light treatment. Based on data from other species in the Pieridae family, it is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ believed the decision on whether to go into diapause or direct development is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ taken during instar-IV (Friberg et al., 2011). Apart from instar-III, six samples (3 ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ males and 3 females, one from each family) were collected for each ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ developmental stage and light treatment. Only three samples were collected for ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instar-III (unknown sex). As a control, ten extra individuals were raised under ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Figure 1 | Illustration of the experimental setup. Offspring from three unrelated females were ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ divided into two treatment groups and raised under short-day and long-day light conditions. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Samples used to assess gene expression differentials across treatments were collected from ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ individuals harvested during instar-III, instar-V and pupal stages, and also as imagos when ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ raised under long-day light conditions. Controls were kept for each treatment group to verify ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ that the long-day treatment resulted in direct development and short-day treatment resulted in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diapause induction [L. sinapis images adapted from plate by Karl Eckstein (Wikimedia ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Commons)].

9 each light treatment. All ten individuals raised under long-day light conditions ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ went into direct development and emerged as adults after a brief period (7-10 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ days) spent as pupae. All ten individuals reared under short-day light conditions ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ went into diapause upon arriving to the pupal stage (and were still in diapause at ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the time this work was initiated). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

RNA extracted from the whole body of each specimen sampled was sent for ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ sequencing at NGI Stockholm. Library preparation was done using the dUTP ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ protocol (Illumina TruSeq Stranded mRNA with Poly-A selection) and sequencing ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ took place on an Illumina HiSeq 2500 using three lanes (Lanes 4, 5 and 6). The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ concentration of RNA varied between samples. All low-concentration samples ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (12) were sequenced using Lane 4. Each of the remaining 24 samples was ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ divided in two and sequenced in both Lane 5 and Lane 6 (two libraries were ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ prepared for each of these samples). The raw sequencing data was in the form of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Table 1 | General information about the L. sinapis genome, RNA samples, library ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ preparation, and sequencing technology. ​ ​ ​ ​ ​ ​ ​

10 Illumina paired-end short reads 126 bp long. Based on an assumption of 15k ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed genes and an average gene length of 1k bp, the average raw coverage ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ across samples was estimated to be around 334x. General information about the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ L. sinapis genome, RNA samples, library preparation, and sequencing technology ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ is shown in Table 1. ​ ​ ​ ​ ​ ​ ​ ​

1.3 Current project ​ ​ ​ ​ In this project, we are to develop and implement a bioinformatics pipeline for the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ assembly, annotation and analysis of the transcriptome of L. sinapis with the aim ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ of identifying key genes and pathways associated with diapause, a development ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ regulatory mechanisms believed to be under local adaptation. The ultimate goal ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ is to use RNA-seq short read data to delineate a map of the gene network ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ regulating this key life-history characteristic, in the process identifying genomic ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ regions likely to feature among L. sinapis' current or emerging genomic islands of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differentiation.

2. Project goals ​ ​ ​ ​ The main goals of this project are to: ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ i) develop and implement a bioinformatics protocol required to assemble both ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the developmental stage specific and global transcriptomes of the common ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ wood-white butterfly (L. sinapis) Swedish ecotype. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ii) carry out gene expression analysis in order to determine lists of genes for ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ which there is a statistically significant expression differential (between stages ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and/or between treatment groups); analysis of gene expression data will focus ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ on two distinct, complementary dimensions: differences between light ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatments, and ontogeny. ​ ​ ​ ​ iii) investigate functional categories for genes whose expression differs between ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatment groups and identify key genes and pathways regulating diapause. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

11 3. Methods ​ ​ In this section, we provide a detailed description of the different pipelines used at ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ each stage of the project. ​ ​ ​ ​ ​ ​ ​ ​

In the first part of the project, a standard bioinformatics pipeline was ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ implemented to process raw reads and filter out unwanted sequences (quality ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ control).

Next, processed reads were assembled into a single transcriptome. This was ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ followed by mapping of the reads associated to each sample against the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ assembled transcriptome and quantification of gene expression levels. Functional ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ annotation of the transcriptome was done via homology analysis against a set of ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ selected reference genomes. ​ ​ ​ ​

In the third and last part of the project, we started by computing differential ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ levels of gene expression – across light treatments and/or developmental stages ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ – and evaluating whether observed expression differentials were statistically ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ significant. We then used gene ontology analysis to identify the functional ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ categories (i.e., likely biological functions) associated to the genes of interest. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ information thus collected was finally employed to delineate the likely pathways ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ regulating diapause and direct development in L. sinapis. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​

For name and location of all scripts employed, see Supplementary information, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ Table S1. ​ ​

3.1 Quality control ​ ​ ​ ​ Initial quality assessment ​ ​ ​ ​ The quality of the raw reads associated to each of the libraries was assessed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ using FastQC, v. 0.11.5 (Babraham Bioinformatics, 2016). Initial quality ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ assessment was performed separately for each replicate with the aim of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ identifying potential issues including the presence of adapter sequences, artifacts ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ induced by the sequencing technology employed, read-ends with low Phred ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ values, and gross disparities between replicates. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

12 Filter libraries ​ ​ In order to address the problems identified in the initial quality assessment and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ improve the overall quality of the libraries, the latter went through a sequence of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ five filtering steps. ​ ​ ​ ​

Initial filtering of raw reads was performed separately for each library using ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ TrimGalore, v. 0.4.1 (Babraham Bioinformatics, 2017a). For each read, the first 12 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ nucleotides from the 5' end and the first 3 nucleotides from the 3' end were ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ trimmed in order to remove stretches with {A,C,G,T} nucleotide content deviating ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ from the global average, an artifact known to be induced by the Illumina ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ sequencing platform used (Hansen et al., 2010). Illumina reads are also prone to ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ quality deterioration at the end of the read and for this reason terminal ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ nucleotides with Phred score < 30 were trimmed. Because adapters located at ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the ends of the cDNA segment of interest are often also sequenced, in full or just ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ partially, we also looked for and filtered out adapter sequences (Illumina ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ universal adapter and Illumina index adapter). Finally, we also filtered out ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ sequences shorter than 30 bp (after trimming), as very short reads have a high ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ probability of mapping to multiple locations in the genome and thereby cause ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ biases in gene expression and variant calling analyses. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

The second filtering step involved the masking (N) of low quality nucleotides ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Phred quality score < 20), located anywhere in the read, using Fastq Masker ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ from the FASTX, v. 0.0.14 toolbox (FASTX, 2010). Such low quality nucleotides are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ deemed to be uninformative. ​ ​ ​ ​ ​ ​

In the third filtering step, we used prinseq, v. 0.20.4 (Schmieder and Edwards, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2011) to trim Poly-A tails (as 3' polyadenylation is a post-transcriptional ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ modification) and filter out homopolymers (low complexity sequences). The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ latter include read segments with a complexity score > 7 according to the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ BLAST/DUST algorithm (Morgulis et al., 2006). This threshold filters out any ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ read containing a T{≥19}, A{≥19}, C{≥19}, or G{≥19} sequence, as well as longer ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ dinucleotide and trinucleotide repeats. Reads containing homopolymers are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ deemed uninformative as they can map to multiple genome regions during ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ assembly and, moreover, they slowdown the assembly process. prinseq was also ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ used to trim masked nucleotides located at the read ends (uninformative), filter ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ out sequences with more than 10% of masked nucleotides (low average quality), ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and filter out sequences shorter than 30 nucleotides. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

13 In the fourth filtering step, we used condetri, v. 2.3 (Smeds and Künstner, 2011) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to select only reads with Phred quality score > 30 in more than 80% of the read ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (high quality sequences). ​ ​ ​ ​

In the fifth and last filtering step, we made use of FastQ Screen, v. 0.9.2 (Babraham ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ Bioinformatics, 2017b) and bowtie2, v. 2.2.9 (Langmead and Salzberg, 2012) to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ identify and screen out reads of ribosomal origin2, bacterial and human ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ contaminants, and reads associated to known L. sinapis repeat sequences. FastQ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ Screen screens for contaminants using reference libraries provided by the user. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ We used the SILVA/insecta database (Quast et al., 2013) to screen for 16S/18S ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ small subunit rRNA and 23S/28S large subunit rRNA (downloaded March 20, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2017). The Rfam database (Nawrocki et al., 2015) was used to screen for 5S/5.8S ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ rRNA (downloaded March 20, 2017). An in-house L. sinapis repeat library created ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ by Alexander Suh and Venkat Talla3 (Talla et al., 2017) was used to filter out ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reads associated to L. sinapis repeat sequences. Screening for human ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ contaminants was carried out using a library based on the Homo sapiens high ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ coverage assembly GRCh38 available at Ensembl4. Screening for Wolbachia ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ bacterial contaminants was performed on the basis of all sixteen strands ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ available at Ensembl-bacteria5. ​ ​ ​ ​​

Because libraries used during bacterial contamination screening must be ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ prepared by the user and often include only the usual suspects (eg, Wolbachia ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ symbionts when sequencing specimens), some of the samples were ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ also scanned using Kraken, v. 0.10.5-beta, an ultra-fast metagenomics screening ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ package that includes a library of more than 2,000 bacterial genomes and viral ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ signatures (Wood and Salzberg, 2014). ​ ​ ​ ​ ​ ​ ​ ​

2 rRNA constitutes up to 80% of the total RNA present in a cell and although steps are taken to remove all ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ rRNA sequences during library preparation, a small amount often evades such attempts and gets ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ sequenced together with the cDNA segments of interest. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 3 L. sinapis repeat library available at ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ /proj/b2014034/nobackup/private/alexs/rmodel1.0.7/11leps/11lep_rm1.0_hex.lib 4 Human cDNA sequences downloaded on February 5, 2017 from ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz . ​ ​ 5 Wolbachia sequences downloaded on February 5, 2017 from http://bacteria.ensembl.org. Strands ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ included: Wolbachia, Wolbachia endosymbiont of Cimex lectularius, Wolbachia endosymbiont of Culex ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ quinquefasciatus JHB, Wolbachia endosymbiont of Culex quinquefasciatus Pel, Wolbachia endosymbiont of ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ Drosophila ananassae, Wolbachia endosymbiont of Drosophila melanogaster, Wolbachia endosymbiont of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ Drosophila simulans, Wolbachia endosymbiont of Drosophila simulans wHa, Wolbachia endosymbiont of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Drosophila simulans wNo, Wolbachia endosymbiont of Glossina morsitans morsitans, Wolbachia ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ endosymbiont of Muscidifurax uniraptor, Wolbachia endosymbiont of Onchocerca ochengi, Wolbachia ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ endosymbiont strain TRS of Brugia malayi, Wolbachia pipientis wAlbB, Wolbachia pipientis wMelPop, and ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ Wolbachia sp. wRi. ​ ​ ​ ​ 14 As already mentioned, sequencing of some of the replicates took place in two ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ distinct lanes. Upon quality control, libraries associated to the same biological ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ replicate were consolidated into a single library (number of libraries reduced ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ from 60 to 36, one for each sample). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

3.2 From transcriptome assembly to functional annotation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Transcriptome assembly ​ ​ De novo transcriptome assembly was performed using the Trinity, v. 2.3.2 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ RNA-seq assembler (Grabherr et al., 2011; Haas et al., 2013). Trinity was run ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ using default parameters, apart from the minimum contig length which was set at ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 126 (corresponding to Illumina’s read length). Multiple transcriptomes were ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ initially produced with the aim of evaluating transcriptome robustness. De novo ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ assembly was done initially sample by sample (replicate by replicate), then by ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ experiment, by stage, by lane (the three lanes used during Illumina sequencing), ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ by family, by sex, and using all samples. All transcriptomes were characterized in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ terms of number of contigs, average GC content, and contig ExN50. The latter is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ similar to the standard N50 contig statistic (see, for example, Miller et al., 2010) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ but is computed using only the top most highly expressed transcripts. ExN50 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ values were computed using the protocol described in the Trinity website (Haas, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ 2016). Finally, and on the basis of this analysis, as well as on the percentage of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reads on each library that map to each assembled transcriptome (see §’Mapping ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and transcript quantification’, infra), we selected the set of samples that produced ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the highest quality transcriptomes and used them to assemble an optimized, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ consensus transcriptome. Of the initial 36 libraries, 25 were used to produce the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ optimized transcriptome (sample list provided in Supplementary information, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ Table S2). ​ ​

As a rule, Trinity produces transcriptomes with a number of contigs that far ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ exceeds the estimated number of genes/gene-variants expected to be present in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the genome. Although a hard threshold could have been used to trim the number ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of contigs (e.g., removal of all contigs smaller than 200 bp), we opted for not ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ performing contig filtering at this stage. Instead, de facto trimming was carried ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ out first after transcripts were annotated (§’Gene count consolidation’, infra), and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ then during downstream analysis using well established protocols (e.g., as ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ implemented by DESeq2, as described below), usually on the basis of transcript ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ coverage.

15 Mapping and transcript quantification ​ ​ ​ ​ ​ ​ Mapping of reads to the transcriptome and quantification of transcript ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ abundance wase done using KALLISTO, v. 0.43.0, a novel ultra-fast pseudo aligner ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ 6 (Bray et al., 2016), using 100 bootstrap samples. ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

As part of the transcriptome quality assessment described above ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (§’Transcriptome assembly’), we also checked the percentage of reads that map ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ back to each of the assembled transcriptomes, and compared these values with ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ those obtained when the transcriptome is assembled using the L. sinapis genome ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ as a reference7. Alignment and quantification on the basis of a reference genome ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ was done using either the Tophat v. 2.0.12/Cufflinks v. 2.2.1 protocol (Trapnell et ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ al., 2009; Trapnell et al., 2010; Trapnell et al., 2012), the HISAT v. 2.0.5/StringTie ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ v. 1.2.0 protocol (Kim et al., 2015; Pertea et al., 2015; Pertea et al., 2016), or the ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ STAR v. 2.5.1b/StringTie protocol (Dobin et al., 2013). Default parameters were ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ used in all cases. ​ ​ ​ ​ ​ ​

Functional annotation ​ ​ Functional annotation of the transcriptome was conducted using a modified ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ version of the Trinotate, v. 3.0.2 suite (Haas, 2017a). Transcripts were processed ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ using TransDecoder, v. 3.0.1 and blasted against the Pfam database8 (as ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ implemented by Trinotate). Additionally, (raw) transcripts were also blasted ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ directly against the following reference organisms: Bombyx mori (Silk moth), ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ ​ Danaus plexippus (Monarch butterfly), Papilio machaon (Old World swallowtail ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ butterfly), and Drosophila melanogaster9. The three lepidopteran are the closest ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

6 Traditional mapping methods (so-called true aligners, as those implemented by the HISAT2/StringTie ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ protocol) are based on aligning each read to the reference genome/transcriptome; pseudo-aligners such ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ as KALLISTO and SALMON (Patro et al., 2017) use bootstrapping to determine the segment in the ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reference genome/transcriptome most likely to have produced the read. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 7 We used an in-house L. sinapis reference genome assembled by Venkat Talla. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 8 Downloaded on July 5th, 2017 from ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz ​ 9 All proteomes downloaded from UniProt on July 5, 2017. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ B. mori: ​ ​ ​ ftp://ftp..org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eu ​ karyota/UP000005204_7091.fasta.gz; ​ D. plexippus: ​ ​ ​ ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eu ​ karyota/UP000007151_13037.fasta.gz; ​ P. machaon: ​ ​ ​ ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eu ​ karyota/UP000053240_76193.fasta.gz; ​ D. melanogaster: ​ ​ ​ ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eu ​ karyota/UP000000803_7227.fasta.gz 16 species to L. sinapis among UniProtKB’s reference proteomes, while D. ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ melanogaster is the key model organism within the Insecta class. Gene ontology ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (GO) terms associated to each blast hit were subsequently retrieved from a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ locally kept copy of the UniProtKB database using a homemade script (link ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ provided in Supplementary information, Table S1). The local UniProtKB database ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ is based on the invertebrates UniProtKB database and includes both the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Swiss-Prot and TrEMBL segments of the database10. We decided to include ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ TrEMBL because to date most lepidopteran genes have only been electronically ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ annotated, thus not making the grade to be included in the much smaller, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ manually-annotated Swiss-Prot11. ​ ​​

Gene count consolidation ​ ​ ​ ​ The annotation file described above clusters all transcripts associated to the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ same gene (as identified by Trinity) into one unique entry. Trinity genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to the same drosophilid and/or lepidopteran gene were subsequently ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ consolidated into a single entry (as were the associated GO terms). Concurrently, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ for each library, counts associated to each transcript (transcript abundance) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ were consolidated into counts-per-gene using tximport, v. 1.4.0 (Soneson et al., ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ 2015), a R/Bioconductor package (R, v. 3.4.1; Bioconductor, v. 3.5). No ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​​ ​ ​​ ​ ​ ​ ​ ​ normalization (by library size or transcript length) was carried out at this stage ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ as normalization will be performed later on during DGE analysis. When Trinity ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes were consolidated into a single entry based on the annotation file (as ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ described above), counts-per-gene were adjusted accordingly (all counts ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to the same gene were added together). Finally, we filtered out genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ without annotation detected in just one sample (deemed to be noise). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

For association between proteome ID and species, see: ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/RE ​ ADME 10 Swiss-Prot and TrEMBL databases downloaded on July 3, 2017. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Swiss-Prot: ​ ​ ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/uni ​ prot_sprot_invertebrates.dat.gz TrEMBL : ​ ​ ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/uni ​ prot_trembl_invertebrates.dat.gz 11 In the standard Trinotate protocol, transcripts are blasted only against the Swiss-Prot database, as the ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ latter is restricted to manually annotated genes. However, transcripts are blasted against all species ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ included in Swiss-Prot, which is dominated by model organisms (human, mouse, etc) and includes very ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ few lepidopteran genes. As a consequence, most hits are associated to very distant species and the GO ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ terms thus obtained are of questionable validity when applied to Lepidoptera. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 17 3.3 Expression profiling and gene enrichment analysis ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Differential gene expression analysis ​ ​ ​ ​ ​ ​ DGE analysis was carried out using DESeq2, v. 1.16.1 (Love et al., 2014), a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ well-established R/Bioconductor package compatible with the ​ ​​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ KALLISTO/tximport suites. DESeq2 was implemented following the instructions ​ ​​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ provided by the authors (Love et al., 2016). Counts per gene were rounded to the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ closest integer and low coverage genes with baseMean (count average across all ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ samples) < 1 were removed prior to the DGE analysis proper. Batch effects were ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ removed using sva, v. 3.24.4 , an R/Bioconductor package (Leek et al., 2012). Once ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ a dataset and variable of interest are declared (e.g., compare instar-III replicates ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ across light treatments), sva detects hidden technical, biological, and ​ ​ ​ ​ ​ ​​ ​​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ environmental batch effects, which are modeled through the use of surrogate ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ variables that can then be used to adjust expression values prior to downstream ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ analysis. In our case, the idea was to use sva to detect expected sources of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ variation, such as family and sex, as well as additional artifacts present in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ data unbeknown to us. As such, the number of surrogate variables was not ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ explicitly declared; instead we let sva estimate the appropriate number of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ variables, which varied between two and four depending on the datasets being ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ compared.

DGE analysis of sva-adjusted expression values was finally carried out using the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ DESeq function, a step during which counts per gene are also normalized by ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ library size. The DESeq2 protocol carries out significance testing for individual ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes using the Wald test, and uses the Benjamini-Hochberg adjustment ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Benjamini and Hochberg, 1995) to account for multiple hypothesis testing and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ control the false discovery rate (FDR) (alpha parameter was set to 0.05). DESeq2 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ generates four values for each gene: baseMean, log2FoldChange (ratio of gene ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ ​ expression values across treatments, in log2 scale), p-value (Wald test), and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ FDR-adjusted p-value (p-adj). Only genes with p-adj<0.05, baseMean>10, and ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ |log2FoldChange|>1 were considered to be differentially expressed (the last ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ two conditions minimize the risk of observing false positives created by the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ presence of transcriptional noise). ​ ​ ​ ​ ​ ​

18 Enrichment analysis of gene ontology terms ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Enrichment analysis of GO terms was performed using the R/Bioconductor ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ package topGO, v. 2.28.0, following the instructions provided by the authors ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Alexa and Rahnenfuhrer, 2016). The list of genes of interest was defined as the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ set of differentially expressed genes identified previously using DESeq2. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ‘gene universe’ includes all annotated genes with baseMean > 6, as well as the list ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of genes of interest. topGO checks if there are biological functions (GO terms) ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ overrepresented among the genes of interest (vis-a-vis the full gene universe), ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ performing test statistics to evaluate whether eventual differences are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ significant. The simplest approach involves the direct application of Fisher’s ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ exact test but this procedure usually yields an unseemly large number of false ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ positives (Alexa et al., 2006). topGo consequently makes available several ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ alternatives aimed at tackling this issue, such as the elim, weight and parent-child ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​​ ​ ​​ algorithms. These algorithms take into consideration the relationship between ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ GO terms prior to running Fisher’s exact test, allowing for an improved, less ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ error-prone performance. We took a conservative approach and selected the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Fisher-elim algorithm since it provides the lowest rate of false positives, albeit at ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the risk of missing some true positives (type II errors) (Alexa et al., 2006). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​

Notice that the p-values provided by any of the Fisher-based test statistics are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ per GO term and therefore have not been explicitly corrected for multiple testing. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Multiple testing correction applied in the context of GO enrichment analysis ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ remains a contested issue. For starters, such corrections tend to produce highly ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ conservative results (few or no significant GO terms) (Alexa and Rahnenfuhrer, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2016). More problematically, algorithms such as Fisher-elim and Fisher-weight ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ estimate p-values for a particular GO term by taking into consideration ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ neighbouring terms (GO terms are linked through a graph-like structure), a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ scenario that clearly violates the independence condition underpinning classic ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ false discovery rate and family-wise error rate tests (such as the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Benjamini-Hochberg and Bonferroni corrections, respectively). In fact, it is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ claimed by its authors that results stemming from the Fisher-elim and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Fisher-weight tests are already corrected or are otherwise not affected by ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ multiple testing (Alexa and Rahnenfuhrer, 2016). For this reason, and also ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ because the current study is exploratory in nature, we decided to carry on with ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the analysis without adjusting test statistics beyond what is provided by the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Fisher-elim test. ​ ​

19 4. Results ​ ​ 4.1 Transcriptome assembly and transcript quantification ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ On arrival from sequencing, libraries contained an average of 19.9M paired-end ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reads, yielding a 334x average coverage per replicate. All libraries were ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ subjected to quality control with the aim of removing Illumina artifacts, Illumina ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ adapter sequences, Poly-A tails, homopolymers, rRNA and other contaminants, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reads mapping to known repeat sequences, and mask and/or remove low quality ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ nucleotides, as described in the methods section (§3.1, supra). Upon quality ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ control, the average mean quality score associated to each nucleotide along the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ read length, Q, was 34 or above for all libraries12. The average number of masked ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ nucleotides was generally below 3% for any nucleotide position along the read ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ length and never exceeded 5%. ​ ​ ​ ​ ​ ​ ​ ​

On average, over 90% of the trimmed reads in each library mapped to the L. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ sinapis reference genome. Ribosomal contamination, as measured using FastQ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ Screen (Babraham Bioinformatics, 2017b), accounted for between 1.5 and 11% ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of each library. Reads mapping to known L. sinapis repeat sequences did not ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ exceed 1.5% (interspersed repeats make up almost 50% of the L. sinapis genome ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ assembly (Talla et al., 2017)). Reads mapping to repeat and rRNA sequences ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ were filtered out upon detection. No significant bacterial or human ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ contamination was detected. Libraries associated to one of the lanes (Lane 4) had ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ a higher proportion of sequences tagged as over-represented (up to 15% in one ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ case), even after being screened for repeat and Illumina adapter sequences, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ribosomal and bacterial/human contaminants. Samples sequenced on this lane ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ had a relatively low initial concentration (see §’Experimental design’, supra) and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ over-amplification during polymerase chain reaction (PCR) might have led to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ some sequences having become markedly more abundant. Over-representation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of some sequences can nonetheless be expected in RNA samples and they may ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reflect treatments that lead to specific genes being highly expressed. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

All samples passed quality control apart from one of the replicates collected from ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ an adult female under long-day light treatment. The library associated to this ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ replicate contained no mRNA sequences that could be mapped to the L. sinapis ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ genome and was consequently discarded. One other library (also a long-day adult ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ female) had a size about four times bigger than the average library size. This ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ sample was included for downstream analysis but was excluded from the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

12 The probability that a base call is incorrect, p, is given by Q = -10log10(p). For Q ≤ 34, p ≤ 0.04%. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ 20

Table 2 | Average library specs before and after quality control. Average number of reads, ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ average read length, and estimated coverage for each library before quality control (RAW), and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ after libraries were processed using TrimGalore, prinseq, condetri, and FastQ Screen. Values ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ shown are per replicate. Coverage estimated as follows: coverage = ((# reads per sample) * 2 * ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 126) / ((# of genes)*(average gene length)). Number of genes and average gene length assumed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to be 15k and 1 kb, respectively. One sample with particularly high library size (60M reads after ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ quality control) was excluded from this table but included for downstream analysis. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

summary specs shown in Table 2 so as not to inflate average values ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ representative of the remaining libraries. ​ ​ ​ ​ ​ ​ ​ ​

After quality control, we used Trinity (Grabherr et al., 2011; Haas et al., 2013) to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ carry out de novo assembly of transcriptomes for each experimental condition, ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and also to produce a high quality, complete L. sinapis transcriptome based on ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ the best libraries for each experimental condition13. In order to evaluate the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ quality and robustness of the transcriptomes being produced, and also to select ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the libraries to be used when assembling the complete transcriptome, we ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ produced several intermediate transcriptomes based on stage, lane, family, sex, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and treatment. All in all, 122 different transcriptomes were assembled (see Table ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 3).

Different metrics were employed to appraise each transcriptome. As expected, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the total number of transcripts and predicted genes in de novo assemblies ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ increases as we increase the number of libraries used to produce the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ transcriptome, easily reaching values over 100k (see Table 4). These values are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ well above those observed in biological samples, and are due to a large extent to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

13 Complete in this context means only that it includes all transcripts produced - and captured - in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ course of this experiment. It is possible, and highly likely, that under a different treatment many genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ give rise to isoforms not detected here. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 21 Table 3 | List of de novo transcriptomes assembled using Trinity (abridged). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ ​ ​​ ​ ​

Table 4 | Basic statistics associated to each assembled transcriptome, as estimated by Trinity. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​

22 the presence of high numbers of lowly expressed transcripts, which are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ incompletely assembled, leading to transcript fragmentation14. Actual ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ transcriptome trimming will be carried out during downstream analysis, as ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ described in the methods section. ​ ​ ​ ​ ​ ​ ​ ​

Average GC content per transcriptome varied between 39 and 41.3%, depending ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ on the assembly (Table 4). These numbers are slightly above the value measured ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ in the L. sinapis genome (37%), shown in Table 1. Also on Table 4, we can see the ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ contig N50 and the average contig length. These statistics, widely used in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ characterization of genome draft assemblies, are considered unreliable when ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ applied to newly assembled transcriptomes since they are easily swayed by an ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ inflated number of transcripts. They are shown here only for historical reasons. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

A more reliable way to assess the quality of a newly assembled transcriptome is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to use the ExN50 statistic. This metric is similar to N50 but is computed by ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ including only the top most highly expressed transcripts (Haas, 2016). We used ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the KALLISTO pseudo aligner (Bray et al., 2016) to carry out transcript ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ quantification. Counts per transcript were not normalized at this stage ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ (normalization will be carried out during DGE analysis). Computed ExN50 values, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ shown in Fig. 2, are higher and steadier than those observed for N50. This was ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expected as ExN50 is in fact calculated by maximizing N50 (Haas, 2016). ExN50 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ deteriorates somehow when only a limited number of libraries is used to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ assemble the transcriptome, as is the case of the transcriptomes associated to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ each of the twelve treatments (shown in dark blue). The transcriptome built ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ solely on the basis of Lane 4 libraries also has a poorer ExN50, namely when ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ compared with transcriptomes based on lanes 5 and 6. As mentioned before, lane ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 4 libraries were built from low-concentration, high PCR samples. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

An added bonus of computing ExN50 is that the procedure employed is based on ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the exclusion of lowly expressed transcripts, allowing us to more accurately ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ estimate the number of transcripts actually present in a sample (Haas, 2016; ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Haas, 2017b). Estimated total number of transcripts based on this procedure are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ shown in Fig. 3 for each of the transcriptomes assembled. This time around, the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ number of expressed transcripts is within the range one would expect to find in a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ real-world biological sample. This data also shows that transcriptomes based on ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ individuals treatments or development stages have a smaller number of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

14 Additional reasons behind a high number of transcripts include, inter alia, sequencing errors, the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ placement of different alleles in different transcripts, and the fragmentation of alternative splicing ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ variants. 23 transcripts. This was to be expected, as different genes are expressed under ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ different conditions. Once we start aggregating libraries by family, lane or sex, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the total number of transcripts detected shots up, stabilizing around 20-25k ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (21,189 transcripts when using the ‘Best libraries’ transcriptome). Again, results ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ for the transcriptome based solely on Lane 4 libraries suggest this transcriptome ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to be of subpar quality. The transcriptome built using all samples predicts a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ higher number of transcripts than that built using only the best samples ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to each experiment. It is likely that some of the lower quality libraries ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ used to assemble the ‘All samples’ transcriptome triggered an increase in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ transcript fragmentation and/or included reads that could not be properly ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ assembled together or matched to reads present in the other, higher quality ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ libraries.

Figure 2 | Contig ExN50 statistic computed for different assembled transcriptomes. ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Transcriptomes assembled by including only libraries associated to each of the twelve ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatments (dark blue), developmental stage (cyan), family (orange), sequencing lane (green), ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ sex (purple), all samples (salmon), and best samples from each experiment (dark red). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

24

Figure 3 | Estimated total number of transcripts. Number of transcripts calculated by ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ maximizing the contig ExN50. Transcriptomes assembled by including only libraries associated ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to each of the twelve treatments (dark blue), developmental stage (cyan), family (orange), ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ sequencing lane (green), sex (purple), all samples (salmon), and best samples from each ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ experiment (dark red). ​ ​ ​ ​

We also evaluated the quality of the assemblies by checking what percentage of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reads from each of the libraries used to built a transcriptome actually aligns back ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to it15. We used KALLISTO to carry out this operation. On average, 85 to 96% of all ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reads in each of the libraries align to the transcriptomes assembled by Trinity ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ and there was no obvious systematic difference between treatments, sexes or ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ families (Fig. 4). ​ ​ ​ ​

As a way of double-checking these results and further test the robustness of the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ method here employed to build de novo transcriptomes, we carried out a set of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ tests on single-library transcriptomes, first assembled with the Trinity/KALLISTO ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ combo as described above, and then with the help of the L. sinapis reference ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ genome. In particular, we wanted to check whether opting for the de novo ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ approach had not led us to forfeit transcripts that otherwise would have been ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ detected had we used a reference genome during the assembly. We thus used the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

15 High quality transcriptomes are made of a set of transcripts created on the basis of most reads present ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ in the libraries used to assemble them. When we use said transcriptome as a template, most reads from ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the original libraries are expected to align to it. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 25 same set of libraries to build a whole new set of transcriptomes but this time ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ around instead of deploying Trinity we made use of the the L. sinapis reference ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ genome and protocols based on either Tophat/Cufflinks (Trapnell et al., 2009), ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ HISAT/StringTie (Kim et al., 2015; Pertea et al., 2015), or STAR/StringTie (Dobin ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ et al., 2013). As before, we then checked the fraction of reads in each library that ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ aligns back to the newly created transcriptomes. The results, shown in Fig. 5, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ indicate that about 90% of the reads map to the transcriptomes independently of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the method used for the assembly. Finally, we used STAR/StringTie to map the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ‘Best libraries’ dataset against the L. sinapis reference genome and estimate the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ total number of transcripts. StringTie created a transcriptome containing 36,684 ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ transcripts, of which 21,205 have a 10x coverage or better. These values ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ corroborate those obtained using Trinity/KALLISTO (see Fig. 3) and suggest that, ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ for this particular system, opting for the de novo approach brings no major ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ discernible drawback and may actually be a better choice than available ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ alternatives grounded on the use of a partially annotated reference genome (as is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the case with most non-model organisms). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Figure 4 | Average percentage of reads mapping back to each transcriptome. ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Transcriptomes assembled by including only libraries associated to each of the twelve ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatments (dark blue), developmental stage (cyan), family (orange), sex (purple), all samples ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (salmon), and best samples from each experiment (dark red). Apart from the ‘All samples’ and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ‘Best libraries’ transcriptomes, only the libraries used to produce a transcriptome were used ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ when mapping back reads to said transcriptome. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

26

Figure 5 | Percentage of reads mapping back to transcriptomes assembled using different ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ protocols. Single-library transcriptomes either assembled de novo using Trinity or by relying on ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ the L. sinapis reference genome. Assembly and mapping performed de novo using ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ Trinity/KALLISTO (dark-orange), or on the basis of the L. sinapis reference genome using ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ TopHat/Cufflinks (cyan), HISAT2/StringTie (light green), or STAR/StringTie (dark green). ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​

The results presented so far suggest that the Trinity-based protocol here ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ employed is sound and can be used to assemble high quality transcriptomes. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ above analysis also served as a basis upon which we selected the best libraries ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ from each experiment to build the final, optimized transcriptome (selected ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ samples are listed in Supplementary information, Table S2). This is the ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ transcriptome referred as ‘Best libraries (all experiments)’ in Figs. 2-4, and that ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ will be used during all subsequent downstream analyses. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

4.2 Functional annotation ​ ​ ​ ​ We used a modified version of the Trinotate protocol (Haas, 2017a) to annotate ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the transcriptome. The standard Trinotate protocol relies exclusively on the ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Swiss-Prot database both for blasting (which is done against all species present in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the database) and gleaning GO terms. In the approach here employed, transcripts ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ were blasted only against a selected group of proteomes (Bombyx mori, Danaus ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ plexippus, Papilio machaon, and Drosophila melanogaster), and GO terms were ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ collected from the complete UniprotKB database (including both Swiss-Prot and ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ TrEMBL). This means that blast hits include only lepidopterans and Drosophila, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​

27 the most well annotated insect species, and that the biological functions thus ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ inferred have a higher probability of still being valid when applied to L. sinapis. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​

The final annotation file lists all genes associated to the same blast hit under the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ same entry. Likewise, counts per transcript were first consolidated into counts ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ per gene using tximport (Soneson et al., 2015) and then further consolidated ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ based on the annotation file. After removal of orphan entries in the annotation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ file supported by counts in just one single sample, the transcriptome was ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ trimmed from 256,436 to 127,138 putative gene entries. Of these, 20,210 are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ annotated and 16,435 have associated GO terms. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

4.3 Sample correlation and sample clustering ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Correlation analysis ​ ​ Ideally, all RNA samples associated to a particular treatment should yield the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ same genetic information concerning which genes are expressed, and by how ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ much they are expressed. In the real world, RNA-seq is known to be a rather ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ noisy technique and samples collected under the same exact conditions often ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ offer conflicting pictures on the identity of the genes at play (biological noise). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Collecting multiple biological replicates for each treatment is the standard way of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ dealing with this problem. These replicates are not going to be exactly the same ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expression-wise, but one hopes to be able to reach some consensus on ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expression levels, if not for all, at least for most genes. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

In the current project, three replicates were obtained for each treatment and we ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ are interested in assessing to what extent their expression profiles are alike. To ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ this end, we took the 5,000 top most expressed genes for each treatment and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ computed the Pearson correlation coefficient between each pair of replicates ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ based on their expression profiles (expression values were first normalized by ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ library size and then log-10 transformed). For each set of three replicates, three ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ inter-replicate correlation coefficients were thus computed, which were ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ subsequently averaged so as to characterize each set of replicates using one ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ single correlation value. The results, shown in Fig. 6, reveal a rather uneven ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ landscape. In some cases, replicates linked to a specific treatment have indeed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ very similar profiles (e.g., long-day instar-V females) while in other cases the ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ replicates turn out to be rather different. Long-day pupae, both male and female, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ fall into this last category. ​ ​ ​ ​ ​ ​ ​ ​

28

Figure 6 | Correlation between replicates associated to the same treatment. Average of ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Pearson correlation coefficients between expression profiles of biological replicates associated ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to each treatment, computed on the basis of the 5,000 top most expressed genes. Long-day adult ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ female treatment includes only two replicates; all other treatments include three replicates. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Averaged correlation coefficients significant to p<0.001. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Three major reasons lie behind inter-replicate divergence. As mentioned above, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ there will always be some degree of biological variation between samples ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ collected under the same conditions. Another source of variation are errors and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ biases induced by the sequencing platform and during data processing (technical ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ noise). For instance, ‘Lane 4’ libraries, flagged above as having subpar quality, are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ not uniformly represented among treatments. Finally, one must keep in mind the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ presence of factors other than those explicitly declared for each treatment. In the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ present study, all samples belonged to three L. sinapis familes, by which we mean ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ that all specimens descend from three distinct mother lines (dubbed family 3, 10, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and 11). In our experimental design, we hoped the three replicates associated to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ each treatment, even if not absolutely identical, would nevertheless be more ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ alike between them than to replicates associated to a different treatment. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ However, this is not necessarily the case. As shown in Fig. 7, there is a high ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ degree of correlation between the expression profiles of replicates originating ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ from the same family, even when these are sampled from different ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ developmental stages. A case at point are the long-day pupal replicates. When ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ aggregated by treatment, these replicates, which originate from distinct families, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ show only loosely correlated expression profiles (Fig. 6). When we take all ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 29 long-day pupal replicates, both male and female, and measure how they correlate ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ when we group them by family, correlation values increase by 35% for at least ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ two of the families (family 3 and 10), as shown in Fig. 7. These results strongly ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ suggest that expression values in the L. sinapis samples by us collected are likely ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to be modulated by several factors, including those associated to family traits ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (maternal effects). As we will see in the next section, sex and developmental ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ stage are also major determinants of how samples group and distinguish ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ themselves from each other, while light treatment emerges as a second-order ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ effect observable only after sex and family effects (here treated as batch effects) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ are removed from the expression data. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Figure 7 | Correlation between replicates associated to the same family. Average of Pearson ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ correlation coefficients between expression profiles of biological replicates associated to same ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ biological family. Light-orange: instar-III and V replicates (male and female); Dark-orange: ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ pupal replicates (male and female). Correlation coefficients were computed on the basis of the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 5,000 top most expressed genes and are significant to p<0.001. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Clustering analysis ​ ​ Exploratory analysis of gene expression data associated to each treatment was ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ further advanced using principal component analysis (PCA). As with correlation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ analysis, the aim was to get an insight on the main factors clustering the samples. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ We were particularly interested in comparing samples across different ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ developmental stages and light treatments, as this will be the focus of the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differential expression analysis to be carried further downstream. PCA was ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 30 carried out using MATLAB’s pca function (MATLAB, 2016) and was based on the ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ top 5,000 genes with the highest expression variance across treatments. ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Expression values were normalized by library size, log10 transformed, and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ zero-centered16 prior to PCA analysis. ​ ​ ​ ​ ​ ​ ​ ​

Results from PCA of samples subjected to short-day light treatment are shown in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Fig. 8. The analysis was carried out first for instar-III and V samples, and then for ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instar-V and pupal samples. In the latter case, samples clearly aggregate by ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ developmental stage (Fig. 8, bottom). For pupal samples, female replicates also ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ tend to aggregate together, suggesting that expression profiles reflect both ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ developmental stage and sex characteristics. PCA of larval samples offers a more ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ nuanced picture (Fig. 8, top). Instar-III and instar-V samples cannot easily be told ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ apart and some, but not all, replicates aggregate instead by family. For larval ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ samples, expression values seem to be determined by multiple factors, including ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ developmental stage, family, sex, and light treatment. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Analysis of samples subjected to long-day light treatment yielded rather similar ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ results (Fig. 9). Instar-V, pupal and adult samples give rise to individual clusters ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ that stand apart from each other (Fig. 9, middle and bottom image). Adult ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ samples are further differentiate by sex (Fig. 9, bottom). Larval samples once ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ again fail to reach complete segregation, with PCA revealing expression values ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ are being shaped by both developmental stage and family traits (Fig. 9, top). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

We carried out a separate analysis of instar-V samples to check whether sex or ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ family determines how they cluster. Results from this analysis reveal clustering ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of instar-V samples subjected to short-day treatment is shaped by both sex and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ family factors, while for long-day treatment family is by far the most important ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ factor determining how samples cluster (Fig. 10). Larval specimens under ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ long-day treatment are expected to have made the decision to go into direct ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ development by this stage. Our results suggest that, upon activation, pathways ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to direct development involve genes whose expression diverges ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ meaningfully between different L. sinapis families. Sexual traits, on the other ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ hand, seem to become more dominant during pupal and adult stages (Figs. 8 and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 9).

16 For each gene, the mean expression value was subtracted from the expression values associated to the ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ different treatments. After this operation, down-regulated (upregulated) genes have negative (positive) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expression values and the average expression value across treatments is zero (for each gene). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 31

Figure 8 | Principal component analysis of samples subjected to short-day light treatment. ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Top) Instar-III and instar-V samples; (Bottom) Instar-V and pupal samples. Lines are guides to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the eye. Analysis based on top 5,000 genes with the highest expression variance across ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatments. Expression values were normalized by library size, log10 transformed, and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ zero-centered.

32

Figure 9 | Principal component analysis of samples subjected to long-day light treatment. ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Top) Instar-III and instar-V samples; (Middle) Instar-V and pupal samples; (Bottom) Pupal and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ adult samples. Lines are guides to the eye. Analysis based on top 5,000 genes with the highest ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ expression variance across treatments. Expression values were normalized by library size, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ log10 transformed, and zero-centered. ​ ​ ​ ​ ​ ​ 33

Figure 10 | Principal component analysis of instar-V samples. Samples subjected to (Top) ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ short-day light treatment; (Bottom) long-day light treatment. Analysis based on top 5,000 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes with the highest expression variance. Expression values were normalized by library size, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ log10 transformed, and zero-centered. Lines are guides to the eye. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Per last, we performed PCA of all samples associated to the same developmental ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ stage, independently of light treatment. The results, shown in Fig. 11, suggest ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ that for this particular scenario, no clear factor prevails over the others, with ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ family, sex, and light-treatment all vying for dominance. There are major ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ implications coming out from these results, the first and foremost being that ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ disentanglement of all four factors (developmental stage, family, sex, and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ light-treatment ) is highly recommended prior to carrying out DGE analysis ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ aimed at identifying the key genes involved in the regulation of diapause and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ direct development. ​ ​

34

Figure 11 | Principal component analysis of samples subjected to different light ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatments. (Top) Instar-III samples; (Middle) Instar-V samples; (Bottom) Pupal samples. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Analysis based on top 5,000 genes with the highest expression variance across treatments. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Expression values were normalized by library size, log10 transformed, and zero-centered [L. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ sinapis images adapted from plate by Karl Eckstein (Wikimedia Commons)]. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

35 4.4 Differential gene expression analysis ​ ​ ​ ​ ​ ​ ​ ​ Identification of genes differentially expressed across treatments was performed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ using DESeq2 (Love et al., 2014). We started by investigating the exact same ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expression dataset previously analyzed through PCA and correlation analysis. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ That is to say, before removing batch effects from the data. A gene was deemed to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ be differentially expressed only if the FDR-adjusted p-value, p-adj, was less than ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ 0.05, and the count average across all samples, baseMean, and the absolute fold ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ change, |log2FoldChange|, was higher than 10 and 1, respectively. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

The results stemming from the DGE analysis are summarized in Fig. 12. Looking ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ first at changes observed across developmental stages, we observed that the total ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ number of differentially expressed genes is at its highest when we compare ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instar-V and pupal samples (over 3,000 genes). This was to be expected as ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ insects undergo major physiological changes between these two stages. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ transition between pupal and adult stage is also marked by large numbers of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes being either up-regulated or down-regulated (2,038). Larval development ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ between instar-III and V affects the expression of a smaller set of genes. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Bringing our attention now to the differences between light treatments, DGE ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ analysis reveals that only a small number of genes (16) seem to be differentially ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed between instar III specimens associated to the diapausing and direct ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ development paths. These values are only slightly higher if we look at differences ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ among instar-V (41) or pupal (32) samples. We expected differences between ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ light treatments to be more subtle than those between developmental stages so ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ these numbers are not altogether surprising. However, the gene expression ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ dataset used has not been corrected for batch effects so it is likely that the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ analysis as carried out so far has been compromised by biases stemming from ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ family and sex effects, among other. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

In order to identify and remove batch effects, we used the sva package (Leek et ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ al., 2012). For each set of samples being compared, we let sva detect the number ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ and nature of surrogate variables present in the expression data. sva receives no ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​​ ​ ​ information concerning the sex or family membership of the samples so we ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ checked whether any of the surrogate variables detected by sva correlated with ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ either family or sex. This was indeed the case. For example, one of the surrogate ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ variables detected by sva during the analysis of both sets of instar-III samples ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

36

Figure 12 | Total number of genes differentially expressed across treatments. Expression ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ data not corrected for batch effects. Only genes with p-adj<0.05, baseMean>10, and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ |log2FoldChange|>1 were considered to be differentially expressed [L. sinapis images adapted ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ from plate by Karl Eckstein (Wikimedia Commons)]. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Figure 13 | Total number of genes differentially expressed across treatments after removal ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of batch effects. Surrogate variables detected and corrected for using sva. Only genes with ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ p-adj<0.05, baseMean>10, and |log2FoldChange|>1 were considered to be differentially ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed [L. sinapis images adapted from plate by Karl Eckstein (Wikimedia Commons)]. ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

37 2 was highly correlated with family membership (F-test adjusted R ​ = 0.80 with ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ p-value<0.05). Likewise, when assessing pupal and adult samples subjected to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ long-day light treatment, sva identified a surrogate variable highly correlated ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2 with sex (R ​ = 0.88 with p-value<1e-4). The fact that sva was able to spot biases ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ previously identified during PCA and correlation analysis gave us confidence on ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the ability of this package to detect and correct for the presence of batch effects17. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

The total number of differentially expressed genes between treatments, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ computed after removal of batch effects, is shown in Fig. 13. Although we detect ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the same trends observed previously (e.g., comparisons across developmental ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ stages are associated with higher numbers of differentially expressed genes), ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ differences across light treatments now show a much higher degree of contrast. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Whereas before only 16 genes had been detected as being differentially ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed between instar-III samples subjected to different light treatments, DGE ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ analysis carried out after correcting for batch effects now reveals the presence of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 102 differentially expressed genes. The number of differentially expressed genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ among instar-V and among pupal samples has also increased from 41 and 32 to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 67 and 168, respectively. The effect of sva on the expression profile of one of the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes now flagged as being differentially expressed, st ( scarlet), is shown ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ in Fig. 14. The total number of genes differentially expressed across treatments ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ that are annotated and have associated GO terms is shown in Table 5. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Among the genes differentially expressed across light treatment between ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instar-III samples (102), only four (4) genes18 were still part of the pool of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differentially expressed genes across instar-V samples (67), as shown in Fig. 15. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Similarly, there is very little overlap (4 genes19) between the set of genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differentially expressed across instar-V samples and those differentially ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

17 An alternative way of dealing with batch effects during DGE analysis is to explicitly model any such ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ biases (in the present case, this would take the form ‘design = ~ sex + family + daylength’). However, we ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ have no information concerning the sex of instar-III specimens, and family effects are not uniform across ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ members of the same nominal family. Under these conditions, using sva seems to be a better option as sex ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and family effects are adjusted individually for each sample. Furthermore, sva can detect surrogate ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ variables we are not aware of, or that we would have had difficulty in quantifying (e.g. biases induced ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ during sequencing). ​ ​ 18 The four genes are: CG31548, a putative oxidoreductase gene (DOWN-DOWN); KGM_07406 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ (immune-related Hdd1 protein) (DOWN-DOWN); CG9701, a gene involved in carbohydrate metabolic ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ processes (UP-UP); and Ddc, an amine regulator (DOWN-DOWN). (DOWN≡ down-regulated; ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ UP≡up-regulated.) 19 All four genes lack annotation (they are identified only by their Trinity id code). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ 38

Figure 14 | Effect of sva on st gene expression data. Expression profiles and p-adj value for st ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ gene (Protein scarlet), measured during instar-III under different light treatments, before and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ after batch effects were removed using sva. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ Table 5 | Total number of differentially expressed genes, annotated genes and genes with ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated GO terms, across treatments. Expression values corrected for batch effects prior ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to DGE analysis using sva. Only genes with p-adj<0.05, baseMean>10, and |log2FoldChange|>1 ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ were considered to be differentially expressed. (SD≡short-day; LD≡long-day) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

39

Figure 15 | Venn diagram for total number of differentially expressed genes across light ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatments. Surrogate variables detected and corrected for using sva. Only genes with ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ p-adj<0.05, baseMean>10, and |log2FoldChange|>1 were considered to be differentially ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed.

Figure 16 | Venn diagrams for total number of differentially expressed genes across ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ developmental stages. (Left) Number of differentially expressed genes between instar-III and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ V, for short-day and long-day light treatments; (Right) Number of differentially expressed genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ between instar-V and pupal stages, for short-day and long-day light treatments. Surrogate ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ variables detected and corrected for using sva. Only genes with p-adj<0.05, baseMean>10, and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ |log2FoldChange|>1 were considered to be differentially expressed [L. sinapis images adapted ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ from plate by Karl Eckstein (Wikimedia Commons)]. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

40 expressed across pupal samples. As expected, there is a higher degree of overlap ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (59 genes) when we first take the list of genes differentially expressed between ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instar-III and V under short-day treatment (184) and compare with the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ analogous list obtained for specimens under long-day treatment (324), as shown ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ in Fig. 16. A high degree of overlap is also observed when we compare the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instar-V to pupa gene lists associated to different light treatments (2118 common ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes).

The results from DGE analysis suggest that (i) different pathways are active in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ specimens under different light treatments, even before a final decision is made ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ during instar-IV on whether to go into diapause (an assertion supported by the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ high number of differentially expressed genes, 102, between instar-III samples ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ under different light treatment); (ii) once a decision has been made (during ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instar-IV), a new set of pathways is activated depending on whether diapause or ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ direct development has been selected (as indicated by the large number of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differentially expressed genes, 67, between instar-V samples and by the fact that ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ ​ ​ only 4 genes are differentially expressed across light treatments both during ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instar-III and V (either UP-UP or DOWN-DOWN). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

4.5 Gene ontology term enrichment ​ ​ ​ ​ ​ ​ ​ ​ We used topGO (Alexa and Rahnenfuhrer, 2016) to assess how differentially ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed genes are distributed among different biological processes. topGO ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ takes as its input the list of genes flagged by DESeq2 as being differentially ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ expressed and performs enrichment analysis for GO terms using algorithms that ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ take into account the graph-like structure of the GO database. Statistical ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ significance of GO terms thus detected was evaluated using Fisher’s exact test. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ We selected the Fisher-elim algorithm as it minimizes the number of false ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ positives (Alexa et al., 2006). Gene enrichment analysis was performed ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ separately for all eight sets of differentially expressed genes depicted in Fig. 13. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

The top-most significant GO categories, for the set of differentially expressed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes across light treatments for instar-III samples, is shown in Fig. 17 (full list ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ provided in Supplementary information, Table S3). These GO terms reflect ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differences between specimens, exposed either to short- or long-day light ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatment, before a decision is made on whether to go into diapause or direct ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ development. Among the terms shown, there are several associated to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ developmental processes, such as those regulating chitin and cuticle biosynthetic ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

41 processes. We had observed that larvae raised under short-day light treatment ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ developed at a lower rate (size and weight-wise) than those reared under ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ long-day conditions. Some of the GO categories here flagged as over-represented ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ are therefore likely to reflect differences in expression for genes involved in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ regulation of metabolic and catabolic processes that control the way larval L. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ sinapis grow and develop. Associated to the GO category ‘cellular response to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ nutrient’ (highlighted in green in Fig. 17), is a gene, sug (sugarbabe protein), ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ known to repress genes involved in dietary fat breakdown (fat catabolism) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Zinke et al., 2002). This gene was up-regulated in instar-III specimens under ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ short-day treatment. Among the genes associated to the GO term ‘carbohydrate ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Figure 17 | Top-35 most significant GO categories, for the set of differentially expressed ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes across light treatments for instar-III samples. GO enrichment analysis carried out ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ using topGO’s Fisher-elim algorithm. Symbol size is proportional to the number of significantly ​ ​​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed genes assigned to a GO category (significant genes). The smallest symbol size ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ corresponds to one gene. Symbol color reflects the ratio between the number of significant genes ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to a particular GO term and the total number of annotated genes (significant or not) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to that same GO category (enrichment score). Terms highlighted in blue are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to processes involving monoamine neurotransmitters. Highlighted in green is one of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the GO terms linked to fat catabolism [L. sinapis image adapted from plate by Karl Eckstein ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Wikimedia Commons)]. ​ ​ ​ 42 metabolic processes’, Amy-p (alpha-amylase A), a gene implicated in the ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ breakdown of starch (Powell and Andjelković, 1983), is downregulated for ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ short-day instar-III samples. Desat1, which codes for an enzyme, desaturase 1, ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ involved in lipid metabolic processes favoring fat storage (Parisi et al., 2013), ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ was up-regulated in instar-III larvae subject to short-day light conditions, and so ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ were P49010 (Beta-GlcNAcase), Cht2 (chitinase 2), and Cht10 (chitinase 10), all ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ genes coding for chitinases implicated in chitin catabolism and nutrient recycling ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ during ecdysis (molting) (Merzendorfer and Zimoch, 2003). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

A second set of GO terms, highlighted in blue in Fig. 17, relate to processes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ involving neurotransmitters and neuromodulators. Particularly well represented ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ are categories linked to monoamine metabolic processes, such as those involving ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the synthesis of dopamine and serotonin (5-hydroxytryptamine, 5-HT). When we ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ checked which genes are responsible for enrichment of these GO terms, two ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ names came out: Ddc (aromatic-L-amino-acid decarboxylase) and st (scarlet). ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ Both are up-regulated in instar-III specimens under short-day treatment. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Next, we checked which GO terms were enriched for the set of differentially ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed genes across light treatments for instar-V samples. The results, shown ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ in Fig. 18, again highlight multiple metabolic and biosynthetic processes that ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reflect the different rates of development, such as molting, for individuals reared ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ under different light treatment (full list provided in Supplementary information, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ Table S4). However, by this stage specimens have already opted on whether to go ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ into diapause or direct development, and we would expect to see already some ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ downstream consequences stemming from this decision. For instance, activation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of direct development pathways is expected to accelerate sexual maturation. One ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of the GO categories listed in Fig. 18 is ‘pole cell migration’ (4th entry from the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ top). This category is underpinned by zfh1 (Zn finger homeodomain 1), a gene ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ known to be involved in gonadogenesis in Drosophila (Howard, 1998). zfh1 is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ up-regulated for samples under long-day light treatment. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Dopamine and serotonin-related GO terms are also among those flagged as being ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ statistically significant (entries highlighted in blue in Fig. 18). Again, Ddc reveals ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ itself to be one of the key genes being asymmetrically expressed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

43

Figure 18 | Top-35 most significant GO categories, for the set of differentially expressed ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes across light treatments for instar-V samples. GO enrichment analysis carried out using ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ topGO’s Fisher-elim algorithm. Symbol size is proportional to the number of significantly ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed genes assigned to a GO category (significant genes). The smallest symbol size ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ corresponds to one gene. Symbol color reflects the ratio between the number of significant genes ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to a particular GO term and the total number of annotated genes (significant or not) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to that same GO category (enrichment score). Terms highlighted in blue are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to processes involving monoamine neurotransmitters [L. sinapis image adapted from ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ plate by Karl Eckstein (Wikimedia Commons)]. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ under different light treatments (up-regulated under short-day light treatment, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ as observed before for instar-III samples). Another amine-regulator at play is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ PPO2 (Prophenoloxidase 2), an enzyme that acts on dopamine substrates (Asada ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and Sezaki, 1999), and which is up-regulated for instar-V samples under long-day ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ light treatment. ​ ​

In the last juxtaposition between light treatments, we looked at GO terms ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ reflecting gene expression differentials between pupal samples. As expected, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ multiple GO categories bring out differences between diapausing and direct ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

44

Figure 19 | Top-35 most significant GO categories, for the set of differentially expressed ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes across light treatments for pupal samples. GO enrichment analysis carried out using ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ topGO’s Fisher-elim algorithm. Symbol size is proportional to the number of significantly ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed genes assigned to a GO category (significant genes). The smallest symbol size ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ corresponds to one gene. Symbol color reflects the ratio between the number of significant genes ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to a particular GO term and the total number of annotated genes (significant or not) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to that same GO category (enrichment score). Terms highlighted in blue are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to processes involving monoamine neurotransmitters [L. sinapis image adapted from ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ plate by Karl Eckstein (Wikimedia Commons)]. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ development individuals, such GO terms linked to cytoskeleton development and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ organization (top most significant GO hits shown in Fig. 19; full list provided in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Supplementary information, Table S5). Terms associated to biological processes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ involving monoamine neurotransmitters also populate this list, including terms ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to dopamine, histamine (2-(1H-imidazol-5-yl)ethanamine), and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ monoamine transport (entries highlighted in blue in Fig. 19). The genes driving ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differences in amine and neuronal pathways between light treatments, however, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differ from those called upon during the larval stage. It includes Vmat (vesicular ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ monoamine transporter), a gene involved in dopamine and histamine storage ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and transport in Drosophila (Sang et al., 2007), and CG32447, a gene implicated in ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the G-protein coupled glutamate receptor signaling pathway (originally reported ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

45 as CG7155 in (Brody and Cravchik, 2000)). Both genes are up-regulated in pupal ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ samples subjected to long-day light treatment. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

On the face of it, the GO terms and associated genes mentioned above represent ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the universe of biological functions and pathways differentiating L. sinapis ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ individuals following either the arrested development or direct development ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ paths. They cover all three direct comparisons between light treatments, one for ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ each developmental stage (adult specimens could not be compared across light ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatments since we lack data for post-diapausing individuals). The remaining ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ five sets of differentially expressed genes, and associated GO enrichment ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ analysis, all deal with changes between developmental stages in organisms ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ subjected to either the short-day or the long-day light treatment (see Fig. 13 for ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ clarification). Theoretically, and apart from the differences highlighted above, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ both sets of L. sinapis individuals should have followed a similar developmental ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ path. This idealized description, however, does not hold in the real world. DGE ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ analysis of some genes is often compromised due to lack of statistical signal. This ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ is particularly true for lowly expressed genes. It is thus vital to check which GO ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ terms are being enriched along the two developmental paths as they may not ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ match, even when the differences between light treatments pinpointed above are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ taken into consideration. ​ ​ ​ ​

We carried out GO term enrichment analysis for each of the five sets of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differentially expressed genes associated to changes between developmental ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ stages (results can be found in Supplementary information, Tables S6 to S10). We ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ then scanned for GO terms linked to processes involved in monoamine or ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ hormonal regulation, and retrieved the associated list of significant genes. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ results, shown in Fig. 20, reveal the presence of several genes not detected in our ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ previous analysis. For instance, a closer look at the instar-III to instar-V ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ transition indicates the presence of a gene, CYP15C1 (farnesoate epoxidase) that ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ ​ is up-regulated in instar-V samples under short-day light treatment. Other genes, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ such as Spn42Da (Serpin 4), are similarly regulated in both light treatments and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ are thus unlikely to be directly involved in the pathways associated to diapause ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ or direct-development. There are also several instances where a gene may not ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ show up under a particularly light treatment because of lack of statistical signal. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ For instance, AstC (Allatostatin C) is down-regulated between the instar-V and ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ pupal stages for individuals under long-day light treatment. This gene is not ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ present in the equivalent list for the short-day light treatment. However, a closer ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ look

46

Figure 20 | Differentially expressed genes involved in monoamine and hormonal regulation. ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Genes in blue are down-regulated; genes in red are up-regulated (see arrow direction). See text ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ for genes highlighted in green. GO enrichment analysis carried out using topGO (Fisher-elim ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ algorithm; p-value < 0.05). List of significant genes used in enrichment analysis selected using ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ DESeq2 (p-adj < 0.05). Genes marked with ‘*’ are differentially expressed (p-adj<0.05) but their ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ amine- and/or hormonal-related topGO categories have p-value>0.05. [L. sinapis images adapted ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ from plate by Karl Eckstein (Wikimedia Commons)]. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ at the data indicates that it too is downregulated for short-day light treatment, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ although with a slightly higher p-value (DESeq2’s p-adj = 0.11), thus not making ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the cut to be included in the enrichment analysis as a significant gene. In short, an ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ educated guess must be made on a case-by-case basis as to whether a particular ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ gene is likely to be involved in the regulation of processes leading either to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ dormancy or direct development. This was done by taking each of the genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ listed in Fig. 20 and checking normalized count across all treatments, selecting ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ only genes for which there is clear evidence of divergent behavior between light ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatments.

Besides the five genes flagged through direct comparison across light treatments ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Ddc, st, PPO2, Vmat, and CG32447), only the monoamine/ hormonal-related ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes highlighted in green in Fig. 20 were deemed to behave differently ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ depending on light treatment: CYP15C1, Manf (mesencephalic astrocyte-derived ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ neurotrophic factor), and Hn (henna). Average gene expression values across ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatments for Ddc, st, and CYP15C1 genes are shown in Fig. 21 (boxplots for ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​​ ​​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ each individual gene can be found in Supplementary information, Figs. S1 and S2). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

47

Figure 21 | Average expression levels across treatments for genes Ddc, st, and CYP15C1. ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​​ ​ Counts are normalized by library size and sva-adjusted. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Ddc and st are upregulated in instar-III samples under short-day light treatment ​ ​ ​ ​​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ vis-a-vis those under long-day light treatment. By instar-V, the gap in expression ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ levels (across light treatments) has been reduced substantially, although even ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ this marginal difference remains statistically significant for Ddc. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​

Average expression levels for CYP15C1 increase across developmental stages for ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ individuals under short-day light treatment, while the opposite is observed in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ individuals in the direct development path. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

During instar-V, specimens under long-day light treatment show up-regulated ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expression levels for PPO2 (Supplementary information, Fig. S3) and reduced ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expression for Ddc (Fig. 21). Three genes – Vmat, CG32447, and Manf – show ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ reduced expression levels during the pupal stage in diapausing individuals, while ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ a fourth gene, Hn, is upregulated (Fig. 22; boxplots in Supplementary information, ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ Figs. S4 and S5). ​ ​ ​ ​ ​ ​

48

Figure 22 | Average expression levels across treatments for genes Vmat, CG32447, Manf ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​​ and Hn. Counts are normalized by library size and sva-adjusted. ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

4.6 Circadian clock genes ​ ​ ​ ​ ​ ​ We looked into the expression levels of genes associated to the circadian clock, as ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ several studies suggest they may be implicated in diapause regulation. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ results, shown in Fig. 23, reveal that all six genes are weakly expressed (compare ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ with count levels for genes shown in Figs. 21 and 22). Lowly expressed genes are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ particularly difficult to investigate using DGE analysis and even genes for which ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ there is a hint they may be differently regulated across light treatments may fail ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to attain statistical significance. For instance, for instar-V samples, tim (timeless) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ is differentially expressed at the row level (p-value = 0.002), but the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ FDR-adjusted p-value is p-adj = 0.15. ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​

There are, however, two genes whose expression shows a statistically significant ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ change across developmental transitions. cyc (cycle) is down-regulated between ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ instar-V and pupal stages in individuals under short-day light treatment (p-adj = ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 0.013), while per (period) is upregulated (p-adj = 0.03). (Boxplots for cyc, per, ​ ​ ​ ​​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ and tim can be found in Supplementary information, Figs. S6 and S7.) There is ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ consequently some evidence that circadian clock genes are involved in diapause ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ regulation, namely after a decision is made on whether to go into arrested or ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ direct development. ​ ​

49

Figure 23 | Average expression levels across treatments for circadian clock genes. Counts ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ are normalized by library size and sva-adjusted. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

50 5. Discussion & Conclusions ​ ​ ​ ​ ​ ​ Recent studies aiming at investigating molecular mechanisms underpinning ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ insect diapause have tended to be framed by models of diapause regulation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ where insect photoperiodic calendar and circadian rhythms are intimately ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ related (Goto and Denlinger, 2002; Tauber et al., 2007; Yamada, and Yamamoto, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2011; Bajgar et al., 2013; Meuti and Denlinger, 2013; Meuti et al., 2015; Pegoraro ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ et al., 2016; Pruisscher, 2016). While there is evidence that disruption of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ circadian rhythms via mutant or RNAi silencing of circadian clock genes tends to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ affect the ability of insects to regulate diapause (Tauber et al., 2007; Stehlík et al., ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ 2008; Han and Denlinger, 2009; Meuti et al., 2015), to date no one has been able ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to map the pathways involved and show how the two systems are linked at the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ molecular level. The main alternative to the circadian model of diapause ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ regulation, the hourglass model, relies instead on biomolecules that, when ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ present at specific concentrations, would trigger a cascade leading to the onset of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diapause. In the present study, we find strong evidence suggesting amine ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ neurotransmitters are being modulated differently across light treatments, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ indirectly providing support to the hourglass model. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Diapause regulation ​ ​ Both the st and Ddc genes have been shown be involved in the regulation and ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ recycling of monoamine neurotransmitters in . In Drosophila, the st ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​​ gene is believed to take an active role in the neuroglia (neuron-supporting cells), ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ where it helps in histamine recycling (Stuart et al., 2007). Histamine is a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ monoamine neurotransmitter that is released by photoreceptors in arthropods, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ with histamine levels modulated according to light intensity. st affects not only ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ histamine levels but its expression has also been shown to increase the amounts ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of dopamine and serotonin in the brain of fruit (Borycz et al., 2008). In the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ present study, st was up-regulated in instar-III specimens under short-day ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatment, suggesting that average dopamine and serotonin concentrations were ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ possibly higher in larvae exposed to extended darkness. Similarly to st, the Ddc ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​​ gene has also been shown to promote the synthesis of dopamine and serotonin ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ in the Drosophila brain (Livingstone and Tempel, 1983). Like st, it is also ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ up-regulated in short-day instar-III samples. ​ ​ ​ ​ ​ ​ ​ ​

Monoamine neurotransmitters play several important regulatory roles. In ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Drosophila, neural dopamine is required for entrainment of daily activity ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ rhythms under low ambient light levels (Hirsh et al., 2010). Studies of larval ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

51 costata, a drosophilid , suggest that serotonin and dopamine levels ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ increase by up to 20% at the onset of each period of darkness (Ko tál et al., ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ š ​ ​​ ​ ​ ​ 2000), while in larval Bombyx mori the amount of melatonin ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (N-acetyl-5-methoxytryptamine) in the head increases 5-fold during nighttime ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Itoh et al., 1995). Of particular relevance to the present study, dopamine levels ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ have been shown to diverge between diapausing and non-diapausing individuals. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ In the Pieris brassicae butterfly, the amount of dopamine by the end of the ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ photosensitive period in larvae under short-day light treatment was twice as ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ high as that observed in specimens subjected to long-day treatment, mostly ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ because dopamine metabolism (breakdown of dopamine into catabolites) had all ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ but ceased to take place; no serotonin was detected in long-day larval samples ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Isabel et al., 2001). Dopamine levels were also higher in diapausing individuals ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ during the pupal stage, and stayed high up until the breaking of diapause. In the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Bombyx mori silkworm, dopamine concentrations are consistently higher in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diapause-type larvae and pupae than in those set for direct development ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Noguchi and Hayakawa, 2001). In the Antheraea pernyi silkmoth, both dopamine ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ and melatonin have been shown to be involved in diapause termination (Wang ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ et al., 2015a; Wang et al., 2015b). ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​

High dopamine concentration and a sharp reduction in the rate of dopamine ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ metabolism have been proposed as telltale signs announcing the onset of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diapause (Houk and Beck, 1977; Isabel et al., 2001). Low concentration of two ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ dopamine catabolites, epinephrine (adrenaline) and norepinephrine, are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated with lower heart rates, reduced glucose release, and a general ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ slowdown of metabolic processes. In P. brassicae, high serotonin levels are ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ believed to lead to the progressive inhibition of cerebral metabolism and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ endocrinological activity (hormonal secretion) (Isabel et al., 2001). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​

Our results show that several metabolic processes are modulated asymmetrically ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ between light treatments during instar-III. The sug gene, upregulated in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ pre-diapausing individuals under short-day light treatment, has been shown to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ inhibit fat catabolism and promote sugar-to-fat conversion (fatty acid synthesis) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Zinke et al., 2002). Amy-p, a gene involved in carbohydrate metabolism (Powell ​ ​​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and Andjelković,1983), is down-regulated in short-day instar-III samples. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Desat1 gene, up-regulated in larvae subject to short-day light conditions, codes ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ for an enzyme that promotes lipid metabolic processes favoring fat storage ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Parisi et al., 2013). Finally, our data also suggests that expression of CYP15C1 is ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ reduced in larvae subjected to short-day light treatment. CYP15C1 has been ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​

52 shown to be required for the production of the juvenile hormone (JH) in Bombyx ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ mori (Daimon et al., 2012). Juvenile hormones, in turn, are known to play a ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ critical role in the regulation of larval molting during late instars, in signaling the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ onset of metamorphosis, and in the regulation of reproductive maturation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Riddiford, 2012; Jindra et al., 2013). In insects undergoing diapause as adults, fat ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ synthesis and accumulation occur only if juvenile hormone levels are kept low ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Sim and Denlinger, 2008; Liu et al., 2016). Taken together, these results suggest ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ that metabolic activities in instar-III larvae under short-day treatment are being ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ redirected towards fatty acid synthesis and sequestration while glucose usage is ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ being curtailed. ​ ​

During instar-V, the set of genes at play is by far and large distinct from those ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ flagged during instar-III. By this stage, developmental choices have become ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ irreversible, with long-day larvae set on the road to direct development, while in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ caterpillars under low-light treatment specific pathways have been activated in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ preparation for pupal diapause. Monoamine regulation at this stage continues to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differ between treatments, although modulated by a slightly different group of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes. st gene expression is now identical for both light treatments. Ddc is still ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ upregulated in specimens under short day treatment, although the gap in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expression levels in not as pronounced as during instar-III. PPO2, a gene weakly ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ expressed during instar-III under both light treatments, is now up-regulated in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ direct development larvae. PPO2 is a multifunctional enzyme used by arthropods ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ in melanin synthesis, a process during which dopamine is used as a precursor, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ with melanin subsequently being used during sclerotization (cuticular ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ hardening) (Sugumaran, 2002). PPO2 was upregulated in larvae under long-day ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ light treatment, suggesting a higher metabolic rate in direct development ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ organisms, as well as a heightened consumption (breakdown) of dopamine. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Expression levels for another gene involved in positive dopamine regulation, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Manf, dropped dramatically between instars III and V in samples under long-day ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatment. In short, all three genes directly or indirectly involved in monoamine ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ regulation (Ddc, PPO2 and Manf) support higher dopamine levels in instar-V ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ larvae kept under reduced light conditions. Signaling a widening of the rift ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ between developmental paths, expression levels for a gene involved in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ development of the gonads, zfh1, are higher in specimens under long-day light ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatment.

During pupation, a new set of monoamine regulators again takes over. Ddc and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ PPO2 are now only weakly expressed, while differences in expression across light ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

53 treatments for Manf are at this stage minimal. Instead, expression differentials ​ ​ ​ ​​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ are observed for Vmat, CG32447, and Hn. Vmat is a monoamine transporter ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ known to decrease overall dopamine levels when overexpressed (Sang et al., ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ 2007). Vmat was up-regulated in pupae under long-day light treatment. CG32447 ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ is part of the metabotropic G-protein coupled glutamate receptor family and acts ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ as a dopamine modulator (Brody and Cravchik, 2000). It is down-regulated in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ specimens reared under low-light conditions. Hn is an Drosophila enzyme ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ homologous to the human PAH (Phenylalanine-4-hydroxylase) and is believed to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ be part of the metabolic pathway involved in the synthesis of catecholamine ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ neurotransmitters such as dopamine (Craig et al., 1988). Hn is weakly expressed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ in pupal samples subjected to long-day light treatment. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

These results suggest that although the final decision on whether to initiate ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diapause is taken at the end of instar-IV, the process leading to this decision ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ starts way earlier, at the beginning of the photosensitive period, during instar-III. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ In other words, the switch to diapause is the culmination of a process that builds ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ up slowly during the larval period in specimens under adverse, low light ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ conditions. For individuals under long-day light treatment, these ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diapause-inducing conditions never take a hold and the (default) direct ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ development pathways remain activated. Light to darkness transactions trigger ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ an increase in dopamine and other monoamines neurotransmitters. Under ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ extreme long-day conditions (22h light), monoamine levels flatten out after each ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ short night (Itoh et al., 1995). Here we hypothesize that exposure to long nights ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ produces altogether different results, with dopamine levels observed during the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ short period of daylight never fully returning to those observed in the previous ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ day. Sustained high dopamine levels instigate metabolic changes in larvae reared ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ under short day conditions that favor fatty acid storage and curtail glucose usage. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ The progressive slowdown of general metabolic activities reaches its end point ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ by the end of the photosensitive period, which can be seen either as a decision ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ point (one or more biochemical reactions mark the switch to diapause proper) or ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the end of a window of opportunity (until then, there was always a chance to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ resume direct development, had light and temperature conditions changed in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ right direction; see (Friberg et al., 2011)). One way or the other, there is no ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ turning back and wood whites under short- and long-day light conditions will ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ from now on follow different development paths grounded on different pathways ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Friberg et al., 2011). From our results, Ddc and st emerge as the key genes ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ whose expression promotes a sustained high level of dopamine during the whole ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ photosensitive period. Expression of monoamine modulators active during the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

54 last instar and pupation, such as PPO2, Manf, Vmat, and CG32447, are to a certain ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​​ ​ ​​ ​ ​ ​ ​ ​ ​ extent subservient to the choice in developmental pathways made earlier. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Islands of speciation ​ ​ ​ ​ According to the analysis carried out above based on the identification and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ functional analysis of differentially expressed genes across light treatments , ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diapause regulation is underpinned by a series of monoamine modulators. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Diapause is a diverging life-history trait among L. sinapis ecotypes. Wood white ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ populations in central and northern Sweden are univoltine and by default ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ virtually all members of this population go into diapause during their pupal ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ stage. As seen here, they can easily be induced into direct development, but only ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ by rearing them under light conditions not found in the wild in this geographic ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ area. Southern European populations are multivoltine and individuals need to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ evaluate environmental conditions carefully so as to select an optimal ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ developmental path. Yet, southern wood whites brought into northern latitudes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ fail to fully adjust to their new environment, falling instead into behavioral ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ patterns better suited to their region of origin (Friberg, personal ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ communication). This raises the hypothesis that diapause regulation has ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ diverged between different wood white ecotypes, and that these differences ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ might be observable a the genomic level. If the conclusions we drew from our ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ analysis are sound, then direct confrontation between the genomes of individuals ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ originating from disparate latitudes along the species range should reveal the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ presence of diverging chromosomal regions, and among these some should ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ include genes that directly or indirectly modulate monoamine neurotransmitters. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

In a recent study carried out by members of the Backström lab, whole-genome ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ resequencing data from three L. sinapis ecotypes (Swedish, Spanish, and Kazakh) ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ was compared with the aim of identifying shared, private and fixed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ polymorphisms between populations (Söderberg et al., unpublished). Analysis of ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the top 0.01% of windows with the largest number of fixed differences between ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ populations, identified nine candidate genes for traits under local adaptation. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

One of the genes identified as diverging between Swedish and Spanish ecotypes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ was ST1C4 (sulfotransferase 1C4). The homologous gene in humans (SULT1C4), ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ is part of a family of enzymes involved in the metabolism of catecholamines, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ steroids, and phenols, including dopamine (Allali-Hassani et al., 2007). ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ Interestingly, in humans ST1C4 has no specific affinity for dopamine (a different ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ sulfotransferase, SULT1A3, aptly called ‘human dopamine sulfotransferase’ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

55 shows the highest affinity). In humans, ST1C4 shows stronger affinity for phenols ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ and steroids, and it can interact as well with epinephrine and isoprenaline. It ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ must be however noted that the presence/absence of cofactors is known to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ modulate sulfotransferase affinity towards potential substrates (Allali-Hassani et ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ al., 2007). In our study, ST1C4 is only lightly expressed, independently of light ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatment and developmental stage. It remains to be seen whether the Spanish ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ mutant is expressed in higher quantities under the same conditions, as well as ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ determine the specific role played by this enzyme in Lepidoptera. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Another gene flagged during the genome scan analysis was spri (sprint), which ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ showed a statistically significant number of fixed differences between the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Swedish and Kazakh populations. spri is a Rab5 protein activator (Carney et al., ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ 2006). The latter, in turn, is implicated in positive modulation of exocytosis of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ synaptic vesicles (Wucherpfennig et al., 2003). In short, spri is an indirect ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ neurotransmitter regulator. In the experiments described in this report, spri ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ expression was virtually identical across light treatments. ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​

Whether ST1C4 and spri are indeed involved in diapause regulation is still to be ​ ​​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ seen. Knockout experiments using CRISPR interference or RNAi coupled with ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ cross-ecotype diapause experiments similar to the one described in this report ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ could help clarify this point. Before that, however, it would certainly pay off to ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ run detailed genome scans across pairs of different L. sinapis ecotypes and check ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ if there are other genomic regions coding for monoamine regulators showing ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ marked (or even moderate) fixed differences across populations. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

56 Afterword

A few caveats: ​ ​ ​ ​ i) This study was based on RNA collected from the whole body of larvae, pupae ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and imagos. If some of the key genes and pathways directly or indirectly involved ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ in diapause regulation are confined to a specific organ (eg, brain or gut; see ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (Bajgar et al., 2013; Linn et al., 1995)) the associated transcriptional signal could ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ have been diluted, or even lost. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ii) Circadian clock genes were absent from the discussion on diapause regulatory ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ mechanisms here presented. As shown in the results chapter, circadian clock ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes are usually weakly expressed and, as befitting a group of genes responsible ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ for regulating circadian cellular activities, their expression levels vary ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ throughout the day following a 24 hour cycle. This makes detection and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ interpretation of expression differentials rather problematic. In most studies, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ synchronization between light treatments is done by enforcing simultaneous ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ lights on times. This ensures expression of circadian clock genes is in phase at the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ beginning of each cycle. This was not the case here. As mentioned already, in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ experimental setup here employed treatments were synchronized at lights off. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ This approach has the advantage of synchronizing the daily spikes on ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ monoamine levels, which are observed at or immediately after the transition ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ from light to darkness. ​ ​ ​ ​ ​ ​ iii) It was our intention to produce transcriptomes also based on the L. sinapis ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ reference genome and carry out a parallel DGE and GO analysis based on these ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ alternative transcriptomes. At the time of this writing, genome annotation was ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ still ongoing ‒ this part of the project will be pursued at a later data, once the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ annotation file is ready. ​ ​ ​ ​ ​ ​

Luis Leal ​ ​ August 2017 ​ ​

57 Acknowledgments

I am extremely grateful to Niclas Backström for the opportunity to work at his ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ lab and for acting as my supervisor during my degree project. Thank you for ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ sharing your knowledge and enthusiasm for this fascinating area of research, for ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ providing the perfect mix of guidance and freedom, for your advice and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ encouragement during our weekly meetings, and for the supportive and peaceful ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ work environment. ​ ​

I am also very grateful to Siv Andersson and Lena Henriksson for their many ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ comments and suggestions concerning the planning and structure of this project ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and on the contents of this report. Many thanks also to Caesar Al Jewari, who ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ gracefully accepted to be my opponent during the degree project's final ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ presentation.

Among the many people who contributed to this project, I would like to single ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ out Thomas Källman for the many productive discussions we had and for his ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ invaluable advice on the design of the pipelines here employed. Many thanks also ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to Venkat Talla, Axel Söderberg, Ludovic Dutoit, Linnéa Smeds, Douglas Scofield, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Willian Silva, Lucile Soler, Magne Friberg, and Christer Wiklund for their helpful ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ suggestions and advice. ​ ​ ​ ​

I would like to acknowledge support from Science for Life Laboratory, the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ National Genomics Infrastructure, NGI, and Uppmax for providing assistance in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ massive parallel sequencing and computational infrastructure. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

58 References Alexa, Adrian, Jörg Rahnenführer, and . (2006). “Improved ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Scoring of Functional Groups from Gene Expression Data by Decorrelating GO ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Graph Structure.” Bioinformatics 22 (13): 1600–1607. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1093/bioinformatics/btl140.

Alexa A., and J. Rahnenfuhrer. (2016). “topGO: Enrichment Analysis for Gene ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Ontology”. R package version 2.28.0. ​ ​ ​ ​ ​ ​ ​ ​ https://www.bioconductor.org/packages/release/bioc/html/topGO.html

Allali-Hassani, Abdellah, Patricia W. Pan, Ludmila Dombrovski, Rafael ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Najmanovich, Wolfram Tempel, Aiping Dong, Peter Loppnau, Fernando Martin, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Janet Thonton, Aled M Edwards, Alexey Bochkarev, Alexander N Plotnikov, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Masoud Vedadi, and Cheryl H Arrowsmith (2007). “Structural and Chemical ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Profiling of the Human Cytosolic Sulfotransferases.” PLOS Biology 5 (5): e97. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1371/journal.pbio.0050097.

Asada, Nobuhiko, and Hiroshi Sezaki. (1999). “Properties of Phenoloxidases ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Generated from Prophenoloxidase with 2-Propanol and the Natural Activator in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Drosophila Melanogaster.” Biochemical Genetics 37 (5–6): 149–58. ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ doi:10.1023/A:1018730404305.

Babraham Bioinformatics (2016) ​ ​ ​ ​ http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Babraham Bioinformatics (2017a) ​ ​ ​ ​ http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/

Babraham Bioinformatics (2017b) ​ ​ ​ ​ http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

Bajgar, Adam, Marek Jindra, and David Dolezel. (2013). “Autonomous Regulation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of the Insect Gut by Circadian Genes Acting Downstream of Juvenile Hormone ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Signaling.” Proceedings of the National Academy of Sciences 110 (11): 4416–21. ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ doi:10.1073/pnas.1217060110.

Beck, S. D. (1980). Insect Photoperiodism, 2nd ed. (New York: Academic Press). ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Benjamini, Yoav, and Yosef Hochberg. (1995). “Controlling the False Discovery ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ Statistical Society. Series B (Methodological) 57 (1): 289–300. ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ doi:10.2307/2346101.

59 Borycz, J., J. A. Borycz, A. Kubów, V. Lloyd, and I. A. Meinertzhagen. (2008). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ “Drosophila ABC Transporter Mutants White, Brown and Scarlet Have Altered ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Contents and Distribution of Biogenic Amines in the Brain.” Journal of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ Experimental Biology 211 (21): 3454–66. doi:10.1242/jeb.021162. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Bradshaw, William E., and Christina M. Holzapfel. (2010). “What Season Is It ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Anyway? Circadian Tracking vs. Photoperiodic Anticipation in Insects.” Journal of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ Biological Rhythms 25 (3): 155–65. doi:10.1177/0748730410365656. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Bray, Nicolas L., Harold Pimentel, Páll Melsted, and . (2016). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ “Near-Optimal Probabilistic RNA-Seq Quantification.” Nature Biotechnology 34 ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ (5): 525–27. doi:10.1038/nbt.3519. ​ ​ ​ ​

Brody, Thomas, and Anibal Cravchik. (2000). “Drosophila MelanogasterG ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Protein–Coupled Receptors.” The Journal of Cell Biology 150 (2): F83–88. ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1083/jcb.150.2.F83.

Bünning, E. (1936). “Die endogene Tagesrhythmik als Grundlage der ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Photoperiodischen Reaktion.” Berichte der deutschen botanischen Geschellschaft ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ 53: 590-607. ​ ​

Carney, Darren S., Brian A. Davies, and Bruce F. Horazdovsky. (2006). “Vps9 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Domain-Containing : Activators of Rab5 GTPases from Yeast to Neurons.” ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Trends in Cell Biology 16 (1): 27–35. doi:10.1016/j.tcb.2005.11.001. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Coyne, J.A., and H.A. Orr. (2004). Speciation. (Sunderland, MA: Sinauer ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ Associates).

Craig SP, Buckle VJ, Lamouroux A, Mallet J, Craig IW. (1988). “Localization of the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ human dopamine beta hydroxylase (DBH) gene to chromosome 9q34.” Cytogenet ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ Genome Res 48: 48-50 ​ ​ ​ ​ ​ ​

Daimon, Takaaki, Toshinori Kozaki, Ryusuke Niwa, Isao Kobayashi, Kenjiro ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Furuta, Toshiki Namiki, Keiro Uchino, Yutaka Banno, Susumu Katsuma, Toshiki ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Tamura, Kazuei Mita, Hideki Sezutsu, Masayoshi Nakayama, Kyo Itoyama, Toru ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Shimada, and Tetsuro Shinoda. (2012). “Precocious Metamorphosis in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Juvenile Hormone–Deficient Mutant of the Silkworm, Bombyx Mori.” PLOS ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ Genetics 8 (3): e1002486. doi:10.1371/journal.pgen.1002486. ​ ​ ​ ​ ​ ​ ​ ​

Darwin, C. (1859). On the Origin of Species by Means of Natural Selection or the ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Preservation of Favoured Races in the Struggle for Life. (London: John Murray). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

60 Dincă, Vlad, Vladimir A. Lukhtanov, Gerard Talavera, and Roger Vila. (2011). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ “Unexpected Layers of Cryptic Diversity in Wood White Leptidea Butterflies.” ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Nature Communications 2 (May): 324. doi:10.1038/ncomms1329. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. (2013). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ doi:10.1093/bioinformatics/bts635.

Dobzhansky, T. (1937). Genetics and the Origin of Species. Columbia University ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Biological Series. (New York: Columbia University Press). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

FASTX (2010) http://hannonlab.cshl.edu/fastx_toolkit/index.html ​ ​ ​ ​​

Fisher, R. A. (1930). The Genetical Theory of Natural Selection. (Oxford: The ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Clarendon Press). ​ ​

Freese, A., and K. Fiedler. (2002). "Experimental evidence for specific distinctness ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of the two wood white butterfly taxa, Leptidea sinapis and L. reali (Pieridae)." ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ Nota lepidopterologica 25, no. 1: 39-60. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Friberg, Magne, and Christer Wiklund. (2007). “Generation-Dependent Female ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Choice: Behavioral Polyphenism in a Bivoltine Butterfly.” Behavioral Ecology 18 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ (4): 758–63. doi:10.1093/beheco/arm037. ​ ​ ​ ​

Friberg, Magne, Martin Olofsson, David Berger, Bengt Karlsson, and Christer ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Wiklund. (2008a). “Habitat Choice Precedes Host Plant Choice – Niche Separation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ in a Species Pair of a Generalist and a Specialist Butterfly.” Oikos 117 (9): ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ 1337–44. doi:10.1111/j.0030-1299.2008.16740.x. ​ ​

Friberg, Magne, Martin Bergman, Jaakko Kullberg, Niklas Wahlberg, and Christer ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Wiklund. (2008b). “Niche Separation in Space and Time between Two Sympatric ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Sister Species—a Case of Ecological Pleiotropy.” Evolutionary Ecology 22 (1): ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ 1–18. doi:10.1007/s10682-007-9155-y. ​ ​

Friberg, Magne, Inger M. Aalberg Haugen, Josefin Dahlerus, Karl Gotthard, and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Christer Wiklund. (2011). “Asymmetric Life-History Decision-Making in Butterfly ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Larvae.” Oecologia 165 (2): 301–10. doi:10.1007/s00442-010-1804-0. ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​

Goto, Shin G, and David L Denlinger. (2002). “Short-Day and Long-Day ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Expression Patterns of Genes Involved in the Flesh Fly Clock Mechanism: Period, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Timeless, Cycle and Cryptochrome.” Journal of Insect Physiology 48 (8): 803–16. ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1016/S0022-1910(02)00108-7.

61 Grabherr, Manfred G., Brian J. Haas, Moran Yassour, Joshua Z. Levin, Dawn A. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Thompson, Ido Amit, Xian Adiconis, Lin Fan, Raktima Raychowdhury, Qiandong ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Zeng, Zehua Chen, Evan Mauceli, Nir Hacohen, Andreas Gnirke, Nicholas Rhind, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Federica di Palma, Bruce W. Birren, Chad Nusbaum, Kerstin Lindblad-Toh, Nir ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Friedman, and . (2011). “Full-Length Transcriptome Assembly from ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ RNA-Seq Data without a Reference Genome.” Nature Biotechnology 29 (7): ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ 644–52. doi:10.1038/nbt.1883. ​ ​

Haas, Brian J., Alexie Papanicolaou, Moran Yassour, Manfred Grabherr, Philip D. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Blood, Joshua Bowden, Matthew Brian Couger, David Eccles, Bo Li, Matthias ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Lieber, Matthew D. MacManes, Michael Ott, Joshua Orvis, Nathalie Pochet, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Francesco Strozzi, Nathan Weeks, Rick Westerman, Thomas William, Colin N. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Dewey, Robert Henschel, Richard D. LeDuc, Nir Friedman, and Aviv Regev. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (2013). “De Novo Transcript Sequence Reconstruction from RNA-Seq Using the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Trinity Platform for Reference Generation and Analysis.” Nature Protocols 8 (8): ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ 1494–1512. doi:10.1038/nprot.2013.084. ​ ​

Haas, Brian (2016), "Transcriptome Contig Nx and ExN50 stats”, available at ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome%20Contig %20Nx%20and%20ExN50%20stats (last accessed on August 7, 2017). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Haas, Brian (2017a), “Trinotate: Transcriptome Functional Annotation and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Analysis”, https://trinotate.github.io/ ​ ​

Haas, Brian (2017b), “Trinity Transcript Quantification”, available at ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantif ication (last accessed on August 8, 2017) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Haldane, J.B.S. (1932). The Causes of Evolution. (London; New York: Longmans, ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Green & Co.). ​ ​ ​ ​

Han, Bing, and David L. Denlinger. (2009). “Length Variation in a Specific Region ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of the Period Gene Correlates with Differences in Pupal Diapause Incidence in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Flesh Fly, Sarcophaga Bullata.” Journal of Insect Physiology, Special Issue on Insect ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Clocks, 55 (5): 415–18. doi:10.1016/j.jinsphys.2009.01.005. ​ ​ ​ ​ ​ ​ ​ ​

Hansen, Kasper D., Steven E. Brenner, and Sandrine Dudoit. (2010). “Biases in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Illumina Transcriptome Sequencing Caused by Random Hexamer Priming.” ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Nucleic Acids Research 38 (12): e131–e131. doi:10.1093/nar/gkq224. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

62 Hirsh, Jay, Thomas Riemensperger, Hélène Coulom, Magali Iché, Jamie Coupar, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and Serge Birman. (2010). “Roles of Dopamine in Circadian Rhythmicity and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Extreme Light Sensitivity of Circadian Entrainment.” Current Biology 20 (3): ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ 209–14. doi:10.1016/j.cub.2009.11.037. ​ ​

Houk, Edward J., and Stanley D. Beck. (1977). “Distribution of Putative ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Neurotransmitters in the Brain of the European Corn Borer, Ostrinia Nubilalis.” ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Journal of Insect Physiology 23 (9): 1209–17. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1016/0022-1910(77)90155-X.

Howard, Ken. (1998). “Organogenesis: Drosophila Goes Gonadal.” Current Biology ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ 8 (12): R415–17. doi:10.1016/S0960-9822(98)70266-0. ​ ​ ​ ​ ​ ​

Isabel, G, L Gourdoux, and R Moreau. (2001). “Changes of Biogenic Amine Levels ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ in Haemolymph during Diapausing and Non-Diapausing Status in Pieris Brassicae ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ L.” Comparative Biochemistry and Physiology Part A: Molecular & Integrative ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Physiology 128 (1): 117–27. doi:10.1016/S1095-6433(00)00284-1. ​ ​ ​ ​ ​ ​ ​ ​

Itoh, Masanori T., Atsuhiko Hattori, Tsuyoshi Nomura, Yawara Sumi, and Takuro ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Suzuki. (1995). “Melatonin and Arylalkylamine N-Acetyltransferase Activity in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the Silkworm, Bombyx Mori.” Molecular and Cellular Endocrinology 115 (1): ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 59–64. doi:10.1016/0303-7207(95)03670-3. ​ ​

Jiggins, Chris D., Russell E. Naisbit, Rebecca L. Coe, and James Mallet. (2001). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ “Reproductive Isolation Caused by Colour Pattern Mimicry.” Nature 411 (6835): ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ 302–5. doi:10.1038/35077075. ​ ​

Jindra, Marek, Subba R. Palli, Lynn M. Riddiford. (2013). “The Juvenile Hormone ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Signaling Pathway in Insect Development.” Annual Review of Entomology 58 (1): ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 181–204

Kim, Daehwan, Ben Langmead, and Steven L. Salzberg. (2015). “HISAT: A Fast ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Spliced Aligner with Low Memory Requirements.” Nature Methods 12 (4): ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ 357–60. doi:10.1038/nmeth.3317. ​ ​

Ko tál, Vladimı́r, Hirofumi Noguchi, Kimio Shimada, and Yoichi Hayakawa. š (2000). “Circadian Component Influences the Photoperiodic Induction of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Diapause in a Drosophilid Fly, Chymomyza Costata.” Journal of Insect Physiology ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ 46 (6): 887–96. doi:10.1016/S0022-1910(99)00195-X. ​ ​ ​ ​ ​ ​

Ko tál, Vladimı́r, and Kimio Shimada. (2001). “Malfunction of Circadian Clock in š the Non-Photoperiodic-Diapause Mutants of the Drosophilid Fly, Chymomyza ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Costata.” Journal of Insect Physiology 47 (11): 1269–74. ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1016/S0022-1910(01)00113-5.

63 Ko tál, Vladimír. (2011). “Insect Photoperiodic Calendar and Circadian Clock: š ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Independence, Cooperation, or Unity?” Journal of Insect Physiology, Dormancy ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and Developmental Arrest in Invertebrates, 57 (5): 538–56. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1016/j.jinsphys.2010.10.006.

Langmead, Ben, and Steven L. Salzberg. (2012). “Fast Gapped-Read Alignment ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ with Bowtie 2.” Nature Methods 9 (4): 357–59. doi:10.1038/nmeth.1923. ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Leek, Jeffrey T., W. Evan Johnson, Hilary S. Parker, Andrew E. Jaffe, and John D. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Storey. (2012). “The Sva Package for Removing Batch Effects and Other ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Unwanted Variation in High-Throughput Experiments.” Bioinformatics 28 (6): ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ 882–83. doi:10.1093/bioinformatics/bts034. ​ ​

Lees, A. D. (1953). “The Significance of the Light and Dark Phases in the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Photoperiodic Control of Diapause in Metatetranychus Ulmi Koch.” Annals of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ Applied Biology 40 (3): 487–97. doi:10.1111/j.1744-7348.1953.tb02388.x. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Lees, A. D. (1973). “Photoperiodic Time Measurement in the Aphid Megoura ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Viciae.” Journal of Insect Physiology 19 (12): 2279–2316. ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1016/0022-1910(73)90237-0.

Linn, C. E., K. R. Poole, W. L. Roelofs, and W.-Q. Wu. (1995). “Circadian Changes in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Melatonin in the Nervous System and Hemolymph of the Cabbage Looper Moth, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Trichoplusia Ni.” Journal of Comparative Physiology A 176 (6): 761–71. ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1007/BF00192624.

Liu, Wen, Yi Li, Li Zhu, Fen Zhu, Chao-Liang Lei, and Xiao-Ping Wang. (2016). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ “Juvenile Hormone Facilitates the Antagonism between Adult Reproduction and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Diapause through the Methoprene-Tolerant Gene in the Female Colaphellus ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Bowringi.” Insect Biochemistry and Molecular Biology 74 (July): 50–60. ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ doi:10.1016/j.ibmb.2016.05.004.

Livingstone, Margaret S., and Bruce L. Tempel. (1983). “Genetic Dissection of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Monoamine Neurotransmitter Synthesis in Drosophila.” Nature 303 (5912): ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ 67–70. doi:10.1038/303067a0. ​ ​

Lorkovic, Z. (1993). "Leptidea reali Reissinger, 1989 (= lorkovicii Real 1988), a ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ new European species (Lepid., Pieridae)." Natura Croatica 2, no. 1: 1-26. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Love, Michael I., Wolfgang Huber, and Simon Anders. (2014). “Moderated ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Genome Biology 15: 550. doi:10.1186/s13059-014-0550-8. ​ ​ ​ ​ ​ ​ ​ ​

64 Love, Michael I., Simon Anders, Vladislav Kim, and Wolfgang Huber. (2016). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ “RNA-Seq Workflow: Gene-Level Exploratory Analysis and Differential ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Expression.” F1000Research 4 (November): 1070. ​ ​ ​ ​ ​ ​ ​ ​ doi:10.12688/f1000research.7035.1.

Lukhtanov, Vladimir A., Vlad Dincă, Gerard Talavera, and Roger Vila. (2011). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ “Unprecedented Within-Species Chromosome Number Cline in the Wood White ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Butterfly Leptidea Sinapis and Its Significance for Karyotype Evolution and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Speciation.” BMC Evolutionary Biology 11 (1): 109. ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1186/1471-2148-11-109.

MATLAB (2016). MATLAB 2016b, The MathWorks, Inc., Natick, Massachusetts, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ United States. ​ ​

Maynard Smith, J. (1966). “Sympatric Speciation.” The American Naturalist 100 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ (916): 637–50. ​ ​

Mayr, E. (1963). Animal species and evolution. (Cambridge, MA: Harvard ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ University Press). ​ ​

Merzendorfer, Hans, and Lars Zimoch. (2003). “Chitin Metabolism in Insects: ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Structure, Function and Regulation of Chitin Synthases and Chitinases.” Journal of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ Experimental Biology 206 (24): 4393–4412. doi:10.1242/jeb.00709. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Meuti, Megan E., and David L. Denlinger. (2013). “Evolutionary Links Between ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Circadian Clocks and Photoperiodic Diapause in Insects.” Integrative and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ Comparative Biology 53 (1): 131–43. doi:10.1093/icb/ict023. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Meuti, Megan E., Mary Stone, Tomoko Ikeno, and David L. Denlinger. (2015). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ “Functional Circadian Clock Genes Are Essential for the Overwintering Diapause ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of the Northern House Mosquito, Culex Pipiens.” Journal of Experimental Biology ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 218 (3): 412–22. doi:10.1242/jeb.113233. ​ ​ ​ ​ ​ ​

Miller, Jason R., Sergey Koren, and Granger Sutton. (2010). “Assembly Algorithms ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ for Next-Generation Sequencing Data.” Genomics 95 (6): 315–27. ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ doi:10.1016/j.ygeno.2010.03.001.

Morgulis, Aleksandr, E. Michael Gertz, Alejandro A. Schäffer, and Richa Agarwala. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (2006). “A Fast and Symmetric DUST Implementation to Mask Low-Complexity ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ DNA Sequences.” Journal of Computational Biology 13 (5): 1028–40. ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1089/cmb.2006.13.1028.

65 Nawrocki, Eric P., Sarah W. Burge, , Jennifer Daub, Ruth Y. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Eberhardt, Sean R. Eddy, Evan W. Floden, Paul P. Gardner, Thomas A. Jones, John ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Tate, and Robert D. Finn. (2015). “Rfam 12.0: Updates to the RNA Families ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Database.” Nucleic Acids Research 43 (D1): D130–37. doi:10.1093/nar/gku1063. ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Noguchi, Hirofumi, and Yoichi Hayakawa. (2001). “Dopamine Is a Key Factor for ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the Induction of Egg Diapause of the Silkworm, Bombyx Mori.” European Journal ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ of Biochemistry 268 (3): 774–80. doi:10.1046/j.1432-1327.2001.01933.x. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Parisi, Federica, Sara Riccardo, Sheri Zola, Carlina Lora, Daniela Grifoni, Lewis M. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Brown, and Paola Bellosta. (2013). “DMyc Expression in the Fat Body Affects ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ DILP2 Release and Increases the Expression of the Fat Desaturase Desat1 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Resulting in Organismal Growth.” Developmental Biology 379 (1): 64–75. ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1016/j.ydbio.2013.04.008.

Patro, Rob, Geet Duggal, Michael I. Love, Rafael A. Irizarry, and Carl Kingsford. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (2017). “Salmon Provides Fast and Bias-Aware Quantification of Transcript ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Expression.” Nature Methods 14 (4): 417–19. doi:10.1038/nmeth.4197. ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Pegoraro, Mirko, Akanksha Bafna, Nathaniel J. Davies, David M. Shuker, and Eran ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Tauber. (2016). “DNA Methylation Changes Induced by Long and Short ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Photoperiods in Nasonia.” Genome Research 26 (2): 203–10. ​ ​ ​ ​ ​ ​​ ​ ​ ​​ ​ ​ ​ ​ ​ doi:10.1101/gr.196204.115.

Pertea, Mihaela, Geo M. Pertea, Corina M. Antonescu, Tsung-Cheng Chang, Joshua ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ T. Mendell, and Steven L. Salzberg. (2015). “StringTie Enables Improved ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Reconstruction of a Transcriptome from RNA-Seq Reads.” Nature Biotechnology ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ 33 (3): 290–95. doi:10.1038/nbt.3122. ​ ​ ​ ​ ​ ​

Pertea, Mihaela, Daehwan Kim, Geo M. Pertea, Jeffrey T. Leek, and Steven L. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Salzberg. (2016). “Transcript-Level Expression Analysis of RNA-Seq Experiments ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ with HISAT, StringTie and Ballgown.” Nature Protocols 11 (9): 1650–67. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1038/nprot.2016.095.

Powell, Jeffrey R., and Marko Andjelković. (1983). “Population Genetics of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Drosophila Amylase. Iv. Selection in Laboratory Populations Maintained on ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Different Carbohydrates.” Genetics 103 (4): 675–89. ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​

Pruisscher, Peter. (2016). “Genetic Signature of Diapause Adaptation in Two ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Butterflies”. Licentiate thesis (Stockholm University). ​ ​ ​ ​ ​ ​ ​ ​

66 Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Yarza, Jörg Peplies, and Frank Oliver Glöckner. (2013). “The SILVA Ribosomal ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ RNA Gene Database Project: Improved Data Processing and Web-Based Tools.” ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Nucleic Acids Research 41 (D1): D590–96. doi:10.1093/nar/gks1219. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Ravinet, M., R. Faria, R. K. Butlin, J. Galindo, N. Bierne, M. Rafajlović, M. a. F. Noor, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ B. Mehlig, and A. M. Westram. (2017). “Interpreting the Genomic Landscape of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Speciation: A Road Map for Finding Barriers to Gene Flow.” Journal of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ Evolutionary Biology 30 (8): 1450–77. doi:10.1111/jeb.13047. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Riddiford, Lynn M. (2012). “How Does Juvenile Hormone Control Insect ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Metamorphosis and Reproduction?” General and Comparative Endocrinology 179 ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ (3): 477–84. doi:10.1016/j.ygcen.2012.06.001. ​ ​ ​ ​

Sang, Tzu-Kang, Hui-Yun Chang, George M. Lawless, Anuradha Ratnaparkhi, Lisa ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Mee, Larry C. Ackerson, Nigel T. Maidment, David E. Krantz, and George R. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Jackson. (2007). “A Drosophila Model of Mutant Human Parkin-Induced Toxicity ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Demonstrates Selective Loss of Dopaminergic Neurons and Dependence on ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Cellular Dopamine.” Journal of Neuroscience 27 (5): 981–92. ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1523/JNEUROSCI.4810-06.2007.

Sauders, D. S. (1976). Insect Clocks (Oxford: Pergamon Press) ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​

Saunders, David S. (2012). “Insect Photoperiodism: Seeing the Light.” ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Physiological Entomology 37 (3): 207–18. ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1111/j.1365-3032.2012.00837.x.

Schmieder, Robert, and Robert Edwards. (2011). “Quality Control and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Preprocessing of Metagenomic Datasets.” Bioinformatics 27 (6): 863–64. ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​

Schneider, Ralf F., and Axel Meyer. (2017). “How Plasticity, Genetic Assimilation ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and Cryptic Genetic Variation May Contribute to Adaptive Radiations.” Molecular ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ Ecology 26 (1): 330–50. doi:10.1111/mec.13880. ​ ​ ​ ​ ​ ​ ​ ​

Shapiro, B.J., J.B. Leducq, and J. Mallet. (2016). “What Is Speciation?” PLOS ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ Genetics 12 (3): e1005860. doi:10.1371/journal.pgen.1005860. ​ ​ ​ ​ ​ ​ ​ ​

Sim, Cheolho, and David L. Denlinger. (2008). “Insulin Signaling and FOXO ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Regulate the Overwintering Diapause of the Mosquito Culex Pipiens.” Proceedings ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ of the National Academy of Sciences 105 (18): 6777–81. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1073/pnas.0802067105.

67 Smeds, Linnéa, and Axel Künstner. (2011). “ConDeTri - A Content Dependent ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Read Trimmer for Illumina Data.” PLOS ONE 6 (10): e26314. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​

Soneson, Charlotte, Michael I. Love, and Mark D. Robinson. 2015. “Differential ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Analyses for RNA-Seq: Transcript-Level Estimates Improve Gene-Level ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Inferences.” F1000Research 4 (December): 1521. ​ ​​ ​ ​ ​ ​ ​ ​ doi:10.12688/f1000research.7563.1.

Stehlík, J., R. Závodská, K. Shimada, I. auman, and V. Ko tál. (2008). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​Š ​ ​ ​ ​ ​ ​ š ​ ​ “Photoperiodic Induction of Diapause Requires Regulated of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Timeless in the Larval Brain of Chymomyza Costata.” Journal of Biological ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ Rhythms 23 (2): 129–39. doi:10.1177/0748730407313364. ​ ​ ​ ​ ​ ​ ​ ​

Stuart, Ann E., J. Borycz, and Ian A. Meinertzhagen. (2007). “The Dynamics of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Signaling at the Histaminergic Photoreceptor Synapse of Arthropods.” Progress in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ Neurobiology 82 (4): 202–27. doi:10.1016/j.pneurobio.2007.03.006. ​ ​ ​ ​ ​ ​ ​ ​

Sugumaran, Manickam. (2002). “Comparative Biochemistry of Eumelanogenesis ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and the Protective Roles of Phenoloxidase and Melanin in Insects.” Pigment Cell ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ Research 15 (1): 2–9. doi:10.1034/j.1600-0749.2002.00056.x. ​ ​ ​ ​ ​ ​ ​ ​

Talla, V., Suh, A., Kalsoom, F., Dincă, V., Vila, R., Friberg, M., Wiklund, C., ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Backström, N. (2017). “Rapid increase in genome size as a consequence of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ transposable element hyperactivity in wood-white (Leptidea) butterflies”. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Genome Biology and Evolution (in press) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Tauber, Eran, Mauro Zordan, Federica Sandrelli, Mirko Pegoraro, Nicolò ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Osterwalder, Carlo Breda, Andrea Daga, Alessandro Selmin, Karen Monger, Clara ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Benna, Ezio Rosato, Charalambos P. Kyriacou, and Rodolfo Costa. (2007). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ “Natural Selection Favors a Newly Derived Timeless Allele in Drosophila ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Melanogaster.” Science 316 (5833): 1895–98. doi:10.1126/science.1138412. ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​

Tolman, T. (1997). Butterflies of Britain and Europe (London: Harper Collins). ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Trapnell, Cole, Lior Pachter, and Steven L. Salzberg. (2009). “TopHat: Discovering ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Splice Junctions with RNA-Seq.” Bioinformatics 25 (9): 1105–11. ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​ ​ ​ doi:10.1093/bioinformatics/btp120.

Trapnell, Cole, Brian A. Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Marijke J. van Baren, Steven L. Salzberg, Barbara J. Wold, and Lior Pachter. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (2010). “Transcript Assembly and Quantification by RNA-Seq Reveals ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Unannotated Transcripts and Isoform Switching during Cell Differentiation.” ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Nature Biotechnology 28 (5): 511–15. ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1038/nbt.1621.

68 Trapnell, Cole, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Kelley, Harold Pimentel, Steven L. Salzberg, John L. Rinn, and Lior Pachter. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (2012). “Differential Gene and Transcript Expression Analysis of RNA-Seq ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Experiments with TopHat and Cufflinks.” Nature Protocols 7 (3): 562–78. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1038/nprot.2012.016.

Veerman, Alfred. (2001). “Photoperiodic Time Measurement in Insects and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Mites: A Critical Evaluation of the Oscillator-Clock Hypothesis.” Journal of Insect ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ Physiology 47 (10): 1097–1109. doi:10.1016/S0022-1910(01)00106-8. ​ ​ ​ ​ ​ ​ ​ ​

Wang, Qiushi, Ikumi Hanatani, Makio Takeda, Katsutaka Oishi, and Katsuhiko ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Sakamoto. (2015a). “D2-like Dopamine Receptors Mediate Regulation of Pupal ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Diapause in Chinese Oak Silkmoth Antheraea Pernyi.” Entomological Science 18 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ (2): 193–98. doi:10.1111/ens.12099. ​ ​ ​ ​

Wang, Qiushi, Yuichi Egi, Makio Takeda, Katsutaka Oishi, and Katsuhiko ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Sakamoto. (2015b). “Melatonin Pathway Transmits Information to Terminate ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Pupal Diapause in the Chinese Oak Silkmoth Antheraea Pernyi and through ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Reciprocated Inhibition of Dopamine Pathway Functions as a Photoperiodic ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Counter.” Entomological Science 18 (1): 74–84. doi:10.1111/ens.12083. ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

White, M.J.D. (1978). Modes of speciation. (San Francisco: Freeman). ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Wolf, Jochen B. W., and Hans Ellegren. (2017). “Making Sense of Genomic Islands ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ of Differentiation in Light of Speciation.” Nature Reviews Genetics 18 (2): 87–100. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1038/nrg.2016.133.

Wood, Derrick E., and Steven L. Salzberg. (2014). “Kraken: Ultrafast Metagenomic ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Sequence Classification Using Exact Alignments.” Genome Biology 15 (3): R46. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ doi:10.1186/gb-2014-15-3-r46.

Wright, S. (1931). “Evolution in Mendelian Populations”. Genetics, 16(2), 97–159. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​

Wu, Chung-I., and Chau-Ti Ting. (2004). “Genes and Speciation.” Nature Reviews ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ Genetics 5 (2): 114–22. doi:10.1038/nrg1269. ​ ​ ​ ​ ​ ​ ​ ​

Wucherpfennig, Tanja, Michaela Wilsch-Bräuninger, and Marcos ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ González-Gaitán. (2003). “Role of Drosophila Rab5 during Endosomal Trafficking ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ at the Synapse and Evoked Neurotransmitter Release.” The Journal of Cell Biology ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ 161 (3): 609–24. doi:10.1083/jcb.200211087. ​ ​ ​ ​ ​ ​

69 Yamada, Hirokazu, and Masa-Toshi Yamamoto. (2011). “Association between ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Circadian Clock Genes and Diapause Incidence in Drosophila Triauraria.” PLOS ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ONE 6 (12): e27493. doi:10.1371/journal.pone.0027493. ​ ​ ​ ​ ​ ​ ​ ​

Zhan, Shuai, Christine Merlin, Jeffrey L. Boore, and Steven M. Reppert. (2011). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ “The Monarch Butterfly Genome Yields Insights into Long-Distance Migration.” ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Cell 147 (5): 1171–85. doi:10.1016/j.cell.2011.09.052. ​ ​ ​ ​ ​ ​ ​ ​

Zhan, Shuai, and Steven M. Reppert. (2013). “MonarchBase: The Monarch ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Butterfly Genome Database.” Nucleic Acids Research 41 (D1): D758–63. ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ doi:10.1093/nar/gks1057.

Zinke, Ingo, Christina S. Schütz, Jörg D. Katzenberger, Matthias Bauer, and ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Michael J. Pankratz. (2002). “Nutrient Control of Gene Expression in Drosophila: ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ Microarray Analysis of Starvation and Sugar dependent Response.” The EMBO ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ - ​ ​ ​ ​​ ​ ​ Journal 21 (22): 6162–73. doi:10.1093/emboj/cdf600. ​ ​ ​ ​ ​ ​ ​ ​

70 Supplementary information ​ ​

Table S1 | List of scripts used in this project. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

71 Table S1 (cont.) | List of scripts used in this project. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

72 Table S1 (cont.) | List of scripts used in this project. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

73 Table S2 | List of samples used to prepare the final, optimized transcriptome. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ All files located at ​ ​ ​ ​ ​ ​ /proj/b2014034/nobackup/Luis/RNAseq_Lsinapis/1z_QC_concatenated_files.

74 Table S3 | List of significant GO categories, for the set of differentially expressed genes across ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ light treatments for instar-III samples. The number of differentially expressed genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (significant) and the total number of annotated genes associated to a GO category are shown, as ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ well as the number of expected genes. p-values computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

75 Table S3 (cont.) | List of significant GO categories, for the set of differentially expressed genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ across light treatments for instar-III samples. The number of differentially expressed genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ (significant) and the total number of annotated genes associated to a GO category are shown, as ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ well as the number of expected genes. p-values computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

76 Table S4 | List of significant GO categories, for the set of differentially expressed genes across ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ light treatments for instar-V samples. The number of differentially expressed genes (significant) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and the total number of annotated genes associated to a GO category are shown, as well as the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ number of expected genes. p-values computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

77 Table S5 | List of significant GO categories, for the set of differentially expressed genes across ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ light treatments for pupal samples. The number of differentially expressed genes (significant) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ and the total number of annotated genes associated to a GO category are shown, as well as the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ number of expected genes. p-values computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

78 Table S6 | List of significant GO categories, for the set of differentially expressed genes between ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instar-III and instar-V developmental stages, for the short-day light treatment. The number of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differentially expressed genes (significant) and the total number of annotated genes associated ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to a GO category are shown, as well as the number of expected genes. p-values computed using ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the Fisher-elim algorithm. ​ ​ ​ ​

79 Table S6 (cont.) | List of significant GO categories, for the set of differentially expressed genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ between instar-III and instar-V developmental stages, for the short-day light treatment. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ number of differentially expressed genes (significant) and the total number of annotated genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to a GO category are shown, as well as the number of expected genes. p-values ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​

80 Table S7 | List of top-100 most significant GO categories, for the set of differentially expressed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes between instar-V and pupal developmental stages, for the short-day light treatment. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ number of differentially expressed genes (significant) and the total number of annotated genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to a GO category are shown, as well as the number of expected genes. p-values ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​

81 Table S7 (cont.) | List of top-100 most significant GO categories, for the set of differentially ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed genes between instar-V and pupal developmental stages, for the short-day light ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatment. The number of differentially expressed genes (significant) and the total number of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ annotated genes associated to a GO category are shown, as well as the number of expected ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes. p-values computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

82 Table S8 | List of significant GO categories, for the set of differentially expressed genes between ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instar-III and instar-V developmental stages, for the long-day light treatment. The number of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ differentially expressed genes (significant) and the total number of annotated genes associated ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ to a GO category are shown, as well as the number of expected genes. p-values computed using ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ the Fisher-elim algorithm. ​ ​ ​ ​

83 Table S8 (cont.) | List of significant GO categories, for the set of differentially expressed genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ between instar-III and instar-V developmental stages, for the long-day light treatment. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ number of differentially expressed genes (significant) and the total number of annotated genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to a GO category are shown, as well as the number of expected genes. p-values ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​

84 Table S9 | List of top-150 most significant GO categories, for the set of differentially expressed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes between instar-V and pupal developmental stages, for the long-day light treatment. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ number of differentially expressed genes (significant) and the total number of annotated genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to a GO category are shown, as well as the number of expected genes. p-values ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​

85 Table S9 (cont.) | List of top-150 most significant GO categories, for the set of differentially ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed genes between instar-V and pupal developmental stages, for the long-day light ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatment. The number of differentially expressed genes (significant) and the total number of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ annotated genes associated to a GO category are shown, as well as the number of expected ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes. p-values computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

86 Table S9 (cont.) | List of top-150 most significant GO categories, for the set of differentially ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed genes between instar-V and pupal developmental stages, for the long-day light ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatment. The number of differentially expressed genes (significant) and the total number of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ annotated genes associated to a GO category are shown, as well as the number of expected ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes. p-values computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

87 Table S10 | List of top-100 most significant GO categories, for the set of differentially expressed ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes between pupal and adult developmental stages, for the long-day light treatment. The ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ number of differentially expressed genes (significant) and the total number of annotated genes ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ associated to a GO category are shown, as well as the number of expected genes. p-values ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​

88 Table S10 (cont.) | List of top-100 most significant GO categories, for the set of differentially ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ expressed genes between pupal and adult developmental stages, for the long-day light ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ treatment. The number of differentially expressed genes (significant) and the total number of ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ annotated genes associated to a GO category are shown, as well as the number of expected ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ genes. p-values computed using the Fisher-elim algorithm. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

89

Figure S1 | Boxplot for normalized and sva-adjusted counts for Ddc and st genes. Red line ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​​ ​ ​ ​ ​ indicates the median value; blue diamond symbol indicates the mean value. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

90

Figure S2 | Boxplot for normalized and sva-adjusted counts for CYP15C1 gene. Red line ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​​ ​ ​ ​ ​ indicates the median value; blue diamond symbol indicates the mean value. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

91

Figure S3 | Boxplot for normalized and sva-adjusted counts for PPO2 and zfh1 genes. Red line ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​​ ​ ​​ ​ ​ ​ ​ ​ ​ indicates the median value; blue diamond symbol indicates the mean value. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

92

Figure S4 | Boxplot for normalized and sva-adjusted counts for Vmat and CG32447 genes. Red ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​​ ​ ​ line indicates the median value; blue diamond symbol indicates the mean value. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

93

Figure S5 | Boxplot for normalized and sva-adjusted counts for Hn and Manf genes. Red line ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​​ ​ ​​ ​ ​ ​ ​ indicates the median value; blue diamond symbol indicates the mean value. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

94

Figure S6 | Boxplot for normalized and sva-adjusted counts for the cyc and per circadian clock ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​​ ​ ​ ​​ ​ ​​ ​ ​ genes. Red line indicates the median value; blue diamond symbol indicates the mean value. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

95

Figure S7 | Boxplot for normalized and sva-adjusted counts for tim, a circadian clock gene. Red ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ line indicates the median value; blue diamond symbol indicates the mean value. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

96