The genomic basis of climate and host adaptation

RAHUL VIVEK RANE

Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy

MARCH 2017 School of Biosciences, Faculty of Science The University of Melbourne

(Produced on archival quality paper)

Page | i

Abstract

Many species are currently threatened by the direct and indirect effects of anthropogenically driven climate change. The elevation of global temperatures and increase in variability in both temperature and precipitation pose a risk to biodiversity as species are pushed close to their thermal safety margins. Current predictions suggest a dramatic loss of species diversity and the contraction of geographical ranges of many species. Many ectothermic that cannot regulate their body temperature are likely to be threatened, particularly ecologically- restricted herbivorous insects that depend for on plants for food and that are often in phenological synchrony with their plant hosts.

However, adaptive shifts in these species in response to host loss and climatic extremes may counter the effects of climate change to some extent. This highlights the importance of studying species-specific adaptation mechanisms including host interactions. This dissertation contributes to this overall aim by studying the genomic basis of climatic and host adaptation. I use melanogaster as a model system at the intraspecific level, and Drosophila species from the repleta group as a model system for the comparative level. In assessing the genomic basis of host responses, I consider a much broader range of taxa.

This dissertation begins with a study on the use of chromosome level sequencing of D. melanogaster populations from two ends of a thermal cline. I present genomic evidence for the role of the inversion 3R Payne in capturing alleles favourable to local climatic conditions in the non-inverted form, and therefore driving adaptation to climate change.

The study further elucidates the impact of climatically important chromosomal inversions in driving higher linkage disequilibrium on the non-inverted form - potentially benefiting both karyotypes.

Page | ii

In the second chapter, I develop a new pipeline, Orthonome, and tools for multi-species comparisons for prediction of orthologues and inparalogues with the highest accuracy and recall. Using Orthonome, I was able to identify a much greater level of conservation across Drosophilid lineages than earlier thought, amounting to nearly 33% better resolution than industry-accepted methods.

I then use Orthonome in the third chapter to compare the genomes of 58 insect species – most of which are known to be agricultural pests. Testing across eight gene families, I present evidence for genomic patterns in only four gene families (P450s, CCEs, GSTs and ABCs) as being associated with polyphagy or particular host ranges. While three of them have been reported before, I find that ABC transporters present much stronger evidence than reflected in earlier studies, with feeding behaviour as well as host tissue displaying an effect on gene gain in more voracious pest species.

Finally, in the last study, I use novel genomic data and evidence from the repleta group of drosophilids to carry out phylogenetically constrained analyses of genes potentially associated with host and thermal stress adaptation. My aim here is to find mutually exclusive evolutionary pathways to neofunctionalisation between stress tolerant cactophilic specialists and less tolerant generalists in the group. I also find a different adaptive response in the cactophilic species compared to the generalist species; these species show little lineage specific gene gain, suggesting an exception to current standing theories on neofunctionalisation for adaptation.

I further discuss the applicability the species level and order level analyses for an overall detailed and systematic approach to identify the genomic basis of climatic and host adaptation.

Page | iii

Declaration

This is to certify that: 1. The thesis comprises only my original work towards the PhD except where

indicated in the Preface,

2. Due acknowledgement has been made in the text to all other material used,

3. The thesis is fewer than 100 000 words in length, exclusive of Tables, maps,

bibliographies and appendices OR the thesis is [number of words] as approved

by the Research Higher Degrees Committee.

SIGNATURE (Rahul Vivek Rane)

Page | iv

Preface

For the work carried out in Chapter 2, Ary A. Hoffmann (AAH), Siu Fai Lee (Ronald)

(SFL) and Rahul Vivek Rane (RVR) designed research; RVR and Lea Rako (LR) performed research; RVR, SFL and AAH contributed new reagents/analytical tools;

RVR and MK analysed data; and RVR, SFL and AAH wrote the paper. At least 60% of the work was undertaken by RVR and this study was published in the journal Molecular

Ecology 24(10):2423-32. doi: 10.1111/mec.13161.

In Chapter 3, RVR and SFL conceptualised the study. RVR designed the pipeline and wrote the scripts. Web interface conceptualised by RVR and TN and implemented/maintained by TN. John Oakeshott (JO), AAH and SFL helped interpret the biological significance of results. RVR, JO, AAH and SFL wrote the manuscript. At least 75% of the work was undertaken by RVR and the manuscript is currently under review with the journal BMC Genomics.

For Chapter 4, RVR, JO and Tom Walsh (TW) conceptualised the study. Stephen

Pearce (SP), Lars Jermiin (LJ), Karl Gordon (KG), SFL and AAH helped design the experiments and analyses. AAH, JO and SFL helped RVR with analyses and writing while TW, LJ and SP helped conceptualise the visualisations. A small part of the study was published in the journal Current Opinion in Insect Science 13:70-6. doi:

10.1016/j.cois.2015.12.001. The paper has been included in Appendix 1.0 for reference.

Approximately 65% of the work was undertaken by RVR and follow up work and preparation for publication will be undertaken during the examination of the thesis.

AAH, JO, SFL, CS and Goujie Zhang (GZ) provided materials for the work in Chapter

5 as a part of a larger project. RVR and SP carried out the analyses and Fang Li (FL),

Chris Coppin (CC), Jennifer Shirriffs (JS) assisted in various parts of the experiment

Page | v while JO, AAH, SP, SFL and MS assisted in further analyses, interpretation and writing. Approximately 60% of the work was undertaken by RVR and the manuscript will be submitted for publication during the examination of the thesis.

Page | vi

Acknowledgements

There are so many people I would like to thank for guiding me and helping me through these last few years on an intellectual, practical and emotional level.

To my supervisors Ronald Lee, Ary Hoffmann, John Oakeshott and Michele Shiffer, my heartfelt gratitude goes to you for believing in me and giving me the opportunity to explore the research world, for guiding me in my pursuit of science and for your attention and patience throughout. To my committee chair Phil Batterham for his invaluable support and guidance within and beyond research. Also a special thanks to

Nancy Endersby-Harshman and Vanessa White for their astute advice and constant support throughout the past few years. I would be lost in this journey without you all!

I would also like to thank my collaborators and co-workers (past and present) Stephen

Pearce, Alexandre Fournier-Level, Heng-Lin Yeap, Charles Robin, Lars Jermiin, Jen

Shiriffs, Lea Rako, Karl Gordon, Tom Walsh, Martin Kapun and the postdocs in the

Hoffmann lab for their advice and support and for introducing me to a wealth of information and avenues for my research. The support you all have provided me has greatly enriched my experience as well as knowledge in immeasurable ways.

A huge thank you to my fellow lab members, especially Rebecca, Heng, Elf, Alex,

Perran, Peter and Xuan as well as the members of the Robin lab Amol, Llew, Sue,

Caitlyn and Bec who have shared my journey with me while being the source of many a laughs and intellectually engaging discussions. It would have been a lonely journey without you all!

A giant thanks to my family (the Rane and the Srinivasan clans) for always having believed in me and unconditionally supporting me through my life’s journey leading up to this moment. This journey would not have happened without your support.

Page | vii

Most of all, the biggest thanks to my wife Neerja who has been my steadfast companion through these years with me, patiently supporting me through the thick and thin, celebrating my successes and learning from the non-successes with me and being the perineal source of motivation through this journey. This journey would not have been half as fun without you and I have reached this point thanks to you!

Page | viii

Table of Contents ABSTRACT ...... II

DECLARATION ...... IV

PREFACE ...... V

ACKNOWLEDGEMENTS ...... VII

CHAPTER 1 ...... 1

INTRODUCTION ...... 1

1.1 Adaptation in response to climate change ...... 3

1.1.1 Adaptation by phenotypic plasticity ...... 4

1.1.2 Genetically based adaptation ...... 5

1.1.3 Origin and fixation of adaptive alleles ...... 8

1.2 Evolution of plant host adaptation in herbivorous insects ...... 14

1.3 The genomic basis of climate and host adaptation ...... 15

1.4 Thesis synopsis: ...... 18

CHAPTER 2 ...... 21

GENOMIC EVIDENCE FOR ROLE OF INVERSION 3RP OF DROSOPHILA MELANOGASTER IN

FACILITATING CLIMATE CHANGE ADAPTATION ...... 21

Abstract ...... 22

2.1 Introduction ...... 23

2.2 Materials and Methods ...... 26

2.3 Results and Discussion ...... 31

Acknowledgments ...... 38

Availability of data ...... 38

CHAPTER 3 ...... 45

Page | ix

ORTHONOME – A NEW PIPELINE FOR PREDICTING HIGH QUALITY ORTHOLOGUE SETS

APPLICABLE TO COMPLETE AND DRAFT GENOMES...... 45

Abstract ...... 46

3.1 Introduction ...... 46

3.2 Implementation ...... 48

3.3 Results and Discussion ...... 51

3.3.1 Benchmarking against current pipelines ...... 51

3.3.2 Detailed comparison of Orthonome with OrthoDB using the 12 genome

data set ...... 52

3.3.3 Detailed comparison of Orthonome with OrthoDB using the 20 genome

data set ...... 56

3.4 Conclusion ...... 57

Acknowledgements ...... 58

CHAPTER 4 ...... 63

LINEAGE-SPECIFIC EXPANSIONS IN DETOXIFICATION-RELATED ENZYME AND

TRANSPORTER GENE FAMILIES ARE ASSOCIATED WITH HOST USE TRAITS IN

HERBIVOROUS INSECTS ...... 63

Abstract: ...... 63

4.1 Introduction ...... 64

4.2 Materials and Methods ...... 66

4.3 Results ...... 67

4.3.1 Cross-order analyses ...... 68

4.3.2 Order-specific analyses ...... 71

4.4 Discussion ...... 75

Acknowledgements ...... 78

Page | x

CHAPTER 5 ...... 85

COMPARATIVE GENOMICS AND TRANSCRIPTOMICS OF DROSOPHILA SPECIES IN THE

REPLETA GROUP: WHAT TYPE OF CHANGES DRIVE LINEAGE SPECIFIC RESPONSES? ...... 85

Abstract ...... 85

5.1 Introduction ...... 86

5.2 Materials and methods ...... 89

5.3 Results and Discussion ...... 100

5.3.1 Genome assemblies: ...... 100

5.3.2 Genome annotations: ...... 101

5.3.3 Phylogenomics of the Drosophila genus:...... 102

5.3.4 Comparative genomic analysis and lineage specific gene expansions: ... 103

5.3.5 Genes under positive selection in the repleta group: ...... 109

5.3.6 Transcriptomic analysis of thermal tolerance in D. buzzatii and D. hydei

...... 113

5.3.7 Overlap between lineage specific gene expansion, positive selection and

transcriptional response ...... 120

5.4 Conclusion ...... 127

CHAPTER 6 ...... 129

CONCLUSION AND FUTURE DIRECTIONS ...... 129

REFERENCES ...... 140

APPENDIX 1 (PAPER) ...... 159

APPENDIX 2 (CHAPTER 2) ...... 160

APPENDIX 3 (CHAPTER 3) ...... 163

SUPPLEMENTARY METHODS ...... 163 Page | xi

SUPPLEMENTARY NOTES ...... 166

APPENDIX 4 (CHAPTER 4) ...... 173

APPENDIX 5 (CHAPTER 5) ...... 181

Page | xii

List of figures Figure 1.1: Changes in allele frequency have been observed in the cases of genes such as AdhS, and inversion In(3R)Payne...... 6

Figure 1.2: A simplified model for genetic adaptation ...... 11

Figure 2.1. Mean differentiation and LD plotted using 200 kb windows slid every 50 kb.

...... 40

Figure 2.2. Pattern of linkage disequilibrium across chromosome arm 3R for Innisfail populations ...... 42

Figure 2.3. Distribution of most differentiated outlier SNV’s across chromosome arm

3R in windows of 1 Mb...... 43

Figure 3.1. Process flow diagram for the Orthonome pipeline ...... 59

Figure 3.2: Species tree discordance test measuring the congruence between trees based on predicted orthologue sets and reference species trees ...... 60

Figure 3.3. Comparison between Orthonome, MSOAR and OrthoDB for orthologue identification across syntenically supported regions and fast evolving gene families. .. 61

Figure 4.1: Distribution of normalised gene numbers in each order for each of the nine gene families for the host use types ...... 80

Figure 4.2: Distribution of normalised total gene counts (green) and duplications

(orange) in each of the eight gene families for diet and host use combinations in every order...... 82

Fig. 5.1: Phylogenetic relationships of the five repleta group species to earlier sequenced Drosophila species based on 1,802 orthogroups shared by all species...... 102

Figure 5.2: Overlap between gene duplications in each repleta group species after annotating each duplication with the phylogenetically most similar orthogroup...... 107

Page | xiii

Figure 5.3: Overlap between orthogroups belonging to genes under positive selection in each repleta group species ...... 111

Figure 5.4: Gene ontology clusters enriched in the differentially expressed genes of D. hydei and D. buzzatii after being split into upregulated (A) or downregulated (B) groups in response to stress...... 115

Figure ...... 116

Figure 5.6: Species-specific expression of genes falling into SSOM clusters 2-7...... 119

Figure 5.7: The overlap between lineage specific orthogroups, genes under positive selection in cactophilic and generalist species, heat/desiccation candidate genes and differentially expressed genes in D. hydei and D. buzzatii...... 122

Figure 5.8: Functional Sets enriched in each analysis for the repleta group lineage, generalist and specialist species overall, as well as the constituent species...... 126

Figure S3.1: Correlation between genome assembly and annotation quality and orthologue identification...... 169

Figure S4.1: A comparison between numbers of genes in each order for each of the eight gene families analysed in this study. The total size of every stacked bar-plot indicates the total number of genes while the part of the bar coloured only orange indicates orthologues and only blue indicates duplications...... 174

Figure S4.3: A comparison between numbers of genes in each order for a fixed array of lineages for four gene families (ABCs, CCEs, GSTs, and P450s). The total size of every stacked bar-plot indicates the total number of genes while the part of the bar coloured only orange indicates orthologues and only blue indicates inparalogues...... 178

Figure S4.4: Distribution of total gene counts (green) and duplications (orange) in detoxification associated lineages in each of the four gene families for diet and phagy

Page | xiv combinations in every order. The graph summarises raw counts (A) and normalised counts (B) for the same...... 180

Figure S5.1: Scheme for heat stress and recovery assay...... 181

Figure S5.2: Gene ontology cluster enrichment for 971 repleta group specific orthogroups after annotation using gene ontology Sets...... 181

Figure S5.3: Gene ontology enrichments for all species specific duplications after semantic similarity based clustering ...... 181

Fig S5.4: Principal component analysis of expression variation amongst the top 2000 most variable genes...... 182

Fig S5.5: SuperSOM analysis of orthologue expression ...... 183

Fig S5.6: Species specific expression of SSOM cluster 1...... 184

Page | xv

List of tables

Table 2.1. FST values for population comparisons. Regions spanned by In(3R)P are always significantly higher than other regions...... 44

Table 3.1: Numbers of orthologues, orthogroups, inparalogues and gene births identified by Orthonome, OrthoDB, MultiMSOAR2, OMA groups and reciprocal best hit (RBH) using the same input data ...... 54

Table 4.1: Results from multivariate and univariate ANOVAs testing the impact of host use and order on the number of genes overall and in the different families ...... 79

Table 5.1: Summary of Orthonome analysis for D. melanogaster and five repleta group species...... 104

Table 5.2: Contingency tables testing for enrichment of gene expansion events in lineage specific super orthogroups ...... 105

Table 5.3: Association between mean number of inparalogues in generalist and cactophilic lineages and the number of SoG’s accounting for these inparalogues ...... 108

Table 5.4: Contingency tables for estimation of enrichment of orthogroups under positive selection in orthogroups undergoing duplications ...... 112

Table 5.5: Overlap of genes found to be under positive (+ve) selection, differential expression (DE) and identified as candidate genes responsible for heat or desiccation tolerance in earlier studies...... 123

Table S2.1. Information of primers used to verify homozygosity of the isochromosomal lines...... 160

Table S2.2.List of lines used for preparation of the RADSeq library...... 161

Table S2.3. Number of SNV’s per population comparison, excluding loci with FST = 0.

...... 162

Table S2.4. Outlier SNV distribution and density across 3R...... 162

Page | xvi

Table S2.5: Percent of SNV's mapped to genic regions ...... 162

Table S2.6. Outlier genes found to be significantly differentiating between the populations and karyotypes...... 162

Table S2.7. Gene ontologies found to be significantly enriched between the populations and karyotypes by analysis...... 162

Table S3. 1: Genome, orthologue, duplication statistics and data sources for all 20

Drosophila species analysed in the current study. These include percentage of genes identified as orthologues by Orthonome and OrthoDB as well as genome statistics for each genome based on BUSCO analyses...... 170

Table S3.2: Total number of 1:1 (n=all) orthologroups identified by the Orthonome and

OrthoDB pipelines in the 12 and 20 genome sets...... 170

Table S3.3: Comparison between OrthoDB and Orthonome for number of orthogroups corresponding to eleven rapidly evolving gene families in Drosophila species ...... 171

Table S3.4: Median and mean gain of orthologue pairs in Orthonome compared to

MSOAR2 for the 20 Drosophila species ...... 172

Table S4.1: Data origin, diet, pesticide resistance cases and genome information for species analysed in the current study ...... 173

Table S4.2: Correlation between total and normalised gene counts for each gene family.

...... 173

Table S4.3 Summary table for order specific tests of gene family expansions due to diet and host range ...... 173

Table S5.1: Scaffold and contig length based statistics as well as results from BUSCO genome assessment using the dataset for the five assembled repleta group genomes and

D. melanogaster...... 185

Page | xvii

Table S5.2: Repeat content analysis for the six species analysed in this study characterising transposable elements as well as tandem repeats...... 186

Table S5.3: Genome annotation statistics for the five assembled repleta group genomes and D. melanogaster ...... 187

Table S5.4: Number of pairwise orthologues between D. melanogaster and the five repleta species and the published (older) annotation of D. buzzatii. A higher number indicates greater evolutionary conservation of genes...... 188

Table S5.5: Sets of functional terms describing hierarchical grouping of gene ontology terms and the number of genes in Drosophila melanogaster within each set...... 189

Table S5.6: Subsets of functional terms after splitting each Set and the number of genes in Drosophila melanogaster within each subset. Each set consists of terms with semantic similarities >0.7 but stricter clustering cut-off’s...... 190

Page | xviii

CHAPTER 1

Introduction

Adaptation to anthropogenically driven ecological and climate change is quickly becoming a key factor in deciding species’ survivability in their natural habitats (Bale &

Hayward, 2010; Bale et al., 2002; McKechnie & Wolf, 2010; Moir et al., 2014). The

Intergovernmental Panel on Climate Change (IPCC) in 2013 reported an upward trend in global climatic indicators in the past century. Global average temperatures are reported to have risen by 0.78ᵒC since 2003 and we are witnessing increased frequency of heat waves in Asia, Europe and Australia. Precipitation patterns have changed globally along with the highest concentration of greenhouse gases in 800,000 years. These patterns are predicted to intensify by 2100, with a further 2ᵒC rise in global average temperature, and an increase in heat waves and contrasting fluctuations in precipitation and seasonality

(Barros et al., 2015).

Climate change therefore presents a great risk of biodiversity loss due to species being unable to geographically, physiologically or genetically adjust to increased abiotic stresses and man-made barriers to adaptation mechanisms (Franks & Hoffmann, 2012).

This is likely to impact ectothermic species most severely since they are unable to regulate their own body temperature and hence are directly exposed to changing climatic elements.

Studies have also suggested that terrestrial insects may experience a reduction of ‘thermal safety margins’, reducing the habitable habitats and further affecting a species’ fitness

(Paaijmans et al., 2013). Some species affected by climate change may possess limited capacity to tolerate or adapt to these changes as seen in rainforest Drosophila (Hoffmann et al., 2003b). Furthermore, if species are already near their physiological abiotic stress tolerance limits (e.g. lizards (Sinervo et al., 2010)), they may become extinct due to climatic changes (Hoffmann & Sgrò, 2011). Page | 1

Reduction of viable habitats and environmental conditions could amount to drastic loss of plant, and bacterial species leading to the “sixth mass extinction” (Moir et al.,

2014). Large-scale die-offs, heat and habitat destruction related mortality in birds, flying foxes and plants have been reported (Boyles et al., 2011) with a rising frequency in the past decades (McKechnie & Wolf, 2010). Such loss of multiple species will lead to disruption of entire communities or ecosystems and decouple natural interactions between species (Barros et al., 2015).

One such ecologically critical example of community interactions is that between insects and their hosts. Herbivorous insects account for nearly a quarter of all metazoan species

(Simon et al., 2015) and utilise their plant hosts for feeding, mating sites, oviposition and as a habitat throughout their life cycle. Due to this intimate relationship with host plants, it has therefore been hypothesized that species diversity in insects is driven by herbivorous specialist insects that depend on a fixed range of host or hosts as opposed to generalist species with a broad host range (Simon et al., 2015). Examples of host driven speciation/specialisation include evolutionary specialisation to Morinda fruit in

Drosophila sechellia compared to its congeneric generalist relative D. simulans (Dworkin

& Jones, 2009; Shiao et al., 2015) and similar host-driven specialisation event also occurs in island communities of D. yakuba (Yassin et al., 2016). Non-herbivorous species also show the evolution of specialization such as Cotesia sesamiae, a biological control agent of cereal stem borers that has now gained specialist behaviour (Kaiser et al., 2015). The host-adapted insect species evolve phenological synchrony with their hosts, e.g. co- ordinating their life-cycle with the flowering or nutritional life-cycle of the plants (Singer

& Parmesan, 2010). While phenological synchrony may not necessarily be observed in generalist insect species, it is especially important for host specialists since they have a strong, often obligate interaction with their hosts.

Page | 2

Current models estimate that nearly 57% of common plant species, most of which are insect hosts in some capacity, are expected to lose more than half their current range by

2080 (Warren et al., 2013). While it is difficult to estimate the impact of climate change on plants with already smaller geographic ranges (Moir et al., 2014), they are estimated to be even more threatened by climate change (Thomas, 2010; Thomas, 2011). This dramatic loss of plant biodiversity and range will affect plant-adapted insects. It is therefore likely that the estimated biotic impact of climate change on ecosystems may be more pervasive than earlier thought (Thomas, 2011). Since interacting insects and plant species become co-threatened due to disruption of processes essential to the survival of one or both species (Bocedi et al., 2013; Moir et al., 2014; Singer & Parmesan, 2010), this may eventually lead to co-extinction of insects and hosts in multiple communities

(Thomas, 2010). Furthermore changing climatic pressures on generalist species may aid in expanding their host and geographical ranges due to new resources becoming available within their temperature tolerance ranges, especially towards the poles, increasing the risk of invasions by these species and subsequent ecosystem disruption (Chown et al., 2015).

These considerations suggest that any study of insect adaptation under climate change requires an assessment not only of insect species-specific adaptation to increasing climatic stress but also to their responses to plant hosts. The remainder of this chapter focuses on the genomic basis of climatic and host adaptation, including modes of adaptation to both these factors in insects (or ), and where possible it highlights the role that genomics can play in understanding adaptive processes. I will then proceed to formulate a rationale for genomic studies combining both these approaches.

1.1 Adaptation in response to climate change

When exposed to abiotic stresses due to climate change, species have one of two options; adapt to the climatic changes locally (Ayala et al., 2013; Bergland et al., 2014; Pool et

Page | 3 al., 2012) or migrate to a more favorable location (Chen et al., 2011), failing which the species may face extinction (Boyles et al., 2011; Franks & Hoffmann, 2012; Savolainen et al., 2013). Understanding the likelihood of adaptive processes in a species could be the first step to gaining a perspective on its mode of adaptation in response to climate change.

1.1.1 Adaptation by phenotypic plasticity

Radical changes in local climatic and host conditions (i.e., those that lead to a severe reduction in fitness) may compel species to adapt rapidly. This can be facilitated by rapid physiological changes to counter the environmental stress condition. One such short-term response mechanism known to help species adapt rapidly to novel stresses is termed acclimation or acclimatization, which represents an example of phenotypic plasticity.

Phenotypic plasticity involves no immediate genetic changes since it manifests phenotypic variation from the same genotypes. Plastic responses can however eventually lead to evolutionary genetic changes and have long-lasting effects. For example Ragland and Kingsolver (2007) found that pitcher plant mosquito have evolved a plasticity in diapause induction in response to longer warm periods. However, further studies found that there was a subsequent rapid adaptive shift across multiple populations of pitcher plant mosquito (Bradshaw & Holzapfel, 2008).

Under climate change models, one of the major abiotic stress factors that insects will have to adapt to is increasing temperature (Barros et al., 2015). Recent data have demonstrated that plastic changes may contribute more to adaptation than genetic changes in imparting thermal resistance (Ayrinhac et al., 2004; Bradshaw & Holzapfel, 2006; Hoffmann &

Sgrò, 2011; Hoffmann et al., 2005). These underlying plastic changes could also be a result of changes in patterns of gene expression. Using comparative genomics, Levine et al. (2011) found plasticity in gene expression response at varying temperatures based on geographical origin (tropical (warm) vs temperate (cooler) environments) in populations

Page | 4 of D. melanogaster. Similarly, stronger plasticity in heat-adapted populations as opposed to heat-sensitive populations was also observed in Daphnia pulex (Yampolsky et al.,

2014). Both studies observed downregulation of metabolism indicating it may be a generic compensatory response to stress in invertebrates. Though canalization (stronger plasticity in more heat-adapted populations) may lead to similar patterns, phenotypic plasticity in gene expression could be associated with or induced by environmental stresses (Dayan et al., 2015). It is therefore likely that more stress-adapted populations of a species have a higher likelihood of responding favorably to climate change. As a result, regulatory differences and hardening (plasticity induced due to exposure to sub-lethal stress) effects could be the primary effectors of phenotypic plasticity in response to heat and will thus be an important components of an evolutionary adaptation to climate change in many species (Sgrò et al., 2015).

1.1.2 Genetically based adaptation

Along with plastic response to abiotic stress, genetically based adaptation to local climates offer a greater chance of long-term adaptation to climatic stress. Local adaptation consists of genetic changes in a population occurring as a consequence of fitness differences among genotypes, independent of the effect of migration (Savolainen et al.,

2013). As sometimes observed with species-specific adaptations, local adaptation is often a result of spatially varying selection. Recent studies have indicated that genetically based adaptation to climate change is occurring in traits such as flowering/reproductive timing

(Li et al., 2010), thermal response (Karell et al., 2011), body size (Kennington et al.,

2007) and physiological stress resistance (Franks, 2011).

Multiple studies in D. melanogaster have demonstrated a strong association of few genes with physiological stress tolerance such as hsr-omega with heat tolerance (Anderson et al., 2003; Anderson et al., 2005; Collinge et al., 2008; Johnson et al., 2009). There is also

Page | 5 historical evidence of shifts in allele frequency of genes (e.g. AdhS; Fig. 1.1 A) and chromosomal inversions associated with climatic adaptation (In(2L)t - (Kennington &

Hoffmann, 2013), and In(3R)Payne - (Anderson et al., 2005), Fig. 1.1 B) in D. melanogaster populations over the past 20 years. In another dipteran Anopheles gambiae, the large effect loci for heat/aridity stress tolerance are also associated with an inversion

(2La). This is likely since chromosomal inversions bind genes selected in the same direction or co-adapted to one another to form a single recombination inhibiting genetic element (Kirkpatrick & Barton, 2006b). Additionally, climatically (thermal) associated changes in allele frequency have been observed in black spruce (Picea mariana) (Prunier et al., 2011), three-spine sticklebacks (Barrett et al., 2011), Arabidopsis thaliana and many more species. This leads to a prediction that in populations placed along an environmental gradient forming a cline, genetic elements of large effects are likely to contribute to local adaptation under monogenic inheritance models (Savolainen et al.,

2013). This could be facilitated by chromosomal inversions and genes of large effect such as regulatory genes (Olson-Manning et al., 2012).

Figure 1.1: Changes in allele frequency have been observed in the cases of genes such as

AdhS, and inversion In(3R)Payne. Adapted from Umina et al. (2005) and Anderson et al. (2005)

Page | 6

A

B

Along with large-effect genes, the huge number of small-effect genes assisting in adaptation to complex stresses have also been identified in recent studies using increasingly available high-throughput technologies. Genomic techniques have now been used to identify large numbers of genes contributing or responding to climate driven local adaptation which could even be controlled by multiple genes. For example along clines in D. melanogaster both population allele frequency from population genomics (Turner et al., 2008) and gene expression levels using transcriptome sequencing (Levine et al.,

2011) have been used to identify climatically linked genetic changes. Population genomic studies of a thermal cline in A. gambiae have also led to identification of genes within inversions that could be co-adapted to permit local adaptation in hotter climates (Cheng et al., 2012). Since clines are also linked to multiple morphological traits in their respective species (Cheng et al., 2012; Rako et al., 2007; Rocca et al., 2009), they could

Page | 7 also potentially allow us to identify genes associated with complex morphological adaptation.

Genomic studies have also helped identify adaptations that are independent of clines. For e.g. using RAD-Seq (Baird et al., 2008) based population genomic survey of multiple populations, the parallel evolution of armour plating in freshwater three-spine sticklebacks was linked to genetic polymorphisms that have reached fixation and are driving local adaptation to freshwater habitats (Hohenlohe et al., 2012). However this raises an important question: How do these adaptive alleles emerge in populations and subsequently go to fixation?

1.1.3 Origin and fixation of adaptive alleles

Current models of adaptive evolution consider whether adaptation occurs on account of de novo mutations (DM) (Orr, 2005) introduced into populations under selective pressure

(Radwan & Babik, 2012) or standing variation (SV) already present in populations.

Under neutral evolution, a small percentage of de novo mutations are likely to drift to intermediate frequencies until eventual fixation beyond a threshold allele frequency which is decided by the effective population size (Kimura, 1984). However under climatic selection driven by climate change there can be a population bottleneck, resulting in an increased chance of de novo mutations undergoing ‘hard’ selective sweeps whereby only one haplotype attains fixation. This can be due to epistatic interactions between an allele favoured under climate selection and its neighbours, or if the genomic area is held in strong linkage disequilibrium. Though the fixation of de novo mutations is often viewed as an infrequent event, studies in Acetylcholine esterase (Ace) gene in D. melanogaster have suggested that the diapause driven boom-bust (burst of population growth at the end

Page | 8 of seasonal diapause or dormancy) cycles in insects may help to drive novel mutations to fixation faster than expected and could involve ‘soft’ sweeps (Karasov et al., 2010) where multiple adaptive alleles sweep through the population at the same time, often bound to a neutral allele which has become beneficial to adaptation due to environmental change.

SV’s on the other hand are already present in a population, consisting of neutral loci (or loci under stabilising selection) at an intermediate frequency, which can quickly rise to fixation under selection due to ‘soft’ sweeps. For instance, standing variation in ectodysplasin alleles have been found to recurringly reach fixation by soft sweeps and drive adaptation to freshwater habitats in three-spine sticklebacks (Colosimo et al., 2005).

However soft sweeps occurring on a genetic background with high degree of standing variation beyond the selected locus can make it difficult to track fixation of adaptive alleles (Pritchard et al., 2010). SV’s may therefore help multiple haplotypes ‘hitchhike’ to higher frequencies as the favoured allele reaches fixation (Olson-Manning et al., 2012;

Smith & Haigh, 1974). This again has been reported in the case of three-spine sticklebacks that have undergone parallel evolution of reduced plate armour in freshwater populations, often relying on similar existing variation at the ectodysplasin locus in natural populations (Colosimo et al., 2005; Jones et al., 2012).

However, at the same time it is likely that several selectively advantageous mutations or even ancestral variations may be maintained in populations by balancing selection for a much longer time than neutral evolution would predict (Takahata, 1990). For example, balancing selection is thought to conserve the complementary sex determination (csd) gene across multiple hymenopteran species (Hedrick, 2013). Cho et al. (2006) demonstrated the existence of similar patterns of high polymorphism and balancing selection across the csd locus in multiple bee species that has been maintained by a mutation-selection balance across all species. It is therefore likely that under each case,

Page | 9 the locus may be older than the species itself and is maintained through the evolutionary history of the order to allow for fixation where necessary and beneficial to a particular species in the order (Hedrick, 2013).

Additional sources of adaptive alleles is gene duplications or introgression from closely related species via hybridisation. These sources of adaptive alleles could lead to additive effects on selected traits by increasing gene copies of already favoured alleles from different species or introducing an adapted locus, though this may not be a general expectation. For example, the inversion 2La is known to have introgressed into A. gambiae from Anopheles arabiensis (Sharakhov et al., 2006) to impart high tolerance to heat and desiccation and also enable local adaptation along a thermal cline in African populations (Gray et al., 2009; Rocca et al., 2009). Similarly, a locally adapted mutation in the Mc1r pigmentation gene is known to have been transferred from wild dogs to wolves via ancient hybridisation in northern Americas (Anderson et al., 2009). However it is also likely that introgression by hybridisation may enable the transfer of mal-adapted alleles, therefore threatening entire populations if such alleles reach fixation (Hindrikson et al., 2012).

1.1.3.1 Effect size and adaptive potential of genes

The fitness effect of a beneficial mutation with respect to the phenotype affected plays a critical role in the fixation of favoured alleles. The infinitesimal or polygenic model assumes that multiple genes of small effect control variation in a trait (Lynch & Walsh,

1998). The theory of major effect alleles states that major alleles, though rarer than small effect alleles, are more likely to control trait variation since it is easier for selective sweeps to lead one adaptive locus to fixation as opposed to a combination of multiple loci (Olson-

Manning et al., 2012; Orr, 1998; Savolainen et al., 2013). Current literature also supports the idea that variable phenotypic traits are controlled by a complex genetic architecture

Page | 10 consisting of major and minor effect alleles along with polygenic determinants of trait variation (Cassone et al., 2011; Kirkpatrick & Barton, 2006b; Rako et al., 2006b; Rocca et al., 2009).

The classification of an allele as having major or minor effect is defined by its position and function in a biochemical or regulatory network (Fig. 1.2). Therefore, a gene with more upstream and downstream associations in a pathway is likely to have a greater effect on adaptation and associated pleiotropic traits – as a major effect allele (Fig. 1.2, gene A).

Similarly, genes at ‘bottleneck’ steps where mal-adaptation has a direct effect on phenotype due to terminal positions closest to phenotypic expression, despite fewer or no pleiotropic effects will also form major effect alleles (Fig. 1.2, genes D and E). On the contrary genes may have high pleiotropic effects but little or no effect on the phenotype under selection (Fig. 1.2, genes B and C) and will therefore be genes of minor effect.

Under this model, genes at central regulatory positions in the pathway (gene A) are likely to be under strong purifying selection, while terminal major effect genes displaying greater variation (Olson-Manning et al., 2012). However, the central genes may also help in adaptation if they are under strong positive selection. An example of this is seen with insulin and target of rapamycin (TOR) pathways where central genes are under positive selection with comparatively higher purifying selection acting on upstream genes in

Caenorhabditis spp. (Jovelin & Phillips, 2011), vertebrates and Drosophila spp.

(Alvarez-Ponce et al., 2009; Alvarez-Ponce et al., 2011; Luisi et al., 2012). Depending upon the phenotype under selection, similar genes or pathways may assist in adaptation to environmental stress across populations or species.

Figure 1.2: A simplified model for genetic adaptation. A gene’s location in a network defines the effect size of a variation in the gene. In this diagram adapted from Olson-

Manning et al. (2012) we can visualise nodes in the network with larger circles having

Page | 11 greater effect on phenotype. In this circumstance changes in regulatory genes (A) may have pleiotropic effects due to central position in the network. In comparison, secondary regulatory elements (B & C) are likely to have a smaller effect size due to less linked branches, but they could have pleiotropic effects on other phenotypes (B). In contrast, variations in proteins at ‘bottleneck’ positions (D & E) could have large effect sizes and affect only the phenotype under selection but demonstrate limited (E) or no pleiotropic effects (D). However downstream to bottleneck genes, genes of small effect could undergo high purifying selection and be present in large numbers (F1 & F2). It should however be noted that this model does not account for complex interactions in regulatory networks.

Page | 12

Under the model described above, similar genes are likely to help multiple species adapt to when any one phenotype is under selection. For example, loss of trichomes in multiple

Drosophila species has been linked to independent accumulation of evolutionary changes in a critical regulatory gene – ovo in each species. Similarly morphological variation in pigmentation across multiple species has been linked to similar evolutionary changes in the regulatory gene Mc1r (Nadeau & Jiggins, 2010). Similar case of ‘convergent evolution’ or ‘parallel evolution’ has also been noted with the evolution of reduced plate armour in isolated freshwater populations of three-spine stickleback being driven by a signalling pathway regulator ectodysplasin in all cases (Colosimo et al., 2005; Jones et al., 2012). In other words, an environmental challenge results in the same solution in multiple independently evolving populations. Ultimately, even though the genetic changes may not always be the same at the sequence or gene level, the networks affected during parallel evolution are of key importance and targets for query.

The structure of chromosomes may be modified by selection on account of climatic stress, often behaving as a single genetic element (Radwan & Babik, 2012). Recent work in D. melanogaster (Fabian et al., 2012; Kolaczkowski et al., 2011) and A. gambiae (Cheng et al., 2012) has demonstrated clustering of differentiation between populations that spans across large regions of the genome. Such clusters of co-adapted or co-differentiated genes have also been reported to impart greater adaptive capacity if bound by inversions (Gray et al., 2009; Rako et al., 2006b; Rocca et al., 2009). However, in the A. gambiae and D. melanogaster studies, these clusters were more frequent in populations under with higher average temperatures. It is therefore possible that the ability of a species to adapt to different climatic conditions may be affected by local adaptation events already under play, leading to a scenario whereby certain populations will adapt more rapidly to climate and host changes than others.

Page | 13

I hypothesize that the mechanisms leading micro-evolutionary adaptive changes also plays a large part in macro-evolutionary adaptations to climatic or host specialisation in a species or an isolated population.

1.2 Evolution of plant host adaptation in herbivorous insects

When studying the adaptation of insects to climatic stress, it is important to also understand their evolution of host adaptation due to the intimate relationships between insects and hosts. Adaptation to novel hosts followed by reproductive isolation due to strong insect-host interaction is also thought to be a driving force for speciation in herbivorous insects (Mullen & Shaw, 2014). This speciation process is reinforced by sub- optimal interactions over time with other possible hosts for feeding, mating and oviposition (Matzkin, 2014; Simon et al., 2015) due to specialisation or neofunctionalisation. The overwhelming contribution of insects to biodiversity offers a useful system to understand how dietary specialisation drives speciation. Conversely, knowledge of the genetic mechanisms and genomic signatures that make an insect species a successful generalist may also prove to be invaluable in predicting the adaptive capacity of insect species as well as pest management strategies (Chown et al., 2015).

Comparative genomic analyses have led to identification of gene families contributing to plant-specialisation and resulting speciation in multiple species such as drosophilids

(Drosophila 12 Genomes et al., 2007; Shiao et al., 2015; Yassin et al., 2016). In addition, studies in coleopterans (Shi et al., 2012; Tribolium Genome Sequencing et al., 2008), lepidopterans (You et al., 2013b), hemipterans (Mizell et al., 2008) and hymenopterans

(Oakeshott et al., 2010) have enriched our knowledge of gene family composition and evolution and their association with the insects’ respective host ranges. For example

Gschloessl et al. (2013) found extensive divergence between sibling Ostrinia species for genes implicated in development, immunity and sensory function. Whole transcriptome

Page | 14 comparisons have also indicated that genes regulating chemosensory and detoxification mechanisms may have led to the polyphagous behaviour of Helicoverpa armigera

(Lepidoptera) compared to its oligophagous sister species H. assulta (Li et al., 2013).

The most common recurring evolutionary phenomenon reported in literature pertains to gene expansion in polyphagous herbivores compared to their specialised counterparts.

Gene expansions have been reported in association with increasing polyphagy in detoxification, sensory and protease gene families in Nasonia vitripennis (Hymenoptera)

(Oakeshott et al., 2010), pea aphid (Hemiptera) (Rispe et al., 2008; Smadja et al., 2009), and the lepidopterans - Plutella xylostella (You et al., 2013b), Spodoptera frugiperda

(Giraudo et al., 2015) and H. armigera (Celorio-Mancera et al., 2012). Expansions in detoxification gene repertoire could subsequently enhance the substrate specificity of detoxification genes in these species (Berenbaum et al., 1992; Berenbaum & Zangerl,

2008). This is also in concordance with earlier research hypothesising that a larger complement of detoxification genes could contribute to increased host range (Claudianos et al., 2006).

Similar patterns have been observed when studying gene evolution within herbivorous species such as pea and peach aphids (Bass et al., 2013; Smadja et al., 2009) and the polyphagous spider mite Tetranychus urticae (Dermauw et al., 2013b). It is therefore possible that within and between species adaptation to plant-hosts could be driven by detoxification and chemosensory related genes.

1.3 The genomic basis of climate and host adaptation

Adaptation to climate change in insect species is likely to be a combinatorial effect of host and abiotic stress adaptation. Approaches combining both these factors could therefore be used to assess local and specific adaptation. With upcoming genomic

Page | 15 technologies and the wealth of historical research to draw upon, it is now possible to study adaptation to climate change in greater detail. This provides an opportunity to test the independent inferences of host and climatic adaptation studies by designing detailed large-scale genomic studies that possess sensitive gene family and orthologue classification as well as a sufficient breadth of host and species order diversity (Andrew et al., 2013).

Field assays in the UK brown Argus butterfly have shown remarkable host range- expansions and the evolution of host-specialisation in newly settled populations (Buckley et al., 2012). The population genomic studies in the same species have also demonstrated that the loci displaying differentiation in these newly expanded host-specific populations are also associated with poleward range expansion, suggesting that populations at terminal ends of a species distribution are likely to be exposed to novel stresses, and may adapt by modifying their habitat use based on both abiotic and biotic factors (Bridle et al., 2014).

Genomic studies comparing larvae of the large pine weevil collected on two host-plants

(pine and spruce) have further supported the idea of climatically driven host-adaptation.

Manel et al. (2009) found that when comparing two populations adapted to different hosts, 75% of the genetically differentiated loci were associated with host-adaptation was actually driven by environmental variables. The remainder of the differentiation was correlated with plant host-use and insect-plant interaction. This species presents one of the clearest indications of the importance of climatic adaptation acting in correlation with or as a driving factor for host-adaptation, though more detailed genomic studies would be necessary to examine the complete repertoire of environment associated genotypes.

Page | 16

The butterfly and flying weevil examples provide insight into local adaptation to host driven by climate. In addition a signal for generation of a reproductive barrier associated with climatic adaptation emerges from research done in the stick insect - Timema cristinae. Nosil et al. (2012) reported extremely high degree of divergence and the reduced reproduction between two host-defined ecotypes. Along with morphological difference on account of historical host adaptation, the authors find multiple factors including geographical distance, gene flow and climate as responsible for maintaining the currently observed ecological sub-species. These examples demonstrate the adaptation to climatic stress as being driven by novel host adaptation, or alternatively demonstrating the priority of host adaptation in an insects evolutionary response to climate change.

The current literature presents multiple examples where the interactions between species is being affected or even driven by climate change. While aforementioned examples mainly exemplify specialisation in response to local adaptation, additional responses have been reported such as host-breadth expansion and feeding pattern changes in the willow psyllid in response to increased temperatures in Greenland (Hodkinson, 1997). It is likely that climate change may disrupt current plant host-insect interactions or help create new relationships enhancing the survivability of the species, which can be identified and studied in genetic studies over the course of these ecological changes. Furthermore, genomic studies accounting for observations across multiple species/lineages are likely to provide the power to delineate random effects from effects driven by climate and host adaptation concurrently and independently. With the progress in genomic techniques and their analyses, it is now becoming increasingly possible to design studies that will allow the genomic basis of climatic and host adaptation and their interaction to be evaluated.

Page | 17

1.4 Thesis synopsis:

This dissertation aims to identify the genomic signatures of adaptation to climate change and hosts using single species, multi-species comparative and functional genomic studies.

The central question is: “What genes or gene functions contribute most to a species ability to adapt to climate change and host-range?”

Chapter 2 investigates the genomic signatures of climatic adaptation driven by an inversion In(3R)Payne in D. melanogaster using a population genomics approach. The third chromosome of individuals from a northern population polymorphic for the inversion and a southern population with extremely low frequency of the inversion were sequenced using a RAD-Seq approach. Comparisons between similar numbers of northern inverted, northern non-inverted and southern non-inverted individuals helped test theories on the evolution of chromosomal inversions and their role in climatic adaptation. Furthermore the study demonstrated the ability of chromosomal inversions to cluster co-adapted alleles both within the inversion and the northern non-inverted chromosomes with high linkage disequilibrium (LD) while also driving inter-population differentiation.

Given the distinct patterns of climatic adaptation identified within D. melanogaster, I moved next to identify genomic patterns of host adaptation in insect species. However as suggested by earlier studies, this is possible only when lineage specific evolution sensitive orthologue classifications are possible to contrast macro-evolutionary and parallel evolution patterns. I therefore developed a high quality and high throughput orthologue identification bioinformatic pipeline (Orthonome) and associated toolkits as detailed in

Chapter 3. The Orthonome pipeline outperformed widely accepted orthologue classification pipelines in terms of accuracy, sensitivity and speed by a substantial margin.

This improvement was even more apparent when studying draft genomes. Furthermore

Page | 18

Orthonome provides insight into lineage specific evolutionary patterns of gene conservation and expansion while also assisting in gene family annotation across dozens of species.

In Chapter 4, I investigated the evolution of eight gene families implicated in host adaptation and polyphagy in herbivorous insects in 58 insect species spanning six species orders. The high quality orthologue and gene expansion and gene family prediction generated by Orthonome enabled a detailed analysis of the interaction between order, phagy and diet across the species examined. The key result was the confirmation of a high degree of association between detoxification and transporter gene family expansions and polyphagy in herbivorous insect species. The detoxification specific gene family subclasses also displayed similar trends. Furthermore I found an association between lineage specific gene expansion (duplication) and phagy for some of the families.

Finally, I tackled the combined effect of host and climatic adaptation using five highly stress tolerant repleta group of Drosophila species in Chapter 5. The aim of this chapter was to compare the generalist species to the highly stress tolerant specialists in the group and identify gene families and biological processes that may have led to such adaptation.

This involved assembling three of the genomes and annotating four and subsequently comparing the repleta group of species with prior sequenced Drosophila genomes

(Drosophila 12 Genomes et al., 2007). After identifying genes under positive selection in the repleta lineage, the study proceeded to identify genes differentially and synonymously expressed in the generalist D. hydei and heat-adapted specialist D. buzzatii after exposure to high heat stress for one hour. We found that as expected from earlier studies on this group, the downregulation of metabolic function and upregulation of cell repair and oxidative stress response were the hallmarks of the transcriptional response in this group, though the key difference was the higher magnitude and faster activation of the same

Page | 19 genes in the more heat tolerant D. buzzatii. Furthermore we found that the dietary host adaptation in this group of species was also revolving around metabolism of cyclic nitrogenous compounds – however once again we found that the faster evolution of cell health associated gene functions such as proteolytic activity could be associated with the tolerance to highly toxic rotting cactus diet observed in three of the five species.

Page | 20

CHAPTER 2

Genomic evidence for role of inversion 3RP of Drosophila melanogaster in facilitating climate change adaptation

Authors: Rahul V. Rane1,*, Lea Rako1, Martin Kapun2, Siu F. Lee1 and Ary A. Hoffmann1

Author Affiliation:

1School of Biosciences, Bio21 Institute, University of Melbourne, 30 Flemington Road, Parkville, 3010 Victoria, Australia

2University of Lausanne, Department of Ecology and Evolution, UNIL Sorge, Le Biophore, 1015 Lausanne, Switzerland

Corresponding Author:

Rahul V. Rane

School of Biosciences, Bio21 Institute, Room 264, Level 2, 30 Flemington Road, Parkville, 3010 Australia

E-mail: [email protected]

Tel: +61-3-83442314

Keywords: Inversion, Evolution, Climatic adaptation, Genomics

Running Title: Role of inversion 3RP in climatic adaptation

Page | 21

Abstract

Chromosomal inversion polymorphisms are common in animals and plants and recent models suggest that alternative arrangements spread by capturing different combinations of alleles acting additively or epistatically to favour local adaptation. It is also thought that inversions typically maintain favoured combinations for a long time by suppressing recombination between alternative chromosomal arrangements. Here, we consider patterns of linkage disequilibrium and genetic divergence in an old inversion polymorphism in Drosophila melanogaster (In(3R)Payne) known to be associated with climate change adaptation and a recent invasion event into Australia. We extracted, karyotyped and sequenced whole chromosomes from two Australian populations so that changes in the arrangement of the alleles between geographically separated tropical and temperate areas could be compared. Chromosome-wide linkage disequilibrium (LD) analysis revealed strong LD within the region spanned by In(3R)Payne. This genomic region also showed strong differentiation between the tropical and the temperate populations but no differentiation between different karyotypes from the same population, after controlling for chromosomal arrangement. Patterns of differentiation across the chromosome arm and in gene ontologies were enhanced by the presence of the inversion. These data support the notion that inversions are strongly selected by bringing together combinations of genes but it is still not clear if such combinations act additively or epistatically. Our data suggest that climatic adaptation through inversions can be dynamic, reflecting changes in the relative abundance of different forms of an inversion and ongoing evolution of allelic content within an inversion.

Page | 22

2.1 Introduction

Ever since their discovery in Drosophila by Alfred H. Sturtevant, the study of chromosomal inversions has shaped the development of many contemporary evolutionary theories and concepts (Krimbas & Powell, 1992). In particular, the inhibitory effect of inversions on recombination offers a mechanism to preserve combinations of favoured alleles even though the genes might be distantly located from one another along a chromosome. This ability to preserve favoured combinations has raised the possibility that inversions are important in major adaptive shifts requiring changes at multiple loci

(Lowry & Willis, 2010). Recent data suggest the involvement of inversion polymorphisms in adaptive responses to new habitats (Ayala et al., 2011), changing climate conditions (Anderson et al., 2005; Cheng et al., 2012; Kennington & Hoffmann,

2013; Umina et al., 2005) and mimicry (Joron et al., 2011b). Inversion polymorphisms have also been linked to a large number of traits including body size (Rako et al., 2006a), stress resistance (Rego et al., 2010; Weeks et al., 2002), life history changes (Lowry &

Willis, 2010), morphological patterns (Joron et al., 2011b) and reproductive incompatibility (Ayala et al., 2013; Micali & Smith, 2006).

Two models have been put forth to explain the maintenance and spread of inversions in populations. In the first (mostly verbal) model proposed by Dobzhansky (1970), sets of epistatically interacting coadapted alleles are thought to be captured and kept together within inversion polymorphisms by reduced recombination rates, and inversion heterozygotes are favoured due to overdominance. The coadapted alleles within inverted or non-inverted chromosome arrangements impart higher fitness within a population. As a result, alleles within inverted or non-inverted chromosome arrangements from the same population have a high fitness, whereas alleles from within the same arrangement but originating from a different population are not necessarily coadapted (Charlesworth &

Page | 23

Charlesworth, 1979; Kirkpatrick & Barton, 2006a). This leads to the prediction that the allelic content within chromosomal arrangements can evolve to be different across populations, consistent with some empirical evidence (Schaeffer et al., 2003). However, a high degree of sequence conservation in some inversions is expected to persist particularly around breakpoints where gene flux is almost completely suppressed

(Kennington et al., 2006; Schaeffer & Anderson, 2005). Moreover, the allelic content of inversions within populations can also be affected by other processes including admixture

(Bergland et al., 2014).

Following observations by Charlesworth and Charlesworth (1979), Kirkpatrick and

Barton (2006a) proposed an extension to Dobzhansky’s hypothesis suggesting that inversions can spread when they capture a favoured combination of locally adapted alleles that do not necessarily interact epistatically. However, the favoured combination might be broken down through double crossover within the inverted region or gene conversion.

Yet, despite this process occurring, selection may nevertheless be sufficiently strong to increase the frequency of the inversion within a population particularly if recombination rates are low. This model has been invoked to explain the spread of an inversion in

Anopheles mosquitoes (White et al., 2007).

In Drosophila melanogaster, there are a number of cosmopolitan inversion polymorphisms in populations around the world that have been linked to multiple ecologically important traits (Hoffmann & Rieseberg, 2008; Krimbas & Powell, 1992).

Molecular analyses of breakpoint regions (Hasson & Eanes, 1996; Matzkin et al., 2005) suggest that these inversions are in the order of one to a few hundred thousand years old

(Andolfatto et al., 1999; Corbett-Detig & Hartl, 2012). Furthermore, there is empirical evidence for recombination assisted gene-flux within inversions, leading to homogenisation of genetic variation among karyotypes (e.g., Kennington et al., 2006).

Page | 24

Alleles under selection are located within the inverted region and may facilitate adaptation to different climates and other selective factors. Selection has been particularly well documented in the case of In(3R)Payne in D. melanogaster (Krimbas & Powell,

1992; Rako et al., 2006a), with the frequency of the inverted form displaying a parallel latitudinal cline in multiple continents due to higher frequencies close to the equator and a decreasing frequency towards higher latitudes (Knibb, 1982). There is also an established correlation between the body size cline in Australia and the inversion (Weeks et al., 2002). In(3R)P responds to climate change and has experienced an upward shift in frequency in the Australian cline in the past few decades (Anderson et al., 2005).

Furthermore, genomic comparisons indicate that a substantial proportion of clinally varying polymorphisms occur within the region covered by In(3R)P (Fabian et al., 2012;

Kolaczkowski et al., 2011; Reinhardt et al., 2014).Yet, despite the direct implication of

In(3R)P in climate change adaptation, there is little information on patterns of genomic variation across this inversion. Such a chromosome-wide data set is needed to understand how combinations of alleles might be selected and also to identify the possible impact of processes like admixture in invasive populations (Bergland et al., 2014).

In this study we investigate genomic variation in In(3R)P along the east coast of Australia.

Drosophila melanogaster is thought to have arrived in Australia just over 100 years ago

(Hoffmann & Weeks, 2007) and patterns of microsatellite variation indicate that this species invaded from a northern location (Kennington et al., 2003). Recently-collected genomic data points to an invasion by having different geographic origins (Bergland et al., 2014). This provides an opportunity to consider the development of patterns of variation across the inversion over a relatively short time period due to ongoing selection and admixture. This can be achieved by comparing differentiation within a northern tropical location with differentiation across the geographic cline.

Page | 25

2.2 Materials and Methods

2.2.1 Population samples, In(3R)Payne screening and sequencing

Population samples from east coast of Australia were collected between February and

April in 2011 from Queensland (Innisfail; 17°36'S) and Victoria (Yering Station;

37°40'S). Wild stock was bred in the lab for one generation, and third chromosomes from the wild were extracted using the double balancer stocks (FlyBase IDs: FBst0000007 and

FBst0003703 (Bloomington BDSC Stock Center ID: 7 and 3701); obtained from Dr Phil

Daborn).

Thirty seven isochromosomal lines from Queensland and 33 from Victoria were genotyped for presence of inversion In(3R)P and other polymorphic molecular markers to ensure lines were homozygous for the third chromosome. DNA extraction and PCR procedures were carried out as described elsewhere (Hoffmann et al., 2012) for markers on chromosome arms 3R and 3L (Table S2.1). All primers used are listed in Table S2.1.

Characterization of In(3R)P was carried out by amplifying a 570 bp region near its proximal break point using primers In(3R)P-SNP_outer_1 and In(3R)P-SNP_outer_2

(Table S2.1) following Anderson et al. (2005) using the following PCR conditions: 2 μL

DNA template (approximately 0.5% of the whole DNA), 1 μL (10 μM) of forward and reverse primer, 4 μL of MyTaq Red Mix (Bioline) and 2 μL of DEPC-treated water in a reaction volume of 10 μL. All the PCR reactions were carried out in an Eppendorf gradient Mastercycler™. PCR amplicons were sequenced to determine the In(3R)P genotype (i.e., inverted or standard). To verify the validity of the SNP assay, we prepared isochromosomal lines from two Queensland, one Victorian and one South Australian population, all collected in 2014 (gifted by Dr. Alexandre Fournier-Level) and found the

SNP to be in complete LD with the inversion 3RP.

Page | 26

To obtain as many possible combinations of variants within karyotypes, amplicon sequences were grouped into haplotypes using COLLAPSE v1.2 (Posada, 2004). Out of

19 unique haplotypes, 19 individuals from Innisfail bearing In(3R)P (Ii) and 18 samples from Innisfail (Is) and Yering Station (Ys) bearing the standard conformation were selected for RAD-seq library preparation. Lines selected are listed in Table S2.2.

2.2.2 Library preparation for RADseq

DNA extraction for sequencing library preparation was carried out using the High Pure

PCR Template Preparation Kit (Roche) with 11 mg flies (5-6 flies per line) from each chromosome line listed in Table S2.2. Flies were soaked in cell lysis buffer with 200 mM

KOH and 1 μL 5% SDS for 10 minutes, homogenized with a pestle and genomic DNA extracted following manufacturer's instructions barring a two-step elution with the buffer at 72C. All extractions were then incubated at 37C with 1 μL RNaseA for 1 hour and precipitated with ethanol.

A double-restriction RADseq (ddRAD) library was prepared separately for each sample by modifying the protocol in Peterson et al.(2012). Restriction enzymes NlaIII and MluCI

(New England BioLabs, Ipswich, MA, USA) were used to digest genomic DNA and a variable length barcoding strategy was used at both the 3’ and 5’ ends (Table S2.2). DNA quality, integrity and fragment distribution were measured using Qubit Fluorometer (Life

Technologies) and the High Sensitivity DNA Chips on the 2100 BioAnalyzer system

(Agilent systems). Furthermore, 10-fold excess of adaptor was used in each adaptor- template ligation reaction.

Size selection of template DNA was achieved using the AmpureXP™ paramagnetic beads (Beckman Coulter) in a two-step clean-up procedure. Each library was washed at a DNA:bead ratio of 0.5 and the supernatant was transferred to a new tube and washed

Page | 27 with an additional 0.5x bead solution. This procedure yielded a final library size distribution from 150 bp-500 bp. Individual libraries were then combined (i.e., multiplex library) and sequenced on one lane using the Illumina HiSeq2000 platform (Illumina) to obtain 100 bp paired-end sequence reads at the Ramaciotti Centre, NSW, Australia.

Across all three populations we obtained 353 million reads with an average phred score of 37.

2.2.3 Sequence alignment and data processing

Raw sequence data was demultiplexed based on individual barcodes using fastq-multx tool from ea-utils package (http://code.google.com/p/ea-utils) to obtain 55 individual specific FASTQ files. Low quality sequences were removed using Fastx_trimmer from the Fastx-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit) with a threshold of -q20 and

6-7 bp on 3’ and 5’ end to eliminate the internal adapter sequences. Reads were then aligned to the Drosophila melanogaster reference genome sequence r5.47 (Adams et al.,

2000) using BWA v0.7.5a-r405 (Li & Durbin, 2009) allowing for 2% base-call error and indel size of 12 bp. Paired aligned sequences were merged and converted to BAM alignment format using SAMTOOLS v.0.1.19 (Li et al., 2009). All libraries after demultiplexing gave an average of 3.8% and 4.5% genome sequence coverage across 3R and 3L respectively.

Variant loci were identified using GATK-UnifiedGenotyper (McKenna et al., 2010) allowing for a minor allele frequency of 0.05 and p-value < 0.001 across a population.

Loci were further filtered down to those with a minimum sequencing depth of 10 using

VCFTools (Danecek et al., 2011). Sites with more than 25% individuals in a population or population pair missing sequence data were excluded from further analyses.

Additionally variants falling within the low recombination regions of the genome near

Page | 28 the telomeres and centromere on chromosome 3 were excluded from downstream analyses (see Kolaczkowski et al., 2011). Across all three population pairs, 499,834 single nucleotide variants (SNVs) were detected from which 70,882 passed all quality, missing data and sequence depth filters with average minimum coverage of ~60x per variant SNV per population-pair. The missing data was then haplotype imputed and phased using BEAGLE v4 (Browning & Browning, 2009) based on genotype likelihood calculated by GATK UnifiedGenotyper. This correction along with the individual level data we obtained using RADSeq should greatly reduce any genealogical bias low- representation sequencing should encounter (Arnold et al., 2013; Gautier et al., 2013) along with reduction in risk of excessive false negatives as reported in earlier studies

(Arnold et al., 2013). Inversion co-ordinates were converted using FlyBase Cytosearch tool to obtain proximal and distal breakpoint co-ordinates (12,401,910 – 20,580,518) for

In(3R)Payne (Marygold et al., 2013).

2.2.4 Population genetic parameters and outlier analyses

Outlier analyses for candidate loci were carried out using individual SNV’s instead of 1 kb windows used in earlier studies (Cheng et al., 2012; Kolaczkowski et al., 2011). This is because distance between RADSeq markers often exceeds 1kb. Weighted Weir and

Cockerham’s FST’s were calculated using VCFTools (Danecek et al., 2011) using aforementioned imputed SNV dataset. Chromosome-wide divergence was examined by separating pairwise analyses on the basis of presence of inversion (Innisfail-inverted &

Innisfail-standard: IiIs), geography (Innisfail-standard & Yering station-standard: IsYs) and geography with inversion (Innisfail-inverted & Yering Station standard: IiYs).

We then extract 5000 randomly selected SNV’s shared by the two Innisfail populations and ensured they had a minimum allele frequency > 0.05 across both populations

Page | 29 combined. We then estimated pairwise LD (Lewontin’s D’) and plotted LD heatmaps using Haploview v4.2 (Barrett et al., 2005). The D’ values with respect to In(3R)P were obtained by using a SNV associated with the inversion as a reference point in both

Innisfail (Queensland) populations. Correlation between LD and FST was measured by extracting per SNV FST values from the IiYs comparison. These values were then compared to LD with for all loci to obtain correlation coefficients for loci within the inversion and the collinear genome.

In order to identify candidate genes we adopted an outlier strategy similar to that described in Fabian et al. (2012) and Kolaczkowski et al. (2011). We first computed pairwise FST for every polymorphic SNV and defined outlier SNV’s as those falling in the top 5% tail of the FST distribution. These candidate SNV’s were then annotated with

SNPEff v3.2 (Cingolani et al., 2012) using D. melanogaster genome annotations v5.47

(FlyBase) by including a region 1kb up- and downstream of all annotated genes.

2.2.5 Gene ontology (GO) analysis

Lists of outlier loci assembled above were analyzed for enrichment of biological functions by classifying the genes into gene ontologies (Ashburner et al., 2000). Outlier loci were tested for enrichment of GO terms using Gowinda (Kofler & Schlotterer, 2012) using the

SNP mode on account of the low percentage of the genome captured by RADSeq which uses a permutation based approach. Every annotated gene once again included a region

1kb up- and downstream. We excluded GO categories with < 3 genes and allowed for

FDR < 0.1. The analysis was run using SNV’s as input to allow equal representation of all polymorphic loci.

Page | 30

2.3 Results and Discussion

In(3R)P is a cosmopolitan inversion found in Drosophila melanogaster populations and presents a cline along the eastern coast of Australia. The inversion is present at ~63% frequency in northern tropical populations and as low as 5% in southern temperate populations (Rako et al., 2006a). To examine the impacts of In(3R)P on genetic differentiation and linkage disequilibrium (LD), we sequenced 55 wild-type third chromosomes from Innisfail in Queensland (19 bearing In(3R)P, 18 standard) and Yering

Station in Victoria (18 standard) using multiplexed double-digest RADseq (Peterson et al., 2012). The left and right arms of the third chromosome were analysed separately and data from 3L was used as a null model for genome divergence and coverage variance.

Genetic differentiation between populations was examined by pair-wise analyses based on the presence of In(3R)P (Innisfail-inverted & Innisfail-standard: IiIs), geography

(Innisfail-standard & Yering Station-standard: IsYs) and geography with the inversion

(Innisfail-inverted & Yering Station standard: IiYs). IsYs and IiYs had more single- nucleotide variants (SNVs) than IiIs (Table S2.3) across 3R while 3L had little difference in number of SNV’s for IiYs compared to IsYs, confirming greater accumulation of rare variants on 3R. We also observed that In(3R)P had a slight effect on mean differentiation across 3L (FST: IiYs = 0.10 vs IsYs = 0.086). Furthermore, FSTs across the region spanned by In(3R)P were significantly higher than the collinear chromosomal segments on 3R for all three population comparisons (Table 2.1). The difference between inversion and collinear segments was highest for IiYs (FST = 0.226 vs. 0.112, Wilcoxon’s rank sum test: P < 0.001) followed by IsYs and IiIs (IsYs: FST = 0.161 vs. 0.117; IiIs: FST = 0.030 vs. 0.016, both P < 0.001). The elevated FST across the inversion region in IsYs suggests that, even in the standard conformation, the region contains a cluster of genetic variation that responds to the geographic cline. Nevertheless, the highest level of divergence is

Page | 31 clearly between the inverted and standard arrangements from the different locations as reflected by the IiYs comparison.

In general, genetic differentiation reduced from the breakpoints toward the center of the inversion (Fig. 2.1A). In addition to the significant differentiation that exists within

Innisfail between the inverted and the standard chromosomes (Table 2.1), we also found that not all windows in IiYs have consistently high FST values compared to corresponding genomic windows in IsYs (Fig. 2.1A). These observations suggest that there is a reduction of recombination across In(3R)P between alternate karyotypes, and that such a reduction is not uniform. For the IiYs comparison, elevated differentiation was detected ~1Mb beyond the In(3R)P breakpoints (Fig. 2.1A). These patterns of differentiation are in agreement with the general observations for inversion 2La and 2Rb in Anopheles gambiae as reported by Cheng et al. (2012). Interestingly we observe a similar trend for the IsYs comparison. It is likely that the standard conformation in Innisfail also harbours a very high density of adaptive alleles. This is despite the differences between the Innisfail standard and inverted in the IiIs comparison. Therefore it may be possible that the standard karyotype also conserves genes or variations that act in concert with the inverted state to impart fitness benefit. Additionally, the inversion is strongly associated with the region spanning from 10 to 17 Mb on 3L (Fig. 2.1B). This effect is strictly limited to the

IiYs comparison and we do not see a similar trend in the IsYs comparison. Therefore, the inversion appears to interact with loci both inside and outside its physical boundaries.

Since In(3R)P has a unique origin and recombination (and/or gene conversion) is very rare but not suppressed in heterokaryotypes, loci bound by In(3R)P should display a high level of LD. Theory also predicts that LD should decrease toward the center of the inversion owing to more double crossover events away from the breakpoints (Andolfatto et al., 2001; Navarro & Barton, 2003; Navarro et al., 1997) although in large inversions

Page | 32 like In(3R)P tight LD is only expected very close to the breakpoints of the inversion

(Andolfatto et al. 2001). However earlier studies have shown that loci within an inversion can bear signs of selection and be in high LD (calculated using the D’ statistic) with the inversion as seen in Drosophila pseudoobscura and D. melanogaster (Kennington &

Hoffmann, 2013; Kennington et al., 2006; Pegueroles et al., 2010; Schaeffer et al., 2003).

We measured mean D’ across 3R with respect to In(3R)P in the Innisfail lines (D’ = 0.47).

After excluding regions with low recombination (3L: 1 - 447,386 & 18,392,988 – end of chromosome arm and 3R: 1 - 7,940,899 & 27,237,549 – end of chromosome arm), the mean D’ across the inversion was significantly reduced compared to the rest of the collinear genome (D’ = 0.45 vs. 0.47, Wilcoxon’s rank sum test, P < 0.05). Sliding window analyses showed that most windows for mean D’ near the breakpoints and a few windows within the inversion were in high LD with the inversion (Fig. 2.1C). The window of loci spanning from 16.2 – 16.4 Mb located within the inversion displayed the highest LD (D’ = 0.56) with the inversion. Furthermore this same region displays close to zero differentiation when comparing Innisfail – inverted lines and Innisfail – standard lines, therefore suggesting that this region may be under selective pressure and may contain adaptive loci that are highly associated with increased fitness. However the distal breakpoint appeared to be undergoing increased recombination/gene conversion and loss of structure compared to the proximal breakpoint. This pattern is qualitatively similar to earlier reported patterns of LD across In(3R)P (Kennington et al., 2006) in Australian populations.

The D’ results also indicated that the inversion substantially influenced recombination/gene conversion rates outside the inversion. A 200 kb window preceding the proximal breakpoint (11.4 – 11.6 Mb) (mean D’ = 0.503, Wilcoxon’s rank sum test:

P < 0.005) had greater D’ than the rest of the chromosome arm and the region spanning

Page | 33 from 21.5 – 21.7 Mb beyond the distal breakpoint had a higher D’ (mean D’ = 0.540,

Wilcoxon’s rank sum test: P < 0.05) (Fig. 2.1C). We found no correlation between LD for loci within the inversion and FST values for the same loci between Innisfail-inverted and Yering Station-standard lines (Fig. S2.1). These data might reflect the fact that the loci locally adapted at opposite ends of the cline differ; variation at loci associated with

In(3R)P in the north could be neutral or selected in the opposite direction in the south and vice versa.

To investigate the LD patterns within the same chromosomal rearrangement, we plotted heatmaps for pair-wise LD calculations (Lewontin’s D’) across 3R for inverted and standard arrangements from the Innisfail (Queensland) population separately. We observed expansive LD blocks in the region spanned by the inversion in the subpopulation of inverted chromosomes (Fig. 2.2A), as well as in the subpopulation with standard chromosomes (Fig. 2.2B). This indicates that there may be reduced recombination in this region among inversion homozygotes as well as among standard homozygotes.

Furthermore the region spanned by the inversion appears to maintain high LD among clusters of loci separated from each other, which is further enriched in the standard conformation. Since inversions could be maintained by heterozygous advantage

(Dobzhansky, 1970) it is possible that in populations where In(3R)P exists in greater than

60% allele frequency (Anderson et al., 2005), the standard conformation could hold alleles strictly associated with and complimentary to the inversion on account of recombination and selection on such loci. Furthermore since the frequency of standards is so low in northern Australia, it is likely that limited number of haplotypes of the standard chromosome work towards preserving variation co-adapted with the inversion

3R Payne. This could also contribute to increased differentiation in the region spanning the inversion across the cline (c.f. Kennington et al., 2006). The most differentiated

Page | 34 outlier single nucleotide variations (SNV’s) in population comparisons offer possible targets of positive selection (Akey et al., 2002; Kolaczkowski et al., 2011). To characterize these outliers, we selected SNV’s across 3R and 3L, excluding regions of low recombination near the telomeres and the centromere. We found that the top 5% of outlier loci across 3R had significantly higher SNV density across the inversion region

(Fig. 2.3, Table S2.4). The IiYs comparison also demonstrated an accumulation of highly differentiated loci near the breakpoints, gradually reducing toward the centre of In(3R)P, whereas in IsYs candidate density increased toward the centre of the inversion and for

IiIs it was highest to the right of the inversion. We also observed enrichment of outlier

SNV’s in the coding region (non-synonymous) and the 5’ UTR (see Table S2.5).

Several genes identified using outlier SNV’s displayed clinal variation, a large number of which were developmental or stimulus-response related genes such as Hsrω, Ubx, Ser tld and dmrt93B and Irc on 3R and CyclinT, Cyp4d20 and Hsp23 and Hsp27 on 3L (See

Table S2.6). We also observed Insulin-like receptor (InR) and several P450 genes were strongly differentiated between the subpopulations with alternate karyotype in Innisfail.

Gene ontology (GO) analysis on outlier SNV’s revealed functionally distinct gene categories under spatially varying selection that are influenced by the presence of

In(3R)P. Within Innisfail, In(3R)P mainly enriched metabolic processes and enzyme linked receptor protein signalling pathways and limb morphogenesis (Table S2.7). When comparing SNV’s within the standard arrangements from both ends of the cline, GO terms related to ‘regulation of muscle organ development’ ‘sugar transmembrane transporter activity’ and ‘response to temperature stimulus’ were enriched (Table S2.7).

However, alternate karyotypes from opposite ends of the cline displayed an enrichment of gene functions related to membrane receptors and ion transport activity as well as

‘gamete generation’ in for male and female gametes (Table S2.7). Taken together with

Page | 35 the differentiation patterns, these data suggest that entire genomic regions and gene families may be associated with rapid adaptive shifts in response to latitudinal selective pressures.

How can these patterns of differentiation across the third chromosome be explained?

Several factors may contribute to the increased differentiation seen between the inverted arrangement in the north and standard arrangement in the south, along with the enrichment of GO categories, when compared to the lower level of differentiation in the standard arrangement between these areas. One possibility is that new combinations of alleles favoured by climate-related selection have been captured within the inverted region following the recent invasion of Australia by D. melanogaster. This species has only been present in Australia for just over 100 years (Hoffmann & Weeks, 2007), equating to at most 2000-3000 generations, necessitating a rapid evolution of genome content within the same arrangement. Nevertheless there is clearly ongoing selection on inversions and markers within inversions, as highlighted by continued changes in frequencies along the east coast of Australia (Umina et al., 2005). Under the Kirkpatrick and Barton (2006) model, an inversion is expected to go to fixation in a population once favoured alleles are captured. The situation in eastern Australia is clearly more complicated because the inversion is only at a high frequency at one end of the cline. The standard arrangement continues to be favoured at high latitudes and ongoing gene flow as well as geographically/ temporally varying selection will maintain inversion polymorphism even if the genomic content of the arrangements have evolved to produce local adaptation.

Another possibility is that multiple forms of inverted and/or standard chromosomes invaded Australia, and these have changed in frequency as the distribution of D. melanogaster expanded (Bergland et al., 2014). This process would also result in

Page | 36 geographically-related genomic changes across populations, but without a signature of a recent selective sweep. In addition, other factors like direct selection on inversions through effects on gene function and expression could influence the frequency (but not genome content) of In(3R)P (see Hoffmann and Rieseberg (2008)). Both direct selection on the inversion and clinal assortment of multiple forms of inverted or standard arrangements may help to explain the parallel patterns of In(3R)P across continents. It should be possible to test these possibilities further through genomic analyses of additional populations using a higher density of genetic markers. With additional data, it may be possible to investigate the potential of selection in the face of recombination to generate rapid differences in genomic content using simulations like those involving the sequential Markovian coalescence recently applied to inversions (Peischl et al., 2013). It should also then be possible to look for evidence of recent selective sweeps. By undertaking genomic analysis of the In(3R)P region from populations across different continents with consistent sets of SNP markers, it should also be possible to partition the effects of admixture more clearly from ongoing selection.

Regardless of whether the patterns we have detected arise from the rapid and recent evolution of genome content, geographic assortment of different arrangements or other factors, our data support the hypothesis that In(3R)P is associated with the capture of locally adapted alleles acting additively or epistatically. Furthermore the enrichment of different gene categories in each comparison points to biological processes and phenotypic traits affected by the inversion and differentiated along the cline. Established inversions can maintain differentiation when compared to non-inverted chromosomes and retain structure despite recombination over time particularly around breakpoints

(Andolfatto et al., 2001). However our findings and other recent research suggest that inversions might also maintain adaptive sets of genes away from the breakpoints,

Page | 37 enhancing differentiation between inverted and non-inverted chromosomal arrangements.

The results highlight that geographic adaptation can be affected by the way genetic variation under selection is organized and captured along chromosomes as well as by allelic changes across the genome from standing or mutational variation.

Acknowledgments

We thank Gordana Rašić for help with the library construction and barcoding strategy and Robert Good and Heng Lin Yeap for technical assistance with computing. We also thank Dr. Alexandre Fournier-Level for gifting us with a 2014 Drosophila melanogaster collection from Australia. This work was supported by Discovery and Fellowship grants from the Australian Research Council to AAH, the University of Melbourne Early Career

Researcher grant to SFL, and the Science Industry Endowment Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Availability of data

The raw reads used for analysis are available in the NCBI Sequence Read Archive (SRA) under project no. PRJNA221876 with accession numbers SRR1239546 – SRR1239600.

The raw alignment files, raw and imputed polymorphism data in VCF format, FST calculations and script/code for all above are available from Dryad: doi:10.5061/dryad.5q0m8.

Footnotes

Author contributions: A.A.H., S.F.L. and R.V.R. designed research; R.V.R. and L.R. performed research; R.V.R., S.F.L. and A.A.H. contributed new reagents/analytical tools;

R.V.R. and M.K. analyzed data; and R.V.R., S.F.L. and A.A.H. wrote the paper.

Page | 38

Competing financial interests

The authors declare no competing financial interests.

Page | 39

Figure Legends

Figure 2.1. Mean differentiation and LD plotted using 200 kb windows slid every 50 kb.

(A) Significantly elevated mean FST is detected across In(3R)P whose location is indicated by the black line above the curves Furthermore this aligns well with the three markers (red) indicated as most associated with In(3R)Payne by Kennington et al. (2006).

However (B) FST across 3L shows very little difference when comparing IiYs and IsYs.

(C) Level of LD across 3R with respect to In(3R)P (location is indicated by the black line above the curves) combined across both Innisfail populations DP stands for Dprime

(measure of LD). All x-axes exclude the low recombination regions near the centromere and telomeres (see Materials and Methods).

Page | 40

A

B

C

Page | 41

Figure 2.2. Pattern of linkage disequilibrium across chromosome arm 3R for Innisfail populations. LD patterns based on 18 wild chromosomes of standard conformation (3R_Is

- top) and 19 wild chromosomes carrying In(3R)P (3R_Ii - below). In(3R)P is marked by the grey region between the chromosome tracks (3RP region). The colour (red) intensity indicates the strength of LD in D’ of a pair-wise comparison between two sites.

All x-axes exclude the low recombination regions near the centromere and telomeres (see

Materials and Methods).

Page | 42

Figure 2.3. Distribution of most differentiated outlier SNV’s across chromosome arm 3R in windows of 1 Mb. (A) SNV distribution for the three population comparisons. (B)

Difference between number of SNV’s in IiYs and IsYs for every window. Extensive difference between the two population comparisons is clustered near the breakpoints and the 16-17 Mb windows. Region spanned by In(3R)P is denoted by a black line. All X- axes exclude the low recombination regions near the centromere and telomeres (see

Materials and Methods).

A B

Page | 43

Tables

Table 2.1. FST values for population comparisons. Regions spanned by In(3R)P are always significantly higher than other regions.

Population 3L 3R Region spanned 3R: region outside 3R: top 5% comparison by In(3R)P In(3R)P threshold

IiIs 0.048 0.052 0.054 0.051 0.174

IiYs 0.085 0.106 0.130 0.093 0.363

IsYs 0.085 0.095 0.102 0.090 0.290

Page | 44

CHAPTER 3

Orthonome – a new pipeline for predicting high quality orthologue sets applicable to complete and draft genomes.

Rahul V. Rane1, 2, John G. Oakeshott2, Thu Nguyen1, Ary A. Hoffmann1 & Siu F. Lee2, 3

1School of Biosciences, Bio21 Institute, The University of Melbourne, Australia

2CSIRO, Australian Capital Territory, Australia.

3Department of Biological Sciences, Macquarie University

Correspondence: Rane, Rahul V.

2CSIRO, Australian Capital Territory, Australia

Phone: +61-3-9035 5939

E-mail: [email protected]

Email addresses:

1. JO: [email protected]

2. TN: [email protected]

3. AH: [email protected]

4. SFL: [email protected]

Key words: orthologue, inparalogue, gene duplication, gene birth.

Page | 45

Abstract

Orthonome is an orthologue prediction pipeline that provides a superior combination of orthologue capture rates and accuracy when tested alongside previously published pipelines on complete and draft drosophilid genomes. The pipeline compares sequence domains and then forms sequence-similar clusters before using phylogenetic comparisons to identify inparalogues. It then corrects sequence similarity metrics for fragment and gene length bias using a novel scoring metric capturing relationships between full length as well as fragmented genes. The remaining genes are then brought together for the identification of orthologues within a phylogenetic framework. The orthologue predictions are further calibrated along with inparalogues and gene births, using synteny, to identify novel orthologous relationships. We use 12 high quality Drosophila genomes to show that, compared to other orthologue prediction pipelines, Orthonome provides orthogroups with minimal error but high recall. Furthermore, Orthonome is resilient to suboptimal assembly/annotation quality, with the inclusion of draft genomes from eight additional Drosophila species still providing >6500 1:1 orthologues across all twenty species while retaining a better combination of accuracy and recall than other pipelines. Orthonome is implemented as a searchable database and query tool along with multiple-sequence alignment browsers for all sets of orthologues. The underlying documentation and database are accessible at http://www.orthonome.com.

3.1 Introduction

Distinguishing orthologous and paralogous relationships among gene complements of whole genomes in different eukaryote species is crucial to understanding patterns of functional conservation and change through the course of evolution (Gabaldon & Koonin,

Page | 46

2013). Analyses of these patterns could become far more powerful with the rapid growth in the number of species for which whole genome sequences are available.

Several methods have been developed over the past two decades for orthologue prediction, including tree-based methods (Ensembl Compara (Vilella et al., 2009),

PANTHER (Mi et al., 2013) and PhylomeDB (Huerta-Cepas et al., 2008)), graph-based methods (i.e., based on pairwise comparisons; Best Reciprocal Hits (Tatusov et al., 1997),

Reciprocal Smallest Distance (RSD), EggNOG (Jensen et al., 2008), Hieranoid

(Schreiber & Sonnhammer, 2013), InParanoid (O'Brien et al., 2005), OMA (Altenhoff et al., 2015) and OrthoInspector (Linard et al., 2011)) and meta-methods combining phylogenetic information derived from different databases (e.g. MetaPhors (Pryszcz et al., 2011)). Most of these methods were developed prior to the widespread availability of whole genome data so they are not well suited to exploit the contextual information (e.g. positional information or splice variation) provided by such data (Dewey, 2011).

Additionally, the strictly graph-based methods involve finding genes in pairs of genomes that share the strongest sequence similarity or have the least genetic distance. However, phenomena such as violations of molecular clock behaviour or rapid gene duplications can compromise their performance (Dalquen et al., 2013; Gabaldon & Koonin, 2013).

Tree-based methods on the other hand, typically improve the quality of orthologue inference in the presence of such phenomena (Dalquen et al., 2013) and further improvements can be made if synteny information is also used (e.g. BBH-LS (Zhang &

Leong, 2012) or the recently developed MSOAR (Fu et al., 2007; Shi et al., 2011; Shi et al., 2010)). The MSOAR algorithms effectively utilise genomic context in a genome evolution framework to predict orthologues without relying on just one of either strict gene homology or conserved gene order (Dewey, 2011; Shi et al., 2011; Shi et al., 2010).

However, MSOAR cannot be used to analyse draft genomes because it requires

Page | 47 chromosome-level assignment of annotated gene models. It also uses approximate measures of genetic similarity rather than the more accurate Smith-Waterman alignments.

The Orthonome orthologue prediction pipeline presented here overcomes the issues highlighted above. It uses whole genome data but is tolerant of fragmented draft genomes.

The pipeline also has a toolkit of scripts to enable orthologue-based whole genome phylogenetic analyses. Orthonome was developed to provide high quality orthologue predictions while maintaining high recall compared to other pipelines. The accuracy and sensitivity of the pipeline are validated using tests developed by Altenhoff et al. (2016) and published data for twenty Drosophila genomes, twelve of which have been extensively curated.

3.2 Implementation

Orthonome is a multimodular pipeline and toolkit for orthologue prediction and phylogenetic and gene evolution analysis. The workflow includes five steps as described in Fig. 3.1, including input processing, adaptive identification of inparalogues, corrections of Smith-Waterman alignment for gene length and fragmentation bias, and classification of genes as orthologues or paralogues based on their evolutionary and genomic contextual relationships. After the homology comparisons and inparalogue

(species-specific duplication (Sonnhammer & Koonin, 2002)) identification following

MSOAR, the Smith-Waterman scores are normalised to the length of the aligned segments (S’ scores), and the gap penalty is reduced to accommodate large gaps between coding regions, allowing more efficient recovery of orthologues that have fragmented gene models (Supplementary note 1). S’ scores are calculated for gene pair Ga1 and Ga2 using a Smith Waterman alignment and BLOSUM50 substitution matrix as:

푆푚𝑖푡ℎ푊푎푡푒푟푚푎푛퐵퐿50[퐺 푣푠 퐺 ] 푆′ = ( 푎1 푎2 ) ∗ 푅푎푡𝑖표 푝푎𝑖푟푤𝑖푠푒 𝑖푑푒푛푡𝑖푡푦 ∗ 100 푃푎𝑖푟푤𝑖푠푒 푎푙𝑖𝑔푛푚푒푛푡 푙푒푛𝑔푡ℎ (푏푝)

Page | 48

Orthonome then partitions gene sets into four orthogonal classes, namely orthologues that are present in all species (hereafter termed 1:1 (n=all) orthogroups), those present in some but not all species (1:1 (1

(SOGs), which are akin to functional gene families.

The pipeline consists of UNIX bash and c++ scripts assisted by toolkits written in perl and python. Furthermore the pipeline is written in a map-reduce protocol leveraging the parallelisation enabled by a GNU parallel utility (Tange, 2011). The source code is available at https://bitbucket.org/rahulvrane/orthonome while the database website is hosted at www.orthonome.com.

The genome and annotation data for the implementation of Orthonome herein uses the 12

FlyBase Drosophila species (Drosophila ananassae, D. erecta, D. grimshawi, D. melanogaster, D. mojavensis, D. persimilis, D. pseudoobscura, D. sechellia, D. simulans,

D. virilis, D. willistoni, D. yakuba; accessible at ftp://ftp.flybase.net/genomes/)

(Marygold et al., 2013) and the additional eight draft genomes in modENCODE (D. biarmipes, D. bipectinata, D. elegans, D. eugracilis, D. ficusphila, D. kikkawai, D. rhopaloa, D. takahashii; https://www.hgsc.bcm.edu/arthropods/drosophila-modencode- project) (Table S3.1). A total of 308,667 genes were analysed using this approach across the 20 species (Table S3.1). We did not use the 66 metazoan species set in Altenhoff et al. (2016) because Orthonome depends upon whole genome information, including both genome sequence and annotation, which is unavailable for most of the 66 species.

Comparisons between Orthonome and other pipelines were carried out as follows:

Page | 49

1. Comparing Orthonome and five other orthologue prediction pipelines using the 12

and 20 Drosophila genome sets.

The six orthologue prediction pipelines (Orthonome, OrthoDB (Kriventseva et al., 2015),

Reciprocal bi-directional hit, MetaPhors (Pryszcz et al., 2011), OMA (two modes: groups and GETHOGS) (Altenhoff et al., 2015) and MultiMSOAR2 (Shi et al., 2011)) were compared according to methods which Altenhoff et al. (2016) developed to (i) evaluate phylogenetic discordance between species trees and gene trees constructed using orthologue sets and (ii) measure number of orthologue clusters with all species – representing a recall of conserved orthologues. We did not use the gene tree or ontology- based method from Altenhoff et al. (2016) because only one of the 20 species has the manually annotated gene ontologies and gene trees required for the tests.

2. Comparison between Orthonome and the published data on the 12 Drosophila

genomes obtained from FlyBase.

The output statistics from Orthonome were compared with those generated through the

OrthoDB pipeline from FlyBase (OrthoDB v7)

(https://flybase.org/static_pages/downloads/FB2015_04/genes/gene_orthologs_fb_2015

_04.tsv.gz using the script Orthonome_discordance_data.py (see Supplementary

Material). All orthogroups unique to Orthonome were tested for gene ontology enrichment using the goatools pipeline (https://github.com/tanghaibao/goatools) and enrichment was carried out with reference to the D. melanogaster genome.

3. Comparison between Orthonome, OrthoDB and MSOAR2 for all 20 Drosophila

genomes.

Page | 50

We also compared the output of Orthonome to that of OrthoDB and MSOAR2 for all 20 species above using the default parameters defined by the authors of those pipelines;

Kriventseva et al. (2015) and Shi et al. (2011).

3.3 Results and Discussion

3.3.1 Benchmarking against current pipelines

To assess its performance we applied Orthonome to datasets of 12 and 20 Drosophila species (see Implementation) and compared our results to those generated with OrthoDB

(Waterhouse et al., 2011), Reciprocal bi-directional best hit (Tatusov et al., 1997),

MetaPhors (Pryszcz et al., 2011), OMA (groups and GETHOGS) (Altenhoff et al., 2015) and MultiMSOAR2 (Shi et al., 2011; Shi et al., 2010) using the two measures described in Altenhoff et al. (2016). Firstly, we calculated the average tree error (Robinson-Foulds distance between the reference species phylogeny and the tree obtained from orthologous clusters with all tested species (see (Altenhoff et al., 2016) for each method using the 12 and 20 genome sets. This test assumes that a true orthologous group with all species represented (1:1 (n=all) orthogroups in the case of Orthonome) should give rise to the same phylogenetic tree as the overall species tree. Therefore a lower average tree error represents better predictive capacity of the pipeline. For the second of the measures we calculated the recall statistic, namely the number of complete orthologous group sets recovered by the pipeline, where a higher number indicates better resolution of orthologous relationships.

Application of six pipelines to the 12 high quality FlyBase genomes showed Orthonome had the lowest error rate and second highest recall (9,513 1:1 (n=all) orthogroups, Fig. 3.2).

By comparison, MultiMSOAR2, which had the highest recall (9,595 1:1 (n=all) orthogroups), also had the highest error. Furthermore OMA (groups) which had the second lowest error rate had nearly four-fold lower recall compared to Orthonome.

Page | 51

Additionally, OrthoDB, which is the most widely accepted source of orthologues for the

Drosophila species (Attrill et al., 2015), also had lower quality (3rd best) and recall (4th best) compared to Orthonome. Therefore, for the 12 genomes we found that Orthonome had the best combination of tree error and recall compared to the six other pipelines tested in the current study (Fig. 3.2).

Addition of the eight lower quality modENCODE genomes (see Implementation) to the data set yielded a broadly similar pattern of differences among the pipelines. The recall rate for Orthonome was now rated 3rd best, but it still had the best combination of recall and accuracy. The lower number of 1:1 (n=all) orthogroups recovered by all pipelines with the 20 genome data set is most likely due to the lower quality annotations (chimeric and missing genes) in the eight modENCODE genomes (Tekaia, 2016).

3.3.2 Detailed comparison of Orthonome with OrthoDB using the 12 genome data set

The analysis of the twelve well curated Drosophila genomes in FlyBase with Orthonome identified orthologous relationships for an average of ~1.8% more genes in each species than did OrthoDB (13,516 vs 13,270), including resolving cryptic relationships between multiple genes (see e.g. Fig. 3.3A/B). Orthonome also found ~44% more 1:1 (n=all) orthogroups (9,538 vs 6,621) than did OrthoDB (Table 3.1, S3.1 (Waterhouse et al.,

2011)) which involved resolving an additional 2,917 1:1 (n=all) orthogroups from among orthogroups that OrthoDB had left as 1:many or many:many sets (Fig. 3.3 and Table

S3.2). Furthermore, 94%, or 6,228, of the 1:1 (n=all) orthogroups predicted by OrthoDB were also predicted by Orthonome.

The greatly enhanced capability of Orthonome to resolve 1:many and many:many into

1:1 relationships was evident in an analysis of 11 gene families associated with stress tolerances and prone to rapid amplification and loss in insects (Rane et al., 2016).

Page | 52

Orthonome could resolve 45-133% more of these genes into 1:1 (n=all) orthogroups than

OrthoDB (Table S3.3). For example, cytochrome P450 monooxygenases (P450s) were classified into 51 1:1 (n=all) orthogroups by Orthonome compared to 22 by OrthoDB (Table

S3.3). An example of the improved orthologue assignment by Orthonome in a particular

P450 gene cluster is illustrated in Fig. 3.3C.

Page | 53

Table 3.1: Numbers of orthologues, orthogroups, inparalogues and gene births identified by Orthonome, OrthoDB, MultiMSOAR2, OMA groups and reciprocal best hit (RBH) using the same input data. Inparalogue predictions were carried out only in Orthonome and MultiMSOAR2 while gene births are those genes for which no significant sequence similarities could be found by the respective pipelines. NA denotes the lack of inparalogue identification by OrthoDB and values that could not be calculated for MSOAR2 (since it has a different scoring method than

Orthonome).

Multi Comparison Measures Orthonome OrthoDB OMA RBH MSOAR2

Average number of genes per species 15,446

Average number of orthologues per species 13,517 13,270 13,310 12,934 13,545 Twelve Average number of inparalogues per species 1,519 NA 1,543 NA NA FlyBase Average number of gene births per species 400 2,805 593 NA NA genomes Number of 1:1 (n=all) orthogroups 9,538 6,621 9,595 5,380 8,643

Number of 1:1 (1

Twenty Average number of genes per species 15055

genomes Average number of orthologues per species 13,272 13,091 12,964 12,565 13,504

(Twelve Average number of inparalogues per species 1,305 NA 1,422 NA NA

FlyBase + Average number of gene births per species 468 2,343 670 NA NA

Page | 54

eight Number of 1:1 (n=all) orthogroups 6,541 3,912 7,491 2,555 5,757

modENCODE) Number of 1:1 (1

Page | 55

3.3.3 Detailed comparison of Orthonome with OrthoDB using the 20 genome data set

Compared to the twelve FlyBase genomes above, the eight draft modENCODE genomes had significantly shorter post-assembly contig and scaffold N50s (averages of 493Kb cf

1.9Mb and 1.1Mb cf 12.8Mb respectively; Table S3.1) and more missing members of the universal insect gene set identified by BUSCO (0.06% vs 1.65% respectively).

Application of Orthonome to this second data set again achieved a small (0.5%) improvement in orthologue recovery per species compared to OrthoDB. Compared to the

12 genomes analysis above, both pipelines displayed a 1.8% and 1.3% reduction in the number of orthologues recovered by OrthoDB and Orthonome per species, which we can attribute to missing gene models in additional draft genomes (Fig. S3.1b). However, the number of 1:1 (n=all) orthogroups per genome obtained from the combined data sets using

Orthonome was actually 67% higher than the number obtained with OrthoDB, bearing out the value of Orthonome even for poorer quality genome assemblies.

The ability of Orthonome to identify orthologues and in particular to resolve 1:many and many:many orthogroups into 1:1 (n=all) orthogroups with greater accuracy (Fig. 3.2 and

Fig. 3.3c) also means that genes in each genome that do not fall into the latter class can be cleanly partitioned into three other classes, namely 1:1 (1

The numbers of genes assigned to the orthogroup, inparalogue and de novo gene birth categories will be less prone to error in the 12 genome analysis than the 20 genome

Page | 56

analyses because the 12 genome analysis gives the best estimate of the number of 1:1

(n=all) orthogroups, as explained above. As such, the finding of an average of 1519 inparalogues and 400 de novo gene births per genome in the 12 genome analysis (Table

3.1) represents the best estimates yet available for these two crucial sources of genetic novelty in the genus. Nevertheless we observed a negative relationship between the number of 1:1 (n=all) orthogroups per genome and the quality of genome assembly and annotation (see above, Supplementary Note 2). As a result, although they are the most up-to-date available at the time of writing, the estimates of gene birth and inparalogues are still very likely to be overestimates of the true values for these quantities because of the residual errors in the underlying assemblies and annotations.

3.4 Conclusion

Whereas previous pipelines are limited to classifying most genes in a genome into 1:many or many:many groups of orthologues, especially in draft genomes, Orthonome has separated the majority into 1:1 (n=all) orthogroups (representing evolutionary conservation) or inparalogues and gene births (representing the genetic novelty from which new functions most likely evolve). By comparing multiple pipelines we also find that

Orthonome provides the best combination of accuracy and recall of orthologue assignments. As the quality of genome assemblies and annotations continues to improve, the availability of Orthonome and any subsequent improvements on it should provide increasingly accurate estimates of the various sources of genomic variation fundamental to evolutionary change.

Declarations

Page | 57

Acknowledgements

We thank Karl Gordon and Alexandre Fournier-Level for valuable discussion and suggestions. We also thank the NeCTAR cloud research facility for allocating significant computational resources for the continuance of the Orthonome database.

Funding

This work was funded by the Australian Science and Industry Endowment Fund (SIEF) and a NeCTAR computation grant to RVR. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Authors’ contributions

RVR and SFL conceptualised the study. RVR designed the pipeline and wrote the scripts.

Web interface conceptualised by RVR and TN and implemented/maintained by TN. JO,

AH and SFL helped interpret the biological significance of results. RVR, JO, AH and

SFL wrote the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval

Not applicable.

Availability of data and materials

Page | 58

The source-code and raw data for the pipeline will be made available at https://bitbucket.org/rahulvrane/orthonome and the generated prediction database for

Drosophila orthologues is available as a searchable website, www.orthonome.com.

Figures

Figure 3.1. Process flow diagram for the Orthonome pipeline. The five-step Orthonome pipeline combines the power of heuristic and greedy algorithms to improve the accuracy and recall from draft both well annotated and draft genomes.

Page | 59

Figure 3.2: Species tree discordance test measuring the congruence between trees based on predicted orthologue sets and reference species trees. The circular markers represent the results for the 20 species data set while the square markers represent those for the 12 high quality FlyBase genomes. The yellow circular markers and grey square markers represent pipelines that produce orthogonal orthologue sets with only one gene per species in a cluster. Orthonome provides a superior combination of low average tree error and high recall with both data sets.

Page | 60

Figure 3.3. Comparison between Orthonome, MSOAR and OrthoDB for orthologue identification across syntenically supported regions and fast evolving gene families. (A)

A comparison of orthologue capture success in a 40 Kb syntenic region between D. sechellia and D. melanogaster. Orthologous relationships are indicated by vertical lines.

Black lines = orthologous pairs supported by OrthoDB, MSOAR and Orthonome; blue lines = orthologous pairs detected only by MSOAR and Orthonome; red line =

Page | 61

orthologous pair recovered only by Orthonome. (B) The newly recovered orthologous pair FBgn0004554 and FBgn0170274 in Panel A is supported by high (>90%) amino acid identity. (C) Orthonome is able to split OrthoDB orthogroup EOG7KHP47 (dotted orange clade) consisting of P450 monooxygenases into three independent orthogroups (solid blue clade). The three D. melanogaster genes are highlighted in green and the six additional genes that were allocated to the orthogroups by Orthonome only are marked in red.

OrthoDB was unable to identify orthology to these six genes.

Page | 62

CHAPTER 4

Lineage-specific expansions in detoxification-related enzyme and transporter gene families are associated with host use traits in herbivorous insects

Rahul V. Ranea,b, Tom K. Walsha, Lars S. Jermiina, Stephen L. Pearcea, Karl H. J.

Gordona, Stephen Richardsc, Ary A. Hoffmannb, Siu Fai Leeb and John G.

Oakeshotta a CSIRO, Clunies Ross St, (GPO Box 1700), Acton, ACT, 2601, Australia b Bio21 Institute, School of BioSciences, University of Melbourne, 30 Flemington Road,

Parkville, 3010, Australia c Human Genome Sequencing Center, Department of Molecular and Human Biology,

Baylor College of Medicine, One Baylor Plaza, Houston, TX 77006, USA

Abstract:

It has been hypothesised that the host range of herbivorous insects is associated with the size of gene families involved in xenobiotic detoxification. Here we have tested this hypothesis on 58 insect species whose genomes have been sequenced. Analyses conducted both across and within insect orders found that polyphagous species generally do have more cytochrome P450, carboxylcholinesterase and ABC transporter genes than restricted specialists. Moreover the lineages within three of these families found to contribute most to these associations are ones which contain members that empirical evidence has already associated with detoxification functions. On the other hand we found little evidence that host range was associated with the sizes of two other gene families, the glutathione s-transferases and UDP-glycosyltransferases, which have also been associated with detoxification functions in some previous work. The sizes of three other gene families, namely odorant binding proteins, the major facilitator superfamily and heat

Page | 63

shock proteins, which are associated with environmental response but not specifically detoxification, were also found not to be associated with host range.

4.1 Introduction

Plants produce a wide variety of defence chemicals that protect them from insect attack and, conversely, insects have a range of detoxification pathways which determine the plants they can use as hosts (Despres et al., 2007; Strong et al., 1984). It has therefore been proposed that generalist insect species with broad host ranges would have more versatile detoxification pathways than would close relatives with narrower host ranges

(Berenbaum et al., 1992; Berenbaum & Zangerl, 2008). More specifically it has been hypothesised that the broader host range species would have larger gene families encoding key components of the pathways (Claudianos et al., 2006; Dermauw et al.,

2013a; Oakeshott et al., 2010). The families that have attracted particular interest are catabolic enzymes such as cytochrome P450 monooxygenases (P450s) and carboxyl/choline esterases (CCEs) which can degrade the compounds into less toxic and/or more water soluble products, and other enzymes such as glutathione S-transferase

(GSTs) and UDP-glycosyltransferase (UGTs) which can conjugate the toxins or their catabolic products with endogenous molecules that expedite their subsequent secretion

(Ahn et al., 2012; Berenbaum et al., 1992; Berenbaum et al., 1996; Daborn et al., 2002;

Feyereisen, 2011; Shou-Min, 2012). Increasingly now it is also recognised that genes encoding transporter proteins (de la Paz Celorio-Mancera et al., 2013; Dermauw et al.,

2013a; Govind et al., 2010), in particular the ATP-binding cassette (ABC) transporters

(Dermauw et al., 2013a; You et al., 2013a) and possibly also the Major Facilitator

Superfamily (MFS) of proteins, could also be important parts of the pathway (Hayashi et al., 2002; Kretschmer et al., 2009; Saidijam et al., 2006; Tenreiro et al., 2002).

Page | 64

There is a growing body of evidence that the first five of these gene families are involved directly in insects’ metabolism of plant defence compounds, plus an already large body of evidence linking them to the detoxification of various synthetic chemical insecticides

(Buss & Callaghan, 2008). Genome data supporting the hypothesis of larger gene complements in broader host range insects have been obtained for the P450, CCE, GST,

UGT and ABC transporter families at least, but the data generally only involve a small number of species and these species are often not closely related [8, 9, 33-37]. Moreover these studies have generally compared the total numbers of genes in the respective families rather than the numbers in the specific lineages within them that have most often been empirically linked to detoxification functions.

The current study is a more thorough test of the hypothesis which compares a total of 58 species across five insect orders/superorders for which good quality genome data are available. The Orthonome pipeline of (Rane et al., in review) is used to extract the key gene families and specific lineages within them from each species and to identify the inparalogues (i.e. species-specific gene duplications) within those families and lineages.

The size of the data set also enables us to separately assess the associations of these numbers with host range as defined by the number of host taxa attacked and host tissue type (i.e. seeds, leaves, flowers fruits etc). As well as the six genes families above, it also considers gene families involved in host sensing, in particular odorant binding proteins

(OBPs), some of which are known to deliver plant volatiles to sensory receptors (Hungate et al., 2013; Matsuo et al., 2007; Swarup et al., 2014), and general stress response proteins such as the heat shock proteins HSP70 and HSP90 which also play a role in some insects’ resistance to toxic compounds (Esther et al., 2015; Feng et al., 2010; Skerl & Gregorc,

2010; Sun et al., 2014). We are not proposing that the OBPs and HSPs will show the same associations with host range and host tissue type as the families directly involved in

Page | 65

detoxification above. Nevertheless, given their functions in host finding and general stress response, some change in their composition might be expected across the range of host uses compared.

4.2 Materials and Methods

Our dataset encompasses a total of 58 wholly or partially herbivorous species covering seven insect orders. They included 25 Diptera, 7 Lepidoptera, 5 Coleoptera, 13

Hymenoptera, 6 Hemiptera and 1 Thysanoptera (for a total of 7 in the superorder

Condylognatha), plus 1 Isoptera. Details of their host usage are given in Table S4.1, together with the sources of the data.

Complete genome sequences and annotations were obtained for the 58 species from NCBI

(www.ncbi.nlm.nih.gov) and several other public data sources. Details of these data, including key assembly and annotation quality metrics calculated with the BUSCO pipeline (Simao et al., 2015), are given in Table S4.1. Prior to any analysis, we filtered the gene models of every species and selected only the longest isoforms for each gene for further analysis. We also utilised potential pseudogenes to allow identification of historic evolutionary events as well as mis-assemblies pertaining to certain gene families.

In order to identify the members of each of the eight gene families in each genome, we first collected high quality manual annotations for them in the two best annotated species in the study, i.e. Drosophila melanogaster and Bombyx mori. The Orthonome pipeline

(www.orthonome.com) was then used to identify orthologues, inparalogues, orthogroups

(1:1 orthologues containing >=2 species) and super orthogroups (SOGs; which cluster orthogroups and inparalogues by sequence domain similarity) within each species and gene family.

Page | 66

For four of the gene families (P450s, CCEs, GSTs and ABC transporters) for which there is considerable empirical evidence associating detoxifying functions to specific lineages within the family we also specifically analysed those lineages most commonly associated with those functions. These lineages were the CYP6 P450s, the dietary and detoxification clades of CCEs, delta and epsilon GSTs and the ABC transporter sub-classes B and C.

All analyses of variance on total gene counts and inparalogue numbers within each family or lineage within family were carried out with SPSS (IBM Statistics 22). Given significant differences between the genome-wide gene contents of the various species (which varied from about 9,823 protein coding genes in Ceratosolen solmsi to 33,019 genes in

Homalodisca vitripennis), we also analysed the gene counts and inparalogue numbers for each of the families as a proportion of the total numbers of genes in the rest of the genome

(i.e. the genome-wide numbers minus the numbers in the eight families).

4.3 Results

Total gene counts in each of the eight gene families varied considerably across the 58 species, from 2-3 fold for the MFS and HSP families up to about 10 fold for the OBPs across total, orthologue and inparalogue counts (Figs 4.1, S4.1 & S4.2). There were also significant correlations among the counts for most of the eight families, with most pairwise correlations in the range 0.4 to 0.7 and only the OBPs generally uncorrelated with the other seven families (Table S4.2). Importantly also the numbers of genes in each of the eight families were in many cases correlated to some degree with the total numbers of genes in the rest of the genome (Table S4.2) – as detailed in the Materials and Methods, we therefore focussed much of the subsequent analysis on the counts in each family normalised against the number of genes in the rest of the respective genomes (Fig. 4.2).

Page | 67

The pairwise correlations among the normalised numbers in the different families were lower than with the un-normalised numbers above, but still generally significant and many still in the 0.3-0.6 range (Table S4.2).

We then carried out two sets of analyses probing the relationships between normalised gene numbers in the eight families with host use phenotypes. The first set compared all

58 species but, because the specific tissue types attacked often differed among the insect orders, used a relatively coarse classification of host use phenotype. The second set of analyses compared the species in each order/superorder, allowing us to separately consider two host use phenotypes, namely host species range and host tissue type, the limitation in this case being that there were too few species in some orders for meaningful analyses of each.

4.3.1 Cross-order analyses

A set of multi- and univariate analyses of variance was used to test the overall effects of insect order and three classes of host use on the normalised gene numbers in the eight gene families (Table 4.1). For the purposes of these analyses the six hemipterans and one thysanopteran in the superorder Condylognatha were considered to one ‘order’. The three host use classes were: ‘specialists’, which are essentially confined to a single tissue type from a single genus of host; ‘generalists’, which use a single tissue type but from multiple genera of hosts; and ‘omnivores’, which utilise multiple tissue types of hosts from multiple genera (including insects in the three species in question here). The one exception to this classification system involved the nectar-feeding Hymenoptera, which were classified as specialists although they feed on hosts from many genera because of the paucity of plant defence chemicals in nectar (Adler & Irwin, 2005; González-Teuber

& Heil, 2009).

Page | 68

The multivariate analysis showed significant main effects of insect order and host use class but non-significant interactions between them. The univariate analyses for each gene family then revealed a significant main effect of insect order for all families except the P450s, plus a significant main effect of host use class for the P450s, CCEs and ABC transporters, but not for the other five families. The order by host use interaction term was just significant (P<0.05) for HSPs. For the seven families showing a significant effect of order, that effect accounted for 2 - 24% of the variance among species. For the three families showing a significant effect of host use class, that effect accounted for 24,

23 and 11% of the variance among species for the P450s, CCEs and ABC transporters respectively.

The effects of insect order largely reflected the smaller sizes of all the families except the

P450s in the Hymenoptera and to a lesser extent Diptera than in the Lepidoptera,

Coleoptera and Condylognatha; family sizes in the hymenopterans were about 45% smaller than in the latter three orders before normalising for genome size, and about 25% smaller after normalising (Figs 4.1-3). For all three gene families which showed a significant effect of host use type, we found that the specialists had about 21% fewer

P450s, CCEs and ABC transporters than the generalists before normalising, and about

12.5% after. There were relatively few omnivores in the sample but on average they had

36% more genes in these families than the generalists from the same order before normalising, but there was effectively no difference after normalising. The magnitude of the differences among the three host use species classes were relatively similar across the three gene families.

It is clear from these analyses that many of the previous studies comparing a small number of species with different host usages from different orders could have been conflating the effects of the two variables. However our analysis also clearly supports the association

Page | 69

proposed in these earlier studies between detoxification gene family size and host use patterns. Thus species with broader host use niches do have greater numbers of genes in three of the five major detoxification gene families, specifically the P450s, CCEs and

ABC transporters, at least among the 58 species sampled here.

Notably, a repeat of the above analyses for the specific lineages of four of the five families which have been most commonly implicated in detoxification functions in empirical studies (i.e. the CYP6 P450s, the dietary and detoxification clades of CCEs, delta and epsilon GSTs and the ABC transporter sub-classes B and C) found very similar patterns to those above on all genes in the families (Table 4.1, Fig. 4.2). In particular significant differences among host use classes were again found for the P450s, CCEs and ABC transporters, but not for the GSTs, and the differences for the first three were of similar size and direction to the above. However, analyses excluding these specific lineages found significant residual differences among the P450s and to a lesser extent CCEs (Table

4.1, Fig. S4.3). Thus most of the association with host use classes in overall gene numbers in these three families is attributable to the specific lineages most often empirically associated with detoxification functions but there were also some contributions from other lineages for the CCEs and in particular the P450s. Interestingly the association was not significant for the Delta and Epsilon GSTs but it became so for the other GST lineages. We have no explanation for this latter result.

We also repeated the overall analyses on the number of inparalogues rather than the total number of genes in each family. In so much as inparalogues are defined as lineage- specific duplicates, this analysis excludes all genes which are present in more than one species, whether or not they are present in all the others. It is therefore more likely to reflect gene gain than gene loss events. Given that the 58 species vary greatly in their phylogenetic distances from their nearest relatives in the sample (e.g. the 22 drosophilids

Page | 70

and two bactrocerans will all have a congeneric closest relative whereas many of the species sampled in the other orders have no close relative in the sample), we are wary not to over-interpret these analyses. We simply note that essentially the same pattern of results were obtained, the only difference being that the main effect of host use class became significant (at the P<0.05 level) for the GSTs (Table 4.1). These findings suggest that much of the host use associations found in the earlier analyses reflect gene gain rather than gene loss events.

4.3.2 Order-specific analyses

Given the diversity of host tissue types used by the species across the different orders, we also carried out a series of order-specific analyses with the aim of comparing species with different host species ranges against a background of a smaller and more tightly defined host tissue type usage patterns. For these purposes we considered host species range as either restricted, being confined to species from a single plant genus, or polyphagous, meaning using hosts from different plant genera. A total of ten different host tissue types were recognised – fresh and rotting fruit, fresh and rotting green tissue, wood, seed, nectar

(often including some pollen), sap (generally phloem), fungi and omnivorous (which, as above, meant using various plant and animal tissues). However the number of host tissue types represented in each insect order (again combining the hemipterans and thysanopteran into the Condylognatha) ranged from five in the Hymenoptera down to just one in the Lepidoptera.

Our strategy for the order-specific analyses was first to apply one-way ANOVAs to the categories of species in each order generated by the combination of their host range and tissue type traits above and, where this generated significant between category variance, then to carry out specific a posteriori contrasts. As with the cross-order analyses, we analysed the normalised total gene numbers and inparalogue numbers for each of the eight

Page | 71

families and the total gene numbers for the specific lineages of P450s, CCEs, GSTs and

ABC transporters most often associated with detoxification in the past. Additionally, given that the power of the order-specific analyses was limited by the small numbers of species within the orders, in this case we also analysed the sums of the normalised numbers across the five detoxification families.

Differences in the Diptera. The 25 Diptera in the study included 22 drosophilids and three tephritids (2 Bactrocera and the Mediterranean fruit fly Ceratitis capitata) (Table S4.1).

Seventeen of the drosophilids are rotting-fruit feeders, two from the repleta group are rotting-cacti feeders and one each feed on flowers, bark, pandanus palm and various fresh fruits. Both the Bactrocera and C. capitata are also fresh fruit feeders. The only dipterans in the study which we classified as restricted were the pandanus palm feeding D. erecta and D. sechellia, which eats rotting Morinda citrifolia fruit. The only significant difference we found in the (normalised) total numbers of genes across the six feeding habit categories of species represented in the sample was for the CCEs, where the polyphagous fresh fruit feeders had somewhat greater numbers than the other feeding habit categories. No significant differences were found among the categories for the four specific lineages of the P450s, CCEs, GSTs and ABC transporters that have previously been associated with detoxification (Table S4.3). The paucity of differences found may have been due in part to the narrow taxonomic range of the species sampled and the great preponderance of generalist rotting fruit feeders amongst them. Consistent with the narrow taxonomic range of the species sampled, there were relatively few inparalogues among the Diptera as compared to the other orders sampled.

Differences in the Lepidoptera. All seven lepidopterans in the current study were green tissue (mostly leaf) feeders, ranging from the host specialists Bombyx mori, Melitea cinxia, Heliconius melpomene and Danaus plexipus to the polyphagous pests Chilo

Page | 72

suppressalis, Manduca sexta and Plutella xylostella. The only family showing a significant difference in the normalised total numbers of genes between the species with these two feeding habits was for the GSTs, which were more numerous in the three polyphagous species than the four with more restricted feeding habits (Table S4.3).

Notably however the same pattern of differences (though not significant) was also seen in the P450s, CCEs and ABC transporters and the sum of the normalised numbers for the five detoxification families (Table S4.3). Differences in normalised inparalogue numbers and in the four specific lineages of the P450s, CCEs, GSTs and ABC transporters were not statistically significant for any of the families individually but showed the same trends as the total numbers (Figs 4.2, 4.3 & S4.4).

Differences in the Coleoptera. There were only five coleopterans in the study, three wood borers plus one tissue-eating and one seed-eating species. Two of the wood borers had restricted host ranges, the other three species being classified as polyphagous. The paucity of species and their spread across four feeding habit categories severely limited the power of the analyses and none of the gene families showed significant variation among species with the four feeding habits (Table S4.3). Nevertheless it is noted that the highly polyphagous pest Leptinotarsa decemlineata (Colorado potato beetle) had the highest numbers of detoxification and transporter genes in comparison both to the other coleopterans and to the great majority of the species in the other orders (Fig. 4.1). For example it had more (141) CCEs than any of the other species, and the second highest number of ABCs (166), but relatively low numbers of the OBPs and HSPs.

Differences in the Hymenoptera. The 13 hymenopterans available to analyse included five nectar-feeders, three omnivores, two fungus-feeders and one each of leaf-tissue, fruit and seed feeding species. The nectar feeders and the seed eater were considered restricted, and the rest polyphagous, yielding six feeding habit categories. Whilst overall the

Page | 73

Hymenoptera had fewer genes in seven of the eight families, even on normalised counts, than the other orders (the exception being the MFSs), they were also the most variable across species, with six of the families, namely the P450s, GSTs, UGTs, MFSs, OBPs and HSPs, showing significant differences among feeding habit categories (Table S4.3).

In the case of the P450s, GSTs and UGTs this was because the omnivores had relatively high gene numbers and the restricted nectar feeders and the seed eater relatively low numbers. As with the other orders above, the same trends were evident in the other two major detoxification families (i.e. the CCEs and ABC transporters) even though not statistically significant individually. By contrast the other three families (MFS, OBPs and

HSPs) showed very different patterns, being relatively numerous in the nectar feeders and to a lesser extent the seed eater, and, for the OBPs in particular, depauperate in the omnivores (Fig. 4.2, Table S4.3). The differences in the sums of the normalised numbers for the five detoxification gene families were also statistically significant (Fig. 4.2; Table

S4.3).

The trends in the numbers of inparalogues and the four specific lineages of detoxification families largely reflected the patterns for total gene counts, with some differences in the specific families reaching statistical significance (the CCEs now significant and the OBPs and HPSs not for the inparalogues and the P450s now significant for the specific detoxification lineages; Figs 4.2, 4. 3 & S4.4, Table S4.3).

Differences in the Condylognatha. The condylognathans analysed consisted of six hemipterans (one leaf tissue, one seed and four phloem/sap feeders) and the thysanopteran leaf tissue feeder Frankliniella occidentalis. Three of the phloem/sap sucking species were classified as restricted and one as polyphagous, the two leaf feeders as polyphagous and the seed feeder as restricted, giving a total of four feeding habit categories. There were significant differences in normalised numbers among the four categories for the

Page | 74

P450s, CCEs and ABC transporters, with the polyphagous species generally yielding higher values than the restricted host range species (Figs 4.2, 4.3 & S4.4, Table S4.3).

This trend was evident but less pronounced for the GSTs and to a lesser extent UGTs.

The difference among the categories were also significant for the sums of the normalised values across the five detoxification families (Figs 4.2, 4.3 & S4.4, Table S4.3). Notably the other three families (MFS, OBP and HSP) did not show this pattern (Figs 4.2, 4.3 &

S4.4, Table S4.3). The trends for the five detoxification families were also evident in the inparalogue data, although generally only statistically significant when the values for the five families were summed, and in the specific lineages, where it was statistically significant for P450s and CCEs (Figs 4.2, 4.3 & S4.4, Table S4.3).

4.4 Discussion

Our cross-order analysis clearly showed that some of the previous studies apparently supporting the hypothesis associating detoxification gene family size with polyphagy could have been heavily influenced by the small number of species compared, their location in different insect orders, and some overarching differences in overall gene content among the species. Nevertheless, after controlling for those other effects, our analysis still supported the hypothesis for three of the five major detoxification gene families, namely the P450s, CCEs and ABC transporters. However the data for the other two major detoxification gene families, the GSTs and UGTs, did not yield significant associations. Another family of transporters, the MFSs, which some studies suggest may have a role in detoxification, did not show the association either. Nor did the OBPs, which are involved in host finding, or the HSPs, which are involved in general stress responses.

Analyses confined to the specific lineages of the P450s, CCEs, GSTs and ABC transporters which have been most heavily implicated in detoxification functions in previous empirical studies found the association to be statistically significant for the

Page | 75

P450s, CCEs and ABC transporters, but not the GSTs. However analyses excluding these lineages also found some residual associations for the CCEs and in particular P450s.

There is indeed evidence for some detoxification functions for these other lineages. For example, aphid CCEs involved in metabolic resistance to organophosphate insecticides derive from lineages outside those formally considered to be dietary and detoxification clades (Devonshire et al., 1998; Oakeshott et al., 2005) while CYP4, CYP9 and CYP12

P450s have been implicated in organophosphate and pyrethroid resistance in addition to the CYP6s in certain Diptera and Lepidoptera (Faucon et al., 2015; Joußen et al., 2012;

Oakeshott et al., 2013).

Analyses confined to genes identified as inparalogues suggested that the associations observed in the various cross-order analyses above were largely due to gene gain rather than gene loss events. However the very uneven taxonomic distribution of the species sampled in the Diptera in particular meant that inparalogues were effectively defined differently across orders, and we are therefore cautious not to over-interpret these analyses.

All the cross-order analyses thus provided strong evidence for an association with polyphagy for the P450s, CCEs and ABC transporters. However the variety of tissue types on which the species from different orders feed had meant that a relatively coarse classification of host usage had to be used in those analyses in order to reduce the number of host use categories to a workable number. A complementary set of intra-order analyses was therefore carried out in which the effects of host species range could be considered against a background of a smaller range of more tightly defined host tissue types.

Three of the order-specific analyses yielded strong support for the hypothesised association. For the Lepidoptera the family-wide association was significant for the ABC

Page | 76

transporter family individually and for the sum of this and the P450, CCE, GST and UGT families collectively. For the Hymenoptera the association was significant for the P450,

GST and UGT families individually and for these plus the CCEs and ABC transporters collectively. For the Condylognatha the effect was significant for the P450s, CCEs and

ABC transporters individually and for these plus the GSTs and UGTs collectively. The lineage-specific analyses for the P450s, CCEs, GSTs and ABC transporters and the inparalogue analyses for all five of these families showed similar trends and were by and large significant where the corresponding family-wide associations had been.

The power of the equivalent analyses for the Diptera and Coleoptera was more heavily constrained by sampling issues, albeit different ones. In the case of the Diptera there was a relatively large number of species in the sample but the great majority (22 of 25) were drosophilids and most of these had essentially the same host use phenotype. In the case of the Coleoptera the few species available to compare (5) were scattered across very different host use phenotypes. Even so, the dipteran analysis showed the CCEs to be significantly more numerous among the polyphagous fresh fruit feeders than the species with other host use phenotypes, while the highly polyphagous pest L. decemlineata stood out in the coleopteran analysis for its high numbers of both CCEs and ABC transporters.

Overall the intra-order analyses support the evidence from the cross-order analyses that the sizes of the P450, CCE, and ABC transporter families are related to host range. The effects for the P450s and CCEs in particular can be quite large; the CYP6 P450 and dietary/detoxification CCE lineages are 10.5% larger in polyphagous vs restricted species in some orders, for an average of about 26 more genes across these two lineages alone

(Fig. 4.1). On the other hand the GSTs and UGTs provided little evidence for an association, the only analyses where these families showed a significant association being the Hymenoptera. While it is clear that some members of these families do carry out

Page | 77

detoxification functions (Sau et al., 2010; Shou-Min, 2012), our data suggest that changes in gene number are seldom selected for in these families in response to changes in host use. It may be that these families instead respond to selection in such circumstances with changes in gene regulation. Interestingly there are several cases in the literature where quantitative changes in GST activity or gene expression have been associated with insecticide resistance (Lumjuan et al., 2005; Shi et al., 2012; Xu et al., 2015) but no cases that we know of where they have been associated with gene duplication. By contrast there are several cases in aphids, mosquitoes and blowflies where CCE duplications have been causally linked to insecticide resistance (Field et al., 1988; Lenormand et al., 1998;

Newcomb et al., 2005)..

Given the associations we have found for the P450s, CCEs and ABC transporters, and also the validation they provide for the methodology we have used, the priority now is to expand substantially the number of genomes analysed so there is the sensitivity to interrogate the associations with both host species range and host tissue type more deeply.

Given the rapidly expanding number of insect genome sequences becoming available, such analyses should become possible in the near future.

Acknowledgements

We wish to thank Dr Alexandre Fournier-Level for useful discussion on gene family evolution and Dr Philip Larkin of CSIRO for useful discussion of plant defence compounds. In addition, we wish to acknowledge the teams behind the genomes sequenced as part of the i5k project. SR was supported by NHGRI grant U54 HG003273,

Richard Gibbs PI, funding the Baylor College of Medicine Human Genome Sequencing

Center. Other support was provided by the Australian Science and Industry Endowment

Fund.

Page | 78

Table 4.1: Results from multivariate and univariate ANOVAs testing the impact of host use and order on the number of genes overall and in the different families (p-value significance thresholds 0.05, 0.01 and 0.001 denoted by *, ** and ***)

Order Host use Order * Host use Error Partial Partial Partial Gene Comparison Eta Eta Eta Mean Family / df F-ratio df F-ratio df F-ratio df gene set (effect (effect (effect square statistic size) size) size) Wilks' 40 8.77*** NA 8 2.95** NA 32 1.25 NA NA NA Lambda 5 2.23 0.203 1 13.95** 0.24 4 0.33 0.03 44 2.376E-06 P450s

5 13.79*** 0.611 1 13.56** 0.23 4 0.35 0.03 44 5.86E-07 CCEs 5 3.17* 0.265 1 3.96 0.08 4 0.46 0.04 44 2.854E-07 GSTs 5 6.69*** 0.432 1 0.27 0.01 4 0.59 0.05 44 4.94E-07 UGTs

therespective families

5 4.66** 0.347 1 5.29* 0.11 4 1.95 0.15 44 1.553E-06 ABCs 5 23.76*** 0.73 1 2.4 0.05 4 1.37 0.11 44 2.896E-06 MFS’ 5 29.77*** 0.772 1 0.33 0.01 4 1.55 0.12 44 2.384E-07 OBPs 5 6.47*** 0.424 1 1.65 0.04 4 2.92* 0.21 44 1.493E-07 HSPs

Total numbergenes inof Wilks' 40 3.8*** NA 8 3.19** NA 32 1.23 NA NA NA Lambda 5 12.52*** 0.5 1 13.31** 0.19 4 0.92 0.17 44 8.62E-07 P450s

5 10.12*** 0.49 1 9.34** 0.11 4 0.81 0.08 44 4.367E-07 CCEs 5 6.44*** 0.38 1 5.81* 0.29 4 0.43 0.19 44 3.984E-08 GSTs 5 10.32*** 0.66 1 0.01 0.01 4 0.17 0.05 44 2.277E-07 UGTs 5 5.46** 0.48 1 7.7** 0.15 4 1.59 0.08 44 9.338E-07 ABCs 5 12.76*** 0.41 1 1.68 0.07 4 0.27 0.05 44 1.659E-06 MFS’ 5 4.98** 0.22 1 3.6 0.11 4 1.93 0.24 44 4.719E-08 OBPs 5 5.45** 0.39 1 2.99 0.08 4 1.86 0.13 44 7.673E-08 HSPs

Numberof duplicatedgenes in respective the families Wilks' 20 6.99*** NA 4 3.36** NA 16 1.09 NA NA NA Lambda P450s 5 1.69 0.16 1 6.39* 0.13 4 2.74* 0.2 44 8.452E-07

(Cyp6) CCEs 5 17.36*** 0.66 1 12.01** 0.21 4 2.11 0.16 44 3.173E-07 (dietary / detox)

familysubclass GSTs (delta, 5 8.18*** 0.48 1 0.23 0.01 4 0.49 0.04 44 1.796E-07 epsilon) 5 3.22* 0.27 1 7.95** 0.15 4 1.19 0.1 44 6.335E-07 ABCs (B, C)

Total numbergenes inof the gene Wilks' 20 7.02*** NA 4 4.06** NA 16 3.49*** NA NA NA

Lambda

detox

- 5 7.57*** 0.46 1 15.58*** 0.26 4 2.22 0.16 44 7.563E-07 P450s

5 2.62* 0.23 1 4.52* 0.09 4 1.11 0.09 44 1.662E-07 CCEs

5 3.56** 0.29 1 10.60** 0.19 4 3.03* 0.21 44 6.952E-08 GSTs

genes ingenes gene family 5 11.71*** 0.57 1 1.11 0.02 4 5.66** 0.34 44 3.454E-07 ABCs

Numberofremaining non

Page | 79

Multivariate analysis of variance (Manova) (Wilk’s Lambda) was used for comparing gene numbers and gene duplications across all nine families, with subclasses considered separately where these could be defined. Univariate Anovas were then undertaken on each family. Order and phagy were treated as fixed factors. Gene families: ABC/ABC- like transporters (ABC), carboxyl/cholinesterase (CCE), glutathione S-transferase

(GST), Heat shock proteins (HSP), Major facilitator superfamily of transporters (MFS), nicotinic acetyl-choline receptors (nACHR), Odorant binding proteins (OBP), cytochrome P450 (P450), UDP-glucuronosyltransferase (UGT).

Figures:

Figure 4.1: Distribution of normalised gene numbers in each order for each of the nine gene families for the host use types; restricted and polyphagous herbivores and polyphagous omnivores, each represented by a different colour.

Page | 80

Page | 81

Figure 4.2: Distribution of normalised total gene counts (green) and duplications (orange) in each of the eight gene families for diet and host use combinations in every order.

Page | 82

Gene families: ABC/ABC-like transporters (ABC), carboxyl/cholinesterase (CCE), glutathione S-transferase (GST), Heat shock proteins (HSP),

Major facilitator superfamily of transporters (MFS), Odorant binding proteins (OBP), cytochrome P450 (P450), UDP-glucuronosyltransferase

(UGT).

Page | 83

Figure 4.3: Distribution of total gene counts (green) and duplications (orange) in each of the nine gene families for diet and phagy combinations in every order.

Gene families: ABC/ABC-like transporters (ABC), carboxyl/cholinesterase (CCE), glutathione S-transferase (GST), Heat shock proteins (HSP), Major facilitator superfamily of transporters (MFS), Odorant binding proteins (OBP), cytochrome P450 (P450), UDP-glucuronosyltransferase (UGT).

Page | 84

CHAPTER 5

Comparative genomics and transcriptomics of Drosophila species in the repleta group: what type of changes drive lineage specific responses?

Rahul V. Ranea,b, Stephen L. Pearcea, Fang Lic, Chris Coppina, Michele Schifferb ,

Jennifer Shirriffsb, Carla Sgrod, Phillipa Griffinb, Goujie Zhangc,e, Siu Fai Leeb , Ary A.

Hoffmannb and John G. Oakeshotta a CSIRO, Clunies Ross St, (GPO Box 1700), Acton, ACT, 2601, Australia b Bio21 Institute, School of BioSciences, University of Melbourne, 30 Flemington Road,

Parkville, 3010, Australia c China National GeneBank, BGI-Shenzhen d School of Biological Sciences, Monash University, Melbourne, Vic., Australia e Centre for Social Evolution, Department of Biology, Universitetsparken 15, University of Copenhagen

Abstract

The repleta group of drosophilids includes cactophilic specialist species with high tolerances to climatic stresses and dietary generalists with variable levels of tolerance to those stresses. However, the genetic basis of the host and climatic adaptation in the group is yet to be elucidated. Here we analyse the genomes of five repleta species, three of them newly sequenced (Drosophila aldrichi, Drosophila hydei and Drosophila repleta) and two previously published (Drosophila mojavensis and Drosophila buzzatii, the latter reannotated). Using orthologue and inparalogue predictions across the five repleta group and eleven other Drosophila species, we observed significant gene gain in the repleta

85

group ancestral branch, driven by generalist species – while the cactophilic species had the most genes under positive selection. As expected from earlier studies, we also found reproductive genes to be downregulated in the cactophilic species – with the upregulation of metabolic genes involved in metabolic biological processes as well as catalytic and hydrolase activity. The abundance of inparalogues in generalists and positively selected genes in cactophiles with the enrichment of similar sets of functions suggested alternate genetic pathway leading to neofunctionalisation. Furthermore, we found a large degree of overlap between biological functions enriched in earlier studies on desiccation stress as well as host tolerance as in our evolutionary and heat stress studies – indicating a singular evolutionary direction in the ancestral repleta lineage – further diverging in the ecological groups.

5.1 Introduction

Identification of the genomic changes associated with adaptation to specific components of the environment requires genomic comparisons of multiple closely related species that have diverged for the phenotypes in question where genetic changes due to drift or other adaptations irrelevant to those phenotypes are minimal (Somero, 2010). To date, comparisons of this sort among species of insects have identified genomic changes associated with patterns of wing coloration in butterflies (Cong et al., 2016; Joron et al.,

2011a) , phenological changes in moths (Derks et al., 2015) and, using a subset of the data from the Drosophila 12 genomes project (Drosophila 12 Genomes et al., 2007), adaptation to an otherwise toxic host in Drosophila sechellia and Drosophila yakuba

(Dworkin & Jones, 2009; Shiao et al., 2015; Yassin et al., 2016). So far, such analyses have not considered adaptive responses to climate stress, even though the adaptability of poikilotherms such as insects to rising temperatures will be crucial to their ongoing survival under anthropogenically driven climate change (Franks & Hoffmann, 2012).

86

Drosophila is an ideal genus for such an analysis because its species occupy a range of climatic regions and have diverged in their response to environmental extremes

(Hoffmann et al., 2003a; Powell, 1997) that limit their distributions (Kellermann et al.,

2012a; Kellermann et al., 2012b; Overgaard et al., 2014). Phylogenetic analysis of stress responses in Drosophila points to phylogenetic conservation of thermal and desiccation responses suggesting that changes within lineages predispose flies to particular ecological habitats (Kellermann et al., 2012a; Kellermann et al., 2012b). One particularly promising group in this respect is the repleta group, which originated ~30 million years ago in the

Americas (Guillen et al., 2015). Many of the repleta, such as the mulleri subgroup species

Drosophila mojavensis, Drosophila buzzatii and Drosophila aldrichi, are desert-adapted and display extremely high heat, cold and desiccation tolerance (Kellermann et al., 2012a;

Kellermann et al., 2012b) but others, such as Drosophila hydei and Drosophila repleta, do not live in the desert and are much less tolerant of these three stresses (Kellermann et al., 2012a; Kellermann et al., 2012b; Kellermann et al., 2009). Notably also, while all the repleta are saprophagous (feed on rotting tissue) they also vary widely in their host and dietary preferences (Barker et al., 2013; Goñi et al., 2012; O’Grady & Markow, 2012); the desert species are dietary specialists that feed and breed on necrotic cactus tissue, whereas D. hydei and D. repleta are dietary generalists which will eat a wide range of rotting fruits and vegetables as well cacti, plus chicken or cow faeces (Fogleman &

Danielson, 2001; Goñi et al., 2012; O’Grady & Markow, 2012; Soto et al., 2014).

Some comparative genomic studies have been conducted on two repleta group species, the cactophilic D. mojavensis and D. buzzatii, which show similarly high tolerances to heat, cold and desiccation stress (Guillen et al., 2015; Matzkin, 2014; Matzkin et al.,

2011) but the former being much more restricted both geographically and in the range of cacti it will use than the latter (Oliveira et al., 2012). Comparisons between these two

87

species and with two other drosophilids outside the repleta group (D. virilis and D. grimshawi) showed expansions of gene families involved in proteolysis, sensory perception and gene regulation in the cactophilic species. The same study also found positive selection underpinning rapid evolution in genes in these species involved in gene regulation and the catabolism of some of the heterocyclic toxins found in the cacti

(Guillen et al., 2015). Transcriptomic comparisons of four D. mojavensis populations living on different hosts have also highlighted transcriptional changes in key metabolic and sensory pathways which might contribute to desiccation and/or host adaptation

(Matzkin et al., 2011; Matzkin & Markow, 2009; Rajpurohit et al., 2013). A constraint in these studies however has been the lack of comparable genomic or transcriptomic data for generalist species which are less climatically tolerant within the repleta group.

The present study presents the genomes of three additional repleta group species, one additional highly stress tolerant cactophile, D. aldrichi, and two less tolerant dietary generalists, D. repleta and D. hydei. We also reassemble and re-annotate the D. buzzatii genome. Comparative analyses among these four species plus D. mojavensis and between the five repleta species and previously published genomes from other Drosophila groups are then used to identify genetic factors contributing to heat shock and cactus vs generalist dietary adaptations. These analyses are founded on a robust genome-wide phylogeny for all 24 Drosophila species for which genomes are now available (Rane et al., in review).

Orthologue and duplication predictions are then used to identify lineage-specific gene expansions and bursts of positive selection in the repleta group. We also compare transcriptomes across a time course of heat shock response for D. hydei and D. buzzatii, and test whether gene sets showing divergent transcriptomic responses to heat between these species are related to those showing genomic divergence.

88

5.2 Materials and methods

Fly strains:

The D. hydei and D. repleta strains used for genome sequencing were collected from

Townsville, QLD, Australia and Wandin, VIC, Australia, respectively and then inbred in the laboratory for 17 generations of single pair full-sibling mating to reduce their heterozygosity (expected inbreeding co-efficient > 0.7 (Rumball et al., 1994)). The D. aldrichi strain sequenced was originally caught in Mexico in 2002 and was likely inbred to some degree while being maintained in the University of California San Diego

Drosophila Species Stock Center (stock number: 15081-1251.13). It was further inbred for two generations of sibling mating from a single pair of flies prior to sequencing.

Individuals for the mixed life stage transcriptome sequencing that we used to augment the annotations were obtained from the inbred D. hydei and D. repleta lines above and from a wild-caught mass bred line of D. aldrichi from Inglewood, QLD, Australia. RNA was prepared from a mixture of six life stages (embryos, first, second and third instar larvae, pupae and adults, combined in approximately equal weight amounts) for each species.

Recently collected strains of D. hydei and D. buzzatii were used in the thermal stress transcriptomics experiment. The offspring of 50 field females were pooled to establish a mass bred population for each species. The D. hydei females were collected from Pascoe

Vale, Melbourne, VIC and the D. buzzatii from Inglewood, QLD, Australia. The mass bred strains were maintained at 25 °C for one generation prior to the experiment.

Genome sequencing and assembly:

Sequencing:

89

Adult females from the inbred D. repleta, D. hydei and D. aldrichi strains were harvested and DNA extracted from heads to minimise contamination with DNA from their gut flora.

Six libraries were then prepared from the DNA from each species for sequencing on an

Illumina HiSeq2000 platform; three paired-end libraries spanning 250bp, 500bp and

800bp using 150bp or 100bp paired-end sequencing runs, plus three long insert mate-pair libraries with insert sizes of 2kbp, 5kbp and 10kbp using 49bp paired-end runs. All library preparation and sequencing were carried out at Beijing Genomics Institute (BGI),

Shenzhen. At least 35Gb of raw sequence data were obtained per species, yielding a final read coverage of ~210x of the estimated genome size.

Assembly:

The quality of the raw sequence data for each species was initially assessed using FastQC

(Andrews, 2010), plus K-mer counting with Jellyfish (Marcais & Kingsford, 2011) to identify novel contaminant sequences. Trimmomatic (Bolger et al., 2014) was then used to remove low quality reads and trim other reads to remove bases with Phred scores < 9 and contaminant and/or adaptor sequences attached to reads. Only paired reads where both sequences retaining >80% of the presumptive original size were retained for the genome assembly and the rest were marked as singletons for genome correction (see below). Approximately 5% and 10% of the paired-end (short insert) and mate-pair (long insert) raw data respectively were put aside during these quality control steps. The trimmed reads were then aligned to the published D. mojavensis genome (which BUSCO and assembly N50 stats indicated was better assembled than the published D. buzzatii genome; see Table 5.1 below) to re-estimate the insert size distributions for each library.

We generally observed a 10% reduction in insert sizes compared to the original size selection; this reduction was then incorporated into the assembler parameters below.

90

The trimmed reads for D. hydei and D. repleta were then corrected for sequence polymorphism using Quorum (Marçais et al., 2015), but BFC (Li, 2015) was used to assist in stringent heterozygosity-sensitive read correction for D. aldrichi (because higher levels of heterozygosity were expected for this species). Corrected reads were assembled with the MaSuRCA assembler (Zimin et al., 2013), which is based on but extends the Celera

CABOG assembler (Adams et al., 2000). K-mer values of 105bp and 37bp for paired-end and mate-pair data respectively were used as input parameters into MaSuRCA.

MaSuRCA first assembles super-reads by extending reads using best match or sequence overlap and then assembles contigs using CABOG. We modified this protocol to utilise only paired-end data for super-read assembly, but then used all the data for contig assembly. All contigs > 230bp (the observed insert size for our smallest short-insert library) were then joined into scaffolds using the super-reads and Illumina sequencing data as input for the SSPACE 2 scaffolder (Boetzer et al., 2011). We also allowed

SSPACE to extend ends of scaffolds using unused raw sequence data where possible.

Once again we restricted the minimum scaffold size retained to 230bp.

Local reassembly and gap filling:

The scaffolded D. hydei, D. repleta and D. aldrichi genomes were further improved by local realignment and gap-filling with multiple iterations of Pilon (Walker et al., 2014).

For this purpose, we used corrected reads, MaSuRCA super-reads and singletons that had been removed during raw sequence processing as input data. SNAP aligner (Zaharia et al., 2011) was used for accelerated alignment of these sequences to the scaffolded genome with only the best alignment of every sequence utilised as input for correction. For every genome, multiple iterations of Pilon were carried out until improvements affected less than 5% of base pairs in the previous iteration. Thereafter, all remaining reads were de novo assembled into new scaffolds. The final genome assemblies were then benchmarked

91

for quality assessment using the BUSCO pipeline (Simao et al., 2015) following the authors’ instructions.

Repeat sequence analysis:

Tandem repeats and transposable elements (TEs) in D. hydei, D. repleta, D. buzzatii and

D. aldrichi were identified following the strategy of Zhang et al. (2012). The programs

TRF (Benson, 1999) and RepeatMasker (Tarailo‐Graovac & Chen, 2009) (version 3.2.7) were used to characterise tandem repeats; RepeatMasker was used with the Repbase

(Jurka et al., 2005) database of known repeat sequences in D. melanogaster to identify repeats shared with the latter species; and previously undescribed repeats were identified using LTR_FINDER (Xu & Wang, 2007) (version 1.0.3), PILER (Edgar & Myers, 2005) and RepeatScout (Price et al., 2005) (version 1.05). Repeat proteins were also identified using RepeatProteinMask (version 3.2.2) as implemented in RepeatMasker. All the repeat sequences in each species identified by the different methods were combined into a final repeat library and categorized in a hierarchical way, as per Zhang et al. (2012).

Genome annotation:

Transcriptome sequencing for gene prediction:

We made RNA-Seq transcriptomes from mixed life stage RNA preparations of male and female D. repleta, D. aldrichi and D. hydei on the Illumina HiSeq 2000 platform to assist in gene annotations. The RNAs were prepared using the Zymo Direct-zol™ RNA extraction kits (Zymo Research, Irvine, USA). A standard RNA-Seq library was prepared for D. hydei, while strand-specific libraries were prepared for D. repleta and D. aldrichi according to Borodina et al. (2011). All three libraries were prepared by BGI following standard Illumina protocols. We obtained 4Gb of RNA-Seq data for D. hydei and 3Gb of

RNA-Seq data for D. repleta and D. aldrichi. These data were then filtered to remove

92

reads which had more than 10% of bases unknown, more than 40 bases with Phred scores less than 7, or >20% of the sequence comprising adaptor sequence. This left a high quality read set of ~2 Gb for D. hydei and D. repleta and ~3 Gb for D. aldrichi.

Annotation of gene models:

We created a novel genome annotation pipeline to identify, evaluate and collate protein coding and putative non-coding genes through use of the RNA-Seq data and homologue- guided and ab initio gene predictions. For each species, we first utilised the repeat annotations above to soft-mask the genome. For the homologue-guided predictions, we then assembled a dataset of 2,315 D. melanogaster genes (248 from CEGMA (Parra et al., 2007) and a further 2067 from OrthoDB7 (Waterhouse et al., 2011)) that have only

1:1 orthologues across at least 90% of the genomes so far catalogued. Splice- aware annotations of their orthologues in the newly assembled genomes were then constructed using the CEGMA (Parra et al., 2007) and Exonerate (Slater & Birney, 2005) pipelines and the published D. melanogaster and D. mojavensis proteomes (Slater &

Birney, 2005).

The splice junctions identified above using the 2,315 D. melanogaster genes were used to guide the first pass of a splice-aware two-pass alignment of the cleaned RNA-Seq data from D. hydei, D. repleta and D. aldrichi to their respective genomes with

GMAP/GSNAP (v 2014-08-20) (Wu & Nacu, 2010) and STAR aligner (v 2.4.0) (Dobin et al., 2013). The second pass repeated the process using updated splice junction databases from the first pass. The resultant alignments were then used to identify contiguous transcript sequences with StringTie (Pertea et al., 2015) and genome-guided

Trinity (Haas et al., 2013). We also identified transcripts using de novo assembly of RNA-

Seq data with Trinity (Haas et al., 2013). The de novo and genome-guided transcriptomes

93

were then collated using PASA2 (Haas et al., 2003) to create a comprehensive transcript database for each species and to identify complete RNA-Seq guided gene models with

Transdecoder (https://transdecoder.github.io/).

The Transdecoder gene models that had >80% sequence length and identity match to D. melanogaster genes (including the 2,315 genes above) were combined to prepare a golden training set (GTrainSet) for ab initio prediction of genes. The GTrainSet splice junctions were then augmented with the self-training Genemark-ET gene predictor (Lomsadze et al., 2014), after which the updated GTrainSet and gene, UTR and exon length parameters were used to train and run ab initio predictors in Glimmer (Majoros et al., 2004),

Augustus (v 3.0.2) (Stanke et al., 2006) and SNAP (Korf, 2004). The gene, UTR and exon length parameters were recalibrated for the GTrainSet genes after every ab initio prediction run. All de novo predictions from the four predictors were then screened to remove small genes (translated peptide < 35 amino acids).

All ten sets of gene models created above - two RNA-Seq based (PASA2/Transdecoder and StringTie), five ab initio (CEGMA, Genemark-ET, Glimmer, SNAP and Augustus), two homology-guided (D. melanogaster and D. mojavensis) and the GTrainSet - were finally passed to EvidenceModeler (Haas et al., 2008) to create a comprehensive whole genome annotation. Greatest weight was given to RNA-Seq (weight of 15) followed by

GTrainSet (12) and homology-based (11) evidence and least to the ab initio predictions

(2-4). The gene set collated by EvidenceModeler was then updated to annotate alternative splice events and 3’ and 5’ UTRs, and to refine gene boundaries using RNA-evidence in

PASA2. This set was then processed to add nested genes using the homology matches in

D. melanogaster and D. mojavensis, creating the official gene set (OGS).

94

The above pipeline was also used to re-annotate the published D. buzzatii genome, which we found was lacking many otherwise conserved gene models in the first run of our orthology analysis (see below). In the absence of a mixed life stage transcriptome for this species, we used our own stress transcriptome data (detailed below) for this re-annotation.

Comparative genomics:

Classification of orthologues:

The Orthonome pipeline (http://www.orthonome.com, (Rane et al., in review; Rane et al., 2016)) was applied to sets of 1:1 orthologues and orthogroups (groups of 1:1 orthologues across multiple species containing no inparalogues) extracted from the OGSs for the five repleta species and D. melanogaster. The pipeline also identified gene birth events and inparalogues and clustered the orthogroups into Super Orthogroups (SoGs,

(Rane et al., in review)) based on shared protein domains. Only the longest isoform for each gene in each genome was used in this analysis.

Three Orthonome analyses were carried out, the first for calculating the phylogeny and the remaining two for comparative genomics. These were (i) one using the twelve

FlyBase genomes (McQuilton et al., 2012),eight modENCODE genomes

(https://www.hgsc.bcm.edu/arthropods/drosophila-modencode-project) and our newly annotated D. aldrichi, D. buzzatii, D. hydei and D. repleta genomes, (ii) one using this set minus the eight lesser quality modENCODE genomes, and (iii) one just using the five repleta species and D. melanogaster.

Species phylogeny:

The orthologues obtained from analysis (i) above led to the identification of 4,935 orthogroups. The orthogroups were then filtered to keep only those that satisfied the

95

following criteria: (1) relatively slowly evolving, with no genes presenting a Ka/Ks > 1

(see below for calculation method), (2) >200 amino acids in length, and (3) all genes within a 1.5 median absolute deviation from the median length of the orthogroup. The nucleotide coding sequences from the 1,802 orthogroups retained were used to construct a species phylogeny using FastTree2 (Price et al., 2010) with 1,000 internal boot-strap replicates (Price et al., 2010). Divergence time was estimated with MEGA7 (Rel-Time)

(Kumar et al., 2016) based on calibration obtained according to Obbard et al. (2012).

Functional and gene group annotations:

We first augmented the functional annotations for D. melanogaster genes obtained from

FlyBase using the latest release of InterProScan 5 (January 2016) (Jones et al., 2014).

These updated annotations were transferred to D. aldrichi, D. buzzatii, D. hydei and D. repleta based on orthologous relationships estimated using Orthonome (see above).

Functional annotation of the remaining inparalogous genes was carried out by identifying the phylogenetically closest orthologue or inparalogue from D. melanogaster while de novo annotation was carried out for orphan genes using InterProScan 5. Gene interaction networks were also transferred from D. melanogaster using orthology (Murali et al.,

2011). FlyBase Gene Group annotations were transferred to the repleta group species in a similar fashion.

Functional enrichment analyses:

Functional enrichment of ontologies was carried out using a semantic similarity and hierarchy-based clustering of gene ontologies identified as enriched by Fisher’s exact test

(P<0.05, corrected by FDR). We first clustered the 47,676 gene ontology terms (GO,

Gene ontology consortium) into 43 mutually exclusive GO term sets using Louvain clustering method (Blondel et al., 2008) by allocating semantic similarity as edge weights

96

between connected hierarchical GO terms. These sets were further partitioned into 432 subsets. For the purpose of this study, we used 23 sets (Table S5.5) of gene ontologies representing biological processes and their subsequent subsets (Table S5.6). For each comparison, gene lists were first enriched based on frequency with Goatools

(https://github.com/tanghaibao/goatools). These significantly enriched ontology terms were then compressed using REVIGO (Supek et al., 2011) and grouped into sets according to aforementioned partitions.

Evolutionary rate analyses:

Branch-site model (BSM) tests (Valle et al., 2014) were carried out to identify specific lineages in which orthologues identified by Orthonome may have been under positive selection. This analysis was carried out using all the FlyBase genomes and the four repleta group genomes annotated in the current study. We assumed fixed branch lengths based on the phylogeny constructed above and calculations were carried out using FastCodeML

(Valle et al., 2014), with a maximum χ2 P value of 0.05. Each branch in the repleta lineage was analysed and FDR corrected P-value cut-off of < 0.05 was used.

Comparative stress transcriptomics:

Thermal stress tolerance transcriptomes for D. hydei and D. buzzatii:

Mass-bred populations initiated from recently-collected flies were used for these experiments to minimise issues with changes in gene expression during laboratory adaptation. We also conducted pilot studies on these mass bred populations to identify suitable testing temperatures. This was done by exposing ten replicates of ten adult females each to a range of high temperatures for 60 minutes and then returned them to

25°C, recording their survival through the stress and for the following 24 hrs, so that sub- lethal temperatures could be identified. D. hydei was tested from 35°C to 39°C and D.

97

buzzatii was tested from 37°C to 42°C. The stress assays were eventually conducted at

37°C for D. hydei and 39.5°C for D. buzzatii, the difference reflecting their variation in thermal tolerance (Kellermann et al., 2012b).

For the stress assays, three replicates of ten adult females each were harvested at each time point indicated in Fig. S5.1 and snap-frozen in liquid nitrogen. One time-point was immediately frozen prior to exposure, one half way through the exposure, one immediately after the exposure and five at defined intervals over the next 24 hrs. The snap-frozen material was stored at -80°C until RNA extraction (pooling all ten flies in each replicate) using the Zymo Direct-zol™ RNA extraction kits (Zymo Research, Irvine,

USA). RNA-Seq libraries were prepared and sequenced with the Illumina HiSeq 2000 platform as above. In order to avoid batch effects, all 48 libraries (8 time points and 3 replicates for each species) were run on the same instrument in the same flow cell.

The sequence reads for each species were filtered for low quality and adapter sequences were trimmed with Trimmomatic (Bolger et al., 2014) as above. The cleaned reads were then aligned to their respective genomes using a two-pass strategy in STAR aligner (v

2.4.2a) (Dobin et al., 2013). We also used the GeneCounts mode in STAR (parameters: -

-outSAMattributes NH HI AS nM NM MD XS --outFilterScoreMinOverLread 0.33 -- outFilterMatchNminOverLread 0.33) to output gene counts similar to those generated by the featureCounts package (Liao et al., 2014). These outputs were combined to create an expression value matrix for each gene across all eight time points as input for subsequent analyses.

Read counts were normalised across samples using the trimmed mean of M-values method (Robinson & Oshlack, 2010) and converted to log2 counts per million values

(log2cpm) with associated quality weights using the voom-limma pipeline (Law et al.,

98

2014). Batch effects due to replicate samples were corrected with ComBat (Johnson et al., 2007) against an intercept only model (Fig. S5.4). Gene expression was then modelled in limma (Law et al., 2014) with a simple factorial model. Genes with a significant difference in expression relative to the control time point (false discovery rate adjusted P value less than 0.05 and a log2 fold-change in expression greater than 1) in at least two of the subsequent time points were considered to be heat responsive.

To identify common patterns of expression among differentially expressed genes, the expression of each gene across the different time points was standardised to have a mean of 0 and standard deviation of 1. These standardised expression values were clustered into eight groups using fuzzy c-means clustering (Bezdek et al., 1984) with a membership exponent value of 1.2. This clustering approach assigns each gene a membership value for each cluster (from 0 to 1, summing to 1 for each gene) indicating how well the expression of the gene corresponds to the mean expression of the cluster. The clusters obtained for each species were then synchronised with a second fuzzy c-means clustering which used the earlier predictions as seeds for higher quality classifications, producing the number of clusters with the greatest fuzzy partitioning coefficient. The hard clustering assignment of each gene (cluster with the highest membership value) was used to group genes for plotting to visualise the cluster expression patterns. The core genes of each cluster (membership value greater than [0.5]) were used for functional enrichment analysis with GoaTools (https://github.com/tanghaibao/goatools).

Finally, to compare the expression of Orthonome-defined D. hydei and D. buzzatii orthologues, we constructed a 20 x 20 super self-organising map (SOM) (Kohonen, 1998) from the 3071 1:1 orthologue pairs in the gene coexpression network. Hierarchical clustering of the vectors from this SOM was used to define eight patterns of orthologue expression.

99

Identification of candidate genes for heat and desiccation:

Candidate genes for heat and desiccation were identified from earlier studies catalogued on www.adeer.pearg.com. Only genes with functional evidence from at least one genome-wide association study or functional study were maintained for analyses in the current study.

5.3 Results and Discussion

5.3.1 Genome assemblies:

The highly inbred D. hydei and D. repleta lines had better assembly statistics than the more heterozygous D. aldrichi line. This is apparent from the smaller scaffold L50s for

D. hydei and D. repleta than that for D. aldrichi (20 and 14 cf 53 respectively; Table

S5.1). The longest scaffolds in the former two were also longer than in D. aldrichi (~14 and ~10 cf ~4Mb respectively).

The generalist feeders D. hydei and D. repleta yielded assembled genome sizes of

~165Mb, which is very close to previous estimates using DAPI staining (177+/-22 and

167+/-13Mb respectively (Bosco et al., 2007)). No DAPI estimate has been published for the cactophilic D. aldrichi but our assembled genome size (191Mb) for this species was larger than the two generalists but similar to that of the cactophilic D. mojavensis

(194Mb; FlyBase.org), which itself was corroborated by a DAPI estimate (183+/-3Mb

(Bosco et al., 2007)). Notably the published assembly of the other cactophilic species in our analysis, D. buzzatii, was only estimated at 161Mb, most likely due to significant underestimation of both repeat and gene content during genome assembly ((Rius et al.,

2016) and see below).

The repeat contents of the various genomes were highly variable across the five assemblies. At the extremes, D. buzzatii had only 8.4% total repeat content while D.

100

aldrichi had 24.5% (Table S5.2), and the range was wide even between the two generalist species – 12.0% in D. hydei and 23.9% in D. repleta (Table S5.2). The D. buzzatii estimate however may in part reflect the lower assembly quality for this species (Table

S5.2).

5.3.2 Genome annotations:

We identified 15,838, 14,790 and 16,070 genes respectively in D. hydei, D. repleta and

D. aldrichi, most of which have annotated UTRs. An average of 1,500 genes in each species were annotated as having alternate splice events. Our gene numbers for these three species are higher than the 14,680 and 13,919 published for D. mojavensis and D. melanogaster respectively, and higher again than the 13,158 published for D. buzzatii.

We therefore applied our pipeline to D. buzzatii, resulting in the identification of 1,374 additional genes and ~2,600 alternate splice events (Table S5.3), and bringing total gene numbers up to 14,532, at the lower end of the range found for the other repleta group species. Scans for members of the conserved CEGMA and BUSCO gene sets both suggested that our new D. buzzatii annotation was still missing about 10% of genes, compared with less than 3% for the other species (data not shown). However, accounting for these missing data would still leave D. buzzatii with similar gene numbers to D. hydei, suggesting no consistent difference in gene number between the cactophilic and non- cactophilic species.

We also completed a functional annotation for 80-90% of the genes recovered in each species using Interproscan (v5.16-55) (Jones et al., 2014), as well as an improved Gene

Ontology (GO) annotation for these genes (based on orthology with or phylogenetic proximity to D. melanogaster genes) identified as species-specific duplications

(inparalogues) using the Orthonome pipeline ((Rane et al., in review) and see below).

101

This allowed us to place 87% - 91% of all genes in various GO categories and KEGG

(Kanehisa, 2013) or PANTHER (Mi et al., 2013) biochemical pathways.

5.3.3 Phylogenomics of the Drosophila genus:

FastTree2 (Price et al., 2010) was used to obtain a species phylogeny from the nucleotide coding sequences of 1,802 orthologues present in all 20 FlyBase and modENCODE genomes plus the four additional repleta group genomes assembled here. The well resolved tree obtained (100% bootstrap support for all branches in Fig. 5.1) confirmed earlier phylogenies for the repleta group (Durando et al., 2000). Specifically it partitioned the cactophilic species into the mojavensis (D. aldrichi and D. mojavensis) and buzzatii clades, with D. hydei the sister group for all three of those species and D. repleta the sister for all four of the other species.

We also found that our genome-based tree obtained better resolution for two sections of the tree for the Sophophoran subgenus than had earlier studies (Seetharam & Stuart,

2013). Specifically, the phylogeny indicates D. elegans and D. rhopaloa to be a pair of sister species and D. ficusphila to be the sister clade to all sophophorans in our sample except the montium and ananassae subgroups (as opposed to just D. elegans and D. rhopaloa (Seetharam & Stuart, 2013)). In addition, our phylogeny showed the clade containing the takahashii and suzuki subgroups to be the sister clade to that containing the eugracilis and melanogaster subgroups with more confidence than previously possible.

Fig. 5.1: Phylogenetic relationships of the five repleta group species to earlier sequenced

Drosophila species based on 1,802 orthogroups shared by all species. All bootstrap values for nodes from 1000 iterations were equal to 1 and the ancestral repleta group branch is indicated with a hash.

102

5.3.4 Comparative genomic analysis and lineage specific gene expansions:

Application of Orthonome to the 12 FlyBase species plus the four additional repleta species identified orthologues for 11,780 - 13,883 genes in each of the 16 species, with an average of 13,109 1:1 orthologues identified between any two species (Table 5.1,

Table S5.4). These orthologues were distributed across 15,907 orthogroups (sets of orthologues found in at least two species), of which 7,359 had 1:1 orthologues across all

16 species. As might be expected, some of the differences in orthologue number were associated with differences in genome and annotation quality. For example, D. buzzatii, which we showed above had more missing genes than the other species, even after our re-annotation, exhibited relatively low numbers of orthologues. Conversely, there were fewer inparalogues in the better-curated genomes (D. melanogaster and D. mojavensis).

Some other differences probably reflected the respective phylogenetic distances; for

103

example, there were fewer orthologues in the more distant and diverged Drosophila subgenus compared to the subgenus Sophophora. However, some of the patterns in the data were not explicable by either assembly/annotation quality or phylogenetic relationships. In particular, despite their phylogenetic proximity, D. repleta and D. hydei shared fewer than expected orthologues with each other (11,252) when compared to similarly related Sophophora species (c.f. Table S5.4). D. repleta and D. hydei share only marginally more than either species shared with the relatively distantly related D. melanogaster (10,952 and 10,776 respectively).

Table 5.1: Summary of Orthonome analysis for D. melanogaster and five repleta group species. Average numbers of one-to-one orthologues are calculated based on pairwise orthologue predictions whereas total genes with orthologues and species-specific duplications (inparalogues) are evaluated using the phylogeny-sensitive Orthonome analysis of all genomes.

Species Number of Mean Total genes Species specific duplications genes number of with All Repleta one-to-one orthologues orthogroups group orthologues specific orthogroups D. aldrichi 16,070 11294 12469 2141 100 D. mojavensis 14,680 11669 13147 948 69 D. buzzatii (improvement 14,532 10745 11780 2161 173 compared to original (1,374) (485) annotation) D. hydei 15,838 11096 12523 2625 514 D. repleta 14,790 11222 12588 1902 224 D. melanogaster 13,919 10894 12992 688 NA

Some 971 of the 15,907 orthogroups were specific to the repleta group and 265 of these

971 were allocated to the ancestral branch of the group. Functional enrichment analyses indicated that these latter 265 were enriched (FDR adjusted FET P-value < 0.05) for three out of 23 possible Revigo-clustered sets of GO terms (Fig. S5.2 and see Materials and

104

Methods), namely sets 9 (energy derivation/oxidation), 11 (miscellaneous biological processes) and 40 (nucleotide sugars metabolism).Totals of 512 and 194 of the repleta- specific orthogroups were specific to the generalist and cactophilic lineages respectively

(i.e. only occurred in both generalist species or any two of the three cactophiles respectively). Neither the 512 nor the 194 were significantly enriched for any of the GO sets.

In addition to orthogroups, the Orthonome analysis also identified between 948 (D. mojavensis) and 2,625 (D. hydei) species-specific duplications (i.e. inparalogues) in the five repleta species (Table 5.1). Notably, those in the two generalists were highly correlated with the orthogroups arising in the ancestral lineage, whereas the corresponding correlation for the cactophiles was not significant (Table 5.2). Thus, the super-orthogroups (SoG’s: orthogroups clustered into sequence-similar families) arising in the ancestral repleta lineage often continued to generate new genes in the lineages that remained generalist feeders, whereas the new genes arising in the species with the derived cactophilic feeding habit were more likely to arise in different orthogroups. Interestingly the new genes in the cactophilics but not the generalists tended to be more dispersed across orthogroups (as indicated by the association between the numbers of inparalogues and super-orthogroups in Table 5.3) than might be expected. This further supports the notion that, in the process of adapting to the new cactophilic life style, the cactophilic species may have been exploiting genetic diversity through duplication of genes involved in a wide range of functions.

Table 5.2: Contingency tables testing for enrichment of gene expansion events in lineage specific super orthogroups. Expected values for each cell are indicated in parentheses.

105

Orthonome

classification Cactophilic specific super orthogroups

Yes No Total

Orthologue 433 (419.34) 62074 (62087.65) 62507

Duplication 52 (65.65) 9735 (9721.34) 9787

Total 485 71809 72294

χ2 = 3.07, df = 1, P-value = 0.080

Orthonome

classification Generalist specific super orthogroups

Yes No Total

Orthologue 9114 (10440.31) 53393 (52066.68) 62507

Duplication 2961 (1634.68) 6826 (8152.31) 9787

Total 12075 60219 72294

χ2 = 1493.04, df = 1, P-value < 0.000001***

106

Figure 5.2: Overlap between gene duplications in each repleta group species after annotating each duplication with the phylogenetically most similar orthogroup. Each species name also includes in parentheses: (D) indicating duplications and [C] indicating number of thermal tolerance candidate genes found within the orthogroups.

107

There was considerable overlap among the orthogroups generating inparalogues in the three cactophilic species, as was the case between the two generalist species, with significantly less overlap between species in the two different ecological groupings (Fig.

5.2). Thus, 50 orthogroups generated inparalogues in all three cactophiles but neither of the generalists, and 45 did so in both generalists but none of the cactophiles. Furthermore, for orthogroups generating inparalogues in two or more species, at least half of these orthogroups in each species also did so in another species in the same ecological grouping.

There was thus considerable commonality in the orthogroups generating duplications in the different species in each ecological grouping.

Table 5.3: Association between mean number of inparalogues in generalist and cactophilic lineages and the number of SoG’s accounting for these inparalogues.

Lineage Mean number of Number of SoG’s with

inparalogues inparalogues

Cactophilic 1750 (2073.78) 2442 (2118.22)

Generalist 2263.5 (1939.22) 1657 (1980.78)

χ2 = 206.39, df = 1, P-value < 0.00000001

Functional enrichment outlined in Fig. 5.8 below revealed several sets of GO terms that were significantly enriched in various combinations of the five species (Fig. S5.3). These are discussed in detail in relation to Fig. 5.8 below but set 40 (Nucleotide sugars metabolism) is of particular relevance here as the only one that consistently separated the cactophilic and generalists groupings of species. This set, which had also been enriched among the orthogroups in the ancestral repleta lineage above, was enriched in both the generalist species but none of the cactophilic species.

108

Nearly half (236) of the list of 558 candidate climate stress genes we had complied from previous publications on the genetics of heat or desiccation stress resistance in Drosophila

(very largely D. melanogaster; see Materials and Methods and Table S5.7) were found to be included among the orthogroups generating duplicates in one or more of the five species (Fig. 5.3). This represents a significant enrichment of these candidates among the orthogroups giving rise to duplicates (χ2 = 19.57, df = 1, P-value = 0.000010). However there was no obvious difference in the frequencies of the candidate genes between the two species groupings, even though the cactophilics were expected to have been strongly selected for increased levels of tolerance to higher temperatures and drier environments in at least some parts of their life cycle. This suggests that while the candidates are genes that show relatively higher rates of duplication generally, they are not specifically duplicating in response to extreme climatic conditions, but might also be responding to other selective factors like hosts.

5.3.5 Genes under positive selection in the repleta group:

Branch-site model (BSM) testing of 7,359 orthogroups with 1:1 orthologues in each of the 12 FlyBase species and the four additional repleta species found just 10 orthogroups displaying positive selection at the ancestral repleta branch but 56 did so in the ancestral cactophilic branch. Furthermore, 69 - 80 did so at the terminal D. hydei (76), D. repleta

(69), D. aldrichi (68) and D. buzzatii (80) branches, with the greatest number in the terminal D. mojavensis (121) branch. Thus D. mojavensis, which was the most specialist of the species and had the fewest lineage-specific duplications, also had the most lineage- specific instances of positive selection. While its specialisation to columnar cacti

(Matzkin, 2012) apparently required relatively few new genes generated by duplications, it did require directional selection on a relatively large number of existing genes.

109

Despite the greater number of positively selected genes in D. mojavensis, there was considerable overlap among the positively selected genes in the three cactophilic species, as was the case between the two generalist species, with significantly less overlap between species in the two different ecological groupings (Fig. 5.3). Most striking are the large numbers of genes that were positively selected in both members of three species pairs, namely D. mojavensis with each of D. aldrichi and D. buzzatii and D. repleta with

D. hydei. There was thus considerable commonality in the positively selected genes in the different species in each ecological grouping, as had been observed in the orthogroups generating inparalogues. A remarkable contrast between the two analyses however was the abundance of unique genes generating inparalogues (no overlaps observed) in each species whereas nearly all the genes under positive selection had overlapping occurrences between at least two of the five species (especially, as noted, between species from the same ecological grouping).

110

Figure 5.3: Overlap between orthogroups belonging to genes under positive selection in each repleta group species. Each species name also includes in parentheses: (+) indicating duplications and [C] indicating number of thermal tolerance candidate genes found within the orthogroups.

111

Notwithstanding that both the analyses of duplicates and positively selected genes show considerably more commonality between species with an ecological grouping than across the two groupings, there was a very limited overlap between the orthogroups generating inparalogues in any species and those that contained positively selected genes (Table 5.4).

One caveat here is that the BSM analysis required the availability of all 16 species in the orthogroup for analysis, so this comparison excluded the orthogroups that were repleta- specific. It appears that inparalogue formation and directional selection do not show the association that might be expected from many current theories linking duplication and neofunctionalisation in adaptive evolution (Teshima & Innan, 2008). However, it is not clear if the genes that have been considered in this analysis are representative of the lineage specific groups.

Table 5.4: Contingency tables for estimation of enrichment of orthogroups under positive selection in orthogroups undergoing duplications. Expected values for each cell are indicated in parentheses.

Does orthogroup Cactophilics: orthogroup under positive selection

produce inparalogues

Yes No

Yes 18 (12.74) 3267 (3272.25) 3285

No 38 (43.25) 11115 (11109.74) 11153

Total 56 14382 14438

Chisq = 2.31, df = 1, P-value = 0.128

Generalists: orthogroup under positive selection

Yes No

Yes 123 (129.82) 2063 (2056.17) 2186

112

No 679 (672.17) 10639 (10645.82) 11318

Total 802 12702 13504

Chisq = 0.39, df = 1, P-value = 0.532

The only significantly enriched sets of GO terms among the positively selected genes in any of the branches were set 3: Metabolite and ion binding (specifically GO:0043035 : chromatin insulator sequence binding) and set 10: Receptor regulator activity

(GO:0001094: TFIID-class transcription factor binding), both in the terminal D. mojavensis branch. However, the smaller number of positively selected genes in many of the other branches limited the power to detect significant enrichments in those branches.

The only gene in our list of candidate climate stress genes which we found among the positively selected genes in any of the branches was CG7130 (in D. mojavensis) which is a Heat shock protein 40/DNAJ co-chaperone. Interestingly this gene was also duplicated in D. buzzatii. Apart from this one gene, the lack of candidates among the positively selected genes contrasts markedly with the enrichment of these candidate genes across all five species in the inparalogue analysis above. We also found respectively 45 and 4 candidate climate stress genes in the generalist and cactophilic orthogroups under positive selection.

5.3.6 Transcriptomic analysis of thermal tolerance in D. buzzatii and D. hydei

There were 9,109 and 9,461 genes from D. hydei and D. buzzatii respectively (6,956 of which are pairwise orthologues) with a total of more than 50 RNA-Seq reads in our transcriptomic analyses of heat stress response in these two species. Of these, 1,031 and

993 respectively were differentially expressed (DE) (FDR adjusted p-value < 0.05, log2 fold change in expression > 1 in at least two time points) relative to their pre-exposure

113

level of expression. The great majority of the DE genes showed changed expression in the initial time points during the heat stress, with more showing up- than downregulation in each species in this time period (Figs 5.4, 5.5 and S5.4). However the ratio of up- to downregulated genes was much higher in D. buzzatii than D. hydei (917/76 cf 737:295 respectively). Analysis of the GO terms enriched in the genes up- and downregulated in response to thermal stress of each species also bore out the differences between the species in their transcriptional responses to the same (Fig. 5.4). We found the upregulated genes in both D. hydei and D. buzzatii were enriched for Sets 0 (Response to stimulus and stress), 3 (Metabolite and ion binding) and 11 (Miscellaneous biological processes).

Additionally upregulated genes in D. hydei were also enriched for Sets 5 (Cell fate determination), 6 (Development) and 10 (Receptor regulator activity), while D. buzzatii also displayed upregulation of genes enriched in 19 (Catalytic activity), 29

(Monooxygenase activity), 37 (Hydrolase activity) and 40 (Nucleotide sugars metabolism). Concurrently, the downregulated DE genes in D. hydei were enriched for

Sets 3 (Metabolite and ion binding) and 40 (Nucleotide sugars metabolism) while downregulated D. buzzatii genes were enriched for Sets 4 (Cell component organization or biogenesis) and 5 (Cell fate determination).

114

Figure 5.4: Gene ontology clusters enriched in the differentially expressed genes of D. hydei and D. buzzatii after being split into upregulated (A) or downregulated (B) groups in response to stress. Numbers at the end of the bars denote the number of enriched genes related to the function in the test set versus the background set.

115

There was a wide range of transcriptional responses among the DE genes in both species in the post-treatment recovery period, so we used fuzzy c-means clustering of their expression profiles across all eight time points to organize them into eight clusters, representing eight major patterns of response (Fig. 5.5). Amongst the initially upregulated genes, one cluster showed a steady return to pre-exposure levels through the course of the recovery period, three clusters continued to rise for part of it but then steadied or fell away, one continued to rise throughout the recovery period, and one showed a bimodal response, declining slightly early in recovery but then rising again. One cluster of initially downregulated genes remained relatively steady through the first part of the recovery phase before returning to pre-exposure levels, while the other declined further through the first part of the recovery phase but then also largely retuned to pre-exposure levels.

Both species had genes in each cluster but the numbers of genes in most patterns varied considerably between the two species. D. hydei genes were far more numerous than D. buzzatii genes in both the cluster of initially downregulated genes but the ratios of the two species’ genes in the six clusters of initially upregulated genes was very variable. In particular D. buzzatii genes clearly outnumbered D. hydei genes in three patterns showing relatively large recovery phase change, namely those involving a full return to pre- exposure levels, ongoing upregulation and the bimodal response. Conversely D. hydei genes were over-represented in two patterns showing little change or some rise early in the recovery phase but then some drop away.

Figure 5.5: Clustering of genes that were differentially expressed in D. hydei and D. buzzatii and subsequent gene ontology enrichment (FDR < 0.05). Significantly enriched terms are indicated in the cluster plots with shaded area indicating timepoints under heat stress conditions 0 mins to 59 mins with delta being the midpoint of the stress exposure.

The expression values are scaled and qualitative in nature.

116

117

As was the case with the exposure period results, an analysis of GO term enrichment among the clusters bore out the evidence from the raw numbers of DE genes for the very different transcriptional responses of the two species to the heat shock (Fig. 5.5). Genes corresponding to Sets 3 (Metabolite and ion binding) and 40 (Nucleotide sugars metabolism) were enriched in a cluster representing initial downregulation followed by late recovery (Expression cluster 8) in D. hydei, but in one displaying protracted upregulation and late recovery in D. buzzatii (Expression cluster 5). Furthermore, genes representing Set 11 (Miscellaneous biological processes) were enriched in multiple expression clusters displaying initial or protracted upregulation in D. hydei (Clusters 1, 3 and 6) but not in D. buzzatii. Contrastingly, enrichment of genes involved in Sets 13

(Primary metabolism), 19 (Catalytic activity) and 37 (Hydrolase activity) was observed in expression clusters displaying early response to stress but a variable recovery response

(Clusters 3, 4 and 6) with only genes involved in primary metabolism returning to pre- stress levels. The only major commonality in the expression cluster-specific enrichments for the two species involved genes contributing to Sets 4 (Cell component organization or biogenesis), 5 (Cell fate determination) and 6 (Development) in both species in the expression cluster 1 showing initial upregulation during exposure but a return to basal levels at 24 hours.

Given the species differences in transcriptional responses evident in the above analyses, it was of interest to directly compare the transcriptional behaviours of orthologous pairs of genes from the two species. While 825 of the 1,031 DE genes in D. hydei and 704 of the 993 DE genes in D. buzzatii had orthologues in the other species, only 123 of the pairs of orthologues showed DE in both species. The distribution of these 123 across the expression clusters above also differed between the species (data not shown and contingency test p-value < 0.05). To compare the expression of orthologues between the

118

two species more comprehensively, we constructed a Super Self-Organising Map

(SSOM) (Wehrens & Buydens, 2007) from 5778 pairwise orthologues with an average log2cpm expression greater than 1 in each species (i.e. a less stringent inclusion criterion than that used for DE above; Fig. S5.4). These expression values were mean-centered separately for each species prior to SSOM construction so that only the changes in expression for each species were considered.

Hierarchical clustering of the codebook vectors of each node in the SSOM identified seven major patterns of orthologue expression across the two species (Figs 5.6, S5.5 and

S5.6). One of these (SSOM cluster 1) comprised 17 genes whose expression did not change through the course of the experiment in either species (Fig. S5.6). However the other six SSOM clusters all showed changes over time and, for SSOM clusters 2, 4, 6 and

7 in particular, between the species as well (Fig. 5.6. In particular we note SSOM cluster

2 which showed a much stronger upregulatory peak in mid-phase recovery in D. hydei was enriched for Set 0 (Subset 0_1: response to stress). Concurrently, SSOM cluster 4, which contained the most genes (51) and showed a much stronger upregulatory peak in mid-phase recovery in D. buzzatii, was enriched for Subset 0_4 (detoxification) and Set

6 (development) genes, including Frost (which has been implicated in thermal tolerance in D. melanogaster (Colinet et al., 2010)). SSOM clusters 5, 6 and 7, which were all enriched for Set 5 (Cell fate determination) and 6 (Development) genes all showed time- bound lower expression in D. hydei than D. buzzatii, although they differed in the details of the difference.

Figure 5.6: Species-specific expression of genes falling into SSOM clusters 2-7. SSOM’s were constructed using 129 orthologous genes from D. hydei and D. buzzatii that were differentially expressed under stress in both species.

119

5.3.7 Overlap between lineage specific gene expansion, positive selection and transcriptional response

The great majority of the DE genes in D. hydei and/or D. buzzatii had not been implicated in any of our previous analyses (Fig. 5.7). This likely reflects the very large number of genes involved in some way or another with these responses, and the fact that many were probably only involved in indirect ways rather than being causally connected to the tolerance phenotype.

Nevertheless, the following two observations are noteworthy. First, the greatest overlap of differentially expressed genes in both D. hydei and D. buzzatii was with genes under positive selection in the generalist species (Fig. 5.7). A total of 71 and 74 of the 1,031 and

993 differentially expressed genes in D. hydei and D. buzzatii respectively were also under positive selection in both the generalist species (Fisher exact test: p-value < 0.05)

120

but not the three cactophilic species (Fig. 5.7). These genes were enriched for developmental functions (set 6) in both the species, but were also enriched in D. hydei for regulatory functions in set 0 (Response to stimulus and stress). The D. buzzatii genes were also enriched for carbohydrate and lipid metabolism, both a part of set 13 (Primary metabolism), as well as sensory perception of taste (set 0: response to chemical).

Second, a substantial number of DE genes in either or both of D. hydei and D. buzzatii were also among our list of candidate climatic stress response genes. In total, 57 and 59 of the DE genes in the two species respectively were also identified as climatic stress candidate genes. Among the heat tolerance candidates, we found genes such as Hsp23,

Hsp70Ab, Hsp83 and Hsp26 (all implicated in cold, heat and hypoxia response), CG7381

(a midgut expressed trypsin inhibitor), GstD5 and phu (both detoxification genes), and

CG10383, Mal-A3 and Myo31DF (all having hydrolase activity and longevity associations). We also found similar overlaps in the desiccation tolerance candidate genes, such as fatty acyl-CoA reductase 1 homologue (CG12268: Far1) in D. buzzatii

(which controls glycerophospholipid biosynthesis (Honsho et al., 2010)), Myo31DF (see above), and regulatory genes incuding CSN1b and nub (detailed in Table 5.5). When partitioning based on orthologues between the two species using SSOM clusters, five and

14 of the genes from clusters 2-7 displayed positive selection in cactophilic species and generalists respectively. While the cactophilic genes were mostly developmental and reproductive genes, D. hydei displayed positive selection for heat-shock proteins (Hsp23), developmental genes (palisade, nubbin, Mur2B) and cellular stress tolerance genes

(Ugt86Dj).

121

Figure 5.7: The overlap between lineage specific orthogroups, genes under positive selection in cactophilic and generalist species, heat/desiccation candidate genes and differentially expressed genes in D. hydei and D. buzzatii.

122

Table 5.5: Overlap of genes found to be under positive (+ve) selection, differential expression (DE) and identified as candidate genes responsible for heat or desiccation tolerance in earlier studies.

Categories overlapped Count Gene names

D. hydei (DE), D. buzzatii (DE), Cactophilics 2 CG7248, Ugt86Dj

(+ve selected)

D. hydei (DE), D. buzzatii (DE), Generalists (+ve 13 CG13114, CG10000, CG6839,

selected) LKRSDH, CG15571, nAChRbeta3,

l(2)efl, psd, Gr66a, CG5986, Vha100-4,

CG7702, CG33281

Desiccation candidates, D. buzzatii (DE), 2 CG12268, CSN1b

Cactophilics (+ve selected)

Desiccation candidates, D. buzzatii (DE), 3 CG10960, Gr97a, nub

Generalists (+ve selected)

Desiccation candidates, D. hydei (DE), D. 2 FBgn0051176, Dhc93AB

buzzatii (DE)

Desiccation candidates, D. hydei (DE), D. 1 CG7381

buzzatii (DE), Generalists (+ve selected)

Desiccation candidates, Heat candidates, D. 1 Myo31DF

buzzatii (DE)

Heat candidates, D. buzzatii (DE), Generalists 3 phu, Mal-A3, iotaTry

(+ve selected)

Heat candidates, D. hydei (DE), D. buzzatii (DE) 5 Hsp70Ab, Hsp83, GstD5, CG10383,

Hsp26

Heat candidates, D. hydei (DE), D. buzzatii (DE), 1 Hsp23

Generalists (+ve selected)

It was also informative to scan for overlap between the results from the different analyses at the level of the Revigo-clustered GO sets instead of primarily at the level of the genes

123

as above (Fig. 5.8). We found that 23 of the total of 25 GO sets annotated as biological processes were implicated in one or more of the analyses, and most of those 23 were involved in more than one analyses. As expected from the gene-level analyses (Fig. 5.3), the GO-set analyses indicated a strong association between the orthogroups initiated in the ancestral repleta lineage and inparalogues arising subsequently, particularly among the generalists. Furthermore, as before, there was little association between the positively selected gene sets and either the ancestral orthogroups or the subsequent inparalogues.

Most of the functions represented among the DE genes were also unrelated to those from the other analyses, which again is consistent with the gene-level comparisons. In terms of discriminating between the cactophiles and generalists, the most notable GO-set was set

40 (Nucleotide sugars metabolism). This set was enriched in the ancestral repleta lineage and among the duplications in the generalists but not cactophiles and was downregulated in the stress response experiment in the generalists but upregulated in that experiment in the cactophiles. It is not immediately obvious why a set associated with nucleotide sugar metabolism should distinguish the two groups of species in so many ways.

A preliminary analysis of the specific subsets of GO-sets underlying some of the most interesting similarities and differences in the above set analysis was then carried out to interrogate the functions involved in more detail (Table S5.6). Some of the highlights are as follows. D. hydei was enriched for subset 0_1 (response to heat) in the transcriptional response to heat while D. buzzatii was enriched for subset 0_0 (response to cold) – indicating a possible overlap in function between the two stresses when invoking a stress response in the two repleta group species. Of these two functions, only ‘response to heat’ was found to be generating inparalogues in D. mojavensis, though all five species had duplications enriched for cellular responses to either hypoxia, cellular and DNA damage and nitrogen compounds, the former two of which are more consistent with an oxidative

124

stress or host shift response (Matzkin, 2012; Zhao & Haddad, 2011). Recent host-shift experiments in D. mojavensis (Matzkin, 2012) also highlighted the role of energy derivation and oxidation genes (Set 9) and metabolic biological processes (Set 11) along with chitin metabolism related genes (Set 40), all of which were found to be enriched in the repleta group specific orthogroups. Similarly, the downregulation of metabolic/developmental genes and upregulation of hydrolase (peptidase) and catalytic activity as well as a majority of gene functions enriched in species specific duplications were also reported in earlier studies of desiccation stress on D. mojavensis (Matzkin &

Markow, 2009).

125

Figure 5.8: Functional Sets enriched in each analysis for the repleta group lineage, generalist and specialist species overall, as well as the constituent species.

Only enriched Sets are annotated in the cells. Lineage specific orthogroup analyses denoted by (O) and positive selection analyses () were based on orthogroups whereas duplication analyses (D) used all remaining genes. All the annotated genes in the respective species were used for the stress dependent up/downregulation

() analyses.

126

5.4 Conclusion

With rising global temperatures, the ability of a species to adapt to abiotic stresses is important for future fitness. The repleta group of species demonstrate an ability to adapt to dietary as well as desiccation/thermal stress by regulation of metabolism (Matzkin &

Markow, 2009). Here we report a high quality genome assembly and annotation of multiple species from the repleta group which provide us with unprecedented depths of insights into the genomic basis of evolution among the Drosophila species. Using orthologue and inparalogue prediction sensitive to evolutionary pressures, we found the two generalist repleta species in our study to have retained the ancestral host use phenotype and climatic niche – given the continued accumulation of duplications in the orthogroups formed in the ancestral lineage. The most heat tolerant cactophilic species however acquired duplications in different orthologues, and underwent positive selection in still other orthogroups. This was exemplified by the most specialist of cactophiles in feeding and geographical range (D. mojavensis) accumulating least duplications but with twice as many fast evolving genes as the other species. On at least the coding level – we found little evidence that the genes that were duplicated in any species were also undergoing positive selection. This observation is somewhat at odds with current theories on evolutionary adaptation driven by duplication and neofunctionalisation.

Comparing to earlier stress tolerance studies in Drosophila, we found that the genes undergoing duplications in both the cactophiles and generalists contained a disproportionately high number of genes linked to climatic stress responses. This might suggest that some level of duplication has been associated with climate stress response in both groups of species. However, the low representation of candidate climatic stress response genes undergoing positive selection in individual species could indicate an

127

alternate adaptation in the cactophiles. Perhaps many of the positively selected genes in the cactophiles reflect the adaptation to the new diet, rather than the additional climate stress tolerance.

We also found considerable differences in the transcriptional response to heat stress of the cactophile D. buzzatii and generalist D. hydei spanning from immediate during-stress to late post-stress recovery phases. While the less tolerant generalist species had a larger quantitative response to stress, the more heat tolerant species showed greater evidence of changes that could be interpreted as adaptive on account of greater number of genes displaying complete recovery, 24 hours after stress. A hierarchical analysis of the biological processes found to be enriched among the genes involved in duplication, positive selection and transcriptional response showed relatively little overlap between the functions discriminating the two groups of species. Together with the other finding above this would suggest that the (variably) greater heat stress of the cactophiles may be achieved by regulatory rather than structural gene evolution. One notable exception to this may be genes involved nucleotide sugar metabolism that were enriched in the ancestral orthogroups and the generalist duplications and showed opposite transcriptional responses to heat stress, but more work is needed to elucidate the basis for this.

128

CHAPTER 6

Conclusion and future directions

The work presented in this dissertation attempts to discern the genomic signatures and factors associated with adaptation to climatic variation as well as plant-hosts in ectothermic insects. In Chapter 2, I was able to demonstrate the role of chromosomal inversions in binding together alleles favourable to local climatic conditions, and therefore driving adaptation to climate change in the widespread Drosophila melanogaster. By sequencing single chromosomes where the inversion is located, I also identified alleles independent of the inversion that were differentially adapted to the ends of the eastern Australian latitudinal cline. Compared to earlier estimates of differentiation between the northern and southern populations (Kolaczkowski et al., 2011), using genome sequencing I found that the inversion 3R Payne increases the magnitude of differentiation by ~26% when compared to collinear (non-inverted) arrangements from the same population, even when sampling for the same single nucleotide polymorphisms

(SNPs). This observation supports earlier theories hypothesising the role of chromosomal inversions, in capturing: 1) epistatically acting, co-adapted loci; or 2) alleles that do not act epistatically but that are favoured in the same direction to drive adaptive responses in a species (Dobzhansky, 1970; Kirkpatrick & Barton, 2006b). In concordance with these ideas, the inversion also displayed increased linkage disequilibrium (LD: the propensity of alleles to be co-inherited) between loci within the inversion. Surprisingly however, I found that the collinear arrangement in the northern population showed even greater LD than the inverted arrangement, indicating that the inversion may also conserve co-adapted alleles in the collinear genome due to a heterozygous benefit (Pegueroles et al., 2010), since In(3R)Payne was never observed to reach fixation along the east Australian cline

129

(Anderson et al., 2005; Umina et al., 2005). However the universality of this observation remains to be validated.

The role of In(3R)Payne and other cosmopolitan inversions in clinal adaptation has since been studied further in an attempt to find parallel patterns of the In(3R)Payne cline in

North America to those in Australia (Kapun et al., 2016a; Kapun et al., 2016b). It has also been suggested that the clinality of adaptive alleles and inversions may be maintained by admixture (addition of new alleles) from adapted ancestral populations in Europe and

Africa (Bergland et al., 2016). Kapun et al. (2016a) however found that while most inversions display allele frequencies indistinguishable from that expected under admixture (which supports Bergland et al. (2016)’s conclusion), In(3R)Payne displayed significant deviation from the expected pattern (Kapun et al., 2016a). This suggests that

In(3R)Payne is indeed segregating along the cline and is likely responding to climatic factors like temperature and precipitation, and at least partially driving clinal adaptation in D. melanogaster. Furthermore this inversion also captured the vast majority (79%) of clinally associated SNP’s on the chromosome arm 3R (Kapun et al., 2016a). Kapun et al.

(2016b) also found that, as expected from earlier studies (Rako et al., 2006b),

In(3R)Payne was highly associated with body size traits, but not stress resistance or life history traits. The studies indicate that while In(3R)Payne is associated with adaptive shifts in response to latitudinally-associated factors like temperature, there are potentially additional factors contributing to the maintenance of the inversion in warmer climates

(Kapun et al., 2014). The results from Chapter 2 and recent studies of local adaptation driven by In(3R)Payne confirm the significance of capturing alleles associated with climatic adaptation - even if they are not located in close proximity. More detailed analyses are however required to identify genes co-adapted or acting in the same direction within the inversions.

130

Detailed study of inversions could be carried out using the same principle used in Chapter

2 i.e. sequencing one chromosome at a time, but with higher throughput. For example, genetic signatures of adaptation could be identified by independent analysis of haplotypes of every chromosome from multiple individuals across populations along the cline. While this approach using balancer lines is limited to D. melanogaster, it can be replicated bioinformatically by crossing wild populations to a sequenced line and subtracting the known genetic background subsequently. By reducing noise due to heterozygosity, whole genome sequencing will allow detailed identification of genes that are tightly associated with clinal and climatic adaptation. Sequencing of individuals (as opposed to short- representation sequencing like Poolseq and RADSeq), followed by whole genome LD and analyses of genetic differentiation, will help narrow down the genetic elements

(genes, indels and SNPs) contributing to climatic adaptation. Such analyses of haplotypes will also help provide answers to admixture events since admixture primarily works at a haplotype level due to sexual reproduction.

In Chapter 3 I developed a highly scalable and sensitive orthologue and duplication prediction bioinformatic pipeline, Orthonome, which also considers phylogenetic and syntenic relationships of genes across species. Along with increased speed, the pipeline also outperformed current pipelines with its ability to provide more sensitive orthologue predictions. The performance of the pipeline was further improved when using low quality draft genomes by implementing novel scoring algorithms which account for truncated annotations as is frequent in draft assemblies. Due to the greater recovery of orthologous relationships and stronger resolution of genes into 1:1 orthologue relationships, Orthonome also proved useful in predicting mis-annotations in key developmental gene families. However, I found that orthologue retrieval is affected by

131

annotation quality more than assembly quality, and therefore future work to identify and recover missed annotations de novo is essential.

Since Orthonome separates gene duplication and birth events from orthologues, and clusters them by domain similarity, the pipeline is suitable for gene family and subfamily classifications as well as identification of lineage specific expansion events across multiple species. The pipeline will also help in identifying micro-evolutionary changes when comparing closely related species or diverged populations which have been assembled independently. For example, since Orthonome is sensitive to syntenic changes, the order and orientation of orthologues on the genome may help detect inversion and translocation events. The pipeline could also be used in detailed comparisons between highly selected lines such as the Dark-fly line of Drosophila melanogaster (Izutsu et al.,

2012), to identify gene deaths and duplications over the 1,400 or more generations of selective breeding. Additionally when compared to OrthoDB, the greater number of 1:1 orthologues predicted across all species (44% in good quality genomes and 62% in draft genomes) will allow for more robust evolutionary (dN/dS) and phylogenetic studies since we find greater conservation between genomes than earlier orthologue prediction pipelines. However, the data suggests that the presence of closely related species greatly improved the duplication classification quality by Orthonome. With a web platform now available for Orthonome, I expect the quality of orthologue prediction of macro- evolutionary events to improve with time as better quality genomes are incorporated in the database and annotation quality improves with novel algorithms. In addition to this, further development will include the implementation of robust and high-throughput machine learning methods to predict protein domain families that are called Super

Orthologue Groups (SOGs). This will not only reduce time taken to predict orthologues,

132

but also allow more computational time to implement exhaustive phylogenetic methods for prediction of gene duplications, and prediction of pseudogenes and inversions.

In Chapter 4, the orthologue and gene family predictions produced using Orthonome were utilised to find evidence for the hypothesis that detoxification-related gene family sizes have a role in the host range expansion of herbivorous insect species. I classified eight gene families in 58 species and found four gene families (P450s, CCEs, GSTs and ABCs) which were associated with increased host range or polyphagy. Although another gene family often implicated in host adaptation, the UGTs, was not significantly associated with polyphagy. While P450s, CCEs and GSTs have been implicated in host adaptation multiple times (Ahn et al., 2012; Berenbaum et al., 1992; Berenbaum et al., 1996; Daborn et al., 2002; Feyereisen, 2011; Shou-Min, 2012), the role of ABC transporters in host adaptation has only been recently noted (Dermauw et al., 2013a; You et al., 2013a).

Additionally, I found the gene duplication and gene family lineages sizes in P450s, CCEs and ABCs to be associated with polyphagy. On the contrary, there was limited evidence for the association between polyphagy and numbers of delta and epsilon GSTs, despite their role in detoxification, while the remaining GST lineages displayed an association with polyphagy. It is likely that since a majority of detoxification related gene discoveries are made through pesticide resistance assays (Li et al., 2007), the lineages of GSTs responsible for detoxification and host adaptation have been underestimated since the remaining lineages also showed significant association with polyphagy. Regardless of gene families however, I found that the most polyphagous insects in the analysis such as the coleopteran pest Leptinotarsa decemlineata, the highly invasive beetle Anoplophora glabripennis, the hemipteran Homalodisca vitripennis, and the lepidopteran pest Plutella xylostella had the highest number of detoxification genes. Therefore, while the moderately polyphagous or restricted species may present weaker signal supporting the

133

hypothesis, I found that the greater the host range of a species, the more likely it is to possess a generous repertoire of detoxification genes.

I also found that every order of insects utilises a different combination of the five major gene families to expand host range. Additionally the estimations of gene duplications also tend to offer similar evidence as total counts, but could be subject to overestimation on account of reduced phylogenetic variety of species analysed in some orders (Bass & Field,

2011). Concomitantly, some orders did not provide clear patterns since statistical tests could not be carried out due to smaller numbers of species analysed or a lower number of host and host range combinations. These tests need to be expanded as more species are sequenced. Even though I focused on only eight gene families in Chapter 4, I was able to confirm the feasibility of large-scale multi-species comparison studies to ascertain the role of gene expansion in polyphagy.

The patterns emerging for host adaptation showed a significant association of gene gain with increased host range. It is therefore likely that rapid gain in gene families is essential for polyphagy, even though only one particular gene might be involved, as seen in a

Myzus persicae subpopulation (Bass et al., 2013) where a gain of 100 more copies of the

CYP6CY3 gene enabled adaptation to a highly toxic host, tobacco. The genes involved may differ across species, but this requires further testing. The study could be expanded to greater numbers of gene families, and gene duplications should be considered more specifically instead of only the contraction and expansion of gene families. This should lead to a greater understanding of host adaptation in species as well as providing the possibility that genomics could be eventually used to predict the likelihood of host adaptation. Based on the acquisition of specific genes, it may be possible to predict host adaptations when associating host plant toxins with particular detoxification genes or combinations thereof. The knowledge of fixed genetic factors associated with insect-host

134

adaptations may also be exploited to stimulate rapid gene expansions in specialist species to enable host adaptation in novel environments, or host contraction in generalist species.

Technologies such as CRISPR enabled native ‘gene drive’ have already been implemented in mosquitoes to selectively drive gene frequencies in natural populations to fixation with limited introduction of CRISPR modified individuals (Gantz et al., 2015), providing a possible pathway of modifying insect populations.

In Chapter 5, I applied the techniques developed in prior chapters to study the genomic signature of host and climatic adaptation in the repleta group of drosophilids. In this chapter, I also developed an orthologue-supported novel genome annotation pipeline to overcome the issues of missing annotations identified in Chapter 3. I found that the new annotation pipeline was highly effective in improving the annotation and subsequent orthologue retrieval in earlier published draft genome by >10%. The results also suggested a role for host adaptation related gene families in climatic adaptation. I found that the nitrogen metabolism and oxidation stress tolerance related biological processes evolved rapidly in the repleta group, compared to ~19 other drosophilids, and these seemed important to host adaptation as well as the response to thermal stress which was verified using stress transcriptome experiments. The enrichment of genes, while in qualitative agreement with earlier observations for cactophilic drosophilids (Guillen et al., 2015), also suggested that they evolve faster in highly heat-tolerant cactophilic repleta species compared to moderately tolerant generalists. The fact that all species in this group consume toxic food that leads to oxidative stress (Ahmad, 1992; Enyiukwu et al., 2014;

Fogleman & Danielson, 2001; Markow et al., 2000; Matzkin, 2014) could indicate why all members of this group showed such enrichment.

Under heat stress, the cactophilic species displayed a more rapid induction of gene transcription, and a robust recovery since most of the genes returned to original pre-stress

135

transcriptional levels. The same oxidative stress tolerance related genes that showed induction and recovery, also displayed rapid gene expansions in the repleta group, especially in the cactophilic species. This suggests that adaptation to oxidative stress is a driving factor in the evolution of the repleta group of species. The transcriptomic and evolutionary evidence however leaves unanswered questions which can be validated by more detailed functional studies involving knock-down of the critical genes in these biological processes. Such functional studies would help clarify any potential connection between climate adaptation and host range. Moreover these observations lead to some suggestions about possible pathways of adaptive changes. Could host adaptation have led to climatic adaptation? Or alternatively, could an adaptation to oxidative stress in an ancestral species have led to both climatic and host adaptation in the repleta group?

My observations from Chapter 4 and 5 suggest that species may adapt to stressful conditions or new hosts by expanding their repertoire of associated genes. Evolutionary signatures have previously been observed, examples include the evolution of cold tolerance in Antarctic notothenioid fishes by means of 118 serial gene duplications (Chen et al., 2008), and the evolution of pesticide resistance in the mosquito Culex pipiens

(Lenormand et al., 1998). These evolutionary events are often accompanied by transcriptomic changes, in the same way we observed different expression and recovery patterns among D. buzzatii and D. hydei. However a limitation of the current work is that we have only considered one drosophilid lineage. Ideally, the same patterns need to be tested rigorously in a phylogenetic framework, which should be possible given the repeated evolution of phenotypic tolerance to stress in dozens of drosophilid species

(Kellermann et al., 2012a; Kellermann et al., 2012b). An example of another group displaying a very diverse set of thermal, cold, and desiccation stress tolerance responses is the Zaprionus group. This group in some ways is even richer than the repleta group

136

when it comes to diversity of stress tolerance phenotypes, and is sufficiently phylogenetically removed from it to provide an independent set of contrasts for comparison.

Replication across lineages is particularly important because the results from Chapter 5 suggest that multiple evolutionary events are likely between closely related species, making it difficult to obtain a clear genetic signature of adaptation to novel hosts or climatic stress. This is exemplified by the observation that different sub-groups of cactophilic species (mojavensis vs mulleri cactophilic drosophilids) have slightly different evolutionary signatures of gene enrichment and positive selection, despite qualitatively similar phenotypic adaptation to heat, desiccation and cactophilic diet

(Chapter 5, Guillen et al., 2015). Such signatures may become even less clear if phylogenetic distance increases, unless selected candidate genes are already known. Such a case is that of the Antarctic notothenioid fishes, which were also shown to have undergone rapid duplication of the antifreeze protein coding genes (Chen et al., 1997).

One way that such species specific adaptations can be separated from group specific events is to combine large comparative genomics studies with candidate genes identified using population resequencing and functional studies. For example we can identify the genes associated with adaptation to a particular stress, using species specific transcriptomic studies such as Matzkin and Markow (2013) and Matzkin (2012) in the case of dietary adaptation, and the results in Chapter 5 in the case of heat stress. In Chapter

5 I was able to identify 123 genes with a synonymous response to heat stress across two species of the repleta group. Genes associated with local adaptation to stresses can be identified using genome wide association studies (GWAS) and verified for changes in allele frequency using population resequencing. A similar approach involving GWAS and population genomics has been used to study adaptation to the toxic fruit Miranda

137

citrafolia in the specialist D. sechellia (Dworkin & Jones, 2009; Hungate et al., 2013;

Shiao et al., 2015), and now in the generalist D. yakuba (Yassin et al., 2016), leading to the identification of genes responsible for parallel evolution of host adaptation. A combination of the above approaches will be required to delineate local effects from evolutionary changes and therefore identify genes critical to adaptation to climatic stresses. Knowledge of candidate genes and the networks they form may allow researchers to eventually predict the adaptability of species to climatic stresses such as heat stress. Hypotheses can be further tested by conducting knock-down and gene insertion experiments using CRISPR across groups of species displaying phenotypic diversity, such as the Zaprionus and repleta group, for heat, cold and desiccation and the montium group of drosophilids for desiccation. While the power to predict potential species niches may reduce with increased complexity in phenotypic expression of a trait, genomics may nonetheless play a role in understanding or creating initial estimates of a species capacity and mode of adaptation in light of climate change.

In conclusion, the genomic signature of host and climatic adaptation may revolve around genes involved in particular biological processes in species groups, but causal links will require further comparative and functional studies. The data provided in this dissertation illustrate how genomic and associated transcriptomic analyses can help identify potential candidate families and processes underlying ecological divergence. Further evolutionary and functional studies are required to support putative families involved. At the same time, genes contributing to local adaptation can be identified through single chromosome sequencing but patterns may be difficult to interpret without an understanding of structural variation and its impacts on recombination and diversity in candidate genes.

With the increasing threat of climate change to insects and the plant hosts they interact

138

with, a knowledge of genetic factors contributing to both climate and hosts may help to better predict likely species responses.

139

References

Adams, M. D. & Celniker, S. E. & Holt, R. A. & Evans, C. A. & Gocayne, J. D. & Amanatides, P. G., et al. (2000). The genome sequence of Drosophila melanogaster. Science 287, 2185-95. Adler, L. S.Irwin, R. E. (2005). Ecological costs and benefits of defenses in nectar. Ecology 86, 2968-2978. Ahmad, S. (1992). Biochemical defence of pro-oxidant plant allelochemicals by herbivorous insects. Biochemical Systematics and Ecology 20, 269-296. Ahn, S. J., Vogel, H. &Heckel, D. G. (2012). Comparative analysis of the UDP- glycosyltransferase multigene family in insects. Insect Biochem Mol Biol 42, 133-47. Akey, J. M., Zhang, G., Zhang, K., Jin, L. &Shriver, M. D. (2002). Interrogating a high- density SNP map for signatures of natural selection. Genome Res 12, 1805-14. Altenhoff, A. M., Boeckmann, B., Capella-Gutierrez, S., Dalquen, D. A., DeLuca, T., Forslund, K., et al. (2016). Standardized benchmarking in the quest for orthologs. Nat Methods 13, 425-30. Altenhoff, A. M., Skunca, N., Glover, N., Train, C. M., Sueki, A., Pilizota, I., et al. (2015). The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements. Nucleic Acids Res 43, D240-9. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. &Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-10. Alvarez-Ponce, D., Aguade, M. &Rozas, J. (2009). Network-level molecular evolutionary analysis of the insulin/TOR signal transduction pathway across 12 Drosophila genomes. Genome Res 19, 234-42. Alvarez-Ponce, D., Aguade, M. &Rozas, J. (2011). Comparative genomics of the vertebrate insulin/TOR signal transduction pathway: a network-level analysis of selective pressures. Genome Biol Evol 3, 87-101. Anderson, A. R., Collinge, J. E., Hoffmann, A. A., Kellett, M. &McKechnie, S. W. (2003). Thermal tolerance trade-offs associated with the right arm of chromosome 3 and marked by the hsr-omega gene in Drosophila melanogaster. Heredity (Edinb) 90, 195-202. Anderson, A. R., Hoffmann, A. A., McKechnie, S. W., Umina, P. A. &Weeks, A. R. (2005). The latitudinal cline in the In(3R)Payne inversion polymorphism has shifted in the last 20 years in Australian Drosophila melanogaster populations. Mol Ecol 14, 851-8. Anderson, T. M., vonHoldt, B. M., Candille, S. I., Musiani, M., Greco, C., Stahler, D. R., et al. (2009). Molecular and evolutionary history of melanism in North American gray wolves. Science 323, 1339-43. Andolfatto, P., Depaulis, F. &Navarro, A. (2001). Inversion polymorphisms and nucleotide variability in Drosophila. Genet Res 77, 1-8. Andolfatto, P., Wall, J. D. &Kreitman, M. (1999). Unusual haplotype structure at the proximal breakpoint of In(2L)t in a natural population of Drosophila melanogaster. Genetics 153, 1297-1311. Andrew, N. R., Hill, S. J., Binns, M., Bahar, M. H., Ridley, E. V., Jung, M.-P., et al. (2013). Assessing insect responses to climate change: What are we testing for? Where should we be heading? PeerJ 1, e11. Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. Reference Source.

140

Arnold, B., Corbett-Detig, R. B., Hartl, D. &Bomblies, K. (2013). RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling. Mol Ecol 22, 3179-90. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-9. Attrill, H., Falls, K., Goodman, J. L., Millburn, G. H., Antonazzo, G., Rey, A. J., et al. (2015). FlyBase: establishing a Gene Group resource for Drosophila melanogaster. Nucleic Acids Research, gkv1046. Ayala, D., Fontaine, M. C., Cohuet, A., Fontenille, D., Vitalis, R. &Simard, F. (2011). Chromosomal inversions, natural selection and adaptation in the malaria vector Anopheles funestus. Mol Biol Evol 28, 745-58. Ayala, D., Guerrero, R. F. &Kirkpatrick, M. (2013). Reproductive isolation and local adaptation quantified for a chromosome inversion in a malaria mosquito. Evolution 67, 946-58. Ayrinhac, A., Debat, V., Gibert, P., Kister, A. G., Legout, H., Moreteau, B., et al. (2004). Cold adaptation in geographical populations of Drosophila melanogaster: phenotypic plasticity is more important than genetic variability. Functional Ecology 18, 700-706. Baird, N. A., Etter, P. D., Atwood, T. S., Currey, M. C., Shiver, A. L., Lewis, Z. A., et al. (2008). Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One 3, e3376. Bale, J. S.Hayward, S. A. (2010). Insect overwintering in a changing climate. J Exp Biol 213, 980-94. Bale, J. S., Masters, G. J., Hodkinson, I. D., Awmack, C., Bezemer, T. M., Brown, V. K., et al. (2002). Herbivory in global climate change research: direct effects of rising temperature on insect herbivores. Global Change Biology 8, 1-16. Barker, J. S. F., Starmer, W. T. &MacIntyre, R. J. (2013). Ecological and evolutionary genetics of Drosophila. Springer Science & Business Media. Barrett, J. C., Fry, B., Maller, J. &Daly, M. J. (2005). Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21, 263-265. Barrett, R. D., Paccard, A., Healy, T. M., Bergek, S., Schulte, P. M., Schluter, D., et al. (2011). Rapid evolution of cold tolerance in stickleback. Proc Biol Sci 278, 233- 8. Barros, V., Field, C., Dokke, D., Mastrandrea, M., Mach, K., Bilir, T., et al. (2015). Climate change 2014: impacts, adaptation, and vulnerability. Part B: regional aspects. Contribution of Working Group II to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Bass, C.Field, L. M. (2011). Gene amplification and insecticide resistance. Pest Manag Sci 67, 886-90. Bass, C., Zimmer, C. T., Riveron, J. M., Wilding, C. S., Wondji, C. S., Kaussmann, M., et al. (2013). Gene amplification and microsatellite polymorphism underlie a recent insect host shift. Proc Natl Acad Sci U S A 110, 19460-5. Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27, 573-80. Berenbaum, M. R., Cohen, M. B. &Schuler, M. A. (1992). Cytochrome-P450 monooxygenase genes in oligophagous Lepidoptera. Berenbaum, M. R., Favret, C. &Schuler, M. A. (1996). On defining ''key innovations'' in an adaptive radiation: Cytochrome P450s and papilionidae. American Naturalist 148, S139-S155.

141

Berenbaum, M. R.Zangerl, A. R. (2008). Facing the future of plant-insect interaction research: le retour à la “raison d'être”. Plant Physiology 146, 804-811. Bergland, A. O., Tobler, R., Gonzalez, J., Schmidt, P. &Petrov, D. (2014). Secondary contact and local adaptation contribute to genome-wide patterns of clinal variation in Drosophila melanogaster. biorxiv. Bergland, A. O., Tobler, R., Gonzalez, J., Schmidt, P. &Petrov, D. (2016). Secondary contact and local adaptation contribute to genome-wide patterns of clinal variation in Drosophila melanogaster. Mol Ecol 25, 1157-74. Bezdek, J. C., Ehrlich, R. &Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences 10, 191-203. Blondel, V. D., Guillaume, J., Lambiotte, R. &Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008, P10008. Bocedi, G., Atkins, K. E., Liao, J., Henry, R. C., Travis, J. M. &Hellmann, J. J. (2013). Effects of local adaptation and interspecific competition on species' responses to climate change. Ann N Y Acad Sci 1297, 83-97. Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. &Pirovano, W. (2011). Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, 578-9. Bolger, A. M., Lohse, M. &Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114-20. Borodina, T., Adjaye, J. &Sultan, M. (2011). A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol 500, 79-98. Bosco, G., Campbell, P., Leiva-Neto, J. T. &Markow, T. A. (2007). Analysis of drosophila species genome size and satellite DNA content reveals significant differences among strains as well as between species. Genetics 177, 1277-1290. Boyles, J. G., Seebacher, F., Smit, B. &McKechnie, A. E. (2011). Adaptive thermoregulation in endotherms may alter responses to climate change. Integr Comp Biol 51, 676-90. Bradshaw, W.Holzapfel, C. (2008). Genetic response to rapid climate change: It's seasonal timing that matters. Mol. Ecol. 17, 157. Bradshaw, W. E.Holzapfel, C. M. (2006). Climate change. Evolutionary response to rapid climate change. Science 312, 1477-8. Bridle, J. R., Buckley, J., Bodsworth, E. J. &Thomas, C. D. (2014). Evolution on the move: specialization on widespread resources associated with rapid range expansion in response to climate change. Proc Biol Sci 281, 20131800. Browning, B. L.Browning, S. R. (2009). A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 84, 210-23. Buckley, J., Butlin, R. K. &Bridle, J. R. (2012). Evidence for evolutionary change associated with the recent range expansion of the British butterfly, Aricia agestis, in response to climate change. Mol Ecol 21, 267-80. Buss, D. S.Callaghan, A. (2008). Interaction of pesticides with p-glycoprotein and other ABC proteins: A survey of the possible importance to insecticide, herbicide and fungicide resistance. Pesticide Biochemistry and Physiology 90, 141-153. Cassone, B. J., Molloy, M. J., Cheng, C., Tan, J. C., Hahn, M. W. &Besansky, N. J. (2011). Divergent transcriptional response to thermal stress by Anopheles gambiae larvae carrying alternative arrangements of inversion 2La. Mol Ecol 20, 2567-80. Celorio-Mancera, M. d. l. P., Heckel, D. G. &Vogel, H. (2012). Transcriptional analysis of physiological pathways in a generalist herbivore: responses to different host

142

plants and plant structures by the cotton bollworm, Helicoverpa armigera. Entomologia Experimentalis et Applicata 144, 123-133. Charlesworth, D.Charlesworth, B. (1979). Selection on recombination in clines. Genetics 91, 581. Chen, I. C., Hill, J. K., Ohlemuller, R., Roy, D. B. &Thomas, C. D. (2011). Rapid range shifts of species associated with high levels of climate warming. Science 333, 1024-6. Chen, L., DeVries, A. L. &Cheng, C. H. (1997). Evolution of antifreeze glycoprotein gene from a trypsinogen gene in Antarctic notothenioid fish. Proc Natl Acad Sci U S A 94, 3811-6. Chen, Z., Cheng, C. H., Zhang, J., Cao, L., Chen, L., Zhou, L., et al. (2008). Transcriptomic and genomic evolution under constant cold in Antarctic notothenioid fish. Proc Natl Acad Sci U S A 105, 12944-9. Cheng, C., White, B. J., Kamdem, C., Mockaitis, K., Costantini, C., Hahn, M. W., et al. (2012). Ecological genomics of Anopheles gambiae along a latitudinal cline: a population-resequencing approach. Genetics 190, 1417-32. Cho, S., Huang, Z. Y., Green, D. R., Smith, D. R. &Zhang, J. (2006). Evolution of the complementary sex-determination gene of honey bees: balancing selection and trans-species polymorphisms. Genome research 16, 1366-1375. Chown, S. L., Hodgins, K. A., Griffin, P. C., Oakeshott, J. G., Byrne, M. &Hoffmann, A. A. (2015). Biological invasions, climate change and genomics. Evol Appl 8, 23-46. Cingolani, P., Platts, A., Wang le, L., Coon, M., Nguyen, T., Wang, L., et al. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80-92. Claudianos, C., Ranson, H., Johnson, R. M., Biswas, S., Schuler, M. A., Berenbaum, M. R., et al. (2006). A deficit of detoxification enzymes: pesticide sensitivity and environmental response in the honeybee. Insect Mol Biol 15, 615-36. Colinet, H., Lee, S. F. &Hoffmann, A. (2010). Functional characterization of the Frost gene in Drosophila melanogaster: importance for recovery from chill coma. PLoS One 5, e10925. Collinge, J. E., Anderson, A. R., Weeks, A. R., Johnson, T. K. &McKechnie, S. W. (2008). Latitudinal and cold-tolerance variation associate with DNA repeat- number variation in the hsr-omega RNA gene of Drosophila melanogaster. Heredity (Edinb) 101, 260-70. Colosimo, P. F., Hosemann, K. E., Balabhadra, S., Villarreal, G., Jr., Dickson, M., Grimwood, J., et al. (2005). Widespread parallel evolution in sticklebacks by repeated fixation of Ectodysplasin alleles. Science 307, 1928-33. Cong, Q., Shen, J., Borek, D., Robbins, R. K., Otwinowski, Z. &Grishin, N. V. (2016). Complete genomes of Hairstreak butterflies, their speciation, and nucleo- mitochondrial incongruence. Scientific reports 6. Corbett-Detig, R. B.Hartl, D. L. (2012). Population genomics of inversion polymorphisms in Drosophila melanogaster. PLoS Genet 8, e1003056. Daborn, P. J., Yen, J. L., Bogwitz, M. R., Le Goff, G., Feil, E., Jeffers, S., et al. (2002). A single p450 allele associated with insecticide resistance in Drosophila. Science 297, 2253-6. Dalquen, D. A., Altenhoff, A. M., Gonnet, G. H. &Dessimoz, C. (2013). The impact of gene duplication, insertion, deletion, lateral gene transfer and sequencing error on orthology inference: a simulation study. PLoS One 8, e56925.

143

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., et al. (2011). The variant call format and VCFtools. Bioinformatics 27, 2156-8. Dayan, D. I., Crawford, D. L. &Oleksiak, M. F. (2015). Phenotypic plasticity in gene expression contributes to divergence of locally adapted populations of Fundulus heteroclitus. Mol Ecol 24, 3345-59. de la Paz Celorio-Mancera, M., Wheat, C. W., Vogel, H., Soderlind, L., Janz, N. &Nylin, S. (2013). Mechanisms of macroevolution: polyphagous plasticity in butterfly larvae revealed by RNA-Seq. Mol Ecol 22, 4884-95. Derks, M. F., Smit, S., Salis, L., Schijlen, E., Bossers, A., Mateman, C., et al. (2015). The genome of winter moth (Operophtera brumata) provides a genomic perspective on sexual dimorphism and phenology. Genome Biology and Evolution 7, 2321-2332. Dermauw, W., Osborne, E. J., Clark, R. M., Grbic, M., Tirry, L. &Van Leeuwen, T. (2013a). A burst of ABC genes in the genome of the polyphagous spider mite Tetranychus urticae. BMC Genomics 14, 317. Dermauw, W., Wybouw, N., Rombauts, S., Menten, B., Vontas, J., Grbic, M., et al. (2013b). A link between host plant adaptation and pesticide resistance in the polyphagous spider mite Tetranychus urticae. Proc Natl Acad Sci U S A 110, E113-22. Despres, L., David, J. P. &Gallet, C. (2007). The evolutionary ecology of insect resistance to plant chemicals. Trends Ecol Evol 22, 298-307. Devonshire, A. L., Field, L. M., Foster, S. P., Moores, G. D., Williamson, M. S. &Blackman, R. L. (1998). The evolution of insecticide resistance in the peach- potato aphid, Myzus persicae. Philosophical Transactions of the Royal Society B-Biological Sciences 353, 1677-1684. Dewey, C. N. (2011). Positional orthology: putting genomic evolutionary relationships into context. Brief Bioinform 12, 401-12. Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15-21. Dobzhansky, T. G. (1970). Genetics of the evolutionary process. Columbia Univ Pr. Drosophila 12 Genomes, C. & Clark, A. G. & Eisen, M. B. & Smith, D. R. & Bergman, C. M. & Oliver, B., et al. (2007). Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203-18. Durando, C. M., Baker, R. H., Etges, W. J., Heed, W. B., Wasserman, M. &DeSalle, R. (2000). Phylogenetic analysis of the repleta species group of the genus Drosophila using multiple sources of characters. Molecular Phylogenetics and Evolution 16, 296-307. Dworkin, I.Jones, C. D. (2009). Genetic changes accompanying the evolution of host specialization in Drosophila sechellia. Genetics 181, 721-36. Edgar, R. C.Myers, E. W. (2005). PILER: identification and classification of genomic repeats. Bioinformatics 21 Suppl 1, i152-8. Enyiukwu, D., Awurum, A., Ononuju, C. &Nwaneri, J. (2014). Significance of characterization of secondary metabolites from extracts of higher plants in plant disease management. Int J Adv Agric Res 2, 8-28. Esther, E., Smit, S., Beukes, M., Apostolides, Z., Pirk, C. W. &Nicolson, S. W. (2015). Detoxification mechanisms of honey bees (Apis mellifera) resulting in tolerance of dietary nicotine. Scientific reports 5. Fabian, D. K., Kapun, M., Nolte, V., Kofler, R., Schmidt, P. S., Schlotterer, C., et al. (2012). Genome-wide patterns of latitudinal differentiation among populations of Drosophila melanogaster from North America. Mol Ecol 21, 4748-69.

144

Faucon, F., Dusfour, I., Gaude, T., Navratil, V., Boyer, F., Chandre, F., et al. (2015). Identifying genomic changes associated with insecticide resistance in the dengue mosquito Aedes aegypti by deep targeted sequencing. Genome Res 25, 1347-59. Feng, H., Wang, L., Liu, Y., He, L., Li, M., Lu, W., et al. (2010). Molecular characterization and expression of a heat shock protein gene (HSP90) from the carmine spider mite, Tetranychus cinnabarinus (Boisduval). J Insect Sci 10, 112. Feyereisen, R. (2011). Arthropod CYPomes illustrate the tempo and mode in P450 evolution. Biochim Biophys Acta 1814, 19-28. Field, L. M., Devonshire, A. L. &Forde, B. G. (1988). Molecular evidence that insecticide resistance in peach potato aphids (Myzus persicae Sulz) results from amplification of an esterase gene. Biochemical Journal 251, 309-312. Fogleman, J. C.Danielson, P. B. (2001). Chemical interactions in the cactus- microorganism-Drosophila model system of the Sonoran Desert. American Zoologist 41, 877-889. Franks, S. J. (2011). Plasticity and evolution in drought avoidance and escape in the annual plant Brassica rapa. New Phytol 190, 249-57. Franks, S. J.Hoffmann, A. A. (2012). Genetics of climate change adaptation. Annu Rev Genet 46, 185-208. Fu, Z., Chen, X., Vacic, V., Nan, P., Zhong, Y. &Jiang, T. (2007). MSOAR: a high- throughput ortholog assignment system based on genome rearrangement. J Comput Biol 14, 1160-75. Gabaldon, T.Koonin, E. V. (2013). Functional and evolutionary implications of gene orthology. Nat Rev Genet 14, 360-6. Gantz, V. M., Jasinskiene, N., Tatarenkova, O., Fazekas, A., Macias, V. M., Bier, E., et al. (2015). Highly efficient Cas9-mediated gene drive for population modification of the malaria vector mosquito Anopheles stephensi. Proc Natl Acad Sci U S A 112, E6736-43. Gautier, M., Foucaud, J., Gharbi, K., Cezard, T., Galan, M., Loiseau, A., et al. (2013). Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping. Mol Ecol 22, 3766-79. Giraudo, M., Hilliou, F., Fricaux, T., Audant, P., Feyereisen, R. &Le Goff, G. (2015). Cytochrome P450s from the fall armyworm (Spodoptera frugiperda): responses to plant allelochemicals and pesticides. Insect Mol Biol 24, 115-28. Goñi, B., Remedios, M., González-Vainer, P., Martínez, M. &Vilela, C. R. (2012). Species of Drosophila (Diptera: ) attracted to dung and carrion baited pitfall traps in the Uruguayan Eastern Serranías. Zoologia (Curitiba) 29, 308-317. González-Teuber, M.Heil, M. (2009). Nectar chemistry is tailored for both attraction of mutualists and protection from exploiters. Plant Signaling & Behavior 4, 809- 813. Govind, G., Mittapalli, O., Griebel, T., Allmann, S., Bocker, S. &Baldwin, I. T. (2010). Unbiased transcriptional comparisons of generalist and specialist herbivores feeding on progressively defenseless Nicotiana attenuata plants. PLoS One 5, e8735. Gray, E. M., Rocca, K. A. C., Costantini, C. &Besansky, N. J. (2009). Inversion 2La is associated with enhanced desiccation resistance in Anopheles gambiae. Malaria Journal 8, 215-215.

145

Gschloessl, B., Beyne, E., Audiot, P., Bourguet, D. &Streiff, R. (2013). De novo transcriptomic resources for two sibling species of moths: Ostrinia nubilalis and O. scapulalis. BMC Res Notes 6, 73. Guillen, Y., Rius, N., Delprat, A., Williford, A., Muyas, F., Puig, M., et al. (2015). Genomics of ecological adaptation in cactophilic Drosophila. Genome Biol Evol 7, 349-66. Haas, B. J., Delcher, A. L., Mount, S. M., Wortman, J. R., Smith, R. K., Hannick, L. I., et al. (2003). Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31, 5654-5666. Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J., et al. (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 8, 1494-512. Haas, B. J., Salzberg, S. L., Zhu, W., Pertea, M., Allen, J. E., Orvis, J., et al. (2008). Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biology 9, R7. Hasson, E.Eanes, W. F. (1996). Contrasting histories of three gene regions associated with In(3L)Payne of Drosophila melanogaster. Genetics 144, 1565-75. Hatje, K., Keller, O., Hammesfahr, B., Pillmann, H., Waack, S. &Kollmar, M. (2011). Cross-species protein sequence and gene structure prediction with fine-tuned Webscipio 2.0 and Scipio. BMC Res Notes 4, 265. Hayashi, K., Schoonbeek, H. J. &De Waard, M. A. (2002). Bcmfs1, a novel major facilitator superfamily transporter from Botrytis cinerea, provides tolerance towards the natural toxic compounds camptothecin and cercosporin and towards fungicides. Appl Environ Microbiol 68, 4996-5004. Hedrick, P. W. (2013). Adaptive introgression in animals: examples and comparison to new mutation and standing variation as sources of adaptive variation. Mol Ecol 22, 4606-4618. Hindrikson, M., Männil, P., Ozolins, J., Krzywinski, A. &Saarma, U. (2012). Bucking the trend in wolf-dog hybridization: first evidence from Europe of hybridization between female dogs and male wolves. PLoS One 7, e46465. Hodkinson, I. (1997). Progressive restriction of host plant exploitation along a climatic gradient: the willow psyllid Cacopsylla groenlandica in Greenland. Ecological Entomology 22, 47-54. Hoffmann, A., Sorensen, J. &Loeschcke, V. (2003a). Adaptation of Drosophila to temperature extremes: bringing together quantitative and molecular approaches. J. Therm. Biol. 28, 175. Hoffmann, A. A., Blacket, M. J., McKechnie, S. W., Rako, L., Schiffer, M., Rane, R. V., et al. (2012). A proline repeat polymorphism of the Frost gene of Drosophila melanogaster showing clinal variation but not associated with cold resistance. Insect Mol Biol 21, 437-45. Hoffmann, A. A., Hallas, R. J., Dean, J. A. &Schiffer, M. (2003b). Low potential for climatic stress adaptation in a rainforest Drosophila species. Science 301, 100-2. Hoffmann, A. A.Rieseberg, L. H. (2008). Revisiting the Impact of Inversions in Evolution: From Population Genetic Markers to Drivers of Adaptive Shifts and Speciation? Annu Rev Ecol Evol Syst 39, 21-42. Hoffmann, A. A.Sgrò, C. M. (2011). Climate change and evolutionary adaptation. Nature 470, 479-85. Hoffmann, A. A., Shirriffs, J. &Scott, M. (2005). Relative importance of plastic vs genetic factors in adaptive differentiation: geographical variation for stress

146

resistance in Drosophila melanogaster from eastern Australia. Functional Ecology 19, 222-227. Hoffmann, A. A.Weeks, A. R. (2007). Climatic selection on genes and traits after a 100 year-old invasion: a critical look at the temperate-tropical clines in Drosophila melanogaster from eastern Australia. Genetica 129, 133-47. Hohenlohe, P. A., Bassham, S., Currey, M. &Cresko, W. A. (2012). Extensive linkage disequilibrium and parallel adaptive divergence across threespine stickleback genomes. Philos Trans R Soc Lond B Biol Sci 367, 395-408. Honsho, M., Asaoku, S. &Fujiki, Y. (2010). Posttranslational regulation of fatty acyl- CoA reductase 1, Far1, controls ether glycerophospholipid synthesis. J Biol Chem 285, 8537-42. http://nectar.org.au/, N. R. C. NeCTAR Research Cloud. Huerta-Cepas, J., Bueno, A., Dopazo, J. &Gabaldon, T. (2008). PhylomeDB: a database for genome-wide collections of gene phylogenies. Nucleic Acids Res 36, D491- 6. Hungate, E. A., Earley, E. J., Boussy, I. A., Turissini, D. A., Ting, C. T., Moran, J. R., et al. (2013). A locus in Drosophila sechellia affecting tolerance of a host plant toxin. Genetics 195, 1063-75. Indrischek, H., Wieseke, N., Stadler, P. F. &Prohaska, S. J. (2016). The paralog-to- contig assignment problem: high quality gene models from fragmented assemblies. Algorithms Mol Biol 11, 1. Izutsu, M., Zhou, J., Sugiyama, Y., Nishimura, O., Aizu, T., Toyoda, A., et al. (2012). Genome features of "Dark-fly", a Drosophila line reared long-term in a dark environment. PLoS One 7, e33288. Jensen, L. J., Julien, P., Kuhn, M., von Mering, C., Muller, J., Doerks, T., et al. (2008). eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res 36, D250-4. Johnson, T. K., Carrington, L. B., Hallas, R. J. &McKechnie, S. W. (2009). Protein synthesis rates in Drosophila associate with levels of the hsr-omega nuclear transcript. Cell Stress Chaperones 14, 569-77. Johnson, W. E., Li, C. &Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118-127. Jones, F. C., Grabherr, M. G., Chan, Y. F., Russell, P., Mauceli, E., Johnson, J., et al. (2012). The genomic basis of adaptive evolution in threespine sticklebacks. Nature 484, 55-61. Jones, P., Binns, D., Chang, H. Y., Fraser, M., Li, W., McAnulla, C., et al. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236-40. Joron, M., Frezal, L., Jones, R. T., Chamberlain, N. L., Lee, S. F., Haag, C. R., et al. (2011a). Chromosomal rearrangements maintain a polymorphic supergene controlling butterfly mimicry. Nature 477, 203-206. Joron, M., Frezal, L., Jones, R. T., Chamberlain, N. L., Lee, S. F., Haag, C. R., et al. (2011b). Chromosomal rearrangements maintain a polymorphic supergene controlling butterfly mimicry. Nature 477, 203-6. Joußen, N., Agnolet, S., Lorenz, S., Schöne, S. E., Ellinger, R., Schneider, B., et al. (2012). Resistance of Australian Helicoverpa armigera to fenvalerate is due to the chimeric P450 enzyme CYP337B3. Proceedings of the National Academy of Sciences 109, 15206-15211.

147

Jovelin, R.Phillips, P. C. (2011). Expression level drives the pattern of selective constraints along the insulin/Tor signal transduction pathway in Caenorhabditis. Genome Biol Evol 3, 715-22. Jurka, J., Kapitonov, V. V., Pavlicek, A., Klonowski, P., Kohany, O. &Walichiewicz, J. (2005). Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110, 462-7. Kaiser, L., Le Ru, B. P., Kaoula, F., Paillusson, C., Capdevielle-Dulac, C., Obonyo, J. O., et al. (2015). Ongoing ecological speciation in Cotesia sesamiae, a biological control agent of cereal stem borers. Evol Appl 8, 807-20. Kanehisa, M. (2013). Molecular network analysis of diseases and drugs in KEGG. Data Mining for Systems Biology, pp. 263-275. Springer. Kapun, M., Fabian, D. K., Goudet, J. &Flatt, T. (2016a). Genomic evidence for adaptive inversion clines in Drosophila melanogaster. Mol Biol Evol 33, 1317-36. Kapun, M., Schmidt, C., Durmaz, E., Schmidt, P. S. &Flatt, T. (2016b). Parallel effects of the inversion In(3R)Payne on body size across the North American and Australian clines in Drosophila melanogaster. J Evol Biol 29, 1059-72. Kapun, M., van Schalkwyk, H., McAllister, B., Flatt, T. &Schlotterer, C. (2014). Inference of chromosomal inversion dynamics from Pool-Seq data in natural and laboratory populations of Drosophila melanogaster. Mol Ecol 23, 1813-27. Karasov, T., Messer, P. &Petrov, D. (2010). Evidence that adaptation in Drosophila is not limited by mutation at single sites. PLoS Genet 6, e1000924. Karell, P., Ahola, K., Karstinen, T., Valkama, J. &Brommer, J. E. (2011). Climate change drives microevolution in a wild bird. Nat Commun 2, 208. Katoh, K., Kuma, K., Toh, H. &Miyata, T. (2005). MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research 33, 511 - 518. Kellermann, V., Loeschcke, V., Hoffmann, A. A., Kristensen, T. N., Flojgaard, C., David, J. R., et al. (2012a). Phylogenetic constraints in key functional traits behind species' climate niches: patterns of desiccation and cold resistance across 95 Drosophila species. Evolution 66, 3377-89. Kellermann, V., Overgaard, J., Hoffmann, A. A., Flojgaard, C., Svenning, J. C. &Loeschcke, V. (2012b). Upper thermal limits of Drosophila are linked to species distributions and strongly constrained phylogenetically. Proc Natl Acad Sci U S A 109, 16228-33. Kellermann, V., van Heerwaarden, B., Sgrò, C. M. &Hoffmann, A. A. (2009). Fundamental evolutionary limits in ecological traits drive Drosophila species distributions. Science 325, 1244-6. Kennington, W. J., Gockel, J. &Partridge, L. (2003). Testing for asymmetrical gene flow in a Drosophila melanogaster body-size cline. Genetics 165, 667-73. Kennington, W. J.Hoffmann, A. A. (2013). Patterns of genetic variation across inversions: geographic variation in the In(2L)t inversion in populations of Drosophila melanogaster from eastern Australia. BMC Evol Biol 13, 100. Kennington, W. J., Hoffmann, A. A. &Partridge, L. (2007). Mapping regions within cosmopolitan inversion In(3R)Payne associated with natural variation in body size in Drosophila melanogaster. Genetics 177, 549-56. Kennington, W. J., Partridge, L. &Hoffmann, A. A. (2006). Patterns of diversity and linkage disequilibrium within the cosmopolitan inversion In(3R)Payne in Drosophila melanogaster are indicative of coadaptation. Genetics 172, 1655-63. Khan, M. A., Elias, I., Sjolund, E., Nylander, K., Guimera, R. V., Schobesberger, R., et al. (2013). Fastphylo: fast tools for phylogenetics. BMC Bioinformatics 14, 334.

148

Kimura, M. (1984). The neutral theory of molecular evolution. Cambridge University Press. Kirkpatrick, M.Barton, N. (2006a). Chromosome Inversions, Local Adaptation and Speciation. Genetics 173, 419-434. Kirkpatrick, M.Barton, N. (2006b). Chromosome inversions, local adaptation and speciation. Genetics 173, 419-34. Knibb, W. (1982). Chromosome inversion polymorphisms in Drosophila melanogaster II. Geographic clines and climatic associations in Australasia, North America and Asia. Genetica 58, 213 - 221. Kofler, R.Schlotterer, C. (2012). Gowinda: unbiased analysis of gene set enrichment for genome-wide association studies. Bioinformatics 28, 2084-5. Kohonen, T. (1998). The self-organizing map. Neurocomputing 21, 1-6. Kolaczkowski, B., Kern, A. D., Holloway, A. K. &Begun, D. J. (2011). Genomic Differentiation Between Temperate and Tropical Australian Populations of Drosophila melanogaster. Genetics 187, 245-260. Korf, I. (2004). Gene finding in novel genomes. BMC Bioinformatics 5, 59. Kretschmer, M., Leroch, M., Mosbach, A., Walker, A. S., Fillinger, S., Mernke, D., et al. (2009). Fungicide-driven evolution and molecular basis of multidrug resistance in field populations of the grey mould fungus Botrytis cinerea. PLoS Pathog 5, e1000696. Krimbas, C. B.Powell, J. R. (1992). Drosophila inversion polymorphism. CRC Press. Kriventseva, E. V., Tegenfeldt, F., Petty, T. J., Waterhouse, R. M., Simao, F. A., Pozdnyakov, I. A., et al. (2015). OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software. Nucleic Acids Res 43, D250-6. Kumar, S., Stecher, G. &Tamura, K. (2016). MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets. Mol Biol Evol, msw054. Law, C. W., Chen, Y., Shi, W. &Smyth, G. K. (2014). Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15, R29. Lee, S. F., Chen, Y., Varan, A. K., Wee, C. W., Rako, L., Axford, J. K., et al. (2011). Molecular basis of adaptive shift in body size in Drosophila melanogaster: Functional and sequence analyses of the Dca gene. Mol Biol Evol 28, 2393- 2402. Lenormand, T., Guillemaud, T., Bourguet, D. &Raymond, M. (1998). Appearance and sweep of a gene duplication: Adaptive response and potential for new functions in the mosquito Culex pipiens. Evolution 52, 1705-1712. Levine, M. T., Eckert, M. L. &Begun, D. J. (2011). Whole-genome expression plasticity across tropical and temperate Drosophila melanogaster populations from Eastern Australia. Mol Biol Evol 28, 249-56. Li, H. (2015). BFC: correcting Illumina sequencing errors. Bioinformatics 31, 2885-7. Li, H.Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-60. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-9. Li, H., Zhang, H., Guan, R. &Miao, X. (2013). Identification of differential expression genes associated with host selection and adaptation between two sibling insect species by transcriptional profile analysis. BMC Genomics 14, 582. Li, X., Schuler, M. A. &Berenbaum, M. R. (2007). Molecular mechanisms of metabolic resistance to synthetic and natural xenobiotics. Annual Review of Entomology, pp. 231-253.

149

Li, Y., Huang, Y., Bergelson, J., Nordborg, M. &Borevitz, J. O. (2010). Association mapping of local climate-sensitive quantitative trait loci in Arabidopsis thaliana. Proc Natl Acad Sci U S A 107, 21199-204. Liao, Y., Smyth, G. K. &Shi, W. (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923-30. Linard, B., Thompson, J. D., Poch, O. &Lecompte, O. (2011). OrthoInspector: comprehensive orthology analysis and visual exploration. BMC Bioinformatics 12, 11. Lomsadze, A., Burns, P. D. &Borodovsky, M. (2014). Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Research 42, e119-e119. Lowry, D. B.Willis, J. H. (2010). A widespread chromosomal inversion polymorphism contributes to a major life-history transition, local adaptation, and reproductive isolation. PLoS Biol 8, 1-14. Luisi, P., Alvarez-Ponce, D., Dall'Olio, G. M., Sikora, M., Bertranpetit, J. &Laayouni, H. (2012). Network-level and population genetics analysis of the insulin/TOR signal transduction pathway across human populations. Mol Biol Evol 29, 1379- 92. Lumjuan, N., McCarroll, L., Prapanthadara, L. A., Hemingway, J. &Ranson, H. (2005). Elevated activity of an Epsilon class glutathione transferase confers DDT resistance in the dengue vector, Aedes aegypti. Insect Biochem Mol Biol 35, 861- 71. Lynch, M.Walsh, B. (1998). Genetics and analysis of quantitative traits. Majoros, W. H., Pertea, M. &Salzberg, S. L. (2004). TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878-9. Manel, S., Conord, C. &Despres, L. (2009). Genome scan to assess the respective role of host-plant and environmental constraints on the adaptation of a widespread insect. BMC Evol Biol 9, 288. Marcais, G.Kingsford, C. (2011). A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764-70. Marçais, G., Yorke, J. A. &Zimin, A. (2015). QuorUM: an error corrector for Illumina reads. PLoS One 10, e0130821. Markow, T. A., Anwar, S. &Pfeiler, E. (2000). Stable isotope ratios of carbon and nitrogen in natural populations of Drosophila species and their hosts. Functional Ecology 14, 261-266. Marygold, S. J., Leyland, P. C., Seal, R. L., Goodman, J. L., Thurmond, J., Strelets, V. B., et al. (2013). FlyBase: improvements to the bibliography. Nucleic Acids Res 41, D751-7. Matsuo, T., Sugaya, S., Yasukawa, J., Aigaki, T. &Fuyama, Y. (2007). Odorant-binding proteins OBP57d and OBP57e affect taste perception and host-plant preference in Drosophila sechellia. PLoS Biol 5, e118. Matzkin, L. M. (2012). Population transcriptomics of cactus host shifts in Drosophila mojavensis. Mol Ecol 21, 2428-39. Matzkin, L. M. (2014). Ecological genomics of host shifts in Drosophila mojavensis. Ecological Genomics, pp. 233-247. Springer. Matzkin, L. M., Johnson, S., Paight, C., Bozinovic, G. &Markow, T. A. (2011). Dietary protein and sugar differentially affect development and metabolic pools in ecologically diverse Drosophila. J Nutr 141, 1127-33.

150

Matzkin, L. M.Markow, T. A. (2009). Transcriptional regulation of metabolism associated with the increased desiccation resistance of the cactophilic Drosophila mojavensis. Genetics 182, 1279-88. Matzkin, L. M.Markow, T. A. (2013). Transcriptional differentiation across the four subspecies of Drosophila mojavensis. Speciation: Natural Processes, Genetics and Biodiversity. Matzkin, L. M., Merritt, T. J., Zhu, C. T. &Eanes, W. F. (2005). The structure and population genetics of the breakpoints associated with the cosmopolitan chromosomal inversion In(3R)Payne in Drosophila melanogaster. Genetics 170, 1143-52. McKechnie, A. E.Wolf, B. O. (2010). Climate change increases the likelihood of catastrophic avian mortality events during extreme heat waves. Biol Lett 6, 253- 6. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., et al. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297-303. McQuilton, P., St Pierre, S. E., Thurmond, J. &FlyBase, C. (2012). FlyBase 101--the basics of navigating FlyBase. Nucleic Acids Res 40, D706-14. Mi, H., Muruganujan, A. &Thomas, P. D. (2013). PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res 41, D377-86. Micali, C. O.Smith, M. L. (2006). A nonself recognition gene complex in Neurospora crassa. Genetics 173, 1991-2004. Mizell, R. F., 3rd, Tipping, C., Andersen, P. C., Brodbeck, B. V., Hunter, W. B. &Northfield, T. (2008). Behavioral model for Homalodisca vitripennis (Hemiptera: Cicadellidae): optimization of host plant utilization and management implications. Environ Entomol 37, 1049-62. Moir, M. L., Hughes, L., Vesk, P. A. &Leng, M. C. (2014). Which host‐dependent insects are most prone to coextinction under changed climates? Ecology and evolution 4, 1295-1312. Mullen, S. P.Shaw, K. L. (2014). Insect speciation rules: unifying concepts in speciation research. Annu Rev Entomol 59, 339-61. Murali, T., Pacifico, S., Yu, J., Guest, S., Roberts, G. G. &Finley, R. L. (2011). DroID 2011: a comprehensive, integrated resource for protein, transcription factor, RNA and gene interactions for Drosophila. Nucleic Acids Research 39, D736- D743. Nadeau, N. J.Jiggins, C. D. (2010). A golden age for evolutionary genetics? Genomic studies of adaptation in natural populations. Trends Genet 26, 484-92. Navarro, A.Barton, N. H. (2003). Chromosomal speciation and molecular divergence-- accelerated evolution in rearranged chromosomes. Science 300, 321-4. Navarro, A., Betran, E., Barbadilla, A. &Ruiz, A. (1997). Recombination and gene flux caused by gene conversion and crossing over in inversion heterokaryotypes. Genetics 146, 695-709. Newcomb, R. D., Gleeson, D. M., Yong, C. G., Russell, R. J. &Oakeshott, J. G. (2005). Multiple mutations and gene duplications conferring organophosphorus insecticide resistance have been selected at the Rop-1 locus of the sheep blowfly, Lucilia cuprina. Journal of Molecular Evolution 60, 207-220. Nosil, P., Gompert, Z., Farkas, T. E., Comeault, A. A., Feder, J. L., Buerkle, C. A., et al. (2012). Genomic consequences of multiple speciation processes in a stick insect. Proc Biol Sci 279, 5058-65.

151

O'Brien, K. P., Remm, M. &Sonnhammer, E. L. (2005). Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33, D476-80. O’Grady, P. M.Markow, T. A. (2012). Rapid morphological, ecological and behavioral evolution in Drosophila: comparisons between the cactophilic repleta species group and the endemic Hawaiian Drosophila. Rapidly evolving genes & genetic systems. New York, NY: Oxford University Press. Oakeshott, J. G., Devonshire, A. L., Claudianos, C., Sutherland, T. D., Horne, I., Campbell, P. M., et al. (2005). Comparing the organophosphorus and carbamate insecticide resistance mutations in cholin- and carboxyl-esterases. Chemico- Biological Interactions 157, 269-275. Oakeshott, J. G., Farnsworth, C. A., East, P. D., Scott, C., Han, Y., Wu, Y., et al. (2013). How many genetic options for evolving insecticide resistance in heliothine and spodopteran pests? Pest Manag Sci 69, 889-896. Oakeshott, J. G., Johnson, R. M., Berenbaum, M. R., Ranson, H., Cristino, A. S. &Claudianos, C. (2010). Metabolic enzymes associated with xenobiotic and chemosensory responses in Nasonia vitripennis. Insect Mol Biol 19 Suppl 1, 147-63. Obbard, D. J., Maclennan, J., Kim, K. W., Rambaut, A., O'Grady, P. M. &Jiggins, F. M. (2012). Estimating divergence dates and substitution rates in the Drosophila phylogeny. Mol Biol Evol 29, 3459-73. Oliveira, D., Almeida, F., O'Grady, P., Armella, M., DeSalle, R. &Etges, W. (2012). Monophyly, divergence times, and evolution of host plant use inferred from a revised phylogeny of the Drosophila repleta species group. Molecular Phylogenetics and Evolution 64, 533-544. Olson-Manning, C. F., Wagner, M. R. &Mitchell-Olds, T. (2012). Adaptive evolution: evaluating empirical support for theoretical predictions. Nat Rev Genet 13, 867- 77. Orr, H. A. (1998). The population genetics of adaptation: the distribution of factors fixed during adaptive evolution. Evolution, 935-949. Orr, H. A. (2005). The genetic theory of adaptation: a brief history. Nat Rev Genet 6, 119-27. Overgaard, J., Kearney, M. R. &Hoffmann, A. A. (2014). Sensitivity to thermal extremes in Australian Drosophila implies similar impacts of climate change on the distribution of widespread and tropical species. Global Change Biology 20, 1738-1750. Paaijmans, K. P., Heinig, R. L., Seliga, R. A., Blanford, J. I., Blanford, S., Murdock, C. C., et al. (2013). Temperature variation makes ectotherms more sensitive to climate change. Glob Chang Biol 19, 2373-80. Parra, G., Bradnam, K. &Korf, I. (2007). CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061-7. Pegueroles, C., Ordonez, V., Mestres, F. &Pascual, M. (2010). Recombination and selection in the maintenance of the adaptive value of inversions. J Evol Biol 23, 2709-17. Peischl, S., Koch, E., Guerrero, R. F. &Kirkpatrick, M. (2013). A sequential coalescent algorithm for chromosomal inversions. Heredity 111, 200-209. Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T. C., Mendell, J. T. &Salzberg, S. L. (2015). StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33, 290-5.

152

Peterson, B. K., Weber, J. N., Kay, E. H., Fisher, H. S. &Hoekstra, H. E. (2012). Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One 7, e37135. Pool, J. E., Corbett-Detig, R. B., Sugino, R. P., Stevens, K. A., Cardeno, C. M., Crepeau, M. W., et al. (2012). Population Genomics of sub-saharan Drosophila melanogaster: African diversity and non-African admixture. PLoS Genet 8, e1003080. Posada, D. (2004). Collapse: describing haplotypes from sequence alignments. University of Vigo, Vigo, Spain. Powell, J. R. (1997). Progress and prospects in evolutionary biology: the Drosophila model. Oxford University Press, USA. Price, A. L., Jones, N. C. &Pevzner, P. A. (2005). De novo identification of repeat families in large genomes. Bioinformatics 21 Suppl 1, i351-8. Price, M. N., Dehal, P. S. &Arkin, A. P. (2010). FastTree 2--approximately maximum- likelihood trees for large alignments. PLoS One 5, e9490. Pritchard, J. K., Pickrell, J. K. &Coop, G. (2010). The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Curr Biol 20, R208-15. Prunier, J., Laroche, J., Beaulieu, J. &Bosquet, J. (2011). Scanning the genome for gene SNPs related to climate adaptation and estimating selection at the molecular level in boreal black spruce. Mol. Ecol. 20, 1702. Pryszcz, L. P., Huerta-Cepas, J. &Gabaldon, T. (2011). MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency- based confidence score. Nucleic Acids Res 39, e32. Radwan, J.Babik, W. (2012). The genomics of adaptation. Proc Biol Sci 279, 5024-8. Ragland, G. J.Kingsolver, J. G. (2007). Influence of seasonal timing on thermal ecology and thermal reaction norm evolution in Wyeomyia smithii. J Evol Biol 20, 2144- 53. Rajpurohit, S., Oliveira, C. C., Etges, W. J. &Gibbs, A. G. (2013). Functional genomic and phenotypic responses to desiccation in natural populations of a desert drosophilid. Mol Ecol 22, 2698-2715. Rako, L., Anderson, A. R., Sgro, C. M., Stocker, A. J. &Hoffmann, A. A. (2006a). The association between inversion In(3R)Payne and clinally varying traits in Drosophila melanogaster. Genetica 128, 373-384. Rako, L., Anderson, A. R., Sgrò, C. M., Stocker, A. J. &Hoffmann, A. A. (2006b). The association between inversion In(3R)Payne and clinally varying traits in Drosophila melanogaster. Genetica 128, 373. Rako, L., Blacket, M. J., McKechnie, S. W. &Hoffmann, A. A. (2007). Candidate genes and thermal phenotypes: identifying ecologically important genetic variation for thermotolerance in the Australian Drosophila melanogaster cline. Mol Ecol 16, 2948-57. Rane, R. V., Thu, N., Oakeshott, J. G., Hoffmann, A. &Lee, S. (in review). The Orthonome orthologue classification pipeline reveals substantially higher levels of conservation of gene sets across Drosophila genomes than previously estimated. Rane, R. V., Walsh, T. K., Pearce, S. L., Jermiin, L. S., Gordon, K. H., Richards, S., et al. (2016). Are feeding preferences and insecticide resistance associated with the size of detoxifying enzyme families in insect herbivores? Curr Opin Insect Sci 13, 70-6. Rego, C., Balanya, J., Fragata, I., Matos, M., Rezende, E. L. &Santos, M. (2010). Clinal patterns of chromosomal inversion polymorphisms in Drosophila subobscura

153

are partly associated with thermal preferences and heat stress resistance. Evolution 64, 385-97. Reinhardt, J. A., Kolaczkowski, B., Jones, C. D., Begun, D. J. &Kern, A. D. (2014). Parallel geographic variation in Drosophila melanogaster. Genetics 197, 361-73. Rispe, C., Kutsukake, M., Doublet, V., Hudaverdian, S., Legeai, F., Simon, J. C., et al. (2008). Large gene family expansion and variable selective pressures for cathepsin B in aphids. Mol Biol Evol 25, 5-17. Rius, N., Guillén, Y., Delprat, A., Kapusta, A., Feschotte, C. &Ruiz, A. (2016). Exploration of the Drosophila buzzatii transposable element content suggests underestimation of repeats in Drosophila genomes. BMC Genomics 17, 344. Robinson, M. D.Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11, R25. Rocca, K. A., Gray, E. M., Costantini, C. &Besansky, N. J. (2009). 2La chromosomal inversion enhances thermal tolerance of Anopheles gambiae larvae. Malar J 8, 147. Rumball, W., Franklin, I. R., Frankham, R. &Sheldon, B. L. (1994). Decline in heterozygosity under full-sib and double first-cousin inbreeding in Drosophila melanogaster. Genetics 136, 1039-49. Saidijam, M., Benedetti, G., Ren, Q., Xu, Z., Hoyle, C. J., Palmer, S. L., et al. (2006). Microbial drug efflux proteins of the major facilitator superfamily. Curr Drug Targets 7, 793-811. Sau, A., Tregno, F. P., Valentino, F., Federici, G. &Caccuri, A. M. (2010). Glutathione transferases and development of new principles to overcome drug resistance. Archives of Biochemistry and Biophysics 500, 116-122. Savolainen, O., Lascoux, M. &Merila, J. (2013). Ecological genomics of local adaptation. Nat Rev Genet 14, 807-20. Schaeffer, S. W.Anderson, W. W. (2005). Mechanisms of genetic exchange within the chromosomal inversions of Drosophila pseudoobscura. Genetics 171, 1729- 1739. Schaeffer, S. W., Goetting-Minesky, M. P., Kovacevic, M., Peoples, J. R., Graybill, J. L., Miller, J. M., et al. (2003). Evolutionary genomics of inversions in Drosophila pseudoobscura: evidence for epistasis. Proc Natl Acad Sci U S A 100, 8319-24. Schreiber, F.Sonnhammer, E. L. (2013). Hieranoid: hierarchical orthology inference. J Mol Biol 425, 2072-81. Seetharam, A. S.Stuart, G. W. (2013). Whole genome phylogeny for 21 Drosophila species using predicted 2b-RAD fragments. PeerJ 1, e226. Sgrò, C. M., Terblanche, J. S. &Hoffmann, A. A. (2015). What can plasticity contribute to insect responses to climate change? Annu Rev Entomol. Sharakhov, I. V., White, B. J., Sharakhova, M. V., Kayondo, J., Lobo, N. F., Santolamazza, F., et al. (2006). Breakpoint structure reveals the unique origin of an interspecific chromosomal inversion (2La) in the Anopheles gambiae complex. Proc Natl Acad Sci U S A 103, 6258-62. Shi, G., Peng, M. C. &Jiang, T. (2011). MultiMSOAR 2.0: an accurate tool to identify ortholog groups among multiple genomes. PLoS One 6, e20892. Shi, G., Zhang, L. &Jiang, T. (2010). MSOAR 2.0: Incorporating tandem duplications into ortholog assignment based on genome rearrangement. BMC Bioinformatics 11, 10.

154

Shi, H., Pei, L., Gu, S., Zhu, S., Wang, Y., Zhang, Y., et al. (2012). Glutathione S- transferase (GST) genes in the red flour beetle, Tribolium castaneum, and comparative analysis with five additional insects. Genomics 100, 327-35. Shiao, M. S., Chang, J. M., Fan, W. L., Lu, M. Y., Notredame, C., Fang, S., et al. (2015). Expression divergence of chemosensory genes between Drosophila sechellia and its sibling species and its implications for host shift. Genome Biol Evol 7, 2843-58. Shou-Min, F. (2012). Insect glutathione S-transferase: a review of comparative genomic studies and response to xenobiotics. Bulletin of Insectology 65, 265-271. Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. &Zdobnov, E. M. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210-2. Simon, J. C., d'Alencon, E., Guy, E., Jacquin-Joly, E., Jaquiery, J., Nouhaud, P., et al. (2015). Genomics of adaptation to host-plants in herbivorous insects. Brief Funct Genomics 14, 413-23. Sinervo, B., Mendez-de-la-Cruz, F., Miles, D. B., Heulin, B., Bastiaans, E., Villagran- Santa Cruz, M., et al. (2010). Erosion of lizard diversity by climate change and altered thermal niches. Science 328, 894-9. Singer, M. C.Parmesan, C. (2010). Phenological asynchrony between herbivorous insects and their hosts: signal of climate change or pre-existing adaptive strategy? Philosophical Transactions of the Royal Society of London B: Biological Sciences 365, 3161-3176. Skerl, M. I. S.Gregorc, A. (2010). Heat shock proteins and cell death in situ localisation in hypopharyngeal glands of honeybee (Apis mellifera carnica) workers after imidacloprid or coumaphos treatment. Apidologie 41, 73-86. Slater, G. S.Birney, E. (2005). Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31. Smadja, C., Shi, P., Butlin, R. K. &Robertson, H. M. (2009). Large gene family expansions and adaptive evolution for odorant and gustatory receptors in the pea aphid, Acyrthosiphon pisum. Mol Biol Evol 26, 2073-86. Smith, J. M.Haigh, J. (1974). The hitch-hiking effect of a favourable gene. Genet Res 23, 23-35. Somero, G. (2010). The physiology of climate change: how potentials for acclimatization and genetic adaptation will determine ‘winners’ and ‘losers’. The Journal of experimental biology 213, 912-920. Sonnhammer, E. L.Koonin, E. V. (2002). Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet 18, 619-20. Soto, I. M., Carreira, V. P., Corio, C., Padro, J., Soto, E. M. &Hasson, E. (2014). Differences in tolerance to host cactus alkaloids in Drosophila koepferae and D. buzzatii. PLoS One 9, e88370. Stanke, M., Tzvetkova, A. &Morgenstern, B. (2006). AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biology 7, 1-8. Strong, D. R., Lawton, J. H. &Southwood, S. R. (1984). Insects on plants. Community patterns and mechanisms. Blackwell Scientific Publicatons. Sun, Y., Sheng, Y., Bai, L., Zhang, Y., Xiao, Y., Xiao, L., et al. (2014). Characterizing heat shock protein 90 gene of Apolygus lucorum (Meyer-Dur) and its expression in response to different temperature and pesticide stresses. Cell Stress Chaperones 19, 725-39.

155

Supek, F., Bošnjak, M., Škunca, N. &Šmuc, T. (2011). REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One 6, e21800. Suyama, M., Torrents, D. &Bork, P. (2006). PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Research 34, W609-W612. Swarup, S., Morozova, T. V., Sridhar, S., Nokes, M. &Anholt, R. R. (2014). Modulation of feeding behavior by odorant-binding proteins in Drosophila melanogaster. Chem Senses 39, 125-32. Takahata, N. (1990). A simple genealogical structure of strongly balanced allelic lines and trans-species evolution of polymorphism. Proc Natl Acad Sci U S A 87, 2419-23. Tange, O. (2011). Gnu parallel-the command-line power tool. The USENIX Magazine 36, 42-47. Tarailo‐Graovac, M.Chen, N. (2009). Using RepeatMasker to identify repetitive elements in genomic sequences. Current Protocols in Bioinformatics, 4.10. 1- 4.10. 14. Tatusov, R. L., Koonin, E. V. &Lipman, D. J. (1997). A genomic perspective on protein families. Science 278, 631-7. Tekaia, F. (2016). Inferring orthologs: open questions and perspectives. Genomics Insights 9, 17-28. Tenreiro, S., Nunes, P. c. A., Viegas, C. A., Neves, M. S., Teixeira, M. C., Cabral, M. G., et al. (2002). AQR1 gene (ORF YNL065w) encodes a plasma membrane transporter of the major facilitator superfamily that confers resistance to short- chain monocarboxylic acids and quinidine in Saccharomyces cerevisiae. Biochemical and biophysical research communications 292, 741-748. Teshima, K. M.Innan, H. (2008). Neofunctionalization of duplicated genes under the pressure of gene conversion. Genetics 178, 1385-98. Thomas, C. (2010). Climate, climate change and range boundaries. Divers. Distrib. 16, 488. Thomas, C. D. (2011). Translocation of species, climate change, and the end of trying to recreate past ecological communities. Trends Ecol Evol 26, 216-21. Tribolium Genome Sequencing, C. & Richards, S. & Gibbs, R. A. & Weinstock, G. M. & Brown, S. J. & Denell, R., et al. (2008). The genome of the model beetle and pest Tribolium castaneum. Nature 452, 949-55. Turner, T. L., Levine, M. T., Eckert, M. L. &Begun, D. J. (2008). Genomic analysis of adaptive differentiation in Drosophila melanogaster. Genetics 179, 455-73. Umina, P. A., Weeks, A. R., Kearney, M. R., McKechnie, S. W. &Hoffmann, A. A. (2005). A rapid shift in a classic clinal pattern in Drosophila reflecting climate change. Science 308, 691-3. Valle, M., Schabauer, H., Pacher, C., Stockinger, H., Stamatakis, A., Robinson- Rechavi, M., et al. (2014). Optimization strategies for fast detection of positive selection on phylogenetic trees. Bioinformatics 30, 1129-1137. Van Dongen, S. M. (2001). Graph clustering by flow simulation. Utrecht University. Vilella, A. J., Severin, J., Ureta-Vidal, A., Heng, L., Durbin, R. &Birney, E. (2009). EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res 19, 327-35. Walker, B. J., Abeel, T., Shea, T., Priest, M., Abouelliel, A., Sakthikumar, S., et al. (2014). Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9, e112963.

156

Warren, R., VanDerWal, J., Price, J., Welbergen, J., Atkinson, I., Ramirez-Villegas, J., et al. (2013). Quantifying the benefit of early climate change mitigation in avoiding biodiversity loss. Nature Climate Change 3, 678-682. Waterhouse, R. M., Zdobnov, E. M., Tegenfeldt, F., Li, J. &Kriventseva, E. V. (2011). OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011. Nucleic Acids Res 39, D283-8. Weeks, A. R., McKechnie, S. W. &Hoffmann, A. A. (2002). Dissecting adaptive clinal variation: markers, inversions and size/stress associations in Drosophila melanogaster from a central field population. Ecology Letters 5, 756-763. Wehrens, R.Buydens, L. M. (2007). Self-and super-organizing maps in R: the Kohonen package. J Stat Softw 21, 1-19. White, B., Hahn, M., Pombi, M., Cassone, B., Lobo, N., Simard, F., et al. (2007). Localization of candidate gene regions maintaining a common polymorphic inversion (2La) in Anopheles gambiae. PLoS Genet 3, 2404 - 2414. Wu, T. D.Nacu, S. (2010). Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873-81. Xu, Z.Wang, H. (2007). LTR_FINDER: an efficient tool for the prediction of full- length LTR retrotransposons. Nucleic Acids Res 35, W265-8. Xu, Z. B., Zou, X. P., Zhang, N., Feng, Q. L. &Zheng, S. C. (2015). Detoxification of insecticides, allechemicals and heavy metals by glutathione S-transferase SlGSTE1 in the gut of Spodoptera litura. Insect Sci 22, 503-11. Yampolsky, L. Y., Zeng, E., Lopez, J., Williams, P. J., Dick, K. B., Colbourne, J. K., et al. (2014). Functional genomics of acclimation and adaptation in response to thermal stress in Daphnia. BMC Genomics 15, 859. Yassin, A., Debat, V., Bastide, H., Gidaszewski, N., David, J. R. &Pool, J. E. (2016). Recurrent specialization on a toxic fruit in an island Drosophila population. Proc Natl Acad Sci U S A 113, 4771-6. You, M., Yue, Z., He, W., Yang, X., Yang, G., Xie, M., et al. (2013a). A heterozygous moth genome provides insights into herbivory and detoxification. Nature Genetics 45, 220-225. You, M., Yue, Z., He, W., Yang, X., Yang, G., Xie, M., et al. (2013b). A heterozygous moth genome provides insights into herbivory and detoxification. Nat Genet 45, 220-5. Zaharia, M., Bolosky, W. J., Curtis, K., Fox, A., Patterson, D., Shenker, S., et al. (2011). Faster and more accurate sequence alignment with SNAP. arXiv preprint arXiv:1111.5572. Zhang, G. Y., Liu, X., Quan, Z. W., Cheng, S. F., Xu, X., Pan, S. K., et al. (2012). Genome sequence of foxtail millet (Setaria italica) provides insights into grass evolution and biofuel potential. Nature Biotechnology 30, 549-+. Zhang, M.Leong, H. W. (2012). BBH-LS: an algorithm for computing positional homologs using sequence and gene context similarity. BMC Systems Biology 6. Zhao, H. W.Haddad, G. G. (2011). Review: Hypoxic and oxidative stress resistance in Drosophila melanogaster. Placenta 32 Suppl 2, S104-8. Zimin, A. V., Marcais, G., Puiu, D., Roberts, M., Salzberg, S. L. &Yorke, J. A. (2013). The MaSuRCA genome assembler. Bioinformatics 29, 2669-77.

157

158

Appendix 1 (Paper)

Attached as additional PDF document

159

Appendix 2 (Chapter 2)

Table S2.1. Information of primers used to verify homozygosity of the isochromosomal lines.

Chromosome Dm-name CG name Forward sequence (5’ to 3’) Reverse sequence (5’ to 3’) gDNA Extension Reference arm size time 3L CG4769-PA CG4769 GCAAGGGCGGAGAGGATTACATCTTCTCGC CAGGAAAGTGGCCACGTCCTTGGCCAGCTG 315 41s 3L CG7638-PA CG7638 AGGAGATGGTGACTTTAATATTATTGAGTC GGTGAAGTAGCTCCACCAGGTGGCAAAGCC 187 24s 3L CG4169-PA CG4169 ATGGCCTGTAACGCTAGCAAGACGTCGCTC GCGGTGGCTACCACCAGTTTGTTCTCCAGC 281 35s 3R Dca CG7390 CCGTCTTCAAGGTYAATCCA TGGGTGGTTGGAATTTTGAT 122 15s (Lee et al., 2011) 3R Akt1 CG4006 GACTGGTGGGGCACTGGC AAACGGAAGACGACCACAGAT 120 15s 3R Pi3K92E CG4141 GGATTTGGACTACCTACGGGAAACT CGTGCTTTTTCCTCCGTGTAG 120 15s 3R InR CG18402 GGACTCCTGTCCGAAAGGCTATT GCACTCGTTCGCCGTYAC 120 15s 3R Tsc1 CG6147 GCAGATGCTACAAGARAGCAGCA TTCTCGTCCAGRCTRTGCCTCAGA 120 15s 3R Crc-PA CG9429 ATGATGTGGTGCAAAACAGTGATAGTGTTG CCGGGGGTGAGGACGAATTTTCCGAACTCC 286 36s 3R CG3731-PB CG3731 CGGTGGCGCCAACAACGCCTCCAACCTGGC GCCTCGGTAACCATGGTGCACAGACGCATC 336 42s 3R CG4538-PB CG4538 AATCAGCCGCCAGTCCCATGGACAACGACC CGCGGCTCGGTTAGGATGCGCACCAGCATG 300 38s 3R CG17370- CG17370 GCCCAGCCGGCGCTGCTGTATCTAGTGCCC TCAGACTTCCAGTTGTTTTGATGGTTGTTG 203 27s PC 3R CG17565 AGAAGGTGTCCACCACCACATCC ATTTCTCCACCGAACTCTCCGTCTT 116 15s CAGCTCCCAATGTGCCTCAC GCCAGGGGCCATTACTCCTAC 3R CG13630 117 15s 3R In(3R)P-SNP TTTGCCGCAAATTATTGTGAG (Anderson _outer_1 45s et al., 2005) 3R In(3R)P-SNP ATCGCGTGCAGGTTGGC (Anderson _outer_2 et al., 2005)

160

Table S2.2.List of lines used for preparation of the RADSeq library.

No. Innisfail Innisfail Yering P1 barcode (inverted) (standard) Station (standard) 1 In1 In4 YS1 GCTGA

2 In2 In18 YS7 GTAGT 3 In6 In19 YS11 AACCA

4 In9 In20 YS15 ACGCT 5 In12 In21 YS17 CTTGG

6 In14 In25 YS21 GGTTG 7 In24 In31a YS22 TAGTA

8 In28 In32 YS36 TTACC 9 In31b In33 YS51 ACAGTG

10 In37 In35 YS57 TATAAT 11 In39 In40 YS59 AGTCAA 12 In49 In42 YS65 TAGTTT 13 In54 In44 YS69 GAGTGG 14 In65 In45 YS75 GCCAAT 15 In66b In46 YS104 CCAACA 16 In72 In58 YS126 CTGTAA 17 In73 In67 YS129 ATGTCAC 18 In75 In77b YS131 GTGAAAC 19 In77 -- -- AGTTCCA P2 ACTG CCGAG GACGAC Barcode Illumina ATCACG CGATGT TTAGGC Barcode

Column 5 denotes the P1 adaptor barcode for indexing individual lines within a population while the last row lists the P2 barcodes for indexing each of the three populations. This design allowed multiplexing of 19x3 samples with 22 barcodes.

161

Table S2.3. Number of SNV’s per population comparison, excluding loci with FST = 0.

Population comparison 3L 3R IiIs 5304 8249 IiYs 9924 12746 IsYs 9351 12214

Table S2.4. Outlier SNV distribution and density across 3R.

IiYs IsYs IiIs Within In(3R)P region No. of SNV’s 745 564 494 SNV’s / Mb 82 62.6 54.8 Outside In(3R)P region No. of SNV’s 140 321 390 SNV’s / Mb 11.6 26.75 32.5

Table S2.5: Percent of SNV's mapped to genic regions. Assuming 3L as a null hypothesis, there is considerable enrichment of outlier SNV's localizing to 5' UTR and coding regions after normalizing for original SNV distribution, inversion karyotpye. Furthermore outliers in the coding region present enrichment for non-synonymous coding changes.

Attached as an additional excel sheet

Table S2.6. Outlier genes found to be significantly differentiating between the populations and karyotypes. Attached as an additional excel sheet

Table S2.7. Gene ontologies found to be significantly enriched between the populations and karyotypes by analysis. Attached as an additional excel sheet

162

Appendix 3 (Chapter 3)

Supplementary methods

The implementation of Orthonome consists of five major steps:

Step 1: Processing input files

Orthonome uses whole genome annotations in GFF3 format and genomes in FASTA format to predict orthology and syntenic relationships between genes across species.

The processing script Gff2gene_zeroidx.py tolerates annotations with gene models split across more than one contig/scaffold, making it usable even with splice alignment- based gene prediction from fragmented draft assemblies, such as those produced by

Scipio (Hatje et al., 2011) and ExonMatchSolver-pipeline (Indrischek et al., 2016).

Orthonome preferentially chooses the longest isoform in cases where alternately spliced variants are predicted.

Sequence similarity matrices for genes in each species pair are then calculated using

BLASTP (Altschul et al., 1990) and Smith-Waterman alignments for each within and between species combination.

Step 2: Pair-wise species comparisons and inparalogue identification

Following MSOAR (Shi et al., 2010), for each pair of species, the corresponding

BLASTP similarity matrices are processed to create clusters of genes with similar domains using MCL clustering (Van Dongen, 2001). Both syntenic and non-syntenic inparalogues are then identified using neighbour-joining trees generated using DNA distance matrices (Khan et al., 2013). These are generated from codon alignments constructed for each gene cluster using MAFFT (L-INS-i) (Katoh et al., 2005) and

PAL2NAL (v14) (Suyama et al., 2006). The genes identified as ancestral based on the phylogeny are retained for further analysis in Step 3. 163

Step 3: Score adjustment for remaining genes and pairwise orthologue prediction

For each gene retained above, we estimate the five most similar genes from each species pair, based on fragment bias-corrected pairwise scores (S’) but also with BLASTP bit scores of at least 30, e-values <= 10e-7 and Smith-Waterman alignments spanning

>50% of gene lengths. S’ scores are calculated for gene pair Ga1 and Ga2 using a Smith

Waterman alignment with BLOSUM50 substitution matrix as:

푆푚𝑖푡ℎ푊푎푡푒푟푚푎푛퐵퐿50[퐺 푣푠 퐺 ] 푆′ = ( 푎1 푎2 ) ∗ 푅푎푡𝑖표 푝푎𝑖푟푤𝑖푠푒 𝑖푑푒푛푡𝑖푡푦 ∗ 100 푃푎𝑖푟푤𝑖푠푒 푎푙𝑖𝑔푛푚푒푛푡 푙푒푛𝑔푡ℎ (푏푝)

This modified Smith-Waterman scoring approach allows Orthonome to utilise both complete and incomplete gene models in relatively fragmented genomes, since scores are normalised by the length of the aligned segment of the gene, enabling the pipeline to identify relationships even between fully and incompletely annotated genes as long as the requirements for synteny are fulfilled.

Step 4: Orthologue prediction and re-calibration to include positional orthologues

Orthonome uses the modified Smith-Waterman scores of the genes and their corresponding coordinates as inputs for the MSOAR program to identify the orthologous relationships between genes of each species pair. Due to the more sensitive identification of orthogonal relationships using the S’ score, phylogenetically close neighbours including positional inparalogues are re-analysed to re-compute orthologue calls after an initial MSOAR run. This allows Orthonome to extend the MSOAR analysis and recover genome orthologues incorrectly identified as inparalogues, based on S’ scores and syntenic relationships.

The pairwise orthologues obtained in the earlier step are then used to create clusters of genes using the MCL algorithm. FastTree2 is then applied to orthogonal clusters, using

164

only one gene from each species, in order to construct a consensus phylogenetic tree based on concatenated codon alignments.

The gene families constructed using an all by all matrix of sequence similarity are then reconciled into orthologous clusters by MultiMSOAR2.0 (Shi et al., 2011). Tree reconciliation and formation of orthogonal sets of orthologues further utilises the S’ score to quantify sequence similarity between genes. These orthogonal sets of orthologues are termed orthogroups (OG).

Step 5: Orthogroup clustering and evolutionary classification

All outputs from Steps 3 and 4 are collated to form clusters of orthogroups based on domain similarity. These clusters, termed Super Orthogroups (SoGs), consist of orthologues (satisfying at least two of the bi-directional sequence similarity, phylogenetic and syntenic evidences), inparalogues and gene births. SoGs are clustered based on the similarity of domains in each orthogroup.

The pipeline produces tabulated, database-tool-friendly formats which are utilised to build the web interface. All final collations of data and SOGs created by Orthonome are then updated into an online database, currently implemented to include only the 20

Drosophila species analysed in the current study.

Computational requirements

The pipeline has been tested on Linux servers with four cores and 64 GB memory hosted on the NeCTAR research cloud (http://nectar.org.au/) and requires peak memory usage of 8 GB. Using this infrastructure, the novel map-reduce implementation in

Orthonome reduces the time required by the BLASTP search, Smith-Waterman alignment scoring and orthology assessment by 5-10 fold, albeit with phylogenetically more distant species still taking longer to compare than more closely related species. 165

Such programming models often involve implementing calculations and computational tasks in a parallel distributed manner. The map-reduce approach to pipeline creation allows optimal usage of computing resources over a computing grid or local computers and is therefore highly scalable from the 20 genomes tested here to hundreds of genomes. An optimized pre-processing step for generation of sequence similarity scores as aforementioned and a 64-bit recompilation of the core algorithms permit completion of pairwise comparative analyses up to three times faster than the original MSOAR2 implementation.

Web interface

A proof of concept web interface is currently available at www.orthonome.com. The web interface allows users to search the database using FlyBase gene names (Marygold et al., 2013) to view the entire orthogroup and super orthogroup (inparalogues and orthogroups are then clustered by domain) to which the gene belongs. The output can be directly imported into other data manipulation software. Along with orthologues, the search result also contains information regarding genes tagged as duplications/inparalogues in every super orthogroup.

Supplementary notes

Supplementary note 1: Correlation between genome quality and orthologue retrieval

To evaluate the relationship between genome quality and orthologue retrieval by

Orthonome, we calculated N50 statistics for all twelve FlyBase and eight modENCODE

Drosophila genomes (Table S3.1). When comparing the N50 statistic to the average number of 1:1 orthologues for every species, we found the draft genomes clustering towards a lower intersect of the two axes as opposed to the twelve higher quality 166

genomes (Fig. S3.1a). We found a similar trend when comparing the number of annotated gene models in a species with the average number of 1:1 orthologues for every species (Fig. S3.1b). The genome and annotation statistics were both positively correlated with the efficiency of orthologue retrieval in Orthonome (r = 0.58, P = 0.0075

& r = 0.67, P = 0.0012 respectively). We also found a significant negative correlation between genome N50 statistic and the number of gene births (r = -0.47, P = 0.034).

Supplementary note 2: Performance of orthologue identification pipelines compared to Orthonome

To ascertain the gain in orthologue identification capacity due to the additional features of the Orthonome pipeline, we utilised the peptide sequences used in Orthonome to identify orthologue clusters with OrthoDB, and the same input data for MSOAR2 as for

Orthonome. For the twelve high quality FlyBase genomes we found that Orthonome was able to identify 9,538 1:1 (n=all) orthogroups while OrthoDB and MultiMSOAR were able to identify 6,621 and 9,595 such orthogroups respectively (Table S3.2).

To assess the effect of the lower quality of the draft genomes on orthologue recovery, we compared the analyses of all twenty genomes using Orthonome and OrthoDB. We found that the addition of the eight draft genomes significantly reduced the number of

1:1 (n=all) orthogroups in both pipelines. We noted a 41% reduction in 1:1 (n=all) orthogroups in OrthoDB (3,912 down from 6,621 for the twelve species analysis).

Orthonome on the other hand produced 6,541 1:1 (n=all) orthogroups, which is 31% down from the 9,538 orthogroups identified for the twelve high quality genomes (Table S3.2) but still 67% more than those predicted by OrthoDB. Therefore Orthonome appears more tolerant towards genome assembly and annotation errors that may have been introduced by draft genomes which have not been extensively curated like the twelve

FlyBase genomes.

167

When comparing the number of one-to-one orthologues identified in every pairwise comparison among species, we found an average increase of 2% (average of 195) in orthologue counts in Orthonome compared to MSOAR2. Since we observed a similar increase in orthologue count when comparing only the twelve high quality genomes or the eight draft genomes, we find that the usage of the S’ scores compared to BLAST scores significantly increases the capacity of Orthonome to identify orthologous relationships.

We also noted an increase in orthologue counts by Orthonome in pairwise comparisons involving D. melanogaster despite it already being the best quality genome in our analysis (Table S3.4). Since the recovery of orthologues in a pairwise comparison could be negatively affected by fragmented genes, leading to lower BLAST scores, even comparisons between a high quality genome and a lower quality genome will result in a lower number of orthologues. The S’ score calculates sequence similarity between genes by tolerating fragmentation of gene models, making it possible for Orthonome to capture a greater number of orthologues supported by phylogenetic and synteny measures.

One example of orthologue pairs only identified by Orthonome and missed by

MSOAR2 as well as OrthoDB is the D. melanogaster gene FBgn0004554 and its D. sechellia counterpart. As shown in Fig. 3.3A/B in the main text we found that most genes around FBgn0004554 had orthologues in D. melanogaster identified by all pipelines (black lines), barring two genes that are very small and present no sequence similarity to any other genes. FBgn0004554 on the other hand has an extremely low

BLAST score with its syntenic counterpart in D. sechellia FBgn0170274 and aligned only the first 24 amino acids. However the S’ scoring system in Orthonome enabled it to correctly identify the two genes as orthologous to one another by calculating a higher

168

score that takes into account the entire sequence length. This gene pair also satisfies the requirement of synteny and is therefore identified as a true orthologue.

Supplementary Figures

Figure S3.1: Correlation between genome assembly and annotation quality and orthologue identification. (a) N50 statistic vs. average number of 1:1 orthologues for each species. (b) Number of annotated gene models in a species with the average number of 1:1 orthologues for every species.

a b

169

Supplementary Tables

Table S3. 1: Genome, orthologue, duplication statistics and data sources for all 20

Drosophila species analysed in the current study. These include percentage of genes identified as orthologues by Orthonome and OrthoDB as well as genome statistics for each genome based on BUSCO analyses.

Attached as excel sheet

Table S3.2: Total number of 1:1 (n=all) orthologroups identified by the Orthonome and

OrthoDB pipelines in the 12 and 20 genome sets

Pipeline 12 Genome analysis 20 Genome analysis % drop due to draft genomes Orthonome 9538 6541 31 OrthoDB 6621 3912 41 Improved 44% 67% performance of Orthonome

170

Table S3.3: Comparison between OrthoDB and Orthonome for number of orthogroups corresponding to eleven rapidly evolving gene families in

Drosophila species

Gene Family ABCs ACHRs CCEs GRs GSTs HSPs MFS OBPs ORs P450s UGT

Total genes in Drosophila melanogaster 56 12 23 60 34 9 66 52 61 87 17

Orthonome: Number of unique orthogroups in 55 12 23 57 33 9 65 51 58 77 16

OrthoDB: Number of unique orthogroups in 53 12 22 57 29 8 62 50 55 67 16

Orthonome: Number of orthogroups with conserved 37 8 16 33 17 7 45 30 33 51 9 1-to-1 sets including all 12 species OrthoDB: Number of orthogroups with conserved 24 5 11 17 9 3 28 20 13 22 5 1-to-1 sets including all 12 species Orthonome: Improvement of orthogroup 3.77 0.00 4.55 0.00 13.79 12.50 4.84 2.00 5.45 14.93 0.00 identification compared to OrthoDB (%) Orthonome: Improvement of conserved 1-to-1 54.17 60.00 45.45 94.12 88.89 133.33 60.71 50.00 153.85 131.82 80.00 orthogroup identification compared to OrthoDB (%)

**Families: ABC/ABC-like transporters (ABCs), acetyl-choline receptors (ACHRs), carboxyl/cholinesterases (CCEs), gustatory receptors (GRs), glutathione S-transferases (GSTs), Heatshock proteins (HSPs), Major facilitator superfamily of transporters (MFS), Odorant binding proteins (OBPs), odorant receptors (ORs), cytochrome P450s (P450s) and UDP-glucuronosyltransferase (UGTs).

171

Table S3.4: Median and mean gain of orthologue pairs in Orthonome compared to

MSOAR2 for the 20 Drosophila species

Species Source Median Mean Contig N50 BUSCO completion Drosophila ananassae FlyBase 172 186.47 99399 98% Drosophila erecta FlyBase 142 170.26 455916 99% Drosophila grimshawi FlyBase 180 192.21 92106 99% Drosophila melanogaster FlyBase 113 143.53 21485538 98% Drosophila mojavensis FlyBase 202 218.63 124510 98% Drosophila persimilis FlyBase 219 232.63 20311 93% Drosophila pseudoobscura FlyBase 607 588.16 203957 93% Drosophila sechellia FlyBase 118 168.53 42955 96% Drosophila simulans FlyBase 166 201.21 15927 84% Drosophila virilis FlyBase 182 210.53 123628 99% Drosophila willistoni FlyBase 169 197.26 183168 99% Drosophila yakuba FlyBase 112 22.63 255016 98% Drosophila biarmipes modENCODE 182 196.05 149400 96% Drosophila bipectinata modENCODE 201 84.32 467944 97% Drosophila elegans modENCODE 188 200.26 937343 96% Drosophila eugracilis modENCODE 162 173.05 658447 95% Drosophila ficusphila modENCODE 161 183.21 829588 95% Drosophila kikkawai modENCODE 159 168.32 575074 96% Drosophila takahashii modENCODE 175 183.42 298321 94% Drosophila rhopaloa modENCODE 172 184.79 35496 93%

172

Appendix 4 (Chapter 4)

Table S4.1: Data origin, diet, pesticide resistance cases and genome information for species analysed in the current study

Attached as an additional excel sheet Table S4.2: Correlation between total and normalised gene counts for each gene family.

Gene counts were normalised by total remaining gene content in every species after subtracting the sum of eight gene families. Cells in gray are non-significant correlations

Correlations for total count P450s CCEs GSTs UGTs ABCs MFS OBPs HSPs P450s 1 CCEs .631** 1 GSTs .593** .358** 1 UGTs .667** .724** .568** 1 ABCs .533** .605** .455** .638** 1 MFS .478** .817** .386** .663** .460** 1 OBPs .161 .001 .607** .040 .018 .030 1 HSPs .304* .551** .540** .471** .422** .603** .329* 1 Correlations for normalised count P450s CCEs GSTs UGTs ABCs MFS OBPs HSPs P450s 1 CCEs .517** 1 GSTs .639** 0.139 1 UGTs .578** .615** .386** 1 ABCs .426** .280* 0.174 .309* 1 MFS .355** .697** 0.133 .453** -0.006 1 OBPs .334* -0.119 .633** 0.025 -0.061 -0.081 1 HSPs .437** .444** .404** .349** 0.209 .535** 0.264 1 ** Correlation is significant at the 0.01 level (2- tailed). * Correlation is significant at the 0.05 level (2- tailed).

Table S4.3 Summary table for order specific tests of gene family expansions due to diet and host range

Attached as an additional excel sheet

173

Figure S4.1: A comparison between numbers of genes in each order for each of the eight gene families analysed in this study. The total size of every stacked bar-plot indicates the total number of genes while the part of the bar coloured only orange indicates orthologues and only blue indicates duplications.

174

Gene families: ABC/ABC-like transporters (ABC), carboxyl/cholinesterase (CCE), glutathione S-transferase (GST), Heatshock proteins (HSP), Major facilitator superfamily of transporters (MFS), Odorant binding proteins (OBP), cytochrome P450 (P450), UDP-glucuronosyltransferase (UGT).

175

Figure S4.2: A comparison between normalised numbers of genes in each order for each of the eight gene families analysed in this study. The total size of every stacked bar-plot indicates the total number of genes while the part of the bar coloured only orange indicates orthologues and only blue indicates duplications.

176

Gene families: ABC/ABC-like transporters (ABC), carboxyl/cholinesterase (CCE), glutathione S-transferase (GST), Heatshock proteins (HSP), Major facilitator superfamily of transporters (MFS), Odorant binding proteins (OBP), cytochrome P450 (P450), UDP-glucuronosyltransferase (UGT).

177

Figure S4.3: A comparison between numbers of genes in each order for a fixed array of lineages for four gene families (ABCs, CCEs, GSTs, and

P450s). The total size of every stacked bar-plot indicates the total number of genes while the part of the bar coloured only orange indicates orthologues and only blue indicates inparalogues.

178

Gene families: ABC/ABC-like transporters (ABC) type B and C, carboxyl/cholinesterase (CCE) (dietary/detox), glutathione S-transferase (GST) (delta, epsilon) and cytochrome P450 (P450) Cyp6.

179

Figure S4.4: Distribution of total gene counts (green) and duplications (orange) in detoxification associated lineages in each of the four gene families for diet and phagy combinations in every order. The graph summarises raw counts (A) and normalised counts (B) for the same.

Gene families: ABC/ABC-like transporters (ABC) type B and C, carboxyl/cholinesterase

180

Appendix 5 (Chapter 5)

Figure S5.1: Scheme for heat stress and recovery assay. The zero time point was used as a control with two time points during the stress phase and five time points in the recovery phase, including one at 24 hours to assess long-term effects of the stress on recovery. The red line indicates during-stress sampling points for D. buzzatii and green indicates the same for D. hydei. All blue dots indicate time points sampled for both species.

Figure S5.2: Gene ontology cluster enrichment for 971 repleta group specific orthogroups after annotation using gene ontology Sets.

Figure S5.3: Gene ontology enrichments for all species specific duplications after semantic similarity based clustering. FDR adjusted FET P-value < 0.05 with minimum

GO term similarity of 0.3. Numbers at the end of the bars denote the number of enriched genes related to the function in the test set versus the background set.

181

Fig S5.4: Principal component analysis of expression variation amongst the top 2000 most variable genes. Clustering of timepoints across both species indicate independent gene sets active on during-stress response as well as recovery.

182

Fig S5.5: SuperSOM analysis of orthologue expression. A) Correlation of mean expression levels of orthologues B) Assignment of orthologues to nodes in the

SuperSOM C/D) Codebook vectors for the D. hydei/buzzatii components of the

SuperSOM. Nodes are coloured based on hierearchical clustering of their codebook vectors.

183

Fig S5.6: Species specific expression of SSOM cluster 1. The orthologous genes in

SSOM cluster 1 displayed no difference in expression profile between D. hydei and D. buzzatii while also being differentially expressed compared to control in stress or recovery.

1

● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n ● ● ● ● ● ● ● ● ●

o

i

s 0

s ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● e ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● r ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● p ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● x ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● E ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−2 ● ● ● ● ● ●

● −3 ● 0 0.3 0.55 1 2 3 4 24 Timepoint

Species D. hydei D. buzzatii

184

Table S5.1: Scaffold and contig length based statistics as well as results from BUSCO genome assessment using the dataset for the five assembled repleta group genomes and

D. melanogaster.

Assembly parameter D. aldrichi D. mojavensis D. buzzatii D. hydei D. repleta D.

melanogaster

Number of scaffolds 2,620 6,841 826 8,115 6,611 15

Total size of scaffolds 190,651,39 193,826,310 161,490,85 165,796,18 164,461,34 168,736,537

9 1 1 5

N50 scaffold length 1,029,703 24,764,193 1,380,941 2,316,993 3,476,135 23,011,544

L50 scaffold count 53 4 30 20 14 4

Gaps in genome 4.48 7.03 9.27 2.22 2.09 3.77

(%nucleotides)

Average number of contigs 3.40 1.70 13.00 1.20 1.40 2506.50

per scaffold

Longest contig (post 1,024,094 1,453,876 448,794 4,096,336 1,201,580 27,905,053

assembly)

N50 contig length (post 83,768 124,510 28,092 468,470 207,004 19,435,691

assembly)

BUSCO stats (N = 2,675 C:95%[D:1 C:97%[D:4.4 C:86%[D:4. C:96%[D:7. C:96%[D:5. C:98%[D:5.0

genes)* 0%] 0%] 0%] 9%] 9%] 0%]

F:3.0%,M:1 F:2.20%;M:0. F:2.7%,M:1 F:2.9%,M:0 F:2.8%,M:0 F:1.40%;M:0.

.5% 50% 0% .2% .6% 40%

*BUSCO statistics consist of [C]omplete gene models, [D]uplications, [F]ragmented and [M]issing genes.

185

Table S5.2: Repeat content analysis for the six species analysed in this study characterising transposable elements as well as tandem repeats.

Type (as % D. aldrichi D. mojavensis D. buzzatii D. hydei D. repleta D. melanogaster

of genome)

DNA 11.56 3.4 4.79 5.03 9.73 1.49

LINE 4.55 2.45 1.57 2.25 4.52 4.77

SINE 0.01 0 0 0.02 0.03 0

LTR 7.58 6.45 1.47 4.25 8.48 11.73

Other 0.01 0 0 0.01 0.01 0

Unknown 0.87 2.88 0.6 0.42 0.84 0.9

DNA

repeats

Total 24.58 15.18 8.43 11.97 23.60 18.89

*DNA – DNA repeats, LINE - long interspersed nuclear elements, SINE – short interspersed nuclear elements, LTR – long terminal repeats

186

Table S5.3: Genome annotation statistics for the five assembled repleta group genomes and D. melanogaster. Statistics to evaluate gene, transcript and exon length as well as alternate splice events were calculated for every genome.( *BUSCO statistics consist of [C]omplete gene models, [D]uplications, [F]ragmented and [M]issing genes).

Statistic D. aldrichi D. mojavensis D. buzzatii D. buzzatii D. hydei D. repleta D. melanogaster

(old)

Mean transcript size(UTR, CDS) 2,155 2,387 2,002 1,636 2,190 2,068 2,880

Mean gene locus size(first to last exon) 5,416 4,429 5,986 3,428 4,982 5,016 5,741

Number of genes 16,070 14,680 14,532 13,158 15,838 14,790 13,919

Number of predicted transcripts 20,925 20,110 17,166 13,158 18,480 19,990 30,452

Mean exon size 471 439 453 397 517 478 539

Number of distinct exons 69,409 67,288 60,185 49,832 67,695 61,941 77,694

Mean number of distinct exons per gene 4.30 4.30 4.10 4.10 4.30 4.20 5.58

Mean number of transcripts per gene 1.30 1.40 1.20 1.00 1.20 1.40 2.19

Genes with functional annotation 13,216 12,231 11,944 10,496 14,485 12,594 12,236

BUSCO stats (N = 2,675 genes)* C:95%[D:13%] C:98%[D:9.60%] C:88%[D:12%] C:94%[D:12%] C:94%[D:9.5%] C:98%[D:9.10%]

F:2.2%,M:2.5% F:0.80%,M:0.10% F::1.0%,M:10% F:2.6%,M:2.6% F:3.1%,M:2.5% F:0.70%,M:0.30%

187

Table S5.4: Number of pairwise orthologues between D. melanogaster and the five repleta species and the published (older) annotation of D. buzzatii. A higher number indicates greater evolutionary conservation of genes.

Species D. aldrichi D. mojavensis D. buzzatii D. hydei D. repleta D. melanogaster D. buzzatii (Old

annotation)

D. aldrichi NA 11880 10757 11473 11334 11028 10232

D. mojavensis NA 11585 11539 11945 11400 10480

D. buzzatii NA 10442 10631 10314 NA

D. hydei NA 11252 10776 10412

D. repleta NA 10952 10405

D. melanogaster NA 9775

188

Table S5.5: Sets of functional terms describing hierarchical grouping of gene ontology terms and the number of genes in Drosophila melanogaster within each set. Each set consists of terms with semantic similarities >0.7 with Subsets described in table S6)

GO Set GO Set description Number of D.

number melanogaster

genes

0 Response to stimulus and stress 401

1 Protein localisation 915

2 Metabolite transport 896

3 Metabolite and ion binding 4909

4 Cell component organization or biogenesis 2917

5 Cell fate determination 31

6 Development 96

8 Protein modification 1800

9 Energy derivation / oxidation 174

10 Receptor regulator activity 2094

11 Miscellaneous biological processes 5177

12 Gene regulation 494

13 Primary metabolism 3379

14 Amino acid metabolic process 207

18 Nucleic acid processes 1529

19 Catalytic activity 3140

21 Homeostasis 1065

22 Signalling pathways 541

24 Cell cycle 70

29 Monooxygenase activity 568

37 Hydrolase activity 353

39 Muscle function 124

40 Nucleotide sugars metabolism 144

189

Table S5.6: Subsets of functional terms after splitting each Set and the number of genes in Drosophila melanogaster within each subset. Each set consists of terms with semantic similarities >0.7 but stricter clustering cut-off’s.

Attached as an additional excel sheet

190

Minerva Access is the Institutional Repository of The University of Melbourne

Author/s: Rane, Rahul Vivek

Title: The genomic basis of climate and host adaptation

Date: 2017

Persistent Link: http://hdl.handle.net/11343/177356

File Description: Complete thesis: The genomic basis of climate and host adaptation

Terms and Conditions: Terms and Conditions: Copyright in works deposited in Minerva Access is retained by the copyright owner. The work may not be altered without permission from the copyright owner. Readers may only download, print and save electronic copies of whole works for their own personal non-commercial use. Any use that exceeds these limits requires permission from the copyright owner. Attribution is essential when quoting or paraphrasing from these works.