Comparative genomics and host adaptation in Hawaiian Picture-wing

A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI’I AT HILO IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

IN

TROPICAL CONSERVATION BIOLOGY AND ENVIRONMENTAL SCIENCE

MAY 2016

By

Ellie E. Armstrong

Thesis Committee:

Donald Price, Elizabeth Stacy, Rosemary Gillespie, and Stefan Prost ACKNOWLEDGMENTS

I would like to thank numerous people for their help and support, without which this research project would not have been possible. Foremost, I would like to thank my advisor Dr. Donald Price for his belief in my ability to accomplish an ambitious amount during my time at UH Hilo and for his help with all things concerning picture-wing Drosophila . Thank you to my committee members, Elizabeth Stacy and Rosemary Gillespie for providing invaluable feedback both in terms of writing and ecological theory, and to committee member Stefan Prost, who gave complete guidance of the bioinformatics and genome analyses. Thanks to Rasmus Nielsen, Tyler Linderoth, and Russ Corbett-Detig for help on selection analyses and advising. Thanks to collaborators Durrell Kapan, Pawel Michalak, and Ken Kaneshiro for providing samples and data and whose time and efforts made working with the picture-wings even more rewarding. Thanks to Cerise Chen and Anna Sellas who carefully extracted the and treated my specimens as if they were their own. Thank you to Karl Magnacca, Pat Bily, and Mark Wright, whose knowledge and help accessing the forest reserves as well as identifying species was more than appreciated. Thank you to Jun Ying Lim, Chris Yakym, Kylle Roy, Tom Fezza, Julien Petillon, Curtis Ewing, Britton Cole, Henrik Krehenwinkel, Jacqueline Haggarty, Stephanie Gayle, Patrick O’Grady, and Susan Kennedy for help in the field and for being supportive of this work. I would also like to send extra thanks to Noah Whiteman, Andy Gloss, and Benjamin Goldman-Huertas for providing me with data and being thoughtful and dedicated collaborators. Additionally, I would like to thank Nagarjun Vijay for help using Circos.

i ABSTRACT

For lineages, host-switching and differential host adaptations are commonly suggested as important ecological drivers behind diversification and co-evolution. The Hawaiian picture-wing Drosophila are an ideal lineage for investigations of the genomics of speciation because of their close relationship to D. melanogaster , their well-known phylogeny, thoroughly documented ecology, and incredible diversity. I investigated the genomic landscape changes and genome evolution in relation to host-plant adaptation in four species of Hawaiian picture- wing Drosophila . Three of these species, Drosophila sproati , Drosophila murphyi , and Drosophila ochracea were sequenced for this study using de-novo whole-genome sequencing and assembly. All of these species are endemic to Island and considered specialist species and oviposit on the trigynum (D. sproati and D. murphyi ) and Freycinetia arborea (D. ochracea ). The last species, D. grimshawi from the Maui Nui complex, was originally sequenced by the Drosophila 12 Genomes Consortia and is considered a generalist species. I also used the outgroup, flava , a continental species of the Scaptomyza genera, which originated in Hawaii. In the picture-wing lineage, there were 264 genes found to be under positive selection, notably those with chemosensory gene functions, plant detoxification functions, and inter-male aggression functions. Using all paralogs and orthologs of chemosensory genes, three genes in the odorant receptor and odorant binding protein gene families were detected to be under positive selection. It was also shown that specialists lost more chemosensory genes over time compared to the generalist species. Using 1:1 orthologs across species, I detected 12 genes under positive selection in individual picture-wing species that had a variety of functions in protein signaling, cell metabolism and division, and plant detoxification. These results highlight the genomic changes associated with host plant adaptation and sexual selection in the . In addition, the results show that repeat content is strongly correlated with genome size across the Drosophila . has a repeat content 3x higher than the other picture-wings, but this does not correspond to increases in repeat diversity. There was no notable codon bias between picture-wing species, but there was codon bias between D. grimshawi and S. flava. Picture-wings tended to favor a single codon over other codons, whereas S. flava has evenly represented codon choice for any given amino acid. Genomic synteny across picture-wing species and D. melanogaster was highly conserved at the chromosome level.

ii TABLE OF CONTENTS

Acknowledgements...... i Abstract...... ii List of Tables...... iv List of Figures...... v

Introduction...... 1 Ecological speciation and adaptive radiations...... 1 Host-plant use in Drosophila...... 2 Hawaiian Drosophila ...... 4 Hypotheses & Objectives...... 9 Materials & Methods...... 10 Results...... 18 Data quality assessment...... 18 Genome assembly...... 18 Repeats and genomic traits...... 19 Genome synteny analysis...... 20 Gene annotation and orthology...... 20 Positive selection analysis...... 21 Chemosensory gene analysis...... 22 Discussion...... 24 Conclusion...... 32

Tables...... 33 Figures...... 55 Literature Cited...... 80

iii List of Tables

Table 1: Species collection information for genome sequencing.

Table 2: DNA extraction and sequencing information, including library type and facility.

Table 3: Scaffold N50 from various assemblers and various kmer sizes before running GapCloser for de-novo assemblies.

Table 4: Assembly statistics for best assembly for D. sproati , D. murphyi , and D. ochracea .

Table 5: CEGMA results from de-novo assembled genomes.

Table 6: Genome characteristics from all species including GC content, assembly size, and repeat content.

Table 7: Categorical repeat content of genomes.

Table 8: Results from annotation of Hawaiian picture-wing Drosophila and S. flava including transcript size (exon size), intron size, and composition.

Table 9: Codon bias of 16 codons in picture-wing Drosophila and .

Table 10: Genes under positive selection identified by PAML in picture-wing species using branch-site model.

Table 11: Function of genes under selection in all picture-wings using M7/M8 sites model analyzed by GO enrichment analysis.

Table 12: Glutamate receptor gene copy numbers found in Hawaiian picture-wing Drosophila .

Table 13: Odorant binding protein gene copy numbers found in Hawaiian picture-wing Drosophila .

Table 14: Odorant receptor gene copy numbers found in Hawaiian picture-wing Drosophila .

Table 15: Gustatory receptor gene copy numbers found in Hawaiian picture-wing Drosophila .

Table 16: Genes under positive selection in i) branch leading to all species and ii) branch leading to picture-wing species using branch-site model in PAML.

iv List of Figures

Figure 1: A) Basic phylogenetic structure of genus Drosophila . B) Phylogenetic structure of picture-wing inferred from PHYLIP using S. flava as outgroup.

Figure 2: Feature response curve of assemblies generated for D. sproati

Figure 3: Feature response curve of assemblies generated for D. murphyi.

Figure 4: Feature response curve of assemblies generated for D. ochracea

Figure 5: Comparison of average C-value size and total assembly size reported from abyss- fac in the program AbySS.

Figure 6: Correlation between genome size and % repeats in genome across Drosophila phylogeny and including S. flava .

Figure 7: Correlation between genome size and estimated repeat content in genome across Drosophila phylogeny and S. flava .

Figure 8: Distance matrices of codon bias across species and phylogenetic relationships inferred from codon bias.

Figure 9.1-9.3: Relative codon bias across species for picture-wings and S. flava .

Figure 10. Synteny visualization of D. sproati genome compared with D. melanogaster genome using Circos.

Figure 11: Synteny visualization of D. murphyi genome compared with D. melanogaster genome using Circos.

Figure 12: Synteny visualization of D. ochracea genome compared with D. melanogaster genome using Circos.

Figure 13: Synteny visualization of D. grimshawi genome compared with D. melanogaster genome using Circos.

Figure 14: Abundance of the classes of repetitive elements across picture-wings, S. flava and D. melanogaster .

Figure 15: Abundance of the families of repetitive elements across picture-wings, S. flava and D. melanogaster .

Figure 16: Frequency of shared 1:1 orthologous genes between Hawaiian picture-wing Drosophila and S. flava .

v Figure 17: Distribution of p-value statistics (p<0.08) from branch-site model selection tests for each individual Hawaiian picture-wing Drosophila .

Figure 18: LRT p-value statistic for D. grimshawi using branch-site model.

Figure 19: Distribution of p-value statistic (p<0.08) from M7/M8 sites model selection tests for all Hawaiian picture-wing Drosophila .

Figure 20: Distribution of p-value statistic (p<0.08) from M7/M8 sites model selection tests for all picture-wing Drosophila and Scaptomyza in chemosensory genes.

Figure 21: Distribution of p-value statistic (p<0.08) from M7/M8 sites model selection tests for all picture-wing Drosophila in chemosensory genes.

Figure 22: Phylogeny displaying genes under selection in each species. Gene IDs correspond to D. grimshawi uniprot gene IDs.

Figure 23: Relative gain/loss of chemosensory genes across picture-wing species as compared to S. flava .

vi INTRODUCTION

Ecological speciation and adaptive radiations

Divergent natural selection on traits that vary between environments is thought to contribute to the establishment of reproductive barriers, thus leading to the formation of new species. This process, known as ecological speciation, has recently been implicated as a primary driver of diversification (Rundle & Nosil 2005; Schluter 2009). This method of diversification may explain high diversity in adaptive radiations, where biogeographic history and vicariance do not fully explain contemporary patterns, however the evidence for this is limited. For arthropod lineages, host-switching and differential host-adaptations are commonly considered ecological drivers promoting diversification and co-evolution, and recent studies have shown strong evidence for ecological speciation in several lineages of phytophagous (Matsubayashi et al. 2011; Nosil 2007). Specifically, host-plant switching has been proposed as a key factor in the speciation of lineages, especially those in which there is specialization across diverse varieties of plants (Ehrlich & Raven 1964; Jaenike 1990). This pattern has also been noted for adaptive radiations across island chains (Roderick & Gillespie 1998). However, diversification on islands can be complex, with geography, sexual selection, adaptation, and hybridization all playing a role in speciation. In order to address the complex evolutionary history of a Hawaiian lineage, I use whole-genome data to investigate genes and genomic architecture important to the diversification of picture-wing Drosophila . Recent advances in sequencing technologies and bioinformatics have led to more in- depth studies on how barriers to gene flow arise throughout entire genomes and how these barriers can lead to speciation (Feder et al. 2012; Feder et al. 2013). However, many of these studies lack appropriate integration of the ecological setting of an organism with genomic findings. The Hawaiian picture-wing Drosophila are an ideal lineage to investigate the genomics of speciation because of their close relationship to D. melanogaster , their well-known phylogeny, thoroughly documented ecology, and incredible diversity (Carson & Kaneshiro 1976; Roderick & Gillespie 1998; Wagner & Funk 1995). In addition, Drosophila genomes have been proposed as a model for comparative genomics (Crosby et al. 2007; Clark et al. 2007) because of their small size, short generation time, and the ability to raise lab populations for experimental purposes. Genome-wide analyses of closely related species have shown to be important in

1 understanding barriers to speciation, but overall there have been very few studies to integrate an ecological component (Poelstra et al. 2014; Soria-Carrasco et al. 2014). An extensive analysis of genomic sequences across an adaptive radiation has the potential to greatly advance our understanding of diversification mechanisms.

Host-plant use in Drosophila

The evolution of host-specialization or preference requires an array of adaptive traits to evolve before an can successfully breed or feed on any particular host. An adult individual must be able to detect and locate a plant of its choosing, while avoiding plants that may contain harmful compounds. Thus, much of the previous work focusing on host-plant adaptations have focused on the interface between the and its environment (i.e. microbe associations and the chemical compositions of plants) and the specific genes that underlie that preference and allow the individual to find the preferred host.

Ecology of Drosophila host use

Drosophila depend primarily on yeasts as a food source and some yeast associations have been shown to affect organismal fitness (Sang 1956; Chippindale et al.1993). This suggests that the specific fungal composition of a plant may be an important component of host preference. A recent study that documented the yeast diversity across the Drosophila phylogeny found that host substrate influences the yeast composition of flies (Chandler et al. 2012). Studies on the cactophilic mainland Drosophila and its four host races have revealed that the bacteria and yeast communities associated with each race’s host plant have significant differences (Starmer & Fogleman 1986; Fellows & Heed 1972). Utilization of these different hosts can be lethal to other species who lack specialization (Starmer & Fogleman 1986; Fellows & Heed 1972) The relationship between Drosophila and yeast has also shown to be specific. As yeasts are immobile, Drosophila have been shown to regulate and choose which yeast they vector to a host (Stamps et al. 2012). This has long been proposed in the cactophilic Drosophila , which were originally suggested to selectively change the yeast densities and demonstrate larval yeast preference (Fogleman et al. 1981; Ganter 1988). This specific vectoring and composition of yeast on suitable hosts further elicits the idea that microbial communities play an important role in host plant specificity in Drosophila.

2

Genomics of plant host use

Chemosensory gene families have largely been the focus of the genomics of host plant use. More specifically work has focused on i) odorant-binding proteins ( OBPs ), ii) odorant receptors ( Or ), iii) glutamate receptors (GluR ), and iv) gustatory receptors ( Gr ). OBPs bind odorants or pheromones in the environment and deliver the compounds to Ors . GluR s are less well known, but are mostly categorized as mediating the detection of small molecules at the synapses (Benton et al. 2009). Gr s bind environmental tastants that lead to specific behaviors depending on the type of neuron that is bound. Gustatory receptor neurons can be found in various locations in Drosophila flies, such as in the taste sensilla in the legs, and are often used to detect sugar, bitter compounds or salt concentrations (Hoskins et al. 2015). Both Or s and Gr s are expected to assist Drosophilids in host-choice and the composition and utilization of genes in these families change in response to different plant hosts. Furthermore, it has been shown that chemosensory detection of yeast volatiles is crucial for identification of a suitable host for oviposition (Becher et al. 2012). Extensive work on continental Drosophila species has supported the notion that chemosensory genes are especially important in host plant relationships and may be involved in differential host evolution within . For example, in the D. mojavensis group, candidate genes relating to host tolerance and preference have been established. Studies have shown that the fermentation process affects the concentration of alcohols and volatiles within the cactus host of D. mojavensis , which is correlated with variations at the Adh (alcohol dehydrogenase) locus (Matzkin et al. 2006). Differences in the Adh locus are thought to correspond to how much alcohol each species can tolerate. Analysis of data from the Drosophila 12 genomes project showed that specialist species lose genes from the Or and Gr families significantly faster than generalists (Clark et al. 2007). Specifically, the specialist D. secheillia from the melanogaster subgroup has lost Or and Gr genes 10 times more rapidly than its sister taxon, the generalist D. simulans . The authors hypothesize that this is an expected outcome of specialization; a specialist should require fewer genes to detect a smaller number of preferable host substrates. In order to further investigate these patterns, additional genomes of D. erecta and D. yakuba, a specialist and generalist, respectively, also from the melanogaster subgroup, were analyzed. Comparisons among these

3 genomes showed that genes from the Or and Gr families were lost at a rate five times faster than generalists (McBride & Arguello 2007). However, more extensive analysis showed that there are some differences between how these genes are gained and lost across the clade. For instance, although these loci show little codon bias, the Gr locus has a higher relative rate of nucleotide replacement within the group, indicating that these genes may be under more positive selection than Or genes, or alternatively experience weaker purifying selection. Critiques of these two studies have noted the difficulty in linking genomic patterns to the biology and ecology of the flies, as patterns such as these can be explained by genetic drift and/or confounded by time and demography (Gardiner et al. 2008; Whiteman & Pierce 2008). Furthermore, Gardiner et al. (2008) showed that the proportion of pseudogenized genes did not differ among three specialist and 9 generalist species in their sampling, but differed significantly between the two island endemics and the other species. These higher pseudogenization rates might be explained by higher genetic drift due to bottlenecks and small population sizes in island endemics (Gardiner et al. 2008). Although Hawaiian Drosophila have been extensively studied ecologically, they have not been specifically examined for patterns or differences across loci potentially associated with host switching. The group has diversified extensively in regards to host-plant use across the Hawaiian islands, utilizing 34 of the 87 families of native flowering plants, as well as ferns, sap fluxes, and even spider eggs (Magnacca et al. 2008). This suggests that the wide range of plant families used by the Hawaiian Drosophila require distinct adaptations that may promote speciation and diversification.

Hawaiian Drosophila

Background

Hawaiian Drosophila likely colonized the islands through a single founder event between 25-35 million years ago, which gave rise to approximately 1000 species contained within 2 genera: the Hawaiian Drosophila and Scaptomyza (Kaneshiro & Boake 1987; O'Grady & DeSalle 2008; O’Grady et al. 2011). Scaptomyza then went on to re-colonize the continental United States (Lapoint et al. 2013). The picture-wing species group within the Hawaiian Drosophila arose approximately 5 million years ago, close to the origination of the oldest high island of . The group contains three clades: adiastola , planitibia , and grimshawi and is

4 comprised of approximately 120 morphologically diverse species (Carson 1982). The flies are characteristically larger than other Drosophila species and are descriptively named for the diverse array of patterns across their wings.

Ecology of Hawaiian picture-wing Drosophila

Hawaiian picture-wing Drosophila , like most other Drosophila species, are saprophytic and feed primarily on yeast from decaying plant material (Markow & O’Grady 2008). However, unlike other Drosophila, the extent to which the yeasts consumed by adults are associated with the oviposition site is largely unknown. In order to court females, males of picture-wing Drosophila form leks, but the site of the lek is not necessarily related to the host plant where the female will eventually lay her eggs (Spieth 1974). The nutrients that the larvae consume are then strongly tied to the particular host plant of the group. In most specialist Drosophila species, adults and larvae are tied to the same host through both the consumption of the rotting material and the use of that site for oviposition. It has been suggested that a female’s association with a plant in the environment may be subject to the productivity of that plant, or a more abundant and frequently suitable plant would attract more specialists (Levins 1968). Hawaiian picture-wing Drosophila are known for their charismatic mating displays and sexual dimorphism, which suggests that sexual selection may play a role in the diversification of the group (Spieth 1974). More recently, Kaneshiro (Kaneshiro, 1976, 1980) has proposed models pertaining to the Hawaiian picture-wing Drosophila to explain diversification in regards to sexual selection. His models state that when a species colonizes a new habitat (island), it is no longer favorable to be very sexually selective, as this would decrease an individual’s chance of mating success. This decreased sexual preference promotes speciation and divergence within the new habitat, as the new group eventually develops new mating preferences. The pattern of asymmetrical mating choice, with older species having stronger mating preference than younger species (for example, from Maui compared to Hawaii Island), has been demonstrated for several species, specifically within the planitibia sub-group of the picture-wings (Kaneshiro 1976). Criticisms of the founder principle theories described above, which were first employed by Mayr (Mayr 1970) have been noted, but Templeton (Templeton 1979) argued that a better explanation has yet to be presented. He argues that although founder events may not always lead to rapid speciation, that there are circumstances based on population structure, genome

5 characteristics, and even ecological traits that may favor adaptive radiation following a founder event (Templeton 1979). Questions remain on what these specific characteristics might be and in what situations the founding event is the driver of speciation versus an event that will eventually cause genetic differences between populations due to genetic drift.

Previous study on the Hawaiian picture-wings

The Hawaiian Drosophila have been subject of numerous genetic studies, beginning with extensive karyotype analyses across the clade (Carson 1983). In the picture-wings, Drosophila grimshawi was used as a standard for comparison. Karyotype analysis within the clade, as well as across the Hawaiian Drosophila revealed a basic pattern of 6 chromosomes, but with prominent inversion differences between species, and in some cases populations (Carson 1992). After the initial karyotype sequencing of 103 species, within population polymorphisms focused heavily on the planitibia sub-group members, D. silvestris , D. heteroneura , and D. planitibia (Sene & Carson 1977; Craddock & Johnson 1979; DeSalle & Giddings 1986).

Taxa in this study

Drosophila sproati and D. murphyi are within the grimshawi sub-group of the picture- wing clade and endemic to Hawaii Island. Both nuclear and mitochondrial loci analysis infer a divergence time of a little over 1 million years and indicate colonization of the Big Island by two separate introductions from Maui (Magnacca & Price 2015). The species are morphologically similar, with D. murphyi being slightly smaller and darker than its sister D. sproati (Magnacca & Price 2012). Populations are dispersed across the wet forests of the island in areas located between 1066m (3500ft) and 1828m (6000ft), with D. murphyi tending to occur at slightly higher elevations than D. sproati . These two species are reliant on the native eudicot (), or Ōlapa , for oviposition and breeding. Distributions of the flies are highly dependent on the availability of this host species in the forest. Microsatellite analysis across twelve populations of D. sproati and D. murphyi show that some hybridization may occur where populations overlap, however it does appear that the two species are genetically distinct across the island (Wright, T. master’s thesis, UH Hilo 2012). Drosophila ochracea relies on the monocot vine Freycinetia arborea (Pandanaceae), also known as ʻieʻie, a vastly different host from that of D. sproati and D. murphyi . Drosophila

6 ochracea is also within the grimshawi subgroup and is predicted to have colonized Hawaii Island from Maui. It is more closely related to several other species that utilize this host plant and are dispersed across the other high islands. Drosophila ochracea also occurs in sympatry with some populations of D. sproati and D. murphyi . It is slightly smaller than D. sproati , and has similar wing markings to both D. sproati and D. murphyi . Drosophila grimshawi , a species previously sequenced by the Drosophila 12 Genomes Consortium, is also in the grimshawi sub-group (Clark et al. 2007). Drosophila grimshawi is a generalist species, using a variety of host species for oviposition. Uniquely, D. grimshawi is not a single-island endemic and can be found on the Maui Nui complex (Ohta 1978). Based on recent molecular distinctions, the Kauai and populations were renamed D. craddockae and are regarded as specialists on Wikstroemia (Kaneshiro & Kambysellis 1999).

Host-switching as a driver of diversification in Hawaiian picture-wing Drosophila

The Hawaiian Islands are an excellent location in which to determine whether there are consistent patterns of genomic change between species. The geological chronology of the Hawaiian Islands, and their well-known biota (Price & Clague 2002; Roderick & Gillespie 1998), make them an exemplary location to integrate studies of ecology and evolution. Previous studies have investigated the context of speciation among several Hawaiian lineages, for example, sexual selection through song modification in Laupala crickets (Mendelson & Shaw 2005), divergence of ecomorph and web-building characteristics in Hawaiian Tetragnatha spiders (Gillespie 2004), as well as ecomorph differentiation, habitat specialization, and geographical isolation in the infamous silversword alliance (Baldwin 1997). Additionally, the Hawaiian picture-wing Drosophila have been extensively studied in terms of their diversification via host-plant associations and sexual selection (Kaneshiro & Boake 1987). The diversity of hosts utilized by the Hawaiian picture-wings suggests that host-plant diversification has been an important driver of speciation across the clade, presenting an excellent opportunity to study potential ecological speciation ( Magnacca et al. 2008; O’Grady et al. 2011; Magnacca & Price 2015). Host-plant specialization has been implicated as an important factor in speciation for many herbivorous insects (Roderick & Percy 2008), but it is unclear how or why certain arthropod lineages speciate through host-switching, while others diversify without changing their

7 host. Several studies have noted these issues in examining diversification in the context of host- switching, due to the confounding relationship positive selection can have with past demography (Whiteman & Pierce 2008). For example, because of the long divergence times between many Drosophila species (upwards of 25my in some cases), adaptation may be confounded by genetic drift. Furthermore, rapidly changing landscapes, such as those caused by lava flows, have been shown to be a primary driver of diversification in allopatry of on islands (Gillespie & Roderick 2002). Numerous phylogenetic and genetic studies have looked for evidence of sexual selection and host-preference evolution in Hawaiian picture-wing Drosophila (Boake 2005; Heed 1971 Kambysellis et al. 1995; Magnacca & Price 2015). However, the relative roles of sexual selection and host preference in the speciation of the group have not been specifically examined. The existing base of ecological and evolutionary information make the picture-wings well suited for further understanding the genomic basis of host plant use (Carson 1982; Coyne & Orr 2004). This study provides the first look at genome evolution within the picture-wing clade in the broader context of the Drosophila genus with emphasis on possible evolution in genes corresponding to host-adaptation.

8 HYPOTHESES & OBJECTIVES

1) Investigate the genomic landscape (i.e. genome size, repeat content, intron and exon size) in the available Hawaiian picture-wing Drosophila genomes.

Hypotheses: I predict that the genomic landscape of the picture-wings will be highly conserved. Due to bottlenecks from island colonization, I hypothesize that the genomic landscape of the Hawaiian picture-wings will be more similar to Scaptomyza than other continental Drosophila .

2) Investigate genes under positive selection in the available Hawaiian picture-wing Drosophila genomes.

Hypothesis: I predict some genes that are potentially important for host-plant recognition and sexual selection will show signs of positive selection.

3) Investigate gene repertoires of chemosensory gene families, such as olfactory and gustatory receptors, and pseudogenization in those genes.

Hypothesis: I hypothesize that I will find changes in chemosensory gene repertoires between species of the Hawaiian picture-wing Drosophila . Specifically, I predict that specialist species will have a greater rate of loss than generalist species. Pseudogenization and gene loss are predicted to be components of plant-host specialization, as species with fewer hosts are expected to need smaller gene repertoires.

9 MATERIALS & METHODS

Specimen Collection

Hawaiian picture-wing Drosophila from Hawaii Island were collected with traditional sponge baits. Fermented banana paste and fermented mushroom spray were applied to the bottom of the baits, and baits were hung 1-2 meters from the ground. Drosophila murphyi was collected in May 2015 in Hilo Forest Reserve (Laupahoehoe) at an elevation of ~1600m (N19 °54’57.88’’, W-155 °18’2.49’’). Drosophila ochracea and D. sproati were collected in March off Stainback Highway at elevation of ~1066m near Tom’s trail (N19 °34’11.45 ’’ , W155 °12’44.54 ’’ ). Specimens were collected in glass vials and transported to University of Hawaii, Hilo for identification and preserved in 90% EtOH. The specimens were stored in -20 oC and then shipped to the University of California, Berkeley for extraction. All specimens were collected under permit number FHM16-393. Collection information can be found in Table 1. All collection sites were located in wet montane forests dominated by `ōhi`a (Metrosideros polymorpha ) canopy and hapu’u fern ( Cibotium spp. ) dominated understory.

Laboratory Methods

Genome Sequencing

Genomic DNA was obtained from the specimens using a modified extraction protocol with a Quiagen DNAeasy kit. Male specimens were used in order to sequence both sex chromosomes. Specimens were removed from ethanol and allowed to air dry on a sterile surface until they were sufficiently dried. They were placed into individual 1.5ml tubes and flash frozen at -80C for five minutes. After freezing each specimen was manually ground by mortar and pestle and 200ul buffer ATL and 20ul Proteinase K were added to each tube (or until the specimens were fully submerged) followed by the addition of 4ul of RNase A. The tubes were sealed and allowed to sit for 2 minutes then gently overturned to mix the solutions. They were then incubated at 56C overnight (12 hours). The specimens were then removed from the incubator and 400ul 95% ethanol and Buffer AL were added to the tubes. The tubes were gently inverted to promote mixing, placed in a 2ul collection tube and centrifuged at 8000rpm for 1 minute. Flow through was discarded. Subsequently, 500ul of buffer AW1 was added and tubes were spun again at 8000rpm for 1

10 minute. Flow through was again discarded. Then, 500ul of buffer AW2 was added to the collection tube and tubes were spun at 14,000rpm for 3 minutes and flow through discarded. The previous step was repeated one additional time. Subsequently, the specimen tubes were placed in a new spin column and 150ul of Buffer AE was added, incubated for 2-3 minutes and centrifuged at 8000rpm for 1 minute which comprised the first DNA elution, this step was repeated with 75ul of Buffer AE and this comprised the second and final DNA elution. Using this protocol, we obtained enough DNA to use single specimens for sequencing (Table 2). Genomic DNA extractions were sent to the UC Davis Genome center for library preparation. Gel-free mate-paired libraries and paired-end 180bp overlap libraries (20bp overlap) were selected in order to minimize the required input DNA amount, as well as to maximize ease of assembly. All libraries were sequenced on a single lane of the Illumina Hiseq 3000 platform using the 100bp paired-end setting.

Bioinformatic Analyses

Raw-Data Processing

Several tools have been developed for reference-free genome analysis in order to predict data quality, as well as genome characteristics. I used the open source tools, FastQC v 0.11.3 (Andrews 2010) and preqc (Simpson 2013) to check the quality of the data and evaluate genome characteristics (Simpson 2014; Babraham Bioinformatics). I used FastQC to assess base quality (Phred scores) along the reads, average Phred scores per read, kmer overrepresentation (among others this can indicate the presence of adapter remnants or sequencing bias) and the average GC content of the read data (this can be used to asses sequencing bias or indicate genome-wide GC content). In the next step, I obtained statistics that provide descriptions of the data quality and required assembly effort using the program preqc (Simpson 2013). To assess how difficult an assembly will be, preqc creates a de-bruijn graph representation of the genome and estimates (1) rates of branching due to the presence of heterozygous sites (Bradnam et al. 2013), (2) repeat content and (3) sequencing error. It further estimates the assembled genome size, expected average coverage of the assembly, the expected GC bias and the kmer size that is expected to maximize the contig N50 length (see below for a description of the N50 statistic). Subsequently, I processed the raw reads to improve the quality of the data using the program AdapterRemoval

11 (Lindgreen 2012) which removes adapters, adapter remnants and low quality bases. To correct sequencing errors in the reads, I used SGA correct (Simpson & Durbin 2012). SGA correct uses a kmer frequency distribution to identify kmers that are rare by plotting the distribution of kmers with the length of 31 bases, and identifies the left end of the distribution that shows a high frequency of rare kmers. These kmers are considered erroneous and are then corrected using un- erroneous reads.

De-novo Assembly

After preprocessing the raw-read data, I assembled draft genome sequences using a variety of assembly tools, such as ABySS (Simpson et al. 2009), SoapDenovo2 (Luo et al. 2012), Platanus (Kajitani et al. 2014), Allpaths-LG (Gnerre et al. 2011) and SGA (Simpson & Durbin, 2012). In contrast to ABySS, SoapDenovo2, Platanus, and Allpaths-LG, which are de-bruijn graph based assemblers, SGA (an Overlap-Layout-Consensus assembler) uses overlap graphs to assemble the genome sequence. In a first investigation, I ran the assemblies using different kmer sizes (k= 51, 55, 61, 65 and 75; based on the preqc kmer distribution) for ABySS, Platanus and SoapDenovo2. After assembly, I used GapCloser to improve the assembly quality (Luo et al. 2012). This tool fills gaps in the scaffolds by remapping the sequence reads to the gap regions in the genome assembly using the areas around gaps as seeds. To assess which genome assembler performed best, I used general assembly statistics as well as CEGMA (Core Eukaryotic Genes Mapping Approach, Parra et al. 2007). I used CEGMA, in order to test which kmer size and which assembler (assembly strategy) produced the highest quality assembly. CEGMA uses BLAST to search the assembly for DNA regions that match a set of 248 highly conserved single copy genes that should occur in most eukaryotes. It reports the percentage of these core genes found and their copy numbers, and thereby gives a measure of the assembly quality. Assemblies were also assessed based on standard statistics such as the contig and scaffold N50, the number of contigs and the assembled genome length. N50 is calculated by ordering all contigs/scaffolds by length from longest to shortest. The lengths of each contig/scaffold are then summed, starting with the longest ones, until the running sum equals 50% of the assembled genome size or more (sum of all contigs and scaffolds in the assembly). The N50 is the length of the last added contig which contributes to 50% of the genome size. I also used the program FRC or feature-response-curve (Vezzi et al. 2012), which

12 characterizes the coverage of the sequence assembler as a function of its discrimination threshold (number of features or errors). This allows the assessment of which assembly has the least number of errors. In order to run FRC, I mapped the PE and MP libraries back onto the respective genome assembly using BWA (Li & Durbin 2009) and processed the mapping file (convert to binary format, sort reads and index the mapping file) using samtools (Li et al. 2009). I ran FRC and CEGMA only on the genome with the highest N50 (in the case of multiple kmer runs) produced by each assembler.

Repeat Masking & Genome Annotation

Using the best assembly, genes and repeats in the genome sequence were annotated. Repeats were annotated to (1) prepare the assembly for gene annotation and (2) to detect repeats for evolutionary analyses (Bradnam et al. 2013). Genes were annotated to be able to assess positive selection in gene regions. I used a combination of homology and de-novo repeat finding to annotate repeats. First, I identified repeats in the genome using the de-novo repeat finder RepeatModeler (Smit et al. 2014). In the next step, I merged the de-novo generated repeat library with the Repbase (Jurka et al. 2005) library of D. melanogaster and ran RepeatMasker (Smit et al. 2004) which identifies repeats in the genome using homology to the given repeat library, to acquire a final set of repeats. I soft-masked simple repeats (converted to lower case) for subsequent gene annotation, which improves the annotation quality by improving mapping performance in the homology-based annotation step (see below). To annotate genes in the genome, a combination of ab-initio gene prediction and evidence-based gene (homology-based) guided annotation was used. To do so, the genome annotation pipeline Maker2 (Holt & Yandell 2011) was applied, which is able to perform all the above mentioned genome annotation strategies. Previously published protein evidence from the closely related D. melanogaster (Ensemble annotated proteins release 84) was used to perform homology-based annotation. In addition to the homology-based annotation, Maker2 also used the two de-novo gene finder tools, SNAP (Korf 2004) and Augustus (Stanke & Waack 2003) for ab- initio gene prediction.

Repeat Analyses & Genomic Traits

13 In order to assess genomic trait variation between Hawaiian Drosophila and other members of the clade, I accessed C-value estimates from the publicly available Genome Size Database (http://www.genomesize.com). I then obtained genome sequences for the 29 published Drosophila genomes from NCBI and ran RepeatMasker (using the D. melanogaster RepBase library), which provided summary statistics for the total percentage of repeats in the genomes. In addition to these output statistics, I estimated the upper percentage boundary of repeats in the genome assemblies by calculating the number of N’s in the assembly and adding it to the number of basepairs masked by RepeatMasker. This is based on the knowledge that most gap regions in the genome assembly will correspond to repeat regions, which are known to be very difficult to resolve in de-novo assembly. To calculate basic statistics such as the total assembly size and the N50 on the 29 genomes, I used abyss-fac, a tool within the genome assembler ABySS and an in-house script, because abyss-fac ignores contigs smaller than 500bp. To assess the relationship between repeat content and genome size, I ran a linear regression for both repeat content (basepairs masked by RepeatMasker alone) and genome size, as well as estimated repeat content (basepairs masked by RepeatMasker + N’s in genome assembly) and genome size. Additionally, I carried out a more detailed investigation of repeat content in Hawaiian picture-wing Drosophila , Scaptomyza flava , and Drosophila melanogaster using de- novo and homology-based repeat annotation (see above) to produce summary statistics on the composition of repeat types (i.e. LINEs, SINEs, DNA, smRNA), as well as which classes and families they represent. I used the program ANACONDA2 to assess codon bias between the genomes (Moura et al. 2005). ANACONDA2 provides software to investigate codon and amino acid bias using fasta files. I compared each picture-wing to one another, as well as D. grimshawi and S. flava. ANACONDA2 filters sequences by looking for start codons (i.e. open reading frames). I chose to filter transcripts less than 300 amino acids in length, as that was slightly below the average exon size found in each species. (Moura, 2007 #659;Moura, 2005 #701)

Orthologous and Paralogous Gene Identification

To perform analyses such as phylogenetic tree reconstruction and inference of selection, I first identified 1:1 orthologous genes between S. flava and the four Hawaiian picture-wing Drosophila species. I used the de-novo annotations, as well as those available on NCBI for D.

14 grimshawi and S. flava (provided by Noah Whiteman). The genus Scaptomyza arose on the Hawaiian Islands approximately 25 mya and went on to recolonize the mainland (Lapoint et al. 2013; O'Grady & DeSalle 2008). S. flava was chosen because it is the least diverged from the Hawaiian Drosophila , and thus provides a good outgroup for detecting selection within the clade. A gene is orthologous between species if it can be traced back to a common ancestor. On the contrary, paralogous genes are usually created by gene duplication along a branch in the phylogeny (Koonin 2005). Orthologous and paralogous genes were inferred using the program PoFF (Lechner et al. 2014). This program uses a combination of reciprocal blast (homology information) and synteny (regional information) to distinguish between orthologous and paralogous genes. After inferring orthologous genes, I plotted the frequency of orthologous genes across species included in my analysis, to choose a cutoff for genes missing in different species. I then aligned genes with no more than 25% of missing species using Prank (Löytynoja & Goldman 2008), a highly accurate gene alignment method (Blackburne & Whelan 2012; Fletcher & Yang 2010). In order to preserve the reading frame for downstream analyses, such as inference of positive selection, I ran Prank in codon mode.

Intron Extraction

In addition to exons, I also extracted introns to prepare for the phylogenetic and selection analyses using custom and in-house scripts (see https://github.com/irusri/Extract-intron-from- gff3). In order to extract them, I first found their location based on the coordinates of the exons provided by the annotation pipeline, specifically the gff file output from MAKER2. I then extracted the intron lines from the annotation files based on these coordinates. Introns were then grouped if they were from the same gene and concatenated.

Phylogenetic Tree Reconstruction

To investigate phylogenetic relationships between the Hawaiian picture-wings and S. flava , I used RAxML (Stamatakis 2014), a maximum likelihood-based phylogenetic tree reconstruction tool to generate gene . I set the bootstrapping parameter to use 100 bootstraps in order to sufficiently test for support of the inferred tree topology. The individual gene trees were then converted to a species tree using Astral (Mirarab et al. 2014). Astral uses a binning

15 approach to combine similar gene trees (by creating an incompatibility graph between gene trees) and then chooses the most likely species tree under the multi-species coalescent model.

Inference of Positive selection using dn/ds ratios and Likelihood Ratio Tests

I inferred signatures of positive selection based on the reconstructed phylogeny, using PAML (Yang 2007) and likelihood-ratio tests (Yang et al 2005; see also Jeffares et al. 2015). Inferences of dn/ds ratios are highly influenced by alignment errors. To minimize these errors, I first ran Prank using the codon model, which has been shown to produce very reliable alignments (Blackburne & Whelan 2012; Fletcher & Yang 2010) and I then filtered the alignments for regions of low-quality using Aliscore (Kück et al. 2010). Aliscore performs a Monte Carlo resampling within a sliding window to identify regions of low alignment quality. I removed low- quality regions from the alignment using Alicut (Kück 2009). In order to preserve the reading frame, I ran Aliscore on protein alignments generated from the codon fasta files using Biopython and then used the output files to correct the codon fasta files. Two different models were used to infer dn/ds ratios. The sites model infers genes that are under selection in all branches of a phylogeny. The branch-site model infers genes that are under selection in only one or a few branches within the tree. I inferred non-synonymous to synonymous substitution ratios using the branch-site model in CODEML, part of the Paml package for each of the four picture-wing Drosophila species. Genes found under selection using the branch-site model were blasted using the publicly available UNIPROT blast tool against both the UNIPROT and SwissProt databases. In order to infer genes under selection in the Hawaiian picture-wing Drosophila as a group , I used the M7/M8 sites model test in PAML. After using the M7/M8 sites model, I corrected for multiple-testing using bonferroni correction. Multiple-testing corrections are not recommended for the branch-site test (Rasmus Nielsen, personal communication; see Discussion). In the next step, I investigated whether certain Gene Ontology (GO; http://geneontology.org) categories are enriched in the genes under selection. To do that, I first blasted all genes under selection (foreground in the GO analysis) and all genes that did not show selection (background) against the RefSeq database of D. melanogaster . I then extracted the RefSeq IDs for the best hits (an e-value stronger than 10 -6) and used GOrilla ( http://cbl-

16 gorilla.cs.technion.ac.il ) and DAVID ( http://david.ncifcrf.gov ) to infer GO enrichment and gene names.

Chemosensory gene analysis

I obtained a list of well-annotated and manually curated chemosensory genes from S. flava (provided by Noah Whiteman), which included i) odorant-binding proteins, ii) glutamate receptors, iii) gustatory receptors, and iv) odorant receptors. I then used the program PoFF to find orthologous and paralogous genes in the picture-wing Drosophila species. These were not restricted to 1:1 orthologous genes, since gene duplication can lead to an increase in copy number. After finding orthologous and paralogous genes, I performed a selection analysis using PAML. I first ran the M7/M8 sites model to investigate which chemoreceptor genes are under selection in all the species (picture-wings and S. flava ) and just in the Hawaiian picture-wing Drosophila . These were corrected for multiple comparison with the bonferonni method. I further used the branch-site test to investigate whether any chemoreceptor genes are under selection in each picture-wing species. I further tested for convergent selection between D. murphyi and D. sproati since they share the same host using the branch-site test (by selecting both branches as foreground).

Genome Synteny Analysis

To investigate genome synteny between D. melanogaster and each of the picture-wing Drosophila , I first created pairwise whole-genome alignments using Satsuma (Grabherr et al. 2010). Drosophila melanogaster was used as a comparison because its genome assembly is in scaffolds that largely correspond to chromosomes, which allows for in depth synteny inferences. Synteny was then visualized using Circos plots, a tool created for visualizing genomic data (Krzywinski et al. 2009). Scaffolds smaller than 50kb in length where not used in the circos plot reconstruction. These included all Y chromosome scaffolds of D. melanogaster .

17 RESULTS

Data Quality Assessment

I obtained an average sequence coverage of ~80x per library for D. murphyi, D. sproati, and D. ochracea . FastQC results for all three species showed a GC content of 37%, 36%, and 36%, respectively, which is consistent with other picture-wing Drosophila genomes, (Kang 2016). All sequences showed average Phred scores over 36, indicating that the sequences are of high quality and have an average error rate of less than 0.003%. Some sequences showed bias in the kmer distribution at the start of the DNA molecules, indicating that adapters should be removed before further analysis. Preqc showed similar results in terms of the quality level of sequences, but also provided predictions of potential difficulties in genome assembly. All assemblies show a high number of variant branches and repeat branches, indicating that heterozygosity and repeat content may be high in the genomes. Furthermore, preqc provided an a priori suggestion of which kmer (here, 61 bp kmer for all three species) will likely result in the most complete assembly, which is a useful starting point for several of the assemblers tested (ABySS; Simpson et al. 2009, Platanus, and SOAPdenovo2; Luo et al. 2012).

Genome Assembly

The use of AdapterRemoval and SGA correct resulted in the removal of 10%, 11% and 10% of the total reads for D. murphyi , D. sproati and D. ochracea , respectively. Post-assembly statistics consistently revealed that Allpaths-LG produced the highest quality assembly, as both the N50 score and CEGMA results were highest for these assemblies from all species. CEGMA results (see Table 5) indicate a good quality of the draft assemblies, with over 97% of the 248 eukaryotic proteins found complete within each assembly. ABySS yielded the second best assemblies, but showed significantly lower N50 values than Allpaths-LG (Table 3). Both SOAPdenovo2 and Platanus performed poorly, with genome statistics and CEGMA scores considerably smaller than those of Allpaths-LG and ABySS. Running GapCloser improved the genome assembly quality in all cases (Table 5). The program FRC also revealed that the Allpaths-LG assembly consistently performed best out of the five assemblers (Figures 2-4).

18 Contrary to the N50 and CEGMA statistics, SGA outperformed every assembler but Allpaths- LG in the FRC analysis, but had N50s consistently lower than ABySS.

Repeats & Genomic Traits

Within the picture-wings, assembled genome size is more conserved among the closely related species, D. sproati, D. murphyi, and D. ochracea (~150Mb; Table 6), but is considerably larger for D. grimshawi (186Mb; Table 6). GC content did not vary strongly across the Drosophila genus (36%-45%), but appears to be lowest and conserved within the Hawaiian picture-wings. Drosophila sproati, D. murphyi, and D. ochracea have a considerably lower repeat content (between 10-12.5%), than D. grimshawi (30% ; Table 6). D. grimshawi shows a more similar repeat content to S. flava (32%), and D. grimshawi and S. flava are among the highest repeat contents of the currently sequenced Drosophila (Table 6). Repeat content (calculated as the number of basepairs masked by RepeatMasker) showed significant correlation with genome size (R 2=0.2255, p=0.0054; Figure 6), but this relationship increased with the addition of N’s in the assembly (here, estimated repeat content; Table 6). We found a stronger relationship between estimated repeat content and genome size (R 2=0.537, p<0.0001; Figure 7). C-value estimates across the Drosophila genus were almost always higher than assembly size, suggesting that genome assemblies might be incomplete most likely due to repetitive regions (Figure 5). De-novo and homology-based repeat annotation also revealed variation between the types of repeats across the four picture-wing species, S. flava, and D. melanogaster . Consistent with the elevated total repeats seen in D. grimshawi and S. flava, these species contain considerably more long-interspersed nuclear elements (LINE’s; Table 7). Additionally, S. flava shows consistently higher repeat content across short-interspersed nuclear elements (SINE’s), long terminal repeats (LTR’s), DNA repeats, and smRNA repeats. Drosophila melanogaster also shows an elevated number of LTR’s compared with picture-wing species, but contains a smaller amount than S. flava. Repeat classes across the species were relatively conserved within the three picture-wing species D. murphyi , D. sproati , and D. ochracea (Figure 14). However, D. grimshawi had a higher repeat content, which was composed of increased simple repeats, LINE’s and DNA repeats. Scaptomyza flava also had a large increase in the number of DNA, LINE, LTR, and RC

19 repeats, relative to the picture wings, but had a smaller number of simple repeats. Species differed more in repeat family diversity than repeat class diversity across species. (Figure 15). Scaptomyza flava had by far the most diverse families of elements and the picture-wings showed similar compositions of families across species. Additionally, D. melanogaster had elevated familial diversity compared to picture-wing species, however it was not as diverse as S. flava . Both D. melanogaster and S. flava had greater numbers of Gypsy elements, and S. flava had more Helitron and CR1 elements compared to both the picture-wings and D. melanogaster . Codon usage across picture-wing species was relatively consistent (Figure 8). The picture-wings tend to favor C or G in the third base position when available (Table 9). Scaptomyza flava had codons that were more evenly distributed across amino acids, whereas the picture-wings strongly favored a single codon over others. This resulted in many codon bias differences between D. grimshawi and S. flava . This indicates that S. flava uses codons differently than the picture-wing clade. Phylogenetic relationships inferred by differential codon use predicted that D. ochracea and D. murphyi were more closely related than the sister species D. murphyi and D. sproati .

Genome Synteny Analysis

Genome synteny between picture-wings and D. melanogaster appears to be relatively conserved (Figure 10-13). The different sequencing strategies reveal notable differences in the genome assemblies, which is mostly due to the fragmentary nature of the Illumina-based assemblies (scaffolds <50kb were removed in the circos plots). In D. murphyi a very low number of contigs map to chromosome 2L and there are also very few contigs that map to chromosome 4 across all picture-wing species. This is most likely due to the highly repetitive nature of chromosome 4 of D. melanogaster (Sun et al. 2000). Additionally, the extremely reduced chromosome six of picture-wing Drosophila corresponds to chromosome 4 of D. melanogaster , further explaining the mapping inconsistencies. I did not find a single occurrence where one scaffold in the Hawaiian picture-wings aligned to two or more scaffolds in D. melanogaster .

Gene Annotation & Orthology

Genes (including both introns and exons) annotated using MAKER2 composed approximately 25, 24, and 25 percent of the de-novo genomes of D. murphyi , D. sproati , and D.

20 ochracea (Table 8). This was comparable to the annotations in both D. grimshawi and S. flava which composed approximately 21% and 23% of the genomes, respectively. The total number of genes annotated in each species was between 11,980 and 13,209 and was relatively consistent across species, despite differences in sequencing and annotation pipelines. Transcript size (here, exon size) was approximately 1500bp for the sequenced picture-wings, but was slightly larger in D. grimshawi (~1600bp) and S. flava (~1800bp). Intron size followed the same pattern as transcript size and was approximately 2000bp in the three picture-wings, but slightly larger at ~2200bp in D. grimshawi and D. flava . Using one to one orthologs, I was able to build a phylogeny for the four picture-wings and S. flava (Figure 1B). Bootstrap values for all branches had a support of 100. The relationships between species were as predicted from previous phylogenetic work on the picture wings using several nuclear and mitochondrial loci (Magnacca & Price 2015). There were approximately 5,000 1:1 orthologs shared between the picture-wing species and S. flava (Figure 16).

Positive Selection Analysis

Using the branch-site model, I was able to detect genes under positive selection in each of the four picture-wing species (LRT distribution Figure 17; Table 10). Multiple testing correction for the branch-site test was not used, because multiple testing corrections assume a uniform distribution of the p-values in the analysis, but the distributions here are strongly skewed towards one (see Figure 18 for example). A total of five genes were found to be under positive selection only in D. grimshawi (GH20935 , GH11139 , GH13431 , GH23102 , GH24199 ), whose functions correspond to protein transport and signaling, oxidation, and cell division. Specifically, GH20935 is a cytochrome P450, which are known to be involved in plant compound detoxification. A total of two genes were found under positive selection only in D. sproati and D. murphyi , of which one was shared between the two (GH2606 and GH24440 ; GH20606 and GH22237 ). These genes also appear to be related to cell division and protein transport or signaling from the external membrane. Three genes were found to be under positive selection only in D. ochracea all of which appear to be involved in mRNA processing and signaling (GH20868 , GH12213 , and GH17505 ).

21 Using the M7/M8 model, I was able to detect genes that are under positive selection in all picture-wing species (LRT distribution Figure 19). A total of 464 genes were found to be under positive selection before using a bonferroni correction for multiple comparisons. After correction, a total of 264 genes were found to be under positive selection. The function of the genes under selection varied greatly (Table 11), but those having to do with chemosensory genes, P450 plant detoxification, and aggression behaviors are highly relevant to the picture-wing system (Table 11). The gene-ontology (GO)-enrichment analysis also found significant enrichment categories in chemosensory gene function, epithelial cell fate, aggressive behavior, and inter-male aggressive behavior. However, the GO enrichment categories were not significant after correction for multiple comparisons.

Chemosensory Gene Analysis

Chemosensory gene gain and loss did not follow a consistent pattern across picture-wing species (Table 12). Glutamate receptors showed very little change across species, except in D. murphyi and D. sproati , which lost five receptors each, two of which were a shared loss. Drosophila ochracea and D. grimshawi lost two genes each and one of the genes lost by D. ochracea was also lost by D. murphyi and D. sproati. Odorant-binding proteins also did not show a discrete pattern of gain and loss (Table 13). Drosophila grimshawi and S. flava had identical sets of genes that code for odorant-binding proteins and showed no gains or losses. D. sproati had duplications of two genes, and a loss of three, while D. murphyi and D. ochracea lost three and two respectively. One of these losses was shared across the three species, and one was shared solely between D. sproati and D. murphyi . Odorant receptor protein genes showed the largest number of duplications across the picture-wing species (Table 14). A single gene in D. grimshawi tripled in copy number, two genes in D. ochracea doubled in copy number, and a single gene in D. murphyi doubled its copy number. Scaptomyza flava had a single gene duplication. Drosophila sproati did not gain any copies, but instead incurred three gene losses, one of which overlapped with a duplication in D. ochracea . There was a single shared loss between D. grimshawi , D. ochracea , and D. murphyi . Gustatory receptor genes also saw more losses than gains across the clade (Table 15). Drosophila sproati had five gene losses, of which three overlapped with the losses seen in D. ochracea . Drosophila murphyi also had three gene losses, one of which overlapped with a loss

22 from D. sproati . Drosophila grimshawi and S. flava both had a single duplication that was not shared with each other, or any losses in the other species. Using the sites-model, two chemosensory genes were found to be under selection in all species, but were no longer significant after bonferonni correction (LRT distribution Figure 20; Table 16). Additionally, three genes were found to be under selection after correction in only picture-wing species (Table 16). None of the chemoreceptor genes showed signs for positive selection using the branch-site test in each individual species. Further, I did not detect any pseudogenization (e.g. presence of premature stop codons or insertions or deletions that destroy the reading frame) in the orthologous chemosensory genes within the Hawaiian picture-wing species. This could also partly be due to issues with the annotation of highly pseudogenizised genes (as they might lack gene identifying features or are too divergent to be detected in the homology-based annotation step).

23 DISCUSSION

Genomic landscape

Genomes of eukaryotes vary tremendously in overall size and amount of coding region, however these characteristics are generally conserved among closely related species (Grime & Mowforth 1982). Within the picture-wing group, (assembled) genome size appears to be highly conserved among the three species sequenced here ( D. sproati, D. murphyi, and D. ochracea ), but is slightly larger in D. grimshawi (Table 4). This is likely due to the differing sequencing strategies employed by this project (high-throughput short-read Illumina sequencing) and the original 12 Drosophila genomes project (Sanger sequencing). Sanger sequencing is a more efficient method for capturing regions that are highly repetitive because of its higher read length, which could account for the differences in size between D. grimshawi and the other picture- wings. This notion is also supported by the increased reported repeat content for D. grimshawi . Eukaryotic genomes have been shown to contain approximately 10k-20k genes, despite large changes in genome size across taxa (Walbot & Petrov 2001). The genomes in this study were consistent with this prediction (averaging approximately 12,400 genes) and the number of annotated genes did not vary widely between species. The sequencing type (high-throughput short-read vs. Sanger sequencing) did not seem to affect the gene annotation, confirming that high-throughput methods are efficient in areas with low repeat content, such as gene regions. The average transcript size (single exons) and the average intron size were also consistent across the picture-wing species. The assemblies in this study do not provide a refined analysis of synteny differences across species because the picture-wings sequenced here could not be assembled at the chromosome level. Drosophila melanogaster has a slightly different chromosomal arrangement than the Hawaiian picture-wings (XY234 vs X23456) (Lindsley & Zimm 2012). The alignment results indicate that genomic synteny is conserved between the picture-wing species and D. melanogaster and genes have largely remained on the same chromosomes (no scaffolds in the picture-wings aligned to two or more scaffolds in D. melanogaster ). Interestingly, picture-wing Drosophila are known for their paracentric inversions, both between species and within populations of species (Carson 1983). Interchromosomal rearrangements, or translocation events, do not appear to be a significant factor in picture-wing Drosophila . In general it has been shown

24 that polytene chromosome mapping is a good predictor for phylogenetic relationships, as confirmed by the comparison of gene trees (O'Grady et al. 2001). However, the single genes sequenced in the previous studies do not provide insight into patterns of rearrangement that may be important for the establishment of species barriers. Improving the scaffolding of these genomes to further investigate regions that undergo inversion that may potentially contribute to speciation would be an avenue for further work. Codon bias, or a difference in frequency of one or more synonymous codons in coding DNA, exists among organisms, genomic regions, and genes (Sharp & Matassi 1994). Selection is generally calculated as a derivative of the relative rate of nonsynonymous to synonymous codon substitutions, equating synonymous substitutions as effectively neutral. However, the existence of codon bias suggests that synonymous substitutions are also driven by selection or mutation making it an important component of genome evolution. In the Drosophila genus, previous work has supported the case that codon bias is largely driven by selection (Powell & Moriyama 1997). There were no significant changes in codon bias across the picture-wing genomes, and in most cases, codon usage appears to be conserved (Figure 8). Drosophila murphyi was more similar in codon use to D. ochracea than its sister species D. sproati (Figure 8). However, there was a significant change in codon bias between picture-wings and S. flava . Additionally, the data suggest that S. flava tends to evenly distribute codon choice, whereas the picture-wing Drosophila tend to favor one codon over the other(s) (Figure 9.1,9.2, 9.3). Some genes in Drosophila are known to show some degree of codon bias, such as those in the Adh family, and those that are highly expressed (Powell & Moriyama 1997). It would be relevant in this case to carry out a more detailed investigation into genes that are known to be highly expressed and compare them to mid and low range expressed genes.

Repeat Content

Repeat content recorded across the picture-wing species is conserved within the three species D. sproati , D. murphyi , and D. ochracea , which are predicted to be less than 1.5 million years diverged. Drosophila grimshawi has a considerably higher repeat content. The higher estimates of repeats in D. grimshawi may be influenced by the quality of the genome, which has considerably larger scaffolds than the Drosophila sequenced in this study. However, even after including Ns in the assembly, D. grimshawi remains a strong outlier for the picture-wing group.

25 Drosophila grimshawi is more distantly related to D. sproati , D. murphyi , and D. ochracea and is approximately 2.5 million years diverged from the group which could explain the genome size and repeat differences between species (Magnacca & Price 2015). Across the Drosophila genus, I have shown that repeat content is strongly correlated with genome size (R2=0.2255 for repeat content and R2=0.5378 for estimated repeat content, see Figure 6-7). Repeat content has previously been attributed with changes in genome size, showing that when repeat content increases, genome size also increases (Eichler & Sankoff 2003). Interestingly, previous research has only found a correlation between repeat content and genome size in a few taxa and varies between invertebrates, mammals, fishes, and non-eukaryotes (Thomas et al. 2003). Studies on the effect repeat content has on genome size have been conducted previously in picture-wing Drosophila , in relation to their relative karyotype (Hunt et al. 1984). Drosophila grimshawi , the standard used for the karyotype analysis, possesses the karyotype 5R1D (5 rods, 1 dot), which is also found in D. sproati , D. murphyi , and D. ochracea . Studies on how heterochromatin varies across the chromosomes, suggests that there is not a large variation in heterochromatin among these species; however the genome assemblies indicate that D. grimshawi has a larger genome size than the other three species (Clayton 1988). Karyotype analysis of D. ochracea revealed that one pair of rods are nearly double in length because of increased amounts of heterochromatin (Clayton 1988). This indicates that genome size estimates for D. ochracea should be larger compared to the other picture-wings, but assembly size estimates do not indicate an increase in size (which is not surprising since heterochromatin regions are known to be highly underrepresented in the read data of short-read sequencing technologies). The data here show D. ochracea to have the lowest repeat content, but a similar estimated genome size to that of D. sproati and D. murphyi. Overall, however we do capture the correlation of repetitive DNA with relative genome size, which other studies including the Hawaiian picture-wings have shown (Craddock et al. 2016). Whether changes in repeat content are a by-product of speciation events, or are dynamic for species throughout time is unknown. Repeat classes across D. sproati , D. murphyi , and D. ochracea were highly conserved. Drosophila grimshawi had the highest number of repeats overall, which was attributed to increases in Copia elements (LTR), Gypsy elements (LTR), Jockey elements (LINE), Maverick elements (DNA), and TcMar-Tc1 elements (DNA) (Jurka et al. 2005; Figure 14). Long terminal repeat elements (LTRs) are retrotranscribing mobile elements and common in eukaryotic

26 genomes. LTRs are often the largest elements because their repetitive element will often repeat hundreds of times. Long interspersed nuclear elements (LINEs) are also retrotranscribed and are known to be widespread in eukaryotic genomes (Jurka et al. 2005). Jockey elements have been shown to be the most abundant LINE element in D. melanogaster , being specifically abundant in the euchromatin (Kaminker et al. 2002). The greater number of total elements in both D. grimshawi and S. flava , suggests that the colonizing Drosophila of the Hawaiian Islands had a high repeat content, that has been subsequently lost in the picture-wings to different degrees. It will be necessary to sequence a Hawaiian Scaptomyza to test whether repeat content was better maintained throughout the evolutionary history of this genus, or lost in the Hawaiian species and regained after the colonization of continental areas. Repetitive elements in the genome have been suggested as a major source of genomic variation for eukaryotes. This is not only because of their own variability, but because these elements promote events that facilitate both small and large-scale changes. Exon shuffling, a molecular mechanism that rearranges and duplicates exons, is more frequent in regions with repetitive sequences and TEs have been implicated in the evolution of eukaryotes (Patthy 1999; Taft et al. 2007; Kidwell & Lisch 2001; Zhou et al. 2009; Chalopin et al. 2015). Recent studies have shown an increase in TE activity in response to short and long-term stress, though the exact mechanisms are still unknown (for an overview see Capy et al. 2000; Casacuberta and González 2013). The distribution of the repetitive element families should be mapped to see which are close to coding sequences in order to infer which elements may promote gene diversification, as has been done in previous studies (Feschotte 2008). Additionally, the Drosophila genus, specifically the picture-wing adaptive radiation, provides an ideal system to investigate the correlation of diversity and phenotype with TE families and attributes.

Genes Under Selection

Selection tests revealed several genes under positive selection, both in the individual picture-wings and within the group as a whole. Gene annotation and genomic resources are a clear limitation in understanding the function of the genes under selection, as some of the genes found to be under selection had unknown functions or were not well-annotated. In the individual picture-wings, genes under selection primarily had to do with protein transport and cell function (Table 10: Figure 22). In D. grimshawi , a cytochrome P450 gene was found to be under

27 selection. Cytochrome P450’s are known to function in insecticide resistance and plant toxin detoxification (Scott 1999). As D. grimshawi is considered a generalist species, while D. sproati , D. murphyi, and D. ochracea are limited in their oviposition sites, a P450 gene under selection in D. grimshawi would contribute to its ability to tolerate a variety of host plants. This concept has been suggested in other systems of polyphagous insects, including swallowtail butterflies (Cohen et al. 1992) the corn earworm (Li et al. 2002), the cotton bollworm (Mao et al. 2007), and desert Drosophila (Frank & Fogleman 1992). In the picture-wing group, the sites model revealed 264 genes to be under selection, representing a variety of functions. Genes of interest were identified as chemosensory genes, P450 genes, and those having an aggression or inter-male aggression function. Specifically, the recurrence of the P450 complex under selection in both the individual picture-wings and the group at large should be investigated in future studies. As P450s are involved with detoxification of plant compounds, they could have a significant evolutionary role in the host preference of the Hawaiian Drosophila . A recent study of P450s in Lepidoptera showed that there are different complexes of genes between generalist and specialist insects, indicating that some P450s may be important in host tolerance (Li et al. 2002). Further corroborating this suggestion, studies showed that the structure of the proteins in generalists have channels that are less structurally contained, and are predicted to allow for the detoxification of more diverse allelochemicals (Li et al. 2002). A recent study by Guillen et al. (2015) on the comparative genomics of the Drosophila subgenus used similar analyses to investigate the ecological adaptations and genomic changes of D. buzzatii, D. mojavensis , and the cactophilic clade. They found that genes under selection were enriched in heterocycle catabolic processes, which have adaptive links to the cacti that they breed in. Similarly, the species investigated here had genes under selection that may be related to the adaptation to different host plants. However, more chemical work needs to be done on the Hawaiian host-plants before we can directly relate these genes to adaptive functions. Using the branch-site test on individual picture-wings, no chemosensory genes were found to be under selection. However, there were several limitations to the method I used to investigate selection. First, the small number of genomes used limits the power to detect selection by this method. Second, the outgroup S. flava may not be appropriate for fine selection tests because it is significantly diverged from the picture-wing species. It would be pertinent to

28 investigate chemosensory genes with a more closely related outgroup from the Hawaiian Islands, in addition to using genomic sequences from more picture-wing species. I was able to detect several chemosensory genes under selection using the M7/M8 sites model, however after bonferroni correction, only three remained under selection in the group containing all species (picture-wings and S. flava ). All genes under selection were in the Or or OBP gene families, and none were found to be under selection in the Gr or GluR gene families. Scaptomyza flava may be limited in the number of chemosensory genes it shares with picture- wings because it is a herbivorous insect that burrows into the upper layer of the plant epidermis to lay its eggs (Lapoint et al. 2013). As a result, it has lost most of its genes associated with detection of yeast volatiles from plants, genes that are still likely in use by the picture-wing Drosophila (Goldman-Huertas et al. 2015). However, as a first look at the evolution of chemosensory genes, it does reveal that there is selection in these gene groups, which may be significant for host plant preference. Only Obp99c was found to be under selection in both the chemosensory gene analysis and the all-genes analysis. No other genes were found to overlap for these tests. Gene copy number of the various chemosensory genes appears to change frequently in the picture-wing clade (Table 12-15: Figure 23). Interestingly, chemoreceptor genes such as those in the Or gene family are thought to diversify primarily via gene duplication and interchromosomal transposition, two mechanisms correlated with TE activity, so it is not surprising that many show copy number differences across species (Conceição & Aguadé 2008). Several studies have suggested that specialists will lose genes faster than generalists, as a smaller suite of genes is needed to adapt to single plant species (McBride 2007; McBride & Arguello 2007; Whiteman & Pierce 2008). The three specialist picture-wings D. sproati , D. murphyi , and D. ochracea lost a total of 16, 9, and 8 genes across the chemosensory genes investigated here, whereas D. grimshawi incurred a loss of only three genes. However, duplication of chemosensory genes does not reveal a pattern, with the addition of three genes in the generalist D. grimshawi and the addition of two, five, and two in D. sproati , D. murphyi , and D. ochracea , respectively. The gain of chemosensory genes is often overlooked in papers investigating host- plant adaptation, but is clearly an important component of the evolution of chemosensory systems. The net values of these gains and losses total 0 in D. grimshawi , -14 in D. sproati , -4 in D. murphyi, and -6 in D. ochracea. Overall, the data here do show a net loss of chemoreceptor

29 genes in specialist species, but should be more thoroughly investigated by the inclusion of additional generalist species. A pattern noted by Gardiner et al. (Gardiner et al. 2008) was that single island endemics had a higher rate of pseudogenization, most likely due to the small population sizes and bottlenecks caused by founding events. This study addresses this confounding factor by using primarily single island endemics, with the exception of D. grimshawi, which occurs on the Maui Nui complex. As there was no elevated pseudogenization in the single-island endemics, it is likely that founder events do not significantly affect the rates of pseudogenization in chemosensory genes. Several genes that were found to have copy number changes are predicted to be important in host plant detection for Drosophila . The genes Gr21a and Gr63a, for example, have been found to be important for CO 2 chemosensation (Jones et al. 2007). Both D. sproati and D. murphyi lost their copies of Gr21a. Drosophila melanogaster mutants that lost Gr21a were found to be functional for chemosensation of CO2, whereas mutants that lost Gr63a were unable to detect it. The loci were further determined to function together in a concentration-dependent manner. It is unclear whether Gr21a was lost due to its redundancy for the detection of CO 2, or if it was selected against for ecological reasons in D. sproati and D. murphyi because it was no longer needed. In odorant-binding proteins, Obp99c was found to be under selection. Previous study has found that Obp99c, Obp99a and Obp99d may be redundant for the identification of benzaldehyde in Drosophila (Wang et al. 2007). There was only a single increase in copy number for this Obp99a in D. sproati . Additionally, D. sproati and D. murphyi also lost their copy of Obp19d, a gene that has altered expression in males and females post-copulation, but its function is unknown (Zhou et al. 2009). However, the sexually dimorphic expression of chemosensory genes presents an additionally important function of genes in this category. Chemosensory genes include genes that bind odorants and pheromones, but they are broadly classified (Touhara & Vosshall 2009). Given that sexual selection has been proposed as a significant driver of speciation in picture-wing Drosophila future work should take care to distinguish the chemoreceptors important for host plant detection and those important for mate recognition. Using S. flava as a baseline to investigate chemosensory gene evolution may be problematic for several reasons. As previously stated, saprophagy is the ancestral state for the

30 Scaptomyza lineage, but herbivory has evolved in S. flava (Lapoint et al. 2013). It has been shown that the detection of yeast volatiles, such as those from rotting plant materials, are no longer necessary to S. flava and its genome has subsequently lost chemosensory genes related to that function and S. flava has also lost the ability to detect yeast volatiles (Goldman-Huertas et al. 2015). For future analyses, it will be important to expand the comparative dataset to include genes ecologically relevant to the detection of yeasts by using comparative groups that still depend on the detection of yeast volatiles.

31 CONCLUSION

How diversification proceeds at a rapid pace, such as in adaptive radiations, remains an elusive evolutionary question. A potential hypothesis for the rapid diversification of insect lineages is through differential host plant use, or the diversification across a variety of host species. Here, I present the first look at whole genome data for species utilizing different host plants and spanning generalists and specialists in an adaptive radiation of Hawaiian picture-wing Drosophila . I found that genomes between these species are similar in both synteny and codon use, but the single generalist species, D. grimshawi , had increased repetitive content, increased familial diversity of repetitive content, and an overall larger genome. Across species, I found genes that have a variety of functions under selection, notably those that relate to chemosensory function, plant detoxification and inter-male aggression. The inclusion of all chemosensory genes in the odorant-binding protein, gustatory receptor, glutamate receptor, and olfactory receptor families showed few genes under selection. However, I was able to corroborate the hypothesis that specialist species lose genes faster than generalist species by comparing the net gain and loss of receptors of picture-wing species . This work does not elucidate the function or suggest any specific genes to be involved in yeast volatile detection or the detection of suitable plant material in picture-wing Drosophila. However, it does provide footing for future studies to investigate the differentiation of these gene complexes that appear to be important for diversification across more species and populations.

32 TABLES & FIGURES

Table 1: Species collection information for genome sequencing of D. murphyi , D. sproati , and D. ochracea .

Species Sex Site/NCBI Ascension # Forest Reserve GPS Location Elevation Collector °N/°W D. murphyi male Laupahoehoe, Hawaii Laupahoehoe Natural N19 °54’57.88’’, 5,500 ft E. Armstrong Island Area Reserve W-155 °18’2.49’’

D. sproati male Stainback Highway, Upper Waiaikea N19 °34’11.45 °, 3,500 ft E. Armstrong Hawaii Island Forest Reserve W155 °12’44.54 °

D. ochracea male Stainback Highway, Upper Waiaikea N19 °34’11.45 °, 3,500 ft E. Armstrong Hawaii Island Forest Reserve W155 °12’44.54 °

Table 2: DNA extraction and sequencing information, including library type and facility.

Species DNA Extracted (ng) Sequencing Facility Libraries D. murphyi PE 180, non size-selected 1,909.5 UC Davis Genome Center MP D. sproati PE 180, non size-selected 1,624.5 UC Davis Genome Center MP D. ochracea PE 180, non size-selected 1,402.4 UC Davis Genome Center MP

33

Table 3: Scaffold N50 from various assemblers (in kb) and various kmer sizes before running GapCloser for de-novo assemblies. Scores in bold indicate the highest quality assembly.

Kmer D. murphyi D. sproati D. ochracea Abyss 55 19,623 22,206 31,002 61 20,398 21,124 25,411 65 18,680 18,804 19,497 75 7,918 7,826 5,277 SOAPdenovo2 51 726 879 1,264 55 817 934 1,318 61 853 949 1,440 65 854 928 1,354 75 737 966 1,232 Platanus N/A 1,404 1,730 1,491 AllpathsLG N/A 272,000 364,000 473,000 SGA N/A 2,409 4,455 1,842

Table 4: Assembly statistics for the respective best assembly for D. sproati , D. murphyi , and D. ochracea .

Species Assembly # Average contig Contig N50 # Average scaffold Scaffold N50 size (Mb) contigs size (bp) (bp) scaffolds size (bp) (bp) D. sproati 154 7,376 20,861 60,901 3,434 47,652 364,401 D. murphyi 156 6,248 25,067 67,534 2,239 73,670 275,462 D. ochracea 155 4,961 31,209 113,448 1,659 97,983 476,715 D. grimshawi 186 24,020 7,747 92,106 17,440 11,495 8,399,593 S. flava 203 14,776 13,769 59,835 6,619 32,654 111,640

34 Table 5: CEGMA results from de-novo assemblies. GC= Gap Closer. Complete proteins are those that are unfragmented across a scaffold or contig. Partial proteins are those that are found, but may exist on more than one scaffold or contig. Best CEGMA scores in bold.

Number Number % Complete Proteins % Partial Proteins D. murphyi Allpaths, GC 97.98 243 99.19 246

Allpaths 96.77 240 98.79 245

Abyss (k61), GC 95.16 236 98.79 245

Abyss (k61) 94.76 235 98.79 245

Platanus 71.37 177 94.35 234

D. ochracea Allpaths, GC 97.58 242 99.60 247

Abyss (k55) 95.97 238 98.79 245

D. sproati Allpaths, GC 97.18 241 98.79 245

Abyss (k55) 95.16 236 98.39 244

35 Table 6: Genome characteristics from all species. Statistics were generated using abyss-fac, RepeatMasker, and publicly available NCBI statistics for the most recent assembly. Repeat content is the raw percentage reported by RepeatMasker, while estimated repeat content is this value in addition to the total Ns in the assembly. Species Assembly Size (Mb) N50 (bp) GC Content (%) Repeats (%) Est. Repeats (%) Scaptomyza flava 203 118,040 38.7 32.3 38.2 Drosophila murphyi 156 271,386 37 11.8 16.8 Drosophila sproati 154 350,566 36 12.5 18.4 Drosophila ochracea 155 469,458 36 9.99 14.7 Drosophila grimshawi 186 9,830,507 39 30.3 37.5 Drosophila mojavensis 180 26,610,000 40.2 23.3 30.3 Drosophila albomicans 215 25,561 38.4 8.4 23.3 Drosophila americana 160 21,109 40.6 14.3 14.7 Drosophila virilis 189 11,720,000 40 25.8 33.9 Drosophila willistoni 223 4,500,039 37.1 28.9 33.98 Drosophila pseudoobscura 148 12,370,000 45.1 17.7 20.1 Drosophila persimilis 175 1,850,815 44.9 27.8 34.6 Drosophila miranda 132 27,940,000 44.9 10.8 13.8 Drosophila bipectinata 166 663,975 41.6 25.7 26.2 Drosophila ananassae 213 4,635,703 42 32.5 39.9 Drosophila kikkawai 163 910,714 41.4 18.5 18.98 Drosophila ficusphila 151 1,049,484 41.9 15.2 16.1 Drosophila rhopaloa 193 46,154 40.1 25.2 26.9 Drosophila elegans 170 1,714,124 40.3 19.8 20.1 Drosophila takahashii 181 390,631 40.0 20.5 21.1 Drosophila suzukii 201 424,319 40.7 20.9 34.1 Drosophila biarmipes 168 3,384,765 41.8 22.6 23.0 Drosophila eugracilis 156 976,885 40.9 20.9 21.3 Drosophila erecta 145 22,480,000 42.3 17.4 22.4 Drosophila yakuba 162 21,540,000 42.4 24.9 26.7 Drosophila melanogaster 142 250,00,000 41.9 22.7 23.5 Drosophila sechellia 157 2,884,279 42.1 24.2 29.8 Drosophila simulans 123 235,30,000 43.0 11.6 11.9 Drosophila busckii 112 20,013 39.9 9.3 9.9

36

Table 7: Categorical repeat content of genomes. Results were generated using RepeatModeler and RepeatMaker.

Species LINE SINE LTR DNA Unclassified SmRNA Others Total (%) D. murphyi 7060 0 4815 2504 26557 45 272213 11.76 D. sproati 7082 1 5963 2590 42465 55 254618 12.45 D. ochracea 7037 0 5717 2690 10239 48 264274 9.99 D. grimshawi 10789 0 8532 5654 52781 29 275592 30.34 S. flava 24398 2352 17866 24201 88114 844 197010 32.29 D. melanogaster 7099 0 12007 5763 10949 206 94705 22.65

Table 8: Results from annotation of Hawaiian picture-wing Drosophila and S. flava . Average transcript size is equivalent to the average of the total size of the exons in each gene for each respective species.

Species Total genes annotated Average transcript size (bp) Average intron size (bp) % of genome D. murphyi 12369 1505 2079 25.2 D. sproati 11980 1494 2111 23.8 D. ochracea 12369 1548 2085 24.9 D. grimshawi 12095 1601 2217 20.9 S. flava 13209 1815 2222 22.6

37 Table 9: Codon bias of 16 codons in picture-wing Drosophila and Scaptomyza flava.

Codon Bias Lysine Arginine Phenylalanine Isoleucine Glutamic Acid Species AAA AAG AGA AGG CGA CGC CGG CGU UUC UUU AUA AUC AUU GAA GAG D. murphyi 38.00 62.00 7.39 5.76 13.41 37.45 10.77 25.23 50.24 49.76 24.22 35.41 40.37 34.01 65.99

D. sproati 38.37 61.63 6.56 5.69 13.45 37.90 10.94 25.46 49.76 50.24 24.18 35.23 40.59 33.46 66.54

D. ochracea 38.08 61.92 7.44 5.78 13.26 37.36 10.80 25.36 50.25 49.75 24.09 35.38 40.53 34.29 65.71

D. grimshawi 36.70 63.30 6.36 5.78 13.30 38.32 10.94 25.29 50.79 49.21 24.11 36.01 39.88 32.63 67.37

S. flava 45.70 54.30 9.43 6.61 14.75 31.72 7.99 29.50 38.77 61.23 29.40 27.39 43.22 43.28 56.72

Codon Bias Tyrosine Serine Cysteine Alanine Species UAC UAU AGC AGU UCA UCC UCG UCU UGC UGU GCA GCC GCG GCU D. murphyi 41.93 58.07 26.92 14.67 11.61 18.01 21.39 7.41 62.91 37.09 22.61 36.64 21.09 19.66

D. sproati 41.42 58.58 27.07 14.60 11.69 17.91 21.66 7.07 63.52 36.48 22.73 36.66 21.26 19.35

D. ochracea 41.72 58.28 26.82 14.71 11.69 17.96 21.43 7.39 62.97 37.03 22.74 36.50 21.01 19.74

D. grimshawi 41.97 58.03 27.54 14.68 11.23 18.05 21.82 6.68 63.86 36.14 22.67 37.02 21.24 19.06

S. flava 37.85 62.15 25.30 16.70 15.90 14.94 17.10 10.07 57.21 42.79 28.82 29.56 16.29 25.33

38

Glycine Leucine Valine Species GGA GGC GGG GGU CUA CUC CUG CUU UUA UUG GUA GUC GUG GUU D. murphyi 16.19 51.23 6.33 26.25 8.55 16.39 39.43 7.24 6.25 22.14 9.46 25.47 44.43 20.64

D. sproati 16.20 51.86 6.54 25.41 8.44 16.48 39.73 7.28 6.17 21.90 9.48 25.45 44.74 20.34

D. ochracea 16.02 51.24 6.36 26.38 8.64 16.36 39.25 7.25 6.43 22.06 9.61 25.40 44.23 20.76

D. grimshawi 16.00 52.78 6.18 25.04 8.36 16.77 40.68 6.93 5.45 21.81 9.06 25.71 45.46 19.76

S. flava 20.44 44.10 5.54 29.92 11.85 11.93 28.29 13.21 9.78 24.93 14.47 20.83 36.81 27.89

39 Table 10: Genes under positive selection identified by PAML in Hawaiian picture-wing Drosophila species using branch-site model. Genes were identified using UNIPROT by blasting nucleotide sequences.

p-value % UNIPROT Species Molecular Function Biological Function Domain type identity Gene ID Cysteine-type peptidase D. ochracea 0.00270 82.3 GH20868 Signal peptide, chain signal activity SCP Domain, 0.00356 88.3 GH12213 ------Signal processing signal Nop Domain 0.01106 99.8 GH17505 ------

D. sproati 0.00097 83.6 GH20606 ------Mab-21 Domain Actin-filament 0.00061 100 GH24440 ------Fascin family organization

D. murphyi 0.00319 86.5 GH20606 ------Mab-21 Domain

Transmembrane, ER membrane protein transmembrane 0.00375 100 GH22237 ------complex helix

Heme binding, iron ion Cytochrome P450 D. grimshawi 0.00096 100 GH20935 binding, monooxygenase, ------family oxidoreductase >0.0001 99.8 GH11139 ------Repeat, signal Intracellular protein Importin N- 0.03968 100 GH13431 ------transport terminal Microtubule based process, 0.00975 100 GH23102 Dynactin complex Coiled coil mitotic nuclear division Transmembrane, 0.03973 100 GH24199 Component of membrane ------transmembrane helix

40 Table 11: Function of genes under selection in all picture-wings using M7/M8 sites model analyzed by GO enrichment analysis. Bolded genes are genes of interest for picture-wing Drosophila . Genes with * detected as significant by GO enrichment analysis before multiple-testing correction.

ID Gene Name Species GO enrichment category NP_610948 Acid labile subunit Drosophila melanogaster NP_611296 adipose Drosophila melanogaster NP_477395 alpha-coatomer protein Drosophila melanogaster NP_524268 alpha-Esterase-2 Drosophila melanogaster NP_572329 Anaphase Promoting Complex 7 Drosophila melanogaster NP_524689 Antigen 5-related 2 Drosophila melanogaster NP_477394 Arginyl-tRNA--protein transferase 1 Drosophila melanogaster NP_523745 Attacin-A Drosophila melanogaster NP_001163216 bancal Drosophila melanogaster NP_523415 Beta Adaptin Drosophila melanogaster NP_650554 Cadherin-89D Drosophila melanogaster NP_524874 Calcineurin B2 Drosophila melanogaster NP_523710* Calmodulin* Drosophila melanogaster* Chemosensory* NP_726938 cap binding protein 80 Drosophila melanogaster NP_996297 Cardioacceleratory peptide receptor Drosophila melanogaster NP_511140* Casein kinase Ialpha* Drosophila melanogaster* Inter-Male Aggression* NP_524551 Caspase Drosophila melanogaster NP_524962 Chitinase 4 Drosophila melanogaster NP_648432 Chromosomal serine/threonine-protein Drosophila melanogaster kinase JIL-1 NP_650597 chromosome alignment defect 1 Drosophila melanogaster NP_723044 Collagen type IV Drosophila melanogaster NP_787957 croquemort Drosophila melanogaster NP_610661 Cuticular protein 47Eg Drosophila melanogaster NP_651303 Cyclin B3 Drosophila melanogaster NP_523628 Cytochrome P450-6a2 Drosophila melanogaster P450

41 NP_523778 Dicer-2 Drosophila melanogaster NP_523787 Diptericin B Drosophila melanogaster NP_001027183 Dmel_CG10096 Drosophila melanogaster NP_610961 Dmel_CG10104 Drosophila melanogaster NP_648648 Dmel_CG10140 Drosophila melanogaster NP_650518 Dmel_CG10264 Drosophila melanogaster NP_609929 Dmel_CG10650 Drosophila melanogaster NP_572456 Dmel_CG10761 Drosophila melanogaster NP_650299 Dmel_CG10841 Drosophila melanogaster NP_650894 Dmel_CG10877 Drosophila melanogaster NP_650236 Dmel_CG10909 Drosophila melanogaster NP_730761 Dmel_CG11109 Drosophila melanogaster NP_649427 Dmel_CG11115 Drosophila melanogaster NP_001162889 Dmel_CG11149 Drosophila melanogaster NP_649848 Dmel_CG11966 Drosophila melanogaster NP_649852 Dmel_CG11971 Drosophila melanogaster NP_649853 Dmel_CG11975 Drosophila melanogaster NP_647736 Dmel_CG1246 Drosophila melanogaster NP_610983 Dmel_CG12858 Drosophila melanogaster NP_649725 Dmel_CG1287 Drosophila melanogaster NP_610558 Dmel_CG12917 Drosophila melanogaster NP_788360 Dmel_CG12970 Drosophila melanogaster NP_648857 Dmel_CG13067 Drosophila melanogaster NP_609235 Dmel_CG13095 Drosophila melanogaster NP_609306 Dmel_CG13116 Drosophila melanogaster NP_611666 Dmel_CG13506 Drosophila melanogaster NP_611700 Dmel_CG13510 Drosophila melanogaster NP_611763 Dmel_CG13545 Drosophila melanogaster NP_723308 Dmel_CG13796 Drosophila melanogaster NP_612054 Dmel_CG13894 Drosophila melanogaster

42 NP_609025 Dmel_CG13982 Drosophila melanogaster NP_608351 Dmel_CG14234 Drosophila melanogaster NP_650735 Dmel_CG14303 Drosophila melanogaster NP_649025 Dmel_CG14353 Drosophila melanogaster NP_569881 Dmel_CG14629 Drosophila melanogaster NP_649462 Dmel_CG14646 Drosophila melanogaster NP_650106 Dmel_CG14717 Drosophila melanogaster NP_569950 Dmel_CG14803 Drosophila melanogaster NP_569946 Dmel_CG14812 Drosophila melanogaster NP_609934 Dmel_CG15173 Drosophila melanogaster NP_608696 Dmel_CG15395 Drosophila melanogaster NP_611104 Dmel_CG15708 Drosophila melanogaster NP_572269 Dmel_CG15765 Drosophila melanogaster NP_609550 Dmel_CG16996 Drosophila melanogaster NP_648817 Dmel_CG17032 Drosophila melanogaster NP_610964 Dmel_CG17386 Drosophila melanogaster NP_651078 Dmel_CG17625 Drosophila melanogaster NP_788894 Dmel_CG17754 Drosophila melanogaster NP_732384 Dmel_CG17836 Drosophila melanogaster NP_726604 Dmel_CG1838 Drosophila melanogaster NP_650138 Dmel_CG18547 Drosophila melanogaster NP_572446 Dmel_CG2120 Drosophila melanogaster NP_611635 Dmel_CG2921 Drosophila melanogaster NP_726885 Dmel_CG2947; Dmel_CG32789 Drosophila melanogaster NP_724626 Dmel_CG30379 Drosophila melanogaster NP_650328 Dmel_CG3061 Drosophila melanogaster NP_732654 Dmel_CG31198 Drosophila melanogaster NP_731899 Dmel_CG31321 Drosophila melanogaster NP_723620 Dmel_CG31870 Drosophila melanogaster NP_728092 Dmel_CG32556 Drosophila melanogaster

43 NP_727781* Dmel_CG32600* Drosophila melanogaster* Chemosensory* NP_572836 Dmel_CG32649 Drosophila melanogaster NP_727428 Dmel_CG32687 Drosophila melanogaster NP_788084 Dmel_CG3278 Drosophila melanogaster NP_787961 Dmel_CG33128 Drosophila melanogaster NP_996130 Dmel_CG33285 Drosophila melanogaster NP_001027132 Dmel_CG33689 Drosophila melanogaster NP_001036664 Dmel_CG34112 Drosophila melanogaster NP_001096960 Dmel_CG34322 Drosophila melanogaster NP_001097297 Dmel_CG34439 Drosophila melanogaster NP_572308 Dmel_CG3455 Drosophila melanogaster NP_001096871 Dmel_CG3526 Drosophila melanogaster NP_726833 Dmel_CG3588 Drosophila melanogaster NP_650408 Dmel_CG3837 Drosophila melanogaster NP_650167 Dmel_CG3916 Drosophila melanogaster NP_726347 Dmel_CG4019 Drosophila melanogaster NP_001104019 Dmel_CG41520 Drosophila melanogaster NP_569975 Dmel_CG4194 Drosophila melanogaster NP_569976 Dmel_CG4199 Drosophila melanogaster NP_728769 Dmel_CG42304 Drosophila melanogaster NP_650889 Dmel_CG4288 Drosophila melanogaster NP_650038 Dmel_CG4683 Drosophila melanogaster NP_611675 Dmel_CG4752 Drosophila melanogaster NP_650609 Dmel_CG5285 Drosophila melanogaster NP_611352 Dmel_CG5345 Drosophila melanogaster NP_609374 Dmel_CG5390 Drosophila melanogaster NP_648733 Dmel_CG5392 Drosophila melanogaster NP_001137744 Dmel_CG5431 Drosophila melanogaster NP_651603 Dmel_CG5527 Drosophila melanogaster NP_572993 Dmel_CG5548 Drosophila melanogaster

44 NP_723554 Dmel_CG5603 Drosophila melanogaster NP_599147 Dmel_CG5671 Drosophila melanogaster NP_648526 Dmel_CG5883 Drosophila melanogaster NP_652731 Dmel_CG5887 Drosophila melanogaster NP_651220 Dmel_CG6173 Drosophila melanogaster NP_650445 Dmel_CG6276 Drosophila melanogaster NP_610905 Dmel_CG6337 Drosophila melanogaster NP_573318 Dmel_CG6696 Drosophila melanogaster NP_648348 Dmel_CG6757 Drosophila melanogaster NP_611185 Dmel_CG6967 Drosophila melanogaster NP_649041 Dmel_CG7271 Drosophila melanogaster NP_573334 Dmel_CG7288 Drosophila melanogaster NP_649305 Dmel_CG7324 Drosophila melanogaster NP_648034 Dmel_CG7376 Drosophila melanogaster NP_725052 Dmel_CG7777 Drosophila melanogaster NP_610193 Dmel_CG7863 Drosophila melanogaster NP_573364 Dmel_CG7874 Drosophila melanogaster NP_650971 Dmel_CG7922 Drosophila melanogaster NP_610470 Dmel_CG8029 Drosophila melanogaster NP_650247 Dmel_CG8063 Drosophila melanogaster NP_995782 Dmel_CG8083 Drosophila melanogaster NP_729547 Dmel_CG8177 Drosophila melanogaster NP_610763 Dmel_CG8493 Drosophila melanogaster NP_610781 Dmel_CG8525 Drosophila melanogaster NP_610402 Dmel_CG8586 Drosophila melanogaster NP_001163567 Dmel_CG8861 Drosophila melanogaster NP_573086 Dmel_CG9056 Drosophila melanogaster NP_996492 Dmel_CG9125 Drosophila melanogaster NP_610053 Dmel_CG9318 Drosophila melanogaster NP_572746 Dmel_CG9360 Drosophila melanogaster

45 NP_649938 Dmel_CG9475 Drosophila melanogaster NP_001137798 Dmel_CG9491 Drosophila melanogaster NP_651806 Dmel_CG9698 Drosophila melanogaster NP_611500 Dmel_CG9954 Drosophila melanogaster NP_651625 Dmel_CG9988 Drosophila melanogaster NP_001096987 DNA-directed RNA polymerase Drosophila melanogaster NP_001027443 DNA-directed RNA polymerase subunit Drosophila melanogaster NP_525061 dunce Drosophila melanogaster NP_647890 dusky-like Drosophila melanogaster NP_573184 E3 ubiquitin-protein ligase UBR1 Drosophila melanogaster NP_573098 Eukaryotic translation initiation factor 5 Drosophila melanogaster NP_523876* extra macrochaetae* Drosophila melanogaster* Inter-Male aggression* NP_001162940 Fatty acid (long chain) transport protein Drosophila melanogaster NP_477506* G protein salpha 60A* Drosophila melanogaster* Chemosensory* NP_477340 glass bottom boat Drosophila melanogaster NP_611172 Glyceraldehyde 3-phosphate Drosophila melanogaster dehydrogenase NP_523986 Gram-negative bacteria binding protein 3 Drosophila melanogaster NP_725040* Gustatory receptor 10a* Drosophila melanogaster* Chemosensory* NP_725040* Gustatory receptor 47b* Drosophila melanogaster* Chemosensory* NP_523808* Gustatory receptor 58b* Drosophila melanogaster* Chemosensory* NP_611427 hippo Drosophila melanogaster NP_477354 Hsp70/Hsp90 organizing protein homolog Drosophila melanogaster NP_609665 icarus Drosophila melanogaster NP_524436 Insulin-like receptor Drosophila melanogaster NP_650052 J domain-containing protein CG6693 Drosophila melanogaster NP_573382 kekkon5 Drosophila melanogaster NP_524786 kohtalo Drosophila melanogaster NP_572947* lethal (Bradnam et al.) G0007* Drosophila melanogaster* Inter-Male aggression* NP_572253 lethal (Bradnam et al.) G0060 Drosophila melanogaster

46 NP_647983 Lin-28 homolog Drosophila melanogaster NP_609418 Lipase 4 Drosophila melanogaster NP_729140 logjam Drosophila melanogaster NP_524778 lola like Drosophila melanogaster NP_524425* Malvolio* Drosophila melanogaster* Chemosensory* NP_728104 minibrain Drosophila melanogaster NP_511049 Myosin light chain cytoplasmic Drosophila melanogaster NP_572402 Negative elongation factor B Drosophila melanogaster NP_731311* Neutralized* Drosophila melanogaster* Inter-Male aggression* NP_788047 nimrod C2 Drosophila melanogaster NP_523620 O-glycosyltransferase Drosophila melanogaster NP_573350* Odorant-binding protein 18a* Drosophila melanogaster* Chemosensory* NP_649802* Odorant-binding protein 85a* Drosophila melanogaster* Chemosensory* NP_651711* Odorant-binding protein 99c* Drosophila melanogaster* Chemosensory* NP_649139 Ornithine aminotransferase precursor Drosophila melanogaster NP_649634 Osiris 13 Drosophila melanogaster NP_649639 Osiris 18 Drosophila melanogaster NP_610537 Peptidyl-alpha-hydroxyglycine alpha- Drosophila melanogaster amidating lyase 1 NP_523418 Peritrophin A Drosophila melanogaster NP_524666 Phospholipase A2 activator protein Drosophila melanogaster NP_610380 Probable cytochrome P450 4ad1 Drosophila melanogaster P450 NP_608457 Probable cytochrome P450 6t1 Drosophila melanogaster P450 NP_610206 Probable cytochrome P450 6w1 Drosophila melanogaster P450 NP_608834 Probable elongator complex protein 3 Drosophila melanogaster NP_569845 Probable methylmalonate-semialdehyde Drosophila melanogaster dehydrogenase [acylating], mitochondrial NP_651204 Probable N6-adenosine-methyltransferase Drosophila melanogaster MT-A70-like protein NP_649459 Probable sodium/potassium/calcium Drosophila melanogaster exchanger CG1090

47 NP_729172 Probable tubulin beta chain CG32396 Drosophila melanogaster NP_524115 Proteasome 26kD subunit Drosophila melanogaster NP_724614 Proteasome alpha6 subunit Drosophila melanogaster NP_649529 Proteasome subunit beta type-4 Drosophila melanogaster NP_647792 Protein translation factor SUI1 homolog Drosophila melanogaster NP_788502 Protein tyrosine phosphatase 69D Drosophila melanogaster NP_536738 retinal degeneration C Drosophila melanogaster NP_524398 Rhodopsin 2 Drosophila melanogaster NP_788038 rhomboid-6 Drosophila melanogaster NP_722715 Rieske iron-sulfur protein Drosophila melanogaster NP_476706 RNA polymerase II 140kD subunit Drosophila melanogaster NP_995758 Serine protease inhibitor 4 Drosophila melanogaster NP_524954 Serine protease inhibitor 5 Drosophila melanogaster NP_995614 SLY-1 homologous Drosophila melanogaster NP_476741 spindle E Drosophila melanogaster NP_731826 squid Drosophila melanogaster NP_524682 stunted Drosophila melanogaster NP_727224* swiss cheese* Drosophila melanogaster* Chemosensory* NP_524304* Tachykinin-like receptor at 86C* Drosophila melanogaster* Inter-Male aggression* NP_731308 tango Drosophila melanogaster NP_524124 target of Poxn Drosophila melanogaster NP_996160 TBP-associated factor 1 Drosophila melanogaster NP_523515 Tetraspanin 29Fa Drosophila melanogaster NP_523552 Tetraspanin 33B Drosophila melanogaster NP_524524 Tetraspanin 97E Drosophila melanogaster NP_477033 Tiggrin Drosophila melanogaster NP_524305 Transcription factor TFIIFbeta Drosophila melanogaster NP_651398 Transcriptional regulator ATRX homolog Drosophila melanogaster NP_001097249 Transitional endoplasmic reticulum ATPase Drosophila melanogaster TER94

48 NP_732012 Tropomyosin 2 Drosophila melanogaster NP_524826 Tryptophanyl-tRNA synthetase Drosophila melanogaster NP_536748 Ultrabithorax Drosophila melanogaster NP_001014739 upheld Drosophila melanogaster NP_476801 Vacuolar H[+] ATPase 16kD subunit Drosophila melanogaster NP_569952 Vacuolar protein sorting-associated protein Drosophila melanogaster 26 NP_649326 WD repeat-containing protein 26 homolog Drosophila melanogaster

49 Table 12: Glutamate receptor gene copy numbers found in Hawaiian picture-wing Drosophila . Receptors were found using curated list from S. flava and the program PoFF. Copy numbers not equivalent to 1 in bold.

Gene ID D. grimshawi D. ochracea D. murphyi D. sproati S. flava GH17276/CG3822 1 1 1 1 1 GH18377/CG5621 1 1 1 1 1 GH13546CG9935 1 1 1 1 1 GH10703/Clumsy 1 1 1 1 1 GH15168/GluRIA 1 1 1 1 1 GH16102/GluRIID 1 1 1 1 1 GH11834Ir2Ia-2 1 1 1 1 1 GH13667/Ir25a 1 1 1 1 1 GH10978/Ir40a 1 1 2 1 1 GH20908/Ir60a 1 1 1 0 1 GH15754Ir68b-2 1 1 1 1 1 GH14456/Ir75c 1 1 1 1 1 GH16353/Ir76b 1 1 1 1 1 GH12125/Ir7b 1 1 1 1 1 GH12587/Ir7c 1 1 1 1 1 GH18306/Ir84a 1 1 1 0 1 GH12703/Ir8a 1 1 1 1 1 GH17280/Ir93a 0 1 1 0 1 GH-Nmdar1 1 1 1 1 1 GH17573/Nmdar2 1 1 1 1 1 GH12224/Ir10a 1 0 1 1 1 GH23804/Ir75d 1 1 1 1 1 GH23957/CG11155 1 1 0 1 1 GH11254/GluRIIC 1 1 0 1 2 GH16113/GluRIIE 1 1 0 1 1 GH16551/Ir67b-3 0 1 0 0 1 GH22899/Ir48d 1 0 0 0 1

50 Table 13: Odorant binding protein gene copy numbers found in Hawaiian picture-wing Drosophila . Receptors were found using curated list from S. flava and the program PoFF. Copy numbers not equivalent to 1 in bold.

Gene ID D. grimshawi D. ochracea D. murphyi D. sproati S. flava GH12519/10CSP3 1 1 1 1 1 GH20166/CG30172 1 1 1 1 1 GH21706/CSP2 1 1 1 1 1 GH17928/Obp19a 1 1 1 1 1 GH17929/Obp19b 1 1 1 1 1 GH19983/Obp44a 1 1 1 2 1 GH22122/Obp50a 1 1 1 1 1 GH22119/Obp50e 1 1 1 1 1 GH23015/Obp56a 1 1 1 1 1 GH22760/Obp56d 1 1 1 1 1 GH23020/Obp56h 1 1 1 1 1 GH16324/Obp73a 1 1 1 1 1 GH14293/Obp83a 1 1 1 1 1 GH14291/Obp83abL1 1 1 1 1 1 GH14286/Obp83ef 1 1 1 1 1 GH18907/Obp99a 1 1 1 2 1 GH13991/Obp99b 1 1 1 1 1 GH18683/Obp99c 1 1 1 1 1 GH14317/Obp99d 1 1 1 1 1 GH22042/Phk-3CSP1 1 0 1 1 1 GH17930/Obp19d 1 1 0 0 1 GH17591/Obp19c 1 1 1 1 1 GH10975/Obp28a 1 1 1 0 1 GH19916/Obp47b 1 1 0 1 1 GH14285/Obp83g 1 0 0 0 1

51 Table 14: Odorant receptor gene copy numbers found in Hawaiian picture-wing Drosophila . Gene ID corresponds to UNIPROT gene ID. Receptors were found using curated list from S. flava and the program PoFF. Copy numbers not equivalent to 1 in bold.

Gene ID (DGRI) D. grimshawi D. ochracea D. murphyi D. sproati S. flava GH22378/Or13a-2 1 1 1 1 1 GH10234/Or24a-1 1 1 2 1 2 GH13264/Or30a-1 1 1 1 1 1 GH21457/Or42a-1 1 1 1 0 1 GH19930/Or45b 1 2 1 0 1 GH21770/Or46aB 3 1 1 1 1 GH21764/Or49b 1 1 1 1 1 GH22759/Or56a 1 1 1 1 1 GH16554/Or67c 1 2 1 1 1 GH19067/Or82a-2 1 1 1 1 1 GH22929/Or85e 1 1 1 1 1 GH17860/Or92a 1 1 1 1 1 GH16278/Or98a-2 1 1 1 0 1 GH22610/Orco 1 1 1 1 1 GH13813/Or33c 1 1 1 1 1 GH12539/Or10a 1 1 1 1 1 GH11074/Or67a-1 0 0 0 1 1

52 Table 15: Gustatory receptor gene copy numbers found in Hawaiian picture-wing Drosophila . Receptors were found using curated list from S. flava and the program PoFF. Copy numbers not equivalent to 1 in bold.

Gene ID D. grimshawi D. ochracea D. murphyi D. sproati S. flava GH24883/Gr02a 1 1 1 1 1 GH12540/Gr10a 1 1 1 1 1 GH10560/Gr33a 1 1 1 1 2 GH20741/Gr43a 1 1 1 1 1 GH16865/Gr61a-1 1 1 1 1 1 GH15044/Gr63a-2 1 1 1 1 1 GH14696/Gr64b 1 0 1 0 1 GH14697/Gr64c 1 0 1 0 1 GH14698/Gr64e 1 0 1 0 1 GH15380/Gr66a 1 1 1 1 1 GH17518/Gr89a 2 1 1 1 1 GH14561/Gr64f 1 1 0 1 1 GH13143/Gr28a 1 1 0 1 1 GH11240/Gr21a 1 1 0 0 1 GH10309/Gr32a 1 1 1 0 1

53 Table 16: Genes under positive selection in i) in all species and ii) in only picture-wing species using the M7/M8 site model in PAML. P-values were corrected using bonferroni method. Genes under selection in bold.

Gene ID p-value Corrected p-value All species GH22929/Or85e 0.013197 0.065985 GH22759/Or56a 0.026482 0.132410 Picture-wings GH19067/Or82a-2 0.032434 0.129736 GH22929/Or85e 0.00089288 0.00357152 GH22759/Or56a 0.012753 0.051012 GH14293/Obp83a 0.0234 0.0936000 GH18683/Obp99c 0.0000 0.000000 GH19983/Obp44a 0.043578 0.174312 GH23015/Obp56a 0.00573480 0.0229392

54

A Scaptomyza Hawaiian picture-wings Mojavensis group

Melanogaster group Ananasse group

Pseudoobscura group B 5.1 my ScaptomyzaSFLA fla a 3.7 my 100 Drosophila DGRI grimshawi 0.8-1.9 my

100 Drosophila DOCHochracea 0.4 -0 my 100 Drosophila DSPR sproa 100

Drosophila DMUR v murphyi 0.5 Figure 1: A) Basic phylogenetic structure of genus Drosophila . B) Phylogenetic structure of picture-wing clade inferred from PHYLIP using S. flava as an outgroup. Vales correspond to bootstrap values. Colors correspond to collection localities of each species. Scaptomyza flava not pictured. Photos from Karl Magnacca.

55 D. Sproati cumulative genome size cumulative(%)

Abyss Platanus SGA SOAPdenovo

0 20 40 60 80 100 Allpaths

0e+00 5e+05 1e+06

cumulative features

Figure 2: Feature response curve plot of assemblies generated for D. sproati. The curve characterizes the sensitivity or coverage of the sequence assembler as a function of the number of features (cumulative features) across the genome (cumulative genome size). Features can be characterized as repeat content, mate-pair orientation and separations, depth of alignment, and polymorphisms.

56 D. Murphyi cumulative genome size cumulative(%)

Abyss Platanus SGA SOAPdenovo

0 20 40 60 80 100 Allpaths

0 200000 400000 600000 800000 1000000 1200000

cumulative features

Figure 3: FRC plot of assemblies generated for D. murphyi. The curve characterizes the sensitivity or coverage of the sequence assembler as a function of the number of features (cumulative features) across the genome (cumulative genome size). Features can be characterized as repeat content, mate-pair orientation and separations, depth of alignment, and polymorphisms.

57 D. Ochracea cumulative genome size cumulative(%)

Abyss Platanus SGA SOAPdenovo

0 20 40 60 80 100 Allpaths

0 200000 400000 600000 800000 1000000 1200000

cumulative features

Figure 4: FRC plot of assemblies generated for D. ochracea. The curve characterizes the sensitivity or coverage of the sequence assembler as a function of the number of features (cumulative features) across the genome (cumulative genome size). Features can be characterized as repeat content, mate-pair orientation and separations, depth of alignment, and polymorphisms.

58 C-value vs Genome Assembly Size

C-value Assembly Genome Size 150 200 250 300 350 400

0 5 10 15 20 25 30

Species

Figure 5: Comparison of average C-value size (genomesize.com) and total assembly size reported from abyss-fac in the program AbySS.

59 Genome size vs Repeat content

Scaptomyza Hawaiian picture-wings Other Drosophila % Repeats %

Adjusted R-squared: 0.2255 p-value: 0.005399 10 15 20 25 30

120 140 160 180 200 220 240

Genome Size (Mb)

Figure 6: Correlation between genome size and % repeats (as measured by RepeatMasker) in genome across Drosophila phylogeny and including Scaptomyza flava . The overall model fit was R2=0.2255, with a p-value <0.05.

60 Genome size vs Estimated repeat content

Scaptomyza Hawaiian picture-wings Other Drosophila % Repeats

Adjusted R-squared: 0.5378

10 15 20 25 30 35 40 p-value: 1.185e-06

120 140 160 180 200 220 240

Genome Size (Mb)

Figure 7: Correlation between genome size and estimated repeat content (measured as repeat content from RepeatMasker plus N content) in genome across Drosophila phylogeny and including Scaptomyza flava . The overall model fit was R2=0.5378, with a p-value <0.00001.

61

Figure 8: Distance matrices of codon bias across species and phylogenetic relationships inferred from codon bias similarity. Output produced by ANACONDA2 software.

62 Lysine Asparagine

200000 200000 150000 150000 100000 100000 50000 50000 0 0 AAA AAC AAU AAG

Threonine Arginine

200000 200000 150000 150000 100000 100000 50000 50000 0 0 ACA AGA ACC ACU CGA ACG CGC CGU AGG CGG

Serine Isoleucine

200000 200000 150000 150000 100000 100000 50000 50000 0 0 AUA UCA AUC AUU AGC AGU UCC UCU UCG

Figure 9.1: Relative codon bias across species for picture-wings and S. flava . Results were produced using ANACONDA2 software.

63 Methionine Phenylalanine

200000 200000 150000 150000 100000 100000 50000 50000 0 0 UUC UUU

Tyrosine Cysteine

200000 200000 150000 150000 100000 100000 50000 50000 0 0 UAC UAU UGC UGU

Tryptophan Leucine

200000 200000 150000 150000 100000 100000 50000 50000 0 0 CUA UUA CUC CUU CUG UUG

Figure 9.2: Relative codon bias across species for picture-wings and S. flava . Results were produced using ANACONDA2 software.

64 Proline Histidine

200000 200000 150000 150000 100000 100000 50000 50000 0 0 CCA CAC CAU CCC CCU CCG

Glutamine Valine

200000 200000 150000 150000 100000 100000 50000 50000 0 0 CAA CAG GUA GUC GUU GUG

Alanine Glycine

200000 200000 150000 150000 100000 100000 50000 50000 0 0 GCA GCC GCU GGA GCG GGC GGU GGG

Asparagine Glutamic Acid

200000 200000

150000 150000

100000 100000

50000 50000

0 0 GAA GAC GAU GAG

Stop

50000 40000 30000 20000 10000 0 UAA UAG UGA

Figure 9.3: Relative codon bias across species for picture-wings and S. flava . Results were produced using ANACONDA2 software.

65

chrX 0

0

0 0 5 0 0 1 0 5 0 1

0 0 2 0

0 0

0 5 0 ch 0 r 0 1 2L

5 0 1 0

0 20

0 0

0

5

0 c 10 0 h r 2 0 R 15

0

0 20

0

0 25

0 0

0 5 0

0 10 0 L

3

r 0 15 h

c 0

2 0 0

0 25 0

0 0 0

0 5 0

0 10 0

0 15 R 0 r3 2 h 0 0 c 0 2 0 5

0 3 0 0 0 0

0 0

5 4 0 r

0 h 0 c 0

0 0

0 0 0 M 0

0 0 U

Figure 10. Synteny visualization of D. sproati (left) genome compared with D. melanogaster genome using Circos.

66 c

0 h 0 r

0 0 X 0 5 0 0 1

0 5 0 1

0 0 0 2

0

0 0

0

0 5

0 c 0 h 0 1 r2 0 L

0 15

0

0 20 0

0 0 0

0 5 0

0 0 1 c 0 h r 2 0 15 R

0

0 20 0

0 25

0 0 0

0 5 0

0 10 0 L 3 0 r 15 h 0 c

0 20 0

0 25 0

0 0 0

0 5 0

0 1 0 0

0 1 5 0

0 2 0 R 0 r3 2 0 5 h 0 3 c

0 0 0 0

0 UM chr4 Figure 11: Synteny visualization of D. murphyi (left) genome compared with D. melanogaster genome using Circos.

67 chrX

0

0

0 0 0 5 0 1 0 5 0 1 0 0 2 0 0 0 c 5 0 hr 0 0 2 1 L 0 5 1 0 0 2 0

0 0 0 5 0 c 10 h r 0 2 5 R 0 1

0 20 0 0 25 0 0 0 5 0 0 10 0 c

h

0 15 r 3

0 L 0 20 0 25 0

0 0 0

0 5

0 10 0 L 0 15 3 r 0 h 0 20 c 0 0 25 0 3 0 0

0 0 4 0 r 0 h 0 c 0 0 M 0 0 U

0 0

0 0 0 0

0 0

0 0

0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0

0

0 0

Figure 12: Synteny visualization of D. ochracea (left) genome compared with D. melanogaster genome using Circos.

68 c 1 hrX 0 0

5 5 0

1 0 1 5 0 1 0 2 5

0 0

5 c 0 h 0 r 1 2 1 L 5 5 1 1 0 20

5

0 0

5

5 10 c h r2 0 15 R 0

5 20

0 25

0

15 5

10 10

L 5 3 r

15 h

c 0 20

25 20

15 0

10 5

5 10

0 1 5 R 3 0 2 r 0 h c

2 5 0

3 5 0

0 0 0

5

0

0 0 5 hr4 0 c 1 UM

Figure 13: Synteny visualization of D. grimshawi (left) genome compared with D. melanogaster genome using Circos.

69 Classes of repetative elements across species

DNA 350000 LINE Low Complexity LTR 300000 RC Simple Repeat Unknown 250000

200000

150000

100000

50000

0 S.flava D.sproati D.murphyi D.ochracea D.grimshawi D.melanogaster

Figure 14: Abundance of the classes of repetitive elements across picture-wings, S. flava and D. melanogaster . Values were obtained from RepeatMasker. Not included in graph are repeat classes ARTEFACT, RNA, rRNA, Satellite, and SINE elements due to low relative abundance.

70 Families of repetative elements across species

120000 CMC-Transib Copia CR1 100000 Gypsy hAT-Ac hAT-hobo hAT-Pegasus 80000 hAT-Tip100 Helitron I Jockey 60000 LOA Maverick P 40000 Pao PIF-Harbinger R1 TcMar-Tc1 20000 TcMar-Tigger CMC-EnSpm hAT-Blackjack L2 0 Ngaro tRNA S.flava D.sproati D.murphyi D.ochracea D.grimshawi D.melanogaster

Figure 15: Abundance of the families of repetitive elements across picture-wings, S. flava and D. melanogaster . Values were obtained from RepeatMasker. Not included in graph are repeat families which had values less than 500 across all species.

71 Frequency of shared genes across species Number of genes Number of 0 1000 3000 5000

2 3 4 5

# species sharing genes

Figure 16: Frequency of shared 1:1 orthologous genes between Hawaiian picture-wing Drosophila and S. flava . Orthologous genes were found using the program PoFF.

72 D. grimshawi branch LRT p-value <0.8 D. ochracea LRT p-value <0.8 Frequency Frequency 0 2 4 6 8 0 2 4 6 8

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

p-value p-value

D. murphyi branch LRT p-value <0.8 D. sproati branch LRT p-value <0.8 Frequency Frequency 0 2 4 6 8 0 2 4 6 8

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

p-value p-value

Figure 17: Distribution of p-value statistics (p<0.08) from branch-site model selection tests for each individual Hawaiian picture-wing Drosophila . LRT values obtained from program codeml.

73 D. grimshawi branch LRT p-value <0.8 Frequency 0 2000 4000 6000 8000

0.0 0.2 0.4 0.6 0.8 1.0

p-value

Figure 18: LRT p-value statistic for D. grimshawi using branch-site model. The p-values are highly skewed towards 1, displaying that multiple testing corrections are not necessary.

74 Site model for Hawaiian picture-wing branch LRT p-value <0.8 Frequency 0 100 200 300 400 500

0.0 0.2 0.4 0.6 0.8

p-value

Figure 19: Distribution of p-value statistic (p<0.08) from M7/M8 sites model selection tests for all Hawaiian picture-wing Drosophila . LRT values obtained from program codeml.

75 Site model of chemosensory genes in all species: LRT p-value <0.8 Frequency 0 2 4 6 8 10

0.0 0.2 0.4 0.6 0.8

p-value

Figure 20: Distribution of p-value statistic (p<0.08) from M7/M8 sites model selection tests for all picture-wing Drosophila and Scaptomyza in chemosensory genes. LRT values obtained from program codeml.

76 Site model of chemosensory genes in picture-wings: LRT p-value <0.8 Frequency 0 2 4 6 8 10

0.0 0.2 0.4 0.6 0.8

p-value

Figure 21: Distribution of p-value statistic (p<0.08) from M7/M8 sites model selection tests for all picture-wing Drosophila in chemosensory genes. LRT values obtained from program codeml.

77 Scaptomyza*flaSFLA a*

GH20935/GH11139/ GH13431/GH23102/

GH24199* Drosophila* DGRIgrimshawi*

GH20868* GH12213*

GH17505* Drosophila* DOCHochracea*

GH20606* GH24440* Drosophila* DSPR sproa6*

GH20606* GH22237*Drosophila* DMUR v murphyi*

0.5

Figure 22: Phylogeny displaying genes under selection in each species. Gene IDs correspond to D. grimshawi uniprot gene IDs. Corresponding gene function information can be found in table 10.

78 Scaptomyza*flaSFLA a*

Drosophila* DGRIgrimshawi*

Drosophila* DOCHochracea*

Drosophila* DSPR sproa6*

Drosophila* DMURmurphyi*v

0.5

Figure 23: Relative gain/loss of chemosensory genes across picture-wing species as compared to S. flava . Each color represents a different set of chemosensory gene categories.

79 References

Andrews, S. (2010). FASTQC. A quality control tool for high throughput sequence data. URL http://www.bioinformatics.babraham.ac.uk/projects/fastqc . Baldwin, B. G. (1997). Adaptive radiation of the Hawaiian silversword alliance: congruence and conflict of phylogenetic evidence from molecular and non-molecular investigations. Molecular evolution and adaptive radiation, 103 (128), 1840-1843. Becher, P. G., Flick, G., Rozpędowska, E., Schmidt, A., Hagman, A., Lebreton, S., . . . Witzgall, P. (2012). Yeast, not fruit volatiles mediate Drosophila melanogaster attraction, oviposition and development. Functional Ecology, 26 (4), 822-828. Benton, R., Vannice, K. S., Gomez-Diaz, C., & Vosshall, L. B. (2009). Variant ionotropic glutamate receptors as chemosensory receptors in Drosophila. Cell, 136 (1), 149-162. Blackburne, B. P., & Whelan, S. (2012). Class of multiple sequence alignment algorithm affects genomic analysis. Molecular biology and evolution , mss256. Boake, C. R. (2005). Sexual selection and speciation in hawaiian Drosophila. Behav Genet, 35 (3), 297-303. doi:10.1007/s10519-005-3221-4 Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, İ., . . . Chikhi, R. (2013). Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. arXiv preprint arXiv:1301.5406 . Capy, P., Gasperi, G., Biémont, C., & Bazin, C. (2000). Stress and transposable elements: co ‐evolution or useful parasites? Heredity, 85 (2), 101-106. Carson, H. L. (1982). Evolution of Drosophila on the newer Hawaiian volcanoes. Heredity, 48 (1), 3-25. Carson, H. L. (1983). Chromosomal sequences and interisland colonizations in Hawaiian Drosophila. Genetics, 103 (3), 465-482. Carson, H. L. (1992). Inversions in Hawaiian Drosophila : CB Krimbas & JR Powell. CRC Press, Boca Raton. Carson, H. L., & Kaneshiro, K. Y. (1976). Drosophila of Hawaii: systematics and ecological genetics. Annual Review of Ecology and Systematics , 311-345. Casacuberta, E., & González, J. (2013). The impact of transposable elements in environmental adaptation. Molecular Ecology, 22 (6), 1503-1517. Chalopin, D., Naville, M., Plard, F., Galiana, D., & Volff, J.-N. (2015). Comparative analysis of transposable elements highlights mobilome diversity and evolution in vertebrates. Genome biology and evolution, 7 (2), 567-580. Chandler, J. A., Eisen, J. A., & Kopp, A. (2012). Yeast communities of diverse Drosophila species: comparison of two symbiont groups in the same hosts. Applied and environmental microbiology, 78 (20), 7327-7336. Chippindale, A. K., Leroi, A. M., Kim, S. B., & Rose, M. R. (1993). Phenotypic plasticity and selection in Drosophila life ‐history evolution. I. Nutrition and the cost of reproduction. Journal of Evolutionary Biology, 6 (2), 171-193. Clark, A. G., Eisen, M. B., Smith, D. R., Bergman, C. M., Oliver, B., Markow, T. A., . . . Iyer, V. N. (2007). Evolution of genes and genomes on the Drosophila phylogeny. Nature, 450 (7167), 203-218. Clayton, F. E. (1988). The role of heterochromatin in karyotype variation among Hawaiian picture-winged Drosophila. Cohen, M. B., Schuler, M. A., & Berenbaum, M. R. (1992). A host-inducible cytochrome P-450 from a host-specific caterpillar: molecular cloning and evolution. Proceedings of the National Academy of Sciences, 89(22), 10920-10924. Conceição, I., & Aguadé, M. (2008). High Incidence of Interchromosomal Transpositions in the Evolutionary History of a Subset of Or Genes in Drosophila. Journal of molecular evolution, 66 (4), 325-332. doi:10.1007/s00239-008-9071-y Coyne, J. A., & Orr, H. A. (2004). Speciation (Vol. 37): Sinauer Associates Sunderland, MA. Craddock, E. M., Gall, J. G., & Jonas, M. (2016). Hawaiian Drosophila genomes: size variation and evolutionary expansions. Genetica , 1-18. Craddock, E. M., & Johnson, W. E. (1979). Genetic variation in Hawaiian Drosophila. V. Chromosomal and allozymic diversity in and its homosequential species. Evolution , 137-155. Crosby, M. A., Goodman, J. L., Strelets, V. B., Zhang, P., Gelbart, W. M., & Consortium, F. (2007). FlyBase: genomes by the dozen. Nucleic acids research, 35 (suppl 1), D486-D491.

80 DeSalle, R., & Giddings, L. V. (1986). Discordance of nuclear and mitochondrial DNA phylogenies in Hawaiian Drosophila. Proceedings of the National Academy of Sciences, 83(18), 6902-6906. Ehrlich, P. R., & Raven, P. H. (1964). Butterflies and plants: a study in coevolution. Evolution , 586-608. Eichler, E. E., & Sankoff, D. (2003). Structural dynamics of eukaryotic chromosome evolution. Science, 301 (5634), 793-797. Feder, J. L., Egan, S. P., & Nosil, P. (2012). The genomics of speciation-with-gene-flow. Trends in Genetics, 28 (7), 342-350. Feder, J. L., Flaxman, S. M., Egan, S. P., Comeault, A. A., & Nosil, P. (2013). Geographic mode of speciation and genomic divergence. Annual Review of Ecology, Evolution, and Systematics, 44 , 73-97. Fellows, D. P., & Heed, W. B. (1972). Factors affecting host plant selection in desert-adapted cactiphilic Drosophila. Ecology , 850-858. Feschotte, C. (2008). Transposable elements and the evolution of regulatory networks. Nature Reviews Genetics, 9(5), 397-405. Fletcher, W., & Yang, Z. (2010). The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. Molecular biology and evolution, 27 (10), 2257-2267. Fogleman, J. C., Starmer, W. T., & Heed, W. B. (1981). Larval selectivity for yeast species by Drosophila mojavensis in natural substrates. Proceedings of the National Academy of Sciences, 78(7), 4435-4439. Frank, M. R., & Fogleman, J. C. (1992). Involvement of cytochrome P450 in host-plant utilization by Sonoran Desert Drosophila. Proceedings of the National Academy of Sciences, 89(24), 11998-12002. Ganter, P. F. (1988). The vectoring of cactophilic yeasts by Drosophila. Oecologia, 75 (3), 400-404. Gardiner, A., Barker, D., Butlin, R. K., Jordan, W. C., & Ritchie, M. G. (2008). Drosophila chemoreceptor gene evolution: selection, specialization and genome size. Molecular Ecology, 17 (7), 1648-1657. Gillespie, R. (2004). Community assembly through adaptive radiation in Hawaiian spiders. Science, 303 (5656), 356- 359. Gillespie, R. G., & Roderick, G. K. (2002). Arthropods on islands: colonization, speciation, and conservation. Annual review of entomology, 47 (1), 595-632. Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F. J., Burton, J. N., Walker, B. J., . . . Sykes, S. (2011). High- quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences, 108 (4), 1513-1518. Goldman-Huertas, B., Mitchell, R. F., Lapoint, R. T., Faucher, C. P., Hildebrand, J. G., & Whiteman, N. K. (2015). Evolution of herbivory in linked to loss of behaviors, antennal responses, odorant receptors, and ancestral diet. Proceedings of the National Academy of Sciences, 112 (10), 3026-3031. Grabherr, M. G., Russell, P., Meyer, M., Mauceli, E., Alföldi, J., Di Palma, F., & Lindblad-Toh, K. (2010). Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics, 26 (9), 1145- 1151. Grime, J., & Mowforth, M. (1982). Variation in genome size—an ecological interpretation. Guillén, Y., Rius, N., Delprat, A., Williford, A., Muyas, F., Puig, M., . . . Negre, B. (2015). Genomics of ecological adaptation in cactophilic Drosophila. Genome biology and evolution, 7 (1), 349-366. Heed, W. B. (1971). Host plant specificity and speciation in Hawaiian Drosophila. Taxon , 115-121. Holt, C., & Yandell, M. (2011). MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC bioinformatics, 12 (1), 491. Hoskins, R. A., Carlson, J. W., Wan, K. H., Park, S., Mendez, I., Galle, S. E., . . . Svirskas, R. (2015). The Release 6 reference sequence of the Drosophila melanogaster genome. Genome research, 25 (3), 445-458. Hunt, J. A., Bishop, J. G., & Carson, H. L. (1984). Chromosomal mapping of a middle-repetitive DNA sequence in a cluster of five species of Hawaiian Drosophila. Proceedings of the National Academy of Sciences, 81(22), 7146-7150. Jaenike, J. (1990). Host specialization in phytophagous insects. Annual Review of Ecology and Systematics , 243- 273. Jeffares, D. C., Tomiczek, B., Sojo, V., & dos Reis, M. (2015). A Beginners Guide to Estimating the Non- synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome Parasite Genomics Protocols (pp. 65-90): Springer. Jones, W. D., Cayirlioglu, P., Kadow, I. G., & Vosshall, L. B. (2007). Two chemosensory receptors together mediate carbon dioxide detection in Drosophila. Nature, 445 (7123), 86-90. Jurka, J., Kapitonov, V. V., Pavlicek, A., Klonowski, P., Kohany, O., & Walichiewicz, J. (2005). Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and genome research, 110 (1-4), 462-467.

81 Kajitani, R., Toshimoto, K., Noguchi, H., Toyoda, A., Ogura, Y., Okuno, M., . . . Maruyama, H. (2014). Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome research , gr. 170720.170113. Kambysellis, M. P., Ho, K.-F., Craddock, E. M., Piano, F., Parisi, M., & Cohen, J. (1995). Pattern of ecological shifts in the diversification of Hawaiian Drosophila inferred from a molecular phylogeny. Current Biology, 5(10), 1129-1139. Kaminker, J. S., Bergman, C. M., Kronmiller, B., Carlson, J., Svirskas, R., Patel, S., . . . Rubin, G. M. (2002). The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol, 3(12), RESEARCH0084. Kaneshiro, K. Y. (1976). Ethological isolation and phylogeny in the planitibia subgroup of Hawaiian Drosophila. Evolution , 740-745. Kaneshiro, K. Y. (1980). Sexual isolation, speciation and the direction of evolution. Evolution , 437-444. Kaneshiro, K. Y., & Boake, C. R. (1987). Sexual selection and speciation: issues raised by Hawaiian Drosophila. Trends in Ecology & Evolution, 2 (7), 207-212. Kaneshiro, K. Y., & Kambysellis, M. P. (1999). Description of a New Allopatric Sibling Species of Hawaiian Picture-Winged Drosophila. Pacific science, 58 (2). Kang, L., Settlage, R., McMahon, W., Michalak, K., Tae, H., Garner, H., Stacy, E., Price, D. K., Michalak, P. (2016). Genomic Signatures of Speciation in Sympatric and Alopatric Hawaiian Picture-winged Drosophila. Genome biology and evolution . Kidwell, M. G., & Lisch, D. R. (2001). Perspective: transposable elements, parasitic DNA, and genome evolution. Evolution, 55 (1), 1-24. Koonin, E. V. (2005). Orthologs, paralogs, and evolutionary genomics 1. Annu. Rev. Genet., 39 , 309-338. Korf, I. (2004). Gene finding in novel genomes. BMC bioinformatics, 5 (1), 1. Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., . . . Marra, M. A. (2009). Circos: an information aesthetic for comparative genomics. Genome research, 19 (9), 1639-1645. Kück, P. (2009). ALICUT: a Perlscript which cuts ALISCORE identified RSS. Department of Bioinformatics, Zoologisches Forschungsmuseum A. Koenig (ZFMK), Bonn, Germany, version, 2 . Kück, P., Meusemann, K., Dambach, J., Thormann, B., von Reumont, B. M., Wägele, J. W., & Misof, B. (2010). Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Frontiers in zoology, 7 (1), 1. Lapoint, R. T., O’Grady, P. M., & Whiteman, N. K. (2013). Diversification and dispersal of the Hawaiian Drosophilidae: The evolution of Scaptomyza. Molecular Phylogenetics and Evolution, 69 (1), 95-108. Lechner, M., Hernandez-Rosales, M., Doerr, D., Wieseke, N., Thévenin, A., Stoye, J., . . . Stadler, P. F. (2014). Orthology detection combining clustering and synteny for very large datasets. Levins, R. (1968). Evolution in changing environments: some theoretical explorations : Princeton University Press. Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25 (14), 1754-1760. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., . . . Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25 (16), 2078-2079. Li, X., Berenbaum, M. R., & Schuler, M. A. (2002). Plant allelochemicals differentially regulate Helicoverpa zea cytochrome P450 genes. Insect molecular biology, 11 (4), 343-351. Lindgreen, S. (2012). AdapterRemoval: easy cleaning of next-generation sequencing reads. BMC research notes, 5(1), 337. Lindsley, D. L., & Zimm, G. G. (2012). The genome of Drosophila melanogaster : Academic Press. Löytynoja, A., & Goldman, N. (2008). Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science, 320 (5883), 1632-1635. Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., . . . Liu, Y. (2012). SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience, 1 (1), 18. Magnacca, K. N., Foote, D., & O’Grady, P. M. (2008). A review of the endemic Hawaiian Drosophilidae and their host plants. Zootaxa, 1728 , 1-58. Magnacca, K. N., & Price, D. K. (2012). New species of Hawaiian picture wing Drosophila(Diptera: Drosophilidae), with a key to species. Zootaxa, 3188 , 1-30. Magnacca, K. N., & Price, D. K. (2015). Rapid adaptive radiation and host plant conservation in the Hawaiian picture wing Drosophila (Diptera: Drosophilidae). Molecular Phylogenetics and Evolution, 92 , 226-242.

82 Mao, Y.-B., Cai, W.-J., Wang, J.-W., Hong, G.-J., Tao, X.-Y., Wang, L.-J., . . . Chen, X.-Y. (2007). Silencing a cotton bollworm P450 monooxygenase gene by plant-mediated RNAi impairs larval tolerance of gossypol. Nature biotechnology, 25 (11), 1307-1313. Markow, T., & O’grady, P. (2008). Reproductive ecology of Drosophila. Functional Ecology, 22 (5), 747-759. Matsubayashi, K., Kahono, S., & Katakura, H. (2011). Divergent host plant specialization as the critical driving force in speciation between populations of a phytophagous ladybird beetle. Journal of Evolutionary Biology, 24 (7), 1421-1432. Matzkin, L. M., Watts, T. D., Bitler, B. G., Machado, C. A., & Markow, T. A. (2006). Functional genomics of cactus host shifts in Drosophila mojavensis. Molecular Ecology, 15 (14), 4635-4643. Mayr, E. (1970). Populations, species, and evolution: an abridgment of animal species and evolution : Harvard University Press. McBride, C. S. (2007). Rapid evolution of smell and taste receptor genes during host specialization in Drosophila sechellia. Proc Natl Acad Sci U S A, 104 (12), 4996-5001. doi:10.1073/pnas.0608424104 McBride, C. S., & Arguello, J. R. (2007). Five Drosophila genomes reveal nonneutral evolution and the signature of host specialization in the chemoreceptor superfamily. Genetics, 177 (3), 1395-1416. Mendelson, T. C., & Shaw, K. L. (2005). Sexual behaviour: rapid speciation in an arthropod. Nature, 433 (7024), 375-376. Mirarab, S., Bayzid, M. S., Boussau, B., & Warnow, T. (2014). Statistical binning enables an accurate coalescent- based estimation of the avian tree. Science, 346 (6215), 1250463. Nosil, P. (2007). Divergent host plant adaptation and reproductive isolation between ecotypes of Timema cristinae walking sticks. The American Naturalist, 169 (2), 151-162. O'Grady, P., & DeSalle, R. (2008). Out of Hawaii: the origin and biogeography of the genus Scaptomyza (Diptera: Drosophilidae). Biology Letters, 4 (2), 195-199. O'grady, P. M., Baker, R. H., Durando, C. M., Etges, W. J., & DeSalle, R. (2001). Polytene chromosomes as indicators of phylogeny in several species groups of Drosophila. BMC evolutionary biology, 1 (1), 6. O’Grady, P. M., Lapoint, R. T., Bonacum, J., Lasola, J., Owen, E., Wu, Y., & DeSalle, R. (2011). Phylogenetic and ecological relationships of the Hawaiian Drosophila inferred by mitochondrial DNA analysis. Molecular Phylogenetics and Evolution, 58 (2), 244-256. Ohta, A. T. (1978). Ethological isolation and phylogeny in the grimshawi species complex of Hawaiian Drosophila. Evolution , 485-492. Parra, G., Bradnam, K., & Korf, I. (2007). CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 23 (9), 1061-1067. Patthy, L. (1999). Genome evolution and the evolution of exon-shuffling—a review. Gene, 238 (1), 103-114. Poelstra, J. W., Vijay, N., Bossu, C. M., Lantz, H., Ryll, B., Müller, I., . . . Grabherr, M. G. (2014). The genomic landscape underlying phenotypic integrity in the face of gene flow in crows. Science, 344 (6190), 1410- 1414. Powell, J. R., & Moriyama, E. N. (1997). Evolution of codon usage bias in Drosophila. Proceedings of the National Academy of Sciences, 94 (15), 7784-7790. Price, J. P., & Clague, D. A. (2002). How old is the Hawaiian biota? Geology and phylogeny suggest recent divergence. Proceedings of the Royal Society of London B: Biological Sciences, 269 (1508), 2429-2435. Roderick, G., & Gillespie, R. (1998). Speciation and phylogeography of Hawaiian terrestrial arthropods. Molecular Ecology, 7 (4), 519-531. Roderick, G., & Percy, D. (2008). Host plant use, diversification, and coevolution: insights from remote oceanic islands. Specialization, Speciation, and Radiation. Evolutionary Biology of Herbivorous Insects , 151-161. Rundle, H. D., & Nosil, P. (2005). Ecological speciation. Ecology Letters, 8 (3), 336-352. Sang, J. H. (1956). The quantitative nutritional requirements of Drosophila melanogaster. Journal of Experimental Biology, 33 (1), 45-72. Schluter, D. (2009). Evidence for ecological speciation and its alternative. Science, 323 (5915), 737-741. Scott, J. G. (1999). Cytochromes P450 and insecticide resistance. Insect biochemistry and molecular biology, 29 (9), 757-777. Sene, F. M., & Carson, H. L. (1977). Genetic variation in Hawaiian Drosophila. IV. Allozymic similarity between D. silvestris and D. heteroneura from the island of Hawaii. Genetics, 86 (1), 187-198. Sharp, P. M., & Matassi, G. (1994). Codon usage and genome evolution. Current opinion in genetics & development, 4 (6), 851-860. Simpson, J. T. (2013). Exploring Genome Characteristics and Sequence Quality Without a Reference. arXiv preprint arXiv:1307.8026 .

83 Simpson, J. T., & Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data structures. Genome research, 22 (3), 549-556. Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J., & Birol, İ. (2009). ABySS: a parallel assembler for short read sequence data. Genome research, 19 (6), 1117-1123. Smit, A., Hubley, R., & Green, P. 1996–2004. RepeatMasker Open-3.0 http://repeatmasker.org/ . Smit, A., Hubley, R., & Green, P. (2014). RepeatModeler Open-1.0. 2008-2010. Soria-Carrasco, V., Gompert, Z., Comeault, A. A., Farkas, T. E., Parchman, T. L., Johnston, J. S., . . . Schwander, T. (2014). Stick insect genomes reveal natural selection’s role in parallel speciation. Science, 344 (6185), 738- 742. Spieth, H. T. (1974). Courtship behavior in Drosophila. Annual review of entomology, 19 (1), 385-405. Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics , btu033. Stamps, J. A., Yang, L. H., Morales, V. M., & Boundy-Mills, K. L. (2012). Drosophila regulate yeast density and increase yeast community similarity in a natural substrate. PLoS One, 7 (7), e42238. Stanke, M., & Waack, S. (2003). Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, 19 (suppl 2), ii215-ii225. Starmer, W. T., & Fogleman, J. C. (1986). Coadaptation ofDrosophila and yeasts in their natural habitat. Journal of Chemical Ecology, 12 (5), 1037-1055. Sun, F.-L., Cuaycong, M. H., Craig, C. A., Wallrath, L. L., Locke, J., & Elgin, S. C. (2000). The fourth chromosome of Drosophila melanogaster: interspersed euchromatic and heterochromatic domains. Proceedings of the National Academy of Sciences, 97 (10), 5340-5345. Taft, R. J., Pheasant, M., & Mattick, J. S. (2007). The relationship between non ‐protein ‐coding DNA and eukaryotic complexity. Bioessays, 29 (3), 288-299. Templeton, A. R. (1979). Once again, why 300 species of Hawaiian Drosophila? Evolution, 33 (1), 513-517. Thomas, J., Touchman, J., Blakesley, R., Bouffard, G., Beckstrom-Sternberg, S. M., Margulies, E., . . . McDowell, J. (2003). Comparative analyses of multi-species sequences from targeted genomic regions. Nature, 424 (6950), 788-793. Touhara, K., & Vosshall, L. B. (2009). Sensing odorants and pheromones with chemosensory receptors. Annual review of physiology, 71 , 307-332. Vezzi, F., Narzisi, G., & Mishra, B. (2012). Feature-by-feature–evaluating de novo sequence assembly. PLoS One, 7(2), e31002. Wagner, W. L., & Funk, V. A. (1995). Hawaiian biogeography : Smithsonian Institution Press. Walbot, V., & Petrov, D. A. (2001). Gene galaxies in the maize genome. Proceedings of the National Academy of Sciences, 98 (15), 8163-8164. Wang, P., Lyman, R. F., Shabalina, S. A., Mackay, T. F., & Anholt, R. R. (2007). Association of polymorphisms in odorant-binding protein genes with variation in olfactory response to benzaldehyde in Drosophila. Genetics, 177 (3), 1655-1665. Whiteman, N. K., & Pierce, N. E. (2008). Delicious poison: genetics of Drosophila host plant preference. Trends Ecol Evol, 23 (9), 473-478. doi:10.1016/j.tree.2008.05.010 Whiteman, N. K., & Pierce, N. E. (2008). Delicious poison: genetics of Drosophila host plant preference. Trends in Ecology & Evolution, 23 (9), 473-478. Yang, Z. (2007). PAML 4: phylogenetic analysis by maximum likelihood. Molecular biology and evolution, 24 (8), 1586-1591. Yang, Z., Wong, W. S., & Nielsen, R. (2005). Bayes empirical Bayes inference of amino acid sites under positive selection. Molecular biology and evolution, 22 (4), 1107-1118. Zhou, S., Stone, E. A., Mackay, T. F., & Anholt, R. R. (2009). Plasticity of the chemoreceptor repertoire in Drosophila melanogaster. PLoS Genet, 5 (10), e1000681.

84