GBE

Genomics of Ecological Adaptation in Cactophilic

Yolanda Guille´n1,Nu´ ria Rius1, Alejandra Delprat1, Anna Williford2,FrancescMuyas1, Marta Puig1, So` nia Casillas1,3, Miquel Ra`mia1,3, Raquel Egea1,3, Barbara Negre4,5, Gisela Mir6,7, Jordi Camps8, Valentı´ Moncunill9, Francisco J. Ruiz-Ruano10, Josefa Cabrero10,LeonardoG.deLima11, Guilherme B. Dias11, Jeronimo C. Ruiz12,Aure´lie Kapusta13, Jordi Garcia-Mas6, Marta Gut8,IvoG.Gut8, David Torrents9, Juan P. Camacho10,GustavoC.S.Kuhn11,Ce´dric Feschotte13,AndrewG.Clark14, Esther Betra´n2, Antonio Barbadilla1,3, and Alfredo Ruiz1,*

1Departament de Gene`tica i de Microbiologia, Universitat Auto` noma de Barcelona, Spain Downloaded from 2Department of Biology, University of Texas at Arlington 3Institut de Biotecnologia i de Biomedicina, Universitat Auto` noma de Barcelona, Spain 4EMBL/CRG Research Unit in Systems Biology, Centre for Genomic Regulation (CRG), Barcelona, Spain 5 Universitat Pompeu Fabra (UPF), Barcelona, Spain http://gbe.oxfordjournals.org/ 6IRTA, Centre for Research in Agricultural Genomics (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, Edifici CRAG, Barcelona, Spain 7The Peter MacCallum Cancer Centre, East Melbourne, Victoria, Australia 8Centro Nacional de Ana´lisis Geno´ mico (CNAG), Parc Cientı´fic de Barcelona, Torre I, Barcelona, Spain 9Barcelona Supercomputing Center (BSC), Edifici TG (Torre Girona), Barcelona, Spain and Institucio´ Catalana de Recerca i Estudis Avanc¸ats (ICREA), Barcelona, Spain 10Departamento de Gene´tica, Facultad de Ciencias, Universidad de Granada, Spain 11 Instituto de Cieˆncias Biolo´ gicas, Departamento de Biologia Geral, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil at Universidad de Granada - Historia las Ciencias on August 15, 2015 12Informa´tica de Biossistemas, Centro de Pesquisas Rene´ Rachou—Fiocruz Minas, Belo Horizonte, MG, Brazil 13Department of Human Genetics, University of Utah School of Medicine 14Department of Molecular Biology and Genetics, Cornell University *Corresponding author: E-mail: [email protected]. Accepted: December 23, 2014

Abstract Cactophilic Drosophila species provide a valuable model to study gene–environment interactions and ecological adaptation. Drosophila buzzatii and Drosophila mojavensis are two cactophilic species that belong to the repleta group, but have very different geographical distributions and primary host plants. To investigate the genomic basis of ecological adaptation, we sequenced the genome and developmental transcriptome of D. buzzatii and compared its gene content with that of D. mojavensis and two other noncactophilic Drosophila species in the same subgenus. The newly sequenced D. buzzatii genome (161.5 Mb) comprises 826 scaffolds (>3 kb) and contains 13,657 annotated protein-coding genes. Using RNA sequencing data of five life-stages we found expression of 15,026 genes, 80% protein-coding genes, and 20% noncoding RNA genes. In total, we detected 1,294 genes putatively under positive selection. Interestingly, among genes under positive selection in the D. mojavensis lineage, there is an excess of genes involved in metabolism of heterocyclic compounds that are abundant in Stenocereus cacti and toxic to nonresident Drosophila species. We found 117 orphan genes in the shared D. buzzatii–D. mojavensis lineage. In addition, gene duplication analysis identified lineage-specific expanded families with functional annotations associated with proteolysis, zinc ion binding, chitin binding, sensory perception, ethanol tolerance, immunity, physiology, and reproduction. In summary, we identified genetic signatures of adaptation in the shared D. buzzatii–D. mojavensis lineage, and in the two separate D. buzzatii and D. mojavensis lineages. Many of the novel lineage-specific genomic features are promising candidates for explaining the adaptation of these species to their distinct ecological niches. Key words: cactophilic Drosophila, genome sequence, ecological adaptation, positive selection, orphan genes, gene duplication.

ß The Author(s) 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 349 Guille´netal. GBE

Introduction On the other hand, D. mojavensis is endemic to the deserts of Southwestern United States and Northwestern Mexico. Its Drosophila species are saprophagous that feed and primary host plants are Stenocereus gummosus (pitaya agria) breed on a variety of fermenting plant materials, chiefly in Baja California and Stenocereus thurberi (organ pipe) in fruits, flowers, slime fluxes, decaying bark, leaves and stems, Arizona and Sonora, but uses also Ferocactus cylindraceous cactus necroses, and fungi (Carson 1971). These substrates (California barrel) in Southern California and Opuntia sp. in include bacteria and yeasts that decompose the plant tissues Santa Catalina Island (Fellows and Heed 1972; Heed and and contribute to the nutrition of larvae and adults (Starmer Mangan 1986; Ruiz and Heed 1988; Etges et al. 1999). The 1981; Begon 1982). Only two species groups use cacti as their ecological conditions of the Sonoran Desert are extreme (dry, primary breeding site: repleta (Oliveira et al. 2012)andnan- arid, and hot), as attested by the fact that only four Drosophila Downloaded from noptera (Lang et al. 2014). Both species groups originated at species are endemic (Heed and Mangan 1986). In addition, the virilis–repleta radiation, 20–30 Ma (Throckmorton 1975; D. mojavensis chief host plants, pitaya agria and organ pipe, Morales-Hojas and Vieira 2012; Oliveira et al. 2012)but are chemically complex and contain large quantities of triter- adapted independently to the cactus niche. The “cactus- pene glycosides, unusual medium-chain fatty acids, and sterol yeast-Drosophila system” in arid zones provides a valuable diols (Kircher 1982; Fogleman and Danielson 2001). These http://gbe.oxfordjournals.org/ model to investigate gene–environment interactions and eco- allelochemicals are toxic to nonresident Drosophila species, logical adaptation from genetic and evolutionary perspectives decreasing significantly larval performance (Fogleman and (Barker and Starmer 1982; Barker et al. 1990). Rotting cacti Kircher 1986; Ruiz and Heed 1988; Fogleman and provide relatively abundant, predictable, and long-lasting re- Armstrong 1989; Frank and Fogleman 1992). In addition, sources that can sustain very large Drosophila populations. For host plant chemistry and fermentation byproducts affect instance, a single saguaro rot may weigh up to several tons, adult epicuticular hydrocarbons and mating behavior last for many months, and sustain millions of Drosophila larvae (Havens and Etges 2013) as well as expression of hundreds

and adults (Breitmeyer and Markow 1998). On the other at Universidad de Granada - Historia las Ciencias on August 15, 2015 of genes (Matzkin et al. 2006; Etges et al. 2015; Matzkin hand, cacti are usually found in arid climates with middle to 2014). high temperatures that may impose desiccation and thermal As a first step to understand the genetic bases of ecological stresses (Loeschcke et al. 1997; Hoffmann et al. 2003; adaptation, here we compare the genomes of the two cacto- Rajpurohit et al. 2013). Finally, some cacti may contain allelo- philic species with those of two noncactophilic species of the chemicals that can be toxic for Drosophila (see below). Thus, Drosophila subgenus: D. virilis that belongs to the virilis species adaptation to use cacti as breeding sites must have entailed a group and D. grimshawi that belongs to the picture wing fairly large number of changes in reproductive biology, behav- group of Hawaiian Drosophila (fig. 1). The lineage leading to ior, physiology, and biochemistry (Markow and O’Grady the common ancestor of D. buzzatii and D. mojavensis after 2008). diverging from D. virilis (#3 in fig. 1) represents the lineage that We have sequenced the genome and developmental tran- adapted to the cactus niche (likely Opuntia; Oliveira et al. scriptome of Drosophila buzzatii to carry out a comparative 2012), whereas the lineages leading to D. buzzatii (#1) and analysis with those of Drosophila mojavensis, Drosophila virilis, D. mojavensis (#2) adapted to the specific niche of each spe- and Drosophila grimshawi (Drosophila 12 Genomes cies. We carried out a genome-wide scan for 1) genes under Consortium et al. 2007). Drosophila buzzatii and D. mojaven- positive selection, 2) lineage-specific genes, and 3) gene- sis are both cactophilic species that belong to the mulleri sub- duplications in the three lineages (fig. 1).Basedontheresults group of the repleta group (Wasserman 1992; Oliveira et al. of our comparative analyses, we provide a list of candidate 2012), although they have very different geographical distri- genes that might play a meaningful role in the ecological ad- butions and host plants (fig. 1). Drosophila buzzatii is a sub- aptation of these fruit flies. cosmopolitan species which is found in four of the six major biogeographic regions (David and Tsacas 1980). This species is originally from Argentina and Bolivia but now has a wide geo- Materials and Methods graphical distribution that includes other regions of South We sequenced the genome of a highly inbreed D. buzzatii America, the Old World, and Australia (Carson and strain, st-1 (Betran et al. 1998). DNA was extracted from Wasserman 1965; Fontdevila et al. 1981; Hasson et al. male and female adults (Pin˜ ol et al. 1988; Milligan 1998). 1995; Manfrin and Sene 2006). It chiefly feeds and breeds Reads were generated with three different sequencing plat- in rotting tissues of several Opuntia cacti but can also occa- forms (supplementary fig. S2 and table S12, Supplementary sionally use columnar cacti (Hasson et al. 1992; Ruiz et al. Material online). The assembly of the genome was performed 2000; Oliveira et al. 2012). The geographical dispersal of in three stages (supplementary table S13, Supplementary Opuntia by humans in historical times is considered the Material online): Preassembly (Margulies et al. 2005), scaffold- main driver of the world-wide expansion of D. buzzatii ing (Boetzer et al. 2011), and gapfilling (Nadalin et al. 2012). In (Fontdevila et al. 1981; Hasson et al. 1995). each step, a few chimeric scaffolds were identified and split.

350 Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 Genomics of Ecological Adaptation GBE

(a)(b) Downloaded from http://gbe.oxfordjournals.org/

FIG.1.—(a) Phylogenetic relationship of fruit fly species considered in our comparative analysis and their host preference. (b) Geographical distribution of cactophilic species D. buzzatii (red) and D. mojavensis (green) in America. at Universidad de Granada - Historia las Ciencias on August 15, 2015

The final assembly, named Freeze 1, contains 826 scaffolds acetic acid. Following Ruiz-Ruano et al. (2011),thesamples greater than 3 kb and N50 and N90 index are 30 and 158, were stained by Feulgen reaction and images obtained by respectively. The distribution of read depth in the preassembly optical microscopy were analyzed with the pyFIA software showed a Gaussian distribution with a prominent mode (supplementary fig. S5 and table S18, Supplementary centered at approximately 22Â (supplementary fig. S3, Material online). Supplementary Material online). CG content is approximately The 826 scaffolds in Freeze 1 were assigned to chromo- 35% overall, approximately 42% in gene regions (including somes by aligning their sequences with the D. mojavensis introns) and reaches approximately 52% in exons (supple- genome using MUMmer (Delcher et al. 2003). In addition, mentary table S14, Supplementary Material online). the 158 scaffolds in the N90 index were mapped, ordered, Unidentified nucleotides (N’s) represent approximately 9% and oriented (supplementary fig. S1, Supplementary Material overall, approximately 4% in gene regions, and 0.004% in online) using conserved linkage (Schaeffer et al. 2008), in situ exons. Sequence quality was assessed by comparing Freeze 1 hybridization, and additional information (Gonza´lez et al. with five Sanger sequenced bacterial artificial chromosomes 2005; Guille´n and Ruiz 2012). To estimate the number of (BACs) (Negre et al. 2005; Prada 2010; Calvete et al. 2012) rearrangements between D. buzzatii and D. mojavensis, and with Illumina genomic and RNA sequencing (RNA-Seq) their chromosomes were compared using GRIMM (Tesler reads (supplementary fig. S4, Supplementary Material online). 2002; Delprat A, Guille´nY,RuizA,inpreparation).Genesin Quality assessments gave an overall error rate of approxi- the Hox gene complex (HOM-C) and five other gene com- mately 0.0005 and a PHRED quality score of approximately plexes were searched in silico in the D. buzzatii genome and Q33 (supplementary tables S15 and S16, Supplementary manually annotated using available information (Negre et al. Material online). An overall proportion of segregating sites 2005), the annotated D. mojavensis and Drosophila melano- of approximately 0.1% was estimated (supplementary table gaster genomes, and the RNA-seq data generated for D. buz- S17, Supplementary Material online). zatii (Negre B, Muyas F, Guille´n Y, Ruiz A, in preparation). The genome size of two D. buzzatii strains, st-1 and j-19, Transposable elements (TEs) were annotated with was estimated by Feulgen Image Analysis Densitometry. The RepeatMasker using a comprehensive TE library compiled genome size of D. mojavensis 15081-1352.22 strain from FlyBase (St Pierre et al. 2014), Repbase (Jurka et al. (193,826,310 bp) was used as reference (Drosophila 12 2005), and RepeatModeler. Tandem Repeats Finder version Genomes Consortium et al. 2007). Testicles from anesthetized 4.04 (Benson 1999) was used to identify satellite DNAs males were dissected in saline solution and fixed in acetic- (satDNAs). alcohol 3:1. Double preparations of D. mojavensis and D. For the RNA-Seq experiments, RNA from frozen samples buzzatii were made by crushing the fixed testicles in 50% (embryos, larvae, pupae, adult males, and adult females) was

Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 351 Guille´netal. GBE

processed using the TruSeq RNA sample preparation kit pro- The data set was further modified to include additional family vided by Illumina. We used a Hi-Seq2000 Illumina Sequencer members based on sequence coverage and to exclude family to generate nonstrand-specific paired-end approximately members with internal stop codons and matches to TEs. Gene 100 bp reads from poly(A) + RNA. Between 60 and 89 million counts for each family from the four species were analyzed reads were generated per sample. A total of approximately with an updated version of CAFE (CAFE 3.1 provided by the 286 million filtered reads were mapped to Freeze 1 with authors; Han et al. 2013) to identify lineage-specific expan- TopHat (Trapnell et al. 2009) representing approximately sions. The sets of CAFE-identified expanded families in the D. 180 Â coverage of the total genome size (supplementary buzzatii and D. mojavensis genomes were examined for the table S19, Supplementary Material online). Transcripts were presence of lineage-specific duplications. Families that in- assembled with Cufflinks (Trapnell et al. 2010)using cluded members with ds <0.4 were examined manually and Downloaded from Annotation Release 1 as reference (see below). lineage-specific duplications were inferred when no hits were Protein-coding genes (PCGs) were annotated combining found in the syntenic region of the genome with a missing with EVidence Modeler (EVM; Haas et al. 2008)theresults copy. Drosophila buzzatii-specific RNA-mediated duplications of different predictors: Augustus (Stanke and Waack 2003), were identified by examining intron-less and intron-containing

SNAP (Korf 2004), N-SCAN (Korf et al. 2001), and Exonerate gene family members. A duplicate was considered a retrocopy http://gbe.oxfordjournals.org/ (Slater and Birney 2005). The EVM set contained 12,102 gene if its sequence spanned all introns of the parental gene. The models. We noticed that orthologs for a considerable number number of families identified by CAFE as expanded along the of D. mojavensis PCGs were absent from this data set. Thus, internal cactophilic branch was reduced by considering only we used the Exonerate predictions to detect another 1,555 those families that were also found in expanded category after PCGs not reported by EVM (Poptsova and Gogarten 2010). rerunning the analysis with a less stringent cutoff (35% amino Altogether, we predicted a total of 13,657 PCG models in the acid identity, 50% coverage). The overlapping set of ex- D. buzzatii reference genome (Annotation Release 1). Features panded families was manually examined to verify the absence of these models are given in supplementary table S20, of D. buzzatii and D. mojavensis new family members in the D. at Universidad de Granada - Historia las Ciencias on August 15, 2015 Supplementary Material online. The RSD (reciprocal smallest virilis genome. Functional annotation (i.e., Gene Ontology distance) algorithm (Wall and Deluca 2007) was used to iden- [GO] term) for all expanded families was obtained using the tify 9,114 1:1 orthologs between D. mojavensis and D. buz- DAVID annotation tool (Huang et al. 2009a, 2009b). For zatii. Orthology relationships among the four species in the genes without functional annotation in DAVID, annotations Drosophila subgenus (fig. 1) were inferred from D. buzzatii–D. of D. melanogaster orthologs were used. An extended version mojavensis list of orthologs and the OrthoDB catalog (version of these methods is given as supplementary methods, 6; Kriventseva et al. 2008). To test for positive selection, we Supplementary Material online. compared different codon substitution models using the like- lihood ratio test (LRT). We run two pairs if site models (SM) on the orthologs set between D. buzzatii and D. mojavensis:M7 Results versus M8 and M1a versus M2a (Yang 2007). Then, we used Features of the D. buzzatii Genome branch-site models (BSM) to test for positive selection in three Genome Sequencing and Assembly lineages (fig. 1): D. mojavensis lineage, D. buzzatii lineage, and the lineage that led to the two cactophilic species (D. buzzatii We sequenced and de novo assembled the genome of D. and D. mojavensis). We run Venny software (Oliveros 2007)to buzzatii line st-1 using shotgun and paired-end reads from create a Venn diagram showing shared selected genes among 454/Roche, mate-pair and paired-end reads from Illumina, the different models. We identified genes that are only pre- and Sanger BAC-end sequences (~22Â total expected cover- sent in the two cactophilic species, D. mojavensis and D. buz- age; see Materials and Methods for details). We consider the zatii, by blasting the amino acid sequences from the 9,114 1:1 resulting assembly (Freeze 1) as the reference D. buzzatii orthologs between D. mojavensis and D. buzzatii (excluding genome sequence (table 1). This assembly comprises 826 scaf- misannotated genes) against all the proteins from the remain- folds greater than 3 kb long with a total size of 161.5 Mb. ing 11 Drosophila species available in FlyBase protein data- Scaffold N50 and N90 indexes are 30 and 158, respectively, base, excluding D. mojavensis (St Pierre et al. 2014). whereas scaffold N50 and N90 lengths are 1.38 and 0.16 Mb, For gene duplication analysis (DNA- and RNA-mediated du- respectively (table 1). Quality controls (see Materials and plications), we used annotated PCGs from the four species of Methods) yielded a relatively low error rate of approximately the Drosophila subgenus (see supplementary methods, 0.0005 (PHRED quality score Q = 33). For comparison, we also Supplementary Material online). Briefly, we ran all-against-all assembled the genome of the same line (st-1) using only four BLASTp and selected hits with alignment length extending lanes of short (100 bp) Illumina paired-end reads (~76Â ex- over at least 50% of both proteins and with amino acid iden- pected coverage) and the SOAPdenovo software (Luo et al. tity of at least 50%. Markov Cluster Algorithm (Enright et al. 2012). This resulted in 10,949 scaffolds greater than 3 kb long 2002) was used to cluster retained proteins into gene families. with a total size of 144.2 Mb (table 1). All scaffolds are

352 Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 Genomics of Ecological Adaptation GBE

Table 1 Table 2 Summary of Assembly Statistics for the Genome of Drosophila Transposable Element Content of Drosophila buzzatii Genome buzzatii Class Order Annotated Genome Assembly Freeze 1 SOAPdenovo Base Pair Coverage (%) Number of scaffolds (>3 kb) 826 10,949 I (retrotransposons) LTR 2,366,439 1.47 Coverage ~22Â ~76Â DIRS 55 0.00 Assembly size (bp) 161,490,851 144,184,967 LINE 2,541,645 1.57 Scaffold N50 index 30 2,035 II (DNA transposons) TIR 2,017,167 1.25 Scaffold N50 length (bp) 1,380,942 18,900 Helitron 5,531,009 3.42 Scaffold N90 index 158 7,509 Maverick 189,267 0.12 Downloaded from Scaffold N90 length (bp) 161,757 5,703 Unknown 973,759 0.60 Contig N50 index 1,895 2,820 Total 13,619,341 8.43 Contig N50 length (bp) 17,678 3,101 NOTE.—The classification follows Wicker et al. (2007).

available for download from the Drosophila buzzatii Genome interchromosomal reorganizations between D. buzzatii and http://gbe.oxfordjournals.org/ Project web page (http://dbuz.uab.cat, last accessed January D. mojavensis have previously been found (Ruiz et al. 1990; 7, 2015). This site also displays all the information generated in Ruiz and Wasserman 1993) all 826 scaffolds were assigned to this project (see below). chromosomes by BLASTn against the D. mojavensis genome. In addition, the 158 scaffolds in the N90 index were mapped Genome Size and Repeat Content to chromosomes, ordered, and oriented (supplementary fig. S1, Supplementary Material online; Delprat A, Guille´n Y, Ruiz The genome sizes of two D. buzzatii strains, st-1 and j-19, A, in preparation) using conserved linkage (Schaeffer et al. were estimated by Feulgen Image Analysis Densitometry on 2008) and additional information (Gonza´lez et al. 2005; at Universidad de Granada - Historia las Ciencias on August 15, 2015 testis cells (Ruiz-Ruano et al. 2011)usingD. mojavensis as Guille´n and Ruiz 2012). A bioinformatic comparison of D. reference. Integrative Optical Density values were 21% (st-1) buzzatii and D. mojavensis chromosomes confirmed that and 25% (j-19) smaller than those for D. mojavensis.Thus, chromosome 2 differs between these species by ten inversions taking 194 Mb (total assembly size) as the genome size of D. (2m, 2n, 2z7, 2c, 2f, 2g, 2h, 2q, 2r,and2s), chromosomes X mojavensis (Drosophila 12 Genomes Consortium et al. 2007) and 5 differ by one inversion each (Xe and 5g, respectively), we estimated the genome sizes for D. buzzatii st-1 and j-19 and chromosome 4 is homosequential as previously described lines as 153 and 146 Mb, respectively. (Ruiz et al. 1990; Ruiz and Wasserman 1993; Guille´n and Ruiz To assess the TE content of the D. buzzatii genome, we 2012). In contrast, we find that chromosome 3 differs by five masked the 826 scaffolds of Freeze 1 assembly using a library inversions instead of the expected two that were previously of TEs compiled from several sources (see Materials and identified by cytological analyses (Ruiz et al. 1990). These Methods). We detected a total of 56,901 TE copies covering three additional chromosome 3 inversions seem to be specific approximately 8.4% of the genome (table 2). The most abun- to the D. mojavensis lineage (Delprat A, Guille´n Y, Ruiz A, in dant TEs seem to be Helitrons, LINEs, long terminal repeat preparation). One of these inversions, 3f2, is polymorphic in (LTR) retrotransposons, and TIR transposons that cover natural populations of D. mojavensis, but, conflicting with 3.4%, 1.6%, 1.5%, and 1.2% of the genome, respectively previous reports (Ruiz et al. 1990; Schaeffer et al. 2008), ap- (table 2). In addition, we identified tandemly repeated pears to be homozygous in the sequenced strain. This has satDNAs with repeat units longer than 50 bp (Melters et al. been corroborated by the cytological reanalysis of its polytene 2013) (see Materials and Methods). The two most abundant chromosomes (Delprat et al. 2014). tandem repeat families are the pBuM189 satellite (Kuhn et al. Many developmental genes are arranged in gene com- 2008) and the DbuTR198 satellite, a novel family with repeat plexes each comprising a small number of functionally related units 198 bp long (table 3). The remaining tandem repeats had genes. We checked the organization of six of these gene sequence similarity to integral parts of TEs, such as the internal complexes in the D. buzzatii genome: HOM-C, Achaete– tandem repeats of the transposon Galileo (de Lima LG, scute complex, Iroquois complex, NK homeobox gene cluster Svartman M, Ruiz A, Kuhn GCS, in preparation). (NK-C), Enhancer of split complex, and Bearded complex (Brd- C) (Negre B, Muyas F, Guille´n Y, Ruiz A, in preparation). Hox Chromosomal Rearrangements genes were arranged in a single complex in the Drosophila The basic karyotype of D. buzzatii is similar to that of the genus ancestor (Hughes and Kaufman 2002). However, this Drosophila genus ancestor and consists of six chromosome HOM-C suffered two splits (caused by chromosomal inver- pairs: Four pairs of equal-length acrocentric autosomes, one sions) in the lineage leading to the repleta species group pair of “dot” autosomes, a long acrocentric X, and a small (Negre et al. 2005). In order to fully characterize HOM-C or- acrocentric Y (Ruiz and Wasserman 1993). Because no ganization in D. buzzatii, we manually annotated all Hox

Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 353 Guille´netal. GBE

Table 3 Satellite DNAs Identified in the Drosophila buzzatii Genome Tandem Repeat GC Genome Consensus Sequenceb Distribution repeat Length Content (%) Coverage (%)a Family pBuM189 189 29 0.039 GCAAAAGACTCCGTCAATTA D. buzzatii cluster species GAAAACAAAAAATGTTATAGTTTTGAGGATTAACC D. mojavensis GGCAAAAACCGTATTATTTGTTATAT GATTTCTGTATGGAATACCGTTTTAGAA GCGTCTTTTATCGTATTACTCAGATATATCT Downloaded from TAAGATTTAGCATAATCTAAGAACTTTT TGAAATATTCACATTTGTCCA DbuTR198 198 34 0.027 AAGGTAGAAAGGTAGTTGGTGAGATAAACCAGAAAAA D. buzzatii GAGCTAAAAACGGCTAAAAACGGCTAGAAAATAGCCA GAAAGGTAGATTGAACATTAATGGGCAAATGG

ATGGATAAATAAGACTGGTCATCATCCAA http://gbe.oxfordjournals.org/ TGAACAGAATCATGATTAAGAGATAGAAATA TGATTAGAAAGTAGGATAGAAAGGTTAGAAAG

aGenome fraction was calculated assuming a genome size of 163,547,398 bp (version 1 freeze of all contigs). bConsensus sequence generated after clustering TRF results (see Materials and Methods). genes and located them in three scaffolds (2, 5, and 229) of are used as references (Poptsova and Gogarten 2010). A total chromosome 2 (Negre B, Muyas F, Guille´nY,RuizA,inprep- of 13,657 PCGs were annotated in the D. buzzatii genome aration). The analysis of these scaffolds revealed that only two (Annotation Release 1). These PCG models contain a total of at Universidad de Granada - Historia las Ciencias on August 15, 2015 clusters of Hox genes are present. The distal cluster contains 52,250 exons with an average of 3.8 exons per gene. Gene proboscipedia, Deformed, Sex combs reduced, Antennapedia expression analyses provided transcriptional evidence for and Ultrabithorax, whereas the proximal cluster contains 88.4% of these gene models (see below). The number of labial, abdominal A and Abdominal B. This is precisely the PCGs annotated in D. buzzatii is lower than the number an- same HOM-C organization observed in D. mojavensis (Negre notated in D. mojavensis (14,595, Release 1.3), but quite close and Ruiz 2007). Therefore, there seem to be no additional to the number annotated in D. melanogaster (13,955, rearrangements of the HOM-C in D. buzzatii besides those Release 5.56), one of the best-known eukaryotic genomes already described in the genus Drosophila (Negre and Ruiz (St Pierre et al. 2014). However PCGs in both D. buzzatii and 2007). The other five developmental gene complexes contain D. mojavensis genomes tend to be smaller and contain fewer 4, 3, 6, 13, and 6 functionally related genes, respectively (Lai exons than those in the D. melanogaster genome (supple- et al. 2000; Garcia-Ferna`ndez 2005; Irimia et al. 2008; Negre mentary table S1, Supplementary Material online), which and Simpson 2009). All these complexes seem largely con- suggests that the annotation in the two cactophilic species served in the D. buzzatii genome with few exceptions might be incomplete. After applying several quality filters, a (Negre B, Muyas F, Guille´n Y, Ruiz A, in preparation). The total of 12,977 high confidence protein-coding sequences gene slouch is separated from the rest of the NK-C in D. (CDS) were selected for further analysis (see Materials and buzzatii andalsoinallotherDrosophila species outside of Methods). the melanogaster species group; in addition, the gene Bearded,amemberoftheBrd-C, is seemingly absent from Developmental Transcriptome the D. buzzatii and D. mojavensis genomes, although it is present in D. virilis and D. grimshawi. On the other hand, To characterize the expression profile throughout D. buzzatii genes flanking the complexes are often variable, presumably development, we performed RNA-Seq experiments using due to the fixation of chromosomal inversions with break- samples from five different stages: Embryos, larvae, pupae, points in the boundaries of the complexes. adult females, and adult males. Gene expression levels were calculated based on fragments per kilobase of exon per million fragments mapped (FPKM) values. PCG models that did not PCG Content show evidence of transcription (FPKM < 1) were classified as We used a combination of ab initio and similarity-based algo- nonexpressed PCGs, whereas transcribed regions that did not rithms in order to reduce the high false-positive rate associated overlap with any annotated PCG model were tentatively con- with de novo gene prediction (Wang et al. 2003; Misawa and sidered noncoding RNA (ncRNA) genes (fig. 2a). We detected Kikuno 2010) as well as to avoid the propagation of false- expression (FPKM > 1) of 26,455 transcripts and 15,026 positive predicted gene models when closely related species genes, 12,066 (80%) are PCGs and 2,960 (20%) are

354 Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 Genomics of Ecological Adaptation GBE

(a) for PCGs and 2.2 for ncRNA genes. Adult males show more 12500 stage-specific genes (844 genes) compared with adult females (137 genes). 10000 PCGswithnoexpressioninthisstudy(FPKM< 1) might be

7500 expressed at a higher level in other tissues or times, or they ncRNA might be inducible under specific conditions that we did not PCG 5000 test (Weake and Workman 2010; Etges et al. 2015; Matzkin 2014). We also must expect that some remaining fraction of 2500 gene models will be false positives (Wang et al. 2003).

However, because we used a combination of different anno- Downloaded from Number of expressed genes Number of expressed 0 tation methods to reduce the proportion of false-positives, we female male embryos larvae pupae adults adults expect this proportion to be very small. On the other hand, Stage transcribed regions that do not overlap with any annotated (b) PCG models are likely ncRNA genes although we cannot dis-

card that some of them might be false negatives, that is, http://gbe.oxfordjournals.org/ 6000 genes that went undetected by our annotation methods per- haps because they contain small open reading frames

4000 (Ladoukakis et al. 2011). One observation supporting that ncRNA

PCG most of them are in fact ncRNA genes is that their expression breadth is quite different from that of PCGs and a high frac- 2000 tion of them are stage-specific genes. In most Drosophila spe- cies, with limited analyses of the transcriptome (Celniker et al. at Universidad de Granada - Historia las Ciencias on August 15, 2015 Number of expressed genes Number of expressed 0 2009), few ncRNA genes have been annotated. In contrast, in D. melanogaster with a very well-annotated genome, 2,096 12345 ncRNA genes have been found (Release 5.56, FlyBase). Thus, Number of stages the number of ncRNA found in D. buzzatii is comparable to that of D. melanogaster. FIG.2.—Developmental expression profile of D. buzzatii genes. (a) Number of expressed PCGs (red) and ncRNA genes (blue) along five de- velopmental stages. (b) Classification of PCGs and ncRNA genes according Website to the number of stages where they are expressed. Awebsite(http://dbuz.uab.cat, last accessed January 7, 2015) has been created to provide free access to all information and ncRNA genes. The number of expressed genes resources generated in this work. It includes a customized (PCGs + ncRNA) increases through the life cycle with a maxi- browser (GBrowse; Stein et al. 2002)fortheD. buzzatii mum of 12,171 in adult males (fig. 2a and supplementary genome incorporating multiple tracks for gene annotations table S2, Supplementary Material online), a pattern similar with different gene predictors, for expression levels and tran- to that found in D. melanogaster (Graveley et al. 2011). In script annotations for each developmental stage, and for addition, we observed a clear sex-biased expression in adults: repeat annotations. It contains also utilities to download con- Males express 1,824 more genes than females. Previous stud- tigs, scaffolds, and data files and to carry out Blast searches ies have attributed this sex-biased gene expression mainly to against all D. buzzatii contigs and scaffolds. the germ cells, indicating that the differences between ovary and testis are comparable to those between germ and somatic Lineage-Specific Analyses cells (Parisi et al. 2004; Graveley et al. 2011). We set up to analyze three lineages for several aspects that We assessed expression breadth for each gene simply as could reveal genes involved in adaptation to the cactophilic the number of developmental stages with evidence of expres- niche. These lineages are denoted as #1, #2, and #3, respec- sion (fig. 2b and supplementary table S2, Supplementary tively, in figure 1: D. buzzatii lineage, D. mojavensis lineage, Material online). Expression breadth is significantly different and cactophilic lineage (i.e., lineage shared by D. buzzatii and (P < 0.001) for PCGs and ncRNA genes. A total of 6,546 ex- D. mojavensis). We searched for genes under positive selec- pressed PCGs (54.2%) are constitutively expressed (i.e., we tion, duplicated genes, and orphan genes in those lineages. observed expression in the five stages), but only 260 of ncRNA genes (8.8%) are constitutively expressed (supplemen- tary table S2, Supplementary Material online). In contrast, 925 Genes under Positive Selection expressed PCGs (7.7%) and 1,292 ncRNA genes (43.6%) are We first searched for genes evolving under positive selection expressed only in one stage. Mean expression breadth was 3.9 during the divergence between D. buzzatii and D. mojavensis,

Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 355 Guille´netal. GBE

using codon substitution models implemented in the PAML 4 BSM BSM D. buzzatii branch D. mojavensis branch package (Yang 2007). Two pairs of different SM were com- SM BSM pared by the LRT, M1a versus M2a and M7 versus M8 (see D. buzz : D. moj cactophilic branch Materials and Methods). In each case, a model that allows for sites with o > 1 (positive selection) is compared with a null model that considers only sites with o < 1 (purifying selection) and o = 1 (neutrality). At P < 0.001, the first comparison (M1a vs. M2a) detected 915 genes whereas the second comparison (M7 vs. M8) detected 802 genes. Comparison of the two gene sets allowed us to detect 772 genes present in both, Downloaded from and this was taken as the final list of genes putatively under positive selection using SM (supplementary table S3,

Supplementary Material online). FIG.3.—Venn diagram showing the number of genes putatively Next, we used BSM from PAML 4 package (Yang 2007)to under positive selection detected by two different methods, SM and search for genes under positive selection in the phylogeny BSM using three different lineages as foreground branches. http://gbe.oxfordjournals.org/ of the four Drosophila subgenus species, D. buzzatii, D. moja- vensis, D. virilis, and D. grimshawi (fig. 1). Orthologous rela- show a significant enrichment in DNA-binding function. tionships among the four species were inferred from D. DNA-dependent regulation of transcription and phosphate buzzatii–D. mojavensis list of orthologs and the OrthoDB cat- metabolic processes were also overrepresented. We also alog (see Materials and Methods). A total of 8,328 unequiv- found a significant enrichment in the Ig-like domain, involved ocal 1:1:1:1 orthologs were included in the comparison of a in functions related to cell–cell recognition and immune BSM allowing sites with o > 1 (positive selection) and a null system. The 172 candidate genes in D. mojavensis lineage model that does not. We selected three branches to test for show a significant excess of genes related to the heterocycle at Universidad de Granada - Historia las Ciencias on August 15, 2015 positive selection (the foreground branches): D. buzzatii line- catabolic process (P = 5.9e-04). Interestingly, the main hosts of age, D. mojavensis lineage, and cactophilic lineage (denoted D. mojavensis (columnar cacti) contain large quantities of tri- as #1, #2, and #3 in fig. 1). The number of genes putatively terpene glycosides, which are heterocyclic compounds. under positive selection detected at P < 0.001 in the three Among the candidate genes in the branch leading to the branches was 350, 172, and 458, respectively (supplementary two cactophilic species, there are three overrepresented mo- table S3, Supplementary Material online). These genes only lecular functions related to both metal and DNA binding. The partially overlap those previously detected in the D. buzzatii–D. GO terms with the highest significance in the biological pro- mojavensis comparison using SM (fig. 3). Although 69.4% cess category are cytoskeleton organization and, once again, and 55.8% of the genes putatively under positive selection regulation of transcription. in the D. buzzatii and D. mojavensis lineages were also de- Using the RNA-Seq data we determined the expression tected in the D. buzzatii–D. mojavensis comparison, only profiles of all 1,294 genes putatively under positive selection. 22.3% of the genes detected in the cactophilic lineage were A total of 1,213 (93.7%) of these genes are expressed in at present in the previous list (fig. 3). Thus, the total number of least one developmental stage (supplementary table S2, genes putatively under positive selection is 1,294. Supplementary Material online). A comparison of expression We looked for functional categories overrepresented level and breadth between candidate and noncandidate among the candidate genes reported by both SM and BSM genes revealed that genes putatively under positive selection (table 4). We first performed a GO enrichment analysis with are expressed at a lower level (V2 =84.96, P < 2e-16) and in the 772 candidate genes uncovered by SM comparing fewer developmental stages (V2 = 26.99, P < 2e-6) than the D. mojavensis and D. buzzatii orthologs using DAVID tools rest. (Huang et al. 2007). Two molecular functions show higher proportion than expected by chance (relative to D. mojavensis Orphan Genes in the Cactophilic Lineage genome) within the list of candidate genes: Antiporter activity and transcription factor activity. With respect to the biological To detect orphan genes in the cactophilic lineage, we blasted process, regulation of transcription is the only overrepresented the amino acid sequences encoded by 9,114 D. buzzatii genes category. A significant enrichment in Src Homology-3 domain with D. mojavensis 1:1 orthologs against all proteins from the was observed. This domain is commonly found within proteins 12 Drosophila genomes except D. mojavensis available in with enzymatic activity and it is associated with protein bind- FlyBase (St Pierre et al. 2014). We found 117 proteins with ing function. no similarity to any predicted Drosophila protein (cutoff value A similar GO enrichment analysis was carried out with can- of 1e-05) and were considered to be encoded by putative didate genes found using BSM in each of the three targeted orphan genes. We focused on the evolutionary dynamics of branches. The 350 candidate genes in D. buzzatii lineage these orphan genes by studying their properties in comparison

356 Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 eoiso clgclAdaptation Ecological of Genomics eoeBo.Evol. Biol. Genome ()3936 o:019/b/v21Advanc doi:10.1093/gbe/evu291 7(1):349–366.

Table 4 GO Analysis of Putative Genes under Positive Selection Detected by Both SM and BSM Codon Lineage Number of GO enrichment substitution (Branch Number) Candidates Models Molecular Function Biological Process Interpro Domain

ID Fold ID Fold ID Fold Enrichment Enrichment Enrichment SM Drosophila buzzatii versus 772 Antiporter activity 1.77 Regulation of transcription 4.90 Src homology-3 domain 1.60 Drosophila mojavensis Transcription factor 1.56 activity cespbiainDcme 1 2014 31, December publication Access e BSM D. buzzatii #1 350 DNA binding 1.36 Regulation of transcription 1.36 Immunoglobulin-like 1.33 DNA dependent Phosphate metabolic process 0.72 D. mojavensis #2 172 Dopamine 2.35 Heterocycle catabolic process 2.35 DOMON (DOpamine 2.35 beta-monooxigenase Cation transport 0.98 beta-MOnooxygenase activity Histidine family amino acid 2.35 N-terminal domain) catabolic process Cactophilic #3 458 Zinc ion binding 2.01 Cytoeskeleton organization 1.67 Zinc finger, PHD-type 1.93 Transition metal 2.01 Regulation of transcription 1.06 Proteinase inhibitor 2.20 ion binding DNA dependent I1 kazal DNA binding 1.66

NOTE.—Only categories showing an enrichment with a P value less than 1.0e-03 are included. GBE

357

Downloaded from from Downloaded http://gbe.oxfordjournals.org/ at Universidad de Granada - Historia de las Ciencias on August 15, 2015 15, August on Ciencias las de Historia - Granada de Universidad at Guille´netal. GBE

dn ds ω expressed exclusively in one developmental stage with mean -16 1.00 **p < 2.2x10 *p = 2.0x10-05 expression breadth of 2.56 (vs. 3.94 for nonorphans).

0.75 Gene Duplications The annotated PCGs from four species of the Drosophila sub-

**p < 2.2x10-16 0.50 genus were used to study gene family expansions in the D. buzzatii, D. mojavensis, and cactophilic lineages (fig. 1). Proteins that share 50% identity over 50% of their length 0.25

were clustered into gene families using Markov Cluster Downloaded from Algorithm. After additional quality filters (see Materials and Methods), the final data set consisted of a total of 56,587 0.00 proteins from four species clustered into 19,567 families, in- cluding single-gene families (supplementary tables S4–S7, FIG.4.—Patterns of divergence in orphan and nonorphan genes.

Supplementary Material online). http://gbe.oxfordjournals.org/ Orphan genes (blue) have significantly higher dn and o values compared with that of nonorphan genes (red). Nonorphan genes show significantly Considering the D. buzzatii genome alone (supplementary higher ds. table S4, Supplementary Material online), we find 11,251 single-copy genes and 1,851 duplicate genes (14%) clustered in 691 gene families. Among D. buzzatii gene families, about 70% of families have two members and the largest to the remaining 8,997 1:1 orthologs (fig. 4). We observed family includes 16 members (supplementary table S4, that median dn of orphan genes was significantly higher Supplementary Material online). Among single-copy genes, than that of nonorphan genes (dnorphan = 0.1291; 1,786 genes are only present in the D. buzzatii lineage. This at Universidad de Granada - Historia las Ciencias on August 15, 2015 dn = 0.0341; W = 846,254, P < 2.2e-16) and the nonorphan number decreases only to 1,624 when proteins are clustered o o o same pattern was observed for ( orphan = 0.4253, nonor- into families with a less stringent cutoff of 35% identity and phan = 0.0887, W = 951,117, P < 2.2e-16). However, median 50% coverage. Such lineage-specific single-copy genes have ds of orphan genes is somewhat lower than that for the rest of been found in all the 12 Drosophila genomes that have been genes (dsorphan = 0.3000, dsnonorphan = 0.4056, W = 406,799, analyzed, including D. mojavensis (Hahn et al. 2007), and P =2.4e-05). although traditionally they have been viewed as annotation Wefound19ofthe117orphangenesinthelistofcandi- artifacts, many of these genes may be either de novo or fast- date genes detected in the D. buzzatii–D. mojavensis compar- evolving genes (Reinhardt et al. 2013; Palmieri et al. 2014). ison (see above). This proportion (16.3%) was significantly Lineage-specific expansions were identified by analyzing higher than that found in nonorphan 1:1 orthologs (753/ the gene count for each family from the four species using 8,997 = 8.4%), which indicates an association between CAFE3.1 (see Materials and Methods). This analysis detected gene lineage-specificity and positive selection (Fisher exact expansions of 86 families along the D. buzzatii lineage. test, two-tailed, P < 0.0001). The 19 orphan genes included However, 15 families increased in size as the result of extra in the candidate gene group are not associated with any GO copies added to the data set after taking into account high category. As a matter of fact, information about protein do- sequence coverage. The expansions of these families cannot mains was found for only two of these genes (GYR and YLP be confirmed with the current genome assembly. The remain- motifs in both cases: GI20994 and GI20995). These results ing families were analyzed further in order to confirm D. buz- should be viewed cautiously as newer genes are functionally zatii-specific duplications. To do that, we first selected gene undercharacterized and GO databases are biased against families with members that have ds < 0.4 (median ds for D. them (Zhang et al. 2012). We also compared the protein mojavensis–D. buzzatii orthologs) and then manually exam- length between orphan and nonorphan gene products. Our ined syntenic regions in D. mojavensis genome. Although results showed that orphan genes are shorter (W = 68,825.5, this approach might miss some true lineage-specific expan- P < 2.2e-16) and have fewer exons than nonlineage-specific sions, it reduces the possibility of including old families into the genes (W = 201,068, P < 2.2e-16). expansion category that might have been misclassified as a RNA-Seq data allowed us to test for expression of orphan result of incomplete gene annotation in the genomes under genes. From the 117 gene candidates, 82 (70%) are ex- study or independent loss of family members in different lin- pressed at least in one of the five analyzed developmental eages. Of the 30 gene families whose members had ds < 0.4, stages. A comparison of the expression profiles between we confirmed the expansion of 20 families (supplementary orphan and the rest of 1:1 orthologous genes showed that table S8, Supplementary Material online).In12ofthe20fam- the expression breadth of orphans is different from that of ilies, new family members are found on the same scaffold in nonorphans (V2 = 101.4, P < 0.001): Most orphan genes are close proximity suggesting unequal crossing over or proximate

358 Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 Genomics of Ecological Adaptation GBE

segmental duplication as the mechanisms for duplicate forma- been shown to encode female reproductive peptidases tion. The remaining eight families contain dispersed duplicates (Kelleher and Markow 2009), and members of three addi- found in different scaffolds. Six of these families expanded tional families (Families 187, 277, and 1234) encode proteins through retroposition, the RNA-mediated duplication mecha- that are found in D. mojavensis accessory gland proteome nism that allows insertion of reverse-transcribed mRNA nearly (Kelleher et al. 2009). anywhere in the genome. In most cases, family expansions are There are 20 gene families that expanded along the cacto- due to addition of a new single copy in the D. buzzatii lineage philic branch, that is, before the split between D. buzzatii and (in 25 of total 35 families). Two families that expanded the D. mojavensis (see Materials and Methods; supplementary most, with up to 5 (Family 95) and 9 (Family 126) new mem- table S10, Supplementary Material online). Most families (16 bers, encode various peptidases involved in protein degrada- of 20) have expanded through tandem or nearby segmental Downloaded from tion. Other expanded families are associated with a broad duplication and are still found within the same scaffold. The range of functions, including structural proteins of cu- remaining families with dispersed duplicates included one ret- ticle and chorion, enzymes involved in carbohydrate and lipid rogene, the duplicate of T-cp1, identified previously in D. metabolism, proteins that function in immune response, and mojavensis lineage (Han and Hahn 2012). The extent of per- olfactory receptors. In addition, Family 128 encodes female family expansions in the cactophilic lineage is modest, with http://gbe.oxfordjournals.org/ reproductive peptidases (Kelleher and Markow 2009)andit two new additional members found in four families and a appears that new family members have been acquired inde- single new copy in the remaining families. Members of the pendently in D. buzzatii and D. mojavensis lineages (supple- most expanded families encode guanylate cyclases that are mentary table S11, Supplementary Material online). involved in intracellular signal transduction, peptidases, and We find six families in D. buzzatii that expanded through carbon–nitrogen hydrolases. Members of other families in- retroposition in the 11 Myr since the split between D. buzzatii clude various proteins with metal-binding properties as well and D. mojavensis (supplementary table S9, Supplementary as proteins with a role in vesicle and transmembrane transport Material online). This gives a rate of 0.55 retrogenes/Myr, (supplementary table S11, Supplementary Material online). at Universidad de Granada - Historia las Ciencias on August 15, 2015 which is consistent with previous estimates of functional retro- We also see expansion of three families (Family 775, Family gene formation in Drosophila of 0.5 retrogenes/Myr (Bai et al. 776, and Family 800) with functions related to regulation of 2007). The expression of all but one retrogene is supported by juvenile hormone (JH) levels (see Discussion). RNA-Seq data, with no strong biases in expression between the sexes. Four retrogenes are duplicates of ribosomal pro- teins, and the parental genes from two of these families Discussion (RpL37a and RpL30) have been previously shown to generate The D. buzzatii Genome retrogenes in other Drosophila lineages (Bai et al. 2007; Han and Hahn 2012). Frequent retroposition of ribosomal proteins Drosophila is a leading model for comparative genomics, with could be explained by the high levels of transcription of ribo- 24 genomes of different species already sequenced (Adams somal genes although other Drosophila lineages do not show et al. 2000; Drosophila 12 Genomes Consortium et al. 2007; a bias in favor of retroduplication of ribosomal proteins (Bai Zhou et al. 2012; Zhou and Bachtrog 2012; Fonseca et al. et al. 2007; Han and Hahn 2012). The remaining two retro- 2013; Ometto et al. 2013; Chen et al. 2014). However, only genes include the duplicate of Caf1, protein that is involved in five of these species belong to the species-rich Drosophila histone modification, and the duplicate of VhaM9.7-b, a sub- subgenus, and only one of these species, D. mojavensis,isa unit of ATPase complex. cactophilic species from the large repleta species group. Here CAFE analysis identified 127 families that expanded along we sequenced the genome and transcriptome of D. buzzatii, the D. mojavensis lineage. Of these families, 86 contain mem- another cactophilic member of the repleta group, to investi- bers with ds < 0.4. Further examination of syntenic regions gate the genomic basis of adaptation to this distinct ecological confirmed expansion of only 17 families (supplementary niche. Using different sequencing platforms and a three-stage table S8, Supplementary Material online). New members in de novo assembly strategy, we generated a high quality two families (Families 1121 and 1330) are found in different genome sequence that consists of 826 scaffolds greater scaffolds and originated through RNA-mediated duplications. than 3 kb (Freeze 1). A large portion (>90%) of the These instances have been previously identified as D. moja- genome is represented by 158 scaffolds with a minimum vensis-specific retropositions (Han and Hahn 2012). Members size of 160 kb that have been assigned, ordered, and oriented of expanded families encode proteins that function in prote- in the six chromosomes of the D. buzzatii karyotype. As ex- olysis, peptide and ion transport, aldehyde and carbohydrate pected, the assembly is best for chromosome 2 (because of metabolism, as well as sensory perception (supplementary the use of Sanger generated BAC-end sequences) and worst table S11, Supplementary Material online). At least 4 of the for chromosome X (because of the three-fourth representa- 17 expanded families play a role in reproductive biology: tion of this chromosome in adults of both sexes). The quality Proteases of Family 128 with three new members have of our Freeze 1 assembly compares favorably with the

Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 359 Guille´netal. GBE

assembly generated using only Illumina reads and the copy fragmentation (Rius N, Guille´n Y, Kapusta A, Feschotte SOAPdenovo assembler, and with those of other Drosophila C, Ruiz A, in preparation). The contribution of satDNAs (table genomes generated using second-generation sequencing 3) is also an underestimate and further experiments are re- platforms (Zhou et al. 2012; Zhou and Bachtrog 2012; quired for a correct assessment of this component (de Lima Fonseca et al. 2013; Ometto et al. 2013; Chen et al. 2014), LG, Svartman M, Ruiz A, Kuhn GCS, in preparation). However, although our Freeze 1 does not attain the quality of the 12 we identified the pBuM189 satDNA as the most abundant Drosophila genomes generated using Sanger only (Drosophila tandem repeat of D. buzzatii. Previous in situ hybridization 12 Genomes Consortium et al. 2007). experiments revealed that pBuM189 copies are located in Drosophila buzzatii is a subcosmopolitan species that has the centromeric region of all chromosomes, except chromo- been able to colonize four of the six major biogeographical some X (Kuhn et al. 2008). Thus, pBuM189 satellite is likely Downloaded from regions (David and Tsacas 1980). Only two other repleta the main component of the D. buzzatii centromere. group species (Drosophila repleta and Drosophila hydei) Interestingly, a pBuM189 homologous sequence has recently have reached such widespread distribution. Invasive species been identified as the most abundant tandem repeat of D. are likely to share special genetic traits that enhance their mojavensis (Melters et al. 2013). Although the chromosome colonizing ability (Parsons 1983; Lee 2002). From an ecological location in D. mojavensis has not been determined, the per- http://gbe.oxfordjournals.org/ point of view we would expect colonizing species to be sistence of pBuM189 as the major satDNA in D. buzzatii and r-strategists with a short developmental time (Lewontin D. mojavensis may reflect a possible role for these sequences 1965). Because there is a correlation between developmental in centromere function (Ugarkovic´ 2009). time and genome size (Gregory and Johnston 2008), coloniz- ingspeciesarealsoexpectedtohaveasmallgenomesize Chromosome Evolution (Lavergne et al. 2010).ThegenomesizeofD. buzzatii was estimated in our assembly as 161 Mb and by cytological tech- The chromosomal evolution of D. buzzatii and D. mojavensis niques as 153 Mb, approximately 20% smaller than the D. has been previously studied by comparing the banding pattern at Universidad de Granada - Historia las Ciencias on August 15, 2015 mojavensis genome. The genome size of a second D. buzzatii of the salivary gland chromosomes (Ruiz et al. 1990; Ruiz and strain, estimated by cytological techniques, is even smaller, Wasserman 1993). Drosophila buzzatii has few fixed inversions 7 146 Mb. However, the relationship between genome size (2m, 2n, 2z ,and5g) when compared with the ancestor of the and colonizing ability does not hold in the Drosophila genus repleta group. In contrast, D. mojavensis showed ten fixed at large. Although colonizing species such as D. melanogaster inversions (Xe, 2c, 2f, 2g, 2h, 2q, 2r, 2s, 3a, and 3d), five of and Drosophila simulans have relatively small genomes, spe- them (Xe, 2q, 2r, 2s, and 3d) exclusive to D. mojavensis and the cialist species with a narrow distribution such as Drosophila rest shared with other cactophilic Drosophila (Guille´n and Ruiz sechelia and Drosophila erecta also have small genomes. On 2012). Thus, the D. mojavensis lineage appears to be a derived the other hand, Drosophila ananassae, Drosophila malerkotli- lineage with a relatively high rate of rearrangement fixation. ana, Drosophila suzuki, D. virilis,andZaprionus indianus are Here, we compared the organization of both genomes cor- also colonizing Drosophila species but have relatively large roborating all known inversions in chromosomes X, 2, 4, and genomes (Nardon et al. 2005; Bosco et al. 2007; Drosophila 5. In D. mojavensis chromosome 3, however, we found five 12 Genomes Consortium et al. 2007; Gregory and Johnston inversions instead of the two expected (Delprat A, Guille´nY, 2008). Further, there seems to be little difference in genome Ruiz A, in preparation). One of the three additional inversions is size between original and colonized populations within spe- the polymorphic inversion 3f2 (Ruiz et al. 1990). This inversion cies (Nardon et al. 2005). Seemingly, other factors such as has previously been found segregating in Baja California and historical or chance events, niche dispersion, genetic variabil- Sonora (Mexico) and is homozygous in the strain of Santa ity, or behavioral shifts are more significant than genome size Catalina Island (California) that was used to generate the D. in determining the current distribution of colonizing species mojavensis genome sequence (Drosophila 12 Genomes (Markow and O’Grady 2008). Consortium et al. 2007). Previously, the Santa Catalina Island TE content in the D. buzzatii genome was estimated as population was thought to have the standard (ancestral) ar- 8.4% (table 2), a relatively low value compared with that of rangements in all chromosomes, like the populations in D. mojavensis, 10–14% (Ometto et al. 2013; Rius et al., in Southern California and Arizona (Ruiz et al. 1990; Etges preparation). These data agree well with the smaller genome et al. 1999).Thepresenceofinversion3f2 in Santa Catalina size of D. buzzatii because genome size is positively correlated Island is remarkable because it indicates that the flies that col- with the contribution of TEs (Kidwell 2002; Feschotte and onized this island came from Baja California and are derived Pritham 2007). However, TE copy number and coverage esti- instead of ancestral with regard to the rest of D. mojavensis mated in D. buzzatii (table 2) must be taken cautiously. populations (Delpratetal.2014). The other two additional Coverage is surely underestimated due to the difficulties in chromosome 3 inversions are fixed in the D. mojavensis lineage assembling repeats, in particular with short sequence reads, and emphasize its rapid chromosomal evolution. Guille´nand whereas the number of copies may be overestimated due to Ruiz (2012) analyzed the breakpoint of all chromosome 2

360 Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 Genomics of Ecological Adaptation GBE

inversions fixed in D. mojavensis and concluded that the nu- adaptation given the ecological properties of both cactophilic merous gene alterations at the breakpoints with putative species (table 4). The most interesting result of this analysis is adaptive consequences point directly to natural selection as that genes putatively under positive selection in D. mojavensis the cause of D. mojavensis rapid chromosomal evolution. branch are enriched in genes involved in heterocyclic catabolic The four fixed chromosome 3 inversions provide an opportu- processes. Four candidate D. mojavensis genes, GI19101, nity for further testing this hypothesis (Delprat A, Guille´nY, GI20678, GI21543 and GI22389, that are orthologous to D. Ruiz A, in preparation). melanogaster genes nahoda, CG5235, slgA and knk,respec- tively, participate in these processes and might be involved in Candidate Genes under Positive Selection and Orphan adaptation of D. mojavensis to the Stenocereus cacti,plants Genes with particularly large quantities of heterocyclic compounds Downloaded from (see Introduction). A difficulty with this interpretation is the Several methods have been developed to carry out genome- fact that the D. mojavensis genome sequence was generated wide scans for genes evolving under positive selection (Nielsen using a strain from Santa Catalina Island where D. mojavensis 2005; Anisimova and Liberles 2007; Vitti et al. 2013). We used inhabits Opuntia cactus (Drosophila 12 Genomes Consortium here a rather simple approach based on the comparison of the et al. 2007). However, the evidence indicates that the ancestral http://gbe.oxfordjournals.org/ nonsynonymous substitution rate (dn) with the synonymous D. mojavensis population is the agria-inhabiting Baja California substitution rate (ds) at the codon level (Yang et al. 2000; population and that the Mainland Sonora population split from Wong et al. 2004; Zhang et al. 2005; Yang 2007). Genes Baja California approximately 0.25 Ma whereas the Mojave putatively under positive selection were detected on the Desert and Mainland Sonora populations diverged more re- basis of statistical evidence for a subset of codons where re- cently, approximately 0.125 Ma (Smith et al. 2012). placement mutations were fixed faster than mutation at silent Moreover, the presence of inversion 3f2 in the Santa Catalina sites. Four species of the Drosophila subgenus (fig. 1)were Island population suggests that the flies that colonized this employed to search for genes under positive selection using island came from Baja California populations, where this inver- at Universidad de Granada - Historia las Ciencias on August 15, 2015 SM and BSM. We restricted the analysis to this subset of the sion is currently segregating, and not from the Mojave Desert, Drosophila phylogeny to avoid the saturation of synonymous where this inversion is not present (Delprat et al. 2014). This is substitutions expected with phylogenetically very distant spe- compatible with mitochondrial DNA sequence data (Reed et al. cies (Bergman et al. 2002; Larracuente et al. 2008), and also 2007) although in contrast to other data (Machado et al. 2007). because these are the genomes with the highest quality avail- Finally, the transcriptional profiles of the four D. mojavensis able (Schneider et al. 2009). A total of 1,294 candidate genes subpopulations reveal only minor gene expression differences were detected with both SM and BSM, which represents ap- between individuals from Santa Catalina Island and Baja proximately 14% of the total set of 1:1 orthologs between D. California (Matzkin and Markow 2013). mojavensis and D. buzzatii. Positive selection seems pervasive Orphan genes are genes with restricted taxonomic distri- in Drosophila (Sawyer et al. 2007; Singh et al. 2009; Sella et al. bution. Such genes have been suggested to play an important 2009; Mackay et al. 2012) and, using methods similar to ours, role in phenotypic and adaptive evolution in multiple species it has been estimated that 33% of single-copy orthologs in the (Domazet-Losˇo and Tautz 2003; Khalturin et al. 2009; Chen melanogaster group have experienced positive selection et al. 2013). The detection of orphan genes is highly depen- (Drosophila 12 Genomes Consortium et al. 2007). The smaller dent on the availability of sequenced and well-annotated ge- fraction of genes putatively under positive selection in our nomes of closely related species, and the total number of analyses may be due to the fewer lineages considered in our lineage-specific genes tend to be overestimated (Khalturin study. In addition, both studies may be underestimating the et al. 2009). We were as conservative as possible by consid- true proportion of positively selected genes because only 1:1 ering only high-confidence 1:1 orthologs in two species, D. orthologs were included in the analyses and genes that evolve buzzatii and D. mojavensis. The result is a set of 117 orphans in too fast may be missed by the methods used to establish the cactophilic lineage. orthology relationships (Bierne and Eyre-Walker 2004). At We observe that orphan genes clearly show a different any rate, the 1,294 candidate genes found here should be pattern of molecular evolution compared with that of older evaluated using other genomic methods for detecting positive conserved genes. Orphans exhibit a higher dn that can be selection, for example, those comparing levels of divergence attributed to more beneficial mutations fixed by positive se- and polymorphism (Vitti et al. 2013). Furthermore, functional lection or to lower constraint, or both (Cai and Petrov 2010; follow-up tests will be necessary for a full validation of their Chen et al. 2010). However, as the number of genes puta- adaptive significance (Lang et al. 2012). tively under positive selection within the set of orphan genes is BSM allowed us to search for positively selected genes in the higher than expected by chance, we suggest that the elevated three-targeted lineages (D. buzzatii, D. mojavensis, and cacto- dn likely reflects adaptive evolution. philic branch). We then performed GO enrichment analyses in Orphans also have fewer exons and encode shorter pro- order to identify potential candidates for environmental teins than nonorphans. This observation has been reported in

Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 361 Guille´netal. GBE

multiple eukaryotic organisms such as yeasts (Carvunis et al. incompleteness of the annotation of D. buzzatii PCGs pre- 2012), fruitflies (Domazet-Losˇo and Tautz 2003) and primates cludes us from being able to reliably identify gene families (Cai and Petrov 2010), and it is further supported by a positive that lost family members. correlation between protein length and sequence conserva- To minimize the possibility of missing gene copies that were tion (Lipman et al. 2002) (see above). We did not find expres- potentially collapsed into single genes during D. buzzatii sion support for all the orphan genes detected. This suggests genome assembly, we used sequence coverage to adjust to us that either orphans are more tissue- or stage-specific the size of gene families. Two of the families that expanded than nonorphans (Zhang et al. 2012) or we are actually de- as a result of this correction encoded chorion genes. However, tecting artifactual CDS that are not expressed. However, given chorion genes are known to undergo somatic amplifications the patterns of sequence evolution of orphan genes, we favor in ovarian follicle cells (Claycomb and Orr-Weaver 2005), and Downloaded from the first explanation for the majority of them. Collectively, all the use of sequence coverage to correct for “missing” copies these results support the conclusion that orphan genes evolve can be misleading in these cases. As there is no easy way to faster than older genes, and that they experience lower levels verify families that were placed into the expanded category of purifying selection and higher rates of adaptive evolution due to high sequence coverage alone, our discussion below is

(Chen et al. 2010). limited to gene duplicates that were annotated in the D. buz- http://gbe.oxfordjournals.org/ It has been widely reported that younger genes have lower zatii genome. expression levels than older genes on average (Cai and Petrov A recent survey of the functional roles of new genes across 2010; Tautz and Domazet-Losˇo2011; Zhang et al. 2012). various taxa offers evidence for the rapid recruitment of Here, we observe that orphan genes that are being tran- new genes into gene networks underlying a wide range scribed are less expressed than nonorphans (Kruskal test, of phenotypes including reproduction, behavior, and develop- V2 = 9.37, P = 0.002). One of the proposed hypotheses to ex- ment (Chen et al. 2013). A number of lineage-specific dupli- plain these observations is that genes that are more conserved cates identified in our study fit this description, but further are indeed involved in more functions (Pa´l et al. 2006; Tautz experimental confirmation of their functions through loss- at Universidad de Granada - Historia las Ciencias on August 15, 2015 and Domazet-Losˇo2011). of-function studies and characterization of molecular interac- Different studies have demonstrated that newer genes are tions are necessary. Among families that expanded in the more likely to have stage-specific expression than older genes D. buzzatii,theD. mojavensis, and the cactophilic lineages, (Zhang et al. 2012). Here, we show that the number of stage- 35% have functional annotations that are similar to those of specific expressed orphans is significantly higher than that of rapidly evolving families identified in the analysis of the 12 older genes. It has been proposed that newer genes tend to Drosophila genomes (Hahn et al. 2007). These families include be more developmentally regulated than older genes (Tautz genes that are involved in proteolysis, zinc ion binding, chitin and Domazet-Losˇo 2011). This means that they contribute binding, sensory perception, immunity, and reproduction. A most to the ontogenic differentiation between taxa (Chen fraction of these expanded families may reflect physiological et al. 2010). In D. buzzatii the vast majority of stage-specific adaptations to a novel habitat. For example, given the impor- orphan genes are expressed in larvae (15/29), indicating that tance of olfactory perception in recognition of the host cactus expression of younger genes is mostly related to stages in plants (Date et al. 2013), the duplication of an olfactory re- which D. buzzatii and D. mojavensis lineages most diverge ceptor in D. buzzatii may represent an adaptation to cactophi- from each other. lic substrates. Another D. buzzatii family includes ninjurin,a gene involved in tissue regeneration that is one of the com- ponents of the innate immune response (Boutros et al. 2002). Gene Duplication In D. mojavensis, we also observe the duplication of an odor- The study of gene duplications in the D. buzzatii and D. moja- ant receptor and, coinciding with a previous report (Croset vensis lineages aims at understanding the genetic bases of the et al. 2010), of an ionotropic glutamate receptor that belongs ecological specialization associated with colonization of novel to a novel family of diversified chemosensory receptors cactus habitats. Although we only considered expanded fam- (Benton et al. 2009; Croset et al. 2010). An aldehyde dehy- ilies, it is known that specialization sometimes involves gene drogenase is also duplicated in the D. mojavensis lineage and losses. For example, D. sechellia and D. erecta, which are spe- might reveal a role in detoxification of particular aldehydes cialized to grow on particular substrates, have lost gustatory and ethanol (Fry and Saweikis 2006). In the D. buzzatii–D. receptors and detoxification genes (Drosophila 12 Genomes mojavensis lineage, one family contains proteins with the Consortium et al. 2007; Dworkin and Jones 2009). Sometimes MD-2-related lipid recognition domain involved in pathogen the losses are driven by positive selection, as has been sug- recognition and in D. mojavensis we find a duplicate of a gested in the case of the neverland gene in Drosophila pachea phagosome-associated peptide transporter that is involved in (Lang et al. 2012) where positive selection appears to have bacterial response in D. melanogaster (Charrie`re et al. 2010). favored a novel neverland allele that has lost the ability to Several of the D. mojavensis-specific gene duplicates have metabolize cholesterol. In our study of gene families, the been described as male and female reproductive proteins.

362 Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 Genomics of Ecological Adaptation GBE

Unlike the accessory gland proteins of D. melanogaster,the Acknowledgments proteome of D. mojavensis accessory glands is rich in meta- This work was supported by grants BFU2008-04988 and bolic enzymes and nutrient transport proteins (Kelleher et al. BFU2011-30476 from Ministerio de Ciencia e Innovacio´ n 2009). Three of the expanded families include metabolic pro- (Spain) to A.R., by an FPI fellowship to Y.G. and a PIF-UAB teins previously identified as candidate seminal fluid proteins fellowship to N.R, and by the National Institute of General specific to D. mojavensis lineage (Kelleher et al. 2009). We also Medical Sciences of the National Institute of Health under detect an increase of female reproductive tract proteases as a award number R01GM071813 to E.B. The content is solely possible counter adaptation to fast-evolving male ejaculate the responsibility of the authors and does not necessarily rep- (Kelleher and Markow 2009). resent the official views of the funding agencies. Three gene families are of particular interest among those Downloaded from that were expanded in the lineages leading to D. buzzatii and D. mojavensis, as they contain duplicates of genes with func- Literature Cited tions related to the regulation of JH levels. One family includes a Adams MD, et al. 2000. The genome sequence of Drosophila melanoga- new duplicate of JH esterase duplication gene (Jhedup in D. ster. Science 287:2185–2195. melanogaster). JH esterases are involved in JH degradation Anisimova M, Liberles DA. 2007. The quest for natural selection in the age http://gbe.oxfordjournals.org/ (Bloch et al. 2013), although Jhedup has much lower level of of comparative genomics. Heredity 99:567–579. JH esterase activity than Jhe (Crone et al. 2007). Another family Bai Y, Casola C, Feschotte C, Betra´n E. 2007. Comparative genomics re- includes new duplicate that encodes protein with sequence veals a constant rate of origination and convergent acquisition of functional retrogenes in Drosophila. Genome Biol. 8:R11. similarity to hemolymph JH-binding protein (CG5945 in D. Barker JSF, Starmer WT. 1982. Ecological genetics and evolution: the melanogaster). JH-binding proteins belong to a large gene cactus-yeast-Drosophila model system. Sydney (NSW): Academic family regulated by circadian genes and affect circadian behav- Press. ior, courtship behavior, metabolism, and aging (Vanaphan Barker JSF, Starmer WT, MacIntyre RJ. 1990. Ecological and evolutionary et al. 2012). This family includes JH-binding proteins that func- genetics of Drosophila.NewYork:PlenumPress. at Universidad de Granada - Historia las Ciencias on August 15, 2015 Begon M. 1982. Yeasts and Drosophila. In: Ashburner M, Carson HL, tion as carriers of JH through the hemolymph to its target tis- Thompson JN Jr, editors. The genetics and biology of Drosophila, sues (Bloch et al. 2013). The third family includes a new Vol. 3b. London: Academic Press. p. 3345–3384. duplicate of a dopamine synthase gene (ebony in D. melano- Benson G. 1999. Tandem repeats finder: a program to analyze DNA se- gaster). ebony is involved in the synthesis of dopamine, and it is quences. Nucleic Acids Res. 27:573–580. known that dopamine levels affect behavior and circadian Benton R, Vannice KS, Gomez-Diaz C, Vosshall LB. 2009. Variant iono- tropic glutamate receptors as chemosensory receptors in Drosophila. rhythms through regulation of hormone levels including JH Cell 136:149–162. (Rauschenbach et al. 2012). All three duplicates are expressed Bergman CM, et al. 2002. Assessing the impact of comparative genomic in D. buzzatii adults. At insect adult stage, JHs play a role in sequence data on the functional annotation of the Drosophila physiology and behavior, and their levels oscillate daily (Bloch genome. Genome Biol. 3:research0086. et al. 2013). Gene duplications of JH-binding proteins (JHBP), Betran E, Santos M, Ruiz A. 1998. Antagonistic pleiotropic effect of JH esterases (JHE), and ebony may change the timing and levels second-chromosome inversions on body size and early life-history traits in Drosophila buzzatii. Evolution 52:144–154. of active JHs which, in turn, alter the behavior and physiology Bierne N, Eyre-Walker A. 2004. The genomic rate of adaptive amino acid regulated by JHs. One interesting effect of mutations in circa- substitution in Drosophila. Mol Biol Evol. 21:1350–1360. dian rhythm genes, or of direct perturbations of the circadian Bloch G, Hazan E, Rafaeli A. 2013. Circadian rhythms and endocrine func- rhythm, is a reduced ethanol tolerance in D. melanogaster (Pohl tions in adult insects. J Insect Physiol. 59:56–69. et al. 2013). Intriguingly, Jhedup and another gene duplicated Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. 2011. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27:578–579. in the cactophilic lineage, Sirt2 (a protein deacetylase), have Bosco G, Campbell P, Leiva-Neto JT, Markow TA. 2007. Analysis of been also shown to affect ethanol tolerance and sensitivity Drosophila species genome size and satellite DNA content reveals sig- when mutated (Kong et al. 2010). Given that both D. moja- nificant differences among strains as well as between species. vensis and D. buzzatii breed and feed on rotting fruit, a shift in Genetics 177:1277–1290. tolerance to ethanol and other cactus-specific compounds is Boutros M, Agaisse H, Perrimon N. 2002. Sequential activation of signaling pathways during innate immune responses in Drosophila.DevCell.3: one of the expected adaptations associated with a switch to a 711–722. cactus host. Future functional studies of these new duplicates Breitmeyer CM, Markow TA. 1998. Resource availability and population are required to understand their role in physiological and be- size in cactophilic Drosophila. Funct Ecol. 12:14–21. havioral changes associated with a change to a new habitat. Cai JJ, Petrov DA. 2010. Relaxed purifying selection and possibly high rate of adaptation in primate lineage-specific genes. Genome Biol Evol. 2: 393–409. Supplementary Material Calvete O, Gonza´lez J, Betra´n E, Ruiz A. 2012. Segmental duplication, microinversion, and gene loss associated with a complex inversion Supplementary methods, figures S1–S6,andtables S1–S20 breakpoint region in Drosophila. Mol Biol Evol. 29:1875–1889. areavailableatGenome Biology and Evolution online (http:// Carson HL 1971. The ecology of Drosophila breeding sites. No. 2. Honolulu www.gbe.oxfordjournals.org/). (Hawaii): University of Hawaii Foundation Lyon Arboretum Fund.

Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 363 Guille´netal. GBE

Carson HL, Wasserman M. 1965. A widespread chromosomal polymor- Fogleman JC, Kircher HW. 1986. Differential effects of fatty acid chain phism in a widespread species, Drosophila buzzatii.AmNat.99: length on the viability of two species of cactophilic Drosophila.Comp 111–115. Biochem Physiol A Physiol. 83:761–764. Carvunis A-R, et al. 2012. Proto-genes and de novo gene birth. Nature Fonseca NA, et al. 2013. Drosophila americana as a model species for 487:370–374. comparative studies on the molecular basis of phenotypic variation. Celniker SE, et al. 2009. Unlocking the secrets of the genome. Nature 459: Genome Biol Evol. 5:661–679. 927–930. Fontdevila A, Ruiz A, Alonso G, Ocan˜ a J. 1981. Evolutionary history of Charrie`re GM, et al. 2010. Identification of Drosophila Yin and PEPT2 as Drosophila buzzatii. I. Natural chromosomal polymorphism in colo- evolutionarily conserved phagosome-associated muramyl dipeptide nized populations of the Old World. Evolution 35:148–157. transporters. J Biol Chem. 285:20147–20154. Frank MR, Fogleman JC. 1992. Involvement of cytochrome P450 in host- Chen S, Krinsky BH, Long M. 2013. New genes as drivers of phenotypic plant utilization by Sonoran Desert Drosophila. Proc Natl Acad Sci U S evolution. Nat Rev Genet. 14:645–660. A. 89:11998–12002. Downloaded from Chen S, Zhang YE, Long M. 2010. New genes in Drosophila quickly Fry JD, Saweikis M. 2006. Aldehyde dehydrogenase is essential for both become essential. Science 330:1682–1685. adult and larval ethanol resistance in Drosophila melanogaster. Genet Chen ZX, et al. 2014. Comparative validation of the D. melanogaster Res. 87:87–92. modENCODE transcriptome annotation. Genome Res. 24: Garcia-Ferna`ndez J. 2005. The genesis and evolution of homeobox gene 1209–1223. clusters. Nat Rev Genet. 6:881–892. Claycomb JM, Orr-Weaver TL. 2005. Developmental gene amplification: Gonza´lez J, et al. 2005. A BAC-based physical map of the Drosophila http://gbe.oxfordjournals.org/ insights into DNA replication and gene expression. Trends Genet. 21: buzzatii genome. Genome Res. 15:885–889. 149–162. Graveley BR, et al. 2011. The developmental transcriptome of Drosophila Crone EJ, et al. 2007. Only one esterase of Drosophila melanogaster is melanogaster. Nature 471:473–479. likely to degrade juvenile hormone in vivo. Insect Biochem Mol Biol. 37: Gregory TR, Johnston JS. 2008. Genome size diversity in the family 540–549. . Heredity 101:228–238. Croset V, et al. 2010. Ancient protostome origin of chemosensory iono- Guille´n Y, Ruiz A. 2012. Gene alterations at Drosophila inversion tropic glutamate receptors and the evolution of insect taste and olfac- breakpoints provide prima facie evidence for natural selection tion. PLoS Genet. 6:e1001064. as an explanation for rapid chromosomal evolution. BMC

Date P, et al. 2013. Divergence in olfactory host plant Genomics 13:53. at Universidad de Granada - Historia las Ciencias on August 15, 2015 preference in D. mojavensis in response to cactus host use. PLoS Haas BJ, et al. 2008. Automated eukaryotic gene structure annotation One 8:e70027. using EVidenceModeler and the Program to Assemble Spliced David J, Tsacas L. 1980. Cosmopolitan, subcosmopolitan and widespread Alignments. Genome Biol. 9:R7. species: different strategies within the Drosophilid family (Diptera). Hahn MW, Han MV, Han S-G. 2007. Gene family evolution across 12 C R Soc Bioge´ogr. 57:11–26. Drosophila genomes. PLoS Genet. 3:e197. Delcher AL, Salzberg SL, Phillippy AM. 2003. Using MUMmer to identify Han MV, Hahn MW. 2012. Inferring the history of interchromosomal gene similar regions in large sequence sets. Curr Protoc Bioinformatics. transposition in Drosophila using n-dimensional parsimony. Genetics Chapter 10:Unit 10.3. 190:813–825. Delprat A, Etges WJ, Ruiz A. 2014. Reanalysis of polytene chromosomes in Han MV, Thomas GWC, Lugo-Martinez J, Hahn MW. 2013. Estimating Drosophila mojavensis populations from Santa Catalina Island, gene gain and loss rates in the presence of error in genome assembly California, USA. Drosoph Inf Serv. 97:53–57. and annotation using CAFE 3. Mol Biol Evol. 30:1987–1997. Domazet-Losˇo T, Tautz D. 2003. An evolutionary analysis of orphan genes Hasson E, et al. 1995. The evolutionary history of Drosophila buzzatii. in Drosophila. Genome Res. 13:2213–2219. XXVI. Macrogeographic patterns of inversion polymorphism in New Drosophila 12 Genomes Consortium, et al. 2007. Evolution of genes and World populations. J Evol Biol. 8:369–384. genomes on the Drosophila phylogeny. Nature 450:203–218. Hasson E, Naveira H, Fontdevila A. 1992. The breeding sites of Argentinian Dworkin I, Jones CD. 2009. Genetic changes accompanying the evolu- cactophilic species of the Drosophila mulleri complex (subgenus tion of host specialization in Drosophila sechellia. Genetics 181: Drosophila-repleta group). Rev Chil Hist Nat. 65:319–326. 721–736. Havens JA, Etges WJ. 2013. Premating isolation is determined by larval Enright AJ, Van Dongen S, Ouzounis CA. 2002. An efficient algorithm for rearing substrates in cactophilic Drosophila mojavensis.IX.Hostplant large-scale detection of protein families. Nucleic Acids Res. 30: and population specific epicuticular hydrocarbon expression influences 1575–1584. mate choice and sexual selection. J Evol Biol. 26:562–576. Etges WJ, et al. 2015. Deciphering life history transcriptomes in different Heed WB, Mangan RL. 1986. Community ecology of the Sonoran Desert environments. Mol Ecol. 24:151–179. Drosophila. In: Ashburner M, Carson HL, Thompson JN, editors. The Etges WJ, Johnson WR, Duncan GA, Huckins G, Heed WB. 1999. genetics and biology of Drosophila. Vol. 3e. London: Academic Press. Ecological genetics of cactophilic Drosophila. In: Robichaux R, editor. p. 311–345. Ecology of Sonoran Desert plants and plant communities. Tucson (AZ): Hoffmann AA, Sørensen JG, Loeschcke V. 2003. Adaptation of Drosophila University of Arizona Press. p. 164–214. to temperature extremes: bringing together quantitative and molecu- Fellows DP, Heed WB. 1972. Factors affecting host plant selection in lar approaches. J Therm Biol. 28:175–216. desert-adapted cactiphilic Drosophila. Ecology 53:850–858. Huang DW, et al. 2007. DAVID Bioinformatics Resources: expanded an- Feschotte C, Pritham EJ. 2007. DNA transposons and the evolution of notation database and novel algorithms to better extract biology from eukaryotic genomes. Annu Rev Genet. 41:331–368. large gene lists. Nucleic Acids Res. 35:W169–W175. Fogleman JC, Armstrong L. 1989. Ecological aspects of cactus triterpene Huang DW, Sherman BT, Lempicki RA. 2009a. Bioinformatics enrichment glycosides I. Their effect on fitness components of Drosophila moja- tools: paths toward the comprehensive functional analysis of large vensis. J Chem Ecol. 15:663–676. gene lists. Nucleic Acids Res. 37:1–13. Fogleman JC, Danielson PB. 2001. Chemical interactions in the cactus- Huang DW, Sherman BT, Lempicki RA. 2009b. Systematic and integrative microorganism-Drosophila model system of the Sonoran Desert. Am analysis of large gene lists using DAVID bioinformatics resources. Nat Zool. 41:877–889. Protoc. 4:44–57.

364 Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 Genomics of Ecological Adaptation GBE

Hughes CL, Kaufman TC. 2002. Hox genes and the evolution of the ar- Luo R, et al. 2012. SOAPdenovo2: an empirically improved memory- thropod body plan. Evol Dev. 4:459–499. efficient short-read de novo assembler. GigaScience 1:18. Irimia M, Maeso I, Garcia-Ferna`ndez J. 2008. Convergent evolution of Machado CA, Matzkin LM, Reed LK, Markow TA. 2007. Multilocus nu- clustering of Iroquois homeobox genes across metazoans. Mol Biol clear sequences reveal intra- and interspecific relationships among Evol. 25:1521–1525. chromosomally polymorphic species of cactophilic Drosophila.Mol Jurka J, et al. 2005. Repbase Update, a database of eukaryotic repetitive Ecol. 16:3009–3024. elements. Cytogenet Genome Res. 110:462–467. Mackay TFC, et al. 2012. The Drosophila melanogaster genetic reference Kelleher ES, Markow TA. 2009. Duplication, selection and gene conversion panel. Nature 482:173–178. in a Drosophila mojavensis female reproductive protein family. Manfrin MH, Sene FM. 2006. Cactophilic Drosophila in South America: a Genetics 181:1451–1465. model for evolutionary studies. Genetica 126:57–75. Kelleher ES, Watts TD, LaFlamme BA, Haynes PA, Markow TA. 2009. Margulies M, et al. 2005. Genome sequencing in microfabricated high- Proteomic analysis of Drosophila mojavensis male accessory glands density picolitre reactors. Nature 437:376–380. Downloaded from suggests novel classes of seminal fluid proteins. Insect Biochem Mol Markow TA, O’Grady P. 2008. Reproductive ecology of Drosophila. Funct Biol. 39:366–371. Ecol. 22:747–759. Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch TCG. 2009. More Matzkin LM. 2014. Ecological genomics of host shifts in Drosophila moja- than just orphans: are taxonomically-restricted genes important in evo- vensis. Adv Exp Med Biol. 781:233–247. lution? Trends Genet. 25:404–413. Matzkin LM, Markow TA. 2013. Transcriptional differentiation across the Kidwell MG. 2002. Transposable elements and the evolution of genome four subspecies of Drosophila mojavensis. In: Pawel M, editor. http://gbe.oxfordjournals.org/ size in eukaryotes. Genetica 115:49–63. Speciation: natural processes, genetics and biodiversity. New York: Kircher HW. 1982. Chemical composition of cacti and its relationship to Nova Scientific Publishers. Sonoran Desert Drosophila. In: Barker JSF, Starmer WT, editors. Matzkin LM, Watts TD, Bitler BG, Machado CA, Markow TA. 2006. Ecological genetics and evolution: the cactus-yeast-Drosophila model Functional genomics of cactus host shifts in Drosophila mojavensis. system. Sydney (NSW): Academic Press. p. 143–158. Mol Ecol. 15:4635–4643. Kong EC, et al. 2010. Ethanol-regulated genes that contribute to ethanol Melters DP, et al. 2013. Comparative analysis of tandem repeats from sensitivity and rapid tolerance in Drosophila. Alcohol Clin Exp Res. 34: hundreds of species reveals unique insights into centromere evolution. 302–316. Genome Biol. 14:R10.

Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5:59. Milligan B. 1998. Total DNA isolation. In: Hoelzel AR, editor. at Universidad de Granada - Historia las Ciencias on August 15, 2015 Korf I, Flicek P, Duan D, Brent MR. 2001. Integrating genomic homology Molecular genetic analysis of populations: A practical approach. into gene structure prediction. Bioinformatics 17(Suppl 1): S140–S148. Oxford (UK): IRL Press. p. 29–64. Kriventseva EV, Rahman N, Espinosa O, Zdobnov EM. 2008. OrthoDB: the Misawa K, Kikuno RF. 2010. GeneWaltz—a new method for reducing the hierarchical catalog of eukaryotic orthologs. Nucleic Acids Res. 36: false positives of gene finding. BioData Min. 3:6. D271–D275. Morales-Hojas R, Vieira J. 2012. Phylogenetic patterns of geographical and Kuhn GCS, Sene FM, Moreira-Filho O, Schwarzacher T, Heslop- ecological diversification in the subgenus Drosophila. PLoS One 7: Harrison JS. 2008. Sequence analysis, chromosomal distribution and e49552. long-range organization show that rapid turnover of new and old Nadalin F, Vezzi F, Policriti A. 2012. GapFiller: a de novo assembly ap- pBuM satellite DNA repeats leads to different patterns of variation in proach to fill the gap within paired reads. BMC Bioinformatics seven species of the Drosophila buzzatii cluster. Chromosome Res. 16: 13(Suppl 14): S8. 307–324. Nardon C, et al. 2005. Is genome size influenced by colonization of new Ladoukakis E, Pereira V, Magny EG, Eyre-Walker A, Couso JP. 2011. environments in dipteran species? Mol Ecol. 14:869–878. Hundreds of putatively functional small open reading frames in Negre B, et al. 2005. Conservation of regulatory sequences and gene Drosophila. Genome Biol. 12:R118. expression patterns in the disintegrating Drosophila Hox gene com- Lai EC, Bodner R, Posakony JW. 2000. The enhancer of split complex of plex. Genome Res. 15:692–700. Drosophila includes four Notch-regulated members of the bearded Negre B, Ruiz A. 2007. HOM-C evolution in Drosophila: is there a need for gene family. Development 127:3441–3455. Hox gene clustering? Trends Genet. 23:55–59. Lang M, et al. 2012. Mutations in the neverland gene turned Drosophila Negre B, Simpson P. 2009. Evolution of the achaete-scute complex in pachea into an obligate specialist species. Science 337:1658–1661. insects: convergent duplication of proneural genes. Trends Genet. Lang M, et al. 2014. Radiation of the Drosophila nannoptera species group 25:147–152. in Mexico. J Evol Biol. 27:575–584. Nielsen R. 2005. Molecular signatures of natural selection. Annu Rev Larracuente AM, et al. 2008. Evolution of protein-coding genes in Genet. 39:197–218. Drosophila. Trends Genet. 24:114–123. Oliveira DCSG, et al. 2012. Monophyly, divergence times, and evolution of Lavergne S, Muenke NJ, Molofsky J. 2010. Genome size reduction can host plant use inferred from a revised phylogeny of the Drosophila trigger rapid phenotypic evolution in invasive plants. Ann Bot. 105: repleta species group. Mol Phylogenet Evol. 64:533–544. 109–116. Oliveros J. 2007. VENNY. An interactive tool for comparing lists with Venn Lee CE. 2002. Evolutionary genetics of invasive species. Trends Ecol Evol. diagrams. Madrid (Spain): BioinfoGP CNB-CSIC. 17:386–391. Ometto L, et al. 2013. Linking genomics and ecology to investigate the Lewontin RC. 1965. Selection for colonizing ability. In: Baker HG, complex evolution of an invasive Drosophila pest. Genome Biol Evol. 5: Stebbins GL, editors. The genetics of colonizing species. New York: 745–757. Academic Press. p. 77–94. Pa´l C, Papp B, Lercher MJ. 2006. An integrated view of protein evolution. Lipman DJ, Souvorov A, Koonin EV, Panchenko AR, Tatusova TA. 2002. Nat Rev Genet. 7:337–348. The relationship of protein conservation and sequence length. BMC PalmieriN,KosiolC,Schlo¨ tterer C. 2014. The life cycle of Drosophila Evol Biol. 2:20. orphan genes. eLife 3:e01311. Loeschcke V, Krebs RA, Dahlgaard J, Michalak P. 1997. High-temperature Parisi M, et al. 2004. A survey of ovary-, testis-, and soma-biased stress and the evolution of thermal resistance in Drosophila. EXS 83: gene expression in Drosophila melanogaster adults. Genome Biol. 5: 175–190. R40.

Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 365 Guille´netal. GBE

Parsons P. 1983. The evolutionary biology of colonizing species. New York: Stanke M, Waack S. 2003. Gene prediction with a hidden Markov Cambridge University Press. model and a new intron submodel. Bioinformatics 19(Suppl 2): Pin˜ ol J, Francino O, Fontdevila A, Cabre´ O. 1988. Rapid isolation of ii215–ii225. Drosophila high molecular weight DNA to obtain genomic libraries. Starmer WT. 1981. A comparison of Drosophila habitats according to the Nucleic Acids Res. 16:2736. physiological attributes of the associated yeast communities. Evolution Pohl JB, et al. 2013. Circadian genes differentially affect tolerance to eth- 35:38–52. anol in Drosophila. Alcohol Clin Exp Res. 37:1862–1871. Stein LD, et al. 2002. The generic genome browser: a building block Poptsova MS, Gogarten JP. 2010. Using comparative genome analysis to for a model organism system database. Genome Res. 12: identify problems in annotated microbial genomes. Microbiology 156: 1599–1610. 1909–1917. Tautz D, Domazet-Losˇo T. 2011. The evolutionary origin of orphan genes. Prada CF 2010. Evolucio´ ncromoso´ mica del cluster Drosophila mar- Nat Rev Genet. 12:692–702. tensis: origen de las inversiones y reutilizacio´ n de los puntos de Tesler G. 2002. GRIMM: genome rearrangements web server. Downloaded from rotura [PhD thesis]. Barcelona (Spain): Universitat Auto` noma de Bioinformatics 18:492–493. Barcelona. Throckmorton L. 1975. The phylogeny, ecology and geography of Rajpurohit S, Oliveira CC, Etges WJ, Gibbs AG. 2013. Functional genomic Drosophila. In: King R, editor. Handbook of genetics. Vol. 3. New and phenotypic responses to desiccation in natural populations of a York: Plenum Press. p. 421–469. desert drosophilid. Mol Ecol. 22:2698–2715. Trapnell C, et al. 2010. Transcript assembly and quantification by RNA-Seq Rauschenbach IY, Bogomolova EV, Karpova EK, Shumnaya LV, reveals unannotated transcripts and isoform switching during cell dif- http://gbe.oxfordjournals.org/ Gruntenko NE. 2012. The role of D1 like receptors in the regulation ferentiation. Nat Biotechnol. 28:511–515. of juvenile hormone synthesis in Drosophila females with increased Trapnell C, Pachter L, Salzberg SL. 2009. TopHat: discovering splice junc- dopamine level. Dokl Biochem Biophys. 446:231–234. tions with RNA-Seq. Bioinformatics 25:1105–1111. Reed LK, Nyboer M, Markow TA. 2007. Evolutionary relationships of Ugarkovic´ D.- 2009. Centromere-competent DNA: structure and evolu- Drosophila mojavensis geographic host races and their sister species tion. In: Ugarkovic´ D,- editor. Centromere. Structure and evolution. Drosophila arizonae. Mol Ecol. 16:1007–1022. Progress in molecular and subcellular biology. Vol. 48. Berlin Reinhardt JA, et al. 2013. De novo ORFs in Drosophila are important to (Germany): Springer. p. 53–76. organismal fitness and evolved rapidly from previously non-coding Vanaphan N, Dauwalder B, Zufall RA. 2012. Diversification of takeout, a sequences. PLoS Genet. 9:e1003860. male-biased gene family in Drosophila. Gene 491:142–148. at Universidad de Granada - Historia las Ciencias on August 15, 2015 Ruiz A, Cansian AM, Kuhn GC, Alves MA, Sene FM. 2000. The Drosophila Vitti JJ, Grossman SR, Sabeti PC. 2013. Detecting natural selection in ge- serido speciation puzzle: putting new pieces together. Genetica 108: nomic data. Annu Rev Genet. 47:97–120. 217–227. Wall DP, Deluca T. 2007. Ortholog detection using the reciprocal smallest Ruiz A, Heed WB. 1988. Host-plant specificity in the cactophilic Drosophila distance algorithm. Methods Mol Biol. 396:95–110. mulleri species complex. J Anim Ecol. 57:237–249. Wang J, et al. 2003. Vertebrate gene predictions and the problem of large Ruiz A, Heed WB, Wasserman M. 1990. Evolution of the mojavensis clus- genes. Nat Rev Genet. 4:741–749. ter of cactophilic Drosophila with descriptions of two new species. J Wasserman M. 1992. Cytological evolution of the Drosophila repleta spe- Hered. 81:30–42. cies group. In: Krimbas CB, Powell JR, editors. Drosophila inversion Ruiz A, Wasserman M. 1993. Evolutionary cytogenetics of the Drosophila polymorphism. Boca Raton (FL): CRC Press. p. 455–552. buzzatii species complex. Heredity 70:582–596. Weake VM, Workman JL. 2010. Inducible gene expression: diverse regu- Ruiz-Ruano FJ, et al. 2011. DNA amount of X and B chromosomes in the latory mechanisms. Nat Rev Genet. 11:426–437. grasshoppers Eyprepocnemis plorans and Locusta migratoria. Wicker T, et al. 2007. A unified classification system for eukaryotic trans- Cytogenet Genome Res. 134:120–126. posable elements. Nat Rev Genet. 8:973–982. Sawyer SA, Parsch J, Zhang Z, Hartl DL. 2007. Prevalence of positive se- lection among nearly neutral amino acid replacements in Drosophila. Wong WS, Yang Z, Goldman N, Nielsen R. 2004. Accuracy and power of Proc Natl Acad Sci U S A. 104:6504–6510. statistical methods for detecting adaptive evolution in protein coding Schaeffer SW, et al. 2008. Polytene chromosomal maps of 11 Drosophila sequences and for identifying positively selected sites. Genetics 168: species: the order of genomic scaffolds inferred from genetic and 1041–1051. physical maps. Genetics 179:1601–1655. Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Schneider A, et al. 2009. Estimates of positive Darwinian selection are Biol Evol. 24:1586–1591. inflated by errors in sequencing, annotation, and alignment. Yang Z, Nielsen R, Goldman N, Pedersen AM. 2000. Codon-substitution Genome Biol Evol. 1:114–118. models for heterogeneous selection pressure at amino acid sites. Sella G, Petrov DA, Przeworski M, Andolfatto P. 2009. Pervasive natural Genetics 155:431–449. selection in the Drosophila genome? PLoS Genet. 5:e1000495. Zhang J, Nielsen R, Yang Z. 2005. Evaluation of an improved branch-site Singh ND, Larracuente AM, Sackton TB, Clark AG. 2009. Comparative likelihood method for detecting positive selection at the molecular genomics on the Drosophila phylogenetic tree. Annu Rev Ecol Evol level. Mol Biol Evol. 22:2472–2479. Syst. 40:459–480. Zhang YE, Landback P, Vibranovski M, Long M. 2012. New genes ex- Slater GS, Birney E. 2005. Automated generation of heuristics for biolog- pressed in human brains: implications for annotating evolving ge- ical sequence comparison. BMC Bioinformatics 6:31. nomes. Bioessays 34:982–991. Smith G, Lohse K, Etges WJ, Ritchie MG. 2012. Model-based com- Zhou Q, Bachtrog D. 2012. Sex-specific adaptation drives early sex chro- parisons of phylogeographic scenarios resolve the intraspecific mosome evolution in Drosophila. Science 337:341–345. divergence of cactophilic Drosophila mojavensis. Mol Ecol. 21: Zhou Q, et al. 2012. Deciphering neo-sex and B chromosome evolu- 3293–3307. tion by the draft genome of Drosophila albomicans. BMC Genomics St Pierre SE, Ponting L, Stefancsik R, McQuilton P, FlyBase Consortium. 13:109. 2014. FlyBase 102—advanced approaches to interrogating FlyBase. Nucleic Acids Res. 42(Database issue):D780–D788. Associate editor: Mar Alba

366 Genome Biol. Evol. 7(1):349–366. doi:10.1093/gbe/evu291 Advance Access publication December 31, 2014 2 Tel 6 _1_2 (-) 149 (+) 101 (-) X 100 (-) 3 141 (+) 138 (+) 113 (-) _3_1 (-)

5 33 (+) 38 (+)

54 (-) 4 90 (-) 25 (+) 27 (+) _1_1 (+) 50 (+) 106 (+) 40 (+) 126 (-) 46 (+) 81 (+) 34 (-) 64 (-) 45 (-) _2_1 (+) 120 (+) 48 (+) 60_2 (+) 35 (-) 137 (+)

21 (-) 24_2 (+) 39 (-) 10 (+) 1 (-) 71 (-) 23 (-) 20_2 (-) 116 (-) 82 (-) 72 (+) 69 (-) 132 (-) 28 (-) 68 (-) 3 (-) 31 (+) 78 (-)

12 (+) 30 (+)

16 (-) 14 (+) 42 (-)

6 (-) 61 (+) 13 (-) 87 (-) 22 (+) 110 (+) 129 (+) 36_1 (+) 112 (-) 24_1 (+) 93 (-) 114 (-) 52 (+) 89 (+) 111 (-) 75 (-) 150 (-) 77 (+) 121 (-) 18_1 (-) 60_1 (-) 127 (-) 11 (+) 41 (-) 99 (-) 56 (+) 18_2 (+) 84 (-) 109 (-) 143 (+) 107 (+) 43 (-) 83 (-) 125 (-) 47 (-) 105 (-) 95 (-) 58 (+) 88 (+) 85 (-) 94 (+) 2 (+) 59 (-) 108 (-) 118 (+) 67 (-) 128 (-) 103 (-) 175 (-) 123 (+) 139 (-) 119 (+) 29 (-) 117 (+) 145 (-) 86 (-) 133 (-) 49 (-) 44_2 (+) 57 (-) 20_1 (+) 102 (-) 51 (-) 70 (-) 73 (+) 79 (+) 131 (-) 44_1 (+) 96 (+) 17 (-) 19 (-) 62_2 (-) 76 (-) 92 (+) 80 (+) 55 (+) 8 (+) 97 (+) 179 (+) 15 (-) 9_1 (-) 130 (-) 7 (-) 9_2 (+) 53 (-) 32 (+) 66 (+) 5 (-) 36_2 (+)

104 (-) 65 (+) 154 (-) Cen 37 (-) 62_1 (+) 74 (-) 148 (+) 63 (+) 115 (+) 124 (+) 122 (?) 177 (-) 188 (+) Figure S1. Order and orientation within D. buzzatii chromosomes of 158 scaffolds included in N90 index. The number of scaffolds in chromosomes X, 2, 3, 4, 5 and 6 is 48, 7, 38, 26, 35 and 4, respectively. Each scaffold is represented as a solid block and its orientation relative to telomere is marked by a positive (+) or negative (-) sign next to its identification number (? if direction is unknown).

25000000

20000000

15000000

Frequency Frequency 10000000

5000000

0 1 2 4 5 8 9 12 15 18 21 24 27 30 33 36 40 44 48 52 57 65 75 85 95 >100 Read depth SF G G G G G / G G G // G G / G G // // G G G

/S /F 7///// 7///// 7//q 7// ///qC//q ///C//q4/.qN/)

/ / 7///// 7///// 7// 7// ///C//qN/.qN/) ///C//qN/.qN)

/FN

5))))) D. buzzatii))))) ) ) ) ) ) ) D. buzzatii) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )))))))))))) bb

ω

bb bb b

bωbbbbbD. buzzatiibbD. mojavensisbb bbbbbbbbbbbωbbbbbb SUPPLEMENTARY METHODS

Flies

Two strains of Drosophila buzzatii, st-1 and j-19, were used. Strain st-1 was isolated from collected in Carboneras (Spain) by repeated sib-mating and selection for chromosome arrangement 2st (Betrán et al. 1998). This strain is isogenic for the major part of chromosome 2 and highly inbred for the rest of the genome. Strain j-19 was isolated from flies collected in Ticucho (Argentina) using the balanced-lethal stock Antp/ 5 (Piccinali et al. 2007). Individuals of j-19 strain are homozygous for chromosome arrangement 2j (Cáceres et al. 2001).

DNA extraction and sequencing

DNA was extracted from male and female adults of strain st-1 using the sodium dodecyl sulfate (SDS) method (Milligan 1998) or the method described by

Piñol et al. (1988) for isolating high molecular weight DNA. Reads from different sequencing platforms were generated for strain st-1 in order to achieve an accurate assembly of the genome of this strain (figure S2 and table S12). Shotgun reads (3 plates, ~8x) and paired-end (PE) reads (2 plates, ~3x) were generated using GS-FLX platform (454-Roche) at the Centre for Research in Agricultural

Genomics (CRAG, Barcelona, Spain). PE reads were produced from three different libraries with inserts of 6 kb (one half-plate), 7 kb (one plate) and 8 kb (one half- plate). We removed duplicate reads from 454 sequences using CDHIT 3.1.2 (Li and Godzik 2006). We also generated ~100 bp PE reads (4 lanes, ~76x) from

1 libraries with an insert size of ~500 bp using HiSeq2000 platform (Illumina) at the

Centro Nacional de Análisis Genómico (CNAG, Barcelona, Spain). An accurate pipeline was designed in order to filter Illumina reads based on their length and quality. We first trimmed the read ends discarding bases with a quality lower than

Q20 and then filtered low quality sequences (keeping only those with at least 95% of the bases with quality ≥ Q20). The final step was to discard exact duplicates and reverse complement exact duplicates from the final dataset. A mate pair (MP) library with ~7.5 kb fragments was also prepared and sequenced (one lane, ~12x) with Illumina at Macrogen Inc. (Seoul, Korea). Low quality reads as well as exact duplicates were removed (as before). Finally, we also used information provided by

BAC end-sequences (BES) of 1,152 BAC clones covering D. buzzatii chromosome

2 (Guillén and Ruiz 2012).

De novo assembly

The assembly of the genome of strain st-1 was performed in three stages

(supplementary table S13). In the first stage, Newbler 2.6 was fed with filtered 454 reads (shotgun and PE), Sanger BES and one of the four Illumina PE lane to obtain an initial de novo preassembly (figure S2). Prior to the assembly, false or chimeric 454 PE reads were discarded by mapping all the paired sequences against the D. mojavensis masked genome (Drosophila 12 Genomes Consortium et al. 2007) using gsMapper (Newbler 2.6). Those reads coming from the same fragment that aligned to different chromosomes as well as those aligning to multiple locations in the D. mojavensis scaffolds were removed. Likewise, all BES 2 were previously filtered by mapping them against the D. mojavensis genome in order to remove chimeric mates and artifacts using gsMapper. Out of the initial

2304 BES, 1799 reads were used for the preassembly. We used the “heterozygotic mode” option in Newbler 2.6 to allow for residual nucleotide variability in the inbreed st-1 strain. We also run the “large or complex genome” option as we were assembling a eukaryotic genome. Thus the assembly algorithm was prepared to deal with the problem of high-copy regions, although the number of output contigs was expected to be high. The preassembly contained 2,306 scaffolds. To estimate the number of chimeric artifacts, the 38 scaffolds contained in the N50 index were mapped to the D. mojavensis masked genome using NUCmer (Delcher et al. 2003).

Three scaffolds that matched two or more regions located in different D. mojavensis chromosomes were considered chimeric and split.

In a second stage, Illumina MP reads were used by SSPACE (Boetzer et al.

2011) to link output >3kb scaffolds from the preassembly and obtain 815 larger scaffolds (supplementary table S13). A minimum number of three mate pairs were required to connect two sequences (k=3). Prior to this operation, all Illumina MP reads were mapped against the D. buzzatii contigs of the preassembly stage (table

S13) using bowtie2 (Langmead and Salzberg 2012). We used only MP reads that obeyed the following criteria: (I) both end sequences from the same fragment mapped to different contigs (at unknown distance); and (II) both ends mapped in the same contig at a distance greater than 4.5 kb (thus excluding inward paired end contamination). SSPACE, the software used for the scaffolding step, excluded mates not mapping at the expected set distance. After this step, a second control

3 for chimerism was performed (as before), detecting another three chimeric scaffolds (4, 26 and 98), which were split resulting in six new scaffolds.

The third stage consisted of filling the gaps (N's) using the three short PE

Illumina libraries that were not included in the pre-assembly (supplementary table

S13). GapFiller (Nadalin et al. 2012) was used in this stage, running 10 iterations and at least 4 reads needed to call a base during an extension (figure S2). To further control for chimerism, the 818 scaffolds in the N90 scaffold index resulting from the third assembly step were blasted against the D. mojavensis masked genome using MUMmer (Delcher et al. 2003) and the resulting hits were reordered according to the D. mojavensis coordinates. This method allowed the detection of inversion breakpoint regions shared by these two species and putative chimeric scaffolds. Under a conservative criterion, eight scaffolds (9, 18, 20, 24, 36, 44, 60,

62) mapping in more than one location in the same chromosome but in regions where no inversion breakpoints or other rearrangements were expected (see

Results) were split. The final assembly, named Freeze 1, thus contains 826 scaffolds >3kb and N50 and N90 index are 30 and 158, respectively.

Fold redundancy and base composition

The distribution of read depth in the st-1 genome preassembly shows a

Gaussian distribution with a prominent mode centered at ~22x (figure S3).

Conceivably, the scaffolding and gap filling stages of the assembly did not alter significantly this distribution. However, its variance is much larger than that expected by random (~30 times higher), showing that there is an important bias on the coverage. In particular there is a long right tail that might reflect cases where 4 highly similar repetitive sequences or duplicated genes were merged into the same consensus sequence. One such case of misassembly was observed in the Hsp68 genes. In most Drosophila genomes there are two almost identical Hsp68 gene copies arranged head-to-head (Guillén and Ruiz 2012). In the D. buzzatii genome only one copy was found but it was in the vicinity of a gap (filled with N’s) about the same size, suggesting that the assembler had merged all Hsp68 reads into a single gene leaving a gap in the place of the second copy.

Base composition of genes, exons and overall for Freeze 1 assembly is summarized in supplementary table S14. CG content is ~35% overall, ~42% in gene regions (including introns) and reaches ~52% in exons. Unidentified nucleotides (N’s) represent ~9% overall, ~4% in gene regions and 0.004% in exons.

These patterns agree well with the reported higher CG content of genes and exons in many genomes including those of Drosophila (Adams et al. 2000; Heger and

Ponting 2007; Díaz-Castillo and Golic 2007) and humans (Bulmer 1987; Lander et al. 2001).

Sequence quality assessment and nucleotide polymorphism

To assess the quality of the Freeze 1 assembly sequence, we used ~800 kb of Sanger sequences corresponding to five D. buzzatii BAC clones: 40C11 and

5H14 (Negre et al. 2005), 20O19 and 1N19 (Calvete et al. 2012) and 1B03 (Prada

2010). These BAC sequences were aligned against the genome sequence using

MUMmer (Delcher et al. 2003). Some BAC regions containing repetitive elements matched multiple scaffold locations and were excluded (supplementary table S15).

Considering only the unambiguously covered regions (97.6%), the genome 5 sequence was 99.95% identical to that of the BAC sequences, giving an error rate of 0.0005 and a PHRED quality score of ~Q33.

In a second sequence quality assessment, we mapped the three Illumina runs (99,124,355 reads) that were used in the GapFiller stage of the assembly

(figure S2) and RNA-Seq data from adult males (44,840,622 reads, see below) against the Freeze 1 assembly using bowtie2 (Langmead and Salzberg 2012).

Mapping of genomic reads allowed us to assess the overall genome error rate, including both expressed and non-expressed regions, whereas mapping of RNA-

Seq reads reported the error rate exclusively for expressed regions. We considered as assembly errors those positions where 80% or more of the reads did not match the genome base and at least 80% of these unmatched positions had the same nucleotide (figure S4). Under a conservative criterion the overall error rate was estimated to 0.0005 and the average quality ~Q33, as before. A similar value was estimated when aligning the RNA-Seq reads to the expressed regions of the genome (supplementary table S16).

Strain st-1 used for generating the D. buzzatii reference genome was isogenic for a large portion of chromosome 2 and highly inbreed for the remaining genome (see above). We estimated the amount of residual nucleotide polymorphism in this strain by aligning the Illumina reads against the genome

Freeze 1 assembly (figure S4). An overall proportion of segregating sites of ~0.1% was estimated (supplementary table S17). About 15% of all the SNPs are located in gene sequences and 4% in coding exons. Thus the vast majority of SNPs are located in non-coding regions.

6 Genome size estimation

The genome size of two D. buzzatii strains, st-1 and j-19, was estimated by

Feulgen Image Analysis Densitometry. The genome size of D. mojavensis 15081-

1352.22 strain (193,826,310 bp) was used as reference (Drosophila 12 Genomes

Consortium et al. 2007). Testes from anesthetized males of both species and strains were dissected in saline solution and fixed in acetic-alcohol 3:1. Double preparations of D. mojavensis and D. buzzatii were prepared by crushing the fixed testes in 50% acetic acid. Following Ruiz-Ruano et al. (2011), the samples were stained by Feulgen reaction including a 5N HCl incubation for 5 minutes. Images obtained by optical microscopy were analyzed with the pyFIA software (Ruiz-

Ruano et al. 2011) (figure S5 and supplementary table S18).

Chromosome organization and evolution

The 826 scaffolds in Freeze 1 were assigned to chromosomes by aligning their sequences with the D. mojavensis genome using blastn from MUMmer

(Delcher et al. 2003). In addition, the 158 scaffolds in the N90 index were mapped, ordered and oriented in the chromosomes (figure S1). The seven scaffolds corresponding to chromosome 2 were ordered and oriented using D. buzzatii BAC- based physical map and BAC-end sequences (González et al. 2005, Guillén and

Ruiz 2012). Those scaffolds mapping to chromosomes X, 4, 5 and 6 were ordered and oriented by conserved linkage (Schaeffer et al. 2008). Briefly, we looked for the position in D. mojavensis of genes located at the ends of D. buzzatii scaffolds.

When two of these genes are closely located in the D. mojavensis genome (<200 kb in most cases) we can infer that they are also close in D. buzzatii, assuming 7 synteny conservation, and then the respective scaffolds must be adjacent. This method works as far as there are no inversion breakpoints between the two scaffolds and gave consistent results for the four forementioned chromosomes. In contrast, for chromosome 3, it yielded ambiguous or inconsistent results. We had to resort to in situ hybridization of PCR generated probes to anchor chromosome 3 scaffolds to D. buzzatii polytene chromosomes (Delprat et al. in preparation).

In order to determine the organization of the HOX gene complex (HOM-C), the eight Drosophila HOX genes were searched bioinformatically in the D. buzzatii genome and found in three chromosome 2 scaffolds: 2, 5 and 229. Scaffold 2 contained four Hox genes (pb, Scr, Antp and Ubx) and scaffold 5 another three (lab, abdA and AbdB) (see Results). The eighth HOX gene, Dfd, was found in the small scaffold 229 (49,930 bp). We looked for the genomic position of this scaffold using

BAC-end sequences and found that those of three BACs (3A12, 9B20 and 25B04) anchored this scaffold inside scaffold 2, precisely within the HOX gene complex where there is a 65-kb gap filled with N’s. We concluded that this was a case of misassembly and the correct order of D. buzzatii HOX genes at this chromosomal site must be pb, Dfd, Scr, Antp and Ubx. All genes (HOX genes, HOX-derived genes and non-HOX genes) within the HOM-C were manually annotated using the available information (Negre et al. 2005), the annotated D. mojavensis and D. melanogaster genomes, and the RNA-seq data generated for D. buzzatii.

Repeat identification and masking

A library of transposable elements (TEs) was constructed combining three different collections of repeats. The first collection was compiled blasting FlyBase 8 canonical set of TEs against an early assembly of D. buzzatii genome. For each query several significant hits were manually inspected in order to recover the most complete TE copy. The second collection was built with RepeatScout 1.0.5 (Price et al. 2005) and classified by Repclass (Feschotte et al. 2009) and the third is the result of RepeatModeler 1.0.5 (Smit and Hubley 2008), with RepeatScout and

RECON (Bao and Eddy 2002), both using the D. buzzatii early assembly. Manual analyses to reduce redundancy and remove possible protein coding genes were performed with RepeatMasker and blast searches resulting in a library with 357 TE sequences. This library was used to mask the repeats from Freeze 1 assembly with RepeatMasker v3.2.9 (Smit et al. 1996) and then annotate the protein coding genes (see below).

A second and more comprehensive TE library (4,802 sequences) was generated adding Repbase (Jurka et al. 2005) repeats from Insecta species to the previous library and running again RepeatScout and RepeatModeler with D. buzzatii Freeze 1 assembly. Additionally, sequences classified as simple repeats, satellite or low complexity, were removed from the library. Finally, a blast analysis was performed to filter non-TE related sequences. Sequences with significant hits

(e-value<1e-25) to D. mojavensis CDS and at the same time with no significant similarity to repeats deposited in Repbase were removed. This second TE library was then used to annotate and classify D. buzzatii TEs running RepeatMasker with the following options cutoff 250, -nolow and –norna, to prevent masking any low complexity regions and small RNA genes.

In order to identify satDNAs (highly abundant tandemly repeated DNA motifs) from the genome of D. buzzatii, we used the Tandem Repeats Finder (TRF) 9 software (version 4.04) (Benson 1999). Tandem repeats searches were performed in all contigs using the command line version of TRF with parameters 1, 1, 2, 80, 5,

200 and 750 for match, mismatch, indel, probability of match, probability of indel, min. score and max. period, respectively. Repeats with less than 50 bp were eliminated from the dataset. We developed a series of scripts and pipelines for clustering similar tandem repeats into major families and to eliminate redundancy between families (de Lima et al. in preparation). The outcome produced a table containing the repeat size, consensus sequence and genomic fraction of every tandem repeat family identified. From the final collection of tandem repeats, we selected the most likely satDNA families based on three main parameters: (i) abundance; (ii) no sequence similarity with transposable elements or to other non- satellite genomic elements (inferred by screening the Repbase, Genbank and

FlyBase databases) and (iii) the presence of several contigs made exclusively by repeats from the same tandem repeat family.

Developmental transcriptome

Flies of the D. buzzatii st-1 strain were reared on standard cornmeal-yeast- agar culture media. Ten to twenty individuals from each of five different life stages

(embryos, larvae, pupae, adult males and adult females) were collected and frozen at -80ºC. RNA from frozen samples was processed using the TruSeq RNA sample preparation kit provided by Illumina. The protocol included a poly-A selection to enrich for mRNA. Library preparation was carried out at Cornell's Molecular Biology and Genetics Department, whereas RNA sequencing was done at Weill Cornell

Medical College. The average insert size of the libraries from the 5 samples was 10 264 bp. Sequencing at PE 100 bp was performed on a Hi-Seq2000 Illumina

Sequencer. A total of 378,647,052 raw reads were generated (38 Gb of sequence) comprising between 60 and 89 million reads from each of the 5 samples. RNA-Seq reads were trimmed and filtered by quality (at least 95% of the bases had a quality

≥ Q20) (supplementary table S19). Filtered reads were mapped to Freeze 1 masked genome using TopHat version 1.3.3 allowing only for uniquely mapped reads (Trapnell et al. 2009). The common setting parameters used among different stages were: -g 1 (maximum multihits) -F 0 (suppression of transcripts below this abundance level) and -i 40 (minimum intron length). The rest of parameters were set by default.

We run Cufflinks to reconstruct transcripts models and their expression level for each stage (Trapnell et al. 2010) using Annotation Release 1 as reference (-g option activated). This allowed us to identify new isoforms from expressed protein- coding genes (PCG) and also non-coding RNA (ncRNA) genes. Transcription levels along the genome sequence and transcripts inferred by Cufflinks for each stage are included in the genome browser of the D. buzzatii Genome Project web

(http://dbuz.uab.cat).

Protein-coding gene annotation

The masked Freeze 1 assembly was used to annotate PCGs using a strategy that combined both ab initio and homology-based predictions. We used two HMM-based algorithms, Augustus (Stanke and Waack 2003) and SNAP (Korf

2004), and a dual-genome de novo software, N-SCAN (Korf et al. 2001) using as guide the alignment between D. buzzatii Freeze 1 assembly and D. mojavensis 11 masked genome (release 1.3). Exonerate was run to identify conserved genes aligning both D. mojavensis and D. melanogaster protein databases to Freeze 1 assembly (Slater and Birney 2005). All these predictions were combined by a weight-based consensus generator, EVidence Modeler (EVM) (Haas et al. 2008) using the following weights: Exonerate D. mojavensis (9), Exonerate D. melanogaster (6), NSCAN (6), Augustus (2) and SNAP (2). The EVM gene set contained 12,102 gene models.

There were 1,555 genes annotated by Exonerate but not reported by EVM due to their structural properties. We included these genes in Annotation Release 1 by combining EVM and Exonerate annotations using mergeBed tool from Bedtools package (Quinlan and Hall 2010). The Annotation Release 1 includes 13,657 annotated genes (12,102 annotated by EVM and 1,555 genes detected only by

Exonerate). The 1,555 genes annotated only by exonerate were shorter (Wilcoxon test, W=81226636, p-value<2.2e-16) and had fewer exons (W=15142546, p- value<2.2e-16). This fact indicates that algorithms that annotate genes by generating a consensus from multiple evidences are not efficient at identifying short and monoexonic genes. Some genes from the Annotation Release 1 contain internal stop codons and/or lack stop or start codons suggesting they might be misannotated PCG or pseudogenes (supplementary table S20). We filtered those

PCG that show at least one internal stop codon and/or were not multiple of three, leaving a total of 12,977 high-confidence protein-coding sequences for further analyses.

We computed the number of wrong assembled positions contained in the total span of the gene models as well as the errors located within exons of 12 Annotation Release 1 (see above). The vast majority of genes (91.3%) and exon

(99.2%) sequences showed no error nucleotides. Thus, we concluded that errors are mainly contained in non-exonic regions, and both the detection of positive selection and the divergence pattern analyses carried out subsequently will not be significantly altered by misassembled sequences (Schneider et al. 2009).

Detection of genes under positive selection

To test for positive selection we made a comparison between different pairs of codon substitution models. We first estimated the dn/ds ratios of 11,154 orthologs between D. mojavensis and D. buzzatii. Orthologus pairs that showed a length difference higher than 20% were excluded, as well as those orthologs with a ds >1, leaving a total of 9,114 PCGs. Then we run two site models on this gene set:

M7 (beta), which does not allow for positively selected sites (ω>1), and M8

(beta&ω), which includes one extra class of sites to the beta model allowing for sites with ω>1 (Yang 2007). Both models were then compared using a likelihood- ratio test (LRT). We also run two more site models, M1a and M2a, and compared them again using the LRT test. Only genes that were detected as putatively under positive selection by both model comparisons were analyzed in further detail (see

Results).

To perform the branch-site tests of positive selection, we identified 8,328

1:1:1:1 orthologs among the four available Drosophila subgenus species: D. buzzatii, D. mojavensis, D. virilis and D. grimshawi using OrthoDB version 6 database (Kriventseva et al. 2008). Branch-site models allow us detecting positive selection that affects particular sites and branches of the phylogeny. We decided to 13 test for positive selection on three different lineages: D. mojavensis lineage, D. buzzatii lineage, and the lineage that led to the two cactophilic species (D. buzzatii and D. mojavensis) (supplementary table S3). We run Venny software (Oliveros

2007) to create a Venn diagram showing shared selected genes among the different models. Gene expression information for positively selected genes was extracted from the Cufflinks output (see above).

Detection of orphan genes

We identified genes that are only present in the two cactophilic species, D. mojavensis and D. buzzatii, by blasting the amino acid sequences from the 9,114

1:1 orthologs between D. mojavensis and D. buzzatii against all the proteins from the remaining 11 Drosophila species available in FlyBase protein database

(excluding D. mojavensis). Proteins that showed no similarity with any Drosophila known gene product were considered putative orphans. We used a cutoff value of

1e-05 to avoid spurious hits. From the initial single-copy orthologs set between D. mojavensis and D. buzzatii, 117 proteins showed no similarity with any predicted

Drosophila polypeptides. We used this set to study genes unique to the cactophilic lineage (Supplementary table S3) and analyzed their expression pattern with

TopHat and Cufflinks (see above).

Gene Duplication Analyses

The longest isoforms of annotated PCGs from the four species of the

Drosophila subgenus (supplementary table S4) were used for the analysis of gene duplications in cactophilic species (D. mojavensis and D. buzzatii) and in the 14 lineage leading to D. buzzatii. We ran all-against-all blastp (version 2.2.25+;

Altschul et al. 1997) and selected hits with alignment length extending over at least

50% of both proteins and with amino acid identity of at least 50%. These proteins were clustered using Markov Cluster Algorithm (Enright et al. 2002) with the bit score value as the similarity measure and an inflation parameter (I) of 2. We then removed 636 genes that matched transposable elements (supplementary table S5) that were identified by tBlastn with an E-value cut off 10E-20 against the TE library consisting of Repbase TE set (Jurka et al. 2005) and the newly- identified D. buzzatii TEs (this work). We also removed 664 D. buzzatii ORFs that contained internal stop codons (supplementary table S6). To account for possible missing gene copies that might have been collapsed during the assembly of D. buzzatii genome an additional gene copy was added to a gene family when gene sequence coverage in D. buzzatii exceeded 2X the average coverage. This correction added a total of 155 genes, increasing the number of family members in

64 gene families. The final dataset included a total of 56,587 proteins from the 4 species clustered into 19,567 families (supplementary table S7)

Gene counts for each family from the 4 species were analyzed with an updated version of CAFE (CAFE 3.1 provided by the authors; Han et al. 2013) to identify lineage-specific expansions. Given the phylogeny and gene family sizes in each species, CAFE estimates the maximum likelihood value of gene birth and death rates and infers the ancestral family sizes. With known family sizes at the tree nodes, families that expand or contract in a particular lineage can be identified.

This analysis was performed only with families for which at least one member is inferred to be present at the root of the tree. Our detection of expanded families 15 was based on the 2-parameter model (separate birth rates for D. buzzatii branch and the rest of the tree) because it was significantly better than a single-parameter model (p<10-4; likelihood ratio test).

The sets of CAFE-identified expanded families in the D. buzzatii (86 families) and D. mojavensis (127 families) genomes were examined for the presence of lineage-specific duplications. First, we used PAML package (Yang

2007) to calculate pairwise dS for all family members and narrowed the list of candidates by selecting families that included members with dS<0.4 (30 families in

D. buzzatii and 86 families in D. mojavensis). These families were further examined manually and lineage-specific duplications were inferred when no hits were found in the syntenic region of the genome with a missing copy (20 families in

D. buzzatii and 17 families in D. mojavensis). The syntenic regions were identified by the closest D. mojavensis – D. buzzatii orthologs of the genes flanking the new duplicate in the D. buzzatii or D. mojavensis genome. D. buzzatii-specific RNA- mediated duplications were identified by examining intron-less and intron- containing gene family members. A duplicate was considered a retrocopy if its sequence spanned all introns of the parental gene.

The number of families identified by CAFE as expanded along the cactophilic lineage was reduced by considering only those families that were also found in expanded category after rerunning the analysis with a less stringent cutoff

(35% amino acid identity, 50% coverage). This procedure eliminates the effects of our arbitrary threshold in gene family assignment. The overlapping set of expanded families (27 families) was manually examined to verify the absence of D. buzzatii

16 and D. mojavensis new family members in the D. virilis genome (confirmed in 20 families).

Functional annotation (i.e., GO term) for all expanded families was obtained using the DAVID annotation tool (Huang et al. 2009a; Huang et al. 2009b).

For genes without functional annotation in DAVID, annotations of D. melanogaster orthologs were used. Consensus annotation for each expanded family is provided in supplementary table S11.

Additional References

Altschul SF et al. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-402

Bao Z, Eddy SR. 2002. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12: 1269–1276.

Bulmer M. 1987. A statistical analysis of nucleotide sequences of introns and exons in human genes. Mol Biol Evol. 4: 395–405.

Cáceres M, Puig M, Ruiz A. 2001. Molecular characterization of two natural hotspots in the Drosophila buzzatii genome induced by transposon insertions. Genome Res. 11: 1353–1364.

Díaz-Castillo C, Golic KG. 2007. Evolution of gene sequence in response to chromosomal location. Genetics 177: 359–374.

Feschotte C, Keswani U, Ranganathan N, Guibotsy ML, Levine D. 2009. Exploring repetitive DNA landscapes using REPCLASS, a tool that automates the classification of transposable elements in eukaryotic genomes. Genome Biol Evol. 1: 205–220.

Heger A, Ponting CP. 2007. Evolutionary rate analyses of orthologs and paralogs from 12 Drosophila genomes. Genome Res. 17: 1837–1849.

Lander ES, et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860–921.

17 Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9: 357–359.

Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658–1659.

Piccinali R, Mascord L, Barker J, Oakeshott J, Hasson E. 2007. Molecular Population Genetics of the α-Esterase5 Gene Locus in Original and Colonized Populations of Drosophila buzzatii and Its Sibling Drosophila koepferae. J Mol Evol. 64: 158–170.

Price AL, Jones NC, Pevzner PA. 2005. De novo identification of repeat families in large genomes. Bioinformatics 21 Suppl 1: i351–358.

Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841–842.

Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996-2010. .

Smit AFA, Hubley R. RepeatModeler Open-1.0. 2008-2010. .

18 Table S1. Protein-coding gene content of D. buzzatii genome compared to those of D. mojavensis and D. melanogaster.

D. mojavensis D. melanogaster Species D. buzzatii R1.3 R5.55

Number of genes 13657 14595 13937

Mean gene size (bp) 3108 4429 6656

Mean protein size (aa) 498 494 690

Longest gene size (bp) 67103 299059 396068

Shortest gene size (bp) 63 105 117

Longest protein size (aa) 14469 8926 22949

Shortest protein size (aa) 21 34 11

Mean number of exons 3.80 3.78 5.50 Supplementary Table S2 (A) Number of protein coding genes (PCG) and non-coding genes (ncRNA) expressed along D. buzzatii development.

Stage PCG ncRNA Total Embryos 8552 1208 9760 Larvae 8709 810 9519 Pupae 10485 1574 12059 Adult females 9310 1037 10347 Adult males 10347 1824 12171 Total 47403 6453 53856

(B) Number of PCG and ncRNA genes expressed in one or more stages.

Stages PCG ncRNA Total 1 925 1292 2217 2 1655 689 2344 3 1322 393 1715 4 1618 326 1944 5 6546 260 6806 Total 12066 2960 15026

(C) Distribution of putatively selected genes expressed along D. buzzatii development.

Non- Stage Selected Total selected Embryos 881 7671 8552 Larvae 812 7897 8709 Pupae 1069 9416 10485 Adult females 932 8378 9310 Adult males 1000 9347 10347 Total 4694 42709 47403 (D) Expression breadth distribution of putatively selected genes in D. buzzatii.

Stages Selected Non-selected Total 1 106 819 925 2 166 1489 1655 3 119 1203 1322 4 211 1407 1618 5 611 5935 6546 Total 1213 10853 12066

(E) Distribution of orphan genes expression in D. buzzatii life cycle.

Stage Orphans Non-orphans Total Embryos 21 8531 8552 Larvae 49 8660 8709 Pupae 51 10434 10485 Adult females 35 9275 9310 Adult males 54 10293 10347 Total 210 47193 47403

(F) Number of orphans and no orphans expressed in one or more stages of D. buzzatii life cycle.

Stages Orphans Non-orphans total 1 29 896 925 2 18 1637 1655 3 11 1311 1322 4 8 1610 1618 5 16 6530 6546 Total 82 11984 12066 (G) Chromosomal location of putatively selected genes detected by site models (SM). The chromosomal location of one of the 772 gene candidates was unknown.

Chromosome Selected Non-selected Total X 168 1259 1427 2 154 2151 2305 3 129 1557 1686 4 155 1653 1808 5 161 1686 1847 6 4 25 29 Total 771 8331 9102

(H) Chromosomal location of putatively selected genes detected by all models (SM and BSM). The chromosomal locations of two of the 1294 genes were unknown.

Chromosome Selected Non-selected Total X 260 1167 1427 2 264 2041 2305 3 238 1448 1686 4 245 1563 1808 5 277 1570 1847 6 8 21 29 Total 1292 7810 9102 Table S12. Summary of sequencing data.

Library

# plates Mean Mean read Expected Strain Platform (454) insert size #Raw reads #Filtered reads Type length (bp) coverage or lanes (kb)

(Illumina)

Shotgun 3 - 4219296 3857039 335.23 8x 454 PE 2 6-8 2501837 1691215 304.92 3x

st-1 Sanger BES - 150 2304 1799 698.2 ~0.01x

PE 4 0.5 447062156 114499279 106.3 76x Illumina MP 1 7.5 41846306 19292893 97.8 12x Table S13. Three stages in the assembly of D. buzzatii st-1 genome.

# # putative N50 scaffold Max scaffold Stage Input Scaffolds chimerics #N's #gaps index size (> 3 kb) (split)

All 454 + BES 3 De novo Pre- + 1 library 2306 (interchromo 38 14579794 18060254 - assembly (Newbler) Illumina short somal) PE

Pre- 3 Scaffolding assembled 815 (interchromo 29 16289485 18991294 13409 (SSPACE) scaffolds + MP somal) libray

Scaffolds + 3 8 Gapfilling (GapFiller) Illumina short 818 (intrachromo 30 16306990 14974169 11462 PE somal) Table S14. Base composition by genome features.

Base Genome Genes Exons

composition

AT 55.81 % 54.24 % 48.17 %

GC 34.92 % 42.00 % 51.83 %

N 9.27 % 3.76 % 0.004 %

Total bases 161490851 42433860 20364820

Fraction 100 % 26.28 % 12.61 %

Table S15. Quality control of Freeze 1 assembly using sequenced BACs.

Matched scaffolds Unambiguous bp Average identity BAC Chromosome Length (bp) covered (%) (%) Number of Freeze 1 Aligned scaffolds scaffold id. blocks

1B03 2 258840 97.29 99.96 1 scaffold1 8

1N19 2 138724 98.97 99.92 1 scaffold1 8

20O19 2 143293 98.24 100 1 scaffold1 5

40C11 2 132938 100.00 99.88 1 scaffold2 6

5H14 2 124024 93.31 99.97 1 scaffold5 12 Table S16. Assembly error rate inferred by mapping genomic and RNAseq reads to Freeze 1 sequence. The overall error rate was computed using a coverage threshold of 4 aligned reads per position.

Genomic reads mapping RNAseq male adults reads mapping

# Putative assembly # Putative assembly Error rate Error rate sequence errors sequence errors

No coverage threshold 182598 0.00125 71499 0.00153

Coverage threshold ≥4 68898 0.00047 19042 0.00062

Table S17. Polymorphism rate estimation by mapping Illumina reads to Freeze 1 assembly.

Gapfiller reads mapping

# Polymorphic positions Polymorphism rate

No coverage threshold 148772 0.00102

Coverage threshold ≥4 141648 0.000972

Table S18. Optical Density (IOD) and genome size estimation.

IOD Genome size (pg) Genome size (Mb)

Species j19 st1 j19 st1 j19 st1

D. buzzatii 96.56 467.03 0.149 0.156 146 153

D. mojavensis 128.27 591.20 0.198a 0.198a 194b 194b

a Estimated by dividing genome size in Mb by 978 Mb/pg. b Total assembly size (Drosophila 12 Genomes Consortium). Table S19. RNAseq reads per sample

Reads Reads used Reads Mean Quality Paired filtered yielding Sample Yield (Mb) % bp Q ≥ 30 by TopHat (x 106) Score reads (x 106) unique hits (x 106) (x 106)

Embryos 9051 89.6 87.05 34.26 68.5 68.4 50.9

Larvae 6084 60.2 87.51 34.42 46.5 46.4 30.3

Pupae 7070 69.9 86.13 33.94 52.4 52.4 45.8

Female adults 8658 85.7 85.77 33.85 63.6 63.6 55.8

Male adults 7382 73.1 87.03 34.25 55.9 55.8 44.8

Total 38245 378.5 86.70 34.14 286.9 286.6 227.6 Table S20. Features of PCG models in Annotation Release 1.

EVM Exonerate Total

Annotated PCGs 12102 1555 13657

Putatively correct CDS 11213 0 11213

CDS with internal stop codons 334 330 664

CDS lacking start codon 163 0 163

CDS lacking stop codon 308 654 962

CDS lacking start and stop codons 68 571 639

CDS not multiple of 3 16 0 16