Edinburgh Research Explorer

Draft genome and reference transcriptomic resources for the urticating pine defoliator Thaumetopoea pityocampa

Citation for published version: Gschloessl, B, Dorkeld, F, Berges, H, Beydon, G, Bouchez, O, Branco, M, Bretaudeau, A, Burban, C, Dubois, E, Gauthier, P, Lhuillier, E, Nichols, J, Nidelet, S, Rocha, S, Sauné, L, Streiff, R, Gautier, M & Kerdelhué, C 2018, 'Draft genome and reference transcriptomic resources for the urticating pine defoliator Thaumetopoea pityocampa', Molecular Ecology Resources. https://doi.org/10.1111/1755-0998.12756

Digital Object Identifier (DOI): 10.1111/1755-0998.12756

Link: Link to publication record in Edinburgh Research Explorer

Published In: Molecular Ecology Resources

General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim.

Download date: 28. Sep. 2021 This article isThis article by protected copyright. All rightsreserved. 10.1111/1755doi: c Please Record. of Version the and version this between differences to lead may which not process, has proofreading butand pagination review typesetting, peer copyediting, the full through undergone been and publication for accepted been has article This Lisbon (ULis 4 31326Auzeville, Castanet 3 France 2 CS30016,F Campus Agropolis, 1b CS Campus Agropolis, 1a Full postal addresses S. Rocha Bretaudeau Gschloessl B. Thaumetopoea pityocampa defoliator pine urticating the for resources transcriptomic reference and genome Draft type Article :Resource Article DR. CAROLEKERDELHUEID:0000 (Orcid ID GAUTIER (Orcid DR. :0000 MATHIEU : DR. ID BERNHARDGSCHLOESSL(Orcid 0000 - - -

- -

AcceptedINRA Article

oet eerh etr CF, nttt Spro d Arnma IA, nvriy of University (ISA), Agronomia de Superior Instituto (CEF), Center Research Forest

NA U 12, GeT 1426, US INRA, CBGP, INRA, CIRAD, IRD, Montpellier SupAgro, Univ. Univ. SupAgro, Montpellier IRD, CIRAD, INRA, CBGP, CBGP, IRD, CIRAD, INRA, Montpellier SupAgro, Univ. Univ. SupAgro, Montpellier INRA, CIRAD, IRD, CBGP,

- 4 CNRGV, 24 Chemin de Borde Rouge, BP 52627, 31326 52627, BP Rouge, Borde de Chemin 24 CNRGV, ,

L. Sauné 5a,5b boa), Tapadaboa), daAjuda, 1349 1a , C. Burban C. , F Dorkeld F. , - 0998.12756 1a ,

R. Streiff 30016, F - 6 Tolosan Cedex, France - , E. D E. ,

PlaGe, Genotoul, Genotoul, PlaGe, (: Notodontidae) 1a

H Berges H. , - - 1a 34988 Montferrier 34988 Montferrier ubois , M.Gautier

7 , P. Gauthier P. , - 017 Lisboa, Portugal017 Lisboa, 2 G Beydon G. , - 1a 0001 NA ueil, hmn e od Rouge, Borde de Chemin Auzeville, INRA , C.Kerdelhué - 0001

- - - sur sur 7257 1b - 0002 - , E. Lhuillier E. , 7667 - - Lez cedex, Lez France cedex,

- 5880) - 2

7296 O Bouchez O. , - 902X) 1a

- Montpellier, Montpellier, Montpellier, Montpellier, 148X)

3 France

, J. Nichols J. , Castanet Tolosan Cedex, Tolosan Castanet

3

M Branco M. , ite this article as article this ite 755 avenue du du avenue 755 5 aeu du avenue 755 8 , S. Nidelet S. , 4 A. , 7* , This article isThis article by protected copyright. All rightsreserved. pityocampa title Running 62 3345.e F CS30016, IRD, CIRAD, author Corresponding Lepidoptera Keywords address: * Present E 8 Cardonille, Cedex34094 Montpellier 05,France CNRS 5203 UMR sud, 7 6 5b Rennes, France 35042 Beaulieu, Campus (BIPAA), Agroecosystems for Platform BioInformatics 5a - - - dinburgh, dinburgh, EH9 3FL

- - BIOGECO, Bordeaux INRA,Univ.

Accepted ArticleMGX Plateforme dnug Gnmc, swrh aoaois Te igs ulig, h Uiest of University The Buildings, King's The Laboratories, Ashworth Genomics, Edinburgh

INRIA, IRISA, Ge NA UR nttt e ééiu, nionmn e Poeto ds lne (IGEPP), Plantes des Protection et Environnement Génétique, de Institut UMR INRA,

- (4

mail [email protected] address: -

34988 Montferrier34988 –

Montpellier SupAgro, Univ. SupAgro, Montpellier

4 caatr includin characters (45

6): 1a de novo de

, Scotland, UK , Scotland, -

nOuest CoreFacility,Campus 35042 deBeaulieu, Rennes,France Montpellier GenomiX, Montpellier

nm, drs, a, mi) Brhr Gschloessl, Bernhard email): fax, address, (name, –

assembly, genome, transcriptome, gene prediction, BAC library, BAC prediction, gene transcriptome, genome, assembly, 61 INSERM 661 U

- sur -

Lez cedex, France. Lezcedex, , 69 route , 69route d'Arcachon, 33610Cestas,France g space): space): g

Montpellier c/o Institut de Génomique Fonctionnelle IGF Fonctionnelle Génomique de Institut c/o –

e novo De nvrié e otele, 4 re e la de rue 141 Montpellier, de Université

Tel.: +33 4 30 63 04 18; fax: +33 4 99 99 +334 fax: 18; 04 63 30 +334 Tel.: ,

755 avenue du Campus Agropolis, Agropolis, Campus du avenue 755

eoe n tasrpoe of transcriptomes and genome BP INRA, CBGP,

T. T. - This article isThis article by protected copyright. All rightsreserved. genome a via available anno is information described The genome. draft the on localized be could transcripts these of 90% Moreover, set. gene BUSCO the of 84% and genes and quality sequen high 29 a of obtain consisting transcriptome to reference clustered comprehensive were data 454/Sanger published previously su were reads The technologies. Illumina MiSeq and HiSeq with sequenced stages developmental various qua assembly also was library the assess to sequenced individually were library BAC this from BACs 11 particular, In developed. coverage 3X a interest, of regions genomic of mapping physical 29 assembly genome this From kb). (N50=164 scaffolds 292 68 into assembled was length total Mb 537 of genome draft first a and determined first was karyotype the characteristics, phenological w a Using architecture. genetic underlying their understand to prerequisite obligate an are which resources, transcriptomic temperatures. phe as such traits, various phenotypic for high a variability shows species This areas. invaded the in concerns health and human severe cause larvae urticating Its region. Mediterranean the in defoliator pine main processionary pine The A BSTRACT Accepted the of bset Article tation portaltation ( ces covered 99% of the CEGMA and 88% of the BUSCO highly conserved eukaryotic conserved highly BUSCO the of 88% and CEGMA the of 99% covered ces

de novo

415 coding ge coding 415

iy Additionally, lity. This genome http://bipaa.genouest.org/sp/thaumetopoea_pityocampa/

assembled into 62assembled into study study - predicted predicted presents nes were predicted. To circumvent some limitations for fine scale scale fine for limitations some circumvent To predicted. were nes Thaumetopoea pityocampa Thaumetopoea oig genes coding e novo de

the construction and analysis of extensive genomic and and genomic extensive of analysis and construction the ell

376 and63 - studied population from Portugal with peculiar peculiar with Portugal from population studied

rncitmc eore wr gnrtd from generated were resources transcriptomic ooy fcniy ad oeac t extreme to tolerance and fecundity, nology, , the the ,

175 transcripts, respectively. Then, 175 transcripts, e novo de

(Lepidoptera: Notodontidae) is the is Notodontidae) (Lepidoptera:

701 701

rncitm asmle and assemblies transcriptome oa fide bona ).

ngns These unigenes. a robust a robust This article isThis article by protected copyright. All rightsreserved. emerge probably it that suggested been has It years. 10 past the Portugal summer develo larvae and spring in reproducing adults with phenology aberrant an PPM the of expansion local of future anticipate to account evolution into taken be should that change climate to response hypothesized a as phenology been recently has It places. warmer to compared altitudes) high and range environments(northern colder earlier in mating emergingand adults 2011) Branco, & also 2008) Larsson, & Battisti, regions between differ to found were areas urbanized or agricultural forest, in species, conifer introduced or native several of needles feeding by winter during develop larvae its and univoltine, is processionary moth pine The 2005; Robinetet 2012; al., Robinet, Rousselet, & Roques,2014) areas populated new, colonizing warming, climate to due range geographical 2014) al., et Vega 2017; Roques, & Larsson, reactions allergic severe cause can is it 2015) al., et (Battisti most of expense the at develops it where Africa North and Europe main a is PPM) Schiffermüller) (hereafter moth processionary pine The I Accepted Article NTRODUCTION a human and animal health concern, as the highly urticating setae carried by the larvae larvae the by carried setae urticating highly the as concern, health animal and human a proved experimentally to be a variable adaptive trait trait adaptive variable a be to experimentally proved Rsi aca Rqe,& osee, 2016) Rousselet, & Roques, Garcia, (Rossi, (Pimentel et al., 2006) al., et (Pimentel –

hs eotd s h "umr ouain (SP) population" "summer the as reported thus

. The timing of sexual reproduction varies as a function of local conditions, local of function a as varies reproduction sexual of timing The . . The PPM has received increasing attention in the last decades because decades last the in attention increasing received has PPM The . (Robinet, Laparie, & Rousselet, 2015) Rousselet, & Laparie, (Robinet, . The capacity of its larvae to cope with extreme temperatures was was temperatures extreme with cope to larvae its of capacity The . pest of Mediterranean pine forests and is widespread in southern in widespread is and forests pine Mediterranean of pest . This very unique population has been thoroughly studied in studied thoroughly been has population unique very This . Btit, om Fgel & aso, 01 Battisti, 2011; Larsson, & Fagrell, Holm, (Battisti, Hdr Zmr, Csr, 02 Z 2002; Castro, & Zamora, (Hodar, Mroe, hs pce i expandin is species this Moreover, . ot rfrneadlra performance larval and preference Host . hueooa pityocampa Thaumetopoea

(Santos, Paiva, Tavares, Kerdelhu Tavares, Paiva, (Santos, . Finally, a population showing population a Finally, . -

.

was discovered in 1997 in in 1997 in discovered was

fo a eaiey recent relatively a from d Pinus

and ping during the the during ping Cedrus (Battisti et al., al., et (Battisti ovi, Stastny, Stastny, ovi,

Dns & (Denis

species its g on é ,

This article isThis article by protected copyright. All rightsreserved. Acceptedtool useful a represents It traits. relevant ecologically in involved genes candidate find to way open further transcriptome reference a to addition In zones. hybrid in introgression of mosaic scans wide identi to approaches genomic population large eases genome reference evolution phenotypic non for resources valuable 2015) al., species sister the to corresponded probably most but sequences, released later were resources two New between populations. transcripts allochronic of sets compare to useful already was transcriptome reference sequencing this incomplete, Although phenology. in involved potentially genes on focus particular a with 454 Roche and Sanger technologies using developed were resources Article transcriptomic 2011) al., et Burban, structure geographical study 2010) unravel to fragments DNA at nuclear patterns phylogeographic or mitochondrial of sequencing involved mostly have studies Molecular populations. its of evolution genomic the about known is little very ph and ecological many Although populations reduced is strongly winter "classical" sympatric the and SP the between flow gene that and shift, phenological , or neutral population genetics approaches using a handful of microsatellite markers to to markers microsatellite of handful a using approaches genetics population neutral or , . Recent advances in high throughput sequencing technologies now allow to obtain to allow now technologies sequencing throughput high in advances Recent . (Manel et al., 2016) al., et (Manel (Gschloessl et al., 2014) al., et (Gschloessl (Mueller, Kuhl, Timmermann, & Kempenaers, 2016) Kempenaers, & Timmermann, Kuhl, (Mueller,

and introgression patterns patterns introgression and s the possibility of differential expression studies, and can pave the the pave can and studies, expression differential of possibility the s

- (El Mokhefi et al., 2016) al., et Mokhefi (El various spatial scales spatial various oe organisms, model - scale analyses of genetic variation and the development of of development the and variation genetic of analyses scale , to disentangle complex demographic histories, or to analyze analyze to or histories, demographic complex disentangle to , (Burban al., et 2016; al., Santos, Burban,et 2011) enotypic studies have been conducted so far for the PPM, PPM, the for far so conducted been have studies enotypic , to provide a first reference of PPM of reference first a provide to , fy genomic regions prone to selection via genome via selection to prone regions genomic fy

and to fill the gap between genomic and and genomic between gap the fill to and (Kerdelhué et al., 2009; Rousselet et al., et Rousselet 2009; al., et (Kerdelhué , or allochronic diff allochronic or , Bra e a. 2016) al., et (Burban

T. wilkinsoni T. o dniy soitd viral associated identify to

eoe a reference a genome, erentiation . Availability of a a of Availability .

expressed genes expressed

(Jakubowska et et (Jakubowska . . Recently, Recently, .

(Santos, (Santos, - This article isThis article by protected copyright. All rightsreserved. pupae) metamorphosing and buried lar L5 and L3 L1, eggs, (adults, stages developmental main the from Samples in detailed as traps baited pheromone using caught were males and soil the from out dug were pupae nests, the from collected directly were larvae 8°58’W). (39°47’N, Portugal Leiria, de Nacional Mata the from 2012 and 2010 between sampled was below described experiments the in used material the All Sampling M comprehensive and robust a reference wereset of inturnmapped expressedPPMgenesthat obtain to combined were data transcriptomic of assembly and genome draft the on predicted genes coding Furthermore, tool. assessment and description population this within lower is diversity genetic since Portugal in found SP the on set was focus the Thus, PPM. t and genome the into insights first provide to was study this of aim The temperaturesto high shock(such asheat proteins). 2015) al., phenol of architecture genetic the characterize to essential be also will They comm.). pers. al., et (Gautier selection of footprints detect which scans wide genome using constraints adaptive to subjected loci of annotation) (and identification the scenarios demographic complex disentangle to markers on various 2016) Burger, & Corander, Mohandesan, Fitak, patterns polymorphism interpret to and sequences, genomic annotate to

AcceptedAND ATERIAL Article , of the urticating systemurticating the of ,

- going studies concerning the PPM and implying the development of pan of development the implying and PPM the concerning studies going M e novo de ETHODS (Santos, Burban, et al., 2011) al., et Burban, (Santos,

P gnm asml, sn B using assembly, genome PPM (Berardi et al., et (Berardi ee band rm aoaoy ern a dtie in detailed as rearing laboratory from obtained were 2017 Sc rsucs r ugnl nee for needed urgently are resources Such .

. This study reports the first karyotype first the reports study This . )

or of traits involved in the adaptation the in involved traits of or

(Leblois et al., 2018) al., et (Leblois

ats Bra, t l (2011) al. et Burban, Santos, ogy C eunig s quality a as sequencing AC

on the genome.on the (see for instance Derks et Derks instance for (see ranscriptome of the the of ranscriptome . pityocampa T. D e a. 2015; al., et (Du

or to or

vae, freshly freshly vae, - genomic e novo de improve improve

SP - .

This article isThis article by protected copyright. All rightsreserved. Accepteddiethyldithiocarbama sodium w/v 0.13% K, proteinase mg/ml 0.1 sarcosine, lauryl sodium w/v 1% was buffer lysis (2) diethyldithiocarbamate, sodium w/v 0.13% and 000, 40 PVP w/v 0.25% acid, ascorbic sucrose, mM 500 9.4, pH EDTA M 0.01 KCl, M 0.1 Tris, M 0.01 was (SEB) Buffer Extraction based Sucrose (1) (2000) Paterson and Th field. the in collected masses egg molecular weightfromHigh DNAwasprepared 130L1larvae BAC libraryconstruction afluorescentunder microscope followingstaining. DAPI methanol:aceticac a with twice fixed then were hypotonic in resuspended chromosomes Mitotic temperature. room pellet at min 10 for incubated and M) 0.075 (KCl solution the and eliminated was supernatant min), 5 for g (400 room at Article3h for incubated and concentration), pellet piston a final with crushed µg/ml (0.04 colchicine with medium culture 1640 RPMI in placed 1998) lines cell for used routinely protocol (NaCl a following and solution 0.9%) physiological a with washed eggs fresh using obtained were Karyotypes Karyotyping were flash constructi BAC for used samples while ethanol, 95% in preserved were (2017) Kerdelhué and Burban, Santos, Paiva, Branco, . After centrifugation (400 g for 5 min) and elimination of the supernatant, eggs were were eggs supernatant, the of elimination and min) 5 for g (400 centrifugation After . - frozen in frozen in andliquid nitrogen preserved at

and te, 6 mM EGTA and 200 mM L mM 200 and EGTA mM 6 te, 4 mM spermidine, 1 mM spermine tetrahydrochloride, 0.1% w/v w/v 0.1% tetrahydrochloride, spermine mM 1 spermidine, mM 4 Gonthier et al. (2010) al. et Gonthier

e protocol described in described protocol e

was applied with the following modifications: following the with applied was id solution (3:1), spread on slides and counted slides on spread (3:1), solution id - 80°C. . Samples used for genomic sequencing sequencing genomic for used Samples . - Lysine dissolved in 0.5 M EDTA pH pH EDTA M 0.5 in dissolved Lysine

Peterson, Tomkins, Frisch, Wing, Frisch, Tomkins, Peterson, temperature. After centrifugation centrifugation After temperature.

(Popescu, (Popescu, hatched inthelaboratoryhatched from on or RNA extraction extraction RNA or on Hayes, & Dutrillaux, & Hayes, This article isThis article by protected copyright. All rightsreserved. were libraries DNA final The cycles. PCR 8 with enriched selectively were fragments DNA The ligated. subsequently were index molecular a and adapter The added. was base size and purified were and process, repair a end an through went fragments on DNA The bp. 900 sonication ca. of fragments produce to through S220 Covaris fragmented were BACs of ng 250 Illumina. from kit protocol) Na TruSeq the with constructed were Libraries GeneTools (Syngene). software the using estimated was size insert Each linear. = ramping hours, 16 = time run sec, 15 = time switch final sec, 5 = time switch initial 120°, = angle included v/cm, 6 = voltage (Bio 220V SYSTEM CHILLER XA Mapper Chef a performedwith electrophoresis field pulse for (TBE 0.25X) gel agarose 0.8% a on transferred enzyme NotI fast the with ( digested were ng 150 which of ng/µL), 150 (ca. clone BAC each for obtained were DNA of µg 15 Then, marker. selective Chloramphenicol µg/mL 12.5 (M randomly 11 from isolated was DNA qualityassessmentSequencing andassembly11BACs of forassembly France)(Toulouse, BAC The Wisconsin) Madison, ligation and weigh molecular 0.05 in 1h 50°C, pre were plugs agarose nuclei, the of lysis after (3) 9.1, Accepted was digestion enzymatic the incubation, After 37°C. at min 40 incubated and Fermentas) Article acherey Nagel) following the manufacturer instructions using 100 mL of LB media with media LB of mL 100 using instructions manufacturer the following Nagel) acherey library was deposited at the Centre National de Ressources Génomiques Végétales Génomiques Ressources de National Centre the at deposited was library to pIndigoBAC t genomic DNA genomic t

ET p 8 t °, n te soe a 4C Pril ieto o high of digestion Partial 4°C. at stored then and 4°C, at 8 pH EDTA M .

were performed according to according performed were

- 5 Hind with - III chosen BACs using the Nucleobond Xtra Midi Plus kit Plus Midi Xtra Nucleobond the using BACs chosen Hind - Cloning Ready vector (Epicentre Biotechnologies, Biotechnologies, (Epicentre vector Ready Cloning selected on magnetic beads. Finally, a single 'A' 'A' single a Finally, beads. magnetic on selected III (Sigma III no DNA sample preparation (low throughput (low preparation sample DNA no Chalhoub, Belcram, and Caboche (2004) Caboche and Belcram, Chalhoub, - - a) ne te olwn condition following the under Rad) - Aldrich, St Aldrich, washed 1h in 0.5 in 1h washed

- Louis, Missouri) Louis,

M EDTA pH 9.1 at at 9.1 pH EDTA M

, elution , s: .

This article isThis article by protected copyright. All rightsreserved. SSPACE program the with scaffolded then to best matching ( 1990) ( to mainly matched Lipman, non to corresponding and scores bit high & e having results alignment all manually analyzing by removed Myers, http://www.blast.ncbi.nlm.nih.gov/Blast.cgi Miller, ( contaminations diverse NT NCBI against aligned n the by weighted [bp] size assembly x [bp] N50 10. of steps with 261 ILLUMINACLIP:$adapterfile:2:28:10 k using created were assemblies Several Trimmomatic. by ‘unpaired’ Sing the and bp) 300 of read2 and read1 between distance Velvet parameters: using established was 1A) (Fig. assembly BAC Each MINLEN:30. SLIDINGWINDOW:5:30 HEAD ILLUMINACLIP:$vectorfile:2:28:10 following 2014) Usadel, PE raw The v3 (2x300bp). the using MiSeq Illumina an on performed was sequencing and amounts equimolar in pooled were libraries normalization, After kit. qPCR on Labchip 1000 DNA a with validated Acceptedi.e. Platyhelminthes Article

lnn vcos, nutrd bacteria uncultured vectors), cloning (v1.2.10, Zerbino & Birney, 2008) Birney, & Zerbino (v1.2.10,

ed wr cend ih h Timmtc software Trimmomatic the with cleaned were reads

to remove the Illumina adapter as well as the vector sequences, using the the using sequences, vector the as well as adapter Illumina the remove to

, e , Physeter catodon Physeter - Actinobacteria value of e of value T he e.g.

chosen v721, CI eore oriaos 2016) Coordinators, Resource NCBI (v07/2015,

- 14). bacteria, nematodes) using using nematodes) bacteria,

assembly correspond assembly

,

(sperm wh (sperm

Other contigs were identified as being artificial sequences artificial being as identified were contigs Other Proteobacteria a Bioanalyzer (Agilent) and quantified with a KAPA a with quantified and (Agilent) Bioanalyzer a

( - . on both the cleaned PE reads (applying an inner inner an (applying reads PE cleaned the both on i.e. net pce sqecs Cnaiae contigs Contaminated sequences. species ale). ale). C

RP1 LAIG2 TRAILING:28 LEADING:28 CROP:15 naiae c ontaminated p adhering to rumen of cows) and a sequence sequence a and cows) of rumen to adhering v., ote, ekl Jne, ulr & Butler, Jansen, Henkel, Boetzer, (v3.0, aired umber of contigs. The best assembly was was assembly best The contigs. of umber The

(both with e with (both -

e remaining remaining ed nd (PE) protocol and the Reagent Kit Reagent the and protocol (PE) nd

to the highest value of the product product the of value highest the to blastn

le nis were ontigs - - - n (E ras eotd as reported reads (SE) End values between e between values va contigs for each BAC were were BAC each for contigs

v.3 Ble, os, & Lohse, Bolger, (v0.33, BLAST lues of 0) and flatworms flatworms and 0) of lues web - mer sizes from 31 to to 31 from sizes mer - nlss via analysis

(Altschul, Gish, Gish, (Altschul, dniid and identified

o erh for search to - 4ad 0, and 14

This article isThis article by protected copyright. All rightsreserved. 2010) 2011) Marth, & Stromberg, Quinlan, Garrison, SAMtools using assembly BAC each calculate were coverage bp per read the and percentage read aligned the Subsequently, retained, were size insert library and parameters Bowtie using scaffold BAC each against reads output Trimmomatic the mapping by determined was coverage Read end2+,strand2,size2. -- step=20 LASTZ with duplicated of presence the for BAC each screening by assessed was BACs assembled 11 the of quality The criterion. number) contig / [bp] size assembly x [bp] (N50 the calculating by chosen was scaffoldings eac from removed were bp 1500 than (same shorter Scaffolds possible. reads when contigs extend SE to SSPACE allowed Trimmomatic This parameters). the account into took also approach second the while 1) sequen PE the only included analysis 2011) Pirovano,

Acceptedformat=general:score,na Article .

-- v.30, ars 2007) Harris, (v1.03.02, regions. Thus, each assembly was aligned against itself by global alignment alignment global by itself against aligned was assembly each Thus, regions. gfextend -- . Two independent scaffolding strategies were applied. The first scaffolding first The applied. were strategies scaffolding independent Two . re order, only read pairs with mapping distances consistent with the specific specific the with consistent distances mapping with pairs read only order,

v.., aged Slbr, 2012) Salzberg, & Langmead (v2.2.4, me1,zstart1,end1,strand1,size1,name2,zstart2+, --

gapped i.e.

asml. hn a aoe te et f hs two these of best the above, as Then, assembly. h applying the options the applying ces (applying standard parameters: standard (applying ces (v1.2, H. Li et al., 2009) al., et Li H. (v1.2,

plig h floig parameters: following the applying -- chain chain

n BEDtools and --

-- matchcount=800 no - discordant and discordant , BAMtools , (v2.2.23, Quinlan & Hall, Hall, & Quinlan (v2.2.23, . In addition to default default to addition In . - k 5 k (v2.3.0, Barnett, (v2.3.0, -- -- – notransition a 0.7 a maxins 1100. maxins -- identity=92 identity=92 – x 0 x d for d – -- p This article isThis article by protected copyright. All rightsreserved. with 1B) (Fig. removed were adapters and quality for cleaned were libraries all of reads the Accepteddata raw resulting the all of Assembly De novo a singleIllumina lane. HiSeq2000 on sequenced were libraries three Thesesample. sequenced the in diversity genetic increasing o to necessary was DNAs of Pooling males. faqs/next paired the using sequenced are and libraries pair mate than sizes insert larger have they sequencing; respectively. bp, 000 20 and 8000 3000, of sizes insert with libraries PE (LJD) Distance Jumping Long a of runs 3 on sequenced co library SE454 the ( Operon MWG Eurofins by sequenced and constructed were Article libraries other four The Scotland). (Edinburgh, facility sequencing Genomics Edinburgh the w librariestwo male. These single a from DNA genomic the on based constructed were and respectively, bp 600 and 300 of short with libraries PE Illumina of consisted PE600i and PE300i The constructed. SE PE600i, libraries, PE300i, Six as to trapping. referred pheromone hereafter by caught males SP 21 of thorax and head the from protocol chloroform CTAB/phenol standard a following isolated was DNA genomic Whole Genome construction library andsequencing Construction andassessment ofthe quality - n ra pooo ( protocol read end genome - generation LJDs were developed by Eurofins MWG Operon as an alternative to mate pair pair mate to alternative an as Operon MWG Eurofins by developed were LJDs

assembly andcharacteristicsassembly - sequencing.aspx sse i a in nsisted Roche 454 Roche https://www.eurofinsgenomics.eu/en/eurofins ere sequenced on a single lane of an Illumina HiSeq2000 inIllumina HiSeq2000 an of lane single a on sequenced ere

hl gnm SE genome whole GS ) . Each library was constructed using a pool of 4 to 8 to 4 of pool a using constructed was library Each .

was carried out carried was

FLX+.

btain enough DNA at the unwanted expense of expense unwanted the at DNA enough btain Tpit 454, LJD3i, LJD8i and LJD20i, were further further were LJD20i, and LJD8i LJD3i, 454,

The LJD3i, LJD8i and LJD20i libraries were libraries LJD20i and LJD8i LJD3i, The - SP v1genome assembly

library, built from a single male and and male single a from built library, by Eurofins MWG Operon. Briefly, Briefly, Operon. MWG Eurofins by

Ebersberg, Germany). In short In Germany). Ebersberg, - genomics/product

- insert sizes sizes insert - , This article isThis article by protected copyright. All rightsreserved. Jellyfish software the with reads raw end v parameter single in mapped were reads SE454 The respectively. PE600i, for bp 800 applying coverage), BAC for above described (as Bowtie with scaffolds and contigs the on mapped were libraries genomic three the of reads PE the coverage, genome evaluate to order In scaffolds. assembled the on calculated were genome mean, N90, (N50, statistics length scaffold Classical mitogenome wasverified. edi manually further and database Nucleotide 15 of scaffold genome single a in detected AT The assembly. genome nuclear the from isolated of assembly lengt a with Scaffolds (CONVEY, pipeline (k assembly assembly k of own range broad a using data 454 and Illumina their applied whic http://www.conveycomputer.com) Operon MWG Eurofins genome nuclear the in integrated genes endosymbiont as of removal potential considered were database Genome NCBI and the contaminations in genome bacterial a to mapping readsbp.Furthermore, 100 than shorter they if were reads SE454 bp and 30 than shorter were 2014) al., et Bolger software Trimmomatic the Accepted1 Article genome, the frequencies of kmers of 61 bp length were counted within all Illumina paired Illumina all withincounted were length bp 61 of kmersof frequencies the genome, - U). - mer size 61) was selected on the basis of N50 size and maximal scaffold length. length. scaffold maximal and size N50 of basis the on selected was 61) size mer T. pityocampa T.

In addition, In . PE read PE . removed. h of at least 1000 bp were retained to constitute the final genome genome final the constitute to retained were bp 1000 least at of h

, named , to estimate to (v0.22, window size of 20 bp, minimum PHRED quality of 20, 20, of quality PHRED minimum bp, 20 of size window (v0.22, s were filtered if they were shorter than 70 bp, LJD reads if they they if reads LJD bp, 70 than shorter were they if filtered were s

lae oe ht hs rcdr mgt ae eutd in resulted have might procedure this that note Please Tpit in silico in - rn m ran h SP v1 genome. v1 SP

v..1 Mras & Marcais (v1.1.11, 1 b, ofre by confirmed bp, 717 ted. Furthermore, the circular structure of the the of structure circular the Furthermore, ted.

- the expected full genome size of the of size genome full expected the lil Vle asmle wt al available all with assemblies Velvet ultiple mer sizes and varying parameters. The best best The parameters. varying and sizes mer - rich sequence of the mitochondrion was was mitochondrion the of sequence rich etc.

The mitogenome was identified and and identified was mitogenome The ) and the GC content of the of content GC the and ) -- maxins 500 bp for PE300i and and PE300i for bp 500 maxins

igfr, 2011) Kingsford, blastn -

end mode end against the NCBI NCBI the against .

Subsequently, Subsequently,

(Bowtie (Bowtie and the the and Tpit nuclear nuclear - the SP - This article isThis article by protected copyright. All rightsreserved. Acceptedcorrespo The assembly. genome the to 20) num_alignments with aligned subsequently and sequences KOG analyses) ‘arthropod’ and ‘eukaryota’ for files FASTA ( CEGMA the from extracted were genome the in identified not were which sequences protein BUSCO and CEGMA (parametersgenes gen eukaryotic 2015) Zdobnov, & Kriventseva, 2003) 2007) Korf, CEGMApipeline prediction the (i) using genes, well detecting by assessed was assembly the of completeness The assessmentQuality ofthe Article Haudry,Bechet, Lerat, & 2014) all” them find to code “One tool the by summarized were RepeatMasker 2015) 2015) RepeatModeler with identified were assembly genome the in different in assembled possibly haplotypes and regions duplicated identify ( approach LASTZ same the Furthermore, kmer=61, maxcov=80,readlength=100bp; 2017) Vurtureet al., web the with analyzed subsequently were results , and (ii) the recently developed BUSCO program BUSCO developed recently the (ii) and , (parameter: (parameters:

which sea which es (parameters es tblastn - - l arthropoda engine ncbi) and RepeatMasker RepeatMasker and ncbi) engine -

xsmall s rches for 248 for rches (NCBI Tpit - - - SP v1 LS+ 222; parameters: v2.2.29; BLAST+ l eukaryota l

- applying default applying default parameters. mgenome - wih a sprtl rn o erh o 49 conserved 429 for search to run separately was which , gff

genome assembly orthologous groups of proteins of groups orthologous – norna norna see - m genome m -- kogs.fa

long). long). above) was applied as for the BAC analyses analyses BAC the for as applied was above) – (v2.5, default parameters, Parra, Bradnam, & Bradnam,& parameters,Parra, default (v2.5, nie ncbi engine

ad UC (corresponding BUSCO and ) (v4.0.6, Smit, Hubley, Smit, (v4.0.6, -

otae GenomeScope software -- (v1.2, Simao, Waterhouse, Ioannidis, Waterhouse, Simao, (v1.2, long) and 2675 conserved arthropod conserved 2675 and long)

nding KOGs were allowed to be to allowed were KOGs nding

(v1.0.8, Smit & Hubley, 2008 Hubley, & Smit (v1.0.8, – . poly

- vle 1e evalue - conserved eukaryotic core eukaryotic conserved (KOGs, Tatusov et al., et Tatusov (KOGs, – f) Te eut of results The gff). (v03/2016, Bailly (v03/2016, scaffolds - 5 & Green, 2013 Green, &

. pityocampa T.

– (parameters: uft 7 outfmt ancestral . Repeats Repeats . to – - - -

This article isThis article by protected copyright. All rightsreserved. beads and time fragmentation the Acceptedmodifying by obtained bp 850 of size insert long a with siz insert standard a with both instructions, manufacturer the to (Illumina) according kit prep sample mRNA stranded TruSeq the using constructed were Libraries development stageequimolar in before library proportions preparation. (Quant procedure Qubit the using estimated were concentrations RNA kit. mini plant RNeasy Qiagen the using the Whenever technology. nanodrop an on migration through evaluated was integrity and quality RNA procedure. extraction Trizol a using metamorphosingpupae, adults) and buried freshly larvae, L5 and L3 L1, (eggs, stages developmental various to corresponding samples from extracted was RNA Molecular procedures Construction andassessment of quality 2009) Article the using BAC each for separately visualized genome the alig The analyses. for coverage described as parameters and protocol same the applying coverage, base per read the calculate to order in sequences BAC the on Bowtie with mapped were PE600i assemblies. contigs) of size smaller the account into take to order in matchcount=400 above) mentioned as parameters (same scaffolds as genome the of quality the assess further To presentas being ifat oftheleast wasaligned 40% KOGscaffold ontwo sequence ends. several over split .

n diin h cle the addition In T. pityocampa T. - t N asy i) RA fo 2 o individua 5 to 2 from RNAs kit). assay RNA it

nments of the the of nments aned genomic shotgun reads from the libraries libraries the from reads shotgun genomic aned

scaffolds, using an in an using scaffolds, 260/280 OD ratio was below 1.7 below was ratio OD 260/280 Tpit Tpit - - sembly, LASTZ LASTZ sembly, SP SP v1 scaffolds on the BAC assemblies were assemblies BAC the on scaffolds v1 SP

CIRCOS as well as the genome contigs (parameter contigs genome the as well as transcriptomic transcriptomic resources - house Perl script. A KOG was counted counted was KOG A script. Perl house

toolkit toolkit was run to align align to run was (v0.68, Krzywinski et al., al., et Krzywinski (v0.68, ls were pooled for each each for pooled were ls , samples were purified were samples ,

against the 11 BAC BAC 11 the against e of 450 bp and and bp 450 of e

agarose gel and and gel agarose PE300i the genome genome the

and --

This article isThis article by protected copyright. All rightsreserved. ( data extrinsic as included were reads RNAseqAccepted HiSeq Illumina the regions, coding of identification the optimize To 2015) Su maximum, at identity 50% at removed (redundancy 43 and scaffolds 2005) Morgenstern, & a model, Lepidoptera a of lack the to Due genome. specific a in structure gene the predict and represent to models Markov on relies program This genome. the in regions program AUGUSTUS The apredicted geneBuilding set from the were sequencedsizes) usingMiSeqpooled and the PE2x300bpprotocol. Illumina 2x PE the using an HiSeq2000 of lanes two in sequenced and amounts equimolar in pooled were inserts Articlebp 450 using obtained libraries The (Clinisciences). kit quantification Library KAPA the using D a using validated were Libraries Genomics). Coulter (Beckman Beads XP AMPure using purified were products PCR and cycles PCR 15 through amplified were cDNAs Ligated adapters. indexed Illumina's using performed was Double amplification. PCR final the during matrix a as used be to strand second the prevented This dUTP. by substituted was dTTP step, generation strand second the During D. Actinomycin and Technologies) hex random using transcribed reverse and fragmented were RNAs poly Briefly, quantity. . The generated training. Thegenerated model wasretained forthegene prediction analysis. de novo

gene model for model gene 8 Lpdpea rti sqecs band rm h Uie5 database UniRef50 the from obtained sequences protein Lepidoptera 486 - A RNAs were purified using oligo using purified were RNAs A

a rn n h ‘riig md, rvdn te sebe genome assembled the providing mode, ‘Training’ the in run was - 100 bp protocol. Then, all libraries (both standard and long insert insert long and standard (both libraries all Then, protocol. bp 100 i.e. stranded cDNAs were ade were cDNAs stranded Sak & ac, 2003) Waack, & (Stanke

evidence from exter from evidence T.pityocampa A00 hp n n gln Bonlzr n quantified and Bioanalyzer Agilent an on chip NA1000 Tpit - SP

v1 genome was nal sources) into the prediction process. prediction the into sources) nal created. Thus, WebAUGUSTUScreated.Thus,

zek, Wang, Huang, McGarvey, & Wu, & McGarvey, Huang, Wang, zek, nylated at their 3' ends before ligation ligation before ends 3' their at nylated a ue t id to used was

-

d(T) magnetic beads. The poly The beads. magnetic d(T) amers, Super Script II (Life (Life II Script Super amers, - coding potential entify specific coding region coding specific

(Stanke - A+ This article isThis article by protected copyright. All rightsreserved. AUGUSTUSthe tool extrwere Aug2.2 predictions Aug2.1 and the of CDSsequences the Finally, the(including parameter ( exon (excluding exon uniquely the with BAMtools using exon mapped r Uniquely reconstruction. SE as junctions these on with mapped thenBowtie were reads RNAseq HiSeq The exon. flanking each of bp 100 taking exon the of sequences FASTA the retrieve to format) (GFF3 file structure t detail, In below). (see included criteria beforebeing tobuildconservative the genome. draft codin of set exhaustive most the represented prediction (CDS). sequences coding and introns detected and genes coding complete only parameters the exonnames=on -- (2014) Stanke with applied were 1B) (Fig. predictions (v3.0.2) coding specific most the obtain inter the and exons potential as identified genomes eukaryotic and 100 TopHat with mapped were reads These bam2hints Accepted Articlealternatives --

tool provided with AUGUSTUS) and applied as input for the Aug2.2 prediction prediction Aug2.2 the for input as applied and AUGUSTUS) with provided tool no -

from : A - -- -- icrat Floig h AUGUSTUS the Following discordant.

- second second species=$NEW_PPM_MODEL -- codingseq=on $GENOME_FASTA. The first The $GENOME_FASTA. codingseq=on xn apn ras. h intr The reads). mapping exon hintsfile=$ALL_HINTS - evidence=true

getAnnotFasta.pl Sak, 2009) (Stanke, he intron genom intron he AUGUSTUS -- hintsfile=$INTRON_HINTS). - gene structure annotation, two consecutive AUGUSTUS AUGUSTUS consecutive two annotation, structure gene te genome the , prediction (Aug2.2) prediction -- . as (‘ eads - allow_hinted_splicesites=atac

e coordinates were extracted from the Aug2.1 gene Aug2.1 the from extracted were coordinates e xn ed rtivd y Atos ee combined were SAMtools by retrieved reads exon (v2.0.12, Kim et al., 2013) al., et Kim (v2.0.12, - - -- xnc ein a ptnil nrn. n re to order In introns. potential as regions exonic apd ed peiul aind y TopHat by aligned previously reads mapped introns=on - U’ option) to optimize the exon border border exon the optimize to option) U’ -- n tutr codnts ie a updated was file coordinates structure on h floig aaees s ugse in suggested as parameters following the extrinsicCfgFile=extrinsic.M.RM.E.W.cfg - map T. pityo

g genes for the present present the for genes g

e RAe ra cutr were clusters read RNAseq ped -- was defined using more robust and more usingrobust defined was genemodel=complete) identified identified genemodel=complete) standard c

prediction (Aug2.1, including (Aug2.1, prediction ampa , applying parameters applying ,

reference transcriptomereference recommendations -- - exon junctions by junctions exon acted by applying acted protein=on Tpit hs gene This - SP v1 v1 SP

for -- - r This article isThis article by protected copyright. All rightsreserved. Accepted the that showed assem Velvet/Oases between shown) over the in resulted probably most which shown), not (data check quality Bioanalyzer Agilent the from curve flat and wide relatively the to due be could This silico respectively libraries, MiSeq two the for bp 460 leadi Oases, in used be to length insert the determine Velvet cases both In 4. 8 to from27 2012) Birney, & Vingron, Zerbino, Schulz, parameters, default Oases by followed run was parameters) default (v1.2.08, Velvet Subsequently, tool. was redundancy reads, process. 2011) Article c were reads overlapping technologies sequencing different the to related artefacts or transcripts chimeric 1C) (Fig. separately assembled were data MiSeq and HiSeq min_len 30, ILLUMI parameters: prinseq following the with $adapterfile trimming adapter and quality for Trimmomatic filtering by analyzed independently were reads MiSeq and HiSeq The De novo - . estimated insert sizes were shorter than expected from the wet laboratory procedures. laboratory wet the from expected than shorter were sizes insert estimated - Merged as well as unmerged FLASH output reads were retained for the assembly assembly the for retained were reads output FLASH unmerged as well as Merged lite

To decrease computation time and to facilitate the assembly process for the HiSeq the for process assembly the facilitate to and time computation decrease To assembly of HiSeq of assembly andMiSeq transcriptomes - :2:40:15 HEADCROP:12 SLIDINGWINDOW:4:15 Subsequently, MINLEN:30). representation of smaller fragments in the libraries. A comparative analysis (not analysis comparative A libraries. the in fragments smaller of representation 1 were applied for HiSeq, and from 51 to 101 for MiSeq data, using a step size of astep size data, using 101for MiSeq from to HiSeq,and 51 appliedfor were 1 v.02 Shidr Ewrs 2011) Edwards, & Schmieder (v0.20.2, - out_format 3)wasapplied

n silico in

removed using the BBNorm the using removed tool tool - ombined with the program FLASH FLASH program the with ombined ae isr szs eund etr rncit assemblies. transcript better returned sizes insert based velvetg

( blies established with the wet lab insert sizes further further sizes insert lab wet the with established blies parameters

to remove andpolyTendslonger polyA than 5 bp. . Interestingly, for both MiSeq libraries the the libraries MiSeq both for Interestingly, - amos_file yes amos_file ng to 220 bp for HiSeq data a data HiSeq for bp 220 to ng

(v35.21, Bushnel, 2014) Bushnel, (v35.21,

( of DNA fragment insert sizes obtained sizes insert fragment DNA of - rmti_et 5, trim_tail_left

(v1.2.11, Magoc & Salzberg, Salzberg, & Magoc (v1.2.11,

. Odd k Odd . in order to avoid creating creating avoid to order in - read_trkg yes read_trkg - - mer values ranging ranging values mer rmti_ih 5, trim_tail_right

. In each case, each In . normalization )

nd 210 and and 210 nd was run to run was NACLIP: (v0.2.08, (v0.2.08, h best The

in -

This article isThis article by protected copyright. All rightsreserved. CD each of sequence AcceptedVennDiagramm package (parameters (CD CD by clustered then was sequences protein of set This set. data final p the in redundancy longest the only FrameDP, by identified region coding a 2014) al., SP the for obtained resource transcriptome 454/Sanger published previously Carr FrameDP identified. were transcripts assembled the protein of set reference consistent a generate To gene set the from transcriptome reference a Establishing Article section.calculation percentage exclude further ( coverage transcripts with low to chosen was (FPKM) mapped` fragments ki million per per `fragments length metric, transcript the purpose, this For transcript. each of abundance the reconstructed the on parameters) default the of transcripts (v2.2.4, Bowtie with aligned was read Each align_and_estimate_abundance.pl RSEM inclu genes, core set data each for transcriptome - è I pcae v.., u Nu Zu W, L, 2012) Li, & Wu, Zhu, Niu, Fu, v4.5.4, package, HIT re, & Schiex, 2009) Schiex, & re,

v.., . i Dwy 2011) Dewey, & Li B. (v1.2.8,

and on the on and

ee acltd o ec tasrpoe s ecie i te BA the in described as transcriptome each for calculated were - c 0.90 c ding partial ones. partial ding corresponding transcriptome corresponding

- CDS of the of CDS l 20), and the results were visua were results the and 20), l - i cutr a rtie t bid h fnl e o cdd reference coded of set final the build to retained was cluster Hit (v1.6.17, H. Chen & Boutros, 2011) & Boutros, Chen H. (v1.6.17,

was applied on each of the the of each on applied was i.e

was chosen based on the highest ratio of identified CEGMA CEGMA identified of ratio highest the on based chosen was

Aug2.2 Aug2.2 . FPKM ≤2).

T ( -- hen, transcripts were filtered on their abundance with the the with abundance their on filtered were transcripts hen, s_ehd SM ebde i te rnt package. Trinity the in embedded RSEM) est_method subset of coding genes coding of subset . Guided by these alignments, RSEM estimated RSEM alignments, these by Guided .

The

- coding transcripts, first the coding regions of regions coding the first transcripts, coding program e novo de read per base coverage and the remapping andthe perbasecoverage read lized by a Venn diagram using the R R the using diagram Venn a by lized MiSeq and HiSeq tra HiSeq and MiSeq (v1.2.2, default parameters, Gouzy, Gouzy, parameters, default (v1.2.2,

transcriptomes and the predicted the and transcriptomes which was run by the tool tool the by run was which . Again, only the longestproteintheAgain, . only . For each sequence including sequence each For . pie a rtie t limit to retained was eptide using an identity of 90% 90% of identity an using nscripts, on the the on nscripts, (Gschloessl et et (Gschloessl C coverage coverage C lobase lobase - HIT HIT

This article isThis article by protected copyright. All rightsreserved. transcripts Furthermore, kept. were identity sequence 92% a at length transcript the (2009) Stanke novo the to aligned were population same the from obtained resource 454/Sanger published the and reference) and MiSeq (HiSeq, transcriptomes established three the from transcripts The theMapping referenc transcriptReference localization inthe (2014) in described as sequence hit best the on calculated then were values OHR Lepbase genomedatabase set reference a 5 (outfmt set. reference protein its on based transcript assembled 2010) al., et O'Neil (OHR, - set arthropod transcriptome referen the as well as assemblies transcriptome MiSeq and HiSeq the of completeness The assessmentQuality ofthe two sequence correspondingnucleotide to protein. each reference the Then, proteins. Accepted Article applied was trans m

genome assembly using BLAT using assembly genome .

- evalue 1e evalue s

) analyses as described above except that the parameter the that except above described as analyses ) , only those transcripts which were aligned to the genome with at least 80% of 80% least at with genome the to aligned were which transcripts those only , was assessed by running CEGMA running by assessed was corresponding to corresponding - 5 e transcripts onto genome the assembly The HiSeq, MiSeq and referenc and MiSeq HiSeq, The as - gapopen 11 gapopen

h aayi ws oe n tasrp set. transcript a on done was analysis the (v4, Challis, K Challis, (v4,

was also calculated, which estimates the completeness of an an of completeness the estimates which calculated, also was de novo Tpi blastx 471 938 471 t - P transc SP - gapextend 1 gapextend (v3

and the referenceand the transcriptomes

alignment length compared to the best match within amatchwithin best the to alignmentcompared length 5, default parameters, Kent, 2002) Kent, parameters, default 5, umar, Dasmahapatra, Jiggins, & Blaxter, 2016) & Blaxter, Jiggins, Dasmahapatra,umar, Tpit protein sequences extracted extracted sequences protein riptome was obtained by retrieving the coding coding the retrieving by obtained was riptome - SP v1genomeandfunctional annotation

- and BUSCO (using the eukaryote and the and eukaryote the (using BUSCO and

word_size 3 word_size e transcripts were aligned with aligned were transcripts e

- matrix BLOSUM62) against against BLOSUM62) matrix

-- The ortholog hit ratio ratio hit ortholog The from the lepidopteran the from

long was omitted and omitted was long . As proposed by by proposed As . Gschloessl et al. et Gschloessl blastx . The .

de ce ce

This article isThis article by protected copyright. All rightsreserved. Accepted parameters). (default FrameDP with generated was set protein predicted corresponding the ( LepidoDB in available is by published set transcript the species, latter this For http://sgp.dna.affrc.go.jp/ComprehensiveGeneSet r6.03), of sets protein the and set transcript FrameDP the between orthologs potential evalueExponentCutoff= parameters: 2.5.0 (blastp package OrthoMCL the Finally, results. InterProScan 2005) al., Blast2GO Subsequently, dataset. same the on conducted InterProScan An sequences=20). target with 2017) june (v5 database NR NCBI Articlethe against aligned were proteins PPM retained the Then, kept. was sequence peptide longest parameters) predicted (default were transcripts reference the to corresponding proteins The annotation Functional scaffolds atleast40% wasaligned if ofthetranscript scaffold ontwo termini. the against with blastn aligned were transcripts reference The assembly. genome of assessment quality the sever over split potentially

(parameters: Danaus plexippus Danaus

assigned Gene Ontology (GO) terms to proteins, taking into account the the account into taking proteins, to terms (GO) Ontology Gene assigned Tpit - SP v1 assembly. Then, th Then, assembly. v1 SP - I mlil poen wr ietfe fr seii tasrp, ny the only transcript, specific a for identified were proteins multiple If . soft_masking true soft_masking

- 5, mcl v14 mcl 5, - (version Danaus_plexippus.DanPle_1.0.25), (version evalue 1e evalue al scaffolds were searched for, applying the same approach as for as approach same the applying for, searched were scaffolds al http://bipaa.genouest.org/data/public/lepidodb/ - - 137 with parameters: with 137 5, orthomcl v2.0.9 parameters: v2.0.9 orthomcl 5, Drosophila melanogaster Drosophila – evalue 1e evalue e blastp - (v5.13 predicted proteins corresponding to the reference reference the to corresponding proteins predicted se transcripts were declared as being split on two two on split being as declared were transcripts se

) and the the and ) v.., m (v2.5.0, - - 20 Zonv Awie, 2001) Apweiler, & Zdobnov 52.0, 5 – Legeai et al. (2014) al. et Legeai outfmt 7 outfmt (v2.5, database vOct2016, Conesa et Conesa vOct2016, database (v2.5,

L L, tekr, Ro, 2003) Roos, & Stoeckert, Li, (L. -- noctuoid nmm e inimum abc – (version dme (version show_gis - I 1.5) was used to identify to used was 1.5) I

percentMatchCutoff=5 pdpea frugiperda Spodoptera - value=1e Bombyx mori mori Bombyx

was retrieved which which retrieved was – TR2012b.fa

perc_identity 70) perc_identity

sn FrameDP using l - all - , maximum 8, - translation blastp (version ) , and and ,

was and 0 - .

This article isThis article by protected copyright. All rightsreserved. to sufficient were results the Yet, cultures. line cell for developed primarily protocol a of use possibly counts lower of observations inc reflected the while 2n=98, likely was number actual the chromosomes small the spreading in difficulty the S1 (Fig. metaphases detected 8 the of 6 in observed was chromosomes 98 of number diploid A Karyotyping R a and BLAST Galaxy server a as well as engine search a data, resource PPM the of analyses facilitate 2009) Holmes, & Mungall, Stein, Uzilov, Skinner, (v1.12.1, in an using pages web 2007) Emmert, database Chado a into loaded were annotations functional corresponding genes predicted all http://bipaa.genouest.org/sp/thaumetopoea_pityocampa/. via accessed be can which sever for resources genomic hosting database public a LepidoDB, into integrated were study present the in developed resources transcriptomic and genomic The Public release the of 22 these and CDS predicted a with transcript each for kept was sequence protein longest the Finally,

Accepted AND ESULTS Article Supporting Information Supporting

253 omplete spreading. The small number of observed metaphases may be due to the to due be may metaphases observed of number small The spreading. omplete S. frugiperda

(Blanken D . The available data for the reference transcript set were gathered in specific specific in gathered were set transcript reference the for data available The . ISCUSSION

(Aug2.1 and (Aug2.1 T. pityocampa berg et et 2010)berg al., - hous

proteins were includedproteins the were into ), and slightly smaller diploid numbers in the 2 others. Because of Because others. 2 the in numbers diploid smaller slightly and ),

JE apiain Fr iulzto proe, JBrowse a purposes, visualization For application. J2EE e Aug2.2

SP resourcesinLepidoDB database

can be can be fromaccessed thewebpage. sets)

and the reference transcripts along with their with along transcripts reference the and of

T. pityocampa T.

OrthoMCL analysis. ortholog

genome browser was set up. To To up. set was browser genome , it was considered that considered was it , al lepidopteran species, lepidopteran al (v1.31, Mungall & Mungall (v1.31,

In detail, detail, In

This article isThis article by protected copyright. All rightsreserved. mori Bombyx the of size The closest the and corn and Mb), 371 strain: Mb, 438 587 Mb, Notodontidae for documented maximum the to close is which Mb, 2). (Table Mb 10.7 to kb 5.2 from ranging lengths N50 other Information S1 Table in found be can libraries LJD and shotgun libraries all on details More respectively. bp, 1951 and kb 164 were lengths scaffold N90 and N50 corresponding the and Mb 2.1 to kb 1 from ranged 68 into assembled further were contigs These the of lengths contig N90 and N50 1). (Table The Characteristics 2014;Lukhtanov,al., 2014) is duetochromosomalprobably rearrangementslevels andhigh ofrep moth processionary pine the in (n=49) chromosomes small of number large the relatively species, lepidopteran other in As karyotype. Lepidoptera ancestral the to correspond ( n=31 of number constant a showing taxa of majority a with chromosomes, haploid n=223 i Indeed, Lepidoptera. other with com to and species, studied the for number chromosome of estimate first a provide Accepted Articlei.e.

chromosomal conservatism) chromosomal . pityocampa T. de novo de Gregory et al.,2007) et Gregory ) .

lepido R 42 b ad h wne moth winter the and Mb) (482 ltvl hg safl nmes ae lo en eotd ot eety for recently most reported been also have numbers scaffold high elatively

and quality ofthe Tpit

pteran genome assemblies which assemblies genome pteran P eoe a asmld no 7 94 otg rpeetn 57 Mb 507 representing contigs 934 675 into assembled was genome SP - SP v1 assembly lies in between the ones published for the silkworm silkworm the for published ones the between in lies assembly v1 SP Noctuoidea .

. It is larger than the islarger Fall genome. It ofthe Overall S. frugiperda S. (Ahola et al., 2014; Lukhtanov, 2014) Lukhtanov, 2014; al., et (Ahola n Lepidoptera, chromosome numbers range from n=5 to to n=5 from range numbers chromosome Lepidoptera, n Tpit

species for which a genome was previou was genome a which for species sta - SP v1 genome tistics of the genome assembly are listed in Table 1. Table in listed are assembly genome the of tistics

(Gouin et al., 2017) al., et (Gouin

292 scaffolds. 292 assembly prptr brumata Operophtera

resulted in 142 to 80 to 142 in resulted

The

were 1424 and 320 bp, respectively. bp, 320 and 1424 were

Tpit The , which is a main pest of rice rice of pest main a is which , - SP v1 genome size was 537 was size genome v1 SP Tpit etitive elementsetitive

(estimated range 323 range (estimated armyworm - SP v1 s v1 SP , which , (638 Mb). Mb). (638

479 scaffolds, and scaffolds, 479 caffold lengths lengths caffold

is supposed to supposed is sly available. sly

(corn strain: strain: (corn ( Supporting Yet, using using Yet, pare them pare (Ahola et (Ahola et –

This article isThis article by protected copyright. All rightsreserved. Acceptedpityocampa assemblies BAC the cons library the during obtained estimates size the with consistent werescaffolds).Assembly sizes scaffolds(on average3 9 to assembled1 into were paired equival genome 3 represented and kb 75 of size insert mean a with clones BACs. reconstructed the on scaffolds the of quality the Furthermore, the numbers chromosome lower with counterparts their than elements repeated of level higher a and chromosomes shorter have to supposed 540 in identified were 2008) mori B. Articlefor found those than higher was 3) (Table 45% of elements repeated of proportion the However, the contigs. genome the of assembly quality high the confirmed findings These pair. base per reads 100 least at the of 98% Moreover, 1. least at of coverage base per read a had Ns) with positions (excluding assembly scaffold) per coverage average for S2 Table (see reads/bp Tpit actual the that suggests which Mb, betwe range to estimated was size genome full expected the GenomeScope, S. frugiperda S. T. pityocampa - te Ncuie species, Noctuoidea other SP V1 SP . Furthermore, 605 duplicated regions (average length: 1348 bp, range 809 range bp, 1348 length: (average regions duplicated 605 Furthermore, . - n lbais a be can libraries end

(44%) and (44%)

Tpit genome scaffolds against the assembled BACs, all BACs could be recovered on recovered be could BACs all BACs, assembled the against scaffolds genome assembly size. size. assembly - SP v1 SP

The (29% for both strains, Gouin et al., 2017) al., et Gouin strains, both for (29%

genome ofthe present s characteristics O. brumata O. Tpit varied from 921 to 2148 reads/bp (Table 4). When aligning the the aligning When 4). (Table reads/bp 2148 to 921 from varied

assembly had a coverage of at least 10, 85% of at least 30 and 39% of 39% and 30 least at of 85% 10, least at of coverage a had assembly scaffolds. - SP assembly also also assembly SP The mean coverage in the the in coverage mean The on in found

Tpit . frugiperda S. (54%) eiotr wt a agr numbe larger a with Lepidoptera - T. pityocampa T. SP v1 SP Table S3. The randomly chosen 11 sequenced BACs BACs sequenced 11 chosen randomly The S3. Table

(Derks et al., 2015; International Silkworm Genome, Genome, Silkworm International 2015; al., et (Derks n hr, h BC irr ws opsd f 9 968 19 of composed was library BAC the short, In

genome assembly was evaluated by localizing the the localizing by evaluated was assembly genome had a GC content of content GC a had

3% o bt sris Gun t l, 2017) al., et Gouin strains, both for (36% (Ahola et al., 2014) al., et (Ahola

genome size is probably shorter than the than shorter probably is size genome truction process. The read coverage of coverage read The process. truction Tpit tudy.

. All pos All .

- and closer to the val the to closer and SP v1 genome assembly was 84 was assembly genome v1 SP

37% which is very close to to close very is which 37% itions of the draft genome genome draft the of itions , which is consistent with with consistent is which , r of chromosomes are are chromosomes of r ents. ents. en 432 and 452 and 432 en Details on the the on Details ues found for for found ues -

8509 bp) bp) 8509 T. . This article isThis article by protected copyright. All rightsreserved. Acceptedgenome from identification SNP appro genomics population of range a develop to permitted data generated the However, fragmented. still is assembly the and optimized not is scaffolding the Yet, quality. good of are contigs the that and completeness acceptable of is assembly the that indicated study present the in obtained characteristicsgenome All identified frugiperda quality high the in comparison, In scaffolds. two on split be to These the in identified were respectively, arthropod, and eukaryote for demophoon erato Heliconius conserved corresponding the of CEG 94% than more contain Lepbase in available genomes Article9 to raised ratio Interes length. identified were (59%) 145 CEGMA, by searched genes eukaryotic conserved 248 the Among determined. was assembly the in fragmented. Atlast, the that illustrated mapped were reads cleaned genomic when pair base per to 71 addition, In sequences. BAC BACs). 11 all S2: al were scaffolds) 20 average (on max. 14%, (min. length their of 56% at average suppressalis Chilo MA genes at least partially. This proportion is lower in four species, namely ca. 93% for 93% ca. namely species, four in lower is proportion This partially. least at genes MA proportions rose to rose proportions (Gouin al., 2017) et

tan 8% cr) n 9% rc) f h BSO rhoo gn st were set gene arthropod BUSCO the of (rice) 92% and (corn) 87% strains tingly, when allowing the CEGMA proteins to be split over two scaffolds, the the scaffolds, two over split be to proteins CEGMA the allowing when tingly, 1% (226 1%

On the other hand, the the hand, other the On Tpit

the number of conserved eukaryoticthe genes and number ofconserved arthropod recovered being

(Table 2). As for the BUSCO analyses, 34% and 47% of t of 47% and 34% BUSCOanalyses, the for As 2). (Table - SP v1 genome assembly genome v1 SP

out of 248 genes). 248 of out 57 % (eukaryotes) and 7 and (eukaryotes) % , .

92% for 92% - wide resequencing wide igned to each BAC sequence (Fig. 2: 4 chosen BACs, Fig. Fig. BACs, chosen 4 2: (Fig. sequence BAC each to igned 93% of the BAC lengths were covered by at least 10 reads reads 10 least at by covered were lengths BAC the of 93%

n h gnm assembly genome the in Plutella xylostella Plutella Tpit

In comparison, most of the published lepidopteranpublished the of most comparison, In - Sp v1 contigs covered on average 74% of the the of 74% average on covered contigs v1 Sp

was of good sequence quality but remained but quality sequence good of was 83%). Between 5 and 32 genome scaffolds scaffolds genome 32 and 5 Between 83%). 2 % (arthropods) % (e.g., Leblois et al., 2018, Gautier et al., et Gautier 2018, al., et Leblois (e.g.,

, 84% for 84% , (Table 4) (Table Tpit icuig 7 ee a full at genes 87 including , aches such as RADseq or or RADseq as such aches

- .

when genes were allowed were genes when eoe sebis of assemblies genome SP v1 genome assembly. genome v1 SP Melitaea Once more, these results these more, Once Tpit cinxia cinxia - SP v1 genome v1 SP he conserved he and 62% 62% and S. - This article isThis article by protected copyright. All rightsreserved. of set the of (MiSeq51) 70% and (HiSeq61) 86% Around respectively. MiSeq51, and tran two These respectively. transcriptomes, k with obtained were genes) CEGMA the Regarding De novo transcriptome. 8 of consisted genes coding was regions intergenic the of length to up 22 of length average 30 10 of minimum a with 157, available 26 the to close 29 assembly genome the In apredicted geneBuilding set from the Construction and qual focusparticular and onscaffolding thus generationandthe the integration oflongreads. a with future, the in improved be to needs still genome the Hence, synteny). (e.g. scale large disequilibrium linkage of analyses allow not comm.) pers.

Accepted Article 860 transcripts (CDS) were predicted. were (CDS) transcripts 860 exons. The CDS of the predicted coding genes were on average average on were genes coding predicted the of CDS The exons.

15.8

assem lepidopteran nuclear genomes, the average number of predicted coding genes is 17 is genes coding predicted of number average the genomes, nuclear lepidopteran

Mb , and therefore represents a valuable resource. The high fragmentation still does still fragmentation high The resource. valuable a represents therefore and , . The cumulated l cumulated The .

de novo de bly ofHiSeqandMiSeqtranscriptomesbly 329 predicted genes for the rice variant of variant rice the for genes predicted 329 176

bp. On average, a gene was composed of composed was gene a average, On bp.

ity assessment of ity transcriptomes, the best assemblies (highest percentage of identified identified of percentage (highest assemblies best the transcriptomes,

1 predicted 415

1 ad mxmm f 6 2 gns Tbe 2). (Table genes 329 26 of maximum a and 117 ength of the Aug2. the of ength

3 CS hc wr rtie t bid h reference the build to retained were which CDS 232 301 Tpit - mer values of 61 and 51 for the HiSeq and MiSeq MiSeq and HiSeq the for 51 and 61 of values mer

Furthermore, 89 225 89 Furthermore, Tpit Mb. - SP coding , haplotypic data and comparative analyses over analyses comparative and data haplotypic , scriptomes were thus kept and named HiSeq61 HiSeq61 named and kept thus were scriptomes -

v1 genome SP The predicted Aug2.2 subset of high quality high of subset Aug2.2 predicted The 1 transcriptomic transcriptomic resources genes

introns was introns

S. frugiperda S.

(Aug2.1)

exons were identified with an with identified were exons 3 220.8

exons, with a maximum of of maximum a with exons,

511 ee identified were

Mb and the cumulated the and Mb

bp long, and summed and long, bp (Table 2 (Table

) . n Aug2. In Within the the Within

hc is which

1 , This article isThis article by protected copyright. All rightsreserved. we overlap, low this of cause the identified clearly not did we While results. MiSeq and Accepted HiSeq among shared were clusters 8754 while only, technologies that these of one sequences by obtained contained clusters protein MiSeq 9540 and HiSeq sets. 6606 protein Precisely, corresponding the of overlap low relatively a showed assemblies transcript (14 4172 sets, protein input four all in present were peptides reference these of (1%) 261 while set, one in present full sequences reference these Of peptides. 701 29 of set protein reference CD with clustered subsequently were sequences protein all and recovered was 80 of total 2014) al., et (Gschloessl CDS in resulted HiSeq61, FrameDP with transcripts in sequences coding of identification The Article construction, quality the from transcriptome Reference 3930 bp,respectively. and 4177 were transcripts MiSeq51 and HiSeq61 assembled the of values N50 The unigenes. of consisted and Mb 152 of size total a had assembly MiSeq51 The 5). (Table unigenes 648 31 into grouped were which sequences 376 62 held and transcripts) alternative (including Mb 128 covered transcriptome HiSeq61 The coverage base per final read average the The assemblies. establish to used were reads RNAseq (FLASH) extended and merged cleaned, %) in three sets and 5750 (19%) in two sets (Fig. 3). (Fig. sets two in (19%) 5750 and sets three in %) - length predicted CDS. predicted length Ti st was set This .

36

278 transcripts with CDS. with transcripts 278

547 MiSeq51 MiSeq51 547 i.e. complemented assessment and localization

specifically reconstructed by a sequencing or prediction methodology, prediction or sequencing a by reconstructed specifically

which had 5830 transcripts with predicted CDS, summing up to a a to up summing CDS, predicted with transcripts 5830 had which

Most reference proteins, reference Most e novo de

by

rncit ad in and transcripts Then, the longest FrameDP longest the Then, e novo de peiul pbihd 5/agr P transcriptome SP 454/Sanger published previously a

3 7 tasrps erue it 2 412 22 into regrouped transcripts 175 63 rncitms n peitd ee set: gene predicted and transcriptomes wa i.e. s 72 for HiSeq61 and 37 for 37 and HiSeq61 for 72 s

19 518 (66%) sequences, were only were sequences, (66%) 518 19

6486 Aug2.2 genes with predicted predicted with genes Aug2.2 6486 Surprisingly, the HiSeq and MiSeq and HiSeq the Surprisingly, - peptide to each transcript each to peptide 20

465 (69%) had a had (69%) 465 - I it a into HIT MiSeq51. MiSeq51. 31 were

415 415 This article isThis article by protected copyright. All rightsreserved. the in SNPs outlier e.g. interpret to resource important an constitute will draft, genome the with along transcriptome, reference the Hence, transcripts. predicted) (not actual these positioning by improved greatly was resource genomic the of annotation gene functional and 26 transcript other the genome single one on identity sequence 92% with located were transcripts the against aligned were transcripts reference the sequencing technologies, and approaches various combining by Hence, separately. taken sets four the a generated approach clustering three the to compared length. their of 90% at least at hits sequence protein referenc set full at refere the within genes The length. full at present were 350 which of found 5). (Table genes core eukaryotic CEGMA 248 of high the by underlined meant and potential onlyhighlyreliablegenes. biases transcriptsis to lower to retain and This the of construction the to prior sets data various to applied we combining results These differences. migh technologies Miseq and HiSeq of specificities sequencing that suggest

Accepted. Article 753 (90%) reference transcripts reference (90%) 753 Of these, these, Of - length. 24 648 24 length. e

transcripts had OHRs of at least 0.9, hence covering the respective best Lepbase best respective the covering hence 0.9, least at of OHRs had transcripts

Concerning the 429 BUSCO conserved eukaryotic genes, 378 (88%) genes were genes (88%) 378 genes, eukaryotic conserved BUSCO 429 the Concerning data 3 0 reference 802 13

issued Tpit s,

17 149 (58%) were identified as being split on two scaffolds. In total, total, In scaffolds. two on split being as identified were (58%) 149 17 (83%) reference transcripts matched with matched transcripts reference (83%)

nce transcriptome was 2236 out of 2675 (84%) 2675 of out 2236 was transcriptome nce - quality rm different from SP e novo de

rncit e ws infcnl otmzd We te 29 the When optimized. significantly was set transcript emphasize the necessary necessary the emphasize

of the final reference transcriptome, reference final the of Tpit

rncit hd H vle o a lat 0.6, least at of values OHR had transcripts rncitms n te u22 D st te CD the set, CDS Aug2.2 the and transcriptomes could be located be could - SP reference transcriptome of higher quality than any of any than quality higher of transcriptome reference SP as being as sequencing

at least partially present and 238 at full at length partiallypresent and238 at least Tpit technologies

- number of identified BUSCO arthropod arthropod BUSCO identified of number i P v1 SP n the genome assembly. The structural structural The assembly. genome the n vrl, s a b se i Tbe 5, Table in seen be can as Overall,

cautions

eoe seby 9604 assembly, genome Yt te itrn procedure filtering the Yet, . the Lepbase reference protein reference Lepbase the

reference transcriptome was was transcriptome reference

in

which

, future studies before before studies future 1938 being recovered being 1938 identified scaffold t explain these explain t hl 9578 while . Among .

246 out out 246

(32%) -

HIT HIT 701

This article isThis article by protected copyright. All rightsreserved. the for reported groups ortholog 471 Accepted19 the than lower is genes unique of number This 11 to assigned were proteins predicted with transcripts reference PPM species five the for groups 779 16 identified OrthoMCL total, In quality. its for indicator essential an represented and set gene coding melanogaster. species lepidopter the counterparts identify to aimed study orthology The terms. represented most the were (n=776) membrane and compon (n=913) function nucleus integral (n=3355), molecular membrane component, frequent cellular category most GO the the were Concerning (n=827) assignments. binding ion zinc and (n=873) binding ion metal (n=1385), binding ATP (n=500). transcription of regulation and (n=502) oxidation were processes biological represented most three The S6). Table assignments, 609 (10 component’ ‘cellular to 568 and S5) Table (18 function’ ‘molecular to Article belonged terms GO 1275 whereas S4), Table assignm 205 (15 terms GO 2053 comprised process’ ‘biological category The terms. GO 2016) and (38%) t than higher is proportion This reference 701 29 the Among and annotation analyses Functional orthology informat annotated vicinity. and valuable providing by studies, genomic population . Of the the Of .

Spodoptera exigua Spodoptera annotated reference transcripts 13 907 (47%) were associated to 3896 distinct 3896 to associated were (47%) 907 13 transcripts reference annotated Hence, this analysis contributed to the biological validation of the reference the of validation biological the to contributed analysis this Hence, S. frugiperda S.

transcripts, (40%) oe eotd for reported hose , B. mori B. (Legeai et al., 2014; Pascual et al., 2012; Song et al., et Song 2012; al., et Pascual 2014; al., et (Legeai 1 (n= 71%

taken together (Fig. 4). 16 743 16 4). (Fig. together taken

and and D. plexippus D. - 20 989 20

euto poes (n=723), process reduction of . frugiperda S.

T. pityocampa T.

) could be functionally annotated. functionally be could )

as w as

(42%), ell as in the fruit fly fruit the in as ell reference transcripts in in transcripts reference

187 ortholog groups. groups. ortholog 187 pdpea litura Spodoptera

361 assignments, assignments, 361 (56%) among the the among (56%) ion in their their in ion

proteolysis ent of of ent ents, ents,

D. S. This article isThis article by protected copyright. All rightsreserved. expression differential and transcriptomics comparative allow also will They enemies). use, host in involved P450 (e.g., interactions biotic or proteins) shock heat (e.g., conditions climatic to adaptation in involved novo requiring applications some impedes still scaffolding temperature high to tolerance or phenology as such traits 2018) al., et (Leblois genome Portugal, in SP shifted the of phenologically evolution the particular in and PPM the of case the In clade. taxonomic neglected lig shed will which transcriptomics and genomics comparative Lepidoptera for beofprime importance will developedstudy resources forPPM this Thus,the in the released. addition In available inLepbasepublically belong Papilionoidea tothe super genomes lepidopteran 21 the of (n=13) majority the 2017), (October now Right databases. spec the for second the and species, Notodontidae a of assembly genome first the provides study This divergentwere highly from four other the species. insect and both to specific were transcripts) (528 orthologs PPM 413 species. lepidopteran other the of one least at with shared were 355) n=10 (93%, orthologs frugiperda C

Accepted ArticleONCLUSIONS T.

rncitmc eore wl pri a hruh nlss f ao gn families gene major of analysis thorough a permit will resources transcriptomic pityocampa - ies wide analyses of genetic diversity diversity genetic of analyses wide

- genome strains strains genome to the the to ih otoda super Noctuoidea rich

Tpit ) and ) , and to gain knowledge on the genetic architecture of major adaptive major of architecture genetic the on knowledge gain to and ,

draft genome, a high a genome, draft

774 orthologs 774 ( supplementary data supplementary - family which is currently underrepresented in the public public the in underrepresented currently is which family mue ecin lne t itrcin wt natural with interactions to linked reactions immune –

t corresponding to 2675 protein 2675 to corresponding e viaiiy f hs rf gnm wl allow will genome draft this of availability he - quality reference transcriptome for this species is species this for transcriptome reference quality o ietnl is recent its disentangle to

in

Gouin et al., 2017) al., et Gouin

additional genomic information genomic additional

Noctuoidea species ( species Noctuoidea ee tog te o a limited far so the though even , - family. . A majority of the PPM PPM the of majority A . vltoay history evolutionary

- coding transcripts coding S. frugiperda S. ht on this this on ht . The . de -

This article isThis article by protected copyright. All rightsreserved. Acceptedamount SNPs ofneutral 2014) Beaumont, scans genomic through selections of signature refine and establish to allow would genome super specific to markers other building Last, regions. coding and scaffolds of quality the increase to hence and regions coding of recovery X 2015; studies 2 Recent genome. the on all at localized be not and could transcripts (10%) scaffolds genome separate two on split were transcripts reference established shown as addition, a of quality overall the raise hence and process 2017) al., et Giordano long adding example, produced those as such For sequences, assemblies. Article genome future improve to developed be can alternatives algorithmic and molecular Yet, heterozygosity. intrinsic its of because diversity genetic depleted a with population a from originating although here, considered samples The stage. larval L5 urticating very the and comm.) pers. M.R., (Branco conditions rearing in mortality heavy to due available genome a building for limitations main improved. be to still is scaffolding the while values, completeness and coverage good by characterized were contigs obtained the measurements, quality several by shown As functionally to annotate i contribute and populations or stages developmental across approaches RNAseq using studies ue et al., 2013) al., et ue

a linkage map could further contribute to associate and order B order and associate to contribute further could map linkage a

or to develop powerful analyses of demographic scenarios using a large large a using scenarios demographic of analyses powerful develop to or

in the current current the in

in the assembly process has been proven to facilitate the scaffolding the facilitate to proven been has process assembly the in

proposed to embed transcriptomic data into the assembly to improve to assembly the into data transcriptomic embed to proposed and linkage disequilibrium data dentified SNPs.dentified - scaffolds. by PacBio or MinIon technologies MinIon or PacBio by Tpit -

wide assembly in PP in assembly wide - SP v1 assembly, around 58% (n=17 149) of the the of 149) (n=17 58% around assembly, v1 SP , still remained sub remained still , ftr, improved future, A studies of genetic diversity either to identify identify to either diversity genetic of studies (Gautier, 2015; Vi 2015; (Gautier, de novo de .

genome M is that inbred lines were not were lines inbred that is M - optimal for genome assembly genome for optimal talis, Gautier, Dawson, & Dawson, Gautier, talis, (Rhoads & Au, 2015) Au, & (Rhoads . pityocampa T. (Flusberg et al., 2010; al., et (Flusberg (M. Chen et al., et Chen (M. ACs, genes and genes ACs, One of the of One

- reference reference fragment

948 948 . In .

This article isThis article by protected copyright. All rightsreserved. Bailly local Basic (1990). J. D. Lipman, & W., E. Myers, W., Miller, W., Gish, F., S. Altschul, I. Hanski, . . . P., Rastas, P., Koskinen, L., Salmela, P., Somervuo, R., Lehtonen, V., Ahola, INRIA/Irisa, Rennes, Campus forhisbioinformatics France) Beaulieu, advices. accessproviding computational totheir resourcesandFab for France) (Rennes, GenOuest and France) (Roscoff, ABiMS platforms bioinformatics the ( platform computational (MiSeq perfo were applications" analyses "genotyping bioinformatics the of of development Most sequencing). the in support their for (France) Montpellier Cirad of platform genotyping GPTR the from staff the acknowledge authors Biodiversi la de et l'Environnement de Méditerranéen Centre pe d'Excellence (Laboratoire CeMEB LabEx the from both facilities, technical GenSeq the at were performed was sonication RNA and DNA and Evolutive”, Environnemental/Cytogénomique experiments Karyotyping was Portugal in by partially supported the UID/AGR/00239/2013. project Work 2012). BioRessources AIP INRA project, (GAPP Agronomique 10 Agency National French the from grant a by supported was work This A References CKNOWLEDGEMENTS

Accepted Article- JCJC - Bechet, M., Haudry, A., & Lerat, & Haudry,A., M., Bechet, conveniently parse RepeatMasker output files. files. output RepeatMasker 10.1186/1759 parse conveniently 10.1016/s0022 tool. search alignment reveals and 10.1038/ncomms5737 karyotype ancient an retains Lepidoptera. in fusions chromosomal selective genome fritillary Glanville The (2014). - 1705

- 01 -

- 8753 - EOHN) n b te nttt ainl e a Recherche la de National Institut the by and GENOPHENO) 2836(05)80360

Montferrier - 5 -

13

ora o Mlclr ilg, 215 Biology, Molecular of Journal - sur - 2

E. (2014). “One code to find them all”: A perl tool to perl tool A find them code to all”: “One E. (2014). - Lez, France Lez, fre a te “Génomique the at rformed ). Nature Communications, 5 Communications, Nature The authors would also like to thank thank to like also would authors The

rice (IGEPPrice Legeai é Mnplir France). Montpellier, té, oie N, 5 DNA, Mobile md n h CG HPC CBGP the on rmed for Research (ANR Research for 3, 403 (3), - BIPAA,INRA,

(4737). do (4737). 1) doi: (13). - 1. doi: 410. The The i: i: - This article isThis article by protected copyright. All rightsreserved. Acceptedhighly of generation the for package A VennDiagram: (2011). C. P. Boutros, & (2016). H., M. Chen, Blaxter, & D., C. Jiggins, K., K. K. Dasmahapatra, S., Kumar, J., R. Challis, into genomes plant of cloning Efficient (2004). M. Caboche, & H., Belcram, B., Chalhoub, 35.21). (Version bioinformatictools other and aligner, read short BBMap (2014). B. Bushnel, M. Paiva, H., Santos, J., Landes, R., Leblois, M., Gautier, C., Burban, Experimental (2017). C. Kerdelhué, & C., Burban, M., H. Santos, R., M. Paiva, M., Branco, for trimmer flexible A Trimmomatic: (2014). B. Usadel, & M., Lohse, M., A. Bolger, preScaffolding W.Pirovano, (2011). & D., Butler, J., H. C. V.,Jansen, M.,Henkel, Boetzer, . . . M., Mangan, R., Lazarus, G., Ananda, N., Coraor, G., Kuster, Von D., Blankenberg, ArticleM., Olivieri, R., A. Trentin, E., Mitali, G., Arrigoni, M., Pivato, L., Berardi, S. Larsson, & A., Roques, A., Schopf, C., Robinet, S., Netherer, M., Stastny, A., Battisti, urtica associated and Processionary (2017). A. Roques, & S., Larsson, A., Battisti, Larsso & B., Fagrell, G., Holm, A., Battisti, N., D. Avtzis, M., Avcı, A., Battisti, (2011). T. G. Marth, & P., M. Stromberg, R., A. Quinlan, K., E. Garrison, W., D. Barnett, customizable Venn and Euler diagrams in R. R. in 10.1186/1471 diagrams Euler and Venn customizable TheLepbase: Lepidopteran genome database. 7652.2004 size. insert uniform more and larger with libraries (BAC) chromosome artificial bacterial Retrieved from https://sourceforge.net/projects/bbm moth, 10.1111/bij.12829 processionary Notodontidae). pine the pine low of for Evidence (2016). populations allochronic 2 in time reproductive moth.processionary heritable for evidence data. sequence 10.1093/bioinformatics/btu170 Biology, Illumina Molecular 10.1093/bioinformatics/btq683 in assembled Protocols 10.1002/0471142727.mb1910s89 Current (20 J. Taylor, doi:10.1093/jme/tjx144 Notodontidae). (Lepidoptera: (201 10.1890/04 temperatures. winter increased in range geographic of Expansion (2005). 10.1146/annurev change global risk: 10.1146/annurev significance. medical and nature moths andclimatechange: anupdate spp files. Natura BAM (2015). M. managing Zamoum, and analyzing for toolkit and API Bioinformatics, 27 C++ A BamTools: .): new insights in relation to climate change. In change. climate to relation in insights new .): 7 . rtoe nlss o analysis Proteome ). ln Boehooy ora, 2 Journal, Biotechnology Plant .00065.x - 1903 otg uig SSPACE. using contigs - 0. aay A web A Galaxy: 10). 2105 ilgcl ora o te ina Scey 119 Society, Linnean the of Journal Biological - -

ento ento

(12), 1691(12),

- - driven effects. driven 12 Insect Science,24 - - 031616 120709 -

35 -

level hybridization between two allochronic populations of populations allochronic two between hybridization level - 1692. doi: 10.1093/bioinformatics/btr1741692. doi: Journal of Medical Entomology, Medical of Journal - - hsoy f h poesoay oh ( moths processionary the of history l

034918 144844 utctn ste from setae urticating f Ben

clgcl plctos 15 Applications, Ecological nul eiw f noooy 56 Entomology, of Review Annual - iifrais 30 Bioinformatics, Annual Review of Entomology, 62 Entomology, of Review Annual based genome analysis tool for experimentalists. experimentalists. for tool analysis genome based n, S. (2011). Urticating hairs in arthropods: their their arthropods: in hairs Urticating (2011). S. n, Jamaa, M. L., Berardi, L., Berretima, W., ...... W., Berretima, L., Berardi, L., M. Jamaa,

(pp. 15

(2), 325 (2), hueooa pityocampa Thaumetopoea iifrais 27 Bioinformatics, h pn poesoay oh asd by caused moth processionary pine the - 3, 181 (3), 80): Springer / Quae Editions. bioRxiv - 335. 1 doi:

BMC Bioinformatics, Bioinformatics, BMC ap/ A. Roques A. . doi: 10.1101/056994

- 8. o: 10.1111/j.1467 doi: 188. 1) 2114 (15), hueooa pityocampa Thaumetopoea 0.1111/1744 54(6) 19 - 6, 2084 (6), R., ...... R., 578 (4),

(Ed.), (Ed.), ,

2, 311 (2), ,

1560 203 , - , 323 ,

Thaumetopoea 7917.12287 - 1021 Processionary Kerdelhu 2120. doi: doi: 2120...... (Lepidoptera: (Lepidoptera: 12 - - 2096. doi: doi: 2096. - 7. doi: 579. -

, 1566 - -

2. doi: 328. doi: . 220. doi: doi: 220. 342. doi: doi: 342. Masi, A. Masi, 5 doi: 35. . é

doi: doi: tion , C. ,

- - -

This article isThis article by protected copyright. All rightsreserved. AcceptedCarr J., Gouzy, B. Duvic, M., J. Aury, S., Gimenez, K., Nam, A., Bretaudeau, A., Gouin, M. Quillet, . . . C., Rambaud, N., Helmstetter, E., Prat, C., Blassiau, A., Bellec, L., Gonthier, . . . M., R. Davies, K., J. Bonfield, P., Coupland, A., M. Quail, L., Aigrain, F., Giordano, Genome (2015). M. Gautier, W. Li, & S., Wu, Z., Zhu, B., Niu, L., Fu, H., J. Lee, R., D. Webster, A., B. Flusberg, E., Mohandesan, R., R. Fitak, ArticleM. Simonato, & G., Chakali, A., Battisti, C., Burban, C., Kerdelhué, M., Mokhefi, El insights First (2015). X. Zhang, . . . Z., Wang, M., Yang, F., Shen, Z., Fan, W., Li, L., Du, J. H. Megens, . . . C., Mateman, A., Bossers, E., Schijlen, L., Salis, S., Smit, F., M. Derks, Tiger (2015). V. N. Grishin, & Z., Otwinowski, D., Borek, Q., Cong, Garcia S., Götz, A., Conesa, . . . J., Yu, C., Zhang, Q., Wu, J., Liu, Y., Hu, M., Chen, matured sequences. sequences. 10.1093/bioinformatics/btp024 matured ( pests lepidopteran polyphagous 11816. doi: 10.1038/s41598 highly of frugiperda genomes Two (2017). Notes, Research deep and Construction (2010). C. platforms.MiSeq (2017). K. D. Jackson, 10.1534/genetics.115.181453 population 10.1093/bioinformatics/bts565 next real single during methylation DNA of detection Direct (2010). W. S. Turner, origin. African North of Molecular Ecology Resources, 16 dromedary domestic female a of annotation and assembly its 6 Evolution, of edge southern the markers. nuclear and at mitochondrial between patterns Contrasting moth range: processionary pine the of differentiation Genetic 10.1111/17551013. doi: immunogenetics. and loci gene novel ( panda giant the into 7 phenology. and dimorphism ( sexual on moth perspective winter of genome The (2015). 10 r research. 10.1093/bioinformatics/bti610 genomics visualizati annotation, for tool universal A Blast2GO: 10.1038/srep18019 RNA by genes full novel of identification and completeness assembly genome eveals (8), 2321 (8), (6), 910 (6), - - time sequencing. - eeain eunig data. sequencing generation oeae f h gnm o cioy ( chicory of genome the of coverage è m re, S., & S., re, echanisms for for echanisms - - - Ncude wt dfeet host different with Noctuidae) , 2332. 10.1093/gbe/evv145 doi: 919. 10.1016/j.celrep.2015.01.026 doi: pcfc cova specific (13), 4274 (13), - e i te in pna genome. panda giant the in seq 3

Scientific Reports Scientific

,

Schiex, T. (2009). (2009). T. Schiex,

225. 10.1186/1756doi: Nature Methods, 7 iuooa melanoleuca Ailuropoda - - 4288. 10.1002/ece3.2194 doi:

De novo De - (2005). M. Robles, & M., Talon, J., Terol, M., J. Gomez, oadr J, Bre, . . 21) The (2016). A. P. Burger, & J., Corander, - ie cn o aatv dvrec ad soito with association and divergence adaptive for scan wide 0998.12367 s eito and peciation - 017 riates.

iifrais 21 Bioinformatics,

hrceiain f w BC irre rpeetn a representing libraries BAC two of characterization

iifrais 25 Bioinformatics,

- yeast genome assemblies from MinION, PacBio and PacBio MinION, from assemblies genome yeast 10461 (1), 314 ,

7

Travers, K. J., Olivares, E. C., Clark, T. A., . . . . . A., T. Clark, C., E. Olivares, J., K. Travers, (2012). CD (2012). (1), 3935. 10.1038/s41598(1), doi: FrameDP: Sensitive peptide detection on noisy noisy on detection peptide Sensitive FrameDP: - eeis 201 Genetics, iifrais 28 Bioinformatics, oeua Eooy eore, 15 Resources, Ecology Molecular 4 (6), 461 (6), prptr brumata Operophtera -

c

324. doi: 10.1111/1755 324. doi: - aterpillar 0500 ihru itbs L. intybus Cichorium - plant ranges. ranges. plant lo tasrpoe eore for resource A transcriptome: blood ) - - 3 - 465. doi: 10.1038/nmeth.1459465. doi:

HIT: Accelerated for clustering the clustering for Accelerated HIT: - 225 cetfc eot, 5 Reports, Scientific c eoe ilg ad Evolution, and Biology Genome

Wu, J. (2015). Improvement of of Improvement (2015). J. Wu, hemical 1) 3674 (18),

n n aayi i functional in analysis and on 5, 670 (5), 4, 1555 (4), 2) 3150 (23), cetfc eot, 7 Reports, Scientific d poie a genomic a provides ) - efense. len - s Asteraceae). , 0998.12443 wallowtail wallowtail gth protein gth , . . . Fournier, P. Fournier, . . . , - e novo de 017 - - 3676. doi: doi: 3676. - 7. doi: 671. 1579. doi: doi: 1579. el Reports, Cell (18019). doi: doi: (18019). - Ecology and Ecology - 3152. doi: doi: 3152. 03996 Spodoptera - 4, 1001 (4), molecule,

g genome

- (2016). (2016). coding

enome - BMC z

(1), - This article isThis article by protected copyright. All rightsreserved. Accepted groups ortholog of Identification OrthoMCL: (2003). S. D. Roos, & J., C. Stoeckert, L., Li, . . . N., Homer, J., Ruan, T., Fennell, A., Wysoker, B., Handsaker, H., Li, RNA from quantification transcript Accurate RSEM: (2011). N. C. Dewey, & B., Li, . . . F., Blanc, S., A. Grenet, Gosselin M., J. Escoubas, B., Duvic, S., Gimenez, F., Legeai, M., Galan, C., Burban, J., Foucaud, A., Rohfritsch, M., Gautier, R., Leblois, & B., Langmead, M. Marra, . . . D., Horsman, (2013). R., Gascoyne, J., Connors, I., Birol, J., L. Schein, M., Krzywinski, S. Salzberg, & R., Kelley, H., Pimentel, C., Trapnell, G., Pertea, D., Kim, A. Battisti, & A., Roques, J., Rousselet, P., Salvato, M., Simonato, L., Zane, C., Kerdelhué, BLAT (2002). J. W. Kent, Article Millan R., Nalcacioglu, K., A. Jakubowska, C Genome Silkworm International of survival larval and moth by utilisation Host (2002). J. Castro, & R., Zamora, A., J. Hodar, (2007). S. R. Harris, (2014). C. Kerdelhué, & R., Streiff, D., Heckel, C., Burban, H., Vogel, B., Gschloessl, Bennett, . . . J., I. Leitch, K., Kullman, B., Kullman, H., Tamm, A., J. Nicol, R., T. Gregory, o ekroi genomes. 10.1101/gr.1224503 eukaryotic for SAMtools. Subgroup Processing Data Project genome. reference a 10.1186/1471 without or with data Spodoptera frugiperda an Establishment (2014). P. Fournier, processionarypine moth demographic the Deciphering (2018). C. 9 Methods, genomics. comparative for aesthetic information 19 Research, An Circos: (2009). insertions, A. of presence the in transcriptomes of and deletions genefusions. alignment Accurate TopHat2: expanding currently a in patterns contemporary species. and history Quaternary (2009). 664. ( moth processionary pityocampa pine the from from path sequences of viral search of identification In (2015). Z. Demirbag, & S., Herrero, Retrieved 38 si the USA. insect, 2311.2002.00415.x University, three cate processionary pine State http://www.bx.psu.edu/~rsharris/rsharris_phd_thesis_2007.pdf Pennsylvania 10.1016/j.ibmb.2014.01.005 ( sequencing. diverg moth phenologically processionary two of analysis Comparative 35 (2 D. M. (12), 1036(12), D332(Database issue),

doi: 10.1101/gr.229202doi: Pinus BMC Evolutionary Biology,

Salzberg, S. L. (2012). Fast gapped Fast (2012). L. S. Salzberg, (4), 357 007). Eukaryotic genome size databases. databases. size genome Eukaryotic 007). Bioinformatics, 25 ).

- (9), 1639 (9), Viruses, 7 Viruses, 1045. doi:10.1016/j.ibmb.2008.11.004 species. species. - lkworm net iceity n Mlclr ilg, 46 Biology, Molecular and Biochemistry Insect 2105 mrvd arie lgmn o genomic of alignment pairwise Improved -

- -- 359. 10.1038/nmeth.1923 doi:

12 the BLAST the - . - 1645. 10.1101/gr.092759.109 doi: obx mori Bombyx clgcl noooy 27 Entomology, Ecological rpillar (2), 456 323 BMC Genomics, hueooa pityocampa Thaumetopoea Thaumetopoea pityocampa - D338. doi:10.1093/nar/gkl828

Genome 14 Biology, onsortium

(16), 2078(16), Thaumetopoea pityocampa pityocampa Thaumetopoea - 479. 10.3390/v7020456 doi: eoe eerh 13 Research, Genome - like alignment tool. alignment like

(2009). The Sequence Alignment/Map format and and format Alignment/Map Sequence The (2009). 9 . - ,

ev, . Sanz A., Leiva,

Insect Biochemistry and Molecular Biology, Biology, Molecular and Biochemistry Insect 20) Te eoe f lpdpea model lepidopteran a of genome The (2008). 220. doi: 220. doi: 10.1186/1471 d analysis of a reference transcriptome for for transcriptome reference a of analysis d - 15 2 history of allochronic differentiation in the the in differentiation allochronic of history 079. doi: 10.1093/bioinformatics/btp352 079. doi: ,

704. doi: 10.1186/1471704. doi: - read alignment with Bowtie 2. Bowtie with alignment read M Bioinf BMC (R36). 10.1186/gb doi:

. 292 ,

by ) Molecular Ecology, accepted - Genome Research, 12 Research, Genome abnl, . Mrtgu H., Muratoglu, A., Carbonell, ent populations of the pine pine the of populations ent

gn: Transcriptome ogens: in relation to food quality in quality food to relation in - 0. o: 10.1046/j.1365 doi: 301. Nucleic Acids Research, Research, Acids Nucleic

9, 2178 (9), DNA.

e novo de ormatics, ormatics,

- 2148

PD hss, The thesis), (PhD - - 2164 9

31 , - - . . . . . Thaumetopoea Thaumetopoea 220 2013 12 00 Genome 1000 - transcriptome 2189. doi: doi: 2189. - , 15

Kerdelhué, - 323. doi: doi: 323. - 2 doi: 42. - (4), 656 (4), 14 Genome Genome 704 Nature - - based based 4 -

- Seq Seq . r

36

- -

This article isThis article by protected copyright. All rightsreserved. applications. its and Accepted sequencing PacBio (2015). F. K. Au, & A., Rhoads, comparing for utilities of suite flexible A BEDTools: (2010). M. I. Hall, & R., A. Quinlan, (1998). B. Dutrillaux, & H., Hayes, P., Popescu, (2006). Å. J. Nilsson, & M., Neves, C., Ferreira, M., Santos, T., Calvão, C., Pimentel, A. R. Wing, A., D. Frisch, P., J. Tomkins, G., D. Peterson, . . . G., Gloeckner, J., Ferre, J., Canizares, M., J. Blanca, K., A. Jakubowska, L., Pascual, & K., Bradnam, G., Parra, for Carmichael, D., J. Dzurisin, T., S. CenterO'Neil, National the of resources Database (2016). Coordinators. Resource ArticleNCBI ontology An study: case Chado A (2007). B. D. Emmert, & J., C. Mungall, the of Characterization (2016). B. Kempenaers, & B., Timmermann, H., Kuhl, C., J. Mueller, lock fast, A (2011). C. Kingsford, & G., Marcais, S.,Perrier,Manel, C.,Pratlong, M., Abi to reads short of adjustment length Fast FLASH: (2011). L. S. Salzberg, & T., Magoc, numbe Chromosome (2014). A. V. Lukhtanov, Proteomics Proteomics Bioinformatics, 13 genomic Pratiques Techniques et 10.1016/j.foreco.2006 Portugal. Coastal Central forest, pine production a in cycle life shifted a populationwith Notodontidae) Esta guide. illustrated An libraries: (BAC) chromosome artificial bacterial plant of Construction 10.1016/j.ibmb.2012.04.003570. doi: of microbes. transcriptome of types different The (2012). S. Herrero, genomes. eukaryotic 10.1093/bioinformatics/btm071 in genes 11 propertius Population (2010). 10.1093/nar/gkv1290doi: Information. Biotechnology 23 genome representing for schema 10.1111/1755561. doi: signals. selection and expression biased tit k blue the of transcriptome and genome of occurrences 10.1093/bioinformatics/btr011 of ge 10.1111/mec.13468 in selection positive (2016). assemblies. 10.1093/bioinformatics/btr507 genome improve 10.3897/CompCytogen.v8i4.8789 Hesperiidae). - i337 (13), 310 blishment and expansion of a of expansion and blishment Journal Agricultural Genomics, 5 of

eoi rsucs n ter nlec o te eeto o te inl of signal the of detection the on influence their and resources Genomic features. features.

and - i346. doi:10. Papilio zelicaon Papilio Bioinformatics, 26 oprtv Ctgntc, 8 Cytogenetics, Comparative -

ee tasrpoe eunig f omdl organisms nonmodel of sequencing transcriptome level Korf, I. (2007). CEGMA: A pipeline to accurately annotate core annotate accurately to pipeline A CEGMA: (2007). I. Korf, oet clg ad aaeet 233 Management, and Ecology Forest .06.005 - 0998.12450

(pp. 260). (pp. 260).

nome scans. scans. nome 1093/bioinformatics/btm189 Insect Biochemistry and Molecular Biology, 42 Biology, Molecular and Biochemistry Insect -

uli Ais eerh 44 Research, Acids Nucleic mers. (5), 278(5),

. - Rached, P.,&Aurelle,D. L.,Paganini,J.,Pontarotti, - BMC Genomics, BMC

soitd ilgcl information. biological associated R. D., Lobo, N. F., Emrich, S. J., & Hellmann, J. J. J. Hellmann, & J., S. Emrich, F., N. Lobo, D., R. Paris, FRA:INRA Editions. Thaumetopoea pityocampa Thaumetopoea

(6), 841 (6),

Bioinformatics, Bioinformatics, - 289. doi: 289. doi: 10.1016/j.gpb.2015.08.002 iifrais 27 Bioinformatics, iifrais 23 Bioinformatics, oeua Eooy 25 Ecology, Molecular - oeua Eco Molecular free approach for efficient parallel countingefficient approach for free Cyanistes caeruleus Cyanistes , 1 - r evolution in skippers (Lepidoptera, (Lepidoptera, skippers in evolution r 842. 10.1093/bioinformatics/btq033 doi: - pdpea exigua Spodoptera Techniques de cytogénétique animale. animale. cytogénétique de Techniques 3.

11 ,

310. doi: 10.1186/1471 doi: 310. & aesn A H (2000). H. A. Paterson, & , 27 oy eore, 16 Resources, logy 4, 275 (4), Dtbs ise, D7 issue), (Database 2) 2957 (21), 9, 1061 (9), 6, 764 (6),

(Den. & Schiff.) (Lep. Schiff.) & (Den. : Polymorphisms, sex Polymorphisms, : 1, 108 (1),

1, 170 (1), larvae exposed to to exposed larvae - Bioinformatics, Bioinformatics, based modular based - 291. doi: doi: 291. - - - 93 doi: 2963. 07 doi: 1067. 7. doi: 770. - -

1. doi: 115. 8. doi: 184. Genomics 2, 549 (2), (8), 557 (8), Erynnis Erynnis - 2164 - D19. - - - - -

This article isThis article by protected copyright. All rightsreserved. Tophat. with AUGUSTUS into RNAseq Accepted Illumina Incorporating (2014). M. Stanke, RNA Incorporating (2009). M. Stanke, Transcriptional (2016). Z. Huang, & X., Guan, M., Li, E., Shao, S., Wu, C., Chen, F., Song, A., F. A. Smit, (2008 R. Hubley, & A., F. A. Smit, Holme & J., C. Mungall, D., L. Stein, V., A. Uzilov, E., M. Skinner, E. Kriventseva, P., Ioannidis, M., R. Waterhouse, A., F. Simao, R D. Zerbino, H., M. Schulz, metagenomic of preprocessing and control Quality (2011). R. Edwards, & R., Schmieder, Kerdelhu C., Tavares, R., M. Paiva, H., Santos, Article (2011). C. Kerdelhué, & M., Branco, P., J. Rossi, J., Rousselet, C., Burban, H., Santos, D., Argal, R., Zhao, J., Rousselet, J. Rossi, J., Rousselet, C., Robinet, R & M., Laparie, C., Robinet, C. Imbert, C., Robinet, Retrieved 2017, from from 2017, greifswald.de/bioinf/wiki/pmwiki.php?n=IncorporatingRNAseq.Tophat Retrieved http://augustus.gobics.de/binaries/re 10.1038/srep23861 doi: activation. toxin the in trypsin of involvement possible of analysis profiling Retrieved from from 10.1101/gr.094607.109 next A with 10.1093/bioinformatics/btv351 completeness annotation and assembly single genome Assessing BUSCO: (2015). 28 RNA datasets. divergence. allochronic Evolutionary under 9101.2011.02318.x population of Lepidoptera a Journal in observed shift niche moth processionary 146 pine the in pityocampa speciation allochronic Incipient of Biogeography, 37 pine the of history demographic the moth, processionary structuring in topography of role The (2010). 0239 for connectivity organisms. habitat forest on impact and distribution Spatial landscapes: agricultural challenges. ForestAnnals of Science, 71 future and model simulation a from results Preliminary France: in moth pine the in heterogeneity critical in result moth.processionary can phenologies Local change: global Europe. Human (2012). - (8), 1086 (8), . Gri, . Rqe, A. Roques, J., Garcia, P., - 158. doi:j.1420 - - http://www.repeatmasker.org 8 - seq assembly across the dynamic range of expression levels. levels. expression of range dynamic the across assembly seq oy orthologs. copy

- Biological Biological Invasions,14 generation genome browser. browser. genome generation

Bioinformatics, 27 uly R, Gen P (2013 P. Green, & R., Hubley, - 1092. doi: 10.1093/bioinformatics/bts0941092. doi:

( eiotr, Notodontidae). Lepidoptera, - http://www.repeatmasker.org - E., Rousselet, J., Sauvard, D., Garcia, J., Goussard, F., & Roques, A. A. Roques, & F., Goussard, J., Garcia, D., Sauvard, J., Rousselet, E., eitd long mediated

adcp Eooy 31 Ecology, Landscape & Roques, A. (2014). Potential spread of the pine processionary pine the of spread Potential (2014). A. Roques, & - (8), (8), 1478 Frontiers 6 Frontiers in Physiology, Thaumetopoea pityocampa Thaumetopoea 9101.2010.02147.x pdpea litura Spodoptera , igo, . & iny E (02. ae: Robust Oases: (2012). E. Birney, & M., Vingron, .,

ousselet, J. (2015). Looking beyond the large scale effects of effects scale large the beyond Looking (2015). J. ousselet,

(6), 863 (6), Simonato, M., Battisti, A., Roques, A., & Kerdelhué, C. C. Kerdelhué, & A., Roques, A., Battisti, M., Simonato, (2), 149 (2), - ilg, 24 Biology, 1490. do -

& osee, . 21) Tes usd frss in forests outside Trees (2016). J. Rousselet, & , 05. eetoee (eso open (Version RepeatModeler 2015). - iifrais 31 Bioinformatics, itne up o te ie rcsinr mt in moth processionary pine the of jumps distance (8), (8), 1557 -

Seq into AUGUSTUS. Retrieved 2017, from from 2017, Retrieved AUGUSTUS. into Seq - adme.rnaseq.html 864. 10.1093/bioinformatics/btr026 doi: - 160. doi: 10.1007/s13595160. doi: i: 10.1111/j.1365 i: eoe eerh 19 Research, Genome

- ave cha larvae - 1569. doi: 10.1007/s105301569. doi: é 05. eetakr Vrin open (Version RepeatMasker 2015). 9, 1897 (9), C, Bac, . (2011). M. Branco, & C., , 2, 243 (2), ora o Ev of Journal

(334). doi: 10.3389/fphys.2015.00334(334). doi:

(Lepidoptera: Notodontidae). (Lepidoptera:

lne wt VpA txn and toxin Vip3Aa with llenged -

5. o: 10.1007/s10980 doi: 254.

- 1) 3210 (19), 95 di 10.1111/j.1420 doi: 1905. - 2699.2010.02289.x Scientific Reports, 6 Reports, Scientific ltoay ilg, 24 Biology, olutionary s, I. H. (2009). JBrowse: (2009). H. I. s, V., & Zdobnov, E. M. M. E. Zdobnov, & V., - 013 9, 1630 (9), - 0287 - http://bioinf.uni - 011 - Bioinformatics, .) Retrieved 1.0). Thaumetopoea Thaumetopoea 22 doi: 3212.

- Temperature 7 - - 9979 68 doi: 1638.

(23861). (23861). e novo de Journal - 9 - - 4.0). 015

(1), (1),

- - -

This article isThis article by protected copyright. All rightsreserved. Accepted cost Ecological (2008). S. Larsson, & A., Battisti, M., Stastny, D., Zovi, for Algorithms Velvet: (2008). E. Birney, & R., D. Zerbino, InterProScan (2001). R. Apweiler, & M., E. Zdobnov, (2013). W. X. Sun, & Y., Y. Kuang, F., X. Kong, Y., G. Hou, P., Y. Zhu, T., J. Li, W., Xue, & J., Gurtowski, H., Fang, J., C. Underwood, M., Nattestad, J., F. Sedlazeck, W., G. Article Vurture, & J., K. Dawson, M., Gautier, R., Vitalis, Garcia I., Moneo, M., J. Vega, . . V., E. Koonin, B., Kiryutin, R., A. Jacobs, D., J. Jackson, D., N. Fedorova, L., R. Tatusov, clusters: UniRef (2015). H. C. Wu, & B., P. McGarvey, H., Huang, Y., Wang, E., B. Suzek, new a and model Markov hidden a with prediction Gene (2003). S. Waack, & M., Stanke, se web A AUGUSTUS: (2005). B. Morgenstern, & M., Stanke, f n net ebvr ipsd y ot lns n enemies. and 10.1890/071398. doi: plants host by imposed herbivore insect an of graphs. Bruijn 10.1101/gr.074492.107 de using 10.1093/bioinformatics signature 10.1186/1471doi: wi genomes Scaffolding L_RNA_scaffolder: reads. data. (20 C. M. Schatz, frequency gene 10.1534/genetics.113.152991 from 165 selection Immunology, and Allergy of 10.1159/000369807 Archives e International setae a of utility Diagnostic to sensitization IgE (2014). J. Vega, . . . I., A. Mahillo, BMC Bioinformatics, eukaryotes. includes version updated An database: COG The (2003). A. D. Natale, . Bioinformatics, 31 searches. similarity sequence improving for alternative scalable and comprehensive A submod 10.1093/bioinformatics/btg1080 intron issue),Server W465 user allows that eukaryotes Bioinformatics - recognition methods in InterPro. InterPro. in methods recognition 17). GenomeScope: Fast reference Fast GenomeScope: 17). el. el. - (6), 926(6), 2164

- W467. doi: 10.1093/nar/gki458W467. doi: . doi: 10.1093/bioinformatics/btx153 4 -

0883.1 ,

/17.9.847

- 41. doi: 10.1186/147141. doi: 14 iifrais 19 Bioinformatics, - - ri, . . Gonzalez C., J. Ortiz, 932. doi: 932. doi: 10.1093/bioinformatics/btu739 - 604 -

eie constraints. defined

tat ciia pcue n ascae rs factors. risk associated and picture clinical xtract,

Beaumont, M. A. (2014). Detecting and measuring measuring and Detecting (2014). A. M. Beaumont, eoe eerh 18 Research, Genome th transcripts. th - iifrais 17 Bioinformatics, 2105 Spl 2, ii215 2), (Suppl. eeis 196 Genetics,

- uo, . Ri, . Rodriguez C., Ruiz, M., Munoz, uli Ais eerh 33 Research, Acids Nucleic -- - -

4 n nerto pafr fr the for platform integration an free genome profiling from short short from profiling genome free - 41 e novo de

Thaumetopoea pityocampa Thaumetopoea vrfrgn peito in prediction gene for rver BMC Genomics, BMC 5, 821 (5), clg, 89 Ecology, 3, 799 (3),

4, 283 (4), s on local adaptation local on s hr ra assembly read short 9, 847 (9),

- ii225. doi: doi: ii225. - - 829. doi: doi: 829. - 817. doi: doi: 817. - 9. doi: 290. 5, 1388 (5), 848. doi: doi: 848. 14 , (Web

604. - - :

This article isThis article by protected copyright. All rightsreserved. scaffolds. S2 Fig. most98 asthe number likely of chromosomesspecies. that in S1 Fig. Additional Supporting Information of maybefoundintheonlineversion this article: S (https://www.ncbi.nlm.nih.gov/bioproject/344465). accessibl are reference the and transcriptome assemblies BAC the genome, draft the as well as reads sequence raw ( database LepidoDB the in available publicly are annotations all and assemblies transcriptome and genome The D performed and Developed finaland approvedmanuscript. the PG. SN, JN, HB, B GB,analyses: bioinformatics OB, high EL, control, CB, quality LS, construction, sequencing): BAC construction, library extraction, sampl ensured and rearing insect PerformedRS.sampling, CK, BG, study: MG, the anddesigned Conceived A UPPORTING INFORMATIO UPPORTING Accepted Article ACCESSIBILITY ATA CONTRIBUTIONS UTHOR

Metaphase obtained from an egg mass of mass egg an from obtained Metaphase l 1 asmld A sqecs n crepnig aligned corresponding and sequences BAC assembled 11 All

e preservation: SR, MB. Performed wet lab experiments (karyotyping, DNA DNA (karyotyping, experiments lab wet Performed MB. SR, preservation: e http://bipaa.genouest.org/sp/thaumetopoea_pityocampa/download/

N G, FD, AB, ED. Wrote the paper: BG, FD, CK. All authors read authors All CK. FD, BG, paper: the Wrote ED. AB, FD, G,

e as NCBI BioProject with the id PRJNA344465 PRJNA344465 id the with BioProject NCBI as e

T.

pityocampa

, allowing to determine 2n = = 2n determine to allowing , Tpit - SP v1 genome genome v1 SP - throughput

). The This article isThis article by protected copyright. All rightsreserved. 4 Fig. transcriptome assemblies andtheAug2.2CDS prediction 3 Fig. nuclear 2 Fig. HiSeq, the andreferencetranscriptomes.MiSeq genomeassessment; B.Assembly quality of the resourcestranscriptomic for Fig. 1 F Table S6 Table S5 Table S4 Table S3 Table S2 Table S1

Accepted Article LEGENDS IGURE D. plexippus of proteomes the and (TPIT) proteins reference

Workflow bioinformatics summarizing the to analyses the generate genomic and Venn diagram showing all OrthoMCL ortholog groups among the predicted the among groups ortholog OrthoMCL all showing diagram Venn Venn diagram showing all reference transcripts and their coverage among the three the among coverage their and transcripts reference all showing diagram Venn or hsn sebe BC eune ad h crepnig aligned corresponding the and sequences BAC assembled chosen Four genome scaffolds.

GO term counts for the category forthe ‘cellularGO term component’. counts GO term category forthe ‘biologicalprocess’.GO term counts forthelibrariesRead counts forthe used BAC assemblies. base Read per of coverage paired forthelibrariesRead counts forthe used (DPLE) and

counts for the category forthe ‘molecularfunction’.counts

Tpi D. melanogaster t - SP. A. Recon - end reads on the end readsonthe (DMEL). struction ofthe usedforstruction 11BAC sequences

Tpit Tpit S. frugiperda S. - SP v1draft genome; C.Generation of -

SP

s .

v1 genome assembly. Tpit

-

SP v1genome scaffolds. (SFRU),

B. mori B.

Tpit (BMOR), (BMOR), T - P v1 SP pi t -

SP SP Accepted c Please Record. of Version the isThis article by protected copyright. All rightsreserved. and version this between differences to 10.1111/1755 lead may which process, proofreading and types copyediting, the through been not has but review peer full undergone and publication for accepted been has article This Article - 0998.12756

t ti atce s doi: as article this ite etting, pagination pagination etting, This article isThis article by protected copyright. All rightsreserved. contentN in assembly [%] Count base N GC SE454 PE600i PE300i Total coverage Second largestl Max Min sequenceN90 count N90 [bp] sequenceN50 count N50 [bp] Median [bp] Mean [bp] Sequence count Total length [Mb] defined asthe averageper read count assembled bp. 1 Table T

Accepted ArticleABLES

content [%] imum l imum l – – –

coverage

coverage coverage

F

ength [bp] ength [bp] aue o te otg ad h safls ≥ k) eand n h assembly the in retained kb) 1 (≥ scaffolds the and contigs the of eatures

ength [bp]

Contigs 379 839 100 209 675 934 18 371 19 793

1424 38.0 121 320 402 750 507 0.9 45 76 51 0 0

Large scaffolds 116 266 296 1 877510 2 148522 163 589 29 439 68 292 1000 1951 1757 7870 21.6 37.2 728 537 0.5 31 52 84

f h nuclear the of T. pityocampa

genome Te oeae is coverage The . This article isThis article by protected copyright. All rightsreserved. A Part whichgenes on weresplit two fromobtained thecorresponding genome paper assembledthe genomes(Aug2.1 for fromtaken corresponding the publications Table AcceptedPapilioglaucus Operophtera brumata Melitaea cinxia Manducasexta Lerema accius Junonia coenia melpomene Heliconius melpomene lativitta Heliconius erato demophoon Heliconius erato Danaus plexippus Chilosuppressalis Calycopis cecrops Bombyx mori Bicyclus anynana Amyelois transitella Article

2 Species

Characteristics of genome Lepidoptera inLepbase assemblies available v4.Genome for characteristics

Papilionoidea; Nymphalidae Papilionoidea; Nymphalidae Papilionoidea; Nymphalidae Papilionoidea; Nymphalidae Papilionoidea; Nymphalidae Papilionoidea; Papilionoidea; Nymphalidae Geometroidea; Geometridae Bombycoidea; Bombycidae Papilionoidea; Papilionidae Papilionoidea; Lycaenidae Bombycoidea; Sphingidae Hesperioidea; Hesperiidae Pyraloidea; Super family; family Pyraloidea; Tpit Nymphalidae - Pyralidae SP v1 T. pityocampa

nuclear

Gouin al.(2017) et

(Ahola etal., 2014) Rice striped stem borer Red Squinting bush brown

Tobacco hornworm Navel Navel orangeworm genome scaffolds. Glanville fritillary Monarch butterfly Postmanbutterfly ). b Crimson Crimson Clouded Common name - banded hairstreak Eastern tiger Winter moth Silk ) TheCEGMAfor % longwing longwing Buckeye

worm - - patched patched skipper

and

Legeai etal.(pers. comm.).

.

c ) In brack Version v1.1 v1.0 v1.0 v1.0 v1.1 v1.0 v2.0 v1.0 v1.0 v3.0 v1.0 v1.1 v1.0 v1.2 v1.0 Melitaea cinxia Melitaea

ets thepercentage isshown, ofgenes including the CEGMA GCA_000931545.1 GCA_001266575.1 GCA_000716385.1 GCA_000262585.1 GCA_001278395.1 GCA_000636095.1 GCA_001625245.1 GCA_000151625.1 GCA_001186105.1 Accession N/A N/A N/A N/A N/A N/A

weremissing in Lepbase and weretherefore

a) Countsrefera) togenes/transcripts predicted on

2015 2015 2012 2015 2017 2015 2016 2016 2012 2014 2016 2008 2015 2015 2015 Date ------03 08 05 09 05 07 09 03 11 04 04 04 10 07 - 04

------20 11 24 02 13 17 12 11 07 22 21 28 28 22

Spodoptera Spodoptera frugiperda Size (Mb) 376 638 390 419 298 586 275 418 383 249 372 729 482 475 406

Scaffold count 68 029 25 801 20 871 29 988 80 479 60 049 43 463 10 800 8261 1136 5397 7301 795 142 196

were

This article isThis article by protected copyright. All rightsreserved. Accepted pityocampa Thaumetopoea Article(rice strain) Spodopterafrugiperda (corn strain) Spodopterafrugiperda Plutellaxylostella Plodiainterpunctella Phoebis sennae Papilioxuthus Papiliopolytes Papiliomachaon

Yponomeutoidea; Plutellidae Papilionoidea; Papilionidae Papilionoidea; Papilionidae Papilionoidea; Papilionoidea; Pieridae Noctuoidea; Noctuidae Noctuoidea; Noctuidae Noctuoidea; Noctuidae Pyraloidea; Pyralidae

Papilionidae

Old worldswallowtail Pine processionnary Diamondback moth Common Mormon Cloudless sulphur Asian swallowtail Indian mealmoth Fall armyworm Fall armyworm swallowtail moth

v1.0 v1.0 v1.0 v1.0 v1.0 v3.1 v1.1 v1.0 v1.1

GCA_001298345.1 GCA_000836215.1 GCA_001298355.1 GCA_000330985.1 GCA_001586405.1 PRJNA344465 PRJEB13834 PRJEB13110 N/A

2017 2017 2014 2015 2016 2015 2015 2015 2018 ------09 09 10 04 03 09 02 09 - 01 ------25 25 02 03 10 28 02 28

537 371 438 393 382 345 243 227 278

68 292 29 127 41 577 10 542 20 800 15 362 63 186 1794 3873

This article isThis article by protected copyright. All rightsreserved. B Part Accepted frugiperdaS. (rice strain) frugiperdaS. (corn P. xylostella P. interpunctella P. sennae P. xuthus P. polytes P. machaon P. glaucus O. brumata M. cinxia M. sexta L. accius J. coenia melpomene H. melpomene eratoH. lativitta eratoH. demophoon Article D. plexippus C. suppressalis C. cecrops B. mori B. anynana A. transitella

Species

N50 (kb) 10 1270.7 3432.6 3672.3 1174.3 1571.1 2102.7 5483.8 4008.4 1587.0 737.2 256.7 230.3 119.3 664.0 525.3 715.6 233.5 638.3 65.6 28.5 52.8

5.2 689.0

N90 (kb) 1432.2 2670.1 930.4 261.6 273.1 160.5 18.7 19.6 22.4 13.6 29.6 46.4 60.3 61.1 99.3 45.4 154 6.4 3.5 1.1 2.0 2.4 4.8

complete (%) CEGMA 77.0 78.2 85.1 82.3 91.5 83.9 87.9 84.3 64.1 85.9 83.9 88.7 81.1 90.3 41.5 70.6 76.6 81.0 78.2 N/A N/A N/A N/A b

)

partial (%) CEGMA At leastAt 83.9 92.3 96.4 96.0 96.4 94.0 94.8 96.0 94.0 96.0 95.2 96.8 95.0 93.2 96.0 61.7 94.8 96.8 97.2 95.2 N/A N/A N/A b

)

GC% 36.0 36.0 38.3 35.1 33.0 34.1 34.0 33.8 35.4 38.6 32.6 35.3 34.4 34.5 32.8 33.5 33.2 31.6 35.7 37.1 37.7 36.5 35.7

14.4 12.5 10.4 14.5 N% 0.0 2.6 4.6 3.2 5.4 3.9 4.5 3.6 2.1 7.4 4.7 2.9 0.0 0.4 1.3 1.4 2.7 5.5 1.2

Gene count 26 329 21 700 19 386 23 136 16 117 15 322 12 244 15 497 15 692 16 912 16 751 15 17 411 19 234 20 102 14 613 13 676 15 130 10 117 16 456 15 488 22 642 15 208 451

a )

Transcriptcount 21 779 26 357 23 907 24 497 16 492 15 322 12 244 15 497 15 692 16 912 16 790 27 403 17 411 19 234 21 661 14 613 20 118 15 130 10 132 16 456 22 061 22 642 19 808

a )

This article isThis article by protected copyright. All rightsreserved. Total DNA SINE LINE LTR Table 3 Accepted Article T. pityocampa strain)

Number elements ofrepeated foundinthe Family

163.6

Fragments 2.0 1 573624

174 994 331 364 879 702 636

115

35.1 Total length [Mb]

Tpit

- SP v1 genome corresponding percentagelength. of and genome

23.4 ( 243.3 105.3 28.7 50.4 58.9 91.1

) %genome of c )

37.2 45.3 11.0 19.6

5.3 9.4

21.6

29

415

30 860

Accepted isThis article by protected copyright. All rightsreserved. Article scaffolds, numberscaffoldsof the assembly, the for Table 4

Characteristics of and andassembled BACs.Theestimated Characteristics the11sequenced contigs

Tpi21M24_S24_L001 Tpi21K05_S23_L001 Tpi21P07_S22_L001 Tpi21D23_S21_L001 Tpi21C18_S20_L001 Tpi21L11_S19_L001 Tpi21H19_S18_L001 Tpi21A08_S17_L001 Tpi21F03_S16_L001 Tpi21G16_S15_L001 Tpi21J02_S14_L001 PE BAC library

and genomic reads

Estimated BAC size . [kb]

128

68 95 65 75 50 70 55 85 90 40 , N50 lengths, BAC read coverages BAC read lengths, N50 ,

SSPACE Velvet+ [kb] 100.0 102.6 81.2 94.7 46.8 83.4 60.1 64.8 95.8 88.2 49.4

assembly (k

Best - mer) 181 171 181 181 171 161 181 181 171 171 181

Scaffold count

1 1 2 9 2 3 2 1 1 5 2

[kb] N50 81.2 94.7 42.5 14.6 63.8 36.3 98.1 64.8 95.8 55.9 26.4

and the % of BAC sequence lengths coveredby lengths sequence BACof % and the

(reads/bp) coverage Average assembled 1541 2148 1202 1297 1484 1456 1398 1553 1249 1386 921

scaffolds Aligned Tpi # t

- sizes are given, are asthek sizes aswell SP 22 32 11 19 21 30 15 23 24 15 5

scaffolds covered genome by by % 68.9 72.4 27.1 14.1 55.4 52.2 65.0 73.5 65.5 41.4 83 . 3

contigs genome by c % overed 83.2 82.5 60.5 33.6 71.3 74.6 78.8 84.4 78.0 67.4 94.8

reads/bp genome by covered % Tpit ≥ 92.0 86.1 84.3 71.1 81.9 86.7 91.2 89.7 84.3 88.2 92.6 - mer length used mer used length 10 -

SP v1 genome genome SPv1

This article isThis article by protected copyright. All rightsreserved. of forwhichhalf assemblyN50: contig length the represented bycontigs is sizeorlonger. ofthis alternative Including b) transcriptomes) other in transcripts the 5 Table AcceptedCEGMA identified Split ontwo Located on transcriptN50 length [bp] Mediantranscript length [bp] Meantranscript length [bp] Number of transcripts Number of unigenes Coverage (mean read count per bp) Number of mappedcleaned (and FLASH Size oftranscriptome [Mb] MeanSanger trimmedread length [bp](min Number of Sanger reads Mean NGStrimmed lengthread [bp] (min Read number after BBNORM Read number after FLASH Read number after cleaning Raw number of NGS reads Article genome.

Characteristics of the variousof Characteristics the Tpit Tpit

- ) eeec tasrpoe band y CD by obtained transcriptome Reference a) SP - SP v1scaffolds [count] (%)

[%] (count of 248) v1 genome [count] (%)

d)

(only

HiSeq)

+BBNORM –

max)

Tpit max)

-

SP transcriptome SP andthe assemblies ) reads

Reference 17 149 (57.7) 9604 (32.3) 99.2 (246)

29 701 29 701 transcripts. c) For Augustus For c) transcripts. - I cutrn: ny ngns r peet cmae t peec o alternative of presence to (compared present are unigenes only clustering: HIT 3632 1564 2279 67.7 N/A N/A N/A N/A N/A N/A N/A N/A N/A a)

8232 (100) Aug2.2

420 294 378 8232 8232 0 (0) N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

3.1

c) c) c)

robust CDS subset (Aug2.2) of the coding of the (Aug2.2) robust CDSsubset 515 (334 454 308 (220 / 4196 (51.7) 3100 (38.2) Sanger 66.9 (166) Aug2.2 309 200 465 703 467 082

– 8119 17.1 – 1832) 1386 1001 1156 6696 5290 9.4 593) N/A N/A

b) b) b) predictions the CDS lengths were considered. d) d) considered. were lengths CDS the predictions

SP HiSeq 91.1 (30 31 845 (51.1) 22 824 (36.6)

102 164 686 119 171 986 457 553 121 514 941 496 559 510 744 96.0 (238) 62 376 128.3

31 648 72.4

- 4177 1130 2057 168) N/A N/A

b) b) b)

SP MiSeq (MiSeq1+2)

- genes predicted from from predicted genes 273.4 (30 36 476 (57.7) 19 854 (31.4) 21 300 953 30 669 751 39 613 596 44 533 734 95.2 (236) 63 175 151.9 22 412 37.3 - 3930 1504 2404 568) N/A N/A N/A

b) b) b)

This article isThis article by protected copyright. All rightsreserved. Accepted ArticleTranscripts withFrameDP full Transcripts with BUSCO arthropod full BUSCO BUSCO euk full BUSCO CEGMA full

arthropodidentified [%] (count of 2675) eukidentified [%] (count of 429) - length [%] (count of 248) - FrameDPpeptide [count] (%) length [%] (count of 429) - length [%] (count of 2675) - length peptides [count] (%)

20 465 (68.9) 29 701 (100) 83.6 (2236) 72.4 (1938) 81.6 (350) 88.1 (378) 96.0 (238)

2655 (32.3) 6486 (78.8) N/A N/A N/A N/A N/A

3987 (49.1) 5830 (71.8) 31.5 (843) 43.1 (185) 52.4 (225) 60.5 (150) 25.9 (692)

24 31 70.4 (1882) 81.7 (2186)

651 (39.5) 415 (50.4) 80.0 (343) 86.9 (373) 89.5 (222)

27 36 60.5 (1618) 80.0 (2141)

314 (43.2) 547 (57.9) 72.3 (310) 86.0 (369) 82.3 (204)

This article isThis article by protected copyright. All rightsreserved. 10.1111/1755doi: c Please Record. of Version the and version this between differences to lead may which not process, has proofreading butand pagination review typesetting, peer copyediting, the full through undergone been and publication for Accepted accepted been has article This Article - 0998.12756

ite this article as article this ite This article isThis article by protected copyright. All rightsreserved. Accepted Article

This article isThis article by protected copyright. All rightsreserved. Accepted Article

This article isThis article by protected copyright. All rightsreserved. Accepted Article

This article isThis article by protected copyright. All rightsreserved. Accepted Article