<<

GigaScience

Chromosomal-level reference genome of Chinese peacock ( bianor) based on third-generation DNA sequencing and Hi-C analysis --Manuscript Draft--

Manuscript Number: GIGA-D-19-00120 Full Title: Chromosomal-level reference genome of Chinese peacock butterfly (Papilio bianor) based on third-generation DNA sequencing and Hi-C analysis Article Type: Data Note

Funding Information: National Natural Science Foundation of Dr. Wen Wang (31621062) Chinese Academy of Sciences Dr. Wen Wang (XDB13000000) CAS “Light of West China” Dr. Xueyan Li

Abstract: Background

Papilio bianor Cramer, 1777 (i.e. Chinese peacock) (Insecta, , Papilionidae) is a widely distributed with a large number of geographic populations from the Southeast of Russia to China, , , Vietnam, and Thailand. Its wing color consists of both pigmentary colored scales (black, reddish) and structural colored scales (iridescent blue or green dust). A high- quality reference genome of P. bianor is thus important for investigating iridescent color evolution, phylogeography, and evolution of swallowtail .

Findings

Here, we obtained a chromosome-level de novo genome assembly of the high heterozygous Chinese peacock (Papilio bianor) (1.81%) using long Pacific Biosciences (PacBio) sequencing reads (43.19 Gb) and high-through chromosome conformation capture (Hi-C) technology. The final assembly is 402.00 Mb on 30 chromosomes (29 autosomes and 1 sex chromosomes W) with 5.50 Mb contig N50 and 12.51 Mb scaffold N50. Totally 15,375 protein-coding genes and 222.29 Mb (55.30%) of repetitive sequences were identified. The phylogenetic trees of representative species of butterflies and moths constructed using one to one single-copy orthologous genes indicate that the Chinese peacock was separated from a common ancestor of swallowtails about 23.69-36.04 million years ago (mya). Demographic history inferred using the Pairwise Sequentially Markovian Coalescence (PSMC) analysis suggested that the population expansion of this species from the last interglacial period to the last glacial maximum possibly resulted from its decreased natural enemies and its adaptation to climate diversity during glacial period.

Conclusions

We present a high-quality chromosome-level reference genome of the Chinese peacock (Papilio bianor) using long-read single-molecule sequencing and Hi-C-based chromatin interaction maps. Our results lay the foundation for exploring genetic basis of special biological features of the Chinese peacock butterfly, and also provide a useful datasource for comparative genomics and phylogenomics among butterflies and moths. Corresponding Author: Xueyan Li, Ph.D

CHINA Corresponding Author Secondary Information: Corresponding Author's Institution: Corresponding Author's Secondary

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Institution: First Author: Xueyan Li, Ph.D First Author Secondary Information: Order of Authors: Xueyan Li, Ph.D Sihan Lu, Ph.D Jie Yang Xuelei Dai Feiang Xie Jinwu He Zhiwei Dong Junlai Mao Guichun Liu Zhou Chang Ruoping Zhao Wenting Wan Ru Zhang Wen Wang Order of Authors Secondary Information: Additional Information: Question Response

Are you submitting this manuscript to a No special series or article collection?

Experimental design and statistics Yes

Full details of the experimental design and statistical methods used should be given in the Methods section, as detailed in our Minimum Standards Reporting Checklist. Information essential to interpreting the data presented should be made available in the figure legends.

Have you included all the information requested in your manuscript?

Resources Yes

A description of all resources used, including antibodies, cell lines, and software tools, with enough information to allow them to be uniquely identified, should be included in the

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Methods section. Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model organisms and tools, where possible.

Have you included the information requested as detailed in our Minimum Standards Reporting Checklist?

Availability of data and materials Yes

All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the “Availability of Data and Materials” section of your manuscript.

Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist?

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Manuscript Click here to download Manuscript Pb-Genome_submitted version_MS_20190411.docx

1 Chromosomal-level reference genome of Chinese peacock butterfly (Papilio 1 2 2 bianor) based on third-generation DNA sequencing and Hi-C analysis 3 4 5 3 6 7 1,2,† 1,† 3,† 4,† 1 2 8 4 Sihan Lu , Jie Yang , Xuelei Dai , Feiang Xie , Jinwu He , Zhiwei Dong , 9 10 5 Junlai Mao4, Guichun Liu1,2, Zhou Chang2, Ruoping Zhao2, Wenting Wan1, Ru 11 12 1 2,5,*,# 2,* 13 6 Zhang , Wen Wang , Xueyan Li 14 15 16 17 7 18 19 8 1 Center for Ecological and Environmental Sciences, Northwestern Polytechnical 20

21 9 University, Xi’an, Shanxi 710072, China. 22 23 10 2 State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of 24 25 11 Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223, China. 26 27 12 3 Key Laboratory of Genetics, Breeding and Reproduction of Shaanxi 28 29 13 Province, College of Animal Science and Technology, Northwest A&F University, 30 31 14 Yangling 712100, China 32 33 15 4 School of Marine Science and Technology, Zhejiang Ocean University, Zhoushan, 34 35 16 Zhejiang 316022, China 36 37 5 38 17 Center for Excellence in Animal Evolution and Genetics, Kunming, Yunnan 39 18 650223, China 40 41 42 19 43 44 20 †These authors contributed equally to this work. 45 46 47 21 *Correspondence should be addressed to L.X.Y ([email protected]), W.W 48 49 22 ([email protected]). 50 51 52 23 #Current address: Center for Ecological and Environmental Sciences, Northwestern 53 54 55 24 Polytechnical University, Xi’an, Shanxi 710072, China 56 57 58 59 60 61 1 62 63 64 65 25 Abstract 1 2 26 Background: Papilio bianor Cramer, 1777 (i.e. Chinese peacock) (Insecta, 3 4 5 27 Lepidoptera, Papilionidae) is a widely distributed swallowtail butterfly with a large 6 7 28 number of geographic populations from the Southeast of Russia to China, Japan, India, 8 9 10 29 Vietnam, Myanmar and Thailand. Its wing color consists of both pigmentary colored 11 12 30 scales (black, reddish) and structural colored scales (iridescent blue or green dust). A 13 14 high-quality reference genome of P. bianor is thus important for investigating 15 31 16 17 32 iridescent color evolution, phylogeography, and evolution of swallowtail butterflies. 18 19 33 Findings: We obtained a chromosome-level de novo genome assembly of the high 20 21 22 34 heterozygous Papilio bianor (1.81 %) using long Pacific Biosciences (PacBio) 23 24 35 sequencing reads and high-through chromosome conformation capture (Hi-C) 25 26 27 36 technology. The final assembly is 402.00 Mb on 30 chromosomes (29 autosomes and 28 29 37 1 sex chromosomes W) with 12.51 Mb scaffold N50. Totally 15,375 protein-coding 30 31 32 38 genes and 222.29 Mb of repetitive sequences were identified. The phylogenetic trees 33 34 39 indicated that P. bianor was separated from a common ancestor of swallowtails about 35 36 40 23.69-36.04 million years ago. Demographic history suggested that the population 37 38 39 41 expansion of this species from the last interglacial period to the last glacial maximum 40 41 42 possibly resulted from its decreased natural enemies and its adaptation to climate 42 43 44 43 diversity during glacial period. 45 46 44 Conclusions: We present a high-quality chromosome-level reference genome of 47 48 49 45 Papilio bianor using long-read single-molecule sequencing and Hi-C-based chromatin 50 51 46 interaction maps. Our results lay the foundation for exploring genetic basis of special 52 53 47 biological features of P. bianor, and also provide a useful datasource for comparative 54 55 56 48 genomics and phylogenomics among butterflies and moths. 57 58 49 59 60 61 2 62 63 64 65 50 Keywords: Papilio bianor; single-molecule real-time (SMRT) sequencing; High- 1 2 51 through chromosome conformation capture (Hi-C) map; chromosome-level reference 3 4 5 52 genome; Butterfly. 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 3 62 63 64 65 53 Background information 1 2 54 Butterflies are one of most charming animals especially for their extraordinarily 3 4 5 55 diverse wing patterns among species, populations, sexes, and even seasonal forms [1- 6 7 56 3]. They also have many other intriguing traits such as complex life cycles, diverse 8 9 10 57 larval morphology and habits, and high species diversity etc. [4]. Thus, butterflies 11 12 58 have been regarded as one of the most important model organisms in different fields 13 14 from morphology, physiology, ecology, development, genetics to evolutionary 15 59 16 17 60 biology [4-6] since Darwin proposed his theory of natural selection in 1859 [7]. Back 18 19 61 in 1864, Bates, the famous inventor of theory, predicted that “the study of 20 21 22 62 butterflies…will someday be valued as one of the most important branches of 23 24 63 Biological science.” [8]. With the feasibility to dissect the heterozygous genomes of 25 26 27 64 such wild like butterflies and to perform genetic manipulation on them [9-11], 28 29 65 butterflies have been becoming a promising system to explore the genetics, evolution, 30 31 32 66 morphological diversification and speciation. 33 34 67 35 36 68 Compared with butterfly diversity of more than 18,000 described species [12], only 37 38 39 69 37 butterfly species in 6 families including five swallowtails (Papilionidae) have their 40 41 70 reference genomes dissected [9, 13-30]. Among them, the chromosomal-level 42 43 44 71 reference genomes were assembled only for two nymphids (Heliconius melpomene 45 46 72 and Melitaea cinxia) and one swallowtail (Papilio xuthus) [9, 24, 25] using linkage 47 48 49 73 map method. Chromosomal-level reference genomes for more butterflies are not only 50 51 74 indispensible to identify subtle genetic variations underpinning morphological traits 52 53 54 75 which often resulted from small mutations in regulatory elements [31, 32],but also 55 56 76 will provide unique opportunity to promote the evolutionary biological studies on the 57 58 59 77 famous butterfly system. 60 61 4 62 63 64 65 78 1 2 79 The development of third generation single molecule technology has paved the way 3 4 5 80 to dissect complex genomes of different kinds of wild organisms including butterflies 6 7 81 [25, 28, 30, 33, 34]. Combined with high-through chromosome conformation capture 8 9 10 82 (Hi-C) technology, which is mainly used to identify chromatin interactions across the 11 12 83 entire genome and now also used as a powerful tool to assist genome assembly [35], 13 14 chromosomal-level reference genomes have been obtained for some organisms 15 84 16 17 85 including such insects as fruit fly [36], mosquito [37], moth [38, 39] etc. Nevertheless, 18 19 86 up to now no such case combining single molecule sequencing and Hi-C technologies 20 21 22 87 to assemble chromosomal-level reference genomes is reported for butterflies. 23 24 88 25 26 27 89 Papilio bianor Cramer, 1777 (Papilionidae, Papilioninae, Papilionini) (Fig. 1a), also 28 29 90 known as Chinese peacock black swallowtail emerald or Chinese peacock, is a widely 30 31 32 91 distributed swallowtail butterfly with a large range of geographic populations from the 33 34 92 Southeast of Russia to China, Japan, India, Vietnam, Myanmar and Thailand [40-42]. 35 36 93 Its larvae mainly feed plants of like reticulate, meliifolia and 37 38 39 94 bungeanum [40, 43, 44], and its complete life cycle spend 40 to 50 days. 40 41 95 Its wing colors consist of both pigmentary colored scales (black, reddish) and 42 43 44 96 structural colored scales (iridescent blue or green dust) [44], which make it a 45 46 97 promising model to explore the origin and evolution of combined colors in insects. 47 48 49 98 Scientific interests in P. bianor have long existed, for examples in its 50 51 99 prothracicotropic hormones (PTTHs) [45], oviposition behavior [43, 46, 47], 52 53 100 phylogenetic position and species delimit [48-52], chromosome numbers [53] or 54 55 56 101 mitochondrial genome [49, 54]. Here, combining SMRT and Hi-C technologies, we 57 58 59 60 61 5 62 63 64 65 102 constructed the chromosome-level reference genome of P. bianor (30 chromosomes), 1 2 103 which is the fourth chromosomal-level reference genome in butterflies. 3 4 5 104 6 7 105 Data Description 8 9 10 106 collection and breeding 11 12 107 Wild eggs of P. bianor were collected in north surburb of Kunming city (Yunnan, 13 14 China), and then reared under the conditions of 26 , 80% relative humidity with 15 108 ℃ 16 17 109 16h/8h light/darkness. The hatched larvae were fed with Rutaceous plant Zanthxylum 18 19 110 piperitum under the same conditions. Two 5th instar were collected for Hi-C 20 21 22 111 sequencing. were reared under the same conditions as the eggs until their 23 24 112 eclosion. Adults were collected for genome survey using IIlumina platform and for de 25 26 27 113 novo genome sequencing using PacBio platform. 28 29 114 30 31 32 115 Genome survey using Illumina sequencing technology 33 34 116 Genomic DNA was isolated from thorax and abdomen of single male adult using a 35 36 117 Gentra Puregene Blood kit (Qiagen, Germany) following manual instructions. Paired- 37 38 39 118 end libraries of two different insertion sizes (150 bp and 500 bp) were constructed and 40 41 119 sequenced on an Illumina HiSeq2000 platform at BGI (Shenzhen, China). The total 42 43 44 120 number of sequencing reads was approximately 16.45 Gb for PE150 and 28.42 Gb for 45 46 121 PE500 (Table S1). We estimated genome size using Illumina short reads (PE150 and 47 48 49 122 PE500), by k-mer distribution analysis with k = 17, using the formula: G = k- 50 51 123 mer_number/k-mer_depth [55]. Our data indicate that P. bianor has an estimated 52 53 124 genome size of 473.07 Mb and a high heterozygosity of 1.81% (Fig. S1 & Table S2). 54 55 56 125 57 58 126 Library construction and sequencing using SMRT and Hi-C technologies 59 60 61 6 62 63 64 65 127 Genomic DNA was extracted from thorax and abdomen of another male adult and 1 2 128 used to construct one 20-kb library for the PacBio platform according to the 3 4 5 129 manufacturers’ protocols (NextOmics, China). With ten single-molecular real-time 6 7 130 (SMRT) cells in the PacBio RSII platform, we generated 43.19 Gb subreads with a 8 9 10 131 average read length of 16.4 kb after removing adaptor sequences within sequences 11 12 132 (Table S1). The long subreads were used for de novo genome assembly of P. bianor. 13 14 15 133 16 17 134 The sample mixed from whole body of two male larval individuals (the fifth instar) 18 19 135 was used to construct library for Hi-C sequencing according to the similar method in 20 21 22 136 the previous study [35]. A 400-700 bp library was sequenced on the Illumina HiSeq X 23 24 137 Ten platform with 150 paired-end mode, and resulted in ~75.11 Gb raw reads (Table 25 26 27 138 S1). 28 29 139 30 31 32 140 Chromosomal-level genome assembly 33 34 141 Considering the high heterozygosity of P. bianor (1.81%: Fig. S1 & Table S2), we 35 36 142 firstly performed a PacBio-only assembly using Wtdbg (v1.2.8; with --tidy-reads 37 38 39 143 5000 -k 0 -p 17 -S 1) [56], which is a de novo sequence assembler for long noisy reads 40 41 144 produced by PacBio or Oxford Nanopore Technologies and is based on the fuzzy 42 43 44 145 Bruijn graph (FBG) algorithm. Secondly, to eliminate the high error rate of the 45 46 146 PacBio long reads, we further polished the PacBio-only assembled sequences using 47 48 49 147 Illumina reads as following. All the Illumina reads were mapped to the PacBio-only 50 51 148 assembly with BWA-mem [57], which was further corrected with 2-round Pilon 52 53 149 (v1.21) correction [58, 59]. Thirdly, because the polished assembly still contained a 54 55 56 150 number of shorter contigs with significantly lower coverage, which perhaps represents 57 58 151 the high heterozygous regions that were not merged to equivalent segments in the 59 60 61 7 62 63 64 65 152 homologous chromosomes, we used a looser cutoff for identity (> 90%) to merge the 1 2 153 contigs with lower coverage and smaller size (size < 1000 bp and coverage < 50 or 3 4 5 154 size < 10000 bp and coverage < 35) into the longer contigs as the previously reported 6 7 155 [14]. Fourthly, the raw reads generated from the Hi-C sequencing were mapped to the 8 9 10 156 polished assembled genome using Juicer [60] and 3D de novo assembly [37] 11 12 157 softwares to improve the assembly. Approximately 90.50% of contigs were anchored 13 14 onto 30 super-scaffolds (Fig. 1b & Table S3), which quite possibly correspond to the 15 158 16 17 159 30 chromosomes as reported by cytogenetic karyotype [53]. Finally, we obtained the 18 19 160 chromosomal-level high-quality assembly of P. bianor with total length of ~402.00 20 21 22 161 Mb and the longest scaffold N50 (12.51 M) among the published butterfly genomes 23 24 162 (Table 1 & Table S4). The assembled genome accounts for 85% of estimated genome 25 26 27 163 size (473.07 Mb) by the k-mer distribution analysis (Table S2). 28 29 164 30 31 32 165 Quality evaluation of assembled genome 33 34 166 The assembled quality was evaluated using three methods as following. Firstly, The 35 36 167 completeness of the assembly was evaluated by Benchmarking Universal Single-Copy 37 38 39 168 Orthologs (BUSCO) (version 2.0; BUSCO, RRID:SCR 015008) [61] software. The 40 41 169 BUSCO data showed that P. bianor assembly covered 96.90% of the core genes with 42 43 44 170 96.30% covered genes complete (Table S5), which are similar to those published high 45 46 171 quality butterfly genomes (Table 1). We also checked the mapping rates of Illumina 47 48 49 172 and PacBio reads to the P. bianor assembly by BWA [57] and BLASR [62], and 50 51 173 found high mapping rate of 96.31% and 96.86%, respectively (Table S6 & Table S7). 52 53 174 Thirdly, we compared syntenic relationships between gemomes of P. bianor and P. 54 55 56 175 xuthus (Fig. 1c) and found that 94.96% of the P. bianor assembled genome sequences 57 58 176 can be aligned (1:1) to the P. xuthus reference genome. All these results suggest that 59 60 61 8 62 63 64 65 177 the assembled P. bianor genome is of high quality (including completeness, base level 1 2 178 continuity and accuracy) (Table 1). 3 4 5 179 6 7 180 Genome annotation 8 9 10 181 Repetitive sequences including tandem repeats and transposable elements (TEs) were 11 12 182 searched in the P. bianor assembled genome. Firstly, we used Tandem Repeat Finder 13 14 (version 4.07b; with 2 7 7 80 10 50 2000 -d -h parameters) [63] to annotate the 15 183 16 17 184 tandem repeats. Then, TEs were identified using a combination of de novo and 18 19 185 homology-based approaches at both the DNA and protein levels. At the DNA level, 20 21 22 186 we used RepeatModeler (version 1.0.4; RepeatModeler, RRID:SCR_015027) [64] to 23 24 187 construct a de novo repeat library, which built a repeat consensus database with 25 26 27 188 classification information, and then we adopted RepeatMasker (version 4.0.5) [65] to 28 29 189 search similar TEs against the known Repbase TE library (version 16.02) [66] and de 30 31 32 190 novo repeat library. We also used LTR_FINDER (LTR Finder, RRID:SCR_015247) 33 34 191 [67] to find long terminal repeats. At the protein level, software RepeatProteinMask 35 36 192 [65] was used to search the assembled P. bianor genome against the TE protein 37 38 39 193 database using a WU-BLASTX engine. Finally, we identified and masked 55.30% of 40 41 194 the P. bianor assembly as repeat regions (Table S8), which is the highest in published 42 43 44 195 butterfly genomes (Table 1). Among all TEs, the most abundant class of repetitive 45 46 196 elements is long interspersed nuclear elements (LINEs, 14.22%), and the next is DNA 47 48 49 197 transposons (8.81%) (Table S9). Compared with the reference genomes of other 50 51 198 swallowtail butterflies, LINEs, DNA transposons and long terminal repeats (LTRs) of 52 53 199 repeats have expanded in P. bianor genome (Fig. 2a). 54 55 56 200 57 58 59 60 61 9 62 63 64 65 201 To annotate protein-coding genes of P. bianor, we used both de novo and homology- 1 2 202 based gene prediction approaches. For de novo gene prediction, the repeat-masked 3 4 5 203 genome was analyzed by SNAP (version 2006-07-28) [68], GENSCAN (version 1.0) 6 7 204 [69], glimmerHMM (version 3.0.3 ) [70], and AUGUSTUS (version 2.5.5; Augustus: 8 9 10 205 Gene Prediction, RRID:SCR 008417) [71]. For homology-based predictions, the 11 12 206 protein sequences from eight insects including beetle Tribolium castaneum [72], fruit 13 14 fly Drosophila melanogaster [73], silkworm Bombyx mori [74], moth Helicoverpa 15 207 16 17 208 armigera [75], and four butterflies Papilio polytes [23], Papilio xuthus [9], Heliconius 18 19 209 melpomene [24] and Danaus plexippus [20], were used as templates for homology- 20 21 22 210 based gene prediction. Then we used TBLASTN [76] with an E-value cut-off of 1e-5 23 24 211 to align the protein sequences of the reference gene set to P. bianor genome, and 25 26 27 212 GeneWise (v2.2.0) [77] to perform more precise alignment. Gene sequences with 28 29 213 length < 150 bp or percent identity < 25% were removed. EvidenceModeler software 30 31 32 214 (EVM, version 1.1.1) [78] was used to integrate the genes predicted by the homology 33 34 215 and de novo approaches and generate a comprehensive, non-redundant gene set. 35 36 216 Finally, 15,375 protein-coding genes were annotated in the assembled P. bianor 37 38 39 217 genome (Table S10), which is similar to those published reference genomes of other 40 41 218 swallowtail butterflies (Fig. S2). 42 43 44 219 45 46 220 The KEGG, TrEMBL, SwissProt and Cog databases were searched for best matches 47 48 49 221 to P. bianor the protein sequences yielded by EVM software, using BLASTP (version 50 51 222 2.2.26) with an (E)-value cutoff of 1e-5, and Pfam, PRINTS, ProDom and SMART 52 53 223 databases were searched for known motifs and domains in our sequences using 54 55 56 224 InterProScan software (version 5.18-57.0; InterProScan, RRID:SCR_005829) [79]. 57 58 225 We also searched all predicted gene sequences to GenBank nonredundant protein (nr) 59 60 61 10 62 63 64 65 226 using BLASTN (RRID:SCR 001598) with a maximal e-value of 1e-5. In sum, 13,343 1 2 227 genes were annotated with at least 1 related function, which accounts for about 86.78% 3 4 5 228 of the P. bianor annotated genes (Table S11). 6 7 229 8 9 10 230 Gene family identification and phylogenetic analysis 11 12 231 We use OrthoMCL (version 2.0.9; OrthoMCL DB: Ortholog Groups of Protein 13 14 Sequences, RRID:SCR 007839) [80] to cluster the P. bianor annotated genes with an 15 232 16 17 233 (E)-value cutoff of 1 e-5, and Markov Chain Clustering with default inflation 18 19 234 parameter in an all-to-all BLASTP analysis of entries for the reference genomes of six 20 21 22 235 swallowtail butterflies including P. bianor in this study and other five published so far 23 24 236 (P. polytes, P. xuthus, P. machaon, P. glaucus, and P. memnon). The result showed 25 26 27 237 that 293 gene families were specific to P. bianor (Fig. 2b). Using Computational 28 29 238 Analysis of gene Family Evolution (CAFE; version 4.0.1) [81], we also identified 375 30 31 32 239 expanded gene families and 1863 contracted gene families in P. bianor. The P. bianor 33 34 240 expanded gene families were enriched in 17 GO categories and the contracted gene 35 36 241 families were enriched in 14 GO categories, most of which are related to oxygen 37 38 39 242 metabolism (Table S12 & Table S13). 40 41 243 42 43 44 244 To reveal phylogenetic position of P. bianor among Papilionoidea, we selected 16 45 46 245 butterfly species in five families (Papilionidae (6): Papilio xuthus, Papilio polytes, 47 48 49 246 , , Papilio memnon; Hesperiidae (1): Lerema accius; 50 51 247 Pieridae (2): Phoebis sennae, Pieris rapae; Nymphalidae (2): Bicyclus anynana, 52 53 248 Heliconius melpomene; (2): nemesis, Calephelis virginiensis; 54 55 56 249 Lycaenidae (1): Calycopis cecrops) [9, 13-15, 17, 21, 23, 24, 26-28] with 2 moths 57 58 250 (Bombyx mori [74], Helicoverpa armigera [75]) as outgroups for phylogenetic 59 60 61 11 62 63 64 65 251 analysis. 1378 one to one single orthologs were identified from these 14 species and 1 2 252 their nucleic acid sequences were aligned using PRANK (version 3.8.31) [82] to 3 4 5 253 construct the phylogenetic trees using RAxML (version 7.2.8; RAxML, RRID:SCR 6 7 254 006086) [83] by choosing the GTR+G+I model. The phylogeny was further analyzed 8 9 10 255 by PAML MCMCtree (version 4.5; PAML, RRID:SCR 014932) program [84], and 11 12 256 calibrated with published timings for the divergence of difference species [85]. Our 13 14 phylogenetic tree showed that P. bianor cluster at the base of P. machaon and P. 15 257 16 17 258 xuhtus, and diverged from them 23 million years ago (mya); all Papilio species is a 18 19 259 monophyly, and diverged from other butterflies approximately 41.07-56.86 mya (Fig. 20 21 22 260 2c). This tree is largely consistent with those constructed from cytochrome oxidases I 23 24 261 (COI), cytochrome oxidases II (COII) and elongation factor 1α (EF-1α) [86, 87], and 25 26 27 262 from 425 loci from two outgroups and 173 species of butterflies[88]. 28 29 263 30 31 32 264 We also inferred demographic histories of P. bianor by SNP calling of Illumina short 33 34 265 reads against assembled genome using the Pairwise Sequentially Markovian 35 36 266 Coalescence (PSMC) analysis [89] (0.1× 10-8 mutations per site per generation 37 38 39 267 calculated by r8s [90]; three or four generations per year [47]). Our result suggested 40 41 268 that the effective population size increased significantly from the last interglacial 42 43 44 269 period (LIG, approximately 0.1 million years before present) to its maximum at the 45 46 270 last glacial maximum (LGM, approximately 0.01 million years before present) (Fig. 47 48 49 271 2d). We infer that the population expansion of this species possibly results from the 50 51 272 decrease of its natural enemies (e.g. birds or lizard) and from its adaptation to diverse 52 53 273 climate environments during LIG and LGM. 54 55 56 274 57 58 275 Conclusion 59 60 61 12 62 63 64 65 276 We present the chromosomal-level genome assembly of P. bianor with the contig and 1 2 277 scaffold N50 of 5.50 Mb and 12.51 Mb, respectively. The assembled genome 3 4 5 278 included 15,375 protein-coding genes, 293 species-specific gene families, 375 6 7 279 expanded gene families and 1863 contracted gene families. The P. bianor diverged 8 9 10 280 from other Papilio approximately 23.69-36.04 mya. Our results also show that the 11 12 281 effective population size of P. bianor increased significantly during the glacial period. 13 14 Our results lay the foundation for exploring the special biological features of the 15 282 16 17 283 Chinese peacock butterfly, and also provide a useful data source for comparative 18 19 284 genomics and phylogenomics among butterflies and Lepidopterans. 20 21 22 285 23 24 286 Availability of supporting data 25 26 27 287 The raw reads have been deposited at NCBI in the sequence read archive (SRA) under 28 29 288 BioProject Number: PRJNA530186. The chromosome-level genome, annotation, and 30 31 32 289 other supporting data are also available via the GigaScience database, GigaDB. 33 34 290 35 36 291 Abbreviations 37 38 39 292 bp: base pair; kb: kilo base; Mb: mega base; Gb: giga base; PE: paired-end; BUSCO: 40 41 293 Benchmarking Universal Single-Copy Orthologs; TE: transposable element; GO: 42 43 44 294 gene ontology; KEGG: Kyoto Encyclopedia of Genes and Genomes. 45 46 295 47 48 49 296 Competing interests 50 51 297 The authors declare that there have no competing interests. 52 53 298 54 55 56 299 Author contributions 57 58 59 60 61 13 62 63 64 65 300 X.L., W.W conceived and supervised the study. J.H., Z.D., Z.C., G.L. fed and 1 2 301 collected the samples. G.L., J.H. extracted the genomic DNA. S.L., X.D. assembled 3 4 5 302 the genome. S.L., J.Y., F.X. carried out the quality assessment, repeat annotation, and 6 7 303 gene annotation. J.Y., F.X., J.M. carried out evolutionary analyses. S.L. uploaded the 8 9 10 304 raw read data, genome assembly, and annotation in the GenBank and GigaScience 11 12 305 (GigaDB) databases. S.L., X.L., W.W. wrote the manuscript. All authors read and 13 14 approved the final manuscript. 15 306 16 17 307 18 19 308 Acknowledgements 20 21 22 309 This work was supported by grants from the National Natural Science Foundation of 23 24 310 China (No. 31621062) (to WW), the Chinese Academy of Sciences (XDB13000000 25 26 27 311 (to WW), and CAS “Light of West China” (to LXY). 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 14 62 63 64 65 312 References 1 2 313 1. Boggs CL, Watt WB and Ehrlich PR. Butterflies: ecology and evolution 3 4 314 taking flight. University of Chicago Press; 2003. 5 315 2. Joron M and Mallet JLB. Diversity in mimicry: paradox or paradigm? Trends 6 316 in ecology & evolution. 1998;13 11:461-6. doi:Doi 10.1016/S0169- 7 317 5347(98)01483-9. 8 318 3. Nijhout HF. The development and evolution of butterfly wing patterns. 9 10 319 Smithson Inst. 1991;293. 11 320 4. Heikkila M, Kaila L, Mutanen M, Pena C and Wahlberg N. Cretaceous origin 12 321 and repeated tertiary diversification of the redefined butterflies. Proceedings 13 322 Biological sciences. 2012;279 1731:1093-9. doi:10.1098/rspb.2011.1430. 14 5. Kawahara AY and Breinholt JW. Phylogenomics provides strong evidence for 15 323 16 324 relationships of butterflies and moths. Proceedings Biological sciences. 17 325 2014;281 1788:20140970. doi:10.1098/rspb.2014.0970. 18 326 6. Mitter C, Davis DR and Cummings MP. Phylogeny and Evolution of 19 327 Lepidoptera. Annual review of entomology. 2017;62:265-83. 20 21 328 doi:10.1146/annurev-ento-031616-035125. 22 329 7. Darwin C. The Origin of Species; And, the Descent of Man. Modern library; 23 330 1859. 24 331 8. Bates H. New species of butterflies from Guatemala and Panama, collected by 25 332 Osbert Salvin and F. du Cane Godman, Esqs. Entomologist's monthly 26 27 333 Magazine. 1864;1 1/7:1-164. 28 334 9. Li X, Fan D, Zhang W, Liu G, Zhang L, Zhao L, et al. Outbred genome 29 335 sequencing and CRISPR/Cas9 gene editing in butterflies. Nature 30 336 communications. 2015;6:8212. doi:10.1038/ncomms9212. 31 32 337 10. Zhang LL and Reed RD. Genome editing in butterflies reveals that spalt 33 338 promotes and Distal-less represses eyespot colour patterns. Nature 34 339 communications. 2016;7 doi:10.1038/Ncomms11769. 35 340 11. Markert MJ, Zhang Y, Enuameh MS, Reppert SM, Wolfe SA and Merlin C. 36 341 Genomic Access to Monarch Migration Using TALEN and CRISPR/Cas9- 37 38 342 Mediated Targeted Mutagenesis. G3-Genes Genom Genet. 2016;6 4:905-15. 39 343 doi:10.1534/g3.116.027029. 40 344 12. van Nieukerken EJ, Kaila L, Kitching IJ, Kristensen NP, Lees D, Minet J, et al. 41 345 Order Lepidoptera Linnaeus, 1758. Zootaxa. 2011;3148:212-21. 42 43 346 13. Cong Q, Borek D, Otwinowski Z and Grishin NV. genome sheds light 44 347 on unique phenotypic traits and phylogeny. BMC genomics. 2015;16:639. 45 348 doi:10.1186/s12864-015-1846-0. 46 349 14. Cong Q, Borek D, Otwinowski Z and Grishin NV. Tiger Swallowtail Genome 47 350 Reveals Mechanisms for Speciation and Caterpillar Chemical Defense. Cell 48 49 351 reports. 2015;10 6:910-9. doi:10.1016/j.celrep.2015.01.026. 50 352 15. Shen J, Cong Q, Kinch LN, Borek D, Otwinowski Z and Grishin NV. 51 353 Complete genome of Pieris rapae, a resilient alien, a cabbage pest, and a 52 354 source of anti-cancer proteins. F1000Res. 2016;5:2631. 53 355 doi:10.12688/f1000research.9765.1. 54 55 356 16. Cong Q, Li W, Borek D, Otwinowski Z and Grishin NV. The Bear Giant- 56 357 Skipper genome suggests genetic adaptations to living inside roots. 57 358 Molecular genetics and genomics : MGG. 2018; doi:10.1007/s00438-018- 58 359 1494-6. 59 60 61 15 62 63 64 65 360 17. Iijima T, Kajitani R, Komata S, Lin CP, Sota T, Itoh T, et al. Parallel evolution 1 361 of Batesian mimicry supergene in two Papilio butterflies, P. polytes and P. 2 362 memnon. Science advances. 2018;4 4 doi:10.1126/sciadv.aao5416. 3 4 363 18. Zhan S, Merlin C, Boore JL and Reppert SM. The monarch butterfly genome 5 364 yields insights into long-distance migration. Cell. 2011;147 5:1171-85. 6 365 doi:10.1016/j.cell.2011.09.052. 7 366 19. Hill JA, Neethiraj R, Rastas P, Clark N, Morehouse N, de la Paz Celorio- 8 367 Mancera M, et al. A butterfly chromonome reveals selection dynamics during 9 10 368 extensive and cryptic chromosomal reshuffling. bioRxiv. 2018:233700. 11 369 20. Zhan S, Zhang W, Niitepold K, Hsu J, Haeger JF, Zalucki MP, et al. The 12 370 genetics of monarch butterfly migration and warning colouration. Nature. 13 371 2014;514 7522:317-21. doi:10.1038/nature13812. 14 21. Cong Q, Shen JH, Warren AD, Borek D, Otwinowski Z and Grishin NV. 15 372 16 373 Speciation in Cloudless Sulphurs Gleaned from Complete Genomes. Genome 17 374 biology and evolution. 2016;8 3:915-31. doi:10.1093/gbe/evw045. 18 375 22. Talla V, Suh A, Kalsoom F, Dinca V, Vila R, Friberg M, et al. Rapid Increase 19 376 in Genome Size as a Consequence of Transposable Element Hyperactivity in 20 21 377 Wood-White (Leptidea) Butterflies. Genome biology and evolution. 2017;9 22 378 10:2491-505. doi:10.1093/gbe/evx163. 23 379 23. Nishikawa H, Iijima T, Kajitani R, Yamaguchi J, Ando T, Suzuki Y, et al. A 24 380 genetic mechanism for female-limited Batesian mimicry in Papilio butterfly. 25 381 Nature genetics. 2015;47 4:405-U169. doi:10.1038/ng.3241. 26 27 382 24. Dasmahapatra KK, Walters JR, Briscoe AD, Davey JW, Whibley A, Nadeau 28 383 NJ, et al. Butterfly genome reveals promiscuous exchange of mimicry 29 384 adaptations among species. Nature. 2012;487 7405:94-8. 30 385 doi:10.1038/nature11041. 31 32 386 25. Ahola V, Lehtonen R, Somervuo P, Salmela L, Koskinen P, Rastas P, et al. 33 387 The Glanville fritillary genome retains an ancient karyotype and reveals 34 388 selective chromosomal fusions in Lepidoptera. Nature communications. 35 389 2014;5 doi:10.1038/Ncomms5737. 36 390 26. Cong Q, Shen JH, Borek D, Robbins RK, Otwinowski Z and Grishin NV. 37 38 391 Complete genomes of Hairstreak butterflies, their speciation, and nucleo- 39 392 mitochondrial incongruence. Scientific reports. 2016;6 40 393 doi:10.1038/Srep24863. 41 394 27. Cong Q, Shen JH, Li WL, Borek D, Otwinowski Z and Grishin NV. The first 42 43 395 complete genomes of Metalmarks and the classification of butterfly families. 44 396 Genomics. 2017;109 5-6:485-93. doi:10.1016/j.ygeno.2017.07.006. 45 397 28. Nowell RW, Elsworth B, Oostra V, Zwaan BJ, Wheat CW, Saastamoinen M, 46 398 et al. A high-coverage draft genome of the mycalesine butterfly Bicyclus 47 399 anynana. GigaScience. 2017;6 7 doi:10.1093/gigascience/gix035. 48 49 400 29. Mallet J. New genomes clarify mimicry evolution. Nature genetics. 2015;47 50 401 4:306-7. doi:10.1038/ng.3260. 51 402 30. Davey JW, Chouteau M, Barker SL, Maroja L, Baxter SW, Simpson F, et al. 52 403 Major Improvements to the Heliconius melpomene Genome Assembly Used to 53 404 Confirm 10 Chromosome Fusion Events in 6 Million Years of Butterfly 54 55 405 Evolution. G3-Genes Genom Genet. 2016;6 3:695-708. 56 406 doi:10.1534/g3.115.023655. 57 407 31. Loehlin DW and Carroll SB. EVOLUTIONARY BIOLOGY Sex, lies and 58 408 butterflies. Nature. 2014;507 7491:172-3. doi:Doi 10.1038/Nature13066. 59 60 61 16 62 63 64 65 409 32. Brunetti CR, Selegue JE, Monteiro A, French V, Brakefield PM and Carroll 1 410 SB. The generation and diversification of butterfly eyespot color patterns. 2 411 Current Biology. 2001;11 20:1578-85. doi:Doi 10.1016/S0960- 3 4 412 9822(01)00502-4. 5 413 33. VanBuren R, Bryant D, Edger PP, Tang HB, Burgess D, Challabathula D, et al. 6 414 Single-molecule sequencing of the desiccation-tolerant grass Oropetium 7 415 thomaeum. Nature. 2015;527 7579:508-U209. doi:10.1038/nature15714. 8 416 34. Andere AA, Ii RNP, Ray DA and Picard CJ. Genome sequence of Phormia 9 10 417 regina Meigen (Diptera: Calliphoridae): implications for medical, veterinary 11 418 and forensic research. BMC genomics. 2016;17 doi:10.1186/s12864-016- 12 419 3187-z. 13 420 35. Belaghzal H, Dekker J and Gibcus JH. Hi-C 2.0: An optimized Hi-C 14 procedure for high-resolution genome-wide mapping of chromosome 15 421 16 422 conformation. Methods. 2017;123:56-65. doi:10.1016/j.ymeth.2017.04.004. 17 423 36. Chakraborty M, VanKuren NW, Zhao R, Zhang XW, Kalsow S and Emerson 18 424 JJ. Hidden genetic variation shapes the structure of functional elements in 19 425 Drosophila. Nature genetics. 2018;50 1:20-+. doi:10.1038/s41588-017-0010-y. 20 21 426 37. Dudchenko O, Batra SS, Omer AD, Nyquist SK, Hoeger M, Durand NC, et al. 22 427 De novo assembly of the Aedes aegypti genome using Hi-C yields 23 428 chromosome-length scaffolds. Science. 2017;356 6333:92-5. 24 429 doi:10.1126/science.aal3327. 25 430 38. Chen WB, Yang XW, Tetreau G, Song XZ, Coutu C, Hegedus D, et al. A 26 27 431 high-quality chromosome-level genome assembly of a generalist herbivore, 28 432 Trichoplusia ni. Molecular ecology resources. 2019;19 2:485-96. 29 433 doi:10.1111/1755-0998.12966. 30 434 39. Xiang H, Liu XJ, Li MW, Zhu YN, Wang LZ, Cui Y, et al. The evolutionary 31 32 435 road from wild moth to domestic silkworm. Nature ecology & evolution. 33 436 2018;2 8:1268-79. doi:10.1038/s41559-018-0593-4. 34 437 40. Wu C. Fauna Sinica Insect Vol. 25 Lepidoptera Papilionidae. Beijing: Science 35 438 Press, 2001. 36 439 41. Sinev SY. Catalogue of the Lepidoptera of Russia. Ed. SY Sinev. KMK, 37 38 440 Saint-Petersburg-Moscow, 2008. 39 441 42. Chou I. Monograph of Chinese butterflies. Zhengzhou: Henan Scientific and 40 442 Technological Publishing House. 1994:1-854. 41 443 43. Ono H, Nishida R and Kuwahara Y. Oviposition stimulant for a Rutaceae- 42 43 444 feeding swallowtail butterfly, Papilio bianor (Lepidoptera: Papilionidae): 44 445 Hydroxycinnamic acid derivative from Orixa japonica. Applied Entomology 45 446 and Zoology. 2000;35 1:119-23. 46 447 44. Perveen F, Khan A and Sikander. Characteristics of butterfly (Lepidoptera) 47 448 fauna from Kabal, Swat, Pakistan. Journal of Entomology and Zoology 48 49 449 Studies. 2014;2 1:56-69. 50 450 45. Yokoyama I, Endo K, Yamanaka A and Kumagai K. Species-specificity in the 51 451 action of big and small prothoracicotropic hormones (PTTHs) of the 52 452 swallowtail butterflies, Papilio xuthus, P. machaon, P. bianor and P. helenus. 53 453 Zoological Science. 1996;13 3:449-54. doi:Doi 10.2108/Zsj.13.449. 54 55 454 46. Ono H, Nishida R and Kuwahara Y. A dihydroxy-gamma-lactone as an 56 455 oviposition stimulant for the swallotail butterfly, Papilio bianor, from the 57 456 Rutaceous plant, Orixa japonica. Biosci Biotech Bioch. 2000;64 9:1970-3. 58 457 doi:Doi 10.1271/Bbb.64.1970. 59 60 61 17 62 63 64 65 458 47. Dongsheng L. A Preliminary Observation on the Artificial Rearing of Xinyang 1 459 Papilio bianor. JOURNAL OF XINYANG TEACHERS COLLEGE 2 460 (NATURAL SCIENCE EDITION). 1997;2. 3 4 461 48. Lixin Z, Xiaobing W, Chunsheng W and Banghe Y. Phylogenetic evaluation 5 462 of Papilio bianor and P. polyctor (Lepidoptera: Papilionidae). Oriental Insects. 6 463 2009;43 1:25-32. 7 464 49. Hou LX, Ying S, Yang XW, Yu Z, Li HM and Qin XM. The complete 8 465 mitochondrial genome of Papilio bianor (Lepidoptera: Papilionidae), and its 9 10 466 phylogenetic position within Papilionidae. Mitochondrial DNA Part A. 11 467 2016;27 1:102-3. doi:10.3109/19401736.2013.873923. 12 468 50. Ae S. Some problems in hybrids between Papilio bianor and P. maackii. 13 469 Academia (Nanzan Univ). 1962;33:21-8. 14 51. CHANG Y-J. A study on hybridization of two subspecies of Papilio bianor 15 470 16 471 (Lepidoptera, Papilionidae) in . Lepidoptera Science. 1990;41 1:1-6. 17 472 52. Yamada A. A study of interspecific hybrids between Papilio bianor and P. 18 473 maackii. The nature and insects. 1977;12:27-8. 19 474 53. Maeki K and Makino S. Chromosome numbers of some Japanese Rhopalocera. 20 21 475 Lepid news. 1953;7:36-8. 22 476 54. Dong Y, Zhu L-X, Wu Y-f and Wu X-B. The complete mitochondrial genome 23 477 of the Chinese peacock, Papilio bianor (Insecta: Lepidoptera: Papilionidae). 24 478 Mitochondrial DNA. 2013;24 6:636-8. 25 479 55. Li R, Fan W, Tian G, Zhu H, He L, Cai J, et al. The sequence and de novo 26 27 480 assembly of the giant panda genome. Nature. 2010;463 7279:311-7. 28 481 doi:10.1038/nature08696. 29 482 56. Ruan J and Li H. Fast and accurate long-read assembly with wtdbg2. BioRxiv. 30 483 2019:530972. 31 32 484 57. Li H. Aligning sequence reads, clone sequences and assembly contigs with 33 485 BWA-MEM. arXiv preprint arXiv:13033997. 2013. 34 486 58. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. 35 487 Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and 36 488 Genome Assembly Improvement. PloS one. 2014;9 11 37 38 489 doi:10.1371/journal.pone.0112963. 39 490 59. Vaser R, Sovic I, Nagarajan N and Sikic M. Fast and accurate de novo genome 40 491 assembly from long uncorrected reads. Genome research. 2017;27 5:737-46. 41 492 doi:10.1101/gr.214270.116. 42 43 493 60. Durand NC, Shamim MS, Machol I, Rao SSP, Huntley MH, Lander ES, et al. 44 494 Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C 45 495 Experiments. Cell Syst. 2016;3 1:95-8. doi:10.1016/j.cels.2016.07.002. 46 496 61. Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV and Zdobnov EM. 47 497 BUSCO: assessing genome assembly and annotation completeness with 48 49 498 single-copy orthologs. Bioinformatics. 2015;31 19:3210-2. 50 499 doi:10.1093/bioinformatics/btv351. 51 500 62. Chaisson MJ and Tesler G. Mapping single molecule sequencing reads using 52 501 basic local alignment with successive refinement (BLASR): application and 53 502 theory. BMC bioinformatics. 2012;13 doi:10.1186/1471-2105-13-238. 54 55 503 63. Benson G. Tandem repeats finder: a program to analyze DNA sequences. 56 504 Nucleic acids research. 1999;27 2:573-80. doi:Doi 10.1093/Nar/27.2.573. 57 505 64. Smith A, Hubley R and Green P. RepeatMasker Open-4.0.(2013-2015). 2016. 58 506 65. Chen N. Using RepeatMasker to identify repetitive elements in genomic 59 60 507 sequences. Current protocols in bioinformatics. 2004;5 1:4.10. 1-4.. 4. 61 18 62 63 64 65 508 66. Bao WD, Kojima KK and Kohany O. Repbase Update, a database of repetitive 1 509 elements in eukaryotic genomes. Mobile DNA-Uk. 2015;6 2 510 doi:10.1186/s13100-015-0041-9. 3 4 511 67. Xu Z and Wang H. LTR_FINDER: an efficient tool for the prediction of full- 5 512 length LTR retrotransposons. Nucleic acids research. 2007;35:W265-W8. 6 513 doi:10.1093/nar/gkm286. 7 514 68. Korf I. Gene finding in novel genomes. BMC bioinformatics. 2004;5 doi:Doi 8 515 10.1186/1471-2105-5-59. 9 10 516 69. Burge C and Karlin S. Prediction of complete gene structures in human 11 517 genomic DNA. J Mol Biol. 1997;268 1:78-94. doi:DOI 12 518 10.1006/jmbi.1997.0951. 13 519 70. Majoros WH, Pertea M and Salzberg SL. TigrScan and GlimmerHMM: two 14 open source ab initio eukaryotic gene-finders. Bioinformatics. 2004;20 15 520 16 521 16:2878-9. doi:10.1093/bioinformatics/bth315. 17 522 71. Stanke M, Keller O, Gunduz I, Hayes A, Waack S and Morgenstern B. 18 523 AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic acids 19 524 research. 2006;34:W435-W9. doi:10.1093/nar/gkl200. 20 21 525 72. Tribolium Genome Sequencing C, Richards S, Gibbs RA, Weinstock GM, 22 526 Brown SJ, Denell R, et al. The genome of the model beetle and pest Tribolium 23 527 castaneum. Nature. 2008;452 7190:949-55. doi:10.1038/nature06784. 24 528 73. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, 25 529 et al. The genome sequence of Drosophila melanogaster. Science. 2000;287 26 27 530 5461:2185-95. 28 531 74. Duan J, Li R, Cheng D, Fan W, Zha X, Cheng T, et al. SilkDB v2.0: a 29 532 platform for silkworm (Bombyx mori) genome biology. Nucleic acids research. 30 533 2010;38 Database issue:D453-6. doi:10.1093/nar/gkp801. 31 32 534 75. Pearce SL, Clarke DF, East PD, Elfekih S, Gordon KHJ, Jermiin LS, et al. 33 535 Genomic innovations, transcriptional plasticity and gene loss underlying the 34 536 evolution and divergence of two highly polyphagous and invasive Helicoverpa 35 537 pest species. BMC biology. 2017;15 1:63. doi:10.1186/s12915-017-0402-6. 36 538 76. Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, et al. 37 38 539 Gapped BLAST and PSI-BLAST: a new generation of protein database search 39 540 programs. Nucleic acids research. 1997;25 17:3389-402. doi:DOI 40 541 10.1093/nar/25.17.3389. 41 542 77. Birney E, Clamp M and Durbin R. GeneWise and genomewise. Genome 42 43 543 research. 2004;14 5:988-95. doi:10.1101/gr.1865504. 44 544 78. Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, et al. Automated 45 545 eukaryotic gene structure annotation using EVidenceModeler and the program 46 546 to assemble spliced alignments. Genome biology. 2008;9 1 doi:10.1186/Gb- 47 547 2008-9-1-R7. 48 49 548 79. Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, et al. 50 549 InterProScan 5: genome-scale protein function classification. Bioinformatics. 51 550 2014;30 9:1236-40. 52 551 80. Li L, Stoeckert CJ and Roos DS. OrthoMCL: identification of ortholog groups 53 552 for eukaryotic genomes. Genome research. 2003;13 9:2178-89. 54 55 553 81. De Bie T, Cristianini N, Demuth JP and Hahn MW. CAFE: a computational 56 554 tool for the study of gene family evolution. Bioinformatics. 2006;22 10:1269- 57 555 71. 58 556 82. Loytynoja A and Goldman N. An algorithm for progressive multiple 59 60 557 alignment of sequences with insertions. Proceedings of the National Academy 61 19 62 63 64 65 558 of Sciences of the United States of America. 2005;102 30:10557-62. 1 559 doi:10.1073/pnas.0409137102. 2 560 83. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post- 3 4 561 analysis of large phylogenies. Bioinformatics. 2014;30 9:1312-3. 5 562 doi:10.1093/bioinformatics/btu033. 6 563 84. Yang ZH. PAML 4: Phylogenetic analysis by maximum likelihood. Molecular 7 564 biology and evolution. 2007;24 8:1586-91. doi:10.1093/molbev/msm088. 8 565 85. Kumar S, Stecher G, Suleski M and Hedges SB. TimeTree: A Resource for 9 10 566 Timelines, Timetrees, and Divergence Times. Molecular biology and 11 567 evolution. 2017;34 7:1812-9. doi:10.1093/molbev/msx116. 12 568 86. Zakharov EV, Caterino MS and Sperling FA. Molecular phylogeny, historical 13 569 biogeography, and divergence time estimates for swallowtail butterflies of the 14 genus Papilio (Lepidoptera: Papilionidae). Systematic biology. 2004;53 15 570 16 571 2:278-98. 17 572 87. Dupuis JR and Sperling FA. Repeated reticulate evolution in North American 18 573 Papilio machaon group swallowtail butterflies. PloS one. 2015;10 19 574 10:e0141882. 20 21 575 88. Espeland M, Breinholt J, Willmott KR, Warren AD, Vila R, Toussaint EFA, et 22 576 al. A Comprehensive and Dated Phylogenomic Analysis of Butterflies. 23 577 Current Biology. 2018;28 5:770-+. doi:10.1016/j.cub.2018.01.061. 24 578 89. Li H and Durbin R. Inference of human population history from individual 25 579 whole-genome sequences. Nature. 2011;475 7357:493-U84. 26 27 580 doi:10.1038/nature10231. 28 581 90. Sanderson MJ. r8s: inferring absolute rates of molecular evolution and 29 582 divergence times in the absence of a molecular clock. Bioinformatics. 2003;19 30 583 2:301-2. doi:DOI 10.1093/bioinformatics/19.2.301. 31 32 584 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 20 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 585 Table 1: Comparison of quality and composition of different butterfly genomes. 24 25 Genome De novo 26 Genome Scaffold GC Number size Heterozygosity a BUSCO b assembled Repeat Exon Intron Family Species size N50 content of proteins 27 without (%) (%) transcripts a (%) (%) (%) (Mb) (kb) (%) (k) 28 gap (Mb) (%) 29 30 Papilio bianor 402 402 1.8 12813 96.3 NA 36.6 55.3 5.05 27.44 15.4 31 Papilio xuthus 244 238 NA 6199 97.6 NA 33.8 22.4 8.59 45.50 13.1 32 Papilio machaon 281 266 1.2 1150 95.5 98 32.3 22.3 7.37 30.36 15.5 33 Papilionidae Papilio polytes 227 218 NA 3672 91.8 NA 34.0 23.8 12.97 48.58 12.2 34 35 Papilio memnon 233 219 NA 5457 96.6 NA 32.8 22.5 11.31 43.17 12.4 36 Papilio glaucus 375 361 2.3 231 95.5 98 35.4 22.0 5.07 25.60 15.7 37 Achalarus lyciades 567 536 1.5 558 97.3 98 35.3 25.0 3.57 28.40 15.9 38 39 Hesperiidae Lerema accius 298 290 1.5 525 95.1 98 34.4 15.5 6.96 31.60 17.4 40 ursus violae 429 427 0.1 4153 98.3 99 34.7 25.8 4.59 30.90 14.1 41 Pieris rapae 246 243 1.5 617 98.0 99 32.7 22.7 7.91 33.30 13.2 42 Pieridae 43 Phoebis sennae 406 347 1.2 257 97.7 97 39.0 17.2 6.20 25.50 16.5 44 Danaus plexippus 249 242 0.6 716 98.0 96 31.6 16.3 8.40 28.10 15.1 45 Heliconius melpomene 274 270 NA 194 95.6 NA 32.8 24.9 6.38 25.40 12.8 46 Nymphalidae 47 Melitaea cinxia 390 361 NA 119 83.0 97 32.6 27.5 4.34 31.20 16.7 48 Bicyclus anynana 475 470 NA 638 97.6 NA 36.5 25.8 4.73 38.36 22.6 49 Calephelis nemesis 809 783 0.5 206 95.6 99 34.9 34.8 2.25 19.60 15.4 50 Riodinidae 51 Calephelis virginiensis 855 824 1.3 175 93.9 99 35.0 38.8 2.17 20.50 15.6 52 Lycaenidae Calycopis cecrops 729 689 1.2 233 95.5 96 37.1 34.0 3.11 24.00 16.5 53 54 586 a NA: not available in the referenced citation. 55 587 b BUSCO is calculated in this study. 56 57 58 59 60 61 1 62 63 64 65 588 Figure legends 1 2 589 Figure 1. Characterization of Papilio bianor. (a) Female adult of P. bianor. Shown 3 4 5 590 from left to right are: (1) dorsal view, (2) ventral view. (scales = 20.0 mm; Photo by 6 7 591 Zhiwei Dong) (b) Heatmap of chromosomal interactions. Each chromosome is framed 8 9 10 592 with blue block, and each scaffold is framed with green block. (c) Circos plot of P. 11 12 593 bianor chromosome-level reference genome with the previously released Papilio 13 14 xuthus genome (obtained from Chinese group). Shown from the outermost to 15 594 16 17 595 innermost are: (1) gene density, (2) repeat element density, (3) GC content, and (4) 18 19 596 syntenic regions with P. xuthus (left). 20 21 22 597 23 24 598 Figure 2. Genomic analysis of Papilio bianor. (a) Breakdown of the whole-genome 25 26 27 599 assemblies into different segments in Papilio. (b) Venn diagram of the shared gene 28 29 600 families of Papilio. (c) Maximum Likelihood (ML) phylogenetic tree of Papilionoidea 30 31 32 601 inferred using orthologue genes. The numbers in the square brackets on the nodes are 33 34 602 the 95% confidence intervals of divergence time. (d) Demographic history of P. 35 36 603 bianor. “g” indicates generation time in years, and “μ” indicates genomic substitution 37 38 39 604 rate. Pb: Papilio bianor; Pgl: Papilio glaucus; Pma: Papilio machaon; Pme: Papilio 40 41 605 memnon; Ppol: Papilio polytes; Pxu: Papilio xuthus. 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

62 63 64 65 606 Figure 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 607 37 608 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

62 63 64 65 609 Figure 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 610 611 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

62 63 64 65 612 Additional files 1 2 3 613 Figure S1: K-mer (k=17) distribution in Papilio bianor genome. The first peak 4 5 6 614 (depth=53) is a heterozygous peak, which is higher than the main peak (depth=26), 7 8 615 suggesting the P. bianor genome is a highly heterozygous genome. The x-axis is 9 10 11 616 depth (×); the y-axis is the proportion which represents the frequency at that depth 12 13 617 divide by the total frequency of all the depth. 14 15 618 16 17 18 619 Figure S2: The statistics of annotated protein-coding genes of Papilio. (a) mRNA 19 20 620 length, (b) Coding sequence (CDS) length, (c) exon length, (d) intron length, (e) exon 21 22 23 621 number. The x-axis represents length or number and the y-axis represents the density 24 25 622 of genes. 26 27 28 623 29 30 624 Table S1: The statistics of sequencing data generated for Papilio bianor genome. 31 32 625 The sequencing depth was calculated by the assembled genome size. 33 34 35 626 36 37 627 Table S2: Genome size estimation of Papilio bianor with K-mer distribution 38 39 40 628 analysis using k=17. 41 42 629 43 44 45 630 Table S3: The statistics of assembled chromosome-level genome of Papilio bianor. 46 47 631 The Hi-C data were filtered by HiC-Pro software, and then 6,690,421 pairs of reads 48 49 632 could be used in the following analysis, it accounts for 68.04% of the total Hi-C data. 50 51 52 633 53 54 634 Table S4: The continuity assessment of genome assembly of Papilio bianor. 55 56 57 635 58 59 60 61

62 63 64 65 636 Table S5: The quality evaluation of assembled genome of Papilio bianor by 1 2 637 BUSCO software. 3 4 5 638 6 7 639 Table S6: The statistics of mapping ratio of Illumina reads to Papilio bianor 8 9 10 640 assembled genome. 11 12 641 13 14 15 642 Table S7: The statistics of mapping ratio of PacBio reads to Papilio bianor 16 17 643 assembled genome. 18 19 644 20 21 22 645 Table S8: The statistics of the annotated repeat sequences in Papilio bianor 23 24 646 genome. 25 26 27 647 28 29 648 Table S9: The statistics of the TE contents in Papilio bianor genome. 30 31 32 649 33 34 650 Table S10: The statistics of predicted protein-coding genes in Papilio bianor 35 36 651 genome. 37 38 39 652 40 41 653 Table S11: The statistics of gene function annotation in Papilio bianor genome. 42 43 44 654 45 46 655 Table S12: The GO term enrichment of expanded gene families in Papilio bianor 47 48 49 656 genome. 50 51 657 52 53 54 658 Table S13: The GO term enrichment of contracted gene families in Papilio 55 56 659 bianor genome. 57 58 660 59 60 61

62 63 64 65 Figure 1 Click here to download Figure Figure 1.jpg Figure 2 Click here to download Figure Figure 2.jpg Supplementary Material

Click here to access/download Supplementary Material Pb-Genome_submitted version_Suplementary files_20190409.docx Supplementary Figure 1

Click here to access/download Supplementary Material Figure S1.jpg Supplementary Figure 2

Click here to access/download Supplementary Material Figure S2.jpg Personal Cover Click here to download Personal Cover Pb-Genome_submitted version_cover letter_20190409.docx

Dear Editors of GigaScience,

We would like to submit our manuscript entitled “Chromosomal-level reference genome of Chinese peacock butterfly (Papilio bianor) based on third-generation DNA sequencing and Hi-C analysis” for your consideration as a Data Note in “GigaScience”. This submission includes ~6,200 words in main text, 1 table and 2 figures in total, Supplementary Tables 1-13 and Supplementary Figures 1-2. We declare that all the content of the manuscript has not published or submitted for publication elsewhere. We acknowledge that all authors have contributed significantly and that all authors are in agreement with the content of the manuscript.

Butterflies have been favored by naturalists for centuries, and the study of butterflies has been an integral part of ecology and evolution ever since Darwin proposed his theory of natural selection in 1859. Back in 1864, H. W. Bates, the famous inventor of mimicry theory, predicted that “the study of butterflies…will someday be valued as one of the most important branches of Biological science.”. Chinese peacock butterfly Papilio bianor is one of ideal model organisms in genetics, evolutionary biology and phylogeographic researches due to its special features such as easy breeding, different kinds of wing color, and widely geographic distribution. A high-quality chromosome- level reference genome of P. bianor is very important for investigating iridescent color evolution, phylogeography, and evolution of swallowtail butterflies.

In this study, we assembled the chromosome-level genome of the high heterozygous Chinese peacock butterfly (P. bianor) genome (1.81 %) using combined Illumina, PacBio, and Hi-C technologies. The final assembly is 402.00 Mb on 30 chromosomes (29 autosomes and 1 sex chromosomes W) with the contig and scaffold N50 as 5.50 Mb and 12.51 Mb, respectively. And the maximum length of contig and scaffold as 15.05 Mb and 17.37 Mb, respectively. The genomic resources generated in this study lay the foundation for exploring genetic basis of special biological features of the Chinese peacock butterfly, and also provide a useful datasource for comparative genomics and phylogenomics among butterflies and moths.

These findings is expected to be of interest to a broad audience of entomologist and ecologists, especially the researchers working on butterflies. As the international journal focusing on ‘Big data’ research from the life and biomedical sciences, GigaScience represents the ideal platform for us to share our results with the international research community. We look forward to hearing from you at your earliest convenience.

Thank you for your consideration.

Sincerely,

Xueyan Li, Ph.D State Key Laboratory of Genetic Resources and Evolution Kunming Institute of Zoology, Chinese Academy of Sciences (CAS), Kunming, Yunnan 650223, China Email: [email protected], Tel: 86-871-68125339, Fax: 86-871-68125338

Wen Wang, Ph.D State Key Laboratory of Genetic Resourses and Evolution Kunming Institute of Zoology, Chinese Academy of Sciences (CAS), Kunming, Yunnan 650223, China Center for Ecological and Environmental Sciences, Northwestern Polytechnical University, Xi’an 710072, China Email: [email protected]