Flybase: Introduction of the Drosophila Melanogaster Release 6 Reference Genome Assembly and Large-Scale Migration of Genome Annotations

FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation dos Santos, Gilberto, Andrew J. Schroeder, Joshua L. Goodman, Victor B. Strelets, Madeline A. Crosby, Jim Thurmond, David B. Emmert, and William M. Gelbart. 2015. “FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations.” Nucleic Acids Research 43 (Database issue): D690-D697. doi:10.1093/nar/gku1099. http://dx.doi.org/10.1093/nar/gku1099. Published Version doi:10.1093/nar/gku1099 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:15034860 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA D690–D697 Nucleic Acids Research, 2015, Vol. 43, Database issue Published online 14 November 2014 doi: 10.1093/nar/gku1099 FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations Gilberto dos Santos1,*, Andrew J. Schroeder1, Joshua L. Goodman2, Victor B. Strelets2, Madeline A. Crosby1, Jim Thurmond2, David B. Emmert1, William M. Gelbart1 and the FlyBase Consortium† 1The Biological Laboratories, Harvard University, 16 Divinity Avenue, Cambridge, MA 02138, USA and 2Department of Biology, Indiana University, Bloomington, IN 47405, USA Received September 29, 2014; Revised October 17, 2014; Accepted October 22, 2014 ABSTRACT genome (Release 6). This improved assembly was coordi- nately integrated into FlyBase and NCBI (4) and became Release 6, the latest reference genome assembly of the reference genome assembly for D. melanogaster as of the the fruit ﬂy Drosophila melanogaster, was released summer of 2014. by the Berkeley Drosophila Genome Project in 2014; The challenge when replacing one assembly with another it replaces their previous Release 5 genome assem- is the migration of anchored genomic data from the old as- bly, which had been the reference genome assem- sembly to the new one. This is especially true given that bly for over 7 years. With the enormous amount of the scale of the data involved in the migration to the Re- information now attached to the D. melanogaster lease 6 genome assembly far exceeds that involved in the genome in public repositories and individual labo- last such migration to the Release 5 assembly in 2006; there ratories, the replacement of the previous assembly is an enormous amount of data in public repositories such by the new one is a major event requiring careful mi- as FlyBase, NCBI, modENCODE DCC (5), modMine (6) and the UCSC Genome Browser (7), as well as in private gration of annotations and genome-anchored data databases of individual laboratories that has to be updated. to the new, improved assembly. In this report, we de- The new Release 6 reference genome assembly, the migra- scribe the attributes of the new Release 6 reference tion of FlyBase annotations and genome-anchored data to genome assembly, the migration of FlyBase genome this new genome assembly, and a user guide on how to in- annotations to this new assembly, how genome fea- terrogate it and update genomic coordinates are the topics tures on this new assembly can be viewed in Fly- of this report. Base (http://ﬂybase.org) and how users can convert coordinates for their own data to the corresponding THE RELEASE 6 REFERENCE GENOME ASSEMBLY Release 6 coordinates. Description of BDGP Release 6 genome assembly A reference genome assembly for D. melanogaster was first INTRODUCTION released in 2000 (8,9). This assembly was based on nuclear FlyBase (http://flybase.org) is a database of Drosophila- DNA sequences derived from the ‘iso-1’ reference strain: related genetic and genomic information (1). From late- isogenic yellow (y1); cinnabar (cn1), brown (bw1), speck (sp1) 2006 until mid-2014, the reference genome assembly for (10). Since then, this iso-1-derived reference nuclear genome the biomedical model organism, Drosophila melanogaster, assembly has been revised to close gaps, improve sequence has been the Berkeley Drosophila Genome Project (BDGP, quality and add sequence from centric heterochromatin re- http://fruitfly.org) Release 5 genome assembly (2). During gions (2,11). this period, FlyBase has published 57 updates to the gene The latest BDGP Release 6 of the D. melanogaster model annotations of this species’ genome, and has inte- genome assembly, still based on the iso-1 reference strain, grated many other mapped genome features, most notably includes a total of 143.7 Mb on 1870 scaffolds compris- those of the modENCODE project (3). In the spring of ing 2442 contigs (Table 1). The vast majority, 137.6 Mb, 2014, BDGP released a new assembly of the D. melanogaster of this sequence resides on seven chromosome arms: X, *To whom correspondence should be addressed. Tel: +1 617 495 9925; Fax: +1 617 496 1354; Email: [email protected] †The members of the FlyBase Consortium are listed in the Acknowledgements. C The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Nucleic Acids Research, 2015, Vol. 43, Database issue D691 2L, 2R, 3L, 3R, 4 and Y (Table 2). Additionally, there are DATA MIGRATION TO THE RELEASE 6 REFERENCE 1862 minor scaffolds that lack precise localization. Almost GENOME ASSEMBLY half of these ‘unlocalized’ scaffolds (n = 884) have been Feature migration mapped to a chromosome region by comparative genome hybridization in embryos with various chromosome defi- FlyBase represents as many features located on the genome ciencies: 2CEN (centromere-proximal region of chromo- as possible. Primary among these are the gene model an- some 2), 3CEN, X, Y, XY (regions mapping to both X and notations, represented by localized exon features that are Y chromosomes) and rDNA (ribosomal DNA) (12). The joined into one or more models representing transcript iso- Release 6 reference genome assembly also replaces the pre- forms. In addition to the gene models, FlyBase contains nu- vious mitochondrial reference genome assembly, a compos- merous other types of localized features including ones that ite of sequences from various D. melanogaster strains, with are part of the sequenced genome (e.g. protein-binding sites, one derived exclusively from the iso-1 reference strain (Ta- enhancers), reagents and variations reported in the litera- ble 1). ture (e.g. aberrations, mutant alleles, mobile element inser- tion sites) and discretely aligned features (cDNAs, protein similarities, gene model predictions, RNA-Seq splice junc- tions). The replacement of the Release 5 genome assembly with Release 6 required migration of millions of discrete fea- Major improvements in the Release 6 genome assembly com- tures and over a hundred RNA-Seq datasets from the old to pared to Release 5 the new assembly. Assembly-to-assembly alignments between Release 5 and Release 6 has a number of important changes from Release Release 6 were generated by NCBI (4), using previously 5(Tables1 and 2). Release 6 is 4.2 Mb larger, even as the described methods (http://www.ncbi.nlm.nih.gov/genome/ total assembly gap length has been reduced to 1.2 Mb, a tools/remap/docs/alignments),andprovidedtoFlyBase. decrease of 1.5 Mb. The main areas of improvement are to Using the information provided in these alignments, a map the centric heterochromatin regions of the major chromo- was generated that identified coordinate spans in Release 5 some arm scaffolds (X, 2L, 2R, 3L, 3R, 4), which have in- that correspond to equivalent unchanged regions in the new corporated over 10 Mb of sequence from minor Release 5 Release 6 assembly. Because the Release 6 assembly con- scaffolds (XHet, 2LHet, 2RHet, 3LHet, 3RHet and U). The tains some regions that are inverted relative to their orienta- chromosome Y scaffold has been improved dramatically, in- tion in Release 5, and other regions that have moved to dif- creasing over 10-fold in size to 3.1 Mb. Almost all remain- ferent scaffolds, the mapping file and necessary coordinate ing gaps in the chromosome arm scaffolds are in the het- transformations needed to take these types of changes into erochromatic regions. Small, unmapped scaffolds are now account. The mapping file underlies the coordinates con- represented individually, instead of being concatenated into verter tool (see below) and is available at FlyBase (http:// pseudoscaffolds (e.g. Release 5 pseudoscaffolds U, 3LHet). flybase.org/reports/FBrf0225389.html). This map was used to perform the coordinate conversions necessary to update the location of the features from their old Release 5 coordinates and localize them to the new scaffolds of the Release 6 assembly. Access to the Release 6 and Release 5 genome assemblies The BDGP Release 5 and Release 6 genome assemblies have been deposited at NCBI. Full reports are available at NCBI Realignment of high-throughput datasets (4), which provides global statistics and quality metrics for each assembly, as well as a ‘full sequence report’ and links For many types of high-throughput datasets, direct map- to GenBank (13) and RefSeq (14) accessions for individual ping of reads to the new genome assembly is preferable to scaffolds (Table 3). the migration process described above, since it allows align- Beginning with FlyBase update FB2014 04 (released July ment to newly sequenced regions.

Flybase: Introduction of the Drosophila Melanogaster Release 6 Reference Genome Assembly and Large-Scale Migration of Genome Annotations

Maternally Inherited Pirnas Direct Transient Heterochromatin Formation

The Biogrid Interaction Database

Flybase: Updates to the Drosophila Melanogaster Knowledge Base Aoife Larkin 1,*, Steven J

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

The Drosophila Melanogaster Transcriptome by Paired-End RNA Sequencing

Rnacentral: Progress in the Development of the Non-Coding RNA Sequence Database

Basic Bioinformatics

Flybase 2.0: the Next Generation Jim Thurmond1, Joshua L

Phylogenetic Profiling of the Arabidopsis Thaliana Proteome

Arabidopsis Thaliana Functional Genomics Project Annual Report 2008

Bioinformatics Approaches for Functional Predictions in Diverse Informatics Environments. Paula Maria Moolhuijzen, Bsc This Thes

Abstracts Selected for Platform Presentation