A Converter Facilitating Genome Annotation Submission to European Nucleotide Archive Martin Norling1,2, Niclas Jareborg1,3 and Jacques Dainat1,2*
Total Page:16
File Type:pdf, Size:1020Kb
Norling et al. BMC Res Notes (2018) 11:584 https://doi.org/10.1186/s13104-018-3686-x BMC Research Notes RESEARCH NOTE Open Access EMBLmyGFF3: a converter facilitating genome annotation submission to European Nucleotide Archive Martin Norling1,2, Niclas Jareborg1,3 and Jacques Dainat1,2* Abstract Objective: The state-of-the-art genome annotation tools output GFF3 format fles, while this format is not accepted as submission format by the International Nucleotide Sequence Database Collaboration (INSDC) databases. Convert- ing the GFF3 format to a format accepted by one of the three INSDC databases is a key step in the achievement of genome annotation projects. However, the fexibility existing in the GFF3 format makes this conversion task difcult to perform. Until now, no converter is able to handle any GFF3 favour regardless of source. Results: Here we present EMBLmyGFF3, a robust universal converter from GFF3 format to EMBL format compatible with genome annotation submission to the European Nucleotide Archive. The tool uses json parameter fles, which can be easily tuned by the user, allowing the mapping of corresponding vocabulary between the GFF3 format and the EMBL format. We demonstrate the conversion of GFF3 annotation fles from four diferent commonly used anno- tation tools: Maker, Prokka, Augustus and Eugene. EMBLmyGFF3 is freely available at https://github.com/NBISweden/EMBLmyGFF3. Keywords: Annotation, Converter, Submission, EMBL, GFF3 Introduction the GFF3 format has become the de facto reference for- Over the last 20 years, many sequence annotation tools mat for annotations. Despite well-defned specifcations, have been developed, facilitating the production of rela- the format allows great fexibility allowing holding wide tively accurate annotation of a wide range of organisms variety of information. Tis fexibility has resulted in the in all kingdoms of the tree of life. To describe the features format being used by a broad range of annotation tools annotated within the genomes, the Generic Feature For- (e.g. MAKER [2], Augustus [3], Prokka [4], Eugene [5]), mat (GFF) was developed. Facing some limitation using and is used by most genome browsers (e.g. ARTEMIS the originally published Sanger specifcation, GFF has [6], Webapollo [7], IGV [8]). Te fexibility of the GFF3 evolved into diferent favours depending on the difer- format has facilitated its spread but raises a recurrent ent needs of diferent laboratories. Te Sequence Ontol- problem of interoperability with the diferent tools that ogy Project (http://www.sequenceontology.org; [1]) in use it. Indeed, one can meet as many favours of GFF3 2013 proposed the GFF3 format, which “addresses the format as tools producing it. One of the natural out- most common extensions to GFF, while preserving back- comes of a GFF3 fle is to be converted in a format that ward compatibility with previous formats”. Since then, can be submitted in one of the INSDC databases. Since 2016 NCBI released a beta version of a process to submit GFF3 or GTF to GenBank [9]. Tey describe what infor- *Correspondence: [email protected] mation is expected in the GFF3 fle and how to format it 1 National Bioinformatics Infrastructure Sweden (NBIS), SciLifeLab, in order to be accepted by the table2asn_GFF tool, which Uppsala Biomedicinska Centrum (BMC), Husargatan 3, 751 23 Uppsala, Sweden convert the well-formatted GFF3 into.sqn fles for sub- Full list of author information is available at the end of the article mission to GenBank. Modifying a GFF3 fle to fulfl the © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Norling et al. BMC Res Notes (2018) 11:584 Page 2 of 5 requirements is not often an easy task and programming called EMBLmyGFF3, which is a universal GFF3 to skills may be needed to automate it. To facilitate this step, EMBL converter allowing the submission to the ENA a user-friendly bioinformatics tool called Genome Anno- database. To our knowledge, this is the only tool able to tation Generator (GAG) has been implemented [10]. deal with any favour of GFF3 fle without any pre-pro- GAG provides a straightforward and consistent tool for cessing of the fle. Indeed, the originality lies in json map- producing a submission-ready NCBI annotation in.tbl ping fles allowing the mapping of vocabulary between format. Tis.tbl format is a tabulate format required with GFF3 and EMBL formats. two other fles (.sbt and.fsa) by the tbl2asn tool provided by the NCBI in order to produce a.sqn fle for submission Main text to GenBank. Implementation While the NCBI doesn’t accept their GenBank Flat File Te EMBLmyGFF3 tool is an implementation in the format but rather an.sqn intermediate fle for submission, python programming language of the verbose documen- EBI accepts submission in their EMBL fat fle format. tation provided by the European Bioinformatics Institute Here the difculty is to generate an EMBL fat fle from [13]. Te implementation is structured around two main a GFF3 fle. Several tools have been developed to per- modules: feature and qualifer. form this step i.e. Artemis [6], seqret from EMBOSS [11], (i) Te feature module contains the description of all GFF3toEMBL [12] but have limitations. While pletho- the EMBL features and their associated qualifers. Te ric number of annotation tools exist, GFF3toEMBL [12] feature module handles a parameter fle in json for- only deals with the GFF3 produced by the prokaryote mat, called translation_gf_feature_to_embl_feature. annotation tool Prokka. So, for annotation produced by json, allowing the proper mapping of the feature types other tools, users have to turn to other solutions. Arte- described in the 3rd column of the GFF3 fle to the cho- mis has a graphical user interface, which doesn’t allow sen EMBL features. an automation of the process. Seqret is designed to deal Below is an example how to map the GFF3 feature type with only one record at a time, which makes its use for “three_prime_UTR” to the EMBL feature type “3′UTR”: genome-wide annotation not straightforward. Te main bottleneck is that both tools need a properly formatted "three_prime_UTR": { GFF3 containing the INSDC expected vocabulary (3th and 9th column), while the annotation tools do not nec- "target": "3'UTR" essarily use this vocabulary. Te EMBL format follows the INSDC defnitions and accepts 52 diferent feature } types, whereas the GFF3 mandates the use of a Sequence Ontology term or accession number (3rd column of the GFF3), which nevertheless constitutes 2278 terms in We also provide the possibility to decide which features version 2.5.3 of the Sequence Ontology. Moreover, the will be printed in the output using the “remove” param- EMBL format accepts 98 diferent qualifers, where the eter. In the following example the feature type “three_ corresponding attribute tag types in the 9th column of prime_UTR” will be ignored: the GFF3 are unlimited. Consequently, in many cases the user may have to pre-process the GFF3 to adapt it to the "three_prime_UTR": { expected vocabulary. Te information contained and the vocabulary used "target": "3'UTR", in a GFF3 fle can difer a lot depending of the annota- tion tool used. On top of that the vocabulary used by the GFF3 format and by the EMBL format can difer in "remove": true many points. Tose diferences make it difcult to cre- ate a universal GFF3 to EMBL converter that would avoid } pre-processing of the GFF3 annotation fle. Te chal- lenge undoubtedly lies in the implementation of a correct mapping between the feature types described in the 3rd (ii) Te qualifer module defnes all the EMBL qualifers column, as well as the diferent attribute’s tags of the 9th (a defnition, a comment, and the format of the expected column of the GFF3 fle and the corresponding EMBL value) and has methods to access and print them. Within features and qualifers. the GFF3 fle, the qualifers are the attribute’s tags of the In collaboration with the European Nucleotide Archive 9th column. It is common that an attribute’s tag doesn’t we have developed a tool addressing these difculties correspond to a EMBL qualifer name. To address this Norling et al. BMC Res Notes (2018) 11:584 Page 3 of 5 difculty, the module handles a parameter fle in json produce the annotation, as well as metadata required format, called translation_gf_attribute_to_embl_quali- by ENA. Tere are metadata that are not contained in fer.json, allowing proper mapping of the attribute’s tag GFF3 format, but that are mandatory or recommended described in the 9th column of the GFF3 fle to the cho- for producing a valid EMBL fat fle. When all the manda- sen EMBL qualifer. Below is an example how to map the tory metadata are flled, the tool will proceed to the con- “Dbxref” attribute’s tags from the GFF3 fle to the “db_ version; otherwise it will help the user to fll the needed xref” qualifer expected by the EMBL format. information. "Dbxref": { Results "source description": "A database cross reference.", Te software has been used to convert the GFF3 anno- tation fle produced by diferent annotation tools (e.g. "target": "db_xref", Prokka, Maker, Augustus, Eugene). Tree test cases are included into the source code distribution.