Ancient Evolutionary Signals of Protein-Coding
Total Page:16
File Type:pdf, Size:1020Kb
Casimiro-Soriguer et al. BMC Genomics (2020) 21:210 https://doi.org/10.1186/s12864-020-6632-y RESEARCH ARTICLE Open Access Ancient evolutionary signals of protein- coding sequences allow the discovery of new genes in the Drosophila melanogaster genome Carlos S. Casimiro-Soriguer, Alejandro Rubio, Juan Jimenez and Antonio J. Pérez-Pulido* Abstract Background: The current growth in DNA sequencing techniques makes of genome annotation a crucial task in the genomic era. Traditional gene finders focus on protein-coding sequences, but they are far from being exhaustive. The number of this kind of genes continuously increases due to new experimental data and development of improved bioinformatics algorithms. Results: In this context, AnABlast represents a novel in silico strategy, based on the accumulation of short evolutionary signals identified by protein sequence alignments of low score. This strategy potentially highlights protein-coding regions in genomic sequences regardless of traditional homology or translation signatures. Here, we analyze the evolutionary information that the accumulation of these short signals encloses. Using the Drosophila melanogaster genome, we stablish optimal parameters for the accurate gene prediction with AnABlast and show that this new strategy significantly contributes to add genes, exons and pseudogenes regions, yet to be discovered in both already annotated and new genomes. Conclusions: AnABlast can be freely used to analyze genomic regions of whole genomes where it contributes to complete the previous annotation. Keywords: Ancient sequences, Gene finding, Protein-coding genes, Drosophila melanogaster, Protomotifs Background significant number of protein-coding sequences and other Research groups from all over the world are sequencing functional genomic elements are missing when using cur- whole genomes as a common task, taking advantage of rently available genomic annotation approaches. the current burst in the genomics era [1]. The analysis of One of the most intensively studied model organism is the sequences from those genomes is essential for accur- the fruit-fly Drosophila melanogaster. Its genome was se- ate annotation procedures. However, computational tools quenced in 2000, and 13,601 protein-coding genes were for gene discovery usually miss around 20% of protein- initially annotated, coming from the integration of the two coding genes when annotating a whole genome, or even used gene finders, which respectively predicted 13,189 and more in the case of eukaryotic organisms [2, 3]. Thus, a 17,464 genes [4]. From this milestone, the number of fruit-fly genes has changed, and numerous and significant discrepancies have arisen [5]. But nowadays the FlyBase * Correspondence: [email protected] database put this number at 14,133 [6], showing that the Centro Andaluz de Biologia del Desarrollo (CABD, UPO-CSIC-JA). Facultad de Ciencias Experimentales (Área de Genética), Universidad Pablo de Olavide, number of genes is constantly increasing over time, and a 41013 Sevilla, Spain © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Casimiro-Soriguer et al. BMC Genomics (2020) 21:210 Page 2 of 10 greater increase is expected to come from the discovery of regions to evaluate the method accuracy and determine new kinds of genes, such as those shorter than 100 amino optimal AnABlast parameters. The fine-tuning proced- acids, which in the fruit-fly genome could account for ure was finally tested with an early version of a protein thousands of them [7]. database, and we show that many new genes predicted Traditional gene finders are routinely based on both sig- by this algorithm are really true genes that have been in- nificant sequence similarity and sequence signatures such as corporated into the current genome annotation of this those used to define open reading frames (ORF), signals in- organism. We also show that AnABlast is useful to dis- volved in splicing [8], or combined protocols to get better cover small ORF and fossil sequences that are hidden to results [9]. Among the new proposed methods, we have pre- conventional gene finder algorithms, and show how this viously shown that accumulation of low-score alignments, new strategy can contribute to discover the complete set which would represent footprints of ancient sequences, of protein-coding regions of a whole genome. highlights present and ancient protein-coding regions which are hard to discover by conventional methods [10]. Briefly, Results this novel computational approach, that we named AnA- Searching for protein-coding signals in the fruit-fly Blast, compares the putative amino acid sequences from the genome six reading frames of a genomic sequence against a non- AnABlast is a computational tool that searches for protein- redundant protein database, and collects the matches, in- coding regions in whole genomes by taking into account cluding low-score alignments, which we call protomotifs. low-score alignments shared by multiple unrelated protein These are specifically accumulated in coding but rarely in sequences. To this end, AnABlast uses the putative amino non-coding sequences. Thus, the profile of AnABlast with acid sequences translated from a genomic sequence to peaks of accumulated protomotifs, accurately marks puta- search for sequence similarity in a non-redundant protein tive protein-coding genes, pseudogenes, and fossils of an- database. Alignments obtained from this similarity search cient coding sequences, overcoming the effects of possible (called protomotifs), including those of a low score, are sequencing errors and reading frame shifts, since it does not then piled up along the query sequence, and peaks accu- search for reading frames but sequence coding signals. mulating protomotifs above a specific threshold will high- We previously showed that this strategy is useful to light potential protein-coding regions and will be discover putative protein-coding regions. It was tested considered coding-signals (Fig. 1). Finally, these coding- with both intergenic and intronic sequences from the signals can be evaluated: those ones underlying exons from fission yeast Schizosaccharomyces pombe, and 18 puta- aprotein-codinggenewillbetruepositivepredictions, tive genes were predicted [10]. But even though these exons without coding signals will be false negatives, and predictions had computational support and some trans- coding-signals underlying introns or intergenic regions will lation evidences, prediction parameters were empirically be putative false positives. But it should be noted that the established by training the algorithm with S. pombe cu- false positives could potentially underlie new coding se- rated genes. Here, we use the whole well-studied D. mel- quences which escaped to conventional annotation pipe- anogaster genome, and its annotated protein-coding lines. Thus, false positives highlighted by AnABlast may Fig. 1 Schematic diagram showing AnABlast profiles obtained in a theoretical genomic region with a two-exon gene. The peak-height is the maximum accumulation of protomotifs in a specific genomic position (BLAST alignments including low bit-scores). Peaks with a protomotif accumulation above a peak-height threshold are considered as putative protein-coding regions (coding-signal). Significant peaks matching a known exon represents true positive peaks, while those underlying a genomic region without known exons are considered false positive coding- signals. Well-known exons which do not significantly accumulate protomotifs (peak-height below the threshold) constitute false negatives Casimiro-Soriguer et al. BMC Genomics (2020) 21:210 Page 3 of 10 represent genomic regions encoding putative new proteins, positive signals represent a particularly interesting set of but also non-functional degenerated protein-coding re- genomic regions, since they could constitute new protein- gions, something of particular interest in current genome coding regions. research. To test the capability of AnABlast to discover protein- Protomotifs underlie into the true reading frame coding regions, we used this algorithm to analyze the Protein-coding signals highlighted by AnABlast are whole genome of the fruit-fly D. melanogaster (2012 and mainly composed by protein sequence alignments of low 2017 releases). Protein sequences coming from the virtual score, but also occasionally high score. To test if such translation of the complete fly genome (in all the six read- alignments are just random, or they actually match true ing