Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

microorganisms Article Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes Katelyn McNair 1,* , Carol L. Ecale Zhou 2, Brian Souza 3, Stephanie Malfatti 3 and Robert A. Edwards 1,4,* 1 Computational Sciences Research Center, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182, USA 2 Lawrence Livermore National Laboratory, Global Security Computing Applications, Livermore, CA 94550, USA; [email protected] 3 Lawrence Livermore National Laboratory, Biological Sciences Research Division, Livermore, CA 94550, USA; [email protected] (B.S.); [email protected] (S.M.) 4 College of Science and Engineering, Flinders University, Bedford Park, SA 5042, Australia * Correspondence: [email protected] (K.M.); [email protected] (R.A.E.) Abstract: One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid Citation: McNair, K.; Ecale Zhou, frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the C.L.; Souza, B.; Malfatti, S.; Edwards, EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy R.A. Utilizing Amino Acid of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results Composition and Entropy of to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Potential Open Reading Frames to Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), Identify Protein-Coding Genes. while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96). Microorganisms 2021, 9, 129. https://doi.org/10.3390/ Keywords: phage; genome; gene; annotation; machine learning; clustering; prediction microorganisms9010129 Received: 30 November 2020 Accepted: 5 January 2021 1. Introduction Published: 8 January 2021 The first genome ever sequenced was that of Bacteriophage MS2 [1]. Twenty years later, Publisher’s Note: MDPI stays neu- the first bacterial genome, Haemophilus influenza, was sequenced, and with it came the need tral with regard to jurisdictional clai- to computationally predict where protein-coding genes occur in prokaryotic genomes [2]. ms in published maps and institutio- This gave rise to the first of the gene annotations tools, GeneMark [3], Glimmer [4], and nal affiliations. CRITICA [5], a decade later Prodigal [6], and most recently PHANOTATE [7] for viral genomes. One thing each of these tools shares in common is the necessity for a training set of good genes—genes that are highly likely to encode proteins, and that the software can use to learn the features that segregate coding open-reading frames (ORFs) from Copyright: © 2021 by the authors. Li- noncoding ones. GeneMark and GLIMMER, and to an extent Prodigal and PHANOTATE, censee MDPI, Basel, Switzerland. all require precomputed gene models to find similar genes within the input genome, and This article is an open access article the better these training models, the better the predictions that each tool makes. Both distributed under the terms and con- GeneMark and CRITICA rely on previously annotated genomes to predict genes in the ditions of the Creative Commons At- input query genome. GeneMark selects one of its precomputed general heuristic models tribution (CC BY) license (https:// creativecommons.org/licenses/by/ based on the amino acid translation table and GC content of the input genome. CRITICA 4.0/). uses the shared homology between the known genes and the input genome, as well as Microorganisms 2021, 9, 129. https://doi.org/10.3390/microorganisms9010129 https://www.mdpi.com/journal/microorganisms Microorganisms 2021, 9, x FOR PEER REVIEW 2 of 11 GeneMark and CRITICA rely on previously annotated genomes to predict genes in the Microorganisms 2021, 9, 129 2 of 11 input query genome. GeneMark selects one of its precomputed general heuristic models based on the amino acid translation table and GC content of the input genome. CRITICA uses the shared homology between the known genes and the input genome, as well as non-comparative information such as contextu contextualal hexanucleotide frequency. In contrast, GLIMMER, Prodigal, and PHANOTATE create gene models from only the input query genome, which removesremoves thethe dependence dependence on on reference reference data. data. Each Each uses uses a differenta different method method to toselect select ORFs ORFs for for inclusion inclusion in a in training-set, a training-set, on which on which a gene a modelgene model is built. is GLIMMERbuilt. GLIMMER builds buildsits training-set its training-set from the from longest the longest (and thus (and most thus likely most tolikely be protein-encoding) to be protein-encoding) ORFs, whichORFs, whichare predicted are predicted by its by LONGORFS its LONGORFS program. program. Prodigal Prodigal uses uses the GCthe GC frame frame plot plot consensus consen- susof the of the ORFs, ORFs, and and performs performs the the first first of of its its two two dynamic dynamic programming programming steps steps toto buildbuild a training-set. PHANOTATEPHANOTATE creates creates a a training-set training-set by by taking taking all all the the ORFs ORFs that that begin begin with with the themost most common common start start codon codon ATG. ATG. An additional method, which has subsequently been been added added as as an option to LON- GORFS, isis thethe Multivariate Multivariate Entropy Entropy Distance Distance (MED2) (MED2) algorithm algorithm [8], [8], which which finds finds the entropythe en- tropydensity density profiles profiles (EDP) of(EDP) ORFs of andORFs compares and compares them to them precomputed to precomputed reference reference EDP profiles EDP profilesfor coding for andcoding noncoding and noncoding ORFs. The ORFs. EDP The of anEDP ORF of an is a ORF 20-dimensional is a 20-dimensional vector S vector= {si} of S the entropy of the 20 amino acid frequencies p , and is defined by: = {si} of the entropy of the 20 amino acid frequenciesi pi , and is defined by: −11 20 s = plogp where where H = − p log p (1) i i i ∑j j H j=1 This approach relies on the coding ORFs having a conserved amino acid composition that isThis different approach from relies the noncoding on the coding ORFs. ORFs Th havingis differential a conserved can be amino seen acidwhen composition comparing thethat observed is different amino from acid the frequency noncoding of ORFs. known This phage differential protein cancoding be seen genes when to the comparing expected frequenciesthe observed (Figure amino acid1A), frequencywhich are ofbased known purely phage on protein the percent coding AT|GC genes tocontent the expected of the genome.frequencies Since (Figure coding1A), ORFs which have are a bias based towa purelyrds certain on the amino percent acids, AT|GC and noncoding content of ORFs the havegenome. frequencies Since coding approximately ORFs have dependent a bias towards on the certain nucleotide amino composition, acids, and noncoding each will ORFs clus- terhave separately frequencies in a approximately 20-dimensional dependent amino acid on thespace nucleotide (Figure composition,1B). each will cluster separately in a 20-dimensional amino acid space (Figure1B). (A) (B) Figure 1. VisualizingVisualizing the the amino amino acid acid composition composition of open of open reading reading frames. frames. (A) Comparison (A) Comparison of average of average amino aminoacid occur- acid occurrencerence across across 14,179 14,179 phage phage genomes. genomes. Points Points correspond correspond to the to 20 the different 20 different amino amino acids acidsand are and labeled are labeled according according to their to International Union of Pure and Applied Chemistry (IUPAC) single letter abbreviations. The observed frequencies come their International Union of Pure and Applied Chemistry (IUPAC) single letter abbreviations. The observed frequencies from the annotated genome consensus gene calls, while the expected come from the overall codon probabilities calculated come from the annotated genome consensus gene calls, while the expected come from the overall codon probabilities from the GC content. Amino acids above the diagonal identity line occur more frequently than expected in coding open calculatedreading frames from (ORFs), the GC and content. those Amino below it acids occur above less frequent the diagonally than identity expected, line which occur alludes more frequently to a coding than bias expected signal. (B in) codingThe averaged open reading amino framesacid frequencies (ORFs), and of co thoseding below ORFs itchange occur lessbased frequently on the GC than cont expected,ent. The whichprevious alludes consensus to a coding calls were bias signal.averaged (B )for The each averaged genome amino and acidthen frequenciesplotted using of principle coding ORFs component change basedanalysis on (PCA), the GC and content. are colored The previous based consensuson the GC calls were averaged for each genome and then plotted using principle component analysis (PCA), and are colored based on the GC content of the genome. The (red) lower GC content genomes tend to favor the amino acids (FYNKI) with AT-rich codons, while the yellow high-GC content genomes tend to favor the amino acids (PRAGW) with GC-rich codons.

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes

Microbial Gene Identification Using Interpolated Markov Models Steven L

Steven L. Salzberg

By Katherine E. L. Cox a Dissertation Submitted to Johns Hopkins University in Conformity with the Requirements for the Degree O

Improved Microbial Gene Identification with GLIMMER

The Effect of Machine Learning Algorithms on Metagenomics Gene

Genome Sequencing and Annotation

Gene Prediction with Glimmer for Metagenomic Sequences Augmented by Classification and Clustering

MCB 1200/1201, Virus Hunting

Machine Learning for Metagenomics: Methods and Tools

Glimmer-MG Release Notes Version 0.1

Information Theory, Graph Theory and Bayesian Statistics Based Improved and Robust Methods in Genome Assembly