GAPGOM—An R Package for Gene Annotation Prediction Using GO Metrics Casper Van Mourik1,2, Rezvan Ehsani3,4,5* and Finn Drabløs1*

van Mourik et al. BMC Res Notes (2021) 14:162 https://doi.org/10.1186/s13104-021-05580-1 BMC Research Notes RESEARCH NOTE Open Access GAPGOM—an R package for gene annotation prediction using GO metrics Casper van Mourik1,2, Rezvan Ehsani3,4,5* and Finn Drabløs1* Abstract Objective: Properties of gene products can be described or annotated with Gene Ontology (GO) terms. But for many genes we have limited information about their products, for example with respect to function. This is particularly true for long non-coding RNAs (lncRNAs), where the function in most cases is unknown. However, it has been shown that annotation as described by GO terms to some extent can be predicted by enrichment analysis on properties of co- expressed genes. Results: GAPGOM integrates two relevant algorithms, lncRNA2GOA and TopoICSim, into a user-friendly R package. Here lncRNA2GOA does annotation prediction by co-expression, whereas TopoICSim estimates similarity between GO graphs, which can be used for benchmarking of prediction performance, but also for comparison of GO graphs in general. The package provides an improved implementation of the original tools, with substantial improvements in performance and documentation, unifed interfaces, and additional features. Keywords: Gene ontology, Annotation, Prediction, Long non-coding RNAs Introduction processes of gene regulation [3], but any useful annota- Te properties of gene products like proteins or non-pro- tion is in most cases missing. However, it has been shown tein coding RNAs (ncRNAs) can be described with Gene for example by Jiang et al. that annotation by GO terms Ontology (GO) terms as provided by the GO Consor- to some extent can be predicted [4]. A typical approach tium [1]. GO resources describe a gene product by anno- will identify sets of genes with correlated expression pat- tating it with standardized terms organized as a DAG tern across several experiments, based on the assump- (Directed Acyclic Graph). Such annotation should ideally tion that genes with correlated expression pattern may be be based on experimental data. However, the amount of involved in similar processes. It will then use enrichment experimental information that is available varies a lot, analysis of GO terms for the gene set to identify these and for many genes very little is known about their func- processes. Tis approach was recently benchmarked tion. Tis is particularly true for most ncRNAs, includ- using an improved strategy for comparing known and ing long ncRNAs (lncRNAs). Te number of lncRNAs predicted GO terms for genes with known annotation seems to be comparable to the number of protein-coding [5]. Although the benchmarking had to be done on pro- genes [2], and they are known to be important in several tein-coding genes (due to the lack of annotated lncRNA genes), the results indicate that the same approach can be used also on other classes of genes, like lncRNAs. *Correspondence: [email protected]; [email protected] Here we present a user-friendly and well-documented 1 Department of Cancer Research and Molecular Medicine, NTNU- Norwegian University of Science and Technology, 7491 Trondheim, implementation of two tools for annotation predic- Norway tion and benchmarking, lncRNA2GOA (lncRNA to GO 5 Present Address: Department of Informatics, University of Bergen, Annotation) [5] and TopoICSim (Topological Informa 5020 Bergen, Norway - Full list of author information is available at the end of the article tion Content Similarity) [6], integrated into the R package © The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. van Mourik et al. BMC Res Notes (2021) 14:162 Page 2 of 5 GAPGOM (Gene Annotation Prediction using GO Met- (Fisher, Sobolev) measures. Te tool will then do an rics). Tis package provides a workbench for exploring enrichment analysis on GO terms for the most correlated annotation prediction. We also show how these tools can genes, by default using the top 250 genes, which previ- be the basis for alternative approaches, by using other ously has been found to be the optimal cutof [5]. Te types of annotation terms or in combination with GO- enriched terms are used as predicted annotation for the based defnition of gene sets for enrichment analysis. query gene. Te predicted GO annotation can optionally be compared to actual GO annotation on benchmark sets Main text by using TopoICSim, for example in a simple leave-one- Overview out approach. Te overall performance will be indicative GAPGOM integrates the tools lncRNA2GOA and of the level of performance that can be expected also for TopoICSim together with libraries on properties like novel genes. Both alternative similarity measures and gene expression and annotation data into a pipeline as more advanced cross-validations can easily be integrated illustrated in Fig. 1. Te main use of GAPGOM will be with GAPGOM through the R framework. to go from expression data for a query gene to predicted Te lncRNA2GOA tool can also be used in alternative GO annotation using lncRNA2GOA, by frst defning a approaches as indicated in Table 1. Te main approach set of co-expressed genes by correlating the expression of described above is Approach 1. A similar approach can the query gene to other genes across several experiments. be used for predicting “class” rather than GO annotation Te available methods for correlation represent both sta- (Approach 2). Te “class” can be any type of annotation tistical (Pearson, Spearman, Kendall) and geometrical term for which we have good training data, as for example whether the gene may be associated with specifc types of cancer. Tis is basically the same as the stand- ard approach, except that we need annotation of “class” Query rather than GO for the training data. Here we do not Property Input need TopoICSim for benchmarking as it will not involve t comparisons between complex graph structures of GO Library terms. Instead, we will typically use a 2 × 2 confusion Property matrix where predictions on the benchmark set are clas- sifed as true or false positive or negative predictions. Similarity Te TopoICSim tool can also be used in another alter- Deﬁne gene se native approach (Approach 3) by predicting “class” from gene sets defned through similarity of GO terms rather Library than gene expression. Tis means that we can use GO Annotaon annotation to predict “class”, which can be more specifc and informative regarding the property we are inves- Enrichment tigating, compared to a set of GO terms. Again, bench- Find enrichment mark scoring will typically be based on a 2 × 2 confusion Query matrix. Annotaon Te use of lncRNA2GOA and TopoICSim has been integrated into the R package GAPGOM, with improved usability and user-friendliness, function documentation and vignettes, signifcant speed improvements, a more (Benchmark) Similarity unifed input interface, unit tests, and extra features. Te Predicon Performance Annotaon Table 1 Main alternative approaches for annotation prediction Result with GAPGOM tools Fig. 1 The GAPGOM pipeline for annotation prediction. The fowchart shows the main steps of GAPGOM. For a query gene and a Approach Tool Property Annotation Benchmarka certain property type, a gene set is defned based on similarity over a property library. The gene set is then populated with data from an 1 lncRNA2GOA Expression GO TopoICSim annotation library and enriched terms of the gene set are identifed. 2 lncRNA2GOA Expression Class 2 2 × The predicted annotation may optionally be compared to actual 3 TopoICSim GO Class 2 2 annotation if the query gene is part of a benchmark dataset (i.e., with × a 2 2 represents benchmarking using a confusion matrix over true and false known annotation) × positive and negative predictions van Mourik et al. BMC Res Notes (2021) 14:162 Page 3 of 5 main speedup is seen in TopoICSim with more efcient In our frst approach to the challenge (the “direct” use of R-specifc data structures and improvements in approach in [9]) the diferent cancer types were used as the algorithm. Integration through R also makes it easy to annotation terms for the lncRNAs in the training set. For add new features. Te package is available on Bioconduc- benchmarking each lncRNA was then “re-annotated” tor and GitHub. Documentation and vignettes are made with lncRNA2GOA by identifying co-expressed lncRNA available in Rmarkdown and HTML (webpage) form and genes of the training set and doing enrichment analysis includes both an extensive introduction to the package on cancer type for the co-expressed set. Tis gave a sta- (An introduction to GAPGOM) and examples of bench- tistically signifcant enrichment for the expected cancer marking (Benchmarks and other GO similarity methods). type, but the enrichment was not very strong. Tis prob- ably indicates that cancer type may not be strongly asso- Applications ciated with co-expression in normal cells.

GAPGOM—An R Package for Gene Annotation Prediction Using GO Metrics Casper Van Mourik1,2, Rezvan Ehsani3,4,5* and Finn Drabløs1*

Supplementary Table 1: Adhesion Genes Data Set

Table S6. Names and Functions of 44 Target Genes and 4 Housekeeping Genes Assessed for Gene Expression Measurements

Mygene.Info R Client

Solar and Ultraviolet Radiation

Kin Discrimination Promotes Horizontal Gene Transfer Between Unrelated Strains in Bacillus Subtilis

Mygene.Info R Client

Bioinformatics - 2 - Investigation of C

Karyomegalic Interstitial Nephritis and DNA Damage-Induced Polyploidy in Fan1 Nuclease-Defective Knock-In Mice

Somamer Reagents Generated to Human Proteins Number Somamer Seqid Analyte Name Uniprot ID 1 5227-60

Self-Identity Barcodes Encoded by Six Expansive Polymorphic Toxin Families Discriminate Kin in Myxobacteria

Genome-Wide Association Studies in Alzheimer Disease

Mouse Models of Inherited Retinal Degeneration with Photoreceptor Cell Loss