PICKER-HG: a Web Server Using Random Forests for Classifying Human Genes Into Categories

bioRxiv preprint doi: https://doi.org/10.1101/681460; this version posted June 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Fabris et al. PICKER-HG: a web server using random forests for classifying human genes into categories Fabio Fabris1, Daniel Palmer2, Zoya Farooq2, Jo~aoPedro de Magalh~aes2* and Alex A Freitas1 BioGrid [2], and GTex [3] has enabled the use of several machine learning methods to assist biologists investigating their data [4, 5] using free and open source tools such as scikit-learn[1] and WEKA[2]. Abstract However, as far as we know, there is no online tool Motivation: One of the main challenges faced by that applies the standard classification workflow from biologists is how to extract valuable knowledge the machine learning field to biological data. Our freely from the data produced by high-throughput available web server, the \PerformIng Classification genomic experiments. Although machine learning and Knowledge Extraction via Rules using random can be used for this, in general, machine learning forests on Human Genes (PICKER-HG)" web server tools on the web were not designed for biologist is designed to fill this niche: it is capable of reading users. They require users to create suitable user data in the form of class labelled human gene lists biological datasets and often produce results that and then preparing them for classification. Depending are hard to interpret. on user preferences, the genes are then annotated with either GO terms, Protein-Protein Interactions or base- Objective: Our aim is to develop a freely available line expression levels (from GTex). The selected anno- web server, named PerformIng Classification and tation type is then used by a sophisticated classifica- Knowledge Extraction via Rules using random tion algorithm (random forests) to predict the class la- forests on Human Genes (PICKER-HG), aimed at bels of the provided genes and to extract interpretable biologists looking for a straightforward application IF-THEN-ELSE rules with good predictive power di- of a powerful machine learning technique (random rectly from the classification model. forests) to their data. We hope that our web server will be useful to biolo- Results: We have developed the first web server gists exploring human genes by providing data-driven that, as far as we know, dynamically constructs a insights about various complex biological phenomena, classification dataset, given a list of human genes possibly assisting in the functional classification of with annotations entered by the user, and outputs gene sets and the prioritization of candidate genes for classification rules extracted of a Random Forest further study. model. The web server can also classify a list of genes whose class labels are unknown, potentially Methods assisting biologists investigating the association The classification task handled by this server is the between class labels of interest and human genes. computational problem of inducing a model that maps Availability: given genes to class labels (e.g. whether or not a gene http://machine-learning-genomics.com/ is associated with ageing) using annotations describing properties of each gene (e.g. functional annotations or Keywords: Web server, Random Forest, Machine expression levels in different tissues). Learning, Classification To perform this task, usually, two sets of genes are available { a training set (which is used to build the model) and a testing set (the candidate genes). The Introduction training set contains genes for which the class label is already known, such as genes previously associated The increasing volume of freely available biological with ageing. The testing set, on the other hand, con- data from sources such as the Gene Ontology (GO)[1], tains genes for which the class label is not known, such *Correspondence: [email protected] as genes not known to be associated with ageing. 2Integrative Genomics of Ageing Group, Institute of Ageing and Chronic [1] Disease, University of Liverpool, L7 8TX Liverpool, UK https://scikit-learn.org/ Full list of author information is available at the end of the article [2]https://www.cs.waikato.ac.nz/ml/weka/ bioRxiv preprint doi: https://doi.org/10.1101/681460; this version posted June 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Fabris et al. Page 2 of6 Data sources Random Forest is equivalent to a set of IF-THEN rules, To compile the training and testing datasets, one needs one rule for each path from the root node of a decision to describe the instances (genes) using numerical fea- tree to a leaf, where a prediction is made. Given a tures. One of the advantages of using our server is that set of such rules extracted from the RF, which usually once the Entrez Ids. of the genes and classes are de- comprises thousands of rules, we can find a subset of fined by the user, the training and testing datasets are rules with high predictive performance and return this automatically generated. The PICKER-HG web server rule set to the user. uses 3 sources of data to this end: Gene Ontology (GO) Selecting a measure of predictive performance from terms, Protein-Protein Interactions (PPI), and GTex the many available measures [9] is a subjective choice baseline expression levels. and using a single measure of predictive performance GO features encode the information of which GO to rank the rules often puts more weight on one aspect terms are associated with a gene (an instance). A value of the classification performance than others. For this of `1' for this feature means that the gene is known to reason, the PICKER-HG server reports four measures be associated with the GO term. A value of `0' means of predictive performance (`Coverage', `Hits', `Preci- that the gene is not known to be associated with the sion', `hits − error', defined next). GO term. We have used GO annotations from GO re- `Coverage' is the number of genes covered by the lease `2017-03-14'. rule, i.e., the number of genes satisfying all conditions PPI features encode for each gene, the list of proteins in the rule; `hits' is the number of genes among the that interact with the products of the gene. In practice, covered genes that were correctly classified by the rule; a value `1' for this feature indicates that the gene (or `precision' is equal to `hits' divided by `coverage'. The gene product) interacts with a given protein; a value `0' default ranking criterion (from highest to lowest value) indicates that there is no evidence for that interaction. is according to the formula hits − errors, in other We have used PPIs from BioGRID (version 3.4.146). words, the number of correctly classified genes minus Lastly, GTex features encode the expression value of the number of incorrectly classified genes. the gene across several tissues. For more information We have used the following algorithm to extract rules about the types of expression values, please read the from the RF: First, we take the final RF classification \Help" section of our web server. We have used GTex model and extract the equivalent rule set from it. Next, version `2016-01-15 v7'. we discard all rules that satisfy at least one of the following criteria: 1) have a precision lower than 0.5, Training the Random Forest 2) cover fewer than 3 genes, or 3) have a precision Before applying the classification model to the testing smaller than the relative frequency of the class label. set it should first be validated. We use the popular 10- We call this subset of filtered rules R . Next, a post- fold cross-validation procedure to estimate the error of good the classifier. processing algorithm takes each rule in Rgood in turn Once the model has been validated, if its estimated and tries to simplify it by executing two procedures, predictive performance is satisfactory, it can then be as follows. used to classify the genes in the testing set and to Procedure 1: Search for the single condition whose extract potentially useful information about the un- removal increases the precision of the rule the most (if derlying classification problem. In this work we use a there are ties, choose an arbitrary condition). If such popular classification algorithm called Random For- a condition is found, permanently remove the condi- est (RF)[6]. RFs models are formed from several Ran- tion from the rule and repeat Procedure 1. If no con- dom Trees (a type of Decision Tree) that achieve good dition is found by Procedure 1 (there is no condition predictive performance [7, 8] and are interpretable to whose removal increases the rule's precision), execute some extent, since they consist of interpretable Deci- Procedure 2. Note that the removal of conditions by sion Trees. Procedure 1 maintains or increases the precision of the However, it is not practical to interpret all trees rule and at the same time, maintains or increases its in the forest, since there are usually many of them coverage. and their predictions are often contradictory. Instead, Procedure 2: Search for the single condition whose we propose a “finer-grained” interpretation, converting removal increases the number of `hits' the most, ac- trees to classification rules and then pruning the rules cepting a decrease in precision if the new precision is to improve their generality and reduce overfitting.

Load more