Gene Function Prediction

Gene Function Prediction

Amal S. Perera

Computer Science & Eng.

University of Moratuwa.

Moratuwa, Sri Lanka.

William Perrizo

Computer Science

North Dakota State University

Fargo, ND 58105

Abstract

Comprehensive Genome Databases provide several catalogues of information on some genomes such as yeast (Saccaromyces cerevesiae). This paper presents the potential for using gene annotation data, which includes phenotype, localization, protein class, complexes, enzyme catalogues, pathway information, and protein-protein interaction, in predicting the functional class of yeast genes. We predict a rank ordered list of genes for each functional class from available information and determine the relevance of different input data, through systematic weighting using a genetic algorithm. The classification task has many unusual aspects, such as multi-valued attributes, classification into overlapping classes, a hierarchical structure of attributes, many unknown values, and interactions between genes. We use a bit-vector approach that treats each domain value as an attribute. The weight optimization uses Receiver Operating Characteristic evaluation in the fitness measure to handle the sparseness of the data.

1  Introduction

During the last three decades, the emergence of molecular and computational biology has shifted classical genetics research to the study, at the molecular level of the genome structure of different organisms. As a result, the rapid development of more accurate and powerful tools for this purpose, has produced overwhelming volumes of genomic data, that is being accumulated in more than 300 biological databases. Gene sequencing projects on different organisms have helped to identify tens of thousands of potential new genes, yet their biological function often remains unclear. Different approaches have been proposed for large-scale gene function analysis. Traditionally, functions are inferred through sequence similarity algorithms [1] such as BLAST or PSI-BLAST [2]. Similarity searches have some shortcomings. The function of genes that are identified as similar may be unknown, or differences in the sequence may be so significant as to make conclusions unclear. For this reason some researchers use sequence as an input to classification algorithms to predict function [3]. Another common approach to function prediction uses two steps. Genes are first clustered, based on similarity in expression data, and then clusters are used to infer function from genes in the same cluster with known function [4]. Alternatively, function has been directly predicted from gene expression data using classification techniques such as Support Vector Machines [5].

We show the use of gene annotation data, which includes phenotype, localization, protein class, complexes, enzyme catalogues, pathway information, and protein-protein interactions, in predicting the functional class of yeast genes. Phenotype data has been used to construct individual rules that can predict function for certain genes based on the C4.5 decision tree algorithm [6]. The gene annotation data we used was extracted from the Munich Information Center for Protein Sequences (MIPS) database [7] and then processed and formatted in a way that fits our purpose. MIPS has a genomic repository of data on Saccharomyces cerevesiae (yeast). Functional classes from the MIPS database include among many others, Metabolism, Protein Synthesis and Transcription. Each of these, is in turn divided into classes and then divided again into subclasses to yield a hierarchy of up to five levels. Each gene may have more than one function associated with it. In addition MIPS has catalogues with other information such as phenotype, localization, protein class, protein complexes, enzyme catalogue and data on protein-protein interactions. In this work, we predict functions for genes at the highest level in the functional hierarchy. Despite the fact that yeast is one of the most thoroughly studied organisms, the function of 30 – 40 % of its ORFs remain currently unknown. For about 30% of the ORFs no information whatsoever is available, and for the remaining ones unknown attributes are very common. This lack of information creates interesting challenges that cannot be addressed with standard data mining techniques. We have developed novel tools, with the hope of helping biologists by providing experimental direction, concerning the function of genes with unknown functions.

2  Experimental Method

Gene annotations are atypical data mining attributes in many ways. Each of the MIPS catalogues has a hierarchical structure. Each attribute, including the class label, gene function, is furthermore multi-valued. We therefore have to consider each domain value as a separate binary attribute rather than assigning labels to protein classes, phenotypes, etc. For the class label, this means that we must classify genes into overlapping classes, which is also referred, to as multi-label classification, rather than multi-class classification in which the class label is disjoint. To this end we represent each domain value of each property as a binary attribute that is either one (gene has the property) or zero (gene doesn't have the property). This representation has some similarity to bit-vector representations in Market Basket Research, in which the items in a shopping cart are represented as 1-values, in the bit-vector of all items in a store. The classification problem is correspondingly broken up into a separate binary classification problem, for each value in the function domain. The resulting classification problem has more than one thousand binary attributes, each of which is very sparse. Two attribute values should be considered related, if both are ‘1’, i.e., both genes have a particular property. Not much can be concluded if two attribute values are 0, i.e., both genes do not have a particular property.

Classification is furthermore influenced by the hierarchical nature of the attribute domains. Many models exist for the integration of hierarchical data into data mining tasks, such as text classification, mining of association rules, and interactive information retrieval, among others[16][17][18][19]. Recent work [15] introduces similarity measurements that can exploit a hierarchical domain, but focuses on the case where matching attributes are confined to be at the leaf level of the hierarchy. The data set we consider in this work has poly-hierarchical attributes, where attributes must be matched at multiple levels. Hierarchical information is represented using a separate set of attribute-bit columns for each level where each binary position indicates the presence or the absence of the corresponding category. Evidence for the use of multiple binary similarity metrics is available in the literature and usage is based on the computability and the requirement of the application [22]. In this work we use a similarity metric identified in the literature as “Russel-Rao” and the definition is given below [22]. Given two binary vectors Zi and Zj with N dimensions (categorical values)

The above similarity measure will count the number of matching “1” bits in the two binary vectors and divide it by the total number of bits. A similarity metric is defined as a function that assigns a value to the degree of similarity between object i and j. For each similarity measure the corresponding dissimilarity is defined as the distance between two objects and should be non-negative, commutative, adhere to triangle inequality, and reflexive, to be categorized as a metric. Dissimilarity functions that show partial conformity to the above properties are considered as pseudo metrics. The corresponding dissimilarity measure only shows the triangular inequality when M ≤ N/2 where N is the number of dimensions and M is . For this application we find the use of the above as appropriate. It is also important to note that in this application the existence of a categorical attribute for a given object is more valuable than the non-existence of a certain categorical attribute. In other words "1" is more valuable than a "0" in the data. Therefore, the count of matching "1" is more important for the task than the count of matching "0". The P-tree data structure we use also allows us to easily count the number of matching "1"s with the use of a root count operation.

Similarity is calculated considering the matching similarity at each individual level. The total similarity is the weighted sum of the individual similarities at each level on the hierarchy. The total weight for attributes that match at multiple levels is thereby higher indicating a closer match. Counting matching values corresponds to a simple additive similarity model. Additive models are often preferred for problems with many attributes because they can better handle the low density in attribute space, also referred to as the "curse of dimensionality [8]."

3  Similarity Weight Optimization

Similarity models that consider all attributes as equal, such as K-Nearest-Neighbor classification (KNN) work well when all attributes are similar in their relevance to the classification task. This is, however, often not the case. The problem is particularly pronounced for categorical attributes that can only have two distances, namely distance zero if attributes are equal and one or some other fixed distance, if attributes are different. Many solutions have been proposed, that weight dimensions, according to their relevance to the classification problem. The weighting can be derived as part of the algorithm [9]. In an alternative strategy the attribute dimensions are scaled, using, e.g., a genetic algorithm, to optimize the classification accuracy of a separate algorithm, such as KNN [10]. Our algorithm is similar to the second approach, which is slower but more accurate. Modifications were necessary due to the nature of the data. Because class label values had a relatively low probability of being ‘1’ we chose to use AROC values instead of accuracy, as a criterion for the optimization [11]. Nearest neighbor evaluation was replaced by the counting of matches as described above. We furthermore included importance measurements into the classification that are entirely independent of the neighbor concept. We evaluate the importance of a gene based on the number of possible genetic and physical interactions its protein has, with the proteins of other genes. Interactions with lethal genes, i.e., genes that cannot be removed in gene deletion experiments, because the organism cannot survive without them, were considered separately. The number of items of known information, such as localization and protein class, was also considered as an importance criterion.

4  ROC Evaluation

Many measures of prediction quality exist, with the best-known one being prediction accuracy. There are several reasons why accuracy is not a suitable tool for our purposes. One main problem derives from the fact, that commonly, only few genes are involved (positive) in a given function. This leads to large fluctuations in the number of correctly predicted participant genes (true positives). Secondly we would like to get a ranking of genes rather than a strict separation into participant and non-participant, since our results may have to be combined with independently derived experimental probability levels. Furthermore, we have to account for the fact that, not all functions of all genes have been determined yet. Similarly there may be genes that are similar to ones that are involved in the function, but are not experimentally seen as such due to masking. Therefore it may be more important and feasible to recognize a potential candidate than to exclude an unlikely one. This corresponds to the situation faced in hypothesis testing: A false negative, i.e., a gene that is not recognized as a candidate, is considered more important than a false positive, i.e., a gene that is considered a candidate although it isn't involved in the function.

The method of choice for this type of situation is ROC (Receiver Operating Characteristic) analysis [11]. ROC analysis is designed to determine the quality of prediction of a given property, such as a gene being involved in a phenotype. Samples that are predicted as positive and indeed have that property are referred to as true positive samples, where as samples that are negative, but are incorrectly classified as positive, are false positive. The ROC curve depicts the rate of true positives, as a function of the false positive rate, for all possible probability thresholds. A measure of quality of prediction is the area under the ROC curve. Our prediction results are all given as values for the area under the ROC curve (AROC). To construct a ROC curve, samples are ordered in decreasing likelihood of being positive. The threshold that delimits prediction as positive is then continuously varied. If all true positive samples are listed, first the ROC curve will start out by following the y-axis, until all positive samples have been plotted and then continue as horizontal for the negative samples. With appropriate normalization the area under this curve is one. If samples are listed in random order the rate of samples that are true positive and ones that are false positive will be equal and the ROC curve will be a diagonal with area 0.5.

5  Data Representation

The input data was converted to P-trees [12][13][14]. P-trees are a lossless, compressed, and data-mining-ready data structure. This data structure has been successfully applied in data mining applications ranging from Classification and Clustering with K-Nearest-Neighbor, to Classification with Decision Tree Induction, to Association Rule Mining for real world data [12][13][14]. A basic P-tree represents one attribute bit that is reorganized into a tree structure by recursively sub-dividing, while recording the predicate truth value regarding purity for each division. Each level of the tree contains truth-bits that represent pure sub-trees and can then be used for fast computation of counts. This construction is continued recursively down each tree path until a pure sub-division is reached that is entirely pure. The basic and complement P-trees are combined using Boolean algebra operations to produce P-trees for values, entire tuples, value intervals, or any other attribute pattern. The root count of any pattern tree will indicate the occurrence count of that pattern. The P-tree data structure provides the perfect structure for counting patterns in an efficient manner. The data representation can be conceptualized as a flat table in which each row is a bit vector, containing a bit for each attribute or part of attribute for each gene. Representing each attribute bit, as a basic P-tree generates a compressed form of this flat table.

Experimental class labels and the other categorical attributes were each encoded in single bit columns. Protein-interaction was encoded using a bit column for each possible gene in the data set, where the existence of an interaction with that particular gene was indicated with a truth bit.