Deep Autoencoder Neural Networks for Gene Ontology Annotation Predictions
Total Page:16
File Type:pdf, Size:1020Kb
Deep Autoencoder Neural Networks for Gene Ontology Annotation Predictions ∗ y Davide Chicco Peter Sadowski Pierre Baldi Politecnico di Milano University of California, Irvine University of California, Irvine Dipartimento di Elettronica Dept. of Computer Science Dept. of Computer Science Informazione Bioingegneria Institute for Genomics and Institute for Genomics and Milan, Italy Bioinformatics Bioinformatics [email protected] Irvine, CA, USA Irvine, CA, USA [email protected] [email protected] ABSTRACT functional features from a controlled vocabulary. These an- The annotation of genomic information is a major challenge notations are important for effective communication in biomed- in biology and bioinformatics. Existing databases of known ical research, and lay the groundwork for bioinformatics soft- gene functions are incomplete and prone to errors, and the ware tools and data mining investigations. The in vitro bimolecular experiments needed to improve these databases biomolecular experiments used to validate gene functions are are slow and costly. While computational methods are not expensive, so the development of computational methods to a substitute for experimental verification, they can help in identify errors and prioritize new biomolecular experiments two ways: algorithms can aid in the curation of gene anno- is a worthwhile area of research [1]. tations by automatically suggesting inaccuracies, and they The Gene Ontology project (GO) is a bioinformatics ini- can predict previously-unidentified gene functions, acceler- tiative to characterize all the important features of genes and ating the rate of gene function discovery. In this work, we gene products within a cell [2] [3]. GO is composed of three develop an algorithm that achieves both goals using deep controlled vocabularies structured as mostly-separate sub- autoencoder neural networks. With experiments on gene ontologies: biological processes, cellular components, and annotation data from the Gene Ontology project, we show molecular functions. Each GO sub-ontology is structured as that deep autoencoder networks achieve better performance a directed acyclic graph of features (nodes) and ontological than other standard machine learning methods, including relationships (edges). In January 2014, GO contained 39,000 the popular truncated singular value decomposition. terms with more than 25,450 biological processes, 2,250 cel- lular components, and 9,650 molecular functions. However, GO annotations are constantly being revised and added as Categories and Subject Descriptors new experimental evidence is produced. I.2.6 [Artificial Intelligence]: Learning; J.3 [Life and One approach to improving gene function annotation data Medical Sciences]: Biology and Genetics; H.2.8 [Database bases like GO is to use patterns in the known annotations Applications]: Data mining to predict new annotations. This can be viewed as a matrix- completion problem, in which one attempts to recover a ma- Keywords trix with some underlying structure from noisy observations. Machine learning algorithms have proved very successful in biomolecular annotations, matrix-completion, autoencoders, similar applications, such as the famous million-dollar Net- neural networks, Gene Ontology, truncated singular value flix prize awarded in 2009. Many machine learning algo- decomposition, principal component analysis rithms have already been applied to gene function annota- tion ([4] [5] [6] [7] [8] [9]), but to the best of our knowledge 1. INTRODUCTION deep autoencoder neural networks have not. Deep networks In bioinformatics, a controlled gene function annotation of multiple hidden layers have an advantage over shallow is a binary matrix associating genes or gene products with machine learning methods in that they are able to model complex data with greater efficiency. They have proven their ∗corresponding author usefulness in fields such as vision and speech recognition, and ycorresponding author promise to yield similar performance gains in other machine learning applications that have complex underlying struc- Permission to make digital or hard copies of all or part of this work for personal or ture in the data. classroom use is granted without fee provided that copies are not made or distributed A popular algorithm for matrix-completion is the trun- for profit or commercial advantage and that copies bear this notice and the full citation cated singular value decomposition method (tSVD). Kha- on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or tri et al. first used this method for GO annotation predic- republish, to post on servers or to redistribute to lists, requires prior specific permission tion [10], and one of the authors of this work has extended and/or a fee. Request permissions from [email protected]. their method with gene clustering and term-term similarity BCB’14, September 20–23, 2014, Newport Beach, CA, USA. weights [11] [12]. However, the tSVD method can be viewed Copyright is held by the owner/author(s). Publication rights licensed to ACM. as a special linear case of a more general approach using ACM 978-1-4503-2894-4/14/09 ...$15.00. autoencoders [13] [14] [15]. Deep, non-linear, autoencoder http://dx.doi.org/10.1145/2649387.2649442. ACM-BCB 2014 533 neural networks have more expressive power, and may be better suited for discovering the underlying patterns in gene function annotation data. In this paper, we summarize the tSVD and autoencoder methods, show how they can be used to predict annotations, and compare the performance on six separate GO datasets. 2. SYSTEM AND METHODS In this section we describe the two annotation-prediction algorithms used in this paper: Truncated Singular Value De- composition and Autoencoder Neural Network. 2.1 Truncated Singular Value Decomposition Truncated Singular Value Decomposition (tSVD) [16] is a matrix factorization method that produces a low-rank ap- Figure 1: An illustration of the Singular Value m×n Decomposition (upper green image) and the Trun- proximation to a matrix. Define Ad 2 f0; 1g to be a cated SVD reconstruction (lower blue image) of the matrix of annotations. The m rows of Ad correspond to A matrix. In the classical SVD decompostion, A genes, while the n columns correspond to GO features, such m×n m×m m×n T n×n that 2 f0; 1g , U2 R , Σ 2 R V 2 R . In the Truncated decomposition, where k 2 N is the trun- cation level, U 2 m×k, Σ 2 k×k, V T 2 k×n, and the ( k R k R k R 1 if gene i is annotated with feature j, m×n output matrix Ae2 R Ad(i; j) = 0 otherwise. (1) When features are organized into ontologies, sometimes The matrix Ae is real valued and can be interpreted as a only the most specific feature is specified, and the more gen- model of the noisy, incomplete observations. It can be used eral features (ancestors) are implicit. Thus, in this work we to predict both inaccuracies and missing gene functions | a consider a modified matrix A defined as large value of eaij suggests that gene i should be annotated 8 with term j, whereas a value close to zero suggests the oppo- > if gene i is annotated with feature j < 1 site. The choice of the k truncation parameter controls the A(i; j) = or with any descendant of j, > complexity of the model, and affects the predictions. Khatri : 0 otherwise. et al. use a fixed value of k = 500 in [10] [18] [19], while one (2) of the authors of this paper has developed a new discrete op- T The ith row of the A matrix (ai ) contains all the direct and timization algorithm to select the best truncation level on indirect annotations of gene i. The jth column encodes the the basis of the ROC AUCs, described in [20]. list of genes that have been annotated (directly or indirectly) In order to better comprehend why Ae can be used to to feature j. This process is sometimes defined as annotation predict gene-to-term annotations, we highlight that an al- unfolding [17]. ternative expression of Equation (4) can be obtained using Predictions are produced by computing the SVD of the basic linear algebra manipulations: matrix A and truncating the less-significant singular values. T The SVD of the matrix A is given by Ae = AVk Vk (5) A = U Σ V T (3) Additionally, the SVD of the matrix A is related to the T T eigen-decomposition of the symmetric matrices T = A A where U is a m × m unitary matrix (i.e. U U = I), Σ is T and G = AA . The columns of Vk (Uk) are a set of k a non-negative diagonal matrix of size m × n, and V T is T eigenvectors corresponding to the k largest eigenvalues of a n × n unitary matrix (i.e. V V = I). Conventionally, the matrix T (G). The matrix T has a simple interpretation the entries along the diagonal of Σ (the singular values) in our context. In fact, are sorted in non-increasing order. The number r ≤ p of m non-zero singular values is equal to the rank of the matrix X T (j ; j ) = A · A (6) A, where p = min(m; n). For a positive integer k < r, the 1 2 (i;j1) (i;j2) i=1 tSVD matrix Ae is given by i.e. T (j1; j2) is the number of genes annotated with both terms, j1 and j2. Consequently, T (j1; j2) indicates the (un- T Ae = Uk Σk Vk (4) normalized) correlation between term pairs and it can be interpreted as a similarity score of the terms j and j , the where U (V T ) is a m × k (n × k) matrix achieved by 1 2 k k computation of which is exclusively based on the use of these retaining the first k columns of U (V ) and Σ is a k × k di- terms in available annotations.