Biodata Mining Biomed Central

BioData Mining BioMed Central Software article Open Access Fast Gene Ontology based clustering for microarray experiments Kristian Ovaska, Marko Laakso and Sampsa Hautaniemi* Address: Computational Systems Biology Laboratory, Institute of Biomedicine and Genome-Scale Biology Program, Biomedicum Helsinki, University of Helsinki, PO Box 63 (Haartmaninkatu 8), 00014 UNIVERSITY OF HELSINKI, Finland Email: Kristian Ovaska - [email protected]; Marko Laakso - [email protected]; Sampsa Hautaniemi* - [email protected] * Corresponding author Published: 21 November 2008 Received: 8 April 2008 Accepted: 21 November 2008 BioData Mining 2008, 1:11 doi:10.1186/1756-0381-1-11 This article is available from: http://www.biodatamining.org/content/1/1/11 © 2008 Ovaska et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Analysis of a microarray experiment often results in a list of hundreds of disease- associated genes. In order to suggest common biological processes and functions for these genes, Gene Ontology annotations with statistical testing are widely used. However, these analyses can produce a very large number of significantly altered biological processes. Thus, it is often challenging to interpret GO results and identify novel testable biological hypotheses. Results: We present fast software for advanced gene annotation using semantic similarity for Gene Ontology terms combined with clustering and heat map visualisation. The methodology allows rapid identification of genes sharing the same Gene Ontology cluster. Conclusion: Our R based semantic similarity open-source package has a speed advantage of over 2000-fold compared to existing implementations. From the resulting hierarchical clustering dendrogram genes sharing a GO term can be identified, and their differences in the gene expression patterns can be seen from the heat map. These methods facilitate advanced annotation of genes resulting from data analysis. Background protein product encoded by a gene) in terms of their bio- A microarray experiment may result in hundreds of differ- logical processes, cellular components and molecular entially expressed genes that are subject to interpretation functions [1]. The GO annotations are carried out by cura- and further analysis. As analysing these lists gene-by-gene tors of several bioinformatics databases, so the GO data- is tedious and error prone, the genes in the lists are rou- base is constantly updated. The ontology defines terms tinely annotated using Gene Ontology (GO) with an aim that are linked together to form a directed acyclic graph. to identify statistically significant biological processes or Gene products are annotated with a number of ontology pathways [1]. However, statistical analysis of GO annota- terms. Annotation with a given term also implies annotations can produce a very large number of significantly tion with all ancestors of the term. enriched or down-regulated biological processes. Thus, it is often challenging to interpret GO results and identify In this study we present methodology and software to novel testable biological hypotheses. cluster genes based on their biological functionality using GO annotations. Integral part of the methodology is the The GO project provides a species-independent control- ability to rapidly compute pair-wise distances between the led vocabulary for describing gene products (an RNA or gene annotation similarities. Page 1 of 8 (page number not for citation purposes) BioData Mining 2008, 1:11 http://www.biodatamining.org/content/1/1/11 Two approaches to gene similarity computation are graph and were introduced in the context of GO by Lord et al. structure -based (GS) and information content -based [7]. To compute the semantic similarity between terms t1 (IC) measures. GS-based methods use the hierarchical and t2, we first find the most informative common ances- structure of GO in computing gene similarity. IC-based tor A of t1 and t2, i.e., A is a term that is an ancestor of both methods additionally consider the a priori probabilities, t1 and t2 and has the maximum IC among common ances- or information contents, of GO terms in a reference gene tors CommonAnc(t1,t2) of the terms. Now, the Resnik sim- set. IC-based measures have been found to perform better ilarity [8] is defined as than pure graph-based measures [2,3]. SimResnik(t1, t2) = IC(A). Czekanowski-Dice similarity [4] is a GS-based method. Distance of genes G1 and G2 is defined as Several other measures are defined that also take the information contents of t1 and t2 into account. The Lin measure Δ [9] is defined as = #(GO ( G12 ) GO ( G )) dG(,12 G ) , #(GO ( G12 )∪+ GO ( G )) #( GO ( G 12 ) ∩ GO ( G )) = 2IC() A where Δ is the symmetric set difference, # is the number of Sim Lin(,tt12 ) . (1) IC() t12+ IC ( t ) elements in a set and GO(Gi) is the set of GO annotations for gene Gi. Similarity can be defined as 1 - d(G1, G2). Jiang and Conrath [10] define a semantic distance metric as In Kappa statistics [5], each gene is represented as a binary vector (g1,...,gN), where gi is 1 if the gene is annotated with dJC(t1, t2) = IC(t1) + IC(t2) - 2IC(A). the GO term gi and 0 otherwise. N is the total number of GO terms under consideration. The corresponding similarity measure for dJC(t1, t2) [6] is given by Similarity of genes G1 and G2 is defined as 1 Sim (,tt )= . OGG,,−AGG JC 12 d (,)tt +1 K = 12 12, JC 12 GG12, 1−A GG12, Finally, the Relevance measure [11] that combines Lin's where O represents observed co-occurrence of GO and Resnik's measures is defined as GG12, terms and A represents random co-occurrence. − − GG12, = 21logpt ( )( pt ( )) = 21IC()( A p ()) A Sim Rel(,tt12 ) max . t∈ CommonAnc(,) t t logpt ( ))log(+ pt ) IC() t+ IC ( t ) O is the relative frequency of agreeing locations in 12 1 2 12 GG12, The MICA-based measures can be modified to take into the two binary vectors, i.e., locations that are either both account so called disjunctive ancestor terms [6]. Two 0 or both 1. A is the expected relative frequency of ancestors a1 and a2 of a term t are disjunctive if there are GG12, independent paths from a1 to t and from a2 to t. Such such locations if the binary vectors were random, taking ancestors represent distinct interpretations of the term t. into account the observed probabilities of 0's and 1's. In the GraSM enhancement, all common disjunctive ancestors of terms t1 and t2 are considered when comput- The following discussion considers IC-based similarity ing Sim(t1,t2) [6]. GraSM modifies the computation of measures. The information content of a GO term is com- IC(A) and can be applied to the Resnik, Lin and Jiang- puted by the frequency of the term occurring in annota- Conrath measures. tions; a rarely used term contains a greater amount of information. Probability for observing a term t is defined After computing the pair-wise term similarities, the next Freq() t step in MICA-based measures is to calculate the similarity as pt()= , where MaxFreq is the maximum fre- MaxFreq between genes G1 and G2. This can be done in several ways quency of all terms [6]. The information content for a and our package supports three most commonly used term t is given as IC(t) = -log2p(t). Probabilities can be esti- methods. In the two simplest methods, the maximum or mated from a corpus of annotations, such as the Gene the mean of pair-wise GO term similarities between anno- Ontology database. tation sets of G1 and G2 is used as the similarity value [12]. That is, when G1 is annotated with terms t1,...,tn and G2 Several related similarity metrics are based on the most ′′ informative common ancestor (MICA) of two GO terms with terms tt1,..., m , pair-wise term similarities form an n Page 2 of 8 (page number not for citation purposes) BioData Mining 2008, 1:11 http://www.biodatamining.org/content/1/1/11 Czekanowski-Dice, Kappa, Resnik (with GraSM as an × m matrix S. Now, Simgene(G1, G2) is the maximum or the option), Jiang-Conrath (GraSM), Lin (GraSM), Relevance, mean of the matrix. In the third method, similarity is Cosine and SimGIC. The MICA-based measures (Resnik, defined as Sim (G , G ) = max{rowScore, columnScore} gene 1 2 Lin, Jiang-Conrath, Relevance and GraSM enhancements) [11], where are implemented as a combination of R and C++ code; the four other measures are implemented in R. In addition to n m the regular R package, csbl.go is available as a component ==11 rowScore ∑ maxSSij and columnScore ∑ max ij. for Anduril [15], a framework for high-throughput data n 1≤≤jm m 1≤≤in i=1 j=1 analysis we recently developed. The package is extensively In addition to MICA- and GraSM-based measures, we tested and includes a user guide. have implemented the cosine similarity and SimGIC measures. In cosine similarity [13], each gene G is repre- Similarity computation needs GO term probabilities for the reference gene set. We provide precomputed probabil- sented as a vector (w , w ,...,w ), where each w is IC(t ) if 1 2 N i i ity tables for Homo sapiens, Saccharomyces cerevisiae, G is annotated with the term ti, or 0 otherwise. N is the Caenorhabditis elegans, Drosophila melanogaster, Mus muscu- total number of GO terms under consideration. Similarity lus and Rattus norvegicus. The tables are computed based GG⋅ on all gene and protein annotations for the given organ- of genes G and G is defined as 12, where · is the 1 2 ||||GG12 ism found in the geneontology.org database.

Load more