Consensus Clustering and Functional Interpretation of Gene-Expression Data
Total Page:16
File Type:pdf, Size:1020Kb
Open Access Method2004SwiftetVolume al. 5, Issue 11, Article R94 Consensus clustering and functional interpretation of comment gene-expression data Stephen Swift*, Allan Tucker*, Veronica Vinciotti*, Nigel Martin†, Christine Orengo‡, Xiaohui Liu* and Paul Kellam§ Addresses: *Department of Information Systems and Computing, Brunel University, Uxbridge UB8 3PH, UK. †School of Computer Science and Information Systems, Birkbeck College, London WC1E 7HX, UK. ‡Department of Biochemistry and Molecular Biology, University College § London, London WC1E 6BT, UK. Virus Genomics and Bioinformatics Group, Department of Infection, Windeyer Institute, 46 Cleveland reviews Street, University College London, London W1T 4JF, UK. Correspondence: Paul Kellam. E-mail: [email protected] Published: 1 November 2004 Received: 4 December 2003 Revised: 15 March 2004 Genome Biology 2004, 5:R94 Accepted: 13 September 2004 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2004/5/11/R94 reports © 2004 Swift et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Consensus<p>Consensusshown to perform clustering clustering, better and than functionala new individual method interpretation methodsfor analyzing alone.</p> of gene-expression microarray data datathat takes a consensus set of clusters from various algorithms, is deposited research Abstract Microarray analysis using clustering algorithms can suffer from lack of inter-method consistency in assigning related gene-expression profiles to clusters. Obtaining a consensus set of clusters from a number of clustering methods should improve confidence in gene-expression analysis. Here we introduce consensus clustering, which provides such an advantage. When coupled with a statistically based gene functional analysis, our method allowed the identification of novel genes research refereed regulated by NFκB and the unfolded protein response in certain B-cell lymphomas. Background hierarchical clustering (HC) and partitioning around There are many practical applications that involve the group- medoids (PAM) [1-3]. Most algorithms make use of a starting ing of a set of objects into a number of mutually exclusive sub- allocation of variables based, for example, on random points sets. Methods to achieve the partitioning of objects related by in the data space or on the most correlated variables, and correlation or distance metrics are collectively known as clus- which therefore contain an inherent bias in their search interactions tering algorithms. Any algorithm that applies a global search space. These methods are also prone to becoming stuck in for optimal clusters in a given dataset will run in exponential local maxima during the search. Nevertheless, they have been time to the size of problem space, and therefore heuristics are used for partitioning gene-expression data with notable suc- normally required to cope with most real-world clustering cess [4,5]. Artificial Intelligence (AI) techniques such as problems. This is especially true in microarray analysis, genetic algorithms, neural networks and simulated annealing where gene-expression data can contain many thousands of (SA) [6] have also been used to solve the grouping problem, variables. The ability to divide data into groups of genes shar- resulting in more general partitioning methods that can be ing patterns of coexpression allows more detailed biological applied to clustering [7,8]. In addition, other clustering meth- information insights into global regulation of gene expression and cellular ods developed within the bioinformatics community, such as function. the cluster affinity search technique (CAST), have been applied to gene-expression data analysis [9]. Importantly, all Many different heuristic algorithms are available for cluster- of these methods aim to overcome the biases and local ing. Representative statistical methods include k-means, Genome Biology 2004, 5:R94 R94.2 Genome Biology 2004, Volume 5, Issue 11, Article R94 Swift et al. http://genomebiology.com/2004/5/11/R94 Table 1 The weighted-kappa guideline Weighted-kappa Agreement strength 0.0 ≤ K ≤ 0.2 Poor 0.2 <K ≤ 0.4 Fair 0.4 <K ≤ 0.6 Moderate 0.6 <K ≤ 0.8 Good 0.8 <K ≤ 1.0 Very good maxima involved during a search but to do this requires fine- Despite the formal assessment of clustering methods, there tuning of parameters. remains a practical need to extract reliably clustered genes from a given gene-expression matrix. This could be achieved Recently, a number of studies have attempted to compare and by capturing the relative merits of the different clustering validate cluster method consistency. Cluster validation can be algorithms and by providing a usable statistical framework split into two main procedures: internal validation, involving for analyzing such clusters. Recently, methods for gene-func- the use of information contained within the given dataset to tion prediction using similarities in gene-expression profiles assess the validity of the clusters; or external validation, between annotated and uncharacterized genes have been based on assessing cluster results relative to another data described [20]. To circumvent the problems of clustering source, for example, gene function annotation. Internal vali- algorithm discordance, Wu et al. used five different clustering dation methods include comparing a number of clustering algorithms and a variety of parameter settings on a single algorithms based upon a figure of merit (FOM) metric, which gene-expression matrix to construct a database of different rates the predictive power of a clustering arrangement using gene-expression clusters. From these clusters, statistically a leave-one-out technique [10]. This and other metrics for significant functions were assigned using existing biological assessing agreement between two data partitions [11,12] knowledge. readily show the different levels of cluster method disagree- ment. In addition, when the FOM metric was used with an In this paper, we confirm previous work showing gene- external cluster validity measure, similar inconsistencies are expression clustering algorithm discordance using a direct observed [13]. measurement of similarity: the weighted-kappa metric. Because of the observed variation between clustering meth- These method-based differences in cluster partitions have led ods, we have developed techniques for combining the results to a number of studies that produce statistical measures of of different clustering algorithms to produce more reliable cluster reliability either for the gene dimension [14,15] or the clusters. A method for clustering gene-expression data using sample dimension of a gene-expression matrix. For example, resampling techniques on a single clustering method has the confidence in hierarchical clusters can be calculated by been proposed for microarray analysis [19]. In addition, Wu perturbing the data with Gaussian noise and subsequent et al. showed that clusters that are statistically significant reclustering of the noisy data [16]. Resampling methods (bag- with respect to gene function could be identified within a ging) have been used to improve the confidence of a single database of clusters produced from different algorithms [20]. clustering method, namely PAM in [17]. A simple method for Here we describe a fusion of these two approaches using a comparison between two data partitions, the weighted- 'consensus' strategy to produce both robust and consensus kappa metric [18], can also be used to assess gene-expression clustering (CC) of gene-expression data and assign statistical cluster consistency. This metric rates agreement between the significance to these clusters from known gene functions. Our classification decisions made by two or more observers. In method is different from the approach of Monti et al., in that this case the two observers are the clustering methods. The different clustering algorithms are used rather than perturb- weighted-kappa compares clusters to generate the score ing the gene-expression data for a single algorithm [19]. Our within the range -1 (no concordance) to +1 (complete con- method is also distinct from the cluster database approach of cordance) (Table 1). A high weighted-kappa indicates that Wu et al [20]. There, clusters from different algorithms were the two arrangements are similar, while a low value indicates in effect fused if the consensus view of all algorithms that they are dissimilar. In essence, the weighted-kappa met- indicated that the gene-expression profiles clustered inde- ric is analogous to the adjusted Rand index used by others to pendently of the method. In the absence of a defined rule base compare cluster similarity [16,19]. for selecting clustering algorithms, we have implemented Genome Biology 2004, 5:R94 http://genomebiology.com/2004/5/11/R94 Genome Biology 2004, Volume 5, Issue 11, Article R94 Swift et al. R94.3 clustering methods from the statistical, AI and data-mining 1 communities to prevent 'cluster-method type' biases. When consensus clustering was used with probabilistic measures of 0.8 comment cluster membership derived from external validation with 0.6 gene function annotations, it was possible to accurately and rapidly identify specific transcriptionally co-regulated