Analysis of Genetic Association Using Hierarchical Clustering and Cluster Validation Indices
Total Page:16
File Type:pdf, Size:1020Kb
Analysis of genetic association using Hierarchical Clustering and Cluster Validation Indices Inti A. Pagnuco1,2,3, Juan I. Pastore1,2,3, Guillermo Abras1, Marcel Brun1,2, and Virginia L. Ballarin1 1Digital Image Processing Lab, ICyTE, UNMdP, 2Department of Mathematics, School of Engineering, UNMdP, 3CONICET Abstract It is usually assumed that co-expressed genes suggest co-regulation in the underlying regulatory network. Determining sets of co-expressed genes is an important task, based on some criteria of similarity. This task is usually performed by clustering algorithms, where the genes are clustered into meaningful groups based on their expression values in a set of experiment. In this work we propose a method to find sets of co-expressed genes, based on cluster validation indices as a measure of similarity for individual gene groups, and a combination of variants of hierarchical clustering to generate the candidate groups. We evaluated its ability to retrieve significant sets on simulated correlated and real genomics data, where the performance is measured based on its detection ability of co-regulated sets against a full search. Additionally, we analyzed the quality of the best ranked groups using an online bioinformatics tool that provides network information for the selected genes. 1 Introduction approach, while still providing, as shown in the results, a New technologies for genomic analysis measure the good approach to the optimal results. expression of thousand of genes simultaneously. An The Matlab code is available from https://subversion. important goal in some studies is to find sets of assembla.com/svn/pdilab-gaa/tags/Tag.1.1.1 co-expressed genes, to study co-regulation and biological In the next section we present a) an introduction to functions. This task is usually performed by clustering pattern recognition tools, b) the proposed algorithm, c) algorithms, where the whole family of genes, or a subset of the result in genomic data. The final conclusions show them, are clustered into meaningful groups based on their that these algorithms could result in a useful tool to the expression values in a set of experiment. These techniques researchers as a preliminar technique to data analysis. provide insight on the possible co-regulation between genes, under the hypothesis that co-expression may 2 Approach indicate evidence for co-regulation, and these hypotheses 2.1 Pattern Recognition of co-regulation must eventually be corroborated or be Pattern Recognition techniques have been widely used rejected by further experiments. Existing clustering to identify objects according their features [7]. Clustering algorithms are based, usually, on a global approach algorithms is a collection of these techniques that groups [1, 2, 3, 4, 5], or on just pairwise analysis [6]. While unclassified patterns based on similarity. some of them provide methods for the selection of the best In this work, the objects to be classified are genes. number of clusters [2], they result in a very large number The data consists of a set of m samples S1;S2; ::; Sm and of clusters, depending on the number of groups required n genes g1; g2; ::gn that are normally represented by a or determined by the algorithm. The large size of these matrix of two dimension M where M(i; j) represents the sets makes their use as hypothesis generator improbable. expression of the gen gi for the sample Sj. The expression A more comprehensive approach consist on the analysis of each gen gi, across all samples, correspond to a row of of co-expression of all possible sub-sets of genes. While the matrix M, and it is represented by a feature vector simple, this approach fails because of the need to analyze Xi(xi1; xi2; ::; xim), where each value xij represent the an exponential number of candidate sets. This search of expression of the gen gi for the sample Sj. The feature meaningful sets would be impractical if the number of vector is then composed by m samples, so that the regions available genes for the analysis is large, which is usual of the partition H belong to Rm. in this field. Each gen can be assigned to one of k possible groups, In this work we present a new algorithm, that combine and the result of a clustering algorithm is a partition of the the particularities of clustering hierarchical algorithm with set of genes, as a family W = W1;W2; ::; Wk, where each individual cluster validation based on the Silhoutte and Wi is a group of genes with a certain degree of similarity. Dunn indices, to generate a ranked list of gene group, avoiding the exhaustive search, but providing high quality 2.2 Hierarchical Clustering results. As in [2] we use the cluster validation index as a Hierarchical Clustering is one of most used clustering quality/homogeneity measure, but instead of using it to algorithms in bioinformatics [8]. Given a set of N items to generate the partitions, we use it to select the best sets, be clustered, and a N × N distance (or similarity) matrix, from several partitions, obtained by different variants of the basic process of hierarchical clustering [9] is described hierarchical clustering. This approach permits the use of here: (i) Start by assigning each item to a different cluster, many different clustering algorithms, combining the best so that for N items, there are N clusters, each containing results of each one of them, avoiding the use of a full search just one item. Let the distances between the clusters be 1 the same as the distances between the items they contain. (ii) Find the closest pair of clusters and merge them into a single cluster, so that now there is one cluster less. (iii) Compute distances between the new cluster and each of the old clusters. (iv) Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.A more detailed description of the hierarchical clustering algorithms can be found in [9, 10]. Step 3 can be done in different ways, which is what distinguishes single-linkage, complete-linkage and average-linkage clustering. In single-linkage clustering, the distance between two clusters is the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, the similarity between two clusters is the largest similarity from any member of one cluster to any member of the other cluster. In complete-linkage clustering, the distance between two clusters is the largest distance from any member of one cluster to any member of the other cluster. In average-linkage clustering, the distance between two clusters is the average distance from any member of one cluster to any member of the other cluster. 2.3 Quality measures A common problem of clustering algorithms is the validation of results. In the next subsection we explain the two types of internal validation [11] used in this work. 2.3.1 Silhouette Index One of the methods of internal validation used is Silhouette index, which measures the quality of clustering as the average quality of its elements [12]. The Silhouette value for an element ranges between −1 and 1. A high Silhouette value indicates that the element is closer to its own cluster elements than to the ones that do not belong to its own cluster. To define the Silhouette index for the points, two measures need to be defined first. Lets N be the number of clusters. Let X 2 Ck a point (element) that belongs to the cluster Ck, and let nk the number of elements of Ck, then the first measure a(X), the average distance of X to the points of Ck is defined by: a(X) = 1 P d(X; Y ) nk−1 Y 2Ck;Y 6=X The second measure, b(X), the average distance to the nearest cluster, for a point X 2 Ck, is defined by: 1 P b(X) = minh=1::N;h6=k d(X; Y ) nh Y 2Ch Finally, the Silhouette index S(X) is defined by: b(X S(X) = on a list. (c) Once a list of sets and scores is obtained, ∗ d(Ck;C ) D(Ck) = (5) sort the list based on score (higher to lower). ∆(Ck) As an additional approach, we combine all the variants, ∗ where C is the set of points not in Ck (the same idea forming a large sets of candidate groups, still smaller ∗ N used with Silhoutte), d(Ck;C ) is the distance between than the maximum 2 sets, and select the best groups the two clusters and (∆Ck) is the size of the cluster Ck. from this set. This approach extends the search space, We call this index a Cluster Specific Dunn Index. relative to the use of only one algorithm, maintaining the A key aspect of this work is the selection of a cluster computational requirements still low. Another advantage score that reflects the compactness and separateness of of this approach is that new clustering algorithms may each set of profiles relative to the rest of the profiles. be added to the pool, generating more candidate groups, Many cluster validation measures can be found in the therefore improving the overall quality of the result. In literature grouped in three categories: internal, relative addition, this approach does not depend on the selection and external [17, 18]. On this specific problem, external of a specific combination. Every algorithm, or variant of indices are not viable because we just don't know what is an algorithm, used to partition the set of profiles, can be the expected partition. In the other hand, relative indices added to the combination, because it could only improve were proven to be unreliable regarding cluster quality in the results, by adding new sets not previously detected [21, a probabilistic framework [19].