Coclusteringa Useful Tool for Chemometrics

Special Issue Article

Received: 21 September 2011, Revised: 3 January 2012, Accepted: 3 January 2012, Published online in Wiley Online Library: 2012

(wileyonlinelibrary.com) DOI: 10.1002/cem.1424 Coclustering—a useful tool for chemometrics Rasmus Broa*, Evangelos E. Papalexakisb, Evrim Acara and Nicholas D. Sidiropoulosc

Nowadays, chemometric applications in biology can readily deal with tens of thousands of variables, for instance, in omics and environmental analysis. Other areas of chemometrics also deal with distilling relevant information in highly information-rich data sets. Traditional tools such as the principal component analysis or hierarchical clustering are often not optimal for providing succinct and accurate information from high rank data sets. A relatively little known approach that has shown signiﬁcant potential in other areas of research is coclustering, where a data matrix is simultaneously clustered in its rows and columns (objects and variables usually). Coclustering is the tool of choice when only a subset of variables is related to a speciﬁc grouping among objects. Hence, coclustering allows a select number of objects to share a particular behavior on a select number of variables. In this paper, we describe the basics of coclustering and use three different example data sets to show the advantages and shortcomings of coclustering. Copyright © 2012 John Wiley & Sons, Ltd.

Keywords: clustering; coclustering; L1 norm; sparsity

1. INTRODUCTION difference in intake relates to cultural differences. Hence, clustering among samples would split the samples into these two groups. It is The chemometric field is dealing with increasingly complex data, also conceivable that there could be another grouping because of, for instance, in omics, quantitative structure–activity relationships, for example, some people preferring fish. However, because and environmental analysis. It is not uncommon to use hyphen- fish-related items are only a small part of the variables and fish ated methods for measuring thousands of chemical compounds. lovers appear in both populations, such a cluster cannot be real- This is quite different from traditional chemometric applications, ized. On the other hand, coclustering could capture both a country for instance, in spectroscopy where the number of variables (wave- and a fish cluster because it considers which samples are related lengths) may be high but the actual number of chemicals reflected with which variables at the same time rather than one modality in the data—the chemical rank—is typically low. Approaches such at a time. as principal component analysis (PCA) are very well suited for Hence, coclustering is the tool of choice when subsets of analyzing fairly low rank data, especially when the gathered data subjects are related with respect to corresponding subsets of are known to be relevant to the problem being investigated. variables. For some coclustering methods it also holds that an Traditional clustering techniques are more useful for exploratory individual subject (or variable) can belong to several (or no) clus- analyses of “classical” data. However, with the increasing number ters. This is so-called overlapping coclustering as opposed to of variables being measured nowadays, there is an interesting non-overlapping coclustering where each variable is assigned opposite trend toward not being interested in modeling the full to at most one cluster. data. Instead, the focus is often on finding few, so-called, biomar- In the following, we describe the theory behind coclustering kers. A biomarker can be a specific chemical compound indicative and subsequently exemplify coclustering on a toy data set of a pathological condition or indicative of intake of certain food reflecting different kinds of animals, on a data set of chromato- stuff. Thus, even though the actual amount of data and “informa- graphic measurements of olive oils, as well as on cancer gene tion” increases, at the same time, the need for simplifying the expression data. visualization, interpretation, and understanding increases. In coclustering, a data matrix is simultaneously clustered in its rows and columns (objects and variables usually). Coclustering is by no means new [11], but it has attracted considerable interest * Correspondence to: R. Bro, Department of Food Science, Faculty of Life Sciences, in recent years because of some algorithmic developments and University of Copenhagen, DK-1958 Frederiksberg, Denmark. its promising performance in various applications—particularly E-mail: [email protected] in bioinformatics [15]. One of the main advantages of coclustering is that it clusters a R. Bro, E. Acar Department of Food Science, Faculty of Life Sciences, University of Copenhagen, both objects (samples) and variables simultaneously. Suppose DK-1958 Frederiksberg, Denmark we have a data set that shows the food intake of various items for a group of people from Belgium and Korea. In order to find b E. E. Papalexakis the clusters in this data set, we may use a simple approach School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA where the samples are clustered first, and subsequently, the c N. D. Sidiropoulos variables are clustered. It is conceivable that the main clusters Department of Electrical and Computer Engineering, University of Minnesota, could be exactly Asian and European because, overall, the main Minneapolis, MN, USA

2. THEORY It follows that when sparsity is imposed to such an extent that rows and columns are completely left out, the concept of assessing We assume that our data forms a matrix X of dimensions I J. residual sums of squares or fit values is not meaningful or at least not meaningful in the same sense as for ordinary least squares 2.1. Coclustering with sparse matrix regression fitting. Therefore, other means for evaluating the usefulness of a model are needed. Such are described in the following section Coclustering can be formulated as a constrained outer product on metaparameters. Also, interpreting why certain samples or decomposition of the data matrix, with sparsity on the latent variables are left “orphan” may be useful for understanding the factors of the bilinear model [17]. Each cocluster is represented coclustering. This is usually an application-specificproblem. by a rank-1 component of the decomposition. Instead of using One may add non-negativity constraints to the given loss a plain bilinear model, sparsity on the latent factors is imposed. function formulation, which can be readily applied within the Intuitively, latent sparsity selects the appropriate rows and existing coordinate descent algorithm with minor modifications. columns that belong to each cocluster, rendering all other coef- Although our focus here will be on SMR coclustering because of fi cients that do not belong to a certain cocluster exactly zero. its appropriateness for chemometric applications, there are several Hence, each bilinear component represents a cocluster. types of coclustering models and algorithms that are popular in Mathematically, this coclustering scheme may be stated as the other areas and worth mentioning. Banerjee et al. [1,3,8] have minimization of the following loss function: introduced a class of coclustering algorithms that use Bregman X X divergences,unified in an abstract framework. Bregman cocluster- T 2 þ l jjþ l ing is a hard coclustering technique, in the sense that it seeks to X AB F Aik Bjk i;k j;k locate a non-overlapping “checkerboard” structure in the data. This type of coclustering is typically not of interest in chemometrics, where A and B are matrices of size I K and J K, respectively; K where one often deals with data that contain large numbers of corresponds to the number of extracted coclusters. The sum of potentially irrelevant variables. Dhillon [7] has formulated coclus- absolute values is used as a sparsity-inducing surrogate for the tering as a bipartite graph partitioning problem originally in the number of nonzero elements, for example, see Ref. [19], and l context of coclustering of documents and words from a document is a sparsity-controlling parameter. corpus. This algorithm can also be classified as hard coclustering. In The loss function can be interpreted as a constrained version addition, this algorithm works for non-negative data only. Initial of a bilinear model such as PCA. Rotations such as varimax [12] testing of various algorithms has shown that appearance of local also aim at simplicity and sparsity, but they do so in a lossless minima is a common problem. In fact, most hard coclustering manner, where the actual bilinear approximation of the data is algorithms seem to have much more pronounced problems with left unchanged. It is merely rotated toward a simpler view that local minima than soft coclustering ones. Furthermore, the possi- will not usually lead to real sparsity. ble local minima in soft coclustering are often distinct (e.g., rank Doubly sparse matrix factorization as shown has been proposed deficient) and hence easier to spot. Other approaches that earlier [13,20]. Witten et al.[20]proposedaddingsparsity-inducing are more distantly related are methods presented by Damian hard one-norm constraints on both left and right latent vectors, as et al. [4] and Friedman and Meulman [9], which do not account a variation of sparse singular value decomposition and sparse for sparsity, and the hard coclustering method of Hageman canonical correlation analysis. Although their model was not devel- et al. [10], which uses a genetic algorithm that is sensitive oped with coclustering in mind, it is similar to sparse matrix to local minima. regression (SMR), which uses soft one-norm penalties instead of hard constraints (and possibly non-negativity when appropriate). 2.2. Metaparameters Algorithmically, Witten et al.[20]useadeflation algorithm that extracts one rank-1 component at a time, instead of alternating For SMR, there are certain meta-parameters, that is, the penalty l optimization across rank-1 components as in SMR. and number of coclusters that need to be chosen. The number of Lee et al. [13] proposed a similar approach specifically for coclus- coclusters must be selected in most coclustering methods, but tering. However, their algorithm is not guaranteed to converge for SMR, which is not based on hard clustering, it is found that because the penalties are not kept fixed during iterations. As a in many cases, the clusters are exactly or approximately nested result, the algorithm in Lee et al. [13] does not monotonically re- as we increase the number of clusters. Hence, for example, for duce a tangible cost function, and instabilities are not uncommon. a solution with five coclusters, it is often found that the first three In Papalexakis et al. [18], a coordinate descent algorithm is coclusters is approximately equal the solution found using only proposed in order to solve the given optimization problem. More three coclusters. The reason for this approximate nestedness is specifically, one may solve this problem in an alternating fashion, currently being investigated further. In any case, it greatly simpli- where each subproblem is basically a least absolute shrinkage fies the use of the method. For hard coclustering methods, a and selection operator problem [16,19]. We have to note that a similar behavior is naturally not observed. global minimum for the bilinear problem may not be attained; In practice, the metaparameters are mostly determined in the the existing algorithms guarantee a local minimum or saddle following way: The penalty for a given number of components is point solution only. chosen so that it is active. Choosing l that is too small would The SMR coclustering algorithm [18] may be characterized as a give an inactive penalty, and choosing l that is too big would soft or fuzzy coclustering algorithm, in the sense that cocluster lead to some components/coclusters with all zero values. A membership is not merely a zero or one, but can be any value simple line search can be implemented to find a value of l that in between. Some rows and columns may not be assigned to is active without leading to all zeros. It is generally seen that the any cocluster, and overlapping coclusters are allowed and can specific setting of l is not critical, but of course, any automatically be extracted. determined value of l can be further refined. This has not been

Table 1. Animal data set used to illustrate coclustering

21)Cprgt©21 onWly&Sn,Ltd. Sons, & Wiley John 2012 © Copyright (2012) Has Number Carnivore Feather Wings Domesticized Eaten by >100 kg >2 m Breathe Extinct Dangerous Life Random Has Walk Speed eyes of Caucasians under expectancy a on (MPH) legs/ water beak two arms legs Giraffe 1 4 0 0 0 0 0 1 1 0 0 0 30 1 0 0 32 Cow 1 4 0 0 0 1 1 1 1 0 0 0 15 3 0 0 30 Lion 1 4 1 0 0 0 0 1 0 0 0 1 15 6 0 0 50 Gorilla 1 4 0 0 0 0 0 1 0 0 0 1 30 2 0 1 25 Fly 1 6 0 0 1 0 0 0 0 0 0 0 0,1 7 0 0 5 Spider 1 8 1 0 0 0 0 0 0 0 0 0 1 8 0 0 1 Shark 1 0 1 0 0 0 0 1 0 1 0 1 50 4 0 0 30 House 0 0 0 0 0 0 0 1 1 0 0 0 100 9 0 0 0 Horse 1 4 0 0 0 1 1 1 1 0 0 0 15 2 0 0 40 Elephant 1 4 0 0 0 0 0 1 1 0 0 0 35 6 0 0 25 Mammoth 1 4 0 0 0 0 0 1 1 0 1 0 35 5 0 0 25 Sabre Tiger 1 4 1 0 0 0 0 1 0 0 1 1 15 7 0 0 40 Pig 1 4 0 0 0 1 1 1 0 0 0 0 25 8 0 0 11 Cod 1 0 1 0 0 0 1 0 0 1 0 0 40 9 0 0 2 Eel 1 0 1 0 0 0 1 0 0 1 0 0 55 1 0 0 20 Jellyﬁsh 1 0 0 0 0 0 0 0 0 1 0 0 0,7 3 0 0 1 Dolphin 1 0 1 0 0 0 0 1 1 1 0 0 30 5 0 0 35 Nemo 1 0 0 0 0 0 0 0 0 1 0 0 1 6 0 0 4 Shrimp 1 0 0 0 0 0 1 0 0 1 0 0 1 2 0 0 0,5 Dog 1 4 1 0 0 1 0 0 0 0 0 0 13 8 0 0 35 Cat 1 4 1 0 0 1 0 0 0 0 0 0 25 9 0 0 30 Fox 1 4 1 0 0 0 0 0 0 0 0 0 14 4 0 0 42 Wolf 1 4 1 0 0 0 0 0 0 0 0 1 18 3 0 0 25 Rabbit 1 4 0 0 0 1 1 0 0 0 0 0 9 8 0 0 35 wileyonlinelibrary.com/journal/cem Chicken 1 2 0 1 1 1 1 0 0 0 0 0 15 1 1 1 9 Eagle 1 2 1 1 1 0 0 0 0 0 0 0 55 3 1 1 60 Seagull 1 2 1 1 1 0 0 0 0 0 0 0 10 6 1 1 25 Blackbird 1 2 1 1 1 0 0 0 0 0 0 0 18 0 1 1 25 Bat 1 2 1 0 1 0 0 0 0 0 0 0 24 4 0 0 8 T. Rex. 1 4 1 0 0 0 0 1 1 0 1 1 40 9 0 1 25 Neanderthal 1 4 1 0 0 0 0 0 0 0 1 0 50 8 0 1 18 Triceratops 1 4 1 0 0 0 0 1 1 0 1 1 30 5 0 0 10 Man 1 4 1 0 0 0 0 0 0 0 0 0 80 2 0 1 28 Penguin 1 2 1 1 1 0 0 0 0 0 0 0 15 4 1 1 25 R. BRO ET AL.

100 200 300 400 500 600 Time [au]

Figure 1. Preprocessed chromatographic data. pursued here. In order to determine the number of coclusters, a The data and the algorithm can be found at www.models.life. fairly ad hoc approach has been used. Because coclustering is used ku.dk (January 2012). for exploratory analysis and the solution is nested, we simply extract sufficiently many components to explain the main clusters. More rigorous approaches such as cross-validation could 4. RESULTS be implemented, but we do not see the predictive ability of coclustering as a very meaningful criterion to optimize. Rather, 4.1. Looking at the animal data set we find that interpretability of clusters is what is often sought It is interesting to investigate the outcome of a simple PCA and what we focus on here. model on the auto-scaled animal data. In Figure 2, a score plot of the first two components of a PCA model is shown. Compo- 3. MATERIALS AND METHODS nent 1 seems to reflect birds, which is verified from the loading vector that has high values for the variables: feather, wings, has A toy data set is constructed for illustrating the behavior of coclus- a beak, and walk on two legs. Component 2, though, is difficult tering in general. This data set shows attributes of different to interpret and seems to reflect a mix of different properties. animals, and the data were not made particularly meticulously. This is also apparent from the loading plot. Several variables are not well defined, but this is of moderate Looking at components 3 and 4 (Figure 3), similar complications consequence in this context. Also, the data were made from the arise in interpreting the meaning of different components. All but authors’ point of view, for example, in terms of which animals are the first component reflect several phenomena in a contrast domesticized. In Table I, the data set is tabulated. Note that the fashion, and often, it is difficult to extract and distinguish the im- data also includes an outlying sample (house) and an outlying portant variation. variable (random). Turning to SMR, a model is fitted using six coclusters. Similar As another example, data from Refs [5,6] are analyzed. One results are obtained with different numbers of coclusters, but hundred twenty-six oil samples are analyzed by HPLC coupled to we chose six here to exemplify the results. The data are scaled, a charged aerosol detector. Of the oil samples, 68 were various not centered, and non-negativity is imposed. It is possible to plot types, and grades of olive oils and the remaining were either the resulting components/clusters as ordinary PCA components non-olive vegetable oils or non-olive vegetable oils mixed with in scatter or line plots. However, the semi-discrete nature of olive oil. The HPLC method is aimed at providing a triacylglyceride the clusters sometimes makes such visualizations less efficient. profile of the oils. The triacylglycerides are known to have a distinct Instead, we have developed a plot where each cluster is shown pattern for olive oils. The data were baseline corrected and aligned by labels of all samples and variables larger than a threshold. as described in the original work, and the resulting data after This threshold was set to 20% of maximum but was inactive here removal of a few outliers is shown in Figure 1. because all elements smaller than 20% of maximum were exactly As a final data set, we looked at a typical gene expression data zero. Furthermore, the size of the label indicates the size of the set. A total of 56 samples were selected from a cohort of lung element. This provides an intuitive visualization as shown in cancer patients assayed by using the Affymetrix 95av2 GeneChip Figure 4 for the six-cocluster SMR model. brand oligonucleotide array. The 56 patients represent four It is striking how easy it is to assess the meaning of this model distinct histological types: normal lung, pulmonary carcinoid compared with the PCA model. Looking at the coclusters one at tumors, colon metastases, and small cell carcinoma. The data a time, it is observed that cocluster 1 is a bird cocluster. Cocluster have been described in several publications [2,14] and also using 2 is given by one variable (extinct) and is evident. Cocluster 3 coclustering [13]. The original data set contains 12 625 genes. comprises big animals. Note how several samples in coclusters Unlike most publications, no pre-selection to reduce the number 2 and 3 coincide. Animals in cocluster 4 are “grown” and eaten of genes is performed here. Rather, coclustering is applied directly by people, and cocluster 5 captures animals living in water. on the data. The data set holds information on 56 patients of which Finally, cocluster 6 is too dense to allow an easy interpretation. 20 are pulmonary carcinoid samples, 13 colon cancer metastasis It is apparently a cocluster relating to the overall variation samples, 17 normal lung samples, and 6 small cell carcinoma and is in this sense taking care of the offsets induced by the lack samples. The data set is fairly easy to cluster into these four groups. of centering.

PCA score plot order to see how SMR can deal with irrelevant variation, 30 0.6 T. Rex. random variables (uniformly distributed) were added to the original 17 variables. The data were scaled such that each vari- Triceratops 0.4 Sabre Tiger able had unit variance and SMR was performed. In Figure 5, it is seen that the method very nicely distinguishes Neanderthal Mammoth between the animal-related information and the random variables. Gorilla 0.2 Lion All coclusters but cocluster 7 are easy to interpret. Cocluster 7 is not Eagle Wolf — Man Seagull sparse at all it comprises almost all variables and all samples. Shark Blackbird HouseElephantGiraffe Penguin Also, note that the remaining coclusters are not identical to the 0 DolphinSpiderFox Bat coclusters found before, but they are indeed fairly similar. Fly CatDog

Scores on PC 2 (18.30%) -0.2 CowHorseNemoJellyfish Chicken 4.2. Olive oils PigCodEel Rabbit -0.4 Shrimp For the olive oil data set, a nice separation is achieved with three -0.2 0.2 0.6 1 coclusters. Adding more does not seem to change the coclusters Scores on PC 1 (30.26%) obtained in the three cocluster model, and the added coclusters are not immediately meaningful. In Figure 6, it is seen that PCA loading plot cocluster 1 reflects olive oils, whereas cocluster 2 reflects non- Extinct olive oils. The mixed samples containing some olive oils are 0.4 Dangerous placed in between. The third cocluster seems to reflect only a fraction of the olive oils. This is likely related to the olive oils Walk on two legs being a very diverse class of samples spanning from pomace to 0.2 >100kg Carnivor >2m Life expectancy extra virgin oil. The corresponding elution profiles of each cluster Number of legs/arms Speed (MPH) Has a beak fi Feather are meaningful. The rst (olive oil) cocluster has peaks around Random Wings 0 Has eyes 300 and 400 (arbitrary units), and those peaks represent the main olive oil triacylglycerides (triolein, 1,2-olein-3-palmitin, and 1,2-olein-3-linolein). Likewise, the non-olive oil cocluster repre- -0.2

Loading 2 (18.30%) sents trilinolein, 1,2-linolein-3-olein, and 1,2-linolein-3-palmitin, Breathe under water which are frequent in non-olive oils. It is satisfying to see that Domesticized the olive oil samples are clustered together, as desired, even -0.4 Eaten by Caucassians though SMR is an unsupervised approach that does not use -0.4 -0.2 0 0.2 0.4 0.6 0.8 any prior or side information. Loading 1 (30.26%) The results obtained with coclustering are not too different from what would be obtained with PCA. In fact, it is somewhat disturb- Figure 2. Top: score plot of PCA model. Bottom: corresponding loading that there is a distinct lack of sparsity. Although the model ing plot. makes sense from a chemical point of view, little sparsity is seen, for example, in loading 1 and 2 (on the other hand, loading 3 is There is a dramatic difference in how easy it is to visualize the sparse, and so are scores 2 and 3 to a certain extent). As described results of PCA and SMR, but the data set is simple in the sense in the theory section, the magnitude of the L1 penalty is automat- that there are no signiﬁcant amounts of irrelevant variation. In ically chosen, but it turns out that it is not possible to obtain more

0.6

0.4 Cow Horse Pig Chicken Rabbit 0.2 Mammoth Giraffe Elephant DogCat T. Rex. House Triceratops 0 Gorilla Eagle Lion Fly Sabre Tiger SeagullBlackbirdNeanderthalPengui Spider Fox Man Wolf Bat -0.2 Dolphin Shrimp CodEel Shark JellyfishNemo Scores on PC 3 (13.15%) -0.4 5 1015202530

0.4 House Dolphin 0.2 Mammoth Giraffe Elephant Eel Cow Horse Shrimp Chicken Shark Cod EagleBlackbirdT. Rex.Triceratops Jellyfish Seagull Pengui 0 Nemo Gorilla Pig Man Neanderthal Sabre Tiger Bat -0.2 Lion Fly Fox Rabbit Wolf Spider DogCat

Scoreson PC(8.87%) 4 -0.4 5 1015202530 Sample

Figure 3. Scores 3 and 4 from a PCA model.

Cluster 1 Cluster 2 Cluster 3 Penguin Triceratops Triceratops T. Rex. Blackbird Walk on two legs Neanderthal Dolphin Mammoth Seagull Has a beak T. Rex. Elephant Eagle Wings Sabre Tiger Horse House Chicken Feather Mammoth Extinct Cow >2m Giraffe >100kg

Cluster 4 Cluster 5 Cluster 6 Chicken Shrimp ManPenguin Rabbit NeanderthalTriceratops BatT. Rex. Cat Blackbird Nemo Seagull Dog RabbitEagle Dolphin Wolf Shrimp Fox DogCat Eel Jellyfish EelDolphin Cod Number of legs/arms Cod Pig Eel MammothSabre Tiger Speed (MPH) Elephant Walk on two legs Pig Horse Cod House Random SpiderShark Life expectancy Horse Eaten by Caucassians Fly Dangerous Gorilla >100kg Cow Domesticized Shark CowLion Breathe under water Giraffe Carnivor Has eyes

Figure 4. Sparse matrix regression coclusters of animal data. Font size indicates “belongingness” to the cluster.

Cluster 1 Cluster 2 Cluster 3 Penguin Random Shrimp Triceratops Neanderthal Blackbird Walk on two legs Nemo T. Rex. Dolphin Wolf Seagull Has a beak Jellyfish Sabre Tiger Eel Mammoth Eagle Wings Shark Dangerous Cod Breathe under water Gorilla Extinct Chicken Feather Shark Eaten by Caucasians Lion >100kg

Cluster 4 Cluster 5 Cluster 6 Chicken Bat Random Triceratops Rabbit T. Rex. Cat Dolphin Dog Mammoth Shrimp Elephant Cod Fly Wings Horse Pig House Horse Eaten by Caucasians Cow >2m Cow Domesticized Giraffe >100kg

Cluster 7 Random ManPenguin Random T.NeanderthalTriceratops Rex. Random SeagullBlackbirdBat Random RabbitChickenEagle Random CatFoxWolf Random NemoShrimpDog Random EelJellyfishDolphin Random SabrePigCod Tiger Random HorseElephantMammoth Random SpiderSharkHouse SpeedRandom (MPH) GorillaFly >100kgLifeRandom expectancy GiraffeCowLion HasNumberCarnivor eyes of legs/arms

Figure 5. Sparse matrix regression coclusters with 30 random variables added to the data.

sparsity than shown here. Manually increasing l leads to a model coclustering as deﬁned here is a suitable model for spectral-like where one component/cocluster turns all zero and hence rank data such as these. A more suitable approach could be an elastic deﬁcient. This points to a problem with the current coclustering net-type coclustering [21], which would allow the natural colli- approach. Because l isthesameforboththerowandthe nearities to be represented in the clusters. This seems like an column mode, problems or lack of sparsity may occur when interesting research direction. the modes are quite different in dimension. The lack of sparsity Note that for this particular data set, it would be possible to is likely caused by the strong collinearity as well as by the lack integrate the chromatographic peaks and thereby obtain of intrinsic sparsity in this type of data. It is questionable if discrete data that would be more suitable for coclustering.

Cluster 1 Cluster 2 Cluster 3

Non olive 5 Non olive Non olive Olive Olive Olive 3 Mix Mix Mix 4 2

2 3 2 1 1 1 Cluster belongingness 0 0 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 Sample number Sample number Sample number

1.4 2 4 1 3 0.8 1 2 0.4 1 Cluster belongingness 0 0 0 100 200 300 400 500 600 100 200 300 400 500 600 100 200 300 400 500 600 Retention Retention Retention

Figure 6. Three sparse matrix regression clusters are shown. Top plots show sample clusters and bottom plots show elution time clusters.

Cluster 1 Cluster 2 Samp 20 Var 1233812476 12555 123831238512437124541248912487 Samp 19 Var 12406 1247912488 Var 11921119931215912160 120451215412240 1233112333123371224112334 Var 120321222012256 Var 11744 117191173011846118561188111875118821188311939 12171 12264 Samp 17 Var 117391178511903 Var 1127011336 11547 110401105711297116131121511282112991157911640 1140111834 Var 111741139611459 1131611532 11538 Var 1076610895 107081074310799109021100411032108411087710944 1087311201 11283 Samp 16 Var 1096611020 11099 Var 102421042010465 105991067010685 10547106181059510622 10869 Var 1092810965 Var 100889852 10199 99321011410176101861042998741037910518 10555 Samp 15 Var 104161055310848 10911 Var 9440 9619967296849782 949997269790985398689821 10174 Var 1027110289 1029810369 10750 Samp 50 Var 91479406 916792019232942792699336 9434943997239787 Samp 14 Var 10244 10261 Samp 49 Var 880088738926 88589053 9080912488668867887989179152 9308 Samp 13 Var 995710099 10188 Var 85138589 85338534871387878837 85388539883187358855 Var 94829663 94419354 9788 Samp 48 Var 8416 84068471 84588486848784988434 8484 85108773 Samp 12 Var 9207 904291129265 Var 813982038205 818881898255 8216829982618272 8318837884438459 Var 89238924 8971 Samp 47 Var 7851 8034805281058110 81298154815881038115 816481728173 Samp 11 Var 884687948887 8970 Samp 46 Var 78177821 7616767077447757 7761777579847994783379398027 Var 8689 8564 Samp 45 Var 74537461 749175647614 7447746775847476746874747478 Samp 10 Var 7854 810982248241 83648545 Var 72167394 7264716673167272730274147464 7473 Var 735575607751 Samp 44 Var 70257045 70767093 70787092709470997090 7106725572697236 Var 708272507279 7549 Var 674268456998 700470057010 677670426955 707370887091 7071 Samp 9 Var 6933 7056 Samp 43 Var 65666596 6606661367026723 66406686669367226814669168386944 6975 Samp 8 Var 69176918 6815 6991 Var 6072 5994621465096555 614964856215628765576632 Var 6207645562196251 6670 Samp 42 Var 49335217 520157725807 5898604254186164 5983 59246178 Samp 7 Var 57975833 Samp 41 Var 4352 4419444948484920 472246814838506050645232 52275785 Var 55095653 56875722 Var 39814144 429643454347436144164404 4753 Samp 6 Var 489352955466 5223 Samp 40 Var 3740 380139033927 37913789383938554129391242334275 40374250 Var 4818 48884423 Var 3405 336234023431 34793444376535313534 35333733 Samp 5 Var 38713979 4305 Samp 39 Var 2864 3175318732663291 314233253356330333593304 3414 Var 3478 3738 Samp 38 Var 26902844 264027802919274027722774279828732881 3262 2932 Var 31863347 3383 34583737 Var 25752586 258526112616 25742593260526362664 Samp 4 Var 2720279630793102 3408 Samp 37 Var 21282270 191323282338 22692550255421472342 22952565 Var 2676 2674 Var 14431493 145615701686 1836188617351701173417901807 205320702247 Samp 3 Var 24662533 2535 2672 Samp 36 Var 112811321159118612051276 122112441282 13121450127313161296 1538 Var 209421572274 Samp 35 Var 966982 8688809971018109210321066 Samp 2 Var 1278183318571860 2272 Var 547590646 434492 435498801825910911 Samp 1 Var 1168121012651271 Samp 34 Var 3992125 191301 42586217365396399400 645 Var 248 421

Cluster 3 Cluster 4

Var 1223912548 Var 1246612456 Var 11728 117881181912139 12232 12329 Var 1221312107 12455 Var 11593 11578 Var 11338 11385 1169211939 Var 1117811270 1100311373 Var 11322 11271 Samp 33 Var 107951080110906 11002 Var 10828 1119211270 Var 1043110490 1017410544 10483 10838 Var 1066710720 Samp 32 Var 100459797 10053 10138 Var 10628 1051110521 10643 10644 Var 9206942794289442 Var 10291 10241 Var 9196 88898911 Var 10213 1016910136 Samp 31 Var 8728 Var 9803 976910063 10084 Var 834084048459 8502 86638697 8667 Var 9426 Var 82168238 8272 8549 Var 9196924593319421 9697 Samp 30 Var 7950796780027972 Var 89238924 9077 Var 749478257884 7798 7939 Var 872487168732 Samp 29 Var 72557444733274697486 Samp 56 Var 81948061 8272 Var 705871176517 7254 Var 7823 79398142 Samp 28 Var 6192635061606485 Var 7288 737473757372 7648 Var 5989 601861285983 6164 Samp 55 Var 7056 7008 Samp 27 Var 549655595888 56985821 Var 60036111 6052 6553 Var 5396 53605467 5418 Var 5363 575355835619 Var 5288 49445326 Samp 54 Var 492452994424 Samp 25 Var 4840474747484886 5243 Var 41364383 41334144 Samp 24 Var 335338603336 3468 394143604383 Var 3379 33654132 Var 282133033304 Samp 53 Var 3160 3186 33033304 Var 2610 261125102812 2847 Var 2631 31383137 Samp 23 Var 196820522131 23262327 Var 13582283 22921599 Var 18751967 1904 Samp 52 Var 12111326 Samp 22 Var 1624 136614931813 Var 8961050 984 1278 Var 1156127110811206 12111489 Var 648731830 Samp 21 Var 784 795633990 Samp 51 Var 561 557 545641 Var 520 487521258 415 630 Var 427 52052161

Figure 7. Sparse matrix regression coclusters of cancer data color-coded according to cancer class.

The intention though, with the given example, is to illustrate does not provide the sparsity desired in order to be able to talk the behavior of coclustering on continuous data. meaningfully of speciﬁc biomarkers. Performing a PCA on the same data (auto-scaled) provides a very clear grouping into the four cancer types (not shown). 4.3. Cancer The separation is not perfect as in Figure 7, but the tendency When analyzing the gene expression data, the four different is very clear. Lee et al. [13] also performed coclustering with an cancer types come out immediately when we ﬁt a four-cocluster algorithm similar to the SMR algorithm. The coclustering in the model as shown in Figure 7, where the four cancer classes are sample space that they obtained resembles the one obtained color coded. It is apparent that the four cancer classes are per- using PCA more than the distinct coclustering obtained in fectly clustered, but it is also apparent that the gene mode shows Figure 7. This, however, can be explained by the fact that pen- little sparsity in comparison with patients. Hence, coclustering alties are chosen differently by Lee et al. using a Bayesian

J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem R. BRO ET AL. information criterion. Regardless, as also observed with the 5. de la Mata-Espinosa P, Bosque-Sendra JM, Bro R, Cuadros-Rodriguez SMR algorithm, the algorithm of Lee et al.producessolutions L. Discriminating olive and non-olive oils using HPLC-CAD and chemometrics. Anal. Bioanal. Chem. 2011a\; 399: 2083–2092. that are not as sparse as expected in the gene mode. 6. de la Mata-Espinosa P, Bosque-Sendra JM, Bro R, Cuadros-Rodriguez L. Olive oil quantification of edible vegetable oil blends using triacyl- glycerols chromatographic fingerprints and chemometric tools. 5. CONCLUSION Talanta 2011b; 85: 177–182. 7. Dhillon IS. Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of the seventh ACM SIGKDD The basic principles behind coclustering have been explained, international conference on Knowledge discovery and data mining and a new model and algorithm have been favorably compared 2001; 269–274. with common methods such as PCA. It is shown that coclustering 8. Dhillon IS, Mallela S, Modha DS. Information-theoretic co-clustering. can provide meaningful and easily interpretable results on both Proceedings of the Ninth ACM SIGKDD International Conference on – fairly simple and complex data compared with more traditional Knowledge Discovery and Data Mining 2003; 89 98. 9. Friedman JH, Meulman JJ. Clustering objects on subsets of attributes. approaches. Limitations were encountered when the number of J. Roy. Stat. Soc. B Stat. Meth. 2004; 66: 815–849. irrelevant samples grew too high and when spectral-like data 10. Hageman JA, van den Berg RA, Westerhuis JA, van der Werf MJ, are analyzed. More elaborate algorithms need to be developed Smilde AK. Genetic algorithm based two-mode clustering of meta- for handling such situations. bolomics data. Metabolomics 2008; 4: 141–149. 11. Hartigan JA. Direct clustering of a data matrix. J. Am. Stat. Assoc. 1972; 67: 123–129. 12. Kaiser HF. The varimax criterion for analytic rotation in factor analysis. Acknowledgements Psychometrika 1958; 23: 187–200. 13. Lee M, Shen H, Huang JZ, Marron JS. Biclustering via sparse singular N. Sidiropoulos was supported in part by ARO grant W911NF- value decomposition. Biometrics 2010; 66: 1087–1095. 11-1-0500. 14. Liu Y, Hayes DN, Nobel A, Marron JS. Statistical significance of clustering for high-dimension, low-sample size data. J. Am. Stat. Assoc. 2008; 103: 1281–1293. REFERENCES 15. Madeira SC, Oliveira AL. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 2004; 1:24–45. 1. Banerjee A, Merugu S, Dhillon IS, Ghosh J. Clustering with Bregman 16. Osborne MR, Presnell B, Turlach BA. On the LASSO and its dual. J. divergences. J. Mach. Learn Res. 2005; 6: 1705–1749. Comput. Graph. Stat. 2000; 9: 319–337. 2. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd 17. Papalexakis EE, Sidiropoulos ND. Co-clustering as multilinear decompo- C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander sition with sparse latent factors. 2011 IEEE International Conference on ER, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M. Acoustics, Speech and Signal Processing, Prague, Czech Republic, 2011. Classification of human lung carcinomas by mRNA expression profil- 18. Papalexakis EE, Sidiropoulos ND, Garofalakis MN. Reviewer profiling ing reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. using sparse matrix regression. 2010 IEEE International Conference U.S.A. 2001; 98: 13790–13795. on Data Mining Workshops, 2010; 1214–1219. 3. Cho H, Dhillon IS, Guan Y, Sra S. Minimum sum-squared residue 19. Tibshirani R. Regression shrinkage and selection via the lasso. J. Roy. co-clustering of gene expression data. Proceedings of the Fourth SIAM Stat. Soc. B 1996; 58: 267–288. International Conference on Data Mining 2004; 114–125. 20. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, 4. Damian D, Oresic M, Verheij E, Meulman J, Friedman J, Adourian A, with applications to sparse principal components and canonical Morel N, Smilde A, van der Greef J. Applications of a new subspace correlation analysis. Biostatistics 2009; 10: 515–534. clustering algorithm (COSA) in medical systems biology. Metabolomics 21. Zou H, Hastie T. Regularization and variable selection via the elastic 2007; 3:69–77. net. J. Roy. Stat. Soc. B 2005; 67: 301–320.