<<

COMBINATORIAL OPTIMIZATION TECHNIQUES IN MINING

By STANISLAV BUSYGIN

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007

1 c 2007 Stanislav Busygin

2 To all the people of goodwill who helped me along the path

3 ACKNOWLEDGMENTS First of all, I would like to express my gratitude to Dr. Panos M. Pardalos for his support and guidance during my PhD studies at the University of Florida. I am grateful to the members of my supervisory committee Dr. Stan Uryasev, Dr. Joseph Geunes and Dr. William Hager for their time and good judgement. I am also very grateful to my collaborators and friends Dr. Sergiy Butenko, Dr. Vladimir Boginski, Dr. Artyom Nahapetyan, and Dr. Oleg Prokopyev for their valuable contributions to our joint research.

4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...... 4 LIST OF TABLES ...... 7 LIST OF FIGURES ...... 8 ABSTRACT ...... 9

CHAPTER 1 INTRODUCTION ...... 11 1.1 General Overview ...... 11 1.2 Data Mining Problems and Optimization ...... 12 2 IN DATA MINING ...... 14 2.1 The Main Concept ...... 14 2.2 Formal Setup ...... 15 2.3 of Biclustering ...... 16 2.4 Relation to SVD ...... 16 2.5 Methods ...... 20 2.5.1 “Direct Clustering” ...... 20 2.5.2 Node-Deletion ...... 20 2.5.3 FLOC Algorithm ...... 22 2.5.4 Biclustering via Spectral Bipartite Graph Partitioning ...... 23 2.5.5 Matrix Iteration for Minimizing Sum-Squared Residue . 27 2.5.6 Double Conjugated Clustering ...... 31 2.5.7 Information-Theoretic Based Co-Clustering ...... 32 2.5.8 Biclustering via Gibbs Sampling ...... 35 2.5.9 Statistical-Algorithmic Method for Bicluster Analysis (SAMBA) .. 38 2.5.10 Coupled Two-way Clustering ...... 39 2.5.11 Plaid Models ...... 40 2.5.12 Order-Preserving Submatrix (OPSM) Problem ...... 42 2.5.13 OP-Cluster ...... 43 2.5.14 Supervised Classification via Maximal δ-valid Patterns ...... 44 2.5.15 CMonkey ...... 45 2.6 Discussion and Concluding Remarks ...... 45 3 CONSISTENT BICLUSTERING VIA FRACTIONAL 0–1 PROGRAMMING . 47 3.1 Consistent Biclustering ...... 47 3.2 Supervised Biclustering ...... 49 3.3 Fractional 0–1 Programming ...... 51 3.4 Algorithm for Biclustering ...... 53

5 3.5 Computational Results ...... 56 3.5.1 ALL vs. AML ...... 56 3.5.2 HuGE Index data set ...... 57 3.6 Conclusions and Future Research ...... 57 4 AN OPTIMIZATION-BASED APPROACH FOR DATA CLASSIFICATION .. 61 4.1 Basic Definitions ...... 61 4.2 Optimization Formulation and Classification Algorithm ...... 65 4.3 Computational Experiments ...... 67 4.3.1 ALL vs. AML Data Set ...... 67 4.3.2 Colon Cancer Data Set ...... 67 4.4 Conclusions ...... 68 5 GRAPH MODELS IN ...... 69 5.1 Cluster Cores Based Clustering ...... 71 5.2 Decision-Making under Constraints of Conflicts ...... 72 5.3 Conclusions ...... 75 6 A NEW TRUST REGION TECHNIQUE FOR THE MAXIMUM WEIGHT CLIQUE PROBLEM ...... 76 6.1 Introduction ...... 76 6.2 The Motzkin–Straus Theorem for Maximum Clique and Its Generalization 78 6.3 The Trust Region Problem ...... 85 6.4 The QUALEX-MS Algorithm ...... 89 6.5 Computational Experiment Results ...... 94 6.6 Remarks and Conclusions ...... 101 REFERENCES ...... 102 BIOGRAPHICAL SKETCH ...... 109

6 LIST OF TABLES Table page 3-1 HuGE index biclustering ...... 60 6-1 DIMACS maximum clique benchmark results ...... 98 6-2 Performance of QUALEX-MS vs. PBH on random weighted graphs ...... 100

7 LIST OF FIGURES Figure page 2-1 Partitioning of samples and features into 3 clusters ...... 17 2-2 Coclus H1 algorithm ...... 29 2-3 Coclus H2 algorithm ...... 30 2-4 Gibbs biclustering algorithm ...... 37 3-1 heuristic ...... 55 3-2 ALL vs. AML heatmap ...... 58 3-3 HuGE index heatmap ...... 59 4-1 Data classification algorithm ...... 66 5-1 Cluster cores based clustering algorithm ...... 73 5-2 Example of two CaRTs for a ...... 73 6-1 New-best-in weighted heuristic ...... 94 6-2 NBIW-based graph preprocess algorithm ...... 95 6-3 Meta-NBIW algorithm ...... 95 6-4 QUALEX-MS algorithm ...... 96

8 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy COMBINATORIAL OPTIMIZATION TECHNIQUES IN DATA MINING By Stanislav Busygin August 2007 Chair: Panos M. Pardalos Major: Industrial and Systems Engineering My research analyzes the role of combinatorial optimization in data mining research and proposes a collection of new practically efficient data mining techniques based on combinatorial optimization algorithms. A variety of addressed data mining problems include supervised clustering (classification), biclustering, , and detection. The recent advances and trends in biclustering are surveyed and the major challenges for its further development are outlined. Similarly to many other data mining methodologies, one of them is the lack of mathematical justification for the significance of purported results. I address this issue with the development of the notion of consistent biclustering. The significance of consistent biclustering is mathematically justified by the conic separation theorem establishing simultaneous delineation of both sample and attribute classes by convex cones. This required property of the obtained biclustering serves as a powerful tool for selecting those attributes of the data which are relevant to a particular studied phenomenon. As an example of such an application, several well-known DNA microarray data sets are considered with the consistent biclustering results obtained for them. To further advance the application of mathematically well-justified optimization methods to major data mining problems, I developed a new optimization based data classification framework which relies upon the same criteria of class separation that serve as the objectives in unsupervised clustering methods, but utilizing them instead

9 as the constraints on feature selection based upon the available training set of samples. The reliability and robustness of the methodology is also empirically confirmed with computational experiments on DNA microarray data. Next, I discuss the prominent role of graph models in data analysis with the emphasis on data analysis applications of the maximum clique/independent set problem. The great variety of real-world problems that can be tackled with the graph-based models is surveyed along with the employed methodologies of . Finally, I present a practically efficient maximum clique heuristic QUALEX-MS. It utilizes a new simple generalization of the Motzkin-Straus theorem for the maximum weight clique problem. This generalization, representing quite a significant theoretical result itself, maximally preserves the form of the original Motzkin-Straus formulation and is proved directly, without the use of mathematical induction. QUALEX-MS employs a new trust region heuristic based upon this new quadratic programming formulation. In contrast to usual trust region methods, it takes into account not only the global optimum of a quadratic objective over a sphere, but also a set of other stationary points. The developed method has complexity O(n3), where n is the number of vertices of the graph. Computational experiments indicate that QUALEX-MS is exact on small graphs and very efficient on the DIMACS benchmark graphs and various random maximum weight clique problem instances. QUALEX-MS was utilized for optimization of classification and regression trees of .

10 CHAPTER 1 INTRODUCTION 1.1 General Overview

Due to recent technological advances in such areas as IT and biomedicine, researchers face ever-increasing challenges in extracting relevant information from the enormous volumes of available data. The so-called data avalanche is created by the fact that there is no concise set of parameters that can fully describe a state of real-world complex systems studied nowdays by biologists, ecologists, sociologists, economists, etc. On the other hand, modern computers and other equipment are able to produce and store virtually unlimited data sets characterizing a complex system, and with the help of available computational power there is a great potential for significant advances in both theoretical and applied research. That is why in recent years there has been a dramatic increase in the interest in sophisticated data mining and techniques utilizing not only statistical methods but also a wide spectrum of computational methods associated with large-scale optimization, including algebraic methods and neural networks. This dissertation presents new combinatorial optimization models and algorithms for data mining problems with application to biomedicine. Organizationally, the dissertation is divided into two major parts. The first part, which consists of Chapters 2 and 3, is concerned with biclustering models and methods. In particular, Chapter 2 reviews the ongoing development in research on biclustering and its applications and emphasizes the theoretical tools that are seemingly necessary for constructing robust and efficient biclustering algorithms. Chapter 3 presents a novel concept of consistent biclustering, theoretical justification of its robustness, and fractional 0–1 programming models for supervised consistent biclustering with corresponding heuristical algorithms. The second part of the dissertation (Chapters 4, 5 and 6) is dedicated to other models for data analysis. Chapter 4 presents mathematical programming models for data classification whose constrains are related to the objectives of known

11 methods. Next, Chapter 5 and 6 are concerned with graph models in data mining and a new polynomial-time maximum clique heuristic algorithms. that, in particular, can be utilized for extracting large groups of closely related data samples and selecting optimal classification and regression models for databases. Chapter 5 discusses the modeling of data sets with graphs. Chapter 6 presents an efficient O(n3) maximum weight clique heuristic QUALEX-MS using the Motzkin-Straus quadratic programming formulation of the problem. 1.2 Data Mining Problems and Optimization

Data mining is a broad area covering a variety of methodologies for analyzing and modeling large data sets. Generally speaking, it aims at revealing a genuine similarity in data profiles while discarding the diversity irrelevant to a particular investigated phenomenon. The problems associated with data mining tasks mainly fall into the following categories:

• Clustering: partition of a given set of samples into classes according a certain similarity relevant to the purpose of analysis;

• Dimensionality reduction: projection of a high-dimensional data set onto a low-dimensional space facilitating the data exploration;

• Reduction of noise: correcting or removing inaccurate measurements and atypical samples from a data set. In particular, problems of the first category may correspond to either unsupervised or supervised clustering. In the latter case (also called classification), the researcher is given a so-called training set of samples whose classes are known, and this ´apriori information is supposed to be used to classify the test set of samples. Next, we should mention that there exists an important special case of dimensionality reduction called feature selection, where the new low-dimensional space is obtained by dropping a subset of coordinates of the original space. Finally, the special case of reduction of noise, where the samples with properties atypical for their classes are identified, called the outlier detection.

12 All these problems are interrelated and usually represent certain stages of the whole data mining procedure. For instance, reduction of noise may be performed to refine the data set before a dimensionality reduction technique is applied to it, and, finally, classification or supervised clustering may be used to obtain the desired result. The data mining problems can be naturally treated as optimization problems. Indeed, whether one decides how to partition data into similar groups or how to construct a low-dimensional space and project the data onto it or how to preprocess the data to reduce the noise, the objective of this task can be expressed as a certain mathematical function that needs to be maximized or minimized subject to proper constraints. Moreover, as the data always come as a sequence of observed samples, these optimization problems normally involve discrete variables associated with the samples each of which represents a certain decision regarding one of the samples. Therefore, one may infer that such a theoretical area as combinatorial optimization finds an important application in data mining. In this work we present some new combinatorial optimization techniques for the data mining problems. While having clustering as the main goal, these techniques are also able to handle dimensionality and noise reduction within the same optimization task due to the developed sophisticated mathematical models for the data mining problems.

13 CHAPTER 2 BICLUSTERING IN DATA MINING 2.1 The Main Concept

The problems of partitioning objects into a number of groups can be met in many areas. For instance, the vector partition problem, which consists in partitioning of n d-dimensional vectors into p parts has broad expressive power and arises in a variety of applications ranging from economics to symbolic computation [7, 47, 55]. However, the most abundant area for the partitioning problems is definitely data mining. Data mining is a broad area covering a variety of methodologies for analyzing and modeling large data sets. Generally speaking, it aims at revealing a genuine similarity in data profiles while discarding the diversity irrelevant to a particular investigated phenomenon. To analyze patterns existing in data, it is often desirable to partition the data samples according to some similarity criteria. This task is called clustering. There are many clustering techniques designed for a variety of data types – homogeneous and non-homogeneous numerical data, categorical data, 0–1 data. Among them one should mention [57], k-means [65], self-organizing maps (SOM) [60], support vector machnines (SVM) [36, 87], logical analysis of data (LAD) [15, 16], etc. A recent survey on clustering methods can be found in [92]. However, working with a data set, there is always a possibility to analyze not only properties of samples, but also of their components (usually called attributes or features). It is natural to expect that each associated part of samples recognized as a cluster is induced by properties of a certain subset of features. With respect to these properties we can form an associated cluster of features and bind it to the cluster of samples. Such a pair is called a bicluster and the problem of partitioning a data set into biclusters is called a biclustering problem.

14 2.2 Formal Setup

Let a data set of n samples and m features be given as a rectangular matrix A =

(aij)m×n, where the value aij is the expression of i-th feature in j-th sample. We consider classification of the samples into classes

S1, S2,..., Sr, Sk ⊆ {1 . . . n}, k = 1 . . . ,

S1 ∪ S2 ∪ ... ∪ Sr = {1 . . . n},

Sk ∩ S` = ∅, k, ` = 1 . . . r, k 6= `.

This classification should be done so that samples from the same class share certain common properties. Correspondingly, a feature i may be assigned to one of the feature classes

F1, F2,..., Fr, Fk ⊆ {1 . . . m}, k = 1 . . . r,

F1 ∪ F2 ∪ ... ∪ Fr = {1 . . . m},

Fk ∩ F` = ∅, k, ` = 1 . . . r, k 6= `,

in such a way that features of the class Fk are “responsible” for creating the class of

samples Sk. Such a simultaneous classification of samples and features is called biclustering (or co-clustering). Definition 1. A biclustering of a data set is a collection of pairs of sample and feature

subsets B = ((S1, F1), (S2, F2),..., (Sr, Fr)) such that the collection (S1, S2,..., Sr) forms a

partition of the set of samples, and the collection (F1, F2,..., Fr) forms a partition of the

set of features. A pair (Sk, Fk) will be called a bicluster. It is important to note here that in some of the biclustering methodologies a direct one-to-one correspondence between classes of samples and classes of features is not required. Moreover, the number of sample and feature classes is allowed to be different.

This way we may consider not only pairs (Sk, Fk), but also other pairs (Sk, F`), k 6= `.

15 Such pairs will be referred to as co-clusters. Another possible generalization is to allow overlapping of co-clusters. The criteria used to relate clusters of samples and clusters of features may have different nature. Most commonly, it is required that the submatrix corresponding to a bicluster either is overexpressed (i.e., mostly includes values above average), or has a lower variance than the whole data set, but in general, biclustering may rely on any kind of common patterns among elements of a bicluster. 2.3 Visualization of Biclustering

One popular tool for visualizing data sets is heatmaps. A heatmap is a rectangular grid composed of pixels each of which corresponds to a data value. The color of a pixel ranges between bright green or blue (lowest values) and bright red (highest values) visualizing the corresponding data value. This way, if the samples or/and features of the data set are ordered with respect to some pattern in the data, the pattern becomes obvious to observe visually. When one constructs a reasonable biclustering of a data set and then reorders samples and features by cluster , the heatmap is supposed to show a “checkerboard” pattern as diagonal blocks show biclusters that are the distinguished submatrices according to the used biclustering method. Figure 2-1 is an example of data set with 3 biclusters of overexpressed values visualized as the heatmap (in a black-and-white diagram darker pixels correspond to higher values.) 2.4 Relation to SVD

Singular value decomposition (SVD) is a remarkable matrix factorization which generalizes eigendecomposition of a symmetric matrix providing the orthogonal basis of eigenvectors. SVD is applicable to any rectangular matrix A = (aij)m×n. It delivers

orthogonal matrices U = (uik)m×p and V = (vjk)n×p (the columns of the matrices are orthogonal to each other and have the unit length) such that

T U AV = diag(σ1, . . . , σp), p = min(m, n). (2–1)

16 Figure 2-1. Partitioning of samples and features into 3 clusters

17 The numbers σ1 ≥ σ2 ≥ ... ≥ σp ≥ 0 are called singular values, the columns of U are called left singular vectors and the columns of V are called right singular vectors of A. This way, left singular vectors provide an orthonormal basis for columns of A, and right singular vectors provide an orthonormal basis for rows of A. Moreover, these bases are coupled so that

Avk = σkuk,

T A uk = σkvk, where uk is k-th left singular vector, and vk is k-th right singular vector of the matrix. The singular values of A are precisely the lengths of the semi-axes of the hyperellipsoid

E = {Ax : kxk2 = 1}. The SVD provides significant information about properties of the matrix. In particular, if σr is the last nonzero singular value (i.e., σr+1 = ... = σp = 0), then

rank(A) = r,

null(A) = span{vr+1, . . . , vn},

ran(A) = span{u1, . . . , ur},

where span{x1, . . . , xk} denotes the linear subspace spanned by the vectors x1, . . . , xk, null(A) = {x : Ax = 0} is the nullspace of the matrix, and ran(A) is the linear subspace spanned by the columns of A. It is easy to see from these properties that the SVD is a very useful tool for dimensionality reduction in data mining. Taking also into account that the Frobenius norm of the matrix

m n r 2 X X 2 X 2 kAkF = aij = σk, i=1 j=1 k=1 one can obtain the best in sense of Frobenius norm low-rank approximation of the matrix by equating all singular values after some σ` to zero and considering

18 ` ˜ X T A = σkukvk . k=1 Such a low-rank approximation may be found in principal component analysis (PCA) with ` first principal components considered. PCA applies SVD to the data matrix after certain preprocessing (centralization or standardization of data samples) is performed. We refer the reader to a linear algebra text [46] for more theoretical consideration of the SVD properties and algorithms. One may relate biclustering to the SVD via consideration an idealized data matrix. If the data matrix has a block-diagonal structure (with all elements outside the blocks equal to zero), it is natural to associate each block with a bicluster. On the other hand, it is easy to see that each pair of singular vectors will designate one such bicluster by nonzero components in the vector. More precisely, if the data matrix is of the form   A1 0 ... 0      0 A2 ... 0  A =   ,  . . . .   . . .. .      0 0 ...Ar where {Ak}, k = 1 . . . r are arbitrary matrices, then for each Ak there will be a singular vector pair (uk, vk) such that nonzero components of uk correspond to rows occupied by Ak and nonzero components of vk correspond to columns occupied by Ak. In a less idealized case, when the elements outside the diagonal blocks are not necessarily zeros but diagonal blocks still contain dominating values, the SVD is able to reveal the biclusters too as dominating components in the singular vector pairs. Hence, the SVD represents a handy tool for biclustering algorithms. Below we show that many biclustering methods either use the SVD directly or have a certain association with the SVD concept.

19 2.5 Methods

2.5.1 “Direct Clustering”

Apparently the earliest biclustering algorithm that may be found in the literature is so-called direct clustering by Hartigan [51] also known as Block Clustering. This approach relies on statistical analysis of submatrices to form the biclusters. Namely, the quality of a

bicluster (Sk, Fk) is assessed by the variance

X X 2 V AR(Sk, Fk) = (aij − µk) ,

i∈Fk j∈Sk

where µk is the average value in the bicluster:

P P a i∈Fk j∈Sk ij µk = . |Fk||Sk|

A bicluster is considered perfect if it has zero variance, so biclusters with lower variance are considered to be better than biclusters with higher variance. This, however, leads to an undesirable effect: single-row, single-column submatrices become ideal biclusters as their variance is zero. The issue is resolved by fixing the number of biclusters and minimizing the objective

r X X X 2 V AR(S, F) = (aij − µk) .

k=1 i∈Fk j∈Sk Hartigan mentioned that other objective functions may be used to find biclusters with other desirable properties (such as minimizing variance in rows, variance in columns, or biclusters following certain patterns.) 2.5.2 Node-Deletion Algorithm

A more sophisticated criterion for constructing patterned biclusters was introduced by Y. Cheng and G.M. Church [30]. It is based on minimization of so-called mean squared residue. To formulate it, let us introduce the following notation. Let

(r) 1 X µik = aij (2–2) |Sk| j∈Sk

20 be the mean of the i-th row in the sample cluster Sk,

(c) 1 X µjk = aij (2–3) |Fk| i∈Fk

be the mean of the j-th column in the feature cluster Fk, and

P P a i∈Fk j∈Sk ij µk = |Fk||Sk|

be the mean value in the bicluster (Sk, Fk). The residue of element aij is defined as

(r) (c) rij = aij − µik − µjk + µk, (2–4)

i ∈ Fk, j ∈ Sk. Finally, the mean squared residue score of the bicluster (Sk, Fk) is defined as

X X 2 Hk = (rij) .

i∈Fk j∈Sk This value is equal to zero if all columns of the bicluster are equal to each other (that

would imply that all rows are equal too). A bicluster (Sk, Fk) is called a δ-bicluster if

Hk ≤ δ. Cheng and Church proved that finding the largest square δ-bicluster is NP -hard. So, they used a greedy procedure starting from the entire data matrix and successively removing columns or rows contributing most to the mean squared residue score. The brute-force deletion algorithm testing the deletion of each row and column would be still quite expensive in the sense of time complexity as it would require O((m + n)mn) operations. However, the authors employed a simplified search for columns and rows to delete choosing a column with maximal

1 X 2 d(j) = rij, |Fk| i∈Fk a row with maximal

1 X 2 d(i) = rij, |Sk| j∈Sk

21 or subsets of columns or rows for which d(j) or d(i) exceeds a certain threshold above the current mean square residue of the bicluster. They have proved that any such deletion can only decrease the current mean square residue. These deletions are performed until a δ-bicluster is obtained. Then, as the constructed co-cluster can be not maximal (some of the previously removed columns or rows can be added without violating the δ-bicluster condition), the authors used a column and row addition algorithm. Namely, they proved that adding any column (row) with d(j)(d(i)) below the current mean square residue does not increase it. Therefore, successive addition of such columns and rows leads to a maximal δ-bicluster. Software implementation of the method as well as some test data sets are available at [31]. K. Bryan et al. improved the node-deletion algorithm of Cheng and Church applying a simulated annealing technique. They reported a better performance on a variety of datasets in [20]. 2.5.3 FLOC Algorithm

J. Yang et al. generalized the definition of residue used in the node-deletion algorithm

to allow entries (some aij may be unknown) [93, 94]. For a bicluster (Sk, Fk),

they introduced the notion of alpha-occupancy meaning that for each sample j ∈ Sk the number of known data entries aij, i ∈ Fk is greater than α|Fk| and for each feature i ∈ Fk the number of known data entries aij, j ∈ Sk is greater than α|Sk|. They also defined the volume of a bicluster as the number of known data entries in the bicluster, and the

(r) (c) average values µik , µjk and µk are calculated with respect to the known data entries only. The authors developed a heuristic algorithm FLOC (flexible overlapped clustering) to find r biclusters with low average residue. First, the biclusters are generated randomly with a chosen ρ for each sample and feature to be included in a bicluster. Then, for each feature and sample, and for each bicluster it is calculated how much the addition of this feature/sample (if it is currently not in the bicluster) or its removal (if it is currently

22 in the bicluster) reduces the residue of the bicluster. If at least one of such actions reduces the residue, the one achieving the largest reduction is performed. When no further residue reduction is possible, the method stops. It is easy to show that the computational complexity of the method is O((m + n)mnrp), where p is the number of iterations till termination. The authors claim that in the computational experiments they performed p is of the order of 10. The FLOC algorithm is also able to take into account various additional constraints on biclusters by eliminating certain feature/sample additions/removals from consideration. 2.5.4 Biclustering via Spectral Bipartite Graph Partitioning

In [38] I.S. Dhillon proposed the following method of biclustering. Represent each sample and each feature of a dataset as a vertex of a graph G(V,E), |V | = m + n. Between the vertex corresponding to sample j = 1 . . . n and the vertex corresponding to feature i = 1 . . . m introduce an edge with weight aij. The graph has no edges between vertices representing samples, as well as between vertices representing features. Thus, the graph is bipartite with F and S representing its color classes. The graph G has the following weighted adjacency matrix   0 A   M =   (2–5) AT 0

Now, a partition of the set of vertices into r parts V1,V2,...,Vr,

V = V1 ∪ V2 ∪ ... ∪ Vr,

Vk ∩ V` = ∅, k 6= `, k, ` = 1 . . . r,

will provide a biclustering of the dataset. Define the cost of the partition as the total weight of edges cut by it:

r−1 r X X X X cut(V1,...,Vr) = mij (2–6)

k=1 `=k+1 i∈Vk j∈V`

23 When we are looking for a biclustering maximizing in-class expression values (thus creating dominating submatrices of biclusters) it is natural to seek minimization of the defined cut value. Besides, we should be looking for rather balanced in size biclusters as otherwise the cut value is most probably minimized with all but one biclusters containing one sample-feature pair only. This problem can be tackled with an SVD-related algorithm. Let us introduce the following

Definition 2. The Laplacian matrix LG of G(V,E) is a |V | × |V | symmetric matrix, with one row and one column for each vertex, such that   P m , if i = j,  k ik  Lij = −mij, if i 6= j and (i, j) ∈ E,    0, otherwise.

Let a partition V = V1 ∪ V2 of the graph be defined via a ±1 vector p = (pi)i=1...|V | such that   +1, i ∈ V1, pi =  −1, i ∈ V2. The Laplacian matrix is connected to the weight of a cut through the following

Theorem 1. Given the Laplacian matrix LG of G and a partition vector p, the Rayleigh Quotient pT Lp 4 = cut(V ,V ). pT p |V | 1 2 By this theorem, the cut is obviously minimized with the trivial solution, i.e., when

all pi are either −1 or 1. So, to achieve a balanced partition we need to modify the

objective function. Let us assign a positive weight wi to each vertex i ∈ V , and let

W = diag(w1, w2, . . . , w|V |) be the diagonal matrix of these weights. We denote

X weight(V`) = wi

i∈V`

24 Now, the following objective function allows us to achieve balanced clusters:

cut(V1,V2) cut(V1,V2) Q(V1,V2) = + . weight(V1) weight(V2)

Let us denote ν = P w and introduce the generalized partition vector with elements ` i∈V` i  q ν2  + , i ∈ V1,  ν1 qi = q ν1  − , i ∈ V2.  ν2

The following theorem generalizes Theorem 1. Theorem 2. T q Lq cut(V1,V2) cut(V1,V2) T = + . (2–7) q W q weight(V1) weight(V2) Minimizing the expression (2–7) is NP -hard. However, a relaxed version of this problem can be solved via a generalized eigendecomposition (notice that qT W e = 0). Theorem 3. The problem xT Lx min (2–8) x6=0 xT W x s.t. xT W e = 0, is solved when q is the eigenvector corresponding to the second smallest eigenvalue λ2 of the generalized eigenvalue problem Lz = λW z (2–9)

We can solve this problem for the bipartite graph case via the SVD. Choosing the weight matrix W to be equal to the degree matrix, we have   D1 −A L =    T  −A D2 and   D 0  1  W =   , 0 D2

25 P P where D1 and D2 are diagonal matrices such that D1(i, i) = j aij and D2(j, j) = i aij. Then (2–9) becomes         D1 −A x D1 0 x     = λ     ,  T        −A D2 y 0 D2 y

1/2 1/2 or denoting u = D1 x and v = D2 y,

−1/2 −1/2 D1 AD2 v = (1 − λ)u,

−1/2 T −1/2 D2 A D1 u = (1 − λ)v,

ˆ −1/2 −1/2 which precisely defines the SVD of the normalized matrix A = D1 AD2 . So, the balanced cut minimization problem can be solved by finding the second largest singular value of this normalized matrix and the singular vector pair corresponding to it that can be used to obtain the biclustering to two classes. In case of multiclass partitioning,

Dhillon used ` = dlog2re singular vectors u2, u3, . . . , u`+1 and v2, v3, . . . , v`+1 to form the `-dimensional data set   −1/2 D1 U Z =   ,  −1/2  D2 V where U = (u2, . . . , u`+1) and V = (v2, . . . , v`+1). After such a significant dimensionality reduction is performed, the rows of the matrix Z (which represent both samples and features of the original data set) are clustered with a simple k-means algorithm [65]. Dhillon reports encouraging computational results for problems. Very similar spectral biclustering routines for microarray data have been suggested by Y. Kluger et al. [59]. In addition to working with the singular vectors of Aˆ, they considered two other normalization methods that can be used before applying the SVD. The first one is bistochastization. It makes all row sums equal and all column sums equal too (generally, to a different constant.) It is known from Sinkhorn’s theorem that under quite general conditions on the matrix A there exist diagonal matrices D1 and D2 such that

26 D1AD2 achieves bistochastization [6]. The other approach is applicable if sample/feature subvectors within a bicluster are expected to be shifted by a constant with respect to each other (i.e., vectors a and b are considered similar if a ≈ b + αe, where α is the constant and e is the all-one vector.) When similar data are expected to be scaled by different constants (i.e., a ≈ αb), the desirable property can be achieved by applying a logarithm to all data entries. Then, defining n 1 X a¯ = a , i· n ij j=1 m 1 X a¯ = a , ·j m ij i=1 and m n 1 X X a¯ = a , ·· mn ij i=1 j=1 the normalized data are obtained as

bij = aij − a¯i· − a¯·j +a ¯··.

After the singular vectors, it is decided which of them contain the relevant information about the optimal data partition. To extract partitioning information from the system of singular vectors, each of them is examined by fitting to a piecewise constant vector. That is, the entries of an eigenvector is sorted and all possible thresholds between classes are considered. Such a procedure is equivalent to searching for good optima in one-dimensional k-means problem. Then few best singular vectors can be selected to run k-means on the data projected onto them. 2.5.5 Matrix Iteration Algorithms for Minimizing Sum-Squared Residue

H. Cho et al. proposed a co-clustering algorithm minimizing the sum-squared residue throughout all co-clusters [33]. Thus, this approach does not take into account any correspondence between clusters of samples and clusters of features, but considers all the submatrices formed by them. The algorithm is based on algebraic properties of the matrix of residues.

27 For a given clustering of features (F1, F2,..., Fq), introduce a feature cluster indicator

−1/2 matrix F = (fik)m×q such that fik = |Fk| if i ∈ Fk and fik = 0 otherwise. Also, for

a given clustering of samples (S1, S2,..., Sr), introduce a sample cluster indicator matrix

−1/2 S = (sjk)n×r such that sjk = |Sk| if j ∈ Sk and sjk = 0 otherwise. Notice that these matrices are orthonormal, that is, all columns are orthogonal to each other and have

unit length. Now, let H = (hij)m×n be the residue matrix. There are two choices for hij definition. It may be defined similar to (2–4):

(r) (c) hij = aij − µik − µj` + µk`, (2–10)

(r) (c) where i ∈ F`, j ∈ Sk, µ and µ are defined as in (2–2) and (2–3), and µk` is the average

of the co-cluster (Sk, F`): P P a i∈F` j∈Sk ij µk` = . |F`||Sk|

Alternatively, hij may be defined just as the difference between aij and the co-cluster average:

hij = aij − µk`. (2–11)

By direct algebraic manipulations it can be shown that

H = A − FF T ASST (2–12)

in case of (2–11) and H = (I − FF T )A(I − SST ) (2–13) in case of (2–10). The method tries to minimize kHk2 using an iterative process such that on each iteration a current co-clustering is updated so that kHk2, at least, does not increase. The authors point out that finding the global minimimum for kHk2 over all possible co-clusterings would lead to an NP -hard problem. There are two types of clustering updates used: batch (when all samples or features may be moved between clusters at one

28 time) and incremental (one sample or one feature is moved at a time.) In case of (2–11) the batch algorithm works as defined in Algorithm 2-2.

Input: data matrix A, number of sample clusters r, number of feature clusters q Output: clustering indicators S and F Initialize S and F ; objval ← kA − FF T ASST k2; ∆ ← 1, τ ← 10−2kAk2 {Adjustable}; while ∆ > τ do AS ← FF T AS; for j ← 1 to n do −1/2 S 2 assign j-th sample to cluster Sk with smallest kA·j − |Sk| A·kk ; end update S with respect to the new clustering; AF ← F T ASST ; for i ← 1 to m do −1/2 F 2 assign i-th feature to cluster Fk with smallest kAi· − |Fk| Ak·k ; end update F with respect to the new clustering; oldobj ← objval, objval ← kA − FF T ASST k2; ∆ ← |oldobj − objval|; end

Figure 2-2. Coclus H1 algorithm

In case of (2–10), it becomes Algorithm 2-3, which is similar but uses a bit different matrix manipulations. To describe the incremental algorithm, we first note that in case of (2–10) H is defined as in (2–13), and minimization of kHk2 is equivalent to maximization of kF T ASk2. So, suppose we would like to improve the objective function by moving a sample from

T ¯ cluster Sk to cluster Sk0 . Denote F A by A and the new sample clustering indicator matrix by S˜. As S and S˜ differ only in columns k and k0, the objective can be rewritten as

¯ ˜ 2 ¯ 2 ¯ ˜ 2 ¯ 2 kAS·k0 k − kAS·k0 k + kAS·kk − kAS·kk . (2–14)

So, the inner loop of the incremental algorithm looks through all possible one sample moves and chooses the one increasing (2–14) most. A similar expression can be derived

29 Input: data matrix A, number of sample clusters r, number of feature clusters q. Output: clustering indicators S and F . Initialize S and F ; objval ← k(I − FF T )A(I − SST )k2; ∆ ← 1, τ ← 10−2kAk2 {Adjustable}; while ∆ > τ do AS ← (I − FF T )AS, AP ← (I − FF T )A; for j ← 1 to n do P −1/2 S 2 assign j-th sample to cluster Sk with smallest kA·j − |Sk| A·kk ; end update S with respect to the new clustering; AF ← F T A(I − SST ), AP ← A(I − SST ); for i ← 1 to m do P −1/2 F 2 assign i-th feature to cluster Fk with smallest kAi· − |Fk| Ak·k ; end update F with respect to the new clustering; oldobj ← objval, objval ← k(I − FF T )A(I − SST )k2; ∆ ← |oldobj − objval|; end

Figure 2-3. Coclus H2 algorithm for features. Next, it can be shown that in case (2–11) when H is defined as in (2–12), the objective can be reduced to

˜ 2 2 ˜ 2 2 ¯ ˜ 2 ¯ 2 ¯ ˜ 2 ¯ 2 kAS·k0 k − kAS·k0 k + kAS·kk − kAS·kk − kAS·k0 k + kAS·k0 k − kAS·kk + kAS·kk , (2–15) so the incremental algorithm just uses (2–15) instead of (2–14). Notice the direct relation of the method to the SVD. Maximization of kF T ASk2 if F and S were just constrained to be orthonormal matrices would be solved by F = U and S = V , where U and V are as in (2–1). F and S have the additional constraint on the structure (being a clustering indicator.) However, the SVD helps to initialize the clustering indicator matrices and provides a lower bound on the objective (as the sum of squares of the singular values.) Software with the implementation of both cases of this method is available at [34].

30 2.5.6 Double Conjugated Clustering

Double conjugated clustering (DCC) is a node-driven biclustering technique that can be considered a further development of such clustering methods as k-means [65] and self-organizing maps (SOM) [60]. The method was developed by S. Busygin et al. [23]. It operates in two spaces – space of samples and space of features – applying in each of them either k-means or SOM training iterations. Meanwhile, after each one-space iteration its result updates the other map of clusters by means of a matrix projection. The method works as follows.

Introduce a matrix C = (cik)m×r which will be referred to as samples nodes or

samples map and a matrix D = (djk)n×r which will be referred to as features nodes or features map. This designates r nodes for samples and r nodes for features that will be used for one-space clustering iterations such as k-means or SOM (in the latter case, the nodes are to be arranged with respect to a certain topology that will determine node neighborhoods.) We start from the samples map, initialize it with random numbers and perform a one-space clustering iteration (for instance, in case of k-means we assign each sample to closest node and then update each node storing in it the centroid of the assigned samples.) Now the content of C is projected to form D with a matrix transformation:

D := B(AT C),

where B(M) is the operator normalizing each column of matrix M to the unit length. The matrix multiplication that transforms nodes of one space to the other can be justified with the following argument. The value cik is the weight of i-th feature in the k-th node. So, the k-th node of the features map is constructed as a linear combination of the features such that cik is the coefficient of the i-th feature in it. The unit normalization keeps the magnitude of node vectors constrained. Next, after the projection, the features map is updated with the similar one-space clustering iteration, and then the backwards projection

31 is applied: C := B(AD), which is justified in the similar manner using the fact that djk is the weight of the j-th sample in the k-th node. This cycle is repeated until no samples and features are moved anymore, or stops after a predefined number of iterations. To be consistent with unit normalization of the projected nodes, the authors have chosen to use cosine metrics for one-space iterations, which is not affected by differences in magnitudes of the clustered vectors. This also prevents undesirable clustering of all low-magnitude elements into a single cluster that often happens when a node-driven clustering is performed using the Euclidean metric. The DCC method has a close connection to the SVD that can be observed in its computational routine. Notice that if one “forgets” to perform the one-space clustering iterations, then DCC executes nothing else but the power method for the SVD [46]. In such case all samples nodes would converge to the dominating left singular vector and all features nodes would converge to the dominating right singular vector of the data matrix. However, the one-space iterations prevent this from happening moving the nodes towards centroids of different sample/feature clusters. This acts similarly to re-orthogonalization in the power method when not only the dominating but also a bunch of next singular vector pairs are sought. This way DCC can be seen as an alteration of the power method for SVD relaxing the orthogonality requirement for the iterated vectors but making them more appealing to groups of similar samples/features of the data. 2.5.7 Information-Theoretic Based Co-Clustering

In this method, developed by I. Dhillon et al. in [39], we treat the input data set (aij)m×n as a joint probability distribution p(X,Y ) between two discrete random variables X and Y , which can take values in the sets {x1, x2, . . . , xm} and {y1, y2, . . . , yn}, respectively.

32 Formally speaking, the goal of the proposed procedure is to cluster X into at most ˆ ˆ k disjoint clusters X = {xˆ1, xˆ2,..., xˆk} and Y into at most l disjoint clusters Y =

{yˆ1, yˆ2,..., yˆl}. Put differently, we are looking for mappings CX and CY such that

CX : {x1, x2, . . . , xm} −→ {xˆ1, xˆ2,..., xˆk},

CY : {y1, y2, . . . , yn} −→ {yˆ1, yˆ2,..., yˆl},

ˆ ˆ i.e., X = CX (X) and Y = CY (Y ), and a tuple (CX ,CY ) is referred to as co-clustering. Before we proceed with a description of the technique let us recall some definitions from probability and . The relative entropy, or the Kullback-Leibler (KL) divergence between two probability distributions p1(x) and p2(x) is defined as

X p1(x) D(p ||p ) = p (x) log . 1 2 1 p (x) x 2

Kullback-Leibler divergence can be considered as a “distance” of a “true” distribution p1 to an approximation p2. The mutual information I(X; Y ) of two random variables X and Y is the amount of information shared between these two variables. In other words, I(X; Y ) = I(Y ; X) measures how much X tells about Y and, vice versa, how much Y tells about X. It is defined as X X p(x, y) I(X; Y ) = p(x, y) log = D(p(x, y)||p(x)p(y)). p(x)p(y) y x Now, we are looking for an optimal co-clustering, which minimizes the loss in mutual information min I(X; Y ) − I(X,ˆ Yˆ ). (2–16) X,ˆ Yˆ Define q(X,Y ) to be the following distribution

q(x, y) = p(ˆx, yˆ)p(x|xˆ)p(y|yˆ), (2–17)

33 p(x) where x ∈ xˆ and y ∈ yˆ. Obviously, p(x|xˆ) = p(ˆx) ifx ˆ = CX (x) and 0, otherwise. The following result states an important relation between the loss of information and distribution q(X,Y )[5]:

Lemma 1. For a fixed co-clustering (CX ,CY ), we can write the loss in mutual information as I(X; Y ) − I(Xˆ; Yˆ ) = D(p(X,Y )||q(X,Y )). (2–18)

In other words, finding an optimal co-clustering is equivalent to finding a distribution q defined by (2–17), which is close to p in KL divergence. Consider the joint distribution of X, Y , Xˆ and Yˆ denoted by p(X,Y, X,ˆ Yˆ ). Following the above lemma and (2–17) we are looking for a distribution q(X,Y, X,ˆ Yˆ ), an approximation of p(X,Y, X,ˆ Yˆ ), such that:

q(x, y, x,ˆ yˆ) = p(ˆx, yˆ)p(x|xˆ)p(y|yˆ),

and p(X,Y ) and q(X,Y ) are considered as two-dimensional marginals of p(X,Y, X,ˆ Yˆ ) and q(X,Y, X,ˆ Yˆ ), respectively. The next lemma lies in the core of the proposed algorithm from [39]. Lemma 2. The loss in mutual information can be expressed as (i) a weighted sum of the relative entropies between row distributions p(Y |x) and “row-lumped” distributions q(Y |xˆ),

X X D(p(X,Y, X,ˆ Yˆ )||q(X,Y, X,ˆ Yˆ )) = p(x)D(p(Y |x)||q(Y |xˆ)),

xˆ x:CX (x)=ˆx

(ii) a weighted sum of the relative entropies between column distributions p(X|y) and “column-lumped” distributions q(X|yˆ), that is,

X X D(p(X,Y, X,ˆ Yˆ )||q(X,Y, X,ˆ Yˆ )) = p(y)D(p(X|y)||q(X|yˆ)).

yˆ y:CY (y)=ˆy

Due to Lemma 2 the objective function can be expressed only in terms of the

0 0 row-clustering, or column-clustering. Starting with some initial co-clustering (CX ,CY )

34 0 1 1 2 2 (and distribution q ) we iteratively obtain new co-clusterings (CX ,CY ), (CX ,CY ), ..., using column-clustering in order to improve row-clustering as

t+1 t CX (x) = arg min D(p(Y |x)||q (Y |xˆ)) (2–19) xˆ

and, vice versa, using row-clustering to improve column-clustering as

t+2 t CY (y) = arg min D(p(X|y)||q (X|yˆ)). (2–20) yˆ

Obviously, after each step (2–19), or (2–20) we need to recalculate the necessary distributions qt+1 and qt+2. It can be proved that the described algorithm monotonically decreases the objective function (2–16), though it may converge only to a local minimum [39]. Software with the implementation of this method is available at [34]. In [5] the described alternating minimization scheme was generalized for Bregman divergences, which includes KL-divergence and Euclidean distance as special cases. 2.5.8 Biclustering via Gibbs Sampling

The Bayesian framework can be a powerful tool to tackle problems involving uncertainty and noisy patterns. Thus it comes as a natural choice to apply it to data mining problems such as biclustering. Q. Sheng at al. proposed a Bayesian technique for biclustering based on a simple frequency model for the expression pattern of a bicluster and on Gibbs sampling for parameter estimation [81]. This approach not only finds samples and features of a bicluster but also represents the pattern of a bicluster as a probabilistic model defined by the posterior distribution for the data values within the bicluster. The choice of Gibbs sampling also helps to avoid local minima in the Expectation-Maximization procedure that is used to obtain and adjust the probabilistic model. Gibbs sampling is a well-known Markov chain Monte Carlo method [29]. It is used

to sample random variables (x1, x2, . . . , xk) when their marginal distribution of the joint distribution are too complex to sample directly from, but the conditional distributions

35 (0) (0) (0) can be easily sampled. Starting from initial values (x1 , x2 , . . . , xk ), the Gibbs samples draws values of the variables from the conditional distributions:

(t+1) (t+1) (t+1) t t xi ∼ p(xi|x1 , . . . , xi−1 , x(i+1), . . . , xk),

(t) (t) (t) i = 1 . . . k, t = 0, 1, 2,.... It can be shown that the distribution of (x1 , x2 , . . . , xk )

converges to the true joint distribution p(x1, x2, . . . , xk) and the distributions of sequences

(t) (t) (t) {x1 }, {x2 }, ..., {xk } converge to true marginal distribution of the corresponding variables.

The biclustering method works with m + n 0-1 values f = (fi)i=1...m (for features) and s = (sj)j=1...n (for samples) indicating which features and samples are selected to the bicluster. These indicators are considered Bernoulli random variables with parameters λf and λs respectively. The data are discretized and modeled with multinomial distributions. The background data (i.e., all the data that do not belong to the bicluster) are considered P to follow one single distribution φ = (φ1, φ2, . . . , φ`), 0 ≤ φk ≤ 1, k φk = 1, k = 1, . . . , `, where ` is the total number of bins used for discretization. It is assumed that within the bicluster all features should behave similarly, but the samples are allowed to have different expression levels. That is, for data values of each sample j within the bicluster we assume P a different distribution (θ1j, θ2j, . . . , θ`j), 0 ≤ θkj ≤ 1, k θkj = 1, k = 1, . . . , `, and it is independent from the other samples. The λf , λs, {φk} and {θkj} are parameters of this Bayesian model, and therefore we need to include in the model their conjugate priors. Typically for Bayesian models, one chooses Beta distribution for the conjugate priors of Bernoulli random variables and Dirichlet distribution for the conjugate priors of multinomial random variables:

φ ∼ Dirichlet(α),

θ·j ∼ Dirichlet(βj),

λf = Beta(ξf ), λs = Beta(ξs),

36 where α and βj are parameter vectors of the Dirichlet distributions, and ξf and ξs are parameter vectors of the Beta distributions.

Denote the subvector of s with j-th component removed by s¯j and the subvector of f with i-th component removed by f¯i. To derive the full conditional distributions, one can use the relations between distributions

p(fi|f¯i, s, D) ∝ p(fi, f¯i, s, D) = p(f, s, D)

and

p(sj|f, s¯j,D) ∝ p(f, sj, s¯j,D) = p(f, s, D),

where D is the observed discretized data. The distribution p(f, s, D) can be obtained by

integrating θ, φ, λf and λs out of the likelihood function L(θ, φ, λf , λs|f, s, D):

L(θ, φ, λf , λs|f, s, D) = p(f, s, D|θ, φ, λf , λs) = p(D|f, s, θ, φ)p(f|λf )p(s|λs).

Using these conditional probabilities, we can perform the biclustering with Algorithm 2-4.

Initialize randomly vectors f and s; repeat for i ← 1 to m do // each feature pi ← p(fi = 1|f¯i, s, D); assign fi ← 1 with probability pi and fi ← 0 otherwise; end for j ← 1 to n do // each sample pj ← p(sj = 1|f, s¯j,D); assign sj ← 1 with probability pi and sj ← 0 otherwise; end until the number of iterations exceeded a predetermined number ;

Figure 2-4. Gibbs biclustering algorithm

To obtain the biclustering, the probabilities pi’s and pj’s are averaged over all iterations and a feature/sample is selected in the bicluster if the average probability corresponding to it is above a certain threshold. More than one bicluster can be

37 constructed by repeating the procedure while the probabilities corresponding to previously selected samples and features are permanently assigned to zero. 2.5.9 Statistical-Algorithmic Method for Bicluster Analysis (SAMBA)

Consider a bipartite graph G(F, S,E), where the set of data features F and the set of data samples S form two independent sets, and there is an edge (i, j) ∈ E between each feature i and each sample j iff the expression level of feature i changes significantly

in sample j. Obviously, a bicluster B0 = (S0, F0) should correspond to a subgraph

H(F0, S0,E0) of G. Next assign some weights to the edges and non-edges of G in such a way that the statistical significance of a bicluster matches the weight of the respective subgraph. Hence, in this setup biclustering is reduced to a search for heavy subgraphs in G. This idea is a cornerstone of the statistical-algorithmic method for bicluster analysis (SAMBA) developed by Tanay et al. [83, 85]. Some additional details on construction of a bipartite graph G(F, S,E) corresponding to features and samples can be found in the supporting information of [84]. The idea behind one of the possible schemes for edges’ weight assignment from [85]

works as follows. Let pf,s be the fraction of bipartite graphs with the degree sequence same as in G such that the edge (f, s) ∈ E. Suppose that the occurrence of an edge

(f, s) is an independent Bernoulli random variable with parameter pf,s. In this case, the probability of observing subgraph H is given by     Y Y p(H) =  pf,s ·  (1 − pf,s) (2–21)

(f,s)∈E0 (f,s)∈/E0

Next consider another model, where edges between vertices from different partitions of

a bipartite graph G occur independently with constant probability pc > max p(f,s). (f,s)∈(F,S) pc 1−pc Assigning weights log to edges (f, s) ∈ E0 and log to (f, s) ∈/ E0 we can p(f,s) 1−p(f,s) observe that the log-likelihood ratio for a subgraph H

X pc X 1 − pc log L(H) = + (2–22) pf,s 1 − pf,s (f,s)∈E0 (f,s)∈/E0

38 is equal to the weight of the subgraph H. If we assume that we are looking for biclusters with the features behaving similarly within the set of samples of the respective bicluster then heavy subgraphs should correspond to “good” biclusters. In [85] the algorithm for finding heavy subgraphs (biclusters) is based on the procedure for solving the maximum bounded biclique problem. In this problem we are looking for a maximum weight biclique in a bipartite graph G(F, S,E) such that the degree of every feature vertex f ∈ F is at most d. It can be shown that maximum bounded biclique can be solved in O(n2d) time. At the first step of SAMBA for each vertex f ∈ F we find k heaviest bicliques containing f. During the next phase of the algorithm we try to improve the weight of the obtained subgraphs (biclusters) using a simple local search procedure. Finally, we greedily filter out biclusters with more than L% overlap. SAMBA implementation is available as a part of EXPANDER, gene expression analysis and visualization tool, at [79]. 2.5.10 Coupled Two-way Clustering

Coupled two-way clustering (CTWC) is a framework that can be used to build a biclustering on the basis of any one-way clustering algorithm. It was introduced by G. Getz, E. Levine and E. Domany in [43]. The idea behind the method is to find stable clusters of samples and features such that using one of the feature clusters results in stable clustering for samples and vice versa. The iterative procedure runs as follows. Initially the

0 0 0 entire set of samples S0 and the entire set of features F0 are considered stable clusters. F0

0 1 1 is used to cluster samples and S0 is used to cluster features. Denote by {Fi } and {Sj }

0 0 the obtained clusters (which are considered stable with respect to F0 and S0 .) Now every

s t pair (Fi ,Sj), t, s = {0, 1} corresponds to a data submatrix, which can be clustered in the

2 2 similar two-way manner to obtain clusters of the second order {Fi } and {Sj }. Then again

s t the process is repeated with each pair (Fi ,Sj) not used earlier to obtain the clusters on

39 the next order, and so on until no new cluster satisfying certain criteria is obtained. The used criteria can impose constraints on cluster size, some statistical characteristics, etc. Though any one-way clustering algorithm can be used within the described iterative two-way clustering procedure, the authors chose a hierarchical clustering method SPC [12, 40]. The justification of this choice comes from the natural of relative cluster stability delivered by SPC. The SPC method originates from a physical model associating a break up of a cluster with a certain temperature at which this cluster loses stability. Therefore, it is easy to designate more stable clusters as those requiring higher temperature for further partitioning. Online implementation of CTWC is available at [88]. 2.5.11 Plaid Models

Consider the perfect idealized biclustering situation. We have K biclusters along the main diagonal of the data matrix A = (aij)m×n with the same values of aij in each bicluster k, k = 1,...,K: K X aij = µ0 + µkρikκjk, (2–23) k=1 where µ0 is some constant value (“background color”), ρik = 1 if feature i belongs to bicluster k (ρik = 0, otherwise), κjk = 1 if sample j belongs to bicluster k (κjk = 0, otherwise) and µk is the value, which corresponds to bicluster k (“color” of bicluster k), i.e., aij = µ0 + µk if feature i and sample j belongs to the same bicluster k. We also require that each feature and sample must belong to exactly one bicluster, that is,

K K X X ∀i ρik = 1 and ∀j κjk = 1, (2–24) k=1 k=1 respectively. In [61] Lazzeroni and Owen introduced a more complicated plaid model as a natural generalization of idealization (2–23)-(2–24). In this model, biclusters are allowed to overlap, and are referred to as layers. The values of aij in each layer are represented as

K X aij = θij0 + θijkρikκjk, (2–25) k=1

40 where the value of θij0 corresponds to a background layer and θijk can be expressed as µk,

µk + αik, µk + βjk, or µk + αik + βjk depending on a particular situation. We are looking for a plaid model such that the following objective function is minimized: m n K !2 X X X min aij − θij0 − θijkρikκjk . (2–26) i=1 j=1 k=1 In [61] the authors developed a heuristic iterative-based algorithm for solving (2–26). Next we briefly describe the main idea of the approach. Suppose we have K − 1 layers and we are looking for the K-th layer such that the objective function in (2–26) is minimized. Let K−1 K−1 X Zij = Zij = aij − θij0 − θijkρikκjk (2–27) k=1

Substituting θijK by µK + αiK + βjK , the objective function from (2–26) can be rewritten

in terms of Zij as m n X X 2 (Zij − (µK + αiK + βjK )ρikκjk) (2–28) i=1 j=1

(0) (0) Let ρiK and κjK be some starting values of our iteration algorithm. At each iteration step (s) (s) (s) s = 1, 2,...,S we update the values of ρiK , κjK and θijK applying the following simple (s) (s−1) (s−1) (s−1) procedure. The value of θijK is obtained from ρiK and κjK , then the values of ρiK (s−1) (s) (s−1) (s) (s−1) (s) and κjK are updated using θijK and κjK , or θijK and ρjK , respectively. Variables ρiK (s) and κjK are relaxed, i.e., they can take values between 0 and 1. We fix them to be {0, 1} during one of the last iterations of the algorithm.

More specifically, given ρiK and κjK , the value of θijK = µK + αiK + βjK is updated as follows: P P i j ρiK κjK Zij µK = P 2 P 2 , ( i ρiK )( j κjK ) P (Z − µ ρ κ )κ P j ij K iK jK jK i(Zij − µK ρiK κjK )ρiK αiK = P 2 , βjK = P 2 . ρiK j κjK κjK i ρiK

41 Given θijK and κjK , or θijK and ρiK , we update ρiK , or κjK as

P θ κ Z P j ijK jK ij i θijK ρiK Zij ρiK = P 2 2 , or κjK = P 2 2 , j θijK κjK i θijK ρiK

respectively.

(0) For more details of this technique including the selection of starting values ρiK and (0) κjK , stopping rules and other important issues we refer the reader to [61]. Software with the implementation of the discussed method is available at [62]. 2.5.12 Order-Preserving Submatrix (OPSM) Problem

In this model introduced by Ben-Dor et al. [9, 10], given the data set A = (aij)m×n,

the problem is to identify a k × ` submatrix (bicluster) (F0, S0) such that the expression

values of all features in F0 increase or decrease simultaneously within the set of samples

S0. In other words, in this submatrix we can find a permutation of columns such that in every row the values corresponding to selected columns are increasing. More formally, let

F0 be a set of row indices {f1, f2, . . . , fk}. Then there exists a permutation of S0, which

consists of column indices {s1, s2, . . . , s`}, such that for all i = 1, . . . , k and j = 1, . . . , ` − 1 we have that

afi,sj < afi,sj+1 .

In [9, 10] it is proved that the OPSM problem is NP -hard. So, the authors designed a greedy heuristic algorithm for finding large order-preserving submatrices, which we briefly outline next.

Let S0 ⊂ {1, . . . , n} be a set of column indices of size ` and π = (s1, s2, . . . , s`)

be a linear ordering of S0. The pair (S0, π) is called a complete OPSM model. A row

i ∈ {1, . . . , m} supports a complete model (S0, π) if

ai,s1 < ai,s2 < . . . < ai,s` .

For a complete model (S0, π) all supporting rows can be found in O(nm) time.

42 A partial model θ = {< s1, . . . , sc >, < s`−d+1, . . . , s` >, `} of a complete model (S0, π)

is given by the column indices of the c smallest elements < s1, . . . , sc >, the column indices

of the d largest elements s`−d+1, . . . , s` and the size `. We say that θ is a partial model of order (c, d). Obviously, a model of order (c, d) becomes complete if c + d = `. The idea of the algorithm from [9, 10] is to increase c and d in the partial model until we get a good quality complete model. The total number of partial models of order (1, 1) in the matrix with n columns is n(n − 1). At the first step of the algorithm we select t best partial models of order (1, 1). Next we try to derive partial models of order (2, 1) from the selected partial models of order (1, 1). Pick t best models of order (2, 1). At the step two we try to extend them to partial models of order (2, 2). We continue this process until we get t models of order (d`/2e, d`/2e). Overall complexity of the algorithm is O(tn3m)[9, 10]. 2.5.13 OP-Cluster

The order preserving cluster model (OP-Cluster) is proposed by J. Liu and W. Wang in [63]. This model is similar to the OPSM-model discussed above and can be considered, in some sense, as its generalization. It aims at finding biclusters where the features follow the same order of values in all the samples. However, when two feature values in a sample are close enough, they are considered indistinguishable and allowed to be in any order in the sample. Formally, if features i, i + 1, . . . , i + 4i are ordered in a non-decreasing sequence in a sample j (i.e., aij ≤ ai+1,j ≤ ... ≤ ai+4i,j) and a user-specified grouping threshold δ > 0 is given, the sample j is called similar on these attributes if

ai+4i,j − aij < G(δ, aij), where G is a grouping function defining where feature values are equivalent. Such a sequence of features are called a group for sample j. The feature i is called the pivot point of this group. The function G may be defined in different ways. The authors use a simple

43 choice

G(δ, aij) = δaij.

Next, a sequence of features is said to show an UP pattern in a sample if it can be partitioned into groups so that the pivot point of each group is not smaller than the preceding value in the sequence. Finally, a bicluster is called an order-preserving cluster (OP-Cluster) is there exists a permutation of its features such that they all show an UP pattern. The authors presented an algorithm for finding OP-Clusters with no less than the required number of samples ns and number of features nf. The algorithm essentially searches through all ordered subsequences of features existing in samples to find maximal common ones, but due to a representation of feature sequences in a tree form allowing for an efficient pruning technique the algorithm is sufficiently fast in practice to apply to real data. 2.5.14 Supervised Classification via Maximal δ-valid Patterns

In [27] the authors defined a δ-valid pattern as follows. Given a data matrix A =

(aij)m×n and δ > 0, a submatrix (F0, S0) of A is called a δ-valid pattern if

∀i ∈ F0 max aij − min aij < δ (2–29) j∈S0 j∈S0

The δ-valid pattern is called maximal if it is not a submatrix of any larger submatrix of A, which is also a δ-valid pattern. Maximal δ-valid patterns can be found using the SPLASH algorithm [26]. The idea of the algorithm is find an optimal set of δ-patterns such that they cover the set of samples. It can be done using a greedy approach selecting first most statistically significant and most covering patterns. Finally, this set of δ-patterns is used to classify the test samples (samples with unknown classification). For more detailed description of the technique we refer the reader to [27].

44 2.5.15 CMonkey

CMonkey is another statistical method for biclustering that has been recently introduced by Reiss at al. [77]. The method is developed specifically for genetic data and works at the same time with gene sequence data, gene expression data (from a microarray) and gene network association data. It constructs one bicluster at a time with an iterative procedure. First, the bicluster is created either randomly or from the result of some other clustering method. Then, on each step, for each sample and feature it is decided whether it should be added to/removed from the bicluster. For this purpose, the probabilities of the presence of the considered sample or feature in the bicluster with respect to the current structure of the bicluster at the three data levels is computed, and a simulated annealing formula is used to make the decision about the update on the basis of the computed probabilities. This way, even when these probabilities are not high, the update has a nonzero chance to occur (that allows escapes from local optima as in any other simulated annealing technique for global optimization.) The actual probability of the update also depends on the chosen annealing schedule, so earlier updates have normally higher probability of acceptance while the later steps get almost identical to local optimization. we refer the reader to [77] for the detailed description of the Reiss at al. work. 2.6 Discussion and Concluding Remarks

In this chapter we reviewed the most widely used and successful biclustering techniques and their related applications. Generally speaking many of the approaches rely on not mathematically strict arguments and there is a lack of methods to justify the quality of the obtained biclusters. Furthermore, additional efforts should be made to connect properties of the biclusters with phenomena relevant to the desired data analysis. Therefore, future development of biclustering should involve more theoretical studies of biclustering methodology and formalization of its quality criteria. More specifically, as we observed that the biclustering concept has remarkable interplay with algebraic notion

45 of the SVD, we believe that biclustering methodology should be further advanced in the direction of algebraic formalization. This should allow effective utilization of classical algebraic algorithms. In addition, a more formal setup for desired class separability can be achieved with establishing new theoretical results on the properties of domains confining all samples/features of one bicluster. The number of biclustering applications can be also extended with other areas, where simultaneous clustering of data samples and features (attributes) makes a lot of sense. For example, one of the promising directions may be biclustering of stock data. This way clustering of equities may reveal groups of companies whose performance is dependent on the same (but possibly hidden) factors, while clusters of trading days may reveal unknown patterns of stock market returns. To summarize, we should emphasize that further successful development of biclustering theory and techniques is essential for the progress in data mining and its applications (text mining, , etc.)

46 CHAPTER 3 CONSISTENT BICLUSTERING VIA FRACTIONAL 0–1 PROGRAMMING 3.1 Consistent Biclustering

Let each sample be already assigned somehow to one of the classes S1, S2,..., Sr.

Introduce a 0–1 matrix S = (sjk)n×r such that sjk = 1 if j ∈ Sk, and sjk = 0 otherwise.

The sample class centroids can be computed as the matrix C = (cik)m×r:

C = AS(ST S)−1, (3–1)

whose k-th column represents the centroid of the class Sk. Consider a row i of the matrix C. Each value in it gives us the average expression of the i-th feature in one of the sample classes. As we want to identify the checkerboard pattern in the data, we have to assign the feature to the class where it is most expressed.

ˆ 1 So, let us classify the i-th feature to the class k with the maximal value cikˆ :

ˆ i ∈ Fkˆ ⇒ ∀k = 1 . . . r, k 6= k : cikˆ > cik (3–2)

Now, provided the classification of all features into classes F1, F2, ..., Fr, let us construct a classification of samples using the same principle of maximal average expression and see whether we will arrive at the same classification as the initially given

one. To do this, construct a 0–1 matrix F = (fik)m×r such that fik = 1 if i ∈ Fk and

fik = 0 otherwise. Then, the feature class centroids can be computed in form of matrix

D = (djk)n×r: D = AT F (F T F )−1, (3–3)

1 Taking into account that in real-life data mining applications all data are fractional values, whose accuracy is not perfect, we may disregard the case when this maximum is not unique. However, for the sake of theoretical purity we further assume that if the ambiguity in classification occurs, we apply a negligible perturbation to the data set values and start the procedure anew.

47 whose k-th column represents the centroid of the class Fk. The condition on sample classification we need to verify is

ˆ j ∈ Skˆ ⇒ ∀k = 1 . . . r, k 6= k : djkˆ > djk (3–4)

Let us state now the definition of biclustering and its consistency formally. Definition 1. A biclustering of a data set is a collection of pairs of sample and feature

subsets B = ((S1, F1), (S2, F2),..., (Sr, Fr)) such that the collection (S1, S2,..., Sr) forms a

partition of the set of samples, and the collection (F1, F2,..., Fr) forms a partition of the set of features. Definition 2. A biclustering B will be called consistent if both relations (3–2) and (3–4) hold, where the matrices C and D are defined as in (3–1) and (3–3). We will also say that a data set is biclustering-admitting if some consistent biclustering for it exists. Furthermore, the data set will be called conditionally biclustering-admitting with respect to a given (partial) classification of some samples and/or features if there exists a consistent biclustering preserving the given (partial) classification. Next, we will show that a consistent biclustering implies separability of the classes by

convex cones. Further we will denote j-th sample of the data set by a·j (which is the j-th

column of the matrix A), and i-th feature by ai· (which is the i-th row of the matrix A).

Theorem 4. Let B be a consistent biclustering. Then there exist convex cones P1, P2,..., Pr ⊆

m R such that all samples from Sk belong to the cone Pk and no other sample belongs to it, k = 1 . . . r.

n Similarly, there exist convex cones Q1, Q2,..., Qr ⊆ R such that all features from Fk

belong to the cone Qk and no other feature belongs to it, k = 1 . . . r.

Proof. Let Pk be the conic hull of the samples of class Sk, that is, a vector x ∈ Pk if and only if it can be represented as X x = γja·j,

j∈Sk

48 where all γj ≥ 0. Obviously, Pk is convex and all samples of the class Sk belong to it.

Now, suppose there is a sample ˆj ∈ S`, ` 6= k that belongs to the cone Pk. Then there exists representation X a·ˆj = γja·j, j∈Sk where all γj ≥ 0. Next, consistency of the biclustering implies that in the matrix of feature centroids D, the component dˆj` > dˆjk. This implies P P a ˆ a ˆ i∈F` ij > i∈Fk ij |F`| |Fk| P Plugging in a ˆ = γ a , we obtain ij j∈Sk j ij P P P P γjaij γjaij i∈F` j∈Sk > i∈Fk j∈Sk |F`| |Fk|

Changing the order of summation,

P a  P a  X i∈F` ij X i∈Fk ij γj > γj , |F`| |Fk| j∈Sk j∈Sk or X X γjdj` > γjdjk

j∈Sk j∈Sk

On the other hand, for any j ∈ Sk, the biclustering consistency implies dj` < djk, that

contradicts to the obtained inequality. Hence, the sample ˆj cannot belong to the cone Pk. Similarly, we can show that the stated conic separability holds for the classes of features. It also follows from the proved conic separability that convex hulls of classes are separated, i.e, they do not intersect. 3.2 Supervised Biclustering

One of the most important problems for real-life data mining applications is supervised classification of test samples on the basis of information provided by training data. In such a setup, a training set of samples is supplied along with its classification

49 known ´apriori, and classification of additional samples, constituting the test set, has to be performed. That is, a supervised classification method consists of two routines, first of which derives classification criteria while processing the training samples, and the second one applies these criteria to the test samples. In genomic and proteomic data analysis, as well as in other data mining applications, where only a small subset of features is expected to be relevant to the classification of interest, the classification criteria should involve dimensionality reduction and feature selection. In this chapter, we handle such a task utilizing the notion of consistent biclustering. Namely, we select a subset of features of the original data set in such a way that the obtained subset of data becomes conditionally biclustering-admitting with respect to the given classification of training samples.

Assuming that we are given the training set A = (aij)m×n with the classification of

samples into classes S1, S2,..., Sr, we are able to construct the corresponding classification of features according to (3–2). Now, if the obtained biclustering is not consistent, our goal is to exclude some features from the data set so that the biclustering with respect to the residual feature set is consistent.

Formally, let us introduce a vector of 0–1 variables x = (xi)i=1...m and consider the

i-th feature selected if xi = 1. The condition of biclustering consistency (3–4), when only the selected features are used, becomes

Pm Pm i=1 aijfikˆxi i=1 aijfikxi ˆ ˆ Pm > Pm , ∀j ∈ Skˆ, k, k = 1 . . . r, k 6= k. (3–5) i=1 fikˆxi i=1 fikxi

We will use the fractional relations (3–5) as constraints of an optimization problem selecting the feature set. It may incorporate various objective functions over x, depending on the desirable properties of the selected features, but one general choice is to select the maximal possible number of features in order to lose minimal amount of information provided by the training set. In this case, the objective function is

m X max xi (3–6) i=1

50 The optimization problem (3–6),(3–5) is a specific type of fractional 0–1 programming problem, which we discuss in the next section. 3.3 Fractional 0–1 Programming

Fractional 0–1 programming problem (or hyperbolic 0–1 programming problem) is defined as follows:

n Pm X αj0 + i=1 αjixi max f(x) = m , (3–7) x∈{0,1}m β + P β x j=1 j0 i=1 ji i where it is usually assumed that for all j and x ∈ {0, 1}m the denominators in (3–7) are Pm positive, i.e. βj0 + i=1 βjixi > 0. Problem (3–7) is known to be NP -hard [75]. For more information on complexity issues of fractional 0–1 programming problems we refer the reader to [75, 76]. Applications of constrained and unconstrained versions of problem (3–7) arise in numerous areas including but not limited to scheduling [78], query optimization in data bases and information retrieval [49], and p-choice facility location [86]. Generally, in the framework of fractional 0–1 programming we consider problems, where we optimize a multiple-ratio fractional 0–1 function of type (3–7) subject to a set of linear constraints. Algorithms for solving problem (3–7) include linearization techniques [76, 86, 90], branch and bound methods [86], network-flow [74] and approximation [50] approaches. In this chapter we define a new class of fractional 0–1 programming problems, where fractional terms are not in the objective function, but in constraints, i.e. we optimize a linear objective function subject to fractional constraints. More formally, we define the following problem: m X max g(x) = wixi (3–8) x∈{0,1}m i=1

ns s Pm s X αj0 + αjixi s.t. i=1 ≥ p , s = 1,...,S, (3–9) βs + Pm βs x s j=1 j0 i=1 ji i

51 where S is the number of fractional constraints, and we also assume that for all s, j and

m s Pm s x ∈ {0, 1} denominators in (3–9) are positive, i.e. βj0 + i=1 βjixi > 0. This problem is clearly NP -hard since linear 0–1 programming is a special class of problem (3–8)-(3–9) if

s s βji = 0 and βj0 = 1 for j = 1, . . . , ns, i = 1, . . . , m and s = 1,...,S. A typical approach for solving problem (3–7) is to reformulate it as a linear mixed 0–1 programming problem, which can be addressed using standard linear programming solvers like CPLEX [35]. For more detailed information on possible linearization methods for fractional 0–1 programming problems we can refer to [76, 86, 90]. Fortunately, a similar technique can be also applied to problem (3–8)-(3–9). The linearization approach discussed next is based on a very simple idea: Theorem 5. A polynomial mixed 0–1 term z = xy, where x is a 0–1 variable, and y is a continuous variable taking any positive value, can be represented by the following linear inequalities: (1) y − z ≤ M − Mx; (2) z ≤ y; (3) z ≤ Mx; (4) z ≥ 0, where M is a large number greater than y. A simple proof of this result can be found in [90].

s Next define a set of new variables yj such that

s 1 yj = s Pm s , (3–10) βj0 + i=1 βjixi where j = 1, . . . , ns, and s = 1,...,S. Since we assume that all denominators are positive, condition (3–10) is equivalent to

m s s X s s βj0yj + βjixiyj = 1. (3–11) i=1

s In terms of new variables yj problem (3–8)-(3–9) can be rewritten as

m X max g(x) = wixi (3–12) x∈{0,1}m i=1

52 ns ns m X s s X X s s s.t. αj0yj + αjixiyj ≥ ps, s = 1,...,S, (3–13) j=1 j=1 i=1 m s s X s s βj0yj + βjixiyj = 1, j = 1, . . . , ns, s = 1,...,S. (3–14) i=1

s In order to obtain a linear mixed 0–1 formulations, nonlinear terms xiyj in (3–13) and

s (3–14) can be linearized introducing additional variables zij and applying the results of s s PS Theorem 5. The number of new variables yj and zij is (m + 1) s=1 ns. 3.4 Algorithm for Biclustering

To linearize the fractional 0–1 program (3–6),(3–5), we should introduce according to (3–10) the variables 1 yk = Pm , k = 1 . . . r. (3–15) i=1 fikxi

Since fik can take values only zero or one, equation (3–15) can be equivalently rewritten as

m X fikxi ≥ 1, k = 1 . . . r. (3–16) i=1

m X fikxiyk = 1, k = 1 . . . r. (3–17) i=1

In terms of the new variables yk, condition (3–5) is replaced by

m m X X ˆ ˆ aijfikˆxiykˆ > aijfikxiyk ∀j ∈ Skˆ, k, k = 1 . . . r, k 6= k. (3–18) i=1 i=1

Next, observe that the term xiyk is present in (3–18) if and only if fik = 1, i.e., i ∈ Fk. So, there are totally only m of such products in (3–18), and hence we can introduce m

variables zi = xiyk, i ∈ Fk to linearize the system by Theorem 5. Obviously, the parameter M can be set to 1. So, instead of (3–17) and (3–18), we have the following constraints: m X fikzi = 1, k = 1 . . . r. (3–19) i=1 m m X X ˆ ˆ aijfikˆzi > aijfikzi ∀j ∈ Skˆ, k, k = 1 . . . r, k 6= k. (3–20) i=1 i=1

53 yk − zi ≤ 1 − xi, zi ≤ yk, zi ≤ xi, zi ≥ 0, i ∈ Fk. (3–21)

Unfortunately, while the linearization by Theorem 5 works nicely for small-size problems, it often creates instances, where the gap between the integer programming and the linear programming relaxation optimum solutions is very big for larger problems. As a consequence, the instance can not be solved in a reasonable time even with the best techniques implemented in modern integer programming solvers. Hence, we have developed an alternative approach to solving the problem (3–6), (3–5) via mixed 0–1 programming, which is similar by the main idea to the method for solving specific fractional 0–1 programming problems described in [74].

Consider the meaning of variables zi. We have introduced them so that

xi zi = Pm , i ∈ Fk. (3–22) `=1 f`kx`

Thus, for i ∈ Fk, zi is the reciprocal of the cardinality of the class Fk after the feature selection, if the i-th feature is selected, and 0 otherwise. This suggests that zi is also a binary variable by nature as xi is, but its nonzero value is just not set to 1. That value is not known unless the optimal sizes of feature classes are obtained. However, knowing zi is sufficient to define the value of xi, and the system of constraints with respect only to the continuous variables 0 ≤ zi ≤ 1 constitutes a linear relaxation of the biclustering constraints (3–5). Furthermore it can be strengthened by the system of inequalities

connecting zi to xi. Indeed, if we know that no more than mk features can be selected for class Fk, then it is valid to impose:

xi ≤ mkzi, xi ≥ zi, i ∈ Fk. (3–23)

We can prove

∗ Pm ∗ ∗ Theorem 6. If x is an optimal solution to (3–6), (3–5), and mk = i=1 fikxi , then x is also an optimal solution to (3–6),(3–19),(3–20),(3–23).

54 Proof. Obviously, x∗ is a feasible solution to the new program, so we just have to show that (3–6),(3–19),(3–20),(3–23) cannot have a better solution. Assume such a solution x∗∗ exists. Then, m m X ∗∗ X ∗ xi > xi , i=1 i=1 and, therefore, at least for one k ∈ {1 . . . r},

m m X ∗∗ X ∗ fikxi > fikxi . i=1 i=1

On the other hand, xi ≤ mkzi, and in conjunction with (3–19) it implies that

m m m X ∗∗ X X ∗ fikxi ≤ mkfikzi = mk = fikxi . i=1 i=1 i=1 We have obtained a contradiction and, therefore, x∗ is an optimal solution to (3–6), (3–19), (3–20), (3–23). Hence, we can utilize Algorithm 3-1 as the iterative heuristic of feature selection.

for i ← 1 to m do xi ← 1; end repeat Pm mk ← i=1 fikxi for all k = 1 . . . r; solve the mixed 0–1 programming formulation using the inequalities (3–23) instead of (3–21); Pm until mk = i=1 fikxi for all k = 1 . . . r ; Figure 3-1. Feature selection heuristic

Another modification of the program (3–6), (3–5) that may result in the improvement of quality of the feature selection is strengthening of the class separation by introduction of a coefficient greater than 1 for the right-hand side of the inequality (3–5). In this case, we improve (3–5) by the relation

Pm Pm i=1 aijfikˆxi i=1 aijfikxi Pm ≥ (1 + t) Pm , (3–24) i=1 fikˆxi i=1 fikxi

55 where t > 0 is a constant that becomes a parameter of the method (notice also that doing this we have also replaced the strict inequalities by non-strict ones and made the feasible domain closed). In the mixed 0–1 programming formulation, it is achieved by replacing (3–20) by m m X X ˆ ˆ aijfikˆzi ≥ (1 + t) aijfikzi ∀j ∈ Skˆ, k, k = 1 . . . r, k 6= k. (3–25) i=1 i=1 After the feature selection is done, we perform classification of test samples according to (3–4). That is, if b = (bi)i=1...m is a test sample, we assign it to the class Fkˆ satisfying Pm Pm i=1 bifikˆxi i=1 bifikxi ˆ Pm > Pm , k = 1 . . . r, k 6= k. i=1 fikˆxi i=1 fikxi

3.5 Computational Results

3.5.1 ALL vs. AML data set

We applied supervised biclustering to a well-researched microarray data set containing samples from patients diagnosed with acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) diseases [45]. It has been the subject of a variety of research papers, e.g. [8, 11, 89, 91]. This data set was also used in the CAMDA data contest [28]. It is divided into two parts – the training set (27 ALL, 11 AML samples), and the test set (20 ALL, 14 AML samples). According to the described methodology, we performed feature selection for obtaining a consistent biclustering using the training set, and the samples of the test set were subsequently classified choosing for each of them the class with the highest average feature expression. The parameter of separation t = 0.1 was used. The algorithm selected 3439 features for class ALL and 3242 features for class AML. The obtained classification contains only one error: the AML-sample 66 was classified into the ALL class. To provide the justification of the quality of this result, we should mention that the support vector machines (SVM) approach delivers up to 5 classification errors on the ALL vs. AML data set depending on how the parameters of the method are tuned [89]. Furthermore, the perfect classification was obtained only with one specific set of values of the parameters.

56 The heatmap for the constructed biclustering is presented in Figure 3-2. 3.5.2 HuGE Index data set

Another computational experiment that we conducted was on feature selection for consistent biclustering of the Human Gene Expression (HuGE) Index data set [54]. The purpose of the HuGE project is to provide a comprehensive database of gene expressions in normal tissues of different parts of human body and to highlight similarities and differences among the organ systems. We refer the reader to [53] for the detailed description of these studies. The data set consists of 59 samples from 19 distinct tissue types. It was obtained using oligonucleotide microarrays capturing 7070 genes. The samples were obtained from 49 human individuals: 24 males with median age of 63 and 25 females with median age of 50. Each sample came from a different individual except for first 7 BRA samples that were from the different brain regions of the same individual and 5th LI sample, which came from that individual as well. We applied to the data set Algorithm 1 with the parameter of separation t = 0.1. The obtained biclustering is summarized in Table 3-1 and its heatmap is presented in Figure 3-3. The distinct block-diagonal pattern of the heatmap evidences the high quality of the obtained feature classification. We also mention that the original studies of HuGE Index data set in [53] were performed without 6 of the available samples: 2 KI samples, 2 LU samples, and 2 PR samples were excluded because their quality was too poor for the statistical methods used. Nevertheless, we may observe that none of them distorts the obtained biclustering pattern, which confirms the robustness of our method. 3.6 Conclusions and Future Research

We have developed a new optimization framework to perform supervised biclustering with feature selection. It has been proved that the obtained partitions of samples and features of the data set satisfy a conic separation criterion of classification. Though the constructed fractional 0–1 programming formulation may be hard to tackle with direct solving methods, it admits a good linear continuous relaxation. Preliminary computational

57 Figure 3-2. ALL vs. AML heatmap

58 Figure 3-3. HuGE index heatmap

59 Table 3-1. HuGE index biclustering Tissue type Abbreviation #samples #features selected Blood BD 1 472 Brain BRA 11 614 Breast BRE 2 902 Colon CO 1 367 Cervix CX 1 107 Endometrium ENDO 2 225 Esophagus ES 1 289 Kidney KI 6 159 Liver LI 6 440 Lung LU 6 102 Muscle MU 6 532 Myometrium MYO 2 163 Ovary OV 2 272 Placenta PL 2 514 Prostate PR 4 174 Spleen SP 1 417 Stomach ST 1 442 Testes TE 1 512 Vulva VU 3 186

results show that tightening it iteratively with valid inequalities linking the continuous and 0–1 variables, we are able to obtain a good heuristic solution providing a reliable feature selection and the test set classification based on it. We also note that in contrast to many other data mining methodologies the developed algorithm involves only one parameter that should be defined by the user. Further research work should reveal more properties relating solutions of the linear relaxation to solutions of the original fractional 0–1 programming problem. This should allow for more grounded choices of the class separation parameter t for feature selection and better solving methods. It is also interesting to investigate whether the problem (3–6) subject to (3–5) itself is NP -hard.

60 CHAPTER 4 AN OPTIMIZATION-BASED APPROACH FOR DATA CLASSIFICATION 4.1 Basic Definitions

A data set is normally given in form of a rectangular matrix A = (aij)m×n. The columns of this matrix represent n data samples, while the rows correspond to m features

of these samples. A matrix element aij gives us the expression of i-th feature in j-th sample. If the set of samples is partitioned into r classes, we will denote the k-th class by

Sk ⊂ {1 . . . n}, k = 1 . . . r. Next, we introduce a 0-1 matrix S = (sjk)n×r such that sjk = 1

if j ∈ Sk, and sjk = 0 otherwise. We will also consider centroids of those classes. Each

class centroid will be represented as a column of matrix C = (cik)m×r:

C = AS(ST S)−1.

The function r n X X 2 J(S) = sjk ka.j − c.kk , (4–1) k=1 j=1

where a.j denotes j-th column of the matrix A, c.k denotes k-th column of the matrix C and k.k denotes the Euclidean norm, will be called the sum-of-squares (or k-means) objective of the clustering given by the matrix S. The smaller the value J(S), the tighter are the clusters as the sum of distances from cluster members to the corresponding centroid decreases. We will say that a clustering satisfies the sum-of-squares (or k-means)

criterion if, for each sample j, the distance from it to the centroid c.kˆ of the class Skˆ 3 j is not greater than the distance to any other class centroid:

ka.j − c.kˆk ≤ ka.j − c.kk , k = 1 . . . r. (4–2)

The criterion (4–2) implies that the clustering is terminal for k-means algorithm, that is, if it is given as the input to the algorithm, no samples will be moved from one class to another, and the algorithm will terminate immediately.

61 Remark. Note, however, that the criterion (4–2) does not always guarantee that the clustering delivers even a local minimum to the objective (4–1). Indeed, let us consider, for example, three samples in the one-dimensional space represented by points 0, 5, and 8, with the clustering ({0, 5}, {8}) provided. This clustering is not (locally) optimal with respect to the k-means objective as moving 5 from the first class to the second decreases J(S) from 12.5 to 4.5. However, it is easy to verify that this clustering satisfies the k-means criterion (4–2). Next, we will say that a clustering satisfies the pairwise threshold criterion if the distance between any two samples that belong to the same class is always not greater than

n·(n−1) any distance between two samples from different classes. This can be expressed by 2 inequalities of the form

2 |a.j1 − a.j2 | ≤ Dint (4–3)

if samples j1 and j2 are from the same class, and

2 |a.j1 − a.j2 | ≥ Dext (4–4)

if samples j1 and j2 are from different classes, with one additional inequality

Dint ≤ Dext. (4–5)

This way, Dint is an upper bound on the distance between samples in the same class and

Dext is a lower bound on the interclass distance. We will call the inequalities (4–3)-(4–5) the pairwise threshold constraints. The following property establishes a link between the sum-of-squares objective and the pairwise threshold criterion: Theorem 1. The sum of squares of distances between all samples of a class and the centroid of the class can be expressed via pairwise distances between the samples in the following way:

X 2 1 X X 2 ka.j − c.kk = ka.j1 − a.j2 k . (4–6) |Sk| j∈Sk j1∈Sk j2∈Sk j2>j1

62 Proof. By the Cosine Theorem,

2 2 2 ka.j1 − a.j2 k = ka.j1 − c.kk + ka.j2 − c.kk −

2ka.j1 − c.kkka.j2 − c.kk cos(a.j1 − c.k, a.j2 − c.k) =

2 2 ka.j1 − c.kk + ka.j2 − c.kk − 2(a.j1 − c.k)(a.j2 − c.k).

So,

X X 2 X X 2 2 ka.j1 − a.j2 k = ka.j1 − c.kk + ka.j2 − c.kk −

j1∈Sk j2∈Sk j1∈Sk j2∈Sk X X 2 (a.j1 − c.k)(a.j2 − c.k).

j1∈Sk j2∈Sk The last term is zero, since ! X X X X (a.j1 − c.k)(a.j2 − c.k) = (a.j1 − c.k) (a.j2 − c.k)

j1∈Sk j2∈Sk j1∈Sk j2∈Sk and X X (a.j2 − c.k) = a.j2 − |Sk|c.k = |Sk|c.k − |Sk|c.k = 0.

j2∈Sk j2∈Sk Thus,

X X 2 X X 2 2 ka.j1 − a.j2 k = ka.j1 − c.kk + ka.j2 − c.kk =

j1∈Sk j2∈Sk j1∈Sk j2∈Sk X 2 2|Sk| ka.j − c.kk ,

j∈Sk which imples (4–6).

Corollary 1. If a clustering S satisfies the pairwise threshold criterion, then

(n − r)D2 J(S) ≤ int . (4–7) 2

Proof. Using the lemma, we obtain

r X 1 X X 2 J(S) = ka.j1 − a.j2 k |Sk| k=1 j1∈Sk j2∈Sk j2>j1

63 r 2 X 1 |Sk|(|Sk| − 1) 2 (n − r)Dint ≤ Dint = . |Sk| 2 2 k=1

Remark. Usually the pairwise threshold criterion is stronger than the k-means criterion, though a small example when it does not guarantee a local minimum to the objective (4–1) can be given. Consider samples (0, −1), (0, 1), (2, 0), (4, 0), (4, 0), (4, 0) in the two-dimensional space. If first two samples represent one class and the rest represent √ the other class, the pairwise threshold constraint is satisfied with Dmin = 2, Dmax = 5

2 and J(S) = 5. However, if we move the sample (2, 0) to the first class, we have J(S) = 4 3 despite the pairwise threshold is violated (the distance between samples of the same class √ (0, 1) and (2, 0) is 5 but the distance between samples from different classes (2, 0) and (4, 0) is 2).

Let us also introduce a vector of variables x = (xi)i=1...m bounded between 0 and 1 representing chosen feature weights:

0 ≤ xi ≤ 1, i = 1 . . . m. (4–8)

If xi = 0, then i-th feature is disregarded during the test set classification. If xi > 0 then

the value of xi is the weight of i-th feature, which we will use in our classification routine. This way, the vector x represents a feature selection made in the data set. Taking into account the feature selection, we may rewrite the sum-of-squares criterion as m m X 2 X 2 (aij − cikˆ) xi ≤ (aij − cik) xi, k = 1 . . . r. (4–9) i=1 i=1 ˆ where k is such that j ∈ Skˆ. The pairwise threshold criterion will be defined by inequalities m X 2 (aij1 − aij2 ) xi ≤ Dint, (4–10) i=1

64 if samples j1 and j2 are from the same class,

m X 2 (aij1 − aij2 ) xi ≥ Dext (4–11) i=1 if samples j1 and j2 are from different classes, and

Dint ≤ Dext. (4–12)

4.2 Optimization Formulation and Classification Algorithm

Given a classification of the training set of samples, we will be performing feature selection in it utilizing the introduced clustering criteria. We will formulate an optimization problem over the variables x, where the objective is either to maximize the class separation or to minimize the information loss, and the constraints are one of the formulated clustering criterion. If the k-means criterion is employed, the used objective will be m X max xi, (4–13) x i=1 which maximizes the total weight of selected features, that is, omits the minimum amount of information from the data set. With the pairwise threshold criterion we will use the objective

max Dext − Dint, (4–14) x managing the maximum separation between classes. If this optimization problem has only the trivial solution x = 0, there is no possibility to satisfy the clustering criterion for the training set irrespectively of the feature selection. In such cases we will relax the criterion by dropping some of the constraints. In order to decide what constraints should be dropped, we will analyze the Lagrangian multipliers (dual variables) corresponding to the trivial solution. We know that if the dual variable corresponding to a constraint is nonzero, this constraint is active and keeps the optimal solution from improvement. So, as long as x = 0 is the only feasible solution to the problem, we iteratively remove constraints with corresponding nonzero dual variables

65 unless we obtain the opportunity to improve the solution. If this procedure leads to removal of all constraints, we conclude that the given feature selection problem is not suitable for the chosen clustering criterion.

After the feature selection is performed, we will assign a test sample b = (bi)i=1...m

• to the class Skˆ having the nearest centroid c·kˆ with respect to the weights x, that is for all k = 1 . . . r, m m X 2 X 2 (bi − cikˆ) xi ≤ (bi − cik) xi, (4–15) i=1 i=1 if we use the k-means criterion;

ˆ ˆ • to the class Skˆ 3 j of the nearest neighbor j from the training set with respect to the weights x, that is for all j = 1 . . . n

m m X 2 X 2 bi − aiˆj xi ≤ (bi − aij) xi, (4–16) i=1 i=1 if we use the pairwise threshold criterion. In Figure 4-1 we describe our data classification algorithm formally.

Form the linear program (4–13),(4–9), (4–8) (if using the k-means criterion), or (4–14),(4–10),(4–11),(4–12), (4–8) (if using the pairwise threshold criterion); repeat solve the linear programming problem; if x = 0 then drop constraints with |πk| > , which are not the bound constraints (4–8), from the LP, where πk is the dual variable corresponding to the k-th constraint; end until x 6= 0 or (4–8) are the only constraints remaining ; if x = 0 then exit; // no admissible feature selection exists with the chosen criterion; end Classify each test sample using (4–15) (if using the k-means criterion) or using (4–16) (if using the pairwise threshold criterion);

Figure 4-1. Data classification algorithm

66 We should note here that the algorithm has computational complexity comparable with solving the linear program, since whenever the linear program solved more than once, its size decreases with the constraints removed. In the next section we discuss computational results on two DNA microarray data sets. For solving linear programming problems we used CPLEX [35]. 4.3 Computational Experiments

4.3.1 ALL vs. AML Data Set

The feature selection program with the k-means criterion (4–13),(4–9) delivered the optimum value 7069.3582 (which means that almost all features were selected with weights close to 1). The subsequent classification of the test set by (4–15) gave two misclassifications: the AML-sample 64 and AML-sample 66 were classified into the ALL class. The pairwise threshold program (4–14),(4–10), (4–11),(4–12) selected 1457 features with nonzero weights. The subsequent classification of the test set was perfect: all ALL and AML test samples were classified into appropriate classes. To provide justification of the quality of this result, we should mention that the support vector machines (SVM) approach delivers up to 5 classification errors on the ALL vs. AML data set depending on how the parameters of the method are tuned [89]. Furthermore, the perfect classification was obtained only with one specific set of values of the parameters. 4.3.2 Colon Cancer Data Set

A colon cancer microarray data set including expression profiles of 2000 genes from 22 normal tissues and 40 tumor samples was published in [3]. We randomly selected 11 normal and 20 tumor samples into the training set. The other half of samples were used as the test set. The feature selection program with the k-means criterion (4–13),(4–9) delivered the optimum value 1903.045. The number of features selected with nonzero weights was 1901.

67 The classification errors were as follows: 4 Normal samples (8, 12, 34, 36) are classified into Tumor class, and 2 Tumor samples (30, 36) are classified into Normal class. The pairwise threshold constraints allowed for a feasible solution only after two iterations of exclusion of active constraints, and after that only 32 features were selected with nonzero weights. The misclassified samples are 5 Normal (2, 8, 12, 34, 36), and 2 Tumor (30, 36). 4.4 Conclusions

We have developed an optimization approach to handling data classification problems, which uses a unified methodology for feature selection and classification with the possibility of outlier detection. It has a very natural connection to the concepts of unsupervised clustering. Since the used unsupervised clustering criteria are not fixed, the methodology is highly flexible and potentially may be used to process data of arbitrary nature. The fact that the practically important data mining problems can be represented as optimization problems allows us to use standard optimization software packages to solve them. This direction gives us a promise for more efficient treatment of real-world problems, whose original formulation is normally quite fuzzy. The good performance on known microarray data sets confirms reliability of the applied methodology.

68 CHAPTER 5 GRAPH MODELS IN DATA ANALYSIS One of the most important aspects of data analysis is finding an efficient way of summarizing and visualizing the data that would allow one to obtain useful information about the properties of the data. As the data normally come as a sequence of samples and the crucial information characterizing the data most often lie in relations between the samples, graph models treating the samples as vertices and the relations as edges between them come as a handy tool to represent and process the data. One of such remarkable graph models can be found in the area of telecommunication.

The call graph GC (V,E) is defined as a graph whose vertices V correspond to telephone numbers and two of them are connected by an edge (u, v) ∈ E if a call was ever made from one number to the other. Abello et al. [2] used data from AT&T telephone billing records to construct this graph. This way, massive telecommunication traffic data were represented in a form suitable for information retrieval. Nevertheless, the call graph can be so large that the conventional methods of data analysis are not able to process it directly. Indeed, Abello et al. reported that considering a 20-day period, one obtains the call graph of about 290 million vertices and 4 billion edges. The analyzed one-day call graph had 53,767,087 vertices and over 170 million edges. This graph had 3,667,448 connected components, most of which had just only two vertices connected by an edge. Only 302,468 of the components (or 8%) had more than 3 vertices. It was also observed that the graph had a giant connected component of 44,989,297 vertices. Large cliques in the call graph may represent telephone service subscribers forming close groups, so finding them is useful for a number of data mining objectives (such as efficient marketing, or detecting suspicious activities.) Hence, the authors pursued finding large cliques (as well as even larger dense subgraphs) in the giant connected component using sophisticated heuristics. A similar graph model has been used to analyze connections among hosts and the network of links of the . In the Internet graph GI (V,E), the set

69 of vertices V corresponds to the set of routers navigating packets of data throughout the Internet, while an edge (u, v) ∈ E represents a physical connection between the router u ∈ V and the router v ∈ V (which can be either created by a cable or can be wireless.) Specific data and research performed on the Internet graph can be found on the “Internet

Mapping Project” web page [32]. In the Web graph GW (V,E), the set of vertices V represents the web pages existing in the World Wide Web, and an edge (u, v) ∈ E signifies a link from the page u ∈ V to the page v ∈ V . Notice that in contrast to the graphs described earlier, the Web graph is genuinely directed, as each link in the World Wide Web has only one direction. However, for many purposes of data analysis the direction can be disregarded to simplify the model and make applicable the methods proven to be efficient on other graph models of similar type. The next remarkable graph model of similar spirit is the market graph introduced

by Boginski et al. [13]. In such a graph GW (V,E) each vertex v ∈ V corresponds to a particular financial security. Then we compute the correlation matrix C (or the matrix of another similarity measure) for the set of securities V on the basis of observed prices over a certain period of time. Further, given some threshold c0, we will consider an edge (u, v) to be present in the graph G if and only if the correlation cuv between securities u and v is

not less than c0. Naturally, cliques in the market graph correspond to groups of securities showing similar behavior (thus, such securities are closely related to each other due to some structural properties of the economic system.) On the other hand, independent sets suggest well diversified portfolios of securities whose behavior is less related to each other under the existing economic conditions. Finally, the last (but not the least) application of the graph representation of pairwise relations comes from biomedicine. Human brain is one of the most complex systems ever studied by scientists, and the enormous number of neurons and the dynamic nature of connections between them makes the analysis of brain function especially challenging. However, it is important to gain a certain understanding of such dynamics, at least,

70 for more efficient treatment of the widespread neurological disorders, such as epilepsy. During the last decade the extensive use of electroencephalograms (EEG) allowed for a significant progress in quantitative representation of the brain function. In short, EEG simultaneously records the electric signal from a number of prespecified spots on the brain, thus providing a collection synchronized . Certain measures of similarity between these series (such as T -index) can be computed and observed in their evolution throughout the time of recording (see, e.g., [56]). Then we can define the brain graph

GB(V,E) whose vertices V correspond to the electrodes recording EEG and, similarly to the market graph, and edge (u, v) ∈ E between two of them is introduced if their similarity exceeds a certain threshold. Cliques in the brain graph correspond to groups of functional areas entrained in a synchronous activity, which may clearly designate the current state of the brain or some existing pathological condition. 5.1 Cluster Cores Based Clustering

It comes as no surprise that the notion of large cliques is fruitful for clustering application. Recently, Y.D. Shen et al. developed so-called cluster cores-based cluster- ing for high-dimensional data [80]. The main advantages of this novel approach over conventional clustering methods consist in extending the applicable similarity measures between samples to semantic-based ones, and linear scalability with respect to the number of samples. In fact, the similarity criterion for samples may be communicated by the user of the data mining application via a set of rules after the data set is already formed and made available for the analysis. For instance, if data samples represent different people and there are attributes expressing their occupation, age and income, the user may choose to take into account all of them, disregard one of them (say, age) as irrelevant to a particular research goal, or to pick just one of them (say, income) as the only one of these three attributes significant for the goal. Such interactivity helps to avoid undesirable influence of ´apriori irrelevant attributes to the outcome of data mining procedure and, at the same time, to fight the “” arising from the enormous amount of

71 features that may characterize the samples of the data set, but which are not important to certain goals. In fact, the cores-based clustering avoids working with the features completely by operating on the similarity graph defined for samples alone. This graph has the set of vertices representing the samples of the data set, and an edge between two vertices is introduced if and only if these two samples are recognized as similar by the user. The cluster cores are defined as maximal cliques in the similarity graph exceeding a certain size α: Definition 3. A subset of samples Cr is a cluster core if it satisfies the following three conditions:

1. |Cr| ≥ α;

2. Any two samples from Cr are similar;

3. There exists no C0 such that Cr ⊂ C0 and C0 satisfies 2. Then, a cluster is defined as a natural extension of the core allowing additional samples to have no less than θ · |Cr| similar samples in Cr: Definition 4. A subset of samples C is a cluster with respect to selected threshold θ if contains every sample of the data set such that the sample

1. either is a member of the core Cr,

2. or has at least θ · |Cr| samples similar to it in Cr, 0 ≤ θ ≤ 1. Note that a cluster core Cr and a threshold 0 ≤ θ ≤ 1 define a unique cluster. Y.D. Shen et al. utilized the defined notions to construct clustering Algorithm 5-1. 5.2 Decision-Making under Constraints of Conflicts

Another fruitful application of graph models to data analysis comes from the applications where a set of possible decisions is considered under the constraints of pairwise conflicts between them. In such a setup, each decision is modeled as a vertex of a graph, while an edge between two vertices is introduced if the two corresponding decisions

72 Input: similarity graph G(V,E), threshold parameter 0 ≤ θ ≤ 1 Output: set of clusters {Ci} i ← 1; while V 6= ∅ do find a cluster core Cr ⊆ V ; r construct the cluster Ci as defined by C and θ; remove all vertices of Ci from G(V,E); i ← i + 1; end

Figure 5-1. Cluster cores based clustering algorithm

age salary assets credit 20 30,000 25,000 poor 25 76,000 75,000 good 30 90,000 200,000 good 40 100,000 175,000 poor 50 110,000 250,000 good 60 50,000 150,000 good 70 35,000 125,000 poor 75 15,000 100,000 poor

Figure 5-2. Example of two CaRTs for a database have conflicts. Then, any admissible set of decisions corresponds to an independent set of the graph, and following a certain objective, we can associate appropriate weights with the graph vertices to obtain the correspondence between optimal set of decisions with the maximum weight independent sets of the graph. The maximum weight independent set problem is equivalent to the maximum weight clique problem for the complementary graph, hence it can be also tackled with efficient maximum clique heuristics. One remarkable application of this technique is modeling data sets/databases with a set of classification and regression trees (CaRTs) [17]. Let us illustrate the concept of CaRT with a simple example considered in [4] (Figure 5.2). The presented is an excerpt from a database table with 4 attributes and 8 records. Suppose that we consider

73 insignificant an error of magnitude 2 for age, 5,000 for salary and 25,000 for assets. The figure depicts two possible decision trees for this dataset: the first (classification) tree predicts the credit attribute (with salary as the predictor attribute), the second (regression) tree predicts the assets (with salary and age as the predictor attributes.) When we model a dataset/ database via CaRTs, we omit predicted attributes in all but the outlier samples/records, and store the corresponding CaRTs instead of it. The value of such representation is twofold. First, the CaRTs tell us a lot about the patterns existing in the data and have a powerful prediction ability for the properties of new samples (records) that may potentially be stored in the future. Second, the omission of predicted attributes is an opportunity to reduce the size of storage space required to store the data, which is the vital necessity in today’s IT world overloaded with immense amounts of data and data flows. Virtually all industrial databases are maintained today without any compression, even when they are data warehouses of many terabytes in size (as an illustration of the emerging trends see, e.g., media coverage of “Top Ten Biggest, Baddest Databases” at http://www.eweek.com/article2/0,4149,1410944,00.asp). Finding the optimal set of CaRTs minimizing size of the compressed table is a very challenging combinatorial optimization problem. In fact, there is not only the exponential number of possible CaRT-based models to choose from, but also constructing CaRTs (to estimate their compression benefits) is a computationally intensive task. Therefore, employment of sophisticated techniques from knowledge discovery and combinatorial optimization areas is crucial. As the authors point out in [4], selection of the optimal CaRT models for a database table can be performed via solving the maximum weight independent set problem. Indeed, each CaRT can be treated as a decision representing a vertex in the graph. Two CaRTs have a conflict between each other if they either predict the same attribute, or one of them utilizes an attribute predicted by the other CaRT. The appropriate weight to assign to a CaRT is either some quantification of its prediction value

74 for the data analysis purposes, or the amount of storage that will be saved of the CaRT is used for the database compression purposes. 5.3 Conclusions

We have discussed a number of models and applications utilizing graphs and networks for the data analysis purposes. Ranging from IT and telecommunication to biomedicine and finance, these methodologies provide a handy tool to grasp the most important characteristics of the data and visualize them. As the technological progress continues, it may be expected that more and more practical fields will become a source of various massive data sets, for which the graph models will be the efficient approach to apply. The remarkable role of cliques and independent sets in the constructed graphs gives the additional motivation to come up with more practically efficient algorithms for the maximum clique/independent set problem, which is of a great importance for the computational complexity theory being an NP -hard problem with one of the simplest formulations. Hence, the significance of this problem for modeling data sets and complex systems cannot be overestimated.

75 CHAPTER 6 A NEW TRUST REGION TECHNIQUE FOR THE MAXIMUM WEIGHT CLIQUE PROBLEM 6.1 Introduction

Let G(V,E) be a simple undirected graph, V = {1, 2, . . . , n}. The adjacency matrix

of G is a matrix AG = (aij)n×n, where aij = 1 if (i, j) ∈ E, and aij = 0 if (i, j) ∈/ E. The set of vertices adjacent to a vertex i ∈ V will be denoted by N(i) = {j ∈ V :(i, j) ∈ E} and called the neighborhood of the vertex i. A subgraph G0(V 0,E0), V 0 ⊆ V will be called induced by the vertex subset V 0 if (i, j) ∈ E0 whenever i ∈ V 0, j ∈ V 0, and (i, j) ∈ E, and E0 includes no other edges. A clique Q is a subset of V such that any two vertices of Q are adjacent. It is called maximal if there is no other vertex in the graph connected with all vertices of Q. Similarly, an independent set S is a subset of V such that any two vertices of S are not adjacent, and S is maximal if any other vertex of the graph is connected with at least one vertex of S. A graph is called complete multipartite if its vertex set can be partitioned into maximal independent sets (parts) and any two vertices from different parts are adjacent. Obviously, a clique is a complete multipartite graph, whose all parts are single vertices. The maximum clique problem asks for a clique of maximum cardinality. This cardinality is called the clique number of the graph and is denoted by ω(G).

Next, we associate with each vertex i ∈ V a positive number wi called the vertex

weight. This way, along with the adjacency matrix AG, we consider the vector of vertex

weights w ∈ Rn. The total weight of a vertex subset S ⊆ V will be denoted by

X W (S) = wi. i∈S

The maximum weight clique problem asks for a clique Q of the maximum W (Q) value. We denote this value by ω(G, w). Both the maximum cardinality and the maximum weight clique problems are NP -hard [42], so it is considered unlikely that an exact polynomial time algorithm for

76 them exists. Approximation of large cliques is also hard. It was shown in [52] that unless NP = ZPP no polynomial time algorithm can approximate the clique number within a factor of n1− for any  > 0. Recently this margin was tightened to n/2(log n)1− [58]. Hence there is a great need in practically efficient heuristic algorithms. For an extensive survey of developed methods, see [14]. The approaches offered include such common combinatorial optimization techniques as sequential greedy heuristics, local search heuristics, methods based on simulated annealing, neural networks, genetic algorithms, tabu search, etc. Among the most recent and promising combinatorial algorithms are the augmentation algorithm based on edge projection by Mannino and Stefanutti [66] and the decomposition method with penalty evaporation heuristic suggested by St-Louis, Ferland, and Gendron [82]. Finally, there are methods utilizing various formulations of the clique problem as a continuous (nonconvex) optimization problem. The most recent methods of this kind include PBH algorithm by Massaro, Pelillo, and Bomze [67], and Max-AO algorithm by Burer, Monteiro, and Zhang [21]. The first one is based on linear complementarity formulation of the clique problem, while the second one employs a low-rank restriction upon the primal semidefinite program computing the Lov´asz number (ϑ-function) of a graph. In this chapter we present a continuous maximum weight clique algorithm named QUALEX-MS (QUick ALmost EXact Motzkin–Straus-based.) It follows the idea of finding stationary points of a quadratic function over a sphere for guessing near-optimum cliques exploited in QUALEX and QSH algorithms [22, 25]. However, QUALEX-MS is based on a new generalized version of the Motzkin–Straus quadratic programming formulation for the maximum weight clique, and we attribute its better performance to specific properties of its optima also discussed in this chapter. A software package implementing QUALEX-MS is available at [25]. The chapter is organized as follows. In Section 6.2 we revise the Motzkin–Straus theorem to use the quadratic programming formulation for the maximum weight clique

77 problem. Section 6.3 reviews the trust region problem and finding its stationary points. In Section 6.4 we provide a theoretical result connecting the trust region stationary points with maximum clique finding and formulate the QUALEX-MS method itself. Section 6.5 describes computational experiments with the algorithm and their results. In the final Section 6.6 we make some conclusions and outline further research work. 6.2 The Motzkin–Straus Theorem for Maximum Clique and Its Generalization

In 1965, Motzkin and Straus formulated the maximum clique problem as a certain quadratic program over a simplex [70]. Theorem 7 (Motzkin–Straus). The global optimum value of the quadratic program

1 max f(x) = xT A x (6–1) 2 G

subject to X xi = 1, x ≥ 0 (6–2) i∈V is 1  1  1 − . (6–3) 2 ω(G) See [1] for a recent direct proof. We formulate a simple generalization of this result for the maximum weight clique problem and prove it similarly to [1]. In contrast to the generalization established in [44], this one does not require any reformulation of the maximum clique quadratic program to another minimization problem. It maximally preserves the form of the original Motzkin–Straus result.

Let wmin be the smallest vertex weight existing in the graph. We introduce a vector

d ∈ Rn such that wmin di = 1 − . wi Consider the following quadratic program:

T max f(x) = x (AG + diag(d1, . . . , dn))x (6–4)

78 subject to X xi = 1, x ≥ 0. (6–5) i∈V First, we formulate a preliminary lemma. Lemma 1. Let x0 be a feasible solution of the program (6–4)-(6–5) and (i, j) ∈/ E be

0 0 a non-adjacent vertex pair such that xi > 0, xj > 0, and (without loss of generality) ∂f (x0) ≥ ∂f (x0). Then the point x00, where ∂xi ∂xj

00 0 0 00 00 0 xi = xi + xj, xj = 0, xk = xk, k ∈ V, i 6= k 6= j, (6–6) is also a feasible solution of (6–4)-(6–5) and f(x00) ≥ f(x0). The equality f(x00) = f(x0)

P 0 P 0 holds if and only if wi = wj = wmin and k∈N(i) xk = k∈N(j) xk. Proof. It is easy to see that x00 satisfies the constraints (6–5) and hence it is a feasible

0 solution. Now we show that this solution is at least as good as x . Since (i, j) ∈/ E, aij = 0 and there is no xixj term in the objective f(x). So, we can partition f(x) into terms dependent on xi, terms dependent on xj, and the other terms:

2 X fi(x) = dixi + 2xi xk, k∈N(i) 2 X fj(x) = djxj + 2xj xk, k∈N(j)

f ij(x) = f(x) − fi(x) − fj(x).

The partial derivatives of f(x) with respect to xi and xj are:

∂f ∂fi X = = 2dixi + 2 xk, ∂xi ∂xi k∈N(i)

∂f ∂fj X = = 2djxj + 2 xk. ∂xj ∂xj k∈N(j)

00 0 00 00 0 We have f ij(x ) = f ij(x ) and fj(x ) = 0, so to compare f(x ) to f(x ) we should evaluate

00 0 0 fi(x ) and compare it to fi(x ) + fj(x ). In these computations we take into account that

79 di and dj are always nonnegative:

00 0 0 2 0 0 P 0 fi(x ) = di(xi + xj) + 2(xi + xj) k∈N(i) xk = 0 0 0 0 2 0 P 0 fi(x ) + 2dixixj + di xj + 2xj k∈N(i) xk = 0 0  0 P 0  0 2 fi(x ) + xj 2dixi + 2 xk + di xj = k∈N(i) (6–7) 0 0 ∂fi 0 0 2 0 0 ∂fj 0 0 2 fi(x ) + x (x ) + di x ≥ fi(x ) + x (x ) + di x ≥ j ∂xi j j ∂xj j 0 0 ∂fj 0 0 0 2 0 P 0 fi(x ) + x (x ) = fi(x ) + 2dj x + 2x x ≥ j ∂xj j j k∈N(j) k 0 0 2 0 P 0 0 0 fi(x ) + dj xj + 2xj k∈N(j) xk = fi(x ) + fj(x ).

Hence f(x00) ≥ f(x0). Next, we observe that all the ≥-relations in (6–7) become equalities

∂fi 0 ∂fj 0 if and only if di = dj = 0 and (x ) = (x ). The first immediately implies wi = wj = ∂xi ∂xj P ∗ P ∗ wmin, and together with the second it means that k∈N(i) xk = k∈N(j) xk. This completes the proof of the lemma. Now we are ready to establish the generalized version of the Motzkin–Straus theorem. Theorem 8. The global optimum value the program (6–4)-(6–5) is

w 1 − min . (6–8) ω(G, w)

For each maximum weight clique Q∗ of the graph G(V,E) there is a global optimum x∗ of the program (6–4)-(6–5) such that   w /ω(G, w), if i ∈ Q∗ ∗  i xi = (6–9)  0, if i ∈ V \ Q∗.

Proof. Let us define the support of a feasible solution x0 as the set of indices of nonzero

0 0 variables V = {i ∈ V : xi > 0}. From Lemma 1 it follows that the program (6–4)-(6–5) has a global optimum whose support is a clique. Indeed, if x0 is a global optimum such

0 0 that for some non-adjacent vertex pair (i, j) ∈/ E, xi > 0 and xj > 0, then the point x00 defined in (6–6) is also a global optimum. Using this property we can always obtain a global optimum x∗ whose support is a clique Q∗. Now we show that Q∗ is necessarily a maximum weight clique.

80 ∗ Indeed, in the subspace {xi} : i ∈ Q we have the program:

X 2 X X max f(x) = dixi + xixj (6–10) i∈Q∗ i∈Q∗ j∈Q∗ j6=i X subject to xi = 1. i∈Q∗ The objective may be transformed to

2 ! 2 X X wminxi xi − . wi i∈Q∗ i∈Q∗

The first term is equal to 1 due to the constraint, so we may consider an equivalent program: X x2 i → min wi i∈Q∗ The Lagrangian of the program is

2 ! X xi X + λ xi − 1 . wi i∈Q∗ i∈Q∗

It is easy to see it has the only stationary point

w 2 x = i , i ∈ Q∗; λ = , i W (Q∗) W (Q∗)

∗ ∗ ∗ and this point is the minimum. So, xi = wi/W (Q ), i ∈ Q . Evaluate the objective f(x∗). It is

∗ 2 2 X wmin (xi ) X wminwi wmin 1 − = 1 − ∗ 2 = 1 − ∗ . wi wi(W (Q )) W (Q ) i∈Q∗ i∈Q∗

This value is largest when W (Q∗) is largest, so the objective attains a global optimum when Q∗ is a maximum weight clique. Therefore,

w max f(x) = f(x∗) = 1 − min . ω(G, w)

81 Finally, it is easy to see that for any maximum weight clique Q∗∗ the point x∗∗ defined as   w /ω(G, w), if i ∈ Q∗∗ ∗∗  i xi =  0, if i ∈ V \ Q∗∗ provides the objective value (6–8). So, each maximum weight clique has a global optimum of the program (6–4)-(6–5) corresponding to it as claimed. We extend Theorem 8 by the following result characterizing global optima of (6–4)-(6–5): Theorem 9. Let x∗ be a global optimum of the program (6–4)-(6–5) and G∗(V ∗,E∗) be

∗ ∗ ∗ ∗ the subgraph induced by the support V = {i ∈ V : xi > 0} of x . Then G is a complete multipartite graph, whose any part may have more than one vertex only if all vertices of

∗ this part have the same weight wmin, and any maximal clique of G is a maximum weight clique of the graph G(V,E). Proof. First we prove that if the subgraph G∗ includes a non-adjacent vertex pair (i, j) ∈/ E, then the vertices i and j necessarily have in it the same neighborhood. Lemma

P ∗ P ∗ 1 necessitates the conditions wi = wj = wmin and k∈N(i) xk = k∈N(j) xk. Suppose there is some ` ∈ V ∗ such that (i, `) ∈ E while (j, `) ∈/ E. Then Lemma 1 also necessitates

P ∗ P ∗ ∗∗ ∗ w` = wmin and k∈N(`) xk = k∈N(j) xk. Obtain the point x from x by altering only two

∗∗ ∗ ∗ ∗∗ ∗ ∗∗ coordinates: xi = xi + xj /2 and xj = xj /2. Obviously, x is also a global optimum as because of the abovementioned conditions the sum of terms of the objective dependent on xi or xj remains the same. But now

X ∗∗ X ∗ ∗ X ∗∗ ∗ X ∗∗ xk = xk + xj /2 = xk + xj /2 > xk , k∈N(`) k∈N(`) k∈N(j) k∈N(j)

∗∗ ∗∗ ∗∗ so the value f(x ) can be improved by increasing x` while further decreasing xj . Hence neither x∗∗ nor x∗ is a global optimum, and we have obtained a contradiction. Similarly, we can show that there is no ` ∈ V ∗ such that (j, `) ∈ E while (i, `) ∈/ E. Thus, i and j have the same neighborhood in G∗.

82 Now it is easy to see that maximal independent sets of G∗ do not intersect, and hence it is a complete multipartite graph. As it can have a non-adjacent pair of vertices

∗ only if both vertices of this pair have the weight wmin, we obtain that G cannot have multivertex parts with vertices of another weight. Next, using the transformation (6–6) for each non-adjacent vertex pair in G∗, we can arrive at another global optimum whose support is an arbitrary maximal clique of G∗. As we have shown in the proof of Theorem 9, it implies that this clique is a maximum weight one. Therefore, all maximal cliques of G∗ are maximum weight cliques of G. Theorem 9 evidences that all global optima of the Motzkin–Straus program are equally useful for solving the clique problem, and there is no need to drive out “spurious” optima not corresponding directly to cliques. Even more, a global optimum whose support includes non-adjacent vertices provides more information as it reveals immediately a family of optimum cliques. One may observe that the program (6–4)-(6–5) has similar correspondence of its local optima to other maximal cliques of the graph. Hence it is complicated to arrive at an optimum clique applying gradient-based optimization methods to the Motzkin–Straus program. So, in our work we explore another approach. For the development of our method we will use a rescaled form of the quadratic program (6–4)-(6–5). First of all, for the graph G(V,E) with the vertex weights w define

(w) (w) the weighted adjacency matrix AG = (aij )n×n such that   wi − wmin, if i = j  (w)  √ aij = wiwj, if (i, j) ∈ E (6–11)    0, if i 6= j and (i, j) ∈/ E.

Obviously, it is the ordinary adjacency matrix when all vertex weights are ones. Next, we introduce the vector of vertex weight square roots

n √ z ∈ R : zi = wi. (6–12)

83 The rescaled formulation is given in the following corollary of Theorem 8. Corollary 1. The global optimum value of the quadratic program

T (w) max f(x) = x AG x (6–13) subject to zT x = 1, x ≥ 0 (6–14)

is w 1 − min . ω(G, w) For each maximum weight clique Q∗ of the graph G(V,E) there is a global optimum of (6–13)-(6–14) such that   z /ω(G, w), if i ∈ Q∗ ∗  i xi = (6–15)  0, if i ∈ V \ Q∗. √ Proof. Perform the variable scaling xi → wixi in the formulation of Theorem 8. The corollary is obtained immediately. A useful property of the rescaled formulation is that optima corresponding to all maximum weight cliques are located at the same distance from the origin. Now we state this fact formally.

Definition 5. An indicator of a clique Q ⊆ V is a vector xQ ∈ Rn such that   Q  zi/W (Q), if i ∈ Q xi =  0, if i ∈ V \ Q.

Proposition 1. All cliques of the same weight σ have indicators located at the same √ distance 1/ σ from the origin.

84 Proof. It follows immediately that the indicator of a clique Q ⊆ V with the weight W (Q) = σ is a vector of the length

s √ X 2 p (zi/σ) = W (Q)/σ = 1/ σ. i∈Q

We may notice here that cliques of larger weight have indicators located closer to the origin. The indicators of the maximum weight cliques have the smallest radius, namely, 1/pω(G, w). The idea of our method is to replace the nonnegativity constraint x ≥ 0 in (6–14) by a ball constraint xT x ≤ r2 of a radius r ≈ 1/pω(G, w) and to regard the stationary points of this new program as vectors significantly correlating with the maximum weight clique indicators. In the next section we outline polynomial time finding of stationary points of a quadratic on a sphere. In our case this technique can be used after the objective is orthogonally projected onto the hyperplane zT x = 1, so this equality may be removed from the constraints. In the subsequent section we give a substantiation of the used constraint replacement proving a particular case when a spherical stationary point is exactly an optimum of the program (6–13)-(6–14) and formulate the algorithm itself. 6.3 The Trust Region Problem

The trust region problem is the optimization of a quadratic function subject to a ball constraint. The term originates from a nonlinear programming application of this problem. Namely, to improve a feasible point, a small ball – trust region – around the point is introduced and a quadratic approximation of the objective is optimized in it. Then, if the objective approximation is good enough within this locality, the ball optimum of the quadratic is very close to the optimum of the objective there, and it may be taken as the next improved feasible solution. This technique is very attractive in many cases since the optimization of a quadratic function over a sphere is polynomially solvable in contrast to general nonconvex programming [95]. There is a vast range of other sources describing

85 theoretical and practical results on the trust region problem [41, 48, 64, 69]. Here we outline the complete diagonalization method deriving not only the global optimum at a given sphere radius, but all stationary points corresponding to particular radii we want to consider. That is, the radius value remains non-fixed up to a final step when it appears as a parameter of a univariate equation determining the stationary points. We note that for our application we are interested in hyperbolic objectives only, so interior stationary points never exist. Thus, consider finding the stationary points for the problem

f(x) = xT Ax + 2bT x (6–16)

n X 2 2 s.t. xi = r , i=1 where A is a given real symmetric n × n matrix, b ∈ Rn is a given vector, and x ∈ Rn is the vector of variables. First, we diagonalize the quadratic form in (6–16) performing eigendecomposition of A:

T A = Qdiag(λ1, . . . , λn)Q ,

where Q is the matrix of eigenvectors (stored as columns) and the eigenvalues {λi} have nondecreasing order. In the eigenvector basis, (6–16) is

n n X 2 X f(y) = λiyi + 2 ciyi, (6–17) i=1 i=1

n X 2 2 yi = r , (6–18) i=1 and the following relations hold:

x = Qy, y = QT x, b = Qc, c = QT b. (6–19)

The Lagrangian of (6–17)-(6–18) is

n n n ! X 2 X X 2 2 L(y, µ) = λiyi + 2 ciyi − µ yi − r . (6–20) i=1 i=1 i=1

86 µ is the lagrangian multiplier of the spherical constraint here. We take it with negative sign for the sake of convenience. The stationary conditions are

∂L ∂L = 0, = 0. ∂yi ∂µ

So, ∂L = 2(λi − µ)yi + 2ci = 0, ∂yi

and assuming µ 6= λi, ci yi = . (6–21) µ − λi Substituting (6–21) into the spherical constraint (6–18), we get

n X c2 i − r2 = 0. (6–22) (µ − λ )2 i=1 i The left-hand side of (6–22) is a univariate function consisting of n + 1 continuous and convex pieces. As all the numerators are positive, in each piece between two successive eigenvalues of A it may intersect µ-axis twice (determining two stationary points on the sphere), touch it once (determining one stationary point), or be over the axis (no stationary point corresponds to these µ values.) That depends on the chosen radius r: the greater the radius, the more cases of two spherical stationary points within one continuous piece of (6–22). Two outermost continuous pieces are (−∞; λ1) and (λn; +∞). In each of them (6–22) always has one and only one root. The root in the first piece is the global minimum, the root in the second piece is the global maximum.

A degenerate case when µ = λi for some i is possible if ci = 0. Then, if λi is

a multiple eigenvalue of A, all cj corresponding to λj = λi must be equal to zero to

cause the degeneracy. Then all yj such that µ 6= λj should be computed by (6–21), and if the sum of their squares is not above r2, any combination of the rest entries of y obeying (6–18) gives a stationary point. Formally, we have a cluster of k equal eigenvalues

87 λi = λi+1 = ... = λi+k−1 with

ci = ci+1 = ... = ci+k−1 = 0. (6–23)

If i−1 n 2 X 2 X 2 2 r0 = yj + yj ≤ r , (6–24) j=1 j=i+k

where the values yj are computed by (6–21) with µ = λi, then any yi, yi+1, . . . , yi+k−1 such that i+k−1 X 2 2 2 yj = r − r0 j=i provide a stationary point. So, it is possible then that the number of stationary points is infinite. In our method we will consider, in the degenerate case, only such points that all but one of the entries

yi, yi+1, . . . , yi+k−1 are zero. There are 2k cases:

p 2 2 yi = ± r − r0, yi+1 = 0, . . . , yi+k−1 = 0,

p 2 2 yi = 0, yi+1 = ± r − r0, . . . , yi+k−1 = 0, (6–25) ···

p 2 2 yi = 0, yi+1 = 0, . . . , yi+k−1 = ± r − r0, so an eigenvalue of multiplicity k gives 2k points to consider. Finally, we note that the total complexity of the above procedure is O(n3) if we derive O(n) stationary points and it takes not more than O(n2) time to get one µ value. Indeed, the eigendecomposition may be computed up to any fixed precision in O(n3) time [72], and each basis conversion in (6–19) takes quadratic time, so generally we have one O(n3) computation at the beginning of the procedure, and O(n) computations of O(n2) complexity each afterwards.

88 6.4 The QUALEX-MS Algorithm

Thus, we will work with the program

T (w) max f(x) = x AG x (6–26)

s.t. zT x = 1, xT x ≤ r2, where r is a parameter not fixed ´apriori. We designate now a particular case, when a stationary point of the program (6–26) is an optimum of the program (6–13)-(6–14). It happens when for each vertex outside a maximum weight clique the weight sum of adjacent vertices in the clique is constant. Namely, the following theorem holds. Theorem 10. Let Q ⊆ V be a maximal clique of the graph G(V,E) such that

∀v ∈ V \ Q : W (N(v) ∩ Q) = C,

where C is some fixed value. Then the indicator xQ of Q   Q  zi/W (Q), if i ∈ Q xi =  0, if i ∈ V \ Q.

is a stationary point of the program (6–26) when the parameter r = 1/pW (Q). Proof. Consider the Lagrangian of the program (6–26). It is

Q Q T (w) Q T Q Q T Q 2 L(x , µ1, µ2) = (x ) AG x + µ1(z x − 1) + µ2((x ) x − r ).

Its partial derivatives are

∂L X (w) Q Q Q = 2 aij xj + ziµ1 + 2xi µ2 = ∂xi i∈V   Q X Q Q Q = 2zi zixi + zjxj  − 2wminxi + ziµ1 + 2xi µ2. j∈N(i)

89 Let i ∈ Q. Then it gives

∂L X wj zi 2zi = 2z − 2w + z µ + µ = Q i W (Q) min W (Q) i 1 W (Q) 2 ∂xi j∈Q

 2w 2µ  = z 2 − min + µ + 2 . i W (Q) 1 W (Q) Conversely, if i ∈ V \ Q,

∂L X  2C  = 2z z xQ + z µ = z + µ . Q i j j i 1 i W (Q) 1 ∂xi j∈N(i)∩Q

We may see that in both cases the partial derivative is the same for each i up to a nonzero

Q multiplier zi. So, the stationary point criterion system ∂L/∂xi = 0 is reduced to two equations over two variables µ1 and µ2. The second equation directly gives

2C µ = − . 1 W (Q)

Substituting this into the first equation, we obtain

µ2 = C + wmin − W (Q).

So, there are values of the lagrangian multipliers satisfying the stationary point criterion. Therefore, xQ is a stationary point of the program (6–26).

We notice that the obtained µ2 value is negative unless the weight of the clique Q can be increased by a one-to-one vertex exchange. This means that in the stationary points we are interested in the gradient of the objective is directed outside the constraining sphere. It is consistent with the fact that we look for a maximum of the objective. We note a special case of Theorem 10 corresponding to the maximum cardinality clique problem. Corollary 2. Let Q ⊆ V be a maximal clique of the graph G(V,E) such that

∀v ∈ V \ Q : |N(v) ∩ Q| = C,

90 Q where C is some fixed value, and all vertex weights wi equal 1. Then the indicator x of Q   Q  1/|Q|, if i ∈ Q xi =  0, if i ∈ V \ Q.

is a stationary point of the program (6–26) when the parameter r = 1/p|Q|. Generally, optima of (6–13)-(6–14) cannot be found directly as stationary points of (6–26). However, we accept the assumption that if the parameter r is close to 1/pω(G, w), then the stationary points of (6–26), where the objective gradient is directed outside, provide significant information about maximum weight clique indicators. This may be supported by the fact that the conjunction of three imposed requirements – maximization of a quadratic form whose matrix is nonnegative, positive dot product with the positive vector z, and a rather small norm of the sought vector x – suggests that the vector x should have rather been composed of positive entries. Thus, we heuristically expect that violation of the nonnegativity constraint is not very dramatic. As the next step, we show how to reduce the program (6–26) to a trust region problem projecting orthogonally the objective onto the hyperplane zT x = 1. First, we move the origin into a new point x0 = z/W (V ). (6–27)

This point is the orthogonal projection of the origin onto the hyperplane zT x = 1. That is, we introduce new variablesx ˆ = x − z/W (V ). This way we obtain a new program equivalent to (6–26):

T (w) 0 T (w) max g(ˆx) =x ˆ AG xˆ + 2(x ) AG xˆ (6–28)

s.t. zT xˆ = 0, xˆT xˆ ≤ rˆ2,

wherer ˆ2 = r2 − 1/W (V ) (here we took into account that (x0)T x0 = 1/W (V ).) Now the constraining equality determines a linear subspace. The orthogonal projector onto it is a

91 matrix P = (pij)n×n, where   1 − wi/W (V ), if i = j p = ij √  − wiwj/W (V ), if i 6= j.

Thus, the program (6–28) may be reformulated as

max g(ˆx) =x ˆT Aˆxˆ + 2ˆbT xˆ (6–29)

s.t.x ˆT xˆ ≤ rˆ2,

ˆ (w) ˆT 0 T (w) where A = PAG P and b = (x ) AG P . This is a trust region problem – optimization of a quadratic subject to a single ball constraint. Direct matrix manipulations show that Aˆ and ˆb can be computed by the formulas

(w) 0 (w) 0 (w) 0 0 aˆij = aij − xj δi − xi δj + xi xj D (6–30) and δ(w) − x0D ˆb = i i , (6–31) i W (V ) where

(w) √ X δi = wi(wi − wmin + wj) (6–32) j∈N(i) (which are vertex degrees in the unweighted case), and

X X D = wj(wj − wmin) + wjwk. (6–33) j∈V (j,k)∈E

Thus, if Q is a maximum weight clique obeying Theorem 10 conditions, its indicator may be recognized by the trust region procedure described in the previous section. Generally, we will handle the maximum weight clique problem in the following way allowing us to preserve the total complexity of the method in an O(n3) time. Before applying the trust region technique, we find a possibly best clique Q by a fast greedy procedure. To improve it, we will try to search for cliques weighing at least

92 W (Q)+wmin using the stationary points of the program (6–26). It follows from Proposition 1 that we should be interested in those points, where

1 1 rˆ2 = − (6–34) W (Q) + wmin W (V )

or less. In our method we consider the stationary points having thisr ˆ2 value, plus those corresponding to µ values minimizing the left hand side of (6–22) in each continuous section. Since cliques of larger weights correspond to lesser radii, we have a chance to correct the “shallowness” of the formula (6–34) considering the minimum possible radii. Besides, to find stationary points at any fixed radius, we need to find those minimizing µ values anyway to determine how many roots (6–22) has on each continuous section. If the left hand side minimum on a continuous section is negative, there are two roots and each of them is bracketed between the minimizing point and one of the section bounds. Both univariate minimization and univariate root finding when a root is bracketed may be efficiently performed by Brent’s method [18]. Next, each of the obtained stationary points is passed to a greedy heuristic as a new vertex weight vector and the found clique is compared to the best clique known at this moment (initially it is the clique found at the preliminary stage.) The algorithm result is the best clique obtained upon completion of this process. The greedy heuristic used in our method to process the stationary points is a generalization of the New-Best-In sequential degree heuristic. It runs in O(n2) time. The usual version of this algorithm is when the input vector x is the vertex weight vector w. Within our trust region technique we submit to this routine the obtained spherical stationary points. Before anything else we apply a preprocessing able to reduce the input graph in some instances. It is clear that removing of too low connected vertices and preselection of too high connected vertices – when these operations do not lead to missing of the exact solution – are desirable as the Theorem 10 condition may be violated because of

93 Input: graph G(V,E), vector x ∈ Rn Output: a maximal clique Q n P construct vector y ∈ R such that yi = xi + j∈N(i) xj; V1 ← V , k ← 1, Q ← ∅; while Vk 6= ∅ do

choose a vertex vk ∈ Vk such that yvk is greatest; Q ← Q ∪ {vk}; Vk+1 ← Vk ∩ N(vk); for each j ∈ Vk+1 do y ← y − P x ; j j `∈(Vk\Vk+1)∩N(j) ` end k ← k + 1; end

Figure 6-1. New-best-in weighted heuristic such vertices most. Thus, we iteratively remove vertices, whose weight together with the neighborhood weight is below the clique weight derived by Algorithm 6-1, and preselect any vertex non-adjacent only to a set weighing not more than the vertex itself. It is easy to see that one 2.6–2.11 cycle takes not more than an O(n2) time and is repeated only if at least one vertex is removed from the graph. As well, there are not more than n calls of the Algorithm 6-1. Hence, the preprocessing complexity is in O(n3). The preliminary greedy heuristic we use to derive a first approximation of the maximum weight clique calls Algorithm 6-1 n times starting from each of the vertices as chosen ´apriori. Obviously, the complexity of Algorithm 6-3 is n · O(n2) = O(n3). It does not exceed the trust region procedure complexity, so this process does not increase the total complexity of the method. Thus, we propose Algorithm 6-4 for the maximum weight clique problem. 6.5 Computational Experiment Results

The goal of the first computational experiment was to find a smallest maximum clique instance, where QUALEX-MS cannot find an exact solution. We used the program geng available at [68] to generate all non-isomorphic to each other graphs up to 10 vertices inclusive. QUALEX-MS successfully found exact solutions to all those instances.

94 Input: graph G(V,E), its vertex weight vector w Output: reduced graph G(V,E), preselected vertex subset Q0, clique Q Q0 ← ∅, B ← 0; repeat assign Q the result of Algorithm 6-1 for G(V,E) with w; if W (Q) ≤ B then break; end B ← W (Q); flag ← false; P compose set R of vertices i ∈ V such that wi + j∈N(i) wj < B; if R 6= ∅ then flag ← true; end repeat remove the vertex subset R from the graph G(V,E); P compose a clique P of vertices i ∈ V such that wi ≥ j∈V \N(i)\{i} wj; P B ← B − j∈P wj; Q0 ← Q0 ∪ P ; compose set R of vertices i ∈ V such that P \ N(i) 6= ∅; if R 6= ∅ then flag ← true; end until R = ∅ ; until not flag or V = ∅ ;

Figure 6-2. NBIW-based graph preprocess algorithm

Input: graph G(V,E), its vertex weight vector w Output: maximal clique Qˆ Qˆ ← ∅; for each i ∈ V do i construct the subgraph NG induced by N(i); i assign Q the result of Algorithm 6-1 for NG with its vertex weight subvector; Q ← Q ∪ {i}; if Q is better than Qˆ then Qˆ := Q; end end

Figure 6-3. Meta-NBIW algorithm

95 Input: graph G(V,E), its vertex weight vector w Output: maximal clique Q execute Algorithm 6-2; store the preselected vertex set Q0 and the clique Q; if V = ∅ then Q ← Q ∪ Q0; exit; end execute Algorithm 6-3 and store the result Qˆ; compute z by (6–12), x0 by (6–27), δ(w) by (6–32), and D by (6–33); compute Aˆ by (6–30) and ˆb by (6–31); ˆ T perform the eigendecomposition A = Rdiag(λ1, . . . , λn)R ; compute the vector c = RT ˆb; compute r2 asr ˆ2 by (6–34) for W (Qˆ); for each µ > 0 minimizing left-hand side of (6–22) in a continuous interval or satisfying (6–22) do compute y by (6–21); x ← Ry + x0; rescale xi ← zixi, i ∈ V ; execute Algorithm 6-1 with the vector x and rewrite the result in Qˆ if it is a better solution; end for each eigenvalue cluster λi = ... = λi+k−1 > 0 satisfying (6–23) do compute all yj, j ∈ V \{i, . . . , i + k − 1} by (6–21); 2 compute r0 by (6–24); 2 2 if r0 ≤ r then for each combination of yj, j ∈ {i, . . . , i + k − 1} defined by (6–25) do x ← Ry + x0; rescale xi ← zixi, i ∈ V ; execute Algorithm 6-1 with the vector x and rewrite the result in Qˆ if it is a better solution; end end end if Qˆ is a better solution than Q then Q ← Qˆ; end Q ← Q ∪ Q0; Figure 6-4. QUALEX-MS algorithm

96 Though it cannot be excluded that with another vertex numbering in one of them the exact solution would have been lost, we consider this result to be a strong evidence that counterexamples to the algorithm do not exist at least up to 11-vertex graphs. Unfortunately, there are too many non-isomorphic 11-vertex graphs to continue the experiment the same way, so it has not been completed. Next, we tested QUALEX-MS on all 80 DIMACS maximum clique instances1 . and compared the results against our earlier algorithms QSH and QUALEX 2.0 [22, 25]. All three programs were run on a Pentium IV 1.4GHz computer under OS Linux RedHat. However, the QUALEX-MS package makes use of a new eigendecomposition routine DSYEVR from LAPACK involving Relatively Robust Representations to compute eigenpairs after the matrix is reduced to a tridiagonal form [37]. This explains improvement of the average running time versus the two other programs. As a BLAS implementation, the platform-specific prebuilt of ATLAS library2 was used. Exact or best known solutions were found by QUALEX-MS in 57 instances. It is significantly better than 39 exact or best known solutions by QSH and an advance comparing to 51 exact or best known solutions by QUALEX 2.0. For the rest DIMACS graphs QUALEX-MS obtained good approximation solutions. The results are composed in Table 1. The last computational experiment performed with QUALEX-MS was finding maximum weight cliques. Since there are no widely accepted maximum weight clique test suites, we followed the approach accepted in [67] and tested the algorithm against normal and irregular random graphs with various edge densities comparing it to the algorithm PBH (which is another recent continuous-based heursitic.) To generate the irregular random graphs Algorithm 4.1 from [67] was used. Vertex weights were evenly distributed

1 available at ftp://dimacs.rutgers.edu/pub/challenge/graph/ 2 available at http://www.netlib.org/atlas/archives/

97 Table 6-1. DIMACS maximum clique benchmark results Instance n den- ω(G) QSH QUALEX 2.0 QUALEX-MS sity found time found time found time brock200 1 200 0.745 21 21 1 21 1 21 1 brock200 2 200 0.496 12 12 < 1 12 < 1 12 < 1 brock200 3 200 0.605 15 15 < 1 15 1 15 1 brock200 4 200 0.658 17 17 1 17 < 1 17 < 1 brock400 1 400 0.748 27 27 4 27 4 27 2 brock400 2 400 0.749 29 29 4 29 4 29 3 brock400 3 400 0.748 31 31 4 31 5 31 2 brock400 4 400 0.749 33 33 4 33 4 33 2 brock800 1 800 0.649 23 17 37 23 36 23 18 brock800 2 800 0.651 24 24 38 24 35 24 18 brock800 3 800 0.649 25 25 38 25 37 25 18 brock800 4 800 0.650 26 26 37 26 35 26 18 C125.9 125 0.898 ≥ 34 31 < 1 33 < 1 34 < 1 C250.9 250 0.899 ≥ 44 42 1 43 1 44 1 C500.9 500 0.900 ≥ 57 52 8 53 8 55 4 C1000.9 1000 0.901 ≥ 68 62 103 63 71 64 27 C2000.5 2000 0.500 ≥ 16 13 1593 16 1547 16 278 C2000.9 2000 0.900 ≥ 77 67 1545 72 1519 72 215 C4000.5 4000 0.500 ≥ 18 15 16198 17 15558 17 2345 c-fat200-1 200 0.077 12 12 < 1 12 < 1 12 < 1 c-fat200-2 200 0.163 24 24 < 1 24 < 1 24 < 1 c-fat200-5 200 0.426 58 58 < 1 58 < 1 58 < 1 c-fat500-1 500 0.036 14 14 5 14 4 14 1 c-fat500-2 500 0.073 26 26 5 26 3 26 2 c-fat500-5 500 0.186 64 64 2 64 2 64 2 c-fat500-10 500 0.374 126 126 3 126 3 126 2 DSJC500.5 500 0.500 ≥ 13 11 9 13 8 13 5 DSJC1000.5 1000 0.500 ≥ 15 13 85 14 74 14 36 gen200 p0.9 44 200 0.900 44 37 1 39 1 42 < 1 gen200 p0.9 55 200 0.900 55 55 < 1 55 < 1 55 1 gen400 p0.9 55 400 0.900 55 48 4 50 4 51 2 gen400 p0.9 65 400 0.900 65 63 4 65 4 65 2 gen400 p0.9 75 400 0.900 75 75 4 75 4 75 2 hamming6-2 64 0.905 32 32 < 1 32 < 1 32 < 1 hamming6-4 64 0.349 4 4 < 1 4 < 1 4 < 1 hamming8-2 256 0.969 128 128 1 128 1 128 < 1 hamming8-4 256 0.639 16 16 1 16 1 16 1 hamming10-2 1024 0.990 512 512 72 512 61 512 38 hamming10-4 1024 0.829 ≥ 40 36 70 36 62 36 45

98 Table 6-1 Continued Instance n den- ω(G) QSH QUALEX 2.0 QUALEX-MS sity found time found time found time johnson8-2-4 28 0.556 4 4 < 1 4 < 1 4 < 1 johnson8-4-4 70 0.768 14 14 < 1 14 < 1 14 < 1 johnson16-2-4 120 0.765 8 8 < 1 8 < 1 8 < 1 johnson32-2-4 496 0.879 16 16 5 16 5 16 8 keller4 171 0.649 11 11 < 1 11 < 1 11 1 keller5 776 0.751 27 23 22 26 19 26 16 keller6 3361 0.818 ≥ 59 48 6095 51 5721 53 1291 MANN a9 45 0.927 16 16 < 1 16 < 1 16 < 1 MANN a27 378 0.990 126 125 2 126 2 125 1 MANN a45 1035 0.996 345 342 70 342 61 342 17 MANN a81 3321 0.999 ≥ 1100 1096 6671 1096 6057 1096 477 p hat300-1 300 0.244 8 7 1 8 2 8 1 p hat300-2 300 0.489 25 24 2 24 1 25 1 p hat300-3 300 0.744 36 33 1 35 2 35 1 p hat500-1 500 0.253 9 9 9 9 9 9 3 p hat500-2 500 0.505 36 33 8 36 9 36 4 p hat500-3 500 0.752 ≥ 50 46 8 48 9 48 4 p hat700-1 700 0.249 11 8 23 11 24 11 10 p hat700-2 700 0.498 44 42 24 43 26 44 12 p hat700-3 700 0.748 ≥ 62 59 24 61 24 62 11 p hat1000-1 1000 0.245 ≥ 10 9 82 10 76 10 28 p hat1000-2 1000 0.490 ≥ 46 43 85 45 79 45 34 p hat1000-3 1000 0.744 ≥ 68 62 83 65 76 65 32 p hat1500-1 1500 0.253 12 10 458 12 489 12 95 p hat1500-2 1500 0.506 ≥ 65 62 453 64 507 64 111 p hat1500-3 1500 0.754 ≥ 94 85 465 91 486 91 108 san200 0.7 1 200 0.700 30 30 < 1 30 < 1 30 1 san200 0.7 2 200 0.700 18 18 1 18 < 1 18 < 1 san200 0.9 1 200 0.900 70 70 < 1 70 1 70 < 1 san200 0.9 2 200 0.900 60 60 < 1 60 < 1 60 1 san200 0.9 3 200 0.900 44 35 1 40 < 1 40 < 1 san400 0.5 1 400 0.500 13 9 3 13 4 13 2 san400 0.7 1 400 0.700 40 40 4 40 4 40 3 san400 0.7 2 400 0.700 30 30 4 30 4 30 2 san400 0.7 3 400 0.700 22 16 4 17 4 18 2 san400 0.9 1 400 0.900 100 100 3 100 4 100 2 san1000 1000 0.502 15 10 76 15 69 15 25 sanr200 0.7 200 0.697 18 15 1 17 1 18 1 sanr200 0.9 200 0.898 42 37 < 1 41 < 1 41 < 1 sanr400 0.5 400 0.501 13 11 4 12 4 13 2 sanr400 0.7 400 0.700 ≥ 21 18 4 20 5 20 2

99 Table 6-2. Performance of QUALEX-MS vs. PBH on random weighted graphs n den- QUALEX-MS PBH sity Normal Irregular Normal Irregular Avg. R St.D. Avg. R St.D. Avg. R St.D. Avg. R St.D. 100 0.10 100.00% ±0.00 100.00% ±0.00 97.95% ±0.15 98.44% ±0.13 100 0.20 100.00% ±0.00 99.88% ±0.05 97.73% ±0.16 98.64% ±0.12 100 0.30 99.87% ±0.05 99.89% ±0.04 97.25% ±0.17 98.84% ±0.11 100 0.40 99.48% ±0.18 99.75% ±0.05 95.04% ±0.23 98.53% ±0.12 100 0.50 99.45% ±0.19 99.81% ±0.04 94.61% ±0.24 98.74% ±0.12 100 0.60 99.18% ±0.21 99.93% ±0.02 94.71% ±0.23 99.64% ±0.06 100 0.70 98.02% ±0.32 99.84% ±0.03 96.10% ±0.20 98.94% ±0.11 100 0.80 98.54% ±0.29 99.99% ±0.00 93.13% ±0.26 98.56% ±0.12 100 0.90 98.43% ±0.27 99.99% ±0.00 94.29% ±0.24 99.56% ±0.07 100 0.95 98.72% ±0.20 100.00% ±0.00 96.49% ±0.19 99.75% ±0.05 200 0.10 100.00% ±0.00 99.97% ±0.04 200 0.20 99.55% ±0.19 99.86% ±0.04 200 0.30 99.33% ±0.29 99.45% ±0.16 200 0.40 99.08% ±0.45 99.36% ±0.35 200 0.50 98.34% ±0.46 99.32% ±0.14 200 0.60 98.00% ±0.35 99.61% ±0.10 200 0.70 96.99% ±0.64 99.54% ±0.12 200 0.80 96.21% ±0.55 99.71% ±0.10

random integer numbers from 1 to 10. Due to significantly better speed of QUALEX-MS comparing to the heuristics considered in [67] and availability of a highly optimized exact maximum weight clique solver cliquer by P. Osterg˚ardand¨ S. Niskanen [71], we were able to perform the tests not only on 100-vertex graphs but also on 200-vertex graphs up to the edge density 0.8. As well, we increased the number of tested graphs in each group from 20 to 50. The running time of QUALEX-MS on all those instances is in 1 second, so it may be considered negligible. However, similar testing on larger graphs is unfortunately difficult because of significant slowing down of the exact solver. Table 6-2 presents the results of this computational experiment. The measured value is percentage of the found clique weights to the optimum clique weights averaged through all graphs of a group (Avg. R columns). Second result columns represent standard deviations of these values (St. Dev. columns). The obtained figures show that our method

100 strictly outperforms the algorithm PBH and the difference between maximum weight cliques and those found by QUALEX-MS is rather negligible. Apart from these computational experiments, Basu et al. report that QUALEX-MS has always been able to find the exact solutions of maximum weight clique problem instances constructed to optimize classification and regression for databases, which were considered in their database compression experiments [4]. 6.6 Remarks and Conclusions

We have presented a new fast heuristic method for the maximum weight clique problem. It has been shown empirically that the method is exact on a considerable range of instances. Among them are the Brockington-Culberson graphs from the DIMACS test suite [19] (brock*) exceptionally hard for all other types of heuristics that may be found in the literature. Besides, we have specified theoretically a non-trivial class of instances where the used trust region formulation may directly deliver a maximum weight clique indicator (Theorem 10). As the next step of QUALEX-MS development it should be investigated whether it is possibile to express Motzkin–Straus optima as a function of a particular subset of the spherical stationary points. It may lead to a generalization of Theorem 10 expanding the class of maximum weight clique instances where the optimum is directly computable by the presented trust region procedure. A case theoretically seeming to be the worst for the described technique is when there are multiple eigenvalues causing the trust region problem degeneracy. It may be supposed that a special submethod dealing with such instances should be developed.

101 REFERENCES [1] J. Abello, S. Butenko, P.M. Pardalos, M.G.C. Resende, Finding independent sets in a graph using continuous multivariable polynomial formulations, J. Global Optim. 21 (4) (2001) 111–137. [2] J. Abello, P.M. Pardalos, M.G.C. Resende, On maximum clique problem in very large graphs, DIMACS Series in and Theoretical , vol. 50, AMS Providence, RI, 1999, pp. 119–130. [3] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of the National Academy of Sciences 96 (1999) 6745–6750. [4] S. Babu, M. Garofalakis, R. Rastogi, SPARTAN: A model-based semantic compression system for massive data tables, in: Proceedings of the 2001 ACM International Conference on Management of Data (SIGMOD), 2001, pp. 283–295. [5] A. Banerjee, I.S. Dhillon, J. Ghosh, S. Merugu, D.S. Modha, Generalized maximum entropy approach to Bregman co-clustering and matrix approximations, in: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2004, pp. 509–514. [6] R.B. Bapat, T.E.S. Raghavan, Non-negative Matrices and Applications (Chapter 6), Cambridge University Press, Cambridge, UK, 1997. [7] E.R. Barnes, A.J. Hoffman, U.G. Rothblum, Optimal partitions having disjoint convex and conic hulls, Math. Program. 54 (1) (1992) 69–86. [8] A. Ben-Dor, L. Bruhn, I. Nachman, M. Schummer, Z. Yakhini, Tissue classification with gene expression profiles, J. Comput. Biol. 7 (2000) 559–584. [9] A. Ben-Dor, B. Chor, R. Karp, Z. Yakhini, Discovering local structure in gene expression data: the order-preserving submatrix problem, in: Proceedings of the 6th Annual International Conference on Computational Biology (RECOMB), 2002, pp. 49–57. [10] A. Ben-Dor, B. Chor, R. Karp, Z. Yakhini, Discovering local structure in gene expression data: the order-preserving submatrix problem, J. Comput. Biol. 10 (3-4) (2003) 373–384. [11] A. Ben-Dor, N. Friedman, Z. Yakhini, Class discovery in gene expression data, in: Proceedings of 5th Annual International Conference on Computational Molecular Biology (RECOMB), 2001, pp. 31–38. [12] M. Blatt, S. Wiseman, E. Domany, Data clustering using a model granular magnet, Neural Comput. 9 (8) (1997) 1805–1842.

102 [13] V. Boginski, S. Butenko, P.M. Pardalos, On structural properties of the market graph, in: A. Nagurney (Ed.), Innovations in Financial and Economic Networks, Edward Elgar Publishers, Cheltenham, UK–Northampton, MA, USA, 2003, pp. 28–45. [14] I.M. Bomze, M. Budinich, P.M. Pardalos, M. Pelillo, The maximum clique problem, in: D.-Z. Du and P.M. Pardalos, (Eds.), Handbook of Combinatorial Optimization (Supplement Volume A), Kluwer Academic, Dordrecht, 1999, pp. 1–74. [15] E. Boros, P. Hammer, T. Ibaraki, A. Kogan, Logical analysis of numerical data, Math. Program. 79 (1997) 163–190. [16] E. Boros, P. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz, I. Muchnik, An implementation of logical analysis of data, IEEE Transactions on Knowledge and Data Engineering 12 (2000) 292–306. [17] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Chapman & Hall/CRC Press, 1984. [18] R.P. Brent, Algorithms for Minimization without Derivatives, Prentice-Hall, Englewood Cliffs, NJ, 1973. [19] M. Brockington, J.C. Culberson, Camouflaging independent sets in quasi-random graphs, in: D. Johnson and M.A. Trick (Eds.), Cliques, Coloring and Satisfiability, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 26, AMS Providence, RI, 1996, pp. 75–88. [20] K. Bryan, P. Cunningham, N. Bolshakova, Biclustering of expression data using simulated annealing, in: Proceedings of the 18th IEEE Symposium on Computer-Based Medical Systems (CBMS), 2005, pp. 383–388. [21] S. Burer, R.D.C. Monteiro, Y. Zhang, Maximum stable set formulations and heuristics based on continuous optimization, Math. Program. 94 (1) (2002) 137–166. [22] S. Busygin, S. Butenko, P.M. Pardalos, A heuristic for the maximum independent set problem based on optimization of a quadratic over a sphere, Journal of Comb. Optim. 6 (3) (2002) 287–297. [23] S. Busygin, G. Jacobsen, E. Kr¨amer,Double conjugated clustering applied to leukemia microarray data, SIAM Data Mining Workshop on Clustering High Dimensional Data and Its Applications, 2002. [24] S. Busygin, O.A. Prokopyev, P.M. Pardalos, Feature selection for consistent biclustering via fractional 0–1 programming, J. Comb. Optim. 10 (1) (2005) 7–21. [25] S. Busygin (1998), Stas Busygin’s NP-completeness page, http://www.busygin.dp.ua/npc.html

103 [26] A. Califano, SPLASH: Structural pattern localization analysis by sequential histograms, 16 (2000) 341–357. [27] A. Califano, S. Stolovitzky, Y. Tu, Analysis of gene expression microarrays for phenotype classification, in: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), 2000, pp. 75–85. [28] CAMDA Conference Website (2001), http://www.camda.duke.edu/camda01.html, last accessed April 2007. [29] G. Casella, E.I. George, Explaining the Gibbs sampler, The American Statistician 46 (1992) 167–174. [30] Y. Cheng, G.M. Church, Biclustering of expresssion data, in: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), 2000, pp. 93–103. [31] Y. Cheng, G.M. Church, Biclustering of expresssion data (Supplementary information), http://arep.med.harvard.edu/biclustering/, last accessed April 2007. [32] W. Cheswick, H. Burch, Internet Mapping Project, http://www.cs.bell-labs.com/who/ches/map/, last accessed April 2007. [33] H. Cho, I.S. Dhillon, Y. Guan, S. Sra, Minimum sum-squared residue co-clustering of gene expression data, in: Proceedings of the 4th SIAM International Conference on Data Mining (SDM), 2004, pp. 114–125. [34] H. Cho, Y. Guan, S. Sra, Co-clustering software, version 1.1, http://www.cs.utexas.edu/users/dml/Software/cocluster.html, last accessed April 2007. [35] ILOG Inc., CPLEX 9.0 User’s Manual, 2004. [36] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, UK, 2000. [37] I.S. Dhillon, A new O(n2) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem, Computer Science Division Technical Report No. UCB//CSD-97-971, UC Berkeley, 1997. [38] I.S. Dhillon, Co-clustering documents and words using bipartite spectral graph partitioning, in: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2001, pp. 269–274. [39] I.S. Dhillon, S. Mallela, D.S. Modha, Information-theoretic co-clustering, in: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2003, pp. 89–98.

104 [40] E. Domany, Super-paramagnetic clustering of data, Physica A 263 (1999) 158–169. [41] G.E. Forsythe, G.H. Golub, On the stationary values of a second degree polynomial on the unit sphere, SIAM J. Appl. Math. 13 (1965) 1050–1068. [42] M. Garey, D. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, Freeman & Co., New York, 1979. [43] G. Getz, E. Levine, E. Domany, Coupled two-way clustering analysis of gene microarray data, Proceedings of the National Academy of Sciences 97 (2000) 12079–12084. [44] L.E. Gibbons, D.W. Hearn, P.M. Pardalos, M.V. Ramana, Continuous characterizations of the maximum clique problem, Math. Oper. Res. 22 (1997) 754–768. [45] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, E.S. Lander, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (1999) 531–537. [46] G.H. Golub, C.F. Van Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore MA, 1996. [47] D. Granot, U.G. Rothblum, The Pareto set of the partition bargaining game, Games and Economic Behavior 3 (1991) 163–182. [48] W. Hager, Minimizing a quadratic over a sphere, SIAM J. Optim. 12 (2001) 188–208. [49] P. Hansen, M. Poggi de Arag˜ao, C.C. Ribeiro, Hyperbolic 0–1 programming and query optimization in information retrieval, Math. Program. 52 (2) (1991) 256–263. [50] S. Hashizume, M. Fukushima, N. Katoh, T. Ibaraki, Approximation algorithms for combinatorial fractional programming problems, Math. Program. 37 (3) (1987) 255–267. [51] J.A. Hartigan, Direct clustering of a data matrix, J. Amer. Stat. Assoc. 67 (1972) 123–129. [52] J. H˚astad,Clique is hard to approximate within n1−, in: Proceedings of 37th Annual IEEE Symposium on the Foundations of Computer Science (FOCS), 1996, pp. 627–636. [53] L-L. Hsiao, F. Dangond, T. Yoshida, R. Hong, R.V. Jensen, J. Misra, W. Dillon, K.F. Lee, K.E. Clark, P. Haverty, Z. Weng, G. Mutter, M.P. Frosch, M.E. MacDonald, E.L. Milford, C.P. Crum, R. Bueno, R.E. Pratt, M. Mahadevappa, J.A. Warrington, G. Stephanopoulos, G. Stephanopoulos, S.R. Gullans, A compendium of gene expression in normal human tissues, Physiol. Genomics 7 (2001) 97–104.

105 [54] HuGE Index.org Website, http://www.hugeindex.org, last accessed April 2007. [55] F.K. Hwang, S. Onn, U.G. Rothblum, Linear shaped partition problems, Oper. Res. Let. 26 (2000) 159–163. [56] L.D. Iasemidis, P.M. Pardalos, J.C. Sackellares, D.S. Shiau, Quadratic binary programming and dynamical system approach to determine the predictability of epileptic seizures, J. Comb. Optim. 5 (1) (2001) 9–26. [57] S.C. Johnson, Hierarchical clustering schemes, Psychometrika 2 (1967) 241–254. [58] S. Khot, Improved inapproximability results for maxclique, chromatic number and approximate graph coloring, in: Proceedings of 42nd Annual IEEE Symposium on the Foundations of Computer Science (FOCS), 2001, pp. 600–609. [59] Y. Kluger, R. Basri, J.T. Chang, M. Gerstein, Spectral biclustering of microarray data: coclustering genes and conditions, Genome Res. 13 (4) (2003) 703–716. [60] T. Kohonen, Self-Organization Maps, Springer-Verlag, Berlin-Heidelberg, 1995. [61] L. Lazzeroni, A. Owen, Plaid models for gene expression data, Sinica 12 (2002) 61–86. [62] L. Lazzeroni, A. Owen Plaid models, for microarrays and expression, http://www-stat.stanford.edu/˜owen/plaid/, last accessed April 2007. [63] J. Liu, W. Wang, OP-cluster: Clustering by tendency in high dimensional Space, in: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), 2003, pp. 187–194. [64] S. Lyle, M. Szularz, Local minima of the trust region problem, J. Optim. Theory Appl. 80 (1994) 117–134. [65] J.B. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the 5th Symposium on Mathematics and Probability, 1967, pp. 281–297. [66] C. Mannino, E. Stefanutti, An augmentation algorithm for the maximum weighted stable set problem, Comput. Optim. Appl. 14 (1999) 367–381. [67] A. Massaro, M. Pelillo, I.M. Bomze, A complementary pivoting approach to the maximum weight clique problem, SIAM J. Optim. 12 (4) (2002) 928–948. [68] B.D. McKay (1984), The nauty page, http://cs.anu.edu.au/˜bdm/nauty/, last accessed April 2007. [69] J.J. Mor´e,D.S. Sorensen, Computing a trust region step, SIAM J. Sci. Statist. Comput. 4 (1983) 553–572.

106 [70] T.S. Motzkin, E.G. Straus, Maxima for graphs and a new proof of a theorem of Turan, Canad. J. Math. 17 (4) (1965) 533–540. [71] P. Osterg˚ard,S.¨ Niskanen (2002), Cliquer – routines for clique searching, http://users.tkk.fi/˜pat/cliquer.html, last accessed April 2007. [72] V.Y. Pan, Z.Q. Chen, The complexity of the matrix eigenproblem, in: Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC), 1999, pp. 507–516. [73] P.M. Pardalos, S. Busygin, O.A. Prokopyev, On biclustering with feature selection for microarray data sets, in: R. Mondaini (Ed.), BIOMAT 2005 – International Symposium on Mathematical and Computational Biology, World Scientific, 2006, pp. 367–378. [74] J.-C. Picard, M. Queyranne, A network flow solution to some nonlinear 0–1 programming problems, with applications to , Networks 12 (1982) 141–159. [75] O.A. Prokopyev, H.-X. Huang, P.M. Pardalos, On complexity of unconstrained hyperbolic 0–1 programming problems, Oper. Res. Lett. 33 (2005) 312–318. [76] O.A. Prokopyev, C. Meneses, C.A.S. Oliveira, P.M. Pardalos, On multiple-ratio hyperbolic 0–1 programming problems, Pacific J. Optim. 1 (2) (2005) 327–345. [77] D.J. Reiss, N.S. Baliga, R. Bonneau, Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks, BMC Bioinformatics 7 (280) (2006). [78] S. Saipe, Solving a (0,1) hyperbolic program by branch and bound, Naval Res. Logist. Quarterly 22 (1975) 497–515. [79] R. Shamir, EXPANDER: A gene expression analysis and visualization software, http://www.cs.tau.ac.il/˜rshamir/expander/expander.html, last accessed April 2007. [80] Y.D. Shen, Z.Y. Shen, S.M. Zhang, Q. Yang, Cluster cores-based clustering for high dimensional data, in: Proceedings of 4th IEEE International Conference on Data Mining (ICDM), 2004, pp. 519–522. [81] Q. Sheng, Y. Moreau, B. De Moor, Biclustering microarray data by Gibbs sampling, Bioinformatics 19 (2003) ii196–ii205. [82] P. St-Louis, J.A. Ferland, and B. Gendron (2006), A penalty-evaporation heuristic in a decomposition method for the maximum clique problem, Technical Report, D´epartment d’informatique et de recherche op´erationnelle, Universit´ede Montr´eal, http://www.iro.umontreal.ca/˜gendron/publi.html, last accessed April 2007.

107 [83] A. Tanay, Computational analysis of transcriptional programs: function and evolution, PhD thesis, 2005, http://www.cs.tau.ac.il/˜rshamir/theses/amos phd.pdf, last accessed April 2007. [84] A. Tanay, R. Sharan, M. Kupiec, R. Shamir, Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data, Proceedings of the National Academy of Sciences 101 (2004) 2981–2986. [85] A. Tanay, R. Sharan, R. Shamir, Discovering statistically significant bilcusters in gene expression data, Bioinformatics 18 (2002) S136–S144. [86] M. Tawarmalani, S. Ahmed, N. Sahinidis, Global optimization of 0–1 hyperbolic programs, J. Global Optim. 24 (2002) 385–416. [87] V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, Berlin-Heidelberg, 1999. [88] Weizmann Institue of Science (2000), The coupled two way clustering algorithm, http://ctwc.weizmann.ac.il/, last accessed April 2007. [89] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, V. Vapnik, Feature selection for SVMs, in: S.A. Solla, T.K. Leen, Klaus-Robert M¨uller(Eds.), Advances in Neural Systems, MIT Press, 2001. [90] T.-H. Wu, A note on a global approach for general 0–1 fractional programming, European J. Oper. Res. 101 (1997) 220–223. [91] E.P. Xing, R.M. Karp, CLIFF: Clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts, Bioinformatics Discovery Note 1 (2001) 1–9. [92] R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Transactions on Neural Networks 16 (2005) 645–648. [93] J. Yang, W. Wang, H. Wang, P. Yu, δ-clusters: Capturing subspace correlation in a large data set, in: Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), 2002, pp. 517–528. [94] J. Yang, W. Wang, H. Wang, P. Yu, Enhanced biclustering on expression data, in: Proceedings of the 3rd IEEE Conference on Bioinformatics and Bioengineering (BIBE), 2003, pp. 321–327. [95] Y. Ye, A new complexity result on minimization of a quadratic function with a sphere constraint, in: C.A. Floudas, P.M. Pardalos (Eds.), Recent Advances in Global Optimization, Princeton University Press, Princeton, NJ, 1992, pp. 19–31.

108 BIOGRAPHICAL SKETCH Stanislav Busygin was born on July 14, 1974, in Dzhankoy, Crimea (Ukraine). He received his Specialist Degree in applied mathematics from Dnipropetrovsk National University (Ukraine) in 1996. From 1996 to 2003, he has been working as a software engineer and a scientific consultant in a number of technologically innovative companies in Ukraine and Western Europe. In 2003, Stanislav Busygin entered the graduate program in industrial and systems engineering at the University of Florida. He received his M.S. degree in industrial and systems engineering from University of Florida in April 2005. Stanislav Busygin is an author of almost a dozen of scientific peer-reviewed research papers and surveys.

109