A Robust Minimax Approach to Classification
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Machine Learning Research 3 (2002) 555-582 Submitted 5/02; Published 12/02 A Robust Minimax Approach to Classi¯cation Gert R.G. Lanckriet [email protected] Laurent El Ghaoui [email protected] Chiranjib Bhattacharyya [email protected] Michael I. Jordan [email protected] Department of Electrical Engineering and Computer Science and Department of Statistics University of California Berkeley, CA 94720, USA Editor: Bernhard SchÄolkopf Abstract When constructing a classi¯er, the probability of correct classi¯cation of future data points should be maximized. We consider a binary classi¯cation problem where the mean and covariance matrix of each class are assumed to be known. No further assumptions are made with respect to the class- conditional distributions. Misclassi¯cation probabilities are then controlled in a worst-case setting: that is, under all possible choices of class-conditional densities with given mean and covariance matrix, we minimize the worst-case (maximum) probability of misclassi¯cation of future data points. For a linear decision boundary, this desideratum is translated in a very direct way into a (convex) second order cone optimization problem, with complexity similar to a support vector machine problem. The minimax problem can be interpreted geometrically as minimizing the maximum of the Mahalanobis distances to the two classes. We address the issue of robustness with respect to estimation errors (in the means and covariances of the classes) via a simple modi¯cation of the input data. We also show how to exploit Mercer kernels in this setting to obtain nonlinear decision boundaries, yielding a classi¯er which proves to be competitive with current methods, including support vector machines. An important feature of this method is that a worst-case bound on the probability of misclassi¯cation of future data is always obtained explicitly. Keywords: classi¯cation, kernel methods, convex optimization, second order cone programming 1. Introduction Consider the problem of choosing a linear discriminant by minimizing the probabilities that data vectors fall on the wrong side of the boundary. One way to attempt to achieve this is via a generative approach in which one makes distributional assumptions about the class-conditional densities and thereby estimates and controls the relevant probabilities. The need to make distri- butional assumptions, however, casts doubt on the generality and validity of such an approach, and in discriminative solutions to classi¯cation problems it is common to attempt to dispense with class-conditional densities entirely. Rather than avoiding any reference to class-conditional densities, it might be useful to attempt to control misclassi¯cation probabilities in a worst-case setting; that is, under all possible choices of class-conditional densities with a given mean and covariance matrix, minimize the worst-case (maximum) probability of misclassi¯cation of future data points. Such a minimax approach could be viewed as providing an alternative justi¯cation for discriminative approaches. In this paper we show how such a minimax program can be carried out in the setting of binary classi¯cation. Our c 2002 Gert R.G. Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya and Michael I. Jordan. ° Lanckriet, El Ghaoui, Bhattacharyya and Jordan approach involves exploiting the following powerful theorem due to Marshall and Olkin (1960), which has recently been put in the light of recent convex optimization techniques by Popescu and Bertsimas (2001): 1 2 T 1 sup Pr y = ; with d = inf (y y¹) §y¡ (y y¹); 2 y y (y¹;§y) f 2Sg 1+d ¡ ¡ » 2S where y is a random vector, is a given convex set, and where the supremum is taken over all S distributions for y having mean y¹ and covariance matrix §y (assumed to be positive de¯nite for simplicity). This theorem provides us with the ability to bound the probability of misclassifying a point, without making Gaussian or other speci¯c distributional assumptions. We will show how to exploit this ability in the design of linear classi¯ers, for which the set is a half-space. The usefulness of this theorem will also be illustrated in the case of unsupervisedS learning, for the single class problem. One of the appealing features of this formulation is that one obtains an explicit upper bound on the probability of misclassi¯cation of future data: 1=(1 + d2). We will also show that the learning problem can be interpreted geometrically, as yielding that hyperplane that minimizes the maximum of the Mahalanobis distances from the class means to the hyperplane. Another feature of this formulation is the possibility to study directly the e®ect of using plug-in estimates for the means and covariance matrices rather then their real, but unknown, values. We will assume the real mean and covariance are unknown, but bounded in a convex region and show how this a®ects the resulting classi¯er: it amounts to incorporating a regularization term in the estimator and results in an increase in the upper bound on the probability of misclassi¯cation of future data. A third important feature of this approach is that, as in linear discriminant analysis (Mika et al., 1999), it is possible to generalize the basic methodology to allow nonlinear decision boundaries via the use of Mercer kernels. The resulting nonlinear classi¯ers are competitive with existing classi¯ers, including support vector machines. The paper is organized as follows: in Section 2 we present the minimax formulation for linear classi¯ers and our main results. In Section 3, we show how a simple modi¯cation of our method makes it possible to treat estimation errors in the mean and covariance matrices within the minimax formulation. The single class case is studied in Section 4. In Section 5 we deal with kernelizing the method. Finally, we present empirical results in Section 6. The results in Sections 2, 5 and 6 have been presented earlier (Lanckriet et al., 2002a), and are expanded signi¯cantly in the current paper, while the results in Sections 3 and 4 are new. Matlab code to build and evaluate the di®erent types of classi¯ers can be downloaded from http://robotics.eecs.berkeley.edu/~gert/. 2. Minimax Probabilistic Decision Hyperplane In this section, the minimax formulation for linear classi¯ers is presented. After de¯ning the problem in Section 2.1, we state the main result in Section 2.2 and propose an optimization algorithm in Section 2.3. In Section 2.4 and 2.5, we show how the optimization problem can be interpreted in a geometric way. Section 2.6 addresses the e®ect of making Gaussian assumptions and the link with Fisher discriminant analysis is discussed in Section 2.7. 2.1 Problem De¯nition Let x and y denote random vectors in a binary classi¯cation problem, modelling data from each of two classes, with means and covariance matrices given by (x¹; §x) and (y¹; §y), respectively, with 556 A Robust Minimax Approach to Classification n n n x; x¹; y; y¹ R , and §x; §y R £ both symmetric and positive semide¯nite. With a slight abuse of notation,2 we let x and y 2denote the classes. We wish to determine a hyperplane (a;b)= z aT z = b , where a Rn 0 and b R, which separates the two classes of pointsH with maximalf j probabilityg with respect2 tonf allg distributions2 having these means and covariance matrices. This is expressed as max ® s.t. inf Pr aT x b ® (1) ®;a=0;b x (x¹;§x) f ¸ g¸ 6 » inf Pr aT y b ®; y (y¹;§y) f · g¸ » where the notation x (x¹; § ) refers to the class of distributions that have prescribed mean x¹ and » x covariance §x, but are otherwise arbitrary; likewise for y. Future points z for which aT z b are then classi¯ed as belonging to the class associated with x, otherwise they are classi¯ed as belonging¸ to the class associated with y. In formulation (1) the term 1 ® is the worst-case (maximum) misclassi¯cation probability, and our classi¯er minimizes this maximum¡ probability. For simplicity, we will assume that both §x and §y are positive de¯nite; our results can be extended to the general positive semide¯nite case, using the appropriate extension of Marshall and Olkin's result, as discussed by Popescu and Bertsimas (2001). However, we do not handle this case here, since in practice we always add a regularization term to the covariance matrices. A complete discussion of the choice of the regularization parameter is given in Section 3. 2.2 Main Result We begin with a lemma that will be used in the sequel. Lemma 1 Using the notation of Section 2.1, with y¹, §y positive de¯nite, a =0, b given, such that aT y¹ b and ® [0; 1), the condition 6 · 2 inf Pr aT y b ® (2) y (y¹;§y) f · g¸ » holds if and only if b aT y¹ ·(®) aT § a; (3) ¡ ¸ y q ® where ·(®)= 1 ® . ¡ q Proof of lemma: In view of the result of Marshall and Olkin (1960) and using = aT y b , we obtain: S f ¸ g T 1 2 T 1 sup Pr a y b = ; with d = inf (y y¹) §y¡ (y y¹): (4) 2 T y (y¹;§y) f ¸ g 1+d a y b ¡ ¡ » ¸ The minimum distance d admits a simple closed-form expression (see Appendix A): max((b aT y¹); 0)2 d2 = inf (y y¹)T § 1(y y¹)= : (5) y¡ T¡ aT y b ¡ ¡ a §ya ¸ 557 Lanckriet, El Ghaoui, Bhattacharyya and Jordan T The probability constraint (2) in this lemma is equivalent to supy (y¹;§y) Pr a y b 1 ®.