Rounding Methods for Discrete Linear Classification
Total Page:16
File Type:pdf, Size:1020Kb
Rounding Methods for Discrete Linear Classification Yann Chevaleyre [email protected] LIPN, CNRS UMR 7030, Universit´eParis Nord, 99 Avenue Jean-Baptiste Cl´ement, 93430 Villetaneuse, France Fr´ed´ericKoriche [email protected] CRIL, CNRS UMR 8188, Universit´ed'Artois, Rue Jean Souvraz SP 18, 62307 Lens, France Jean-Daniel Zucker [email protected] INSERM U872, Universit´ePierre et Marie Curie, 15 Rue de l'Ecole de M´edecine,75005 Paris, France Abstract in polynomial time by support vector machines if the performance of hypotheses is measured by convex loss Learning discrete linear classifiers is known functions such as the hinge loss (see e.g. Shawe-Taylor as a difficult challenge. In this paper, this and Cristianini(2000)). Much less is known, how- learning task is cast as combinatorial op- ever, about learning discrete linear classifier. Indeed, timization problem: given a training sam- integer weights, and in particular f0; 1g-valued and ple formed by positive and negative feature {−1; 0; 1g-valued weights, can play a crucial role in vectors in the Euclidean space, the goal is many application domains in which the classifier has to find a discrete linear function that mini- to be interpretable by humans. mizes the cumulative hinge loss of the sam- ple. Since this problem is NP-hard, we ex- One of the main motivating applications for this work amine two simple rounding algorithms that comes from the field of quantitative metagenomics, discretize the fractional solution of the prob- which is the study of the collective genome of the lem. Generalization bounds are derived for micro-organisms inhabiting our body. It is now tech- several classes of binary-weighted linear func- nically possible to measure the abundance of bacte- tions, by analyzing the Rademacher complex- rial species by measuring the activity of specific tracer ity of these classes and by establishing ap- genes for that species. Moreover, it is known that the proximation bounds for our rounding algo- abundance of some bacterial species in our body is rithms. Our methods are evaluated on both related to obesity or leanness. Instead of learning a synthetic and real-world data. standard linear classifier to predict obesity, biologists would like to find two small groups of bacterial species, such that if the abundance of bacteria in the first group 1. Introduction is greater than that of the second group, then the in- dividual is classified as being obese. Given a dataset Linear classification is a well-studied learning prob- in which features represent the abundance of specific lem in which one needs to extrapolate, from a set bacterial species, this problem boils down to learning of positive and negative examples represented in Eu- a linear classifier with {−1; 0; 1g-valued weights. clidean space by their feature vector, a linear hypoth- esis h(x) = sgn(hw; xi − b) that correctly classifies fu- In other domains such as medical diagnosis, the in- ture, unseen, examples. In the past decades, a wide terpretability of predictive models is also a key aspect. variety of theoretical results and efficient algorithms The most common diagnostic models are M-of-N rules have been obtained for learning real-weighted linear (Towell and Shavlik, 1993) according to which patients functions (also known as \perceptrons"). Notably, it are classified as ill if at least M criteria among N are is well-known that the linear classification problem can satisfied. However, learning M-of-N rules is hard (a be cast as a convex optimization problem and solved proof is provided in the extended version of this work (Chevaleyre et al., 2013)). In binary classification, lin- Proceedings of the 30 th International Conference on Ma- ear threshold functions with f0; 1g-valued weights are chine Learning, Atlanta, Georgia, USA, 2013. JMLR: equivalent to M-of-N rules. Thus, the theory and the W&CP volume 28. Copyright 2013 by the author(s). algorithms described in this paper can also be used to Rounding Methods for Discrete Linear Classification learn such rules, as shown in the experimental section. respect to m, and δ 2 (0; 1) is a confidence parameter. Perhaps the major obstacle to the development of dis- Ideally, we would like to have at our disposal an effi- crete linear functions lies in the fact that, in the stan- cient algorithm for minimizing riskm(c). The resulting dard distribution-free PAC learning model, the prob- minimizer, say c∗, would be guaranteed to provide an lem of finding an integer-weighted linear function that optimal hypothesis because the other terms in the risk is consistent with a training set is equivalent to the bound (1) do not depend on the choice of the hypoth- (Zero-One) Integer Linear Programming problem (Pitt esis. Unfortunately, because the class C of discrete and Valiant, 1988), which is NP-complete. In order linear classifiers is not a a convex set, the convexity of to alleviate this issue, several authors have investi- hinge loss does not help in finding c∗ and, as shown gated the learnability of discrete linear functions in by Theorem1 in the next section, the optimization distribution-specific models, such as the uniform dis- problem remains NP-hard. tribution (Golea and Marchand, 1993a; K¨ohler et al., The key message to be gleaned from this paper is 1990; Opper et al., 1990; Venkatesh, 1991), or the that the convexity of the loss function does help in product distribution (Golea and Marchand, 1993b). approximating the combinatorial optimization prob- Yet, beyond this pioneering work, many questions re- lem, using simple rounding methods. Our first algo- main open, especially when the model is distribution- rithm is a standard randomized rounding (RR) method free but the loss functions are convex. that starts from a fractional solution w∗ in the convex In this paper, we consider just such a scenario by ex- hull of C, and then builds c by viewing the fractional ∗ amining the problem of learning binary-weighted lin- value wi as the probability that ci should be set to 1. ear functions with the hinge loss, a well-known surro- The second algorithm, called greedy rounding (GR), gate of the zero-one loss. The key components of the is essentially a derandomization of RR that iteratively classification problem are a set C ⊆ f0; 1gn of boolean rounds the coordinates of the fractional solution by vectors1 from which the learner picks his hypotheses, maintaining a constraint on the sum of weights. and a fixed (yet hidden) probability distribution over For the class C of binary-weighted linear functions, we the set n × {±1g of examples. For a hinge parameter R show that the greedy rounding algorithm is guaranteed γ > 0, the hinge loss penalizes a hypothesis c 2 C on to return a concept c 2 C satisfying: an example (x; y) if its margin y hc; xi is less than γ. X The performance of a hypothesis c 2 C is measured by risk (c) ≤ risk (c∗) + 2 m m 2γ its risk, denoted risk(c), and defined as the expected m loss of c on an example (x; y) drawn from the underly- where Xp = maxi=1 kxikp, and kxkp is the Lp-norm ing distribution. Typically, risk(c) is upper-bounded of x. We also show that the problem of improving this by the sum of two terms: a sample estimate riskm(c) bound up to a constant factor is NP-hard. Combining of the performance of c and a penalty term Tm(C) that greedy rounding's performance with the Rademacher depends on the hypothesis class C and, potentially, complexity of C yields the risk bound: also on the training set. The sample estimate riskm(c) ∗ is simply the averaged cumulative hinge loss of c on risk(c) ≤ riskm(c ) m r a set f(xi; yi)g of examples drawn independently X 2 r n 8 ln(2/δ) i=1 + 2 + X min 1; + from the underlying distribution. The penalty term 2γ γ 1 m m Tm(C) can be given by the VC-dimension of C, or its Rademacher complexity with respect to the size m of For the subclass Ck of sparse binary-weighted linear the training set. For binary-weighted linear classifiers, functions involving at most k ones among n, we show the penalty term induced by their Rademacher com- that greedy rounding is guaranteed to return a concept plexity can be substantially smaller than the penalty c 2 Ck satisfying: term induced by their VC dimension. So, by a sim- p ple adaptation of Bartlett and Mendelson's framework X1 k risk (c) ≤ risk (c∗) + (2002), our risk bounds take the form of: m m γ r 2 8 ln(2/δ) Using the Rademacher complexity of Ck, which is sub- risk(c) ≤ riskm(c) + Rm(C) + (1) stantially smaller than that of C, we have: γ m ∗ risk(c) ≤ riskm(c ) where Rm(C) is the Rademacher complexity of C with p r n r 1 X1 k 2 2 log k 8 ln(2/δ) As explained in Section 4.2, {−1; 0; 1g-weighted classi- + + X1k + fication can be reduced to f0; 1g-weighted classification. γ γ m m Rounding Methods for Discrete Linear Classification Similar results are derived with the randomized round- Proof. In what follows, we denote by c∗ any vector in ∗ ing algorithm, with less sharp bounds due to the ran- C for which riskm(c ) is minimal. For an undirected domization process. We evaluate these rounding meth- graph G = (V; E), the Max-Cut problem is to find a ods on a both synthetic and real-world datasets, show- subset S ⊂ V such that the number of edges with one ing good performance in comparison with standard lin- end point in S and the other in V nS is maximal.