<<

Rounding Methods for Discrete Linear Classification

Yann Chevaleyre [email protected] LIPN, CNRS UMR 7030, Universit´eParis Nord, 99 Avenue Jean-Baptiste Cl´ement, 93430 Villetaneuse, France Fr´ed´ericKoriche [email protected] CRIL, CNRS UMR 8188, Universit´ed’Artois, Rue Jean Souvraz SP 18, 62307 Lens, France Jean-Daniel Zucker [email protected] INSERM U872, Universit´ePierre et Marie Curie, 15 Rue de l’Ecole de M´edecine,75005 Paris, France

Abstract in polynomial time by support vector machines if the performance of hypotheses is measured by convex loss Learning discrete linear classifiers is known functions such as the hinge loss (see e.g. Shawe-Taylor as a difficult challenge. In this paper, this and Cristianini(2000)). Much less is known, how- learning task is cast as combinatorial op- ever, about learning discrete linear classifier. Indeed, timization problem: given a training sam- integer weights, and in particular {0, 1}-valued and ple formed by positive and negative feature {−1, 0, 1}-valued weights, can play a crucial role in vectors in the Euclidean space, the goal is many application domains in which the classifier has to find a discrete linear function that mini- to be interpretable by humans. mizes the cumulative hinge loss of the sam- ple. Since this problem is NP-hard, we ex- One of the main motivating applications for this work amine two simple rounding that comes from the field of quantitative metagenomics, discretize the fractional solution of the prob- which is the study of the collective genome of the lem. Generalization bounds are derived for micro-organisms inhabiting our body. It is now tech- several classes of binary-weighted linear func- nically possible to measure the abundance of bacte- tions, by analyzing the Rademacher complex- rial species by measuring the activity of specific tracer ity of these classes and by establishing ap- genes for that species. Moreover, it is known that the proximation bounds for our rounding algo- abundance of some bacterial species in our body is rithms. Our methods are evaluated on both related to obesity or leanness. Instead of learning a synthetic and real-world data. standard linear classifier to predict obesity, biologists would like to find two small groups of bacterial species, such that if the abundance of bacteria in the first group 1. Introduction is greater than that of the second group, then the in- dividual is classified as being obese. Given a dataset Linear classification is a well-studied learning prob- in which features represent the abundance of specific lem in which one needs to extrapolate, from a set bacterial species, this problem boils down to learning of positive and negative examples represented in Eu- a linear classifier with {−1, 0, 1}-valued weights. clidean space by their feature vector, a linear hypoth- esis h(x) = sgn(hw, xi − b) that correctly classifies fu- In other domains such as medical diagnosis, the in- ture, unseen, examples. In the past decades, a wide terpretability of predictive models is also a key aspect. variety of theoretical results and efficient algorithms The most common diagnostic models are M-of-N rules have been obtained for learning real-weighted linear (Towell and Shavlik, 1993) according to which patients functions (also known as “perceptrons”). Notably, it are classified as ill if at least M criteria among N are is well-known that the linear classification problem can satisfied. However, learning M-of-N rules is hard (a be cast as a convex optimization problem and solved proof is provided in the extended version of this work (Chevaleyre et al., 2013)). In binary classification, lin- Proceedings of the 30 th International Conference on Ma- ear threshold functions with {0, 1}-valued weights are chine Learning, Atlanta, Georgia, USA, 2013. JMLR: equivalent to M-of-N rules. Thus, the theory and the W&CP volume 28. Copyright 2013 by the author(s). algorithms described in this paper can also be used to Rounding Methods for Discrete Linear Classification learn such rules, as shown in the experimental section. respect to m, and δ ∈ (0, 1) is a confidence parameter. Perhaps the major obstacle to the development of dis- Ideally, we would like to have at our disposal an effi- crete linear functions lies in the fact that, in the stan- cient for minimizing riskm(c). The resulting dard distribution-free PAC learning model, the prob- minimizer, say c∗, would be guaranteed to provide an lem of finding an integer-weighted linear function that optimal hypothesis because the other terms in the risk is consistent with a training set is equivalent to the bound (1) do not depend on the choice of the hypoth- (Zero-One) Integer problem (Pitt esis. Unfortunately, because the class C of discrete and Valiant, 1988), which is NP-complete. In order linear classifiers is not a a convex set, the convexity of to alleviate this issue, several authors have investi- hinge loss does not help in finding c∗ and, as shown gated the learnability of discrete linear functions in by Theorem1 in the next section, the optimization distribution-specific models, such as the uniform dis- problem remains NP-hard. tribution (Golea and Marchand, 1993a; K¨ohler et al., The key message to be gleaned from this paper is 1990; Opper et al., 1990; Venkatesh, 1991), or the that the convexity of the loss function does help in product distribution (Golea and Marchand, 1993b). approximating the combinatorial optimization prob- Yet, beyond this pioneering work, many questions re- lem, using simple rounding methods. Our first algo- main open, especially when the model is distribution- rithm is a standard randomized rounding (RR) method free but the loss functions are convex. that starts from a fractional solution w∗ in the convex In this paper, we consider just such a scenario by ex- hull of C, and then builds c by viewing the fractional ∗ amining the problem of learning binary-weighted lin- value wi as the probability that ci should be set to 1. ear functions with the hinge loss, a well-known surro- The second algorithm, called greedy rounding (GR), gate of the zero-one loss. The key components of the is essentially a derandomization of RR that iteratively classification problem are a set C ⊆ {0, 1}n of boolean rounds the coordinates of the fractional solution by vectors1 from which the learner picks his hypotheses, maintaining a constraint on the sum of weights. and a fixed (yet hidden) probability distribution over For the class C of binary-weighted linear functions, we the set n × {±1} of examples. For a hinge parameter R show that the greedy rounding algorithm is guaranteed γ > 0, the hinge loss penalizes a hypothesis c ∈ C on to return a concept c ∈ C satisfying: an example (x, y) if its margin y hc, xi is less than γ. X The performance of a hypothesis c ∈ C is measured by risk (c) ≤ risk (c∗) + 2 m m 2γ its risk, denoted risk(c), and defined as the expected m loss of c on an example (x, y) drawn from the underly- where Xp = maxi=1 kxikp, and kxkp is the Lp-norm ing distribution. Typically, risk(c) is upper-bounded of x. We also show that the problem of improving this by the sum of two terms: a sample estimate riskm(c) bound up to a constant factor is NP-hard. Combining of the performance of c and a penalty term Tm(C) that greedy rounding’s performance with the Rademacher depends on the hypothesis class C and, potentially, complexity of C yields the risk bound: also on the training set. The sample estimate riskm(c) ∗ is simply the averaged cumulative hinge loss of c on risk(c) ≤ riskm(c ) m r a set {(xi, yi)} of examples drawn independently X 2  r n  8 ln(2/δ) i=1 + 2 + X min 1, + from the underlying distribution. The penalty term 2γ γ 1 m m Tm(C) can be given by the VC-dimension of C, or its Rademacher complexity with respect to the size m of For the subclass Ck of sparse binary-weighted linear the training set. For binary-weighted linear classifiers, functions involving at most k ones among n, we show the penalty term induced by their Rademacher com- that greedy rounding is guaranteed to return a concept plexity can be substantially smaller than the penalty c ∈ Ck satisfying: term induced by their VC dimension. So, by a sim- √ ple adaptation of Bartlett and Mendelson’s framework X∞ k risk (c) ≤ risk (c∗) + (2002), our risk bounds take the form of: m m γ r 2 8 ln(2/δ) Using the Rademacher complexity of Ck, which is sub- risk(c) ≤ riskm(c) + Rm(C) + (1) stantially smaller than that of C, we have: γ m ∗ risk(c) ≤ riskm(c ) where Rm(C) is the Rademacher complexity of C with √ r n r 1 X∞ k 2 2 log k 8 ln(2/δ) As explained in Section 4.2, {−1, 0, 1}-weighted classi- + + X∞k + fication can be reduced to {0, 1}-weighted classification. γ γ m m Rounding Methods for Discrete Linear Classification

Similar results are derived with the randomized round- Proof. In what follows, we denote by c∗ any vector in ∗ ing algorithm, with less sharp bounds due to the ran- C for which riskm(c ) is minimal. For an undirected domization process. We evaluate these rounding meth- graph G = (V,E), the Max-Cut problem is to find a ods on a both synthetic and real-world datasets, show- subset S ⊂ V such that the number of edges with one ing good performance in comparison with standard lin- end point in S and the other in V \S is maximal. Un- ear optimization methods. less P=NP, no polynomial-time algorithm can achieve the approximation ratio of 0.997 for MaxCut in 3- The proofs of preparatory lemmas2 and5 are not regular graphs (Berman and Karpinski, 1999). included for space reasons, but can be found in the ex- tended version of this work (Chevaleyre et al., 2013). Based on this result, we first construct a dataset from a 3-regular graph G = (V,E) having an even number 2. Binary-Weighted Linear Classifiers of vertices. Our dataset consist of n = |V | + 1 fea- tures and m = 2 |E| examples. The first |V | features Notation. The set of positive integers {1, ··· , n} is are associated with the vertices of G. For each edge 0 denoted [n]. For a subset S ⊆ Rn, we denote by (j, j ) ∈ E, we build two positively labeled examples conv(S) the convex hull of S. For two vectors u, v ∈ x and x0 in the following way. In the first example x, n 0 R and p ≥ 1 the Lp-norm of u is denoted kukp and j and j are set to γ, and all other features are set to the inner product between u and v is denoted hu, vi. 0. In the second example x0, j and j0 are set to −γ, n Given a vector u ∈ R and k ∈ [n], we denote by u1:k the feature |V | + 1 is set to 2γ and all others are set to the prefix (u1, ··· , uk) of u. Finally, given a training 0. Consider any weight vector c where c|V |+1 is equal m m set {(xi, yi)}i=1, we write Xp = maxi=1 kxikp. to 0. Clearly, setting c|V |+1 to 1 will strictly decrease the loss of c if at least one coordinate in c is nonzero. In this study, we shall examine classification problems Thus, we will assume from now on and without loss of for which the set of instances is the Euclidean space generality that c is always set to 1. Observe that n and the hypothesis class is a subset of {0, 1}n. |V |+1 R the loss on the two examples x and x0 is Specifically, we shall focus on the class C = {0, 1}n ( of all binary-weighted linear functions, and the sub- 0 if c 6= c 0 ` (hc, xi) + ` (hc, x0i) = j j class Ck of all binary-weighted linear functions with at 1 otherwise most k ones among n. The parameterized loss func- tion `γ : R × {±1} → R examined in this work is the Let us now define cut(c) = |{(i, j) ∈ E : ci 6= cj}|. By hinge loss defined by: viewing c as the characteristic vector of a subset of vertices, cut (c) is the value of the cut in G induced by 1 `γ (p, y) = max(0, γ − py) where γ > 0 this subset. Note we have cut(c) = |E|−2 |E| riskm(c). γ Thus, minimizing the loss on the dataset maximizes the cut on the graph. Consequently, cut (c∗) is the 2.1. Computational Complexity optimal value of the Max-cut problem. For a training set {(x , y )}m , the empirical risk of a i i i=1 Finally, suppose by contradiction that for all α > 0, weight vector c ∈ C, denoted risk (c), is defined by m there is a polynomial-time algorithm capable of learn- its averaged cumulative loss: ing from any dataset of size m a vector c satisfying m 1 X ∗ X2 riskm(c) = `γ (hci, xii , y) risk (c) ≤ risk (c ) + α m m m i=1 γ By c∗, we denote any minimizer of the objective func- Notably, in√ the dataset constructed above, the value tion risk . Recall that if C is a convex subset of n of X2 is γ 6. Thus, for√ this dataset, we get that m R ∗ then c∗ can be found in polynomial time using con- riskm(c) ≤ riskm(c ) + α 6, and hence, √ vex optimization algorithms. However, for the discrete cut(c) ≥ cut(c∗) − 2α |E| 6 (2) class C = {0, 1}n, the next result states that the opti- mization problem is much harder. To this point, Feige et al.(2001) have shown that on 3-regular graphs, the optimal cut has a value Theorem 1. There exists a constant α > 0 such that, of at least |E|/2. By reporting this value into unless P=NP, there is no polynomial time algorithm √ (2), we obtain cut(c) ≥ cut(c∗) − 4α 6 cut(c∗) = capable of learning from any dataset of size m a hy- √ cut(c∗) 1 − 4α 6. Because α can be arbitrarily close pothesis c ∈ C such that: to 0, this implies that Max-Cut is approximable within 0  X2 any constant factor, which contradicts Berman and riskm(c) ≤ min riskm(c ) + α c0∈C γ Karpinski’s (1999) inapproximability result. Rounding Methods for Discrete Linear Classification

2.2. Rademacher Complexity For the case n ≥ m, consider a training set S such that x = X for all i ∈ [m], and zero everywhere Suppose that our training set S = {(x , y )}m con- i,i 1 i i i=1 else. Clearly, equation3 implies R (C) = X1 . For the sists of examples generated by independent draws from S 2 case n < m, assume m is a multiple of 2n and consider some fixed probability distribution on n × {±1}. For R a dataset S in which each each example contains only a class F of real valued functions f : Rn → R, define one non-zero value equal to X1, and such that the its Rademacher complexity on S to be: m number of nonzero values per column is n . Then, by " m # 1 X applying Lemma2 to equation3, we obtain: RS(F) = E sup σif(xi) m r f∈F i=1 nX m r n R (C) ≥ 1 n = X Here, the expectation is over the Rademacher ran- S m 32 1 32m dom variables σ1 . . . σm, which are drawn from {±1} with equal probability. Since S is random, we can also take expectation over the choice of S and define Theorem 4. For a constant k > 0, let Ck be the class Rm(F) = E [RS(F)], which gives us a quantity that of binary-weighted linear functions with at most k ones depends on both the function class and the sample among n. Then, size. As indicated by inequality (1), bounds on the r Rademacher complexity of a class immediately yield 2 log n R (C ) ≤ X k k risk bounds for classifiers picked from that class. For m k ∞ m continuous linear functions, sharp Rademecher com- n plexity bounds have been provided by Kakade et al. Proof. For a closed convex set S ⊂ R+, consider the (2008). We provide here similar bounds for two im- hypothesis class: portant classes of discrete linear functions. FS = {x 7→ hw, xi : w ∈ S} Lemma 2. Let σ1, ··· , σm be Rademacher variables. Pk p n wj wj Then [| σi|] ≥ k/8 for any even k ≥ 2. Using the convex function F (w) = P ln + E i=1 j=1 W1 W1 n Theorem 3. Let C = {0, 1} be the class of all binary- ln n, where W1 = maxw∈S kwk1, we get from Theo- weighted linear functions. Then, rem 1 in (Kakade et al., 2008):  r  r n 2 sup {F (w): w ∈ S} Rm(C) ≤ X1 min 1, R (F ) ≤ X W (4) m m S ∞ 1 m This bound is tight up to a constant factor. n For any k, let Sk = conv ({w ∈ {0, 1} : kwk1 ≤ k}), where conv(.) is the convex hull. Because F is convex Proof. Consider the hypothesis class: and Sk is a convex polytope, the supremum of F is n Fp,v = {x 7→ hw, xi : w ∈ R , kwkp ≤ v} one of the vertices of the polytope. Thus, By Theorem 1 in (Kakade et al., 2008), we have n √ sup F (w) = sup{F (w): w ∈ {0, 1} , kwk1 ≤ k} n w∈S Rm(F2,v) ≤ vX2/√ m. Moreover, for any c ∈ {0, 1} , k we have kck ≤ n. It follows that C ⊆ F √ , and k 2 2, n X 1 1 n since k · k ≤ k · k , we get that: ≤ ln n + ln = ln 2 1 k k k l=1 r n r n R (C) ≤ R(F √ ) ≤ X ≤ X m 2, n 2 m 1 m The result follows by reporting this value into (4). Now, let us prove that this bound is tight. First, let us rewrite the rademacher complexity over samples in 3. Rounding Methods a more convenient form: This section exploits the convexity of the hinge loss to n " m # 1 X X derive simple approximation algorithms for minimiz- R (C) = sup c σ x S E j i i,j ing empirical risk. The overall idea is to first relax m cj ∈{0,1} j=1 i=1 the optimization problem by deriving a fractional so- n " m # 1 X X lution w∗, and then to round the solution w∗ using = sup w σ x E j i i,j a deterministic or a randomized method. The convex 2m wj ∈{−1,1} j=1 i=1 optimization setting we consider is defined by: n " m # 1 X X = σ .x (3) ∗ 2m E i i,j w = argmin riskm(w)) (5) j=1 i=1 w∈[0,1]n∩S Rounding Methods for Discrete Linear Classification

3.1. Randomized Rounding The randomized rounding (RR) algorithm is one of the most popular approximation schemes for combi- natorial optimization (Raghavan and Thompson, 1987; Williamson and Shmoys, 2011). In the setting of our framework, the algorithm starts from the fractional solution of the problem and draws a random concept c ∈ Ck by choosing each value ci independently to 1 ∗ ∗ with probability wi and to 0 with probability 1 − wi . Figure 1. (left) Intersection of the l1 ball of radius 2, of the The following lemma (derived from Bernstein’s in- ∗ l∞ ball of radius 1 for non negative coordinates. (right) equality) states that using c instead of w to compute The solution to the convex relaxation coincides with the a dot product yields a bounded deviation. solution to the original problem. Lemma 5. Let x ∈ Rn, w∗ ∈ [0, 1]n and c ∈ {0, 1}n ∗ be a random vector such that P [ci = 1] = wi for all i ∈ 1 . . . n. Then, with probability at least 1 − δ, the Algorithm 1 Randomized Rounding (RR) following inequalities hold: Parameters: A set of m examples, a convex set S   ∗ 2 ∗ hc, xi ∈ hw , xi ± 1.52kxk2 ln and 1. Solve w = argminw∈[0,1]n∩S riskm(w) δ     2 p 2 2. For each i ∈ [n], set ci to 1 with probability wi hc, xi ∈ hw∗, xi ± kxk + 1.7 kw∗k ln ∞ 3 1 δ 3. Return c Theorem 6. Let c be the vector returned by the ran- domized rounding algorithm. Then, with probability 1 − δ, the following hold:

n n where S = R+ for the hypothesis class C = {0, 1} , • For the class C: n ∗ 1.52 2m and S = {w ∈ R+ : kwk1 ≤ k} for the subclass Ck of riskm(c) ≤ riskm(c ) + γ X2 ln δ binary-weighted linear functions with at most k ones among n. Note that the empirical risk minimization • For the class Ck: √ 2 ∗ problem for C can be viewed as an optimization prob- ∗ 3 +1.7 kw k1 2m n riskm(c) ≤ riskm(c ) + γ X∞ ln δ lem over R+ under the L∞-norm constraint. The prob- lem of minimizing empirical risk in the convex hull of Proof. Since the γ-hinge loss in 1 -Lipschitz, we have: Ck is illustrated in the left part of Figure1. γ The accuracy of rounding methods depend on the risk(c) − risk(c∗) ≤ risk(c) − risk(w∗) number of non-fractional values in the relaxed solu- m ∗ ∗ 1 X tion w . Indeed, if most weights of w are already ≤ |hc, x i − hw∗, x i| (6) γm i i in {0, 1}, then these values will remain unchanged by i=1 the rounding phase, and the final approximation c will Taking expectations and applying the union bound on close to w∗. Figure1 illustrates this phenomenon by Lemma5, we get with probability 1 − δ0 that: representing a case where w∗ and c coincide. The ob- jective function is represented by ellipses, and the four [∃i ∈ [m], |hc, x i − hw∗, x i| ≥ t] 2 P i i dots at the corner of the square are the vectors {0, 1} . m X ∗ 0 The hinge loss also takes an important part in the qual- ≤ P [|hc, xii − hw , xii| ≥ t] ≤ mδ ity of the rounding process. Increasing the parameter i=1 γ increases the likelihood that weights become binary. The result follows by setting δ = mδ0 and re- Taking this to the extreme, if γ ≥ X1, then the hinge porting into (6) the values t = 1.52X ln 2 and n   2 δ0 loss is linear inside the [0, 1] hypercube and all convex 2 p ∗ 2 t = + 1.7 kw k1 X∞ ln 0 . optimization tasks described above will yield solutions 3 δ with binary weights. We note in passing that a sim- 3.2. Greedy Rounding ilar phenomenon arises in the Lasso feature selection procedure, where the weight vectors are more likely to Despite its relative weak guarantees, the randomized fall on a vertex of the L1-ball as the margin increases. rounding procedure can be used as a building block Rounding Methods for Discrete Linear Classification

∗ ∗ Algorithm 2 Greedy Rounding (GR) term hc1:k−1 − w 1:k−1, xi,1:k−1i . hzk:n − wk:n, xi,k:ni Parameters: A set of m examples, an integer k ≤ n is null in expectation. We get that:

∗ ∗ 1. Solve w = argminw∈[0,1]n∩S riskm(w) E [fi(z, w ) | z1:k−1 = c1:k−1] =  2 2. For k = 1 to n, set k−1 n X ∗ X 2 ∗ ∗ =  xi,j(cj − wj ) + xi,jwj 1 − wj Ak ← {a ∈ {0, 1} : ∀i = 1 . . . m j=1 j=k ∗ 2 n (θi,k−1 + xi,k(a − wk)) X = θ2 + x2 w∗ 1 − w∗ ≤ θ2 + x2 w∗ (1 − w∗) i,k−1 i,j j j i,k−1 i,k k k j=k ∗ ∗  ck ← argmin riskm c1, ··· , ck−1, a, wk+1, ··· , wn a∈A k Now, let U1, ··· ,Un denote random variables taking values in some domain D ⊆ R, and g be a function n 3. return (c1 . . . cn) from D into R. Using the definition of conditional expectation, we know that for any j ∈ [n] there exists a value u ∈ D such that E [g(U1,...Un) | Uj = u] ≤ [g(U ,...U )]. By application, there exists a value for constructing more efficient algorithms. Specifically, E 1 n c ∈ {0, 1} such that [f (z, w∗) | z = c ] ≤ the new approximation scheme we propose, called k E i 1:k 1:k [f (z, w∗) | z = c ]. The result follows us- Greedy Rounding (GR), is essentially a derandomiza- E i 1:k−1 1:k−1 ing a=c . tion of RR with some improvements. As described in k Algorithm 2, the procedure starts again by comput- ing the fractional solution w∗ of the optimization task Based on this lemma, the approximation guarantees (Line1). Then, the coordinates of w∗ are rounded in a of the greedy rounding algorithm are summarized in sequential manner by simply maintaining a constraint the next theorem. Interestingly, a comparison with on the admissible values (Line2). The algorithm uses the lower bound for the class C obtained in Theo- a matrix Θ = [θi,k] of parameters defined as follows. rem1 reveals that the approximation bound of GR For any k ∈ [n], let c1:k = (c1, ··· , ck) be the pre- for this class is tight up to a constant factor. In other fix of the vector c build at the end of step k. Then, words, GR is an optimal for Pk ∗ θi,k = j=1 xi,j(cj − wk) for each i ∈ [m]. the class of binary-weighted linear functions. The next result, which we call the derandomization Theorem 8. Let c be the vector returned by the GR lemma, shows that at each step k of the rounding pro- algorithm. Then, cedure, there is a value a ∈ {0, 1} which does not in- crease the loss too much. ∗ • For the class C, riskm(c) ≤ riskm(c ) + X2/2γ Lemma 7. For any k ≤ n and any (c1 . . . ck−1) ∈ √ k−1 ∗ {0, 1} , there exist a ∈ {0, 1} such that for all i ∈ • For the class Ck, riskm(c) ≤ riskm(c )+X∞ k/γ [m],

2 Proof. Since the γ-hinge loss is 1/γ-Lipschitz, we can (θ + x (a − w∗)) ≤ θ2 + x2 w∗ (1 − w∗) i,k−1 i,k k i,k−1 i,k k k use inequality (6) to derive that:

m Proof. Let z be a random vector taking values in 1 X n ∗ risk (c) − risk (c∗) ≤ |hc, x i − hw∗, x i| {0, 1} such that P[zj = 1] = wj for all j ∈ [n]. m m γm i i Clearly, we have E [z − w∗] = 0. For any i ∈ [m], i=1 ∗ ∗ 2 m let fi(z, w ) = hz − w , xii . We can observe that 1 X ∗ ∗ Pn 2 ∗ ∗ ≤ hc − w , xii [fi(z, w )] = x w 1 − w . Taking condi- γm E j=1 i,j j j i=1 tional expectations, we have: m 1 X ≤ |θi,n| (7) [f (z, w∗) | z = c ] = γm E i 1:k−1 1:k−1 i=1 h i hc − w∗ , x i2 + hz − w∗ , x i2 E 1:k−1 1:k−1 i,1:k−1 k:n k:n i,k:n Now, by application of lemma7, we know that for each step k of GR, any value a ∈ Ak is such that ∗ 2 2 2 ∗ ∗ In the right hand side of this equation, the squared (θi,k−1 + xi,k(a − wk)) ≤ θi,k−1 + xi,kwk (1 − wk) for 2 sum is equal to the sum of squares because the all i ∈ [m]. Since ck ∈ Ak, we must have θi,k ≤ Rounding Methods for Discrete Linear Classification

P 2 ∗ ∗ j∈[k] xi,jwj (1 − wj ) for all i ∈ [m] and k ∈ [n]. Re- porting this inequality into (7),

m v n 1 X uX risk (c) − risk (c∗) ≤ u x2 w∗(1 − w∗) m m γm t i,j j j i=1 j=1

∗ Let R = riskm(c) − riskm(c ). For the class C, using ∗ ∗ 1 the fact that wj (1 − wj ) ≤ 4 we have R ≤ X2/(2γ). For the class Ck,, using H¨older’sinequality√ and the fact ∗ that kw k1 ≤ k, we obtain R ≤ X∞ k/γ.

4. Experiments We tested the empirical performance of our algorithms by conducting experiments on a synthetic problem and several real-world domains. Besides the Randomized Rounding (RR) algorithm and the Greedy Rounding (GR) algorithm, we evaluated the behavior of two frac- tional optimization techniques: the Convex (Cvx) op- timization method that returns the fractional solution of the problem specified by (5), and the Support Vec- tor Machine (L1-SVM) that solves the `1-constrained version of the problem. For small datasets, we also evaluated MIP (mixed ) which is the exact solution to the combinatorial problem. In our implementation of the algorithms, we used the linear programming software CPLEX that returns the fractional solution of convex optimization tasks.

4.1. Synthetic Data In order to validate different aspects or our algo- rithms, we designed a simple artificial dataset gener- Figure 2. Test error rates on synthetic data, comparing the number of examples (upper part) and the number of irrel- ator. Called with parameters k, n, m, η, the generator evant features (lower part) builds a dataset composed of m examples, each with n features. Examples are drawn from a uniform dis- tribution over [−10, 10]n. Also, the generator draws randomly a target function with exactly k ones, and surprising, because the target concepts have {0, 1}- each example is labeled with respect to this target. Fi- weights. On synthetic data, GR is slightly less ac- nally, the coordinates of each example are perturbed curate than RR, whose performance is close to Cvx. with a normal law of standard deviation η. 4.2. Metagenomic Data We first evaluated the generalization performance of the optimization algorithms. Setting k = 10, n = 100, In metagenomic classification, discrete linear functions η = 0.1, we generated datasets with an increasing have a natural interpretation in term of bacterial abun- number m of examples, and plotted the generaliza- dance. We used a real-world dataset containing 38 in- tion zero-one loss measured on test data (upper part of dividuals and 69 features. The dataset is divided into figure2). Next, we evaluated the robustness of our al- two well-balanced classes: obese people and non obese. gorithms with respect to irrelevant attributes. Setting Each feature represents the abundance of a bacterial k = 10, m = 50, η = 0.1, we generated datasets with species. As mentioned in the introduction, the weight n varying from 20 to 800, and plotted again the gener- of each feature captures a qualitative effect encoded alization zero-one loss, measured on test data (bottom by a value in {−1, 0, +1} (negative, null effect, posi- part of figure2). It is apparent that GR and RR per- tive). Let POS (respectively NEG) denote the group form significantly better than L1-SVM, which is not of bacterial species whose feature has a weight of 1 Rounding Methods for Discrete Linear Classification

k L1-SVM Cvx RR GR MIP L1-SVM Cvx RR RR×5 10 0.46 0.48 0.41 0.43 0.35 0.15 0.15 0.2 0.164 20 0.46 0.44 0.41 0.44 0.04s 0.04s 0.04s 2.22s ∞ 0.44 0.40 0.39 0.43 run time 0.02s 0.02s 0.04s 0.98s 13s Table 2. Test error rates and average running time in sec- onds on colon cancer data Table 1. Test error rates and average running time in sec- onds on metagenomic data If at least 3 of the following conditions are met, then the mushroom is poisonous (respectively −1). If the abundance of all bacteria in bruises = yes POS is greater than the abundance of the bacteria in odor ∈ {almond, foul, musty, none, pungent} NEG, then the individual will be classified as obese. gill attachment = attached gill spacing = crowded In order to learn ternary-weighted linear functions stalk root = rooted with our algorithms, we used a simple trick that stalk color above ring = pink reduces the classification task to a binary-weighted stalk color below ring = pink learning problem. The idea is to duplicate attributes ring number = one n in the following way: to each instance x ∈ R we ring type ∈ {large, pendant} 0 d associate an instance x ∈ R where d = 2n and spore print color = brown 0 x = (x1, −x1, x2, −x2, ··· , xn, −xn). Given a binary- weighted concept c0 ∈ {0, 1}d, the corresponding Table 3. M-of-N rule for the mushrooms dataset ternary-weighted concept c ∈ {−1, 0, +1}n is recov- 0 0 ered by setting ci = c2i−1 −c2i. Based on this transfor- 0 0 mation, it is easy to see that `γ (c ; x , y) = `γ (c; x, y). dom roundings at each time step. It turns out that 0 0 So, if c minimizes empirical risk on the set {(xt, yt)}, RR × 5 achieves a much better error rate than RR then c minimizes empirical risk on {(xt, yt)}. If, in in this case (but not on the datasets of the previous 0 addition, c is k-sparse, then c is k-sparse. subsections). Thus, we obtain a concept much simpler The test error rates of algorithms are reported in Ta- than the linear hypothesis generated by the SVM, with ble 4.2. Test errors was measured by conducting 10 comparable accuracy. fold cross validation, averaged over 10 experiments. In light of these results, it is apparent that RR slightly 4.4. Mushrooms outperforms both SVM and Cvx, which clearly over- Finally, we ran experiments on the “mushrooms” fit the data even in presence of the L -ball constraint 1 dataset to evaluate how M-of-N rules are learnt using (for the first two rows). Unsurprisingly, the MIP solver rounding algorithms. This dataset contains 22 features generated a model superior to the others. For k ≥ 20, which are all nominal. We transformed these features the mixed integer program did not finish in reasonable into binary features, and ran our discrete linear learn- time, so we left the corresponding entries of the table ing algorithms on this dataset, without imposing any blank. In a nutshell, we can conclude that the accu- cardinality constraint. racy does not suffer from switching to ternary weights, but this learning task looks challenging. With an accuracy of 98%, the M-of-N rule shown in Table3 was produced. We ran 10 times 10 fold cross 4.3. Colon cancer validation with our algorithms (see table4). Algo- rithm Cvx achieves a perfect classification. Here, GR To demonstrate the performance of discrete linear clas- outperforms RR, but running RR several times (and sifiers for gene selection, we applied our algorithms choosing the best solution) considerably improves the to microarray data on colon cancer, which is publicly accuracy results. available. The dataset consists of 62 samples, 22 of which are normal and 40 of which are from colon can- cer tissues. The genes are already pre-filtered, consist- Cvx RR RR×20 GR ing of the 2,000 genes. We launched our algorithms 0 0.6 0.014 0.023 with k = 15 to select 15 genes only. We did not plot 0.24s 0.26s 0.74s 11s the result of GR because each run of GR took a huge amount of time. Instead, RR × 5 is a variant of ran- Table 4. Test error rates and average running time in sec- domized rounding that selects the best out of 5 ran- onds on colon cancer data Rounding Methods for Discrete Linear Classification

References G. G. Towell and J. W. Shavlik. Extracting refined rules from knowledge-based neural networks. Ma- P. L. Bartlett and S. Mendelson. Rademacher and chine Learning, 13:71–101, 1993. Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3: S. Venkatesh. On learning binary weights for majority 463–482, 2002. functions. In Proceedings of the 4th Annual Work- shop on Computational Learning Theory (COLT), P. Berman and M. Karpinski. On some tighter in- pages 257–266. Morgan Kaufmann, 1991. approximability results (extended abstract). In Automata, Languages and Programming, 26th In- D. P. Williamson and D. B. Shmoys. The Design of ternational Colloquium (ICALP), pages 200–209. Approximation Algorithms. Cambridge, 2011. Springer, 1999.

Y. Chevaleyre, F. Koriche, and J. D. Zucker. Rounding methods for discrete linear classification (extended version). Technical Report hal-00771012, hal, 2013.

U. Feige, M. Karpinski, and M. Langberg. A note on approximating Max-Bisection on regular graphs. In- formation Processing Letters, 79(4):181–188, 2001.

M. Golea and M. Marchand. Average case analysis of the clipped Hebb rule for nonoverlapping Perception networks. In Proceedings of the 6th annual confer- ence on computational learning theory (COLT’93). ACM, 1993a.

M. Golea and M. Marchand. On learning perceptrons with binary weights. Neural Computation, 5(5):767– 782, 1993b.

S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, mar- gin bounds, and regularization. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS), pages 793–800, 2008.

H. K¨ohler,S. Diederich, W. Kinzel, and M. Opper. Learning algorithm for a neural network with binary synapses. Zeitschrift fr Physik B Condensed Matter, 78:333–342, 1990.

M. Opper, W. Kinzel, J. Kleinz, and R Nehl. On the ability of the optimal perceptron to generalise. Journal of Physics A: Mathematical and General, 23 (11):L581–L586, 1990.

L. Pitt and L. G. Valiant. Computational limitations on learning from examples. J. ACM, 35(4):965–984, 1988.

P. Raghavan and C. D. Thompson. Randomized rounding: A technique for probably good algorithms and algorithmic proofs. Combinatorica, 7(4):365– 374, 1987.

J. Shawe-Taylor and N. Cristianini. An Introduction to Support Vector Machines. Cambridge, 2000.