
Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case Alina Beygelzimer* 1 David´ Pal´ * 1 Balazs´ Szor¨ enyi´ * 1 Devanathan Thiruvenkatachari* 2 Chen-Yu Wei* 3 Chicheng Zhang* 4 Abstract and reveals the feature vector xt to the learner. Upon re- ceiving xt, the learner makes a prediction ybt and receives We study the problem of efficient online multi- feedback. In contrast with the standard full-information class linear classification with bandit feedback, setting, where the feedback given is the correct label yt, where all examples belong to one of K classes here the feedback is only a binary indicator of whether the and lie in the d-dimensional Euclidean space. prediction was correct or not. The protocol of the problem Previous works have left open the challenge of is formally stated below. designing efficient algorithms with finite mistake bounds when the data is linearly separable by a Protocol 1 ONLINE MULTICLASS LINEAR CLASSIFICA- margin γ. In this work, we take a first step to- TION WITH BANDIT FEEDBACK wards this problem. We consider two notions of Require: Number of classes K, number of rounds T . linear separability, strong and weak. Require: Inner product space (V; h·; ·i). for t = 1; 2;:::;T do 1. Under the strong linear separability condi- Adversary chooses example (x ; y ) 2 V × tion, we design an efficient algorithm that t t f1; 2;:::;Kg where x is revealed to the learner. achieves a near-optimal mistake bound of t 2 Predict class label ybt 2 f1; 2;:::;Kg. O K/γ . 1 Observe feedback zt = [ybt 6= yt] 2 f0; 1g. 2. Under the more challenging weak linear separability condition, we design an effi- The performance of the learner is measured by its cumu- cient algorithm with a mistake bound of PT PT p lative number of mistakes zt = 1 [yt 6= yt], 2 t=1 t=1 b min(2Oe(K log (1/γ)); 2Oe( 1/γ log K)).1 Our where 1 denotes the indicator function. algorithm is based on kernel Perceptron and In this paper, we focus on the special case when the ex- is inspired by the work of Klivans & Serve- d dio(2008) on improperly learning intersec- amples chosen by the adversary lie in R and are linearly tion of halfspaces. separable with a margin. We introduce two notions of lin- ear separability, weak and strong, formally stated in Defi- nition 1. The standard notion of multiclass linear separa- bility (Crammer & Singer, 2003) corresponds to the weak 1. Introduction linear separability. For multiclass classification with K We study the problem of ONLINE MULTICLASS LINEAR classes, weak linear separability requires that all examples arXiv:1902.02244v2 [cs.LG] 18 Jun 2019 CLASSIFICATION WITH BANDIT FEEDBACK (Kakade from the same class lie in an intersection of K − 1 halfs- et al., 2008). The problem can be viewed as a repeated paces and all other examples lie in the complement of the game between a learner and an adversary. At each time intersection of the halfspaces. Strong linear separability means that examples from each class are separated from step t, the adversary chooses a labeled example (xt; yt) the remaining examples by a single hyperplane. *The authors are listed in alphabetical order. 1Yahoo Research, New York, NY, USA 2New York University, New York, NY, In the full-information feedback setting, it is well known USA 3University of Southern California, Los Angeles, CA, USA (Crammer & Singer, 2003) that if all examples have norm 4Microsoft Research, New York, NY, USA. Correspondence to: at most R and are weakly linearly separable with a margin David´ Pal´ <[email protected]>. γ, then the MULTICLASS PERCEPTRON algorithm makes 2 Proceedings of the 36 th International Conference on Machine at most b2(R/γ) c mistakes. It is also known that any 1 2 Learning, Long Beach, California, PMLR 97, 2019. Copyright (possibly randomized) algorithm must make 2 (R/γ) 2019 by the author(s). mistakes in the worst case. The MULTICLASS PERCEP- 1We use the notation Oe(f(·)) = O(f(·) polylog(f(·))). TRON achieves an information-theoretically optimal mis- Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case take bound, while being time and memory efficient.23 intersection of halfspaces and has been studied previously by Klivans & Servedio(2008). As a side result, we improve The bandit feedback setting, however, is much more chal- on the results of Klivans & Servedio(2008) by removing lenging. For the strongly linearly separable case, we are the dependency on the original dimension d. not aware of any prior efficient algorithm with a finite mis- take bound. 4 We design a simple and efficient algorithm The resulting kernelized algorithm runs in time polynomial (Algorithm1) that makes at most O(K(R/γ)2) mistakes in the original dimension of the feature vectors d, the num- in expectation. Its memory complexity and per-round time ber of classes K, and the number of rounds T . We prove complexity are both O(dK). The algorithm can be viewed that if the examples lie in the unit ball of Rd and are weakly as running K copies of the BINARY PERCEPTRON algo- linearly separable with marginpγ, Algorithm2 makes at 2 rithm, one copy for each class. We prove that any (possibly most min(2Oe(K log (1/γ)); 2Oe( 1/γ log K)) mistakes. randomized) algorithm must make Ω(K(R/γ)2) mistakes in the worst case. The extra O(K) multiplicative factor in In AppendixG, we propose and analyze a very different al- the mistake bound, as compared to the full-information set- gorithm for weakly linearly separable data. The algorithm ting, is the price we pay for the bandit feedback, or more is based on the obvious idea that two points that are close precisely, the lack of full-information feedback. enough must have the same label. For the case when the examples are weakly linearly sepa- Finally, we study two questions related to the computa- rable, it was open for a long time whether there exist effi- tional and information-theoretic hardness of the problem. cient algorithms with finite mistake bound (Kakade et al., Any algorithm for the bandit setting collects information in 2008; Beygelzimer et al., 2017). Furthermore, Kakade the form of so called strongly labeled and weakly labeled et al.(2008) ask the question: Is there any algorithm with examples. Strongly labeled examples are those for which a finite mistake bound that has no explicit dependence on we know the class label. Weakly labeled example is an ex- the dimensionality of the feature vectors? We answer both ample for which we know that class label can be anything questions affirmatively by providing an efficient algorithm except for one particular class. In AppendixH, we show with finite dimensionless mistake bound (Algorithm2). 5 that the offline problem of finding a multiclass linear clas- sifier consistent with a set of strongly and weakly labeled The strategy used in Algorithm2 is to construct a non- examples is NP-hard. In AppendixI, we prove a lower linear feature mapping φ and associated positive definite bound on the number of mistakes of any algorithm that uses 0 kernel k(x; x ) that makes the examples strongly linearly only strongly-labeled examples and ignores weakly labeled separable in a higher-dimensional space. We then use the examples. kernelized version of Algorithm1 for the strongly separa- ble case. The kernel k(x; x0) corresponding to the feature mapping φ has a simple explicit formula and can be com- 2. Related work puted in O(d) time, making Algorithm2 computationally The problem of online bandit multiclass learning was ini- efficient. For details on kernel methods see e.g. (Scholkopf¨ tially formulated in the pioneering work of Auer & Long & Smola, 2002) or (Shawe-Taylor & Cristianini, 2004). (1999) under the name of “weak reinforcement model”. The number of mistakes of the kernelized algorithm de- They showed that if all examples agree with some classifier pends on the margin in the corresponding feature space. from a prespecified hypothesis class H, then the optimal We analyze how the mapping φ transforms the margin pa- mistake bound in the bandit setting can be upper bounded d by the optimal mistake bound in the full information set- rameter of weak separability in the original space R into a margin parameter of strong separability in the new feature ting, times a factor of (2:01 + o(1))K ln K. Long(2017) space. This problem is related to the problem of learning later improved the factor to (1 + o(1))K ln K and showed its near-optimality. Daniely & Helbertal(2013) extended 2We call an algorithm computationally efficient, if its running the results to the setting where the performance of the al- time is polynomial in K, d, 1/γ and T . 3 gorithm is measured by its regret, i.e. the difference be- For completeness, we present these folklore results along tween the number of mistakes made by the algorithm and with their proofs in AppendixA in the supplementary material. 4Although Chen et al.(2009) claimed that their Conservative the number of mistakes made by the best classifier in H in OVA algorithm with PA-I update has a finite mistake bound un- hindsight. We remark that all algorithms developed in this der the strong linear separability condition, their Theorem 2 is context are computationally inefficient. incorrect: first, their Lemma 1 (with C = +1) along with their R 2 The linear classification version of this problem is ini- Theorem 1 implies a mistake upper bound of ( γ ) , which contra- dicts the lower bound in our Theorem3; second, their Lemma 1 tially studied by Kakade et al.(2008).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages37 Page
-
File Size-