Qboost: Large Scale Classiﬁer Training with Adiabatic Quantum Optimization

JMLR: Workshop and Conference Proceedings 25:333{348, 2012 Asian Conference on Machine Learning QBoost: Large Scale Classifier Training with Adiabatic Quantum Optimization Hartmut Neven [email protected] Google Inc. Vasil S. Denchev [email protected] Purdue University Geordie Rose [email protected] and William G. Macready [email protected] D-Wave Systems Inc. Editor: Steven C.H. Hoi and Wray Buntine Abstract We introduce a novel discrete optimization method for training in the context of a boosting framework for large scale binary classifiers. The motivation is to cast the training problem into the format required by existing adiabatic quantum hardware. First we provide theoretical arguments concerning the transformation of an originally continuous optimization problem into one with discrete variables of low bit depth. Next we propose QBoost as an iterative training algorithm in which a subset of weak classifiers is selected by solving a hard optimization problem in each iteration. A strong classifier is incrementally constructed by concatenating the subsets of weak classifiers. We supplement the findings with experiments on one synthetic and two natural data sets and compare against the performance of existing boosting algorithms. Finally, by conducting a quantum Monte Carlo simulation we gather evidence that adiabatic quantum optimization is able to handle the discrete optimization problems generated by QBoost. Keywords: adiabatic quantum computing, discrete optimization, machine learning, su- pervised learning, boosting 1. Introduction We perform binary classifier training using discrete optimization in a formulation adapted to take advantage of emerging hardware that performs adiabatic quantum optimization (AQO). AQO, first introduced by Farhi et al.(2000, 2001), is a quantum computing model with good prospects for scalable and practically useful hardware implementations. Aharonov et al.(2004) showed it to be polynomially equivalent to the gate model of quantum computation. Theoretical and numeric studies of the purported computational superiority of AQO over classical computing have repeatedly given encouraging results, e.g. Santoro et al.(2002); Farhi et al.(2009); Amin and Choi(2009); Dickson and Amin(2011). Significant investments are underway by the Canadian company D-Wave Systems to develop a hardware implementation. A series of rigorous studies of the quantum mechanical properties of the D-Wave processors, culminating in a recent Nature publication (Johnson et al., 2011), have increased the excitement in the quantum computing community for this approach. This was further fueled by news of Lockheed Martin purchasing a D-Wave c 2012 H. Neven, V.S. Denchev, G. Rose & W.G. Macready. Neven Denchev Rose Macready machine. Most recently, benchmarking tests performed by D-Wave on their current chips (Rose, 2011) have indicated that they are indeed capable of providing dramatic speedups with respect to the best known classical algorithms for certain NP-hard problems. For machine learning purposes, D-Wave's implementation of AQO can be regarded as a discrete optimization engine that accepts any problems formulated as quadratic uncon- strained binary optimization (QUBO), also equivalent to the Ising model and Weighted MAX-2-SAT. It should be noted that the training formulation we propose here is a good format for AQO independently of D-Wave's efforts since it can be physically realized as the simplest possible multi-qubit configuration—an Ising system (Brush, 1967). 2. The Learning Task T N We study binary classifiers of the form y = sign w x + b , where x 2 R is a general input 1 N pattern to be classified, y 2 {−1; 1g is the label associated with x, w 2 R is a vector of weights to be optimized, and b 2 R is the bias. Training, also known as regularized risk minimization, consists of choosing w and b by simultaneously minimizing two terms: PS empirical risk R(ww; b) = s=1 L (m (xs; ys;www; b)) =S and regularization Ω(w). R, via a loss function L, estimates the error that any candidate classifier causes over a set of S training examples f(xs; ys)js = 1;:::;Sg. The argument of L is known as the margin of example s with respect to the decision hyperplane defined by w and b: T m (xs; ys;www; b) = ys w xs + b . Ω controls the complexity of the classifier and is necessary for good generalization because classifiers with high complexity display overfitting—they can classify the training set with low error but may not do well on previously unseen data. Training amounts to solving (ww; b)∗ = arg min fR (ww; b) + Ω (w)g . (1) ww;b The most natural choice for L is 0-1 loss, which simply counts misclassifications: L0-1(m) = (1 − sign (m)) =2 (2) The optimization problem (1) with L0-1 is NP-hard due to non-convexity (Feldman et al., 2010). Also, because L0-1 does not enforce a margin, the generalization of classifiers trained with it is bad even when regularization is applied (Vapnik, 1998). In order to avoid dealing with NP-hard optimization problems and to build large-margin classifiers, in practice L0-1 is replaced by some convex upper bound (e.g. square, logistic, exponential, hinge). This allows arriving at convex optimization problems that can be rigorously analyzed and efficiently solved by classical means. An example of a convex upper bound to L0-1 is square loss: 2 Lsquare(m) = (m − 1) (3) The natural choice for Ω is `0-norm penalization of w because that explicitly drives weights towards exact zero. This is not only associated with good generalization but also fast execution during the performance phase. However, `0-norm regularization leads to non-convex optimization problems. Again, to avoid computational hardness, `0-norm is 1. Any feature vector consisting of raw data, extracted features, kernel or weak classifiers outputs, etc. 334 QBoost: Large Scale Classifier Training with Adiabatic Quantum Optimization frequently replaced by convex alternatives such as the `2- or `1-norm. Regularization based on `2-norm has the nicest mathematical form in the sense that the resulting optimization problems are convex and differentiable. However, when continuous weight variables are used in training, `2-norm regularization only decreases the magnitude of weights but does not produce exact zeros. The reason for this effect is that in the optimization problem `2-norm regularization results in vanishing gradient magnitudes in the direction of zero weights. The alternative, `1-norm, under certain conditions may succeed at enforcing sparsity but leads to more complicated convex optimization problems due to non-differentiability. 3. Low-Precision Discrete Variables The D-Wave quantum optimization processor that we aim to deploy for training requires problems to be discrete and formulated as QUBO. Further, the current hardware generation| Vesuvius (Rose, 2011)|can handle a maximum of 512 binary variables, which imposes the additional requirement of being frugal with the bit-depth of weight variables. 3.1. The Discrete Optimization Problem We discretize the elements of w to some low bit-depth dw < 64. While this approach is somewhat unconventional, in Subsection 3.2 we argue that the weights do not need high precision. In fact we show a favorable condition dw ≥ log(S=N) in the case of binary features such as the weak classifiers typically used in boosting. Even though analyzing classifiers constructed out of more general sets of features appears to be more difficult, experiments provide support for using low-precision weights. Finally, we also discretize the bias b with some low bit-depth db < 64 Given the options for loss function and regularization discussed in Section2, here we study regularized risk minimization with Lsquare and `0-norm regularization over discrete _ variables w_ and b of bit depth dw and db respectively. This leads to a QUBO-compatible baseline optimization problem: ( S ) ∗ 1 X T (ww;_ b_) = arg min Lsquare ys(w_ xs + b_) + λkw_ k0 , (4) _ S ww;_ b s=1 where λ 2 R>0 controls the relative importance of regularization. The full QUBO derivation for a similar learning formulation is illustrated in Appendix A of Denchev et al.(2012). 3.2. Bit-Depth of Weight Variables It is evident from (2) that L0-1 enforces an inequality constraint per training example: T ys w xs + b ≥ 0 for s = 1;:::;S (5) Each training example brings about an inequality, which demands to choose weights that are on one side of a diagonal hyperplane in N-dimensional space. If we restrict each xs 2 {−1; 1gN , the hyperplane corresponding to example s is defined by a set of ±1 coefficients depending on xs. Fig.1 illustrates the situation for N = 3. The number of regions created 335 Neven Denchev Rose Macready Figure 1: Arrangement of the diagonal hyperplanes that define the solution spaces for se- lecting w∗. Depicted is the situation for N = 3, which yields 14 regions. The number of solution regions grows rapidly: N = 4 leads to 104 and N = 5 to 1882 regions. Here all possible hyperplanes are shown. However, in practice S training examples invoke only a small subset of the 2N−1 possible hyperplanes. The blue dots are the vertices of a cube placed in the positive quadrant with one vertex coinciding with the origin. They correspond to weight configurations that can be represented with one bit. Multi-bit weights give rise to a cube-shaped lattice. by S hyperplanes is calculated using their characteristic polynomial (Orlik and Terao, 1992): T N X k dim( Sk) Nregions = (−1) (−1) (−1) , (6) Sk T where Sk designates the k-element subsets of the S hyperplanes and dim( Sk) is the 2 dimension of the intersection of Sk. Due to linear dependencies among the hyperplanes, T which occur for N ≥ 4, we are not able to find a closed form expression for dim( Sk) and instead have to resort to an upper bound for Nregions (Orlik and Terao, 1992; Sauer, 1972): N X S N ≤ (7) regions k k=0 2.

Load more