Limited-Memory Common-Directions Method for Distributed L1-Regularized Linear Classiﬁcation

Limited-memory Common-directions Method for Distributed L1-regularized Linear Classification Wei-Lin Chiang∗ Yu-Sheng Liy Ching-pei Leez Chih-Jen Linx Abstract is effective, for many other applications batch methods For distributed linear classification, L1 regularization is are used to more accurately solve the optimization useful because of a smaller model size. However, with problem. Currently, OWLQN [1], an extension of a the non-differentiability, it is more difficult to develop limited-memory quasi-Newton method (LBFGS) [7], is efficient optimization algorithms. In the past decade, the most commonly used distributed method for L1- OWLQN has emerged as the major method for dis- regularized classification. For example, it is the main tributed training of L1 problems. In this work, we point linear classifier in Spark MLlib [10], a popular machine out issues in OWLQN's search directions. Then we ex- learning tool on Spark. OWLQN's popularity comes tend the recently developed limited-memory common- from several attributes. First, in a single-machine directions method for L2-regularized problems to L1 setting, the comparison [13] shows that OWLQN is scenarios. Through a unified interpretation of batch competitive among solvers for L1-regularized logistic methods for L1 problems, we explain why OWLQN has regression. Second, while coordinate descent (CD) or been a popular method and why our method is superior its variants [14] are state-of-the-art, they are inherently in distributed environments. Experiments confirm that sequential and are difficult to parallelize in distributed the proposed method is faster than OWLQN in most environments. Even if they have been modified for situations. distributed training (e.g., [8]), usually a requirement is that data points are stored in a feature-wise manner. In 1 Introduction contrast, as we will see in the discussion in this paper, OWLQN is easier to parallelize and allows data to be Given training data (y ; x ); i = 1; : : : ; l with label i i stored either in an instance-wise or a feature-wise way. y = ±1 and feature vector x 2 n, we consider L1- i i R The motivation of this work is to study why regularized linear classification problem OWLQN has been successful and whether we can de- l X T velop a better distributed training method. We begin (1.1) min f(w) ≡ kwk1 + C ξ(yiw xi); w i=1 with showing in Section 2 how OWLQN is extended from the method LBFGS [7]. From that we point out where kwk = Pn jw j and ξ is a differentiable and 1 j=1 j some issues of OWLQN's direction finding at each it- convex loss function. Here we consider the logistic loss eration. Then in Section 3 we extend a recently devel- ξ(z) = log(1 + e−z): oped limited-memory common-directions method [12, 6] for L2-regularized problems to the L1 setting. In Sec- For large applications, (1.1) is often considered because tion 4, through a unified interpretation of methods for of the model sparsity. Many optimization techniques L1 problems, we explain why our proposed method is have been proposed to solve (1.1); see, for example, more principled than OWLQN. Although from Section 3 the comparison in [13]. Generally (1.1) is more diffi- our method is slightly more expensive per iteration, the cult than L2-regularized problems because of the non- explanation in Section 4 and the past results on L2- differentiability. regularized problems indicate the potential of fewer iter- For large-scale data, distributed training is needed. ations. For distributed implementations, in Section 5 we Although for some applications an online method like [9] show that OWLQN and our method have similar communication costs per iteration. Therefore, our method ∗National Taiwan University. [email protected]. Part of can be very useful for distributed training because the this work was done when the first, the second, and the last authors communication occupies a significant portion of the to- visited Alibaba, Inc. This work is also partially supported by tal running time after parallelizing the computation, MOST of Taiwan grant 104-2221-E-002-047-MY3. yNational Taiwan University. [email protected] and our method needs fewer total iterations. The result zUniversity of Wisconsin-Madison. [email protected] is confirmed through detailed experiments in Section 6. xNational Taiwan University. [email protected] Copyright c by SIAM Unauthorized reproduction of this article is prohibited The proposed method has been implemented in MPI- 2.2 Modification from LBFGS to OWLQN. LIBLINEAR (http://www.csie.ntu.edu.tw/~cjlin/ OWLQN extends LBFGS by noticing that instead of libsvmtools/distributed-liblinear/). Supplemen- the optimality condition tary materials and programs used in this paper are rf(w) = 0 available at http://www.csie.ntu.edu.tw/~cjlin/ papers/l-commdir-l1/. for smooth optimization, for L1 problems in (1.1), w is a global optimum if and only if the projected gradient 2 OWLQN: Orthant-Wise Limited-memory (PG) is zero. Quasi-Newton Method rPf(w) = 0; We introduce OWLQN and discuss its issues. where for j = 1; : : : ; n, 2.1 Limited-memory BFGS (LBFGS) Method. P (2.6) rj f(w) ≡ OWLQN is an extension of the method LBFGS [7] for 8 r L(w) + 1 if w > 0; or w = 0; r L(w) + 1 < 0; the following L2-regularized problem. <> j j j j rjL(w) − 1 if wj < 0; or wj = 0; rjL(w) − 1 > 0; wT w > (2.2) f(w) ≡ + L(w); :0 otherwise. 2 The concept of projected gradient is from bound- where constrained optimization. If we let l X T + − (2.3) L(w) ≡ C ξ(yiw xi): w = w − w ; i=1 To minimize f(w), Newton methods are commonly an equivalent bound-constrained problem of (1.1) is used. At the current wk, a direction is obtained by l X + X − X + − T min wj + wj + C ξ(yi(w − w ) xi) 2 −1 w+;w− i=1 (2.4) d = −∇ f(wk) rf(wk): j j + − 2 subject to w ≥ 0; w ≥ 0; 8j: Because calculating r f(wk) and its inverse may be j j expensive, quasi-Newton techniques have been proposed Roughly speaking, the projected gradient indicates to obtain an approximate direction whether we can update wj by a gradient descent step 2 −1 or not. For example, if d = −Bkrf(wk); where Bk ≈ r f(wk) : BFGS [11] is a representative quasi-Newton technique (2.7) (wk)j = 0 and rjL(wk) + 1 < 0; that uses information from all past iterations and the then following update formula T T (wk)j − α(rjL(wk) + 1) > 0; 8α > 0 Bk = Vk−1Bk−1Vk−1 + ρk−1sk−1sk−1; + − remain in the orthant of wj ≥ 0 (or w ≥ 0; w = 0). where j j On this face f is differentiable with respect to wj and T T Vk−1 ≡ I − ρk−1uk−1s ; ρk−1 ≡ 1=(u sk−1); k−1 k−1 (2.8) rjf(wk) = rjL(wk) + 1 sk−1 ≡ wk − wk−1; uk−1 ≡ rf(wk) − rf(wk−1); exists. Therefore, if we update wj along the direction of and I is the identity matrix. −∇jf(wk) with a small enough step size, the objective To reduce the cost, LBFGS [7] proposes using value will decrease. Hence if (2.7) holds, the projected information from the previous m iterations. From gradient is defined to be the value in (2.8). the derivation in [7], the direction d can be efficiently From the above explanation, the projected gradient obtained by 2m inner products using columns in the roughly splits all variables to two sets: an active one following matrix containing elements that might still be modified, and a non-active one including elements that should remain n×2m (2.5) P = sk−m; uk−m;:::; sk−1; uk−1 2 R : the same. By defining the following active set After obtaining d, a line search ensures the sufficient P (2.9) A ≡ fj j r f(wk) 6= 0g; decrease of the function value (details not shown). j Algorithm I in the supplementary materials summarizes OWLQN simulates LBFGS on the face characterized by the procedure of LBFGS. the set A, and proposes the following modifications. Copyright c by SIAM Unauthorized reproduction of this article is prohibited 1. Because rf(wk) does not exist, the direction d is that this method lacks convergence guarantee, though obtained by using the projected gradient a slightly modified algorithm with asymptotic convergence is proposed recently [5]. Second, under an active P (2.10) d = −Bkr f(wk): set A, we would like to get a good direction by minimiz- ing the following second-order approximation. They apply the same procedure of O(m) vector oper- ations (Algorithm II in the supplementary materials) 1 T 2 P T (2.15) min dArAAL(wk)dA + rAf(wk) dA; to get d, but uk−1 is replaced by dA 2 uk−1 ≡ rL(wk) − rL(wk−1): where L(wk) is defined in (2.3). Thus a quasi-Newton setting should approximate 2 −1 Namely, now Bk is an approximation of r L(wk) 2 −1 2 −1 2 −1 but not r f(wk) . rAAL(wk) rather than (r L(wk) )AA; P 2. The search direction is aligned with −∇ f(wk): but the latter is closer to what OWLQN uses. We see P that the mapping to A by an alignment with −∇Pf(w ) (2.11) dj 0 if − djrj f(wk) ≤ 0: k in (2.11) is conducted after the direction finding in 3.

Load more