Proximal Quasi-Newton for Computationally Intensive L1

Proximal Quasi-Newton for Computationally Intensive `1-regularized M-estimators Kai Zhong 1 Ian E.H. Yen 2 Inderjit S. Dhillon 2 Pradeep Ravikumar 2 1 Institute for Computational Engineering & Sciences 2 Department of Computer Science University of Texas at Austin [email protected], fianyen,inderjit,[email protected] Abstract We consider the class of optimization problems arising from computationally intensive `1-regularized M-estimators, where the function or gradient values are very expensive to compute. A particular instance of interest is the `1-regularized MLE for learning Conditional Random Fields (CRFs), which are a popular class of statistical models for varied structured prediction problems such as sequence labeling, alignment, and classification with label taxonomy. `1-regularized MLEs for CRFs are particularly expensive to optimize since computing the gradient values requires an expensive inference step. In this work, we propose the use of a carefully constructed proximal quasi-Newton algorithm for such computationally intensive M-estimation problems, where we employ an aggressive active set selection technique. In a key contribution of the paper, we show that the proximal quasi-Newton method is provably super-linearly convergent, even in the absence of strong convexity, by leveraging a restricted variant of strong convexity. In our experiments, the proposed algorithm converges considerably faster than current state-of-the-art on the problems of sequence labeling and hierarchical classification. 1 Introduction `1-regularized M-estimators have attracted considerable interest in recent years due to their ability to fit large-scale statistical models, where the underlying model parameters are sparse. The optimization problem underlying these `1-regularized M-estimators takes the form: min f(w) := λkwk1 + `(w); (1) w where `(w) is a convex differentiable loss function. In this paper, we are particularly interested in the case where the function or gradient values are very expensive to compute; we refer to these functions as computationally intensive functions, or CI functions in short. A particular case of interest are `1- regularized MLEs for Conditional Random Fields (CRFs), where computing the gradient requires an expensive inference step. There has been a line of recent work on computationally efficient methods for solving (1), including [2, 8, 13, 21, 23, 4]. It has now become well understood that it is key to leverage the sparsity of the optimal solution by maintaining sparse intermediate iterates [2, 5, 8]. Coordinate Descent (CD) based methods, like CDN [8], maintain the sparsity of intermediate iterates by focusing on an active set of working variables. A caveat with such methods is that, for CI functions, each coordinate update typically requires a call of inference oracle to evaluate partial derivative for single coordinate. One approach adopted in [16] to address this is using Blockwise Coordinate Descent that updates a block of variables at a time by ignoring the second-order effect, which however sacrifices the convergence guarantee. Newton-type methods have also attracted a surge of interest in recent years [5, 13], but these require computing the exact Hessian or Hessian-vector product, which is very 1 expensive for CI functions. This then suggests the use of quasi-Newton methods, popular instances of which include OWL-QN [23], which is adapted from `2-regularized L-BFGS, as well as Projected Quasi-Newton (PQN) [4]. A key caveat with OWL-QN and PQN however is that they do not exploit the sparsity of the underlying solution. In this paper, we consider the class of Proximal Quasi- Newton (Prox-QN) methods, which we argue seem particularly well-suited to such CI functions, for the following three reasons. Firstly, it requires gradient evaluations only once in each outer iteration. Secondly, it is a second-order method, which has asymptotic superlinear convergence. Thirdly, it can employ some active-set strategy to reduce the time complexity from O(d) to O(nnz), where d is the number of parameters and nnz is the number of non-zero parameters. While there has been some recent work on Prox-QN algorithms [2, 3], we carefully construct an implementation that is particularly suited to CI `1-regularized M-estimators. We carefully maintain the sparsity of intermediate iterates, and at the same time reduce the gradient evaluation time. A key facet of our approach is our aggressive active set selection (which we also term a ”shrinking strategy”) to reduce the number of active variables under consideration at any iteration, and correspondingly the number of evaluations of partial gradients in each iteration. Our strategy is particularly aggressive in that it runs over multiple epochs, and in each epoch, chooses the next working set as a subset of the current working set rather than the whole set; while at the end of an epoch, allows for other variables to come in. As a result, in most iterations, our aggressive shrinking strategy only requires the evaluation of partial gradients in the current working set. Moreover, we adapt the L-BFGS update to the shrinking procedure such that the update can be conducted without any loss of accuracy caused by aggressive shrinking. Thirdly, we store our data in a feature-indexed structure to combine data sparsity as well as iterate sparsity. [26] showed global convergence and asymptotic superlinear convergence for Prox-QN methods under the assumption that the loss function is strongly convex. However, this assumption is known to fail to hold in high-dimensional sampling settings, where the Hessian is typically rank-deficient, or indeed even in low-dimensional settings where there are redundant features. In a key contribution of the paper, we provide provable guarantees of asymptotic superlinear convergence for Prox-QN method, even without assuming strong-convexity, but under a restricted variant of strong convexity, termed Constant Nullspace Strong Convexity (CNSC), which is typically satisfied by standard M-estimators. To summarize, our contributions are twofold. (a) We present a carefully constructed proximal quasi- Newton method for computationally intensive (CI) `1-regularized M-estimators, which we empir- ically show to outperform many state-of-the-art methods on CRF problems. (b) We provide the first proof of asymptotic superlinear convergence for Prox-QN methods without strong convexity, but under a restricted variant of strong convexity, satisfied by typical M-estimators, including the `1-regularized CRF MLEs. 2 Proximal Quasi-Newton Method A proximal quasi-Newton approach to solve M-estimators of the form (1) proceeds by iteratively constructing a quadratic approximation of the objective function (1) to find the quasi-Newton direction, and then conducting a line search procedure to obtain the next iterate. Given a solution estimate wt at iteration t, the proximal quasi-Newton method computes a descent direction by minimizing the following regularized quadratic model, T 1 T dt = arg min gt ∆ + ∆ Bt∆ + λkwt + ∆k1 (2) ∆ 2 where gt = g(wt) is the gradient of `(wt) and Bt is an approximation to the Hessian of `(w). Bt is usually formulated by the L-BFGS algorithm. This subproblem (2) can be efficiently solved by randomized coordinate descent algorithm as shown in Section 2.2. The next iterate is obtained from the backtracking line search procedure, wt+1 = wt + αtdt, where 0 1 2 the step size αt is tried over fβ ; β ; β ; :::g until the Armijo rule is satisfied, f(wt + αtdt) ≤ f(wt) + αtσ∆t; T where 0 < β < 1, 0 < σ < 1 and ∆t = gt dt + λ(kwt + dtk1 − kwtk1). 2 2.1 BFGS update formula Bt can be efficiently updated by the gradients of the previous iterations according to the BFGS update [18], T T Bt−1st−1st−1Bt−1 yt−1yt−1 Bt = Bt−1 − T + T (3) st−1Bt−1st−1 yt−1st−1 where st = wt+1 − wt and yt = gt+1 − gt We use the compact formula for Bt [18], T Bt = B0 − QRQ = B0 − QQ;^ where T −1 St B0St Lt ^ T Q := [ B0St Yt ] ;R := T ; Q := RQ Lt −Dt St = [s0; s1; :::; st−1] ;Yt = y0; y1; :::; yt−1 sT y if i > j D = diag[sT y ; :::; sT y ] and (L ) = i−1 j−1 t 0 0 t−1 t−1 t i;j 0 otherwise In practical implementation, we apply Limited-memory-BFGS. It only uses the information of the most recent m gradients, so that Q and Q^ have only size, d × 2m and 2m × d, respectively. B0 is T T usually set as γtI for computing Bt, where γt = yt−1st−1=st−1st−1[18]. As will be discussed in Section 2.3, Q(Q^) is updated just on the rows(columns) corresponding to the working set, A. The time complexity for L-BFGS update is O(m2jAj + m3). 2.2 Coordinate Descent for Inner Problem Randomized coordinate descent is carefully employed to solve the inner problem (2) by Tang and ∗ ∗ Scheinberg [2]. In the update for coordinate j, d d + z ej, z is obtained by solving the one- dimensional problem, ∗ 1 2 z = arg min (Bt)jjz + ((gt)j + (Btd)j)z + λj(wt)j + dj + zj z 2 This one-dimensional problem has a closed-form solution, z∗ = −c + S(c − b=a; λ/a) ,where S is the soft-threshold function and a = (Bt)jj, b = (gt)j + (Btd)j and c = (wt)j + dj. For B0 = γtI, T T the diagonal of Bt can be computed by (Bt)jj = γt − qj q^j, where qj is the j-th row of Q and q^j is the j-th column of Q^. And the second term in b, (Btd)j can be computed by, T ^ T ^ (Btd)j = γtdj − qj Qd = γtdj − qj d; ^ ^ ^ ^ where d := Qd.

Load more