Block Coordinate and Stochastic Gradient Decent Methods

CME307/MS&E311: Optimization Lecture Note #16 Block Coordinate and Stochastic Gradient Decent Methods Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/˜yyye (Chapter 8) 1 CME307/MS&E311: Optimization Lecture Note #16 Recall Generic Algorithms for Minimization and Global Convergence Theorem A Generic Algorithm: A point to set mapping in a subspace of Rn. Theorem 1 Let A be an “algorithmic mapping” defined over set X, and let sequence fxkg, starting from a given point x0, be generated from xk+1 2 A(xk): Let a solution set S ⊂ X be given, and suppose i) all points fxkg are in a compact set; ii) there is a continuous (merit) function z(x) such that if x 62 S, then z(y) < z(x) for all y 2 A(x); otherwise, z(y) ≤ z(x) for all y 2 A(x); iii) the mapping A is closed at points outside S. Then, the limit of any convergent subsequences of fxkg is a solution in S. 2 CME307/MS&E311: Optimization Lecture Note #16 Block Coordinate Descent Method for Unconstrained Optimization I min f(x) = f((x1; x2; :::; xn)); where x = (x1; x2; :::; xn): x2RN For presentation simplicity, we let each xj be a scalar variable so that N = n. Let f(x) be differentiable every where and satisfy the (first-order) β-Coordinate Lipschitz condition, that is, for any two vectors x and d krjf(x + ej: ∗ d) − rjf(x)k ≤ βjkej: ∗ dk (1) where ej is the unit vector that ej = 1 and zero everywhere else, and :∗ is the component-wise product. Cyclic Block Coordinate Descent (CBCD) Method (Gauss-Seidel): − x1 arg minx1 f(x1;:::; xn); . − xn arg minxn f(x1;:::; xn): 3 CME307/MS&E311: Optimization Lecture Note #16 Aitken Double Sweep Method: − x1 arg minx1 f(x1;:::; xn); . − xn arg minxn f(x1;:::; xn); − xn−1 arg minxn−1 f(x1;:::; xn); . − x2 arg minx2 f(x1;:::; xn): Gauss-Southwell Method: ∗ • Compute the gradient vector rf(x) and let i = arg maxfjrf(x)jjg. • ∗ − xi arg minxi f(x1;:::; xn): 4 CME307/MS&E311: Optimization Lecture Note #16 Block Coordinate Descent Method for Unconstrained Optimization II Randomly-Permuted Cyclic Block Coordinate Descent (RCBCD) Method: • Draw a random permutation σ = fσ(1); : : : ; σ(n)g of f1; : : : ; ng; • − xσ(1) arg minxσ(1) f(x1;:::; xn); . − xσ(n) arg minxσ(n) f(x1;:::; xn): Randomized Block Coordinate Descent (RBCD) Method: ∗ • Randomly choose i 2 f1; 2; :::; ng. • ∗ − xi arg minxi∗ f(x1;:::; xn): 5 CME307/MS&E311: Optimization Lecture Note #16 Convergence of the BCD Methods The following theorem gives some conditions under which the deterministic BCD method will generate a sequence of iterates that converge. Theorem 2 Let f : Rn ! R be given. For some given point x0 2 Rn, let the level set X0 = fx 2 Rn : f(x) ≤ f(x0)g be bounded. Assume further that f is continuously differentiable on the convex hull of X0. Let fxkg be the sequence of points generated by the Cyclic Block Coordinate Descent Method initiated at x0. Then every accumulation point of fxkg is a stationary point of f . For strictly convex quadratic minimization with Hessian Q, e.g., the linear convergence rate of Gauss-Southwell is ( ) − ( ) λ (Q) n 1 λ (Q) λ (Q) − λ (Q) 2 1 − min ≥ 1 − min ≥ max min : λmax(Q)(n − 1) λmax(Q) λmax(Q) + λmin(Q) 6 CME307/MS&E311: Optimization Lecture Note #16 Randomized Block Coordinate Gradient Descent Method At the kth Iteration of RBCGD: • Randomly choose ik 2 f1; 2; :::; ng. • k+1 k − 1 r k x k = x k ik f(x ); i i βik k+1 k 8 6 k xi = xi ; i = i : Theorem 3 (Expected Error Convergence Estimate Theorem) Let the objective function f(x) be convex ∗ and satisfy the (first-order) β-Coordinate Lipschitz condition, and admit a minimizer x . Then ( ) k+1 ∗ n 1 0 ∗ 2 0 ∗ E k [f(x )] − f(x ) ≤ kx − x k + f(x ) − f(x ) ; ξ n + k + 1 2 β P 0 1 k−1 k k2 2 where random vector ξk−1 = (i ; i ; :::; i ) and norm-square x β = j βjxj . 7 CME307/MS&E311: Optimization Lecture Note #16 ∗ ∗ Proof: Denote by δk = f(xk) − f(x ), ∆k = xk − x , and X k 2 k k − ∗k2 0 − ∗ 2 (r ) = x x β = βj(xj xj ) : j Then, from the RBCGD iteration k+1 2 k 2 − r k k − ∗ 1 r k 2 (r ) = (r ) 2 ik f(x )(xik xik ) + ( ik f(x )) : βik It follows from the β-Coordinate Lipschitz condition, k+1 − k ≤ r k k+1 − k 1 r k 2 f(x ) f(x ) ik f(x )(x k x k ) + ( ik f(x )) i i 2βik −1 k 2 = (rik f(x )) : 2βik Combining the two inequalities, we have k+1 2 ≤ k 2 − r k k − ∗ k − k+1 (r ) (r ) 2 ik f(x )(xik xik ) + 2(f(x ) f(x )): Dividing both sides by 2 and taking expectation with respect to ik yields 1 k+1 2 1 k 2 1 k T k ∗ k k+1 E k [ (r ) ] ≤ (r ) − rf(x ) (x − x ) + f(x ) − E k [f(x )]; i 2 2 n i 8 CME307/MS&E311: Optimization Lecture Note #16 ∗ ∗ which together with convexity assumption rf(xk)T (x − xk) ≤ f(x ) − f(xk) gives − 1 k+1 2 1 k 2 1 ∗ n 1 k k+1 E k [ (r ) ] ≤ (r ) + f(x ) + f(x ) − E k [f(x )]; i 2 2 n n i Rearranging gives, for each k ≥ 0, ( ) 1 k+1 2 k+1 1 k 2 k 1 k E k [ (r ) + δ ] ≤ (r ) + δ − δ : i 2 2 n − Taking expectation with respect to ξk 1 on both sides 1 1 1 k+1 2 k+1 ≤ − k 2 k − − k Eξk [ 2 (r ) + δ ] Eξk 1 [ 2 (r ) + δ ] n Eξk 1 [δ ] 1 k 2 k − 1 k = Eξk [ 2 (r ) + δ ] n Eξk [δ ]: k+1 Recursively applying the inequalities from and noting that Eξk [f(x )] is monotonically decreasing k+1 1 k+1 2 k+1 E k [δ ] ≤ E k [ (r ) + δ ] ξ ξ 2 P ≤ 1 0 2 0 − 1 k j ( 2 (r ) + δ ) n j=0 Eξk [δ ] ≤ 1 0 2 0 − k+1 k+1 ( 2 (r ) + δ ) n Eξk [δ ] which leads to the desired result. 9 CME307/MS&E311: Optimization Lecture Note #16 Worst-Case Convergnece Comparison of BCDs There is a convex quadratic minimization problem of dimension n: min xT Qx; where for γ 2 (0; 1) 0 1 1 γ ::: γ B C B C B γ 1 ::: γ C B C Q = B C : @ ::: ::: ::: ::: A γ γ ::: 1 • n CBCD is 2π2 times slower than SDM; • n2 CBCD is 2π2 times slower than RBCD (each iteratione randomly select n corordinate to update); • n(n+1) CBCD is 2π2 times slower than RCBCD; Randomization makes a difference. 10 CME307/MS&E311: Optimization Lecture Note #16 Recall Optimization under Uncertainty In many applications, the objective value is partially determined by decision makers and partially determined by “Nature”. (OP T ) minx f(x;!) (2) s.t. c(x;!) 2 K ⊂ Rm: where ! represents uncertain data and x 2 Rn is the decision vector, and K is a constraint set. For deterministic optimization, we assume ξ is known and fixed. In reality, we may have • the (exact) probability distribution ξ of data !. • the sample distribution and/or few moments of data !. • knowledge of ! belonging to a given uncertain set U . In the following we consider the unconstrained case. 11 CME307/MS&E311: Optimization Lecture Note #16 Stochastic Optimization and Stochastic Gradient Descent (SGD) Methods minx F (x) := Eξ[f(x;!)]: Sample Average Approximation (SAA) Method: P 1 M i minx M i=1 f(x;! ): Two Implementations: • Sample-First and Optimize-Second, in particular, SAA: collect enough examples then search a solution of an approximated deterministic optimization problem. • Sample and Optimize concurrently, in particular, SGD: collect a sample set Sk of few samples of ! at iteration k: 1 X g^k = rf(xk;!i) and xk+1 = xk − αkg^k: jSkj i2Sk Questions are: how many samples are sufficient for an ϵ approximate solution to the original stochastic optimization problem. This is an information complexity issue, besides the computation complexity issue, in optimization. 12 CME307/MS&E311: Optimization Lecture Note #16 Information Complexity and Sample Size in SAA • In SAA, the required number of samples, M , should be larger than the dimension of decision vector and should grow polynomially with the increase of dimensionality. In specific, let xSAA be the optimal solution from the SAA method. Then to ensure probability P [F (xSAA) − F (x∗) ≤ ϵ] ≥ 1 − α; 1 1 1 M = O( )(n ln( ) + ln( )): ϵ2 ϵ α ∗ • If x is sparse or it can be approximated by a sparse solution with cardinality p << n, then by adding a regulative penalty function into the objective P 1 M i minx M i=1 f(x;! ) + P (x); the sample size can be reduced to 1 p 1:5 n 1 1 n 1 M = O( )( ln ( ) + ln( )); or in convex case: M = O( )(p ln( ) + ln( )): ϵ2 ϵ ϵ α ϵ2 ϵ α 13 CME307/MS&E311: Optimization Lecture Note #16 SGD and its Advantages Collect SGD with one !k sampled uniformly at iteration k: g^k = rf(xk;!k) and xk+1 = xk − αkg^k: • Works with the step size rule: ! X1 k k −1 α ! 0 and α ! 1 (e.g., αk = O(k )): k=0 • A great technology to potentially reduce the computation complexity – need fewer samples at the beginning.

Load more