Large Scale Optimization for Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Large Scale Optimization for Machine Learning Meisam Razaviyayn Lecture 21 [email protected] Announcements: • Midterm exams • Return it next Tuesday 1 Non-smooth Objective Function • Sub-gradient • Typically slow and no good termination criteria (other than cross validation) • Proximal Gradient • Fast assuming each iteration is easy • Block Coordinate Descent • Also helpful for exploiting multi-block structure • Alternating Direction Method of Multipliers (ADMM) • Will be covered later 1 Multi-Block Structure and BCD Method Block Coordinate Descent (BCD) Method: At iteration r, choose an index i and Very different than previous Choice of index i: Cyclic, randomized, Greedy incremental GD, SGD,… Simple and scalable: Lasso example Geometric Interpretation 2 Convergence of BCD Method Assumptions: “Nonlinear programming”, D.P. Bertsekas for cyclic update rule • Separable Constraints Proof? • Differentiable/smooth objective Necessary assumptions? • Unique minimizer at each step 3 Necessity of Smoothness Assumption Not “Regular” smooth Examples: Lasso 4 BCD and Non-smooth Objective Theorem [Tseng 2001] Assume 1) Feasible set is compact. 2) The uniqueness of minimizer at each step. Every limit point of the iterates 3) Separable constraint is a stationary point 4) Regular objective function True for cyclic/randomized/greedy rule Definition of stationarity for nonsmooth? Rate of convergence of BCD: • Similar to GD: sublinear for general convex and linear for strongly convex • Same results can be shown in most of the non-smooth popular objectives 5 Uniqueness of the Minimizer [Michael J. D. Powell 1973] 6 Uniqueness of the Minimizer Tensor PARAFAC Decomposition 0.8 0.7 0.6 0.5 0.4 Error 0.3 0.2 NP-hard [Hastad 1990] 0.1 0 [Carroll 1970], [Harshman1970]: Alternating Least Squares 0 50 100 150 200 250 300 Iterates “Swamp” effect 7 BCD Limitations • Uniqueness of minimizer • Each sub-problem needs to be easily solvable 8 BCD Limitations • Uniqueness of minimizer • Each sub-problem needs to be easily solvable Popular Solution: Inexact BCD At iteration r, choose an index i and 8 BCD Limitations • Uniqueness of minimizer • Each sub-problem needs to be easily solvable Popular Solution: Inexact BCD At iteration r, choose an index i and Local approximation of the objective function Block successive upper-bound minimization, block successive convex approximation, convex-concave procedure, majorization minimization, dc-programming, BCGD,… 8 Idea of Block Successive Upper-bound Minimization Global upper-bound: 9 Idea of Block Successive Upper-bound Minimization Global upper-bound: Locally tight: 9 Idea of Block Successive Upper-bound Minimization Global upper-bound: Locally tight: Monotone Algorithm Every limit point is a stationary point 9 Example 1: Block Coordinate (Proximal) Gradient Descent Smooth Scenario: 10 Example 1: Block Coordinate (Proximal) Gradient Descent Smooth Scenario: Non-smooth Scenario: 10 Example 1: Block Coordinate (Proximal) Gradient Descent Smooth Scenario: Non-smooth Scenario: Using Bregman divergence 10 Example 1: Block Coordinate (Proximal) Gradient Descent Smooth Scenario: Non-smooth Scenario: Using Bregman divergence Alternating Proximal Minimization: 10 Example 2: Expectation Maximization Algorithm 11 Example 2: Expectation Maximization Algorithm 11 Example 2: Expectation Maximization Algorithm Jensen’s inequality 11 24 Example 3: Transcript Abundance Estimation levels ⇢1,...,⇢M can be written as N Pr(R1,...,RN ; ⇢1,...,⇢M )= Pr(Rn; ⇢1 ...⇢M ) n=1 Y N M = Pr(R read R from sequence s ) Pr(s ) n | n m m n=1 m=1 ! Y X N M = ↵nm⇢m , n=1 m=1 ! Y X where ↵ Pr(R read R from sequence s ) can be obtained efficiently using an alignment nm , n | n m algorithm such as the ones based on the Burrows-Wheeler transform; see, e.g., [85], [86]. Therefore, given ↵ , the maximum likelihood estimation of the abundance levels can be stated as { nm}n,m N M ⇢ML = arg min log ↵nm⇢m ⇢ − n=1 m=1 ! X X (36) b M s.t. ⇢ =1, and ⇢ 0, m =1,...,M. m m ≥ 8 m=1 X As a special case of the EM algorithm, a popular approach for solving this optimization problem is 12 to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress software [87] solves the following optimization problem at the r-th iteration of the algorithm: N M M ↵ ⇢r ⇢ ⇢r+1 = arg min nm m log m + log ↵ ⇢r M r r nm m ⇢ − ↵ ⇢ ⇢m n=1 m=1 m0=1 nm0 m0 ! m=1 !! X X ✓ ◆ X (37) M P s.t. ⇢ =1, and ⇢ 0, m =1,...,M. m m ≥ 8 m=1 X Using Jensen’s inequality, it is not hard to check that (37) is a valid upper-bound of (36) in the BSUM framework. Moreover, (37) has a closed form solution obtained by N r r+1 1 ↵nm⇢m ⇢m = , m =1,...,M, N M ↵ ⇢r 8 n=1 m0=1 nm0 m0 X which makes the algorithm computationallyP efficient at each step. For another application of the BSUM algorithm in classical genetics, the readers are referred to the traditional gene counting algorithm [88]. 2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease March 10, 2015 DRAFT 24 levels ⇢1,...,⇢M can be written as 24 Example 3: TranscriptN Abundance Estimation Pr(R1levels,...,R⇢1N,...,; ⇢1,...,⇢M can⇢M be)= writtenPr as (Rn; ⇢1 ...⇢M ) n=1 Y N N M Pr(R1,...,RN ; ⇢1,...,⇢M )= Pr(Rn; ⇢1 ...⇢M ) = nPr=1 (Rn read Rn from sequence sm) Pr(sm) Y | n=1 m=1 ! Y X N M N M= Pr(R read R from sequence s ) Pr(s ) n | n m m n=1 m=1 ! = Y↵nm⇢Xm , n=1 m=1 N M! Y X = ↵nm⇢m , where ↵ Pr(R read R from sequencen=1s ) mcan=1 be obtained! efficiently using an alignment nm , n | n Ym X algorithm suchwhere as the↵ onesPr based(R onread theR Burrows-Wheelerfrom sequence s transform;) can be obtained see, e.g., efficiently [85], [86]. using Therefore, an alignment nm , n | n m given ↵ algorithm, the maximum such as the likelihood ones based estimation on the Burrows-Wheeler of the abundance transform; levels can see, be e.g., stated [85], as [86]. Therefore, { nm}n,m given ↵ , the maximum likelihood estimation of the abundance levels can be stated as { nm}n,m N M ⇢ML = arg min log N ↵nm⇢mM ⇢ − ! ⇢ML = argn=1 min m=1log ↵nm⇢m X⇢ −X (36) M n=1 m=1 ! b X X (36) s.t. ⇢ =1M, and ⇢ 0, m =1,...,M. b m m ≥ 8 m=1 s.t. ⇢m =1, and ⇢m 0, m =1,...,M. ≥ 8 X m=1 As a special case of the EM algorithm, a popularX approach for solving this optimization problem is As a special case of the EM algorithm, a popular approach for solving this optimization problem is 12 to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress software [87] solves the following optimization problem at the r-th iteration of the algorithm: software [87] solves the following optimization problem at the r-th iteration of the algorithm: N M r M N ↵Mnm⇢ r ⇢m M r+1 r+1 m ↵nm⇢m ⇢m r r ⇢ = arg min⇢ = arg min M log r log + log + log↵nm⇢m↵nm⇢ ⇢ − ⇢ Mr ⇢r r m − m =1 ↵nm0 ⇢m ↵ ⇢ m ! ⇢m !! n=1 m=1 n=1 m0 =1 m0=10 nm✓0 m0 ◆ ! m=1 m=1 !! X X X X ✓ ◆ X X (37) (37) M M P P s.t. ⇢s.t.=1, ⇢andm =1⇢, and0, ⇢m =10,,...,M.m =1,...,M. m m ≥ 8 ≥ 8 m=1 m=1 X X Using Jensen’sUsing inequality, Jensen’s it inequality, is not hard it is to not check hard that to check (37) thatis a (37) valid is upper-bound a valid upper-bound of (36) of in (36) the in BSUM the BSUM framework. Moreover,framework. (37) Moreover, has a closed (37) has form a closed solution form obtained solution obtained by by N N 1 ↵ ⇢r 1 ⇢r+1 = ↵ ⇢r nm m , m =1,...,M, r+1 m nm m M r ⇢m = N ,↵nm m⇢ =1,...,M,8 N M n=1↵ m⇢0r=1 80 m0 n=1 m0=1Xnm0 m0 which makes the algorithmX computationallyP efficient at each step. which makes the algorithm computationallyP efficient at each step. For another application of the BSUM algorithm in classical genetics, the readers are referred to the For another application of the BSUM algorithm in classical genetics, the readers are referred to the traditional gene counting algorithm [88]. traditional gene counting algorithm [88]. 2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in 2) Tensordifferent decomposition: areas suchThe as CANDECOMP/PARAFAC chemometrics [89], [90], clustering (CP) decomposition [91], and compression has applications [92]. For the in ease different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease March 10, 2015 DRAFT March 10, 2015 DRAFT 24 levels ⇢1,...,⇢M can be written as N 24 Pr(R1,...,RN ; ⇢1,...,⇢M )= Pr(Rn; ⇢1 ...⇢M ) n=1 Y levels ⇢1,...,⇢M can be written as N M N = Pr(R read R from sequence s ) Pr(s ) n | n m m n=1 m=1 ! Pr(R1,...,RN ; ⇢1,...,⇢M )= Pr(Rn; ⇢Y1 ...⇢XM ) n=1 N M Y N M = ↵nm⇢m , = Prn=1(R mread=1 R from! sequence s ) Pr(s ) Y n |X n m m n=1 m=1 ! where ↵ Pr(R readY R Xfrom sequence s ) can be obtained efficiently using an alignment nm , n | N nM m algorithm such as the ones= based on↵nm the⇢m Burrows-Wheeler, transform; see, e.g., [85], [86].