Large Scale Optimization for

Meisam Razaviyayn Lecture 21 [email protected] Announcements:

• Midterm exams • Return it next Tuesday

1 Non-smooth Objective Function

• Sub- • Typically slow and no good termination criteria (other than cross validation)

• Proximal Gradient • Fast assuming each iteration is easy

• Block Coordinate Descent • Also helpful for exploiting multi-block structure

• Alternating Direction Method of Multipliers (ADMM) • Will be covered later 1 Multi-Block Structure and BCD Method

Block Coordinate Descent (BCD) Method:

At iteration r, choose an index i and

Very different than previous Choice of index i: Cyclic, randomized, Greedy incremental GD, SGD,… Simple and scalable: Lasso example Geometric Interpretation 2 Convergence of BCD Method

Assumptions: “”, D.P. Bertsekas for cyclic update rule

• Separable Constraints Proof? • Differentiable/smooth objective Necessary assumptions? • Unique minimizer at each step 3 Necessity of Smoothness Assumption

Not “Regular”

smooth

Examples: Lasso 4 BCD and Non-smooth Objective

Theorem [Tseng 2001] Assume 1) Feasible set is compact. 2) The uniqueness of minimizer at each step. Every limit point of the iterates 3) Separable constraint is a stationary point 4) Regular objective function

True for cyclic/randomized/greedy rule Definition of stationarity for nonsmooth?

Rate of convergence of BCD: • Similar to GD: sublinear for general convex and linear for strongly convex • Same results can be shown in most of the non-smooth popular objectives

5 Uniqueness of the Minimizer

[Michael J. D. Powell 1973]

6 Uniqueness of the Minimizer

Tensor PARAFAC Decomposition 0.8 0.7

0.6

0.5

0.4 Error

0.3

0.2

NP-hard [Hastad 1990] 0.1

0 [Carroll 1970], [Harshman1970]: Alternating Least Squares 0 50 100 150 200 250 300 Iterates

“Swamp” effect

7 BCD Limitations

• Uniqueness of minimizer • Each sub-problem needs to be easily solvable

8 BCD Limitations

• Uniqueness of minimizer • Each sub-problem needs to be easily solvable

Popular Solution: Inexact BCD

At iteration r, choose an index i and

8 BCD Limitations

• Uniqueness of minimizer • Each sub-problem needs to be easily solvable

Popular Solution: Inexact BCD

At iteration r, choose an index i and

Local approximation of the objective function

Block successive upper-bound minimization, block successive convex approximation, convex-concave procedure, majorization minimization, dc-programming, BCGD,… 8 Idea of Block Successive Upper-bound Minimization

Global upper-bound:

9 Idea of Block Successive Upper-bound Minimization

Global upper-bound:

Locally tight:

9 Idea of Block Successive Upper-bound Minimization

Global upper-bound:

Locally tight:

Monotone Algorithm

Every limit point is a stationary point 9 Example 1: Block Coordinate (Proximal)

Smooth Scenario:

10 Example 1: Block Coordinate (Proximal) Gradient Descent

Smooth Scenario:

Non-smooth Scenario:

10 Example 1: Block Coordinate (Proximal) Gradient Descent

Smooth Scenario:

Non-smooth Scenario:

Using Bregman divergence

10 Example 1: Block Coordinate (Proximal) Gradient Descent

Smooth Scenario:

Non-smooth Scenario:

Using Bregman divergence

Alternating Proximal Minimization:

10 Example 2: Expectation Maximization Algorithm

11 Example 2: Expectation Maximization Algorithm

11 Example 2: Expectation Maximization Algorithm

Jensen’s inequality

11 24 Example 3: Transcript Abundance Estimation levels ⇢1,...,⇢M can be written as

N Pr(R1,...,RN ; ⇢1,...,⇢M )= Pr(Rn; ⇢1 ...⇢M ) n=1 Y N M = Pr(R read R from sequence s ) Pr(s ) n | n m m n=1 m=1 ! Y X N M = ↵nm⇢m , n=1 m=1 ! Y X where ↵ Pr(R read R from sequence s ) can be obtained efficiently using an alignment nm , n | n m algorithm such as the ones based on the Burrows-Wheeler transform; see, e.g., [85], [86]. Therefore, given ↵ , the maximum likelihood estimation of the abundance levels can be stated as { nm}n,m N M ⇢ML = arg min log ↵nm⇢m ⇢ n=1 m=1 ! X X (36) b M s.t. ⇢ =1, and ⇢ 0, m =1,...,M. m m 8 m=1 X As a special case of the EM algorithm, a popular approach for solving this optimization problem is 12 to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress software [87] solves the following optimization problem at the r-th iteration of the algorithm:

N M M ↵ ⇢r ⇢ ⇢r+1 = arg min nm m log m + log ↵ ⇢r M r r nm m ⇢ ↵ ⇢ ⇢m n=1 m=1 m0=1 nm0 m0 ! m=1 !! X X ✓ ◆ X (37) M P s.t. ⇢ =1, and ⇢ 0, m =1,...,M. m m 8 m=1 X Using Jensen’s inequality, it is not hard to check that (37) is a valid upper-bound of (36) in the BSUM framework. Moreover, (37) has a closed form solution obtained by

N r r+1 1 ↵nm⇢m ⇢m = , m =1,...,M, N M ↵ ⇢r 8 n=1 m0=1 nm0 m0 X which makes the algorithm computationallyP efficient at each step. For another application of the BSUM algorithm in classical genetics, the readers are referred to the traditional gene counting algorithm [88]. 2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease

March 10, 2015 DRAFT 24

levels ⇢1,...,⇢M can be written as 24 Example 3: TranscriptN Abundance Estimation Pr(R1levels,...,R⇢1N,...,; ⇢1,...,⇢M can⇢M be)= writtenPr as (Rn; ⇢1 ...⇢M ) n=1 Y N N M Pr(R1,...,RN ; ⇢1,...,⇢M )= Pr(Rn; ⇢1 ...⇢M ) = nPr=1 (Rn read Rn from sequence sm) Pr(sm) Y | n=1 m=1 ! Y X N M N M= Pr(R read R from sequence s ) Pr(s ) n | n m m n=1 m=1 ! = Y↵nm⇢Xm , n=1 m=1 N M! Y X = ↵nm⇢m , where ↵ Pr(R read R from sequencen=1s ) mcan=1 be obtained! efficiently using an alignment nm , n | n Ym X algorithm suchwhere as the↵ onesPr based(R onread theR Burrows-Wheelerfrom sequence s transform;) can be obtained see, e.g., efficiently [85], [86]. using Therefore, an alignment nm , n | n m given ↵ algorithm, the maximum such as the likelihood ones based estimation on the Burrows-Wheeler of the abundance transform; levels can see, be e.g., stated [85], as [86]. Therefore, { nm}n,m given ↵ , the maximum likelihood estimation of the abundance levels can be stated as { nm}n,m N M ⇢ML = arg min log N ↵nm⇢mM ⇢ ! ⇢ML = argn=1 min m=1log ↵nm⇢m X⇢ X (36) M n=1 m=1 ! b X X (36) s.t. ⇢ =1M, and ⇢ 0, m =1,...,M. b m m 8 m=1 s.t. ⇢m =1, and ⇢m 0, m =1,...,M. 8 X m=1 As a special case of the EM algorithm, a popularX approach for solving this optimization problem is As a special case of the EM algorithm, a popular approach for solving this optimization problem is 12 to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress software [87] solves the following optimization problem at the r-th iteration of the algorithm: software [87] solves the following optimization problem at the r-th iteration of the algorithm: N M r M N ↵Mnm⇢ r ⇢m M r+1 r+1 m ↵nm⇢m ⇢m r r ⇢ = arg min⇢ = arg min M log r log + log + log↵nm⇢m↵nm⇢ ⇢ ⇢ Mr ⇢r r m m =1 ↵nm0 ⇢m ↵ ⇢ m ! ⇢m !! n=1 m=1 n=1 m0 =1 m0=10 nm✓0 m0 ◆ ! m=1 m=1 !! X X X X ✓ ◆ X X (37) (37) M M P P s.t. ⇢s.t.=1, ⇢andm =1⇢, and0, ⇢m =10,,...,M.m =1,...,M. m m 8 8 m=1 m=1 X X Using Jensen’sUsing inequality, Jensen’s it inequality, is not hard it is to not check hard that to check (37) thatis a (37) valid is upper-bound a valid upper-bound of (36) of in (36) the in BSUM the BSUM framework. Moreover,framework. (37) Moreover, has a closed (37) has form a closed solution form obtained solution obtained by by N N 1 ↵ ⇢r 1 ⇢r+1 = ↵ ⇢r nm m , m =1,...,M, r+1 m nm m M r ⇢m = N ,↵nm m⇢ =1,...,M,8 N M n=1↵ m⇢0r=1 80 m0 n=1 m0=1Xnm0 m0 which makes the algorithmX computationallyP efficient at each step. which makes the algorithm computationallyP efficient at each step. For another application of the BSUM algorithm in classical genetics, the readers are referred to the For another application of the BSUM algorithm in classical genetics, the readers are referred to the traditional gene counting algorithm [88]. traditional gene counting algorithm [88]. 2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in 2) Tensordifferent decomposition: areas suchThe as CANDECOMP/PARAFAC chemometrics [89], [90], clustering (CP) decomposition [91], and compression has applications [92]. For the in ease different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease

March 10, 2015 DRAFT

March 10, 2015 DRAFT 24

levels ⇢1,...,⇢M can be written as

N 24 Pr(R1,...,RN ; ⇢1,...,⇢M )= Pr(Rn; ⇢1 ...⇢M ) n=1 Y levels ⇢1,...,⇢M can be written as N M N = Pr(R read R from sequence s ) Pr(s ) n | n m m n=1 m=1 ! Pr(R1,...,RN ; ⇢1,...,⇢M )= Pr(Rn; ⇢Y1 ...⇢XM ) n=1 N M Y N M = ↵nm⇢m , = Prn=1(R mread=1 R from! sequence s ) Pr(s ) Y n |X n m m n=1 m=1 ! where ↵ Pr(R readY R Xfrom sequence s ) can be obtained efficiently using an alignment nm , n | N nM m algorithm such as the ones= based on↵nm the⇢m Burrows-Wheeler, transform; see, e.g., [85], [86]. Therefore, n=1 m=1 ! Y X given ↵nm n,m, the maximum likelihood estimation of the abundance levels can be stated as where ↵nm , Pr{ (Rn} read Rn from sequence sm) can be obtained efficiently using an alignment | N M algorithm such as the ones based on the Burrows-Wheeler transform; see, e.g., [85], [86]. Therefore, Example 3: Transcript⇢ML = arg min Abundancelog ↵nm ⇢Estimationm ⇢ given ↵ , the maximum likelihood estimationn=1 of the abundancem=1 levels! can be stated as { nm}n,m X X (36) M b N M s.t. ⇢m =1, and ⇢m 0, m =1,...,M. ⇢ML = arg min log ↵nm⇢m 8 ⇢ m=1 n=1 mX=1 ! X X (36) As a specialb case of the EMM algorithm, a popular approach for solving this optimization problem is s.t. ⇢ =1, and ⇢ 0, m =1,...,M. m m 8 to successively minimizem a=1 local tight upper-bound of the objective function. In particular, the eXpress X As a specialsoftware case of [87] the solves EM algorithm, the following a popular optimization approach problem for solving at the thisr-th optimization iteration of problem the algorithm: is

N M M to successively minimize a local tight upper-bound of the↵ objective⇢r function.⇢ In particular, the eXpress ⇢r+1 = arg min nm m log m + log ↵ ⇢r M r r nm m software [87] solves the following⇢ optimization problem at the↵r-th⇢ iteration of⇢m the algorithm: n=1 m=1 m0=1 nm0 m0 ! m=1 !! X X ✓ ◆ X (37) N M M r P M r+1 ↵nm⇢m ⇢m r ⇢ = arg min s.t. ⇢m =1, andlog⇢m 0, +m log=1,...,M.↵ ⇢ M r r nm m ⇢ ↵ ⇢ ⇢m 8 n=1 m=1m =1 m0=1 nm0 m0 ! m=1 !! X XX ✓ ◆ X (37) M Using Jensen’s inequality, it isP not hard to check that (37) is a valid upper-bound of (36) in the BSUM s.t. ⇢ =1, and ⇢ 0, m =1,...,M. m m 8 m=1 framework. Moreover,X (37) has a closed form solution obtained by Using Jensen’s inequality, it is not hard to check that (37) is a valid upper-bound of (36) in the BSUM N r r+1 1 ↵nm⇢m framework. Moreover, (37) has a closed⇢m form= solution obtained by , m =1,...,M, 13 N M ↵ ⇢r 8 n=1 m0=1 nm0 m0 X N r r+1 1 ↵nm⇢m P which makes the⇢m algorithm= computationally efficient, m =1 at,...,M, each step. N M ↵ ⇢r 8 n=1 m0=1 nm0 m0 For another applicationX of the BSUM algorithm in classical genetics, the readers are referred to the which makes the algorithm computationallyP efficient at each step. traditional gene counting algorithm [88]. For another application of the BSUM algorithm in classical genetics, the readers are referred to the 2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in traditional gene counting algorithm [88]. different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease 2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease March 10, 2015 DRAFT

March 10, 2015 DRAFT 24

levels ⇢1,...,⇢M can be written as

N Pr(R1,...,RN ; ⇢1,...,⇢M )= Pr(Rn; ⇢1 ...⇢M ) 24 n=1 Y N M levels ⇢1,...,⇢M can= be written asPr(Rn read Rn from sequence sm) Pr(sm) | ! n=1 m=1 N Y X 24 N M Pr(R1,...,RN ; ⇢1,...,⇢M )= Pr(Rn; ⇢1 ...⇢M ) = ↵ ⇢n=1 , nm Ym levels ⇢1,...,⇢M can be written asn=1 m=1 N ! M Y X N = Pr(R read R from sequence s ) Pr(s ) n | n m m where ↵nm , Pr(Rn read Rn from sequence sm)n=1can m be=1 obtained efficiently using an alignment ! Pr(R1,...,R| N ; ⇢1,...,⇢M )= Pr(Rn; ⇢Y1 ...⇢XM ) n=1 N M algorithm such as the ones based on the Burrows-WheelerY transform; see, e.g., [85], [86]. Therefore, N M = ↵nm⇢m , = Prn=1(R mread=1 R from! sequence s ) Pr(s ) given ↵nm n,m, the maximum likelihood estimationY of then |X abundancen levels canm be statedm as { } n=1 m=1 ! where ↵ Pr(R readY R Xfrom sequence s ) can be obtained efficiently using an alignment nm , n |N N nMM m algorithm⇢ML = arg such min as the ones= log based on↵nm the↵nm⇢m Burrows-Wheeler⇢m, transform; see, e.g., [85], [86]. Therefore, ⇢ n=1 m=1 ! n=1Y Xm=1 ! given ↵nm n,m, the maximumX likelihoodX estimation of the abundance levels can be stated(36) as where ↵ b Pr{ (R} read R Mfrom sequence s ) can be obtained efficiently using an alignment nm , n | n m s.t. ⇢m =1, andN ⇢m M0, m =1,...,M. algorithm such as the ones based on the Burrows-Wheeler transform; 8 see, e.g., [85], [86]. Therefore, Example 3: Transcript⇢MLm=1= arg min Abundancelog ↵nm ⇢Estimationm X ⇢ given ↵ , the maximum likelihood estimationn=1 of the abundancem=1 levels! can be stated as { nm}n,m X X (36) As a special case of the EM algorithm, a popular approachM for solving this optimization problem is b N M s.t. ⇢m =1, and ⇢m 0, m =1,...,M. to successively minimize⇢ aML local= arg tight min upper-boundlog of the↵nm objective⇢m function. In8 particular, the eXpress ⇢ m=1 n=1 mX=1 ! software [87] solves the following optimizationX problemX at the r-th iteration of the algorithm: (36) As a specialb case of the EMM algorithm, a popular approach for solving this optimization problem is s.t. ⇢ =1, and ⇢ 0, m =1,...,M. N M m r m 8 M r+1 to successively minimizem a=1 local↵nm tight⇢m upper-bound⇢m of the objective function. Inr particular, the eXpress ⇢ = arg min X log + log ↵ ⇢ M r r nm m ⇢ ↵ ⇢ ⇢m r As a specialsoftware case of [87] then=1 solves EM m algorithm,=1 the followingm a0=1 popular optimizationnm0 approachm0 problem for solving! at the this-th optimizationm iteration=1 of problem the!! algorithm: is X X ✓ ◆ X (37) M to successively minimize a local tightP upper-boundN M of the objectiver function. In particular, theM eXpress r+1 ↵nm⇢m ⇢m r s.t. ⇢ =⇢m arg=1 min, and ⇢m 0, m =1,...,M.log + log ↵ ⇢ M r r nm m software [87] solves the following⇢ optimization problem 8 at the↵r-th⇢ iteration of⇢m the algorithm: m=1 n=1 m=1 m0=1 nm0 m0 ! m=1 !! X X X ✓ ◆ X (37) N M M r P M Using Jensen’s inequality,r+1 it is not hard to check↵ thatnm⇢ (37)m is a valid⇢m upper-bound of (36)r in the BSUM ⇢ = arg min s.t. ⇢m =1, andlog⇢m 0, +m log=1,...,M.↵ ⇢ M r r nm m ⇢ ↵ ⇢ ⇢m 8 n=1 m=1m =1 m0=1 nm0 m0 ! m=1 !! framework. Moreover, (37) has a closedX X formX solution obtained✓ by ◆ X (37) M Using Jensen’s inequality, it isP not hard to check that (37) is a valid upper-bound of (36) in the BSUM s.t. ⇢mN=1, and ⇢rm 0, m =1,...,M. 1 ↵nm⇢ 8 framework.⇢r+1 Moreover,=m=1 (37) has a closedm form, solutionm =1 obtained,...,M, by m X M r Closed form update! N ↵nm ⇢ 8 Using Jensen’s inequality, it is notn=1 hardm to0=1 check that0 m (37)0 is a valid upper-bound of (36) in the BSUM X N r r+1 1 ↵nm⇢m which makesframework. the algorithm Moreover, computationally (37) has a closedP⇢m efficientform= solution at each obtained step. by , m =1,...,M, 13 N M ↵ ⇢r 8 n=1 m0=1 nm0 m0 X For another application of the BSUM algorithmN inr classical genetics, the readers are referred to the r+1 1 ↵nm⇢m P which makes the⇢m algorithm= computationally efficient, m =1 at,...,M, each step. N M ↵ ⇢r 8 traditional gene counting algorithm [88]. n=1 m0=1 nm0 m0 For another applicationX of the BSUM algorithm in classical genetics, the readers are referred to the which makes the algorithm computationallyP efficient at each step. 2) Tensor decomposition:traditional geneThe counting CANDECOMP/PARAFAC algorithm [88]. (CP) decomposition has applications in For another application of the BSUM algorithm in classical genetics, the readers are referred to the different areas such2) as Tensor chemometrics decomposition: [89], [90],The CANDECOMP/PARAFAC clustering [91], and compression (CP) decomposition [92]. For thehas easeapplications in traditional gene counting algorithm [88]. different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease 2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in March 10, 2015 DRAFT different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease March 10, 2015 DRAFT

March 10, 2015 DRAFT