<<

Analysis of the Optimization Landscapes for Overcomplete Representation Learning

Qing

Center for Data Science New York University

July 19, 2020 Data Increasingly Massive & High-Dimensional...

hyperspectral imaging autonomous driving

social network healthcare Learning Compact Representations Learning Compact Representations Summary of Main Results

Convolutional/overcomplete dictionary learning can be provably solved with simple methods.

Q. Qu, Y. , X. , Y. , Z. Zhu, Analysis of optimization landscapes for overcomplete learning, ICLR’20. Outline

Overcomplete Dictionary Learning

Convolutional Dictionary Learning

Conclusion Learning Sparsely-Used Dictionaries

Given Y , jointly learn compact dictionary A0 and sparse X0? Learning Compact Representations

Denoising Image Restoration

Super Resolution Image Half-toning - Image courtesy of Julien Mairal et al. Dictionary Learning - Symmetry

♦ Permutation symmetry: (2nn! signed permutations Π)   ⊤ Y = A0X0 = (A0Π) Π X0   ⊤ ♦ Equivalent solution pairs: (A0, X0) ⇐⇒ A0Π, Π X0 . Dictionary Learning - Symmetry Leads to Nonconvexity

♦ Permutation symmetry: (2nn! signed permutations Π)   ⊤ Y = A0X0 = (A0Π) Π X0   ⊤ ♦ Equivalent solution pairs: (A0, X0) ⇐⇒ A0Π, Π X0 . “flat” saddle point

In the worst case, finding a local minimizer is NP-hard [Murty et al. 1987]...

Why Do Nonconvex Problems Seem “Scary”? Naturally be formulated as nonconvex optimization problems:

min φ( Y ; W ) W ∈M data model parameters

“bad” local minima In the worst case, finding a local minimizer is NP-hard [Murty et al. 1987]...

Why Do Nonconvex Problems Seem “Scary”? Naturally be formulated as nonconvex optimization problems:

min φ( Y ; W ) W ∈M data model parameters

“bad” local minima “flat” saddle point Why Do Nonconvex Problems Seem “Scary”? Naturally be formulated as nonconvex optimization problems:

min φ( Y ; W ) W ∈M data model parameters

“bad” local minima “flat” saddle point

In the worst case, finding a local minimizer is NP-hard [Murty et al. 1987]... ♦ No “bad” local minima; ♦ No “flat” saddle point.

Symmetry Creates Benign Nonconvex Geometry Symmetry Creates Benign Nonconvex Geometry

♦ No “bad” local minima; ♦ No “flat” saddle point. Symmetry Creates Benign Nonconvex Geometry

Nonconvex learning problems can be solved efficiently to global solutions! ♦ sparse blind deconvolution [Li et al.’18, Qu et al.’19] ♦ overcomplete dictionary learning [this work] ♦ convolutional dictionary learning [this work] ♦ tensor decomposition [ et al.’16]

A Fairly Broad Class of Nonconvex Problems

♦ phase retrieval [Sun et al’18] ♦ low rank matrix recovery [Ge et al.’16, Zhu et al.’18] ♦ phase synchronization [Boumal’17] ♦ shallow/linear neural network [Kawaguchi’17, Du et al.’19] A Fairly Broad Class of Nonconvex Problems

♦ sparse blind deconvolution ♦ phase retrieval [Sun et al’18] [Li et al.’18, Qu et al.’19] ♦ low rank matrix recovery ♦ overcomplete dictionary [Ge et al.’16, Zhu et al.’18] learning [this work] ♦ phase synchronization ♦ convolutional dictionary [Boumal’17] learning [this work] ♦ shallow/linear neural network ♦ tensor decomposition [Kawaguchi’17, Du et al.’19] [Ge et al.’16] Global Geometry for Overcomplete DL

Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL

Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL

Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL

Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL

Find one column of A0 by

⊤ 4 min φDL(q) := − q Y q∈Sn−1 4

Theorem (Global Geometry, Qu et al.’19)

Suppose (i) K = m/n constant (ii) A0 near orthogonal. Every critical point of φDL(q) is either 1. a saddle point exhibits negative curvature;

2. close to a target solution: one column of A0. Assumptions - Overcomplete DL

Assumptions: near orthogonality of A0

1. Row orthogonal: unit norm tight frame (UNTF) r n A A⊤ = I, ∥a ∥ = 1. m 0 0 0i 2. Column (near) orthogonal: incoherence * +

a a 0i 0j ≤ max , µ, i≠ j ∥a0i∥ ∥a0j∥ High-level Intuition: for any q⋆ = a0k h i E φ(q⊤Y ) ∝ φ( q⊤A ) X0 ⋆ | ⋆{z 0}

correlation ζ(q⋆) Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q)

ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2)

3a. Li et al. Global geometry of multichannel sparse blind deconvolution on the sphere, NeurIPS’18. 3b. Zhang et al. Structured local optima in sparse blind deconvolution problem, IEEE TIT’19. 3c. Zhai et al., Complete dictionary learning via ℓ4-Norm maximization over the orthogonal group, JMLR’20.

Problem Formulation - Overcomplete DL 1 Find one column of A0 by

4 ⊤ n−1 min φDL(q) := − q Y , s.t. q ∈ S . q 4 remove scaling spikiness promoting Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q)

ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2)

3a. Li et al. Global geometry of multichannel sparse blind deconvolution on the sphere, NeurIPS’18. 3b. Zhang et al. Structured local optima in sparse blind deconvolution problem, IEEE TIT’19. 3c. Zhai et al., Complete dictionary learning via ℓ4-Norm maximization over the orthogonal group, JMLR’20.

Problem Formulation - Overcomplete DL 1 Find one column of A0 by

4 ⊤ n−1 min φDL(q) := − q Y , s.t. q ∈ S . q 4 remove scaling spikiness promoting

High-level Intuition: for any q⋆ = a0k h i E φ(q⊤Y ) ∝ φ( q⊤A ) X0 ⋆ | ⋆{z 0}

correlation ζ(q⋆) Problem Formulation - Overcomplete DL 1 Find one column of A0 by

4 ⊤ n−1 min φDL(q) := − q Y , s.t. q ∈ S . q 4 remove scaling spikiness promoting

High-level Intuition: for any q⋆ = a0k h i E φ(q⊤Y ) ∝ φ( q⊤A ) X0 ⋆ | ⋆{z 0}

correlation ζ(q⋆) Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q)

ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2)

3a. Li et al. Global geometry of multichannel sparse blind deconvolution on the sphere, NeurIPS’18. 3b. Zhang et al. Structured local optima in sparse blind deconvolution problem, IEEE TIT’19. 3c. Zhai et al., Complete dictionary learning via ℓ4-Norm maximization over the orthogonal group, JMLR’20. Problem Formulation - Overcomplete DL

If q⋆ = a01, as the columns of A0 are incoherent: h i ζ(q ) = 1 a⊤ a ··· a⊤ a ⋆ | 01{z 02} | 01{z 02} |·|<µ |·|<µ

Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q)

ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2) Global Geometry for Overcomplete DL

Find one column of A0 by

⊤ 4 min φDL(q) := − q Y q∈Sn−1 4

Theorem (Global Geometry, Qu et al.’19)

Suppose (i) K = m/n constant (ii) A0 near orthogonal. Every critical point of φDL(q) is either 1. a saddle point exhibits negative curvature;

2. close to a target solution: one column of A0. Global Geometry for Overcomplete DL Experiment - Overcomplete DL

2 practice m < n recover full A0 via repeated vs. theory m < Cn independent trials Choice of ℓk-Norm for Dictionary Learning

k Average error with varying Maximizing ℓ -norm with sample complexity p/n and k. different k.

Image credited to Yuexiang Zhai and Yifei Shen.

- Zhai et al. Complete dictionary learning via ℓ4-norm maximization over the orthogonal group, JMLR’20.

- Shen et al. Complete dictionary learning via ℓp-norm maximization, ArXiv preprint’20. Outline

Overcomplete Dictionary Learning

Convolutional Dictionary Learning

Conclusion Convolutional Dictionary Learning

Given multiple measurement of circulant convolution

XK ⊛ ≤ ≤ yi = ak xki, (1 i p), k=1

{ } { } can we jointly learn all ak k and sparse xki k,i?

♦ ∈ Rn Here, yi, ak, xki

♦ Sparse signal xki

- Il Yong Chun and Jeffrey A. Fessler, Convolutional dictionary learning: Acceleration and convergence, TIP’18.

- Cristina Garcia-Cardona and Brendt Wohlberg, Convolutional dictionary learning: A comparative review & new

algorithms, TCI’18. Motivations

Learning compact convolutional representations [Bristow et al.’13, Gu et al.’15, Garcia-Cardona et al.’18]

Learning physical models from scientific data [Cheung et al.’18] Motivations

Can be viewed as one-layer of convolutional neural network [Papyan et al.’17&18, Ye et al.’18].

- Papyan et al., Convolutional neural networks analyzed via convolutional sparse coding, JMLR’18.

- Papyan et al., Theoretical foundations of deep learning via sparse representations: A multilayer sparse model and

its connection to convolutional neural networks, SPM’18. Problem Formulation - Circulant Matrices

For each ak ⊛ xki, equivalently

⊛ ⇐⇒ · ≤ ≤ ≤ ≤ ak xki Cak Cxki , 1 k K, 1 i p,

where a circulant matrix Convolutional DL vs. Overcomplete DL

For each sample (1 ≤ i ≤ p),

XK ⊛ yi = ak xki, k=1 equivalently,

 

 Cx1i  h i    .  Cy = Ca ··· Ca ·  . , i | 1 {z K }  .  overcomplete A0 Cx | {zKi }

sparse Xi Convolutional DL vs. Overcomplete DL

Given Y = A0 · X0, learn overcomplete A0 and sparse X0?

- et al., Convolutional dictionary learning through tensor factorization, NeurIPS Workshop’15. From Overcomplete DL to Convolutional DL?

Find one shift of (1 ≤ i ≤ K) via

4 min − q⊤Y , s.t. q ∈ Sn−1 ? q 4 From Overcomplete DL to Convolutional DL?

Find one shift of ai (1 ≤ i ≤ K) via

4 min − q⊤Y , s.t. q ∈ Sn−1 ? q 4

For generic ai and A0, it does NOT work!

♦ Generic ai enforces structures on A0;

♦ For convolutional DL, in general A0 is NOT near orthogonal. Problem Formulation - Convolutional DL

4 ⊤ n−1 min φCDL(q) = − q P Y , s.t. q ∈ S , q 4

♦ Preconditioning matrix:     −1/2 −1/2 −1 ⊤ ≈ ⊤ P = (θnK) YY A0A0 Problem Formulation - Convolutional DL

4 ⊤ n−1 min φCDL(q) ≈ − q AX0 , s.t. q ∈ S , q 4

♦ Row orthogonalization of A0:   −1/2 ≈ ⊤ PY A0A0 A0 X0 = AX0. | {z } A ♦ A is tight frame (but not necessarily unit norm). Problem Formulation & Preconditioning

ℓ4-loss,  ℓ4-loss, ✓ Preconditioning in Neural Network

Learning multi-layer perceptron network on MNIST dataset2

2. Credited to Ye et al., Network Deconvolution, ICLR’20. ♦ We show the result over   2 n−1 ⊤ RCDL := q ∈ S EX [φCDL(q)] ≤ −ξCDL A q , 3

♦ Initialization is required.

Convolutional DL - Benign Local Geometry

Theorem (Local Geometry, Qu et al.’19) Suppose (i) A is near orthogonal (ii) K constant. Locally, every critical point of φCDL(q) on RCDL is either 1. a saddle point that exhibits negative curvature,

2. or it is near a target solution, i.e., a precond. shift of ai. Convolutional DL - Benign Local Geometry

Theorem (Local Geometry, Qu et al.’19) Suppose (i) A is near orthogonal (ii) K constant. Locally, every critical point of φCDL(q) on RCDL is either 1. a saddle point that exhibits negative curvature,

2. or it is near a target solution, i.e., a precond. shift of ai.

♦ We show the result over   2 n−1 ⊤ RCDL := q ∈ S EX [φCDL(q)] ≤ −ξCDL A q , 3

♦ Initialization is required. (0) P 1. Initialize q = Sn−1 (P yi) with a random sample yi, √   (0) ⊤ (0) ⊤ ζ(q ) = A q ≈ KP n−1 A A x ; | {z } S | {z } i ≈ spiky diagonal

(∞) 2. Optimize with saddle escaping method to obtain q ; −1 (∞) 3. Return estimate a⋆ = PSn−1 P q .

From Geometry to Global Optimization

Theorem (Global convergence, Qu et al.’19) −2/3 For typical A0, suppose sparsity θ ≤ (Kn) , then (0) 1. initialization q falls into RCDL;

2. all iterates stay within RCDL; 3. the solution q(∞) is close to target solutions. From Geometry to Global Optimization

Theorem (Global convergence, Qu et al.’19) −2/3 For typical A0, suppose sparsity θ ≤ (Kn) , then (0) 1. initialization q falls into RCDL;

2. all iterates stay within RCDL; 3. the solution q(∞) is close to target solutions.

(0) P 1. Initialize q = Sn−1 (P yi) with a random sample yi, √   (0) ⊤ (0) ⊤ ζ(q ) = A q ≈ KP n−1 A A x ; | {z } S | {z } i ≈ spiky diagonal

(∞) 2. Optimize with saddle escaping method to obtain q ; −1 (∞) 3. Return an estimate a⋆ = PSn−1 P q . Experiment - Convolutional DL

Filter 1 Filter 2 Filter 3 Learning 3 random filters by the proposed approach. Outline

Overcomplete Dictionary Learning

Convolutional Dictionary Learning

Conclusion ♦ Extension to study multi-layer convolutional neural networks;

♦ Dealing with low-pass (non-invertible) filters.

Conclusion & Future Work

♦ Improvement on bounds for overcompleteness; ♦ Dealing with low-pass (non-invertible) filters.

Conclusion & Future Work

♦ Improvement on bounds for overcompleteness;

♦ Extension to study multi-layer convolutional neural networks; Conclusion & Future Work

♦ Improvement on bounds for overcompleteness;

♦ Extension to study multi-layer convolutional neural networks;

♦ Dealing with low-pass (non-invertible) filters. Acknowledgement

Yuexiang Zhai Xiao Li Yuqian Zhang Zhihui Zhu (UC Berkeley) (CUHK) (Rutgers) (Denver U.) THANK !