Analysis of the Optimization Landscapes for Overcomplete Representation Learning
Qing Qu
Center for Data Science New York University
July 19, 2020 Data Increasingly Massive & High-Dimensional...
hyperspectral imaging autonomous driving
social network healthcare Learning Compact Representations Learning Compact Representations Summary of Main Results
Convolutional/overcomplete dictionary learning can be provably solved with simple methods.
Q. Qu, Y. Zhai, X. Li, Y. Zhang, Z. Zhu, Analysis of optimization landscapes for overcomplete learning, ICLR’20. Outline
Overcomplete Dictionary Learning
Convolutional Dictionary Learning
Conclusion Learning Sparsely-Used Dictionaries
Given Y , jointly learn compact dictionary A0 and sparse X0? Learning Compact Representations
Denoising Image Restoration
Super Resolution Image Half-toning - Image courtesy of Julien Mairal et al. Dictionary Learning - Symmetry
♦ Permutation symmetry: (2nn! signed permutations Π) ⊤ Y = A0X0 = (A0Π) Π X0 ⊤ ♦ Equivalent solution pairs: (A0, X0) ⇐⇒ A0Π, Π X0 . Dictionary Learning - Symmetry Leads to Nonconvexity
♦ Permutation symmetry: (2nn! signed permutations Π) ⊤ Y = A0X0 = (A0Π) Π X0 ⊤ ♦ Equivalent solution pairs: (A0, X0) ⇐⇒ A0Π, Π X0 . “flat” saddle point
In the worst case, finding a local minimizer is NP-hard [Murty et al. 1987]...
Why Do Nonconvex Problems Seem “Scary”? Naturally be formulated as nonconvex optimization problems:
min φ( Y ; W ) W ∈M data model parameters
“bad” local minima In the worst case, finding a local minimizer is NP-hard [Murty et al. 1987]...
Why Do Nonconvex Problems Seem “Scary”? Naturally be formulated as nonconvex optimization problems:
min φ( Y ; W ) W ∈M data model parameters
“bad” local minima “flat” saddle point Why Do Nonconvex Problems Seem “Scary”? Naturally be formulated as nonconvex optimization problems:
min φ( Y ; W ) W ∈M data model parameters
“bad” local minima “flat” saddle point
In the worst case, finding a local minimizer is NP-hard [Murty et al. 1987]... ♦ No “bad” local minima; ♦ No “flat” saddle point.
Symmetry Creates Benign Nonconvex Geometry Symmetry Creates Benign Nonconvex Geometry
♦ No “bad” local minima; ♦ No “flat” saddle point. Symmetry Creates Benign Nonconvex Geometry
Nonconvex learning problems can be solved efficiently to global solutions! ♦ sparse blind deconvolution [Li et al.’18, Qu et al.’19] ♦ overcomplete dictionary learning [this work] ♦ convolutional dictionary learning [this work] ♦ tensor decomposition [Ge et al.’16]
A Fairly Broad Class of Nonconvex Problems
♦ phase retrieval [Sun et al’18] ♦ low rank matrix recovery [Ge et al.’16, Zhu et al.’18] ♦ phase synchronization [Boumal’17] ♦ shallow/linear neural network [Kawaguchi’17, Du et al.’19] A Fairly Broad Class of Nonconvex Problems
♦ sparse blind deconvolution ♦ phase retrieval [Sun et al’18] [Li et al.’18, Qu et al.’19] ♦ low rank matrix recovery ♦ overcomplete dictionary [Ge et al.’16, Zhu et al.’18] learning [this work] ♦ phase synchronization ♦ convolutional dictionary [Boumal’17] learning [this work] ♦ shallow/linear neural network ♦ tensor decomposition [Kawaguchi’17, Du et al.’19] [Ge et al.’16] Global Geometry for Overcomplete DL
Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL
Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL
Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL
Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL
Find one column of A0 by
⊤ 4 min φDL(q) := − q Y q∈Sn−1 4
Theorem (Global Geometry, Qu et al.’19)
Suppose (i) K = m/n constant (ii) A0 near orthogonal. Every critical point of φDL(q) is either 1. a saddle point exhibits negative curvature;
2. close to a target solution: one column of A0. Assumptions - Overcomplete DL
Assumptions: near orthogonality of A0
1. Row orthogonal: unit norm tight frame (UNTF) r n A A⊤ = I, ∥a ∥ = 1. m 0 0 0i 2. Column (near) orthogonal: incoherence * +
a a 0i 0j ≤ max , µ, i≠ j ∥a0i∥ ∥a0j∥ High-level Intuition: for any q⋆ = a0k h i E φ(q⊤Y ) ∝ φ( q⊤A ) X0 ⋆ | ⋆{z 0}
correlation ζ(q⋆) Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q)
ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2)
3a. Li et al. Global geometry of multichannel sparse blind deconvolution on the sphere, NeurIPS’18. 3b. Zhang et al. Structured local optima in sparse blind deconvolution problem, IEEE TIT’19. 3c. Zhai et al., Complete dictionary learning via ℓ4-Norm maximization over the orthogonal group, JMLR’20.
Problem Formulation - Overcomplete DL 1 Find one column of A0 by
4 ⊤ n−1 min φDL(q) := − q Y , s.t. q ∈ S . q 4 remove scaling spikiness promoting Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q)
ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2)
3a. Li et al. Global geometry of multichannel sparse blind deconvolution on the sphere, NeurIPS’18. 3b. Zhang et al. Structured local optima in sparse blind deconvolution problem, IEEE TIT’19. 3c. Zhai et al., Complete dictionary learning via ℓ4-Norm maximization over the orthogonal group, JMLR’20.
Problem Formulation - Overcomplete DL 1 Find one column of A0 by
4 ⊤ n−1 min φDL(q) := − q Y , s.t. q ∈ S . q 4 remove scaling spikiness promoting
High-level Intuition: for any q⋆ = a0k h i E φ(q⊤Y ) ∝ φ( q⊤A ) X0 ⋆ | ⋆{z 0}
correlation ζ(q⋆) Problem Formulation - Overcomplete DL 1 Find one column of A0 by
4 ⊤ n−1 min φDL(q) := − q Y , s.t. q ∈ S . q 4 remove scaling spikiness promoting
High-level Intuition: for any q⋆ = a0k h i E φ(q⊤Y ) ∝ φ( q⊤A ) X0 ⋆ | ⋆{z 0}
correlation ζ(q⋆) Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q)
ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2)
3a. Li et al. Global geometry of multichannel sparse blind deconvolution on the sphere, NeurIPS’18. 3b. Zhang et al. Structured local optima in sparse blind deconvolution problem, IEEE TIT’19. 3c. Zhai et al., Complete dictionary learning via ℓ4-Norm maximization over the orthogonal group, JMLR’20. Problem Formulation - Overcomplete DL
If q⋆ = a01, as the columns of A0 are incoherent: h i ζ(q ) = 1 a⊤ a ··· a⊤ a ⋆ | 01{z 02} | 01{z 02} |·|<µ |·|<µ
Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q)
ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2) Global Geometry for Overcomplete DL
Find one column of A0 by
⊤ 4 min φDL(q) := − q Y q∈Sn−1 4
Theorem (Global Geometry, Qu et al.’19)
Suppose (i) K = m/n constant (ii) A0 near orthogonal. Every critical point of φDL(q) is either 1. a saddle point exhibits negative curvature;
2. close to a target solution: one column of A0. Global Geometry for Overcomplete DL Experiment - Overcomplete DL
2 practice m < n recover full A0 via repeated vs. theory m < Cn independent trials Choice of ℓk-Norm for Dictionary Learning
k Average error with varying Maximizing ℓ -norm with sample complexity p/n and k. different k.
Image credited to Yuexiang Zhai and Yifei Shen.
- Zhai et al. Complete dictionary learning via ℓ4-norm maximization over the orthogonal group, JMLR’20.
- Shen et al. Complete dictionary learning via ℓp-norm maximization, ArXiv preprint’20. Outline
Overcomplete Dictionary Learning
Convolutional Dictionary Learning
Conclusion Convolutional Dictionary Learning
Given multiple measurement yi of circulant convolution
XK ⊛ ≤ ≤ yi = ak xki, (1 i p), k=1
{ } { } can we jointly learn all ak k and sparse xki k,i?
♦ ∈ Rn Here, yi, ak, xki
♦ Sparse signal xki
- Il Yong Chun and Jeffrey A. Fessler, Convolutional dictionary learning: Acceleration and convergence, TIP’18.
- Cristina Garcia-Cardona and Brendt Wohlberg, Convolutional dictionary learning: A comparative review & new
algorithms, TCI’18. Motivations
Learning compact convolutional representations [Bristow et al.’13, Gu et al.’15, Garcia-Cardona et al.’18]
Learning physical models from scientific data [Cheung et al.’18] Motivations
Can be viewed as one-layer of convolutional neural network [Papyan et al.’17&18, Ye et al.’18].
- Papyan et al., Convolutional neural networks analyzed via convolutional sparse coding, JMLR’18.
- Papyan et al., Theoretical foundations of deep learning via sparse representations: A multilayer sparse model and
its connection to convolutional neural networks, SPM’18. Problem Formulation - Circulant Matrices
For each ak ⊛ xki, equivalently
⊛ ⇐⇒ · ≤ ≤ ≤ ≤ ak xki Cak Cxki , 1 k K, 1 i p,
where a circulant matrix Convolutional DL vs. Overcomplete DL
For each sample (1 ≤ i ≤ p),
XK ⊛ yi = ak xki, k=1 equivalently,
Cx1i h i . Cy = Ca ··· Ca · . , i | 1 {z K } . overcomplete A0 Cx | {zKi }
sparse Xi Convolutional DL vs. Overcomplete DL
Given Y = A0 · X0, learn overcomplete A0 and sparse X0?
- Huang et al., Convolutional dictionary learning through tensor factorization, NeurIPS Workshop’15. From Overcomplete DL to Convolutional DL?
Find one shift of ai (1 ≤ i ≤ K) via
4 min − q⊤Y , s.t. q ∈ Sn−1 ? q 4 From Overcomplete DL to Convolutional DL?
Find one shift of ai (1 ≤ i ≤ K) via
4 min − q⊤Y , s.t. q ∈ Sn−1 ? q 4
For generic ai and A0, it does NOT work!
♦ Generic ai enforces structures on A0;
♦ For convolutional DL, in general A0 is NOT near orthogonal. Problem Formulation - Convolutional DL
4 ⊤ n−1 min φCDL(q) = − q P Y , s.t. q ∈ S , q 4
♦ Preconditioning matrix: −1/2 −1/2 −1 ⊤ ≈ ⊤ P = (θnK) YY A0A0 Problem Formulation - Convolutional DL
4 ⊤ n−1 min φCDL(q) ≈ − q AX0 , s.t. q ∈ S , q 4
♦ Row orthogonalization of A0: −1/2 ≈ ⊤ PY A0A0 A0 X0 = AX0. | {z } A ♦ A is tight frame (but not necessarily unit norm). Problem Formulation & Preconditioning
ℓ4-loss, ℓ4-loss, ✓ Preconditioning in Neural Network
Learning multi-layer perceptron network on MNIST dataset2
2. Credited to Ye et al., Network Deconvolution, ICLR’20. ♦ We show the result over 2 n−1 ⊤ RCDL := q ∈ S EX [φCDL(q)] ≤ −ξCDL A q , 3
♦ Initialization is required.
Convolutional DL - Benign Local Geometry
Theorem (Local Geometry, Qu et al.’19) Suppose (i) A is near orthogonal (ii) K constant. Locally, every critical point of φCDL(q) on RCDL is either 1. a saddle point that exhibits negative curvature,
2. or it is near a target solution, i.e., a precond. shift of ai. Convolutional DL - Benign Local Geometry
Theorem (Local Geometry, Qu et al.’19) Suppose (i) A is near orthogonal (ii) K constant. Locally, every critical point of φCDL(q) on RCDL is either 1. a saddle point that exhibits negative curvature,
2. or it is near a target solution, i.e., a precond. shift of ai.
♦ We show the result over 2 n−1 ⊤ RCDL := q ∈ S EX [φCDL(q)] ≤ −ξCDL A q , 3
♦ Initialization is required. (0) P 1. Initialize q = Sn−1 (P yi) with a random sample yi, √ (0) ⊤ (0) ⊤ ζ(q ) = A q ≈ KP n−1 A A x ; | {z } S | {z } i ≈ spiky diagonal
(∞) 2. Optimize with saddle escaping method to obtain q ; −1 (∞) 3. Return an estimate a⋆ = PSn−1 P q .
From Geometry to Global Optimization
Theorem (Global convergence, Qu et al.’19) −2/3 For typical A0, suppose sparsity θ ≤ (Kn) , then (0) 1. initialization q falls into RCDL;
2. all iterates stay within RCDL; 3. the solution q(∞) is close to target solutions. From Geometry to Global Optimization
Theorem (Global convergence, Qu et al.’19) −2/3 For typical A0, suppose sparsity θ ≤ (Kn) , then (0) 1. initialization q falls into RCDL;
2. all iterates stay within RCDL; 3. the solution q(∞) is close to target solutions.
(0) P 1. Initialize q = Sn−1 (P yi) with a random sample yi, √ (0) ⊤ (0) ⊤ ζ(q ) = A q ≈ KP n−1 A A x ; | {z } S | {z } i ≈ spiky diagonal
(∞) 2. Optimize with saddle escaping method to obtain q ; −1 (∞) 3. Return an estimate a⋆ = PSn−1 P q . Experiment - Convolutional DL
Filter 1 Filter 2 Filter 3 Learning 3 random filters by the proposed approach. Outline
Overcomplete Dictionary Learning
Convolutional Dictionary Learning
Conclusion ♦ Extension to study multi-layer convolutional neural networks;
♦ Dealing with low-pass (non-invertible) filters.
Conclusion & Future Work
♦ Improvement on bounds for overcompleteness; ♦ Dealing with low-pass (non-invertible) filters.
Conclusion & Future Work
♦ Improvement on bounds for overcompleteness;
♦ Extension to study multi-layer convolutional neural networks; Conclusion & Future Work
♦ Improvement on bounds for overcompleteness;
♦ Extension to study multi-layer convolutional neural networks;
♦ Dealing with low-pass (non-invertible) filters. Acknowledgement
Yuexiang Zhai Xiao Li Yuqian Zhang Zhihui Zhu (UC Berkeley) (CUHK) (Rutgers) (Denver U.) THANK YOU!