Analysis of the Optimization Landscapes for Overcomplete Representation Learning

Analysis of the Optimization Landscapes for Overcomplete Representation Learning Qing Qu Center for Data Science New York University July 19, 2020 Data Increasingly Massive & High-Dimensional... hyperspectral imaging autonomous driving social network healthcare Learning Compact Representations Learning Compact Representations Summary of Main Results Convolutional/overcomplete dictionary learning can be provably solved with simple methods. Q. Qu, Y. Zhai, X. Li, Y. Zhang, Z. Zhu, Analysis of optimization landscapes for overcomplete learning, ICLR’20. Outline Overcomplete Dictionary Learning Convolutional Dictionary Learning Conclusion Learning Sparsely-Used Dictionaries Given Y , jointly learn compact dictionary A0 and sparse X0? Learning Compact Representations Denoising Image Restoration Super Resolution Image Half-toning - Image courtesy of Julien Mairal et al. Dictionary Learning - Symmetry ♦ Permutation symmetry: (2nn! signed permutations Π) ⊤ Y = A0X0 = (A0Π) Π X0 ⊤ ♦ Equivalent solution pairs: (A0, X0) ⇐⇒ A0Π, Π X0 . Dictionary Learning - Symmetry Leads to Nonconvexity ♦ Permutation symmetry: (2nn! signed permutations Π) ⊤ Y = A0X0 = (A0Π) Π X0 ⊤ ♦ Equivalent solution pairs: (A0, X0) ⇐⇒ A0Π, Π X0 . “flat” saddle point In the worst case, finding a local minimizer is NP-hard [Murty et al. 1987]... Why Do Nonconvex Problems Seem “Scary”? Naturally be formulated as nonconvex optimization problems: min φ( Y ; W ) W ∈M data model parameters “bad” local minima In the worst case, finding a local minimizer is NP-hard [Murty et al. 1987]... Why Do Nonconvex Problems Seem “Scary”? Naturally be formulated as nonconvex optimization problems: min φ( Y ; W ) W ∈M data model parameters “bad” local minima “flat” saddle point Why Do Nonconvex Problems Seem “Scary”? Naturally be formulated as nonconvex optimization problems: min φ( Y ; W ) W ∈M data model parameters “bad” local minima “flat” saddle point In the worst case, finding a local minimizer is NP-hard [Murty et al. 1987]... ♦ No “bad” local minima; ♦ No “flat” saddle point. Symmetry Creates Benign Nonconvex Geometry Symmetry Creates Benign Nonconvex Geometry ♦ No “bad” local minima; ♦ No “flat” saddle point. Symmetry Creates Benign Nonconvex Geometry Nonconvex learning problems can be solved efficiently to global solutions! ♦ sparse blind deconvolution [Li et al.’18, Qu et al.’19] ♦ overcomplete dictionary learning [this work] ♦ convolutional dictionary learning [this work] ♦ tensor decomposition [Ge et al.’16] A Fairly Broad Class of Nonconvex Problems ♦ phase retrieval [Sun et al’18] ♦ low rank matrix recovery [Ge et al.’16, Zhu et al.’18] ♦ phase synchronization [Boumal’17] ♦ shallow/linear neural network [Kawaguchi’17, Du et al.’19] A Fairly Broad Class of Nonconvex Problems ♦ sparse blind deconvolution ♦ phase retrieval [Sun et al’18] [Li et al.’18, Qu et al.’19] ♦ low rank matrix recovery ♦ overcomplete dictionary [Ge et al.’16, Zhu et al.’18] learning [this work] ♦ phase synchronization ♦ convolutional dictionary [Boumal’17] learning [this work] ♦ shallow/linear neural network ♦ tensor decomposition [Kawaguchi’17, Du et al.’19] [Ge et al.’16] Global Geometry for Overcomplete DL Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL Given Y = A0 · X0, learn overcomplete A0 and sparse X0? Global Geometry for Overcomplete DL Find one column of A0 by ⊤ 4 min φDL(q) := − q Y q∈Sn−1 4 Theorem (Global Geometry, Qu et al.’19) Suppose (i) K = m/n constant (ii) A0 near orthogonal. Every critical point of φDL(q) is either 1. a saddle point exhibits negative curvature; 2. close to a target solution: one column of A0. Assumptions - Overcomplete DL Assumptions: near orthogonality of A0 1. Row orthogonal: unit norm tight frame (UNTF) r n A A⊤ = I, ∥a ∥ = 1. m 0 0 0i 2. Column (near) orthogonal: incoherence * + a a 0i 0j ≤ max , µ, i≠ j ∥a0i∥ ∥a0j∥ High-level Intuition: for any q⋆ = a0k h i E φ(q⊤Y ) ∝ φ( q⊤A ) X0 ⋆ | ⋆{z 0} correlation ζ(q⋆) Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q) ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2) 3a. Li et al. Global geometry of multichannel sparse blind deconvolution on the sphere, NeurIPS’18. 3b. Zhang et al. Structured local optima in sparse blind deconvolution problem, IEEE TIT’19. 3c. Zhai et al., Complete dictionary learning via ℓ4-Norm maximization over the orthogonal group, JMLR’20. Problem Formulation - Overcomplete DL 1 Find one column of A0 by 4 ⊤ n−1 min φDL(q) := − q Y , s.t. q ∈ S . q 4 remove scaling spikiness promoting Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q) ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2) 3a. Li et al. Global geometry of multichannel sparse blind deconvolution on the sphere, NeurIPS’18. 3b. Zhang et al. Structured local optima in sparse blind deconvolution problem, IEEE TIT’19. 3c. Zhai et al., Complete dictionary learning via ℓ4-Norm maximization over the orthogonal group, JMLR’20. Problem Formulation - Overcomplete DL 1 Find one column of A0 by 4 ⊤ n−1 min φDL(q) := − q Y , s.t. q ∈ S . q 4 remove scaling spikiness promoting High-level Intuition: for any q⋆ = a0k h i E φ(q⊤Y ) ∝ φ( q⊤A ) X0 ⋆ | ⋆{z 0} correlation ζ(q⋆) Problem Formulation - Overcomplete DL 1 Find one column of A0 by 4 ⊤ n−1 min φDL(q) := − q Y , s.t. q ∈ S . q 4 remove scaling spikiness promoting High-level Intuition: for any q⋆ = a0k h i E φ(q⊤Y ) ∝ φ( q⊤A ) X0 ⋆ | ⋆{z 0} correlation ζ(q⋆) Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q) ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2) 3a. Li et al. Global geometry of multichannel sparse blind deconvolution on the sphere, NeurIPS’18. 3b. Zhang et al. Structured local optima in sparse blind deconvolution problem, IEEE TIT’19. 3c. Zhai et al., Complete dictionary learning via ℓ4-Norm maximization over the orthogonal group, JMLR’20. Problem Formulation - Overcomplete DL If q⋆ = a01, as the columns of A0 are incoherent: h i ζ(q ) = 1 a⊤ a ··· a⊤ a ⋆ | 01{z 02} | 01{z 02} |·|<µ |·|<µ Complete DL Overcomplete DL · ∥·∥ · − ∥·∥4 φ( ) = 1 φ( ) = 4 minimize sparsity on ζ(q) maximize spikiness on ζ(q) ρ(ζ) = #NNZ ρ(ζ) = ζ(1)/ζ(2) Global Geometry for Overcomplete DL Find one column of A0 by ⊤ 4 min φDL(q) := − q Y q∈Sn−1 4 Theorem (Global Geometry, Qu et al.’19) Suppose (i) K = m/n constant (ii) A0 near orthogonal. Every critical point of φDL(q) is either 1. a saddle point exhibits negative curvature; 2. close to a target solution: one column of A0. Global Geometry for Overcomplete DL Experiment - Overcomplete DL 2 practice m < n recover full A0 via repeated vs. theory m < Cn independent trials Choice of ℓk-Norm for Dictionary Learning k Average error with varying Maximizing ℓ -norm with sample complexity p/n and k. different k. Image credited to Yuexiang Zhai and Yifei Shen. - Zhai et al. Complete dictionary learning via ℓ4-norm maximization over the orthogonal group, JMLR’20. - Shen et al. Complete dictionary learning via ℓp-norm maximization, ArXiv preprint’20. Outline Overcomplete Dictionary Learning Convolutional Dictionary Learning Conclusion Convolutional Dictionary Learning Given multiple measurement yi of circulant convolution XK ~ ≤ ≤ yi = ak xki, (1 i p), k=1 { } { } can we jointly learn all ak k and sparse xki k,i? ♦ ∈ Rn Here, yi, ak, xki ♦ Sparse signal xki - Il Yong Chun and Jeffrey A. Fessler, Convolutional dictionary learning: Acceleration and convergence, TIP’18. - Cristina Garcia-Cardona and Brendt Wohlberg, Convolutional dictionary learning: A comparative review & new algorithms, TCI’18. Motivations Learning compact convolutional representations [Bristow et al.’13, Gu et al.’15, Garcia-Cardona et al.’18] Learning physical models from scientific data [Cheung et al.’18] Motivations Can be viewed as one-layer of convolutional neural network [Papyan et al.’17&18, Ye et al.’18]. - Papyan et al., Convolutional neural networks analyzed via convolutional sparse coding, JMLR’18. - Papyan et al., Theoretical foundations of deep learning via sparse representations: A multilayer sparse model and its connection to convolutional neural networks, SPM’18. Problem Formulation - Circulant Matrices For each ak ~ xki, equivalently ~ ⇐⇒ · ≤ ≤ ≤ ≤ ak xki Cak Cxki , 1 k K, 1 i p, where a circulant matrix Convolutional DL vs. Overcomplete DL For each sample (1 ≤ i ≤ p), XK ~ yi = ak xki, k=1 equivalently, Cx1i h i . Cy = Ca ··· Ca · . , i | 1 {z K } . overcomplete A0 Cx | {zKi } sparse Xi Convolutional DL vs. Overcomplete DL Given Y = A0 · X0, learn overcomplete A0 and sparse X0? - Huang et al., Convolutional dictionary learning through tensor factorization, NeurIPS Workshop’15. From Overcomplete DL to Convolutional DL? Find one shift of ai (1 ≤ i ≤ K) via 4 min − q⊤Y , s.t. q ∈ Sn−1 ? q 4 From Overcomplete DL to Convolutional DL? Find one shift of ai (1 ≤ i ≤ K) via 4 min − q⊤Y , s.t. q ∈ Sn−1 ? q 4 For generic ai and A0, it does NOT work! ♦ Generic ai enforces structures on A0; ♦ For convolutional DL, in general A0 is NOT near orthogonal. Problem Formulation - Convolutional DL 4 ⊤ n−1 min φCDL(q) = − q P Y , s.t. q ∈ S , q 4 ♦ Preconditioning matrix: −1/2 −1/2 −1 ⊤ ≈ ⊤ P = (θnK) YY A0A0 Problem Formulation - Convolutional DL 4 ⊤ n−1 min φCDL(q) ≈ − q AX0 , s.t. q ∈ S , q 4 ♦ Row orthogonalization of A0: −1/2 ≈ ⊤ PY A0A0 A0 X0 = AX0. | {z } A ♦ A is tight frame (but not necessarily unit norm).

Load more