Guaranteed Non-Convex Machine Learning Using Tensor Methods

Guaranteed Machine Learning using Tensor Methods Anima Anandkumar .. Amazon & U.C. Irvine .. Winedale 2016 Regime of Modern Machine Learning Massive datasets, growth in computation power, challenging tasks Recent Success in Machine Learning Tremendous improvement in prediction accuracy. Deployment in the real world. Image classification Speech recognition Text processing Automated algorithms without requiring extensive tuning. Regime of Modern Machine Learning Massive datasets, growth in computation power, challenging tasks Missing Link in AI: Guaranteed Methods Automated algorithms without requiring extensive tuning. Missing Link in AI: Unsupervised Learning Unsupervised learning to achieve human level intelligence. Image Credit: Whizz Education Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima Guaranteed approaches for reaching global optima? Guaranteed Learning through Tensor Methods Replace the objective function E.g. Max Likelihood vs. Best Tensor decomp. Preserves Global Optimum (infinite samples) 2 arg max p(x; θ) = argmin T (x) T (θ) F θ θ k b − k T (x): empirical tensor, T (θ): low rank tensor based on θ. b Finding globally opt tensor decomposition Simple algorithms succeed under mild and natural conditions for many learning problems. Guaranteed Learning through Tensor Methods Replace the objective function E.g. Max Likelihood vs. Best Tensor decomp. Preserves Global Optimum (infinite samples) 2 arg max p(x; θ) = argmin T (x) T (θ) F θ θ k b − k T (x): empirical tensor, T (θ): low rank tensor based on θ. b Dataset 1 Model Finding globally opt tensor decomposition Dataset 2 Class Simple algorithms succeed under mild and natural conditions for many learning problems. Outline 1 Introduction 2 Why Tensors? 3 Tensor Decomposition Methods 4 Learning Using Tensor Methods 5 Conclusion Example: Discovering Latent Factors Math Physics Music Classics List of scores for students in different tests Alice Bob Learn hidden factors for Verbal and Mathematical Carol Intelligence [C. Spearman 1904] Dave Eve Score (student,test) = studentverbal-intlg testverbal × + studentmath-intlg testmath × Matrix Decomposition: Discovering Latent Factors Verbal Math Math Physics Music Classics Alice Bob Carol = Dave + Eve Identifying hidden factors influencing the observations Characterized as matrix decomposition Tensor: Shared Matrix Decomposition Verbal Math Math Physics Music Classics Alice Bob (Oral) Carol = Dave + Eve Alice Bob (Written) Carol = Dave + Eve Shared decomposition with different scaling factors Combine matrix slices as a tensor Tensor Decomposition Written Verbal Math Oral Alice Bob Carol = + Dave Eve Math music Physics Classics Outer product notation: T = u v w + u˜ v˜ w˜ ⊗ ⊗ ⊗ ⊗ m T i ,i ,i = ui vi wi + u˜i v˜i w˜i 1 2 3 1 · 2 · 3 1 · 2 · 3 Identifiability under Tensor Decomposition = + + · · · 3 3 T = λ a ⊗ + λ a ⊗ + , 1 1 2 2 · · · Uniqueness of Tensor Decomposition [J. Kruskal 1977] Above tensor decomposition: unique when rank one pairs are linearly independent Matrix case: when rank one pairs are orthogonal λ a 2 2 λ2a2 λ2a2 λ1a1 λ1a1 λ1a1 Outline 1 Introduction 2 Why Tensors? 3 Tensor Decomposition Methods 4 Learning Using Tensor Methods 5 Conclusion Notion of Tensor Contraction Extends the notion of matrix product Matrix product Tensor Contraction · Mv = X vjMj T (u, v, )= X uivjTi,j,: j i,j = + = + + + Notion of Tensor Contraction Extends the notion of matrix product Matrix product Tensor Contraction · Mv = X vjMj T (u, v, )= X uivjTi,j,: j i,j = + = + + + StridedBatchedGEMM for tensor contractions. In CuBLAS 8.0 Avoid explicit transpositions or permutations. 10x(GPU) and 2x(CPU) speedup on small and moderate sized tensors. Y. Shi, U.N. Niranjan, A, C. Cecka, Tensor Contractions with Extended BLAS Kernels on CPU and GPU, HiPC, 2016. Symmetric Tensor Decomposition = + + · · · 3 3 T = v ⊗ + v ⊗ + , 1 2 · · · A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Symmetric Tensor Decomposition Tensor Power Method = + + T (v,v, ) · · · v · . 7→ T (v,v, ) k · k T (v,v, )= v, v 2v + v, v 2v · h 1i 1 h 2i 2 Exponential no. of stationary points for power method: T (v, v, )= λv · Stable Unstable Other statitionary points A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Non-orthogonal Tensor Decomposition = + + · · · 3 3 T = v ⊗ + v ⊗ + , 1 2 · · · A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Non-orthogonal Tensor Decomposition Orthogonalization v1 v2 W v˜1 v˜2 T (W, W, W )= T˜ Find W using SVD of Matrix Slice + M = T ( , ,θ)= · · A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Scaling up Tensor Decomposition Randomized dimensionality reduction +1 through sketching. +1 ◮ Complexity independent of tensor order: -1 exponential gain! Sketch s Tensor T State of art results for visual Q & A Wang, Tung, Smola, A. “ Guaranteed Tensor Decomposition via Sketching”, NIPS‘15. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding by A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, CVPR 2016. Outline 1 Introduction 2 Why Tensors? 3 Tensor Decomposition Methods 4 Learning Using Tensor Methods 5 Conclusion Tensor Methods for Topic Modeling campus police Topic-word matrix P[word = i topic = j] witness | Linearly independent columns Moment Tensor: Co-occurrence of Word Triplets witness police campus campus police = + + witness on police campus witness Educa Sports crime Tensors vs. Variational Inference Criterion: Perplexity = exp[ likelihood]. − Learning Topics from PubMed on Spark, 8mil articles ×104 10 105 Tensor 8 Variational 6 104 4 Perplexity 2 Running Time 0 103 Learning network communities from social network data Facebook n 20k, Yelp n 40k, DBLP-sub n 1e5, DBLP n 1e6. ∼ ∼ ∼ ∼ 106 101 105 100 104 Error 10-1 103 Running Time 2 -2 10 FB YP DBLPsub DBLP 10 FB YP DBLPsub DBLP F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014. Tensors vs. Variational Inference Criterion: Perplexity = exp[ likelihood]. − Learning Topics from PubMed on Spark, 8mil articles ×104 10 105 Tensor 8 Variational 6 104 4 Perplexity 2 Running Time 0 103 Learning network communities from social network data Facebook n 20k, Yelp n 40k, DBLP-sub n 1e5, DBLP n 1e6. ∼ ∼ ∼ ∼ 106 101 105 100 104 Error 10-1 103 Running Time 2 -2 Orders10 FB of YP Magnitude DBLPsub DBLP Faster10 &FB More YP Accurate DBLPsub DBLP F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014. Learning Representations using Tensors Shift-invariant Dictionary Convolutional Model = + ∗ ∗ Dictionary Image A. Agarwal, A, P. Jain, P. Netrapalli, “Learning Sparsely Used Overcomplete Dictionaries,” COLT 2014. A, M. Janzamin, R. Ge, “Overcomplete Tensor Decomposition, ” COLT 2015. Huang, A., “Convolutional Dictionary Learning through Tensor Factorization”, Proc. of JMLR 2015. Learning Representations using Tensors Efficient Tensor Decomposition with Shifted Components = +...+ + +...+ ∗ ⊗3 ∗ ⊗3 ∗ ⊗3 ∗ ⊗3 M3 ( 1 ) shift( 1 ) ( 2 ) shift( ) F F F F2 Shift-invariant Dictionary Linear Model with Constraints = Dictionary Image A. Agarwal, A, P. Jain, P. Netrapalli, “Learning Sparsely Used Overcomplete Dictionaries,” COLT 2014. A, M. Janzamin, R. Ge, “Overcomplete Tensor Decomposition, ” COLT 2015. Huang, A., “Convolutional Dictionary Learning through Tensor Factorization”, Proc. of JMLR 2015. Results using Sentence Embeddings No outside information for tensor methods. Trained on 4k samples. ∼ Skipthought vectors trained on 74 million sentences. Sentiment Analysis Method MR SUBJ Paragraph-vector 74.8 90.5 Skip-thought 75.5 92.1 ConvDic+DeconvDec 78.9 92.4 Paraphrase Detection Method OutsideInformation Fscore Vector Similarity word similarity 75.3% RMLMG syntacticinfo 80.5% ConvDic+DeconvDec none 80.7% Skip-thought bookcorpus 81.9% Reinforcement Learning of POMDPs Observation Window Average Reward vs. Time. 0.08 SM-UCRL-POMDP 0.07 0.06 0.05 0.04 0.03 Average Reward DNN 0.02 0.01 0 0 2 4 6 8 10 time ×10 POMDP model with 8 hidden states (trained using tensor methods) vs. NN with 3 hidden layers 30 neurons each (trained using K. Azzizade,RmsProp). Lazaric, A, Reinforcement Learning of POMDPs using Spectral Methods, COLT16. http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html Reinforcement Learning of POMDPs Observation Window Average Reward vs. Time. 0.08 SM-UCRL-POMDP 0.07 0.06 0.05 0.04 0.03 Average Reward DNN 0.02 0.01 0 0 2 4 6 8 10 time ×10 POMDP model with 8 hidden states (trained using tensor methods) vs. NN with 3 hidden layers 30 neurons each (trained using K. Azzizade,RmsProp).Faster Lazaric, convergenceA, Reinforcement to better Learning solution of POMDPs via using tensor Spectral methods. Methods, COLT16. http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html Outline 1 Introduction 2 Why Tensors? 3 Tensor Decomposition Methods 4 Learning Using Tensor Methods 5 Conclusion Conclusion Guaranteed Non-Convex Machine Learning Matrix and tensor methods have desirable guarantees on reaching global optimum. for nonconvex problems. Method of moment estimators using tensor decomposition. Low order polynomial computational and sample complexity. Faster and better performance in practice. = + .... Research Connections and Resources Collaborators

Guaranteed Non-Convex Machine Learning Using Tensor Methods

Anima Anandkumar

Tensor Regression Networks

Deep Learning in Unconventional Domains

Anima Anandkumar

Communication-Efficient Distributed SGD with Sketching

View a List of the Invited Faculty Members

Riseandgrind: Lessons from a Biased AI Conor Mcgarrigle Dublin School of Creative Arts, TU Dublin

Curriculum Vitae

Lecture 17: March 16Th 2019 17.1 Introduction

Guaranteed Training of Neural Networks Using Tensor Methods

Conference on Learning Theory 2016: Preface

Neural Operator: Graph Kernel Network for Partial Differential Equations