Guaranteed Machine Learning using Tensor Methods
Anima Anandkumar
..
Amazon & U.C. Irvine
..
Winedale 2016 Regime of Modern Machine Learning Massive datasets, growth in computation power, challenging tasks
Recent Success in Machine Learning
Tremendous improvement in prediction accuracy. Deployment in the real world.
Image classification Speech recognition Text processing
Automated algorithms without requiring extensive tuning. Regime of Modern Machine Learning Massive datasets, growth in computation power, challenging tasks
Missing Link in AI: Guaranteed Methods
Automated algorithms without requiring extensive tuning. Missing Link in AI: Unsupervised Learning Unsupervised learning to achieve human level intelligence.
Image Credit: Whizz Education Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
Guaranteed approaches for reaching global optima? Guaranteed Learning through Tensor Methods
Replace the objective function E.g. Max Likelihood vs. Best Tensor decomp.
Preserves Global Optimum (infinite samples)
2 arg max p(x; θ) = argmin T (x) T (θ) F θ θ k b − k T (x): empirical tensor, T (θ): low rank tensor based on θ. b
Finding globally opt tensor decomposition Simple algorithms succeed under mild and natural conditions for many learning problems. Guaranteed Learning through Tensor Methods
Replace the objective function E.g. Max Likelihood vs. Best Tensor decomp.
Preserves Global Optimum (infinite samples)
2 arg max p(x; θ) = argmin T (x) T (θ) F θ θ k b − k T (x): empirical tensor, T (θ): low rank tensor based on θ. b Dataset 1 Model Finding globally opt tensor decomposition Dataset 2 Class Simple algorithms succeed under mild and natural conditions for many learning problems. Outline
1 Introduction
2 Why Tensors?
3 Tensor Decomposition Methods
4 Learning Using Tensor Methods
5 Conclusion Example: Discovering Latent Factors Math Physics Music Classics List of scores for students in different tests Alice Bob Learn hidden factors for Verbal and Mathematical Carol Intelligence [C. Spearman 1904] Dave Eve
Score (student,test) = studentverbal-intlg testverbal × + studentmath-intlg testmath × Matrix Decomposition: Discovering Latent Factors
Verbal Math Math Physics Music Classics Alice Bob Carol = Dave + Eve
Identifying hidden factors influencing the observations Characterized as matrix decomposition Tensor: Shared Matrix Decomposition
Verbal Math Math Physics Music Classics Alice Bob (Oral) Carol = Dave + Eve
Alice Bob (Written) Carol = Dave + Eve
Shared decomposition with different scaling factors Combine matrix slices as a tensor Tensor Decomposition
Written Verbal Math Oral
Alice Bob Carol = + Dave Eve Math music Physics Classics Outer product notation:
T = u v w + u˜ v˜ w˜ ⊗ ⊗ ⊗ ⊗ m T i ,i ,i = ui vi wi + u˜i v˜i w˜i 1 2 3 1 · 2 · 3 1 · 2 · 3 Identifiability under Tensor Decomposition
= + + · · ·
3 3 T = λ a ⊗ + λ a ⊗ + , 1 1 2 2 · · ·
Uniqueness of Tensor Decomposition [J. Kruskal 1977]
Above tensor decomposition: unique when rank one pairs are linearly independent
Matrix case: when rank one pairs are orthogonal
λ a 2 2 λ2a2 λ2a2 λ1a1 λ1a1
λ1a1 Outline
1 Introduction
2 Why Tensors?
3 Tensor Decomposition Methods
4 Learning Using Tensor Methods
5 Conclusion Notion of Tensor Contraction Extends the notion of matrix product
Matrix product Tensor Contraction · Mv = X vjMj T (u, v, )= X uivjTi,j,: j i,j
= + = +
+ + Notion of Tensor Contraction Extends the notion of matrix product
Matrix product Tensor Contraction · Mv = X vjMj T (u, v, )= X uivjTi,j,: j i,j
= + = +
+ +
StridedBatchedGEMM for tensor contractions. In CuBLAS 8.0 Avoid explicit transpositions or permutations. 10x(GPU) and 2x(CPU) speedup on small and moderate sized tensors.
Y. Shi, U.N. Niranjan, A, C. Cecka, Tensor Contractions with Extended BLAS Kernels on CPU and GPU, HiPC, 2016. Symmetric Tensor Decomposition
= + + · · ·
3 3 T = v ⊗ + v ⊗ + , 1 2 · · ·
A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Symmetric Tensor Decomposition Tensor Power Method
= + + T (v,v, ) · · · v · . 7→ T (v,v, ) k · k T (v,v, )= v, v 2v + v, v 2v · h 1i 1 h 2i 2 Exponential no. of stationary points for power method: T (v, v, )= λv · Stable Unstable Other statitionary points
A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Non-orthogonal Tensor Decomposition
= + + · · ·
3 3 T = v ⊗ + v ⊗ + , 1 2 · · ·
A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Non-orthogonal Tensor Decomposition
Orthogonalization
v1 v2 W v˜1 v˜2
T (W, W, W )= T˜
Find W using SVD of Matrix Slice + M = T ( , ,θ)= · ·
A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Scaling up Tensor Decomposition
Randomized dimensionality reduction +1 through sketching. +1 ◮ Complexity independent of tensor order: -1 exponential gain! Sketch s Tensor T State of art results for visual Q & A
Wang, Tung, Smola, A. “ Guaranteed Tensor Decomposition via Sketching”, NIPS‘15. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding by A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, CVPR 2016. Outline
1 Introduction
2 Why Tensors?
3 Tensor Decomposition Methods
4 Learning Using Tensor Methods
5 Conclusion Tensor Methods for Topic Modeling
campus police Topic-word matrix P[word = i topic = j] witness | Linearly independent columns
Moment Tensor: Co-occurrence of Word Triplets
witness police campus
campus police = + + witness on police campus witness Educa Sports crime Tensors vs. Variational Inference Criterion: Perplexity = exp[ likelihood]. − Learning Topics from PubMed on Spark, 8mil articles ×104 10 105 Tensor 8 Variational
6 104 4 Perplexity
2 Running Time
0 103
Learning network communities from social network data Facebook n 20k, Yelp n 40k, DBLP-sub n 1e5, DBLP n 1e6. ∼ ∼ ∼ ∼
106
101 105
100 104 Error 10-1 103 Running Time
2 -2 10 FB YP DBLPsub DBLP 10 FB YP DBLPsub DBLP
F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014. Tensors vs. Variational Inference Criterion: Perplexity = exp[ likelihood]. − Learning Topics from PubMed on Spark, 8mil articles ×104 10 105 Tensor 8 Variational
6 104 4 Perplexity
2 Running Time
0 103
Learning network communities from social network data Facebook n 20k, Yelp n 40k, DBLP-sub n 1e5, DBLP n 1e6. ∼ ∼ ∼ ∼
106
101 105
100 104 Error 10-1 103 Running Time
2 -2 Orders10 FB of YP Magnitude DBLPsub DBLP Faster10 &FB More YP Accurate DBLPsub DBLP F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014. Learning Representations using Tensors
Shift-invariant Dictionary Convolutional Model
= + ∗ ∗ Dictionary Image
A. Agarwal, A, P. Jain, P. Netrapalli, “Learning Sparsely Used Overcomplete Dictionaries,” COLT 2014. A, M. Janzamin, R. Ge, “Overcomplete Tensor Decomposition, ” COLT 2015.
Huang, A., “Convolutional Dictionary Learning through Tensor Factorization”, Proc. of JMLR 2015. Learning Representations using Tensors
Efficient Tensor Decomposition with Shifted Components
= +...+ + +...+
∗ ⊗3 ∗ ⊗3 ∗ ⊗3 ∗ ⊗3 M3 ( 1 ) shift( 1 ) ( 2 ) shift( ) F F F F2 Shift-invariant Dictionary Linear Model with Constraints
=
Dictionary Image
A. Agarwal, A, P. Jain, P. Netrapalli, “Learning Sparsely Used Overcomplete Dictionaries,” COLT 2014. A, M. Janzamin, R. Ge, “Overcomplete Tensor Decomposition, ” COLT 2015.
Huang, A., “Convolutional Dictionary Learning through Tensor Factorization”, Proc. of JMLR 2015. Results using Sentence Embeddings
No outside information for tensor methods. Trained on 4k samples. ∼ Skipthought vectors trained on 74 million sentences.
Sentiment Analysis Method MR SUBJ Paragraph-vector 74.8 90.5 Skip-thought 75.5 92.1 ConvDic+DeconvDec 78.9 92.4
Paraphrase Detection Method OutsideInformation Fscore Vector Similarity word similarity 75.3% RMLMG syntacticinfo 80.5% ConvDic+DeconvDec none 80.7% Skip-thought bookcorpus 81.9% Reinforcement Learning of POMDPs
Observation Window Average Reward vs. Time.
0.08 SM-UCRL-POMDP 0.07
0.06
0.05
0.04
0.03
Average Reward DNN 0.02
0.01
0 0 2 4 6 8 10 time ×10 POMDP model with 8 hidden states (trained using tensor methods) vs. NN with 3 hidden layers 30 neurons each (trained using
K. Azzizade,RmsProp). Lazaric, A, Reinforcement Learning of POMDPs using Spectral Methods, COLT16. http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html Reinforcement Learning of POMDPs
Observation Window Average Reward vs. Time.
0.08 SM-UCRL-POMDP 0.07
0.06
0.05
0.04
0.03
Average Reward DNN 0.02
0.01
0 0 2 4 6 8 10 time ×10 POMDP model with 8 hidden states (trained using tensor methods) vs. NN with 3 hidden layers 30 neurons each (trained using
K. Azzizade,RmsProp).Faster Lazaric, convergenceA, Reinforcement to better Learning solution of POMDPs via using tensor Spectral methods. Methods, COLT16. http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html Outline
1 Introduction
2 Why Tensors?
3 Tensor Decomposition Methods
4 Learning Using Tensor Methods
5 Conclusion Conclusion Guaranteed Non-Convex Machine Learning Matrix and tensor methods have desirable guarantees on reaching global optimum. for nonconvex problems. Method of moment estimators using tensor decomposition. Low order polynomial computational and sample complexity. Faster and better performance in practice.
= + .... Research Connections and Resources
Collaborators Rong Ge (Duke), Daniel Hsu (Columbia), Sham Kakade (UW), Jennifer Chayes, Christian Borgs, Alex Smola (CMU), Prateek Jain, Alekh Agarwal & Praneeth Netrapalli (MSR), Srinivas Turaga (Janelia), Allesandro Lazaric (Inria), Hossein Mobahi (Google).
Podcast/lectures/papers/software available at http://newport.eecs.uci.edu/anandkumar/ Machine Learning @ AWS ALGORITHMS
THE TRINITY FUELING ARTIFICIAL INTELLIGENCE DIVERSE & UTILITY LARGE DATASETS COMPUTING Apply for AWS credits for your research
https://aws.amazon.com/grants/
ACADEMIC Conduct research and build products at AWS: ENGAGEMENTS internships and full time positions!
Send me an email: [email protected] P2, DEEP AMI, AND DEEP TEMPLATE
P2 INSTANCES DEEP AMI DEEP TEMPLATE Up to 40k Pre-configured for Deep learning CUDA cores deep learning clusters MXNET IS AWS’S DEEP LEARNING FRAMEWORK OF CHOICE frontend MULTIPLE LANGUAGES backend
single implementation of backend performance guarantee regardless system and common operators which front-end language is used AMALGAMATION RUNS IN BROWSER • Fit the core library with all dependencies WITH JAVASCRIPT into a single C++ source file • easy to compile on … DEPLOY ANYWHERE The first image for search “dog” at images.google.com
Outputs “beagle” with
BlindTool by Joseph Paul Cohen, demo on Nexus 4 prob=73% within 1 second Beyond Dependency graph for 2-layer neural networks with 2 GPUs
data[gpu0].copyfrom(data[0:50]) data = next_batch() data[gpu0].copyfrom(data[51:100]) WRITING fc2_wgrad[cpu] = fc1[gpu1] = FullcForward(data[gpu1], fc1[gpu0] = FullcForward(data[gpu0], fc2_wgrad[gpu0] + fc2_wgrad[gpu1] fc1_weight[gpu1]) fc1_weight[gpu0]) Each forward-backward-update PARALLEL involves O(num_layer), which is fc2_weight[cpu] -= fc2[gpu1] = FullcForward(fc1[gpu1], fc2[gpu0] = lr*fc12_wgrad[gpu0] fc2_weight[gpu1]) FullcForward(fc1[gpu0], PROGRAMS fc2_weight[gpu0]) often 100–1,000, tensor
fc2_weight[cpu].copyto( fc2_weight[gpu0] , fc2_ograd[gpu1] = computations and communications fc2_weight[gpu1]) IS PAINFUL LossGrad(fc2[gpu1], label[51:100]) fc2_ograd[gpu0] = LossGrad(fc2[gpu0], label[0:50])
fc1_wgrad[cpu] = fc1_wgrad[gpu0] + fc1_wgrad[gpu1] fc1_ograd[gpu1], fc2_wgrad[gpu1] = FullcBackward(fc2_ograd[gpu1] , fc1_ograd[gpu0], fc2_wgrad[gpu0] = fc2_weight[gpu1]) FullcBackward(fc2_ograd[gpu0] , fc2_weight[gpu0]) fc1_weight[cpu] -= lr * fc1_wgrad[gpu0]
_, fc1_wgrad[gpu1] = _, fc1_wgrad[gpu0] = FullcBackward(fc1_ograd[gpu1] , FullcBackward(fc1_ograd[gpu0] , fc1_weight[cpu].copyto( fc1_weight[gpu1]) fc1_weight[gpu0]) fc1_weight[gpu0] , fc1_weight[gpu1]) 16 Ideal Inception v3 Resnet
12 91% Alexnet Efficiency 8
4
1 2 4 8 16 256 Ideal Inception v3 Resnet 192
Alexnet
128
64
1 2 4 8 16 32 64 128 256 256 Ideal Inception v3 Resnet 19288% Efficiency Alexnet
128
64
1 2 4 8 16 32 64 128 256 CREATE AND COMPARE YOUR OWN FRAMEWORK PERFORMANCE SPIN UP LOG IN RUN BENCHMARKS
github.com/awslabs/deeplearning-benchmark