Guaranteed using Tensor Methods

Anima Anandkumar

..

Amazon & U.C. Irvine

..

Winedale 2016 Regime of Modern Machine Learning Massive datasets, growth in computation power, challenging tasks

Recent Success in Machine Learning

Tremendous improvement in prediction accuracy. Deployment in the real world.

Image classification Speech recognition Text processing

Automated algorithms without requiring extensive tuning. Regime of Modern Machine Learning Massive datasets, growth in computation power, challenging tasks

Missing Link in AI: Guaranteed Methods

Automated algorithms without requiring extensive tuning. Missing Link in AI: Unsupervised Learning Unsupervised learning to achieve human level intelligence.

Image Credit: Whizz Education Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima

Guaranteed approaches for reaching global optima? Guaranteed Learning through Tensor Methods

Replace the objective function E.g. Max Likelihood vs. Best Tensor decomp.

Preserves Global Optimum (infinite samples)

2 arg max p(x; θ) = argmin T (x) T (θ) F θ θ k b − k T (x): empirical tensor, T (θ): low rank tensor based on θ. b

Finding globally opt tensor decomposition Simple algorithms succeed under mild and natural conditions for many learning problems. Guaranteed Learning through Tensor Methods

Replace the objective function E.g. Max Likelihood vs. Best Tensor decomp.

Preserves Global Optimum (infinite samples)

2 arg max p(x; θ) = argmin T (x) T (θ) F θ θ k b − k T (x): empirical tensor, T (θ): low rank tensor based on θ. b Dataset 1 Model Finding globally opt tensor decomposition Dataset 2 Class Simple algorithms succeed under mild and natural conditions for many learning problems. Outline

1 Introduction

2 Why Tensors?

3 Tensor Decomposition Methods

4 Learning Using Tensor Methods

5 Conclusion Example: Discovering Latent Factors Math Physics Music Classics List of scores for students in different tests Alice Bob Learn hidden factors for Verbal and Mathematical Carol Intelligence [C. Spearman 1904] Dave Eve

Score (student,test) = studentverbal-intlg testverbal × + studentmath-intlg testmath × Matrix Decomposition: Discovering Latent Factors

Verbal Math Math Physics Music Classics Alice Bob Carol = Dave + Eve

Identifying hidden factors influencing the observations Characterized as matrix decomposition Tensor: Shared Matrix Decomposition

Verbal Math Math Physics Music Classics Alice Bob (Oral) Carol = Dave + Eve

Alice Bob (Written) Carol = Dave + Eve

Shared decomposition with different scaling factors Combine matrix slices as a tensor Tensor Decomposition

Written Verbal Math Oral

Alice Bob Carol = + Dave Eve Math music Physics Classics Outer product notation:

T = u v w + u˜ v˜ w˜ ⊗ ⊗ ⊗ ⊗ m T i ,i ,i = ui vi wi + u˜i v˜i w˜i 1 2 3 1 · 2 · 3 1 · 2 · 3 Identifiability under Tensor Decomposition

= + + · · ·

3 3 T = λ a ⊗ + λ a ⊗ + , 1 1 2 2 · · ·

Uniqueness of Tensor Decomposition [J. Kruskal 1977]

Above tensor decomposition: unique when rank one pairs are linearly independent

Matrix case: when rank one pairs are orthogonal

λ a 2 2 λ2a2 λ2a2 λ1a1 λ1a1

λ1a1 Outline

1 Introduction

2 Why Tensors?

3 Tensor Decomposition Methods

4 Learning Using Tensor Methods

5 Conclusion Notion of Tensor Contraction Extends the notion of matrix product

Matrix product Tensor Contraction · Mv = X vjMj T (u, v, )= X uivjTi,j,: j i,j

= + = +

+ + Notion of Tensor Contraction Extends the notion of matrix product

Matrix product Tensor Contraction · Mv = X vjMj T (u, v, )= X uivjTi,j,: j i,j

= + = +

+ +

StridedBatchedGEMM for tensor contractions. In CuBLAS 8.0 Avoid explicit transpositions or permutations. 10x(GPU) and 2x(CPU) speedup on small and moderate sized tensors.

Y. Shi, U.N. Niranjan, A, C. Cecka, Tensor Contractions with Extended BLAS Kernels on CPU and GPU, HiPC, 2016. Symmetric Tensor Decomposition

= + + · · ·

3 3 T = v ⊗ + v ⊗ + , 1 2 · · ·

A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Symmetric Tensor Decomposition Tensor Power Method

= + + T (v,v, ) · · · v · . 7→ T (v,v, ) k · k T (v,v, )= v, v 2v + v, v 2v · h 1i 1 h 2i 2 Exponential no. of stationary points for power method: T (v, v, )= λv · Stable Unstable Other statitionary points

A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Non-orthogonal Tensor Decomposition

= + + · · ·

3 3 T = v ⊗ + v ⊗ + , 1 2 · · ·

A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Non-orthogonal Tensor Decomposition

Orthogonalization

v1 v2 W v˜1 v˜2

T (W, W, W )= T˜

Find W using SVD of Matrix Slice + M = T ( , ,θ)= · ·

A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014. Scaling up Tensor Decomposition

Randomized dimensionality reduction +1 through sketching. +1 ◮ Complexity independent of tensor order: -1 exponential gain! Sketch s Tensor T State of art results for visual Q & A

Wang, Tung, Smola, A. “ Guaranteed Tensor Decomposition via Sketching”, NIPS‘15. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding by A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, CVPR 2016. Outline

1 Introduction

2 Why Tensors?

3 Tensor Decomposition Methods

4 Learning Using Tensor Methods

5 Conclusion Tensor Methods for Topic Modeling

campus police Topic-word matrix P[word = i topic = j] witness | Linearly independent columns

Moment Tensor: Co-occurrence of Word Triplets

witness police campus

campus police = + + witness on  police campus witness Educa Sports crime Tensors vs. Variational Inference Criterion: Perplexity = exp[ likelihood]. − Learning Topics from PubMed on Spark, 8mil articles ×104 10 105 Tensor 8 Variational

6 104 4 Perplexity

2 Running Time

0 103

Learning network communities from social network data Facebook n 20k, Yelp n 40k, DBLP-sub n 1e5, DBLP n 1e6. ∼ ∼ ∼ ∼

106

101 105

100 104 Error 10-1 103 Running Time

2 -2 10 FB YP DBLPsub DBLP 10 FB YP DBLPsub DBLP

F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014. Tensors vs. Variational Inference Criterion: Perplexity = exp[ likelihood]. − Learning Topics from PubMed on Spark, 8mil articles ×104 10 105 Tensor 8 Variational

6 104 4 Perplexity

2 Running Time

0 103

Learning network communities from social network data Facebook n 20k, Yelp n 40k, DBLP-sub n 1e5, DBLP n 1e6. ∼ ∼ ∼ ∼

106

101 105

100 104 Error 10-1 103 Running Time

2 -2 Orders10 FB of YP Magnitude DBLPsub DBLP Faster10 &FB More YP Accurate DBLPsub DBLP F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014. Learning Representations using Tensors

Shift-invariant Dictionary Convolutional Model

= + ∗ ∗ Dictionary Image

A. Agarwal, A, P. Jain, P. Netrapalli, “Learning Sparsely Used Overcomplete Dictionaries,” COLT 2014. A, M. Janzamin, R. Ge, “Overcomplete Tensor Decomposition, ” COLT 2015.

Huang, A., “Convolutional Dictionary Learning through Tensor Factorization”, Proc. of JMLR 2015. Learning Representations using Tensors

Efficient Tensor Decomposition with Shifted Components

= +...+ + +...+

∗ ⊗3 ∗ ⊗3 ∗ ⊗3 ∗ ⊗3 M3 ( 1 ) shift( 1 ) ( 2 ) shift( ) F F F F2 Shift-invariant Dictionary Linear Model with Constraints

=

Dictionary Image

A. Agarwal, A, P. Jain, P. Netrapalli, “Learning Sparsely Used Overcomplete Dictionaries,” COLT 2014. A, M. Janzamin, R. Ge, “Overcomplete Tensor Decomposition, ” COLT 2015.

Huang, A., “Convolutional Dictionary Learning through Tensor Factorization”, Proc. of JMLR 2015. Results using Sentence Embeddings

No outside information for tensor methods. Trained on 4k samples. ∼ Skipthought vectors trained on 74 million sentences.

Sentiment Analysis Method MR SUBJ Paragraph-vector 74.8 90.5 Skip-thought 75.5 92.1 ConvDic+DeconvDec 78.9 92.4

Paraphrase Detection Method OutsideInformation Fscore Vector Similarity word similarity 75.3% RMLMG syntacticinfo 80.5% ConvDic+DeconvDec none 80.7% Skip-thought bookcorpus 81.9% Reinforcement Learning of POMDPs

Observation Window Average Reward vs. Time.

0.08 SM-UCRL-POMDP 0.07

0.06

0.05

0.04

0.03

Average Reward DNN 0.02

0.01

0 0 2 4 6 8 10 time ×10 POMDP model with 8 hidden states (trained using tensor methods) vs. NN with 3 hidden layers 30 neurons each (trained using

K. Azzizade,RmsProp). Lazaric, A, Reinforcement Learning of POMDPs using Spectral Methods, COLT16. http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html Reinforcement Learning of POMDPs

Observation Window Average Reward vs. Time.

0.08 SM-UCRL-POMDP 0.07

0.06

0.05

0.04

0.03

Average Reward DNN 0.02

0.01

0 0 2 4 6 8 10 time ×10 POMDP model with 8 hidden states (trained using tensor methods) vs. NN with 3 hidden layers 30 neurons each (trained using

K. Azzizade,RmsProp).Faster Lazaric, convergenceA, Reinforcement to better Learning solution of POMDPs via using tensor Spectral methods. Methods, COLT16. http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html Outline

1 Introduction

2 Why Tensors?

3 Tensor Decomposition Methods

4 Learning Using Tensor Methods

5 Conclusion Conclusion Guaranteed Non-Convex Machine Learning Matrix and tensor methods have desirable guarantees on reaching global optimum. for nonconvex problems. Method of moment estimators using tensor decomposition. Low order polynomial computational and sample complexity. Faster and better performance in practice.

= + .... Research Connections and Resources

Collaborators Rong Ge (Duke), Daniel Hsu (Columbia), Sham Kakade (UW), Jennifer Chayes, Christian Borgs, Alex Smola (CMU), Prateek Jain, Alekh Agarwal & Praneeth Netrapalli (MSR), Srinivas Turaga (Janelia), Allesandro Lazaric (Inria), Hossein Mobahi ().

Podcast/lectures/papers/software available at http://newport.eecs.uci.edu/anandkumar/ Machine Learning @ AWS ALGORITHMS

THE TRINITY FUELING DIVERSE & UTILITY LARGE DATASETS COMPUTING Apply for AWS credits for your research

https://aws.amazon.com/grants/

ACADEMIC Conduct research and build products at AWS: ENGAGEMENTS internships and full time positions!

Send me an email: [email protected] P2, DEEP AMI, AND DEEP TEMPLATE

P2 INSTANCES DEEP AMI DEEP TEMPLATE Up to 40k Pre-configured for CUDA cores deep learning clusters MXNET IS AWS’S DEEP LEARNING FRAMEWORK OF CHOICE frontend MULTIPLE LANGUAGES backend

single implementation of backend performance guarantee regardless system and common operators which front-end language is used AMALGAMATION RUNS IN BROWSER • Fit the core library with all dependencies WITH JAVASCRIPT into a single C++ source file • easy to compile on … DEPLOY ANYWHERE The first image for search “dog” at images.google.com

Outputs “beagle” with

BlindTool by Joseph Paul Cohen, demo on Nexus 4 prob=73% within 1 second Beyond Dependency graph for 2-layer neural networks with 2 GPUs

data[gpu0].copyfrom(data[0:50]) data = next_batch() data[gpu0].copyfrom(data[51:100]) WRITING fc2_wgrad[cpu] = fc1[gpu1] = FullcForward(data[gpu1], fc1[gpu0] = FullcForward(data[gpu0], fc2_wgrad[gpu0] + fc2_wgrad[gpu1] fc1_weight[gpu1]) fc1_weight[gpu0]) Each forward-backward-update PARALLEL involves O(num_layer), which is fc2_weight[cpu] -= fc2[gpu1] = FullcForward(fc1[gpu1], fc2[gpu0] = lr*fc12_wgrad[gpu0] fc2_weight[gpu1]) FullcForward(fc1[gpu0], PROGRAMS fc2_weight[gpu0]) often 100–1,000, tensor

fc2_weight[cpu].copyto( fc2_weight[gpu0] , fc2_ograd[gpu1] = computations and communications fc2_weight[gpu1]) IS PAINFUL LossGrad(fc2[gpu1], label[51:100]) fc2_ograd[gpu0] = LossGrad(fc2[gpu0], label[0:50])

fc1_wgrad[cpu] = fc1_wgrad[gpu0] + fc1_wgrad[gpu1] fc1_ograd[gpu1], fc2_wgrad[gpu1] = FullcBackward(fc2_ograd[gpu1] , fc1_ograd[gpu0], fc2_wgrad[gpu0] = fc2_weight[gpu1]) FullcBackward(fc2_ograd[gpu0] , fc2_weight[gpu0]) fc1_weight[cpu] -= lr * fc1_wgrad[gpu0]

_, fc1_wgrad[gpu1] = _, fc1_wgrad[gpu0] = FullcBackward(fc1_ograd[gpu1] , FullcBackward(fc1_ograd[gpu0] , fc1_weight[cpu].copyto( fc1_weight[gpu1]) fc1_weight[gpu0]) fc1_weight[gpu0] , fc1_weight[gpu1]) 16 Ideal Inception v3 Resnet

12 91% Alexnet Efficiency 8

4

1 2 4 8 16 256 Ideal Inception v3 Resnet 192

Alexnet

128

64

1 2 4 8 16 32 64 128 256 256 Ideal Inception v3 Resnet 19288% Efficiency Alexnet

128

64

1 2 4 8 16 32 64 128 256 CREATE AND COMPARE YOUR OWN FRAMEWORK PERFORMANCE SPIN UP LOG IN RUN BENCHMARKS

github.com/awslabs/deeplearning-benchmark