<<

CS7015 () : Lecture 7 Autoencoders and relation to PCA, Regularization in autoencoders, Denoising autoencoders, Sparse autoencoders, Contractive autoencoders

Mitesh M. Khapra

Department of Computer Science and Engineering Indian Institute of Technology Madras

1/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Module 7.1: Introduction to Autoencoders

2/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 h = g(W xi + b) ∗ ˆxi = f(W h + c)

ˆxi W ∗ h W

xi

3/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Encodes its input xi into a hidden representation h Decodes the input again from this hidden representation The model is trained to minimize a certain which will ensure that ˆx is close to x (we will see some h = g(W x + b) i i i such loss functions soon) ∗ ˆxi = f(W h + c)

An autoencoder is a special type of ˆxi feed forward neural network which does the following W ∗ h W

xi

3/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Decodes the input again from this hidden representation The model is trained to minimize a certain loss function which will ensure that ˆx is close to x (we will see some h = g(W x + b) i i i such loss functions soon) ∗ ˆxi = f(W h + c)

An autoencoder is a special type of ˆxi feed forward neural network which does the following W ∗ Encodes its input xi into a hidden h representation h W

xi

3/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Decodes the input again from this hidden representation The model is trained to minimize a certain loss function which will ensure that ˆxi is close to xi (we will see some such loss functions soon) ∗ ˆxi = f(W h + c)

An autoencoder is a special type of ˆxi feed forward neural network which does the following W ∗ Encodes its input xi into a hidden h representation h W

xi

h = g(W xi + b)

3/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 The model is trained to minimize a certain loss function which will ensure that ˆxi is close to xi (we will see some such loss functions soon) ∗ ˆxi = f(W h + c)

An autoencoder is a special type of ˆxi feed forward neural network which does the following W ∗ Encodes its input xi into a hidden h representation h W Decodes the input again from this hidden representation xi

h = g(W xi + b)

3/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 The model is trained to minimize a certain loss function which will ensure that ˆxi is close to xi (we will see some such loss functions soon)

An autoencoder is a special type of ˆxi feed forward neural network which does the following W ∗ Encodes its input xi into a hidden h representation h W Decodes the input again from this hidden representation xi

h = g(W xi + b) ∗ ˆxi = f(W h + c)

3/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 An autoencoder is a special type of ˆxi feed forward neural network which does the following W ∗ Encodes its input xi into a hidden h representation h W Decodes the input again from this hidden representation x i The model is trained to minimize a certain loss function which will ensure that ˆx is close to x (we will see some h = g(W x + b) i i i such loss functions soon) ∗ ˆxi = f(W h + c)

3/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 An autoencoder where dim(h) < dim(xi) is called an under complete autoencoder

ˆxi W ∗ h W

xi

h = g(W xi + b) ∗ ˆxi = f(W h + c)

4/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 If we are still able to reconstruct ˆxi perfectly from h, then what does it say about h?

h is a loss-free encoding of xi. It cap- tures all the important characteristics of xi Do you see an analogy with PCA?

An autoencoder where dim(h) < dim(xi) is called an under complete autoencoder

Let us consider the case where

ˆxi dim(h) < dim(xi) W ∗ h W

xi

h = g(W xi + b) ∗ ˆxi = f(W h + c)

4/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 h is a loss-free encoding of xi. It cap- tures all the important characteristics of xi Do you see an analogy with PCA?

An autoencoder where dim(h) < dim(xi) is called an under complete autoencoder

Let us consider the case where

ˆxi dim(h) < dim(xi) If we are still able to reconstruct ˆx W ∗ i perfectly from h, then what does it h say about h? W

xi

h = g(W xi + b) ∗ ˆxi = f(W h + c)

4/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Do you see an analogy with PCA?

An autoencoder where dim(h) < dim(xi) is called an under complete autoencoder

Let us consider the case where

ˆxi dim(h) < dim(xi) If we are still able to reconstruct ˆx W ∗ i perfectly from h, then what does it h say about h? h is a loss-free encoding of x . It cap- W i tures all the important characteristics xi of xi

h = g(W xi + b) ∗ ˆxi = f(W h + c)

4/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 An autoencoder where dim(h) < dim(xi) is called an under complete autoencoder

Let us consider the case where

ˆxi dim(h) < dim(xi) If we are still able to reconstruct ˆx W ∗ i perfectly from h, then what does it h say about h? h is a loss-free encoding of x . It cap- W i tures all the important characteristics xi of xi Do you see an analogy with PCA? h = g(W xi + b) ∗ ˆxi = f(W h + c)

4/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Let us consider the case where

ˆxi dim(h) < dim(xi) If we are still able to reconstruct ˆx W ∗ i perfectly from h, then what does it h say about h? h is a loss-free encoding of x . It cap- W i tures all the important characteristics xi of xi Do you see an analogy with PCA?

h = g(W xi + b) ∗ ˆxi = f(W h + c)

An autoencoder where dim(h) < dim(xi) is called an under complete autoencoder

4/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

ˆxi W ∗ h W

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ h W

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i

xi h = g(W xi + b) ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i Such an identity encoding is useless xi in practice as it does not really tell us anything about the important char- h = g(W xi + b) acteristics of the data ∗ ˆxi = f(W h + c)

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Let us consider the case when ˆxi dim(h) ≥ dim(xi) W ∗ In such a case the autoencoder could learn a trivial encoding by simply h copying xi into h and then copying h into ˆx W i Such an identity encoding is useless xi in practice as it does not really tell us anything about the important char- h = g(W xi + b) acteristics of the data ∗ ˆxi = f(W h + c)

An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder

5/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Choice of f(xi) and g(xi) Choice of loss function

The Road Ahead

6/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Choice of loss function

The Road Ahead

Choice of f(xi) and g(xi)

6/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 The Road Ahead

Choice of f(xi) and g(xi) Choice of loss function

6/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 The Road Ahead

Choice of f(xi) and g(xi) Choice of loss function

7/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∗ ˆxi = tanh(W h + c) ∗ ˆxi = W h + c ∗ ˆxi = logistic(W h + c)

Suppose all our inputs are binary (each xij ∈ {0, 1}) Which of the following functions would be most apt for the decoder?

Logistic as it naturally restricts all outputs to be between 0 and 1

g is typically chosen as the

∗ ˆxi = f(W h + c)

W ∗

h = g(W xi + b)

W

xi

0 1 1 0 1 (binary inputs)

8/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∗ ˆxi = tanh(W h + c) ∗ ˆxi = W h + c ∗ ˆxi = logistic(W h + c)

Which of the following functions would be most apt for the decoder?

Logistic as it naturally restricts all outputs to be between 0 and 1

g is typically chosen as the sigmoid function

Suppose all our inputs are binary ˆx = f(W ∗h + c) i (each xij ∈ {0, 1}) W ∗

h = g(W xi + b)

W

xi

0 1 1 0 1 (binary inputs)

8/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∗ ˆxi = tanh(W h + c) ∗ ˆxi = W h + c ∗ ˆxi = logistic(W h + c) Logistic as it naturally restricts all outputs to be between 0 and 1

g is typically chosen as the sigmoid function

Suppose all our inputs are binary ˆx = f(W ∗h + c) i (each xij ∈ {0, 1}) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b)

W

xi

0 1 1 0 1 (binary inputs)

8/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∗ ˆxi = W h + c ∗ ˆxi = logistic(W h + c) Logistic as it naturally restricts all outputs to be between 0 and 1

g is typically chosen as the sigmoid function

Suppose all our inputs are binary ˆx = f(W ∗h + c) i (each xij ∈ {0, 1}) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b) ˆx = tanh(W ∗h + c) W i

xi

0 1 1 0 1 (binary inputs)

8/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∗ ˆxi = logistic(W h + c) Logistic as it naturally restricts all outputs to be between 0 and 1

g is typically chosen as the sigmoid function

Suppose all our inputs are binary ˆx = f(W ∗h + c) i (each xij ∈ {0, 1}) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b) ˆx = tanh(W ∗h + c) W i ∗ ˆxi = W h + c xi

0 1 1 0 1 (binary inputs)

8/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Logistic as it naturally restricts all outputs to be between 0 and 1

g is typically chosen as the sigmoid function

Suppose all our inputs are binary ˆx = f(W ∗h + c) i (each xij ∈ {0, 1}) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b) ˆx = tanh(W ∗h + c) W i ∗ ˆxi = W h + c xi ∗ ˆxi = logistic(W h + c)

0 1 1 0 1 (binary inputs)

8/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 g is typically chosen as the sigmoid function

Suppose all our inputs are binary ˆx = f(W ∗h + c) i (each xij ∈ {0, 1}) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b) ˆx = tanh(W ∗h + c) W i ∗ ˆxi = W h + c xi ∗ ˆxi = logistic(W h + c)

0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all outputs to be between 0 and 1

8/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Suppose all our inputs are binary ˆx = f(W ∗h + c) i (each xij ∈ {0, 1}) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b) ˆx = tanh(W ∗h + c) W i ∗ ˆxi = W h + c xi ∗ ˆxi = logistic(W h + c)

0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all outputs to be between 0 and 1 g is typically chosen as the sigmoid function

8/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∗ ˆxi = tanh(W h + c) ∗ ˆxi = W h + c ∗ ˆxi = logistic(W h + c)

Suppose all our inputs are real (each xij ∈ R) Which of the following functions would be most apt for the decoder?

What will logistic and tanh do? They will restrict the reconstruc- ted ˆx to lie between [0,1] or [-1,1] Again, g is typically chosen as the i whereas we want ˆx ∈ n sigmoid function i R

∗ ˆxi = f(W h + c)

W ∗

h = g(W xi + b)

W

xi

0.25 0.5 1.25 3.5 4.5 (real valued inputs)

9/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∗ ˆxi = tanh(W h + c) ∗ ˆxi = W h + c ∗ ˆxi = logistic(W h + c)

Which of the following functions would be most apt for the decoder?

What will logistic and tanh do? They will restrict the reconstruc- ted ˆx to lie between [0,1] or [-1,1] Again, g is typically chosen as the i whereas we want ˆx ∈ n sigmoid function i R

Suppose all our inputs are real (each ˆx = f(W ∗h + c) i xij ∈ R) W ∗

h = g(W xi + b)

W

xi

0.25 0.5 1.25 3.5 4.5 (real valued inputs)

9/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∗ ˆxi = tanh(W h + c) ∗ ˆxi = W h + c ∗ ˆxi = logistic(W h + c) What will logistic and tanh do? They will restrict the reconstruc- ted ˆx to lie between [0,1] or [-1,1] Again, g is typically chosen as the i whereas we want ˆx ∈ n sigmoid function i R

Suppose all our inputs are real (each ˆx = f(W ∗h + c) i xij ∈ R) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b)

W

xi

0.25 0.5 1.25 3.5 4.5 (real valued inputs)

9/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∗ ˆxi = W h + c ∗ ˆxi = logistic(W h + c) What will logistic and tanh do? They will restrict the reconstruc- ted ˆx to lie between [0,1] or [-1,1] Again, g is typically chosen as the i whereas we want ˆx ∈ n sigmoid function i R

Suppose all our inputs are real (each ˆx = f(W ∗h + c) i xij ∈ R) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b) ˆx = tanh(W ∗h + c) W i

xi

0.25 0.5 1.25 3.5 4.5 (real valued inputs)

9/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∗ ˆxi = logistic(W h + c) What will logistic and tanh do? They will restrict the reconstruc- ted ˆx to lie between [0,1] or [-1,1] Again, g is typically chosen as the i whereas we want ˆx ∈ n sigmoid function i R

Suppose all our inputs are real (each ˆx = f(W ∗h + c) i xij ∈ R) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b) ˆx = tanh(W ∗h + c) W i ∗ ˆxi = W h + c xi

0.25 0.5 1.25 3.5 4.5 (real valued inputs)

9/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 What will logistic and tanh do? They will restrict the reconstruc- ted ˆx to lie between [0,1] or [-1,1] Again, g is typically chosen as the i whereas we want ˆx ∈ n sigmoid function i R

Suppose all our inputs are real (each ˆx = f(W ∗h + c) i xij ∈ R) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b) ˆx = tanh(W ∗h + c) W i ∗ ˆxi = W h + c xi ∗ ˆxi = logistic(W h + c)

0.25 0.5 1.25 3.5 4.5 (real valued inputs)

9/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 They will restrict the reconstruc- ted ˆx to lie between [0,1] or [-1,1] Again, g is typically chosen as the i whereas we want ˆx ∈ n sigmoid function i R

Suppose all our inputs are real (each ˆx = f(W ∗h + c) i xij ∈ R) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b) ˆx = tanh(W ∗h + c) W i ∗ ˆxi = W h + c xi ∗ ˆxi = logistic(W h + c)

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do? (real valued inputs)

9/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Again, g is typically chosen as the sigmoid function

Suppose all our inputs are real (each ˆx = f(W ∗h + c) i xij ∈ R) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b) ˆx = tanh(W ∗h + c) W i ∗ ˆxi = W h + c xi ∗ ˆxi = logistic(W h + c)

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do? (real valued inputs) They will restrict the reconstruc- ted ˆxi to lie between [0,1] or [-1,1] n whereas we want ˆxi ∈ R

9/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Suppose all our inputs are real (each ˆx = f(W ∗h + c) i xij ∈ R) W ∗ Which of the following functions would be most apt for the decoder? h = g(W xi + b) ˆx = tanh(W ∗h + c) W i ∗ ˆxi = W h + c xi ∗ ˆxi = logistic(W h + c)

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do? (real valued inputs) They will restrict the reconstruc- ted ˆx to lie between [0,1] or [-1,1] Again, g is typically chosen as the i whereas we want ˆx ∈ n sigmoid function i R

9/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 The Road Ahead

Choice of f(xi) and g(xi) Choice of loss function

10/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 m 1 X T i.e., min (ˆxi − xi) (ˆxi − xi) W,W ∗,c,b m i=1

Consider the case when the inputs are real valued The objective of the autoencoder is to recon- struct ˆxi to be as close to xi as possible This can be formalized using the following objective function:

m n 1 X X 2 min (ˆxij − xij ) W,W ∗,c,b m i=1 j=1

We can then train the autoencoder just like a regular feedforward network using back- propagation ∂L (θ) ∂L (θ) All we need is a formula for ∂W ∗ and ∂W which we will see now

ˆxi

W ∗

h

W

xi

h = g(W xi + b) ∗ ˆxi = f(W h + c)

11/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 m 1 X T i.e., min (ˆxi − xi) (ˆxi − xi) W,W ∗,c,b m i=1

The objective of the autoencoder is to recon- struct ˆxi to be as close to xi as possible This can be formalized using the following objective function:

m n 1 X X 2 min (ˆxij − xij ) W,W ∗,c,b m i=1 j=1

We can then train the autoencoder just like a regular feedforward network using back- propagation ∂L (θ) ∂L (θ) All we need is a formula for ∂W ∗ and ∂W which we will see now

Consider the case when the inputs are real

ˆxi valued

W ∗

h

W

xi

h = g(W xi + b) ∗ ˆxi = f(W h + c)

11/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 m 1 X T i.e., min (ˆxi − xi) (ˆxi − xi) W,W ∗,c,b m i=1

This can be formalized using the following objective function:

m n 1 X X 2 min (ˆxij − xij ) W,W ∗,c,b m i=1 j=1

We can then train the autoencoder just like a regular feedforward network using back- propagation ∂L (θ) ∂L (θ) All we need is a formula for ∂W ∗ and ∂W which we will see now

Consider the case when the inputs are real

ˆxi valued The objective of the autoencoder is to recon- W ∗ struct ˆxi to be as close to xi as possible h

W

xi

h = g(W xi + b) ∗ ˆxi = f(W h + c)

11/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 m 1 X T i.e., min (ˆxi − xi) (ˆxi − xi) W,W ∗,c,b m i=1

We can then train the autoencoder just like a regular feedforward network using back- propagation ∂L (θ) ∂L (θ) All we need is a formula for ∂W ∗ and ∂W which we will see now

Consider the case when the inputs are real

ˆxi valued The objective of the autoencoder is to recon- W ∗ struct ˆxi to be as close to xi as possible h This can be formalized using the following objective function:

W m n 1 X X 2 min (ˆxij − xij ) W,W ∗,c,b m xi i=1 j=1

h = g(W xi + b) ∗ ˆxi = f(W h + c)

11/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We can then train the autoencoder just like a regular feedforward network using back- propagation ∂L (θ) ∂L (θ) All we need is a formula for ∂W ∗ and ∂W which we will see now

Consider the case when the inputs are real

ˆxi valued The objective of the autoencoder is to recon- W ∗ struct ˆxi to be as close to xi as possible h This can be formalized using the following objective function:

W m n 1 X X 2 min (ˆxij − xij ) W,W ∗,c,b m xi i=1 j=1 m 1 X T h = g(W xi + b) i.e., min (ˆxi − xi) (ˆxi − xi) W,W ∗,c,b m ∗ i=1 ˆxi = f(W h + c)

11/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) ∂L (θ) All we need is a formula for ∂W ∗ and ∂W which we will see now

Consider the case when the inputs are real

ˆxi valued The objective of the autoencoder is to recon- W ∗ struct ˆxi to be as close to xi as possible h This can be formalized using the following objective function:

W m n 1 X X 2 min (ˆxij − xij ) W,W ∗,c,b m xi i=1 j=1 m 1 X T h = g(W xi + b) i.e., min (ˆxi − xi) (ˆxi − xi) W,W ∗,c,b m ∗ i=1 ˆxi = f(W h + c) We can then train the autoencoder just like a regular feedforward network using back- propagation

11/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Consider the case when the inputs are real

ˆxi valued The objective of the autoencoder is to recon- W ∗ struct ˆxi to be as close to xi as possible h This can be formalized using the following objective function:

W m n 1 X X 2 min (ˆxij − xij ) W,W ∗,c,b m xi i=1 j=1 m 1 X T h = g(W xi + b) i.e., min (ˆxi − xi) (ˆxi − xi) W,W ∗,c,b m ∗ i=1 ˆxi = f(W h + c) We can then train the autoencoder just like a regular feedforward network using back- propagation ∂L (θ) ∂L (θ) All we need is a formula for ∂W ∗ and ∂W which we will see now

11/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) ∂L (θ) = ∂h2 ∂ˆxi T = ∇ˆxi {(ˆxi − xi) (ˆxi − xi)}

= 2(ˆxi − xi)

∂L (θ) ∂L (θ) ∂h2 ∂a2 ∗ = ∗ ∂W ∂h2 ∂a2 ∂W

∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 = ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W We have already seen how to calculate the expres- sion in the boxes when we learnt

Note that the loss function is shown for only one training example.

T L (θ) = (ˆxi − xi) (ˆxi − xi) h2 = ˆxi a2

W ∗ h1 a1 W h0 = xi

12/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) ∂L (θ) = ∂h2 ∂ˆxi T = ∇ˆxi {(ˆxi − xi) (ˆxi − xi)}

= 2(ˆxi − xi)

∂L (θ) ∂L (θ) ∂h2 ∂a2 ∗ = ∗ ∂W ∂h2 ∂a2 ∂W

∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 = ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W We have already seen how to calculate the expres- sion in the boxes when we learnt backpropagation

T L (θ) = (ˆxi − xi) (ˆxi − xi) h2 = ˆxi a2

W ∗ h1 a1 W h0 = xi

Note that the loss function is shown for only one training example.

12/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) ∂L (θ) = ∂h2 ∂ˆxi T = ∇ˆxi {(ˆxi − xi) (ˆxi − xi)}

= 2(ˆxi − xi)

∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 = ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W We have already seen how to calculate the expres- sion in the boxes when we learnt backpropagation

T L (θ) = (ˆxi − xi) (ˆxi − xi) ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∗ = ∗ ∂W ∂h2 ∂a2 ∂W h2 = ˆxi a2

W ∗ h1 a1 W h0 = xi

Note that the loss function is shown for only one training example.

12/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) ∂L (θ) = ∂h2 ∂ˆxi T = ∇ˆxi {(ˆxi − xi) (ˆxi − xi)}

= 2(ˆxi − xi)

We have already seen how to calculate the expres- sion in the boxes when we learnt backpropagation

T L (θ) = (ˆxi − xi) (ˆxi − xi) ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∗ = ∗ ∂W ∂h2 ∂a2 ∂W h2 = ˆxi a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 ∗ = W ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W h1 a1 W h0 = xi

Note that the loss function is shown for only one training example.

12/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) ∂L (θ) = ∂h2 ∂ˆxi T = ∇ˆxi {(ˆxi − xi) (ˆxi − xi)}

= 2(ˆxi − xi)

T L (θ) = (ˆxi − xi) (ˆxi − xi) ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∗ = ∗ ∂W ∂h2 ∂a2 ∂W h2 = ˆxi a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 ∗ = W ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W h1 We have already seen how to calculate the expres- a1 sion in the boxes when we learnt backpropagation W h0 = xi

Note that the loss function is shown for only one training example.

12/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 T = ∇ˆxi {(ˆxi − xi) (ˆxi − xi)}

= 2(ˆxi − xi)

T L (θ) = (ˆxi − xi) (ˆxi − xi) ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∗ = ∗ ∂W ∂h2 ∂a2 ∂W h2 = ˆxi a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 ∗ = W ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W h1 We have already seen how to calculate the expres- a1 sion in the boxes when we learnt backpropagation W ∂L (θ) ∂L (θ) h0 = xi = ∂h2 ∂ˆxi

Note that the loss function is shown for only one training example.

12/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 = 2(ˆxi − xi)

T L (θ) = (ˆxi − xi) (ˆxi − xi) ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∗ = ∗ ∂W ∂h2 ∂a2 ∂W h2 = ˆxi a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 ∗ = W ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W h1 We have already seen how to calculate the expres- a1 sion in the boxes when we learnt backpropagation W ∂L (θ) ∂L (θ) h0 = xi = ∂h2 ∂ˆxi T = ∇ˆxi {(ˆxi − xi) (ˆxi − xi)} Note that the loss function is shown for only one training example.

12/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 T L (θ) = (ˆxi − xi) (ˆxi − xi) ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∗ = ∗ ∂W ∂h2 ∂a2 ∂W h2 = ˆxi a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 ∗ = W ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W h1 We have already seen how to calculate the expres- a1 sion in the boxes when we learnt backpropagation W ∂L (θ) ∂L (θ) h0 = xi = ∂h2 ∂ˆxi T = ∇ˆxi {(ˆxi − xi) (ˆxi − xi)} Note that the loss function is = 2(ˆxi − xi) shown for only one training example.

12/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Consider the case when the inputs are binary We use a sigmoid decoder which will produce outputs between 0 and 1, and can be interpreted as probabilities. For a single n-dimensional ith input we can use the following loss function n X min{− (xij logx ˆij + (1 − xij) log(1 − xˆij))} j=1 ∂L (θ) Again we need is a formula for ∂W ∗ and ∂ (θ) If xij = 1 ? L ∂W to use backpropagation If xij = 0 ?

∗ ˆxi = f(W h + c)

W ∗

h = g(W xi + b)

W

xi

0 1 1 0 1 (binary inputs)

13/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We use a sigmoid decoder which will produce outputs between 0 and 1, and can be interpreted as probabilities. For a single n-dimensional ith input we can use the following loss function n X min{− (xij logx ˆij + (1 − xij) log(1 − xˆij))} j=1 ∂L (θ) Again we need is a formula for ∂W ∗ and ∂ (θ) If xij = 1 ? L ∂W to use backpropagation If xij = 0 ?

Consider the case when the inputs are ∗ ˆxi = f(W h + c) binary

W ∗

h = g(W xi + b)

W

xi

0 1 1 0 1 (binary inputs)

13/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 For a single n-dimensional ith input we can use the following loss function n X min{− (xij logx ˆij + (1 − xij) log(1 − xˆij))} j=1 ∂L (θ) Again we need is a formula for ∂W ∗ and ∂ (θ) If xij = 1 ? L ∂W to use backpropagation If xij = 0 ?

Consider the case when the inputs are ∗ ˆxi = f(W h + c) binary

W ∗ We use a sigmoid decoder which will produce outputs between 0 and 1, and h = g(W xi + b) can be interpreted as probabilities.

W

xi

0 1 1 0 1 (binary inputs)

13/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) Again we need is a formula for ∂W ∗ and ∂ (θ) If xij = 1 ? L ∂W to use backpropagation If xij = 0 ?

Consider the case when the inputs are ∗ ˆxi = f(W h + c) binary

W ∗ We use a sigmoid decoder which will produce outputs between 0 and 1, and h = g(W xi + b) can be interpreted as probabilities. th W For a single n-dimensional i input we can use the following loss function x n i X min{− (xij logx ˆij + (1 − xij) log(1 − xˆij))} j=1 0 1 1 0 1 (binary inputs)

13/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) Again we need is a formula for ∂W ∗ and ∂L (θ) ∂W to use backpropagation

If xij = 1 ?

If xij = 0 ?

Consider the case when the inputs are ∗ ˆxi = f(W h + c) binary

W ∗ We use a sigmoid decoder which will produce outputs between 0 and 1, and h = g(W xi + b) can be interpreted as probabilities. th W For a single n-dimensional i input we can use the following loss function x n i X min{− (xij logx ˆij + (1 − xij) log(1 − xˆij))} j=1 0 1 1 0 1 (binary inputs)

What value ofx ˆij will minimize this function?

13/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) Again we need is a formula for ∂W ∗ and ∂L (θ) ∂W to use backpropagation

If xij = 0 ?

Consider the case when the inputs are ∗ ˆxi = f(W h + c) binary

W ∗ We use a sigmoid decoder which will produce outputs between 0 and 1, and h = g(W xi + b) can be interpreted as probabilities. th W For a single n-dimensional i input we can use the following loss function x n i X min{− (xij logx ˆij + (1 − xij) log(1 − xˆij))} j=1 0 1 1 0 1 (binary inputs)

What value ofx ˆij will minimize this function? If xij = 1 ?

13/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) Again we need is a formula for ∂W ∗ and ∂L (θ) ∂W to use backpropagation

Consider the case when the inputs are ∗ ˆxi = f(W h + c) binary

W ∗ We use a sigmoid decoder which will produce outputs between 0 and 1, and h = g(W xi + b) can be interpreted as probabilities. th W For a single n-dimensional i input we can use the following loss function x n i X min{− (xij logx ˆij + (1 − xij) log(1 − xˆij))} j=1 0 1 1 0 1 (binary inputs)

What value ofx ˆij will minimize this function? If xij = 1 ?

If xij = 0 ?

13/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Consider the case when the inputs are ∗ ˆxi = f(W h + c) binary

W ∗ We use a sigmoid decoder which will produce outputs between 0 and 1, and h = g(W xi + b) can be interpreted as probabilities. th W For a single n-dimensional i input we can use the following loss function x n i X min{− (xij logx ˆij + (1 − xij) log(1 − xˆij))} j=1 0 1 1 0 1 (binary inputs) ∂L (θ) Again we need is a formula for ∗ and What value ofx ˆ will minimize this ∂W ij ∂L (θ) to use backpropagation function? ∂W If xij = 1 ?

If xij = 0 ?

13/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Consider the case when the inputs are ∗ ˆxi = f(W h + c) binary

W ∗ We use a sigmoid decoder which will produce outputs between 0 and 1, and h = g(W xi + b) can be interpreted as probabilities. th W For a single n-dimensional i input we can use the following loss function x n i X min{− (xij logx ˆij + (1 − xij) log(1 − xˆij))} j=1 0 1 1 0 1 (binary inputs) ∂L (θ) Again we need is a formula for ∗ and What value ofx ˆ will minimize this ∂W ij ∂L (θ) to use backpropagation function? ∂W If xij = 1 ?

If xij = 0 ? Indeed the above function will be minimized whenx ˆij = xij ! 13/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∗ = ∗ ∂W ∂h2 ∂a2 ∂W

∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 = ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W We have already seen how to calculate the expressions in the square boxes when we learnt BP The first two terms on RHS can be computed as: ∂L (θ) xij 1 − xij = − +  ∂L (θ)  ∂h2j xˆij 1 − xˆij ∂h21  ∂L (θ)  ∂h2j ∂ (θ)   = σ(a2j)(1 − σ(a2j)) L  ∂h22  ∂a =  .  2j ∂h2  .    ∂L (θ) ∂h2n

n P L (θ) = − (xij logx ˆij + (1 − xij ) log(1 − xˆij )) j=1 h2 = ˆxi a2 W ∗ h1 a1 W

h0 = xi

14/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 = ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W We have already seen how to calculate the expressions in the square boxes when we learnt BP The first two terms on RHS can be computed as: ∂L (θ) xij 1 − xij = − +  ∂L (θ)  ∂h2j xˆij 1 − xˆij ∂h21  ∂L (θ)  ∂h2j ∂ (θ)   = σ(a2j)(1 − σ(a2j)) L  ∂h22  ∂a =  .  2j ∂h2  .    ∂L (θ) ∂h2n

n P ∂L (θ) ∂L (θ) ∂h2 ∂a2 L (θ) = − (xij logx ˆij + (1 − xij ) log(1 − xˆij )) ∗ = ∗ j=1 ∂W ∂h2 ∂a2 ∂W h2 = ˆxi a2 W ∗ h1 a1 W

h0 = xi

14/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We have already seen how to calculate the expressions in the square boxes when we learnt BP The first two terms on RHS can be computed as: ∂L (θ) xij 1 − xij = − +  ∂L (θ)  ∂h2j xˆij 1 − xˆij ∂h21  ∂L (θ)  ∂h2j ∂ (θ)   = σ(a2j)(1 − σ(a2j)) L  ∂h22  ∂a =  .  2j ∂h2  .    ∂L (θ) ∂h2n

n P ∂L (θ) ∂L (θ) ∂h2 ∂a2 L (θ) = − (xij logx ˆij + (1 − xij ) log(1 − xˆij )) ∗ = ∗ j=1 ∂W ∂h2 ∂a2 ∂W h2 = ˆxi ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 a2 = ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W W ∗ h1 a1 W

h0 = xi

14/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 The first two terms on RHS can be computed as: ∂L (θ) xij 1 − xij = − +  ∂L (θ)  ∂h2j xˆij 1 − xˆij ∂h21  ∂L (θ)  ∂h2j ∂ (θ)   = σ(a2j)(1 − σ(a2j)) L  ∂h22  ∂a =  .  2j ∂h2  .    ∂L (θ) ∂h2n

n P ∂L (θ) ∂L (θ) ∂h2 ∂a2 L (θ) = − (xij logx ˆij + (1 − xij ) log(1 − xˆij )) ∗ = ∗ j=1 ∂W ∂h2 ∂a2 ∂W h2 = ˆxi ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 a2 = ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W W ∗ We have already seen how to h1 a1 calculate the expressions in the W square boxes when we learnt BP

h0 = xi

14/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7  ∂L (θ)  ∂h21  ∂ (θ)  ∂ (θ)  L  L  ∂h22  =  .  ∂h2  .    ∂L (θ) ∂h2n

n P ∂L (θ) ∂L (θ) ∂h2 ∂a2 L (θ) = − (xij logx ˆij + (1 − xij ) log(1 − xˆij )) ∗ = ∗ j=1 ∂W ∂h2 ∂a2 ∂W h2 = ˆxi ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 a2 = ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W W ∗ We have already seen how to h1 a1 calculate the expressions in the W square boxes when we learnt BP The first two terms on RHS can be h0 = xi computed as: ∂L (θ) xij 1 − xij = − + ∂h2j xˆij 1 − xˆij ∂h2j = σ(a2j)(1 − σ(a2j)) ∂a2j

14/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 n P ∂L (θ) ∂L (θ) ∂h2 ∂a2 L (θ) = − (xij logx ˆij + (1 − xij ) log(1 − xˆij )) ∗ = ∗ j=1 ∂W ∂h2 ∂a2 ∂W h2 = ˆxi ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1 a2 = ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W W ∗ We have already seen how to h1 a1 calculate the expressions in the W square boxes when we learnt BP The first two terms on RHS can be h0 = xi computed as: ∂L (θ) xij 1 − xij = − +  ∂L (θ)  ∂h2j xˆij 1 − xˆij ∂h21  ∂L (θ)  ∂h2j ∂ (θ)   = σ(a2j)(1 − σ(a2j)) L  ∂h22  ∂a =  .  2j ∂h2  .    ∂L (θ) ∂h2n 14/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Module 7.2: Link between PCA and Autoencoders

15/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 use a linear encoder use a linear decoder use squared error loss function normalize the inputs to

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1

We will now see that the encoder part ˆxi y PCA of an autoencoder is equivalent to PCA if we h ≡ u1 u2

x xi P T XT XP = D

16/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 use a linear decoder use squared error loss function normalize the inputs to

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1

We will now see that the encoder part ˆxi y PCA of an autoencoder is equivalent to PCA if we use a linear encoder h ≡ u1 u2

x xi P T XT XP = D

16/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 use squared error loss function normalize the inputs to

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1

We will now see that the encoder part ˆxi y PCA of an autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder h ≡ u1 u2

x xi P T XT XP = D

16/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 normalize the inputs to

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1

We will now see that the encoder part ˆxi y PCA of an autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder h ≡ u1 u2 use squared error loss function x xi P T XT XP = D

16/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We will now see that the encoder part ˆxi y PCA of an autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder h ≡ u1 u2 use squared error loss function normalize the inputs to x m ! xi P T XT XP = D 1 1 X xˆ = √ x − x ij m ij m kj k=1

16/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 The operation in the bracket ensures that the data now has 0 mean along each dimension j (we are subtracting the mean) Let X 0 be this zero mean data mat- rix then what the above normaliza- 0 √1 tion gives us is X = m X T 1 0 T 0 Now (X) X = m (X ) X is the co- variance matrix (recall that covari- ance matrix plays an important role in PCA)

First let us consider the implication ˆxi y PCA of normalizing the inputs to

m ! 1 1 X xˆij = √ xij − xkj h ≡ u1 u2 m m k=1 x xi P T XT XP = D

17/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Let X 0 be this zero mean data mat- rix then what the above normaliza- 0 √1 tion gives us is X = m X T 1 0 T 0 Now (X) X = m (X ) X is the co- variance matrix (recall that covari- ance matrix plays an important role in PCA)

First let us consider the implication ˆxi y PCA of normalizing the inputs to

m ! 1 1 X xˆij = √ xij − xkj h ≡ u1 u2 m m k=1 The operation in the bracket ensures x that the data now has 0 mean along x T T i P X XP = D each dimension j (we are subtracting the mean)

17/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 T 1 0 T 0 Now (X) X = m (X ) X is the co- variance matrix (recall that covari- ance matrix plays an important role in PCA)

First let us consider the implication ˆxi y PCA of normalizing the inputs to

m ! 1 1 X xˆij = √ xij − xkj h ≡ u1 u2 m m k=1 The operation in the bracket ensures x that the data now has 0 mean along x T T i P X XP = D each dimension j (we are subtracting the mean) Let X 0 be this zero mean data mat- rix then what the above normaliza- 0 √1 tion gives us is X = m X

17/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 First let us consider the implication ˆxi y PCA of normalizing the inputs to

m ! 1 1 X xˆij = √ xij − xkj h ≡ u1 u2 m m k=1 The operation in the bracket ensures x that the data now has 0 mean along x T T i P X XP = D each dimension j (we are subtracting the mean) Let X 0 be this zero mean data mat- rix then what the above normaliza- 0 √1 tion gives us is X = m X T 1 0 T 0 Now (X) X = m (X ) X is the co- variance matrix (recall that covari- ance matrix plays an important role

in PCA) 17/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 m n 1 X X (x − xˆ )2 m ij ij i=1 j=1

is obtained when we use a linear en- coder.

First we will show that if we use lin- ear decoder and a squared error loss function then The optimal solution to the following objective function

ˆxi y PCA

h ≡ u1 u2

x xi P T XT XP = D

18/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 m n 1 X X (x − xˆ )2 m ij ij i=1 j=1

is obtained when we use a linear en- coder.

The optimal solution to the following objective function

First we will show that if we use lin- ˆxi y PCA ear decoder and a squared error loss function then h ≡ u1 u2

x xi P T XT XP = D

18/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 m n 1 X X (x − xˆ )2 m ij ij i=1 j=1

is obtained when we use a linear en- coder.

First we will show that if we use lin- ˆxi y PCA ear decoder and a squared error loss function then The optimal solution to the following h ≡ u1 u2 objective function

x xi P T XT XP = D

18/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 is obtained when we use a linear en- coder.

First we will show that if we use lin- ˆxi y PCA ear decoder and a squared error loss function then The optimal solution to the following h ≡ u1 u2 objective function

m n x 1 X X 2 (xij − xˆij) xi P T XT XP = D m i=1 j=1

18/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 First we will show that if we use lin- ˆxi y PCA ear decoder and a squared error loss function then The optimal solution to the following h ≡ u1 u2 objective function

m n x 1 X X 2 (xij − xˆij) xi P T XT XP = D m i=1 j=1

is obtained when we use a linear en- coder.

18/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 v u m n ∗ 2 uX X 2 min (kX − HW kF ) kAkF = t aij W ∗H i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we are ignoring the biases)

This is equivalent to

From SVD we know that optimal solution to the above problem is given by

∗ T HW = U.,≤kΣk,kV.,≤k

By matching variables one possible solution is

H = U.,≤kΣk,k ∗ T W = V.,≤k

m n X X 2 min (xij − xˆij) (1) θ i=1 j=1

19/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 v u m n ∗ 2 uX X 2 min (kX − HW kF ) kAkF = t aij W ∗H i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we are ignoring the biases) From SVD we know that optimal solution to the above problem is given by

∗ T HW = U.,≤kΣk,kV.,≤k

By matching variables one possible solution is

H = U.,≤kΣk,k ∗ T W = V.,≤k

m n X X 2 min (xij − xˆij) (1) θ i=1 j=1 This is equivalent to

19/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 v u m n uX X 2 kAkF = t aij i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we are ignoring the biases) From SVD we know that optimal solution to the above problem is given by

∗ T HW = U.,≤kΣk,kV.,≤k

By matching variables one possible solution is

H = U.,≤kΣk,k ∗ T W = V.,≤k

m n X X 2 min (xij − xˆij) (1) θ i=1 j=1 This is equivalent to

∗ 2 min (kX − HW kF ) W ∗H

19/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 (just writing the expression (1) in matrix form and using the definition of ||A||F ) (we are ignoring the biases) From SVD we know that optimal solution to the above problem is given by

∗ T HW = U.,≤kΣk,kV.,≤k

By matching variables one possible solution is

H = U.,≤kΣk,k ∗ T W = V.,≤k

m n X X 2 min (xij − xˆij) (1) θ i=1 j=1 This is equivalent to v u m n ∗ 2 uX X 2 min (kX − HW kF ) kAkF = t aij W ∗H i=1 j=1

19/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 From SVD we know that optimal solution to the above problem is given by

∗ T HW = U.,≤kΣk,kV.,≤k

By matching variables one possible solution is

H = U.,≤kΣk,k ∗ T W = V.,≤k

m n X X 2 min (xij − xˆij) (1) θ i=1 j=1 This is equivalent to v u m n ∗ 2 uX X 2 min (kX − HW kF ) kAkF = t aij W ∗H i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we are ignoring the biases)

19/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 By matching variables one possible solution is

H = U.,≤kΣk,k ∗ T W = V.,≤k

m n X X 2 min (xij − xˆij) (1) θ i=1 j=1 This is equivalent to v u m n ∗ 2 uX X 2 min (kX − HW kF ) kAkF = t aij W ∗H i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we are ignoring the biases) From SVD we know that optimal solution to the above problem is given by

∗ T HW = U.,≤kΣk,kV.,≤k

19/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 m n X X 2 min (xij − xˆij) (1) θ i=1 j=1 This is equivalent to v u m n ∗ 2 uX X 2 min (kX − HW kF ) kAkF = t aij W ∗H i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we are ignoring the biases) From SVD we know that optimal solution to the above problem is given by

∗ T HW = U.,≤kΣk,kV.,≤k

By matching variables one possible solution is

H = U.,≤kΣk,k ∗ T W = V.,≤k

19/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 H = U.,≤kΣk,k T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I) T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV ) T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I) T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A ) T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I) T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A ) −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k) −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

H = XV.,≤k

We will now show that H is a linear encoding and find an expression for the encoder weights W

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I) T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV ) T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I) T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A ) T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I) T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A ) −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k) −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

H = XV.,≤k

We will now show that H is a linear encoding and find an expression for the encoder weights W

H = U.,≤kΣk,k

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV ) T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I) T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A ) T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I) T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A ) −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k) −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

H = XV.,≤k

We will now show that H is a linear encoding and find an expression for the encoder weights W

H = U.,≤kΣk,k T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I)

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I) T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A ) T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I) T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A ) −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k) −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

H = XV.,≤k

We will now show that H is a linear encoding and find an expression for the encoder weights W

H = U.,≤kΣk,k T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I) T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV )

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A ) T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I) T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A ) −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k) −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

H = XV.,≤k

We will now show that H is a linear encoding and find an expression for the encoder weights W

H = U.,≤kΣk,k T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I) T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV ) T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I)

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I) T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A ) −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k) −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

H = XV.,≤k

We will now show that H is a linear encoding and find an expression for the encoder weights W

H = U.,≤kΣk,k T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I) T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV ) T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I) T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A )

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A ) −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k) −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

H = XV.,≤k

We will now show that H is a linear encoding and find an expression for the encoder weights W

H = U.,≤kΣk,k T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I) T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV ) T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I) T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A ) T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I)

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k) −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

H = XV.,≤k

We will now show that H is a linear encoding and find an expression for the encoder weights W

H = U.,≤kΣk,k T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I) T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV ) T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I) T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A ) T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I) T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A )

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

H = XV.,≤k

We will now show that H is a linear encoding and find an expression for the encoder weights W

H = U.,≤kΣk,k T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I) T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV ) T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I) T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A ) T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I) T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A ) −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k)

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 H = XV.,≤k

We will now show that H is a linear encoding and find an expression for the encoder weights W

H = U.,≤kΣk,k T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I) T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV ) T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I) T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A ) T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I) T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A ) −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k) −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We will now show that H is a linear encoding and find an expression for the encoder weights W

H = U.,≤kΣk,k T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I) T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV ) T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I) T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A ) T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I) T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A ) −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k) −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

H = XV.,≤k

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We will now show that H is a linear encoding and find an expression for the encoder weights W

H = U.,≤kΣk,k T T −1 T T −1 = (XX )(XX ) U.,≤K Σk,k (pre-multiplying (XX )(XX ) = I) T T T T T −1 T = (XV Σ U )(UΣV V Σ U ) U.,≤kΣk,k (using X = UΣV ) T T T T −1 T = XV Σ U (UΣΣ U ) U.,≤kΣk,k (V V = I) T T T −1 T −1 −1 −1 −1 = XV Σ U U(ΣΣ ) U U.,≤kΣk,k ((ABC) = C B A ) T T −1 T T = XV Σ (ΣΣ ) U U.,≤kΣk,k (U U = I) T T −1 −1 T −1 −1 −1 = XV Σ Σ Σ U U.,≤kΣk,k ((AB) = B A ) −1 T = XV Σ I.,≤kΣk,k (U U.,≤k = I.,≤k) −1 −1 = XVI.,≤k (Σ I.,≤k = Σk,k)

H = XV.,≤k

Thus H is a linear transformation of X and W = V.,≤k

20/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 From SVD, we know that V is the matrix of eigen vectors of XT X From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix We saw earlier that, if entries of X are normalized by

Thus, the encoder matrix for linear autoencoder(W ) and the projection matrix(P ) for PCA could indeed be the same. Hence proved

We have encoder W = V.,≤k

21/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix We saw earlier that, if entries of X are normalized by

Thus, the encoder matrix for linear autoencoder(W ) and the projection matrix(P ) for PCA could indeed be the same. Hence proved

We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X

21/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We saw earlier that, if entries of X are normalized by

Thus, the encoder matrix for linear autoencoder(W ) and the projection matrix(P ) for PCA could indeed be the same. Hence proved

We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix

21/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Thus, the encoder matrix for linear autoencoder(W ) and the projection matrix(P ) for PCA could indeed be the same. Hence proved

We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix We saw earlier that, if entries of X are normalized by

21/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Thus, the encoder matrix for linear autoencoder(W ) and the projection matrix(P ) for PCA could indeed be the same. Hence proved

We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix We saw earlier that, if entries of X are normalized by

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1

21/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Thus, the encoder matrix for linear autoencoder(W ) and the projection matrix(P ) for PCA could indeed be the same. Hence proved

We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix We saw earlier that, if entries of X are normalized by

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1 then XT X is indeed the covariance matrix

21/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix We saw earlier that, if entries of X are normalized by

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1 then XT X is indeed the covariance matrix Thus, the encoder matrix for linear autoencoder(W ) and the projection matrix(P ) for PCA could indeed be the same. Hence proved

21/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 use a linear encoder use a linear decoder use a squared error loss function and normalize the inputs to

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1

Remember The encoder of a linear autoencoder is equivalent to PCA if we

22/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 use a linear decoder use a squared error loss function and normalize the inputs to

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1

Remember The encoder of a linear autoencoder is equivalent to PCA if we use a linear encoder

22/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 use a squared error loss function and normalize the inputs to

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1

Remember The encoder of a linear autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder

22/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 and normalize the inputs to

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1

Remember The encoder of a linear autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder use a squared error loss function

22/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1

Remember The encoder of a linear autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder use a squared error loss function and normalize the inputs to

22/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Remember The encoder of a linear autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder use a squared error loss function and normalize the inputs to

m ! 1 1 X xˆ = √ x − x ij m ij m kj k=1

22/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Module 7.3: Regularization in autoencoders (Motivation)

23/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 While poor generalization could hap- pen even in undercomplete autoen- coders it is an even more serious prob- lem for overcomplete auto encoders Here, (as stated earlier) the model can simply learn to copy xi to h and then h to ˆxi To avoid poor generalization, we need to introduce regularization

ˆxi W ∗ h W

xi

24/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Here, (as stated earlier) the model can simply learn to copy xi to h and then h to ˆxi To avoid poor generalization, we need to introduce regularization

While poor generalization could hap-

ˆxi pen even in undercomplete autoen- coders it is an even more serious prob- ∗ W lem for overcomplete auto encoders h W

xi

24/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 To avoid poor generalization, we need to introduce regularization

While poor generalization could hap-

ˆxi pen even in undercomplete autoen- coders it is an even more serious prob- ∗ W lem for overcomplete auto encoders h Here, (as stated earlier) the model can simply learn to copy xi to h and W then h to ˆxi xi

24/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 While poor generalization could hap-

ˆxi pen even in undercomplete autoen- coders it is an even more serious prob- ∗ W lem for overcomplete auto encoders h Here, (as stated earlier) the model can simply learn to copy xi to h and W then h to ˆxi xi To avoid poor generalization, we need to introduce regularization

24/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 This is very easy to implement and just adds a term λW to the gradient ∂L (θ) ∂W (and similarly for other para- meters)

The simplest solution is to add a L2- ˆxi regularization term to the objective function W ∗ m n 1 X X 2 2 h min (ˆxij − xij) + λkθk θ,w,w∗,b,c m i=1 j=1 W

xi

25/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 The simplest solution is to add a L2- ˆxi regularization term to the objective function W ∗ m n 1 X X 2 2 h min (ˆxij − xij) + λkθk θ,w,w∗,b,c m i=1 j=1 W This is very easy to implement and x i just adds a term λW to the gradient ∂L (θ) ∂W (and similarly for other para- meters)

25/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 This effectively reduces the capacity of Autoencoder and acts as a regular- izer

i.e., W ∗ = W T

Another trick is to tie the weights of

ˆxi the encoder and decoder W ∗ h W

xi

26/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 This effectively reduces the capacity of Autoencoder and acts as a regular- izer

Another trick is to tie the weights of ∗ ˆxi the encoder and decoder i.e., W = W T W ∗ h W

xi

26/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Another trick is to tie the weights of ∗ ˆxi the encoder and decoder i.e., W = W T W ∗ This effectively reduces the capacity h of Autoencoder and acts as a regular- izer W

xi

26/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Module 7.4: Denoising Autoencoders

27/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 P (xeij = 0|xij) = q P (xeij = xij|xij) = 1 − q

A simple P (xeij|xij) used in practice is the following

In other words, with probability q the input is flipped to 0 and with probab- ility (1 − q) it is retained as it is

A denoising encoder simply corrupts the input data using a probabilistic ˆxi process (P (xeij|xij)) before feeding it to the network

h

˜xi

P (xeij|xij)

xi

28/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 P (xeij = 0|xij) = q P (xeij = xij|xij) = 1 − q In other words, with probability q the input is flipped to 0 and with probab- ility (1 − q) it is retained as it is

A denoising encoder simply corrupts the input data using a probabilistic ˆxi process (P (xeij|xij)) before feeding it to the network A simple P (x |x ) used in practice h eij ij is the following

˜xi

P (xeij|xij)

xi

28/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 P (xeij = xij|xij) = 1 − q In other words, with probability q the input is flipped to 0 and with probab- ility (1 − q) it is retained as it is

A denoising encoder simply corrupts the input data using a probabilistic ˆxi process (P (xeij|xij)) before feeding it to the network A simple P (x |x ) used in practice h eij ij is the following

P (xeij = 0|xij) = q ˜xi

P (xeij|xij)

xi

28/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 In other words, with probability q the input is flipped to 0 and with probab- ility (1 − q) it is retained as it is

A denoising encoder simply corrupts the input data using a probabilistic ˆxi process (P (xeij|xij)) before feeding it to the network A simple P (x |x ) used in practice h eij ij is the following

P (xeij = 0|xij) = q ˜xi P (xeij = xij|xij) = 1 − q P (xeij|xij)

xi

28/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 A denoising encoder simply corrupts the input data using a probabilistic ˆxi process (P (xeij|xij)) before feeding it to the network A simple P (x |x ) used in practice h eij ij is the following

P (xeij = 0|xij) = q ˜xi P (xeij = xij|xij) = 1 − q P (xeij|xij) In other words, with probability q the xi input is flipped to 0 and with probab- ility (1 − q) it is retained as it is

28/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 This helps because the objective is still to reconstruct the original (un- corrupted) xi m n 1 X X arg min (ˆx − x )2 m ij ij θ i=1 j=1 It no longer makes sense for the model to copy the corrupted xei into h(xei) and then into xˆi (the objective func- tion will not be minimized by doing so) For example, it will have to learn to Instead the model will now have to reconstruct a corrupted xij correctly by capture the characteristics of the data relying on its interactions with other correctly. elements of xi

How does this help ?

ˆxi

h

˜xi

P (xeij|xij)

xi

29/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 It no longer makes sense for the model to copy the corrupted xei into h(xei) and then into xˆi (the objective func- tion will not be minimized by doing so) For example, it will have to learn to Instead the model will now have to reconstruct a corrupted xij correctly by capture the characteristics of the data relying on its interactions with other correctly. elements of xi

How does this help ? This helps because the objective is ˆxi still to reconstruct the original (un- corrupted) xi m n h 1 X X arg min (ˆx − x )2 m ij ij θ i=1 j=1

˜xi

P (xeij|xij)

xi

29/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 For example, it will have to learn to Instead the model will now have to reconstruct a corrupted xij correctly by capture the characteristics of the data relying on its interactions with other correctly. elements of xi

How does this help ? This helps because the objective is ˆxi still to reconstruct the original (un- corrupted) xi m n h 1 X X arg min (ˆx − x )2 m ij ij θ i=1 j=1 It no longer makes sense for the model ˜xi to copy the corrupted xei into h(xei) P (xeij|xij) and then into xˆi (the objective func- tion will not be minimized by doing xi so)

29/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 For example, it will have to learn to reconstruct a corrupted xij correctly by relying on its interactions with other elements of xi

How does this help ? This helps because the objective is ˆxi still to reconstruct the original (un- corrupted) xi m n h 1 X X arg min (ˆx − x )2 m ij ij θ i=1 j=1 It no longer makes sense for the model ˜xi to copy the corrupted xei into h(xei) P (xeij|xij) and then into xˆi (the objective func- tion will not be minimized by doing xi so) Instead the model will now have to capture the characteristics of the data correctly.

29/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 How does this help ? This helps because the objective is ˆxi still to reconstruct the original (un- corrupted) xi m n h 1 X X arg min (ˆx − x )2 m ij ij θ i=1 j=1 It no longer makes sense for the model ˜xi to copy the corrupted xei into h(xei) P (xeij|xij) and then into xˆi (the objective func- tion will not be minimized by doing xi so) For example, it will have to learn to Instead the model will now have to reconstruct a corrupted xij correctly by capture the characteristics of the data relying on its interactions with other correctly. elements of xi 29/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We will now see a practical application in which AEs are used and then compare Denoising Autoencoders with regular autoencoders

30/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 0 1 2 3 9 Task: Hand-written digit recognition

|xi| = 784 = 28 × 28

28*28

Figure: Basic approach(we use raw data as input Figure: MNIST Data features)

31/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 784 xˆi ∈ R Task: Hand-written digit recognition

d h ∈ R

|xi| = 784 = 28 × 28

Figure: MNIST Data

Figure: AE approach (first learn important characteristics of data) 32/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Task: Hand-written digit 0 1 2 3 9 recognition

d h ∈ R

|xi| = 784 = 28 × 28

Figure: MNIST Data

Figure: AE approach (and then train a classifier on top of this hidden representation) 33/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We will now see a way of visualizing AEs and use this visualization to compare different AEs

34/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 W1 Solution: xi = p T W1 W1

For example,

T h1 = σ(W1 xi)[ignoring bias b]

Where W1 is the trained vector of weights con- necting the input to the first hidden neuron

What values of xi will cause h1 to be max- imum (or maximally activated) T max {W1 xi} Suppose we assume that our inputs are nor- xi 2 T malized so that kxik = 1 s.t. ||xi|| = xi xi = 1

We can think of each neuron as a filter which xˆi will fire (or get maximally) activated for a cer- tain input configuration xi h

xi

35/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 W1 Solution: xi = p T W1 W1

What values of xi will cause h1 to be max- imum (or maximally activated) T max {W1 xi} Suppose we assume that our inputs are nor- xi 2 T malized so that kxik = 1 s.t. ||xi|| = xi xi = 1

We can think of each neuron as a filter which xˆi will fire (or get maximally) activated for a cer- tain input configuration xi For example, h T h1 = σ(W1 xi)[ignoring bias b]

xi Where W1 is the trained vector of weights con- necting the input to the first hidden neuron

35/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 W1 Solution: xi = p T W1 W1

T max {W1 xi} Suppose we assume that our inputs are nor- xi 2 T malized so that kxik = 1 s.t. ||xi|| = xi xi = 1

We can think of each neuron as a filter which xˆi will fire (or get maximally) activated for a cer- tain input configuration xi For example, h T h1 = σ(W1 xi)[ignoring bias b]

xi Where W1 is the trained vector of weights con- necting the input to the first hidden neuron

What values of xi will cause h1 to be max- imum (or maximally activated)

35/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 W1 Solution: xi = p T W1 W1

T max {W1 xi} xi 2 T s.t. ||xi|| = xi xi = 1

We can think of each neuron as a filter which xˆi will fire (or get maximally) activated for a cer- tain input configuration xi For example, h T h1 = σ(W1 xi)[ignoring bias b]

xi Where W1 is the trained vector of weights con- necting the input to the first hidden neuron

What values of xi will cause h1 to be max- imum (or maximally activated) Suppose we assume that our inputs are nor- malized so that kxik = 1

35/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 W1 Solution: xi = p T W1 W1

We can think of each neuron as a filter which xˆi will fire (or get maximally) activated for a cer- tain input configuration xi For example, h T h1 = σ(W1 xi)[ignoring bias b]

xi Where W1 is the trained vector of weights con- necting the input to the first hidden neuron

What values of xi will cause h1 to be max- imum (or maximally activated) T max {W1 xi} Suppose we assume that our inputs are nor- xi 2 T malized so that kxik = 1 s.t. ||xi|| = xi xi = 1

35/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We can think of each neuron as a filter which xˆi will fire (or get maximally) activated for a cer- tain input configuration xi For example, h T h1 = σ(W1 xi)[ignoring bias b]

xi Where W1 is the trained vector of weights con- necting the input to the first hidden neuron

What values of xi will cause h1 to be max- imum (or maximally activated) T max {W1 xi} Suppose we assume that our inputs are nor- xi 2 T malized so that kxik = 1 s.t. ||xi|| = xi xi = 1

W1 Solution: xi = p T W1 W1

35/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Let us plot these images (xi’s) which maxim- ally activate the first k neurons of the hidden representations learned by a vanilla autoen- coder and different denoising autoencoders

These xi’s are computed by the above formula using the weights (W1,W2 ...Wk) learned by the respective autoencoders

Thus the inputs xˆi

W1 W2 Wn xi = , ,... q q p T T T W Wn h W1 W1 W2 W2 n

will respectively cause hidden neurons 1 to n

xi to maximally fire

T max {W1 xi} xi 2 T s.t. ||xi|| = xi xi = 1

W1 Solution: xi = p T W1 W1

36/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 These xi’s are computed by the above formula using the weights (W1,W2 ...Wk) learned by the respective autoencoders

Thus the inputs xˆi

W1 W2 Wn xi = , ,... q q p T T T W Wn h W1 W1 W2 W2 n

will respectively cause hidden neurons 1 to n

xi to maximally fire Let us plot these images (xi’s) which maxim- ally activate the first k neurons of the hidden representations learned by a vanilla autoen- T max {W1 xi} coder and different denoising autoencoders xi 2 T s.t. ||xi|| = xi xi = 1

W1 Solution: xi = p T W1 W1

36/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Thus the inputs xˆi

W1 W2 Wn xi = , ,... q q p T T T W Wn h W1 W1 W2 W2 n

will respectively cause hidden neurons 1 to n

xi to maximally fire Let us plot these images (xi’s) which maxim- ally activate the first k neurons of the hidden representations learned by a vanilla autoen- T max {W1 xi} coder and different denoising autoencoders xi 2 T These xi’s are computed by the above formula s.t. ||xi|| = xi xi = 1 using the weights (W1,W2 ...Wk) learned by W1 Solution: xi = p T the respective autoencoders W1 W1

36/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 The hidden neurons of the denoising AEs seem to act like pen-stroke detectors (for example, in the highlighted neuron the black region is a stroke that you would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’) As the increases the filters become more wide because the neuron has to rely on more adjacent pixels to feel confident about a stroke

Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising (No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns

37/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 As the noise increases the filters become more wide because the neuron has to rely on more adjacent pixels to feel confident about a stroke

Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising (No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns The hidden neurons of the denoising AEs seem to act like pen-stroke detectors (for example, in the highlighted neuron the black region is a stroke that you would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)

37/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising (No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns The hidden neurons of the denoising AEs seem to act like pen-stroke detectors (for example, in the highlighted neuron the black region is a stroke that you would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’) As the noise increases the filters become more wide because the neuron has to rely on more adjacent pixels to feel confident about a stroke 37/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Another way of corrupting the inputs is to add a to the input

xeij = xij + N (0, 1) We will now use such a denoising AE on a different dataset and see their performance

We saw one form of P (xeij|xij) which flips a ˆxi fraction q of the inputs to zero

h

˜xi

P (xeij|xij)

xi

38/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We will now use such a denoising AE on a different dataset and see their performance

We saw one form of P (xeij|xij) which flips a ˆxi fraction q of the inputs to zero Another way of corrupting the inputs is to add a Gaussian noise to the input h xeij = xij + N (0, 1)

˜xi

P (xeij|xij)

xi

38/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We saw one form of P (xeij|xij) which flips a ˆxi fraction q of the inputs to zero Another way of corrupting the inputs is to add a Gaussian noise to the input h xeij = xij + N (0, 1) We will now use such a denoising AE on a ˜xi different dataset and see their performance P (xeij|xij)

xi

38/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 PCA does not give such edge detectors

Figure: Weight decay Figure: AE filters Figure: Data filters

The hidden neurons essentially behave like edge detectors

39/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Figure: Weight decay Figure: AE filters Figure: Data filters

The hidden neurons essentially behave like edge detectors PCA does not give such edge detectors

39/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Module 7.5: Sparse Autoencoders

40/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 A hidden neuron with sigmoid activation will have values between 0 and 1 We say that the neuron is activated when its output is close to 1 and not activated when its output is close to 0. A sparse autoencoder tries to ensure the neuron is inactive most of the times.

xˆi

h

xi

41/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 We say that the neuron is activated when its output is close to 1 and not activated when its output is close to 0. A sparse autoencoder tries to ensure the neuron is inactive most of the times.

A hidden neuron with sigmoid activation will xˆi have values between 0 and 1

h

xi

41/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 A sparse autoencoder tries to ensure the neuron is inactive most of the times.

A hidden neuron with sigmoid activation will xˆi have values between 0 and 1 We say that the neuron is activated when its h output is close to 1 and not activated when its output is close to 0.

xi

41/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 A hidden neuron with sigmoid activation will xˆi have values between 0 and 1 We say that the neuron is activated when its h output is close to 1 and not activated when its output is close to 0. A sparse autoencoder tries to ensure the

xi neuron is inactive most of the times.

41/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 A sparse autoencoder uses a sparsity para- meter ρ (typically very close to 0, say, 0.005) and tries to enforce the constraintρ ˆl = ρ One way of ensuring this is to add the follow- ing term to the objective function

k X ρ 1 − ρ Ω(θ) = ρ log + (1 − ρ) log ρˆl 1 − ρˆl l=1 When will this term reach its minimum value and what is the minimum value? Let us plot it and check.

If the neuron l is sparse (i.e. mostly inactive) xˆi thenρ ˆl → 0

h

xi

The average value of the activation of a neuron l is given by m 1 X ρˆ = h(x ) l m i l i=1

42/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 One way of ensuring this is to add the follow- ing term to the objective function

k X ρ 1 − ρ Ω(θ) = ρ log + (1 − ρ) log ρˆl 1 − ρˆl l=1 When will this term reach its minimum value and what is the minimum value? Let us plot it and check.

If the neuron l is sparse (i.e. mostly inactive) xˆi thenρ ˆl → 0 A sparse autoencoder uses a sparsity para- h meter ρ (typically very close to 0, say, 0.005) and tries to enforce the constraintρ ˆl = ρ

xi

The average value of the activation of a neuron l is given by m 1 X ρˆ = h(x ) l m i l i=1

42/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 When will this term reach its minimum value and what is the minimum value? Let us plot it and check.

If the neuron l is sparse (i.e. mostly inactive) xˆi thenρ ˆl → 0 A sparse autoencoder uses a sparsity para- h meter ρ (typically very close to 0, say, 0.005) and tries to enforce the constraintρ ˆl = ρ One way of ensuring this is to add the follow- xi ing term to the objective function

k The average value of the X ρ 1 − ρ Ω(θ) = ρ log + (1 − ρ) log activation of a neuron l is given ρˆl 1 − ρˆl l=1 by m 1 X ρˆ = h(x ) l m i l i=1

42/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 If the neuron l is sparse (i.e. mostly inactive) xˆi thenρ ˆl → 0 A sparse autoencoder uses a sparsity para- h meter ρ (typically very close to 0, say, 0.005) and tries to enforce the constraintρ ˆl = ρ One way of ensuring this is to add the follow- xi ing term to the objective function

k The average value of the X ρ 1 − ρ Ω(θ) = ρ log + (1 − ρ) log activation of a neuron l is given ρˆl 1 − ρˆl l=1 by m 1 X When will this term reach its minimum value ρˆ = h(x ) l m i l and what is the minimum value? Let us plot i=1 it and check.

42/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 The function will reach its minimum value(s) whenρ ˆl = ρ.

ρ = 0.2

Ω(θ)

0.2 ρˆl

43/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ρ = 0.2

Ω(θ)

0.2 ρˆl

The function will reach its minimum value(s) whenρ ˆl = ρ.

43/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk For each neuron l ∈ 1 . . . k in hidden , we have ∂Ω(θ) ρ (1 − ρ) = − + ∂ρˆl ρˆl 1 − ρˆl ∂ρˆ and l = x (g0(W T x + b))T (see next slide) ∂W i i

k X ρ 1 − ρ Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Can be re-written as

k X L (θ) is the squared error loss or Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) cross entropy loss and Ω(θ) is the l=1 sparsity constraint. By Chain rule: We already know how to calculate ∂Ω(θ) ∂Ω(θ) ∂ρˆ ∂L (θ) = . ∂W ∂W ∂ρˆ ∂W ∂Ω(θ) Let us see how to calculate ∂W . Finally,

∂Lˆ(θ) ∂L (θ) ∂Ω(θ) = + ∂W ∂W ∂W (and we know how to calculate both terms on R.H.S)

Now,

Lˆ(θ) = L (θ) + Ω(θ)

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Ω(θ) ρ (1 − ρ) = − + ∂ρˆl ρˆl 1 − ρˆl ∂ρˆ and l = x (g0(W T x + b))T (see next slide) ∂W i i

k X ρ 1 − ρ Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Can be re-written as

k X Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) l=1 By Chain rule: We already know how to calculate ∂Ω(θ) ∂Ω(θ) ∂ρˆ ∂L (θ) = . ∂W ∂W ∂ρˆ ∂W ∂Ω(θ) Let us see how to calculate ∂W . Finally,

∂Lˆ(θ) ∂L (θ) ∂Ω(θ) = + ∂W ∂W ∂W (and we know how to calculate both terms on R.H.S)

Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint.

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Ω(θ) ρ (1 − ρ) = − + ∂ρˆl ρˆl 1 − ρˆl ∂ρˆ and l = x (g0(W T x + b))T (see next slide) ∂W i i

k X ρ 1 − ρ Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Can be re-written as

k X Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) l=1 By Chain rule: ∂Ω(θ) ∂Ω(θ) ∂ρˆ = . ∂W ∂ρˆ ∂W ∂Ω(θ) Let us see how to calculate ∂W . Finally,

∂Lˆ(θ) ∂L (θ) ∂Ω(θ) = + ∂W ∂W ∂W (and we know how to calculate both terms on R.H.S)

Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Ω(θ) ρ (1 − ρ) = − + ∂ρˆl ρˆl 1 − ρˆl ∂ρˆ and l = x (g0(W T x + b))T (see next slide) ∂W i i

k X ρ 1 − ρ Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Can be re-written as

k X Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) l=1 By Chain rule: ∂Ω(θ) ∂Ω(θ) ∂ρˆ = . ∂W ∂ρˆ ∂W

Finally,

∂Lˆ(θ) ∂L (θ) ∂Ω(θ) = + ∂W ∂W ∂W (and we know how to calculate both terms on R.H.S)

Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W ∂Ω(θ) Let us see how to calculate ∂W .

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Ω(θ) ρ (1 − ρ) = − + ∂ρˆl ρˆl 1 − ρˆl ∂ρˆ and l = x (g0(W T x + b))T (see next slide) ∂W i i

Can be re-written as

k X Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) l=1 By Chain rule: ∂Ω(θ) ∂Ω(θ) ∂ρˆ = . ∂W ∂ρˆ ∂W

Finally,

∂Lˆ(θ) ∂L (θ) ∂Ω(θ) = + ∂W ∂W ∂W (and we know how to calculate both terms on R.H.S)

k X ρ 1 − ρ Now, Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W ∂Ω(θ) Let us see how to calculate ∂W .

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Ω(θ) ρ (1 − ρ) = − + ∂ρˆl ρˆl 1 − ρˆl ∂ρˆ and l = x (g0(W T x + b))T (see next slide) ∂W i i

By Chain rule: ∂Ω(θ) ∂Ω(θ) ∂ρˆ = . ∂W ∂ρˆ ∂W

Finally,

∂Lˆ(θ) ∂L (θ) ∂Ω(θ) = + ∂W ∂W ∂W (and we know how to calculate both terms on R.H.S)

k X ρ 1 − ρ Now, Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Lˆ(θ) = L (θ) + Ω(θ) Can be re-written as

k X L (θ) is the squared error loss or Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) cross entropy loss and Ω(θ) is the l=1 sparsity constraint. We already know how to calculate ∂L (θ) ∂W ∂Ω(θ) Let us see how to calculate ∂W .

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) Finally, ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Lˆ(θ) ∂L (θ) ∂Ω(θ) ∂Ω(θ) ρ (1 − ρ) = + = − + ∂W ∂W ∂W ∂ρˆl ρˆl 1 − ρˆl

∂ρˆl 0 T T (and we know how to calculate both and = x (g (W x + b)) (see next slide) ∂W i i terms on R.H.S)

k X ρ 1 − ρ Now, Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Lˆ(θ) = L (θ) + Ω(θ) Can be re-written as

k X L (θ) is the squared error loss or Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) cross entropy loss and Ω(θ) is the l=1 sparsity constraint. By Chain rule: We already know how to calculate ∂Ω(θ) ∂Ω(θ) ∂ρˆ ∂L (θ) = . ∂W ∂W ∂ρˆ ∂W ∂Ω(θ) Let us see how to calculate ∂W .

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Finally, For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Lˆ(θ) ∂L (θ) ∂Ω(θ) ∂Ω(θ) ρ (1 − ρ) = + = − + ∂W ∂W ∂W ∂ρˆl ρˆl 1 − ρˆl

∂ρˆl 0 T T (and we know how to calculate both and = x (g (W x + b)) (see next slide) ∂W i i terms on R.H.S)

k X ρ 1 − ρ Now, Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Lˆ(θ) = L (θ) + Ω(θ) Can be re-written as

k X L (θ) is the squared error loss or Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) cross entropy loss and Ω(θ) is the l=1 sparsity constraint. By Chain rule: We already know how to calculate ∂Ω(θ) ∂Ω(θ) ∂ρˆ ∂L (θ) = . ∂W ∂W ∂ρˆ ∂W ∂Ω(θ) Let us see how to calculate ∂W . ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Finally,

∂Lˆ(θ) ∂L (θ) ∂Ω(θ) ∂Ω(θ) ρ (1 − ρ) = + = − + ∂W ∂W ∂W ∂ρˆl ρˆl 1 − ρˆl

∂ρˆl 0 T T (and we know how to calculate both and = x (g (W x + b)) (see next slide) ∂W i i terms on R.H.S)

k X ρ 1 − ρ Now, Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Lˆ(θ) = L (θ) + Ω(θ) Can be re-written as

k X L (θ) is the squared error loss or Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) cross entropy loss and Ω(θ) is the l=1 sparsity constraint. By Chain rule: We already know how to calculate ∂Ω(θ) ∂Ω(θ) ∂ρˆ ∂L (θ) = . ∂W ∂W ∂ρˆ ∂W ∂Ω(θ) Let us see how to calculate ∂W . ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk For each neuron l ∈ 1 . . . k in hidden layer, we have

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Finally,

∂Lˆ(θ) ∂L (θ) ∂Ω(θ) = + ∂W ∂W ∂W

∂ρˆl 0 T T (and we know how to calculate both and = x (g (W x + b)) (see next slide) ∂W i i terms on R.H.S)

k X ρ 1 − ρ Now, Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Lˆ(θ) = L (θ) + Ω(θ) Can be re-written as

k X L (θ) is the squared error loss or Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) cross entropy loss and Ω(θ) is the l=1 sparsity constraint. By Chain rule: We already know how to calculate ∂Ω(θ) ∂Ω(θ) ∂ρˆ ∂L (θ) = . ∂W ∂W ∂ρˆ ∂W ∂Ω(θ) Let us see how to calculate ∂W . ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Ω(θ) ρ (1 − ρ) = − + ∂ρˆl ρˆl 1 − ρˆl

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Finally,

∂Lˆ(θ) ∂L (θ) ∂Ω(θ) = + ∂W ∂W ∂W (and we know how to calculate both terms on R.H.S)

k X ρ 1 − ρ Now, Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Lˆ(θ) = L (θ) + Ω(θ) Can be re-written as

k X L (θ) is the squared error loss or Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) cross entropy loss and Ω(θ) is the l=1 sparsity constraint. By Chain rule: We already know how to calculate ∂Ω(θ) ∂Ω(θ) ∂ρˆ ∂L (θ) = . ∂W ∂W ∂ρˆ ∂W ∂Ω(θ) Let us see how to calculate ∂W . ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Ω(θ) ρ (1 − ρ) = − + ∂ρˆl ρˆl 1 − ρˆl ∂ρˆ and l = x (g0(W T x + b))T (see next slide) ∂W i i

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 k X ρ 1 − ρ Now, Ω(θ) = ρlog + (1 − ρ)log ρˆl 1 − ρˆl l=1 Lˆ(θ) = L (θ) + Ω(θ) Can be re-written as

k X L (θ) is the squared error loss or Ω(θ) = ρlogρ−ρlogρˆl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρˆl) cross entropy loss and Ω(θ) is the l=1 sparsity constraint. By Chain rule: We already know how to calculate ∂Ω(θ) ∂Ω(θ) ∂ρˆ ∂L (θ) = . ∂W ∂W ∂ρˆ ∂W ∂Ω(θ) Let us see how to calculate ∂W . ∂Ω(θ) h iT = ∂Ω(θ) , ∂Ω(θ) ,... ∂Ω(θ) Finally, ∂ρˆ ∂ρˆ1 ∂ρˆ2 ∂ρˆk For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Lˆ(θ) ∂L (θ) ∂Ω(θ) ∂Ω(θ) ρ (1 − ρ) = + = − + ∂W ∂W ∂W ∂ρˆl ρˆl 1 − ρˆl

∂ρˆl 0 T T (and we know how to calculate both and = x (g (W x + b)) (see next slide) ∂W i i terms on R.H.S)

44/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Derivation ∂ρˆ =  ∂ρˆ1 ∂ρˆ2 ... ∂ρˆk  ∂W ∂W ∂W ∂W

∂ρˆl For each element in the above equation we can calculate ∂W (which is the partial derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl:-

h 1 Pm T i ∂ρˆ ∂ m i=1 g W:,lxi + bl l = ∂Wjl ∂Wjl h i m ∂ gW T x + b  1 X :,l i l = m ∂W i=1 jl m 1 X = g0W T x + b x m :,l i l ij i=1 So in matrix notation we can write it as : ∂ρˆ l = x (g0(W T x + b))T ∂W i i 45/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Module 7.6: Contractive Autoencoders

46/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 It does so by adding the following reg- ularization term to the loss function

2 Ω(θ) = kJx(h)kF

where Jx(h) is the Jacobian of the en- coder. Let us see what it looks like.

A contractive autoencoder also tries to prevent an overcomplete autoen- coder from learning the identity func- xˆ tion.

h

x

47/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 where Jx(h) is the Jacobian of the en- coder. Let us see what it looks like.

A contractive autoencoder also tries to prevent an overcomplete autoen- coder from learning the identity func- xˆ tion. It does so by adding the following reg- ularization term to the loss function h

2 Ω(θ) = kJx(h)kF x

47/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Let us see what it looks like.

A contractive autoencoder also tries to prevent an overcomplete autoen- coder from learning the identity func- xˆ tion. It does so by adding the following reg- ularization term to the loss function h

2 Ω(θ) = kJx(h)kF x where Jx(h) is the Jacobian of the en- coder.

47/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 A contractive autoencoder also tries to prevent an overcomplete autoen- coder from learning the identity func- xˆ tion. It does so by adding the following reg- ularization term to the loss function h

2 Ω(θ) = kJx(h)kF x where Jx(h) is the Jacobian of the en- coder. Let us see what it looks like.

47/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 In other words, the (l, j) entry of the Jacobian captures the variation in the output of the lth neuron with a small variation in the jth input.

If the input has n dimensions and the hidden layer has k dimensions then

48/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 In other words, the (l, j) entry of the Jacobian captures the variation in the output of the lth neuron with a small variation in the jth input.

If the input has n dimensions and the   hidden layer has k dimensions then ∂h1 ...... ∂h1 ∂x1 ∂xn  ∂h2 ...... ∂h2   ∂x1 ∂xn  Jx(h) =  . . .   . .. .   . .  ∂hk ...... ∂hk ∂x1 ∂xn

48/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 If the input has n dimensions and the   hidden layer has k dimensions then ∂h1 ...... ∂h1 ∂x1 ∂xn In other words, the (l, j) entry of the  ∂h2 ...... ∂h2   ∂x1 ∂xn  Jx(h) =  . . .  Jacobian captures the variation in the  . .. .  th  . .  output of the l neuron with a small ∂hk ∂hk th ...... variation in the j input. ∂x1 ∂xn

48/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 If the input has n dimensions and the   hidden layer has k dimensions then ∂h1 ...... ∂h1 ∂x1 ∂xn In other words, the (l, j) entry of the  ∂h2 ...... ∂h2   ∂x1 ∂xn  Jx(h) =  . . .  Jacobian captures the variation in the  . .. .  th  . .  output of the l neuron with a small ∂hk ∂hk th ...... variation in the j input. ∂x1 ∂xn

n k  2 2 X X ∂hl kJx(h)kF = ∂xj j=1 l=1

48/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Consider ∂h1 , what does it mean if ∂x1 ∂h1 = 0 ∂x1 It means that this neuron is not very sensitive to variations in the input x1. But doesn’t this contradict our other goal of minimizing L(θ) which re- quires h to capture variations in the input.

n k What is the intuition behind this ? 2 2 X X  ∂hl  kJx(h)kF = ∂xj j=1 l=1

h

x

49/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 It means that this neuron is not very sensitive to variations in the input x1. But doesn’t this contradict our other goal of minimizing L(θ) which re- quires h to capture variations in the input.

What is the intuition behind this ? n k  ∂h 2 ∂h 2 X X l Consider 1 , what does it mean if kJx(h)kF = ∂x1 ∂xj ∂h1 = 0 j=1 l=1 ∂x1

h

x

49/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 But doesn’t this contradict our other goal of minimizing L(θ) which re- quires h to capture variations in the input.

What is the intuition behind this ? n k  ∂h 2 ∂h 2 X X l Consider 1 , what does it mean if kJx(h)kF = ∂x1 ∂xj ∂h1 = 0 j=1 l=1 ∂x1 It means that this neuron is not very sensitive to variations in the input x . 1 xˆ

h

x

49/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 What is the intuition behind this ? n k  ∂h 2 ∂h 2 X X l Consider 1 , what does it mean if kJx(h)kF = ∂x1 ∂xj ∂h1 = 0 j=1 l=1 ∂x1 It means that this neuron is not very sensitive to variations in the input x . 1 xˆ But doesn’t this contradict our other goal of minimizing L(θ) which re- quires h to capture variations in the h input.

x

49/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 By putting these two contradicting objectives against each other we en- sure that h is sensitive to only very important variations as observed in the training data. L(θ) - capture important variations in data Ω(θ) - do not capture variations in data Tradeoff - capture only very import- ant variations in the data

n k Indeed it does and that’s the idea 2 2 X X  ∂hl  kJx(h)kF = ∂xj j=1 l=1

h

x

50/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 L(θ) - capture important variations in data Ω(θ) - do not capture variations in data Tradeoff - capture only very import- ant variations in the data

n k Indeed it does and that’s the idea 2 2 X X  ∂hl  By putting these two contradicting kJx(h)kF = ∂xj objectives against each other we en- j=1 l=1 sure that h is sensitive to only very important variations as observed in the training data. xˆ

h

x

50/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Ω(θ) - do not capture variations in data Tradeoff - capture only very import- ant variations in the data

n k Indeed it does and that’s the idea 2 2 X X  ∂hl  By putting these two contradicting kJx(h)kF = ∂xj objectives against each other we en- j=1 l=1 sure that h is sensitive to only very important variations as observed in the training data. xˆ L(θ) - capture important variations in data h

x

50/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Tradeoff - capture only very import- ant variations in the data

n k Indeed it does and that’s the idea 2 2 X X  ∂hl  By putting these two contradicting kJx(h)kF = ∂xj objectives against each other we en- j=1 l=1 sure that h is sensitive to only very important variations as observed in the training data. xˆ L(θ) - capture important variations in data h Ω(θ) - do not capture variations in data x

50/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 n k Indeed it does and that’s the idea 2 2 X X  ∂hl  By putting these two contradicting kJx(h)kF = ∂xj objectives against each other we en- j=1 l=1 sure that h is sensitive to only very important variations as observed in the training data. xˆ L(θ) - capture important variations in data h Ω(θ) - do not capture variations in data Tradeoff - capture only very import- x ant variations in the data

50/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Let us try to understand this with the help of an illustration.

51/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 y

u1 u2

x

52/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 It makes sense to maximize a neuron to be sensitive to variations along u1 At the same time it makes sense to inhibit a neuron from being sensitive to variations along u2 (as there seems to be small noise and unimportant for reconstruction) By doing so we can balance between the contradicting goals of good recon- struction and low sensitivity. What does this remind you of ?

y Consider the variations in the data along directions u1 and u2

u1 u2

x

52/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 At the same time it makes sense to inhibit a neuron from being sensitive to variations along u2 (as there seems to be small noise and unimportant for reconstruction) By doing so we can balance between the contradicting goals of good recon- struction and low sensitivity. What does this remind you of ?

y Consider the variations in the data along directions u1 and u2 It makes sense to maximize a neuron u1 to be sensitive to variations along u1 u2

x

52/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 By doing so we can balance between the contradicting goals of good recon- struction and low sensitivity. What does this remind you of ?

y Consider the variations in the data along directions u1 and u2 It makes sense to maximize a neuron u1 to be sensitive to variations along u1 u2 At the same time it makes sense to inhibit a neuron from being sensitive to variations along u2 (as there seems to be small noise and unimportant for x reconstruction)

52/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 What does this remind you of ?

y Consider the variations in the data along directions u1 and u2 It makes sense to maximize a neuron u1 to be sensitive to variations along u1 u2 At the same time it makes sense to inhibit a neuron from being sensitive to variations along u2 (as there seems to be small noise and unimportant for x reconstruction) By doing so we can balance between the contradicting goals of good recon- struction and low sensitivity.

52/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 y Consider the variations in the data along directions u1 and u2 It makes sense to maximize a neuron u1 to be sensitive to variations along u1 u2 At the same time it makes sense to inhibit a neuron from being sensitive to variations along u2 (as there seems to be small noise and unimportant for x reconstruction) By doing so we can balance between the contradicting goals of good recon- struction and low sensitivity. What does this remind you of ?

52/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Module 7.7 : Summary

53/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ˆx y PCA

h ≡ u1 u2

x x P T XT XP = D

54/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ˆx y PCA

h ≡ u1 u2

x T T ∗ 2 x P X XP = D min kX − HW kF θ | {z } UΣV T (SVD)

54/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Regularization

Ω(θ) = λkθk2 Weight decaying

k X ρ 1 − ρ Ω(θ) = ρ log + (1 − ρ) log Sparse ρˆl 1 − ρˆl l=1 n k 2 X X  ∂hl  Ω(θ) = Contractive ∂xj j=1 l=1

ˆxi

h

˜xi

P (xeij|xij)

xi

55/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 Ω(θ) = λkθk2 Weight decaying

k X ρ 1 − ρ Ω(θ) = ρ log + (1 − ρ) log Sparse ρˆl 1 − ρˆl l=1 n k 2 X X  ∂hl  Ω(θ) = Contractive ∂xj j=1 l=1

ˆxi Regularization

h

˜xi

P (xeij|xij)

xi

55/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 k X ρ 1 − ρ Ω(θ) = ρ log + (1 − ρ) log Sparse ρˆl 1 − ρˆl l=1 n k 2 X X  ∂hl  Ω(θ) = Contractive ∂xj j=1 l=1

ˆxi Regularization

2 h Ω(θ) = λkθk Weight decaying

˜xi

P (xeij|xij)

xi

55/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 n k 2 X X  ∂hl  Ω(θ) = Contractive ∂xj j=1 l=1

ˆxi Regularization

2 h Ω(θ) = λkθk Weight decaying k X ρ 1 − ρ Ω(θ) = ρ log + (1 − ρ) log Sparse ρˆl 1 − ρˆl l=1 ˜xi

P (xeij|xij)

xi

55/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7 ˆxi Regularization

2 h Ω(θ) = λkθk Weight decaying k X ρ 1 − ρ Ω(θ) = ρ log + (1 − ρ) log Sparse ρˆl 1 − ρˆl l=1 ˜x n k i 2 X X  ∂hl  Ω(θ) = Contractive ∂xj P (xeij|xij) j=1 l=1

xi

55/55 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7