<<

Neural Networks:

Machine Learning

Based on slides and material from , Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1 Neural Networks

• What is a neural network?

• Predicting with a neural network

• Training neural networks

• Practical concerns

2 This lecture

• What is a neural network?

• Predicting with a neural network

• Training neural networks – Backpropagation

• Practical concerns

3 Training a neural network

• Given – A network architecture (layout of neurons, their connectivity and activations) – A dataset of labeled examples

• S = {(xi, yi)}

• The goal: Learn the weights of the neural network

• Remember: For a fixed architecture, a neural network is a parameterized by its weights – Prediction: � = ��(�, �)

4 Recall: Learning as loss minimization

We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss �

min �(�� �, � , �) � Perhaps with a regularizer

5 Recall: Learning as loss minimization

We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss �

min �(�� �, � , �) � Perhaps with a regularizer So far, we saw that this strategy worked for: 1. Logistic Regression Each 2. Support Vector Machines minimizes a 3. different loss 4. LMS regression function

All of these are linear models

Same idea for non-linear models too!

6 Back to our running example

Given an input x, how is the output predicted

output y = � + �� + ��

� = �(� + �� + ��)

z = �(� + �� + ��)

7 Back to our running example

Given an input x, how is the output predicted

output y = � + �� + ��

� = �(� + �� + ��)

z = �(� + �� + ��)

Suppose the true label for this example is a number �

8 Back to our running example

Given an input x, how is the output predicted

output y = � + �� + ��

� = �(� + �� + ��)

z = �(� + �� + ��)

Suppose the true label for this example is a number �

We can write the square loss for this example as: 1 � = �– � 2 9 Learning as loss minimization

We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss �

min � �� �, � , � � Perhaps with a regularizer

How do we solve the optimization problem?

10 min �(�� �, � , �) � Stochastic descent

Given a training set S = ��, �� , � ∈ ℜ 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set

2. For each training example ��, �� ∈ S: • Treat this example as the entire dataset Compute the gradient of the loss ��(�� �, � , �)

• Update: � ← � − ���(�� �, � , �))

3. Return w

11 min �(�� �, � , �) � Stochastic

Given a training set S = ��, �� , � ∈ ℜ 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set

2. For each training example ��, �� ∈ S: • Treat this example as the entire dataset Compute the gradient of the loss ��(�� �, � , �)

• Update: � ← � − ���(�� �, � , �))

3. Return w

12 min �(�� �, � , �) � Stochastic gradient descent

Given a training set S = ��, �� , � ∈ ℜ 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set

2. For each training example ��, �� ∈ S: • Treat this example as the entire dataset Compute the gradient of the loss ��(�� �, � , �)

• Update: � ← � − ���(�� �, � , �))

3. Return w

13 min �(�� �, � , �) � Stochastic gradient descent

Given a training set S = ��, �� , � ∈ ℜ 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set

2. For each training example ��, �� ∈ S: • Treat this example as the entire dataset Compute the gradient of the loss ��(�� �, � , �)

• Update: � ← � − ���(�� �, � , �))

3. Return w

14 min �(�� �, � , �) � Stochastic gradient descent

Given a training set S = ��, �� , � ∈ ℜ 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set

2. For each training example ��, �� ∈ S: • Treat this example as the entire dataset Compute the gradient of the loss ��(�� �, � , �)

• Update: � ← � − ���(�� �, � , �))

3. Return w

15 min �(�� �, � , �) � Stochastic gradient descent

Given a training set S = ��, �� , � ∈ ℜ 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set

2. For each training example ��, �� ∈ S: • Treat this example as the entire dataset Compute the gradient of the loss ��(�� �, � , �)

• Update: � ← � − ���(�� �, � , �)) �!: learning rate, many tweaks possible 3. Return w

16 min �(�� �, � , �) � Stochastic gradient descent

Given a training set S = ��, �� , � ∈ ℜ 1. Initialize parameters w The objective is not convex. Initialization can be important 2. For epoch = 1 … T: 1. Shuffle the training set

2. For each training example ��, �� ∈ S: • Treat this example as the entire dataset Compute the gradient of the loss ��(�� �, � , �)

• Update: � ← � − ���(�� �, � , �)) �!: learning rate, many tweaks possible 3. Return w

17 min �(�� �, � , �) � Stochastic gradient descent

Given a training set S = ��, �� , � ∈ ℜ 1. Initialize parameters w The objective is not convex. Initialization can be important 2. For epoch = 1 … T: 1. Shuffle the training set

2. For each training example ��, �� ∈ S: • Treat this example as the entire dataset Compute the gradient of the loss ��(�� �, � , �)

• Update: � ← � − ���(�� �, � , �)) �!: learning rate, many tweaks possible 3. Return w Have we solved everything?

18 The of the ? ��(�� �, � , �)

If the neural network is a differentiable function, we can find the gradient – Or maybe its sub-gradient – This is decided by the activation functions and the loss function

It was easy for SVMs and logistic regression – Only one

But how do we find the sub-gradient of a more complex function? – Eg: A ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 19 The derivative of the loss function? ��(�� �, � , �)

If the neural network is a differentiable function, we can find the gradient – Or maybe its sub-gradient – This is decided by the activation functions and the loss function

It was easy for SVMs and logistic regression – Only one layer

But how do we find the sub-gradient of a more complex function? – Eg: A ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 20 The derivative of the loss function? ��(�� �, � , �)

If the neural network is a differentiable function, we can find the gradient – Or maybe its sub-gradient – This is decided by the activation functions and the loss function

It was easy for SVMs and logistic regression – Only one layer

But how do we find the sub-gradient of a more complex function? – Eg: A ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 21 Checkpoint Where are we

If we have a neural network (structure, activations and weights), we can make a prediction for an input

If we had the true label of the input, then we can define the loss for that example

If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD

Questions?

22 Checkpoint Where are we

If we have a neural network (structure, activations and weights), we can make a prediction for an input

If we had the true label of the input, then we can define the loss for that example

If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD

Questions?

23 Checkpoint Where are we

If we have a neural network (structure, activations and weights), we can make a prediction for an input

If we had the true label of the input, then we can define the loss for that example

If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD

Questions?

24 Checkpoint Where are we

If we have a neural network (structure, activations and weights), we can make a prediction for an input

If we had the true label of the input, then we can define the loss for that example

If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD

Questions?

25 Checkpoint Where are we

If we have a neural network (structure, activations and weights), we can make a prediction for an input

If we had the true label of the input, then we can define the loss for that example

If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD

Questions?

26 Let’s look at some simple expressions �� = 1 �� � �, � = � + � �� = 1 ��

27 Let’s look at some simple expressions �� = 1 �� � �, � = � + � �� = 1 ��

28 Let’s look at some simple expressions �� = 1 �� � �, � = � + � �� = 1 ��

�� = � �� � �, � = �� �� = � ��

29 Let’s look at some simple expressions �� = 1 �� � �, � = � + � �� = 1 ��

�� = � �� � �, � = �� �� = � ��

�� = 1 if � ≥ � , 0 otherwise �� � �, � = max �, � �� = 1 if � ≥ � , 0 otherwise �� 30 Let’s look at some simple expressions �� = 1 �� Useful to keep in mind what these � �, � = � + � represent In these (and all other) cases: �� = 1 �� �� ��

�� Represents the rate of change of the function = � �� � with respect to a small change in � � �, � = �� �� = � ��

�� = 1 if � ≥ � , 0 otherwise �� � �, � = max �, � �� = 1 if � ≥ � , 0 otherwise �� 31 More complicated cases? � �, �, � = � � + �

This is still simple enough to manually take derivatives, but let us work through this in a slightly different way.

32 More complicated cases? � �, �, � = � � + �

This is still simple enough to manually take derivatives, but let us work through this in a slightly different way.

Break down the function in terms of simple forms

� = � + � � = ��

33 More complicated cases? � �, �, � = � � + �

This is still simple enough to manually take derivatives, but let us work through this in a slightly different way.

Break down the function in terms of simple forms

� = � + � � = ��

Each of these is a simple form. We know how to compute , , ,

34 More complicated cases? � �, �, � = � � + �

This is still simple enough to manually take derivatives, but let us work through this in a slightly different way.

Break down the function in terms of simple forms

� = � + � � = ��

Each of these is a simple form. We know how to compute , , ,

Key idea: Build up derivatives of compound expressions by breaking it down into simpler pieces, and applying the

35 More complicated cases? � �, �, � = � � + �

This is still simple enough to manually take derivatives, but let us work through this in a slightly different way.

Break down the function in terms of simple forms

� = � + � � = ��

Each of these is a simple form. We know how to compute , , ,

Key idea: Build up derivatives of compound expressions by breaking it down into simpler pieces, and applying the chain rule

�� �� �� = ⋅ = � ⋅ 2� = 2�� �� �� ��

36 In terms of “computation graphs” � �, �, � = � � + �

� Computes �� = �(� + �)

� � Computes � + �

� �

37 In terms of “computation graphs” � �, �, � = � � + �

15 The forward pass: � Computes function Computes �� = �(� + �) values for specific inputs

5 � � Computes � + � 3

� � 2 1

38 In terms of “computation graphs” � �, �, � = � � + �

The backward 15 pass: Computes � Computes �� = �(� + �) derivatives of each intermediate node

5 � � Computes � + � 3

� � 2 1

39 In terms of “computation graphs” � �, �, � = � � + �

The backward 15 pass: Computes � Computes �� = �(� + �) derivatives of each intermediate node �� = � = 5 �� 5 � � Computes � + � 3

� � 2 1

40 In terms of “computation graphs” � �, �, � = � � + �

The backward 15 pass: Computes � Computes �� = �(� + �) derivatives of each intermediate node �� �� = � = 5 = � = 3 �� �� 5 � � Computes � + � 3

� � 2 1

41 In terms of “computation graphs” � �, �, � = � � + �

The backward 15 pass: Computes � Computes �� = �(� + �) derivatives of each intermediate node �� �� = � = 5 = � = 3 �� �� 5 � � Computes � + � 3

� � �� �� �� = ⋅ = 3×2×2 = 12 2 1 �� �� ��

42 In terms of “computation graphs” � �, �, � = � � + �

The backward 15 pass: Computes � Computes �� = �(� + �) derivatives of each intermediate node �� �� = � = 5 = � = 3 �� �� 5 � � Computes � + � 3

� � �� �� �� = ⋅ = 3×2×2 = 12 2 1 �� �� �� �� �� �� = ⋅ = 3×1 = 3 �� �� �� 43 In terms of “computation graphs” � �, �, � = � � + �

15 � Computes �� = �(� + �)

�� �� = � = 5 = � = 3 �� �� 5 � � Computes � + � 3

� � �� �� �� = ⋅ = 3×2×2 = 12 2 1 �� �� �� �� �� �� = ⋅ = 3×1 = 3 �� �� �� 44 The abstraction

• Each node in the graph knows two things: 1. How to compute the value of a function with respect to its inputs (forward)

2. How to compute the of its output with respect to each of its inputs (backward)

• These can be defined independently of what happens in the rest of the graph

• We can build up complicated functions using simple nodes, and compute values and partial derivatives of these as well

45 In terms of “computation graphs” � �, �, � = � � + � Meaning of the partial derivatives: How 15 sensitive is the value of f � Computes �� = �(� + �) to the value of each variable �� �� = � = 5 = � = 3 �� �� 5 � � Computes � + � 3

� � �� �� �� = ⋅ = 3×2×2 = 12 2 1 �� �� �� �� �� �� = ⋅ = 3×1 = 3 �� �� �� 46 A notational convenience

Commonly nodes in the networks represent not only single numbers (e.g. features, outputs) but also entire vectors (an array of numbers), matrices (a 2d array of numbers) or tensors (an n-dimensional array of numbers).

47 A notational convenience

Commonly nodes in the networks represent not only single numbers (e.g. features, outputs) but also entire vectors (an array of numbers), matrices (a 2d array of numbers) or tensors (an n-dimensional array of numbers).

� Represents [�, �, �]

48 A notational convenience

Commonly nodes in the networks represent not only single numbers (e.g. features, outputs) but also entire vectors (an array of numbers), matrices (a 2d array of numbers) or tensors (an n-dimensional array of numbers).

� � � � = � � � �

49 A notational convenience

Commonly nodes in the networks represent not only single numbers (e.g. features, outputs) but also entire vectors (an array of numbers), matrices (a 2d array of numbers) or tensors (an n-dimensional array of numbers).

� = � �� �

50 A notational convenience

Commonly nodes in the networks represent not only single numbers (e.g. features, outputs) but also entire vectors (an array of numbers), matrices (a 2d array of numbers) or tensors (an n-dimensional array of numbers).

� = � �� �

Each element of � is �, � and is generated by the sigmoid activation to � each element of � �.

51 A notational convenience

Commonly nodes in the networks represent not only single numbers (e.g. features, outputs) but also entire vectors (an array of numbers), matrices (a 2d array of numbers) or tensors (an n-dimensional array of numbers).

� = �, �, �

� = � �� �

52 A notational convenience

Commonly nodes in the networks represent not only single numbers (e.g. features, outputs) but also entire vectors (an array of numbers), matrices (a 2d array of numbers) or tensors (an n-dimensional array of numbers).

� � = ��

� = � �� �

53 A notational convenience

Commonly nodes in the networks represent not only single numbers (e.g. features, outputs) but also entire vectors (an array of numbers), matrices (a 2d array of numbers) or tensors (an n-dimensional array of numbers).

� � = �� No activation because the output is defined to be linear �

� = � �� �

54 Reminder: Chain rule for derivatives

– If � is a function of � and � is a function of � • Then � is a function of �, as well – Question: how to find �

�� �� �� = ⋅ �� �� �� �

55 Reminder: Chain rule for derivatives

– If � = a function of � + a function of �, and the �’s are functions of � • Then � is a function of �, as well – Question: how to find

�� �� �� �� �2 z z = ⋅ + ⋅ �� �� �� �� ��

56 Reminder: Chain rule for derivatives

– If � = sum of functions of � • Then � is a function of �, as well – Question: how to find

� �� �� �� = ⋅ �� �� �� z z ⋯ z

57 Backpropagation

1 � = �– �∗ 2 output y = � + �� + ��

� = �(� + �� + ��)

z = �(� + �� + ��)

58 Backpropagation

1 � = �– �∗ 2 output y = � + �� + ��

� = �(� + �� + ��)

z = �(� + �� + ��)

We want to compute and Important: � is a differentiable function of all the weights 59 Applying the chain rule to compute the gradient (And remembering partial computations along Backpropagation the way to speed up things)

1 � = �– �∗ 2 output y = � + �� + ��

� = �(� + �� + ��)

z = �(� + �� + ��)

We want to compute and Important: � is a differentiable function of all the weights 60 Backpropagation example 1 � = �– �∗ 2 Output layer output y = � + �� + ��

�� �� �� = �� �� ��

61 Backpropagation example 1 � = �– �∗ 2 Output layer output y = � + �� + ��

�� �� �� = �� �� ��

62 Backpropagation example 1 � = �– �∗ 2 Output layer output y = � + �� + ��

�� �� �� = �� �� ��

�� = � − �∗ ��

63 Backpropagation example 1 � = �– �∗ 2 Output layer output y = � + �� + ��

�� �� �� = �� �� ��

�� �� = � − �∗ = 1 �� ��

64 Backpropagation example 1 � = �– �∗ 2 Output layer output y = � + �� + ��

�� �� �� = �� �� ��

65 Backpropagation example 1 � = �– �∗ 2 Output layer output y = � + �� + ��

�� �� �� = �� �� ��

66 Backpropagation example 1 � = �– �∗ 2 Output layer output y = � + �� + ��

�� �� �� = �� �� ��

�� = � − �∗ ��

67 Backpropagation example 1 � = �– �∗ 2 Output layer output y = � + �� + ��

�� �� �� = �� �� ��

�� �� = � − �∗ = � �� ��

68 Backpropagation example 1 � = �– �∗ 2 Output layer output y = � + �� + ��

�� �� �� = �� �� ��

�� �� = � − �∗ = � �� ��

We have already computed this partial derivative for the previous case

Cache to speed up!

69 Backpropagation example Hidden layer derivatives

1 � = �– �∗ 2 output y = � + �� + ��

� = �(� + �� + ��)

z = �(� + �� + ��)

70 Backpropagation example Hidden layer derivatives

1 � = �– �∗ 2 output y = � + �� + ��

� = �(� + �� + ��)

z = �(� + �� + ��)

We want

71 Backpropagation example 1 � = �– �∗ 2 Hidden layer derivatives

�� �� �� = �� �� ��

72 Backpropagation example y = � + �� + �� Hidden layer

�� �� �� = �� �� �� �� � = (� + �� + ��) �� ��

73 Backpropagation example y = � + �� + �� Hidden layer

�� �� �� = �� �� �� �� � = (� + �� + ��) �� ��

�� � � = (� � + � �) �� �� ��

74 Backpropagation example y = � + �� + �� Hidden layer

�� �� �� = �� �� �� �� � = (� + �� + ��) �� ��

�� � � = (� � + � �) �� �� �� 0

� is not a function of �

75 Backpropagation example y = � + �� + �� Hidden layer

�� �� �� = �� �� �� �� � = (� + �� + ��) �� ��

�� �� = � �� ��

76 Backpropagation example � = �(� + � � + � � ) Hidden layer

�� �� �� = �� �� �� �� � = (� + �� + ��) �� ��

�� �� = � �� ��

77 Backpropagation example � = �(� + � � + � � ) Hidden layer Call this s �� �� �� = �� �� �� �� � = (� + �� + ��) �� ��

�� �� = � �� ��

78 Backpropagation example � = �(� + � � + � � ) Hidden layer Call this s �� �� �� = �� �� �� �� � = (� + �� + ��) �� ��

�� �� = � �� �� �� �� �� = � �� �� ��

79 Backpropagation example � = �(� + � � + � � ) Hidden layer Call this s

�� �� �� �� (From previous slide) = � �� �� �� ��

80 Backpropagation example � = �(� + � � + � � ) Hidden layer Call this s

�� �� �� �� = � �� �� �� ��

Each of these partial derivatives is easy

81 Backpropagation example � = �(� + � � + � � ) Hidden layer Call this s

�� �� �� �� = � �� �� �� ��

Each of these partial derivatives is easy �� = � − �∗ ��

82 Backpropagation example � = �(� + � � + � � ) Hidden layer Call this s

�� �� �� �� = � �� �� �� ��

Each of these partial derivatives is easy �� = � − �∗ �� �� Why? Because � � is the logistic = �(1 − �) �� function we have already seen

83 Backpropagation example � = �(� + � � + � � ) Hidden layer Call this s

�� �� �� �� = � �� �� �� ��

Each of these partial derivatives is easy �� = � − �∗ �� �� Why? Because � � is the logistic = �(1 − �) �� function we have �� already seen = � ��

84 Backpropagation example � = �(� + � � + � � ) Hidden layer Call this s

�� �� �� �� = � �� �� �� ��

Each of these partial derivatives is easy �� = � − �∗ More important: We�� have already computed many of �these� partial Why? Because � � is the logistic = �(1 − �) derivatives because� we� are proceedingfunction we have from top to bottom� (i.e.� backwards)already seen = � ��

85 The Backpropagation Algorithm

The same algorithm works for multiple layers, and more complicated architectures

Repeated application of the chain rule for partial derivatives – First perform forward pass from inputs to the output – Compute loss

– From the loss, proceed backwards to compute partial derivatives using the chain rule – Cache partial derivatives as you compute them • Will be used for lower layers

86 Mechanizing learning

• Backpropagation gives you the gradient that will be used for gradient descent – SGD gives us a generic learning algorithm – Backpropagation is a generic method for computing partial derivatives

• A recursive algorithm that proceeds from the top of the network to the bottom

• Modern neural network libraries implement automatic differentiation using backpropagation – Allows easy exploration of network architectures – Don’t have to keep deriving the by hand each time

87 min �(�� �, � , �) � Stochastic gradient descent

Given a training set S = ��, �� , � ∈ ℜ

1. Initialize parameters w The objective is not convex. 2. For epoch = 1 … T: Initialization can be important 1. Shuffle the training set 2. For each training example ��, �� ∈ S: • Treat this example as the entire dataset

• Compute the gradient of the loss �� �� �!, � , �! using backpropagation

• Update: � ← � − �"��(�� �!, � , �!)) �!: learning rate, many tweaks possible 3. Return w

88