<<

Section 3: Descent & Practice Problems

Problem 1. Computation Graph Review

Let's assume we have a simple f(x, y, z) = (x + y) z . We can break this up into the equations q = x + y and f(x, y, z) = qz . Using this simplified notation, we can also represent this equation as a computation graph:

Now let's assume that we are evaluating this function at x = -2, y = 5, and z = -4. In addition let the ​ ​ ​ ​ ​ ​ value of the upstream gradient (gradient of the loss with respect to our function, ∂L/∂f ) equal 1. ​ ​ These are filled out for you in the computation graph.

Solve for the following values, both symbolically (without plugging in specific values of x/y/z), and evaluated at x = -2, y = 5, z = -4, and ∂L/∂f = 1: ​ ​ ​ ​ ​ ​ ​ ​

Symbolically Evaluated: 1. ∂f / ∂q = ∂f / ∂q =

2. ∂q / ∂x = ∂q / ∂x =

3. ∂q / ∂y = ∂q / ∂y =

4. ∂f / ∂z = ∂f / ∂z =

5. ∂f / ∂x = ∂f / ∂x =

6. ∂f / ∂y = ∂f / ∂y =

CS230 Page 1

Problem 2. Computation Graphs on Steroids

Now let's perform backpropagation through a single neuron of a neural network with a sigmoid activation. Specifically, we will define the pre-activation z = woxo + w1x1 + w2 and we will define the activation value α = σ(z) = 1 / (1 + e−z) . The computation graph is visualized below:

In the graph we've filled out the forward activations, on the top of the lines, as well as the ​ ​ upstream gradient (gradient of the loss with respect to our neuron, ∂L/∂α ). Use this information to compute the rest of the (labelled with question marks) throughout the graph. ​ ​

Hint: A calculator may be helpful here.

Finally, report the symbolic gradients with respect to the input parameters, xo, x1, w0, w1, w2 :

1. ∂α / ∂x0 =

2. ∂α / ∂w0 =

3. ∂α / ∂x1 =

4. ∂α / ∂w1 =

5. ∂α / ∂w2 =

CS230 Deep Learning Page 2

Problem 3. Backpropagation Basics: Dimensions &

Let's assume we have a two neural network, as defined below:

(i) z1 = W 1 x + b1

a1 = ReLU(z1)

z2 = W 2 a1 + b2 (i) yˆ = σ(z2) (i) (i) (i) L = y(i) * log(yˆ ) + (1 − y(i)) * log(1 − yˆ ) m −1 (i) J = m ∑ L i=1

(i) (i) Note that x represents a single input example, and is of shape Dx × 1 . Further y is a single output label and is a scalar. There are m examples in our dataset. We will use Da nodes in our ​ ​ 1 hidden layer; that is, z1 's shape is Da1 × 1 .

1. What are the shapes of W 1, b1, W 2, b2 ? If we were vectorizing this network across multiple examples, what would the shapes of the weights/biases be instead? If we were vectorizing across multiple examples, what would the shapes of X and Y be instead?

(i) (i) 2. What is ∂J / ∂yˆ ? Refer to this result as δ1 . Using this result, what is ∂J / ∂yˆ ?

(i) (i) 3. What is ∂yˆ / ∂z2 ? Refer to this result as δ2 .

CS230 Deep Learning Page 3

Equations reproduced below for the remaining parts of the question:

(i) z1 = W 1 x + b1

a1 = ReLU(z1)

z2 = W 2 a1 + b2 (i) yˆ = σ(z2) (i) (i) (i) L = y(i) * log(yˆ ) + (1 − y(i)) * log(1 − yˆ ) m −1 (i) J = m ∑ L i=1

(i) (i) Note that x represents a single input example, and is of shape Dx × 1 . Further y is a single output label and is a scalar. There are m examples in our dataset. We will use Da nodes in our ​ ​ 1 hidden layer; that is, z1 's shape is Da1 × 1 .

(i) 4. What is ∂z2 / ∂a1 ? Refer to this result as δ3 .

(i) 5. What is ∂a1 / ∂z1 ? Refer to this result as δ4 .

(i) 6. What is ∂z1 / ∂W 1 ? Refer to this result as δ5 .

7. What is ∂J / ∂W 1 ? It may help to reuse work from the previous parts. Hint: Be careful with the shapes!

CS230 Deep Learning Page 4

Problem 4. Bonus!

Apart from simple mathematical operations like multiplication or exponentiation, and piecewise operations like the max used in relu activations, we can also perform complex operations in our neural networks. For this question, we'll be exploring the sort operation in hopes of better ​ ​ understanding how to backpropagate gradients through a sort. This is applicable in a variety of real-world use-cases including a differentiable non-max suppression, for object detection networks.

For each of the following parts, assume you are given an input vector x ∈ Rn and some upstream gradient vector ∂L / ∂F , and you want to calculate ∂L / ∂x where F is a function of x ​ ​ ​ that also returns a vector. You may assume all values in x are distinct. Note that xo is the first ​ ​ ​ ​ component in the vector x: ( x = [x , x , ... , x ] ). ​ ​ 0 1 n−1

1. F(x) = x0 * x ​ ​

2. F(x) = sort(x)

3. F(x) = x0 * sort(x) ​ ​

CS230 Deep Learning Page 5

Section 3 Solutions

Problem 1. Computation Graph Review

1. ∂f / ∂q = z = − 4 2. ∂q / ∂x = 1 3. ∂q / ∂y = 1 4. ∂f / ∂z = q = x + y = 3 5. ∂f / ∂x = z * 1 = z = − 4 6. ∂f / ∂y = z * 1 = z = − 4

Problem 2. Computation Graphs on Steroids

1. ∂α / ∂x0 = σ(z) (1 − σ(z)) w0

2. ∂α / ∂w0 = σ(z) (1 − σ(z)) x0

3. ∂α / ∂x1 = σ(z) (1 − σ(z)) w1

4. ∂α / ∂w1 = σ(z) (1 − σ(z)) x1

5. ∂α / ∂w2 = σ(z) (1 − σ(z))

CS230 Deep Learning Page 6

Problem 3. Backpropagation Basics: Dimensions & Derivatives

D ×D D ×1 1×D 1×1 a1 x a1 a1 1. W 1 ∈ R , b1 ∈ R , W 2 ∈ R , b2 ∈ R . The shapes of the weights/biases would be the same after vectorizing. X ∈ RDx ×m, Y ∈ Rm ×1 after vectorizing.

(i) (i) (i) (i) (i) 1 (i) 2. δ1 = y / yˆ − (1 − y ) / (1 − yˆ ) . δJ / δyˆ = − m ∑δ1 i (i) 3. δ2 = σ(z2) (1 − σ(z2))

(i) 4. δ3 = W 2

(i) 5. δ4 = 0 if z1 < 0, 1 if z1 >= 0

(i) (i)T 6. δ5 = x

(i) 1 (i) (i) (i) (i) (i) 7. δ6 = − m ∑δ1 * δ2 * (δ3 ° δ4 ) * δ5 i

Problem 4. Bonus!

1. As an example, say x = [x0, x1, x2] . Then F (x) = [x0 * x0, x0 * x1, x0 * x2] . Then ∂L/∂x will be a vector. For component i, where i is not 0, it is ∂L/∂F x . For the 0th ​ ​ ​ ​ i * 0

component, it will be 2 * x0 * ∂L/∂F 0 + ∑ ∂L/∂F i * xi . i ≠ 0

2. Sorting will simply reroute the gradients. As an example, say x = [x0, x1, x2] , we have

upstream gradients ∂L / ∂F = [∂0, ∂1, ∂2] , and F (x) = [x1, x2, x0] . Then, ∂L / ∂x = [∂ , ∂ , ∂ ] (move gradients to reverse the transformation from x -> F(x)). 2 0 1 ​ ​

3. This can be viewed as a computation graph where the multiplication happens first and then the sorting happens. As such, it simply requires rerouting the gradients to account for the sort as in #2, and then performing the multiplicative rules as in #1 to account for

multiplying by x0. ​ ​

CS230 Deep Learning Page 7