Section 3: Gradient Descent & Backpropagation Practice Problems

Section 3: Gradient Descent & Backpropagation Practice Problems Problem 1. Computation Graph Review Let's assume we have a simple function f(x, y, z) = (x + y) z . We can break this up into the equations q = x + y and f(x, y, z) = qz . Using this simplified notation, we can also represent this equation as a computation graph: Now let's assume that we are evaluating this function at x = -2, y = 5, and z = -4. In addition let the value of the upstream gradient (gradient of the loss with respect to our function, ∂L/∂f ) equal 1. These are filled out for you in the computation graph. Solve for the following values, both symbolically (without plugging in specific values of x/y/z), and evaluated at x = -2, y = 5, z = -4, and ∂L/∂f = 1: Symbolically Evaluated: 1. ∂f / ∂q = ∂f / ∂q = 2. ∂q / ∂x = ∂q / ∂x = 3. ∂q / ∂y = ∂q / ∂y = 4. ∂f / ∂z = ∂f / ∂z = 5. ∂f / ∂x = ∂f / ∂x = 6. ∂f / ∂y = ∂f / ∂y = CS230 Deep Learning Page 1 Problem 2. Computation Graphs on Steroids Now let's perform backpropagation through a single neuron of a neural network with a sigmoid activation. Specifically, we will define the pre-activation z = woxo + w1x1 + w2 and we will define the activation value α = σ(z) = 1 / (1 + e−z) . The computation graph is visualized below: In the graph we've filled out the forward activations, on the top of the lines, as well as the upstream gradient (gradient of the loss with respect to our neuron, ∂L/∂α ). Use this information to compute the rest of the gradients (labelled with question marks) throughout the graph. Hint: A calculator may be helpful here. Finally, report the symbolic gradients with respect to the input parameters, xo, x1, w0, w1, w2 : 1. ∂α / ∂x0 = 2. ∂α / ∂w0 = 3. ∂α / ∂x1 = 4. ∂α / ∂w1 = 5. ∂α / ∂w2 = CS230 Deep Learning Page 2 Problem 3. Backpropagation Basics: Dimensions & Derivatives Let's assume we have a two layer neural network, as defined below: (i) z1 = W 1 x + b1 a1 = ReLU(z1) z2 = W 2 a1 + b2 (i) yˆ = σ(z2) (i) (i) (i) L = y(i) * log(yˆ ) + (1 − y(i)) * log(1 − yˆ ) m −1 (i) J = m ∑ L i=1 (i) (i) Note that x represents a single input example, and is of shape Dx × 1 . Further y is a single output label and is a scalar. There are m examples in our dataset. We will use Da nodes in our 1 hidden layer; that is, z1 's shape is Da1 × 1 . 1. What are the shapes of W 1, b1, W 2, b2 ? If we were vectorizing this network across multiple examples, what would the shapes of the weights/biases be instead? If we were vectorizing across multiple examples, what would the shapes of X and Y be instead? (i) (i) 2. What is ∂J / ∂yˆ ? Refer to this result as δ1 . Using this result, what is ∂J / ∂yˆ ? (i) (i) 3. What is ∂yˆ / ∂z2 ? Refer to this result as δ2 . CS230 Deep Learning Page 3 Equations reproduced below for the remaining parts of the question: (i) z1 = W 1 x + b1 a1 = ReLU(z1) z2 = W 2 a1 + b2 (i) yˆ = σ(z2) (i) (i) (i) L = y(i) * log(yˆ ) + (1 − y(i)) * log(1 − yˆ ) m −1 (i) J = m ∑ L i=1 (i) (i) Note that x represents a single input example, and is of shape Dx × 1 . Further y is a single output label and is a scalar. There are m examples in our dataset. We will use Da nodes in our 1 hidden layer; that is, z1 's shape is Da1 × 1 . (i) 4. What is ∂z2 / ∂a1 ? Refer to this result as δ3 . (i) 5. What is ∂a1 / ∂z1 ? Refer to this result as δ4 . (i) 6. What is ∂z1 / ∂W 1 ? Refer to this result as δ5 . 7. What is ∂J / ∂W 1 ? It may help to reuse work from the previous parts. Hint: Be careful with the shapes! CS230 Deep Learning Page 4 Problem 4. Bonus! Apart from simple mathematical operations like multiplication or exponentiation, and piecewise operations like the max used in relu activations, we can also perform complex operations in our neural networks. For this question, we'll be exploring the sort operation in hopes of better understanding how to backpropagate gradients through a sort. This is applicable in a variety of real-world use-cases including a differentiable non-max suppression, for object detection networks. For each of the following parts, assume you are given an input vector x ∈ Rn and some upstream gradient vector ∂L / ∂F , and you want to calculate ∂L / ∂x where F is a function of x that also returns a vector. You may assume all values in x are distinct. Note that xo is the first component in the vector x: ( x = [x , x , ... , x ] ). 0 1 n−1 1. F(x) = x0 * x 2. F(x) = sort(x) 3. F(x) = x0 * sort(x) CS230 Deep Learning Page 5 Section 3 Solutions Problem 1. Computation Graph Review 1. ∂f / ∂q = z = − 4 2. ∂q / ∂x = 1 3. ∂q / ∂y = 1 4. ∂f / ∂z = q = x + y = 3 5. ∂f / ∂x = z * 1 = z = − 4 6. ∂f / ∂y = z * 1 = z = − 4 Problem 2. Computation Graphs on Steroids 1. ∂α / ∂x0 = σ(z) (1 − σ(z)) w0 2. ∂α / ∂w0 = σ(z) (1 − σ(z)) x0 3. ∂α / ∂x1 = σ(z) (1 − σ(z)) w1 4. ∂α / ∂w1 = σ(z) (1 − σ(z)) x1 5. ∂α / ∂w2 = σ(z) (1 − σ(z)) CS230 Deep Learning Page 6 Problem 3. Backpropagation Basics: Dimensions & Derivatives D ×D D ×1 1×D 1×1 a1 x a1 a1 1. W 1 ∈ R , b1 ∈ R , W 2 ∈ R , b2 ∈ R . The shapes of the weights/biases would be the same after vectorizing. X ∈ RDx ×m, Y ∈ Rm ×1 after vectorizing. (i) (i) (i) (i) (i) 1 (i) 2. δ1 = y / yˆ − (1 − y ) / (1 − yˆ ) . δJ / δyˆ = − m ∑δ1 i (i) 3. δ2 = σ(z2) (1 − σ(z2)) (i) 4. δ3 = W 2 (i) 5. δ4 = 0 if z1 < 0, 1 if z1 >= 0 (i) (i)T 6. δ5 = x (i) 1 (i) (i) (i) (i) (i) 7. δ6 = − m ∑δ1 * δ2 * (δ3 ° δ4 ) * δ5 i Problem 4. Bonus! 1. As an example, say x = [x0, x1, x2] . Then F (x) = [x0 * x0, x0 * x1, x0 * x2] . Then ∂L/∂x will be a vector. For component i, where i is not 0, it is ∂L/∂F x . For the 0th i * 0 component, it will be 2 * x0 * ∂L/∂F 0 + ∑ ∂L/∂F i * xi . i ≠ 0 2. Sorting will simply reroute the gradients. As an example, say x = [x0, x1, x2] , we have upstream gradients ∂L / ∂F = [∂0, ∂1, ∂2] , and F (x) = [x1, x2, x0] . Then, ∂L / ∂x = [∂ , ∂ , ∂ ] (move gradients to reverse the transformation from x -> F(x)). 2 0 1 3. This can be viewed as a computation graph where the multiplication happens first and then the sorting happens. As such, it simply requires rerouting the gradients to account for the sort as in #2, and then performing the multiplicative rules as in #1 to account for multiplying by x0. CS230 Deep Learning Page 7 .

Section 3: Gradient Descent & Backpropagation Practice Problems

Training Autoencoders by Alternating Minimization

Learning to Learn by Gradient Descent by Gradient Descent

Training Neural Networks Without Gradients: a Scalable ADMM Approach

Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments

CSE 152: Computer Vision Manmohan Chandraker

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection

Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder

Gradient Descent (PDF)

Lecture 10: Recurrent Neural Networks

Stochastic Gradient Descent in Machine Learning

Lecture 6: Stochastic Gradient Descent Sanjeev Arora Elad Hazan