Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam This examination consists of 17 printed sides, 5 questions, and 100 points. The exam accounts for 20% of your total grade. Please write your answers on the exam paper in the spaces provided. You may use the 16th page if necessary but you must make a note on the question's answer box. You have 80 minutes to complete the exam. Exams turned in after the end of the examination period will be either penalized or not graded at all. The exam is closed book and allows only a single page of notes. You are not allowed to: use a phone, laptop/tablet, calculator or spreadsheet, access the internet, communicate with others, or use other programming capabilities. You must disable all networking and radios (\airplane mode"). If you are taking the exam remotely, please send us the exam by Tuesday, February 13 at 5:50 pm PDT as a scanned PDF copy to [email protected]. Stanford University Honor Code: I attest that I have not given or received aid in this examination, and that I have done my share and taken an active part in seeing to it that others as well as myself uphold the spirit and letter of the Honor Code. SUNet ID: Signature: Name (printed): SCPD student Question Points Score Multiple Choice 18 Short Questions 32 Word Vectors 12 Backpropagation 17 RNNs 21 Total: 100 The standard of academic conduct for Stanford students is as follows: 1. The Honor Code is an undertaking of the students, individually and collectively: a. that they will not give or receive aid in examinations; that they will not give or receive unpermitted aid in class work, in the preparation of reports, or in any other work that is to be used by the instructor as the basis of grading; b. that they will do their share and take an active part in seeing to it that they as well as others uphold the spirit and letter of the Honor Code. 2. The faculty on its part manifests its confidence in the honor of its students by refraining from proc- toring examinations and from taking unusual and unreasonable precautions to prevent the forms of dishonesty mentioned above. The faculty will also avoid, as far as practicable, academic procedures that create temptations to violate the Honor Code. 3. While the faculty alone has the right and obligation to set academic requirements, the students and faculty will work together to establish optimal conditions for honorable academic work. 1. Multiple Choice (18 points) For each of the following questions, color all the circles you think are correct. No explanations are required. (a) (2 points) Which of the following statement about Skip-gram are correct? It predicts the center word from the surrounding context words p The final word vector for a word is the average or sum of the input vector v and output vector u corresponding to that word When it comes to a small corpus, it has better performance than GloVe It makes use of global co-occurrence statistics (b) (2 points) Which of the following statements about dependency trees are correct? Each word is connected to exactly one dependent (i.e., each word has exactly one outgoing edge) A dependency tree with crossing edges is called a \projective" dependency tree. p Assuming it parses the sentence correctly, the last transition made by the dependency parser from class and assignment 2 will always be a RIGHT-ARC connecting ROOT to some word. None of the above (c) (2 points) Which of the following statements is true of language models? Neural window-based models share weights across the window Neural window-based language models suffer from the sparsity problem, but n-gram language models do not The number of parameters in an RNN language model grows with the number of time steps p Neural window-based models can be parallelized, but RNN language models cannot Solution: D: Gradients must flow through each time step for RNNs whereas for neural window-based models can perform its forward- and back-propagation in parallel. (d) (2 points) Assume x and y get multiplied together element-wise x∗y. Sx and Sy are the shapes of x and y respectively. For which Sx and Sy will Numpy / Tensorflow throw an error? Sx = [1; 10];Sy = [10; 10] S = [10; 1];S = [10; 1] p x y Sx = [10; 100];Sy = [100; 10] Sx = [1; 10; 100];Sy = [1; 1; 1] (e) (2 points) Suppose that you are training a neural network for classification, but you notice that the training loss is much lower than the validation loss. Which of the following can be used to address the issue (select all that apply)? Page 2 p Use a network with fewer layers Decrease dropout probability p Increase L2 regularization weight Increase the size of each hidden layer Solution: A, C { the model is now overfitting, and all of these are valid tech- niques to address it. However, since dropout is a form of regularization, then decreasing it (B) will have the opposite effect, as will increasing network size (D). (f) (2 points) Suppose a classifier predicts each possible class with equal probability. If there are 10 classes, what will the cross-entropy error be on a single example? − log(10) − 0:1 log(1) p − log(0:1) − 10 log(0:1) Solution: C. Cross-Entropy loss simplifies to negative log of the predicted probability for the correct class. At the start of training, we have approximately uniform probabilities for the 10 classes, or 0.1 for each class. So, that means -log(0.1) is what the loss should approximately be at the start. (g) (2 points) Suppose we have a loss function f(x; y; θ), defined in terms of parameters θ, inputs x and labels y. Suppose that, for some special input and label pair (x0, y0), the loss equals zero. True or false: it follows that the gradient of the loss with respect to θ pis equal to zero. True False (h) (2 points) During backpropagation, when the gradient flows backwards through the psigmoid or tanh non-linearities, it cannot change sign. True False Solution: A. True. The local gradient for both of these non-linearites is always positive. (i) (2 points) Suppose we are training a neural network with stochastic gradient de- scent on minibatches. True or false: summing the cost across the minibatch is equivalent to averaging the cost across the minibatch, if in the first case we divide the learning rate by the minibatch size. p True False Solution: A. True. There is only a constant factor difference between averaging and summing, and you can simply multiply the learning rate to get the same update. Page 3 2. Short Questions (32 points) Please write answers to the following questions in a sentence or two. (a) Suppose we want to classify movie review text as (1) either positive or negative sentiment, and (2) either action, comedy or romance movie genre. To perform these two related classification tasks, we use a neural network that shares the first layer, but branches into two separate layers to compute the two classifications. The loss is a weighted sum of the two cross-entropy losses. h = ReLU(W0x + b0) y^1 = softmax(W1h + b1) y^2 = softmax(W2h + b2) J = αCE(y1; y^1) + βCE(y2; y^2) 10 2 Here input x 2 R is some vector encoding of the input text, label y^1 2 R is 3 a one-hot vector encoding the true sentiment, label y^2 2 R is a one-hot vector 10 10×10 encoding the true movie genre, h 2 R is a hidden layer, W0 2 R ; W1 2 2×10 3×10 R ; W2 2 R are weight matrices, and α and β are scalars that balance the two losses. i. (4 points) To compute backpropagation for this network, we can use the mul- tivariable chain rule. Assuming we already know: @^y1 @^y2 @J T @J T = ∆1; = ∆2; = δ3 ; = δ4 @x @x @^y1 @^y2 @J What is @x ? T T T T Solution: δ3 ∆1 + δ4 ∆2(row convention), or ∆1δ3 + ∆2δ4 (column convention) If you are using row convention(numerator-layout notation), the dimensions T T of parameters in the questions are δ3 : 1x2, ∆1: 2x10, δ4 : 1x3, ∆2: 3x10 If you are using column convention(denominator-layout notation), the di- T T mensions of parameters in the questions are δ3 : 2x1, ∆1: 10x2, δ4 : 3x1, ∆2: 10x3 ii. (3 points) When we train this model, we find that it underfits the training data. Why might underfitting happen in this case? Provide at least one suggestion to reduce underfitting. Solution: The model is too simple (just two layers, hidden layers' dimension is only 10, input feature dimension is only 10). Anything increasing the complexity of the model would be accepted, in- cluding: • Increasing dimensions of the hidden layer Page 4 • Adding more layers • Splitting the model into two (with more overall parameters) (b) Practical Deep Learning Training i. (2 points) In assignment 1, we saw that we could use gradient check, which calculates numerical gradients using the central difference formula, as a way to validate the accuracy of our analytical gradients. Why don't we use the numerical gradient to train neural networks in practice? Solution: Since calculating the complete numerical gradient for a single update requires iterating through all dimensions of all the parameters and computing two forward passes for each iteration, this becomes too expensive to use for training in practice.

Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Training Autoencoders by Alternating Minimization

CS 189 Introduction to Machine Learning Spring 2021 Jonathan Shewchuk HW6

Can Temporal-Difference and Q-Learning Learn Representation? a Mean-Field Analysis

Neural Network in Hardware

Population Dynamics: Variance and the Sigmoid Activation Function ⁎ André C

Neural Networks Shuai Li John Hopcroft Center, Shanghai Jiao Tong University

Dynamic Modification of Activation Function Using the Backpropagation Algorithm in the Artificial Neural Networks

A Deep Reinforcement Learning Neural Network Folding Proteins

Two-Steps Implementation of Sigmoid Function for Artificial Neural Network in Field Programmable Gate Array

Information Flows of Diverse Autoencoders

A Study of Activation Functions for Neural Networks Meenakshi Manavazhahan University of Arkansas, Fayetteville

Compositional Distributional Semantics with Long Short Term Memory