Learning to Attend with Neural Networks
by
Lei (Jimmy) Ba
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical & Computer Engineering University of Toronto
© Copyright 2020 by Lei (Jimmy) Ba Abstract
Learning to Attend with Neural Networks
Lei (Jimmy) Ba Doctor of Philosophy Graduate Department of Electrical & Computer Engineering University of Toronto 2020
As more computational resources become widely available, artificial intelligence and machine learning researchers design ever larger and more complicated neural networks to learn from millions of data points.
Although the traditional convolutional neural networks (CNNs) can achieve superhuman accuracy in object recognition tasks, they brute-force the problem by scanning over every location in the input images with the same fidelity. This thesis introduces a new class of neural networks inspired by the human visual system. Unlike CNNs that process the entire image at once into the current hidden layer, attention allows for salient features to dynamically come to the forefront as needed. The ability to attend is especially important when there is a lot of clutter in a scene. However, learning attention-based neural networks poses some challenges to the current machine learning techniques: What information should the neural network “pay attention”? Where does the network store its sequences of “glimpses”? Can our learning algorithms do better than simply “trial-and-error”?
To address these computational questions, we first describe a novel recurrent visual attention model in the context of variational inference. Because the standard REINFORCE or the trial-and-error algo- rithm can be slow due to its high variance gradient estimates, we show a re-weighted wake-sleep objective can improve the training performance. We also demonstrate the visual attention models outperform the previous state-of-the-art methods based on CNNs in the images and captions generation tasks. Fur- thermore, we discuss how the visual attention mechanism can improve the working memory of recurrent neural networks (RNNs) through a novel form of self-attention. The second half of the thesis focuses on gradient-based learning algorithms. We developed a new first-order optimization algorithm to overcome the slow convergence of the stochastic gradient descent algorithms in RNNs and attention-based models.
In the end, we explored the benefit of applying second-order optimization methods in training neural networks.
ii Acknowledgements
I would like to thank my advisors: Geoffrey Hinton, Brendan Frey and Ruslan Salakhutdinov. This PhD thesis would not have been possible without the support of these amazing mentors. I am extremely grateful to Geoff for being the most caring supervisor I could ask for. His mathematical intuition, insight into neural networks and enthusiasm for highest research standards inspired many ideas in this thesis. I am fortunate enough to work with an incredible group colleagues at the University of Toronto Machine Learning group: Roger Grosse, Jamie Kiros, James Martens, Kevin Swersky, Ilya Sutskever, Kelvin Xu, Hui Yuan Xiong, Chris Maddison. Vlad Mnih and Rich Caruana provided me with a outstanding and yet fruitful research internship experience outside of academia to whom I owe a debt of gratitude. Among many who lent help along the way, I will give my dearest thanks to my parents for their unwavering support.
iii Contents
1 Introduction 1 1.1 What are neural networks and why do we need attention? ...... 1 1.2 Overview ...... 2 1.3 Neural networks ...... 3 1.4 Convolutional neural networks ...... 3 1.5 Recurrent neural networks ...... 4 1.6 Learning ...... 5 1.6.1 Maximum likelihood estimation and Kullback-Leibler divergence ...... 6 1.6.2 Regularization ...... 7 1.6.3 Gradient descent ...... 7
2 Deep recurrent visual attention 9 2.1 Motivation ...... 9 2.2 Learning where and what ...... 10 2.3 Variational lower bound objective ...... 12 2.3.1 Maximize the variational lower bound ...... 12 2.3.2 Multi-object/Sequential classification as a visual attention task ...... 13 2.3.3 Comparison with CNN ...... 14 2.3.4 Discussion ...... 17 2.4 Improved learning with re-weighted wake-sleep objective ...... 18 2.4.1 Wake-Sleep recurrent attention model ...... 18 2.4.2 An improved lower bound on the log-likelihood ...... 19 2.4.3 Training an inference network ...... 21 2.4.4 Control variates ...... 22 2.4.5 Encouraging exploration ...... 23 2.4.6 Experiments ...... 23 2.5 Summary ...... 24
3 Generating image (and) captions with visual attention 26 3.1 Problem definition ...... 26 3.2 Related work ...... 26 3.3 Image Caption Generation with Attention Mechanism ...... 28 3.3.1 Model details ...... 28 3.3.2 Learning stochastic “hard” vs deterministic “soft” Attention ...... 30
iv 3.3.3 Experiments ...... 33 3.4 Generating images ...... 35 3.4.1 Model architecture ...... 36 3.4.2 Learning ...... 38 3.4.3 Generating images from captions ...... 38 3.4.4 Experiments ...... 39 3.5 Summary ...... 42
4 Stabilizing RNN training with layer normalization 43 4.1 Motivation ...... 43 4.2 Batch and weight normalization ...... 44 4.3 Layer normalization ...... 45 4.3.1 Layer normalized recurrent neural networks ...... 45 4.4 Related work ...... 46 4.5 Analysis ...... 46 4.5.1 Invariance under weights and data transformations ...... 46 4.5.2 Geometry of parameter space during learning ...... 47 4.6 Experimental results ...... 49 4.6.1 Order embeddings of images and language ...... 49 4.6.2 Teaching machines to read and comprehend ...... 51 4.6.3 Skip-thought vectors ...... 51 4.6.4 Modeling binarized MNIST using DRAW ...... 53 4.6.5 Handwriting sequence generation ...... 53 4.6.6 Permutation invariant MNIST ...... 54 4.6.7 Convolutional Networks ...... 55 4.7 Summary ...... 55
5 Self-attention to the recent past using fast weights 56 5.1 Motivation ...... 56 5.2 Evidence from physiology that temporary memory may not be stored as neural activities 57 5.3 Fast Associative Memory ...... 57 5.3.1 Layer normalized fast weights ...... 59 5.3.2 Implementing the fast weights “inner loop” in biological neural networks ...... 60 5.4 Experimental results ...... 60 5.4.1 Associative retrieval ...... 60 5.4.2 Integrating glimpses in visual attention models ...... 61 5.4.3 Facial expression recognition ...... 64 5.4.4 Agents with memory ...... 65 5.5 Summary ...... 66
6 Accelerating learning using Adaptive Moment methods 67 6.1 Motivation ...... 67 6.2 Algorithm ...... 68 6.2.1 Adam’s update rule ...... 69
v 6.3 Initialization bias correction ...... 69 6.4 Convergence analysis ...... 70 6.4.1 Convergence proof ...... 71 6.5 Related work ...... 75 6.6 Experiments ...... 76 6.6.1 Logistic regression ...... 76 6.6.2 Multi-layer neural networks ...... 77 6.6.3 Convolutional neural networks ...... 77 6.6.4 Bias-correction term ...... 78 6.7 Extensions ...... 79 6.7.1 AdaMax ...... 79 6.7.2 Temporal averaging ...... 80 6.8 Summary ...... 81
7 Scale up learning with the distributed natural gradient methods 82 7.1 Motivation ...... 82 7.2 Fisher information matrix and natural gradient ...... 84 7.2.1 Kronecker factored approximate Fisher ...... 84 7.2.2 Approximate natural gradient using K-FAC ...... 84 7.2.3 Related works ...... 85 7.3 Distributed Optimization using K-FAC ...... 85 7.3.1 Asynchronous Fisher block inversion ...... 85 7.3.2 Asynchronous statistics computation ...... 86 7.4 Doubly-factored Kronecker approximation for large convolution layers ...... 87 7.4.1 Factored Tikhonov damping for the double-factored Kronecker approximation . . . 88 7.5 Step size selection ...... 89 7.5.1 Experimental evaluation of the step-size selection method of Section 7.5 ...... 90 7.6 Automatic construction of the K-FAC computation graph ...... 90 7.7 Experiments ...... 91 7.7.1 CIFAR-10 classification and asynchronous Fisher block inversion ...... 92 7.7.2 ImageNet classification ...... 93 7.7.3 A cheaper Kronecker factor approximation for convolution layers ...... 95 7.8 Summary ...... 98
8 Conclusion 99 8.1 Attention-based neural networks ...... 99 8.2 Stochastic optimization ...... 100 8.3 Future directions ...... 100 8.3.1 Variance reduction and learning ...... 100 8.3.2 Beyond maximum likelihood learning ...... 101
Bibliography 102
vi Relationship to Prior Work
The chapters in this thesis describe work that has been published in the following conferences:
• Chapter 2: Multiple object recognition with visual attention. Ba, J., Mnih, V., & Kavukcuoglu, K. (2015). International conference on learning representation.
• Chapter 2: Learning wake-sleep recurrent attention models. Ba, J., Salakhutdinov, R. R., Grosse, R. B., & Frey, B. J. (2015). Advances in neural information processing systems (pp. 25932601).
• Chapter 3: Show, attend and tell: neural image caption generation with visual attention. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Bengio, Y. (2015). International conference on machine learning (pp. 20482057).
• Chapter 3: Generating images from captions with attention. Mansimov, E., Parisotto, E., Ba, J. L., & Salakhutdinov, R. (2015). International conference on learning representation.
• Chapter 4: Layer normalization. Ba, J., Kiros, J. R., & Hinton, G. E. (2016). Advances in neural information processing systems deep learning symposium.
• Chapter 5: Using fast weights to attend to the recent past. Ba, J., Hinton, G. E., Mnih, V., Leibo, J. Z., & Ionescu, C. (2016). Advances in neural information processing systems (pp. 43314339).
• Chapter 6: Adam: a method for stochastic optimization. Kingma, D., & Ba, J. (2015). Interna- tional conference on learning representation.
• Chapter 7: Distributed second-order optimization using kronecker-factored approximations. Ba, J., Grosse, R., & Martens, J. (2016). International conference on learning representation.
vii Chapter 1
Introduction
1.1 What are neural networks and why do we need attention?
In recent years, researchers have tackled problems in computer vision, speech recognition and natural language processing by using deep learning methods [Hinton et al., 2012b, Krizhevsky and Hinton, 2009, Sutskever et al., 2014b], that learn powerful feature detectors directly from inputs with little or no pre-processing. Deep learning avoids the time-consuming process of designing features by hand and as datasets get larger it can discover better features with no additional human effort. Many deep learning systems use feed-forward neural networks of many layers. While they have been very successful, state- of-the-art deep neural networks can be computationally expensive: training these models often takes weeks even when parallelizing over many machines. The high runtime cost of these models at test time is a further problem for many real-time applications on smart phones or wearable devices. As more computational resources become available, artificial intelligence and machine learning re- searchers train ever larger neural networks using millions of data points. Many of these systems use the largest neural network that can conveniently fit into a modern computer to exhaustively process the entire input all at once. A convolutional neural network (CNN) may recognize thousands of objects with superhuman accuracy, but standard CNNs are computationally clumsy and expensive because they ex- amine all image locations at the same level of detail. Such brute-force learning approaches have yielded impressive results thus far, but for tasks such as learning from the rapidly growing amount of video data on YouTube, a less brute-force approach should be far more effective. One of the most curious facets of the human visual system is the presence of attention [Rensink, 2000, Corbetta and Shulman, 2002]. Rather than compress an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed. This is especially im- portant when there is a lot of clutter in an image. The human visual system converts a high-dimensional visual scene into a sequence of glimpses by using intelligently selected fixation points. This dynamically allocates computation resources to more informative parts of the input and internal, covert attention amplifies this effect. One may, for example, spend a few minutes translating a long French paragraph to English by going back and forth between the ambiguous sentences in the source document. The same person in an unfamiliar train station can quickly find out which way to go by only glancing at the useful signs for a fraction of a second. Inspired by the human visual system, this thesis explores the topic of learning neural networks that can integrate and retrieve information by intelligent sampling.
1 Chapter 1. Introduction 2
1.2 Overview
Much of the recent work on neural networks focuses on sequence modeling using recurrent neural networks (RNNs). The applications of these models to machine translation, speech recognition, and language modeling have shown promising results in practice. Despite their success, training RNNs is often unstable and slow. The traditional RNN architectures also fail to learn long-term dependencies among their input sequences. We are interested in how neural networks can integrate information over long sequences using atten- tion mechanisms. We take inspiration from the way humans perform sequence recognition tasks such as reading by continually moving the fovea to the next relevant object or character, recognizing the individual object, and adding the recognized object to our internal representation of the sequence. The traditional sequence-to-sequence models have turned out to be challenging to train for these tasks. In this thesis, we describe methods that address the difficulty of training RNNs on problems with complicated long-range temporal structure. There are two major themes in this thesis: modeling and learning. For modeling, our contribution is in developing novel attention-based neural networks for object recognition tasks and sequence generation tasks in both computer vision and natural language processing. For learning, our contribution is in developing new optimization algorithms to address the challenges in learning attention-based neural networks. Outline of Thesis In chapter 1, we describe many popular techniques used in neural network research and deep learning. The topics can be divided into neural network architectures and traditional learning algorithms. This will serve as a foundation on which we can place the contributions of this thesis. This chapter also introduces the detailed terminology and notations that will be used through the thesis. Chapter 2 begins the discussion of a new attention-based neural network architecture for vision tasks. In particular, we describe an extension to the recurrent visual attention model (Minh, et al., 2013) so that it can solve multi-object recognition tasks. We also present a new learning objective derived from variational inference. We then define a formal connection between REINFORCE and variational inference. Chapter 3 then discusses two variants of the visual attention applied to image and caption generation tasks. We found our model outperformed all the previous caption generation models at the time. In Chapter 4, we investigate the unstable training dynamics of attention-based RNNs on very long sequences and discuss the learning challenges present in these models. We describe a new normalization technique to address the challenge by normalizing the groups of hidden neurons to have the same mean and standard deviation at each time step of the sequence. Building upon Chapter 5, we apply the nor- malization technique to learn a “self-attention” neural network that can attend to its past computation. This form of “self-attention’ can be used to store temporary memories of the recent past. We show this new form of temporary storage is very helpful in sequence-to-sequence models. In Chapter 6, we discuss the learning algorithms themselves by presenting a new stochastic opti- mization algorithm, “Adam”, for training neural networks. We analyze the convergence properties of the algorithm and discuss recent developments in stochastic optimization. Our numerical experiments demonstrate “Adam” can speed up convergence in learning various attention-based neural networks. Chapter 7 continues the discussion of learning algorithms for neural networks using second-order opti- mization techniques. We develop a novel distributed optimizer that scales the KFAC natural gradient Chapter 1. Introduction 3 algorithm to train state-of-the-art deep learning models with tens of millions of parameters. Our ex- periments show distributed KFAC can speed up the convergence linearly with respect to the size of the mini-batch.
1.3 Neural networks
In the rest of this chapter, we give the background on neural networks and optimization that will make this thesis self-contained. We will first define the basic notations for neural networks that we will use for the rest of the thesis. A feed-forward neural network or multilayer perception (MLP) Rumelhart et al. [1986] is the most common neural network architecture that consists of layers of simple neuron-like processing units. Such network of artificial neurons maps an input vector x to an output y. The neuron-like processing units compute a weighted sum their inputs using a set of incoming weights w and pass them through an activation function, f. Namely, the output of a neuron is given by:
X z = wixi + b, a = f(z), (1.1) i where we denote the neuron activations before and after the nonlinear function as z and a. An additional bias scalar b is included in the weighted sum that helps learning. In a deep feedforward neural network with d layers, the computation can be expressed in terms of matrix-vector operations as follows:
z1 = W1x + b1, a1 = f(z1), (1.2)
z2 = W2a1 + b2, a2 = f(z2), (1.3)
··· y = Wdad−1 + bd. (1.4)
We use the subscript to index the layers. Note that there are many element-wise nonlinear activation functions to choose from. In the past, the logistic activation functions, such as sigmoid or tanh were the most popular options. Krizhevsky et al. [2012e] showed that the rectified linear unit (ReLU) non- linearities could be highly successful for computer vision tasks and proved faster to train than the standard sigmoid units.
1 σ(z) = , ReLU(z) = max(0, z). (1.5) 1 + e−z
The neural networks in the rest of the thesis are assumed to have the ReLU activation function unless otherwise stated.
1.4 Convolutional neural networks
Although many prior works have shown that feed-forward neural networks can express any function given enough hidden units, more complex neural network architectures are often preferred in practice because of their expressivity and better generalization. One such example is the convolutional neural Chapter 1. Introduction 4 networks (CNNs) [Rumelhart et al., 1986, LeCun et al., 1990, Krizhevsky and Hinton, 2009]. The idea of convolutional filters, that are small local filter banks applied to the entire image, has been explored in many classical works for image processing and computer vision tasks. CNNs build such prior knowledge into the neural network architectures. Unlike fully connected neural networks, the weights in CNNs are heavily constrained and local. Each neuron only processes a local source of information from the output of the previous layer. The incoming weights of the convolutional neurons or the receptive fields are also shared across the spatial locations. These receptive fields act like feature detectors looking for a particular pattern anywhere on the input images. Therefore, the outputs of a convolutional layer are called feature maps which are computed as:
a = f(W a − + b), (1.6) d ∗ d 1 where, is the convolution operator. Both the weight matrices and the biases are shared across receptive ∗ fields. Weight sharing not only greatly reduces the number of free parameters in the network, but also encode the prior knowledge about the image processing and local pattern recognition of the vision tasks.
1.5 Recurrent neural networks
Another important class of neural network architectures is recurrent neural networks (RNNs), that are a temporal generalization of classical feed-forward neural networks. RNNs map input sequence to an output sequence as a nonlinear dynamical system. The network updates its hidden activations from the current input and the activations from the previous timestep using the bottom-up and the recurrent weights respectively. The hidden-to-hidden recurrent connections allow RNNs to aggregate information over time. The recurrent computation also enables the possibility of storing long-term dependencies between the input in its hidden activations. Formally, a standard one-layer RNN mapping of an input sequence x , , x to an output sequence y , , y is defined as: { 1 ··· T } { 1 ··· T }
a1 = f(Winx1 + brec), (1.7)
at = f(Winx1 + Wrecat−1 + brec), (1.8)
yt = Woutat + bout. (1.9) where Win,Wrec and Wout are respectively the input, recurrent and output weights shared across T timesteps. This RNN is analogous to a feed-forward neural network with T layers, except the weights are shared between the layers. The weight sharing in RNNs allows the same model to process sequences of any number of timesteps and remember the information from the past. However, it also makes the recurrent activations grow in magnitude as the input sequence gets longer. The standard logistic activation functions, such as sigmoid or tanh tend to saturate for the longer timesteps, whereas ReLU leads to exploding hidden activations. In other words, any small changes in the beginning of the input sequence could cause exploding activations many timesteps later under ReLU. Le et al. [2015] point out that if the recurrent weights are initialized close to a scaled identity matrix, we can partially alleviate such problem and successfully train of ReLU RNNs over thousands of timesteps. Arjovsky et al. [2016] later generalized the identity matrix results to any scaled orthonormal matrices where all the eigenvalues of the recurrent weights are close to one. Chapter 1. Introduction 5
zt zt
ht-1 Eyt-1 ht-1 Eyt-1
ht-1 i o input gate output gate
zt c ht input modulator memory cell
Eyt-1 f forget gate
ht-1 Eyt-1
zt Figure 1.1: A LSTM cell, lines with bolded squares imply projections with a learnt weight vector. Each cell learns how to weigh its input components (input gate), while learning how to modulate that contribution to the memory (input modulator). It also learns weights which erase the memory cell (forget gate), and weights which control how this memory should be emitted (output gate).
However, these initialization schemes require hyperparameter tuning and are sensitive to the change of scale in the input sequences. Orthogonal to the weight initializations, Hochreiter and Schmidhuber [1997b] address the exploding activation problem by introducing a new RNN activation function, the long-short-term memory (LSTM) units. LSTMs have a set of gating units that control the information flow in the RNN. These gates turn on or off to update a set of linear memory neurons.
1.6 Learning
Neural networks can be used as parametric models for many standard statistical learning tasks such as regression and classification. In statistical learning, the goal is to estimate a set of parameters from a given training dataset. For simplicity of notation, we will lump all the neural network weights into a parameter vector θ = [vec W T , vec W T , ]T , where vec is the vectorization operator that { 1} { 2} ··· converts a matrix or tensor to a column vector. Given a dataset of N input and target pairs = Dtrain (x(n), t(n)) N , we can measure the performance the neural network according to a loss function and { }n=1 the current weights as the averaged loss of the training examples:
N N 1 X 1 X (θ) = l(n)(θ) = l((x(n), t(n)), θ), (1.10) L N N n=1 n=1 where l(n) are the losses for each training example and denotes the averaged loss over the whole L training set. The above averaged loss measured on the training set is also known as the empirical risk. There is a wide variety of loss functions studied in the field of machine learning. In this thesis, we choose to focus on the following two loss functions, which appear in many regression and classification problems of practical importance: mean squared error,
1 l ((x(n), t(n)), θ) = y(n) t(n) 2, (1.11) MSE 2k − k2 Chapter 1. Introduction 6 and multi-class cross-entropy loss,
(n) (n) X (n) (n) lCE((x , t ), θ) = ti log pi , p = softmax(y), (1.12) i
exp(z1) exp(z2) where softmax(z) = [ P , P , ] is a generalization of the element-wise logistic func- j exp(zj ) j exp(zj ) ··· tion. Due to their smoothness, the two loss functions have synergistic effects when combined with the optimization-based learning procedures for neural networks. Given a neural network architecture, the learning problem involves searching for a set of weights that minimizes the chosen loss function.
θ∗ = arg min (θ) (1.13) L
In the case of a single neuron, the above minimization can be solved using a set of linear algebraic equations or convex programming. In general, obtaining the optimal set of weights in neural network learning is intractable due to the highly non-convex loss function in the weight space. L
1.6.1 Maximum likelihood estimation and Kullback-Leibler divergence
The loss minimization considers neural network learning from an optimization perspective, a different perspective of Eq. 1.13 is grounded in statistical inference. The neural network outputs can be viewed as defining a conditional distribution for the targets t, given the input vector x. In most of the applications, we are usually not interested in the distribution of the inputs themselves, but rather in predicting the possible values of the targets given a user-chosen input. In a regression problem, we may model the real-valued targets as a conditional Gaussian distribution with a mean equal to the output of the neural network and an identity covariance matrix. The log likelihood of such model can be written down as:
1 X h i log p(t(n) x(n), θ) = (y(n) t(n))2 + log 2π . (1.14) | −2 i − t i
For classification problems, the conditional probability of the input belonging to cth class can be modeled as a multinomial distribution,
log p(t(n) = c x(n), θ) = log p(n), p = softmax(y). (1.15) | c
It is easy to see the above log likelihoods simply correspond to the negative loss function we defined earlier for regression and classification. Therefore, we can directly think of neural network learning as a maximum likelihood estimation (MLE) problem that maximizes the likelihood of the conditional probability defined by the outputs of the network, where the estimates are the learnable weights. Thus we can argue that a learning algorithm that optimizes the loss functions shares many desired properties as the MLE methods, such as the statistical consistency and efficiency guarantees. One may also be tempted to minimize the difference between the neural network’s output distribution with the empirical target distribution. The Kullback-Leibler (KL) divergence is a natural choice to Chapter 1. Introduction 7 measure the difference between two distributions,
1 X h (n) (n) i E [KL(pdata(t x) p(t x, θ)] = E log p(t x , θ) + const.. (1.16) x∼Dtrain | || | −N t(n)∼ | n pdata
Thus, minimizing the KL divergence between the data distribution and the network’s output distribution is equivalent to MLE.
1.6.2 Regularization
It is a common belief in neural network design to prefer more expressive models with many layers. These models typically have more parameters than the number of training examples. This belief is justified under the increasing complexity of the datasets and the fast growing parallel computation resources. In such overparameterized regime, it is easy for the learning algorithm to discover a network that overfits where the model achieves zero training loss but fails to generalize on unseen test data. Thus, the loss functions are usually modified with an additional constraint to limit the capacity of the neural network in order to prevent overfitting. The constraint could be imposed on the architecture via weight sharing as in the case of CNNs or could be in the form of a data-independent regularization term in the modified loss function. For example, weight decay, that is the sum of the `2 norms of the weight matrices, is a simple yet effective regularizer,
(θ) = (θ) + (θ), (θ) = λ θ 2, (1.17) L LCE/MSE R R k k2 where the λ is the weight decay coefficient. Even with the regularization terms, very wide and deep neural networks are still vulnerable to overfitting. Dropout [Hinton et al., 2012a] is an effective technique for avoiding co-adaptations of the neurons, thus reducing overfitting in neural networks. During training, each neuron has an independent probability p to be dropped out from the computation,
z = W a − + b , a = M f(z ),M Bern(1 p), (1.18) d d d 1 d d d d d ∼ − where M is a binary mask drawn i.i.d. from a Bernoulli distribution indicating which neurons get to be kept. Dropout can also be viewed as training an ensemble of neural nets each with different connectivity patterns but the weights are shared among all the ensemble members. At test time, a multiplier of 1 p − is used on the hidden activations to correct for the missing neurons during training.
1.6.3 Gradient descent
In practice, finding the weights and biases that minimize the given loss function is done using gradient- based optimization methods. Gradient-based learning rules for neural networks were given different names in the past [Rumelhart et al., 1986, LeCun et al., 1990]. Starting from a set of randomly initialized weights, simple gradient-based learning algorithms change the weights in the negative gradient direction Chapter 1. Introduction 8 of the loss function with respect to the weight,
∂ ∂ g L L (1.19) , ∂θ GW , ∂W ∆W = η , ∆θ = [vec ∆W T , vec ∆W T , ]T , (1.20) − GW { 1} { 2} ··· θ θ + ∆θ, (1.21) ← where η is the learning rate or step size. We use as a shorthand notation for the gradient of the loss GW with respect to a particular weight matrix. At each iteration, the weights “descend” along the gradient direction. There are typically many local optima in the weight space. For typical neural networks, there are no global convergence guarantees for these gradient descent algorithms. Despite this, the solution discovered in practice often performs very well. The computation cost of the gradient updates grows linearly with the dataset size. A typical GW computer vision dataset will contain hundreds of thousands of training examples, which makes computing the full gradient update expensive. To address this problem, a stochastic approximation variant of gradient descent, the stochastic gradient descent (SGD) algorithm, computes an estimate of the gradients on a small mini-batch randomly sampled from the total training set. SGD is an unbiased approximation of the full gradient and the variance of the estimate is inversely proportional to the mini-batch size,
1 X ∂l(i) = ,B 1, 2, ,N , (1.22) GW B ∂W ⊂ { ··· } | | i∈B 1 ∂l(i) [ ] = , Var[ ] = Var (1.23) E GW GW GW B ∂W | | where B is a uniformly sampled subset from the training set. Using a larger mini-batch size tends to work better and the update computation can take advantage of the parallelism from the modern parallel computing hardware. Most of the learning algorithms for neural networks rely on computing the gradient in the inner GW loop. Rumelhart et al. [1986] provides an efficient algorithm to compute these that “backpropagates” the difference between the network’s prediction and the target through the layers of the neural network from the output layer back to the inputs. In general, any neural network can be expressed as a computation graph. The backpropgation algorithm is a special case of backwards-mode automatic differentiation on the computation graph. Chapter 2
Deep recurrent visual attention
2.1 Motivation
Convolutional neural networks have recently been very successful on a variety of recognition and clas- sification tasks [Krizhevsky et al., 2012b, Goodfellow et al., 2013, Jaderberg et al., 2014, Vinyals et al., 2014b, Karpathy and Fei-Fei, 2014]. One of the main drawbacks of convolutional networks (ConvNets) is their poor scalability with increasing input image size so efficient implementations of these models on multiple GPUs [Krizhevsky et al., 2012b] or even spanning multiple machines [Dean et al., 2012] have become necessary. Applications of ConvNets to multi-object and sequence recognition from images have avoided working with big images and instead focused on using ConvNets for recognizing characters or short sequence segments from image patches containing reasonably tightly cropped instances [Goodfellow et al., 2013, Jaderberg et al., 2014]. Applying such a recognizer to large images containing uncropped instances requires integrating it with a separately trained sequence detector or a bottom-up proposal generator. Non-maximum suppression is often performed to obtain the final detections. While combining separate components trained using different objective functions has been shown to be worse than end-to-end training of a single system in other domains, integrating object localization and recognition into a single globally-trainable architecture has been difficult. In this chapter, we take inspiration from the way humans perform visual sequence recognition tasks such as reading by continually moving the fovea to the next relevant object or character, recognizing the individual object, and adding the recognized object to our internal representation of the sequence. Our proposed system is a deep recurrent neural network that at each step processes a multi-resolution crop of the input image, called a glimpse. The network uses information from the glimpse to update its internal representation of the input, and outputs the next glimpse location and possibly the next object in the sequence. The process continues until the model decides that there are no more objects to process. We show how the proposed system can be trained end-to-end by approximately maximizing a variational lower bound on the label sequence log-likelihood. This training procedure can be used to train the model to both localize and recognize multiple objects purely from label sequences. We evaluate the model on the task of transcribing multi-digit house numbers from publicly available Google Street View imagery. Our attention-based model outperforms the state-of-the-art ConvNets on tightly cropped inputs while using both fewer parameters and much less computation. We also show
9 Chapter 2. Deep recurrent visual attention 10
ˆ ˆ ˆ ˆ l1 ˆl2 l3 l4 ln+1 emission context (2) (2) (2) (2) (2) r0 r1 r2 r3 rn Icoarse
y1 ys (1) (1) (1) (1) classification r1 r2 r3 rn
glimpse
(x1,l1) (x2,l2) (x3,l3) (xn,ln)
Figure 2.1: The deep recurrent attention model. that our model outperforms ConvNets by a much larger margin in the more realistic setting of larger and less tightly cropped input sequences.
2.2 Learning where and what
For simplicity, we first describe how our model can be applied to classifying a single object and later show how it can be extended to multiple objects. Processing an image x with an attention-based model is a sequential process with N steps, where each step consists of a saccade followed by a glimpse. At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln.
The model uses the observation to update its internal state and outputs the location ln+1 to process at the next time-step. Usually the number of pixels in the glimpse xn is much smaller than the number of pixels in the original image x, making the computational cost of processing a single glimpse independent of the size of the image. A graphical representation of our model is shown in Figure 2.1. The model can be broken down into a number of sub-components, each mapping some input into a vector output. We will use the term “network” to describe these non-linear sub-components since they are typically multi-layered neural networks. Glimpse network: The glimpse network is a non-linear function that receives the current input image patch, or glimpse, xn and its location tuple ln , where ln is a 2 dimensional vector representing the x- and y-coordinate of the patch, as input and outputs a vector gn. The job of the glimpse network is to extract a set of useful features from location l of the raw visual input. We will use G (x W ) to n image n| image denote the output vector from function G ( ) that takes an image patch x and is parameterized by image · n weights W . G ( ) typically consists of three convolutional hidden layers without any pooling image image · layers followed by a fully connected layer. Separately, the location tuple is mapped by G (l W ) loc n| loc using a fully connected hidden layer where, both G (x W ) and G (l W ) have the same image n| image loc n| loc dimension. We combine the high bandwidth image information with the low bandwidth location tuple by multiplying the two vectors element-wise to get the final glimpse feature vector gn,
g = G (x W )G (l W ). (2.1) n image n| image loc n| loc Chapter 2. Deep recurrent visual attention 11
This type of multiplicative interaction between “what” and “where” was initially proposed by Larochelle and Hinton [2010b].
Recurrent network: The recurrent network aggregates information extracted from the individual glimpses and combines the information in a coherent manner that preserves spatial information. The glimpse feature vector gn from the glimpse network is supplied as input to the recurrent network at each time step. The recurrent network consists of two recurrent layers with non-linear function Rrecur. We defined the two outputs of the recurrent layers as r(1) and r(2).
r(1) = R (g , r(1) W ) and r(2) = R (r(1), r(2) W ) (2.2) n recur n n−1| r1 n recur n n−1| r2
We use Long-Short-Term Memory units [Hochreiter and Schmidhuber, 1997c] for the non-linearity Rrecur because of their ability to learn long-range dependencies and stable learning dynamics.
Emission network: The emission network takes the current state of recurrent network as input and makes a prediction on where to extract the next image patch for the glimpse network. It acts as a controller that directs attention based on the current internal states from the recurrent network. It (2) consists of a fully connected hidden layer that maps the feature vector rn from the top recurrent layer to a coordinate tuple ˆln+1.
ˆl = E(r(2) W ) (2.3) n+1 n | e
Context network: The context network provides the initial state for the recurrent network and its output is used by the emission network to predict the location of the first glimpse. The context network C( ) takes a down-sampled low-resolution version of the whole input image and outputs a fixed · Icoarse length vector cI . The contextual information provides sensible hints on where the potentially interesting regions are in a given image. The context network employs three convolutional layers that map a coarse image to a feature vector used as the initial state of the top recurrent layer r2 in the recurrent Icoarse network. However, the bottom layer r1 is initialized with a vector of zeros for reasons we will explain later.
Classification network: The classification network outputs a prediction for the class label y based (1) on the final feature vector rN of the lower recurrent layer. The classification network has one fully connected hidden layer and a softmax output layer for the class y.
P (y I) = O(r1 W ) (2.4) | n| o
Ideally, the deep recurrent attention model should learn to look at locations that are relevant for classifying objects of interest. The existence of the contextual information, however, provides a “short cut” solution such that it is much easier for the model to learn from contextual information than by combining information from different glimpses. We prevent such undesirable behavior by connecting the context network and classification network to different recurrent layers in our deep model. As a result, the contextual information cannot be used directly by the classification network and only affects the sequence of glimpse locations produced by the model. Chapter 2. Deep recurrent visual attention 12
2.3 Variational lower bound objective
2.3.1 Maximize the variational lower bound
Given the class labels y of an input image , we can formulate learning as a supervised classification prob- I lem with the cross entropy objective function. The attention model predicts the class label conditioned on intermediate latent location variables l from each glimpse and extracts the corresponding patches. Let θ = [vec W >, vec W >, vec W >, vec W >, vec W >, vec W >]T denote the con- { image} { loc} { r1} { r2} { e} { o} catenated model parameters. We can thus maximize the likelihood of the class label by marginalizing over the glimpse locations log p(y , θ) = log P p(l , θ)p(y l, , θ). | I l | I | I The marginalized objective function can be learned through optimizing its variational free energy lower bound : F X X log p(l , θ)p(y l, , θ) p(l , θ) log p(y, l , θ) + H[l] (2.5) | I | I ≥ | I | I l l X = p(l , θ) log p(y l, , θ) (2.6) | I | I l
The learning rule to update the model parameters θ follows the gradient of the above free energy:
∂ X ∂ log p(y l, , θ) X ∂p(l , θ) F = p(l , θ) | I + log p(y l, , θ) | I (2.7) ∂θ | I ∂θ | I ∂θ l l X ∂ log p(y l, , θ) ∂ log p(l , θ) = p(l , θ) | I + log p(y l, , θ) | I (2.8) | I ∂θ | I ∂θ l
For each glimpse in the glimpse sequence, it is difficult to evaluate exponentially many glimpse locations during training. The summation in equation 2.8 can then be approximated using Monte Carlo samples.
˜lm p(l , θ) = (l ; ˆl , Σ) (2.9) ∼ n | I N n n M ∂ 1 X ∂ log p(y ˜lm, , θ) ∂ log p(˜lm , θ) F | I + log p(y ˜lm, , θ) | I (2.10) ∂θ ≈ M ∂θ | I ∂θ m=1
The equation 2.10 gives a practical algorithm to train the deep attention model. Namely, we can sample the glimpse location prediction from the model after each glimpse. The samples are then used in the standard backpropagation to obtain an estimator for the gradient of the model parameters. Notice that log likelihood log p(y ˜lm, , θ) has an unbounded range that can introduce substantial high variance | I in the gradient estimator. Especially when the sampled location is off from the object in the image, the log likelihood will induce an undesired large gradient update that is backpropagated through the rest of the model. We can reduce the variance in the estimator 2.10 by replacing the log p(y ˜lm, , θ) with a 0/1 discrete | I indicator function R and using a baseline technique used in Mnih et al. [2014b]. m 1 y = arg maxy log p(y ˜l , , θ) R = | I (2.11) 0 otherwise Chapter 2. Deep recurrent visual attention 13
b = E (r(2) W ) (2.12) n baseline n | baseline
(2) As shown, the recurrent network state vector rn is used to estimate a state-based baseline b for each glimpse that significantly improves the learning efficiency. The baseline effectively centers the random variable R and can be learned by regressing towards the expected value of R. Given both the indicator function and the baseline, we have the following gradient update:
M ∂ 1 X ∂ log p(y ˜lm, , θ) ∂ log p(˜lm , θ) F | I + λ(R b) | I (2.13) ∂θ ≈ M ∂θ − ∂θ m=1 where, hyper-parameter λ balances the scale of the two gradient components. In fact, by using the 0/1 indicator function, the learning rule from equation 2.13 is equivalent to the REINFORCE [Williams, 1992b] learning rule employed in Mnih et al. [2014b] for training their attention model. When viewed as a reinforcement learning update, the second term in equation 2.13 is an unbiased estimate of the gradient with respect to W of the expected reward R under the model glimpse policy. Here we show that such learning rule can also be motivated by simply approximately optimizing the free energy. During inference, the feedforward location prediction can be used as a deterministic prediction on the location coordinates to extract the next input image patch for the model. The model behaves as a normal feedforward network. Alternatively, our marginalized objective function equation 2.5 suggests a procedure to estimate the expected class prediction by using samples of location sequences ˜lm, , ˜lm { 1 ··· N } and averaging their predictions,
M 1 X ˜m El[p(y I)] p(y , l ). (2.14) | ≈ M | I m=1
This allows the attention model to be evaluated multiple times on each image with the classification predictions being averaged. In practice, we found that averaging the log probabilities gave the best performance.
Here, we encode the real valued glimpse location tuple ln using a Cartesian coordinate that is centered at the middle of the input image. The ratio converting unit width in the coordinate system to the number of pixels is a hyper-parameter. This ratio presents an exploration versus exploitation trade off. The proposed model performance is very sensitive to this setting. We found that setting its value to be around 15% of the input image width tends to work well.
2.3.2 Multi-object/Sequential classification as a visual attention task
Our proposed attention model can be easily extended to solve classification tasks involving multiple objects. To train the deep recurrent attention model for the sequential recognition task, the multiple object labels for a given image need to be cast into an ordered sequence y , y , , y . The deep { 1 2 ··· s} recurrent attention model then learns to predict one object at a time as it explores the image in a sequential manner. We can utilize a simple fixed number of glimpses for each target in the sequence. In addition, a new class label for the “end-of-sequence” symbol is included to deal with variable numbers of objects in an image. We can stop the recurrent attention model once a terminal symbol is predicted. Chapter 2. Deep recurrent visual attention 14
Concretely, the objective function for the sequential prediction is
S X X log p(y , y , , y , θ) = log p(l , θ)p(y l , , θ) (2.15) 1 2 ··· S | I s | I s| s I s=1 l
The learning rule is derived as in equation 2.13 from the free energy and the gradient is accumulated across all targets. We assign a fixed number of glimpses, N, for each target. Assuming S targets in an image, the model would be trained with N (S + 1) glimpses. The benefit of using a recurrent model × for multiple object recognition is that it is a compact and simple form yet flexible enough to deal with images containing variable numbers of objects. Learning a model from images of many objects is a challenging setup. We can reduce the difficulty by modifying our indicator function R to be proportional to the number of targets the model predicted correctly.
X Rs = Rj (2.16) j≤s
In addition, we restrict the gradient of the objective function so that it only contains glimpses up to the first mislabeled target and ignores the targets after the first mistake. This curriculum-like adaption to the learning is crucial to obtain a high performance attention model for sequential prediction.
2.3.3 Comparison with CNN
To show the effectiveness of the deep recurrent attention model (DRAM), we first investigate a number of multi-object classification tasks involving a variant of MNIST. We then apply the proposed attention model to a real-world object recognition task using the multi-digit SVHN dataset Netzer et al. [2011] and compare with the state-of-the-art deep ConvNets. As suggested in Mnih et al. [2014b], we used a glimpse network with two different scales to improve 1 2 the classification performance. Namely, given a glimpse location ln, we extract two patches (xn, xn) 1 2 where xn is the original patch and xn is a down-sampled coarser image patch. We use the concatenation 1 2 of xn and xn as the glimpse observation. “foveal” feature. The hyper-parameters in our experiments are the learning rate η and the location variance Σ in equation 2.9. They are determined by grid search and cross-validation.
Learning to find digits
We first evaluate the effectiveness of the controller in the deep recurrent attention model using the MNIST handwritten digit dataset. We generated a dataset of pairs of randomly picked handwritten digits in a 100x100 image with distraction noise in the background. The task is to identify the 55 different combinations of the two digits as a classification problem. The attention models are allowed 4 glimpses before making a classification prediction. The goal of this experiment is to evaluate the ability of the controller and recurrent network to combine information from multiple glimpses with minimum effort from the glimpse network. The results are shown in table (2.1). The DRAM model with a context network significantly outperforms the other models. Chapter 2. Deep recurrent visual attention 15
Table 2.1: Error rates on the MNIST Table 2.2: Error rates on the MNIST pairs classification task. two digit addition task.
Model Test Err. Model Test Err. RAM Mnih et al. [2014b] 9% ConvNet 64-64-64-512 3.2% DRAM w/o context 7% DRAM 2.5% DRAM 5%
Figure 2.2: Left) Two examples of the learned policy on the digit pair classification task. The first column shows the input image while the next 5 columns show the selected glimpse locations. Right) Two examples of the learned policy on the digit addition task. The first column shows the input image while the next 5 columns show the selected glimpse locations.
Learning to do addition
For a more challenging task, we designed another dataset with two MNIST digits on an empty 100x100 background where the task is to predict the sum of the two digits in the image as a classification problem with 19 targets. The model has to find where each digit is and add them up. When the two digits are sampled uniformly from all classes, the label distribution is heavily imbalanced for the summation where most of the probability mass concentrated around 10. Also, there are many digit combinations that can be mapped to the same target, for example, [5,5] and [3,7]. The class label provides a weaker association between the visual feature and supervision signal in this task than in the digit combination task. We used the same model as in the combination task. The deep recurrent attention model is able to discover a glimpse policy to solve this task achieving a 2.5% error rate. In comparison, ConvNets take longer to learn and perform worse when given weak supervision. Some inference samples are shown in figure 2.2 It is surprising that the learned glimpses policy for predicting the next glimpse is very different in the addition task comparing to the predicting combination task. The model that learned to do addition toggles its glimpses between the two digits.
Learning to read house numbers
The publicly available multi-digit street view house number (SVHN) dataset Netzer et al. [2011] consists of images of digits taken from pictures of house fronts. Following Goodfellow et al. [2013], we formed a validation set of 5000 images by randomly sampling images from the training set and the extra set, and these were used for selecting the learning rate and sampling variance for the stochastic glimpse policy. The models are trained using the remaining 200,000 training images. We follow the preprocessing technique from Goodfellow et al. [2013] to generate tightly cropped 64 x 64 images with multi-digits at the center and similar data augmentation is used to create 54x54 jittered images during training. We also convert the RGB images to grayscale as we observe the color information does not affect the final classification performance. We trained a model to classify all the digits in an image sequentially with the objective function Chapter 2. Deep recurrent visual attention 16
Table 2.3: Whole sequence recognition error rates Table 2.4: Whole sequence recognitionn error rate on multi-digit SVHN. on enlarged multi-digit SVHN.
Model Test Err. Model Test Err. 11 layer CNN Goodfellow et al. [2013] 3.96% 10 layer CNN resize 50% 10 layer CNN 4.11% 10 layer CNN re-trained 5.60% Single DRAM 5.1% Single DRAM focus 5.7% Single DRAM MC avg. 4.4% forward-backward DRAM focus 5.0% forward-backward DRAM MC avg. 3.9% Single DRAM fine-tuned 5.1% forward-backward DRAM fine-tuning 4.46% defined in equation 2.15. The label sequence ordering is chosen to go from left to right as the natural ordering of the house number. The attention model is given 3 glimpses for each digit before making a prediction. The recurrent model keeps running until it predicts a terminal label or until the longest digit length in the dataset is reached. In the SVHN dataset, up to 5 digits can appear in an image. This means the recurrent model will run up to 18 glimpses per image, that is 5 x 3 plus 3 glimpses for a terminal label. Learning the attention model took around 3 days on a GPU. The model performance is shown in table (2.3). We found that there is still a performance gap between the state-of-the-art deep ConvNet and a single DRAM that “reads” from left to right, even with the Monte Carlo averaging. The DRAM often over predicts additional digits in the place of the terminal class. In addition, the distribution of the leading digit in real-life follows Benford’s law1. We therefore train a second recurrent attention model to “read” the house numbers from right to left as a backward model. The forward and backward model can share the same weights for their glimpse net- works but they have different weights for their recurrent and their emission networks. The predictions of both forward and backward models can be combined to estimate the final sequence prediction. Following the observation that attention models often overestimate the sequence length, we can flip first k number of sequence prediction from the backwards model, where k is the shorter length of the sequence length prediction between the forward and backward model. This simple heuristic works very well in practice and we obtain state-of-the-art performance on the Street View house number dataset with the forward- backward recurrent attention model. Videos showing sample runs of the forward and backward models on SVHN test data can be found at http://www.psi.toronto.edu/~jimmy/dram/forward.avi and http://www.psi.toronto.edu/~jimmy/dram/backward.avi respectively. These visualizations show that the attention model learns to follow the slope of multi-digit house numbers when they go up or down. For comparison, we also implemented a deep ConvNet with a similar architecture to the one used in Goodfellow et al. [2013]. The network had 8 convolutional layers with 128 filters in each followed by 2 fully connected layers of 3096 ReLU units. Dropout is applied to all 10 layers with 50% dropout rate to prevent over-fitting. Moreover, we generate a less cropped 110x110 multi-digit SVHN dataset by enlarging the bounding box of each image such that the relative size of the digits stays the same as in the 54x54 images. Our deep attention model trained on 54x54 can be directly applied to the new 110x110 dataset with no modification. The performance can be further improved by “focusing” the model on where the digits are. We run the model once and crop a 54x54 bounding box around the glimpse location sequence and
1Benford’s law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small. Chapter 2. Deep recurrent visual attention 17
(Giga) floating-point op. 10 layer CNN DRAM DRAM MC avg. F-B DRAM MC avg. 54x54 2.1 0.2 0.35 0.7 ≤ 110x110 8.5 0.2 1.1 2.2 ≤ param. (millions) 10 layer CNN DRAM DRAM MC avg. F-B DRAM MC avg. 54x54 51 14 14 28 110x110 169 14 14 28
Table 2.5: Computation cost of DRAM V.S. deep ConvNets
feed the 54x54 bounding box to the attention model again to generate the final prediction. This allows DRAM to “focus” and obtain a similar prediction accuracy on the enlarged images as on the cropped image without ever being trained on large images. We also compared the deep ConvNet trained on the 110x110 images with the fine tuned attention model. The deep attention model significantly outperforms the deep ConvNet with very little training time. The DRAM model only takes a few hours to fine-tune on the enlarged SVHN data, compared to one week for the deep 10 layer ConvNet.
2.3.4 Discussion
In our experiments, the proposed deep recurrent attention model (DRAM) outperforms the state-of- the-art deep ConvNets on the standard SVHN sequence recognition task. Moreover, as we increase the image area around the house numbers or lower the signal-to-noise ratio, the advantage of the attention model becomes more significant. In table 2.5, we compare the computational cost of our proposed deep recurrent attention model with that of deep ConvNets in terms of the number of floating-point operations for the multi-digit SVHN models along with the number of parameters in each model. The recurrent attention models that only process a selected subset of the input scale better than a ConvNet that looks over an entire image. The estimated cost for the DRAM is calculated using the maximum sequence length in the dataset, however the expected computational cost is much lower in practice since most of the house numbers are around 2 3 digits long. In addition, since the attention based model does not process the whole image, it can − naturally work on images of different size with the same computational cost independent of the input dimensionality. We also found that the attention-based model is less prone to over-fitting than ConvNets, likely because of the stochasticity in the glimpse policy during training. Though it is still beneficial to regularize the attention model with some dropout noise between the hidden layers during training, we found that it gives a very marginal performance boost of 0.1% on the multi-digit SVHN task. On the other hand, the deep 10 layer ConvNet is only able to achieve 5.5% error rate when dropout is only applied to the last two fully connected hidden layer. Finally, we note that DRAM can easily deal with variable length label sequences. Moreover, a model trained on a dataset with a fixed sequence length can easily be transferred and fine tuned with a similar dataset but longer target sequences. This is especially useful when there is lack of data for the task with longer sequences. Chapter 2. Deep recurrent visual attention 18
q(a y, , ⌘) q(a y, a , , ⌘) q(a y, a , , ⌘) q(a y, a , , ⌘) 1| I 2| 1 I 3| 1:2 I 4| 1:3 I y y inference y y network
p(a , ✓) p(a2 a1, , ✓) p(a3 a1:2, , ✓) p(a4 a1:3, , ✓) 1|I | I | I | I
Ilow-resolution prediction network
p(y a, , ✓) | I
(x1,a1) (x2,a2) (x3,a3) (xN ,aN ) Figure 2.3: The Wake-Sleep Recurrent Attention Model.
2.4 Improved learning with re-weighted wake-sleep objective
Training stochastic attention models is difficult because the loss gradient involves intractable posterior expectations, and because the stochastic gradient estimates can have high variance. (The latter problem was also observed by Zaremba and Sutskever [2015] in the context of memory networks.) In this section, we propose the Wake-Sleep Recurrent Attention Model (WS-RAM), a method for training stochastic recurrent attention models which deals with the problems of intractable inference and high-variance gradients by taking advantage of several advances from the literature on training deep generative models: inference networks Dayan et al. [1995], the reweighted wake-sleep algorithm Bornschein and Bengio [2014], and control variates Paisley et al. [2012], Mnih and Gregor [2014]. During training, the WS-RAM approximates posterior expectations using importance sampling, with a proposal distribution computed by an inference network. Unlike the prediction network, the inference network has access to the object category label, which helps it choose better glimpse locations.
2.4.1 Wake-Sleep recurrent attention model
We now describe our wake-sleep recurrent attention model (WS-RAM). Given an image , the network I first chooses a sequence of glimpses a = (a1, . . . , aN ), and after each glimpse, receives an observation xn computed by a mapping g(a , ). This mapping might, for instance, extract an image patch at a given n I scale. The first glimpse is based on a low-resolution version of the input, while subsequent glimpses are chosen based on information acquired from previous glimpses. The glimpses are chosen stochastically according to a distribution p(a a − , , θ), where θ denotes the parameters of the network. This is in n | 1:n 1 I contrast with soft attention models, which deterministically allocate attention across all image locations. After the last glimpse, the network predicts a distribution p(y a, , θ) over the target y (for instance, | I the caption or image category). As shown in Figure 2.3, the core of the attention network is a two-layer recurrent network, which we term the “prediction network”, where the output at each time step is an action (saccade) which is used to compute the input at the next time step. A low-resolution version of the input image is fed Chapter 2. Deep recurrent visual attention 19 to the network at the first time step, and the network predicts the class label at the final time step. Importantly, the low-resolution input is fed to the second layer, while the class label prediction is made by the first layer, preventing information from propagating directly from the low-resolution image to the output. This prevents local optima where the network learns to predict y directly from the low-resolution input, disregarding attention completely. On top of the prediction network is an inference network, which receives both the class label and the attention network’s top layer representation as inputs. It tries to predict the posterior distribution q(a y, a , , η), parameterized by η, over the next saccade, conditioned on the image category being n+1 | 1:n I correctly predicted. Its job is to guide the posterior sampler during training time, thereby acting as a “teacher” for the attention network. The inference network is described further in Section 2.4.3. One of the benefits of stochastic attention models is that the mapping g can be localized to a small image region or coarse granularity, which means it can potentially be made very efficient. Furthermore, g need not be differentiable, which allows for operations (such as choosing a scale) which would be difficult to implement in a soft attention network. The cost of this flexibility is that standard backpropagation cannot be applied, so instead we use novel algorithms described in the next section. Assume we have a dataset with labels y for the supervised prediction task (e.g. object category). In contrast to the supervised saliency prediction task (e.g. Itti et al. [1998], Judd et al. [2009]), there are no labels for where to attend. Instead, we learn an attention policy based on the idea that the best locations to attend to are the ones which most robustly lead the model to predict the correct category. In particular, we aim to maximize the probability of the class label (or equivalently, minimize the cross-entropy) by marginalizing over the actions at each glimpse:
X = log p(y , θ) = log p(a , θ)p(y a, , θ). (2.17) L | I a | I | I
2.4.2 An improved lower bound on the log-likelihood
In this section, we first describe the previous variational lower bound objective. We then introduce a new objective function which directly estimates the gradients of . The new method can be seen as L maximizing a tighter lower bound on . L Let q(a y, ) be an approximating distribution. The lower bound on is then given by: | I L X X = log p(a , θ)p(y a, , θ) q(a y, ) log p(y, a , θ) + [q] = . (2.18) L a | I | I ≥ a | I | I H F
In the case where q(a y, ) = p(a , θ) is the prior, as considered by Ba et al. [2015], this reduces to | I | I X = p(a , θ) log p(y a, , θ). (2.19) F a | I | I
The learning rules can be derived by taking derivatives of Eqn. 2.19 with respect to the model parameters:
∂ X ∂ log p(y a, , θ) ∂ log p(a , θ) F = p(a , θ) | I + log p(y a, , θ) | I . (2.20) ∂θ a | I ∂θ | I ∂θ Chapter 2. Deep recurrent visual attention 20
The summation can be approximated using M Monte Carlo samples a˜m from p(a , θ): | I M ∂ 1 X ∂ log p(y a˜m, , θ) ∂ log p(a˜m , θ) F | I + log p(y a˜m, , θ) | I . (2.21) ∂θ ≈ M ∂θ | I ∂θ m=1
The partial derivative terms can each be computed using standard backpropagation. This suggests a simple gradient-based training algorithm: For each image, one first computes the samples a˜m from the prior p(a , θ), and then updates the parameters according to Eqn. 2.21. As observed by Ba et al. [2015], | I one must carefully use control variates in order to make this technique practical; we defer discussion of control variates to Section 2.4.4. The variational method described above has some counterintuitive properties early in training. First, because it averages the log-likelihood over actions, it greatly amplifies the differences in probabilities assigned to the true category by different bad glances. For instance, a glimpse sequence which leads to 0.01 probability assigned to the correct class is considered much worse than one which leads to 0.02 probability under the variational objective, even though in practice they may be equally bad since they have both missed the relevant information. A second odd behavior is that all glimpse sequences are weighted equally in the log-likelihood gradient. It would be better if the training procedure focused its effort on using those glances which contain the relevant information. Both of these effects contribute noise in the training procedure, especially in the early stages of training. Instead, we adopt an approach based on the wake-p step of reweighted wake-sleep Bornschein and Bengio [2014], where we attempt to maximize the marginal log-probability directly. We differentiate L the marginal log-likelihood objective in Eqn. 2.17 with respect to the model parameters:
∂ 1 X ∂ log p(y a, , θ) ∂ log p(a , θ) L = p(a , θ)p(y a, , θ) | I + | I . (2.22) ∂θ p(y , θ) | I | I ∂θ ∂θ | I a The summation and normalizing constant are both intractable to evaluate, so we estimate them using importance sampling. We must define a proposal distribution q(a y, ), which ideally should be close to | I the posterior p(a y, , θ). One reasonable choice is the prior p(a , θ), but another choice is described | I | I in Section 2.4.3. Normalized importance sampling gives a biased but consistent estimator of the gradient of . Given samples a˜1,..., a˜M from q(a y, ), the (unnormalized) importance weights are computed L | I as:
p(a˜m , θ)p(y a˜m, , θ) w˜m = | I | I . (2.23) q(a˜m y, ) | I The Monte Carlo estimate of the gradient is given by:
M ∂ X ∂ log p(y a˜m, , θ) ∂ log p(a˜m , θ) L wm | I + | I , (2.24) ∂θ ≈ ∂θ ∂θ m=1
m m PM i where w =w ˜ / i=1 w˜ are the normalized importance weights. When q is chosen to be the prior, this approach is equivalent to the method of Tang and Salakhutdinov [2013] for learning generative feed-forward networks. Our importance sampling based estimator can also be viewed as the gradient ascent update on the h 1 PM mi m objective function E log M m=1 w˜ . Combining Jensen’s inequality with the unbiasedness of thew ˜ Chapter 2. Deep recurrent visual attention 21 shows that this is a lower bound on the log-likelihood:
" M # " M # 1 X 1 X log w˜m log w˜m = log [w ˜m] = . (2.25) E M ≤ E M E L m=1 m=1
We relate this to the previous section by noting that = [logw ˜m]. Another application of Jensen’s F E inequality shows that our proposed bound is at least as accurate as : F " M # " M # 1 X 1 X = [logw ˜m] = logw ˜m log w˜m . (2.26) F E E M ≤ E M m=1 m=1
Burda et al. Burda et al. [2015] further analyzed a closely related importance sampling based estimator in the context of generative models, bounding the mean absolute deviation and showing that the bias decreases monotonically with the number of samples.
2.4.3 Training an inference network
Late in training, once the attention model has learned an effective policy, the prior distribution p(a , θ) | I is a reasonable choice for the proposal distribution q(a y, ), as it puts significant probability mass on | I good actions. But early in training, the model may have only a small probability of choosing a good set of glimpses, and the prior may have little overlap with the posterior. To deal with this, we train an inference network to predict, given the observations as well as the class label, where the network should look to correctly predict that class (see Figure 2.3). With this additional information, the inference network can act as a “teacher” for the attention policy. The inference network predicts a sequence of glimpses stochastically:
N Y q(a y, , η) = q(a y, , η, a − ). (2.27) | I n | I 1:n 1 n=1
This distribution is analogous to the prior, except that each decision also takes into account the class label y. We denote the parameters for the inference network as η. During training, the prediction network is learnt by following the gradient of the estimator in Eqn. 2.24 with samples a˜m q(a y, , η) ∼ | I drawn from the inference network output. Our training procedure for the inference network parallels the wake-q step of reweighted wake- sleep Bornschein and Bengio [2014]. Intuitively, the inference network is most useful if it puts large probability density over locations in an image that are most informative for predicting class labels. We therefore train the inference weights η to minimize the Kullback-Leibler divergence between the recog- nition model prediction q(a y, , η) and posterior distribution from the attention model p(a y, , θ): | I | I X min DKL(p q) = min p(a y, , θ) log q(a y, , η). (2.28) η η k − a | I | I
The gradient update for the recognition weights can be obtained by taking the derivatives of Eq. (2.28) with respect to the recognition weights η: ∂DKL(p q) ∂ log q(a y, , η) k = E | I . (2.29) ∂η p(a | y,I,θ) ∂η Chapter 2. Deep recurrent visual attention 22
Since the posterior expectation is intractable, we estimate it with importance sampling. In fact, we reuse the importance weights computed for the prediction network update (see Eqn. 2.23) to obtain the following gradient estimate for the recognition network:
M m ∂DKL(p q) X ∂ log q(a˜ y, , η) k wm | I . (2.30) ∂η ≈ ∂η m=1
2.4.4 Control variates
The speed of convergence of gradient ascent with the gradients defined in Eqns. 2.24 and 2.30 suffers from high variance of the stochastic gradient estimates. Past work using similar gradient updates has found significant benefit from the use of control variates, or reward baselines, to reduce the variance Williams [1992a], Paisley et al. [2012], Mnih et al. [2014a], Mnih and Gregor [2014], Ba et al. [2015]. Choosing effective control variates for the stochastic gradient estimators amounts to finding a function that is highly correlated with the gradient vectors, and whose expectation is known or tractable to compute Paisley et al. [2012], Weaver and Tao [2001b]. Unfortunately, a good choice of control variate is highly model-dependent. We first note that:
p(a , θ) ∂ log p(a , θ) ∂ log q(a y, , η) E | I | I = 0, E | I = 0. (2.31) q(a | y,I,η) q(a y, , η) ∂θ q(a | y,I,η) ∂η | I The terms inside the expectation are very similar to the gradients in Eqns. 2.24 and 2.30, suggesting that stochastic estimates of these expectations would make good control variates. To increase the correlation between the gradients and the control variates, we reuse the same set of samples and importance weights for the gradients and control variates. Using these control variates results in the gradient estimates for the prediction and recognition networks, we obtain:
p(a˜m | I,θ) M m ∂ log p(a , θ) X m q(a˜m | y,I,η) ∂ log p(a˜ , θ) | I w i | I , (2.32) ∂θ ≈ − PM p(a˜ | I,θ) ∂θ m=1 i=1 q(a˜i | y,I,η)
M m ∂DKL(p q) X 1 ∂ log q(a˜ y, , η) k wm | I . (2.33) ∂η ≈ − M ∂η m=1
Our use of control variates does not bias the gradient estimates (beyond the bias which is present due to importance sampling). However, as we show in the experiments, the resulting estimates have much lower variance than those of Eqns. 2.24 and 2.30. Following the analogy with reinforcement learning highlighted by Mnih and Gregor [2014], these control variates can also be viewed as reward baselines:
p(a | I,θ) p(a˜m | I,θ) [p(y a, , θ)] m q(a | y,I,η) Eq(a | y,I,η) | I q(a˜ | y,I,η) bp = i , (2.34) h p(a | I,θ) i ≈ PM p(a˜ | I,θ) M [p(y a, , θ)] i · Eq(a | y,I,η) q(a | y,I,η) Eq(a | y,I,θ) | I i=1 q(a˜ | y,I,η) a | I [p(y a, , θ)] 1 b = Ep( ,θ) | I = , (2.35) q M [p(y a, , θ)] M · Ep(a | I,θ) | I where M is the number of samples drawn for proposal q. Chapter 2. Deep recurrent visual attention 23
2.4.5 Encouraging exploration
Similarly to other methods based on reinforcement learning, stochastic attention networks face the problem of encouraging the method to explore different actions. Since the gradient in Eqn. 2.24 only rewards or punishes glimpse sequences which are actually performed, any part of the space which is never visited will receive no reward signal. Ba et al. [2015] introduced several heuristics to encourage exploration, including: (1) raising the temperature of the proposal distribution, (2) regularizing the attention policy to encourage viewing all image locations, and (3) adding a regularization term to encourage high entropy in the action distribution. We have implemented all three heuristics for the WS- RAM and for the baselines. While these heuristics are important for good performance of the baselines, we found that they made little difference to the WS-RAM because the basic method already explores adequately.
2.4.6 Experiments
To measure the effectiveness of the proposed WS-RAM method, we first investigated a toy classification task involving a variant of the MNIST handwritten digits dataset LeCun et al. [1998a] where transfor- mations were applied to the images. We then evaluated the proposed method on a substantially more difficult image caption generation task using the Flickr8k Hodosh et al. [2013] dataset.
Translated scaled MNIST
We generated a dataset of randomly translated and scaled handwritten digits from the MNIST dataset Le- Cun et al. [1998a]. Each digit was placed in a 100x100 black background image at a random location and scale. The task was to identify the digit class. The attention models were allowed four glimpses before making a classification prediction. The goal of this experiment was to evaluate the effectiveness of our proposed WS-RAM model compared with the variational approach of Ba et al. [2015]. For both the WS-RAM and the baseline, the architecture was a stochastic attention model which used ReLU units in all recurrent layers. The actions included both continuous and discrete latent variables, corresponding to glimpse scale and location, respectively. The distribution over actions was represented as a Gaussian random variable for the location and an independent multinomial random variable for the scale. All networks were trained using Adam [Kingma and Ba, 2014a], with the learning rate set to the highest value that allowed the model to successfully converge to a sensible attention policy. The classification performance results are shown in Table 2.6. In Figure 2.4, the WS-RAM is com- pared with the variational baseline, each using the same number of samples (in order to make computa- tion time roughly equivalent). We also show comparisons against ablated versions of the WS-RAM where the control variates and inference network were removed. When the inference network was removed, the prior p(a , θ) was used for the proposal distribution. | I In addition to the classification results, we measured the effective sample size (ESS) of our method with and without control variates and the inference network. ESS is a standard metric for evaluating P m 2 m importance samplers, and is defined as 1/ m(w ) , where w denotes the normalized importance weights. Results are shown in Figure 2.4. Using the inference network reduced the variances in gradient estimation, although this improvement did not reflect itself in the ESS. Control variates improved both metrics. Chapter 2. Deep recurrent visual attention 24
100 4.0 6
3.5 VAR VAR+c 5 3.0
WS-RAM 4 VAR 2.5 WS-RAM+c VAR+c 2.0 WS-RAM+q 3 WS-RAM WS-RAM -1 1.5 WS-RAM+q+c 10 2
Training Error WS-RAM+c WS-RAM+c 1.0
WS-RAM+q 1 WS-RAM+q WS-RAM+q+c 0.5 Effective Sample Size WS-RAM+q+c
0.0 0
0 20 40 60 80 100 Variance of Estimated Gradient 0 20 40 60 80 100 0 20 40 60 80 100 Figure 2.4: Left: Training error as a function of the number of updates. Middle: variance of the gradient estimates. Right: effective sample size (max = 5). Horizontal axis: thousands of updates. VAR: variational baseline; WS-RAM: our proposed method; +q: uses the inference networks for the proposal distribution; +c: uses control variates.
10-1
Test err. VAR WS-RAM WS-RAM + q VAR+c, no exploration
Training Error VAR+c + exploration no c.v. 3.11% 4.23% 2.59% WS-RAM+q+c, no exploration
-2 WS-RAM+q+c + exploration +c.v. 1.81% 1.85% 1.62% 10 0 50 100 150 200
Table 2.6: Classification error rate comparison for Figure 2.5: The effect of the exploration heuristics on the attention models trained using different algo- the variational baseline and the WS-RAM. rithms on translated scaled MNIST. The numbers are reported after 10 million updates using 5 samples.
In Section 2.4.5, we described heuristics which encourage the models to explore the action space. Figure 2.5 compares the training with and without these heuristics. Without the heuristics, the varia- tional method quickly fell into a local minimum where the model predicted only one glimpse scale over all images; the exploration heuristics fixed this problem. By contrast, the WS-RAM did not appear to have this problem, so the heuristics were not necessary.
2.5 Summary
Convolutional neural networks, trained end-to-end, have been shown to substantially outperform previ- ous approaches to various supervised learning tasks in computer vision (e.g. Krizhevsky et al. [2012a])). Despite their wide success, convolutional nets are computationally expensive when processing high- resolution input images, because they must examine all image locations at a fine scale. This has mo- tivated our development on visual attention-based models in this chapter, which reduce the number of parameters and computational operations by selecting informative regions of an image to focus on. In addition to computational speedups, one can understand what information neural networks is using by seeing where it is looking. We studied two learning objectives, variational lower bound and reweighted importance sampling bound, for stochastic attention models and compare them on the standard hand-written digits classifica- tion tasks. We demonstrate that our stochastic attention model can learn to (1) classify translated and scaled MNIST digits, and (2) generate image captions by attending to the relevant objects in images and their corresponding scale. The proposed reweighted wake-sleep algorithm shows much improved Chapter 2. Deep recurrent visual attention 25 training performance for the stochastic visual attention models. Chapter 3
Generating image (and) captions with visual attention
3.1 Problem definition
Automatically generating captions for an image is a task close to the heart of scene understanding — one of the primary goals of computer vision. Not only must caption generation models be able to solve the computer vision challenges of determining what objects are in an image, but they must also be powerful enough to capture and express their relationships in natural language. For this reason, caption generation has long been seen as a difficult problem. It amounts to mimicking the remarkable human ability to compress huge amounts of salient visual information into descriptive language and is thus an important challenge for machine learning and AI research. Yet despite the difficult nature of this task, there has been a recent surge of research interest in attacking the image caption generation problem. Aided by advances in training deep neural networks [Krizhevsky et al., 2012c] and the availability of large classification datasets [Russakovsky et al., 2014], recent work has significantly improved the quality of caption generation using a combination of convo- lutional neural networks (convnets) to obtain vectorial representation of images and recurrent neural networks to decode those representations into natural language sentences (see Sec. 3.2). Unlike CNNs that compress an entire image into a static representation, visual attention allows for salient features to dynamically come to the forefront as needed. This is especially important when there is a lot of clutter in an image. Using representations (such as those from the very top layer of a convnet) that distill information in image down to the most salient objects is one effective solution that has been widely adopted in previous work. Encouraged by recent advances in caption generation and inspired by recent successes in employing attention in machine translation [Bahdanau et al., 2014], we investigate visual attention models that can attend to salient part of an image while generating its caption.
3.2 Related work
In this section we provide relevant background on previous work on image caption generation and attention. Recently, several methods have been proposed for generating image descriptions. Many of
26 Chapter 3. Generating image (and) captions with visual attention 27
14x14 Feature Map A bird flying LSTM over a body of water 1. Input 2. Convolutional 3. RNN with attention 4. Word by Image Feature Extraction over the image word generation
Figure 3.1: Our model learns a words/image alignment. The visualized attentional maps (3) are ex- plained in Sections 3.3.1 & 3.3.3 these methods are based on recurrent neural networks and inspired by the successful use of sequence- to-sequence training with neural networks for machine translation [Cho et al., 2014c, Bahdanau et al., 2014, Sutskever et al., 2014c, Kalchbrenner and Blunsom, 2013]. The encoder-decoder framework [Cho et al., 2014c] of machine translation is well suited, because it is analogous to “translating” an image to a sentence. The first approach to using neural networks for caption generation was proposed by Kiros et al. [2014a] who used a multimodal log-bilinear model that was biased by features from the image. The model presented in this chapter was later followed by Kiros et al. [2014b] whose method was designed to explicitly allow for a natural way of doing both ranking and generation. Mao et al. [2014] used a similar approach to generation but replaced a feedforward neural language model with a recurrent one. Both Vinyals et al. [2014a] and Donahue et al. [2014] used recurrent neural networks (RNN) based on long short-term memory (LSTM) units [Hochreiter and Schmidhuber, 1997a] for their models. Unlike Kiros et al. [2014a] and Mao et al. [2014] whose models see the image at each time step of the output word sequence, Vinyals et al. [2014a] only showed the image to the RNN at the beginning. Along with images, Donahue et al. [2014] and Yao et al. [2015] also applied LSTMs to videos, allowing their model to generate video descriptions. Most of these works represent images as a single feature vector from the top layer of a pre-trained convolutional network. Karpathy and Li [2014] instead proposed to learn a joint embedding space for ranking and generation whose model learns to score sentence and image similarity as a function of R-CNN object detections with outputs of a bidirectional RNN. Fang et al. [2014] proposed a three-step pipeline for generation by incorporating object detections. Their models first learn detectors for several visual concepts based on a multi-instance learning framework. A language model trained on captions was then applied to the detector outputs, followed by rescoring from a joint image-text embedding space. Unlike these models, our proposed attention framework does not explicitly use object detectors but instead learns latent alignments from scratch. This allows our model to go beyond “objectness” and learn to attend to abstract concepts. Prior to the use of neural networks for generating captions, two main approaches were dominant. The first involved generating caption templates which were filled in based on the results of object detections and attribute discovery (Kulkarni et al. [2013], Li et al. [2011], Yang et al. [2011], Mitchell et al. [2012], Chapter 3. Generating image (and) captions with visual attention 28
Elliott and Keller [2013]). The second approach was based on first retrieving similar captioned images from a large database then modifying these retrieved captions to fit the query [Kuznetsova et al., 2012, 2014]. These approaches typically involved an intermediate “generalization” step to remove the specifics of a caption that are only relevant to the retrieved image, such as the name of a city. Both of these approaches have since fallen out of favour to the now dominant neural network methods. There has been a long line of previous work incorporating the idea of attention into neural networks. Some that share the same spirit as our work include Larochelle and Hinton [2010a], Denil et al. [2012], Tang et al. [2014] and more recently Gregor et al. [2015b]. In particular however, our work directly extends the work of Bahdanau et al. [2014], Mnih et al. [2014b], Ba et al. [2014], Graves [2013a].
3.3 Image Caption Generation with Attention Mechanism
3.3.1 Model details
In this section, we describe the two variants of our attention-based model by first describing their common framework. The key difference is the definition of the ψ function which we describe in detail in Sec. 3.3.2. See Fig. 3.1 for the graphical illustration of the proposed model. We denote vectors with bolded font and matrices with capital letters. In our description below, we suppress bias terms for readability.
Encoder: convolutional features
Our model takes a single raw image and generates a caption y encoded as a sequence of 1-of-K encoded words.
K y = y1,..., yC , yi R { } ∈ where K is the size of the vocabulary and C is the length of the caption. We use a convolutional neural network in order to extract a set of feature vectors which we refer to as annotation vectors. The extractor produces L vectors, each of which is a D-dimensional representation corresponding to a part of the image.
D a = φ1,..., φL , φi R { } ∈
In order to obtain a correspondence between the feature vectors and portions of the 2-D image, we extract features from a lower convolutional layer unlike previous work which instead used a fully connected layer. This allows the decoder to selectively focus on certain parts of an image by weighting a subset of all the feature vectors.
Decoder: long short-term memory network
We use a long short-term memory (LSTM) network [Hochreiter and Schmidhuber, 1997a] that produces a caption by generating one word at every time step conditioned on a context vector, the previous hidden state and the previously generated words. Our implementation of LSTMs, shown in Fig. 1.1, closely follows the one used in Zaremba et al. [2014]: Chapter 3. Generating image (and) captions with visual attention 29
Figure 3.2: Visualization of the attention for each generated word. The rough visualizations obtained by upsampling the attention weights and smoothing. (top)“soft” and (bottom) “hard” attention (note that both models generated the same captions in this example).
In simple terms, the context vector zˆt is a dynamic representation of the relevant part of the image input at time t. We define a mechanism ψ that computes zˆt from the annotation vectors φi, i = 1,...,L corresponding to the features extracted at different image locations. For each location i, the mechanism generates a positive weight αi which can be interpreted either as the probability that location i is the right place to focus for producing the next word (stochastic attention mechanism), or as the relative importance to give to location i in blending the φi’s together (deterministic attention mechanism). The weight αi of each annotation vector ai is computed by an attention model fatt for which we use a multilayer perceptron conditioned on the previous hidden state at−1. To emphasize, we note that the hidden state varies as the output RNN advances in its output sequence: “where” the network looks next depends on the sequence of words that has already been generated.
eti =fatt(φi, at−1) exp(e ) α = ti . ti PL k=1 exp(etk)
Once the weights (which sum to one) are computed, the context vectorz ˆt is computed by
zˆ = ψ ( φ , α ) , (3.1) t { i} { i} where ψ is a function that returns a single vector given the set of annotation vectors and their corre- sponding weights. The details of the ψ function are discussed in Sec. 3.3.2.
The initial memory state and hidden state of the LSTM are predicted by an average of the annotation vectors fed through two separate MLPs (init,c and init,h):
L ! L ! 1 X 1 X c = f φ , a = f φ 0 init,c L i 0 init,h L i i i
We use a deep output layer [Pascanu et al., 2014] to compute the output word probability. Its input are cues from the image (the context vector), the previously generated word, and the decoder state (ht).
p(y φ, yt−1) exp(L (Ey + L a + L zˆ )), (3.2) t| 1 ∝ o t−1 h t z t
K×m m×n m×D where Lo R , Lh R , Lz R , and E are learned parameters initialized randomly. ∈ ∈ ∈ Chapter 3. Generating image (and) captions with visual attention 30
3.3.2 Learning stochastic “hard” vs deterministic “soft” Attention
In this section we discuss two alternative mechanisms for the attention model fatt: stochastic attention and deterministic attention.
Stochastic “hard” attention
We represent the location variable st as where the model decides to focus attention when generating the t-th word. st,i is an indicator one-hot variable which is set to 1 if the i-th location (out of L) is the one used to extract visual features. By treating the attention locations as intermediate latent variables, we can assign a multinoulli distribution parametrized by α , and view zˆ as a random variable: { i} t
p(s = 1 s , φ) = α (3.3) t,i | j Similar to Chapter 2, we define a variational lower bound objective on the marginal log-likelihood Ls log p(y φ) of observing the sequence of words y given image features φ. Similar to work in generative | deep generative modeling [Kingma and Welling, 2014b, Rezende et al., 2014], the learning algorithm for the parameters θ of the models can be derived by directly optimizing X X = p(s φ) log p(y s, φ) log p(s φ)p(y s, φ) = log p(y φ), (3.5) Ls | | ≤ | | | s s following its gradient ∂ s X ∂ log p(y s, φ) ∂ log p(s φ) L = p(s φ) | + log p(y s, φ) | . (3.6) ∂θ | ∂θ | ∂θ s We approximate this gradient of Ls by a Monte Carlo method such that N n n ∂ s 1 X ∂ log p(y s˜ , φ) ∂ log p(˜s φ) L | + log p(y s˜n, φ) | , (3.7) ∂θ ≈ N ∂θ | ∂θ n=1 n n n n wheres ˜ = (s1 , s2 ,...) is a sequence of sampled attention locations. We sample the location st from a multinouilli distribution defined by Eq. (3.3): s˜n Multinoulli ( αn ). t ∼ L { i } We reduce the variance of this estimator with the moving average baseline technique [Weaver and Tao, 2001a]. Upon seeing the k-th mini-batch, the moving average baseline is estimated as an accumulated sum of the previous log likelihoods with exponential decay: b = 0.9 b − + 0.1 log p(y s˜ , φ) k × k 1 × | k To further reduce the estimator variance, the gradient of the entropy H[s] of the multinouilli distribution is added to the RHS of Eq. (3.7). Chapter 3. Generating image (and) captions with visual attention 31 The final learning rule for the model is then N n n n ∂ s 1 X ∂ log p(y s˜ , φ) ∂ log p(˜s φ) ∂H[˜s ] L | + λ (log p(y s˜n, φ) b) | + λ ∂θ ≈ N ∂θ r | − ∂θ e ∂θ n=1 where, λr and λe are two hyper-parameters set by cross-validation. As pointed out and used by Ba et al. [2014] and Mnih et al. [2014b], this formulation is equivalent to the REINFORCE learning rule [Williams, 1992c], where the reward for the attention choosing a sequence of actions is a real value proportional to the log likelihood of the target sentence under the sampled attention trajectory. In order to further improve the robustness of this learning rule, with probability 0.5 for a given image, we set the sampled attention locations ˜ to its expected value α (equivalent to the deterministic attention in Sec. 3.3.2). Deterministic “soft” attention Learning stochastic attention requires sampling the attention location st each time, instead we can take the expectation of the context vector zˆt directly, L X Ep(st|a)[zˆt] = αt,iφi (3.8) i=1 and formulate a deterministic attention model by computing a soft attention weighted annotation vector ψ ( φ , α ) = PL α φ as proposed by Bahdanau et al. [2014]. This corresponds to feeding in a soft α { i} { i} i i i weighted context into the system. The whole model is smooth and differentiable under the deterministic attention, so learning end-to-end is trivial by using standard back-propagation. Learning the deterministic attention can also be understood as approximately optimizing the marginal likelihood in Eq. (3.5) under the attention location random variable st from Sec. 3.3.2. The hidden acti- vation of LSTM at is a linear projection of the stochastic context vector zˆt followed by tanh non-linearity. To the first-order Taylor approximation, the expected value Ep(st|a)[at] is equivalent to computing at using a single forward computation with the expected context vector Ep(st|a)[zˆt]. Let us denote by nt,i as n in Eq. (3.2) with zˆt set to φi. Then, we can write the normalized weighted geometric mean (NWGM) of the softmax of k-th word prediction as Q p(st,i=1|a) i exp(nt,k,i) NWGM[p(yt = k φ)] = P Q p(st,i=1|a) | j i exp(nt,j,i) exp(Ep(st|a)[nt,k]) = P j exp(Ep(st|a)[nt,j]) This implies that the NWGM of the word prediction can be well approximated by using the expected context vector E [zˆt], instead of the sampled context vector φi. Furthermore, from the result by Baldi and Sadowski [2014], the NWGM in Eq. (3.9) which can be computed by a single feedforward computation approximates the expectation E[p(yt = k φ)] of | the output over all possible attention locations induced by random variable st. This suggests that the proposed deterministic attention model approximately maximizes the marginal likelihood over all possible attention locations. Chapter 3. Generating image (and) captions with visual attention 32 Doubly stochastic attention In training the deterministic version of our model, we introduce a form a doubly stochastic regularization that encourages the model to pay equal attention to every part of the image. Whereas the attention at P P every point in time sums to 1 by construction (i.e i αti = 1), the attention i αti is not constrained in any way. This makes it possible for the decoder to ignore some parts of the input image. In order to alleviate this, we encourage P α τ where τ L . In our experiments, we observed that this t ti ≈ ≥ D penalty quantitatively improves overall performance and that this qualitatively leads to more descriptive captions. Additionally, the soft attention model predicts a gating scalar β from previous hidden state at−1 at PL each time step t, such that, ψ ( φ , α ) = β α φ , where β = σ(f (a − )). This gating variable { i} { i} i i i t β t 1 lets the decoder decide whether to put more emphasis on language modeling or on the context at each time step. Qualitatively, we observe that the gating variable is larger than the decoder describes an object in the image. The soft attention model is trained end-to-end by minimizing the following penalized negative log- likelihood: L C X X = log p(y φ) + λ (1 α )2, (3.9) Ld − | − ti i t where we simply fixed τ to 1. Training procedure Both variants of our attention model were trained with stochastic gradient descent using adaptive learn- ing rates. For the Flickr8k dataset, we found that RMSProp [Tieleman and Hinton, 2012b] worked best, while for Flickr30k/MS COCO dataset we found the recently proposed Adam algorithm [Kingma and Ba, 2014c] to be quite effective. To create the annotations ai used by our decoder, we used the Oxford VGGnet [Simonyan and Zis- serman, 2014a] pre-trained on ImageNet without finetuning. In our experiments we use the 14 14 512 × × feature map of the fourth convolutional layer before max pooling. This means our decoder operates on the flattened 196 512 (i.e L D) encoding. In principle however, any encoding function could be used. × × In addition, with enough data, the encoder could also be trained from scratch (or fine-tuned) with the rest of the model. As our implementation requires time proportional to the length of the longest sentence per update, we found training on a random group of captions to be computationally wasteful. To mitigate this problem, in preprocessing we build a dictionary mapping the length of a sentence to the corresponding subset of captions. Then, during training we randomly sample a length and retrieve a mini-batch of size 64 of that length. We found that this greatly improved convergence speed with no noticeable diminution in performance. On our largest dataset (MS COCO), our soft attention model took less than 3 days to train on an NVIDIA Titan Black GPU. In addition to dropout [Srivastava et al., 2014], the only other regularization strategy we used was early stopping on BLEU score. We observed a breakdown in correlation between the validation set log-likelihood and BLEU in the later stages of training during our experiments. Since BLEU is the most commonly reported metric, we used BLEU on our validation set for model selection. Chapter 3. Generating image (and) captions with visual attention 33 In our experiments with soft attention, we used Whetlab1 [Snoek et al., 2012, 2014] in our Flickr8k experiments. Some of the intuitions we gained from hyperparameter regions it explored were especially important in our Flickr30k and COCO experiments. We make our code for these models publicly available to encourage future research in this area2. Figure 3.3: Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word) 3.3.3 Experiments BLEU Dataset Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR Google NIC[Vinyals et al., 2014a]†Σ 63 41 27 — — Log Bilinear [Kiros et al., 2014a]◦ 65.6 42.4 27.7 17.7 17.31 Flickr8k Soft-Attention 67 44.8 29.9 19.5 18.93 Hard-Attention 67 45.7 31.4 21.3 20.30 Google NIC†◦Σ 66.3 42.3 27.7 18.3 — Log Bilinear 60.0 38 25.4 17.1 16.88 Flickr30k Soft-Attention 66.7 43.4 28.8 19.1 18.49 Hard-Attention 66.9 43.9 29.6 19.9 18.46 CMU/MS Research [Chen and Zitnick, 2014]a — — — — 20.41 MS Research [Fang et al., 2014]†a — — — — 20.71 BRNN [Karpathy and Li, 2014]◦ 64.2 45.1 30.4 20.3 — COCO Google NIC†◦Σ 66.6 46.1 32.9 24.6 — Log Bilinear◦ 70.8 48.9 34.4 24.3 20.03 Soft-Attention 70.7 49.2 34.4 24.3 23.90 Hard-Attention 71.8 50.4 35.7 25.0 23.04 Table 3.1: BLEU-1,2,3,4/METEOR metrics compared to other methods, indicates a different split, (—) indicates an unknown metric, indicates the authors kindly provided† missing metrics by personal communication, Σ indicates an ensemble,◦ a indicates using AlexNet We describe our experimental methodology and quantitative results which validate the effectiveness of our model for caption generation. 1https://www.whetlab.com/ 2https://github.com/kelvinxu/arctic-captions Chapter 3. Generating image (and) captions with visual attention 34 Data We report results on the widely-used Flickr8k and Flickr30k dataset as well as the more recenly in- troduced MS COCO dataset. Each image in the Flickr8k/30k dataset have 5 reference captions. In preprocessing our COCO dataset, we maintained the same number of references between our datasets by discarding captions in excess of 5. We applied only basic tokenization to MS COCO so that it is consistent with the tokenization present in Flickr8k and Flickr30k. For all our experiments, we used a fixed vocabulary size of 10,000. Results for our attention-based architecture are reported in Table 3.1. We report results with the frequently used BLEU metric3 which is the standard in image caption generation research. We report BLEU4 from 1 to 4 without a brevity penalty. There has been, however, criticism of BLEU, so we report another common metric METEOR [Denkowski and Lavie, 2014] and compare whenever possible. Figure 3.4: Examples of mistakes where we can use attention to gain intuition into what the model saw. Evaluation procedures A few challenges exist for comparison, which we explain here. The first challenge is a difference in choice of convolutional feature extractor. For identical decoder architectures, using more recent architectures such as GoogLeNet [Szegedy et al., 2014a] or Oxford VGG [Simonyan and Zisserman, 2014a] can give a boost in performance over using the AlexNet [Krizhevsky et al., 2012c]. In our evaluation, we compare directly only with results which use the comparable GoogLeNet/Oxford VGG features, but for METEOR comparison we include some results that use AlexNet. The second challenge is a single model versus ensemble comparison. While other methods have reported performance boosts by using ensembling, in our results we report a single model performance. 3We verified that our BLEU evaluation code matches the authors of Vinyals et al. [2014a], Karpathy and Li [2014] and Kiros et al. [2014b]. For fairness, we only compare against results for which we have verified that our BLEU evaluation code is the same. 4BLEU-n is the geometric average of the n-gram precision. For instance, BLEU-1 is the unigram precision, and BLEU-2 is the geometric average of the unigram and bigram precision. Chapter 3. Generating image (and) captions with visual attention 35 Finally, there is a challenge due to differences between dataset splits. In our reported results, we use the pre-defined splits of Flickr8k. However, for the Flickr30k and COCO datasets is the lack of standardized splits for which results are reported. As a result, we report the results with the publicly available splits5 used in previous work [Karpathy and Li, 2014]. We note, however, that the differences in splits do not make a substantial difference in overall performance. Quantitative analysis In Table 3.1, we provide a summary of the experiment validating the quantitative effectiveness of atten- tion. We obtain state of the art performance on the Flickr8k, Flickr30k and MS COCO. In addition, we note that in our experiments we are able to significantly improve the state-of-the-art performance METEOR on MS COCO. We speculate that this is connected to some of the regularization techniques we used (see Sec. 3.3.2) and our lower-level representation. Qualitative analysis: learning to attend By visualizing the attention learned by the model, we are able to add an extra layer of interpretability to the output of the model (see Fig. 3.1). Other systems that have done this rely on object detection systems to produce candidate alignment targets [Karpathy and Li, 2014]. Our approach is much more flexible, since the model can attend to “non-object” salient regions. The 19-layer OxfordNet uses stacks of 3x3 filters meaning the only time the feature maps decrease in size are due to the max pooling layers. The input image is resized so that the shortest side is 256- dimensional with preserved aspect ratio. The input to the convolutional network is the center-cropped 224x224 image. Consequently, with four max pooling layers, we get an output dimension of the top convolutional layer of 14x14. Thus in order to visualize the attention weights for the soft model, we upsample the weights by a factor of 24 = 16 and apply a Gaussian filter to emulate the large receptive field size. As we can see in Figs. 3.2 and 3.3, the model learns alignments that agree very strongly with human intuition. Especially from the examples of mistakes in Fig. 3.4, we see that it is possible to exploit such visualizations to get an intuition as to why those mistakes were made. We provide a more extensive list of visualizations as the supplementary materials for the reader. 3.4 Generating images Statistical natural image modelling remains a fundamental problem in computer vision and image un- derstanding. This has motivated recent approaches in generative modelling applied to natural images by employing deep neural networks for their inference and generative components. Image generative models studied previously are often restricted to learning unconditional models of images distribution or conditioned on simple structured annotations, for example classification labels. Despite the advances in generative models, learning the highly structured natural image distribution in the high dimensional pixel space alone proves to be a difficult task. In the real world, however, images rarely appear in isolation. They are often accompanied by their unstructured textual descriptions on web pages and in books. One domain presents a substantial amount of relevent information for the other domain. The 5http://cs.stanford.edu/people/karpathy/deepimagesent/ Chapter 3. Generating image (and) captions with visual attention 36 Figure 3.5: The image-to-caption generative model additional information from image and unstructured text descriptions can be used to simplify the image modelling task. There are two directions in learning a generative model of image and text. One approach is to learn a text generative model conditioned on the images. A significant amount of recent work has been focused on generating captions from images [Karpathy and Li, 2015], [Xu et al., 2015], [Kiros et al., 2014d] and etc. The models take an image descriptor and generate unstructured texts through a recurrent decoder. By contrast, learning a generative model for image and text may also be studied by generating images correctly interpreting the text description. Generating high dimensional realistic images from their descriptions is a more difficult approach that combines two challenging components of language modelling and image generation. Namely, the model has to capture the semantic meaning expressed in the description and then use that knowledge to generate pixel intensities of the image. Although the interesting high dimensional natural images lay on a small manifold that is difficult to capture, the additional text description cues of a target image may simplify the learning problem by focusing on the conditional distribution. In this section, we illustrate how simple sequential deep learning techniques can be used to build a conditional probabilistic model over natural image space effectively. By using a sequence to sequence framework to approach the problem of image generation from unstructured natural language captions, our model iteratively draws the patches on canvas, while attending to the relevant words in the descrip- tion. 3.4.1 Model architecture Our proposed model can be viewed as an instance of the sequence-to-sequence framework [Sutskever et al., 2014a], [Cho et al., 2014b], [Srivastava et al., 2015] where captions are represented as a sequence of consecutive words and images are represented as a sequence of patches drawn on canvas over time t = 1, ..., T . Let y be the input caption, consisting of N words y1, y2, ..., yn and x be the image corresponding to that caption. Chapter 3. Generating image (and) captions with visual attention 37 Language model: the bidirectional attention RNN The input caption sentences are fed into a deterministic Bidirectional LSTM that encodes the variable size sentences into the vector representation s. Bidirectional LSTM consists of one Forward LSTM and Backward LSTM which combine information from past and future respectively. The Forward LSTM lang lang lang computes the sequence of forward hidden states [−→h 1 , −→h 2 , ..., −→h N ] , whereas the Backward LSTM lang lang lang computes the sequence of backward hidden states [←−h 1 , ←−h 2 , ..., ←−h N ]. Then these hidden states are concatenated together into the sequence [hlang, hlang, ..., hlang], where hlang = [−→h lang, ←−h lang], 1 1 2 N n n n ≤ n N. ≤ Image model: the conditional DRAW network The DRAW network Gregor et al. [2015c] is a sequential probabilistic model generating images by ac- cumulating the output at each iterative step. While the original DRAW network assumes the latent variables are independent, it has shown in [Bachman and Precup, 2015] the model performance is im- proved by including the dependencies of latent variables. We extended the architecture of the DRAW network generative process to include an additional input caption from the language model described in Sec. (3.4.1). Similarly to the original DRAW network, the conditional DRAW network is a stochastic recurrent neural network that consists of Inference LSTM that infers the distribution of latent variables of image x given y and then the Generative LSTM that uses the inferred latent variables in order to reconstruct the image x given y. The align function is used to compute the alignment between the input caption and intermediate image generative steps as in Bahdanau et al. [2015a]: Formally, the image is generated by iteratively computing the following equations for t = 1, ..., T xˆ = x σ(c − ) (3.10) t − t 1 dec rt = read(xt, xˆt, ht−1) (3.11) enc enc enc dec ht = LSTM (ht−1, [rt, ht−1]) (3.12) z q(Z henc) (3.13) t ∼ t| t dec dec dec ht = LSTM (ht−1, zt, st−1) (3.14) dec lang st = align(ht−1, h ) (3.15) dec ct = ct−1 + write(ht ) (3.16) where read and write are the same attention operators as in [Gregor et al., 2015c]. Given the caption lang lang lang lang representation from the language model, h = [h1 , h2 , ..., hN ], the align operator computes the final sentence representation st through a weighted sum using alignment probabilities α1...N : dec lang lang lang lang st = align(ht−1, h ) = α1h1 + α2h2 + ... + αN hN . (3.17) The corresponding alignment probabilities α1...n at each step are obtained by: T lang dec etj = v tanh(Uhj + W ht + b) (3.18) exp(e ) α = tj . (3.19) j PN j=1 exp(etj) Chapter 3. Generating image (and) captions with visual attention 38 lang 1 Here h0 is initialized to the learned bias. Setting α1...N to N turns the encoder into the vanilla model introduced in [Cho et al., 2014b] without the attention. 3.4.2 Learning The model is learned by the modified version of Stochastic Gradient Variational Bayes (SGVB) algorithm introduced by Kingma and Welling [2014a]. The model is trained to maximize the lower bound of marginal likelihood of the correct image x given the input caption y. The is decomposed into the L L latent loss z and the reconstruction loss x. L L The reconstruction loss x equals to 1 PL (log p(x y, z) where L is the number of samples used L L l=1 t| during training, which was set to 1 in our experiments. The latent loss z is a negative sum of Kullback–Leibler divergence terms between distribution L q(Z henc) and some prior distribution p(Z ) over time t = 1, ..., T , which can be seen as a regularization t| t t term. Since the patches drawn on canvas over time are not independent of each other, naturally the sufficient statistics of the prior distribution at time t should be dependent on the sufficient statistics of the prior distribution at time t 1. Therefore, instead of setting p(Z ), ..., p(Z ) to be independent − 1 T dec unit gaussian distributions, the mean and variance of p(Zt) depends on the ht−1, which forms a Markov chain p(Z ), p(Z Z ), ..., p(Z Z − ) as in [Bachman and Precup, 2015], where 1 2 | 1 T | T 1 prior dec µt = tanh(Wµht−1) (3.20) prior dec σt = exp(tanh(Wσht−1)) (3.21) Overall, the loss function is calculated as follows: L " T # X = E log p(x y,Z1:T ) + DKL (q(Zt Zt−1, y, x) p(Zt Zt−1, y)) L Q(Z | y,x) − | | k | 1:t t=2 + D (q(Z y, x) p(Z y)) . (3.22) KL 1 | k 1 | The expectation can be approximate by M Monte Carlo samples Z˜ from q(Z y, x): 1:T 1:T | M " T # 1 X X log p(x y, Z˜m ) + D q(Z Z˜m , y, x) p(Z Z˜m , y) L ≈M − | 1:T KL t | t−1 k t | t−1 m=1 t=2 + D (q(Z y, x) p(Z y)) . (3.23) KL 1 | k 1 | 3.4.3 Generating images from captions During the image generation step, we throw away the inference network from the decoder and instead sample from the prior distribution. Due to the blurriness of samples generated by DRAW model, we do an additional post processing step, where we use the generator of the adversarial network trained on residuals of laplacian pyramid to sharpen the generated images, similar to [Denton et al., 2015]. By fixing the prior of adversarial generator to the mean of uniform distribution, it gets treated as a deterministic neural network which allows us to calculate the lower bound of likelihood. The reconstuction loss becomes the loss between sharpened image and correct image, whereas the latent loss stays the same. We also noticed that sampling from the mean of uniform distribution allowed us to generate much less Chapter 3. Generating image (and) captions with visual attention 39 A yellow school bus parked A red school bus parked in a A green school bus parked in A blue school bus parked in in a parking lot. parking lot. a parking lot. a parking lot. Figure 3.6: Examples of changing the color while keeping the caption fixed. noisy samples than by sampling from the uniform distribution itself. 3.4.4 Experiments COCO Microsoft COCO [Lin et al., 2014b] is the largest dataset of images annotated with captions consisting of roughly 83k images. The rich collection of images with variety of styles, backgrounds and objects makes the task of learning a good generative model conditioned on caption very challenging. Since some of the images have more than five captions attached to them, for consistency with related work on caption generation we disregard extra captions. In the following subsections we make both qualitative and quantitative analysis of our model as well as compare its performance with the performance of other related generative models. Analysis of generated images The main goal of image generation to learn a model that can understand the semantic meaning expressed in the descriptions of the images, such as the properties of objects, the relationships between them, etc. and then use that knowledge to generate relevant images. To verify that, we wrote a set of captions inspired by COCO dataset and changed some words in the captions to see whether the model made the relevant changes in the generated samples. First, we wanted to see whether the model understood one of the most basic properties of any object, the color. As shown in Figure 3.6, we generated images of school buses with four different colors: yellow, red, green and blue. Although, there are images of buses with different colors in the training set, all school buses specifically are colored yellow. Despite that, the model managed to generate images of an object that is reminiscent of the bus pained with the relevant color. Apart from changing the colors of objects, we were curious whether changing the background of the scene described in the caption would result in the appropriate changes in the generated samples. Changing the background in images is a somewhat harder task for a model that changing the color, due to the larger visual area in images that is taken by the background. Nevertheless, as shown in Figure 3.7 changing the skies from blue to rainy as well as changing the grass type from dry to green resulted in the appropriate changes. The nearest images from the training set also indicate that the model was not simply copying the patterns it observed during the learning phase. Despite an infinite number of ways of changing colors and backgrounds in descriptions, in general we found that the model made appropriate changes as long as some similar patter was present in the training set. However, it wasn’t always the case when changing an object itself in the description. As Chapter 3. Generating image (and) captions with visual attention 40 A very large commercial A very large commercial A herd of elephants walking A herd of elephants walking plane flying in blue skies. plane flying in rainy skies. across a dry grass field. across a green grass field. Figure 3.7: Examples of changing the background while keeping the caption fixed. The respective nearest training images based on pixelwise distance are displayed on top. you can see in Figure 3.8, when objects didn’t have a clear fine grained differences, such as different shape or color, the relevant changes in the generated samples weren’t very clearly seen. This highlighted the limitation of the model to grasp the detailed understanding of each object. The decadent chocolate A bowl of banas is on the A vintage photo of a cat. A vintage photo of a dog. desert is on the table. table. Figure 3.8: Examples of changing the object while keeping the caption fixed. Analysis of attention Unfortunately, we found that there was no connection between the patches drawn on canvas and most attended words at each timestep. During the image generation, the model mostly focused on several words that carried the semantic meaning of caption. The words which were mostly attended during all generation steps indicated the kind of scene the model would generate. For example, as shown in Figure 3.9 by equally looking at words desert and forest allowed the model to make relevant changes in the scene. Whereas in the second example, the model completely ignored the word sun and didn’t make any changes. A rider on a blue motorcycle A rider on a blue motorcycle A surfer, a woman, and a A surfer, a woman, and a in the desert. in the forest. child walk on the beach. child walk on the sun. Figure 3.9: Examples of most attended words while changing the background in the caption. Chapter 3. Generating image (and) captions with visual attention 41 Align DRAW LAPGAN Conv. VAE Figure 3.10: Four different models displaying results from sampling caption A group of people walk on a beach with surf boards. Comparison with other models The quantitative evaluation of generative models has been a subject of ambiguity in a machine learning community. Compared to reporting classification accuracies in discriminative models, the measures defining generative models are intractable most of the times and might not correctly define the real quality of the model. To get a better comparisons between performance of generative models, we report results on two different metrics as well as do some qualitative comparison of different generative models. As you can see in Figure 3.10, we generated several samples from prior of each of the current state- of-the-art generative models corresponding to the caption “A group of people walk on a beach with surf boards”. While all of the samples look sharp, the images generated by LAPGAN look more noisy and don’t have a very clear structure in them, whereas the images generated by variational models trained with L2 cost function have a watercolor effect on the images. As for the quantitative comparison of different models, we first compare the performance of the model trained with variational methods. We rank the images in test set conditioned on the captions based on the variational lower bound of likelihood and then report the Precision-Recall metric to evaluate the quality of the generative model. As we expected, the quality of image retrieval using generative models is worse compared to the disriminative models that were specifically build for retrieval. To deal with large computational complexity of looping through each test image, we create a shortlist of hundred images including the correct one, based on the convolutional features of VGG like model trained on CIFAR dataset. Since there are “easy” images for which the model assigns high likelihood independent of the query caption, we look at the ratio of likehood of image conditioned on the sentence to likelihood of image conditioned on the mean sentence representation in the training set. We found that the reconstruction error x increased for the sharpened images that considerably hurt the retrieval results. Since sharpening L changes the statistics of images, computing reconstruction error for each pixel is not necessarily a good metric. Instead of calculating error per each pixel, we turn to the smarter metric, such as Structural Similarity Index (SSI), which incorporates luminace and contrast masking into the error calculation. Due to strong inter-dependencies of closer pixels, the metric is calculated on the small windows of the image. To calculate SSI, we sampled fifty images from prior of each generative model per each caption in the test set. Chapter 3. Generating image (and) captions with visual attention 42 COCO (before sharpening) Image Search Image Similarity Model R@1 R@5 R@10 R@50 Med r SSI LAPGAN ----- 0.08 Fully-conn. VAE (L2 cost) 1.0 5.6 10.4 51.1 51 0.156 Conv. VAE (L2 cost) 1.0 5.9 10.9 50.8 50 0.164 Skipthought DRAW 2.0 11.2 18.9 63.3 36 0.157 Noalign DRAW 2.8 14.1 23.1 68.0 31 0.155 Align DRAW 3.0 14.0 22.9 68.5 31 0.156 3.5 Summary In this chapter, we discussed two attention-based neural networks to tackle the image and caption generation problem. Empirically, our attention-based approaches obtain the state-of-the-art performance on the standard vision benchmark datasets. We also show how the learned attention can be exploited to give more interpretability into the models generation process, and demonstrate that the learned alignments correspond very well to human intuition. Chapter 4 Stabilizing RNN training with layer normalization Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this chapter, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques. 4.1 Motivation Deep neural networks trained with some version of Stochastic Gradient Descent have been shown to substantially outperform previous approaches on various supervised learning tasks in computer vision [Krizhevsky et al., 2012e] and speech processing [Hinton et al., 2012b]. But state-of-the-art deep neural networks often require many days of training. It is possible to speed-up the learning by computing gradients for different subsets of the training cases on different machines or splitting the neural network itself over many machines [Dean et al., 2012], but this can require a lot of communication and complex software. It also tends to lead to rapidly diminishing returns as the degree of parallelization increases. An orthogonal approach is to modify the computations performed in the forward pass of the neural net to make learning easier. Recently, batch normalization [Ioffe and Szegedy, 2015b] has been proposed to reduce training time by including additional normalization stages in deep neural networks. The 43 Chapter 4. Stabilizing RNN training with layer normalization 44 normalization standardizes each summed input using its mean and its standard deviation across the training data. Feedforward neural networks trained using batch normalization converge faster even with simple SGD. In addition to training time improvement, the stochasticity from the batch statistics serves as a regularizer during training. Despite its simplicity, batch normalization requires running averages of the summed input statistics. In feed-forward networks with fixed depth, it is straightforward to store the statistics separately for each hidden layer. However, the summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps. Furthermore, batch normalization cannot be applied to online learning tasks or to extremely large distributed models where the minibatches have to be small. This chapter introduces layer normalization, a simple normalization method to improve the training speed for various neural network models. Unlike batch normalization, the proposed method directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. We show that layer normalization works well for RNNs and improves both the training time and the generalization performance of several existing RNN models. 4.2 Batch and weight normalization th Consider the d hidden layer in a deep feed-forward, neural network, and let zd be the vector represen- tation of the summed inputs to the neurons in that layer. The summed inputs are computed through a linear projection with the weight matrix Wd and the bottom-up inputs ad given as follows: > zd,i = wd,i ad ad+1,i = f(zd,i + bd,i) (4.1) where f( ) is an element-wise non-linear function and w is the incoming weights to the ith hidden · d,i th units at d layer and bd,i is its scalar bias parameter. The parameters in the neural network are learnt using gradient-based optimization algorithms with the gradients being computed by back-propagation. One of the challenges of deep learning is that the gradients with respect to the weights in one layer are highly dependent on the outputs of the neurons in the previous layer especially if these outputs change in a highly correlated way. Batch normalization [Ioffe and Szegedy, 2015b] was proposed to reduce such undesirable “covariate shift”. The method normalizes the summed inputs to each hidden unit over the training cases. Specifically, for the ith summed input in the dth layer, the batch normalization method rescales the summed inputs according to their variances under the distribution of the data r gl,i h 2i z¯d,i = (zd,i µd,i) µd,i = E [zd,i] σd,i = E (zd,i µd,i) (4.2) σd,i − x∼p(x) x∼p(x) − th th wherea ¯d,i is normalized summed inputs to the i hidden unit in the l layer and gi is a gain parameter scaling the normalized activation before the non-linear activation function. Note the expectation is under the whole training data distribution. It is typically impractical to compute the expectations in Eq. (2) exactly, since it would require forward passes through the whole training dataset with the current set of weights. Instead, µ and σ are estimated using the empirical samples from the current mini-batch. This puts constraints on the size of a mini-batch and it is hard to apply to recurrent neural networks. Chapter 4. Stabilizing RNN training with layer normalization 45 4.3 Layer normalization We now consider the layer normalization method which is designed to overcome the drawbacks of batch normalization. Notice that changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer, especially with ReLU units whose outputs can change by a lot. This suggests the “covariate shift” problem can be reduced by fixing the mean and the variance of the summed inputs within each layer. We, thus, compute the layer normalization statistics over all the hidden units in the same layer as follows: v H u H 1 X u 1 X 2 µ = z σ = t (z µl) (4.3) d H d,i d H d,i − i=1 i=1 where H denotes the number of hidden units in a layer. The difference between Eq. (2) and Eq. (3) is that under layer normalization, all the hidden units in a layer share the same normalization terms µ and σ, but different training cases have different normalization terms. Unlike batch normalization, layer normaliztion does not impose any constraint on the size of a mini-batch and it can be used in the pure online regime with batch size 1. 4.3.1 Layer normalized recurrent neural networks The recent sequence to sequence models [Sutskever et al., 2014b] utilize compact recurrent neural net- works to solve sequential prediction problems in natural language processing. It is common among the NLP tasks to have different sentence lengths for different training cases. This is easy to deal with in an RNN because the same weights are used at every time-step. But when we apply batch normalization to an RNN in the obvious way, we need to to compute and store separate statistics for each time step in a sequence. This is problematic if a test sequence is longer than any of the training sequences. Layer normalization does not have such problem because its normalization terms depend only on the summed inputs to a layer at the current time-step. It also has only one set of gain and bias parameters shared over all time-steps. In a standard RNN, the summed inputs in the recurrent layer are computed from the current input xt and previous vector of hidden states at−1 which are computed as zt = Wrecat−1 + Winxt. The layer normalized recurrent layer re-centers and re-scales its activations using the extra normalization terms similar to Eq. (3): v H u H g 1 X u 1 X a = f (z µ ) + b µ = z σ = t (z µ )2 (4.4) t σ t − t t H t,i t H t,i − t t i=1 i=1 where Wrec is the recurrent hidden to hidden weights and Win are the bottom up input to hidden weights. is the element-wise multiplication between two vectors. b and g are defined as the bias and gain parameters of the same dimension as at. In a standard RNN, there is a tendency for the average magnitude of the summed inputs to the recurrent units to either grow or shrink at every time-step, leading to exploding or vanishing gradients. In a layer normalized RNN, the normalization terms make it invariant to re-scaling all of the summed Chapter 4. Stabilizing RNN training with layer normalization 46 Weight matrix Weight matrix Weight vector Dataset Dataset Single training case re-scaling re-centering re-scaling re-scaling re-centering re-scaling Batch norm Invariant No Invariant Invariant Invariant No Weight norm Invariant No Invariant No No No Layer norm Invariant Invariant No Invariant No Invariant Table 4.1: Invariance properties under the normalization methods. inputs to a layer, which results in much more stable hidden-to-hidden dynamics. 4.4 Related work Batch normalization has been previously extended to recurrent neural networks [Laurent et al., 2015, Amodei et al., 2015, Cooijmans et al., 2016]. The previous work [Cooijmans et al., 2016] suggests the best performance of recurrent batch normalization is obtained by keeping independent normalization statistics for each time-step. The authors show that initializing the gain parameter in the recurrent batch normalization layer to 0.1 makes significant difference in the final performance of the model. Our work is also related to weight normalization [Salimans and Kingma, 2016]. In weight normalization, instead of the variance, the L2 norm of the incoming weights is used to normalize the summed inputs to a neuron. Applying either weight normalization or batch normalization using expected statistics is equivalent to have a different parameterization of the original feed-forward neural network. Re-parameterization in the ReLU network was studied in the Path-normalized SGD [Neyshabur et al., 2015]. Our proposed layer normalization method, however, is not a re-parameterization of the original neural network. The layer normalized model, thus, has different invariance properties than the other methods, that we will study in the following section. 4.5 Analysis In this section, we investigate the invariance properties of different normalization schemes. 4.5.1 Invariance under weights and data transformations The proposed layer normalization is related to batch normalization and weight normalization. Although, their normalization scalars are computed differently, these methods can be summarized as normalizing the summed inputs ai to a neuron through the two scalars µ and σ. They also learn an adaptive bias b and gain g for each neuron after the normalization. gi ai = f( (zi µi) + bi) (4.5) σi − Note that for layer normalization and batch normalization, µ and σ is computed according to Eq. 4.2 and 4.3. In weight normalization, µ is 0, and σ = w . k k2 Table 4.1 highlights the following invariance results for three normalization methods. Weight re-scaling and re-centering: First, observe that under batch normalization and weight normalization, any re-scaling to the incoming weights wi of a single neuron has no effect on the normalized summed inputs to a neuron. To be precise, under batch and weight normalization, if the weight vector is scaled by δ, the two scalar µ and σ will also be scaled by δ. The normalized summed inputs stays Chapter 4. Stabilizing RNN training with layer normalization 47 the same before and after scaling. So the batch and weight normalization are invariant to the re-scaling of the weights. Layer normalization, on the other hand, is not invariant to the individual scaling of the single weight vectors. Instead, layer normalization is invariant to scaling of the entire weight matrix and invariant to a shift to all of the incoming weights in the weight matrix. Let there be two sets of model parameters θ, θ0 whose weight matrices W and W 0 differ by a scaling factor δ and all of the incoming weights in W 0 are also shifted by a constant vector γ, that is W 0 = δW + 1γ>. Under layer normalization, the two models effectively compute the same output: g g a0 =f( (W 0x µ0) + b) = f( (δW + 1γ>)x µ0 + b) σ0 − σ0 − g =f( (W x µ) + b) = a. (4.6) σ − Notice that if normalization is only applied to the input before the weights, the model will not be invariant to re-scaling and re-centering of the weights. Data re-scaling and re-centering: We can show that all the normalization methods are invariant to re-scaling the dataset by verifying that the summed inputs of neurons stays constant under the changes. Furthermore, layer normalization is invariant to re-scaling of individual training cases, because the normalization scalars µ and σ in Eq. (3) only depend on the current input data. Let x0 be a new data point obtained by re-scaling x by δ. Then we have, g g a0 =f( i w>x0 µ0 + b ) = f( i δw>x δµ + b ) = a . (4.7) i σ0 i − i δσ i − i i It is easy to see re-scaling individual data points does not change the model’s prediction under layer normalization. Similar to the re-centering of the weight matrix in layer normalization, we can also show that batch normalization is invariant to re-centering of the dataset. 4.5.2 Geometry of parameter space during learning We have investigated the invariance of the model’s prediction under re-centering and re-scaling of the parameters. Learning, however, can behave very differently under different parameterizations, even though the models express the same underlying function. In this section, we analyze learning behavior through the geometry and the manifold of the parameter space. We show that the normalization scalar σ can implicitly reduce learning rate and makes learning more stable. Riemannian metric The learnable parameters in a statistical model form a smooth manifold that consists of all possible input-output relations of the model. For models whose output is a probability distribution, a natural way to measure the separation of two points on this manifold is the Kullback-Leibler divergence between their model output distributions. Under the KL divergence metric, the parameter space is a Riemannian manifold. The curvature of a Riemannian manifold is entirely captured by its Riemannian metric, whose quadratic form is denoted as ds2. That is the infinitesimal distance in the tangent space at a point in the parameter space. Intuitively, it measures the changes in the model output from the parameter space along a tangent direction. The Riemannian metric under KL was previously studied [Amari, 1998] and was shown to be well approximated under second order Taylor expansion using the Fisher Chapter 4. Stabilizing RNN training with layer normalization 48 information matrix: 1 ds2 = D p(y x; θ) p(y x; θ + δ) δ>F (θ)δ, (4.8) KL | k | ≈ 2 " # ∂ log p(y x; θ) ∂ log p(y x; θ) > F (θ) = E | | , (4.9) x∼p(x),y∼p(y | x) ∂θ ∂θ where, δ is a small change to the parameters. The Riemannian metric above presents a geometric view of parameter spaces. The following analysis of the Riemannian metric provides some insight into how normalization methods could help in training neural networks. The geometry of normalized generalized linear models We focus our geometric analysis on the generalized linear model. The results from the following analysis can be easily applied to understand deep neural networks with block-diagonal approximation to the Fisher information matrix, where each block corresponds to the parameters for a single neuron. A generalized linear model (GLM) can be regarded as parameterizing an output distribution from the exponential family using a weight vector w and bias scalar b. To be consistent with the previous sections, the log likelihood of the GLM can be written using the summed inputs a as the following: (a + b)y η(a + b) log p(y x; w, b) = − + c(y, φ), (4.10) | φ [y x] = f(a + b) = f(w>x + b), Var[y x] = φf 0(a + b), (4.11) E | | where, f( ) is the transfer function that is the analog of the non-linearity in neural networks, f 0( ) is the · · derivative of the transfer function, η( ) is a real valued function and c( ) is the log partition function. φ is · · a constant that scales the output variance. Assume a H-dimensional output vector y = [y , y , , y ] 1 2 ··· H is modeled using H independent GLMs and log p(y x; W, b) = PH log p(y x; w , b ). Let W be the | i=1 i | i i weight matrix whose rows are the weight vectors of the individual GLMs, b denote the bias vector of length H and vec( ) denote the Kronecker vector operator. The Fisher information matrix for the multi- · dimensional GLM with respect to its parameters θ = [w>, b , , w> , b ]> = vec([W, b]>) is simply the 1 1 ··· H H expected Kronecker product of the data features and the output covariance matrix: " " ## Cov[y x] xx> x F (θ) = E 2 | > . (4.12) x∼P (x) φ ⊗ x 1 We obtain normalized GLMs by applying the normalization methods to the summed inputs a in the original model through µ and σ. Without loss of generality, we denote F¯ as the Fisher information matrix Chapter 4. Stabilizing RNN training with layer normalization 49 under the normalized multi-dimensional GLM with the additional gain parameters θ = vec([W, b, g]>): gigj > gi gi(aj −µj ) F¯11 F¯1H χiχ χi χi ··· σiσj j σi σiσj . . . Cov[yi, yj x] F¯(θ) = . . . , F¯ = | > gj aj −µj . . . ij E 2 χj 1 x∼P (x) φ σj σj > gj (ai−µi) ai−µi (ai−µi)(aj −µj ) F¯H1 F¯HH χ ··· j σiσj σi σiσj (4.13) ∂µi ai µi ∂σi χi = x − . (4.14) − ∂wi − σi ∂wi Implicit learning rate reduction through the growth of the weight vector: Notice that, comparing to standard GLM, the block F¯ij along the weight vector wi direction is scaled by the gain parameters and the normalization scalar σi. If the norm of the weight vector wi grows twice as large, even though the model’s output remains the same, the Fisher information matrix will be different. The 1 curvature along the wi direction will change by a factor of 2 because the σi will also be twice as large. As a result, for the same parameter update in the normalized model, the norm of the weight vector effectively controls the learning rate for the weight vector. During learning, it is harder to change the orientation of the weight vector with large norm. The normalization methods, therefore, have an implicit “early stopping” effect on the weight vectors and help to stabilize learning towards convergence. Learning the magnitude of incoming weights: In normalized models, the magnitude of the incoming weights is explicitly parameterized by the gain parameters. We compare how the model output changes between updating the gain parameters in the normalized GLM and updating the magnitude of the equivalent weights under original parameterization during learning. The direction along the gain parameters in F¯ captures the geometry for the magnitude of the incoming weights. We show that Riemannian metric along the magnitude of the incoming weights for the standard GLM is scaled by the norm of its input, whereas learning the gain parameters for the batch normalized and layer normalized models depends only on the magnitude of the prediction error. Learning the magnitude of incoming weights in the normalized model is therefore, more robust to the scaling of the input and its parameters than in the standard model. 4.6 Experimental results We perform experiments with layer normalization on 6 tasks, with a focus on recurrent neural networks: image-sentence ranking, question-answering, contextual language modelling, generative modelling, hand- writing sequence generation and MNIST classification. Unless otherwise noted, the default initialization of layer normalization is to set the adaptive gains to 1 and the biases to 0 in the experiments. 4.6.1 Order embeddings of images and language In this experiment, we apply layer normalization to the recently proposed order-embeddings model of Vendrov et al. [2016] for learning a joint embedding space of images and sentences. We follow the same experimental protocol as Vendrov et al. [2016] and modify their publicly available code to incorporate layer normalization 1 which utilizes Theano [Team et al., 2016]. Images and sentences from the Microsoft 1https://github.com/ivendrov/order-embedding Chapter 4. Stabilizing RNN training with layer normalization 50 Image Retrieval (Validation) Image Retrieval (Validation) Image Retrieval (Validation) 43 78 90 42 77 89 41 76 40 88 39 75 87 38 74 37 86 73 36 mean Recall@1 mean Order-Embedding + LN Recall@5 mean Order-Embedding + LN Order-Embedding + LN 72 Recall@10 mean 85 35 Order-Embedding Order-Embedding Order-Embedding 34 71 84 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 iteration x 300 iteration x 300 iteration x 300 (a) Recall@1 (b) Recall@5 (c) Recall@10 Figure 4.1: Recall@K curves using order-embeddings with and without layer normalization. MSCOCO Caption Retrieval Image Retrieval Model R@1 R@5 R@10 Mean r R@1 R@5 R@10 Mean r Sym [Vendrov et al., 2016] 45.4 88.7 5.8 36.3 85.8 9.0 OE [Vendrov et al., 2016] 46.7 88.9 5.7 37.9 85.9 8.1 OE (ours) 46.6 79.3 89.1 5.2 37.8 73.6 85.7 7.9 OE + LN 48.5 80.6 89.8 5.1 38.9 74.3 86.3 7.6 Table 4.2: Average results across 5 test splits for caption and image retrieval. R@K is Recall@K (high is good). Mean r is the mean rank (low is good). Sym corresponds to the symmetric baseline while OE indicates order-embeddings. COCO dataset [Lin et al., 2014a] are embedded into a common vector space, where a GRU [Cho et al., 2014a] is used to encode sentences and the outputs of a pre-trained VGG ConvNet [Simonyan and Zisserman, 2014b] (10-crop) are used to encode images. The order-embedding model represents images and sentences as a 2-level partial ordering and replaces the cosine similarity scoring function used in Kiros et al. [2014c] with an asymmetric one. We trained two models: the baseline order-embedding model as well as the same model with layer normalization applied to the GRU. After every 300 iterations, we compute Recall@K (R@K) values on a held out validation set and save the model whenever R@K improves. The best performing models are then evaluated on 5 separate test sets, each containing 1000 images and 5000 captions, for which the mean results are reported. Both models use Adam [Kingma and Ba, 2014a] with the same initial hyperparameters and both models are trained using the same architectural choices as used in Vendrov et al. [2016]. Figure 4.1 illustrates the validation curves of the models, with and without layer normalization. We plot R@1, R@5 and R@10 for the image retrieval task. We observe that layer normalization offers a per-iteration speedup across all metrics and converges to its best validation model in 60% of the time it takes the baseline model to do so. In Table 4.2, the test set results are reported from which we observe that layer normalization also results in improved generalization over the original model. The results we report are state-of-the-art for RNN embedding models, with only the structure-preserving model of Wang et al. [2016] reporting better results on this task. However, they evaluate under different conditions (1 test set instead of the mean over 5) and are thus not directly comparable. Chapter 4. Stabilizing RNN training with layer normalization 51 Attentive reader 1.0 LSTM 0.9 BN-LSTM BN-everywhere LN-LSTM 0.8 0.7 0.6 validation error rate error validation 0.5 0.4 0 100 200 300 400 500 600 700 800 training steps (thousands) Figure 4.2: Validation curves for the attentive reader model. BN results are taken from [Cooijmans et al., 2016]. 4.6.2 Teaching machines to read and comprehend In order to compare layer normalization to the recently proposed recurrent batch normalization [Cooij- mans et al., 2016], we train an unidirectional attentive reader model on the CNN corpus both introduced by Hermann et al. [2015]. This is a question-answering task where a query description about a passage must be answered by filling in a blank. The data is anonymized such that entities are given randomized tokens to prevent degenerate solutions, which are consistently permuted during training and evaluation. We follow the same experimental protocol as Cooijmans et al. [2016] and modify their public code to incorporate layer normalization 2 which uses Theano [Team et al., 2016]. We obtained the pre-processed dataset used by Cooijmans et al. [2016] which differs from the original experiments of Hermann et al. [2015] in that each passage is limited to 4 sentences. In Cooijmans et al. [2016], two variants of recurrent batch normalization are used: one where BN is only applied to the LSTM while the other applies BN everywhere throughout the model. In our experiment, we only apply layer normalization within the LSTM. The results of this experiment are shown in Figure 4.2. We observe that layer normalization not only trains faster but converges to a better validation result over both the baseline and BN variants. In Cooijmans et al. [2016], it is argued that the scale parameter in BN must be carefully chosen and is set to 0.1 in their experiments. We experimented with layer normalization for both 1.0 and 0.1 scale initialization and found that the former model performed significantly better. This demonstrates that layer normalization is not sensitive to the initial scale in the same way that recurrent BN is. 3 4.6.3 Skip-thought vectors Skip-thoughts [Kiros et al., 2015] is a generalization of the skip-gram model [Mikolov et al., 2013] for learning unsupervised distributed sentence representations. Given contiguous text, a sentence is encoded with a encoder RNN and decoder RNNs are used to predict the surrounding sentences. Kiros et al. [2015] showed that this model could produce generic sentence representations that perform well on several tasks without being fine-tuned. However, training this model is time-consuming, requiring 2https://github.com/cooijmanstim/Attentive_reader/tree/bn 3We only produce results on the validation set, as in the case of Cooijmans et al. [2016] Chapter 4. Stabilizing RNN training with layer normalization 52 86.0 34 82 Skip-Thoughts + LN 85.5 33 Skip-Thoughts 80 85.0 32 Original 84.5 78 31 84.0 76 30 83.5 Accuracy 74 83.0 Skip-Thoughts + LN MSE x 100 29 Skip-Thoughts + LN Pearson x 100 Skip-Thoughts Skip-Thoughts 28 72 82.5 Original Original 82.0 27 70 5 10 15 20 5 10 15 20 5 10 15 20 iteration x 50000 iteration x 50000 iteration x 50000 (a) SICK(r) (b) SICK(MSE) (c) MR 86 94.5 91 94.0 84 90 93.5 89 82 93.0 88 92.5 80 87 92.0 86 Accuracy 78 Accuracy 91.5 Accuracy Skip-Thoughts + LN Skip-Thoughts + LN Skip-Thoughts + LN 91.0 85 76 Skip-Thoughts Skip-Thoughts Skip-Thoughts Original 90.5 Original 84 Original 74 90.0 83 5 10 15 20 5 10 15 20 5 10 15 20 iteration x 50000 iteration x 50000 iteration x 50000 (d) CR (e) SUBJ (f) MPQA Figure 4.3: Performance of skip-thought vectors with and without layer normalization on downstream tasks as a function of training iterations. The original lines are the reported results in [Kiros et al., 2015]. Plots with error use 10-fold cross validation. Best seen in color. Method SICK(r) SICK(ρ) SICK(MSE) MR CR SUBJ MPQA Original [Kiros et al., 2015] 0.848 0.778 0.287 75.5 79.3 92.1 86.9 Ours 0.842 0.767 0.298 77.3 81.8 92.6 87.9 Ours + LN 0.854 0.785 0.277 79.5 82.6 93.4 89.0 Ours + LN † 0.858 0.788 0.270 79.4 83.1 93.7 89.3 Table 4.3: Skip-thoughts results. The first two evaluation columns indicate Pearson and Spearman correlation, the third is mean squared error and the remaining indicate classification accuracy. Higher is better for all evaluations except MSE. Our models were trained for 1M iterations with the exception of ( ) which was trained for 1 month (approximately 1.7M iterations) † several days of training in order to produce meaningful results. In this experiment we determine to what effect layer normalization can speed up training. Using the publicly available code of Kiros et al. [2015] 4, we train two models on the BookCorpus dataset [Zhu et al., 2015]: one with and one without layer normalization. These experiments are performed with Theano [Team et al., 2016]. We adhere to the experimental setup used in Kiros et al. [2015], training a 2400-dimensional sentence encoder with the same hyperparameters. Given the size of the states used, it is conceivable layer normalization would produce slower per-iteration updates than without. However, we found that provided CNMeM 5 is used, there was no significant difference between the two models. We checkpoint both models after every 50,000 iterations and evaluate their performance on five tasks: semantic-relatedness (SICK) [Marelli et al., 2014], movie review sentiment (MR) [Pang and Lee, 2005], 4https://github.com/ryankiros/skip-thoughts 5https://github.com/NVIDIA/cnmem Chapter 4. Stabilizing RNN training with layer normalization 53 customer product reviews (CR) [Hu and Liu, 2004], subjectivity/objectivity classification (SUBJ) [Pang and Lee, 2004] and opinion polarity (MPQA) [Wiebe et al., 2005]. We plot the performance of both models for each checkpoint on all tasks to determine whether the performance rate can be improved with LN. The experimental results are illustrated in Figure 4.3. We observe that applying layer normalization results both in speedup over the baseline as well as better final results after 1M iterations are performed as shown in Table 4.3. We also let the model with layer normalization train for a total of a month, resulting in further performance gains across all but one task. We note that the performance differences between the original reported results and ours are likely due to the fact that the publicly available code does not condition at each timestep of the decoder, where the original model does. 4.6.4 Modeling binarized MNIST using DRAW 100 We also experimented with the generative modeling on the MNIST Baseline dataset. Deep Recurrent Attention Writer (DRAW) [Gregor et al., WN 95 LN 2015a] has previously achieved the state-of-the-art performance on modeling the distribution of MNIST digits. The model uses a dif- 90 ferential attention mechanism and a recurrent neural network to se- quentially generate pieces of an image. We evaluate the effect of layer 85 normalization on a DRAW model using 64 glimpses and 256 LSTM Bound Variational Test hidden units. The model is trained with the default setting of Adam 80 0 20 40 60 80 100 [Kingma and Ba, 2014a] optimizer and the minibatch size of 128. Epoch Previous publications on binarized MNIST have used various training Figure 4.4: DRAW model test protocols to generate their datasets. In this experiment, we used the negative log likelihood with and fixed binarization from Larochelle and Murray [2011]. The dataset without layer normalization. has been split into 50,000 training, 10,000 validation and 10,000 test images. Figure 4.4 shows the test variational bound for the first 100 epoch. It highlights the speedup benefit of applying layer normalization that the layer normalized DRAW converges almost twice as fast than the baseline model. After 200 epoches, the baseline model converges to a variational log likelihood of 82.36 nats on the test data and the layer normalization model obtains 82.09 nats. 4.6.5 Handwriting sequence generation The previous experiments mostly examine RNNs on NLP tasks whose lengths are in the range of 10 to 40. To show the effectiveness of layer normalization on longer sequences, we performed handwriting generation tasks using the IAM Online Handwriting Database [Liwicki and Bunke, 2005]. IAM-OnDB consists of handwritten lines collected from 221 different writers. When given the input character string, the goal is to predict a sequence of x and y pen co-ordinates of the corresponding handwriting line on the whiteboard. There are, in total, 12179 handwriting line sequences. The input string is typically more than 25 characters and the average handwriting line has a length around 700. We used the same model architecture as in Section (5.2) of Graves [2013b]. The model architecture consists of three hidden layers of 400 LSTM cells, which produce 20 bivariate Gaussian mixture compo- nents at the output layer, and a size 3 input layer. The character sequence was encoded with one-hot Chapter 4. Stabilizing RNN training with layer normalization 54 1