<<

Learning to Attend with Neural Networks

by

Lei (Jimmy) Ba

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical & Computer Engineering

© Copyright 2020 by Lei (Jimmy) Ba Abstract

Learning to Attend with Neural Networks

Lei (Jimmy) Ba Doctor of Philosophy Graduate Department of Electrical & Computer Engineering University of Toronto 2020

As more computational resources become widely available, artificial intelligence and researchers design ever larger and more complicated neural networks to learn from millions of data points.

Although the traditional convolutional neural networks (CNNs) can achieve superhuman accuracy in object recognition tasks, they brute-force the problem by scanning over every location in the input images with the same fidelity. This thesis introduces a new class of neural networks inspired by the human visual system. Unlike CNNs that process the entire image at once into the current hidden layer, attention allows for salient features to dynamically come to the forefront as needed. The ability to attend is especially important when there is a lot of clutter in a scene. However, learning attention-based neural networks poses some challenges to the current machine learning techniques: What information should the neural network “pay attention”? Where does the network store its sequences of “glimpses”? Can our learning algorithms do better than simply “trial-and-error”?

To address these computational questions, we first describe a novel recurrent visual attention model in the context of variational inference. Because the standard REINFORCE or the trial-and-error algo- rithm can be slow due to its high variance gradient estimates, we show a re-weighted wake-sleep objective can improve the training performance. We also demonstrate the visual attention models outperform the previous state-of-the-art methods based on CNNs in the images and captions generation tasks. Fur- thermore, we discuss how the visual attention mechanism can improve the working memory of recurrent neural networks (RNNs) through a novel form of self-attention. The second half of the thesis focuses on gradient-based learning algorithms. We developed a new first-order optimization algorithm to overcome the slow convergence of the stochastic gradient descent algorithms in RNNs and attention-based models.

In the end, we explored the benefit of applying second-order optimization methods in training neural networks.

ii Acknowledgements

I would like to thank my advisors: Geoffrey Hinton, Brendan Frey and Ruslan Salakhutdinov. This PhD thesis would not have been possible without the support of these amazing mentors. I am extremely grateful to Geoff for being the most caring supervisor I could ask for. His mathematical intuition, insight into neural networks and enthusiasm for highest research standards inspired many ideas in this thesis. I am fortunate enough to work with an incredible group colleagues at the University of Toronto Machine Learning group: Roger Grosse, Jamie Kiros, James Martens, Kevin Swersky, Ilya Sutskever, Kelvin Xu, Hui Yuan Xiong, Chris Maddison. Vlad Mnih and Rich Caruana provided me with a outstanding and yet fruitful research internship experience outside of academia to whom I owe a debt of gratitude. Among many who lent help along the way, I will give my dearest thanks to my parents for their unwavering support.

iii Contents

1 Introduction 1 1.1 What are neural networks and why do we need attention? ...... 1 1.2 Overview ...... 2 1.3 Neural networks ...... 3 1.4 Convolutional neural networks ...... 3 1.5 Recurrent neural networks ...... 4 1.6 Learning ...... 5 1.6.1 Maximum likelihood estimation and Kullback-Leibler divergence ...... 6 1.6.2 Regularization ...... 7 1.6.3 Gradient descent ...... 7

2 Deep recurrent visual attention 9 2.1 Motivation ...... 9 2.2 Learning where and what ...... 10 2.3 Variational lower bound objective ...... 12 2.3.1 Maximize the variational lower bound ...... 12 2.3.2 Multi-object/Sequential classification as a visual attention task ...... 13 2.3.3 Comparison with CNN ...... 14 2.3.4 Discussion ...... 17 2.4 Improved learning with re-weighted wake-sleep objective ...... 18 2.4.1 Wake-Sleep recurrent attention model ...... 18 2.4.2 An improved lower bound on the log-likelihood ...... 19 2.4.3 Training an inference network ...... 21 2.4.4 Control variates ...... 22 2.4.5 Encouraging exploration ...... 23 2.4.6 Experiments ...... 23 2.5 Summary ...... 24

3 Generating image (and) captions with visual attention 26 3.1 Problem definition ...... 26 3.2 Related work ...... 26 3.3 Image Caption Generation with Attention Mechanism ...... 28 3.3.1 Model details ...... 28 3.3.2 Learning stochastic “hard” vs deterministic “soft” Attention ...... 30

iv 3.3.3 Experiments ...... 33 3.4 Generating images ...... 35 3.4.1 Model architecture ...... 36 3.4.2 Learning ...... 38 3.4.3 Generating images from captions ...... 38 3.4.4 Experiments ...... 39 3.5 Summary ...... 42

4 Stabilizing RNN training with layer normalization 43 4.1 Motivation ...... 43 4.2 Batch and weight normalization ...... 44 4.3 Layer normalization ...... 45 4.3.1 Layer normalized recurrent neural networks ...... 45 4.4 Related work ...... 46 4.5 Analysis ...... 46 4.5.1 Invariance under weights and data transformations ...... 46 4.5.2 Geometry of parameter space during learning ...... 47 4.6 Experimental results ...... 49 4.6.1 Order embeddings of images and language ...... 49 4.6.2 Teaching machines to read and comprehend ...... 51 4.6.3 Skip-thought vectors ...... 51 4.6.4 Modeling binarized MNIST using DRAW ...... 53 4.6.5 Handwriting sequence generation ...... 53 4.6.6 Permutation invariant MNIST ...... 54 4.6.7 Convolutional Networks ...... 55 4.7 Summary ...... 55

5 Self-attention to the recent past using fast weights 56 5.1 Motivation ...... 56 5.2 Evidence from physiology that temporary memory may not be stored as neural activities 57 5.3 Fast Associative Memory ...... 57 5.3.1 Layer normalized fast weights ...... 59 5.3.2 Implementing the fast weights “inner loop” in biological neural networks ...... 60 5.4 Experimental results ...... 60 5.4.1 Associative retrieval ...... 60 5.4.2 Integrating glimpses in visual attention models ...... 61 5.4.3 Facial expression recognition ...... 64 5.4.4 Agents with memory ...... 65 5.5 Summary ...... 66

6 Accelerating learning using Adaptive Moment methods 67 6.1 Motivation ...... 67 6.2 Algorithm ...... 68 6.2.1 Adam’s update rule ...... 69

v 6.3 Initialization bias correction ...... 69 6.4 Convergence analysis ...... 70 6.4.1 Convergence proof ...... 71 6.5 Related work ...... 75 6.6 Experiments ...... 76 6.6.1 Logistic regression ...... 76 6.6.2 Multi-layer neural networks ...... 77 6.6.3 Convolutional neural networks ...... 77 6.6.4 Bias-correction term ...... 78 6.7 Extensions ...... 79 6.7.1 AdaMax ...... 79 6.7.2 Temporal averaging ...... 80 6.8 Summary ...... 81

7 Scale up learning with the distributed natural gradient methods 82 7.1 Motivation ...... 82 7.2 Fisher information matrix and natural gradient ...... 84 7.2.1 Kronecker factored approximate Fisher ...... 84 7.2.2 Approximate natural gradient using K-FAC ...... 84 7.2.3 Related works ...... 85 7.3 Distributed Optimization using K-FAC ...... 85 7.3.1 Asynchronous Fisher block inversion ...... 85 7.3.2 Asynchronous statistics computation ...... 86 7.4 Doubly-factored Kronecker approximation for large convolution layers ...... 87 7.4.1 Factored Tikhonov damping for the double-factored Kronecker approximation . . . 88 7.5 Step size selection ...... 89 7.5.1 Experimental evaluation of the step-size selection method of Section 7.5 ...... 90 7.6 Automatic construction of the K-FAC computation graph ...... 90 7.7 Experiments ...... 91 7.7.1 CIFAR-10 classification and asynchronous Fisher block inversion ...... 92 7.7.2 ImageNet classification ...... 93 7.7.3 A cheaper Kronecker factor approximation for convolution layers ...... 95 7.8 Summary ...... 98

8 Conclusion 99 8.1 Attention-based neural networks ...... 99 8.2 Stochastic optimization ...... 100 8.3 Future directions ...... 100 8.3.1 Variance reduction and learning ...... 100 8.3.2 Beyond maximum likelihood learning ...... 101

Bibliography 102

vi Relationship to Prior Work

The chapters in this thesis describe work that has been published in the following conferences:

• Chapter 2: Multiple object recognition with visual attention. Ba, J., Mnih, V., & Kavukcuoglu, K. (2015). International conference on learning representation.

• Chapter 2: Learning wake-sleep recurrent attention models. Ba, J., Salakhutdinov, R. R., Grosse, R. B., & Frey, B. J. (2015). Advances in neural information processing systems (pp. 25932601).

• Chapter 3: Show, attend and tell: neural image caption generation with visual attention. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Bengio, Y. (2015). International conference on machine learning (pp. 20482057).

• Chapter 3: Generating images from captions with attention. Mansimov, E., Parisotto, E., Ba, J. L., & Salakhutdinov, R. (2015). International conference on learning representation.

• Chapter 4: Layer normalization. Ba, J., Kiros, J. R., & Hinton, G. E. (2016). Advances in neural information processing systems symposium.

• Chapter 5: Using fast weights to attend to the recent past. Ba, J., Hinton, G. E., Mnih, V., Leibo, J. Z., & Ionescu, C. (2016). Advances in neural information processing systems (pp. 43314339).

• Chapter 6: Adam: a method for stochastic optimization. Kingma, D., & Ba, J. (2015). Interna- tional conference on learning representation.

• Chapter 7: Distributed second-order optimization using kronecker-factored approximations. Ba, J., Grosse, R., & Martens, J. (2016). International conference on learning representation.

vii Chapter 1

Introduction

1.1 What are neural networks and why do we need attention?

In recent years, researchers have tackled problems in , and natural language processing by using deep learning methods [Hinton et al., 2012b, Krizhevsky and Hinton, 2009, Sutskever et al., 2014b], that learn powerful feature detectors directly from inputs with little or no pre-processing. Deep learning avoids the time-consuming process of designing features by hand and as datasets get larger it can discover better features with no additional human effort. Many deep learning systems use feed-forward neural networks of many layers. While they have been very successful, state- of-the-art deep neural networks can be computationally expensive: training these models often takes weeks even when parallelizing over many machines. The high runtime cost of these models at test time is a further problem for many real-time applications on smart phones or wearable devices. As more computational resources become available, artificial intelligence and machine learning re- searchers train ever larger neural networks using millions of data points. Many of these systems use the largest neural network that can conveniently fit into a modern computer to exhaustively process the entire input all at once. A convolutional neural network (CNN) may recognize thousands of objects with superhuman accuracy, but standard CNNs are computationally clumsy and expensive because they ex- amine all image locations at the same level of detail. Such brute-force learning approaches have yielded impressive results thus far, but for tasks such as learning from the rapidly growing amount of video data on YouTube, a less brute-force approach should be far more effective. One of the most curious facets of the human visual system is the presence of attention [Rensink, 2000, Corbetta and Shulman, 2002]. Rather than compress an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed. This is especially im- portant when there is a lot of clutter in an image. The human visual system converts a high-dimensional visual scene into a sequence of glimpses by using intelligently selected fixation points. This dynamically allocates computation resources to more informative parts of the input and internal, covert attention amplifies this effect. One may, for example, spend a few minutes translating a long French paragraph to English by going back and forth between the ambiguous sentences in the source document. The same person in an unfamiliar train station can quickly find out which way to go by only glancing at the useful signs for a fraction of a second. Inspired by the human visual system, this thesis explores the topic of learning neural networks that can integrate and retrieve information by intelligent sampling.

1 Chapter 1. Introduction 2

1.2 Overview

Much of the recent work on neural networks focuses on sequence modeling using recurrent neural networks (RNNs). The applications of these models to machine translation, speech recognition, and language modeling have shown promising results in practice. Despite their success, training RNNs is often unstable and slow. The traditional RNN architectures also fail to learn long-term dependencies among their input sequences. We are interested in how neural networks can integrate information over long sequences using atten- tion mechanisms. We take inspiration from the way humans perform sequence recognition tasks such as reading by continually moving the fovea to the next relevant object or character, recognizing the individual object, and adding the recognized object to our internal representation of the sequence. The traditional sequence-to-sequence models have turned out to be challenging to train for these tasks. In this thesis, we describe methods that address the difficulty of training RNNs on problems with complicated long-range temporal structure. There are two major themes in this thesis: modeling and learning. For modeling, our contribution is in developing novel attention-based neural networks for object recognition tasks and sequence generation tasks in both computer vision and natural language processing. For learning, our contribution is in developing new optimization algorithms to address the challenges in learning attention-based neural networks. Outline of Thesis In chapter 1, we describe many popular techniques used in neural network research and deep learning. The topics can be divided into neural network architectures and traditional learning algorithms. This will serve as a foundation on which we can place the contributions of this thesis. This chapter also introduces the detailed terminology and notations that will be used through the thesis. Chapter 2 begins the discussion of a new attention-based neural network architecture for vision tasks. In particular, we describe an extension to the recurrent visual attention model (Minh, et al., 2013) so that it can solve multi-object recognition tasks. We also present a new learning objective derived from variational inference. We then define a formal connection between REINFORCE and variational inference. Chapter 3 then discusses two variants of the visual attention applied to image and caption generation tasks. We found our model outperformed all the previous caption generation models at the time. In Chapter 4, we investigate the unstable training dynamics of attention-based RNNs on very long sequences and discuss the learning challenges present in these models. We describe a new normalization technique to address the challenge by normalizing the groups of hidden neurons to have the same mean and standard deviation at each time step of the sequence. Building upon Chapter 5, we apply the nor- malization technique to learn a “self-attention” neural network that can attend to its past computation. This form of “self-attention’ can be used to store temporary memories of the recent past. We show this new form of temporary storage is very helpful in sequence-to-sequence models. In Chapter 6, we discuss the learning algorithms themselves by presenting a new stochastic opti- mization algorithm, “Adam”, for training neural networks. We analyze the convergence properties of the algorithm and discuss recent developments in stochastic optimization. Our numerical experiments demonstrate “Adam” can speed up convergence in learning various attention-based neural networks. Chapter 7 continues the discussion of learning algorithms for neural networks using second-order opti- mization techniques. We develop a novel distributed optimizer that scales the KFAC natural gradient Chapter 1. Introduction 3 algorithm to train state-of-the-art deep learning models with tens of millions of parameters. Our ex- periments show distributed KFAC can speed up the convergence linearly with respect to the size of the mini-batch.

1.3 Neural networks

In the rest of this chapter, we give the background on neural networks and optimization that will make this thesis self-contained. We will first define the basic notations for neural networks that we will use for the rest of the thesis. A feed-forward neural network or multilayer perception (MLP) Rumelhart et al. [1986] is the most common neural network architecture that consists of layers of simple neuron-like processing units. Such network of artificial neurons maps an input vector x to an output y. The neuron-like processing units compute a weighted sum their inputs using a set of incoming weights w and pass them through an activation function, f. Namely, the output of a neuron is given by:

X z = wixi + b, a = f(z), (1.1) i where we denote the neuron activations before and after the nonlinear function as z and a. An additional bias scalar b is included in the weighted sum that helps learning. In a deep feedforward neural network with d layers, the computation can be expressed in terms of matrix-vector operations as follows:

z1 = W1x + b1, a1 = f(z1), (1.2)

z2 = W2a1 + b2, a2 = f(z2), (1.3)

··· y = Wdad−1 + bd. (1.4)

We use the subscript to index the layers. Note that there are many element-wise nonlinear activation functions to choose from. In the past, the logistic activation functions, such as sigmoid or tanh were the most popular options. Krizhevsky et al. [2012e] showed that the rectified linear unit (ReLU) non- linearities could be highly successful for computer vision tasks and proved faster to train than the standard sigmoid units.

1 σ(z) = , ReLU(z) = max(0, z). (1.5) 1 + e−z

The neural networks in the rest of the thesis are assumed to have the ReLU activation function unless otherwise stated.

1.4 Convolutional neural networks

Although many prior works have shown that feed-forward neural networks can express any function given enough hidden units, more complex neural network architectures are often preferred in practice because of their expressivity and better generalization. One such example is the convolutional neural Chapter 1. Introduction 4 networks (CNNs) [Rumelhart et al., 1986, LeCun et al., 1990, Krizhevsky and Hinton, 2009]. The idea of convolutional filters, that are small local filter banks applied to the entire image, has been explored in many classical works for image processing and computer vision tasks. CNNs build such prior knowledge into the neural network architectures. Unlike fully connected neural networks, the weights in CNNs are heavily constrained and local. Each neuron only processes a local source of information from the output of the previous layer. The incoming weights of the convolutional neurons or the receptive fields are also shared across the spatial locations. These receptive fields act like feature detectors looking for a particular pattern anywhere on the input images. Therefore, the outputs of a convolutional layer are called feature maps which are computed as:

a = f(W a − + b), (1.6) d ∗ d 1 where, is the convolution operator. Both the weight matrices and the biases are shared across receptive ∗ fields. Weight sharing not only greatly reduces the number of free parameters in the network, but also encode the prior knowledge about the image processing and local pattern recognition of the vision tasks.

1.5 Recurrent neural networks

Another important class of neural network architectures is recurrent neural networks (RNNs), that are a temporal generalization of classical feed-forward neural networks. RNNs map input sequence to an output sequence as a nonlinear dynamical system. The network updates its hidden activations from the current input and the activations from the previous timestep using the bottom-up and the recurrent weights respectively. The hidden-to-hidden recurrent connections allow RNNs to aggregate information over time. The recurrent computation also enables the possibility of storing long-term dependencies between the input in its hidden activations. Formally, a standard one-layer RNN mapping of an input sequence x , , x to an output sequence y , , y is defined as: { 1 ··· T } { 1 ··· T }

a1 = f(Winx1 + brec), (1.7)

at = f(Winx1 + Wrecat−1 + brec), (1.8)

yt = Woutat + bout. (1.9) where Win,Wrec and Wout are respectively the input, recurrent and output weights shared across T timesteps. This RNN is analogous to a feed-forward neural network with T layers, except the weights are shared between the layers. The weight sharing in RNNs allows the same model to process sequences of any number of timesteps and remember the information from the past. However, it also makes the recurrent activations grow in magnitude as the input sequence gets longer. The standard logistic activation functions, such as sigmoid or tanh tend to saturate for the longer timesteps, whereas ReLU leads to exploding hidden activations. In other words, any small changes in the beginning of the input sequence could cause exploding activations many timesteps later under ReLU. Le et al. [2015] point out that if the recurrent weights are initialized close to a scaled identity matrix, we can partially alleviate such problem and successfully train of ReLU RNNs over thousands of timesteps. Arjovsky et al. [2016] later generalized the identity matrix results to any scaled orthonormal matrices where all the eigenvalues of the recurrent weights are close to one. Chapter 1. Introduction 5

zt zt

ht-1 Eyt-1 ht-1 Eyt-1

ht-1 i o input gate output gate

zt c ht input modulator memory cell

Eyt-1 f forget gate

ht-1 Eyt-1

zt Figure 1.1: A LSTM cell, lines with bolded squares imply projections with a learnt weight vector. Each cell learns how to weigh its input components (input gate), while learning how to modulate that contribution to the memory (input modulator). It also learns weights which erase the memory cell (forget gate), and weights which control how this memory should be emitted (output gate).

However, these initialization schemes require hyperparameter tuning and are sensitive to the change of scale in the input sequences. Orthogonal to the weight initializations, Hochreiter and Schmidhuber [1997b] address the exploding activation problem by introducing a new RNN activation function, the long-short-term memory (LSTM) units. LSTMs have a set of gating units that control the information flow in the RNN. These gates turn on or off to update a set of linear memory neurons.

1.6 Learning

Neural networks can be used as parametric models for many standard statistical learning tasks such as regression and classification. In statistical learning, the goal is to estimate a set of parameters from a given training dataset. For simplicity of notation, we will lump all the neural network weights into a parameter vector θ = [vec W T , vec W T , ]T , where vec is the vectorization operator that { 1} { 2} ··· converts a matrix or tensor to a column vector. Given a dataset of N input and target pairs = Dtrain (x(n), t(n)) N , we can measure the performance the neural network according to a loss function and { }n=1 the current weights as the averaged loss of the training examples:

N N 1 X 1 X (θ) = l(n)(θ) = l((x(n), t(n)), θ), (1.10) L N N n=1 n=1 where l(n) are the losses for each training example and denotes the averaged loss over the whole L training set. The above averaged loss measured on the training set is also known as the empirical risk. There is a wide variety of loss functions studied in the field of machine learning. In this thesis, we choose to focus on the following two loss functions, which appear in many regression and classification problems of practical importance: mean squared error,

1 l ((x(n), t(n)), θ) = y(n) t(n) 2, (1.11) MSE 2k − k2 Chapter 1. Introduction 6 and multi-class cross-entropy loss,

(n) (n) X (n) (n) lCE((x , t ), θ) = ti log pi , p = softmax(y), (1.12) i

exp(z1) exp(z2) where softmax(z) = [ P , P , ] is a generalization of the element-wise logistic func- j exp(zj ) j exp(zj ) ··· tion. Due to their smoothness, the two loss functions have synergistic effects when combined with the optimization-based learning procedures for neural networks. Given a neural network architecture, the learning problem involves searching for a set of weights that minimizes the chosen loss function.

θ∗ = arg min (θ) (1.13) L

In the case of a single neuron, the above minimization can be solved using a set of linear algebraic equations or convex programming. In general, obtaining the optimal set of weights in neural network learning is intractable due to the highly non-convex loss function in the weight space. L

1.6.1 Maximum likelihood estimation and Kullback-Leibler divergence

The loss minimization considers neural network learning from an optimization perspective, a different perspective of Eq. 1.13 is grounded in statistical inference. The neural network outputs can be viewed as defining a conditional distribution for the targets t, given the input vector x. In most of the applications, we are usually not interested in the distribution of the inputs themselves, but rather in predicting the possible values of the targets given a user-chosen input. In a regression problem, we may model the real-valued targets as a conditional Gaussian distribution with a mean equal to the output of the neural network and an identity covariance matrix. The log likelihood of such model can be written down as:

1 X h i log p(t(n) x(n), θ) = (y(n) t(n))2 + log 2π . (1.14) | −2 i − t i

For classification problems, the conditional probability of the input belonging to cth class can be modeled as a multinomial distribution,

log p(t(n) = c x(n), θ) = log p(n), p = softmax(y). (1.15) | c

It is easy to see the above log likelihoods simply correspond to the negative loss function we defined earlier for regression and classification. Therefore, we can directly think of neural network learning as a maximum likelihood estimation (MLE) problem that maximizes the likelihood of the conditional probability defined by the outputs of the network, where the estimates are the learnable weights. Thus we can argue that a learning algorithm that optimizes the loss functions shares many desired properties as the MLE methods, such as the statistical consistency and efficiency guarantees. One may also be tempted to minimize the difference between the neural network’s output distribution with the empirical target distribution. The Kullback-Leibler (KL) divergence is a natural choice to Chapter 1. Introduction 7 measure the difference between two distributions,

1 X h (n) (n) i E [KL(pdata(t x) p(t x, θ)] = E log p(t x , θ) + const.. (1.16) x∼Dtrain | || | −N t(n)∼ | n pdata

Thus, minimizing the KL divergence between the data distribution and the network’s output distribution is equivalent to MLE.

1.6.2 Regularization

It is a common belief in neural network design to prefer more expressive models with many layers. These models typically have more parameters than the number of training examples. This belief is justified under the increasing complexity of the datasets and the fast growing parallel computation resources. In such overparameterized regime, it is easy for the learning algorithm to discover a network that overfits where the model achieves zero training loss but fails to generalize on unseen test data. Thus, the loss functions are usually modified with an additional constraint to limit the capacity of the neural network in order to prevent overfitting. The constraint could be imposed on the architecture via weight sharing as in the case of CNNs or could be in the form of a data-independent regularization term in the modified loss function. For example, weight decay, that is the sum of the `2 norms of the weight matrices, is a simple yet effective regularizer,

(θ) = (θ) + (θ), (θ) = λ θ 2, (1.17) L LCE/MSE R R k k2 where the λ is the weight decay coefficient. Even with the regularization terms, very wide and deep neural networks are still vulnerable to overfitting. Dropout [Hinton et al., 2012a] is an effective technique for avoiding co-adaptations of the neurons, thus reducing overfitting in neural networks. During training, each neuron has an independent probability p to be dropped out from the computation,

z = W a − + b , a = M f(z ),M Bern(1 p), (1.18) d d d 1 d d d d d ∼ − where M is a binary mask drawn i.i.d. from a Bernoulli distribution indicating which neurons get to be kept. Dropout can also be viewed as training an ensemble of neural nets each with different connectivity patterns but the weights are shared among all the ensemble members. At test time, a multiplier of 1 p − is used on the hidden activations to correct for the missing neurons during training.

1.6.3 Gradient descent

In practice, finding the weights and biases that minimize the given loss function is done using gradient- based optimization methods. Gradient-based learning rules for neural networks were given different names in the past [Rumelhart et al., 1986, LeCun et al., 1990]. Starting from a set of randomly initialized weights, simple gradient-based learning algorithms change the weights in the negative gradient direction Chapter 1. Introduction 8 of the loss function with respect to the weight,

∂ ∂ g L L (1.19) , ∂θ GW , ∂W ∆W = η , ∆θ = [vec ∆W T , vec ∆W T , ]T , (1.20) − GW { 1} { 2} ··· θ θ + ∆θ, (1.21) ← where η is the learning rate or step size. We use as a shorthand notation for the gradient of the loss GW with respect to a particular weight matrix. At each iteration, the weights “descend” along the gradient direction. There are typically many local optima in the weight space. For typical neural networks, there are no global convergence guarantees for these gradient descent algorithms. Despite this, the solution discovered in practice often performs very well. The computation cost of the gradient updates grows linearly with the dataset size. A typical GW computer vision dataset will contain hundreds of thousands of training examples, which makes computing the full gradient update expensive. To address this problem, a stochastic approximation variant of gradient descent, the stochastic gradient descent (SGD) algorithm, computes an estimate of the gradients on a small mini-batch randomly sampled from the total training set. SGD is an unbiased approximation of the full gradient and the variance of the estimate is inversely proportional to the mini-batch size,

1 X ∂l(i) = ,B 1, 2, ,N , (1.22) GW B ∂W ⊂ { ··· } | | i∈B 1 ∂l(i)  [ ] = , Var[ ] = Var (1.23) E GW GW GW B ∂W | | where B is a uniformly sampled subset from the training set. Using a larger mini-batch size tends to work better and the update computation can take advantage of the parallelism from the modern parallel computing hardware. Most of the learning algorithms for neural networks rely on computing the gradient in the inner GW loop. Rumelhart et al. [1986] provides an efficient algorithm to compute these that “backpropagates” the difference between the network’s prediction and the target through the layers of the neural network from the output layer back to the inputs. In general, any neural network can be expressed as a computation graph. The backpropgation algorithm is a special case of backwards-mode automatic differentiation on the computation graph. Chapter 2

Deep recurrent visual attention

2.1 Motivation

Convolutional neural networks have recently been very successful on a variety of recognition and clas- sification tasks [Krizhevsky et al., 2012b, Goodfellow et al., 2013, Jaderberg et al., 2014, Vinyals et al., 2014b, Karpathy and Fei-Fei, 2014]. One of the main drawbacks of convolutional networks (ConvNets) is their poor scalability with increasing input image size so efficient implementations of these models on multiple GPUs [Krizhevsky et al., 2012b] or even spanning multiple machines [Dean et al., 2012] have become necessary. Applications of ConvNets to multi-object and sequence recognition from images have avoided working with big images and instead focused on using ConvNets for recognizing characters or short sequence segments from image patches containing reasonably tightly cropped instances [Goodfellow et al., 2013, Jaderberg et al., 2014]. Applying such a recognizer to large images containing uncropped instances requires integrating it with a separately trained sequence detector or a bottom-up proposal generator. Non-maximum suppression is often performed to obtain the final detections. While combining separate components trained using different objective functions has been shown to be worse than end-to-end training of a single system in other domains, integrating object localization and recognition into a single globally-trainable architecture has been difficult. In this chapter, we take inspiration from the way humans perform visual sequence recognition tasks such as reading by continually moving the fovea to the next relevant object or character, recognizing the individual object, and adding the recognized object to our internal representation of the sequence. Our proposed system is a deep recurrent neural network that at each step processes a multi-resolution crop of the input image, called a glimpse. The network uses information from the glimpse to update its internal representation of the input, and outputs the next glimpse location and possibly the next object in the sequence. The process continues until the model decides that there are no more objects to process. We show how the proposed system can be trained end-to-end by approximately maximizing a variational lower bound on the label sequence log-likelihood. This training procedure can be used to train the model to both localize and recognize multiple objects purely from label sequences. We evaluate the model on the task of transcribing multi-digit house numbers from publicly available Google Street View imagery. Our attention-based model outperforms the state-of-the-art ConvNets on tightly cropped inputs while using both fewer parameters and much less computation. We also show

9 Chapter 2. Deep recurrent visual attention 10

ˆ ˆ ˆ ˆ l1 ˆl2 l3 l4 ln+1 emission context (2) (2) (2) (2) (2) r0 r1 r2 r3 rn Icoarse

y1 ys (1) (1) (1) (1) classification r1 r2 r3 rn

glimpse

(x1,l1) (x2,l2) (x3,l3) (xn,ln)

Figure 2.1: The deep recurrent attention model. that our model outperforms ConvNets by a much larger margin in the more realistic setting of larger and less tightly cropped input sequences.

2.2 Learning where and what

For simplicity, we first describe how our model can be applied to classifying a single object and later show how it can be extended to multiple objects. Processing an image x with an attention-based model is a sequential process with N steps, where each step consists of a saccade followed by a glimpse. At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln.

The model uses the observation to update its internal state and outputs the location ln+1 to process at the next time-step. Usually the number of pixels in the glimpse xn is much smaller than the number of pixels in the original image x, making the computational cost of processing a single glimpse independent of the size of the image. A graphical representation of our model is shown in Figure 2.1. The model can be broken down into a number of sub-components, each mapping some input into a vector output. We will use the term “network” to describe these non-linear sub-components since they are typically multi-layered neural networks. Glimpse network: The glimpse network is a non-linear function that receives the current input image patch, or glimpse, xn and its location tuple ln , where ln is a 2 dimensional vector representing the x- and y-coordinate of the patch, as input and outputs a vector gn. The job of the glimpse network is to extract a set of useful features from location l of the raw visual input. We will use G (x W ) to n image n| image denote the output vector from function G ( ) that takes an image patch x and is parameterized by image · n weights W . G ( ) typically consists of three convolutional hidden layers without any pooling image image · layers followed by a fully connected layer. Separately, the location tuple is mapped by G (l W ) loc n| loc using a fully connected hidden layer where, both G (x W ) and G (l W ) have the same image n| image loc n| loc dimension. We combine the high bandwidth image information with the low bandwidth location tuple by multiplying the two vectors element-wise to get the final glimpse feature vector gn,

g = G (x W )G (l W ). (2.1) n image n| image loc n| loc Chapter 2. Deep recurrent visual attention 11

This type of multiplicative interaction between “what” and “where” was initially proposed by Larochelle and Hinton [2010b].

Recurrent network: The recurrent network aggregates information extracted from the individual glimpses and combines the information in a coherent manner that preserves spatial information. The glimpse feature vector gn from the glimpse network is supplied as input to the recurrent network at each time step. The recurrent network consists of two recurrent layers with non-linear function Rrecur. We defined the two outputs of the recurrent layers as r(1) and r(2).

r(1) = R (g , r(1) W ) and r(2) = R (r(1), r(2) W ) (2.2) n recur n n−1| r1 n recur n n−1| r2

We use Long-Short-Term Memory units [Hochreiter and Schmidhuber, 1997c] for the non-linearity Rrecur because of their ability to learn long-range dependencies and stable learning dynamics.

Emission network: The emission network takes the current state of recurrent network as input and makes a prediction on where to extract the next image patch for the glimpse network. It acts as a controller that directs attention based on the current internal states from the recurrent network. It (2) consists of a fully connected hidden layer that maps the feature vector rn from the top recurrent layer to a coordinate tuple ˆln+1.

ˆl = E(r(2) W ) (2.3) n+1 n | e

Context network: The context network provides the initial state for the recurrent network and its output is used by the emission network to predict the location of the first glimpse. The context network C( ) takes a down-sampled low-resolution version of the whole input image and outputs a fixed · Icoarse length vector cI . The contextual information provides sensible hints on where the potentially interesting regions are in a given image. The context network employs three convolutional layers that map a coarse image to a feature vector used as the initial state of the top recurrent layer r2 in the recurrent Icoarse network. However, the bottom layer r1 is initialized with a vector of zeros for reasons we will explain later.

Classification network: The classification network outputs a prediction for the class label y based (1) on the final feature vector rN of the lower recurrent layer. The classification network has one fully connected hidden layer and a softmax output layer for the class y.

P (y I) = O(r1 W ) (2.4) | n| o

Ideally, the deep recurrent attention model should learn to look at locations that are relevant for classifying objects of interest. The existence of the contextual information, however, provides a “short cut” solution such that it is much easier for the model to learn from contextual information than by combining information from different glimpses. We prevent such undesirable behavior by connecting the context network and classification network to different recurrent layers in our deep model. As a result, the contextual information cannot be used directly by the classification network and only affects the sequence of glimpse locations produced by the model. Chapter 2. Deep recurrent visual attention 12

2.3 Variational lower bound objective

2.3.1 Maximize the variational lower bound

Given the class labels y of an input image , we can formulate learning as a supervised classification prob- I lem with the cross entropy objective function. The attention model predicts the class label conditioned on intermediate latent location variables l from each glimpse and extracts the corresponding patches. Let θ = [vec W >, vec W >, vec W >, vec W >, vec W >, vec W >]T denote the con- { image} { loc} { r1} { r2} { e} { o} catenated model parameters. We can thus maximize the likelihood of the class label by marginalizing over the glimpse locations log p(y , θ) = log P p(l , θ)p(y l, , θ). | I l | I | I The marginalized objective function can be learned through optimizing its variational free energy lower bound : F X X log p(l , θ)p(y l, , θ) p(l , θ) log p(y, l , θ) + H[l] (2.5) | I | I ≥ | I | I l l X = p(l , θ) log p(y l, , θ) (2.6) | I | I l

The learning rule to update the model parameters θ follows the gradient of the above free energy:

∂ X ∂ log p(y l, , θ) X ∂p(l , θ) F = p(l , θ) | I + log p(y l, , θ) | I (2.7) ∂θ | I ∂θ | I ∂θ l l X ∂ log p(y l, , θ) ∂ log p(l , θ) = p(l , θ) | I + log p(y l, , θ) | I (2.8) | I ∂θ | I ∂θ l

For each glimpse in the glimpse sequence, it is difficult to evaluate exponentially many glimpse locations during training. The summation in equation 2.8 can then be approximated using Monte Carlo samples.

˜lm p(l , θ) = (l ; ˆl , Σ) (2.9) ∼ n | I N n n M ∂ 1 X ∂ log p(y ˜lm, , θ) ∂ log p(˜lm , θ) F | I + log p(y ˜lm, , θ) | I (2.10) ∂θ ≈ M ∂θ | I ∂θ m=1

The equation 2.10 gives a practical algorithm to train the deep attention model. Namely, we can sample the glimpse location prediction from the model after each glimpse. The samples are then used in the standard backpropagation to obtain an estimator for the gradient of the model parameters. Notice that log likelihood log p(y ˜lm, , θ) has an unbounded range that can introduce substantial high variance | I in the gradient estimator. Especially when the sampled location is off from the object in the image, the log likelihood will induce an undesired large gradient update that is backpropagated through the rest of the model. We can reduce the variance in the estimator 2.10 by replacing the log p(y ˜lm, , θ) with a 0/1 discrete | I indicator function R and using a baseline technique used in Mnih et al. [2014b].  m 1 y = arg maxy log p(y ˜l , , θ) R = | I (2.11) 0 otherwise Chapter 2. Deep recurrent visual attention 13

b = E (r(2) W ) (2.12) n baseline n | baseline

(2) As shown, the recurrent network state vector rn is used to estimate a state-based baseline b for each glimpse that significantly improves the learning efficiency. The baseline effectively centers the random variable R and can be learned by regressing towards the expected value of R. Given both the indicator function and the baseline, we have the following gradient update:

M ∂ 1 X ∂ log p(y ˜lm, , θ) ∂ log p(˜lm , θ) F | I + λ(R b) | I (2.13) ∂θ ≈ M ∂θ − ∂θ m=1 where, hyper-parameter λ balances the scale of the two gradient components. In fact, by using the 0/1 indicator function, the learning rule from equation 2.13 is equivalent to the REINFORCE [Williams, 1992b] learning rule employed in Mnih et al. [2014b] for training their attention model. When viewed as a reinforcement learning update, the second term in equation 2.13 is an unbiased estimate of the gradient with respect to W of the expected reward R under the model glimpse policy. Here we show that such learning rule can also be motivated by simply approximately optimizing the free energy. During inference, the feedforward location prediction can be used as a deterministic prediction on the location coordinates to extract the next input image patch for the model. The model behaves as a normal feedforward network. Alternatively, our marginalized objective function equation 2.5 suggests a procedure to estimate the expected class prediction by using samples of location sequences ˜lm, , ˜lm { 1 ··· N } and averaging their predictions,

M 1 X ˜m El[p(y I)] p(y , l ). (2.14) | ≈ M | I m=1

This allows the attention model to be evaluated multiple times on each image with the classification predictions being averaged. In practice, we found that averaging the log probabilities gave the best performance.

Here, we encode the real valued glimpse location tuple ln using a Cartesian coordinate that is centered at the middle of the input image. The ratio converting unit width in the coordinate system to the number of pixels is a hyper-parameter. This ratio presents an exploration versus exploitation trade off. The proposed model performance is very sensitive to this setting. We found that setting its value to be around 15% of the input image width tends to work well.

2.3.2 Multi-object/Sequential classification as a visual attention task

Our proposed attention model can be easily extended to solve classification tasks involving multiple objects. To train the deep recurrent attention model for the sequential recognition task, the multiple object labels for a given image need to be cast into an ordered sequence y , y , , y . The deep { 1 2 ··· s} recurrent attention model then learns to predict one object at a time as it explores the image in a sequential manner. We can utilize a simple fixed number of glimpses for each target in the sequence. In addition, a new class label for the “end-of-sequence” symbol is included to deal with variable numbers of objects in an image. We can stop the recurrent attention model once a terminal symbol is predicted. Chapter 2. Deep recurrent visual attention 14

Concretely, the objective function for the sequential prediction is

S X X log p(y , y , , y , θ) = log p(l , θ)p(y l , , θ) (2.15) 1 2 ··· S | I s | I s| s I s=1 l

The learning rule is derived as in equation 2.13 from the free energy and the gradient is accumulated across all targets. We assign a fixed number of glimpses, N, for each target. Assuming S targets in an image, the model would be trained with N (S + 1) glimpses. The benefit of using a recurrent model × for multiple object recognition is that it is a compact and simple form yet flexible enough to deal with images containing variable numbers of objects. Learning a model from images of many objects is a challenging setup. We can reduce the difficulty by modifying our indicator function R to be proportional to the number of targets the model predicted correctly.

X Rs = Rj (2.16) j≤s

In addition, we restrict the gradient of the objective function so that it only contains glimpses up to the first mislabeled target and ignores the targets after the first mistake. This curriculum-like adaption to the learning is crucial to obtain a high performance attention model for sequential prediction.

2.3.3 Comparison with CNN

To show the effectiveness of the deep recurrent attention model (DRAM), we first investigate a number of multi-object classification tasks involving a variant of MNIST. We then apply the proposed attention model to a real-world object recognition task using the multi-digit SVHN dataset Netzer et al. [2011] and compare with the state-of-the-art deep ConvNets. As suggested in Mnih et al. [2014b], we used a glimpse network with two different scales to improve 1 2 the classification performance. Namely, given a glimpse location ln, we extract two patches (xn, xn) 1 2 where xn is the original patch and xn is a down-sampled coarser image patch. We use the concatenation 1 2 of xn and xn as the glimpse observation. “foveal” feature. The hyper-parameters in our experiments are the learning rate η and the location variance Σ in equation 2.9. They are determined by grid search and cross-validation.

Learning to find digits

We first evaluate the effectiveness of the controller in the deep recurrent attention model using the MNIST handwritten digit dataset. We generated a dataset of pairs of randomly picked handwritten digits in a 100x100 image with distraction noise in the background. The task is to identify the 55 different combinations of the two digits as a classification problem. The attention models are allowed 4 glimpses before making a classification prediction. The goal of this experiment is to evaluate the ability of the controller and recurrent network to combine information from multiple glimpses with minimum effort from the glimpse network. The results are shown in table (2.1). The DRAM model with a context network significantly outperforms the other models. Chapter 2. Deep recurrent visual attention 15

Table 2.1: Error rates on the MNIST Table 2.2: Error rates on the MNIST pairs classification task. two digit addition task.

Model Test Err. Model Test Err. RAM Mnih et al. [2014b] 9% ConvNet 64-64-64-512 3.2% DRAM w/o context 7% DRAM 2.5% DRAM 5%

Figure 2.2: Left) Two examples of the learned policy on the digit pair classification task. The first column shows the input image while the next 5 columns show the selected glimpse locations. Right) Two examples of the learned policy on the digit addition task. The first column shows the input image while the next 5 columns show the selected glimpse locations.

Learning to do addition

For a more challenging task, we designed another dataset with two MNIST digits on an empty 100x100 background where the task is to predict the sum of the two digits in the image as a classification problem with 19 targets. The model has to find where each digit is and add them up. When the two digits are sampled uniformly from all classes, the label distribution is heavily imbalanced for the summation where most of the probability mass concentrated around 10. Also, there are many digit combinations that can be mapped to the same target, for example, [5,5] and [3,7]. The class label provides a weaker association between the visual feature and supervision signal in this task than in the digit combination task. We used the same model as in the combination task. The deep recurrent attention model is able to discover a glimpse policy to solve this task achieving a 2.5% error rate. In comparison, ConvNets take longer to learn and perform worse when given weak supervision. Some inference samples are shown in figure 2.2 It is surprising that the learned glimpses policy for predicting the next glimpse is very different in the addition task comparing to the predicting combination task. The model that learned to do addition toggles its glimpses between the two digits.

Learning to read house numbers

The publicly available multi-digit street view house number (SVHN) dataset Netzer et al. [2011] consists of images of digits taken from pictures of house fronts. Following Goodfellow et al. [2013], we formed a validation set of 5000 images by randomly sampling images from the training set and the extra set, and these were used for selecting the learning rate and sampling variance for the stochastic glimpse policy. The models are trained using the remaining 200,000 training images. We follow the preprocessing technique from Goodfellow et al. [2013] to generate tightly cropped 64 x 64 images with multi-digits at the center and similar data augmentation is used to create 54x54 jittered images during training. We also convert the RGB images to grayscale as we observe the color information does not affect the final classification performance. We trained a model to classify all the digits in an image sequentially with the objective function Chapter 2. Deep recurrent visual attention 16

Table 2.3: Whole sequence recognition error rates Table 2.4: Whole sequence recognitionn error rate on multi-digit SVHN. on enlarged multi-digit SVHN.

Model Test Err. Model Test Err. 11 layer CNN Goodfellow et al. [2013] 3.96% 10 layer CNN resize 50% 10 layer CNN 4.11% 10 layer CNN re-trained 5.60% Single DRAM 5.1% Single DRAM focus 5.7% Single DRAM MC avg. 4.4% forward-backward DRAM focus 5.0% forward-backward DRAM MC avg. 3.9% Single DRAM fine-tuned 5.1% forward-backward DRAM fine-tuning 4.46% defined in equation 2.15. The label sequence ordering is chosen to go from left to right as the natural ordering of the house number. The attention model is given 3 glimpses for each digit before making a prediction. The recurrent model keeps running until it predicts a terminal label or until the longest digit length in the dataset is reached. In the SVHN dataset, up to 5 digits can appear in an image. This means the recurrent model will run up to 18 glimpses per image, that is 5 x 3 plus 3 glimpses for a terminal label. Learning the attention model took around 3 days on a GPU. The model performance is shown in table (2.3). We found that there is still a performance gap between the state-of-the-art deep ConvNet and a single DRAM that “reads” from left to right, even with the Monte Carlo averaging. The DRAM often over predicts additional digits in the place of the terminal class. In addition, the distribution of the leading digit in real-life follows Benford’s law1. We therefore train a second recurrent attention model to “read” the house numbers from right to left as a backward model. The forward and backward model can share the same weights for their glimpse net- works but they have different weights for their recurrent and their emission networks. The predictions of both forward and backward models can be combined to estimate the final sequence prediction. Following the observation that attention models often overestimate the sequence length, we can flip first k number of sequence prediction from the backwards model, where k is the shorter length of the sequence length prediction between the forward and backward model. This simple heuristic works very well in practice and we obtain state-of-the-art performance on the Street View house number dataset with the forward- backward recurrent attention model. Videos showing sample runs of the forward and backward models on SVHN test data can be found at http://www.psi.toronto.edu/~jimmy/dram/forward.avi and http://www.psi.toronto.edu/~jimmy/dram/backward.avi respectively. These visualizations show that the attention model learns to follow the slope of multi-digit house numbers when they go up or down. For comparison, we also implemented a deep ConvNet with a similar architecture to the one used in Goodfellow et al. [2013]. The network had 8 convolutional layers with 128 filters in each followed by 2 fully connected layers of 3096 ReLU units. Dropout is applied to all 10 layers with 50% dropout rate to prevent over-fitting. Moreover, we generate a less cropped 110x110 multi-digit SVHN dataset by enlarging the bounding box of each image such that the relative size of the digits stays the same as in the 54x54 images. Our deep attention model trained on 54x54 can be directly applied to the new 110x110 dataset with no modification. The performance can be further improved by “focusing” the model on where the digits are. We run the model once and crop a 54x54 bounding box around the glimpse location sequence and

1Benford’s law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small. Chapter 2. Deep recurrent visual attention 17

(Giga) floating-point op. 10 layer CNN DRAM DRAM MC avg. F-B DRAM MC avg. 54x54 2.1 0.2 0.35 0.7 ≤ 110x110 8.5 0.2 1.1 2.2 ≤ param. (millions) 10 layer CNN DRAM DRAM MC avg. F-B DRAM MC avg. 54x54 51 14 14 28 110x110 169 14 14 28

Table 2.5: Computation cost of DRAM V.S. deep ConvNets

feed the 54x54 bounding box to the attention model again to generate the final prediction. This allows DRAM to “focus” and obtain a similar prediction accuracy on the enlarged images as on the cropped image without ever being trained on large images. We also compared the deep ConvNet trained on the 110x110 images with the fine tuned attention model. The deep attention model significantly outperforms the deep ConvNet with very little training time. The DRAM model only takes a few hours to fine-tune on the enlarged SVHN data, compared to one week for the deep 10 layer ConvNet.

2.3.4 Discussion

In our experiments, the proposed deep recurrent attention model (DRAM) outperforms the state-of- the-art deep ConvNets on the standard SVHN sequence recognition task. Moreover, as we increase the image area around the house numbers or lower the signal-to-noise ratio, the advantage of the attention model becomes more significant. In table 2.5, we compare the computational cost of our proposed deep recurrent attention model with that of deep ConvNets in terms of the number of floating-point operations for the multi-digit SVHN models along with the number of parameters in each model. The recurrent attention models that only process a selected subset of the input scale better than a ConvNet that looks over an entire image. The estimated cost for the DRAM is calculated using the maximum sequence length in the dataset, however the expected computational cost is much lower in practice since most of the house numbers are around 2 3 digits long. In addition, since the attention based model does not process the whole image, it can − naturally work on images of different size with the same computational cost independent of the input dimensionality. We also found that the attention-based model is less prone to over-fitting than ConvNets, likely because of the stochasticity in the glimpse policy during training. Though it is still beneficial to regularize the attention model with some dropout noise between the hidden layers during training, we found that it gives a very marginal performance boost of 0.1% on the multi-digit SVHN task. On the other hand, the deep 10 layer ConvNet is only able to achieve 5.5% error rate when dropout is only applied to the last two fully connected hidden layer. Finally, we note that DRAM can easily deal with variable length label sequences. Moreover, a model trained on a dataset with a fixed sequence length can easily be transferred and fine tuned with a similar dataset but longer target sequences. This is especially useful when there is lack of data for the task with longer sequences. Chapter 2. Deep recurrent visual attention 18

q(a y, , ⌘) q(a y, a , , ⌘) q(a y, a , , ⌘) q(a y, a , , ⌘) 1| I 2| 1 I 3| 1:2 I 4| 1:3 I y y inference y y network

p(a , ✓) p(a2 a1, , ✓) p(a3 a1:2, , ✓) p(a4 a1:3, , ✓) 1|I | I | I | I

Ilow-resolution prediction network

p(y a, , ✓) | I

(x1,a1) (x2,a2) (x3,a3) (xN ,aN ) Figure 2.3: The Wake-Sleep Recurrent Attention Model.

2.4 Improved learning with re-weighted wake-sleep objective

Training stochastic attention models is difficult because the loss gradient involves intractable posterior expectations, and because the stochastic gradient estimates can have high variance. (The latter problem was also observed by Zaremba and Sutskever [2015] in the context of memory networks.) In this section, we propose the Wake-Sleep Recurrent Attention Model (WS-RAM), a method for training stochastic recurrent attention models which deals with the problems of intractable inference and high-variance gradients by taking advantage of several advances from the literature on training deep generative models: inference networks Dayan et al. [1995], the reweighted wake-sleep algorithm Bornschein and Bengio [2014], and control variates Paisley et al. [2012], Mnih and Gregor [2014]. During training, the WS-RAM approximates posterior expectations using importance sampling, with a proposal distribution computed by an inference network. Unlike the prediction network, the inference network has access to the object category label, which helps it choose better glimpse locations.

2.4.1 Wake-Sleep recurrent attention model

We now describe our wake-sleep recurrent attention model (WS-RAM). Given an image , the network I first chooses a sequence of glimpses a = (a1, . . . , aN ), and after each glimpse, receives an observation xn computed by a mapping g(a , ). This mapping might, for instance, extract an image patch at a given n I scale. The first glimpse is based on a low-resolution version of the input, while subsequent glimpses are chosen based on information acquired from previous glimpses. The glimpses are chosen stochastically according to a distribution p(a a − , , θ), where θ denotes the parameters of the network. This is in n | 1:n 1 I contrast with soft attention models, which deterministically allocate attention across all image locations. After the last glimpse, the network predicts a distribution p(y a, , θ) over the target y (for instance, | I the caption or image category). As shown in Figure 2.3, the core of the attention network is a two-layer recurrent network, which we term the “prediction network”, where the output at each time step is an action (saccade) which is used to compute the input at the next time step. A low-resolution version of the input image is fed Chapter 2. Deep recurrent visual attention 19 to the network at the first time step, and the network predicts the class label at the final time step. Importantly, the low-resolution input is fed to the second layer, while the class label prediction is made by the first layer, preventing information from propagating directly from the low-resolution image to the output. This prevents local optima where the network learns to predict y directly from the low-resolution input, disregarding attention completely. On top of the prediction network is an inference network, which receives both the class label and the attention network’s top layer representation as inputs. It tries to predict the posterior distribution q(a y, a , , η), parameterized by η, over the next saccade, conditioned on the image category being n+1 | 1:n I correctly predicted. Its job is to guide the posterior sampler during training time, thereby acting as a “teacher” for the attention network. The inference network is described further in Section 2.4.3. One of the benefits of stochastic attention models is that the mapping g can be localized to a small image region or coarse granularity, which means it can potentially be made very efficient. Furthermore, g need not be differentiable, which allows for operations (such as choosing a scale) which would be difficult to implement in a soft attention network. The cost of this flexibility is that standard backpropagation cannot be applied, so instead we use novel algorithms described in the next section. Assume we have a dataset with labels y for the supervised prediction task (e.g. object category). In contrast to the supervised saliency prediction task (e.g. Itti et al. [1998], Judd et al. [2009]), there are no labels for where to attend. Instead, we learn an attention policy based on the idea that the best locations to attend to are the ones which most robustly lead the model to predict the correct category. In particular, we aim to maximize the probability of the class label (or equivalently, minimize the cross-entropy) by marginalizing over the actions at each glimpse:

X = log p(y , θ) = log p(a , θ)p(y a, , θ). (2.17) L | I a | I | I

2.4.2 An improved lower bound on the log-likelihood

In this section, we first describe the previous variational lower bound objective. We then introduce a new objective function which directly estimates the gradients of . The new method can be seen as L maximizing a tighter lower bound on . L Let q(a y, ) be an approximating distribution. The lower bound on is then given by: | I L X X = log p(a , θ)p(y a, , θ) q(a y, ) log p(y, a , θ) + [q] = . (2.18) L a | I | I ≥ a | I | I H F

In the case where q(a y, ) = p(a , θ) is the prior, as considered by Ba et al. [2015], this reduces to | I | I X = p(a , θ) log p(y a, , θ). (2.19) F a | I | I

The learning rules can be derived by taking derivatives of Eqn. 2.19 with respect to the model parameters:

∂ X ∂ log p(y a, , θ) ∂ log p(a , θ) F = p(a , θ) | I + log p(y a, , θ) | I . (2.20) ∂θ a | I ∂θ | I ∂θ Chapter 2. Deep recurrent visual attention 20

The summation can be approximated using M Monte Carlo samples a˜m from p(a , θ): | I M ∂ 1 X ∂ log p(y a˜m, , θ) ∂ log p(a˜m , θ) F | I + log p(y a˜m, , θ) | I . (2.21) ∂θ ≈ M ∂θ | I ∂θ m=1

The partial derivative terms can each be computed using standard backpropagation. This suggests a simple gradient-based training algorithm: For each image, one first computes the samples a˜m from the prior p(a , θ), and then updates the parameters according to Eqn. 2.21. As observed by Ba et al. [2015], | I one must carefully use control variates in order to make this technique practical; we defer discussion of control variates to Section 2.4.4. The variational method described above has some counterintuitive properties early in training. First, because it averages the log-likelihood over actions, it greatly amplifies the differences in probabilities assigned to the true category by different bad glances. For instance, a glimpse sequence which leads to 0.01 probability assigned to the correct class is considered much worse than one which leads to 0.02 probability under the variational objective, even though in practice they may be equally bad since they have both missed the relevant information. A second odd behavior is that all glimpse sequences are weighted equally in the log-likelihood gradient. It would be better if the training procedure focused its effort on using those glances which contain the relevant information. Both of these effects contribute noise in the training procedure, especially in the early stages of training. Instead, we adopt an approach based on the wake-p step of reweighted wake-sleep Bornschein and Bengio [2014], where we attempt to maximize the marginal log-probability directly. We differentiate L the marginal log-likelihood objective in Eqn. 2.17 with respect to the model parameters:

∂ 1 X ∂ log p(y a, , θ) ∂ log p(a , θ) L = p(a , θ)p(y a, , θ) | I + | I . (2.22) ∂θ p(y , θ) | I | I ∂θ ∂θ | I a The summation and normalizing constant are both intractable to evaluate, so we estimate them using importance sampling. We must define a proposal distribution q(a y, ), which ideally should be close to | I the posterior p(a y, , θ). One reasonable choice is the prior p(a , θ), but another choice is described | I | I in Section 2.4.3. Normalized importance sampling gives a biased but consistent estimator of the gradient of . Given samples a˜1,..., a˜M from q(a y, ), the (unnormalized) importance weights are computed L | I as:

p(a˜m , θ)p(y a˜m, , θ) w˜m = | I | I . (2.23) q(a˜m y, ) | I The Monte Carlo estimate of the gradient is given by:

M ∂ X ∂ log p(y a˜m, , θ) ∂ log p(a˜m , θ) L wm | I + | I , (2.24) ∂θ ≈ ∂θ ∂θ m=1

m m PM i where w =w ˜ / i=1 w˜ are the normalized importance weights. When q is chosen to be the prior, this approach is equivalent to the method of Tang and Salakhutdinov [2013] for learning generative feed-forward networks. Our importance sampling based estimator can also be viewed as the gradient ascent update on the h 1 PM mi m objective function E log M m=1 w˜ . Combining Jensen’s inequality with the unbiasedness of thew ˜ Chapter 2. Deep recurrent visual attention 21 shows that this is a lower bound on the log-likelihood:

" M # " M # 1 X 1 X log w˜m log w˜m = log [w ˜m] = . (2.25) E M ≤ E M E L m=1 m=1

We relate this to the previous section by noting that = [logw ˜m]. Another application of Jensen’s F E inequality shows that our proposed bound is at least as accurate as : F " M # " M # 1 X 1 X = [logw ˜m] = logw ˜m log w˜m . (2.26) F E E M ≤ E M m=1 m=1

Burda et al. Burda et al. [2015] further analyzed a closely related importance sampling based estimator in the context of generative models, bounding the mean absolute deviation and showing that the bias decreases monotonically with the number of samples.

2.4.3 Training an inference network

Late in training, once the attention model has learned an effective policy, the prior distribution p(a , θ) | I is a reasonable choice for the proposal distribution q(a y, ), as it puts significant probability mass on | I good actions. But early in training, the model may have only a small probability of choosing a good set of glimpses, and the prior may have little overlap with the posterior. To deal with this, we train an inference network to predict, given the observations as well as the class label, where the network should look to correctly predict that class (see Figure 2.3). With this additional information, the inference network can act as a “teacher” for the attention policy. The inference network predicts a sequence of glimpses stochastically:

N Y q(a y, , η) = q(a y, , η, a − ). (2.27) | I n | I 1:n 1 n=1

This distribution is analogous to the prior, except that each decision also takes into account the class label y. We denote the parameters for the inference network as η. During training, the prediction network is learnt by following the gradient of the estimator in Eqn. 2.24 with samples a˜m q(a y, , η) ∼ | I drawn from the inference network output. Our training procedure for the inference network parallels the wake-q step of reweighted wake- sleep Bornschein and Bengio [2014]. Intuitively, the inference network is most useful if it puts large probability density over locations in an image that are most informative for predicting class labels. We therefore train the inference weights η to minimize the Kullback-Leibler divergence between the recog- nition model prediction q(a y, , η) and posterior distribution from the attention model p(a y, , θ): | I | I X min DKL(p q) = min p(a y, , θ) log q(a y, , η). (2.28) η η k − a | I | I

The gradient update for the recognition weights can be obtained by taking the derivatives of Eq. (2.28) with respect to the recognition weights η:   ∂DKL(p q) ∂ log q(a y, , η) k = E | I . (2.29) ∂η p(a | y,I,θ) ∂η Chapter 2. Deep recurrent visual attention 22

Since the posterior expectation is intractable, we estimate it with importance sampling. In fact, we reuse the importance weights computed for the prediction network update (see Eqn. 2.23) to obtain the following gradient estimate for the recognition network:

M m ∂DKL(p q) X ∂ log q(a˜ y, , η) k wm | I . (2.30) ∂η ≈ ∂η m=1

2.4.4 Control variates

The speed of convergence of gradient ascent with the gradients defined in Eqns. 2.24 and 2.30 suffers from high variance of the stochastic gradient estimates. Past work using similar gradient updates has found significant benefit from the use of control variates, or reward baselines, to reduce the variance Williams [1992a], Paisley et al. [2012], Mnih et al. [2014a], Mnih and Gregor [2014], Ba et al. [2015]. Choosing effective control variates for the stochastic gradient estimators amounts to finding a function that is highly correlated with the gradient vectors, and whose expectation is known or tractable to compute Paisley et al. [2012], Weaver and Tao [2001b]. Unfortunately, a good choice of control variate is highly model-dependent. We first note that:

 p(a , θ) ∂ log p(a , θ) ∂ log q(a y, , η) E | I | I = 0, E | I = 0. (2.31) q(a | y,I,η) q(a y, , η) ∂θ q(a | y,I,η) ∂η | I The terms inside the expectation are very similar to the gradients in Eqns. 2.24 and 2.30, suggesting that stochastic estimates of these expectations would make good control variates. To increase the correlation between the gradients and the control variates, we reuse the same set of samples and importance weights for the gradients and control variates. Using these control variates results in the gradient estimates for the prediction and recognition networks, we obtain:

 p(a˜m | I,θ)  M m ∂ log p(a , θ) X m q(a˜m | y,I,η) ∂ log p(a˜ , θ) | I w i  | I , (2.32) ∂θ ≈ − PM p(a˜ | I,θ) ∂θ m=1 i=1 q(a˜i | y,I,η)

M   m ∂DKL(p q) X 1 ∂ log q(a˜ y, , η) k wm | I . (2.33) ∂η ≈ − M ∂η m=1

Our use of control variates does not bias the gradient estimates (beyond the bias which is present due to importance sampling). However, as we show in the experiments, the resulting estimates have much lower variance than those of Eqns. 2.24 and 2.30. Following the analogy with reinforcement learning highlighted by Mnih and Gregor [2014], these control variates can also be viewed as reward baselines:

p(a | I,θ) p(a˜m | I,θ) [p(y a, , θ)] m q(a | y,I,η) Eq(a | y,I,η) | I q(a˜ | y,I,η) bp = i , (2.34) h p(a | I,θ) i ≈ PM p(a˜ | I,θ) M [p(y a, , θ)] i · Eq(a | y,I,η) q(a | y,I,η) Eq(a | y,I,θ) | I i=1 q(a˜ | y,I,η) a | I [p(y a, , θ)] 1 b = Ep( ,θ) | I = , (2.35) q M [p(y a, , θ)] M · Ep(a | I,θ) | I where M is the number of samples drawn for proposal q. Chapter 2. Deep recurrent visual attention 23

2.4.5 Encouraging exploration

Similarly to other methods based on reinforcement learning, stochastic attention networks face the problem of encouraging the method to explore different actions. Since the gradient in Eqn. 2.24 only rewards or punishes glimpse sequences which are actually performed, any part of the space which is never visited will receive no reward signal. Ba et al. [2015] introduced several heuristics to encourage exploration, including: (1) raising the temperature of the proposal distribution, (2) regularizing the attention policy to encourage viewing all image locations, and (3) adding a regularization term to encourage high entropy in the action distribution. We have implemented all three heuristics for the WS- RAM and for the baselines. While these heuristics are important for good performance of the baselines, we found that they made little difference to the WS-RAM because the basic method already explores adequately.

2.4.6 Experiments

To measure the effectiveness of the proposed WS-RAM method, we first investigated a toy classification task involving a variant of the MNIST handwritten digits dataset LeCun et al. [1998a] where transfor- mations were applied to the images. We then evaluated the proposed method on a substantially more difficult image caption generation task using the Flickr8k Hodosh et al. [2013] dataset.

Translated scaled MNIST

We generated a dataset of randomly translated and scaled handwritten digits from the MNIST dataset Le- Cun et al. [1998a]. Each digit was placed in a 100x100 black background image at a random location and scale. The task was to identify the digit class. The attention models were allowed four glimpses before making a classification prediction. The goal of this experiment was to evaluate the effectiveness of our proposed WS-RAM model compared with the variational approach of Ba et al. [2015]. For both the WS-RAM and the baseline, the architecture was a stochastic attention model which used ReLU units in all recurrent layers. The actions included both continuous and discrete latent variables, corresponding to glimpse scale and location, respectively. The distribution over actions was represented as a Gaussian random variable for the location and an independent multinomial random variable for the scale. All networks were trained using Adam [Kingma and Ba, 2014a], with the learning rate set to the highest value that allowed the model to successfully converge to a sensible attention policy. The classification performance results are shown in Table 2.6. In Figure 2.4, the WS-RAM is com- pared with the variational baseline, each using the same number of samples (in order to make computa- tion time roughly equivalent). We also show comparisons against ablated versions of the WS-RAM where the control variates and inference network were removed. When the inference network was removed, the prior p(a , θ) was used for the proposal distribution. | I In addition to the classification results, we measured the effective sample size (ESS) of our method with and without control variates and the inference network. ESS is a standard metric for evaluating P m 2 m importance samplers, and is defined as 1/ m(w ) , where w denotes the normalized importance weights. Results are shown in Figure 2.4. Using the inference network reduced the variances in gradient estimation, although this improvement did not reflect itself in the ESS. Control variates improved both metrics. Chapter 2. Deep recurrent visual attention 24

100 4.0 6

3.5 VAR VAR+c 5 3.0

WS-RAM 4 VAR 2.5 WS-RAM+c VAR+c 2.0 WS-RAM+q 3 WS-RAM WS-RAM -1 1.5 WS-RAM+q+c 10 2

Training Error WS-RAM+c WS-RAM+c 1.0

WS-RAM+q 1 WS-RAM+q WS-RAM+q+c 0.5 Effective Sample Size WS-RAM+q+c

0.0 0

0 20 40 60 80 100 Variance of Estimated Gradient 0 20 40 60 80 100 0 20 40 60 80 100 Figure 2.4: Left: Training error as a function of the number of updates. Middle: variance of the gradient estimates. Right: effective sample size (max = 5). Horizontal axis: thousands of updates. VAR: variational baseline; WS-RAM: our proposed method; +q: uses the inference networks for the proposal distribution; +c: uses control variates.

10-1

Test err. VAR WS-RAM WS-RAM + q VAR+c, no exploration

Training Error VAR+c + exploration no c.v. 3.11% 4.23% 2.59% WS-RAM+q+c, no exploration

-2 WS-RAM+q+c + exploration +c.v. 1.81% 1.85% 1.62% 10 0 50 100 150 200

Table 2.6: Classification error rate comparison for Figure 2.5: The effect of the exploration heuristics on the attention models trained using different algo- the variational baseline and the WS-RAM. rithms on translated scaled MNIST. The numbers are reported after 10 million updates using 5 samples.

In Section 2.4.5, we described heuristics which encourage the models to explore the action space. Figure 2.5 compares the training with and without these heuristics. Without the heuristics, the varia- tional method quickly fell into a local minimum where the model predicted only one glimpse scale over all images; the exploration heuristics fixed this problem. By contrast, the WS-RAM did not appear to have this problem, so the heuristics were not necessary.

2.5 Summary

Convolutional neural networks, trained end-to-end, have been shown to substantially outperform previ- ous approaches to various supervised learning tasks in computer vision (e.g. Krizhevsky et al. [2012a])). Despite their wide success, convolutional nets are computationally expensive when processing high- resolution input images, because they must examine all image locations at a fine scale. This has mo- tivated our development on visual attention-based models in this chapter, which reduce the number of parameters and computational operations by selecting informative regions of an image to focus on. In addition to computational speedups, one can understand what information neural networks is using by seeing where it is looking. We studied two learning objectives, variational lower bound and reweighted importance sampling bound, for stochastic attention models and compare them on the standard hand-written digits classifica- tion tasks. We demonstrate that our stochastic attention model can learn to (1) classify translated and scaled MNIST digits, and (2) generate image captions by attending to the relevant objects in images and their corresponding scale. The proposed reweighted wake-sleep algorithm shows much improved Chapter 2. Deep recurrent visual attention 25 training performance for the stochastic visual attention models. Chapter 3

Generating image (and) captions with visual attention

3.1 Problem definition

Automatically generating captions for an image is a task close to the heart of scene understanding — one of the primary goals of computer vision. Not only must caption generation models be able to solve the computer vision challenges of determining what objects are in an image, but they must also be powerful enough to capture and express their relationships in natural language. For this reason, caption generation has long been seen as a difficult problem. It amounts to mimicking the remarkable human ability to compress huge amounts of salient visual information into descriptive language and is thus an important challenge for machine learning and AI research. Yet despite the difficult nature of this task, there has been a recent surge of research interest in attacking the image caption generation problem. Aided by advances in training deep neural networks [Krizhevsky et al., 2012c] and the availability of large classification datasets [Russakovsky et al., 2014], recent work has significantly improved the quality of caption generation using a combination of convo- lutional neural networks (convnets) to obtain vectorial representation of images and recurrent neural networks to decode those representations into natural language sentences (see Sec. 3.2). Unlike CNNs that compress an entire image into a static representation, visual attention allows for salient features to dynamically come to the forefront as needed. This is especially important when there is a lot of clutter in an image. Using representations (such as those from the very top layer of a convnet) that distill information in image down to the most salient objects is one effective solution that has been widely adopted in previous work. Encouraged by recent advances in caption generation and inspired by recent successes in employing attention in machine translation [Bahdanau et al., 2014], we investigate visual attention models that can attend to salient part of an image while generating its caption.

3.2 Related work

In this section we provide relevant background on previous work on image caption generation and attention. Recently, several methods have been proposed for generating image descriptions. Many of

26 Chapter 3. Generating image (and) captions with visual attention 27

14x14 Feature Map A bird flying LSTM over a body of water 1. Input 2. Convolutional 3. RNN with attention 4. Word by Image Feature Extraction over the image word generation

Figure 3.1: Our model learns a words/image alignment. The visualized attentional maps (3) are ex- plained in Sections 3.3.1 & 3.3.3 these methods are based on recurrent neural networks and inspired by the successful use of sequence- to-sequence training with neural networks for machine translation [Cho et al., 2014c, Bahdanau et al., 2014, Sutskever et al., 2014c, Kalchbrenner and Blunsom, 2013]. The encoder-decoder framework [Cho et al., 2014c] of machine translation is well suited, because it is analogous to “translating” an image to a sentence. The first approach to using neural networks for caption generation was proposed by Kiros et al. [2014a] who used a multimodal log-bilinear model that was biased by features from the image. The model presented in this chapter was later followed by Kiros et al. [2014b] whose method was designed to explicitly allow for a natural way of doing both ranking and generation. Mao et al. [2014] used a similar approach to generation but replaced a feedforward neural language model with a recurrent one. Both Vinyals et al. [2014a] and Donahue et al. [2014] used recurrent neural networks (RNN) based on long short-term memory (LSTM) units [Hochreiter and Schmidhuber, 1997a] for their models. Unlike Kiros et al. [2014a] and Mao et al. [2014] whose models see the image at each time step of the output word sequence, Vinyals et al. [2014a] only showed the image to the RNN at the beginning. Along with images, Donahue et al. [2014] and Yao et al. [2015] also applied LSTMs to videos, allowing their model to generate video descriptions. Most of these works represent images as a single feature vector from the top layer of a pre-trained convolutional network. Karpathy and Li [2014] instead proposed to learn a joint embedding space for ranking and generation whose model learns to score sentence and image similarity as a function of R-CNN object detections with outputs of a bidirectional RNN. Fang et al. [2014] proposed a three-step pipeline for generation by incorporating object detections. Their models first learn detectors for several visual concepts based on a multi-instance learning framework. A language model trained on captions was then applied to the detector outputs, followed by rescoring from a joint image-text embedding space. Unlike these models, our proposed attention framework does not explicitly use object detectors but instead learns latent alignments from scratch. This allows our model to go beyond “objectness” and learn to attend to abstract concepts. Prior to the use of neural networks for generating captions, two main approaches were dominant. The first involved generating caption templates which were filled in based on the results of object detections and attribute discovery (Kulkarni et al. [2013], Li et al. [2011], Yang et al. [2011], Mitchell et al. [2012], Chapter 3. Generating image (and) captions with visual attention 28

Elliott and Keller [2013]). The second approach was based on first retrieving similar captioned images from a large database then modifying these retrieved captions to fit the query [Kuznetsova et al., 2012, 2014]. These approaches typically involved an intermediate “generalization” step to remove the specifics of a caption that are only relevant to the retrieved image, such as the name of a city. Both of these approaches have since fallen out of favour to the now dominant neural network methods. There has been a long line of previous work incorporating the idea of attention into neural networks. Some that share the same spirit as our work include Larochelle and Hinton [2010a], Denil et al. [2012], Tang et al. [2014] and more recently Gregor et al. [2015b]. In particular however, our work directly extends the work of Bahdanau et al. [2014], Mnih et al. [2014b], Ba et al. [2014], Graves [2013a].

3.3 Image Caption Generation with Attention Mechanism

3.3.1 Model details

In this section, we describe the two variants of our attention-based model by first describing their common framework. The key difference is the definition of the ψ function which we describe in detail in Sec. 3.3.2. See Fig. 3.1 for the graphical illustration of the proposed model. We denote vectors with bolded font and matrices with capital letters. In our description below, we suppress bias terms for readability.

Encoder: convolutional features

Our model takes a single raw image and generates a caption y encoded as a sequence of 1-of-K encoded words.

K y = y1,..., yC , yi R { } ∈ where K is the size of the vocabulary and C is the length of the caption. We use a convolutional neural network in order to extract a set of feature vectors which we refer to as annotation vectors. The extractor produces L vectors, each of which is a D-dimensional representation corresponding to a part of the image.

D a = φ1,..., φL , φi R { } ∈

In order to obtain a correspondence between the feature vectors and portions of the 2-D image, we extract features from a lower convolutional layer unlike previous work which instead used a fully connected layer. This allows the decoder to selectively focus on certain parts of an image by weighting a subset of all the feature vectors.

Decoder: long short-term memory network

We use a long short-term memory (LSTM) network [Hochreiter and Schmidhuber, 1997a] that produces a caption by generating one word at every time step conditioned on a context vector, the previous hidden state and the previously generated words. Our implementation of LSTMs, shown in Fig. 1.1, closely follows the one used in Zaremba et al. [2014]: Chapter 3. Generating image (and) captions with visual attention 29

Figure 3.2: Visualization of the attention for each generated word. The rough visualizations obtained by upsampling the attention weights and smoothing. (top)“soft” and (bottom) “hard” attention (note that both models generated the same captions in this example).

In simple terms, the context vector zˆt is a dynamic representation of the relevant part of the image input at time t. We define a mechanism ψ that computes zˆt from the annotation vectors φi, i = 1,...,L corresponding to the features extracted at different image locations. For each location i, the mechanism generates a positive weight αi which can be interpreted either as the probability that location i is the right place to focus for producing the next word (stochastic attention mechanism), or as the relative importance to give to location i in blending the φi’s together (deterministic attention mechanism). The weight αi of each annotation vector ai is computed by an attention model fatt for which we use a multilayer perceptron conditioned on the previous hidden state at−1. To emphasize, we note that the hidden state varies as the output RNN advances in its output sequence: “where” the network looks next depends on the sequence of words that has already been generated.

eti =fatt(φi, at−1) exp(e ) α = ti . ti PL k=1 exp(etk)

Once the weights (which sum to one) are computed, the context vectorz ˆt is computed by

zˆ = ψ ( φ , α ) , (3.1) t { i} { i} where ψ is a function that returns a single vector given the set of annotation vectors and their corre- sponding weights. The details of the ψ function are discussed in Sec. 3.3.2.

The initial memory state and hidden state of the LSTM are predicted by an average of the annotation vectors fed through two separate MLPs (init,c and init,h):

L ! L ! 1 X 1 X c = f φ , a = f φ 0 init,c L i 0 init,h L i i i

We use a deep output layer [Pascanu et al., 2014] to compute the output word probability. Its input are cues from the image (the context vector), the previously generated word, and the decoder state (ht).

p(y φ, yt−1) exp(L (Ey + L a + L zˆ )), (3.2) t| 1 ∝ o t−1 h t z t

K×m m×n m×D where Lo R , Lh R , Lz R , and E are learned parameters initialized randomly. ∈ ∈ ∈ Chapter 3. Generating image (and) captions with visual attention 30

3.3.2 Learning stochastic “hard” vs deterministic “soft” Attention

In this section we discuss two alternative mechanisms for the attention model fatt: stochastic attention and deterministic attention.

Stochastic “hard” attention

We represent the location variable st as where the model decides to focus attention when generating the t-th word. st,i is an indicator one-hot variable which is set to 1 if the i-th location (out of L) is the one used to extract visual features. By treating the attention locations as intermediate latent variables, we can assign a multinoulli distribution parametrized by α , and view zˆ as a random variable: { i} t

p(s = 1 s , φ) = α (3.3) t,i | j

Similar to Chapter 2, we define a variational lower bound objective on the marginal log-likelihood Ls log p(y φ) of observing the sequence of words y given image features φ. Similar to work in generative | deep generative modeling [Kingma and Welling, 2014b, Rezende et al., 2014], the learning algorithm for the parameters θ of the models can be derived by directly optimizing

X X = p(s φ) log p(y s, φ) log p(s φ)p(y s, φ) = log p(y φ), (3.5) Ls | | ≤ | | | s s following its gradient   ∂ s X ∂ log p(y s, φ) ∂ log p(s φ) L = p(s φ) | + log p(y s, φ) | . (3.6) ∂θ | ∂θ | ∂θ s

We approximate this gradient of Ls by a Monte Carlo method such that

N  n n  ∂ s 1 X ∂ log p(y s˜ , φ) ∂ log p(˜s φ) L | + log p(y s˜n, φ) | , (3.7) ∂θ ≈ N ∂θ | ∂θ n=1

n n n n wheres ˜ = (s1 , s2 ,...) is a sequence of sampled attention locations. We sample the location st from a multinouilli distribution defined by Eq. (3.3):

s˜n Multinoulli ( αn ). t ∼ L { i }

We reduce the variance of this estimator with the moving average baseline technique [Weaver and Tao, 2001a]. Upon seeing the k-th mini-batch, the moving average baseline is estimated as an accumulated sum of the previous log likelihoods with exponential decay:

b = 0.9 b − + 0.1 log p(y s˜ , φ) k × k 1 × | k

To further reduce the estimator variance, the gradient of the entropy H[s] of the multinouilli distribution is added to the RHS of Eq. (3.7). Chapter 3. Generating image (and) captions with visual attention 31

The final learning rule for the model is then

N  n n n  ∂ s 1 X ∂ log p(y s˜ , φ) ∂ log p(˜s φ) ∂H[˜s ] L | + λ (log p(y s˜n, φ) b) | + λ ∂θ ≈ N ∂θ r | − ∂θ e ∂θ n=1 where, λr and λe are two hyper-parameters set by cross-validation. As pointed out and used by Ba et al. [2014] and Mnih et al. [2014b], this formulation is equivalent to the REINFORCE learning rule [Williams, 1992c], where the reward for the attention choosing a sequence of actions is a real value proportional to the log likelihood of the target sentence under the sampled attention trajectory. In order to further improve the robustness of this learning rule, with probability 0.5 for a given image, we set the sampled attention locations ˜ to its expected value α (equivalent to the deterministic attention in Sec. 3.3.2).

Deterministic “soft” attention

Learning stochastic attention requires sampling the attention location st each time, instead we can take the expectation of the context vector zˆt directly,

L X Ep(st|a)[zˆt] = αt,iφi (3.8) i=1 and formulate a deterministic attention model by computing a soft attention weighted annotation vector ψ ( φ , α ) = PL α φ as proposed by Bahdanau et al. [2014]. This corresponds to feeding in a soft α { i} { i} i i i weighted context into the system. The whole model is smooth and differentiable under the deterministic attention, so learning end-to-end is trivial by using standard back-propagation. Learning the deterministic attention can also be understood as approximately optimizing the marginal likelihood in Eq. (3.5) under the attention location random variable st from Sec. 3.3.2. The hidden acti- vation of LSTM at is a linear projection of the stochastic context vector zˆt followed by tanh non-linearity.

To the first-order Taylor approximation, the expected value Ep(st|a)[at] is equivalent to computing at using a single forward computation with the expected context vector Ep(st|a)[zˆt].

Let us denote by nt,i as n in Eq. (3.2) with zˆt set to φi. Then, we can write the normalized weighted geometric mean (NWGM) of the softmax of k-th word prediction as

Q p(st,i=1|a) i exp(nt,k,i) NWGM[p(yt = k φ)] = P Q p(st,i=1|a) | j i exp(nt,j,i)

exp(Ep(st|a)[nt,k]) = P j exp(Ep(st|a)[nt,j])

This implies that the NWGM of the word prediction can be well approximated by using the expected context vector E [zˆt], instead of the sampled context vector φi. Furthermore, from the result by Baldi and Sadowski [2014], the NWGM in Eq. (3.9) which can be computed by a single feedforward computation approximates the expectation E[p(yt = k φ)] of | the output over all possible attention locations induced by random variable st. This suggests that the proposed deterministic attention model approximately maximizes the marginal likelihood over all possible attention locations. Chapter 3. Generating image (and) captions with visual attention 32

Doubly stochastic attention

In training the deterministic version of our model, we introduce a form a doubly stochastic regularization that encourages the model to pay equal attention to every part of the image. Whereas the attention at P P every point in time sums to 1 by construction (i.e i αti = 1), the attention i αti is not constrained in any way. This makes it possible for the decoder to ignore some parts of the input image. In order to alleviate this, we encourage P α τ where τ L . In our experiments, we observed that this t ti ≈ ≥ D penalty quantitatively improves overall performance and that this qualitatively leads to more descriptive captions.

Additionally, the soft attention model predicts a gating scalar β from previous hidden state at−1 at PL each time step t, such that, ψ ( φ , α ) = β α φ , where β = σ(f (a − )). This gating variable { i} { i} i i i t β t 1 lets the decoder decide whether to put more emphasis on language modeling or on the context at each time step. Qualitatively, we observe that the gating variable is larger than the decoder describes an object in the image. The soft attention model is trained end-to-end by minimizing the following penalized negative log- likelihood:

L C X X = log p(y φ) + λ (1 α )2, (3.9) Ld − | − ti i t where we simply fixed τ to 1.

Training procedure

Both variants of our attention model were trained with stochastic gradient descent using adaptive learn- ing rates. For the Flickr8k dataset, we found that RMSProp [Tieleman and Hinton, 2012b] worked best, while for Flickr30k/MS COCO dataset we found the recently proposed Adam algorithm [Kingma and Ba, 2014c] to be quite effective.

To create the annotations ai used by our decoder, we used the Oxford VGGnet [Simonyan and Zis- serman, 2014a] pre-trained on ImageNet without finetuning. In our experiments we use the 14 14 512 × × feature map of the fourth convolutional layer before max pooling. This means our decoder operates on the flattened 196 512 (i.e L D) encoding. In principle however, any encoding function could be used. × × In addition, with enough data, the encoder could also be trained from scratch (or fine-tuned) with the rest of the model. As our implementation requires time proportional to the length of the longest sentence per update, we found training on a random group of captions to be computationally wasteful. To mitigate this problem, in preprocessing we build a dictionary mapping the length of a sentence to the corresponding subset of captions. Then, during training we randomly sample a length and retrieve a mini-batch of size 64 of that length. We found that this greatly improved convergence speed with no noticeable diminution in performance. On our largest dataset (MS COCO), our soft attention model took less than 3 days to train on an NVIDIA Titan Black GPU. In addition to dropout [Srivastava et al., 2014], the only other regularization strategy we used was early stopping on BLEU score. We observed a breakdown in correlation between the validation set log-likelihood and BLEU in the later stages of training during our experiments. Since BLEU is the most commonly reported metric, we used BLEU on our validation set for model selection. Chapter 3. Generating image (and) captions with visual attention 33

In our experiments with soft attention, we used Whetlab1 [Snoek et al., 2012, 2014] in our Flickr8k experiments. Some of the intuitions we gained from hyperparameter regions it explored were especially important in our Flickr30k and COCO experiments. We make our code for these models publicly available to encourage future research in this area2.

Figure 3.3: Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word)

3.3.3 Experiments

BLEU Dataset Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR Google NIC[Vinyals et al., 2014a]†Σ 63 41 27 — — Log Bilinear [Kiros et al., 2014a]◦ 65.6 42.4 27.7 17.7 17.31 Flickr8k Soft-Attention 67 44.8 29.9 19.5 18.93 Hard-Attention 67 45.7 31.4 21.3 20.30 Google NIC†◦Σ 66.3 42.3 27.7 18.3 — Log Bilinear 60.0 38 25.4 17.1 16.88 Flickr30k Soft-Attention 66.7 43.4 28.8 19.1 18.49 Hard-Attention 66.9 43.9 29.6 19.9 18.46 CMU/MS Research [Chen and Zitnick, 2014]a — — — — 20.41 MS Research [Fang et al., 2014]†a — — — — 20.71 BRNN [Karpathy and Li, 2014]◦ 64.2 45.1 30.4 20.3 — COCO Google NIC†◦Σ 66.6 46.1 32.9 24.6 — Log Bilinear◦ 70.8 48.9 34.4 24.3 20.03 Soft-Attention 70.7 49.2 34.4 24.3 23.90 Hard-Attention 71.8 50.4 35.7 25.0 23.04 Table 3.1: BLEU-1,2,3,4/METEOR metrics compared to other methods, indicates a different split, (—) indicates an unknown metric, indicates the authors kindly provided† missing metrics by personal communication, Σ indicates an ensemble,◦ a indicates using AlexNet

We describe our experimental methodology and quantitative results which validate the effectiveness of our model for caption generation.

1https://www.whetlab.com/ 2https://github.com/kelvinxu/arctic-captions Chapter 3. Generating image (and) captions with visual attention 34

Data

We report results on the widely-used Flickr8k and Flickr30k dataset as well as the more recenly in- troduced MS COCO dataset. Each image in the Flickr8k/30k dataset have 5 reference captions. In preprocessing our COCO dataset, we maintained the same number of references between our datasets by discarding captions in excess of 5. We applied only basic tokenization to MS COCO so that it is consistent with the tokenization present in Flickr8k and Flickr30k. For all our experiments, we used a fixed vocabulary size of 10,000. Results for our attention-based architecture are reported in Table 3.1. We report results with the frequently used BLEU metric3 which is the standard in image caption generation research. We report BLEU4 from 1 to 4 without a brevity penalty. There has been, however, criticism of BLEU, so we report another common metric METEOR [Denkowski and Lavie, 2014] and compare whenever possible.

Figure 3.4: Examples of mistakes where we can use attention to gain intuition into what the model saw.

Evaluation procedures

A few challenges exist for comparison, which we explain here. The first challenge is a difference in choice of convolutional feature extractor. For identical decoder architectures, using more recent architectures such as GoogLeNet [Szegedy et al., 2014a] or Oxford VGG [Simonyan and Zisserman, 2014a] can give a boost in performance over using the AlexNet [Krizhevsky et al., 2012c]. In our evaluation, we compare directly only with results which use the comparable GoogLeNet/Oxford VGG features, but for METEOR comparison we include some results that use AlexNet. The second challenge is a single model versus ensemble comparison. While other methods have reported performance boosts by using ensembling, in our results we report a single model performance.

3We verified that our BLEU evaluation code matches the authors of Vinyals et al. [2014a], Karpathy and Li [2014] and Kiros et al. [2014b]. For fairness, we only compare against results for which we have verified that our BLEU evaluation code is the same. 4BLEU-n is the geometric average of the n-gram precision. For instance, BLEU-1 is the unigram precision, and BLEU-2 is the geometric average of the unigram and bigram precision. Chapter 3. Generating image (and) captions with visual attention 35

Finally, there is a challenge due to differences between dataset splits. In our reported results, we use the pre-defined splits of Flickr8k. However, for the Flickr30k and COCO datasets is the lack of standardized splits for which results are reported. As a result, we report the results with the publicly available splits5 used in previous work [Karpathy and Li, 2014]. We note, however, that the differences in splits do not make a substantial difference in overall performance.

Quantitative analysis

In Table 3.1, we provide a summary of the experiment validating the quantitative effectiveness of atten- tion. We obtain state of the art performance on the Flickr8k, Flickr30k and MS COCO. In addition, we note that in our experiments we are able to significantly improve the state-of-the-art performance METEOR on MS COCO. We speculate that this is connected to some of the regularization techniques we used (see Sec. 3.3.2) and our lower-level representation.

Qualitative analysis: learning to attend

By visualizing the attention learned by the model, we are able to add an extra layer of interpretability to the output of the model (see Fig. 3.1). Other systems that have done this rely on object detection systems to produce candidate alignment targets [Karpathy and Li, 2014]. Our approach is much more flexible, since the model can attend to “non-object” salient regions. The 19-layer OxfordNet uses stacks of 3x3 filters meaning the only time the feature maps decrease in size are due to the max pooling layers. The input image is resized so that the shortest side is 256- dimensional with preserved aspect ratio. The input to the convolutional network is the center-cropped 224x224 image. Consequently, with four max pooling layers, we get an output dimension of the top convolutional layer of 14x14. Thus in order to visualize the attention weights for the soft model, we upsample the weights by a factor of 24 = 16 and apply a Gaussian filter to emulate the large receptive field size. As we can see in Figs. 3.2 and 3.3, the model learns alignments that agree very strongly with human intuition. Especially from the examples of mistakes in Fig. 3.4, we see that it is possible to exploit such visualizations to get an intuition as to why those mistakes were made. We provide a more extensive list of visualizations as the supplementary materials for the reader.

3.4 Generating images

Statistical natural image modelling remains a fundamental problem in computer vision and image un- derstanding. This has motivated recent approaches in generative modelling applied to natural images by employing deep neural networks for their inference and generative components. Image generative models studied previously are often restricted to learning unconditional models of images distribution or conditioned on simple structured annotations, for example classification labels. Despite the advances in generative models, learning the highly structured natural image distribution in the high dimensional pixel space alone proves to be a difficult task. In the real world, however, images rarely appear in isolation. They are often accompanied by their unstructured textual descriptions on web pages and in books. One domain presents a substantial amount of relevent information for the other domain. The

5http://cs.stanford.edu/people/karpathy/deepimagesent/ Chapter 3. Generating image (and) captions with visual attention 36

Figure 3.5: The image-to-caption generative model additional information from image and unstructured text descriptions can be used to simplify the image modelling task. There are two directions in learning a generative model of image and text. One approach is to learn a text generative model conditioned on the images. A significant amount of recent work has been focused on generating captions from images [Karpathy and Li, 2015], [Xu et al., 2015], [Kiros et al., 2014d] and etc. The models take an image descriptor and generate unstructured texts through a recurrent decoder. By contrast, learning a generative model for image and text may also be studied by generating images correctly interpreting the text description. Generating high dimensional realistic images from their descriptions is a more difficult approach that combines two challenging components of language modelling and image generation. Namely, the model has to capture the semantic meaning expressed in the description and then use that knowledge to generate pixel intensities of the image. Although the interesting high dimensional natural images lay on a small manifold that is difficult to capture, the additional text description cues of a target image may simplify the learning problem by focusing on the conditional distribution. In this section, we illustrate how simple sequential deep learning techniques can be used to build a conditional probabilistic model over natural image space effectively. By using a sequence to sequence framework to approach the problem of image generation from unstructured natural language captions, our model iteratively draws the patches on canvas, while attending to the relevant words in the descrip- tion.

3.4.1 Model architecture

Our proposed model can be viewed as an instance of the sequence-to-sequence framework [Sutskever et al., 2014a], [Cho et al., 2014b], [Srivastava et al., 2015] where captions are represented as a sequence of consecutive words and images are represented as a sequence of patches drawn on canvas over time t =

1, ..., T . Let y be the input caption, consisting of N words y1, y2, ..., yn and x be the image corresponding to that caption. Chapter 3. Generating image (and) captions with visual attention 37

Language model: the bidirectional attention RNN

The input caption sentences are fed into a deterministic Bidirectional LSTM that encodes the variable size sentences into the vector representation s. Bidirectional LSTM consists of one Forward LSTM and Backward LSTM which combine information from past and future respectively. The Forward LSTM lang lang lang computes the sequence of forward hidden states [−→h 1 , −→h 2 , ..., −→h N ] , whereas the Backward LSTM lang lang lang computes the sequence of backward hidden states [←−h 1 , ←−h 2 , ..., ←−h N ]. Then these hidden states are concatenated together into the sequence [hlang, hlang, ..., hlang], where hlang = [−→h lang, ←−h lang], 1 1 2 N n n n ≤ n N. ≤

Image model: the conditional DRAW network

The DRAW network Gregor et al. [2015c] is a sequential probabilistic model generating images by ac- cumulating the output at each iterative step. While the original DRAW network assumes the latent variables are independent, it has shown in [Bachman and Precup, 2015] the model performance is im- proved by including the dependencies of latent variables. We extended the architecture of the DRAW network generative process to include an additional input caption from the language model described in Sec. (3.4.1). Similarly to the original DRAW network, the conditional DRAW network is a stochastic recurrent neural network that consists of Inference LSTM that infers the distribution of latent variables of image x given y and then the Generative LSTM that uses the inferred latent variables in order to reconstruct the image x given y. The align function is used to compute the alignment between the input caption and intermediate image generative steps as in Bahdanau et al. [2015a]: Formally, the image is generated by iteratively computing the following equations for t = 1, ..., T

xˆ = x σ(c − ) (3.10) t − t 1 dec rt = read(xt, xˆt, ht−1) (3.11) enc enc enc dec ht = LSTM (ht−1, [rt, ht−1]) (3.12) z q(Z henc) (3.13) t ∼ t| t dec dec dec ht = LSTM (ht−1, zt, st−1) (3.14) dec lang st = align(ht−1, h ) (3.15) dec ct = ct−1 + write(ht ) (3.16)

where read and write are the same attention operators as in [Gregor et al., 2015c]. Given the caption lang lang lang lang representation from the language model, h = [h1 , h2 , ..., hN ], the align operator computes the final sentence representation st through a weighted sum using alignment probabilities α1...N :

dec lang lang lang lang st = align(ht−1, h ) = α1h1 + α2h2 + ... + αN hN . (3.17)

The corresponding alignment probabilities α1...n at each step are obtained by:

T lang dec etj = v tanh(Uhj + W ht + b) (3.18) exp(e ) α = tj . (3.19) j PN j=1 exp(etj) Chapter 3. Generating image (and) captions with visual attention 38

lang 1 Here h0 is initialized to the learned bias. Setting α1...N to N turns the encoder into the vanilla model introduced in [Cho et al., 2014b] without the attention.

3.4.2 Learning

The model is learned by the modified version of Stochastic Gradient Variational Bayes (SGVB) algorithm introduced by Kingma and Welling [2014a]. The model is trained to maximize the lower bound of marginal likelihood of the correct image x given the input caption y. The is decomposed into the L L latent loss z and the reconstruction loss x. L L The reconstruction loss x equals to 1 PL (log p(x y, z) where L is the number of samples used L L l=1 t| during training, which was set to 1 in our experiments. The latent loss z is a negative sum of Kullback–Leibler divergence terms between distribution L q(Z henc) and some prior distribution p(Z ) over time t = 1, ..., T , which can be seen as a regularization t| t t term. Since the patches drawn on canvas over time are not independent of each other, naturally the sufficient statistics of the prior distribution at time t should be dependent on the sufficient statistics of the prior distribution at time t 1. Therefore, instead of setting p(Z ), ..., p(Z ) to be independent − 1 T dec unit gaussian distributions, the mean and variance of p(Zt) depends on the ht−1, which forms a Markov chain p(Z ), p(Z Z ), ..., p(Z Z − ) as in [Bachman and Precup, 2015], where 1 2 | 1 T | T 1

prior dec µt = tanh(Wµht−1) (3.20) prior dec σt = exp(tanh(Wσht−1)) (3.21)

Overall, the loss function is calculated as follows: L " T # X = E log p(x y,Z1:T ) + DKL (q(Zt Zt−1, y, x) p(Zt Zt−1, y)) L Q(Z | y,x) − | | k | 1:t t=2 + D (q(Z y, x) p(Z y)) . (3.22) KL 1 | k 1 |

The expectation can be approximate by M Monte Carlo samples Z˜ from q(Z y, x): 1:T 1:T | M " T # 1 X X   log p(x y, Z˜m ) + D q(Z Z˜m , y, x) p(Z Z˜m , y) L ≈M − | 1:T KL t | t−1 k t | t−1 m=1 t=2 + D (q(Z y, x) p(Z y)) . (3.23) KL 1 | k 1 |

3.4.3 Generating images from captions

During the image generation step, we throw away the inference network from the decoder and instead sample from the prior distribution. Due to the blurriness of samples generated by DRAW model, we do an additional post processing step, where we use the generator of the adversarial network trained on residuals of laplacian pyramid to sharpen the generated images, similar to [Denton et al., 2015]. By fixing the prior of adversarial generator to the mean of uniform distribution, it gets treated as a deterministic neural network which allows us to calculate the lower bound of likelihood. The reconstuction loss becomes the loss between sharpened image and correct image, whereas the latent loss stays the same. We also noticed that sampling from the mean of uniform distribution allowed us to generate much less Chapter 3. Generating image (and) captions with visual attention 39

A yellow school bus parked A red school bus parked in a A green school bus parked in A blue school bus parked in in a parking lot. parking lot. a parking lot. a parking lot.

Figure 3.6: Examples of changing the color while keeping the caption fixed. noisy samples than by sampling from the uniform distribution itself.

3.4.4 Experiments

COCO

Microsoft COCO [Lin et al., 2014b] is the largest dataset of images annotated with captions consisting of roughly 83k images. The rich collection of images with variety of styles, backgrounds and objects makes the task of learning a good generative model conditioned on caption very challenging. Since some of the images have more than five captions attached to them, for consistency with related work on caption generation we disregard extra captions. In the following subsections we make both qualitative and quantitative analysis of our model as well as compare its performance with the performance of other related generative models.

Analysis of generated images

The main goal of image generation to learn a model that can understand the semantic meaning expressed in the descriptions of the images, such as the properties of objects, the relationships between them, etc. and then use that knowledge to generate relevant images. To verify that, we wrote a set of captions inspired by COCO dataset and changed some words in the captions to see whether the model made the relevant changes in the generated samples. First, we wanted to see whether the model understood one of the most basic properties of any object, the color. As shown in Figure 3.6, we generated images of school buses with four different colors: yellow, red, green and blue. Although, there are images of buses with different colors in the training set, all school buses specifically are colored yellow. Despite that, the model managed to generate images of an object that is reminiscent of the bus pained with the relevant color. Apart from changing the colors of objects, we were curious whether changing the background of the scene described in the caption would result in the appropriate changes in the generated samples. Changing the background in images is a somewhat harder task for a model that changing the color, due to the larger visual area in images that is taken by the background. Nevertheless, as shown in Figure 3.7 changing the skies from blue to rainy as well as changing the grass type from dry to green resulted in the appropriate changes. The nearest images from the training set also indicate that the model was not simply copying the patterns it observed during the learning phase. Despite an infinite number of ways of changing colors and backgrounds in descriptions, in general we found that the model made appropriate changes as long as some similar patter was present in the training set. However, it wasn’t always the case when changing an object itself in the description. As Chapter 3. Generating image (and) captions with visual attention 40

A very large commercial A very large commercial A herd of elephants walking A herd of elephants walking plane flying in blue skies. plane flying in rainy skies. across a dry grass field. across a green grass field.

Figure 3.7: Examples of changing the background while keeping the caption fixed. The respective nearest training images based on pixelwise distance are displayed on top. you can see in Figure 3.8, when objects didn’t have a clear fine grained differences, such as different shape or color, the relevant changes in the generated samples weren’t very clearly seen. This highlighted the limitation of the model to grasp the detailed understanding of each object.

The decadent chocolate A bowl of banas is on the A vintage photo of a cat. A vintage photo of a dog. desert is on the table. table.

Figure 3.8: Examples of changing the object while keeping the caption fixed.

Analysis of attention

Unfortunately, we found that there was no connection between the patches drawn on canvas and most attended words at each timestep. During the image generation, the model mostly focused on several words that carried the semantic meaning of caption. The words which were mostly attended during all generation steps indicated the kind of scene the model would generate. For example, as shown in Figure 3.9 by equally looking at words desert and forest allowed the model to make relevant changes in the scene. Whereas in the second example, the model completely ignored the word sun and didn’t make any changes.

A rider on a blue motorcycle A rider on a blue motorcycle A surfer, a woman, and a A surfer, a woman, and a in the desert. in the forest. child walk on the beach. child walk on the sun.

Figure 3.9: Examples of most attended words while changing the background in the caption. Chapter 3. Generating image (and) captions with visual attention 41

Align DRAW LAPGAN Conv. VAE

Figure 3.10: Four different models displaying results from sampling caption A group of people walk on a beach with surf boards.

Comparison with other models

The quantitative evaluation of generative models has been a subject of ambiguity in a machine learning community. Compared to reporting classification accuracies in discriminative models, the measures defining generative models are intractable most of the times and might not correctly define the real quality of the model. To get a better comparisons between performance of generative models, we report results on two different metrics as well as do some qualitative comparison of different generative models. As you can see in Figure 3.10, we generated several samples from prior of each of the current state- of-the-art generative models corresponding to the caption “A group of people walk on a beach with surf boards”. While all of the samples look sharp, the images generated by LAPGAN look more noisy and don’t have a very clear structure in them, whereas the images generated by variational models trained with L2 cost function have a watercolor effect on the images. As for the quantitative comparison of different models, we first compare the performance of the model trained with variational methods. We rank the images in test set conditioned on the captions based on the variational lower bound of likelihood and then report the Precision-Recall metric to evaluate the quality of the generative model. As we expected, the quality of image retrieval using generative models is worse compared to the disriminative models that were specifically build for retrieval. To deal with large computational complexity of looping through each test image, we create a shortlist of hundred images including the correct one, based on the convolutional features of VGG like model trained on CIFAR dataset. Since there are “easy” images for which the model assigns high likelihood independent of the query caption, we look at the ratio of likehood of image conditioned on the sentence to likelihood of image conditioned on the mean sentence representation in the training set. We found that the reconstruction error x increased for the sharpened images that considerably hurt the retrieval results. Since sharpening L changes the statistics of images, computing reconstruction error for each pixel is not necessarily a good metric. Instead of calculating error per each pixel, we turn to the smarter metric, such as Structural Similarity Index (SSI), which incorporates luminace and contrast masking into the error calculation. Due to strong inter-dependencies of closer pixels, the metric is calculated on the small windows of the image. To calculate SSI, we sampled fifty images from prior of each generative model per each caption in the test set. Chapter 3. Generating image (and) captions with visual attention 42

COCO (before sharpening) Image Search Image Similarity Model R@1 R@5 R@10 R@50 Med r SSI LAPGAN ----- 0.08 Fully-conn. VAE (L2 cost) 1.0 5.6 10.4 51.1 51 0.156 Conv. VAE (L2 cost) 1.0 5.9 10.9 50.8 50 0.164 Skipthought DRAW 2.0 11.2 18.9 63.3 36 0.157 Noalign DRAW 2.8 14.1 23.1 68.0 31 0.155 Align DRAW 3.0 14.0 22.9 68.5 31 0.156

3.5 Summary In this chapter, we discussed two attention-based neural networks to tackle the image and caption generation problem. Empirically, our attention-based approaches obtain the state-of-the-art performance on the standard vision benchmark datasets. We also show how the learned attention can be exploited to give more interpretability into the models generation process, and demonstrate that the learned alignments correspond very well to human intuition. Chapter 4

Stabilizing RNN training with layer normalization

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this chapter, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

4.1 Motivation

Deep neural networks trained with some version of Stochastic Gradient Descent have been shown to substantially outperform previous approaches on various supervised learning tasks in computer vision [Krizhevsky et al., 2012e] and speech processing [Hinton et al., 2012b]. But state-of-the-art deep neural networks often require many days of training. It is possible to speed-up the learning by computing gradients for different subsets of the training cases on different machines or splitting the neural network itself over many machines [Dean et al., 2012], but this can require a lot of communication and complex software. It also tends to lead to rapidly diminishing returns as the degree of parallelization increases. An orthogonal approach is to modify the computations performed in the forward pass of the neural net to make learning easier. Recently, batch normalization [Ioffe and Szegedy, 2015b] has been proposed to reduce training time by including additional normalization stages in deep neural networks. The

43 Chapter 4. Stabilizing RNN training with layer normalization 44 normalization standardizes each summed input using its mean and its standard deviation across the training data. Feedforward neural networks trained using batch normalization converge faster even with simple SGD. In addition to training time improvement, the stochasticity from the batch statistics serves as a regularizer during training. Despite its simplicity, batch normalization requires running averages of the summed input statistics. In feed-forward networks with fixed depth, it is straightforward to store the statistics separately for each hidden layer. However, the summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps. Furthermore, batch normalization cannot be applied to online learning tasks or to extremely large distributed models where the minibatches have to be small. This chapter introduces layer normalization, a simple normalization method to improve the training speed for various neural network models. Unlike batch normalization, the proposed method directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. We show that layer normalization works well for RNNs and improves both the training time and the generalization performance of several existing RNN models.

4.2 Batch and weight normalization

th Consider the d hidden layer in a deep feed-forward, neural network, and let zd be the vector represen- tation of the summed inputs to the neurons in that layer. The summed inputs are computed through a linear projection with the weight matrix Wd and the bottom-up inputs ad given as follows:

> zd,i = wd,i ad ad+1,i = f(zd,i + bd,i) (4.1) where f( ) is an element-wise non-linear function and w is the incoming weights to the ith hidden · d,i th units at d layer and bd,i is its scalar bias parameter. The parameters in the neural network are learnt using gradient-based optimization algorithms with the gradients being computed by back-propagation. One of the challenges of deep learning is that the gradients with respect to the weights in one layer are highly dependent on the outputs of the neurons in the previous layer especially if these outputs change in a highly correlated way. Batch normalization [Ioffe and Szegedy, 2015b] was proposed to reduce such undesirable “covariate shift”. The method normalizes the summed inputs to each hidden unit over the training cases. Specifically, for the ith summed input in the dth layer, the batch normalization method rescales the summed inputs according to their variances under the distribution of the data r gl,i h 2i z¯d,i = (zd,i µd,i) µd,i = E [zd,i] σd,i = E (zd,i µd,i) (4.2) σd,i − x∼p(x) x∼p(x) −

th th wherea ¯d,i is normalized summed inputs to the i hidden unit in the l layer and gi is a gain parameter scaling the normalized activation before the non-linear activation function. Note the expectation is under the whole training data distribution. It is typically impractical to compute the expectations in Eq. (2) exactly, since it would require forward passes through the whole training dataset with the current set of weights. Instead, µ and σ are estimated using the empirical samples from the current mini-batch. This puts constraints on the size of a mini-batch and it is hard to apply to recurrent neural networks. Chapter 4. Stabilizing RNN training with layer normalization 45

4.3 Layer normalization

We now consider the layer normalization method which is designed to overcome the drawbacks of batch normalization. Notice that changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer, especially with ReLU units whose outputs can change by a lot. This suggests the “covariate shift” problem can be reduced by fixing the mean and the variance of the summed inputs within each layer. We, thus, compute the layer normalization statistics over all the hidden units in the same layer as follows: v H u H 1 X u 1 X 2 µ = z σ = t (z µl) (4.3) d H d,i d H d,i − i=1 i=1 where H denotes the number of hidden units in a layer. The difference between Eq. (2) and Eq. (3) is that under layer normalization, all the hidden units in a layer share the same normalization terms µ and σ, but different training cases have different normalization terms. Unlike batch normalization, layer normaliztion does not impose any constraint on the size of a mini-batch and it can be used in the pure online regime with batch size 1.

4.3.1 Layer normalized recurrent neural networks

The recent sequence to sequence models [Sutskever et al., 2014b] utilize compact recurrent neural net- works to solve sequential prediction problems in natural language processing. It is common among the NLP tasks to have different sentence lengths for different training cases. This is easy to deal with in an RNN because the same weights are used at every time-step. But when we apply batch normalization to an RNN in the obvious way, we need to to compute and store separate statistics for each time step in a sequence. This is problematic if a test sequence is longer than any of the training sequences. Layer normalization does not have such problem because its normalization terms depend only on the summed inputs to a layer at the current time-step. It also has only one set of gain and bias parameters shared over all time-steps. In a standard RNN, the summed inputs in the recurrent layer are computed from the current input xt and previous vector of hidden states at−1 which are computed as zt = Wrecat−1 + Winxt. The layer normalized recurrent layer re-centers and re-scales its activations using the extra normalization terms similar to Eq. (3):

v H u H  g  1 X u 1 X a = f (z µ ) + b µ = z σ = t (z µ )2 (4.4) t σ t − t t H t,i t H t,i − t t i=1 i=1 where Wrec is the recurrent hidden to hidden weights and Win are the bottom up input to hidden weights. is the element-wise multiplication between two vectors. b and g are defined as the bias and gain parameters of the same dimension as at. In a standard RNN, there is a tendency for the average magnitude of the summed inputs to the recurrent units to either grow or shrink at every time-step, leading to exploding or vanishing gradients. In a layer normalized RNN, the normalization terms make it invariant to re-scaling all of the summed Chapter 4. Stabilizing RNN training with layer normalization 46

Weight matrix Weight matrix Weight vector Dataset Dataset Single training case re-scaling re-centering re-scaling re-scaling re-centering re-scaling Batch norm Invariant No Invariant Invariant Invariant No Weight norm Invariant No Invariant No No No Layer norm Invariant Invariant No Invariant No Invariant Table 4.1: Invariance properties under the normalization methods. inputs to a layer, which results in much more stable hidden-to-hidden dynamics.

4.4 Related work

Batch normalization has been previously extended to recurrent neural networks [Laurent et al., 2015, Amodei et al., 2015, Cooijmans et al., 2016]. The previous work [Cooijmans et al., 2016] suggests the best performance of recurrent batch normalization is obtained by keeping independent normalization statistics for each time-step. The authors show that initializing the gain parameter in the recurrent batch normalization layer to 0.1 makes significant difference in the final performance of the model. Our work is also related to weight normalization [Salimans and Kingma, 2016]. In weight normalization, instead of the variance, the L2 norm of the incoming weights is used to normalize the summed inputs to a neuron. Applying either weight normalization or batch normalization using expected statistics is equivalent to have a different parameterization of the original feed-forward neural network. Re-parameterization in the ReLU network was studied in the Path-normalized SGD [Neyshabur et al., 2015]. Our proposed layer normalization method, however, is not a re-parameterization of the original neural network. The layer normalized model, thus, has different invariance properties than the other methods, that we will study in the following section.

4.5 Analysis

In this section, we investigate the invariance properties of different normalization schemes.

4.5.1 Invariance under weights and data transformations

The proposed layer normalization is related to batch normalization and weight normalization. Although, their normalization scalars are computed differently, these methods can be summarized as normalizing the summed inputs ai to a neuron through the two scalars µ and σ. They also learn an adaptive bias b and gain g for each neuron after the normalization.

gi ai = f( (zi µi) + bi) (4.5) σi −

Note that for layer normalization and batch normalization, µ and σ is computed according to Eq. 4.2 and 4.3. In weight normalization, µ is 0, and σ = w . k k2 Table 4.1 highlights the following invariance results for three normalization methods. Weight re-scaling and re-centering: First, observe that under batch normalization and weight normalization, any re-scaling to the incoming weights wi of a single neuron has no effect on the normalized summed inputs to a neuron. To be precise, under batch and weight normalization, if the weight vector is scaled by δ, the two scalar µ and σ will also be scaled by δ. The normalized summed inputs stays Chapter 4. Stabilizing RNN training with layer normalization 47 the same before and after scaling. So the batch and weight normalization are invariant to the re-scaling of the weights. Layer normalization, on the other hand, is not invariant to the individual scaling of the single weight vectors. Instead, layer normalization is invariant to scaling of the entire weight matrix and invariant to a shift to all of the incoming weights in the weight matrix. Let there be two sets of model parameters θ, θ0 whose weight matrices W and W 0 differ by a scaling factor δ and all of the incoming weights in W 0 are also shifted by a constant vector γ, that is W 0 = δW + 1γ>. Under layer normalization, the two models effectively compute the same output:

g g a0 =f( (W 0x µ0) + b) = f( (δW + 1γ>)x µ0 + b) σ0 − σ0 − g =f( (W x µ) + b) = a. (4.6) σ −

Notice that if normalization is only applied to the input before the weights, the model will not be invariant to re-scaling and re-centering of the weights. Data re-scaling and re-centering: We can show that all the normalization methods are invariant to re-scaling the dataset by verifying that the summed inputs of neurons stays constant under the changes. Furthermore, layer normalization is invariant to re-scaling of individual training cases, because the normalization scalars µ and σ in Eq. (3) only depend on the current input data. Let x0 be a new data point obtained by re-scaling x by δ. Then we have,

g g a0 =f( i w>x0 µ0 + b ) = f( i δw>x δµ + b ) = a . (4.7) i σ0 i − i δσ i − i i

It is easy to see re-scaling individual data points does not change the model’s prediction under layer normalization. Similar to the re-centering of the weight matrix in layer normalization, we can also show that batch normalization is invariant to re-centering of the dataset.

4.5.2 Geometry of parameter space during learning

We have investigated the invariance of the model’s prediction under re-centering and re-scaling of the parameters. Learning, however, can behave very differently under different parameterizations, even though the models express the same underlying function. In this section, we analyze learning behavior through the geometry and the manifold of the parameter space. We show that the normalization scalar σ can implicitly reduce learning rate and makes learning more stable.

Riemannian metric

The learnable parameters in a statistical model form a smooth manifold that consists of all possible input-output relations of the model. For models whose output is a probability distribution, a natural way to measure the separation of two points on this manifold is the Kullback-Leibler divergence between their model output distributions. Under the KL divergence metric, the parameter space is a Riemannian manifold. The curvature of a Riemannian manifold is entirely captured by its Riemannian metric, whose quadratic form is denoted as ds2. That is the infinitesimal distance in the tangent space at a point in the parameter space. Intuitively, it measures the changes in the model output from the parameter space along a tangent direction. The Riemannian metric under KL was previously studied [Amari, 1998] and was shown to be well approximated under second order Taylor expansion using the Fisher Chapter 4. Stabilizing RNN training with layer normalization 48 information matrix:

1 ds2 = D p(y x; θ) p(y x; θ + δ) δ>F (θ)δ, (4.8) KL | k | ≈ 2 " # ∂ log p(y x; θ) ∂ log p(y x; θ) > F (θ) = E | | , (4.9) x∼p(x),y∼p(y | x) ∂θ ∂θ where, δ is a small change to the parameters. The Riemannian metric above presents a geometric view of parameter spaces. The following analysis of the Riemannian metric provides some insight into how normalization methods could help in training neural networks.

The geometry of normalized generalized linear models

We focus our geometric analysis on the generalized linear model. The results from the following analysis can be easily applied to understand deep neural networks with block-diagonal approximation to the Fisher information matrix, where each block corresponds to the parameters for a single neuron.

A generalized linear model (GLM) can be regarded as parameterizing an output distribution from the exponential family using a weight vector w and bias scalar b. To be consistent with the previous sections, the log likelihood of the GLM can be written using the summed inputs a as the following:

(a + b)y η(a + b) log p(y x; w, b) = − + c(y, φ), (4.10) | φ [y x] = f(a + b) = f(w>x + b), Var[y x] = φf 0(a + b), (4.11) E | | where, f( ) is the transfer function that is the analog of the non-linearity in neural networks, f 0( ) is the · · derivative of the transfer function, η( ) is a real valued function and c( ) is the log partition function. φ is · · a constant that scales the output variance. Assume a H-dimensional output vector y = [y , y , , y ] 1 2 ··· H is modeled using H independent GLMs and log p(y x; W, b) = PH log p(y x; w , b ). Let W be the | i=1 i | i i weight matrix whose rows are the weight vectors of the individual GLMs, b denote the bias vector of length H and vec( ) denote the Kronecker vector operator. The Fisher information matrix for the multi- · dimensional GLM with respect to its parameters θ = [w>, b , , w> , b ]> = vec([W, b]>) is simply the 1 1 ··· H H expected Kronecker product of the data features and the output covariance matrix: " " ## Cov[y x] xx> x F (θ) = E 2 | > . (4.12) x∼P (x) φ ⊗ x 1

We obtain normalized GLMs by applying the normalization methods to the summed inputs a in the original model through µ and σ. Without loss of generality, we denote F¯ as the Fisher information matrix Chapter 4. Stabilizing RNN training with layer normalization 49 under the normalized multi-dimensional GLM with the additional gain parameters θ = vec([W, b, g]>):

     gigj > gi gi(aj −µj ) F¯11 F¯1H χiχ χi χi  ···    σiσj j σi σiσj   . . .  Cov[yi, yj x]   F¯(θ) =  . . .  , F¯ =  |  > gj aj −µj   . . .  ij E  2  χj 1    x∼P (x)  φ  σj σj       > gj (ai−µi) ai−µi (ai−µi)(aj −µj ) F¯H1 F¯HH χ ··· j σiσj σi σiσj (4.13)

∂µi ai µi ∂σi χi = x − . (4.14) − ∂wi − σi ∂wi

Implicit learning rate reduction through the growth of the weight vector: Notice that, comparing to standard GLM, the block F¯ij along the weight vector wi direction is scaled by the gain parameters and the normalization scalar σi. If the norm of the weight vector wi grows twice as large, even though the model’s output remains the same, the Fisher information matrix will be different. The 1 curvature along the wi direction will change by a factor of 2 because the σi will also be twice as large. As a result, for the same parameter update in the normalized model, the norm of the weight vector effectively controls the learning rate for the weight vector. During learning, it is harder to change the orientation of the weight vector with large norm. The normalization methods, therefore, have an implicit “early stopping” effect on the weight vectors and help to stabilize learning towards convergence. Learning the magnitude of incoming weights: In normalized models, the magnitude of the incoming weights is explicitly parameterized by the gain parameters. We compare how the model output changes between updating the gain parameters in the normalized GLM and updating the magnitude of the equivalent weights under original parameterization during learning. The direction along the gain parameters in F¯ captures the geometry for the magnitude of the incoming weights. We show that Riemannian metric along the magnitude of the incoming weights for the standard GLM is scaled by the norm of its input, whereas learning the gain parameters for the batch normalized and layer normalized models depends only on the magnitude of the prediction error. Learning the magnitude of incoming weights in the normalized model is therefore, more robust to the scaling of the input and its parameters than in the standard model.

4.6 Experimental results

We perform experiments with layer normalization on 6 tasks, with a focus on recurrent neural networks: image-sentence ranking, question-answering, contextual language modelling, generative modelling, hand- writing sequence generation and MNIST classification. Unless otherwise noted, the default initialization of layer normalization is to set the adaptive gains to 1 and the biases to 0 in the experiments.

4.6.1 Order embeddings of images and language

In this experiment, we apply layer normalization to the recently proposed order-embeddings model of Vendrov et al. [2016] for learning a joint embedding space of images and sentences. We follow the same experimental protocol as Vendrov et al. [2016] and modify their publicly available code to incorporate layer normalization 1 which utilizes Theano [Team et al., 2016]. Images and sentences from the Microsoft

1https://github.com/ivendrov/order-embedding Chapter 4. Stabilizing RNN training with layer normalization 50

Image Retrieval (Validation) Image Retrieval (Validation) Image Retrieval (Validation) 43 78 90 42 77 89 41 76 40 88 39 75 87 38 74 37 86 73 36 mean Recall@1 mean Order-Embedding + LN Recall@5 mean Order-Embedding + LN Order-Embedding + LN 72 Recall@10 mean 85 35 Order-Embedding Order-Embedding Order-Embedding 34 71 84 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 iteration x 300 iteration x 300 iteration x 300 (a) Recall@1 (b) Recall@5 (c) Recall@10 Figure 4.1: Recall@K curves using order-embeddings with and without layer normalization.

MSCOCO Caption Retrieval Image Retrieval Model R@1 R@5 R@10 Mean r R@1 R@5 R@10 Mean r Sym [Vendrov et al., 2016] 45.4 88.7 5.8 36.3 85.8 9.0 OE [Vendrov et al., 2016] 46.7 88.9 5.7 37.9 85.9 8.1 OE (ours) 46.6 79.3 89.1 5.2 37.8 73.6 85.7 7.9 OE + LN 48.5 80.6 89.8 5.1 38.9 74.3 86.3 7.6

Table 4.2: Average results across 5 test splits for caption and image retrieval. R@K is Recall@K (high is good). Mean r is the mean rank (low is good). Sym corresponds to the symmetric baseline while OE indicates order-embeddings.

COCO dataset [Lin et al., 2014a] are embedded into a common vector space, where a GRU [Cho et al., 2014a] is used to encode sentences and the outputs of a pre-trained VGG ConvNet [Simonyan and Zisserman, 2014b] (10-crop) are used to encode images. The order-embedding model represents images and sentences as a 2-level partial ordering and replaces the cosine similarity scoring function used in Kiros et al. [2014c] with an asymmetric one.

We trained two models: the baseline order-embedding model as well as the same model with layer normalization applied to the GRU. After every 300 iterations, we compute Recall@K (R@K) values on a held out validation set and save the model whenever R@K improves. The best performing models are then evaluated on 5 separate test sets, each containing 1000 images and 5000 captions, for which the mean results are reported. Both models use Adam [Kingma and Ba, 2014a] with the same initial hyperparameters and both models are trained using the same architectural choices as used in Vendrov et al. [2016].

Figure 4.1 illustrates the validation curves of the models, with and without layer normalization. We plot R@1, R@5 and R@10 for the image retrieval task. We observe that layer normalization offers a per-iteration speedup across all metrics and converges to its best validation model in 60% of the time it takes the baseline model to do so. In Table 4.2, the test set results are reported from which we observe that layer normalization also results in improved generalization over the original model. The results we report are state-of-the-art for RNN embedding models, with only the structure-preserving model of Wang et al. [2016] reporting better results on this task. However, they evaluate under different conditions (1 test set instead of the mean over 5) and are thus not directly comparable. Chapter 4. Stabilizing RNN training with layer normalization 51

Attentive reader 1.0 LSTM 0.9 BN-LSTM BN-everywhere LN-LSTM 0.8

0.7

0.6 validation error rate error validation 0.5

0.4 0 100 200 300 400 500 600 700 800 training steps (thousands)

Figure 4.2: Validation curves for the attentive reader model. BN results are taken from [Cooijmans et al., 2016].

4.6.2 Teaching machines to read and comprehend

In order to compare layer normalization to the recently proposed recurrent batch normalization [Cooij- mans et al., 2016], we train an unidirectional attentive reader model on the CNN corpus both introduced by Hermann et al. [2015]. This is a question-answering task where a query description about a passage must be answered by filling in a blank. The data is anonymized such that entities are given randomized tokens to prevent degenerate solutions, which are consistently permuted during training and evaluation. We follow the same experimental protocol as Cooijmans et al. [2016] and modify their public code to incorporate layer normalization 2 which uses Theano [Team et al., 2016]. We obtained the pre-processed dataset used by Cooijmans et al. [2016] which differs from the original experiments of Hermann et al. [2015] in that each passage is limited to 4 sentences. In Cooijmans et al. [2016], two variants of recurrent batch normalization are used: one where BN is only applied to the LSTM while the other applies BN everywhere throughout the model. In our experiment, we only apply layer normalization within the LSTM. The results of this experiment are shown in Figure 4.2. We observe that layer normalization not only trains faster but converges to a better validation result over both the baseline and BN variants. In Cooijmans et al. [2016], it is argued that the scale parameter in BN must be carefully chosen and is set to 0.1 in their experiments. We experimented with layer normalization for both 1.0 and 0.1 scale initialization and found that the former model performed significantly better. This demonstrates that layer normalization is not sensitive to the initial scale in the same way that recurrent BN is. 3

4.6.3 Skip-thought vectors

Skip-thoughts [Kiros et al., 2015] is a generalization of the skip-gram model [Mikolov et al., 2013] for learning unsupervised distributed sentence representations. Given contiguous text, a sentence is encoded with a encoder RNN and decoder RNNs are used to predict the surrounding sentences. Kiros et al. [2015] showed that this model could produce generic sentence representations that perform well on several tasks without being fine-tuned. However, training this model is time-consuming, requiring

2https://github.com/cooijmanstim/Attentive_reader/tree/bn 3We only produce results on the validation set, as in the case of Cooijmans et al. [2016] Chapter 4. Stabilizing RNN training with layer normalization 52

86.0 34 82 Skip-Thoughts + LN 85.5 33 Skip-Thoughts 80 85.0 32 Original 84.5 78 31 84.0 76 30 83.5

Accuracy 74 83.0 Skip-Thoughts + LN MSE x 100 29 Skip-Thoughts + LN Pearson x 100 Skip-Thoughts Skip-Thoughts 28 72 82.5 Original Original 82.0 27 70 5 10 15 20 5 10 15 20 5 10 15 20 iteration x 50000 iteration x 50000 iteration x 50000 (a) SICK(r) (b) SICK(MSE) (c) MR

86 94.5 91 94.0 84 90 93.5 89 82 93.0 88 92.5 80 87 92.0 86 Accuracy 78 Accuracy 91.5 Accuracy Skip-Thoughts + LN Skip-Thoughts + LN Skip-Thoughts + LN 91.0 85 76 Skip-Thoughts Skip-Thoughts Skip-Thoughts Original 90.5 Original 84 Original 74 90.0 83 5 10 15 20 5 10 15 20 5 10 15 20 iteration x 50000 iteration x 50000 iteration x 50000 (d) CR (e) SUBJ (f) MPQA

Figure 4.3: Performance of skip-thought vectors with and without layer normalization on downstream tasks as a function of training iterations. The original lines are the reported results in [Kiros et al., 2015]. Plots with error use 10-fold cross validation. Best seen in color.

Method SICK(r) SICK(ρ) SICK(MSE) MR CR SUBJ MPQA Original [Kiros et al., 2015] 0.848 0.778 0.287 75.5 79.3 92.1 86.9 Ours 0.842 0.767 0.298 77.3 81.8 92.6 87.9 Ours + LN 0.854 0.785 0.277 79.5 82.6 93.4 89.0 Ours + LN † 0.858 0.788 0.270 79.4 83.1 93.7 89.3

Table 4.3: Skip-thoughts results. The first two evaluation columns indicate Pearson and Spearman correlation, the third is mean squared error and the remaining indicate classification accuracy. Higher is better for all evaluations except MSE. Our models were trained for 1M iterations with the exception of ( ) which was trained for 1 month (approximately 1.7M iterations) † several days of training in order to produce meaningful results. In this experiment we determine to what effect layer normalization can speed up training. Using the publicly available code of Kiros et al. [2015] 4, we train two models on the BookCorpus dataset [Zhu et al., 2015]: one with and one without layer normalization. These experiments are performed with Theano [Team et al., 2016]. We adhere to the experimental setup used in Kiros et al. [2015], training a 2400-dimensional sentence encoder with the same hyperparameters. Given the size of the states used, it is conceivable layer normalization would produce slower per-iteration updates than without. However, we found that provided CNMeM 5 is used, there was no significant difference between the two models. We checkpoint both models after every 50,000 iterations and evaluate their performance on five tasks: semantic-relatedness (SICK) [Marelli et al., 2014], movie review sentiment (MR) [Pang and Lee, 2005],

4https://github.com/ryankiros/skip-thoughts 5https://github.com/NVIDIA/cnmem Chapter 4. Stabilizing RNN training with layer normalization 53 customer product reviews (CR) [Hu and Liu, 2004], subjectivity/objectivity classification (SUBJ) [Pang and Lee, 2004] and opinion polarity (MPQA) [Wiebe et al., 2005]. We plot the performance of both models for each checkpoint on all tasks to determine whether the performance rate can be improved with LN. The experimental results are illustrated in Figure 4.3. We observe that applying layer normalization results both in speedup over the baseline as well as better final results after 1M iterations are performed as shown in Table 4.3. We also let the model with layer normalization train for a total of a month, resulting in further performance gains across all but one task. We note that the performance differences between the original reported results and ours are likely due to the fact that the publicly available code does not condition at each timestep of the decoder, where the original model does.

4.6.4 Modeling binarized MNIST using DRAW

100 We also experimented with the generative modeling on the MNIST Baseline dataset. Deep Recurrent Attention Writer (DRAW) [Gregor et al., WN 95 LN 2015a] has previously achieved the state-of-the-art performance on modeling the distribution of MNIST digits. The model uses a dif- 90 ferential attention mechanism and a recurrent neural network to se- quentially generate pieces of an image. We evaluate the effect of layer 85 normalization on a DRAW model using 64 glimpses and 256 LSTM Bound Variational Test hidden units. The model is trained with the default setting of Adam 80 0 20 40 60 80 100 [Kingma and Ba, 2014a] optimizer and the minibatch size of 128. Epoch Previous publications on binarized MNIST have used various training Figure 4.4: DRAW model test protocols to generate their datasets. In this experiment, we used the negative log likelihood with and fixed binarization from Larochelle and Murray [2011]. The dataset without layer normalization. has been split into 50,000 training, 10,000 validation and 10,000 test images. Figure 4.4 shows the test variational bound for the first 100 epoch. It highlights the speedup benefit of applying layer normalization that the layer normalized DRAW converges almost twice as fast than the baseline model. After 200 epoches, the baseline model converges to a variational log likelihood of 82.36 nats on the test data and the layer normalization model obtains 82.09 nats.

4.6.5 Handwriting sequence generation

The previous experiments mostly examine RNNs on NLP tasks whose lengths are in the range of 10 to 40. To show the effectiveness of layer normalization on longer sequences, we performed handwriting generation tasks using the IAM Online Handwriting Database [Liwicki and Bunke, 2005]. IAM-OnDB consists of handwritten lines collected from 221 different writers. When given the input character string, the goal is to predict a sequence of x and y pen co-ordinates of the corresponding handwriting line on the whiteboard. There are, in total, 12179 handwriting line sequences. The input string is typically more than 25 characters and the average handwriting line has a length around 700. We used the same model architecture as in Section (5.2) of Graves [2013b]. The model architecture consists of three hidden layers of 400 LSTM cells, which produce 20 bivariate Gaussian mixture compo- nents at the output layer, and a size 3 input layer. The character sequence was encoded with one-hot Chapter 4. Stabilizing RNN training with layer normalization 54 1

1 11 1 1

Figure 4.5: Handwriting sequence generation model negative log likelihood with and without layer normalization. The models are trained with mini-batch size of 8 and sequence length of 500.

100 100

10-1 10-1

10-2 10-2

10-3 10-3

10-4 10-4 Train NLL Train NLL

10-5 BatchNorm bz128 10-5 LayerNorm bz4 Baseline bz128 Baseline bz4 10-6 10-6 LayerNorm bz128 BatchNorm bz4 10-7 10-7 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Epoch Epoch 0.025 0.025

0.020 0.020

0.015 0.015 Test Err. Test Err. BatchNorm bz128 LayerNorm bz4 0.010 0.010 Baseline bz128 Baseline bz4 LayerNorm bz128 BatchNorm bz4 0.005 0.005 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Epoch Epoch

Figure 4.6: Permutation invariant MNIST 784-1000-1000-10 model negative log likelihood and test error with layer normalization and batch normalization. (Left) The models are trained with batch-size of 128. (Right) The models are trained with batch-size of 4. vectors, and hence the window vectors were size 57. A mixture of 10 Gaussian functions was used for the window parameters, requiring a size 30 parameter vector. The total number of weights was increased to approximately 3.7M. The model is trained using mini-batches of size 8 and the Adam [Kingma and Ba, 2014a] optimizer. The combination of small mini-batch size and very long sequences makes it important to have very stable hidden dynamics. Figure 4.5 shows that layer normalization converges to a comparable log likelihood as the baseline model but is much faster.

4.6.6 Permutation invariant MNIST

In addition to RNNs, we investigated layer normalization in feed-forward networks. We show how layer normalization compares with batch normalization on the well-studied permutation invariant MNIST classification problem. From the previous analysis, layer normalization is invariant to input re-scaling which is desirable for the internal hidden layers. But this is unnecessary for the logit outputs where the prediction confidence is determined by the scale of the logits. We only apply layer normalization to the fully-connected hidden layers that excludes the last softmax layer. Chapter 4. Stabilizing RNN training with layer normalization 55

All the models were trained using 55000 training data points and the Adam [Kingma and Ba, 2014a] optimizer. For the smaller batch-size, the variance term for batch normalization is computed using the unbiased estimator. The experimental results from Figure 4.6 highlight that layer normalization is robust to the batch-sizes and exhibits a faster training convergence comparing to batch normalization that is applied to all layers.

4.6.7 Convolutional Networks

We have also experimented with convolutional neural networks. In our preliminary experiments, we observed that layer normalization offers a speedup over the baseline model without normalization, but batch normalization outperforms the other methods. With fully connected layers, all the hidden units in a layer tend to make similar contributions to the final prediction and re-centering and re-scaling the summed inputs to a layer works well. However, the assumption of similar contributions is no longer true for convolutional neural networks. The large number of the hidden units whose receptive fields lie near the boundary of the image are rarely turned on and thus have very different statistics from the rest of the hidden units within the same layer. We think further research is needed to make layer normalization work well in ConvNets.

4.7 Summary

In this chapter, we introduced layer normalization to speed-up the training of neural networks. We provided a theoretical analysis that compared the invariance properties of layer normalization with batch normalization and weight normalization. We showed that layer normalization is invariant to per training-case feature shifting and scaling. Empirically, we showed that recurrent neural networks benefit the most from the proposed method especially for long sequences and small mini-batches. Chapter 5

Self-attention to the recent past using fast weights

Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These “fast weights” can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.

5.1 Motivation

Ordinary recurrent neural networks typically have two types of memory that have very different time scales, very different capacities and very different computational roles. The history of the sequence currently being processed is stored in the hidden activity vector, which acts as a short-term memory that is updated at every time step. The capacity of this memory is O(H) where H is the number of hidden units. Long-term memory about how to convert the current input and hidden vectors into the next hidden vector and a predicted output vector is stored in the weight matrices connecting the hidden units to themselves and to the inputs and outputs. These matrices are typically updated at the end of a sequence and their capacity is O(H2) + O(IH) + O(HO) where I and O are the numbers of input and output units. Long short-term memory networks [Hochreiter and Schmidhuber, 1997b] are a more complicated type of RNN that work better for discovering long-range structure in sequences for two main reasons: First, they compute increments to the hidden activity vector at each time step rather than recomputing the full vector1. This encourages information in the hidden states to persist for much longer. Second, they allow the hidden activities to determine the states of gates that scale the effects of the weights. These multiplicative interactions allow the effective weights to be dynamically adjusted by the input or hidden

1This assumes the “remember gates ” of the LSTM memory cells are set to one.

56 Chapter 5. Self-attention to the recent past using fast weights 57 activities via the gates. However, LSTMs are still limited to a short-term memory capacity of O(H) for the history of the current sequence. Until recently, there was surprisingly little practical investigation of other forms of memory in recur- rent nets despite strong psychological evidence that it exists and obvious computational reasons why it was needed. There were occasional suggestions that neural networks could benefit from a third form of memory that has much higher storage capacity than the neural activities but much faster dynamics than the standard slow weights. This memory could store information specific to the history of the current sequence so that this information is available to influence the ongoing processing without using up the memory capacity of the hidden activities. Hinton and Plaut [1987] suggested that fast weights could be used to allow true recursion in a neural network and Schmidhuber [1993] pointed out that a system of this kind could be trained end-to-end using backpropagation, but neither of these papers actually implemented this method of achieving recursion.

5.2 Evidence from physiology that temporary memory may not be stored as neural activities

Processes like working memory, attention, and priming operate on timescale of 100ms to minutes. This is simultaneously too slow to be mediated by neural activations without dynamical attractor states (10ms timescale) and too fast for long-term synaptic plasticity mechanisms to kick in (minutes to hours). While artificial neural network research has typically focused on methods to maintain temporary state in activa- tion dynamics, that focus may be inconsistent with evidence that the brain also—or perhaps primarily— maintains temporary state information by short-term synaptic plasticity mechanisms [Tsodyks et al., 1998, Abbott and Regehr, 2004, Barak and Tsodyks, 2007]. The brain implements a variety of short-term plasticity mechanisms that operate on intermediate timescale. For example, short term facilitation is implemented by leftover [Ca2+] in the axon terminal after depolarization while short term depression is implemented by presynaptic neurotransmitter deple- tion Zucker and Regehr [2002]. Spike-time dependent plasticity can also be invoked on this timescale [Markram et al., 1997, Bi and Poo, 1998]. These plasticity mechanisms are all synapse-specific. Thus they are more accurately modeled by a memory with O(H2) capacity than the O(H) of standard recurrent artificial recurrent neural nets and LSTMs.

5.3 Fast Associative Memory

One of the main preoccupations of neural network research in the 1970s and early 1980s [Willshaw et al., 1969, Kohonen, 1972, Anderson and Hinton, 1981, Hopfield, 1982] was the idea that memories were not stored by somehow keeping copies of patterns of neural activity. Instead, these patterns were reconstructed when needed from information stored in the weights of an associative network and the very same weights could store many different memories An auto-associative memory that has N 2 weights cannot be expected to store more that N real-valued vectors with N components each. How close we can come to this upper bound depends on which storage rule we use. Hopfield nets use a simple, one-shot, outer-product storage rule and achieve a capacity of approximately 0.15N binary vectors using weights that require log(N) bits each. Much more efficient use can be made of the weights by using an iterative, Chapter 5. Self-attention to the recent past using fast weights 58

Sustained . boundary . condition . Slow . transition weights

Fast transition weights

Figure 5.1: The fast associative memory model. error correction storage rule to learn weights that can retrieve each bit of a pattern from all the other bits [Gardner, 1988], but for our purposes maximizing the capacity is less important than having a simple, non-iterative storage rule, so we will use an outer product rule to store hidden activity vectors in fast weights that decay rapidly. The usual weights in an RNN will be called slow weights and they will learn by stochastic gradient descent in an objective function taking into account the fact that changes in the slow weights will lead to changes in what gets stored automatically in the fast associative memory. Update A fast associative memory has several advantages when compared with the type of memory assumed the figure notation by a Neural Turing Machine (NTM) [Graves et al., 2014], Neural Stack [Grefenstette et al., 2015], or Memory Network [Weston et al., 2014]. First, it is not at all clear how a real brain would implement the more exotic structures in these models e.g., the tape of the NTM, whereas it is clear that the brain could implement a fast associative memory in synapses with the appropriate dynamics. Second, in a fast associative memory there is no need to decide where or when to write to memory and where or when to read from memory. The fast memory is updated all the time and the writes are all superimposed on the same fast changing component of the strength of each synapse. Every time the input changes there is a transition to a new hidden state which is determined by a combination of three sources of information: The new input via the slow input-to-hidden weights, C, the previous hidden state via the slow transition weights, W , and the recent history of hidden state vectors via the fast weights, A. The effect of the first two sources of information on the new hidden state can be computed once and then maintained as a sustained boundary condition for a brief iterative settling process which allows the fast weights to influence the new hidden state. Assuming that the fast weights decay exponentially, we now show that the effect of the fast weights on the hidden vector during an iterative settling phase is to provide an additional input that is proportional to the sum over all recent hidden activity vectors of the scalar product of that recent hidden vector with the current hidden activity vector, with each term in this sum being weighted by the decay rate raised to the power of how long ago that hidden vector occurred. So fast weights act like a kind of attention to the recent past but with the strength of the attention being determined by the scalar product between the current hidden vector and the earlier hidden vector rather than being determined by a separate parameterized computation of the type used in neural machine translation models [Bahdanau et al., 2015b]. Chapter 5. Self-attention to the recent past using fast weights 59

The update rule for the fast memory weight matrix, A, is simply to multiply the current fast weights by a decay rate, λ, and add the outer product of the hidden state vector, a(t), multiplied by a learning rate, η: A(t) = λA(t 1) + ηa(t)a(t)> (5.1) − The next vector of hidden activities, a(t + 1), is computed in two steps. The “preliminary” vector a0(t + 1) is determined by the combined effects of the input vector x(t) and the previous hidden vector: a0(t+1) = f(W a(t)+Cx(t)), where W and C are slow weight matrices and f(.) is the nonlinearity used by the hidden units. The preliminary vector is then used to initiate an “inner loop” iterative process which runs for S steps and progressively changes the hidden state into a(t + 1) = aS(t + 1)

as+1(t + 1) = f([W a(t) + Cx(t)] + A(t)as(t + 1)), (5.2)

where the terms in square brackets are the sustained boundary conditions. In a real neural net, A could be implemented by rapidly changing synapses but in a computer simulation that uses sequences which have fewer time steps than the dimensionality of h, A will be of less than full rank and it is more efficient to compute the term A(t)hs(t + 1) without ever computing the full fast weight matrix, A. Assuming A is 0 at the beginning of the sequence,

τ=t X A(t) = η λt−τ a(τ)a(τ)T (5.3) τ=1

τ=t X t−τ T A(t)as(t + 1) = η λ a(τ)[a(τ) as(t + 1)] (5.4) τ=1 The term in square brackets is just the scalar product of an earlier hidden state vector, a(τ), with the current hidden state vector, as(t + 1), during the iterative inner loop. So at each iteration of the inner loop, the fast weight matrix is exactly equivalent to attending to past hidden vectors in proportion to their scalar product with the current hidden vector, weighted by a decay factor. During the inner loop iterations, attention will become more focussed on past hidden states that manage to attract the current hidden state. The equivalence between using a fast weight matrix and comparing with a set of stored hidden state vectors is very helpful for computer simulations. It allows us to explore what can be done with fast weights without incurring the huge penalty of having to abandon the use of mini-batches during training. At first sight, mini-batches cannot be used because the fast weight matrix is different for every sequence, but comparing with a set of stored hidden vectors does allow mini-batches.

5.3.1 Layer normalized fast weights

A potential problem with fast associative memory is that the scalar product of two hidden vectors could vanish or explode depending on the norm of the hidden vectors. Recently, layer normalization [Ba et al., 2016] has been shown to be very effective at stablizing the hidden state dynamics in RNNs and reducing training time. Layer normalization is applied to the vector of summed inputs to all the recurrent units at a particular time step. It uses the mean and variance of the components of this vector to re-center and re-scale those summed inputs. Then, before applying the nonlinearity, it includes a Chapter 5. Self-attention to the recent past using fast weights 60 learned, neuron-specific bias and gain. We apply layer normalization to the fast associative memory as follows:

a (t + 1) = f( [W a(t) + Cx(t) + A(t)a (t + 1)]) (5.5) s+1 LN s where [.] denotes layer normalization. We found that applying layer normalization on each iteration LN of the inner loop makes the fast associative memory more robust to the choice of learning rate and decay hyper-parameters. For the rest of the chapter, fast weight models are trained using layer normalization and the outer product learning rule with fast learning rate of 0.5 and decay rate of 0.95, unless otherwise noted.

5.3.2 Implementing the fast weights “inner loop” in biological neural net- works

We considered two different ways of performing this inner loop settling. In method 1 (which is what we use) the inputs to the hidden units after an outer loop transition using W are stored and provide sustained boundary conditions during the inner loop settling. In method 2 (which is more biologically plausible) we simply add the identity matrix to the fast weight matrix so that the inner loop settling tends to sustain the hidden activity vector. For ReLUs, these two methods are equivalent when the fast weight matrix is zero . They are similar but not exactly equivalent when the fast weights are non-zero. Using layer normalization, we found that method 1 worked slightly better than Method 2, but Method 2 would be much easier to implement in a biological network.

5.4 Experimental results

To demonstrate the effectiveness of the fast associative memory, we first investigated the problems of associative retrieval (section 5.4.1) and MNIST classification (section 5.4.2). We compared fast weight models to regular RNNs and LSTM variants. We then applied the proposed fast weights to a facial expression recognition task using a fast associative memory model to store the results of processing at one level while examining a sequence of details at a finer level (section 5.4.3). The hyper-parameters of the experiments were selected through grid search on the validation set. All the models were trained using mini-batches of size 128 and the Adam optimizer [Kingma and Ba, 2014a].

5.4.1 Associative retrieval

We start by demonstrating that the method we propose for storing and retrieving temporary memories works effectively for a toy task to which it is very well suited. Consider a task where multiple key-value pairs are presented in a sequence. At the end of the sequence, one of the keys is presented and the model must predict the value that was temporarily associated with the key. We used strings that contained characters from English alphabet, together with the digits 0 to 9. To construct a training sequence, we first randomly sample a character from the alphabet without replacement. This is the first key. Then a single digit is sampled as the associated value for that key. After generating a sequence of K character- digit pairs, one of the K different characters is selected at random as the query and the network must Chapter 5. Self-attention to the recent past using fast weights 61

2.0 A-LSTM 50 Model R=20 R=50 R=100 1.5 IRNN 50 IRNN 62.11% 60.23% 0.34% LSTM 50 1.0 FW 50 LSTM 60.81% 1.85% 0% 0.5 A-LSTM 60.13% 1.62% 0% Negative log likelihood 0.0 Fast weights 1.81% 0% 0% 0 20 40 60 80 100 120 140 Updates x 5000 Table 5.1: Classification error rate comparison Figure 5.2: Comparison of the test log likelihood on on the associative retrieval task. the associative retrieval task with 50 recurrent hidden units. predict the associated digit. Some examples of such string sequences and their targets are shown below:

Input string Target c9k8j3f1??c 9 j0a5s5z2??a 5 where ‘?’ is the token to separate the query from the key-value pairs. We generated 100,000 training examples, 10,000 validation examples and 20,000 test examples. To solve this task, a standard RNN has to end up with hidden activities that somehow store all of the key-value pairs after the keys and values are presented sequentially. This makes it a significant challenge for models only using slow weights. We used a neural network with a single recurrent layer for this experiment. The recurrent network processes the input sequence one character at a time. The input character is first converted into a learned 100-dimensional embedding vector which then provides input to the recurrent layer2. The output of the recurrent layer at the end of the sequence is then processed by another hidden layer of 100 ReLUs before the final softmax layer. We augment the ReLU RNN with a fast associative memory and compare it to an LSTM model with the same architecture. Although the original LSTMs do not have explicit long- term storage capacity, recent work from Danihelka et al. [2016] extended LSTMs by adding complex associative memory. In our experiments, we compared fast associative memory to both LSTM variants. Figure 5.2 and Table 5.1 show that when the number of recurrent units is small, the fast associative memory significantly outperforms the LSTMs with the same number of recurrent units. The result fits with our hypothesis that the fast associative memory allows the RNN to use its recurrent units more effectively. In addition to having higher retrieval accuracy, the model with fast weights also converges faster than the LSTM models.

5.4.2 Integrating glimpses in visual attention models

Despite their many successes, convolutional neural networks are computationally expensive and the representations they learn can be hard to interpret. The visual attention models introduced in Chapter 2 have been shown to overcome some of the limitations in ConvNets. One can understand what signals the algorithm is using by seeing where the model is looking. Also, the visual attention model is able to selectively focus on important parts of visual space and thus avoid any detailed processing of much of the background clutter. In this section, we show that visual attention models can use fast weights to store

2To make the architecture for this task more similar to the architecture for the next task we first compute a 50 dimensional embedding vector and then expand this to a 100-dimensional embedding. Chapter 5. Self-attention to the recent past using fast weights 62 information about object parts, though we use a very restricted set of glimpses that do not correspond to natural parts of the objects. Given an input image, a visual attention model computes a sequence of glimpses over regions of the image. The model not only has to determine where to look next, but also has to remember what it has seen so far in its working memory so that it can make the correct classification later. Visual attention models can learn to find multiple objects in a large static input image and classify them correctly, but the learnt glimpse policies are typically over-simplistic: They only use a single scale of glimpses and they tend to scan over the image in a rigid way. Human eye movements and fixations are far more complex. The ability to focus on different parts of a whole object at different scales allows humans to apply the very same knowledge in the weights of the network at many different scales, but it requires some form of temporary memory to allow the network to integrate what it discovered in a set of glimpses. Improving the model’s ability to remember recent glimpses should help the visual attention model to discover non-trivial glimpse policies. Because the fast weights can store all the glimpse information in the sequence, the hidden activity vector is freed up to learn how to intelligently integrate visual information and retrieve the appropriate memory content for the final classifier. To explicitly verify that larger memory capacity is beneficial to visual attention-based models, we simplify the learning process in the following way: First, we provide a pre-defined glimpse control signal so the model knows where to attend rather than having to learn the control policy through reinforcement learning. Second, we introduce an additional control signal to the memory cells so the attention model knows when to store the glimpse information. A typical visual attention model is complex and has high variance in its performance due to the need to learn the policy network and the classifier at the same time. Our simplified learning procedure enables us to discern the performance improvement contributed by using fast weights to remember the recent past. We consider a simple recurrent visual attention model that has a similar architecture to the RNN from the previous experiment. It does not predict where to attend but rather is given a fixed sequence of locations: the static input image is broken down into four non-overlapping quadrants recursively with two scale levels. The four coarse regions, down-sampled to 7 7, along with their the four 7 7 quadrants × × are presented in a single sequence as shown in Figure 5.1. Notice that the two glimpse scales form a two-level hierarchy in the visual space. In order to solve this task successfully, the attention model needs to integrate the glimpse information from different levels of the hierarchy. One solution is to use the model’s hidden states to both store and integrate the glimpses of different scales. A much more efficient solution is to use a temporary “cache” to store any of the unfinished glimpse computation when processing the glimpses from a finer scale in the hierarchy. Once the computation is finished at that scale, the results can be integrated with the partial results at the higher level by “popping” the previous result from the “cache”. Fast weights, therefore, can act as a neurally plausible “cache” for storing partial results. The slow weights of the same model can then specialize in integrating glimpses at the same scale. Because the slow weights are shared for all glimpse scales, the model should be able to store the partial results at several levels in the same set of fast weights, though we have only demonstrated the use of fast weights for storage at a single level. We used a single hidden layer recurrent neural network which takes a 100 dimensional embedding vector as its input. We compared the fast weights memory against three other different RNN architecture: IRNN, standard LSTM and associative LSTM. The non-recurrent slow recurrent weights are initialized from uniform distribution between ( 1/√H, 1/√H), where H is the number outgoing weights. The − Chapter 5. Self-attention to the recent past using fast weights 63

Update fast weights and wipe out hidden state

Integration transition weights

Slow transition weights

Fast transition weights

Figure 5.3: The multi-level fast associative memory model.

Model 50 features 100 features 200 features IRNN 12.95% 1.95% 1.42% LSTM 12% 1.55% 1.10% ConvNet 1.81% 1.00% 0.9% Fast weights 7.21% 1.30% 0.85%

Table 5.2: Classification error rates on MNIST. slow weights learning rate is tuned using the 10,000 validation examples. The specific hyper-parameter settings for the models used in the experiments is the following: Fast weights: The fast weights learning rate, η, is set to 0.5 and the fast weights decay rate, λ, is set to 0.9. The fast weights are updated once at every time step. We experimented with more iterations for the “inner loop” and the performance are similar. The recurrent slow weights are initialized to an identity matrix scaled by 0.05. We use the ReLU activation for f( ) in the recurrent layer. IRNN: The recurrent slow weights are initialized to an · identity matrix scaled by 0.5. ReLU is used as the non-linearity in the recurrent layer. Associative LSTM: We used 4 copies of memory cells for the associative LSTM. There are 3 read-write heads used for storage and retrieval memory access. We evaluated the multi-level visual attention model on the MNIST handwritten digit dataset. MNIST is a well-studied problem on which many other techniques have been benchmarked. It contains the ten classes of handwritten digits, ranging from 0 to 9. The task is to predict the class label of an isolated and roughly normalized 28x28 image of a digit. The glimpse sequence, in this case, consists of 24 patches of 7 7 pixels. × Table 5.2 compares classification results for a ReLU RNN with a multi-level fast associative memory against an LSTM that gets the same sequence of glimpses. Again the result shows that when the number of hidden units is limited, fast weights give a significant improvement over the other models. Chapter 5. Self-attention to the recent past using fast weights 64

Figure 5.4: Examples of the near frontal faces from the MultiPIE dataset.

IRNN LSTM ConvNet Fast Weights Test accuracy 81.11 81.32 88.23 86.34

Table 5.3: Classification accuracy comparison on the facial expression recognition task.

As we increase the memory capacities, the multi-level fast associative memory consistently outperforms the LSTM in classification accuracy. Unlike models that must integrate a sequence of glimpses, convolutional neural networks process all the glimpses in parallel and use layers of hidden units to hold all their intermediate computational results. We further demonstrate the effectiveness of the fast weights by comparing to a three-layer convolutional neural network that uses the same patches as the glimpses presented to the visual attention model. From Table 5.2, we see that the multi-level model with fast weights reaches a very similar performance to the ConvNet model without requiring any biologically implausible weight sharing.

5.4.3 Facial expression recognition

To further investigate the benefits of using fast weights in the multi-level visual attention model, we performed facial expression recognition tasks on the CMU Multi-PIE face database [Gross et al., 2010]. The dataset was preprocessed to align each face by eyes and nose fiducial points. It was downsampled to 48 48 greyscale. The full dataset contains 15 photos taken from cameras with different viewpoints for × each illumination expression identity session condition. We used only the images taken from the × × × three central cameras corresponding to 15◦, 0◦, 15◦ views since facial expressions were not discernible − from the more extreme viewpoints. The resulting dataset contained > 100, 000 images. 317 identities appeared in the training set with the remaining 20 identities in the test set. Given the input face image, the goal is to classify the subject’s facial expression into one of the six different categories: neutral, smile, surprise, squint, disgust and scream. The task is more realistic and challenging than the previous MNIST experiments. Not only does the dataset have unbalanced numbers of labels, some of the expressions, for example squint and disgust, are are very hard to distinguish. In order to perform well on this task, the models need to generalize over different lighting conditions and viewpoints. We used the same multi-level attention model as in the MNIST experiments with 200 recurrent hidden units. The model sequentially attends to non-overlapping 12x12 pixel patches at two different scales and there are, in total, 24 glimpses. Similarly, we designed a two layer ConvNet that has a 12x12 receptive fields. From Table 5.3, we see that the multi-level fast weights model that knows when to store information outperforms the LSTM and the IRNN. The results are consistent with previous MNIST experiments. Chapter 5. Self-attention to the recent past using fast weights 65

1.0 1.0 RNN RNN+FW 0.8 LSTM

0.6 0.5

0.4

0.2 0.0 0.0 Avgerage Reward Avgerage Reward 0.2

0.5 0.4

RNN 0.6 RNN+FW LSTM 0.8 1.0 0 2 4 6 8 10 12 14 0 5 10 15 20 25 30 steps steps (a) (b) (c)

Figure 5.5: a) Sample screen from the game ”Catch” b) Performance curves for Catch with N = 16,M = 3. c) Performance curves for Catch with N = 24,M = 5.

However, ConvNet is able to perform better than the multi-level attention model on this near frontal face dataset. We think the efficient weight-sharing and architectural engineering in the ConvNet combined with the simultaneous availability of all the information at each level of processing allows the ConvNet to generalize better in this task. Our use of a rigid and predetermined policy for where to glimpse eliminates one of the main potential advantages of the multi-level attention model: It can process informative details at high resolution whilst ignoring most of the irrelevant details. To realize this advantage we will need to combine the use of fast weights with the learning of complicated policies.

5.4.4 Agents with memory

While different kinds of memory and attention have been studied extensively in the supervised learning setting [Graves, 2014, Mnih et al., 2014a, Bahdanau et al., 2015b], the use of such models for learning long range dependencies in reinforcement learning has received less attention. We compare different memory architectures on a partially observable variant of the game ”Catch” described in [Mnih et al., 2014a]. The game is played on an N N screen of binary pixels and each episode × consists of N frames. Each trial begins with a single pixel, representing a ball, appearing somewhere in the first row of the column and a two pixel ”paddle” controlled by the agent in the bottom row. After observing a frame, the agent gets to either keep the paddle stationary or move it right or left by one pixel. The ball descends by a single pixel after each frame. The episode ends when the ball pixel reaches the bottom row and the agent receives a reward of +1 if the paddle touches the ball and a reward of 1 if it doesn’t. Solving the fully observable task is straightforward and requires the agent to move − the paddle to the column with the ball. We make the task partially-observable by providing the agent blank observations after the Mth frame. Solving the partially-observable version of the game requires remembering the position of the paddle and ball after M frames and moving the paddle to the correct position using the stored information. All agents used recurrent networks to represent the policy. At each time step the input was passed through a hidden layer with 128 ReLU units and then passed to the recurrent core. All agents used 128 recurrent cells. The output at every step was a softmax over the valid actions and a single linear output for the estimate of the value function. We used random search to find hyperparameters values for the learning rate, the number of Hebbian steps, and fast weight learning rate and decay where applicable. Chapter 5. Self-attention to the recent past using fast weights 66

We averaged results over the top 5 models. We used the recently proposed asynchronous advantage actor-critic method [Mnih et al., 2016] to train agents with three types of memory on different sizes of the partially observable Catch task. The three agents included a ReLU RNN, an LSTM, and a fast weights RNN. Figure 5.5 shows learning progress of the different agents on two variants of the game N = 16,M = 3 and N = 24,M = 5. The agent using the fast weights architecture as its policy representation (shown in green) is able to learn faster than the agents using ReLU RNN or LSTM to represent the policy. The improvement obtained by fast weights is also more significant on the larger version of the game which requires more memory.

5.5 Summary

In this chapter, we show the performance of RNNs on a variety of different tasks can be improved by introducing a mechanism that allows each new state of the hidden units to be attracted towards recent hidden states in proportion to their scalar products with the current state. Layer normalization makes this kind of attention work much better. This is a form of attention to the recent past that is somewhat similar to the attention mechanism that has recently been used to dramatically improve the sequence-to-sequence RNNs used in machine translation. Chapter 6

Accelerating learning using Adaptive Moment methods

6.1 Motivation

Stochastic gradient-based optimization is of core practical importance in many fields of science and engineering. Many problems in these fields can be cast as the optimization of some scalar parameterized objective function requiring maximization or minimization with respect to its parameters. If the function is differentiable w.r.t. its parameters, gradient descent is a relatively efficient optimization method, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same computational complexity as just evaluating the function. Often, objective functions are stochastic. For example, many objective functions are composed of a sum of subfunctions evaluated at different subsamples of data; in this case optimization can be made more efficient by taking gradient steps w.r.t. individual subfunctions, i.e. stochastic gradient descent (SGD) or ascent. SGD proved itself as an efficient and effective optimization method that was central in many machine learning success stories, such as recent advances in deep learning [Deng et al., 2013, Krizhevsky et al., 2012d, Hinton and Salakhutdinov, 2006, Hinton et al., 2012b, Graves et al., 2013]. Objectives may also have other sources of noise than data subsampling, such as dropout [Hinton et al., 2012c] regularization. For all such noisy objectives, efficient stochastic optimization techniques are required. The focus of this chapter is on the optimization of stochastic objectives with high-dimensional parameter spaces. Higher-order optimization methods are of interest in the next chapter. The discussion in this chapter will be restricted to first-order methods. In this chapter, we introduce Adam, a new algorithm for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation. Our method is designed to combine the advantages of two recently popular methods: AdaGrad [Duchi et al., 2011], which works well with sparse gradients, and RMSProp [Tieleman and Hinton, 2012a], which works well in on-line and non-stationary settings; important connections to these and other stochastic optimization methods are clarified in section 6.5. Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter, it does not require a stationary objective, it works with sparse gradients, and it naturally performs a form

67 Chapter 6. Accelerating learning using Adaptive Moment methods 68

Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 6.2 for 2 details, and for a slightly more efficient (but less clear) order of computation. gt indicates the elementwise square gt gt. Good default settings for the tested machine learning problems are −8 α = 0.001, β1 = 0.9, β2 = 0.999 and  = 10 . All operations on vectors are element-wise. With t t β1 and β2 we denote β1 and β2 to the power t. Require: α: Stepsize Require: β1, β2 [0, 1): Exponential decay rates for the moment estimates Require: f(θ): Stochastic∈ objective function with parameters θ Require: θ0: Initial parameter vector st m0 0 (Initialize 1 moment vector) ← nd v0 0 (Initialize 2 moment vector) t ←0 (Initialize timestep) ← while θt not converged do t t + 1 ← g f (θ − ) (Get gradients w.r.t. stochastic objective at timestep t) t ← ∇θ t t 1 mt β1 mt−1 + (1 β1) gt (Update biased first moment estimate) ← · − ·2 vt β2 vt−1 + (1 β2) gt (Update biased second raw moment estimate) ← · t − · mb t mt/(1 β1) (Compute bias-corrected first moment estimate) v ←v /(1 −βt ) (Compute bias-corrected second raw moment estimate) bt ← t − 2 θt θt−1 α mb t/(√vbt + ) (Update parameters) end← while − · return θt (Resulting parameters) of step size annealing. In section 6.2 we describe the algorithm and the properties of its update rule. Section 6.3 explains our initialization bias correction technique, and section 6.4 provides a theoretical analysis of Adam’s convergence in online convex programming. Empirically, our method consistently outperforms other methods for a variety of models and datasets, as shown in section 6.6. Overall, we show that Adam is a versatile algorithm that scales to large-scale high-dimensional machine learning problems.

6.2 Algorithm

See Algorithm 1 for pseudo-code of our proposed algorithm Adam. In the rest of the section, we follow the convention notation from numerical optimization and consider f(θ) a noisy objective function: a stochastic scalar function that is differentiable w.r.t. neural networks’ parameters θ. We are interested in minimizing the expected value of this function, E[f(θ)] w.r.t. its parameters θ. With f1(θ), ..., , fT (θ) we denote the realisations of the stochastic function at subsequent timesteps 1, ..., T . The stochasticity might come from the evaluation at random subsamples (minibatches) of datapoints, or arise from inherent function noise. With g = f (θ) we denote the gradient, i.e. the vector of partial derivatives of f , t ∇θ t t w.r.t θ evaluated at timestep t.

The algorithm updates exponential moving averages of the gradient (mt) and the squared gradient (v ) where the hyper-parameters β , β [0, 1) control the exponential decay rates of these moving t 1 2 ∈ averages. The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient. However, these moving averages are initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially during the initial timesteps, and especially when the decay rates are small (i.e. the βs are close to 1). The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected estimates mb t and vbt. Chapter 6. Accelerating learning using Adaptive Moment methods 69

See section 6.3 for more details. Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing the order of computation, e.g. by replacing the last three lines in the loop with the following lines: p t t α = α 1 β /(1 β ) and θ θ − α m /(√v + ˆ). t · − 2 − 1 t ← t 1 − t · t t 6.2.1 Adam’s update rule

An important property of Adam’s update rule is its careful choice of stepsizes. Assuming  = 0, the effective step taken in parameter space at timestep t is ∆ = α m /√v . The effective stepsize has two t · b t bt upper bounds: ∆ α (1 β )/√1 β in the case (1 β ) > √1 β , and ∆ α otherwise. | t| ≤ · − 1 − 2 − 1 − 2 | t| ≤ The first case only happens in the most severe case of sparsity: when a gradient has been zero at all timesteps except at the current timestep. For less sparse cases, the effective stepsize will be smaller. When (1 β ) = √1 β we have that m /√v < 1 therefore ∆ < α. In more common scenarios, − 1 − 2 | b t bt| | t| p 2 we will have that mt/√vt 1 since E[g]/ E[g ] 1. The effective magnitude of the steps taken in b b ≈ ± | | ≤ parameter space at each timestep are approximately bounded by the stepsize setting α, i.e., ∆ α. | t| / This can be understood as establishing a trust region around the current parameter value, beyond which the current gradient estimate does not provide sufficient information. This typically makes it relatively easy to know the right scale of α in advance. For many machine learning models, for instance, we often know in advance that good optima are with high probability within some set region in parameter space; it is not uncommon, for example, to have a prior distribution over the parameters. Since α sets (an upper bound of) the magnitude of steps in parameter space, we can often deduce the right order of magnitude of α such that optima can be reached from θ0 within some number of iterations. With a slight abuse of terminology, we will call the ratio mb t/√vbt the signal-to-noise ratio (SNR). With a smaller SNR the effective stepsize ∆t will be closer to zero. This is a desirable property, since a smaller SNR means that there is greater uncertainty about whether the direction of mb t corresponds to the direction of the true gradient. For example, the SNR value typically becomes closer to 0 towards an optimum, leading to smaller effective steps in parameter space: a form of automatic annealing. The effective stepsize ∆t is also invariant to the scale of the gradients; rescaling the gradients g with factor c will scale mb t with a factor c and v with a factor c2, which cancel out: (c m )/(√c2 v ) = m /√v . bt · b t · bt b t bt

6.3 Initialization bias correction

As explained in section 6.2, Adam utilizes initialization bias correction terms. We will here derive the term for the second moment estimate; the derivation for the first moment estimate is completely analogous. Let g be the gradient of the stochastic objective f, and we wish to estimate its second raw moment (uncentered variance) using an exponential moving average of the squared gradient, with decay rate β2. Let g1, ..., gT be the gradients at subsequent timesteps, each a draw from an underlying gradient distribution g p(g ). Let us initialize the exponential moving average as v = 0 (a vector of zeros). t ∼ t 0 2 First note that the update at timestep t of the exponential moving average v = β v − + (1 β ) g t 2 · t 1 − 2 · t (where g2 indicates the elementwise square g g ) can be written as a function of the gradients at all t t t previous timesteps:

t X v = (1 β ) βt−i g2 (6.1) t − 2 2 · i i=1 Chapter 6. Accelerating learning using Adaptive Moment methods 70

We wish to know how E[vt], the expected value of the exponential moving average at timestep t, relates 2 to the true second moment E[gt ], so we can correct for the discrepancy between the two. Taking expectations of the left-hand and right-hand sides of eq. (6.1):

" t # X t−i 2 E[vt] = E (1 β2) β g (6.2) − 2 · i i=1 t 2 X t−i = E[g ] (1 β2) β + ζ (6.3) t · − 2 i=1 2 t = E[g ] (1 β ) + ζ (6.4) t · − 2

2 where ζ = 0 if the true second moment E[gi ] is stationary; otherwise ζ can be kept small since the exponential decay rate β1 can (and should) be chosen such that the exponential moving average assigns small weights to gradients too far in the past. What is left is the term (1 βt ) which is caused by − 2 initializing the running average with zeros. In algorithm 1 we therefore divide by this term to correct the initialization bias. In case of sparse gradients, for a reliable estimate of the second moment one needs to average over many gradients by chosing a small value of β2; however it is exactly this case of small β2 where a lack of initialisation bias correction would lead to initial steps that are much larger.

6.4 Convergence analysis

We analyze the convergence of Adam using the online learning framework proposed in [Zinkevich, 2003].

Given an arbitrary, unknown sequence of convex cost functions f1(θ), f2(θ),..., fT (θ). At each time t, our goal is to predict the parameter θt and evaluate it on a previously unknown cost function ft. Since the nature of the sequence is unknown in advance, we evaluate our algorithm using the regret, that is the sum of all the previous difference between the online prediction ft(θt) and the best fixed point parameter f (θ∗) from a feasible set for all the previous steps. Concretely, the regret is defined as: t X T X R(T ) = [f (θ ) f (θ∗)] (6.5) t t − t t=1

∗ PT √ where θ = arg minθ∈X t=1 ft(θ). We show Adam has O( T ) regret bound. Our result is comparable to the best known bound for this general convex online learning problem. We also use some definitions th t simplify our notation, where gt ft(θt) and gt,i as the i element. We define g1:t,i R as a vector , ∇ ∈ th that contains the i dimension of the gradients over all iterations till t, g1:t,i = [g1,i, g2,i, , gt,i]. Also, 2 ··· 1 β1 − we define γ √ . Our following theorem holds when the learning rate αt is decaying at a rate of t 2 , β2 and first moment running average coefficient β1,t decay exponentially with λ, that is typically close to 1, e.g. 1 10−8. −

Theorem 6.4.1. Assume that the function f has bounded gradients, f (θ) G, f (θ) ∞ t k∇ t k2 ≤ k∇ t k ≤ d G∞ for all θ R and distance between any θt generated by Adam is bounded, θn θm 2 D, ∈ 2 k − k ≤ β1 α θm θn ∞ D∞ for any m, n 1, ..., T , and β1, β2 [0, 1) satisfy √ < 1. Let αt = √ and k − k ≤ ∈ { } ∈ β2 t Chapter 6. Accelerating learning using Adaptive Moment methods 71

β = β λt−1, λ (0, 1). Adam achieves the following guarantee, for all T 1. 1,t 1 ∈ ≥

2 d d d 2 D X p α(1 + β1)G∞ X X D∞G∞√1 β2 R(T ) T v + g + bT,i 2 1:T,i 2 − 2 ≤ 2α(1 β1) (1 β )√1 β (1 γ) k k 2α(1 β1)(1 λ) − i=1 − 1 − 2 − i=1 i=1 − − Our Theorem 6.4.1 implies when the data features are sparse, the summation term can be much Pd Pd p smaller than its upper bound g << dG∞√T and T v << dG∞√T , in particular i=1 k 1:T,ik2 i=1 bT,i if the class of function and data features are in the form of section 1.2 in [Duchi et al., 2011]. Their Pd results for the expected value E[ g1:T,i 2] also applies to Adam. In particular, adaptive methods, i=1 k k such as Adam and Adagrad, can achieve O(log d√T ), an improvement over O(√dT ) for the non-adaptive method. Decaying β1,t towards zero is important in our theoretical analysis and also matches previous empirical findings, e.g. [Sutskever et al., 2013] suggests reducing the momentum coefficient in the end of training can improve convergence. Finally, we can show the average regret of Adam converges,

Corollary 6.4.2. Assume that the function f has bounded gradients, f (θ) G, f (θ) ∞ G∞ t k∇ t k2 ≤ k∇ t k ≤ d for all θ R and distance between any θ generated by Adam is bounded, θ θ D, θ θ ∞ ∈ t k n − mk2 ≤ k m − nk ≤ D∞ for any m, n 1, ..., T . Adam achieves the following guarantee, for all T 1. ∈ { } ≥ R(T ) 1 = O( ) T √T

Pd R(T ) This result can be obtained by using Theorem 6.4.1 and g dG∞√T . Thus, lim →∞ = i=1 k 1:T,ik2 ≤ T T 0.

6.4.1 Convergence proof

Definition 6.4.3. A function f : Rd R is convex if for all x, y Rd, for all λ [0, 1], → ∈ ∈

λf(x) + (1 λ)f(y) f(λx + (1 λ)y) − ≥ −

Also, notice that a convex function can be lower bounded by a hyperplane at its tangent.

Lemma 6.4.4. If a function f : Rd R is convex, then for all x, y Rd, → ∈

f(y) f(x) + f(x)T (y x) ≥ ∇ −

The above lemma can be used to upper bound the regret and our proof for the main theorem is constructed by substituting the hyperplane with the Adam update rules. The following two lemmas are used to support our main theorem. We also use some definitions th t simplify our notation, where gt ft(θt) and gt,i as the i element. We define g1:t,i R as a vector , ∇ ∈ that contains the ith dimension of the gradients over all iterations till t, g = [g , g , , g ] 1:t,i 1,i 2,i ··· t,i

Lemma 6.4.5. Let g = f (θ ) and g be defined as above and bounded, g G, g ∞ G∞. t ∇ t t 1:t k tk2 ≤ k tk ≤ Then, s T 2 X gt,i 2G∞ g t ≤ k 1:T,ik2 t=1 Chapter 6. Accelerating learning using Adaptive Moment methods 72

Proof. We will prove the inequality using induction over T.

q 2 The base case for T = 1, we have g 2G∞ g . 1,i ≤ k 1,ik2 For the inductive step, s s s T 2 T −1 2 2 X gt,i X gt,i gT,i = + t t T t=1 t=1 s 2 gT,i 2G∞ g − + ≤ k 1:T 1,ik2 T s q g2 2 2 T,i = 2G∞ g g + k 1:T,ik2 − T T

4 2 2 gT,i 2 2 From, g1:T,i 2 gT,i + k k2 g1:T,i 2 gT,i, we can take square root of both side and have, k k − 4 g1:T,i 2 ≥ k k −

q g2 g 2 g2 g T,i k 1:T,ik2 − T,i ≤ k 1:T,ik2 − 2 g k 1:T,ik2 2 gT,i g1:T,i 2 p 2 ≤ k k − 2 TG∞

q Rearrange the inequality and substitute the g 2 g2 term, k 1:T,ik2 − T,i s q g2 2 2 T,i G∞ g g + 2G∞ g k 1:T,ik2 − T T ≤ k 1:T,ik2

2 2 β1 β1 Lemma 6.4.6. Let γ , √ . For β1, β2 [0, 1) that satisfy √ < 1 and bounded gt, gt 2 G, β2 ∈ β2 k k ≤ g ∞ G∞, the following inequality holds k tk ≤

T 2 X mb t,i 2 1 g1:T,i 2 ptv ≤ 1 γ √1 β k k t=1 bt,i − − 2

t √1−β2 1 Proof. Under the assumption, − t 2 (1−β )2 . We can expand the last term in the summation using (1 β1) ≤ 1 Chapter 6. Accelerating learning using Adaptive Moment methods 73 the update rules in Algorithm 1,

T 2 T −1 2 p T PT T −k 2 X mb t,i X mb t,i 1 β2 ( k=1(1 β1)β1 gk,i) p = p + − T 2 q − (1 β ) PT T −j 2 t=1 tvbt,i t=1 tvbt,i 1 T (1 β )β g − j=1 − 2 2 j,i T −1 2 p T T T −k 2 X mb t,i 1 β2 X T ((1 β1)β1 gk,i) p + − T 2 q − ≤ (1 β ) PT T −j 2 t=1 tvbt,i 1 k=1 T (1 β )β g − j=1 − 2 2 j,i T −1 2 p T T T −k 2 X mb t,i 1 β2 X T ((1 β1)β1 gk,i) p + − T q − ≤ (1 β )2 T −k 2 t=1 tvbt,i 1 k=1 T (1 β )β g − − 2 2 k,i T −1 2 p T 2 T  2 T −k X mb t,i 1 β2 (1 β1) X β1 p + − T 2 p − T gk,i 2 ≤ tvt,i (1 β ) T (1 β ) √β2 k k t=1 b − 1 − 2 k=1 T −1 2 T X mb t,i T X T −k p + p γ gk,i 2 ≤ tvt,i T (1 β ) k k t=1 b − 2 k=1

Similarly, we can upper bound the rest of the terms in the summation.

T 2 T T −t X mb t,i X gt,i 2 X j p pk k tγ tvt,i ≤ t(1 β ) t=1 b t=1 − 2 j=0 T T X gt,i 2 X j pk k tγ ≤ t(1 β ) t=1 − 2 j=0

P t 1 For γ < 1, using the upper bound on the arithmetic-geometric series, t tγ < (1−γ)2 :

T T T X gt,i 2 X 1 X gt,i 2 tγj pk k 2 k k t(1 β ) ≤ (1 γ) √1 β2 √t t=1 − 2 j=0 − − t=1 Apply Lemma 6.4.5,

T 2 X mb t,i 2G∞ g1:T,i 2 ptv ≤ (1 γ)2√1 β k k t=1 bt,i − − 2

2 To simplify the notation, we define γ √β1 . Intuitively, our following theorem holds when the , β2 − 1 learning rate αt is decaying at a rate of t 2 and first moment running average coefficient β1,t decay exponentially with λ, that is typically close to 1, e.g. 1 10−8. −

Theorem 6.4.7. Assume that the function f has bounded gradients, f (θ) G, f (θ) ∞ t k∇ t k2 ≤ k∇ t k ≤ d G∞ for all θ R and distance between any θt generated by Adam is bounded, θn θm 2 D, ∈ 2 k − k ≤ β1 α θm θn ∞ D∞ for any m, n 1, ..., T , and β1, β2 [0, 1) satisfy √ < 1. Let αt = √ and k − k ≤ ∈ { } ∈ β2 t Chapter 6. Accelerating learning using Adaptive Moment methods 74

β = β λt−1, λ (0, 1). Adam achieves the following guarantee, for all T 1. 1,t 1 ∈ ≥

2 d d d 2 D X p α(β1 + 1)G∞ X X D∞G∞√1 β2 R(T ) T v + g + bT,i 2 1:T,i 2 − 2 ≤ 2α(1 β1) (1 β )√1 β (1 γ) k k 2α(1 β1)(1 λ) − i=1 − 1 − 2 − i=1 i=1 − −

Proof. Using Lemma 6.4.4, we have,

d X f (θ ) f (θ∗) gT (θ θ∗) = g (θ θ∗ ) t t − t ≤ t t − t,i t,i − ,i i=1

From the update rules presented in algorithm 1,

p θ = θ α m / v t+1 t − t b t bt   αt β1,t (1 β1,t) = θt mt−1 + − gt − 1 βt √v √v − 1 bt bt We focus on the ith dimension of the parameter vector θ Rd. Subtract the scalar θ∗ and square both t ∈ ,i sides of the above update rule, we have,

∗ 2 ∗ 2 2αt β1,t β1,t) ∗ 2 mb t,i 2 (θt+1,i θ ) =(θt,i θ ) ( mt−1,i + (1 gt,i)(θt,i θ ) + α ( ) − ,i − ,i − 1 βt pv − pv − ,i t pv − 1 bt,i bt,i bt,i We can rearrange the above equation and use Young’s inequality, ab a2/2 + b2/2. Also, it can be q t t−j p ≤ shown that pv = P (1 β )β g2 / 1 βt g and β β . Then bt,i j=1 − 2 2 j,i − 2 ≤ k 1:t,ik2 1,t ≤ 1

t p   ∗ (1 β1) vbt,i ∗ 2 ∗ 2 gt,i(θt,i θ,i) = − (θt,i θ,t) (θt+1,i θ,i) − 2αt(1 β1,t) − − − − 1 v 4 t p β1,t bt−1,i ∗ mt−1,i αt(1 β1) vbt,i mb t,i 2 + (θ,i θt,i)√αt−1 1 + − (p ) (1 β1,t) √αt−1 − 4 2(1 β1,t) vt,i − vbt−1,i − b   1 ∗ 2 ∗ 2 p β1,t ∗ 2p (θt,i θ,t) (θt+1,i θ,i) vbt,i + (θ,i θt,i) vbt−1,i ≤2α (1 β ) − − − 2α − (1 β ) − t − 1 t 1 − 1,t 2 2 β1αt−1 mt−1,i αt mb t,i + p + p 2(1 β1) v 2(1 β1) v − bt−1,i − bt,i We apply Lemma 6.4.6 to the above inequality and derive the regret bound by summing across all the dimensions for i 1, ..., d in the upper bound of f (θ ) f (θ∗) and the sequence of convex functions for ∈ t t − t t 1, ..., T : ∈ d d T p p X 1 ∗ 2p X X 1 ∗ 2 vbt,i vbt−1,i R(T ) (θ1,i θ,i) vb1,i + (θt,i θ,i) ( ) ≤ 2α (1 β ) − 2(1 β ) − α − α − i=1 1 − 1 i=1 t=2 − 1 t t 1 d d β1αG∞ X αG∞ X + g1:T,i 2 + g1:T,i 2 (1 β )√1 β (1 γ)2 k k (1 β )√1 β (1 γ)2 k k − 1 − 2 − i=1 − 1 − 2 − i=1 d T X X β1,t ∗ p + (θ θ )2 v 2α (1 β ) ,i − t,i bt,i i=1 t=1 t − 1,t Chapter 6. Accelerating learning using Adaptive Moment methods 75

∗ From the assumption, θ θ D, θ θ ∞ D∞, we have: k t − k2 ≤ k m − nk ≤

2 d d 2 d t D X p α(1 + β1)G∞ X D∞ X X β1,t p R(T ) T v + g + tv bT,i 2 1:T,i 2 bt,i ≤2α(1 β1) (1 β )√1 β (1 γ) k k 2α (1 β1,t) − i=1 − 1 − 2 − i=1 i=1 t=1 − 2 d d D X p α(1 + β1)G∞ X T v + g bT,i 2 1:T,i 2 ≤2α(1 β1) (1 β )√1 β (1 γ) k k − i=1 − 1 − 2 − i=1 2 d t D G∞√1 β2 X X β1,t + ∞ − √t 2α (1 β ) i=1 t=1 − 1,t We can use arithmetic geometric series upper bound for the last term:

t t X β1,t X 1 √t λt−1√t (1 β ) ≤ (1 β ) t=1 − 1,t t=1 − 1 t X 1 λt−1t ≤ (1 β ) t=1 − 1 1 ≤ (1 β )(1 λ)2 − 1 − Therefore, we have the following regret bound:

2 d d d 2 D X p α(1 + β1)G∞ X X D∞G∞√1 β2 R(T ) T v + g + bT,i 2 1:T,i 2 − 2 ≤2α(1 β1) (1 β )√1 β (1 γ) k k 2αβ1(1 λ) − i=1 − 1 − 2 − i=1 i=1 −

6.5 Related work

Optimization methods bearing a direct relation to Adam are RMSProp [Tieleman and Hinton, 2012a, Graves, 2013b] and AdaGrad [Duchi et al., 2011]; these relationships are discussed below. Other stochas- tic optimization methods include vSGD [Schaul et al., 2012], AdaDelta [Zeiler, 2012] and the natural Newton method from Roux and Fitzgibbon [2010], all setting stepsizes by estimating curvature from first-order information. The Sum-of-Functions Optimizer (SFO) [Sohl-Dickstein et al., 2014] is a quasi- Newton method based on minibatches, but (unlike Adam) has memory requirements linear in the number of minibatch partitions of a dataset, which is often infeasible on memory-constrained systems such as a GPU. Like natural gradient descent (NGD) [Amari, 1998], Adam employs a preconditioner that adapts to the geometry of the data, since vbt is an approximation to the diagonal of the Fisher information ma- trix [Pascanu and Bengio, 2013]; however, Adam’s preconditioner (like AdaGrad’s) is more conservative in its adaption than vanilla NGD by preconditioning with the square root of the inverse of the diagonal Fisher information matrix approximation.

RMSProp: An optimization method closely related to Adam is RMSProp [Tieleman and Hinton, 2012a]. A version with momentum has sometimes been used [Graves, 2013b]. There are a few important differences between RMSProp with momentum and Adam: RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are directly Chapter 6. Accelerating learning using Adaptive Moment methods 76

MNIST Logistic Regression IMDB BoW feature Logistic Regression 0.7 0.50 AdaGrad Adagrad+dropout SGDNesterov RMSProp+dropout Adam 0.45 0.6 SGDNesterov+dropout Adam+dropout

0.40

0.5

0.35 training cost

0.4 training cost 0.30

0.3 0.25

0.2 0.20 0 5 10 15 20 25 30 35 40 45 0 20 40 60 80 100 120 140 160 iterations over entire dataset iterations over entire dataset

Figure 6.1: Logistic regression training negative log likelihood on MNIST images and IMDB movie reviews with 10,000 bag-of-words (BoW) feature vectors. estimated using a running average of first and second moment of the gradient. RMSProp also lacks a bias-correction term; this matters most in case of a small value β2 (required in case of sparse gradients), since in that case not correcting the bias leads to very large stepsizes and often divergence, as we also empirically demonstrate in section 6.6.4.

AdaGrad: An algorithm that works well for sparse gradients is AdaGrad [Duchi et al., 2011]. Its q basic version updates parameters as θ = θ α g / Pt g2. Note that if we choose β to be t+1 t − · t i=1 t 2 −1 Pt 2 infinitesimally close to 1 from below, then lim → v = t g . AdaGrad corresponds to a version β2 1 bt · i=1 t −1/2 of Adam with β1 = 0, infinitesimal (1 β2) and a replacement of α by an annealed version αt = α t , − q q · −1/2 p −1/2 −1 Pt 2 Pt 2 namely θ α t m / lim → v = θ α t g / t g = θ α g / g . t − · · b t β2 1 bt t − · · t · i=1 t t − · t i=1 t Note that this direct correspondence between Adam and Adagrad does not hold when removing the bias-correction terms; without bias correction, like in RMSProp, a β2 infinitesimally close to 1 would lead to infinitely large bias, and infinitely large parameter updates.

6.6 Experiments

To empirically evaluate the proposed method, we investigated different popular machine learning models, including logistic regression, multilayer fully connected neural networks and deep convolutional neural networks. Using large models and datasets, we demonstrate Adam can efficiently solve practical deep learning problems. We use the same parameter initialization when comparing different optimization algorithms. The hyper-parameters, such as learning rate and momentum, are searched over a dense grid and the results are reported using the best hyper-parameter setting.

6.6.1 Logistic regression

We evaluate our proposed method on L2-regularized multi-class logistic regression using the MNIST dataset. Logistic regression has a well-studied convex objective, making it suitable for comparison of different optimizers without worrying about local minimum issues. The stepsize α in our logistic Chapter 6. Accelerating learning using Adaptive Moment methods 77 regression experiments is adjusted by 1/√t decay, namely α = √α that matches with our theoratical t t prediction from section 6.4. The logistic regression classifies the class label directly on the 784 dimension image vectors. We compare Adam to accelerated SGD with Nesterov momentum and Adagrad using minibatch size of 128. According to Figure 6.1, we found that the Adam yields similar convergence as SGD with momentum and both converge faster than Adagrad. As discussed in [Duchi et al., 2011], Adagrad can efficiently deal with sparse features and gradients as one of its main theoretical results whereas SGD is slow at learning rare features. Adam with 1/√t decay on its stepsize should theoratically match the performance of Adagrad. We examine the sparse feature problem using IMDB movie review dataset from [Maas et al., 2011]. We pre-process the IMDB movie reviews into bag-of-words (BoW) feature vectors including the first 10,000 most frequent words. The 10,000 dimension BoW feature vector for each review is highly sparse. As suggested in [Wang and Manning, 2013], 50% dropout noise can be applied to the BoW features during training to prevent over-fitting. In figure 6.1, Adagrad outperforms SGD with Nesterov momentum by a large margin both with and without dropout noise. Adam converges as fast as Adagrad. The empirical performance of Adam is consistent with our theoretical findings in sections 6.2 and 6.4. Similar to Adagrad, Adam can take advantage of sparse features and obtain faster convergence rate than normal SGD with momentum.

6.6.2 Multi-layer neural networks

Multi-layer neural network are powerful models with non-convex objective functions. Although our convergence analysis does not apply to non-convex problems, we empirically found that Adam often outperforms other methods in such cases. In our experiments, we made model choices that are consistent with previous publications in the area; a neural network model with two fully connected hidden layers with 1000 hidden units each and ReLU activation are used for this experiment with minibatch size of 128. First, we study different optimizers using the standard deterministic cross-entropy objective func- tion with L2 weight decay on the parameters to prevent over-fitting. The sum-of-functions (SFO) method [Sohl-Dickstein et al., 2014] is a recently proposed quasi-Newton method that works with mini- batches of data and has shown good performance on optimization of multi-layer neural networks. We used their implementation and compared with Adam to train such models. Figure 6.2 shows that Adam makes faster progress in terms of both the number of iterations and wall-clock time. Due to the cost of updating curvature information, SFO is 5-10x slower per iteration compared to Adam, and has a memory requirement that is linear in the number minibatches. Stochastic regularization methods, such as dropout, are an effective way to prevent over-fitting and often used in practice due to their simplicity. SFO assumes deterministic subfunctions, and indeed failed to converge on cost functions with stochastic regularization. We compare the effectiveness of Adam to other stochastic first order methods on multi-layer neural networks trained with dropout noise. Figure 6.2 shows our results; Adam shows better convergence than other methods.

6.6.3 Convolutional neural networks

Convolutional neural networks (CNNs) with several layers of convolution, pooling and non-linear units have shown considerable success in computer vision tasks. Unlike most fully connected neural nets, weight sharing in CNNs results in vastly different gradients in different layers. A smaller learning rate Chapter 6. Accelerating learning using Adaptive Moment methods 78

MNIST Multilayer Neural Network + dropout 10-1 AdaGrad RMSProp SGDNesterov AdaDelta Adam training cost

10-2

0 50 100 150 200 iterations over entire dataset

(a) (b)

Figure 6.2: Training of multilayer neural networks on MNIST images. (a) Neural networks using dropout stochastic regularization. (b) Neural networks with deterministic cost function. We compare with the sum-of-functions (SFO) optimizer [Sohl-Dickstein et al., 2014] for the convolution layers is often used in practice when applying SGD. We show the effectiveness of Adam in deep CNNs. Our CNN architecture has three alternating stages of 5x5 convolution filters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer of 1000 rectified linear hidden units (ReLU’s). The input images are pre-processed by whitening, and dropout noise is applied to the input layer and fully connected layer. The minibatch size is also set to 128 similar to previous experiments. Interestingly, although both Adam and Adagrad make rapid progress lowering the cost in the initial stage of the training, shown in Figure 6.3 (left), Adam and SGD eventually converge considerably faster than Adagrad for CNNs shown in Figure 6.3 (right). We notice the second moment estimate vbt vanishes to zeros after a few epochs and is dominated by the  in algorithm 1. The second moment estimate is therefore a poor approximation to the geometry of the cost function in CNNs comparing to fully connected network from Section 6.6.2. Whereas, reducing the minibatch variance through the first moment is more important in CNNs and contributes to the speed-up. As a result, Adagrad converges much slower than others in this particular experiment. Though Adam shows marginal improvement over SGD with momentum, it adapts learning rate scale for different layers instead of hand picking manually as in SGD.

6.6.4 Bias-correction term

We also empirically evaluate the effect of the bias correction terms explained in sections 6.2 and 6.3. Discussed in section 6.5, removal of the bias correction terms results in a version of RMSProp [Tieleman and Hinton, 2012a] with momentum. We vary the β1 and β2 when training a variational auto-encoder (VAE) with the same architecture as in [Kingma and Welling, 2013] with a single hidden layer with 500 hidden units with softplus nonlinearities and a 50-dimensional spherical Gaussian latent variable. We iterated over a broad range of hyper-parameter choices, i.e. β [0, 0.9] and β [0.99, 0.999, 0.9999], 1 ∈ 2 ∈ and log (α) [ 5, ..., 1]. Values of β close to 1, required for robustness to sparse gradients, results 10 ∈ − − 2 in larger initialization bias; therefore we expect the bias correction term is important in such cases of slow decay, preventing an adverse effect on optimization. Chapter 6. Accelerating learning using Adaptive Moment methods 79

CIFAR10 ConvNet First 3 Epoches CIFAR10 ConvNet 3.0 AdaGrad AdaGrad 102 AdaGrad+dropout AdaGrad+dropout SGDNesterov SGDNesterov 2.5 SGDNesterov+dropout SGDNesterov+dropout 101 Adam Adam Adam+dropout Adam+dropout

0 2.0 10

10-1

training cost 1.5 training cost

10-2

1.0 10-3

0.5 10-4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 5 10 15 20 25 30 35 40 45 iterations over entire dataset iterations over entire dataset

Figure 6.3: Convolutional neural networks training cost. (left) Training cost for the first three epochs. (right) Training cost over 45 epochs. CIFAR-10 with c64-c64-c128-1000 architecture.

In Figure 6.4, values β2 close to 1 indeed lead to instabilities in training when no bias correction term was present, especially at first few epochs of the training. The best results were achieved with small values of (1 β ) and bias correction; this was more apparent towards the end of optimization when − 2 gradients tends to become sparser as hidden units specialize to specific patterns. In summary, Adam performed equal or better than RMSProp, regardless of hyper-parameter setting.

6.7 Extensions

6.7.1 AdaMax

In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a (scaled) L2 norm of their individual current and past gradients. We can generalize the L2 norm based update rule to a Lp norm based update rule. Such variants become numerically unstable for large p. However, in the special case where we let p , a surprisingly simple and stable algorithm emerges; → ∞ see algorithm 2. We’ll now derive the algorithm. Let, in case of the Lp norm, the stepsize at time t be 1/p inversely proportional to vt , where:

p p p v = β v − + (1 β ) g (6.6) t 2 t 1 − 2 | t| t X − = (1 βp) βp(t i) g p (6.7) − 2 2 · | i| i=1

Note that the decay term is here equivalently parameterised as βp instead of β . Now let p , and 2 2 → ∞ Chapter 6. Accelerating learning using Adaptive Moment methods 80

β2=0.99 β2=0.999 β2=0.9999 β2=0.99 β2=0.999 β2=0.9999

β1=0

β1=0.9 Loss

log10(α) (a) after 10 epochs (b) after 100 epochs

Figure 6.4: Effect of bias-correction terms (red line) versus no bias correction terms (green line) after 10 epochs (left) and 100 epochs (right) on the loss (y-axes) when learning a Variational Auto-Encoder (VAE) [Kingma and Welling, 2013], for different settings of stepsize α (x-axes) and hyper-parameters β1 and β2.

1/p define ut = limp→∞(vt) , then:

t !1/p 1/p p X p(t−i) p ut = lim (vt) = lim (1 β ) β gi (6.8) p→∞ p→∞ − 2 2 · | | i=1 t !1/p p 1/p X p(t−i) p = lim (1 β ) β gi (6.9) p→∞ − 2 2 · | | i=1 !1/p t p X  (t−i)  = lim β gi (6.10) p→∞ 2 · | | i=1 t−1 t−2  = max β g , β g , . . . , β g − , g (6.11) 2 | 1| 2 | 2| 2| t 1| | t|

Which corresponds to the remarkably simple recursive formula:

u = max(β u − , g ) (6.12) t 2 · t 1 | t| with initial value u0 = 0. Note that, conveniently enough, we don’t need to correct for initialization bias in this case. Also note that the magnitude of parameter updates has a simpler bound with AdaMax than Adam, namely: ∆ α. | t| ≤

6.7.2 Temporal averaging

Since the last iterate is noisy due to stochastic approximation, better generalization performance is often achieved by averaging. Previously in Moulines and Bach [2011], Polyak-Ruppert averaging [Polyak and Juditsky, 1992, Ruppert, 1988] has been shown to improve the convergence of standard SGD, where ¯ 1 Pn θt = t k=1 θk. Alternatively, an exponential moving average over the parameters can be used, giving higher weight to more recent parameter values. This can be trivially implemented by adding one line to the inner loop of algorithms 1 and 2: θ¯ β θ¯ − + (1 β )θ , with θ¯ = 0. Initalization bias can t ← 2 · t 1 − 2 t 0 t again be corrected by the estimator θbt = θ¯t/(1 β ). − 2 Chapter 6. Accelerating learning using Adaptive Moment methods 81

Algorithm 2: AdaMax, a variant of Adam based on the infinity norm. See section 6.7.1 for details. Good default settings for the tested machine learning problems are α = 0.002, β1 = 0.9 t t and β2 = 0.999. With β1 we denote β1 to the power t. Here, (α/(1 β1)) is the learning rate with the bias-correction term for the first moment. All operations on vectors− are element-wise. Require: α: Stepsize Require: β1, β2 [0, 1): Exponential decay rates Require: f(θ): Stochastic∈ objective function with parameters θ Require: θ0: Initial parameter vector m 0 (Initialize 1st moment vector) 0 ← u0 0 (Initialize the exponentially weighted infinity norm) t ←0 (Initialize timestep) ← while θt not converged do t t + 1 ← g f (θ − ) (Get gradients w.r.t. stochastic objective at timestep t) t ← ∇θ t t 1 m β m − + (1 β ) g (Update biased first moment estimate) t ← 1 · t 1 − 1 · t ut max(β2 ut−1, gt ) (Update the exponentially weighted infinity norm) ← · | t| θt θt−1 (α/(1 β1)) mt/ut (Update parameters) end← while − − · return θt (Resulting parameters)

6.8 Summary

In this chapter, we introduced a simple and computationally efficient algorithm for gradient-based op- timization of stochastic objective functions. Our method is aimed towards machine learning problems with large datasets and/or high-dimensional parameter spaces. The method combines the advantages of two recently popular optimization methods: the ability of AdaGrad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives. The method is straightforward to implement and requires little memory. The experiments confirm the analysis on the rate of convergence in convex problems. Overall, we found Adam to be robust and well-suited to a wide range of non-convex optimization problems in the field machine learning. Chapter 7

Scale up learning with the distributed natural gradient methods

As more computational resources become available, machine learning researchers train ever larger neu- ral networks on millions of data points using stochastic gradient descent (SGD). Although SGD scales well in terms of both the size of dataset and the number of parameters of the model, it has rapidly diminishing returns as parallel computing resources increase. Second-order optimization methods have an affinity for well-estimated gradients and large mini-batches, and can therefore benefit much more from parallel computation in principle. Unfortunately, they often employ severe approximations to the curvature matrix in order to scale to large models with millions of parameters, limiting their effectiveness in practice versus well-tuned SGD with momentum. The recently proposed K-FAC method [Martens and Grosse, 2015] uses a stronger and more sophisticated curvature approximation, and has been shown to make much more per-iteration progress than SGD, while only introducing a modest overhead. In this chapter, we develop a version of K-FAC that distributes the computation of gradients and additional quantities required by K-FAC across multiple machines, thereby taking advantage of the method’s su- perior scaling to large mini-batches and mitigating its additional overheads. We provide a Tensorflow implementation of our approach which is easy to use and can be applied to many existing codebases without modification. Additionally, we develop several algorithmic enhancements to K-FAC which can improve its computational performance for very large models. Finally, we show that our distributed K-FAC method speeds up training of various state-of-the-art ImageNet classification models by a factor of two compared to an improved form of Batch Normalization [Ioffe and Szegedy, 2015a].

7.1 Motivation

Current state-of-the-art deep neural networks [Szegedy et al., 2014b, Krizhevsky et al., 2012e, He et al., 2015] often require days of training time with millions of training cases. The typical strategy to speed-up neural network training is to allocate more parallel resources over many machines and cluster nodes [Dean et al., 2012]. Parallel training also enables researchers to build larger models where different machines

82 Chapter 7. Scale up learning with the distributed natural gradient methods 83 compute different splits of the mini-batches. Although we have improved our distributed training setups over the years, neural networks are still trained with various simple first-order stochastic gradient descent (SGD) algorithms. Despite how well SGD scales with the size of the model and the size of the datasets, it does not scale well with the parallel computation resources. Larger mini-batches and more parallel computations exhibit diminishing returns for SGD and related algorithms. Second-order optimization methods, which use second-order information to construct updates that account for the curvature of objective function, represent a promising alternative. The canonical second- order methods work by inverting a large curvature matrix (traditionally the Hessian), but this doesn’t scale well to deep neural networks with millions of parameters. Various approximations to the curvature matrix have been proposed to help alleviate this problem, such as diagonal [LeCun et al., 1998b, Duchi et al., 2011, Kingma and Ba, 2014b], block diagonal Le Roux et al. [2008], and low-rank ones [Schraudolph et al., 2007, Bordes et al., 2009, Wang et al., 2014, Keskar and Berahas, 2015, Moritz et al., 2016, Byrd et al., 2016, Curtis, 2016, Ramamurthy and Duffy]. Another strategy is to use Krylov-subspace methods and efficient matrix-vector product algorthms to avoid the inversion problem entirely [Martens, 2010, Vinyals and Povey, 2012, Kiros, 2013, Cho et al., 2015, He et al., 2016]. The usual problem with curvature approximations, especially low-rank and diagonal ones, is that they are very crude and only model superficial aspects of the true curvature in the objective function. Krylov- subspace methods on the other hand suffer because they still rely on 1st-order methods to compute their updates. More recently, several approximations have been proposed based on statistical approximations of the Fisher information matrix [Heskes, 2000, Ollivier, 2013, Grosse and Salakhutdinov, 2015, Povey et al., 2015, Desjardins et al., 2015]. In the K-FAC approach [Martens and Grosse, 2015, Grosse and Martens, 2016], these approximations result in a block-diagonal approximation to the Fisher information matrix (with blocks corresponding to entire layers) where each block is approximated as a Kronecker product of two much smaller matrices, both of which can be estimated and inverted fairly efficiently. Because the inverse of a Kronecker product of two matrices is the Kronecker product of their inverses, this allows the entire matrix to be inverted efficiently. Martens and Grosse [2015] found that K-FAC scales very favorably to larger mini-batches compared to SGD, enjoying a nearly linear relationship between mini-batch size and per-iteration progress for medium-to-large sized mini-batches. One possible explanation for this phenomenon is that second-order methods make more rapid progress exploring the error surface and reaching a neighborhood of a local minimum where gradient noise (which is inversely proportional to mini-batch size) becomes the chief limiting factor in convergence1. This observation implies that K-FAC would benefit in particular from a highly parallel distributed implementation. In this chapter, we will discuss an asynchronous distributed version of K-FAC that can effectively exploit large amounts of parallel computing resources, and which scales to industrial-scale neural net models with hundreds of millions of parameters. Our method augments the traditional distributed synchronous SGD setup with additional computation nodes that update the approximate Fisher and compute its inverse. The proposed method achieves a comparable per-iteration runtime as a normal SGD using the same mini-batch size on a typical 4 GPU cluster. We also developed a “doubly factored”

1Mathematical evidence for this idea can be found in Martens [2014], where it is shown that (convex quadratic) objective functions decompose into noise-dependent and independent terms, and that second-order methods make much more rapid progress optimizing the noise-independent term compared to SGD, while have no effect on the noise-dependent term (which shrinks with the size of the mini-batch) Chapter 7. Scale up learning with the distributed natural gradient methods 84

Kronecker approximation for layers whose inputs are feature maps that are normally too large to han- dled by the standard Kronecker-factored approximation. Finally, we empirically demonstrate that the proposed method speeds up learning of various state-of-the-art ImageNet models by a factor of two over Batch Normalization [Ioffe and Szegedy, 2015a].

7.2 Fisher information matrix and natural gradient

7.2.1 Kronecker factored approximate Fisher

Let W be the gradient of the negative log-likelihood of a neural network w.r.t. some weight matrix D L Cout×Cin W R in a layer, where Cin, Cout are the number of input/output units of the layer. The block ∈ of the Fisher information matrix of that layer is given by:

 > F = E vec W vec W , (7.1) x,y∼P {D } {D } where P is the distribution over the input x and the network’s distribution over targets y (implied by the negative log-likelihood objective). We assume that expectations are taken with respect to P (and not the training distribution over y). K-FAC [Martens and Grosse, 2015, Grosse and Martens, 2016] uses a Kronecker-factored approxima- tion to each block which we now describe. Denote the input activation vector to the layer as a RCin , L ∈ the pre-activation inputs as z = W a and the back-propagated loss derivatives as z = ∂ RCout . Note D ∂z ∈ that the gradient of the weights is the outer product of the input activation and back-propagated deriva- tives W > = a z>. K-FAC approximates the Fisher block as a Kronecker product of the second-order D D statistics of the input and the backpropagated derivatives: h i h i F = vec W vec W > = aa> z z> aa> z z> F.ˆ (7.2) E {D } {D } E ⊗ D D ≈ E ⊗ E D D ,

This approximation can be interpreted as making the assumption that the second-order statistics of the activations and the backpropagated derivatives are uncorrelated.

7.2.2 Approximate natural gradient using K-FAC

The natural gradient [Amari, 1998] is defined as the inverse of the Fisher times the gradient. It is traditionally interpreted as the direction in parameter space that achieves the largest (instantaneous) improvement in the objective per unit of change in the output distribution of the network (as measured using the KL-divergence). Under certain conditions, which almost always hold in practice, it can also be interpreted as a second-order update computed by minimizing a local quadratic approximation of the log-likelihood objective, where the Hessian is approximated using the Fisher [Martens, 2014]. To compute the approximate natural gradient in K-FAC, one multiplies the gradient for the weights of each layer by the inverse of the corresponding approximate Fisher block Fˆ for that layer. Denote

Cin×Cout the gradient of the loss function with respect to the weights W by W R . We will assume G ∈ the use of the factorized Tikhonov damping approach described by Martens and Grosse [2015], where 1 1 ˆ 2  > 2 the addition of the damping term λI to F is approximated by adding πaλ I to E aa and πDzλ I h >i to z z , where πa and πDz are adjustment factors that are described in detail and generalized in E D D Chapter 7. Scale up learning with the distributed natural gradient methods 85

Sec. 7.4.1. (Note that one can also include the contribution to the curvature from any L2 regularization terms with λ.) By exploiting the basic identities (A B)−1 = (A−1 B−1) and (A B) vec(C) = vec(BCA>), the ⊗ ⊗ ⊗ approximate natural gradient update v can then be computed as:

−1  −1 −1     > 1   h >i 1  v = Fˆ + λI vec vec aa + πaλ 2 I z z + πDzλ 2 I , (7.3) {GW } ≈ E GW E D D which amounts to several matrix inversion of multiplication operations involving matrices roughly the same size as the weight matrix W .

7.2.3 Related works

Large distributed optimization frameworks have been previously studied for the 1st-order methods. [Dean et al., 2012] proposed a distributed system that computes gradient asynchronously over thou- sands of machines. The scalability of a distributed 1st-order method is heavily related to the network communication between the computation nodes. Seide et al. [2014] shows an impressive speed-up for the distributed SGD by reducing the network traffic overhead. Perhaps the work that is closest to us is the from Povey et al. [2014]. Unlike our approach, their method computes the natural gradients computed asynchronously over many computation nodes and the master parameter is synchronized after a time interval. distributed K-FAC computes synchronous SGD and only the 2nd-order statistics and eigendecomposition are computed asychronously.

7.3 Distributed Optimization using K-FAC

Stochastic optimization algorithms benefit from low-variance gradient estimates (as might be obtained from larger mini-batches). Prior work suggests that approximate natural gradient algorithms might benefit more than standard SGD from reducing the variance [Martens and Grosse, 2015, Grosse and Martens, 2016]. One way to efficiently obtain low-variance gradient estimates is to parallelize the gradient computation across many machines in a distributed system (thus allowing large mini-batches to be processed efficiently). Because the gradient computation in K-FAC is identical to that of SGD, we parallelize the gradient computation using the standard synchronous SGD model. However, K-FAC also introduces other forms of overhead not found in SGD — in particular, estima- tion of second-order statistics and computation of inverses or eigenvalues of the Kronecker factors. In this section, we describe how these additional computations can be performed asynchronously. While this asynchronous computation introduces an additional source of error into the algorithm, we find that it does not significantly affect the per-iteration progress in practice. All in all, the per-iteration wall clock time of our distributed K-FAC implementation is only 5-10% higher compared to synchronous SGD with the same mini-batch size.

7.3.1 Asynchronous Fisher block inversion

Computing the parameter updates as per Eq.7.3 requires the estimated gradients to be multiplied by the inverse of the smaller Kronecker factors. This requires periodically computing (typically) either inverses or eigendecompositions of each of these factors. While these factors typically have sizes only in the Chapter 7. Scale up learning with the distributed natural gradient methods 86

parameters parameter compute server inverses

gradient gradient gradient stats worker worker worker worker ...

Figure 7.1: The diagram illustrates the distributed computation of K-FAC. Gradient workers (blue) compute the gradient w.r.t. the loss function. Stats workers (grey) compute the sampled second-order statistics. Additional workers (red) compute inverse Fisher blocks. The parameter server (orange) uses gradients and their inverse Fisher blocks to compute parameter updates. hundreds or low thousands, very deep networks may have hundreds of such matrices (2 or more for each layer). Furthermore, matrix inversion and eigendecomposition see little benefit from GPU computation, so they can be more expensive than standard neural network operations. For these reasons, inverting the approximate Fisher blocks represents a significant computational cost. It has been observed that refreshing the inverse of the Fisher blocks only occasionally and using stale values otherwise has only a small detrimental effect on average per-iteration progress, perhaps because the curvature changes relatively slowly [Martens and Grosse, 2015]. We push this a step further by computing the inverses asynchronously while the network is still training. Because the required linear algebra operations are CPU-bound while the rest of our computations are GPU-bound, we perform them on the CPU with little effective overhead. Our curvature statistics are somewhat more stale as a result, but this does not appear to significantly affect per-iteration optimization performance. In our experiments, we found that computing the inverses asynchronously usually offered a 40-50% speed-up to the overall wall-clock time of the K-FAC algorithm.

7.3.2 Asynchronous statistics computation

The other major source of computational overhead in K-FAC is the estimation of the second-order statistics of the activations and derivatives, which are needed for the Kronecker factors. In the standard K-FAC algorithm, these statistics are computed on the same mini-batches as the gradients, allowing the forward pass computations to be shared between the gradient and statistics computations. By computing the gradients and statistics on separate mini-batches, we can enable a higher degree of parallelism, at the expense of slightly more total computational operations. Under this scheme, the statistics estimation is independent of the gradient computation, so it can be done on one or more separate worker nodes with their own independent data shards. These worker nodes receive parameters from the parameter server (just as in synchronous SGD) and communicate statistics back to the parameter server. In our experiments, we assigned at most one worker to computing statistics. Chapter 7. Scale up learning with the distributed natural gradient methods 87

7.4 Doubly-factored Kronecker approximation for large convo- lution layers

Computing the standard Kronecker factored Fisher approximation for a given layer involves operations on matrices whose dimension is the number of input units or output units. The cost of these operations is reasonable for most fully-connected networks because the number of units in each layer rarely exceeds a couple thousand. Large convolutional neural networks, however, often include a fully-connected layer that “pools” over a large feature map before the final softmax classification. For instance, the output of the last pooling layer of AlexNet is of size 6 6 256 = 9216, which then provides inputs to the × × subsequent fully connected layer of 4096 ReLUs. VGG models also share a similar architecture. For the standard Kronecker-factored approximation one of the factors will be a matrix of size 9216 9216, × which is too expensive to be explicitly inverted as often as is needed during training. In this section we propose a “doubly-factored” Kronecker approximation for layers whose input is a large feature map. Specifically, we approximate the second-order statistics matrix of the inputs as itself factoring as a Kronecker product. This gives an approximation which is a Kronecker product of three matrices. Using the AlexNet example, the 9216 4096 weight matrix in the first fully connected layer is × equivalent to a filterbank of 4096 filters with kernel size 6 6 on 256 input channels. Let A be a matrix × of dimension -by-C representing the input activations (for a single training case), where = K K T in T w × h is the feature map height and width, and Cin is the number of input channels. The Fisher block for such a layer can be written as:

> > > T ×C [vec W vec W ] = [vec a vec a z z ], a R in . (7.4) E {D } {D } E { } { } ⊗ D D ∈

We begin be making the following rank-1 approximation:

a Ψ>, (7.5) ≈ K where RT ,Ψ RCin are the factors along the spatial location dimension and the input channel K ∈ ∈ dimension. The optimal solution of a low-rank approximation under the Frobenius norm is given by the singular value decomposition. The activation matrix a is small enough that its SVD can be computed efficiently. Let σ1, u1, v1 be the first singular value and its left and right singular vectors of the activation matrix a, respectively. The factors of the rank-1 approximation are then chosen to be = √σ u and K 1 1 Ψ = √σ v . captures the activation patterns across spatial locations in a feature map and Ψ captures 1 1 K the pattern across the filter responses. Under the rank-1 approximation of a we have:

[vec a vec a > z z>] [vec Ψ> vec Ψ> > z z>] (7.6) E { } { } ⊗ D D ≈ E {K } {K } ⊗ D D = [ > ΨΨ> z z>]. (7.7) E KK ⊗ ⊗ D D

We further assume the second order statistics are three-way independent between the loss derivatives z, the activations along the input channels Ψ, and the activations along spatial locations : D K

[vec W vec W >] [ >] [ΨΨ>] [ z z>]. (7.8) E {D } {D } ≈ E KK ⊗ E ⊗ E D D Chapter 7. Scale up learning with the distributed natural gradient methods 88

The final approximated Fisher block is a Kronecker product of three small matrices. And note that although we assumed the feature map activations have low-rank structure, the resulting approximated Fisher is not low-rank. The approximate natural gradient for this layer can then be computed by multiplying the inverses of each of the smaller matrices against the respective dimensions of the gradient tensor. We define a

d1×d2×d3 dj dk×di function i : R R that constructs a matrix from a 3D tensor by “reshaping” it so R → that the desired target dimension i 1, 2, 3 maps to columns, while the remaining dimensions (j and ∈ { } T ×Cin×Cout k) are “folded together” and map to the rows. Given the gradient of the weights, W R G ∈ we can compute the matrix-vector product with the inverse double-factored Kronecker approximated Fisher block as:   −1 [ z z>]−1 −1 [ΨΨ>]−1 ( −1( [ >]−1 ( ))) . (7.9) R3 E D D R3 R2 E R2 R1 E KK R1 GW which is a nested application of the reshape function ( ) at each of the dimension of the gradient tensor. R · The doubly factored Kronecker approximation provides a computationally feasible alternative to the standard Kronecker-factored approximation for layers that have a number of parameters in the order of hundreds of millions. For example, inverting it for the first fully connected layer of AlexNet takes about 15 seconds on an 8 core Intel Xeon CPU, and such time is amortized in our asynchronous algorithm. Unfortunately, the homogeneous coordinate formulation is no longer applicable under this new ap- proximation. Instead, we lump the bias parameters together and associate a full Fisher block with them, which can be explicitly computed and inverted since the number of bias parameters per layer is small.

7.4.1 Factored Tikhonov damping for the double-factored Kronecker ap- proximation

In second-order optimization methods, “damping” performs the crucial task of correcting for the inaccu- racies of the local quadratic approximation of the objective that is (perhaps implicitly) optimized when computing the update [Martens and Sutskever, 2012, Martens, 2014, e.g.]. In the well-known Tikhonov damping/regularization approach, one adds a multiple of the identity λI to the Fisher before inverting it (as one also does for L2-regularization / weight-decay), which roughly corresponds to imposing a spherical trust-region on the update. The inverse of a Kronecker product can be computed efficiently as the Kronecker product of the inverse of its factors. Adding a multiple of the identity complicates this computation (although it can still be performed tractably using eigendecompositions). The “factored Tikhonov damping” technique proposed in [Martens and Grosse, 2015] is appealing because it preserves the Kronecker structure of the factorization and thus the inverse can still be computed by inverting each of the smaller matrices (and avoiding the more expensive eigendecomposition operation). And in our experiments with large ImageNet models, we also observe the factored damping seems to perform better in practice. In this sub- section we derive a generalized version of factored Tikhonov damping for the double-factored Kronecker approximation. Suppose we wish to add λI to our approximate Fisher block A B C. In the factored Tikhonov 1 1 1 ⊗ ⊗ scheme this is approximated by adding πaλ 3 I, πbλ 3 I, and πcλ 3 I to A, B and C respectively, for non- negative scalars πa, πb and πc satisfying πaπbπc = 1. The error associated with this approximation Chapter 7. Scale up learning with the distributed natural gradient methods 89 is:

1 1 1 (A + π λ 3 I) (B + π λ 3 I) (C + π λ 3 I) (A B C + λI) (7.10) a ⊗ b ⊗ c − ⊗ ⊗ 1 1 1 =π λ 3 I A B + π λ 3 I A C + π λ 3 I B C c ⊗ ⊗ b ⊗ ⊗ a ⊗ ⊗ i 1 1 1 1 1 + π λ 3 I π λ 3 I A + π λ 3 I π λ 3 I B + π λ 3 I π λ 3 I C (7.11) c ⊗ b ⊗ c ⊗ a ⊗ a ⊗ b ⊗

Following Martens and Grosse [2015], we choose πa, πb and πc by taking the nuclear norm in Eq. 7.11 and minimizing its triangle inequality-derived upper-bound. Note that the nuclear norm of Kronecker products is the product of the nuclear norms of each individual matrices: A B ∗ = A ∗ B ∗. This k ⊗ k k k k k gives the following formula for the value of πa s  2  −1 3 A ∗ B ∗ C ∗ πa = k k k k k k . (7.12) dA dB dC where the d’s are the number of rows (equiv. columns) of the corresponding Kronecker factor matrices.

The corresponding formulae for πb and πc are analogous. Intuitively, the Eq. 7.12 rescales the contri- bution to each factor matrix according to the geometric mean of the ratio of its norm vs the norms of the other factor matrices. This results in the contribution being upscaled if the factor’s norm is larger than averaged norm, for example. Note that this formula generalizes to Kronecker products of arbitrary numbers of matrices as the geometric mean of the norm ratios.

7.5 Step size selection

Although Grosse and Martens [2016] found that Polyak averaging [Polyak and Juditsky, 1992] obviated the need for tuning learning rate schedules on some problems, we observed the choice of learning rate schedules to be an important factor in our ImageNet experiments (perhaps due to higher stochasticity in the updates). On ImageNet, it is common to use a fixed exponential decay schedule [Szegedy et al., 2014b, 2015]. As an alternative to learning rate schedules, we instead use curvature information to control the amount by which the predictive distribution is allowed to change after each update. In particular, given a parameter update vector v, the second-order Taylor approximation to the KL divergence between the predictive distributions before and after the update is given by the (squared) Fisher norm:

1 D [q p] v>F v (7.13) KL || ≈ 2

This quantity can be computed with a curvature-vector product [Schraudolph, 2002]. Observe that choosing a step size of η will produce an update with squared Fisher norm η2 v>F v. Instead of using a learning rate schedule, we choose η in each iteration such that the squared Fisher norm is at most some value c:  r c  η = min η , (7.14) max v>F v Grosse and Martens [2016] used this method to clip updates at the start of training, but we found it k useful to use it throughout training. We use an exponential decay schedule ck = c0ζ , where c0 and ζ are tunable parameters, and k is incremented periodically (every half an epoch in our ImageNet experiments). Shrinking the maximum changes in the model prediction after each update is analogous to shrinking the Chapter 7. Scale up learning with the distributed natural gradient methods 90

5.0 0.70 dist.K-FAC bz256 decayKL 4.5 dist.K-FAC bz256 decayLR 0.65

4.0

0.60 3.5

3.0 0.55 Err.

CrossEntropy 2.5 0.50

2.0

0.45 1.5

1.0 0.40 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 Updates x 1e+04 Updates x 1e+04 Figure 7.2: decayKL indicates the proposed step-size selection method and decayLR indicates standard exponential learning rate decay. trust region of the second-order optimization. In practice, computing curvature-vector products after every update introduces significant computational overhead, so we instead used the approximate Fisher Fˆ in place of F , which allows the approximate Fisher norm to be computed efficiently as v>Fˆv = v>Fˆ(Fˆ−1 ) = v> . The maximum step size η was set to a large value, and in practice this GW GW max maximum was reached only at the beginning of training, when F was small in magnitude. We found this outperformed simple exponential learning rate decay on ImageNet experiments.

7.5.1 Experimental evaluation of the step-size selection method of Section 7.5

To compare our proposed step size selection from Sec. 7.5 with the commonly-used exponential learning rate decay, we performed a simple experiment training GoogLeNet. Both the learning rate and threshold c on the square Fisher norm, is decayed by a factor of 0.96 after every 3200 iterations. The results of this experiment are plotted in Fig. 7.2, and indicate that our method outperforms the standard baseline.

7.6 Automatic construction of the K-FAC computation graph

In recent years, deep learning libraries have moved towards the computational graph abstraction [Bergstra et al., 2010, Abadi et al., 2016] to represent neural network computations. In this section we give a high level description of an algorithm that scans a computational graph for parameters for which one of the various Kronecker-factored approximations can be applied, locates nodes containing the required infor- mation to compute the second-order statistics required by the approximations, and then constructs a new graph that computes the approximations and uses them to update the parameters. For the sake of discussion, we will assume the computation graph is a directed bipartite graph that has a set of operator nodes doing some computation, and some variable nodes that holds intermediate computational results. The trainable parameters are stored in the memory that is loaded or mutated through read/write operator nodes. We also assume that the trainable parameters are grouped layer-wise as a set of weights and biases. Finally, we assume the gradient computation for the trainable parameters is performed by a computation graph (which is usually is generated via automatic differentiation). Chapter 7. Scale up learning with the distributed natural gradient methods 91

In analogy to generating the gradient computation graph through automatic differentiation, given an arbitrary computation graph with a set of the trainable parameters, we would like to use the existing nodes in the given graph to automatically generate a new computation graph, a “K-FAC computation graph”, that computes the Kronecker-factored approximate Fisher blocks associated with each group of parameters (typically layers in a neural net), and then uses them to update the parameters. To compute the Fisher block for a given layer, we want to find all the nodes holding the gradients of the trainable parameters in a computation graph. One simple strategy is to traverse the computation graph from the gradient nodes to their immediate parent nodes. A set of parameters has a Kronecker-factored approximation to its Fisher block if its corresponding gradient node has a matrix product or convolution operator node as its immediate parent node. For these parameters, the Kronecker factor matrices are the second-order statistics of the inputs to the parent operator node of their gradient nodes (typically the activities a and back-propagated derivatives z). For other sets of parameters an exact Fisher block can be computed instead (assuming they have D low enough dimension). In a typical neural network, most of the parameters are concentrated in weight matrices, that are used for matrix product or convolution operations, for which one of the existing Kronecker-factored approximations applies. Homogeneous coordinates can be used if the weights and biases of the same layer are annotated in the computation graph. The rest of the parameters are often gain and bias vectors for each hidden unit, and it is feasible to compute and invert exact Fisher blocks for these. Kronecker factors can sometimes be shared by approximate Fisher blocks for two or more parameters. This is the case, for example, when a vector of units serves as inputs to two different weight-matrix multiplication operations. In such cases, the computation of the second-order statistics can be reused, which is what we do in our implementation. A neural network can be also instantiated multiple times in a computational graph (with shared parameters) to process different inputs. The gradient of the parameters shared across the instantiations are the sum of the individual gradients from each instantiation. Given such computation graph, the immediate parent operator node from the gradient is a summation whose inputs are computed by the same type of operators. Without additional knowledge about the computation graph, one approximation is to treat the individual gradient contributions in the summation as statistically independent of each other (similarly to how gradient contributions from multiple spatial locations are treated as independent in the KFC approximation [Grosse and Martens, 2016]). Under this approximation, the Kronecker factors associated with the gradient can be computed by lumping the statistics associated with each of the gradient contributions together. Our implementation of Distributed K-FAC in TensorFlow applies the above the strategy to au- tomatically generate K-FAC computation graphs without requiring the user to modify their existing model-definition code.

7.7 Experiments

We experimentally evaluated distributed K-FAC on several large convolutional neural network training tasks involving the CIFAR-10 and ImageNet classification datasets. Due to computational resource constraints, we used a single GPU server with 8 Nvidia K80 GPUs to simulate a large distributed system. The GPUs were used as gradient workers that computed the Chapter 7. Scale up learning with the distributed natural gradient methods 92 gradient over a large mini-batch, with the CPUs acting as a parameter server. The Fisher block inversions were performed on the CPUs in parallel, using as many threads as possible. The second-order statistics required for the various Fisher block approximations were computed either syncronously by the gradient workers after each gradient computation (CIFAR-10 experiments), or asynchronously using a separate dedicated “stats worker” (ImageNet experiments). Meta-parameters such as learning rates, damping parameters, and the decay-rate for the second-order statistics, were optimized carefully by hand for each method. The momentum was fixed to 0.9. Similarly to Martens and Grosse [2015], we applied an exponentially decayed Polyak averaging scheme to the sequence of output iterates produced by each method. We found this improved their convergence rate in the later stages of optimization, and reduced or eliminated the need to decay the learning rates. We chose to base our implementation of distributed K-FAC on the TensorFlow framework [Abadi et al., 2016] because it provides well-engineered and scalable primitives for distributed computation. We implement distributed K-FAC in TensorFlow by scanning the gradient-computing graph for groups of parameters whose gradient computations have particular structures. Having identified such groups we compute/approximate their Fisher blocks using a method tailored to the type of structure observed. This type of implementation can be applied to existing model-specification code without significant modification of said code. And because TensorFlow’s parallel primitives were designed with scalability in mind, it should be possible to scale our implementation to a larger distributed system with hundreds of workers.

7.7.1 CIFAR-10 classification and asynchronous Fisher block inversion

In our first experiment we evaluated the effectiveness of asynchronously computing the approximate Fisher inverses (as described in Section 7.3.1). We considered the effect that this has both on the quality of the updates, as measured by per-iteration progress on the objective, and on the average per-iteration wall-clock time. The task is to train a basic convolutional network model on the CIFAR-10 image classification dataset [Krizhevsky and Hinton, 2009]. The model has 3 convolutional layers of 32-32-64 filters, each with a receptive field size of 5x5, followed by a softmax layer that predicts 10 classes. This is a similar but not identical CIFAR-10 model that was used by Grosse and Martens [2016]. All the CIFAR-10 experiments use a mini-batch size of 512. The baseline method is a simple synchronous version of distributed K-FAC with a fixed learning rate, and up to 4 GPUs acting as gradient and stats workers, which recomputes the inverses of the approximate Fisher blocks once every 20 iterations. This baseline method behaves similarly to the implementation of K-FAC in Grosse and Martens [2016], while being potentially faster due to its greater use of parallelism. We compare this baseline to a version of distributed K-FAC where the approximate Fisher blocks are inverted asynchronously and in parallel with the rest of the optimization process. Note that under this scheme, inverses are updated about once every 16 iterations for the single GPU condition, and every 30 iterations for the four GPU condition. For networks larger than this relatively small CIFAR-10 net they may get updated (far) less often (e.g. the AlexNet experiments in Section 7.7.3). The results of this first experiment are plotted in Fig. 7.3. We found that the asynchronous version iterated about 1.5 times faster than the synchronous version, while its per-iteration progress remained comparable. The plots show that the asynchronous version is better at taking advantage of parallel computation and displayed an almost linear speed-up as the number of gradient workers increases to 4. 0.30 Chapter1.6 7. Scale up learning with the distributed natural gradient methods 93 dist.K-FAC async gpu1 0.25 1.4 dist.K-FAC async gpu4 0.20 1.2 dist.K-FAC sync gpu1 1.0 0.15 Err.

NLL dist.K-FAC sync gpu4 0.8 0.10 0.6 0.05 0.4 0.00 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 Updates Updates

0.30 1.6 0.25 1.4 1.2 0.20 1.0 0.15 Err. NLL 0.8 0.10 0.6 0.05 0.4 0.00 0 100 200 300 400 500 600 0 100 200 300 400 500 600 sec. sec. Figure 7.3: The results from our CIFAR-10 experiment looking at the effectiveness of asynchronously computing the approximate Fisher inverses. gpu indicates the number of gradient workers. Dashed lines denote training curves and solid lines denote test curves. Top row: cross entropy loss and classification error vs the number of updates. Bottom row: cross entropy loss and classification error vs wallclock time.

In terms of the wall-clock time, using only 4 GPUs the asynchronous version of distributed K-FAC is able to complete 700 iterations in under a minute, where it achieves the minimum test error (19%).

7.7.2 ImageNet classification

In our second set of experiments we benchmarked distributed K-FAC against several other popular approaches, and considered the effect of mini-batch size on per-iteration progress. To do this we trained various off-the-shelf convnet architectures for image classification on the ImageNet dataset [Russakovsky et al., 2015]: AlexNet [Krizhevsky et al., 2012e], GoogLeNet InceptionV1 [Szegedy et al., 2014b] and the 50-layer Residual network [He et al., 2015]. Despite having 1.2 million images in the ImageNet training set, a data pre-processing pipeline is almost always used for training ImageNet that includes image jittering and aspect distortion. We used a less extensive dataset augmentation/pre-processing pipeline than is typically used for ImageNet, as the purpose of this study is not to achieve state-of-the-art ImageNet results, but rather to evaluate the optimization performance of distributed K-FAC. In particular, the dataset consists of 224x224 images and during training the original images are first resized to 256x256 and then randomly cropped back down to 224x224 before being fed to the network. Note that while it is typically the case that validation error is higher than training error, this data pre-processing pipeline for ImageNet creates an augmented training set that is more difficult than the undistorted validation set and therefore the validation error is often lower than the training error during the first 90% of training. This observation is consistent with previously published results [He et al., 2015]. In all our ImageNet experiments, we used the cheaper Kronecker factorization from Appendix 7.7.3, and the KL-based step sized selection method described in Section 7.5 with parameters c0 = 0.01 and ζ = 0.96. The SGD baselines use an exponential learning rate decay schedule with a decay rate of 0.96. Decaying is applied after each half-epoch for distributed K-FAC and SGD+Batch Normalization, and after every two epochs for plain SGD, which is consistent with the experimental setup of Ioffe and Szegedy [2015a]. 4.5 0.50 Chapter4.0 7. Scale up learningSGD+BN with bz256the rbz128 distributed natural gradient methods 94 0.45 3.5 SGD+BN bz256 rbz32 3.0 dist.K-FAC bz256 0.40

2.5 dist.K-FAC+BN bz256 Err. 0.35 2.0 CrossEntropy 0.30 1.5 1.0 0.25 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 Updates x 1e+04 Updates x 1e+04

4.5 0.50 4.0 0.45 3.5 3.0 0.40

2.5 Err. 0.35 2.0 CrossEntropy 0.30 1.5 1.0 0.25 0 13.9 27.8 41.7 55.6 69.4 83.3 97.2 0 13.9 27.8 41.7 55.6 69.4 83.3 97.2 hours hours Figure 7.4: Optimization performance of distributed K-FAC and SGD training GoogLeNet on ImageNet. Dashed lines denote training curves and solid lines denote validation curves. bz indicates the size of mini- batches. rbz indicates the size of chunks used to assemble the BN updates. Top row: cross entropy loss and classification error v.s. the number of updates. Bottom row: cross entropy loss and classification error vs wallclock time (in hours). All methods used 4 GPUs, with distributed K-FAC using the 4-th GPU as a dedicated asynchronous stats worker.

GoogleLeNet and Batch Normalization

Batch Normalization [Ioffe and Szegedy, 2015a] is a reparameterization of neural networks that can make them easier to train with first-order methods, and has been successfully applied to large ImageNet models. It can be thought of as a modification of the units of a neural network so that each one centers and normalizes its own raw input over the current mini-batch (or subset thereof), after which it applies a separate shift and scaling operation via its own local “bias” and “gain” parameters (which are optimized). These shift and scaling operations can learn to effectively undo the centering and normalization, thus preserving the class of functions that the network can compute. Batch Normalization (BN) is closely related to centering techniques [Schraudolph, 1998], and likely helps for the same reason that they do, which is that the alternative parameterization gives rise to loss surfaces with more favorable curvature properties. The main difference between BN and traditional centering is that BN makes the centering and normalization operations part of the model instead of the optimization algorithm (and thus “backprops” through them when computing the gradient), which helps stabilize the optimization. Without any changes to the algorithm, distributed K-FAC can be used to train neural networks that have BN layers. The weight-matrix gradient for such layers has the same structure as it does for standard layers, and so Fisher blocks can be approximated using the same set of techniques. The per-unit gain and bias parameters cause a minor complication, but because they are relatively few in number, one can compute an exact Fisher block for each of them. Computing updates for BN networks over large mini-batches is usually done by splitting the mini- batch into chunks of size 32, computing the gradients separately for these chunks (using only the data in the chunk to compute the mean and variance statistics), and then summing them together. Using small sample sets to compute the statistics like this introduces additional stochasticity into the BN update that acts as a regularizer, but can also hurt optimization performance. To help decouple the effect of regularization and optimization, we also compared to a BN baseline that uses larger chunks. We found using larger chunks can give a factor of 2 speed-up in optimization performance over the standard BN baseline. In our figures rbz will indicate the chunk size, which defaults 32 if left unspecified. In Fig. 7.4, we compare distributed K-FAC to SGD on GoogLeNet with and without BN. All methods used 4 GPUs, with distributed K-FAC using the 4-th GPU as a dedicated asynchronous stats worker. 0.50 2.4 Chapter 7. Scale up learningdist.K-FAC with KFC the bz512 distributed natural gradient methods 95 2.2 0.45 dist.K-FAC fast bz512 2.0 0.40 1.8 Err. 1.6 0.35

CrossEntropy 1.4 0.30 1.2 1.0 0.25 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Updates x 1e+04 0.50 Updates x 1e+04 2.4 2.2 0.45 2.0 0.40 1.8 Err. 1.6 0.35

CrossEntropy 1.4 0.30 1.2 1.0 0.25 0 13.9 27.8 41.7 55.6 69.4 83.3 97.2 0 13.9 27.8 41.7 55.6 69.4 83.3 97.2 hours hours Figure 7.5: Empirical evaluation of the proposed cheaper Kronecker approximation on GoogLeNet. bz indicates the size of the mini-batches. Dashed lines denote training curves and solid lines denote validation curves. Top row: cross entropy loss and classification error vs the number of updates. Bottom row: cross entropy loss and classification error vs wallclock time.

We observe that the per-iteration progress made by distributed K-FAC on the training objective is not significantly affected by the use of BN. Moreover, distributed K-FAC is 3.5 times faster than SGD with standard BN baseline (orange line) and 1.5-2 times faster than the enhanced BN baseline (blue line). BN, however, does help distributed K-FAC generalize better, likely due to its aforementioned regularizing effect. For the simplicity of our discussion, distributed K-FAC is not combined with BN in the the rest of the experiments, as we are chiefly interested in evaluating optimization performance, not regularization, and BN doesn’t seem to provide any additional benefit to distributed K-FAC in regards to the former. Note that this is not too surprising, given that K-FAC is provably invariant to the kind of centering and normalization transformations that BN does [Martens and Grosse, 2015].

7.7.3 A cheaper Kronecker factor approximation for convolution layers

In a convolution layer, the gradient is the sum of the outer product between the receptive field input activation a and the back-propagated derivatives z at each spatial location t . One cannot simply t D t ∈ T apply the standard Kronecker factored approximation from Martens and Grosse [2015] to each location, sum the results, and then take the inverse, as there is no known efficient algorithm for computing the inverse of such a sum. In Grosse and Martens [2016], a Kronecker-factored approximation for convolutional layers called Kronecker Factors for Convolution (KFC) was developed. It works by introducing additional statistical assumptions about how the weight gradients are related across locations. In particular, KFC assumes spatial homogeneity, i.e. that all locations have the same statistics, and spatially uncorrelated derivatives, which (essentially) means that gradients from any two different locations are statistically independent. This yields the following approximation: h i [vec W vec W >] a a> z z> . (7.15) E {D } {D } ≈ |T | E t t ⊗ E D tD t

In this section we introduce an arguably simpler Kronecker factored approximation for convolutional layers that is cheaper to compute. In practice, it appears to be competitive with the original KFC approximation in terms of per-iteration progress on the objective, working worse in some experiments and better in others, while (often) improving wall-clock time due to its cheaper cost. Chapter 7. Scale up learning with the distributed natural gradient methods 96

5.0 0.70 4.5 SGD bz2048 0.65 4.0 SGD+BN bz2048 rbz256 0.60 3.5 dist.K-FAC bz2048 3.0 0.55 Err. 2.5 0.50

CrossEntropy 2.0 1.5 0.45 1.0 0.40 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Updates x 1e+04 Updates x 1e+04

5.0 0.70 4.5 0.65 4.0 3.5 0.60 3.0 0.55 Err. 2.5 0.50

CrossEntropy 2.0 1.5 0.45 1.0 0.40 0 5.6 11.1 16.7 22.2 27.8 33.3 0 5.6 11.1 16.7 22.2 27.8 33.3 hours hours

Figure 7.6: Optimization performance of distributed K-FAC and SGD training AlexNet on ImageNet. Dashed lines denote training curves and solid lines denote validation curves. bz indicates the size of the mini-batches. rbz indicates the size of chunks used to assemble the BN updates. Top row: cross entropy loss and validation error vs the number of updates. Bottom row: cross entropy loss and validation error vs wallclock time (in hours). All methods used 8 GPUs, with distributed K-FAC using the 8-th GPU as a dedicated asynchronous stats worker.

It works by approximating the sum of the gradients over spatial locations as the outer product of the averaged receptive field activations over locations Et[at], and the averaged back-propagated derivatives [ z ], multipled by the number of spatial locations . In other words: Et D t |T | " # X X [vec W vec W >] = vec z a> vec z a> > (7.16) E {D } {D } E { D t t } { D t t } t∈T t∈T  ! !> X X =  at zt at zt  (7.17) E ⊗ D ⊗ D t∈T t∈T "  ># E E[at] E[ zt] E[at] E[ zt] (7.18) ≈ |T | t ⊗ t D |T | t ⊗ t D

Under the approximation assumption that the second-order statistics of the average activations, Et[at], and the second-order statistics of the average derivatives, [ z ], are uncorrelated, this becomes: Et D t     2 > > E E[at] E[at] E E[ zt] E[ zt] (7.19) |T | t t ⊗ t D t D

This approximation is cheaper than the original KFC approximation because it is easier to compute a single outer product (after averaging over locations) than it is to compute an outer product at each location and then average. In the synchronous setting, for the large convolutional networks we experi- mented with, this trick resulted in a 20-30% decrease in overall wall clock time per iteration, with little effect on per-iteration progress.

AlexNet and the doubly-factored Kronecker approximation

To demonstrate that distributed K-FAC can efficiently optimize models with very wide layers we train AlexNet using distributed K-FAC and compare to SGD+BN. The doubly-factored Kronecker approxi- Chapter 7. Scale up learning with the distributed natural gradient methods 97

5.0 0.8 4.5 SGD+BN bz512 rbz64 0.7 4.0 dist.K-FAC bz512 3.5 0.6 3.0 0.5 Err. 2.5 0.4

CrossEntropy 2.0 1.5 0.3 1.0 0.2 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Updates x 1e+04 Updates x 1e+04

5.0 0.8 4.5 0.7 4.0 3.5 0.6 3.0 0.5 Err. 2.5 0.4

CrossEntropy 2.0 1.5 0.3 1.0 0.2 0 13.9 27.8 41.7 55.6 69.4 83.3 97.2 0 13.9 27.8 41.7 55.6 69.4 83.3 97.2 hours hours

Figure 7.7: Optimization performance of distributed K-FAC and SGD training ResNet50 on ImageNet. The dashed lines are the training curves and solid lines are the validation curves. bz indicates the size of mini-batches. rbz indicates the size of chunks used to assemble the BN updates. Top row: cross entropy loss and classification error v.s. the number of updates. Bottom row: cross entropy loss and classification error v.s. wallclock time (in hours). All methods used 8 GPUs, with distributed K-FAC using the 8-th GPU as a dedicated asynchronous stats worker. mation proposed in Section 7.4 is applied to the first fully-connected layer of AlexNet, which has 9216 input units and is thus too wide for the standard Kronecker approximation to be feasible. Note that even with this addtional approximation, computing all of the Fisher block inverses for AlexNet is very expensive, and in our experiments they only get updated once every few hundred iterations by our 16 core Xeon 2.2Ghz CPU. The results from this experiment are plotted in Fig. 7.6. They show that Distributed K-FAC still works well despite potentially extreme staleness of the Fisher block inverses, speeding up training by a factor of 1.5 over the improved SGD-BN baseline.

Very deep architectures (ResNets)

In recent years very deep convolutional architectures have been successfully applied to ImageNet clas- sification. These networks are particularly challenging to train because the usual difficulties associated with deep learning are especially severe. Fortunately second-order optimization is perhaps ideally suited to addressing these difficulties in a robust and principled way [Martens, 2010]. To investigate whether distributed K-FAC can scale to such architectures and provide useful acceler- ation, we compared it to SGD+BN using the 50 layer ResNet architecture [He et al., 2015]. The results from this experiment are plotted in Fig. 7.7. They show that distributed K-FAC provides significant speed-up during the early stages of training compared to SGD+BN.

Mini-batch size scaling properties

In our final experiment we explored how well distributed K-FAC scales as additional parallel comput- ing resources become available. To do this we trained GoogLeNet with varying mini-batch sizes of 256, 1024, 2048 , and measured per-training-case progress. Ideally, if extra gradient data is being used { } efficiently, one should expect the per-training-case progress to remain relatively constant with respect to mini-batch size. The results from this experiment are plotted in Fig. 7.8, and show that distributed Chapter 7. Scale up learning with the distributed natural gradient methods 98

0.50 2.4 SGD+BN bz1024 SGD+BN bz2048 SGD+BN bz256 2.2 0.45 dist.K-FAC bz1024 dist.K-FAC bz2048 2.0 dist.K-FAC bz256 0.40 1.8 Training Err.

CrossEntropy 1.6 0.35

1.4 0.30 1.2

1.0 0.25 0 10 20 30 40 50 0 10 20 30 40 50 #example consumed x 1e+06 #example consumed x 1e+06

Figure 7.8: The comparison of distributed K-FAC and SGD on per training case progress on training loss and errors. The experiments were conducted using GoogLeNet with various mini-batch sizes.

K-FAC exhibits something close to this ideal behavior, while SGD+BN rapidly loses data efficiency when moving beyond a mini-batch size of 256. These results suggest that distributed K-FAC, more so than the SGD+BN baseline, is capable of speeding up training in proportion to the amount of parallel computational resources used.

7.8 Summary

We have introduced distributed K-FAC, an asynchronous distributed second-order optimization algo- rithm which computes Kronecker-factored Fisher approximations and stochastic gradients over larger mini-batches asynchronously and in parallel. Our experiments show that the extra overhead introduced by distributed K-FAC is mostly mitigated by the use of parallel asynchronous computation, resulting in updates that can be computed in a similar amount of time to those of distributed SGD, while making much more progress on the objective function per iteration. We showed that in practice this can lead to speedups of roughly 3.5x compared to standard SGD + Batch Normalization (BN), and 2x compared to SGD + an improved version of BN on large-scale convolutional network training tasks. We also proposed a doubly-factored Kronecker approximation that allows distributed K-FAC to scale up to large models with hundreds of millions of parameters, and demonstrated the effectiveness of this approach in experiments. Finally, we showed that distributed K-FAC enjoys a favorable scaling property with mini-batch size that is seemingly not shared by SGD+BN. In particular, we showed that per-iteration progress tends to be proportional to the mini-batch size up to a much larger threshold than for SGD+BN. This suggests that it will yield even further reductions in total wall-clock training time when implemented in a larger distributed system than the one we considered. Chapter 8

Conclusion

The focus of this thesis is to develop attention-based deep learning models, in the examples of computer vision tasks. Using recurrent attention models in Chapter 3 as our starting point, we introduce several computation models of attention in the context of neural networks applying to a variety of sequence learning problems. The overall development comes in two-fold: first, we design the attention-based neural networks from a probabilistic inference perspective; we then derive improved learning algorithms for these models. By study these attention-based neural networks, we have gained key insights into which information the model is using to make its decisions and, at the same time, can substantially improve the computational efficiency of neural networks.

8.1 Attention-based neural networks

Attention-based neural networks are a new class of supervised learning models. Unlike traditional CNNs, the visual attention-based models can reduce the number of parameters and computational operations by selecting informative regions within an image to focus on. In Chapter 3, We formally establish the connection between variational inference and learning stochastic recurrent attention-based neural networks. The DRAM model obtains high performance on the Street View House Numbers (SVHN) recognition task comparable to the previous state-of-the-art CNN approaches. This was the first time a very different, sequential computational model outperformed deep CNNs. Although DRAM models have shown promise in scalability, they remain difficult to train because of intractable posterior inference over the sequences of glimpse locations and high variance in the stochastic gradient estimates. Borrowing techniques from the literature on training deep generative models and probabilistic inference, we developed a new learning algorithm using weighted importance sampling for training stochastic attention networks. This approach optimizes an improved lower bound on the log- likelihood and obtains better approximate inference compared to the previous variational approximation. This also reduces the variability in the stochastic gradients. Empirically, the improvements speed up the training time for stochastic attention networks. In Chapter 4, we apply the attention-based models to automatically generate image captions and highlight which image region was relevant for each word in the caption. We demonstrate the attention can also add a degree of interpretability to the current deep neural networks, as one can understand

99 Chapter 8. Conclusion 100 what signals the algorithm is using by examining where it is attending. Our method shows significant improvement over simpler, feed-forward CNNs for generating richer and more detailed captions. Instead of being used to process large input images or long input sequences, we studied an attention mechanism can also be used to enhance temporary storage in recurrent neural networks (RNNs) by selectively attending to relevant recent memories in Chapter 6. We show a form of self-attention, that acts as a fast associative memory, can be combined with a standard RNN. Fast weights, therefore, provide a “temporal cache to implement recursion thus extending the very limited memory capacity of standard RNNs by adding the high capacity temporary storage that is needed to integrate perception over many glimpses or to achieve powerful reasoning capabilities.

8.2 Stochastic optimization

Stochastic optimization is at the core of the modern deep learning system. Although we have improved our neural network models over the years, they are still trained with various simple SGD-like algorithms because, unlike most second-order methods, SGD scales well to millions of training cases and millions of model parameters. In Chapter 6, we present a simple improvement to the standard SGD, Adam, that adapts the learning rate using the second moment of the stochastic gradients. Adam has minimal computation cost overhead and only requires first-order gradient information. Our method has the advantages that its parameter updates are invariant to gradient rescaling, and it naturally performs adaptive learning rate annealing that helps stable convergence toward the end of learning. We show that the default hyper-parameter of our method including learning rate work robustly across most deep learning models and datasets. Second-order optimization methods, such as natural gradient, benefit more from well-estimated gra- dients and can therefore make better use of larger mini-batches. However, in practice, they are com- putationally intractable to handle large-scale deep learning models. In Chapter 7, we investigate how a distributed asynchronous natural gradient method can alleviate the computation cost. The method is designed to combine the advantage of lower-overhead distributed SGD with the superior scaling to large mini-batches of Kronecker-factored natural gradient. The second-order information is incorporated asynchronously thus mitigating the additional overheads. In the experiments, Distributed K-Fac show a 3.5x speedup compared to SGD with batch normalization in training the state-of-the-art deep neural networks.

8.3 Future directions

Despite the new progress in deep learning discussed in this thesis, ever-growing dataset sizes with in- creasing complexity will pose significant challenges to the current machine learning methods in the next decade. Here, we propose several future directions to improve the learning algorithms.

8.3.1 Variance reduction and learning

In a learning problem, achieving a certain loss under a fixed computation budget depends on various factors, e.g. dataset size, input dimensionality, model complexity and the number of training samples to estimate the gradient at every step [Bottou, 2010]. One of the most important factors affecting the Chapter 8. Conclusion 101 performance of the SGD-like optimization algorithm is the variance of the gradient estimated from the mini-batches. Various works have studied the trade-offs for a range of mini-batch sizes [Goyal et al., 2017, Masters and Luschi, 2018]. However, such trade-offs are not yet clear for learning the large-scale neural networks used in practice. Another important factor affecting the variance trade-offs is the variability in training examples. Consider two illustrative examples. Suppose we have an objective function that could have zero gradients at some inputs, e.g. hinge loss for binary classification. For inputs at which the gradient of the objective is zero, there is no contribution made to the gradient estimator. In this case, using large mini-batches is inefficient because it spends the same amount of time on computing the gradient of any data point. The second example is when we have exact duplicate training data. We can save computation time by removing all duplicates except one. Although not necessarily desirable, we can keep the gradient estimates unbiased by always multiplying the gradient of the remaining example proportional to the number of duplicates. In both cases, a smart learning algorithm would identify the inappropriate training examples from the training set, excluding them from sampling. Such an algorithm would achieve a lower variance in the gradient estimates under the fixed computation cost and hence faster convergence.

8.3.2 Beyond maximum likelihood learning

Maximum likelihood estimation is the preferred learning framework in most of the scenarios. While it has many desired properties, maximum likelihood learning of a mapping between two domains requires paired information or alignment between the input vectors and the label data. This is undesirables as most of the training data are unaligned in the wild. Another successful framework for learning from the data distribution is Generative Adversarial Networks (GANs)[Goodfellow et al., 2014]. In the absence of alignment, GANs aims at learning a deterministic neural network mapping from an input distribution, e.g. Gaussian, to an output distribution, which is only represented by their empirical samples. In order to measure the mapping quality, various metrics between the probability distributions have been proposed. While the MLE learning can be viewed as minimizing the KL divergence between the empirical training data and model distribution, the Jensen-Shannon distance in the original work on GANs [Goodfellow et al., 2014] can be viewed as minimizing the KL divergence in the reverse order, that is between the model distribution and the empirical training data. Later works have proposed other metrics such as f-divergences [Nowozin et al., 2016]. Recently, the introduction of Wasserstein distance [Arjovsky et al., 2017] as a metric for GANs re-surged the interest in the field of optimal transport [Villani, 2008]. Arjovsky et al. [2017] provides a game-representation for their proposed Wasserstein GAN formulation based on the dual form of the resulting optimal transport problem. The optimal transport framework is an elegant solution addressing the distribution learning problem from samples without alignment. From an optimization perspective, minimizing the optimal transport distance remains challenging. Developing efficient neural network learning algorithms for large-scale optimal transport problems remains an important future research direction for unsupervised learning, self-supervised learning and representation learning. Bibliography

Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

LF Abbott and Wade G Regehr. Synaptic computation. Nature, 431(7010):796–803, 2004.

Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.

James A Anderson and Geoffrey E Hinton. Models of information processing in the brain. Parallel models of associative memory, pages 9–48, 1981.

Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pages 1120–1128, 2016.

Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.

J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. In International Conference on Learning Representations, 2015.

J. Ba, R. Kiros, and G. Hinton. Layer normalization. arXiv:1607.06450, 2016.

Jimmy Lei Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. arXiv:1412.7755 [cs.LG], December 2014.

Philip Bachman and Doina Precup. Data generation as sequential decision making. In NIPS, 2015.

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015a.

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015b.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 [cs.CL], September 2014.

102 BIBLIOGRAPHY 103

Pierre Baldi and Peter Sadowski. The dropout learning algorithm. Artificial intelligence, 210:78–122, 2014.

Omri Barak and Misha Tsodyks. Persistent activity in neural networks with dynamic synapses. PLoS Comput Biol, 3(2):e35, 2007.

James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Des- jardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A cpu and gpu math compiler in python. In Proc. 9th Python in Science Conf, pages 1–7, 2010.

Guo-qiang Bi and Mu-ming Poo. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. The Journal of neuroscience, 18(24): 10464–10472, 1998.

Antoine Bordes, L´eonBottou, and Patrick Gallinari. Sgd-qn: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research, 10(Jul):1737–1754, 2009.

J. Bornschein and Y. Bengio. Reweighted wake-sleep. arXiv:1406.2751, 2014.

L´eonBottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMP- STAT’2010, pages 177–186. Springer, 2010.

Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. arXiv:1509.00519, 2015.

Richard H Byrd, SL Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.

Xinlei Chen and C Lawrence Zitnick. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.

Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014a.

Kyunghyun Cho, Bart van Merrienboer, C¸aglar G¨ul¸cehre,Dzmitry Bahdanau, Fethi Bougares, Hol- ger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014b.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine transla- tion. In EMNLP, October 2014c.

Minhyung Cho, Chandra Dhir, and Jaehyung Lee. Hessian-free optimization for learning deep multidi- mensional recurrent neural networks. In Advances in Neural Information Processing Systems, pages 883–891, 2015.

Tim Cooijmans, Nicolas Ballas, C´esar Laurent, and Aaron Courville. Recurrent batch normalization. arXiv preprint arXiv:1603.09025, 2016. BIBLIOGRAPHY 104

Maurizio Corbetta and Gordon L Shulman. Control of goal-directed and stimulus-driven attention in the brain. Nature reviews neuroscience, 3(3):201–215, 2002.

Frank Curtis. A self-correcting variable-metric algorithm for stochastic optimization. In Proceedings of The 33rd International Conference on Machine Learning, pages 632–641, 2016.

Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative long short- term memory. arXiv preprint arXiv:1602.03032, 2016.

P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The Helmholtz machine. Neural Computation, 7: 889–904, 1995.

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.

Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Michael Seltzer, Geoff Zweig, Xiaodong He, Jason Williams, et al. Recent advances in deep learning for speech research at microsoft. ICASSP 2013, 2013.

Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas. Learning where to attend with deep architectures for image tracking. Neural Computation, 2012.

Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014.

Emily L. Denton, Soumith Chintala, Arthur Szlam, and Robert Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, 2015.

Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu. Natural neural networks. In Advances in Neural Information Processing Systems, pages 2071–2079, 2015.

Jeff Donahue, Lisa Anne Hendrikcs, Segio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389v2 [cs.CV], November 2014.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.

Desmond Elliott and Frank Keller. Image description using visual dependency representations. In EMNLP, pages 1292–1302, 2013.

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Doll´ar,Jianfeng Gao, Xiaodong He, Margaret Mitchell, John Platt, et al. From captions to visual concepts and back. arXiv:1411.4952 [cs.CV], November 2014.

Elizabeth Gardner. The space of interactions in neural network models. Journal of physics A: Mathe- matical and general, 21(1):257, 1988. BIBLIOGRAPHY 105

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information pro- cessing systems, pages 2672–2680, 2014.

Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit num- ber recognition from street view imagery using deep convolutional neural networks. arXiv preprint arXiv:1312.6082, 2013.

Priya Goyal, Piotr Doll´ar,Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2014.

Alex Graves. Generating sequences with recurrent neural networks. Technical report, arXiv preprint arXiv:1308.0850, 2013a.

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013b.

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645–6649. IEEE, 2013.

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.

Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to trans- duce with unbounded memory. In Advances in Neural Information Processing Systems, pages 1819– 1827, 2015.

K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. DRAW: a recurrent neural network for image generation. arXiv:1502.04623, 2015a.

Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015b.

Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. In ICML, 2015c.

Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image and Vision Computing, 28(5):807–813, 2010.

Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers. In Proceedings of the 33rd International Conference on Machine Learning (ICML-16), 2016.

Roger Grosse and Ruslan Salakhutdinov. Scaling up natural gradient by factorizing fisher information. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. BIBLIOGRAPHY 106

Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy, and Martin Tak´aˇc.Large scale distributed hessian-free optimization for deep neural network. arXiv preprint arXiv:1606.00511, 2016.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suley- man, and Phil Blunsom. Teaching machines to read and comprehend. In NIPS, 2015.

Tom Heskes. On “natural” learning and pruning in multilayered perceptrons. Neural Computation, 12 (4):881–901, 2000.

G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.

G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012a.

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012b.

Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society, pages 177–186. Erlbaum, 1987.

Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdi- nov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012c.

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997a.

Sepp Hochreiter and J¨urgenSchmidhuber. Long short-term memory. Neural computation, 9(8):1735– 1780, 1997b.

Sepp Hochreiter and J¨urgenSchmidhuber. Long short-term memory. Neural computation, 9(8):1735– 1780, 1997c.

Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, pages 853–899, 2013.

John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.

Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pages 448–456, 2015a.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015b. BIBLIOGRAPHY 107

L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions of Pattern Analysis and Machine Intelligence, 20(11):1254–59, November 1998.

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014.

T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In Interna- tional Conference on Computer Vision, 2009.

Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709. Association for Computational Linguistics, 2013.

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. arXiv preprint arXiv:1412.2306, 2014.

Andrej Karpathy and Fei-Fei Li. Deep visual-semantic alignments for generating image descriptions. arXiv:1412.2306 [cs.CV], December 2014.

Andrej Karpathy and Fei-Fei Li. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.

Nitish Shirish Keskar and Albert S Berahas. adaqn: An adaptive quasi-newton algorithm for training rnns. arXiv preprint arXiv:1511.01169, 2015.

D. Kingma and J. L. Ba. Adam: a method for stochastic optimization. ICLR, 2014a. arXiv:1412.6980.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR 2015, 2014b.

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG], December 2014c.

Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In The 2nd International Conference on Learning Representations (ICLR), 2013.

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014a.

Durk P. Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014b.

Ryan Kiros. Training neural networks with stochastic hessian-free optimization. arXiv preprint arXiv:1301.3641, 2013.

Ryan Kiros, Ruslan Salahutdinov, and Richard Zemel. Multimodal neural language models. In Interna- tional Conference on Machine Learning, pages 595–603, 2014a.

Ryan Kiros, Ruslan Salakhutdinov, and Richard Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 [cs.LG], November 2014b.

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014c. BIBLIOGRAPHY 108

Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014d.

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In NIPS, 2015.

Teuvo Kohonen. Correlation matrix memories. Computers, IEEE Transactions on, 100(4):353–359, 1972.

A. Krizhevsky, I. Sutskever, , and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Neural Information Processing Systems, 2012a.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. , University of Toronto, 2009.

Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012b.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. ImageNet classification with deep convolutional neural networks. In NIPS. 2012c.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012d.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012e.

Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Babytalk: Understanding and generating simple image descriptions. PAMI, IEEE Transactions on, 35(12):2891–2903, 2013.

Polina Kuznetsova, Vicente Ordonez, Alexander C Berg, Tamara L Berg, and Yejin Choi. Collective generation of natural image descriptions. In Association for Computational Linguistics: Long Papers, pages 359–368. Association for Computational Linguistics, 2012.

Polina Kuznetsova, Vicente Ordonez, Tamara L Berg, and Yejin Choi. Treetalk: Composition and compression of trees for image descriptions. TACL, 2(10):351–362, 2014.

Hugo Larochelle and Geoffrey E Hinton. Learning to combine foveal glimpses with a third-order boltz- mann machine. In NIPS, pages 1243–1251, 2010a.

Hugo Larochelle and Geoffrey E. Hinton. Learning to combine foveal glimpses with a third-order boltz- mann machine. In NIPS, 2010b.

Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In AISTATS, volume 6, page 622, 2011.

C´esarLaurent, Gabriel Pereyra, Phil´emonBrakel, Ying Zhang, and Yoshua Bengio. Batch normalized recurrent neural networks. arXiv preprint arXiv:1510.01378, 2015.

Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015. BIBLIOGRAPHY 109

Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In Advances in neural information processing systems, pages 849–856, 2008.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recogni- tion. Proceedings of the IEEE, 86(11):2278–2324, 1998a.

Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages 396–404, 1990.

Yann LeCun, L´eonBottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998b.

Siming Li, Girish Kulkarni, Tamara L Berg, Alexander C Berg, and Yejin Choi. Composing simple image descriptions using web-scale n-grams. In Computational Natural Language Learning, pages 220–228. Association for Computational Linguistics, 2011.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. ECCV, 2014a.

T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar,and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014b.

Marcus Liwicki and Horst Bunke. Iam-ondb-an on-line english sentence database acquired from hand- written text on a whiteboard. In ICDAR, 2005.

Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 142–150. Association for Computational Linguistics, 2011.

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632 [cs.CV], December 2014.

Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zam- parelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. SemEval-2014, 2014.

Henry Markram, Joachim L¨ubke, Michael Frotscher, and Bert Sakmann. Regulation of synaptic efficacy by coincidence of postsynaptic aps and epsps. Science, 275(5297):213–215, 1997.

James Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 735–742, 2010.

James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2408–2417, 2015. BIBLIOGRAPHY 110

James Martens and Ilya Sutskever. Training deep and recurrent networks with Hessian-free optimization. In Neural Networks: Tricks of the Trade, pages 479–535. Springer, 2012.

Dominic Masters and Carlo Luschi. Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612, 2018.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daum´eIII. Midge: Generating image descriptions from computer vision detections. In European Chapter of the Association for Computational Linguistics, pages 747– 756. Association for Computational Linguistics, 2012.

A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, 2014.

V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In Neural Information Processing Systems, 2014a.

Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention. arXiv preprint arXiv:1406.6247, 2014b.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.

Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic L-BFGS algo- rithm. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 249–258, 2016.

Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 4, 2011.

Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2413–2421, 2015.

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016.

Yann Ollivier. Riemannian metrics for neural networks i: feedforward networks. arXiv preprint arXiv:1303.0818, 2013.

J. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, 2012. BIBLIOGRAPHY 111

Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ACL, 2004.

Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, pages 115–124, 2005.

Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.

Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neural networks. In ICLR, 2014.

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. Parallel training of dnns with natural gradient and parameter averaging. arXiv preprint arXiv:1410.7455, 2014.

Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. Parallel training of DNNs with natural gradient and parameter averaging. In International Conference on Learning Representations: Workshop track, 2015.

Vivek Ramamurthy and Nigel Duffy. L-SR1: A novel second order optimization method for deep learning.

Ronald A Rensink. The dynamic representation of scenes. Visual cognition, 7(1-3):17–42, 2000.

Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. Technical report, arXiv:1401.4082, 2014.

Nicolas L Roux and Andrew W Fitzgibbon. A fast natural newton method. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 623–630, 2010.

David E Rumelhart, Geoffrey E Hintont, and Ronald J Williams. Learning representations by back- propagating errors. Nature, 323(6088):533–536, 1986.

David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.

Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. arXiv preprint arXiv:1602.07868, 2016.

Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. arXiv preprint arXiv:1206.1106, 2012. BIBLIOGRAPHY 112

J Schmidhuber. Reducing the ratio between learning complexity and number of time varying variables in fully recurrent nets. In ICANN93, pages 460–463. Springer, 1993.

Nicol N. Schraudolph. Centering neural network gradient factors. In Genevieve B. Orr and Klaus-Robert M¨uller,editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science, pages 207–226. Springer Verlag, Berlin, 1998.

Nicol N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7), 2002.

Nicol N Schraudolph, Jin Yu, Simon G¨unter, et al. A stochastic quasi-newton method for online convex optimization. In AISTATS, volume 7, pages 436–443, 2007.

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In INTERSPEECH, pages 1058–1062, 2014.

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014a.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recog- nition. arXiv preprint arXiv:1409.1556, 2014b.

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In NIPS, pages 2951–2959, 2012.

Jasper Snoek, Kevin Swersky, Richard S Zemel, and Ryan P Adams. Input warping for bayesian opti- mization of non-stationary functions. arXiv preprint arXiv:1402.0929, 2014.

Jascha Sohl-Dickstein, Ben Poole, and Surya Ganguli. Fast large-scale optimization by unifying stochas- tic gradient and quasi-newton methods. In Proceedings of the 31st International Conference on Ma- chine Learning (ICML-14), pages 604–612, 2014.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video repre- sentations using LSTMs. In ICML, 2015.

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1139–1147, 2013.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014a.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014b.

Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112, 2014c. BIBLIOGRAPHY 113

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014a.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014b.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.

Y. Tang and R. Salakhutdinov. Learning stochastic feedforward neural networks. In Neural Information Processing Systems, 2013.

Yichuan Tang, Nitish Srivastava, and Ruslan R Salakhutdinov. Learning generative models with visual attention. In NIPS, pages 1808–1816, 2014.

The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Anger- mueller, Dzmitry Bahdanau, Nicolas Ballas, Fr´ed´ericBastien, Justin Bayer, Anatoly Belikov, et al. Theano: A python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.

T. Tieleman and G Hinton. Lecture 6.5 - RMSProp, COURSERA: Neural Networks for Machine Learn- ing. Technical report, 2012a.

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5 - RMSProp. Technical report, 2012b.

Misha Tsodyks, Klaus Pawelzik, and Henry Markram. Neural networks with dynamic synapses. Neural computation, 10(4):821–835, 1998.

Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. ICLR, 2016.

C´edricVillani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

Oriol Vinyals and Daniel Povey. Krylov subspace descent for deep learning. In AISTATS, pages 1261– 1268, 2012.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. arXiv:1411.4555 [cs.CV], November 2014a.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555, 2014b.

Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. CVPR, 2016.

Sida Wang and Christopher Manning. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 118–126, 2013.

Xiao Wang, Shiqian Ma, and Wei Liu. Stochastic quasi-newton methods for nonconvex stochastic optimization. arXiv preprint arXiv:1412.1196, 2014. BIBLIOGRAPHY 114

Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning. In Proc. UAI’2001, pages 538–545, 2001a.

Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 538–545. Morgan Kaufmann Publishers Inc., 2001b.

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.

Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 2005.

R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992a.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992b.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992c.

David J Willshaw, O Peter Buneman, and Hugh Christopher Longuet-Higgins. Non-holographic asso- ciative memory. Nature, 1969.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.

Yezhou Yang, Ching Lik Teo, Hal Daum´eIII, and Yiannis Aloimonos. Corpus-guided sentence generation of natural images. In EMNLP, pages 444–454. Association for Computational Linguistics, 2011.

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. Describing videos by exploiting temporal structure. arXiv preprint arXiv:1502.08029, April 2015.

W. Zaremba and I. Sutskever. Reinforcement learning neural Turing machines. arXiv:1505.00521, 2015.

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, September 2014.

Matthew D Zeiler. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, 2015.

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. 2003.

Robert S Zucker and Wade G Regehr. Short-term synaptic plasticity. Annual review of physiology, 64 (1):355–405, 2002.