Learning to Attend with Neural Networks by Lei (Jimmy) Ba a Thesis Submitted in Conformity with the Requirements for the Degree

Learning to Attend with Neural Networks by Lei (Jimmy) Ba A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical & Computer Engineering University of Toronto © Copyright 2020 by Lei (Jimmy) Ba Abstract Learning to Attend with Neural Networks Lei (Jimmy) Ba Doctor of Philosophy Graduate Department of Electrical & Computer Engineering University of Toronto 2020 As more computational resources become widely available, artificial intelligence and machine learning researchers design ever larger and more complicated neural networks to learn from millions of data points. Although the traditional convolutional neural networks (CNNs) can achieve superhuman accuracy in object recognition tasks, they brute-force the problem by scanning over every location in the input images with the same fidelity. This thesis introduces a new class of neural networks inspired by the human visual system. Unlike CNNs that process the entire image at once into the current hidden layer, attention allows for salient features to dynamically come to the forefront as needed. The ability to attend is especially important when there is a lot of clutter in a scene. However, learning attention-based neural networks poses some challenges to the current machine learning techniques: What information should the neural network \pay attention"? Where does the network store its sequences of \glimpses"? Can our learning algorithms do better than simply \trial-and-error"? To address these computational questions, we first describe a novel recurrent visual attention model in the context of variational inference. Because the standard REINFORCE or the trial-and-error algorithm can be slow due to its high variance gradient estimates, we show a re-weighted wake-sleep objective can improve the training performance. We also demonstrate the visual attention models outperform the previous state-of-the-art methods based on CNNs in the images and captions generation tasks. Fur- thermore, we discuss how the visual attention mechanism can improve the working memory of recurrent neural networks (RNNs) through a novel form of self-attention. The second half of the thesis focuses on gradient-based learning algorithms. We developed a new first-order optimization algorithm to overcome the slow convergence of the stochastic gradient descent algorithms in RNNs and attention-based models. In the end, we explored the benefit of applying second-order optimization methods in training neural networks. ii Acknowledgements I would like to thank my advisors: Geoffrey Hinton, Brendan Frey and Ruslan Salakhutdinov. This PhD thesis would not have been possible without the support of these amazing mentors. I am extremely grateful to Geoff for being the most caring supervisor I could ask for. His mathematical intuition, insight into neural networks and enthusiasm for highest research standards inspired many ideas in this thesis. I am fortunate enough to work with an incredible group colleagues at the University of Toronto Machine Learning group: Roger Grosse, Jamie Kiros, James Martens, Kevin Swersky, Ilya Sutskever, Kelvin Xu, Hui Yuan Xiong, Chris Maddison. Vlad Mnih and Rich Caruana provided me with a outstanding and yet fruitful research internship experience outside of academia to whom I owe a debt of gratitude. Among many who lent help along the way, I will give my dearest thanks to my parents for their unwavering support. iii Contents 1 Introduction 1 1.1 What are neural networks and why do we need attention? . .1 1.2 Overview . .2 1.3 Neural networks . .3 1.4 Convolutional neural networks . .3 1.5 Recurrent neural networks . .4 1.6 Learning . .5 1.6.1 Maximum likelihood estimation and Kullback-Leibler divergence . .6 1.6.2 Regularization . .7 1.6.3 Gradient descent . .7 2 Deep recurrent visual attention 9 2.1 Motivation . .9 2.2 Learning where and what . 10 2.3 Variational lower bound objective . 12 2.3.1 Maximize the variational lower bound . 12 2.3.2 Multi-object/Sequential classification as a visual attention task . 13 2.3.3 Comparison with CNN . 14 2.3.4 Discussion . 17 2.4 Improved learning with re-weighted wake-sleep objective . 18 2.4.1 Wake-Sleep recurrent attention model . 18 2.4.2 An improved lower bound on the log-likelihood . 19 2.4.3 Training an inference network . 21 2.4.4 Control variates . 22 2.4.5 Encouraging exploration . 23 2.4.6 Experiments . 23 2.5 Summary . 24 3 Generating image (and) captions with visual attention 26 3.1 Problem definition . 26 3.2 Related work . 26 3.3 Image Caption Generation with Attention Mechanism . 28 3.3.1 Model details . 28 3.3.2 Learning stochastic \hard" vs deterministic \soft" Attention . 30 iv 3.3.3 Experiments . 33 3.4 Generating images . 35 3.4.1 Model architecture . 36 3.4.2 Learning . 38 3.4.3 Generating images from captions . 38 3.4.4 Experiments . 39 3.5 Summary . 42 4 Stabilizing RNN training with layer normalization 43 4.1 Motivation . 43 4.2 Batch and weight normalization . 44 4.3 Layer normalization . 45 4.3.1 Layer normalized recurrent neural networks . 45 4.4 Related work . 46 4.5 Analysis . 46 4.5.1 Invariance under weights and data transformations . 46 4.5.2 Geometry of parameter space during learning . 47 4.6 Experimental results . 49 4.6.1 Order embeddings of images and language . 49 4.6.2 Teaching machines to read and comprehend . 51 4.6.3 Skip-thought vectors . 51 4.6.4 Modeling binarized MNIST using DRAW . 53 4.6.5 Handwriting sequence generation . 53 4.6.6 Permutation invariant MNIST . 54 4.6.7 Convolutional Networks . 55 4.7 Summary . 55 5 Self-attention to the recent past using fast weights 56 5.1 Motivation . 56 5.2 Evidence from physiology that temporary memory may not be stored as neural activities 57 5.3 Fast Associative Memory . 57 5.3.1 Layer normalized fast weights . 59 5.3.2 Implementing the fast weights \inner loop" in biological neural networks . 60 5.4 Experimental results . 60 5.4.1 Associative retrieval . 60 5.4.2 Integrating glimpses in visual attention models . 61 5.4.3 Facial expression recognition . 64 5.4.4 Agents with memory . 65 5.5 Summary . 66 6 Accelerating learning using Adaptive Moment methods 67 6.1 Motivation . 67 6.2 Algorithm . 68 6.2.1 Adam's update rule . 69 v 6.3 Initialization bias correction . 69 6.4 Convergence analysis . 70 6.4.1 Convergence proof . 71 6.5 Related work . 75 6.6 Experiments . 76 6.6.1 Logistic regression . 76 6.6.2 Multi-layer neural networks . 77 6.6.3 Convolutional neural networks . 77 6.6.4 Bias-correction term . 78 6.7 Extensions . 79 6.7.1 AdaMax . 79 6.7.2 Temporal averaging . 80.

Load more