Towards Lifelong Reasoning with Sparse and Compressive Memory Systems
Total Page:16
File Type:pdf, Size:1020Kb
Towards Lifelong Reasoning with Sparse and Compressive Memory Systems Jack William Rae PhD Thesis CoMPLEX University College London Declaration I, Jack William Rae confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. 1 Abstract Humans have a remarkable ability to remember information over long time hori- zons. When reading a book, we build up a compressed representation of the past narrative, such as the characters and events that have built up the story so far. We can do this even if they are separated by thousands of words from the current text, or long stretches of time between readings. During our life, we build up and retain memories that tell us where we live, what we have experienced, and who we are. Adding memory to artificial neural networks has been transformative in machine learning, allowing models to extract structure from temporal data, and more accu- rately model the future. However the capacity for long-range reasoning in current memory-augmented neural networks is considerably limited, in comparison to hu- mans, despite the access to powerful modern computers. This thesis explores two prominent approaches towards scaling artificial memo- ries to lifelong capacity: sparse access and compressive memory structures. With sparse access, the inspection, retrieval, and updating of only a very small sub- set of pertinent memory is considered. It is found that sparse memory access is beneficial for learning, allowing for improved data-efficiency and improved general- 2 isation. From a computational perspective | sparsity allows scaling to memories with millions of entities on a simple CPU-based machine. It is shown that memory systems that compress the past to a smaller set of representations reduce redun- dancy and can speed up the learning of rare classes and improve upon classical data-structures in database systems. Compressive memory architectures are also devised for sequence prediction tasks and are observed to significantly increase the state-of-the-art in modelling natural language. 3 Impact Statement This study investigates memory structures for artificial neural networks that can scale to longer periods of time and larger quantities of data than previously demon- strated possible. It is demonstrated that neural networks can learn to store and retrieve salient information over hundreds of thousands of time-steps | an im- provement of around two orders of magnitude from prior work. This is achieved using efficient sparse memory access operations and compressive memory data structures to reduce the computational cost of accessing memory and the amount of space required to store it. The long-term goal for this work is to embed these memory structures within a complex agent, so it can reason over the stream of data it experiences during its lifetime. This could be a robotic agent which acts in the real world, or a dialogue agent which can reason over past conversations and auxiliary text context (e.g. the web). This study demonstrates that better long-range memory architectures can be used to improve state-of-the-art performance in question answering (Chapter3), database systems (Chapter5), and the modelling of language and speech (Chapter 4 and6). 4 Acknowledgements I dedicate this work to my wife Laura and to my son William. I would like to thank my family and friends for their support throughout this project, in particular my mother who has given considerable guidance along the way. I would like to thank my supervisor Tim Lillicrap for both introducing me to the topics of scalable and compressive memory systems, and for guiding me through the research process over the past couple of years. His consistent energy and creativity has propelled me through many patches of uncertainty and it has been a wonderful experience to work with him. I would also like to thank my supervisor Peter Dayan for his wider perspective spanning multiple fields. Peter's detailed, analytical process of thinking and questioning has shaped the structure of my research. I would like to thank Yori Zwols, my manager throughout the majority of my doctoral studies. Yori helped me to select research projects that suited the self-contained nature of doctoral study, whilst being relevant for DeepMind's mission of solving intelligence. I would like to thank Alex Graves and Greg Wayne for their mentorship. Alex has introduced me to many ideas in the space of memory and sequence modelling. He has also taught me how to write, indirectly via his own works and directly via 5 his astute feedback. Greg has taught me to select projects that can be tentatively validated or disproved within a few weeks of work, which has proven to be one of the most important pieces of advice in the uncertain game of research. I would like to thank Demis Hassabis and Joel Liebo for initially introducing me to memory at DeepMind in 2014 and piquing my interest from their own innovative neuroscience- based memory research. Several research leads at DeepMind have overseen these research projects at DeepMind, providing resources and guidance | namely Daan Wierstraa, Oriol Vinyals, and Koray Kavukcuoglu. For work on sparse attention, I would like to thank Jonathan Hunt who collabo- rated closely on this project. Jonny provided ideas for sparsifying the DNC's link matrix and instrumented the bAbI evaluation. I would also like to thank Malcolm Reynolds, Fumin Wang and Yori Zwols for their advice and comments. For work on compressive memory systems for classification, I would like to thank Chris Dyer and Gabor Melis for providing invaluable direction in getting set up with language modelling. I would also like to thank Oriol Vinyals, Pablo Sprech- mann, Siddhant Jayakumar, Charles Blundell and Adam Santoro for a multitude of enlightening comments and suggestions. For work on the Neural Bloom Filter, I would like to thank my collaborator Sergey Bartunov who implemented the database experiments and had several architec- tural contributions. I thank Andras Gyorgy and Tor Lattimor for their input on the general space complexity of approximate set membership, from an information theoretic perspective. For work on the Compressive Transformer I would like to thank my collaborator 6 Anna Potapenko who ran several experiments and developed the PG-19 dataset. I would also like to thank Siddhant Jayakumar for advice and experiments on the memory maze tasks, and Chloe Hillier for management of the research project, and help in open-sourcing the model code and dataset. I thank Tim Lillicrap for his nu- merous inputs on model design, experiment design and inputs on the manuscript. I thank Chris Dyer, Felix Gimeno, and Koray Kavukcuoglu for reviewing the text. I also thank Peter Dayan, Adam Santoro, Jacob Menick, Emilio Parisotto, Hyun- jik Kim, Simon Osindero, Sergey Bartunov, David Raposo, and Daan Wierstra for ideas regarding model design. I thank Yazhe Li and Aaron Van de Oord for their help and advice in instrumenting speech modelling experiments. Finally I would like to thank the long list of unnamed DeepMind colleagues and UCL staff that have provided library code, talks, discussions, feedback, and endless enthusiasm and inspiration. 7 Contents 1 Introduction 26 1.1 Limitations of Existing Approaches.................. 28 1.2 Overview of Contributions....................... 30 2 Background 34 2.1 Memory in Animals........................... 34 2.1.1 Tulving's Taxonomy for Long-Term Memories........ 35 2.1.2 The Hippocampus....................... 38 2.1.3 Memory as a Complementary Learning System....... 40 2.2 Data Storage in Computing...................... 41 2.2.1 Hashing............................. 42 2.2.2 Nearest Neighbour Search................... 43 2.2.3 Set Membership......................... 46 2.2.4 Bloom Filter........................... 48 2.3 Memory-Augmented Neural Networks................. 49 2.3.1 Hopfield Networks....................... 51 2.3.2 Recurrent Neural Networks.................. 53 2.3.3 Gated Recurrent Neural Networks............... 55 8 2.3.4 Attention............................ 58 2.3.5 Memory Networks....................... 59 2.3.6 Neural Turing Machines.................... 61 2.3.7 Neural Stacks.......................... 66 2.3.8 Differentiable Neural Computer................ 68 2.3.9 Transformer........................... 69 2.3.10 Non-parametric Classification................. 74 2.4 Memory Tasks.............................. 76 2.4.1 Selective Processing of a Stream of Numbers......... 76 2.4.2 Algorithmic Tasks....................... 77 2.4.3 Language Modelling...................... 78 2.4.4 Question Answering...................... 84 3 Scaling Memory with Sparsity 94 3.1 Motivation................................ 95 3.2 Architecture............................... 97 3.2.1 Read............................... 98 3.2.2 Sparse Attention........................ 98 3.2.3 Write............................... 99 3.2.4 Setting the Memory Size N vs T ............... 102 3.2.5 Controller............................ 102 3.2.6 Efficient backpropagation through time............ 103 3.2.7 Approximate nearest neighbours................ 104 3.3 Time and space complexity...................... 106 3.3.1 Initialisation........................... 106 9 3.3.2 Complexity of read....................... 106 3.3.3 Complexity of write....................... 107 3.3.4 Content-based addressing in (log N) time.......... 109 O 3.3.5 Optimality............................ 110 3.4