Continual Reinforcement Learning with Memory at Multiple Timescales
Total Page:16
File Type:pdf, Size:1020Kb
Imperial College Department of Computing Continual Reinforcement Learning with Memory at Multiple Timescales Christos Kaplanis Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of the University of London and the Diploma of Imperial College, April 2020 Declaration of Originality and Copyright This is to certify that this thesis was composed solely by myself. Except where it is stated otherwise by reference or acknowledgment, the work presented is entirely my own. The copyright of this thesis rests with the author. Unless otherwise indicated, its contents are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Licence (CC BY NC-SA). Under this licence, you may copy and redistribute the material in any medium or format. You may also create and distribute modified versions of the work. This is on the condition that; you credit the author, do not use it for commercial purposes and share any derivative works under the same licence. When reusing or sharing this work, ensure you make the licence terms clear to others by naming the licence and linking to the licence text. Where a work has been adapted, you should indicate that the work has been changed and describe those changes. Please seek permission from the copyright holder for uses of this work that are not included in this licence or permitted under UK Copyright Law. i ii Abstract In the past decade, with increased availability of computational resources and several improve- ments in training techniques, artificial neural networks (ANNs) have been rediscovered as a powerful class of machine learning methods, featuring in several groundbreaking applications of artificial intelligence. Most of these successes have been achieved in stationary, confined domains, such as a game playing [MKS+15, SSS+17] and image recognition [KSH12], but, ul- timately, we want to apply artificial intelligence to problems that require it to interact with the real world, which is both vast and nonstationary. Unfortunately, ANNs have long been known to suffer from the phenomenon of catastrophic forgetting [MC89], whereby, in a setting where the data distribution is changing over time, new learning can lead to an abrupt erasure of previously acquired knowledge. The resurgence of ANNs has led to an increased urgency to solve this problem and endow them with the capacity for continual learning [Rin94], which refers to the ability to build on their knowledge over time in environments that are constantly evolving. The most common setting for evaluating continual learning approaches to date has been in the context of training on a number of distinct tasks in sequence, and as a result many of them use the knowledge of task boundaries to consolidate knowledge during training [RE13, KPR+17, ZPG17]. In the real world, however, the changes to the distribution may occur more gradually and at times that are not known in advance. The goal of this thesis has been to develop continual learning approaches that can cope with both discrete and continuous changes to the data distribution, without any prior knowledge of the nature or timescale of the changes. I present three new methods, all of which involve learning at multiple timescales, and evaluate them in the context of deep reinforcement learning, a paradigm that combines reinforcement learning [SB98] with neural networks, which provides a natural testbed for continual learning as (i) it involves interacting with an environment, and (ii) it can feature non-stationarity at unpredictable timescales during training of a single task. The first method is inspired by the process of synaptic consolidation in the brain and involves multi- timescale memory at the level of the parameters of the network; the second extends the first by directly consolidating the agent's policy over time, rather than its individual parameters; finally, the third approach extends the experience replay database [Lin92, MKS+15], which typically maintains a buffer of the agent's most recent experiences in order to decorrelate them during training, by enabling it to store data over multiple timescales. iii iv Acknowledgements I'd like to express my immense gratitude to my supervisors, Claudia and Murray, for all their support and guidance over the past few years. Because of them and all the time they dedicated to me, I've learnt invaluable lessons that have shaped my approach to research and I've really enjoyed the process. Beyond the many stimulating intellectual discussions, they really helped me navigate the psychological ups and downs of the PhD. The whole experience would not have been the same without all the people I shared it with at Imperial, namely all those who passed through Huxley 444 during my time there, the current and former members of the Clopath lab, and the `lunch buddies'. I will miss physics, the walks down the hall for terrible (but free) coffee, weekly lab meetings and JC, perpetual discussions about consciousness and AI, Ludwig, chess puzzles, the lab xmas activities, and much more. I'm lucky to have met all of you legends. I must also thank CSG for their invaluable technical help in times of need. I want to thank my parents, for their endless support and for giving me all the opportunities in life to get to this point, my sister, who could empathise with all the PhD-related struggles, and all my other family and friends who made life easier for me when research was tough. Finally, I am utterly beholden to Georgie, who has been an absolute rock for me every single day and in every aspect of life - I can't imagine any of it without her. Oh, and to our two dogs, Gus and Hector, whose presence has the handy effect of making any stress or worry evaporate instantaneously. v vi Dedication This thesis is dedicated to my παππού Qrisτάκη, who has been there for me despite my never having met him, and to my γιαγιά NÐtsa, who was so warm and clever and made a chocolate Rice Krispie cake that would reliably serve as a reminder of why life is worth living. vii `A good idea is a good idea forever.' David Brent (The Office, Season 1, Episode 4) viii Contents Declaration of Originality and Copyright i Abstract iii Acknowledgements v 1 Introduction 1 1.1 Continual Learning . 4 1.1.1 How do we define it? . 4 1.1.2 How do we measure it? . 7 1.1.3 Continual Reinforcement Learning . 8 1.1.4 Continual Reinforcement Learning with Memory at Multiple Timescales . 10 1.2 Thesis Outline and Contributions . 10 1.3 Publications . 12 2 Background 13 2.1 Artificial Neural Networks . 13 2.1.1 Origins of ANN . 14 2.1.2 Training Deep Neural Networks . 17 ix x CONTENTS 2.2 Reinforcement Learning . 23 2.2.1 Markov Decision Processes . 24 2.2.2 Nonstationarity in Reinforcement Learning . 26 2.2.3 Value Functions . 27 2.2.4 Value Function Approximation . 31 2.2.5 Policy Gradients . 33 2.3 Catastrophic Forgetting in ANN . 36 2.3.1 Discovery and Early Approaches . 37 2.3.2 Regularisation-based Methods . 39 2.3.3 Replay-based Methods . 43 2.3.4 Architectural Approaches . 45 2.3.5 Sparse Coding / Semi-distributed Representations . 47 2.3.6 Task-free Methods . 48 3 Continual Reinforcement Learning with Complex Synapses 51 3.1 Introduction . 51 3.2 Preliminaries . 52 3.2.1 The Benna-Fusi Model . 52 3.2.2 RL Algorithms . 54 3.3 Experiments . 57 3.3.1 Continual Q-learning . 57 3.3.2 Continual Multi-task Deep RL . 61 3.3.3 Continual Learning within a Single Task . 66 CONTENTS xi 3.4 Related Work . 68 3.5 Conclusion . 69 3.6 Experimental Details . 70 3.7 Additional Experiments . 72 3.7.1 Varying Epoch Lengths . 72 3.7.2 Three-task Experiments . 72 3.7.3 Varying Size of Replay Database . 74 3.7.4 Catcher Single Task . 75 3.7.5 Varying Final Exploration Value . 76 3.8 Online Convex Optimisation Perspective . 76 4 Policy Consolidation for Continual Reinforcement Learning 82 4.1 Introduction . 82 4.2 Preliminaries . 83 4.2.1 PPO . 83 4.2.2 Multi-agent RL with Competitive Self-play . 85 4.3 From Synaptic Consolidation to Policy Consolidation . 85 4.3.1 Synaptic Consolidation . 86 4.3.2 Policy Consolidation . 88 4.4 Experiments . 89 4.4.1 Single agent Experiments . 90 4.4.2 Multi-agent Experiments . 92 4.4.3 Further Analyses . 94 xii CONTENTS 4.5 Related Work . 98 4.6 Conclusion and Future Work . 98 4.7 Experimental Details . 99 4.7.1 Single agent Experiments . 99 4.7.2 Self-play Experiments . 100 4.8 Additional Experiments . 101 4.8.1 Directionality of KL constraint . 101 5 Continual Reinforcement Learning with Multi-Timescale Replay 103 5.1 Introduction . 103 5.2 Preliminaries . 105 5.2.1 Soft Actor-Critic . 105 5.2.2 Invariant Risk Minimisation . 107 5.3 Multi-Timescale Replay . 108 5.3.1 Power Law Forgetting . 110 5.3.2 IRM version of MTR . 111 5.4 Experiments . 112 5.5 Related Work . 120 5.6 Conclusion . 121 5.7 Experimental Details . 124 5.8 Additional Experiments . 124 5.8.1 Importance Sampling . 125 5.8.2 ReF-ER . 127 5.8.3 Behaviour Regularised Actor Critic . 129 5.9 Multi-task Experiments . 131 6 Conclusion 132 6.1 Limitations and Future Work . ..