Imperial College Department of Computing

Continual Reinforcement with at Multiple Timescales

Christos Kaplanis

Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of the University of London and the Diploma of Imperial College, April 2020

Declaration of Originality and Copyright

This is to certify that this thesis was composed solely by myself. Except where it is stated otherwise by reference or acknowledgment, the work presented is entirely my own.

The copyright of this thesis rests with the author. Unless otherwise indicated, its contents are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Licence (CC BY NC-SA).

Under this licence, you may copy and redistribute the material in any medium or format. You may also create and distribute modified versions of the work. This is on the condition that; you credit the author, do not use it for commercial purposes and share any derivative works under the same licence.

When reusing or sharing this work, ensure you make the licence terms clear to others by naming the licence and linking to the licence text. Where a work has been adapted, you should indicate that the work has been changed and describe those changes.

Please seek permission from the copyright holder for uses of this work that are not included in this licence or permitted under UK Copyright Law.

i ii Abstract

In the past decade, with increased availability of computational resources and several improve- ments in training techniques, artificial neural networks (ANNs) have been rediscovered as a powerful class of machine learning methods, featuring in several groundbreaking applications of artificial intelligence. Most of these successes have been achieved in stationary, confined domains, such as a game playing [MKS+15, SSS+17] and image recognition [KSH12], but, ul- timately, we want to apply artificial intelligence to problems that require it to interact with the real world, which is both vast and nonstationary. Unfortunately, ANNs have long been known to suffer from the phenomenon of catastrophic forgetting [MC89], whereby, in a setting where the data distribution is changing over time, new learning can lead to an abrupt erasure of previously acquired knowledge. The resurgence of ANNs has led to an increased urgency to solve this problem and endow them with the capacity for continual learning [Rin94], which refers to the ability to build on their knowledge over time in environments that are constantly evolving. The most common setting for evaluating continual learning approaches to date has been in the context of training on a number of distinct tasks in sequence, and as a result many of them use the knowledge of task boundaries to consolidate knowledge during training [RE13, KPR+17, ZPG17]. In the real world, however, the changes to the distribution may occur more gradually and at times that are not known in advance.

The goal of this thesis has been to develop continual learning approaches that can cope with both discrete and continuous changes to the data distribution, without any prior knowledge of the nature or timescale of the changes. I present three new methods, all of which involve learning at multiple timescales, and evaluate them in the context of deep reinforcement learning, a paradigm that combines reinforcement learning [SB98] with neural networks, which provides a natural testbed for continual learning as (i) it involves interacting with an environment, and (ii) it can feature non-stationarity at unpredictable timescales during training of a single task. The first method is inspired by the process of synaptic consolidation in the brain and involves multi- timescale memory at the level of the parameters of the network; the second extends the first by directly consolidating the agent’s policy over time, rather than its individual parameters; finally, the third approach extends the experience replay database [Lin92, MKS+15], which typically maintains a buffer of the agent’s most recent experiences in order to decorrelate them during training, by enabling it to store data over multiple timescales.

iii iv Acknowledgements

I’d like to express my immense gratitude to my supervisors, Claudia and Murray, for all their support and guidance over the past few years. Because of them and all the time they dedicated to me, I’ve learnt invaluable lessons that have shaped my approach to research and I’ve really enjoyed the process. Beyond the many stimulating intellectual discussions, they really helped me navigate the psychological ups and downs of the PhD.

The whole experience would not have been the same without all the people I shared it with at Imperial, namely all those who passed through Huxley 444 during my time there, the current and former members of the Clopath lab, and the ‘lunch buddies’. I will miss physics, the walks down the hall for terrible (but free) coffee, weekly lab meetings and JC, perpetual discussions about consciousness and AI, Ludwig, chess puzzles, the lab xmas activities, and much more. I’m lucky to have met all of you legends. I must also thank CSG for their invaluable technical help in times of need.

I want to thank my parents, for their endless support and for giving me all the opportunities in life to get to this point, my sister, who could empathise with all the PhD-related struggles, and all my other family and friends who made life easier for me when research was tough.

Finally, I am utterly beholden to Georgie, who has been an absolute rock for me every single day and in every aspect of life - I can’t imagine any of it without her. Oh, and to our two dogs, Gus and Hector, whose presence has the handy effect of making any stress or worry evaporate instantaneously.

v vi Dedication

This thesis is dedicated to my παππού Χριστάκη, who has been there for me despite my never having met him, and to my γιαγιά Νίτσα, who was so warm and clever and made a chocolate Rice Krispie cake that would reliably serve as a reminder of why life is worth living.

vii ‘A good idea is a good idea forever.’

David Brent

(The Office, Season 1, Episode 4)

viii Contents

Declaration of Originality and Copyright i

Abstract iii

Acknowledgements v

1 Introduction 1

1.1 Continual Learning ...... 4

1.1.1 How do we define it? ...... 4

1.1.2 How do we measure it? ...... 7

1.1.3 Continual Reinforcement Learning ...... 8

1.1.4 Continual Reinforcement Learning with Memory at Multiple Timescales . 10

1.2 Thesis Outline and Contributions ...... 10

1.3 Publications ...... 12

2 Background 13

2.1 Artificial Neural Networks ...... 13

2.1.1 Origins of ANN ...... 14

2.1.2 Training Deep Neural Networks ...... 17

ix x CONTENTS

2.2 Reinforcement Learning ...... 23

2.2.1 Markov Decision Processes ...... 24

2.2.2 Nonstationarity in Reinforcement Learning ...... 26

2.2.3 Value Functions ...... 27

2.2.4 Value Function Approximation ...... 31

2.2.5 Policy Gradients ...... 33

2.3 Catastrophic Forgetting in ANN ...... 36

2.3.1 Discovery and Early Approaches ...... 37

2.3.2 Regularisation-based Methods ...... 39

2.3.3 Replay-based Methods ...... 43

2.3.4 Architectural Approaches ...... 45

2.3.5 Sparse Coding / Semi-distributed Representations ...... 47

2.3.6 Task-free Methods ...... 48

3 Continual Reinforcement Learning with Complex Synapses 51

3.1 Introduction ...... 51

3.2 Preliminaries ...... 52

3.2.1 The Benna-Fusi Model ...... 52

3.2.2 RL Algorithms ...... 54

3.3 Experiments ...... 57

3.3.1 Continual Q-learning ...... 57

3.3.2 Continual Multi-task Deep RL ...... 61

3.3.3 Continual Learning within a Single Task ...... 66 CONTENTS xi

3.4 Related Work ...... 68

3.5 Conclusion ...... 69

3.6 Experimental Details ...... 70

3.7 Additional Experiments ...... 72

3.7.1 Varying Epoch Lengths ...... 72

3.7.2 Three-task Experiments ...... 72

3.7.3 Varying Size of Replay Database ...... 74

3.7.4 Catcher Single Task ...... 75

3.7.5 Varying Final Exploration Value ...... 76

3.8 Online Convex Optimisation Perspective ...... 76

4 Policy Consolidation for Continual Reinforcement Learning 82

4.1 Introduction ...... 82

4.2 Preliminaries ...... 83

4.2.1 PPO ...... 83

4.2.2 Multi-agent RL with Competitive Self-play ...... 85

4.3 From Synaptic Consolidation to Policy Consolidation ...... 85

4.3.1 Synaptic Consolidation ...... 86

4.3.2 Policy Consolidation ...... 88

4.4 Experiments ...... 89

4.4.1 Single agent Experiments ...... 90

4.4.2 Multi-agent Experiments ...... 92

4.4.3 Further Analyses ...... 94 xii CONTENTS

4.5 Related Work ...... 98

4.6 Conclusion and Future Work ...... 98

4.7 Experimental Details ...... 99

4.7.1 Single agent Experiments ...... 99

4.7.2 Self-play Experiments ...... 100

4.8 Additional Experiments ...... 101

4.8.1 Directionality of KL constraint ...... 101

5 Continual Reinforcement Learning with Multi-Timescale Replay 103

5.1 Introduction ...... 103

5.2 Preliminaries ...... 105

5.2.1 Soft Actor-Critic ...... 105

5.2.2 Invariant Risk Minimisation ...... 107

5.3 Multi-Timescale Replay ...... 108

5.3.1 Power Law Forgetting ...... 110

5.3.2 IRM version of MTR ...... 111

5.4 Experiments ...... 112

5.5 Related Work ...... 120

5.6 Conclusion ...... 121

5.7 Experimental Details ...... 124

5.8 Additional Experiments ...... 124

5.8.1 Importance Sampling ...... 125

5.8.2 ReF-ER ...... 127 5.8.3 Behaviour Regularised Actor Critic ...... 129

5.9 Multi-task Experiments ...... 131

6 Conclusion 132

6.1 Limitations and Future Work ...... 134

Bibliography 137

xiii xiv List of Tables

3.1 Hyperparameters for Tabular Q-learning Experiments ...... 70

3.2 Hyperparameters for Deep RL Experiments ...... 71

4.1 Hyperparameters for Policy Consolidation Experiments ...... 101

5.1 Hyperparameters for Multi-Timescale Replay Experiments ...... 124

xv xvi List of Figures

1.1 (Left) in the brain are distributed across neurons and synapses rather than stored in localised memory addresses like in a computer. (Right) Memories in the brain are formed via change in the strengths of synaptic connections, known as synaptic plasticity...... 2

2.1 Depiction of a McCulloch-Pitts neuron ...... 15

2.2 Image taken from [Sar18] showing the types of visual features learnt in each hidden layer of a neural network trained with facial images...... 17

2.3 Diagram of the interaction between the agent and the environment in an MDP (taken from [SB98]) ...... 25

3.1 Diagrams adapted from [BF16] depicting the chain model (top) and the analogy to liquid flowing between a series of beakers of increasing size and decreasing tube widths (bottom)...... 54

3.2 (Top) How long it took each agent to relearn to navigate to the first reward at the beginning of each epoch. (Bottom) How many time steps it took for the 20-episode moving average of episode lengths to drop below 13, as a measure of how long it took to (re)learn a good policy. Mean over 3 runs with 1 s.d. error bars...... 60

xvii xviii LIST OF FIGURES

3.3 Surface plots of a snapshot of the visible (V 1) and hidden (V 2 and V 3) values of each state during training. While V 1 only appears to retain information about the current reward at (10,10), V 2 and V 3 still remember that there is value at (0,0). When the reward location is switched back to (0,0), flow from the deeper variables in the chain back into V 1 make it easier for the agent to the previous reward location. See https://youtu.be/_KgGpT-sjAU for animation of values over training...... 61

3.4 (Top) How long it took for the agents to relearn each task from the beginning of each epoch; the # of training episodes needed for the 10 test-episode moving average of reward to surpass the threshold is plotted for 3 runs per agent. Runs that did not relearn within the are marked at ∞. (Bottom) Reward per episode averaged over each epoch for each task; means with s.d. error bars over 3 runs. . 65

3.5 The 1000 test-episode moving average of reward in Cart-Pole for the Benna-Fusi agent and control agents with different learning rates; means and s.d. errorbars over 3 runs per agent...... 68

3.6 Comparison of time to (re)learn CartPole in the control agent (blue) and the Benna-Fusi agent (orange) for different epoch lengths...... 72

3.7 Comparison of time to (re)learn Catcher in the control agent (blue) and the Benna-Fusi agent (orange) for different epoch lengths...... 73

3.8 Comparison of time to (re)learn each task in the control agent (blue) and the Benna-Fusi agent (orange) for different epoch lengths. Both agents had a learning rate of 0.001 and the runs with longer epochs were run for fewer epochs. In all cases the Benna-Fusi agent becomes quicker (or in a couple of instances equally quick) at relearning each task than the control agent, demonstrating the Benna- Fusi model’s ability to improve memory at a range of timescales...... 74 LIST OF FIGURES xix

3.10 100 test-episode moving average of reward in Cart-Pole for control agents (all with η = 0.001) with different sized experience replay databases and the Benna- Fusi agent in just the online setting. For these experiments, 1 experience was sampled for training from the database after every time step. In the control cases, when the database is too small, the agent cannot attain a stable performance on the task while the Benna-Fusi agent can...... 74

3.9 Comparison of time to (re)learn each task in the control agent (blue) and the Benna-Fusi agent (orange) for the three different tasks. Each epoch was run for 20000 episodes and both agents had a learning rate of 0.001. While the Benna- Fusi agent took a little longer to learn Catcher than the control agent, by the end of the simulation the Benna-Fusi agent could learn to recall each task much faster than the control...... 75

3.11 The 100 test-episode moving average of reward per episode in Catcher for the Benna-Fusi agent and the best control agent. The control agent learns faster but both end up learning a good policy...... 75

3.12 The 100 test-episode moving average of reward per episode in Cart-Pole for control agents where epsilon was not allowed to decay below different minimum values. None of the runs yielded a good stable performance...... 76

3.13 Effective learning rates ηt for updates at corresponding times t for OGD and the Benna-Fusi model after 100 time steps using η = 1. The learning rate is higher for more recent data points in the Benna-Fusi model, while in OGD, the algorithm becomes less and less adaptive to recent data as time progresses. . . . 79

4.1 (a) Depiction of synaptic consolidation model (adapted from [BF16]) (b) De-

old piction of policy consolidation model. The arrows linking the πks to the πk s represent KL constraints between them, with thicker arrows implying larger con- straints, enforcing the policies to be closer together...... 87 xx LIST OF FIGURES

4.2 Reward over time for (a) alternating task and (b) single task runs; comparison of PC agent with fixed KL (with different βs), clipped (with different clip coef- ficients) and adaptive KL agents (omitted for some runs since return went very negative). Means and s.d. error bars over 3 runs per setting...... 91

4.3 Moving averages of mean scores over time in RoboSumo environment of (a) the final version of each model against its past self at different stages of its history, and (b) the PC agents against the baselines at equivalent points in history. Mean scores calculated over 30 runs using 1 for a win, 0.5 for a draw and 0 for a loss. Error bars in (b) are s.d. across three PC runs, which are shown individually in (a)...... 94

4.4 (a) Reward over time of the policies of the networks at different cascade depths on HumanoidSmallLeg-v0, having been trained alternately on HumanoidSmallLeg- v0 and HumanoidBigLeg-v0. (b) Reward over time on alternating Humanoid tasks for different combinations of cascade length and ω...... 96

4.5 Reward over time for (a) PC model and (b) fixed-KL baseline with β = 10 for different task-switching schedules between the HumanoidSmallLeg-v0 and HumanoidBigLeg-v0 tasks. The PC model relearns quickly after each task switch for a range of switching schedules, allowing it to build on its knowledge, while the baseline is unable to...... 97

4.6 Reward over time using the (a) DKL (πkold ||πk) and (b) DKL (πk||πkold ) constraints.102

5.1 Diagram of MTR buffer; each blue box corresponds to a FIFO sub-buffer. . . . . 109

5.2 Histograms of age of experiences in different types of buffer after 5 million ex- periences have been inserted. Each buffer has a maximum size of 1 million experiences, and for the MTR buffer, only the distribution of experiences in the cascade is shown (not the overflow buffer)...... 110

5.3 Gravity settings over the course of the simulation in each of the three set-ups. . 113

5.4 Fixed gravity setting. Training reward for (a) HalfCheetah and (b) Ant. Mean and standard error bars over three runs...... 115 5.5 Linearly increasing gravity setting. (Top) Training reward for (a) HalfCheetah and (b) Ant. (Bottom) Mean evaluation reward for (c) HalfCheetah and (d) Ant. Mean and standard error bars over three runs...... 117

5.6 Individual Evaluation rewards for linearly increasing gravity HalfCheetah. Mean and standard error bars over three runs...... 118

5.7 Fluctuating gravity setting. (Top) Training reward for (a) HalfCheetah and (b) Ant. (Bottom) Mean evaluation reward for (c) HalfCheetah and (d) Ant. Mean and standard error bars over three runs...... 119

5.8 Individual Evaluation rewards for fluctuating gravity HalfCheetah with (a) FIFO buffer and (b) MTR-IRM buffer. Mean and standard error bars over three runs. 120

5.9 Comparison of training performance of agents on the fluctuating-gravity HalfChee- tah task (a) without and (b) with weighted importance sampling...... 127

5.10 Comparing training performance of runs with and without ReF-ER on the HalfChee- tah task with (a) linearly increasing gravity and (b) fixed gravity...... 129

5.11 Training reward for HalfCheetah task with fixed gravity using a reservoir buffer with different BRAC coefficients...... 130

5.12 Multitask setting: (a) training performance and (b) evaluation performance with uniformly random gravity between −7 and −17m/s2 with a FIFO buffer. This experiment shows that the agent has the capacity to represent good policies for all evaluation settings if trained in a non-sequential setting...... 131

xxi xxii Chapter 1

Introduction

I’m still learning.

- Michelangelo (on his deathbed), 18th February 1564

At every waking moment as humans, our brains are faced with the challenge of processing a torrent of sensory data incoming from the world around us. Despite the enormity and ephemer- ality of this data, we learn to make sense of it - forming useful abstractions, selectively storing information, and continually deepening our understanding of the environment so that we can interact with it more effectively and make better decisions to achieve our goals. While we experience this processing ability as automatic and effortless, from a computational standpoint it is a substantial feat, made harder by the fact that our environment is constantly changing and that our neural resources are limited. One ostensibly elementary aspect of our capacity to learn over a lifetime, but one that in reality poses a great mechanistic conundrum in the context of how learning is thought to occur in the brain, is how we strike a balance between the retention of old knowledge and the acquisition of new memories and skills.

The nature of this conundrum, which is often referred to as the stability-plasticity dilemma [CG87], can be understood by considering some basic neuroscience. The fundamental processing units of the brain are specialised cells called neurons, which communicate with each other via

1 2 Chapter 1. Introduction

Figure 1.1: (Left) Memories in the brain are distributed across neurons and synapses rather than stored in localised memory addresses like in a computer. (Right) Memories in the brain are formed via change in the strengths of synaptic connections, known as synaptic plasticity. electric signals across connections known as synapses. The physical basis of behavioural learning and memory formation is widely thought to be a consequence of synaptic plasticity, which refers to process that changes the strengths of these connections (Figure 1.1 (right)). Unlike in the hard drive of a computer, where data is stored at distinct and localised memory addresses, memories in the brain are thought to be distributed within networks of neurons and across different brain regions, and overlapping in that neurons and synapses can participate in the recall of several different memories (Figure 1.1 (left)). The dilemma lies herein - since a single synapse can play a role in multiple memories, how is it ensured that the imprinting of a new one does not interfere with or overwrite an older one? It would seem that synapses need to be simultaneously stable enough to prevent forgetting and plastic enough to be able to acquire new knowledge.

It was in the context of connectionist models of human memory [MRG+87], namely artificial neural networks (ANNs), that this problem really came to the fore [MC89]. Artificial neu- rons were originally designed as a mathematical abstraction for how biological neurons might function [MP43]; later they were combined with computational theories of learning [Heb49] to show that they could be trained to perform simple tasks [Ros58]. Eventually, methods for training networks consisting of multiple layers of artificial neurons (known today as deep neural networks) were discovered [Wer74, RHW85], which were able to learn more complex tasks that 3 single neurons could not. Soon after, however, it became evident that these artificial neural networks, which also feature a distributed and overlapping form of knowledge representation, were unable to cope well with the stability-plasticity tradeoff and, instead, suffered from a phenomenon known as catastrophic forgetting when trained in a setting where the data distri- bution if changing over time [MC89]. Paradigmatically, if an ANN were trained on task A and then subsequently trained on task B (without simultaneously giving it any data for task A), then upon returning to task A, the network had often completely forgotten how to perform the task. Catastrophic forgetting in humans, on the other hand, is rare - our memories deteriorate gradually and gracefully for the most part [BU59], for example our skills become rusty if we have not practised them in a while or we forget certain details about an event over time, rather than suddenly forgetting that the event had ever occurred.

In recent years, thanks to the discovery of better training techniques and a massive increase in the availability of computational resources, deep neural networks have had a renascence and have played a key role in many successful applications of artificial intelligence, ranging from machine translation [BCB14] and speech recognition [GMH13] to image recognition [KSH12] and video-game playing [MKS+15]. The vast majority of these systems, however, have each been trained and applied in a single, stationary domain. Typically, this involves learning using fixed datasets or environments, which contrasts greatly to the world that humans are faced with, which is constantly evolving. If we eventually want to deploy neural-network based algorithms that operate in the real world and can build on their knowledge over time, like humans they will also have to be able to learn from a nonstationary stream of data without catastrophically forgetting. Indeed, the rise of ANNs has also led to an increased urgency in finding ways to endow neural networks with this ability, commonly referred to as the capacity for continual [Rin94] or lifelong learning [TM95]. 4 Chapter 1. Introduction

1.1 Continual Learning

1.1.1 How do we define it?

A strict definition of what constitutes continual learning (CL), in terms of the qualities that should be desired from an algorithm capable of CL and the conditions that should be present in the learning setting in order to test these qualities, is elusive and is currently a matter of discus- sion in the community [Rin05, clw16, SvHM+18]. The term continual learning was originally conceived in the context of reinforcement learning (RL) [SB98, Rin94], which involves an agent learning to take actions in an environment and is arguably the machine learning paradigm that most closely emulates the challenges of a human acting in the real world. Catastrophic forget- ting in neural networks, on the other hand, has most commonly been evaluated in a supervised learning setting, which involves learning input-output mappings from labelled datasets, typi- cally using variations on standard classification problems [LCB10, KH+09]. In this thesis, I use RL as a context for evaluation, for reasons that are discussed later on, and I often refer to CL agents, but the key factors elucidated below that characterise CL in neural networks (which in the literature has now taken on a meaning beyond the confines of RL [PKP+19]) can be applied to any machine learning paradigm.

Algorithms for CL in neural networks are not that good yet, and so they are not usually tested in the real world, but rather in simulation - for this reason, it is important to define the conditions that must be present in the simulated environment in order to evaluate continual learning. A few of the most agreed-upon conditions for testing CL agents / algorithms are outlined below:

1. The ‘firehose’ [SvHM+18]. The dataset is not fixed, it is constantly streaming in, and the amount of data flowing in is too large for the agent / algorithm to store. This condition is key to emulating the real-time and voluminous nature of data accumulation associated with interacting in the real world.

2. Limited resources. Related to the first condition, this one states that both the overall size of the learning system and the computational resources available to it must be limited. 1.1. Continual Learning 5

The extent to which this is treated as a hard limit that should be set from the start or a softer constraint under which the necessary resources must grow slowly over time varies in the literature, but the underlying principle that they can not grow unsustainably is clear, since the goal is to create agents that learn forever or for an indefinitely long period. The overall size of the system concerns both the amount of raw data that can be stored by the agent as well as the number of parameters that are needed to describe the agent, for example the values of the strengths of the connections between neurons in an ANN.

3. Nonstationarity. The data flowing in need not be originating from a stationary distribu- tion - this may be due to the fact the world itself is changing or simply because the world that the agent interacts with is so vast that it can only experience one small part of it at any given time. This contrasts to the typical training regime for neural networks, in which the data is assumed to be independently and identically distributed (i.i.d.).

As far as what agents should be able to accomplish under these conditions, the key desiderata are as follows:

1. Graceful forgetting. In the context of limited resources, it is unreasonable to expect agents to be able to accumulate an unlimited amount of knowledge over time, but we want their forgetting to be predictable and not abrupt or catastrophic. Furthermore, their forgetting should be intelligent, in the sense that, with experience, the agent should learn what knowledge and skills are important to retain and what can be safely forgotten to make space for new learning.

2. Positive transfer. A crucial characteristic of a successful CL agent is that it is able to distill and utilise knowledge acquired in one task or environment to perform better in others that share similarities, thus enabling it to build on its knowledge in an intelligent way. A strong CL agent should be able to (i) identify and exploit similarities at different levels of abstraction, (ii) use previous knowledge to inform behaviour in future tasks (forward transfer) as well as use more recent knowledge to improve on previously encountered tasks (backward transfer), and (iii) learn how to learn more effectively - having encountered lots 6 Chapter 1. Introduction

of different tasks, the agent should be able to improve on its own learning procedure for new tasks, an ability commonly referred to as meta-learning [SS10]. The goal of mitigating catastrophic forgetting can be interpreted as one of avoiding negative backward transfer.

How important are these criteria, really, if our ultimate goal is to deploy artificially intelligent agents into the real world? Just because humans face the constraints described above, is it necessary to burden machines with them? Some typical pushbacks to the continual learning effort are:

• “Data storage has become extremely cheap - why don’t we just store all the data that comes in?”. Even if we had infinite space to store all the data, training on an ever-growing database will inevitably become slower and slower or require more and more computing power - creating scalable CL agents necessitates that training time does not increase indefinitely.

• “Can’t we just shuffle data all together so that it’s all i.i.d.?”. One practical issue with this approach relates to the previous point in that this can only be done without forgetting if all the data is stored. Another reason this might not be the ideal solution, however, is that if we have an agent robust to catastrophic forgetting, it might actually be beneficial to train on data that appears sequentially or that does not all come from the same distribution. There is evidence that humans may learn better in blocks, rather than interleaved training [WS02, FBD+18], and we know that humans learn with curricula - learning simple things first allows us to learn related complex tasks more easily. Curriculum learning is already a field of interest in machine learning [BLCW09], but it could be more powerful technique if forgetting is mitigated. Furthermore, it’s possible that having access to data that comes from different environments can allow to us to learn more robust and generalisable solutions, by identifying what is invariant across these environments [ABGLP19], an approach that is investigated in Chapter 5 of this thesis.

• “Why don’t we just train data from different environments using separate models to avoid forgetting?”. This approach is problematic from two standpoints: (i) the first is 1.1. Continual Learning 7

that this relies on a constant rise in the number of total parameters required to keep learning, which may be unsustainable, and (ii) by using separate models for different tasks, this precludes the possibility of knowledge transfer from one task to another. The distributed nature of knowledge within an ANN is key to its ability to form abstractions and generalise to new situations and across tasks - for this reason, in order to harness this ability, there is a strong motivation to use just one network for all tasks. Maintaining the benefits of distributed knowledge representations while preventing catastrophic forgetting is a challenging source of tension in continual learning research.

Finally, with regards to defining continual learning, it is important to briefly address alternative terminology and varied usage of the term. In a lot of the recent literature on continual learning in ANNs, the term continual learning is synonymous with avoiding catastrophic forgetting, with less regard to the issue of forward transfer, which is the prime concern of the literature on meta-learning, or learning to learn [Sch87]. Other similar terms include never-ending learning [MCH+18] and lifelong learning [TM95], both of which, in their originations, focused more on the capability of improving future learning from past experience, rather than the forgetting problem. Never-ending learning is not a commonly used term in the neural network literature, but lifelong learning, similarly to continual learning, is often used in the context of ANNs to refer to the catastrophic forgetting problem [PKP+19]. Finally, online learning is a term that corresponds to the setting where the data is becoming available in a streaming fashion, as in continual learning, but typically there is no restriction on data storage.

1.1.2 How do we measure it?

Having established the conditions for continual learning, the question arises of how to design practical settings in which to simulate them and test the desiderata outlined above.

Most commonly, the evaluation of continual learning techniques has been conducted in the context of training on a number of distinct tasks in sequence. For example, typical baselines include variations of the MNIST dataset [LCB10], which consists of thousands of images of 8 Chapter 1. Introduction handwritten digits that must be classified into the corresponding ten categories. The variations involve generating sets of tasks that can then be used to train in sequence, for example in ‘permuted MNIST’ [ZPG17], different tasks are created by generating random permutations of the pixels in each image, and in ‘split MNIST’ , different tasks are created by splitting the original dataset into each digit category [GMX+13]. Other more challenging collections of tasks include sequential image classification [KH+09, ZPG17, RBB18] or reinforcement learning of sequences of ATARI video games [BNVB13, KPR+17, SCL+18]. In order to evaluate continual learning in these settings, the agents can be tested (at different time points during training) on all previously learned tasks, in order to test for catastrophic forgetting and backward transfer, and also on future tasks that have not been trained on, in order to test for forward transfer.

A consequence of choosing this format for evaluation, where the changes to the distribution are discrete and predictable (in the sense that they can be set by the researcher), has been that many of the methods designed for improving continual learning utilise the task boundaries for consolidation of knowledge during training [RE13, KPR+17, ZPG17]. The problem with this is that, in the real world, agents are likely to encounter situations where the data distribution is (i) evolving gradually over time, in a way that cannot be easily split up into discrete tasks, and (ii) changing in an unpredictable way, that cannot be determined by the agent in advance. In these scenarios, many of the existing continual learning methods are simply not applicable.

1.1.3 Continual Reinforcement Learning

The key motivation for the work that comprises this thesis has been to develop continual learn- ing agents that can (i) cope with both discrete and continuous changes to the data distribution, and (ii) do so with little or no prior knowledge of the nature or timescale of these changes to the distribution. One experimental paradigm that naturally poses these challenges to the agent is that of reinforcement learning (RL) [SB98]. As briefly described earlier, the RL setting consists of an agent acting in an environment: at each discrete time step, the agent receives a set of observations and a reward from the environment and uses that information to take an action, which in turn causes the environment to generate the next state and reward in a closed loop 1.1. Continual Learning 9 system. The goal of the agent is to learn a policy that maps states to actions that maximises the reward that it collects in the environment - importantly, the agent has no prior knowledge of the environment and has to learn through a process of trial and error.

The goal of continual learning is to develop agents that can operate and learn autonomously in the real world - from this perspective, RL, in contrast to the other two major machine learning paradigms (supervised and unsupervised learning), lends itself particularly to evaluating the CL desiderata since it involves interacting with an environment. As mentioned earlier, the problem of CL has often been formulated as an extension or subset of RL [Rin94, SvHM+18]. Furthermore, unlike most supervised and unsupervised learning tasks that utilise fixed datasets, the data distribution in RL can change over the course of a single task, often at numerous timescales and in unpredictable ways, due to the unknown dynamics of the environment, for example:

• On a short timescale, there can be positive correlations between successive states in the environment.

• On the (typically longer) timescale of the agent’s learning process, changes to the policy of the agent will naturally change the distribution of states that it experiences in the environment.

• In some settings, for example if there are other learning agents in the environment, the dynamics of the environment from the perspective of the agent can change over time, resulting in changes to the distribution that are outside of the agent’s control.

In this thesis, I have used RL as a context for all the evaluation of the CL algorithms I developed, as it serves as a particularly challenging setting for CL and one that is ripe for testing agents that have to deal with changes to the data distribution that are unpredictable in their nature and timing. 10 Chapter 1. Introduction

1.1.4 Continual Reinforcement Learning with Memory at Multiple

Timescales

In this thesis, I develop and evaluate three novel algorithms for continual reinforcement learning, all of which share the common theme of memory at multiple timescales. The main idea is that if the agent does not know how and when the data distribution is going to change, then learning at multiple timescales can help it cover for many possible scenarios: the shorter timescales allow the agent to adapt fast to new incoming data, while the slower timescales ensure that the agent does not forget too quickly, thus improving the stability-plasticity tradeoff. This concept might be important for continual learning in the brain, where memory at multiple timescales is thought to feature both at the level of individual synapses [BF16], as well as across different brain areas [MMO95] - the former phenomenon served as a particularly important inspiration for the first algorithm presented in Chapter 3. The next section provides an overview of the three approaches I developed, in the context of a breakdown of the thesis by chapter.

1.2 Thesis Outline and Contributions

Below is an outline of the thesis, with a summary of the contributions contained within each chapter.

Chapter 1 This chapter provides a high-level introduction to the problem of catastrophic forgetting and the goals of continual learning, followed by a motivation for the need to develop CL techniques that can deal with unpredictable changes to the data distribution.

Chapter 2 This chapter lays out the background knowledge upon which the original work of this thesis is built, covering the basics of artificial neural networks and reinforcement learning, as well as an overview of existing approaches to mitigating catastrophic forgetting. 1.2. Thesis Outline and Contributions 11

Chapter 3 This chapter presents the first method I developed for tackling catastrophic for- getting in an RL setting, which involves multi-timescale learning at the level of individual parameters in the neural network. The method was inspired by an existing model for the pro- cess of synaptic consolidation in the brain, in which the different timescales of learning might correspond to biochemical processes at the level of a single synapse [BF16]. I present results showing that the method mitigates forgetting at multiple timescales in the context of tabular and deep reinforcement learning. I also discuss how this method can be interpreted in the online convex optimisation framework [Zin03].

Chapter 4 One of the limitations of the method from the previous chapter is that improving the memory of individual parameters does not always translate smoothly into improving the behavioural memory of the agent. In this chapter, I develop a method that addresses this issue by directly consolidating the policy of the agent at multiple timescales, by adapting the previous method with techniques for ‘distilling’ the knowledge from one neural network to another [HVD15, RCG+15] and controlling the magnitude of the agent’s learning step in policy space [SWD+17]. The method is shown to improve continual learning on a number of continuous control tasks in single-task, alternating task and multi-agent competitive self-play settings.

Chapter 5 Experience replay (ER) is a commonly used technique in off-policy RL that in- volves storing a fixed number of the agent’s most recent experiences and shuffling them before training in order to eliminate the short-term correlations in the data and make them i.i.d. [MKS+15]. In this chapter, I develop an ER method that records the experiences of the agent over multiple timescales, rather than just the most recent ones, with the idea of providing a better balance between new learning and retention of old knowledge. Furthermore, I inves- tigate how having the data split into timescale buckets can allow the agent to learn a more robust policy that is invariant across time, using a technique called invariant risk minimisa- tion [ABGLP19]. The two MTR methods are evaluated relative to baselines on a number of continuous control tasks, in which the environment was manipulated continuously throughout 12 Chapter 1. Introduction training, and in many cases are shown to improve continual learning.

Chapter 6 The final chapter reviews the contributions of the thesis and discusses potential avenues for future work.

1.3 Publications

The first two projects in this thesis (Chapters 3 and 4) are largely based on the following two papers accepted at the International Conference on Machine Learning (ICML) in 2018 and 2019 respectively:

• C. Kaplanis, M. Shanahan, C. Clopath. “Continual Reinforcement Learning with Com- plex Synapses”. Proceedings of the 35th international conference on machine learning (ICML-18). 2018.

• C. Kaplanis, M. Shanahan, C. Clopath. “Policy Consolidation for Continual Reinforce- ment Learning”. Proceedings of the 36th international conference on machine learning (ICML-19). 2019.

These two works also featured at the following workshops in a condensed format:

• C. Kaplanis, M. Shanahan, C. Clopath. “Continual Reinforcement Learning with Com- plex Synapses”. Proceedings of the 2 nd Lifelong Learning: A Reinforcement Learning Approach (LLARLA) Workshop, Stockholm, Sweden, FAIM. 2018.

• C. Kaplanis, M. Shanahan, C. Clopath. “Policy Consolidation for Continual Rein- forcement Learning”. Proceedings of NeurIPS 2018 Workshop on Continual Learning, Montr´eal,Canada. 2018.

Finally, the work presented in Chapter 5 is currently under preparation for submission to NeurIPS 2020. Chapter 2

Background

In this chapter, I lay out the foundational knowledge that the original work of this thesis builds upon. I start with an introduction to artificial neural networks (ANN), explaining how they came about and how they are typically trained to perform tasks. Then, I explain the main concepts of reinforcement learning, as well as how it has been integrated with neural networks (a paradigm known as “deep” reinforcement learning), and why it is a particularly interesting and challenging setting for evaluating continual learning and catastrophic forgetting. Finally, in order to provide a literary context for my contributions, I review a number of existing approaches to mitigating catastrophic forgetting, broken down into a number of main categories.

2.1 Artificial Neural Networks

Artificial neural networks (ANNs) form a class of machine learning methods that have been harnessed in recent years to yield groundbreaking results in a spectrum of applied domains. ANNs possess a number of desirable properties; for example, (i) they are capable of learning from raw data, without excessive hand-crafting of features by humans, (ii) they are capable of representing complex and highly non-linear functions, and (iii) they can generalise well to unseen data after training. As a result of these generically useful attributes, they have found use across the various machine learning paradigms, namely:

13 14 Chapter 2. Background

• Supervised learning, which involves learning a mapping of inputs to outputs based on a training set of labelled input-outputs pairs. In this context, for example, ANNs have changed the field of computer vision with state-of-the-art results in image classification [KSH12] and in speech recognition [GMH13].

• Unsupervised learning, which concerns learning about the underlying structure of the data distribution based on a sample from the distribution - often this involves developing a generative model that can produce new data instances. Recently, a neural network architecture known as a transformer [VSP+17] was scaled up to create an impressively coherent generative language model [RWC+19].

• Reinforcement learning (RL), which involves training agents to learn a behaviour that maximises the reward collected in their environment. Notable achievements that combine neural networks with reinforcement learning (deep reinforcement learning) include devel- oping algorithms that are as good as humans at playing ATARI video games [MKS+15], can beat the world’s best Go players [SSS+17], and can manipulate a Rubik’s cube with a robotic hand [AAC+19]. In this thesis, all the experimental evaluations I perform are in the RL paradigm which, as I discuss in a later section in this chapter, is a particularly interesting and challenging setting for continual learning.

Below, I continue with a brief history of artificial neural networks and an explanation of how they are typically trained to perform tasks.

2.1.1 Origins of ANN

The inception of artificial neural networks is usually attributed to Warren McCulloch and Walter Pitts who, in 1943, proposed a mathematical abstraction for the function of a biological neuron that bears many similarities to the units used in contemporary artificial neural networks

[MP43]. In the McCulloch-Pitts model, each neuron receives a number of binary inputs x1...xN (corresponding biologically to currents arriving from connections with other neurons) that are individually weighted (by the strengths of these incoming connections w1...wN ), summed and 2.1. Artificial Neural Networks 15

Figure 2.1: Depiction of a McCulloch-Pitts neuron

finally passed through a threshold function that yields a binary output y (Figure 2.1):

N ! ! X y = H wkxk − T (2.1) k=1 where T is the activation threshold and H is the Heaviside step function, defined as:

  1, if a ≥ 0 H(a) = (2.2)  0, if a < 0

Artificial neurons used in modern neural networks are almost identical, except that the inputs are typically not constrained to be binary and the threshold function H is replaced with other types of non-linearities known as activation functions, examples of which are given in the next section.

McCulloch and Pitts’ theory of neuronal function stopped short, however, of explaining how networks of neurons adapt to give rise to learning. In 1949, Donald Hebb was the first to propose that learning was a consequence of changes to the strengths of the connections between neurons [Heb49], a process known as synaptic plasticity that is now widely accepted to form the basis of learning and memory in the brain. Hebb’s specific proposal was that if two neurons were coactive, then the connection between them would tend to strengthen, a rule colloquially described in the computational neuroscience community as “what fires together wires together”. 16 Chapter 2. Background

In 1958, Frank Rosenblatt invented the perceptron, which adapted the McCulloch-Pitts neuron and endowed it with a Hebb-inspired rule that allowed it to learn a binary classification of its inputs by adapting its incoming weights [Ros58]. The key idea was that the weights of the neuron could be trained to perform an accurate classification by minimising the error between the actual output of the network and the desired output of the network - a concept that remains essential to the way neural networks are trained today.

The perceptron initially caused a lot of excitement, with the prospect that it could be the build- ing block of human intelligence, but over time it was found to be limited in its use for pattern recognition and, in 1969, Minsky and Papert published a book that analytically demonstrated some limitations of the perceptron that, some argue, led to the prolonged lull in ANN research that lasted until the mid-1980s [MP17]. Minsky and Papert showed that the perceptron could not learn the XOR function, an example of the fact that it can only classify patterns that are linearly separable. While it was known that multilayer neural networks could be hand-designed to perform the XOR function, Minsky and Papert showed that, in general, the perceptron learning rule could not be extended in a practical way to train multilayer networks.

Multilayer neural networks, commonly referred to as deep neural networks, consist of several layers of neurons where the outputs of one layer provide the inputs to the next layer. This deep structure allows the network to represent complex functions via hierarchical abstractions of the input - neurons in layer k can select combinations of features in layer k − 1 to respond to (Fig- ure 2.2). For a long time it was not clear how deep neural networks could be trained effectively to make use of this powerful property. In 1974, it was first suggested that a technique now referred to as backpropagation of error could be used to train deep neural networks [Wer74]. It took, however, until 1985 for a famous paper by Rumelhart, Hinton and Williams to popularise the technique and give it its name [RHW85]. Gradient descent with backpropagation is now by far the most common way of training neural networks and it features in all the modern neural network successes mentioned earlier. In the next section, I explain how this powerful method works and, later on in this chapter, I discuss how it can be used to train deep neural networks in the context of reinforcement learning. 2.1. Artificial Neural Networks 17

Figure 2.2: Image taken from [Sar18] showing the types of visual features learnt in each hidden layer of a neural network trained with facial images.

2.1.2 Training Deep Neural Networks

In this section, I describe the architecture of a standard deep neural network and outline the steps required to optimise it with gradient descent and backpropagation. More specifically, I will consider a feedforward network with fully connected layers, in which every neuron in layer l projects to every neuron in layer l + 1 and there are no feedback connections that project to lower layers of the network. This neural network architecture, often referred to as a multi-layer perceptron (MLP), is the most basic and commonly used setup, and is what I use in all the experiments in this thesis. Other common architectures include recurrent neural networks (RNNs) [HS97], which maintain a hidden state over time and are often used to process sequential data such as text or speech, and convolutional neural networks (CNNs) [LBD+90], which use shared weights to learn translation-invariant features and are usually used to process images [KSH12]. In this section, vectors and matrices are denoted in bold type for clarity, but this convention is not strictly upheld for the remainder of the thesis. 18 Chapter 2. Background

MLP Architecture

A deep neural network defines a function:

y = f(x; θ) (2.3) where x is the vector-valued input, y is the vector-valued output, and θ is a vector of the parameters of the network. The function can be decomposed into a sequence of functions:

f(x) = f (n)(··· f (2)(f (1)(x)) ··· ) (2.4) where f (i) corresponds to the function of the ith layer of the network, which itself can be decom- posed into the composition of an affine transformation and a (typically) non-linear activation function: > hl = f (l)(hl−1) = g(Wl hl−1 + bl) (2.5) where Wl and bl correspond respectively to the weights and biases of layer l, and g corresponds to the activation function. The collection of weights and biases of all the layers of the network comprise the network parameters θ. The final layer of the network is called the output layer, and the intermediate layers are known as hidden layers, with the intermediate values hl being referred to as the hidden activations of layer l. The matrix Wl will have a number of rows equal to the number of hidden units in layer hl−1 and a number of columns equal to the number of units in layer hl.

1 Typical activation functions include the sigmoid function σ(a) = 1+e−a (which squishes the ea−e−a output to a number between 0 and 1), the hyperbolic tangent function tanh(a) = ea+e−a (which limits the output to between −1 and 1), and the Rectified Linear Unit relu(a) = max(0, a). While different activation functions possess different qualities, it is crucial that they are non- linear so that the network is expressive enough to represent complex functions, and, for training with backpropagation, it’s important that they are easily differentiable. The output layer will often have a different activation function depending on the task being trained for. In 2.1. Artificial Neural Networks 19 regression tasks, the identity function is typically used after the final affine transformation and, in classification tasks, where the outputs correspond to probabilities of each class that must sum to one, a softmax function is often applied to enforce this constraint:

eai yi = softmax(a)i = (2.6) P aj j e where the pre-activations here, a, are interpreted as log probabilities that can have uncon- strained values.

Training Steps

1. Initialisation. The question of how to initialise the parameters of a neural network is an active research area and the optimal initialisation can depend on many factors, including the depth and width of the network and the type of activation function used [GB10, YS17]. One common property shared across initialisation regimes, however, is that the parameters are chosen to be different to one another, usually sampled at random, for example from a Gaussian distribution. This is important in order to break the symmetry of the network, particularly when a deterministic algorithm is used for updating the parameters, like gradient descent. Otherwise, parameters initialised at the same values within a layer will all be updated in the same way and each hidden unit will end up learning the same feature of the input, severely limiting the representational capacity of the network.

2. Forward pass. The first step in training, referred to as the forward pass, is to evaluate the neural network function f for a given input x or batch of inputs, which can be passed in as a matrix X. This is done by evaluating the hidden activations of each layer in order of depth until the output layer is reached. The values of predicted outputs and the hidden activations are stored for each input, to be used later in the backpropagation algorithm.

3. Loss evaluation. Neural networks are usually optimised to minimise a loss function of the network parameters that can usually be defined as the expectation or average of a per- 20 Chapter 2. Background

example loss function of each datapoint in the dataset D. The form of the loss function will typically depend on the type of task that the network is being trained to solve. For example, in a supervised regression problem, a typical loss function used is the mean squared error (MSE) loss, which quantifies the difference between the actual outputs of the network f(x; θ), often denoted as yˆ, to the desired outputs y for the corresponding inputs as follows:

1 X L (θ) = [(y − f(x; θ))2] = (y − f(x; θ))2 (2.7) MSE Ex,y∼D |D| x,y∼D

Minimising the MSE loss is also equivalent to finding the parameters that maximise the probability of the training data under the assumption that the true outputs are normally distributed around the predicted values of the neural network - an example of a process known more generally as maximum likelihood estimation (MLE). More precisely, minimising the MSE is equivalent to maximising the log likelihood log p(y|x, θ) with respect to θ, where we assume that y = f(x; θ) + , with  ∼ N (0, I).

When training networks in the context of reinforcement learning, which is the paradigm used for all experiments in this thesis, the loss function is typically constructed such that minimising it corresponds to learning a policy that maximises the reward collected by the agent - the various ways of doing this are described in more detail in the Reinforcement Learning section of this chapter.

Typically during training, rather than calculate the loss using the whole dataset and for reasons discussed later when describing the parameter update step, the average loss is evaluated over a subset of the data, referred to as a mini-batch, which serves as an unbiased estimate of the loss over the entire dataset if the data points in the mini-batch are sampled in an independently and identically distributed (i.i.d.) manner from the dataset. The i.i.d. assumption is often broken in a continual learning setting, leading to instability in training and catastrophic forgetting. The formulation of the loss function used in deep reinforcement learning differs in several ways to the supervised learning setting, and is discussed in detail in the next section of this chapter. 2.1. Artificial Neural Networks 21

4. Backpropagation. The backpropagation algorithm [RHW85] is the method used for cal- culating the gradient of the loss function with respect to the neural network parameters - in other words, it is a way to assign credit to each parameter for its impact on the per- formance of the network on the task at hand. The gradient can then be used to modify the parameters of the network to decrease the loss and improve the performance of the network.

Backpropagation works by sequentially propagating information about the loss function from the output layer of the network down through the hidden layers, making ample use of the chain rule of calculus1. It starts by calculating the derivative of the loss function with respect to the outputs of the network ∂L , which are then propagated down to calculate ∂yˆk ∂L l l> l l the errors of the neurons in the last hidden layer δN−1 = N−1 , where a = W h + b ∂ak is the vector of weighted inputs to neurons in layer l. These errors are in turn used to

calculate the errors of the layer below δN−2, and so on until the δl are obtained for all neurons in the network. In order to propagate the error by applying the chain rule it is thus essential that the activation function used is differentiable. Once the errors have been calculated it is then easy to use them to obtain the derivatives for all the parameters in the network, e.g.:

l ∂L ∂L ∂aj l ∂  l > l l  l l−1 l = l l = δj l Wj h + bj = δjhi (2.9) ∂wij ∂aj ∂wij ∂wij

l th th where wij corresponds to the weight between the i neuron in the (l − 1) layer to the jth neuron in the lth layer. A similar computation is executed to calculate the derivatives with respect to the bias terms.

5. Parameter update. Once the gradient ∇L(θ) has been calculated, the parameters can be adjusted with a small step in the direction of the negative gradient, implementing the

1If y = f(x) and z = g(f(x)), where x is a scalar and f and g map scalars to scalars, then the chain rule states that: dz dz dy = (2.8) dx dy dx 22 Chapter 2. Background

gradient descent algorithm: θ ← θ − η∇L(θ) (2.10)

where η is the learning rate, which determines how large the step size is for a given iteration. After a step is taken, the gradient is recalculated and the process repeated, until the network is deemed to have reached an acceptable level of performance. If the whole dataset is used to calculate the gradient, this process is called batch gradient descent but if, as mentioned earlier, usually only a sample from the dataset, known as a minibatch, is used, then it is referred to as stochastic gradient descent (SGD). Though it uses a noisy estimate of the gradient, SGD typically results in much faster convergence than batch gradient descent since the computation time does not grow with the size of the dataset, and it is much more commonly used in practice.

In stationary supervised learning problems, the learning rate is often reduced over time in order to reduce the noise induced by SGD when close to convergence. In a contin- ual learning setting, however, the network has to always be able to adapt to new data streaming in, and so the learning rate is often kept constant. In general, the magnitude of the learning rate can have a large impact on training progress: too small a learning rate can lead to slow progress, while too large a learning rate can lead to instability of perfor- mance due to a combination of (i) amplification of the noise induced by SGD and (ii) the local direction of improvement implied by the gradient being a bad proxy when moving too far away from the starting point of the update. In a continual learning context, the magnitude of the learning rate has an important effect on the tradeoff between memory and adaptability - a large learning rate allows the network to adapt quickly to new data but can also exacerbate forgetting due to large changes to the network parameters, and vice versa for a small learning rate.

Since SGD makes updates based on the local gradient of the loss function in parameter space, it does not necessarily take the fastest route globally to a minimum of the loss func- tion. Many improvements to SGD have been developed to improve its convergence rate - for example, the addition of momentum to the SGD updates [Pol64, SMDH13], which, rather than just using the gradient at the current point in parameter space, updates the 2.2. Reinforcement Learning 23

network based on a moving average of the past gradients in training. Another commonly used method, called RMSProp [TH12], maintains an estimate of the running variance for each parameter and stabilises learning by reducing the learning rate for high variance parameters. In this thesis, most of the experiments employ a very popular method called ADAM [KB14], which essentially combines the techniques of momentum and RMSProp into one algorithm.

2.2 Reinforcement Learning

Reinforcement learning is a mathematical framework for addressing the problem of an agent learning to interact with its environment in order to achieve its goals [SB98]. In this section, I will motivate the use of RL in this thesis, establish the key concepts of the RL framework and describe the various categories of RL algorithms, which will provide some context for the approaches used in subsequent chapters.

All the experimental evaluations of the continual learning algorithms developed in this thesis were conducted within the RL framework, a choice that was motivated by a number of factors:

• The goal of continual learning is to create agents that can incrementally build on their knowledge and skills when deployed in the real world. Of the three main machine learning paradigms, RL is the only one that involves interacting with an environment, a compelling reason to use it to develop algorithms that must ultimately interact with the most complex of environments - the world that we live in.

• The main objective of this thesis has been to develop continual learning methods that can cope with a data stream that features either discrete or continuous changes to the distribution that occur unpredictably over time. As will be discussed in more detail later in this section, these kinds of changes to the data distribution occur naturally in an RL setting. Relating to the previous point, many of the sources of nonstationarity that occur in RL are similar in nature to those that one would expect to encounter in the real world 24 Chapter 2. Background

- for example, those induced by the change in behaviour of the agent itself or of other agents in the environment.

• As mentioned earlier in this chapter, deep reinforcement learning, which combines artificial neural networks with reinforcement learning, has been at the core of a number of ground- breaking successes in artificial intelligence in recent years [MKS+15, SSS+17, AAC+19]. The proven capabilities of deep RL make it a candidate framework for building even more intelligent systems in the future that can solve important real-world problems - for this reason, it makes sense to try to enhance it with continual learning.

The remainder of this section is structured as follows: (i) I begin by describing the mathematical formalisation of RL as Markov Decision Processes (MDPs) [Wat89] and recapping the sources of nonstationarity of the data distribution in RL; then, (ii) I explain the concept of value functions and their estimation in value-based RL algorithms such as Q-learning; subsequently, (iii) I describe the theory behind another set of algorithms known as policy gradient methods, and finally, (iv) I introduce actor-critic methods which hybridise policy- and value-based approaches. All three classes of algorithms are used at various points for evaluation throughout this thesis.

2.2.1 Markov Decision Processes

The RL framework is formalised as a mathematical object known as a Markov Decision Process (MDP)[Wat89]. An MDP represents the interaction between the agent and its environment through a sequence of discrete time steps (Figure 2.3). At each time step t, the agent receives the state of the environment st ∈ S, where S represents the set of states reachable by the agent in the environment. Having observed the state, the agent then selects an action at ∈ A, where A denotes the set of actions available to the agent (typically assumed to be the same set for each state). As a result of taking action at in state st, the agent then observes the next state st+1 and a reward rt+1, where st+1 is sampled with probability p(st+1|st, at) from the distribution determined by the transition function T : S × S × A → [0, 1], and rt+1 is a scalar value sampled with probability p(rt+1|st, at) from a distribution determined by the 2.2. Reinforcement Learning 25

Figure 2.3: Diagram of the interaction between the agent and the environment in an MDP (taken from [SB98]) reward function R : R × S × A → [0, 1]. The reward is sometimes assumed to be deterministic given the previous state and action, in which case it can be denoted as r : S × A → R. The MDP can thus be succinctly represented as a tuple: hS, A, T , Ri.

The goal of the agent is to find a policy, defined by a probability distribution over actions given the state π(at|st), that maximises its expected return Gt, which is the expected sum of future rewards until the end of the fixed-length episode at time T :

" T # ∗ X π = arg max π[Gt] = arg max π r(st0 , at0 ) (2.11) π E π E t0=t where Eπ is the expectation under the reward distribution defined by policy π. Sometimes, however, it is not convenient to consider fixed-length episodes when the task is a continuing one with no natural end, i.e. where T = ∞. In this case, Gt could be infinite and so the maximisation would not be well-defined. The standard solution to this is to introduce a discount factor γ, which has the role of weighting rewards received in the distant future less than those acquired more immediately, and ensures that the return is finite. The revised goal is to find a policy that maximises the expected discounted return:

∞ X t0−t Gt = γ r(st0 , at0 ) (2.12) t0=t

One of the most important features of an MDP is what is known as the Markov property, which is that the probability distributions that determine the next state st+1 and next reward rt+1 are fully determined by the previous state st and previous action at. In other words, at time 26 Chapter 2. Background

t, the future states of the agent are conditionally independent of its past states given st and at. When the state of the environment is not fully observable to the agent, then the Markov property may not hold, in which case it is more accurate to model the process as a partially observable Markov decision process (POMDP) [Mon82]. In a POMDP, the observations of the agent and the true states of the environment are separate variables, and the agent typically maintains a belief distribution of what state it is likely to be in, in order to make informed decisions. A POMDP can then be expressed as a fully observable MDP over belief states; a typical way of training deep RL agents in partially observable environments is to use an RNN, where the hidden state of the RNN can correspond to the belief state of the agent.

In the real world, the observations of an agent are highly likely to be partially observable, and thus it is an important consideration for building continual learning agents. In this thesis, however I chose not to deal with partial observability and only evaluate agents in fully observable environments, since the primary goal was to tackle the issue of catastrophic forgetting.

2.2.2 Nonstationarity in Reinforcement Learning

The unpredictable and complex nature of changes to the data distribution in RL constitutes a key reason for using it as a testbed for the CL algorithms developed in this thesis. In the Introduction, I outlined three common sources of nonstationarity, which can now be described in the context of the MDP framework:

1. One form of nonstationarity arises due to the correlation between consecutive states in

the environment; in general, p(st+1) 6= p(st+1|st) since consecutive states are likely to be similar to each other. This is commonly dealt with by using an experience replay database, described in the next section, that shuffles the most recent experiences during training.

2. Another source of change to the data distribution arises due to the fact that the evolution

of the agent’s policy π(at|st) as it learns will alter the distribution of states it encounters. 2.2. Reinforcement Learning 27

3. Finally, in some circumstances the dynamics of the environment, described by p(st+1|st, at), may change over time. In Chapter 5, in some of the experiments, this effect is created artificially by smoothly adjusting the strength of gravity in the environment over time. In Chapter 4, some experiments are run in a multi-agent setting; in this case, from the

perspective of one of the agents, p(st+1|st, at) evolves over time because of the changes in the other agents’ policies.

2.2.3 Value Functions

One of the major classifications of RL algorithms is into model-free or model-based approaches. Model-based approaches involves learning how the world works, for example by explicitly build- ing a model of p(st+1|st, at), and then making decisions after planning with the model. In this thesis, I only ever use model-free approaches, which do not make an explicit model of the en- vironment and can be further split up into two sets of approaches: value-based methods and policy-based methods.

In this section, I discuss value-based methods, which are based on the estimation of value functions. Value functions represent the expected future return of the agent, Gt, when starting in a particular state (the state-value function) or when taking a particular action in a particular state (the action-value or Q-value function). The expected future return from a particular state depends on the behaviour of the agent and so value functions are defined with respect to a given policy. Once the value function for a policy has been learnt, via a process called policy evaluation, it can then be used to adapt the policy of the agent to increase the expected return, by adjusting it to select actions with the highest value in each state, via a process called policy improvement. It can be shown, via the Bellman optimality equations that are described later on, that by interleaving policy evaluation and policy improvement, the agent can find the optimal policy for the MDP. There are many ways of doing this, all of which fall under the umbrella of generalised policy iteration and collectively represent a large proportion of RL algorithms. 28 Chapter 2. Background

Formally, the state-value function can be expressed as follows:

" ∞ # π X t0−t V (s) = Eπ [Gt|st = s] = Eπ γ r(st0 , at0 )|st = s (2.13) t0=t

The action-value function is defined as:

" ∞ # π X t0−t Q (s, a) = Eπ [Gt|st = s, at = a] = Eπ γ r(st0 , at0 )|st = s, at = a (2.14) t0=t

Value functions can be learned via the agent’s interaction with the environment using a given policy. One way to estimate them is by averaging the realised returns of the agent starting from a given state or after taking an action in a given state - this general approach forms the basis of a class of RL algorithms known as Monte Carlo methods (MC methods). For example, a basic every-state MC method updates the value function estimate for state st, for each time it is encountered, as follows:

V (st) ← V (st) + η(Gt − V (st)) (2.15) where η is the learning rate. MC methods typically provide an unbiased estimate of the expected return but can suffer from high variance. Additionally, they suffer from the fact that Gt and the value estimate can only be made at the end of the episode, and so they cannot be used in a continuing task setting.

The other common way of estimating the value function is by bootstrapping on the estimated value of the next state V (st+1), borrowing ideas from the field of dynamic programming and exploiting the Markov property in an MDP; this technique, when combined with sampling from experience as in MC methods, forms the basis of the class of RL algorithms known as temporal difference (TD) methods. In order to see how the bootstrapping works, we note that 2.2. Reinforcement Learning 29 the state-value function, for example, can be rewritten as follows:

" ∞ # 0 π X t −t V (s) = Eπ γ rt0+1 st = s (2.16) t0=t " ∞ # 0 X t −(t+1) = Eπ rt+1 + γ γ rt0+1 st = s (2.17) t0=t+1 h i π = Eπ rt+1 + γV (st+1) st = s (2.18) X X = π(a|s) p(s0, r|s, a)[r + γV π(s0)] (2.19) a s0,r where p(s0, r|s, a) is the joint probability of observing state s0 and reward r having performed action a in state s in the previous time step. The equation above is known as the Bellman expectation equation for V π and it relates the value of state s to the value of the states that succeed it under the policy π. A similar equation can be written for the action-value function Qπ. The Bellman optimality equation for the state-value function is another important relation that states that, under the optimal policy π∗, the value of a state s must be equal to the expected return after taking the best action in that state (i.e. the one that maximises Qπ∗ (s, a)) and then following π∗ :

V π∗ (s) = max Qπ∗ (s, a) (2.20) a∈A X ∗ = max p(s0, r|s, a) r + γV π (s0) (2.21) a∈A s0,r

A similar equation can be derived for Qπ∗ and once the optimal value functions are obtained, the optimal policy can easily be inferred by choosing the action than maximises Qπ∗ in each state. As mentioned earlier, numerous methods under the umbrella of generalised policy iteration can be used to determine the value function of the optimal policy. One example, which requires knowledge of the dynamics of the environment T , is known as value iteration and it involves repeated application of the Bellman optimality backup operator T ∗:

X T ∗V π(s) := max p(s0, r|s, a)[r + γV π(s0)] (2.22) a∈A s0,r 30 Chapter 2. Background

Incredibly, T ∗ can be shown to have a unique fixed point at V π∗ , demonstrating the power of bootstrapping for finding the optimal policy. In the next subsection, I describe a tempo- ral difference method called Q-learning that does not require knowledge of the environment dynamics, but uses the same technique of bootstrapping to learn an optimal policy.

Q-learning

Q-learning [WD92] is a well-known reinforcement learning algorithm that involves directly learning the Q-values for each state-action pair. The algorithm can be applied to cases where the dimensions of the state space and the action space are both finite, so that a table of all the estimated Q-values (of size |S||A|) can be maintained - for this reason, it is known as a tabular RL method. In Chapter 3, a version of the Q-learning algorithm that involves eligibility traces, which form a hybrid of MC and TD learning, is used in an initial evaluation of the synaptic consolidation method in the context of a simple navigation task. Below, however, I describe the procedure for the standard Q-learning algorithm.

As the agent acts in the environment, at each time step it collects an experience of the form

(st, at, rt+1, st+1). After each step, it uses this experience to update the table of Q-values as follows:

δt ← rt+1 + γV (st+1) − Q(st, at) (2.23)

Q(st, at) ← Q(st, at) + ηδt, (2.24)

where η is the learning rate, V (st+1) = maxa Q(st+1, a) and δt is referred to as the temporal difference error. If the policy of the agent is sufficiently exploratory, then Q will eventually converge to Q∗, the value function of the optimal policy π∗, which is derived as

  ∗ 0 1, if a = arg max Q (s, a ), π∗(a|s) = a0 (2.25)  0, otherwise.

The condition that the policy the agent uses to explore the environment is sufficiently ex- 2.2. Reinforcement Learning 31 ploratory is an important one. If the agent were to follow a greedy policy that always chooses the action with the highest Q-value estimate, then even after interacting with the environment for a long time, there may be states that it hasn’t visited or actions that it has not tried. If this is the case, then the under-visited Q-values may not converge (since they are not being updated) and so it is not guaranteed that the algorithm will find the optimal Q-value function. A typical approach is to use an -greedy policy for exploration, which, at each time step, with a probability 1− chooses the action with the highest Q-value, and with probability  chooses an action uniformly at random from A. This consideration of the agent’s behavioural policy is an example of one of most commonly encountered dilemmas in RL, which is the trade-off between exploration and exploitation - with the -greedy strategy, this trade-off is directly controlled by the value of .

Q-learning falls into the category of off-policy learning algorithms, which means that the be- havioural policy of the agent is not necessarily the same as the policy that the agent is trying to optimise. In the example above, the behavioural policy is -greedy, while the policy being optimised is the greedy policy (i.e.  = 0). In on-policy RL algorithms, on the other hand, the policy being optimised is the behavioural policy itself. In this thesis, we will use both off-policy (Chapters 3 and 5) and on-policy (Chapter 4) algorithms. In general, off-policy algorithms are more powerful but often more complex and harder to get to work, especially when combined with function approximation, which is discussed in the next sub-section.

2.2.4 Value Function Approximation

The fatal flaw of Q-learning is that, while it works well for problems with relatively small state and action spaces, it does not scale well to more complex problems. Often we want to apply RL to problems where the state space is multi-dimensional and even continuous; in these cases, maintaining a table of each individual Q-value is either extremely memory intensive or simply impossible. Furthermore, the time it would take for the agent to explore enough to populate the table becomes computationally infeasible. The typical solution to this is to replace the table with a value function approximator, a parameterised function that takes the state as an 32 Chapter 2. Background input and outputs an estimated Q-value for each action. The benefits of the approximator are that (i) it limits the memory size required to represent the Q-value function to a fixed number of parameters, and (ii) it allows the agent to provide estimates for previously unvisited states by drawing on similarities to past experiences, i.e. enabling it to generalise to new situations. When the function approximator used is an artificial neural network, the algorithm falls into the category of deep reinforcement learning. Next, I describe one such algorithm, known as deep Q-learning or deep Q-networks (DQN), which led to a breakthrough in deep RL when it was used to train agents to play ATARI video games to a super-human level, using just the pixel values on the screen as input [MKS+15].

Deep Q Networks

Deep Q-Networks [MKS+15] are artificial neural networks that are trained to approximate a mapping from states to Q-values; in DQN, the state space can be continuous but the action space must be discrete and the dimension of the output of the network is given by the number of available actions. In the original paper, a deep convolutional network was used, since they are good for processing images, but in general the technique can also be applied to other neural network architectures. In tabular Q-learning, the Q-value of each (st, at) pair is adjusted, as it is experienced, to be slightly closer to the bootstrapped target rt+1 + γV (st+1); in DQN, the squared difference between the Q-value and this target is minimised by incorporating it into the following loss function:

h − 2i L(θ) = E(st,at,rt+1,st+1)∼D rt+1 + γV (st+1; θ ) − Q(st, at; θ) (2.26) where θ are the parameters of the Q-network, θ− are the parameters of an older version of

− 0 − the network (referred to as the target network), and V (st+1; θ ) = maxa0 Q(st+1, a ; θ ). The target network parameters are updated much less frequently than the Q-network parameters and are done so by copying the Q-network parameters at fixed intervals. Off-policy RL with function approximation and bootstrapping, a combination known as the ‘deadly triad’ [SB98], is notoriously unstable and the target network is one feature that is used to prevent divergence 2.2. Reinforcement Learning 33 by increasing the stability of the target used during optimisation. D in the loss function above refers to the experience replay database, which records the agent’s experiences of the form (st, at, rt+1, st+1) in a First-In-First-Out (FIFO) queue and is sampled from at random during training. As mentioned earlier, one source of nonstationarity in RL is that consecutive experiences are usually highly correlated with one another. For this reason, training in an online fashion can cause the network to overfit to recent data; by jumbling together old and new data, the replay database thus plays an essential role in decorrelating updates to the network and preventing catastrophic forgetting of older experiences. In Chapter 5, I develop a method that adapts the replay database to record experiences over multiple timescales, rather than just the most recent ones, in an attempt to improve continual learning in situations when the timescale of changes to the distribution are unknown.

In Chapter 3, I make use of an algorithm called soft Q-learning [HTAL17], which is based on the original DQN algorithm but is adapted to simultaneously maximise the entropy of the agent’s policy, in order to encourage it to find more diverse and robust solutions to the task.

2.2.5 Policy Gradients

In value-based methods such as Q-learning and DQN, the policy of the agent is implicitly defined by the value function, typically by picking the action with the highest estimated value. Policy gradient (PG) methods form an alternate class of RL approaches that learn a parameterised policy (such as by a deep neural network) that directly maps states to a distribution over actions. As we will see, value functions can still be used to facilitate the learning of the policy, but in PG methods they are not used for selecting actions. One of the advantages of PG methods over value-based methods is that they can easily represent policies in tasks with continuous action spaces; in Chapters 4 and 5 evaluation is performed on a number of continuous control tasks, and for this reason the CL methods are designed on top of algorithms that use policy gradients, the specifics of which are detailed in the respective chapters.

PG methods work by ascending the gradient of an objective function J(θ) that in some way represents the expected return of the agent, and thus they require the policy to be differentiable 34 Chapter 2. Background with respect to its parameters. A typical objective function is the average future return from states occupied by the agent, in other words the average value of states occupied when following the policy πθ:

h i J(θ) = G s = s (2.27) Es∼µθ t t

πθ = Es∼µθ [V (s)] (2.28)

X X πθ = µθ(s) πθ(a|s)Q (s, a) (2.29) s∈S a∈A

where µθ(s) is the stationary distribution of the Markov chain induced by policy πθ, in other words the probability (in the long run) of being in state s if the agent follows policy πθ. It is important to clarify that V πθ and Qπθ are not functions parameterised by θ, but the true value functions implied by the policy πθ.

The difficulty of differentiating the objective J(θ) to calculate its gradient with respect to θ lies in the fact that, while it is easy to determine how a small change in θ affects the policy

πθ, it is less obvious how it affects the state distribution µθ, since this depends partially on the environment dynamics, which are unknown to the agent. Fortunately, the policy gradient theorem provides a way of estimating ∇J(θ) from the agent’s experience without differentiating through µθ and forms the basis of all PG algorithms [SMSM00]. The policy gradient theorem states that, for various policy objectives, including the one stated above, the gradient of J(θ) can be expressed as follows:

X X πθ ∇J(θ) ∝ µθ(s) Q (s, a)∇θπθ(a|s) (2.30) s∈S a∈A " # X πθ = Eπ Q (st, a)∇θπθ(a|st) (2.31) a∈A

The REINFORCE algorithm [Wil92] is a classic PG method that reformulates the equation 2.2. Reinforcement Learning 35

above into a form that allows ∇J(θ) to be estimated from the realised returns of the agent Gt:

" # X πθ ∇J(θ) = Eπ Q (s, a)∇θπθ(a|s) (2.32) a∈A " # ∇ π (a|s ) X πθ θ θ t = Eπ πθ(a|st)Q (st, a) (2.33) πθ(a|st) a∈A

= Eπ [Gt∇θ log πθ(at|st)] (2.34)

The expectation above can be estimated by sampling from the agent’s experiences. REIN-

FORCE is a Monte Carlo method as it uses Gt as an input, and so it can only be used for episodic tasks, with updates occurring at the end of each episode. One can see from the formu- lation of the gradient used in REINFORCE that the parameters will be adjusted such that the probability of taking actions that led to high returns is increased more than those that lead to low returns. This is intuitive, but if all the returns are positive, for example, this means that we are increasing the probability of all actions that are taken, just by varying amounts - this can intuitively lead to a lot of variance in the updates, when perhaps it would make more sense to reduce the probability of actions leading to lower than average reward. The introduction of a state-dependent baseline is a common way of reducing the variance of the REINFORCE algorithm that does not introduce bias to the gradient estimate, since its contribution is zero in expectation:

∇J(θ) = Eπ [(Gt − b(st)) ∇θ log πθ(at|st)] (2.35)

πθ A common baseline is to use the state-value function V (st), which can be estimated using a separate function approximator. The difference between the realised return and the baseline is ˆ often referred to as the advantage function, denoted At.

Another drawback of many PG methods, including REINFORCE, is that they are on-policy algorithms, which means that experiences from the current policy must be used for each update, resulting in bad sample efficiency. Old experiences can be (re)used if the gradients are corrected with importance factors, but these typically introduce a lot of variance. Below I describe 36 Chapter 2. Background actor-critic methods, which combine PG methods with TD methods in order to find a better bias-variance tradeoff.

Actor-Critic Methods

Even with a suitable baseline, the REINFORCE algorithm has high variance because it uses the

πθ full returns Gt as an estimate for Q . Another way to trade off variance by introducing some bias is to use a value function approximator to estimate Qπθ by bootstrapping, a technique which characterises a class of approaches known as actor-critic methods. For example, the one-step actor-critic method learns to approximate the state-value function with a separate model with parameters φ and approximates Qπθ with the one-step bootstrapped return, as well as using a state-value baseline:

∇J(θ) = Eπ [(rt+1 + γVφ(st+1) − Vφ(st)) ∇θ log πθ(at|st)] (2.36)

= Eπ [δt∇θ log πθ(at|st)] (2.37)

where, by subtracting the baseline Vφ(st) results δt is the TD error. In Chapter 5, an off-policy actor-critic method is used in the experiments called soft actor-critic [HZAL18]; as discussed earlier, off-policy algorithms are often less stable than their on-policy counterparts, but soft actor-critic improves stability by simultaneously maximising the entropy of the agent’s policy, leading to more diverse and robust solutions to the task.

2.3 Catastrophic Forgetting in ANN

All the methods developed in this thesis aim to tackle the issue of catastrophic forgetting, also known as catastrophic interference, in neural networks; the purpose of this section is to provide a historical and contemporary context for these methods by giving an overview of previous approaches to the problem. I start by discussing some of the early analyses of catastrophic forgetting, that begun in the late 1980s, after the emergence of backpropagation as a way of 2.3. Catastrophic Forgetting in ANN 37 training deep networks; then I cover a selection of approaches to mitigating the problem by dividing them into three broad categories: (i) regularisation-based methods, (ii) replay-based methods, and (iii) architectural approaches. Amongst the description of the various methods, when appropriate, I relate them to the approaches I develop in subsequent chapters of this thesis and also to theories of how the brain avoids catastrophic forgetting. Finally, at the end of the section I briefly discuss the settings in which methods for mitigating catastrophic forgetting are typically evaluated.

2.3.1 Discovery and Early Approaches

In 1989, McCloskey and Cohen published an analysis that showed that, when neural networks were trained with backpropagation on a sequence of tasks, new learning severely disrupted previously acquired knowledge, a phenomenon they named ‘catastrophic interference’ [MC89]. This seminal work, along with a paper by Roger Ratcliff in 1990 [Rat90], is usually attributed with bringing the attention of the scientific community to the problem of catastrophic forget- ting. McCloskey and Cohen’s first set of experiments were with networks trained in sequence on two arithmetic tasks: the first involved adding 1 to single digit numbers, and the second in- volved adding 2 to single digit numbers. They found that, using various metrics of performance, training on the second task led to a radical deterioration of the network’s ability to perform the first task. The authors noted that the disruption of old knowledge by new learning in connectionist models of memory was already known, under the name of the stability-plasticity dilemma [CG87], and indeed in humans as ‘retroactive interference’ [BU59], but it was the severity of the forgetting that was so striking and that undermined the use of neural networks trained with backpropagation as a model of human memory. This point was emphasised with a second set of experiments they ran, in which they replicated a study on forgetting in humans in an associative learning context [BU59], finding that neural networks exhibited extreme for- getting in comparison to humans, where the deterioration was shallower and more graceful. Furthermore, they showed that none of their attempts to mitigate the issue worked, including: increasing the number of hidden units, changing the learning rate, overtraining on the first 38 Chapter 2. Background task, freezing half the weights, changing the target activation values, and making the input and output representations more localised.

McCloskey and Cohen identified the cause of catastrophic forgetting in the distributed nature of representations in neural networks, the same property that provides them with the powerful ability to generalise. Many early approaches to mitigating catastrophic forgetting focused on reducing the overlap between representations of different inputs in the network, often by mod- ifying the backpropagation algorithm. Robert French, for example, proposed an ‘activation sharpening algorithm’ that resulted in sparse activations, where only a small number of the hidden nodes were active at any one time [Fre91]. Learning sparse or semi-distributed rep- resentations is an approach that is still pursued today for mitigating catastrophic forgetting, examples of which are described later in this section. Indeed, many of the concepts behind modern methods have their origins in the last century. In 1990, Kortge modified the back- propagation algorithm with the ‘novelty rule’ in the context of auto-associative learning, which had the effect of only updating weights that contributed to parts of the input that the network could not yet encode well - this was shown to improve the retention of knowledge in the network [Kor90]. This method relates to a large class of modern approaches that use regularisation of parameters in the network in order to prevent forgetting. In 1995, Robins investigated the use of rehearsal mechanisms, i.e. retraining on a subset of previously seen data, for mitigating forgetting and also introduced the technique of pseudorehearsal, which interleaves training on new data with that of randomly generated pseudo-data that represents what the network used to know [Rob95]. Many recently developed techniques use these same core ideas of rehearsal and pseudorehearsal and are known as replay-based methods, as they share similarities with experience replay in RL [Lin92] and with the theory of hippocampal replay for biological [MMO95].

While the discovery of backpropagation as a training method for deep neural networks sparked interest in the connectionist research community for solving the phenomenon of catastrophic for- getting, the rediscovered power of deep neural networks in recent years has triggered a renewed urgency to address the problem, resulting in a explosion in the number of new approaches. The coming subsections provide an overview of these approaches, broken down into a number of 2.3. Catastrophic Forgetting in ANN 39 main categories.

2.3.2 Regularisation-based Methods

Regularisation-based approaches to mitigating catastrophic forgetting work by adding con- straints to the updates made to the network parameters, in a way that preserves previously acquired knowledge. One way of doing this, sometimes referred to as structural regularisation [ZPG17], involves applying constraints to the parameters of the network, which takes inspira- tion from the fact that, in the brain, individual synapses can become consolidated over time. Another set of approaches apply functional regularisation to the neural network, which, rather than considering the stabilisation of parameters as a proxy for consolidating the function of the network, add constraints that prevent changes to the overall mapping of the network in a more direct fashion. Below I discuss a variety of methods that fall into each category.

Synaptic Consolidation and Structural Regularisation

In the brain, synaptic plasticity has not only been observed to occur at multiple timescales, such as long-term potentiation [BL73] and short-term plasticity [TW13], but the level of plas- ticity can also change over time, for example via the process of synaptic consolidation [FM97]. Synaptic consolidation refers to a reduction in plasticity of the synapse, rendering it more sta- ble, which has intuitively been hypothesised as a way for the brain to consolidate knowledge and prevent forgetting [FDA05, BF16]. Consolidating all synapses would render new learning impossible and so the brain must be selective about which ones to stabilise. While neither the conceptual nor precise biochemical mechanisms for synaptic consolidation are well established, in experiments it has been associated with repeated stimulation of the synapse [FM97] and also with the presence of a biochemical called dopamine [FS+90]; these factors could play a role in selecting what synapses to consolidate. The activation of dopaminergic neurons in the brain has been associated with reward prediction error [Sch07] and with novel stimuli [BMMH10]. The former association is often cited as evidence of how the brain performs reinforcement learning, and the latter association evokes a comparison with Kortge’s ‘novelty rule’. 40 Chapter 2. Background

The techniques that take inspiration from synaptic consolidation to mitigate catastrophic for- getting in artificial neural networks often select which parameters to consolidate by assigning them importance factors that determine their stability. In elastic weight consolidation (EWC)

+ ∗ [KPR 17], after training on task A with data from DA to yield parameters θA, the importance factors for the parameters are calculated by estimating the diagonal of the Fisher information

h ∂ 2i ∗ matrix, which for each parameter θi is given by Fθ = x,y∼DA ( log p(y|x, θ)| ∗ ) . This A,i E ∂θi θA measure essentially expresses how sensitive the loss function is with respect to each parameter in the network - the parameters with the greatest Fisher coefficients, and thus the ones the loss function is most sensitive to, are the most important ones to stabilise and protect from subsequent adaptation. This is achieved by adding quadratic constraints to the loss function when training on the next task (task B) that penalise the parameters from moving away from where they were at the end of training on task A, weighted by the Fisher coefficients:

X λ ∗ 2 L(θ) = LB(θ) + Fθ∗ (θi − θ ) (2.38) 2 A,i A,i i

Additional quadratic constraints are then added for subsequent tasks that are trained on. EWC was shown to mitigate catastrophic forgetting in a supervised learning context, on the permuted MNIST tasks, and in an RL context, on sequential training of ATARI 2660 games.

In [ZPG17], a similar method was developed concurrently which also used separate quadratic constraints for each parameter weighted by importance factors to prevent forgetting. In this case, however, the importance factor for each parameter is calculated in an online fashion by determining its contribution to the drop in the loss function over training; in contrast to EWC, which uses a local approximation of importance with the Fisher information metric, this method provides a more global interpretation by considering a parameter’s impact over the whole learning trajectory. In [CDAT18], these two methods are combined by calculating the path integral of the parameters on the Riemannian manifold defined by the Fisher met- ric. In another structural regularisation approach [ABE+18], the importance factors are made proportional to the derivative of the L2 norm of the network output with respect to each pa- rameter; by not relying on the loss function for consolidation, their method can be flexibly use 2.3. Catastrophic Forgetting in ANN 41 to constrain the model with different data than those that were trained on. Finally, it is also worth mentioning an earlier regularisation-based method for lifelong learning in the context of linear and logistic regression called ELLA [RE13], which constructs models for each task via a combination of shared latent components and a set of task-specific vectors. As new tasks come in, both the shared and specific components are updated, while ensuring that the parameters are constrained to be close to those of the optimal parameters for each task in a metric defined by the Hessian matrix at this optimal point, which is stored for each task.

Though it was inspired by synaptic consolidation, EWC was mathematically derived as learning an approximation to the posterior distribution of the parameters, p(θ|DA, DB), using Bayes’ rule:

log p(θ|DA, DB) = log p(DB|θ) + log p(θ|DA) − log p(DB|DA) (2.39)

Unfortunately, the prior probability in this equation p(θ|DA) is intractable when using a deep neural network, and so EWC uses the Laplace approximation to approximate it with a diag- onal Gaussian with the Fisher coefficients described above. This interpretation of parameter regularisation has led to this class of methods sometimes being referred to as prior-focused ap- proaches [FG18]. While EWC approximates the posterior after the first task is learnt, when it starts to add constraints for multiple tasks, the Bayesian interpretation is lost [Hus18]. In vari- ational continual learning [NLBT18], however, an approximation to the posterior is maintained throughout the training of multiple tasks by minimising the Kullback-Leibler (KL) divergence2 between the current distribution and the previous posterior, which takes the form of a Gaussian with a diagonal covariance. In all the regularisation methods mentioned so far, the parameters of the network are assumed to be independent of one another; in the case of a Gaussian approx- imation, the posterior distribution is assumed to have a diagonal covariance matrix with no interactions between the weights. Some other works relax this assumption, allowing for more expressive distributions with interactions between the weights by using different factorisations for the Gaussian approximate posterior [RBB18, KGP+18, ZGHS18].

2The Kullback-Leibler divergence is a measure of how similar two probability distributions are. For two R ∞ p(x) continuous distributions, p and q, it is defined as: DKL(p||q) = −∞ p(x) log q(x) dx. Note that, in general, DKL(p||q) 6= DKL(q||p). 42 Chapter 2. Background

The algorithm I present in Chapter 3 classifies as a structural regularisation method and is inspired by a model of synaptic consolidation [BF16]. Similarly to many of the methods de- scribed above, it assumes that the parameters are independent of one another, but it differs in that (i) there are no importance factors - each parameter is simply constrained to be where it was at different periods in time, and (ii) it does not require the data distribution to be split up into discrete tasks. In general, structural regularisation methods have important advantages, namely that they allow sequential training without the storage of previous data and without growing the number of parameters required to learn, but they also have a number of downsides, for example: (i) the constraints on individual parameters can become stale (there may be more efficient ways to encode information in light of seeing data from different distributions), (ii) as more constraints are added, the network can become too stable, making it hard to learn from new data, and (iii) since approximations to the posterior distribution of parameters have to be used for tractability, they do not always preserve the network’s function accurately.

Functional Regularisation

Rather than add constraints to the network’s parameters, it is also possible to regularise the network function directly. One common way of doing this is by the use of the knowledge distillation framework [HVD15, RCG+15], which allows knowledge to be transferred from one network to another by training the latter to match the input-output mapping of the former. In [LH17], this technique is used to ensure that, during the training of task B, the input-output mapping of the network using the data from task B does not deviate too much from where it was right after having trained on task A. In [SCL+18], the balance between new learning and retention of old knowledge is maintained by using two separate networks, a flexible ‘active column’ and a more stable ‘knowledge base’, whereby knowledge is frequently distilled from the flexible network to the stable one. Conversely, in [FZS+16] distillation is used to transfer knowledge from a more stable network to a more flexible one. These two papers relate to the biological theory of systems consolidation, which is described in more detail in the next section on replay-based methods. The distillation technique is a key component of the method I develop in Chapter 4, where it is used to transfer knowledge between networks that are 2.3. Catastrophic Forgetting in ANN 43 learning at different timescales.

Other forms of functional regularisation for continual learning involve regularising the repre- sentations in the hidden layers to be similar to those in previous tasks [JJJK16], and the use of Gaussian processes to maintain a distribution over functions [TSdGM+20].

2.3.3 Replay-based Methods

One salient theory of how the brain is capable of continual learning is that it is comprised of two complementary learning systems (CLS) [MMO95]: one is the hippocampus, which is very plastic and is responsible for fast acquisition of new memories, as well as for replaying memo- ries to the second system, the neocortex, which is less plastic and, with the aid of hippocampal replays, learns slowly in an interleaved fashion, allowing it to retain memories over a longer pe- riod and generalise from experiences over a lifetime. This process of replaying is often referred to as systems consolidation. Though in reality the distinction is not so clear cut [KHM16], in CLS theory, the hippocampus is thought to store episodic memories, which constitute specific autobiographical events that can be explicitly recalled and stated, while the cortex stores se- mantic memory, which represents more general, structured knowledge about the world. CLS theory has inspired or has at least served as a biological correlate to a number of algorithms for mitigating catastrophic forgetting, known collectively as replay-based methods. In these methods, a portion of the incoming data is stored, corresponding to the episodic memories stored in the hippocampus, and then frequently replayed to the neural network, which plays the role of the slow-learning cortex, in order to prevent forgetting. This set of approaches relax the assumption that, in a continual learning context, no data can be stored, but they typically ensure that the total amount of data stored is either fixed or grows slowly over time.

As discussed in the preceding section of this chapter, the technique of maintaining a buffer of the agent’s previous experiences has long been used in RL and is known as experience replay [Lin92]. Typically in deep RL, the buffer contains the agent’s most recent experiences in the form of a FIFO buffer [MKS+15], which prevents short-term forgetting, but recently a number of different strategies for selecting which memories to store have been explored for use in a 44 Chapter 2. Background continual learning context. In [IC18], the authors investigate (i) prioritising the storage of the most surprising experiences, defined as those with a high TD error, (ii) storing the most rewarding experiences, (iii) storing a uniform sample over the whole history of experiences of the agent using the technique of reservoir sampling [Vit85], and (iv) selecting experiences to store that maximise the coverage of the state space. The reservoir sampling method was found to be the best performing overall, but it was also necessary to maintain a small FIFO buffer to ensure that the network trained on all incoming experiences. The importance of maintaining a balance of new and old experiences in the replay database is something that has been noted in several other analyses, both in single task and sequential multi-task settings [dBKTB15, dBKTB16, ZS17, RAS+19, WR19]; old experiences are necessary to prevent forgetting, but a high density of new experiences are crucial for fine-tuning the policy in the current environment, especially when it may be evolving quickly. In Chapter 5, I address this issue by developing a replay database that records experiences over multiple timescales, with the idea of providing a smooth balance between new and old experiences in situations where the timescale of nonstationarity in the environment is unknown.

Replay buffers have also been used for continual learning outside of RL. In the context of classification tasks, for example, a typical strategy is to store a small number of datapoints per class, sometimes by using clustering techniques to identify ‘typical’ members of a given class, known as exemplars [RKSL17, LP+17, HCK19]. Some methods, rather than simply replay the items in memory to the network, use the memory to regularise the learning updates in order to prevent forgetting. For example, in [LP+17] and [CRRE19], the learning updates are constrained to ensure that they do not increase the loss on the examples stored in memory. In a similar vein, in [RCA+19], a common meta-learning algorithm [FAL17] is used in combination with a replay database to learn a set of networks parameters from which future gradient updates are unlikely to interfere with each other. Interestingly, in [CRE+19], it is found that, across a number of different ER methods on sequences of classification tasks, even very small replay buffers can improve memory retention significantly.

Explicitly storing raw data in a fixed size database, however, is unlikely to scale very well over a long period of time or for a large number of tasks. An approach that avoids the explicit 2.3. Catastrophic Forgetting in ANN 45 storage of raw data is that of pseudo-rehearsal pioneered by [Rob95], whereby pseudodata is created by sampling random inputs from a uniform distribution that are matched with the predicted outputs of the network - this pseudo-data is later used to replay to the network to prevent forgetting. Some modern versions of pseudo-rehearsal take advantage of recent advances in training generative models using neural networks, rather than using uniformly sampled random inputs. In [SLKK17], generative adversarial networks [GPAM+14] are trained sequentially to mimic data from previous tasks, which is then interleaved with new data to train the task solver. Similarly, in [KGL17], a variational autoencoder [KW13] is trained to generate pseudo-data from the historical distribution. An important deficiency of such methods is that the problem of catastrophic forgetting is then passed down to the generative network - after a certain period of time, it may not remember how to mimic old parts of the data distribution.

2.3.4 Architectural Approaches

In this subsection, I describe a selection of approaches to mitigating catastrophic forgetting by adapting the architecture of the neural network. A number of the methods in this category can be classified as dynamic architectures, in the sense that the architecture of the model is modified throughout training; this usually entails adding neural resources to be able to incorporate new knowledge without forgetting and, in this way, they relax the criterion that the network capacity must be fixed in a continual learning setting. It is not inconceivable that this is one strategy employed by the brain - it’s thought that the creation of new neurons, known as neurogenesis, could contribute to the assimilation of new memories in humans [EPBE+98].

In [RRD+16], a new neural network, or column, is instantiated for every newly encountered task as well as lateral connections from the hidden layers of all existing columns to the new one; during training of the new task, only the weights of the new column and the incoming lateral connections are adjusted - the parameters in all other columns are frozen. This setup ensures that there is no catastrophic forgetting of previous tasks and also enables transfer between tasks via the lateral connections; the key disadvantage is that the number of parameters grows quadratically with the number of tasks, primarily due to the lateral connections. In [YYLH18], 46 Chapter 2. Background rather than adding a fixed number of parameters per task, neurons are added adaptively: if the loss on a task after training is above a certain threshold, an optimal number of neurons are added per layer determined by a process of group sparse regularisation; then, the neurons that have had the largest semantic change, determined by the change to their incoming weights, are duplicated, such that one neuron has the old incoming weights and one has the new ones, in order to prevent catastrophic forgetting of previous tasks. In [XZ18], reinforcement learning is used to decide how many neurons to add at each layer after each task, using a reward function that combines performance on the task and model simplicity - a method that requires fewer hyperparameters than [YYLH18].

Related to the methods just described, another set of approaches use ensembles of networks, which are combined to make predictions and prevent forgetting. Some of these, like [RRD+16], add a new sub-model for every new task or batch of data [PUUH01, DYXY07], while others, like [YYLH18, XZ18], are more efficient by selectively adding capacity [WFYH03, RWLG17]. A very recent method efficiently adds capacity by maintaining one set of ‘slow’ weights and multiple sets of low rank ‘fast’ weight matrices per layer that are combined with the slow weights via a Hadamard product to generate an ensemble [WTB20].

Overall, the methods described above suffer from the fact that they involve a growing number of parameters, which calls into question their scalability. Some other approaches use implicit ensembles that do not require growth in the number of parameters. In [GMX+13], a method called Dropout [SHK+14] was shown to somewhat mitigate catastrophic forgetting; Dropout works by randomly silencing neurons during training, creating implicit subnetworks within the full network that can be thought of as an ensemble. In [FBB+17], a genetic algorithm called PathNet is used to discover optimal ‘paths’ through the network for each task; these paths or subnetworks are then frozen for training on subsequent tasks to avoid forgetting, but allowing for feature reuse. Dropout, however, was shown to be inferior to EWC in [KPR+17], and PathNet was only evaluated on forward transfer, not catastrophic forgetting. 2.3. Catastrophic Forgetting in ANN 47

2.3.5 Sparse Coding / Semi-distributed Representations

As mentioned earlier, one of the first approaches to mitigating catastrophic forgetting was to directly tackle what was thought to be the cause of it: distributed representations. Since distributed representations are also what give neural networks the power to generalise effectively, some approaches try to find a balance by encouraging the network to learn sparse or semi- distributed representations [Fre91]; in a sparse representation, only a small subset of the hidden units in a layer are active for a given input, reducing the amount of representational overlap. In [Fre91], an ‘activation sharpening’ algorithm was proposed whereby, for a given input, the activations of the most active units in a layer were increased, while those of the less active units were decreased; the method was shown to improve the relearning ability of networks in a sequential associative learning setup.

In the brain, neurons are split into two broad types: excitatory neurons, which excite the ac- tivity of other neurons that they are connected to, and inhibitory neurons, which dampen the activity of other neurons. One potential role of inhibitory neurons is to create a winner-takes- all (WTA) mechanism among excitatory neurons via lateral inhibition, whereby the activity of the most active neurons excite the inhibitory neurons, which in turn dampen the activity of other excitatory neurons; this ends up amplifying the activity of the most active neurons relative to the less active ones, creating a similar effect to French’s activation sharpening algo- rithm. More recent approaches have also found that introducing lateral inhibition to encourage sparse representations can mitigate catastrophic forgetting in ANNs [SMK+13, ART19]. In [SMK+13], layers of linear neurons are divided into small blocks and a hard WTA mechanism is implemented within each block, with all but the most active neuron in the block being si- lenced; networks with this mechanism displayed better memory retention in a sequential setting than non-linear networks without it. In [ART19], sparse representations via local inhibition are shown to alleviate the problem of capacity saturation that is often encountered when using regularisation-based methods for mitigating forgetting, such as EWC [KPR+17]. Their method involves lateral inhibition of ‘nearby’ neurons in a layer, using a Gaussian kernel on the dif- ference in index values, and neuronal importance factors for discounting inhibition in order to 48 Chapter 2. Background reduce representational overlap between different tasks with similar input distributions.

In [JW19], it is noted that encouraging sparsity is a proxy for learning a representation that avoids interference with future tasks. With this in mind, they developed a technique that meta-trains a ‘representation learning network’ with MAML [FAL17] over multiple continual learning problems to provide the input to a ‘prediction learning network’, which is specific to each continual learning problem (defined as training on a number of separate tasks in sequence). In this way, the representation learning network learns to produce representations that are generally useful for learning sequences of tasks without forgetting, and they found that their method improved on methods that directly targeted sparsity. The meta-learned representations learnt were indeed very sparse but, as opposed to algorithms that specifically target sparsity, there were fewer ‘dead’ neurons (ones that are inactive for all inputs), showing that they were making better use of the representational capacity.

Finally, it is worth noting that the use of sparse representations in deep reinforcement learning have been shown to improve performance in a single-task setting [LKLW19], where the authors show that the locality of the representation mitigates catastrophic forgetting, which can have knock-on effects on performance due to bootstrapping with inaccurate value estimates.

2.3.6 Task-free Methods

As mentioned in the Introduction, methods for mitigating catastrophic forgetting have largely been evaluated in the context of training on a number of distinct tasks in sequence. This is problematic because, for algorithms deployed in the real world, not only may task boundaries be unknown but changes to the data distribution may be gradual rather than discrete. As a result of this context for evaluation, many methods rely on the knowledge of task boundaries. For example, many of the regularisation methods add constraints to the loss function after every task switch [NLBT18, ZPG17, ABE+18], and many of the dynamic architectures add capacity at the task boundaries [RRD+16, YYLH18, XZ18]. Some recent methods address the case where the task boundaries are not known, referred to as task-agnostic continual learning, but are typically still evaluated in a sequential task setting. For example, in [KPR+17], a generative 2.3. Catastrophic Forgetting in ANN 49 model called the Forget-Me-Not process [MVK+16] is used to detect task boundaries at which to apply EWC; in [AKT19], the regularisation-based method memory-aware synapses [ABE+18] is extended to a task-agnostic setting by choosing to update the importance weights when the loss function has been stable for a sufficient period of time. Some methods do not try to detect discrete changes to the distribution, such as [ZGHS18], which proposes a Bayesian method for updating the posterior of the weights (approximated with a diagonal Gaussian) at every iteration. In [RVR+19], a method is developed that learns a continuous task-specific representation in an unsupervised learning setting - this is one of the few papers that is actually evaluated in a setting where the distribution is changed gradually over time.

Some replay-based methods extend naturally to the task-free setting where the distribution may be changing gradually. For example, though again it is only tested in a sequential task setting [IC18, RAS+19], the reservoir buffer, which maintains a fixed-size uniform sample over all historical data does not assume or require the distribution to be split up into discrete tasks. In [ABT+19], a method is developed for selecting data points from a reservoir buffer to be replayed, prioritising those that would suffer from the biggest increase in loss from updates using incoming data. Additionally, some of the pseudorehearsal methods do not require a discretisation of the data distribution into tasks [SLKK17, KGL17]. The multi-timescale replay buffer developed in Chapter 5 is designed for and tested in a task-free setting.

While several methods for mitigating catastrophic forgetting have been evaluated in the context of RL, this has predominantly been on sequences of discrete tasks [KPR+17, SCL+18]. A handful of analyses address the fact that catastrophic forgetting can occur in a single task setting, where the changes to the distribution can be gradual, unpredictable and arise from multiple sources - most of these investigate this with respect to the composition of the replay buffer. As mentioned earlier, the FIFO buffer is very commonly used to counteract short- term nonstationarity [MKS+15], but it is not designed to deal with longer term changes to the distribution and is rarely characterised as preventing ‘forgetting’. In [dBKTB15, dBKTB16, ZS17, ZZP+19], the importance of maintaining a diverse replay database that balances old and new memories is analysed with various approaches in a single task setting. One different approach shows that inducing sparse representations can mitigate forgetting over the course of 50 Chapter 2. Background learning a single task [LKLW19].

As mentioned in the RL section of this chapter, the combination of off-policy learning, boot- strapping and function approximation is known as the ‘deadly triad’ in RL [SB98], which is known to cause instability in learning; one of the reasons that catastrophic forgetting is often not considered explicitly as a problem in single task RL might be that it is not easy to separate from the other potential causes of instability. For example, bootstrapping can cause instabil- ity when the value estimates used for bootstrapping are inaccurate, leading to compounding of errors [KFS+19]; inaccurate estimates might arise because of inadequate generalisation of the function approximator to states it has never seen before, or due to catastrophic forget- ting of values of states that it has seen, but a long time ago. Several techniques have been developed for improving the stability of deep RL with bootstrapping, such as target networks [MKS+15, LHP+15] and double Q-learning [VHGS16], but most have not explicitly targeted catastrophic forgetting.

The methods developed in this thesis are specifically designed to address catastrophic forgetting in both a sequence of discrete RL tasks and within a single task, and evaluations are performed in both settings for each method, as well as more challenging ones: the method developed in Chapter 4 is also evaluated in a multi-agent setting, introducing a new unpredictable source of nonstationarity, and the algorithm presented in Chapter 5 is tested on tasks where the physical laws of the environment are modified gradually over time. Chapter 3

Continual Reinforcement Learning with Complex Synapses

3.1 Introduction

Whereas in ANNs, the parameters are usually modelled as scalar values, an individual synapse in the brain comprises a complex network of interacting components that evolve at different timescales. In this chapter, I present a method for continual reinforcement learning that takes inspiration from this complexity and the fact that synaptic plasticity has been observed to occur at multiple timescales in the brain, including short-term plasticity [ZR02], long-term plasticity [BL73] and synaptic consolidation [CZV+08]. Intuitively, the slow components to plasticity could ensure that a synapse retains memory of a long history of its modifications, while the fast components render the synapse highly adaptable to the formation of new memories, perhaps providing a solution the stability-plasticity dilemma.

In particular, I explore whether a biologically plausible synaptic model [BF16], which abstractly models plasticity over a range of timescales using multiple components per synapse (rather than a sole scalar value), can be applied to mitigate catastrophic forgetting in a reinforcement learn- ing context. By running experiments with both tabular and deep RL agents, it is shown that the model helps continual learning across two simple tasks as well as within a single task, by

51 52 Chapter 3. Continual Reinforcement Learning with Complex Synapses allaying the necessity of an experience replay database, indicating that the incorporation of different timescales of plasticity can correspondingly result in improved behavioural memory over distinct timescales. Furthermore, this is achieved even though the process of synaptic con- solidation has no prior knowledge of the timing or timescale of changes in the data distribution, nor does it rely on the knowledge of task boundaries.

I start by describing the model for synaptic consolidation developed by Benna and Fusi [BF16] and how it can be combined with any synaptic learning rule, and follow by detailing the RL algorithms that I use in the experimental evaluations. Subsequently, I describe the experiments undertaken in both the tabular and deep RL settings and analyse the results. I then briefly discuss how the method fits into related work and propose avenues for future work. Tables of parameters used for experiments are added at the end of the chapter, along with the results of additional experiments that are referenced in the main section. Finally, the last section, which is based on work done after the rest of this project was completed, discusses how the synaptic consolidation method can be viewed in the context of the online convex optimisation framework [Zin03]1.

3.2 Preliminaries

3.2.1 The Benna-Fusi Model

In this chapter, I make use of a synaptic model that was originally derived to maximise the expected signal to noise ratio (SNR) of memories over time in a population of synapses un- dergoing continual plasticity in the form of random, uncorrelated modifications [BF16]. The model assumes that a synaptic weight w at time t is determined by its history of modifications up until that time ∆w(t0), which are filtered by some kernel r(t − t0), such that

X w(t) = ∆w(t0)r(t − t0). (3.1) t0

1The material for this chapter featured in a paper accepted at ICML in 2018 [KSC18], except for the online convex optimisation section, which was work done post-submission to the conference. 3.2. Preliminaries 53

While constraining the variance of the synaptic weights to be finite, the expected (doubly logarithmic) area under the SNR vs. time curve of a given memory is typically maximised

− 1 when r(t) ∼ t 2 , i.e. the kernel decays with a power law.

Implementing this model directly is impractical and unrealistic, since it would require recording the time and size of every synaptic modification; however, the authors show that the power law decay can be closely approximated by a synaptic model consisting of a finite chain of N communicating dynamic variables (as depicted in Figure 3.1). The dynamics of each variable uk in the chain are determined by interaction with its neighbours in the chain:

du C k = g (u − u ) + g (u − u ) (3.2) k dt k−1,k k−1 k k,k+1 k+1 k except for k = 1, for which we have

du dw C 1 = ext + g (u − u ) (3.3) 1 dt dt 1,2 2 1

dwext where the gk−1,k and gk,k+1 terms are constants, and dt corresponds to a continuous form of the ∆w(t0) updates (Equation 3.1). For k = N, there is a leak term, which is constructed by setting uN+1 to 0. The synaptic weight itself w is just read off from the value of u1, while the other variables are hidden and have the effect of regularising the value of the weight by the history of its modifications.

From a mechanical perspective, one can draw a comparison between the dynamics of the chain of variables and liquid flowing through a series of beakers with different base areas Ck connected by tubes of widths gk−1,k and gk,k+1, where the value of a uk variable corresponds to the level of liquid in the beaker (Figure 3.1).

Given a finite number of beakers per synapse, the best approximation to a power law decay is achieved by exponentially increasing the base areas of the beakers and exponentially decreasing

k−1 −k−2 the tube widths as you move down the chain, such that Ck = 2 and gk,k+1 ∝ 2 . Beakers with wide bases and connected by smaller tubes will necessarily evolve at longer timescales. From a biological perspective, the dynamic variables can be likened to reversible biochemical 54 Chapter 3. Continual Reinforcement Learning with Complex Synapses

Figure 3.1: Diagrams adapted from [BF16] depicting the chain model (top) and the analogy to liquid flowing between a series of beakers of increasing size and decreasing tube widths (bottom). processes that are related to plasticity and occur at a large range of timescales.

Importantly, the model abstracts away from the causes of the synaptic modifications ∆w and so is amenable for testing in different learning settings. In the original paper [BF16], the model was shown to extend lifetimes of random, uncorrelated memories in a perceptron and a Hopfield network, while in this work I test the capacity of the model to mitigate behavioural forgetting in more realistic tasks where synaptic updates are unlikely to be uncorrelated.

In all the experiments, the Benna-Fusi ODEs were simulated using the Euler method for nu- merical integration in order to convert them into discrete updates of the form:

η u1 ← u1 + (∆w + g1,2(u2 − u1)) (3.4) C1 η uk ← uk + (gk−1,k(uk−1 − uk) + gk,k+1(uk+1 − uk)) (3.5) Ck where the magnitude of the learning rate η corresponds to the coarseness of the time discreti- sation ∆t.

3.2.2 RL Algorithms

All experiments in this paper were conducted in an RL paradigm. Below I describe the two algorithms used in the experiments in this chapter; the first is a tabular method and the second 3.2. Preliminaries 55 is a deep RL method.

Q-learning with Eligibility Traces

The Q-learning algorithm [WD92] is one of the most common tabular RL methods and was described in Section 2.2.3 of the Background chapter. In this chapter, I use a variation of Q-learning that uses eligibility traces, which bridges the gap between temporal difference and Monte Carlo methods. As a reminder, in standard Q-learning, after every transition

(st, at, rt+1, st+1), the following update is made to the Q-value table:

δt ← rt+1 + γV (st+1) − Q(st, at) (3.6)

Q(st, at) ← Q(st, at) + ηδt, (3.7)

In the ‘naive’ Q(λ) algorithm, in addition to a Q-value table, a table of eligibility traces are maintained for each state-action pair e(s, a), which are all updated as follows at every transition:

  1, if st = s and at = a, et(s, a) = (3.8)  γλet−1(s, a), otherwise. where λ ∈ [0, 1] is a constant decay parameter. All the Q-values are then updated at each time step as follows: Q(s, a) ← Q(s, a) + ηδe(s, a) (3.9)

If λ = 1, the algorithm is equivalent to a Monte Carlo method where the full return Gt is used to update the Q-values; if λ = 0, the algorithm is equivalent to standard Q-learning. By choosing λ to be somewhere between 0 and 1, learning can often be sped up by improving the bias-variance tradeoff between the fully TD and MC methods. Additionally, unlike pure MC methods, eligibility traces can be used to learn in a continuing task setting.

As well to speed up learning, in this chapter I chose to use this method since, as is elucidated further on in the Experiments section, the eligibility traces can also be used to modulate the 56 Chapter 3. Continual Reinforcement Learning with Complex Synapses rate of synaptic consolidation to improve memory retention.

Soft Q-learning

For the deep RL experiments in this chapter, I use an algorithm called soft Q-learning [HTAL17], which is a generalised form of deep Q-learning where the goal is to maximise not only the ex- pected future reward, but also the entropy of the agent’s policy:

" ∞ # ∗ X t π = arg max π γ (r(st, at) + αH(π(·|st))) (3.10) π E t=0 where the entropy of the policy H(π(·|st)) := Eπ [− log π(at|st)] and α is a constant that controls the balance between reward and entropy maximisation. The cost function is similar to DQN except that the state-value function is calculated as a soft-max rather than a hard-max over Q-values: X  1  V (s) = α log exp Q (s, a0) , (3.11) soft α soft a0 and the policy π, rather than deterministically choosing the action with the highest Q-value, picks stochastically as follows:

 1  π(a|s) ∝ exp Q (s, a) (3.12) α soft

Soft Q-learning is a strict generalisation of DQN since one can recover DQN by letting α tend to 0. One benefit of soft Q-learning is that it can generate a more robust policy as it encourages the agent to learn multiple solutions to the task. It is more effective than just using a softmax policy, which maximises the one-step entropy of the policy, since soft Q-learning biases the policy towards states that will also lead to future states with higher entropy. In the experiments with DQN, I used the soft Q-learning objective as I found that it helped to stabilise performance over time. 3.3. Experiments 57

3.3 Experiments

The overarching goal of the experiments was to test whether applying the Benna-Fusi model to an agent’s parameters could enhance its ability to learn continually in an RL setting. The aim was to demonstrate the potential for the model in enabling continual learning and, for this reason, it was tested in relatively simple settings, where catastrophic forgetting is nevertheless still an issue.

The first experiments, which apply the model in a simple tabular Q-learning agent, were in- tended to serve as a proof of principle and as a means to gaining an intuition of the mechanics of the model through visualisation. Subsequently, I tested it in a deep RL agent to evaluate its effect on the agent’s ability to learn continually across two simple tasks and also within a single task.

3.3.1 Continual Q-learning

The first set of experiments was conducted in order to test whether applying the Benna-Fusi model to tabular Q-values could be used to facilitate continual reinforcement learning in a simple grid-world setting.

Experimental Setup

The environment consisted of 100 states organised into a 10x10 two-dimensional grid and the agent was equipped with 5 actions, 4 of which deterministically move the agent to a vertically or horizontally adjacent state and the last of which is a pick-up action that must be chosen to collect the reward when in the correct location. If the agent took an action that would move it off the grid, it would instead just remain in its current location. The agent was trained alternately on two different tasks; in the first, the reward was located in the upper right-hand corner of the grid and, in the second, it was in the bottom left-hand corner. An episode was terminated if the agent reached the goal state and successfully picked up the reward, or if it 58 Chapter 3. Continual Reinforcement Learning with Complex Synapses took a maximum number of steps without reaching the goal. In order to test the agent’s ability to learn continually, the goal location was switched every 10,000 episodes (one epoch) and the time taken for the agent to relearn to capture the reward was measured.

Three different agents were trained and compared:

• A control agent trained in an online fashion with naive Q(λ) using an -greedy policy.

• A Benna-Fusi agent, also trained with naive Q(λ), but for which the tabular Q-values were modelled as a Benna-Fusi synapses, each with their own chain of interacting dynamic variables. For a given state-action pair (s, a), the first variable in the chain is denoted as

1 Q (s, a), which corresponds to u1 in Equation 3.3 and is the ‘visible’ Q-value that deter-

mines the agent’s policy at any time. The Q-learning updates ηδ(t)et(s, a) correspond to the ∆w(t) modifications. The deeper variables in the chain Qk(s, a), with k > 1, can be thought of as ‘hidden’ Q-values that ‘remember’ what the visible Q-value function was over longer timescales and regularise it by its history.

• A modified Benna-Fusi agent, whereby at every time step, the flow from Qk(s, a) to Qk+1(s, a) for all variables in the chain was scaled by a multiple of the eligibility trace

et(s, a). The flow from shallow variables to deeper variables in the chain can be thought of as a process of consolidation of the synapse, or in this case Q-value. The rationale for modulating this flow by the eligibility trace is that it only makes sense to consolidate parameters that are actually being used and modified; for example, if a state s has not been visited for a long time, we should not become increasingly sure of any of the Q-values Q1(s, a).

In a Benna-Fusi chain of length N, 1 and 1 determine the shortest and longest memory g1,2 gN,N+1 −5 timescales of the hidden variables respectively. In these experiments, I set g1,2 to 10 to correspond roughly to the inverse of the minimum number of Q-learning updates per epoch, and the number of variables in each chain to 3, all of which were initialised to 0. The ODEs were numerically integrated after every Q-learning update with a time step of ∆t = 1. A table of all parameters used for simulation is shown in Table 3.1 in Section 3.6. 3.3. Experiments 59

Results

The Benna-Fusi agents learned to switch between good policies for each task significantly faster than the control agent, with the modified Benna-Fusi agent being the quickest to relocate the reward for the first time at the beginning of each epoch (Figure 3.2). After the agents have learned to perform the task in the first epoch, it takes them all a long time to find the reward when its location is switched to the opposite corner at the beginning of the second epoch, since their policies are initially tuned to move actively away from the reward. After subsequent reward switches, however, while the control agent continues to take a long time to relearn a good policy due to the negative transfer between the two tasks, both Benna-Fusi agents learn to re-attain a good level of performance on the task much faster than after the initial task-switch (see bottom of Figure 3.2).

In order to visualise the role of the hidden variables of the Benna-Fusi model in enabling

k k 0 continual learning, I define V (s) := maxa0 Q (s, a ); for k = 1, this simply corresponds to the traditional value function V (s) and, for k > 1, one can interpret V k(s) as a ‘hidden’ value function that records the value function over longer timescales. In Figure 3.3, a snapshot of the V k values during training is depicted. The plot of V 1 indicates that the reward is currently in the upper right hand corner of the grid at location (10,10), since the value function almost monotonically slopes upwards to that point; on the other hand, in V 2 and V 3, which evolve at slower timescales, the hidden value functions also slope up towards (0,0) in the lower left hand corner, indicating that they still ‘remember’ that there was once a reward at that location. When the reward then switches back to (0,0), the downward pressure from the Q-learning updates on the values in the lower left hand corner will ease and, due to the memory of high values in this area of the grid in V 2, this hidden value will flow into the corresponding Q-values

2 in V1, encouraging the agent to re-explore the area and capture the reward

2See https://youtu.be/_KgGpT-sjAU for a video showing the evolution of the value functions over time. 60 Chapter 3. Continual Reinforcement Learning with Complex Synapses

# 3 25 10 Control Benna-Fusi 20 Modified Benna-Fusi

15

10

Time to first reward 5 (Number of timesteps)

0 0 2 4 6 8 10 12 14 16 18 20 22 24 Epoch number #103 70 Control 60 Benna-Fusi Modified Benna-Fusi 50

40

30

20 (Number of timesteps)

Time to learn good policy 10

0 0 2 4 6 8 10 12 14 16 18 20 22 24 Epoch number Figure 3.2: (Top) How long it took each agent to relearn to navigate to the first reward at the beginning of each epoch. (Bottom) How many time steps it took for the 20-episode moving average of episode lengths to drop below 13, as a measure of how long it took to (re)learn a good policy. Mean over 3 runs with 1 s.d. error bars. 3.3. Experiments 61

V1 V2 V3

10 10 10

8 8 8

6 6 6

Value 4 Value 4 Value 4

2 2 2 10 10 10 0 5 0 5 0 5 0 0 0 5 10 0 5 10 0 5 10 0 X Y X Y X Y Figure 3.3: Surface plots of a snapshot of the visible (V 1) and hidden (V 2 and V 3) values of each state during training. While V 1 only appears to retain information about the current reward at (10,10), V 2 and V 3 still remember that there is value at (0,0). When the reward location is switched back to (0,0), flow from the deeper variables in the chain back into V 1 make it easier for the agent to recall the previous reward location. See https://youtu.be/_KgGpT-sjAU for animation of values over training.

3.3.2 Continual Multi-task Deep RL

The next set of experiments were to test if similar improvements to continual learning could be observed if the Benna-Fusi model were applied to the parameters of a deep RL agent alternately performing two simple tasks. While having a better memory retention for tabular Q-values has a direct impact on an agent’s ability to recall a previous policy, it is less obvious that longer memory lifetimes in individual synapses (which know nothing about each other) should yield better behavioural memory in a distributed system such as a deep Q-network.

Experimental Setup

The two tasks used for this experiment were Cart-Pole3 and Catcher, which were suitable for training on the same network because the dimensions of their state spaces and action spaces match. In Cart-Pole, the agent must learn to move a cart from side to side in order to balance a pole vertically, one end of which is attached to the centre of the cart. The agent has access to the position and velocity of the cart along the x axis, and to the angle and angular velocity of the pole. In Catcher, the agent learns to move a paddle left and right in order to catch fruit that is dropping from random horizontal locations from the top to the bottom of a 2D frame.

3The version used was CartPole-v1 from the OpenAI Gym [BCP+16] 62 Chapter 3. Continual Reinforcement Learning with Complex Synapses

The agent observes the x and y position of the fruit, as well as the position and velocity of the paddle along the x axis.

Similarly to the tabular Q-learning experiments, an agent was trained alternately on the two tasks (for 40 epochs of 20,000 episodes) and, as a measure of its ability to learn continually, the time taken for the agent to (re)learn the task after every switch was recorded. A task was deemed to have been (re)learnt if a moving average of the reward per episode moved above a predetermined level (450 for Cart-Pole, which has max reward 500, and 10 for Catcher, which has max reward about 14).

Experiments were run with two types of agent, a control agent and a Benna-Fusi agent. In order to ensure that the difference in performance of the two agents was not just due to differences in the effective learning rate (which is likely to be lower in the Benna-Fusi agent as the parameters are regularised by the hidden variables), the control agent was run with several different learning rates. The Benna-Fusi agent was only run with η = 0.001.

The control agent was essentially a DQN [MKS+15] with two fully connected hidden layers of 400 and 200 ReLUs respectively, but with a number of modifications that were made in order to give it as good a chance as possible to learn continually. The network was trained with soft Q-learning [HTAL17], which I found helped to stabilise learning in each task, presumably by maintaining a more diverse set of experiences in the replay database4. Furthermore, as in [KPR+17], while the network weights were shared between tasks, each layer of the network was allowed to utilise task-specific gains and biases, such that computations at each layer were of the form: ! c c X yi = gi bi + Wijxj (3.13) j where c indexes the task being trained on. This helped overcome the issue of training a network on two different Q-functions, which has been reported to be very challenging even as a regression task [RCG+15].

The experience replay database had a size of 2000, from which 64 experiences were sampled for

4In particular, I found this to be more effective than having a larger replay database, decaying  to a positive value or just having a softmax policy. 3.3. Experiments 63 training with Adam [KB14] at the end of every episode. Crucially, the database was cleared at the end of every epoch in order to ensure that the agent was only training on one task at a time. The agent was -greedy with respect to the stochastic soft Q-learning policy and  was decayed from 1 to almost 0 over the course of each epoch. Finally, ‘soft’ target network updates were used as in [LHP+15], rather than hard periodic updates used in the original DQN. A full table of parameters used can be seen in Table 3.2 in Section 3.6.

The Benna-Fusi agent was identical to the control agent, except that each network parameter was modelled as a Benna-Fusi synapse with 30 variables with g1,2 set to 0.001625, ensuring that the longest timescale (∝ C30 ) comfortably exceeded the total number of updates over g30,31 training (≈ 225). In order to speed up computation, rather than simulate 64 time steps of the ODEs after every replay batch, these were approximated by conducting one Euler update with

∆t = 64. For this reason, the effective flow between u1 and u2 was 64 ∗ 0.001625 = 0.1; if it were larger than 1 this would lead to instability or unwanted oscillations or negative u-values, so we could not increase g1,2 much more. The complexity of the algorithm is O(mN), where N is the number of trainable parameters in the network and m is the number of Benna-Fusi variables per parameter.

The initial values of the hidden variables were normally distributed with variances decaying linearly with the depth in the chain, approximately matching the equilibrium distribution shown for random, uncorrelated memories in the original paper [BF16]. Furthermore, we incrementally allowed flow to occur from the deeper variables to the shallow ones so that the parameters were not constrained much by the random initialisation and only by hidden variables that have had enough time to adapt to the actual experiences of the agent. Specifically, flow from uk+1 to uk was only enabled after 2k gradient updates. g1,2

It is worth mentioning here that the task-specific gains and biases used for the control agent were also used in the Benna-Fusi agent, and so neither agent was entirely task-agnostic. However: (i) as will be seen in the results, these gains and biases do not prevent catastrophic forgetting in the control agent, (ii) the number of task-specific gains and biases scale linearly with the number of neurons (i.e. not as cumbersome as growing linearly with number of weights, which 64 Chapter 3. Continual Reinforcement Learning with Complex Synapses would be quadratic in the number of neurons), and (iii) the consolidation process of the weights in the Benna-Fusi agent is not informed in any way by the task boundaries.

Results

Over the course of training the Benna-Fusi agent became faster at reaching at adequate level of performance on each task than the control agents (Figure 3.4, top), thus demonstrating a better ability for continual learning. Interestingly, while the control agents were all able to learn Cart-Pole at the beginning of training, subsequent training on Catcher then left the network at a starting point that made it very hard or impossible for the agents to relearn Cart-Pole (as evidenced by the number of epochs where an adequate performance was never reached), exhibiting a severe case of catastrophic forgetting. The Benna-Fusi agent did not display this behaviour and, instead, relearned the task quickly in all epochs. It is important to note that parameters were chosen such that the control agents were all capable of learning a very good policy for either task when trained from scratch. In Catcher, the Benna-Fusi agent took longer to converge to a good performance in the first few epochs of training, but subsequently became faster than all the control agents in recalling how to perform the task. On a separate measure of average reward per episode, the Benna-Fusi agent was better than any of the controls on Cart- Pole (Figure 3.4, bottom), but two of the controls reached a slightly higher level of performance on Catcher.

In order to test the agents’ ability to remember over multiple timescales, additional experiments were run with different epoch lengths, ranging from 2500 to 160k episodes, and found that the Benna-Fusi agent demonstrated a better memory than the control in all cases (Figures 3.6 and 3.7 ). Furthermore, in order to ensure that these benefits are not limited to a two-task setting, experiments were run rotating over three tasks and similar results were obtained (Figure 3.9). 3.3. Experiments 65

Cart-Pole Catcher

Benna-Fusi = 0.001 2.0e4 2.0e4 = 0.0001 = 0.00001 1.5e4 1.5e4

1.0e4 1.0e4

0.5e4 0.5e4

0 0

0 5 10 15 20 0 5 10 15 20 Time to (re)learn (episodes) Time to (re)learn (episodes) Epoch # Epoch # Cart-Pole Catcher 500 14

12 400 10 300 8

200 6

4 100 2 0 0 5 10 15 20 0 5 10 15 20 Average reward per episode Average reward per episode Epoch # Epoch #

Figure 3.4: (Top) How long it took for the agents to relearn each task from the beginning of each epoch; the # of training episodes needed for the 10 test-episode moving average of reward to surpass the threshold is plotted for 3 runs per agent. Runs that did not relearn within the are marked at ∞. (Bottom) Reward per episode averaged over each epoch for each task; means with s.d. error bars over 3 runs. 66 Chapter 3. Continual Reinforcement Learning with Complex Synapses

3.3.3 Continual Learning within a Single Task

The continual learning problem is normally posed as the challenge of learning how to perform a series of well-defined tasks in sequence; in RL, however, the issue of nonstationary data often occurs within the training of one task. This effect occurs primarily due to (i) strong correlation in time between consecutive states and (ii) changes in the agent’s policy altering the distribution of experiences. The most common way to deal with this problem is to use an experience replay database (in the form of a FIFO buffer) to decorrelate the data, without which the agent can struggle to learn a stable policy (Figure 3.10). In the final set of experiments, I wanted to see whether using the Benna-Fusi model could enable stable learning in a single task without the use of a replay database.

Experimental Setup

Control and Benna-Fusi agents were trained on Cart-Pole and Catcher separately in an online setting, such that there was no experience replay database and the agents were trained after every time step on the most recent experience.

The architectures of the control and Benna-Fusi agents were the same as in the previous set of experiments bar a couple of differences: the network was smaller (two hidden layers of 100 and 50 units respectively) and, in the Benna-Fusi agent, g1,2 was set to a larger value of 0.01 in order to be able to remember experiences over shorter timescales.

Results

While none of the control agents was able to learn and maintain a consistently good policy for the Cart-Pole task, the Benna-Fusi agent learned to perform the task to perfection in most cases (Figure 3.5). For Catcher, however, all agents were able to learn a consistently good policy, with the control agent learning a bit faster (Figure 3.11).

The reason that the control agents struggle to learn a stable policy for Cart-Pole in an online 3.3. Experiments 67 setting, but not Catcher, could be that the training data is more nonstationary and thus the agents are more prone to catastrophic forgetting as they learn. A common aspect among control tasks, such as Cart-Pole, is that a successful policy often involves restricting experiences to a small part of the state space [dBKTB16]. For example, in Cart-Pole the aim is to keep the pole upright, and so if an agent trains for a while on a good policy, it may begin to overwrite knowledge of Q-values in states where the pole is significantly tilted. Since the agent is constantly learning, it could at some point make an update that causes it to make a wrong action that causes the pole to tilt to an angle that it has not experienced in a while. At this point, the agent might not only perform poorly since it has forgotten the correct policy in this region of the state space, but its policy might be further destabilised by training on these ‘new’ experiences. Furthermore, at this stage the exploration rate might have decayed to a low level, making it harder to relearn.

One idea is not to let  to decay to 0, but in practice I found that this does not solve the problem and can actually make learning less stable (Figure 3.12). This could be (i) because the agent still overfits to states experienced during a good policy and the extra exploration just serves to perturb it into the negative spiral described above faster than otherwise, or (ii), as noted in [dBKTB16], in control tasks the policy often needs to be very fine-tuned in an unstable region of the state space; this requires high-frequency sampling of a good policy and so makes excessive exploration undesirable [dBKTB15]. In Cart-Pole, the Benna-Fusi agent succeeds in honing its performance with recent experiences of a good policy while simultaneously remaining robust to perturbations by maintaining a memory of what to do in suboptimal situations that it has not experienced for a while.

In Catcher, a good policy will still visit a large part of the state space and consecutive states are also less correlated in time since fruit falls from random locations at the top of the screen. This may explain why the control agent does not have a problem learning the task successfully. 68 Chapter 3. Continual Reinforcement Learning with Complex Synapses

500

400

300

200

Benna-Fusi

Reward per episode = 0.001 100 = 0.0001 = 0.00001 = 0.000001 0 0 20000 40000 60000 80000 Episode number

Figure 3.5: The 1000 test-episode moving average of reward in Cart-Pole for the Benna-Fusi agent and control agents with different learning rates; means and s.d. errorbars over 3 runs per agent.

3.4 Related Work

As discussed in Section in the Background chapter, the concept of synaptic consolidation has been applied in a number of recent works that tackle the continual learning problem by adding quadratic terms to the cost function that selectively penalise moving parameters according to how important they are for the recall of previous tasks [RE13, KPR+17, ZPG17, ABE+17]. The Benna-Fusi model also constrains parameters to be close to their previous values and so can be considered a regularisation-based method but, in contrast to the approaches described above, consolidation occurs (i) over a range of timescales, (ii) without any derived importance factors, and (iii) without any knowledge of task boundaries. These characteristics are useful for situations where you do not have prior knowledge of when and over what timescale the training data will change, a possibly realistic assumption for robots deployed to act and learn in the real world. Furthermore, the importance factors derived in the other works could feasibly be used to modulate the flow between the hidden variables as a way of combining approaches.

The idea of modelling plasticity at different timescales to mitigate catastrophic forgetting in 3.5. Conclusion 69

ANNs is not new: in [HP87], each weight is split into separate ‘fast’ and ‘slow’ components, which allows the network to retrieve old memories quickly after training on new data. However, this model was only tested in a very simple setting, matching random binary inputs and outputs, and it is shown in [BF16] that allowing the different components to interact with each other theoretically yields much longer memory lifetimes than keeping them separate. The momen- tum variables in Adam [KB14] and the soft target updates in [LHP+15, PJ92] also effectively remember the parameter values at longer timescales, but their memory declines exponentially, i.e. much faster than the power law decay in the Benna-Fusi model.

3.5 Conclusion

In this chapter, I took inspiration from a computational model of biological synapses [BF16] to show that expressing each parameter of a tabular or deep RL agent as a dynamical system of interacting variables, rather than just a scalar value, can help to mitigate catastrophic forgetting over multiple timescales. On a longer timescale, I found that agents equipped with the Benna- Fusi model displayed better capacity for continual learning than control agents over sequential training of two tasks over a wide range of switching frequencies. On a shorter timescale, unlike any of the control agents, the Benna-Fusi agent was able to stably learn to perform a task that features a high degree of temporal nonstationarity in the state space without the use of an experience replay database.

This chapter is intended as a proof of concept that could be extended in several ways. First, it would be interesting to investigate the sensitivity of continual learning performance to the pa- rameters of the model, such as the number of hidden variables and the granularity of timescales along the cascade. It would also be interesting to look into the information content held at different depths of the chain, which could yield more effective readout schemes for the value of each weight. Furthermore, it would be informative to test the model’s capabilities in a more challenging setting by increasing the number and complexity of tasks, potentially using differ- ent architectures such as actor-critic models [LHP+15, HZAL18], as well as to see if the model 70 Chapter 3. Continual Reinforcement Learning with Complex Synapses can facilitate transfer learning in a series of related tasks. In some initial experiments with larger DQN on tasks from the Arcade Learning Environment [BNVB13, BCP+16], I found that Benna-Fusi agents struggled to reach the same level of performance as the control agents. In the next chapter, I speculate that this might be due to the fact that improving the memory for the network’s individual parameters might not always correspond well to improving the agent’s behavioural memory; this motivates a new technique that consolidates the agent’s policy in a more direct fashion.

Finally, it would be interesting to adapt the model in light of the fact that synaptic consolidation is known to be regulated by neuromodulators such as dopamine, which, for example, has been associated with reward prediction error and exposure to novel stimuli [CZV+08]. One could modulate the flow between the hidden variables in the cascade by factors such as these, or by one of the importance factors used in other regularisation-based methods, in order to consolidate memory more selectively and efficiently.

3.6 Experimental Details

Tables of parameters for both the tabular and deep RL experiments are shown below.

Table 3.1: Hyperparameters for Tabular Q-learning Experiments

Parameter Value # Epochs 24 # Episodes/Epoch 10000 Max # steps per Episode 20000 γ 0.9 λ 0.9  0.05 Learning rate 0.1 Grid size 10x10 # Benna-Fusi variables 3 −5 Benna-Fusi g1,2 10 Elig. trace scale factor* 10

*Multiple of eligibility trace that flow between beakers is scaled by in modified Benna-Fusi model 3.6. Experimental Details 71

Table 3.2: Hyperparameters for Deep RL Experiments

Parameter Multi-task Single task # Epochs 40 1 # Episodes/Epoch 20000 100000 Max # time steps / episode 500 500 Cart-Pole γ 0.95 0.95 Catcher γ 0.99 0.99 Initial  (Epoch start) 1 1 -decay / episode 0.9995 0.9995 Minimum  0 0 Neuron type ReLU ReLU Width hidden layer 1 400 100 Width hidden layer 2 200 50 Optimiser Adam Adam Learning rate 10−3 to 10−6 10−3 to 10−6 Adam β1 0.9 0.9 Adam β2 0.999 0.999 Experience replay size 2000 1 Replay batch size* 64 1 Soft target update τ 0.01 0.01 Soft Q-learning α 0.01 0.01 # Benna-Fusi variables 30 30 Benna-Fusi g1,2 0.001625 0.01 Test Frequency (Episodes) 10 10

*Updates were made sequentially as in stochastic gradient descent, not all in one go as a minibatch. 72 Chapter 3. Continual Reinforcement Learning with Complex Synapses

3.7 Additional Experiments

3.7.1 Varying Epoch Lengths

Figures 3.6 and 3.7 show results for experiments run for a wide range of switching schedules between tasks. In all cases the Benna-Fusi agent becomes quicker (or in a couple of instances equally quick) at relearning each task than the control agent, demonstrating the Benna-Fusi model’s ability to improve memory at a range of timescales. Agents had a learning rate of 0.001 and the runs with longer epochs were run for fewer epochs.

2500 episodes 5000 episodes 10000 episodes

2000 4000 8000

1500 3000 6000

1000 2000 4000 (episodes) 500 1000 2000 Time to (re)learn 0 0 0 0 50 100 150 0 20 40 60 80 0 10 20 30 40 Epoch # Epoch # Epoch #

40000 episodes 80000 episodes 160000 episodes

32000 64000 128000

24000 48000 96000

16000 32000 64000 (episodes) 8000 16000 32000 Time to (re)learn

0 0 0 0 2 4 6 8 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 Epoch # Epoch # Epoch #

Figure 3.6: Comparison of time to (re)learn CartPole in the control agent (blue) and the Benna-Fusi agent (orange) for different epoch lengths.

3.7.2 Three-task Experiments

In order to ensure that the benefits of the Benna-Fusi model were not limited to the two-task setting, I introduced a new task and ran experiments where training was rotated over the three tasks. The new task was a modified version of Cart-Pole where the length of the pole is doubled (dubbed Cart-PoleLong); the criterion I used for judging that this task was different enough to Cart-Pole to be considered a new task was that when trained sequentially after Cart-Pole in 3.7. Additional Experiments 73

2500 episodes 5000 episodes 10000 episodes

2000 4000 8000

1500 3000 6000

1000 2000 4000 (episodes) 500 1000 2000 Time to (re)learn

0 0 0 0 50 100 150 0 20 40 60 80 0 10 20 30 40 Epoch # Epoch # Epoch #

40000 episodes 80000 episodes 160000 episodes

32000 64000 128000

24000 48000 96000

16000 32000 64000 (episodes) 8000 16000 32000 Time to (re)learn

0 0 0 0 2 4 6 8 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 Epoch # Epoch # Epoch #

Figure 3.7: Comparison of time to (re)learn Catcher in the control agent (blue) and the Benna- Fusi agent (orange) for different epoch lengths.

a control agent, it subsequently led to catastrophic forgetting of its policy for the Cart-Pole task.

Figure 3.9 shows the remembering times for each task for a control agent and a Benna-Fusi agent when training was rotated over the three tasks (Cart-PoleLong − > Catcher − > Cart- Pole) over a total of 24 epochs. The results indicate that the Benna-Fusi model exhibits the same benefits as in the two-task setting. 74 Chapter 3. Continual Reinforcement Learning with Complex Synapses

16000 16000 16000

12000 12000 12000 Catcher 8000 8000 Cartpole 8000 (episodes) (episodes) (episodes) Time to (re)learn Time to (re)learn Time to (re)learn Modified Cartpole

4000 4000 4000

0 0 0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 Epoch # Epoch # Epoch #

Figure 3.8: Comparison of time to (re)learn each task in the control agent (blue) and the Benna-Fusi agent (orange) for the three different tasks. Each epoch was run for 20000 episodes and both agents had a learning rate of 0.001. While the Benna-Fusi agent took a little longer to learn Catcher than the control agent, by the end of the simulation the Benna-Fusi agent could learn to recall each task much faster than the control.

3.7.3 Varying Size of Replay Database

500

400

300

200 Reward per episode Database size 100 10 100 1000 10000 0 0 20000 40000 60000 80000 100000 Episode #

Figure 3.9: 100 test-episode moving average of reward in Cart-Pole for control agents (all with η = 0.001) with different sized experience replay databases and the Benna-Fusi agent in just the online setting. For these experiments, 1 experience was sampled for training from the database after every time step. In the control cases, when the database is too small, the agent cannot attain a stable performance on the task while the Benna-Fusi agent can. 3.7. Additional Experiments 75

3.7.4 Catcher Single Task

14

12

10

8

6

4 Reward per episode

2

0 Benna-Fusi Control = 0.001

0 20000 40000 60000 80000 100000 Episode #

Figure 3.10: The 100 test-episode moving average of reward per episode in Catcher for the Benna-Fusi agent and the best control agent. The control agent learns faster but both end up learning a good policy.

3.7.5 Varying Final Exploration Value

500 Min- = 0.01 Min- = 0.05 Min- = 0.1 400

300

200 Reward per episode

100

0 0 20000 40000 60000 80000 100000 Episode #

Figure 3.11: The 100 test-episode moving average of reward per episode in Cart-Pole for control agents where epsilon was not allowed to decay below different minimum values. None of the runs yielded a good stable performance. 76 Chapter 3. Continual Reinforcement Learning with Complex Synapses

3.8 Online Convex Optimisation Perspective

In this section, I show that when the external weight updates in the Benna-Fusi model corre- spond to gradient updates with respect to a loss function, it can be thought of as an algorithm for online convex optimisation (OCO) with some useful properties. It is first necessary to cover some of the basics of what constitutes OCO. I begin with two key definitions, with notation borrowed from [H+16]; in order to remain consistent with [H+16], vectors in this section are denoted in bold. A set K ⊆ Rn is convex if for any x and y in K, every point on the line that connects x and y is also in K: αx + (1 − α)y ∈ K, (3.14)

∀α ∈ [0, 1]. A function f : K 7→ R is convex if

f((αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y), (3.15)

∀α ∈ [0, 1] and ∀x, y ∈ K.

Convex optimisation is the study of how to minimise convex functions over convex sets and, since modern neural networks are typically highly non-convex, most of the guarantees from convex optimisation methods do not apply in the context of training ANN. Nevertheless, empirically they often work well and have provide the basic building blocks for neural network optimisation - a classic example being that of gradient descent. In the reverse sense, it can also be helpful to examine optimisation methods specifically designed for neural networks in the convex setting, because it is often easier to analyse them and form a stronger understanding of their properties. Online Convex Optimisation (OCO) extends convex optimisation to the online learning setting, where the data becomes available sequentially and the loss function can change over time, in a manner akin to the continual learning setting.

The online convex optimization (OCO) framework [Zin03] constitutes a repeated game: at each

n time step t an agent chooses wt ∈ K, where K ⊆ R is a convex set in Euclidean space, after which a convex cost function ft ∈ F : K 7→ R is revealed and the agent incurs a cost ft(wt).

The performance of an algorithm A that chooses the wt at every iteration is often evaluated 3.8. Online Convex Optimisation Perspective 77

by its regret or static regret after T iterations:

( T T ) X X regretT (A) = sup ft(wt) − min ft(w) (3.16) w∈K {f1,...,fT }⊆F t=1 t=1

In other words, the regret measures the difference between the actual accumulated loss and that of the best fixed solution in hindsight. If the regret increases sublinearly with T , then the

regretT (A) algorithm is said to have zero regret since its average regret T → 0 as T → 0. It turns √ out that the lower bound on regret for any algorithm is Ω( T ), a bound that is achieved by the well-known algorithm Online Gradient Descent (OGD)(Algorithm 1) with the assumption

that the learning rates η decay proportionally to √1 [H+16]. This result holds for any sequence t t

of fts, even if chosen adversarially. While this is a powerful guarantee for OGD, it comes with the cost of decreasing adaptability over time; as the learning rate decreases, the agent is less and less able to adjust to recent data.

Algorithm 1: Online gradient descent [H+16]

1 Input: convex set K, T , w1 ∈ K, step sizes {ηt}

2 for t = 1 to T do

3 Play wt and observe loss from convex cost function ft(wt).

4 Update and project:

yt+1 = wt − ηt∇ft(wt)

wt+1 = ΠK(yt+1)

5 end

In the Benna-Fusi model [BF16], it is shown that the effect of a weight update decays approxi- mately with the inverse square root of time since the update. This arises from the fact that the cascade functions as a spatial and temporal discretisation of the diffusion equation. Another way to frame it is that the effective learning rate at time T for an update that occurred at time

t is proportional to √ 1 . If the updates are in fact the gradient of a loss function ∇f (w ), T +1−t t t then, if we ignore the projection step, we can view the consolidation model as implementing 78 Chapter 3. Continual Reinforcement Learning with Complex Synapses

Figure 3.12: Effective learning rates ηt for updates at corresponding times t for OGD and the Benna-Fusi model after 100 time steps using η = 1. The learning rate is higher for more recent data points in the Benna-Fusi model, while in OGD, the algorithm becomes less and less adaptive to recent data as time progresses.

OGD with learning rates η ∝ √ 1 : t T +1−t

T X η wT ≈ ∇ft(wt)√ (3.17) t=1 T + 1 − t where η is a fixed learning rate. As mentioned earlier, the traditional OGD algorithm that √ achieves the Ω( T ) regret bound has η ∝ √1 , so that: t t

T X η wT = ∇ft(wt)√ (3.18) t=1 t

In other words, the learning rate schedule in the synaptic consolidation is the opposite of that in the standard OGD algorithm. While OGD takes larger steps for older data points, the Benna- Fusi model places more importance on the most recent data points and so is more adaptive in comparison (Figure 3.13). Higher adaptivity usually comes at the cost of more forgetting but, if we make the simplifying assumption that in the Benna-Fusi model the effective learning rates are exactly proportional to √ 1 (rather than an approximation), then we can show that it T +1−t has the same optimal regret bound as OGD using η = √1 . t t

We can show this by adapting the proof given in [H+16] that OGD achieves the regret bound √ of Ω( T ) with η ∝ √1 for the case where η ∝ √ 1 ; the key is that it is the total sum t t t T +1−t 3.8. Online Convex Optimisation Perspective 79 of the historical learning rates that matter for the proof, which is the same in both cases.

Of course, simply applying the OGD algorithm with learning rates η ∝ √ 1 (rather than t T +1−t approximating them with the Benna-Fusi algorithm) is not possible without knowing T in advance of running the algorithm. With the Benna-Fusi approximation, the prior knowledge of T is not necessary and can be arbitrarily large.

First, we need some assumptions that are made for the OGD regret proof. Let D be an upper bound on the diameter of K, i.e. ∀x, y ∈ K:

||x − y|| ≤ D (3.19)

Also, assume that all ft are all G-Lipschitz, i.e. ∀t and ∀x, y ∈ K:

|ft(x) − ft(y)| ≤ G||x − y|| (3.20)

With these assumptions, we use the following result from [H+16], which gives a regret bound that depends on the sum of the ηts:

Lemma 3.1. The regret of OGD using any {ηt} is:

T ! T ! X 1 1 X regret = f (w ) − f (w?) ≤ D2 + G2 η (3.21) T t t t 2 η t t=1 T t=1

? PT where w ∈ arg minw∈K t=1 ft(w).

Proof. The full proof of this lemma can be found in the proof for the regret bound of OGD

+ (Theorem 3.1) in [H 16]; it follows from the convexity of the fts and of the set K.

Now we can show the following:

Theorem 1. By applying OGD with learning rates {η = D √ 1 } (which can be approximated t G T +1−t D in an online way by applying the Benna-Fusi algorithm to gradient updates with η = G ) it is √  1  p guaranteed that regretT ≤ DG T + 2 = O(DG (T )), ∀T ≥ 1, achieving the best possible regret bound for a general OCO algorithm (see Theorem 3.2 in [H+16]). 80 Chapter 3. Continual Reinforcement Learning with Complex Synapses

Proof. From Lemma 3.1, we have that for any {ηt}:

T ! T ! X 1 1 X regret = f (w ) − f (w?) ≤ D2 + G2 η (3.22) T t t t 2 η t t=1 T t=1

Thus, plugging in η = D √ 1 , we have: t G T −t+1

T ! 1 X D 1 regret ≤ DG + G2 √ (3.23) T 2 G t=1 T + 1 − t T ! 1 X 1 = DG + DG √ (3.24) 2 t=1 t 1  √  ≤ DG + 2DG T (3.25) 2 √ 1 √ = DG T + = O( T ) (3.26) 2

The key step is in Equation 3.24, which uses the fact that the sum of the learning rates are the same as the standard OGD algorithm with learning rates of { √1 }. The other steps closely t follow the proof in Theorem 3.1 in [H+16].

The above is not intended as a rigorous proof of the regret bound of the Benna-Fusi algorithm when applied to gradient updates, since I made the assumption that the approximation of

√ 1 decay in the effective learning rates was exact. In future work, it would be interesting T +1−t to investigate if this is possible to attain. Chapter 4

Policy Consolidation for Continual Reinforcement Learning

4.1 Introduction

In the conclusions of the previous chapter, I noted that a theoretical limitation of the synaptic consolidation model was that improving the memory of individual parameter values would not necessarily translate well into better behavioural memory, as defined by the agent’s ability to recall how to perform previously learned tasks. In this chapter, I address this limitation by developing an approach called policy consolidation (PC) that directly consolidates the agent’s policy over multiple timescales during training.

Consolidation in the PC model occurs at all times without relying on task boundaries, with the agent’s policy being continually distilled into a cascade of hidden networks that evolve over a range of timescales. The hidden networks, in turn, distill knowledge back through the cascade into the policy network in order to ensure that the agent’s policy does not deviate too much from where it was previously. The PC model is derived by adapting the synaptic consolidation model from the previous chapter via a reinterpretation of the dynamics induced by the cascade as a result of regularisation terms that can be incorporated into the loss function of the agent. The model also takes inspiration from the technique of knowledge distillation [HVD15, RCG+15]

81 82 Chapter 4. Policy Consolidation for Continual Reinforcement Learning and one of the proximal policy optimisation (PPO) algorithms for RL [SWD+17] in order to implement multi-timescale learning directly at the policy level.

The PC agent’s capability for continual learning is evaluated by training it on a number of continuous control tasks in three nonstationary settings that differ in how the data distribution changes over time: (i) alternating between two tasks during training, (ii) training on just one task, and (iii) in a multi-agent self-play environment. Some key differences to the evaluations used in the previous chapter are that the tasks have larger state and action spaces, the action spaces are continuous, and the multi-agent self-play setting introduces a new continual learning challenge where the environment dynamics p(st+1|st, at) from the perspective of one agent are continuously evolving throughout training. The PC model is shown to improve continual learning relative to baselines in all three evaluation settings.

The rest of the chapter is structured as follows: (i) an introduction to the PPO algorithms [SWD+17], which play a key role in the development of the PC method as well as serving as baselines to evaluate it against, and a brief discussion on multi-agent RL; (ii) an explanation of how the PC method is derived from the original synaptic consolidation algorithm; (iii) a description of the experimental setup and a discussion of the results; (iv) conclusions and suggestions for future work.

4.2 Preliminaries

4.2.1 PPO

The PPO algorithms fall into the category of policy gradient methods, that were described in Section 2.2.5 in the Background chapter. As mentioned previously, PG methods have the advantage that they can easily deal with continuous action spaces, and they have recently been shown to perform well on the types of continuous control tasks used for evaluation in this chapter [SWD+17, LHP+15].

One difficulty with PG methods arises from the fact that the magnitude of a gradient step in 4.2. Preliminaries 83 parameter space is often not proportional to its magnitude in policy space. As such, a small step in parameter space can sometimes cause an excessively large step in policy space, leading to a collapse in performance that is hard to recover from. The Proximal Policy Optimization (PPO) algorithms [SWD+17] tackle this problem by optimising a surrogate objective that penalises changes to the parameters that yield large changes in policy. For example, the objective that is maximised for the fixed-KL version of PPO is given by:

h i KL πθ(at|st) ˆ L (θ) = Et At − βDKL (πθold (·|st)||πθ(·|st)) (4.1) πθold (at|st) where the first term in the expectation comprises the likelihood policy gradient, which uses the

ˆ + generalised advantage estimate for At [SML 15], and the second term penalises the Kullback- Leibler divergence between the policy and where it was at the start of each update. At each

iteration, trajectories are sampled using πθold (where θold refers to the parameters of the policy network at the beginning of the iteration) and then used to update πθ with several steps of stochastic gradient descent of LKL(θ). The coefficient β controls the magnitude of the penalty for large step sizes in policy space.

In the adaptive-KL version of PPO, the β parameter is dynamically adjusted throughout train-

ing such that the step size in policy space, defined as d = Et [DKL (πθold (·|st)||πθ(·|st))], remains close to a target value dtarg. If d is too large, then β is increased to encourage smaller steps, and vice versa. In the clipped version of PPO, a loss function is devised that disincentivises making updates that change the action probabilities by more than a fixed ratio:

     CLIP πθ(at|st) ˆ πθ(at|st) ˆ L (θ) = Et min At, clip , 1 − , 1 +  At (4.2) πθold (at|st) πθold (at|st) where  is a hyperparameter that modulates the step sizes in policy space.

The various PPO algorithms are used as baselines in all experiments in this chapter. We shall see later on that the policy consolidation model can be viewed as an extension of the fixed-KL version of PPO operating at multiple timescales. 84 Chapter 4. Policy Consolidation for Continual Reinforcement Learning

4.2.2 Multi-agent RL with Competitive Self-play

While nonstationarity arises in single agent RL due to correlations in successive states and changes in the agent’s policy, the dynamics of the environment given by p(st+1|st, at) are typ- ically stable. In multi-agent RL, the environment dynamics are often unstable since the ob- servations of one agent are affected by the actions of the other agents, whose policies may also evolve over time. In competitive multi-agent environments, training via self-play, whereby the agents’ policies are all governed by the same controller, has had several recent successes [SHM+16, Ope].

One reason cited for the success of self-play is that agents are provided with the perfect curricu- lum as they are always competing against an opponent of the same calibre. In practice, however, it has been reported that only training an agent against the most recent version(s) of itself can lead to instabilities during training [HS16]. In particular, it might result in catastrophic for- getting as the agent overwrites knowledge of how to beat past versions of itself, preventing monotonic policy improvement. For this reason, agents are normally trained against historical versions of themselves to ensure stability and continuous improvement, often sampling from their entire history [BPS+18]. In a continual learning setting it may not be possible to store all historical agents and the training time will become increasingly prohibitive as the history grows. In this work, I evaluate the continual learning ability of our model in a self-play setting by only training each agent against the most recent version of itself.

4.3 From Synaptic Consolidation to Policy Consolida-

tion

As mentioned earlier, one of the theoretical limitations of the synaptic consolidation model used in the previous chapter was that, while it improved memory at the level of individual parameters, this would not guarantee improved behavioural memory of the agent, due to a highly nonlinear relationship between the parameters and the output of the network. The PC 4.3. From Synaptic Consolidation to Policy Consolidation 85 model addresses this limitation by consolidating memory directly at the behavioural level.

The PC model can also be viewed as an extension of one of the PPO algorithms [SWD+17], which ensures that the policy does not change too much with every policy gradient step - in a sense, preventing catastrophic forgetting at a very short timescale. The PC agent operates on the same principle, except that its policy is constrained to be close to where it was at several stages in its history, rather than just at the previous step.

Below I motivate and derive the policy consolidation framework: first, the synaptic consolida- tion model is reinterpreted by incorporating it into the objective function of the RL agent, and then it is combined with concepts from PPO and knowledge distillation in order to directly consolidate the agent’s behavioural memory at a range of timescales.

4.3.1 Synaptic Consolidation

As mentioned in the previous chapter, the synaptic model from [BF16] was originally described by analogy to a chain of communicating beakers of liquid (Figure 4.1a). The level of liquid in the first beaker corresponds to the visible synaptic weight, i.e. the one that is used for neural computation. Liquid can be added or subtracted from this beaker in accordance with any synaptic learning algorithm. The remaining beakers in the chain correspond to ‘hidden’ synaptic variables, which have two simultaneous functions: (i) the flow of liquid from shallower to deeper beakers record the value of the synaptic weight at a wide range of timescales, and (ii) the flow from deeper beakers back through the shallower ones regularise the synaptic weight by its own history, constraining it to be close to previous values. The wide range of timescales is implemented by letting the tube widths between beakers decrease exponentially and the beaker widths to grow exponentially as one traverses deeper into the chain.

The synaptic consolidation process can be formally described with a set of first-order linear differential equations, which can be translated into discrete updates with the Euler method as 86 Chapter 4. Policy Consolidation for Continual Reinforcement Learning

Visible synaptic Hidden synaptic variables weight Train agent Play game

휋1 휋2 휋3 휋N ... 휋 old 휋 old 휋 old old 1 2 3 휋N

Store Policy Recall Policy

(a) Synaptic Consolidation (b) Policy Consolidation

Figure 4.1: (a) Depiction of synaptic consolidation model (adapted from [BF16]) (b) Depiction old of policy consolidation model. The arrows linking the πks to the πk s represent KL constraints between them, with thicker arrows implying larger constraints, enforcing the policies to be closer together. follows:

η u ← u + (∆w + g (u − u )) 1 1 C 1,2 2 1 1 (4.3) η uk ← uk + (gk−1,k(uk−1 − uk) + gk,k+1(uk+1 − uk)) Ck where uk corresonds to the value of the kth variable in the chain (for k > 1), the gk,k+1s correspond to the tube widths, the Cks correspond to the beaker widths, ∆w corresponds to the learning update and η is the learning rate.

Now consider, as in the previous chapter, that we apply this model to the parameters of a neural network that encodes the policy of an RL agent. Let Uk denote a vector of the kth beaker values for all the parameters, let U denote a matrix of all the Uks and let L(U1) be the RL objective that the network is being trained to minimise. If we define a new loss function L∗(U) as follows:

N−1 1 X L∗(U) = L(U ) + g ||U − U ||2 (4.4) 1 2 k,k+1 k k+1 2 k=1 then we notice that, by differentiating it with respect to Uk, a negative gradient step with learning rate η implements the consolidation updates in Equation 4.3 since Ck

∗ − ∇Uk L (U) = gk−1,k(Uk−1 − Uk) + gk,k+1(Uk+1 − Uk) (4.5) 4.3. From Synaptic Consolidation to Policy Consolidation 87

for k > 1 and a step in the direction of −∇U1 L(U1) corresponds to ∆w in Equation 4.3. Thus we can view the synaptic consolidation model as a process that minimises the Euclidean distance between the vector of parameters and its own history at different timescales. It is not obvious, however, that distance in parameter space is a good measure of behavioural dissimilarity of the agent from its past.

4.3.2 Policy Consolidation

Since each parameter has its own chain of hidden variables, each Uk actually defines its own neural network and thus its own policy, which is denoted by πk. With this view, I propose a new loss function that replaces the Euclidean distances in parameter space, given by the

2 ||Uk − Uk+1||2 terms, with a distance in policy space between adjacent beakers:

N−1 ∗ h X i L (π) =L(π1) + Est∼ρ1 gk,k+1DKL (πk||πk+1) (4.6) k=1

where π = (π1, ..., πN ), ρ1 is the state distribution induced by following policy π1 and the

DKL (πk||πk+1) terms refer to the Kullback-Leibler (KL) divergence between the action distri- butions (given state st) of adjacent policies in the chain.

Policy consolidation can be implemented in a practical RL algorithm by adapting the fixed-KL version of PPO [SWD+17]. As a reminder, this version of PPO features a cost term of the form

βDKL (πθold ||πθ), where β controls the size of the penalty for large updates in policy space.

The PC model uses the same policy gradient as in PPO and introduces similar KL terms for

k−1 each πk with βk coefficients that increase exponentially for deeper beakers, with βk = βω .

This ensures that the deeper beakers evolve at longer timescales in policy space, with the βk terms corresponding to the Ck terms in the synaptic consolidation model. In the experiments conducted, it was found that the PC agent’s performance was often more stable when we used

the reverse KL constraint DKL (πk||πkold ) (Section 4.8.1), and so the final objective for the PC 88 Chapter 4. Policy Consolidation for Continual Reinforcement Learning model was given by:

PC PG PPO CASC L (π) = L (π1) + L (π) + L (π) (4.7)

" # PG π1 ˆ PPO where L (π1) is the policy gradient Et At , L (π) constitutes the PPO constraints π1old that determine the timescales of the πks,

" N # PPO X h k−1 i L (π) = −Et βω DKL (πk||πkold ) , (4.8) k=1 and LCASC (π) captures the KL terms between neighbouring policies in the cascade,

" N # CASC X h i L (π) = −Et ω1,2DKL (π1| |π2old ) + ωDKL (πk| |πk−1old ) + DKL (πk| |πk+1old ) , (4.9) k=2

where πN+1old := πN , ω1,2 controls how much the agent’s policy is regularised by its history and ω > 1 determines the ratio of timescales of consecutive beakers. A smaller ω gives a higher granularity of timescales along the cascade of networks, but also a smaller maximum timescale

N−1 determined by βN = βω . Each policy is constrained to be close to the old versions of neighbouring policies, which are fixed during the handful of gradient steps taken per iteration (as in PPO), in order to ensure a stable optimisation at each iteration (Figure 4.1b).

4.4 Experiments

In order to test how effective the model is at alleviating catastrophic forgetting, the performance of a PC agent was evaluated over a number of continuous control tasks [BCP+16, HCS+17, ASBB+18] in three separate RL settings. The settings are differentiated by the nature of how the data distribution changes over time in each case:

(i) In the first setting, the agent was trained alternately on pairs of separate tasks. This is akin to the most common continual learning setting in the literature, where the changes 4.4. Experiments 89

to the distribution are discrete. While in other methods, these discrete transitions are often used to inform consolidation in the model [KPR+17, ZPG17], in the PC model the consolidation process has no explicit knowledge of task transitions.

(ii) In the second setting, the agent was trained on single tasks for long uninterrupted periods. The goal here was to test how well the PC model could handle the continual changes to the distribution of experiences caused by the evolution of the agent’s policy during training. Policy-driven changes to the state distribution have previously been shown to cause insta- bility and catastrophic forgetting in continuous control tasks, e.g. in [dBKTB16] and in the Cart-Pole experiments in the previous chapter with small experience replay databases.

In this case the dynamics of the environment, given by p(st+1|st, at), are stationary.

(iii) In the third setting, the agent was trained in a competitive two-player environment via self-play, whereby the same controller was used and updated for the policy of each player. In this case, the distribution of experiences of each agent are not only affected by its own actions, but also, unlike the single agent experiments, by changes to the state transition function that are influenced by the opponent’s evolving policy.

In each setting, baseline agents were also trained using the fixed-KL, adaptive-KL and clipped versions of PPO [SWD+17, DHK+17] with a range of βs and clip coefficients for comparison. The results presented in the main paper use the reverse KL constraint for the fixed- and adaptive-KL baselines since that was what was used for the PC agent, but similar results were observed using the original PPO objectives (Section 4.8.1).

4.4.1 Single agent Experiments

Setup

All agents shared the same architecture for the policy network, namely an MLP with two hidden layers of 64 ReLUs. The PC agent used for all experiments (unless otherwise stated) consisted of 7 hidden policy networks with β = 0.5 and ω = 4.0. The hyperparameters used 90 Chapter 4. Policy Consolidation for Continual Reinforcement Learning

[Walker2d-v2, [Walker2d-v2] PC Walker2dBigLeg-v0] PC β = 1 5000 4000 β = 1 β = 5 β = 5 4000 β = 10 3000 β = 10 3000 β = 20 β = 20 2000 β = 50 2000 β = 50 Reward Reward clip=0.2 1000 clip=0.2 1000 clip=0.1 clip=0.1 0 clip=0.03 clip=0.03 0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 adaptive β Steps 1e7 Steps 1e7

[HalfCheetah-v2, [HalfCheetahBigLeg-v0] PC HalfCheetahBigLeg-v0] PC 8000 β = 1 β = 1 β = 5 6000 β = 5 6000 β = 10 β = 10 β = 20 4000 4000 β = 20 β = 50

β = 50 Reward Reward 2000 2000 clip=0.2 clip=0.2 clip=0.1 0 clip=0.1 0 clip=0.03 clip=0.03 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 adaptive β Steps 1e7 Steps 1e7

[HumanoidSmallLeg-v0, [RoboschoolHumanoid-v1] PC HumanoidBigLeg-v0] PC 2500 β = 1 6000 β = 1 β = 5 5000 β = 5 2000 β = 10 β = 10 4000 1500 β = 20 β = 20 3000 β = 50 1000 β = 50 Reward Reward 2000 clip=0.2 clip=0.2 500 1000 clip=0.1 clip=0.1 clip=0.03 0 clip=0.03 0 0.0 0.5 1.0 1.5 2.0 adaptive β 0 1 2 3 4 5 adaptive β Steps 1e7 Steps 1e7 (a) Alternating tasks (b) Single task

Figure 4.2: Reward over time for (a) alternating task and (b) single task runs; comparison of PC agent with fixed KL (with different βs), clipped (with different clip coefficients) and adaptive KL agents (omitted for some runs since return went very negative). Means and s.d. error bars over 3 runs per setting. for training were largely the same as those used for the Mujoco tasks in the the original PPO paper [SWD+17]. The hyperparameters were also constant across tasks, except for the learning rate, which was lower for all agents (including the baselines) in the Humanoid tasks.

In the alternating task setting, the task was switched every 1 million time steps for a total of 20 million steps of training. Three pairs of continuous control tasks were used (taken from [BCP+16, HCS+17, SWD+17]); the tasks in each pair were similar to one another, but different enough to induce forgetting in the baseline agents in most cases. In the single task setting, agents were trained for either 20 million or 50 million time steps per task. In both settings, the mean reward on the current task was recorded during training. Full implementational details are included in Section 4.7. 4.4. Experiments 91

Results

In the alternating task setting, the PC agent performs better on average than any of the baselines, showing a particularly stark improvement in the Humanoid tasks where none of the baselines was able to successfully learn both tasks. The PC agent, on the other hand, was able to continue to increase its mean reward on each task throughout training (Figure 4.2a). While the performance of the agent drops when the task is switched, the PC agent is able to relearn quickly and gradually build on its knowledge, particularly in the HalfCheetah and Humanoid tasks. In the single task setting, the PC agent does as well as or better than the baselines in all tasks and also exhibits strikingly low variance in the Humanoid task (Figure 4.2b).

4.4.2 Multi-agent Experiments

Setup

For the multi-agent experiments, agents were trained via self-play in the RoboSumo-Ant-vs- Ant-v0 environment developed in [ASBB+18]. The architecture of the agents in the self-play experiments was the same as in the single-agent runs, but some of the hyperparameters for training were altered, mainly due to the fact that training required many more time steps of experience than in the single agent runs. One important change was that the batch sizes per update were much larger as the trajectories were made longer and also generated by multiple distributed actors (as in [SWD+17]). A larger batch size reduces the variance of the policy gradient, which allowed for larger updates in policy space at each iteration in the PC model by decreasing β and ω1,2 (from 0.5 and 4 to 0.1 and 0.25 respectively) and thus speed up training. Full implementational details given in Section 4.7.

While the agents were not trained on past versions of themselves, the historical models for saved for evaluation purposes at test time. During training, episodes were allowed to have a maximum length of 500 time steps, but this was increased to 5000 at test time in order to reduce the number of draws between agents of similar ability and more easily differentiate between 92 Chapter 4. Policy Consolidation for Continual Reinforcement Learning them.

Results

The first experiment I ran post-training was to pit the final agent for each model against the past versions of itself at the various stages of training. The final PC agents were better than almost all their historical selves, only being marginally beaten by a few of the agents very late on in training. Additionally, the decline in performance against later agents was relatively monotonic for the PC agents, indicating a smooth improvement in the capability of the agent. The fixed-KL agents with low β (0.5 and 1.0) and the adaptive-KL agents were defeated by a significant portion of their historical selves, whilst the clipped-PPO agents exhibited substantial volatility in their performance, demonstrating signs of catastrophic forgetting during training (Figure 4.3a).

The second experiment I ran was to match the PC agents against each of the baselines at their equivalent stages of training. While some of the baseline agents were better than the PC agents during the early stages of training, the PC agents were better on average than all the baselines by the end (Figure 4.3b). The superiority of the PC agents over the adaptive-KL agent was only marginal but it seemed to be slowly increasing over time and, as mentioned in the previous paragraph, the adaptive-KL agent was showing signs of catastrophic forgetting by the end of training. The two baseline agents that did not appear to exhibit much forgetting in the first experiment, β = 2.0 and β = 5.0, were inferior to the PC agents at all stages of training. In the future it would be interesting to train agents for longer to see how their relative performances evolve. Furthermore, it is possible that the PC model is slow to learn at the beginning of training because it is over-consolidated at the initial policy; it would be interesting to implement incremental flow from deeper policies into shallower ones in the PC model as training progresses (as in the synaptic consolidation model in the previous chapter) to see if this problem is resolved. 4.4. Experiments 93

1.0 1.0 PC1 β = 0.5 PC2 β = 1.0 0.9 PC3 β = 2.0 0.8 Clip=0.2 β = 5.0 0.8 Clip=0.1 Adaptive β 0.6

0.7 0.4 0.6 Mean Score Mean Score PC vs Clip=0.2 PC vs β = 2.0 0.2 PC vs Clip=0.1 0.5 PC vs β = 5.0 PC vs β = 0.5 PC vs Adaptive β PC vs β = 1.0

0 1 2 3 4 5 6 0 1 2 3 4 5 6 Steps 1e8 Steps 1e8 (a) Final model vs. self history (b) PC vs. baselines over training

Figure 4.3: Moving averages of mean scores over time in RoboSumo environment of (a) the final version of each model against its past self at different stages of its history, and (b) the PC agents against the baselines at equivalent points in history. Mean scores calculated over 30 runs using 1 for a win, 0.5 for a draw and 0 for a loss. Error bars in (b) are s.d. across three PC runs, which are shown individually in (a).

4.4.3 Further Analyses

Testing policies of hidden networks

In all experiments described thus far, the actions taken by the PC agent were all determined by the policy of the first network in the cascade. However, I thought that testing the performance of the hidden policies might provide some insight into the workings of the PC model. I evaluated the cascade policies of a PC agent that had been trained on the alternating Humanoid tasks by testing its performance on just one of the tasks at the various stages of training. It could be observed that (i) all policies quickly drop in performance each time training on the other task commences, and (ii) the visible policy outperforms all the hidden policies at almost all times during training (Figure 4.4a).

It is to be expected that the shallower policies in the cascade will forget quickly when the task is switched, since they operate at short timescales, but one might have expected the deeper policies to maintain a good performance even after a task switch. The results tell us that the memory of the hidden policy networks is not always stored in the form of coherent policies. This phenomenon might be understood by considering how the PC model is trained. 94 Chapter 4. Policy Consolidation for Continual Reinforcement Learning

Currently, the PC model tries to minimise the KL divergence between the conditional one-step action distributions given state between adjacent policies in the cascade; this is not, however, the same as minimising the KL divergence between the trajectory distributions of adjacent policies. Let us define a trajectory as a sequence of states and actions τ = (s1, a1, ..., sT , aT ), and let us denote the probability of a trajectory τ while following policy π by pπ(τ). Then we have: T Y pπ(τ) = π(at|st)p(st+1|st, at). (4.10) t=1 We can then define the KL divergence between trajectory distributions:

X pπ (τ) D (p ||p ) = p (τ) log k (4.11) KL πk πk+1 πk p (τ) τ πk+1   pπk (τ) = Eπk log (4.12) pπk+1 (τ)

This expectation can be estimated using trajectories sampled from πk, since the p(st+1|st, at) p (τ) terms cancel out in the πk ratios; however, in the PC model, trajectories are only sam- pπk+1 (τ)   pled from π1. In order to estimate DKL pπ ||pπ for k > 1 using trajectories from π1, k k+1old p importance sampling factors must be introduced of the form πk : pπ1

  pπk (τ) pπk (τ) DKL(pπk ||pπk+1 ) = Eπ1 log (4.13) pπ1 (τ) pπk+1 (τ) " # QT π (a |s ) p (τ) = t=1 k t t log πk (4.14) Eπ1 QT p (τ) t=1 π1(at|st) πk+1

In some initial experiments, I found that these importance sampling factors introduced huge variance to the updates, especially for deeper policies in the cascade, and harmed the perfor- mance of the agent. This is due to the fact that for policies that are even slightly different, the large products in the numerator and denominator of the factors can end up differing greatly. In future work, it would be interesting to investigate if the variance can be lowered in some way and whether it can improve the coherence of the hidden policies. This could potentially improve the continual learning ability of the agent and also possibly make it beneficial for the agent to sample from the hidden policies to improve its performance after a task switch. 4.4. Experiments 95

6000 6000 π1 N=8 ω = 4 π2 N=4 ω = 8 5000 5000 π3 N=2 ω = 16 π4 N=8 ω = 2 4000 4000 π5 N=8 ω = √ 2 π6 3000 π7 3000 π8 Reward Reward 2000 2000

1000 1000

0 0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Steps 1e7 Steps 1e7 (a) Performance of visible and hidden policies (b) Changing cascade length and granularity

Figure 4.4: (a) Reward over time of the policies of the networks at different cascade depths on HumanoidSmallLeg-v0, having been trained alternately on HumanoidSmallLeg-v0 and HumanoidBigLeg-v0. (b) Reward over time on alternating Humanoid tasks for different com- binations of cascade length and ω.

Effects of cascade length and granularity

Further experiments were run in the alternating Humanoid task setting in order to evaluate the importance of (i) the granularity of timescales and (ii) the maximum timescale of the cascade

(controlled by βk coefficient of deepest hidden policy) for the continual learning ability of the PC model. To test the effect of reducing the granularity, the length of the cascade was reduced to 4 and 2 networks (from 8) while maintaining the maximum timescale of the deepest policy

N−1 (βN = β × ω ) by increasing ω accordingly. To test the effect of decreasing the maximum √ timescale, ω was reduced from 4 to 2 and 2 while keeping the length of the cascade constant at 8, effectively decreasing βN . The original agent with N = 8 and ω = 4 was found to be the best at continual learning, and the negative effect of reducing the maximum timescale was much more drastic than that of coarsening the range of timescales (Figure 4.4b).

Effects of changing task switching schedule

In order to test how robust the PC model was to different timescales of switching between tasks, experiments were run on the alternating Humanoid tasks with different task switching schedules. The task switching schedule was altered by factors that differed to the ratio of 96 Chapter 4. Policy Consolidation for Continual Reinforcement Learning timescales between hidden policies to ensure that the continual learning ability of the agent was not the result of a harmonic resonance effect. I found that, while continual learning was still possible with slower schedules, at a much faster schedule the agent struggled to switch between tasks (Figure 4.5a). At first glance, this is perhaps counterintuitive since the fast switching schedule should be closer to the i.i.d. setting and thus easier to learn. However, it could just be that the policies of the two tasks cannot be simultaneously represented easily in the same network. This may be remedied with the introduction of a task-id or a recurrent model that can recognise the task at hand. A more speculative, alternative explanation could be that consolidation in the PC model is more effective when learning in blocks, as is thought to be the case for humans in some cases [WS02, FBD+18].

The same experiment was also run for one of the baselines (fixed-KL with β = 10), which was not adept at continual learning for any of the switching schedules (Figure 4.5b). An interesting point to note is that in the baseline runs with slower task-switching schedules, the performances on both tasks decrease over time, with the agent unable to reach previously attained highs. In other words, the agent not only catastrophically forgets, but learning one task puts the network in a state that it struggles to (re)learn the other task at all.

7000 default 5000 1m 6000 long 3m very long 4000 9m 5000 short 300k

4000 3000

3000 2000 Reward Reward 2000 1000 1000

0 0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Steps 1e7 Steps 1e7 (a) PC model (b) Fixed-KL β = 10

Figure 4.5: Reward over time for (a) PC model and (b) fixed-KL baseline with β = 10 for different task-switching schedules between the HumanoidSmallLeg-v0 and HumanoidBigLeg-v0 tasks. The PC model relearns quickly after each task switch for a range of switching schedules, allowing it to build on its knowledge, while the baseline is unable to. 4.5. Related Work 97

4.5 Related Work

The PC model clearly takes a lot of inspiration from the synaptic consolidation model from the previous chapter, as well as from PPO [SWD+17], but it also relates to other continual learning methods that use the knowledge distillation technique [HVD15]. In [SCL+18], knowl- edge is distilled unidirectionally from a flexible network to a more stable one, and vice versa in [FZS+16]. The PC model differs from these two in that the distillation is bidirectional and that networks at multiple timescales are used, rather than just two. Bidirectional distillation is also employed in [TBC+17], but for transfer learning in a multitask context, rather than in a sequential setting. The PC model also relates to ensemble methods for continual learning, in that it involves a collection of networks - the main difference is that, in the current implemen- tation, only one of the networks is used for prediction. Both distillation methods and ensemble methods are discussed in more detail in the Background chapter of this thesis.

4.6 Conclusion and Future Work

In this chapter, I introduced the PC model, which was shown to reduce catastrophic forgetting in a number of RL settings without prior knowledge of the frequency or timing of changes to the agent’s distribution of experiences, whether discrete or continuous.

There are several potential avenues for future work and improvement of the model. A first step would be to run more experiments to compare the PC model to other continual learning methods that are theoretically able to handle continuously changing environments, such as the synaptic consolidation method and some of the methods reviewed in Section 2.3.6, to test the model on a greater variety and number of tasks in sequence, and to perform a more thorough analysis of the hyperparameters of the model.

One limitation of the PC model is that the action distributions are consolidated equally for every state, while it may be more effective to prioritise the storage of particularly important experiences. One way could be to use importance sampling factors, as discussed earlier on, to 98 Chapter 4. Policy Consolidation for Continual Reinforcement Learning distill the trajectory distributions rather than single-step action distributions between hidden policies. Another way could be to prioritise consolidation in states where there is large variabil- ity in estimated value for the available actions (i.e. ones where the action chosen is particularly crucial).

It is well-known in psychology that the spacing of repetition is important for the consolidation of knowledge in humans [And00]. In this vein, it would also be interesting to see if the PC method could be adapted for off-policy RL in order to investigate any potential synergies with experience replay, perhaps incorporating ideas from some of the existing episodic memory methods for continual learning described in the previous section.

4.7 Experimental Details

Much of the code for the PC model was built on top of and adapted from the distributed PPO implementation in [DHK+17].

4.7.1 Single agent Experiments

For the baseline models, I broadly used the same set of hyperparameters used for the training of Mujoco tasks in [SWD+17]. The value function network shared parameters with the policy network and no task-id input was given to the agents. As in [DHK+17], the running mean and variance of the inputs was recorded and used to normalise the input to mean 0 and variance 1. The gradients are also clipped to a norm of 0.5 as in [DHK+17]. In [SWD+17], different parameters were used for the Humanoid tasks as well as multiple actors - for simplicity we used the Mujoco parameters and a single actor. The hidden policies were all initialised with the same parameters as the visible policy for the PC agent, which means that the beginning of training can be slow as the agent is over-consolidated at the initial weights. This might be remedied in the future by introducing incremental flow from the deeper beakers as training progresses, as was done in the previous chapter with the synaptic consolidation model. 4.7. Experimental Details 99

Table 4.1 shows a list of hyperparameters used for the experiments. In future, it would be useful to conduct a broader hyperparameter search for both the baselines and the policy consolidation model. For this work, many more baselines were run than policy consolidation agents in the interest of fairness.

4.7.2 Self-play Experiments

For the self-play experiments, the agents were trained for much longer than in the single agent tasks. For this reason, in order to speed up training, a number of changes were made, namely: using multiple environments in parallel to generate experience, increasing the trajectory length, increasing the minibatch size, reducing number of epochs per update. As a result of increas- ing the number of experiences trained on per update as well as the trajectory length, it was reasonable to expect that the variance of the updates should decrease and that short term nonstationarity is better dealt with. For this reason, I reduced ω1,2 and β in the PC model to allow larger updates per iteration. Additionally, I compared the PC model to a lower range of βs for the fixed-KL baselines for fairness.

The primary (sparse) reward for the RoboSumo agent was administered at the end of an episode, with 2000 for a win, -2000 for a loss and -2000 for a draw. To encourage faster learning, as in [ASBB+18] and [BPS+18], I also trained all agents using a dense reward curriculum in the initial stages of training. I refer readers to [ASBB+18] for the details of the curriculum, which include auxiliary rewards for agents staying close to the centre of the ring and for being in contact with their opponent. Specifically, for the the first 15% of training episodes, the agent was given a linear interpolation of the dense and sparse rewards αrdense + (1 − α)rsparse with α being decayed linearly from 1 to 0 over the course of the first 15% of episodes until only the sparse reward was administered. Only the experiences from one of the players in each environment was used to update the agent. 100 Chapter 4. Policy Consolidation for Continual Reinforcement Learning

Table 4.1: Hyperparameters for Policy Consolidation Experiments

Parameter Multi-task Single task Self-play # Task switches 19 0 0 # Timesteps/Task 1m 50m (Humanoid) / 20m (others) 600m Discount γ 0.99 0.99 0.995 GAE parameter (λ) 0.95 0.95 0.95 Horizon 2048 2048 8192 Adam stepsize (kth policy) ω1−k × 3 × 10−4 ω1−k × 3 × 10−4 or ω1−k × 3 × 10−5 ω1−k × 10−4 VF coefficient 0.5 0.5 0.5 # Epochs per update 10 10 6 # Minibatches 64 64 32 Neuron type ReLU ReLU ReLU Width hidden layer 1 64 64 64 Width hidden layer 2 64 64 64 Adam β1 0.9 0.9 0.9 Adam β2 0.999 0.999 0.999 # Hidden policies 7 7 7 ω1,2 1 1 0.25 ω 4 4 4 β (pol.cons.) 0.5 0.5 0.1 Adaptive KL dtarg 0.01 0.01 0.01 # Environments 1 1 16

4.8 Additional Experiments

4.8.1 Directionality of KL constraint

In some initial experiments I found that using a DKL (πk||πkold ) constraint for each policy in the PC model, rather than the DKL (πkold ||πk) constraint used in the KL versions of PPO [SWD+17], resulted in better continual learning and so in the main results section I compared

the PC model with KL baselines that also used the DKL (πk||πkold ) constraint. Here I show in a few experiments that the same qualitative improvements are gained from the PC agent if the original KL constraint from PPO is used for both the PC model and the baselines (Figure 4.6). As can be seen particularly in the HalfCheetah and Humanoid alternating task settings, the

DKL (πk||πkold ) version performs better.

The effect of the directionality of this KL constraint, as well as the directionality of the KL constraints between adjacent policies (of which there are four possible combinations) warrants 4.8. Additional Experiments 101 further investigation and is an important avenue for future work.

['HalfCheetahBigLeg-v0'] ['HalfCheetahBigLeg-v0']

6000 PC 6000 PC = 1 = 1 4000 = 5 4000 = 5 = 10 = 10 2000 = 20 2000 = 20 Reward Reward = 50 = 50

0 Adaptive 0 Adaptive

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Steps 1e7 Steps 1e7 ['HalfCheetah-v2', 'HalfCheetahBigLeg-v0'] ['HalfCheetah-v2', 'HalfCheetahBigLeg-v0']

6000 PC 6000 PC = 1 = 1 4000 = 5 4000 = 5 = 10 = 10 2000 = 20 2000 = 20 Reward Reward = 50 = 50

0 Adaptive . 0 Adaptive

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Steps 1e7 Steps 1e7 ['HumanoidSmallLeg-v0', 'HumanoidBigLeg-v0'] ['HumanoidSmallLeg-v0', 'HumanoidBigLeg-v0'] 6000 5000 5000 PC PC = 1 = 1 4000 4000 = 5 = 5 3000 3000 = 10 = 10 = 20 = 20 Reward 2000 Reward 2000 = 50 = 50 1000 1000 Adaptive Adaptive

0 0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Steps 1e7 Steps 1e7

(a) DKL (πkold ||πk) (b) DKL (πk||πkold )

Figure 4.6: Reward over time using the (a) DKL (πkold ||πk) and (b) DKL (πk||πkold ) constraints. Chapter 5

Continual Reinforcement Learning with Multi-Timescale Replay

5.1 Introduction

The previous two chapters of this thesis showed how processes at multiple timescales could improve continual reinforcement learning via the consolidation of the agent’s parameters and policy respectively; in this chapter, I investigate how multi-timescale memory can help in the context of the replay database. As discussed previously, so called experience replay (ER) [Lin92, MKS+15] has long been used to counteract short-term correlations between consecutive states in RL, whereby the agent’s most recent experiences are stored in a first-in-first-out (FIFO) buffer, which is then sampled from at random during training. By shuffling the experiences in the buffer, the data are then identically and independently distributed (i.i.d.) at training time, which prevents forgetting over the (short) timescale of the buffer since the distribution over this period is now stationary.

It has been natural, as such, for the community to investigate whether ER can be used to miti- gate forgetting over the longer timescales that are typically associated with continual learning, particularly because it does not necessarily require prior knowledge of the changes to the data distribution. An overview of replay-based methods is provided in Section 2.3.3 in the Back-

102 5.1. Introduction 103 ground chapter of this thesis. One key observation that has been made both in a sequential multi-task setting [IC18, RAS+19] and in a single task setting [dBKTB15, dBKTB16, ZS17, WR19] has been the importance of maintaining a balance between the storage of new and old experiences in the buffer. By just focusing on recent experiences, the agent can easily forget what to do when it revisits states it has not seen in a while, resulting in catastrophic forgetting and instability; by retaining too many old experiences, on the other hand, the agent might focus too much on replaying states that are not relevant to its current policy, resulting in a sluggish and/or noisy improvement in its performance.

In this chapter, I propose a multi-timescale replay (MTR) buffer to improve continual rein- forcement learning, with the following three motivating factors:

• Several of the previously mentioned replay methods use just two timescales of memory in order to strike a balance between new and old experiences [IC18, RAS+19, ZS17]. For example, one of the methods in [IC18] combines a small FIFO buffer with a reservoir buffer that maintains a uniform distribution over the agent’s entire history of experiences [Vit85] - this means that the composition of the replay database will adjust to short term changes in the distribution (with the FIFO buffer) and to long term changes (with the reservoir buffer), but it will not be as sensitive to medium term changes. The MTR method, on the other hand, maintains several sub-buffers that store experiences at a range of timescales, meaning that it can adjust well in scenarios where the rate of change of the distribution is unknown and can vary over time.

• The MTR buffer also draws inspiration from psychological evidence that the function relating the strength of a memory to its age follows a power law [WE91]; forgetting tends to be fast soon after the memory is acquired, and then it proceeds slowly with a long tail. In the MTR buffer, as a result of the combination of multiple timescales of memory, the probability of a given experience lasting beyond a time t in the database before being

1 discarded also follows a power law - more specifically it approximates a t decay.

• While shuffling the data to make it i.i.d. helps to prevent forgetting, it also discards structural information that may exist in the sequential progression of the data distribution 104 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay

- something that it is preserved to a degree in the cascade of sub-buffers in the MTR method. Invariant risk minimization (IRM) is a recently developed method that assumes that the training data has been split up into different environments in order to train a model that is invariant across these environments and is thus more likely to generalise well. In a second version of the MTR model, the MTR-IRM agent, I apply the IRM principle by treating each sub-buffer as a different environment to see if it can improve continual learning by encouraging the agent to learn a policy that is robust across the changing distribution of its experiences across time.

The two MTR methods are evaluated in RL agents trained on continuous control tasks in a standard RL setting, as well as ones where the environment dynamics are continuously modified over time. I find that the standard MTR agent is the best continual learner overall when compared to a number of baselines, and that the MTR-IRM agent improves continual learning in some of the more nonstationary settings, where one would expect a robust policy to be more beneficial. The rest of the chapter assumes the following structure: (i) a small background section covering the RL algorithm used for experiments and an explanation of the IRM principle; (ii) a description of the multi-timescale replay method and how it is combined with IRM; (iii) an explanation of the experimental setup and an analysis of the results of the experiments; (iv) conclusion and discussion of future work, followed by an appendix including a table of hyperparameters and the results of some additional experiments run.

5.2 Preliminaries

5.2.1 Soft Actor-Critic

The RL algorithm used for all experiments is the soft actor-critic (SAC) method [HZAL18]. SAC is in the same family of maximum entropy RL algorithms as the soft Q-learning algorithm used in the first chapter [HTAL17], which generalise the standard RL objective of maximising return by simultaneously maximising the entropy of the agent’s policy and encouraging the 5.2. Preliminaries 105 agent to find multiple ways of achieving its goal, resulting in more robust solutions. Robustness was an important factor in choosing an RL algorithm for this project, since the added non- stationarity of the environment in two of the the experimental settings can destabilise the agent’s performance easily; in initial experiments, I found that SAC was more stable that other algorithms such as DDPG [LHP+15]. Importantly, SAC is an off-policy learning algorithm, which enables it to make use of a replay database, and it uses a parameterised actor, which can easily deal with tasks with continuous action spaces, like the ones I use for evaluation in this chapter.

SAC uses four separate function approximators (ANNs): two for approximating separate Q-

value functions Qθ1 and Qθ2 , one for estimating the state value function Vψ, and one for the policy of the agent, mapping states to action distributions πφ. The state value network also has an associated target network Vψ¯. The value loss is given as:

" # 1   2 JV (ψ) = Est∼D Vψ(st) − Eat∼πφ min Qθi (st, at) − αt log πφ(at|st) (5.1) 2 i∈{1,2}

where D is the replay database and αt is the entropy regularisation coefficient. In the original paper, [HZAL18], αt is fixed but, in the experiments in this chapter, I used the automatic

+ entropy regulariser used in [HZH 18] that adapts αt over time to target a fixed entropy, which was found to be more robust than a fixed entropy regularisation coefficient. The minimum over Q-functions is called the clipped double-Q trick, which is a method for correcting the overestimation bias that occurs when Q-learning is combined with function approximation [FvHM18]. The Q-value losses are given as follows:

1  2 J (θ ) = Q (s , a ) − Qˆ(s , a ) (5.2) Q i E(st,at)∼D 2 θi t t t t for i ∈ {1, 2} and where Qˆ is the Q-value target:

ˆ Q(st, at) = r(st, at) + γVψ¯(st+1) (5.3) 106 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay

Finally, the policy loss is given by:

Jπ(φ) = Est∼D,a∼πφ [αt log(πφ(a|st)) − Qθ1 (st, a)] (5.4)

For the experiments in this chapter, I adapted the code of a version of SAC from [HRE+18]. In all experiments, the output of the policy network was two concatenated vectors (each with the dimension of the action space) representing the mean and log standard deviations of a diagonal Gaussian, which was sampled from to give the action at each time step.

5.2.2 Invariant Risk Minimisation

Much of machine learning is based on the principle of empirical risk minimisation [Vap06], which involves the minimisation of the mean value of some loss function over a set of training data. The training data usually only comprise a subset of the domain that we wish to make predictions over, but the hope is that by minimising the loss on the training data, the empirical risk, that our model will generalise well to the entire data distribution that we care about, the true risk. Under certain assumptions, and importantly when the test data is taken from the same distribution as the training data, the empirical risk will tend to the true risk in the limit of infinite data [VC82], but when there is not much training data or if the test data comes from a different distribution, the guarantees are weaker or non-existent. When the true risk is higher than the empirical risk, the model is said to have overfit to the training data and it is a sign that the model has picked up on spurious correlations in the training data, in other words correlations between variables that are not causally linked. For instance, to borrow an example from [ABGLP19], a predictor trained to classify images of cows (that are typically pictured on green pastures) and camels (typically in the desert) may misclassify images of cows on a sandy beach as camels, since it has learnt to associate green pixels with the presence of a cow.

Invariant risk minimisation [ABGLP19] seeks to train machine learning models that avoid spuri- ous correlations and instead learn invariant or stable properties that enable out-of-distribution generalisation. While typically the training data and test data are randomly shuffled in order 5.3. Multi-Timescale Replay 107 to ensure they are from the same distribution, IRM poses that information is actually lost this way and it starts with the assumption that your data can be split up into a number of different environments e ∈ Etr. The IRM loss function encourages the model to learn a mapping that is invariant across all the different training environments, with the hope that, if it is stable across these, then it is more likely to perform well in previously unseen environments. The IRM loss is constructed as follows:

X e e 2 min R (Φ) + λ · ||∇w|w=1.0R (w · Φ)|| (5.5) Φ:X →Y e∈Etr where Φ is the mapping induced by the model that maps the inputs to the outputs (and is a function of the model parameters), Re is the loss function for environment e, w is a dummy variable and λ is a parameter that balances the importance of the empirical loss (the first term) and the IRM loss (the second term). The goal of the IRM loss is to find a representation Φ such that the optimal readout w is the same (i.e. the gradient of the readout is zero), no matter the environment; this way, when a new environment is encountered, it is less likely that the representation or readout will have to change in order to suit it.

As will be described in more detail later on, the IRM principle is applied in one version of the MTR replay database where the different environments correspond to buffers of experiences collected at different timescales.

5.3 Multi-Timescale Replay

The multi-timescale replay (MTR) database of size N consists of a cascade of nb FIFO sub- buffers, each with maximum size N , and a separate overflow buffer (also FIFO), which has a nb dynamic maximum size that is equal to the difference between N and the number of experiences currently stored in the cascade (Figure 5.1). New experiences of the form (st, at, rt+1, st+1) are pushed into the first sub-buffer. When a given sub-buffer is full, the oldest experience in the buffer is pushed out, at which point it has two possible fates: (i) with a predefined probability

βmtr it gets pushed into the next sub-buffer, or (ii) with probability 1 − βmtr it gets added to 108 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay the overflow buffer. If the total number of experiences stored in the cascade and the overflow buffer exceeds N, the overflow buffer is shrunk with the oldest experiences being removed until the database has at most N experiences. Once the cascade of sub-buffers is full, the size of the overflow buffer will be zero and any experience that is pushed out of any of the sub-buffers is discarded. During training, the number of experiences sampled from each sub-buffer (including the overflow buffer) is proportional to the fraction it contains of the total number of experiences in the database.

Figure 5.1: Diagram of MTR buffer; each blue box corresponds to a FIFO sub-buffer.

Figure 5.2 shows histograms of the age of experiences in FIFO, Reservoir and MTR buffers. The reservoir buffer is implemented using a priority queue, as in [IC18], in the form of a binary min-heap. A binary min-heap is a data structure that can efficiently keep track of its smallest element. When each new experience is added to the buffer, it is appended with a uniform random priority value between 0 and 1, which is used to add it to the heap. When the heap reaches its maximum size, when each new experience is added, the experience with lowest priority is removed from the heap. Since the priorities are assigned at random, the removed experience is effectively uniformly sampled from all past experiences - in this way, the distribution of experiences maintained in the reservoir buffer is also a uniform sample of the entire history of experiences of the agent. 5.3. Multi-Timescale Replay 109

(a) FIFO (b) Reservoir (c) MTR

Figure 5.2: Histograms of age of experiences in different types of buffer after 5 million expe- riences have been inserted. Each buffer has a maximum size of 1 million experiences, and for the MTR buffer, only the distribution of experiences in the cascade is shown (not the overflow buffer).

5.3.1 Power Law Forgetting

As mentioned in the introduction of this chapter, several studies have shown that memory performance in humans declines with a power law function of time [WE91, RW96]; in other words, the accuracy on a memory task at time t is given by y = at−b for some a, b ∈ R+ [KA17]. Here I provide a mathematical intuition for how the MTR buffer approximates a power law

1 forgetting function of the form t , without giving a formal proof. If we assume the cascade th k−1 is full, then the probability of an experience being pushed into the k sub-buffer is βmtr ,

st nd since, for this to happen, one must be pushed from the 1 to the 2 with probability βmtr, and another from the 2nd to the 3rd with the same probability, and so on. So, in expectation,

N 1 · k−1 new experiences must be added to the database for an experience to move from nb βmtr the beginning to the end of the kth sub-buffer. Thus, if an experience reaches the end of the kth buffer, then the expected number of time steps that have passed since that experience was added to the first buffer is given by:

k X N 1 tˆ = [t|end of kth buffer] = · (5.6) k E n i−1 i=1 b βmtr   N βmtr 1 = · · k − 1 (5.7) nb 1 − βmtr βmtr

th ˆ If we approximate the distribution P(t|end of k buffer) with a delta function at its mean, tk,

th k and we note that the probability of an experience making it into (k + 1) buffer at all is βmtr , 110 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay then we can say that:

ˆ th k P(experience lifetime > tk) ≈ P(experience reaching (k + 1) buffer) = βmtr (5.8)

Now, by substituting this approximation into Equation 5.7, we can say that the probability of an experience lasting more than tˆk time steps in the database is given by:

ˆ 1 P(experience lifetime > tk) ≈   (5.9) tˆ nb · 1−βmtr + 1 k N βmtr

In other words, the probability of an experience having been retained after t time steps is roughly proportional to 1 . Note that this contrasts with the slower √1 decay in the parameter- t t level memory in the synaptic consolidation model used in Chapter 3 [BF16], though they are both power law functions.

The expected number of experiences required to fill up the MTR cascade (such that the size of the overflow buffer goes to zero) is calculated as follows:

n Xb N 1 · (5.10) n i−1 i=1 b βmtr which for N = 1e6, nb = 20 and βmtr = 0.85, evaluates to 7 million experiences.

5.3.2 IRM version of MTR

In the IRM version of the MTR agent (MTR-IRM), I assume that the set of experiences in each sub-buffer of the MTR cascade corresponds to data collected in a different environment. Under this assumption, we can apply the IRM principle to the policy network of the SAC agent, so each Re(Φ) corresponds to the policy loss calculated using the data in the corresponding sub-buffer. While it would be interesting to apply the IRM to the value losses too, in this work, for simplicity I only applied it to the policy loss of the agent. The per-experience policy loss 5.4. Experiments 111 for SAC is as follows:

Lπ(φ, s) = Ea∼πφ [αt log(πφ(a|s)) − Qθ1 (s, a)] (5.11)

where φ are the parameters of the policy network, πφ is the conditional action distribution implied by the policy network, s is the state, a is the action sampled from πφ, αt is the dynamic

entropy coefficient and Qθ1 is the Q-value function implied by one of the two Q-value networks. The policy loss at each iteration is calculated by averaging the per-experience loss shown above over a mini-batch of experiences chosen from the replay database. In combination with IRM, however, the overall policy loss is as evaluated as follows:

nb X |Di|  2 L (φ) = [L (φ, s )] + λ · ||∇ L (φ, s 0 , w)|| (5.12) πIRM Est∼D π t IRM |D |Est0 ∼Di w|w=1.0 π t i=1 cascade

th where |Di| is the number of experiences in the i sub-buffer in the cascade, |Dcascade| is the total number of experiences stored in the cascade of sub-buffers, w is a dummy variable, and

Lπ is overloaded, such that:

0 Lπ(φ, st , w) = Ea∼πφ [αt log(πφ(w · a|s)) − Qθ1 (s, a)] (5.13)

The balance between the original policy loss and the extra IRM constraint is controlled by

λIRM .

5.4 Experiments

Setup

The two MTR methods were evaluated in RL agents trained with SAC [HZAL18] on two different continuous control tasks (RoboschoolAnt and RoboschoolHalfCheetah)1, where the strength of gravity in the environments was modified continuously throughout training in three

1These environments are originally from [SWD+17] and were modified by adapting code from [PGK+18]. 112 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay different ways: fixed gravity, gravity fluctuating in a sine wave, and linearly increasing gravity. The idea was to see how the agents could cope with changes to the distribution that arise from different sources and at different timescales. In the fixed gravity setting, which constitutes the standard RL setup, the changes to the distribution result from correlation between successive states and changes to the agent’s policy. In the other two settings, continual learning is made more challenging because changes are also made to the dynamics of the environment. In the linear setting, the gravity is adjusted slowly over time, with no setting being repeated at any point during training; in the fluctuating setting, the changes are faster and gravity settings are revisited so that the relearning ability of the agent can be observed.

Graphs of how the gravity changes in each of the different settings are shown in Figure 5.3. The fixed and linear gravity experiments were run for 5 million time steps, but the fluctuating gravity was run for 12 million steps, with 3 full cycles of 4 million steps. The value of the gravity setting was appended to the state vector of the agent so that there was no ambiguity about what environment the agent was in at each time step.

(a) Fixed gravity (b) Linearly increasing gravity (c) Fluctuating gravity

Figure 5.3: Gravity settings over the course of the simulation in each of the three set-ups.

The MTR and MTR-IRM methods were compared with FIFO, reservoir and half-reservoir- half-FIFO baselines. In the last baseline, new experiences are pushed either into a FIFO buffer or a reservoir buffer (both of equal size) with equal probability. The maximum size of each database used is 1 million experiences and was chosen such that, in every experimental setting, the agent is unable to store the entire history of its experiences.

In order to evaluate the continual learning ability of the agents, their performance was recorded over the course of training (in terms of mean reward) on (i) the current task at hand, referred to 5.4. Experiments 113 as the ‘training reward’, and (ii) on a uniformly distributed subset of the gravity environments experienced during training (−7, −9.5, −12, −14.5 and −17m/s2), referred to as the ‘mean evaluation reward’. A table of the hyperparameters used for training is shown in Section 5.7.

Results

Across all three continual learning settings, the standard MTR agent was the most consistent performer, demonstrating either the best or second best results in terms of training reward and evaluation reward in all tasks, indicating that recording experiences over multiple timescales can improve the tradeoff between new learning and retention of old knowledge in RL agents. The MTR-IRM agent achieved the best evaluation reward in two of the more nonstationary settings for the HalfCheetah task, but not in the Ant task, indicating that learning a policy that is invariant across the experiences faced by the agent over time can be beneficial for generalisation and mitigating forgetting, but that it might depend on the particular task setting and the transfer potential between the different environments. Below, I discuss the results from each setting in more detail.

Fixed gravity In the fixed gravity experiments, the FIFO and MTR agents are consistently the best performers (Figure 5.4), with both agents achieving a similar final level of training reward in both the HalfCheetah and Ant tasks. One would expect the FIFO agent to be a relatively strong performer in this setting, since the environment dynamics are stable and so the retention of old memories is likely to be less crucial than in the other two more nonstationary settings. The fact that the standard MTR agent performs as well as the FIFO agent shows that the replay of some older experiences is not holding back the progress of the agent, but also that it does not seem to particularly help the overall performance either. The MTR-IRM agent, on the other hand, performed poorly in the fixed gravity setting, presumably because there is not enough nonstationarity to reap any generalisation benefits from learning an invariant representation for the policy, and instead the IRM constraints just slow down the pace of improvement of the agent. The fixed gravity agents were not evaluated on different gravities, 114 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay as they were only trained in one gravity environment (−9.81m/s2).

(a) HalfCheetah Train (b) Ant Train

Figure 5.4: Fixed gravity setting. Training reward for (a) HalfCheetah and (b) Ant. Mean and standard error bars over three runs.

Linearly increasing gravity In the linearly increasing gravity experiments, the results were more mixed (Figure 5.5). In the HalfCheetah task, the FIFO agent performed best in terms of training reward, but was the worst performer when evaluated on the 5 different gravity settings. This is somewhat intuitive: the FIFO agent can be expected to do well on the training task as it is only training with the most recent data, which are the most pertinent to the task at hand; on the other hand, it quickly forgets what to do in gravity settings that it experienced a long time ago (Figure 5.6(a)). Interestingly, the MTR-IRM agent surpassed all other agents in the evaluation setting by the end of training, with the standard MTR agent coming second, perhaps indicating that, in a more nonstationary setting (in contrast with the fixed gravity experiments), learning a policy that is invariant across the agent’s history of experiences can lead to a better overall performance in different environments.

In Figure 5.6, we can see that, while in the FIFO agent, the cycles of learning and forgetting are quite clear, in all other agents, where older experiences were maintained for longer in the buffer, the forgetting process is slower. This does not seem to be qualitatively different for the MTR-IRM agent - it just seems to be able to reach a good balance between achieving a high performance in the various settings, while forgetting slowly. In particular, it is hard to identify whether there has been much forward transfer to gravity settings that have yet to be trained 5.4. Experiments 115 on, which one might hope for by learning an invariant policy; at the beginning of training, the extra IRM constraints seem to inhibit the progress on all settings (as compared to the standard IRM agent), but in the latter stages the performance on a number of the later settings improves drastically.

One curious result is that in the reservoir agent, by the end of training, the performance on the low gravities experienced earlier on in training (−7 and −9.5) is worse than the higher gravities experienced later on, despite the fact that the reservoir buffer maintains a uniform sample of the entire history of experiences. It could be that the task is simply more difficult in a lower gravity setting, or that it is less compatible with the task in higher gravity settings - in other words, it may require a radically different policy that is hard to represent in the same network. Alternatively, it could be a results of the order of training - this could be tested in future work by reversing the order in which the different gravities are experienced by the agent over time. Experiments run in a traditional multi-task setting, where the gravity was randomly sampled from an interval of [−7, −17] throughout training, i.e. the tasks are interleaved, showed that it is possible to learn good policies for all tasks in the same policy network, though the performance on the lowest gravity setting was slightly less stable (Figure 5.12).

In the linearly increasing setting for the Ant task, the FIFO, MTR and MTR-IRM agents are equally strong on the training task (Figure 5.5(b)), and the MTR and reservoir agents are joint best with regards to the mean evaluation reward (Figure 5.5(d)). An interesting result to note is that the reservoir and half-reservoir-half-FIFO agents both perform poorly on the training task towards the end of training (something also observed in the equivalent setting on the HalfCheetah task). One explanation might be that, as time passes, the proportion of recent experiences declines in the reservoir buffer, making it harder for the agent to hone its performance on the task at hand. Furthermore, the set of experiences in the reservoir becomes more stationary over time, and so it is possible that the agent starts to overfit to the data distribution in the buffer. However, this would not explain why the half-reservoir-half-FIFO agent actually performs worse than the reservoir agent, since the memories in the FIFO part of the buffer are regularly refreshed. One speculative explanation for this is that the half-reservoir- half-FIFO agent accumulates two sub-buffers of experiences at very different timescales that can 116 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay end up representing quite different distributions, which might result in high-variance gradient updates and instability in learning. In contrast, the MTR buffers maintain a greater number of sub-buffers at different timescales, smoothing the distribution of experiences in the database.

Finally, the MTR-IRM agent does not show the same benefits as in the linearly increasing HalfCheetah setting; this could be because it is more difficult to learn an invariant policy across the different gravity settings across this task, with less potential for transfer between policies for the various environments. The transferability of policies between different environments and the effects of the order in which the environments are experienced are important topics for future investigation.

(a) HalfCheetah Train (b) Ant Train

(c) HalfCheetah Mean Eval (d) Ant Mean Eval

Figure 5.5: Linearly increasing gravity setting. (Top) Training reward for (a) HalfCheetah and (b) Ant. (Bottom) Mean evaluation reward for (c) HalfCheetah and (d) Ant. Mean and standard error bars over three runs. 5.4. Experiments 117

(a) FIFO (b) Reservoir (c) Half Reservoir Half FIFO

(d) MTR (e) MTR with IRM

Figure 5.6: Individual Evaluation rewards for linearly increasing gravity HalfCheetah. Mean and standard error bars over three runs.

Fluctuating gravity In the fluctuating gravity setting, the performances of the various agents were less differentiated than in the linearly increasing gravity setting, perhaps because the timescale of changes to the distribution were shorter and the agents had the opportunity to revisit gravity environments (Figure 5.7). In the HalfCheetah task, the MTR-IRM agent was the best performer in terms of final training and evaluation rewards, though by a very small margin. In the Ant task, the best performer was the standard MTR agent, which reached a higher and more stable evaluation reward than any of the other agents. Once again, as in the linearly increasing gravity setting, the MTR-IRM agent struggled comparatively on the Ant task.

An interesting observation with regards to the agents’ ability to relearn can be made by com- paring the individual evaluation rewards of the FIFO and MTR-IRM agents in the fluctuating gravity setting. The fluctuating performance on each of the different gravity evaluation settings can be observed very clearly in the results of the FIFO agent (Figure 5.8(a)), where the ups and downs in performance reflect the fluctuations of the gravity setting being trained on. While 118 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay in the MTR-IRM agent, these fluctuations in performance can also be observed, the dips in performance on gravity settings that have not been experienced in a while become signficantly shallower as training progresses, providing evidence that the agent is consolidating its knowl- edge over time (Figure 5.8(b)). It would be interesting, in future work, to run experiments for longer to see if the MTR and MTR-IRM agents continue to build on their knowledge in the manner observed.

(a) HalfCheetah Train (b) Ant Train

(c) HalfCheetah Mean Eval (d) Ant Mean Eval

Figure 5.7: Fluctuating gravity setting. (Top) Training reward for (a) HalfCheetah and (b) Ant. (Bottom) Mean evaluation reward for (c) HalfCheetah and (d) Ant. Mean and standard error bars over three runs. 5.5. Related Work 119

(a) FIFO (b) MTR-IRM

Figure 5.8: Individual Evaluation rewards for fluctuating gravity HalfCheetah with (a) FIFO buffer and (b) MTR-IRM buffer. Mean and standard error bars over three runs.

Extrapolation error In general, the replay methods that kept older experiences (i.e. all but the FIFO buffer) often experienced instability in the latter stages of training, which may have been due to a known problem with off-policy methods sometimes referred to as the extrapolation error, whereby a mismatch between the data used to train the agent and the distribution of experiences generated by the current policy leads to errors in the estimation and learning of the value function [FMP19]. In some additional experiments, I tested a few existing methods used to mitigate this problem, but none of them resulted in significant improvements to stability and performance (Section 5.8).

5.5 Related Work

Section 2.3.3 of the Background chapter provides a broad survey of replay-based methods for continual learning, and so here I briefly elaborate on a selection of works cited in the Introduction that investigate or make use of multiple timescales in the replay database. In [dBKTB15], it is shown that retaining older experiences as well as the most recent ones can improve the performance of deep RL agents on the pendulum swing-up task, particularly for smaller replay databases. In [ZS17], it is shown that combined experience replay, which trains 120 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay agents on a combination of the most recent experiences as they come in and those in the replay buffer, enables faster learning than training on just one or the other in both a tabular and a linear function approximation setting, particularly when the replay database is very large. In [IC18], various experience selection methods are evaluated for deep RL, and it was noted that for each method, a small FIFO buffer was maintained in parallel in order to ensure that the agent did not overfit to the more long-term memory buffer and had a chance to train on all experiences. In [RAS+19], a similar protocol is used to combine new experiences and old ones in a deep RL setting with the use of multiple actors and a single learner: each actor maintains a reservoir buffer and periodically feeds a set of experiences to the learner, which is comprised of a certain proportion of freshly generated experiences and another of ones sampled from the reservoir. It is shown that a 50/50 split of new and old experiences provides a good balance between mitigating forgetting and reaching a high overall level of performance on different tasks. As discussed in the Introduction, these methods employ two different timescales in the distribution of experiences used for training, while the MTR methods use multiple timescales, which makes them sensitive to changes to the data distribution that occur at a range of different speeds or frequencies.

Finally, it is worth mentioning that in [WR19], it is shown that prioritising the replay of recent experiences in the buffer improves the performance of deep RL agents. In this paper, a FIFO buffer is used, so the data is only stored at the most recent timescale, but the probability of an experience being chosen for replay decays exponentially with its age.

5.6 Conclusion

In this chapter, I investigated whether a replay buffer set up to record experiences at multiple timescales could help in a continual reinforcement learning setting where the timing, timescale and nature of changes to the incoming data distribution are unknown to the agent. One of the versions of the multi-timescale replay database was combined with the invariant risk minimi- sation principle [ABGLP19] in order to try and learn a policy that is invariant across time, 5.6. Conclusion 121 with the idea that it might lead to a more robust policy that is more resistant to catastrophic forgetting. I tested numerous agents on two different continuous control tasks in three different continual learning settings, where the gravity in the environment was dynamically adjusted throughout training. I found that the standard MTR agent was the most consistent performer overall, when measured on reward on the training environment and on a set of evaluation environments. The MTR-IRM agent was the best continual learner in two of the more nonsta- tionary settings in the HalfCheetah task, but was relatively poor on the Ant task, indicating that the utility of the IRM principle may depend on specific aspects of the tasks at hand and the transferability between the policies in different environments. The MTR method provides a configurable compromise, with respect to the replay buffer distribution, between the typically used FIFO replay buffer, which only holds short term experiences, and the reservoir buffer, which maintains a uniform distribution over the agent’s entire history. While the empirical re- sults were mixed, one of the aims of this chapter was to contribute a method (in the MTR-IRM agent) that attempts to use the nonstationarity of the data distribution to the agent’s advan- tage, rather than treat it as a hindrance, which is the approach of most replay-based methods, where old data and new data are shuffled to make them i.i.d. An important direction for future continual learning methods will be to figure out how to best use this additional information embedded in the nonstationarity, rather than discarding it.

Future Work One important avenue for future work would be to evaluate the MTR model would be to evaluate it in a broader range of training settings, for example: (i) it would be in- teresting to vary the timescales at which the environment is adjusted (e.g. gravity fluctuations at different frequencies) in order to evaluate robustness of the method, and (ii) it would be interesting to deploy it in a multi-agent setting as in Chapter 4. Another aspect of the train- ing environments that deserves more investigation is the relatedness of the tasks at different gravity settings, which might help explain why the MTR-IRM agent performed better on the HalfCheetah task than the Ant task. Furthermore, it would be useful to evaluate the sensitivity of the MTR method’s performance to its hyperparameters (βmtr and nb). Finally, it is worth noting that, in its current incarnation, the MTR method does not select which memories to 122 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay retain for longer in an intelligent way - it is simply determined with a coin toss. In this light, it would be interesting to explore ways of prioritising the retention of certain memories from one sub-buffer to the next, for example by the temporal difference error, which is used in [SQAS15] to prioritise the replay of memories in the buffer. 5.7. Experimental Details 123

5.7 Experimental Details

Hyperparameters Below is a table of the relevant hyperparameters used in our experiments.

Table 5.1: Hyperparameters for Multi-Timescale Replay Experiments

Parameter Value # hidden layers (all networks) 2 # units per hidden layer 256 Learning rate 0.0003 Optimiser Adam Adam β1 0.9 Adam β2 0.999 Replay Database size (all buffers) 1e6 # MTR sub-buffers nb 20 βmtr 0.85 Hidden neuron type ReLU Target network τ 0.005 Target update frequency / time steps 1 Batch size 256 # Training time steps 5e6 (fixed / linear), 1.2e7 (fluctuating) Training frequency / time steps 1 Gravity adjustment interval / time steps 1000 Evaluation interval / episodes 100 # Episodes per evaluation 1 IRM policy coefficient 0.1

5.8 Additional Experiments

In many of the experiments run in this chapter, I found that training with methods that retained experiences for a long period of time (essentially all but the FIFO buffer) often displayed instability during training, with large drops in performance occurring during the latter stages of training. In light of this, I ran some experiments to try to address this problem, under the hypothesis that these methods become subject to what is sometimes referred to as extrapolation error, whereby a mismatch between the data used to train the agent and the distribution of experiences generated by the current policy leads to errors in the estimation and learning of the value function [FMP19]. The extrapolation error is a consequence of the ‘deadly triad’ in 124 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay deep RL (the combination of off-policy learning, bootstrapping and function approximation), and has recently been studied in the context of batch RL [FMP19, JGS+19, WTN19], which is concerned with training RL agents with a fixed dataset of experiences (possibly from some other agent) without any interactions with the environment. In this setting, the policy of the agent can start off being very different to that of the agent that generated the training data; during training, the value function is updated based on experiences in the training data, but this may end up generating bad predictions of the values of states that are actually visited by the agent under the current policy if they do not match sufficiently, not allowing the agent’s performance to improve.

While the experiments in this chapter are not performed in the batch RL setting, when experi- ences are kept in the replay database for a very long time, some of them may become unlikely and irrelevant under the current policy and cause similar issues with extrapolation error. Below I describe three different methods that I employed to try to correct this problem; unfortunately, in some preliminary experiments with each of them, none seemed to clearly improve the stability of training or the performance of the agent.

5.8.1 Importance Sampling

One of the first attempts to address the extrapolation error was to reweight the per-experience value function losses by how likely the action taken in the recorded experience is under the current policy, using importance sampling. For example, the Q-value loss in Equation 5.2 is usually estimated with a randomly selected batch B from the replay database as follows:

1 X 1 2 J (θ ) ≈ Q (s , a ) − r + γV ¯(s ) (5.14) Q i |B| 2 θi t t t+1 ψ t+1 (st,at,rt+1,st+1)∈B

As the policy of the agent changes, the distribution of experiences in the batch sampled from the replay database (some of which may be quite old) may not match well with the distribution of experiences under the current policy, leading to extrapolation error. With the inclusion of 5.8. Additional Experiments 125 importance-sampling (IS) factors, the loss becomes:

1 X 1 2 J (θ ) ≈ ρ Q (s , a ) − r + γV ¯(s ) (5.15) Q i |B| 2 t θi t t t+1 ψ t+1 (st,at,rt+1,st+1)∈B

where ρt is the IS factor. Assuming uniform sampling from the replay database and using normal importance sampling, we would set ρt = πφ(at|st). I instead used a form of weighted

πφ(at|st) importance sampling, which scales the factors by the mean weighting: ρt = P . πφ(at0 |st0 ) (st0 ,at0 )∈B In [SQAS15], a similar scheme is used to reweight the value updates of experiences that have been sampled in a prioritised fashion from the replay database, as opposed to the usual uniform sampling method.

The idea for implementing importance sampling here is to emphasise value updates that are relevant to the current policy and minimising the noise that results from those that are not. Empirically, however, I found that weighted importance sampling had little effect on the perfor- mance of the FIFO and reservoir agents but negatively affected that of the MTR agent (Figure 5.9). I also tried importance sampling factors that took into the account the probability of the action at the time of the experience µ(at|st) but this yielded similar results. Possible explana- tions for why importance sampling did not help are: (i) importance sampling can add variance to the updates, which can in turn affect performance, (ii) ignoring currently irrelevant actions might not be so useful in a nonstationary setting, where these actions may become relevant again at a future stage, or (iii), as noted in [FMP19], simply reweighting the updates by the action probabilities does not deal with the fact that high probability state-action pairs under the current policy might not even feature in the replay database. 126 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay

(a) No IS (b) With IS

Figure 5.9: Comparison of training performance of agents on the fluctuating-gravity HalfChee- tah task (a) without and (b) with weighted importance sampling.

5.8.2 ReF-ER

Some approaches to dealing with the extrapolation error in the batch RL setting try to ensure that the distribution of experiences under the current policy remains close to the distribution in the replay database [NK19, FMP19]. In this section, I show the results of implementing one such method called Remember-and-Forget Experience Replay (ReF-ER) [NK19]. Using the notation from the original paper, if we letg ˆ(φ) be the estimated policy gradient, then with ReF-ER it is adapted as follows:

  D 1 βReF-ERgˆ(φ) − (1 − βReF-ER)ˆg (φ) if c < ρt < cmax gˆReF-ER(φ) = max (5.16)  D −(1 − βReF-ER)ˆg (φ) otherwise where ρ correspond to importance values of the form πφ(at|st) (where µ(a |s ) is the recorded t µ(at|st) t t conditional action probability at the time of the experience), cmax > 1 is a tolerance value that determines if an experience is likely enough under the current policy to be used to improve the 5.8. Additional Experiments 127 policy, andg ˆD(φ) is given as follows:

D gˆ (φ) = Est∼B [∇DKL(µt||πφ(·|st))] . (5.17)

Thus, updates in the direction of −gˆD(φ) minimise the KL divergence between the distribution of actions under the current policy and that of the replay database. Additionally βReF-ER is adjusted dynamically in order to target a fixed proportion of experiences D in the replay database that are considered too off-policy by the threshold cmax:

  (1 − ηβ)βReF-ER if nfar/N > D βReF-ER ← (5.18)  (1 − ηβ)βReF-ER + ηβ, otherwise

Figure 5.10(a) shows the training performance of a number of agents trained with ReF-ER on the HalfCheetah task with linearly increasing gravity, all of which collapse before the end of training. In practice, what happens with all runs is that eventually the KL divergence between the current policy and the replay experiences widens to the extent that the −gˆD(φ) updates made in order to correct it become too large, leading to a vicious cycle of larger and larger updates and a collapse in performance. Even the agents with the FIFO buffer, where the replay experiences should deviate the least from the current policy, suffer from this eventual collapse. Figure 5.10(b) show similar results but in the fixed gravity setting, where the nonstationarity is less extreme; here the FIFO with ReF-ER runs do not collapse but do not reach as high a performance as the agents with the standard FIFO buffer, indicating that the ReF-ER method is over-constraining the policy of the agent, not allowing it to improve as quickly as it could otherwise. 128 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay

(a) Linear Gravity (b) Fixed Gravity

Figure 5.10: Comparing training performance of runs with and without ReF-ER on the HalfCheetah task with (a) linearly increasing gravity and (b) fixed gravity.

5.8.3 Behaviour Regularised Actor Critic

The final set of experiments was using one version of a method for batch RL called Behaviour Regularised Actor Critic (BRAC) [WTN19]. Once again, the idea here is to ensure that the data generated by the current policy does not differ too much from the data in the replay buffer, but instead of regularising the policy directly, the value-penalty version of BRAC regularises the value of states where the current policy diverges from the buffer. In particular, BRAC learns the following value function:

" ∞ # π X t VD (s) = Eπ γ (r(st, at) − αbracD(π(·|st), πb(·|st))) (5.19) t=0

where πb is the behavioural policy (i.e. the source of the training data), D is a divergence function between probability distributions, e.g. KL divergence, and αbrac weighs the importance of minimising this divergence term against reward maximisation. The value penalty in BRAC is similar to SAC, where an entropy term is added in place of the divergence term. In the experiments I ran, I recorded the action distribution for each experience in the replay database in order to calculate the D(π(·|st), πb(·|st)) term, using the KL divergence for D. 5.8. Additional Experiments 129

Figure 5.11 show the performance of agents with different αbrac coefficients; one setting reached a higher maximum performance than the normal reservoir buffer, but at the costs of higher variance during training. A possible explanation for this is that, unlike in the batch RL setting, for which this method was designed, the replay buffer is constantly changing and so the KL divergence is a moving target, leading to instability.

Figure 5.11: Training reward for HalfCheetah task with fixed gravity using a reservoir buffer with different BRAC coefficients. 130 Chapter 5. Continual Reinforcement Learning with Multi-Timescale Replay

5.9 Multi-task Experiments

(a) Training performance (b) Evaluation performance

Figure 5.12: Multitask setting: (a) training performance and (b) evaluation performance with uniformly random gravity between −7 and −17m/s2 with a FIFO buffer. This experiment shows that the agent has the capacity to represent good policies for all evaluation settings if trained in a non-sequential setting. Chapter 6

Conclusion

The primary goal of this thesis has been to develop algorithms that can improve the continual learning ability of deep reinforcement learning agents in circumstances where they have little or no prior knowledge of the timing, timescale or nature of the changes to the distribution of their experiences over time. The key reasons for choosing this focus were as follows:

• If we want to deploy neural network-based algorithms in the real world, they have to be able to build on their knowledge from information that is streaming in from an environ- ment that is continuously changing in unpredictable ways.

• Neural networks trained with backpropagation are prone to catastrophic forgetting [MC89], whereby learning from new data leads to erasure of previously acquired knowledge, which poses one of the greatest obstacles to their ability to learn continually and is thus impor- tant to resolve.

• Many methods have been developed to mitigate catastrophic forgetting, but they have largely been tested in the context of sequences of discrete tasks. As a result, a large number of these approaches rely on the knowledge of the boundaries between tasks, which cannot be applied if these boundaries are not known, or if the changes to the distribution are continuous.

131 132 Chapter 6. Conclusion

• For these reasons, there is a need to develop and test algorithms that can deal with both discrete and continuous changes to the data distribution, with little or no prior knowledge of how and when these changes will occur. Reinforcement learning provides a natural test bed for these situations and arguably most closely simulates acting in the real world, as compared to the other two main machine learning paradigms of supervised and unsupervised learning.

In this thesis, I presented three new methods for mitigating catastrophic forgetting in a continual reinforcement learning setting by exploiting the concept of memory at multiple timescales in various ways:

1. In Chapter 3, I presented an algorithm that took inspiration from the fact that, whereas in an ANN, the parameters are typically modelled as scalar values, an individual synapse in the brain comprises a complex network of interacting biochemical components that evolve at different timescales. I showed that by equipping both tabular and deep RL agents with a synaptic model that incorporates this biological complexity [BF16], catastrophic forgetting can be mitigated at multiple timescales. In particular, I found that as well as enabling continual learning across sequential training of two simple tasks over a variety of switching schedules, it can also be used to overcome within-task forgetting by reducing the need for an experience replay database.

2. In Chapter 4, I developed a method that aimed to address one of the limitations of the synaptic consolidation model, namely that improving the memory of individual pa- rameters in the network would not necessarily translate smoothly into improving the behavioural memory of the agent, since the output of the network is a complex non-linear function that involves codependent terms between the parameters. The policy consoli- dation method works by directly consolidating the policy of the agent using a cascade of hidden policy networks that distil knowledge [HVD15] between their neighbours at a range of timescales. By combining bidirectional distillation with an RL method called PPO [SWD+17] that limits the size of updates to the network in policy space, the cascade is able to simultaneously record the agent’s policy at multiple timescales and regularise 6.1. Limitations and Future Work 133

the agent’s behaviour by its own history. The method was shown to improve continual learning relative to several PPO baselines on a number of continuous control tasks in single-task, alternating two-task, and multi-agent competitive self-play settings.

3. In Chapter 5, I proposed a new type of experience replay database that records the experiences of an agent at multiple timescales during training, that contrasts with the typical FIFO or reservoir buffers, that only record over one timescale. As with the previous two methods, the use of multi-timescale memory here was to provide a balance between new learning and retention of old knowledge. I also proposed an adapted version of the MTR database, which uses the concept of invariant risk minimisation [ABGLP19] to learn a policy that is invariant across the experiences recorded in each timescale bucket of the replay database; this way, the policy should theoretically become robust across different environments and thereby minimise catastrophic forgetting. The two MTR methods were evaluated in comparison with several baselines on two different continuous control tasks in three different continual learning settings, where the environment was modified smoothly over time by slowly adjusting the strength of gravity. The standard MTR agent was the best overall performer, and the MTR-IRM agent demonstrated benefits to continual learning in some of the more nonstationary settings.

Overall, the methods in this thesis show that memory at multiple timescales, whether it is at the level of the parameters, the policy or the episodic memory of the agent, can help to maintain a balance between the preservation of previously acquired knowledge and the ability to adapt quickly to new incoming data - particularly in scenarios where the dynamics of the data distribution are unpredictable.

6.1 Limitations and Future Work

In this final section, I discuss some of the limitations of the methods I developed and potential avenues for future work. 134 Chapter 6. Conclusion

Coarseness of old memories In continual learning, the condition that requires total mem- ory resources to be fixed or limited entails an unavoidable tradeoff between learning new things and remembering old things, since with a limited amount of space one can only store a certain amount of information. In the methods developed in this thesis, the tradeoff manifests itself in the granularity at which information is stored at different timescales. For example, in the synaptic consolidation model, while the hidden variables with short timescales are sensitive to sharp changes to the parameter value, the hidden variables with long timescales adjust very slowly - this way they store a long term average of where the value of the parameter used to be, but short, sharp changes to the parameter value in its distant past will be smoothed out. Equally, in the MTR buffer, the long timescale sub-buffers in the cascade will hold a sparse sample of a long period in the history of experiences, and so they may not include experiences that are representative of a short period in the distant past when it was in a completely different environment to normal. A similar effect can be expected in the policy consolidation agent.

Prioritisation The effect of older memories being ‘blurred’ or incomplete can be problematic if there are parameter changes, policy changes or events that occurred a long time ago and are particularly important to remember. All three models currently operate with the assumption that the recent past is always more important than the distant past, which in general may be a good heuristic, since the recent past is likely to be more relevant for the present and future challenges faced by the agent, but perhaps it can be improved by introducing mechanisms for prioritising the storage of certain memories. As discussed in the Background chapter, many existing regularisation-based methods use importance factors to prioritise the consolidation of certain parameters, for example by the Fisher information metric [KPR+17] or by the sensitivity of the output of the network with respect to the parameters [ABE+17]. Prioritising the replay of experiences by the TD error has also been used to speed up learning in RL agents [SQAS15], and the storage of particularly ‘surprising’ data points has been used to mitigate forgetting [IC18, NO17]. Furthermore, in the brain, the neuromodulator dopamine, which has been associated with reward prediction error [Sch07] and with the presence of novel stimuli [BMMH10], has been shown to be important for synaptic consolidation [FS+90] and thus could facilitate memory 6.1. Limitations and Future Work 135 prioritisation.

Prioritisation of this sort is currently lacking in the multi-timescale methods developed in this thesis, but could be incorporated by using one of the factors mentioned above to modulate the consolidation of memories at different timescales. For example, in the synaptic consolidation model, importance factors could be used to modulate the flow between adjacent hidden variables based, or in the policy consolidation model, similar factors could be used to dynamically adjust the distillation constraints between adjacent hidden networks or from a hidden network from itself at the previous time step. In the MTR buffer, the importance of an experience by some measure could be used to determine the probability of its transition to the next sub-buffer in the cascade. It would be interesting to see if this approach could improve continual learning in any of the multi-timescale methods.

Spacing and consolidation In the original paper for the synaptic consolidation model adapted in Chapter 3 [BF16], it is noted that the frequency of plasticity events can affect how well a synaptic memory is consolidated, resulting in an ‘optimal’ temporal spacing be- tween updates that is neither too large or too small. In psychology, the existence of spac- ing effects in learning have long been known, with the discovery usually being attributed to Ebbinghaus, who noted that spacing study sessions is more effective for memory retention than cramming [Ebb13]. Spacing effects and have since been heavily studied and have also given rise to several software applications for improving human memory retention [WG94, duo11]. Consequently, it would be interesting to see if spaced repetition could be used to enhance the behavioural memory retention of the synaptic consolidation agent in Chapter 3, which could be done, for example, by devising a mechanism that controls how often memories from the buffer are replayed to the network. This concept might also yield a potential synergy between the MTR buffer and the synaptic consolidation model. Some psychological research has suggested that learning could be optimised by spacing repetitions with gradually increasing intervals [Mac32, Pim67]; part of the rationale here is that since forgetting happens fastest soon after a memory is required, it is more important to repeat the information in the early stages of learning. Though the MTR method does not schedule the replay of memories, it might have 136 Chapter 6. Conclusion a similar effect to this type of spaced repetition, since the power law distribution of memories it maintains emphasises the replay of recent memories over old ones.

Curriculum learning While the spacing effect relates to the impact of the frequency of repetition on recall and learning, the study of curriculum learning concerns the importance of the order in which items or tasks are presented for the speed and effectiveness of learning. Human education typically follows curricula; we learn simple things first, which allows us to learn related but more complex tasks more quickly by composing and transferring previously acquired knowledge to the harder problems. This has been shown to be an effective technique in animal learning, where is often referred to as ‘shaping’ [Ski58], and it has also been studied in the context of training artificial neural networks [Elm93, BLCW09]. Since ANNs are known to suffer from catastrophic forgetting, however, it begs the question of to what degree curriculum learning techniques could be improved by methods that mitigate forgetting. An investigation into the effect of the ordering of tasks and changes to the distribution, which would be impacted by the relatedness of skills required at different stages in time, would be an interesting extension to the experiments conducted in this thesis.

Invariance and consolidation In Chapter 5, I combined the MTR buffer with the IRM method [ABGLP19] in order to encourage the agent to learn a policy that is invariant across time and is therefore robust across different environments. As a final thought, it is perhaps worth considering whether consolidation could also play a role in encouraging invariant repre- sentations. In the synaptic consolidation model, the longer a parameter remains at a particular value, the more consolidated it becomes at that value, in effect preserving the values of param- eters that have been stable, or invariant, across time under the synaptic learning rule. In the experiments run in Chapter 3, I did not test if in practice this leads to a policy that generalises better to unseen environments, but it would be interesting to investigate in future work. If not, it would be worth thinking about ways in which the IRM principle could be used to inform a new consolidation method at the synaptic or neuronal level to this end. Bibliography

[AAC+19] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob Mc- Grew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.

[ABE+17] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. arXiv preprint arXiv:1711.09601, 2017.

[ABE+18] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139–154, 2018.

[ABGLP19] Martin Arjovsky, L´eonBottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.

[ABT+19] Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learning with maximal interfered retrieval. In Advances in Neural Information Processing Systems, pages 11849–11860, 2019.

[AKT19] Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. Task-free continual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11254–11263, 2019.

137 138 BIBLIOGRAPHY

[And00] John Robert Anderson. Learning and memory: An integrated approach. John Wiley & Sons Inc, 2000.

[ART19] Rahaf Aljundi, Marcus Rohrbach, and Tinne Tuytelaars. Selfless sequential learn- ing. In International Conference on Learning Representations, 2019.

[ASBB+18] Maruan Al-Shedivat, Trapit Bansal, Yura Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. In International Conference on Learning Repre- sentations, 2018.

[BCB14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans- lation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[BCP+16] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.

[BF16] M. K. Benna and S. Fusi. Computational principles of synaptic memory consol- idation. Nature Neuroscience, 19(12):1697–1706, 2016.

[BL73] T. V. P. Bliss and T. Lømo. Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. The Journal of Physiology, 232(2):331–356, 1973.

[BLCW09] Yoshua Bengio, J´erˆomeLouradour, Ronan Collobert, and Jason Weston. Cur- riculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.

[BMMH10] Ethan S Bromberg-Martin, Masayuki Matsumoto, and Okihide Hikosaka. Dopamine in motivational control: rewarding, aversive, and alerting. Neuron, 68(5):815–834, 2010. BIBLIOGRAPHY 139

[BNVB13] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.

[BPS+18] Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mor- datch. Emergent complexity via multi-agent competition. In International Con- ference on Learning Representations, 2018.

[BU59] Jean M Barnes and Benton J Underwood. “Fate” of first-list associations in transfer theory. Journal of experimental psychology, 58(2):97, 1959.

[CDAT18] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532–547, 2018.

[CG87] G. A. Carpenter and S. Grossberg. Art 2: Self-organization of stable category recognition codes for analog input patterns. Applied optics, 26(23):4919–4930, 1987.

[clw16] Continual Learning and deep networks workshop. Advances in Neural Informa- tion Processing Systems, 2016.

[CRE+19] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajan- than, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. Continual learning with tiny episodic memories. arXiv preprint arXiv:1902.10486, 2019.

[CRRE19] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed El- hoseiny. Efficient lifelong learning with a-GEM. In International Conference on Learning Representations, 2019.

[CZV+08] C. Clopath, L. Ziegler, E. Vasilaki, L. B¨using,and W. Gerstner. Tag-trigger- consolidation: a model of early and late long-term-potentiation and depression. PLoS Computational Biology, 4(12):e1000248, 2008. 140 BIBLIOGRAPHY

[dBKTB15] T. de Bruin, J. Kober, K. Tuyls, and R. Babuˇska. The importance of experience replay database composition in deep reinforcement learning. In Deep Reinforce- ment Learning Workshop, Advances in Neural Information Processing Systems (NIPS), 2015.

[dBKTB16] T. de Bruin, J. Kober, K. Tuyls, and R. Babuˇska. Off-policy experience reten- tion for deep actor-critic learning. In Deep Reinforcement Learning Workshop, Advances in Neural Information Processing Systems (NIPS), 2016.

[DHK+17] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plap- pert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter

Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.

[duo11] . http://www.duolingo.com/, 2011.

[DYXY07] Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu. Boosting for transfer learning. In Proceedings of the 24th international conference on Machine learning, pages 193–200, 2007.

[Ebb13] H Ebbinghaus. Memory: A contribution to experimental psychology (HA Ruger & CE Bussenius, trans.). New York, NY, USA, 1913.

[Elm93] Jeffrey L Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99, 1993.

[EPBE+98] Peter S Eriksson, Ekaterina Perfilieva, Thomas Bj¨ork-Eriksson,Ann-Marie Al- born, Claes Nordborg, Daniel A Peterson, and Fred H Gage. Neurogenesis in the adult human hippocampus. Nature medicine, 4(11):1313–1317, 1998.

[FAL17] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017. BIBLIOGRAPHY 141

[FBB+17] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.

[FBD+18] Timo Flesch, Jan Balaguer, Ronald Dekker, Hamed Nili, and Christopher Sum- merfield. Comparing continual task learning in minds and machines. Proceedings of the National Academy of Sciences, 115(44):E10313–E10322, 2018.

[FDA05] Stefano Fusi, Patrick J Drew, and Larry F Abbott. Cascade models of synapti- cally stored memories. Neuron, 45(4):599–611, 2005.

[FG18] Sebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733, 2018.

[FM97] Uwe Frey and Richard GM Morris. Synaptic tagging and long-term potentiation. Nature, 385(6616):533–536, 1997.

[FMP19] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062, 2019.

[Fre91] Robert M French. Using semi-distributed representations to overcome catas- trophic forgetting in connectionist networks. In Proceedings of the 13th annual cognitive science society conference, volume 1, pages 173–178, 1991.

[FS+90] Uwe Frey, Helmut Schroeder, et al. Dopaminergic antagonists prevent long-term maintenance of posttetanic ltp in the ca1 region of rat hippocampal slices. Brain research, 522(1):69–75, 1990.

[FvHM18] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function ap- proximation error in actor-critic methods. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1587–1596, Stock- holmsm¨assan,Stockholm Sweden, 10–15 Jul 2018. PMLR. 142 BIBLIOGRAPHY

[FZS+16] Tommaso Furlanello, Jiaping Zhao, Andrew M Saxe, Laurent Itti, and Bosco S Tjan. Active long term memory networks. arXiv preprint arXiv:1606.02355, 2016.

[GB10] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international con- ference on artificial intelligence and statistics, pages 249–256, 2010.

[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE, 2013.

[GMX+13] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.

[GPAM+14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversar- ial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[H+16] Elad Hazan et al. Introduction to online convex optimization. Foundations and

Trends R in Optimization, 2(3-4):157–325, 2016.

[HCK19] Tyler L Hayes, Nathan D Cahill, and Christopher Kanan. Memory efficient experience replay for streaming learning. In 2019 International Conference on Robotics and Automation (ICRA), pages 9769–9776. IEEE, 2019.

[HCS+17] P. Henderson, W.-D. Chang, F. Shkurti, J. Hansen, D. Meger, and G. Dudek. Benchmark environments for multitask learning in continuous domains. ICML Lifelong Learning: A Reinforcement Learning Approach Workshop, 2017.

[Heb49] Donald Olding Hebb. The organization of behavior: a neuropsychological theory. J. Wiley; Chapman & Hall, 1949. BIBLIOGRAPHY 143

[HP87] G. E. Hinton and D. C. Plaut. Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society, pages 177–186, 1987.

[HRE+18] Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam Gleave, Anssi Kan- ervisto, Rene Traore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon

Sidor, and Yuhuai Wu. Stable baselines. https://github.com/hill-a/ stable-baselines, 2018.

[HS97] Sepp Hochreiter and J¨urgenSchmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

[HS16] Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121, 2016.

[HTAL17] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforce- ment learning with deep energy-based policies. In Proceedings of the 34th Inter- national Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org, 2017.

[Hus18] Ferenc Husz´ar.Note on the quadratic penalties in elastic weight consolidation. Proceedings of the National Academy of Sciences, page 201717042, 2018.

[HVD15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[HZAL18] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870, 2018.

[HZH+18] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018. 144 BIBLIOGRAPHY

[IC18] David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[JGS+19] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.

[JJJK16] Heechul Jung, Jeongwoo Ju, Minju Jung, and Junmo Kim. Less-forgetting learn- ing in deep neural networks. arXiv preprint arXiv:1607.00122, 2016.

[JW19] Khurram Javed and Martha White. Meta-learning representations for continual learning. In Advances in Neural Information Processing Systems, pages 1818– 1828, 2019.

[KA17] Michael J Kahana and Mark Adler. Note on the power law of forgetting. bioRxiv, page 173765, 2017.

[KB14] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[KFS+19] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11761–11771, 2019.

[KGL17] Nitin Kamra, Umang Gupta, and Yan Liu. Deep generative dual memory network for continual learning. arXiv preprint arXiv:1710.10368, 2017.

[KGP+18] Max Kochurov, Timur Garipov, Dmitry Podoprikhin, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Bayesian incremental learning for deep neural networks. arXiv preprint arXiv:1802.07329, 2018.

[KH+09] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. BIBLIOGRAPHY 145

[KHM16] Dharshan Kumaran, Demis Hassabis, and James L McClelland. What learn- ing systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences, 20(7):512–534, 2016.

[Kor90] Chris A Kortge. Episodic memory in connectionist networks. In Proceedings of the 12th Annual Conference of the Cognitive Science Society, pages 764–771. Erlbaum, 1990.

[KPR+17] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.

[KSC18] Christos Kaplanis, Murray Shanahan, and Claudia Clopath. Continual rein- forcement learning with complex synapses. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2497–2506, Stock- holmsm¨assan,Stockholm Sweden, 10–15 Jul 2018. PMLR.

[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[KW13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[LBD+90] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recog- nition with a back-propagation network. In Advances in neural information pro- cessing systems, pages 396–404, 1990.

[LCB10] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2:18, 2010. 146 BIBLIOGRAPHY

[LH17] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

[LHP+15] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[Lin92] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.

[LKLW19] Vincent Liu, Raksha Kumaraswamy, Lei Le, and Martha White. The utility of sparse representations for control in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4384–4391, 2019.

[LP+17] David Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.

[Mac32] Cecil Alec Mace. The psychology of study. 1932.

[MC89] M. McCloskey and J. N. Cohen. Catastrophic interference in connectionist net- works: The sequential learning problem. Psychology of Learning and Motivation, 24:109–165, 1989.

[MCH+18] Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bishan Yang, Justin Betteridge, Andrew Carlson, Bhanava Dalvi, Matt Gardner, Bryan Kisiel, et al. Never-ending learning. Communications of the ACM, 61(5):103–115, 2018.

[MKS+15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.

[MMO95] J. L. McClelland, B. L. McNaughton, and R. C. O’reilly. Why there are comple- mentary learning systems in the hippocampus and neocortex: insights from the BIBLIOGRAPHY 147

successes and failures of connectionist models of learning and memory. Psycho- logical review, 102(3):419, 1995.

[Mon82] George E Monahan. State of the art—a survey of partially observable markov de- cision processes: theory, models, and algorithms. Management science, 28(1):1– 16, 1982.

[MP43] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.

[MP17] Marvin Minsky and Seymour A Papert. Perceptrons: An introduction to compu- tational geometry. MIT press, 2017.

[MRG+87] James L McClelland, David E Rumelhart, PDP Research Group, et al. Parallel distributed processing, volume 2. MIT press Cambridge, MA:, 1987.

[MVK+16] Kieran Milan, Joel Veness, James Kirkpatrick, Michael Bowling, Anna Koop, and Demis Hassabis. The forget-me-not process. In Advances in Neural Information Processing Systems, pages 3702–3710, 2016.

[NK19] Guido Novati and Petros Koumoutsakos. Remember and forget for experience replay. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4851–4860, Long Beach, California, USA, 09–15 Jun 2019. PMLR.

[NLBT18] Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, and Richard E. Turner. Varia- tional continual learning. In International Conference on Learning Representa- tions, 2018.

[NO17] David G Nagy and Gerg˝oOrb´an.Episodic memory for continual model learning. arXiv preprint arXiv:1712.01169, 2017.

[Ope] OpenAI. Openai five. https://blog.openai.com/openai-five/. 148 BIBLIOGRAPHY

[PGK+18] Charles Packer, Katelyn Gao, Jernej Kos, Philipp Kr¨ahenb¨uhl,Vladlen Koltun, and Dawn Song. Assessing generalization in deep reinforcement learning, 2018.

[Pim67] . A memory schedule. The Modern Language Journal, 51(2):73–75, 1967.

[PJ92] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

[PKP+19] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.

[Pol64] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1– 17, 1964.

[PUUH01] Robi Polikar, Lalita Upda, Satish S Upda, and Vasant Honavar. Learn++: An in- cremental learning algorithm for supervised neural networks. IEEE transactions on systems, man, and cybernetics, part C (applications and reviews), 31(4):497– 508, 2001.

[RAS+19] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In Advances in Neural Infor- mation Processing Systems, pages 348–358, 2019.

[Rat90] Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2):285, 1990.

[RBB18] Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems, pages 3738–3748, 2018.

[RCA+19] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, , and Gerald Tesauro. Learning to learn without forgetting by maximizing BIBLIOGRAPHY 149

transfer and minimizing interference. In International Conference on Learning Representations, 2019.

[RCG+15] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.

[RE13] P. Ruvolo and E. Eaton. Ella: An efficient lifelong learning algorithm. In Inter- national Conference on Machine Learning, pages 507–515, 2013.

[RHW85] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning inter- nal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.

[Rin94] Mark Bishop Ring. Continual learning in reinforcement environments. PhD thesis, University of Texas at Austin Austin, Texas 78712, 1994.

[Rin05] Mark B Ring. Toward a formal framework for continual learning. In NIPS Workshop on Inductive Transfer, Whistler, Canada, 2005.

[RKSL17] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceed- ings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.

[Rob95] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Con- nection Science, 7(2):123–146, 1995.

[Ros58] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.

[RRD+16] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016. 150 BIBLIOGRAPHY

[RVR+19] Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pascanu, Yee Whye Teh, and Raia Hadsell. Continual unsupervised representation learning. In Advances in Neural Information Processing Systems, pages 7645–7655, 2019.

[RW96] David C Rubin and Amy E Wenzel. One hundred years of forgetting: A quanti- tative description of retention. Psychological review, 103(4):734, 1996.

[RWC+19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.

[RWLG17] Boya Ren, Hongzhi Wang, Jianzhong Li, and Hong Gao. Life-long learning based on dynamic combination model. Applied Soft Computing, 56:398–404, 2017.

[Sar18] Dipanjan Sarkar. A comprehensive hands-on guide to transfer learning with real-world applications in deep learning, 2018.

[SB98] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

[Sch87] J¨urgenSchmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universit¨at M¨unchen, 1987.

[Sch07] W. Schultz. Reward signals. Scholarpedia, 2(6):2184, 2007. revision #145291.

[SCL+18] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska- Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress and compress: A scalable framework for continual learning. In Jennifer Dy and An- dreas Krause, editors, Proceedings of the 35th International Conference on Ma- chine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4528–4537, Stockholmsm¨assan,Stockholm Sweden, 10–15 Jul 2018. PMLR. BIBLIOGRAPHY 151

[SHK+14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfit- ting. The journal of machine learning research, 15(1):1929–1958, 2014.

[SHM+16] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel- vam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.

[Ski58] Burrhus F Skinner. Reinforcement today. American Psychologist, 13(3):94, 1958.

[SLKK17] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learn- ing with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.

[SMDH13] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the im- portance of initialization and momentum in deep learning. In International con- ference on machine learning, pages 1139–1147, 2013.

[SMK+13] Rupesh K Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, and J¨urgenSchmidhuber. Compete to compute. In Advances in neural informa- tion processing systems, pages 2310–2318, 2013.

[SML+15] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage es- timation. arXiv preprint arXiv:1506.02438, 2015.

[SMSM00] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.

[SQAS15] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015. 152 BIBLIOGRAPHY

[SS10] Tom Schaul and J¨urgenSchmidhuber. Metalearning. Scholarpedia, 5(6):4650, 2010.

[SSS+17] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.

[SvHM+18] Tom Schaul, Hado van Hasselt, Joseph Modayil, Martha White, Adam White, Pierre-Luc Bacon, Jean Harb, Shibl Mourad, Marc Bellemare, and Doina Pre- cup. The barbados 2018 list of open issues in continual learning. arXiv preprint arXiv:1811.07004, 2018.

[SWD+17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[TBC+17] Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages 4496–4506, 2017.

[TH12] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.

[TM95] S. Thrun and T. M. Mitchell. Lifelong robot learning. Robotics and autonomous systems, 15(1-2):25–46, 1995.

[TSdGM+20] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, and Yee Whye Teh. Functional regularisation for continual learning with gaussian processes. In International Conference on Learning Representations, 2020.

[TW13] M. Tsodyks and S. Wu. Short-term synaptic plasticity. Scholarpedia, 8(10):3153, 2013. revision #182521. BIBLIOGRAPHY 153

[Vap06] Vladimir Vapnik. Estimation of dependences based on empirical data. Springer Science & Business Media, 2006.

[VC82] Vladimir N Vapnik and A Ya Chervonenkis. Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory of Probability & Its Applications, 26(3):532–553, 1982.

[VHGS16] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016.

[Vit85] Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985.

[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.

[Wat89] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, 1989.

[WD92] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279– 292, 1992.

[WE91] John T Wixted and Ebbe B Ebbesen. On the form of forgetting. Psychological science, 2(6):409–415, 1991.

[Wer74] P.J. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, 1974.

[WFYH03] Haixun Wang, Wei Fan, Philip S Yu, and Jiawei Han. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, 2003. 154 BIBLIOGRAPHY

[WG94] PA Wozniak and Edward J Gorzelanczyk. Optimization of repetition spacing in the practice of learning. Acta neurobiologiae experimentalis, 54:59–59, 1994.

[Wil92] Ronald J Williams. Simple statistical gradient-following algorithms for connec- tionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.

[WR19] Che Wang and Keith Ross. Boosting soft actor-critic: Emphasizing recent expe- rience without forgetting the past. arXiv preprint arXiv:1906.04009, 2019.

[WS02] Gabriele Wulf and Charles H Shea. Principles derived from the study of simple skills do not generalize to complex skill learning. Psychonomic bulletin & review, 9(2):185–211, 2002.

[WTB20] Yeming Wen, Dustin Tran, and Jimmy Ba. Batchensemble: an alternative ap- proach to efficient ensemble and lifelong learning. In International Conference on Learning Representations, 2020.

[WTN19] Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline rein- forcement learning. arXiv preprint arXiv:1911.11361, 2019.

[XZ18] Ju Xu and Zhanxing Zhu. Reinforced continual learning. In Advances in Neural Information Processing Systems, pages 899–908, 2018.

[YS17] Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in neural information processing systems, pages 7103–7114, 2017.

[YYLH18] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations, 2018.

[ZGHS18] Chen Zeno, Itay Golan, Elad Hoffer, and Daniel Soudry. Task agnostic continual learning using online variational bayes. arXiv preprint arXiv:1803.10123, 2018. BIBLIOGRAPHY 155

[Zin03] Martin Zinkevich. Online convex programming and generalized infinitesimal gra- dient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928–936, 2003.

[ZPG17] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic in- telligence. In International Conference on Machine Learning, pages 3987–3995, 2017.

[ZR02] R. S. Zucker and W. G. Regehr. Short-term synaptic plasticity. Annual Review of Physiology, 64(1):355–405, 2002.

[ZS17] S. Zhang and R.S. Sutton. A deeper look at experience replay. In Deep Rein- forcement Learning Symposium NIPS 2017, 2017.

[ZZP+19] Linjing Zhang, Zongzhang Zhang, Zhiyuan Pan, Yingfeng Chen, Jiangcheng Zhu, Zhaorong Wang, Meng Wang, and Changjie Fan. A framework of dual replay buffer: Balancing forgetting and generalization in reinforcement learning. In Proceedings of the 2nd Workshop on Scaling Up Reinforcement Learning (SURL), International Joint Conference on Artificial Intelligence (IJCAI), 2019.