Reinforcement Learning in Persistent Environments: Representation Learning and Transfer
Total Page:16
File Type:pdf, Size:1020Kb
Reinforcement Learning in Persistent Environments: Representation Learning and Transfer a dissertation presented by Diana Borsa to The Department of Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Machine Learning University College London London, UK January 2020 © 2020 - Diana Borsa All rights reserved. In the memory of my dad, Prof. Vasile Borsa (1963-2014) Declaration I, Diana Borsa, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. – Diana Borsa Impact Statement In a nutshell, the current body of work has focused on the problem of a learning agent sit- uated in a environment, trying to learn behaviour policies that maximise different reward signals, corresponding to different tasks in this environment. We would argue that thisis a common scenario for many decision making systems in the real world. Reinforcement learning, as a modelling paradigm, has already been shown to successfully tackle complex decision making problems. Our contributions here stem from two key observations: a) in general, we would like our agents to achieve more than one goal in a given environ- ment; b) a lot of the methods underlying even the single task setting, involve building, incrementally, multiple prediction problems that enable improvements in the agent’s be- haviour. Thus, an agent’s journey to an optimal value function, can naturally be cast asa multitask prediction problem. The only difference between a) and b) is whether or not, one varies both the policy and the reward structure. In this work, we have focused on the more general scenario where both of these dimensions vary. In this setting we have shown the benefits of treating the above as a multitask problem and learning common represen- tations to enable transfer between the many prediction problems encountered. Given the generality of the setup and the organic nature of the assumptions imposed, we believe the scope and applicability of this work to be quite board. Moreover, the first part of this work has focused on an offline batch scenario, similar to ones encountered in a real application domains, where the data was generated a priori by a policy we might not have access to. Setups here might include recommender sys- tems under multiple prediction metrics (engagement, retention, expenditure); medical data collected under different protocols and different policies; energy management sys- tems under different manual policies. Treating these problems and data sources indepen- dently can be very expensive. In such restrictive settings, the encouraging results obtained i by algorithms in Chapters 3 - 4 suggest that modelling these problems jointly can signifi- cantly improve the quality of the resulting policies and thus reduce the sample complexity needed to achieve competitive performance. The second part of this thesis investigates a different kind of representation, particularly suitable for transferring knowledge within a more restricted set of tasks. This leads to a very effective type of generalisation in the span of tasks considered. Outside its potential appli- cations to a multitask RL agent, this line of research has already inspired studies in other scientific fields. In particular in cognitive science and neuroscience, our colleagues have argued similar representations are formed in the brain and they have found evidence for this type of generalisation, via policy re-evaluation, being at the core of transfer behaviours in several species (including humans). Nevertheless, the above are really just scratching the surface of the potential benefits of such representation learning in RL. Much more research is needed and we do hope that some of the work presented here would inspired other researchers, across communities, to investigate these paradigms further and improve on current solutions. ii Acknowledgments This whole process has been a long, formative journey. It is almost intimating looking back, all the way back into this chapter of my life; acknowledging all the turns, the bumps, the occasional wins, the ups and downs, the deep valleys, the scarce summits. That is truly an onerous credit-assignment problem and this is merely an attempt to do it justice. First and foremost, I am deeply grateful to have had two wonderful supervisors, and most of all mentors, Thore Graepel and John Shawe-Taylor. Both of them have poured countless hours into my continuous growth as a researcher and beyond. I will forever be grateful for everything I have learnt from you, all the trust and encouragement I was given, the freedom to pursue my ideas – even the bad ones, the guidance to shape my understand- ing, the many opportunities you put in front of me. Youhave been absolutely instrumental throughout this journey and the reason I can call myself a researcher today. Thore, your curiosity and fascination with RL, led me to re-consider this line of the research in a very different lighting. Thank you for bring me to this paradigm, this way to see and model the world, to a complex problem that I am still engagingly pursuing. John, you are the reason I want to stay in academia. The impact I have witnessed over multiple generations cannot overstated. Your constant positivity and open-mindedness when approaching a problem, paired with critical thinking and mathematical rigour, are something I aspire to achieve. Secondly, I was privileged enough to split my time, in the first few years, between UCL and Microsoft Research Cambridge. In many ways, MSRC was where I ’grew up’ as are- searcher and there are many people who contributed to this endeavour. I would like to extend a special thank you to Andy Gordon who, together with Thore, hosted me for my first project there and stayed up, late into the night, when I was submitting my first paper. Many thanks to Katja and Yoram for many fruitful discussions, for their guidance, their friendship and for sharing their PhD. survival wisdom. To the many fellow interns and iii friends I made during those times: Yoad, Yair, Rati, Irina, Yani, Mat and Roberto. To the many engineers that supported my work, Lucas, Tim, Dave and Matthew Johnson, who all took the time to educate a mathematician in software development. Lastly, but certainly not least, I would like to thank Chris Bishop for taking the time and interest to discuss my journey and professional development, even when it diverged away from the lab. I would like to thank my colleagues and officemates in the CS department at UCL, An- drew, Dimitrios, Verena, Alex, Tom and Ronnie. Especially in those last years at UCL, your company, fellowship and humour made those long days at the office particularly en- joyable. Thank you also to the faculty in CSML for their teaching and occasional guidance: Iasonas Kokkinos, Mark Herbster, David Barber, Arthur Gretton and Sebastian Riedel. The last stop in this journey was DeepMind and the i-Team. I am extremely grateful for the warm welcome the i-Team has given me from day one. Thank you to Rémi Munos and Olivier Pietquin for bringing me onto this team, for all your guidance, your trust and for always making me feel like a valuable member of this family. Rémi, I especially enjoyed our long sessions on the whiteboard exploring new operators and trying to formalise new continual learning paradigms. Outside my team, a very special thank you goes to one of the my closest collaborators and kindred minds, André. It has always been a pleasure brain- storming, collaborating and discussing new ideas with you. To all my other collaborators, Hado, Tom, Anna, Doina, Dave, Will, Georg, Bilal, Nicolas and Julien, thank you for al- ways keeping me grounded and true to the science. You all make me a better researcher! Thank you to Philip Treleaven, Yonita Carter and the UK Centre for Doctoral Training in Financial Computing and Analytics for their funding. Thank you also to JJ and Sarah for their commitment to making our experience at UCL more enjoyable and helping us out with the omnipresent, copious amounts of paperwork required. Thank you also to all other organisations that supported my travels to conferences, academic visits and summer schools. All of these have been highly enriching experiences. Finally, I would like to thank my parents for all the sacrifices they made to support my academic journey, for all the training and knowledge I received from them, for always in- stilling a sense of wonder and curiosity about all aspects for the world. Also a special thank you to Iunia for always pushing me to do better, to achieve more, to be a better example. iv Thesis advisor: Thore Graepel Diana Borsa Reinforcement Learning in Persistent Environments: Representation Learning and Transfer Abstract Reinforcement learning (RL) provides a general framework for modelling and reason- ing about agents capable of sequential decision making, with the goal of maximising a reward signal. In this work, we focus on the study of situated agents designed to learn autonomously through direct interaction with their environment, under limited or sparse feedback. We consider an agent in a persistent environment. The dynamics of this ’world’ do not change over time, much like the laws of physics, and the agent would need to learn to master a potentially vast set of tasks in this environment. To efficiently tackle learning in multiple tasks, with the ultimate goal of scaling to a life-long learning agent, we turn our attention to transfer learning. The main insight behind this paradigm is that generalisation may occur not only within tasks, but also across them. The objective of transfer in RL is to accelerate learning by building and reusing knowledge obtained in previously encoun- tered tasks.