Employing Active Perception and Reinforcement Learning in Partially Observable Worlds
Total Page:16
File Type:pdf, Size:1020Kb
Learning in a State of Confusion: Employing active perception and reinforcement learning in partially observable worlds Paul Anthony Crook Doctor of Philosophy Institute of Perception, Action and Behaviour School of Informatics University of Edinburgh 2006 Abstract In applying reinforcement learning to agents acting in the real world we are often faced with tasks that are non-Markovian in nature. Much work has been done using state es- timation algorithms to try to uncover Markovian models of tasks in order to allow the learning of optimal solutions using reinforcement learning. Unfortunately these algo- rithms which attempt to simultaneously learn a Markov model of the world and how to act have proved very brittle. Our focus differs. In considering embodied, embedded and situated agents we have a preference for simple learning algorithms which reliably learn satisficing policies. The learning algorithms we consider do not try to uncover the underlying Markovian states, instead they aim to learn successful deterministic reac- tive policies such that agents actions are based directly upon the observations provided by their sensors. Existing results have shown that such reactive policies can be arbitrarily worse than a policy that has access to the underlying Markov process and in some cases no satis- ficing reactive policy can exist. Our first contribution is to show that providing agents with alternative actions and viewpoints on the task through the addition of active per- ception can provide a practical solution in such circumstances. We demonstrate em- pirically that: (i) adding arbitrary active perception actions to agents which can only learn deterministic reactive policies can allow the learning of satisficing policies where none were originally possible; (ii) active perception actions allow the learning of better satisficing policies than those that existed previously and (iii) our approach converges more reliably to satisficing solutions than existing state estimation algorithms such as U-Tree and the Lion Algorithm. Our other contributions focus on issues which affect the reliability with which deter- ministic reactive satisficing policies can be learnt in non-Markovian environments. We show that that greedy action selection may be a necessary condition for the existence of stable deterministic reactive policies on partially observable Markov decision pro- cesses (POMDPs). We also set out the concept of Consistent Exploration. This is the idea of estimating state-action values by acting as though the policy has been changed to incorporate the action being explored. We demonstrate that this concept can be used to develop better algorithms for learning reactive policies to POMDPs by presenting a new reinforcement learning algorithm; the Consistent Exploration Q(λ) algorithm (CEQ(λ)). We demonstrate on a significant number of problems that CEQ(λ) is more reliable at learning satisficing solutions than the algorithm currently regarded as the best for learning deterministic reactive policies, that of SARSA(λ). iii This thesis is dedicated to “Freddy” Flintoff, Simon Jones, Ashley Giles, Matthew Hoggard, Steve Harmison, Andrew Strauss, Kevin Pietersen, Michael Vaughan, Marcus Trescothick, Geraint Jones, Ian Bell, Paul Collingwood, Gary Pratt, Shane Warne, Brett Lee, Glenn McGrath, Shaun Tait, Justin Langer, Matthew Hayden, Adam Gilchrist, Ricky Ponting, Michael Clarke, Simon Katich, Damien Martyn, Jason Gillespie, Billy Bowden, Rudi Koertzen and everyone else involved in the 2005 Ashes series, for providing entertainment and nail biting drama over the summer whilst I was writing this thesis. To TMS for relaying all the action in their intimate and whimsical style, and who will always conjure up images of idealised English summers in my head. To the idiosyncratic game that is cricket, and to England for winning the Ashes. iv Acknowledgements Many thanks to my supervisors; Gillian Hayes for all her ideas and input, and Bob Fisher for his constructive criticism and advice. Thanks to Jay Bradley, Aaron Crane, Seb Mhatre, Fiona McNeill, and Daniel Winterstein, who read various parts of this thesis. Thanks to Peter Ottery for writing the first version of taskFarmer, a program which was instrumental in spreading the computational load of many experiments in this thesis. Special thanks to my study buddy Alison Pease, who not only read parts of this thesis, but had to put up with my whinging whilst I was writing it. Thanks also to my examiners Sethu Vijayakumar and Eric Postma whose comments and criticisms helped to shape the final version of this thesis. Finally my biggest debt of gratitude is to all my friends and flatmates, without whom this period of research and study in Edinburgh would not have been half as much fun. v Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Paul Anthony Crook) vi Publications The material contained in chapter 4 of this thesis has been previously published in conference and workshop proceedings: • The Sixth European Workshop on Reinforcement Learning (EWRL-6), Nancy, France, September 2003 [Crook and Hayes, 2003a]. • 14th European Conference on Machine Learning (ECML 2003), Cavtat-Dubrovnik, Croatia, September 2003. Published in Springer Verlag Lecture Notes in Artifi- cial Intelligence, volume 2837, pages 72–83 [Crook and Hayes, 2003b]. • Towards Intelligent Mobile Robots (TIMR) 2003, UWE, Bristol, August 2003 [Crook and Hayes, 2003c]. vii Table of Contents 1 Introduction 1 1.1 Motivation ................................ 2 1.2 ThesisAims ............................... 5 1.2.1 First hypothesis ......................... 5 1.2.2 Second hypothesis ....................... 6 1.3 Contributions .............................. 7 1.3.1 Chapter2 ............................ 7 1.3.2 Chapter4 ............................ 7 1.3.3 Chapter5 ............................ 8 1.3.4 Chapter7 ............................ 8 1.3.5 Chapter8 ............................ 9 1.3.6 AppendixA ........................... 10 1.3.7 Reinforcement learning test platform ............. 10 1.4 Organisation of this Thesis ....................... 11 1.4.1 Main body of thesis ....................... 11 1.4.2 Appendices ........................... 13 ix 2 Background 17 2.1 Active Perception ............................ 17 2.2 Reinforcement Learning ........................ 19 2.2.1 Policy types ........................... 21 2.2.2 Markov decision processes (MDPs) .............. 22 2.2.3 Partially observable Markov decision processes (POMDPs) .. 23 2.2.4 Learning with knowledge of environment dynamics ...... 25 2.2.5 Learning internal models .................... 26 2.3 Modifiable, Fixed (Fixed-SE) and Basic Approaches ......... 27 2.3.1 Why define these approaches? ................. 38 2.4 Chapter Summary ............................ 39 3 Literature Review 41 3.1 Fixed-SE Approach ........................... 41 3.1.1 Whitehead and Ballard ..................... 42 3.1.2 Chapman and Kaelbling .................... 46 3.1.3 MingTan ............................ 47 3.1.4 Whitehead and Lin (review) .................. 53 3.1.5 Hochreiter, Bakker and Schmidhumber ............ 55 3.1.6 Loch andSingh ......................... 55 3.1.7 Chrisman ............................ 56 x 3.1.8 McCallum ........................... 59 3.1.9 Predictive state representations ................. 72 3.2 Basic Approach ............................. 73 3.2.1 Littman, Loch and Singh .................... 73 3.2.2 Singh, Jaakkola and Jordan ................... 74 3.2.3 Pendrith and McGarity ..................... 76 3.2.4 Perkins ............................. 79 3.2.5 Perkins and Pendrith ...................... 81 3.2.6 Baird .............................. 82 3.3 Modifiable Approaches ......................... 82 3.3.1 Active perception agents .................... 82 3.3.2 Internal memory agents ..................... 83 3.4 Chapter Summary ............................ 85 3.4.1 Fixed-SE approaches ...................... 86 3.4.2 Basic approaches ........................ 86 3.4.3 Modifiable approaches ..................... 87 3.4.4 Summary table ......................... 88 3.4.5 Issues arising from literature review .............. 88 xi 4 Learning with Active Perception 91 4.1 Agent with Limited Fixed Perception .................. 94 4.1.1 Experimental setup ....................... 95 4.1.2 Results ............................. 105 4.1.3 Discussion ........................... 112 4.1.4 Summary ............................ 114 4.2 Agent with Access To An Oracle .................... 115 4.2.1 Experimental setup ....................... 118 4.2.2 Results ............................. 121 4.2.3 Discussion ........................... 127 4.3 Agent with Active Perception ...................... 131 4.3.1 Experimental setup ....................... 132 4.3.2 Results ............................. 134 4.3.3 Discussion ........................... 137 4.4 McCallum’s M-maze .......................... 138 4.5 Limitations on Learning of Physically Optimal Policies ........ 140 4.6 Chapter Summary ............................ 142 xii 5 Effects of Action Selection 145 5.1 Dependency Between Observation-Action Values and Exploration . 148 5.1.1 Perkins and Pendrith’s simple POMDP example ........ 150 5.1.2 Policy stability ......................... 155 5.1.3 Analogy with simulated annealing