Developing a Predictive Approach to Knowledge Adam White

Developing a predictive approach to knowledge by Adam White A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computing Science University of Alberta c Adam White, 2015 Abstract Understanding how an artificial agent may represent, acquire, update, and use large amounts of knowledge has long been an important research challenge in artificial intelligence. The quantity of knowledge, or knowing a lot, may be nicely thought of as making and updating many predictions about many different courses of action. This predictive approach to knowledge ensures the knowledge is grounded in and learned from low-level data generated by an autonomous agent interacting with the world. Because predictive knowledge can be maintained without human intervention, its acquisition can potentially scale with available data and computing resources. The idea that knowledge might be expressed as prediction has been explored by Cunningham (1972), Becker (1973), Drescher (1990), Sutton and Tanner (2005), Rafols (2006), and Sutton (2009, 2012). Other uses of predictions include representing state with predictions (Littman, Sutton &, Singh 2002; Boots et al. 2010) and modeling partially observable domains (Talvitie & Singh 2011). Unfortunately, technical challenges related to numerical instability, divergence under off-policy sampling, and com- putational complexity have limited the applicability and scalability of predictive knowledge acquisition in practice. This thesis explores a new approach to representing and acquiring predictive knowledge on a robot. The key idea is that value functions, from reinforcement learning, can be used to represent policy-contingent declarative and goal-oriented predictive knowledge. We use recently developed gradient-TD methods that are compatible with off-policy learning and function approximation to explore the practicality of making and updating many predictions in parallel, while the agent interacts with the world from continuous inputs on a robot. The work described here includes both empirical demonstrations of the effectiveness of our new approach and new algorithmic contributions useful for scaling prediction learning. We demonstrate that our value functions are practically learnable and can encode a variety of knowledge with several experiments—including a demonstration of the psychological ii phenomenon of nexting, learning predictions with refined termination conditions, learning policy-contingent predictions from off-policy samples, and learning procedural goal- directed knowledge—all on two different robot platforms. Our results demonstrate the po- tential scalability of our approach; making and updating thousands of predictions from hundreds of thousands of multi-dimensional data samples, in realtime and on a robot—beyond the scalability of related predictive approaches. We also introduce a new online estimate of off-policy learning progress, and demonstrate its usefulness in tracking the performance of thousands of predictions about hundreds of distinct policies. Finally, we conduct a novel empirical investigation of one of our main learning algorithms, GTD(λ), revealing several new insights of particular relevance to predictive knowledge acquisition. All told, the work described here significantly develops the predictive approach to knowledge. iii Acknowledgements There are many people that helped me along the long road to finishing this document. First and foremost my supervisor Rich Sutton. He is of course a visionary, a careful scientist, a leader of my field of study, and someone who encourages his students to work on big and bold ideas—to not be afraid of being ahead of the field. He would always generate new and unexpected idea’s and look at a problem in a unique way to help me get unstuck. Rich taught me two important lessons as a student which I think had a large role in my success. The first lesson was to pay attention and love getting the tiny details right. The second was to never be afraid of a new problem or obstacle; many things can be solved and we should be excited about the prospect of such a pursuit. I also owe many thanks to Joseph Modayil. He joined our research group when I first starting working with off-policy learning and robots, and his vision, ideas, wisdom, and guidance were essential in helping me become a useful scientist. Joseph was my co- supervisor in every way but official title. I thank the postdocs, Thomas Degris, Patrick Pilarski, Hado van Hasselt, and Harm van Seijen. The lively research meetings, long afternoon coffees, dinners, and board game nights were so productive both professionally and personally. Finally, I thank the members of my examining committee, Pierre-Yves Oudeyer, Marek Reformat, Pierre Boulanger, and Michael Bowling for their intriguing questions, and help- ful suggestions for improving both the content of my work and presentation of my ideas. iv Table of Contents 1 Introduction 1 1.1 Objective . 2 1.2 Approach . 3 1.3 Contributions . 5 1.4 Thesis layout . 7 1.5 Summary . 8 2 Background 9 2.1 Reinforcement learning . 9 2.2 Value functions . 11 2.3 Function approximation . 12 2.4 Estimating the value function . 16 2.5 Off-policy learning . 18 2.6 Computing the MSPBE . 23 2.7 Recent algorithmic advances . 24 2.8 Learning policies . 25 2.9 Options . 25 2.10 Summary . 26 3 Sensorimotor data streams and robots 27 3.1 Learning about sensorimotor data . 27 3.2 The iRobot Create . 28 3.3 The Critterbot . 30 3.4 Critterbot sensorimotor data . 33 3.5 Summary . 36 4 General Value Functions 40 4.1 The setting . 41 4.2 Cumulants . 42 4.3 Termination . 43 4.4 General Value functions . 44 4.5 Learning GVFs . 46 4.6 Independence of predictive span . 48 4.7 A Horde of Demons . 50 4.8 Related approaches . 53 4.9 Summary . 56 5 Nexting 57 5.1 Predicting what will happen next . 57 5.2 Nexting as multiple value functions . 59 5.3 A scaling experiment . 61 5.4 Accuracy of learned predictions . 65 5.5 Unmodeled situations . 70 5.6 Linear and quadratic computation . 70 5.7 Other ways to encode nexting predictions . 73 5.8 Distinctiveness of nexting . 74 v 5.9 Summary . 77 6 Experiments with GVFs on robots 79 6.1 Experiments with more complex terminations . 80 6.2 Off-policy prediction learning . 84 6.2.1 Experiments on the Critterbot . 85 6.2.2 Experiments on the Create . 86 6.3 Off-policy control learning . 92 6.4 Other demonstrations of GVF learning . 93 6.5 Summary . 94 7 Experiments with gradient-TD learning 96 7.1 Experiments on Markov chains . 97 7.1.1 Problem . 97 7.1.2 Learning h .............................. 98 7.1.3 Tuning GTD(λ) for similarity . 101 7.1.4 MSPBE minimization and h . 104 7.1.5 Overall conclusions . 107 7.2 Experiments on Baird’s counterexample . 107 7.2.1 Problem . 108 7.2.2 The role of h .............................109 7.2.3 Similarity measures of h . 113 7.3 Related Work . 114 7.4 Summary . 116 8 Estimating off-policy progress 118 8.1 Measuring progress on a robot . 118 8.2 A new proposal . 119 8.3 Experiments on the Markov chain . 122 8.3.1 Comparing RUPEE and the RMSPBE . 122 8.3.2 A non-stationary domain . 125 8.4 Experiments on the Critterbot . 127 8.4.1 Prediction accuracy of many demons . 129 8.4.2 Comparing RUPEE and prediction accuracy . 134 8.4.3 Many demons and many target policies . 137 8.5 Summary . 140 9 Adapting the behavior policy 141 9.1 Unexpected demon error . 142 9.2 Adapt the behavior policy of a robot . 144 9.2.1 Problem . 145 9.2.2 Experiment . 147 9.2.3 Results and conclusions . 148 9.3 Discussion . 150 9.4 Summary . 151 10 Perspectives and Future Work 153 10.1 General Value Functions . 153 10.2 Online off-policy progress estimation . 154 10.3 Empirical study of GTD(λ) . 155 10.4 Limitations and future work . 156 10.4.1 Comparing predictive representations of knowledge . 156 10.4.2 Predictive features and a Horde of demons . 156 10.4.3 A theoretical analysis of RUPEE . 157 10.4.4 Mitigating variance in off-policy learning . 157 10.4.5 Putting it all together . 158 10.5 Summary . 158 Bibliography 159 vi Appendix 170 A A new hybrid-TD algorithm . 170 B Algorithms for off-policy GVF learning . 174 C Intrinsically motivated reinforcement learning . 177 D The parallel scalability of Horde . 182 vii List of Tables 5.1 Summary of the tile-coding strategy used to produce feature vectors from sensorimotor data. For each sensor of a given type, its tilings were either 1-dimensional or 2-dimensional, with the given number of intervals. Only the first four of the robot’s eight thermal sensors were included in the tile coding due to a coding error. ..

Load more