An Object-Oriented Representation for Efficient Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
AN OBJECT-ORIENTED REPRESENTATION FOR EFFICIENT REINFORCEMENT LEARNING BY CARLOS GREGORIO DIUK WASSER A dissertation submitted to the Graduate School—New Brunswick Rutgers, The State University of New Jersey in partial fulfillment of the requirements for the degree of Doctor of Philosophy Graduate Program in Computer Science Written under the direction of Michael L. Littman and approved by New Brunswick, New Jersey October, 2010 c 2010 Carlos Gregorio Diuk Wasser ALL RIGHTS RESERVED ABSTRACT OF THE DISSERTATION An Object-oriented Representation for Efficient Reinforcement Learning by Carlos Gregorio Diuk Wasser Dissertation Director: Michael L. Littman Agents (humans, mice, computers) need to constantly make decisions to survive and thrive in their environment. In the reinforcement-learning problem, an agent needs to learn to maximize its long-term expected reward through direct interaction with the world. To achieve this goal, the agent needs to build some sort of internal representa- tion of the relationship between its actions, the state of the world and the reward it expects to obtain. In this work, I show how the way in which the agent represents state and models the world plays a key role in its ability to learn effectively. I will introduce a new representation, based on objects and their interactions, and show how it enables several orders of magnitude faster learning on a large class of problems. I claim that this representation is a natural way of modeling state and that it bridges a gap between generality and tractability in a broad and interesting class of domains, namely those of relational nature. I will present a set of learning algorithms that make use of this rep- resentation in both deterministic and stochastic environments, and present polynomial bounds that prove their efficiency in terms of learning complexity. ii Acknowledgements I came to Rutgers in the Fall semester of 2003, planning to work at the intersection of machine learning, artificial intelligence and cognitive science. At that point in time, I did not know who I was going to work with. On my first year I took three different classes that felt relevant to my interests, and they all happened to be taught by Michael Littman. I realized then that he would be the perfect advisor, and I have not been disappointed. Working with Michael has been inspiring, he became an academic role- model and it never stopped being a lot of fun. But working with Michael would have not been enough if it had not been for my labmates and colleagues at Rutgers Laboratory for Real-Life Reinforcement Learning, or RL3. A special mention should be first made to the people I collaborated the most with: Thomas Walsh, Alexander Strehl and Lihong Li. But the lab would not have been the lab without the useful discussions, lunches and exchanges with Ali Nouri, Bethany Leffler, Istvan Szita, Ari Weinstein, John Asmuth, Chris Mansley, Sergiu Goschin and Monica Babes. The long lab afternoons were also made possible by my frequent corridor crossings into the office of my great friends Miguel Mosteiro and Rohan Fernandes. I refuse to call the hours spent in their office “procrastination”, as they were incredibly educational, from formal probability theory and theorem proving, to the history and politics of India. Part of my intellectual development and life in the CS Department was also immensely facilitated by my friendship with Matias Cuenca Acuna, Christian Rodriguez, Marcelo Mydlarz, Smriti Bhagat, Nikita Lytkin and Andre Madeira. One often misses the subtle nuances of a subject until one has to teach it. My un- derstanding of many subjects has improved with my experiences as a teaching assistant. I owe a debt to the kind supervision of Martin Farach-Colton, S. Muthukrishnan and iii Leonid Khachiyan. The years at Rutgers were also one of the most gratifying experiences I will al- ways cherish thanks to many people. To the Rutgers crowd: Miguel Mosteiro, Silvia Romero, Pablo Mosteiro, Martin Mosteiro, Roberto Gerardi, Ignacia “Nachi” Perugor- ria, Alberto Garcia-Raboso, Rocio Naveiras, Jose Juknevich, Mercedes Morales, Ma- tias Cuenca, Cecilia Martinez, Selma Cohen, Gustavo Crembil, Christian Rodriguez, Marcelo Mydlarz, Cecilia Della Valle, Macarena Urzua, Felipe and Claudia Menanteau, Felipe Troncoso, Claudia Cabello, Juan Jose Adriasola, Viviana Cobos, Molly Palmer, Silvia Font, Cristobal Cardemil, Soledad Chacon, Dudo and Guiomar Vergara. Another paragraph is deserved by my friends at the Graduate Student Association and our union, the Rutgers AAUP-AFT: Scott Bruton, Ryan Fowler, Naomi Fleming, Brian Graf, Elric Kline and Charlotte Whalen. My great friends in Argentina, who always visited, skyped and were there: Nico Barea, Juli Herrera, Hernan Schikler, Sebastian Ludmer, Fer Sciarrotta, Maria Elinger, Martin Mazzini, Sebas Pechersky, Ana Abramowski, Lu Litichever, Laura Gomez, Sergi Daicz, Sergi “Techas” Zlotnik, Dan Zlotnik, Charly Lopez Holtmann, Diego Fernandez- Slezak, Vic Robledo, Gaston Giribet, Julian Faivovich, and the list goes on and on! To Fiorella Kotsias and Alejo Flah, Fran D’Alessio and Pau Lamamie. A source of constant encouragement and support has been my family: my parents Juan Diuk and Lilian Wasser, my siblings Pablo, Beatriz and Maria Ana Diuk, and my fantastic nephews Nico, Ari, Santi, Diego and Tamara. Last, but by no means the least, Vale, to whom this dissertation is dedicated, and Manu, who arrived in the last few months of my graduate work and helped me finish. I will never forget his long naps on my lap while I was writing this dissertation. iv Dedication A Vale y Manu, por todo. v Table of Contents Abstract ........................................ ii Acknowledgements ................................. iii Dedication ....................................... v List of Tables ..................................... xi List of Figures .................................... xii 1. Introduction ................................... 1 1.1. Learning by Direct Interaction . ... 1 1.2. The Reinforcement Learning Problem Setting . ...... 3 1.3. Challenges in Reinforcement Learning . ..... 5 1.3.1. Exploration .............................. 5 BayesianRL ............................. 6 Almost Solving the Problem, Most of the Time . 7 Optimism in the Face of Uncertainty . 8 1.3.2. TheCurseofDimensionality . 8 Stateaggregation........................... 9 1.4. AChangeofRepresentation. 10 1.4.1. Objects and the Physical World . 11 1.4.2. Relational Reinforcement Learning . 11 1.5. Object-oriented Representations in Reinforcement Learning ....... 13 2. Markov Decision Processes .......................... 15 2.1. Definitions................................... 15 vi 2.1.1. TheMarkovProperty . 16 2.1.2. The Interaction Protocol . 16 2.2. Policies and Value Functions . 17 2.3. SolvinganMDP ............................... 18 2.3.1. ExactPlanning............................ 18 2.3.2. Approximate Planning and Sparse Sampling . 19 2.4. Learning.................................... 21 2.5. Measuringperformance . 22 2.5.1. Computational complexity . 22 2.5.2. Spacecomplexity. .. .. .. .. .. .. .. 23 2.5.3. Learningcomplexity . 23 PAC-MDP .............................. 24 2.6. TheKWIKFramework............................ 24 2.6.1. KWIK-RMax............................. 26 2.6.2. Example: Learning a Finite MDP . 27 Other Supervised Learning Frameworks: PAC and Mistake-bound 27 2.7. Summary ................................... 28 3. Learning Algorithms and the Role of Representations ......... 29 3.1. TheTaxiProblem .............................. 29 3.1.1. Taxiexperimentalsetup . 31 3.2. TheRoleofModelsandExploration . 31 3.2.1. Flat state representations . 31 3.2.2. Model-free learning: Q-learning and ǫ-greedy exploration . 32 Q-learningresults .......................... 33 3.2.3. Model-based learning and KWIK-Rmax exploration . 34 3.2.4. Experimentalresults . 36 3.3. The Role of State Representations . 37 Factored Rmax ............................ 40 vii 3.3.1. Experimentalresults . 41 3.4. The Role of Task Decompositions . 42 MaxQ................................. 42 A Model-based Version of MaxQ . 44 3.5. WhatWouldPeopleDo?. .. .. .. .. .. .. .. 45 3.6. Object-oriented Representation . ..... 47 3.7. SummaryandDiscussion . .. .. .. .. .. .. .. 49 4. Object-Oriented Markov Decision Processes ............... 50 4.1. Pitfall: playing with objects . 50 4.2. The Propositional OO-MDP Formalism . 52 4.2.1. Relational transition rules . 53 4.2.2. Formalismsummary . 57 4.3. First-OrderOO-MDPs: AnExtension . 58 4.4. Examples ................................... 60 4.4.1. Taxi .................................. 60 4.4.2. Pitfall ................................. 61 4.4.3. Goldmine ............................... 62 4.4.4. Pac-Man ............................... 63 4.5. TransitionCycle ............................... 65 4.6. OO-MDPs and Factored-state Representations Using DBNs....... 66 4.7. OO-MDPsandRelationalRL . 68 4.8. OO-MDPs and Deictic Representations in Reinforcement Learning . 69 4.9. Summary ................................... 71 5. Learning Deterministic OO-MDPs ..................... 72 5.1. TheKWIKEnumerationAlgorithm . 72 5.2. LearningaCondition. 73 5.2.1. Condition Learner Algorithm . 75 viii Analysis................................ 77 5.2.2. Example: Learning a Condition in Taxi . 77 5.3. LearninganEffect .............................. 79 5.3.1. Example: LearninganEffect . 80 5.4. Learning a Condition-Effect Pair . 81 5.5. Learning Multiple Condition-Effect Pairs: the Tree model ........ 81 5.5.1.