Compact Parametric Models for Efficient Sequential Decision
Total Page:16
File Type:pdf, Size:1020Kb
Compact Parametric Models for Efficient Sequential Decision Making in High-dimensional, Uncertain Domains by Emma Patricia Brunskill Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2009 c Massachusetts Institute of Technology 2009. All rights reserved. Author................................................ ............. Department of Electrical Engineering and Computer Science May 21, 2009 Certifiedby.......................................... ............... Nicholas Roy Assistant Professor Thesis Supervisor Acceptedby........................................... .............. Terry P.Orlando Chairman, Department Committee on Graduate Theses Compact Parametric Models for Efficient Sequential Decision Making in High-dimensional, Uncertain Domains by Emma Patricia Brunskill Submitted to the Department of Electrical Engineering and Computer Science on May 21, 2009, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Within artificial intelligence and robotics there is considerable interest in how a single agent can autonomously make sequential decisions in large, high-dimensional, uncertain domains. This thesis presents decision-making algorithms for maximizing the expcted sum of future rewards in two types of large, high-dimensional, uncertain situations: when the agent knows its current state but does not have a model of the world dynamics within a Markov decision process (MDP) framework, and in partially observable Markov decision processes (POMDPs), when the agent knows the dynamics and reward models, but only receives information about its state through its potentially noisy sensors. One of the key challenges in the sequential decision making field is the tradeoff between optimality and tractability. To handle high-dimensional (many variables), large (many po- tential values per variable) domains, an algorithm must have a computational complexity that scales gracefully with the number of dimensions. However, many prior approaches achieve such scalability through the use of heuristic methods with limited or no guarantees on how close to optimal, and under what circumstances, are the decisions made by the al- gorithm. Algorithms that do provide rigorous optimality bounds often do so at the expense of tractability. This thesis proposes that the use of parametric models of the world dynamics, rewards and observations can enable efficient, provably close to optimal, decision making in large, high-dimensional uncertain environments. In support of this, we present a reinforcement learning (RL) algorithm where the use of a parametric model allows the algorithm to make close to optimal decisions on all but a number of samples that scales polynomially with the dimension, a significant improvement over most prior RL provably approximately optimal algorithms. We also show that parametric models can be used to reduce the computational complexity from an exponential to polynomial dependence on the state dimension in for- ward search partially observable MDP planning. Under mild conditions our new forward- search POMDP planner maintains prior optimality guarantees on the resulting decisions. We present experimental results on a robot navigation over varying terrain RL task and a large global driving POMDP planning simulation. Thesis Supervisor: Nicholas Roy Title: Assistant Professor 3 Acknowledgments As I pass out of the hallowed hallows of the infamous Stata center, I remain deeply grateful to those that have supported my graduate experience. I wish to thank my advisor Nicholas Roy, for all of his insight, support and particularly for always urging me to think about the potential impact of my work and ideas, and encouraging me to be broad in that impact. Leslie Kaelbling and Tomas Lozano-Perez, two of my committee members, possess the remarkable ability to switch their attention instantly to a new topic and shed their immense insight on whatever technical challenge I posed to them. My fourth committee member, Michael Littman, helped push me to start thinking about the theoretical directions of my work, and I greatly enjoyed a very productive collaboration with him, and his graduate students Lihong Li and Bethany Leffler. I have had the fortune to interact extensively with two research groups during my time at MIT: the Robust Robotics Group and the Learning and Intelligent Systems group. Both groups have been fantastic environments. In addition, Luke Zettlemoyer, Meg Aycinena Lippow and Sarah Finney were the best officemates I could have ever dreamed of, always willing to read paper drafts, listen attentively to repeated practice presentations, and share in the often essential afternoon treat session. I wish to particularly thank Meg and Sarah, my two research buddies and good friends, for proving that an article written on providing advice on how to be a good graduate student was surprizingly right on target, at least on making my experience in graduate school significantly better. Far from the work in this thesis I have been involved in a number of efforts related to international development during my stay at MIT. It has been my absolute pleasure to collaborate extensively with Bill Thies, Somani Patnaik, and Manish Bhardwaj on some of this work. The MIT Public Service Center has generously provided financial assistance and advice on multiple occasions for which I am deeply grateful. I also appreciate my advisor Nick’s interest and support of my work in this area. I will always appreciate my undergraduate operating systems professor Brian Bershad for encouraging me to turn my gaze from physics and seriously consider computer science. I credit he, Gretta Bartels, the graduate student that supervised my first computer science research project, and Hank Levy for showing me how exciting computer science could be, and convincing me that I could have a place in this field. My friends have been an amazing presence in my life throughout this process. Molly Eitzen and Nikki Yarid have been there since my 18th birthday party in the first house we all lived in together. The glory of Oxford was only so because of Chris Bradley, In- grid Barnsley, Naana Jumah, James Analytis, Lipika Goyal, Kirsti Samuels, Ying Wu, and Chelsea Elander Flanagan Bodnar and it has been wonderful to have several of them over- lap in Boston during my time here. My fantastic former housemate Kate Martin has the unique distinction of making even situps seem fun. Chris Rycroft was a fantastic running partner and always willing to trade math help for scones. I’ve also had the fortunate to cross paths on both coasts and both sides of the Atlantic pond with Marc Hesse since the wine and cheese night when we both first started MIT. There are also many other wonderful individuals I’ve been delighted to meet and spend time with during my time as a graduate student. 5 Most of the emotional burden of me completing a PhD has fallen on only a few unlucky souls and I count myself incredibly lucky to have had the willing ear and support of two of my closest friends, Sarah Stewart Johnson and Rebecca Hendrickson. Both have always inspired me with the joy, grace and meaning with which they conduct their lives. My beloved sister Amelia has been supportive throughout this process and her dry librarian wit has brought levity to many a situation. My parents have always been almost unreasonably supportive of me throughout my life. I strongly credit their consistent stance that it is better to try and fail than fail to try, with their simultaneous belief that my sister and I are capable of the world, to all that I have accomplished. Most of all, to my fiance Raffi Krikorian. For always giving me stars, in so many ways. 6 To my parents and sister, for always providing me with a home, no matter our location, and for their unwavering love and support. Contents 1 Introduction 14 1.1 Framework .................................. 15 1.2 Optimality and Tractability . 17 1.3 ThesisStatement ............................... 19 1.4 Contributions ................................. 19 1.5 Organization.................................. 20 2 Background 22 2.1 MarkovDecisionProcesses. 23 2.2 ReinforcementLearning ... .... .... .... .... .... ... 26 2.2.1 Infinite or No Guarantee Reinforcement Learning . 26 2.2.2 Optimal Reinforcement Learning . 29 2.2.3 Probably Approximately Correct RL . 31 2.3 POMDPs ................................... 32 2.3.1 Planning in Discrete-State POMDPs . 34 2.4 Approximate POMDP Solvers . 38 2.4.1 ModelStructure ........................... 39 2.4.2 BeliefStructure............................ 41 2.4.3 Value function structure . 44 2.4.4 Policystructure............................ 47 2.4.5 Online planning . 48 2.5 Summary ................................... 49 3 Switching Mode Planning for Continuous Domains 50 3.1 Switching State-space Dynamics Models . ..... 52 3.2 SM-POMDP ................................. 54 3.2.1 Computational complexity of Exact value function backups . 58 3.2.2 Approximating the α-functions ................... 59 3.2.3 HighestPeaksProjection. 59 3.2.4 L2minimization ........................... 60 3.2.5 Point-based minimization . 62 3.2.6 Planning ............................... 63 3.3 Analysis.................................... 63 3.3.1 Exact value function backups . 67 3.3.2 Highest peak projection analysis . 67 3.3.3 L2 minimization analysis . 68 3.3.4 Point-based minimization