Hierarchical Deep Reinforcement Learning for Robotics and Data Science

Hierarchical Deep Reinforcement Learning For Robotics and Data Science Sanjay Krishnan Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2018-101 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-101.html August 7, 2018 Copyright © 2018, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Hierarchical Deep Reinforcement Learning For Robotics and Data Science by Sanjay Krishnan A dissertation submitted in partial satisfaction of the requirements for the degree of: Doctor of Philosophy in Electrical Engineering and Computer Sciences in the Graduate Division of the University of California at Berkeley Committee in charge: Kenneth Goldberg Benjamin Recht Joshua S. Bloom Summer 2018 1 Abstract Hierarchical Deep Reinforcement Learning for Robotics and Data Science by Sanjay Krishnan Doctor of Philosophy in Electrical Engineering and Computer Sciences Ken Goldberg, Chair This dissertation explores learning important structural features of a Markov Decision Process from offline data to significantly improve the sample-efficiency, stability, and robustness of solutions even with high dimensional action spaces and long time horizons. It presents applications to surgical robot control, data cleaning, and generating efficient execution plans for relational queries. The dissertation contributes: (1) Sequential Windowed Reinforcement Learning: a framework that approximates a long-horizon MDP with a sequence of shorter term MDPs with smooth quadratic cost functions from a small number of expert demonstrations, (2) Deep Discovery of Options: an algorithm that discovers hierarchical structure in the action space from observed demonstrations, (3) AlphaClean: a system that decomposes a data cleaning task into a set of independent search problems and uses deep q-learning to share structure across the problems, and (4) Learning Query Optimizer: a system that observes executions of a dynamic program for SQL query optimization and learns a model to predict cost-to-go values to greatly speed up future search problems. Contents 1 Introduction 4 Background and Related Work . .7 The Ubiquity of Markov Decision Processes . .8 Reinforcement Learning (RL) . .9 Imitation Learning . .9 Reinforcement Learning with Demonstrations . 10 2 Deep Discovery of Deep Options: Learning Hierarchies From Data 12 2.1 Overview . 13 2.2 Generative Model . 14 2.3 Expectation-Gradient Inference Algorithm . 14 2.4 Experiments: Imitation . 20 2.5 Segmentation of Robotic-Assisted Surgery . 38 2.6 Experiments: Reinforcement . 39 3 Sequential Windowed Inverse Reinforcement Learning: Learning Reward De- composition From Data 45 3.1 Transition State Clustering . 46 TSC Simulated Experimental Evaluation . 50 Surgical Data Experiments . 62 3.2 SWIRL: Learning With Transition States . 73 Reward Learning Algorithm . 75 Policy Learning . 76 Simulated Experiments . 78 Physical Experiments with the da Vinci Surgical Robot . 87 4 Alpha Clean: Reinforcement Learning For Data Cleaning 92 4.1 Problem Setup . 94 4.2 Architecture and API . 98 4.3 Search Algorithm . 99 4.4 Experiments . 103 2 CONTENTS 3 5 Reinforcement Learning For SQL Query Optimization 113 5.1 Problem Setup . 113 5.2 Background . 115 5.3 Learning to Optimize . 119 5.4 Reduction factor learning . 121 5.5 Optimizer Architecture . 121 5.6 Experiments . 122 6 Reinforcement Learning for Surgical Tensioning 127 6.1 Motivation . 128 6.2 Physics of Multi-Dimensional Spring Systems . 128 6.3 Reinforcement Learning For Policy Search . 131 6.4 Approximate Solution . 132 6.5 Experiments . 134 7 Conclusion 140 7.1 Challenges and Open Problems . 140 Chapter 1 Introduction Bellman’s “Principle of Optimality” and the characterization of dynamic programming is one of the most important results in computing [11]. Its importance stems from the ubiquity of Markovian decision processes (MDPs), which formalize a wide range of problems from path planning to scheduling [49]. In the most abstract form, there is an agent who makes a sequence of decisions to effect change on a system that processes these decisions and updates its internal state possibly non-deterministically. The process is “Markovian” in the sense that the system’s current state completely determines its future progression. As an example of an MDP, one might have to plan a sequence of motor commands to a robot to control it to a target position. Or, one might have to schedule a sequence of cluster computing tasks while avoiding double scheduling on a node. The solution to a MDP is a decision making policy: the optimal decision to make given any current state of the system. Since the MDP framework is extremely general—it encompasses shortest-path graph search, supervised machine learning, and optimal control—the difficulty in solving a partic- ular MDP relates to what assumptions can be made about the system. A setting of interest is when one only assumes black-box access to the system, where no parametric model of the system is readily available and the agent must iteratively query the system to optimize its decision policy. This setting, also called reinforcement learning [121], has been the subject of significant recent research interest. First, there are many dynamical problems for which a closed-form analytical description of a system’s behavior is not available but one has a programatic approximation of how the system transitions (i.e., simulation). Reinforcement Learning (RL) allows for direct optimization of decision policies over such simulated systems. Next, by virtue of the minimal assumptions, RL algorithms are extremely general and widely applicable across many different problem settings. From a software engineering perspective, this unification allows the research community to develop a small number of optimized libraries for RL rather each domain designing/maintaining problem-specific algorithms. Over the last few years, the combination deep neural networks and reinforcement learning, or Deep RL, have emerged as in robotics, AI, and machine learning as an important area of research [85, 110, 115, 119]. Deep RL correlates features of states to successful decisions using neural networks. 4 CHAPTER 1. INTRODUCTION 5 Since RL relies on black-box queries to optimize the policy, it can be very inefficient in problems where the decision space is high-dimensional and when the decision horizon (the length of the sequence of decisions the agent needs to make) is very long. The algorithm has to simultaneously correlate features that are valuable for making a decision in a state while also searching through the space of decisions to figure out which sequence of decisions are valuable. A failure to learn means that the algorithm cannot intelligently search the space of decisions, and a failure to find promising sequences early means that the algorithm cannot learn. This creates a fundementally unstable algorithm setting, where the hope is the algorithm discovers good decision sequences by chance and can bootstrap from these relatively rare examples. The higher the dimensionality of the problem, the less likely purely random search will be successful. Fortunately, in many domains of interest, a limited amount of expert knowledge is available, e.g., a human teleoperator can guide the motion of a robot to show an example solution of one problem instance or a slow classical algorithm can be used to generate samples for limited problem instances. Collecting such “demonstrations” to exhaustively learn a solution for possible progressions on MDP (called Imitation Learning [91]) can be expensive, but we might have sufficient data to find important structural features of a search problem including task decomposition, subspaces of actions that are irrelevant to the task, and the structure of the cost function. One such structure is hierarchy, where primitive actions are hierarchically composed into higher level behaviors. The main contribution of this dissertation is the thesis that learning action hierarchies from data can significantly improve the sample-efficiency, stability, and robustness of Deep RL in high-dimensional and long-horizon problems. With this additional structure, the RL process can be restricted to those sequences that are grounded in action sequences seen in the expert data. My work over the last 6 years explores this approach in several contexts for control of imprecise cable-driven surgical robots, automatically synthesizing data-cleaning programs to meet quality specifications, and generating efficient execution plans for relational queries. I describe algorithmic contributions, theoretical analysis about the implementations themselves, the architecture of the RL systems, and data analysis from physical systems and simulations. Contributions: This dissertation contributes: 1. Deep Discovery of Options (DDO): A new bayesian learning framework for learning parametrized control hierarchies from expert data (Chapter 2). DDO infers the parameters of an Abstract HMM model to decomposes the action space into a hierarchy of discrete skills relevant to a task. I show that the hierarchical model represents policies more efficiently

Load more