Dimensionality Reduced Reinforcement Learning for Assistive Robots

The 2016 AAAI Fall Symposium Series: Artificial Intelligence for Human-Robot Interaction Technical Report FS-16-01 Dimensionality Reduced Reinforcement Learning for Assistive Robots William Curran Tim Brys David Aha Oregon State University Vrije Universiteit Brussel Navy Center for Applied Research in AI Corvallis, Oregon Belgium [email protected] [email protected] [email protected] Matthew Taylor William D. Smart Washington State University Oregon State University Pullman, Washington Corvallis, Oregon [email protected] [email protected] Abstract The first step in our research goals is to develop an efficient method for teaching the robot. Reinforcement learn- State-of-the-art personal robots need to perform complex ma- ing is an ideal approach in our application. It can be used nipulation tasks to be viable in assistive scenarios. However, to teach a robot new skills that the human teacher can- many of these robots, like the PR2, use manipulators with high degrees-of-freedom, and the problem is made worse in not demonstrate, find novel ways to reach human-defined bimanual manipulation tasks. The complexity of these robots goals, and can be used to find solutions to difficult prob- lead to large dimensional state spaces, which are difficult to lems with no analytic formulation (Kormushev, Calinon, learn in. We reduce the state space by using demonstrations and Caldwell 2013). However, reinforcement learning does to discover a representative low-dimensional hyperplane in not scale well, and real-world robotics problems are high- which to learn. This allows the agent to converge quickly dimensional (Kober and Peters 2012). Learning in high- to a good policy. We call this Dimensionality Reduced Re- dimensional spaces not only significantly increases the time inforcement Learning (DRRL). However, when performing and memory requirements of many algorithms, but also de- dimensionality reduction, not all dimensions can be fully generates performance due to the curse of dimensionality represented. We extend this work by first learning in a sin- (Kaelbling, Littman, and Moore 1996). Personal robots need gle dimension, and then transferring that knowledge to a higher-dimensional hyperplane. By using our Iterative DRRL to perform complex manipulation tasks to be viable in many (IDRRL) framework with an existing learning algorithm, the scenarios. Complex manipulations require high degree-of- agent converges quickly to a better policy by iterating to in- freedom arms and manipulators. For example, the PR2 robot creasingly higher dimensions. IDRRL is robust to demonstra- has two 7 degree-of-freedom arms. When learning position tion quality and can learn efficiently using few demonstra- and velocity control, this leads to a 14 dimensional state tions. We show that adding IDRRL to the Q-Learning algo- space per arm. rithm leads to faster learning on a set of mountain car tasks In this work, we focus on the core problem of high- and the robot swimmers problem. dimensional state spaces. We introduce two algorithms Dimensionality Reduced Reinforcement Learning (DRRL) 1 Introduction and Iterative DRRL. In DRRL we use demonstrations to compute a projection to a low-dimensional hyperplane. In Our ultimate goal is to deploy personal robots into the world, each learning iteration, we project the current state onto and have members of the general public retask them with- this hyperplane, compute and execute an action, project the out having to resort to explicitly programming them. Learn- new state onto the hyperplane, and perform a reinforcement ing from demonstration (LfD) methods learns a policy using learning update. This general approach has been shown to examples or demonstrations given by a human to speed up reduce the exploration needed and accelerates the learning learning a custom task (Argall et al. 2009). However, these rate of reinforcement learning algorithms (Bitzer, Howard, demonstrations must be consistent and accurately represent and Vijayakumar 2010; Colome et al. 2014). solving the task. These methods also solve for a specific complex task, rather than solve for general control (Argall The robot can learn more quickly in the low-dimensional et al. 2009). In this paper we present an approach to directly hyperplane. However, this leads to a critical trade-off. By address the problem of learning good policies with RL in projecting onto a low-dimensional hyperplane, we are dis- high-dimensional state spaces. carding potentially important data. By adding DRRL to an existing algorithm, we show that the robot can quickly con- Copyright c 2016, Association for the Advancement of Artificial verge to a good policy much faster. However, since DRRL Intelligence (www.aaai.org). All rights reserved. does not represent all dimensions, it could converge to a poor 25 policy. A significant amount of research in RL focuses on in- In many learning domains, poor policies are undesirable. creasing the speed of learning by taking advantage of do- In robotics in particular, bad controllers can damage the main knowledge. Some techniques include agent partition- robot. We propose a novel framework, IDRRL, combining ing, which focuses mainly on how to divide the problem learning from demonstration techniques, dimensionality re- by the state space, actions, or goals (Curran, Agogino, and duction, and transfer learning. Instead of learning entirely Tumer 2013; Jordan and Jacobs 1993; Reddy and Tade- in one hyperplane, we iteratively learn in all hyperplanes by palli 1997); generalizing over the state space with techniques using transfer learning. The robot can quickly learn in a low- such as tile coding (Whiteson, Taylor, and Stone 2007), neu- dimensional space d, and transfer that knowledge from d di- ral networks (Haykin 1998), or k-nearest neighbors (Mar- mensions to the d +1dimensional space using the known tin H., de Lope, and Maravall 2009); and learning with tem- mapping between the spaces. porally defined actions, such as options (Sutton, Precup, and Our novel approach is a framework to improve other Singh 1999). Our framework is designed to be used with learning algorithms when working in high-dimensional these approaches. In this paper, we use the title coding gen- spaces. It combines the speed of low-dimensional learning eralization. and the expressiveness of the full state space. We show in a set of mountain car tasks and the robot swimmers problem 2.2 Dimensionality Reduction for Learning that reinforcement learning algorithms modified with DRRL Previous work in dimensionality reduction focuses on re- or IDRRL can converge quickly to a better policy than learn- ducing the space for classification or function approxima- ing entirely in the full dimensional space. tion. Principal Component Analysis (PCA) (Jolliffe 2002) is effective in many machine learning and data mining ap- 2 Background plications at extracting features from large data sets (Pech- enizkiy, Puuronen, and Tsymbal 2003; Turk and Pentland To motivate our approach, we outline previous work per- 1991). Feature selection, which aims at reducing the dimen- formed in the field of reinforcement learning, dimensionality sionality by selecting a subset of most relevant features, has reduction, transfer learning, and learning from demonstra- also been proven to be an effective and efficient way to han- tion. dle high-dimensional data (Cobo et al. 2011). In our work, we use PCA to discover the low-dimensional 2.1 Reinforcement Learning representation of the state space during learning. It does this by computing a transform to convert correlated data to lin- In our reinforcement learning (RL) approach, we use the early uncorrelated data. This transformation ensures that the standard formulation of MDPs (Kaelbling, Littman, and first principal component captures the largest possible vari- Moore 1996). An MDP is a 4-tuple S, A, T, R, where S ance. Each additional component captures the largest possi- is a set of states, A is a set of actions, T is a probabilis- ble variance uncorrelated with all previous components. Es- tic state transition function T (s, a, s ), and R is the reward sentially, PCA represents as much of the demonstrated state function R(s, a). space as possible in a lower dimension. In this work, we use function approximation for general- Rather than transferring knowledge from a simple repre- ization. CMACs (Albus 1981) partition a state space into a θ sentation to a complex one, Taylor, Kulis, and Sha (2011) set of overlapping tiles, and maintain the weights ( ) of each learn directly which states are irrelevant. They collect data tile. The accuracy of the generalization is improved as the while the agent explores the environment, calculate a state number of tilings (n) increases. Each tile has an associated φ similarity metric, and ignore state variables that do not add binary value ( ) to indicate whether that tile is present in the additional information and scale those that do. Similarly, current state. Thomas et al. (2011) use demonstrations to develop a subset The estimate of the value function is: of features from the original space. The learning algorithm n then used this subspace to predict the action that a human Qt(s, a)= θ(i)φ(i) (1) expert would take. Both approaches find state variables to i ignore or emphasize. This works well if there are unimpor- tant state variables, but would have issues where there are in- Q (s, a) θ where t is the estimated value function, is the frequent or non-demonstrated state variables with critically weight vector and φ is a component vector. Given a learn- important data. In our IDRRL framework, we iteratively add ing example, we adjust the weights of the involved tiles by additional state information until it learns in the full state the same amount to reduce the error. We use standard model- space, essentially combining the speed of low-dimensional free Q-Learning to update our function approximation: learning and the expressiveness of the full state space.

Load more