Online Feature Selection for Model-Based Reinforcement Learning

Online Feature Selection for Model-based Reinforcement Learning Trung Thanh Nguyen [email protected] Zhuoru Li [email protected] Tomi Silander [email protected] Tze-Yun Leong [email protected] School of Computing, National University of Singapore, Singapore, 117417 Abstract while capturing comprehensive transition and reward dynamics information. We also propose an online We propose a new framework for learning multinomial logistic regression method with group the world dynamics of feature-rich environ- lasso to automatically learn the relevant structure of ments in model-based reinforcement learn- the world dynamics model. While the regression mod- ing. The main idea is formalized as a new, els cannot capture the full conditional distributions factored state-transition representation that like DBNs, their simplicity allows fast, online learn- supports efficient online-learning of the rel- ing in very high dimensional spaces. Online feature evant features. We construct the transition selection is implemented with operating the regression models through predicting how the actions algorithm in our variant MDP formulation. change the world. We introduce an online sparse coding learning technique for feature In the rest of the paper, we will first introduce an selection in high-dimensional spaces. We de- MDP representation that captures the world dynamics rive theoretical guarantees for our framework as action effects. We then present an online learning and empirically demonstrate its practicality algorithm for identifying relevant features via sparse in both simulated and real robotics domains. coding, and show that in theory our framework should lead to computationally efficient learning of a near optimal policy. Due to the space limit, full proofs are 1. Introduction placed in the supplementary materials. To back up the theoretical claims, we conduct experiments in both In model-based reinforcement learning (RL), factored simulated and real robotics domains. state representations, often in the form of dynamic Bayesian networks (DBNs), are deployed to exploit 2. Method structures of the world dynamics. This allows the agent to plan and act in large state spaces without In RL, a task is typically modelled as an MDP defined actually visiting every state. However, learning the by a tuple (S; A; T; R; γ), where S is a set of states; A world dynamics of a complex environment is very dif- is a set of actions; T : S ×A×S ! [0; 1] is a transition ficult and often computationally infeasible. Most re- function, such that T (s; a; s0) = P (s0js; a) specifies the cent work in this area is based on the RMAX frame- probability of transiting to state s0 upon taking an ac- work (Brafman & Tennenholtz, 2003), and focuses on tion a at state s; R is a reward function indicating sample-efficient learning of the optimal policies. This immediate expected reward after the state transition approach incurs heavy computational costs for maxi- s −!a s0; and future reward that occurs t time steps mizing information gain from every interaction, even in future is discounted by γt. The agent's goal is to in carefully designed, low-dimensional spaces. learn a policy π that specifies an action to perform We propose a variant formulation of the factored at each state s, so that the expected discounted, cu- Markov decision process (MDP) that incorporates a mulative future reward starting from s is maximized. principled way to compactly factorize the state space, In model-based RL, the optimal policy is estimated from the transition model T and the reward model R. Proceedings of the 30 th International Conference on Ma- In this paper we concentrate on learning the transi- chine Learning, Atlanta, Georgia, USA, 2013. JMLR: tion model, with a known reward model. The reward W&CP volume 28. Copyright 2013 by the author(s). model, however, can be learned in a similar way. Online Feature Selection for Model-based Reinforcement Learning In a factored MDP, each state is represented by a depend on other features of the state, such as the sur- vector of n state-attributes. The transition func- face material at the location (state). Such features are tion for the factored states is commonly expressed often carefully included in the state representations. using dynamic Bayesian networks (DBNs) in which While essential in formulating the transition or reward 0 Qn 0 a a T (s; a; s ) = i=1 P (sijP ai (s); a), where P ai indi- models, these features may complicate the planning or cates a subset of state-attributes in s called the parents learning processes by increasing the size and complex- 0 of si (Fig.1a). Learning T requires learning the subsets ity of the state space. P aa and the parameters for conditional distributions, i We separate the state identifying state-attributes from or the DBN local structures in other words. the \merely" informative state-features in our repre- Learning DBN structures of the transition function sentation. This way, we can apply an efficient feature online, i.e., while the agent is interacting with the selection method on a large number of state features environment, however, is computationally prohibitive to capture the transition dynamics, while maintaining in most domains. On the other hand, recent stud- a compact state space. ies (Xiao, 2009; Yang et al., 2010) have shown encour- More formally, a situation Calculus MDP (CMDP) aging results in learning the structure of logistic regres- is defined by a tuple (S; f; A; T; E; R; γ), where sion models, which can effectively serve as local struc- S; A; T; R; γ have the same meaning as in regular MDP. tures in the DBNs even in high dimensional spaces. S = hS ;S ; ::; S i is the state space implicitly repre- While these regression models cannot fully capture the 1 2 n sented by vectors of n state-attributes. The function conditional distributions, their expressive power can f : S ! m extracts m state-features from each state. be improved by augmenting low dimensional state rep- R E is an action effect variable such that the transition resentation with non-linear features of the state vec- function can be factored as tors. We introduce an online sparse multinomial logis- n tic regression method that supports efficient learning Y T (s; a; s0) = P (s0 j s; a) = P (s0 j s; f(s); a) of the structured representation of the transition func- i i=1 tion. n Y X 0 = P (si j e; s)P (e j s; f(s); a): 2.1. CMDP - a factored MDP with i=1 e2E feature-variables and action effects Figure 1b shows an example of this decomposition. We present a variant of the factored MDP that de- The agent uses the feature function f to identify the fines a \compact but comprehensive" factorization of relevant features, and then use both state attributes the transition function and supports efficient learn- and features to predict the action effects. We also as- ing of the relevant features. We consider two major sume that the effect e and current state s determine approaches to modeling world dynamics: predicting the next state s0, thus P (s0je; s) is either 0 or 1. This changes and differentiating features. defines the semantic meaning of the effect which is assumed to be known by the agent. The remaining First, we predict the relative changes of states instead task is to learn the P (ejs; a) = P (ejx(s); a), where of directly specifying the next states in a transition. x(s) = (s; f(s)), which is a classification problem; we Mediating state changes via action effects is a common solve this problem using multinomial logistic regres- strategy in situation calculus (McCarthy, 1963). Since sion methods. the number of relative changes or action effects is usu- ally much smaller than the size of the state space, the corresponding prediction task should be easier. The 2.2. Online multinomial logistic regression learning problem can then be expressed as a multi- with group lasso class classification task of predicting the action effects. We introduce a regularized online multinomial regres- Second, we differentiate the roles of attributes or fea- sion method with group lasso that allows us to learn tures that characterize a state. In a regular factored a probabilistic multi-class classifier with online feature MDP, the state-attributes or features serve to both de- selection. We also show that the structure and param- fine the state space and capture information about the eters of the learnt classifier are likely to converge to transition model. For example, two state-attributes, those of the optimal classifier. the (x; y)-coordinates uniquely identify a state and compactly factorize the state space in a grid-world. 2.2.1. Multinomial logistic regression A policy can be learned on this factored space. The Multinomial logistic regression is a simple yet effec- transition dynamics or action effects, however, may tive classification method. Assuming K classes of d- Online Feature Selection for Model-based Reinforcement Learning 0 S S0 sured by the amount of extra loss, or regret S1 S1 1 1 E RT (W ) = L(T ) − LW (T ) 0 0 T T S2 S S2 S2 2 X t t X = (lt(W ) + Ψ(W )) − (lt(W ) + Ψ(W )) f1(:) t=1 t=1 S S0 3 3 f2(:) We want to learn a series of parameters W t to achieve f3(:) small regret with respect to a good model W that has 0 S4 S4 a small loss LW (T ). (a) (b) 2.2.2. Online learning for regularized Figure 1. a.) Standard DBN. b.) Our customized DBN multinomial logistic regression for CMDP.

Load more