Online Feature Selection for Model-based

Trung Thanh Nguyen [email protected] Zhuoru Li [email protected] Tomi Silander [email protected] Tze-Yun Leong [email protected] School of Computing, National University of Singapore, Singapore, 117417

Abstract while capturing comprehensive transition and reward dynamics information. We also propose an online We propose a new framework for learning multinomial method with group the world dynamics of feature-rich environ- to automatically learn the relevant structure of ments in model-based reinforcement learn- the world dynamics model. While the regression mod- ing. The main idea is formalized as a new, els cannot capture the full conditional distributions factored state-transition representation that like DBNs, their simplicity allows fast, online learn- supports efficient online-learning of the rel- ing in very high dimensional spaces. Online feature evant features. We construct the transition selection is implemented with operating the regression models through predicting how the actions algorithm in our variant MDP formulation. change the world. We introduce an online sparse coding learning technique for feature In the rest of the paper, we will first introduce an selection in high-dimensional spaces. We de- MDP representation that captures the world dynamics rive theoretical guarantees for our framework as action effects. We then present an online learning and empirically demonstrate its practicality algorithm for identifying relevant features via sparse in both simulated and real robotics domains. coding, and show that in theory our framework should lead to computationally efficient learning of a near op- timal policy. Due to the space limit, full proofs are 1. Introduction placed in the supplementary materials. To back up the theoretical claims, we conduct experiments in both In model-based reinforcement learning (RL), factored simulated and real robotics domains. state representations, often in the form of dynamic Bayesian networks (DBNs), are deployed to exploit 2. Method structures of the world dynamics. This allows the agent to plan and act in large state spaces without In RL, a task is typically modelled as an MDP defined actually visiting every state. However, learning the by a tuple (S, A, T, , γ), where S is a set of states; A world dynamics of a complex environment is very dif- is a set of actions; T : S ×A×S → [0, 1] is a transition ficult and often computationally infeasible. Most re- function, such that T (s, a, s0) = P (s0|s, a) specifies the cent work in this area is based on the RMAX frame- probability of transiting to state s0 upon taking an ac- work (Brafman & Tennenholtz, 2003), and focuses on tion a at state s; R is a reward function indicating sample-efficient learning of the optimal policies. This immediate expected reward after the state transition approach incurs heavy computational costs for maxi- s −→a s0; and future reward that occurs t time steps mizing information gain from every interaction, even in future is discounted by γt. The agent’s goal is to in carefully designed, low-dimensional spaces. learn a policy π that specifies an action to perform We propose a variant formulation of the factored at each state s, so that the expected discounted, cu- Markov decision process (MDP) that incorporates a mulative future reward starting from s is maximized. principled way to compactly factorize the state space, In model-based RL, the optimal policy is estimated from the transition model T and the reward model R. Proceedings of the 30 th International Conference on Ma- In this paper we concentrate on learning the transi- chine Learning, Atlanta, Georgia, USA, 2013. JMLR: tion model, with a known reward model. The reward W&CP volume 28. Copyright 2013 by the author(s). model, however, can be learned in a similar way. Online Feature Selection for Model-based Reinforcement Learning

In a factored MDP, each state is represented by a depend on other features of the state, such as the sur- vector of n state-attributes. The transition func- face material at the location (state). Such features are tion for the factored states is commonly expressed often carefully included in the state representations. using dynamic Bayesian networks (DBNs) in which While essential in formulating the transition or reward 0 Qn 0 a a T (s, a, s ) = i=1 P (si|P ai (s), a), where P ai indi- models, these features may complicate the planning or cates a subset of state-attributes in s called the parents learning processes by increasing the size and complex- 0 of si (Fig.1a). Learning T requires learning the subsets ity of the state space. P aa and the parameters for conditional distributions, i We separate the state identifying state-attributes from or the DBN local structures in other words. the “merely” informative state-features in our repre- Learning DBN structures of the transition function sentation. This way, we can apply an efficient feature online, i.e., while the agent is interacting with the selection method on a large number of state features environment, however, is computationally prohibitive to capture the transition dynamics, while maintaining in most domains. On the other hand, recent stud- a compact state space. ies (Xiao, 2009; Yang et al., 2010) have shown encour- More formally, a situation Calculus MDP (CMDP) aging results in learning the structure of logistic regres- is defined by a tuple (S, f, A, T, E, R, γ), where sion models, which can effectively serve as local struc- S, A, T, R, γ have the same meaning as in regular MDP. tures in the DBNs even in high dimensional spaces. S = hS ,S , .., S i is the state space implicitly repre- While these regression models cannot fully capture the 1 2 n sented by vectors of n state-attributes. The function conditional distributions, their expressive power can f : S → m extracts m state-features from each state. be improved by augmenting low dimensional state rep- R E is an action effect variable such that the transition resentation with non-linear features of the state vec- function can be factored as tors. We introduce an online sparse multinomial logis- n tic regression method that supports efficient learning Y T (s, a, s0) = P (s0 | s, a) = P (s0 | s, f(s), a) of the structured representation of the transition func- i i=1 tion. n Y X 0 = P (si | e, s)P (e | s, f(s), a). 2.1. CMDP - a factored MDP with i=1 e∈E feature-variables and action effects Figure 1b shows an example of this decomposition. We present a variant of the factored MDP that de- The agent uses the feature function f to identify the fines a “compact but comprehensive” factorization of relevant features, and then use both state attributes the transition function and supports efficient learn- and features to predict the action effects. We also as- ing of the relevant features. We consider two major sume that the effect e and current state s determine approaches to modeling world dynamics: predicting the next state s0, thus P (s0|e, s) is either 0 or 1. This changes and differentiating features. defines the semantic meaning of the effect which is assumed to be known by the agent. The remaining First, we predict the relative changes of states instead task is to learn the P (e|s, a) = P (e|x(s), a), where of directly specifying the next states in a transition. x(s) = (s, f(s)), which is a classification problem; we Mediating state changes via action effects is a common solve this problem using multinomial logistic regres- strategy in situation calculus (McCarthy, 1963). Since sion methods. the number of relative changes or action effects is usu- ally much smaller than the size of the state space, the corresponding prediction task should be easier. The 2.2. Online multinomial logistic regression learning problem can then be expressed as a multi- with group lasso class classification task of predicting the action effects. We introduce a regularized online multinomial regres- Second, we differentiate the roles of attributes or fea- sion method with group lasso that allows us to learn tures that characterize a state. In a regular factored a probabilistic multi-class classifier with online feature MDP, the state-attributes or features serve to both de- selection. We also show that the structure and param- fine the state space and capture information about the eters of the learnt classifier are likely to converge to transition model. For example, two state-attributes, those of the optimal classifier. the (x, y)-coordinates uniquely identify a state and compactly factorize the state space in a grid-world. 2.2.1. Multinomial logistic regression A policy can be learned on this factored space. The Multinomial logistic regression is a simple yet effec- transition dynamics or action effects, however, may tive classification method. Assuming K classes of d- Online Feature Selection for Model-based Reinforcement Learning

0 S S0 sured by the amount of extra loss, or regret S1 S1 1 1

E RT (W ) = L(T ) − LW (T )

0 0 T T S2 S S2 S2 2 X t t X = (lt(W ) + Ψ(W )) − (lt(W ) + Ψ(W )) f1(.) t=1 t=1 S S0 3 3 f2(.) We want to learn a series of parameters W t to achieve

f3(.) small regret with respect to a good model W that has 0 S4 S4 a small loss LW (T ). (a) (b) 2.2.2. Online learning for regularized Figure 1. a.) Standard DBN. b.) Our customized DBN multinomial logistic regression for CMDP. We introduce an update function, mDAGL-update

d (Algorithm 1) to extend the efficient dual averaging dimensional vectors x ∈ R , we represent each class method (Xiao, 2009) for solving lasso and group k with a d-dimensional prototype vector Wk. Classifi- lasso (Yang et al., 2010) logistic regression on binary cation of an input vector x is based on how “similar” classification to the multi-class case. it is to the prototype vectors. Similarity is measured Pd Let h(W ) be a strongly convex function with modulus with inner product hWk, xi = i=1 Wkixi, where xi 0 t=1 denotes feature i. The log probability of a class is 1; W = arg minW h(W ), and let W be initialized 0 t defined by log P (y = k|x; Wk) ∝ hWk, xi. The param- to W . Let Gki be the partial derivatives of function t t ∂lt t lt(W ) with respect to Wki at W (G = (W )). eter vectors of the model form the rows of a matrix ki ∂Wki T t W = (W1, ..., WK ) . We define G¯ to be a matrix of average partial deriva- ¯t 1 Pt τ t t t t tives, i.e., Gki = τ=1 Gki, where Let lt(W ) = − log P (y |x ; W ) denote the item-wise t t log-loss of a model with coefficient matrix W predict- Gτ = −xτ (I(yτ = k) − P (k|xτ ; W τ )). (1) ing a data point (yt, xt) observed at time t. A typical ki i objective of an online learning system is to minimize For any data observed at time t, we update the coeffi- the total loss by updating its W t over time. However, cient matrix via the resulting model will often be very complicated and  β  over-fitting. To achieve a parsimonious model, we ex- W t+1 = arg min G¯t,W + Ψ(W ) + t h(W ) , press our a priori belief that most features are irrel- W t evant or superfluous by introducing a regularization (2) d √ P where βt is a non-negative, non-decreasing constant term Ψ(W ) = λ i K||W·i||2, where ||W·i||2 denotes the 2-norm of the ith column of W , and λ is a posi- sequence, and h·, ·i denotes an inner product between ¯t P ¯t tive constant. This regularization is similar to that of two matrices; G ,W = k,i GkiWki. group lasso. It communicates the idea that it is likely that a whole column of W has zero values (especially, Theorem 1 (Update Rule) Given h(W ) = 1 2 ¯t for large λ). A column of all zeros suggests that the 2 ||W ||2, a K × d average gradient matrix G , and a corresponding feature is not necessary for classifica- regularization parameter λ > 0, the optimal solution tion. of (2) is achieved column-wise as follows √ (−→ t The objective function can now be written as 0 if ||G¯ ||2 ≤ λ K, t+1 √ ·i W·i = t  λ K  t T ¯ β ||G¯t || − 1 G·i otherwise. X t t t ·i 2 L(T ) = lt(W ) + Ψ(W ) (3) t=1 D t tE T W t ,x d This rule dictates that when the length of the aver- X e y X √ = − log + λ K||W t || , age gradient matrix column is small enough, the cor- t t ·i 2 P hWk,x i t=1 k e i responding parameter column should be truncated to zero. This amounts to feature selection. where W t is the coefficient matrix learned using t − 1 previously observed data items. The quality of a se- The following regret analysis confirms that the solu- quence of parameter matrices W t, t ∈ (1,...,T ) with tion will converge and that the average maximal regret asymptotically approaches zero with rate O( √1 ). respect to a fixed parameter matrix W can be mea- t Online Feature Selection for Model-based Reinforcement Learning

Algorithm 1 The mDAGL update Algorithm 2 The loreRL algorithm Input: t, yt, xt,W t, G¯t−1, λ, α Input: mDAGL regularization parameters λ, α, Gt ← use equation 1 with (yt, xt),W t CMDP variables S, f, A, E, R, γ, exploration . t t−1 t−1 1 t 0 0 0 G¯ ← G¯ + G Let W = (W1,W2,...,W|A|) = (W ,W ,...,W ) t t √ t+1 t ¯ ¯ ¯ ¯ ~ ~ ~ W ← use equation 3 with G¯ , βt = α t, λ Let G = (G1, G2,..., G|A|) = (0, 0,..., 0) t+1 t return (W , G¯ ) s0 ← random initial state for t = 1, 2, 3, ... do π ← Solve MDP using transition model T (W¯ ) Theorem 2 (Regret Bound) Let the sequence of a ← π(st, )#-greedy action selection t t+1 {W }t≥1 be generated by the update rule (3), and as- Take action a yielding effect e, next state s ¯t 2 ¯ t ¯ sume that there exists a constant√G such that ||G ||2 ≤ (Wa, Ga) ← mDAGL(t, e, x(s ),Wa, Ga, λ, α) 2 G , ∀t ≥ 1. If we choose βt = α t where α > 0, then end for for any t ≥ 1 and for any W that satisfies h(W ) ≤ D2 where D is a constant, the average regret is bounded as gorithm 2. Inputs to loreRL are the CMDP compo- R (W ) ∆ t ≤ √ , t = 1, 2, 3.., (4) nents (except the transition function), regularization t t parameters λ and α of mDAGL algorithm, and the   2  where ∆ = αD2 + G . that determines the probability of taking a random ac- α tion. We first initialize logistic regression parameters W and the average gradient matrices G¯ for each ac- [Proof Sketch ] The item-wise loss function l (W ) a a t tion a ∈ A. We also randomly select a starting state of multinomial logistic regression is convex, thus the s0. techniques used for binary case (Xiao, 2009) can be applied for multinomial case as well. At each time step, a random action a is chosen with a small probability , but otherwise we calculate the op- Since the average regret goes asymptotically to zero, timal policy π for an MDP with the transition model t it may look very feasible that the sequence (W ) also T (W ) is based on the current effect predictors. While ∗ converges to some optimal W . However, the regret we have used value iteration (like in Rmax) for finding analysis is valid for any sequence of data, and with- the optimal policy, any other model-based RL tech- out additional assumptions about the data generating nique can be used as well. We do not focus on the process there may not be any asymptotically optimal planning part of RL here, but Dyna-Q or Prioritized ∗ classifier W , thus convergence is not meaningful. To Sweeping can be deployed for a more scalable algo- study convergence, we assume the data is to be sam- rithm. After performing an action a in state st and pled independently from some joint distribution p for observing its effect e, the experience (e, st, f(st)) will data vector (y,x). In this case we try to find a W that be presented to the mDAGL algorithm that updates minimizes the expected loss E [l(W )] + Ψ(W ). Now p the parameter matrix Wa and the gradient matrix G¯a. assuming that the optimal solution W ∗ is sparse, and some other technical assumptions, it is indeed possible As we just do -greedy random sampling, it is im- to show that possible to guarantee PAC convergence to an optimal   policy. Assuming that observed data is i.i.d, we can t ∗ −1 −1 −1 2 − 1 P (||W − W ||2 > ) <  ( + r ) + ∆ t 4 , prove that difference in optimal value functions of two c CMDPs with different logistic regression based tran- (5) sition functions is bounded by the difference in their where r and c are constants (see Lemma 13 in (Lee & parameters. This leads to a corollary for convergence Wright, 2012) for the result and its assumptions). to near optimal policy.

2.3. Model-based RL with feature selection Theorem 3 (Difference in Value Function) M1 Our main task is to turn transition model learning Let M1 = (S, f, A, T (W ), E, R, γ) and M2 into the learning of conditional distributions P (E | M2 = (S, f, A, T (W ), E, R, γ) be two CMDPs s, f(s), a) using multinomial logistic regression for with optimal policies π1 and π2 respectively. Let M which attention to relevant features can be efficiently us denote by Vπ the value function for policy π in implemented online via mDAGL. CMDP M. Let r The key steps of our method, called loreRL (RL with (a),M1 (a),M2 1 = 2 max ||We − We ||1 sup ||x(s)||1, regularized logistic regression), are presented in Al- a∈A,e∈E s Online Feature Selection for Model-based Reinforcement Learning then max V M2 (s) − V M2 (s) ≤ 2γVmax1 , manually designed, low-dimensional state-spaces. s∈S π1 π2 1−γ

(a),M1 (a),M2 Instead of searching for an optimal model with a mini- where We and We refer to the vector of co- efficients corresponding to class E = e under action a mal number of samples at almost any cost, our method attempts to save costs from early on, and gradually in model M1 and M2 respectively, || · ||1 is the 1-norm improve the model acknowledging that the true model of vector, and Vmax is the maximum value of any state for any policy in either of the CMDPs. may actually be unattainable. In this spirit the struc- ture learning study by Degris et al. (Degris et al., 2006) resembles our work, but they do not address online

By taking M2 to be an CMDP based on the optimal learning with large feature sets. Ross et al. (Ross & ∗ W and M1 an estimated CMDP based on mDAGL, Pineau, 2008) have used Markov chain Monte Carlo the vanishing bound given in equation (5) can be trans- to sample from all possible DBN structures. However, lated into a vanishing bound for value difference of the Markov Chain used has a very long burn-in period policies. In case the true transition model is repre- and slow mixing time, making sampling computation- sentable by a sparse W ∗, we would most probably con- ally prohibitive in large problems. verge to a near optimal policy. Kroon and Whiteson (2009) present a feature selec- When we cannot express the true transition dynamics tion method for learning state values. This method, as logistic regression based on available state features, however, assumes that the DBN structures of transi- it is hard to give guarantees of performance. However, tion and reward models are known. Our work, on the we can still have some confidence in doing well. The lo- other hand, does not make such assumptions. ∗ gistic regression model Pl closest (in Kullback-Leibler Leffler et al. (2007) also suggest to predict relative distance) to the true model Ptrue (possibly not a logis- 1 changes in states, which corresponds to the action ef- tic regression model) is the one that has the smallest fects in our formulation. However, they manually se- expected log-loss. While our optimality criterion is the lect important features to aggregate information from expected regularized log-loss, we expect the regular- ∗ ∗ similar states for action effect predicting. Our work ized log-loss optimal model PΨ to be close to Pl thus focuses on learning those features automatically. Hes- almost as close to Ptrue as we can get. This relatively ter and Stone (2009; 2012) later employ Quinlan’s small KL-distance can be converted to relatively small C4.5 (Quinlan, 1993) to learn a for pre- distances in actual transition probabilities, which can dicting relative changes of every state variable. This then further be converted to a relatively small bound works better than the method by Degris et al. Despite on value differences by the same arguments used in adapting C4.5 for online learning, the method is still proving Theorem 3. Therefore, since our model would ∗ very slow as a costly tree induction procedure has to very likely converge close to PΨ, we can expect to do ∗ be repeated many times in a large feature space. In ad- almost as well as PΨ. dition, all the data needs to be stored for the purpose, which is undesirable in some applications. 3. Related work Strehl and Littman (2007), and Walsh et al. (2009) DBN has been a popular choice for factoring and ap- have also proposed an online method proximating transition models. In DBN learning fea- with L2-regularization to approximate the transition ture selection is equivalent to picking the parents of model in continuous MDP. L2, however, does not im- the state variables from the previous time slice. Re- plement feature selection. cent studies have led to improvements in sample com- plexity for learning optimal policy. Those studies as- 4. Experiments sume maximum number of possible parents for a node (Strehl et al., 2007), (Diuk et al., 2009), or knowledge We present empirical evaluation of loreRL on both sim- of a planning horizon that satisfies certain conditions ulated and real robotic domains. The experiments aim (Chakraborty & Stone, 2011). However, the improve- to demonstrate that loreRL can a) generalize and ap- ments in sample complexity are achieved at the ex- proximate the transition model to achieve fast con- pense of actual computational complexity since these vergence to near optimal policy, and b) with feature methods have to search through a large number of par- selection, perform well in complex, feature rich envi- ent sets. Hence, these methods appear feasible only in ronments. We also want to see if theoretical promises derived under assumption of i.i.d sampling can be real- 1 Such model may not always exist since the parameter ized in practice. We compare accumulated rewards of set is open. However, for our argument, any model with almost infimum distance to the true model will do. loreRL with factored Rmax (fRmax), in which the net- Online Feature Selection for Model-based Reinforcement Learning work structures of transition models are known (Strehl the best structure learning in ergodic factored MDPs et al., 2007), and with factored -greedy (fEpsG), in (Chakraborty & Stone, 2011). fRmax and fEpsG have which the optimistic Rmax exploration of fRmax is correct DBN structures provided by an oracle. All the replaced by -greedy strategy. We also compare our methods are implemented with our customized DBN method with RL-DT (Hester & Stone, 2009) and LSE- to utilize domain knowledge. Rmax is included as a Rmax (Chakraborty & Stone, 2011), which are the reference to show the effect of knowledge generaliza- state of the art model-based RL algorithms for learn- tion. ing transition models. As seen in Figure 2a, loreRL can approximate the world dynamics using samples in all the states, thus it 4.1. Grid-world domain converges as fast as fEpsG, and RL-DT to near optimal In this domain, the agent tries to reach its goal in the policy. Although fRmax is provided with the correct grid-world world consuming as little energy as possi- DBN structure, its accumulated reward is lower due ble. Each cell in the world has one of five surface ma- to aggressive exploration to find the optimal model. terials: sand, soil, water, brick, and fire; there may be After exploration the policy is guaranteed to be near walls between cells. Surface and walls are features that optimal, but it may still take a long time (or forever) determine the stochastic dynamics of the world. In ad- to catch up with loreRL. While LSE-Rmax follows the dition, to test the variable selection aspect, we attach Rmax scheme, it starts with a simple model and ex- hundreds of random binary features to the environ- plores a bit less aggressively than fRmax, gaining some ment; it has to learn to focus on the relevant features advantage in early episodes. However, LSE-Rmax ap- to quickly achieve its goal. The agent can perform four pears to require much more data to choose a more actions (move up, down, left, right), which will lead it complex model. Its accumulated reward drops below to one of the four states around it or leave it to its cur- fRmax after 150 episodes, and the angle of the curve rent state. Effects of the actions are captured in five suggests that its DBN structure is still not correct. We outcomes (moved up, left, down, right, did not move). did not run LSE-Rmax for more episodes, as the algo- The states are defined by the (x,y)-coordinates of the rithm has very high computational demand (Table 1). agent. To perform an action, agent will spend 0.01 When the feature set has many irrelevant features units of energy. It loses 1 unit if falling into a state (Figure 2b), loreRL is able to learn the relevant ones of fire, but gains 1 unit when successfully reaching an and still gain nearly as high accumulated reward as exit door. A task ends when agent reaches a terminal fEpsG which has relevant features provided by oracle. state, i.e., any exit door or state with fire. Also loreRL’s running time is not much longer than We generated the environment transition models from fRmax’s or fEpsG’s (Table 1). Other methods are too four random multinomial logistic distributions (one for slow to be run in this high-dimensional environment. each action); every different combination of cell sur- These results also suggest that with -greedy explo- faces and walls around the cell will lead to different ration and random restarts, near optimal policy can transition dynamics at the cell. The probability of go- be found even without i.i.d data sampling. ing through a wall is rounded to zero and the freed probability mass is evenly distributed to other effects. The agent’s starting position is randomly picked in Table 1. Average running time per episode in 800 episodes each episode. when acting in an environment with 210 features. (Slow RL-DT, LSE-Rmax could only be run with 10 features.) We ran the experiments with loreRL having α = 1.5, Run on Intel Xeon CPU 2.13GHz, 32GB RAM. λ = 0.05, γ = 0.95, exploration  = 0.05, parameter m = 10 for fRmax, m = 5 for Rmax (m = 5 is small Algorithm fRmax fEpsG RL-DT LSE-R. bloreRL loreRL for Rmax, but increasing it did not yield better result), Time (sec.) 0.26 0.25 9.09 67.53 4.3 0.55 fixed m = 10, σ = 0.99 for LSE-Rmax (similar to the author’s report). All results are averaged over 20 runs, Feature selection. To model real-life situations, the and we report the 95% confidence intervals. feature space is usually exponentially large. The abil- ity to focus only on the (most) relevant features is Generalization and convergence. We first show required to achieve effective learning. To understand that when the feature space is small, loreRL performs the role of feature selection, we focus on comparing as efficiently as the state of the art methods. RL-DT loreRL with a bloreRL that is based on multinomial employs a decision tree to generalize transition dynam- logistic regression without feature selection (without ics knowledge over states, but it is implemented with - the regularization term). fEpsG and fRmax are base greedy exploration strategy. LSE-Rmax appears to be lines. Online Feature Selection for Model-based Reinforcement Learning

400 400 400 200 200 300 fRmax 0 fEpsG 0 fRmax 200 RL-DT fEpsG -200 -200 bloreRL 100 LSE-Rmax loreRL loreRL Rmax 0 bloreRL -400 Rmax -400 -100 loreRL

Accumulated Reward -600 -600 fRmax -200 fEpsG 0 200 400 600 800 0 200 400 600 800 0 50 100 150 200 250 300 350 400 450 No. of episodes No. of episodes No. of irrelevant features (a) CMDP with 10 features. (b) Extra 200 irrelevant features. (c) Rewards after 800 episodes.

Figure 2. Accumulated rewards in a 900 state CMDP for various model-based RL-methods.

Figure 2b shows the accumulated reward when the en- rials. To capture this variation, the agent describes vironment has 200 irrelevant binary features. As seen, each state with a long vector of binary features. The G loreRL is still able to converge fast to optimal pol- “green” binary indicator fi (s) of a state s is set to 1 icy, and outperforms fRmax and bloreRL. Figure 2c iff there is a green mark that is further than i units but shows performances of loreRL and bloreRL after 800 closer than i + 1 units from the xy-center of the state episodes as a function of the number of irrelevant fea- s (i ∈ {0,..., 99}). Similar features are defined for tures. Only minimally affected by the actual number blue marks and to the blue target ball yielding 300 bi- of irrelevant features, loreRL can quickly select the rel- nary features. Eight indicators for different robot ori- evant features and outperform bloreRL. loreRL does entations are also included in the feature-base together not lose much to fEpsG either. While fRmax may find with four intentionally redundant “there is/is-not a an optimal policy before loreRL due to agressive ex- green/blue mark in the state grid”-bits. All together ploration, its accumulated reward is still lower than these yield 312 binary features per state. The intu- loreRL’s. In our experiments, we also observed that ition behind these features is that they serve as proxies loreRL, that selects a small set of features, is much to surface materials, slopes on the surfaces, obstacles, faster than bloreRL (Table 1). etc., which are likely to be important factors deter- mining the dynamics in the environments, but which 4.2. Robotics domain the robots sensors cannot capture. Although only few among these 312 features are important for modeling Our second experiment is conducted in a real environ- robots actions, the robot does not know those critical ment where we cannot expect the effect of the actions features. The robot has to learn to select them based to follow logistic regression model. The domain (Fig- on feedback while interacting with the environment. ure 3) consists of a 5 × 5-feet environment made of various materials such as beans, soil, hay, leaves, and The robot’s task is to travel in the environment from carpet that cause agent’s actions, move-forward, turn- a random start and reach the blue ball, which will earn left, and turn-right, to have different effects at different it a reward of 2 points. The robot will receive −1 locations. The environment also contains a blue tar- point if it falls out of the area or into the death places get ball, and there are marks painted in green, blue, marked with rectangles, and −0.05 points for and red colors on cardboard surface. The three wheel an action at any other states. An episode ends if the robot was built with LEGO Mindstorms NXT kit and robot reaches a terminal state, or gets stuck for four a camera was installed above the area so that the robot consecutive actions. can fully observe the environment. The robot battery did not allow us to compare our To learn the transition function for the robot, we dis- algorithm with the slow RL-DT and LSE-Rmax; we cretized the environment into a state space of 8 × 8 could only compare with the fine-tuned fRmax, fEpsG, (x,y)-locations and 8 different orientations of a robot, and man-loreRL algorithms in which we manually which yields a state space of 512 states. The actions selected important features and specified the DBN- may change robot’s relative location in four different structures for the transition models. man-loreRL is ways and orientation in five different ways resulting based on multinomial logistic regression models with in total of 20 different effects. However, in different 12 manually selected features. We ran the experiments states the actions’ tendency to produce effects may with loreRL and man-loreRL having α = 0.5, λ = 0.05, be different due to the differences in surface mate- γ = 0.95, exploration  = 0.05, parameter m = 10 for Online Feature Selection for Model-based Reinforcement Learning fRmax. All results are averaged over 10 runs, and we The efficiency is gained, however, at the expense of report the 95% confidence intervals. losing generality. Not all transition functions can be accurately represented as predicting action effects us- ing state features via logistic regression. Nevertheless, we believe that this compromise between scalability and generality is often a useful one. The generality problem may also be alleviated by introducing non- linear features that are combinations of the original ones. Other generalizations such as stochastic features and vector valued effects are also possible but are left for future work. The proposed framework also opens an opportunity for knowledge transfer. While different environments often have different states, the effects of Figure 3. A real environment. actions are likely to persist from environment to en- vironment. mDAGL would allow an agent to learn transferrable sparse action models. For instance, this 60 fRmax fEpsG method could be incorporated into the TES framework loreRL (Nguyen et al., 2012) to implement an intelligent agent 40 man-loreRL that can learn, accumulate and transfer knowledge au- 20 tomatically between environments. 0

Accumulated Reward -20 Acknowledgments 0 10 20 30 40 50 We thank PhD. Haiqin Yang for helpful comments on No. of episodes our sparse multinomial logistic regression. This re- search is supported by Academic Research Grants: Figure 4. Accumulated rewards of various methods. MOE2010-T2-2-071 and T1 251RES1005 from the Ministry of Education in Singapore. In Figure 4, loreRL appears to quickly capture the environment dynamics and outperform other meth- References ods. Even with manually selected features, fRmax and fEpsG require more exploration to learn the dynamics. Brafman, Ronen I. and Tennenholtz, Moshe. R-max man-loreRL gains rewards a bit faster, but in the end - a general polynomial time algorithm for near- it slightly loses to loreRL possibly due to the (unfore- optimal reinforcement learning. In Journal of Ma- seen) insufficiency of the manually selected features. chine Learning Research, 3:213–231, 2003. Table 2 further shows that loreRL is fast. Its aver- age running time per episode with 312-features is only Chakraborty, Doran and Stone, Peter. Structure learn- slightly slower than with 12 manually selected features. ing in ergodic factored MDPs without knowledge of the transition function’s in-degree. In Proceedings of the International Conference on , Table 2. Average running time per episode in 50 episodes. ICML’11, pp. 737–744, 2011. Run on Intel Centrino Duo T2400 (1.83GHz), 1GB RAM.

Algorithm fRmax fEpsG man-loreRL loreRL Degris, Thomas, Sigaud, Olivier, and Wuillemin, Time (sec.) 134.37 127.69 93.54 108.10 Pierre-Henri. Learning the structure of factored Markov decision processes in reinforcement learning 5. Conclusions problems. In Proceedings of the International Con- ference on Machine Learning, ICML’06, pp. 257– We have demonstrated how online multinomial logistic 264, 2006. regression with group lasso can be used to quickly ob- tain a parsimonious transition model in model based Diuk, Carlos, Li, Lihong, and Leffler, Bethany R. The RL. The method leads to fast learning since a single adaptive k-meteorologists problem and its applica- transition model can be learnt using samples from all tion to structure learning and feature selection in re- the states with a small set of features. inforcement learning. In Proceedings of the Interna- Online Feature Selection for Model-based Reinforcement Learning

tional Conference on Machine Learning, ICML’09, Quinlan, J. R. C4.5: Programs for Machine Learning. pp. 249–256, 2009. Morgan Kaufmann, 1993. ISBN 1-55860-238-0. Hester, Todd and Stone, Peter. Generalized model Ross, St´ephane and Pineau, Joelle. Model-based learning for reinforcement learning in factored do- bayesian reinforcement learning in large structured mains. In Proceedings of International Conference domains. In Proceedings of the Conference on Un- on Autonomous Agents and Multiagent Systems, certainty in Artificial Intelligence, UAI’08, pp. 476– volume 3 of AAMAS’09, pp. 717–724, 2009. 483, 2008. Hester, Todd and Stone, Peter. Texplore: real-time sample-efficient reinforcement learning for robots. Strehl, Er L. and Littman, Michael L. Online linear Machine Learning, pp. 1–45, 2012. regression and its application to model-based rein- Kroon, Mark and Whiteson, Shimon. Automatic fea- forcement learning. In Proceedings of the Advances ture selection for model-based reinforcement learn- in Neural Information Processing Systems, NIPS’07, ing in factored MDPs. In Proceedings of the Inter- pp. 737–744, 2007. national Conference on Machine Learning and Ap- plications, ICMLA’09, 2009. Strehl, Er L., Diuk, Carlos, and Littman, Michael L. Efficient structure learning in factored-state MDPs. Lee, S. and Wright, S. Manifold identification of dual In Proceedings of the National Conference on Arti- averaging methods for regularized stochastic online ficial Intelligence, AAAI’07, 2007. learning. In Journal of Machine Learning Research, Walsh, Thomas J., Szita, Istv´an, Diuk, Carlos, 13:1705–1744, 2012. and Littman, Michael L. Exploring compact Leffler, Bethany R., Littman, Michael L., and Ed- reinforcement-learning representations with linear munds, Timothy. Efficient reinforcement learning regression. In Proceedings of the Conference on Un- with relocatable action models. In Proceedings of certainty in Artificial Intelligence, UAI’09, pp. 591– the National Conference on Artificial Intelligence, 598, 2009. AAAI’07, pp. 572–577, 2007. Xiao, Lin. Dual averaging methods for regularized McCarthy, John. Situations, actions, and causal laws. stochastic learning and online optimization. In Pro- Technical Report Memo 2, Stanford Artificial Intel- ceedings of the Advances in Neural Information Pro- ligence Project, Stanford University, 1963. cessing Systems, NIPS’09, 2009. Nguyen, Trung T., Silander, Tomi, and Leong, Tze Yun. Transferring expectations in model-based Yang, Haiqin, Xu, Zenglin, King, Irwin, and Lyu, reinforcement learning. In Proceedings of the Ad- Michael R. Online learning for group lasso. In Pro- vances in Neural Information Processing Systems, ceedings of the International Conference on Machine NIPS’12, 2012. Learning, ICML’10, pp. 1191–1198, 2010.