Non-Parametric Policy Gradients: A Unified Treatment of Propositional and Relational Domains

Kristian Kersting [email protected] Dept. of Knowledge Discovery, Fraunhofer IAIS, Schloss Birlinghoven, 53754 Sankt Augustin, Germany Kurt Driessens [email protected] Dept. of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Heverlee, Belgium

Abstract ceives to estimate a value-function indicating the ex- pected value of being in a state or of taking an action Policy gradient approaches are a powerful in- in a state. The policy is represented only implicitly, strument for learning how to interact with for instance as the policy that selects in each state the the environment. Existing approaches have action with highest estimated value. As Sutton et al. focused on propositional and continuous do- (2000) point out, the value function approach, how- mains only. Without extensive feature engi- ever, has several limitations. First, it seeks to find de- neering, it is difficult – if not impossible – terministic policies, whereas in real world applications to apply them within structured domains, in the optimal policy is often stochastic, selecting differ- which e.g. there is a varying number of ob- ent actions with specific probabilities. Second, a small jects and relations among them. In this pa- change in the value-function parameter can push the per, we describe a non-parametric policy gra- value of one action over that of another, causing a dis- dient approach – called NPPG – that over- continuous change in the policy, the states visited, and comes this limitation. The key idea is to overall performance. Such discontinuous changes have apply Friedmann’s gradient boosting: poli- been identified as a key obstacle to establishing con- cies are represented as a weighted sum of re- vergence assurances for algorithms following the value- gression models grown in an stage-wise opti- function approach (Bertsekas & Tsitsiklis, 1996). Fi- mization. Employing off-the-shelf regression nally, value-functions can often be much more complex learners, NPPG can deal with propositional, to represent than the corresponding policy as they en- continuous, and relational domains in a uni- code information about both the size and distance to fied way. Our experimental results show that the appropriate rewards. Therefore, it is not surprising it can even improve on established results. that so called policy gradient methods have been de- veloped that attempt to avoid learning a value function explicitly (Williams, 1992; Baxter et al., 2001; Konda 1. Introduction & Tsitsiklis, 2003). Given a space of parameterized Acting optimally under uncertainty is a central prob- policies, they compute the gradient of the expected lem of artificial intelligence. If an agents learns to act reward with respect to the policy’s parameters, move solely on the basis of the rewards associated with ac- the parameters into the direction of the gradient, and tions taken, this is called (Sut- repeat this until they reach a local optimum. This ton & Barto, 1998). More precisely, the agent’s learn- direct approximation of the policy overcomes the lim- ing task is to find a policy for action selection that itations of the value function approach stated above. maximizes its reward over the long run. For instance, Sutton et al. (2000) show convergence even when using function approximation. The dominant reinforcement learning (RL) approach for the last decade has been the value-function ap- Current policy gradient methods have focused on proach. An agent uses the reward it occasionally re- propositional and continuous domains assuming the environment of the learning agent to be representable Appearing in Proceedings of the 25 th International Confer- as a vector-space. Nowadays, the role of structure ence on , Helsinki, Finland, 2008. Copy- and relations in the data, however, becomes more and right 2008 by the author(s)/owner(s). Non-Parametric Policy Gradients: A Unified Treatment of Propositional and Relational Domains more important (Getoor & Taskar, 2007): information Consider the standard RL framework (Sutton & Barto, about one object can help the agent to reach conclu- 1998), where an agent interacts with a Markov Deci- sions about other objects. Such domains are hard to sion Process (MDP). The MDP is defined by a number represent meaningfully using a fixed set of features. of states s ∈ S, a number of actions a ∈ A, state- Therefore, relational RL approaches have been devel- transition probabilities δ(s, a, s0): S × A × S → [0, 1] oped (Dˇzeroski et al., 2001), which seek to avoid ex- that represent the probability that taking action a in plicit state and action enumeration as – in principle state s will result in a transition to state s0 and a re- – traditionally done in RL through a symbolic repre- ward function r(s, a): S × A 7→ IR. When the reward sentation of states and actions. Existing relational RL function is nondeterministic, we will use the expected approaches, however, have focused on value functions rewards R(s, a) = Es,a [r(s, a)], where Es,a denotes the and, hence, suffer from the same problems as their expectation over all states s and actions a. The state, propositional counterparts as listed above. action, and reward at time t are denoted as st ∈ S, a ∈ A and r (or R ) ∈ IR. In this paper, we present the first model-free pol- t t t icy gradient approach that deals with relational and The agent selects which action to execute in a state propositional domains. Specifically, we present a following a policy function π(s, a): S × A → [0, 1]. non-parametric approach to policy gradients, called Current policy gradient approaches assume that the NPPG. Triggered by the observation that finding policy function π is parameterized by a (weight) vector many rough rules of thumb of how to change the way θ ∈ IRn and that this policy function is differentiable to act can be a lot easier than finding a single, highly ∂π(s,a,θ) with respect to its parameters, i.e., that ∂θ ex- accurate policy, we apply Friedmann’s (2001) gradient ists. A common choice is a Gibbs distribution based boosting. That is, we represent policies as weighted on a linear combination of features: sums of regression models grown in a stage-wise op- X  π(s, a) = eΨ(s,a)/ eΨ(s,b) , (1) timization. Such a functional gradient approach has b recently been used to efficiently train conditional ran- T dom fields for labeling (relational) sequences using where the potential function Ψ(s, a) = θ φsa with φsa boosting (Dietterich et al., 2004; Gutmann & Ker- the feature vector describing state s and action a. This sting, 2006) and for policy search in continuous do- representation guarantees that the policy specifies a mains (Bagnell & Schneider, 2003). In contrast to probability distribution independent of the exact form the setting of the sequence labeling of the function Ψ. Choosing a parameterization, cre- n task, feedback on the performance is received only at ates an explicit policy space IR with n equal to size the end of an action sequence in the policy search set- of the parameter vector θ. This space can be traversed ting. The benefits of a boosting approach to functional by an appropriate search algorithm. policy gradients are twofold. First, interactions among Policy gradients are computed w.r.t. a func- states and actions are introduced only as needed, so tion ρ that expresses the value of a policy in that the potentially infinite search space is not explic- P π P an environment, ρ(π) = s∈S d (s) a∈A π(s, a) · itly considered. Second, existing off-the-shelf regres- Qπ(s, a) , where Qπ is the usual (possibly dis- sion learners can be used to deal with propositional counted) state-action value function, e.g., Qπ(s, a) = and relational domains in a unified way. To the best of P∞ i  π Eπ i=0 γ Rt+i|st = s, at = a , and d (s) is the (pos- the authors’ knowledge, this is the first time that such sibly discounted) stationary distribution of states un- a unified treatment is established. As our experimen- der policy π. We assume a fully ergodic environment tal results show, NPPG can even significantly improve so that dπ(s) exists and is independent of any starting upon established results in relational domains. state s0 for all policies. We proceed as follows. We will start off by reviewing A property of ρ for both average reward reinforcement policy gradients and their mathematical background. learning and episodic tasks with a fixed starting state Afterwards, we will develop in Section 3. In NPPG s0 is that (Sutton et al., 2000, Theorem 1) Section 4, we will present our experimental results. ∂ρ X X ∂π(s, a) Before concluding, we will touch upon related work. = dπ(s) · Qπ(s, a) . (2) ∂θ s a ∂θ 2. Policy Gradients Important to note is that this gradient does not in- ∂dπ (s) clude the term ∂θ . Eq. (2) allows the computation Policy gradient algorithms find a locally optimal policy of an approximate policy gradient through exploration. starting from an arbitrary initial policy using a gra- By sampling states s through exploration of the envi- dient ascent search through an explicit policy space. ronment following policy π, the distribution dπ(s) is Non-Parametric Policy Gradients: A Unified Treatment of Propositional and Relational Domains automatically represented in the generated sample of tions. Specifically, one starts with some initial func- P ∂π(s,a) π tion Ψ , e.g. based on the zero potential, and itera- encountered states. The sum a ∂θ Q (s, a) then 0 ∂ρ tively adds corrections Ψm = Ψ0 + ∆1 + ... + ∆m. becomes an unbiased estimate of ∂θ and can be used in a gradient ascent algorithm. In contrast to the standard gradient approach, ∆m here denotes the so-called functional gradient, i.e., Of course, the value Qπ is unknown and must be esti- ∆m = ηm · Es,a [∂ρ/∂Ψm−1]. Interestingly, this func- mated for each visited state-action pair either by us- tional gradient coincides with what traditional policy ing a Monte Carlo approach or by building an explicit gradient approach estimate, namely (2), as the follow- representation of Q, although in this latter case care ing theorem says. must be taken when choosing the parameterization of the Q-function (Sutton et al., 2000). Theorem 3.1 (Functional Policy Gradient) For any MDP, in either the average-reward or start-state 3. Non-Parametric Policy Gradients formulation,   A drawback of a fixed, finite parameterization of a ∂ρ X π ∂π(s, a) Es,a = d (s) · · Q(s, a) . (3) policy such as in Eq. (1) is that it assumes each feature ∂Ψ s,a ∂Ψ makes an independent contribution to the policy. Of course it is possible to define more features to capture This is a straightforward adaptation of Theorem 1 in combinations of the basic features, but this leads to (Sutton et al., 2000) and is also quite intuitive: the a combinatorial explosion in the number of features, functional policy gradient indicates how we would like and hence, in the dimensionality of the optimization the policy to change in all states and actions in order problem. Moreover, in continuous and in relational to increase the performance measure ρ. environments it is not clear at all which features to Unfortunately, we do not know the distribution dπ(s) choose as there are infinitely many possibilities. of how often we visit each state s under policy π. This To overcome these problems, we introduce a different is, however, easy to approximate from the empirical policy gradient approach based on Friedmann’s (2001) distribution of states visited when following π: gradient boosting. In our case, the potential function   Ψ in the Gibbs distribution1 of Eq. (1) is represented   as a weighted sum of regression models grown in an ∂ρ X ∂π(s, a)  Es,a ≈ Es∼π  · Q(s, a) , (4) stage-wise optimization. Each regression model can be ∂Ψ  a ∂Ψ  | {z } viewed as defining several new feature combinations. =:fm(s,a) The resulting policy is still a linear combination of fea- tures, but the features can be quite complex. Formally, where the state s is sampled according to π, denoted gradient boosting is based on the idea of functional as s ∼ π. We now have a set of traning examples π gradient ascent, which we will now describe. from the distribution d (s), so we can compute the value fm(s, a) of the functional gradient at each of the 2 3.1. Functional Gradient Ascent training data points for all actions a applicable in s. We can then use these point-wise functional gradients Traditional gradient ascent estimates the parameters to define a set of training examples {(s, a), fm(s, a)} θ of a policy iteratively as follows. Starting with some and then train a function fm : S × A 7→ IR so that it initial parameters θ0, the parameters θm in the next minimizes the squared error over the training exam- iteration are set to the current parameters plus the ples. Although the fitted function fm is not exactly gradient of ρ w.r.t. to θ, i.e., θm = θ0 + δ1 + ... + δm the same as the desired functional gradient in Eq. (3), where δm = ηm · ∂ρ/∂θm−1 is the gradient multiplied it will point in the same general direction assuming by a constant ηm, which is obtained by doing a line there are enough training examples. So, taking a step search along the gradient. Functional gradient ascent ∆m = ηm · fm will approximate the true functional is a more general approach, see e.g. (Friedman, 2001; policy gradient ascent.

Dietterich et al., 2004). Instead of assuming a lin- 2 ear parameterization for Ψ, it just assumes that Ψ Here, an update is made for all actions possible in each state encountered irrespective of which action was actually will be represented by a linear combination of func- taken. Alternatively, we can only make an update for the one action actually taken (Baxter et al., 2001). To com- 1Other distributions are possible but are subject to fu- pensate for the fact that some actions are selected more fre- ture research. For instance, it would be interesting to in- quently than others, we divide by the probability of choos- vestigate modeling joint actions of multiple agents in rela- ing the action, i.e., we use f (s, a)/π(s, a) as functional tioanl domains a long the lines of Guestrin et al. (2002). m gradient training examples and do not run over all actions. Non-Parametric Policy Gradients: A Unified Treatment of Propositional and Relational Domains

As an example, we will now derive the point-wise func- Algorithm 1: Non-Parametric Policy Gradient tional gradient of a policy parameterized as a Gibbs 1 Let Ψ be the zero potential (the empty tree) distribution (1). For the sake of readability, we denote 0 2 for m = 1 to N do Ψ(s, a) as Ψa and Ψ(s, b) as Ψb. 3 Choose a starting state s Proposition 3.1 The point-wise functional gradient 4 Generate episodes Em starting in s following of ρ with respect to a policy parameterized as a Gibbs the policy πm−1 with Ψm−1(s,a) P Ψm−1(s,b) ∂π(s,a) πm−1(s, a) = e / b e distribution (1) equals to Q(s, a) · ∂Ψ with 5 Generate functional gradient examples  π(s, a)(1 − π(s, a)) if s = s0 ∧ a = a0, Rm = {(si, aij), fm(si, aij)} based on Em  ∂π(s, a) 0 0 0 6 Induce regression model fm based on Rm 0 0 = −π(s, a)π(s, a ) if s = s ∧ a 6= a . ∂Ψ(s , a ) 7 Set potential to Ψm = Ψm−1 + ∆m where 0 otherwise, ∆m = ηm · fm with local step size ηm 8 return final potential Ψ = Ψ0 + ∆1 + ... + ∆m To see this, consider the first case; the other cases can be derived in a similar way. Due to the Gibbs distribution, we can write (∂π(s, a))/(∂Ψ(s, a)) =

Ψa Ψa P Ψb Ψa P Ψb as defining several new feature combinations, one cor- ∂ e e · b e − e · b ∂e /∂Ψa P Ψ = 2 . responding to each path in the tree from the root to ∂Ψa e b [P eΨb ] b b a leaf. The resulting policies still have the form of a linear combination of features, but the features can Assuming (∂Ψb)/(∂Ψa) = 0, i.e., Ψa and Ψb are in- P Ψb Ψa be quite complex. The trees are grown using a regres- dependent, it holds b ∂e /∂Ψa = e and we can rewrite the state-action gradient as sion tree learner such as CART (Breiman et al., 1984), which in principle runs as follows. It starts with the  P eΨb   Ψa  P eΨb − eΨa · b 1 − e empty tree and repeatedly searches for the best test b P eΨb P eΨb Ψa b Ψa b for a node according to some splitting criterion such as = e 2 = e P Ψ P Ψb e b [ b e ] b weighted variance. Next, the examples R in the node are split into R (success) and R (failure) according which simplifies – due to the definition of π(s, a) – to s f to the test. For each split, the procedure is recursively ∂π(s, a) applied, obtaining subtrees for the respective splits. = π(s, a)(1 − π(s, a)) . ∂Ψ(s, a) As splitting criterion, we use the weighted variance on Rs and Rf . We stop splitting if the variance in one The key point is the assumption (∂Ψb)/(∂Ψa) = 0. node is small enough or a depth limit was reached. In This actually means that we model Ψ with k func- leaves, the average regression value is predicted. tions fk, one for each action a, i.e., Ψ(s, a) = fa(s). In a We propose to use regression tree learners because a turn, we estimate k regression models fm. In the ex- periments, however, learning a single regression model rich variety of variants exists that can deal with finite, continuous and even relational data. Depending on fm did not decrease performance so that we stick here to the conceptually easier variant of learning a single the type of the problem domain at hand, one can in- stantiate the TreeNPPG algorithm by choosing the regression model fm. appropriate regression tree learner. We will give sev- We call policy gradient methods that follow the out- eral examples in the following experimental section. lined functional gradient approach non-parametric pol- icy gradients, or NPPG for short. They are non- 4. Experimental Evaluation parametric because the number of parameters can grow with the number of episodes. Our intention is to investigate how well NPPG works. To this aim, we implemented it and investigated the 3.2. Gradient Tree Boosting following questions: NPPG as summarized in Alg. 1 describes actually a (Q1) Does TreeNPPG work and, if so, are there family of approaches. In the following, we will develop cases where it yields better results than current state- a particular instance, called TreeNPPG, which uses of-the-art methods? (Q2) Is TreeNPPG applicable regression tree learners to estimate fm in line 6. across finite, continuous, and relational domains? In TreeNPPG, the policy is represented by sums of In the following, we will describe the experiments car- regression trees. Each regression tree can be viewed ried out to investigate the questions and their results. Non-Parametric Policy Gradients: A Unified Treatment of Propositional and Relational Domains

(Q1) Blocks World: A Relational Domain On(a,b) - 10 blocks 1 As a complex domain for empirical evaluation of 0.9 TreeNPPG , we consider the well-known blocks world 0.8 (Slaney & Thi´ebaux, 2001). To be able to compare 0.7 RRL-TG the performance of TreeNPPG to other relational 0.6 RRL-RIB 0.5 RRL-Trendi RL (RRL) systems, we adopt the same experimental TreeNPG 0.4 SD RRL-TG environment as used for RRL by e.g. Driessens and SD RRL-RIB 0.3 SD RRL-Trendi Dˇzeroski (2005). We learn in a world with 10 blocks SD-TreeNPG and try to accomplish the on(A, B) goal. This goal 0.2 Percentage of Solved Testcases is parameterized and appropriate values for A and B 0.1 0 are chosen at same the time as a starting state is gen- 0 500 1000 1500 2000 2500 3000 erated. Thus, the reinforcement learn learns a single Number of Learning Episodes policy that stacks any two blocks. Although this is a relatively simple goal in a planning context, both RRL Figure 1. Comparison of the learning curves for Non- and our TreeNPPG algorithm use a model free ap- Parametric Policy Gradient and various RRL implemen- proach and only learn from interactions with the envi- tations on the on(A,B) task in a world with 10 blocks. ronment. For the given setup this means that there are approximately 55.7 million reachable states3 of which to learn meaningful trees, we postpone calling Tilde 1.44 million are goal states. The minimal number of until an episode brings the cumulated number of ex- steps to the goal is 3.9 on average. The percentage amples over 100. As a language bias for Tilde we of states for other minimal solution sizes are given in employ the same language bias as used in published Fig. 2. The agent only receives a reward of 1 if it RRL experiments, including predicates such as clear, 4 reaches the goal in the minimal number of steps. The on, above and compheight . Each learned model ∆m, probability of receiving a reward using a random strat- updates the potential function Ψ using a step-size egy is approximately 1.3%. To counter this difficulty ηm = 1. We count on the self-correcting property of of reaching any reward, we adopted the active guid- tree boosting to correct over- or under-stepping the ance approach as proposed by Driessens and Dzeroski target on the next iteration. (2004), presenting the RL agent with 10% of expert We ran experiments using three versions of the RRL traces during exploration in all experiments. system, i.e, TG (Driessens et al., 2001), RIB (Driessens We apply TreeNPPG to this relational domain by & Ramon, 2003), and Trendi (Driessens & Dˇzeroski, simply employing the relational regression tree learner 2005), which represent the current state-of-the-art of Tilde (Blockeel & De Raedt, 1998). Rather than RRL systems, and our TreeNPPG algorithm. Af- using attribute-value or threshold tests in node of the ter every 50 learning episodes, we fixed the strategy tree, Tilde employs logical queries. Furthermore, a learned by RRL and tested the performance on 1000 placeholder for domain elements (such as blocks) can randomly generated starting states. For TreeNPPG occur in different nodes meaning that all occurrences fixing the strategy is not required as it uses the learned denote the same domain element. Indeed, this slightly strategy for exploration directly. Fig. 1 shows the re- complicates the induction process, for example when sulting learning curves averaged over 10 test-runs as generating the possible tests to be placed in a node. well as the standard deviations of these curves. As To this aim, it employs a classical refinement operator shown, TreeNPPG outperforms all tested versions of under θ-subsumption. The operator basically adds a the RRL system. After approximately 2000 learning literal, unifies variables, and grounds variables. When episodes, TreeNPPG solves 99% of all presented test- a node is to be splitted, the set of all refinements cases in the minimal number of steps. With less than are computed and evaluated according to the chosen 4 steps per learning episode on average, this means heuristic. Except for the representational differences, that TreeNPPG generalizes the knowledge it col- Tilde uses the same approach to tree-building as the lected by visiting less than 8000 states to solve 99% generic tree learner. Because single episodes are too of 54.3 million states. Around this time, TreeNPPG short and thus generate too few examples for Tilde has generated a list of 500 trees on average. One ob-

3 servation that can not be made directly on the shown We do not consider states in which the goal has already graph is the stability of the TreeNPPG algorithm. been satisfied as starting states. Therefore, a number of states where the goal is satisfied are not reachable, i.e., Where the performance of the RRL system using TG the states where on(A, B) is satisfied and there are extra 4For a full overview and the semantics of these predi- blocks on top of A. cates, we refer to (Driessens & Dˇzeroski, 2005). Non-Parametric Policy Gradients: A Unified Treatment of Propositional and Relational Domains

Performance of TreeNPG with respect to the solution length 1 0.9 0.8 0.7 Percentage of starting states 0.6 Solved after 100 episodes 900 episodes 0.5 1200 episodes 1500 episodes 0.4 1800 episodes 0.3

Percentage of Solved Cases 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 Minimal Steps to Goal

Figure 2. Detailed view of the learning performance of Figure 3. Learning performance of TreeNPPG on the TreeNPPG showing the percentage of solved test-cases continuous corridor task. Shown is how the probability of with respect to the minimal solution size for varying going left at all corridor positions evolves with the number amounts of learning experience. The graph also shows the of iterations. The shading indicates the standard deviation percentage of test-cases for each solution length. over the 30 runs of the experiment. for regression can vary substantially between experi- one can see, the robot gradually learns to go left in ments, TreeNPPG follows an extremely similar path the left section of the corridor and right in the right in each iteration of the experiment. The only variation section. Quite intuitively, the uncertainty is highest in between experiments is the exact time-frame of the the middle part of the corridor. phase transition in the results, as shown by the peak in TreeNPPG’s standard deviation curve. Fig. 2 shows We also ran experiments in a grid-world version of this the results in more detail, plotting the percentage of problem. We do not report on the successful results solved test-cases with respect to the minimal solution here because finite domains are a special case of the re- length. As one can see, TreeNPPG gradually learns lational and continuous cases. To summarize, question to solve problems with growing solution sizes. The Q2 can also be answered affirmatively. graph also shows the percentage of test-cases for each solution size. 5. Related Work To summarize, the results clearly show that question Within reinforcement learning (RL), there are two Q1 can be answered affirmatively. main classes of solution methods: value-function meth- ods seek to estimate the value of states and actions (Q2) Corridor World: A Continuous Domain whereas policy-based methods search for a good pol- icy within some class of policies. To qualitatively test whether TreeNPPG is also ap- plicable in continuous domains, we considered a sim- Within policy-based methods, policy gradients have ple continuous corridor domain. The task is to navi- received increased attention, see e.g. (Williams, 1992; gate a robot from any position pos0 ∈ [0, 10] in a one- Baxter et al., 2001; Guestrin et al., 2002; Konda & dimensional corridor [0, 10] to one of the exists at both Tsitsiklis, 2003; Munos, 2006) as a non-exhausting list. ends (0 and 10). At each time t = 0, 1, 2,..., the robot Most closely related to NPPG is the work of Bagnell can go either one step to the left (a = −1) or one step and Schneider (2003), see also (Bagnell, 2004). They to the right (a = 1); the outcome, however, is uncertain proposed a functional gradient policy algorithms in with post+1 = post + a + N (1, 1). The robot is given a reproducing kernel hilbert spaces for continuous do- reward after each step equal to −1 if pos ∈ (0, 10) and mains. So far, however, this line of work has not equal to 20 otherwise, i.e., it reaches one of the exists. considered (relationally) structured domains and the connection to gradient boosting was not employed. Fig. 3 shows how the stochastic policy evolves with the number of iterations, i.e., calls to the tree learner. Within value function approaches, the situation is These results are averaged over 30 reruns. In each run, slightly different. NPPG can be viewed as automat- we selected a random starting position pos0 uniformly ically generating a “variable” propositionalization or in (0, 10), gathered learning examples from 30 episodes discretization of the domain at hand. In this sense, in each iteration, and used a step size ηm = 0.7. As it is akin to tree-based state discretization RL ap- Non-Parametric Policy Gradients: A Unified Treatment of Propositional and Relational Domains proaches such as (Chapman & Kaelbling, 1991; Mc- ability to learn in hybrid domains with both discrete Callum, 1996; Uther & Veloso, 1998) and related ap- and continuous variables within real-world domains proaches. Within this line of research, there have been such as as robotics and network routing. Most interest- some boosting methods proposed. Ernst et al. (2005) ing, however, is to address the more general problem showed how to estimate Q functions with ensemble of learning how to interact with (relationally) struc- methods based on regression trees. Riedmiller (2005) tured environments in the presence observation noise. keeps all regression examples and re-weights them ac- NPPG is naturally applicable in this case and, hence, cording to some heuristic. Both neither consider policy paves the way towards (model-free) solutions of what gradients nor relational domains. can be called relational POMDPs. This is a topic of high current interest since it combines the expres- Recently, there have been some exciting new develop- sive representations of classical AI with the decision- ments in combining the rich relational representations theoretic emphasis of modern AI. of classical knowledge representation with RL. While traditional RL requires (in principle) explicit state Acknowledgments The authors would like to and action enumeration, these symbolic approaches thank the anonymous reviewers for their valuable com- seek to avoid explicit state and action enumeration ments. The material is based upon work performed through a symbolic representation of states and ac- while KK was with CSAIL@MIT and is supported tions. Most work in this context, however, has fo- by the Defense Advanced Research Projects Agency cused on value function approaches. Basically, a num- (DARPA), through the Department of the Interior, ber of relational regression algorithms have been de- NBC, Acquisition Services Division, under Contract veloped for use in this RL system that employ rela- No. NBCHD030010, and by a Fraunhofer ATTRACT tional regression trees (Driessens et al., 2001), rela- fellowship. KD is a post-doctoral research fellow of the tional instance based regression (Driessens & Ramon, Research Fund - Flanders (FWO). 2003), graph kernels and Gaussian processes (G¨artner et al., 2003) and relational model-trees (Driessens & Dˇzeroski, 2005). Finally, there is an increasing num- References ber of dynamic programming approaches for solv- Aberdeen, D. (2006). Policy-gradient methods for ing relational MDPs (Kersting et al., 2004; Sanner planning. Advances in Neural Information Process- & Boutilier, 2005; Wang et al., 2007). In contrast ing Systems 18 (pp. 9–17). to NPPG, they assume a model of the domain. It Bagnell, J. (2004). Learning decisions: Robustness, would be interesting, however, to combine NPPG with these approaches along the line of (Wang & Dietterich, uncertainty, and approximation. Doctoral disserta- 2003). A parametric step into this direction has been tion, Robotics Institute, Carnegie Mellon Univer- already taken by Aberdeen (2006). sity, Pittsburg, Pa, USA. Bagnell, J., & Schneider, J. (2003). Policy search in reproducing kernel hilbert space (Technical Report 6. Conclusions CMU-RI-TR-03-45). Robotics Institute, Carnegie We have introduced the framework of non-parametric Mellon University, Pittsburg, Pa, USA. policy gradient (NPPG) methods. It seeks to leverage Baxter, J., Bartlett, P., & Weaver, L. (2001). Ex- the policy selection problem by approaching it from periments with infinite-horizon, policy-gradient es- a gradient boosting perspective. NPPG is fast and timation. Journal of Arificial Intellifence Research straightforward to implement, combines the expressive (JAIR), 15, 351–381. power of relational RL with the benefits of policy gra- Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro- dient methods, and can deal with finite, continuous, dynamic programming. Belmont, MA: Athena Sci- and relational domains in a unified way. Moreover, the entific. experimental results show a significant improvement Blockeel, H., & De Raedt, L. (1998). Top-down induc- over established results; for the first time, a (model- tion of first order logical decision trees. Artificial free) relational RL approach learns to solve on(A, B) Intelligence, 101, 285–297. in a world with 10 blocks. Breiman, L., Friedman, J., Olshen, R., & Stone, C. NPPG suggests several interesting directions for fu- (1984). Classification and regression trees. Belmont: ture research such as using more advanced regression Wadsworth. models, developing actor-critic versions of NPPG es- Chapman, D., & Kaelbling, L. P. (1991). Input gener- timating a value function in parallel to reduce the vari- alization in delayed reinforcement learning: An al- ance of the gradient estimates, and exploiting NPPG’s gorithm and performance comparisions. Proceedings Non-Parametric Policy Gradients: A Unified Treatment of Propositional and Relational Domains

of the 12th International Joint Conference on Arti- Kersting, K., van Otterlo, M., & De Raedt, L. (2004). ficial Intelligence (pp. 726–731) Sydney, Australia. Bellman goes relational. Proceedings of the Twenty- Dietterich, T., Ashenfelter, A., & Bulatov, Y. (2004). First International Conference on Machine Learn- Training conditional random fields via gradient tree ing (ICML-2004) (pp. 465–472). Banff, Canada. boosting. Proceedings of the 21st International Con- Konda, V. R., & Tsitsiklis, J. N. (2003). On actor- ference on Machine Learning (pp. 217–224). Banff, critic algorithms. SIAM J. Control Optim., 42, Canada. 1143–1166. Driessens, K., & Dˇzeroski, S. (2005). Combining McCallum, A. (1996). Reinforcement learning with se- model-based and instance-based learning for first lective perception and hidden state. Doctoral dis- order regression. Proceedings of the 22nd Interna- sertation, University of Rochester, New York, NY, tional Conference on Machine Learning (pp. 193– USA. 200) Bonn, Germany. Munos, R. (2006). Policy gradient in continuous time. Driessens, K., & Dzeroski, S. (2004). Integrating guid- Journal of Machine Learning Research (JMLR), 7, ance into relational reinforcement learning. Machine 771–791. Learning, 57, 271–304. Riedmiller, M. (2005). Neural fitted Q iteration - First experiences with a data efficient neural reinforce- Driessens, K., & Ramon, J. (2003). Relational instance ment learning method. Proceedings of the 16th Eu- based regression for relational reinforcement learn- ropean Conference on Machine Learning (pp. 317– ing. Proceedings of the 20th International Confer- 328) Porto, Portugal. ence on Machine Learning (pp. 123–130) Washing- ton, DC, USA. Sanner, S., & Boutilier, C. (2005). Approximate linear programming for first-order MDPs. Proceedings of Driessens, K., Ramon, J., & Blockeel, H. (2001). the 21st conference on Uncertainty in AI (UAI) (pp. Speeding up relational reinforcement learning 509-517) Edinburgh, Scotland. through the use of an incremental first order de- cision tree learner. Proceedings of the 12th Euro- Slaney, J., & Thi´ebaux, S. (2001). Blocks world revis- pean Conference on Machine Learning (pp. 97–108), ited. Artificial Intelligence, 125, 119–153. Freiburg, Germany. Sutton, R., & Barto, A. (1998). Reinforcement learn- Dˇzeroski, S., De Raedt, L., & Driessens, K. (2001). Re- ing: An introduction. Cambridge, MA: The MIT lational reinforcement learning. Machine Learning, Press. 43, 7–52. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforce- Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree- ment learning with function approximation. Ad- based batch mode reinforcement learning. Journal vances in Neural Information Processing Systems 12 of Machine Learning Research (JMLR), 6, 503–556. (pp. 1057–1063). MIT Press. Friedman, J. (2001). Greedy function approximation: Uther, W., & Veloso, M. (1998). Tree based discretiza- A gradient boosting machine. Annals of Statistics, tion for continuous state space reinforcement learn- 29, 1189–1232. ing. Proceedings of the 15th National Conference G¨artner, T., Driessens, K., & Ramon, J. (2003). Graph on Artificial Intelligence (AAAI-96) (pp. 769–774) kernels and gaussian processes for relational rein- Portland, OR, USA. forcement learning. International Conference on In- Wang, C., Joshi, S., & Khardon, R. (2007). First order ductive Logic Programming (pp. 146–163) Szeged, decision diagrams for relational mdps. Proceedings Hungary. of the 20th International Joint Conference on Artifi- Getoor, L., & Taskar, B. (2007). An introduction to cial Intelligence (pp. 1095–1100). Hyderabad, India: statistical relational learning. MIT Press. AAAI press. Guestrin, C., Lagoudakis, M., & Parr, R. (2002). Co- Wang, X., & Dietterich, T. (2003). Model-based policy ordinated reinforcement learning. Proceedings of the gradient reinforcement learning. Proceedings of the 19th International Conference on Machine Learning 20th International Conference on Machine Learning (pp. 227–234) Sydney, Australia. (pp. 776-783) Washington, DC, USA. Gutmann, B., & Kersting, K. (2006). TildeCRF: Con- Williams, R. (1992). Simple statistical gradient follow- ditional random fields for logical sequences. Proceed- ing algorithms for connectionist reinforcement learn- ings of the 17th European Conference on Machine ing. Machine Learning, 8, 229–256. Learning (pp. 174–185). Berlin, Germany.