Improving Action Selection in MDP's Via Knowledge Transfer

Improving Action Selection in MDP’s via Knowledge Transfer Alexander A. Sherstov and Peter Stone Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 USA {sherstov, pstone}@cs.utexas.edu Abstract domains with large action sets, significant portions of the action set are irrelevant from the standpoint of optimal be- Temporal-difference reinforcement learning (RL) has been havior. Consider, for example, a pastry chef experimenting successfully applied in several domains with large state sets. with a new recipe. Several parameters, such as oven temper- Large action sets, however, have received considerably less attention. This paper demonstrates the use of knowledge ature and time to rise, need to be determined. But based on transfer between related tasks to accelerate learning with past experience, only a small range of values is likely to be large action sets. We introduce action transfer, a technique worth testing. Similarly, when driving a car, the same safe- that extracts the actions from the (near-)optimal solution to driving practices (gradual acceleration, minor adjustments the first task and uses them in place of the full action set to the wheel) apply regardless of the terrain or destination. when learning any subsequent tasks. When optimal actions Finally, a bidding agent in an auction can raise a winning bid make up a small fraction of the domain’s action set, action by any amount. But past experience may suggest that only transfer can substantially reduce the number of actions and a small number of raises are worth considering. In all these thus the complexity of the problem. However, action transfer settings, action transfer reduces the action set and thereby between dissimilar tasks can be detrimental. To address this accelerates learning. difficulty, we contribute randomized task perturbation (RTP), an enhancement to action transfer that makes it robust to un- Action transfer relies on the similarity of the tasks in- representative source tasks. We motivate RTP action transfer volved; if the first task is not representative of the others, with a detailed theoretical analysis featuring a formalism of action transfer can handicap the learner. If many tasks are related tasks and a bound on the suboptimality of action trans- to be learned, a straightforward remedy would be to transfer fer. The empirical results in this paper show the potential of actions from multiple tasks, learning each from scratch with RTP action transfer to substantially expand the applicability the full action set. However, in some cases the learner may of RL to problems with large action sets. not have access to a representative sample of tasks in the domain. Furthermore, the cost of learning multiple tasks with Introduction the full action set could be prohibitive. We therefore focus on the harder problem of identifying Temporal-difference reinforcement learning (RL) (Sutton & the domain’s useful actions by learning as few as one task Barto 1998) has proven to be an effective approach to se- with the full action set, and tackling all subsequent tasks quential decision making. However, large state and action with the resulting reduced action set. We propose a novel sets remain a stumbling block for RL. While large state algorithm, action transfer with randomized task perturba- sets have seen much work in recent research (Tesauro 1994; tion (RTP), that performs well even when the first task is Crites & Barto 1996; Stone & Sutton 2001), large action sets misleading. In addition to action transfer and RTP, this pa- have been explored to but a limited extent (Santamaria, Sut- per contributes: (i) a formalism of related tasks that aug- ton, & Ram 1997; Gaskett, Wettergreen, & Zelinsky 1999). ments the MDP definition and decomposes it into task- Our work aims to leverage similarities between tasks to specific and domain-wide components; and (ii) a bound on accelerate learning with large action sets. We consider the suboptimality of regular action transfer between related cases in which a learner is presented with two or more re- tasks, which motivates RTP action transfer theoretically. We lated tasks with identical action sets, all of which must be present empirical results in several learning settings, show- learned; since real-world problems are rarely handled in iso- ing the superiority of RTP action transfer to regular action lation, this setting is quite common. This paper explores the transfer and to learning with the full action set. idea of extracting the subset of actions that are used by the (near-)optimal solution to the first task and using them in- Preliminaries stead of the full action set to learn more efficiently in any subsequent tasks, a method we call action transfer. In many A Markov decision process (MDP), illustrated in Figure 1, is a quadruple hS, A, t, ri, where S is a set of states; A is a Copyright c 2005, American Association for Artificial Intelli- set of actions; t : S × A → Pr(S) is a transition function gence (www.aaai.org). All rights reserved. indicating a probability distribution over the next states upon AAAI-05 / 1024 taking a given action in a given state; and r : S × A → empty R r A is a reward function indicating the immediate payoff upon R wall taking a given action in a given state. Given a sequence of r S quicksand Pn i rewards r0, r1, . , rn, the associated return is i=0 γ ri, 6tr goal where 0 ≤ γ ≤ 1 is the discount factor. Given a policy π : S → A for acting, its associated value function V π : Figure 1: MDP formalism. Figure 2: Grid world domain. S → R yields, for every state s ∈ S, the expected return from starting in state s and following π. The objective is to find an optimal policy π∗ : S → A whose value function A Formalism for Related Tasks dominates that of any other policy at every state. The traditional MDP definition as a quadruple hS, A, t, ri The learner experiences the world as a sequence of states, is adequate for solving problems in isolation. However, actions, and rewards, with no prior knowledge of the func- it is not expressive enough to capture similarities across tions t and r. A practical vehicle for learning in this set- problems and is thus poorly suited for analyzing knowledge ting is the Q-value function Q : S × A → R, defined as transfer. As an example, consider two grid world maps. The π P 0 π 0 Q (s, a) = r(s, a)+γ s0∈S t(s |s, a)V (s ). The widely abstract reward and transition dynamics are the same in both used Q-learning algorithm (Watkins 1989) incrementally cases. However, the MDP definition postulates t and r as approximates the Q-value function of the optimal policy. functions over S × A. Since different maps give rise to dif- As a running example and experimental testbed, we intro- ferent state sets, their functions t and r are formally distinct duce a novel grid world domain (Figure 2) featuring discrete and largely incomparable, failing to capture the similarity states but continuous actions. Some cells are empty; oth- of the reward and transition dynamics in both cases. Our ers are occupied by a wall or a bed of quicksand. One cell new MDP formalism overcomes this difficulty by using out- is designated as a goal. The actions are of the form (d, p), comes and classes to remove the undesirable dependence of where d ∈ {NORTH, SOUTH, EAST, WEST} is an intended the model description (t and r) on the state set. direction of travel and p ∈ [0.5, 0.9] is a continuous param- eter. The intuitive meaning of p is as follows. Small values Outcomes Rather than specifying the effects of an ac- Pr(S) of p are safe in that they minimize the probability of a move tion as a probability distribution over next states, Pr(O) in an undesired direction, but result in slow progress (i.e., no we specify it as a probability distribution over out- O O change of cell is a likely outcome). By contrast, large values comes (Boutilier, Reiter, & Price 2001). is the set of of p increase the likelihood of movement, albeit sometimes “nature’s choices,” or deterministic actions under nature’s in the wrong direction. Formally, the move succeeds in the control. In our domain, these are: NORTH, SOUTH, EAST, WEST STAY a ∈ A requested direction d with probability p; lateral movement , . Corresponding to every action avail- (in one of the two randomly chosen directions) takes place able to the learner is a probability distribution (possibly dif- O a with probability (2p − 1)/8; and no change of cell results ferent in different states) over . When is taken, nature with probability (9−10p)/8. Note that p = 0.5 and p = 0.9 “chooses” an outcome for execution according to that proba- t : S ×A → Pr(O) are the extreme cases: the former prevents lateral movement; bility distribution. In the new definition , Pr(O) the latter forces a change of cell. Moves into walls or off the the range is common to all tasks, unlike the original Pr(S) grid-world edge cause no change of cell. range . The semantics of the outcome set is made rig- orous in the definitions below. The reward dynamics are as follows. The discount rate Note that the qualitative effect of a given outcome differs is γ = 0.95. The goal and quicksand cells are absorbing from state to state. From many states, the outcome EAST states with reward 0.5 and −0.5, respectively. All other ac- corresponds to a transition to a cell just right of the cur- tions generate a reward of −p2, making fast actions more rent location.

Load more