Reinforcement Learning As Classification: Leveraging Modern Classifiers
Total Page:16
File Type:pdf, Size:1020Kb
Reinforcement Learning as Classification: Leveraging Modern Classifiers Michail G. Lagoudakis [email protected] Ronald Parr [email protected] Department of Computer Science, Duke University, Durham, NC 27708 USA Abstract function that is both rich enough to express interesting poli- The basic tools of machine learning appear in cies, yet smooth enough to ensure that good policies can the inner loop of most reinforcement learning al- be discovered. The validity of such criticisms is debatable gorithms, typically in the form of Monte Carlo (since nearly all practical applications of machine learning methods or function approximation techniques. methods require some feature engineering), but there can To a large extent, however, current reinforcement be no question that recent advances in classifier learning learning algorithms draw upon machine learn- have raised the bar. For example, it is not uncommon to ing techniques that are at least ten years old and, hear anecdotal reports of the naive application of support with a few exceptions, very little has been done vector machines matching or exceeding the performance to exploit recent advances in classification learn- of classical approaches with careful feature engineering. ing for the purposes of reinforcement learning. We are not the first to note the potential benefits of modern We use a variant of approximate policy iteration classification methods to reinforcement learning. For ex- based on rollouts that allows us to use a pure clas- ample, Yoon et al.(2002) use inductive learning techniques, sification learner, such as a support vector ma- including bagging, to generalize across similar problems. chine (SVM), in the inner loop of the algorithm. Dietterich and Wang (2001) also use a kernel-based ap- We argue that the use of SVMs, particularly in proximation method to generalize across similar problems. combination with the kernel trick, can make it The novelty in our approach is its orientation towards the easier to apply reinforcement learning as an “out- application of modern classification methods within asin- of-the-box” technique, without extensive feature gle, noisy problem at the inner loop of a policy iteration engineering. Our approach opens the door to algorithm. By using rollouts and a classifier to represent modern classification methods, but does not pre- policies, we avoid the sometimes problematic step of value clude the use of classical methods. We present function approximation. Thus we aim to address the cri- experimental results in the pendulum balancing tiques of value function methods raised by the proponents and bicycle riding domains using both SVMs and of direct policy search, while avoiding the confines of a pa- neural networks for classifiers. rameterized policy space. We note that in recent work, Fern, Yoon and Givan(2003) 1. Introduction also examine policy iteration with rollouts and an induc- tive learner at the inner loop. However, their emphasis Reinforcement learning provides an intuitively appealing is different. They focus on policy space bias as a means framework for addressing a wide variety of planning and of searching a rich space of policies, while we emphasize control problems(Sutton & Barto, 1998). Compelling con- modern classifiers as a method of recovering high dimen- vergence results exist for small state spaces(Jaakkola et al., sional structure in policies. 1994) and there has been some success in tackling large state spaces through the use of value function approxima- tion and/or search through a space of parameterized poli- 2. Basic definitions and algorithms cies (Williams, 1992). In this section we review the basic definitions for Markov Despite the successes of reinforcement learning, some frus- Decision Processes (MDPs), policy iteration, approximate tration remains about the extent and difficulty of feature policy iteration and rollouts. This is intended primarily as engineering required to achieve success. This is true both a review and to familiarize the reader with our notation. of value function methods, which require a rich set of fea- More extensive discussions of these topics are widely avail- tures to represent accurately the value of a policy, and pol- able (Bertsekas & Tsitsiklis, 1996). icy search methods, which require a parameterized policy Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003. 2.1. MDPs In practice, policy iteration terminates in a surprisingly small number of steps. However, it relies on the availabil- An MDP is defined as a 6-tuple (S, A,P,R,γ,D) where: ity of the full model of the MDP, exact evaluation of each S is the state space of the process; A is a finite set of ac- policy, and exact representation of each policy. tions; P is a Markovian transition model, where P (s, a, s0) is the probability of making a transition to state s0 when 2.3. Approximate methods taking action a in state s; R is a reward (or cost) function, such that R(s, a) is the expected reward for taking action Approximate methods are frequently used when the state a in state s; γ ∈ [0, 1) is the discount factor for future re- space of the underlying MDP is extremely large and exact wards; and, D is the initial state distribution from which (solution or learning) methods fail. A general framework of states are drawn when the process is initialized. using approximation within policy iteration is known as ap- In reinforcement learning, it is assumed that the learner can proximate policy iteration (API). In its most general form, b observe the state of the process and the immediate reward API assumes a policy πi at each step, which may not nec- at every step, however P and R are completely unknown. essarily be an exact implementation of the improved pol- In this paper, we also make the assumption that our learning icy from the previous time step. Typically, the error in- troduced is bounded by the maximum norm (L∞) error in algorithm has access to a generative model of the process b what is presumed to be an approximate Qπ that was used which is a black box that takes a state s and an action a i−1 as inputs and outputs a next state s0 drawn from P and a to generate πi+1 (Bertsekas & Tsitsiklis, 1996). API can- reward r. Note that this is not the same as having the model not guarantee monotonic improvement and convergence to (P and R) itself. the optimal policy. However, in practice it often finds very good policies in a few iterations, since it normally makes A deterministic policy π for an MDP is a mapping π : S7→ big steps in the space of possible policies. This is in con- Afrom states to actions, where π(s) is the action the agent trast to policy gradient methods which, despite acceleration takes at state s.ThevalueVπ(s)of a state s under a policy methods, are often forced to take very small steps. π is the expected, total, discounted reward when the pro- cess begins in state s and all decisions at all steps are made LSPI (Lagoudakis & Parr, 2001) is an example of an according to policy π: API algorithm. It uses temporal differences of sample " # transitions between state-action pairs to approximate each X∞ Qπ. It is also possible to use a technique closer to pure t Vπ(s)=E γ R st,π(st) |s0 =s . Monte Carlo evaluation called rollouts. Rollouts estimate t=0 Qπ(s, a) by executing action a in state s and following pol- The goal of the decision maker is to find an optimal policy icy π thereafter, while recording the total discounted re- π∗ that maximizes the expected, total, discounted reward ward obtained during the run. This processes is repeated from the initial state distribution D: several times and it requires a simulator with resets to ar- bitrary states (or a generative model) since a large num- ∗ π = arg max η(π, D)=Es∼D[Vπ(s)] . ber of trajectories is required to obtain an accurate esti- π mate of Qπ(s, a). Rollouts were first used by Tesauro It is known that for every MDP, there exists at least one and Galperin (1997) to improve online performance of a optimal deterministic policy. backgammon player. (A supercomputer did rollouts before selecting each move.) However, they can also be used to 2.2. Policy Iteration specify a set of target values for a function approximator used within API to estimate Qπ. The rollout algorithm is Policy iteration is a method of discovering such a policy summarized in Figure 1. by iterating through a sequence π1, π2, ..., πk of improving policies. The iteration terminates when there is no change 3. API without Value Functions in the policy (πk = πk−1)andπk is the optimal policy. This is typically achieved by computing Vπi , which can be The dependence of typical API algorithms on value func- done iteratively or by solving a system of linear equations tions places continuous function approximation methods at and then determining a set of Q-values: the inner loop of most approximate policy iteration algo- X 0 0 rithms. These methods typically minimize L2 error, which Qπi (s, a)=R(s, a)+γ P(s, a, s )Vπi (s ) , is a poor match with the L∞ bounds for API. This prob- s0 lem is not just theoretical. Efforts to improve performance and the improved policy as such as adding new features to a neural network or new basis functions to LSPI don’t always have the expected ef- πi+1(s) = arg max Qπ (s, a) . fect. Increasing expressive power can sometimes lead to a i Rollout (M,s,a,γ,π,K,T) // M: Generative model // M : Generative model // Dρ: Source of rollout states // (s, a) : State-action pair whose value is sought // γ : Discount factor // γ : Discount factor // π0: Initial policy (default: uniformly random) // π : Policy // K : Number of trajectories // K : Number of trajectories // T : Length of each trajectory // T : Length of each trajectory 0 π = π0 for k =1to K 0 (s ,r)←SIMULATE(M,s,a) repeat 0 Qe k ← r π = π 0 s ← s TS = ? for t =1to T − 1 for each s ∈ Dρ 0 (s ,r)← SIMULATE(M,s,π(s)) for each a ∈A π t e e Qe k ← Qk + γ r Q (s, a) ← Rollout(M,s,a,γ,π,K,T) 0 ∗ π s ← s a = arg max Qe (s, a) K a∈A X ∗ π π ∗ e 1 e ∀ a ∈A,a6 a Q s, a <e Q s, a e Qe ← Qk if = : ( ) ( ) K ← ∪{s, a∗ +} k=1 TS TS ( ) π π ∗ e e for each a ∈A: Q (s, a) <e Q (s, a ) − return Qe TS ← TS ∪{(s, a) } π0 = Learn(TS) π ≈ π0 Figure 1.