Quick viewing(Text Mode)

Reinforcement Learning As Classification: Leveraging Modern Classifiers

Reinforcement Learning As Classification: Leveraging Modern Classifiers

as Classification: Leveraging Modern Classifiers

Michail G. Lagoudakis [email protected] Ronald Parr [email protected] Department of Computer Science, Duke University, Durham, NC 27708 USA Abstract function that is both rich enough to express interesting poli- The basic tools of appear in cies, yet smooth enough to ensure that good policies can the inner loop of most reinforcement learning al- be discovered. The validity of such criticisms is debatable gorithms, typically in the form of Monte Carlo (since nearly all practical applications of machine learning methods or function approximation techniques. methods require some feature engineering), but there can To a large extent, however, current reinforcement be no question that recent advances in classifier learning learning algorithms draw upon machine learn- have raised the bar. For example, it is not uncommon to ing techniques that are at least ten years old and, hear anecdotal reports of the naive application of support with a few exceptions, very little has been done vector machines matching or exceeding the performance to exploit recent advances in classification learn- of classical approaches with careful feature engineering. ing for the purposes of reinforcement learning. We are not the first to note the potential benefits of modern We use a variant of approximate policy iteration classification methods to reinforcement learning. For ex- based on rollouts that allows us to use a pure clas- ample, Yoon et al.(2002) use inductive learning techniques, sification learner, such as a support vector ma- including bagging, to generalize across similar problems. chine (SVM), in the inner loop of the algorithm. Dietterich and Wang (2001) also use a kernel-based ap- We argue that the use of SVMs, particularly in proximation method to generalize across similar problems. combination with the kernel trick, can make it The novelty in our approach is its orientation towards the easier to apply reinforcement learning as an “out- application of modern classification methods within asin- of-the-box” technique, without extensive feature gle, noisy problem at the inner loop of a policy iteration engineering. Our approach opens the door to algorithm. By using rollouts and a classifier to represent modern classification methods, but does not pre- policies, we avoid the sometimes problematic step of value clude the use of classical methods. We present function approximation. Thus we aim to address the cri- experimental results in the pendulum balancing tiques of value function methods raised by the proponents and bicycle riding domains using both SVMs and of direct policy search, while avoiding the confines of a pa- neural networks for classifiers. rameterized policy space. We note that in recent work, Fern, Yoon and Givan(2003) 1. Introduction also examine policy iteration with rollouts and an induc- tive learner at the inner loop. However, their emphasis Reinforcement learning provides an intuitively appealing is different. They focus on policy space bias as a means framework for addressing a wide variety of planning and of searching a rich space of policies, while we emphasize control problems(Sutton & Barto, 1998). Compelling con- modern classifiers as a method of recovering high dimen- vergence results exist for small state spaces(Jaakkola et al., sional structure in policies. 1994) and there has been some success in tackling large state spaces through the use of value function approxima- tion and/or search through a space of parameterized poli- 2. Basic definitions and algorithms cies (Williams, 1992). In this section we review the basic definitions for Markov Despite the successes of reinforcement learning, some frus- Decision Processes (MDPs), policy iteration, approximate tration remains about the extent and difficulty of feature policy iteration and rollouts. This is intended primarily as engineering required to achieve success. This is true both a review and to familiarize the reader with our notation. of value function methods, which require a rich set of fea- More extensive discussions of these topics are widely avail- tures to represent accurately the value of a policy, and pol- able (Bertsekas & Tsitsiklis, 1996). icy search methods, which require a parameterized policy

Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003. 2.1. MDPs In practice, policy iteration terminates in a surprisingly small number of steps. However, it relies on the availabil- An MDP is defined as a 6-tuple (S, A,P,R,γ,D) where: ity of the full model of the MDP, exact evaluation of each S is the state space of the process; A is a finite set of ac- policy, and exact representation of each policy. tions; P is a Markovian transition model, where P (s, a, s0) is the probability of making a transition to state s0 when 2.3. Approximate methods taking action a in state s; R is a reward (or cost) function, such that R(s, a) is the expected reward for taking action Approximate methods are frequently used when the state a in state s; γ ∈ [0, 1) is the discount factor for future re- space of the underlying MDP is extremely large and exact wards; and, D is the initial state distribution from which (solution or learning) methods fail. A general framework of states are drawn when the process is initialized. using approximation within policy iteration is known as ap- In reinforcement learning, it is assumed that the learner can proximate policy iteration (API). In its most general form, b observe the state of the process and the immediate reward API assumes a policy πi at each step, which may not nec- at every step, however P and R are completely unknown. essarily be an exact implementation of the improved pol- In this paper, we also make the assumption that our learning icy from the previous time step. Typically, the error in- troduced is bounded by the maximum norm (L∞) error in algorithm has access to a generative model of the process b what is presumed to be an approximate Qπ that was used which is a black box that takes a state s and an action a i−1 as inputs and outputs a next state s0 drawn from P and a to generate πi+1 (Bertsekas & Tsitsiklis, 1996). API can- reward r. Note that this is not the same as having the model not guarantee monotonic improvement and convergence to (P and R) itself. the optimal policy. However, in practice it often finds very good policies in a few iterations, since it normally makes A deterministic policy π for an MDP is a mapping π : S7→ big steps in the space of possible policies. This is in con- Afrom states to actions, where π(s) is the action the agent trast to policy gradient methods which, despite acceleration takes at state s.ThevalueVπ(s)of a state s under a policy methods, are often forced to take very small steps. π is the expected, total, discounted reward when the pro- cess begins in state s and all decisions at all steps are made LSPI (Lagoudakis & Parr, 2001) is an example of an according to policy π: API algorithm. It uses temporal differences of sample " # transitions between state-action pairs to approximate each X∞  Qπ. It is also possible to use a technique closer to pure t Vπ(s)=E γ R st,π(st) |s0 =s . Monte Carlo evaluation called rollouts. Rollouts estimate t=0 Qπ(s, a) by executing action a in state s and following pol- The goal of the decision maker is to find an optimal policy icy π thereafter, while recording the total discounted re- π∗ that maximizes the expected, total, discounted reward ward obtained during the run. This processes is repeated from the initial state distribution D: several times and it requires a simulator with resets to ar- bitrary states (or a generative model) since a large num- ∗ π = arg max η(π, D)=Es∼D[Vπ(s)] . ber of trajectories is required to obtain an accurate esti- π mate of Qπ(s, a). Rollouts were first used by Tesauro It is known that for every MDP, there exists at least one and Galperin (1997) to improve online performance of a optimal deterministic policy. backgammon player. (A supercomputer did rollouts before selecting each move.) However, they can also be used to 2.2. Policy Iteration specify a set of target values for a function approximator used within API to estimate Qπ. The rollout algorithm is Policy iteration is a method of discovering such a policy summarized in Figure 1. by iterating through a sequence π1, π2, ..., πk of improving policies. The iteration terminates when there is no change 3. API without Value Functions in the policy (πk = πk−1)andπk is the optimal policy.

This is typically achieved by computing Vπi , which can be The dependence of typical API algorithms on value func- done iteratively or by solving a system of linear equations tions places continuous function approximation methods at and then determining a set of Q-values: the inner loop of most approximate policy iteration algo- X 0 0 rithms. These methods typically minimize L2 error, which Qπi (s, a)=R(s, a)+γ P(s, a, s )Vπi (s ) , is a poor match with the L∞ bounds for API. This prob- s0 lem is not just theoretical. Efforts to improve performance and the improved policy as such as adding new features to a neural network or new basis functions to LSPI don’t always have the expected ef- πi+1(s) = arg max Qπ (s, a) . fect. Increasing expressive power can sometimes lead to a i Rollout (M,s,a,γ,π,K,T) // M: Generative model // M : Generative model // Dρ: Source of rollout states // (s, a) : State-action pair whose value is sought // γ : Discount factor // γ : Discount factor // π0: Initial policy (default: uniformly random) // π : Policy // K : Number of trajectories // K : Number of trajectories // T : Length of each trajectory // T : Length of each trajectory 0 π = π0 for k =1to K 0 (s ,r)←SIMULATE(M,s,a) repeat 0 Qe k ← r π = π 0

s ← s TS = ? for t =1to T − 1 for each s ∈ Dρ 0 (s ,r)← SIMULATE(M,s,π(s)) for each a ∈A

π

t e e Qe k ← Qk + γ r Q (s, a) ← Rollout(M,s,a,γ,π,K,T) 0 ∗ π s ← s a = arg max Qe (s, a) K a∈A

X ∗ π π ∗ e 1 e

∀ a ∈A,a6 a Q s, a

π π ∗

e e for each a ∈A: Q (s, a)

4 policy that selects actions randomly with uniform proba- bility. Such “excellent” policies balance the pendulum for more then 3 simulated minutes (in practice, we found that 2 such policies could balance essentially indefinitely). The choice of the sampling distribution did not affect the re- 0 sults significantly. For illustration, we used uniform sam- pling for rollout states. Figure 3 shows the training data Angular Velocity

−2 obtained for the LF action. A positive example indicates a state where LF was found to be the best action and a neg- ative example is a state where LF was found to be a bad −4 choice. It is easy to see that positive and negative examples are easily separated. The same figure also shows the result- −6 ing support vectors for the LF classifier using a polynomial −1.5 −1 −0.5 0 0.5 1 1.5 Angle kernel of degree 2.

Figure 3. Training data (+ : positive, x : negative) and support Figure 4 shows the entire learned policies (blue/dark-gray vectors (o) for the LF action. for LF, red/medium-gray for RF, and green/light-gray for NF) for all three classifiers: SVM with a polynomial ker- nel, SVM with a Gaussian kernel, and a neural network til the observed performance of the policy, as measured classifier with 5 hidden units. Interestingly, in the case of with experiments with the simulator, decreased. Since ap- the polynomial kernel, the policy does not use the NF ac- proximate policy iteration does not ensure monotonically tion at all, whereas the other policies do. This is due to improving policies, it is possible that continuing to run pol- the limited abilities of the polynomial degree-2 kernel. All icy iteration beyond an initial setback could still result in policies are excellent in the sense that they can all balance better policies, but we did not explore this possibility. the pendulum for a long time, perhaps indefinitely. In all cases, the input to the SVM or the neural network was just 7.1. Inverted pendulum the 2-dimensional state description. For SVMs, the number In the inverted pendulum domain, the task is to balance a of support vectors was normally smaller than the number of pendulum of unknown length and mass at the upright posi- rollout states. The constant C, the trade-off between train- tion by applying forces to the cart to which it is attached. ing error and margin, was set to 1. Three actions are allowed: left force LF (−50 Newtons), We note that pendulum balancing is a relatively simple right force RF (+50 Newtons), or no force (NF) at all (0 problem. The classes are nearly linearly separable, so good Newtons). All three actions are noisy; uniform noise in classification performance here should not be surprising to [−10, 10] is added to the chosen action. The state space those familiar with modern classification methods. Note- of the problem is continuous and consists of the vertical worthy features from the reinforcement learning perspec- angle θ and the angular velocity θ˙ of the pendulum. The tive are the small number of iterations of policy iteration transitions are governed by the nonlinear dynamics of the required and the non-parametric representation of the pol- system (Wang & Griffin, 1996). and depend on the current icy. Figure 3 shows the ability of the SVM to adapt the rep- state and the current (noisy) control u: resentation to match the training data since only the support − ˙ 2 − vectors are used to represent the policy. ¨ g sin(θ) αml(θ) sin(2θ)/2 α cos(θ)u θ = − 2 , 4l/3 αml cos (θ) 7.2. Bicycle riding 2 where g is the gravity constant (g =9.8m/s ), m is the In the bicycle balancing and riding problem (Randløv & mass of the pendulum (m =2.0kg), M is the mass of the Alstrøm, 1998) the goal is to learn to balance and ride a cart (M =8.0kg), l is the length of the pendulum (l =0.5 bicycle to a target position located 1 km away from the m), and α =1/(m+M). A reward of 1 is given as long as starting location. Initially, the bicycle’s orientation is at an the angle of the pendulum does not exceed π/2 in absolute angle of 90◦ to the goal. The state description is a six- value (the pendulum is above the horizontal line). An angle dimensional real-valued vector (θ, θ,˙ ω, ω,˙ ω,ψ¨ ),whereθ greater than π/2 signals the end of the episode and a reward is the angle of the handlebar, ω is the vertical angle of (penalty) of 0. The discount factor of the process is set to the bicycle, and ψ is the angle of the bicycle to the goal. 0.95. 6 6 6

4 4 4

2 2 2

0 0 0

Angular Velocity −2 Angular Velocity −2 −2 Angular Velocity

−4 −4 −4

−6 −6 −6 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 Angle Angle Angle Figure 4. Pendulum: policies learned with the polynomial kernel SVM, the Gaussian kernel SVM, and the neural network classifier.

The actions are the torque τ applied to the handlebar (dis- cretized to {−2, 0, +2}) and the displacement of the rider 20 {− } υ (discretized to 0.02, 0, +0.02 ). In our experiments, 15 actions are restricted so that either τ =0or υ =0giving a total of 5 actions. The noise in the system is a uniformly 10 Goal position distributed term in [−0.02, +0.02] added to the displace- 5 ment component of the action. The dynamics of the bicycle 0 are based on the model of Randlov and Alstrom (1998) and −5 the time step of the simulation is set to 0.02 seconds. As is typical with this problem, we used a shaping reward (Ng −10 Starting position et al., 1999). −15 −20 Our experiments with the bicycle did show some sensitivity 0 100 200 300 400 500 600 700 800 900 1000 to the parameters of the problem as well as the parameters of our learner. This made it difficult for us to find parame- Figure 5. Bicycle: Trajectory of an SVM policy. ters that consistently produced good performance. Some of this may simply be reflective of our inexperience in tuning the parameters of SVMs. It is also possible that we did not side). This policy was produced with a polynomial ker- consider enough samples. nel of degree 3 and 4000 rollout states. In the final policy, For our SVM experiments, we used a shaping reward of rt the bicycle rides to the goal, then turns around toward the given at each time step, where rt =(dt−1−γdt) as long goal in a very tight radius. This policy was obtained in just as |ω| <π/15,andr=0otherwise. dt is the distance of two API iterations, starting with a uniformly random ac- the back wheel of the bicycle to the goal position at time t. tion selection policy. Similarly to the pendulum, the input The discount factor was set to 0.95. to the SVM was the raw 6-dimensional state description and C =1. In our preliminary experiments with this domain, we were able to solve the problem with uniform sampling and poly- For our neural network experiments, we used a shaping re- nomial kernels of low degree. However it required a large ward of rt given at each time step, where rt =1+(dt−1− number of rollout states (about 5, 000). With sampling γdt) as long as |ω| <π/15,andr=0otherwise. dt is the from the distribution of the next policy, we were able to distance of the back wheel of the bicycle to the goal posi- solve the problem with fewer rollout states and both RBF tion at time t. The discount factor was set to 0.99. Since our and polynomial kernels. However we did not find kernels neural network learner only uses positive examples and not that consistently produced good policies with reasonable all states successfully produce positive training instances, sample sizes. (The balancing problem is solved easily us- we used 8000 rollout states. ing any of the classification methods, but riding to the goal Figure 5 shows sample trajectories of one of our better neu- proved more difficult.) ral network policy iteration runs using 30 hidden units. Af- Figure 5 shows a sample trajectory from the final policy of ter the first iteration, the learned policy can only balance one of our better policy iteration runs using SVMs. The bi- the bicycle for a few steps and it crashes. The policy at cycle moves in the 2-dimensional plane from the initial po- the second iteration reaches the goal, but fails to return to sition (0, 0) (left side) to the goal position (1000, 0) (right it. Finally, the policy at the third iteration, reaches the goal faster and stays there. The best neural network policy is not Fern, A., Yoon, S., & Givan, R. (2003). Approximate policy itera- 500 tion with a policy language bias: Learning control knowledge in planning domainsTechnical report TR-ECE-03-11). Purdue Iteration 2 University School of Electrical and Computer Engineering. 250 Guestrin, C. E., Koller, D., & Parr, R. (2001). Max-norm projec- tions for factored MDPs. Proceedings of the Seventeenth Inter- national Joint Conference on Artificial Intelligence (IJCAI-01) 0 (pp. 673 – 680). Seattle, Washington: Morgan Kaufmann. Iteration 1 (crash) Jaakkola, T., Jordan, M., & Singh, S. (1994). On the convergence −250 Goal Position of stochastic iterative dynamic programming algorithms. Neu- Starting position Iteration 3 ral Computation, 6, 1185–1201.

−500 −500 −250 0 250 500 750 1000 1250 Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. Advances in Neural Information Processing Systems Figure 6. Bicycle: policies at successive iterations (NN classifier). 7 (pp. 345–352). Cambridge, Massachusetts: MIT Press.

Kaelbling, L. P. (1993). Learning in embedded systems. Cam- as good as the best SVM policy, but it is more illustrative bridge, Massachusetts: MIT Press. of the progress of policy iteration because it takes an extra Kakade, S., & Langford, J. (2002). Approximately optimal ap- iteration. proximate reinforcement learning. The Nineteenth Interna- tional Conference on Machine Learning (ICML-2002). Syd- 8. Discussion ney, Australia. Kearns, M., Mansour, Y., & Ng, A. Y. (1999). A sparse sampling We have presented a case for an approach to RL that com- algorithm for near-optimal planning large markov decision pro- bines policy iteration and pure classification learning with cesses. Proceedings of the Sixteenth International Joint Con- rollouts. The emphasis of the approach in this paper is the ference on Artificial Intelligence (IJCAI-99) (pp. 1324–1331). ability to use of modern classification techniques such as Stockholm, Sweden: Morgan Kaufmann. SVMs to alleviate some of the burden of feature engineer- Lagoudakis, M., & Parr, R. (2001). Model free least squares pol- ing from the practitioner of reinforcement learning. How- icy iteration. To appear in 14th Neural Information Processing ever, our empirical results also suggest that more traditional Systems (NIPS-14). Vancouver, Canada. methods such as neural networks can be used successfully. Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance We believe that these initial successes will help open the under reward transformations: theory and application to reward door to greater exploitation of modern classification meth- shaping. Proc. 16th International Conf. on Machine Learning (pp. 278–287). Morgan Kaufmann, San Francisco, CA. ods on more challenging reinforcement learning domains. Of course, many questions remain. More thorough inves- Randløv, J., & Alstrøm, P. (1998). Learning to drive a bicycle us- tigation of issues relating to sample complexity and restart ing reinforcement learning and shaping. The Fifteenth Interna- distributions are important areas for future work. tional Conference on Machine Learning. Madison, Wisconsin: Morgan Kaufmann.

Sutton, R., & Barto, A. (1998). Reinforcement learning: An in- Acknowledgments troduction. Cambridge, MA: MIT Press. This research was supported in part by NSF grant 0209088. Tesauro, G., & Tesauro, G. (1997). On-line policy improvement We also thank Alan Fern, Bob Givan, Carlos Guestrin and using monte-carlo search. 9th Neural Information Processing Ryan Deering for helpful discussions. Systems (NIPS-9). Denver, Colorado. Wang, H. Tanaka, K., & Griffin, M. (1996). An approach to fuzzy References control of nonlinear systems: Stability and design issues. IEEE Transactions on Fuzzy Systems, 4, 14–23. Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic program- ming. Belmont, Massachusetts: Athena Scientific. Williams, R. J. (1992). Simple statistical gradient-following al- gorithms for connectionist reinforcement learning. Machine Collobert, R., & Bengio, S. (2001). SVMTorch: Support vec- Learning, 8, 229–256. tor machines for large-scale regression problems. Journal of Machine Learning Research (JMLR), 1, 143–160. Yoon, S. W., Fern, A., & Givan, B. (2002). Inductive policy se- lection for first-order MDPs. Proceedings of the Eighteenth Dietterich, T. G., & Wang, X. (2001). Batch value funtion approx- Conference on Uncertainty in Artificial Intelligence (UAI-02). imation via support vectors. Advances in Neural Information Edmonton, Canada: Morgan Kaufmann. Processing Systems 14: Proceedings of the 2001 Conference. Vancouver, British Columbia: MIT Press.