
Value Function Approximation in Reinforcement Learning using the Fourier Basis George Konidaris1;3 Sarah Osentoski2;3∗ Philip Thomas3 MIT CSAIL1 Department of Computer Science2 Autonomous Learning Laboratory3 [email protected] Brown University University of Massachusetts Amherst [email protected] [email protected] Abstract demonstrate that it performs well compared to RBFs and the polynomial basis, the most common fixed bases, and is com- We describe the Fourier basis, a linear value function approx- petitive with learned proto-value functions even though no imation scheme based on the Fourier series. We empirically extra experience or computation is required. demonstrate that it performs well compared to radial basis functions and the polynomial basis, the two most popular fixed bases for linear value function approximation, and is Background competitive with learned proto-value functions. A d-dimensional continuous-state Markov decision process (MDP) is a tuple M = (S; A; P; R); where S ⊆ Rd is a Introduction set of possible state vectors, A is a set of actions, P is the 0 Reinforcement learning (RL) in continuous state spaces re- transition model (with P (x; a; x ) giving the probability of moving from state x to state x0 given action a), and R is the quires function approximation. Most work in this area fo- 0 cuses on linear function approximation, where the value reward function (with R(x; a; x ) giving the reward obtained from executing action a in state x and transitioning to state function is represented as a weighted linear sum of a set of 0 features (known as basis functions) computed from the state x ). Our goal is to learn a policy, π, mapping state vectors variables. Linear function approximation results in a sim- to actions so as to maximize return (discounted sum of re- ple update rule and quadratic error surface, even though the wards). When P is known, this can be achieved by learning basis functions themselves may be arbitrarily complex. a value function, V , mapping state vectors to return, and se- RL researchers have employed a wide variety of ba- lecting actions that result in the states with highest V . When sis function schemes, most commonly radial basis func- P is not available, the agent will typically either learn it, or tions (RBFs) and CMACs (Sutton and Barto, 1998). Often, instead learn an action-value function, Q, that maps state- choosing the right basis function set is criticial for successful action pairs to expected return. Since the theory underlying learning. Unfortunately, most approaches require significant the two cases is similar we consider only the value function case. V is commonly approximated as a weighted sum of design effort or problem insight, and no basis function set is ¯ Pm a set of basis functions φ1; :::; φm: V (x) = i=1 wiφi(x). both simple and sufficiently reliable to be generally satis- ¯ factory. Recent work (Mahadevan and Maggioni, 2007) has This is termed linear value function approximation since V focused on learning basis function sets from experience, re- is linear in weights w = [w1; :::; wm]; learning entails find- moving the need to design an approximator but introducing ing the w corresponding to an approximate optimal value ¯ ∗ the need to gather data to create one. function, V . Linear function approximation is attractive The most common continuous function approximation because it results in simple update rules (often using gradi- method in the applied sciences is the Fourier series. Al- ent descent) and possesses a quadratic error surface with a though the Fourier series is simple, effective, and has solid single minimum (except in degenerate cases). Nevertheless, theoretical underpinnings, it is almost never used for value we can represent complex value functions since the basis function approximation. This paper describes the Fourier functions themselves can be arbitrarily complex. basis, a simple linear function approximation scheme us- The Polynomial Basis. Given d state variables x = ing the terms of the Fourier series as basis functions.1 We [x1; :::; xd], the simplest linear scheme uses each variable directly as a basis function along with a constant function, Copyright c 2011, Association for the Advancement of Artificial setting φ0(x) = 1 and φi(x) = xi, 0 ≤ i ≤ d. How- Intelligence (www.aaai.org). All rights reserved. ∗ ever, most interesting value functions are too complex to be Sarah Osentoski is now with the Robert Bosch LLC Research represented this way. This scheme was therefore general- and Technology Center in Palo Alto, CA. 1 ized to the polynomial basis (Lagoudakis and Parr, 2003): Open-source, RL-Glue (Tanner and White, 2009) com- Qd ci;j patible Java source code for the Fourier basis can be φi(x) = j=1 xj , where each ci;j is an integer between downloaded from http://library.rl-community.org/ 0 and n. We describe such a basis as an order n polynomial wiki/Sarsa_Lambda_Fourier_Basis_(Java). basis. For example, a 2nd order polynomial basis defined over two state variables x and y would have feature vector: Thus a full nth order Fourier approximation to a one- Φ = [1; x; y; xy; x2y; xy2; x2y2]: Note the dimensional value function results in a linear function ap- features that are a function of both variables; these features proximator with 2n + 1 terms. However, as we shall see model the interaction between those variables. below, we can usually use only n + 1 terms. Radial Basis Functions. Another common scheme is RBFs, where each basis function is a Gaussian: Even, Odd and Non-Periodic Functions 2 2 φ (x) = p 1 e−||ci−xjj =2σ , for a given collection of i 2πσ2 If f is known to be even (that is, f(x) = f(−x), so that f is 2 centers ci and variance σ . The centers are typically dis- symmetric about the y-axis), then 8i > 0; bi = 0, so the sin tributed evenly along each dimension, leading to nd centers terms can be dropped. This results in a function guaranteed for d state variables and a given order n; σ2 can be varied but to be even, and reduces the terms required for an nth order 2 Fourier approximation to n + 1. Similarly, if f is known is often set to n−1 . RBFs only generalize locally—changes in one area of the state space do not affect the entire state to be odd (that is, f(x) = −f(−x), so that f is symmetric space. Thus, they are suitable for representing value func- with respect to the origin) then 8i > 0; ai = 0, so we can tions that might have discontinuities. However, this limited omit the cos terms. These cases are depicted in Figure 1. generalization is often reflected in slow initial performance. Proto-Value Functions. Recent research has focused on learning basis functions given experience. The most promi- nent learned basis is proto-value functions or PVFs (Ma- hadevan and Maggioni, 2007). In their simplest form, an agent builds an adjacency matrix, A, from experience and (a) (b) then computes the Laplacian, L = (D − A), of A where D is a diagonal matrix with D(i; i) being the out-degree of state i. The eigenvectors of L form a set of bases that respect Figure 1: Even (a) and odd (b) functions. the topology of the state space (as reflected in A), and can be used as a set of orthogonal bases for a discrete domain. However, in general, value functions are not even, odd, Mahadevan et al. (2006) extended PVFs to continuous do- or periodic (or known to be in advance). In such cases, we mains, using a local distance metric to construct the graph can define our approximation over [−1; 1] but only project and an out-of-sample method for obtaining the values of the input variable to [0; 1]. This results in a function peri- each basis function at states not represented in A. Although odic on [−1; 1] but unconstrained on (0; 1]. We are now free the given results are promising, PVFs in continuous spaces to choose whether or not the function is even or odd over require samples to build A, an eigenvector decomposition to [−1; 1] and can drop half of the terms in the approximation. build the basis functions, and pose several potentially diffi- In general, we expect that it will be better to use the “half- cult design decisions. even” approximation and drop the sin terms because this causes only a slight discontinuity at the origin. Thus, we The Fourier Basis define the univariate nth order Fourier basis as: In this section we describe the Fourier series for one and φi(x) = cos (iπx) ; (2) multiple variables, and use it to define the univariate and multivariate Fourier bases. for i = 0; :::; n. Figure 2 depicts a few of the resulting ba- sis functions. Note that frequency increases with i; thus, The Univariate Fourier Series high order basis functions will correspond to high frequency The Fourier series is used to approximate a periodic func- components of the value function. tion; a function f is periodic with period T if f(x + T ) = Univariate Fourier Basis Function k=1 Univariate Fourier Basis Function k=2 Univariate Fourier Basis Function k=3 Univariate Fourier Basis Function k=4 1 1 1 1 0.8 0.8 0.8 0.8 f(x); 8x. The nth degree Fourier expansion of f is: 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 n 0 0 0 0 −0.2 −0.2 −0.2 −0.2 a0 X 2π 2π −0.4 −0.4 −0.4 −0.4 f¯(x) = + a cos k x + b sin k x ; −0.6 −0.6 −0.6 −0.6 k k −0.8 −0.8 −0.8 −0.8 −1 −1 −1 −1 2 T T 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 k=1 (1) T Figure 2: Univariate Fourier basis functions for i = 1, 2, 3 with a = 2 R f(x) cos( 2πkx ) dx and with b = k T 0 T k and 4.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-