Effective Sketching Methods for Value Function Approximation

Effective sketching methods for value function approximation Yangchen Pan, Erfan Sadeqi Azer and Martha White Department of Computer Science Indiana University Bloomington [email protected], [email protected], [email protected] Abstract might be visited. Once interacting with the environment, however, it is likely not all features will become active, and that a lower-dimensional subspace will be visited. High-dimensional representations, such as radial basis function networks or tile cod- A complementary approach for this high-dimensional ing, are common choices for policy evalua- representation expansion in reinforcement learning, tion in reinforcement learning. Learning with therefore, is to use projections. In this way, we can over- such high-dimensional representations, how- parameterize for robustness, but then use a projection to ever, can be expensive, particularly for matrix a lower-dimensional space to make learning feasible. For methods, such as least-squares temporal differ- an effectively chosen projection, we can avoid discarding ence learning or quasi-Newton methods that important information, and benefit from the fact that the approximate matrix step-sizes. In this work, agent only visits a lower-dimensional subspace of the en- we explore the utility of sketching for these vironment in the feature space. two classes of algorithms. We highlight issues Towards this aim, we investigate the utility of sketch- with sketching the high-dimensional features ing: projecting with a random matrix. Sketching has been directly, which can incur significant bias. As a extensively used for efficient communication and solv- remedy, we demonstrate how to use sketching ing large linear systems, with a solid theoretical foun- more sparingly, with only a left-sided sketch, dation and a variety of different sketches (Woodruff, that can still enable significant computational 2014). Sketching has been previously used in reinforce- gains and the use of these matrix-based learn- ment learning, specifically to reduce the dimension of the ing algorithms that are less sensitive to param- features. Bellemare et al. (2012) replaced the standard bi- eters. We empirically investigate these algo- ased hashing function used for tile coding Sutton (1996), rithms, in four domains with a variety of repre- instead using count-sketch.1 Ghavamzadeh et al. (2010) sentations. Our aim is to provide insights into investigated sketching features to reduce the dimension- effective use of sketching in practice. ality and make it feasible to run least-squares temporal difference learning (LSTD) for policy evaluation. In arXiv:1708.01298v1 [cs.LG] 3 Aug 2017 LSTD, the value function is estimated by incrementally 1 INTRODUCTION computing a d × d matrix A, where d is the number of features, and an d-dimensional vector b, where the pa- A common strategy for function approximation in re- rameters are estimated as the solution to this linear sys- inforcement learning is to overparametrize: generate a tem. Because d can be large, they randomly project the large number of features to provide a sufficiently com- features to reduce the matrix size to k × k, with k d. plex function space. For example, one typical represen- For both of these previous uses of sketching, however, tation is a radial basis function network, where the cen- the resulting value function estimates are biased. This ters for each radial basis function are chosen to exhaus- bias, as we show in this work, can be quite significant, tively cover the observation space. Because the environ- resulting in significant estimation error in the value func- ment is unknown—particularly for the incremental learn- tion for a given policy. As a result, any gains from using setting—such an overparameterized representation is more robust to this uncertainty because a reasonable rep- 1They called the sketch the tug-of-war sketch, but it is more resentation is guaranteed for any part of the space that standard to call it count-sketch. ing LSTD methods—over stochastic temporal difference agent-environment interaction is formulated as a Markov (TD) methods—are largely overcome by this bias. A nat- decision process (S; A; Pr; r), where S is the set of ural question is if we can benefit from sketching, with states, A is the set of actions, and Pr : S × A × minimal bias or without incurring any bias at all. S! [0; 1) is the one-step state transition dynamics. On each time step t = 1; 2; 3; :::, the agent selects an In this work, we propose to instead sketch the linear sys- action according to its policy π, A ∼ π(S ; ·), with tem in LSTD. The key idea is to only sketch the con- t t π : S × A ! [0; 1) and transitions into a new state straints of the system (the left-side of A) rather than the def variables (the right-side of A). Sketching features, on the St+1 ∼ Pr(St;At; ·) and obtains scalar reward Rt+1 = other hand, by design, sketches both constraints and vari- r(St;At;St+1). ables. We show that even with a straightforward linear For policy evaluation, the goal is to estimate the value system solution, the left-sided sketch can significantly function, vπ : S! R, which corresponds to the ex- reduce bias. We further show how to use this left-sided pected return when following policy π sketch within a quasi-Newton algorithm, providing an def unbiased policy evaluation algorithm that can still ben- vπ(s) = Eπ[GtjSt = s]; efit from the computational improvements of sketching. where is the expectation over future states when se- The key novelty in this work is designing such system- Eπ lecting actions according to π. The return, G 2 is sketching algorithms when also incrementally comput- t R the discounted sum of future rewards given actions are ing the linear system solution. There is a wealth of lit- selected according to π: erature on sketching linear systems, to reduce compu- tation. In general, however, many sketching approaches G def= R + γ R + γ γ R + ::: (1) cannot be applied to the incremental policy evaluation t t+1 t+1 t+2 t+1 t+2 t+3 problem, because the approaches are designed for a static = Rt+1 + γt+1Gt+1 linear system. For example, Gower & Richtarik´ (2015) where γ 2 [0; 1] is a scalar that depends on provide a host of possible solutions for solving large t+1 S ;A ;S and discounts the contribution of future re- linear systems. However, they assume access to A up- t t t+1 wards exponentially with time. A common setting, for front, so the algorithm design, in memory and compu- example, is a constant discount. This recent general- tation, is not suitable for the incremental setting. Some ization to state-dependent discount (Sutton et al., 2011; popular sketching approaches, such as Frequent Direc- White, 2016) enables either episodic or continuing prob- tions (Ghashami et al., 2014), has been successfully used lems, and so we adopt this more general formalism here. for the online setting, for quasi-Newton algorithms (Luo et al., 2016); however, they sketch symmetric matrices, We consider linear function approximation to estimate that are growing with number of samples. the value function. In this setting, the observations are expanded to a higher-dimensional space, such as through This paper is organized as follows. We first introduce tile-coding, radial basis functions or Fourier basis. Given the policy evaluation problem—learning a value function this nonlinear encoding x : S! Rd, the value is approx- for a fixed policy—and provide background on sketch- > d def ing methods. We then illustrate issues with only sketch- imated as vπ(St) ≈ w xt for w 2 R and xt = x(St). ing features, in terms of quality of the value function ap- One algorithm for estimating w is least-squares tempo- proximation. We then introduce the idea of using asym- ral difference learning (LSTD). The goal in LSTD(λ) metric sketching for policy evaluation with LSTD, and (Boyan, 1999) is to minimize the mean-squared pro- provide an efficient incremental algorithm that is O(dk) jected Bellman error, which can be represented as solv- on each step. We finally highlight settings where we ex- ing the following linear system pect sketching to perform particularly well in practice, def > and investigate the properties of our algorithm on four A = Eπ[et(xt − γt+1xt+1) ] domains, and with a variety of representation properties. def b = Eπ[Rt+1et]: 2 PROBLEM FORMULATION def where et = γt+1λet−1 + xt is called the eligibility trace for trace parameter λ 2 [0; 1]. To obtain w, the We address the policy evaluation problem within rein- system A and b are incrementally estimated, to solve forcement learning, where the goal is to estimate the T −1 Aw = b. For a trajectory f(St;At;St+1;Rt+1)gt=0 , value function for a given policy2. As is standard, the however, generalize to the off-policy setting, where data is gen- 2To focus the investigation on sketching, we consider the erated according to a behavior policy different than the given simpler on-policy setting in this work. Many of the results, target policy we wish to evaluate. def let dt = xt − γt+1xt+1, then the above two expected terms are usually computed via sample average that can 3 ISSUES WITH SKETCHING THE be recursively computed, in a numerically stable way, as FEATURES 1 > One approach to make LSTD more feasible is to At+1 = At + etdt − At t + 1 project—sketch—the features. Sketching involves sam- 1 S : k×d bt+1 = bt + (etRt+1 − bt) pling a random matrix R from a family of ma- t + 1 trices S, to project a given d-dimensional vector x to a (much smaller) k-dimensional vector Sx.

Effective Sketching Methods for Value Function Approximation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support