
Effective sketching methods for value function approximation Yangchen Pan, Erfan Sadeqi Azer and Martha White Department of Computer Science Indiana University Bloomington [email protected], [email protected], [email protected] Abstract might be visited. Once interacting with the environment, however, it is likely not all features will become active, and that a lower-dimensional subspace will be visited. High-dimensional representations, such as ra- dial basis function networks or tile cod- A complementary approach for this high-dimensional ing, are common choices for policy evalua- representation expansion in reinforcement learning, tion in reinforcement learning. Learning with therefore, is to use projections. In this way, we can over- such high-dimensional representations, how- parameterize for robustness, but then use a projection to ever, can be expensive, particularly for matrix a lower-dimensional space to make learning feasible. For methods, such as least-squares temporal differ- an effectively chosen projection, we can avoid discarding ence learning or quasi-Newton methods that important information, and benefit from the fact that the approximate matrix step-sizes. In this work, agent only visits a lower-dimensional subspace of the en- we explore the utility of sketching for these vironment in the feature space. two classes of algorithms. We highlight issues Towards this aim, we investigate the utility of sketch- with sketching the high-dimensional features ing: projecting with a random matrix. Sketching has been directly, which can incur significant bias. As a extensively used for efficient communication and solv- remedy, we demonstrate how to use sketching ing large linear systems, with a solid theoretical foun- more sparingly, with only a left-sided sketch, dation and a variety of different sketches (Woodruff, that can still enable significant computational 2014). Sketching has been previously used in reinforce- gains and the use of these matrix-based learn- ment learning, specifically to reduce the dimension of the ing algorithms that are less sensitive to param- features. Bellemare et al. (2012) replaced the standard bi- eters. We empirically investigate these algo- ased hashing function used for tile coding Sutton (1996), rithms, in four domains with a variety of repre- instead using count-sketch.1 Ghavamzadeh et al. (2010) sentations. Our aim is to provide insights into investigated sketching features to reduce the dimension- effective use of sketching in practice. ality and make it feasible to run least-squares tempo- ral difference learning (LSTD) for policy evaluation. In arXiv:1708.01298v1 [cs.LG] 3 Aug 2017 LSTD, the value function is estimated by incrementally 1 INTRODUCTION computing a d × d matrix A, where d is the number of features, and an d-dimensional vector b, where the pa- A common strategy for function approximation in re- rameters are estimated as the solution to this linear sys- inforcement learning is to overparametrize: generate a tem. Because d can be large, they randomly project the large number of features to provide a sufficiently com- features to reduce the matrix size to k × k, with k d. plex function space. For example, one typical represen- For both of these previous uses of sketching, however, tation is a radial basis function network, where the cen- the resulting value function estimates are biased. This ters for each radial basis function are chosen to exhaus- bias, as we show in this work, can be quite significant, tively cover the observation space. Because the environ- resulting in significant estimation error in the value func- ment is unknown—particularly for the incremental learn- tion for a given policy. As a result, any gains from us- ing setting—such an overparameterized representation is more robust to this uncertainty because a reasonable rep- 1They called the sketch the tug-of-war sketch, but it is more resentation is guaranteed for any part of the space that standard to call it count-sketch. ing LSTD methods—over stochastic temporal difference agent-environment interaction is formulated as a Markov (TD) methods—are largely overcome by this bias. A nat- decision process (S; A; Pr; r), where S is the set of ural question is if we can benefit from sketching, with states, A is the set of actions, and Pr : S × A × minimal bias or without incurring any bias at all. S! [0; 1) is the one-step state transition dynamics. On each time step t = 1; 2; 3; :::, the agent selects an In this work, we propose to instead sketch the linear sys- action according to its policy π, A ∼ π(S ; ·), with tem in LSTD. The key idea is to only sketch the con- t t π : S × A ! [0; 1) and transitions into a new state straints of the system (the left-side of A) rather than the def variables (the right-side of A). Sketching features, on the St+1 ∼ Pr(St;At; ·) and obtains scalar reward Rt+1 = other hand, by design, sketches both constraints and vari- r(St;At;St+1). ables. We show that even with a straightforward linear For policy evaluation, the goal is to estimate the value system solution, the left-sided sketch can significantly function, vπ : S! R, which corresponds to the ex- reduce bias. We further show how to use this left-sided pected return when following policy π sketch within a quasi-Newton algorithm, providing an def unbiased policy evaluation algorithm that can still ben- vπ(s) = Eπ[GtjSt = s]; efit from the computational improvements of sketching. where is the expectation over future states when se- The key novelty in this work is designing such system- Eπ lecting actions according to π. The return, G 2 is sketching algorithms when also incrementally comput- t R the discounted sum of future rewards given actions are ing the linear system solution. There is a wealth of lit- selected according to π: erature on sketching linear systems, to reduce compu- tation. In general, however, many sketching approaches G def= R + γ R + γ γ R + ::: (1) cannot be applied to the incremental policy evaluation t t+1 t+1 t+2 t+1 t+2 t+3 problem, because the approaches are designed for a static = Rt+1 + γt+1Gt+1 linear system. For example, Gower & Richtarik´ (2015) where γ 2 [0; 1] is a scalar that depends on provide a host of possible solutions for solving large t+1 S ;A ;S and discounts the contribution of future re- linear systems. However, they assume access to A up- t t t+1 wards exponentially with time. A common setting, for front, so the algorithm design, in memory and compu- example, is a constant discount. This recent general- tation, is not suitable for the incremental setting. Some ization to state-dependent discount (Sutton et al., 2011; popular sketching approaches, such as Frequent Direc- White, 2016) enables either episodic or continuing prob- tions (Ghashami et al., 2014), has been successfully used lems, and so we adopt this more general formalism here. for the online setting, for quasi-Newton algorithms (Luo et al., 2016); however, they sketch symmetric matrices, We consider linear function approximation to estimate that are growing with number of samples. the value function. In this setting, the observations are expanded to a higher-dimensional space, such as through This paper is organized as follows. We first introduce tile-coding, radial basis functions or Fourier basis. Given the policy evaluation problem—learning a value function this nonlinear encoding x : S! Rd, the value is approx- for a fixed policy—and provide background on sketch- > d def ing methods. We then illustrate issues with only sketch- imated as vπ(St) ≈ w xt for w 2 R and xt = x(St). ing features, in terms of quality of the value function ap- One algorithm for estimating w is least-squares tempo- proximation. We then introduce the idea of using asym- ral difference learning (LSTD). The goal in LSTD(λ) metric sketching for policy evaluation with LSTD, and (Boyan, 1999) is to minimize the mean-squared pro- provide an efficient incremental algorithm that is O(dk) jected Bellman error, which can be represented as solv- on each step. We finally highlight settings where we ex- ing the following linear system pect sketching to perform particularly well in practice, def > and investigate the properties of our algorithm on four A = Eπ[et(xt − γt+1xt+1) ] domains, and with a variety of representation properties. def b = Eπ[Rt+1et]: 2 PROBLEM FORMULATION def where et = γt+1λet−1 + xt is called the eligibility trace for trace parameter λ 2 [0; 1]. To obtain w, the We address the policy evaluation problem within rein- system A and b are incrementally estimated, to solve forcement learning, where the goal is to estimate the T −1 Aw = b. For a trajectory f(St;At;St+1;Rt+1)gt=0 , value function for a given policy2. As is standard, the however, generalize to the off-policy setting, where data is gen- 2To focus the investigation on sketching, we consider the erated according to a behavior policy different than the given simpler on-policy setting in this work. Many of the results, target policy we wish to evaluate. def let dt = xt − γt+1xt+1, then the above two expected terms are usually computed via sample average that can 3 ISSUES WITH SKETCHING THE be recursively computed, in a numerically stable way, as FEATURES 1 > One approach to make LSTD more feasible is to At+1 = At + etdt − At t + 1 project—sketch—the features. Sketching involves sam- 1 S : k×d bt+1 = bt + (etRt+1 − bt) pling a random matrix R from a family of ma- t + 1 trices S, to project a given d-dimensional vector x to a (much smaller) k-dimensional vector Sx.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages13 Page
-
File Size-