Effective sketching methods for value function approximation

Yangchen Pan, Erfan Sadeqi Azer and Martha White Department of Computer Science Indiana University Bloomington [email protected], [email protected], [email protected]

Abstract might be visited. Once interacting with the environment, however, it is likely not all features will become active, and that a lower-dimensional subspace will be visited. High-dimensional representations, such as ra- dial basis function networks or tile cod- A complementary approach for this high-dimensional ing, are common choices for policy evalua- representation expansion in , tion in reinforcement learning. Learning with therefore, is to use projections. In this way, we can over- such high-dimensional representations, how- parameterize for robustness, but then use a projection to ever, can be expensive, particularly for matrix a lower-dimensional space to make learning feasible. For methods, such as least-squares temporal differ- an effectively chosen projection, we can avoid discarding ence learning or quasi-Newton methods that important information, and benefit from the fact that the approximate matrix step-sizes. In this work, agent only visits a lower-dimensional subspace of the en- we explore the utility of sketching for these vironment in the feature space. two classes of algorithms. We highlight issues Towards this aim, we investigate the utility of sketch- with sketching the high-dimensional features ing: projecting with a random matrix. Sketching has been directly, which can incur significant bias. As a extensively used for efficient communication and solv- remedy, we demonstrate how to use sketching ing large linear systems, with a solid theoretical foun- more sparingly, with only a left-sided sketch, dation and a variety of different sketches (Woodruff, that can still enable significant computational 2014). Sketching has been previously used in reinforce- gains and the use of these matrix-based learn- ment learning, specifically to reduce the dimension of the ing algorithms that are less sensitive to param- features. Bellemare et al. (2012) replaced the standard bi- eters. We empirically investigate these algo- ased hashing function used for tile coding Sutton (1996), rithms, in four domains with a variety of repre- instead using count-sketch.1 Ghavamzadeh et al. (2010) sentations. Our aim is to provide insights into investigated sketching features to reduce the dimension- effective use of sketching in practice. ality and make it feasible to run least-squares tempo- ral difference learning (LSTD) for policy evaluation. In arXiv:1708.01298v1 [cs.LG] 3 Aug 2017 LSTD, the value function is estimated by incrementally 1 INTRODUCTION computing a d × d matrix A, where d is the number of features, and an d-dimensional vector b, where the pa- A common strategy for function approximation in re- rameters are estimated as the solution to this linear sys- inforcement learning is to overparametrize: generate a tem. Because d can be large, they randomly project the large number of features to provide a sufficiently com- features to reduce the matrix size to k × k, with k  d. plex function space. For example, one typical represen- For both of these previous uses of sketching, however, tation is a radial basis function network, where the cen- the resulting value function estimates are biased. This ters for each radial basis function are chosen to exhaus- bias, as we show in this work, can be quite significant, tively cover the observation space. Because the environ- resulting in significant estimation error in the value func- ment is unknown—particularly for the incremental learn- tion for a given policy. As a result, any gains from us- ing setting—such an overparameterized representation is more robust to this uncertainty because a reasonable rep- 1They called the sketch the tug-of-war sketch, but it is more resentation is guaranteed for any part of the space that standard to call it count-sketch. ing LSTD methods—over stochastic temporal difference agent-environment interaction is formulated as a Markov (TD) methods—are largely overcome by this bias. A nat- decision process (S, A, Pr, r), where S is the set of ural question is if we can benefit from sketching, with states, A is the set of actions, and Pr : S × A × minimal bias or without incurring any bias at all. S → [0, ∞) is the one-step state transition dynamics. On each time step t = 1, 2, 3, ..., the agent selects an In this work, we propose to instead sketch the linear sys- action according to its policy π, A ∼ π(S , ·), with tem in LSTD. The key idea is to only sketch the con- t t π : S × A → [0, ∞) and transitions into a new state straints of the system (the left-side of A) rather than the def variables (the right-side of A). Sketching features, on the St+1 ∼ Pr(St,At, ·) and obtains scalar reward Rt+1 = other hand, by design, sketches both constraints and vari- r(St,At,St+1). ables. We show that even with a straightforward linear For policy evaluation, the goal is to estimate the value system solution, the left-sided sketch can significantly function, vπ : S → R, which corresponds to the ex- reduce bias. We further show how to use this left-sided pected return when following policy π sketch within a quasi-Newton algorithm, providing an def unbiased policy evaluation algorithm that can still ben- vπ(s) = Eπ[Gt|St = s], efit from the computational improvements of sketching. where is the expectation over future states when se- The key novelty in this work is designing such system- Eπ lecting actions according to π. The return, G ∈ is sketching algorithms when also incrementally comput- t R the discounted sum of future rewards given actions are ing the linear system solution. There is a wealth of lit- selected according to π: erature on sketching linear systems, to reduce compu- tation. In general, however, many sketching approaches G def= R + γ R + γ γ R + ... (1) cannot be applied to the incremental policy evaluation t t+1 t+1 t+2 t+1 t+2 t+3 problem, because the approaches are designed for a static = Rt+1 + γt+1Gt+1 linear system. For example, Gower & Richtarik´ (2015) where γ ∈ [0, 1] is a scalar that depends on provide a host of possible solutions for solving large t+1 S ,A ,S and discounts the contribution of future re- linear systems. However, they assume access to A up- t t t+1 wards exponentially with time. A common setting, for front, so the algorithm design, in memory and compu- example, is a constant discount. This recent general- tation, is not suitable for the incremental setting. Some ization to state-dependent discount (Sutton et al., 2011; popular sketching approaches, such as Frequent Direc- White, 2016) enables either episodic or continuing prob- tions (Ghashami et al., 2014), has been successfully used lems, and so we adopt this more general formalism here. for the online setting, for quasi-Newton algorithms (Luo et al., 2016); however, they sketch symmetric matrices, We consider linear function approximation to estimate that are growing with number of samples. the value function. In this setting, the observations are expanded to a higher-dimensional space, such as through This paper is organized as follows. We first introduce tile-coding, radial basis functions or Fourier basis. Given the policy evaluation problem—learning a value function this nonlinear encoding x : S → Rd, the value is approx- for a fixed policy—and provide background on sketch- > d def ing methods. We then illustrate issues with only sketch- imated as vπ(St) ≈ w xt for w ∈ R and xt = x(St). ing features, in terms of quality of the value function ap- One algorithm for estimating w is least-squares tempo- proximation. We then introduce the idea of using asym- ral difference learning (LSTD). The goal in LSTD(λ) metric sketching for policy evaluation with LSTD, and (Boyan, 1999) is to minimize the mean-squared pro- provide an efficient incremental algorithm that is O(dk) jected Bellman error, which can be represented as solv- on each step. We finally highlight settings where we ex- ing the following linear system pect sketching to perform particularly well in practice, def > and investigate the properties of our algorithm on four A = Eπ[et(xt − γt+1xt+1) ] domains, and with a variety of representation properties. def b = Eπ[Rt+1et].

2 PROBLEM FORMULATION def where et = γt+1λet−1 + xt is called the eligibility trace for trace parameter λ ∈ [0, 1]. To obtain w, the We address the policy evaluation problem within rein- system A and b are incrementally estimated, to solve forcement learning, where the goal is to estimate the T −1 Aw = b. For a trajectory {(St,At,St+1,Rt+1)}t=0 , value function for a given policy2. As is standard, the however, generalize to the off-policy setting, where data is gen- 2To focus the investigation on sketching, we consider the erated according to a behavior policy different than the given simpler on-policy setting in this work. Many of the results, target policy we wish to evaluate. def let dt = xt − γt+1xt+1, then the above two expected terms are usually computed via sample average that can 3 ISSUES WITH SKETCHING THE be recursively computed, in a numerically stable way, as FEATURES 1 >  One approach to make LSTD more feasible is to At+1 = At + etdt − At t + 1 project—sketch—the features. Sketching involves sam- 1 S : k×d bt+1 = bt + (etRt+1 − bt) pling a random matrix R from a family of ma- t + 1 trices S, to project a given d-dimensional vector x to a (much smaller) k-dimensional vector Sx. The goal in with A0 = 0 and b0 = 0. The incremental estimates At defining this class of sketching matrices is to maintain and bt converge to A and b. A naive algorithm, where −1 certain properties of the original vector. The following is w = At bt is recomputed on each step, would result 3 −1 a standard definition for such a family. in O(d ) computation to compute the inverse At . In- −1 stead, At is incrementally updated using the Sherman- Definition 1 (Sketching). Let d and k be positive inte- −1 + k×d Morrison formula, with A0 = ξ for a small ξ > 0 gers, δ ∈ (0, 1), and  ∈ R . Then, S ⊂ R is called a family of sketching matrices with parameters (, δ), if for  −1 −1 t − 1 1 > a random matrix, S, chosen uniformly at random from At = At−1 + etdt t t this family, we have that ∀x ∈ Rd t  A−1 e d>A−1  h i = A−1 + t−1 t t t−1 (1 − )kxk2 ≤ kSxk2 ≤ (1 + )kxk2 ≥ 1 − δ. t−1 > −1 P 2 2 2 t − 1 t − 1 + dt At−1et where the probability is w.r.t. the distribution over S. requiring O(d2) storage and computation per step. Un- fortunately, this quadratic cost is prohibitive for many We will explore the utility of sketching the features with incremental learning settings. In our experiments, even several common sketches. These sketches all require k = d = 10, 000 was prohibitive, since d2 = 100 million. Ω(−2 ln(1/δ) ln d). A natural approach to improve computation to solve for Gaussian random projections, also known as the JL- w is to use stochastic methods, such as TD(λ) (Sut- Transform (Johnson & Lindenstrauss, 1984), has each 1 ton, 1988). This algorithm incrementally updates with entry in S i.i.d. sampled from a Gaussian, N (0, k ). wt+1 = wt + αδtet for stepsize α > 0 and TD-error > Count sketch selects exactly one uniformly picked non- δt = Rt+1 + (γt+1xt+1 − xt) wt. The expectation of zero entry in each column, and sets that entry to either this update is Eπ[δtet] = b − Awt; the fixed-point solu- 1 or −1 with equal probability (Charikar et al., 2002; tions are the same for both LSTD and TD, but LSTD cor- Gilbert & Indyk, 2010). The Tug-of-War sketch (Alon responds to a batch solution whereas TD corresponds to et al., 1996) performs very similarly to Count sketch in a stochastic update. Though more expensive than TD— our experiments, and so we omit it. which is only O(d)—LSTD does have several advan- tages. Because LSTD is a batch method, it summarizes Combined sketch is the product of a count sketch matrix all samples (within A), and so can be more sample ef- and a Gaussian projection matrix (Wang, 2015; Boutsidis ficient. Additionally, LSTD has no step-size parameter, & Woodruff, 2015). using a closed-form solution for w. Hadamard sketch—the Subsampled Randomized Hadamard Transform—is computed as S = √1 DH P, Recently, there has been some progress in better balanc- kd d ing between TD and LSTD. Pan et al. (2017) derived a where D ∈ Rd×d is a diagonal matrix with each di- quasi-Newton algorithm, called accelerated gradient TD agonal element uniformly sampled from {1, −1}, d×d d×k (ATD), giving an unbiased algorithm that has some of the Hd ∈ R is a Hadamard matrix and P ∈ R is a benefits of LSTD, but with significantly reduced compu- column sampling matrix (Ailon & Chazelle, 2006). tation because they only maintain a low-rank approxi- Sketching provides a low-error between the recovery mation to A. The key idea is that A provides curvature > information, and so can significantly improve step-size S Sx and the original x, with high probability. For the selection for TD and so improve the convergence rate. above families, the entries in S are zero-mean i.i.d. with variance 1, giving E[S>S] = I over all possible S. Con- The approximate A can still provide useful curvature in- > formation, but can be significantly cheaper to compute. sequently, in expectation, the recovery S Sx is equal to x. For a stronger result, a Chernoff bound can be used We use the ATD update to similarly obtain an unbiased > algorithm, but use sketching approximations instead of to bound the deviation of S S from this expected value: for the parameters (, δ) of the matrix family, we get that low-rank approximations. First, however, we investigate h i some of the properties of sketching. P (1 − )I ≺ S>S ≺ (1 + )I ≥ 1 − δ. 80 80 Figure 1: Efficacy of different gaussian hadamard sketches for sketching the features 70 hadamard 70 gaussian countskt 60 for LSTD, with k = 50. The 60 countskt RMSE is w.r.t. the optimal value Root Root Mean combine Mean function, computed using rollouts. Square Square Error Error LSTD(λ) is included as the baseline, 30 30 combine with w = A−1b, with the other 20 20 LSTD LSTD curves corresponding to different 10 10 sketches of the features, to give 0 0 > −1 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 w = (SAS ) Sb as used for the Steps Steps random projections LSTD algorithm. (a) Mountain Car, tile coding (b) Mountain Car, RBF The RBF width in Mountain Car

80 80 is σ = 0.12 times the range of the state space and in Puddle World is 70 hadamard 70 √ hadamard σ = 0.0072. The 1024 centers 60 60 for RBFs are chosen to uniformly Root countskt Root Mean gaussian combine Mean Square Square cover the 2-d space in a grid. For tile Error Error countskt coding, we discretize each dimension LSTD combine 30 30 gaussian by 10, giving 10 × 10 grids, use 10 20 20 LSTD tilings, and set the memory size as 10 10 1024. The bias is high for tile coding 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 features, and much better for RBF Steps Steps features, though still quite large. The (c) Puddle World, tile coding (d) Puddle World, RBF different sketches perform similarly.

These properties suggest that using sketching for the and right subspaces of A with different sketching ma- kL×d kR×d feature vectors should provide effective approximations. trices, SL ∈ R and SR ∈ R . Depending on Bellemare et al. (2012) showed that they could use these the choices of kL and kR, we can then solve the smaller > projections for tile coding, rather than the biased hashing system SLASRSRw = SLb efficiently. The goal is to function that is typically used, to improve learning per- better take advantage of the properties for the different formance for the control setting. The efficacy, however, sides of an asymmetric matrix A. of sketching given features, versus using the unsketched One such natural improvement should be in one-sided features, is less well-understood. sketching. By only sketching from the left, for example, We investigate the properties of sketching the features, and setting SR = I, we do not project w. Rather, we only shown in Figure 1 with a variety of sketches in two project the constraints to the linear system Aw = b. Im- benchmark domains for RBF and tile-coding represen- portantly, this does not introduce bias: the original solu- tations (see (Sutton & Barto, 1998, Chapter 8) for an tion w to Aw = b is also a solution to SAw = Sb overview of these representations). For both domains, the for any sketching matrix S. The projection, however, re- observations space is 2-dimensional, with expansion to moves uniqueness in terms of the solutions w, since the d = 1024 and k = 50. The results are averaged over 50 system is under-constrained. Conversely, by only sketch- runs, with ξ, λ swept over 13 values, with ranges listed ing from the right, and setting SL = I, we constrain the in Appendix C. We see that sketching the features can space of solutions to a unique set, and do not remove any incur significant bias, particularly for tile coding, even constraints. For this setting, however, it is unlikely that with a reasonably large k = 50 to give O(dk) runtimes. w with Aw = b satisfies AS>w = b. This bias reduces with k, but remains quite high and so The conclusion from many initial experiments is that is likely too unreliable for practical use. the key benefit from asymmetric sketching is when only sketching from the left. We experimented with all 4 SKETCHING THE LINEAR SYSTEM pairwise combinations of Gaussian random projections, Count sketch, Tug-of-War sketch and Hadamard sketch for S and S . We additionally experimented with only All of the work on sketching within reinforcement learn- L R sketching from the right, setting S = I. In all of these ing has investigated sketching the features; however, L experiments, we found asymmetric sketching provided we can instead consider sketching the linear system, little to no benefit over using S = S and that sketch- Aw = b. For such a setting, we can sketch the left L R ing only from the right also performed similarly to using We first discuss how to efficiently maintain (SA)†, and SL = SR. We further investigated column and row selec- then describe how to use that matrix to obtain an unbi- def tion sketches (see Wang (2015) for a thorough overview), ased algorithm. Let A˜ = SA ∈ Rk×d. For this under- but also found these to be ineffective. We therefore pro- constrained system with b˜ def= Sb, the minimum norm ceed with an investigation into effectively using left-side solution to Aw˜ = b˜ is3 w = A˜ >(A˜ A˜ >)−1b˜ and sketching. In the next section, we provide an efficient A˜ † = A˜ >(A˜ A˜ >)−1 ∈ d×k A˜ † † R . To maintain t incre- O(dk) algorithm to compute (SLA) , to enable compu- A˜ † mentally, therefore, we simply need to maintain t in- tation of w = (SLA) SLb and for use within an unbi- ˜ ˜ > −1 crementally and the k × k-matrix (AtAt ) incremen- ased quasi-Newton algorithm. def def def tally. Let e˜t = Set, dt = dt and ht = A˜ tdt. We can We conclude this section with an interesting connection update the sketched system in O(dk) time and space to a data-dependent projection method that has been used   for policy evaluation, that further motivates the utility ˜ ˜ 1 > ˜ At+1 = At + t+1 e˜tdt − At of sketching only from the left. This algorithm, called ˜ ˜ 1  ˜  truncated LSTD (tLSTD) (Gehring et al., 2016), incre- bt+1 = bt + t+1 e˜tRt+1 − bt mentally maintains a rank k approximation of A matrix, using an incremental singular value decomposition. We ˜ ˜ > −1 To maintain (AtAt ) incrementally, notice that the un- show below that this approach corresponds to projecting normalized update is A from the left with the top k left singular vectors. This ˜ ˜ > ˜ > ˜ > is called a data-dependent projection, because the pro- At+1At+1 = (At + e˜tdt )(At + e˜tdt ) jection depends on the observed data, as opposed to the = A˜ A˜ > + e˜ h> + h e˜> + ||d ||2ke˜ e˜>. data-independent projection—the sketching matrices— t t t t t t t 2 t t which is randomly sampled independently of the data. (A˜ A˜ > )−1 (A˜ A˜ >)−1 > Hence, t+1 t+1 can be updated from t t Proposition 1. Let A = UΣV be singular value de- by applying the Sherman-Morrison update three times. composition of the true A. Assume the singular values For a normalized update, based on samples, the update is are in decreasing order and let Σk be the top k sin- gular values, with corresponding k left singular vectors  2 ˜ ˜ > t ˜ ˜ > t > > At+1At+1 = AtAt + 2 e˜tht + hte˜t Uk and k right singular vectors Vk. Then the solution t+1 (t+1) † > w = VkΣ U b (used for tLSTD) corresponds to LSTD 1 2 > k k + 2 ||dt|| ke˜te˜ > (t+1) 2 t using asymmetric sketching with SL = Uk and SR = I. Proof. We know U = [u1,..., ud] for singular vectors ˜ ˜ ˜ > † ˜ d > > We can then compute wt = At(AtAt ) bt on each step. ui ∈ R with ui ui = 1 and ui uj = 0 for i 6= j. Since > Uk = [u1,..., uk], we get that Uk U = [Ik 0d−k] ∈ This solution, however, will provide the minimum norm k×d R for k-dimensional identity matrix Ik and zero ma- solution, rather than the unbiased solution, even though k×(d−k) > trix 0d−k ∈ R . Then we get that SLb = Uk b the unbiased solution is feasible for the underconstrained > > > and SLA = [Ik 0d−k]ΣV = ΣkV = ΣkVk . system. To instead push the preference towards this un- † † > biased solution, we use the stochastic approximation al- Therefore, w = (SLA) SLb = VkΣkUk b. gorithm, called ATD (Pan et al., 2017). This method is a quasi-second order method, that relies on a low-rank ap- 5 LEFT-SIDED SKETCHING proximation Aˆ t to At; using this approximation, the up- ALGORITHM ˆ † date is wt+1 = wt + (αtAt + ηI)δtet. Instead of being used to explicitly solve for w, the approximation matrix In this section, we develop an efficient approach to use is used to provide curvature information. The inclusion the smaller, sketched matrix SA for incremental pol- of η constitutes a small regularization component, that icy evaluation. The most straightforward way to use pushes the solution towards the unbiased solution. SA is to incrementally compute SA, and periodically solve w = (SA)†Sb. This costs O(dk) per step, and We show in the next proposition that for our alternative O(d2k) every time the solution is recomputed. To main- approximation, we still obtain unbiased solutions. We tain O(dk) computation per-step, this full solution could use results for iterative methods for singular linear sys- only be computed every d steps, which is too infrequent tems (Shi et al., 2011; Wang & Bertsekas, 2013), since to provide a practical incremental policy evaluation ap- 3We show in Proposition 2, Appendix A, that A˜ is full row proach. Further, because it is an underconstrained sys- rank with high probability. This property is required to ensure tem, there are likely to be infinitely many solutions to that the inverse of A˜ A˜ > exists. In practice, this is less of a ˜ ˜ > SAw = Sb; amongst those solutions, we would like to concern, because we initialize the matrix A0A0 with a small ˜ ˜ > sub-select amongst the unbiased solutions to Aw = b. positive value, ensuring invertibility for AtAt for finite t. A may be singular. A has been shown to be positive semi-definite under standard assumptions on the MDP seems to be fewer empirical investigations into when (Yu, 2015); for simplicity, we assume A is positive semi- sketching has most benefit. In this section, we elucidate definite, instead of providing these MDP assumptions. some hypotheses about when sketching should be most k×d † effective, which we then explore in our experiments. Assumption 1. For S ∈ R and B = α(SA) S + ηI with B ∈ Rd×d, the matrix BA is diagonalizable. In the experiments for sketching the features in Section Assumption 2. A is positive semi-definite. 3, it was clear that sketching the RBF features was much Assumption 3. α ∈ (0, 1 ) and 0 < η ≤ 1 where more effective than sketching the tile coding features. A 2 2λmax(A) natural investigation, therefore, is into the properties of λmax(A) is the maximum eigenvalue of A. representations that are more amenable to sketching. The Theorem 1. Under Assumptions 1-3, the expected up- key differences between these two representations is in dating rule w = w + [Bδ e ] converges to a t+1 t Eπ t t terms of smoothness, density and overlap. The tile cod- fixed-point w? = A†b. ing representation has non-smooth 0,1 features, which do not overlap in each grid. Rather, the overlap for tile cod- Proof. The expected updating rule is Eπ[Bδtet] = ing results from overlapping tilings. This differs from B(b − Awt). As in the proof of convergence for ATD RBF overlap, where centers are arranged in a grid and (Pan et al., 2017, Theorem 1), we similarly verify the only edges of the RBF features overlap. The density of conditions from (Shi et al., 2011, Theorem 1.1). RBF features is significantly higher, since more RBFs are Notice first that BA = α(SA)†SA + ηA. active for each input. Theoretical work in sketching for regression (Maillard & Munos, 2012), however, does not For singular value decomposition, SA = UΣV>, † † > > require features to be smooth. We empirically investigate we have that (SA) SA = VΣ U UΣV = these three properties—smoothness, density and overlap. > ˜ V[Ik˜ 0d−k]V , where k ≤ k is the rank of SA. The maximum eigenvalue of (SA)†SA is therefore 1. There are also some theoretical results that suggest sketching could be more amenable for more distinct † Because (SA) SA and A are both positive semidefinite, features—less overlap or potentially less tilings. Balcan BA is positive semi-definite. By Weyl’s inequalities, et al. (2006) showed a worst-case setting where data- † independent sketching results in poor performance. They λmax(BA) ≤ αλmax((SA) SA) + ηλmax(A). propose a two-stage projection, to maintain separability Therefore, the eigenvalues of I−BA have absolute value in classification. The first stage uses a data-dependent −1 strictly less than 1, because η ≤ (2λmax(A)) and α < projection, to ensure features are not highly correlated, † −1 1/2 = (2λmax((SA) SA)) by assumption. and the second uses a data-independent projection (a For the second condition, since BA is PSD and diagonal- sketch) to further reduce the dimensionality after the or- izable, we can write BA = QΛQ−1 for some matrices thogonal projection. The implied conclusion from this re- Q and diagonal matrix Λ with eigenvalues greater than sult is that, if the features are not highly correlated, then or equal to zero. Then (BA)2 = QΛQ−1QΛQ−1 = the first step can be avoided and the data independent QΛ2Q−1 has the same rank. sketch should similarly maintain classification accuracy. This result suggests that sketching for feature expansions For the third condition, because BA is the sum of two with less redundancy should perform better. positive semi-definite matrices, the nullspace of BA is a subset of the nullspace of each of those matrices in- We might also expect sketching to be more robust to dividually: nullspace(BA) = nullspace(α(SA)†SA + the condition number of the matrix. For sketching in re- ηA) ⊆ ηnullspace(ηA) = nullspace(A). In the other gression, Fard et al. (2012) found a bias-variance trade- direction, for all w such that Aw = 0, its clear that off when increasing k, where for large k, estimation er- BAw = 0, and so nullspace(A) ⊆ nullspace(BA). ror from a larger number of parameters became a fac- Therefore, nullspace(A) = nullspace(BA). tor. Similarly, in our experiments above, LSTD using an incremental Sherman-Morrison update has periodic spikes in the learning curve, indicating some instability. 6 WHEN SHOULD SKETCHING HELP? The smallest eigenvalue of the sketched matrix should be larger than that of the original matrix; this improvement To investigate the properties of these sketching ap- in condition number compensates for the loss in informa- proaches, we need to understand when we expect sketch- tion. Similarly, we might expect that maintaining an in- ing to have the most benefit. Despite the wealth of liter- cremental singular value decomposition, for ATD, could ature on sketching and strong theoretical results, there be less robust than ATD with left-side sketching. 60 90 Figure 2: (a) and (b) are learning 80 TD(0) 50 TD(0) 70 curves on Mountain Car with k = LSTD-P ATD-P 60 50 (c) (d) Root TD , and and are their corre- Mean 50 Square LSTD-P sponding parameter-sensitivity plots. Error ATD-SVD ATD-P 40 The sensitivity plots report average 20 30 RMSE over the entire learning curve, 20 TD 10 LSTD-L for the best λ for each parameter. ATD-L LSTD 10 ATD-SVD ATD-L LSTD 0 0 The stepsize α is reported for TD, 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Steps Steps the initialization parameter ξ for the (a) Mountain Car, RBF (b) Mountain Car, Tile coding LSTD methods and the regulariza- tion parameter η for the ATD meth-

100 100 ods. The initialization for the matri- TD 90 90 TD(0) ces in the ATD methods is fixed to LSTD-P 80 TD(0) 80 the identity. The range for the regular- 70 70 TD ization term η is 0.1 times the range 60 60 ATD-SVD ATD-P 50 LSTD-P 50 for α. As before, the sketching ap- ATD-P LSTD-L 40 LSTD 40 ATD-L proaches with RBFs perform better 30 30 than with tile coding. The sensitivity 20 20 ATD-SVD ATD-L LSTD 10 10 of the left-side projection methods is LSTD-L 0 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 significantly lower than the TD meth- Alpha/Eta Alpha/Eta ods. ATD-L also seems to be less sen- (c) Mountain Car, RBF, Sensitivity (d) Mountain Car, Tile coding, Sensitivity sitive than ATD-SVD, and incurs less bias than LSTD-L.

60 60 60

50 50 50 TD TD(0) TD(0) TD(0) Root Root Root Mean LSTD-P TD Mean Square Mean Square LSTD-P Error Square Error ATD-P Error ATD-P ATD-P LSTD-L ATD-SVD ATD-SVD TD 20 20 LSTD-P 20

10 10 LSTD-L 10 ATD-L ATD-SVD LSTD ATD-L ATD-L LSTD LSTD-L LSTD 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Steps Steps Steps (a) Puddle World, RBF, k = 25 (b) Puddle World, RBF, k = 50 (c) Puddle World, RBF, k = 75

Figure 3: Change in performance when increasing k, from 25 to 75. Two-sided projection (i.e., projecting the features) significantly improves with larger k, but is strictly dominated by left-side projection. At k = 50, the left-side projection methods are outperforming TD and are less variant. ATD-SVD seems to gain less with increasing k, though in general we found ATD-SVD to perform more poorly than ATD-P particularly for RBF representations.

7 EXPERIMENTS more amenable to sketching, particularly for two-sided sketching, 3) smoothness of features only seems to im- In this section, we test the efficacy of sketching for LSTD pact two-sided sketching, 4) ATD with sketching de- and ATD in four domains: Mountain Car, Puddle World, creases bias relative to its LSTD variant and 5) ATD with Acrobot and Energy Allocation. We set k = 50, un- left-sided sketching typically performs as well as ATD- less otherwise specified, average all results over 50 runs SVD, but is significantly faster. and sweep parameters for each algorithm. Detailed ex- Performance and parameter sensitivity for RBFs and perimental settings, such as parameter ranges, are in Ap- Tile coding. We first more exhaustively compare the al- pendix C. To distinguish projections, we add -P for two- gorithms in Mountain Car and Puddle World, in Figures sided and -L for left-sided to the algorithm name. 2 and 3 with additional such results in the appendix. As We conclude that 1) two-sided projection—projecting has been previously observed, TD with a well-chosen the features—generally does much worse than only pro- stepsize can perform almost as well as LSTD in terms jecting the left-side of A, 2) higher feature density is of sample efficiency, but is quite sensitive to the stepsize. 35 70 35

50 60 LSTD-P 30 30 LSTD-P LSTD-P LSTD-L Root ATD-SVD 40 50 TD Mean LSTD-P ATD-L 25 Square LSTD-L ATD-SVD TD LSTD-L Error 40 ATD-P ATD-P 30 ATD-P ATD-P 20 ATD-L ATD-L 20 ATD-SVD TD ATD-SVD 30 TD

15 20 LSTD 15 20 LSTD LSTD LSTD LSTD-L ATD-L 10 10 10 10 1 2 3 4 1 2 3 4 1 2 3 1 2 3 Overlap Overlap Redundancy Redundancy (a) RBF, k = 50 (b) Spline, k = 50 (c) Tile coding, k = 50 (d) RBFs with tilings, k = 50

Figure 4: The effect of varying the representation properties, in Puddle World with d = 1024. In (a) and (b), we examine the impact of varying the overlap, for both smooth features (RBFs) and 0-1 features (Spline). For spline, the feature is 1 if ||x − ci|| < σ and otherwise 0. The spline feature represents a bin, like for tile coding, but here we adjust the widths of the bins so that they can overlap and do not use tilings. The x-axis has four width values, to give a corresponding feature vector norm of about 20, 40, 80, 120. In (c) and (d), we vary the redundancy, where number of tilings is increased and the total number of features kept constant. We generate tilings for RBFs like for tile coding, but for each grid cell use an RBF similarity rather than a spline similarity. We used 4 × 16 × 16, 16 × 8 × 8 and 64 × 4 × 4.

Here, therefore, we explore if our matrix-based learning algorithms can reduce this parameter sensitivity. In Fig- is to raise an arm to a certain height. The Energy alloca- ure 2, we can indeed see that this is the case. The LSTD tion domain Salas & Powell (2013) is a five-dimensional algorithms look a bit more sensitive, because we sweep continuing task, where the goal is to store and allocate over small initialization values for completeness. For tile energy to maximize profit. For Acrobot, we used 14, 400 coding, the range is a bit more narrow, but for RBFs, in uniformly-spaced centers and for Energy allocation, we the slightly larger range, the LSTD algorithms are quite used the same tile coding of 8192 features as Pan et al. insensitive for RBFs. Interestingly, LSTD-L seems to be (2017). We summarize the results in the caption of Fig- more robust. We hypothesize that the reason for this is ure 5, with the overall conclusion that ATD-L provides that LSTD-L only has to initialize a smaller k × k sym- an attractive way to reduce parameter sensitivity of TD, metric matrix, (SA(SA)>)−1 = ηI, and so is much and benefit from sketching to reduce computation. more robust to this initialization. In fact, across settings, we found initializing to I was effective. Similarly, ATD- L benefits from this robustness, since it needs to initialize 8 CONCLUSION AND DISCUSSION the same matrix, and then further overcomes bias using the approximation to A only for curvature information. In this work, we investigated how to benefit from sketch- ing approaches for incremental policy evaluation. We Impact of the feature properties. We explored the first showed that sketching features can have significant feature properties—smoothness, density, overlap and bias issues, and proposed to instead sketch the linear sys- redundancy—where we hypothesized sketching should tem, enabling better control over how much information help, shown in Figure 4. The general conclusions are is lost. We highlighted that sketching for radial basis 1) the two-side sketching methods improve—relative to features seems to be much more effective, than for tile LSTD—with increasing density (i.e., increasing overlap coding, and further that a variety of natural asymmetric and increasing redundancy), 2) the smoothness of the sketching approaches for sketching the linear system are features (RBF versus spline) seems to affect the two- not effective. We then showed that more carefully using side projection methods much more, 3) the shape of the sketching—particularly with left-side sketching within a left-side projection methods follows that of LSTD and 4) quasi-Newton update—enables us to obtain an unbiased ATD-SVD appears to follow the shape of TD more. In- approach that can improve sample efficiency without in- creased density generally seemed to degrade TD, and so curring significant computation. Our goal in this work ATD-SVD similarly suffered more in these settings. In was to provide practical methods that can benefit from general, the ATD methods had less gain over their corre- sketching, and start a focus on empirically investigating sponding LSTD variants, with increasing density. settings in which sketching is effective. Experiments on high dimensional domains. We finally Sketching has been used for quasi-Newton updates in on- apply our sketching techniques on two high dimensional line learning; a natural question is if those methods are domains to illustrate practical usability: Acrobot and En- applicable for policy evaluation. Luo et al. (2016) con- ergy allocation. The Acrobot domain (Sutton & Barto, sider sketching approaches for an online Newton-update, 1998) is a four dimensional episodic task, where the goal for general functions rather than just the linear func- 50 50 10000

45 TD(0) 8000 ATD-P 40 40 TD(0): 0.9s Root ATD-P LSTD-P Mean ATD-L Root Root LSTD-L Mean Square Mean TD: 0.9s LSTD-P: 9.5s Error Square TD(0) Square Error TD Error ATD-SVD: 685s 4000 LSTD-L ATD-P: 7.5s LSTD-P TD 20 20 2000 ATD-SVD 15 ATD-L ATD-L: 17s LSTD-L: 20s ATD-SVD 10 10 0 0 1000 2000 3000 4000 5000 50 100 150 200 250 0 2000 4000 6000 8000 10000 Steps Time(seconds) Steps (a) Acrobot, RBF (b) Acrobot, RMSE vs Time (c) Energy allocation, Tile coding

Figure 5: Results in domains with high-dimensional features, using k = 50 and with results averaged over 30 runs. For Acrobot, the (left-side) sketching methods perform well and are much less sensitive to parameters than TD. For runtime, we show RMSE versus time allowing the algorithms to process up to 25 samples per second, to simulate a real-time setting learning; slow algorithms cannot process all 25 within a second. With computation taken into account, ATD-L has a bigger win over ATD-SVD, and does not lose relative to TD. Total runtime in seconds for one run for each algorithm is labeled in the plot. ATD-SVD is much slower, because of the incremental SVD. For the Energy Allocation domain, the two-side projection methods (LSTD-P, ATD-P) are significantly worse than other algorithms. Interestingly, here ATD-SVD has a bigger advantage, likely because sketching the tile coding features is less effective. tion approximation case we consider here. They similarly standing into the properties of A, it could be possible have to consider updates amenable to incrementally ap- to benefit from this variety. For example, sketching the proximating a matrix (a Hessian in their case). In general, left-side of A could be seen as sketching the eligibility however, porting these quasi-Newton updates to policy trace, and the right-side as sketching the difference be- evaluation for reinforcement learning is problematic for tween successive features. For some settings, there could two reasons. First, the objective function for temporal be properties of either of these vectors that are particu- difference learning is the mean-squared projected Bell- larly suited to a certain sketch. As another example, the man error, which is the product of three expectations. It key benefit of many of the sketches over Gaussian ran- is not straightforward to obtain an unbiased sample of dom projections is in enabling the dimension k to be this gradient, which is why Pan et al. (2017) propose a larger, by using (sparse) sketching matrices where dot slightly different quasi-Newton update that uses A as a product are efficient. We could not easily benefit from preconditioner. Consequently, it is not straightforward to these properties, because SA could be dense and com- apply quasi-Newton online algorithms that assume ac- puting matrix-vector products and incremental inverses cess to unbiased gradients. Second, the Hessian can be would be expensive for larger k. For sparse A, or when nicely approximated in terms of gradients, and is sym- SA has specialized properties, it could be more possible metric; both are exploited when deriving the sketched on- to benefit from different sketches. line Newton-update (Luo et al., 2016). We, on the other Finally, the idea of sketching fits well into a larger theme hand, have an asymmetric matrix A. of random representations within reinforcement learn- In the other direction, we could consider if our approach ing. A seminal paper on random representations (Sutton could be beneficial for the online regression setting. For & Whitehead, 1993) demonstrates the utility of random , with γ = 0, the matrix A actually threshold units, as opposed to more carefully learned corresponds to the Hessian. In contrast to previous ap- units. Though end-to-end training has become more pop- proaches that sketched the features (Maillard & Munos, ular in recent years, there is evidence that random rep- 2012; Fard et al., 2012; Luo et al., 2016), therefore, one resentations can be quite powerful (Aubry & Jaffard, could instead sketch the system and maintain (SA)†. 2002; Rahimi & Recht, 2007, 2008; Maillard & Munos, −1 Since the second-order update is A gt for gradient gt 2012), or even combined with descent strategies (Mah- on iteration t, an approximate second-order update could mood & Sutton, 2013). For reinforcement learning, this † be computed as ((SA) S + ηI)gt. learning paradigm is particularly suitable, because data cannot be observed upfront. Data-independent represen- In our experiments, we found sketching both sides of A tations, such as random representations and sketching ap- to be less effective and found little benefit from modi- proaches, are therefore particularly appealing and war- fying the chosen sketch; however, these empirical con- rant further investigation for the incremental learning set- clusions warrant further investigation. With more under- ting within reinforcement learning. ACKNOWLEDGEMENTS Maillard, Odalric-Ambrym and Munos, Remi.´ Linear Regres- sion With Random Projections. Journal of Machine Learn- This research was supported by NSF CRII-RI-1566186. ing Research, 2012. Pan, Yangchen, White, Adam, and White, Martha. Accelerated Gradient Temporal Difference Learning. In AAAI Confer- References ence on Artifical Intelligence, 2017. Ailon, Nir and Chazelle, Bernard. Approximate nearest neigh- Rahimi, A and Recht, B. Random features for large-scale ker- bors and the fast Johnson-Lindenstrauss transform. ACM nel machines. In Advances in Neural Information Process- Symposium on Theory of Computing, 2006. ing Systems, 2007. Alon, Noga, Matias, Yossi, and Szegedy, Mario. The space Rahimi, Ali and Recht, Benjamin. Uniform approximation of complexity of approximating the frequency moments. In functions with random bases. In Annual Allerton Conference ACM Symposium on Theory of Computing, 1996. on Communication, Control, and Computing, 2008. Aubry, Jean-Marie and Jaffard, Stephane.´ Random Wavelet Se- Salas, D F and Powell, W B. Benchmarking a Scalable Ap- ries. Communications in Mathematical Physics, 2002. proximate Dynamic Programming Algorithm for Stochas- tic Control of Multidimensional Energy Storage Problems. Balcan, Maria-Florina, Blum, Avrim, and Vempala, San- Dept Oper Res Financial Eng, 2013. tosh. Kernels as features: On kernels, margins, and low- dimensional mappings. , 2006. Shi, X, Wei, Y, and Zhang, W. Convergence of general non- stationary iterative methods for solving singular linear equa- Bellemare, Marc, Veness, Joel, and Bowling, Michael. Sketch- tions. SIAM Journal on Matrix Analysis and Applications, Based Linear Value Function Approximation. In Advances 2011. in Neural Information Processing Systems, 2012. Sutton, Richard S, Modayil, J, Delp, M, Degris, T, Pilarski, Boutsidis, Christos and Woodruff, David P. Communication- P.M., White, A, and Precup, D. Horde: A scalable real-time optimal distributed principal component analysis in the architecture for learning knowledge from unsupervised sen- column-partition model. arXiv:1504.06729, 2015. sorimotor interaction. In International Conference on Au- Boyan, J and Moore, A W. Generalization in Reinforcement tonomous Agents and Multiagent Systems, 2011. Learning: Safely Approximating the Value Function. Ad- Sutton, R.S. Learning to predict by the methods of temporal vances in Neural Information Processing Systems, 1995. differences. Machine Learning, 1988. Boyan, J A. Least-squares temporal difference learning. Inter- Sutton, R.S. Generalization in reinforcement learning: Suc- national Conference on Machine Learning, 1999. cessful examples using sparse coarse coding. In Advances Charikar, M, Chen, K, and Farach-Colton, M. Finding frequent in Neural Information Processing Systems, 1996. items in data streams. Theoretical Computer Science, 2002. Sutton, R.S. and Barto, A G. Reinforcement Learning: An In- Fard, Mahdi Milani, Grinberg, Yuri, Pineau, Joelle, and Precup, troduction. MIT press, 1998. Doina. Compressed Least-Squares Regression on Sparse Sutton, RS and Whitehead, SD. Online learning with random Spaces. AAAI, 2012. representations. In International Conference on Machine Gehring, Clement, Pan, Yangchen, and White, Martha. Incre- Learning, 1993. mental Truncated LSTD. International Joint Conference on Wang, Mengdi and Bertsekas, Dimitri P. On the convergence Artificial Intelligence, 2016. of simulation-based iterative methods for solving singular Ghashami, M, Desai, A, and Phillips, J M. Improved practical linear systems. Stochastic Systems, 2013. matrix sketching with guarantees. European Symposium on Wang, Shusen. A Practical Guide to Randomized Algorithms, 2014. Matrix Computations with MATLAB Implementations. Ghavamzadeh, M, Lazaric, A, Maillard, O A, and Munos, R. arXiv:1505.07570, 2015. LSTD with random projections. In Advances in Neural In- White, Martha. Unifying task specification in reinforcement formation Processing Systems, 2010. learning. arXiv:1609.01995, 2016. Gilbert, A and Indyk, P. Sparse recovery using sparse matrices. Woodruff, David P. Sketching as a tool for numerical linear Proceedings of the IEEE, 2010. algebra. arXiv:1411.4357, 2014. Gower, Robert M and Richtarik,´ Peter. Randomized Iterative Yu, Huizhen. On convergence of emphatic temporal-difference Methods for Linear Systems. arXiv:1506.03296, 2015. learning. In Annual Conference on Learning Theory, 2015. Johnson, William B and Lindenstrauss, Joram. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics, 1984. Luo, Haipeng, Agarwal, Alekh, Cesa-Bianchi, Nicolo,` and Langford, John. Efficient Second Order Online Learning by Sketching. Advances in Neural Information Processing Sys- tems, 2016. Mahmood, A R and Sutton, R.S. Representation search through generate and test. In Proceedings of the AAAI Workshop on Learning Rich Representations from Low-Level Sensors, 2013. A Row-rank properties of SA ing wt+1 = wt + α(b − Awt) and wt will converge to a solution of the system (under some conditions). We To ensure the right pseudo-inverse is well-defined in Sec- tested the following ways to use sketched linear systems. tion 5, we show that the projected matrix SA is full First, for the two-sided sketched A, we want to solve for row-rank with high probability, if A has sufficiently high > ˜ > SLASRw = SLb. If At = SLAtSR is square, we can rank. We know that the probability measure of row-rank use the iterative update deficient matrices for S has zero mass. However in the following, we prove a stronger and practically more use- 1   A˜ = A˜ + S e (S d )> − A˜ ful claim that SA is far from being row-rank deficient. t+1 t t + 1 L t R t t Formally, we define a matrix to be δ-full row-rank if there 1   b˜ = b˜ + r S e − b˜ is no row that can be replaced by another row with dis- t+1 t t + 1 t+1 L t t tance at most δ to make that matrix row-rank deficient. ˜ wt+1 = wt + αt(bt+1 − A˜ t+1wt) k×d Proposition 2. Let S ∈ R be any Gaussian matrix > = wt + αt(SLbt+1 − SLAt+1SRwt) with 0 mean and unit variance. For rA = rank(A) and for any δ > 0, SA is δ-full row-rank with probability at and use w for prediction on the sketched features. An- 2 (rA(1−0.8δ)−k) other option is to maintain the inverse incrementally, us- least 1 − exp(−2 r ). A ing Sherman-Morrison

Proof. Let A = UΣV> be the SVD for A. Since U > > ˜ −1 ad = dt SRAt is an orthonormal matrix, S0 = SU has the same dis- a = A˜ −1S e tribution as S and the rank of SA is the same as S0Σ. u t L t 0 −1 −1 auad Moreover notice that the last d − rA columns of S get ˜ ˜ At = At − > multiplied by all-zero rows of Σ. Therefore, in what 1 + dt au follows, we assume we draw a random matrix S0 ∈ r S e − b˜ b˜ = b˜ + t+1 L t t Rk×rA (similar to how S is drawn), and that Σ ∈ RrA×rA t t t 0 is a full rank diagonal matrix. We study the rank of S Σ. ˜ −1 ˜ w = At bt Consider iterating over the rows of S0, the probability δ > that any new row is -far from being a linear combination If SLASR is not square (e.g., SR = I), we instead 1 − 0.8δ > > > of the previous ones is at least . To see why, as- solve for SL SLASRSRw = SL SLb, where applying i > sume that you currently have rows and sample another SL provides the recovery from the left and SR the re- vector v with entries sampled i.i.d. from a standard Gaus- covery from the right. sian as the candidate for the next row in S0. The length 0 Second, with the same sketching, we also experimented corresponding to the projection of any row Sj: onto v, † > 0 with SL, instead of SL for the recovery, and similarly for i.e., Sj:v ∈ R, is a Gaussian random variable. Thus, the 0 SR, but this provided no improvement. probability of the Sj:v being within δ is at most 0.8δ. This follows from the fact that the area under probability For this square system, the iterative update is density function of a standard Gaussian random variable † ˜ ˜ over [0, x] is at most 0.4x, for any x > 0. wt+1 = wt + αtSL(bt+1 − At+1SRwt) † > This stochastic process is a Bernoulli trial with success = wt+1 + αtSL(SLbt+1 − SLAt+1SRSRwt) probability of at least 1−0.8δ. The trial stops when there ˜ ˜ are k successes or when the number of iterations reaches for the same bt and At which can be efficiently kept S rA. The Hoeffding inequality bounds the probability of incrementally, while the pseudoinverse of L only needs 2 failure by exp(−2 (rA(1−0.8δ)−k) ). to be computed once at the beginning. rA > Third, we tried to solve the system SL SLAw = b, using the updating rule wt+1 = wt + αt(bt+1 − B Alternative iterative updates > SL SLAt+1wt), where the matrix SLAt+1 can be in- crementally maintained at each step by using a simple In addition to the proposed iterative algorithm using a rank-one update. left-sided sketch of A, we experimented with a variety of alternative updates that proved ineffective. We list them Fourth, we tried to explicitly regularize these iterative here for completeness. updates by adding a small step in the direction of δtet. We experimented with a variety of iterative updates. For In general, none of these iterative methods performed a linear system, Aw = b, one can iteratively update us- well. We hypothesize this may be due to difficulties in choosing stepsize parameters. Ultimately, we found the sketched updated within ATD to be the most effective. and velocity have different ranges, we set the band- C Experimental details width separately for each feature using k(x, c) = x1−c1 2 x2−c2 2 exp(−(( ) + ( ) )), where r1 is the range of 0.12r1 0.12r2 the first state variable and r2 is the range of second state Mountain Car is a classical episodic task with the goal variable. of driving the car to the top of mountain. The state is 2-dimensional, consisting of the (position, velocity) of In Figure 4, we used a relatively rarely used representa- the car. We used the specification from (Sutton & Barto, tion which we call spline feature. For sample x, the ith 1998). We compute the true values of 2000 states, where spline feature is set to 1 if ||x − ci|| < δ and otherwise each testing state is sampled from a trajectory generated set as 0. The centers are selected in exactly the same way by the given policy. From each test state, we estimate the as for the RBFs. value—the expected return—by computing the average over 1000 returns, generated by rollouts. The policy for Parameter optimization. We swept the following Mountain Car is the energy pumping policy with 20% ranges for stepsize (α), bootstrap parameter (λ), regular- randomness starting from slightly random initial states. ization parameter (ηt), and initialization parameter ξ for The discount rate is 1.0, and is 0 at the end of the episode, all domains: and the reward is always −1. j 1. α ∈ {0.1 × 2.0 |j = −7, −6, ..., 4, 5} divided by l1 Puddle World Boyan & Moore (1995) is an episodic task, norm of feature representation, 13 values in total. where the goal is for a robot in a continuous gridworld 2. λ ∈ {0.0, 0.1, ..., 0.9, 0.93, 0.95, 0.97, 0.99, 1.0}, to reach a goal state within as fewest steps as possible. 15 values in total. The state is 2-dimensional, consisting of (x, y) positions. 3. η ∈ {0.01 × 2.0j|j = −7, −6, ..., 4, 5} divided by We use the same setting as described in (Sutton & Barto, l1 norm of feature representation, 13 values in total. 1998), with a discount of 1.0 and -1 per step, except 4. ξ ∈ {10j|j = −5, −4.25, −3.5, ..., 2.5, 3.25, 4.0}, when going through a puddle that gives higher magni- 13 values in total. tude negative reward. We compute the true values from To choose the best parameter setting for each algorithm, 2000 states in the same way as Mountain Car. A simple we used the sum of RMSE across all steps for all the heuristic policy choosing the action leading to shortest domains Energy allocation. For this domain, optimizing Euclidean distance with 10% randomness is used. based on the whole range causes TD to pick an aggres- Acrobot is a four-dimensional episodic task, where the sive step-size to improve early learning at the expense of goal is to raise an arm to certain level. The reward is −1 later learning. Therefore, for Energy allocation, we in- for non-terminal states and 0 for goal state, again with stead select the best parameters based on the sum of the discount set to 1.0. We use the same tile coding as de- RMSE for the second half of the steps. scribed in (Sutton & Barto, 1998), except that we use For the ATD algorithms, as done in the original paper, we memory size 215 = 32, 768. To get a reasonable pol- set α = 1 and only swept the regularization parameter icy, we used true-online Sarsa(λ) to go through 15000 t t η, which can also be thought of a (smaller) final step- episodes with stepsize α = 0.1/48 and bootstrap param- size. For this reason, the range of η is set to 0.1 times the eter λ = 0.9. Each episode starts with a slight random- range of α, to adjust this final stepsize range to an order ness. The policy is −greedy with respect to state value of magnitude lower. and  = 0.05. The way we compute true values and gen- erate training trajectories are the same as we described Additional experimental results. In the main paper, for the above two domains. we demonstrated a subset of the results to highlight con- Energy allocation (Salas & Powell, 2013) is a continuing clusions. For example, we showed the learning curves task with a five-dimensional state, where we use the same and parameter sensitivity in Mountain Car, for RBFs and settings as in Pan et al. (2017). The matrix A was shown tile coding. Due to space, we did not show the corre- to have a low-rank structure (Pan et al., 2017) and hence sponding results for Puddle World in the main paper; matrix approximation methods are expected to perform we include these experiments here. Similarly, we only well. showed Acrobot with RBFs in the main text, and include results with tile coding here. For radial basis functions, we used format k(x, c) = 2 ||x−c||2 exp(− 2σ2 ) where σ is called RBF width and c is a feature. On Mountain Car, because the position 90 60

80 TD 50 TD(0) 70 TD(0) LSTD-P Root Root Mean LSTD-L Mean TD Figure 6: The two sensitivity figures Square Square Error ATD-P ATD-L Error are corresponding to the above two ATD-SVD 30 20 LSTD-P learning curves on Puddle World do- 20 ATD-SVD LSTD ATD-P 10 LSTD-L main. Note that we sweep initial- 10 ATD-L LSTD 0 0 ization for LSTD-P, but keep ini- 0 2 4 6 8 10 12 14 0 1000 2000 3000 4000 5000 Alpha/Eta Steps tialization parameter fixed across all (a) Puddle World, tile coding, k = 50 (b) Puddle World, RBF, k = 50 other settings. The one-side projec- tion is almost insensitive to initializa- tion and the corresponding ATD ver- 100 90

90 80 sion is insensitive to regularization.

80 TD 70 Though ATD-SVD also shows insen- TD(0) LSTD-P TD ATD-P LSTD-P sitivity, performance of ATD-SVD is Root Root Mean LSTD-L Mean TD(0) Square Square much worse than sketching methods Error Error ATD-SVD ATD-P for the RBF representation. And, one 40 ATD-L 30 ATD-SVD should note that ATD-SVD is also 30 20 LSTD-L 20 10 LSTD much slower as shown in the below ATD-L 10 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 figures. Alpha/Eta Alpha/Eta (c) Puddle World, tile coding, k = 50 (d) Puddle World, RBF, k = 50

90 90 90 ATD-P 80 80 80 TD(0) TD(0) TD(0) 70 LSTD-P 70 70 LSTD-P ATD-P 60 LSTD-P Root Root ATD-P Mean 50 Mean Square Square Error 40 Error LSTD-L TD LSTD-L 30 30 30

20 TD 20 TD 20

10 ATD-L 10 10 ATD-L ATD-SVD LSTD ATD-SVD LSTD ATD-SVD LSTD ATD-L 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Steps Steps Steps (a) Mountain Car, Tile coding, k = 25 (b) Mountain Car, Tile coding, k = 50 (c) Mountain Car, Tile coding, k = 75

Figure 7: Change in performance when increasing k, from 25 to 75. We can draw similar conclusions to the same experiments in Puddle World in the main text. Here, the unbiased of ATD-L is even more evident; even with as low a dimension as 25, it performs similarly to LSTD.

60 80 80

ATD-P 70 70 50 60 60 TD

Root TD TD(0) Mean Root Root LSTD-P Mean Square Mean TD(0) TD Error ATD-L TD(0) Square LSTD-P Square Error ATD-P Error ATD-P LSTD-P 20 20 20 10 ATD-SVD ATD-SVD ATD-SVD LSTD-L 10 10 ATD-L LSTD-L ATD-L LSTD-L 0 0 0 0 1000 2000 3000 4000 5000 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Steps Alpha/Eta Alpha/Eta (a) Acrobot, tile coding, k = 50 (b) Acrobot, tile coding, k = 50 (c) Acrobot, RBF, k = 75

Figure 8: Additional experiments in Acrobot, for tile coding with k = 50 and for RBFs with k = 75.