From: AAAI Technical Report WS-97-10. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved.

Estimator in Reinforcement Learning: Theoretical Problems and Practical Solutions Mark D. Pendrith and Malcolm R.K. Ryan School of Computer Science and Engineering The University of New South Wales Sydney 2052 Australia {pendrith,malcolmr } @cse.unsw.edu.au

Abstract proach may sometimes run into problems, particularly for noisy and non-Markovian domains, and present a In reinforcement learning, as in manyon-line direct approach to managing estimator variance, the search techniques, a large numberof estimation ccBeta algorithm. (e.g. Q-value estimates for 1-step Q-learning) are maintained and dynamically up- dated as information comes to hand during the RL as On-line Dynamic Programming learning process. Excessive variance of these es- RL has been characterised as an form of asynchronous, timators can be problematic, resulting in un- on-line form of Dynamic Programming (DP) (Watkins even or unstable learning, or even making ef- fective learning impossible. Estimator variance 1989; Sutton, Barto, & Williams 1992). This charac- is usually managedonly indirectly, by selecting terisation works well if the domain is well-modelled by global learning algorithm parameters (e.g. A for a Markov Decision Process (MDP) (Puterman 1994). TD(A) based methods) that axe a compromise Formally, an MDPcan be described as a quintuple between an acceptable level of estimator per- (S, A, a, T, p). S is the set of process states, which turbation and other desirable system attributes, such as reduced estimator bias. In this paper, we may include a special terminal state. From any non- argue that this approach maynot always be ade- terminal state, an action can be selected from the set quate, particularly for noisy and non-Markovian of actions A, although not all actions maybe available domains, and present a direct approach to man- from all states. aging estimator variance, the new ccBeta al- gorithm. Empirical results in an autonomous At time-step t = 0 the process starts at random in robotics domain are also presented showing im- one of the states according to a starting proved performance using the ccBeta method. distribution function a. At any time-step t > 0 the process is in exactly one state, and selecting an ac- tion from within state st results in a transition to a Introduction successor state s¢+1 according to a transition probabil- Manydomains of interest in AI are too large to be ity function T. Also, an immediate scalar payoff (or searched exhaustively in reasonable time. One ap- reward) rt is received, the of which is proach has been to employ on-line search techniques, determined by p, a (possibly stochastic) mapping such as reinforcement learning (RL). In RL, as in many S × A into the reals. Once started, the process contin- on-line search techniques, a large number of estima- ues indefinitely or until a terminal state is reached. tion parameters (e.g. Q-value estimates for 1-step Q- By a Markov process, we that both the state learning) are maintained and dynamically updated as transition probability function T and the payoff ex- information comes to hand during the learning process. pectation p is dependent only upon the action se- Excessive variance of these estimators during the learn- lected and the current state. In particular, the his- ing process can be problematic, resulting in uneven or tory of states/actions/rewards of the process lead- unstable learning, or even making effective learning im- ing to the current state does not influence T or p. possible. We will also consider non-Markov Decision Processes Normally, estimator variance is managedonly indi- (NMDPs)which are formally identical except that the rectly, by selecting global learning algorithm parame- Markovassumption is relaxed; for a variety of reasons ters (e.g. A for TD()~) based methods) that trade-off NMDPsoften better model complex real-world RL do- the level of estimator perturbation against other sys- mains than do MDPs(Pendrith & Ryan 1996). tem attributes, such as estimator bias or rate of adap- Generally when faced with a decision process the tation. In this paper, we give reasons why this ap- problem is to discover a mapping S -~ A (or policy)

81 n] ") that maximises the expected total future discounted re- As n -+ c~, both rl and rl approach the infinite trt ward R. r = ~’~=o 7 for some discount factor 3" E horizon or actual return [0,1]. If 3’ = 1, then the total future discounted re- oo ward is just the total reward R = ~=o rt where all fu- r, = ~ 3"ir~+~ (61 ture rewards are weighted equally; otherwise, rewards i=O received sooner are weighted more heavily than those received later. as a limiting case. Wenote that if in (2) l) were re placed with th e ac- Q-learning as on-line value iteration tual return, then this would form the update rule for If an RL method like 1-step Q-learning (QL) (Watkins a Monte Carlo estimation procedure. However, rather 1989) is used to find the optimal policy for an MDP, than waiting for the completed actual return, QLin- the method resembles an asynchronous, on-line form of stead employs a "virtual return" that is an estimate of the DP value iteration method. QLcan be viewed as a the actual return. This makes the estimation process relaxation method that successively approximates the resemble an on-line value iteration method. One view so-called Q-values of the process, the value Q’(st,at) of the n-step CTRis that it bridges at its two extremes being the expected value of the return by taking a ac- value iteration and Monte Carlo methods. One could tion at from state st and following a policy 7r from that also make the observation that such a Monte Carlo point on. method would be an on-line, asynchronous analog of Wenote that if an RL agent has access to Q-values policy iteration, another important DP technique. for an optimal policy for the system it is controlling, The central conceptual importance of the CTRto it is easy to act optimally without planning; simply RL techniques is that virtually all RL algorithms es- selecting the action from each state with the highest timate the value function of the state/action pairs Q-value will suffice. The QLalgorithm has been shown of the system using either single or multi-step CTRs to converge, under suitable conditions 1, to just these directly, as in the case of QLor the C-Trace algo- Q-values. The algorithm is briefly recounted below. rithm (Pendrith & Ryan 1996), or as returns that At each step, QL updates an entry in its table of are equivalent to weighted sums of varying length n- Q-value estimates according to the following rule (pre- step CTRs, such as the TD(A) return (Sutton 1988; sented here in a "delta rule" form): Watkins 1989). (1) CTRBias and Variance For RL in Markovian domains, the choice of length where fl is a step-size , and of CTRis usually viewed as a trade-off between bias and variance of the returns to the estimation ~t = r~1) - QCst,at) (2) parameters, and hence of the estimation parameters themselves (e.g. Watkins 1989). where r~1) is the 1-step corrected truncated return The idea is that shorter CTRsshould exhibit less (CTR): variance but more bias than longer CTRs. The in- rl 1) = r~ + 3"maxQ(st+l,a) (3) creased bias will be due to the increased weight in the return values of estimators that will, in general, be The 1-step CTRis a special case of the n-step CTR. inaccurate while learning is still taking place. The ex- Using Watkins’ (1989) notation pected reduction in estimator variance is due to the fact that for a UTRthe variance of the return will be r~n) = r["] + 7" maxQ(st+., a) (4) strictly non-decreasing as n increases. (~ Applying this reasoning uncritically to CTRs is n] problematic, however. In the case of CTRs, we note where rl is the simple uncorrected n-step truncated that initial estimator inaccuracies that are responsible return (UTR) for return bias mayalso result in high return variance n--I (5) in the early stages of learning. Thus, in the early stages rlnl= Z7’r t+, of learning, shorter CTRsmay actually result in the i=0 worst of both worlds - high bias and high variance. l Perhaps most fundamentally the Markovassumption By way of illustration, consider the simple MDPde- must hold; it is known(Pendrith & Ryan1996) that the op- picted in Figure 1. The expected variance of the sam- timality properties of QLdo not generalise to non-Markov pled returns for action 0 from state A will be arbi- systems. The original proof of QL convergence and op- timality properties can be found in an expandedform in trarily high depending upon the difference between the (Watkins & Dayan1992). initial estimator values for actions from states B and

82 Figure 2: A 3-state NMDP,with two available actions from starting state A, and one available from the successor state B. The action from state B immediatelyleads to ter- mination and a reward; the decision process is non-Markov because the reward depends on what action was previously selected from state A. Figure 1: A 4-state/1 action MDP.State A is the starting pCA,O) = state, states B and C are equiprobablesuccessor states after taking action 0. Actions from B and C immediately lead pCA,1) = to termination. The immediatereward at each step is zero. p(B,O) = (if the action from state A was 0) If 1-step CTRsare used, the variance as well as bias of p(B,O) = (if the action from state A was 1) the estimator returns for State A/action 0 depends on the difference of the initial estimator values for states B and C. On the other hand, if CTRsof length 2 or greater are used, the estimator returns will be unbiased and have zero If MonteCarlo returns are used (or, equivalently in this variance. case, n-step CTRswhere n > 1), the estimator returns for state/action pairs CA,0) and {A, 1) will exhibit zero C. In this case, the estimator for {A, 0) would experi- variance at all stages of learning, and the correspond- ence both high bias and high variance if 1-step CTRs ing estimators should rapidly converge to their correct were to be used. On the other hand, using CTRs of values of 0 and 1 respectively. length 2 or greater would result in unbiased estimator On the other hand, if 1-step CTRs are used, the returns with zero variance at all stages of learning for variance of the CA, 0) and {A, 1) estimators will be non- this MDP. zero while the variance of the estimator for (B, 0) In general, as estimators globally converge to their B, correct values, the variance of an n-step CTRfor an non-zero. The estimator for C 0) will exhibit non- zero variance as long as both actions continue to be MDPwill become dominated by the variance in the tried from state A, which would normally be for all terms comprising the UTRcomponent of the return stages of learning for the sake of active exploration. value, and so the relation Finally, note that the variance for {B, 0) will the same ’)] j}] in this case for all n-step CTRsn _> 1. Hence, the i < j =~ var[r~ <_ var[r~ (7) overall estimator variance for this NMDPis strictly will be true in the limit. However,the point we wish to greater at all stages of learning for 1-step CTRsthan makehere is that a bias/variance tradeoff involving n for any n-step CTRsn > 1. for CTRsis not as clear cut as may be often assumed, In previously published work studying RL in noisy particularly in the early stages of learning, or at any and non-Markovian domains (Pendrith & Ryan 1996), stage of learning if the domain is noisy, 2 even if the excessive estimator variance appeared to be causing domain is Markovian. problems for 1-step QL in domains where using Monte Carlo style returns improved matters. These unex- CTR Bias and Variance in NMDPs pected experimental, results did not Candst.ill do not) Perhaps more importantly, if the domain is not Marko- fit well with the "folk wisdom" concerning estimator vian, then the relation expressed in (7) is not guaran- bias and variance in RL. We present these analyses teed to hold for any stage of the learning. To demon- firstly as a tentative partial explanation for the unex- strate this possibly surprising fact, we consider the sim- pected results in these experiments. ple 3-state non-Markov Decision Process CNMDP)de- Secondly, the foregoing analysis is intended to pro- picted in Figure 2. vide sometheoretical motivation for an entirely differ- For this NMDP,the expected immediate reward ent approach to managing estimator variance in RL, from taking the action 0 in state B depends upon which in which attention is shifted away from CTRlength action was taken from state A. Suppose the (determin- and is instead focused on the step-size parameter /~. istic) reward function p is as follows: In the next part of this paper, we discuss a new algo- 2Theargument here is that for noisy domains, estimator rithm (perhaps more accurately a family of algorithms) bias is continually being reintroduced, taking the process we call ccBeta, which results from taking such an ap- "backwards"towards the conditions of early learning. proach.

83 /~: Variance versus Adaptability or under-estimation of the underlying value to be esti- In RL, finding a good value for the step-size parameter mated is occurring, and suggests ;3 should be increased fl for a particular algorithm for a particular domain to facilitate rapid adaptation. If, on the other hand, is usually in a trial-and-error process for the experi- the estimate errors are serially uncorrelated, then this menter, which can be time consuming. The resulting maybe taken as an indication that there is no system- choice is usually a trade-off between fast adaptation atic error occurring, and ;9 can be safely decreased to (large/~) and low estimator variance (small/~). In minimise variance while these conditions exist. and in adaptive parameter estimation systems gener- So, for each parameter we are trying to estimate, ally, there emerges a natural tension between the issues we keep a separate set of autocorrelation for of convergence and adaptability. its error series as follows, where ccl is derived as an Stochastic convergence theory (Kushner & Clark exponentially weighted autocorrelation coefficient: 1978) suggests that a reducing ;9 series (such as/~ sum_square_erri +- K.sum_square_erri_l + A/2 (9) 1/i) with the properties

oo oo sum_producti +-- K.sum_producti_l + Ai.Ai-1 (10) < co sum_producti E~i=c°’ and Ef~ (8) cci e- (11) i=1 i=1 x/sum-square-err i .sum-square_err i_ l may be used to adjust an estimator’s value for succes- At the start of learning, the sum_square_err and sive returns; this will guarantee in-limit convergence sum_productvariables are initialised to zero; but this under suitable conditions. However, this is in general potentially leads to a divide-by-zero problem on the not a suitable strategy for use non-stationary environ- RHSof (11). We explicitly check for this situation, ments i.e. environments in which the schedule of pay- and when detected, cci is3 set to 1. offs may vary over time. While convergence properties Wenote that if in (9) and (10) the exponential decay for stationary environments are good using this tech- parameter K E [0, 1] is less than 1, two desirable prop- nique, re-adaptation to changing environmental condi- erties emerge: firstly, the values of the sum_square_err tions can be far too slow. and sum_product series are finitely bounded, and sec- In practice, a constant value for the ~ series is usu- ondly the correlation coefficients are biased with re- ally chosen. This has the the advantage of being spect to recency. While the first property is conve- constantly sensitive to environment changes, but has nient for practical implementation considerations with poorer convergence propert.ies, particularly in noisy or regards to possible floating point representation over- stochastic environments. It appears the relatively poor flow conditions etc., the second property is essential convergence properties of a constant fl series can lead for effective adaptive behaviour in non-stationary en- to instabilities in learning in somesituations, making vironments. Setting K to a value of 0.9 has been found an effective trade-off between learning rate and vari- to be effective in all domainstested so far; experimen- ance difficult. tally this does not seem to be a particularly sensitive A method for automatically varying the step-size fl parameter. parameter by a simple on-line statistical analysis of It is also possible to derive an autocorrelation co- the estimate error is presented here. The resulting fl efficient not of the error series directly, but instead series will be neither constant nor strictly decreasing, of the sign of the the error series, i.e. replacing the but will vary as conditions indicate. Ai and Ai-1 terms in (9) and (10) with sgn(Ai) and The ccBeta Algorithm sgn(Ai-1). This variant may prove to be generally more robust in very noisy environments. At each update step for the parameter estimate, we In such a situation an error series maybe so noisy assume we are using a "delta-rule" or on-line LMSstyle update rule along the lines of that, even if the error signs are consistent, a good linear regression is not possible, and so/~ will be small even Ai ~- z,-Qi-i when there is evidence of persistent over- or under- Q~ e- Qi-, + /~A~ estimation. This approach proved to be successful when applied to the extremely noisy real robot domain where zi is the ith returned value in the series we are described in the next section. Based on our results to trying to estimate and Bi is the ith value of the step- date, this version could be recommended as a good size schedule series used. Qi is the ith estimate of this "general purpose" version of ccBeta. series, and Ai is the ith value of the error series. The idea behind ccBeta is quite straightforward. If aThis mayseem arbitrary, but the reasoning is simply that if you have exactly one sample from a population to the series of estimate errors for a parameter is posi- work with, the best estimate you can makefor the meanof tively auto-correlated, this indicates a persistent over- that population is the value of that sample.

84 Simulation experiment 1 (noisy zero fn) Simulation experiment 2 (noisy step fn) T.S.E. Std. Dev. T.S.E. Steps to Crossing ccBeta 17.95 0.042 ccBeta 21.60 10 beta = 0.1 10.42 0.032 beta = 0.1 16.19 35 beta = 0.2 22.30 0.047 beta = 0.2 25.84 15 beta = 0.3 35.72 0.060 beta = 0.3 38.56 9 beta = 0.4 50.85 0.071 beta = 0.4 53.35 8 reducing beta 0.17 0.004 reducing beta 395.73 >9,600

Table 1: Results for simulation experiment 1. Table 2: Results for simulation experiment 2.

Oncean autocorrelation coefficient is derived, /~i is Experimental Results set as follows: In all the experiments that follow we use the same vari- if (cci > O) ant of ccBeta; this has a K parameter of 0.9, and uses sgn(A) normalisation to calculate cci. For weighting ~i ~ ccl* MAX.BETA else f~i +-- 0 f~i, we use cci rather than cc~, and have MIN.BETA and MAX_BETAset to 0.01 and 1.0 respectively.

if (/~i < MIN_BETA) Simulation experiment 1 /~i ~- MIN..BETA In this section, we describe some simulation studies First, we note that in the above pseudo-code, neg- comparing the variance and re-adaptation sensitivity ative and zero auto-correlations are treated the same characteristics of the ccBeta method for generating a for the purposes of weighting/~i. A strongly negative step-size series against standard regimes of fixed f~ autocorrelation indicates alternating error signs, sug- and reducingfl, where ~ = 1/i. gesting fluctuations around a mean value. Variance The learning system for these experiments was a minimisation is also desirable in this situation, moti- single-unit on-line LMSestimator which was set up to vating a small f~i. track an input signal for 10,000 time steps. In the first On the other hand, a strongly positive cci results experiment, the signal was stochastic but with station- from a series of estimation errors of the same sign, ary mean: a zero function perturbed by uniform ran- indicating a persistent over- or under-estimation, sug- dom noise in the range [-0.25, +0.25]. The purpose of gesting a large f~i is appropriate to rapidly adapt to this experiment was to assess asymptotic convergence changed conditions. properties, in particular estimator error and variance. Setting the scaling parameters MIN_BETAto 0.01 As can be seen from Table 1, the reducing beta and MAX_BETAto 1.0 has been found to be effective, schedule of/~i = 1/i was superior to fixed beta and and these values are used in the experiments that fol- ccBeta in terms of both total square error (TSE) and low. Although values of 0 and 1 respectively might be estimator variance for this experiment. As we would more "natural", as this corresponds to the automatic expect, variance (normalised to in scaling of the correlation coefficient, in practice a small the tabled results) increased directly with the magni- non-zero MIN_BETAvalue was observed to have the tude of beta for the fixed beta series. In this experi- effect of making an estimate series less discontinuous ment ccBeta performed at a level between fixed beta (although whether this offers any real advantages has set at 0.1 and 0.2. not been fully determined at this stage.) Finally, we note that prima ]acie it would be reason- Simulation experiment 2 able to use cc~ rather than cci to weight ~t. Arguably, In the second experiment, a non-stationary stochas- cc2 is the more natural choice, since from statistical tic signal was used to assess re-adaption performance. theory it is the square of the correlation coefficient that The signal was identical to that used in the first ex- indicates the proportion of the variance that can be at- periment, except that after the first 400 time steps the tributed to the change in a correlated variable’s value. mean changed from 0 to 1.0, resulting in a noisy step Both the cci and cc~ variations have been tried, and function. The adaption response for the various beta while in simulations marginally better results both in series over 50 time steps around the time of the change terms of total square error (TSE) and variance were in mean are plotted in figures 3 and 4. obtained by using cc~, a corresponding practical ad- To get an index of responsiveness to the changed vantage was not evident when applied to the robot mean using the different beta series, we have mea- experiments. sured the number of time steps from the time the mean

85 1.6 1.6 fixed betaffi 0.4 -- ccBela --- 1.4 0.3...... 1.4 0.2 0.1 ...... 1.2 1.2 reducingbeta = 1/I .... 1 I -- .---" 0.8 ,’ / 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

I I ¯ 0.2 ’ ~ I -0.2 ’ ’ 430 440 390 400 410 420 430 440 390 400 410 420 lime step t lime step t

Figure 3: Fixed beta and reducing beta step response Figure 4: ccBeta step response plot. plots. algorithm. level was changed to the time the estimator values first Stumpyis, by robotics standards, built to an inex- crossed the new mean level, i.e. when the estimator pensive design. Each leg has two degrees of freedom, first reaches a value > 1. powered by two hobbyist servo motors (see Figure 5). As can be seen from Table 2, the reducing beta A 68HCll Miniboard converts instructions from a 486 schedule of Bi = 1/i was far worse than than for either PC sent via a RS-232-Cserial port connection into the fixed beta or ccBeta in terms of TSE and re-adaptation pulses that fire the servo motors. The primary motion performance ("Steps to Crossing", column 2). Indeed, sensor is a cradle-mounted PC mouse dragged along the extreme sluggishness of the reducing beta series behind the robot. This provides for very noisy sensory was such that the estimator level had risen to only input, as might be appreciated. The quality of the sig- about 0.95 after a further 9,560 time steps past the end nal has been found to vary quite markedly with the of the time-step windowshown in figure 3. The rela- surface the robot is traversing. tively very high TSEfor the reducing beta series was As well as being noisy, this domain was non- also almost entirely due to this very long re-adaptation Markovian by virtue of the compact but coarse dis- time. The inherent unsuitability of such a regime for cretized state-space representation. This compact learning in a non-stationary environment is clearly il- representation 4 meant learning was fast, but favoured lustrated in this experiment. Despite having "nice" an RL algorithm that did not rely heavily on the theoretical properties, it represents an impractical ex- Markov assumption; in earlier work (Pendrith & Ryan treme in the choice between good in-limit convergence 1996) C-Trace had been shown to be well-suited for properties and adaptability. this domain. In the figure 3 plots, the trade-off of responsive- The robot was given a set of primitive "reflexes" in ness versus estimator variance for a fixed beta series is the spirit of Rodney Brooks’ "subsumption" architec- clearly visible. Wenote howeverthat the re-adaptation ture (Brooks 1991). A leg that is triggered will incre- response curve for ccBeta (figure 4) resembles that mentally moveto lift up if on the ground and forward the higher values of fixed beta, while its TSE(table 2) if already lifted. A leg that does not receive activa- corresponds to lower values, which gives some indica- tion will tend to drop downif lifted and backwards if tion the algorithm is working as intended. already on the ground. In this way a basic stepping motion was encoded in the robots "reflexes". Experiment 3: Learning to Walk The legs were grouped to movein two groups of three For the next set of experiments we have scaled up to a to form two tripods (Figure 6). The learning problem real robot-learning problem: the gait coordination of a was to discover an efficient walking policy by trigger- six-legged insectoid walking robot. ing or not triggering each of the tripod groups from The robot (affectionately referred to as "Stumpy"in each state. Thus the action set to choose from in each the UNSWAI lab) is faced with the problem of learn- 41024 "boxes" in 6 dimensions: alpha and beta motor ing to walk with a forward motion, minimising back- positions for each leg group (4 continuous variables each ward and lateral movements. In this domain, ccBeta discretized into 4 ranges), plus 2 booleanvariables indicat- was compared to using hand tuned fixed beta step-size ing the triggered or untriggered state of each leg group at constants for two different versions of the C-Trace RL the last control action.

86 Set 0

Set 1 leg 5 leg 3 leg 1

Figure 6: Groupingthe legs into two tripods.

Discussion of results

Figure 5: The robot learning to walk. The average forward walking speed over the first hour of learning for two versions of C-Trace using both cc- discretized state consisted of four possibilities: Beta and fixed step-size parameters are presented in the plots in figures 7 and 8. ¯ Trigger both groups of legs In the case of multiple-visit C-Trace (figure 7), ¯ Trigger group A only notice immediately that the learning performance is ¯ Trigger group B only much more stable using ccBeta than with the fixed ¯ Do not trigger either group beta series. This shows up as obviously reduced vari- The robot received positive reinforcement for for- ance in the average forward speed: significantly, the ward motion as detected by the PC mouse, and neg- overall learning rate doesn’t seem to have been ad- ative reinforcement for backward and lateral move- versely affected, which is encouraging. It would "not be ments. unreasonable to expect some trade-off between learn- Although quite a restricted learning problem, inter- ing stability and raw learning rate, but such a trade-off esting non-trivial behaviours and strategies have been is not apparent in these results. seen to emerge. Interestingly, the effects of estimator variance seem to manifest themselves in a subtly different way in the The RL algorithms first-visit C-Trace experiments (figure 8). Wenotice that first-visit C-Trace even without ccBeta seems to As mentioned earlier, C-Trace is an RL algorithm that uses multi-step CTRsto estimate the state/action have had a marked effect on reducing the step-to-step variance in performance as seen in multiple-visit C- value function. While one C-Trace variant, multiple- Trace. This is very interesting in itself, and calls for visit C-Trace, has been described in earlier work in ap- 5 plication to this domain, the other, first-visit C-Trace. further theoretical and experimental investigation. However, we also notice that at around time-step is a variant that has not been previously described. 8000 in the first-visit fixed-beta plot there is the start It is easiest understand the difference between of a sharp decline in performance, where the learning multiple-visit and first-visit C-Trace, in terms of the seems to have become suddenly unstable. difference between a multiple-visit and a first-visit These results were averaged over several (n = 5) runs Monte Carlo algorithm. Briefly, a first-visit Monte for each plot, so this appears to be a real effect. If so, Carlo algorithm will selectively ignore somereturns for it is conspicuousby its absence in the first-visit ccBeta the purposes of learning in order to get a more truly independent sample set. In Singh & Sutton (1996), 5At this point, we will makethe following brief obser- first-visit versus multiple-visit returns are discussed in vations on this effect: a) The reduction in variance theo- conjunction with with a new "first-visit" version of retically makessense inasmuchas the variance of the sum the TD(A) algorithm, and significant improvements of several randomvariables is equal to the sumof the vari- ances of the variables if the variables are not correlated, learning performance are reported for a variety of do- but will be greater than than this if they are positively cor- mains using the new algorithm. related. In multiple-visit returns, the positive correlation For these reasons, a first-visit version of C-Trace betweenreturns is what prevents them being statistically seemedlikely to be particularly well-suited for a ccBeta independent, b) This raises the interesting possibility that implementation, as the "purer" sampling methodology the observed improvedperformance of "replacing traces" owesas muchif not more to a reduction in estimator vari- for the returns should (in theory) enhance the sensi- ance than to reduced estimator bias, whichis the explana- tivity of the on-line statistical tests. tion proposed by (Singh ~z Sutton 1996).

87 12OOO 12OOO ,0ooo 100o0 8OOO | s0oo s0oo 8 6OOO 4000

’~ 20oo I 2000 !0 ! 0 t -20OO 1 -2000 -4000 -400O i i -6000 i i i -8000 i , 0 2O00 4000 6000 8000 10000 0 2000 4000 6OOO 8O0O 10000 timesteps timesteps

Figure 7: The robot experiment using RL algorithm 1 Figure 8: The robot experiment using RL algorithm 2 (multiple-visit C-Trace). (first-visit C-Trace).

plot: ccBeta would appear to have effectively rectified to managing on-line estimator variance. the problem. Overall, the combination of first-visit C-Trace and Acknowledgements ccBeta seems to be the winning combination for these The authors are indebted to Paul Wongfor his assis- experiments, which is encouragingly in agreement with tance in performing manyof the robot experiments. prediction. References Conclusions Brooks, R. 1991. Intelligence without reason. In Pro- th Excessive estimator variance during on-line learning ceedings of the 12 International Joint Conference can be a problem, resulting in learning instabilities on Artificial Intelligence, 569-595. of the sort we have seen in the experiments described Kushner, H., and Clark, D. 1978. Stochastic approx- here. imation methods for constrained and unconstrained In RL, estimator variance is traditionally dealt with systems. New York: Springer-Verlag. only indirectly via the general process of tuning vari- Pendrith, M., and Ryan, M. 1996. Actual re- ous learning parameters, which can be a time consum- turn reinforcement learning versus Temporal Differ- ing trial-and-error process. Additionally, theoretical ences: Sometheoretical and experimental results. In results presented here indicate some of the pervasive L.Saitta., ed., Machine Learning: Proc. of the Thir- ideas regarding the trade-offs involved in the tuning teenth Int. Conf. Morgan Kaufmann. process need to be critically examined. In particular, the relationship between CTRlength and estimator Puterman, M. 1994. Markov decision processes : Dis- variance needs reassessment, particularly in the case crete stochastic dynamic programming. New York: of learning in a non-Markov domain. John Wiley & Sons. The ccBeta algorithm has been presented as a prac- Singh, S., and Sutton, R. 1996. Reinforcement learn- tical example of an alternative approach to managing ing with replacing eligibility traces. MachineLearning estimator variance in RL. This algorithm has been de- 22:123-158. signed to actively minimise estimator variance while avoiding the degradation in re-adaptation response Sutton, R.; Barto, A.; and Williams, R. 1992. Rein- times characteristic of passive methods, ccBeta has forcement learning is direct adaptive optimal control. been shown to perform well both in simulation and in IEEE Control Systems Magazine 12(2):19-22. real-world learning domains. Sutton, R. 1988. Learning to predict by the methods A possible advantage of this algorithm is that since of temporal difference. Machine Learning 3:9-44. it is not particularly closely tied in its design or as- Watkins, C., and Dayan, P. 1992. Technical note: sumptions to RL algorithms, Markov or otherwise, it Q-learning. Machine Learning 8:279-292. mayturn out be usable with a fairly broad class of on- line search methods. Basically, any method that uses Watkins, C. 1989. Learning from Delayed Rewards. some form of "delta rule" for parameter updates might Ph.D. Thesis, King’s College, Cambridge. potentially benefit from using a ccBeta-style approach

88