<<

arXiv:cs/9901001v1 [cs.LG] 5 Jan 1999 TD( Introduction 1. l mn hs are these Princi- among 3]. ple 5, [7, self-play with Tem- learning Difference for poral suitable particularly it make that backgammon [8]. players with backgammon competitive human be best to the itself proven has that player mon besceso TD( of success able associated and increases. transitions costs as state estimates observed cost of The the number improve to the transitions. is state algorithm the several of after goal updates batch The in transition, or state each network. after neural online updated a are from parameters as parameterised mapping a such by The approximator implemented function is state. cost current future to the system states of dynamical function stochastic a a of as cost expected future the term approximating long for algorithm elegant It an [4]. is program checkers Samuel’s of algorithm learning hroede o analtb erhn eprin deeper searching by lot a gain not does one appended ther be should mans list this feel with: we ahead, one-ply exploration of amount space. minimal search a of least at approxima- forces network which neural good a and find tion, the to mak- easier of counts), it piece function of ing vector smooth a as reasonably say, (viewed, a position is position mon self-play, smoothness of resentation thousand hundred several from TDLeaf( ayatoshv icse h euirte of peculiarities the discussed have authors Many ear’ DGmo sprastems remark- most the perhaps is TD-Gammon Tesauro’s sT-amni t rgnlfr nysearched only form original its in TD-Gammon As λ ncnucinwt iia erh epeetsm experim with some comparisons provide present and We utility its search. demonstrate which minimax with conjunction in ucinwiepaigo h reItre hs evr(FI Server Free the on playing while function directed( rma15 aigt 10rtn njs 0 ae.W discus We games. resu Tesauro’s 308 and just results in our rating between relationship 2100 the a and to rating 1650 a from hr r w osberaosfrti;ei- this; for reasons possible two are There . ,dvlpdb utn[] a t ot nthe in roots its has [6], Sutton by developed ), eateto ytm niern eateto optrS Computer of Department Engineering Systems of Department hlo erhi odeog gis hu- against enough good is search shallow nti ae epeetTDLeaf( present we paper this In stochasticity utainNtoa nvriyAsrla ainlUnive National Australian University National Australian { Jon.Baxter,Andrew.Tridgell,Lex.Weaver λ abra00,AsrlaCner 20 Australia 0200, Canberra Australia 0200, Canberra λ .I atclr u hs rga,“ngta, sdTDLe used “KnightCap,” program, chess our particular, In ). :CmiigTmoa ifrneLann ihGame-Tree with Learning Difference Temporal Combining ): oahnBxe nrwTigl e Weaver Lex & Tridgell Andrew Baxter Jonathan λ pe fplay of speed .I sanua ewr backgam- network neural a is It ). akamni admgame random a is backgammon : h vlaino backgam- a of evaluation the : DGmo learnt TD-Gammon : λ ,avraino h TD( the on variation a ), Search. rep- ABSTRACT nrdc w aitoso h hm:TD-directed( theme: the on variations two introduce rt etrtann aa n h eod TDLeaf( second, the and data, training better erate n TDLeaf( and Y iusfl notectgr of category the into fall tech- niques learning understood best and known popularly The Learning Reinforcement 2. deep in use for function evaluation search. an minimax learn to used is ( upti nw.Ti losu omaueteerror the measure system. the to train us to “correct” it allows the use This and trained, is known. is each system for output the that which fact upon the by input distinguished is category This n tel rgasuelna functions. linear chess use most programs hence, othello and and low be the to of has depth function cost search evaluation computational the the as Consequently, in evaluated increase increases. positions exponential of an number requires the re- This without games, achieve these to search. For difficult deep a is 5]. evaluation is 11, tactical order performance [9, liable to task human difficult network near more neural with far small one-ply a at allows moves which Go or searchers. shallow compet- of pool only a is in TD-Gammon ing so significant and a deeply of searching incapable for are humans three-ply or improvement), to performance search versions TD-Gammon recent of that given (questionable backgammon tebodctgr nowihTD( which into category broad (the usqetscin elo tTD( at look we sections subsequent Y i ′ i o xml,i u ytmmp input maps system our if example, For ntenx eto elo at look we section next the In othello chess, for representation a finding contrast, In hn with then, , ′ − t nbackgammon. in lts TD( Y i CS, ) 2 λ λ samaueo h ro orsodn to corresponding error the of measure a as n nte esrdclvrat TD- variant, radical less another and ) oeo h esn o hssuccess this for reasons the of some s loih htealsi ob used be to it enables that algorithm ) nsi ohcesadbackgammon and chess both in ents λ fics.onenet.net .Tefis ssmnmxsac ogen- to search minimax uses first The ). Y i ste“orc”otu,w a use can we output, “correct” the as af( } @anu.edu.au λ olanisevaluation its learn to ) enocmn learning reinforcement uevsdlearning supervised λ λ .I improved It ). nsm ealand detail some in ) al) n hnin then and falls), ) cience rsity X i ooutput to λ λ ), ) . Xi. Summing this value across a set of training exam- thus on the training algorithm which is modifying the ′ 2 ples yields an error measure of the form i(Yi − Yi) , agent. This dependency complicates the task of proving which can be used by training techniquesP such as back convergence for TD(λ) in the general case [2]. propagation. differs substantially from su- 3. The TD(λ) algorithm pervised learning in that the “correct” output is not known. Hence, there is no direct measure of error, Temporal Difference learning or TD(λ), is perhaps the instead a scalar reward is given for the responses to a best known of the reinforcement learning algorithms. It series of inputs. provides a way of using the scalar rewards such that Consider an agent reacting to its environment (a gen- existing supervised training techniques can be used to eralisation of the two-player scenario). Let S de- tune the function approximator. Tesauro’s TD-Gammon note the set of all possible environment states. Time for example, uses back propagationto train a neural net- proceeds with the agent performing actions at discrete work function approximator, with TD(λ) managing this time steps t = 1, 2, ... . At time t the agent finds the process and calculating the necessary error values. environment in state xt ∈ S, and has available a set of Here we consider how TD(λ) would be used to train

actions Axt . The agent chooses an action at ∈ Axt , an agent playing a two-player game, such as chess or which takes the environment to state xt+1 with proba- backgammon. bility p(xt, xt+1,at). After a determined series of ac- Suppose x1,...,xN−1, xN is a sequence of states in tions in the environment, perhaps when a goal has been one game. For a given parameter vector w, define the achieved or has become impossible, the scalar reward, temporal difference associated with the transition xt → r(xN ) where N is the number of actions in the series, is xt+1 by awarded to the agent. These rewards are often discrete, dt := J˜(xt+1, w) − J˜(xt, w). (3) eg: “1” for success, “-1” for failure, and “0” otherwise. For ease of notation we will assume all series of ac- Note that dt measures the difference between the reward tions have a fixed length of N (this is not essential). If predicted by J˜(·, w) at time t +1, and the reward pre- we assume that the agent chooses its actions according dicted by J˜(·, w) at time t. The true evaluation function to some function a(x) of the current state x (so that J ∗ has the property a(x) ∈ Ax), the expected reward from each state x ∈ S ∗ ∗ Ext |xt [J (xt+1) − J (xt)]=0, is given by +1 ˜ ∗ so if J(·, w) is a good approximation to J , Ext |xt dt ∗ +1 J (x) := ExN |xr(xN ), (1) should be close to zero. For ease of notation we will assume that J˜(xN , w)= r(xN ) always, so that the final where the expectation is with respect to the transition temporal difference satisfies probabilities p(xt, xt+1,a(xt)). ˜ ˜ ˜ Once we have J ∗(u), we can ensure that actions are dN−1 = J(xN , w)−J(xN−1, w)= r(xN )−J(xN−1, w). chosen optimally in any state by using the following That is, dN−1 is the difference between the true out- equation to minimise the expected reward for the en- come of the game and the prediction at the penultimate vironment ie: the other player in the game. move. At the end of the game, the TD(λ) algorithm updates

∗ ∗ ′ the parameter vector w according to the formula a (x) := argmina∈Ax J (xa, w). (2) N−1 N−1  j−t  For very large state spaces S it is not possible store w := w + α ∇J˜(xt, w) λ dt (4) ∗ X X the value of J (x) for every x ∈ S, so instead we t=1  j=t  might try to approximate J ∗ using a parameterised func- ˜ ˜ tion class J˜: S × Rk → R, for example linear func- where ∇J(·, w) is the vector of partial derivatives of J tion, splines, neural networks, etc. J˜(·, w) is assumed with respect to its parameters. The positive parameter to be a differentiable function of its parameters w = α controls the learning rate and would typically be “an- nealed” towards zero during the course of a long series (w1,...,wk). The aim is to find w so that J˜(x, w) is “close to” J ∗(u), at least in so far as it generates the of games. The parameter λ ∈ [0, 1] controls the extent correct ordering of moves. to which temporal differences propagate backwards in This approach to learning is quite different from that time. To see this, compare equation (4) for λ =0: of supervised learning where the aim is to minimise an N−1 explicit error measurement for each data point. w :=w + α ∇J˜(xt, w)dt X Another significant difference between the two t=1 paradigms is the nature of the data used in training. N−1 =w + α ∇J˜(x , w) J˜(x +1, w) − J˜(x , w) With supervised learning it is fixed, whilst with rein- X t h t t i forcement learning the states which occur during train- t=1 ing are dependent upon the agent’s choice of action, and (5) and λ =1: the resulting scores are propagated back up the tree by choosing at each stage the move which leads to the best N−1 position for the player on the move. See figure 1 for w := w + α ∇J˜(x , w) r(x ) − J˜(x , w) . X t h N t i an example game tree and its minimax evaluation. With t=1 (6) reference to the figure, note that the evaluation assigned to the root node is the evaluation of the leaf node of the Consider each term contributing to the sums in equa- principal variation; the sequence of moves taken from tions (5) and (6). For λ = 0 the parameter vector is the root to the leaf if each side choosesthe best available being adjusted in such a way as to move J˜(xt, w) — move. the predicted reward at time t — closer to J˜(xt+1, w) Our TD-directed(λ) variant utilises minimax search — the predicted reward at time t +1. In contrast, TD(1) by allowing play to be guided by minimax, but still de- adjusts the parameter vector in such away as to move fines the temporal differences to be the differences in the predicted reward at time step t closer to the final re- the evaluations of successive board positions occurring ward at time step N. Values of λ between zero and one during the game, as per equation (3). interpolate between these two behaviours. Note that (6) Let J˜d(x, w) denote the evaluation obtained for state is equivalent to gradient descent on the error function x by applying J˜(·, w) to the leaf nodes of a depth d 2 N−1 ˜ minimax search from x. Our aim is to find a parameter E(w) := t=1 r(xN ) − J(xt, w) . P h i vector w such that J˜ (·, w) is a good approximation to Tesauro [7, 8] and those who have replicated his work d the expected reward J ∗. One way to achieve this is to with backgammon, report that the results are insensitive apply the TD(λ) algorithm to J˜ (x, w). That is, foreach to the value of λ and commonly use a value around 0.7. d sequence of positions x ,...,x in a game we define Recent work by Beale and Smith [1] however, suggests 1 N the temporal differences that in the domain of chess there is greater sensitivity to the value of λ, with it perhaps being profitable to d := J˜ (x +1, w) − J˜ (x , w) (8) dynamically tune λ. t d t d t Successive parameter updates according to the TD(λ) as per equation (3), and then the TD(λ) algorithm (4) algorithm should, over time, lead to improved predic- for updating the parameter vector w becomes tions of the expected reward J˜(·, w). Provided the ac- tions a(xt) are independent of the parameter vector w, N−1 N−1 it can be shown that for linear J˜(·, w), the TD(λ) algo-  j t  w := w + α ∇J˜ (x , w) λ − d . (9) rithm convergesto a near-optimal parameter vector [10]. X d t X t t=1  j=t  Unfortunately, there is no such guarantee if J˜(·, w) is non-linear [10], or if a(x ) depends on w [2]. t One problem with equation (9) is that for d > 1, J˜d(x, w) is not a necessarily a differentiable function 4. Two New Variants of w for all values of w, even if J˜(·, w) is everywhere differentiable. This is because for some values of w a For argument’s sake, assume any action taken in state there will be “ties” in the minimax search, i.e. there x leads to predetermined state which we will denote will be more than one best move available in some of x′ J˜ , w J ∗ by a. Once an approximation (· ) to has been the positions along the principal variation, which means found, we can use it to choose actions in state x by ′ that the principal variation will not be unique. Thus, the picking the action a ∈ Ax whose successor state xa ˜ 1 evaluation assigned to the root node, Jd(x, w), will be minimizes the opponent’s expected reward : the evaluation of any one of a number of leaf nodes. a˜(x) := argmin J˜(x′ , w). (7) Fortunately, under some mild technical assumptions a∈Ax a on the behaviour of J˜(x, w), it can be shown that for Rk ˜ This was the strategy used in TD-Gammon. Unfortu- all states x and for “almost all” w ∈ , Jd(x, w) ˜ nately, for games like othello and chess it is difficult to is a differentiable function of w. Note that Jd(x, w) accurately evaluate a position by looking only one move is also a continuous function of w whenever J˜(x, w) or ply ahead. Most programs for these games employ is a continuous function of w. This implies that even ˜ some form of minimax search. In minimax search, one for the “bad” pairs (x, w), ∇Jd(x, w) is only undefined builds a tree from position x by examining all possible because it is multi-valued. Thus we can still arbitrarily ˜ moves for the computer in that position, then all possi- choose a particular value for ∇Jd(x, w) if w happens to ble moves for the opponent, and then all possible moves land on one of the bad points. for the computerand so on to some predetermineddepth Based on these observations we modified the TD(λ) d. The leaf nodes of the tree are then evaluated using algorithm to take account of minimax search: instead of a heuristic evaluation function (such as J˜(·, w)), and working with the root positions x1,...,xN , the TD(λ) algorithm is applied to the leaf positions found by min- 1If successor states are only determined stochastically by the choice of a, we would choose the action minimizing the expected imax search from the root positions. We call this algo- reward over the choice of successor states. rithm TDLeaf(λ). 4 A -5 4 B C 3 -5 4 5 D E F G

H I J K L M N O 3 -9 -5 -6 4* 2 -9 5

Fig. 1: Full breadth, 3-ply search tree illustrating the minimax rule for propagating values. Each of the leaf nodes (H–O) is given a score by the evaluation function, J˜(·,w). These scores are then propagated back up the tree by assigning to each opponent’s internal node the minimum of its children’s values, and to each of our internal nodes the maximum of its children’s values. The principle variation is then the sequence of best moves for either side starting from the root node, and this is illustrated by a dashed line in the figure. Note that the score at the root node A is the evaluation of the leaf node (L) of the principal variation. As there are no ties between any siblings, the derivative of A’s score with respect to the parameters w is just ∇J˜(L,w).

5. Experiments with Chess To investigate the importance of some of these reasons, we conducted several more experiments. In this section we describe several experiments in which the TDLeaf(λ) and TD-directed(λ) algorithms were Good initial conditions. used to train the weights of a linear evaluation function A second experiment was run in which KnightCap’s co- for our chess program, called KnightCap. efficients were all initialised to the value of a pawn. For our main experiment we took KnightCap’s eval- Playing with this initial weight setting KnightCap uation function and set all but the material parameters had a blitz rating of 1260 ± 50. After more than 1000 to zero. The material parameters were initialised to the games on FICS KnightCap’s rating has improved to standard “computer” values2. With these parameter set- about 1540 ± 50, a 280 point gain. This is a much tings KnightCap was started on the Free Internet Chess slower improvement than the original experiment, and server (FICS, fics.onenet.net). To establish its makes it clear that starting near a good set of weights is rating, 25 games were played without modifying the important for fast convergence. evaluation function, after which it had a blitz (fast time control) rating of 1650 ± 503. We then turned on the Self-Play TDLeaf(λ) learning algorithm, with λ = 0.7 and the Learning by self-play was extremely effective for TD- learning rate α = 1.0. The value of λ was chosen Gammon, but a significant reason for this is the stochas- arbitrarily, while α was set high enough to ensure rapid ticity of backgammon. However, chess is a determin- modification of the parameters. istic game and self-play by a deterministic algorithm After only 308 games, KnightCap’s rating climbed to tends to result in a large number of substantially similar 2110 ± 50. This rating puts KnightCap at the level of games. This is not a problem if the games seen in self- US Master. play are “representative” of the games played in prac- We repeated the experiment using TD-directed(λ), tice, however KnightCap’s self-play games with only and observed a 200 point rating rise over 300 games. non-zero material weights are very different to the kind A significant improvement, but slower than TDLeaf(λ). of games humans of the same level would play. There are a number of reasons for KnightCap’s re- To demonstrate that learning by self-play for Knight- markable rate of improvement. Cap is not as effective as learning against real oppo- nents, we ran another experiment in which all but the 1. KnightCap started out with intelligent material pa- material parameters were initialised to zero again, but rameters. This put it close in parameter space to this time KnightCap learnt by playing against itself. Af- many far superior parameter settings. ter 600 games (twice as many as in the original FICS 2. Most players on FICS prefer to play opponents experiment), we played the resulting version against the of similar strength, and so KnightCap’s opponents good version that learnt on FICS, in a 100 game match improved as it did. Hence it received both positive with the weight values fixed. The FICS trained version and negative feedback from its games. won 89 points to the self-play version’s 11. 3. KnightCap was not learning by self-play.

21 for a pawn, 4 for a knight, 4 for a bishop, 6 for a rook and 12 6. Backgammon Experiment for a queen. 3After some experimentation, we have estimated the standard de- For our backgammon experiment we were fortunate to viation of FICS ratings to be 50 ratings points. have Mark Land (University of California, San Diego) provide us with the source code for his LGammon pro- which can be attributed to the substantially different gram which has been implemented along the lines of distribution over leaf nodes compared to root nodes. Tesauro’s TD-Gammon[7, 8]. No such improvement was observed for backgammon Along with the code for LGammon, Land also pro- which suggests that the optimal network to use in 1-ply vided a set of weights for the neural network. The search is close to the optimal network for 2-ply search. weights were used by LGammon when playing on the On the theoretical side, it has recently been shown First Internet Backgammon Server (FIBS, fibs.com), that TD(λ) converges for linear evaluation functions where LGammon achieved a rating which ranged from [10]. An interesting avenue for further investigation 1600 to 1680, significantly above the mean rating across would be to determine whether TDLeaf(λ) has similar all players of about 1500. For convenience, we refer to convergence properties. the weights as the FIBS weights. Using LGammon and the FIBS weights to directly References compare searching to two-ply against searching to one- ply, we observed that two-ply is stronger by 0.25 points- [1] D F Beal and M C Smith. Learning Piece values per-game, a significant difference in backgammon. Fur- Using Temporal Differences. Journal of The Inter- ther analysis showed that in 24% of positions, the move national Computer Chess Association, September recommended by a two-ply search differed from that 1997. recommended by a one-ply search. [2] D P Bertsekas and J N Tsitsiklis. Neuro-Dynamic Subsequently, we decided to investigate how well Programming. Athena Scientific, 1996. TD-directed(λ) and TDLeaf(λ), both of which can [3] Jordan Pollack, Alan Blair, and Mark Land. Co- search more deeply, might perform. Our experiment evolution of a Backgammon Player. In Proceed- sought to determine whether either TD-directed(λ) or ings of the Fifth Artificial Life Conference, Nara, TDLeaf(λ) could find better weights than standard Japan, 1996. TD(λ). [4] A L Samuel. Some Studies in Machine LEarning To test this, we suitably modified the algorithms to Using the Game of Checkers. IBM Journal of Re- account for the stochasticity inherent in the game, and search and Development, 3:210–229, 1959. took two copies of the FIBS weights — the end product [5] Nicol Schraudolph, Peter Dayan, and Terrence Se- of a standard TD(λ) training run of 270,000 games. We jnowski. Temporal Difference Learning of Posi- trained one copy using TD-directed(λ)and the other us- tion Evaluationin the Game of Go. In Jack Cowan, ing TDLeaf(λ). Each network was trained for 50000 Gerry Tesauro, and Josh Alspector, editors, Ad- games and then played against the unmodified FIBS vances in Neural Information Processing Systems weights for 1600 games, with both sides searching to 6, San Fransisco, 1994. Morgan Kaufmann. two-ply and the match score recorded. [6] Richard Sutton. Learningto Predict by the Method The results fluctuated around parity with the FIBS of Temporal Differences. Machine Learning, 3:9– weights (the product of TD(λ) training), with no statisti- 44, 1988. cally significant change in performance being observed. [7] Gerald Tesauro. Practical Issues in Temporal Dif- This suggests that the solution found by TD(λ), is either ference Learning. Machine Learning, 8:257–278, at or near the optimal for two-ply play. 1992. [8] Gerald Tesauro. TD-Gammon, a self-teaching 7. Discussion and Conclusion backgammon program, achieves master-level play. Neural Computation, 6:215–219, 1994. We have introduced TDLeaf(λ), a variant of TD(λ) for [9] Sebastian Thrun. Learning to Play the Game of training an evaluation function used in minimax search. Chess. In G Tesauro, D Touretzky, and T Leen, The only extra requirement of the algorithm is that the editors, Advances in Neural Information Process- leaf-nodes of the principal variations be stored through- ing Systems 7, San Fransisco, 1995. Morgan Kauf- out the game. mann. We presented some experiments in which a chess [10] John N Tsitsikilis and Benjamin Van Roy. An evaluation function was trained by on-line play against a Analysis of Temporal Difference Learning with mixture of human and computer opponents. The exper- Function Approximation. IEEE Transactions on iments show both the importance of “on-line” sampling Automatic Control, 42(5):674–690, 1997. (as opposed to self-play), and the need to start near a [11] Steven Walker, Raymond Lister, and Tom Downs. good solution for fast convergence. On Self-Learning Patterns in the Othello Board We compared training using leaf nodes (TDLeaf(λ)) Game by the Method of Temporal Differences. In with training using root nodes, both in chess with a C Rowles, H liu, and N Foo, editors, Proceed- linear evaluation function and 5-10 ply search, and in ings of the 6th Australian Joint Conference on Ar- backgammon with a one hidden layer neural-network tificial Intelligence, pages 328–333, Melbourne, evaluation function and 2-ply search. We found a sig- 1993. World Scientific. nificant improvementtraining on the leaf nodesin chess,