Learning to Play Draughts using Temporal Difference Learning with Neural Networks and Databases Jan Peter Patist and Marco Wiering Cognitive Artificial Intelligence Utrecht University contact: [email protected] Abstract ering a database of already played games. How- This paper describes several aspects of using ever in games like draughts much look-ahead is temporal difference learning (TD) and neural needed to create a game which is good enough networks to learn game evaluation functions, to learn from. Therefore, learning by play- and the benefits of using databases. Experi- ing an actual game is time consuming. Using ments in tic-tac-toe and international draughts databases of existing games is much less time have been done to measure the effectiveness of consuming because look-ahead is not needed. using databases. The experiment of Tic-Tac- Another benefit of using databases over self-play Toe showed that training from database games is that the examples are representative of good resulted in better play than learning from play- game-playing and do not suffer from start-up ing against a random player. In the experiment problems like in self-play when a randomly ini- of learning international draughts, the program tiated evaluation function is used. reached after just a few hours of training a bet- Games. For a long time now board games ter level than a strong computer playing pro- like draughts, checkers, backgammon, chess gram and occasionally drew a very strong pro- etc. have gained the interest of computer gram. Thus, using temporal difference learn- science. Nowadays there are programs which ing and neural networks on database games is a are playing at a high level. For example in time-efficient way to reach a considerable level chess the number one human player, Kas- of play. parov, has been beaten by Deep Blue[1997] and Deep Junior[2003]. Also in other board 1 Introduction games the computer has gained ground, like in Reinforcement learning (Sutton, 1988; Kael- backgammon, othello, international draughts bling et al., 1996) is a learning method which and checkers. In international draughts several enables agents to learn to act in an environment grandmasters were defeated by the programs by getting rewards. The agent chooses between Buggy[2003] and Flits[2002]. actions based on the current sensory informa- The success of most of the strongest programs tion and a function approximator to map sen- in these board games is due to the use of the sory information to an output. This output can possibility to look upon many thousands of be an action or a value used in selecting an ac- positions per second by using computer power tion. The agent uses a function approximator and efficient search algorithms. Because of the to map sensory information to an output which exponential growth of the amount of positions will result in the correct behavior. Rewards are in respect to the search depth, it is impos- used by the agent to adjust his function ap- sible to calculate all possible continuations proximator to maximize his future cumulative until a position which ends by the rules of reward. In game-playing the player or agent the game. Several algorithms and heuristics chooses a move which maximizes the chance to have been developed to cut the continuations win minus the chance to loose. looked upon. Although the computer can Several ways exist to collect the games or ex- search many positions it can not reach an end amples needed to learn from. For example by position from most of the positions. Therefore self-play, playing against humans or by consid- a good evaluation function is very important to estimate ’the chance’ to win given a certain (Plaat, 1996) for more information on game board position. In most of the mentioned search algorithms. Normally search algorithms board games the evaluation function is tuned are extended with heuristics to enhance speed. by the system designer and this may cost a An evaluation function is used to value a po- lot of time. Automated learning can probably sition. In our experiments we used a neural result in better evaluation functions in less network because of its ability to represent non- time. For example the backgammon program linear functions (Cybenko, 1989), which is im- TD-Gammon (Tesauro, 1995) reached the level portant in draughts because a good evaluation of a world class player by learning using neural function can probably not be represented by a networks and TD-learning (Sutton, 1988). linear function. It should be noted that the use Learning from database games. In rein- of neural networks slows down the search speed. forcement learning agents act in an environment Because it is very difficult for programs to play and get feedback on their performance by re- well in the opening and end-game-phase they ward signals. They use that feedback to adjust are usually equipped with a database for both themselves to maximize their performance. phases. In most programs the parameters of the However, when an agent learns from database evaluation function are set by hand. However games it does not apply its newly acquired this costs a lot of time. insights directly to the board and is therefore not confronted with its behavior. Database 2.2 Learning Programs games are only labelled with the end-results Examples of automated learning evaluation of a game (i.e. win, loss, or draw). We use functions are TD-Gammon in backgammon, a reinforcement learning method to estimate KnightCap (Baxter et al., 1997; Baxter et al., values of board positions that have occurred 1998; Baxter et al., 2000) and NeuroChess during a game. In many board-games like (Thrun, 1995) in chess, and NeuroDraughts draughts, chess, etc. a database exists. Using (Lynch and Griffith, 1997) in checkers. TD- temporal difference learning is a good way to Gammon became world class player by learning make use of this rich collection of knowledge. from self-play. KnightCap learned by TDleaf, a TD-method for game-trees and reached a rea- Overview. In the next section we de- sonable level of play in fast-play chess. Neu- scribe and explain shortly the ingredients of roChess used an explanation-based neural net- a game-playing program and some learning work (EBNN) which predicted the position af- game-programs. Reinforcement learning will ter some moves and was trained by grandmas- be discussed in section 3. Experiments of ter games. Gradient information of the EBNN learning Tic-Tac-Toe using TD(λ)-learning and was used to adjust the weights of another neural databases will be presented in section 4, and in network V trained by self-play. The EBNN was section 5 we present results on learning inter- trained before V. The level of play was weak. national draughts. Finally, section 6 concludes In Neurodraughts a cloning strategy was used and describes possible future work. to test, select and train networks with differ- ent input features. It is not known how strong 2 Game playing this program plays against other checkers play- 2.1 Parts of a game-playing program ing programs or humans. Almost all programs which play board-games 3 Reinforcement Learning like draughts have the same architecture. The different parts of the programs are a move gen- 3.1 Markov Decision Problems erator, a search engine and an evaluation func- Most work done in Reinforcement Learning tion. Given a certain position, the move gen- (RL) is about solving Markov decision prob- erator enables generating all possible successor lems. A problem is a Markov decision prob- positions. The search engine is an algorithm lem whenever it satisfies the Markov property. which calculates possible continuations. Exam- The Markov property states that the transition ples of search algorithms are Alpha-Beta, Ne- probabilities to possible next states only depend gascout, Principal Variation Search etc. See on the current state and the selected action. A Markov decision problem is characterized by the 3.3 RL and Neural Networks possible states and actions, a transition func- Because in draughts the amount of possible tion, a reward function, and a discount factor. states or positions is too large to directly store The transition function contains all transition and learn from, a function approximator must probabilities of transitions that are made to a be used to approximate the value of the posi- next state, under influence of some action. The tions. In our research we used a feed-forward reward function is determined by the system de- neural network as a function approximator. signer. The discounting factor affects the greed- To learn from a game all positions are scored iness towards immediate or future reward. Play- by propagating it through the neural network. ing draughts can be seen as a Markov decision New values of the positions are estimated after problem as long as the opponent uses a fixed each game by using the TD-update rule. These strategy for selecting moves. new values are the target values of the posi- 3.2 Temporal Difference Learning tions seen in the game. Extended error back- propagation (Sperduti and Starita, 1993) is To estimate the values of positions we used the used to adjust the weights of the network in or- temporal difference method (Sutton, 1988). der to minimize the difference between the tar- The Temporal Difference method is a RL- get value and the estimated value of the posi- method driven by the difference between two tion, called error. In our experiments we used successive state values to adjust former state the activation function βx/(1 + abs(βx)) with values which decreases the difference between neuron sensitivity β.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-