Dorian Van Den Heede Reinforcement Learning Teaching Computers To
Total Page:16
File Type:pdf, Size:1020Kb
Teaching Computers to Play Chess Through Deep Reinforcement Learning Dorian Van den Heede Supervisors: Prof. dr. ir. Francis wyffels, Prof. dr. ir. Joni Dambre Counsellors: Ir. Jeroen Burms, Prof. dr. ir. Francis wyffels, Prof. dr. ir. Joni Dambre Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Department of Electronics and Information Systems Chair: Prof. dr. ir. Rik Van de Walle Faculty of Engineering and Architecture Academic year 2016-2017 2 To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; No more; and by a sleep to say we end The heart-ache and the thousand natural shocks That flesh is heir to, 'tis a consummation Devoutly to be wish'd. To die, to sleep; To sleep: perchance to dream: ay, there's the rub; For in that sleep of death what dreams may come When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of disprized love, the law's delay, The insolence of office and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would fardels bear, To grunt and sweat under a weary life, But that the dread of something after death, The undiscover'd country from whose bourn No traveller returns, puzzles the will And makes us rather bear those ills we have Than fly to others that we know not of? Thus conscience does make cowards of us all; And thus the native hue of resolution Is sicklied o'er with the pale cast of thought, And enterprises of great pith and moment With this regard their currents turn awry, And lose the name of action. Teaching Computers to Play Chess Through Deep Reinforcement Learning Dorian Van den Heede Supervisors: Prof. dr. ir. Francis wyffels, Prof. dr. ir. Joni Dambre Counsellors: Ir. Jeroen Burms, Prof. dr. ir. Francis wyffels, Prof. dr. ir. Joni Dambre Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Department of Electronics and Information Systems Chair: Prof. dr. ir. Rik Van de Walle Faculty of Engineering and Architecture Academic year 2016-2017 i To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; No more; and by a sleep to say we end The heart-ache and the thousand natural shocks That flesh is heir to, 'tis a consummation Devoutly to be wish'd. To die, to sleep; To sleep: perchance to dream: ay, there's the rub; For in that sleep of death what dreams may come When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of disprized love, the law's delay, The insolence of office and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would fardels bear, To grunt and sweat under a weary life, But that the dread of something after death, The undiscover'd country from whose bourn No traveller returns, puzzles the will And makes us rather bear those ills we have Than fly to others that we know not of? Thus conscience does make cowards of us all; And thus the native hue of resolution Is sicklied o'er with the pale cast of thought, And enterprises of great pith and moment With this regard their currents turn awry, And lose the name of action. ii Foreword After a long and sporadically lonesome road, an incredible journey has come to an end in the form of this master dissertation. Exploiting experience from exploration is what makes one fail less and less. I had the amazing opportunity to pursue my thesis in a subject I am extremely passionate about, for which I want to thank my supervisor Prof. Francis wyffels. Last year and especially the last three months, I needed a lot of advice and help. Luckily, Jeroen Burms has always been a great counselor and helped me whenever I had a question or idea. Thank you. I want to express my gratitude to 'grand papa et grand maman'. During the fortnight of my isolation, they spoiled me more than my facebook wall spoiled the Game of Thrones story. Furthermore I want to show appreciation to my fellow students with whom I spent count- less hours of hard labor. I am thanking Jonas for lifting me up, Loic for always remaining casual, Timon for always being equally happy and Vincent for infecting invincibility. Closer to me, there are my parents whom I love very much. I am sincerely sorry if I do not show this enough. Dorian Van den Heede, august 2017 iii Permission of use on loan \The author(s) gives (give) permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the copyright terms have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation." Dorian Van den Heede, august 2017 iv Teaching Computers to Play Chess with Deep Reinforcement Learning Dorian Van den Heede Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Academic year 2016-2017 Supervisors: Prof. dr. ir. F. wyffels and Prof. dr. ir. J. Dambre Counsellor: J. Burms Faculty of Engineering and Architecture Ghent University Department of Electronics and Information Systems Chair: Prof. dr. ir. R. Van de Walle Abstract Chess computers have reached a superhuman level at playing chess by combining tree search with the incorporation of heuristics obtained with expert knowledge into the value function when evaluating board positions. Although these approaches add a big amount of tactical knowledge, they do not tackle the problem of effectively solving chess, a deterministic game. We try to set the first stone by solving relatively simple chess endgames from a human point of view with deep reinforcement learning (DRL). The proposed DRL algorithm consists of 2 main components: (i) self play (ii) translating the collected data in phase (i) to a supervised learning framework with deep learning. In light of this, we present a novel TD learning algorithm, TD-Stem(λ), and compare it with the state of the art, TD-Leaf(λ). Keywords Chess, deep reinforcement learning, TD-learning Teaching Computers to Play Chess with Deep Reinforcement Learning Dorian Van den Heede Supervisor(s): Francis wyffels, Jeroen Burms, Joni Dambre Abstract— Chess computers have reached a superhuman level at play- different modifications of the well known TD(λ) algorithm may ing chess by combining tree search with the incorporation of heuristics unite strategic and tactical information into the value function in obtained with expert knowledge into the value function when evaluating board positions. Although these approaches add a big amount of tactical section IV. The value function itself is discussed in V. We com- knowledge, they do not tackle the problem of effectively solving chess, a de- pare our novel algorithm with the SOTA TD-learning algorithm terministic game. We try to set the first stone by solving relatively simple for chess in section VI and form an appropriate conclusion in chess endgames from a human point of view with deep reinforcement learn- section VII. ing (DRL). The proposed DRL algorithm consists of 2 main components: (i) self play (ii) translating the collected data in phase (i) to a supervised learn- ing (SL) framework with deep learning. In light of this, we present a novel II. RELATED WORK temporal difference (TD) learning algorithm, TD-Stem(λ), and compare it with the state of the art, TD-Leaf(λ). Already in 1959, a TD-learning algorithm was used to create Keywords—Chess, deep reinforcement learning, TD-learning a simplistic checkers program [5]. The first real success story of using TD-learning (arguably for RL in general) and Neural network (NN) for value function approximation in boardgames I. INTRODUCTION was Tesouro’s backgammon program TD-Gammon with a sim- N the early 50s, scientists like Shannon and Turing were al- ple fully connected feedforward neural network (FNN) architec- Iready musing about how a machine would be able to play ture learned by incremental TD(λ) updates [6]. One interesting chess autonomously [1], [2]. The enormous complexity ( consequence of TD-Gammon is its major influence on strate- ∼ 10120 different states) combined with its popularity and rich his- gic game play of humans in backgammon today. As chess is a tory made it the holy grail of artificial intelligence (AI) research game where tactics seem to have the upper hand over strategic in the second half of the 20th century. Research attained its peak thinking, game tree search is used to calculate the temporal dif- in 1997 when Deep Blue succeeded at winning a duel with the ference updates in TD-Leaf(λ) proposed by Baxter et al.