Evaluating Knowledge Transferability in Endgames using Deep Neural Networks

by

Frímann Kjerúlf

Thesis of 30 ECTS credits submitted to the School of Computer Science at Reykjavík University in partial fulfillment of the requirements for the degree of Master of Science (M.Sc.) in Computer Science

June 2019

Examining Committee: Yngvi Bjornsson, Supervisor Professor, Reykjavík University, Iceland

Stephan Schiffel, Assistant Professor, Reykjavík University, Iceland

David Thue, Assistant Professor, Reykjavík University, Iceland

i Copyright Frímann Kjerúlf June 2019

ii Evaluating Knowledge Transferability in Chess Endgames using Deep Neural Networks

Frímann Kjerúlf

June 2019

Abstract

Transfer learning is becoming an essential part of modern machine learning, especially in the field of deep neural networks. In the domain of image recognition there are known methods to evaluate the transferability of features which are based on evaluating to what degree a feature extractor can be considered general to the domain, or specific to the task at hand. This is of high importance when aiming for a successful knowledge transfer since one typically wants to transfer only the general feature extractors and leave the specific ones behind. The general features in the case of image classification can be considered local with respect to each pixel, since the feature extractors in early layers activate on simple features like edges, which are localized within a certain radius from a given pixel. One might then ask the question, whether similar methods are also applicable in other domains than image classification, and of special interest are domains characterized by non-local features. Chess is as excellent example of such a domain since a square’s locality can not be defined by the adjacent pixels alone. One needs to take into account that a single piece can traverse the whole board in a single move. We show that this method is applicable in the case of tablebases, in spite of structural differences in the feature space, and that the distribution of the learned information within the network is similar as in the case of image classification.

iii Titill

Frímann Kjerúlf

júní 2019

Útdráttur

Yfirfærslunám er orðið nauðsynlegur hluti af nútíma vélnámi, sérstaklega á sviði djúp- tauganeta. Á sviði myndgreiningar eru til þekktar aðferðir til að meta flytjanleika eiginleika í yfirfærslunámi sem byggjast á því að meta hvaða hlutar tauganetsinns finna almenna eiginleika á borð við línur og litaskil, og hvað hlutar finna sértæka eig- inleika á borð við andlit eða hús. Þetta er mikilvægt þegar framkvæma á árangursríkt yfirfærslunám því oft reynist best að flytja einungis þá hluta netsinns sem eru almenn- ir. Í myndgreiningu þá teljast þessir almennu eiginleikar vera staðbundnir með tilliti til hvers díls því hægt er að skilgreina hvort díll sé hluti af línu eða ekki með því að skoða einungis þá díla sem eru innan vissrar fjárlægðar frá dílnum. Þá er hægt að spyrja sig hvort þessar sömu aðferðir eigi við í öðrum óðölum sem einkennast af óstaðbundnum almennum eiginleikum. Skák er mjög gott dæmi um slíkt óðal þar sem ekki er hægt að skilgreina nánd hvers reits með því að horfa einungis á nágranna hans. Taka þarf tillit til þess að sumir taflmenn geta farið þvert yfir borðið í einum leik. Niðurstaða okkar er að áðurnefndar aðferðir úr óðali myndgreiningar til þess að meta flytjanleika eiginleika, eiga vel við í óðali endatafla í skák þrátt fyrir þennann mun á eiginleikum óðalanna, og að lögun og dreyfing upplýsinga innan tauganetsinns er áþekk.

iv Evaluating Knowledge Transferability in Chess Endgames using Deep Neural Networks

Frímann Kjerúlf

Thesis of 30 ECTS credits submitted to the School of Computer Science at Reykjavík University in partial fulfillment of the requirements for the degree of Master of Science (M.Sc.) in Computer Science

June 2019

Student:

Frímann Kjerúlf

Examining Committee:

Yngvi Bjornsson

Stephan Schiffel

David Thue

v The undersigned hereby grants permission to the Reykjavík University Library to re- produce single copies of this Thesis entitled Evaluating Knowledge Transferability in Chess Endgames using Deep Neural Networks and to lend or sell such copies for private, scholarly or scientific research purposes only. The author reserves all other publication and other rights in association with the copyright in the Thesis, and except as herein before provided, neither the Thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author’s prior written permission.

date

Frímann Kjerúlf Master of Science

vi - Dedicated to Loki -

vii Acknowledgements

I want to thank my supervisor Dr. Yngvi Björnsson for all the support and goodwill throughout this project.

This work was funded by 2014 RANNIS grant “Hermi- og brjóstvitstrjáleit í alhliða leikjaspilun og öðrum flóknum ákvörðunartökuvandamálum”.

viii Contents

Acknowledgements viii

Contents ix

List of Figures xi

List of Tables xii

List of Abbreviations xiii

List of Symbols xiv

1 Introduction 1

2 Background 3 2.1 Convolutional Neural Networks ...... 3 2.2 Transfer Learning ...... 3 2.3 Transferability ...... 5 2.4 Endgame Tablebases ...... 6

3 Methods 8 3.1 Board State Representation ...... 8 3.2 WDL Values ...... 10 3.3 Transferability ...... 11 3.4 Network Design ...... 14 3.5 Transfer Learning ...... 16 3.6 Expansion Learning ...... 16

4 Results and Discussions 18 4.1 Tuning Hyperparameters ...... 18 4.1.1 Final Hyperparamters ...... 24 4.2 Experimental Setup ...... 24 4.3 Transferability ...... 25 4.3.1 Performance Loss Due to Co-Adaption ...... 25 4.3.2 Performance Loss Due to Specification ...... 26 4.4 Transfer Learning ...... 29 4.5 Expansion Learning ...... 31

ix 5 Conclusion 33 5.1 Summary ...... 33 5.2 Future Work ...... 34

Bibliography 35

x List of Figures

2.1 An example of a Convolutional Neural Network (CNN) with 4 convolutional layers (Conv) and one fully connected layer (FC) (from Medium.com [3]) . 4 2.2 Difference between traditional ML (left) and transfer learning ML (right). (from Pan and Yang, 2010 [4]) ...... 4

3.1 with numbering of squares, where a1 is mapped to 0 and h8 to 63...... 8 3.2 Example board state with 7 pieces ...... 8 3.3 Example from Figure 3.2 in vector state representation...... 9 3.4 For full state representation. Each piece type has its own 8 × 8 bit-array. . 9 3.5 Example from Figure 3.2 in full state representation. Each piece has its own 8 × 8 bit-array, resulting in a 8 × 8 × 4 tensor...... 9 3.6 Image classification performance after knowledge transfer between domain A to B (AnB) and B to B (BnB) as a function of n, the number of trans- ferred layers. Transferred weights of AnB and BnB are kept frozen while transferred weights of AnB+ and BnB+ are allowed to fine-tune (from Yosinski et al. [1])...... 11 3.7 Comparing accuracy of 2x2 and 3x3 filters ...... 14 3.8 Evaluating optimal number of convolutional layers ...... 15

4.1 Comparing performance of Adam vs Adadelta ...... 19 4.2 Evaluating effect of batch normalization on accuracy ...... 19 4.3 Comparing accuracy of 16 and 32 bit floating point precision ...... 20 4.4 Evaluating the number of epochs needed for convergence ...... 21 4.5 Evaluating the possibility of overfitting ...... 21 4.6 Evaluating effect of batch size on accuracy and training speed ...... 22 4.7 Co-adaption splitting at layers 3, 4 and 5 ...... 25 4.8 Performance drop due to co-adaption and specification ...... 27 4.9 Knowledge transfer seeds better initial accuracy ...... 28

xi List of Tables

2.1 Definition of WDL values in the Syzygy tablebase, giving the game theo- retical values for a given board state...... 7 2.2 Syzygy endgame tablebase information, showing both All possible states and states with only Pawns and Kings...... 7

3.1 Breakdown by WDL values showing the number of states for each WDL value and the corresponding ratio of the whole dataset...... 11 3.2 Evaluating optimal number of convolutional layers ...... 15 3.3 Evaluating optimal network size (250 epochs) ...... 15

4.1 Evaluating effect of batch size on accuracy after 100 epochs ...... 23 4.2 Evaluating effect of batch size on accuracy and training time for 10 epochs 23 4.3 Evaluating effect of batch size on accuracy after 30 minute training time . 23 4.4 Chosen hyperparameters and network information ...... 24 4.5 Network model showing layer type, filter size and number of parameters . . 24 4.6 Final label prediction accuracy...... 30 4.7 Final accuracy of φ3 on D3 by WDL values...... 30 4.8 Final accuracy of φ3→4 on D4 by WDL values...... 30

xii List of Abbreviations

ANN Artificial Neural Network CNN Convolutional Neural Network Conv Convolutional DL Deep Learning DTM Depth to Mate Value DTZ Depth to Zeroing-Move Value EL Expansion Learning FC Fully Connected ML Machine Learning TL Transfer Learning WDL Win Loss

xiii List of Symbols

Symbol Description m Number of pieces in a given state mp Number of piece types in a given state X The feature space of all possible board states Xm The feature space of all m piece board states Y WDL value label space Dm Domain of all m piece board states X Board state random variable x A given board state y A given WDL value Pm(X) The probability of sampling an m-piece board state from X Dm Dataset of m-piece board states dm,i = (xi, yi) Datapoint i in Dm Tm Training task fm(·) The objective predictive function mapping states in Xm to labels φ(x) = y A neural net with input x and output y φm A network trained on Dm φm→k A network trained on Dk with transfer from φm φrnd An untrained network initialized with random weights ceil(z) Rounds z ∈ R up to next integer Prnd(y) Probability of φrnd outputting label y given a random input

PXm (y) The probability of sampling a state with label y from Xm n Number of transferred layers φAnB A network trained on DB with n transferred layers from φA AnB Same as φAnB. Transferred layers frozen AnB+ Same as φAnB. Transferred layers fine-tuned.

xiv Chapter 1

Introduction

In recent history, deep learning (DL) has been the most promising subfield of machine learning (ML) and has been responsible for the majority of notable breakthroughs in modern AI. Its application spans over many different fields, including computer vi- sion, speech recognition, natural language processing, selfdriving cars and board-game agents, to name just a few. Within the DL field there is a wide range of architec- tures, with convolutional neural networks (CNNs) at the forefront. This architecture works in some ways similar to a biological neural network, and bares to some degree resemblance to the visual information processing systems found in animals. They are composed of filters which convolute over the image and activate once they find a pat- tern that fits the given filter. For image recognition with CNNs the first layers look for low level general features like lines and simple patterns, while layers further down the network activate on more complex higher level features, which can be considered specific to the given task. Though the DL method has proven itself to be the most prominent ML method of modern times, it is not free of flaws. The main drawback being the demand for high amounts of data in order to perform, compared to classical ML methods. When a human takes on learning a new task, the process typically consists on adapting previous knowledge. This seeds the idea of reusing information which has already been learned, a method which commonly goes by the name transfer learning (TL), and is a widely used method in ML. It can easily be applied to the field of DL by reusing the layers of a pretrained network. One can choose to transfer all the layers, but it has proven more useful in most cases to transfer the n first layers only. Then in order to achieve successful knowledge transfer one would like to know how many layers of the network to transfer, i.e. which part of the network to reuse. Essentially one would want to identify which layers are responsible for general feature extraction. In a recent study by Yosinski et al.[1] a method was devised to evaluate the infor- mation distribution within a neural network, the gradient from general to specific parts of the network, in order to quantify which parts of the network are most prominent for transfer. In essence, to evaluate at which layer n we start to get degraded performance due to specification of transferred layers. Their method consists of evaluating final accuracy as a function of n, the number of transferred layers, while either freezing the transferred layers or allowing them to fine-tune. Their research was focused on the domain of image recognition and the results specific to the problem at hand. Due to the short sightedness of early layers in CNNs in terms of visibility within the feature space, then the same methods might not necessarily be applicable in domains with non-local features, i.e. the transition from general to specific feature extractors 2 CHAPTER 1. INTRODUCTION might not be as clear as in the domain of image classification. In here we explore the feasibility of using Yosinsky’s et al. method to evaluate information distribution in a network in a domain with non-local features. This brings us to the research question we set up to answer:

How transferable is knowledge in domains with non-local features?

Chess is an excellent example of such a domain since a piece can traverse the whole board in a single move, meaning that its proximity can not be defined by the adjacent squares alone and its features can therefore be considered non-local. We therefore use chess as a test bench for our study and source our data from the field of chess endgame tablebases. Our measurements show that, in spite of the apparent non- locality of features in our domain of chess, we are still able to quantify the distribution of the learned information within the network, which has similar structure as the in the case of image classification. Furthermore, in order to evaluate the feasibility of expanding learned knowledge to a domain with limited or no label information, we take a special look at the edge case of doing a full transfer and freeze all layers, similar to doing knowledge transfer without retraining on the target dataset. A use case might be where one does not have access to the ground truth of the target domain. Our goal with this method is to use the given information from the source domain, to expand our knowledge to the target domain. A technique we like to label as expansion learning.

The thesis is organized as follows. In Chapter 2 we introduce the terminology used and provide the necessary background knowledge to set the stage for what follows. In Chapter 3 we cover the used methodology, including state representations, the CNN architecture, and a method for quantifying knowledge distribution within a network. Then moving on to Chapter 4, we present and discuss the results of our empirical results based on the methods we set up in the previous chapter, and finally in Chapter 5 we conclude by summarizing our findings and discuss future work. Chapter 2

Background

2.1 Convolutional Neural Networks

Convolutional neural networks (CNNs) were initially designed for analyzing features in images and were inspired by the design of the visual cortex[2]. CNNs are a form of an artificial neural network (ANN) and consist of layers of nodes, where nodes in adjacent layers are connected by neurons which allow information to flow between layers. The input data is passed into the network which flows through the network layer by layer towards the output of the network. Each layer works as a feature extractor abstracting the given input, progressively abstracting higher and higher level features as the data passes through the network. In the case of facial image recognition, an image in matrix form is fed to the input of the network. The first layers extracts localized features like edges while the second layer extracts features in the form of collection of edges, and so forth. Theses are all features that can be considered general to the domain of image recognition and can apply to most image recognition tasks, not only facial recognition. But as we traverse the network, the feature extractors start to react more and more to specialized features like noses and eyes which are specific to the task at hand. It is then logical to assume that the layers, initially extracting general features, progressively move towards the task of extracting more and more specific features with respect to the domain. An example of a CNN is depicted in Figure 2.1. The network consists of an input layer, a sequence of convolutional layers which serve as a feature extractors, followed by a fully connected layer which acts as a classifier, and finally an output layer.

2.2 Transfer Learning

The main drawback of DL with respect to classical ML algorithms is the need for high amounts of data in order to be able to learn effectively. It is then of high impor- tance to be able to reuse the previously learned knowledge, i.e. to transfer the current knowledge. In transfer learning the aim is to extract knowledge from one or more tasks (source tasks) and apply it to a different, albeit related task (target task). In the context of deep learning and neural networks this is typically accomplished by transferring a part of a network —typically a fixed number of early network layers— trained on a source task to a network that is being trained on a related target task. For a successful transfer it is beneficial to transfer only the first n layers which are considered general, in order to disregard specific knowledge which is not applicable 4 CHAPTER 2. BACKGROUND

Input Conv Conv Conv Conv FC Output

Figure 2.1: An example of a Convolutional Neural Network (CNN) with 4 convolutional layers (Conv) and one fully connected layer (FC) (from Medium.com [3])

Figure 2.2: Difference between traditional ML (left) and transfer learning ML (right). (from Pan and Yang, 2010 [4]) in the target domain. There can be different impetus for doing such a transfer, for example, to expedite the learning in the target domain. In particular, transfer learning has been used successfully in modern deep neural networks for supervised image clas- sification and recognition [5], where it exhibits the ability (of the early network layers) to learn general features that resemble Gabor filters or colour blobs. The Gabor filters are an approximation of the functionality of the primary visual cortex in the brain [6], and if an image classification neural network is not learning these filters at the first layers, that is usually taken as an indication that something is wrong with the network. So at least for the task of image classification it seems like a waste of time to relearn the general features every time you train a deep neural network. This effect has been observed not only across different natural image data sets, but also across different learning objectives [7], which is why we are interested in seeing how this methodology transfers to the domain of simple board games. Following is a mathematical definition of transfer learning, its components and an introduction the the terminology used throughout the essay. 2.3. TRANSFERABILITY 5

Domain We define our domain for an m-piece dataset as Dm = {X ,Pm(X)}, where X ∈ X . Here X is the feature space of all possible board states and Pm is the marginal prob- ability distribution, which gives the probability of sampling an m-piece board state from X . An m-piece subset of X is defined as Xm = {x ∈ X |Pm(x) > 0}.

Datasets We define the dataset for an m-piece tablebase as Dm = {d1, ..., d#}, where # is the size of the m-piece tablebase. Here di = (xi, yi) is a single data-point, where xi ∈ Xm and yi ∈ Y , where Y is the label space. In our case then the size of Dm is could be the same as Xm since we have access to all the data in Xm, but we keep this distinction since Dm incorporates both states and labels, while Xm refers only to the state space. Later on we omit some states in Dm due to simplicity in representing the board states, as explained in section 3.1.

Task Our task is defined as Tm = {Y , fm(·)}, where fm(·): X → Y is the objective pre- dictive function, mapping states to labels. Given a data-point (x, y), then fm(x) = y. In our case Y = {−1, 0, 1} for loss, draw and win, respectively. It is our objective to learn fm(·) by training a neural network φ. The goal is:

∀(x, y) ∈ Dm : φ(x) ' fm(x) That is, to have the output of φ approximate the output of f for all datapoints. In our case of working with endgame tablebases the task is to predict the outcome of the game.

Transfer learning In here we are doing knowledge transfer from Dm to Dk where m < k. Here our predictive functions are not the same, i.e. fm(·) 6= fk(·). The transfer consists of using fm(·) as a basis when learning to approximate fk(·). We define φm as the network trained on Dm with no knowledge transfer, and φm→k as a network trained on Dk with knowledge transfer from Dm. We estimate fm(·) by training a network φm with data Dm sampled from Xm. This network is used as a starting point to train a network φm→k on dataset Dk. We are interested in knowing if a transfer from m to k has any benefits in terms of initial training speed and/or final accuracy of φm→k, in contrast to training a network φk on Dk with no knowledge transfer from Dm. We now take a look at our adaption of a known method for evaluating at which n one should cut of the network for a successful transfer, and cover our main agenda, which is to evaluate the effectiveness of this method in a domain with non-local fea- tures, like a squares locality in the game of chess.

2.3 Transferability

When doing transfer learning with neural networks it is important to know which parts of the network to reuse, since reusing the specific parts can hinder performance. In a recent work by Yosinski et al.[1], a method was demonstrated for quantifying the transferability of features from each layer of a neural network, clearly showing how successive layers in the network adapt from detecting general to more specific features. 6 CHAPTER 2. BACKGROUND

It was shown that transferability is negatively affected by, on the one hand, breaking up co-adapted layers in the original network, and, on the other hand, by transferring layers that already have become too specialized towards the original task. These two effects seem to be localized in the network, where co-adaption dominates performance issues around the middle of the network, while specification is the main culprit for performance loss when transferring from the latter part of the network. Furthermore, it was demonstrated that initializing a network with transferred features may improve generalization of the final network, that is, the accuracy of a fully trained network that was initialized with transferred layers is better than of a network that was trained on the task at hand from scratch. Their domain of choice was image classification on the ImageNet dataset [8]. The previously mentioned study was performed in a domain characterized by local features. We are interested in evaluating the effectiveness of this method for quantify- ing transferability of non-local features. In here we perform similar transfer learning experiments, but in radically different problem domain: vs chess endgames, which is an excellent example of a domain characterized by non-local features since a piece’s locality is not just the adjacent squares. Interestingly, the transfer properties of our network are consistent with the ones observed in the previously mentioned image- classification experiment, further validating the effectiveness of the layer transferring approach in neural networks, even for non-image related tasks.

2.4 Endgame Tablebases

There are immense amounts of chess related data available online, but we decided on working with endgame tablebases. The main reason being that this dataset is charac- terized by sparsity of states, since they hold at max 7 pieces per state. Furthermore we use only a limited subset of the tablebases, focusing only on states with two kings and some pawns. Endgame tablebases hold, among other things, the WDL value (win, draw, loss) for a given board state in the game of chess, given perfect play. The WDL value is the game-theoretical value of who will win, lose, or whether we have a draw. It takes into account the 50 move drawing rule, which states that draw can be claimed by either player if a capture has not taken place, or a pawn has not moved, for the past 50 moves. As can be seen in Table 2.1, then the possible WDL values range from -2 to 2. For loss we have -2 and -1 and for win we have 2 and 1. The difference between those is that 2 and -2 refers to that win or loss is guaranteed, while 1 and -1 refers to a win or loss is possible, but a draw can be claimed given the 50 move rule. Then finally a value of 0 refers to a draw. These values have been generated by , which involves working backwards from a or a position, a technique proposed by Richard Bellman in 1965 [9]. This method of working backwards towards a state instead of analyzing forward from a state was a new way of solving games like chess and checkers. Given the exponential complexity of chess, these endgame tablebases have only been solved for up to a certain amount of pieces. By the time of this writing, endgame tablebases for chess have been solved for up to 7 pieces. In here we used data from the Syzygy tablebase which is available online at the Syzygy website [10]. The Syzygy 2.4. ENDGAME TABLEBASES 7

Table 2.1: Definition of WDL values in the Syzygy tablebase, giving the game theo- retical values for a given board state.

WDL value Meaning -2 Loss guaranteed -1 Loss possible, draw can be claimed by the 50 move rule 0 Draw guaranteed 1 Win possible, draw can be claimed by the 50 move rule 2 Win guaranteed tablebase also holds, for a given board state, the depth to mate value (DTM) and the depth to zeroing-move value (DTZ), which were not used in this research. The number of possible board states is exponential with respect to number of pieces. For example the number of states for 7 pieces is 4 orders of magnitude greater than for 5 pieces. So these tablebases grow fast, as can be seen in Table 2.2. In this study we are only using states composed of two kings and some pawns. This gives us two advantages. First of all we are able to shrink our dataset substantially, which makes training feasible given the time restrictions and the hardware at our avail. Secondly, given only kings and pawns, the 50 move counter is bound to reset quite frequently. We can therefore ignore the 50 move rule, which shrinks the possible WDL values from 5 to 3, so our neural network has 3 outputs instead of 5 and fewer weights to train.

Table 2.2: Syzygy endgame tablebase information, showing both All possible states and states with only Pawns and Kings.

Number Number of States Size of Pieces Pawns All Ratio Pawns All 2 462 incl. in 5 3 163 × 103 368 × 103 2 incl. in 5 4 7 × 106 125 × 106 18 244 KB incl. in 5 5 160 × 106 25 × 109 156 7 MB 939 MB 6 13 × 109 3 × 1012 231 591 MB 150 GB 7 423 × 1012 17 TB 8 38 × 1015 1 PB Chapter 3

Methods

Here we describe the architecture used for representing board states and the parameters used for designing our CNN, as well as the methods used for evaluating transferability. The representation of our training data is covered in Section 3.1 and extraction of WDL values from the Syzygy tablebase in Section 3.2. In Section 3.3 we introduce methods to evaluate the general transferability of non-local features by transfer learning, a method we adopt from Yosinski et al. [1], which assesses reasons for performance drop due to co-adaption and specification of layers. Our methods for doing full transfer learning is listed in Section 3.5 and an extension to transfer learning called expansion learning is defined in Section 3.6.

3.1 Board State Representation

All possible board states for King vs Pawn chess were generated up to 7 pieces and two suitable board state representations were designed. In both representations a square is referenced by a number between 0 and 63, where a1 is mapped to 0 and h8 to 63, as can be seen in Figure 3.1 while Figure 3.2 depicts an example state.

Figure 3.1: Chess table with number- ing of squares, where a1 is mapped to Figure 3.2: Example board state with 0 and h8 to 63. 7 pieces 3.1. BOARD STATE REPRESENTATION 9

Figure 3.4: For full state representa- Figure 3.3: Example from Figure 3.2 tion. Each piece type has its own 8 × 8 in vector state representation. bit-array.

a) White pawns b) Black pawns

c) White king d) Black king

Figure 3.5: Example from Figure 3.2 in full state representation. Each piece has its own 8 × 8 bit-array, resulting in a 8 × 8 × 4 tensor. 10 CHAPTER 3. METHODS

Vector state representation is a dense representation of a board state. For a dataset of n pieces, the state is represented as an n-dimensional integer vector. In our case when working only with pawns and kings this representation is unambiguous since we have defined rules about what piece governs what element in the vector. The first n−2 n − 2 elements are the pawns, where the first ceil( 2 ) are white pawns and the rest are black pawns. Then the last two elements are white and black king respectively. As can be seen from above when n − 2 is odd then white has the majority of pawns. This representation of the example from Figure 3.2 is depicted in Figure 3.3.

Full state representation follows the tradition from the Leela [11] and AlphaZero [12] of using a sparse image like representation of chess states, which are well suitable as inputs to convolutional neural networks. The state is defined by a 8 × 8 × mp tensor where mp is the number of piece types. Each piece has its own 8 × 8 bit-plane in this tensor with values 0 or 1 representing an empty square or a square with a piece, respectively. So for a pawn chess we would have four types of pieces (mp = 4), white pawn, black pawn, white king and black king, which gives us an 8 × 8 × 4 tensor. This state representation is especially suitable for convolutional neural networks due to the image like nature of the chess table. Example of this rep- resentation can be seen in Figure 3.5.

It is worth to note that by defining the number of white pawns in an n-piece state to be n−2 equal to ceil( 2 ), we are leaving out a subset of the corresponding Dn domain. In our research we worked with maximum n = 5, so the outcome is trivial for states where one player has a king and three pawns, while the opponent has a single king, so leaving out those states should not change much. But with higher n the outcome is not as trivial, and therefore not recommended to limit the dataset as much in this regard. This rule also disregards states where black has more pawns than white, but due to symmetry reasons we are not loosing any information. One can always swap white and black in the board state and multiply the WDL value by −1 to get an estimation of a board state from blacks perspective. While this is not true in the general case, but since we are only working with pawns and kings then there is complete symmetry between black and white. This also justifies the omission of the side-to-move information from the board state, so we always assume that white has the move. This comes in handy since feeding this information into a CNN network is not as straight forward as feeding in a raw board state.

3.2 WDL Values

The WDL values needed to be extracted form the Syzygy endgame tablebase which can be downloaded from the Syzygy website [10]. In order to extract the WDL values we used the python-chess package which has built-in methods to probe the Syzygy database. Before extracting the WDL values one needs to create a board state class with the python-chess package. Loading a state from disk took only 1% of the pro- cessing time, generating the class took 18% but probing the Syzygy tablebase took up 81% of the time. In total, probing for one state took 50µs. In here we focused on tablebases with m ∈ {3, 4, 5}, since working with larger m is quite time consuming, 3.3. TRANSFERABILITY 11 and this range for m should be considered as a good proof of concept for the general ideas. In Table 3.1 we have a breakdown by the frequency of each WDL value for states including pawns and kings. From the table we see that apart from the Loss WDL value for m = 3, no WDL value can be considered sparse. This is of high importance since if one label is dominating the label space, one can get relatively decent accuracy from always guessing that label.

Table 3.1: Breakdown by WDL values showing the number of states for each WDL value and the corresponding ratio of the whole dataset.

Number LOSS DRAW WIN of Pieces states ratio states ratio states ratio 3 0 0.000 38,368 0.235 124,960 0.765 4 1,737,970 0.234 2,485,090 0.334 3,213,028 0.432

3.3 Transferability

It has been shown that leaving out some of the more specific parts of the network when doing transfer learning can result in better final accuracy, as well as avoiding overfitting when working with small target datasets. This is especially important when the tasks in the source and target domain are dissimilar [13]. It is therefore important to develop methods to maximize the transferability of knowledge between domains by evaluating which parts of the network to reuse.

Figure 3.6: Image classification performance after knowledge transfer between domain A to B (AnB) and B to B (BnB) as a function of n, the number of transferred layers. Transferred weights of AnB and BnB are kept frozen while transferred weights of AnB+ and BnB+ are allowed to fine-tune (from Yosinski et al. [1]). 12 CHAPTER 3. METHODS

For a classifier network with a softmax output, it is natural to assume that the latter layers are related to the task at hand since they are in close connection to the classifier, and therefore considered specific, while the early layers in a network have been shown to detect simple features, at least in the case of image classification [13], and are therefore considered general. Given this different classification of initial and final layers, it is logical to assume a transition from general to specific as we traverse through the network. Here we adopt methods developed by Yosinski et al. [1] to measure the transfer- ability of layers in a neural network and evaluate the applicability of this method in other domains than image recognition. This is done by quantifying to what degree a layer can be considered general or specific and to evaluate whether this transition is sudden or if it evolves gradually over many layers. Yosinski et al. demonstrate their method by doing transfer learning in the domain of image classification, where they split the ImageNet[8] dataset into two approximately equal parts A and B, where the splitting is done by dividing classes randomly between A and B. The results of Yosinski’s et al. measurements can be seen in Figure 3.6 and should be used as a reference regarding how to evaluate transferability of layers as is covered below. For a detailed explanation of their method please refer to Yosinski et al. [1]. Although we have similarities between chess states and image data, where both are made of layered 2D arrays with certain local proximity surrounding each square/pixel, the locality of chess is more complex. Yosinski et al. did their research in the domain of image recognition, in which the general feature extractors activate on features which can be considered local. Take for example the task of deciding whether a single pixel is a part of a line or not, one needs to look no further than the adjacent pixels alone. But in the game of chess, a squares value can not only be determined by looking at the adjacent squares, since for example the can traverse the whole board in a single move and capture a piece 7 squares away. And a pawn with a high rank is much more likely to be promoted to a by reaching rank 8, than a pawn with a low rank, so it is relative location on the board with respect to rank is important. These are all features that can be considered non-local. Our goal is to evaluate whether the previously mentioned method by Yosinski et al. is applicable in other domains than image recognition. We wanted to select a domain that retains some local features but is still indifferent in some regard. The existence of non-local features in chess is an interesting addition. We however did not want to deviate too far from the original domain by introducing features that were too dissimilar in structure from image recognition. Therefore we chose to work only with kings vs pawns chess endgames. By limiting ourselves to this sub-domain of the whole tablebase, we are lessening the amount of non-locality, but still retaining some. In our adoption of this method, we are not twofold splitting the dataset, but rather comparing networks trained on subsets of the Syzygy tablebase split with respect to number of pieces. As in Yosinski’s et al. method, two networks are trained as a base for the transfer. Network φA is trained on DA while network φB is trained on DB where in our case B = A+1. Transfer learning is then applied by training a source network on the source dataset and reusing the n first layers of the source network, randomly initializing the other layers and retraining the network on the target dataset. A transfer is done both from A to B and B to B by creating two networks: 3.3. TRANSFERABILITY 13

• A transfer network φAnB (AnB for short) created by copying the n first layers from φB and retraining on DA.

• A selffer network φBnB (BnB for short) resulting from copying n layers from φB and retraining on DB.

The transfer network is used to evaluate the benefits of the transfer as a function of n, while the selffer network is used as a control case. The n reused layers of AnB and BnB are then either frozen, i.e. not updated by the back propagation algorithm, or fine-tuned by allowing them to retrain. The fine-tuned networks are differentiated from the frozen ones by a plus sign. So we end with four networks: AnB, BnB, AnB+ and BnB+. By studying the evolution of the accuracy and the difference in accuracy between each network as a function of n we are able estimate to what degree the layers are general or specific. A few general assumptions taken from Yosinski et al. [1] can be made given the accuracy of the resulting networks, where they evaluate reasons behind performance loss based on which part of the network we are transferring. The following assumptions are best understood by referring to Figure 3.6.

• First part of the network: If the accuracy of network AnB in predicting labels in DB is no different than the accuracy on BnB at the same task, we can assume that layers 1 to n are general with respect to DB. On the other hand if the performance of AnB is worse than BnB, then we can assume that at least layer n is specific to DA. This effect is observed in Figure 3.6 at layers 1-3, which seem to be general.

• Middle part of the network: If we see a drop in accuracy of AnB and BnB in the middle of the network, while AnB+ and BnB+ seem to recover with retraining, we can assume that the performance drop is due to co-adaption of neurons in layer n with neighbouring layers. In the case of the frozen networks the backpropagation algorithm is not able to compensate and relearn the lost knowledge embedded in the co-adaption, while fine-tuning is able to adapt and get accuracy back on track. At layers 4 and 5 in Figure 3.6 we see clear evidence of performance loss dominated by the co-adaption effect.

• Latter part of the network: If the accuracy of network AnB is lower than network BnB for high n, then we can assume that the drop of AnB’s accuracy is due to specification of layers in φA with respect to DA. If BnB is converging to the accuracy of BnB+, we can assume that performance loss due to co-adaption has diminished. The continued drop of AnB and the rise in performance of BnB in Figure 3.6 at layers 6 and 7 is an example of this effect where specification is the dominant performance retardant. 14 CHAPTER 3. METHODS

3.4 Network Design

We used a Convolutional Neural Network (CNN) as our network architecture. An example can be seen in Figure 2.1. The network consists of an input layer, a sequence of convolutional layers which serve as a feature extractors, followed by a fully connected layer which acts as a classifier, and finally an output layer. The shape of the input layer is 8 × 8 × 4 defined by the full state representation, but we had to decide on the shape of the convolutional filters and the number of convolutional layers. Note that in these comparisons we aimed at having the same number of weights for each network, in order to give a fair comparison. In all cases unless stated, experiments were carried out using the D4 dataset, the subset of 4-piece states from the whole tablebase.

Figure 3.7: Comparing accuracy of 2x2 and 3x3 filters

We compared the accuracy of 2 × 2 filters and 3 × 3 filters with results seen in Figure 3.7. After deciding on using a 2×2 filter we tested the optimal number of convolutional layers, reported in Figure 3.8 and Table 3.2, where 7 layers gave us the best result. This comes to no surprise since we are using 2 × 2 filters, so the dimensions of the "board" shrink by one square for every convolutional layer. Therefore we should need 7 filters in order for the last convolutional layer to output a tensor with shape 1 × 1 × i, where i is an integer defined by the number of filters in each layer. Remember that in the general case one piece can traverse the whole board in one move, so for classification it makes sense that the fully connected layer is working with bottleneck feature extractions from the whole board. The shape of the output layer is 3, defined by the number of WDL values we need to predict. 3.4. NETWORK DESIGN 15

Table 3.2: Evaluating optimal number of convolutional layers

Layers Weights Accuracy 3 191,000 0.922 4 176,000 0.951 6 187,000 0.975 7 182,000 0.977

Figure 3.8: Evaluating optimal number of convolutional layers

To decide on the size of the network (number of weights) we measured the accuracy with respect to training time for a few network sizes, as seen in Table 3.3. The network size was tuned by adjusting the number of filters per convolutional layer, gradually increasing the number of filters per layer. For 512,000 weights convergence took much longer and did not reach as good final accuracy as the other candidates (results not reported). We see that the mapping from network size to training time is non-linear,

Table 3.3: Evaluating optimal network size (250 epochs)

Training Weights Accuracy Time [min] 35,000 0.956 167 70,000 0.975 192 182,000 0.977 282 256,000 0.983 306 16 CHAPTER 3. METHODS and given the amount of decrease in final accuracy, it is not beneficial to go for a smaller network. We decided to go for 182,000 weights rather than 256,000 weights, even though by choosing the latter network we receive a substantial increase in accuracy for the small cost of approximately 8.5% increase in training time. The reason is that when a networks capacity is much higher than the information within the domain, we have a higher risk of overfitting. Since we will be learning the D3 dataset with this network, we opted for 182,000 weights. However, it turned out later on that we were probably far from overfitting so we could in retrospect have chosen the larger network without risk.

3.5 Transfer Learning

In our transferability study we did not have the means to converge the training due to lack of time and computational resources. We therefore want to evaluate the final accuracy of doing a full transfer and allowing the network to converge, to get an estimation of the final accuracy. We are not expecting big domain difference between various chess variants, based on the fact that they are mostly played with similar table structure and related rules. It is therefore likely that transferring knowledge between chess variants will prove successful. To evaluate the efficiency of transfer learning in this domain we do a knowledge transfer from D3 with the goal to enhance accuracy on predicting labels in D4. This is done by fully training a network φ3 on data D3 by allowing the test accuracy to converge. We then initialize a network φ4 with the weights from φ3, which is then trained on D4. We are interested in knowing two things, whether knowledge transfer seeds better initial accuracy of φ4 and if the transfer will affect final accuracy compared to no knowledge transfer. Better final accuracy would be optimal, as has often been the case in other studies, e.g. Caruna, 1995 [13]. But no benefits or even a negative transfer could also be observed, where the transfer is actually reducing the final performance.

3.6 Expansion Learning

Usually the main application of transfer learning has been to transfer from a larger domain to a smaller domain, in order to boost final accuracy, as well as to avoid overfitting due to small sample size when retraining on the target domain. But we turn this around and evaluate the feasibility of transferring knowledge from a smaller domain to a larger domain without retraining, essentially expanding the knowledge in the source domain to the target domain. In order to evaluate the effectiveness of our method we take a special look at the edge case of AnB with n = 8. Meaning that we do a full transfer and no retraining, a method we label as expansion learning. The biggest advantage of vanilla transfer learning is that given a small sample from the target domain, one is able to avoid overfitting and get better accuracy in comparison to no knowledge transfer. In the case of expansion learning we can go even further and get estimations without any label information. The method consists of training a network φm on a source dataset Dm and use it without retraining to evaluate states from the target dataset Dk where m < k. 3.6. EXPANSION LEARNING 17

This has possible applications to expanding the current tablebases beyond the m = 7 piece limit, which can be achieved by training a network on the current tablebase, and use it to evaluate states in Xk with k > 7. Since we do not have any label information, we would have no way to evaluate whether our expansion is any good. But in here we do a feasibility study on that concept by selecting k ≤ 7 and m = k − 1, for example at m = 3 and k = 4, since we have the labels for both of those datasets. If our method is successful in evaluating labels in D4 given information in D3, by comparing real and estimated labels, we might expect similar results when expanding beyond 7 pieces. As a base case for comparison we chose an untrained network φrnd initialized with random weights and measured its accuracy on the same task. The theoretical expected 1 accuracy of a random network is P (φrnd(X) = Y ) = 3 , which might seem obvious but it is interesting to note that this fact is indeed independent of fm(·), the objective pre- dictive function for state labels in Xm. This can be seen from the following deduction:

We define PXm (l) as the probability of sampling a state with label l from Xm. Given an untrained network φrnd with randomized initial weights, and datapoints dm = (xi, yi) ∈ Dm, where xi ∈ Xm and yi ∈ Y = {−1, 0, 1}. And given a random state, assuming that the networks output is completely random, then the probability of the network 1 outputting a label l is Prnd(l) = 3 for all l ∈ {−1, 0, 1}. Then the probability of our network guessing the right label is given by

1   X P φrnd(xi) = yi = Prnd(l)PXm (l) l=−1 1 1 X = P (l) 3 Xm l=−1 1 = 3

1 As can be seen then the probability of guessing the right answer is 3 and independent of PXm (l) and therefore fm(·). Chapter 4

Results and Discussions

In here we present and discuss the empirical results from our experiments. We begin by diving into the reasons for selecting our final hyperparameters and the methods we used to tune our model for best performance. The model was optimized with respect to our experimental setup which is covered in the section thereafter. Then moving on to our study of transferability to the methods applicability in the domain of endgame tablebases. Since full convergence was not feasible for our transferability study, we then take a closer look at the two edge cases where n = 8; full transfer learning by converging 3n4+, and expansion learning by evaluating the predictability of the 3n4 and 4n5 networks on their target datasets.

4.1 Tuning Hyperparameters

To tune our network for best performance we had to search for the best hyperparam- eters. All tests were done on the D4 dataset. Note that the results are from single runs, since averaging over many different runs was time consuming. Also note that the observed accuracies in these tests are lower than the ones we present in our final scores later in this chapter. This is due to the fact that here we are only optimizing one parameter at a time, so we have not reached our networks full potential, while in our final runs we are using a fully optimized version of our network. First of all we need to choose an optimizer. Two popular choices are Adam and Adadelta. A comparison between these two can be seen in Figure 4.1, where Adadelta is clearly the better option. The effect of introducing batch normalization can be seen in Figure 4.2. Since batch normalization did not give an obvious performance boost we opted for skipping it. According to a recent study [14], then for most neural networks, using 16 bit pre- cision should not affect final accuracy much with respect to 32 bit precision, resulting in much faster training times since we are essentially reducing the size of our network by a factor of 2. As can be seen in Figure 4.3 we are getting a slight decrease in final accuracy. But as a down side we did not notice a decrease in training time; on the contrary we noticed that training on 16 bit precision took 40% more time than for 32 bit. This is surprising and possibly a limiting factor of Keras, the library we used for training. 4.1. TUNING HYPERPARAMETERS 19

Figure 4.1: Comparing performance of Adam vs Adadelta

Figure 4.2: Evaluating effect of batch normalization on accuracy 20 CHAPTER 4. RESULTS AND DISCUSSIONS

Figure 4.3: Comparing accuracy of 16 and 32 bit floating point precision

Finally we needed to decide on the number of epochs and evaluate whether our net- works were overfitting. As can be seen in Figure 4.4, which shows the training and validation accuracy when training on D4, then full convergence of the validation accu- racy is reached between 100 and 150 epochs; so for those cases when full convergence is needed we decided on 150 epochs. One way to evaluate whether our network is over- fitting is to monitor the validation loss, which should gradually be going down with each epoch. If one observes that the validation error starts (on average) to rise again after reaching a minimum, this is an indication of overfitting. Also, if the training loss continues to drop while validation loss has converged then that is an indication that we are limited by the capacity of our model [15]. As can be seen from Figure 4.5 we are clearly not overfitting and could thus benefit from using a bigger model. 4.1. TUNING HYPERPARAMETERS 21

Figure 4.4: Evaluating the number of epochs needed for convergence

Figure 4.5: Evaluating the possibility of overfitting 22 CHAPTER 4. RESULTS AND DISCUSSIONS

Figure 4.6: Evaluating effect of batch size on accuracy and training speed

In Figure 4.6 we see how changing the batch size effects accuracy. We checked the final accuracy after 100 epochs, which can be seen in Table 4.1, and the accuracy after 10 epochs, which can be seen in Table 4.2. We clearly see that if the batch size is too big then that can hinder final performance. There doesnt seem to be big difference in final accuracy for batch size of 256, 512 or 1,024, but one could assume from Figure 4.6, that 256 is initially faster at learning. But from Table 4.2 where we are only training for 10 epochs, we see that there is a dramatic difference in training times per epoch. From Table 4.3 we can see that if we limit ourselves not to 10 epochs but 30 minutes, then batch size of 1,024 reaches the highest accuracy given this time period. We however opted for using a batch size of 256 based on Figure 4.6, since the information in Tables 4.2 and 4.3 were not available until we had already begun our measurements. 4.1. TUNING HYPERPARAMETERS 23

Table 4.1: Evaluating effect of batch size on accuracy after 100 epochs

Batch size Accuracy 256 0.975 512 0.973 1,024 0.977 2,048 0.969 4,096 0.974 8,194 0.961

Table 4.2: Evaluating effect of batch size on accuracy and training time for 10 epochs

Training time Batch size Accuracy [mm:ss] 256 0.963 28:34 512 0.954 15:24 1,024 0.955 09:13 2,048 0.941 07:53 4,096 0.936 06:42 8,194 0.914 06:02

Table 4.3: Evaluating effect of batch size on accuracy after 30 minute training time

Batch size Accuracy 256 0.963 512 0.962 1,024 0.970 2,048 0.961 4,096 0.966 8,194 0.952 24 CHAPTER 4. RESULTS AND DISCUSSIONS

4.1.1 Final Hyperparamters In Table 4.4 we see the relevant chosen hyperparameters as well as important network information. A breakdown of the network structure can be seen in Table 4.5. Due to time restrictions and limited computational resources, we limited training time to 10 epochs for all transferability measurements in Section 4.3.

Table 4.4: Chosen hyperparameters and network information

Hyperparamter Value Filter size 2 × 2 Conv layers 7 Input Size 8 × 8 × 4 Output Size 3 Number of weights 181, 651 Optimizer Adadelta Batch not used Precision 32 bit Batch Size 256 Network size 757 KB

Table 4.5: Network model showing layer type, filter size and number of parameters

Layer Type Filters Parameters Input Conv 1 16 272 Conv 2 32 928 Conv 3 32 4,128 Conv 4 64 8,256 Conv 5 128 32,896 Conv 6 128 65,664 Conv 7 128 65,664 Flatten Fully Connected 3,843 Output Total parameters 181, 651

4.2 Experimental Setup

We used two Nvidia GeForce GTX 1080 Ti GPUs with 11GB of RAM each. All neural networks were programmed in Python 3.6 using Jupyter Notebook. Our neural network framework was Keras 2.2.4 on top of TensorFlow 1.11.0. For creating board states we used the python-chess package which was used to probe the Syzygy tablebase for WDL values. 4.3. TRANSFERABILITY 25

4.3 Transferability

To evaluate the transferability of knowledge in domains with non-linear features and evaluate what parts of the network are feasible for transfer, we first compare 4n4 with 4n4+ to check for performance loss due to co-adaption between neighbouring layers. We then compare accuracy of 3n4 with respect to 3n4+ to quantify to what degree a layer can be considered specific by evaluating performance loss due to specification.

4.3.1 Performance Loss Due to Co-Adaption

We train networks φ4 initialized with random weights on D4 for 150 epochs and average over 5 runs, which reach mean accuracy of (98.8 ± 0.1)%. Then networks 4n4 for all n ∈ {1, ..., 8} were initialized by transferring the n first layers of network φ4. The rest of the layers are randomly initialized and the network retrained on D4 for 10 epochs. We see the results in Figure 4.7 where for 4n4 we freeze the transferred layers, while in 4n4+ we allow them to fine-tune by retraining. For reference we see a red dotted line at 0.988 and the corresponding 95% confidence interval, which represents the final accuracy of a fully trained source network φ4. In the Figure we see clear signs of co-adaption in certain parts of the network, where two or more neurons in neighbouring layers have co-evolved and collectively serve as a feature extractor. The signs come in the form of performance loss of frozen networks (4n4) with respect to networks allowed to fine-tune (4n4+). A statistically significant split between 4n4 and 4n4+ at layers 3 to 5 is an indication of this, while for other n there is not much difference in the networks accuracy.

Figure 4.7: Co-adaption splitting at layers 3, 4 and 5 26 CHAPTER 4. RESULTS AND DISCUSSIONS

This performance decrease around the middle layers for 4n4 with respect to 4n4+ is a similar trend as was shown by Yosinski et al. which they attributed to co-adaption. While the fine-tuning of 4n4+ seems to be able to compensate for this co-adaption, the backpropagation algorithm in the case of 4n4, does not seem to be able to relearn the lost knowledge. Due to training time and computing power restrictions we were only able to train each data-point for 10 epochs on the target datasets, while around 150 epochs would be optimal to fully converge the accuracy to its final value. From Figure 4.7 we see that performance generally increases with n. This trend is most likely due to the lack of convergence, explained by the fact that for low n we are throwing away more knowledge from the source network than in the case of high n. This trend was not observed in the case of Yosinski et al.[1], where the data-points were allowed to fully converge, consequently, there is not a reliable basis for determining the optimal n for our domain, but the observed co-adaption split should still be valid and no reason to attribute it to lack of convergence. One thing worth mentioning is that the same fully trained φ4 network was used as a seed for all runs, due to training time restrictions. It would of course have been optimal to retrain a new source network for every run. From Figure 4.7 we see that there is a small but statistically significant performance increase at 4n4+ for n = 7 compared to converged φ4, which would possibly have increased given more training on the target network. From Table 4.6 we see that the final accuracy of converged φ4 is (98.8 ± 0.1)% as is depicted by the red dotted line on Figure 4.7, while for 4n4 with n = 7, the final accuracy is (99.005 ± 0.007)%, depicted by the solid blue line. So for the case of 4n4, then by retraining only the final layer for 10 epochs seems to work better than continuing to train all the layers of φ4. In contrast we do not seem to observe the same trend for 4n4+, which could possibly be explained by the fact that we are retraining the whole network. This trend could possibly be observed given enough training. The measured performance increase was unexpected, but might just be an artifact of always using the same source network φ4 as a seed for 4n4, instead of retraining a new φ4 network for every run. The small uncertainty of 4n4+ at n = 7 is an indication of the fact that the same seed was used over and over. It is highly likely that a different value and uncertainty would be measured given fresh seeds, therefore this difference should be taken with a grain of salt.

4.3.2 Performance Loss Due to Specification To evaluate the gradient from general to specific parts of the network, we train a network φ3 on D3 for 150 epochs. Then networks 3n4 are initialized by transferring the n first layers from φ3, and trained on D4 for 10 epochs. We see the results in Figure 4.8 averaged over 5 runs, where for 3n4 we freeze the first n layers, while for 3n4+ we allow them to retrain and fine-tune. The red dotted line at (0.963 ± 0.001) shows the accuracy of φ4 networks and the corresponding 95% confidence interval. The networks were trained for 10 epochs and averaged over 10 runs, the same number of epochs as in the training of 3n4+ on the target dataset in order to give a fair comparison for evaluating the possibility of an initial accuracy boost when doing knowledge transfer. By looking at the accuracy of 3n4 in Figure 4.8, a clear and gradual drop can be seen from the very start. This is an indication of two things; performance drop due to co-adaption of neurons between layers, and performance loss due to specification of latter layers. With reference to Figure 4.7, we can safely say that this performance 4.3. TRANSFERABILITY 27 drop can partially be attributed to co-adaption at layers 3, 4 and 5. But since co- adaption was not detected at above n = 5, then at layers 6 to 8, the performance drop should solely be due to specification of the transferred layers. We have to take into account that if the weights are frozen then one might expect faster learning than in the case of fine-tuning, since we are updating more weights in the latter case, but given the big accuracy difference when n ∈ {3, 4, 5}, it is safe to assume that co-adaption is the cause. This is in tune with Yosinski’s et al.[1] results in the field of image classification, with one big difference. In their case the first 2 layers do not show sings of performance drop while in our case, we start to see a performance drop at the first copied layer. Since its unlikely that specification is attributing to performance drop at the first layers we might assume that this drop is due to co-adaption early on. But given our previous evaluation of Figure 4.7, where the verdict was that co-adaption was only observed at layers 3, 4 and 5, then co-adaption is an unlikely cause. A plausible candidate is the fact that we are not allowing the test accuracy to converge but are capping the training at 10 epochs due to training time restrictions. Given convergence we might possibly see a similar trend as in the case of image classi- fication study on the ImageNet dataset. We might also be observing an artifact which could be attributed to the non-local features of this domain. Performance loss due to specification might be stretching much further into the general part than in the case of image classification, but full convergence is necessary for any concrete speculations. To evaluate the possible effect a knowledge transfer has on boost of initial accuracy during training, then in Figure 4.9 we take a closer look at 3n4+ and φ4. It seems that 3n4+ achieves better accuracy than the φ4 baseline, for all n except n = 8. As

Figure 4.8: Performance drop due to co-adaption and specification 28 CHAPTER 4. RESULTS AND DISCUSSIONS both these networks have been trained for 10 epochs, with the only difference that 3n4+ received a transfer from φ3, we are safe to say that we get an initial boost in training accuracy when doing transfer from D3 to D4 for all n except n = 8, which is an indication that leaving out at least the final layer is feasible for a successful transfer.

Although we were not able to fully converge our measurements, we see a similar trend in our results as Yosinski et al.[1] did in the domain of image classification, with re- spect to performance loss both due to co-adaption and specification. Transferability in domains with non-local features, at least in the case of chess, seems to be quantifiable with similar methods as in the case of image classification.

Figure 4.9: Knowledge transfer seeds better initial accuracy 4.4. TRANSFER LEARNING 29

4.4 Transfer Learning

The results from our transferability study were, as previously mentioned, not allowed to converge. We therefore evaluate the effect of doing a fully converged transfer by training for 150 epochs, at which the final accuracy had converged. We do this in order to give a fair comparison betwen transfer and non-transfer, to evaluate whether there are any benefits from doing a transfer. The final results for each network were averaged over 5 runs, since the training result from a single run can vary slightly due to the non-deterministic nature of the backpropagation algorithm in conjunction with the Adadelta optimizer. Confidence intervals with 95% confidence level were used for error estimation.

We fully trained a source network φ3 on D3. This network was used as a seed to fully train a target network φ3→4 on D4. For comparison we fully trained a network φ4 on D4 starting from random weights. The results can be seen in Table 4.6. In Tables 4.7 and 4.8 we can see the accuracy of the networks when predicting labels for D3 and D4 respectively, split by WDL values. Note that when calculating the accuracy for individual classes, we checked only for true positive and false negatives within each WDL value, while false positives were not taken into account, also named recall. This was not intentional and taking those into account would of course have given us better insight into the datasets.

In Table 4.6 we see the accuracy of a network φ3→4, the results of doing a fully converged transfer from D3 to D4. This is compared to a network φ4, a network initialized with random weights and trained on data from X4, which should give us a good comparison when evaluating the effect of doing a transfer. As mentioned before then each measurement was averaged over 5 runs, and for each training of a new φ3→4 network, a new φ3 seed was used for every run, eliminating the possibility that the same bad initial φ3 seed was hindering performance. As can be seen from the results in Table 4.6 then we are experiencing negative transfer. The performance of φ3→4 is lower than of φ4, meaning that the transfer is actually hindering performance. This behaviour is expected when transferring between dissimilar domains [16]. But in our case this comes as a surprise since one might expect that the task of estimating labels in D3 in quite similar to estimating labels in D4. But given our results from our study on transferability, we might be transferring to much specific information from the source domain. This negative transfer might have been avoided by at least dropping the final layer, and with the correct choice of m, we might even have observed a positive transfer. The task of evaluating labels in a dataset with odd number of pieces is to some degree dissimilar to evaluating labels in a dataset with even number of pieces. So transferring between even to even or odd to odd might also give better transfer results. It could also be speculated that given the high accuracy of (99.05 ± 0.09)% of our source network φ3, that the culprit is overfitting of φ3 on D3. That the network is not learning to approximate f3(·) but rather memorizing the data. This is of high importance since there is no basis for doing transfer from a network that suffers from overfitting. And in the case of φ3 then overfitting would not come as a surprise since the capacity of our network is optimized for learning X4. From Table 2.2 we see that the D4 dataset is approximately 42 times bigger than the D3 dataset, so the networks capacity has the risk of being to high for this task. It is then worth to mention that none of the networks showed signs of overfitting. This was evaluated by monitoring 30 CHAPTER 4. RESULTS AND DISCUSSIONS

the validation error, which for φ4 had converged at around 150 epochs. So overfitting of φ3 is an unlikely cause for this observed negative transfer. It is worthwhile to note that for X4, as can be seen in Table 4.8, then no WDL value is dominant and the dataset is relatively evenly distributed. But in the case of X3, then Loss is not an option, and Win dominates 3/4 of the dataset, which might contribute to the high accuracy of φ3, since this sparsity could benefit even a biased random predictor.

Table 4.6: Final label prediction accuracy.

Task Accuracy (%)

φ3 on D3 99.50 ± 0.09 φ4 on D4 98.8 ± 0.1 φ3→4 on D4 97.2 ± 0.5

Table 4.7: Final accuracy of φ3 on D3 by WDL values.

Proportion WDL value Accuracy (%) of dataset (%) Loss - 0.00 Draw 98.99 23.50 Win 99.71 76.50 Total 99.50 ± 0.09 100.0

Table 4.8: Final accuracy of φ3→4 on D4 by WDL values.

Proportion WDL value Accuracy (%) of dataset (%) Loss 97.40 23.4 Draw 95.97 33.4 Win 98.06 43.2 Total 97.2 ± 0.5 100.0 4.5. EXPANSION LEARNING 31

4.5 Expansion Learning

To evaluate the feasibility of expanding knowledge from one domain onto a related domain with limited to no label information, we measure the predictability of a network φm trained on data from Xm, on labels in Xm+1. We do expansion from D3 to D4, and from D4 to D5. As a control we measure the effectiveness of a random network at estimating labels in the target domain. Like previously mentioned, this could prove useful in extending the current tablebases.

Accuracy of expanding from D3 to D4

Given an untrained network φrnd and a state-label pair (xi, yi) ∈ D4, where xi ∈ X4 and yi ∈ Y , the empirical probability of guessing the right label was measured to be   P φrnd(xi) = yi = 0.318 but for a network φ3 pretrained on data from X3 and a state-label pair (xi, yi) ∈ D4, the empirical probability of guessing the right label was measured as   P φ3(xi) = yi = 0.570

Accuracy of expanding from D4 to D5

Similarly for (xi, yi) where xi ∈ X5 and xi ∈ Y the empirical probability of guessing the right label was measured as:   P φrnd(xi) = yi = 0.301

But given a network φ4 pretrained on D3 and a state-label pair (xi, yi) ∈ D5, the empirical probability of guessing the right label was measured as:   P φ4(xi) = yi = 0.500

In both cases, the predictability of our expansion method outperformed a random predictor network at evaluating labels in the target domain, although not by a large margin. From the results we see that by doing expansion from D3 to D4 then the accuracy in label prediction of our expansion network φ3 is 0.570, which when compared to a random estimator network φrnd at 0.318, is an increase of 79%. And with expansion from D4 to D5 then the prediction accuracy of our expansion network φ4 is 0.570, and compared to the random estimator network φrnd with accuracy of 0.318, is an increase of 66%. This is a substantial increase, but still not high enough to be feasible for tablebase expansion beyond the m = 7 piece limit. We have to take into account though that we are in both cases extending between even and odd number of pieces. This might be a hindrance since we are introducing a relatively high unbalance between the players, especially given the low piece count. It might prove more useful and serve as a better proof of concept, to do expansion between even and even, or odd to odd, with higher piece count. 32 CHAPTER 4. RESULTS AND DISCUSSIONS

Our results show that there is indeed some information transfer with this method in spite of no retraining on the target network, but our verdict is that more research is needed in order to give a reliable answer whether it is feasible to give predictions about domains with no label information by extending current knowledge gained from related domains. Chapter 5

Conclusion

Here we give our conclusions based on the research question we set up to answer: How transferable is knowledge in domains with non-local features, by summarizing the results and giving our thoughts on possible future work

5.1 Summary

The layers of CNNs can be thought of as feature extractors, where early layers react to general features, while latter layers react to specific features which are in more relation to the learning task. In order to speed up the learning process as well as gaining better final accuracy, one typically does transfer learning from a related task by reusing parts of a source network. To maximize the transfer efficiency, one generally aims to transfer only the first n layers which are attributed to the general part of the network while leaving out the specific part. Yosinski et al.[1] devised a method to evaluate transferability and to quantify the distribution of the learned knowledge within the neural network with respect to co- adaption and specification of layers, as well as determining at which n to cut. Their research was done in the domain of image classification, a domain which is characterized by local features. We wanted to study the effectiveness of this method in a domain characterized by non-local features, and chose chess endgame tablebases as our test case. We observe similar distribution of the learned knowledge within our networks as in the case of image classification, with co-adaption and specification dominating perfor- mance loss at approximately same parts of the network. In order to get precise results with low uncertainty, every measurement had to be averaged over many runs. This was time consuming, and as a compromise, we did not allow our results to converge, but rather capped them at a certain number of epochs. As a consequence we did not get final converged transfer accuracy results for our transferability measurements. Interestingly we seemed to be able to boost the final accuracy of a fully pretrained network by throwing out the last layer and retraining the network for 10 epochs, but we would like to do a fully converged study in order to check whether this is a true phenomena or just an artifact of our experimental setup. To evaluate the full effect of a transfer with respect to no transfer, we did a con- verged transfer of all layers between the m = 3 piece tablebase to the k = 4 tablebase. This resulted in a negative transfer; performance was hindered by doing the transfer, with respect to no transfer at all. This might be an indication that we are transferring 34 CHAPTER 5. CONCLUSION information that is to specific. So this negative transfer might have been avoided by not including the final layer in the transfer. We measured the effectiveness of expanding knowledge from a source domain onto a target domain with no label information. By keeping in the mind the possible application of expanding current tablebases, we measured the expandability within the current tablebase. Our results indicate that while knowledge is transferred, the transfer rate is not high enough to expand the current tablebases. As an end result and an answer to our research question, we conclude that the methods evaluation of transferability in other domains than image classification is promising, and information distribution within the CNN is not much affected by the introduction a different domain with moderate amounts of non-local features.

5.2 Future Work

The main limitation of our measurements were the lack of convergence in our study on transferability. There is a clear need for fully running these measurements to get more meaningful results in order to give a precise answer to our research question. This would give us answer to whether specification reaches further into the network when working with non-local features than in the case of local ones. As well as allowing us to dive further into the observed effect of boosting a network by throwing the last layer and retraining. One potential drawback in our transfer learning study is the fact that we were in all cases doing knowledge transfer from a dataset with odd number of pieces to one with even number of pieces. This might introduce an imbalance between those datasets. Comparing datasets which both have even or both have odd number of pieces would be optimal in the future, to check whether this has any effect on our measurements. This could help us evaluating whether the negative transfer is an actual effect, and give us better evaluation of the feasibility of expanding the current tablebases. As an interesting addition a state representation could be designed that is more suitable for capturing the non-locality of squares in chess, which could prove highly beneficial and would be an valid direction to take in the future. Last we would like to mention a promising addition we would like to implement in future studies which is the addition of dropout layers. This could lessen the amount of co-adaption between layers by forcing them to work more independently, something we should be able to verify by utilizing the methods used in this research. Bibliography

[1] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?”, in Advances in neural information processing systems, 2014, pp. 3320–3328. [2] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”, Biological cy- bernetics, vol. 36, no. 4, pp. 193–202, 1980. [3] ArdenDertat. (Nov. 2017). Applied deep learning - part 4: Convolutional neural networks. Accessed: May 5, 2019, [Online]. Available: https://towardsdata science.com/applied- deep- learning- part- 4- convolutional- neural- networks-584bc134c1e2. [4] S. J. Pan and Q. Yang, “A survey on transfer learning”, IEEE Trans. on Knowl. and Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010, issn: 1041-4347. doi: 10.1109/TKDE.2009.191. [Online]. Available: http://dx.doi.org/10.1109/ TKDE.2009.191. [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks”, Commun. ACM, vol. 60, no. 6, pp. 84–90, May 2017, issn: 0001-0782. doi: 10.1145/3065386. [Online]. Available: http: //doi.acm.org/10.1145/3065386. [6] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field prop- erties by learning a sparse code for natural images”, Nature, vol. 381, no. 6583, pp. 607–609, 1996, issn: 1476-4687. doi: 10.1038/381607a0. [Online]. Available: https://doi.org/10.1038/381607a0. [7] A. Hyvrinen, J. Hurri, and P. O. Hoyer, Natural image statistics: A probabilistic approach to early computational vision. 1st edition. Springer Publishing Com- pany, Incorporated, 2009, isbn: 1848824904, 9781848824904. [8] Stanford Vision Lab, Stanford Univeristy. (2016). Imagenet dataset, [Online]. Available: http://www.image-net.org. [9] R. Bellman, “On the application of dynamic programming to the determination of optimal play in chess and checkers”, Proceedings of the National Academy of Sciences of the United States of America, vol. 53, no. 2, p. 244, 1965. [10] R. de Man. (May 2019). Syzygy endgame tablebases webpage. Accessed: Nov 7, 2018, [Online]. Available: https://syzygy-tables.info. [11] Gian-Carlo Pascutto, Gary Linscott. (Mar. 2019). , [Online]. Available: https://lczero.org. 36 BIBLIOGRAPHY

[12] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanc- tot, L. Sifre, D. Kumaran, T. Graepel, et al., “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science, vol. 362, no. 6419, pp. 1140–1144, 2018. [13] R. Caruana, “Learning many related tasks at the same time with backpropaga- tion”, in Proceedings of the 7th International Conference on Neural Information Processing Systems, ser. NIPS’94, Denver, Colorado: MIT Press, 1994, pp. 657– 664. [Online]. Available: http://dl.acm.org/citation.cfm?id=2998687. 2998769. [14] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learn- ing with limited numerical precision”, CoRR, vol. abs/1502.02551, 2015. arXiv: 1502.02551. [Online]. Available: http://arxiv.org/abs/1502.02551. [15] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016, http://www.deeplearningbook.org. [16] M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich, “To transfer or not to transfer”, in NIPS 2005 workshop on transfer learning, vol. 898, 2005, pp. 1–4.