Journal of Information Processing Vol.29 347–359 (Apr. 2021)

[DOI: 10.2197/ipsjjip.29.347] Regular Paper

Visualizing and Understanding Policy Networks of

Yuanfeng Pang1,a) Takeshi Ito1,b)

Received: July 26, 2020, Accepted: January 12, 2021

Abstract: for the game of Go achieved considerable success with the victory of AlphaGo against Ke Jie in May 2017. Thus far, there is no clear understanding of why deep learning performs so well in the game of Go. In this paper, we introduce visualization techniques used in image recognition that provide insights into the function of intermediate layers and the operation of the Go policy network. When used as a diagnostic tool, these visualizations enable us to understand what occurs during the training process of policy networks. Further, we introduce a visualiza- tion technique that performs a sensitivity analysis of the classifier output by occluding portions of the input Go board, and revealing parts that important for predicting the next move. Further, we attempt to identify important areas through Grad-CAM and combine it with the Go board to provide explanations for next move decisions.

Keywords: computer Go, deep learning, visualization, policy network, grad-CAM

feated three human Go champions. This is the first time that a 1. Introduction computer program has defeated a human professional player in For the longest time, computer Go has been a significant chal- the full-sized game of Go. lenge in artificial intelligence. Go is difficult because of its high Despite the encouraging fact that deep neural networks are bet- branching factors and subtle board situations that are sensitive to ter at recognizing shapes in the game of Go than Monte Carlo small changes. Owing to a combination of these two causes, a rollouts, there is little insight regarding the internal operation and massive search with a prohibitive amount of resources such as behavior of these complex networks, or how they achieve such Monte Carlo rollouts and (MCTS) was good performances. Without a clear understanding of how and used. Monte Carlo rollouts sample long sequences of actions why they work, we cannot effectively employ the DCNN-based at high speed for both players using a simple policy. Averag- computer Go and the encouraging progress it made. ing over such rollouts provide an effective position evaluation, Though the Go policy and ImageNet classification networks thus achieving weak amateur level play in Go. The MCTS uses are both convolutional networks, their inputs and outputs are en- Monte Carlo rollouts [1] to estimate the value of each state in tirely different. In this paper, we introduce visualization tech- a search tree. As more simulations are executed, the search tree niques used in image recognition that provide insight into the grows larger and the relevant values become more accurate. How- function of intermediate layers, and we apply them to visualize ever, even with cutting-edge hardware, the simulation still cannot the operation of the Go policy network. When used as a diagnos- achieve the strength required to beat the leading human profes- tic tool, these visualizations allow us to understand what occurs sional player. during the training process. Further, we introduce a visualization Fortunately, since its introduction by Clark and Storkey [2], technique that performs a sensitivity analysis of the classifier out- convolutional networks have demonstrated excellent performance put by occluding portions of the input Go board, revealing only at computer Go. Several studies have reported that convolutional parts that are important for predicting the next move. Further, networks can deliver good performance similar to regular MCTS- we attempt to identify areas through the Grad-CAM and com- based approaches. This idea was extended in the bot named Dark- bine it with Go board to provide explanations for next move deci- forest. Darkforest [3] lies on a deep convolutional neural network sions. Zeiler and Fergus [5] reported that, when used in a diagnos- (DCNN) designed for long-term predictions. Darkforest sub- tic role, these visualization techniques enabled them to identify stantially improves the win rate for pattern matching approaches model architectures that outperformed former models in terms of against MCTS-based approaches, even with looser search bud- the ImageNet classification benchmark. However, in the case of gets. Next, AlphaGo [4] was introduced, which combines MCTS training a policy network, the usefulness of this method remains with a policy and a value network; since 2016, AlphaGo has de- unclear. Based on the problems indicated by the visualization re- sults, we attempt to change the learning rate of the policy network 1 The University of Electro-Communication, Chofu, Tokyo 182–8585, to improve the performance of the policy network. Then, by using Japan the visualization techniques, we explore the changed policy net- a) [email protected] b) [email protected] work to understand what changed during the training processes.

c 2021 Information Processing Society of Japan 347 Journal of Information Processing Vol.29 347–359 (Apr. 2021)

Our work is divided into three parts: In Section 3, we train image is masked before feeding it to the CNN and a heatmap of a policy network from zero and save the networks at different the probability is then drawn. Matthew and Rob [5] reported that epochs to explain the training process. In Section 4, we apply the there is a distinct drop in the activity within the feature map when visualization techniques used in image recognition to visualize the parts of the original input image corresponding to the pattern the Go policy network for aiding the interpretation of the Go pol- are occluded. When used in a diagnostic role, these visualiza- icy network; further, we analyze what occurs during the training tions allowed them to identify model architectures that outper- process. As these interpretations may prompt ideas for improved form older ones. Chu, Yang, and Tadinada [6] explored residual networks, in Section 5, we present approaches to improve the networks by using visualization and empirical analysis. Further, performance of Go policy networks through the visualized exper- they presented the purpose of residual skip connections. Sel- iment results. varaju and Cogswell [7] introduced a new method (Grad-CAM) This paper makes the following two contributions: for combining feature maps by using a gradient signal that does • Most papers on visualization techniques only focus on the not require any modification of the network architecture. visualization result of a trained convolutional neural net- Visualization techniques for convolutional neural network be- works (classic networks such as AlexNet or VGG). Since havior are applied not only to image recognition tasks but also we trained a new Go policy network with the network in- to artificial intelligence tasks. For example, Laurens, Eliseand, formation from different epochs, we can explain the training and Zeynep analyzed game AI behaviors using visualization tech- process in this work. niques [8]. They visualized the evidence on which the agent bases • Most visualization techniques that we introduced were pre- its decision; further, they explained the importance of producing viously used in only image captioning or visual question an- a justification for a black-box decision agent. swering (VQA). However, based on our visualization re- Recent studies on CNN network visualization have achieved sults, similar results can be achieved even if we use a Go considerable results. The generality of this visualization tech- policy network. Under a limited network architecture, visu- nique suggests that it may perform well in other “visual” domains alization techniques can help researchers focus their efforts such as computer Go. on creating better policy networks for computer Go. 2.2 Policy Network for Computer Go 2. Related Work The most successful current programs in Go are based on 2.1 Visualization MCTS with policy and value networks. The strongest programs Recent work in image recognition has demonstrated the con- such as AlphaGo and Darkforest apply convolutional networks to siderable advantages of using deep convolutional networks over construct a move selection policy, and this is then used to bias the alternative architectures. Meanwhile, there has been progress in exploration when training the value network. AlphaGo achieved understanding how these models work. One relatively simple and a 57% move prediction accuracy using supervised learning based common approach is to apply visual filters. However, this ap- on a database of human professional games [9]. The probabil- proach is limited to the first layer where projections can be made ity that the move of the expert is within the top-5 predictions onto a pixel space. Because we use a direct inner product between of the network is over 87%. Lately, it has become possible to the weights of the convolution layer and the pixel of the image, avoid overfitting to the values by using a combined policy and we can gauge what these filters are looking for by visualizing the value network architecture and a low weight on the value compo- learned weighting from these filters. Clark and Storkey [2] visu- nent. After 72 h, the move prediction accuracy exceeded that of alize the weights of some randomly selected channels from ran- the state-of-the-art technique reported in previous work, reaching domly selected convolution filters of a five-layer convolutional 60.4% on the KGS test set [10]. Meanwhile, some open source neural network trained on the GoGoD dataset. They reported that DNN implementations such as Leela Zero’s networks, which are some filters learn to acquire a symmetric property. However, be- publicly available and have proven performance, achieved an ac- cause visualized filters of intermediate layers are not connected curacy of more than 54%. directly to the input image, they are less interpretable, and in Unlike the input to the image recognition neural network (im- higher layers, alternate methods must be used. One method is ages use the RGB color model which has only three channels), the to visualize the activation maps of intermediate layers. In con- input to the Go neural network usually comprises a 19 × 19 × 17 volutional networks, the filters function to take the underlying image stack comprising 17 binary feature channels with Go board geometry of the input into account. Visualizing activation maps information, each player’s moves are stored as records up to is a simple approach to gain intuition about the type of element in T = −7. The final feature channel represents the colour to play. the input that each feature in that layer is searching for. In 2014, Therefore, if we visualize a regular Go policy network, it would Matthew and Rob [5] introduced a novel visualization technique be more difficult to localize the important area and explain how (deconvolution) that provides insight into the function of inter- the network extracts knowledge from the Go board information. mediate feature layers and the operation of the classifier. Their In this study, we trained a deep convolutional neural network approach provides a nonparametric view of invariance, thereby (DCNN) that predicts the next move when the current board sce- showing patterns from the training set that activate the feature nario is presented as an input and treats the 19 × 19 board as a map. Nevertheless, they conduct the occlusion experiment using 19 × 19 image with 5 channels. an image dataset. In the occlusion experiment, a portion of the

c 2021 Information Processing Society of Japan 348 Journal of Information Processing Vol.29 347–359 (Apr. 2021)

3. Network Table 1 Features used as inputs to the CNN. Recent studies that use DCNN for next-move prediction show some improvement over shallow networks based on simple pat- terns extracted from previous games. In this paper, we train a DCNN that predicts the next move given the current board sce- nario as an input. Each channel encodes information from a dif- ferent aspect of the board, (e.g., Player stone, Opponent stone). Compared to previous works, we use a simpler feature set and a shallower convolutional neural network.

3.1 Data The dataset used in this work was obtained from the Games of Go on Disk (GoGoD) [11]. It consists of sequences of board po- sitions for complete games played between professional human players. The data are saved in a file in SGF format. A move is encoded as an indicator (1 of 361) for each position on the 19 × 19 board. The board state information includes the position of all stones on the 19 × 19 board and the sequence allows one to determine the sequence of moves. We collected 17.6 million board-state next-move pairs that correspond to 86,329 games and used 100,000 next-move pairs as the test dataset. The features used in our method are obtained directly from the raw representation of the game rules. The feature planes are listed in Table 1. There are three differences between our features and those used in the policy network of AlphaGo. First, considering future visualization works, we omitted the feature that represents = × × the color to play and added the empty position plane to the stone Fig. 1 Our network structure (d CNN 2 and residual block 10, and w = 64). The input is the current board situation. The output is color features to make the first three features present the board. predicting the next move. Further, the feature with three planes can be easily visualized us- ing an RGB color model. Second, to accelerate the training pro- Table 2 Differences between the two networks. cess, we reduced the last move feature planes to two (only the last two moves are considered). Third, we encode the stone color fea- tures as player stones and opponent stones (rather than by black stones and white stones), and we omit the feature that records whether the turn is black or white.

3.2 Network Architecture To accelerate the training process, we use a lite policy network. Each convolution layer is followed by batch normalization. All layers use the same width (w = 64) with no pooling. We do not use a pooling layer because it reduces the input information and negatively affects performance. We only use one SoftMax layer to predict the next move. Figure 1 shows the architecture of the network for our model. Differences between our lite policy net- work and the AlphaGo-Zero network are listed in Table 2. Neural network parameters are optimized using stochastic gradient de- scent with momentum and learning rate annealing. After feeding the current position into the network, the move with maximum probability in the SoftMax output will be selected.

3.3 Training Results Fig. 2 Evolution of the accuracy (probability that the network predicts the To determine the performance of our network, we measured next correct move) and loss (caluculated by log likelihood cost func- the prediction accuracy and the loss in the training set and test set tion) on the training set (blue line) and test set (red line) during train- = during 10 training epochs (1 epochs = 137,500 training steps). ing (1 epoch 137,500 training steps). To make the figure clearer, we smoothed the curve with smoothing = 0.8. Figure 2 shows the evolution of the accuracy and the loss in the

c 2021 Information Processing Society of Japan 349 Journal of Information Processing Vol.29 347–359 (Apr. 2021)

Fig. 3 Probability that the next correct move is within the top-n predictions of the network. training and test sets. Figure 3 shows the probability that the next correct move is within the top-n predictions of the network. We find that the top-n performance of DCNN can be very high. The network can predict the correct next move 80% of the time when n = 9. We believe that the accuracy of our network is sufficient to conduct the occlusion sensitivity experiment. Fig. 4 Evolution of the first ten filter weights of the first layer through train- ing. Filters of the first three features (stone color) and the fourth 4. Visualization (last move of the player) and fifth (last move of opponent) feature are displayed in a different block. Within each block, we show the To understand our policy network, we applied the four main visualized weights at epochs 1, 5, and 10. We use the RGB color model to visualize the filters for the first three features and the gray visualizations approaches used in image recognition. All our vi- scale to visualize the filters of the fourth and fifth features. sualizations focus on understanding how the model changes over the process of training. (i) If the Go board is reflected, the output of the classifier 1) The first approach is to plot the filters of a trained model should be reflected in the same manner. Therefore, Clark and so we can understand the behavior of those filters from the first Storkey believed that each filter learns to acquire a symmetric layer. property. A smaller filter (3 × 3) does not learn to acquire a sym- 2) Filters of intermediate layers are not directly connected to metric property compared to previous filters with a larger size the input image. Therefore, for higher layers, alternate methods (7 × 7, Clark and Storkey [2]). We therefore believe that when such as visualizing the activation maps must be used. We can small filters are used, the receptive field is even smaller, which apply the filters over an input image and then plot the activation allows more local patterns to be extracted. Therefore, the orig- map. This allows us to identify input patterns that activate a par- inal symmetric patterns extracted by larger size filters are de- ticular portion of a filter. composed in our policy network, and the symmetric property no 3) Visualization with a transposed convnet (deconvnet) can longer needs to be acquired. help interpret feature activity in intermediate layers. This method (ii) The first three features represent the player stone, opponent maps these activities to the input pixel space, thereby indicating stone, and an empty position using one-hot encoding. This means the input pattern that originally caused the target activation in the that, the value cannot be 1 or 0 simultaneously in the same posi- feature maps. tion. The red, green, and blue visualized weights detect the exis- 4) In a Go next-move problem, a natural question is if the tence of a player stone, opponent stone, or empty position. The model truly identifies the stones on the Go board, or if it only visualization shows that only a few weights have a clear assign- uses the surrounding patterns. Occlusion-based methods attempt ment, and most filters do not learn to acquire a clear recognition to answer this question by systematically occluding different por- of the Go board as normally acquired by a human player. tions of the input board with an occluded square, and then moni- (iii) Through training, weights from the filters of the stone toring the output of the classifier. color features (planes 1, 2, and 3) change less significantly com- 5) The last approach considered is Grad-CAM. This approach pared to the weights from the filters of the last move features. allows us to determine how the output category value changes After 5 epochs, unlike the filters of the stone color features, the with respect to a small change in the input Go board. All posi- filters of the last move features continue to try and converge to tive values in the gradients indicate the part of the board that will optimal weights (e.g., filter block 2, row 3, column 6 and block increase the predicted next move value. Hence, visualizing these 3, row 3, column 10). To evaluate changes in the different filter gradients that are the same shape as the Go board can help pro- planes among epochs, we calculate the average absolute differ- vide information on attention mechanism. ences of all filters among epochs (Table 3). The convergence speed of plane 4 and 5 (the filters of last move features) is lower 4.1 Visualize the Weight of Filters than that of planes 1, 2 and 3 (the filters of stone color). The visualized weight of filters is shown in Fig. 4. These visu- While the visualization of a trained model provides insight into alizations obtained through training allow us to identify the evo- its operation, it can also assist us with selecting good architec- lution process of the model. tures. For example, because planes 4 and 5 (the filters of last The following points can be observed in Fig. 4: move features) changed very significantly during the training, we

c 2021 Information Processing Society of Japan 350 Journal of Information Processing Vol.29 347–359 (Apr. 2021)

Table 3 Average absolute differences of different filter planes. Visualization of the Activation Maps of Different Layers: Fig. 5 shows the visualization of the activation maps of different layers. Visualization of Successive Moves on the Activation Map: Fig. 6 shows the visualization of the activation maps in our trained model when we feed successive moves into the policy network. As stated previously (shown in Fig. 7), the first three features on the top present the board, and the fourth and fifth features at the bottom present the last move of player. Compared with ac- tivation map 2 shown at the bottom, activation map 1 on the top represent more information from the board. As shown in Fig. 6, activation map 1 seems to achieve a more local solution compared to activation map 2. Furthermore, the bottom right corner in ac- tivation map 1 and the original Go board appear to be strongly linked. “Locality” and “linked to original input,” are two phe- nomena that appear when visualizing an image recognition neu- ral network. We assume that the difference in inputs is the reason why these two phenomena did not appear in activation map 2.

4.3 Visualization with a Transposed Convnet Visualization with a Transposed convnet (deconvnet) can inter- pret the feature activity in intermediate layers. This method maps these activities to the input pixel space, thereby showing what input pattern originally caused a given activation in the feature maps. It is performed using a transposed convnet (deconvnet [5]) that can be considered as a convnet model using the same compo- ff Fig. 5 Visualization of the activation map from di erent layers in our nents but in reverse. Figure 8 shows the original image and the trained model. We analyze one move (move 10) in the game played by the AlphaGo Zero (white) and Ke Jie (black). Seeing the acti- reconstructed version from the last maxpool layer of Alexnet. We vation of the first layer (a), it is apparent that many of the lower used rectification and flipped filters (flip each filter vertically and layers are encoding information about the original board and par- tially recognizing the board edge (though the board edge is not one horizontally) to reconstruct the activity in the layer beneath the of the input features). Since these are not directly connected to the one that caused the selected activation, similar to that in Ref. [5]. input image, the activation maps of intermediate layers are less in- Moreover, we used an upresidual layer to invert the residual layer. terpretable, like activation of layer from the fifth resnet block (c) and activation of layer from the last (tenth) resnet block (d), but (c) still This reconstruct is repeated until the input space is reached. having some properties different from (d). For each feature map from Upresidual: Unlike the image convnet used by Zeiler and activation maps (c), compare with feature maps from deeper activa- Fergus [5], we did not use the pooling layer in our policy network, tion maps (d), they have more blank space in the middle area of the board where is also blank on the original board. and therefore, we did not use the uppooling layer in our trans- posed convnet. However, we used the upresidual layer to invert can attempt to extend the training process or decrease the learning the residual layer. We can obtain a set of variables by recording rate of the first layer. Thus, using a low learning rate is a good ap- the gradients added to the layer output. In the transposed convnet proach to ensure that the training process does not miss any local (deconvnet [5]), the upresidual operation deletes these variables minima. from the gradients. Unlike the input (RGB model) in an image convnet, the input 4.2 Visualize the Activation in the Go policy network has more than 3 channels. Therefore, The second visualization method is visualizing the activation: it is impossible to present all reconstructed inputs at one time. Plotting activation values for the neurons in each layer of a con- In Figs. 9 and 10, we visualized only the first three channels of vnet in response to a specific board. In convolutional networks, the reconstructed input, which are the stone color features that filters are applied considering the underlying geometry of the in- provide the board information. put in lower layers, and for each channel, these are arranged spa- In Figs. 9 and 10, the last conv layers have RGB images of size tially. Figures 5 and 6 show examples of this type of visualiza- 2 × 19 × 19, and we depict them as 2 separate 19 × 19 grayscale tion. Except for the last layer that has a size of 2 × 19 × 19, which images. Each small image contains reconstructed inputs in the we depict as 2 separate 19 × 19 grayscale images, all conv layers same (1–19)–(A–T) spatial layout as the input board data. In haveasizeof64×19×19. We depict each of them as 64 separate Fig. 10, the first conv layer has a size of 64 × 19 × 19, and we 19 × 19 grayscale images. Each of the 64 small images contains depict this layer as 64 separate 19 × 19 RGB images. We tiled 64 activations in the same spatial layout (1–19)–(A–T) as the input images into an 8 × 8 grid in row-major order. board data. As shown in Fig. 5, we tiled 64 images into an 8 × 8 Reconstructed Input Evolution During Training: Fig. 9 grid in a row-major order. shows the feature visualizations from our model during training.

c 2021 Information Processing Society of Japan 351 Journal of Information Processing Vol.29 347–359 (Apr. 2021)

Fig. 6 Visualization of the activation map in our trained model. We analyze four continuous right pre- dicted moves (move 17 to move 20) in the game played by AlphaGo Zero (white) and Ke Jie (black), which was won by AlphaGo Zero. These visualizations are not samples from the model but rather are grayscale maps of given features are gray scale maps of given features with a spe- cific input move. In column (b), we visualized the all board ownership calculated by the Monte Carlo tree search (MCTS). In column (c) and (d), we visualized the convnet activations of the last layer features. Although most of the lower computation is robust to small changes, the last layer is more sensitive. Because we encode the stone color features for the player stone and the opponent stone, the input from neighbouring moves are nearly opposite, and the input from separated moves are similar. In activation map 1, although the inputs (e.g., move 17 and move 18, move 19 and move 20) fed to the net work are nearly opposite, activation map 1 remains similar except at the lower right where the furikawari (exchange) occured. Moreover, in activation map 2, input from separated moves (move 17 and move 19, move 18 and move 20) are similar.

Reconstructed Input Visualization: Fig. 10 shows the fea- ture map, except the exaggeration of discriminative parts near the ture visualizations from our model when predicting the same next predicted next move, e.g., the area near the next move D-12 in move. Fig. 9. In Fig. 8, the same effect occurs when we visualize the We found that the image structure of the reconstructed input feature from an image recognition network. However, the dis- visualization of the last layer corresponds to that of the input fea- criminative parts of the image are more vivid, e.g., the eyes of the

c 2021 Information Processing Society of Japan 352 Journal of Information Processing Vol.29 347–359 (Apr. 2021)

Fig. 8 (a) Original image, (b) Reconstructed version from last maxpool Fig. 7 Relationship between input feature planes and activation maps in layer of Alexnet generated using transposed convnet. (c) Enlarged Fig. 6. image of (b).

Fig. 9 Visualization of the reconstructed input in our trained model at epochs 1, 5, and 10. (b), (c), and (d) show that the color contrast is enhanced around the next move postion (D-12) during training. wolf (c). Further, we found mosaic effects from both discrimina- ble shape of Go stones, the discriminative area is more likely to tive parts. Since the eyes of the wolf are an entire object in image show a vital point (a vital point is an important shape point for recognition, we believe the discriminative part of the Go policy both players, which is usually urgent) on the Go board. Because network is related to a standard pattern extracted by computer Go. the transposed convnet is developed to visualize patterns from the In Fig. 10, column (b), we visualized the reconstructed input image that activates the feature map in image recognition neural started from the first layer. Each of the reconstructed inputs of networks. As an interpretable pattern in Go games, it is possible the first layer correspond to the original board by different col- that vital points on the board activate the feature map of a policy ors. Each of these inputs has different directions which are sim- network. ilar to the texture of the image convnet. In columns (c) and (d), we visualized the reconstructed input started from the last layer. 4.4 Occlusion Experiment Though, these boards from three games are different, when these When using DCNN, it is unclear if the model can truly iden- are fed into the policy network, we can achieve the same output tify a certain area on the board like an image classifier does or if (predicted next move is position R-2). Thus, it is clear that each only focuses on the surrounding area. We attempt to answer this feature does not have a strong grouping however there is a clear question by occluding different portions of the input board in a invariance to input deformations. By focusing on color bright- systematic manner with an empty square, and then recalculating ness, we found that some parts from the reconstructed input are the features of the occluded board as a new input of the network. more discriminative (e.g., lower right corner of the image in row We then monitor the rank of the correct move in the top-n confi- 1, row 3, and row 4, the area near C-13 and the bottom corner dent predictions. As shown in Fig. 11, we can visualize the top-n of the image in row 2). The Go situation in the discriminative confident predictions rank of the correct move using the gradation area is more fragile. Compared with other areas that show a sta- square.

c 2021 Information Processing Society of Japan 353 Journal of Information Processing Vol.29 347–359 (Apr. 2021)

Fig. 10 Visualization of the reconstructed input in our trained model. We analyze four moves (move 13 of match 1, move 90 of match 2, and move 174 of match 3) in the games played by AlphaGo Zero.

strange group. Because of the lack of training, the features of the group cannot activate the deeper layer of our network, and the changed area is ignored. Fig. 11 Gradations representing the rank of the correct move in the top-n Figure 12 (a) shows an example of a board where we system- confident predictions. atically cover up different portions of the board with an empty square and see how the predictions of the network change. Fig- In Go games, the Go board situations are sensitive to small ure 12 (b) shows the rank of a correct move in the top-n confi- changes, however we found that one stone hardly has significant dent predictions as a function of removing one stone from the effect on the rank of the correct move in top-n confident predic- position (occlusion area size 1 × 1). When a specific area (e.g., tions in most cases. We believed this was because removing even white stone – H12, K12, black stone – G13, M12) of the board just a single stone could change a regular group of stones into a

c 2021 Information Processing Society of Japan 354 Journal of Information Processing Vol.29 347–359 (Apr. 2021)

Fig. 13 Current board (last move 24–J5, next move 25–H5) and the maps show the rank of the correct move in the top-n confident predictions as a function of removing stones from the position.

Fig. 12 Maps show the rank of the correct move in the top-n confident pre- dictions as a function of removing stones from the position. Fig. 14 Current board (last move 21–D9, next move 22–H8) and the maps Table 4 Changed position rate for different occlusion area sizes. show the rank of correct move in the top-n confident predictions, as a function of removing stones from the position.

is obscured, the rank for the correct next move “K14” drops sig- nificantly. We evaluated the effect of the occlusion area size on our network. The results are shown in Fig. 12 (c) (d). These re- sults suggest that specific stones (e.g., white stone – H12, K12, black stone – G13, M12) or specific areas have a strong effect on move prediction, and when more than one of these stones or ar- Fig. 15 Current board (last move 69–L10, next move 70–G14) and the maps eas get occluded, the DCNN will be unable to output the correct show the rank of the correct move in the top-n confident predictions as a function of removing stones from the position. next move. The example clearly shows that the model localizes the pattern within the board when the up-of-center of the board is obscured. In Fig. 12 (b, c, d), the visualized area (prediction rank (H6 and H4) running out to the center. With the white stone L4, changed positions) is increased with an increase in the occlusion it makes it difficult for the black stone to make a connection be- area size. Therefore, in Table 4, we randomly pick 100 moves tween H4 and O3. The figure shows the left area and the stones from the test set and explore the changed position rate in the case H6, H4, and L4 have a strong effect on the next move predic- of different occlusion area sizes. Because larger occlusion areas tion. As shown in Fig. 15, for the black stone, move L10 was not have a higher probability of blocking critical stones, a larger oc- ideal but necessary, which gave the white Sente an opportunity clusion area has been proven to lead to an increase in changed to return to the left side. We set the recent move as one of the positions. input features, however, once get the Sente (here, white player), In Figs. 13, 14,and15, we set the occlusion area size as 3 × 3, the network can turn to the left side where it is very necessary. and we analyze three moves in the game played by the AlphaGo (white) and Sedol (black), which was won by AlphaGo. The 4.5 Grad-CAM DCNN can implicitly understand many sophisticated concepts of Grad-CAM developed by Ramprasaath et al. [7] uses the gra- Go because it is activated by specific areas. As shown in Fig. 13, dient information flowing into the last convolutional layer of the treating one-space-jump black stones (H6 and H4) with the peep CNN to assign importance values to each neuron for a particular (J5) allows the black stone to be connected with H5. The figure decision of interest. To obtain the class-discriminative localiza- shows that removing the one-space-jump good shape (stone H6 tion map, Grad-CAM computes the gradient of yc (score for class and H4) affects the next move prediction more than when H6 or c) with respect to feature maps A of a convolutional layer. These H4 is removed. In Fig. 14, after pushing once on the left, Al- gradients that flow back are global-average-pooled to obtain the αc phaGo capped black on the lower side and stopped the two stones importance weights k given by the following equation:

c 2021 Information Processing Society of Japan 355 Journal of Information Processing Vol.29 347–359 (Apr. 2021)

Table 5 Average and standard deviation of reconstructed input of the mid- dle layer.

Because the surrounding area has a higher average activated rate, it is possible that the importance of the surrounding area can be Fig. 16 (cited from Fig. 1 in Ref. [7]): Results in a coarse heat-map. Grad- recognized by the policy network. CAM is a technique for making CNN-based models more transpar- ent by visualizing the regions of input (e.g., cat body and dog head) Moreover, in Fig. 17, by comparing (a), (b)-epoch 5 with (c), that are “important” for predictions. This clearly shows that Grad- (d)-epoch 10, we can see the mechanism in the training process. CAM can localize regions to predict a particular answer in the same For Grad-CAM and counterfactual exp, the localized regions be- way as a human. come sparser, which makes the network more effective and more = global average pooling accurate (epoch 5 network accuracy: Top1 45.5%, epoch 10 1 ∂yc network accuracy: Top1 = 46.7%) for deciding the next move. αc = (1) k Z ∂ k Because of ReLU used in function (3) (ReLU returns 0 if it re- i j Aij ceives any negative input, but for any positive value, it returns gradients via backprop that value), if all importance weights are negative, it is possi- Ramprasaath et al. also developed a counterfactual explanation ble that the output is zero (an all zero output leads to an all- by negating the gradient of yc (score for class c) with respect to blue heat-map). As shown in Fig. 17, the Grad-CAM outputs feature maps A of the convolutional layer. Thus, in this instance, vary wildly between moves, where for certain positions (move function (1) becomes function (2). Using function (2), we can 11), the entire visualization is blue. This implies that all impor- localize negatively important areas whose absence leads to an in- tance weights are negative when predicting move 11 (B-3, the crease in the score for the correct class (in this paper, it is the next correct move). However, according to function (2), all neg- correct move). ative importance weights are visualized in the counterfactual exp map (see column (c, d) of move 11). In this scenario, the column global average pooling (c, d) of move 11 visualized the area where the absence leads to 1 ∂yc αc = − (2) an increase in the score for the next correct move (B-3). k Z ∂ k Aij i j  5. Experiments Negative gradients We employed several visualization methods to aid the interpre- Similar to CAM, a Grad-CAM heat-map is a weighted combi- tation of Go policy networks. The intuition gained from these nation of feature maps, but it is followed by a ReLU (Rectified methods can help researchers focus their efforts on creating bet- Linear Unit) (3). Figure 16 shows that Grad-CAM heat-maps are ter policy networks. The main purpose of this experiment is to derived using (3). prove that visualization techniques can be used in a diagnostic ⎛ ⎞ ⎜ ⎟ role. With more evidence and information on the network, a slight c = ⎜ αc k⎟ LGrad-CAM ReLU ⎝ k A ⎠ (3) improvement can be achieved by making small changes to the k  network. linear combination As stated in Section 4.1, the visualization of a trained model Grad-CAM Visualization: Fig. 17 shows the Grad-CAM re- provides insight into its operation, and it can also assist us with sults (positively important areas) and counterfactual explanations selecting good architectures. As shown in Fig. 4, planes 4 and 5 (negatively important areas) from our model (epoch 5 version and (the filters of last move features) change very significantly dur- epoch 10 version). ing the training, and therefore, we can attempt to decrease the As shown in Fig. 17 col (c), Grad-CAM of the Go policy net- learning rate of the first layer. work does not localize the same regions like a Go player as op- To test this idea, we decrease the learning rate of the first layer posed to the Grad-CAM of an image recognition network which size from 10−2 to 10−4, and we trained a new policy network. The is good at localizing regions for predicting a particular answer data and training parameters are the same as that used in Sec- (image class). The localized regions can be sparse (col (c) move tion 3. 10 and move 14) or dense (col (c) move 12 and move 13). Most As shown in Fig. 18, when using the new learning rate, the Grad-CAM results show that the Go policy network ignores the policy network can converge to better local minima than that us- center area of the board, which is very common in a regular Go ing by the original learning rate. Although the lower learning game. Further, we randomly pick 100 moves from the test set rate converged slower in the first 2 epochs, lower learning rate and use Grad-CAM to visualize them. To analyze the difference improved the accuracy performance of the policy network from between the center area (a 7 × 7 area on the Go board, circled by 46.24% to 46.51% at the end of epoch 10. To explore the re- G13, N13, G7, and N7) and the surrounding area, we calculate lationship between learning rates and the accuracy and average the average activated rate (marked as positively important area by absolute differences of the first layer. We train two other net- Grad-CAM) of the center area and the surrounding area (Table 5). works using the same conditions expect the learning rate (10−3

c 2021 Information Processing Society of Japan 356 Journal of Information Processing Vol.29 347–359 (Apr. 2021)

Fig. 17 From rows 1 to 5, we analyze five continuous right predicted moves (move 10 to move 14) in the game played by the AlphaGo (white) and Lee Sedol (black). Column (a) Grad-CAM using epoch 5 version network; Column (b) counterfactual explanations using epoch 5 version net- work; column (c) Grad-CAM using epoch 10 version network; and column (d) counterfactual explanations using epoch 10 version network. and 10−5) of the first layer. The results are shown in Figs. 19 and are more sensitive to the learning rate. In Fig. 20, the accuracy 20. In Fig. 19, the average absolute differences of planes 4 and increases when a learning rate is higher than 10−4 and decreases 5 decrease and approach a flat line. The average absolute dif- when a lower learning rate is used. ference of planes 1, 2, and 3 (stone color features) remains at a To observe the benefits of lowering the learning rate of the first comparatively low level. Compared with planes 1, 2, and 3 (stone layer, we briefly examined the middle layer of the original Go color features), it is clear that planes 4 and 5 (last move features) policy network (learning rate 10−2) and the new trained policy

c 2021 Information Processing Society of Japan 357 Journal of Information Processing Vol.29 347–359 (Apr. 2021)

Fig. 18 Evolution of the accuracy and loss in the test set during training (1 epoch = 137,500 training steps) between the original policy net- work (blue line) and the new trained policy network (red line). To make the figure clearer, we smoothed the curve with smoothing = 0.8.

Fig. 21 Reconstructed input of the middle layer. (a) and (c) are the recon- structed inputs of move 6 from a random Go game. (b) and (d) are reconstructed inputs of move 44 from a random Go game.

Table 6 Average and standard deviation of reconstructed input of the mid- dle layer.

Although the original policy network and new trained policy network have nearly the same accuracy, their visualizing experi- ff ff Fig. 19 Average absolute di erences (epoch 1–epoch 10) of the di erent ff planes for the cases of different learning rates. ment results show di erent properties, which we cannot correctly identify without performing visualization experiments. 6. Conclusion and Future Work In this paper, we demonstrate our model through visualization experiments. While training to predict the next move, the DCNN of Go is highly sensitive to local patterns on the board, and does not rely only on the board information. During training, some special points receive attention. First, through training, we found that weights from the filters of stone color features changed less significantly than weights from the Fig. 20 Accuracy of policy network for the cases of different learning rates. filters of the last move features. Unlike the filters of stone color features, it is more difficult to converge the filters of last move network (learning rate 10−4). By using the transposed convnet, features to optimal weights. Second, we found that the image we reconstructed the input of the middle layer to visualize the structure of reconstructed input visualization of last layer corre- change that occurred during the new training process. Filters sponds to the input feature map, besides the exaggeration of the from the new policy network have a greater variety in the mid- discriminative parts near the predicted next move. We can see that dle layers, as shown in Fig. 21 (c) and (d). Further, we randomly color contrast around the next move position is enhanced during pick 100 moves from the test set and used the transposed convnet training. Third, most Grad-CAM results show that the Go policy to reconstruct the input of the middle layer. Table 6 summarizes network ignores the center area of the board, which is very com- the average standard deviation of the original network and the mon in a regular Go game. Moreover, during the training pro- new trained network. A high standard deviation indicates that the cess, localized regions become sparser, which indicates that the reconstructed input is spread out over a wider range. This im- network can more accurately decide the next move by focusing plies feature maps can be activated by several different types of on less area. input patterns, which indicates that the new, trained middle layer Finally, the results of the experiment in Section 5, shows how can extract more information from the Go board than the original to improve the performance of neural networks. Under a limited middle layer. network architecture, it is possible to help researchers create bet-

c 2021 Information Processing Society of Japan 358 Journal of Information Processing Vol.29 347–359 (Apr. 2021) ter policy networks for computer Go. Further, the experimental Yuanfeng Pang received his M.S. de- results indicated that the variety of the middle layer changed in gree in The University of Electro-Com- the new trained policy network. munication,Tokyo. He is currently pursu- A considerable amount of research can be performed to extend ing his Ph.D. degree in Graduate School this work. We limited the size of our networks to obtain policy of Informatics and Engineering, the Uni- networks at different epochs and ensure a manageable training versity of Electro-Communications. His time. However, latest policy networks have better performance research interests include game AI and and more complex architectures. In this study, we only com- machine learning. pleted a preliminary exploration of the policy networks. In the future, with a partial understanding of how and why deep neural networks work, we hope to develop better models instead of us- Takeshi Ito received Doctor of Engineer- ing trial-and-error approaches. We plan to use the visualization ing degree in Informatics from Nagoya methods as a diagnostic tool to identify problems that occur dur- University, Japan in 1994. He is an as- ing training and further improve the performance of the policy sociated professor in Graduate School of network. In addition, we speculate that it would be possible to Informatics and Engineering, the Univer- find a method to identify important neurons of policy networks sity of Electro-Communications. His re- through visualization techniques and to provide an approach to search interests include human cognitive obtain more reasonable explanations for next move decisions. processes and learning processes in play- ffi Acknowledgments We would like to thank Mr. Fukashi ing thinking games or solving di cult problems. Murakami, a qualified Go instructor, for his expert knowledge of Go used in the experiments. This work was supported by JSPS KAKENHI Grant Number 18H03347.

References [1] Schadd, M.P., Winands, M.H., Tak, M.J. and Uiterwijk, J.W.: Single- player Monte-Carlo Tree Search for SameGame, Knowledge-Based Systems, Vol.34, pp.3–11 (2012). [2] Clark, C. and Storkey, A.: Training Deep Convolutional Neural Net- works to Play Go, Proc. 32nd International Conference on Machine Learning (ICML-15), pp.1766–1774 (2015). [3] Tian, Y. and Zhu, Y.: Better Computer Go Player with Neural Network and Long-Term Prediction, arXiv preprint arXiv:1511.06410 (2015). [4] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S.: Mastering the Game of Go with Deep Neural Networks and Tree Search, Nature, Vol.529, No.7587, pp.484– 489 (2016). [5] Zeiler, M.D. and Fergus, R.: Visualizing and Understanding Convolu- tional Networks, European Conference on Computer Vision, pp.818– 833, Springer (2014). [6] Chu, B., Yang, D. and Tadinada, R.: Visualizing Residual Networks, arXiv preprint arXiv:1701.02362 (2017). [7] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. and Batra, D.: Grad-Cam: Visual Explanations from Deep Networks via Gradient-based Localization, Proc. IEEE International Conference on Computer Vision, pp.618–626 (2017). [8] Weitkamp, L., van der Pol, E. and Akata, Z.: Visual Rationalizations in Deep Reinforcement Learning for Atari Games, Benelux Confer- ence on Artificial Intelligence, Springer (2018). [9] Maddison, C.J., Huang, A., Sutskever, I. and Silver, D.: Move Evalua- tion in Go using Deep Convolutional Neural Networks, arXiv preprint arXiv:1412.6564 (2014). [10] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. and Chen, Y.: Mastering the Game of Go Without Human Knowledge, Nature, Vol.550, No.7676, pp.354–359 (2017). [11] GoGoD (2019), available from https://gogodonline.co.uk.

c 2021 Information Processing Society of Japan 359