Learning Position Evaluation Functions Used in Monte Carlo Softmax Search

HARUKAZU IGARASHI†1 YUICHI MORIOKA KAZUMASA YAMAMOTO

Abstract: This paper makes two proposals for Monte Carlo Softmax Search, which is a recently proposed method that is classified as a selective search like the . The first proposal separately defines the node-selection and backup policies to allow researchers to freely design a node-selection policy based on their searching strategies and confirms the principal variation produced by the Monte Carlo Softmax Search to that produced by a search. The second proposal modifies commonly used learning methods for positional evaluation functions. In our new proposals, evaluation functions are learned by Monte Carlo sampling, which is performed with the backup policy in the search tree produced by Monte Carlo Softmax Search. The learning methods under consideration include supervised learning, reinforcement learning, regression learning, and search bootstrapping. Our sampling-based learning not only uses current positions and principal variations but also the internal nodes and important variations of a search tree. This step reduces the number of games necessary for learning. New learning rules are derived for sampling-based learning based on the Monte Carlo Softmax Search and combinations of the modified learning methods are also proposed in this paper.

Keywords: Computer shogi, Selective search, Monte Carlo Tree Search, Softmax search, Boltzmann distribution

toward the upper nodes [4]. This step caused a problem: the 1. Introduction optimal sequence of the moves estimated by MCSS with a finite Recently, Monte Carlo Tree Search (MCTS) [1], which is a temperature parameter did not always conform to the principal type of selective search, has been applied to computer shogi variation (PV) defined in the minimax search. We separately (Japanese ). Based on its wide success, it has become one of define the two policies and use a backup policy with a low the leading methods applied to computer go, and Alpha Zero, temperature to find the PV. A node-selection policy is freely which is based on MCTS, scored fairly well against the strongest designed to control the depth and width of a search tree based on programs based on traditional search algorithms even in chess the system designer’s strategies. and shogi [2]. Second, we show that positional evaluation functions can be MCTS’s typical search method is the UCT algorithm learned by Monte Carlo sampling with backup policies in a search (Upper Confidence Bound 1 Applied to Trees) [3], which tree produced by MCSS. Many methods have been proposed for deterministically selects a child node with the largest upper the learning evaluation functions of and shogi. confidence bound (UCB) among its brother nodes. The Monte Most works are based on supervised learning, reinforcement Carlo Softmax Search (MCSS) is another proposed selective learning, regression learning, and search bootstrapping. These search scheme [4]. Continuous functions like Boltzmann learning methods (except bootstrapping) only exploit the current distribution functions are used for node selection and backup positions and the principal leaves after the legal moves at the calculation in MCSS, unlike for MCTS. That makes it possible to current positions. Current position refers to a state that actually calculate the gradient vectors of the positional evaluation appears in games for learning, and principal leaf denotes a leaf functions with respect to the parameters included in those node of a PV. This paper pays attention not only to PV but also to functions to introduce a new dimension that is not expected from the leaves of the important variations around it, which means that MCTS. the feature parameters, which are highly related to the positions This paper makes two proposals on MCSS, (1) separation of those leaf nodes, can be updated toward their optimal values. between a node-selection policy and a backup policy, and (2) new Internal nodes in a search tree are learned by bootstrapping. sampling-based learning algorithms performed in a search tree Usually for learning positional evaluation functions, the produced by MCSS. In MCSS, a policy for selecting a child node parameters to be learned must be included in or highly related to while searching used the same Boltzmann distribution function the positions prepared for learning. A huge amount of learning where a backup policy propagates the values of the leaf nodes data, such as the records of games between professional shogi

†1 Shibaura Institute of Technology

1 players or the self-play games of the learning system, must be nodes is called backup and is repeated up to the root node. prepared to increase the learning accuracy. However, the learning Minimax calculation is used in the backup process to search method proposed in this paper will considerably reduce the for the PV. Instead of minimax calculation, the MC softmax number of games required for learning. search uses the following softmax operation that defines an For explanatory simplicity, we mainly restrict the scope of internal node’s value using the expectation of its child node’s our theory’s application to computer shogi in this paper. values as follows:

퐸푠(푠) = ∑푎∈퐴(푠) 푃(푎; 푠) 퐸푠(푣(푎; 푠)), (3.1) 2. General Flow of Monte Carlo Softmax Search where 퐸푠(푠) is the value of node s, 퐴(푠) is a set of legal moves In the search method of MCSS [4], a softmax operation at s, and v(a;s) is a child node produced from s by move 푎 ∈ 퐴(푠). selects a child node based on a certain probability distribution Weight P(a;s)(≥ 0) is a probability distribution function that function instead of a deterministic minimax calculation over the satisfies values of the child nodes. The serial processes that stochastically ∑푥∈퐴(푠) 푃(푥; 푠) = 1. (3.2) select the child nodes from the root node down to a leaf node in a search tree are considered as Monte Carlo sampling. For this In steps 1 to 4 of the search algorithm in Section 2 [4], node- reason, the method proposed in Ref. 4 is called a Monte Carlo selection policy function 휋(푎|푠) was defined by a Boltzmann

Softmax Search (MCSS). distribution function with its children’s values, 퐸푠(푣(푎; 푠)). The The following is the general MCSS flow [4][a]: same function was used as P(a;s) in Eq. (3.2). In this paper, we call P(a;s) a backup policy to distinguish it from node-selection 1) Initialization: Set current state u to the root node. policy 휋(푎|푠) in step 2 [b].P(a;s) gives the weight values in the 2) Selection: One child node v=child (u) is selected by a expectation operator that propagates the values of the child nodes stochastic node-selection policy. up to their parent nodes in step 4. 3) Expansion: If v has already been expanded, then go to 2) Minimax calculation only propagates the value of the after replacing u with v. Otherwise, expand all the child minimum/maximum node to its parent node. The information on nodes of v of only one step and evaluate them by a the important variations is completely discarded other than a PV. quiescence search. Moreover, minimax calculation complicates propagating the 4) Backup: Trace back the path from v to the root node and values of the gradient vectors toward the root node. Even a small update all the node values on the path. variance of the parameters included in an evaluation function discontinuously changes a PV’s path and the derivatives of the MCTS has almost the same flow. However, it uses a deterministic root node’s value. On the other hand, a softmax operator makes it node-selection policy in step 2 and uses playout for evaluating possible to express the values of all the nodes as the continuous the leaf nodes in step 3. The detailed processes and characteristics functions of the parameters in an evaluation function. This is a of MCSS are described in the next section. desirable property as a backup operator because calculating the gradient vectors of the node values is necessary for learning a positional evaluation function. 3. Details of MC Softmax Algorithm Therefore, the softmax operator in a backup policy makes it 3.1 Softmax operation in a backup policy possible to consider the leaf nodes of important variations other In shogi, the search’s objective is to determine the principal than a PV when learning an evaluation function [c]. In Sections 4 variation. The deeper the search, the more accurate a PV is found and 5, we show that Monte Carlo sampling with the backup policy under the constraints of the thinking time. An iterative deepening on a search tree produced by MCSS can take account of the search algorithm is usually used to expand all the next child nodes important variations in supervised learning, reinforcement in only one step. In this algorithm, the child node values are back- learning, regression learning, and search bootstrapping. propagated to calculate the values of their parent nodes. The leaf nodes of the search tree are evaluated by a positional evaluation 3.2 Stochastic selective search in node selection function. The updating procedure of the values of the internal Minimax search, which intends to find the PV in a whole a) The Shibaura Shogi Softmax team used MCSS in The 27th World Computer agent’s action selection policy. Shogi Championship in 2017. c) We can use a minimax operator or set T at a small value close to zero when no b) The term, Softmax Policy, is used in reinforcement learning as a policy through learning is being done. which an agent selects its action. The node-selection policy corresponds to an

2 search tree defined by minimax calculation, is classified as a full- stochastic sampling process. These sampling calculations can be width search. This approach is one of the leading search executed asynchronously in parallel. Stochastic selection algorithms for shogi. A search that restricts the searching region prevents the same node being accessed at a time. It also has an by heuristics is called a selective search. Gekisashi is a typical advantage in mini-batch searching. Parallel searching in MCSS is shogi program that uses selective search based on the heuristic a multiagent system where search agents share the search tree as features of moves [5]. a common memory like a blackboard. The search tree can be In computer go, MCTS is the most successful searching completely divided and assigned to agents. method [1]. UCT is a typical example of MCTS and Second, explicit pruning techniques in a search tree, which are deterministically selects a child node with the largest upper devised in the full-width searches as an alpha-beta search, are not confidence bound (UCB) among its brother nodes [3]. UCT is a necessary in MCSS. Nodes with low scores are not selected so deterministic selective search method. AlphaZero [2] proved that frequently, and the production of arcs from the nodes is naturally MCTS is applicable even to chess and shogi as well as go [d]. suppressed. Recently selective search is being applied to shogi more extensively. Many researchers are interested in whether such a 3.3 Boltzmann distribution functions applied to policies new approach as MCTS outperforms traditional approaches In thermodynamics and statistical physics, the principle of based on full-width searches like an alpha-beta search. MCSS maximum entropy proposed by Jaynes derives the probability (explained above in Section 2) is classified in the selective search distribution function necessary for calculating such physical as MCTS. However, MCSS is different from MCTS because it quantities as the energy of a physical system [9]. In this section, uses positional evaluation functions to calculate the values of the we show that the principle also derives a Boltzmann distribution leaf nodes instead of the playout simulations conducted in MCTS. function and can be applied to the stochastic action selection of A more different point is that MCSS stochastically selects a child agents. node with node selection probability 휋(푎|푠) [e]. In UCT, which Consider the problem of how an agent selects an optimal is a typical MCTS method, the upper bounds of the child nodes action at the current state using heuristics E that evaluate the are estimated [3]. Upper bound Ia (UCB1) [7] is defined by agent’s actions. Ei refers to the score of action i, which is selected

with probability pi calculated from Ei. The value of expectation 2ln푛푠 퐼푎 ≡ 푋̅푎 + 푐√ , (3.3) 푛푎 of {Ei}(i=1,2,…,n) is specified to value ET to maintain the where ns and na are the number of visiting s and selecting a at s. selection quality.

푋̅푎 is the sampling average of the payoff after selecting a up to The principle of maximum entropy argues that an action this time [7][8], and c is a weight coefficient that provides a should be selected based on probability function 푝푖 that balance between exploitation and exploration [f]. maximizes entropy 푓({푝푖}) ≡ − ∑푖 푝푖ln푝푖 under the constraints

The child node with the largest Ia value among its brother imposed on {푝푖} and maintains the largest uncertainty about nodes is selected deterministically [3][ g ] . Upper bound Ia action decisions. However, that decision seems the most unbiased. includes the number of visiting nodes so that infrequently visited The Lagrange multiplier method gives the Boltzmann distribution nodes are preferentially selected. However, the MCSS algorithm function: directly uses 푋̅푎 instead of estimated upper bound Ia and selects 푛

퐸푖⁄푇 퐸푖⁄푇 a child node in a stochastic manner. Temperature parameter T and 푝푖 = 푒 ⁄∑ 푒 (3.4) strategies with heuristics on the game can be embedded into the 푖=1 node selection policies. That controls the depth of the search and by maximizing entropy 푓({푝푖}) under constraints ∑푖 푝푖 = 1 and the balance between exploitation and exploration and expresses 〈퐸〉 ≡ ∑푖 푝푖퐸푖 = 퐸푇 [4][i]. playing styles [h]. When temperature T approaches zero, only 푝푖 of action i that

The stochastic selective search in MCSS has two advantages. has the largest Ei increases to 1 and the stochastic selection policy First, it is suitable for parallel and autonomous computing. A in (3.4) is reduced to a deterministic policy [j]. On the other hand, series of selecting nodes from the root node to a leaf node is a the policy approaches a random policy at infinite temperature. d) Alpha Zero used a neural network model for approximating the positional and the second term in Eq. (3.3) gradually reduces. evaluation functions. This model does not go well with sequential processing as an h) A node-selection policy can be defined independently from a backup policy, as alpha-beta search because it takes too much computing time. will be mentioned in Section 3.3. e) Even in MCSS, playout can be used for node evaluation, and learning the i) T is calculated by Eq. (3.4) and ∑푖 푝푖퐸푖 = 퐸푇. Therefore, T or 퐸푇 is a measure simulation policy used in the playout is proposed in Ref. 6. of the randomness of exploration. f) c is set to 1 in Ref. 7 that proposed the UCB1 algorithm. j) The policy becomes a deterministic selection to minimize Ei if Ei is replaced g) As the visiting number increases, sample average 푋̅푎 approaches the true value with -Ei.

3

The stochastic policy for selecting an action based on the 푑 푑 Boltzmann distribution function is widely used as Boltzmann We define node values 푉푎(푢푡 ) and 푉푏(푣푡 ) in the search selection in reinforcement learning and as the Boltzmann- tree using move values: exploration-based search in multiarmed bandit problems [3][10]. 푑 푑 푑 푑 푑 푉푎(푢푡 ) ≡ ∑ 푃푎(푎푡 ; 푢푡 )푄푎(푢푡 , 푎푡 ) (3.9) 푑 In this paper, we propose two Boltzmann-type functions for 푎푡 node-selection policy 휋(푎|푠) and backup policy P(ai;s). and

However, Ei and T are not necessarily identical in the two 푑 푑 푑 푑 푑 푉푏(푣푡 ) ≡ ∑ 푃푏(푏푡 ; 푣푡 )푄푏(푣푡 , 푏푡 ) . (3.10) functions. 푑 푏푡 Since a state after a move is determined uniquely in shogi, 3.4 Recursive definition of move and node values move value 푄푎(푢, 푥) is replaced by node value 푉푏(푣(푥; 푢)) as Let at be a move at the t-th turn (t=1,2,…,La) in a two-player 푑 푑 푑 푄푎(푢푡 , 푎푡 ) = 푉푏(푣푡 ) (3.11) game like shogi. In MCSS, backup policy Pa (at;ut) is given by a Boltzmann distribution function: and

푃 (푎 ; 푢 ) = 푒푥푝(푄 (푢 , 푎 )⁄푇 )⁄푍 (3.5) 푑 푑 푑+1 푎 푡 푡 푎 푡 푡 푎 푎 푄푏(푣푡 , 푏푡 ) = 푉푎(푢푡 ), (3.12)

∑ ( ( )⁄ ) where 푣(푥; 푢) is the opponent’s state after move x at u [l].Using 푍푎 ≡ 푥∈퐴(푢푡) 푒푥푝 푄푎 푢푡, 푥 푇푎 . (3.6) Eqs. (3.9) to (3.12), the move values are written recursively: Ta is a temperature parameter. 푄푎(푢, 푎) is called a move value, 푑 푑 푑 which indicates the value of move a at position u. 푄푎(푢푡 , 푎푡 ) = 푉푏(푣푡 ) (3.13)

Suppose that the opponent’s backup policy is given by Pb 푑 푑 푑 푑 = ∑ 푃푏(푏푡 ; 푣푡 )푄푏(푣푡 , 푏푡 ) (3.14) (bt;vt), where bt is the opponent’s move at the t-th turn (Fig.1). d 푑 푏푡 (=0,1,…,D) means the searching depth in Fig. 1. Superscripts are 푑 푑 푑+1 = ∑ 푃푏(푏푡 ; 푣푡 )푉푎(푢푡 ) 0 (3.15) omitted when d=0. For example, 푢 is denoted as 푢 . Pb (bt;vt) 푑 푡 푡 푏푡 is given by 푑 푑 푑+1 푑+1 푑+1 푑+1 = ∑ 푃푏(푏푡 ; 푣푡 ) ∑ 푃푎(푎푡 ; 푢푡 )푄푎(푢푡 , 푎푡 ) (3.16) 푑 푑+1 푏푡 푎푡 푃푏(푏; 푣) = 푒푥푝(−푄푏(푣, 푏)⁄푇푏)⁄푍푏 (3.7) 푑 When node 푢푡 is a leaf node, i.e., d=D-1, the move value is 푍푏 ≡ ∑ 푒푥푝(−푄푏(푣, 푥)⁄푇푏) (3.8) defined using positional evaluation function 퐻푎(푢; 휔) [m]: 푥∈퐴(푏) 푄 (푢퐷−1, 푎퐷−1) ≡ ∑ 푃 (푏퐷−1; 푣퐷−1)퐻 (푢퐷) . . 푎 푡 푡 푏 푡 푡 푎 푡 (3.17) using move value function 푄푏(푣, 푏) [k] 퐷−1 푏푡 Similarly, node values are written recursively from (3.7) to d (3.10): ut search depth 푑 푑 푑 푑 푑 d 푉푎(푢푡 ) = ∑ 푃푎(푎푡 ; 푢푡 )푄푎(푢푡 , 푎푡 ) 푑 d 푎푡 at 푑 푑 푑 d = ∑ 푃푎(푎푡 ; 푢푡 )푉푏(푣푡 ) (3.18) v 푑 t 푎푡 d 푑 푑 푑 푑 푑 푑 bt = ∑ 푃푎(푎푡 ; 푢푡 ) ∑ 푃푏(푏푡 ; 푣푡 )푄푏(푣푡 , 푏푡 ) (3.19) 푎푑 푏푑 d +1 d+1 푡 푡 ut 푑 푑 푑 푑 푑+1 d +1 = ∑ 푃푎(푎푡 ; 푢푡 ) ∑ 푃푏(푏푡 ; 푣푡 )푉푎(푢푡 ) . (3.20) at 푑 푑 d +1 푎푡 푏푡 v 퐷 t If node 푢푡 is a leaf, d +1 b 퐷 퐷 t 푉푎(푢푡 ) ≡ 퐻푎(푢푡 ; 휔). (3.21) d +2 ut d+2 Recursive Eqs. (3.16) and (3.20) calculate the backup operations

of the move and node values if backup policy functions 푃푎(푎; 푢) Fig.1 Search tree

k) The backup policy and the move value function of the opponent agents are not m) Eq. (3.14) is a recursive definition of the node values, and Eq. (3.5) is an necessarily required to satisfy Eq. (3.7). example of the backup policy functions based on the node values. l) Transition probabilities are necessary if the state transition is probabilistic after making a move.

4 and 푃푏(푏; 푣) are given. 퐻푎(푢; 휔) is assumed to be a linear 푛 푑 푑 푑+1 function of the position’s features or a neural network function ∙ ∑ ∏ 푃푏(푖) (푏푡 (푖); 푣푡 (푖)) 푉푎(푢푡 ). (3.26) 푑 푑 푖=2 with weight vector 휔. 푏푡 (2),…,푏푡 (푛)

Stepping down to bottom node 푢 ∈ 푈퐷(푢푡, 푎푡) of the Only defining the move and node values of the leaves by Eqs. recursion in Eqs. (3.16) and (3.20), move values 푄푎(푢푡, 푎푡) and (3.17) and (3.21) is necessary as in a two-player game. Therefore, node values 푉푎(푢푡) are expressed using transition probabilities: the theories in Sections 3 and 4 are easily applied to n-player games. 퐷−2 푑 푑 푑+1 푑+1 푃(푢|푢푡, 푎푡) ≡ ∑ ∑ ∏ 푃푏(푏푡 ; 푣푡 )푃푎(푎푡 ; 푢푡 ) 푑 푑 푑=0 {푎푡 } {푏푡 } (3.22) 4. Learning Positional Evaluation Functions 퐷−1 퐷−1 ∙ 푃푏(푏푡 ; 푣푡 ) Based on MCSS as Many methods have been proposed for the learning 푄 (푢 , 푎 ) = ∑ 푃(푢|푢 , 푎 )퐻 (푢; 휔) 푎 푡 푡 푡 푡 푎 (3.23) evaluation functions of computer chess, go, and shogi, based on 푢∈푈퐷(푢푡,푎푡) supervised learning, reinforcement learning, regression learning, and and search bootstrapping. In this section, we use backup policy

functions 푃푎(푎; 푢) , node values 푉푎(푠), and move values 푉푎(푢푡) = ∑ 푃푎(푎푡; 푢푡) ∑ 푃(푢|푢푡, 푎푡)퐻푎(푢; 휔). (3.24) 푄 (푠, 푎) in these learning algorithms. 푎푡 푢∈푈퐷(푢푡,푎푡) 푎

푈 (푢 , 푎 ) is a set of leaf nodes at d=D in the subtree after 퐷 푡 푡 4.1 Supervised learning selecting at at ut. 푃(푢|푢 , 푎 ) is the probability that the node 푡 푡 The Bonanza method, a well-known supervised learning value of leaf node u is backed up to the node value of root node algorithm for positional evaluation functions in shogi, uses game 푢 . For this reason, 푃(푢|푢 , 푎 ) is called a backup probability. 푡 푡 푡 records between professional shogi players as teaching data [12] On the other hand, the transition probability where Pa (a;u) and [13]. As an extension of this method, a supervised learning Pb (b;v) are replaced with 휋 (푎|푢) and 휋 (푏|푣) in 푃(푢|푢 , 푎 ) 푎 푏 푡 푡 method using policy gradients was proposed by Konemura et al. is a probability that node u is selected from root 푢 through 푡 [14]. move 푎 . 푃(푢|푢 , 푎 ) is called a realization probability [n]. The 푡 푡 푡 In Ref. 14, a loss function to be minimized was defined by realization probabilities control the search depth in Gekisashi 휋∗(푎|푠) [5][11]. 훿 (푆; 휋∗, 휋) ≡ ∑ ∑ 휋∗(푎|푠) ln { }. 퐾퐿퐷 휋(푎|푠; 휔) (4.1) 3.5 MCSS in multiplayer games 푠∈푆 푎∈퐴(푠) ∗ In Section 3.4, we applied MCSS to two-player games like 휋 and 휋 are the probability distribution functions of the shogi. This section deals with n-player games with perfect teacher and learning systems. The Kullback-Leibler divergence is information. Assume that player 1 plays with players i (=2,3,…,n) used to measure the difference between the two distribution who take turns in ascending order of i. v(i) is the state of player functions in Eqs. (4.1). S is a set of training positions and 퐴(푠) i’s turn, and b(i) is its move. If the backup policy of player i is is a set of legal moves at position s. 휔 denotes a parameter

Pb(i) (b(i);v(i)), then the move and node values are defined in such vector included in a positional evaluation function. recursive expressions as Eqs. (3.16) and (3.20) as follows: Policy 휋 , which determines the actions of the learning

푛 system, is given by a Boltzmann distribution function in Ref. 14 푑 푑 푑 푑 푄푎(푢푡 , 푎푡 ) = ∑ ∏ 푃푏(푖) (푏푡 (푖); 푣푡 (푖)) as follows: 푑 푑 푖=2 푏푡 (2),…,푏푡 (푛) 휋(푎|푠; 휔) = 푒푥푝(퐻 (푠∗|푎, 푠; 휔)⁄푇) /푍 푎 (4.2) 푑+1 푑+1 푑+1 푑+1 (3.25) ∙ ∑ 푃푎(푎푡 ; 푢푡 )푄푎(푢푡 , 푎푡 ). 푑+1 and 푎푡 ∗ 푍 ≡ ∑ 푒푥푝(퐻푎(푠 |푥, 푠; 휔)⁄푇) . (4.3) 푥∈퐴(푠) 푉 (푢푑) = ∑ 푃 (푎푑; 푢푑) 푎 푡 푎 푡 푡 Position 푠∗ is the leaf node of the PV after move 푎 is selected 푑 푎푡 at position s. Its static value is given by 퐻 (푠∗|푎, 푠; 휔). [o] 푎 In this paper, we use backup policy Pa (at;ut) in Eq. (3.5) as n) The realization probability defined in Gekisashi is represented by the features o) Quiescence search was used in Ref. 14 instead of searching for PVs to reduce of moves instead of the positional evaluation functions and used as a node- the computation time. selection policy. However, MCSS computes a node-selection policy only from positional evaluation functions [4].

5 policy 휋 in Eq. (4.1) considering important variations other than v is the state of the opponent’s turn after the learning agent makes PVs. Therefore, we define the loss function: move a at s. s’ is the next state after the opponent makes b at v. The second term in the square bracket in Eq. (4.11) becomes 휋∗(푎|푠) ∗ ∗ 푉 (푣) using Eqs. (3.9) to (3.12). We also use a steepest descent 훿퐾퐿퐷(푆; 휋 , 푃푎) ≡ ∑ ∑ 휋 (푎|푠) ln { }, (4.4) 푏 푃푎(푎; 푠) 푠∈푆 푎∈퐴(푠) method to derive a learning rule here. The second term is the

target value of 푄푎(푠, 푎) and should not be differentiated with which leads to the following learning rule: respect to 휔. ∗ ∆ω = −휀∇휔훿퐾퐿퐷(푆; 휋 , 푃푎), (4.5) The learning rule is expressed as

∗ ∆ω = −휀∇휔훿푄(푆, 퐴) (4.13) = 휀 ∑ ∑ 휋 (푎|푠)∇휔 ln푃푎(푎; 푠), (4.6) 푠∈푆 푎∈퐴(푠)

= −휀 ∑ [푄푎(푠, 푎) − 푉푏(푣)] ∇휔푄푎(푠, 푎). (4.14) 푠∈푆,푎∈퐴 휀 ∗ = ∑ ∑ {휋 (푎|푠) − 푃푎(푎; 푠)} ∇휔푄푎(푠, 푎). (4.7) 푇푎 푠∈푆 푎∈퐴(푠) The calculation of gradient vector ∇휔푄푎(푠, 푎) is described in Section 5. Note that s is not restricted to an actual position that The calculation of gradient vector ∇ 푄 (푠, 푎) is described in 휔 푎 appears during games, but it is an internal node of a search tree. Section 5. 4.3 Policy-based reinforcement learning

REINFORCE by Williams is a method of policy-based 4.2 Value-based reinforcement learning reinforcement learning [16]. The PG expectation method is a TD(λ) is a typical method using value-based reinforcement learning scheme based on REINFORCE and softmax search. learning algorithms [15]. We replace the state value function used Policy Gradient with Principal Leaf (PGLeaf) is a fast in the TD(λ) method with node value 푉 (푢 ) , defined by Eq. 푎 푡 approximate version of the PG expectation method [17][18]. (3.20), to derive the following learning rule: Using backup policy Pa (at;ut) and move value 푄푎(푢, 푎) in 퐿푎−1 Eq. (3.5), the learning rule in the PG expectation method is 휆 (4.8) ∆휔 = 휀 ∑ 푒 (푡)훿푡, expressed: 푡=1 퐿푎−1 ∆휔 = 휀 ∙ 푟 ∑ 푒 (푡) (4.15) 휆 휆 휔 푒 (푡) = 훾휆푒 (푡 − 1) + ∇휔푉푎(푢푡), (4.9) 푡=0 푒휔(푡) ≡ ∇휔푙푛푃푎(푎푡; 푢푡)

훿푡 ≡ 푟푡+1 + 훾푉푎(푢푡+1) − 푉푎(푢푡). (4.10) (4.16) = [∇휔푄푎(푢푡, 푎푡) − ∑ 푃푎(푎; 푢푡)∇휔푄푎(푢푡, 푎)]⁄푇푎. t is the t’th time step when the learning agent takes its turn. 퐿푎 푎∈퐴(푢푡) is the number of steps until the end of the game. γ and λ are the The calculation of ∇휔푄푎(푠, 푎) on the right-hand side is constants between 0 and 1. 푟 is a reward signal, such as win or 푡 described in Section 5. lose information digitalized to 0 or 1. Computing method

∇ 푉 (푢 ) will be described in Section 5. 휔 푎 푡 4.4 Regression learning Q-learning, which is another typical method for value-based Let r(휎) be a variable that takes 1 if a learning agent wins reinforcement learning algorithms, uses an action value function game 휎, and otherwise it is 0. Next we define function 푃푤푖푛(푠) instead of a state value function. We propose a softmax operation that predicts winning the game at state s using node value as a backup policy in Q-learning. That is, we use move value function 푉푎(푠): function 푄푎(푠, 푎), defined in Eq. (3.16), instead of the usual Q- −푉푎(푠)⁄휏 function. 푃푤푖푛(푠) = 1⁄[1 + 푒 ]. (4.17) First, we define a loss function of learning: If a loss function is defined as

훿 (푆, 퐴) 퐿푎−1 푄 1 훿 (휎) ≡ ∑ [푟(휎) − 푃 (푢 )]2, (4.18) 1 푟푒푔 2 푤푖푛 푡 ≡ ∑ [푄 (푠, 푎) 푡=1 2 푎 푠∈푆,푎∈퐴 2 then the steepest descent method gives the following learning ′ ′ ′ ′ rule: − ∑ 푃푏(푏; 푣) ∑ 푃푎(푎 ; 푠 )푄푎(푠 , 푎 )] . ′ 푏 푎 (4.11)

6

퐿푎−1 functions and was used in the loss function of Alpha Zero [2].

∆휔 = 휀 ∑ [푟(휎) − 푃푤푖푛(푢푡)]∇휔푃푤푖푛(푢푡) , (4.19) 푡=1 where 5. Learning by Monte Carlo Sampling with Backup Policies ∇휔푃푤푖푛(푢푡) = (4.20) 5.1 Gradient vectors of move and node values (1⁄휏)[1 − 푃푤푖푛(푢푡)]푃푤푖푛(푢푡)∇휔푉푎(푢푡). We proposed a backup policy in Section 3 and used the move Alpha Zero [2] also uses a similar regression learning. and node values that were calculated with the backup policy in However, it used a positional evaluation function 퐻 (푢 ; 휔) 푎 푡 the traditional learning methods in Section 4. The gradient instead of 푉푎(푢푡) in Eq. (4.17). vectors of move value functions ∇휔푄푎(푠, 푎) are required for 4.5 Learning by bootstrapping supervised learning in Section 4.1, Q-learning in Section 4.2, and Veness et al. proposed for chess a learning algorithm called policy gradient reinforcement learning in Section 4.3. The Search Bootstrapping [19] and used deep search results to learn gradient vectors of node value functions ∇휔푉푎(푠) are needed for evaluation function 퐻 (푢 ; 휔). We apply node values 푉 (s) to 푎 푡 푎 TD(λ) learning in Section 4.2 and regression learning in Section their bootstrapping approach and define a loss function on set of 4.4. states S: We describe the calculation of these gradient vectors in this

1 section. The calculation of the move values when state s is 훿 (푆) ≡ ∑ [퐻 (푠; 휔) − 푉 (s)]2. (4.21) 푏푡푠 2 푠∈푆 푎 푎 position ut that appears in the actual games was derived from Refs.

The steepest descent method gives the following learning rule 4 and 17. Monte Carlo sampling from root node ut of a search tree if 푉푎(s) is the target values of 퐻푎(푠; 휔): was proposed in Ref. 4 to calculate the move values. In Ref. 4,

∇휔푄푎(푠, 푎) is expressed as ∆휔 = −휀 ∑[퐻푎(푠; 휔) − 푉푎(s)]∇휔퐻푎(푠; 휔). (4.22) 푠∈푆 In Eq. (4.22), 푠 represents an internal node of a search tree. The 푑 푑 푑 푑 푑+1 푑+1 ∇휔푄푎(푢푡 , 푎푡 ) = ∑ 푃푏(푏푡 ; 푣푡 ) ∑ 푃푎(푎푡 ; 푢푡 ) ∗ ∗ 푑 푑+1 value of leaf 푠 of a PV, 퐻푎(푠 ) , was used in the past 푏푡 푎푡 bootstrapping algorithm as the target value of 퐻푎(푠; 휔) in Eq. 푑+1 푑+1 푑+1 ∙ [{푄푎(푢푡 , 푎푡 ) − 푉푎(푢푡 )}⁄푇푎 + 1] (4.21) instead of 푉푎(s) [p].

Moreover, we consider a bootstrapping method that matches 푑+1 푑+1 ∙ ∇휔푄푎(푢푡 , 푎푡 ). (5.1) the distribution of the children’s values of s with the deep search If 푢푑 is a leaf node of the search tree, i.e., d=D-1, then results. 푃퐻(푎; 푠, 휔) and 푃푎(푎; 푠) are distribution functions 푡 퐷−1 퐷−1 퐷−1 퐷−1 퐷 used for the backup operation based on 퐻푎(푠; 휔) and 푉푎(푠) , ∇휔푄푎(푢푡 , 푎푡 ) = ∑ 푃푏(푏푡 |푣푡 )∇휔퐻푎(푢푡 ; 휔). (5.2) 퐷−1 respectively. We define the following loss function to make 푏푡 푑+1 푉푎(푢푡 ) in Eq. (5.1) is written as in Eq. (3.9) using 푃퐻(푎; 푠, 휔) close to 푃푎(푎; 푠): 푑+1 푑+1 푄푎(푢푡 , 푎푡 ). To derive Eqs. (5.1) and (5.2) in Ref. 4, backup 푃푎(푎; 푠) 훿푃푃(푆; 푃푎, 푃퐻) ≡ ∑ ∑ 푃푎(푎; 푠) ln { }. (4.23) policy function 푃 (푏푑; 푣푑) of the opponent is fixed during the 푃퐻(푎; 푠, 휔) 푏 푡 푡 푠∈푆 푎∈퐴(푠) 푑 푑 learning, assuming that ∇휔푃푏(푏푡 ; 푣푡 ) = 0. That leads to a very 푃 (푎; 푠) is the deep search result and is used as a target value. 푎 simple expression of the learning rule. The learning rule becomes The method for calculating ∇휔푉푎(푠) is expressed

∆ω = −휀∇휔훿푃푃(푆; 푃푎, 푃퐻) (4.24) recursively: 푑 푑 푑 푑 푑 ∇휔푉푎(푢푡 ) = ∑ 푃푎(푎푡 ; 푢푡 ) ∑ 푃푏(푏푡 ; 푣푡 ) = 휀 ∑ ∑ 푃푎(푎; 푠)∇휔 ln푃퐻(푎; 푠, 휔) 푑 푑 (4.25) 푎푡 푏푡 푠∈푆 푎∈퐴(푠) 푑 푑 푑 푑+1 ∙ [{푄푎(푢푡 , 푎푡 ) − 푉푎(푢푡 )}⁄푇푎 + 1]∇휔푉푎(푢푡 ). (5.3) 휀 = ∑ ∑ {푃푎(푎; 푠) − 푃퐻(푎; 푠, 휔)} ∇휔퐻푎(푣(푎; 푠)), (4.26) If 푢푑 is a leaf node of the search tree, i.e., d=D, then 푇푎 푡 푠∈푆 푎∈퐴(푠) 퐷 퐷 ∇휔푉푎(푢푡 ) = ∇휔퐻푎(푢푡 ; 휔). (5.4) where 푣(푎; 푠) is the state after making move a at state s. Eq. 푑 푑 (4.25) is the cross entropy between two probability distribution The derivation of Eq. (5.3) is shown in the appendix. 푄푎(푢푡 , 푎푡 )

p) This algorithm is respectively called RootStrap and TreeStrap when node s is the root node and an internal node on the path of a PV.

7

푑+1 can be expressed by 푉푎(푢푡 ) using Eq. (3.15). ∆ω = −휀 ∑ [퐻푎(푣(푎; 푠); ω) − 푉푏(푣)] Equations (5.1) and (5.3) imply that ∇휔푄푎(푠, 푎) and 푠∈푆,푎∈퐴

∇휔푉푎(푠) are given recursively by the product of the transition ∙ ∇휔퐻푎(푣(푎; 푠); ω). (5.8) probabilities and the factors in the brackets. This means that these gradient vectors were calculated by Monte Carlo sampling with This updating rule is reduced to the bootstrapping rule when s is the backup policies on a search tree where the move and node replaced with opponent’s state v in Eq. (4.22). values are given. For bootstrapping, the gradient vectors of the node and move Equations (5.2) and (5.4) indicate that parameters ω included values are not necessary for the learning rules in Eqs. (4.22) and 퐷 퐷 (4.26). Note that bootstrapping methods can learn not only the in 퐻푎(푢푡 ; 휔) at leaf nodes 푢푡 are updated by learning. The leaf node of a PV and its important variations are frequently positions that appear in real games but also the internal nodes visited by Monte Carlo sampling. Therefore, parameters ω, which visited by the Monte Carlo sampling proposed in this section. are highly related to those nodes, are updated frequently. This is a huge advantage to reduce the number of games necessary for sufficient learning. 5.2 Learning rules expressed by gradient vectors of node value functions 5.3 Combination of learning methods: including supervised In the last section, the learning rules were expressed by learning We expressed the updating rules of the major learning methods ∇휔푄푎(푠, 푎) and ∇휔푉푎(푠) . The space dimension of the latter vector is much smaller than that of the former. If the former is in shogi using ∇휔푉푎(푠) and carried them out by Monte Carlo written by the latter, much memory space can be saved. We sampling with backup policies on the search tree. The next problem is how to combine these learning methods. In this paper, rewrote all the learning rules in Section 4 just using ∇휔푉푎(푠). We consider supervised learning in Eq. (4.7). Using (A.1), the we divide that problem into two cases depending on whether updating rule in Eq. (4.7) is rewritten: supervised learning was included or not.

휀 ∗ In the former case, we propose the following combination of ∆휔 = ∑ {휋 (푎|푢) − 푃푎(푎; 푢)} 푇푎 푎∈퐴(푢) supervised learning methods using policy gradients [14], Q- learning, and bootstrapping: ∙ ∑ 푃푏(푏; 푣(푎; 푢))∇휔푉푎(푢(푏; 푣)), (5.5) 푏 where v(a;u) is the state after move a at state u and u(b;v) is the Example of learning with a teacher signal: state after move b at state v. Next, updating rules (4.15) and (4.16) of the policy gradient 1) Prepare training data that consist of a set of states S and target ∗ reinforcement learning are rewritten: distribution 휋 (푎|푠) for supervised learning. 퐿푎−1 2) Do the following procedure for ∀푠 ∈ 푆: ∆ω = 휀 ∙ 푟 ∑ 푒휔(푡) (5.6) { 푡=0 2-1) Search for the game tree by MC Softmax search from root 1 1 node s and preserve the node values at all the nodes visited 푒휔(푡) = [∑ 푃푏(푏푡; 푣푡(푎푡; 푢푡))∇휔푉푎 (푢푡 (푏푡; 푣푡(푎푡; 푢푡))) 푇푎 푏푡 during the search. 2-2) Repeat Monte Carlo sampling with backup policies 1 − ∑ 푃푎(푎; 푢푡) ∑ 푃푏(푏푡; 푣푡(푎; 푢푡))∇휔푉푎 (푢푡 (푏푡; 푣푡(푎; 푢푡)))]. 푎∈퐴(푢푡) 푏푡 (5.7) 푃푎(푎; 푠) and 푃푏(푏; 푣) from the state after making move a at s. During the sampling, do the following procedures: The learning rules of the TD(λ) method and regression { learning are represented only by ∇ 푉 (푠), as shown in Eqs. (4.9) 휔 푎 - Compute ∆ω in Eqs. (4.22) and (5.8) at each node on the and (4.20). sampled path for bootstrapping and Q-learning. Q-learning in Section 4.2 is reduced to a bootstrapping - Compute ∆ω in Eq. (5.5) using Eqs. (5.3) and (5.4) at the method using states where the opponent selects a move for the leaf node of the sampled path for supervised learning. following reason. 푄 (푠, 푎) is the value of move a at state s and 푎 } should be approximated by 퐻 (푣(푎; 푠); ω) as precisely as 푎 2-3) ω is updated to ω + ∆ω. possible. For this purpose, we replace 푄 (푠, 푎) with 푎 } 퐻 (푣(푎; 푠); ω) in Eq. (4.14): 푎

8

5.4 Combination of learning methods: without teacher programs to evaluate their effectiveness. Designing and learning signals selection strategies in a node-selection policy are also future If there is no teacher signal on the states and the moves, we problems. combine reinforcement and regression learning based on the game’s results with Q-learning and bootstrapping as follows: References [1] Browne, C. B., et al. A Survey of Monte Carlo Tree Search Methods. IEEE Trans. of Comp. Intell. and AI in Games. 2012, 【Example of learning without teacher signals】 vol. 4, no. 1, pp. 1-43. 1) Play a game using MC Softmax Search. Preserve the search [2] Silver, D. et al. Mastering Chess and Shogi by Self-Play with a trees whose root nodes are piece positions {ut } that appeared in General Reinforcement Learning Algorithm. 2017, arXiv: 1712. 01815. the game and the values of all the nodes included in the search [3] Kocsis, L. and Szepesvari, C. Bandit based Monte-Carlo Planning. trees. In: Furnkranz J., Scheffer T., Spiliopoulou M. (eds) Machine

2) Conduct Monte Carlo sampling from each ut with backup Learning: ECML 2006, pp. 282-293. [4] Kirii, A., Hara, Y., Igarashi, H., Morioka, Y. and Yamamoto, K. policies 푃 (푎; 푠) and 푃 (푏; 푣) . During the sampling, do the 푎 푏 Stochastic Selective Search Applied to Shogi. Proceedings of the following procedures: 22th Game Programming Workshop (GPW2017), 2017, pp. 26-23. { (in Japanese) [5] Tsuruoka, Y., Yokoyama, D. and Chikayama, T. Game-tree Search - Compute ∆ω in Eqs. (4.22) and (5.8) at each node on the Algorithm Based on Realization Probability. ICGA Journal, 2002, sampled path for bootstrapping and Q-learning. pp. 146-153. 1 [6] Igarashi, H., Morioka, Y. and Yamamoto, K. Learning Positional - Compute ∇휔푉푎(푢푡) and ∇휔푉푎(푢푡 ) using Eqs. (5.3) and Evaluation Functions and Simulation Policies by Policy Gradient (5.4) from gradient vector ∇ 퐻 (푢퐷; 휔) at leaf node uD of 휔 푎 Algorithm. IPSJ SIG Technical Report. 2013, Vol. 2013-GI-30, No. the sampled path. 6, pp. 1-8. (in Japanese) } This paper was translated into English and published as Igarashi, H., Morioka, Y. and Yamamoto, K. Reinforcement 3) Compute ∆ω respectively using Eqs. (4.8), (5.6), and (4.19) Learning of Positional Evaluation Functions and Simulation from the results computed in 2) for the TD (λ)-learning, Policies by Policy Gradient Algorithm. The Research Reports of policy gradient learning, and regression learning. Combine Shibaura Institute of Technology, Natural Sciences and Engineering. 2015, Vol. 58, No.1, pp. 1-9. ISSN 0386-3115. the three values of ∆ω by an integrating calculation as a [7] Auer, P., Cesa-Bianchi, N., and Fisher, P. Finite-time Analysis of linear summation. the Multiarmed Bandit Problem. Machine Learning. 2002, vol. 47, 4) Update ω to ω + ∆ω and return to 1). pp. 235-256. [8] Sato, Y. and Takahashi, D. A Shogi Program Based on Monte- Carlo Tree Search. IPSJ Journal. 2009, Vol. 50, No. 11, pp. 2740- 2751. (in Japanese) 6. Conclusion [9] Jaynes, E. T. Information Theory and Statistical Mechanics. We made two proposals for MCSS. First, we separated a Physical Review, 1957, vol.106, no.4, pp.620-630. [10] Peret, L. and Garcia, F. On-line Search for Solving Markov node-selection policy from a backup policy. This separation Decision Process via Heuristic Sampling. In: de Mantaras, R. L., allows researchers to freely design a node-selection policy based (ed.), ECAI, 2004, pp. 530-534. on their searching strategies and to find the exact principal [11] Hara, Y., Igarashi, H., Morioka, Y. and Yamamoto, K. A Simple Game-Tree Search Algorithm Based on Softmax Strategy and variation given by minimax search. Second, we proposed new Search Depth Control Using Realization Probability. Proceedings sampling-based learning algorithms for a positional evaluation of the 21th Game Programming Workshop (GPW2016), 2016, pp. function. This sampling is performed with backup policies in the 108-111. (in Japanese) [12] Hoki, K. Source Codes of Bonanza 4.1.3. In: Matsubara, H., (ed.), search tree produced by MCSS. The proposed learning algorithm Advances in Computer Shogi 6, Kyoritsu Shuppan, 2012, Chap. 1, is applicable to supervised learning, reinforcement learning, pp. 1-23. (in Japanese) [13] Hoki, K. and Kaneko, T. Large-Scale Optimization for Evaluation regression learning, and search bootstrapping. Functions with Minimax Search. Journal of Artificial Intelligence The new learning methods consider not only the PV leaves Research. 2014, vol. 49, pp. 527-568. but also the leaves of the important variations around the PVs. [14] Konemura, H., Yamamoto, K., Morioka, Y. and Igarashi, H. Policy Gradient Supervised Learning of Positional Evaluation Function in The feature parameters highly related to the positions of those Shogi: Using Quiescence Search and AdaGrad. Proceedings of the leaf nodes can be updated toward their optimal values. The 22th Game Programming Workshop (GPW2017), 2017, pp. 1-7. internal nodes in a search tree are learned by bootstrapping. We (in Japanese) [15] Sutton, R. S. and Barto A. G. Reinforcement Learning: An showed examples of combining these learning methods to reduce Introduction. The MIT Press, Massachusetts, 1998, 322p. the number of games necessary for learning. We are planning to [16] Williams, R. J. Simple Statistical Gradient- Following Algorithms implement our proposed search and learning methods into shogi for Connectionist Reinforcement Learning. Machine Learning,

9

1992, vol. 8, pp. 229-256. [17] Igarashi, H., Morioka, Y. and Yamamoto, K. Learning Static Evaluation Functions Based on Policy Gradient Reinforcement Learning. Proceedings of the 17th Game Programming Workshop (GPW2012), 2012, pp. 118-121. (in Japanese) [18] Morioka, Y. and Igarashi, H. Reinforcement Learning Algorithm that Combines Policy Gradient Method with Alpha-Beta Search. Proceedings of the 17th Game Programming Workshop (GPW2012), 2012, pp. 122-125. (in Japanese) [19] Veness, J., Silver, D., Uther, W. and Blair, A. Bootstrapping from Game Tree Search. Advances in Neural Information Processing Systems 22, 2009.

Appendix A.1 Recursive expression on gradient vectors of node values

Recursive expression on ∇휔푄푎(푠, 푎) in Eq. (5.1) was given in Ref. 4. This section derives Eq. (5.3) from Eq. (5.1). The left- hand side of the latter is written using Eq. (3.15):

푑 푑 푑 푑 푑+1 ∇휔푄푎(푢푡 , 푎푡 ) = ∑ 푃푏(푏푡 ; 푣푡 ) ∙ ∇휔푉푎(푢푡 ). (A.1) 푑 푏푡

If d is replaced with d+1 in (A.1),

푑+1 푑+1 ∇휔푄푎(푢푡 , 푎푡 ) = (A.2) 푑+1 푑+1 푑+2 ∑ 푃푏(푏푡 ; 푣푡 ) ∙ ∇휔푉푎(푢푡 ). 푑+1 푏푡 Substituting (A.2) into the right-hand side of Eq. (5.1), we get

푑 푑 ∇휔푄푎(푢푡 , 푎푡 ) =

푑 푑 푑+1 푑+1 ∑ 푃푏(푏푡 ; 푣푡 ) ∑ 푃푎(푎푡 ; 푢푡 ) 푑 푑+1 푏푡 푎푡

푑+1 푑+1 푑+1 ∙ [{푄푎(푢푡 , 푎푡 ) − 푉푎(푢푡 )}⁄푇푎 + 1]

푑+1 푑+1 푑+2 ∙ ∑ 푃푏(푏푡 ; 푣푡 ) ∙ ∇휔푉푎(푢푡 ). (A.3) 푑+1 푏푡 푑 푑 Comparing the operand of ∑ 푑 푃 (푏 ; 푣 ) ∙ in (A.1) with that 푏푡 푏 푡 푡 in (A.3),

푑+1 푑+1 푑+1 ∇휔푉푎(푢푡 ) = ∑ 푃푎(푎푡 ; 푢푡 ) 푑+1 푎푡

푑+1 푑+1 푑+1 ∙ [{푄푎(푢푡 , 푎푡 ) − 푉푎(푢푡 )}⁄푇푎 + 1]

푑+1 푑+1 푑+2 (A.4) ∙ ∑ 푃푏(푏푡 ; 푣푡 ) ∙ ∇휔푉푎(푢푡 ). 푑+1 푏푡 If d+1 is replaced with d in (A.4), EQ. (5.3) is derived.

10