Measuring the Solution Strength of Learning Agents in Adversarial Perfect Information Games
Total Page:16
File Type:pdf, Size:1020Kb
Measuring the Solution Strength of Learning Agents in Adversarial Perfect Information Games Zaheen Farraz Ahmad, Nathan Sturtevant, Michael Bowling Department of Computing Science University of Alberta, Amii Edmonton, AB fzfahmad, nathanst, [email protected] Abstract were all evaluated against humans and demonstrated strate- gic capabilities that surpassed even top-level professional Self-play reinforcement learning has given rise to capable human players. game-playing agents in a number of complex domains such A commonly used method to evaluate the performance of as Go and Chess. These players were evaluated against other agents in two-player games is to have the agents play against state-of-the-art agents and professional human players and have demonstrated competence surpassing these opponents. other agents with theoretical guarantees of performance, or But does strong competition performance also mean the humans in a number of one-on-one games and then mea- agents can (weakly or strongly) solve the game? Or even suring the proportions of wins and losses. The agent which approximately solve the game? No existing work has con- wins the most games against the others is said to be the more sidered this question. We propose aligning our evaluation of capable agent. However, this metric only provides a loose self-play agents with metrics of strong/weakly solving strate- ranking among the agents — it does not provide a quantita- gies to provide a measure of an agent’s strength. Using small tive measure of the actual strength of the agents or how well games, we establish methodology on measuring the strength they generalize to their respective domains. of a self-play agent and its gap between a strongly-solving Alternatively, in the realm of two-player, perfect- agent, one which plays optimally regardless of an opponent’s decisions. We provide metrics that use ground-truth data from information games, one can evaluate the strength of an agent small, solved games to quantify the strength of an agent and with regard to certain solution concepts. These solution con- its ability to generalize to a domain. We then perform an anal- cepts are defined with respect to the game theoretical value ysis of a self-play agent using scaled-down versions of Chi- obtainable by different strategies (Allis et al. 1994). For in- nese checkers. stance, a “strong” agent would be able to obtain the value of a position of a game from any legal position while a “weaker” agent can do so only in a smaller subset of states. Introduction These measures of an agent’s skill are more principled and Adversarial games have become widely used environments informative than solely their ranking with regards to other for developing and testing the performance of learning agents. However, finding such measures are computation- agents. They possess many of the same properties of real- ally expensive and usually require exhaustively searching world decision-making problems in which we would want over all sequences of play. While they have been used in a deploy said agents. However, unlike highly complex real- number of games to different degrees of success (Allis et al. world environments, games can be more readily modelled 1994; Schaeffer et al. 2007), the computational requirements and have very clear notions of success and failure. As prohibit them from being employed in larger games such as such, games make an excellent training ground for design- Chess or Go. ing and evaluating AI intended to be scaled to more realistic We focus our investigation solely on two-player, zero- decision-making scenarios. sum, perfect-information games. In this paper, we propose aligning our understanding of the strength of a player using Recent advances in AI research have birthed a number metrics of strongly/weakly solved and error rates. We build of agents that imitate intelligent behavior in games with an AlphaZero agent that learns to play Chinese checkers and large state spaces and complex branching factors. Through a use it to learn to play small board sizes. We then use an ex- combination of different techniques and approaches, highly isting Chinese checkers solver to evaluate the strength of the performant agents were developed to play games such AlphaZero player using ground truth data. as Checkers (Schaeffer et al. 1992), Chess (Campbell, Hoane Jr, and Hsu 2002; Silver et al. 2018), Poker (Bowl- ing et al. 2017; Moravcˇ´ık et al. 2017), Go (Silver et al. 2016, Background 2017, 2018) and Starcraft (Vinyals et al. 2019). These agents Adversarial Games Copyright © 2021, Association for the Advancement of Artificial An adversarial game is a sequential decision-making set- Intelligence (www.aaai.org). All rights reserved. ting in which two players alternate taking actions at differ- ent game states until a terminating state is reached at which or draw but there need not be a strategy produced that guar- point each of the players observe some utility. We model a antees this outcome. For example, for the game of Hex, it zero-sum, deterministic, perfect information game as a set has been proved that the first player will win on all square of states, S, where at each state a player i 2 f0; 1g, se- boards given both players play optimally. However, there is lects an action from the set of all actions, A.A policy is a no constructive strategy that is guaranteed to reach this out- mapping π : S! ∆A from a state to a distribution over ac- come1. tions and we say π(ajs) is the probability of taking an action A weakly solved game is one where there is a strategy a 2 A at s 2 S. There is a deterministic transition function that guarantees the game-theoretic value of the game when T : S × A ! S that returns a new state s0 2 S when action both players play optimally. Checkers is a weakly-solved a is taken at s. game (Schaeffer et al. 2007) and is shown to be a draw for We define Z ⊂ S to be the set of all the terminal states. both players under perfect play — there is an explicit strat- At a terminal state, no players can take an action and the egy that is shown to never lose and will always at least draw game ends. A utility function u : S! R associates a real- with any opponent. However, a strategy that weakly solves valued utility (or reward) to each state so that the utility ob- the game does not necessitate that the player following said served by each player i at a state s is ui(s); 8i. The utility for strategy will capitalize on situations where the opponent all non-terminal states is 0 and the utility of a terminal state plays sub-optimally. For instance, if the opponent makes a can be 1, 0 or -1. These are known as the game-theoretic val- mistake and moves to a losing position, a strategy that guar- ues of the terminal states. A utility of 1 signifies a win, -1 a antees a draw will not necessarily win from there. loss and 0 represents a draw (if there are any.) In a zero-sum In a strongly solved game, the game theoretic value of ev- game the utilities of both players sum to 0 (u1(s) = −u2(s)) ery legal position is known and there is a strategy that guar- and so if one player wins the other will lose. Typically in antees that a player will observe that outcome from that po- most combinatorial games, the player who takes an action sition. Examples of solved games are Tic-tac-toe, Connect- that transitions to a terminal state is the winner. Four (Allis 1988) and smaller variants of Chinese check- Each state is also associated with a state-value, V (s), ers (Sturtevant 2019). A player must capitalize on any mis- which is the expected utility observed by a player reaching take an opponent may make and play optimally from every state s and then assuming both players follow their respec- legal position. tive policies until a terminal state is reached where V (z) = Strongly solving a game requires exhaustively enumer- u(z); 8z2Z . An action-value or Q-value is the value asso- ating all sequences of play from the initial position of the ciated with a state-action pair. We denote the Q-value of a game and, through backwards induction, eliminating from state-action pair as Q(s; a) and it is defined as the expected the strategy paths that lead to sub-optimal outcomes. Pre- utility of taking action a at s and then having both players dictably, the intense computational requirements currently follow their policies until termination. As we are only ex- renders it intractable to solve games such as Chess or Go amining games with deterministic transitions, we can define due to the sheer size of their state space (approximately 1050 170 Q(s; a) = V (T (s; a)) and V (s) = a2A π(ajs) · Q(s; a). states for Chess and 10 for Go.) However, knowledge that The goal of an agent in a game is to maximize the utility it a strategy is weakly or strongly solving is an irrefutable mea- achieves when playing the game (thatP is, it strives to win the sure of the competency of an agent’s behavior and so it is game.) An agent does so by finding policies that selects ac- desirable to solve games. tions with higher Q-values at any state of the game ensuring that it has a higher chance of reaching terminal states that Related Work provide positive utilities.