Nonzero-Sum Game Tree Search with Knowledge Oriented Players

Home , Common knowledge (logic), Game tree, Minimax, Outcome (game theory)

Beyond Minimax: Nonzero-Sum Game Tree Search with Knowledge Oriented Players

Tian Sang and Jiun-Hung Chen Department of Computer Science and Engineering University of Washington Seattle, WA 98195 {sang,jhchen}@cs.washington.edu

The classic minimax search for two-player zero-sum ing into account the opponent’s possible responses. This is games such as chess has been thoroughly studied since the done in the Opponent Models (OM) presented in (Carmel early years of AI; however for more general nonzero-sum & Markovitch 1993; Iida et al. 1993). It is assumed that games, minimax is non-optimal, given a player’s knowl- player A uses the evaluation function EA while B uses EB, edge about the opponent. Previously, a few opponent mod- and A knows that B uses EB but not vice versa. So B plays ∗ els and algorithms such as M were introduced to improve minimax using EB, but A plays OM search, which takes minimax by simulating the opponent’s search, given the op- advantage of knowing B’s exact response to every possible ponent’s strategy. Unfortunately, they all assume that one move by A. Given that assumption, OM search dominates player knows the other’s strategy but not vice versa, which minimax for A. is a very special relationship between knowledge and strate- M ∗ search (Carmel & Markovitch 1993; 1996) is a re- gies. In this paper, we characterize minimax and M ∗ and cursive extension of OM search, in which each player has show examples where they can be non-optimal. We propose his own OM of the other but A’s OM strictly contains B’s a new theoretical framework of Knowledge Oriented Players OM. For example, both A and B know the initial strat- (KOP ), using the powerful S5 axiom system to model and egy of each other but A also knows that B knows A’s initial reason about players’ knowledge. With KOP , the general strategy, so A knows that B will adjust his strategy to take relationship between knowledge and strategies can be ana- advantage of A’s initial strategy, and thus A will adjust his lyzed, which is not handled by the traditional game theory. initial strategy as well to take advantage of B’s behavior. We show how strategies are constrained by knowledge and Their opponent models can be nested to any level, but A al- present new strategies for various problem settings. We also ways knows B’s OM so that A can use M ∗ search, which show how to achieve desired knowledge update via commu- accurately predicts B’s responses and thus is at least as good nication so that players can achieve the best possible out- as minimax. come. With respect to players’ knowledge, our strategies Unfortunately, both OM search and M ∗ search assume always dominate minimax and work well for the general knowledge asymmetry: A’s knowledge strictly subsumes knowledge cases where previous algorithms do not apply. B’s knowledge; in particular, they assume that A has cor- rect knowledge about B’s strategy while B has no knowl- Introduction edge or wrong knowledge about A’s strategy, which is a Minimax search on two-player game trees (Shannon 1950) very strong and “unfair” assumption not necessarily true in has been a classic topic and thoroughly studied since the general. These issues were partly addressed in (Donkers early years of AI. Assuming players have perfect informa- 2004), in which the author suggested considering the oppo- tion of game trees and play perfectly, minimax is the opti- nent models with knowledge symmetry and showed an ex- mal strategy in zero-sum games including chess, where both ample in chess where with knowledge symmetry two players players use the same evaluation function for all leaf nodes. It could reach a state of common interests, though the notion has been one of the bases for the hugely successful computer of knowledge symmetry was fairly vague. chess programs, and eventually IBM Deep Blue defeated the In this paper, we propose a new theoretical framework human world chess champion Kasparov in 1997. of Knowledge Oriented Players (KOP ), capable of mod- However, minimax is no longer optimal in more general eling both knowledge asymmetry and knowledge symmetry nonzero-sum games, where players can have different evalu- by the powerful S5n axiom system. With KOP , we are able ation functions that are not necessarily correlated. For exam- to define and analyze the general relationship between two ple, a leaf good for one player is not necessarily good or bad key characteristics of players– knowledge and strategies, in for the other. Intuitively if a player knows the opponent’s finite-horizon two-player nonzero-sum games. As an impor- evaluation function, he could potentially do better by tak- tant step forward, we show how strategies are constrained by knowledge. We present new strategies–M h and HM 0M d, Copyright c 2008, American Association for Artificial Intelli- for various problem settings. They dominate minimax with gence (www.aaai.org). All rights reserved. respect to players’ knowledge and are much more general

113 Action Strategy Deﬁnition 1 A strategy S is an algorithm that given a game Selection Selection Ground Object tree t, ∀n ∈ t, S computes next(n) and value(n) and main- Level Level Meta-Level tains an invariant for all non-leaf n: Perception Monitoring (∃c ∈ C(n), next(n) = c) ∧ value(n) = value(next(n))

Doing Multistrategy Metareasoning According to the estimate by a given a strategy S, at node Reasoning n the move is next(n) and the outcome is value(n), so the

ﬁrst move is next(t) and the ﬁnal outcome is value(t). Figure 1: The high level playing processes of a player. The Minimax Algorithm

Algorithm 1 Minimax ∗ than M search. We also propose communication schemes minimax,E (t A) that can achieve desired knowledge update, and show how for every leaf l they can help players select better strategies. next(l) := null; value(l) := (EA(l),EA(l)) Figure 1 (Cox & Raja 2007) can be used to illustrate for h = 1 to height(t) the high level playing processes of a player in nonzero-sum for every node n with height h games. At the meta-level, the player needs to do metar- if A moves at n // to maximize easoning to select his best strategies, based on his knowl- next(n) := argmaxc value(c).a, c ∈ C(n) edge about the possible outcomes, both players’ prefer- else // B moves, to minimize ences of the outcomes, and both players’ knowledge about next(n) := argminc value(c).a, c ∈ C(n) each other’s knowledge, etc. At the object level, actions value(n) := value(next(n)) (moves) are selected for the current game state, based on the selected strategies. At the ground level, actions (moves) are executed; the corresponding effects/outcomes are ob- Algorithm 1 is minimax, where A uses EA to maximize served and passed to the object level, which are further in- and assumes that B uses EA to minimize. We assume that terpreted/evaluated and passed to the meta-level for knowl- all nodes with the same height can be accessed sequentially. edge/strategy update. In the rest of the paper, first we describe minimax and Definition 2 A play for players A and B on a game tree M ∗ and illustrate their differences by example, then we in- t where A moves first is a function that maps the input troduce our KOP framework, new algorithms, communica- (t, EA,EB,SA,SB) to a leaf value, where SA and SB are tion schemes and analytical results, and finally we conclude. strategies for A and B. That is, play(t, EA,EB,SA,SB) = (a, b), where a = EA(l), b = EB(l), l ∈ t. Minimax and M ∗ Algorithms How play works is trivial: starting from the root, at node n it follows next(n) by SA if A moves otherwise next(n) For ease of understanding, we will describe the minimax by SB and finally it reaches a leaf and returns the value of it. (Shannon 1950) and the M ∗ (Carmel & Markovitch 1996) In a zero-sum game with EA = EB, it is well-known that algorithms in a bottom-up fashion, which is equivalent to if it is a perfect information game and both A and B play the top-down description and can be implemented as depth- perfectly, minimax is optimal in terms of Subgame Perfect first search. We assume that both players know the game Equilibrium (SPE) in game theory (Kuhn 1953). That means tree perfectly but have different evaluation functions for the both players will stay with minimax to get the best outcome. leaf nodes, which are the possible outcomes of the game. ∀t, EA,SA, play(t, EA,EB, minimax, minimax).a Throughout this paper we will use the following notations. ≥ play(t, EA,EB,SA, minimax).a (1) Objects: ∀t, EA,SB, play(t, EA,EB, minimax, minimax).b t: root of a finite game tree, also used to represent the tree. ≤ play(t, EA,EB, minimax, SB).b (2) n: a node of t. At a non-leaf node, we know who moves. In a nonzero-sum game with EA 6= EB, minimax is no l: a leaf node of t, as a possible outcome of the game. longer optimal, because it wrongly assumes that both play- A: max-player who maximizes the evaluation function EA. ers use the same evaluation function. Nonetheless, A’s min- B: min-player who minimizes the evaluation function EB. imax does guarantee the worst case outcome for A, because Function values (references that can be updated): it proceeds as if B would always choose the worst possi- EA, EB: real-valued evaluations for leaves. Without loss ble moves against A. Therefore, minimax is used as the of generality, we assume that there is a fixed deterministic baseline for comparisons in our examples. More generally, tie-break scheme reflected in leaf values, so we have distinct we consider imperfect information nonzero-sum games, in ′ ′ leaf values: ∀l 6= l ,Ei(l) 6= Ei(l ), i ∈ {A, B}. which players can have incomplete mutual knowledge and C(n): set of pointers to the children of n. thus SPE does not apply. A player’s best strategy often de- height(n): height of node n. height(l) = 0 for leaf l; pends on his knowledge about the opponent, so we describe ∗ otherwise, height(n) = 1 + maxc height(c), c ∈ C(n). the M algorithm for nonzero-sum games next. next(n): pointer to a child of n, or null for a leaf. The M ∗ Algorithm value(n): value of node n, which is a pair (a, b) where a Algorithm 2 is equivalent to the recursive M ∗ algorithm de- and b are the evaluations to A and B respectively. scribed in (Carmel & Markovitch 1996). We often refer

114 Algorithm 2 M ∗ ∗ root (A to move) M,E (t A,EB, caller, d) // requires d ≥ 0 if (caller = A) XOR (d is even) // base case: A’s minimax minimax(t, EA); baseA := true R S T else // base case: B’s minimax minimax(t, EB); baseA := false if d ≥ 1 for every leaf l // update leaf values for simulation U V E : 4 8 2 9 1 value(l) := (EA(l),EB(l)) A EB: 7 9 5 4 5 for i := 1 to d // run alternated simulations for A and B for h := i to height(t) W X Y Z for every node n with height h if (A moves at n) and (i is odd XOR baseA) // now at max-node in simulation for A E : 7 11 5 10 3 1 6 2 next(n) := argmaxc value(c).a, c ∈ C(n) A E : 4 5 6 3 8 9 7 2 else if (B moves at n) and (i is even XOR baseA) B // now at min-node in simulation for B next(n) := argminc value(c).b, c ∈ C(n) Figure 2: A nonzero-sum game tree. Max-player A tries to value(n) := value(next(n)) maximize EA and min-player B tries to minimize EB.

goal is to maximize EA of the outcome and B’s goal is to d 1 minimize B of it. We consider the following scenarios, in to M ∗(..., d) by M , OM search by M and minimax by E 0 which players have different imperfect mutual knowledge. M . M ∗ assumes knowledge asymmetry: A’s OM contains d d−1 (1) A does not know EB and B does not know EA. Since B’s OM, so A playing M means B playing M and 0 d d−1 has no idea of B, uses minimax ( ) against the the outcome is play(t, EA,EB,M ,M ). To compute A E A M it, M ∗ needs a series of simulations: A’s M d, B’s M d−1, worst case, assuming B plays minimax using EA too (which A’s M d−2, ..., A’s M 0 or B’s M 0 (depending on the caller is wrong). By A’s minimax from the root, the R branch XOR d’s parity). Therefore, algorithm 2 runs alternated yields EA = 4, the S branch yields EA ≤ 2, and the T bottom-up simulations from M 0 to M d−1. M ∗ tries to take branch yields EA = 1. So A moves to R. At R, B moves to advantage of the opponent’s strategy, and it dominates min- (4,7), which B prefers to (8,9). The final outcome is (4,7). (2) A knows EB and B does not know EA. Without idea imax for A under certain conditions, as stated in (Carmel & 0 1 of EA, B uses M with EB. Knowing EB, A uses M that Markovitch 1996), which can be rephrased as: 0 d d−1 takes into account ’s . At the root, can choose to: ∀t, EA,EB, d ≥ 1, play(t, EA,EB,M ,M ).a B M A d−1 ≥ play(t, EA,EB, minimax, M ).a (3) (2.1) move to R. A knows that B will choose (4,7) at R. (2.2) move to T. A knows that B will choose (9,4) at T. ∗ ∗ Limitations of M (I) In essence, M is the same as OM (2.3) move to S. To determine B’s response at S, A sim- search, because they all assume that A knows B’s strategy ulates B’s M 0: for B, the values are W–(7,4), X–(10,3), but not vice versa, no matter how complicated the strategy ∗ Y–(8,3), Z–(2,2), U–(7,4), V–(3,8) and S–(7,4). So A knows can be. Therefore in M , B’s strategy does not have to be that at S B will move to U, and at U A will move to X be- the recursive form of minimax at all. (II) The assumption ∗ cause A knows that if he moves to W then B will move to of M is too strong. Proposition 3 does not guarantee any (7,4), not preferred by A. At X, B will move to (10,3). bound for , so why would even use such a strategy? In- B B Considering all three choices, A moves from the root to stead, it is quite possible for to use an alternative strategy B S, which is the best for A. The final outcome is (10,3). that we will show later, which is guaranteed to be at least as (3) A knows EB and B knows EA. Since A’s knowledge good as minimax according to B’s knowledge. is the same as in (2), A does the same reasoning and still 1 0 moves to S. Knowing EA, B now uses M (instead of M An Illustrative Example of Nonzero-sum Games as in (2)) that takes into account A’s M 0. From B’s point of The following example is used to illustrate how minimax view, according to A’s M 0, some node values are Y–(1,9), (M 0), OM search (M 1) and M ∗ search work. Both players Z–(2,2) and V–(2,2). So B moves from S to V (out of expec- know the game tree and their own evaluation function EA tation by A in (2)), expecting to reach Z and then (2,2) that or EB. We assume that EA and EB are atomic: either fully has the best EB value. At V, however, A moves to Y instead known or unknown, and players use M d, d ≥ 0. of Z (out of expectation by B), expecting to reach (3,8), be- In Figure 2, triangles represent max-nodes where max- cause A knows if he moves to Z then B will move to (2,2), player A moves, inverted triangles represent min-nodes bad for A. At Y, B moves to (3,8), and the final outcome is where min-player B moves, and squares represent leaves. (3,8), which is neither A’s nor B’s initial expectation. A and B have different evaluation functions EA and EB for (4) A knows EB, B knows EA and A knows that B knows leaves. A leaf is referenced by a pair of its EA and EB val- EA. Again A’s knowledge subsumes B’s knowledge, so as ues, while an internal node is referenced by its label. A’s in (2), A can correctly simulate and predict B’s moves. Us-

115 ing M 2 that takes into account B’s M 1, A knows that if knows that everybody knows ϕ, and so on (van Benthem, he moves to S, he will finally reach (3,8) as in (3), instead van Eijck, & Kooi 2005). Any common knowledge ϕ is just of (10,3) in (2); so A moves to T for (9,4), which is the viewed as part of tautologies included in axiom A1. best achievable outcome given B’s knowledge. Since at T Definition 4 A formula set F is a core of K(P ) iff F can B prefers (9,4) to (1,5), the final outcome is (9,4). be extended to the maximal consistent set K(P ) by using the Summary of Results and Motivation We use a shorthand axioms of S5n; cores-K(P ) is the set of all cores of K(P ). of play (See Definition 2) with only strategy parameters. The outcomes of the scenarios are: (1) play(M 0,M 0) = (4, 7); By definition 4, a potentially infinite K(P ) can be repre- (2) play(M 1,M 0) = (10, 3); (3) play(M 1,M 1) = (3, 8); sented by its core F , which can be small but not necessarily (4) play(M 0,M 0) = (9, 4). The outcome value of (1) by minimal, though K(P ) may have many minimal cores. minimax is the baseline. Scenario (2) and (4) are better than Definition 5 A strategy S is t-selfish for player P on game (1) for A—consistent with proposition 3. tree t, iff ∀n ∈ t where P moves at n, if P is a max-player However, scenario (3) is worse than (1) for both A and B, value(n).a = maxcvalue(c).a; otherwise value(n).b = even though they have more knowledge in (3) than in (1). mincvalue(c).b, c ∈ C(n). S is selfish for P iff ∀t, S is t- That violates our intuition ”the more knowledge the better selfish for P ; P is selfish iff P always uses selfish strategies. outcome”, because M ∗ does not guarantee any bound of the outcomes when there is no strict knowledge asymmetry; yet Naturally, players are always assumed to be selfish and we must have working strategies for all the cases and we they have common knowledge that everybody is selfish. Al- must model the complicated knowledge scenarios (including though we only discuss two-player games in this paper, knowledge about evaluation functions, strategies and knowl- KOP works for multi-player games and most of our results edge about knowledge, etc.) in a rigorous way that can be can be extended to multi-player games . reasoned about. This is our motivation for a new general Definition 6 Given selfish players A, B, a game tree t and framework that models players with knowledge. n ∈ t, (A, B, t) K-decides n, iff n is a leaf, or A and B agree on next(n) and value(n), by K(A) and K(B). The Framework of Knowledge Oriented Players (KOP ) We use “n is decided” as a shorthand for (A, B, t) K- decides n. Intuitively, there can be some decided nodes (i.e. We propose a new framework of Knowledge Oriented Play- game states) in the game tree such that both players agree ers (KOP ), which includes all the elements that determine on what will happen afterwards once the game is in one of the outcomes of finite-horizon nonzero-sum game trees and these states . Players can reason about decided nodes by allows us to reason about knowledge and strategies. We their knowledge. Suppose A moves at n, with leaves l1 should emphasize that since the players’ mutual knowl- and l2 as n’s children. Since A is selfish, A chooses l1 if edge can be imperfect in general, the notion of SPE for EA(l1) ≥ EA(l2); otherwise l2. So given t and EA, n is de- perfect information games in game theory (Kuhn 1953; cided for A, but whether n is decided for B depends on B’s Fudenberg & Tirole 1991) does not apply here in general. knowledge: B will be certain of A’s choice if B knows EA, Definition 3 A Knowledge Oriented Player (KOP ) P is a i.e., KBEA. Similarly, consider n’s parent p where B moves ′ 3-tuple (EP ,SP , K(P )), where EP is the evaluation func- and has another choice n with the same height as n: p is de- ′ tion, SP is the strategy and K(P ) is the knowledge of P cided if n and n are decided plus (1) B knows where to represented by a set of formulas, that is, K(P ) = {ϕ|KP ϕ} move at p (trivial for B because B just chooses the one with (KP ϕ is read as “P knows ϕ”). K(P ) is required to be a smaller EB value) and (2) A knows where B moves at p, so maximal consistent set in the S5n axiom system. the necessary knowledge for A is that A knows that B knows ′ that n and n are decided, that is equivalent to KAKBEA. The S5n axiom system is sound and complete, whose The same reasoning applies to the nodes all the way up main axioms are listed below (with n = 2 and P = A, B). the tree. Therefore, we can derive the following algorithm. For more details, please refer to (Halpern & Moses 1992). A1. All tautologies of the propositional calculus. The Decide Algorithm A2. (KP ϕ ∧ KP (ϕ ⇒ ψ)) ⇒ KP ψ (deduction). Algorithm 3 computes whether n is decided, and next(n) A3. KP ϕ ⇒ ϕ (the knowledge axiom). and value(n) for decided n, by checking a class of knowl- A4. KP ϕ ⇒ KP KP ϕ (positive introspection). edge: if KAEB ∧ KBEA then all the nodes with height one A5. ¬KP ϕ ⇒ KP ¬KP ϕ (negative introspection). are decided; if KAEB ∧ KBEA ∧ KBKAEB ∧ KAKBEA We use KP ϕ as a proposition for “ϕ ∈ K(P )”. We then all the nodes with height two are decided, and so on. have KP true ∧ ¬KP false by axioms. We assume (a) each player P is a KOP ; (b) EP and SP are atoms that can Levels of Knowledge–f(A, d) Next, we directly construct be fully known or unknown: KP EP ↔ ∀t∀l ∈ t∀r ∈ such a set f that includes the desired class of formulas in Real, (KP (EP (l) = r) ∨ KP (EP (l) 6= r))) and KP SP ↔ regular expressions, given a level parameter d ≥ 0: ∀t∀n ∈ t, (C(n) = ∅) ∨ (∀c ∈ C(n),KP (next(n)SP = f(A, 0) = {KAEA,KASA}; d′− c) ∨ KP (next(n)SP 6= c))); (c) KP EP ∧ KP SP . ⌊ ( 1) ⌋ ′ f(A, d) = {KA(KBKA) 2 EB| 1 ≤ d ≤ d} Common knowledge is incorporated in KOP . A formula ′ ⌊ d ⌋ ′ ϕ is common knowledge iff everybody knows ϕ, everybody ∪ {(KAKB) 2 EA| 2 ≤ d ≤ d} ∪ f(A, 0) (4)

116 Algorithm 3 Decide as described before. 1.2 can be proved by mathematical in- decide (A, B, t, n) duction: n = 1 is trivial; assume it holds for n = m, for // returns true or false, and computes next(n) and value(n) n = m + 1, f(P, m + 1) has one formula with m + 1 alter- if height(n) = 0 return true // leaves are don’t cares nations of K, which makes the knowledge check in decide if height(n) = 1 true, so it returns true. By 1.3, decide is sound and complete if A moves at n ∧ KBEA if f(P, d) is a core of K(A); this holds because when A has next(n) := argmaxc value(c).a, c ∈ C(n) no extra knowledge for reasoning, the only nodes that A and else if B moves at n ∧ KAEB B can agree on are those decided by the alternations of K next(n) := argminc value(c).b, c ∈ C(n) and those with single decided child, as done in decide. else return false // negative base case Definition 7 A strategy S is t-informed for player P on value(n) := value(next(n)) game tree t, iff ∀n ∈ t, if (A, B, t) K-decides n then S return true // positive base case computes the next(n) and value(n) that A and B agree if ∃c ∈ C(n), ¬decide(A, B, t, c) on. S is informed for P iff ∀t, S is t-informed for P ; P is return false // not all children decided informed iff P always uses informed strategies. if n has only one child c next(n) := c; value(n) := value(c) Players are always assumed to be informed and have com- return true // the only child is decided mon knowledge about that fact. Therefore a player’s strategy if ∀c ∈ C(n), A moves at n ∧ KB(decide(A, B, t, c)) is always constrained by his knowledge, that is, his strategy next(n) := argmaxc value(c).a, c ∈ C(n) has to be selfish and informed for him if he is rational. else if ∀c ∈ C(n), B moves at n ∧ KA(decide(A, B, t, c)) Definition 8 Given selfish and informed players A and B, next(n) := arginc value(c).b, c ∈ C(n) a game tree t where A moves first, and strategies S and S′, else return false // negative recursive case S K-dominates S′ on t for A, iff by K(A) A knows that ′ value(n) := value(next(n)) play(t, EA,EB, S, SB).a ≥ play(t, EA,EB,S ,SB).a. return true // positive recursive case S K-dominates S′ for A, iff by K(A) A knows that S K- dominates S′ on t for all t where A moves first. In definition 8, according to K(A) A knows that SB is constrained by K(B) in a way that using S is better than S′ Similarly f(B, d) can be defined by switching A and B. for A, in contrast to unconstrained SB. For the different scenarios of the example in Figure 2, the cores of knowledge can be expressed as follows: The M h and Hybrid-minimax-M h Algorithms (1) A: f(A, 0) = {KAEA,KASA} B: f(B, 0) = {KBEB,KBSB} In this section we will present two algorithms that K- (2) A: f(A, 1) = f(A, 0) ∪ {KAEB} , B: f(B, 0) dominate minimax, given players’ complete or incomplete (3) A: f(A, 1), B: f(B, 1) = f(B, 0) ∪ {KBEA} knowledge. Furthermore, we show how communication can (4) A: f(A, 2) = f(A, 1) ∪ {KAKBEA}, B: f(B, 1) help update knowledge from incomplete to complete for better outcomes. The new algorithms work well for the general By our construction, f(A, d) has some nice properties, for ∗ examples, the following properties can be verified: ∀d ≥ 1, knowledge scenarios (e.g., case (3) in Figure 1) where M f(A, d − 1) ⊆ f(A, d) ∧ f(B, d − 1) ⊆ f(B, d) (5) does not apply. First we study the case where every node is decided and thus the decide algorithm can be simplified to f(A, d) ⊆ K(A) → f(B, d − 1) ⊆ K(B) (6) h f(A, d) ∈ cores-K(A) → f(A, d − 1) ∈/ cores-K(A) (7) the M algorithm. Property 5 is trivial. Property 6 holds because the first The M h Algorithm KA of the formulas in f(A, d) can be removed by axiom A3, then we get formulas in f(B, d − 1). Axiom A1-A5 h cannot increase alternations of K and the largest number al- Algorithm 4 M h ternations of K in f(A, d) is d: the corresponding formula M,E (t A,EB) // computes all next(n) and value(n) cannot be extended from f(A, d − 1), so property 7 holds. for every leaf l Using “levels of knowledge”, we are able to give some next(l) := null; value(l) := (EA(l),EB(l)) formal results of algorithm 3, stated as theorem 1. for h = 1 to height(t) for every node n with height h Theorem 1 For algorithm 3, ∀A, B, t, n ∈ t, d ≥ 1, if A moves at n // to maximize 1.1 If decide(A, B, t, n) returns true, then (A, B, t) next(n) := argmaxc value(c).a, c ∈ C(n) K-decides n and decide(A, B, t, n) computes the else // B moves, to minimize next(n) and value(n) that A and B agree on. next(n) := argminc value(c).b, c ∈ C(n) 1.2 If f(A, d) ⊆ K(A)∧f(B, d) ⊆ K(B)∧height(n) ≤ d, value(n) := value(next(n)) then decide(A, B, t, n) returns true. 1.3 If (∃d ≥ 1, f(A, d) ∈ cores-K(A)), then decide(A, B, t, n) returns true iff (A, B, t) K-decides n. h Compared to minimax, M has another parameter EB, Proof sketch: By 1.1, decide is sound; this holds because which enables it to determine the true leaf values for B, h it exactly follows the reasoning process for decided nodes, rather than assuming EA = EB. M correctly determines

117 both player’s optimal moves if they have enough knowledge Therefore, value(l) is passed all the way up to the root, that about their evaluation functions. We can view M h as the is, value(t) = value(l). In minimax, one player is not even nonzero-sum version of minimax or K-minimax: M h deter- aware of the other’s evaluation function, so there is no way mines the best outcome for selfish and informed players with for him to recognize the global optimum, let alone to find respect to their knowledge. it. Similarly, since M ∗ is a recursive simulation process of Theorem 2 Given selfish and informed players A, B, and minimax, it is also possible to miss the global optimum. a game tree t, ∀n ∈ t, if decide(A, B, t, n) returns true, Corollary 1 If t has a global optimum leaf l and both play- h h h h then M computes the same next(n) and value(n) as ers use M , play(t, EA,EB,M ,M ) = (EA(l),EB(l)). decide(A, B, t, n). Corollary 1 holds because when both players use M h, Proof: The M h algorithm simplifies the decide algorithm they agree on the final outcome value(l), which is the same h in the way that all the conditions involving knowledge check global optimum computed by M . are set to true, so they compute the same results. h Corollary 2 If EA and EB are uncorrelated and un- By theorem 2, the M algorithm can be viewed as a spe- bounded, in general there does not exist a sound and com- cial case of algorithm 3. However, we must emphasize that plete pruning algorithm that does not check every leaf. algorithm 3 itself is not a strategy (see definition 1), because it may not compute next(n) and value(n) for every node Corollary 2 holds because any leaf not checked by a prun- n; while M h is a strategy because it computes next(n) and ing algorithm could be a global optimum, given uncon- value(n) for every n. strained EA and EB. This is a bad news for any general pruning attempt—to be able to do pruning, one has to as- h ∗ Theorem 3 M and M computes the same results for all sume correlations between EA and EB, as done in (Carmel ∗ game trees iff M has the parameter d = ∞. & Markovitch 1996). h Proof: M ∗ with parameter d = ∞ corresponds to the case The following theorem reveals the key property of M . that both players simulate each other for an infinite number Theorem 5 M h dominates any strategy if the opponent h h h of levels, therefore, for any given depth tree t, the simulation uses M : ∀t, EA,EB,SA, play(t, EA,EB,M ,M ).a h h process converges to M , which can be verified by compar- ≥ play(t, EA,EB,SA,M ).a (8) h h ing algorithm 2 and algorithm 4. ∀t, EA,EB,SB, play(t, EA,EB,M ,M ).b ∗ h h Even though M can converge to M under certain con- ≤ play(t, EA,EB,M ,SB).b (9) ditions, they are still quite different algorithms in general. h h Proof: For (8), let SA 6= M , play(t, EA,EB,SA,M ) Differences of M ∗ and M h (I) M ∗ requires expensive follows a path p of next pointers from the root to a leaf, h h h recursive simulations while M does not; consequently, in and play(t, EA,EB,M ,M ) follows another path q 6= p. (Carmel & Markovitch 1996) the multi-pass M ∗ algorithm Consider the node n with smallest height where p deviates needs to evaluate a node multiple times (compared to one from q, B does not move at n because B uses the same evaluation per node in M h), and the one-pass M ∗ algorithm next(n) by M h for two paths, so A moves at n. Since needs to compute vector values (compared to single value next(n) of M h points to n’s child with best value to A, h ∗ computation in M ). (II) M represents a family of algo- it must be that SA’s next(n) points to a child with worse rithms by parameter d, which relies on special assumptions value; thus SA can be improved by changing its next(n) to about the opponent’s strategies; while M h does not need that of M h for a better outcome. This improving process such assumptions. The conditions under which M h works can be repeated until p matches q, which means q leads to a will be elaborated in the following theorems. better outcome than p. Because of symmetry, (9) holds. h Definition 9 A leaf l of t is the global optimum of t, iff Since M can be also viewed as backward induction ′ ′ ′ ′ (Kuhn 1953), it is not surprising that propositions (8) and ∀l ∈ t, (l 6= l) → (EA(l) > EA(l ) ∧ EB(l) < EB(l )) (9) about M h are identical to propositions (1) and (2) about Any game tree has at most one global optimum because minimax. By theorem 5, using M h is mutually-enforced by of our assumption of distinct leaf values that reflect a fixed both players in nonzero-sum games, just like what happens deterministic tie-break scheme, while some game trees may to using minimax in zero-sum games. not have any global optimum at all. It is worth knowing that Then it is important to show under what knowledge sce- when there exists a global optimum, whether a strategy al- narios M h is the optimal strategy. ways finds it. Surprisingly, although M h can find the global Theorem 6 For any game tree t where A moves first, if optimum, if there is any, minimax and M ∗ are not guaran- h height(t) ≤ 1 ∨ f(A, height(t) − 1) ⊆ K(A), then M K- teed to do so, as stated in theorem 4. dominates any other strategy on t for A. Theorem 4 If l ∈ t is the global optimum of t, M h finds h Proof: The base case of height(t) ≤ 1 is trivial, because it: M (t, EA,EB) yields value(t) = (EA(l),EB(l)); h with M A chooses the best move to a leaf according to however, minimax and M ∗ do not guarantee value(t) = EA. Given f(A, height(t) − 1) ⊆ K(A), by property 6 (EA(l),EB(l)). we have f(B, height(t) − 2) ⊆ K(B), and then all n with Proof: M h always has next(p) = l for l’s parent p no height(n) ≤ height(t) − 2 are decided by theorem 1.2. matter who moves at p because l is the best choice for both. Given B is selfish and informed, B’s strategy computes the Similarly p’s parent has its next pointer to p, and so on. same next(n) and value(n) as M h does for every n with

118 h height(n) ≤ height(t) − 2; so by theorem 5, for these n, Next we abbreviate Hybrid-minimax-M (t,EA,EB, d) A’s strategy has to match M h or it can be improved. Be- in algorithm 5 as HM 0M d. It can be viewed as a combina- cause B moves at every n with height(n) = height(t) − 1, tion of minimax (M 0) and M h with parameter d: it reduces 0 B’s move is decided for B and A knows that exactly by to A’s M (ignoring unused EB) if d = 0; for a given t, K(A); therefore A knows that A’s first move at the root has it reduces to M h if d = height(t). In other words, M h is to match that of M h too, or it can be improved. Overall A’s applied to the nodes with height at most d, while M 0 is ap- strategy has to match M h, or it can be improved. plied to the other nodes above. The reason is that, the lower By theorem 6, if a player has “enough” knowledge then he level nodes are decided by mutual knowledge so M h yields knows that M h is the optimal strategy. Common knowledge the best result there; while the higher level nodes are not is more than enough, so the following corollary holds. decided, so conservative minimax is used to guarantee the worst case outcome. The following theorem guarantees that Corollary 3 If both A and B have common knowledge the hybrid strategy is at least as good as minimax, given the about EA and EB, that is ∀d ≥ 1, f(A, d) ⊆ K(A), then h player’s knowledge. M K-dominates any strategy for both A and B. Theorem 8 For any game tree t where A moves first, ∀d ≥ Intuitively the more knowledge a player has the better and 1, if f(A, d) ⊆ K(A) then HM 0M d K-dominates minimax more accurate outcome he can estimate. When the condi- on t for A. tions in theorem 6 or in corollary 3 are met, certainly A will estimate the outcome by M h. We claim that this estimated Proof: For any n with height d, A moves at n or h outcome is at least as good as the one estimated by minimax parent(n), so by theorem 5, M K-dominates any strategy 0 d using only EA when A knows nothing about EB. for A in the subtrees rooted by all such n. Since HM M is exactly the same as minimax for the nodes with height more Theorem 7 For any game tree t where A moves first, let 0 d h than d, HM M is at least as good as minimax overall. v1 = value(t) computed by M (t, EA,EB) and v2 = 0 d Extending theorem 7 to HM M , we have corollary 4, value(t) computed by minimax(t, EA), then v1.a ≥ v2.a. which can be proved similarly. Proof: At each node where B moves, minimax (as in al- Corollary 4 For any game tree t where A moves first, let gorithm 1) always let B choose the child with smallest EA h h v1 = value(t) by M (t, EA,EB), v2 = value(t) by value; whereas M let B choose the child with smallest EB 0 d 0 d′ h HM M , v3 = value(t) by HM M (height(t) ≥ d ≥ value. Unless the same child is chosen, M always chooses ′ ), and 4 by A , then for a child better for A, compared to what minimax chooses, so d ≥ 0 v = value(t) minimax(t, E ) h : 1 2 3 4 . M yields a better estimated final outcome for A. A v .a ≥ v .a ≥ v .a ≥ v .a When player A does not have enough knowledge about B, Once again, our intuition is confirmed that the more the outcome estimated by M h can be inaccurate, because knowledge a player has the better outcome he can expect. B may not be aware that B should play M h according to Complexity Given , h and 0 d can be done by B’s knowledge. To solve this problem, we introduce the t M HM M h depth-first search (DFS) in O(height(t)) space and O(size(t)) hybrid-minimax-M algorithm (algorithm 5), which works well with incomplete knowledge. time. These bounds are already optimal for sound and complete strategies that are guaranteed to find the global opti- The Hybrid-minimax- h Algorithm mum if there is one, because first, DFS requires O(height(t)) M space for backtracking; second, any sound and complete strategy needs O(size(t)) time, due to the lack of general Algorithm 5 Hybrid-minimax-M h pruning algorithms by corollary 2. h Hybrid-minimax-M (t,EA,EB, d) // requires d ≥ 0 for every leaf l Knowledge Update via Communication next(l) := null; value(l) := (EA(l),EB(l)) By theorem 8 and corollary 4, given f(A, d) ⊆ K(A), for h = 1 to d a strategy dominating minimax for A is HM 0M d, which for every node n with height h could be still worse than M h though; but using M h requires if A moves at n // to maximize enough knowledge that the players may not have. So knowl- next(n) := argmaxc value(c).a, c ∈ C(n) edge update should be used to increase players’ knowledge else // B moves, to minimize to a level that allows them to use M h. We suggest two com- next(n) := argminc value(c).b, c ∈ C(n) munication schemes for desired knowledge update, assum- value(n) := value(next(n)) ing the communication channel is perfect and has no delay. for h = d + 1 to height(t) (I) If d ≥ 1 ∧ f(A, d) ⊆ K(A), A can announce both EA for every node n with height h and EB to B. if max-player moves at // to maximize n With (I), after A’s announcement, EA and EB become next(n) := argmaxc value(c).a, c ∈ C(n) common knowledge for A and B, so M h becomes the dom- else // min-player moves, to minimize inant strategy for both A and B by corollary 3. This would next(n) := argminc value(c).a, c ∈ C(n) require the logic with both relativized common knowledge value(n) := value(next(n)) and public announcements (PAL-RC), and more details can be found in (van Benthem, van Eijck, & Kooi 2005).

119 (II) If d ≥ 2 ∧ f(A, d) ⊆ K(A), A can announce to B References h that A is using M . Carmel, D., and Markovitch, S. 1993. Learning models of With (II), given d − 1 ≥ 1 and f(B, d − 1) ⊆ K(B), by opponents strategies in game playing. In AAAI Fall Sym- property 6 A knows that KBEA ∈ K(B), and thus A knows posion on Games: Planning and Learning, 140–147. h that B can verify whether A really uses M during the play. Carmel, D., and Markovitch, S. 1996. Incorporating oppo- h By theorem 5 M is mutually enforced, in that, A cannot nent models into adversary search. In Proceedings of 13th h benefit from lying to B that he uses M , because once B is National Conference on Artificial Intelligence (AAAI). h h forced to use M A has to use M as well for his own best. Cox, M. T., and Raja, A. 2007. Metareasoning: A man- Therefore, neither A nor B has the incentive to deviate from h ifesto, technical report, bbn tm-2028, bbn technologies. M if the opponent appears to be using it. www.mcox.org/metareasoning/manifesto. The Example Revisited with h and 0 d For the M HM M Donkers, H. 2004. Opponent models and knowledge sym- example in Figure 2, the baseline value is (4,7), by minimax metry in game tree search. In Proceedings of the First in scenarios (1). If communication is available, in scenar- Knowledge and Games Workshop, 82–88. ios (2)-(4), A will first use the above schemes for desired knowledge update. Then A will use M h to force B to use Fudenberg, D., and Tirole, J. 1991. Game Theory. MIT M h too, as explained before, and finally they will reach the Press. outcome (10,3), which is better than M ∗ in (3) and (4) and Halpern, J. Y., and Moses, Y. 1992. A guide to complete- the same in (2). ness and complexity for modal logics of knowledge and If communication is not available, A will use HM 0M d belief. Artificial Intelligence 54(3):319–379. (with d = 1 for (2) and (3) and d = 2 for (4)) based on A’s Iida, H.; Uiterwijk, J.; Herik, H.; and Herschberg, I. 1993. knowledge in each scenario, which is described right before Potential applications of opponent-model search. part 1: theorem 1. Finally, players will reach the outcome (9,4). the domain of applicability. ICCA 4(16):201–208. With communication or not, both our strategies dominate Kuhn, H. 1953. Extensive games and the problem of infor- minimax in all the scenarios and outperform M ∗ in scenario mation. Annals of Mathematics Studies Volume II, Contri- (3) where players’ knowledge does not subsume each other. butions to Game Theory:193–216. Shannon, C. 1950. Programming a computer for playing chess. Philosophical Magzine 256–275. Conclusions and Future Work van Benthem, J.; van Eijck, J.; and Kooi, B. 2005. Com- mon knowledge in update logics. In Proceedings of 10th This paper is focused on finite-horizon nonzero-sum game Theoretical Aspects of Rationality and Knowledge (TARK). tree search with potentially imperfect mutual knowledge between the players, which is a topic addressed little by the traditional game theory. We have proposed a new theoretical framework of Knowledge Oriented Players, which uses the S5n axiom system to model and reason about players’ knowledge. The good expressiveness of our framework has allowed us, for the first time, to analyze the general relationship between knowledge and strategies in such games, in particular, how strategies are constrained by and can benefit from knowledge. Leveraging this analysis, we have developed two new algorithms–M h and HM 0M d, which provably dominate minimax under a class of knowledge scenarios that is significantly more general than previous algorithms for such problems. Furthermore, our communication schemes allow players to achieve desired knowledge update, so that they can use the mutually-enforced M h algorithm to reach the best possible outcome, given their knowledge. There is plenty room for future work. Players may have partial knowledge of the opponents’ evaluation functions; so far we only consider atomic knowledge of evaluation functions, either fully known or unknown. Another direction is to study other classes of knowledge that can impose constraints on strategies. Furthermore, knowledge may be not abso- lutely true but with some probability to hold and how can uncertainty be incorporated? Finally, can interesting topics in game theory, such as mixed strategies , be relevant to our discussions of knowledge under certain circumstances?

120