CSC384: Intro to Generalizing Search Problems

● So far: our search problems have assumed agent has Game Tree Search complete control of environment ■ state does not change unless the agent (robot) changes it. All we need to compute is a single path to a goal state. ■ Chapter 6.1, 6.2, 6.3, 6.6 cover some of the material we cover here. Section 6.6 has an interesting ● Assumption not always reasonable overview of State-of-the-Art game playing programs. ■ stochastic environment (e.g., the weather, traffic accidents). ■ other agents whose interests conflict with yours ■ Problem: you might not traverse the path you are expecting. ■ Section 6.5 extends the ideas to games with uncertainty (We won’t cover that material but it makes for interesting reading).

Hojjat Ghaderi, University of Toronto 1 Hojjat Ghaderi, University of Toronto 2

Generalizing Search Problems Two-person Zero-Sum Games

● In these cases, we need to generalize our ●Two-person, zero-sum games view of search to handle state changes ■ , checkers, tic-tac-toe, backgammon, go, Doom, “find the last parking space” that are not in the control of the agent. ■ Your winning means that your opponent looses, and ● One generalization yields game tree search vice-versa. ■ agent and some other agents. ■ Zero-sum means the sum of your and your opponent’s ■ The other agents are acting to maximize their payoff is zero---any thing you gain come at your profits opponent’s cost (and vice-versa). Key insight: this might not have a positive effect on your how you act depends on how the other agent acts profits. (or how you think they will act) and vice versa (if your opponent is a rational player)

Hojjat Ghaderi, University of Toronto 3 Hojjat Ghaderi, University of Toronto 4 More General Games More General Games

●What makes something a game? ● What makes games hard? ■ there are two (or more) agents influencing state ■ how you should play depends on how you think change the other person will play; but how they play ■ each agent has their own interests depends on how they think you will play; so how e.g., goal states are different; or we assign you should play depends on how you think they different values to different paths/states think you will play; but how they play should depend on how they think you think they think you ■ Each agent tries to alter the state so as to best will play; … benefit itself.

Hojjat Ghaderi, University of Toronto 5 Hojjat Ghaderi, University of Toronto 6

More General Games Game 1: Rock, Paper Scissors

●Zero-sum games are “fully competitive” ●Scissors cut paper, ■ if one player wins, the other player loses paper covers rock, rock Player II ■ e.g., the amount of money I win (lose) at poker is the smashes scissors RPS amount of money you lose (win) ●Represented as a R 0/0 -1/1 1/-1 ●More general games can be “cooperative” matrix: Player I chooses ■ some outcomes are preferred by both of us, or at a row, Player II chooses least our values aren’t diametrically opposed P 1/-1 0/0 -1/1

a column I Player ● We’ll look in detail at zero-sum games ● Payoff to each player S -1/1 1/-1 0/0 ■ but first, some examples of simple zero-sum and in each cell ( P.I / P.II ) cooperative games ●1: win, 0: tie, -1: loss ■ so it’s zero-sum

Hojjat Ghaderi, University of Toronto 7 Hojjat Ghaderi, University of Toronto 8 Game 2: Prisoner’s Dilemma Game 3: Battlebots

●Two prisoner’s in separate cells, DA doesn’t ●Two robots: Blue (Craig’s), Red (Fahiem’s) have enough evidence to convict them ■ one cup of coffee, one tea left ●If one confesses, other doesn’t: Coop Def ■ both C & F prefer coffee (value 10) CT ■ confessor goes free ■ tea acceptable (value 8) C 0/0 10/8 ■ other sentenced to 4 years Coop 3/3 0/4 ●Both robots go for Coffee ●If both confess (both defect) ■ collide and get no payoff Def 4/0 1/1 T 8/10 0/0 ■ both sentenced to 3 years ●Both go for tea: same ●Neither confess (both cooperate) ●One goes for coffee, other for tea: ■ sentenced to 1 year on minor charge ■ coffee robot gets 10 ●Payoff: 4 minus sentence ■ tea robot gets 8

Hojjat Ghaderi, University of Toronto 9 Hojjat Ghaderi, University of Toronto 10

Two Player Zero Sum Games Two-Player, Zero-Sum Game: Definition

●Key point of previous games: what you should ●Two players A (Max) and B (Min) do depends on what other guy does ●set of positions P (states of the game) ●Previous games are simple “one shot” games ●a starting position s P (where game begins) ■ single move each ●terminal positions T ⊆⊆⊆ P (where game can end) ■ in game theory: strategic or normal form games ●set of directed edges E A between states (A’s ●Many games extend over multiple moves moves ) ■ e.g., chess, checkers, etc. ●set of directed edges E B between states (B’s ■ in game theory: extensive form games moves ) ●We’ll focus on the extensive form ●utility or payoff function U : T →  (how good is ■ that’s where the computational questions emerge each terminal state for player A) ■ why don’t we need a utility function for B?

Hojjat Ghaderi, University of Toronto 11 Hojjat Ghaderi, University of Toronto 12 Intuitions Tic-tac-toe: States

●Players alternate moves (starting with Max) Turn=Max(X) Turn=Min(O) Turn=Max(X) ■ Game ends when some terminal p T is reached XX X  ●A game state : a position-player pair t    O ar ■ tells us what position we’re in, whose move it is t s O ●Utility function and terminals replace goals ■ Max wants to maximize the terminal payoff Min(O) Max(X) ■ Min wants to minimize the terminal payoff X X X O X X ●Think of it as: r l O he O X na ot al ■ Max gets U(t), Min gets –U(t) for terminal node t mi an in er rm ■ This is why it’s called zero (or constant) sum t O te O U = +1 U = -1 Hojjat Ghaderi, University of Toronto 13 Hojjat Ghaderi, University of Toronto 14

Tic-tac-toe: Game Tree Game Tree

Max ●Game tree looks like a search tree ■ Layers reflect the alternating moves

X X X ●But Max doesn’t decide where to go alone Min a ■ after Max moves to state a, Mins decides whether X to move to state b, c, or d ●Thus Max must have a X O X O X ■ must know what to do next no matter what move Max O Min makes ( b, c, or d) Min b c d ■ a sequence of moves will not suffice: Max may want to do something different in response to b, c, or d X O O X O ●What is a reasonable strategy? X X U = +1 Hojjat Ghaderi, University of Toronto 15 Hojjat Ghaderi, University of Toronto 16 Strategy: Intuitions Minimax Strategy: Intuitions

s0 s0 max node max node

min node min node s1 s2 s3 terminal s1 s2 s3 terminal

t1 t2 t3 t4 t5 t6 t7 t1 t2 t3 t4 t5 t6 t7 7 -6 4 3 9 -10 2 7 -6 4 3 9 -10 2 If Max goes to s1, Min goes to t2 So Max goes to s2: so * U(s1) = min{U(t1), U(t2), U(t3)} = -6 U(s0) Theterminalnodeshaveutilities. If Max goes to s2, Min goes to t4 = max{U(s1), U(s2), U(s3)} Butwecancomputea“utility”forthenonterminal * U(s2) = min{U(t4), U(t5)} = 3 = 3 states,byassumingbothplayersalwaysplaytheir If Max goes to s3, Min goes to t6 bestmove. * U(s3) = min{U(t6), U(t7)} = -10

Hojjat Ghaderi, University of Toronto 17 Hojjat Ghaderi, University of Toronto 18

Minimax Strategy Minimax Strategy

●Build full game tree (all leaves are terminals) ● The values labeling each state are the values ■ root is start state, edges are possible moves, etc. that Max will achieve in that state if both he ■ label terminal nodes with utilities and Min play their best moves. ●Back values up the tree ■ Max plays a move to change the state to the highest valued min child. ■ U(t) is defined for all terminals (part of input) ■ Min plays a move to change the state to the lowest ■ U(n) = min { U(c) : c a child of n} if n is a min node valued max child. ■ U(n) = max { U(c) : c a child of n} if n is a max node ● If Min plays poorly, Max could do better, but never worse. ■ If Max, however know that Min will play poorly, there might be a better strategy of play for Max than minimax!

Hojjat Ghaderi, University of Toronto 19 Hojjat Ghaderi, University of Toronto 20 Depth-first Implementation of MinMax Depth-first Implementation of MinMax

utility(N,U) :- terminal(N), utility(N,U). utilityList([],[]). utility(N,U) :- maxMove(N), children(N,CList), utilityList([N|R],[U|UList]) utilityList(CList,UList), :- utility(N,U), utilityList(R,UList). max(UList,U). utility(N,U) :- minMove(N), children(N,CList), utilityList(CList,UList), ■ utilityList simply computes a list of utilities, one for each min(UList,U). node on the list. ●Depth-first evaluation of game tree ■ The way prolog executes implies that this will ■ terminal(N) holds if the state (node) is a terminal compute utilities using a depth-first post-order node. Similarly for maxMove(N) (Max player’s move) traversal of the game tree. and minMove(N) (Min player’s move). post-order (visit children before visiting parents). ■ utility of terminals is specified as part of the input

Hojjat Ghaderi, University of Toronto 21 Hojjat Ghaderi, University of Toronto 22

Depth-first Implementation of MinMax Visualization of DF-MinMax Once s17 evaluated, no need to ● s0 store tree: s16 only needs its value. Notice that the game tree has to have finite Once s24 value computed, we can depth for this to work evaluate s16 ●Advantage of DF implementation: space efficient s1 s13 s16

s2 s6 t14 t15 s17 s24

s18 s21 t3 t4 t5 s7 s10 t25 t26

t8 t9 t11 t12 t19 t20 t22 t23

Hojjat Ghaderi, University of Toronto 23 Hojjat Ghaderi, University of Toronto 24 Pruning Cutting Max Nodes (Alpha Cuts)

●It is not necessary to examine entire tree to ● At a Max node n: make correct minimax decision ■ Let β be the lowest value of n’s siblings examined so far (siblings to the left of n that have already been searched) ●Assume depth-first generation of tree ■ Let α be the highest value of n’s children examined so far ■ After generating value for only some of n’s children (changes as children examined) n we can prove that we’ll never reach in a MinMax s0 strategy. ■ So we needn’t generate or evaluate any further max node s1 s13 s16 children of n ! min node terminal ●Two types of pruning (cuts): s2 s6 β =5 only one sibling value known ■ pruning of max nodes ( α-cuts) 5 ■ pruning of min nodes ( β-cuts ) α =8 α=10α=10 T3 T4 T5 sequence of values for α as s6’s 8 10 5 children are explored Hojjat Ghaderi, University of Toronto 25 Hojjat Ghaderi, University of Toronto 26

Cutting Max Nodes (Alpha Cuts) Cutting Min Nodes (Beta Cuts) ● α ≥ β If becomes we can stop expanding ● At a Min node n: the children of n ■ Let β be the lowest value of n’s children examined ■ Min will never choose to move from n’s parent to so far (changes as children examined) n since it would choose one of n’s lower valued ■ Let α be the highest value of n’s sibling’s siblings first. min node examined so far (fixed when evaluating n) s0 P β = 8 max node s13 s16 s1 min node 14 12 8 n α = 2 4 9 terminal α =10 s2 s6 β =5 β =3 s1 s2 s3 2 4 9

Hojjat Ghaderi, University of Toronto 27 Hojjat Ghaderi, University of Toronto 28 Cutting Min Nodes (Beta Cuts) Alpha-Beta Algorithm ●If β becomes ≤ α we can stop expanding the children of n. MaxEval(node, alpha, beta): Pseudo-code that associates If terminal(node), return U(n) ■ Max will never choose to move from n’s parent to a value with each node. For each c in childlist(n) n since it would choose one of n’s higher value Strategy extracted by val ← MinEval(c, alpha, beta) siblings first. moving to Max node (if you alpha ← max(alpha, val) are player Max) at each If alpha ≥ beta, return alpha step. P alpha = 7 Return alpha MinEval(node, alpha, beta): Evaluate(startNode): If terminal(node), return U(n) /* assume Max moves first */ 6 2 7 n beta = 9 8 3 For each c in childlist(n) MaxEval(start, -infnty, +infnty) val ← MaxEval(c, alpha, beta) beta ← min(beta, val) If alpha ≥ beta, return beta s1 s2 s3 Return beta 9 8 3

Hojjat Ghaderi, University of Toronto 29 Hojjat Ghaderi, University of Toronto 30

Rational Opponents Rational Opponents

●This all assumes that your opponent is rational ●Storing your strategy is a potential issue: ■ e.g., will choose moves that minimize your score ■ you must store “decisions” for each node you can reach by playing optimally ●What if your opponent doesn’t play rationally? ■ if your opponent has unique rational choices, this is a single branch through game tree ■ will it affect quality of ? ■ if there are “ties”, opponent could choose any one of the “tied” moves: must store strategy for each subtree ●What if your opponent doesn’t play rationally? Will your stored strategy still work?

Hojjat Ghaderi, University of Toronto 31 Hojjat Ghaderi, University of Toronto 32 Practical Matters Practical Matters

●All “real” games are too large to enumerate ●We must limit depth of search tree tree ■can’t expand all the way to terminal nodes ■ e.g., chess branching factor is roughly 35 ■we must make heuristic estimates about the ■ Depth 10 tree: 2,700,000,000,000,000 nodes values of the (nonterminal) states at the leaves of ■ Even alpha-beta pruning won’t help here! the tree ■evaluation function is an often used term ■evaluation functions are often learned ●Depth-first expansion almost always used for game trees because of sheer size of trees

Hojjat Ghaderi, University of Toronto 33 Hojjat Ghaderi, University of Toronto 34

Heuristics Some Interesting Games

●Think of a few games and suggest some ●Tesauro’s TD-Gammon heuristics for estimating the “goodness” of a ■ champion backgammon player which learned position evaluation function; stochastic component (dice) ■ chess? ●Checker’s (Samuel, 1950s; Chinook 1990s ■ checkers? Schaeffer) ■ your favorite video game? ●Chess (which you all know about) ■ “find the last parking spot”? ●Bridge, Poker, etc. ●Check out Jonathan Schaeffer’s Web page: ■ www.cs.ualberta.ca/~games ■ they’ve studied lots of games (you can play too)

Hojjat Ghaderi, University of Toronto 35 Hojjat Ghaderi, University of Toronto 36 An Aside on Large Search Problems Realtime Search Graphically

●Issue: inability to expand tree to terminal nodes is relevant even in standard search 1. We run A* (or our favorite ) until we are forced to make a move or run out ■ often we can’t expect A* to reach a goal by of memory. Note: no leaves are goals yet. expanding full frontier ■ so we often limit our lookahead, and make moves 2. We use evaluation function f(n) to decide which before we actually know the true path to the goal path looks best (let’s say it is the red one). ■ sometimes called online or realtime search 3. We take the first step along the best path ●In this case, we use the heuristic function not (red), by actually making that move. just to guide our search, but also to commits to 4. We restart search at the node we reach by moves we actually make making that move. (We may actually cache the ■ in general, guarantees of optimality are lost, but we results of the relevant part of first search reduce computational/memory expense tree if it’s hanging around, as it would with A*). dramatically

Hojjat Ghaderi, University of Toronto 37 Hojjat Ghaderi, University of Toronto 38