Artificial Intelligence CS 165A Feb 13, 2020

Instructor: Prof. Yu-Xiang Wang

® Games and search ® Intro to RL

1 Midterm

• The TAs have finished grading.

• Midterm – Max is in the 80s – Min is in the 30s – Many in the 40-60 zone.

• We will be releasing the grades on Gauchospace. – Talk to me if you did not come to the midterm for some reasons.

• The discussion class will be about the midterm.

2 HW3

• HW3 will be released midnight today.

• Topics covered includes – Game playing – Markov Decision processes – Bandits

• Programming question: – Solve PACMAN with ghosts moving around.

3 Today

• Games / Adversarial Search

• Intro to Reinforcement Learning

4 Illustrative example of a simple game (1 min discussion)

(Example taken from Liang and Sadigh)

5 Game as a search problem

• S0 The initial state

• PLAYER(s): Returns which player has the move

• ACTIONS(s): Returns the legal moves.

• RESULT(s, a): Output the state we transition to.

• TERMINAL-TEST(s): Returns True if the game is over.

• UTILITY(s,p): The payoff of player p at terminal state s.

6 Recap: Two-player, , Deterministic, Zero-Sum Game • Two-player: Tic-Tac-Toe, , Go!

• Perfect information: The State is known to everyone

• Deterministic: Nothing is random

• Zero-sum: The total payoff for all players is a constant.

• The 8-puzzle is a one-player, perfect info, deterministic, zero-sum game. • How about Monopoly? • How about Starcraft? 7 Tic-Tac-Toe

• The first player is X and the second is O • Object of game: get three of your symbol in a horizontal, vertical or diagonal row on a 3x3 game board • X always goes first • Players alternate placing Xs and Os on the game board • Game ends when a player has three in a row (a wins) or all nine squares are filled (a draw)

What’s the state, action, transition, payoff for Tic-Tac-Toe? 8 Partial game tree for Tic-Tac-Toe start

X’s turn

O’s turn

X’s turn

O’s turn

X’s turn

O’s turn

X’s wins 9 Game trees

• A game tree is like a search tree in many ways … – nodes are search states, with full details about a position • characterize the arrangement of game pieces on the game board – edges between nodes correspond to moves – leaf nodes correspond to a set of goals • { win, lose, draw } • usually determined by a score for or against player – at each node it is one or other player’s turn to move • A game tree is not like a search tree because you have an opponent!

10 Two players: MIN and MAX

• In a zero-sum game: – payoff to Player 1 = - payoff to Player 2

• The goal of Player 1 is to maximizing her payoff.

• The goal of Player 2 is to maximizing her payoff as well – Equivalent to minimizing Player 1’s payoff.

11 Minimax search

• Assume that both players play perfectly – do not assume player will miss good moves or make mistakes • Score(s): The score that MAX will get towards the end if both player play perfectly from s onwards. • Consider MIN’s – MIN’s best strategy: • choose the move that minimizes the score that will result when MAX chooses the maximizing move – MAX does the opposite

12 Minimaxing

• Your opponent will choose smaller numbers MAX 3 • If you move left, your Your move opponent will choose 3

MIN 3 -8 • If you move right, your Opponent’s opponent will choose -8 move • Thus your choices are 7 3 -8 50 only 3 or -8 • You should move left

13 Minimax example

Which move to choose?

The minimax decision is move A1

14 Another example

• In the game, it’s your move. Which move will the minimax algorithm choose – A, B, C, or D? What is the minimax value of the root node and nodes A, B, C, and D?

4 MAX

1 A 2 B 4 C 3 D MIN

1 7 2 5 2 8 9 4 6 3 3 5

15 Minimax search

• The minimax decision maximizes the utility under the assumption that the opponent seeks to minimize it (if it uses the same evaluation function)

• Generate the tree of minimax values – Then choose best (maximum) move – Don’t need to keep all values around • Good memory property

• Depth-first search is used to implement minimax – Expand all the way down to leaf nodes – Recursive implementation

16 Minimax properties

• Optimal? Yes, against an optimal opponent, if the tree is finite • Complete? Yes, if the tree is finite • Time complexity? Exponential: O( bm ) • Space complexity? Polynomial: O( bm )

17 But this could take forever…

• Exact search is intractable – Tic-Tac-Toe is 9! = 362,880 – For chess, b » 35 and m » 100 for “reasonable” games – Go is 361! »10750

• Idea 1: Pruning

• Idea 2: Cut off early and use a heuristic function

18 Pruning

• What’s really needed is “smarter,” more efficient search – Don’t expand “dead-end” nodes! • Pruning – eliminating a branch of the search tree from consideration • Alpha-beta pruning, applied to a minimax tree, returns the same “best” move, while pruning away unnecessary branches – Many fewer nodes might be expanded – Hence, smaller effective branching factor – …and deeper search – …and better performance • Remember, minimax is depth-first search

19 Alpha pruning

MAX ≥ 10 A

10 B C D MIN

MAX

10 25 15 5

20 Beta pruning

≤25 A MIN

25 B C D MAX

MIN 10 25 15 50

21 Improvements via alpha/beta pruning

• Depends on the ordering of expansion

• Perfect ordering O(bm/2)

AAAB8XicbZDLSgMxFIbP1Futt6pLN8Ei1E2dKYIui27cWcFesB1LJs20oUlmSDJCGfoWblwo4ta3cefbmLaz0NYfAh//OYec8wcxZ9q47reTW1ldW9/Ibxa2tnd294r7B00dJYrQBol4pNoB1pQzSRuGGU7bsaJYBJy2gtH1tN56okqzSN6bcUx9gQeShYxgY62H23LwmIqz6uS0Vyy5FXcmtAxeBiXIVO8Vv7r9iCSCSkM41rrjubHxU6wMI5xOCt1E0xiTER7QjkWJBdV+Ott4gk6s00dhpOyTBs3c3xMpFlqPRWA7BTZDvVibmv/VOokJL/2UyTgxVJL5R2HCkYnQ9HzUZ4oSw8cWMFHM7orIECtMjA2pYEPwFk9ehma14lm+Oy/VrrI48nAEx1AGDy6gBjdQhwYQkPAMr/DmaOfFeXc+5q05J5s5hD9yPn8AUQuQBA==

• Random ordering O(b3m/4)

AAAB8nicbZDLSgMxFIbP1Futt6pLN8Ei1E2d0YIui27cWcFeYDqWTJq2oclkSDJCGfoYblwo4tancefbmLaz0NYfAh//OYec84cxZ9q47reTW1ldW9/Ibxa2tnd294r7B00tE0Vog0guVTvEmnIW0YZhhtN2rCgWIaetcHQzrbeeqNJMRg9mHNNA4EHE+oxgYy3/rhw+phfirDo57RZLbsWdCS2Dl0EJMtW7xa9OT5JE0MgQjrX2PTc2QYqVYYTTSaGTaBpjMsID6luMsKA6SGcrT9CJdXqoL5V9kUEz9/dEioXWYxHaToHNUC/WpuZ/NT8x/asgZVGcGBqR+Uf9hCMj0fR+1GOKEsPHFjBRzO6KyBArTIxNqWBD8BZPXobmecWzfF8t1a6zOPJwBMdQBg8uoQa3UIcGEJDwDK/w5hjnxXl3PuatOSebOYQ/cj5/AMkKkEM=

• For specific games like Chess, you can get to almost perfect ordering.

22 Heuristic function

• Rather, cut the search off early and apply a heuristic evaluation function to the leaves – h(s) estimates the expected utility of the game from a given position (node/state) s

• The performance of a game-playing program depends on the quality (and speed!) of its evaluation function

23 Heuristics (Evaluation function)

• Typical evaluation function for game: weighted linear function

– h(s) = w1 f1(s) + w2 f2(s) + … + wd fd(s) – weights · features [dot product] • For example, in chess – W = { 1, 3, 3, 5, 8 } – F = { # pawns advantage, # bishops advantage, # knights advantage, # rooks advantage, # queens advantage }

• More complex evaluation functions may involve learning – Adjusting weights based on outcomes – Perhaps non-linear functions – How to choose the features?

24 Tic-Tac-Toe revisited

a partial game tree for Tic-Tac-Toe

25 Evaluation functions

• It is usually impossible to solve games completely • This means we cannot search entire game tree – we have to cut off search at a certain depth • like depth bounded depth first, lose completeness • Instead we have to estimate cost of internal nodes • We do this using an evaluation function – evaluation functions are heuristics • Explore game tree using combination of evaluation function and search

26 Evaluation function for Tic-Tac-Toe

• A simple evaluation function for Tic-Tac-Toe – count the number of rows where X can win – subtract the number of rows where O can win • Value of evaluation function at start of game is zero – on an empty game board there are 8 possible winning rows for both X and O

8-8 = 0

27 Evaluating Tic-Tac-Toe evalX = (number of rows where X can win) – (number of rows where O can win)

• After X moves in center, score for X is +4 • After O moves, score for X is +2 • After X’s next move, score for X is +4

8-8 = 0 8-4 = 4 6-4 = 2 6-2 = 4

28 Evaluating Tic-Tac-Toe evalO = (number of rows where O can win) – (number of rows where X can win)

• After X moves in center, score for O is -4 • After O moves, score for O is +2 • After X’s next move, score for O is -4

8-8 = 0 4-8 = -4 4-6 = -2 2-6 = -4

29 Search depth cutoff

Tic-Tac-Toe with search depth 2

-2

Evaluations shown for X 30 HW3: Programming assignment

• PACMAN with ghosts

• You will generalize minimax search to multiple opponents • You will exploit the benign opponents

31 Small Pacman Example

MAX nodes: under Agent’s control MIN nodes: under Opponent’s control V(s) = max V(s’) V(s) = min V(s’) s’ Î successors(s) s’ Î successors(s)

-8

-8 -10

-8 -5 -10 +8

Terminal States: V(s) = known Expectimax: Playing against a benign opponent • Some times the ghosts are not clever. • They behave randomly.

• Example of game of chance: – Slot machines – Tetris

33 Expectimax

• Your opponent behave randomly with a given MAX 15.2 probability distribution, Your move • If you move left, your opponent will select AVERAGE 5 15.2 actions with probability Opponent’s random [0.5,0.5] move • If you move right, your 7 3 -8 50 opponent will select actions with [0.6,0.4]

34 Summary of game playing

• Minimax search

• Game tree

• Alpha-beta pruning

• Early stop with an evaluation function

• Expectimax.

35 More reading / resources about game playing

• Required reading: AIMA 5.1-5.3

• Stochastic game / Expectiminimax: AIMA 5.5 – Backgammon. TD-Gammon

• Famous game AI: Read AIMA Ch. 5.7 – Deep blue – TD Gammon

• AlphaGo: https://www.nature.com/articles/nature16961

36 Reinforcement Learning Lecture Series

• Overview (Today)

• Markov Decision Processes

• Bandits problems and exploration

• Reinforcement Learning Algorithms

37 Reinforcement learning in the animal world

Ivan Pavlov (1849 - 1936) Nobel Laureate

• Learn from rewards • Reinforce on the states that yield positive rewards 38 Reinforcement learning: Applications

Recommendations

buy or not buy 39 Reinforcement learning problem setup

• State, Action, Reward • Unknown reward function, unknown state-transitions. • Agents might not even observe the state

40 Reinforcement learning problem setup

• State, Action, Reward and Observation

AAAB+nicbVDLSsNAFL2pr1pfqS7dDBbBVUlE0GXRjcta7APaECbTSTt0MgkzE6XEfoobF4q49Uvc+TdO2iy09cDA4Zx7uWdOkHCmtON8W6W19Y3NrfJ2ZWd3b//Arh52VJxKQtsk5rHsBVhRzgRta6Y57SWS4ijgtBtMbnK/+0ClYrG419OEehEeCRYygrWRfLva8vWACTSIsB4HQdaa+XbNqTtzoFXiFqQGBZq+/TUYxiSNqNCEY6X6rpNoL8NSM8LprDJIFU0wmeAR7RsqcESVl82jz9CpUYYojKV5QqO5+nsjw5FS0ygwk3lCtezl4n9eP9XhlZcxkaSaCrI4FKYc6RjlPaAhk5RoPjUEE8lMVkTGWGKiTVsVU4K7/OVV0jmvu4bfXdQa10UdZTiGEzgDFy6hAbfQhDYQeIRneIU368l6sd6tj8VoySp2juAPrM8fFw2T4Q== AAAB+3icbVBNS8NAFNzUr1q/aj16WSyCp5KIoMeiF2+tYGuhCWGz3bRLN5uw+yKWkL/ixYMiXv0j3vw3btoctHVgYZh5jzc7QSK4Btv+tipr6xubW9Xt2s7u3v5B/bDR13GqKOvRWMRqEBDNBJesBxwEGySKkSgQ7CGY3hT+wyNTmsfyHmYJ8yIyljzklICR/Hqj44PLJXYjAhNKRNbJ/XrTbtlz4FXilKSJSnT9+pc7imkaMQlUEK2Hjp2AlxEFnAqW19xUs4TQKRmzoaGSREx72Tx7jk+NMsJhrMyTgOfq742MRFrPosBMFhH1sleI/3nDFMIrL+MySYFJujgUpgJDjIsi8IgrRkHMDCFUcZMV0wlRhIKpq2ZKcJa/vEr65y3H8LuLZvu6rKOKjtEJOkMOukRtdIu6qIcoekLP6BW9Wbn1Yr1bH4vRilXuHKE/sD5/AN0ZlFE=

AAAB+3icbVBNS8NAFHypX7V+1Xr0slgETyURQY+tXjxWsLXQhLDZbtqlm03Y3Ygl5K948aCIV/+IN/+NmzYHbR1YGGbe481OkHCmtG1/W5W19Y3Nrep2bWd3b/+gftjoqziVhPZIzGM5CLCinAna00xzOkgkxVHA6UMwvSn8h0cqFYvFvZ4l1IvwWLCQEayN5NcbHV+7TCA3wnpCMM86uV9v2i17DrRKnJI0oUTXr3+5o5ikERWacKzU0LET7WVYakY4zWtuqmiCyRSP6dBQgSOqvGyePUenRhmhMJbmCY3m6u+NDEdKzaLATBYR1bJXiP95w1SHV17GRJJqKsjiUJhypGNUFIFGTFKi+cwQTCQzWRGZYImJNnXVTAnO8pdXSf+85Rh+d9FsX5d1VOEYTuAMHLiENtxCF3pA4Ame4RXerNx6sd6tj8VoxSp3juAPrM8fsa2UNQ== SAAAB+3icbVBNS8NAFHypX7V+1Xr0slgETyURQY9FLx4rtbXQhLDZbtqlm03Y3Ygl5K948aCIV/+IN/+NmzYHbR1YGGbe481OkHCmtG1/W5W19Y3Nrep2bWd3b/+gftjoqziVhPZIzGM5CLCinAna00xzOkgkxVHA6UMwvSn8h0cqFYvFvZ4l1IvwWLCQEayN5NcbXV+7TCA3wnpCMM+6uV9v2i17DrRKnJI0oUTHr3+5o5ikERWacKzU0LET7WVYakY4zWtuqmiCyRSP6dBQgSOqvGyePUenRhmhMJbmCY3m6u+NDEdKzaLATBYR1bJXiP95w1SHV17GRJJqKsjiUJhypGNUFIFGTFKi+cwQTCQzWRGZYImJNnXVTAnO8pdXSf+85Rh+d9FsX5d1VOEYTuAMHLiENtxCB3pA4Ame4RXerNx6sd6tj8VoxSp3juAPrM8f6YGUWQ== t At Rt R Ot 2 S 2 A 2 2 O • Policy:

– When the state is observable: ⇡AAACEnicbZDLSsNAFIYnXmu9RV26GSyCbkoiguKq6sZlRXuBJpTJdNIOncyEmYlSQp7Bja/ixoUibl25822ctAG19YeBj/+cw5zzBzGjSjvOlzU3v7C4tFxaKa+urW9s2lvbTSUSiUkDCyZkO0CKMMpJQ1PNSDuWBEUBI61geJnXW3dEKir4rR7FxI9Qn9OQYqSN1bUPvZieQehFSA8wYulNBj1J+wONpBT3P/551rUrTtUZC86CW0AFFKp37U+vJ3ASEa4xQ0p1XCfWfoqkppiRrOwlisQID1GfdAxyFBHlp+OTMrhvnB4MhTSPazh2f0+kKFJqFAWmM19RTddy879aJ9HhqZ9SHieacDz5KEwY1ALm+cAelQRrNjKAsKRmV4gHSCKsTYplE4I7ffIsNI+qruHr40rtooijBHbBHjgALjgBNXAF6qABMHgAT+AFvFqP1rP1Zr1PWuesYmYH/JH18Q0lz53J : – Or when the state is not observable S ! A t 1 ⇡ :( ) AAACPHicbVC7SgNBFJ31GeMramkzGIRYGHZFUKyiNnbGRx6QXcPsZJIMmX0wc1cJy36YjR9hZ2VjoYittZNkkZh4YOBwzrnMvccNBVdgmi/GzOzc/MJiZim7vLK6tp7b2KyqIJKUVWggAll3iWKC+6wCHASrh5IRzxWs5vbOB37tnknFA/8W+iFzPNLxeZtTAlpq5m7skDfhBBdsj0CXEhFfJjZwj6lf4XRccN34Otm7i2HfSrAteacLRMrgAY/Fm7m8WTSHwNPESkkepSg3c892K6CRx3yggijVsMwQnJhI4FSwJGtHioWE9kiHNTT1id7GiYfHJ3hXKy3cDqR+PuChOj4RE0+pvufq5GBFNekNxP+8RgTtYyfmfhgB8+noo3YkMAR40CRucckoiL4mhEqud8W0SyShoPvO6hKsyZOnSfWgaGl+dZgvnaV1ZNA22kEFZKEjVEIXqIwqiKJH9Ire0YfxZLwZn8bXKDpjpDNb6A+M7x9/IbAU t R O ⇥ A ⇥ ! A • Learn the best policy that maximizes the expected reward T – Finite horizon (episodic) RL: ⇡⇤ = arg max E[ Rt] ⇡ ⇧ T: horizon

AAACJXicbZDNSsNAFIUn/tb6V3XpZrAI4kISEXRhoSiCyypWC0kaJtNpHTqZhJkbsYS8jBtfxY0LiwiufBUntQttPTBw+O69zL0nTATXYNuf1szs3PzCYmmpvLyyurZe2di81XGqKGvSWMSqFRLNBJesCRwEayWKkSgU7C7snxf1uwemNI/lDQwS5kekJ3mXUwIGBZVTL+HtfVzDHlE9LyKPQWaIx6XX4Dk2AO7DMLvIXezpNAoyqDl5O7vJrwPAflCp2gf2SHjaOGNTRWM1gsrQ68Q0jZgEKojWrmMn4GdEAaeC5WUv1SwhtE96zDVWkohpPxtdmeNdQzq4GyvzJOAR/T2RkUjrQRSazmJtPVkr4H81N4XuiZ9xmaTAJP35qJsKDDEuIsMdrhgFMTCGUMXNrpjeE0UomGDLJgRn8uRpc3t44Bh/dVStn43jKKFttIP2kIOOUR1dogZqIoqe0At6Q0Pr2Xq13q2Pn9YZazyzhf7I+voGRi6lIQ== 2 t=1 X – Infinite horizon RL: 1 t 1 ⇡⇤ = arg max E[ Rt] ⇡ ⇧ 41

AAACOHicbVDLShxBFK3W+Bpfo1m6KTIIIijdIuhGkAQhu4who8J0T3O7pnosrKpuqm6LQ9Gf5cbPcBeyyUIJ2foFqRlnER8HCg7n3Mutc7JSCoth+DOYmv4wMzs3v9BYXFpeWW2urZ/ZojKMd1ghC3ORgeVSaN5BgZJflIaDyiQ/z66+jPzza26sKPQPHJY8UTDQIhcM0Etp81tcit42PaIxmEGs4CZ1XomFjtuipl7AyyxzJ3WXxrZSqcOjqO457+c49P4AlIKew52opt9TpEnabIW74Rj0LYkmpEUmaKfN+7hfsEpxjUyCtd0oLDFxYFAwyetGXFleAruCAe96qkFxm7hx8JpueqVP88L4p5GO1f83HChrhyrzk6Mk9rU3Et/zuhXmh4kTuqyQa/Z8KK8kxYKOWqR9YThDOfQEmBH+r5RdggGGvuuGLyF6HfktOdvbjTw/3W8df57UMU82yCeyRSJyQI7JV9ImHcLILflFHshjcBf8Dv4Ef59Hp4LJzkfyAsHTP4yErNg= 2 t=1 X γ: discount factor RL for robot control

• States: The physical world, e.g., location/speed/acceleration and so on. • Observations: camera images, joint angles • Actions: joint torques • Rewards: stay balanced, navigate to target locations, serve and protect humans, etc. 42 RL for Inventory Management

• State: Inventory level, customer demand, competitor’s inventory • Observations: current inventory levels and sales history • Actions: amount of each item to purchase Rewards: profit • 43 Demonstrating the learning process

• Mountain car: https://www.youtube.com/watch?v=U5w9PoKCOeM

44 Reading materials for RL

• Introduction: – Sutton and Barto: Chapter 1 • Markov Decision Processes – AIMA Section 17.1, Sutton and Barto: Ch 3 • Policy iterations / value iterations – AIMA Chapter 17.2-17.3, Sutton and Barto Ch 4. • Bandits – Sutton and Barto Ch 2, AIMA Ch. 21.4 • RL Algorithms: Sutton and Barto Ch 4, Ch 5, Ch 6, Ch 13

• Next Tuesday: – Markov Decision Processes

45