Effective Short-Term Opponent Exploitation in Simplified Poker

Effective Short-Term Opponent Exploitation in Simplified Poker Bret Hoehn Valeriy Bulitko Finnegan Southey Centre for Science, Athabasca University Robert C. Holte University of Alberta, Dept. of Computing Science Abstract game trees and the strategies involve many parameters (e.g. two-player, limit Texas Hold'em requires O(1018) Uncertainty in poker stems from two key sources, the shuffled deck and an adversary whose strategy is un- parameters (Billings et al. 2003)). The game also has known. One approach is to find a pessimistic game theo- high variance, stemming from the deck and stochas- retic solution (i.e. a Nash equilibrium), but human play- tic opponents, and folding gives rise to partial obser- ers have idiosyncratic weaknesses that can be exploited vations. Strategically complex, the aim is not simply if a model of their strategy can be learned by observing to win but to maximize winnings by enticing a weakly- their play. However, games against humans last for at positioned opponent to bet. Finally, we cannot expect most a few hundred hands so learning must be fast to a large amount of data when playing human opponents. be effective. We explore two approaches to opponent You may play only 50 or 100 hands against a given op- modelling in the context of Kuhn poker, a small game ponent and want to quickly learn how to exploit them. for which game theoretic solutions are known. Param- eter estimation and expert algorithms are both studied. This research explores how rapidly we can gain an Experiments demonstrate that, even in this small game, advantage by observing opponent play given that only convergence to maximally exploitive solutions in a small a small number of hands will be played in total. Two number of hands is impractical, but that good (i.e. better learning approaches are studied: maximum a posteri- than Nash or breakeven) performance can be achieved in ori parameter estimation (parameter learning), and an a short period of time. Finally, we show that amongst a “experts” method derived from Exp3 (Auer et al. 1995) set of strategies with equal game theoretic value, in par- (strategy learning). Both will be described in detail. ticular the set of Nash equilibrium strategies, some are preferable because they speed learning of the opponent's While existing poker opponent modelling research strategy by exploring it more effectively. focuses on real-world games (Korb & Nicholson 1999; Billings et al. ), we systematically study a simpler ver- Introduction sion, reducing the game's intrinsic difficulty to show that, even in what might be considered a best case, the Poker is a game of imperfect information against an ad- problem is still hard. We start by assuming that the op- versary with an unknown, stochastic strategy. It rep- ponent's strategy is fixed. Tracking a non-stationary resents a tough challenge to artificial intelligence re- strategy is a hard problem and learning to exploit a search. Game theoretic approaches seek to approximate fixed strategy is clearly the first step. Next, we con- the Nash equilibrium (i.e. minimax) strategies of the sider the game of Kuhn poker (Kuhn 1950), a tiny game game (Koller & Pfeffer 1997; Billings et al. 2003), for which complete game theoretic analysis is available. but this represents a pessimistic worldview where we Finally, we evaluate learning in a two-phase manner; assume optimality in our opponent. Human players the first phase exploring and learning, while the sec- have weaknesses that can be exploited to obtain win- ond phase switches to pure exploitation based on what nings higher than the game-theoretic value of the game. was learned. We use this simplified framework to show Learning by observing their play allows us to exploit that learning to maximally exploit an opponent in a their idiosyncratic weaknesses. This can be done ei- small number of hands is not feasible. However, we ther directly, by learning a model of their strategy, or also demonstrate that some advantage can be rapidly at- indirectly, by identifying an effective counter-strategy. tained, making short-term learning a winning proposi- Several factors render this difficult in practice. First, tion. Finally, we observe that, amongst the set of Nash real-world poker games like Texas Hold'em have huge strategies for the learner (which are “safe” strategies), Copyright c 2005, American Association for Artificial Intel- the exploration inherent in some strategies facilitates ligence (www.aaai.org). All rights reserved. faster learning compared with other members of the set. AAAI-05 / 783 J|Q J|K Q|J Q|K K|J K|Q pass bet pass bet pass pass pass bet pass bet α 1−γ γ 1−γ γ 1−α α 1−α 1 1 pass pass bet bet bet pass bet bet pass bet pass pass pass bet 1 1−η η 1 1 1−ξ ξ 1 1−ξ ξ 1 1 1−η η -1 +1 -2 -2 +1 +1 +1 +1 +1 +2 pass pass bet pass bet bet 1 1−β β 1−β β 1 -1 -1 +2 -1 -2 +2 Player 1 Choice/Leaf Node Player 2 Choice Node Figure 1: Kuhn Poker game tree with dominated strategies removed Kuhn Poker ξ). The decisions governed by these parameters are Kuhn poker (Kuhn 1950) is a very simple, two-player shown in Figure 1. Kuhn determined that the set of game (P1 - Player 1, P2 - Player 2). The deck consists equilibrium strategies for P1 has the form (α, β; γ) = of three cards (J - Jack, Q - Queen, and K - King). There (γ=3; (1 + γ)=3; γ) for 0 ≤ γ ≤ 1. Thus, there is a are two actions available: bet and pass. The value of continuum of Nash strategies for P1 governed by a sin- each bet is 1. In the event of a showdown (players have gle parameter. There is only one Nash strategy for P2, matched bets), the player with the higher card wins the η = 1=3 and ξ = 1=3; all other P2 strategies can be pot (the King is highest and the Jack is lowest). A game exploited by P1. If either player plays an equilibrium proceeds as follows: strategy (and neither play dominated strategies), then P1 expects to lose at a rate of −1=18 per hand. Thus • Both players initially put an ante of 1 into the pot. P1 can only hope to win in the long run if P2 is playing • Each player is dealt a single card and the remaining suboptimally and P1 deviates from playing equilibrium card is unseen by either player. strategies to exploit errors in P2's play. Our discussion focuses on playing as P1 and exploiting P2, so all ob- • After the deal, P1 has the opportunity to bet or pass. servations and results are from this perspective. – If P1 bets in round one, then in round two P2 can: The strategy-space for P2 can be partitioned into the ∗ bet (calling P1's bet) and the game then ends in a 6 regions shown in Figure 2. Within each region, a sin- showdown, or gle P1 pure strategy gives maximal value to P1. For ∗ pass (folding) and forfeit the pot to P1. points on the lines dividing the regions, the bordering – If P1 passes in round one, then in round two P2 maximal strategies achieve the same value. The in- can: tersection of the three dividing lines is the Nash strategy for P2. Therefore, to maximally exploit P2, it is ∗ bet (in which case there is a third action where P1 sufficient to identify the region in which their strat- can bet and go to showdown, or pass and forfeit egy lies and then to play the corresponding P1 pure to P2), or strategy. Note that there are 8 pure strategies for P1: ∗ pass (game proceeds to a showdown). S1 = (0; 0; 0); S2 = (0; 0; 1); S3 = (0; 1; 0); : : : ; S7 = Figure 1 shows the game tree with P1's value for each (1; 1; 0); S8 = (1; 1; 1). Two of these (S1 and S8) are outcome. Note that the dominated strategies have been never the best response to any P2 strategy, so we need removed from this tree already. Informally, a dominated only consider the remaining six. strategy is one for which there exists an alternative strat- This natural division of P2's strategy space was used egy that offers equal or better value in any given situa- to obtain the suboptimal opponents for our study. Six tion. We eliminate these obvious sources of suboptimal opponent strategies were created by selecting a point play but note that non-dominated suboptimal strategies at random from each of the six regions. They are remain, so it is still possible to play suboptimally with O1 = (:25; :67); O2 = (:75; :8); O3 = (:67; :4); O4 = respect to a specific opponent. (:5; :29); O5 = (:25; :17); O6 = (:17; 2). All experi- The game has a well-known parametrization, in ments were run against these six opponents, although which P1's strategy can be summarized by three pa- we only have space to show results against representa- rameters (α, β, γ), and P2's by two parameters (η, tive opponents here. AAAI-05 / 784 ξ of how they play. A Beta distribution is characterized by two parameters, θ ≥ 0 and ! ≥ 0. The distribu- 1 tion can be understood as pretending that we have observed the opponent's choices several times in the past, θ ! S and that we observed choices one way and choices S 3 the other way. Thus, low values for this pair of param- 7 eters (e.g.

Effective Short-Term Opponent Exploitation in Simplified Poker

Game Theory 10

An Essay on Assignment Games

Improving Fictitious Play Reinforcement Learning with Expanding Models

Using Counterfactual Regret Minimization to Create Competitive Multiplayer Poker Agents

Games Lectures.Key

Successful Nash Equilibrium Agent for a 3-Player Imperfect-Information Game

Game Theory 10

Online Enhancement of Existing Nash Equilibrium Poker Agents

ECE 586BH: Problem Set 4: Problems and Solutions Extensive Form Games

Heads-Up Limit Hold'em Poker Is Solved

Α-Rank: Multi-Agent Evaluation by Evolution

Optimistic Regret Minimization for Extensive-Form Games Via Dilated Distance-Generating Functions∗