ULTIMATE TIC-TAC-TOE

Scott Powell, Alex Merrill

Professor: Professor Christman

An algorithmic solver for Ultimate Tic-Tac-Toe

May 2021 ABSTRACT

Ultimate Tic-Tac-Toe is a deterministic game played by two players where each player’s turn has a direct effect on what options their opponent has. Each player’s viable moves are determined by their opponent on the previous turn, so players must decide whether the best move in the short term actually is the best move overall. Ultimate Tic-Tac-Toe relies entirely on strategy and decision-making. There are no random variables such as dice rolls to interfere with each player’s strategy. This is rela- tively rare in the field of board games, which often use chance to determine turns. Be- cause Ultimate Tic-Tac-Toe uses no random elements, it is a great choice for adversarial search algorithms. We may use the deterministic aspect of the game to our advantage by pruning the search trees to only contain moves that result in a good board state for the intelligent agent, and to only consider strong moves from the opponent. This speeds up the efficiency of the algorithm, allowing for an artificial intelligence capable of winning the game without spending extended periods of time evaluating each potential move. We create an intelligent agent capable of playing the game with strong moves using adversarial minimax search. We propose novel heuristics for evaluating the state of the game at any given point, and evaluate them against each other to determine the strongest heuristics. TABLE OF CONTENTS

1 Introduction1 1.1 Problem Statement...... 2 1.2 Related Work...... 3

2 Methods6 2.1 Simple Heuristic: Greedy...... 6 2.2 New Heuristic...... 7 2.3 Alpha- Beta- Pruning and Depth Limit...... 10

3 Results 12

4 Discussion 13

5 Conclusion 14

Bibliography 15

ii LIST OF FIGURES

1.1 Example of a game move. Players alternate placing marks, and each player’s mark determines which board their opponent will place a mark next. In this figure, One player has placed a circle in the top-left corner of the center board. The second player must then place their next mark on the top-left board (outlined with a square)...... 3 1.2 Complete game of Ultimate Tic-Tac-Toe. O has three boards in a row, winning the game...... 4 1.3 There is no winning strategy that involves winning the bottom-left board for any player. All moves should be focused on winning one of the two other remaining boards, but previous heuristics would determine that winning the bottom-left board is still a good move...... 5

2.1 Each cell is assigned a unique number that can be used to specify a move. Each turn there is a range of valid moves, so if the player must move in the center board, they are prompted for a number between 36 and 44...... 7

iii CHAPTER 1 INTRODUCTION

Ultimate Tic-Tac-Toe (UTTT) is a deterministic with perfect informa- tion. Each move requires the consideration of nine individual Tic-Tac-Toe boards. While traditional Tic-Tac-Toe (TTT) has been solved thoroughly such that the best moves are almost universally known, the ultimate variant adds enough variables that the game is more similar to popular games like Chess or Checkers than the single board version of TTT. Each player’s turn determines which board their opponent will place a mark on, adding additional levels of depth to decision making. Deterministic games are interesting to study because people have an innate curiosity for finding optimal moves in board games, and Artificial intelligence algorithms are ca- pable of processing millions of game scenarios to evaluate moves at a level humans can’t compete with. Additionally computers may be used as a tool by people looking to im- prove their gameplay. Players can select multiple difficulty levels to simulate opponents that provide the best amount of challenge. Many games, including Chess, Checkers, Gomoku, and Blokus, already have algorithms capable of finding optimal moves, but there is very little research on UTTT [4]. We propose a minimax algorithm that plays strong moves using a novel evaluation function. UTTT is an interesting problem for artificial intelligence because it is one of the few deterministic games with perfect information. This means that given the valid moves for a turn, one of the moves must result in a higher chance of winning compared to the others. It is exceedingly difficult for a human to determine the best move, as it depends on up to 80 subsequent moves. Hence, UTTT is a problem that humans are incapable of fully understanding, but a computer has no problem calculating the many potential board states that result from moves. There are other games that have similar rules (For example: Chess, Checkers, and Gomoku), but they have received substantially more

1 research. Thus, UTTT is a problem AI is well-equipped to solve, but there aren’t many intelligent agents capable of playing it.

1.1 Problem Statement

The Ultimate Tic-Tac-Toe board is composed of nine regular Tic-Tac-Toe boards, ar- ranged in a 3x3 grid. This grid is stylized to look like one large Tic-Tac-Toe board, with each grid on the board filled with an empty TTT board. These boards are played simul- taneously. As in traditional TTT, player’s alternate selecting a cell to mark. However, in UTTT each mark also determines which of the nine TTT boards the opponent’s next move will be played in. If on a turn player 1 is permitted to place a mark in the center board in the 3x3 Tic-Tac-Toe grid, and they choose to place a mark in the top-left corner of that board, their opponent’s next move would need to be on the top-left board of the 3x3 grid (See Figure 1.1). Play continues in this fashion until a player has one a board on the 3x3 grid. There is some debate about how players handle winning a small board. For example, what happens when a player places a mark that would result in their opponent’s next move being in a board that has already been won? Early variants used a rule that would require the opponent to place marks in the already won board, to no effect [2], and if no moves are available the opponent may place a mark on any of the nine boards, in any (unfilled) position. Moves that may go on any board are known as wildcards. It has been proven that player 1 has a strategy to win the game in at most 43 moves if wildcards are only permitted if no valid move exists [2]. An alternative variation allows wildcards if a player’s next move is in an already won board [3]. To our knowledge there is no guaranteed winning solution in this variation. This makes it an ideal problem for an intelligent agent to solve. When one player wins three boards in a row, they win the game (See Figure 1.2).

2 Figure 1.1: Example of a game move. Players alternate placing marks, and each player’s mark determines which board their opponent will place a mark next. In this figure, One player has placed a circle in the top-left corner of the center board. The second player must then place their next mark on the top-left board (outlined with a square).

We are focusing on Minimax Search Trees to create an intelligent agent capable of playing Ultimate Tic-Tac-Toe at a high level. We propose a new minimax evaluation function that is capable of beating human players.

1.2 Related Work

To our knowledge, there is only one published paper on UTTT [2]. Other papers have either not been submitted for publication or have not been accepted into a journal [3].

The only published paper focuses on the older variant wither fewer wildcards. Addi- tionally, this paper focuses on an algorithmic solution to the game, forcing the opponent to play in useless squares. Essentially, it exploited a strategy where the first player could remove all agency from the second player and eventually win. The updated rules specif-

3 Figure 1.2: Complete game of Ultimate Tic-Tac-Toe. O has three boards in a row, winning the game. ically address the exploit used in the paper. Using the updated ruleset, the guaranteed winning solution is no long guaranteed. Other unpublished papers that focus on the new ruleset use artificial intelligence to create an intelligent agent that plays the game. As the game is an adversarial en- vironment, common approaches focus on minimax search as well, using a variety of heuristics [3][1]. Many of these heuristics place a strong emphasis on having as many

”won” boards as possible, meaning that oftentimes these heuristics will make greedy de- cisions that result in a win on a small board, even if winning the board does not further the goal of winning the game (winning a board is only useful if it prevents an opponent win or it can be used to make three in a row). Our goal is to focus on minimax search and create new heuristics that don’t greedily focus on winning as many boards as possible. We do this by calculating a score for a given board state, based on both the smaller board states and the overall state of the game across boards. A board that doesn’t block an opponent’s win and doesn’t help you win is given fewer points than a board that is critical for both player’s winning strategies.

4 Figure 1.3: There is no winning strategy that involves winning the bottom-left board for any player. All moves should be focused on winning one of the two other remaining boards, but previous heuristics would determine that winning the bottom-left board is still a good move.

This results in more intelligent moves overall, as it avoids ”traps” that occur by having an available board to win that contributes nothing (see Figure 1.3).

5 CHAPTER 2 METHODS

Our project focuses on new heuristics and new ways of evaluating board states, and uses minimax search to find the optimal move. Our first task was to create a playable version of the game. Because our efforts are focused on the intelligent agent, we made a simple playable version using Python. The game is playable in the command line, and it will print the board and request a move from the player. To make a move, the player needs to type in a number that corresponds to a specific cell in the board. A board containing each cells number is printed whenever a move is requested, for simplicity (see Figure 2.1). To create a data structure that represents the game board, we use a simple, one- dimensional array of size 81. Each index of this array corresponds to a cell on the UTTT board, with the indices described by Figure 2.1. This ordering is beneficial because it is easy to grab slices of the board that correspond to the smaller board. We can also describe winning board states as combinations of moves by their index. So if one player had marks in one of the boards at the zeroth, first, and second indices, they would have three in a row and would have won the board. Once we have a playable board, we create multiple heuristics to decide what the best move is for a given board.

2.1 Simple Heuristic: Greedy

The first heuristic we created involved checking the board for the optimal move on a small board, disregarding the larger board’s state. In other words, the move output by this heuristic is the move that results in the best state of the respective small board. To determine what makes a board state good, we simply calculate the number of winning solutions that rely on a given square. There are four wins that involve the center square, three wins for each corner, and two wins for edges. After the first move has occurred

6 Figure 2.1: Each cell is assigned a unique number that can be used to specify a move. Each turn there is a range of valid moves, so if the player must move in the center board, they are prompted for a number between 36 and 44. on a board, the number of remaining winning states is recalculated. If a player places a mark in the center, the opponent only has four possible winning states left. Each tile is assigned a new number corresponding to the remaining number of wins possible. If a tile wins the board for a player, it is assigned an arbitrarily high score to ensure it is chosen.

This heuristic performs better than random moves, but is susceptible to fall into the types of traps shown in Figure 1.3.

2.2 New Heuristic

Our new heuristic creates categories of board states, and adds or subtracts points de- pending on the number of times each category is present on the board. We compute how close players are to a win using the following code: w i n conditions = [ [ 0 , 1 , 2 ] , [ 3 , 4 , 5 ] , [ 6 , 7 , 8 ] , [ 0 , 3 , 6 ] , [ 1 , 4 , 7 ] , [ 2 , 5 , 8 ] ,

7 [ 0 , 4 , 8 ] , [ 2 , 4 , 6 ] ]

# Array representing how many moves for each win are left a g e n t d i s t = [ 3 ] * 8 o p p d i s t = [ 3 ] * 8

f o r x , wincon in enumerate ( w i n conditions ): f o r i in wincon : # check if win is impossible i f board[i] == opponent: o p p d i s t [ x ] −= 1 # Impossible , set distance high c u r r dist[x] = 100 # check if mark is placed e l i f board[i] == agent: # Agent has mark, reduce distance c u r r d i s t [ x ] −= 1 o p p dist[x] = 100

a g e n t o n e s = 0 opp ones = 0 f o r i in range ( 8 ) : i f c u r r dist[i] == 1: a g e n t o n e s += 1 e l s e : opp ones += 1

return ( c u r r o n e s , opp ones )

This code allows us to calculate the number of ways a board can be won. If the agent can win in one move, we would say that the state of the board is better than if the agent had no marks on a board. This piece of code is surprisingly flexible as it can be used to evaluate both a small board or the larger board. By applying this evaluation function on the larger board, we put a focus on pursuing moves that result in connectable wins, rather than the traps seen in Figure 1.3. Board states that result in a greater number of potential next turn wins are evaluated as preferable, which is a key part of maintaining a lead in the game.

8 With this information, we can assign point values to the following board states:

1. Winning the game is worth infinity points. Losing the game is worth negative

infinity points.

2. Winning or losing a board results in a gain or loss of 100 points.

3. If a board is won and it results in two won boards in a row (i.e. winning one more board would result in a won game), then an additional 200 points are added (this

may occur multiple times if there are multiple paths to victory).

4. Winning a board that results in blocking three in a row for the opponent results in

150 points.

5. Winning a board that is already blocked by the opponent’s boards results in -150

points.

6. Making two marks in a row on a small board adds 5 points

7. Blocking an opponent win on a small board adds 20 points

8. Making a move in a board that has no benefit to the player subtracts 20 points

Our algorithm applies these rules to select the best moves according to minimax search. Assigning infinity points for winning is logical, as that is the final goal of the game. Additionally, winning a board generally assigns points, but if the board con- tributes no value towards a win, 150 points are subtracted, resulting in a net -50 points for winning a useless board. This is consistent with the idea of the game, as once the board is won it can be a source of wildcards, which may result in unfavorable board states. Hence it is better to keep the useless boards open. This algorithm places a strong emphasis on winning boards that contribute to a win. It is better to have two connected boards than it is to have three boards that don’t threaten victory. This strong emphasis is due to the nature of wildcards. The ability to threaten victory if given a wildcard

9 prevents many moves from the opponent, resulting in more favorable board states over time. This means two boards in a row restricts the opponent’s options more than three unconnected boards. We also assign a value for preventing a win. If an opponent has two boards in a row, preventing the win not only results in an additional board for the player, but it also means that those two boards are unconnected, and the opponent needs two more boards to get any value out of the previously won boards. This is means it is much harder for an opponent to win, and so the intelligent agent should actively seek out these positions. For the smaller boards, points are awarded for blocking a win and setting up a win, on a much smaller scale. This is consistent with the goals of the game, as the best move for a small board often results in an unfavorable board state elsewhere. Low point values means the agent will select the best moves on a board if it has no effect on the entire game, but a move that helps win on other boards is preferable to the move that sets up victory the best for the small board, unless the small board contributes to victory.

2.3 Alpha- Beta- Pruning and Depth Limit

Ultimate Tic-Tac-Toe has a very high branching factor. In the beginning of the game, each move has a branching factor of nine, resulting in a huge number of boards to evaluate. This gets even more problematic once a board has been won. Wildcard moves have a worst case branching factor of 69, as moves cannot be placed in any won board, so a wildcard can occur after six moves. This results in a huge increase in complexity and can slow the algorithm down, particularly as more opportunities for wildcards present themselves (when more boards are won). As a result, it’s important that we prune the search tree to dramatically speed up decision making.

We do this using the standard algorithm for pruning with minimax. While consider- ing a move, the resulting board states are compared to the best board state encountered

10 thus far. It is expected that the agent will always pick the best move for itself and the op- ponent will always choose the worst move for the agent. Thus, any move that is known to not be the best/worst move does not need to be further explored. This dramatically speeds up decision-making. Additionally, we impose a variable depth limit for minimax search. Especially early on, evaluating games to completion is prohibitively computationally expensive. Each game may last up to 80 moves, which means that the total number of boards to evalu- ation is 980. This takes far too long to evaluate, and so a depth limit is necessary. We found that a depth limit of seven allowed for smart moves in a reasonable period of time

(ten seconds). A depth limit of seven means 97 = 4782969 possible board states with- out pruning and without any wildcards. Increasing the limit results in an exponential increase in decision-making.

11 CHAPTER 3 RESULTS

One way we tested our agent was by pitting it against different agents with different weights for blocking or winning the board. This is useful because it allows us to op- timize the point values while evaluating its performance against other iterations of the

AI. We additionally tested the agent against random moves and the simple heuristic de- tailed in Section 2.1. We had the agent play 100 games against each of these heuristics with a depth level of three to increase speed. Against random moves, our algorithm won 99 games, with one tie. Against the simple heuristic, our algorithm won 98 games, with two losses. Against a depth level of seven, the agent beat both the random agent and the greedy heuristic 100% of the time. We (human players) were unable to beat our algorithm within five trials using a depth of eight, although the algorithm begins to take extend periods of time to evaluate ( 30 seconds per move).

12 CHAPTER 4 DISCUSSION

The results indicate that this heuristic for evaluating the state of a board is promising, and minimax is a good approach for creating an intelligent agent for Ultimate Tic-Tac- Toe. The superb performance against random moves and the greedy heuristic indicates that it is capable of playing the game at a high level. The 98% winrate against the greedy heuristic even at a low depth shows that this evaluation function prioritizes moves beyond a single board’s strongest move. If the agent wasn’t capable of seeing the big picture of the game, then we would expect it to tie with the greedy heuristic, and that is evidently not the case. While the algorithm performs well, there are still areas where it can be improved.

One potential improvement to the efficiency of the algorithm pertains to the usage of wildcards. It is not uncommon for two players to each be one wildcard away from winning. However, if the opponent is one move away from winning, any wildcard is an instant loss. We could further improve pruning by noting a point in a game where wildcards translate to a loss, and could further prune any decision that results in a wild- card. Because wildcards have such a large branching factor, it often takes time before the winning move is encountered. Currently, the algorithm will prune the rest of the tree once the winning (losing) move is discovered, but this can still result in evaluating thousands of boards before the move is discovered. This is inefficient. Increased speed would allow for greater depth levels and stronger moves overall.

13 CHAPTER 5 CONCLUSION

We have proposed a new heuristic for evaluating board states for Ultimate Tic-Tac-Toe. We have used this heuristic to create an intelligent agent that plays the game competi- tively at a high level. The evaluation function we use incorporates not only the moves that bring it closer to winning a board, but also compares the utility of the given boards and prioritizes moves that result in winning more useful boards. At a depth of seven, this agent is capable of making a decision for the game in under ten seconds, meaning live players do not have to wait long periods of time to complete a game against a strong opponent. Additionally, lowering the depth limit allows for tweaking the difficulty of the agent, with lower depth limits resulting in an easier computer opponent.

14 BIBLIOGRAPHY

[1] Eran Amar and Adi Ben Binyamin. Ai agent for ultimate tic tac toe game. URL: https://www.cs.huji.ac.il/˜ai/projects/2013/ U2T3P/files/AI_Report.pdf.

[2] Guillaume Bertholon, Remi´ Geraud-Stewart,´ Axel Kugelmann, Theo´ Lenoir, and David Naccache. At most 43 moves, at least 29: Optimal strategies and bounds for ultimate tic-tac-toe. CoRR, abs/2006.02353, 2020. URL: https://arxiv. org/abs/2006.02353, arXiv:2006.02353.

[3] Eytan Lifshitz and David Tsurel. Ai approaches to ultimate tic-tac- toe. URL: https://www.cs.huji.ac.il/˜ai/projects/2013/ UlitmateTic-Tac-Toe/files/report.pdf.

[4] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018. URL: https://science.sciencemag.org/ content/362/6419/1140, arXiv:https://science.sciencemag. org/content/362/6419/1140.full.pdf, doi:10.1126/science. aar6404.

15