Intelligent Bomberman with Reinforcement Learning Ngo Hung Minh Triet University of Twente P.O

Intelligent Bomberman with Reinforcement Learning Ngo Hung Minh Triet University of Twente P.O. Box 217, 7500AE Enschede The Netherlands [email protected] ABSTRACT corresponding games. Bomberman is a strategical, maze-based game where the In this study, we show the correlation between reinforce- players defeat their enemies by placing a multi-direction ment learning algorithms with different state representa- count-down bomb that would explode and destroy obsta- tions and the ability to learn and play the game Bomber- cles and other players. In this paper, a simplified version man. We used five reinforcement learning algorithms: Q- of Bomberman is implemented in Java, where different Learning, Sarsa, Double Q-Learning, and Deep Q Neural controlled agents are placed. Each agent represents one of Network and two state representations: 5-tiles information five reinforcement learning methods: Q-Learning, Sarsa, and complete information. Each configuration is tested Double Q-Learning, and Deep Q Neural Network with with three tests: the learnability test, the advance test, two state representations: 5-tiles information and com- and the comparative test. In each of the tests, the agents plete information. Then, we investigate whether a specific are evaluated per generation. Each generation consists reinforcement learning method can successfully learn to of 10000 training episodes and 100 evaluation episodes. play Bomberman efficiently by evaluating them with ad- Then, win rates and average rewards are extracted to make hoc agents and finally against each other. The configu- comparisons. The results show that although simple Q- ration of 5-tiles information with Sarsa archives the best Learning works, but performance deteriorates in scenar- overall quantitative results. ios with larger state space. While a preliminary version of the Deep Q network shows superior results and looks Keywords promising compare to Q-Learning with the use of the neural network to tackle the ample state space. In the end, Reinforcement Learning, Bomberman, Computer Game, 5-tiles information with Sarsa configuration archives the Q-Learning, Neural network, Deep Reinforcement Learn- best overall quantitative results. ing In Section 2, we introduce the game and describe the im- plementation of simple Bomberman in Java with differ- 1. INTRODUCTION ent state representations. Then, in section 3, we give an In the past decades, Reinforcement Learning is gaining overview of five reinforcement algorithms we used for the more attention. Being inspired by animal learning the- agents. After that, we present the experimental setups and ories, reinforcement learning (RL) is developed with the results in section 4. In section 5, we discuss the research idea that an agent can deduce from its experience to de- process and draw reflections from thorough our research. cide which actions to take to maximize predefined values. Finally, we end the paper by summarising our paper and This agent learns by interacting with the environment and discuss some future improvements. receiving feedback indicating the values or rewards of the action. The objective is that after the learning process 2. BOMBERMAN is that the agent will perceive a specific state of an envi- In 1983, Hudson Soft published the original variant of ronment and reason the actions that would give the most Bomberman, and new generations of Bomberman are still rewards. [11, 16]. being published to this day. Nowadays, there are almost Nowadays, there are many studies on creating reinforce- 100 different variants of Bomberman games released on ment learning agents for various genre of games rang- various platforms [7]. In the original variant of Bomber- ing from grid games including traditional games such as man, which is named Bankudan Otoko in Japan but is shougi, Go, and chess [13, 15, 9, 12, 14] to modern com- called Bomber Man in most releases, the player is trapped puter games such as Mario, Atari, Bomberman, Ms. Pack- in a maze with monsters. In each stage, the objective is to man [8, 4, 2] in order to improve the efficiency and win- find the exit and eliminate all monsters by moving around rate of the reinforcement learning agents against other re- the maze, placing bombs, obtains power-ups that empower inforcement learning agents or human champions of the player's bombs. When the goal is met, the player can proceed to the next stage. While single-player campaigns Permission to make digital or hard copies of all or part of this work for mode occupied a large portion of Bomberman, we focus personal or classroom use is granted without fee provided that copies on the multiplayer variant in this research, where players are not made or distributed for profit or commercial advantage and that are trapped in the same maze, and their only objective copies bear this notice and the full citation on the first page. To copy oth- is to eliminate each other by placing bombs strategically erwise, or republish, to post on servers or to redistribute to lists, requires until the very last man is the winner. Furthermore, since prior specific permission and/or a fee. the game can prolong forever, over the iterations, there 35th Twente Student Conference on IT July. 2nd, 2021, Enschede, The Netherlands. will be more challenges that help end the game, such as Copyright 2021, University of Twente, Faculty of Electrical Engineer- more power-ups or bombs appear randomly to speed up ing, Mathematics and Computer Science. the termination of the game. 1 2.1 Environment for all the agents in the game. This probability will in- We implemented an environment that represents a simpli- crease linearly with turn count to 1.0, which prevents the fied version of Bomberman. The core of the environment is game from running forever. Therefore, to win the game, Java and JavaFX as the framework that helps visualize the the agents must learn how to dodge bombs and finish the game. The grid size can be customized, but in our experi- opponent as soon as possible. ments, a board with the size of 7x7 (including the unbreakable border)is used, with its environment configured as in 2.2 State representation Figure 1. The players in each test type are different but According to Figure 1, as there are 36 changeable tiles ranging from two to four. There are two types of agents: (free/breakable/unbreakable), 36 tiles to place a bomb, 36 ad-hoc agents and learning agents. Ad-hoc agents con- tiles for the agents (ranging from two to four) to move into, the number of state can be approximated as 336 · sist of stationary agents and random agents. This type of 36 36 49 agent is used to assess the ability to learn and compatibil- 2 · 4 ≈ 4:9 · 10 different game states in total. In ity (the quality of the agents' policies) of a learning agent this part, we introduce two ways to represent the game. with Bomberman. Each learning agent is implemented One of them will show the complete information of the with a combination of a reinforcement learning algorithm game, which results in the enormous state space mentioned and a state representation. Each turn, the agents receive above. The other is an attempt to minimize the state the state representation from the environment (the game) spaces by limiting the agent knowledge (how much the that the agents use to determine their next move. After agent can perceive from the game/environment) and only receiving all the moves from the active agents, the environ- showing the agent's surroundings to them. ment executes the moves and sends back the results and 2.2.1 Complete information rewards of the agents to them, and then the cycle repeats In this configuration, the game is converted to a flatten- until there is less than one agent on the field. able array that contains all the information of the current game state. The array is in 2D and has the same size as the board. Each of the cells in the array contains the information of the respective tile in the game. That information is computed by the following values: • Tile type: Free, Breakable wall, Obstacles (Unbreak- able wall or Bomb). • Whether the tile contains the agent. • Whether the tile contains any of the other agents. • The foreseeable explosion countdown of the tile. In other words, the agent knows whether there will be an explosion in this tile with the precise countdown. As mentioned in the game rules section, the bombs in this game can be chained to create a bigger explosion. There- fore, The environment uses BFS (Breadth-First Search) to calculate the foreseeable explosion countdown in order to group explosion tiles together, and the group countdown value is equal to the lowest countdown bomb in the Figure 1. The game configuration uses for all tests in this group. This way, the agent knows whether the tile is safe study. Blue tiles are breakables walls, deep brown tiles are according to the nearest bomb countdown but the precise unbreakable walls, and learning agents (in this case, 4) are countdown of the exploding bomb even it is chained by represent by human-like figures and are distinct by colors. another bomb. In this configuration, the number of states This configuration separates the agents with walls, and each is equal to the calculated game states, which is ≈ 4:9·1049. agent is contained in a smaller grid which turns out to be the balance between complexity and fairness. To win the game, 2.2.2 5-tiles information the agents need to learn how to break walls and approach Since the main goal of this study is to evaluate how well then eliminate the opponents. the simple reinforcement learning algorithm performs in Bomberman, we think that it is not feasible to store or The rules of the game are as follows, each turn, the agent explore the state space with the size of a 50 digits number.

Load more