Playing Frogger Using Multi-Step Q-Learning
Total Page:16
File Type:pdf, Size:1020Kb
Playing Frogger using Multi-step Q-learning Bachelor's Project Thesis Jip Maijers, s2605376, [email protected], Supervisor: Dr M.A. Wiering, [email protected] Abstract: This thesis is an extension on the Bachelor's thesis by Tomas van der Velde about training an agent to play the game Frogger using Q-learning. In this thesis 2-step Q-learning and 4-step Q-learning with 2 different reward functions (an action-based reward function and a distance-based reward function) are used to see if it is possible for agents trained with these algorithms to outperform agents trained with regular Q-learning. The game Frogger can be described as a two-part game, where the aim of the first part is crossing a road and where the goal of the second part is crossing a river. In previous research it was found that crossing the river was the biggest obstacle for agents, which is why two separate neural networks were again used. The results obtained show that the action-based reward function outperforms a distance- based reward function and that 4-step Q-learning and 2-step Q-learning improved upon the performance of regular Q-learning, with 4-step Q-learning performing the best of the three in terms of road completion, and 2-step Q-learning in terms of points gained and overall win rate. Keywords: Machine learning, Reinforcement learning, Q-learning, Multi-step Q-learning, neural networks, video games, vision grids 1 Introduction Q-learning (Watkins and Dayan, 1992) is one of the most common reinforcement learning algorithms and has been used in a variety of environments; Reinforcement Learning (RL) (Sutton and Barto, from traffic control (Abdoos, Mozayani, and Baz- 1998) is one of the three machine learning zan, 2011) to stock trading (Lee, Park, O, Lee, paradigms and is usually applied in systems with and Hong, 2007). The application of Q-learning large state spaces or dynamic environments, as as training method in neural networks has also it does not depend on searching through large been studied with interesting results, for example sets of future states when determining what ac- in the arcade game Ms. Pac-man (Bom, Henken, tion to take, but rather acts on an exploration and Wiering, 2013). In all of these applications vs exploitation principle. Perhaps the best known Q-learning succeeds in creating well-performing example of reinforcement learning being used is agents, often even outperforming humans (Mnih by the team of Google Deepmind, for example et al., 2015). The method of combining neural net- in projects like AlphaGo (Silver, Huang, Maddi- works with Q-learning in games performs well since son, Guez, Sifre, van den Driessche, Schrittwieser, a modeller is free to build a state representation to Antonoglou, Panneershelvam, Lanctot, Dieleman, fit the game of choice as Q-learning is a model-free Grewe, Nham, Kalchbrenner, Sutskever, Lillicrap, algorithm. Leach, Kavukcuoglu, Graepel, and Hassabis, 2016), Multi-step Q-learning (Peng and Williams, 1996) but also in games by their project \DQN", whose is a variant of Q-learning in which the algorithm team also wrote an article (Mnih, Kavukcuoglu, Sil- looks a predetermined amount of steps further to ver, Rusu, Veness, Bellemare, Graves, Riedmiller, update its Q-values, which causes it to perform well Fidjeland, Ostrovski, Petersen, Beattie, Sadik, in more fuzzy environments. In some environments Antonoglou, King, Kumaran, Wierstra, Legg, and with coarser state spaces Multi-step Q-learning per- Hassabis, 2015) in which reinforcement learning forms better than normal Q-learning, which is why was used to train a computer to perform better it can be used to determine an optimal number of than humans in playing different Atari 2600 games. 1 future states to take in consideration when build- makes them kill the player if it jumps on them. In ing an agent to work in these environments. We addition to this, a frog dies when jumping on a goal will go more in detail about the workings of both it has already filled, meaning the player will have Q-learning and Multi-step Q-learning in section 3. to learn not to just fill one goal. A difficult part of the game is that the five goals are above the top 1.1 Frogger lane, which is filled with turtles. This means that a player has to plan their moves in such a way that Frogger is an arcade game that was released in 1981 the agents in that lane take them to the correct po- by Konami. In the game the player takes control of sition to fill the goal, which can be quite hard for a frog that has to move vertically to cross a road the outer goals. and a river. Along the way the frog has to make In the game of Frogger the player has three lives, sure to avoid obstacles and in the end fill five goals. and each life has a timer of 60 seconds to complete A visual representation of our version of the game the level. Players are awarded points based on how can be found in figure 2.1. The game runs at 60 well the player does; 10 points for a row the frog frames per second (fps) and a player has five dif- has not reached before, when a player reaches a goal ferent options at each frame of the game: move up, they are awarded 50 points plus 10 points for ev- move down, move left, move right or do nothing. ery half second of time they have left, 1000 points The places a frog can be in the game world can be are awarded for willing all five goals. In the origi- represented by a grid of 13 rows and 13 columns nal game, 200 bonus points are awarded for eating and each action the player takes moves the frog to flies that randomly spawn on the map or by jump- a corresponding space, meaning that from the start ing on a frog which spawned randomly on a log of the game a player can move six spaces to the left before jumping on a goal. If a player scored more or right and 13 spaces up. than 20,000 points they would earn an additional To win the game a player generally follows a fairly life. We chose to leave these ways of earning bonus straightforward set of actions. First, the player points out of our implementation of the game, as needs to cross a road. The road is made up of five this added an extra factor of randomness to our \lanes" on which different combinations of agents, state space which we decided would unnecessarily \cars", move horizontally. Each lane has a differ- complicate the learning process. ent type of car which moves at a speed corre- The process described above covers the first level sponding to that type of car. Finally, cars in odd- of the arcade game. In the original game there are numbered lanes (counted from the lane on which levels after this, swapping some logs for danger- the frog started) move to the left while those in ous crocodiles, adding some extra threats to regions even-numbered lanes move to the right. The player which were safe before and adding power-ups. For needs to avoid getting hit by these cars, this means this project we will focus exclusively on the first that they have to perform a combination of the fol- level of the game. lowing strategies: wait for a car to pass, move up or move away, for example to the left, to avoid being 1.2 Reasons for study hit by a car. Being hit by a car causes the frog to die, which resets the frog to its starting space and This thesis will serve as an extension to the Bache- empties all of the filled goals, meaning that to win a lor's thesis by Tomas van der Velde (Van der Velde, player need to fill all five goals without dying once. 2018). That thesis made an attempt to solve Frog- After the player has successfully traversed the road ger using Q-learning and a multi-layer perceptron they will have to cross a river. The river is again (MLP) (Rumelhart, Hinton, and Williams, 1986), made up of five lanes, but in these lanes the frog but unfortunately only succeeded in solving the can only cross by jumping on the agents in the lane, road part of the game. The agents proved incapable as touching the \water" kills the frog. The agents in to successfully pass the road. In this thesis we will these lanes are made up of logs and groups of tur- tweak the input-vector of that thesis in the hope tles of varying lengths, and each lane has a different that this will guide the agents to the goals better, speed and direction in which the agents move. Fi- and we will employ Multi-step Q-learning to see if nally, the turtles \dive" at a certain interval, which more planning ahead can solve the game in its en- 2 tirety. The research question we will answer in this thesis will be as follows: \Does an agent trained wth Multi-step Q-learning perform better at play- ing the game Frogger than an agent trained with regular Q-learning?". 1.3 Outline of thesis In this thesis we will describe the research we did to try to solve the game Frogger. In section 2 we will go into the implementation of the game, the in- put vector and the neural networks used.