Using Computer Vision Techniques to Play an Existing Video Game

Using Computer Vision Techniques to Play an
Existing Video Game

Presented to the faculty of the College of Science and Mathematics at California State University, San Marcos

Submitted in partial fulfillment of the requirements for the degree of Masters of Science

Christopher Erdelyi [email protected]
March 2019

Abstract

Game playing algorithms are commonly implemented in video games to control non-player characters (hereafter, “NPC’s,”) in order to provide a richer or more competitive game environment. However, directly programming opponent algorithms into the game can cause the game-controlled NPC’s to become predictable to human players over time. This can negatively impact player enjoyment and interest in the game, especially when the algorithm is supposed to compete against human opponents. To extend the revenue-generating lifespan of a game, the developers may wish to continually refine the algorithms – but these updates would need to be downloaded to every players’ installed copy of the game. Thus, it would be beneficial for the game’s algorithm to run independently from the game itself, located on a server which can be easily accessed and updated by the game developers. Furthermore, the same basic algorithm setup could be used across many games that the developer creates, by using computer vision to determine game states, rather than title-specific Application Program Interfaces (hereafter, “API’s.”)

In this paper, we propose a method for playing a racing game using computer vision, and controlling the game only through the inputs provided to human players. Using the Open Source Computer Vision Library (hereafter known by its common name, “OpenCV”) to take screenshots of the game and apply various image processing techniques, we can represent the game world in a manner suitable for external driving algorithm to process. The driving algorithm then makes decisions based on the state of the processed image, and sends inputs back to the game via keyboard emulation.

The driving algorithm created for this project was tuned using more than 50 separate adjustments, and run multiple times on each adjustment to measure how far the player’s vehicle could travel before crashing or stalling. These results were then compared to a set of baseline tests, where random input was used to steer the vehicle. The results show that our computer vision-based approach does indeed show promise, and could be used to successfully compete against human players if enhanced.

2

Acknowledgements

Thank you Dr. Ye, for your suggestions on computer vision and driving algorithm design, and for guiding me throughout the research project. I also thank my family, friends, and coworkers for their patience and support while I completed the Master’s program.

3

Table of Contents

List of Abbreviations and Definitions...................................................6 1. Introduction and Background......................................................7 2. Related Work.................................................................................9

2.1 2.2
DeepMind: Capture the Flag................................................................................................ 9 OpenCV: Grand Theft Auto V ............................................................................................. 10

3. Program Flow Explanation and Diagrams ...............................13

3.1. Overall Process Flow for Experiment ....................................................................................... 13 3.2. OpenCV Image Manipulation: Functions and Order of Operations........................................... 14 3.3 Visual Analysis Processing Steps............................................................................................... 17

4. Hardware and Software Utilized ...............................................18 5. Approach and Implementation..................................................19

5.1 5.2 5.3 5.4 5.5 5.6
Approach........................................................................................................................... 19 Capture Emulator Display .................................................................................................. 22 Overlay Mask on Screen Capture ....................................................................................... 22 Examine Processed Image................................................................................................. 28 Driving Algorithm Chooses Next Input Action .................................................................... 30 Emulate System Keypresses............................................................................................... 31

6. Experimental Results..................................................................34

6.1 6.2 6.3 6.4
Setup and Baseline Results ................................................................................................ 34 Driving Algorithm Tuning: Iterative Results........................................................................ 35 Experiment Results on Static Drive Algorithm Configuration.............................................. 37 Driving Behavior................................................................................................................ 38

7. Conclusion and Future Work.....................................................40 References.............................................................................................43 External Figure References.................................................................49

4

Table of Figures

Figure 1. Screengrab of a video demo for DeepMind playing Capture the Flag. [External Figure 1]................................................10 Figure 2. Canny Edge Detection on GTA V. [External Figure 2]....11 Figure 3. Hough Lines on GTA V image. [External Figure 3].........12 Figure 4. Lane marker overlay in GTA V. [External Figure 4]......12 Figure 5. Closed-loop process cycle. ...................................................14 Figure 6. OpenCV processing steps for emulator screenshots.........16 Figure 7. Visual analysis steps.............................................................18 Figure 8. Grayscale image conversion................................................23 Figure 9. Threshold function generated black and white image......24 Figure 10. Processed game image after Canny edge detection is applied...................................................................................................25 Figure 11. Processed game image after Gaussian blurring has been applied to Canny edges. .......................................................................26 Figure 12. Processed game image with Hough lines..........................27 Figure 13. Processed game image after second application of Hough lines........................................................................................................28 Figure 14. Turns navigated vs algorithm tuning iteration. ..............37 Figure 15. Trial results over 30 attempts. ..........................................38

5

List of Abbreviations and Definitions

API: Application Program Interface. In this paper, we are referring to communications definitions or tools that allow for one program to interact with another directly.

CPU: Central Processing Unit. The general purpose computing cores used in personal computers.

GPU: Graphics Processing Unit. The computing cores which are architected to specialize in computer graphics generation.

NPC: Non-Player Character. Refers to an in-game avatar which may act and look similar to a human player’s avatar, but is controlled by the game itself.

OpenCV: Open Computer Vision. An open-source library of functions that allow for real-time computer vision.

OS: Operating System. Software which manages computer hardware, software, and services.

PAL: Phase Alternating Line. An analogue television encoding standard with a resolution of 576 interlaced lines.

RAM: Random Access Memory. Refers to computer memory for temporary program storage.

ROM: Read Only Memory. In this paper, it refers to the test game’s program file. The name originated from the fact that cartridge-based video games were stored on solid state memory chips, and did not have the ability to be written to.

6

1. Introduction and Background

In today’s video games, one common requirement of the main game program is to control a wide variety of non-player characters, which interact with the human player. These non-player characters, or “NPC’s,” can be cooperative characters, enemies, or environmental figures that add decoration and flair to the game’s world. Traditionally, computer-controlled enemy players, or “bots,” are controlled by a hard-coded logic within a game [25]. Games have traditionally implemented various forms of pathfinding algorithms to control their NPC’s. These pathfinding methods require a full understanding of a map’s topology, along with decision-making functions, such as the A* algorithm [2]. When used for pathfinding, the A* algorithm is a search algorithm which “repeatedly examines the most promising unexplored location it has seen. When a location is explored, the algorithm is finished if that location is the goal; otherwise, it makes note of all that location’s neighbors for further exploration” [4]. Modified A* algorithms for avoiding randomized obstacles were also proposed, and found to be successful 28]. However, even these modified implementations require the environment to be completely known to the game algorithms, and therefore, the algorithm must be integrated with the game itself.
Externally run algorithms do exist for some games. Two companies creating external algorithm programs, which can be used for gaming, are OpenAI [3] and Alphabet’s Deepmind [5]. For example, DeepMind has partnered with Blizzard Entertainment to create a machine-learning capable version of their game, Starcraft II. However, for this experimental version of the game, Blizzard created API’s to allow DeepMind’s software to interact with it [27]. This approach would require an active developer to continue providing support their published game; any game which has had its support discontinued would likely never have similar integration with external software.
A majority of games on the market use integrated algorithms. Their “bots” maintain a limited amount of behaviors throughout the game’s existence. Preprogrammed algorithms can be quickly surpassed by human players in any genre of video game. Once strategies are discovered to exploit gaps in the algorithm’s ingame abilities, the bots no longer present a challenge to the player. Alternatively, bots that play poorly as a part of a human player’s team can be distracting and frustrating. Some players may voice their frustration on Reddit subforums for games, such as Blizzard’s Heroes of the Storm, in which algorithm-controlled teammates play with confusing and sub-optimal strategies [13]. To many, the pre-

7

programmed game logic feels stale, unintelligent, and unentertaining, and players can lose interest in the game if demands for algorithm updates are not met.
Additionally, players often find tricks, glitches, or other unintended (or unforeseen) routes or techniques to get around a racetrack with the shortest recorded time. For example, players of Mario Kart DS (released for the portable Nintendo DS game system) discovered a technique dubbed “snaking,” which involves using a speed boost originally intended for taking corners, and applying it to straight sections of the track [1]. Such discoveries and exploits make the algorithm uncompetitive.
To alleviate this problem, developers sometimes invest significant resources to upgrade the game logic, and provide it as a downloadable update. Developers will sometimes also share insight into the bots’ algorithm changes via a blog post [24]. However, with increasing broadband data speed [16], and with computer vision software becoming freely available [17], an alternative to such a downloaded update is possible. We propose that an algorithm, combined with computer vision, can be used to teach a program how to play a game as it exists in the present, using only the button inputs available to any human player. Such programs could be run on remote servers, and interact with human players via any multiplayer interfaces already working with a game. In order to give a community of gamers an experience that can evolve with them, our proposed game-playing algorithm would allow a game’s developer to make changes to the behavior of their bots, without rewriting any portion of the game’s source code.
This paper is divided into seven main sections. In this section, we describe some background on the problem area we focused on for this research project. In section two, we discuss related work in the field of computer vision for gaming. In section three, we describe the overall program structure, and the process steps performed for image manipulation and measurement. In section four, we have recorded the hardware and software used, for reference. In section five, we discuss the approach and implementation details undertaken to perform the research. In section six, we discuss the experimental results of the research. In section seven, we offer our conclusion on the research performed, and suggest future work to build and improve upon the work we have done.

8

2. Related Work

There have been many similar, documented attempts to play a moderately complex game using external algorithms. Two such experiments are presented here. Both experiments utilize a game’s raw pixel output, combined with humangameplay controller inputs, to perform decision-making and subsequent manipulation of the game environment. In both experiments, the program does not have access to a map, or any other external representation of the game environment. The movement and actions are informed purely by the visuals generated by the game.

2.1 DeepMind: Capture the Flag

DeepMind conducted such an experiment to play a first-person shooter game, which was built upon a visually-modified version of Quake 3 [10]. This modified game was created for in-game agents to play Capture the Flag, leveraging machine learning to increase the capabilities of the game-playing logic. In the game, the avatars on each team were tasked with “tagging” enemy combatants, locating the enemy flag, and returning it to their own base. In addition, the environment layouts are procedurally generated, ensuring that the agents cannot simply memorize the layout between training runs. Figure 1, below, represents both the raw pixels seen and evaluated by the DeepMind-based software (left half of the image) and a representation of one of the procedurally-generated maps (right half of the image.)

9

Figure 1. Screengrab of a video demo for DeepMind playing Capture the Flag. [External Figure
1].

In this Capture the Flag experiment, the DeepMind machine-learning algorithm was able to learn and play the game successfully. The agents were trained with 450,000 matches’ worth of data, and the learned skillset enabled it to beat human players 75% of the time. Watching video of the algorithm in action clearly shows the advantages of DeepMind’s reaction time versus a human player, while demonstrating sufficient tactical skills for defense, teaming up with another player, and capturing the flag to score points [9].

2.2 OpenCV: Grand Theft Auto V

A second experiment was run by Harrison Kinsley of
PythonProgramming.net, on Rockstar Publishing’s Grand Theft Auto V, hereafter referred to as “GTA V” [11]. This game contains a relatively photorealistic representation of city and highway roads, complete with painted lane markers. In this experiment, Mr. Kinsley used OpenCV to perform edge detection of lanes on the in-game roads, in order to create a self-driving car program. Two important functions were used to perform the image analysis: Canny edge detection, and Hough line generators. Canny edge detection works by examining individual image pixels, and compares each pixel to its neighbors to look for sharp intensity

10

changes [6]. Hough lines take an image (such as a Canny-processed image) containing individual pixels, and through the properties of the transform, allow each pixel to “vote” for the line they belong to [14]. The Hough line with the highest “vote” count becomes the drawn line. Here, OpenCV’s integrated Canny edge detection and Hough lines functions were utilized to locate the lane markers, and create two lane guides for the program to follow.
In figure 2 below, the original game display, left, is shown alongside the
Canny edge-detected output, right.

Figure 2. Canny Edge Detection on GTA V. [External Figure 2].

In figure 3 below, Hough lines have been drawn on top of the the Canny edge-detected output from figure 2. Minimum length thresholds for line generation were set, to exclude small features like landscaping, and the vehicle’s mirror. The resulting lines roughly approximate the lane markings on the game’s road.

11

Figure 3. Hough Lines on GTA V image. [External Figure 3].

In figure 4 below, lane markers are defined by choosing the longest two
Hough lines generated. The lane markers are superimposed in green over the original game screenshot, to represent the boundaries that the program is measuring for steering input.

Figure 4. Lane marker overlay in GTA V. [External Figure 4].

The PythonProgramming.net approach allowed the vehicle to successfully navigate the road, by measuring the slope of the two lane lines [12].

12

3. Program Flow Explanation and Diagrams

In this section, we will explain broadly the control loop of our program’s steps, and the methods and order used to perform computer vision processing. Figure 5 contains the repeating control loop of the program. Figure 6 contains the order of operations of the computer vision processing steps. Figure 7 explains the process for analyzing the image and choosing a driving direction.

3.1. Overall Process Flow for Experiment

The program’s loop begins with displaying the emulated game image in windowed mode on the Linux desktop. Screenshots are taken of the game window, by setting the coordinates of the screenshot function to span only the displayed game within the emulator window. These screenshots are then passed through several OpenCV functions, which renders a representation of the game state useful to the driving algorithm. The manipulated image is then passed to the driving algorithm function, where it is examined to determine the next controller output state. These output commands are passed to a Python keypress emulator function, which sends the desired keypresses to the Linux OS. With the game emulator selected and operating as the targeted window, Linux directs the keypress commands to the emulator. The emulator accepts these keypress commands, thereby controlling the game, and altering the next output state. This altered output state becomes the input for the next cycle of the loop.
These steps are shown in Figure 5, below.

13

Figure 5. Closed-loop process cycle.

3.2. OpenCV Image Manipulation: Functions and Order of Operations

The Python OpenCV functions are executed in a specific order to achieve the image processing output desired for our driving algorithm. First, a full-color screenshot is taken of the emulator window’s game state, and the pixel values are saved into a NumPy array. NumPy arrays are N-dimensional array objects, which

14

allow high-performance operation on their elements [15]. This screenshot is converted into grayscale, and then passed through a threshold function, which converts the image to strictly black and white. These steps reduce the amount of computation that the Canny function is required to perform.
The Canny edge detection then creates an image in black and white, where all edges are marked out as thin white lines. A Gaussian blur filter is applied to the image to account for small gaps in detected edges. Gaussian blur filters are lowpass noise filters, which eliminate Gaussian-style noise [26]. The result is that small errors in edge detection are smoothed out, and lines become more continuous. This blur-filtered image is then passed to the Hough lines function, which draws thick white lines across all edges passing a preset length threshold value. This process leaves us with an incomplete masking of the boundaries of the track, so the Gaussian blur is applied a second time, which further fills in the gaps between original Canny-produced edge lines, and the Hough lines. Finally, the Hough lines function is applied a second time, which masks the track boundaries in the image sufficiently well for our driving algorithm to function.
These steps are shown in Figure 6, below. The functions performed are in the left column. The image output from that function is in the right column, connected by a dotted line for clarity.

15

Screenshot Captured and stored in Numpy Array
………………….. ………………….. …………………..
Original Color Image

Black and White Image

OpenCV: Convert Array

Image to Black and White