Sprite Tracking in Two-Dimensional Video Games by Elizabeth M. Shen Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2017 c Massachusetts Institute of Technology 2017. All rights reserved.

Author...... Department of Electrical Engineering and Computer Science May 23, 2017

Certified by...... Philip Tan Research Scientist Thesis Supervisor May 23, 2017

Accepted by ...... Christopher J. Terman Chairman, Masters of Engineering Thesis Committee May 23, 2017 Sprite Tracking in Two-Dimensional Video Games by Elizabeth M. Shen

Submitted to the Department of Electrical Engineering and Computer Science on May 23, 2017, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract I explore various computer vision techniques and their application towards processing and extracting information from two-dimensional video games. The bulk of existing research is designed to work on real-world images, and thus makes assumptions about the world that do not translate to synthetic, stylized environments. Processing footage has promising applications in competitive gaming, such as analyzing strategy in multiplayer online games, or optimizing routes in speedrunning. I present the exploratory results, details of a successful algorithm, and some sample applica- tions.

Thesis Supervisor: Philip Tan Title: Research Scientist

2 Acknowledgments

I would like to thank my advisor Philip Tan for his guidance of my research. I thought it unlikely to get support for my niche interest, but I was happy to be wrong. I am also tremendously grateful towards the Institute; the past five years have been challenging but equally rewarding, and I wouldn’t have it any other way. Finally, I would like to thank my parents. I would not have reached this point without their support of my education.

3 Contents

1 Introduction 8 1.1 Motivation ...... 9 1.2 Existing Work ...... 10

2 Comparing 2D and 3D Environments 12 2.1 Camera Position and Movement ...... 12 2.2 Object borders ...... 13 2.3 Object similarity ...... 14

3 Image Stitching 15 3.1 SIFT ...... 15 3.2 Phase correlation ...... 16

4 Motion tracking 20 4.1 Optical Flow ...... 20 4.1.1 Results ...... 21 4.2 Tracking with phase correlation ...... 21

5 Algorithm 22 5.1 Introduction ...... 22 5.2 Implementation ...... 24 5.3 Evaluation ...... 24 5.3.1 Performance ...... 24 5.3.2 Correctness ...... 24

4 6 Results and Applications 26 6.1 PAC-MAN ...... 26 6.2 Defender ...... 27 6.3 Dota 2 ...... 27

7 Conclusion 30 7.1 Future work ...... 30 7.1.1 Log-polar cross-correlation ...... 30 7.1.2 Handling reflection ...... 30 7.1.3 Handling discontinuity ...... 31

5 List of Figures

1-1 Example of information in StarCraft provided to the AI versus in-game image...... 10

2-1 In Stardew Valley [1], the camera is centered on the player unless they are on the boundary of the screen...... 13 2-2 The player sprite is not distinct enough from the background. The regions marked with colored dots were detected as points of interest for having sharp contrast and corners...... 13

3-1 Four frames from Crypt of the Necrodancer (Brace Yourself Games [6]). Left to right represents movement over time...... 16 3-2 Left: Panorama generated from a SIFT implementation. Right: Panorama generated from manual stitching...... 16 3-3 Calculating the offset between image subsections...... 18 3-4 Three sampled frames from Crypt of the Necrodancer. The boxed regions were used to register their relative position...... 18 3-5 The panorama resulting from joining the frames in Figure 3-4. . . . . 19

5-1 Due to the difference in orientation, the algorithm incorrectly places the sprite location at the orange ghost...... 23 5-2 Plot of the scoring function. Highlighted points are the true location of the PAC-MAN (Namco [7]) sprite and the mistaken identification. 23

6-1 Automated tracking...... 26 6-2 Manual tracking...... 26

6 6-3 Consecutively fused frames from Defender. The old components are marked in purple and the new components are green for contrast. . . 28 6-4 Our phase correlation method fails to pinpoint the sprite accurately- it should mark the top left...... 29 6-5 Several detected keypoints and their paths over time...... 29

7 Chapter 1

Introduction

The goal of this work was to find an automated method to convert live footage or recordings of third-person perspective, stylized video games into a single panorama representing the game world and path taken by the player. To do this, we needed to solve two problems: tracking the player location, and stitching video frames together. First, we provide necessary background information on important characteristics of video games, competitive gaming and the surrounding industries. Next, we explain the differences observed in the environments generated by two- dimensional video games which make existing computer vision algorithms unreliable. Chapter three describes several approaches tested to track object movement, and chapter four describes methods used for image stitching. Chapter five describes our phase-correlation algorithm, which was found to be the most successful. It also provides implementation details and analyzes the results. Chapter six provides results applied to several video games, and describes addi- tional applications. Finally, chapter seven summarizes the work and suggests direction for future work.

8 1.1 Motivation

In many video games, the player can only see a small portion of the world, allowing for a sense of discovery and preventing an overflow of information. However, for aesthetic or competitive reasons, it may be beneficial to the player to see a view of the whole world. For example, team-based multiplayer online battle arena games (MOBAs) such as League of Legends and Dota 2 have several billion hours played every month, and support tournaments and professional players with millions of dollars in prize money. Like professional physical sports, these leagues involve coaches, playbooks, and live commentators. However, statistics can be difficult to extract, as important metrics are difficult to measure unless explicitly provided by the game. Another niche of competitive gaming is speedrunning, where players attempt to complete a game as quickly as possible. For many genres of games, the player controls a sprite and must move to a target destination while avoiding enemies or environ- mental hazards. To analyze and optimize the fastest paths, referred to as routes, the current approach involves memory scanning tools to find and read data from the game executable, which must be done on a case-by-case basis. Being able to condense a video into an image has several desirable properties. First, it summarizes information concisely. In the context of MOBAs, this allows for easy discussion of strategy or pointing out specific events. For speedrunning, it is difficult to evaluate runs without watching hours of gameplay footage. A static image effectively communicates a route that would otherwise be difficult to explain. In addition, suppose a runner (speedrunning player) wishes to find which of two paths is faster. Overlaying and comparing two panoramic images is easier than attempting to time the splits. Secondly, there may be no detailed world map provided to the player. Most games only provide a mini-map or an outline. In procedurally generated games, where the world is different each time, it would be otherwise impossible to get a complete view. Finally, there are less competitive applications. In games with world-building

9 mechanics, the player may wish to generate a high-resolution image of their work. This is often not achievable without custom modifications or manually joined images.

1.2 Existing Work

There has been significant work in developing artificial intelligence that can play games. Some classic examples are the Super games by Nintendo. Older speedrun- ning tools implemented shortest path algorithms to find the best route per . However, this only finds one solution per level. More modern approaches use evolu- tionary algorithms to train neural networks that can learn to play any Mario level [2]. StarCraft is a 1v1 real-time strategy series by Blizzard Entertainment. It has be- come a benchmark game for testing artificial intelligence due to its complex resource management, limited map visibility, and simultaneous decision making. With aca- demic research beginning in the early 2000s, it is now being tackled by DeepMind at Google [5]. However, algorithms are given special access to internal data represen- tations rather than having to extract the data from the game visually, thus differing from a human player’s interaction with the game.

Figure 1-1: Example of information in StarCraft provided to the AI versus in-game image.

10 Yahoo Research developed another application to automatically detect highlights or important moments in live matches [8]. Their approach uses convolution neural nets to detect strong visual effects, which are correlated with highlights. Note that all of the above methods use deep learning and neural networks, which require large amounts (on the order of hundreds or thousands of hours) of tagged video for training. This work is more focused on image processing and correlating a sprite with location over time, and could be incorporated into automatically tagging sprites in a deep learning approach.

11 Chapter 2

Comparing 2D and 3D Environments

Processing computer-synthesized media is a relatively niche concern compared to real- world images and video. Thus, it is unsurprising that most related research assumes and deals with a three-dimensional world. We will focus on games with a specific set of properties and explain the corre- sponding challenges.

2.1 Camera Position and Movement

Many video processing algorithms are based on stationary cameras. We assert, how- ever, that in many two-dimensional games the camera follows the player’s movement while attempting to portray a subset of the surroundings. This is frequently imple- mented as position-locking, or centering the camera at the player location along one or more axes. A variant on this implementation is adding a “window” around the player sprite, and moving the camera when the player reaches the edge of the window. We use the assumption that the object of interest to be tracked is roughly centered as a search heuristic, discussed further in Chapter 5.

12 Figure 2-1: In Stardew Valley [1], the camera is centered on the player unless they are on the boundary of the screen.

2.2 Object borders

In the real world, objects can be delineated by continuous line boundaries and color patches. This property fails to hold in heavily stylized and/or pixel art games. This suggests that statistical methods may be more successful than object based classifi- cation.

Figure 2-2: The player sprite is not distinct enough from the background. The regions marked with colored dots were detected as points of interest for having sharp contrast and corners.

13 2.3 Object similarity

Depending on lighting and perspective, the same object can look extremely different when captured in real . In two-dimensional games, the perspective is fixed and objects are generated from a template, and are therefore near-to-identical. However, this can cause its own problems when finding a match. For example, when matching a repetitive, tiled background, the algorithm may be unable to distinguish the correct alignment, as there are multiple exact local matches.

14 Chapter 3

Image Stitching

The state-of-the-art panorama generation technologies use feature detection to pick out “interesting” regions in images, such as edges (regions with a large gradient) or corners (regions with high curvature in the gradient function). Images are then matched based on corresponding features. This requires that features can be matched despite differences in perspective or lighting.

3.1 SIFT

Scale-invariant feature transform (SIFT) is impressive in that it is robust against uniform scaling, orientation, and brightness. In many games, the player sprite is shown in third-person and often centered on the screen. When doing panoramic stitching, this causes SIFT to overlay the images on the center and corresponding regions that should be shifted. Another reason for failure is the recurrence of identical sprites. Video game char- acters and textures are programmatically generated from template images. Under these conditions, SIFT performs poorly because the extracted features are identical. Examining the SIFT-generated panorama (Figure 3-2), we see that the text and center tiles are aligned correctly. However, the sides of the panorama are warped due to incorrect feature matching. In particular, the repetition of a wall tile sprite causes two different ones to be overlaid.

15 Figure 3-1: Four frames from Crypt of the Necrodancer (Brace Yourself Games [6]). Left to right represents movement over time.

Figure 3-2: Left: Panorama generated from a SIFT implementation. Right: Panorama generated from manual stitching.

Finally, the strengths of SIFT in compensating for scaling and orientation are not utilized. Since these games consistently maintain scale and perspective, we will take a different approach.

3.2 Phase correlation

Phase correlation is an approach that can be used to estimate the overlap between two images. It operates in the frequency domain, taking the cross-correlation of the two images’ signals. We essentially slide one image across another, multiplying the signals

16 of overlapping regions. Regions that are similarly positive or negative will result in a higher cross correlation. Thus, the location of the cross correlation’s maximum gives the relative shift. Suppose we have two signals x(n) and y(n) that are approximately shifted by k. Then, the cross-correlation of the signals in the frequency domain can be expressed as: X (x ? y)(k) = x∗[m]y[m + n] = x∗(n) ∗ y(n) (3.1) m where ? denotes cross-correlation and x∗ is the complex conjugate of x. This is also equivalent to x∗(−n) ∗ y(n), where ∗ is the convolution. We map this to the frequency domain, as the Fast Fourier Transform (FFT) allows for much faster computations. Letting F denote the Fourier transform, we have:

F(x∗(−n) ∗ y(n)) = (F(x(n))∗F(y(n)) (3.2)

To apply phase correlation towards image stitching, we compare small regions of consecutive video frames that stay relatively constant. We can subtract the relative coordinates of the overlap to calculate the absolute position. The images can then be overlapped and blended together.

Suppose the coordinates of the top-left corner of the first frame region are (a1, b1), and the coordinates of the second region are (a2, b2). If the phase correlation suggests that the relative offset is (x2 − x1, y2 − y1), then the total offset of the second frame from the first is (x2 −x1 +a1 −a2, y2 −y1 +b1 −b2). See Figure 3-3 for visual intuition.

17 Figure 3-3: Calculating the offset between image subsections.

Figure 3-4: Three sampled frames from Crypt of the Necrodancer. The boxed regions were used to register their relative position.

18 Figure 3-5: The panorama resulting from joining the frames in Figure 3-4.

19 Chapter 4

Motion tracking

Motion tracking aims to detect an object’s movement over time in a video. There are several occurrences that make this task difficult: the target may be obstructed by objects in the foreground. It is also possible that the target blends into a cluttered environment. Finally, if the object rotates and takes on a different appearance, it is non-trivial to recognize it between frames. In the video game domain we have described, we must primarily handle a cluttered environment. Generally, we observe that the player sprite is not obstructed, and maintains the same planar orientation with minimal differences.

4.1 Optical Flow

We can measure the movement of objects in a video with respect to distance in pixels, and over time. Assuming that a pixel has constant intensity, we can calculate the gradient to predict future movement. Suppose that a pixel at position (x, y) and time t has intensity I(x, y, t). Then, in the next frame, taken at t + dt, it moves to (x + dx, y + dy). We can approximate the intensity at the new position and time by the Taylor Series expansion:

∂I ∂I ∂I I(x + dx, y + dy, t + dt) = I(x, y, t) + x + y + + ··· ∂x ∂y ∂t

20 Since we assume that intensity is preserved, we have I(x+dx, y+dy, t+dt) = I(x, y, t), and thus: ∂I ∂I ∂I dx + dy + dt ≈ 0 ∂x ∂y ∂t Dividing by dt gives: ∂I dx ∂I dy ∂I · + · = − ∂x dt ∂y dt ∂t The partial derivatives of the intensity can be calculated from the image, but we can only estimate the change in position over time. There are many methods to do so, but we tested the Lucas-Kanade method as implemented in OpenCV.

4.1.1 Results

Testing across three games, Stardew Valley, Crypt of the Necrodancer, and Dota 2 (Valve Corporation [4]) produced mixed results. As illustrated in Figure 2-2, the player sprite is difficult to distinguish and therefore difficult to track against other background animations. Crypt of the Necrodancer has dynamic lighting, which makes this approach inapplicable. Finally, we observed moderate success when applied to tracking positions on the Dota 2 minimap (Figure 6-5 of Section 6.3).

4.2 Tracking with phase correlation

For games with distinctly colored and relatively constant sprites, we can use phase correlation, as described in Chapter 3, to also track movement. By simply connecting the estimated position of the sprite in each frame, we generate the path.

21 Chapter 5

Algorithm

5.1 Introduction

As discussed previously, the key to our algorithm is cross-correlation. Given one or more sprites to track, the algorithm performs a cross-correlation of the sprite against the whole image and saves the position of the best match. To stitch consecutive video frames together, we sample a corner of the frames and cross-correlate to find the relative offset. Then, the images are overlaid and blended together for the final result. The algorithm takes two primary inputs: first, the source video, and second, the search target, which can either be a sprite image, or a user-selected region from a frame. Another parameter that can be set is the sampling rate. A higher sampling rate would be slower but more accurate. On occasion, due to occlusion or the target disappearing for gameplay reasons, the cross-correlation may incorrectly ‘locate’ the target in a different region. To improve robustness and performance, we can prioritize searching a small region around the neighborhood of prior matches. Given that game objects tend to move short distances between samples, it is more likely for the true target location to be close to the last found coordinates. First, we establish a scoring function which balances the image similarity obtained from cross-correlation with the pixel distance of the coordinates of the last match.

22 Figure 5-1: Due to the difference in orientation, the algorithm incorrectly places the sprite location at the orange ghost.

Figure 5-2: Plot of the scoring function. Highlighted points are the true location of the PAC-MAN (Namco [7]) sprite and the mistaken identification.

23 Then, we find the believed location of the target on the first frame and save the score as reference. For successive frames, we search a small neighborhood around the believed location. If the score is below a set percentage of the reference, this suggests the target is occluded or out of the window. We then repeatedly expand the search window until a good match is found, or we skip the frame.

5.2 Implementation

We chose to use MATLAB as the language for the implementation because of ease of implementation and existing library functions. However, our work is easily compatible with other languages, namely Python with the OpenCV library [3]. For extended usability, converting to Python is a future goal of our project. We will compare the performance of three tracking and stitching methods: our base implementation, implementation with search heuristics, and manual.

5.3 Evaluation

The testing was conducted on a MacBook Pro with a 2.9 GHz Intel Core i5 processor.

5.3.1 Performance

The tracking algorithm performs at real-time or near-real-time. Manually marking the location of the target sprite took an average of .4 seconds per frame. The naive tracking algorithm took .2522 seconds per frame, and the heuristic search reduced the time to .0782 seconds per frame. It is crucial for performance to narrow down the size of the region in which we search for the target.

5.3.2 Correctness

Since sprite position is game-specific and difficult to extract, the baseline we used for measuring tracking correctness is manually constructed and thus subject to error.

24 Providing a meaningful measure of line similarity is difficult. The accumulated deviation of an estimated path is dependent on the resolution of the image, the size of the target sprite, and length of the video. A score that weighs these factors is difficult to conceive. We propose the metric of the average pixel difference between corresponding points in the baseline and generated paths, normalized over the maxi- mum dimension of the sprite. For future work, it would be of use to develop a metric and standardized benchmark for testing.

25 Chapter 6

Results and Applications

6.1 PAC-MAN

PAC-MAN, the iconic Japanese arcade game by Namco from 1980, serves as a simple example for testing motion tracking. The player navigates the yellow PAC-MAN sprite around a maze, eating dots while avoiding the ghosts. The camera is fixed as the map is static. While the sprite can rotate and deform, it maintains a similar shape throughout. Finally, the background is uncluttered and the sprites are distinct colors.

Figure 6-1: Automated tracking. Figure 6-2: Manual tracking.

26 6.2 Defender

Defender, created by Williams Electronics in 1981, is a horizontal scrolling space shooter where the player can move in two directions, but the camera only moves horizontally, staying to the right of the player. The unidirectional camera movement makes it ideal for testing. We observed good results by fusing with phase correlation by sampling the background below the player sprite (Figure 6-3). Due to parallaxing of the background layers, regions to the right of the screen do not correspond exactly.

6.3 Dota 2

Dota 2 is a MOBA game where two teams of five players each control a character. In “spectator mode”, all player positions are visible. The ability to track all players’ movement could be used to gain insight and develop more advanced strategies. We found that in this case, using correlation produced poor results, possible due to the clutter of the map and the low opacity of the sprites (Figure 6-4). Since the camera is fixed, optical flow analysis produces good results (Figure 6-5).

27 Figure 6-3: Consecutively fused frames from Defender. The old components are marked in purple and the new components are green for contrast.

28 Figure 6-4: Our phase correlation method fails to pinpoint the sprite accurately- it should mark the top left.

Figure 6-5: Several detected keypoints and their paths over time.

29 Chapter 7

Conclusion

This thesis analyzes how computer vision techniques may be used to help program- matically understand two-dimensional video game footage and its applications to- wards casual and competitive gaming. Using a novel combination of existing tech- niques, we present an algorithm that given video footage, can moderately successfully track a target sprite and create a panoramic image its movement through the world. As this area is relatively unexplored, there is a plethora of improvements and extensions that could be made.

7.1 Future work

7.1.1 Log-polar cross-correlation

The log-polar transform allows image matching with rotation, which would be bene- ficial in many games where the character can assume different orientations.

7.1.2 Handling reflection

The above case does not account for mirror-imaged sprites, which frequently occur in games when the player faces different directions. With some modification, it should be possible to make the algorithm invariant to mirror reflection along a single axis.

30 7.1.3 Handling discontinuity

We assumed throughout that the movement of the player between frames is small and continuous. However, it is possible for the player to suddenly change locations through . It would be more robust to recognize a large change in environment and abort, or start a new panorama.

31 Bibliography

[1] Eric Barone. Stardew Valley, 2016.

[2] Seth Bling. MarI/O - Machine Learning for Video Games. https://www. youtube.com/watch?v=qv6UVOQ0F44, 2015.

[3] G. Bradski. Dr. Dobb’s Journal of Software Tools, 2000.

[4] Valve Corporation. Dota 2, 2013.

[5] DeepMind. StarCraft II DeepMind feature layer API. https://www.youtube. com/watch?v=5iZlrBqDYPM, 2016.

[6] Brace Yourself Games. Crypt of the Necrodancer, 2015.

[7] Namco. PAC-MAN, 1980.

[8] Yale Song. Real-Time Video Highlights for Yahoo Esports. 2016.

32