Sprite Tracking in Two-Dimensional Video Games Elizabeth M. Shen

Sprite Tracking in Two-Dimensional Video Games by Elizabeth M. Shen Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2017 c Massachusetts Institute of Technology 2017. All rights reserved. Author.............................................................. Department of Electrical Engineering and Computer Science May 23, 2017 Certified by. Philip Tan Research Scientist Thesis Supervisor May 23, 2017 Accepted by . Christopher J. Terman Chairman, Masters of Engineering Thesis Committee May 23, 2017 Sprite Tracking in Two-Dimensional Video Games by Elizabeth M. Shen Submitted to the Department of Electrical Engineering and Computer Science on May 23, 2017, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract I explore various computer vision techniques and their application towards processing and extracting information from two-dimensional video games. The bulk of existing research is designed to work on real-world images, and thus makes assumptions about the world that do not translate to synthetic, stylized environments. Processing video game footage has promising applications in competitive gaming, such as analyzing strategy in multiplayer online games, or optimizing routes in speedrunning. I present the exploratory results, details of a successful algorithm, and some sample applications. Thesis Supervisor: Philip Tan Title: Research Scientist 2 Acknowledgments I would like to thank my advisor Philip Tan for his guidance of my research. I thought it unlikely to get support for my niche interest, but I was happy to be wrong. I am also tremendously grateful towards the Institute; the past five years have been challenging but equally rewarding, and I wouldn't have it any other way. Finally, I would like to thank my parents. I would not have reached this point without their support of my education. 3 Contents 1 Introduction 8 1.1 Motivation . 9 1.2 Existing Work . 10 2 Comparing 2D and 3D Environments 12 2.1 Camera Position and Movement . 12 2.2 Object borders . 13 2.3 Object similarity . 14 3 Image Stitching 15 3.1 SIFT . 15 3.2 Phase correlation . 16 4 Motion tracking 20 4.1 Optical Flow . 20 4.1.1 Results . 21 4.2 Tracking with phase correlation . 21 5 Algorithm 22 5.1 Introduction . 22 5.2 Implementation . 24 5.3 Evaluation . 24 5.3.1 Performance . 24 5.3.2 Correctness . 24 4 6 Results and Applications 26 6.1 PAC-MAN . 26 6.2 Defender . 27 6.3 Dota 2 . 27 7 Conclusion 30 7.1 Future work . 30 7.1.1 Log-polar cross-correlation . 30 7.1.2 Handling reflection . 30 7.1.3 Handling discontinuity . 31 5 List of Figures 1-1 Example of information in StarCraft provided to the AI versus in-game image. 10 2-1 In Stardew Valley [1], the camera is centered on the player unless they are on the boundary of the screen. 13 2-2 The player sprite is not distinct enough from the background. The regions marked with colored dots were detected as points of interest for having sharp contrast and corners. 13 3-1 Four frames from Crypt of the Necrodancer (Brace Yourself Games [6]). Left to right represents movement over time. 16 3-2 Left: Panorama generated from a SIFT implementation. Right: Panorama generated from manual stitching. 16 3-3 Calculating the offset between image subsections. 18 3-4 Three sampled frames from Crypt of the Necrodancer. The boxed regions were used to register their relative position. 18 3-5 The panorama resulting from joining the frames in Figure 3-4. 19 5-1 Due to the difference in orientation, the algorithm incorrectly places the sprite location at the orange ghost. 23 5-2 Plot of the scoring function. Highlighted points are the true location of the PAC-MAN (Namco [7]) sprite and the mistaken identification. 23 6-1 Automated tracking. 26 6-2 Manual tracking. 26 6 6-3 Consecutively fused frames from Defender. The old components are marked in purple and the new components are green for contrast. 28 6-4 Our phase correlation method fails to pinpoint the sprite accurately- it should mark the top left. 29 6-5 Several detected keypoints and their paths over time. 29 7 Chapter 1 Introduction The goal of this work was to find an automated method to convert live footage or gameplay recordings of third-person perspective, stylized video games into a single panorama representing the game world and path taken by the player. To do this, we needed to solve two problems: tracking the player location, and stitching video frames together. First, we provide necessary background information on important characteristics of video games, competitive gaming and the surrounding industries. Next, we explain the differences observed in the environments generated by two- dimensional video games which make existing computer vision algorithms unreliable. Chapter three describes several approaches tested to track object movement, and chapter four describes methods used for image stitching. Chapter five describes our phase-correlation algorithm, which was found to be the most successful. It also provides implementation details and analyzes the results. Chapter six provides results applied to several video games, and describes addi- tional applications. Finally, chapter seven summarizes the work and suggests direction for future work. 8 1.1 Motivation In many video games, the player can only see a small portion of the world, allowing for a sense of discovery and preventing an overflow of information. However, for aesthetic or competitive reasons, it may be beneficial to the player to see a view of the whole world. For example, team-based multiplayer online battle arena games (MOBAs) such as League of Legends and Dota 2 have several billion hours played every month, and support tournaments and professional players with millions of dollars in prize money. Like professional physical sports, these leagues involve coaches, playbooks, and live commentators. However, statistics can be difficult to extract, as important metrics are difficult to measure unless explicitly provided by the game. Another niche of competitive gaming is speedrunning, where players attempt to complete a game as quickly as possible. For many genres of games, the player controls a sprite and must move to a target destination while avoiding enemies or environ- mental hazards. To analyze and optimize the fastest paths, referred to as routes, the current approach involves memory scanning tools to find and read data from the game executable, which must be done on a case-by-case basis. Being able to condense a video into an image has several desirable properties. First, it summarizes information concisely. In the context of MOBAs, this allows for easy discussion of strategy or pointing out specific events. For speedrunning, it is difficult to evaluate runs without watching hours of gameplay footage. A static image effectively communicates a route that would otherwise be difficult to explain. In addition, suppose a runner (speedrunning player) wishes to find which of two paths is faster. Overlaying and comparing two panoramic images is easier than attempting to time the splits. Secondly, there may be no detailed world map provided to the player. Most games only provide a mini-map or an outline. In procedurally generated games, where the world is different each time, it would be otherwise impossible to get a complete view. Finally, there are less competitive applications. In games with world-building 9 mechanics, the player may wish to generate a high-resolution image of their work. This is often not achievable without custom modifications or manually joined images. 1.2 Existing Work There has been significant work in developing artificial intelligence that can play games. Some classic examples are the Super Mario games by Nintendo. Older speedrunning tools implemented shortest path algorithms to find the best route per level. However, this only finds one solution per level. More modern approaches use evolu- tionary algorithms to train neural networks that can learn to play any Mario level [2]. StarCraft is a 1v1 real-time strategy series by Blizzard Entertainment. It has be- come a benchmark game for testing artificial intelligence due to its complex resource management, limited map visibility, and simultaneous decision making. With aca- demic research beginning in the early 2000s, it is now being tackled by DeepMind at Google [5]. However, algorithms are given special access to internal data represen- tations rather than having to extract the data from the game visually, thus differing from a human player's interaction with the game. Figure 1-1: Example of information in StarCraft provided to the AI versus in-game image. 10 Yahoo Research developed another application to automatically detect highlights or important moments in live esports matches [8]. Their approach uses convolution neural nets to detect strong visual effects, which are correlated with highlights. Note that all of the above methods use deep learning and neural networks, which require large amounts (on the order of hundreds or thousands of hours) of tagged video for training. This work is more focused on image processing and correlating a sprite with location over time, and could be incorporated into automatically tagging sprites in a deep learning approach. 11 Chapter 2 Comparing 2D and 3D Environments Processing computer-synthesized media is a relatively niche concern compared to real- world images and video. Thus, it is unsurprising that most related research assumes and deals with a three-dimensional world. We will focus on games with a specific set of properties and explain the corre- sponding challenges. 2.1 Camera Position and Movement Many video processing algorithms are based on stationary cameras.

Load more