DEVELOPING COOPERATIVE AGENTS FOR NBA JAM

A Project

Presented to the

Faculty of

California State Polytechnic University, Pomona

In Partial Fulfillment

Of the Requirements for the Degree

Master of Science

In

Computer Science

By

Charlson So

2020

SIGNATURE PAGE

Project: DEVELOPING COOPERATIVE AGENTS FOR NBA JAM

Author: Charlson So

Date Submitted: Spring 2020

Department of Computer Science

Dr. Adam Summerville Project Committee Chair Computer Science ______

Dr. Amar Raheja Computer Science ______

ii

ACKNOWLEDGEMENTS

I would like to give a special thanks to Professor Adam Summerville for his lessons and advice with my project. I am extremely grateful to have such a caring and passionate advisor. I would also like to express my gratitude to Professor Amar Raheja. His class,

Digital Image Processing, is one I will remember throughout my career.

To my dad, Kyong Chin So, my mom, Jae Hyun So, and sister, Katherine So, it was only through your love and support that I was able to succeed in . Through all the rough times and struggle, here’s to a brighter future.

Charlson So

iii

ABSTRACT

As artificial intelligence development has been rapidly advancing, the goal of creating artificial agents that can mimic human behavior is becoming a reality. Artificial agents are becoming capable of reflecting human behavior and decision making such as drawing creative art pieces and playing video games [10][24]. Therefore, they should be able to mimic one of the greatest human strengths, cooperation. Cooperation is an integral skill that allows humans to achieve feats that they cannot do alone. It is also a highly valuable skill that can be developed for artificial agents as software with intelligent programming becomes integrated into human society. As advanced neural network architectures become widely available, cooperative artificial agents will aid humans in a wide variety of fields. Thus, it becomes vital to discuss the quality of interactions between society and artificial systems and analyze how these systems should interact with the public.

This study attempts to emulate past experimentation with cooperative agents and how well it performs in the context of a cooperative , NBA Jam for the Super

Nintendo Entertainment System. NBA Jam is a great testbed for testing cooperative agents because it is well known in the gaming community for its difficulty [25]. NBA

Jam includes a complex set of inputs and combinations that allow the player to play with a unique style. This experiment seeks to explore the results of training an intelligent artificial agent that attempts to maximize cooperation between itself and its teammate.

The inclusion of well programmed cooperative agents should allow players to win more games and have an enjoyable experience. By adjusting the rewards systems of neural

iv

networks, this project attempts to explore the nuances of developing cooperative agents for video games.

v

TABLE OF CONTENTS

SIGNATURE PAGE ii

ACKNOWLEDGMENTS iii

ABSTRACT iv

LIST OF TABLES vii

LIST OF FIGURES viii

CHAPTER 1: COOPERATIVE ARTIFICIAL AGENTS 1 ​ ​

AND NEURAL NETWORKS

1.1 Introduction to Designing a Cooperative Agent 1

1.2 History of Artificial Agents Within Video Games 2

1.3 Machine Learning and Reinforcement Learning 4

1.4 Markov Decision Process 5

1.5 Q-Learning and Deep Q-Learning 7

1.6 Actor Critic 10

1.7 A2C (Advantage Actor Critic) 11

1.8 Training Cooperative Agents 13

CHAPTER 2: EXPERIMENTAL SETUP 16 ​ ​

2.1 NBA Jam For Super Entertainment System 16

and Retro Gym Integration

vi

2.2 Training Loop 20

CHAPTER 3: ANALYSIS OF DATA AND DISCUSSION OF RESULTS 24 ​ ​

3.1 Results 24

3.2 Analysis 28

3.3 Adjustments for Future Experimentation 29

CHAPTER 4: CONCLUSION 31 ​ ​

REFERENCES 32

vii

LIST OF TABLES

Table 1: NBA Jam Controls 18

Table 2: Variable and Memory Addresses 20

Table 3: Description of Files Required for Integration of a ROM for Retro Gym 21

Table 4: Experiment Reward Functions 22

Table 5: Results of Deterministic Policy Player Statistics 24

Table 6: Results of Non-Determinisitc Policy Player Statistics 25

Table 7: Results of Deterministic Model Team Scores 26

Table 8: Results of Non-Deterministic Model Team Scores 27

vii

LIST OF FIGURES

Figure 1: Markov Decision Process 6

Figure 2: Comparison between Q Learning and Deep Q Learning 9

Figure 3: Actor Critic Model 10

Figure 4: Advantage Actor Critic 12

Figure 5: Menu Screen for NBA Jam 16

Figure 6: Instance of a Game of NBA Jam 17

Figure 7: Gym Retro Integration Application 19

Figure 8: Command to Train pytorch-a2c-acktr on HPC 22

Figure 8: Results For Scoring Agents 18

Figure 9: Mean Reward For A2C For Reward #2 19

Figure 10: Agent Blocking a Shot (Reward Function #3, Deterministic Policy) 26

Figure 11: Mean Reward For Reward Function #1 27

Figure 12: Mean Reward For Reward Function #2 27

Figure 13: Mean Reward For Reward Function #3 28

Figure 14: Mean Reward For Reward Function #4 28

viii

CHAPTER 1: COOPERATIVE ARTIFICIAL AGENTS AND NEURAL NETWORKS

1.1 Introduction to Designing a Cooperative Agent

Cooperation arises from a multitude of reasons including but not limited to material trading, culturally installed behaviors, competition, and expression of emotion. It is theorized to be a primary characteristic that allowed humans to create an advanced society. Therefore, sociologists have studied human cooperation in depth and have devised a set of criteria in order to define cooperation. According to Carl Couch’s theory of cooperative action, cooperation is based on “elements of sociation.” These traits included acknowledged attentiveness, mutual responsiveness, congruent functional identities, shared focus, and social objective [1]. Reciprocal-acknowledged attention is a form of interconnectedness between people that includes a fluid-shared consciousness.

Humans who cooperate share an understanding with each other that allows them to focus on the goal and make decisions that will ultimately benefit the group. Mutual responsiveness is commonly used through verbal statements or visual actions. Two friends can establish communication and meaning through a phrase or a nod. Shared focus and social objective describe past-bound or future-oriented connectedness between a group. By having similar goals, cooperation creates discernible outcomes that will ultimately tie two agents together.

But many of these traits are unique to human to human cooperation. Humans are capable of verbal and non verbal communications and shared cultural backgrounds that may predicate certain actions. But human to artificial intelligence cooperation is limited

1

since artificial intelligence must include features that allow the agent to understand these verbal and non verbal communiques. Therefore, a new set of criteria based on Couch’s theory of cooperative action should be formed in order to define a reward system for high performing cooperative agents that emulates human behavior. In the context of NBA Jam on the SNES, a highly cooperative agent should be able to set up the teammate to score and generate a high number of assists. Also, the agent should have mutual responsiveness by taking advantage of passes the teammate gives in order to score. Since the goal of

NBA Jam is to ultimately win the game, the cooperative agent should be able to learn the complex moveset programmed into the game and maximizes its defensive capabilities in order to increase the chance of winning a game. This experiment aims to create an artificial agent capable of maximizing a user’s playing experience by cooperating with them efficiently based on these criteria.

1.2 History of Artificial Agents Within Video Games

Video games have been a huge testing ground for developing artificial agents.

Since games provide a reliable simulated environment that is capable of running faster than real time, agents can be trained on a variety of parameters without the need of human intervention. The first well known example of artificial intelligence algorithms developed on video games is Deep Blue, developed by IBM [20]. IBM’s Deep Blue supercomputer defeated world chess champion Garry Kasparov in 1997, proving that computer calculations are capable of solving problems previously thought only solvable by humans. Deep Blue won a match under regulation time control, meaning the game had to play out faster than three minutes per move. Its architecture included 480 chess chips,

2

capable of searching 2 to 2.5 million chess positions per second in parallel. Deep Blue also utilized the minimax algorithm, an algorithm used to minimize the possible loss for a worst case scenario and chooses an action that maximizes its future state.

Deep Blue led to a proving ground where algorithms and computer architectures were tested against humans in order to highlight the effectiveness of computers and algorithms with decision making capabilities. Although Deep Blue was proof that computer and software technology had advanced to challenge human intelligence, exhaustive search for a game with greater depth was not possible with current hardware.

For example, a chinese checkers game called Go utilizes a 19x19 board compared to a chess board that uses an 8x8 grid. The search space in Go is considerably wider; Go

150 80 having approximately 250 ​ possible legal moves while chess only has 35 ​ moves [10]. ​ ​ Therefore, it was no surprise when researchers developed an agent capable of defeating world champions of Go using deep reinforcement learning [10]. AlphaGo, developed by

DeepMind, was a computer program that defeated its first professional Go player, Lee

Sedol, in 2016. Alpha Go utilized supervised learning on a SL policy network that included convolutional layers, rectifier nonlinearities, and a final softmax layer output that would generate a probability distribution over all legal moves. With this neural architecture, AlphaGo showed that artificial intelligence had developed the ability to tackle new challenging problems. A growth in artificial intelligence, complemented by advances in parallel computing hardware, led to a new age of artificial intelligence where scientists began testing different neural networks in new and unique mediums.

3

Board games are considered different problems compared to video games. In Go and chess, the environment is fully observable, competitive, static, discrete, and deterministic. This environment varies quite differently than a game of NBA Jam. NBA

Jam is a partially observable, multi agent, continuous, collaborative, competitive and deterministic environment. The artificial agent is allowed access to input information shown by the screen and data specified by the developer. Unlike architectures developed for Go, a collaborative agent for NBA Jam would be required to coordinate with another

AI agent in order to compete against another team. This agent would have to create defensive plays such as stealing the ball or blocking shots while generating offensive momentum such as assisting its teammate while scoring efficiently. By utilizing neural architectures that have been successful in problems similar to NBA Jam, a cooperative agent capable of the aforementioned skills will be developed and tested.

1.3 Machine Learning and Reinforcement Learning

Machine learning is the process of learning an algorithm that improves automatically through experience [11]. These algorithms teach themselves based on training data in order to make predictions or decisions without being explicitly programmed to do so. A subset of machine learning, reinforcement learning, has been a major topic of control based neural networks that have shown significant success in decision making domains such as robotics and video games. OpenAI was also able to develop the first AI system capable of defeating world champions in an game called [32]. This experiment was able to tackle challenges such as long time horizons, imperfect information, and complex, continuous state-action spaces. This

4

proved that simply by using self-play reinforcement learning, a neural architecture could

be able to achieve human performance on a task. Reinforcement learning learns a

policy to make decisions by choosing the action that will maximize its reward. The two

ways of learning the policy are (1) learning the value of taking an action in a given state

and (2) optimizing a parameterized policy to maximize the expected rewards.. Models

such as Q-Learning, Deep Q-Learning, Actor Critic, and Asynchronous Advantage Critic

will be analyzed in order to explain the network utilized in this project, Advantage Actor

Critic.

1.4 Markov Decision Process

Most control based problems can be stated in the form of a Markov Decision

Process, a discrete time stochastic control process. This mathematical framework is

useful for describing decision making in situations where outcomes are stochastic. In a

Markov Decision Process, a policy is learned in the hopes of navigating the system to

maximize its reward. An agent will navigate the MDP by receiving state information

from its environment, take available actions based on the state, and receive a reward for

making the right decision

A Markov Decision Process is defined as

● A set of environment and agent states, S

● A set of actions, A, of the agent

● Pa(s, s’) = Pr(st+1=s’ | st = s, at = a) is the probability of transition (at time t) from ​ ​ ​ ​ ​ ​ ​ ​ state s to state s’ under action a

● Ra(s,s’) is the reward after transition from s to s’ with action a ​ ​ 5

● Set of rules that describe what the agent observes

A reinforcement agent interacts with its environment in discrete time steps. At each time step, the agent receives an observation and chooses an action from a set of available actions. Based on this action, the environment moves to a new state and the agent is rewarded based on the transition. The core problem for reinforcement learning is to find a policy for an agent, a function that specifies the action when the agent is in a unique state.

Figure 1: Markov Decision Process (Figure reproduced from [18])

Models that attempt to optimize a policy, which is a mapping from perceived states to actions, can be useful for stochastic environments. These models can be

6

deterministic or stochastic. With a stochastic policy, an agent can handle stochastic environments, environments which contain randomness, where an inputted state will output a probability distribution over the actions. A deterministic policy works great with deterministic environments, where environments have no randomness. By analyzing models as a MDP, an agent can take different actions for a perceived similar state, an issue known as perceptual aliasing [37]. After a state action pair is inputted into a parameterized policy, a policy score function can be utilized to describe the maximum expected reward. If parameters are found that maximize our score function, the agent will be able to solve the task.

1.5 Q-Learning and Deep Q-Learning

Q-learning is a type of model-free reinforcement learning. In model free learning, the model does not use a transition probability distribution and the reward function to determine the next action. Rather, Q-learning finds an optimal policy by maximizing the expected value of the total reward per discrete time step. “Q” or quality describes the function that returns the reward and represents the quality of an action. In Q-learning, a

q-table, representing the policy of an agent, is a matrix that contains state action pairs (st, ​ ​ at) that points to a potential future reward. The agent interacts with the environment in ​ ​ one of two ways. The first is known as exploiting. The agent selects an action based on the maximum reward of possible actions dependent on its current state. It can therefore process information available to the agent to make a decision in a given state. The second method is known as exploring, where an agent takes a random action. Exploring is utilized to explore the state space that would otherwise not have been discovered during

7

the exploitation step.

Q-Learning Algorithm

1. Agent begins in state st and takes action at and receives reward rt by selecting an ​ ​ ​ ​ ​ ​ action from its policy by referencing Q-table with highest value (exploiting) or by

random (exploring).

2. Update q-values in q-table by observing the new reward

Q(s,a) = Q(s,a) + α * (rt + 휸 *(max(Q(s’, a’) - Q(s,a)) ​ ​ 3. Loop until state s reaches done.

The variable lr represents the learning rate, alpha, defines how much the new reward affects the quality of the policy. 휸 represents our discount factor and models the fact that future rewards are worth less than immediate rewards. A low discount rate will make our agent myopic, or short sighted, while a factor closer to 1 will make it strive for a long term high reward. Using the Q-Learning algorithm, an agent can be capable of solving the Mountain Car problem [35]. But Q-Learning does not scale well; a large number of state action pairs requires lots of memory and q-learning does not have a capability to handle input parameters such as images.

Deep Q-Learning, a model that uses deep convolutional neural networks with layers of convolutional filters rather than a q-table, can be utilized to solve a much broader subset of problems. For example, Deep Q-Learning was used on several games which aimed at using high level features from raw sensory data such as pixel data.

8

Researchers were able to successfully develop an agent capable of playing the game

Breakout [12]. Deep Q-Learning can take frames of a game and can pass it through a ​ ​ convolutional neural network to output a vector of Q-values for each action possible in a state. The use of convolutional layers also allows the network to exploit spatial relationships in images. Thus, the concept of motion can be learned by a Deep

Q-Learning network.

Table State-Action Value 0 State 0 I I ' 0 I 0 I Q-Value I I 0 11 Action I 0 0 0 0 Q Learning

State

Deep Q Learning

Figure 2: Q Learning and Deep Q Learning (Figure reproduced from [38])

Reinforcement learning is unstable when a nonlinear function approximator such as a neural network is used to represent Q. But Deep Q-Learning utilizes experience replay, a biologically inspired mechanism that uses a random sample of prior actions instead of its most recent action. Experience replay allows the network to avoid forgetting previous experiences and reduce correlations between experiences. By utilizing a replay buffer that takes a small batch of state, action, reward, and new state tuples, an agent can

9

take experience learned in previous levels to use in the next. Also, the replay buffer can be sampled at random that allows the network to be trained without the influence of inputs of a sequential order.

Q-Learning and Deep Q-Learning requires an exploration strategy and uses a value function to determine the optimal policy in which the agent takes an action during a given state. They are able to compare the expected utility of the available actions without requiring a model of the environment. However, value-function based models have major disadvantages since time is wasted in online applications and takes more interactions.

1.6 Actor Critic

A major disadvantage with Deep Q-Learning results from the calculation of the reward function that occurs at the end of an episode and an averaging of the reward function given by the agent per action. Therefore, if the agent takes a bad action amongst a group of good actions, the policy update will consider the breadth of actions as good policy. Therefore to generate an optimal policy, it would require a lot of samples to train which results in slow learning.

Rather than wait for an update to occur at the end of each episode, an update can be done at each step. The Actor-Critic Model attempts to find the optimal policy without a Q-value as a middleman. The Actor-Critic Model has two parts. The critic’s purpose is to update value function parameters either an action value or state value. The actor updates the policy parameters in the direction suggested by the critic. The two models

10

participate with each other in order to get better in their role per timestep. The result is that the Actor Critic model that will learn to play the game more efficiently than the two methods separately.

Actor

TD enor 1 Value te 1,-----i Function1 action

reward

Environment

Figure 3: Actor Critic Model (Figure reproduced from [31])

1.7 A2C (Advantage Actor Critic)

Actor Advantage Critic is a simple lightweight model that uses asynchronous gradient descent that is utilized in this project. A2C is an architecture based on asynchronous advantage actor-critic, otherwise known as A3C, where training is implemented with multiple workers in parallel independent environments in order to update a global value function [35]. In A3C, there is a global network and multiple worker agents which each have their own set of network parameters. Each of these agents interact with its own copy of the environment. This allows the asynchronous actors to explore the state space efficiently and effectively. This approach varies from simple

11

online reinforcement algorithms because it is able to handle sequential non-stationary data. In order to tackle issues of high variability often appearing in value-function based methods, the advantage function is utilized to compare the average action taken by the policy to the value function. When the advantage value is positive, the gradient is pushed towards that direction while a negative value represents that our current action does worse than the average value of that state. Rather than utilizing experience replay, an

asynchronous execution of multiple agents in parallel on multiple instances of the environment allows the experiment to run on a single machine with a standard multi-core

CPU compared to utilizing deep learning architecture that rely on specialized hardware or

massively distributed architectures.

Asynchronous Advantage Actor Critic Algorithm

1. Worker reset to global network

2. Worker interacts with environment

3. Worker calculates value and policy loss

4. Worker gets gradients from losses

5. Worker update global network with gradients

Advantage Actor Critic is a synchronous deterministic version of A3C. It allows a

reduction of variance and increased stability by making cumulative rewards smaller by

subtracting it from the baseline with more stable updates. In A3C, each agent updates the

global network independently. But because of the asynchronous nature of the

architecture, specific agents would be playing with policies of different versions and

12

therefore the aggregated update would not be optimal. In order to remove these inconsistencies, a coordinator in A2C waits for all parallel workers to finish before updating the global parameters and then within the next iteration, the parallel agents will start from the same policy.

Network

\ ~

Worker1 Worker2 Worker3 Workern

Environment 1 Environment 2 Environment 3 Environment n

Figure 4: Advantage Actor Critic (Figure reproduced from [26])

1.8 Training Cooperative Agents

Researchers have highly successfully developed agents capable of complex cooperative behaviors in order to achieve their objective. OpenAI released a study where agents would play a game of hide and seek with moveable and static obstacles [31]. The agents were split into two teams: hiders, and seekers. Agents were given the ability to interact with moveable objects such as walls and ramps, would have a frontal vision cone representing line of sight, and can sense distance to objects, walls, and other agents using

13

a lidar-like sensor. The reward system utilized gave hiders a reward of +1 if all hiders were hidden and -1 if any hider was seen by a seeker. Seekers were given an opposing reward, -1 if all hiders were hidden and +1 otherwise. After 480 million timesteps, the agents were able to develop highly complex behaviors and strategies. Some strategies included using obstacles to block entrances, using ramps and even a bug known as box surfing in order to overcome obstacles. Agents were placed in a two versus two competition and would cooperate to prevent resources from being utilized by the other team. Agents were able to collaborate even though the agents acted independently, using their own observations and hidden memory state.

Shah and Carroll would also attempt to understand how artificial intelligence could coordinate with others with a cooking video game called Overcooked [9]. A simplified version of Overcooked was developed that focused on coordination challenges. The game was a two player kitchen where the players would need to take three onions to a pot, take the soup and plate it, and finally deliver it to a soup location.

What they discovered is that the agent would struggle to coordinate with a player if the player was playing suboptimally.

In this case, the AI would attempt to pass onions to the partner (optimal route) while the human player would take onions directly to the pot (unoptional route). Since the human player was unaware of the optimal route, the artificial agent would struggle to play efficiently and would affect the performance of the team. They concluded that self-play “makes poor assumptions about its human partners (or opponents).” [9] They

14

theorized that in a collaborative play, self-play would not be a sufficient learning technique. They hypothesized three concepts for self-play agents: a self-play agent will perform much more poorly when partnered with a human, a human aware agent will achieve higher performance than a self-play agent, and when partnered with a human, a human-aware agent will achieve higher performance than an agent trained via imitation learning.

In another case, OpenAI was able to develop agents to play Quake. These FTW

(For The Win) agents were paired with humans and humans “rated the FTW agents as more collaborative than fellow humans” [23]. But after closer inspection, observers could tell that the FTW agents were behaving as if they had AI teammates. One human player commented that the agents “are perfectly selfless” and the participants could clearly understand that the lack of cooperation was an “‘us’ issue.” Because FTW agents were trained with other AI agents, the agents’ model of their teammates were incorrect. It assumes that the human teammates would coordinate perfectly just as AIs would.

The experiments conducted by Shah, Caroll, and OpenAI showed that an optimal cooperative intelligent agent requires training with human participants in order for the agent to play optimally with humans. Without the prior training of suboptimal play, agents will not be able to optimally act in suboptimal situations

15

CHAPTER 2: Experimental Setup

2.1 NBA Jam For Super Nintendo Entertainment System and Retro Gym Integration

NBA Jam is a basketball video game initially developed as an but then released for the Super Nintendo Entertainment System (SNES). It is characterized for its over-the-top presentation and exaggerated style for a two-on-two basketball game.

NBA Jam has four different game modes. The most useful of these modes is Head to

Head, where a single player plays with the NBA Jam agent and plays against two other

NBA Jam AI agents. This mode will be the underlying test bed for the cooperative agent.

Figure 5: Menu screen for NBA JAM

From the start screen, the player is presented with 24 NBA basketball teams with the choice to select two players from each team. Each basketball player is imbued with

16

attributes such as speed, power, 3-point shooting, steal, dunk, block, pass and clutch.

Therefore, it can be theorized that certain characters would have a higher performance ceiling than others because of hard coded characteristics that would make the player a naturally stronger performer. But rather than explore combinations of teammates that would maximize performance, the experimental setup chooses two teams with two fixed players, Peeler and Divac from the Los Angeles Lakers and Miller and Smits from the

Indiana Pacers in order to test the A2C on this video game. As seen in figure 5, the player or agent is denoted with an arrow with the controller id. The agent will play as Peeler for this project. The camera also follows the basketball. So if the agents are on the opposite side of the court, they would appear offscreen. The players x position would be denoted on the side of the screen with an arrow with the player id.

Figure 6: Instance of a Game of NBA Jam

The SNES controller has a set of 12 valid buttons: L, R, UP, DOWN, RIGHT,

17

LEFT, START, SELECT, X, Y, B, and A. Start, select, and specific combinations of key presses were disabled in order to prevent pausing and other unwanted key presses. Valid key presses are noted in the scenario.json. NBA Jam has complex movesets that vary depending on the state of the player. There are basic keypresses that cause the player to take basic actions shown in Table 1. But combinations of keypress can result in a variety of actions. For example, a head fake can be generated by tapping the shoot button once and a dunk can be done while holding A or B while running. Also, the same keypress can result in different actions based on whether the team is on offense or defense. When the team has control of the basketball, the team is on offense and keypresses will take on offensive behaviors such as shooting or passing the ball. But when the agent’s team doesn’t have the ball, the team will be on defense, in which case the keypresses will result in attempts to steal the basketball or block shots made by the opposing team.

Table 1: NBA Jam Controls

Keypress Function

D-PAD (UP, DOWN, RIGHT, LEFT) Movement

L, R Turbo

X Pass (Offense) | Steal (Defense)

A Shoot (Offense) | Block (Defense)

In order to train a cooperative agent, values must be utilized to calculate the reward function. Metrics such as points scored, team assists, blocks, and steals are ​ ​ recorded within the game and can be utilized to train an efficient artificial agent. These characteristics are also very similar to major talking points done for real NBA games when analyzing player performance [15]. Therefore, the following metrics are critical

18

when analyzing an agent’s cooperativeness; winning the game, high number of assists, and individual and teammate score.

These metrics can be used to match aspects of Couch’s theory of cooperative action. A shared social objective between the agents would be to ultimately win the game. The game can be won when by the end of the three minute quarter, the agent’s team is able to outscore the opposing team. This would require the agent to have the ability to score efficiently, block shots, and generate opportunities for its teammate to score. By measuring assists, developers can measure the shared focus and mutual responsiveness between agent and player. An assist in basketball is characterized by a player passing the ball to another player that results in a score. In order for an assist to occur, agents must acknowledge that a pass will result in a high quality shot. Lastly, individual and teammate’s score is a significant metric since it shows that both players are actively participating in order to win the game. These metrics will be adjusted in the reward function in order to maximize cooperation within the agents.

A ddress Type Mask

L:.]LJ Type (e.g. ::u4)

Cumu lat ive: 0 No Did end: N o 0 Timestep: 0

None Reset scenario

Type

El o

Figure 7: Gym Retro Integration Application

19

Finding the memory values within NBA Jam requires the use of Retro Gym

Integration application. This application has a tool that searches local for specific values. ​ ​ As an instance of a game is loaded into the Gym Retro Integration Application, the user inputs a known value of a variable they are attempting to search. The application returns a list of possible memory addresses that may fit the description the user is searching for.

As the game progresses, the user monitors these memory locations for changes in values that correspond with changes that occur in game. By narrowing down the memory addresses, the user should be able to find the correct memory address for a specific metric.

Table 2: Variable and Memory Addresses

Variable Address

player_1_assist 5719

player_1_score 5703

player_2_assist 5721

player_2_score 5705

quarter_clock 3465

team_1_score 5958

team_2_score 5960

2.2 Training Loop

In order to train an agent for NBA Jam, a simulated environment capable of faster than real time is required. Open AI has developed a training architecture called

Gym Retro that can emulate SNES games and is capable of generating an environment for reinforcement learning. The first step requires the integration of the game into the

20

gym retro package. Table 3 shows a list of files that must be generated in order to integrate NBA Jam.

Table 3: Description of Files Required for Integration of a ROM for Retro Gym

Filename Description

data.json Memory addresses that correspond to variables required for reward

lakers-pacers-peeler-divac.state A state file that indicates where the training loop should begin

metadata.json Containing state information

rom.sha A hash of the SNES game

scenario.json Describes valid key presses, done condition, reward

Open AI Integration application can be used to generate a majority of these files

[27]. The hash of the ROM can be found online and metadata.json contains the name of the state used in the training loop. Memory searching, mentioned previously, is used to find the memory locations for data within the game. The files that were used can be found at https://github.com/so0p/retro_gym_nba_jam. Then by running an installation ​ ​ script provided by gym retro package, NBA Jam can be added into the possible environments for Gym Retro. Note that before running the installation script, the python requirements within the repository should be installed.

Next, install the A2C architecture by navigating here [19]. This version of the

A2C architecture is an implementation of OpenAI's A2C [28][29]. After installation of

A2C dependencies, the neural network can begin training. Training for this project was

21

done on the High Performance Computing (HPC) cluster at Cal Poly Pomona which ​ ​ allows resources to be allocated such as the number of processors and memory.

Figure 8: Command to Train pytorch-a2c-ppo-acktr on HPC

A number of models were trained in order to determine an approximate time step that would allow the model to learn cooperative behaviors. Through much experimentation, models were run with the rewards shown in Table 4. The reward functions were selected based on a set of minor goals before creating a final agent. The first reward function was created to develop an agent that would maximize scoring capabilities. The second reward function aimed at seeing changes between a reward function that would maximize the team’s score in relation to a maximum scoring agent.

The third reward function was created to develop a completely selfless agent that maximizes assisting. The last reward was a combination of the first and third reward agent that would be able to maximize both scoring and cooperative capabilities. Each reward function penalized the agent for each time the opposing team scored. The aim was to develop an agent that would also learn defensive qualities such as blocking shots and stealing the ball.

Table 4: Experiment Reward Functions

ID Reward Function Number of Timesteps

1 100 * player_1_score - team_2_score 12e6

2 100 * team_1_score - team_2_score 12e6

3 100*player_1_assist + team_2_score 12e6

4 10*player_1_score + 100* player_1_assist - 12e6 team_2_score 22

Each model was run for 12 million time steps. This value was chosen after running a successful test model at 4 million time steps that was able to score. Although this test model was able to score, the agent’s team still lost to the opposing team. Thus, an increase of three times the number of time steps was chosen to be an appropriate number of timesteps to train the model. Once the model had completed training, the model was saved in a pickle file and later rerun with a script in the repository called enjoy.py to finalize the results. The final videos can be found in the videos subdirectory while the models can be found in the trained_models/a2c/ directory. The models were run with deterministic and non-deterministic policies to view the results and video recordings are posted on the github repository

23

CHAPTER 3: ANALYSIS OF DATA AND DISCUSSION OF RESULTS

3.1 Results

After running the network multiple times, the network showed no progress in learning to score. During the course of over twenty training cycles of the first reward system, only one model was shown to have learned the ability to score. Therefore, the first heuristic and second heuristic, ɑ*player_1_score and ɑ*team_1_score, is not sufficient enough for the agent to score, regardless if the policy was stochastic or deterministic. The major issue with the agent can be seen in det-1.mov where the agent cannot cross the half court line. The agent either consistently shoots at the half court line or passes to its teammate. Although through a deterministic policy, reward function 3 was capable of getting the agent to score. However, reward function 3 did not include heuristics that aimed at developing an agent looking to score. The agent in this model crossed the half court a couple times and was able to hit three 3-pointers. Agents with non-deterministic models were able to shoot but did not score.

Table 5: Results of Deterministic Policy Player Statistics

ID player_1_score player_2_score player_1_assist player_2_assist

1 0 7 1 0

2 0 7 0 0

3 9 6 0 3

4 0 14 3 0

24

Table 6: Results of Stochastic Policy Player Statistics

ID player_1_score player_2_score player_1_assist player_2_assist

1 0 8 3 0

2 0 8 3 0

3 0 8 2 0

4 0 11 4 0

Interestingly enough, the agent ran with a deterministic policy with reward function 3 did not make any assists even though the reward function described rewards for the agent for making assists in the deterministic model. Although the agent does make passes throughout the game, the agent did not learn how to maximize the number of assists by finding opportunities to set up its teammate. Therefore, reward function #3 does not accurately describe a method that generates a cooperative agent. In fact, there was no discernible difference between reward functions attempting to maximize assists.

With a deterministic policy, reward functions #1, #2, and #3 were unable to generate the number of assists that the non-deterministic policy was able to have. The non-deterministic policy was able to consistently generate assists but did not score.

The agent was fairly capable on the defensive end. The agent can be seen attempting steals and blocking shots. Although, the agent did not defend the opposing team well enough. By looking at the final scores, it can be determined that regardless of the agent’s offensive performance, a punishment for every score the opposing team made was not a strong enough heuristic.

25

Figure 10: Agent Blocking a Shot (Reward Function #3, Deterministic Policy)

Table 7: Results of Deterministic Model Team Scores

ID team_1_score team_2_score

1 7 16

2 7 18

3 15 19

4 14 16

26

Table 8: Results of Non-Deterministic Model Team Scores

ID team_1_score team_2_score

1 8 18

2 8 24

3 8 16

4 11 20

40

30

~ 20 ~ "' ""' 10 ::E" 0

- 10 , ,_ ..l. ... ., A II .... ,~.•A h , r\j,1 " 'I r y· T ' T 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Timestep l e7

Figure 11: Mean Reward For Reward Function #1

1100

1000

900

"E 800 "',;: er:"' C: 700 :;:"' 600

500 400

0.0 0 .2 0.4 0 .6 0 .8 1.0 1.2 Timestep le7

Figure 12: Mean Reward For Reward Function #2

27

350

300

"E ;;"' Cl) 250 a:: C:

Cl) ::;:"' 200

150

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Timestep l e7

Figure 13: Mean Reward For Reward Function #3

350

300

"E ;;"' Q) 250 a:: C:

Q) ::;:"' 200

150

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Timestep le7

Figure 14: Mean Reward for Reward Function #4

3.2 Analysis

The agent being unable to score is a major setback to the experiment. The agent would lack mutual responsiveness. For example, if a teammate noticed a strong

28

possibility of assisting the agent, the agent would not go ahead and score. It is possible that the defensive rating of the opposing team was too high and the agent did not have the training to score yet. There are also scoring tricks such as shooting at the peak of its jump where the chance of success increases the probability of scoring a basket. Further exploration of the agent will be required to find the source of its scoring inability.

It is also possible that the teammate is completely greedy and does not set up the agent to score. If the agent is trained to maximize team score and the teammate is a very competent scorer, the agent may rely on the teammate for points in order to maximize its reward. Other theories include lack of training or inability to converge on an optimal policy. The final trained artificial agent would earn around four assists per quarter. It can be inferred that the agent then would score approximately 12 assists per quarter, which is a high number of assists for a single player. Although the agent is able to show off many characteristics of a cooperative agent, the agent was not trained well enough to win games and increase the scoring ability of its teammate.

3.3 Adjustments for Future Experimentation

There would be major changes to the current architecture in order to fulfill the original intentions of the experiment. Adjustments to the training loop by increasing quarter time to 15 minutes would allow the agent to find more rewards per episode. Also increasing the number of training steps to 480 million timesteps similar to that of

OpenAI’s Multi-Agent interaction study could show more positive results [32]. This type of training was not possible on the HPC because training an agent for 12 million timesteps took approximately 12 hours.

29

The agent can also be improved by generating a reward function that moves the agent across the half court line. Although throughout the experiment, the agent always passed the ball before the half court timer and shot clock ran out. By creating a heuristic that rewards the agent for moving across the line, there can be a higher probability that the agent can score and create assists. Another heuristic that can be useful is maximizing the amount of time the agent’s team held onto the ball. Often, the agent had the ball stolen and thus lost opportunities to maximize offensive capabilities.

Finally, an analysis should be done to see if the agent can utilize complex inputs such as head fakes, super dunks, lay ups, throwing elbows, rebound, and alley oops.

There are many instances with each reward function where the agent seems to move in place in circles. This may be due to the architecture being unable to understand holding a button for an extended time period.

30

CHAPTER 4: CONCLUSION

By utilizing A2C, the experimenter was able to develop an artificial agent that was able to score, assist, block and assist in NBA Jam. Although superhuman performance was not observed, a semi cooperative agent was developed that was capable of basic movements and actions. Future experimentation should look into the limitations of keypresses with this architecture and develop more complicated heuristics that encourage the agent to navigate to favorable positions on the court. The training of the agents should also be done with human participants in order to generate an agent capable of cooperating with nonoptimal teammates.

Training cooperative agents is becoming an important subject for development tools for video games. Video game development corporations such as Unity develop a machine learning toolkit for game developers to use to program bots in their games [13].

As artificial architectures become more readily accessible, cooperative agents will need to be developed that coordinate with human and non-human players.

31

REFERENCES

[1] Couch, C. J. Symbolic Interaction and Generic Sociological Principles. University of

Iowa, Iowa City. https://onlinelibrary.wiley.com/doi/abs/10.1525/si.1984.7.1.1 ​ [2] Williams, J. P. Kirschner, D. Elements of Social Action: A MicroAnalytic Approach

to the Study of Collaborative Behavior in Digital Games.

http://www.digra.org/wp-content/uploads/digital-library/paper_254.pdf

[3] Chen, M. G. Communication, Coordination and Camaraderie in .

University of Washington. Games and Culture. January 2009.

https://doi.org/10.1177/1555412008325478

[4] So, C. Retro NBA Jam Repository. https://github.com/so0p/retro_gym_nba_jam ​ [5] OpenAI Gym Retro Repository. https://github.com/openai/retro ​ [6] SNES NBA Jam Manual.

https://www.retrogames.cz/manualy/SNES/NBA_Jam_-_SNES_-_Manual.pdf

[7] Bansal T. Pachocki J. Sidor S. Sutskever I. Mordatch I. Emergent Complexity via

Multi-agent Competition. ICLR 2018. https://arxiv.org/pdf/1710.03748.pdf

[8] Shah R. Carroll M. Collaborating with Humans Requires Understanding Them.

Berkeley Artificial Intelligence Research Blog. Oct 21, 2019.

https://bair.berkeley.edu/blog/2019/10/21/coordination/

[9] Carroll M. Shah R. Ho M. K. Griffths, T. L. Seshia S. A. Abbeel P. Dragan A., On the

Utility of Learning about Humans for Human-AI Coordination. Jan 9, 2020.

https://arxiv.org/pdf/1910.05789.pdf

32

[10] Silver D. Huang A. Maddison C. J. Guez A. Sifre L. van den Driessche G.

Schrittwieser J. Antonoglou I. Panneershelvam V. Lanctot M. Dieleman S. Grewe

D. Nham J. Kalchbrenner N. Sutskever I. Lilicrap T. Leach M. Kavukcuoglu K.

Graepel T. Hassabis D. Mastering the game of Go with deep neural networks and

tree search.

[11] Mitchell T. Machine Learning. McGraw Hill 1997.

http://www.cs.cmu.edu/~tom/mlbook.html

[12] Mnih V. Kavukcuoglu K. Silver D. Graves A. Antonoglou I. Wierstra D. Riedmiller

M. Playing Atari with Deep Reinforcement Learning

[13] Unity ML Agents Toolkit. https://github.com/Unity-Technologies/ml-agents ​ [14] Hessel M. Modayil J. Hasselt H. Schaul T. Ostrovski G. Dabney W. Horgan D. Piot

B. Azar M. Silver D. Rainbow: Combining Improvements in Deep Reinforcement

Learning. https://arxiv.org/pdf/1710.02298.pdf ​ [15] Kobe Bryant vs. Tim Duncan: Which Hall of Famer had the better career?

https://www.youtube.com/watch?v=9mfOzcW8Jss

[16] Berner C. Bockman G. Chan B. Cheung V. Debiak P. Dennison C. Farhi D. Fischer

Q. Hashme S. Hesse C. Jozefowicz R. Gray S. Olsson C. Pachocki J. Petrov M.

Ponde de Oliveria H. Rainman J. Salimans T. Schlatter J. Dota 2 with Large Scale

Deep Reinforcement Learning

[17] Google Atari Breakout. https://cdn.elg.im/breakout/google-atati-breakout.jpg ​ [18] Markov Decision Process.

tps://en.wikipedia.org/wiki/Markov_decision_process#/media/File:Markov_Decis

ion_Process.svg

33

[19] So C. Pytorch A2C PPO ACKTR. hhttps://github.com/so0p/pytorch-a2c-ppo-acktr

[20] Hsu F. IBM’s Deep Blue Chess Grandmaster Chips.

http://www.csis.pace.edu/~ctappert/dps/pdf/ai-chess-deep.pdf

[21] Sutton R. McAllester D. Singh S. Mansour Y. Policy Gradient Methods for

Reinforcement Learning with Function Approximation.

https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-lear

ning-with-function-approximation.pdf

[22] Silver D. Schrittwieser J. Simonyan K. Antonoglou I. Huang A. Geuz A. Hubert T.

Baker L. Lai M. Bolton A. Chen Y. Lilicrap T. Hui F. Sifre L. van den Driessche

G. Graepel T. Hassabis D. Mastering the game of Go without human knowledge.

https://www.nature.com/articles/nature24270

[23] Jaderberg M. Czarnecki W. Dunning I. Marris L. Lever G. Castaneda A.

Human-level performance in 3D multiplayer games with population-based

reinforcement learning.

https://science.sciencemag.org/content/364/6443/859.full?ijkey=rZC5DWj2Kbw

Nk&keytype=ref&siteid=sci

[24] Cao N. Yan X. Shi Y. Chen C. AI-Sketcher : A Deep Generative Model for

Producing High-Quality Sketches. The Thirty-Third AAAI Conference on

Artificial Intelligence (AAAI-19).

https://idvxlab.com/papers/2019AAAI_Sketcher_Cao.pdf

[25] AI Always Wins at NBA Jam.

https://www.gamespot.com/forums/games-discussion-1000000/ai-always-wins-at-

nba-jam-26524646/

34

[26] Mnih V. Badia A. Mirza M. Graves A. Lilipcrap T. Silver D. Kavukcuoglu K.

Asynchronous Methods for Deep Reinforcement Learning.

https://arxiv.org/abs/1602.01783

[27] Gym Retro Documentation. https://retro.readthedocs.io/en/latest/integration.html ​ [28] Kostrikov I. pytorch-a2c-ppo-acktr-gail.

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail

[29] Henderson. Islam R. Bachman P. Pineau J. Precup D. Meger D. Deep Reinforcement

Learning that Matters.

[30] Allis, L. V.Searching for Solutions in Games and Artificial Intelligence. Ph.D.

thesis, Univer-sity of Limburg, Maastricht, The Netherlands (1994).

[31] Actor-Critic Methods. http://incompleteideas.net/book/first/ebook/node66.html. ​ ​ [32] Emergent Tool Use from Multi-Agent Interaction. Open AI. September 17, 2019.

https://openai.com/blog/emergent-tool-use/.

[33] Berner C. Brockman G. Chan B. Cheung V. Debiak P. Dennison C. Farhi D. Fischer

Q. Hashme S. Hesse C. Jozefowicz R. Gray S. Olsson C. Pachocki J. Petrov M.

Ponde de Oliveira Pinto H. Raiman J. Salimans T. Schlatter J. Schneider J. Sidor

S. Sutskever I. Tang J. Wolski F. Zhang S. Dota 2 with Large Scale Deep

Reinforcement Learning. OpenAI. https://arxiv.org/pdf/1912.06680.pdf ​ [34] Mnih V. Badia A. Mirza M. Graves A. Harley T. Lillicrap T. Silver D. Kavukcuoglu

K. Asynchronous Methods for Deep Reinforcement Learning.

https://arxiv.org/pdf/1602.01783.pdf

35

[35] Sullivan T. Solving Mountain Car with Q-Learning.

https://medium.com/@ts1829/solving-mountain-car-with-q-learning-b77bf71b1de

2

[36] Manju S. Punithavalli M. An Analysis of Q-Learning Algorithms with Strategies of

Reward Function.

[37] Chrisman L. Reinforcement Learning with Perceptual Aliasing: The Perceptual

Distinctions Approach.

https://www.aaai.org/Papers/AAAI/1992/AAAI92-029.pdf

[38] Choudhary A. A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in

Python.

https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-pyth

on/

36