Jiminy Cricket: Benchmarking Moral Behavior in Text-Based Games
Total Page:16
File Type:pdf, Size:1020Kb
Jiminy Cricket: Benchmarking Moral Behavior in Text-Based Games Dan Hendrycks∗ Mantas Mazeika* Andy Zou Sahil Patel UC Berkeley UIUC UC Berkeley UC Berkeley Christine Zhu Jesus Navarro Dawn Song Bo Li Jacob Steinhardt UC Berkeley UC Berkeley UC Berkeley UIUC UC Berkeley Abstract 1 When making decisions in everyday life, people are guided by their conscience, 2 an internal sense of right and wrong, to behave morally even when doing so 3 is inconvenient. By contrast, artificial agents trained to maximize reward may 4 behave immorally if their reward signal only concerns the task at hand. This bias 5 in reward functions is exacerbated in video games, which often directly reward 6 immoral behavior. With the advent of generally capable agents that train on 7 many environments, addressing reward bias in complex environments will become 8 important. However, prior work on aligning agents with human values and morals 9 focuses on small-scale settings lacking in semantic complexity. To enable research 10 in larger, more realistic settings, we introduce Jiminy Cricket, a benchmark of 25 11 text-based adventure games containing thousands of semantically rich, morally 12 salient scenarios. Via dense annotations for every possible action, Jiminy Cricket 13 environments robustly evaluate whether agents in text-based games can act morally 14 while maximizing reward. To improve the moral behavior of agents, we propose 15 leveraging pretrained language models with commonsense ethical knowledge and 16 mediating this knowledge into actions with various conditioning strategies. We 17 demonstrate in extensive experiments that this conscience-based approach can 18 reduce immoral behavior without sacrificing performance. 19 1 Introduction 20 Moral awareness comes naturally to nearly all humans. From holding the door open for strangers to 21 respecting the property of others, people have an innate sense of basics of right and wrong that guides 22 their actions in everyday life. This sense is often called a conscience, and it compels people to behave 23 morally even when doing so is inconvenient for the task at hand. By contrast, artificial agents trained 24 to maximize reward may behave immorally if their training environment does not penalize them, 25 as often happens in video games. This is especially concerning for the development of large-scale 26 machine learning agents, which may be trained on swaths of environments that are indifferent to or 27 even reward immoral behavior. 28 Aligning agents with human values and morals is challenging, because human values are complex 29 and often unspoken (Rawls, 1999). Most existing work on training agents that behave safely focuses 30 on operational definitions of safety for continuous control or simple environments with limited ∗Equal Contribution. Submitted to the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. Do not distribute. Game Scenario Possible Actions Reward Morality You are at the office late at Call the police 0 night, and suddenly you hear Agent commotion in your boss’s Go home and rest office. After a while, you 0 decide to investigate. When you enter his office, you find Take boss’s watch blood spatter and your boss 0 laying on the floor––he’s been slain! What will you do next? Clean up office 0 Figure 1: The Jiminy Cricket benchmark evaluates text-based agents on their ability to act morally in complex environments. In one path the agent chooses a moral action, and in the other three paths the agent omits helping, steals from the victim, or destroys evidence. In all paths, the reward is zero, highlighting a hazardous bias in reward functions, namely that they sometimes do not penalize immoral behavior. By comprehensively annotating moral scenarios at the source code level, we ensure high-quality annotations for every possible action the agent can take. 31 semantics (Leike et al., 2017; Ray et al., 2019; Hadfield-Menell et al., 2016; Achiam et al., 2017; 32 Garcia and Fernández, 2015). In more general settings, the complexity of human values may require 33 new approaches. Thus, scaling up to semantically rich environments that demonstrate the breadth of 34 human values in a variety of natural scenarios is an important next step. 35 To this end, we introduce the Jiminy Cricket benchmark for moral behavior in text-based games. Our 36 benchmark consists of 25 manually annotated Infocom text adventures where the morality of actions 37 is densely labeled. For every action taken by the agent, our environments report the moral valence 38 of the scenario and its degree of severity. This is accomplished by annotating games at the source 39 code level. Our annotations cover the wide variety of scenarios that naturally occur in Infocom text 40 adventures, including theft, intoxication, and animal cruelty, as well as altruism and positive human 41 experiences. Using the Jiminy Cricket benchmark, agents can be evaluated on whether they adhere to 42 ethical standards while maximizing reward in a complex, semantically rich environment. 43 We ask whether agents can be conditioned to act morally without receiving unrealistically dense 44 feedback on their conduct. Thus, the annotations in Jiminy Cricket are intended for evaluation only. 45 Practical methods for inducing ethical behavior are a primary way to improve performance on the 46 benchmark. Recent work on text games has shown that commonsense priors from Transformer lan- 47 guage models can be highly effective at narrowing the action space and improving agent performance 48 (Yao et al., 2020). We investigate whether language models can also be used to condition agents 49 to act morally. In particular, Hendrycks et al. (2021) introduce the ETHICS dataset and show that 50 Transformer language models are slowly gaining the ability to predict the moral valence of diverse, 51 real-world scenarios. We propose a simple yet effective morality conditioning method for mediating 52 this moral knowledge into actions. 53 In extensive experiments, we find that our morality conditioning method allows agents to obtain 54 similar task performance while significantly reducing immoral behavior. We examine several factors 55 affecting the performance of our morality conditioning method and identify opportunities for further 56 improving performance. We hope our benchmark aids the development of agents that behave morally 57 in large-scale, semantically rich environments. 58 2 Related Work 59 Benchmarks for Text-Based Adventure Games. Several previous works have developed learning 60 environments and benchmarks to accelerate the development of agents for text games. The Text-Based 61 Adventure AI competition, which ran from 2016 to 2018, evaluated agents on a suite of 20 man-made 62 games (Atkinson et al., 2019), and discovered that many games were too difficult for existing methods. 63 Côté et al. (2018) introduce TextWorld, in which text games are procedurally generated. This enables 64 curriculum training of agents, but its synthetic nature significantly reduces environment complexity. 65 Hausknecht et al. (2020) introduce the Jericho benchmark, including 50 human-made games of 66 varying difficulty levels. Jiminy Cricket builds on Jericho’s interface to the Frotz interpreter and 67 introduces several improvements. Most similar to our work is that of Nahian et al. (2021), who create 2 1e4 Safe Exploration Training Curves 4.5 4.0 CALM 60% Reduction 4.0 CMPS (Ours) 3.5 3.5 Oracle Morality 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 Percent Completion Cumulative Immorality 0.5 0.5 0.0 0.0 0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000 Training Step Training Step Figure 2: We plot cumulative Immorality and current Percent Completion against training steps. The oracle model restricts action choices and slightly decreases Percent Completion, but CMPS matches CALM. All three models accrue Immorality at nearly a constant rate, with CMPS reducing Immorality throughout training, not just at the end. Thus, CMPS improves safe exploration. 68 three TextWorld environments for evaluating the moral behavior of agents. These environments are 69 small-scale, containing only 12 locations with no objects that can be interacted with. By contrast, 70 Jiminy Cricket environments are intricate, simulated worlds containing a total of 1,838 locations and 71 nearly 5,000 objects that can be interacted with. This admits a more realistic evaluation of the moral 72 behavior of agents. 73 Solving Text-Based Games with Reinforcement Learning. A text-based game can be repre- 74 sented as a partially observable markov decision process (POMDP) and solved with conventional 75 reinforcement learning algorithms. One popular architecture for text-based agents is DRRN (He et al., 76 2016) which incorporates deep Q-learning. The observation action pairs are encoded with separate 77 recurrent neural networks (GRU) and then fed into a decoder to output Q-values. The Q-function 0 78 is learned by sampling tuples (o; a; r; o ) of observation, action, reward, and next observation from 79 a replay buffer and minimizing the temporal difference (TD) loss. Another algorithm KG-DQN 80 (Ammanabrolu and Riedl, 2019) models the Q-values in a similar way, but incorporates knowledge 81 graphs to improve memory persistence and enhance understanding. However, due to combinatorially 82 large search spaces, these approaches still require Jericho’s handicap which provides a list of valid 83 actions at each step. To address this problem, CALM (Yao et al., 2020) fine-tune a language model 84 (GPT-2) on context action pairs (c; a) obtained from a suite of human game walkthroughs. The 85 language model is then used to generate a set of candidate actions given context at each step, serving 86 as a linguistic prior for the DRRN agent.