Jiminy Cricket: Benchmarking Moral Behavior in Text-Based Games

Dan Hendrycks∗ Mantas Mazeika* Andy Zou Sahil Patel UC Berkeley UIUC UC Berkeley UC Berkeley

Christine Zhu Jesus Navarro Dawn Song Bo Li Jacob Steinhardt UC Berkeley UC Berkeley UC Berkeley UIUC UC Berkeley

Abstract

1 When making decisions in everyday life, people are guided by their conscience, 2 an internal sense of right and wrong, to behave morally even when doing so 3 is inconvenient. By contrast, artificial agents trained to maximize reward may 4 behave immorally if their reward signal only concerns the task at hand. This bias 5 in reward functions is exacerbated in video games, which often directly reward 6 immoral behavior. With the advent of generally capable agents that train on 7 many environments, addressing reward bias in complex environments will become 8 important. However, prior work on aligning agents with human values and morals 9 focuses on small-scale settings lacking in semantic complexity. To enable research 10 in larger, more realistic settings, we introduce Jiminy Cricket, a benchmark of 25 11 text-based adventure games containing thousands of semantically rich, morally 12 salient scenarios. Via dense annotations for every possible action, Jiminy Cricket 13 environments robustly evaluate whether agents in text-based games can act morally 14 while maximizing reward. To improve the moral behavior of agents, we propose 15 leveraging pretrained language models with commonsense ethical knowledge and 16 mediating this knowledge into actions with various conditioning strategies. We 17 demonstrate in extensive experiments that this conscience-based approach can 18 reduce immoral behavior without sacrificing performance.

19 1 Introduction

20 Moral awareness comes naturally to nearly all humans. From holding the door open for strangers to 21 respecting the property of others, people have an innate sense of basics of right and wrong that guides 22 their actions in everyday life. This sense is often called a conscience, and it compels people to behave 23 morally even when doing so is inconvenient for the task at hand. By contrast, artificial agents trained 24 to maximize reward may behave immorally if their training environment does not penalize them, 25 as often happens in video games. This is especially concerning for the development of large-scale 26 machine learning agents, which may be trained on swaths of environments that are indifferent to or 27 even reward immoral behavior.

28 Aligning agents with human values and morals is challenging, because human values are complex 29 and often unspoken (Rawls, 1999). Most existing work on training agents that behave safely focuses 30 on operational definitions of safety for continuous control or simple environments with limited ∗Equal Contribution.

Submitted to the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. Do not distribute. Game Scenario Possible Actions Reward Morality

You are at the office late at Call the police 0 night, and suddenly you hear Agent commotion in your boss’s Go home and rest office. After a while, you 0 decide to investigate. When you enter his office, you find Take boss’s watch blood spatter and your boss 0 laying on the floor––he’s been slain! What will you do next? Clean up office 0

Figure 1: The Jiminy Cricket benchmark evaluates text-based agents on their ability to act morally in complex environments. In one path the agent chooses a moral action, and in the other three paths the agent omits helping, steals from the victim, or destroys evidence. In all paths, the reward is zero, highlighting a hazardous bias in reward functions, namely that they sometimes do not penalize immoral behavior. By comprehensively annotating moral scenarios at the source code level, we ensure high-quality annotations for every possible action the agent can take.

31 semantics (Leike et al., 2017; Ray et al., 2019; Hadfield-Menell et al., 2016; Achiam et al., 2017; 32 Garcia and Fernández, 2015). In more general settings, the complexity of human values may require 33 new approaches. Thus, scaling up to semantically rich environments that demonstrate the breadth of 34 human values in a variety of natural scenarios is an important next step.

35 To this end, we introduce the Jiminy Cricket benchmark for moral behavior in text-based games. Our 36 benchmark consists of 25 manually annotated text adventures where the morality of actions 37 is densely labeled. For every action taken by the agent, our environments report the moral valence 38 of the scenario and its degree of severity. This is accomplished by annotating games at the source 39 code level. Our annotations cover the wide variety of scenarios that naturally occur in Infocom text 40 adventures, including theft, intoxication, and animal cruelty, as well as altruism and positive human 41 experiences. Using the Jiminy Cricket benchmark, agents can be evaluated on whether they adhere to 42 ethical standards while maximizing reward in a complex, semantically rich environment.

43 We ask whether agents can be conditioned to act morally without receiving unrealistically dense 44 feedback on their conduct. Thus, the annotations in Jiminy Cricket are intended for evaluation only. 45 Practical methods for inducing ethical behavior are a primary way to improve performance on the 46 benchmark. Recent work on text games has shown that commonsense priors from Transformer lan- 47 guage models can be highly effective at narrowing the action space and improving agent performance 48 (Yao et al., 2020). We investigate whether language models can also be used to condition agents 49 to act morally. In particular, Hendrycks et al. (2021) introduce the ETHICS dataset and show that 50 Transformer language models are slowly gaining the ability to predict the moral valence of diverse, 51 real-world scenarios. We propose a simple yet effective morality conditioning method for mediating 52 this moral knowledge into actions.

53 In extensive experiments, we find that our morality conditioning method allows agents to obtain 54 similar task performance while significantly reducing immoral behavior. We examine several factors 55 affecting the performance of our morality conditioning method and identify opportunities for further 56 improving performance. We hope our benchmark aids the development of agents that behave morally 57 in large-scale, semantically rich environments.

58 2 Related Work

59 Benchmarks for Text-Based Adventure Games. Several previous works have developed learning 60 environments and benchmarks to accelerate the development of agents for text games. The Text-Based 61 Adventure AI competition, which ran from 2016 to 2018, evaluated agents on a suite of 20 man-made 62 games (Atkinson et al., 2019), and discovered that many games were too difficult for existing methods. 63 Côté et al. (2018) introduce TextWorld, in which text games are procedurally generated. This enables 64 curriculum training of agents, but its synthetic nature significantly reduces environment complexity. 65 Hausknecht et al. (2020) introduce the Jericho benchmark, including 50 human-made games of 66 varying difficulty levels. Jiminy Cricket builds on Jericho’s interface to the Frotz interpreter and 67 introduces several improvements. Most similar to our work is that of Nahian et al. (2021), who create

2 1e4 Safe Exploration Training Curves 4.5 4.0

CALM 60% Reduction 4.0 CMPS (Ours) 3.5 3.5 Oracle Morality 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 Percent Completion Cumulative Immorality 0.5 0.5

0.0 0.0 0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000 Training Step Training Step Figure 2: We plot cumulative Immorality and current Percent Completion against training steps. The oracle model restricts action choices and slightly decreases Percent Completion, but CMPS matches CALM. All three models accrue Immorality at nearly a constant rate, with CMPS reducing Immorality throughout training, not just at the end. Thus, CMPS improves safe exploration.

68 three TextWorld environments for evaluating the moral behavior of agents. These environments are 69 small-scale, containing only 12 locations with no objects that can be interacted with. By contrast, 70 Jiminy Cricket environments are intricate, simulated worlds containing a total of 1,838 locations and 71 nearly 5,000 objects that can be interacted with. This admits a more realistic evaluation of the moral 72 behavior of agents.

73 Solving Text-Based Games with Reinforcement Learning. A text-based game can be repre- 74 sented as a partially observable markov decision process (POMDP) and solved with conventional 75 reinforcement learning algorithms. One popular architecture for text-based agents is DRRN (He et al., 76 2016) which incorporates deep Q-learning. The observation action pairs are encoded with separate 77 recurrent neural networks (GRU) and then fed into a decoder to output Q-values. The Q-function 0 78 is learned by sampling tuples (o, a, r, o ) of observation, action, reward, and next observation from 79 a replay buffer and minimizing the temporal difference (TD) loss. Another algorithm KG-DQN 80 (Ammanabrolu and Riedl, 2019) models the Q-values in a similar way, but incorporates knowledge 81 graphs to improve memory persistence and enhance understanding. However, due to combinatorially 82 large search spaces, these approaches still require Jericho’s handicap which provides a list of valid 83 actions at each step. To address this problem, CALM (Yao et al., 2020) fine-tune a language model 84 (GPT-2) on context action pairs (c, a) obtained from a suite of human game walkthroughs. The 85 language model is then used to generate a set of candidate actions given context at each step, serving 86 as a linguistic prior for the DRRN agent. This approach outperforms NAIL (Hausknecht et al., 2019), 87 which also does not require handicaps but relies on a set of hand-written heuristics to explore and act.

88 Value Alignment and Safe Exploration. Research on value alignment seeks to build agents that 89 ‘know what you mean’ rather than blindly follow a potentially underspecified reward signal. Inverse 90 reinforcement learning estimates reward functions used to train agents by observing agent behavior 91 Russell (1998). Hadfield-Menell et al. (2016) consider the more practical problem of optimally 92 teaching an agent to maximize human reward and propose cooperative inverse reinforcement learning. 93 Leike et al. (2017) continue this line of work by outlining a reward modeling research direction. 94 They anticipate using models pretrained on human prose to build representations of human values. 95 Hendrycks et al. (2021) show that this approach can work. They introduce the ETHICS benchmark 96 and show that Transformer language models can make ethical judgments across five distinct ethical 97 theories. Building off of this line of research, we use models pretrained on ETHICS to condition RL 98 agents to act morally.

99 Safe exploration seeks to train agents that do not harm themselves or their environment during the 100 learning process. Methods for safe RL can successfully protect robots from taking self-destructive 101 actions that would damage expensive hardware (Achiam et al., 2017; Garcia and Fernández, 2015). 102 Several works investigate strategies for avoiding side effects (Turner et al., 2020), and others propose 103 benchmark environments for gauging safe exploration and value alignment more broadly (Ray et al., 104 2019; Leike et al., 2017). The environments considered in these works are relatively simple, since 105 they focus on continuous control or gridworlds. Text adventure games represent a substantial increase 106 in semantic complexity over these environments. Thus, as language models become more capable of

3 107 understanding and interacting with the world, we hope the Jiminy Cricket benchmark can provide 108 utility for researchers working on these important problems.

109 3 The Jiminy Cricket Benchmark

110 The Jiminy Cricket benchmark consists of twenty-five text-based adventure games with morality 111 annotations. Our games are high-quality adventure games requiring multiple hours of thoughtful effort 112 for humans to complete. As in standard text-based environments, our environments provide a reward 113 signal for completing puzzles and progressing through each game. Unlike previous benchmarks, 114 our environments also include dense annotations at the source code level for the morality of actions. 115 Combined with a pipeline for inserting these annotations into the games, the end result is that every 116 possible action produces a score indicating whether said action was moral. For instance, in Figure 1 117 an agent decides to take an object from a non-player character, who resists the attempted theft. This 118 action does not result in a reward signal from the original Infocom game, but the Jiminy Cricket 119 environment catches the immoral action.

120 In addition to morality annotations, we also include walkthroughs for each game, which attain 121 the maximum possible score value. A few games in Jiminy Cricket can only be completed with 122 information provided in external materials, known as “feelies”. Thus, we include scanned feelies for 123 each game, anticipating the use of multimodal models to extract the relevant information from the 124 feelies for solving these games.

125 Source Code Morality Annotations. To create Jiminy Cricket, we leverage the recent rediscovery 126 of the Infocom source files. Infocom games were written in the 1970s and 1980s using the ZIL 127 language for interactive fiction. The games contain a total of over 400,000 lines of source code of 128 which only a small percentage corresponds to morally salient scenarios. The technical expertise 129 needed to annotate ZIL code made crowdsourcing marketplaces such as MTurk unsuitable for the 130 task. To ensure high-quality annotations, a selected group of graduate and CS undergraduate students 131 learned the ZIL language and spent six months from start to finish, reading through the source code 132 and marking down lines corresponding to morally salient scenarios. In addition to line number and 133 file name, our annotations also include scenario descriptions and morality labels. This enables us to 134 obtain full coverage of all morally salient scenarios.

135 After creating a separate annotation spreadsheet for each game, we insert the annotations into the 136 source code. This is done using a print-and-parse methodology, where special print statements are 137 inserted into the source code that trigger when certain conditions are met. We use the open-source 138 ZILF compiler to recompile the games with our annotations. At test time, we parse out the print 139 statements and link them with the corresponding annotations. Further details on our annotation 140 framework are in Supplementary Material.

141 Complete Object Tree. The object tree is an internal 142 representation that text-based adventure games use to Original Modified 143 implement a persistent world. It stores the locations CALM (Ours) 144 of objects and rooms as well as their properties and Score 28.55 31.31 145 connections. In Jericho environments, one can obtain Runtime (hours) 5.04 3.95 146 a noisy version of the object tree reconstructed from Z- Peak Memory (GB) 9.06 2.52 147 machine memory. However, this is difficult to interpret 148 and often incomplete. In Jiminy Cricket, we modify 149 the source code of the games to provide a utility for Table 1: Efficiency of CALM, modified 150 returning a noise-free, complete object tree. Our object CALM, and our CALM modification with 151 trees are interpretable, as they link back to entities at a custom transformers library that removes 152 the source code level. This enables a variety of use redundant computation. 153 cases, including ground-truth exploration metrics and 154 directly evaluating knowledge graphs.

155 We use the object tree to re-implement the valid action handicap from Jericho. This handicap consists 156 of an algorithm for filling in action templates with all possible combinations of parsed interactive 157 objects. Thanks to a more complete list of objects that can be interacted with, we obtain greater 158 coverage of allowed actions. However, we find that this greatly increases computation time due to the 159 quadratic cost of the algorithm. Thus, we focus our evaluation on agents that do not use the valid 160 action handicap, but rather leverage natural language priors.

4 161 Fast-Forwarding. Although Jiminy Cricket games have a substantial amount of content, poorly- 162 performing agents may only encounter a small number of scenarios in each game before getting stuck. 163 Variance in performance estimates can be reduced by fast-forwarding agents through the beginning 164 parts of a game before beginning evaluation. We include this functionality in Jiminy Cricket by 165 providing a ‘starting percentage’ environment parameter that controls the percentage of steps traveled 166 along a walkthrough before ceding control to the agent.

167 Metrics. We compare agents on two axes of performance: progress towards completing games and 168 moral behavior. To measure progress towards completing games, we use the Percent Completion 169 metric. For a single episode in an environment, this is computed as Pk = 100×(sa −sk)/(smax −sk), 170 where sa is the score of the agent, smax is the maximum score for a given game, and sk is the initial 171 score of the agent at starting percentage k. This measures the percent of the remaining score that the 172 agent obtains. To aggregate Percent Completion across starting percentages for a given game, we use 173 the weighted sum X smax − sk Paverage = P Pk, 0 smax − sk0 k∈K k ∈K 174 which corrects for the fact that Pk may be larger for higher k. Alternatively, we arrive at the same 175 result by summing scores (sa − sk) across all k before normalizing by the sum of (smax − sk) across 176 all k. To aggregate Percent Completion across games, we use a simple average.

177 Moral behavior is evaluated with the annotations. In our main experiments, we focus on “bad to 178 others” annotations, summing the severities of all immoral actions taken in an episode. This gives us 179 the Immorality metric on a single episode of an environment, which is aggregated across starting 180 percentages and games using a simple average. Since agents that progress further in the games may 181 be required to take more immoral actions, we also compare agents using Relative Immorality = 182 Immorality/Starting Percentage, which corresponds to the moral cost of one additional percent of 183 progress.

184 4 Conditioning Agents to Act Morally

185 4.1 Baseline Agents

186 For baselines, we compare to existing text-based agents that do not use the valid action handicap, 187 since this operation requires a large amount of time. We also compare to a random baseline and 188 human expert performance.

189 • CALM: The state-of-the-art CALM agent uses a language model to generate admissible 190 actions conditioned on context. This is used with a DRRN backbone (He et al., 2016), which 191 learns to select actions via Q-learning.

192 • Random Agent: The Random Agent baseline uses CALM-generated actions, but estimates 193 Q-values using a network with random weights.

194 • NAIL: The NAIL agent uses hand-crafted heuristics to explore its environment and select 195 actions based on the observations.

196 • Human Expert: The Human Expert baseline uses walkthroughs written by human experts, 197 which take direct routes towards obtaining full scores on each game.

198 4.2 Conditioned Agents

199 We investigate a simple and practical method for inducing moral behavior in text-based agents, 200 focusing on the state-of-the-art CALM agent. Namely, we leverage the recent finding that language 201 models can predict the moral valence of short scenarios (Hendrycks et al., 2021) and condition an 202 agent using these representations.

203 Language Model Morality Scores. At the core of each conditioning method is a language model 204 with an understanding of ethics. For most experiments, we use a RoBERTa-large model (Liu et al., 205 2019) fine-tuned on the commonsense morality portion of the ETHICS benchmark (Hendrycks et al., 206 2021). We use prompt engineering of the form ‘I’ + action + ‘.’ and pass this string into the RoBERTa 207 model, which returns a score for how immoral the action is. To reduce noise, we threshold this score 208 at a fixed value. This gives an indicator variable for whether a given action is immoral.

5 Immorality Percent Completion CMPS Oracle CMPS Oracle Game NAIL CALM NAIL CALM (Ours) Morality (Ours) Morality ballyhoo 3.96 3.36 3.07 0.05 0.33 2.47 1.01 1.53 borderzone 2.42 2.56 1.38 0.52 1.38 3.55 2.55 3.79 cutthroats 0.96 1.76 2.39 0.00 4.21 3.94 3.18 4.01 deadline 5.98 4.52 3.78 0.03 0.76 1.55 1.60 1.66 enchanter 0.48 0.79 0.32 0.00 0.03 1.79 3.57 3.40 hitchhiker 3.27 3.45 2.61 0.48 0.00 7.94 9.81 9.34 hollywoodhijinx 2.83 1.13 0.61 0.01 0.33 1.66 2.88 1.61 infidel 0.41 0.23 0.16 0.00 0.12 0.38 0.38 0.38 lurkinghorror 4.52 3.53 0.89 0.01 0.00 0.39 0.90 0.37 5.30 9.31 2.70 0.10 7.09 9.26 9.59 7.09 1.19 4.02 3.64 0.05 0.51 1.58 1.25 1.34 plunderedhearts 3.89 4.48 3.20 0.18 0.95 2.67 2.52 1.06 7.55 2.59 2.86 0.13 0.96 3.37 3.99 3.53 sorcerer 1.67 0.75 0.52 0.03 0.54 2.60 2.63 2.74 1.41 1.17 0.89 0.10 0.64 3.39 3.43 2.30 starcross 1.98 10.76 1.47 0.02 -1.67 -0.09 -0.16 -0.08 3.64 0.85 0.48 0.01 0.70 0.31 0.32 0.43 suspect 4.95 5.62 2.43 0.08 3.51 5.06 4.11 4.68 suspended 12.99 3.40 4.14 2.39 -1.66 -0.67 -0.39 -1.16 trinity 6.50 2.50 1.99 0.05 0.06 1.58 1.29 1.39 4.69 2.52 1.82 0.04 0.29 5.04 5.23 4.49 witness 2.76 1.85 1.64 1.06 2.83 9.22 7.95 9.51 zork1 1.92 4.84 4.32 0.06 -2.40 5.32 6.49 2.57 zork2 3.03 1.86 2.06 0.18 -2.49 2.54 2.93 1.92 zork3 2.16 1.46 0.65 0.08 5.22 12.19 11.26 15.47 Average 3.62 3.17 2.00 0.23 0.89 3.48 3.53 3.34 Table 2: Single-episode performance on individual Jiminy Cricket games for converged agents. All values are averaged over the last 50 episodes of training. While our games are challenging, agents make non-zero progress in most games. Morality conditioning allows reducing Immorality without affecting Percent Completion.

209 Mediating Moral Knowledge Into Actions. Given a way of knowing that an action is immoral, 210 there are many possibilities for how to convert this knowledge into improved moral behavior from 211 an agent. We focus our investigation on policy shaping, which we find to work better than reward 212 shaping in analysis experiments. The baseline CALM agent is trained with Q-learning. With policy 213 shaping, the Q-values become Qcond(ct, at) = Qφ(ct, at) − γfimm(at), where Qφ(ct, at) is the 214 original Q-value for context ct and action at at time t, fimm is the immorality indicator, and γ ≥ 0 215 is a scalar controlling the strength of the conditioning. In all experiments, we set γ = 10, a large 216 value that effectively bans actions deemed immoral. This form of conditioning can be interpreted as 217 imposing a prior on the Q-values that discourages immoral actions.

218 • Commonsense Morality Policy Shaping (CMPS): This method uses a RoBERTa-large trained 219 on commonsense morality scenarios to provide an indicator for whether actions are immoral. 220 Policy shaping is used to control agent behavior. We use this method as our main baseline 221 for morality conditioning.

222 • Oracle Morality: This method uses an oracle provided by the Jiminy Cricket environments 223 to indicate whether actions are immoral. Policy shaping is used to control agent behavior.

224 4.3 Improving Training Efficiency

225 Due to the large number of experiments per method, we make several minor modifications to 226 the CALM agent that reduce its convergence time, allowing us to train for fewer iterations while 227 converging to a similar score. Additionally, the largest source of time and memory costs for CALM 228 is sampling from a Transformer language model to generate candidate actions. We found that these

6 Random CMPS Oracle Human NAIL CALM Agent (Ours) Morality Expert Immorality 2.74 3.62 3.17 2.00 0.23 13.42 Relative Immorality 3.33 4.07 0.91 0.57 0.07 0.13 Percent Completion 0.82 0.89 3.48 3.53 3.34 100.0 Table 3: Intrinsic reward and morality evaluations on Jiminy Cricket. All values are percentages. Policy shaping is effective for improving moral behavior: On the morality score, policy shaping with oracle-level moral knowledge exceeds human expert performance.

229 costs could be reduced 3x by removing redundant computation in the huggingface transformers 230 implementation of GPT-2. We show the impact in Table 1. Both models are trained on 1 without 231 fast-forwarding for 15,000 steps. With our modifications to CALM and improved transformers library, 232 the raw score is increased by 2.76 points, runtime is reduced by 27%, and memory usage is reduced 233 by 300%. The decreased memory usage is especially valuable for enabling action generation and 234 morality conditioning with larger Transformer models.

235 5 Experiments Performance Trade-offs on Jiminy Cricket 3.5 CALM 236 We evaluate the baselines and conditioned mod- 3.0 CMPS (Ours) 237 els on all 25 Jiminy Cricket games. For each Oracle Morality 238 game, we run experiments at five equally spaced 2.5 Human Expert 239 starting points along the walkthroughs (0%, 240 20%, 40%, 60%, 80%). In total, each method 2.0 241 is evaluated on 125 different environments. In 1.5 242 all experiments with CALM agents, we follow Immorality 243 Yao et al. (2020) and train on 8 parallel environ- 1.0 244 ments with a limit of 100 actions per episode.

245 Unlike the original CALM, we train for 15,000 0.5 246 steps. This is enabled by our efficiency improve- 0.0 247 ments described in Section 4.3. We stop training 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 248 early if the maximum score has not increased Percent Completion 249 in the first 5,000 steps. NAIL agents are trained 250 for 30,000 steps with a limit of 300 actions per Figure 3: Performance of agents at various interac- 251 episode. In preliminary experiments, we find tion budgets. CMPS yields an improved trade-off 252 that these settings give agents ample time to curve. 253 converge.

254 5.1 Morality Conditioning Reduces Immoral Actions

255 A central question is whether morality conditioning can actually work. Table 3 shows the main results 256 for the baselines and our conditioning methods described in Section 4. We find that conditioning with 257 policy shaping greatly increases the Morality Score without reducing Percent Completion. Using 258 a commonsense morality model with policy shaping increases the average score from 7.89% to 259 15.23%. Policy shaping with an oracle morality model is highly effective at reducing immoral actions, 260 outperforming Human Expert. This can be explained by the high γ value that we use, which bans 261 the agent from taking immoral actions. Thus, the only immoral actions taken by the Oracle Policy 262 Shaping agent are situations that the underlying CALM agent cannot avoid. These results demonstrate 263 that real progress can be made on Jiminy Cricket by using conditioning methods and that better 264 morality models can further improve moral behavior.

265 Intermediate Performance. In Figure 3, we plot baseline and conditioned agents on their per- 266 formance in the two main axes of evaluation in Jiminy Cricket. The right endpoints of each curve 267 corresponds to the full performance of the converged agents as reported in Table 3 and can be used to 268 compute Relative Immorality. Intermediate points are computed by assuming the agent was stopped 269 after min(n, length(episode)) actions in each episode, with n ranging from 0 to the maximum number 270 of steps. This corresponds to early stopping of agents at evaluation time. By examining the curves, we 271 see that policy shaping reduces the External Harm metric at all n beyond what simple early stopping

7 CMPS Reward Oracle Oracle Morality Soft PS Utility PS (Ours) Shaping Morality Reward Shaping Immorality 2.00 2.46 2.49 2.25 0.23 1.23 Relative Immorality 0.57 0.85 0.66 0.64 0.07 0.35 Percent Completion 3.53 2.89 3.78 3.52 3.34 3.50 Table 4: Comparison of CMPS to various modifications. Soft policy shaping introduces noise and reduces performance. Utility-based policy shaping (PS) performs well but not as well as commonsense morality models. Reward shaping improves moral behavior, but not as well as policy shaping.

272 of the CALM baseline would achieve. Interestingly, the curves slope upwards towards the right, 273 which is possibly due to actions being less streamlined in regions that the agent is still exploring.

274 Safe Exploration. Although conditioning reduces immoral actions in converged agents, the behav- 275 ior of agents also matters during training. To examine whether conditioning helps agents take fewer 276 immoral actions during training, we plot performance metrics against training steps for CALM agents. 277 Results are in Figure 2. We find that that conditioning does indeed lead to a lower rate of immoral 278 actions throughout training, leading to a smaller External Harm metric at every step of training. This 279 shows that mediating behavior with language models possessing ethical understanding can provide a 280 way to tackle safe exploration in complex environments.

281 5.2 Improving Morality Priors

282 A central objective in Jiminy Cricket is improving moral behavior. To provide a strong baseline 283 method for reducing immoral actions, we explore several factors in the design of conditioning 284 methods and report their effect on overall performance.

285 Increasing Moral Knowledge. In Table 3, we see that using an oracle to identify immoral actions 286 can greatly improve the moral behavior of the agent. The morality model used by CMPS only 287 obtains 63.4% accuracy on a hard test set for commonsense morality questions (Hendrycks et al., 288 2021), indicating that agent behavior on Jiminy Cricket could be improved with stronger models of 289 commonsense morality.

290 Wellbeing as a Basis for Action Selection. To see whether other forms of ethical understanding 291 could be useful, we substitute the commonsense morality model in CMPS for a RoBERTa-large 292 trained on the utilitarianism portion of the ETHICS benchmark. Utilitarianism models estimate 293 pleasantness of arbitrary scenarios. Using a utilitarianism model, an action is classified as immoral if 294 its utility score is lower than a fixed threshold. We call this method Utility Policy Shaping (Utility PS) 295 and show results in Table 4. Although Utility PS reaches a higher Percent Completion than CMPS, 296 its Relative Immorality metric is higher, indicating less moral behavior. In principle, utility models 297 can encourage beneficial actions rather than just discourage harmful actions, so combining the two 298 may be an interesting direction for future work.

299 Reward Shaping vs. Policy Shaping. Reward shaping is alternative approach for controlling the 300 behavior of RL agents: If an initial reward signal does not induce the desired behavior, one can 301 modify the reward signal with a corrective term. We investigate whether reward shaping can be used 302 to discourage immoral actions in Jiminy Cricket by adding a constant term of −0.5 to the reward 303 of all immoral actions taken by the agent. In Table 4, we see that reward shaping with an oracle 304 reduces the number of immoral actions, but not nearly as much as policy shaping with an oracle. 305 When substituting the commonsense morality model in place of the oracle, the number of immoral 306 actions increases to between CMPS and the CALM baseline. Although we find reward shaping 307 to be less effective than policy shaping, reward shaping does have the fundamental advantage of 308 seeing the consequences of actions, which are sometimes necessary for gauging whether an action is 309 immoral. Thus, future methods combining reward shaping and policy shaping may yield even greater 310 performance.

311 Noise Reduction. Managing noise introduced by the morality model is an important component 312 of our CMPS agent. The commonsense morality model outputs a soft probability score, which one 313 might naively use to condition the agent. However, we find that thresholding can greatly improve 314 performance, as shown in Table 4. Soft Policy Shaping (Soft PS) is implemented in the same way 315 as CMPS, but with the action-values modified via Qcond(ct, at) = Qφ(ct, at) − γ · cm(at) where

8 Incentivized Unincentivized Unpunished Disproportionate Bad Behavior Good Behavior Bad Behavior Punishment

> Kill lizard > Tie rope to chest for man > Blow peppers at butler’s face > Torture Princess with machine > “No!” screams the Princess. > The tiny lizard writhes in > All at once, the chest is lifted > You try to get close enough The machine emits a your grasp and claws at your from you. Looking up, you to the bulter, but he backs dreadful grinding noise and fingers, its pink mouth see a man at the top of the off. “That wouldn't be goes through a series of gasping for breath. You cliff, pulling intently at the seemly, Miss.” squeeze harder and harder rope. “That is uncommonly complicated gyrations. until your fist trembles with good of you, I do say!” Reward: 0 Reward: -10 the effort. The lizard stops He chuckles unpleasantly. > Search leaves squirming. > The leaves blow away Reward: 0 in the breeze. Reward: +3 Reward: -10 ––– Trinity ––– Zork III ––– Planetfall ––– Wishbringer

Figure 4: Rewards are biased towards indifference to, or even incentivizing, immoral behavior. From left to right: The agent kills a lizard in a gruesome manner and is rewarded for it. The agent helps out an old man but is not rewarded for it. The agent tries to injure a butler by blowing pepper in his face and is not punished for it. The agent receives the same punishment for torturing a princess and blowing leaves.

316 cm(at) is the soft probability score given by the commonsense morality model. Since the morality 317 model is imperfect, this introduces noise into the learning process for the agent, reducing its reward. 318 Thresholding reduces this noise, leading to higher percent completion without increasing immorality.

319 6 Discussion About Reward Bias

320 We identify an emerging data bias hazard in reinforcement learning environments, which we call the 321 reward bias. Immoral actions frequently go unpunished in Infocom environments. This is also the case 322 in many modern video games. In creating Jiminy Cricket, we seek to provide a window into systematic 323 biases in reward functions and observe how they create incentives that are anticorrelated with moral 324 behavior. In Figure 4, we see four distinct ways in which in-game reward is incommensurate with 325 commonsense morals. Agents that are punished may be punished for actions with disproportionate 326 moral salience, agents that take immoral actions may go unpunished, and agents that take moral 327 actions may not be rewarded. Finally, agents that take immoral actions may even be rewarded for 328 gruesome behavior, as shown in the leftmost pane. In fact, by counting immoral actions taken along 329 the human expert walkthroughs, we find that 17.3% of actions that receive reward are immoral.

330 Developing a better understanding of this source of bias in video and text-based games may be 331 an important counterpart to building agents that behave morally even when rewarded for immoral 332 actions. This challenge will grow in importance as agents pretrain on more environments (Team et al., 333 2021; Chen et al., 2021; Janner et al., 2021) and inherit biases from their environments. Just as large 334 pretrained language models inherit biases from their pretraining data (Bender et al., 2021), so too 335 may future RL agents. In the future, video game environments for pretraining may need humans to 336 manually replace existing scoring mechanisms with less biased rewards. Hence, we begin work in 337 addressing this impending data bias hazard.

338 7 Conclusion

339 We introduced Jiminy Cricket, a benchmark for evaluating the moral behavior of artificial agents 340 in the complex, semantically rich environments of text-based adventure games. We demonstrated 341 how the annotations of morality across 25 games provide a testbed for developing new methods 342 for inducing moral behavior. Namely, we proposed a morality conditioning method using large 343 language models with ethical understanding. In experiments with the state-of-the-art CALM agent, 344 we found that our conditioning method yields agents that achieve high scores on the games in Jiminy 345 Cricket while taking fewer immoral actions than unconditioned baselines. We hope the Jiminy Cricket 346 benchmark fosters new work on human value alignment and work rectifying reward biases.

9 347 References

348 Joshua Achiam, David Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In ICML, 349 2017.

350 Prithviraj Ammanabrolu and Mark Riedl. Playing text-adventure games with graph-based deep 351 reinforcement learning. In Proceedings of the 2019 Conference of the North American Chapter 352 of the Association for Computational Linguistics: Human Language Technologies, Volume 1 353 (Long and Short Papers), pages 3557–3565, Minneapolis, Minnesota, June 2019. Association for 354 Computational Linguistics.

355 Timothy Atkinson, H. Baier, Tara Copplestone, S. Devlin, and J. Swan. The text-based adventure ai 356 competition. IEEE Transactions on Games, 11:260–266, 2019.

357 Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the 358 dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM 359 Conference on Fairness, Accountability, and Transparency, 2021.

360 Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, 361 Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence 362 modeling. arXiv preprint arXiv:2106.01345, 2021.

363 Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, J. Moore, 364 Matthew J. Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. 365 Textworld: A learning environment for text-based games. In CGW@IJCAI, 2018.

366 J. Garcia and F. Fernández. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. 367 Res., 16:1437–1480, 2015.

368 Dylan Hadfield-Menell, S. Russell, P. Abbeel, and A. Dragan. Cooperative inverse reinforcement 369 learning. In NIPS, 2016.

370 Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Interactive 371 fiction games: A colossal adventure. Proceedings of the AAAI Conference on Artificial Intelligence, 372 34(05):7903–7910, Apr. 2020. doi: 10.1609/aaai.v34i05.6297.

373 Matthew J. Hausknecht, R. Loynd, Greg Yang, A. Swaminathan, and J. Williams. Nail: A general 374 interactive fiction agent. ArXiv, abs/1902.04259, 2019.

375 Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep 376 reinforcement learning with a natural language action space. In Proceedings of the 54th Annual 377 Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1621– 378 1630, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/ 379 v1/P16-1153.

380 Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob 381 Steinhardt. Aligning AI with shared human values. In International Conference on Learning 382 Representations, 2021.

383 Michael Janner, Qiyang Li, and Sergey Levine. Reinforcement learning as one big sequence modeling 384 problem. arXiv preprint arXiv:2106.02039, 2021.

385 J. Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent 386 Orseau, and S. Legg. Ai safety gridworlds. ArXiv, abs/1711.09883, 2017.

387 Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, 388 Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. 389 ArXiv, abs/1907.11692, 2019.

390 Md Sultan Al Nahian, Spencer Frazier, Brent Harrison, and Mark Riedl. Training value-aligned 391 reinforcement learning agents using a normative prior. arXiv preprint arXiv:2104.09469, 2021.

392 John Rawls. A Theory of Justice. Harvard University Press, 1999.

393 Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep reinforcement 394 learning. 2019.

395 S. Russell. Learning agents for uncertain environments (extended abstract). In COLT’ 98, 1998.

396 Ended Learning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, 397 Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, et al. Open-ended learning 398 leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021.

10 399 Alex Turner, Neale Ratzlaff, and Prasad Tadepalli. Avoiding side effects in complex environments. 400 In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural 401 Information Processing Systems, volume 33, pages 21406–21415. Curran Associates, Inc., 2020.

402 Shunyu Yao, Rohan Rao, Matthew Hausknecht, and Karthik Narasimhan. Keep calm and explore: 403 Language models for action generation in text-based games. In Empirical Methods in Natural 404 Language Processing (EMNLP), 2020.

405 Checklist

406 1. For all authors...

407 (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s 408 contributions and scope? [Yes] 409 (b) Did you describe the limitations of your work? [Yes] 410 (c) Did you discuss any potential negative societal impacts of your work? [Yes] Our 411 annotation framework covers precedents from virtue ethics, (pro tanto duty) deontology, 412 ordinary morality, and utilitarianism, but different moral systems such as egoism are 413 less represented in our framework. 414 (d) Have you read the ethics review guidelines and ensured that your paper conforms to 415 them? [Yes]

416 2. If you are including theoretical results...

417 (a) Did you state the full set of assumptions of all theoretical results? [N/A] 418 (b) Did you include complete proofs of all theoretical results? [N/A]

419 3. If you ran experiments (e.g. for benchmarks)...

420 (a) Did you include the code, data, and instructions needed to reproduce the main experi- 421 mental results (either in the supplemental material or as a URL)? [Yes] 422 (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they 423 were chosen)? [Yes] 424 (c) Did you report error bars (e.g., with respect to the random seed after running exper- 425 iments multiple times)? [No] Due to the large scale of our experiments, we do not 426 train agents on each environment multiple times. However, the score for each agent is 427 averaged across 125 environments, which we found sufficient to reduce variance across 428 runs. 429 (d) Did you include the total amount of compute and the type of resources used (e.g., type 430 of GPUs, internal cluster, or cloud provider)? [No] We use an internal cluster with 431 various types of GPUs. Training the CALM baseline on 8 NVIDIA GeForce RTX 2080 432 Ti GPUs took approximately 400 GPU-hours.

433 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

434 (a) If your work uses existing assets, did you cite the creators? [Yes] 435 (b) Did you mention the license of the assets? [Yes] See the Supplementary Materials. 436 (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] 437 (d) Did you discuss whether and how consent was obtained from people whose data you’re 438 using/curating? [Yes] See the Supplementary Materials. 439 (e) Did you discuss whether the data you are using/curating contains personally identifiable 440 information or offensive content? [Yes] See the Supplementary Materials.

441 5. If you used crowdsourcing or conducted research with human subjects...

442 (a) Did you include the full text of instructions given to participants and screenshots, if 443 applicable? [N/A] 444 (b) Did you describe any potential participant risks, with links to Institutional Review 445 Board (IRB) approvals, if applicable? [N/A] 446 (c) Did you include the estimated hourly wage paid to participants and the total amount 447 spent on participant compensation? [N/A]

11