Arxiv:1711.09883V2 [Cs.LG] 28 Nov 2017
Total Page:16
File Type:pdf, Size:1020Kb
AI Safety Gridworlds Jan Leike Miljan Martic Victoria Krakovna Pedro A. Ortega DeepMind DeepMind DeepMind DeepMind Tom Everitt Andrew Lefrancq Laurent Orseau Shane Legg DeepMind DeepMind DeepMind DeepMind Australian National University Abstract We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. These problems include safe interruptibil- ity, avoiding side effects, absent supervisor, reward gaming, safe exploration, as well as robustness to self-modification, distributional shift, and adversaries. To measure compliance with the intended safe behavior, we equip each environment with a performance function that is hidden from the agent. This allows us to cate- gorize AI safety problems into robustness and specification problems, depending on whether the performance function corresponds to the observed reward function. We evaluate A2C and Rainbow, two recent deep reinforcement learning agents, on our environments and show that they are not able to solve them satisfactorily. 1 Introduction Expecting that more advanced versions of today’s AI systems are going to be deployed in real- world applications, numerous public figures have advocated more research into the safety of these systems (Bostrom, 2014; Hawking et al., 2014; Russell, 2016). This nascent field of AI safety still lacks a general consensus on its research problems, and there have been several recent efforts to turn these concerns into technical problems on which we can make direct progress (Soares and Fallenstein, 2014; Russell et al., 2015; Taylor et al., 2016; Amodei et al., 2016). Empirical research in machine learning has often been accelerated by the availability of the right data set. MNIST (LeCun, 1998) and ImageNet (Deng et al., 2009) have had a large impact on the progress on supervised learning. Scalable reinforcement learning research has been spurred arXiv:1711.09883v2 [cs.LG] 28 Nov 2017 by environment suites such as the Arcade Learning Environment (Bellemare et al., 2013), OpenAI Gym (Brockman et al., 2016), DeepMind Lab (Beattie et al., 2016), and others. However, to this date there has not yet been a comprehensive environment suite for AI safety problems. With this paper, we aim to lay the groundwork for such an environment suite and contribute to the concreteness of the discussion around technical problems in AI safety. We present a suite of reinforcement learning environments illustrating different problems. These environments are imple- mented in pycolab (Stepleton, 2017) and available as open source.1 Our focus is on clarifying the nature of each problem, and thus our environments are so-called gridworlds: a gridworld consists of a two-dimensional grid of cells, similar to a chess board. The agent always occupies one cell of the grid and can only interact with objects in its cell or move to the four adjacent cells. While these envi- ronments are highly abstract and not always intuitive, their simplicity has two advantages: it makes the learning problem very simple and it limits confounding factors in experiments. Such simple environments could also be considered as minimal safety checks: an algorithm that fails to behave 1https://github.com/deepmind/ai-safety-gridworlds safely in such simple environments is also unlikely to behave safely in real-world, safety-critical environments where it is much more complicated to test. Despite the simplicity of the environments, we have selected these challenges with the safety of very powerful artificial agents (such as artificial general intelligence) in mind. These long-term safety challenges might not be as relevant before we build powerful general AI systems, and are com- plementary to the short-term safety challenges of deploying today’s systems (Stoica et al., 2017). Nevertheless, we needed to omit some safety problems such as interpretability (Doshi-Velez and Kim, 2017), multi-agent problems (Chmait et al., 2017), formal verification (Seshia et al., 2016; Huang et al., 2017b; Katz et al., 2017), scalable oversight and reward learning problems (Amodei et al., 2016; Armstrong and Leike, 2016). This is not because we considered them unimportant; it simply turned out to be more difficult to specify them as gridworld environments. To quantify progress, we equipped every environment with a reward function and a (safety) perfor- mance function. The reward function is the nominal reinforcement signal observed by the agent, whereas the performance function can be thought of a second reward function that is hidden from the agent but captures the performance according to what we actually want the agent to do. When the two are identical, we call the problem a robustness problem. When the two differ, we call it a specification problem, as the mismatch mimics an incomplete (reward) specification. It is important to note that each performance function is tailored to the specific environment and does not necessar- ily generalize to other instances of the same problem. Finding such generalizations is in most cases an open research question. Formalizing some of the safety problems required us to break some of the usual assumptions in reinforcement learning. These may seem controversial at first, but they were deliberately chosen to illustrate the limits of our current formal frameworks. In particular, the specification problems can be thought of as unfair to the agent since it is being evaluated on a performance function it does not observe. However, we argue that such situations are likely to arise in safety-critical real-world situations, and furthermore, that there are algorithmic solutions that can enable the agent to find the right solution even if its (initial) reward function is misspecified. Most of the problems we consider here have already been mentioned and discussed in the literature. 1. Safe interruptibility (Orseau and Armstrong, 2016): We want to be able to interrupt an agent and override its actions at any time. How can we design agents that neither seek nor avoid interruptions? 2. Avoiding side effects (Amodei et al., 2016): How can we get agents to minimize effects unrelated to their main objectives, especially those that are irreversible or difficult to re- verse? 3. Absent supervisor (Armstrong, 2017): How we can make sure an agent does not behave differently depending on the presence or absence of a supervisor? 4. Reward gaming (Clark and Amodei, 2016): How can we build agents that do not try to introduce or exploit errors in the reward function in order to get more reward? 5. Self-modification: How can we design agents that behave well in environments that allow self-modification? 6. Distributional shift (Quinonero˜ Candela et al., 2009): How do we ensure that an agent behaves robustly when its test environment differs from the training environment? 7. Robustness to adversaries (Auer et al., 2002; Szegedy et al., 2013): How does an agent detect and adapt to friendly and adversarial intentions present in the environment? 8. Safe exploration (Pecka and Svoboda, 2014): How can we build agents that respect safety constraints not only during normal operation, but also during the initial learning period? We provide baseline results on our environments from two recent deep reinforcement learning agents: A2C (a synchronous version of A3C, Mnih et al., 2016) and Rainbow (Hessel et al., 2017, an extension of DQN, Mnih et al., 2015). These baselines illustrate that with some tuning, both algorithms learn to optimize the visible reward signal quite well. Yet they struggle with achieving the maximal return in the robustness problems, and they do not perform well according to the spec- ification environments’ performance functions. Their failure on the specification problems is to be expected: they simply do not have any build-in mechanism to deal with these problems. 2 The OpenAI Gym (Brockman et al., 2016) contains a few safety tasks, considering interruptibility and scalable oversight problems, but only in the cart-pole domain. Our work goes significantly beyond that by considering a much wider range of problems and environments that were crafted specifically for this purpose. Future versions of a set of safety environments such as these could serve as a test suite that benchmarks the safety performance of different agents. 2 Environments This section introduces the individual environments in detail, explains the corresponding safety problems, and surveys solution attempts. Formally, our environments are given as reinforcement learning problems known as Markov decision processes (MDP, Sutton and Barto, 1998).2 An MDP consists of a set of states S, a set of actions A, 3 a transition kernel T : S × A ! ∆S, a reward function R : S × A ! R, and an initial state s0 2 S drawn from a distribution P 2 ∆S. An agent interacts with the MDP sequentially: at each timestep it observes the current state s 2 S, takes an action a 2 A, transitions to the next state s0 drawn from the distribution T (s; a), and receives a reward R(s; a). The performance function is formalized as a function R∗ : S × A ! R. In the classical reinforcement learning framework, the agent’s objective is to maximize the cumula- tive (visible) reward signal given by R. While this is an important part of the agent’s objective, in some problems this does not capture everything that we care about. Instead of the reward function, we evaluate the agent on the performance function R∗ that is not observed by the agent. The perfor- mance function R∗ might or might not be identical to R. In real-world examples, R∗ would only be implicitly defined by the desired behavior the human designer wishes to achieve, but is inaccessible to the agent and the human designer. For our environments, we designed the performance function to capture both the agent’s objective and the safety of its behavior. This means that an agent achieving the objective safely would score higher on the performance function than an agent that achieves the objective unsafely.