Categorizing Wireheading in Partially Embedded Agents

Categorizing Wireheading in Partially Embedded Agents Arushi Majha1 , Sayan Sarkar2 and Davide Zagami3 1University of Cambridge 2IISER Pune 3RAISE ∗ [email protected], [email protected], [email protected] Abstract able to locate and edit that memory address to contain what- ever value corresponds to the highest reward. Chances that Embedded agents are not explicitly separated from such behavior will be incentivized increase as we develop their environment, lacking clear I/O channels. Such ever more intelligent agents1. agents can reason about and modify their internal The discussion of AI systems has thus far been dominated parts, which they are incentivized to shortcut or by dualistic models where the agent is clearly separated from wirehead in order to achieve the maximal reward. its environment, has well-defined input/output channels, and In this paper, we provide a taxonomy of ways by does not have any control over the design of its internal parts. which wireheading can occur, followed by a defini- Recent work on these problems [Demski and Garrabrant, tion of wirehead-vulnerable agents. Starting from 2019; Everitt and Hutter, 2018; Everitt et al., 2019] provides the fully dualistic universal agent AIXI, we intro- a taxonomy of ways in which embedded agents violate essen- duce a spectrum of partially embedded agents and tial assumptions that are usually granted in dualistic formula- identify wireheading opportunities that such agents tions, such as with the universal agent AIXI [Hutter, 2004]. can exploit, experimentally demonstrating the re- Wireheading can be considered one particular class of mis- sults with the GRL simulation platform AIXIjs. We alignment [Everitt and Hutter, 2018], a divergence between contextualize wireheading in the broader class of the goals of the agent and the goals of its designers. We con- all misalignment problems – where the goals of the jecture that the only other possible type of misalignment is agent conflict with the goals of the human designer specification gaming, in which the agent finds and exploits – and conjecture that the only other possible type of subtle flaws in the design of the reward function. In the clas- misalignment is specification gaming. Motivated sic example of misspecification, an AI meant to play a boat by this taxonomy, we define wirehead-vulnerable race learns to repetitively obtain a stray reward in the game agents as embedded agents that choose to behave by circling a spot without actually reaching for the finishing differently from fully dualistic agents lacking ac- line [Amodei and Clark, 2016]. cess to their internal parts. We believe that the first step towards solving the misalignment problem is to come up with concrete and formal definitions of the sub-problems. For this reason, this paper in- 1 Introduction troduces wirehead-vulnerable agents and strongly wirehead- vulnerable agents, two mathematical definitions that can be The term wireheading originates from experiments where an found in Section 5. Following Everitt and Hutter’s approach electrode is inserted into a rodent’s brain to directly stim- of modeling agent-environment interactions with causal in- ulate “reward” [Olds and Milner, 1954]. Compulsive self- fluence diagrams [Everitt and Hutter, 2018], these definitions arXiv:1906.09136v1 [cs.AI] 21 Jun 2019 stimulation from electrode implants has also been observed in are based on a taxonomy of wireheading scenarios we intro- humans [Portenoy et al., 1986]. Hedonic drugs can be seen duce in Section 3. as directly increasing the pleasure, or reward, that humans General Reinforcement Learning (GRL) frameworks such experience. as the universal agent AIXI, and its computable approxima- Wireheading, in the context of artificially intelligent sys- tions such as MCTS AIXI [Veness et al., 2011], are power- tems, is the behavior of corrupting the internal structure of ful tools for reasoning about the yet hypothetical Artificial the agent in order to achieve maximal reward without solving General Intelligence, despite being dualistic. This motivates the designer’s goal. For example, imagine a cleaning agent us to use and extend the GRL simulation platform AIXIjs that receives more reward when it observes that there is less [Aslanides et al., 2017] to experimentally demonstrate partial dirt in the environment. If this reward is stored somewhere in the agent’s memory, and if the agent is sophisticated enough 1An extensive list of examples in which various machine learn- to introspect and modify itself during execution, it might be ing systems find ways to game the specified objective can be found at https://vkrakovna.wordpress.com/2018/04/02/specification- ∗All authors contributed equally. gaming-examples-in-ai/ embedding and wireheading scenarios by varying the initial 3 Wireheading Strategies in Partially design of the agent in a N × N gridworld (see Section 4). Embedded Agents Aligning the goals of a reinforcement learning agent with the 2 General Reinforcement Learning and AIXI goals of its human designers is problematic in general. As in- vestigated in recent work, there are several ways to model the AIXI [Hutter, 2004] is a theoretical model of artificial general misalignment problem [Everitt and Hutter, 2018]. One model intelligence, under the framework of reinforcement learning, uses a reward function that is programmed before the agent is that describes optimal agent behavior given unlimited com- launched into its environment and not updated after that. A puting power and minimal assumptions about the environ- possibly more robust model integrates a human in the loop by ment. letting them continuously modify the reward function. An ex- In reinforcement learning, the agent-environment interac- ample of this is Cooperative Inverse Reinforcement Learning tion consists of a turn-based game with discrete time-steps [Hadfield-Menell et al., 2016]. [ et al. ] t Sutton , 1998 . At time-step , the agent sends an ac- We posit that, in the first case, the problem can be bro- a tion t to the environment, which in turn sends the agent ken down into correctly specifying the reward function (the a percept that consists of an observation and reward tuple, misspecification problem) and building agent subparts that et = (ot; rt). This procedure continues indefinitely or even- inter-operate without causing the agent to take shortcuts in tually terminates, depending on the episodic or non-episodic optimizing for the reward function in an unintended fashion nature of the task. (the wireheading problem). For example, as we show, an Actions are selected from an action space A that is usually embedded agent has options to corrupt the observations on finite, and the percepts from a percept space E = O × R, which the reward function evaluates performance, to modify where O is the observation space, and R is the reward space, the reward function, or to hijack the reward signal. Therefore, which is usually [0; 1]. even if the reward function is perfectly specified, or even if For any sequence x1; x2; ::: , the part between t and k is there is a reliable mechanism that gradually improves it, such denoted xt:k = xt:::xk. The shorthand x<t = x1:t−1 denotes an agent may still be able to make unintended changes to it- sequences starting from time-step 1 and ending at t−1, while self and the environment. We are mainly interested in cases x1:1 = x1x2::: denotes an infinite sequence. Sequences can where wireheading happens “intentionally” or “by design,” be appended to each other, and thus x<txt:k = x1:k. Finally, such that exploiting the design is the rational choice for the x∗ is any infinite string beginning with x. agent. Covering the spectrum of misspecification scenarios The environment is modeled by a deterministic program is beyond the scope of this paper, as our main focus here is q of length l(q), and the future percepts e<m = U(q; a<m) wireheading – the kind of misalignment contingent upon the up to a horizon m are computed by a universal (monotone embedded nature of the agent. Turing) machine U executing q given a<m. The probability The formulation of AIXI as presented in Section 2 is dual- of percept et given history ae<tat is thus given by: istic, with a clear boundary between the world and the agent. This is a strong assumption which simply isn’t valid in the X −l(q) P (et j ae<tat) = 2 (1) real world, where agents are contained or embedded in the environment. Embedded agency is a nascent field of re- q:U(q;a≤t)=e≤t∗ search and has proven bewildering [Demski and Garrabrant, where Solomonoff’s universal prior [Sunehag and Hutter, 2019]; we posit it can be less confusing and yet insight- 2013] is used to assign a prior belief to each program. ful to reason about partially embedded agents. We chose An agent can be identified with its policy, which is a distri- causal influence diagrams as the underlying abstraction in bution over actions π(at j ae<t). this area given their recent success in identifying potential If the agent is rational in the Von Neumann-Morgenstern failure modes in misalignment [Everitt and Hutter, 2018; sense [Morgenstern and Von Neumann, 1953], it should max- Everitt et al., 2019]. In a nutshell, this approach consists of imize the expected return, as computed by the value function: representing parts of the environment, the agent, and its sub- components as nodes in a graph, where the edges represent π X V (ae<t) = π (atjae<t) causal relationships between the nodes. One limitation of this approach is the assumption of objective action and observa- at2A tion channels. Addressing subtle errors arising from the agent X π · P (etjae<tat)[γtrt + γt+1V (ae1:t)] using different subjective definitions is beyond the scope of et2E this paper.

Load more