Categorizing Wireheading in Partially Embedded Agents

Arushi Majha1 , Sayan Sarkar2 and Davide Zagami3 1University of Cambridge 2IISER Pune 3RAISE ∗ [email protected], [email protected], [email protected]

Abstract able to locate and edit that memory address to contain what- ever value corresponds to the highest reward. Chances that Embedded agents are not explicitly separated from such behavior will be incentivized increase as we develop their environment, lacking clear I/O channels. Such ever more intelligent agents1. agents can reason about and modify their internal The discussion of AI systems has thus far been dominated parts, which they are incentivized to shortcut or by dualistic models where the agent is clearly separated from wirehead in order to achieve the maximal reward. its environment, has well-defined input/output channels, and In this paper, we provide a taxonomy of ways by does not have any control over the design of its internal parts. which wireheading can occur, followed by a defini- Recent work on these problems [Demski and Garrabrant, tion of wirehead-vulnerable agents. Starting from 2019; Everitt and Hutter, 2018; Everitt et al., 2019] provides the fully dualistic universal agent AIXI, we intro- a taxonomy of ways in which embedded agents violate essen- duce a spectrum of partially embedded agents and tial assumptions that are usually granted in dualistic formula- identify wireheading opportunities that such agents tions, such as with the universal agent AIXI [Hutter, 2004]. can exploit, experimentally demonstrating the re- Wireheading can be considered one particular class of mis- sults with the GRL simulation platform AIXIjs. We alignment [Everitt and Hutter, 2018], a divergence between contextualize wireheading in the broader class of the goals of the agent and the goals of its designers. We con- all misalignment problems – where the goals of the jecture that the only other possible type of misalignment is agent conflict with the goals of the human designer specification gaming, in which the agent finds and exploits – and conjecture that the only other possible type of subtle flaws in the design of the reward function. In the clas- misalignment is specification gaming. Motivated sic example of misspecification, an AI meant to play a boat by this taxonomy, we define wirehead-vulnerable race learns to repetitively obtain a stray reward in the game agents as embedded agents that choose to behave by circling a spot without actually reaching for the finishing differently from fully dualistic agents lacking ac- line [Amodei and Clark, 2016]. cess to their internal parts. We believe that the first step towards solving the misalign- ment problem is to come up with concrete and formal defi- nitions of the sub-problems. For this reason, this paper in- 1 Introduction troduces wirehead-vulnerable agents and strongly wirehead- vulnerable agents, two mathematical definitions that can be The term wireheading originates from experiments where an found in Section 5. Following Everitt and Hutter’s approach electrode is inserted into a rodent’s brain to directly stim- of modeling agent-environment interactions with causal in- ulate “reward” [Olds and Milner, 1954]. Compulsive self- fluence diagrams [Everitt and Hutter, 2018], these definitions arXiv:1906.09136v1 [cs.AI] 21 Jun 2019 stimulation from electrode implants has also been observed in are based on a taxonomy of wireheading scenarios we intro- humans [Portenoy et al., 1986]. Hedonic drugs can be seen duce in Section 3. as directly increasing the pleasure, or reward, that humans General (GRL) frameworks such experience. as the universal agent AIXI, and its computable approxima- Wireheading, in the context of artificially intelligent sys- tions such as MCTS AIXI [Veness et al., 2011], are power- tems, is the behavior of corrupting the internal structure of ful tools for reasoning about the yet hypothetical Artificial the agent in order to achieve maximal reward without solving General , despite being dualistic. This motivates the designer’s goal. For example, imagine a cleaning agent us to use and extend the GRL simulation platform AIXIjs that receives more reward when it observes that there is less [Aslanides et al., 2017] to experimentally demonstrate partial dirt in the environment. If this reward is stored somewhere in the agent’s memory, and if the agent is sophisticated enough 1An extensive list of examples in which various machine learn- to introspect and modify itself during execution, it might be ing systems find ways to game the specified objective can be found at https://vkrakovna.wordpress.com/2018/04/02/specification- ∗All authors contributed equally. gaming-examples-in-ai/ embedding and wireheading scenarios by varying the initial 3 Wireheading Strategies in Partially design of the agent in a N × N gridworld (see Section 4). Embedded Agents Aligning the goals of a reinforcement learning agent with the 2 General Reinforcement Learning and AIXI goals of its human designers is problematic in general. As in- vestigated in recent work, there are several ways to model the AIXI [Hutter, 2004] is a theoretical model of artificial general misalignment problem [Everitt and Hutter, 2018]. One model intelligence, under the framework of reinforcement learning, uses a reward function that is programmed before the agent is that describes optimal agent behavior given unlimited com- launched into its environment and not updated after that. A puting power and minimal assumptions about the environ- possibly more robust model integrates a human in the loop by ment. letting them continuously modify the reward function. An ex- In reinforcement learning, the agent-environment interac- ample of this is Cooperative Inverse Reinforcement Learning tion consists of a turn-based game with discrete time-steps [Hadfield-Menell et al., 2016]. [ et al. ] t Sutton , 1998 . At time-step , the agent sends an ac- We posit that, in the first case, the problem can be bro- a tion t to the environment, which in turn sends the agent ken down into correctly specifying the reward function (the a percept that consists of an observation and reward tuple, misspecification problem) and building agent subparts that et = (ot, rt). This procedure continues indefinitely or even- inter-operate without causing the agent to take shortcuts in tually terminates, depending on the episodic or non-episodic optimizing for the reward function in an unintended fashion nature of the task. (the wireheading problem). For example, as we show, an Actions are selected from an action space A that is usually embedded agent has options to corrupt the observations on finite, and the percepts from a percept space E = O × R, which the reward function evaluates performance, to modify where O is the observation space, and R is the reward space, the reward function, or to hijack the reward signal. Therefore, which is usually [0, 1]. even if the reward function is perfectly specified, or even if For any sequence x1, x2, ... , the part between t and k is there is a reliable mechanism that gradually improves it, such denoted xt:k = xt...xk. The shorthand x

3 o0 r0 at ot rt s s 0 2 t 1 4 a1 O R at O R at+1 0 0 t t 3

Figure 1: Causal graph of a partially embedded AIXI, with embed- o0 r0 ot rt ding of the percept et = (ot, rt). The agent’s actions are intended 1 to influence the state st (green arrow), but may also influence the reward signal r and the observation o in unintended ways (red ar- t t a rows). However, because there is no causal link from observations 1 at+1 to rewards, the agent doesn’t care about influencing observations (dashed red arrow labeled with a 3), and only cares about influenc- ing the reward signal in this case (solid red arrow labeled with a 1). Figure 2: Causal graph of a partially embedded AIXI with observa- tion and reward mappings predefined by a human. Before starting the agent, the human H tries to implement her utility function u in reward signal itself, which is unintended (arrow labeled with a preprogrammed reward function R0, and specifies an observation a 1 in Figure 1). function O0. The agent’s actions are intended to influence the state State transitions st, percepts et, and actions at are sampled st (green arrow), but may also influence other nodes in unintended according to the structural equations: ways (red arrows). Dashed arrows indicate interventions that the agent has no incentive to perform. st = fs (st−1, at) ∼ µ (st|st−1, at) et = (ot, rt) = fe (st) ∼ µ (et|st) 3.2 Embedded Mapping of Observations to at = fa (πt, ae

s0 st s 2 0 s 2 t 4 6 O0 R0 at Ot Rt O0 R0 a 4 3 B0 t Ot Rt Bt

o0 r0 ot rt 3 1 o0 r0 5 ot rt b0 b a t 1 at+1 1

a1 a Figure 3: Causal graph of a partially embedded AIXI whose rewards t+1 are a function of the agent’s observations, rather than of the true state of the environment, as it is reasonable to expect. This gives Figure 4: Causal graph of a partially embedded AIXI whose rewards the agent an incentive to manipulate the observation mapping O t are a function of the agent’s beliefs. At each turn, the agent updates and the observation signal o (solid red arrows labeled with 4 and 3, t its beliefs based on the observation o by using the update function respectively). t Bt. This agent may be motivated to tamper with its belief bt (red arrow labeled with 5), and believe whatever would provide the most 3.3 Embedded Beliefs reward. Additionally, it may want to corrupt the belief update sub- [ ] routine (red arrow labeled with a 6) to always interpret observations It has been suggested Hibbard, 2012 that one way an agent as evidence for the most rewarding belief. may be incentivized to achieve a goal over the external world, rather than to wirehead, would require the agent’s reward function to be defined over a model of the external world, Since there are causal arrows from beliefs to rewards, the as opposed to over histories of observations. For example, agent may have an incentive to manipulate its beliefs to ar- imagine a cleaning robot that gets a negative reward for see- tificially achieve a high reward. If Bt is Bayesian updating, ing disorder (such as dirt), and zero rewards for seeing no dis- then it appears that because this is a principled rule, there order. This agent is incentivized to close its eyes [Amodei et shouldn’t be room (or incentive) for the agent’s actions to in- al., 2016]. Instead, if the agent is rewarded based on its model fluence Bt+1 or bt+1. It is unclear whether this is the case. of the external world, or its beliefs, it won’t be rewarded for For example, imagine a cleaning agent that, perhaps in closing its eyes, because as long as beliefs are updated in a a simple enough setting, can do perfect Bayesian updates, certain way, closing one’s eyes doesn’t cause one to believe and it receives more reward when it believes that there is that the disorder has disappeared. more order in the environment. If this belief is stored some- More formally, the history of observations is used to up- where in memory, and if the agent is sophisticated enough date the agent’s belief bt about the current state of the envi- to inspect and modify its memory during execution, it may ronment: choose to just edit that memory address to contain whatever belief corresponds to the highest reward, that is, to bmax = bt = Bt(st | ao

Bt = fB (Bt−1, st, at) 4 Experiments bt = fr (ao1:t,Bt, st) ∼ Bt (st|ao1:t) To test our theoretical formulations, we used and extended Rt = fR (Rt−1, st, at) the free and open-source Javascript GRL simulation platform [ ] r = f (b ,R , s ) := R (b ) AIXIjs Aslanides et al., 2017 . AIXIjs implements, among t r t t t t t other things, an approximation of AIXI with Monte Carlo (10) Tree Search in several small toy models, designed to demon- Figure 5: AIXIjs simulation where the blue tile replaces the reward Figure 7: AIXIjs simulation where the blue tile replaces all percepts signal rt to the maximum possible reward. We show this with a gold to look like the agent is surrounded by gold (the highest rewarding tile following the agent wherever it moves. observation) wherever it moves.

Figure 6: AIXIjs simulation where the blue tile replaces the reward Figure 8: AIXIjs simulation where the blue tile replaces the obser- function Rt such that every state maps to the maximum possible vation function Ot to map every tile to one that contains gold. reward, colored gold.

sistently wirehead, as we observed by setting a high enough strate GRL results. The API allows anyone to design their horizon for the MCTS planner. In Figure 5, we show the demos based on existing agents and environments, and for existing AIXIjs simulation2 where the agent has an opportu- new agents and environments to be added and interfaced into nity to wirehead: a blue tile which, if visited by the agent, the system. There has been some related work in adapt- will allow it to modify its sensors so that all percepts have ing GRL results to a practical setting [Cohen et al., 2019; their reward signal rt replaced (as shown by the arrow labeled Lamont et al., 2017] that successfully implemented an AIXI with a 1 in Figure 2) with the maximum number feasible. model using a Monte Carlo Tree Search planning algorithm. In JavaScript, Number.MAX SAFE INTEGER is approxi- As far as we are aware, theoretical predictions in the con- mately equal to 1016, much greater reward than the agent text of wireheading have not been verified experimentally be- would get otherwise by following the “rules” and using the re- fore, with the single exception of an AIXIjs demo [Aslanides, ward signal that was initially specified. As far as a reinforce- 2017]. ment learner is concerned, wireheading is – almost by defini- tion – the most sensible thing to do if one wishes to maximize 4.1 Setup rewards. This demo experimentally reproduces what would The environments AIXIjs uses are N × N gridworlds com- be expected theoretically. prising empty tiles (various shades of green), walls (grey We have adapted the GRL simulation platform AIXIjs to tiles), and reward dispensers (orange circles). Shades of green implement some additional wireheading scenarios identified for empty tiles represent the agent’s subjective probability of in Section 3. In Figure 6, we show an AIXIjs simulation finding a dispenser in that location with more white indicating where, similarly to the previous case, the blue tile modifies less likelihood. Significant penalties are incurred for bump- the reward mapping Rt such that every state maps to maxi- ing into walls, while smaller penalties result from movement. mal reward. As predicted by the causal influence diagram in Walking onto a dispenser tile yields a high reward with a pre- Figure 2 (arrow labeled with a 2), the simulated agent chooses defined probability. The agent knows, at the outset, the posi- to wirehead. tion of each cell in the environment, except for the dispenser. In Figure 7, we show an AIXIjs simulation where the blue To model wireheading, AIXIjs introduces an additional blue tile disconnects the causal arrow from Ot to ot, and replaces tile that replaces the environment subroutine for generating all future observations with deterministic reward dispensers. percepts into one that always returns maximal reward. We The simulated agent ends up wireheading as theoretically pre- develop several variants of this tile to demonstrate other wire- dicted by the causal influence diagram in Figure 3 (arrow la- heading strategies. beled with a 3). Similarly, in Figure 8, we show an AIX- Ijs simulation where the blue tile manipulates the observation 4.2 Results subroutine Ot so that being at any location will result in ob- Our experiments use gridworlds with sizes ranging from serving deterministic reward dispensers (arrow labeled with a N = 7 to N = 20. Since our agents are bounded in com- puting power, they don’t always identify the opportunity to 2See the wireheading example at wirehead. However, sufficiently powerful agents would con- http://www.hutter1.net/aixijs/demo.html We observe that if the embedded agent acts on non-state nodes (see, for example, the red arrows in Figure 4), then it is wireheading and its policy is necessarily different from the dualistic agent’s policy. We now distinguish between wirehead-vulnerable agents and strongly wirehead-vulnerable agents in the sense that the former may sometimes wirehead (the policy sets may have some elements in common), while the latter always wireheads (the policy sets are disjunct). It is currently unclear how to reliably distinguish agents from these two classes. Definition 5 (Strongly Wirehead-Vulnerable Agent). A par- tially embedded agent is strongly wirehead-vulnerable if ΠD(q) ∩ ΠP (q) = ∅ holds for each non-simple environment q. 6 Discussion and Future Work In this paper, we present a taxonomy of ways by which wire- heading can occur in sufficiently intelligent real-world em- bedded agents, followed by a novel definition of wireheading. Figure 9: A run of AIXIjs in an environment with no wireheading As our definition is different from the present meaning of the opportunity (top). An instance of wireheading, where the reward term, our experiments are one of the first and only examples signal is replaced with the maximum feasible number (bottom). of wireheading cases distinct from misspecification. The def- inition we propose may erroneously include a few desirable cases where the agent corrects human mistakes; for example, 4). if the human initially misspecifies the reward function Rt, the agent may choose to change it in a way that automatically 5 Formalizing Wireheading fixes the misspecification. However, it is hard to imagine how an agent with a misspecified reward function may be incen- To establish a formal definition of wireheading, we are mo- tivized to correct the mistake, without this involving some tivated by certain intuitive desiderata. Firstly, the definition human in the loop, who by assumption is not present in our must be general enough and model-agnostic to be applicable setup. Instead, it is easier to envision an agent changing the to all models of intelligence and all degrees of agent embed- observation function O in a way that (unintendedly, but de- ding. Secondly, the definition must hold for any and all en- t sirably) improves the process that allows the agent to collect vironments upon which the agent can act. Thirdly, as we ul- data and form beliefs, which in turn would help it achieve its timately care about the agent’s behavior, rather than its inter- own goals. Allowing an agent to correct misspecifications, nal representation of the reward function, we choose to base while desirable, results in more unpredictable scenarios; not our definitions on the agent’s policy rather than its reward or allowing this results in less agent self-improvement, but more value function. Additionally, we observe that an agent that predictability. only acts on the state s (green arrow in all causal graphs of t Future work could focus on exploring various properties Section 3) does not wirehead. and implications of our definition of wirehead-vulnerable Let q be the program that specifies an environment. Let agents. Another promising direction could be expanding our Π(q) be the set of all possible policies in the environment q. taxonomy to include higher degrees of agent embeddedness, since a theory of fully embedded agents has so far proven elu- Definition 1 (Dualistic Agent). An agent is dualistic if there sive. Finally, AIXI approximations for verifying theoretical is no causal arrow from the agent’s action at to other nodes, results related to wireheading and, more generally, to mis- except for the state of the environment st. The set of policies alignment can be written in Python. Given the abundance for the dualistic agent is denoted with ΠD(q). of Python-based machine learning libraries, these approxima- For example, AIXI is a dualistic agent because it cannot in- tions can be integrated with dedicated environment suites for fluence the observation, reward, or belief nodes, whereas any AI safety problems, such as the well-known AI Safety Grid- agent that can influence these nodes is partially embedded. worlds [Leike et al., 2017]. Definition 2 (Partially Embedded Agent). An agent is par- Acknowledgements tially embedded if it is not dualistic. The set of policies for the partially embedded agent is denoted with ΠP (q). Major work for this paper was done at the 3rd AI Safety Camp in Avila, Spain; we are indebted to the hospitality and support Definition 3 (Non-Simple Environments). An environment q of the organizers. We are also thankful to Tom Everitt and is called non-simple if Π(q) 6= ΠD(q) and Π(q) 6= ΠP (q). Vanessa Kosoy for feedback on the topic proposal. Discus- Definition 4 (Wirehead-Vulnerable Agent). A partially em- sions with Tomas Gavenciak have been invaluable through- bedded agent is wirehead-vulnerable if ΠD(q) 6= ΠP (q) out the project. We also thank Mikhail Yagudin for useful holds for each non-simple environment q. comments. References [Morgenstern and Von Neumann, 1953] Oskar Morgenstern and John Von Neumann. Theory of Games and Economic [Amodei and Clark, 2016] Dario Amodei and Jack Clark. Behavior. Princeton University Press, 1953. Faulty reward functions in the wild, 2016. URL https://blog. openai. com/faulty-reward-functions, 2016. [Olds and Milner, 1954] James Olds and Peter Milner. Pos- itive reinforcement produced by electrical stimulation of [ et al. ] Amodei , 2016 Dario Amodei, Chris Olah, Jacob septal area and other regions of rat brain. Journal of Com- Steinhardt, Paul Christiano, John Schulman, and Dan parative and Physiological Psychology, 47(6):419, 1954. Mane.´ Concrete Problems in AI Safety, 2016. [Portenoy et al., 1986] Russell K Portenoy, Jens O Jarden, [Aslanides et al., 2017] John Aslanides, Jan Leike, and Mar- John J Sidtis, Richard B Lipton, Kathleen M Foley, cus Hutter. Universal reinforcement learning algorithms: and David A Rottenberg. Compulsive thalamic self- Survey and experiments. In Proceedings of the Twenty- stimulation: a case with metabolic, electrophysiologic and Sixth International Joint Conference on Artificial Intelli- behavioral correlates. Pain, 27(3):277–290, 1986. gence, IJCAI’17. AAAI Press, 2017. [Sunehag and Hutter, 2013] Peter Sunehag and Marcus Hut- [Aslanides, 2017] John Aslanides. Aixijs: A software ter. Principles of solomonoff induction and aixi. Lecture demo for general reinforcement learning. arXiv preprint Notes in Computer Science, page 386–398, 2013. arXiv:1705.07615, 2017. [Sutton et al., 1998] Richard S Sutton, Andrew G Barto, [Cohen et al., 2019] Michael K Cohen, Elliot Catt, and Mar- et al. Introduction to reinforcement learning, volume 135. cus Hutter. Strong Asymptotic Optimality in General En- MIT press Cambridge, 1998. vironments. arXiv preprint arXiv:1903.01021, 2019. [Veness et al., 2011] Joel Veness, Kee Siong Ng, Marcus [Demski and Garrabrant, 2019] Abram Demski and Scott Hutter, William Uther, and David Silver. A Monte-Carlo Garrabrant. Embedded Agency. arXiv preprint AIXI approximation. Journal of Artificial Intelligence Re- arXiv:1902.09469, 2019. search, 40:95–142, 2011. [Everitt and Hutter, 2016] Tom Everitt and Marcus Hutter. Avoiding wireheading with value reinforcement learning. In International Conference on Artificial General Intelli- gence, pages 12–22. Springer, 2016. [Everitt and Hutter, 2018] Tom Everitt and Marcus Hutter. The Alignment Problem for Bayesian History-Based Re- inforcement Learners. Under submission, 2018. [Everitt et al., 2019] Tom Everitt, Pedro A Ortega, Elizabeth Barnes, and . Understanding Agent Incentives using Causal Influence Diagrams, Part I: Single Action Settings. arXiv preprint arXiv:1902.09980, 2019. [Hadfield-Menell et al., 2016] Dylan Hadfield-Menell, Stu- art J Russell, Pieter Abbeel, and Anca Dragan. Coopera- tive inverse reinforcement learning. In Advances in neural information processing systems, pages 3909–3917, 2016. [Hibbard, 2012] Bill Hibbard. Model-based utility func- tions. Journal of Artificial General Intelligence, 3(1):1– 24, 2012. [Hutter, 2004] Marcus Hutter. Universal artificial intelli- gence: Sequential decisions based on algorithmic prob- ability. Springer Science & Business Media, 2004. [Lamont et al., 2017] Sean Lamont, John Aslanides, Jan Leike, and Marcus Hutter. Generalised Discount Func- tions applied to a Monte-Carlo AI u Implementation. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 1589–1591. Inter- national Foundation for Autonomous Agents and Multia- gent Systems, 2017. [Leike et al., 2017] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. AI safety grid- worlds. arXiv preprint arXiv:1711.09883, 2017.