Pitfalls of Learning a Reward Function Online

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Pitfalls of Learning a Reward Function Online Stuart Armstrong1;2∗ , Jan Leike3 , Laurent Orseau3 and Shane Legg3 1Future of Humanity Institute, Oxford University, UK 2Machine Intelligence Research Institute, Berkeley, USA 3DeepMind, London, UK [email protected], [email protected], [email protected], [email protected] Abstract more complicated requirements, and so on. This has led to a recent trend to learn a model of the reward In some agent designs like inverse reinforcement function, rather than having the programmer design it ([Ng learning an agent needs to learn its own reward and Russell, 2000; Hadfield-Menell et al., 2016; Choi and function. Learning the reward function and opti- Kim, 2011; Amin and Singh, 2016; Abbeel and Ng, 2004; mising for it are typically two different processes, Christiano et al., 2017; Hadfield-Menell et al., 2017; Ibarz usually performed at different stages. We consider et al., 2018; Akrour et al., 2012; MacGlashan et al., 2017; a continual (“one life”) learning approach where Pilarski et al., 2011]). One particularly powerful approach the agent both learns the reward function and op- is putting the human into the loop ([Abel et al., 2017]) as timises for it at the same time. We show that this done by [Christiano et al., 2017], because it allows for the comes with a number of pitfalls, such as deliber- opportunity to correct misspecified reward functions as the ately manipulating the learning process in one di- RL agent discovers exploits that lead to higher reward than rection, refusing to learn, “learning” facts already intended. known to the agent, and making decisions that are strictly dominated (for all relevant reward func- However, learning the reward function with a human in the loop has one problem: by manipulating the human, the tions). We formally introduce two desirable prop- 3 erties: the first is ‘unriggability’, which prevents agent could manipulate the learning process. If the learn- the agent from steering the learning process in the ing process is online – the agent is maximising its reward direction of a reward function that is easier to opti- function as well as learning it – then the human’s feedback [ mise. The second is ‘uninfluenceability’, whereby is now an optimisation target. Everitt and Hutter, 2016; ] [ ] the reward-function learning process operates by Everitt and Hutter, 2019 and Everitt, 2018 analyse the learning facts about the environment. We show that problems that can emerge in these situations, phrasing it as an uninfluenceable process is automatically unrig- a ‘feedback tampering problem’. Indeed, a small change to the environment can make a reward-function learning process gable, and if the set of possible environments is suf- 4 ficiently large, the converse is true too. manipulable. So it is important to analyse which learning processes are prone to manipulation. After building a theoretical framework for studying the dy- 1 Introduction namics of learning reward functions online, this paper will In reinforcement learning (RL) an agent has to learn to solve identify the crucial property of a learning process being un- the problem by maximising the expected reward provided by influenceable: in that situation, the reward function depends a reward function [Sutton and Barto, 1998]. Designing such a only on the environment, and is outside the agent’s control. reward function is similar to designing a scoring function for Thus it is completely impossible to manipulate an uninflu- a game, and can be very difficult ([Lee et al., 2017])1. Usu- enceable learning process, and the reward-function learning ally, one starts by designing a proxy: a simple reward func- is akin to Bayesian updating. tion that seems broadly aligned with the user’s goals. While The paper also identifies the weaker property of unrigga- testing the agent with this proxy, the user may observe that the agent finds a simple behaviour that obtains a high reward 3Humans have many biases and inconsistencies that may be ex- on the proxy, but does not match the behaviour intended by ploited ([Kahneman, 2011]), even accidentally; and humans can be the user,2 who must then refine the reward function to include tricked and fooled, skills that could be in the interest of such an agent to develop, especially as it gets more powerful ([Bostrom, 2014; ∗Contact Author Yudkowsky, 2008]). 1And see ”Specification gaming: the flip side 4See this example, where an algorithm is motivated to give of AI ingenuity”, https://deepmind.com/blog/article/ a secret password to a user while nominally asking for it, Specification-gaming-the-flip-side-of-AI-ingenuity since it is indirectly rewarded for correct input of the pass- 2See https://blog.openai.com/faulty-rewardfunctions/ and the word: https://www.lesswrong.com/posts/b8HauRWrjBdnKEwM5/ paper [Leike et al., 2017]. rigging-is-a-form-of-wireheading 1592 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) bility, an algebraic condition that ensures that actions taken that for any policy π, by the agent do not influence the learning process in expec- m Y tation. An unriggable learning process is thus one the agent P (h j π; µ) = P (h j µ) P (a j hi−1; π): (1) cannot ‘push’ towards its preferred reward function. m m i m i=1 An uninfluenceable learning process is automatically unriggable, but the converse need not be true. This paper If M is a set of environments, then any prior ξ in ∆(M) demonstrates that, if the set of environments is large enough, also defines a probability for a history hm: an unriggable learning process is equivalent, in expecta- X tion, to an uninfluenceable one. If this condition is not P (hm j ξ) = P (µ j ξ)P (hm j µ): met, unriggable-but-influenceable learning processes do al- µ2M low some undesirable forms of manipulations. The situation By linearity, we get that Equation (1) also applies when con- is even worse if the learning process is riggable: the agent can ditioning hm on π and ξ instead of π and µ. follow a policy that would reduce its reward, with certainty, Let hm be a history; then the conditional probability of an for all the reward functions it is learning about, among other environment given prior and history5 is equal to: pathologies. To illustrate, this paper uses a running example of P (hm j µ)P (µ j ξ) P (µ j hm; ξ) = : a child asking their parents for career advice (see Sec- P (hm j ξ) tion 2.2). That learning process can be riggable, unriggable- but-influenceable, or uninfluenceable, and Q-learning exam- Using this, ξ itself defines an environment as a conditional [ ] ples based on it will be presented in Section 6. distribution on the next observation o ( Hutter, 2004 ): This paper also presents a ‘counterfactual’ method for X P (o j hmam+1; ξ) = P (µ j hm; ξ)P (o j hmam+1; µ): making any learning process uninfluenceable, and shows µ2M some experiments to illustrate its performance compared with influenceable and riggable learning. 2.1 Reward Functions and Learning Processes Definition 1 (Reward function). A reward function R is a 2 Notation and Formalism map from complete histories to real numbers6. Let R = fR : Hn ! Rg be the set of all reward functions. The agent takes a series of actions (from the finite set A) and A reward-function learning process ρ can be described by receives from the environment a series of observations (from a conditional probability distribution over reward functions, the finite set O). A sequence of m actions and observations given complete histories.7 Write this as: forms a history of length m: hm = a1o1a2o2 : : : amom. Let Hm be the set of histories of length m. P (R j hn; ρ): We assume that all interactions with the environment are exactly n actions and observations long. Let the set of all pos- This paper will use those probabilities as the definition of ρ. sible (partial) histories be denoted with H = Sn H . The The environment, policy, histories, and learning process i=0 i can be seen in terms of causal graphs ([Pearl, 2009]) in Fig- histories of length n (Hn, in the notation above), are called the complete histories, and the history h is the empty history. ure 1 and using plate notation ([Buntine, 1994]), in Figure 2. 0 The agent is assumed to know the causal graph and the rele- The agent chooses actions according to a policy π 2 Π, the vant probabilities; it selects the policy π. set of all policies. We write P (a j hm; π) for the probability of π choosing action a given the history hm. 2.2 Running Example: Parental Career An environment µ is a probability distribution over the next Instruction observation, given a history hm and an action a. Write P (o j h a; µ) for the conditional probability of a given o 2 O. Consider a child asking a parent for advice about the re- m ward function for their future career. The child is hesitating Let hk be the initial k actions and observations of h . m m between R , the reward function that rewards becoming a We write h v h if h = hk . For a policy π and an B k m k m banker, and R , the reward function that rewards becoming environment µ, we can define the probability of a history D a doctor. Suppose for the sake of this example that becom- h = a o a o : : : a o m 1 1 2 2 m m ing a banker provides rewards more easily than becoming a m doctor does.

Load more