<<

Agent Foundations for Aligning Machine with Human Interests: A Technical Research Agenda

In The Technological Singularity: Managing the Journey. Springer. 2017

Nate Soares and Benya Fallenstein Machine Intelligence Research Institute {nate,benya}@intelligence.org

Contents

1 Introduction 1 1.1 Why These Problems? ...... 2

2 Highly Reliable Agent Designs 3 2.1 Realistic World-Models ...... 4 2.2 Decision Theory ...... 5 2.3 Logical Uncertainty ...... 6 2.4 Vingean Reflection ...... 7

3 Error-Tolerant Agent Designs 8

4 Value Specification 9

5 Discussion 11 5.1 Toward a Formal Understanding of the Problem ...... 11 5.2 Why Start Now? ...... 11

1 Introduction Since artificial agents would not share our evolution- ary history, there is no reason to expect them to be The property that has given humans a dominant ad- driven by human motivations such as lust for power. vantage over other species is not strength or speed, but However, nearly all goals can be better met with more intelligence. If progress in artificial intelligence continues resources (Omohundro 2008). This suggests that, by unabated, AI systems will eventually exceed humans in default, superintelligent agents would have incentives general reasoning ability. A system that is “superintelli- to acquire resources currently being used by humanity. gent” in the sense of being “smarter than the best human (Just as artificial agents would not automatically acquire brains in practically every field” could have an enormous a lust for power, they would not automatically acquire a impact upon humanity (Bostrom 2014). Just as human human sense of fairness, compassion, or conservatism.) intelligence has allowed us to develop tools and strate- Thus, most goals would put the agent at odds with gies for controlling our environment, a superintelligent human interests, giving it incentives to deceive or ma- system would likely be capable of developing its own nipulate its human operators and resist interventions tools and strategies for exerting control (Muehlhauser designed to change or debug its behavior (Bostrom 2014, and Salamon 2012). In light of this potential, it is es- chap. 8). sential to use caution when developing AI systems that Care must be taken to avoid constructing systems can exceed human levels of general intelligence, or that that exhibit this default behavior. In order to ensure can facilitate the creation of such systems. that the development of smarter-than-human intelli- gence has a positive impact on the world, we must meet Research supported by the Machine Intelligence Research Institute (intelligence.org). The final publication is avail- three formidable challenges: How can we create an agent able at Springer via http://www.springer.com/us/book/ that will reliably pursue the goals it is given? How can 9783662540312 we formally specify beneficial goals? And how can we

1 ensure that this agent will assist and cooperate with alignment concerns in the future. its programmers as they improve its design, given that mistakes in early AI systems are inevitable? 1.1 Why These Problems? This agenda discusses technical research that is tractable today, which the authors think will make it This technical agenda primarily covers topics that the easier to confront these three challenges in the future. authors believe are tractable, uncrowded, focused, and Sections 2 through 4 motivate and discuss six research unable to be outsourced to forerunners of the target AI topics that we think are relevant to these challenges. system. We call a smarter-than-human system that reliably By tractable problems, we mean open problems that pursues beneficial goals “aligned with human interests” are concrete and admit immediate progress. Significant or simply “aligned.”1 To become confident that an agent effort will ultimately be required to align real smarter- is aligned in this way, a practical implementation that than-human systems with beneficial values, but in the merely seems to meet the challenges outlined above absence of working designs for smarter-than-human sys- will not suffice. It is also important to gain a solid tems, it is difficult if not impossible to begin most of that formal understanding of why that confidence is justified. work in advance. This agenda focuses on research that This technical agenda argues that there is foundational can help us gain a better understanding today of the research we can make progress on today that will make problems faced by almost any sufficiently advanced AI it easier to develop aligned systems in the future, and system. Whether practical smarter-than-human systems describes ongoing work on some of these problems. arise in ten years or in one hundred years, we expect to Of the three challenges, the one giving rise to the be better able to design safe systems if we understand largest number of currently tractable research questions solutions to these problems. is the challenge of finding an agent architecture that will This agenda further limits attention to uncrowded reliably and autonomously pursue a set of objectives— domains, where there is not already an abundance of that is, an architecture that can at least be aligned research being done, and where the problems may not with some end goal. This requires theoretical knowledge be solved over the course of “normal” AI research. For of how to design agents which reason well and behave example, program verification techniques are absolutely as intended even in situations never envisioned by the crucial in the design of extremely reliable programs programmers. The problem of highly reliable agent (Sotala and Yampolskiy 2017), but program verification designs is discussed in Section 2. is not covered in this agenda primarily because a vibrant The challenge of developing agent designs which are community is already actively studying the topic. tolerant of human error also gives rise to a number This agenda also restricts consideration to focused of tractable problems. We expect that smarter-than- tools, ones that would be useful for designing aligned human systems would by default have incentives to systems in particular (as opposed to intelligent systems manipulate and deceive human operators, and that spe- in general). It might be possible to design generally in- cial care must be taken to develop agent architectures telligent AI systems before developing an understanding which avert these incentives and are otherwise tolerant of highly reliable reasoning sufficient for constructing of programmer error. This problem and some related an aligned system. This could lead to a risky situation open questions are discussed in Section 3. where powerful AI systems are built long before the tools Reliable and error-tolerant agent designs are only needed to safely utilize them. Currently, significant re- beneficial if the resulting agent actually pursues desirable search effort is focused on improving the capabilities of outcomes. The difficulty of concretely specifying what is artificially intelligent systems, and comparatively little meant by “beneficial behavior” implies a need for some effort is focused on alignment (Bostrom way to construct agents that reliably learn what to value 2014, chap. 14). For that reason, this agenda focuses (Bostrom 2014, chap. 12). A solution to this “value on research that improves our ability to design aligned learning” problem is vital; attempts to start making systems in particular. progress are reviewed in Section 4. Lastly, we focus on research that cannot be safely Why work on these problems now, if smarter-than- delegated to machines. As AI algorithms come to rival human AI is likely to be decades away? This question is humans in scientific inference and planning, new pos- touched upon briefly below, and is discussed further in sibilities will emerge for outsourcing Section 5. In short, the authors believe that there are labor to AI algorithms themselves. This is a consequence theoretical prerequisites for designing aligned smarter- of the fact that intelligence is the technology we are than-human systems over and above what is required designing: on the path to great intelligence, much of the to design misaligned systems. We believe that research work may be done by smarter-than-human systems.2 can be done today that will make it easier to address 2. Since the Dartmouth Proposal (McCarthy et al. 1955), 1. A more careful wording might be “aligned with the it has been a standard idea in AI that a sufficiently smart interests of sentient beings.” We would not want to benefit machine intelligence could be intelligent enough to improve humans at the expense of sentient non-human animals—or itself. In 1965, I. J. Good observed that this might create a (if we build them) at the expense of sentient machines. positive feedback loop leading to an “intelligence explosion” (Good 1965). Sotala and Yampolskiy (2017) and Bostrom

2 As a result, the topics discussed in this agenda are Because the stakes are so high, testing combined ones that we believe are difficult to safely delegate to AI with a gut-level intuition that the system will continue systems. Error-tolerant agent design is a good example: to work outside the test environment is insufficient, even no AI problem (including the problem of error-tolerant if the testing is extensive. It is important to also have agent design itself) can be safely delegated to a highly a formal understanding of precisely why the system is intelligent artificial agent that has incentives to ma- expected to behave well in application. nipulate or deceive its programmers. By contrast, a What constitutes a formal understanding? It seems sufficiently capable automated engineer would be able essential to us to have both (1) an understanding of to make robust contributions to computer vision or precisely what problem the system is intended to solve; natural language processing even if its own visual or and (2) an understanding of precisely why this practical linguistic abilities were initially lacking. Most intelligent system is expected to solve that abstract problem. The agents optimizing for some goal would also have incen- latter must wait for the development of practical smarter- tives to improve their visual and linguistic abilities so than-human systems, but the former is a theoretical as to enhance their ability to model and interact with research problem that we can already examine. the world. A full description of the problem would reveal the It would be risky to delegate a crucial task before conceptual tools needed to understand why practical attaining a solid theoretical understanding of exactly heuristics are expected to work. By analogy, consider what task is being delegated. It may be possible to use the game of chess. Before designing practical chess al- our understanding of ideal Bayesian inference to task a gorithms, it is necessary to possess not only a predicate highly intelligent system with developing increasingly describing checkmate, but also a description of the prob- effective approximations of a Bayesian reasoner, but lem in term of trees and backtracking algorithms: Trees it would be far more difficult to delegate the task of and backtracking do not immediately yield a practical “finding good ways to revise how confident you are about solution—building a full game tree is infeasible—but claims” to an intelligent system before gaining a solid they are the conceptual tools of computer chess. It understanding of probability theory. The theoretical would be quite difficult to justify confidence in a chess understanding is useful to ensure that the right questions heuristic before understanding trees and backtracking. are being asked. While these conceptual tools may seem obvious in hindsight, they were not clear to foresight. Consider the famous essay by Edgar Allen Poe about Maelzel’s 2 Highly Reliable Agent Designs Mechanical Turk (Poe 1836). It is in many ways re- markably sophisticated: Poe compares the Turk to “the Bird and Layzell (2002) describe a genetic algorithm calculating machine of Mr. Babbage” and then remarks which, tasked with making an oscillator, re-purposed on how the two systems cannot be of the same kind, the printed circuit board tracks on the motherboard as a since in Babbage’s algebraical problems each step follows makeshift radio to amplify oscillating signals from nearby of necessity, and so can be represented by mechanical computers. This is not the kind of solution the algorithm gears making deterministic motions; while in a chess would have found if it had been simulated on a virtual game, no move follows with necessity from the position circuit board possessing only the features that seemed of the board, and even if our own move followed with relevant to the problem. Intelligent search processes necessity, the opponent’s would not. And so (argues in the real world have the ability to use resources in Poe) we can see that chess cannot possibly be played by unexpected ways, e.g., by finding “shortcuts” or “cheats” mere mechanisms, only by thinking beings. From Poe’s not accounted for in a simplified model. state of knowledge, Shannon’s (1950) description of an When constructing intelligent systems which learn idealized solution in terms of backtracking and trees con- and interact with all the of reality, it is not stitutes a great insight. Our task it to put theoretical sufficient to verify that the algorithm behaves well in test foundations under the field of general intelligence, in the settings. Additional work is necessary to verify that the same sense that Shannon put theoretical foundations system will continue working as intended in application. under the field of computer chess. This is especially true of systems possessing general It is possible that these foundations will be developed intelligence at or above the human level: superintelligent over time, during the normal course of AI research: in machines might find strategies and execute plans beyond the past, theory has often preceded application. But both the experience and imagination of the programmers, the converse is also true: in many cases, application has making the clever oscillator of Bird and Layzell look preceded theory. The claim of this technical agenda is trite. At the same time, unpredictable behavior from that, in safety-critical applications where mistakes can smarter-than-human systems could cause catastrophic put lives at risk, it is crucial that certain theoretical damage, if they are not aligned with human interests insights come first. (Yudkowsky 2008). A smarter-than-human agent would be embedded (2014, chap. 4) have observed that an intelligence explosion within and computed by a complex universe, learning is especially likely if the agent has the ability to acquire more about its environment and bringing about desirable hardware, improve its software, or design new hardware. states of affairs. How is this formalized? What met-

3 ric captures the question of how well an agent would of world models is any computable environment (e.g., perform in the real world?3 any Turing machine). In reality, the simplest hypothesis Not all parts of the problem must be solved in ad- that predicts the data is generally correct, so agents vance: the task of designing smarter, safer, more reliable are evaluated against a simplicity distribution. Agents systems could be delegated to early smarter-than-human are scored according to their ability to predict their systems, if the research done by those early systems can next observation. These answers were insightful, and be sufficiently trusted. It is important, then, to focus led to the development of many useful tools, including research efforts particularly on parts of the problem algorithmic probability and Kolmogorov . where an increased understanding is necessary to con- However, Solomonoff’s induction problem does not struct a minimal reliable generally intelligent system. fully capture the problem faced by an agent learning Moreover, it is important to focus on aspects which are about an environment while embedded within it, as a currently tractable, so that progress can in fact be made subprocess. It assumes that the agent and environment today, and on issues relevant to alignment in particular, are separated, save only for the observation channel. which would not otherwise be studied over the course What is the analog of Solomonoff induction for agents of “normal” AI research. that are embedded within their environment? In this section, we discuss four candidate topics meet- This is the question of naturalized induction ing these criteria: (1) realistic world-models, the study (Bensinger 2013). Unfortunately, the insights of of agents learning and pursuing goals while embedded Solomonoff do not apply in the naturalized setting. In within a physical world; (2) decision theory, the study Solomonoff’s setting, where the agent and environment of idealized decision-making procedures; (3) logical un- are separated, one can consider arbitrary Turing ma- certainty, the study of reliable reasoning with bounded chines to be “possible environments.” But when the deductive capabilities; and (4) Vingean reflection, the agent is embedded in the environment, consideration study of reliable methods for reasoning about agents must be restricted to environments which embed the that are more intelligent than the reasoner. We will now agent. Given an algorithm, what is the set of environ- discuss each of these topics in turn. ments which embed that algorithm? Given that set, what is the analogue of a simplicity prior which captures 2.1 Realistic World-Models the fact that simpler hypotheses are more often correct? Formalizing the problem of computer intelligence may Technical problem (Naturalized Induction). What, seem easy in theory: encode some set of preferences formally, is the induction problem faced by an intelligent as a utility function, and evaluate the expected utility agent embedded within and computed by its environment? that would be obtained if the agent were implemented. What is the set of environments which embed the agent? However, this is not a full specification: What is the set What constitutes a simplicity prior over that set? How of “possible realities” used to model the world? Against is the agent scored? For discussion, see Soares (2015). what distribution over world models is the agent eval- Just as a formal description of Solomonoff induction uated? How is a given world model used to score an answered the above three questions in the context of agent? To ensure that an agent would work well in real- an agent learning an external environment, a formal ity, it is first useful to formalize the problem faced by description of naturalized induction may well yield an- agents learning (and acting in) arbitrary environments. swers to those questions in the context where agents are Solomonoff (1964) made an early attempt to tackle embedded in and computed by their environment. these questions by specifying an “induction problem” Of course, the problem of computer intelligence is in which an agent must construct world models and not simply an induction problem: the agent must also promote correct hypotheses based on the observation interact with the environment. Hutter (2000) extends of an arbitrarily complex environment, in a manner Solomonoff’s induction problem to an “interaction prob- reminiscent of scientific induction. In this problem, the lem,” in which an agent must both learn and act upon agent and environment are separate. The agent gets to its environment. In each turn, the agent both observes see one bit from the environment in each turn, and must one input and writes one output, and the output affects predict the bits which follow. the behavior of the environment. In this problem, the Solomonoff’s induction problem answers all three agent is evaluated in terms of its ability to maximize a of the above questions in a simplified setting: The set reward function specified in terms of inputs. While this 3. Legg and Hutter (2007) provide a preliminary answer to model does not capture the difficulties faced by agents this question, by defining a “universal measure of intelligence” which are embedded within their environment, it does which scores how well an agent can learn the features of an capture a large portion of the problem faced by agents external environment and maximize a reward function. This interacting with arbitrarily complex environments. In- is the type of formalization we are looking for: a scoring deed, the interaction problem (and AIXI [Hutter 2000], metric which describes how well an agent would achieve its solution) are the basis for the “universal measure of some set of goals. However, while the Legg-Hutter metric intelligence” developed by Legg and Hutter (2007). is insightful, it makes a number of simplifying assumptions, However, even barring problems arising from the and many difficult open questions remain (Soares 2015). agent/environment separation, the Legg-Hutter metric

4 does not fully characterize the problem of computer reality, and this increased understanding could be a intelligence. It scores agents according to their ability crucial tool when it comes to designing highly reliable to maximize a reward function specified in terms of agents. observation. Agents scoring well by the Legg-Hutter metric are extremely effective at ensuring their observa- 2.2 Decision Theory tions optimize a reward function, but these high-scoring agents are likely to be the type that find clever ways to Smarter-than-human systems must be trusted to make seize control of their observation channel rather than the good decisions, but what does it mean for a decision to type that identify and manipulate the features in the be “good”? Formally, given a description of an environ- world that the reward function was intended to proxy ment and an agent embedded within, how is the “best for (Soares 2015). techniques available action” identified, with respect to some set of which punish the agent for attempting to take control preferences? This is the question of decision theory. would only incentivize the agent to deceive and mollify The answer may seem trivial, at least in theory: the programmers until it found a way to gain a decisive simply iterate over the agent’s available actions, evaluate advantage (Bostrom 2014, chap. 8). what would happen if the agent took that action, and The Legg-Hutter metric does not characterize the then return whichever action leads to the most expected question of how well an algorithm would perform if utility. But this is not a full specification: How are implemented in reality: to formalize that question, a the “available actions” identified? How is what “would scoring metric must evaluate the resulting environment happen” defined? histories, not just the agent’s observations (Soares 2015). The difficulty is easiest to illustrate in a deterministic But human goals are not specified in terms of en- setting. Consider a deterministic agent embedded in a vironment histories, either: they are specified in terms deterministic environment. There is exactly one action of high-level notions such as “money” or “flourishing that the agent will take. Given a set of actions that it humans.” Leaving aside problems of philosophy, imag- “could take,” it is necessary to evaluate, for each action, ine rating a system according to how well it achieves a what would happen if the agent took that action. But straightforward, concrete goal, such as by rating how the agent will not take most of those actions. How is much diamond is in an environment after the agent has the counterfactual environment constructed, in which acted on it, where “diamond” is specified concretely a deterministic algorithm “does something” that, in in terms of an atomic structure. Now the goals are the real environment, it doesn’t do? Answering this specified in terms of atoms, and the environment his- question requires a theory of counterfactual reasoning, tories are specified in terms of Turing machines paired and counterfactual reasoning is not well understood. with an interaction history. How is the environment his- tory evaluated in terms of atoms? This is the ontology Technical problem (Theory of Counterfactuals). identification problem. What theory of counterfactual reasoning can be used to specify a procedure which always identifies the best Technical problem (Ontology Identification). Given action available to a given agent in a given environment, goals specified in some ontology and a world model, how with respect to a given set of preferences? For discussion, can the ontology of the goals be identified in the world see Soares and Fallenstein (2015b). model? What types of world models are amenable to ontology identification? For a discussion, see Soares Decision theory has been studied extensively by philoso- (2015). phers. The study goes back to Pascal, and has been picked up in modern times by Lehmann (1950), Wald To evaluate world models, the world models must be (1939), Jeffrey (1983), Joyce (1999), Lewis (1981), Pearl evaluated in terms of the ontology of the goals. This (2000), and many others. However, no satisfactory may be difficult in cases where the ontology of the goals method of counterfactual reasoning yet answers this does not match reality: it is one thing to locate atoms in particular question. To give an example of why coun- a Turing machine using an atomic model of physics, but terfactual reasoning can be difficult, consider a deter- it is another thing altogether to locate atoms in a Turing ministic agent playing against a perfect copy of itself in machine modeling quantum physics. De Blanc (2011) the classic prisoner’s dilemma (Rapoport and Chammah further motivates the idea that explicit mechanisms 1965). The opponent is guaranteed to do the same thing are needed to deal with changes in the ontology of the as the agent, but the agents are “causally separated,” system’s world model. in that the action of one cannot physically affect the Agents built to solve the wrong problem—such as action of the other. optimizing their observations—may well be capable of What is the counterfactual world in which the agent attaining superintelligence, but it is unlikely that those on the left cooperates? It is not sufficient to consider agents could be aligned with human interests (Bostrom changing the action of the agent on the left while hold- 2014, chap. 12). A better understanding of naturalized ing the action of the agent on the right constant, be- induction and ontology identification is needed to fully cause while the two are causally disconnected, they are specify the problem that intelligent agents would face logically constrained to behave identically. Standard when pursuing human goals while embedded within causal reasoning, which neglects these logical constraints,

5 misidentifies “defection” as the best strategy available formula, recent research has pointed the way towards to each agent even when they know they have identical some promising paths for future research. source codes (Lewis 1979).4 Satisfactory counterfactual For example, Wei Dai’s “updateless decision theory” reasoning must respect these logical constraints, but (UDT) is a new take on decision theory that systemati- how are logical constraints formalized and identified? cally outperforms causal decision theory (Hintze 2014), It is fine to say that, in the counterfactual where the and two of the insights behind UDT highlight a num- agent on the left cooperates, all identical copies of it ber of tractable open problems (Soares and Fallenstein also cooperate; but what counts as an identical copy? 2015b). What if the right agent runs the same algorithm written Recently, B´ar´aszet al. (2014) developed a concrete in a different programming language? What if it only model, together with a Haskell implementation, of multi- does the same thing 98% of the time? agent games where agents have access to each others’ These questions are pertinent in reality: practical source code and base their decisions on what they can agents must be able to identify good actions in settings prove about their opponent. They have found that it where other actors base their actions on imperfect (but is possible for some agents to achieve robust coopera- well-informed) predictions of what the agent will do. tion in the one-shot prisoner’s dilemma while remaining Identifying the best action available to an agent requires unexploitable (B´ar´aszet al. 2014). taking the non-causal logical constraints into account. These results suggest a number of new ways to ap- A satisfactory formalization of counterfactual reasoning proach the problem of counterfactual reasoning, and we requires the ability to answer questions about how other are optimistic that continued study will prove fruitful. deterministic algorithms behave in the counterfactual world where the agent’s deterministic algorithm does 2.3 Logical Uncertainty something it doesn’t. However, the evaluation of “logical counterfactuals” is not yet well understood. Consider a reasoner encountering a black box with one input chute and two output chutes. Inside the box Technical problem (Logical Counterfactuals). Con- is a complex Rube Goldberg machine that takes an sider a counterfactual in which a given deterministic input ball and deposits it in one of the two output decision procedure selects a different action from the chutes. A probabilistic reasoner may have uncertainty one it selects in reality. What are the implications of about where the ball will exit, due to uncertainty about this counterfactual on other algorithms? Can logical which Rube Goldberg machine is in the box. However, counterfactuals be formalized in a satisfactory way? A standard probability theory assumes that if the reasoner method for reasoning about logical counterfactuals seems did know which machine the box implemented, they necessary in order to formalize a more general theory would know where the ball would exit: the reasoner of counterfactuals. For a discussion, see Soares and is assumed to be logically omniscient, i.e., to know all Fallenstein (2015b). logical consequences of any hypothesis they entertain. By contrast, a practical bounded reasoner may be Unsatisfactory methods of counterfactual reasoning able to know exactly which Rube Goldberg machine (such as the causal reasoning of Pearl [2000]) seem pow- the box implements without knowing where the ball erful enough to support smarter-than-human intelligent will come out, due to the complexity of the machine. systems, but systems using those reasoning methods This reasoner is logically uncertain. Almost all practical could systematically act in undesirable ways (even if reasoning is done under some form of logical uncertainty otherwise aligned with human interests). (Gaifman 2004), and almost all reasoning done by a To construct practical heuristics that are known to smarter-than-human agent must be some form of logi- make good decisions, even when acting beyond the over- cally uncertain reasoning. Any time an agent reasons sight and control of humans, it is essential to understand about the consequences of a plan, the effects of running what is meant by “good decisions.” This requires a for- a piece of software, or the implications of an observa- mulation which, given a description of an environment, tion, it must do some sort of reasoning under logical an agent embedded in that environment, and some set uncertainty. Indeed, the problem of an agent reasoning of preferences, identifies the best action available to the about an environment in which it is embedded as a agent. While modern methods of counterfactual rea- subprocess is inherently a problem of reasoning under soning do not yet allow for the specification of such a logical uncertainty. 4. As this is a multi-agent scenario, the problem of coun- Thus, to construct a highly reliable smarter-than- terfactuals can also be thought of as game-theoretic. The human system, it is vitally important to ensure that goal is to define a procedure which reliably identifies the best the agent’s logically uncertain reasoning is reliable and available action; the label of “decision theory” is secondary. trustworthy. This requires a better understanding of the This goal subsumes both game theory and decision theory: theoretical underpinnings of logical uncertainty, to more the desired procedure must identify the best action in all fully characterize what it means for logically uncertain settings, even when there is no clear demarcation between reasoning to be “reliable and trustworthy” (Soares and “agent” and “environment.” Game theory informs, but does not define, this area of research. Fallenstein 2015a).

6 It is natural to consider extending standard prob- were computably approximable, then they could start ability theory to include the consideration of worlds with a rough approximation of the prior and refine it which are “logically impossible” (e.g., where a determin- over time. Indeed, the process of refining a logical prior— istic Rube Goldberg machine behaves in a way that it getting better and better probability estimates for given doesn’t). This gives rise to two questions: What, pre- logical sentences—captures the whole problem of rea- cisely, are logically impossible possibilities? And, given soning under logical uncertainty in miniature. Hutter some means of reasoning about impossible possibilities, et al. (2013) have defined a desirable prior, but Sawin what is a reasonable prior probability distribution over and Demski (2013) have shown that it cannot be com- impossible possibilities? putably approximated. Demski (2012) and Christiano The problem is difficult to approach in full general- (2014a) have also proposed logical priors, but neither ity, but a study of logical uncertainty in the restricted seems fully satisfactory. The specification of satisfactory context of assigning probabilities to logical sentences logical priors is difficult in part because it is not yet goes back at least to L o´s(1955) and Gaifman (1964), clear which properties are desirable in a logical prior, and has been investigated by many, including Halpern nor which properties are possible. (2003), Hutter et al. (2013), Demski (2012), Russell (2014), and others. Though it isn’t clear to what degree Technical problem (Logical Priors). What is a satis- this formalism captures the kind of logically uncertain factory set of priors over logical sentences that a bounded reasoning a realistic agent would use, logical sentences reasoner can approximate? For a discussion, see Soares in, for example, the language of Peano Arithmetic are and Fallenstein (2015a). quite expressive: for example, given the Rube Goldberg Many existing tools for studying reasoning, such as machine discussed above, it is possible to form a sen- game theory, standard probability theory, and Bayesian tence which is true if and only if the machine deposits networks, all assume that reasoners are logically omni- the ball into the top chute. Thus, considering reasoners scient. A theory of reasoning under logical uncertainty which are uncertain about logical sentences is a useful seems necessary to formalize the problem of naturalized starting point. The problem of assigning probabilities to induction, and to generate a satisfactory theory of coun- sentences of logic naturally divides itself into two parts. terfactual reasoning. If these tools are to be developed, First, how can probabilities consistently be assigned extended, or improved, then a better understanding of to sentences? An agent assigning probability 1 to short logically uncertain reasoning is required. contradictions is hardly reasoning about the sentences as if they are logical sentences: some of the logical structure must be preserved. But which aspects of the 2.4 Vingean Reflection logical structure? Preserving all logical implications Instead of specifying superintelligent systems directly, it requires that the reasoner be deductively omnipotent, seems likely that humans may instead specify generally as some implications φ → ψ may be very involved. The intelligent systems that go on to create or attain super- standard answer in the literature is that a coherent as- intelligence. In this case, the reliability of the resulting signment of probabilities to sentences corresponds to a superintelligent system depends upon the reasoning of probability distribution over complete, consistent logical the initial system which created it (either anew or via theories (Gaifman 1964; Christiano 2014a); that is, an self-modification). “impossible possibility” is any consistent assignment of If the agent reasons reliably under logical uncertainty, truth values to all sentences. Deductively limited rea- then it may have a generic ability to evaluate various soners cannot have fully coherent distributions, but they plans and strategies, only selecting those which seem can approximate these distributions: for a deductively beneficial. However, some scenarios put that logically un- limited reasoner, “impossible possibilities” can be any certain reasoning to the test more than others. There is assignment of truth values to sentences that looks con- a qualitative difference between reasoning about simple sistent so far, so long as the assignment is discarded as programs and reasoning about human-level intelligent soon as a contradiction is introduced. systems. For example, modern program verification tech- Technical problem (Impossible Possibilities). How niques could be used to ensure that a “smart” military can deductively limited reasoners approximate reasoning drone obeys certain rules of engagement, but it would according to a probability distribution over complete the- be a different problem altogether to verify the behavior ories of logic? For a discussion, see Christiano (2014a). of an artificial military general which must run an entire war. A general has far more autonomy, ability to come Second, what is a satisfactory prior probability distri- up with clever unexpected strategies, and opportunities bution over logical sentences? If the system is intended to impact the future than a drone. to reason according to a theory at least as powerful as A self-modifying agent (or any that constructs new Peano Arithmetic (PA), then that theory will be incom- agents more intelligent than itself) must reason about plete (G¨odel,Kleene, and Rosser 1934). What prior the behavior of a system that is more intelligent than the distribution places nonzero probability on the set of reasoner. This type of reasoning is critically important complete extensions of PA? Deductively limited agents to the design of self-improving agents: if a system will would not be able to literally use such a prior, but if it attain superintelligence through self-modification, then

7 the impact of the system depends entirely upon the obstacles. By G¨odel’ssecond incompleteness theorem correctness of the original agent’s reasoning about its (1934), sufficiently powerful formal systems cannot rule self-modifications (Fallenstein and Soares 2015). out the possibility that they may be inconsistent. This Before trusting a system to attain superintelligence, makes it difficult for agents using formal logical reasoning it seems prudent to ensure that the agent uses appro- to verify the reasoning of similar agents which also use priate caution when reasoning about successor agents.5 formal logic for high-confidence reasoning; the first agent That is, it seems necessary to understand the mecha- cannot verify that the latter agent is consistent. Roughly, nisms by which agents reason about smarter systems. it seems desirable to be able to develop agents which Naive tools for reasoning about plans including reason as follows: smarter agents, such as backwards induction (Ben- Porath 1997), would have the reasoner evaluate the This smarter successor agent uses reasoning smarter agent by simply checking what the smarter similar to mine, and my own reasoning is agent would do. This does not capture the difficulty sound, so its reasoning is sound as well. of the problem: a parent agent cannot simply check However, G¨odel,Kleene, and Rosser (1934) showed that what its successor agent would do in all scenarios, for this sort of reasoning leads to inconsistency, and these if it could, then it would already know what actions its problems do in fact make Vingean reflection difficult in a successor will take, and the successor would not in any logical setting (Fallenstein and Soares 2015; Yudkowsky way be smarter. 2013). Yudkowsky and Herreshoff (2013) call this observa- tion the “Vingean principle,” after Vernor Vinge (1993), Technical problem (L¨obian Obstacle). How can who emphasized how difficult it is for humans to predict agents gain very high confidence in agents that use sim- the behavior of smarter-than-human agents. Any agent ilar reasoning systems, while avoiding paradoxes of self- reasoning about more intelligent successor agents must reference? For discussion, see Fallenstein and Soares do so abstractly, without pre-computing all actions that (2015). the successor would take in every scenario. We refer to this kind of reasoning as Vingean reflection. These results may seem like artifacts of models rooted in formal logic, and may seem irrelevant given that practi- Technical problem (Vingean Reflection). How can cal agents must eventually use logical uncertainty rather agents reliably reason about agents which are smarter than formal logic to reason about smarter successors. than themselves, without violating the Vingean principle? However, it has been shown that many of the G¨odelian For discussion, see Fallenstein and Soares (2015). obstacles carry over into early probabilistic logics in a straightforward way, and some results have already been It may seem premature to worry about how agents rea- shown to apply in the domain of logical uncertainty son about self-improvements before developing a theoret- (Fallenstein 2014). ical understanding of reasoning under logical uncertainty Studying toy models in this formal logical setting in general. However, it seems to us that work in this area has led to partial solutions (Fallenstein and Soares 2014). can inform understanding of what sort of logically uncer- Recent work has pushed these models towards proba- tain reasoning is necessary to reliably handle Vingean bilistic settings (Fallenstein and Soares 2014; Yudkowsky reflection. 2014; Soares 2014). Further research may continue driv- Given the high stakes when constructing systems ing the development of methods for reasoning under smarter than themselves, artificial agents might use some logical uncertainty which can handle Vingean reflection form of extremely high-confidence reasoning to verify in a reliable way (Fallenstein and Soares 2015). the safety of potentially dangerous self-modifications. When humans desire extremely high reliability, as is the case for (e.g.) autopilot software, we often use formal 3 Error-Tolerant Agent Designs logical systems (US DoD 1985; UK MoD 1991). High- confidence reasoning in critical situations may require Incorrectly specified superintelligent agents could be something akin to formal verification (even if most rea- dangerous (Yudkowsky 2008). Correcting a modern soning is done using more generic logically uncertain AI system involves simply shutting the system down reasoning), and so studying Vingean reflection in the and modifying its source code. Modifying a smarter- domain of formal logic seems like a good starting point. than-human system may prove more difficult: a system Logical models of agents reasoning about agents that attaining superintelligence could acquire new hardware, are “more intelligent,” however, run into a number of alter its software, create subagents, and take other ac- tions that would leave the original programmers with 5. Of course, if an agent reasons perfectly under logical only dubious control over the agent. This is especially uncertainty, it would also reason well about the construction true if the agent has incentives to resist modification or of successor agents. However, given the fallibility of human shutdown. If intelligent systems are to be safe, they must reasoning and the fact that this path is critically important, be constructed in such a way that they are amenable to it seems prudent to verify the agent’s reasoning methods in this scenario specifically. correction, even if they have the ability to prevent or avoid correction.

8 This does not come for free: by default, agents have A better understanding of corrigible reasoning is essen- incentives to preserve their own preferences, even if those tial to design agent architectures that are tolerant of conflict with the intentions of the programmers (Omo- human error. Other research could also prove fruitful, hundro 2008; Soares and Fallenstein 2015a). Special including research into reliable containment mechanisms. care is needed to specify agents that avoid the default Alternatively, agent designs could somehow incentivize incentives to manipulate and deceive (Bostrom 2014, the agent to have a “low impact” on the world. Specify- chap. 8). ing “low impact” is trickier than it may seem: How do Restricting the actions available to a superintelligent you tell an agent that it can’t affect the physical world, agent may be quite difficult (Bostrom 2014, chap. 9). given that its RAM is physical? How do you tell it that Intelligent optimization processes often find unexpected it can only use its own hardware, without allowing it ways to fulfill their optimization criterion using what- to use its motherboard as a makeshift radio? How do ever resources are at their disposal; recall the evolved you tell it not to cause big changes in the world when oscillator of Bird and Layzell (2002). Superintelligent its behavior influences the actions of the programmers, optimization processes may well use hardware, software, who influence the world in chaotic ways? and other resources in unanticipated ways, making them difficult to contain if they have incentives to escape. Technical problem (Domesticity). How can an intel- We must learn how to design agents which do not ligent agent be safely incentivized to have a low impact? have incentives to escape, manipulate, or deceive in the Specifying such a thing is not as easy as it seems. For first place: agents which reason as if they are incomplete a discussion, see Armstrong, Sandberg, and Bostrom and potentially flawed in dangerous ways, and which (2012). are therefore amenable to online correction. Reasoning Regardless of the methodology used, it is crucial to of this form is known as “corrigible reasoning.” understand designs for agents that could be updated Technical problem (Corrigibility). What sort of rea- and modified during the development process, so as to soning can reflect the fact that an agent is incomplete ensure that the inevitable human errors do not lead to and potentially flawed in dangerous ways? For discus- catastrophe. sion, see Soares and Fallenstein (2015a). Na¨ıve attempts at specifying corrigible behavior are 4 Value Specification unsatisfactory. For example, “moral uncertainty” frame- A highly-reliable, error-tolerant agent design does not works could allow agents to learn values through observa- guarantee a positive impact; the effects of the system tion and interaction, but would still incentivize agents to still depend upon whether it is pursuing appropriate resist changes to the underlying moral uncertainty frame- goals. work if it happened to be flawed. Simple “penalty terms” A superintelligent system may find clever, unin- for manipulation and deception also seem doomed to tended ways to achieve the specific goals that it is given. failure: agents subject to such penalties would have Imagine a superintelligent system designed to cure can- incentives to resist modification while cleverly avoiding cer which does so by stealing resources, proliferating the technical definitions of “manipulation” and “decep- robotic laboratories at the expense of the biosphere, and tion.” The goal is not to design systems that fail in their kidnapping test subjects: the intended goal may have attempts to deceive the programmers; the goal is to been “cure cancer without doing anything bad,” but construct reasoning methods that do not give rise to such a goal is rooted in cultural context and shared deception incentives in the first place. human knowledge. A formalization of the intuitive notion of corrigi- It is not sufficient to construct systems that are smart bility remains elusive. Active research is currently fo- enough to figure out the intended goals. Human beings, cused on small toy problems, in the hopes that insight upon learning that natural selection “intended” sex to gained there will generalize. One such toy problem is be pleasurable only for purposes of reproduction, do the “shutdown problem,” which involves designing a set not suddenly decide that contraceptives are abhorrent. of preferences that incentivize an agent to shut down While one should not anthropomorphize natural selec- upon the press of a button without also incentivizing the tion, humans are capable of understanding the process agent to either cause or prevent the pressing of that but- which created them while being completely unmotivated ton (Soares and Fallenstein 2015a). Stuart Armstrong’s to alter their preferences. For similar reasons, when utility indifference technique (2015) provides a partial developing AI systems, it is not sufficient to develop solution, but not a fully satisfactory one. a system intelligent enough to figure out the intended Technical problem (Utility Indifference). Can a util- goals; the system must also somehow be deliberately ity function be specified such that agents maximizing that constructed to pursue them (Bostrom 2014, chap. 8). utility function switch their preferences on demand, with- However, the “intentions” of the operators are a out having incentives to cause or prevent the switching? complex, vague, fuzzy, context-dependent notion (Yud- For discussion, see Armstrong (2015). kowsky 2011; cf. Sotala and Yampolskiy 2017). Con- cretely writing out the full intentions of the operators in

9 a machine-readable format is implausible if not impossi- This problem is not unique to value learning, but it is ble, even for simple tasks. An intelligent agent must be especially important for it. Research into the program- designed to learn and act according to the preferences matic identification of ambiguities, and the generation of its operators.6 This is the value learning problem. of “queries” which are similar to previous training data Directly programming a rule which identifies cats in but differ along the ambiguous axis, would assist in images is implausibly difficult, but specifying a system the development of systems which can safely perform that inductively learns how to identify cats in images inductive value learning. is possible. Similarly, while directly programming a Intuitively, an intelligent agent should be able to use rule capturing complex human intentions is implausi- some of its intelligence to assist in this process: it does bly difficult, intelligent agents could be constructed to not take a detailed understanding of the human psyche inductively learn values from training data. to deduce that humans care more about some ambigu- Inductive value learning presents unique difficulties. ities (are the human-shaped things actually humans?) The goal is to develop a system which can classify po- than others (does it matter if there is a breeze?). To tential outcomes according to their value, but what sort build a system that acts as intended, the system must of training data allows this classification? The labeled model the intentions of the operators and act accord- data could be given in terms of the agent’s world-model, ingly. This adds another layer of indirection: the system but this is a brittle solution if the ontology of the world- must model the operators in some way, and must ex- model is liable to change. Alternatively, the labeled tract “preferences” from the operator-model and update data could come in terms of observations, in which case its preferences accordingly (in a manner robust against the agent would have to first learn how the labels in improvements to the model of the operator). Techniques the observations map onto objects in the world-model, such as inverse reinforcement learning (Ng and Russell and then learn how to classify outcomes. Designing 2000), in which the agent assumes that the operator algorithms which can do this likely requires a better is maximizing some reward function specified in terms understanding of methods for constructing multi-level of observations, are a good start, but many questions world-models from sense data. remain unanswered. Technical problem (Multi-Level World-Models). How Technical problem (Operator Modeling). By what can multi-level world-models be constructed from sense methods can an operator be modeled in such a way that data in a manner amenable to ontology identification? (1) a model of the operator’s preferences can be extracted; For a discussion, see Soares (2016). and (2) the model may eventually become arbitrarily accurate and represent the operator as a subsystem em- Standard problems of inductive learning arise, as well: bedded within the larger world? For a discussion, see how could a training set be constructed which allows Soares (2016). the agent to fully learn the complexities of value? It is easy to imagine a training set which labels many A system which acts as the operators intend may still observations of happy humans as “good” and many have significant difficulty answering questions that the observations of needlessly suffering humans as “bad,” but operators themselves cannot answer: imagine humans the simplest generalization from this data set may well trying to design an artificial agent to do what they would be that humans value human-shaped things mimicking want, if they were better people. How can normative un- happy emotions: after training on this data, an agent certainty (uncertainty about moral claims) be resolved? may be inclined to construct many simple animatronics Bostrom (2014, chap. 13) suggests an additional level mimicking happiness. Creating a training set that covers of indirection: task the system with reasoning about all relevant dimensions of human value is difficult for what sorts of conclusions the operators would come to the same reason that specifying human value directly is if they had more information and more time to think. difficult. In order for inductive value learning to succeed, Formalizing this is difficult, and the problems are largely it is necessary to construct a system which identifies still in the realm of philosophy rather than technical ambiguities in the training set—dimensions along which research. However, Christiano (2014b) has sketched the training set gives no information—and queries the one possible method by which the volition of a human operators accordingly. could be extrapolated, and Soares (2016) discusses some potential pitfalls. Technical problem (Ambiguity Identification). Given a training data set and a world model, how can di- Philosophical problem (Normative Uncertainty). mensions which are neglected by the training data be What ought one do when one is uncertain about what one identified? For discussion, see Soares (2016). ought to do? What norms govern uncertainty about nor- mative claims? For a discussion, see MacAskill (2014). 6. Or of all humans, or of all sapient creatures, etc. There are many philosophical concerns surrounding what sort of Human operators with total control over a superintel- goals are ethical when aligning a superintelligent system, but ligent system could give rise to a moral hazard of extraor- a solution to the value learning problem will be a practical necessity regardless of which philosophical view is the correct dinary proportions by putting unprecedented power into one. the hands of a small few (Bostrom 2014, chap. 6). The

10 extraordinary potential of superintelligence gives rise to This theoretical understanding might not be devel- many ethical questions. When constructing autonomous oped by default. Causal counterfactual reasoning, de- agents that will have a dominant ability to determine spite being suboptimal, might be good enough to enable the future, it is important to design the agents to not the construction of a smarter-than-human system. Sys- only act according to the wishes of the operators, but tems built from poorly understood heuristics might be also in others’ common interest. Here we largely leave capable of creating or attaining superintelligence for the philosophical questions aside, and remark only that reasons we don’t quite understand—but it is unlikely those who design systems intended to surpass human in- that such systems could then be aligned with human telligence will take on a responsibility of unprecedented interests. scale. Sometimes theory precedes application, but some- times it does not. The goal of much of the research outlined in this agenda is to ensure, in the domain of 5 Discussion superintelligence alignment—where the stakes are in- credibly high—that theoretical understanding comes Sections 2 through 4 discussed six research topics where first. the authors think that further research could make it easier to develop aligned systems in the future. This section discusses reasons why we think useful progress 5.2 Why Start Now? can be made today. It may seem premature to tackle the problem of AI goal alignment now, with superintelligent systems still firmly 5.1 Toward a Formal Understanding of the in the domain of futurism. However, the authors think Problem it is important to develop a formal understanding of AI alignment well in advance of making design decisions Are the problems discussed above tractable, uncrowded, about smarter-than-human systems. By beginning our focused, and unlikely to be solved automatically in the work early, we inevitably face the risk that it may turn course of developing increasingly intelligent artificial out to be irrelevant; yet failing to make preparations at systems? all poses substantially larger risks. They are certainly not very crowded. They also We have identified a number of unanswered founda- appear amenable to progress in the near future, though tional questions relating to the development of general it is less clear whether they can be fully solved. intelligence, and at present it seems possible to make When it comes to focus, some think that problems of some promising inroads. We think that the most respon- decision theory and logical uncertainty sound more like sible course, then, is to begin as soon as possible. generic theoretical AI research than alignment-specific Weld and Etzioni (1994) directed a “call to arms” research. A more intuitive set of alignment problems to computer scientists, noting that “society will reject might put greater emphasis on AI constraint (Sotala autonomous agents unless we have some credible means and Yampolskiy 2017) or value learning. of making them safe.” We are concerned with the oppo- Progress on the topics outlined in this agenda might site problem: what if society fails to reject systems that indeed make it easier to design intelligent systems in are unsafe? What will be the consequences if someone general. Just as the intelligence metric of Legg and believes a smarter-than-human system is aligned with Hutter (2007) lent insight into the ideal priors for agents human interests when it is not? facing Hutter’s interaction problem, a full description This is our call to arms: regardless of whether re- of the naturalized induction problem could lend insight search efforts follow the path laid out in this document, into the ideal priors for agents embedded within their significant effort must be focused on the study of super- universe. A satisfactory theory of logical uncertainty intelligence alignment as soon as possible. could lend insight into general intelligence more broadly. An ideal decision theory could reveal an ideal decision- making procedure for real agents to approximate. References But while these advancements may provide tools useful for designing intelligent systems in general, they Armstrong, Stuart. 2015. “Motivated Value Selection for would make it markedly easier to design aligned systems Artificial Agents.” In 1st International Workshop on in particular. Developing a general theory of highly AI and at AAAI-2015. Austin, TX. reliable decision-making, even if it is too idealized to Armstrong, Stuart, Anders Sandberg, and . be directly implemented, gives us the conceptual tools 2012. “Thinking Inside the Box: Controlling and Using needed to design and evaluate safe heuristic approaches. an Oracle AI.” Minds and Machines 22 (4): 299–324. Conversely, if we must evaluate real systems composed B´ar´asz,Mih´aly,Patrick LaVictoire, Paul F. Christiano, of practical heuristics before formalizing the theoretical Benja Fallenstein, Marcello Herreshoff, and Eliezer Yud- problems that those heuristics are supposed to solve, kowsky. 2014. “Robust Cooperation in the Prisoner’s then we will be forced to rely on our intuitions. Dilemma: Program Equilibrium via Provability Logic.” arXiv: 1401.5577 [cs.GT].

11 Ben-Porath, Elchanan. 1997. “, Nash Equilib- Halpern, Joseph Y. 2003. Reasoning about Uncertainty. Cam- rium, and Backwards Induction in Perfect-Information bridge, MA: MIT Press. Games.” Review of Economic Studies 64 (1): 23–46. Hintze, Daniel. 2014. “Problem Class Dominance in Predic- Bensinger, Rob. 2013. “Building Phenomenological Bridges.” tive Dilemmas.” PhD diss., Arizona State University. Less Wrong (blog), December 23. http://lesswrong. http://hdl.handle.net/2286/R.I.23257. com/lw/jd9/building_phenomenological_bridges/. Hutter, Marcus. 2000. “A Theory of Universal Artificial Bird, Jon, and Paul Layzell. 2002. “The Evolved Radio and Intelligence based on Algorithmic Complexity.” arXiv: Its Implications for Modelling the Evolution of Novel 0004001 [cs.AI]. Sensors.” In Congress on . CEC-’02, 2:1836–1841. Honolulu, HI: IEEE. Hutter, Marcus, John W. Lloyd, Kee Siong Ng, and William T. B. Uther. 2013. “Probabilities on Sentences in an Bostrom, Nick. 2014. Superintelligence: Paths, Dangers, Expressive Logic.” Journal of Applied Logic 11 (4): 386– Strategies. New York: Oxford University Press. 420. Christiano, Paul. 2014a. Non-Omniscience, Probabilistic In- Jeffrey, Richard C. 1983. The Logic of Decision. 2nd ed. ference, and Metamathematics. Technical report 2014–3. Chicago: Chicago University Press. Berkeley, CA: Machine Intelligence Research Institute. http://intelligence.org/files/Non-Omniscience. Joyce, James M. 1999. The Foundations of Causal Decision pdf. Theory. Cambridge Studies in Probability, Induction and Decision Theory. New York: Cambridge University . 2014b. “Specifying ‘enlightened judgment’ precisely Press. (reprise).” Ordinary Ideas (blog). http://ordinaryide as.wordpress.com/2014/08/27/specifying-enlight Legg, Shane, and . 2007. “Universal Intelli- ened-judgment-precisely-reprise/. gence: A Definition of Machine Intelligence.” Minds and Machines 17 (4): 391–444. de Blanc, Peter. 2011. “Ontological Crises in Artificial Agents’ Value Systems” (May 19). arXiv: 1105.3821 [cs.AI]. Lehmann, E. L. 1950. “Some Principles of the Theory of Testing Hypotheses.” Annals of Mathematical Statistics Demski, Abram. 2012. “Logical Prior Probability.” In Artifi- 21 (1): 1–26. cial General Intelligence: 5th International Conference, AGI 2012, 50–59. Lecture Notes in Artificial Intelli- Lewis, David. 1979. “Prisoners’ Dilemma is a Newcomb gence 7716. New York: Springer. Problem.” Philosophy & Public Affairs 8 (3): 235–240. Fallenstein, Benja. 2014. Procrastination in Probabilistic . 1981. “Causal Decision Theory.” Australasian Jour- Logic. Working Paper. Berkeley, CA: Machine Intel- nal of Philosophy 59 (1): 5–30. ligence Research Institute. http://intelligence.org/ L o´s,Jerzy. 1955. “On the Axiomatic Treatment of Probabil- files/ProbabilisticLogicProcrastinates.pdf. ity.” Colloquium Mathematicae 3 (2): 125–137. Fallenstein, Benja, and Nate Soares. 2014. “Problems of MacAskill, William. 2014. “Normative Uncertainty.” PhD Self-Reference in Self-Improving Space-Time Embed- diss., St Anne’s College, University of Oxford. http: ded Intelligence.” In Artificial General Intelligence: 7th / / ora . ox . ac . uk / objects / uuid : 8a8b60af - 47cd - International Conference, AGI 2014, edited by Ben 4abc-9d29-400136c89c0f. Goertzel, Laurent Orseau, and Javier Snaider, 21–32. Lecture Notes in Artificial Intelligence 8598. New York: McCarthy, John, Marvin Minsky, Nathan Rochester, and Springer. Claude Shannon. 1955. A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence. . 2015. Vingean Reflection: Reliable Reasoning for Self- Stanford, CA: Formal Reasoning Group, Stanford Uni- Improving Agents. Technical report 2015–2. Berkeley, versity, August 31. CA: Machine Intelligence Research Institute. https:// intelligence.org/files/VingeanReflection.pdf. Muehlhauser, Luke, and Anna Salamon. 2012. “Intelligence Explosion: Evidence and Import.” In Singularity Hy- Gaifman, Haim. 1964. “Concerning Measures in First Order potheses: A Scientific and Philosophical Assessment, Calculi.” Israel Journal of Mathematics 2 (1): 1–18. edited by Amnon Eden, Johnny Søraker, James H. Moor, . 2004. “Reasoning with Limited Resources and As- and Eric Steinhart. The Frontiers Collection. Berlin: signing Probabilities to Arithmetical Statements.” Syn- Springer. these 140 (1–2): 97–119. Ng, Andrew Y., and Stuart J. Russell. 2000. “Algorithms G¨odel,Kurt, Stephen Cole Kleene, and John Barkley Rosser. for Inverse Reinforcement Learning.” In 17th Interna- 1934. On Undecidable Propositions of Formal Mathe- tional Conference on (ICML-’00), matical Systems. Princeton, NJ: Institute for Advanced edited by Pat Langley, 663–670. San Francisco: Morgan Study. Kaufmann. Good, Irving John. 1965. “Speculations Concerning the First Omohundro, Stephen M. 2008. “The Basic AI Drives.” In Ar- Ultraintelligent Machine.” In Advances in Computers, tificial General Intelligence 2008: 1st AGI Conference, edited by Franz L. Alt and Morris Rubinoff, 6:31–88. edited by Pei Wang, Ben Goertzel, and Stan Franklin, New York: Academic Press. 483–492. Frontiers in Artificial Intelligence and Appli- cations 171. Amsterdam: IOS.

12 Pearl, Judea. 2000. Causality: Models, Reasoning, and Infer- United States Department of Defense. 1985. Department of ence. 1st ed. New York: Cambridge University Press. Defense Trusted Computer System Evaluation Criteria. Department of Defense Standard DOD 5200.28-STD. Poe, Edgar Allan. 1836. “Maelzel’s Chess-Player.” Southern United States Department of Defense, December 26. Literary Messenger 2 (5): 318–326. http://csrc.nist.gov/publications/history/dod Rapoport, Anatol, and Albert M. Chammah. 1965. Pris- 85.pdf. oner’s Dilemma: A Study in Conflict and Cooperation. Vinge, Vernor. 1993. “The Coming Technological Singularity: Vol. 165. Ann Arbor Paperbacks. Ann Arbor: University How to Survive in the Post-Human Era.” In Vision-21: of Michigan Press. Interdisciplinary Science and Engineering in the Era of Russell, Stuart J. 2014. “Unifying Logic and Probability: Cyberspace, 11–22. NASA Conference Publication 10129. A New Dawn for AI?” In Information Processing and NASA Lewis Research Center. http://ntrs.nasa.gov/ Management of Uncertainty in Knowledge-Based Sys- archive/nasa/casi.ntrs.nasa.gov/19940022856. tems: 15th International Conference IPMU 2014, Part pdf. I, 442:10–14. Communications in Computer and Infor- Wald, Abraham. 1939. “Contributions to the Theory of mation Science, Part 1. Springer. Statistical Estimation and Testing Hypotheses.” Annals Sawin, Will, and Abram Demski. 2013. Computable Probabil- of Mathematical Statistics 10 (4): 299–326. ity Distributions Which Converge on Π1 Will Disbelieve Weld, Daniel, and Oren Etzioni. 1994. “The First Law of True Π2 Sentences 2013–10. Berkeley, CA: Machine In- Robotics (A Call to Arms).” In 12th National Confer- telligence Research Institute. http://intelligence. ence on Artificial Intelligence (AAAI-1994), edited by org/files/Pi1Pi2Problem.pdf. Barbara Hayes-Roth and Richard E. Korf, 1042–1047. Shannon, Claude E. 1950. “XXII. Programming a Computer Menlo Park, CA: AAAI Press. for Playing Chess.” The London, Edinburgh, and Dublin Yudkowsky, Eliezer. 2008. “Artificial Intelligence as a Pos- Philosophical Magazine and Journal of Science, Series itive and Negative Factor in Global Risk.” In Global 7, 41 (314): 256–275. Catastrophic Risks, edited by Nick Bostrom and Mi- Soares, Nate. 2014. Tiling Agents in Causal Graphs. Tech- lan M. Cirkovi´c,308–345.´ New York: Oxford University nical report 2014–5. Berkeley, CA: Machine Intelli- Press. gence Research Institute. https://intelligence.org/ . 2011. Complex Value Systems are Required to Re- files/TilingAgentsCausalGraphs.pdf. alize Valuable Futures. The Singularity Institute, San . 2015. Formalizing Two Problems of Realistic World- Francisco, CA. http : / / intelligence . org / files / Models. Technical report 2015–3. Berkeley, CA: Machine ComplexValues.pdf. Intelligence Research Institute. https://intelligence. . 2013. The Procrastination Paradox. Brief Tech- org/files/RealisticWorldModels.pdf. nical Note. Berkeley, CA: Machine Intelligence Re- . 2016. “The Value Learning Problem.” In Ethics for search Institute. http://intelligence.org/files/ Artificial Intelligence Workshop at IJCAI-16. New York, ProcrastinationParadox.pdf. NY. . 2014. Distributions Allowing Tiling of Staged Subjec- Soares, Nate, and Benja Fallenstein. 2015a. Questions of tive EU Maximizers. Technical report 2014–1. Berkeley, Reasoning Under Logical Uncertainty. Technical re- CA: Machine Intelligence Research Institute, May 11. port 2015–1. Berkeley, CA: Machine Intelligence Re- Revised May 31, 2014. http://intelligence.org/ search Institute. https://intelligence.org/files/ files/DistributionsAllowingTiling.pdf. QuestionsLogicalUncertainty.pdf. Yudkowsky, Eliezer, and Marcello Herreshoff. 2013. Tiling . 2015b. “Toward Idealized Decision Theory.” arXiv: Agents for Self-Modifying AI, and the L¨obianObsta- 1507.01986 [cs.AI]. cle. Early Draft. Berkeley, CA: Machine Intelligence Research Institute. http://intelligence.org/files/ Solomonoff, Ray J. 1964. “A Formal Theory of Inductive TilingAgents.pdf. Inference. Part I.” Information and Control 7 (1): 1–22. Sotala, Kaj, and Roman Yampolskiy. 2017. “Responses to the Journey to the Singularity.” Chap. 3 in The Tech- nological Singularity: Managing the Journey, edited by Victor Callaghan, Jim Miller, Roman Yampolskiy, and Stuart Armstrong. The Frontiers Collection. Springer. United Kingdom Ministry of Defense. 1991. Requirements for the Procurement of Safety Critical Software in Defence Equipment. Interim Defence Standard 00-55. United Kingdom Ministry of Defense, April 5.

13