The Landscape of AI Safety and Beneficence Research
Total Page:16
File Type:pdf, Size:1020Kb
The Landscape of AI Safety and Beneficence Research: Input for Brainstorming at Beneficial AI 2017∗ Executive Summary Success in the quest for artificial intelligence has the potential to bring unprecedented benefits to humanity, and it is therefore worthwhile to research how to maximize these benefits while avoiding potential pitfalls. This document gives numerous examples of research topics aimed at ensuring that AI remains robust and beneficial. ss warene Optimization Measures Quantilization Multiobjective Optimization alue Uncertainty Generalized Fallability A Resource V emporal Discount Rates T Environmental Goals Environmental Predicting Future Observations Future Predicting Model Lookahead Model History Analysis History Detecting Channel Switching Channel Detecting Risk-Sensitive Performance Criteria Defined Impact Regularizer Making V Simulated Exploration voidance V Learned Impact Regularizer erification of Cyberphysical Systems Bounded Exploration V erified Component Design Approaches V erification of Machine Learning Components Multiobjective Reinforcement Learning erification More User Friendly V Mild Optimization erification of Recu of erification Human Oversight Penalize Influence Buried Honeypot A Adaptive Control Theory Multi-agent Approaches Abstractly Reason About Superior Agents Computational Humility Reflective Induction Confidence V erification of Whole AI Systems voiding Reward Hacking fects tion Reward Uncertainty Löbian Obstacle V Automated Impact Regularizers Nontransitive Options ference fect Evalua Formal Software V V Follow-on Analysis Pursuing Environmental Goals Environmental Pursuing Safety T erification of Intelli Resistance Counterexample Safe Exploration Adversarial Reward Functions Generalizing A rsive Self-Improvement rsive Adversarial Blinding Computational Complexity T Implementation ariable Indif voiding Negative Side Ef Optimal Policyechnique Preservation A V A Careful Engineering Reward Capping alues-Based Side Ef ulnerab V Reward Pretraining wareness Projecting Behavioral Bounds Multiple Rewards V acking ingean Reflection olerant Agent Design -T ility Finding ility Error Concept Drift Impact Measures Open Source Game Theory Goal Stability Domesticity e gent Systems verting Paranoia Ontology Identification rification A esting Logical Counterfactuals Safer Self-Modification Ontology Update Thresholds al Incentives Episodic Contexts Consistent Decision Making voiding Reward H A Expressive Representations f Induction orld Models Normative Uncertainty Unsupervised Model Learning V Decision Theory Correlation of Dynamics erification Human Preference Aggregation verting Instrument orld-Embedded Solomonof Infinite Ethics A W Perceptual Distance Agnostic W orld-Models Ontological V Bounded Rationality Knowledge Representation Ensembles Realistic W Ethical Motivationalue Universal AlgorithmicTheory IntelligenceFoundations of Ethics of Rational Agency Foundations Common Sense Reasoning Theory of Counterfactuals Causal Accounting Bootstrapped Common Sense Logical Uncertainty Seeded Common Sense Report Suf ficiency Metareasoning Detailed Audit T Endowing Common Sense Visualization and Situational A rails Lengthening Horizons wareness Monitoring Adversarial T ransparency orld-Models Interpretability of Blackboxes Rich Understanding of Human Commands Multilevel W World Model Connectedness Transparent Reinforcement Learners Transparency of Whiteboxes Validation Concept Geometry Oversight Increasing Contextual A Implicit Human Concepts Capability Distillation Conservative Concepts ith Anomaly Detection Control wareness Dimensionality Reduction W Informed Oversight Controlling Another Algorithm Human Judgement Learner Scalable Oversight of Ambiguities Cooperative Inverse Reinforcement Learner Inferring Latent Variables Scalable Oversight Bayesian Feature Selection Trusted Policy Oversight Inductive Ambiguity Identification Uncertainty Identification and Management Non-Bayesian Confidence Estimation Hierarchical Reinforcement Learning Active Learning Distant Supervision Realistic-Prior Design Covariate Shift alue Iteration Security Unsupervised V ransfer Computational Deference Knows-What-It-Knows Learning Unsupervised Risk Estimation Control T Active Reward Learning ference Robustness to Distributional Shift Causal Identification Utility Indif Symbolic-Subsymbolic Integration Limited-Information Maximum Likelihood ripwires V Semisupervised Reward Learning T alue Alignment Resource-A Active Calibration Supervised Reward Learning Corrigibility ware Reasoning Psychological Analogues Psychological Standard IT SecurityContainment T Reachability Analysis esting and Quality Assurance Quality and esting Misuse Risk Robust Policy Improvement Use of Structured Knowledge In Subsymbolic Systems Counterfactual Reasoning Decision Under Bounded Computational Resources Switchable Objective Functions Ethics Mechanisms Use of Subsymbolic Processing In Symbolic Systems eam Analysis ML with Contracts Logical Induction alidated SoftwareAI to Engineering Support Security Privileged Biases Interruptability Model Repair Norm Denial of Service Security-V Red T Grounded Ethical Evolution Utility Function Security Degree of V erified Downstack Software V Logical Priors Robust Human Imitation alue Evolution Value Specification esting Cognitive Parallels Impossible Possibilities Developmental Psychology Dysfunctional Systems of Thought Whole Brain Emulation Safety V Penetration T Handling Improper External Behavior alue Learning Moral T Ethical Ensembles riggered Defense rade Drives and Af Adversarial ML rust Establishment Structured and Privileged Ethical Biases Classical Computational Ethics Modeling Operator Intent Inverse Reinforcement Learning fect V Sensibility-T alue Structuring Capability Amplification Imitation Statistical Guarantees Operator Modeling Reversible T Detecting and Managing Low Competence Distance from User Demonstration Narrow V Scaling Judgement Learning Operator V Statistical-Behavioral T V alue Elicitation asks via V Collaborative V alue Learning alue Learning esting Values Geometry Action and Outcome Evaluations ariational alues Learning Game Theoretic Framing Unified Ethics Spaces Autoencoding Cooperative Inverse Reinforcement Learning Generative Adversarial Networks Other Generative Models V alue Sourcing Dealing with Adversarial T V alue Interrogation V alue Factoring ∗This document borrows heavily from MIRI's agent foundations technical agenda [403], Amodei et al.'s concrete problems review [11], MIRI's machine learning technical agenda [437], and FLI's previous research priorities document [371]. This document was written by Richard Mallah and was last edited January 2, 2017. It reflects feedback from Abram Demski, Andrew Critch, Anthony Aguirre, Bart Selman, Jan Leike, Jessica Taylor, Max Tegmark, Meia Chita-Tegmark, Nate Soares, Paul Christiano, Roman Yampolskiy, Stuart Armstrong, Tsvi Benson-Tilsen, Vika Krakovna, and Vincent Conitzer. (ver. 0.41) 1 Contents 1 Introduction 6 2 Foundations 7 2.1 Foundations of Rational Agency . .7 2.1.1 Logical Uncertainty . .7 2.1.2 Theory of Counterfactuals . .7 2.1.3 Universal Algorithmic Intelligence . .7 2.1.4 Theory of Ethics . .8 2.1.4.1 Ethical Motivation . .8 2.1.4.2 Ontological Value . .8 2.1.4.3 Human Preference Aggregation . .8 2.1.4.4 Infinite Ethics . .8 2.1.4.5 Normative Uncertainty . .8 2.1.5 Bounded Rationality . .9 2.2 Consistent Decision Making . .9 2.2.1 Decision Theory . .9 2.2.1.1 Logical Counterfactuals . .9 2.2.1.2 Open Source Game Theory . .9 2.2.2 Safer Self-Modification . .9 2.2.2.1 Vingean Reflection . 10 2.2.2.1.1 Abstractly Reason About Superior Agents . 10 2.2.2.1.2 Reflective Induction Confidence . 10 2.2.2.1.3 L¨obianObstacle . 10 2.2.2.2 Optimal Policy Preservation . 10 2.2.2.3 Safety Technique Awareness . 10 2.2.3 Goal Stability . 10 2.2.3.1 Nontransitive Options . 10 2.3 Projecting Behavioral Bounds . 11 2.3.1 Computational Complexity . 11 3 Verification 11 3.1 Formal Software Verification . 11 3.1.1 Verified Component Design Approaches . 12 3.1.2 Adaptive Control Theory . 12 3.1.3 Verification of Cyberphysical Systems . 12 3.1.4 Making Verification More User Friendly . 12 3.2 Automated Vulnerability Finding . 12 3.3 Verification of Intelligent Systems . 13 3.3.1 Verification of Whole AI Systems . 13 3.3.2 Verification of Machine Learning Components . 13 3.4 Verification of Recursive Self-Improvement . 13 3.5 Implementation Testing . 13 4 Validation 14 4.1 Averting Instrumental Incentives . 14 4.1.1 Error-Tolerant Agent Design . 15 4.1.2 Domesticity . 15 4.1.2.1 Impact Measures . 15 4.1.2.1.1 Impact Regularizers . 15 4.1.2.1.1.1 Defined Impact Regularizer . 15 4.1.2.1.1.2 Learned Impact Regularizer . 15 4.1.2.1.2 Follow-on Analysis . 15 4.1.2.1.3 Avoiding Negative Side Effects . 16 4.1.2.1.3.1 Penalize Influence . 16 2 4.1.2.1.3.2 Multi-agent Approaches . 16 4.1.2.1.3.3 Reward Uncertainty . 16 4.1.2.1.4 Values-Based Side Effect Evaluation . 16 4.1.2.2 Mild Optimization . 16 4.1.2.2.1 Optimization Measures . 16 4.1.2.2.2 Quantilization . 17 4.1.2.2.3 Multiobjective Optimization . 17 4.1.2.3 Safe Exploration . 17 4.1.2.3.1 Risk-Sensitive Performance Criteria . 17 4.1.2.3.2 Simulated Exploration . 17 4.1.2.3.3 Bounded Exploration . 17 4.1.2.3.4 Multiobjective Reinforcement Learning . 17 4.1.2.3.5 Human Oversight . 17 4.1.2.3.6 Buried Honeypot Avoidance . 18 4.1.2.4 Computational Humility . 18 4.1.2.4.1 Generalized Fallibility Awareness . 18 4.1.2.4.2 Resource Value Uncertainty . 18 4.1.2.4.3 Temporal Discount Rates . 18 4.1.3 Averting Paranoia . 18 4.2 Avoiding Reward Hacking . 18 4.2.1 Pursuing Environmental Goals . 18 4.2.1.1 Environmental Goals . 19 4.2.1.2 Predicting Future Observations . 19 4.2.1.3 Model Lookahead . 19 4.2.1.4 History Analysis . 19 4.2.1.5 Detecting Channel Switching . 19 4.2.2 Counterexample Resistance . 19 4.2.3 Adversarial Reward Functions . 19 4.2.4 Generalizing Avoiding Reward Hacking . 19 4.2.5 Adversarial Blinding . 19 4.2.6 Variable Indifference . 19 4.2.7 Careful Engineering . 20 4.2.8 Reward Capping . 20 4.2.9 Reward Pretraining . 20 4.2.10 Multiple Rewards . 20 4.3 Value Alignment . 20 4.3.1 Ethics Mechanisms . ..