Towards Learning in Probabilistic Action Selection: Markov Systems and Markov Decision Processes

Total Page:16

File Type:pdf, Size:1020Kb

Towards Learning in Probabilistic Action Selection: Markov Systems and Markov Decision Processes Towards Learning in Probabilistic Action Selection: Markov Systems and Markov Decision Processes Manuela Veloso Carnegie Mellon University Computer Science Department Planning - Fall 2001 Remember the examples on the board. 15-889 – Fall 2001 Veloso, Carnegie Mellon Different Aspects of “Machine Learning” Supervised learning • – Classification - concept learning – Learning from labeled data – Function approximation Unsupervised learning • – Data is not labeled – Data needs to be grouped, clustered – We need distance metric Control and action model learning • – Learning to select actions efficiently – Feedback: goal achievement, failure, reward – Search control learning, reinforcement learning 15-889 – Fall 2001 Veloso, Carnegie Mellon Search Control Learning Improve search efficiency, plan quality • Learn heuristics • (if (and (goal (enjoy weather)) (goal (save energy)) (state (weather fair))) (then (select action RIDE-BIKE))) 15-889 – Fall 2001 Veloso, Carnegie Mellon Learning Opportunities in Planning Learning to improve planning efficiency • Learning the domain model • Learning to improve plan quality • Learning a universal plan • Which action model, which planning algorithm, which heuristic control is the most efficient for a given task? 15-889 – Fall 2001 Veloso, Carnegie Mellon Reinforcement Learning A variety of algorithms to address: • – learning the model – converging to the optimal plan. 15-889 – Fall 2001 Veloso, Carnegie Mellon Discounted Rewards “Reward” today versus future (promised) reward • $100K + $100K + $100K + . • INFINITE! Future rewards not worth as much as current. • Assume reality . : discount factor , say γ. • $100K + γ $100K + γ2 $100K + . • CONVERGES! 15-889 – Fall 2001 Veloso, Carnegie Mellon Markov Systems with Rewards Finite set of n states - vector n - si • Probabilistic state matrix, P - n n - pij • × “Goal achievement” - Reward for each state, • vector n -ri Discount factor - γ • Process: • – Start at state si – Receive immediate reward ri – Move randomly to a new state according to the probability transition matrix – Future rewards (of next state) are discounted by γ 15-889 – Fall 2001 Veloso, Carnegie Mellon Solving a Markov Systems with Rewards V ∗(si) expected discounted sum of future rewards • − starting in state si • V ∗(si) = ri + γ[pi1V ∗(s1) + pi2V ∗(s2) + : : : pinV ∗(sn)] 15-889 – Fall 2001 Veloso, Carnegie Mellon Value Iteration to Solve a Markov Systems with Rewards 1 V (si) expected discounted sum of future rewards • − starting in state si for one step. 2 V (si) expected discounted sum of future rewards • − starting in state si for two steps. ... • k V (si) expected discounted sum of future rewards • − starting in state si for k steps. k As k V (si) V ∗(si) • ! 1 ! Stop when difference of k + 1 and k values is smaller • than some . 15-889 – Fall 2001 Veloso, Carnegie Mellon Markov Decision Processes Finite set of states, s1; : : : ; sn • Finite set of actions, a1; : : : ; am • Probabilistic state,action transitions: • k p = prob (next = sj current = si and take action ak) ij j Reward for each state, r1; : : : ; rn • Process: • – Start in state si – Receive immediate reward ri – Choose action ak A 2 k – Change to state sj with probability pij. – Discount future rewards 15-889 – Fall 2001 Veloso, Carnegie Mellon Deterministic Example 10 G 10 10 (Reward on unlabelled transitions is zero.) 15-889 – Fall 2001 Veloso, Carnegie Mellon Nondeterministic Example 0.9 0.1 R 1.0 1.0 S1 D S2 D Unemployed Industry R 0.9 1.0 0.9 S3 D S4 D Grad School Academia 0.1 0.1 0.1 R R 0.9 1.0 15-889 – Fall 2001 Veloso, Carnegie Mellon Solving an MDP A policy is a mapping from states to actions. • Optimal policy - for every state, there is no other • action that gets a higher sum of discounted future rewards. For every MDP there exists an optimal policy. • Solving an MDP is finding an optimal policy. • A specific policy converts an MDP into a plain • Markov system with rewards. 15-889 – Fall 2001 Veloso, Carnegie Mellon Policy Iteration Start with some policy π0(si). • Such policy transforms the MDP into a plain Markov • system with rewards. Compute the values of the states according to • current policy. Update policy: • a π0 π1(si) = argmax ri + γ p V (sj) af X ij g j Keep computing • Stop when πk+1 = πk. • 15-889 – Fall 2001 Veloso, Carnegie Mellon Value Iteration V ∗(si) = expected discounted future rewards, if we • start from si and we follow the optimal policy. Compute V ∗ with value iteration: • k – V (si) = maximum possible future sum of rewards starting from state si for k steps. Bellman’s Equation: • N n+1 k n V (si) = maxk ri + γ p V (sj) f X ij g j=1 Dynamic programming • 15-889 – Fall 2001 Veloso, Carnegie Mellon.
Recommended publications
  • A Survey on Data Collection for Machine Learning a Big Data - AI Integration Perspective
    1 A Survey on Data Collection for Machine Learning A Big Data - AI Integration Perspective Yuji Roh, Geon Heo, Steven Euijong Whang, Senior Member, IEEE Abstract—Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research. Index Terms—data collection, data acquisition, data labeling, machine learning F 1 INTRODUCTION E are living in exciting times where machine learning expertise. This problem applies to any novel application that W is having a profound influence on a wide range of benefits from machine learning.
    [Show full text]
  • Classical Planning in Deep Latent Space
    Journal of Artificial Intelligence Research 1 (2020) 1-15 Submitted 1/1; published 1/1 (Table of Contents is provided for convenience and will be removed in the camera-ready.) Contents 1 Introduction 1 2 Preliminaries and Important Background Concepts 4 2.1 Notations . .4 2.2 Propositional Classical Planning . .4 2.3 Processes Required for Obtaining a Classical Planning Model . .4 2.4 Symbol Grounding . .5 2.5 Autoencoders and Variational Autoencoders . .7 2.6 Discrete Variational Autoencoders . .9 3 Latplan: Overview of the System Architecture 10 4 SAE as a Gumbel-Softmax/Binary-Concrete VAE 13 4.1 The Symbol Stability Problem: Issues Caused by Unstable Propositional Symbols . 15 4.2 Addressing the Systemic Uncertainty: Removing the Run-Time Stochasticity . 17 4.3 Addressing the Statistical Uncertainty: Selecting the Appropriate Prior . 17 5 AMA1: An SAE-based Translation from Image Transitions to Ground Actions 19 6 AMA2: Action Symbol Grounding with an Action Auto-Encoder (AAE) 20 + 7 AMA3/AMA3 : Descriptive Action Model Acquisition with Cube-Space AutoEncoder 21 7.1 Cube-Like Graphs and its Equivalence to STRIPS . 22 7.2 Edge Chromatic Number of Cube-Like Graphs and the Number of Action Schema in STRIPS . 24 7.3 Vanilla Space AutoEncoder . 27 7.3.1 Implementation . 32 7.4 Cube-Space AE (CSAE) . 33 7.5 Back-to-Logit and its Equivalence to STRIPS . 34 7.6 Precondition and Effect Rule Extraction . 35 + arXiv:2107.00110v1 [cs.AI] 30 Jun 2021 8 AMA4 : Learning Preconditions as Effects Backward in Time 36 8.1 Complete State Regression Semantics .
    [Show full text]
  • Discovering Underlying Plans Based on Shallow Models
    Discovering Underlying Plans Based on Shallow Models HANKZ HANKUI ZHUO, Sun Yat-Sen University YANTIAN ZHA, Arizona State University SUBBARAO KAMBHAMPATI, Arizona State University XIN TIAN, Sun Yat-Sen University Plan recognition aims to discover target plans (i.e., sequences of actions) behind observed actions, with history plan libraries or action models in hand. Previous approaches either discover plans by maximally “matching” observed actions to plan libraries, assuming target plans are from plan libraries, or infer plans by executing action models to best explain the observed actions, assuming that complete action models are available. In real world applications, however, target plans are often not from plan libraries, and complete action models are often not available, since building complete sets of plans and complete action models are often difficult or expensive. In this paper we view plan libraries as corpora and learn vector representations of actions using the corpora; we then discover target plans based on the vector representations. Specifically, we propose two approaches, DUP and RNNPlanner, to discover target plans based on vector representations of actions. DUP explores the EM-style (Expectation Maximization) framework to capture local contexts of actions and discover target plans by optimizing the probability of target plans, while RNNPlanner aims to leverage long-short term contexts of actions based on RNNs (recurrent neural networks) framework to help recognize target plans. In the experiments, we empirically show that our approaches are capable of discovering underlying plans that are not from plan libraries, without requiring action models provided. We demonstrate the effectiveness of our approaches by comparing its performance to traditional plan recognition approaches in three planning domains.
    [Show full text]
  • Learning to Plan from Raw Data in Grid-Based Games
    EPiC Series in Computing Volume 55, 2018, Pages 54{67 GCAI-2018. 4th Global Con- ference on Artificial Intelligence Learning to Plan from Raw Data in Grid-based Games Andrea Dittadi, Thomas Bolander, and Ole Winther Technical University of Denmark, Lyngby, Denmark fadit, tobo, [email protected] Abstract An agent that autonomously learns to act in its environment must acquire a model of the domain dynamics. This can be a challenging task, especially in real-world domains, where observations are high-dimensional and noisy. Although in automated planning the dynamics are typically given, there are action schema learning approaches that learn sym- bolic rules (e.g. STRIPS or PDDL) to be used by traditional planners. However, these algorithms rely on logical descriptions of environment observations. In contrast, recent methods in deep reinforcement learning for games learn from pixel observations. However, they typically do not acquire an environment model, but a policy for one-step action selec- tion. Even when a model is learned, it cannot generalize to unseen instances of the training domain. Here we propose a neural network-based method that learns from visual obser- vations an approximate, compact, implicit representation of the domain dynamics, which can be used for planning with standard search algorithms, and generalizes to novel domain instances. The learned model is composed of submodules, each implicitly representing an action schema in the traditional sense. We evaluate our approach on visual versions of the standard domain Sokoban, and show that, by training on one single instance, it learns a transition model that can be successfully used to solve new levels of the game.
    [Show full text]
  • Learning from Noisy Labels with Deep Neural Networks: a Survey Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, Jae-Gil Lee
    UNDER REVIEW 1 Learning from Noisy Labels with Deep Neural Networks: A Survey Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, Jae-Gil Lee Abstract—Deep learning has achieved remarkable success in 100% 100% numerous domains with help from large amounts of big data. Noisy w/o. Reg. Noisy w. Reg. 75% However, the quality of data labels is a concern because of the 75% Clean w. Reg Gap lack of high-quality labels in many real-world scenarios. As noisy 50% 50% labels severely degrade the generalization performance of deep Noisy w/o. Reg. neural networks, learning from noisy labels (robust training) is 25% Accuracy Test 25% Training Accuracy Training Noisy w. Reg. becoming an important task in modern deep learning applica- Clean w. Reg tions. In this survey, we first describe the problem of learning with 0% 0% 1 30 60 90 120 1 30 60 90 120 label noise from a supervised learning perspective. Next, we pro- Epochs Epochs vide a comprehensive review of 57 state-of-the-art robust training Fig. 1. Convergence curves of training and test accuracy when training methods, all of which are categorized into five groups according WideResNet-16-8 using a standard training method on the CIFAR-100 dataset to their methodological difference, followed by a systematic com- with the symmetric noise of 40%: “Noisy w/o. Reg.” and “Noisy w. Reg.” are parison of six properties used to evaluate their superiority. Sub- the models trained on noisy data without and with regularization, respectively, sequently, we perform an in-depth analysis of noise rate estima- and “Clean w.
    [Show full text]
  • Quantum Reinforcement Learning During Human Decision-Making
    ARTICLES https://doi.org/10.1038/s41562-019-0804-2 Quantum reinforcement learning during human decision-making Ji-An Li 1,2, Daoyi Dong 3, Zhengde Wei1,4, Ying Liu5, Yu Pan6, Franco Nori 7,8 and Xiaochu Zhang 1,9,10,11* Classical reinforcement learning (CRL) has been widely applied in neuroscience and psychology; however, quantum reinforce- ment learning (QRL), which shows superior performance in computer simulations, has never been empirically tested on human decision-making. Moreover, all current successful quantum models for human cognition lack connections to neuroscience. Here we studied whether QRL can properly explain value-based decision-making. We compared 2 QRL and 12 CRL models by using behavioural and functional magnetic resonance imaging data from healthy and cigarette-smoking subjects performing the Iowa Gambling Task. In all groups, the QRL models performed well when compared with the best CRL models and further revealed the representation of quantum-like internal-state-related variables in the medial frontal gyrus in both healthy subjects and smok- ers, suggesting that value-based decision-making can be illustrated by QRL at both the behavioural and neural levels. riginating from early behavioural psychology, reinforce- explained well by quantum probability theory9–16. For example, one ment learning is now a widely used approach in the fields work showed the superiority of a quantum random walk model over Oof machine learning1 and decision psychology2. It typi- classical Markov random walk models for a modified random-dot cally formalizes how one agent (a computer or animal) should take motion direction discrimination task and revealed quantum-like actions in unknown probabilistic environments to maximize its aspects of perceptual decisions12.
    [Show full text]
  • Introduction to Deep Learning in Signal Processing & Communications with MATLAB
    Introduction to Deep Learning in Signal Processing & Communications with MATLAB Dr. Amod Anandkumar Pallavi Kar Application Engineering Group, Mathworks India © 2019 The MathWorks, Inc.1 Different Types of Machine Learning Type of Machine Learning Categories of Algorithms • Output is a choice between classes Classification (True, False) (Red, Blue, Green) Supervised Learning • Output is a real number Regression Develop predictive (temperature, stock prices) model based on both Machine input and output data Learning Unsupervised • No output - find natural groups and Clustering Learning patterns from input data only Discover an internal representation from input data only 2 What is Deep Learning? 3 Deep learning is a type of supervised machine learning in which a model learns to perform classification tasks directly from images, text, or sound. Deep learning is usually implemented using a neural network. The term “deep” refers to the number of layers in the network—the more layers, the deeper the network. 4 Why is Deep Learning So Popular Now? Human Accuracy Source: ILSVRC Top-5 Error on ImageNet 5 Vision applications have been driving the progress in deep learning producing surprisingly accurate systems 6 Deep Learning success enabled by: • Labeled public datasets • Progress in GPU for acceleration AlexNet VGG-16 ResNet-50 ONNX Converter • World-class models and PRETRAINED PRETRAINED PRETRAINED MODEL MODEL CONVERTER MODEL MODEL TensorFlow- connected community Caffe GoogLeNet IMPORTER PRETRAINED Keras Inception-v3 MODEL IMPORTER MODELS 7
    [Show full text]
  • Learning from Minimally Labeled Data with Accelerated Convolutional Neural Networks Aysegul Dundar Purdue University
    Purdue University Purdue e-Pubs Open Access Dissertations Theses and Dissertations 4-2016 Learning from minimally labeled data with accelerated convolutional neural networks Aysegul Dundar Purdue University Follow this and additional works at: https://docs.lib.purdue.edu/open_access_dissertations Part of the Computer Sciences Commons, and the Electrical and Computer Engineering Commons Recommended Citation Dundar, Aysegul, "Learning from minimally labeled data with accelerated convolutional neural networks" (2016). Open Access Dissertations. 641. https://docs.lib.purdue.edu/open_access_dissertations/641 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. Graduate School Form 30 Updated ¡¢ ¡£¢ ¡¤ ¥ PURDUE UNIVERSITY GRADUATE SCHOOL Thesis/Dissertation Acceptance This is to certify that the thesis/dissertation prepared By Aysegul Dundar Entitled Learning from Minimally Labeled Data with Accelerated Convolutional Neural Networks For the degree of Doctor of Philosophy Is approved by the final examining committee: Eugenio Culurciello Chair Anand Raghunathan Bradley S. Duerstock Edward L. Bartlett To the best of my knowledge and as understood by the student in the Thesis/Dissertation Agreement, Publication Delay, and Certification Disclaimer (Graduate School Form 32), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy of Integrity in Research” and the use of copyright material. Approved by Major Professor(s): Eugenio Culurciello Approved by: George R. Wodicka 04/22/2016 Head of the Departmental Graduate Program Date LEARNING FROM MINIMALLY LABELED DATA WITH ACCELERATED CONVOLUTIONAL NEURAL NETWORKS A Dissertation Submitted to the Faculty of Purdue University by Aysegul Dundar In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2016 Purdue University West Lafayette, Indiana ii To my family: Nezaket, Cengiz and Deniz.
    [Show full text]
  • Latest Snapshot.” to Modify an Algo So It Does Produce Multiple Snapshots, find the Following Line (Which Is Present in All of the Algorithms)
    Spinning Up Documentation Release Joshua Achiam Feb 07, 2020 User Documentation 1 Introduction 3 1.1 What This Is...............................................3 1.2 Why We Built This............................................4 1.3 How This Serves Our Mission......................................4 1.4 Code Design Philosophy.........................................5 1.5 Long-Term Support and Support History................................5 2 Installation 7 2.1 Installing Python.............................................8 2.2 Installing OpenMPI...........................................8 2.3 Installing Spinning Up..........................................8 2.4 Check Your Install............................................9 2.5 Installing MuJoCo (Optional)......................................9 3 Algorithms 11 3.1 What’s Included............................................. 11 3.2 Why These Algorithms?......................................... 12 3.3 Code Format............................................... 12 4 Running Experiments 15 4.1 Launching from the Command Line................................... 16 4.2 Launching from Scripts......................................... 20 5 Experiment Outputs 23 5.1 Algorithm Outputs............................................ 24 5.2 Save Directory Location......................................... 26 5.3 Loading and Running Trained Policies................................. 26 6 Plotting Results 29 7 Part 1: Key Concepts in RL 31 7.1 What Can RL Do?............................................ 31 7.2
    [Show full text]
  • Exploring Semi-Supervised Variational Autoencoders for Biomedical Relation Extraction
    Exploring Semi-supervised Variational Autoencoders for Biomedical Relation Extraction Yijia Zhanga,b and Zhiyong Lua* a National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland 20894, USA b School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China Corresponding author: Zhiyong Lu ([email protected]) Abstract The biomedical literature provides a rich source of knowledge such as protein-protein interactions (PPIs), drug-drug interactions (DDIs) and chemical-protein interactions (CPIs). Biomedical relation extraction aims to automatically extract biomedical relations from biomedical text for various biomedical research. State-of-the-art methods for biomedical relation extraction are primarily based on supervised machine learning and therefore depend on (sufficient) labeled data. However, creating large sets of training data is prohibitively expensive and labor-intensive, especially so in biomedicine as domain knowledge is required. In contrast, there is a large amount of unlabeled biomedical text available in PubMed. Hence, computational methods capable of employing unlabeled data to reduce the burden of manual annotation are of particular interest in biomedical relation extraction. We present a novel semi-supervised approach based on variational autoencoder (VAE) for biomedical relation extraction. Our model consists of the following three parts, a classifier, an encoder and a decoder. The classifier is implemented using multi-layer convolutional neural networks (CNNs), and the encoder and decoder are implemented using both bidirectional long short-term memory networks (Bi-LSTMs) and CNNs, respectively. The semi-supervised mechanism allows our model to learn features from both the labeled and unlabeled data.
    [Show full text]
  • Learning Complex Action Models with Quantifiers and Logical Implications
    Artificial Intelligence 174 (2010) 1540–1569 Contents lists available at ScienceDirect Artificial Intelligence www.elsevier.com/locate/artint Learning complex action models with quantifiers and logical implications ∗ Hankz Hankui Zhuo a,b, Qiang Yang b, , Derek Hao Hu b,LeiLia a Department of Computer Science, Sun Yat-sen University, Guangzhou, China, 510275 b Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clearwater Bay, Kowloon, Hong Kong article info abstract Article history: Automated planning requires action models described using languages such as the Planning Received 2 January 2010 Domain Definition Language (PDDL) as input, but building action models from scratch is a Received in revised form 4 September 2010 very difficult and time-consuming task, even for experts. This is because it is difficult to Accepted 5 September 2010 formally describe all conditions and changes, reflected in the preconditions and effects of Available online 29 September 2010 action models. In the past, there have been algorithms that can automatically learn simple Keywords: action models from plan traces. However, there are many cases in the real world where Action model learning we need more complicated expressions based on universal and existential quantifiers, Machine learning as well as logical implications in action models to precisely describe the underlying Knowledge engineering mechanisms of the actions. Such complex action models cannot be learned using many Automated planning previous algorithms. In this article, we present a novel algorithm called LAMP (Learning Action Models from Plan traces), to learn action models with quantifiers and logical implications from a set of observed plan traces with only partially observed intermediate state information.
    [Show full text]
  • The Architecture of a Multilayer Perceptron for Actor-Critic Algorithm with Energy-Based Policy Naoto Yoshida
    The Architecture of a Multilayer Perceptron for Actor-Critic Algorithm with Energy-based Policy Naoto Yoshida To cite this version: Naoto Yoshida. The Architecture of a Multilayer Perceptron for Actor-Critic Algorithm with Energy- based Policy. 2015. hal-01138709v2 HAL Id: hal-01138709 https://hal.archives-ouvertes.fr/hal-01138709v2 Preprint submitted on 19 Oct 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. The Architecture of a Multilayer Perceptron for Actor-Critic Algorithm with Energy-based Policy Naoto Yoshida School of Biomedical Engineering Tohoku University Sendai, Aramaki Aza Aoba 6-6-01 Email: [email protected] Abstract—Learning and acting in a high dimensional state- when the actions are represented by binary vector actions action space are one of the central problems in the reinforcement [4][5][6][7][8]. Energy-based RL algorithms represent a policy learning (RL) communities. The recent development of model- based on energy-based models [9]. In this approach the prob- free reinforcement learning algorithms based on an energy- based model have been shown to be effective in such domains. ability of the state-action pair is represented by the Boltzmann However, since the previous algorithms used neural networks distribution and the energy function, and the policy is a with stochastic hidden units, these algorithms required Monte conditional probability given the current state of the agent.
    [Show full text]