Reinforcement Learning Q-Learning Issues and Related Models

Reinforcement Learning Q-Learning Issues and Related Models Reinforcement Learning Davide Bacciu Computational Intelligence & Machine Learning Group Dipartimento di Informatica Università di Pisa [email protected] Introduzione all’Intelligenza Artificiale - A.A. 2012/2013 Reinforcement Learning Q-Learning Issues and Related Models Lecture Outline 1 Reinforcement Learning The Problem Definitions Challenges 2 Q-Learning The Problem Algorithm Example 3 Issues and Related Models Q-Learning Issues SARSA Learning Summary Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges Motivation In some applications it is difficult or impossible to provide the learner with a golden standard (xn; yn) (supervised learning) Game playing What is the right (target) move given current situation? Easier to provide information about wins and losses Optimal control Robot control Planning Resource allocation Job scheduling Channel allocation Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges Reinforcement Learning (RL) Learning to chose the best action based on rewards or punishments from the interacting environment Actions need to maximize the amount of received rewards Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges Data Characterization Observations st The environment state st at time t as perceived by the learner Similar to input xn in supervised/unsupervised learning Assume finite states here Actions at An action at that when performed can alter current state st ∗ The target action at is unknown (no supervision) Assume finite actions here Rewards r(s; a) A reward function r(s; a) that assigns a numerical value to an action a performed in state s Actual rewards cannot be available at any time t Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges The Reinforcement Learning Task A learning agent can perceive a set of environment states S and has a set of actions A it can perform At each time instant t the agent is in a state st , chooses an action at and performs it The environment provides a reward rt = r(st ; at ) and produces a new state st+1 = δ(st ; at ) r and δ belong to the environment and can be unknown to the agent Markov Decision Process (MDP) r() and δ() depend only on current state st and action at Learning Task Find a policy π : S ! A that will maximize the cumulative reward Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges What is a Cumulative Reward? The total reward accumulated over time by following the actions generated by a policy π, starting from an initial state Several ways of computing cumulated rewards Finite horizon reward h X rt+i i=0 Average reward h−1 1 X lim rt+i h!1 h i=0 Infinite discounted reward 1 X i γ rt+i s:t: 0 ≤ γ < 1 i=0 Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges RL Task Formalization Assume the infinite discounted reward generated by policy π starting from state st 1 1 π X i X i V (st ) = γ rt+i = γ r(st+i ; at+i ) i=0 i=0 where γ determines the contribution of delayed rewards Vs immediate rewards The RL task is to find π∗ = arg max V π(s) 8s 2 S π Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges Components of a RL Agent Any RL agent includes one or several of this components Policy π Determines the agent behavior What actions are selected in a given state Value function V ·(·) Measures how rewarding is each state and/or action Model The agent’s internal representation of the environment Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges Maze Example Environment - a grid with inaccessible points, a starting and a goal location States - accessible locations in the grid Actions - move N, S, W, E Rewards- −1 for each move (time step) Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges Maze - Policy Agent-defined moves in each maze location Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges Maze - Value Function A state-dependant value function V ∗(s) measuring the time required to reach the goal from each location Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges Maze - Model Agent’s internal representation of the environment State change function s0 = δ(s; a) Rewards r(s; a) for each state Can be based on a very partial and approximated view of the world Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges Exploration Vs Exploitation How does the agent choose the action? Should we always follow only the policy π? Exploitation - Choose the action the maximizes the reward according to your learned knowledge Exploration - Choose an action that can help in improving your learned knowledge In general, it is important to explore as well as to exploit Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges Exploration strategies How to explore? Choose a random action Choose an action that has never been selected Choose an action that leads to unexplored states When to explore? Until time Texplore -greedy Stop exploring when gathered enough reward Reinforcement Learning The Problem Q-Learning Definitions Issues and Related Models Challenges Model-based Vs Model Free Model-Based Maximal exploitation of experience Optimal provided enough experience Often computationally intractable Model-free Appropriate when model is too complex to store, learn and compute E.g. learn rewards associated to action-states couples (Q function) Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example The Q-Learning Problem The RL task is to find the optimal policy π∗ s.t. π∗ = arg max V π(s) 8s 2 S π By exploiting the definition of reward and state-dependent value function, we obtain a reformulated problem π∗ = arg max r(s; a) + γV ∗(s0) 8s 2 S a where s0 = δ(s; a) The Problem Learn to choose the optimal action in a state s as the action a that maximizes the sum of immediate reward r(s; a) plus the discounted cumulative reward from the successor state Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example The Optimal Q-function Define Q∗(s; a) = r(s; a) + γV ∗(δ(s; a)) The RL problem for state s becomes π∗(s) = arg max Q∗(s; a) a Q∗(s; a) denotes the maximum discounted reward that can be achieved starting in state s and choosing action a Assume that we continue to choose actions optimally Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example Grid Example (γ = 0:9) r(S; A) V ∗(s) Q∗(s; a) π∗ Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example Q-Learning What if we don’t know the optimal V ∗(s)? We don’t need to know it, if we assume to know Q∗(s; a), because V ∗(s0) = max Q∗(s0; a0) a0 Therefore we can rewrite Q∗(s; a) = r(s; a) + γV ∗(δ(s; a)) = r(s; a) + γ max Q∗(δ(s; a); a0) a0 A recursive formulation based on Q∗ only! Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example Learning Q What if we don’t known the optimal Q-function? Key Idea Use approximated Q(s; a) values that are updated with experience Suppose the RL agent has an experience hs; a; r; s0i Experience provides a piece of data ∆Q computed using the current Q estimate ∆Q = r + γ max Q(s0; a0) a0 The current Q estimate can be updated using the new piece of data ∆Q Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example Q Update A moving average computed each time t a new experience ∆Q arrives Qt+1 = (1 − α)Qt + α∆Qt The Q terms are Qt ! current estimate at time t Qt+1 ! updated estimate ∆Qt ! experience at time t α 2 [0; 1] is the learning rate Smaller α determine longer term averages Slower convergence but less fluctuations Can be a function of time t Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example Q-Learning Algorithm 1 Initialize Q(S; A) for all states S and actions A 2 Observe current state s 3 Repeat forever A Select an action a (How?) and execute it B Receive a reward r(s; a) C Observe the new state s0 δ(s; a) D Update estimate Q(s; a) (1 − α)Q(s; a) + α r(s; a) + γ max Q(s0; a0) a0 E Update state s s0 Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example Are Q-Estimates Sufficient? Theorem (Watkins & Dayan, 1992) Q −! Q∗ Q-learning will eventually converge to the optimal policy, for any deterministic Markov Decision Process Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example Grid Example r(S; A) Qt (S; A) Agent A is in state s1 Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example Grid Example r(S; A) Qt (S; A) Agent A selects action "move right" aright leading to the new state s2 = δ(s1; aright ) Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example Grid Example r(S; A) Qt (S; A) Reward is r(s1; aright ) = 0. Assuming γ = 0:9 and α = 1 yields 0 0 Qt (s1; aright ) 0 + 1 · r(s1; aright ) + 0:9 · max Qt (s ; a ) a0 0 + 0:9 · maxf66; 81; 100g a0 90 Reinforcement Learning The Problem Q-Learning Algorithm Issues and Related Models Example Grid Example r(S; A) Qt+1(S; A) The Q-estimate for state s1 and action aright is updated Reinforcement

Reinforcement Learning Q-Learning Issues and Related Models

Self-Discriminative Learning for Unsupervised Document Embedding

Q-Learning in Continuous State and Action Spaces

Reinforcement Learning Data Science Africa 2018 Abuja, Nigeria (12 Nov - 16 Nov 2018)

A Review of Unsupervised Artificial Neural Networks with Applications

A Hybrid Model Consisting of Supervised and Unsupervised Learning for Landslide Susceptibility Mapping

4 Perceptron Learning

Unsupervised Pre-Training of a Deep LSTM-Based Stacked Autoencoder for Multivariate Time Series Forecasting Problems Alaa Sagheer 1,2,3* & Mostafa Kotb2,3

Optimal Path Routing Using Reinforcement Learning

Introducing Machine Learning for Healthcare Research

Unsupervised Machine Learning Models to Predict Anomalous Data Quality Periods

10701: Introduction to Machine Learning

Unsupervised Learning Based Performance Analysis of Ν-Support Vector Regression for Speed Prediction of a Large Road Network