Reinforcement Learning Lecture Markov Decision Process

Reinforcement Learning Markov decision process & Dynamic programming value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value iteration, policy iteration. Vien Ngo MLR, University of Stuttgart Outline • Reinforcement learning problem. – Element of reinforcement learning – Markov Process – Markov Reward Process – Markov decision process. • Dynamic programming – Value iteration – Policy iteration 2/?? Reinforcement Learning Problem Elements of Reinforcement Learning Problem • Agent vs. Environment. • State, Action, Reward, Goal, Return. • The Markov property. • Markov decision process. • Bellman equations. • Optimality and Approximation. 3/?? • The learner and decision-maker is called the agent. • The thing it interacts with, comprising everything outside the agent, is called the environment. • The environment is formally formulated as a Markov Decision Process, which is a mathematically principled framework for sequential decision problems. (from Introduction to RL book, Sutton & Barto) Agent vs. Environment 4/?? Agent vs. Environment • The learner and decision-maker is called the agent. • The thing it interacts with, comprising everything outside the agent, is called the environment. • The environment is formally formulated as a Markov Decision Process, which is a mathematically principled framework for sequential decision problems. (from Introduction to RL book, Sutton & Barto)4/?? • Formally, P r(st+1; rt+1jst; at; rt; ··· ; s0; a0; r0) = P r(st+1; rt+1jst; at; rt) • Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain. The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) 5/?? • Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain. The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) • Formally, P r(st+1; rt+1jst; at; rt; ··· ; s0; a0; r0) = P r(st+1; rt+1jst; at; rt) 5/?? The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) • Formally, P r(st+1; rt+1jst; at; rt; ··· ; s0; a0; r0) = P r(st+1; rt+1jst; at; rt) • Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain. 5/?? Markov Process • A Markov Process (Markov Chain) is defined as 2-tuple (S; P). – S is a state space. 0 – P is a state transition probability matrix: Pss0 = P (st+1 = s jst = s) 6/?? Markov Process: Example Rycycling Robot’s Markov Chain recharge 0.9 0.9 0.1 Batery: Batery: 0.5 high low 0.5 0.5 1.0 0.5 0.1 search wait stop 7/?? Markov Reward Process • A Markov Reward Process is defined as 4-tuple (S; P; R; γ). – S is a state space of n states. 0 – P is a state transition probability matrix: Pss0 = P (st+1 = s jst = s) – R is a reward matrix of Rs. – γ is a discount factor, γ 2 [0; 1]. • The total return is 2 ρt = Rt + γRt+1 + γ Rt+2 + ::: 8/?? Markov Reward Process: Example recharge 0.9;0.0 0.9;0.0 0.1;0.0 Batery: Batery: 0.5;0.0 high low 0.5;0.0 0.5;0.0 0.1;-10.0 1.0;0.0 0.5;-1.0 search wait stop 9/?? Markov Reward Process: Bellman Equations • The value function V (s) V (s) = E ρtjst = s = E Rt + γV (st+1)jst = s • V = R + γP V , hence V = (I − γP )−1R We will visit again in MDP. 10/?? Markov Reward Process: Discount Factor? Many meanings: • Weighing the importance of differently timed rewards, higher importance of more recent rewards. • Representing uncertainty over the presence of next rewards, i.e geometric distributions. • Representing human/animal’s preference over ordering of received rewards. 11/?? Markov decision process 12/?? • MDP = fS; A; T ; R; P0; γg. – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0; s; a) = P r(s0js; a). – R: is a reward function which defines the reward R(s; a). – P0: is the probability distribution over initial states. – γ 2 [0; 1]: is a discount factor. Markov decision process • A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP. 13/?? – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0; s; a) = P r(s0js; a). – R: is a reward function which defines the reward R(s; a). – P0: is the probability distribution over initial states. – γ 2 [0; 1]: is a discount factor. Markov decision process • A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP. • MDP = fS; A; T ; R; P0; γg. 13/?? Markov decision process • A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP. • MDP = fS; A; T ; R; P0; γg. – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0; s; a) = P r(s0js; a). – R: is a reward function which defines the reward R(s; a). – P0: is the probability distribution over initial states. – γ 2 [0; 1]: is a discount factor. 13/?? Example: Recycling Robot MDP 14/?? • A policy is a mapping from state space to action space µ : S 7! A • Objective function: – Expected average reward. T −1 1 h X i η = lim E r(st; at; st+1) T !1 T t=0 – Expected discounted reward. 1 h X t i ηγ = E γ r(st; at; st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ a0 a1 a2 s0 s1 s2 r0 r1 r2 15/?? • Objective function: – Expected average reward. T −1 1 h X i η = lim E r(st; at; st+1) T !1 T t=0 – Expected discounted reward. 1 h X t i ηγ = E γ r(st; at; st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ a0 a1 a2 s0 s1 s2 r0 r1 r2 • A policy is a mapping from state space to action space µ : S 7! A 15/?? • Singh et. al. 1994: 1 ηγ = η 1 − γ a0 a1 a2 s0 s1 s2 r0 r1 r2 • A policy is a mapping from state space to action space µ : S 7! A • Objective function: – Expected average reward. T −1 1 h X i η = lim E r(st; at; st+1) T !1 T t=0 – Expected discounted reward. 1 h X t i ηγ = E γ r(st; at; st+1) t=0 15/?? a0 a1 a2 s0 s1 s2 r0 r1 r2 • A policy is a mapping from state space to action space µ : S 7! A • Objective function: – Expected average reward. T −1 1 h X i η = lim E r(st; at; st+1) T !1 T t=0 – Expected discounted reward. 1 h X t i ηγ = E γ r(st; at; st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ 15/?? Dynamic Programming 16/?? Dynamic Programming • State Value Functions • Bellman Equations • Value Iteration • Policy Iteration 17/?? • definition of optimality: behavior π∗ is optimal iff π∗ ∗ ∗ π 8s : V (s) = V (s) where V (s) = max V (s) π (simultaneously maximising the value in all states) (In MDPs there always exists (at least one) optimal deterministic policy.) State value function • The value (expected discounted return) of policy π when started in state s: π 2 V (s) = Eπfr0 + γr1 + γ r2 + · · · j s0 =sg (1) discounting factor γ 2 [0; 1] 18/?? State value function • The value (expected discounted return) of policy π when started in state s: π 2 V (s) = Eπfr0 + γr1 + γ r2 + · · · j s0 =sg (1) discounting factor γ 2 [0; 1] • definition of optimality: behavior π∗ is optimal iff π∗ ∗ ∗ π 8s : V (s) = V (s) where V (s) = max V (s) π (simultaneously maximising the value in all states) (In MDPs there always exists (at least one) optimal deterministic policy.) 18/?? • We can write this in vector notation V π = Rπ + γP πV π π π π π 0 with vectors V s = V (s), Rs = R(π(s); s) and matrix P s0s = P (s j π(s); s) π P P 0 π 0 • For stochastic π(ajs): V (s) = a π(ajs)R(a; s) + γ s0;a π(ajs)P (s j a; s) V (s ) • Bellman optimality equation ∗ h P 0 ∗ 0 i V (s) = max R(a; s) + γ 0 P (s j a; s) V (s ) a s ∗ h P 0 ∗ 0 i π (s) = argmax R(a; s) + γ s0 P (s j a; s) V (s ) a 0 (Sketch of proof: If π would select another action than argmaxa[·], then π which = π 0 everywhere except π (s) = argmaxa[·] would be better.) • This is the principle of optimality in the stochastic case (related to Viterbi, max-product algorithm) Bellman optimality equation π 2 V (s)=E fr0 + γr1 + γ r2 + · · · j s0 =s; πg = Efr0 j s0 =s; πg + γEfr1 + γr2 + · · · j s0 =s; πg P 0 0 = R(π(s); s) + γ s0 P (s j π(s); s)Efr1 + γr2 + · · · j s1 =s ; πg P 0 π 0 = R(π(s); s) + γ s0 P (s j π(s); s) V (s ) 19/?? • Bellman optimality equation ∗ h P 0 ∗ 0 i V (s) = max R(a; s) + γ 0 P (s j a; s)

Reinforcement Learning Lecture Markov Decision Process

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support