
Reinforcement Learning Markov decision process & Dynamic programming value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value iteration, policy iteration. Vien Ngo MLR, University of Stuttgart Outline • Reinforcement learning problem. – Element of reinforcement learning – Markov Process – Markov Reward Process – Markov decision process. • Dynamic programming – Value iteration – Policy iteration 2/?? Reinforcement Learning Problem Elements of Reinforcement Learning Problem • Agent vs. Environment. • State, Action, Reward, Goal, Return. • The Markov property. • Markov decision process. • Bellman equations. • Optimality and Approximation. 3/?? • The learner and decision-maker is called the agent. • The thing it interacts with, comprising everything outside the agent, is called the environment. • The environment is formally formulated as a Markov Decision Process, which is a mathematically principled framework for sequential decision problems. (from Introduction to RL book, Sutton & Barto) Agent vs. Environment 4/?? Agent vs. Environment • The learner and decision-maker is called the agent. • The thing it interacts with, comprising everything outside the agent, is called the environment. • The environment is formally formulated as a Markov Decision Process, which is a mathematically principled framework for sequential decision problems. (from Introduction to RL book, Sutton & Barto)4/?? • Formally, P r(st+1; rt+1jst; at; rt; ··· ; s0; a0; r0) = P r(st+1; rt+1jst; at; rt) • Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain. The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) 5/?? • Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain. The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) • Formally, P r(st+1; rt+1jst; at; rt; ··· ; s0; a0; r0) = P r(st+1; rt+1jst; at; rt) 5/?? The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) • Formally, P r(st+1; rt+1jst; at; rt; ··· ; s0; a0; r0) = P r(st+1; rt+1jst; at; rt) • Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain. 5/?? Markov Process • A Markov Process (Markov Chain) is defined as 2-tuple (S; P). – S is a state space. 0 – P is a state transition probability matrix: Pss0 = P (st+1 = s jst = s) 6/?? Markov Process: Example Rycycling Robot’s Markov Chain recharge 0.9 0.9 0.1 Batery: Batery: 0.5 high low 0.5 0.5 1.0 0.5 0.1 search wait stop 7/?? Markov Reward Process • A Markov Reward Process is defined as 4-tuple (S; P; R; γ). – S is a state space of n states. 0 – P is a state transition probability matrix: Pss0 = P (st+1 = s jst = s) – R is a reward matrix of Rs. – γ is a discount factor, γ 2 [0; 1]. • The total return is 2 ρt = Rt + γRt+1 + γ Rt+2 + ::: 8/?? Markov Reward Process: Example recharge 0.9;0.0 0.9;0.0 0.1;0.0 Batery: Batery: 0.5;0.0 high low 0.5;0.0 0.5;0.0 0.1;-10.0 1.0;0.0 0.5;-1.0 search wait stop 9/?? Markov Reward Process: Bellman Equations • The value function V (s) V (s) = E ρtjst = s = E Rt + γV (st+1)jst = s • V = R + γP V , hence V = (I − γP )−1R We will visit again in MDP. 10/?? Markov Reward Process: Discount Factor? Many meanings: • Weighing the importance of differently timed rewards, higher importance of more recent rewards. • Representing uncertainty over the presence of next rewards, i.e geometric distributions. • Representing human/animal’s preference over ordering of received rewards. 11/?? Markov decision process 12/?? • MDP = fS; A; T ; R; P0; γg. – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0; s; a) = P r(s0js; a). – R: is a reward function which defines the reward R(s; a). – P0: is the probability distribution over initial states. – γ 2 [0; 1]: is a discount factor. Markov decision process • A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP. 13/?? – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0; s; a) = P r(s0js; a). – R: is a reward function which defines the reward R(s; a). – P0: is the probability distribution over initial states. – γ 2 [0; 1]: is a discount factor. Markov decision process • A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP. • MDP = fS; A; T ; R; P0; γg. 13/?? Markov decision process • A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP. • MDP = fS; A; T ; R; P0; γg. – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0; s; a) = P r(s0js; a). – R: is a reward function which defines the reward R(s; a). – P0: is the probability distribution over initial states. – γ 2 [0; 1]: is a discount factor. 13/?? Example: Recycling Robot MDP 14/?? • A policy is a mapping from state space to action space µ : S 7! A • Objective function: – Expected average reward. T −1 1 h X i η = lim E r(st; at; st+1) T !1 T t=0 – Expected discounted reward. 1 h X t i ηγ = E γ r(st; at; st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ a0 a1 a2 s0 s1 s2 r0 r1 r2 15/?? • Objective function: – Expected average reward. T −1 1 h X i η = lim E r(st; at; st+1) T !1 T t=0 – Expected discounted reward. 1 h X t i ηγ = E γ r(st; at; st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ a0 a1 a2 s0 s1 s2 r0 r1 r2 • A policy is a mapping from state space to action space µ : S 7! A 15/?? • Singh et. al. 1994: 1 ηγ = η 1 − γ a0 a1 a2 s0 s1 s2 r0 r1 r2 • A policy is a mapping from state space to action space µ : S 7! A • Objective function: – Expected average reward. T −1 1 h X i η = lim E r(st; at; st+1) T !1 T t=0 – Expected discounted reward. 1 h X t i ηγ = E γ r(st; at; st+1) t=0 15/?? a0 a1 a2 s0 s1 s2 r0 r1 r2 • A policy is a mapping from state space to action space µ : S 7! A • Objective function: – Expected average reward. T −1 1 h X i η = lim E r(st; at; st+1) T !1 T t=0 – Expected discounted reward. 1 h X t i ηγ = E γ r(st; at; st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ 15/?? Dynamic Programming 16/?? Dynamic Programming • State Value Functions • Bellman Equations • Value Iteration • Policy Iteration 17/?? • definition of optimality: behavior π∗ is optimal iff π∗ ∗ ∗ π 8s : V (s) = V (s) where V (s) = max V (s) π (simultaneously maximising the value in all states) (In MDPs there always exists (at least one) optimal deterministic policy.) State value function • The value (expected discounted return) of policy π when started in state s: π 2 V (s) = Eπfr0 + γr1 + γ r2 + · · · j s0 =sg (1) discounting factor γ 2 [0; 1] 18/?? State value function • The value (expected discounted return) of policy π when started in state s: π 2 V (s) = Eπfr0 + γr1 + γ r2 + · · · j s0 =sg (1) discounting factor γ 2 [0; 1] • definition of optimality: behavior π∗ is optimal iff π∗ ∗ ∗ π 8s : V (s) = V (s) where V (s) = max V (s) π (simultaneously maximising the value in all states) (In MDPs there always exists (at least one) optimal deterministic policy.) 18/?? • We can write this in vector notation V π = Rπ + γP πV π π π π π 0 with vectors V s = V (s), Rs = R(π(s); s) and matrix P s0s = P (s j π(s); s) π P P 0 π 0 • For stochastic π(ajs): V (s) = a π(ajs)R(a; s) + γ s0;a π(ajs)P (s j a; s) V (s ) • Bellman optimality equation ∗ h P 0 ∗ 0 i V (s) = max R(a; s) + γ 0 P (s j a; s) V (s ) a s ∗ h P 0 ∗ 0 i π (s) = argmax R(a; s) + γ s0 P (s j a; s) V (s ) a 0 (Sketch of proof: If π would select another action than argmaxa[·], then π which = π 0 everywhere except π (s) = argmaxa[·] would be better.) • This is the principle of optimality in the stochastic case (related to Viterbi, max-product algorithm) Bellman optimality equation π 2 V (s)=E fr0 + γr1 + γ r2 + · · · j s0 =s; πg = Efr0 j s0 =s; πg + γEfr1 + γr2 + · · · j s0 =s; πg P 0 0 = R(π(s); s) + γ s0 P (s j π(s); s)Efr1 + γr2 + · · · j s1 =s ; πg P 0 π 0 = R(π(s); s) + γ s0 P (s j π(s); s) V (s ) 19/?? • Bellman optimality equation ∗ h P 0 ∗ 0 i V (s) = max R(a; s) + γ 0 P (s j a; s)
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages51 Page
-
File Size-