Foundations of Machine Learning Reinforcement Learning

Foundations of Machine Learning Reinforcement Learning Mehryar Mohri Courant Institute and Google Research [email protected] Reinforcement Learning Agent exploring environment. Interactions with environment: action Agent state Environment reward Problem: find action policy that maximizes cumulative reward over the course of interactions. Mehryar Mohri - Foundations of Machine Learning page 2 Key Features Contrast with supervised learning: • no explicit labeled training data. • distribution defined by actions taken. Delayed rewards or penalties. RL trade-off: • exploration (of unknown states and actions) to gain more reward information; vs. • exploitation (of known information) to optimize reward. Mehryar Mohri - Foundations of Machine Learning page 3 Applications Robot control e.g., Robocup Soccer Teams (Stone et al., 1999). Board games, e.g., TD-Gammon (Tesauro, 1995). Elevator scheduling (Crites and Barto, 1996). Ads placement. Telecommunications. Inventory management. Dynamic radio channel assignment. Mehryar Mohri - Foundations of Machine Learning page 4 This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page 5 Markov Decision Process (MDP) Definition: a Markov Decision Process is defined by: a set of decision epochs 0 ,...,T . • { } • a set of states S , possibly infinite. • a start state or initial state s 0 ; • a set of actions A , possibly infinite. a transition probability Pr[ s s, a ] : distribution over • | destination states s = δ ( s, a ) . a reward probability Pr[ r s, a ] : distribution over • | rewards returned r = r ( s, a ) . Mehryar Mohri - Foundations of Machine Learning page 6 Model State observed at time t : st S. ∈ action Action taken at time t : at A. ∈ Agent state Environment State reached s t +1 = δ ( s t ,a t ) . reward Reward received: r t +1 = r ( s t ,a t ) . at/rt+1 at+1/rt+2 st st+1 st+2 Mehryar Mohri - Foundations of Machine Learning page 7 MDPs - Properties Finite MDPs: A and S finite sets. Finite horizon when T< . ∞ Reward r ( s, a ) : often deterministic function. Mehryar Mohri - Foundations of Machine Learning page 8 Example - Robot Picking up Balls start search/[.1, R1] search/[.9, R1] carry/[.5, R3] other carry/[.5, -1] pickup/[1, R2] Mehryar Mohri - Foundations of Machine Learning page 9 Policy Definition: a policy is a mapping π : S A. → Objective: find policy π maximizing expected return. T 1 finite horizon return: − . • t=0 r st,π(st) + t ∞ γ r s ,π(s ) • infinite horizon return: t =0 t t . Theorem: for any finite MDP, there exists an optimal policy (for any start state). Mehryar Mohri - Foundations of Machine Learning page 10 Policy Value Definition: the value of a policy π at state s is • finite horizon: T 1 − Vπ(s)=E r st,π(st) s0 = s . t=0 infinite horizon: discount factor γ [0 , 1) , • ∈ + ∞ t Vπ(s)=E γ r st,π(st) s0 = s . t=0 Problem: find policy π with maximum value for all states. Mehryar Mohri - Foundations of Machine Learning page 11 Policy Evaluation Analysis of policy value: + ∞ t Vπ(s)=E γ r st,π(st) s0 = s . t=0 + ∞ =E[r(s, π(s))] + γ E γtr s ,π(s ) s = s t+1 t+1 0 t=0 =E[r(s, π(s)] + γ E[V (δ(s, π(s)))]. π Bellman equations (system of linear equations): V (s)=E[r(s, π(s)] + γ Pr[s s, π(s)]V (s). π | π s Mehryar Mohri - Foundations of Machine Learning page 12 Bellman Equation - Existence and Uniqueness Notation: transition probability matrix P =Pr[s s, π(s)]. • s,s | • value column matrixV =Vπ(s). • expected reward column matrix: R=E[r(s, π(s)]. Theorem: for a finite MDP, Bellman’s equation admits a unique solution given by 1 V =(I γP)− R. 0 − Mehryar Mohri - Foundations of Machine Learning page 13 Bellman Equation - Existence and Uniqueness Proof: Bellman’s equation rewritten as V =R + γPV. • P is a stochastic matrix, thus, P =max Pss =max Pr[s s, π(s)] = 1. ∞ s | | s | s s This implies that γ P = γ < 1 . The eigenvalues • ∞ of γ P are all less than one and ( I γ P ) is invertible. − Notes: general shortest distance problem (MM, 2002). Mehryar Mohri - Foundations of Machine Learning page 14 Optimal Policy Definition: policy π ∗ with maximal value for all states s S. ∈ • value of π ∗ (optimal value): s S, Vπ (s)=maxVπ(s). ∀ ∈ ∗ π • optimal state-action value function: expected return for taking action a at state s and then following optimal policy. Q⇤(s, a)=E[r(s, a)] + γ E[V ⇤(δ(s, a))] =E[r(s, a)] + γ Pr[s0 s, a]V ⇤(s0). | s S X02 Mehryar Mohri - Foundations of Machine Learning page 15 Optimal Values - Bellman Equations Property: the following equalities hold: s S, V ∗(s)=maxQ∗(s, a). ∀ ∈ a A ∈ Proof: by definition, for all s , V ∗ ( s ) max Q ∗ ( s, a ) . ≤ a A ∈ If for some s we had V ∗ ( s ) < max Q ∗ ( s, a ) , then • a A maximizing action would define∈ a better policy. Thus, V ⇤(s) = max E[r(s, a)] + γ Pr[s0 s, a]V ⇤(s0) . a A | 2 s S n X02 o Mehryar Mohri - Foundations of Machine Learning page 16 This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page 17 Known Model Setting: environment model known. Problem: find optimal policy. Algorithms: • value iteration. • policy iteration. • linear programming. Mehryar Mohri - Foundations of Machine Learning page 18 Value Iteration Algorithm Φ(V)(s)=max E[r(s, a)] + γ Pr[s s, a]V (s) . a A | ∈ s S ∈ Φ(V)=max Rπ + γPπV . π { } ValueIteration(V0) 1 V V0 V0 arbitrary value ← (1 γ) 2 while V Φ(V) − do − ≥ γ 3 V Φ(V) 4 return Φ←(V) Mehryar Mohri - Foundations of Machine Learning page 19 VI Algorithm - Convergence Theorem: for any initial value V 0 , the sequence defined by V n +1 = Φ ( V n ) converge to V ∗. Proof: we show that Φ is γ -contracting for ·∞ existence and uniqueness of fixed point for Φ . for any s S , let a ∗ ( s ) be the maximizing action • ∈ defining Φ ( V )( s ) . Then, for s S and any U , ∈ Φ(V)(s) Φ(U)(s) Φ(V)(s) E[r(s, a∗(s))] + γ Pr[s s, a∗(s)]U(s) − ≤ − | s S ∈ = γ Pr[s s, a∗(s)][V(s) U(s)] | − s S ∈ γ Pr[s s, a∗(s)] V U = γ V U . ≤ | − ∞ − ∞ s S ∈ Mehryar Mohri - Foundations of Machine Learning page 20 Complexity and Optimality 1 Complexity: convergence in O (log ) . Observe that n Vn+1 Vn γ Vn Vn 1 γ Φ(V0) V0 . − ∞ ≤ − − ∞ ≤ − ∞ n (1 γ) 1 Thus, γ Φ(V0) V0 − n = O log . − ∞ ≤ γ ⇒ -Optimality: let V n +1 be the value returned. Then, V∗ Vn+1 V∗ Φ(Vn+1) + Φ(Vn+1) Vn+1 − ∞ ≤ − ∞ − ∞ γ V∗ Vn+1 + γ Vn+1 Vn . Thus, ≤ − ∞ − ∞ γ V∗ Vn+1 Vn+1 Vn . − ∞ ≤ 1 γ − ∞ ≤ − Mehryar Mohri - Foundations of Machine Learning page 21 VI Algorithm - Example a/[3/4, 2] c/[1, 2] a/[1/4, 2] 1 b/[1, 2] 2 d/[1, 3] 3 1 V (1) = max 2+γ V (1) + V (2) , 2+γV (2) n+1 4 n 4 n n Vn+1(2) = max 3+γVn(1), 2+γVn(2) . For V (1) = 1 , V (2) = 1 , γ =1 / 2 ,V (1) = V (2) = 5/2. 0 − 0 1 1 But, V ∗ (1) = 14 / 3 , V ∗ (2) = 16 / 3 . Mehryar, Mohri - Foundations of Machine Learning page 22 Policy Iteration Algorithm PolicyIteration(π0) 1 π π π arbitrary policy ← 0 0 2 π nil ← 3 while (π = π) do 4 V V policy evaluation: solve (I γP )V = R . ← π − π π 5 π π ← 6 π argmaxπ Rπ + γPπV greedy policy improvement. 7 return←π { } Mehryar Mohri - Foundations of Machine Learning page 23 PI Algorithm - Convergence Theorem: let ( V n ) n N be the sequence of policy values computed by∈ the algorithm, then, V V V∗. n ≤ n+1 ≤ Proof: let π n +1 be the policy improvement at the n th iteration, then, by definition, R + γP V R + γP V = V . πn+1 πn+1 n ≥ πn πn n n therefore, R (I γP )V . • πn+1 ≥ − πn+1 n note that ( I γ P ) 1 preserves ordering: • πn+1 − − 1 k X 0 (I γP )− X = ∞ (γP ) X 0. ≥ ⇒ − πn+1 k=0 πn+1 ≥ 1 thus, V =(I γP )− R V . • n+1 − πn+1 πn+1 ≥ n Mehryar Mohri - Foundations of Machine Learning page 24 Notes Two consecutive policy values can be equal only at last iteration. S The total number of possible policies is A | |, thus, | | this is the maximal possible number of iterations. S A | | best upper bound known O | S| . • | | Mehryar Mohri - Foundations of Machine Learning page 25 PI Algorithm - Example a/[3/4, 2] c/[1, 2] a/[1/4, 2] 1 b/[1, 2] 2 d/[1, 3] Initial policy: π 0 (1) = b, π 0 (2) = c . Evaluation:Vπ0 (1) = 1 + γVπ0 (2) Vπ0 (2) = 2 + γVπ0 (2). 1+γ 2 Thus,V (1) = V (2) = . π0 1 γ π0 1 γ − − Mehryar Mohri - Foundations of Machine Learning page 26 VI and PI Algorithms - Comparison Theorem: let ( U n ) n be the sequence of policy ∈N values generated by the VI algorithm, and ( V n ) n ∈N the one generated by the PI algorithm. If U 0 = V 0 , then, n N, Un Vn V∗. ∀ ∈ ≤ ≤ Proof: we first show that Φ is monotonic.

Foundations of Machine Learning Reinforcement Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support