
Foundations of Machine Learning Reinforcement Learning Mehryar Mohri Courant Institute and Google Research [email protected] Reinforcement Learning Agent exploring environment. Interactions with environment: action Agent state Environment reward Problem: find action policy that maximizes cumulative reward over the course of interactions. Mehryar Mohri - Foundations of Machine Learning page 2 Key Features Contrast with supervised learning: • no explicit labeled training data. • distribution defined by actions taken. Delayed rewards or penalties. RL trade-off: • exploration (of unknown states and actions) to gain more reward information; vs. • exploitation (of known information) to optimize reward. Mehryar Mohri - Foundations of Machine Learning page 3 Applications Robot control e.g., Robocup Soccer Teams (Stone et al., 1999). Board games, e.g., TD-Gammon (Tesauro, 1995). Elevator scheduling (Crites and Barto, 1996). Ads placement. Telecommunications. Inventory management. Dynamic radio channel assignment. Mehryar Mohri - Foundations of Machine Learning page 4 This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page 5 Markov Decision Process (MDP) Definition: a Markov Decision Process is defined by: a set of decision epochs 0 ,...,T . • { } • a set of states S , possibly infinite. • a start state or initial state s 0 ; • a set of actions A , possibly infinite. a transition probability Pr[ s s, a ] : distribution over • | destination states s = δ ( s, a ) . a reward probability Pr[ r s, a ] : distribution over • | rewards returned r = r ( s, a ) . Mehryar Mohri - Foundations of Machine Learning page 6 Model State observed at time t : st S. ∈ action Action taken at time t : at A. ∈ Agent state Environment State reached s t +1 = δ ( s t ,a t ) . reward Reward received: r t +1 = r ( s t ,a t ) . at/rt+1 at+1/rt+2 st st+1 st+2 Mehryar Mohri - Foundations of Machine Learning page 7 MDPs - Properties Finite MDPs: A and S finite sets. Finite horizon when T< . ∞ Reward r ( s, a ) : often deterministic function. Mehryar Mohri - Foundations of Machine Learning page 8 Example - Robot Picking up Balls start search/[.1, R1] search/[.9, R1] carry/[.5, R3] other carry/[.5, -1] pickup/[1, R2] Mehryar Mohri - Foundations of Machine Learning page 9 Policy Definition: a policy is a mapping π : S A. → Objective: find policy π maximizing expected return. T 1 finite horizon return: − . • t=0 r st,π(st) + t ∞ γ r s ,π(s ) • infinite horizon return: t =0 t t . Theorem: for any finite MDP, there exists an optimal policy (for any start state). Mehryar Mohri - Foundations of Machine Learning page 10 Policy Value Definition: the value of a policy π at state s is • finite horizon: T 1 − Vπ(s)=E r st,π(st) s0 = s . t=0 infinite horizon: discount factor γ [0 , 1) , • ∈ + ∞ t Vπ(s)=E γ r st,π(st) s0 = s . t=0 Problem: find policy π with maximum value for all states. Mehryar Mohri - Foundations of Machine Learning page 11 Policy Evaluation Analysis of policy value: + ∞ t Vπ(s)=E γ r st,π(st) s0 = s . t=0 + ∞ =E[r(s, π(s))] + γ E γtr s ,π(s ) s = s t+1 t+1 0 t=0 =E[r(s, π(s)] + γ E[V (δ(s, π(s)))]. π Bellman equations (system of linear equations): V (s)=E[r(s, π(s)] + γ Pr[s s, π(s)]V (s). π | π s Mehryar Mohri - Foundations of Machine Learning page 12 Bellman Equation - Existence and Uniqueness Notation: transition probability matrix P =Pr[s s, π(s)]. • s,s | • value column matrixV =Vπ(s). • expected reward column matrix: R=E[r(s, π(s)]. Theorem: for a finite MDP, Bellman’s equation admits a unique solution given by 1 V =(I γP)− R. 0 − Mehryar Mohri - Foundations of Machine Learning page 13 Bellman Equation - Existence and Uniqueness Proof: Bellman’s equation rewritten as V =R + γPV. • P is a stochastic matrix, thus, P =max Pss =max Pr[s s, π(s)] = 1. ∞ s | | s | s s This implies that γ P = γ < 1 . The eigenvalues • ∞ of γ P are all less than one and ( I γ P ) is invertible. − Notes: general shortest distance problem (MM, 2002). Mehryar Mohri - Foundations of Machine Learning page 14 Optimal Policy Definition: policy π ∗ with maximal value for all states s S. ∈ • value of π ∗ (optimal value): s S, Vπ (s)=maxVπ(s). ∀ ∈ ∗ π • optimal state-action value function: expected return for taking action a at state s and then following optimal policy. Q⇤(s, a)=E[r(s, a)] + γ E[V ⇤(δ(s, a))] =E[r(s, a)] + γ Pr[s0 s, a]V ⇤(s0). | s S X02 Mehryar Mohri - Foundations of Machine Learning page 15 Optimal Values - Bellman Equations Property: the following equalities hold: s S, V ∗(s)=maxQ∗(s, a). ∀ ∈ a A ∈ Proof: by definition, for all s , V ∗ ( s ) max Q ∗ ( s, a ) . ≤ a A ∈ If for some s we had V ∗ ( s ) < max Q ∗ ( s, a ) , then • a A maximizing action would define∈ a better policy. Thus, V ⇤(s) = max E[r(s, a)] + γ Pr[s0 s, a]V ⇤(s0) . a A | 2 s S n X02 o Mehryar Mohri - Foundations of Machine Learning page 16 This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page 17 Known Model Setting: environment model known. Problem: find optimal policy. Algorithms: • value iteration. • policy iteration. • linear programming. Mehryar Mohri - Foundations of Machine Learning page 18 Value Iteration Algorithm Φ(V)(s)=max E[r(s, a)] + γ Pr[s s, a]V (s) . a A | ∈ s S ∈ Φ(V)=max Rπ + γPπV . π { } ValueIteration(V0) 1 V V0 V0 arbitrary value ← (1 γ) 2 while V Φ(V) − do − ≥ γ 3 V Φ(V) 4 return Φ←(V) Mehryar Mohri - Foundations of Machine Learning page 19 VI Algorithm - Convergence Theorem: for any initial value V 0 , the sequence defined by V n +1 = Φ ( V n ) converge to V ∗. Proof: we show that Φ is γ -contracting for ·∞ existence and uniqueness of fixed point for Φ . for any s S , let a ∗ ( s ) be the maximizing action • ∈ defining Φ ( V )( s ) . Then, for s S and any U , ∈ Φ(V)(s) Φ(U)(s) Φ(V)(s) E[r(s, a∗(s))] + γ Pr[s s, a∗(s)]U(s) − ≤ − | s S ∈ = γ Pr[s s, a∗(s)][V(s) U(s)] | − s S ∈ γ Pr[s s, a∗(s)] V U = γ V U . ≤ | − ∞ − ∞ s S ∈ Mehryar Mohri - Foundations of Machine Learning page 20 Complexity and Optimality 1 Complexity: convergence in O (log ) . Observe that n Vn+1 Vn γ Vn Vn 1 γ Φ(V0) V0 . − ∞ ≤ − − ∞ ≤ − ∞ n (1 γ) 1 Thus, γ Φ(V0) V0 − n = O log . − ∞ ≤ γ ⇒ -Optimality: let V n +1 be the value returned. Then, V∗ Vn+1 V∗ Φ(Vn+1) + Φ(Vn+1) Vn+1 − ∞ ≤ − ∞ − ∞ γ V∗ Vn+1 + γ Vn+1 Vn . Thus, ≤ − ∞ − ∞ γ V∗ Vn+1 Vn+1 Vn . − ∞ ≤ 1 γ − ∞ ≤ − Mehryar Mohri - Foundations of Machine Learning page 21 VI Algorithm - Example a/[3/4, 2] c/[1, 2] a/[1/4, 2] 1 b/[1, 2] 2 d/[1, 3] 3 1 V (1) = max 2+γ V (1) + V (2) , 2+γV (2) n+1 4 n 4 n n Vn+1(2) = max 3+γVn(1), 2+γVn(2) . For V (1) = 1 , V (2) = 1 , γ =1 / 2 ,V (1) = V (2) = 5/2. 0 − 0 1 1 But, V ∗ (1) = 14 / 3 , V ∗ (2) = 16 / 3 . Mehryar, Mohri - Foundations of Machine Learning page 22 Policy Iteration Algorithm PolicyIteration(π0) 1 π π π arbitrary policy ← 0 0 2 π nil ← 3 while (π = π) do 4 V V policy evaluation: solve (I γP )V = R . ← π − π π 5 π π ← 6 π argmaxπ Rπ + γPπV greedy policy improvement. 7 return←π { } Mehryar Mohri - Foundations of Machine Learning page 23 PI Algorithm - Convergence Theorem: let ( V n ) n N be the sequence of policy values computed by∈ the algorithm, then, V V V∗. n ≤ n+1 ≤ Proof: let π n +1 be the policy improvement at the n th iteration, then, by definition, R + γP V R + γP V = V . πn+1 πn+1 n ≥ πn πn n n therefore, R (I γP )V . • πn+1 ≥ − πn+1 n note that ( I γ P ) 1 preserves ordering: • πn+1 − − 1 k X 0 (I γP )− X = ∞ (γP ) X 0. ≥ ⇒ − πn+1 k=0 πn+1 ≥ 1 thus, V =(I γP )− R V . • n+1 − πn+1 πn+1 ≥ n Mehryar Mohri - Foundations of Machine Learning page 24 Notes Two consecutive policy values can be equal only at last iteration. S The total number of possible policies is A | |, thus, | | this is the maximal possible number of iterations. S A | | best upper bound known O | S| . • | | Mehryar Mohri - Foundations of Machine Learning page 25 PI Algorithm - Example a/[3/4, 2] c/[1, 2] a/[1/4, 2] 1 b/[1, 2] 2 d/[1, 3] Initial policy: π 0 (1) = b, π 0 (2) = c . Evaluation:Vπ0 (1) = 1 + γVπ0 (2) Vπ0 (2) = 2 + γVπ0 (2). 1+γ 2 Thus,V (1) = V (2) = . π0 1 γ π0 1 γ − − Mehryar Mohri - Foundations of Machine Learning page 26 VI and PI Algorithms - Comparison Theorem: let ( U n ) n be the sequence of policy ∈N values generated by the VI algorithm, and ( V n ) n ∈N the one generated by the PI algorithm. If U 0 = V 0 , then, n N, Un Vn V∗. ∀ ∈ ≤ ≤ Proof: we first show that Φ is monotonic.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages66 Page
-
File Size-