Foundations of Machine Learning Reinforcement Learning
Mehryar Mohri Courant Institute and Google Research [email protected] Reinforcement Learning
Agent exploring environment. Interactions with environment: action
Agent state Environment
reward Problem: find action policy that maximizes cumulative reward over the course of interactions.
Mehryar Mohri - Foundations of Machine Learning page 2 Key Features
Contrast with supervised learning: • no explicit labeled training data. • distribution defined by actions taken. Delayed rewards or penalties. RL trade-off: • exploration (of unknown states and actions) to gain more reward information; vs. • exploitation (of known information) to optimize reward.
Mehryar Mohri - Foundations of Machine Learning page 3 Applications
Robot control e.g., Robocup Soccer Teams (Stone et al., 1999).
Board games, e.g., TD-Gammon (Tesauro, 1995).
Elevator scheduling (Crites and Barto, 1996). Ads placement. Telecommunications. Inventory management. Dynamic radio channel assignment.
Mehryar Mohri - Foundations of Machine Learning page 4 This Lecture
Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem
Mehryar Mohri - Foundations of Machine Learning page 5 Markov Decision Process (MDP)
Definition: a Markov Decision Process is defined by: a set of decision epochs 0 ,...,T . • { } • a set of states S , possibly infinite. • a start state or initial state s 0 ; • a set of actions A , possibly infinite. a transition probability Pr[ s s, a ] : distribution over • | destination states s = ( s, a ) .
a reward probability Pr[ r s, a ] : distribution over • | rewards returned r = r ( s, a ) .
Mehryar Mohri - Foundations of Machine Learning page 6 Model
State observed at time t : st S. action
Action taken at time t : at A. Agent state Environment State reached s t +1 = ( s t ,a t ) . reward
Reward received: r t +1 = r ( s t ,a t ) .
at/rt+1 at+1/rt+2 st st+1 st+2
Mehryar Mohri - Foundations of Machine Learning page 7 MDPs - Properties
Finite MDPs: A and S finite sets. Finite horizon when T< . Reward r ( s, a ) : often deterministic function.
Mehryar Mohri - Foundations of Machine Learning page 8 Example - Robot Picking up Balls
start search/[.1, R1]
search/[.9, R1] carry/[.5, R3]
other carry/[.5, -1] pickup/[1, R2]
Mehryar Mohri - Foundations of Machine Learning page 9 Policy
Definition: a policy is a mapping : S A. Objective: find policy maximizing expected return. T 1 finite horizon return: . • t=0 r st, (st) + t r s , (s ) • infinite horizon return: t =0 t t . Theorem: for any finite MDP, there exists an optimal policy (for any start state).
Mehryar Mohri - Foundations of Machine Learning page 10 Policy Value
Definition: the value of a policy at state s is • finite horizon: T 1 V (s)=E r st, (st) s0 = s . t=0 infinite horizon: discount factor [0 , 1) , • + t V (s)=E r st, (st) s0 = s . t=0 Problem: find policy with maximum value for all states.
Mehryar Mohri - Foundations of Machine Learning page 11 Policy Evaluation
Analysis of policy value: + t V (s)=E r st, (st) s0 = s . t=0 + =E[r(s, (s))] + E tr s , (s ) s = s t+1 t+1 0 t=0 =E[r(s, (s)] + E[V ( (s, (s)))]. Bellman equations (system of linear equations):
V (s)=E[r(s, (s)] + Pr[s s, (s)]V (s ). | s
Mehryar Mohri - Foundations of Machine Learning page 12 Bellman Equation - Existence and Uniqueness
Notation:
transition probability matrix P =Pr[s s, (s)]. • s,s | • value column matrixV =V (s). • expected reward column matrix: R=E[r(s, (s)]. Theorem: for a finite MDP, Bellman’s equation admits a unique solution given by
1 V =(I P) R. 0
Mehryar Mohri - Foundations of Machine Learning page 13 Bellman Equation - Existence and Uniqueness
Proof: Bellman’s equation rewritten as V =R + PV. • P is a stochastic matrix, thus, P =max Pss =max Pr[s s, (s)] = 1. s | | s | s s This implies that P = < 1 . The eigenvalues • of P are all less than one and ( I P ) is invertible.
Notes: general shortest distance problem (MM, 2002).
Mehryar Mohri - Foundations of Machine Learning page 14 Optimal Policy
Definition: policy with maximal value for all states s S. • value of (optimal value): s S, V (s)=maxV (s). • optimal state-action value function: expected return for taking action a at state s and then following optimal policy.
Q⇤(s, a)=E[r(s, a)] + E[V ⇤( (s, a))]
=E[r(s, a)] + Pr[s0 s, a]V ⇤(s0). | s S X02
Mehryar Mohri - Foundations of Machine Learning page 15 Optimal Values - Bellman Equations
Property: the following equalities hold:
s S, V (s)=maxQ (s, a). a A Proof: by definition, for all s , V ( s ) max Q ( s, a ) . a A If for some s we had V ( s ) < max Q ( s, a ) , then • a A maximizing action would define a better policy. Thus,
V ⇤(s) = max E[r(s, a)] + Pr[s0 s, a]V ⇤(s0) . a A | 2 s S n X02 o
Mehryar Mohri - Foundations of Machine Learning page 16 This Lecture
Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem
Mehryar Mohri - Foundations of Machine Learning page 17 Known Model
Setting: environment model known. Problem: find optimal policy. Algorithms: • value iteration. • policy iteration. • linear programming.
Mehryar Mohri - Foundations of Machine Learning page 18 Value Iteration Algorithm