<<

Foundations of Machine

Mehryar Mohri Courant Institute and Research [email protected] Reinforcement Learning

Agent exploring environment. Interactions with environment: action

Agent state Environment

reward Problem: find action policy that maximizes cumulative reward over the course of interactions.

Mehryar Mohri - Foundations of Machine Learning page 2 Key Features

Contrast with : • no explicit labeled training . • distribution defined by actions taken. Delayed rewards or penalties. RL trade-off: • exploration (of unknown states and actions) to gain more reward information; vs. • exploitation (of known information) to optimize reward.

Mehryar Mohri - Foundations of Machine Learning page 3 Applications

Robot control e.g., Robocup Soccer Teams (Stone et al., 1999).

Board games, e.g., TD-Gammon (Tesauro, 1995).

Elevator scheduling (Crites and Barto, 1996). Ads placement. . Inventory management. Dynamic radio channel assignment.

Mehryar Mohri - Foundations of Machine Learning page 4 This Lecture

Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem

Mehryar Mohri - Foundations of Machine Learning page 5 (MDP)

Definition: a Markov Decision Process is defined by: a set of decision epochs 0 ,...,T . • { } • a set of states S , possibly infinite. • a start state or initial state s 0 ; • a set of actions A , possibly infinite. a transition Pr[ s s, a ] : distribution over • | destination states s = ( s, a ) .

a reward probability Pr[ r s, a ] : distribution over • | rewards returned r = r ( s, a ) .

Mehryar Mohri - Foundations of Machine Learning page 6 Model

State observed at time t : st S. action

Action taken at time t : at A. Agent state Environment State reached s t +1 = ( s t ,a t ) . reward

Reward received: r t +1 = r ( s t ,a t ) .

at/rt+1 at+1/rt+2 st st+1 st+2

Mehryar Mohri - Foundations of Machine Learning page 7 MDPs - Properties

Finite MDPs: A and S finite sets. Finite horizon when T< . Reward r ( s, a ) : often deterministic function.

Mehryar Mohri - Foundations of Machine Learning page 8 Example - Robot Picking up Balls

start search/[.1, R1]

search/[.9, R1] carry/[.5, R3]

other carry/[.5, -1] pickup/[1, R2]

Mehryar Mohri - Foundations of Machine Learning page 9 Policy

Definition: a policy is a mapping : S A. Objective: find policy maximizing expected return. T 1 finite horizon return: . • t=0 r st,(st) + t r s ,(s ) • infinite horizon return: t =0 t t . Theorem: for any finite MDP, there exists an optimal policy (for any start state).

Mehryar Mohri - Foundations of Machine Learning page 10 Policy Value

Definition: the value of a policy at state s is • finite horizon: T 1 V(s)=E r st,(st) s0 = s . t=0 infinite horizon: discount factor [0 , 1) , • + t V(s)=E r st,(st) s0 = s . t=0 Problem: find policy with maximum value for all states.

Mehryar Mohri - Foundations of Machine Learning page 11 Policy Evaluation

Analysis of policy value: + t V(s)=E r st,(st) s0 = s . t=0 + =E[r(s, (s))] + E tr s ,(s ) s = s t+1 t+1 0 t=0 =E[r(s, (s)] + E[V ((s, (s)))]. Bellman equations (system of linear equations):

V (s)=E[r(s, (s)] + Pr[s s, (s)]V (s). | s

Mehryar Mohri - Foundations of Machine Learning page 12 Bellman Equation - Existence and Uniqueness

Notation:

transition probability P =Pr[s s, (s)]. • s,s | • value column matrixV =V(s). • expected reward column matrix: R=E[r(s, (s)]. Theorem: for a finite MDP, Bellman’s equation admits a unique solution given by

1 V =(I P) R. 0

Mehryar Mohri - Foundations of Machine Learning page 13 Bellman Equation - Existence and Uniqueness

Proof: Bellman’s equation rewritten as V =R + PV. • P is a stochastic matrix, thus, P =max Pss =max Pr[s s, (s)] = 1. s | | s | s s This implies that P = < 1 . The eigenvalues • of P are all less than one and ( I P ) is invertible.

Notes: general shortest distance problem (MM, 2002).

Mehryar Mohri - Foundations of Machine Learning page 14 Optimal Policy

Definition: policy with maximal value for all states s S. • value of (optimal value): s S, V (s)=maxV(s). • optimal state-action value function: expected return for taking action a at state s and then following optimal policy.

Q⇤(s, a)=E[r(s, a)] + E[V ⇤((s, a))]

=E[r(s, a)] + Pr[s0 s, a]V ⇤(s0). | s S X02

Mehryar Mohri - Foundations of Machine Learning page 15 Optimal Values - Bellman Equations

Property: the following equalities hold:

s S, V (s)=maxQ(s, a). a A Proof: by definition, for all s , V ( s ) max Q ( s, a ) . a A If for some s we had V ( s ) < max Q ( s, a ) , then • a A maximizing action would define a better policy. Thus,

V ⇤(s) = max E[r(s, a)] + Pr[s0 s, a]V ⇤(s0) . a A | 2 s S n X02 o

Mehryar Mohri - Foundations of Machine Learning page 16 This Lecture

Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem

Mehryar Mohri - Foundations of Machine Learning page 17 Known Model

Setting: environment model known. Problem: find optimal policy. : • value iteration. • policy iteration. • linear programming.

Mehryar Mohri - Foundations of Machine Learning page 18 Value Iteration

(V)(s)=max E[r(s, a)] + Pr[s s, a]V (s) . a A | s S (V)=max R + PV . { }

ValueIteration(V0) 1 V V0 V0 arbitrary value (1 ) 2 while V (V) do 3 V (V) 4 return (V)

Mehryar Mohri - Foundations of Machine Learning page 19 VI Algorithm - Convergence

Theorem: for any initial value V 0 , the sequence defined by V n +1 = ( V n ) converge to V . Proof: we show that is -contracting for · existence and uniqueness of fixed point for .

for any s S , let a ( s ) be the maximizing action • defining ( V )( s ) . Then, for s S and any U ,

(V)(s) (U)(s) (V)(s) E[r(s, a(s))] + Pr[s s, a(s)]U(s) | s S = Pr[s s, a(s)][V(s) U(s)] | s S Pr[s s, a(s)] V U = V U . | s S

Mehryar Mohri - Foundations of Machine Learning page 20 and Optimality

1 Complexity: convergence in O (log ) . Observe that n Vn+1 Vn Vn Vn 1 (V0) V0 . n (1 ) 1 Thus, (V0) V0 n = O log . -Optimality: let V n +1 be the value returned. Then,

V Vn+1 V (Vn+1) + (Vn+1) Vn+1 V Vn+1 + Vn+1 Vn . Thus, V Vn+1 Vn+1 Vn . 1 Mehryar Mohri - Foundations of Machine Learning page 21 VI Algorithm - Example

a/[3/4, 2] c/[1, 2] a/[1/4, 2]

1 b/[1, 2] 2 d/[1, 3]

3 1 V (1) = max 2+ V (1) + V (2) , 2+V (2) n+1 4 n 4 n n Vn+1(2) = max 3+Vn(1), 2+Vn(2) . For V (1) = 1 , V (2) = 1 , =1 / 2 ,V (1) = V (2) = 5/2. 0 0 1 1 But, V (1) = 14 / 3 , V (2) = 16 / 3 .

Mehryar, Mohri - Foundations of Machine Learning page 22 Policy Iteration Algorithm

PolicyIteration(0) 1 arbitrary policy 0 0 2 nil 3 while ( = ) do 4 V V policy evaluation: solve (I P )V = R . 5 6 argmax R + PV greedy policy improvement. 7 return { }

Mehryar Mohri - Foundations of Machine Learning page 23 PI Algorithm - Convergence

Theorem: let ( V n ) n N be the sequence of policy values computed by the algorithm, then,

V V V. n n+1 Proof: let n +1 be the policy improvement at the n th iteration, then, by definition, R + P V R + P V = V . n+1 n+1 n n n n n therefore, R (I P )V . • n+1 n+1 n note that ( I P ) 1 preserves ordering: • n+1 1 k X 0 (I P ) X = (P ) X 0. n+1 k=0 n+1 1 thus, V =(I P ) R V . • n+1 n+1 n+1 n Mehryar Mohri - Foundations of Machine Learning page 24 Notes

Two consecutive policy values can be equal only at last iteration. S The total number of possible policies is A | |, thus, | | this is the maximal possible number of iterations. S A | | best upper bound known O | S| . • | |

Mehryar Mohri - Foundations of Machine Learning page 25 PI Algorithm - Example

a/[3/4, 2] c/[1, 2] a/[1/4, 2]

1 b/[1, 2] 2 d/[1, 3]

Initial policy: 0 (1) = b, 0 (2) = c .

Evaluation:V0 (1) = 1 + V0 (2)

V0 (2) = 2 + V0 (2). 1+ 2 Thus,V (1) = V (2) = . 0 1 0 1

Mehryar Mohri - Foundations of Machine Learning page 26 VI and PI Algorithms - Comparison

Theorem: let ( U n ) n be the sequence of policy N values generated by the VI algorithm, and ( V n ) n N the one generated by the PI algorithm. If U 0 = V 0 , then, n N, Un Vn V. Proof: we first show that is monotonic. Let U and V be such that U V and let be the policy such that ( U )= R + P U . Then,

(U) R + PV max R + PV = (V). { }

Mehryar Mohri - Foundations of Machine Learning page 27 VI and PI Algorithms - Comparison

The proof is by induction on n . Assume U V , • n n then, by the monotonicity of ,

Un+1 = (Un) (Vn)=max R + PVn . { } • Let n +1 be the maximizing policy: n+1 =argmax R + PVn . { } • Then, (V )=R + P V R + P V = V . n n+1 n+1 n n+1 n+1 n+1 n+1

Mehryar Mohri - Foundations of Machine Learning page 28 Notes

The PI algorithm converges in a smaller number of iterations than the VI algorithm due to the optimal policy. But, each iteration of the PI algorithm requires a policy value, i.e., solving a system of linear equations, which is more expensive to compute that an iteration of the VI algorithm.

Mehryar Mohri - Foundations of Machine Learning page 29 Primal Linear Program

LP formulation: choose ( s ) > 0 , with s ( s )=1 .

min (s)V (s) V s S subject to s S, a A, V (s) E[r(s, a)] + Pr[s s, a]V (s). | s S Parameters: number rows: S A . • | || | number of columns: S . • | |

Mehryar Mohri - Foundations of Machine Learning page 30 Dual Linear Program

LP formulation:

max E[r(s, a)] x(s, a) x s S,a A subject to s S, x(s,a)=(s)+ Pr[s s, a] x(s,a) | a A s S,a A s S, a A, x(s, a) 0. Parameters: more favorable number of rows. number rows: S . • | | number of columns: S A . • | || |

Mehryar Mohri - Foundations of Machine Learning page 31 This Lecture

Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem

Mehryar Mohri - Foundations of Machine Learning page 32 Problem

Unknown model: • transition and reward not known. • realistic scenario in many practical problems, e.g., . Training information: sequence of immediate rewards based on actions taken. Learning approches: • model-free: learn policy directly. • model-based: learn model, use it to learn policy.

Mehryar Mohri - Foundations of Machine Learning page 33 Problem

How do we estimate reward and transition probabilities? • use equations derived for policy value and Q- functions. • but, equations given in terms of some expectations. • instance of a stochastic approximation problem.

Mehryar Mohri - Foundations of Machine Learning page 34 Stochastic Approximation

N Problem: find solution of x = H ( x ) with x R while • H ( x ) cannot be computed, e.g., H not accessible; • i.i.d. of noisy observations H ( x i )+ w i , available, i [1 ,m ] , with E[ w ]=0 . Idea: algorithm based on iterative technique: x =(1 )x + [H(x )+w ] t+1 t t t t t = x + [H(x )+w x ]. t t t t t • more generally x t +1 = x t + t D ( x t , w t ) .

Mehryar Mohri - Foundations of Machine Learning page 35 Mean Estimation

Theorem: Let X be a taking values in [0 , 1] and let x 0 ,...,x m be i.i.d. values of X . Define the sequence ( µ m ) m by N µm+1 =(1 ↵m)µm +↵mxm with µ =x . 0 0 Then, for [0 , 1] , with =+ and ↵2 <+ , m m m 1 m 0 m 0 X µ a.s E[X]. m

Mehryar Mohri - Foundations of Machine Learning page 36 Proof

Proof: By the independence assumption, for m 0 , Var[µ ]=(1 )2Var[µ ]+2 Var[x ] m+1 m m m m (1 )Var[µ ]+2 . m m m 2 We have m 0 since m 0 m < + . • Let > 0 and suppose there exists N N such that • for all m N , Var[ µ m ] . Then, for m N , Var[µ ] Var[µ ] + 2 , m+1 m m m which implies Var[µ ] Var[µ ] m+N + m+N 2 , m+N N n=N n n=N n when m contradicting Var[ µ m + N ] 0 .

Mehryar Mohri - Foundations of Machine Learning page 37 Mean Estimation

Thus, for all N N there exists m 0 N such that • Var[µm0 ]<.Choose N large enough so that m N, . Then, m Var[µ ] (1 )+ =. m0+1 m0 m0 Therefore, µ m for all m m 0 (L convergence). • 2

Mehryar Mohri - Foundations of Machine Learning page 38 Notes

1 special case: m = m . • Strong law of large numbers. Connection with stochastic approximation.

Mehryar Mohri - Foundations of Machine Learning page 39 TD(0) Algorithm

Idea: recall Bellman’s linear equations giving V

V (s)=E[r(s, (s)] + Pr[s s, (s)]V (s) | s =E r(s, (s)) + V(s) s . s | Algorithm: temporal difference (TD). • sample new state s . • update: depends on number of visits of s . V (s) (1 )V (s)+[r(s, (s)) + V (s)] = V (s)+[r(s, (s)) + V (s) V (s)]. temporal dierence of V values Mehryar Mohri - Foundations of Machine Learning page 40 TD(0) Algorithm TD(0)()

1 V V0 initialization. 2 fort 0 to T do 3 s SelectState() 4 foreach step of epoch t do 5 r Reward(s, (s)) 6 s NextState(, s) 7 V (s) (1 )V (s)+[r + V (s)] 8 s s 9 return V

Mehryar Mohri - Foundations of Machine Learning page 41 Q-Learning Algorithm

Idea: assume deterministic rewards.

Q(s, a) = E[r(s, a)] + Pr[s s, a]V (s) | s S =E[r(s, a)+ max Q(s,a)] s a A Algorithm: [0 , 1] depends on number of visits. • sample new state s . • update: Q(s, a) Q(s, a)+(1 )[r(s, a)+ max Q(s,a)]. a A

Mehryar Mohri - Foundations of Machine Learning page 42 Q-Learning Algorithm (Watkins, 1989; Watkins and Dayan 1992)

Q-Learning()

1 Q Q0 initialization, e.g., Q0 =0. 2 fort 0 to T do 3 s SelectState() 4 foreach step of epoch t do 5 a SelectAction(, s) policy derived from Q,e.g.,-greedy. 6 r Reward(s, a) 7 s NextState(s, a) 8 Q(s, a) Q(s, a)+ r + maxa Q(s,a) Q(s, a) 9 s s 10 return Q

Mehryar Mohri - Foundations of Machine Learning page 43 Notes

Can be viewed as a stochastic formulation of the value iteration algorithm. Convergence for any policy so long as states and actions visited infinitely often. How to choose the action at each iteration? Maximize reward? Explore other actions? Q- learning is an off-policy method: no control over the policy.

Mehryar Mohri - Foundations of Machine Learning page 44 Policies

Epsilon-greedy strategy: with probability 1 greedy action from s ; • • with probability random action. Epoch-dependent strategy (Boltzmann exploration): Q(s,a) e t pt(a s, Q)= , Q(s,a) | t a A e 0 : greedy selection. • t • larger t : random action.

Mehryar Mohri - Foundations of Machine Learning page 45 Convergence of Q-Learning

Theorem: consider a finite MDP. Assume that for 2 all s S and a A , t ( s, a )= , t ( s, a ) < t=0 t=0 with ( s, a ) [0 , 1] . Then, the Q-learning algorithm t converges to the optimal value Q (with probability one). • note: the conditions on t ( s, a ) impose that each state-action pair is visited infinitely many times.

Mehryar Mohri - Foundations of Machine Learning page 46 SARSA: On-Policy Algorithm

SARSA()

1 Q Q0 initialization, e.g., Q0 =0. 2 fort 0 to T do 3 s SelectState() 4 a SelectAction((Q),s) policy derived from Q,e.g.,-greedy. 5 foreach step of epoch t do 6 r Reward(s, a) 7 s NextState(s, a) 8 a SelectAction((Q),s) policy derived from Q,e.g.,-greedy. 9 Q(s, a) Q(s, a)+ (s, a) r + Q(s,a) Q(s, a) t 10 s s 11 a a 12 return Q

Mehryar Mohri - Foundations of Machine Learning page 47 Notes

Differences with Q-learning: • two states: current and next states. • maximum reward for next state not used for next state, instead new action. SARSA: name derived from sequence of updates.

Mehryar Mohri - Foundations of Machine Learning page 48 TD(λ) Algorithm

Idea: • TD(0) or Q-learning only use immediate reward. • use multiple steps ahead instead, for n steps: n n 1 n Rt = rt+1 + rt+2 + ...+ rt+n + V (st+n) V (s) V (s)+ (Rn V (s)). t TD(λ) uses R =(1 ) nRn. • t n=0 t Algorithm: V (s) V (s)+ (R V (s)). t

Mehryar Mohri - Foundations of Machine Learning page 49 TD(λ) Algorithm TD()()

1 V V0 initialization. 2 e 0 3 fort 0 to T do 4 s SelectState() 5 foreach step of epoch t do 6 s NextState(, s) 7 r(s, (s)) + V (s) V (s) 8 e(s) e(s)+1 9 for u S do 10 ifu = s then 11 e(u) e(u) 12 V (u) V(u)+e(u) 13 s s 14 return V

Mehryar Mohri - Foundations of Machine Learning page 50 TD-Gammon (Tesauro, 1995) Large or costly actions: use regression algorithm to estimate Q for unseen values. Backgammon: • large number of positions: 30 pieces, 24-26 locations, • large number of moves. TD-Gammon: used neural networks. • non-linear form of TD(λ),1.5M games played, • almost as good as world-class humans (master level).

Mehryar Mohri - Foundations of Machine Learning page 51 This Lecture

Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem

Mehryar Mohri - Foundations of Machine Learning page 52 Multi-Armed Bandit Problem (Robbins, 1952) Problem: gambler must decide which arm of a N - slot machine to pull to maximize his total reward in a series of trials. • stochastic setting: N lever reward distributions. • adversarial setting: reward selected by adversary aware of all the past.

Mehryar Mohri - Foundations of Machine Learning page 53 Applications

Clinical trials. Adaptive routing. Ads placement on pages. Games.

Mehryar Mohri - Foundations of Machine Learning page 54 Multi-Armed Bandit Game

For t =1 to T do

adversary determines outcome y t Y . • • player selects probability distribution p t and pulls lever I 1 ,...,N , I t p t . t { } • player incurs loss L ( I t ,y t ) (adversary is informed of p t and I t . Objective: minimize regret T T Regret(T )= L(It,yt) min L(i, yt). i=1,...,N t=1 t=1

Mehryar Mohri - Foundations of Machine Learning page 55 Notes

Player is informed only of the loss (or reward) corresponding to his own action. Adversary knows past but not action selected.

Stochastic setting: loss ( L (1 ,y t ) ,...,L ( N,y t )) drawn according to some distribution D = D 1 D N . Regret definition modified by taking expectations.··· Exploration/Exploitation trade-off: playing the best arm found so far versus seeking to find an arm with a better payoff.

Mehryar Mohri - Foundations of Machine Learning page 56 Notes

Equivalent views: • special case of learning with partial information. • one-state MDP learning problem. Simple strategy: -greedy: play arm with best empirical reward with probability 1 , random t arm with probability t .

Mehryar Mohri - Foundations of Machine Learning page 57 Exponentially Weighted Average

Algorithm: Exp3, defined for , > 0 by

t 1 exp li,t p =(1 ) s=1 + , i,t N t 1 N i=1 exp s=1 li,t L(It,yt) with i [1,N], li,t = p 1It=i. It,t Guarantee: expected regret of

O( NT log N).

Mehryar Mohri - Foundations of Machine Learning page 58 Exponentially Weighted Average

Proof: similar to the one for the Exponentially Weighted Average with the additional observation that:

N L(It,yt) E[li,t]= pi,t 1It=i = L(i, yt). i=1 pIt,t

Mehryar Mohri - Foundations of Machine Learning page 59 References

• Dimitri P. Bertsekas. and Optimal Control. 2 vols. Belmont, MA: Athena Scientific, 2007. • Mehryar Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and , 7(3):321-350, 2002. • Martin L. Puterman Markov decision processes: discrete stochastic dynamic programming. Wiley-Interscience, New York, 1994. • Robbins, H. (1952), "Some aspects of the sequential design of experiments", Bulletin of the American Mathematical Society 58 (5): 527–535. • Sutton, Richard S., and Barto, Andrew G. Reinforcement Learning: An Introduction. MIT Press, 1998.

Mehryar Mohri - Foundations of Machine Learning page 60 References

• Gerald Tesauro. Temporal Difference Learning and TD-Gammon. Communications of the ACM 38 (3), 1995. • Watkins, Christopher J. C. H. Learning from Delayed Rewards. Ph.D. thesis, Cambridge University, 1989. • Christopher J. C. H. Watkins and . Q-learning. Machine Learning, Vol. 8, No. 3-4, 1992.

Mehryar Mohri - Foundations of Machine Learning page 61 Appendix

Mehryar Mohri - Foudations of Machine Learning Stochastic Approximation

N Problem: find solution of x = H ( x ) with x R while • H ( x ) cannot be computed, e.g., H not accessible; • i.i.d. sample of noisy observations H ( x i )+ w i , available, i [1 ,m ] , with E[ w ]=0 . Idea: algorithm based on iterative technique: x =(1 )x + [H(x )+w ] t+1 t t t t t = x + [H(x )+w x ]. t t t t t • more generally x t +1 = x t + t D ( x t , w t ) .

Mehryar Mohri - Foundations of Machine Learning page 63 Supermartingale Convergence

Theorem: let X t ,Y t ,Z t be non-negative random variables such that Y t < . If the following t=0 condition holds: E X t +1 t X t + Y t Z t , then, F X • t converges to a limit (with probability one). Z < . • t=0 t

Mehryar Mohri - Foundations of Machine Learning page 64 Convergence Analysis

Convergence of x t +1 = x t + t D ( x t , w t ) , with history defined by Ft t = (xt )t t, (t )t t, (wt )t 0, t=0 t= , t=0 t< . a.s Then, x x. t

Mehryar Mohri - Foundations of Machine Learning page 65 Convergence Analysis

Proof: since is a quadratic function,

1 2 (x )=(x )+ (x )(x x )+ (x x ) (x )(x x ). t+1 t t t+1 t 2 t+1 t t t+1 t Thus, 2 t 2 E (x ) =(x )+ (x ) E D(x , w ) + E D(x , w ) t+1 Ft t t t t t Ft 2 t t Ft 2 t (xt) tc(xt)+ (K1 + K2(xt)) non-neg. for 2 large t 2K 2K =(x )+ t 1 c t 2 (x ). t 2 t 2 t By the supermartingale convergence theorem, (xt) 2 K2 converges and c t (x ) < . t=0 t 2 t 2 Since t > 0 , t =0 t = , t =0 t < , ( x t ) must converge to 0. Mehryar Mohri - Foundations of Machine Learning page 66