Pennsylvania State University College of Information Sciences and Technology Laboratory

Agents that Learn: Reinforcement Learning

Vasant Honavar Artificial Intelligence Research Laboratory College of Information Sciences and Technology and Graduate Program The Huck Institutes of the Life Sciences Pennsylvania State University

[email protected] http://faculty.ist.psu.edu/vhonavar

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Agent in an environment

Environment

State

Action Reward/Punishment

Agent

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning from Interaction with the world

• An agent receives sensations or percepts from the environment through its sensors and acts on the environment through its effectors and occasionally receives rewards or punishments from the environment • The goal of the agent is to maximize its reward (pleasure) or minimize its punishment (or pain) as it stumbles along in an a- priori unknown, uncertain, environment

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Markov Decision Processes • Assume – Finite set of states S – Finite set of actions A • At each discrete time

– The agent observes state st ∈ S and chooses action at ∈ A, receives immediate reward rt

– Environment state changes to st+1

• Markov assumption: st+1 = δ(st, at) and rt = r(st, at)

– i.e., rt and st+1 depend only on current state and action – functions δ and r may be nondeterministic – functions δ and r may not necessarily be known to the agent – reinforcement learning

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Acting rationally in the presence of delayed rewards

Agent and environment interact at discrete time steps: t = 0,1, 2,…

Agent observes state at step t : st ∈ S

produces action at step t : at ∈ A(st )

gets resulting reward : rt+1 ∈ ℜ

and resulting next state : st+1 . . . rt +1 rt +2 rt +3 . . . s st +1 st +2 st +3 t at at +1 at +2 at +3

€ Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Key elements of an RL System

Policy Reward Value Model of • Policy – what to do environment • Reward – what is good • Value – what is good because it predicts reward • Model – what follows what

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory The Agent Uses a Policy to select actions

Policy at step t, π t : a mapping from states to action probabilities

π t (s, a) = probability that at = a when st = s

• A rational agent’s goal is to get as much reward as it can over the long run

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Goals and Rewards

• Is a scalar reward signal an adequate notion of a goal? – maybe not, but it is surprisingly flexible. • A goal should specify what we want to achieve, not how we want to achieve it. • A goal is typically outside the agent’s direct control • The agent must be able to measure success: • explicitly • frequently during its lifespan

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Rewards for Continuing Tasks

Continuing tasks: interaction does not have natural episodes

Cumulative discounted reward

∞ R r r 2r k r , t = t+1 + γ t+2 + γ t+3 +! = ∑γ t+k+1 k=0 where γ ,0 ≤ γ ≤1, is the discount rate. shortsighted 0 ←γ →1 farsighted

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Rewards

Suppose the sequence of rewards after step t is :

rt+1,rt+2 ,rt+3,… What do we want to maximize?

In general, we want to maximize the expected return, E{Rt }, for each step t. Episodic tasks – interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze.

Rt = rt+1 + rt+2 +!+ rT , where T is a final time step at which a terminal state is reached, ending an episode.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Markov Decision Processes

• If a reinforcement learning task has the Markov Property, it is called a Markov Decision Process (MDP). • If state and action sets are finite, the MDP is a finite MDP. • To define a finite MDP, you need to specify: – state and action sets; – one-step dynamics defined by transition probabilities:

a Pssʹ = Pr{st+1 = sʹ st = s,at = a} ∀ s,sʹ∈ S,a ∈ A(s). – reward probabilities:

a Rssʹ = E{rt+1 st = s,at = a,st+1 = sʹ} ∀ s,sʹ∈ S,a ∈ A(s).

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Example – Pole Balancing Task

Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track.

As an episodic task where episode ends upon failure: reward = +1 for each step before failure ⇒ return = number of steps before failure As a continuing task with discounted return: reward = −1 upon failure; 0 otherwise ⇒ return = −γ k , for k steps before failure In either case, return is maximized by avoiding failure for as long as possible.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Example -- Driving task

Get to the top of the hill as quickly as possible.

reward = −1 for each step when not at top of hill ⇒ return = − number of steps before reaching top of hill

Return is maximized by minimizing the number of steps taken to reach the top of the hill

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Finite MDP Example

Recycling Robot • At each step, robot has to decide whether it should a) actively search for a can, b) wait for someone to bring it a can, or c) go to home base and recharge. • Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). • Decisions made on basis of current energy level: high, low. • Reward = number of cans collected

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Some Notable RL Applications

• TD-Gammon – Worlds best backgammon program • Elevator scheduling • Inventory Management – 10% – 15% improvement over the state-of-the art methods • Dynamic Channel Assignment – – high performance assignment of radio channels to mobile telephone calls

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory The Markov Property

• By the state at step t, we mean whatever information is available to the agent at step t about its environment. • The state can include immediate sensations, highly processed sensations, and structures built up over time from sequences of sensations. • Ideally, a state should summarize past sensations so as to retain all essential information – it should have the Markov Property:

Pr{st+1 = s!,rt+1 = r st, at,rt, st−1, at−1,…,r1, s0, a0 } = Pr{st+1 = s!,rt+1 = r st, at }

∀s!, r, and histories st, at,rt, st−1, at−1,…,r1, s0, a0.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Reinforcement learning

• Learner is not told which actions to take • Rewards and punishments may be delayed – Sacrifice short-term gains for greater long-term gains • The need to tradeoff between exploration and exploitation • Environment may not be observable or only partially observable • Environment may be deterministic or stochastic

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Value Functions • The value of a state is the expected return starting from that state; depends on the agent’s policy: State - value function for policy π : ⎧ ∞ ⎫ V π (s) E R s s E k r s s = π { t t = }= π ⎨∑ γ t+k+1 t = ⎬ ⎩k=0 ⎭ The value of taking an action in a state under policy π is the expected cumulative reward starting from that state, taking that action, and thereafter following π :

Action - value function for policy π : ⎧ ∞ ⎫ Q π (s,a) E R s s,a a E k r s s,a a = π { t t = t = }= π ⎨∑ γ t+k+1 t = t = ⎬ ⎩k=0 ⎭ Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Bellman Equation for a Policy π

The basic idea: 2 3 Rt = rt+1 + γrt+2 + γ rt+3 + γ rt+4 ! 2 = rt+1 + γ(rt+2 + γrt+3 + γ rt+4 !)

= rt+1 + γRt+1

π So: V (s) = Eπ {Rt st = s}

= Eπ {rt +1 + γ V (st +1 ) st = s}

Or, without the expectation operator: V π (s) (s,a) Pa Ra V π (sʹ) = ∑π ∑ ssʹ [ ssʹ + γ ] a sʹ

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Optimal Value Functions • For finite MDPs, policies can be partially ordered: π ≥ π # if and only if Vπ (s) ≥ Vπ # (s) for all s ∈S • There is always at least one (and possibly many) policies that is (are) better than or equal to all the others. This is an optimal policy. We denote them all π *. • Optimal policies share the same optimal state-value function: V∗ (s) = max Vπ (s) for all s ∈S π • Optimal policies also share the same optimal action-value function: Q∗(s, a) = max Qπ (s, a) for all s ∈S and a ∈A(s) π This is the expected cumulative reward for taking action a in state s and thereafter following an optimal policy.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Why Optimal State-Value Functions are Useful

Any policy that is greedy with respect to V ∗ is an optimal policy.

Therefore, given , one-step-ahead search produces the V∗ long-term optimal actions.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Example: Tic-Tac-Toe

X X X X X O X X O X X O X O X O X O X O X O X O O O O X } X moves

x x .. x } O moves

x o .. .. . o .. x o x } X moves . . x ...... } O moves

} X moves Assume an imperfect opponent: sometimes x o x makes mistakes xo Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory A Simple RL Approach to Tic-Tac-Toe

Make a table with one entry per state State V(s) – estimated probability of winning 0.5 x 0.5

. . . Play lots of games.

. . . To pick our moves, look ahead one x x x o 1 win o

. . .

. . . step

x o x o 0 loss Current state o

. . .

. . .

o x o Possible next states o x x 0.5 draw x o o * Pick the next state with the highest estimated prob. of winning — the largest V(s) – a greedy move; Occasionally pick a move at random – an exploratory move.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning Rule for Tic-Tac-Toe

starting position •a s – the state before our greedy move opponent's move { •b s ! – the state after our greedy move our move { c•c* opponent's move { •d our move { “Exploratory” move e* •e opponent's move { •f our move { g•g* . . We increment each V(s) toward V(s !) – a backup : V(s) ← V (s) + α[V(s !) − V (s)]

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Why is Tic-Tac-Toe Too Easy?

• Number of states is small and finite • One-step look-ahead is always possible • State completely observable

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

What About Optimal Action-Value Functions?

Given , the agent does not even Q* have to do a one-step-ahead search:

π ∗(s) = arg max Q∗(s, a) a∈A(s)

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Simpler Setting: The n-Armed Bandit Problem

• Choose repeatedly from one of n actions; each choice is called a play r • After each play , you get a reward , where at t

* E rt | at = Q (at )

• Distribution of depends only on rt at • No dependence on state • Objective is to maximize the reward in the long term, e.g., over 1000 plays

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory The Exploration – Exploitation Dilemma

• Suppose you form action value estimates Q (a) Q*(a) * t ≈ at = argmax Qt (a) a • The greedy action at t is * at = at ⇒ exploitation * at ≠ at ⇒ exploration • You can’t exploit all the time; you can’t explore all the time • You can never stop exploring; but you could reduce exploring

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Action-Value Methods

• Adapt action-value estimates and nothing else.

• Suppose by the t-th play, action a had been chosen times, ka producing rewards then r, r , , r , 1 2 … ka r r r 1 + 2 +! ka Qt (a) = ka

* lim Qt (a) = Q (a) ka→∞

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory ε-Greedy Action Selection § Greedy * at = at = argmaxQt (a) a

§ ε-Greedy "a* with probability 1−ε t at = # $random action with probability ε

§ Boltzmann eQt (a) τ Pr(choosing action a at time t) = n eQt (b) τ ∑b=1 where τ is computational temperature

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Solving the Bellman Optimality Equation

• Finding an optimal policy by solving the Bellman Optimality Equation requires: – accurate knowledge of environment dynamics; – enough space an time to do the computation; – the Markov Property. • How much space and time do we need? – polynomial in number of states (via dynamic programming methods), – BUT, number of states is often huge • We usually have to settle for approximations. • Many RL methods can be understood as approximately solving the Bellman Optimality Equation.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Some Notable RL Applications

• TD-Gammon – Worlds best backgammon program • Elevator scheduling • Inventory Management – 10% – 15% improvement over the state-of-the art methods • Dynamic Channel Assignment – – high performance assignment of radio channels to mobile telephone calls

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Agents that Acts Rationally in the presence of delayed rewards

Agent and environment interact at discrete time steps: t = 0,1, 2,…

Agent observes state at step t : st ∈ S

produces action at step t : at ∈ A(st )

gets resulting reward : rt+1 ∈ ℜ

and resulting next state : st+1 . . . rt +1 rt +2 rt +3 . . . s st +1 st +2 st +3 t at at +1 at +2 at +3

€ Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory The Agent Learns a Policy

Policy at step t, π t : a mapping from states to action probabilities

π t (s, a) = probability that at = a when st = s

• Reinforcement learning methods specify how the agent changes its policy as a result of its experience. • Roughly, the agent’s goal is to get as much reward as it can over the long run.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Reinforcement learning

Environment

state

action reward

Agent

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Markov Decision Processes • Assume – Finite set of states S – Finite set of actions A • At each discrete time

– The agent observes state st ∈ S and chooses action at ∈ A, receives immediate reward rt

– Environment state changes to st+1

• Markov assumption: st+1 = δ(st, at) and rt = r(st, at)

– i.e., rt and st+1 depend only on current state and action – functions δ and r may be nondeterministic – functions δ and r may not necessarily be known to the agent – reinforcement learning

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Agent’s learning task

• Execute actions in environment, observe results, and • Learn action policy π : S → A that maximizes 2 E [rt + γrt+1 + γ rt+2 + … ] from any starting state in S • Here 0 ≤ γ < 1 is the discount factor for future rewards

• Target function is to be learned is π : S → A • But we have no training examples of form 〈s, a〉 • Training examples are of form 〈〈s, a 〉, r〉

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Reinforcement learning problem

• Goal: learn to choose actions that maximize 2 r0 + γr1 + γ r2 + … , where 0 ≤ γ < 1

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Value function

• To begin with, consider deterministic worlds... • For each possible policy π the agent might adopt, we can define an evaluation function over states

π 2 V (s) ≡ rt +γrt+1 +γ rt+2 +... ∞ ir ≡ ∑γ t+i i=0

where rt, rt+1, ... are generated by following policy π starting at state s • Restated, the task is to learn the optimal policy π*

π* ≡ argmaxV π (s),(∀s) π

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory What to learn • We might try to have agent learn the evaluation function Vπ* (which we write as V*) • It could then do a look-ahead search to choose best action from any state s because π *(s) ≡ arg max[r(s,a) + γV *(δ (s,a))] a A problem: • This works if agent knows δ : S × A → S, and r : S × A → ℜ • But when it doesn't, it can't choose actions in this way

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Action-Value function – Q function • Define a new function very similar to V* Q(s, a) ≡ r(s, a)+γV *(δ(s, a))

• If agent learns Q, it can choose optimal action even without knowing δ! π *(s) ≡ argmax[r(s, a)+γV *(δ(s, a))] π π *(s) ≡ argmaxQ(s, a) π • Q is the evaluation function the agent will learn

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Training rule to learn Q • Note Q and V* are closely related: V *(s) = maxQ(s, a')) a' • Which allows us to write Q recursively as

Q(s1, a1) = r(s1, a1)+γV *(δ(s1, a1))

= r(s1, a1)+γ maxQ(δ(st+1, a')) a' • Let denote learnerQˆ ’s current approximation to Q. Consider training rule Qˆ(s, a) ← r +γ maxQˆ(s', a') a' • where s’ is the state resulting from applying action a in state s.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Q-Learning

Q(st,at ) ← Q(st,at ) + α rt+1 + γ maxQ(st+1,a) − Q(st,at ) [ a ]

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Q Learning for Deterministic Worlds • For each s, a initialize table entry Qˆ(s,a) ← 0 • Do forever: – Observe current state s – Select an action a and execute it – Receive immediate reward r – Observe the new state s’ – Update the table entry for Qˆ(s as follows: ,a)

Qˆ(s) ← r + γ max Qˆ(s',a') a' – Update state s ← s’.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Updating Q

ˆ ˆ Q(s1,aright ) ← r + γmax Q(s2 ',a') a' ← 0 + 0.9max{63,81,100} ← 90 Notice if rewards non-negative, then (∀s,a,n) Qˆ (s,a) ≥ Qˆ (s,a) and n+1 n ˆ (∀s,a,n) 0 ≤ Qn (s,a) ≤ Q(s,a)

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Convergence theorem

• Theorem: converges to Qˆ Q. • Consider case of deterministic world, with bounded immediate rewards, where each 〈s, a〉 visited infinitely often. • Proof: Define an interval during which each 〈s, a〉 is visited at least once. During each full interval the largest error in table is reduced by factor of γ. ˆ • Let be the table after Qn Qˆ n updates, and Δn be the ˆ maximum error in : Qn ˆ Δn = max | Qn (s,a) −Q(s,a) | s,a

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Convergence theorem ˆ • For any table entry updated on iteration Qn (s,a) n + 1, the ˆ error in the revised estimate is Qn+1(s,a)

ˆ ˆ | Qn+1(s,a) − Q(s,a) | = | (r +γ max Qn (s',a'))− (r +γ max Q(s',a'))| a' a' ˆ = γ | max Qn (s',a') − max Q(s',a') | a' a'

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Convergence theorem

ˆ ˆ | Qn+1(s,a) − Q(s,a) | = | (r +γ max Qn (s',a'))− (r +γ max Q(s',a'))| a' a' ˆ = γ | max Qn (s',a') − max Q(s',a') | a' a' ˆ ≤ γ max | Qn (s',a') − Q(s',a') | a' ˆ ≤ γ max | Qn (s'',a') − Q(s'',a') | s'',a' ˆ | Qn+1(s,a) − Q(s,a) | = γΔn

Note we used general fact that:

| max f1(a) − max f2 (a) |≤ max | f1(a) − f2 (a) | a a a

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Non-deterministic case • What if reward and next state are non-deterministic? • We redefine V and Q by taking expected values.

π 2 V (s) ≡ E[rt +γrt+1 +γ rt+2 +...] ∞ $ i ' ≡ E&∑γ rt+i ) % i=0 (

Q(s, a) ≡ E[r(s, a)+γV *(δ(s, a))]

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Nondeterministic case Q learning generalizes to nondeterministic worlds Alter training rule to

ˆ ˆ ˆ Qn (s,a) ← (1− α n )Qn−1(s,a) + α n [r + max Qn−1(s',a')] a' where 1 αn = 1+ visitsn (s,a)

Convergence of to Qˆ Q can be proved [Watkins and Dayan, 1992]

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Sarsa Always update the policy to be greedy with respect to the current estimate

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Temporal Difference Learning

• Temporal Difference (TD) learning methods – Can be used when accurate models of the environment are unavailable – neither state transition function nor reward function are known – Can be extended to work with implicit representations of action-value functions – Are among the most useful reinforcement learning methods

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Q-learning: The simplest TD Method TD(0)

st

rt +1 st +1

TT TT TT T T

T T TT T TT

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Example – TD-Gammon

• Learn to play Backgammon (Tesauro, 1995) • Immediate reward: • +100 if win • -100 if lose • 0 for all other states

• Trained by playing 1.5 million games against itself. • Now comparable to the best human player.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Temporal difference learning Q learning: reduce discrepancy between successive Q estimates One step time difference:

(1) ˆ Q (st ,at ) ≡ rt + γ max Q(st+1,a) a Why not two steps? (2) 2 ˆ Q (st ,at ) ≡ rt + γrt+1 + γ max Q(st+2 ,a) Or n? a

(n) (n−1) n ˆ Q (st ,at ) ≡ rt + γrt+1 +!+ γ rt+n−1 + γ maxQ(st+n ,a) a Blend all of these: λ (1) (2) 2 (3) Q (st ,at ) ≡ (1− λ)[Q (st ,at ) + λQ (st ,at ) + λ Q (st ,at )

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Temporal difference learning

λ (1) (2) 2 (3) Q (st ,at ) ≡ (1− λ)[Q (st ,at ) + λQ (st ,at ) + λ Q (st ,at )

Equivalent expression:

λ ˆ λ Q (st ,at ) = rt + γ [(1− λ) maxQ(st ,at ) + λQ (st+1,at+1)] a

• TD(λ) algorithm uses above training rule • Sometimes converges faster than Q learning • converges for learning V* for any 0 ≤ λ ≤ 1 (Dayan, 1992) • Tesauro's TD-Gammon uses this algorithm

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Advantages of TD Learning

• TD methods do not require a model of the environment, only experience • TD methods can be fully incremental – You can learn before knowing the final outcome • Less memory • Less peak computation – You can learn without the final outcome • From incomplete sequences • TD converges (under reasonable assumptions)

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Optimality of TD(0)

• Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. • Compute updates according to TD(0), but only update estimates after each complete pass through the data.

For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small α.

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Handling Large State Spaces

• Replace table with neural net or other function Qˆ approximator

• Virtually any function approximator would work provided it can be updated in an online fashion

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning state-action values

• Training examples of the form:

< (st, at ), vt >

• The general gradient-descent rule:

! ! ! θt+1 =θt +α [vt −Qt (st, at )]∇θ Q(st, at )

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Gradient Descent Methods

! T θt = (θt (1),θt (2),…,θt (n))

Assume Vt is a (sufficiently smooth) differentiable function ! of θt , for all s∈S.

Assume, for now, training examples of this form:

π { }

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Gradient Descent

! ! 1 ! θ =θ − α∇ ! MSE(θ ) t+1 t 2 θ t 2 ! 1 π ! P(s)%V (s) V (s)' =θt − α∇θ ∑ & − t ( 2 s∈S ! π P(s)%V (s) V (s)' !V (s) =θt +α∑ & − t (∇θ t s∈S

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Gradient Descent

Use just the per sample gradient

2 ! ! 1 π θ =θ − α∇ ! #V (s )−V (s )% t+1 t 2 θ $ t t t & ! π #V (s ) V (s )% !V (s ), =θt +α$ t − t t &∇θ t t

Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if a decreases appropriately with t.

π π E"V (s ) V (s )$ !V (s ) P(s)"V (s) V (s)$ !V (s) # t − t t %∇θ t t = ∑ # − t %∇θ t s∈S

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory But We Don’t have these Targets

Suppose we just have targets vt instead : ! ! v V (s ) !V (s ) θt+1 =θt +α[ t − t t ]∇θ t t

π If each vt is an unbiased estimate of V (st ), π i.e., E{vt }=V (st ), then gradient descent converges to a local minimum (provided α decreases appropriately).

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory What about TD(λ) Targets?

! ! λ "R V (s )$ !V (s ) θt+1 =θt +α# t − t t %∇θ t t

! ! ! θt+1 =θt +αδt et, where:

δt = rt+1 +γ Vt (st+1)−Vt (st ), as usual, and ! ! ! et = γ λ et−1 + ∇θVt (st )

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Linear Gradient Descent Watkins’ Q(λ)

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Not covered

• Using hierarchical state and action representations • Coping with partial observability • Coping with extremely large state spaces • Neural basis of reinforcement learning

Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar