Agents That Learn: Reinforcement Learning

Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Agents that Learn: Reinforcement Learning Vasant Honavar Artificial Intelligence Research Laboratory College of Information Sciences and Technology Bioinformatics and Genomics Graduate Program The Huck Institutes of the Life Sciences Pennsylvania State University [email protected] http://faculty.ist.psu.edu/vhonavar Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Agent in an environment Environment State Action Reward/Punishment Agent Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts on the environment through its effectors and occasionally receives rewards or punishments from the environment • The goal of the agent is to maximize its reward (pleasure) or minimize its punishment (or pain) as it stumbles along in an a- priori unknown, uncertain, environment Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Markov Decision Processes • Assume – Finite set of states S – Finite set of actions A • At each discrete time – The agent observes state st ∈ S and chooses action at ∈ A, receives immediate reward rt – Environment state changes to st+1 • Markov assumption: st+1 = δ(st, at) and rt = r(st, at) – i.e., rt and st+1 depend only on current state and action – functions δ and r may be nondeterministic – functions δ and r may not necessarily be known to the agent – reinforcement learning Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Acting rationally in the presence of delayed rewards Agent and environment interact at discrete time steps: t = 0,1, 2,… Agent observes state at step t : st ∈ S produces action at step t : at ∈ A(st ) gets resulting reward : rt+1 ∈ ℜ and resulting next state : st+1 . rt +1 rt +2 rt +3 . s st +1 st +2 st +3 t at at +1 at +2 at +3 € Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Key elements of an RL System Policy Reward Value Model of • Policy – what to do environment • Reward – what is good • Value – what is good because it predicts reward • Model – what follows what Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory The Agent Uses a Policy to select actions Policy at step t, π t : a mapping from states to action probabilities π t (s, a) = probability that at = a when st = s • A rational agent’s goal is to get as much reward as it can over the long run Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Goals and Rewards • Is a scalar reward signal an adequate notion of a goal? – maybe not, but it is surprisingly flexible. • A goal should specify what we want to achieve, not how we want to achieve it. • A goal is typically outside the agent’s direct control • The agent must be able to measure success: • explicitly • frequently during its lifespan Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Rewards for Continuing Tasks Continuing tasks: interaction does not have natural episodes Cumulative discounted reward ∞ R r r 2r k r , t = t+1 + γ t+2 + γ t+3 +! = ∑γ t+k+1 k=0 where γ ,0 ≤ γ ≤1, is the discount rate. shortsighted 0 ←γ →1 farsighted Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Rewards Suppose the sequence of rewards after step t is : rt+1,rt+2 ,rt+3,… What do we want to maximize? In general, we want to maximize the expected return, E{Rt }, for each step t. Episodic tasks – interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. Rt = rt+1 + rt+2 +!+ rT , where T is a final time step at which a terminal state is reached, ending an episode. Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Markov Decision Processes • If a reinforcement learning task has the Markov Property, it is called a Markov Decision Process (MDP). • If state and action sets are finite, the MDP is a finite MDP. • To define a finite MDP, you need to specify: – state and action sets; – one-step dynamics defined by transition probabilities: a Pssʹ = Pr{st+1 = sʹ st = s,at = a} ∀ s,sʹ∈ S,a ∈ A(s). – reward probabilities: a Rssʹ = E{rt+1 st = s,at = a,st+1 = sʹ} ∀ s,sʹ∈ S,a ∈ A(s). Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Example – Pole Balancing Task Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. As an episodic task where episode ends upon failure: reward = +1 for each step before failure ⇒ return = number of steps before failure As a continuing task with discounted return: reward = −1 upon failure; 0 otherwise ⇒ return = −γ k , for k steps before failure In either case, return is maximized by avoiding failure for as long as possible. Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Example -- Driving task Get to the top of the hill as quickly as possible. reward = −1 for each step when not at top of hill ⇒ return = − number of steps before reaching top of hill Return is maximized by minimizing the number of steps taken to reach the top of the hill Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Finite MDP Example Recycling Robot • At each step, robot has to decide whether it should a) actively search for a can, b) wait for someone to bring it a can, or c) go to home base and recharge. • Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). • Decisions made on basis of current energy level: high, low. • Reward = number of cans collected Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Some Notable RL Applications • TD-Gammon – Worlds best backgammon program • Elevator scheduling • Inventory Management – 10% – 15% improvement over the state-of-the art methods • Dynamic Channel Assignment – – high performance assignment of radio channels to mobile telephone calls Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory The Markov Property • By the state at step t, we mean whatever information is available to the agent at step t about its environment. • The state can include immediate sensations, highly processed sensations, and structures built up over time from sequences of sensations. • Ideally, a state should summarize past sensations so as to retain all essential information – it should have the Markov Property: Pr{st+1 = s!,rt+1 = r st, at,rt, st−1, at−1,…,r1, s0, a0 } = Pr{st+1 = s!,rt+1 = r st, at } ∀s!, r, and histories st, at,rt, st−1, at−1,…,r1, s0, a0. Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Reinforcement learning • Learner is not told which actions to take • Rewards and punishments may be delayed – Sacrifice short-term gains for greater long-term gains • The need to tradeoff between exploration and exploitation • Environment may not be observable or only partially observable • Environment may be deterministic or stochastic Principles of Artificial Intelligence, IST 597F, Fall 2017, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Value Functions • The value of a state is the expected return starting from that state; depends on the agent’s policy: State - value function for policy π : ⎧ ∞ ⎫ V π (s) E R s s E k r s s = π { t t = }= π ⎨∑ γ t+k+1 t = ⎬ ⎩k=0 ⎭

Load more