Reinforcement Learning

PROF XIAOHUI XIE SPRING 2019

CS 273P and Data Mining

Slides courtesy of Sameer Singh, David Silver Machine Learning

Intro to

Markov Processes

Markov Reward Processes

Markov Decision Processes

2 Reinforcement Learning

3 What makes it different?

No direct supervision, only rewards Feedback is delayed, not instantaneous Time really matters, i.e. data is sequential Agent’s actions affect what data it will receive

• Fly stunt maneuvers in a helicopter

• Defeat the world champion at Backgammon or Go • Manage an portfolio • Control a power station • Make a humanoid robot walk Examples • Play many different Atari games better than humans

4 Agent-Environment Interface

Agent

• decides on an action • receives next observation • receives next reward

Environment

• executes the action • computes next observation • computes next reward

5 Reward, Rt

Nothing about WHY it is +, positive (Good) How well the doing well, could have -, negative (Bad) agent is doing little to do with At-1

Agent is trying to maximize its cumulative reward

6 Example of Rewards

• Fly stunt maneuvers in a helicopter • +ve reward for following desired trajectory • −ve reward for crashing • Defeat the world champion at Backgammon • +/−ve reward for winning/losing a game • Manage an investment portfolio • +ve reward for each $ in bank • Control a power station • +ve reward for producing power • −ve reward for exceeding safety thresholds • Make a humanoid robot walk • +ve reward for forward motion • −ve reward for falling over • Play many different Atari games better than humans • +/−ve reward for increasing/decreasing score

7 Sequential Decision Making

Actions have long term consequences

Rewards may be delayed

May be better to sacrifice short term reward for long term benefit

• A financial investment (may take months to mature) • Refueling a helicopter (might prevent a crash later) • Blocking opponent moves (might eventually help win) • Spend a lot of money and go to college (earn more later)

Examples • Don’t commit crimes (rewarded by not going to jail) • Get started on final project early (avoid stress later)

A key aspect of intelligence: How far ahead are you able to plan?

8 Reinforcement Learning

Given an environment (produces observations and rewards)

Reinforcement Learning

Automated agent that selects actions to maximize total rewards in the environment

9 Let’s look at the Agent

What does the choice of action depend on?

• Can you ignore Ot completely?

• Is just Ot enough? Or (Ot,At)? • Is it last few observations? • Is it all observations so far?

10 Agent State, St

History: everything that happened so far

Ht =

O1R1A1O2R2A2O3R3,…,At-1OtRt

State, St can be Ot OtRt

At-1OtRt

Ot-3Ot-2Ot-1Ot

You, as AI designer, In general, S = f(H ) t t specify this function

11 Agent Policy, π

Current state Next action S A t π t

Good policy: Leads to larger cumulative reward Bad policy: Leads to worse cumulative reward (we will explore this later)

12 Example: Atari

Rules are unknown • What makes the score increase? Dynamics are unknown • How do actions change pixels?

13 Video Time!

https://www.youtube.com/watch?v=V1eYniJ0Rnk

14 Example: Robotic Soccer https://www.youtube.com/watch?v=CIF2SBVY-J0

15 AlphaGo

https://www.youtube.com/watch?v=I2WFvGl4y8c

16 Overview of Lecture

States Markov Processes

+ Rewards

Markov Reward Processes

+ Actions

Markov Decision Processes

17 Machine Learning

Intro to Reinforcement Learning

Markov Processes

Markov Reward Processes

Markov Decision Processes

18 Markov Property

19 Markov Property

20 Markov Property

21 State Transition Matrix

22 State Transition Matrix

23 Markov Processes

24 Markov Processes

25 Student

26 Student MC: Episodes

27 Student MC: Episodes

28 Student MC: Transition Matrix

29 Machine Learning

Intro to Reinforcement Learning

Markov Processes

Markov Reward Processes

Markov Decision Processes

30 Markov Reward Process

31 Markov Reward Process

32 The Student MRP

33 Return

34 Return

35 Return

36 Why discount?

37 Why discount?

38 Why discount?

39 Why discount?

40 Why discount?

41 Value Function

42 Value Function

43 Student MRP: Returns

44 Student MRP: Returns

45 Student MRP: Value Function

46 Student MRP: Value Function

47 Student MRP: Value Function

48 Bellman Equations for MRP

49 Backup Diagrams for MRP

50 Student MRP: Bellman Eq

51 Matrix Form of Bellman Eq

52 Solving the

53 Solving the Bellman Equation

54 v2(C1) = -2 + γ ( .5 v1(FB) + .5 v1(C2) ) 2 1 1 v (FB) = -1 + γ ( .9 v (FB) + .1 v (C1) ) … 3 2 2 v (FB) = -1 + γ ( .9 v (FB) + .1 v (C1) ) …

C1 C2 C3 Pa Pub FB Slp γ=0.5 t=1 -2 -2 -2 10 1 -1 0

t=2 -2.75 -2.8 1.2 10 0 -1.55 0

t=3 -3.09 -1.52 1 10 0.41 -1.83 0

t=4 -2.84 -1.6 1.08 10 0.59 -1.98 0

55 Machine Learning

Intro to Reinforcement Learning

Markov Processes

Markov Reward Processes

Markov Decision Processes

56

57 Markov Decision Process

58 The Student MDP

59 Policies

60 Policies

61 Policies

62 MPs → MRPs → MDPs

63 MPs → MRPs → MDPs

64 Value Function

65 Value Function

66 Student MDP: Value Function

67 Bellman Expected Equation

68 Bellman Expected Equation

69 Bellman Expected Equation, V

70 Bellman Expected Equation, Q

71 Bellman Expected Equation, V

72 Bellman Expected Equation, Q

73 Student MDP: Bellman Exp Eq.

74 Bellman Exp Eq: Matrix Form

75 Optimal Value Function

76 Optimal Value Function

77 Optimal Value Function

78 Student MDP: Optimal V

79 Student MDP: Optimal Q

80 Optimal Policy

81 Optimal Policy

82 Finding Optimal Policy

83 Student MDP: Optimal Policy

84 Bellman Optimality Eq, V

85 Bellman Optimality Eq, Q

86 Bellman Optimality Eq, V

87 Bellman Optimality Eq, Q

88 Student MDP: Bellman Optimality

89 Solving Bellman Equations in MDP Not easy ● Not a linear equation ● No “closed-form” solutions

90 Overview

MDPs States, Transitions, Actions, Rewards

Prediction Given Policy π, Estimate State Value Functions, Action Value Functions

Control Estimate Optimal Value Functions, Optimal Policy

Does the agent know the MDP?

Yes! It’s “planning” No! It’s “Model-free RL” Agent knows everything Agent observes everything as it goes

91 Overview

Evaluate Policy, Find Best Policy, π π*

MDP Known Policy Evaluation Policy/Value Iteration

MDP Unknown MC and TD Learning Q-Learning

92 Overview

Evaluate Policy, Find Best Policy, π π*

Planning

MDP Known Policy Evaluation Policy/Value Iteration

MDP Unknown MC and TD Learning Q-Learning

93 Dynamic Programming

94 Requirements for Dynamic Programming

95 Requirements for Dynamic Programming

96 Planning by Dynamic Programming

97 Applications of DPs

98 Overview

Evaluate Policy, Find Best Policy, π π*

MDP Known Policy Evaluation Policy/Value Iteration

MDP Unknown MC and TD Learning Sarsa + Q-Learning

99 Iterative Policy Evaluation

100 Iterative Policy Evaluation

101 Iterative Policy Evaluation

102 Iterative Policy Evaluation

103 Random Policy: Grid World

104 Policy Evaluation: Grid World

Time 0 : do nothing, stop; no cost.

Time 1 : move (reward -1); then k=0 Unless in goal: reward 0

Time 2 : move (reward -1); then k=1 Most: move (-1) + [v1 = -1] = -2 Some: move (-1) + ¾ [v1 = -1] + ¼ [v1=0] = 1.75

105 Policy Evaluation: Grid World

106 Policy Evaluation: Grid World

107 Policy Evaluation: Grid World

In general: best policy & value for “one step, then follow random policy” (always better policy than random!)

108 Overview

Evaluate Policy, Find Best Policy, π π*

MDP Known Policy Evaluation Policy/Value Iteration

MDP Unknown MC and TD Learning Sarsa + Q-Learning

109 Improving a Policy!

110 Improving a Policy!

111 Improving a Policy!

112 Policy Iteration

113 Jack’s Car Rental

114 Policy Iteration in Car Rental

115 Policy Improvement

116 Policy Improvement

117 Policy Improvement

118 Modified Policy Iteration

119 Modified Policy Iteration

120 Modified Policy Iteration

121 Modified Policy Iteration

122 Generalized Policy Iteration

123 Overview

Evaluate Policy, Find Best Policy, π π*

MDP Known Policy Evaluation Policy/Value Iteration

MDP Unknown MC and TD Learning Sarsa + Q-Learning

124 Monte Carlo RL

125 Monte Carlo RL

126 Monte Carlo RL

127 Monte Carlo Policy Evaluation

128 Every-Visit MC Policy Evaluation

Equivalent, “incremental tracking” form:

Looks like SGD to minimize MSE from the mean value… 129 Blackjack Example

stand hit stand

hit

hit

130 Blackjack Value Function

stand hit

131 Blackjack Value Function

stand hit

132 Temporal Difference Learning

133 Temporal Difference Learning

134 MC and TD

135 MC and TD

136 MC and TD

137 MC and TD

138 Driving Home Example

139 Driving Home: MC vs TD

140 Driving Home: MC vs TD

141 Finite Episodes: AB Example

MC & TD can give different answers on fixed data:

V(B) = 6 / 8

V(A) = 0 ? (Direct MC estimate)

V(A) = 6 / 8? (TD estimate)

142 MC vs TD

Temporal Monte Carlo Difference

• Wait till end of episode to learn • Learn online after every step • Only for terminating worlds • Non-terminating worlds ok

• High-variance, low bias • Low variance, high bias • Not sensitive to initial value • Sensitive to initial value • Good convergence properties • Much more efficient

• Doesn’t exploit Markov property • Exploits Markov Property

• Minimizes squared error • Maximizes log-likelihood

143 Unified View: Monte Carlo

144 Unified View: TD Learning

145 Unified View: Dynamic Prog.

146 Unified View of RL (Prediction)

147 Overview

Evaluate Policy, Find Best Policy, π π*

MDP Known Policy Evaluation Policy/Value Iteration

MDP Unknown MC and TD Learning Sarsa + Q-Learning

148 Model-free Control

Learn a policy 휋 to maximize rewards in the environment

149 On and Off Policy Learning

150 Generalized Policy Iteration

151 Gen Policy Improvement?

152 Not quite!

153 Learn Q function directly…

154 Greedy Action Selection?

155 Greedy Action Selection?

156 Greedy Action Selection?

157 ∊-Greedy Exploration

158 ∊-Greedy Exploration

159 Which Policy Evaluation?

160 Sarsa: TD for Policy Evaluation

161 On-Policy Control w/ Sarsa

162 Off-Policy Learning

163 Off-Policy Learning

164 Q-Learning

165 Q-Learning

166 Q-Learning

167 Off-Policy w/ Q-Learning

168 Off-Policy w/ Q-Learning

169 Off-Policy w/ Q-Learning

170 Q-Learning Control

(SARSAMAX)

171 Relation between DP and TD

172 Update Eqns for DP and TD

173