Reinforcement Learning Function Approximation

Reinforcement Learning Function approximation Mario Martin CS-UPC April 15, 2020 Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 / 63 This is not reasonable on most cases: I In Large state spaces: There are too many states and/or actions to store in memory (f.i. Backgammon: 1020 states, Go 10170 states) I and in continuous state spaces (f.i. robotic examples) In addition, we want to generalize from/to similar states to speed up learning. It is too slow to learn the value of each state individually. Goal of this lecture Methods we have seen so far work well when we have a tabular representation for each state, that is, when we represent value function with a lookup table. Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 1 / 63 In addition, we want to generalize from/to similar states to speed up learning. It is too slow to learn the value of each state individually. Goal of this lecture Methods we have seen so far work well when we have a tabular representation for each state, that is, when we represent value function with a lookup table. This is not reasonable on most cases: I In Large state spaces: There are too many states and/or actions to store in memory (f.i. Backgammon: 1020 states, Go 10170 states) I and in continuous state spaces (f.i. robotic examples) Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 1 / 63 Goal of this lecture Methods we have seen so far work well when we have a tabular representation for each state, that is, when we represent value function with a lookup table. This is not reasonable on most cases: I In Large state spaces: There are too many states and/or actions to store in memory (f.i. Backgammon: 1020 states, Go 10170 states) I and in continuous state spaces (f.i. robotic examples) In addition, we want to generalize from/to similar states to speed up learning. It is too slow to learn the value of each state individually. Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 1 / 63 Goal of this lecture We'll see now methods to learn policies for large state spaces by using function approximation to estimate value functions: π Vθ(s) ≈ V (s) (1) π Qθ(s; a) ≈ Q (s; a) (2) θ is the set of parameters of the function approximation method (with size much lower than jSj) Function approximation allow to generalize from seen states to unseen states and to save space. Now, instead of storing V values, we will update θ parameters using MC or TD learning so they fulfill (1) or (2). Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 2 / 63 Which Function Approximation? There are many function approximators, e.g. I Artificial neural network I Decision tree I Nearest neighbor I Fourier/wavelet bases I Coarse coding In principle, any function approximator can be used. However, the choice may be affected by some properties of RL: I Experience is not i.i.d. { Agent's action affect the subsequent data it receives I During control, value function V(s) changes with the policy (non-stationary) Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 3 / 63 Incremental methods Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 4 / 63 Which Function Approximation? Incremental methods allow to directly apply the control methods of MC, Q-learning and Sarsa, that is, back up is done using \on-line" sequence of data of the trial reported by the agent following the policy. Most popular method in this setting is gradient descent, because it adapts to changes in the data (non-stationary condition) Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 5 / 63 Gradient Descent Let L(θ) be a differentiable function of parameter vector θ, we want to minimize Define the gradient of L(θ) to be: 2 @L(θ) 3 @θ1 6 . 7 rθL(θ) = 6 . 7 4 @L(θ) 5 @θn To find a local minimum of L(θ), gradient descent method adjust the parameter in the direction of negative gradient: 1 ∆θ = − αr L(θ) 2 θ where is a stepsize parameter Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 6 / 63 Gradient Descent 1 ∆θ = − αr L(θ) 2 θ Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 7 / 63 Value Function Approx. by SGD Minimizing Loss function of the approximation Goal: Find parameter vector θ minimizing mean-squared error between π approximate value function Vθ(s) and true value function V (s) π 2 X π π 2 L(θ) = Eπ (V (s) − Vθ(s)) = µ (s)[V (s) − Vθ(s)] s2S where µπ(s) is the time spent in state s while following π Gradient descent finds a local minimum: 1 ∆θ = − αr L(θ) 2 θ π = Eπ [(V (s) − Vθ(s)) rθVθ(s)] Stochastic gradient descent (SGD) samples the gradient π ∆θ = α(V (s) − Vθ(s)) rθVθ(s) Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 8 / 63 Subsection 1 Linear approximation Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 9 / 63 Linear representation of the state Represent state by a feature vector: 2 3 φ1(s) 6 . 7 φ(s) = 4 . 5 φn(s) Represent value function by a linear combination of features: n T X Vθ(s) = φ(s) θ = φj (s)θj (3) j=1 Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 10 / 63 Linear representation of the state For example: I Distance of robot from landmarks I Trends in the stock market I Piece and pawn configurations in chess Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 11 / 63 Example: RoboCup soccer keepaway (Stone, Sutton & Kuhlmann, 2005) Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 12 / 63 Example: RoboCup soccer keepaway (Stone, Sutton & Kuhlmann, 2005) State is encoded in 13 continuous variables: 11 distances among the players, ball, and the center of the field 2 angles to takers along passing lanes Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 13 / 63 Example: RoboCup soccer keepaway (Stone, Sutton & Kuhlmann, 2005) Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 14 / 63 Linear representation of the state Table lookup is a special case of linear value function approximation. Using table lookup features: 2 3 1(S = s1) table 6 . 7 φ (S) = 4 . 5 1(S = sn) Parameter vector is exactly value of each individual state 2 3T 2 3 1(S = s1) θ1 6 . 7 6 . 7 Vθ(S) = 4 . 5 · 4 . 5 1(S = sn) θn Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 15 / 63 Caution When using linear FA, we should ask ourselves if V can be approximated by a linear function and what's the error of this approximation. Usually value functions are smooth (compared with reinforcement function). However, linear FA approximation error could be large, depending on the features selected. Linear representation of the state Another obvious way of reducing the number of states is by grouping some of them using a grid. Drawback is that all states in the cell are equal and you don't learn \softly" from neighbor cells. Better approach is Coarse Coding. Coarse coding provides large feature vector φ(s) that "overlap" Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 16 / 63 Linear representation of the state Another obvious way of reducing the number of states is by grouping some of them using a grid. Drawback is that all states in the cell are equal and you don't learn \softly" from neighbor cells. Better approach is Coarse Coding. Coarse coding provides large feature vector φ(s) that "overlap" Caution When using linear FA, we should ask ourselves if V can be approximated by a linear function and what's the error of this approximation. Usually value functions are smooth (compared with reinforcement function). However, linear FA approximation error could be large, depending on the features selected. Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 16 / 63 Coarse coding using RBFs Each circle is a Radial Basis Function (center c and a width σ) that represents a feature. Value for each feature is: 2 kx−ci k − 2 φi (s) = e (2σ ) Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 17 / 63 Coarse coding using RBFs Parameters of codification: I Number of RBFs (density) and position (c) I Radius of the RBF (width σ) I Different width for each variable of the state Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 18 / 63 Coarse coding using Tiles RBFs return a real value for each feature. Tiles define a binary feature for each tile. I Binary features means weighted sum easy to compute I Number of features present at any time step is constant I Easy to compute indexes of features active You can use irregular tilings or superposition of different tilings Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 19 / 63 Quadratic problem (parabola) has no local minimum. Going back to SGD First nice property of SGD in linear F.A. In the case of linear function approximation, objective function is quadratic: h π T 2 i L(θ) = Eπ (V (s) − φ(s) θ) so SGD converges to global optimum: Notice equation (3). Why it converges? Mario Martin (CS-UPC) Reinforcement Learning April 15, 2020 20 / 63 Going back to SGD First nice property of SGD in linear F.A. In the case of linear function approximation, objective function is quadratic: h π T 2 i L(θ) = Eπ (V (s) − φ(s) θ) so SGD converges to global optimum: Notice equation (3).

Load more