Policy Gradient Algorithms

Total Page:16

File Type:pdf, Size:1020Kb

Policy Gradient Algorithms Policy Gradient Algorithms Ashwin Rao ICME, Stanford University Ashwin Rao (Stanford) Policy Gradient Algorithms 1 / 33 Overview 1 Motivation and Intuition 2 Definitions and Notation 3 Policy Gradient Theorem and Proof 4 Policy Gradient Algorithms 5 Compatible Function Approximation Theorem 6 Natural Policy Gradient Ashwin Rao (Stanford) Policy Gradient Algorithms 2 / 33 Why do we care about Policy Gradient (PG)? Let us review how we got here We started with Markov Decision Processes and Bellman Equations Next we studied several variants of DP and RL algorithms We noted that the idea of Generalized Policy Iteration (GPI) is key Policy Improvement step: π(s; a) derived from argmaxa Q(s; a) How do we do argmax when action space is large or continuous? Idea: Do Policy Improvement step with a Gradient Ascent instead Ashwin Rao (Stanford) Policy Gradient Algorithms 3 / 33 \Policy Improvement with a Gradient Ascent??" We want to find the Policy that fetches the \Best Expected Returns" Gradient Ascent on \Expected Returns" w.r.t params of Policy func So we need a func approx for (stochastic) Policy Func: π(s; a; θ) In addition to the usual func approx for Action Value Func: Q(s; a; w) π(s; a; θ) called Actor and Q(s; a; w) called Critic Critic parameters w are optimized w.r.t Q(s; a; w) loss function min Actor parameters θ are optimized w.r.t Expected Returns max We need to formally define \Expected Returns" But we already see that this idea is appealing for continuous actions GPI with Policy Improvement done as Policy Gradient (Ascent) Ashwin Rao (Stanford) Policy Gradient Algorithms 4 / 33 Value Function-based and Policy-based RL Value Function-based Learn Value Function (with a function approximation) Policy is implicit - readily derived from Value Function (eg: -greedy) Policy-based Learn Policy (with a function approximation) No need to learn a Value Function Actor-Critic Learn Policy (Actor) Learn Value Function (Critic) Ashwin Rao (Stanford) Policy Gradient Algorithms 5 / 33 Advantages and Disadvantages of Policy Gradient approach Advantages: Finds the best Stochastic Policy (Optimal Deterministic Policy, produced by other RL algorithms, can be unsuitable for POMDPs) Naturally explores due to Stochastic Policy representation Effective in high-dimensional or continuous action spaces Small changes in θ ) small changes in π, and in state distribution This avoids the convergence issues seen in argmax-based algorithms Disadvantages: Typically converge to a local optimum rather than a global optimum Policy Evaluation is typically inefficient and has high variance Policy Improvement happens in small steps ) slow convergence Ashwin Rao (Stanford) Policy Gradient Algorithms 6 / 33 Notation Assume episodic with 0 ≤ γ ≤ 1 or non-episodic with 0 ≤ γ < 1 Usual notation of discrete-time, countable-spaces, stationary MDPs 0 a a We lighten P(s; a; s ) notation to Ps;s0 and R(s; a) notation to Rs Initial State Probability Distribution denoted as p0 : N! [0; 1] Policy Function Approximation π(s; a; θ) = P[At = ajSt = s; θ] PG coverage is quite similar for non-discounted non-episodic, by considering average-reward objective (we won't cover it) Ashwin Rao (Stanford) Policy Gradient Algorithms 7 / 33 \Expected Returns" Objective Now we formalize the \Expected Returns" Objective J(θ) 1 X t J(θ) = Eπ[ γ · Rt+1] t=0 Value Function V π(s) and Action Value function Qπ(s; a) defined as: 1 π X k−t V (s) = Eπ[ γ · Rk+1jSt = s] for all t = 0; 1; 2;::: k=t 1 π X k−t Q (s; a) = Eπ[ γ · Rk+1jSt = s; At = a] for all t = 0; 1; 2;::: k=t Advantage Function Aπ(s; a) = Qπ(s; a) − V π(s) Also, p(s ! s0; t; π) will be a key function for us - it denotes the probability of going from state s to s0 in t steps by following policy π Ashwin Rao (Stanford) Policy Gradient Algorithms 8 / 33 Discounted-Aggregate State-Visitation Measure 1 1 X t X t J(θ) = Eπ[ γ · Rt+1] = γ · Eπ[Rt+1] t=0 t=0 1 X t X X X a = γ · ( p0(S0) · p(S0 ! s; t; π)) · π(s; a; θ) ·Rs t=0 s2N S02N a2A 1 X X X t X a = ( γ · p0(S0) · p(S0 ! s; t; π)) · π(s; a; θ) ·Rs s2N S02N t=0 a2A Definition X π X a J(θ) = ρ (s) · π(s; a; θ) ·Rs s2N a2A where ρπ(s) = P P1 γt · p (S ) · p(S ! s; t; π) is the key function S02N t=0 0 0 0 (for PG) we'll refer to as Discounted-Aggregate State-Visitation Measure. Ashwin Rao (Stanford) Policy Gradient Algorithms 9 / 33 Policy Gradient Theorem (PGT) Theorem X π X π rθJ(θ) = ρ (s) · rθπ(s; a; θ) · Q (s; a) s2N a2A π π Note: ρ (s) depends on θ, but there's no rθρ (s) term in rθJ(θ) Note: rθπ(s; a; θ) = π(s; a; θ) · rθ log π(s; a; θ) So we can simply generate sampling traces, and at each time step, π calculate (rθ log π(s; a; θ)) · Q (s; a) (probabilities implicit in paths) Note: rθ log π(s; a; θ) is Score function (Gradient of log-likelihood) We will estimate Qπ(s; a) with a function approximation Q(s; a; w) We will later show how to avoid the estimate bias of Q(s; a; w) This numerical estimate of rθJ(θ) enables Policy Gradient Ascent Let us look at the score function of some canonical π(s; a; θ) Ashwin Rao (Stanford) Policy Gradient Algorithms 10 / 33 Canonical π(s; a; θ) for finite action spaces For finite action spaces, we often use Softmax Policy θ is an m-vector (θ1; : : : ; θm) Features vector φ(s; a) = (φ1(s; a); : : : ; φm(s; a)) for all s 2 N ; a 2 A Weight actions using linear combinations of features: φ(s; a)T · θ Action probabilities proportional to exponentiated weights: eφ(s;a)T ·θ π(s; a; θ) = for all s 2 N ; a 2 A P φ(s;b)T ·θ b2A e The score function is: X rθ log π(s; a; θ) = φ(s; a)− π(s; b; θ)·φ(s; b) = φ(s; a)−Eπ[φ(s; ·)] b2A Ashwin Rao (Stanford) Policy Gradient Algorithms 11 / 33 Canonical π(s; a; θ) for continuous action spaces For continuous action spaces, we often use Gaussian Policy θ is an m-vector (θ1; : : : ; θm) State features vector φ(s) = (φ1(s); : : : ; φm(s)) for all s 2 N Gaussian Mean is a linear combination of state features φ(s)T · θ Variance may be fixed σ2, or can also be parameterized Policy is Gaussian, a ∼ N (φ(s)T · θ; σ2) for all s 2 N The score function is: (a − φ(s)T · θ) · φ(s) r log π(s; a; θ) = θ σ2 Ashwin Rao (Stanford) Policy Gradient Algorithms 12 / 33 Proof of Policy Gradient Theorem We begin the proof by noting that: X π X X π J(θ) = p0(S0) · V (S0) = p0(S0) · π(S0; A0; θ) · Q (S0; A0) S02N S02N A02A π Calculate rθJ(θ) by parts π(S0; A0; θ) and Q (S0; A0) X X π rθJ(θ) = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0) S02N A02A X X π + p0(S0) · π(S0; A0; θ) · rθQ (S0; A0) S02N A02A Ashwin Rao (Stanford) Policy Gradient Algorithms 13 / 33 Proof of Policy Gradient Theorem π Now expand Q (S0; A0) as: X RA0 + γ ·PA0 · V π(S ) (Bellman Policy Equation) S0 S0;S1 1 S12N X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X X p (S ) · π(S ; A ; θ) · r (RA0 + γ ·PA0 · V π(S )) 0 0 0 0 θ S0 S0;S1 1 S02N A02A S12N Note: r RA0 = 0, so remove that term θ S0 X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X X p (S ) · π(S ; A ; θ) · r ( γ ·PA0 · V π(S )) 0 0 0 0 θ S0;S1 1 S02N A02A S12N Ashwin Rao (Stanford) Policy Gradient Algorithms 14 / 33 Proof of Policy Gradient Theorem Now bring the r inside the P to apply only on V π(S ) θ S12N 1 X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X X p (S ) · π(S ; A ; θ) · γ ·PA0 · r V π(S ) 0 0 0 0 S0;S1 θ 1 S02N A02A S12N Now bring P and P inside the P S02N A02A S12N X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X X γ · p (S ) · ( π(S ; A ; θ) ·PA0 ) · r V π(S ) 0 0 0 0 S0;S1 θ 1 S12N S02N A02A Ashwin Rao (Stanford) Policy Gradient Algorithms 15 / 33 Policy Gradient Theorem X Note that π(S ; A ; θ) ·PA0 = p(S ! S ; 1; π) 0 0 S0;S1 0 1 A02A X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X π γ · p0(S0) · p(S0 ! S1; 1; π) · rθV (S1) S12N S02N π X π Now expand V (S1) to π(S1; A1; θ) · Q (S1; A1) A12A X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X X π γ · p0(S0) · p(S0 ! S1; 1; π) · rθ( π(S1; A1; θ) · Q (S1; A1)) S12N S02N A12A Ashwin Rao (Stanford) Policy Gradient Algorithms 16 / 33 Proof of Policy Gradient Theorem P π We are now back to when we started calculating gradient of a π · Q .
Recommended publications
  • Lecture 9: Partial Derivatives
    Math S21a: Multivariable calculus Oliver Knill, Summer 2016 Lecture 9: Partial derivatives ∂ If f(x,y) is a function of two variables, then ∂x f(x,y) is defined as the derivative of the function g(x) = f(x,y) with respect to x, where y is considered a constant. It is called the partial derivative of f with respect to x. The partial derivative with respect to y is defined in the same way. ∂ We use the short hand notation fx(x,y) = ∂x f(x,y). For iterated derivatives, the notation is ∂ ∂ similar: for example fxy = ∂x ∂y f. The meaning of fx(x0,y0) is the slope of the graph sliced at (x0,y0) in the x direction. The second derivative fxx is a measure of concavity in that direction. The meaning of fxy is the rate of change of the slope if you change the slicing. The notation for partial derivatives ∂xf,∂yf was introduced by Carl Gustav Jacobi. Before, Josef Lagrange had used the term ”partial differences”. Partial derivatives fx and fy measure the rate of change of the function in the x or y directions. For functions of more variables, the partial derivatives are defined in a similar way. 4 2 2 4 3 2 2 2 1 For f(x,y)= x 6x y + y , we have fx(x,y)=4x 12xy ,fxx = 12x 12y ,fy(x,y)= − − − 12x2y +4y3,f = 12x2 +12y2 and see that f + f = 0. A function which satisfies this − yy − xx yy equation is also called harmonic. The equation fxx + fyy = 0 is an example of a partial differential equation: it is an equation for an unknown function f(x,y) which involves partial derivatives with respect to more than one variables.
    [Show full text]
  • Policy Gradient
    Lecture 7: Policy Gradient Lecture 7: Policy Gradient David Silver Lecture 7: Policy Gradient Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the last lecture we approximated the value or action-value function using parameters θ, V (s) V π(s) θ ≈ Q (s; a) Qπ(s; a) θ ≈ A policy was generated directly from the value function e.g. using -greedy In this lecture we will directly parametrise the policy πθ(s; a) = P [a s; θ] j We will focus again on model-free reinforcement learning Lecture 7: Policy Gradient Introduction Value-Based and Policy-Based RL Value Based Learnt Value Function Implicit policy Value Function Policy (e.g. -greedy) Policy Based Value-Based Actor Policy-Based No Value Function Critic Learnt Policy Actor-Critic Learnt Value Function Learnt Policy Lecture 7: Policy Gradient Introduction Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Lecture 7: Policy Gradient Introduction Rock-Paper-Scissors Example Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors Scissors beats paper Rock beats scissors Paper beats rock Consider policies for iterated rock-paper-scissors A deterministic policy is easily exploited A uniform random policy
    [Show full text]
  • Introduction to Shape Optimization
    Introduction to Shape optimization Noureddine Igbida1 1Institut de recherche XLIM, UMR-CNRS 6172, Facult´edes Sciences et Techniques, Universit´ede Limoges 123, Avenue Albert Thomas 87060 Limoges, France. Email : [email protected] Preliminaries on PDE 1 Contents 1. PDE ........................................ 2 2. Some differentiation and differential operators . 3 2.1 Gradient . 3 2.2 Divergence . 4 2.3 Curl . 5 2 2.4 Remarks . 6 2.5 Laplacian . 7 2.6 Coordinate expressions of the Laplacian . 10 3. Boundary value problem . 12 4. Notion of solution . 16 1. PDE A mathematical model is a description of a system using mathematical language. The process of developing a mathematical model is termed mathematical modelling (also written model- ing). Mathematical models are used in many area, as in the natural sciences (such as physics, biology, earth science, meteorology), engineering disciplines (e.g. computer science, artificial in- telligence), in the social sciences (such as economics, psychology, sociology and political science); Introduction to Shape optimization N. Igbida physicists, engineers, statisticians, operations research analysts and economists. Among this mathematical language we have PDE. These are a type of differential equation, i.e., a relation involving an unknown function (or functions) of several independent variables and their partial derivatives with respect to those variables. Partial differential equations are used to formulate, and thus aid the solution of, problems involving functions of several variables; such as the propagation of sound or heat, electrostatics, electrodynamics, fluid flow, and elasticity. Seemingly distinct physical phenomena may have identical mathematical formulations, and thus be governed by the same underlying dynamic. In this section, we give some basic example of elliptic partial differential equation (PDE) of second order : standard Laplacian and Laplacian with variable coefficients.
    [Show full text]
  • Vector Calculus and Differential Forms with Applications To
    Vector Calculus and Differential Forms with Applications to Electromagnetism Sean Roberson May 7, 2015 PREFACE This paper is written as a final project for a course in vector analysis, taught at Texas A&M University - San Antonio in the spring of 2015 as an independent study course. Students in mathematics, physics, engineering, and the sciences usually go through a sequence of three calculus courses before go- ing on to differential equations, real analysis, and linear algebra. In the third course, traditionally reserved for multivariable calculus, stu- dents usually learn how to differentiate functions of several variable and integrate over general domains in space. Very rarely, as was my case, will professors have time to cover the important integral theo- rems using vector functions: Green’s Theorem, Stokes’ Theorem, etc. In some universities, such as UCSD and Cornell, honors students are able to take an accelerated calculus sequence using the text Vector Cal- culus, Linear Algebra, and Differential Forms by John Hamal Hubbard and Barbara Burke Hubbard. Here, students learn multivariable cal- culus using linear algebra and real analysis, and then they generalize familiar integral theorems using the language of differential forms. This paper was written over the course of one semester, where the majority of the book was covered. Some details, such as orientation of manifolds, topology, and the foundation of the integral were skipped to save length. The paper should still be readable by a student with at least three semesters of calculus, one course in linear algebra, and one course in real analysis - all at the undergraduate level.
    [Show full text]
  • Using Surface Integrals for Checking the Archimedes' Law of Buoyancy
    Using surface integrals for checking the Archimedes’ law of buoyancy F M S Lima Institute of Physics, University of Brasilia, P.O. Box 04455, 70919-970, Brasilia-DF, Brazil E-mail: [email protected] Abstract. A mathematical derivation of the force exerted by an inhomogeneous (i.e., compressible) fluid on the surface of an arbitrarily-shaped body immersed in it is not found in literature, which may be attributed to our trust on Archimedes’ law of buoyancy. However, this law, also known as Archimedes’ principle (AP), does not yield the force observed when the body is in contact to the container walls, as is more evident in the case of a block immersed in a liquid and in contact to the bottom, in which a downward force that increases with depth is observed. In this work, by taking into account the surface integral of the pressure force exerted by a fluid over the surface of a body, the general validity of AP is checked. For a body fully surrounded by a fluid, homogeneous or not, a gradient version of the divergence theorem applies, yielding a volume integral that simplifies to an upward force which agrees to the force predicted by AP, as long as the fluid density is a continuous function of depth. For the bottom case, this approach yields a downward force that increases with depth, which contrasts to AP but is in agreement to experiments. It also yields a formula for this force which shows that it increases with the area of contact. PACS numbers: 01.30.mp, 01.55.+b, 47.85.Dh Submitted to: Eur.
    [Show full text]
  • Computer Problems for Vector Calculus
    Chapter: Vector Calculus Computer Problems for Vector Calculus 2 2 1. The average rainfall in Flobbertown follows the strange pattern R = (1 + sin x) e−x y , where x and y are distances north and east from one corner of the town. (a) Pick at least 2 values of x and sketch the rainfall function R(y) at that value. Label your plots with their x values. (b) Pick at least 2 values of y and sketch the rainfall function R(x) at that value. Label your plots with their x values. (c) Have a computer generate a 3D plot of R(x; y). Make sure you plot over a region that clearly shows the general behavior of the function, and includes all the x and y values you used for your sketches. Check if the computer plot matches your constant-latitude and constant-longitude sketches. (If it doesn't, figure out what you did wrong.) (d) Based on your computer plot, if you were to start at the position (π=4; 1) roughly what direction could you move in to keep R constant? Hint: You may find it easier to answer this if you make a second plot that zooms in on a small region around this point. 2. The town of Chewandswallow has been buried in piles of bread. The depth of bread is given by B = cos(x + y) + sin x2 + y2, where the town covers the region 0 ≤ x ≤ π=2, 0 ≤ y ≤ π. (a) Have a computer make a contour plot of B(x; y).
    [Show full text]
  • Chapter 3. Linearization and Gradient Equation Fx(X, Y) = Fxx(X, Y)
    Oliver Knill, Harvard Summer School, 2010 An equation for an unknown function f(x, y) which involves partial derivatives with respect to at least two variables is called a partial differential equation. If only the derivative with respect to one variable appears, it is called an ordinary differential equation. Examples of partial differential equations are the wave equation fxx(x, y) = fyy(x, y) and the heat Chapter 3. Linearization and Gradient equation fx(x, y) = fxx(x, y). An other example is the Laplace equation fxx + fyy = 0 or the advection equation ft = fx. Paul Dirac once said: ”A great deal of my work is just playing with equations and see- Section 3.1: Partial derivatives and partial differential equations ing what they give. I don’t suppose that applies so much to other physicists; I think it’s a peculiarity of myself that I like to play about with equations, just looking for beautiful ∂ mathematical relations If f(x, y) is a function of two variables, then ∂x f(x, y) is defined as the derivative of the function which maybe don’t have any physical meaning at all. Sometimes g(x) = f(x, y), where y is considered a constant. It is called partial derivative of f with they do.” Dirac discovered a PDE describing the electron which is consistent both with quan- respect to x. The partial derivative with respect to y is defined similarly. tum theory and special relativity. This won him the Nobel Prize in 1933. Dirac’s equation could have two solutions, one for an electron with positive energy, and one for an electron with ∂ antiparticle One also uses the short hand notation fx(x, y)= ∂x f(x, y).
    [Show full text]
  • Class: B. Tech (Unit I)
    Class: B. Tech (Unit I) I have taken all course materials for Unit I from Book Introduction to Electrodynamics by David J. Griffith (Prentice- Hall of India Private limited). Students can download this book form given web address; Web Address : https://b-ok.cc/book/5103011/55c730 All topics of unit I (vector calculus & Electrodynamics) have been taken from Chapter 1, Chapter 7 & Chapter 8 from said book ( https://b-ok.cc/book/5103011/55c730 ). I am sending pdf file of Chapter 1 Chapter 7 & chapter 8. Unit-I: Vector Calculus & Electrodynamics: (8 Hours) Gradient, Divergence, curl and their physical significance. Laplacian in rectangular, cylindrical and spherical coordinates, vector integration, line, surface and volume integrals of vector fields, Gauss-divergence theorem, Stoke's theorem and Green Theorem of vectors. Maxwell equations, electromagnetic wave in free space and its solution in one dimension, energy and momentum of electromagnetic wave, Poynting vector, Problems. CHAPTER 1 Vector Analysis 1.1 VECTOR ALGEBRA 1.1.1 Vector Operations If you walk 4 miles due north and then 3 miles due east (Fig. 1.1), you will have gone a total of 7 miles, but you’re not 7 miles from where you set out—you’re only 5. We need an arithmetic to describe quantities like this, which evidently do not add in the ordinary way. The reason they don’t, of course, is that displace- ments (straight line segments going from one point to another) have direction as well as magnitude (length), and it is essential to take both into account when you combine them.
    [Show full text]
  • Calculus: Intuitive Explanations
    Calculus: Intuitive Explanations Radon Rosborough https://intuitiveexplanations.com/math/calculus-intuitive-explanations/ Intuitive Explanations is a collection of clear, easy-to-understand, intuitive explanations of many of the more complex concepts in calculus. In this document, an “intuitive” explanation is not one without any mathematical symbols. Rather, it is one in which each definition, concept, and part of an equation can be understood in a way that is more than “a piece of mathematics”. For instance, some textbooks simply define mathematical terms, whereas Intuitive Explanations explains what idea each term is meant to represent and why the definition has to be the way it is in order for that idea to be captured. As another example, the proofs in Intuitive Explanations avoid steps with no clear motivation (like adding and subtracting the same quantity), so that they seem natural and not magical. See the table of contents for information on what topics are covered. Before each section, there is a note clarifying what prerequisite knowledge about calculus is required to understand the material presented. Section 1 contains a number of links to other resources useful for learning calculus around the Web. I originally wrote Intuitive Explanations in high school, during the summer between AP Calculus BC and Calculus 3, and the topics covered here are, generally speaking, the topics that are covered poorly in the textbook we used (Calculus by Ron Larson and Bruce Edwards, tenth edition). Contents 1 About Intuitive Explanations ................................. 4 2 The Formal Definition of Limit ................................ 5 2.1 First Attempts ...................................... 6 2.2 Arbitrary Closeness .................................... 7 2.3 Using Symbols .....................................
    [Show full text]
  • 24 Aug 2017 the Shape Derivative of the Gauss Curvature
    The shape derivative of the Gauss curvature∗ An´ıbal Chicco-Ruiz, Pedro Morin, and M. Sebastian Pauletti UNL, Consejo Nacional de Investigaciones Cient´ıficas y T´ecnicas, FIQ, Santiago del Estero 2829, S3000AOM, Santa Fe, Argentina. achicco,pmorin,[email protected] August 25, 2017 Abstract We introduce new results about the shape derivatives of scalar- and vector-valued functions. They extend the results from [8] to more general surface energies. In [8] Do˘gan and Nochetto consider surface energies defined as integrals over surfaces of functions that can depend on the position, the unit normal and the mean curvature of the surface. In this work we present a systematic way to derive formulas for the shape derivative of more general geometric quantities, including the Gauss curvature (a new result not available in the literature) and other geometric invariants (eigenvalues of the second fundamental form). This is done for hyper-surfaces in the Euclidean space of any finite dimension. As an application of the results, with relevance for numerical methods in applied problems, we introduce a new scheme of Newton-type to approximate a minimizer of a shape functional. It is a mathematically sound generalization of the method presented in [5]. We finally find the particular formulas for the first and second order shape derivative of the area and the Willmore functional, which are necessary for the Newton-type method mentioned above. 2010 Mathematics Subject Classification. 65K10, 49M15, 53A10, 53A55. arXiv:1708.07440v1 [math.OC] 24 Aug 2017 Keywords. Shape derivative, Gauss curvature, shape optimization, differentiation formulas. 1 Introduction Energies that depend on the domain appear in applications in many areas, from materials science, to biology, to image processing.
    [Show full text]
  • All-Action Policy Gradient Methods: a Numerical Integration Approach
    All-Action Policy Gradient Methods: A Numerical Integration Approach Benjamin Petit Loren Amdahl-Culleton Yao Liu Stanford University Stanford University Stanford University [email protected] [email protected] [email protected] Jimmy Smith Pierre-Luc Bacon Stanford University Stanford University [email protected] [email protected] Abstract While often stated as an instance of the likelihood ratio trick [Rubinstein, 1989], the original policy gradient theorem [Sutton et al., 1999] involves an integral over the action space. When this integral can be computed, the resulting “all- action" estimator [Sutton et al., 2001] provides a conditioning effect [Bratley et al., 1987] reducing the variance significantly compared to the REINFORCE estimator [Williams, 1992]. In this paper, we adopt a numerical integration perspective to broaden the applicability of the all-action estimator to general spaces and to any function class for the policy or critic components, beyond the Gaussian case con- sidered by Ciosek and Whiteson [2018]. In addition, we provide a new theoretical result on the effect of using a biased critic which offers more guidance than the previous “compatible features" condition of Sutton et al. [1999]. We demonstrate the benefit of our approach in continuous control tasks with nonlinear function approximation. Our results show improved performance and sample efficiency. 1 Introduction Likelihood ratio (LR) gradient estimators [Aleksandrov et al., 1968, Glynn, 1987, Reiman and Weiss, 1989, Rubinstein, 1989] have been widely used in reinforcement learning [Sutton and Barto, 2018] since the seminal work of Williams [1992] in the class of policy gradient methods [Kimura et al., 1995, 1997, Kimura and Kobayashi, 1998, Marbach and Tsitsiklis, 1998, Sutton et al., 1999, Konda and Tsitsiklis, 2000, Baxter and Bartlett, 2001].
    [Show full text]
  • 8/6/2020 PRACTICE EXAM II Maths 21A, O. Knill, Summer 2020
    8/6/2020 PRACTICE EXAM II Maths 21a, O. Knill, Summer 2020 Name: • Start by printing your name in the above box. • Try to answer each question on the same page as the question is asked. If needed, use the back or the next empty page for work. • Do not detach pages from this exam packet or unstaple the packet. • Please try to write neatly. Answers which are illegible for the grader can not be given credit. • No notes, books, calculators, computers, or other electronic aids are allowed. • Problems 1-3 do not require any justifications. For the rest of the problems you have to show your work. Even correct answers without derivation can not be given credit. • You have 180 minutes time to complete your work. 1 20 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 10 12 10 13 10 Total: 140 1 Problem 1) (20 points) No justifications are necessary The distance d between a point P and a line through two distinct points 1) T F A; B is given by d = jAP~ × AB~ j=jAB~ j. Solution: You know the distance formula! The set of all points in three dimensional space satisfying x2 + y2 = 1 define 2) T F a circle. Solution: it is a cylinder. The surface area of a surface parametrized by ~r(u; v) on a parameter region 3) T F RR R is given by R jru × rvj dudv. Solution: Yes, now it is the right formula! In the midterm there had been a square.
    [Show full text]