Policy Gradient

Total Page:16

File Type:pdf, Size:1020Kb

Policy Gradient Lecture 7: Policy Gradient Lecture 7: Policy Gradient David Silver Lecture 7: Policy Gradient Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the last lecture we approximated the value or action-value function using parameters θ, V (s) V π(s) θ ≈ Q (s; a) Qπ(s; a) θ ≈ A policy was generated directly from the value function e.g. using -greedy In this lecture we will directly parametrise the policy πθ(s; a) = P [a s; θ] j We will focus again on model-free reinforcement learning Lecture 7: Policy Gradient Introduction Value-Based and Policy-Based RL Value Based Learnt Value Function Implicit policy Value Function Policy (e.g. -greedy) Policy Based Value-Based Actor Policy-Based No Value Function Critic Learnt Policy Actor-Critic Learnt Value Function Learnt Policy Lecture 7: Policy Gradient Introduction Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Lecture 7: Policy Gradient Introduction Rock-Paper-Scissors Example Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors Scissors beats paper Rock beats scissors Paper beats rock Consider policies for iterated rock-paper-scissors A deterministic policy is easily exploited A uniform random policy is optimal (i.e. Nash equilibrium) Lecture 7: Policy Gradient Introduction Aliased Gridworld Example Example: Aliased Gridworld (1) The agent cannot differentiate the grey states Consider features of the following form (for all N, E, S, W) φ(s; a) = 1(wall to N; a = move E) Compare value-based RL, using an approximate value function Qθ(s; a) = f (φ(s; a); θ) To policy-based RL, using a parametrised policy πθ(s; a) = g(φ(s; a); θ) Lecture 7: Policy Gradient Introduction Aliased Gridworld Example Example: Aliased Gridworld (2) Under aliasing, an optimal deterministic policy will either move W in both grey states (shown by red arrows) move E in both grey states Either way, it can get stuck and never reach the money Value-based RL learns a near-deterministic policy e.g. greedy or -greedy So it will traverse the corridor for a long time Lecture 7: Policy Gradient Introduction Aliased Gridworld Example Example: Aliased Gridworld (3) An optimal stochastic policy will randomly move E or W in grey states πθ(wall to N and S, move E) = 0:5 πθ(wall to N and S, move W) = 0:5 It will reach the goal state in a few steps with high probability Policy-based RL can learn the optimal stochastic policy Lecture 7: Policy Gradient Introduction Policy Search Policy Objective Functions Goal: given policy πθ(s; a) with parameters θ, find best θ But how do we measure the quality of a policy πθ? In episodic environments we can use the start value πθ J1(θ) = V (s1) = Eπθ [v1] In continuing environments we can use the average value X πθ πθ JavV (θ) = d (s)V (s) s Or the average reward per time-step X πθ X a JavR (θ) = d (s) πθ(s; a) s s a R πθ where d (s) is stationary distribution of Markov chain for πθ Lecture 7: Policy Gradient Introduction Policy Search Policy Optimisation Policy based reinforcement learning is an optimisation problem Find θ that maximises J(θ) Some approaches do not use gradient Hill climbing Simplex / amoeba / Nelder Mead Genetic algorithms Greater efficiency often possible using gradient Gradient descent Conjugate gradient Quasi-newton We focus on gradient descent, many extensions possible And on methods that exploit sequential structure Lecture 7: Policy Gradient Finite Difference Policy Gradient Policy Gradient !"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,# Let J(θ) be any policy objective function Policy gradient algorithms search for a y !"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$ local maximum in J(θ) by ascending the gradient'*$%-/0'*,('-*$.'("$("#$1)*%('-*$ of the policy, w.r.t. parameters θ ,22&-3'/,(-& %,*$0#$)+#4$(-$ %&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$∆θ = α θJ(θ) r Where θJ(θ) is the policy gradient y !"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$r 0 @J(θ) 1 1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($@θ1 B . C %,*$*-.$0#$)+#4$(-$)24,(#$("#$θJ(θ) = B . C r @ A '*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$@J(θ) @θ ,22&-3'/,(-& 9,*4$%&'('%:;$$$$$$n and α is a step-size parameter <&,4'#*($4#+%#*($=>? Lecture 7: Policy Gradient Finite Difference Policy Gradient Computing Gradients By Finite Differences To evaluate policy gradient of πθ(s; a) For each dimension k [1; n] 2 Estimate kth partial derivative of objective function w.r.t. θ By perturbing θ by small amount in kth dimension @J(θ) J(θ + uk ) J(θ) − @θk ≈ where uk is unit vector with 1 in kth component, 0 elsewhere Uses n evaluations to compute policy gradient in n dimensions Simple, noisy, inefficient - but sometimes effective Works for arbitrary policies, even if policy is not differentiable those parameters to an Aibo and instructing it to time itself Velocity of Learned Gait during Training 300 Learned Gait as it walked between two fixed landmarks (Figure 5). More (UT Austin Villa) efficient parameters resulted in a faster gait, which translated into a lower time and a better score. After completing an 280 evaluation, the Aibo sent the resulting score back to the host Learned Gait (UNSW) 260 computer and prepared itself for a new set of parameters to Hand!tuned Gait evaluate.6 (UNSW) When implementing the algorithm described above, we 240 Hand!tuned Gait Lecture 7: Policy Gradient (UT Austin Villa) chose to evaluate t =15policies per iteration. Since there Velocity (mm/s) Hand!tuned Gait (German Team) Finitewas Difference significant noisePolicy in Gradienteach evaluation, each set of parameters 220 was evaluated three times. The resulting score for that set of AIBO example parameters was computed by taking the average of the three 200 evaluations. Each iteration therefore consisted of 45 traversals 1 between pairs of beacons and lasted roughly 7 2 minutes. Since 180 even small adjustments to a set of parameters could lead to 0 5 10 15 20 25 Training AIBO to Walk by Finite DifferenceNumber of Iterations Policy Gradient major differences in gait speed, we chose relatively small values for each !j.Tooffsetthesmallvaluesofeach!j,we Fig. 6. The velocity of the best gait from each iteration during training, accelerated the learning process by using a larger value of compared to previous results. We were able to learn a gait significantly faster than both hand-coded gaits and previously learned gaits. η =2for the step size. z Our team (UT Austin Villa [10]) first approached the gait Parameter Initial ! Best optimization problem by hand-tuning a gait described by half- Value Value Front locus: elliptical loci. ThisLandmarks gait performedC comparably to those of (height) 4.2 0.35 4.081 other teams participating inB RoboCup 2003. The work reported (x offset) 2.8 0.35 0.574 in this paper uses the hand-tuned UT Austin Villa walk asC’ a A (y offset) 4.9 0.35 5.152 starting point for learning. Figure 1 compares the reported Rear locus: speeds of the gaits mentioned above, both hand-tunedB’ and (height) 5.6 0.35 6.02 learned, including that of our starting point, the UT Austin (x offset) 0.0 0.35 0.217 Villa walk. The latter walk is described fully in aA’ team (y offset) -2.8 0.35 -2.982 technical report [10]. The remainder of this section describes Locus length 4.893 0.35 5.285 Locus skew multiplier 0.035 0.175 0.049 x the details of the UT Austin Villa walk that are important to Front height 7.7 0.35 7.483 y understand for the purposes of this paper. Rear height 11.2 0.35 10.843 Time to move Hand-tuned gaits Learned gaits Fig. 2. Thethrough elliptical locus locus of0.704 the Aibo’s0.016 foot. The0.679 half-ellipse is defined by Fig. 5. The training environment for our experiments. Each Aibo times itself length,Time height, on andground position in the0.5x-y plane.0.05 0.430 CMUas it German moves back and UT forth Austin between a pair of landmarksHornby (A and A’, B and B’, or C and C’).Goal: learn a fast AIBO walkFig. 7. (useful The initial policy, for the amount Robocup) of change (!) for each parameter, and (2002) Team Villa UNSW 1999 UNSW best policy learned after 23 iterations. All values are given in centimeters, 200 230 245 254 170 270 except time to move through locus, which is measured in seconds, and time 4 AIBO walkIV. RESULTS policy is controlledon ground,During by which the 12is a fraction. American numbers Open tournament (elliptical in May loci) of 2003, Fig. 1. Maximum forward velocities of the best current gaits (in mm/s) for UT Austin Villa used a simplified version of the parameter- different teams,Our both main learned result and is hand-tuned. that using the algorithm described in Section III, we were able to find one of the fastest known Aibo ization described above that did not allow the front and rear Adapt these parameters by finiteThere are adifference couple of possible explanationspolicy for gradient the amount gaits. Figure 6 shows the performance of the best policy of ofheights variation of in the the learningrobot to process, differ. visible Hand-tuning as the spikes these in parameters The half-ellipticaleach iteration during locus the used learning by process. our team After is 23 shown iterations in thegenerated learning curve a gait shown that in allowed Figure 6.
Recommended publications
  • The Matrix Calculus You Need for Deep Learning
    The Matrix Calculus You Need For Deep Learning Terence Parr and Jeremy Howard July 3, 2018 (We teach in University of San Francisco's MS in Data Science program and have other nefarious projects underway. You might know Terence as the creator of the ANTLR parser generator. For more material, see Jeremy's fast.ai courses and University of San Francisco's Data Institute in- person version of the deep learning course.) HTML version (The PDF and HTML were generated from markup using bookish) Abstract This paper is an attempt to explain all the matrix calculus you need in order to understand the training of deep neural networks. We assume no math knowledge beyond what you learned in calculus 1, and provide links to help you refresh the necessary math where needed. Note that you do not need to understand this material before you start learning to train and use deep learning in practice; rather, this material is for those who are already familiar with the basics of neural networks, and wish to deepen their understanding of the underlying math. Don't worry if you get stuck at some point along the way|just go back and reread the previous section, and try writing down and working through some examples. And if you're still stuck, we're happy to answer your questions in the Theory category at forums.fast.ai. Note: There is a reference section at the end of the paper summarizing all the key matrix calculus rules and terminology discussed here. arXiv:1802.01528v3 [cs.LG] 2 Jul 2018 1 Contents 1 Introduction 3 2 Review: Scalar derivative rules4 3 Introduction to vector calculus and partial derivatives5 4 Matrix calculus 6 4.1 Generalization of the Jacobian .
    [Show full text]
  • Lecture 9: Partial Derivatives
    Math S21a: Multivariable calculus Oliver Knill, Summer 2016 Lecture 9: Partial derivatives ∂ If f(x,y) is a function of two variables, then ∂x f(x,y) is defined as the derivative of the function g(x) = f(x,y) with respect to x, where y is considered a constant. It is called the partial derivative of f with respect to x. The partial derivative with respect to y is defined in the same way. ∂ We use the short hand notation fx(x,y) = ∂x f(x,y). For iterated derivatives, the notation is ∂ ∂ similar: for example fxy = ∂x ∂y f. The meaning of fx(x0,y0) is the slope of the graph sliced at (x0,y0) in the x direction. The second derivative fxx is a measure of concavity in that direction. The meaning of fxy is the rate of change of the slope if you change the slicing. The notation for partial derivatives ∂xf,∂yf was introduced by Carl Gustav Jacobi. Before, Josef Lagrange had used the term ”partial differences”. Partial derivatives fx and fy measure the rate of change of the function in the x or y directions. For functions of more variables, the partial derivatives are defined in a similar way. 4 2 2 4 3 2 2 2 1 For f(x,y)= x 6x y + y , we have fx(x,y)=4x 12xy ,fxx = 12x 12y ,fy(x,y)= − − − 12x2y +4y3,f = 12x2 +12y2 and see that f + f = 0. A function which satisfies this − yy − xx yy equation is also called harmonic. The equation fxx + fyy = 0 is an example of a partial differential equation: it is an equation for an unknown function f(x,y) which involves partial derivatives with respect to more than one variables.
    [Show full text]
  • A Brief Tour of Vector Calculus
    A BRIEF TOUR OF VECTOR CALCULUS A. HAVENS Contents 0 Prelude ii 1 Directional Derivatives, the Gradient and the Del Operator 1 1.1 Conceptual Review: Directional Derivatives and the Gradient........... 1 1.2 The Gradient as a Vector Field............................ 5 1.3 The Gradient Flow and Critical Points ....................... 10 1.4 The Del Operator and the Gradient in Other Coordinates*............ 17 1.5 Problems........................................ 21 2 Vector Fields in Low Dimensions 26 2 3 2.1 General Vector Fields in Domains of R and R . 26 2.2 Flows and Integral Curves .............................. 31 2.3 Conservative Vector Fields and Potentials...................... 32 2.4 Vector Fields from Frames*.............................. 37 2.5 Divergence, Curl, Jacobians, and the Laplacian................... 41 2.6 Parametrized Surfaces and Coordinate Vector Fields*............... 48 2.7 Tangent Vectors, Normal Vectors, and Orientations*................ 52 2.8 Problems........................................ 58 3 Line Integrals 66 3.1 Defining Scalar Line Integrals............................. 66 3.2 Line Integrals in Vector Fields ............................ 75 3.3 Work in a Force Field................................. 78 3.4 The Fundamental Theorem of Line Integrals .................... 79 3.5 Motion in Conservative Force Fields Conserves Energy .............. 81 3.6 Path Independence and Corollaries of the Fundamental Theorem......... 82 3.7 Green's Theorem.................................... 84 3.8 Problems........................................ 89 4 Surface Integrals, Flux, and Fundamental Theorems 93 4.1 Surface Integrals of Scalar Fields........................... 93 4.2 Flux........................................... 96 4.3 The Gradient, Divergence, and Curl Operators Via Limits* . 103 4.4 The Stokes-Kelvin Theorem..............................108 4.5 The Divergence Theorem ...............................112 4.6 Problems........................................114 List of Figures 117 i 11/14/19 Multivariate Calculus: Vector Calculus Havens 0.
    [Show full text]
  • Introduction to Shape Optimization
    Introduction to Shape optimization Noureddine Igbida1 1Institut de recherche XLIM, UMR-CNRS 6172, Facult´edes Sciences et Techniques, Universit´ede Limoges 123, Avenue Albert Thomas 87060 Limoges, France. Email : [email protected] Preliminaries on PDE 1 Contents 1. PDE ........................................ 2 2. Some differentiation and differential operators . 3 2.1 Gradient . 3 2.2 Divergence . 4 2.3 Curl . 5 2 2.4 Remarks . 6 2.5 Laplacian . 7 2.6 Coordinate expressions of the Laplacian . 10 3. Boundary value problem . 12 4. Notion of solution . 16 1. PDE A mathematical model is a description of a system using mathematical language. The process of developing a mathematical model is termed mathematical modelling (also written model- ing). Mathematical models are used in many area, as in the natural sciences (such as physics, biology, earth science, meteorology), engineering disciplines (e.g. computer science, artificial in- telligence), in the social sciences (such as economics, psychology, sociology and political science); Introduction to Shape optimization N. Igbida physicists, engineers, statisticians, operations research analysts and economists. Among this mathematical language we have PDE. These are a type of differential equation, i.e., a relation involving an unknown function (or functions) of several independent variables and their partial derivatives with respect to those variables. Partial differential equations are used to formulate, and thus aid the solution of, problems involving functions of several variables; such as the propagation of sound or heat, electrostatics, electrodynamics, fluid flow, and elasticity. Seemingly distinct physical phenomena may have identical mathematical formulations, and thus be governed by the same underlying dynamic. In this section, we give some basic example of elliptic partial differential equation (PDE) of second order : standard Laplacian and Laplacian with variable coefficients.
    [Show full text]
  • High Order Gradient, Curl and Divergence Conforming Spaces, with an Application to NURBS-Based Isogeometric Analysis
    High order gradient, curl and divergence conforming spaces, with an application to compatible NURBS-based IsoGeometric Analysis R.R. Hiemstraa, R.H.M. Huijsmansa, M.I.Gerritsmab aDepartment of Marine Technology, Mekelweg 2, 2628CD Delft bDepartment of Aerospace Technology, Kluyverweg 2, 2629HT Delft Abstract Conservation laws, in for example, electromagnetism, solid and fluid mechanics, allow an exact discrete representation in terms of line, surface and volume integrals. We develop high order interpolants, from any basis that is a partition of unity, that satisfy these integral relations exactly, at cell level. The resulting gradient, curl and divergence conforming spaces have the propertythat the conservationlaws become completely independent of the basis functions. This means that the conservation laws are exactly satisfied even on curved meshes. As an example, we develop high ordergradient, curl and divergence conforming spaces from NURBS - non uniform rational B-splines - and thereby generalize the compatible spaces of B-splines developed in [1]. We give several examples of 2D Stokes flow calculations which result, amongst others, in a point wise divergence free velocity field. Keywords: Compatible numerical methods, Mixed methods, NURBS, IsoGeometric Analyis Be careful of the naive view that a physical law is a mathematical relation between previously defined quantities. The situation is, rather, that a certain mathematical structure represents a given physical structure. Burke [2] 1. Introduction In deriving mathematical models for physical theories, we frequently start with analysis on finite dimensional geometric objects, like a control volume and its bounding surfaces. We assign global, ’measurable’, quantities to these different geometric objects and set up balance statements.
    [Show full text]
  • Vector Calculus and Differential Forms with Applications To
    Vector Calculus and Differential Forms with Applications to Electromagnetism Sean Roberson May 7, 2015 PREFACE This paper is written as a final project for a course in vector analysis, taught at Texas A&M University - San Antonio in the spring of 2015 as an independent study course. Students in mathematics, physics, engineering, and the sciences usually go through a sequence of three calculus courses before go- ing on to differential equations, real analysis, and linear algebra. In the third course, traditionally reserved for multivariable calculus, stu- dents usually learn how to differentiate functions of several variable and integrate over general domains in space. Very rarely, as was my case, will professors have time to cover the important integral theo- rems using vector functions: Green’s Theorem, Stokes’ Theorem, etc. In some universities, such as UCSD and Cornell, honors students are able to take an accelerated calculus sequence using the text Vector Cal- culus, Linear Algebra, and Differential Forms by John Hamal Hubbard and Barbara Burke Hubbard. Here, students learn multivariable cal- culus using linear algebra and real analysis, and then they generalize familiar integral theorems using the language of differential forms. This paper was written over the course of one semester, where the majority of the book was covered. Some details, such as orientation of manifolds, topology, and the foundation of the integral were skipped to save length. The paper should still be readable by a student with at least three semesters of calculus, one course in linear algebra, and one course in real analysis - all at the undergraduate level.
    [Show full text]
  • 1 Space Curves and Tangent Lines 2 Gradients and Tangent Planes
    CLASS NOTES for CHAPTER 4, Nonlinear Programming 1 Space Curves and Tangent Lines Recall that space curves are de¯ned by a set of parametric equations, x1(t) 2 x2(t) 3 r(t) = . 6 . 7 6 7 6 xn(t) 7 4 5 In Calc III, we might have written this a little di®erently, ~r(t) =< x(t); y(t); z(t) > but here we want to use n dimensions rather than two or three dimensions. The derivative and antiderivative of r with respect to t is done component- wise, x10 (t) x1(t) dt 2 x20 (t) 3 2 R x2(t) dt 3 r(t) = . ; R(t) = . 6 . 7 6 R . 7 6 7 6 7 6 xn0 (t) 7 6 xn(t) dt 7 4 5 4 5 And the local linear approximation to r(t) is alsoRdone componentwise. The tangent line (in n dimensions) can be written easily- the derivative at t = a is the direction of the curve, so the tangent line is given by: x1(a) x10 (a) 2 x2(a) 3 2 x20 (a) 3 L(t) = . + t . 6 . 7 6 . 7 6 7 6 7 6 xn(a) 7 6 xn0 (a) 7 4 5 4 5 In Class Exercise: Use Maple to plot the curve r(t) = [cos(t); sin(t); t]T ¼ and its tangent line at t = 2 . 2 Gradients and Tangent Planes Let f : Rn R. In this case, we can write: ! y = f(x1; x2; x3; : : : ; xn) 1 Note that any function that we wish to optimize must be of this form- It would not make sense to ¯nd the maximum of a function like a space curve; n dimensional coordinates are not well-ordered like the real line- so the fol- lowing statement would be meaningless: (3; 5) > (1; 2).
    [Show full text]
  • Integrating Gradients
    Integrating gradients 1 dimension The \gradient" of a function f(x) in one dimension (i.e., depending on only one variable) is just the derivative, f 0(x). We want to solve f 0(x) = k(x); (1) where k(x) is a known function. When we find a primitive function K(x) to k(x), the general form of f is K plus an arbitrary constant, f(x) = K(x) + C: (2) Example: With f 0(x) = 2=x we find f(x) = 2 ln x + C, where C is an undetermined constant. 2 dimensions We let the function depend on two variables, f(x; y). When the gradient rf is known, we have known functions k1 and k2 for the partial derivatives: @ f(x; y) = k (x; y); @x 1 @ f(x; y) = k (x; y): (3) @y 2 @f To solve this, we integrate one of them. To be specific, we here integrate @x over x. We find a primitive function with respect to x (thus keeping y constant) to k1 and call it k3. The general form of f will be k3 plus an arbitrary term whose x-derivative is zero. In other words, f(x; y) = k3(x; y) + B(y); (4) where B(y) is an unknown function of y. We have made some progress, because we have replaced an unknown function of two variables with another unknown function depending only on one variable. @f The best way to come further is not to integrate @y over y. That would give us a second unknown function, D(x).
    [Show full text]
  • Mean Value Theorem on Manifolds
    MEAN VALUE THEOREMS ON MANIFOLDS Lei Ni Abstract We derive several mean value formulae on manifolds, generalizing the clas- sical one for harmonic functions on Euclidean spaces as well as the results of Schoen-Yau, Michael-Simon, etc, on curved Riemannian manifolds. For the heat equation a mean value theorem with respect to `heat spheres' is proved for heat equation with respect to evolving Riemannian metrics via a space-time consideration. Some new monotonicity formulae are derived. As applications of the new local monotonicity formulae, some local regularity theorems concerning Ricci flow are proved. 1. Introduction The mean value theorem for harmonic functions plays an central role in the theory of harmonic functions. In this article we discuss its generalization on manifolds and show how such generalizations lead to various monotonicity formulae. The main focuses of this article are the corresponding results for the parabolic equations, on which there have been many works, including [Fu, Wa, FG, GL1, E1], and the application of the new monotonicity formula to the study of Ricci flow. Let us start with the Watson's mean value formula [Wa] for the heat equation. Let U be a open subset of Rn (or a Riemannian manifold). Assume that u(x; t) is 2 a C solution to the heat equation in a parabolic region UT = U £ (0;T ). For any (x; t) de¯ne the `heat ball' by 8 9 jx¡yj2 < ¡ 4(t¡s) = e ¡n E(x; t; r) := (y; s) js · t; n ¸ r : : (4¼(t ¡ s)) 2 ; Then Z 1 jx ¡ yj2 u(x; t) = n u(y; s) 2 dy ds r E(x;t;r) 4(t ¡ s) for each E(x; t; r) ½ UT .
    [Show full text]
  • Using Surface Integrals for Checking the Archimedes' Law of Buoyancy
    Using surface integrals for checking the Archimedes’ law of buoyancy F M S Lima Institute of Physics, University of Brasilia, P.O. Box 04455, 70919-970, Brasilia-DF, Brazil E-mail: [email protected] Abstract. A mathematical derivation of the force exerted by an inhomogeneous (i.e., compressible) fluid on the surface of an arbitrarily-shaped body immersed in it is not found in literature, which may be attributed to our trust on Archimedes’ law of buoyancy. However, this law, also known as Archimedes’ principle (AP), does not yield the force observed when the body is in contact to the container walls, as is more evident in the case of a block immersed in a liquid and in contact to the bottom, in which a downward force that increases with depth is observed. In this work, by taking into account the surface integral of the pressure force exerted by a fluid over the surface of a body, the general validity of AP is checked. For a body fully surrounded by a fluid, homogeneous or not, a gradient version of the divergence theorem applies, yielding a volume integral that simplifies to an upward force which agrees to the force predicted by AP, as long as the fluid density is a continuous function of depth. For the bottom case, this approach yields a downward force that increases with depth, which contrasts to AP but is in agreement to experiments. It also yields a formula for this force which shows that it increases with the area of contact. PACS numbers: 01.30.mp, 01.55.+b, 47.85.Dh Submitted to: Eur.
    [Show full text]
  • Gradient Sparsification for Communication-Efficient Distributed
    Gradient Sparsification for Communication-Efficient Distributed Optimization Jianqiao Wangni Jialei Wang University of Pennsylvania Two Sigma Investments Tencent AI Lab [email protected] [email protected] Ji Liu Tong Zhang University of Rochester Tencent AI Lab Tencent AI Lab [email protected] [email protected] Abstract Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computational architectures. A key bottleneck is the communication overhead for exchanging information such as stochastic gradients among different workers. In this paper, to reduce the communi- cation cost, we propose a convex optimization formulation to minimize the coding length of stochastic gradients. The key idea is to randomly drop out coordinates of the stochastic gradient vectors and amplify the remaining coordinates appropriately to ensure the sparsified gradient to be unbiased. To solve the optimal sparsification efficiently, a simple and fast algorithm is proposed for an approximate solution, with a theoretical guarantee for sparseness. Experiments on `2-regularized logistic regression, support vector machines and convolutional neural networks validate our sparsification approaches. 1 Introduction Scaling stochastic optimization algorithms [26, 24, 14, 11] to distributed computational architectures [10, 17, 33] or multicore systems [23, 9, 19, 22] is a crucial problem for large-scale machine learning. In the synchronous stochastic gradient method, each worker processes a random minibatch of its training data, and then the local updates are synchronized by making an All-Reduce step, which aggregates stochastic gradients from all workers, and taking a Broadcast step that transmits the updated parameter vector back to all workers.
    [Show full text]
  • Computer Problems for Vector Calculus
    Chapter: Vector Calculus Computer Problems for Vector Calculus 2 2 1. The average rainfall in Flobbertown follows the strange pattern R = (1 + sin x) e−x y , where x and y are distances north and east from one corner of the town. (a) Pick at least 2 values of x and sketch the rainfall function R(y) at that value. Label your plots with their x values. (b) Pick at least 2 values of y and sketch the rainfall function R(x) at that value. Label your plots with their x values. (c) Have a computer generate a 3D plot of R(x; y). Make sure you plot over a region that clearly shows the general behavior of the function, and includes all the x and y values you used for your sketches. Check if the computer plot matches your constant-latitude and constant-longitude sketches. (If it doesn't, figure out what you did wrong.) (d) Based on your computer plot, if you were to start at the position (π=4; 1) roughly what direction could you move in to keep R constant? Hint: You may find it easier to answer this if you make a second plot that zooms in on a small region around this point. 2. The town of Chewandswallow has been buried in piles of bread. The depth of bread is given by B = cos(x + y) + sin x2 + y2, where the town covers the region 0 ≤ x ≤ π=2, 0 ≤ y ≤ π. (a) Have a computer make a contour plot of B(x; y).
    [Show full text]