Implicit Differentiation
Total Page:16
File Type:pdf, Size:1020Kb
opes. A Neurips 2020 Tut orial by Marc Deisenro A Swamps of integrationnumerical b Monte Carlo Heights C Fo r w w m a r o al r N izi ng Flo d -ba h ckw ime ar Of T d D B ea c h Hills There E Unrolling Backprop Bay f Expectations and Sl and fExpectations B ack Again: A Tale o Tale A Again: ack and th and opes. A Neurips 2020 Tut orial by Marc Deisenro A Swamps of integrationnumerical b Monte Carlo Heights C Fo r w Implicit of Vale w m a r o al r N izi ng Flo d -ba h G ckw ime ar Of T d D B Differentiation ea c h Hills There E Unrolling Backprop Bay f Expectations and Sl and fExpectations B ack Again: A Tale o Tale A Again: ack and th and Implicit differentiation Cheng Soon Ong Marc Peter Deisenroth December 2020 oon n S O g g ’s n Tu e t hC o d r n C i a a h l t M a o t a M r r n c e D e s i Motivation I In machine learning, we use gradients to train predictors I For functions f(x) we can directly obtain its gradient rf(x) I How to represent a constraint? G(x; y) = 0 E.g. conservation of mass 1 dy = 3x2 + 4x + 1 dx Observe that we can write the equation as x3 + 2x2 + x + 4 − y = 0 which is of the form G(x; y) = 0. High school gradients This is an equation y = x3 + 2x2 + x + 4 dy What is the gradient dx ? 2 Observe that we can write the equation as x3 + 2x2 + x + 4 − y = 0 which is of the form G(x; y) = 0. High school gradients This is an equation y = x3 + 2x2 + x + 4 dy What is the gradient dx ? dy = 3x2 + 4x + 1 dx 2 High school gradients This is an equation y = x3 + 2x2 + x + 4 dy What is the gradient dx ? dy = 3x2 + 4x + 1 dx Observe that we can write the equation as x3 + 2x2 + x + 4 − y = 0 which is of the form G(x; y) = 0. 2 Optimization with constraints Given a constrainted continuous optimization problem minx;yF (x; y) subject to G(x; y) = 0 We can solve the equality constraint to get y = g(x) and substitute into the objective F (x; g(x)), and calculate the gradient d F (x; g(x)) dx 3 Implicit function theorem I Solve for y = g(x) and use quotient rule I Directly differentiate, and use product rule Advanced high school gradients This is an equation y = x3 + 2x2 + xy + 4 dy What is the gradient dx ? 4 Advanced high school gradients This is an equation y = x3 + 2x2 + xy + 4 dy What is the gradient dx ? Implicit function theorem I Solve for y = g(x) and use quotient rule I Directly differentiate, and use product rule 4 Function of two variables Consider a function G(x; y) = 0 where x; y 2 R. Assume that near a particular point x0, we can write a closed form expression for y in terms of x, that is y = g(x) : 5 Solve for one variable Substituting y = g(x) into G(x; y), near x0, we get G(x; g(x)) = 0 : We calculate the derivative of G with respect to x using the chain rule, 0 Gx(x; g(x)) + Gy(x; g(x)) · g (x) = 0 : If Gy =6 0, then G g0(x) = − x : Gy 6 Implicit function theorem The implicit function theorem provides conditions under which we can write G(x; y) = 0 as y = g(x), and also conditions when g0(x) = − Gx . Gy Hand wavy intuition Given three variables (could be any topology) x; y; z and G(x; y) = z ; What are the conditions for the following to be well behaved? g(x) = y G(x; g(x)) = z : 7 Implicit function theorem as ansatz Ansatz 1. Explicitly solve one variable in terms of another 2. Chain the gradients I Ansatz = a way to look at problems I The implicit function theorem is a way to solve equations Generalizations depending on where equations live (Krantz and Parks, 2013): I Inverse function theorem I Constant rank theorem I Banach fixed point theorem I Nash-Moser theorem 8 Pointers to literature I From linear algebra to calculus (Hubbard and Hubbard, 2015; Spivak, 2008) I Implicit function theorem from variational analysis (Dontchev and Rockafellar, 2014) I Many versions of implicit function theorem (Krantz and Parks, 2013) I Deep Declarative Networks https://anucvml.github.io/ddn-cvprw2020/ Augustin-Louis Cauchy, 1916 9 Gradient of loss w.r.t. optimal value I Consider the problem of structured prediction (Nowozin et al., 2014) I Let a sample be (xn; yn) I Energy given the parameter of the learner w is E(yn; xn; w) I We assume that the best predictor is found by an optimization algorithm over y, ∗ y (xn; w) = optyE(y; xn; w) : I Measure the error with `(y; yn) I ∗ Loss per sample L(xn; yn; w) (assume predict with y ) I Want to take the gradient of loss L w.r.t. paramters w I but have an optimization problem inside (to find y∗) 10 Gradient of L w.r.t. w By implicit differentation (Do et al., 2007; Samuel and Tappen, 2009), the gradient of the dL loss L(xn; yn; w) with respect to the parameters dw has a closed form. Theorem ∗ ∗ Let y (w) = argminy E(y; w), and L(w) = `(y (w)). Then ( )− dL @2E @2E 1 d` = − : dw @w@y> @y@y> dy 11 Sketch: chain rule ( ) L(w) = ` optyE(y; w) By the chain rule, @L @` @y = : @w @y @w 12 Sketch: gradient with respect to energy E Denote by @E(y; w) g(y; w) = @y The gradient of g with respect to w is given by the chain rule again. Note that y is actually a function of w, i.e. g(y(w); w). @ @g @y @g g(y(w); w) = + : @w @y @w @w 13 Sketch: Stationarity conditions At optimality of E(y; w), its gradient is zero, i.e. g(y; w) = 0. Solving ( )− @y @g @g 1 = − @w @w @y @y @L Substituting @w into @w , ( )− @L @g @g 1 @` = − : @w @w @y @y 14 @L Sketch: Substitute back to obtain @w Recall the definition of g as gradient of energy E, @E(y; w) g(y; w) = @y And hence we get the Hessian with respect to the energy ( )− @L @2E @2E 1 d` = − : @w @w@y> @y@y> dy 15 Result: Gradient of L w.r.t. w The gradient of L(w) = `(argminy E(y(w); w)) with respect to w has a closed form. Theorem ∗ ∗ Let y (w) = argminy E(y; w), and L(w) = `(y (w)). Then ( )− dL @2E @2E 1 d` = − : dw @w@y> @y@y> dy Domke (2012) 16 Summary I We want to take the gradient with respect to an equality constraint I Implicit function theorem gives conditions where we can "invert" a derivative I Implicit function theorem as a way to solve equations I Useful to take a gradient over an optimum Backpropagation is just... the implicit function theorem 17 References Do, C. B., Foo, C.-S., and Ng, A. (2007). Efficient multiple hyperparameter learning for log-linear models. In Neural Information Processing Systems. Domke, J. (2012). Generic methods for optimization-based modeling. In AISTATS. Dontchev, A. L. and Rockafellar, R. T. (2014). Implicit Functions and Solution Mappings: A View from Variational Analysis. Springer, 2nd edition. Hubbard, J. H. and Hubbard, B. B. (2015). Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach. Matrix Editions, 5th edition. Krantz, S. G. and Parks, H. R. (2013). The Implict Function Theorem: History, Theory, and Applications. Springer. Nowozin, S., Gehler, P. V., Jancsary, J., and Lampert, C. H. (2014). Advanced Structured Prediction. MIT Press. 18 References (cont.) Samuel, K. and Tappen, M. (2009). Learning optimized map estimates in continuously-valued MRF models. In CVPR. Spivak, M. (2008). Calculus. Publish or Perish, Inc., 4th edition. 19.