Differential Calculus Paul Schrimpf October 31, 2018 University of British Columbia Economics 526 Cba1

Home , Gateaux derivative, Inverse function theorem

Differential Calculus Paul Schrimpf October 31, 2018 University of British Columbia Economics 526 cba1

In this lecture, we will define derivatives for functions on vector spaces. We will show that all the familiar properties of derivatives — the mean value theorem, chain rule, etc — hold in any vector space. We will primarily focus on Rn, but we also discuss infinite dimensional spaces (will we need to differentiate in them to study optimal control later). All of this material is also covered in chapter 4 of Carter. Chapter 14 of Simon and Blume and chapter 9 of Rudin’s Principles of Mathematical Analysis cover differentiation on Rn. Simon and Blume is better for general understanding and applications, but Rudin is better for proofs and rigor.

1. Derivatives 1.1. Partial derivatives. We have discussed limits of sequences, but perhaps not limits of functions. To be complete, we deﬁne limits as follows. Deﬁnition 1.1. Let X and Y be metric spaces and f : X→Y. lim f (x) c x→x0

where x and x0 ∈ X and c ∈ Y, means that ∀ϵ > 0 ∃δ > 0 such that d(x, x0) < δ implies d( f (x), c) < ϵ. ( ) { } Equivalently, we could say limx→x0 f x c means that for any sequence xn with xn→x, f (xn)→c. You are probably already familiar with the derivative of a function of one variable. Let f : R→R. f is diﬀerentiable at x0 if

f (x0 + h) − f (x0) d f lim (x0) h→0 h dx exists. Similiarly, if f : V→W we deﬁne its ith partial derivative as follows. Deﬁnition 1.2. Let f : Rn→R. The ith partial derivative of f is

∂ f f (x01,..., x0i + h,...x0n) − f (x0) (x0) lim . ∂xi h→0 h The ith partial derivative tells you how much the function changes as its ith argument changes.

1This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License 1 DIFFERENTIAL CALCULUS

∂ Rn→R f Example 1.1. Let f : be a production function. Then we call ∂x the marginal α β i product of xi. If f is Cobb-Douglas, f (k, l) Ak l , where k is capital and l is labor, then the marginal products of capital and labor are

∂ f α− β (k, l) Aαk 1l ∂k ∂ f α β− (k, l) Aβk l 1. ∂l 1.2. Examples.

∂ Example 1.2. If u : Rn→R is a utility function, then we call u the marginal utility of ∂xi xi. If u is CRRA, ∑T 1−γ c u(c ,..., c ) βt t 1 T 1 − γ t1 then the marginal utility of consumption in period t is ∂ u βt −γ. ct ∂ct

Example 1.3. The price elasticity of demand is the percentage change in demand 3 divided by the percentage change in its price. If q1 : R →R is a demand function with three arguments: own price p1, the price of another good, p2, and consumer income, y. The own price elasticity is ∂ ϵ q1 p1 . q1,p1 ∂p1 q1(p1, p2, y) The cross price elasticity is the percentage change in demand divided by the percentage change in the other good’s price, i.e. ∂ ϵ q1 p2 . q1,p2 ∂p2 q1(p1, p2, y) Similarly, the income elasticity of demand is ∂ ϵ q1 y . q1,y ∂y q1(p1, p2, y)

1.3. Total derivatives. Derivatives of univariate functions have a number of useful properties that partial derivatives do not always share. Examples of useful properties include univariate derivatives giving the slope of a tangent line, the implicit function theorem, and Taylor series approximations. We would like the derivatives of multivariate functions to have these properties, but partial derivatives are not enough for this.

2 DIFFERENTIAL CALCULUS

Example 1.4. Consider f : R2→R, { x2 + y2 if xy < 0 f (x, y) x + y if xy ≥ 0

(x 2 + y 2) (x y < 0) + (x + y) (x y >= 0)

100

-50

-100

-150

-20010

5 10

5 0 0 y -5 -5 x

-10-10

∂ f ( , ) ∂ f ( , ) The partial derivatives of this function at 0 are ∂x 0 0 1 and ∂y 0 0 1. How- ∂ f ( , ) + ever, there are points arbitrarily close to zero with ∂x x y 2x 2y. If we were to try to draw a tangent plane to the function at zero, we would find that we cannot. Although the partial derivatives of this function exist everywhere, it is in some sense not differentialable at zero (or anywhere with x y 0). Partially motivated by the preceding example, we define the total derivative (or just the derivative; we’re saying “total” to emphasize the difference between partial derivatives and the derivative).

n Deﬁnition 1.3. Let f : R →R. The derivative (or total derivative or diﬀerential) of f at x0 is a linear mapping, D f : Rn→R1 such that x0

f (x0 + h) − f (x0) − D fx h lim 0 0. h→0 ∥h∥ The h in this deﬁnition is an n vector in Rn. This is contrast to the h in the deﬁnition of partial derivatives, which was just a scalar. The fact that h is now a vector is important because h can approach 0 along any path. Partial derivatives only look at the limits 3 DIFFERENTIAL CALCULUS

as h approaches 0 along the axes. This allows partial derivatives to exist for strange functions like the one in example 1.4. We can see that the function from the example is not differentiable by letting h approach 0 along a path that switches from xy < 0 to xy ≥ 0 infinitely many times close to 0. The limit in the definition of the derivative does not exist along such a path, so the derivative does not exist.

Comment 1.1. In proofs, it will be useful to deﬁne r(x, h) f (x + h) − f (x) − D fx h. |r(x,h)| → We will then repeatedly use the fact that limh 0 ∥h∥ 0.

If the derivative of f at x0 exists, then so do the partial derivatives, and the total derivative is simply the 1 × n matrix of partial derivatives. ∂ Rn→R f ( ) Theorem 1.1. Let f : be diﬀerentiable at x0, then ∂ x0 exists for each i and ( xi ) ∂ f ∂ f . D fx h (x0)··· (x0) h 0 ∂x1 ∂xn

Proof. Since f is diﬀerentiable at x0, we can make h ei t for ei the ith standard basis vector, and t a scalar. The deﬁnition of derivative says that

f (x0 + ei t) − f (x0) − D fx (ei t) lim 0 0. t→0 ∥ei t∥ Let ( , ) ( + ) − ( ) − ri x0 t f x0 ei t f x0 tD fx0 ei |r (x ,t)| → i 0 and note that limt 0 |t| 0. Rearranging and dividing by t,

f (x0 + ei t) − f (x0) r (x , t) D f e + i 0 t x0 i t and taking the limit ( + ) − ( ) f x0 ei t f x0 lim D fx0 ei t→0 t we get the exact same expression as in the deﬁnition of the partial derivative. There- ∂ f fore, D fx ei. Finally, as when we ﬁrst introduced matrices, we know that linear ∂xi 0 transformation D f must be represented by x0 ( ) ∂ f ∂ f . D fx h (x0)··· (x0) h 0 ∂x1 ∂xn □

We know from example 1.4 that the converse of this theorem is false. The existence of partial derivatives is not enough for a function to be differentiable. However, if the partial derivatives exist and are continuous in a neighborhood, then the function is differentiable. Theorem 1.2. Let f : Rn→R and suppose its partial derivatives exist and are continuous in Nδ(x ) for some δ > 0. Then f is differentiable at x with 0 ( 0 ) ∂ f ∂ f . D fx (x0)··· (x0) 0 ∂x1 ∂xn 4 DIFFERENTIAL CALCULUS

Proof. Let h (h1,..., hn) with ∥h∥ < r. Notice that f (x + h) − f (x ) f (x + h e ) − f (x ) + f (x + h e + h e ) − f (x + h e ) + ... (1) 0 0 0 1 1 ( 0 0 ) 1 1 2 2 0 1 1 ∑n−1 + f (x0 + h) − f x0 − hi ei (2) ( )i 1 ( ) ∑n ∑j ∑j−1 f x0 + hi ei − f x0 + hi ei . (3) j1 i1 i1 By the mean value theorem (1.5), ( ) ( ) ∑j ∑j−1 ∑j−1 ∂ f f x0 + hi ei − f x0 + hi ei h j (x0 + hi ei + h¯ j ej) ∂x j i1 i1 i1

for some h¯ j between 0 and h j. The partial derivatives are continuous by assumption, so by making r small enough, we can make

∑j−1 ∂ f ∂ f (x0 + hi ei + h¯ j ej) − (x0) < ϵ/n, ∂x j ∂x j i1 for any ϵ > 0. Combined with equation 3 now we have, ( ) ∑n ∂ ϵ ( + ) − ( ) f ( ) + f x0 h f x0 h j ∂ x0 (4) x j n j 1

∑n ∂ ∑n ( + ) − ( ) − f ( ) ϵ/ f x0 h f x0 h j ∂ x0 h j n (5) x j j 1 j 1 ( + ) − ( ) − ≤ϵ ∥ ∥ f x0 h f x0 D fx0 h h (6) ∥ ∥ Dividing by h and taking the limit,

f (x0 + h) − f (x0) − D fx h lim 0 ≤ ϵ. h→0 ∥h∥ This is true for any ϵ > 0, so the limit must be 0. □ A minor modiﬁcation of this proof would show the stronger result that f : Rn→R has a continuous derivative on an open set U ⊆ Rn if and only if its partial derivatives are continuous on U. We call such a function continuously diﬀerentiably on U and denote the set of all such function as C1(U). ′ 1.4. Mean value theorem. The mean value theorem in R1 says that f (x+h)− f (x) f (x¯)h for some x¯ between x + h and x. The same theorem holds for multivariate functions. To prove it, we will need a couple of intermediate results. ∗ Theorem 1.3. Let f : Rn→R be continuous and K ⊂ Rn be compact. Then ∃x ∈ K such that ∗ f (x ) ≥ f (x)∀x ∈ K. Proof. In the last set of notes. □ 5 DIFFERENTIAL CALCULUS

Definition 1.4. Let f : Rn→R. we say that f has a local maximum at x if ∃δ > 0 such that f (y) ≤ f (x) for all y ∈ Nδ(x). Next, we need a result that relates derivatives to maxima. Theorem 1.4. Let f : Rn→R and suppose f has a local maximum at x and is differentiable at x. Then D fx 0. Proof. Choose δ as in the definition of a local maximum. Since f is differentiable, we can write ( + ) − ( ) + ( , ) f x h f x D fx h r x h ∥h∥ ∥h∥ |r(x,h)| → ∈ Rn ∥ ∥ ∈ R > where limh 0 ∥h∥ 0. Let h tv for some v with v 1 and t . If D fx v 0, > f (x+tv)− f (x) + r(x,tv) > / > then for t 0 small enough, we would have |t| D fx v |t| D fx v 2 0 and f (x + tv) > f (x) in contradiction to x being a local maximum. Similary, if D fv v < 0 < f (x+tv)− f (x) − + r(x,tv) > − / > then for t 0 and small, we would have |t| D fx v |t| D fx v 2 0 and f (x + tv) > f (x). Thus, it must be that D fx v 0 for all v, i.e. D fx 0. □ Now we can prove the mean value theorem. Theorem 1.5 (mean value). Let f : Rn→R1 be in continuously differentiable on some open set U (i.e. f ∈ C1(U). Let x, y ∈ U be such that the line connecting x and y, ℓ(x, y) {z ∈ Rn : z λx + (1 − λ)y, λ ∈ [0, 1]}, is also in U. Then there is some x¯ ∈ ℓ(x, y) such that

f (x) − f (y) D fx¯(x − y). Proof. Let z(t) y + t(x − y) for t ∈ [0, 1] (i.e. t λ). Define ( ) g(t) f (y) − f (z(t)) + f (x) − f (y) t Note that g(0) g(1) 0. The set [0, 1] is closed and bounded, so it is compact. It is easy to verify that g(t) is continuously differentiable since f is continuously differentiable . Hence, g must attain its maximum on [0, 1], say at t¯. If t¯ 0 or 1, then either g is constant, in which case any t¯ ∈ (0, 1) is also a maximum, or g must have an interior minimum, and we can look at the maximum of −g instead. When t¯ is not 0 or 1, then the previous ′ theorem shows that g (t¯) 0. Simple calculation shows that ′( ) − ( − ) + ( ) − ( ) g t¯ D fz(t¯) x y f x f y 0 so D fx¯(x − y) f (x) − f (y) where x¯ z(t¯). □ 1.5. Functions from Rn→Rm. So far we have only looked at functions from Rn to R. Functions to Rm work essentially the same way. Definition 1.5. Let f : Rn→Rm. The derivative (or total derivative or differential) of f at x is a linear mapping, D f : Rn→Rm such that 0 x0

f (x0 + h) − f (x0) − D fx h lim 0 0. h→0 ∥h∥ 6 DIFFERENTIAL CALCULUS

Theorems 1.1 and 2.1 sill hold with no modiﬁcation. The total derivative of f can be represented by the m by n matrix of partial derivatives, ∂ ∂ f1 ( )··· f1 ( ) © ∂ x0 ∂ x0 ª x1 . xn . ® . . ® . D fx0 ® ∂ fm ∂ fm (x0)··· (x0) « ∂x1 ∂xn ¬ This matrix of partial derivatives is often called the Jacobian of f . Rn→Rm The mean value theorem 1.5 holds for each( of the component) functions of f : . ( )··· ( ) T n Meaning, that f can be written as f (x) f1 x fm x where each fj : R →R. The mean value theorem is true for each fj, but the x¯’s will typically diﬀer with j. Corollary 1.1 (mean value for Rn→Rm). Let f : Rn→Rm be in C1(U) for some open U. Let x, y ∈ U be such that the line connecting x and y, ℓ(x, y) {z ∈ Rn : z λx + (1 − λ)y, λ ∈ [0, 1]}, is also in U. Then there are x¯ j ∈ ℓ(x, y) such that

fj(x) − fj(y) D fj (x − y) x¯ j and D f © 1x¯1 ª . ® f (x) − f (y) . ® (x − y). D f « m x¯m ¬ ( ) T D f ··· D f Slightly abusing notation, we might at times write D fx¯ instead of 1x¯1 m x¯m with the understanding that we mean the later. 1.6. Chain rule. For univariate functions, the chain rule says that the derivative of f (g(x)) ′ ′ is f (g(x))g (x). The same is true for multivariate functions. Theorem 1.6. Let f : Rn→Rm and g : Rk→Rn. Let g be continuously differentiable on some open set U and f be continuously differentiable on g(U). Then h : Rk→Rm, h(x) f (g(x)) is continuously differentiable on U with Dhx D fg(x)D gx ∈ Proof. Let x U. Consider

f (g(x + d)) − f (g(x)) . ∥d∥ Since g is diﬀerentiable by the mean value theorem, g(x + d) g(x) + Dg ( )d, so x¯ d

f (g(x + d)) − f (g(x)) f (g(x) + D g ( )d) − f (g(x)) x¯ d

≤ f (g(x) + D gx d) − f (g(x)) + ϵ

where the inequality follows from the the continuity of D gx and f , and holds for any ϵ > 0. f is diﬀerentiable, so ( ( ) + ) − ( ( )) − f g x Dgx d f g x D fg(x)Dgx d lim 0 → Dgx d 0 D gx d 7 DIFFERENTIAL CALCULUS

Using the Cauchy-Schwarz inequality, D gx d ≤ D gx ∥d∥, we get

f (g(x) + Dg d) − f (g(x)) − D f ( )D g d f (g(x) + D g d) − f (g(x)) − D f ( )Dg d x g x x ≤ x g x x

Dgx ∥d∥ D gx d so ( ( ) + ) − ( ( )) − f g x Dgx d f g x D fg(x)D gx d lim 0. d→0 ∥d∥ □

1.7. Higher order derivatives. We can take higher order derivatives of multivariate functions just like of univariate functions. If f : Rn→Rm, then f has nm partial ﬁrst derivatives. Each of these has n partial derivatives, so f has n2m partial second derivatives, written ∂2 f k . ∂xi ∂x j

Theorem 1.7. Let f : Rn→Rm be twice continuously diﬀerentiable on some open set U. Then

∂2 ∂2 fk fk (x) (x) ∂xi ∂x j ∂x j ∂xi

for all i, j, k and x ∈ U.

Proof. Using the deﬁnition of partial derivative, twice, we have

f (x+ti ei+tj ej)− f (x+tj ej) f (x+t e )− f (x) ∂2 → − → i i f limti 0 t limti 0 t lim i i ∂ ∂ → xi x j tj 0 tj

f (x + tj ej + ti ei) − f (x + tj ej) − f (x + ti ei) + f (x) lim lim → → tj 0 ti 0 tj ti

∂2 f from which it is apparent that we get the same expression for .1 □ ∂x j ∂xi

The same argument shows that in general the order of partial derivatives does not matter.

Corollary 1.2. Let f : Rn→Rm be k times continuously diﬀerentiable on some open set U. Then

∂k ∂k f f j j j ( ) j ( ) ∂x 1 ··· ∂x n ∂ p 1 ··· ∂ p n 1 n xp(1) xp(n) ∑ n { ,.., }→{ ,..., } where i1 ji k and p : 1 n 1 n is any permutation (i.e. reordering).

1This proof is not completely correct. We should carefully show that we can interchange the order of taking limits. Interchanging limits is not always possible, but the assumed continuity makes it possible here. 8 DIFFERENTIAL CALCULUS

1.8. Taylor series. You have probably seen Taylor series for univariate functions before. A function can be approximated by a polynomial whose coeﬃcients are the function’s derivatives.

Theorem 1.8. Let f : R→R be k + 1 times continuously diﬀerentiable on some open set U, and let a, a + h ∈ U. Then

2 k k+1 ′ f (a) f (a) f (a¯) + f (a + h) f (a) + f (a)h + h2 + ... + hk + hk 1 2 k! (k + 1)!

where a¯ is between a and h.

The same theorem is true for multivariate functions.

Theorem 1.9. Let f : Rn→Rm be k times continuously diﬀerentiable on some open set U and , + ∈ ( , ) a a h U. Then there exists a k times continuously diﬀerentiable function rk a h such that

∑ ∑k ∂ ji f ( + ) ( ) + 1 ( ) j1 j2 ··· jn + ( , ) f a h f a a h h hn rk a h ∑ j1 jn 1 2 n k! ∂x ··· ∂x i1 ji 1 1 n

∥ ( , )∥ ∥ ∥k and limh→0 rk a h h 0.

Proof. Follows from the mean value theorem. For k 1, the mean value theorem says that

f (a + h) − f (a) D fa¯ h

f (a + h) f (a) + D fa¯ h f (a) + D f h + (D f − D f )h a | a¯ {z a }

r1(a,h)

→ → ( , ) D fa is continuous as a function of a, and as h 0, a¯ a, so limh→0 r1 a h 0, and the theorem is true for k 1. For general k, suppose we have proven the theorem up to k − 1. Then repeating the same argument with the k − 1st derivative of f in place of f shows that theorem is true for k. The only complication is the division by k!. To see where it comes from, we will just focus on f : R→R. The idea is the same for Rn, but the notation gets messy. Suppose we want a second order approximation to f at a,

ˆ ′ 2 2 f (h) f (a) + f (a)h + c2 f (a)h 9 DIFFERENTIAL CALCULUS ˆ and pretend that we do not know c2. Consider f (a + h) f (h). Applying the mean value theorem to the diﬀerence of these functions twice, we have       ˆ ˆ  ′ ˆ′  f (a + h) − f (h) f (a) − f (0) +  f (a + h¯1) − f (h¯1) h |{z}  |{z}   f (a)  f ′(a)        ′ ˆ′  2 ˆ2  f (a) − f (0) +  f (a + h¯2) − f (h¯2)  h¯1h  | {z }    2  2c2 f (a) 2 3 f (a)(1 − 2c2)h¯1h + f (a + h¯3)h¯2h¯1h 1 if we set c2 2 , we can eliminate one term and ˆ | f (a + h) − f (h)| ≤ | f 3(a + h¯ )h3 |. | {z 3 }

r2(a,h) k 1 → ∥ ( , )∥ ∥ ∥ Repeating this sort of argument, we will see that setting ck k! ensures that limh 0 rk a h h 0. □

Example 1.5. The mean value theorem is used often in econometrics to show asymp- totic normality. Many estimators can be written as ˆ n θn ∈ arg min Q (θ) θ∈Θ where Qn(θ) is some objective function that depends on the sampled data. Ex- amples include least squares, maximum likelihood and the generalized method of moments. Suppose there is also a population version of the objective function, Q0(θ) p n 0 and Q (θ) → Q (θ) as n→∞. There is a true value of the parameter, θ0, that satisﬁes 0 θ0 ∈ arg min Q (θ). θ∈Θ For example for OLS, ∑n 1 Qn(θ) (y − x θ)2 n i i i1 and [ ] Q0(θ) E (Y − Xθ)2 . n a ˆ If Q is continuously diﬀerentiable on Θ and θn ∈ int(Θ), then from theorem 1.4, n DQ ˆ 0 θn

10 DIFFERENTIAL CALCULUS

Applying the mean value theorem, n n + 2 n(θˆ − θ ) 0 DQ ˆ DQθ D Qθ¯ n 0 θn 0 ( )− ˆ 1 θ − θ − D2Qn DQn . n 0 θ¯ θ0 √ d Typically, some variant of the central limit theorem implies nDQn → N(0, Σ). For θ0 example for OLS, √ ∑ n 1 nDQθ √ 2(yi − xi θ)θ. n i p Also, typically D2Qn → D2Q0 , so by Slutsky’s theorem,b θ¯ θ0 ( ) ( ( ) ( ) ) √ −1 √ −1 −1 ˆ 2 n n d 2 0 2 0 n(θ − θ ) − D Q nDQθ → N 0, D Q Σ D Q . n 0 θ¯ 0 θ0 θ0

aEssentially the same argument works if you expand Q0 instead of Qn. This is sometimes necessary because there are some models, like quantile regression, where Qn is not diﬀerentiable, but Q0 is diﬀerentiable. b d p d Part of Slutsky’s theorem says that if Xn → X and Yn → c, then Xn/Yn → X/c.

2. Functions on vector spaces To analyze infinite dimensional optimization problems, we need to differentiate functions on infinite dimensional vector spaces. We already did this when studying optimal control, but we glossed over the details. Anyway, we can define the derivative of a function between any two vector spaces as follows.

Deﬁnition 2.1. Let f : V→W. The Fréchet derivative of f at x0 is a continuous2 linear mapping, D fx : V→W such that 0

f (x0 + h) − f (x0) − D fx h lim 0 0. h→0 ∥h∥ Note that this definition is the same as the definition of total derivative. ∞ Example 2.1. Let V L (0, 1) and W R. Suppose f is given by ∫ 1 f (x) g(x(τ), (τ))dτ 0 2 for some continuously differentiable function g : R →R. Then D fx is a linear transfor- n mation from V to R. How can we calculate D fx? If V were R we would calculate the partial derivatives of f and then maybe check that they are continuous so that theorem holds. For an infinite dimensional space there are infinite partial derivatives, so we cannot possibly compute them all. However, we can look at directional derivatives.

2If V and W are ﬁnite dimensional, then all linear functions are continuous. In inﬁnite dimensions, there can be discontinuous linear functions. 11 DIFFERENTIAL CALCULUS

Deﬁnition 2.2. Let f : V→W, v ∈ V and x ∈ U ⊆ V for some open U. The directional derivative (or Gâteaux derivative when V is inﬁnite dimensional) in direction v at x is f (x + αv) − f (x) d f (x; v) lim . α→0 α where α ∈ R is a scalar.

Analogs of theorems 1.1 and 2.1 relates the Gâteaux derivative to the Fréchet derivative.

Lemma 2.1. If f : V→W is Fréchet diﬀerentiable at x, then the Gâteaux derivative, d f (x; v), exists for all v ∈ V, and

d f (x; v) D fx v. The proof of theorem 2.1 relies on the fact that Rn is finite dimensional. In fact, in an infinite dimensional space it is not enough that all the directional derivatives be continuous on an open set around x for the function to be differentiable at x; we also require the directional derivatives to be linear in v. In finite dimensions we can always create a linear map from the partial derivatives by arranging the partial derivatives in a matrix. In infinite dimensions, we cannot do that.

Lemma 2.2. If f : V→W has Gâteaux derivatives that are linear in v and “continuous” in x in the sense that ∀ϵ > 0 ∃δ > 0 such that if ∥x − x∥ < δ, then 1

d f (x ; v) − d f (x; v) 1 < ϵ sup ∥ ∥ v∈V v ( ) then f is Fréchet diﬀerentiable with D fx0 v d f x; v .

Comment 2.1. This continuity in x is actually a very natural deﬁnition. If V and W are normed vector spaces, then the set of all bounded (or equivalently continuous) linear transformations is also a normed vector space with norm ∥ ∥ ∥ ∥ ≡ Av W . A sup ∥ ∥ v∈V v V We are requiring d f (x; v) as a function of x to be continuous with respect to this norm.

Proof. Note that ∫ 1 f (x + h) − f (x) d f (x + th, h)dt 0 by the fundamental theorem of calculus (which we should really prove, but do not have time for, so we will take it as given). Then, ∫ 1

f (x + h) − f (x) − d f (x; h) d f (x + th, h) − d f (x, h)dt ∫ 0 1

≤ d f (x + th, h) − d f (x, h) dt 0 12 DIFFERENTIAL CALCULUS

By the deﬁnition of sup, ( )

( ) d f (x + th; v) − d f (x; v) ( + ) − ( ) ≤ ∥ ∥ . d f x th; h d f x; h sup ∥ ∥ h v∈V v The continuity in x implies for any ϵ > 0 ∃δ > 0 such that if ∥th∥ < δ, then ( )

d f (x + th; v) − d f (x; v) < ϵ. sup ∥ ∥ v∈V v Thus, ∫ 1

f (x + h) − f (x) − d f (x; h) < ϵ ∥h∥ dt ϵ ∥h∥ . 0 ϵ > ∃δ > ∥ ∥ < δ In other words, for any 0 0 such that if h , then

f (x + h) − f (x) − d f (x; h) < ϵ, ∥h∥ and we can conclude that d f (x; h) D fx h. □

Example (Example 2.1 continued). Motivated by lemmas 2.1 and 2.2, we can ﬁnd the Fréchet derivative of f by computing its Gâteaux derivatives. Let v ∈ V. Remember that both x and v are functions in this example. Then, ∫ 1 f (x + αv) g(x(τ) + αv(τ), τ)dτ 0 and ∫ 1 g(x(τ) + αv(τ), τ)dτ d f (x; v) lim 0 α→ α ∫ 0 1 ∂ g ( (τ), τ) (τ) τ ∂ x v d 0 x Now, we can either check that these derivatives are linear and continuous, or just guess and verify that ∫ 1 ∂ ( ) g ( (τ), τ) (τ) τ. D fx v ∂ x v d 0 x Note that this expression is linear in v as it must be for it to be the derivative. Now, we check that the limit in the deﬁnition of the derivative is zero, ∫ ( (τ) + (τ), τ) − ( (τ), τ) − ∂g ( (τ), τ) (τ) τ f (x + h) − f (x) − D f (h) g x h g x ∂x x h d x lim ∥ ∥ lim ∥ ∥ h→0 h h→0 h ∫ ( (τ) + (τ), τ) − ( (τ), τ) − ∂g ( (τ), τ) (τ) τ g x h g x ∂x x h d ≤ lim h→0 ∥h∥ where the inequality follows from the triangle inequality. To simplify, let us assume ∂g that g and ∂x are bounded. Then, by the dominated convergence, theorem, we can

13 DIFFERENTIAL CALCULUS

a interchange the integral and the limit. We then have

∫ ( (τ) + (τ), τ) − ( (τ), τ) − ∂g ( (τ), τ) (τ) g x h g x ∂x x h ≤ lim dτ h→0 ∥h∥ ∂g The deﬁnition of says that ∂x ( (τ) + (τ), τ) − ( (τ), τ) − ∂g ( (τ), τ) (τ) g x h g x ∂ x h x →0 h(τ) |h(τ)| ≤ τ L∞( , ) ∥ ∥ | (τ)| Also ∥h∥ 1 for all because in 0 1 , h sup0≤τ≤1 h . Thus, we can conclude that

( (τ) + (τ), τ) − ( (τ), τ) − ∂g ( (τ), τ) (τ) g x h g x ∂ x h x lim ∥ ∥ h→0 h ∂g g(x(τ) + h(τ), τ) − g(x(τ), τ) − (x(τ), τ)h(τ) ∂x |h(τ)| lim 0, h→0 |h(τ)| ∥h∥

so f is Fréchet differentiable at x with derivative D fx. aWe have not covered the dominated convergence theorem. Unless specifically stated otherwise, on homeworks and exams you can assume that interchanging limits and integrals is allowed. However, do not forget that this is not always allowed. The issue is the order of taking limits. Integrals are defined in terms of limits (either Riemann sums or integrals of simple functions). It is not difficult to come up with examples where am,n is a doubly indexed sequence and limm→∞ limn→∞ am,n , limn→∞ limm→∞ am,n. With this definition of the derivative, almost everything that we proved above for functions from Rn→Rm also holds for functions on Banach spaces. In particular, the chain rule, Taylor’s theorem, the implicit function theorem, and the inverse function theorem hold. The proofs of these theorems on Banach spaces are essentially the same as above, so we will not go over them. If you find this interesting, you may want to go through the proofs of all these claims, but it is not necessary to do so. The mean value theorem is slightly more delicate. It still holds for functions f : V→Rm, but must be modified when the target space of f is infinite dimensional. We start with the mean value theorem for f : V→R.

Theorem 2.1 (Mean value theorem (onto R)). Let f : V→R1 be in continuously diﬀerentiable on some open set U. Let x, y ∈ U be such that the line connecting x and y, ℓ(x, y) {z ∈ Rn : z λx + (1 − λ)y, λ ∈ [0, 1]}, is also in U. Then there is some x¯ ∈ ℓ(x, y) such that

f (x) − f (y) D fx¯(x − y).

Proof. Identical to the proof of 1.5. □

This result can then be generalized to f : V→Rn in the same way as theorem 1.1. 14 DIFFERENTIAL CALCULUS

Now, let’s consider a ﬁrst order Taylor expansion for f : V→W:

f (x + h) − f (x) D fx¯ h

f (x + h) f (x) + D fx¯ h f (x) + D f h + (D f − D f )h x | x ¯ {z x }

r1(x,h) To show that first order Taylor expansions have small approximation error, we need to 1 show that (D fx¯ − D fx)h is uniformly small for all possible h. When W is R , there is a single x¯. When W Rm, then x¯ is different for each dimension (as in theorem 1.1). However, this is okay because we just take the maximum over the finitely many different x¯

to get an upper bound on (D fx¯ − D fx)h . When W is inﬁnite dimensional, this argument does not work. Instead, we must use a result called the Hahn-Banach extension theorem, which is closely related to the separating hyperplane theorem. Theorem 2.2 (Hahn-Banach theorem ). Let V be a vector space and g : V→R be convex. Let S ⊆ V be a linear subspace. Suppose f0 : S→R is linear and

f0(x) ≤ g(x) for all x ∈ S. Then f0 can be extended to f : V→R such that

f0(x) f (x) for all x ∈ S and f (x) ≤ g(x) for all x ∈ V. Proof. We will apply the separating hyperplane theorem to two sets in V × R. Let A {(x, a) : a > g(x), x ∈ V, a ∈ R}. Since g is a convex function, A is a convex set. Let {( , ) ( ), ∈ , ∈ R} B x a : a f0 x x S a ∫ ∩ ∅ , ∅ ∫Since f0 is linear, B is convex. Clearly A B . Also, A because A is open, so A A and A , ∅ since e.g. (x, g(x) + 1) ∈ A. ∗ Therefore, by the separating hyperplane theorem, ∃ξ ∈ (V × R) and c ∈ R such that

ξzA > c ≥ ξzB

for all zA ∈ A and zB ∈ B. B is a linear subspace, so it must be that ξzB 0 for all zB ∈ B, so we can take c 0. ( , ) ∈ ξ( , ) > →R ( ) − ξ(x,0) Note that since 0 1 A, 0 1 0. Let f : V be given by f x ξ(0,1) . To conclude, we must show that f (x) f0(x) for all x ∈ S and f (x) ≤ g(x) for all x ∈ V. To show that f (x) f0(x), by linearity of ξ we have ξ(x, y) ξ(x, 0) + yξ(0, 1) ξ(x, y) − f (x) + y ξ(x, 0) 15 DIFFERENTIAL CALCULUS

If x ∈ S, then ξ(x, f0(x)) 0, so f (x) f0(x) for all x ∈ S. Similarly for any y > g(x), then ( , ) ∈ ξ(x,y) − ( ) + > > ( ) ( ) ≤ ( ) □ x y A, so ξ(x,0) f x y 0, and y f x . Therefore, f x g x . With this result in hand, we now can now prove the mean value theorem for arbitrary Banach spaces. Theorem 2.3 (Mean value theorem (onto R)). Let f : V→W be in continuously diﬀerentiable on some open set U. Let x, y ∈ U be such that the line connecting x and y, ℓ(x, y) {z ∈ Rn : z λx + (1 − λ)y, λ ∈ [0, 1]}, is also in U. Then there is some x¯ ∈ ℓ(x, y) such that

( ) − ( ) ≤ ( − ) ≤ − f x f y W D fx¯ x y W D fx¯ BL(V,W) x y V

∗ Proof. Step 1: use the Hahn Banach theorem to show that ∃ϕ ∈ W with ϕ 1 and

f (x) − f (y) ϕ( f (x) − f (y)). The relevant f0 is f0(α( f (x) − f (y))) α f (x) − f (y) . Deﬁne g : [0, 1]→R by g(t) ϕ( f (tx) − f ((1 − t)y)) and apply the mean value theorem on R to get ( ) − ( ) ϕ ∗ ∗ ( − ) f x f y D ft x+(1−t )y) x y . □

3. Optimization in vector spaces Recall from earlier that there are many sets of functions that are vector spaces. We talked a little bit about Lp spaces of functions. The set of all continuous functions and the sets of all k times continuously differentiable functions are also vector spaces. One of these vector spaces of functions will be appropriate for finding the solution to optimal control problems. Exactly which vector space is a slightly technical problem dependent question, so we will not worry about that for now (and we may not worry about it at all). Similarly, there are vector spaces of infinite sequences. Little ℓp is similar to big Lp, but with sequences instead of functions ( ) ∑∞ 1/p ℓp {{ }∞ | |p < ∞} xt t1 : xt t1 There are others as well. Again, the right choice of vector space depends on the problem being considered, and we will not worry about it too much. Given two Banach spaces (complete normed vector spaces), V and W, let BL(V, W) denote the set of all linear transformations from V to W. The derivative of f : V→W will be in BL(V, W). We can show that BL(V, W) is a vector space, and we can define a norm on BL(V, W) by ∥ ∥ ∥ ∥ D BL(V,W) sup Dv W ∥ ∥ v V 1 ∥·∥ ∥·∥ where V is the norm on V and W is the norm on W. Moreover, we could show that BL(V, W) is complete. Thus, BL(V, W) is also a Banach space. Viewed as function of x, D fx is a function from V to BL(V, W). As we just said, there are both Banach spaces, so can differentiate D fx with respect to x. In this way, we can define the second and higher derivatives of f : V→W. 16 DIFFERENTIAL CALCULUS

In the previous section we saw that differentiation for functions on Banach spaces is the same as for functions on finite dimensional vector spaces. All of our proofs of first and second order conditions only relied on Taylor expansions and some properties of linear transformations. Taylor expansions and linear transformations are the same on Banach spaces as on finite dimensional vector spaces, so our results for optimization will still hold. Let’s just state for the first order condition for equality constraints. The other results are similar, but stating them gets to be slightly cumbersome.

Theorem 3.1 (First order condition for maximization with equality constraints). Let f : U→R and h : U→W be continuously diﬀerentiable on U ⊆ V, where V and W are Banach ∗ spaces. Suppose x ∈ interior(U) is a local maximizer of f on U subject to h(x) 0. Suppose ∗ that Dhx∗ : V→W is onto. . Then, there exists µ ∈ BL(W, R) such that for

L(x, µ) f (x) − µh(x).

we have

∗ ∗ ( , µ ) − µ ∗ Dx L x D fx∗ Dhx 0BL(V,R) ∗ ∗ ∗ DµL(x , µ ) h(x ) 0W

There are a few differences compared to the finite dimensional case that are worth commenting on. First, in the finite dimensional case, we had h : U→Rm, and the condition n m that rankDhx∗ m. This is the same as saying that the Dhx∗ : R →R is onto. Rank is not well-defined in infinite dimension, so we now state this condition as Dhx∗ being onto instead of being rank m. Secondly, previously µ ∈ Rm, and the Lagrangian was

L(x, µ) f (x) − µT h(x).

Viewed as a 1 by m matrix, µT is a linear transformation from Rm to R. Thus, in the abstract case, we just say µ ∈ BL(W, R), which as when we defined transposes, is called ∗ the dual space of W and is denoted W . Finally, we have subscripted the zeroes in the first order condition with BL(V, R) and W to emphasize that the first equation is for linear transformations from V to R, and the second equation is in W. D fx∗ is a linear transformation from V to R. Dhx∗ goes from V to W. µ goes from W to R, so µ composed with Dhx∗, which we just denoted by µDhx∗ is a linear transformation from V to R.

4. Inverse functions Suppose f : Rn→Rm. If we know f (x) y, when can we solve for x in terms of y? In other words, when is f invertible? Well, suppose we know that f (a) b for some a ∈ Rn and b ∈ Rm. Then we can expand f around a,

f (x) f (a) + D fa(x − a) + r1(a, x − a) 17 DIFFERENTIAL CALCULUS

where r1(a, x − a) is small. Since r1 is small, we can hopefully ignore it then y f (x) can be rewritten as a linear equation:

f (a) + D fa(x − a) y

D fa x y − f (a) + D fa a ( ) − ( ) + we know that this equation has a solution if rankD fa rank D fa y f a D fa a . It has a solution for any y if rankD fa m. Moreoever, this solution is unique if rankD fa n. This discussion is not entirely rigorous because we have not been very careful about what r1 being small means. The following theorem makes it more precise. Theorem 4.1 (Inverse function). Let f : Rn→Rn be continuously diﬀerentiable on an open set E. Let a ∈ E, f (a) b, and D fa be invertible . Then (1) there exist open sets U and V such that a ∈ U, b ∈ V, f is one-to-one on U and f (U) V, and ( ) −1 (2) the inverse of f exists and is continuously diﬀerentiable on V with derivative D f f −1(x) .

The open sets U and V are the areas where r1 is small enough. The continuity of f and its derivative are also needed to ensure that r1 is small enough. The proof of this theorem is a bit long, but the main idea is the same as the discussion preceding the theorem.

Comment 4.1. The proof uses the fact that the space of all continuous linear transformations between two normed vector spaces is itself a vector space. I do not think we have talked about this before. Anyway, it is a useful fact that already came up in the proof that continuous Gâteaux differentiable implies Fréchet differentiable last lecture. ∥·∥ ∥·∥ ( , ) Let V and W be normed vector spaces with norms V and W . Let BL V W denote the set of all continuous (or equivalently bounded) linear transformations from V to W. Then BL(V, W) is a normed vector space with norm ∥ ∥ ∥ ∥ ≡ Av W . A BL sup ∥ ∥ v∈V v V This is sometimes called the operator norm on BL(V, W). Last lecture, the proof that Gâteaux differentiable implies Fréchet differentiable required that the mapping from V to BL(V, W) defined by D fx as a function of x ∈ V had to be continuous with respect to the above norm. We will often use the inequality, ∥ ∥ ≤ ∥ ∥ ∥ ∥ , Av W A BL v V ∥·∥ which follows from the definition of BL. We will also use the fact that if V is finite dimensional and f (x, v) : V × V→W, is continuous in x and v and linear in v for each ( , ·) → ( , ) ∥·∥ x, then f x : V BL V W is continuous in x with respect to BL. ( ) ∈ Rn φy( ) + −1 − ( ) Proof. For any y , consider x x D fa y f x . By the mean value theorem for x1, x2 ∈ U, where a ∈ U and U is open, φy( ) − φy( ) φy( − ) x1 x2 D x¯ x1 x2 18 DIFFERENTIAL CALCULUS

Note that

φy − −1 D x¯ I D fa D fx¯ −1( − ). D fa D fa D fx¯

Since D fx is continuous (as a function of x) if we make U small enough, then D fa − D fx¯ λ 1 − < λ will be near 0. Let ∥ −1 ∥ . Choose U small enough that D fa D fx for all 2 D fa BL x ∈ U. From above, we know that

− φy(x ) − φy(x ) D f 1(D f − D f )(x − x ) 1 2 a a x¯ 1 2 ≤ φy − ∥ − ∥ D x BL D fa D fx BL x1 x2 1 ≤ ∥x − x ∥ (7) 2 1 2

For any y ∈ f (U) we can start with an arbitrary x1 ∈ U, then create a sequence by setting

y xi+1 φ (xi).

From (7), this sequence satisﬁes

1 ∥x + − x ∥ ≤ ∥x − x − ∥ . i 1 i 2 i i 1

Using this it is easy to verify that xi form a Cauchy sequence, so it converges. The limit y y satisfy φ (x) x, i.e. f (x) y. Moreover, this x is unique because if φ (x1) x1 and φy( ) ∥ − ∥ ≤ 1 ∥ − ∥ x2 x2, then we have x1 x2 2 x1 x2 , which is only possible if x1 x2. 3 Thus for each y ∈ f (U), there is exactly one x such that f (x) y. That is, f is one-to-one − on U. This proves the first part of the theorem and that f 1 exists. − We now show that f 1 is continuously differentiable with the stated derivative. Let y, y + k ∈ V f (U). Then ∃x, x + h ∈ U such that y f (x) and y + k f (x + h). With φy as defined above, we have

φy( + ) − φy( ) + −1( ( ) − ( + )) x h x h D fa f x f x h − −1 h D fa k

− −1 ≤ 1 ∥ ∥ −1 ≥ 1 ∥ ∥ By 7, h D fa k 2 h . It follows that D fa k 2 h and

∥ ∥ ≤ −1 ∥ ∥ λ−1 ∥ ∥ . h 2 D fa BL k k

3Functions like φy that have d(ϕ(x), ϕ(y)) ≤ cd(x, y) for c < 1 are called contraction mappings. The x with x ϕ(x) is called a fixed point of the contraction mapping. The argument in the proof shows that contraction mappings have at most one fixed point. It is not hard to show that contraction mappings always have exactly one fixed point. 19 DIFFERENTIAL CALCULUS

Importantly as k→0, we also have h→0. Now,

− − − − f 1(y + k) − f 1(y) − D f 1k −D f 1( f (x + h) − f (x) − D f h) x x x ∥ ∥ ∥ ∥ k k

− f (x + h) − f (x) − D f h ≤ 1 λ x D fx ∥ ∥ h −1( + ) − −1( ) − −1 ( + ) − ( ) − f y k f y D fx k −1 f x h f x D fx h lim ≤ lim D fx λ 0 k→0 ∥k∥ k→0 BL ∥h∥ ( )−1 −1 □ Finally, since D fx is continuous, so is D f f −1(y) , which is the derivative of f . The proof of the inverse function theorem might be a bit confusing. The important idea is that if the derivative of a function is nonsingular at a point, then you can invert the function around that point because inverting the system of linear equations given by the mean value expansion around that point nearly gives the inverse of the function.

5. Implicit functions The implicit function theorem is a generalization of the inverse function theorem. In economics, we usually have some variables, say x, that we want to solve for in terms of some parameters, say β. For example, x could be a person’s consumption of a bundle of goods, and b could be the prices of each good and the parameters of the utility function. Sometimes, we might be able to separate x and β so that we can write the conditions of our ∂ model as f (x) b(β). Then we can use the inverse function theorem to compute xi and ∂β j other quantities of interest. However, it is not always easy and sometimes not possible to separate x and β onto opposite sides of the equation. In this case our model gives us equations of the form f (x, β) c. The implicit function theorem tells us when we can ∂ solve for x in terms of β and what xi will be. ∂β j The basic idea of the implicit function theorem is the same as that for the inverse function theorem. We will take a first order expansion of f and look at a linear system whose coefficients are the first derivatives of f . Let f : Rn→Rm. Suppose f can be written − as f (x, y) with x ∈ Rk and y ∈ Rn k. x are endogenous variables that we want to solve for, and y are exogenous parameters. We have a model that requires f (x, y) c, and we know that some particular x0 and y0 satisfy f (x0, y0) c. To solve for x in terms of y, we can expand f around x0 and y0. ( , ) ( , ) + ( − ) + ( − ) + ( , ) f x y f x0 y0 Dx f(x0,y0) x x0 Dy f(x0,y0) y y0 r x y c

In this equation, Dx f(x0,y0) is the m by k matrix of ﬁrst partial derivatives of f with respect ( , ) − to x evaluated at x0 y0 . Similary, Dy f(x0,y0) is the m by n k matrix of ﬁrst partial derivatives of f with respect to y evaluated at (x0, y0). Then, if r(x, y) is small enough, we have

f (x , y ) + D f( , )(x − x ) + D f( , )(y − y ) ≈ c 0 0 x x0 y0 0 y x0 y0 0 ( ) ( − ) ≈ − ( , ) − ( − ) Dx f(x0,y0) x x0 c f x0 y0 Dy f(x0,y0) y y0 20 DIFFERENTIAL CALCULUS ( − ) This is just a system of linear equations with unknowns x x0 . If k m and Dx f(x0,y0) is nonsingular, then we have ( ) ( ) −1 ≈ + − ( , ) − ( − ) x x0 Dx f(x0,y0) c f x0 y0 Dy f(x0,y0) y y0 which gives x approximately as function of y. The implicit function says that you can make this approximation exact and get x g(y). The theorem also tells you what the derivative of g(y) is in terms of the derivative of f . + Theorem 5.1 (Implicit function). Let f : Rn m→Rn be continuously differentiable on some n m open set E and suppose f (x0, y0) c for some (x0, y0) ∈ E, where x0 ∈ R and y0 ∈ R . If ⊂ Rn ⊂ Rn−k ∈ Dx f(x0,y0) is invertible, then there exists open sets U and W with x0 U and y0 ∈ W such that (1) For each y ∈ W there is a unique x such that (x, y) ∈ U and f (x, y) c. ( ) ( ) ( ( ), ) (2) Define this x as g y . Then g is continuously( ) differentiable on W, g y0 x0, f g y y −1 ∈ − c for all y W, and Dgy0 Dx f(x0,y0) Dy f(x0,y0)

Proof. We will show the ﬁrst part by applying the inverse function theorem. Deﬁne + + F : Rn m→Rn m by F(x, y) ( f (x, y), y). To apply the inverse function theorem we must

show that F is continuously diﬀerentiable and DF(x0,y0) is invertible. To show that F is continuously diﬀerentiable, note that F(x + h, y + k) − F(x, y) ( f (x + h, y + k) − f (x, y), k) ( ( ), ) D f(x¯,y¯) h k k where the second line follows from the mean value theorem. It is then apparent that ( )( )

Dx f( , ) Dy f( , ) h F(x + h, y + k) − F(x, y) − x y x y 0 Im k lim 0. (h,k)→0 ∥(h, k)∥ ( ) Dx f(x,y) Dy f(x,y) So, DF(x,y) , which is continuous sinve D f(x,y) is continuous. Also, 0 Im

DF(x0,y0) can be shown to be invertible by using the partitioned inverse formula because

Dx f(x0,y0) is invertiable by assumption. Therefore, by the inverse function theorem, there exists open sets U and V such that (x0, y0) ∈ U and (c, y0) ∈ V, and F is one-to-one on U. m Let W be the set of y ∈ R such that (c, y) ∈ V. By deﬁnition, y0 ∈ W. Also, W is open + in Rm because V is open in Rn m. We can now complete the proof of 1. If y ∈ W then (c, y) F(x, y) for some (x, y) ∈ U. ′ ′ ′ If there is another (x , y) such that f (x , y) c, then F(x , y) (c, y) F(x, y). We just ′ showed that F is one-to-one on U, so x x. We now prove 2. Deﬁne g(y) for y ∈ W such that (g(y), y) ∈ U and f (g(y), y) c, and F(g(y), y) (c, y). By the inverse function theorem, F has an inverse on U. Call it G. Then G(c, y) (g(y), y) 21 DIFFERENTIAL CALCULUS

and G is continuously diﬀerentiable, so g must be as well. Diﬀerentiating the above equation with respect to y, we have ( ) Dgy Dy G(c,y) Im

On the other hand, from the inverse function theorem, the derivative of G at (x0, y0) is ( ) −1 DG(x0,y0) DF(x0,y0) ( ) −1 D f( , ) D f( , ) x x0 y0 y x0 y0 0 I ( m ) −1 − −1 Dx f( , ) Dx f( , )Dy f(x ,y ) x0 y0 x0 y0 0 0 0 Im In particular, ( ) ( ) − −1 Dx f( , )Dy f(x ,y ) D g x0 y0 0 0 y0 Dy G(c,y0) Im Im − −1 □ so D gy Dx f Dy f(x ,y ). 0 (x0,y0) 0 0

6. Contraction mappings One step of the proof the of the inverse function theorem was to show that

1 φy(x ) − φy(x ) ≤ ∥x − x ∥ . 1 2 2 1 2 This property ensures that φ(x) x has a unique solution. Functions like φy appear quite often, so they have name.

Deﬁnition 6.1. Let f : Rn→Rn. f is a contraction mapping on U ⊆ Rn if for all x, y ∈ U,

f (x) − f (y) ≤ c x − y for some 0 ≤ c < 1.

If f is a contraction mapping, then an x such that f (x) x is called a ﬁxed point of the contraction mapping. Any contraction mapping has at most one ﬁxed point.

n n n Lemma 6.1. Let f : R →R be a contraction mapping on U ⊆ R . If x1 f (x1) and x2 f (x2) for some x1, x2 ∈ U, then x1 x2. Proof. Since f is a contraction mapping,

f (x1) − f (x2) ≤ c ∥x1 − x2∥ .

f (xi) xi, so

∥x1 − x2∥ ≤ c ∥x1 − x2∥ .

Since 0 ≥ c < 1, the previous inequality can only be true if ∥x1 − x2∥ 0. Thus, x1 x2. □ 22 DIFFERENTIAL CALCULUS

Starting from any x0, we can construct a sequence, x1 f (x0), x2 f (x1), etc. When f n is a contraction, ∥xn − xn+1∥ ≤ c ∥x1 − x0∥, which approaches 0 as n→∞. Thus, {xn } is a Cauchy sequence and converges to a limit. Moreover, this limit will be such that x f (x), i.e. it will be a ﬁxed point. Lemma 6.2. Let f : Rn→Rn be a contraction mapping on U ⊆ Rn, and suppose that f (U) ⊆ U. Then f has a unique ﬁxed point.

Proof. Pick x0 ∈ U. As in the discussion before the lemma, construct the sequence deﬁned by xn f (xn−1). Each xn ∈ U because xn f (xn−1) ∈ f (U) and f (U) ⊆ U by assumption. n Since f is a contraction on U, ∥xn+1 − xn ∥ ≤ c ∥x1 − x0∥, so limn→∞ ∥xn+1 − xn ∥ 0, and {x } x →∞ x n is a Cauchy sequence. Let limn n. Then

x − f (x) ≤ ∥x − xn ∥ + f (x) − f (xn−1)

≤ ∥x − xn ∥ + c ∥x − xn−1∥ ϵ x →x, so for any ϵ > 0 ∃N, such that if n ≥ N, then ∥x − x ∥ < . Then, n n 1+c

x − f (x) < ϵ for any ϵ > 0. Therefore, x f (x). □