<<

Chapter 3 - Multivariate

Justin Leduc†

These lecture notes are meant to be used by students entering the University of Mannheim Master program in Economics. They constitute the base for a pre-course in ; that is, they summarize elementary concepts with which all of our econ grad students must be familiar. More advanced concepts will be introduced later on in the regular coursework. A thorough knowledge of these basic notions will be assumed in later coursework.

Although the wording is my own, the definitions of concepts and the ways to approach them is strongly inspired by various sources, which are mentioned explicitly in the text or at the end of the chapter.

Justin Leduc

I have slightly restructured and amended these lecture notes provided by my predecessor Justin Leduc in order for them to suit the 2016 course schedule. Any mistakes you might find are most likely my own.

Simona Helmsmueller

∗This version: 2018 †Center for Doctoral Studies in Economic and Social Sciences.

1 Contents

1 Multivariate functions 4 1.1 Invertibility ...... 4 1.2 Convexity, Concavity, and Multivariate Real-Valued Functions ...... 6

2 Multivariate differential calculus 10 2.1 Introduction ...... 10 2.2 Revision: single-variable differential calculus ...... 12 2.3 Partial and the ...... 15 2.4 Differentiability of real-valued functions ...... 18 2.5 Differentiability of vector-valued functions ...... 22 2.6 Higher Order Partial Derivatives and the Taylor Approximation Theorems . 23 2.7 Characterization of convex functions ...... 26

3 theory 28 3.1 Univariate integral calculus ...... 28 3.2 The definite integral and the fundamental theorem of calculus ...... 29 3.3 Multivariate extension ...... 29

A Appendix - Directional Derivatives 31

2 Preview

What you should take away from this chapter:

1. Multivariate functions

• Know when an inverse to a exists and when it does not • Be able to graphically and analytically determine whether a function is convex, concave or quasiconcave.

2. Multivariate differential calculus

• Have an intuition about the meaning of derivatives as measures of small change • Know all the calculation rules for derivatives (sum, quotient, product, ) • Be able to calculate the Gradient, Jacobian and Hessian of a function (if defined) • Know the difference between partial and total derivatives • Be able to write down the Taylor approximation of a function • Know what a function’s differentials tell us about its continuity and convexity

3. Integral theory

• Know and be able to apply the calculation rules, including • Know and be able to apply Fubini’s theorem

3 1 Multivariate functions

Before digging into the details of calculus, let us start with a chapter revising or introducing some concepts and key properties of functions in general. While not directly relevant for the following sections on differential and integral calculus, the definitions and theorems will be important for the chapter on optimization and for your following master classes.

1.1 Invertibility Before we start the process, let us make sure that our vocables are in line. In this chapter, the term vector is to be understood in its classical sense, i.e., as a column vector composed of real numbers, unless otherwise specified. The use of a row vector will thus be explicitly mentioned via a transposition symbol such as a 0. I required as a prerequisite that you be clear with the concept of a function, f, its domain, X, its codomain, Y , and its image, f(X). If Y ⊆ R, i.e., if f maps into the real line, we say that f is a real valued function. If X ⊆ R, i.e., if f takes real numbers as inputs, we call f n a function of a single variable, or univariate function. Finally, if X ⊆ R , n > 1, i.e., if f takes as input vectors of length n, n > 1, we call f a multivariate function. In this chapter, we will look at the insights from calculus and derive generalizations thereof to real-valued n multivariate functions, i.e., functions with domain R , n > 1 and codomain in R. Then, n m we will discuss functions going from R to R , with n > 1 and m > 1. Although we could further generalize the concepts for infinite-dimensional spaces (e.g. function spaces), and n this happens to be very close to the generalization for finite-dimensional spaces (e.g. R , n ∈ N), we restrict ourselves to finite-dimensional vector spaces. The reason is twofold: (i) you will not be expected to be able to work with infinite dimensional spaces in the master’s curriculum, (ii) if you grasp the generalization for finite dimensional spaces, then you will easily grasp that for infinite dimensional spaces as it is presented in books.

Throughout the discussion, we will try to exploit geometric intuition as much as we can. The most common way to geometrically represent a function is to plot its graph. Let f be a function with domain X, codomain Y , and image f(X). Analytically, the is a collection of ordered pairs (x, f(x)), with x ∈ X and f(x) ∈ Y . Hence, the graph is an element of the cartesian product X × Y , a space with dimension dim(X × Y ) = dim(X)+dim(Y ). In the next chapter, we will discuss a different geometrical representation of functions which is slightly less demanding in terms of dimension.

Finally, let us come back to the promised discussion on inverse functions. Let f be a function with domain X, codomain Y , and image f(X). In the preliminary chapter, I said, that, under some conditions, one can define an inverse function f −1 mapping from Y to X, and such that f −1(f(x)) = x. What precisely are these conditions? There are two of them, both quite intuitive. The first one is that we want our function to be well defined over the

4 codomain Y , i.e., we want that it be defined for any element y in Y . This will not be the case, if the image of f, f(X) is a proper subset of Y , for then we would not know what value to associate to f −1(y), y ∈ Y \ f(X). A first natural condition, then, is to require that the image and the codomain of f coincide, i.e., f(X) = Y . If a function satisfies this condition, it is said to be onto1. In other words, if f is onto, then, for any element, y, that we pick in the codomain, Y , of f, there exists an x, in the domain, X, of f, such that f(x) = y. It is a condition on the codomain of f 2.

Definition 1.1. (Surjective functions) A function f : X → Y is said to be surjective or onto, if for every y ∈ Y there exists at least one x ∈ X with f(x) = y.

The second requirement is intuitive too. Namely, we know that for any function f, if we take an element x in its domain X, then, by definition, there exists a unique y in the image, f(X), such that y = f(x). The converse, however, need not be true, i.e., take an element y in f(X) and consider its set of antecedents, i.e., the set of x such that f(x) = y, there is no a priori reason that this set be a singleton3. But if the converse is not true, then, once we consider the inverse mapping f −1 we will end up having several images associated to a single y in f(X). That, in turn, would disqualify our inverse map for the label of “function”. The second condition, then is that the converse holds, i.e., that every distinct element x of the domain X maps into a distinct element y of the image f(X). If it is the case, we say that a function f is one-to-one4.

Definition 1.2. (Injective functions) A function f : X → Y is said to be injective or one-to-one, if the following property holds for all x1, x2 ∈ X: f(x1) = f(x2) ⇒ x1 = x2.

These definitions allow us to derive the following important theorem about the inverse function:

Theorem 1.1. (Existence and definition of the inverse function) Let f : X → Y be onto and one-to-one (i.e. bijective). Then there exists a unique bijective function f −1 : Y → X with f −1(f(x)) = x. Conversely, if such a function exists, then f is bijective.

1a.k.a. surjective. 2Note that this is not a very demanding condition, as it suffices to properly choose the codomain of our function. 3A singleton is a set that contains a single element. 4a.k.a. injective

5 1.2 Convexity, Concavity, and Multivariate Real-Valued Func- tions Our second effort in this section is devoted to the notions of convexity and concavity of functions. Their importance stems from optimization and will thus be emphasized in the next chapter. For now, allow me to simply proceed with the formal discussion.

Definition 1.3. (Convex Real Valued Function) n Let X ⊆ R . A function f : X → R is convex if and only if X is a and for any two x, y ∈ X and λ ∈ [0, 1] we have

f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)

Moreover, if this statement holds strictly whenever y 6= x and λ ∈ (0, 1), we say that f is strictly convex.

Definition 1.4. (Concave Real-Valued Function) n Let X ⊆ R . A function f : X → R is concave if and only if −f is convex. Similarly f is strictly concave if and only if −f is strictly convex.

Remark 1. Note that the definition of a concave real-valued function also requires that the function be defined on a convex domain! You already know the geometrical signification of convexity and concavity for univariate real-valued functions: if one picks at random an x and a y in the domain of f and draw the line segment between (x, f(x)) and (y, f(y)), it will lie weakly below or weakly above the graph of f between f(x) and f(y). This segment will lie weakly above f if and only if f is convex; it will lie weakly below if and only if f is concave. This is illustrated in Figure 5:

Figure 1: Concavity and convexity of univariate real-valued functions

6 Ideally, our generalized definition should preserve this geometrical insight. Does it? Let 2 2 2 us have a look at a simple defined in X ⊂ R , say, f(x1, x2) = x1 + x2. The 3 graph of f, i.e., the collection of ordered pairs (x, f(x)) for all x in the domain of f, lies in R (see Figure 6). If one draws the segment line that connect two points chosen at random on that curve, somehow, it does appear to lie “above” the graph of the function. Yet, as a line exhaust only one out of three dimensions, there are, in fact, two dimensions left free, and the notion of “above” I just used isn’t as clear as that which I used when we were considering a graph on the real plane. Therefore, if we wish to formalize the multidimensional geometric intuition, we must first specify a plane, i.e., a subspace of dimension 2, and make sure that, on that plane, the line lie above the graph, in the classical sense of the term “above”.

Figure 2: Lying “above” in 3 or more dimensions

n Remember that, independently of the dimension of the input vectors x, y ∈ R , f is assumed to be real-valued and, therefore, f(λx + (1 − λ)y) as well as λf(x) + (1 − λ)f(y) belong to R. Therefore, the definition only makes a statement about some specific planes n n of R . Let e denote the vector in R which spans the output dimension. The definition tells us the following: consider any plane with a basis containing e; then, the restriction of f to that plane is a real-valued univariate convex function. More practically, we are asked to consider any two points x and y in the domain of f. Also, the domain is required to be convex. Therefore, the vector z = y − x lies in the domain, and, further, it spans a line, {t ∈ R | x + tz ∈ X}, that passes through both x and y. Combined with e (the vector spanning the vertical axis), z generates a plane in the (n+1)-dimensional space in which the the restriction of f to {t ∈ R | x+tz ∈ X} lies. If one focuses on this plane, one finds himself back in the simple situation of considering the graph of a univariate real-valued function in the real plane, and the definition of convexity requires that this function be convex. In fact, the converse can be shown to hold. That is, given a multivariate real-valued function f with convex domain X. If, for any line lying in X, the restriction of f to this line is convex, then f is said to be convex. This is stated formally in the following theorem.

7 Figure 3: Multivariate Concavity

Definition 1.5. (Multivariate Convexity) n n Let X be a convex subset of R . A real-valued function f : X → R is (strictly) convex n if and only if, for every x ∈ X and every z ∈ {z ∈ R − {0} | kzk= 1}, the function g(t) = f(x + tz) is (strictly) convex on {t ∈ R | x + tz ∈ X}. In some settings, concavity or convexity are just too demanding. For instance, in some cases, the only reasonable assumption one can make is that the function of interest is in- creasing or decreasing. Yet, being increasing or decreasing does not guarantee convexity nor concavity. Further, concavity and convexity need not be preserved by monotonic transfor- mations, which constitutes an undesirable feature when we recall that most of economists’ work is based on ordinal preferences. Therefore, it is interesting to define classes of functions which (i) preserve the most important properties of concave and convex functions and (ii) entail all increasing and decreasing functions and have properties which are preserved under monotonic transformations. Such functions are known respectively as quasi-concave and quasi-convex functions.

As you will see in the next chapter, the convexity of the upper-level set (for concave functions) and convexity of the lower-level set (for convex functions) are the specific characteristics of concave and convex one would wish to preserve. As multivariate convexity and concavity can be reduced to univariate ones, let me illustrate these concepts for the univariate case. If one considers a convex function, and draw an horizontal line (a “level” line) through it, the set of elements x in the domain with an image below this line is called a lower-level set of the function and is convex. Similarly, if one considers a concave function, and draw an horizontal line (a “level” line) through it, the set of elements x in the domain with an image above this line is called an upper-level set of the function and is convex.

8 Figure 4: Convexity and Concavity via lower- and upper-level sets

Quasiconvexity and quasiconcavity are defined so as to preserve precisely these two qualities:

Definition 1.6. (Quasiconvexity, Quasiconcavity) n Let X be a convex subset of R . A real-valued function f : X → R is quasiconvex if and only if for any c ∈ R the set

− Lc := {x|x ∈ X, f(x) ≤ c}, also referred to as f lower-level set, is convex. Iff f is such that

+ Lc := {x|x ∈ X, f(x) ≥ c}, also referred to as f upper-level set, is convex for any c ∈ R, then f is said to be quasiconcave. The following is an often more workable characterization:

Theorem 1.2. (Quasiconvexity, Quasiconcavity) n Let X be a convex subset of R . A real-valued function f : X → R is quasiconvex if and only if

∀x, y ∈ X ∀λ ∈ [0, 1] f(λx + (1 − λ)y) ≤ max{f(x), f(y)}

Iff f is such that

∀x, y ∈ X ∀λ ∈ [0, 1] f(λx + (1 − λ)y) ≥ min{f(x), f(y)} then f is said to be quasiconcave.

This generalization allows us to include some non-convex and non-concave functions while ruling out too messy functions, such as “camel backs” (see figure 9). All convex functions are quasi-convex. All concave functions are quasi-concave. Linear functions are both quasi- concave and quasi-convex, but they are not the only functions with this property: functions

9 that are both quasi-convex and quasi-concave are called quasi-linear functions5. Monotonic functions are another instance of quasi-linear functions.

Figure 5: Quasiconvexity and Quasiconcavity

Remark 2. Beware! Contrary to convexity and concavity, which imply continuity on the interior of the domain, quasi-concave and quasi-convex functions need not be continuous! The floor function is an example of a discontinuous quasi-convex function! However, if a quasi-convex (quasi-concave) function is continuous, then the lower- (respectively upper-) level set is not only convex but also closed!

2 Multivariate differential calculus

2.1 Introduction Wikipedia provides a good explanation of what calculus actually is about:

“Calculus is the mathematical study of change, in the same way that geometry is the study of shape and algebra is the study of operations and their application to solving equations. It has two major branches, differential calculus (concerning rates of change and of curves), and integral calculus (concerning accumu- lation of quantities and the areas under and between curves); these two branches are related to each other by the fundamental theorem of calculus, [...] [which] states that differentiation and integration are inverse operations.”

5An unfortunate term for economists, because it should absolutely not be confused with quasi-linear preferences/ quasi-linear functions!!

10 In this chapter, we will focus on the first branch of calculus, that is, differential calculus, while only touching upon the very basics of integration theory at the end. Let us now try to understand what is precisely meant by ”the mathematical study of change”. We consider the very simple case of real valued functions of a single variable. Namely, let f be a function with domain X in R and codomain Y in R, letx ¯ ∈ R, and assume we are asked what the instantaneous rate of change, or , of f at x =x ¯ is. The question isn’t particularly easy to answer. Well, a general piece of advice when one doesn’t know how to solve a mathematical problem is the following: first try to look for a slightly different problem which you know how to solve. With a bit of luck, answering this second problem will help you find a solution to the first one. In the present case, for instance, one could try to answer first the following question: what is the average rate of change of f from x¯ tox ¯ + h, h ∈ R? This second question is more familiar to us: we just compute the ratio of the rate of change in f to that of the rate of change in x: ∆f f(¯x + h) − f(¯x) = ∆x h Could one use the answer to the second question as an answer for the first question? Well, in many cases, if h is large, it would be hard to claim that the average rate of change is a satisfactory answer to the first question. If h is set equal to zero, the situation is even worse: we end up with an undetermined coefficient because we cannot divide by zero. Second piece of advice when facing a tough mathematical problem: in low-dimensional spaces, geometry can help us! Let us try to picture graphically what happens as we shrink h. For instance,

in the following figure, we started with a large h1 and then considered a smaller h2:

Intuitively, the smaller h becomes, the more one can feel satisfied with using the average rate of change as an approximate answer for the question of the instantaneous rate of change. And yet, h should never reach zero. This is reminiscent of our concept! When asked about the slope of a function f at a given pointx ¯, one should try to see whether the following limit is well defined: f(¯x + h) − f(¯x) lim h→0 h If the limit exists, we denote it f 0(¯x) and call it the of f at x¯. f is then said to be differentiable at x¯. The derivative of f atx ¯, provided it exists, is the best answer one has to the question about the instantaneous rate of change of f atx ¯. A function differentiable at every of its domain is simply called differentiable.

Why would we care about the possibility to formally define the instantaneous rate of change of a function f? The answer to this question is linked to the idea that motivated the introduction of vector spaces and is twofold: (i) most functions we investigate have a domain (and also sometimes a codomain) that lies in a high-dimensional , i.e., a vector space that cannot be pictured geometrically (ii) the instantaneous rate of change, if

11 Figure 6: Average Rate of Change – an Answer to Instantaneous Rate of Change?

defined, gives us detailed information about the local behavior of the functions, and thereby allows us, in such spaces, to analytically see what we cannot geometrically see6. As a result, while a rigorous, analytical definition of the instantaneous rate of change appears redundant when working with low-dimensional functions – i.e., functions with domain and n image within R , n ≤ 3 – it becomes our only tool in high dimensional spaces.

2.2 Revision: single-variable differential calculus Let us start by revising what you have no doubt all learned in high school. From the very practical side, there are a number of calculation rules which make life so much easier.

0 0 Theorem 2.1. Let f, g : R → R differentiable at x0 with derivatives f (x0), g (x0), and λ, µ ∈ R. Then

0 0 0 1. λf + µg is differentiable in x0 with (λf + µg) (x0) = λf (x0) + µg (x0),

0 0 0 2. fg is differentiable in x0 with (fg) (x0) = (f g + fg )(x0)

3. if g(x) 6= 0 in a neighborhood of x0, then f/g is differentiable with

f 0 f 0g − fg0 (x ) = (x ), g 0 g2 0 6Providing a definition of “analytical” isn’t straightforward. The message I wish to convey here is the following: symbols and equations are our only eyes in high-dimensional spaces!

12 4. if all the following expressions are well-defined, then g ◦ f is differentiable in x0 with

0 0 0 (g ◦ f) (x0) = g (f(x0))f (x0).

From the analytical side, differentiable functions have a set of very useful properties. These are exactly what I meant when writing “the derivative gives us detailed information about the local behavior of a function”. Let f be a function with domain and codomain in R. The existence and value of the derivative of a function f gives us three important pieces of information about f: • Let f be differentiable at ¯x,then f is continuous at ¯x.

Proof: f(x) − f(¯x) f 0(¯x) := lim x→x¯ x − x¯ f(x) − f(¯x) Thus, f 0(¯x) well-defined ⇒ ∃ε > 0 such that, ∀x 6=x ¯ ∈ B (¯x), is well ε x − x¯ defined. f(x) − f(¯x) Let x ∈ B (¯x), then f(x) − f(¯x) = (x − x¯). ε x − x¯ Let us take the limit of this expression: f(x) − f(¯x) lim [f(x) − f(¯x)] = lim lim [x − x¯] = f 0(¯x) × 0 = 0. x→x¯ x→x¯ x − x¯ x→x¯ Which is just what is required for continuity, as defined in last chapter.

• Let f be differentiable at ¯x,then there exists a good “” of f in a neighborhood7 of ¯x.

We like linear functions because they are simple and we know how they work. Unfor- tunately, it is not likely that the functions involved in our applications be linear. A good solution, then, is often to choose a function that has the properties required for the application and that is differentiable on the of interest. Such functions are “locally linear”, in the sense that we can locally assimilate their behavior to that of linear functions. More formally, the of a function f : X ⊆ R → R atx ¯ is the following : x 7→ f(¯x) + f 0(¯x)(x − x¯)

0 Note that it is defined everywhere on R, asx ¯, f(¯x) and f (¯x) are real numbers. To see why the tangent is indeed a good approximation of f in a neighborhood ofx ¯, let

x 6=x ¯ ∈ Bε(¯x) and let (x) be our approximation error when using the tangent to approximate f at x. Then

7A “neighborhood” ofx ¯ is simply defined as subset that contains a ε-open ball centered atx ¯ for some ε > 0

13 (x) f(x) − f(¯x) − f 0(¯x)(x − x¯) f(x) − f(¯x) := = − f 0(¯x) (x − x¯) (x − x¯) (x − x¯) Therefore, by definition of the derivative,  (x)  lim = 0 x→x¯ (x − x¯) In words, when x gets close tox ¯ the approximation error becomes insignificant, in the sense that it is much smaller even than the discrepancy between x andx ¯. Geometrically, the tangent may be seen as the limit of secants8 of f:

Figure 7: The Tangent as the Limit of Secants

• Let a and b be real numbers such that a < b. If f is continuous and differ- entiable on (a, b), then:

(i)f 0(x) = 0 for all x ∈ (a, b) iff f is constant on (a, b). (ii)f 0(x) < 0 for all x ∈ (a, b) iff f is decreasing on (a, b). (iii)f 0(x) > 0 for all x ∈ (a, b) iff f is increasing on (a, b).

Think again of our tangent equation. Replacing f 0(x) by any null, negative, or pos- itive number at all points of (a, b) should convince you of the above fact. Picture it geometrically!

Exercise 1. Convince yourself of the above geometric properties by looking at the following four functions:

2 1. f1(x) = x

2. f2(x) = |x|

8A secant is simply a line going through two specified points of a curve.

14 3. f3(x) = max{k ∈ Z : k ≤ x}

4. f4(x) = 1 if x ∈ Q and f4(x) = 0 if x ∈ R − Q Each of these geometrical insight is of great importance when optimizing functions with domain or codomain in a high-dimensional space. Only because of them we can claim that analytic formulas serve as substitution eyes in such spaces. But before they do, we must make sure we generalize them in a proper way. Therefore, the purpose of this chapter is to generalize the concept of a derivative in a manner that preserves precisely these pieces of information that the derivative of f at a point delivers to us.

2.3 Partial Derivatives and the Gradient

n Let f : X ⊆ R → R be a function of n independent variables. The simplest way to proceed with our generalization is to simply transform that function into functions of one variable and to use our usual single-variable notion of derivation. We do this by “consciously forget- ting” that actually n variables could be changing at the same time and instead focus on what happens when only one of the n variables is changing at a time. This process is formally stated in the following definition.

Definition 2.1. () n Let X ⊆ R and suppose f : X → R. Ifx ¯ is an interior point of X, then, the partial derivative of f with respect to xi atx ¯ is defined as

∂f(¯x) f(¯x , ..., x¯ , x¯ + h, x¯ , ..., x¯ ) − f(¯x , ..., x¯ , ..., x¯ ) := lim 1 i−1 i i+1 n 1 i n ∂xi h→0 h with h ∈ R, whenever the limit exists. Another common notation for the partial derivative of f with respect to xi atx ¯ is fi(¯x).

Hence, the idea behind the partial derivative with respect to xi atx ¯ is to consider all (n − 1) xj’s, j 6= i, as fixed and do as if f(.) was only a function of xi. There are thus – at most! – n partial derivatives, one for each of the variables. Note that, a partial derivative, if defined, is defined at a given point:x ¯. Moreover, not only does the value ofx ¯i matter, but also those at which one fixes the xj, j 6= i. This is illustrated by the following example: let f(x1, x2) = x1x2, then ∂f(¯x) =x ¯2 ∂x1 which obviously depends on where we fix x2.

15 Figure 8: Partial derivative of f(x1, x2) with respect to x1 atx ¯

Remark 1. In the above figure, the partial derivative depends on the second variable and not

on the first one (that is why I could draw the tangent line without mentioning whichx ¯1 I was picking). It may as well depend on all variables, or on none of them, as in the following

example: compute the partial derivative of f(x1, x2) = x1 + x2 with respect to x2 atx ¯: ∂f(¯x) = 1 ∂x2

If the partial derivative of f with respect to xi is defined atx ¯, we say that f is partially differentiable at x¯ with respect to xi. If the partial derivative of f with respect to xi is defined at every point of f’s domain, we say that f is partially differentiable with respect to xi. In such a case, as is conventional with univariate derivatives, the partial derivative can be seen as a function from X to R.

The following is an important definition of a row vector whose entries are the partial derivatives of f atx ¯: Definition 2.2. (Gradient) n Let f : R → R a function which is partially differentiable with respect ot all xi, i = 1, ..., n. Then the row vector  ∇f (¯x) := f1(¯x) f2(¯x) ··· fn(¯x) is called the gradient of f atx ¯. Under the following links you can find a video which provide a graphic illustration of partial derivatives. You might find this helpful for developing an intuition for the concepts,

16 but note that a graphic illustration is again only possible for low-dimensional function, whereas it was our aim to generalize the concept of differentials to higher dimensional spaces. https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial- derivatives/v/partial-derivatives-and-graphs n Consider now the more general case of a functions f with domain X in R , and codomain m Y in R , n and m greater than or equal to 2. The easiest way to proceed is to realize that this high-dimensional function can be seen as a vector of real-valued multivariate functions. Namely,

 f 1   f 2  f =    .   .  f m

i n where each f is a real-valued multivariate function mapping from R to R and hence we can define partial derivatives for each of these component functions. We can then extend our previous definition of the gradient in a very natural way to define the Jacobian matrix:

Definition 2.3. (Jacobian matrix) n m n Let f : R → R with partially differentiable component functions f1, ..., fm : R → R, and n letx ¯ ∈ R . Then the Jacobian matrix of f atx ¯ is defined as  ∂f ∂f ∂f  1 (¯x) 1 (¯x) ··· 1 (¯x)  ∂x1 ∂x2 ∂xn     ∂f2 ∂f2 ∂f2   (¯x) (¯x) ··· (¯x)  Jf (¯x) =  ∂x1 ∂x2 ∂xn  .  . . . .   . . .. .   ∂f ∂f ∂f   m (¯x) m (¯x) ··· m (¯x)  ∂x1 ∂x2 ∂xn

m The next two videos introduce you to functions mapping into R . Again, note that the visualization is not possible for higher dimensional spaces, but the above definition will still hold. https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial- derivatives-of-vector-valued-functions/v/computing-the-partial-derivative-of-a-vector-valued- function https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial- derivatives-of-vector-valued-functions/v/partial-derivative-of-a-parametric-surface-part-1 At this stage, the next subsection would usually cover directional derivatives, which form the next type of generalization. I skip these here in the interest of time and because I believe you will be able to grasp the concept in self-study if you properly understood the concept of the partial derivatives. The interested reader is referred to the annex for a short introduction.

17 2.4 Differentiability of real-valued functions

n For simplicity, we again return to the case of f mapping from R to R, and for the moment let n = 2. We have seen above that if we hold y fixed aty ¯ and vary x by a small amount ∆x, then the change in outcome is approximated by the partial derivative:

∂f f(¯x + ∆x, y¯) − f(¯x, y¯) ≈ (¯x, y¯)∆x. ∂x Similarly, if we fix x and vary y by a small amount, we get

∂f f(¯x, y¯ + ∆y) − f(¯x, y¯) ≈ (¯x, y¯)∆y. ∂y

This is because we approximated the function by the linear tangent to the curve at (¯x, y¯) in the direction of x and y respectively. If we are now to vary both variables at the same time and seek for a linear approximation of the resulting change in functional value, it suggests itself to use the sum of the above changes. That is,

∂f ∂f f(¯x + ∆x, y¯ + ∆y) − f(¯x, y¯) ≈ (¯x, y¯)∆x + (¯x, y¯)∆y. ∂x ∂y

We can re-write this as ∂f ∂f f(¯x + ∆x, y¯ + ∆y) ≈ f(¯x, y¯) + (¯x, y¯)∆x + (¯x, y¯)∆y. ∂x ∂y

The right hand side is the parameterization of the so-called tangent plane. Remark 1. If f(¯x, y¯) = 0 and the partial derivatives are independent of each other, then this constitutes a two-dimensional vector space. If f(¯x, y¯) 6= 0, this vector space is shifted to go through this point, and we call it an affine space. This characterization of the linear approx- imation to a function also holds true if the function depends on more than two variables, in which case we can no longer visualize it as a plane. Remark 2. By replacing the ∆ with d in the above equation, we obtain the following expres- sion: ∂f ∂f df = (¯x, y¯)dx + (¯x, y¯)dy ∂x ∂y for the variation on the tangent plane df. This is called the total differential of f at (¯x, y¯) and it gives an approximation for the change ∆f on the graph of f. Put differently, the total differential is a function which depends on the point on the graph from which we start, and on the variation in all directions which we consider. Remark 3. Again looking at the above equation from a slightly different angle, we note that

18 the change in outcome ∆f can be approximated by the linear mapping

∂f ∂f (dx, dy) 7→ (¯x, y¯)dx + (¯x, y¯)dy. ∂x ∂y

Linear functions can be identified with matrices, and in this case, we can write:

∂f ∂f  dx  (¯x, y¯) (¯x, y¯) ∂x ∂y dy

Note that this linear function and representing matrix is actually the gradient of f at (¯x, y¯)!. Let me again refer to the geometric insights that I want to guide our generalization of univariate derivatives to the multivariate case. These were a) continuity, b) linear approxi- mation, and c) the of the slope (i.e., whether or not a function increases, decreases or is constant around one point). While we have made sure that we keep the linear approximation property, the generaliza- tion obtained so far is insufficient. Indeed, given a pointx ¯, then the existence of all partial derivatives is NOT sufficient to guarantee continuity of the function at x¯. To show you this, 2 look at the following function: f : R → R with f(x, y) = 0 if xy = 0 and f(x, y) = 1 else. It ∂f ∂f is partially differentiable with respect to x and y with partial derivative (0, 0) = (0, 0), ∂x ∂y but it is not continuous in (0, 0).

We now try if we can define a different expression which preserves continuity and is intuitively similar to the univariate case. For this, let Df(¯x) denote the generalized version of the derivative atx ¯, i.e., f’s instantaneous rate of change atx ¯. Continuity, as defined in the previous chapter, is about preservation of neighborhoods. As we are looking for continuity, we shall not specify any direction along which x moves. Rather, we specify the distance that separates the new x from our starting pointx ¯ and allow for movement in any direction within that distance. In the univariate case, f is differentiable atx ¯ if and only if there exists a real number f 0(¯x) such that: f(¯x + h) − f(¯x) lim − f 0(¯x) = 0 h→0 h And because the norm is a , this implies: kf(¯x + h) − f(¯x) − f 0(¯x)hk lim = 0 khk→0 khk Using this expression to generalize f 0(¯x) ensures that continuity will be preserved.

19 Definition 2.4. (Multivariate Derivative) n Let X ⊆ R and suppose f : X → R. Ifx ¯ is an interior point of X, then f is differentiable atx ¯ if and only if there exists a row vector Df(¯x) such that kf(¯x + h) − f(¯x) − Df(¯x) · hk lim = 0 khk→0 khk

n where h is a vector in R . If such a vector Df(¯x) exists, we interpret it as the derivative of f atx ¯.

Df(¯x) is to be interpreted as the generalization of our univariate derivative f 0(¯x) at ¯x. Yet, defined as it is, it is a bit difficult to exactly see what it is! The following result, by relating it to the geometrically intuitive partial derivatives, gives useful insights.

Theorem 2.2. (Total Differential, Partial Derivative, and Gradient) n Let X ⊆ R and suppose f : X → R. If x¯ is an interior point of X, and if f is differentiable at x¯, then:

(i) all partial derivatives9 of f exist at x¯ and,

(ii) ∀ z in X, kzk= 1: df(¯x, z) := Df(¯x) · z = ∇f (¯x) · z

In words, Df(¯x), the true generalization of the univariate derivative, is nothing else than

the vector of partial derivatives of f atx ¯, i.e., the gradient of f atx ¯, ∇f (¯x)! That is, the instantaneous rate of change of f in all directions coincides with the gradient of f, which is what we were aiming at when linearly approximating the function by a plane.

IN SHORT, IF YOU UNDERSTOOD WHAT THE GRADIENT IS, THEN YOU almost RULE THE WORLD!

n Proof. As f is differentiable atx ¯, we know that there exists Df(¯x) in R such that: kf(¯x + h) − f(¯x) − Df(¯x) · hk lim = 0 khk→0 khk n Where h is a vector in R . Alternatively, let h = tz, where z is any unit vector and t a scalar going to zero, i.e., kf(¯x + h) − f(¯x) − Df(¯x) · tzk lim = 0 t→0 ktzk Using that kzk= 1 and arranging a bit, one has: kf(¯x + tz) − f(¯x)k lim = kDf(¯x) · zk t→0 |t|

9Reminder: in fact, all directional derivatives, see annex

20 On each side, the sign of terms inside the norm always coincide10, and one has: f(¯x + tz) − f(¯x) lim = Df(¯x) · z t→0 t that is,

Dzf(¯x) = Df(¯x) · z Hence, all directional derivatives are well defined and so must be the partial derivatives in particular. To prove the second part, observe that: kf(¯x + h) − f(¯x) − Df(¯x) · hk lim = 0 khk→0 khk

Df(¯x) is a vector here, and you may denote its entries by (ai)i=1,...,n. Let h = (h1, ..., hn), the limit then writes: kf(¯x + h) − f(¯x) − Pn a h k lim i=1 i i = 0 khk→0 khk Now, let h approach zero for all but one coordinate, for instance, coordinate j. We get: kf(¯x + h ) − f(¯x) − a h k lim j j j = 0 khj k→0 khjk

and an argument similar to that above allows us to take the aj out:

f(¯x + hj) − f(¯x) lim = aj hj →0 hj or, otherwise stated:

fj(¯x) = aj

As the example of a noncontinuous yet partially differentiable function shows, the con- verse of this theorem need not hold. However, one can show the following:

Theorem 2.3. (Partial Differentiablility and Differentiability) n Let X ⊆ R , suppose f : X → R, and let x¯ be an interior point of X. If all the partial derivatives of f at x¯ exist and are continuous, then f is differentiable.

Proof. See e.g. De la Fuente (2000), Chapter 4, Theorem 3.4.

10Realize that if f increases in the direction of z, then both the left hand term and the right hand term inside the norms are positive. If f decreases in the direction of z, then both the left hand term and the right hand term inside the norms are negative.

21 2.5 Differentiability of vector-valued functions We have already extended the notion of partial derivatives to vector-valued functions and defined the Jacobian matrix as equivalent of the gradient. We can also extend our previous definition of the multivariate derivative in a very natural way:

Definition 2.5. (Multivariate Derivative of Vector-Valued Functions) n m Let X ⊆ R and suppose f : X → R . Ifx ¯ is an interior point of X, then f is differentiable atx ¯ if and only if there exists a matrix Df(¯x) such that kf(¯x + h) − f(¯x) − Df(¯x) · hk lim = 0 khk→0 khk

n where h is a vector in R . If such a matrix Df(¯x) exists, we interpret it as the derivative of f atx ¯.

Further, the following result holds:

Theorem 2.4. (Multivariate Derivative and Jacobian Matrix) n m Let X ⊆ R , suppose f : X → R , and let x¯ be an interior point of X. Then, f is differen- tiable at x¯ if and only if each of its component functions is differentiable at x¯. Moreover, if f is differentiable at x¯, then:

(i) all partial derivatives of the component functions exist at x¯, and

(ii) the derivative of f at x¯ is the matrix of partial derivatives of the component functions at x¯:

 ∂f 1 ∂f 1    (¯x) ··· (¯x) ∇ 1 (¯x) f  ∂x1 ∂xn  :  .   . .  m×n Jf (¯x) = Df (¯x) =  .  =  . ··· .  ∈ R  m m  m ∂f ∂f  ∇f (¯x) (¯x) ··· (¯x) ∂x1 ∂xn

Proof. See e.g. De la Fuente (2000), Chapter 4, Theorem 3.3.

The partial converse to this theorem as well as the result on continuity also extend to the high-dimensional case. The proofs indicated for them takes multidimensionality of the codomain into account. Let us conclude with the following remark: a generalized chain rule in high-dimensional space. Once one has introduced the concept of a multivariate derivative, the multivariate chain rule is a straightforward extension of the univariate one.

22 Theorem 2.5. (Multivariate Chain Rule) n m Let X ⊆ R suppose g : X → Y , where Y ⊆ R . Further, suppose f : Y → Z, where p Z ⊆ R . If x¯ is an interior point of X, g(¯x) an interior point of Y , and g and f are differentiable at x¯ and g(¯x), respectively, then f ◦ g is differentiable at x¯ and:

D[f ◦ g](x) = Df(g(¯x))Dg(¯x)

Proof. See e.g. De la Fuente (2000), Chapter 4, Theorem 3.5.

2.6 Higher Order Partial Derivatives and the Taylor Approxima- tion Theorems

n Let f : X ⊆ R → R be a function of n variables. Also, assume X is open. Earlier on, we suggested that, if the n partial derivatives of f are defined at each point of the open domain, then the partial derivatives could themselves be perceived as functions from X to R. Hence, one may attempt to compute their partial derivatives! If such derivatives are defined, we call them second order partial derivative. Repeating the reasoning, one can attempt to de-

fine third, fourth, fifth order derivatives, and so on. Notationally, fi(x) denotes first order partial derivatives, fi,j(x) := ∂fi(x)/∂xj denotes second order partial derivatives, and so on...

Definition 2.6. () n Let X ⊆ R be open, suppose f : X → R, and letx ¯ be an element of X. If all second order partial derivatives of f are defined atx ¯, then, in the same way as the first order partial derivatives could be gathered in a vector – the gradient –, all the second order partial derivatives can be gathered in a matrix. Such a matrix, denoted Hf (¯x), is called the Hessian of f atx ¯, is square, and should be thought of as a generalized second order derivative for multivariate real valued functions.     ∇f1 (¯x) f1,1(¯x) f1,2(¯x) ··· f1,n(¯x) ∇f (¯x) f2,1(¯x) f2,2(¯x) ··· f2,n(¯x) H (¯x) =  2  =   f  .   . . .. .   .   . . . . 

∇fn (¯x) fn,1(¯x) fn,2(¯x) ··· fn,n(¯x) Remark 1. The intermediate equality makes clear why the Hessian is an n × n matrix: it is n n the derivative of the gradient of f atx ¯, which is a function from X ⊆ R to R . Hence, the Hessian is a Jacobian, but the converse is not true! Do not confuse both concepts! Before we go on, let me introduce the concept of Ck-differentiability. Ck-differentiable functions constitute the main class of functions economists work with, as they have many nice properties. For instance, as the next theorem will state, Ck-differentiability constitute a sufficient condition for consequence-less permutations in the order of derivation.

23 Definition 2.7. (Function of class Ck) n k Let X ⊆ R be an open set, Y ⊆ R, and suppose f : X → Y . f is said to be of class C on X, denoted f ∈ Ck(X,Y )11, if all partial derivatives of order less or equal to k exist and are continuous on X.

Theorem 2.6. (Schwarz’s Theorem / Young’s Theorem) If f ∈ Ck(X), then the order in which the derivatives up to order k are taken can be permuted.

Proof. See e.g., De la Fuente (2000), Chapter 4, Theorem 2.6.

Example 2.1. (Taken from Simon & Blume (1994), p. 330) Consider the general Cobb-Douglas production function Q = kxayb. Then,

∂Q = akxa−1yb ∂x ∂Q = bkxayb−1 ∂y ∂2Q = abkxa−1yb−1 ∂x∂y ∂2Q = abkxa−1yb−1 ∂y∂x

n 2 Remark 2. For instance, let X ⊆ R be open and suppose f : X → R. If f ∈ C (X), then

∂fi(x) ∂fj(x) fi,j(x) = = = fj,i(x) ∂xj ∂xi Therefore, if f is C2-differentiable, then its Hessian matrix is symmetric! We now have the equipment to discuss Taylor approximations! Much of the result we will see in the next chapter rely on Taylor approximations of the second order. Further, they are heavily used in macroeconomic analysis. Albeit their rather heavy notation, they formalize a rather simple idea. Namely, in the same way that first order derivatives provide valuable information, so do higher order derivatives. Therefore, more precise, polynomial (rather than simply linear), approximations can be build in small neighborhoods of a point of interest. When the function under study is a function of a single variable, it is possible to exploit Taylor expansions of high orders.

Theorem 2.7. (nth Order Univariate Taylor Approximation) n+1 Let X ⊆ R be an open set and consider f ∈ C (X). Then f can be best nth order approximated around x¯ by the nth order Taylor expansion:

11 Y could be equal to R. As real valued functions are heavily used in applications, the short notation k k C (X) is taken as a substitute for f ∈ C (X, R).

24 n X f (k)(¯x)hk f(¯x + h) ≈ f(¯x) + k! k=1

(k) where h ∈ R is such that x¯ + h ∈ X and f (x) denotes f’s derivative of order k at x¯. The error of approximation, also known as the remainder of the Taylor approximation, is given by the following formula:

n X f (k)(¯x)hk f (n+1)(x + λh) R (h | x¯) := f(¯x + h) − f(¯x) − = hn+1 n k! (n + 1)! k=1 for some λ ∈ (0, 1).

Proof. The proof is an approximation of the , which we have not covered in this course. See e.g., De la Fuente (2000), Chapter 4, Theorem 1.9.

Remark 3. Note that the remainder approaches zero at a faster rate than h itself. This is what guarantees the quality of the approximation! When the function under study is a multivariate function it is computationally very de- manding12 and conceptually non-straightforward to exploit derivatives of order higher than 2. Therefore, we usually stick to the first and second orders.

Theorem 2.8. (Second Order Multivariate Taylor Approximation) n 3 Let X ⊆ R be an open set and consider f ∈ C (X). Then f can be best second order approximated around x¯ by the second order Taylor expansion: 1 f(¯x + h) ≈ f(¯x) + ∇ (¯x) · h + h0 · H (¯x)h f 2 f

n where h ∈ R is such that x¯ + h ∈ X. As khk approaches zero, the remainder approaches zero at a faster rate than h itself.

Remark 4. Looking at the above theorem, it might seem the natural reaction to shrink away. Even graduate students seem to need to sigh when they are asked to apply Taylor. But remember: Taylor is your friend, not your foe. While the formula above still looks cumbersome, its application is straightforward and with some practice easily applied. And the result will make your life so much easier, as it turns some complicated functions, about which you might have no idea how it looks like, into a polynomial of degree 2.

12The first order generalized derivative of a real valued multivariate function is a vector, the second order generalized derivative a matrix, the third order generalized derivative a matrix with multiple layers,...

25 2.7 Characterization of convex functions Let’s recap. We have started by discussing geometric insights of the derivative of real-valued functions and listed three main properties, which should carry over to a generalization to the multivariate case. The partial derivatives (gradient) was quite useful for the linear ap- proximation (and in fact, for polynomial approximations by Taylor), and the total derivative ensured that differentiability preserves continuity. The third desirable property was that the sign of the derivative would carry information about the monotony of the function around a point. This property is kept by by partial derivatives, if we are interested in monotony of 2 the function in the direction of one variable. However, imagine a function f(x, y): R → R, which increases in x and decreases in y. In the case of multivariate function, we are no longer interested in whether or not the function is decreasing or increasing. Instead, we are interested in whether nor not the function is convex or concave. We had already defined and found a useful characterization for these concepts in the first section. Here, we had concluded that f is convex if its restriction to any line in its domain is convex. Endowed with this geometrical insight, one can derive a convenient characterization of con- vexity which applies to functions of class C2(X). Therefore, assume from now on that f is twice differentiable. By considering the restriction of f to the line spanned by z we implicitly defined a univariate function g of t:

g(t) = f(x + tz)

n where t ∈ R and x, z ∈ R such that kzk= 1. From there, one can reevaluate the requirement that g, i.e., the restriction of f to a specified line, be (strictly) convex. More precisely, this 00 is true if and only if ∀t ∈ {t ∈ R | x + tz ∈ X}, g (t)(>) ≥ 0, where: d d g00(t) = g0(t) = ∇ (x + tz) · z = z0 · H (x + tz) · z dt dt f f Exercise 2. This is probably a good time to stop for a little while in our book and for once focus on the ”words instead of the plot”. As an exercise, you should apply the multivariate 2 2 chain rule to the function g = f ·h, where h : R → R , t 7→ x+tz and f : R → R both twice continuously differentiable, and derive g00(t). Verify that the above formula indeed holds. Because of the above equation and because (x, t) 7→ x + tz is a bijection13, instead of requiring that for every x and every t g00(t) ≥ 0 is equivalent to requiring that

0 ∀ y ∈ Int(X) ∀ z ∈ R \{0}, z · Hf (y) · z(>) ≥ 0

If you remember the end of chapter 2, this last formulation should look familiar. Back then, the definition of (semi-)definiteness for matrices seemed to drop out of the sky. Now,

13If it is not absolutely clear to you why this matters, it is worthwhile to stop again for a minute and think this through before continuing.

26 it becomes more clear why this is such an important concept! With it, we have the following third (and most useful) characterization of convexity:

Theorem 2.9. (Multivariate Convexity ) n Let X be a convex subset of R . A real-valued function f : X → R that is also an element 2 of C (X) is convex if and only if, Hf (x) is positive semidefinite for all x ∈ Int(X). Further, if Hf (x) is positive definite for all x ∈ Int(X), then f is strictly convex.

Remark 1. Please note that the characterization only applies to C2 functions. Not every convex function is differentiable though! For instance, the absolute value, a convex function, is not differentiable at 0. Remark 2. We will not show it here, but a convex real-valued function defined on a finite- dimensional convex domain X is continuous on Int(X). Analogous characterizations can be derived for concave functions.

27 3 Integral theory

In the definition of the multivariate derivative, we saw that the derivative itself is a multi- variate function D, operating on the space of all differentiable functions and mapping into another space of functions. To ease notation, we have previously noted that functions taking as input and output other functions are called operators. It is only natural to ask whether this operator has an inverse, i.e. whether there exists D−1. Exercise 1. 1. Test your understanding of the above by answering the following question: is the inverse of the Jacobian matrix a good candidate for D−1?

2. Is the function D a bijection? If you got the last question right, you know that an inverse function of the derivative does not exist. This is similar to the case f(x) = x2. As the function is not injective, you cannot find one value of x for each y = x2, but indeed there are two. There exists no inverse function, but one may well define an inverse correspondence, which maps each value y of the co-domain of f to the set of all values x in the domain of f which are mapped onto y under 2 f: y 7→ {x ∈ R : x = y}. A similar proceeding can be followed to reverse the derivative Z operator, and this is the indefinite integral , also called the .

3.1 Univariate integral calculus I assume that you are already familiar with univariate from high school, so the following proposition should hold no news for you. All the results can be derived from the laws of differentiation.

Theorem 3.1. (Calculation rules for indefinite integrals) 14 Let f, g be two integrable functions and let a, C be constants, n ∈ N. Then Z Z Z • (af(x) + g(x))dx = a f(x)dx + g(x)dx

Z xn+1 Z 1 • xndx = + C if n 6= −1 and dx = lnx + C n + 1 x Z Z • exdx = ex + C and ef(x)f 0(x)dx = ef(x) + C

Z 1 Z f(x) • (f(x))nf 0(x)dx = (f(x))n+1 + C if n 6= −1 and dx = lnf(x) + C n + 1 f 0(x) There is another calculation rule, which can be thought of as the reverse of the : integration by parts.

14Although there is a more formal definition, let it suffice here to understand an integrable function as being a functions whose integral you can compute with the usual rules.

28 Theorem 3.2. (Integration by parts) Let u, v be two differentiable functions. Then, Z Z u(x)v0(x)dx = u(x)v(x) − u0(x)v(x)dx.

3.2 The definite integral and the fundamental theorem of calculus In school, you probably introduced the definite integral as an approximation of the area under the graph, i.e., as limit of the Riemann sums (see e.g. graph in Simon & Blume (1994), p. 891). Above, we introduced the indefinite integral as the inverse operation of the derivative (albeit in a rather informal way). As the names already suggest, both concepts are related and this result is so important that it is called the fundamental theorem of calculus. Put in simple words, it states that summing over small changes yields the net change. Put in the form of calculus, the sum is replaced by the definite integral and the small changes by the differential. Z x Theorem 3.3. Let f :[a, b] → R be continuous and define F (x) = f(t)dt for all x ∈ a [a, b]. Then, F is differentiable on (a, b) with

F 0(x) = f(x) for all x ∈ (a, b).

Proof. (Sketch only, look up the mean value theorem for integration if you want to know more details) ∀x, x + ∆x ∈ [a, b] we have

Z x+∆x Z x F (x + ∆x) − F (x) = f(t)dt − f(t)dt a a Z ∆x = f(t)dt x

= f(c∆)∆x,

where the last equality follows from the mean value theorem for integration, which guarantees

the existence of such a c∆ ∈ [x, x + ∆x] for all ∆x. Dividing by ∆x and taking the limit for ∆x → 0 on both sides yields the desired result.

3.3 Multivariate extension As we did with the derivatives, let us extend the notion of the integral to the multivariate 2 case by first looking at a function mapping from R to R. If in the univariate case, the definite integral was measuring an area under the graph, it is now only natural to require the definite integral to measure the volume under the graph. In higher dimensions, we have

29 to go on without graphic illustrations, but the concept of ”summing up the function values over infinitely small areas of the domain” remains valid. Also, indefinite integrals should still be considered the antiderivative, but now to the multivariate derivative, and again intuition might fail us here. Luckily, for probably all intends and purposes that you will come across integrals in your master courses, the following theorem will be of practical help:

Theorem 3.4. (Fubini’s theorem) n Let Ix = [a, b] and Iy = [c, d] be two intervals in R and define I := Ix × Iy. Let f : I → R continuous. Then Z Z Z  f(x, y)d(x, y) = f(x, y)dy dx, I Ix Iy and all the integrals on the right-hand side are well-defined.

Remark 1. The theorem is pretty powerful as it only needs continuity of the function as a prerequisite, and then allows you to reduce a multivariate integral to a lower-dimensional one (in the case n = 1 to the univariate integral). Remark 2. You can apply the theorem repeatedly if you are faced with higher dimensional integrals. The following is then a direct consequence from the Fubini’s theorem:

Theorem 3.5. Let A, B ∈ R two closed intervals, f : A → R, g : B → R continuous functions. Then Z Z Z  f(x)g(y)d(x, y) = f(x)dx g(y)dy . A×B A B

Proof. Define h : A × B → R with h(x, y) = f(x)g(y). This function is continuous on the interval A × B and we can apply Fubini’s theorem. Then Z Z Z  f(x)g(y)d(x, y) = f(x)g(y)dy dx A×B A B Z Z  = f(x) g(y)dy dx A B Z Z  = f(x)dx g(y)dy A B

30 A Appendix - Directional Derivatives

Now, it is important to realize that we did not proceed without loss of generality when generalizing the concept of derivation to multivariate functions. Remember, we “consciously forgot” that all variables could move at the same time! That is, our generalization could only claim to be proper for independent variables. In practice, variables often are indirectly dependent or simply happen to change simultaneously, and we may wish to evaluate the instantaneous rate of change of f if several of our variables change simultaneously. Let us consider first the geometric intuition. In the case of a derivative for univariate functions it was possible to move away from the point in any desirable direction, for the simple reason that there was a unique direction along which to move, namely, that of the real line15! Then, the derivative was indicating the rate of change in our function as we move away from x along the real line, i.e., all directions we could think of moving along. Yet, as soon as we have two or more dimensions, there are actually infinitely many possible directions along which one could move16! Depending on the “geography” of our multivariate function, which direction we focus on could matter! In the above definition, by requiring that all xj’s, j 6= i, be fixed, we have imposed that the direction along which to evaluate the rate of change should be that which is specified by the vector xi (as drawn in the Figure 3, where we picked the direction of x1!).

Arguably, the real generalization of our derivative concept for univariate functions should not impose any restriction on the direction along which to evaluate the rate of change. This generalization exists and is called a multivariate derivative or more simply, derivative. But, before we get there, we should first look at an intermediate concept, called the , which, if it exists, will give us the rate of change in any specified direction – and not only the direction induced by only one of our variables. We already know from the previous chapter that a vector consists of a direction, a sense, and a magnitude. Thus, any n vector z ∈ R specifies a direction along which we could try to evaluate the rate of change of our multivariate real-valued function. Yet, as an infinity of vectors could specify the same direction, we need to impose a bit of consistency and convention suggest to take a unitary z, i.e., a z such that kzk= 1. From there, we can proceed with a rather intuitive definition:

Definition A.1. (Directional Derivative) n Let X ⊆ R and suppose f : X → R. Ifx ¯ is an interior point of X, then, the rate of change n of f(x) atx ¯ in the direction of unit vector z = (z1, ..., zn) ∈ R is called the directional

15Please do not confuse direction and sense! A direction simply indicates a line along which we move, the sense in which we move, however, indicates toward which side of the line we are moving. 16If you have troubles picturing that, simply imagine a single point within a 2-dimensional space (i.e., a simple plane, like a piece of paper), and start counting the number of lines that you can draw going through this point. The moment when you stop should come right after that when you realize there is an infinity of such lines ;)

31 derivative, is denoted Dzf(¯x), and is defined as f(¯x + hz) − f(¯x) Dzf(¯x) = lim h→0 h with h ∈ R, whenever the limit exists and wheneverx ¯ + hz ∈ X.

n In particular, if z is going in the direction of xi, i.e., z = (0, ..., 0, 1, 0, ..., 0) ∈ R , where the only positive component is that in the direction of xi, then the directional derivative coincides with the partial derivative with respect to xi. To illustrate the concept, let us consider again the case f(x1, x2) = x1x2:

Figure 9: Directional derivative of f(x1, x2) in direction of (1, 1) atx ¯

In practice, it is not a very convenient definition: choosing a vector, finding the unitary vector with the same direction, computing a limit... Luckily, there exists an important characterization of the directional derivative which emphasize its link with partial derivatives and which is much more convenient for applications, that of the total differential.

32