<<

Linear (part 3) : Inner Spaces (by Evan Dummit, 2020, v. 2.00)

Contents

3 Inner Product Spaces 1 3.1 Inner Product Spaces ...... 1 3.1.1 Inner Products on Real Vector Spaces ...... 1 3.1.2 Inner Products on Complex Vector Spaces ...... 3 3.1.3 Properties of Inner Products, Norms ...... 4 3.2 ...... 6 3.2.1 Orthogonality, Orthonormal Bases, and the Gram-Schmidt Procedure ...... 7 3.2.2 Orthogonal Complements and Orthogonal Projection ...... 9 3.3 Applications of Inner Products ...... 13 3.3.1 Least-Squares Estimates ...... 13 3.3.2 Fourier Series ...... 16 3.4 Linear Transformations and Inner Products ...... 18 3.4.1 Characterizations of Inner Products ...... 18 3.4.2 The Adjoint of a Linear Transformation ...... 20

3 Inner Product Spaces

In this chapter we will study vector spaces having an additional kind of structure called an inner product, which generalizes the idea of the of vectors in Rn, and which will allow us to formulate notions of length and  in more general vector spaces. We dene inner products in real and complex vector spaces and establish some of their properties, including the celebrated Cauchy-Schwarz inequality, and survey some applications. We then discuss orthogonality of vectors and subspaces and in particular describe a method for constructing an orthonormal for any nite-dimensional , which provides an analogue of giving standard unit coordinate axes in Rn. Next, we discuss a pair of very important practical applications of inner products and orthogonality: computing least-squares approximations and approximating periodic functions with Fourier series. We close with a discussion of the interplay between linear transformations and inner products.

3.1 Inner Product Spaces

• Notation: In this chapter, we will use parentheses around vectors rather than angle brackets, since we will shortly be using angle brackets to denote an inner product. We will also avoid using dots when discussing , and reserve the dot notation for the dot product of two vectors.

3.1.1 Inner Products on Real Vector Spaces

• Our rst goal is to describe the natural analogue in an arbitrary real of the dot product in Rn. We recall some basic properties of the dot product:

◦ The dot product distributes over and : (v1 +cv2)·w = (v1 ·w)+c(v2 ·w) for any vectors v1, v2, w and scalar c.

1 ◦ The dot product is commutative: v · w = w · v for any vectors v, w. ◦ The dot product is nonnegative: v · v ≥ 0 for any vector v.

• We will use these three properties to dene an arbitrary inner product on a real vector space: • Denition: If V is a real vector space, an inner product on V is a pairing that assigns a scalar in F to each ordered pair (v, w) of vectors in V . This pairing is denoted hv, wi and must satisfy the following properties:

[I1] Linearity in the rst argument: hv1 + cv2, wi = hv1, wi + chv2, wi. [I2] : hw, vi = hv, wi. [I3] Positive-deniteness: hv, vi ≥ 0 for all v, and hv, vi = 0 only when v = 0.

◦ The linearity and symmetry properties are fairly clear: if we x the second component, the inner product behaves like a linear in the rst component, and we want both components to behave in the same way. ◦ The positive-deniteness property is intended to capture an idea about length: namely, the length of a vector v should be the (inner) product of v with itself, and lengths are supposed to be nonnegative. Furthermore, the only vector of length zero should be the zero vector.

• Denition: A vector space V together with an inner product h·, ·i on V is called an inner product space.

◦ Any given vector space may have many dierent inner products. ◦ When we say Let V be an inner product space, we intend this to that V is equipped with a particular (xed) inner product.

• The entire purpose of dening an inner product is to generalize the notion of the dot product to more general vector spaces, so we should check that the dot product on Rn is actually an inner product: Example: Show that the standard dot product on n, dened as • R (x1, . . . , xn)·(y1, . . . , yn) = x1y1 +···+xnyn is an inner product.

◦ [I1]-[I2]: We have previously veried the linearity and symmetry properties. [I3]: If then 2 2 2 . Since each square is nonnegative, , and ◦ v = (x1, . . . , xn) v · v = x1 + x2 + ··· + xn v · v ≥ 0 v · v = 0 only when all of the components of v are zero.

• There are other examples of inner products on Rn beyond the standard dot product. Example: Show that the pairing on 2 is an inner product. • h(x1, y1), (x2, y2)i = 3x1x2 +2x1y2 +2x2y1 +4y1y2 R ◦ [I1]-[I2]: It is an easy algebraic computation to verify the linearity and symmetry properties. ◦ [I3]: We have h(x, y), (x, y)i = 3x2 + 4xy + 4y2 = 2x2 + (x + 2y)2, and since each square is nonnegative, the inner product is always nonnegative. Furthermore, it equals zero only when both squares are zero, and this clearly only occurs for x = y = 0.

• We can dene another class of inner products on function spaces using integration: • Example: Let V be the vector space of continuous (real-valued) functions on the interval [a, b]. Show that b is an inner product on . hf, gi = a f(x)g(x) dx V ´ [I1]: We have b b b ◦ hf1 + cf2, gi = a [f1(x) + cf2(x)] g(x) dx = a f1(x)g(x) dx + c a f2(x)g(x) dx = hf1, gi + c hf2, gi. ´ ´ ´ [I2]: Observe that b b . ◦ hg, fi = a g(x)f(x) dx = a f(x)g(x) dx = hf, gi [I3]: Notice that b´ 2 is the ´ of a nonnegative function, so it is always nonnegative. ◦ hf, fi = a f(x) dx Furthermore (since f is assumed´ to be continuous) the integral of f 2 cannot be zero unless f is identically zero. ◦ Remark: More generally, if w(x) is any xed positive (weight) function that is continuous on [a, b], b is an inner product on . hf, gi = a f(x)g(x) · w(x) dx V ´ 2 3.1.2 Inner Products on Complex Vector Spaces

• We would now like to extend the notion of an inner product to complex vector spaces.

◦ A natural rst guess would be to use the same denition as in a real vector space. However, this turns out not to be the right choice! ◦ To explain why, suppose that we try to use the same denition of dot product to nd the length of a vector of complex , i.e., by computing v · v. ◦ We can see that the dot product of the vector (1, 0, i) with itself is 0, but this vector certainly is not the zero vector. Even worse, the dot product of (1, 0, 2i) with itself is −3, while the dot product of (1, 0, 1 + 2i) with itself is −2 + 4i. None of these choices for lengths seem sensible. √ ◦ Indeed, a more natural choice of length for the vector (1, 0, i) is 2: the rst component has absolute value 1, the second has absolute value 0, and the last has absolute value 1, for an overall length of √ √ 2 2 2, in analogy with the real vector which also has length . Similarly, 1 + 0 + 1 √ √ (1, 0, 1) 2 (1, 0, 2i) should really have length 12 + 02 + 22 = 5. ◦ One way to obtain a nonnegative function that seems to capture this idea of length for a complex vector is to include a conjugation: notice that for any complex vector v, the dot product v · v will always be a nonnegative real . ◦ Using the example above, we can compute that (1, 0, i) · (1, 0, i) = 12 + 02 + 12 = 2, so this modied dot product seems to give the square of the length of a complex vector, at least in this one case. 1 ◦ All of this suggests that the right analogy of the Rn dot product for a pair of complex vectors v, w is v · w rather than v · w. ◦ We then lose the symmetry property of the real inner product, since now v · w and w · v are not equal in general. But notice we still have a relation between v · w and w · v: namely, the latter is the of the former. ◦ We can therefore obtain the right denition by changing the symmetry property hw, vi = hv, wi to conjugate-symmetry: hw, vi = hv, wi .

• With this modication, we can extend the idea of an inner product to a complex vector space: • Denition: If V is a (real or) complex vector space, an inner product on V is a pairing that assigns a scalar in F to each ordered pair (v, w) of vectors in V . This pairing is denoted hv, wi and must satisfy the following properties:

[I1] Linearity in the rst argument: hv1 + cv2, wi = hv1, wi + chv2, wi. [I2] Conjugate-symmetry: hw, vi = hv, wi (where the bar denotes the complex conjugate). [I3] Positive-deniteness: hv, vi ≥ 0 for all v, and hv, vi = 0 only when v = 0.

◦ This denition generalizes the one we gave for a real vector space earlier: if V is a real vector space then hw, vi = hw, vi since hw, vi is always a . ◦ Important Warning: In other disciplines (particularly physics), inner products are often dened to be linear in the second argument, rather than the rst. With this convention, the roles of the rst and second component will be reversed (relative to our denition). This does not make any dierence in the general theory, but can be extremely confusing since the denitions in mathematics and physics are otherwise identical, and the properties will have very similar statements.

• The chief reason for using this denition is that our modied dot product for complex vectors is an inner product:

• Example: For complex numbers v = (x1, . . . , xn) and w = (y1, . . . , yn), show that the map hv, wi = x1y1 + is an inner product on n. ··· + xnyn C 1Another (perhaps more compelling) reason that symmetry is not the right property for complex vector spaces is that the only symmetric positive-denite complex bilinear pairing, by which we mean a complex vector space V with an inner product h·, ·i satisfying the three axioms we listed for a real inner product space, is the trivial inner product on the zero space. Therefore, because we want to develop an interesting theory, we instead require conjugate-symmetry.

3 ◦ [I1]: If v1 = (x1, . . . , xn), v2 = (z1, . . . , zn), and w = (y1, . . . , yn), then

hv1 + cv2, wi = (x1 + cz1)y1 + ··· + (xn + czn)yn

= (x1y1 + ··· + xnyn) + c(z1y1 + ··· + znyn)

= hv1, wi + chv2, wi.

◦ [I2]: Observe that hw, vi = y1x1 + ··· + ynxn = y1x1 + ··· + ynxn = hv, wi. 2 2 ◦ [I3]: If v = (x1, . . . , xn) then hv, vi = x1x1 + x2x2 + ··· + xnxn = |x1| + ··· + |xn| . Each term is nonnegative, so v · v ≥ 0, and clearly v · v = 0 only when all of the components of v are zero. ◦ This map is often called the standard inner product on Cn, since it is fairly natural. • Here is an example of a complex inner product on the space of n × n matrices: Example: Let be the vector space of complex matrices. Show that ∗ is • V = Mn×n(C) n × n hA, Bi = tr(AB ) an inner product on V , where M ∗ = M T is the complex conjugate of the of M (often called the conjugate transpose or the adjoint of M).

◦ [I1]: We have hA + cC, Bi = tr[(A+cC)B∗] = tr[AB∗ +cCB∗] = tr(AB∗)+c tr(CB∗) = hA, Bi+c hC,Bi, where we used the facts that tr(M + N) = tr(M) + tr(M) and tr(cM) = c tr(M). ◦ [I2]: Observe that hB,Ai = tr(BA∗) = tr(B∗A) = tr(AB∗) = hA, Bi, where we used the facts that tr(MN) = tr(M ∗N ∗) and that tr(MN) = tr(NM), both of which are easy algebraic calculations. n n n n n n n [I3]: We have X ∗ X X ∗ X X X X 2 , and ◦ hA, Ai = (AA )j,j = Aj,kAk,j = Aj,kAj,k = |Aj,k| ≥ 0 j=1 j=1 k=1 j=1 k=1 j=1 k=1 equality can only occur when each element of A has absolute value zero (i.e., is zero). ◦ Remark: This inner product is often called the Frobenius inner product.

3.1.3 Properties of Inner Products, Norms

• Our fundamental goal in studying inner products is to extend the notion of length in Rn to a more general setting. Using the positive-deniteness property, we can dene a notion of length in an inner product space.

• Denition: If V is an inner product space, we dene the (or length) of a vector v to be ||v|| = phv, vi.

◦ When V = Rn with the standard dot product, the norm on V reduces to the standard notion of length of a vector in Rn. • Here are a few basic properties of inner products and norms: • Proposition (Properties of Norms): If V is an inner product space with inner product h·, ·i, the following hold:

1. For any vectors v, w1, and w2, hv, w1 + w2i = hv, w1i + hv, w2i.

◦ Proof: We have hv, w1 + w2i = hw1 + w2, vi = hw1, vi + hw2, vi = hv, w1i + hv, w2i by [I1] and [I2]. 2. For any vectors v and w and scalar c, hv, cwi = c hv, wi. ◦ Proof: We have hv, cwi = hcw, vi = chw, vi = c hv, wi by [I1] and [I2]. 3. For any vector v, hv, 0i = 0 = h0, vi. ◦ Proof: Apply property (3) and [I2] with c = 0, using the fact that 0w = 0 for any w. 4. For any vector v, ||v|| is a nonnegative real number, and ||v|| = 0 if and only if v = 0. ◦ Proof: Immediate from [I3]. 5. For any vector v and scalar c, ||cv|| = |c| · ||v||. ◦ Proof: We have ||c · v|| = phc · v, c · vi = pcc hv, vi = |c| · ||v||, using [I2] and property (2).

4 • In Rn, there are a number of fundamental inequalities about lengths, which generalize quite pleasantly to general inner product spaces.

• The following result, in particular, is one of the most fundamental inequalities in all of mathematics: • Theorem (Cauchy-Schwarz Inequality): For any v and w in an inner product space V , we have |hv, wi| ≤ ||v|| ||w||, with equality if and only if the {v, w} is linearly dependent.

◦ Proof: If w = 0 then the result is trivial (since both sides are zero, and {v, 0} is always dependent), so now assume w 6= 0. hv, wi ◦ Let t = . By properties of inner products and norms, we can write hw, wi

||v − tw||2 = hv − tw, v − twi = hv, vi − t hw, vi − t hv, wi + tt hw, wi hv, wi hv, wi |hv, wi|2 |hv, wi|2 = hv, vi − − + hw, wi hw, wi hw, wi |hv, wi|2 = hv, vi − . hw, wi

◦ Therefore, since ||v − tw||2 ≥ 0 and hw, wi ≥ 0, clearing denominators and rearranging yields |hv, wi|2 ≤ hv, vi hw, wi. Taking the square root yields the stated result. ◦ Furthermore, we will have equality if and only if ||v − tw||2 = 0, which is in turn equivalent to v−tw = 0; namely, when v is a multiple of w. Since we also get equality if w = 0, equality occurs precisely when the set {v, w} is linearly dependent. ◦ Remark: As written, this proof is completely mysterious: why does making that particular choice for t work? Here is some motivation: in the special case where V is a real vector space, we can write ||v − tw|| = hv, vi − 2t hv, wi + t2 hw, wi, which is a quadratic function of t that is always nonnegative. ◦ To decide whether a quadratic function is always nonnegative, we complete the square to see that " #  hv, wi 2 hv, wi2 hv, vi − 2t hv, wi + t2 hw, wi = hw, wi t − + hv, vi − . hw, wi hw, wi

hv, wi2 hv, wi ◦ Thus, the minimum value of the quadratic function is hv, vi − , and it occurs when t = . hw, wi hw, wi

• The Cauchy-Schwarz inequality has many applications (most of which are, naturally, proving other inequali- ties). Here are a few such applications:

• Theorem (Triangle Inequality): For any v and w in an inner product space V , we have ||v + w|| ≤ ||v||+||w||, with equality if and only if one vector is a positive-real scalar multiple of the other.

◦ Proof: Using the Cauchy-Schwarz inequality and the fact that Re(z) ≤ |z| for any z ∈ C, we have

||v + w||2 = hv + w, v + wi = hv, vi + 2Re(hv, wi) + hw, wi ≤ hv, vi + 2 |hv, wi| + hw, wi ≤ hv, vi + 2 ||v|| ||w|| + hw, wi = ||v||2 + 2 ||v|| ||w|| + ||w||2 .

Taking the square root of both sides yields the desired result. ◦ Equality will hold if and only if {v, w} is linearly dependent (for equality in the Cauchy-Schwarz inequal- ity) and hv, wi is a nonnegative real number. If either vector is zero, equality always holds. Otherwise, we must have v = c · w for some nonzero constant c: then hv, wi = c hw, wi will be a nonnegative real number if and only if c is a nonnegative real number. q Example: Show that for any continuous function on , it is true that 3 3 2 . • f [0, 3] 0 xf(x) dx ≤ 3 0 f(x) dx ´ ´

5 ◦ Simply apply the Cauchy-Schwarz inequality to f and g(x) = x in the inner product space of continuous functions on with inner product 3 . [0, 3] hf, gi = 0 f(x)g(x) dx ´ q q q ◦ We obtain |hf, gi| ≤ ||f||·||g||, or, explicitly, 3 xf(x) dx ≤ 3 f(x)2 dx· 3 x2 dx = 3 3 f(x)2 dx. 0 0 0 0 ´ ´ ´ ´ ◦ Since any real number is less than or equal to its absolute value, we immediately obtain the required q inequality 3 3 2 . 0 xf(x) dx ≤ 3 0 f(x) dx ´ ´ r a + 2b r b + 2c r c + 2a • Example: Show that for any positive reals a, b, c, it is true that + + ≤ 3. a + b + c a + b + c a + b + c √ √ √ ◦ Let v = ( a + 2b, b + 2c, c + 2a) and w = (1, 1, 1) in R3. By the Cauchy-Schwarz inequality, v · w ≤ ||v|| ||w||. √ √ √ ◦ We compute v · w = a + 2b + b + 2c + c + 2a, along with ||v||2 = (a + 2b) + (b + 2c) + (c + 2a) = 3(a + b + c) and ||w||2 = 3. √ √ √ √ √ ◦ Thus, we see a + 2b+ a + 2c+ b + 2c ≤ p3(a + b + c)· 3, and upon dividing through by a + b + c we obtain the required inequality.

• Example (for those who like ): Prove the momentum- formulation of Heisenberg's uncertainty principle: σxσp ≥ h/2. (In words: the product of uncertainties of position and momentum is greater than or equal to half of the reduced Planck constant.)

◦ It is a straightforward computation that, for two (complex-valued) observables X and Y , the pairing hX,Y i = E[XY ], the of XY , is an inner product on the space of observables. ◦ Assume (for simplicity) that x and p both have expected value 0. ◦ We assume as given the commutation relation xp − px = ih. 2 2 ◦ By denition, (σx) = E[xx] = hx, xi and (σp) = E[pp] = E[pp] = hp, pi are the of x and p respectively (in the statistical sense). By the Cauchy-Schwarz inequality, we can therefore write 2 2 2 2. ◦ σxσp = hx, xi hp, pi ≥ |hx, pi| = |E[xp]| 1 1 ◦ We can write xp = (xp + px) + (xp − px), where the rst component is real and the second is 2 2 1 1 imaginary, so taking expectations yields E[xp] = E[xp + px] + E[xp − px], and therefore, |E[xp]| ≥ 2 2 1 1 h |E[xp − px]| = ih = . 2 2 2 Combining with the inequality above yields 2 2 2 , and taking square roots yields . ◦ σxσp ≥ h /4 σxσp ≥ h/2

3.2 Orthogonality

• Motivated by the Cauchy-Schwarz inequality, we can dene a notion of angle between two nonzero vectors in a real inner product space:

• Denition: If V is a real inner product space, we dene the angle between two nonzero vectors v and w to hv, wi be the real number θ in [0, π] satisfying cos θ = . ||v|| ||w||

◦ By the Cauchy-Schwarz inequality, the quotient on the right is a real number in the interval [−1, 1], so there is exactly one such angle θ.

• Example: Compute the angle between the vectors v = (3, −4, 5) and w = (1, 2, −2) under the standard dot product on R3. √ √ √ ◦ We have v · w = −15, ||v|| = v · v = 5 2, and ||w|| = w · w = 3. −15 1 ◦ Then the angle θ between the vectors satises cos(θ) = √ = −√ , so θ = 3π/4 . 15 2 2

6 • Example: Compute the angle between p = 5x2 − 3 and q = 3x − 2 in the inner product space of continuous functions on with inner product 1 . [0, 1] hf, gi = 0 f(x)g(x) dx ´ q q We have 1 2 , 1 2 2 , and 1 2 ◦ hp, qi = 0 (5x −3)(3x−2) dx = 23/12 ||p|| = 0 (5x − 3) dx = 2 ||q|| = 0 (3x − 2) dx = 1. ´ ´ ´ 23/12 23 23 ◦ Then the angle θ between the vectors satises cos(θ) = = , so θ = cos−1( ) . 2 24 24 ◦ The fact that this angle is so close to 0 suggests that these functions are nearly parallel in this inner product space. Indeed, the graphs of the two functions have very similar :

3.2.1 Orthogonality, Orthonormal Bases, and the Gram-Schmidt Procedure

• A particular case of interest is when the angle between two vectors is π/2 (i.e., they are perpendicular).

◦ For nonzero vectors, by our discussion above, this is equivalent to saying that their inner product is 0.

• Denition: We say two vectors in an inner product space are orthogonal if their inner product is zero. We say a set S of vectors is an orthogonal set if every pair of vectors in S is orthogonal.

◦ By our basic properties, the zero vector is orthogonal to every vector. Two nonzero vectors will be orthogonal if and only if the angle between them is π/2. (This generalizes the idea of two vectors being perpendicular.)

◦ Example: In R3 with the standard dot product, the vectors (1, 0, 0), (0, 1, 0), and (0, 0, 1) are orthogonal. ◦ Example: In R3 with the standard dot product, the three vectors (−1, 1, 2), (2, 0, 1), and (1, 5, −2) form an orthogonal set, since the dot product of each pair is zero. The rst orthogonal set above seems more natural than the second. One reason for this is that the vectors ◦ √ in the rst set each have length 1, while the vectors in the second set have various dierent lengths ( , √ √ 6 5, and 30 respectively).

• Denition: We say a set S of vectors is an orthonormal set if every pair of vectors in S is orthogonal, and every vector in S has norm 1.

◦ Example: In R3, {(1, 0, 0), (0, 1, 0), (0, 0, 1)} is an orthonormal set, but {(−1, 1, 2), (2, 0, 1), (1, 5, −2)} is not.

• In both examples above, notice that the given orthogonal sets are also linearly independent. This feature is not an accident:

• Proposition (Orthogonality and Independence): In any inner product space, every orthogonal set of nonzero vectors is linearly independent.

◦ Proof: Suppose we had a linear dependence a1v1 + ··· + anvn = 0 for an orthogonal set of nonzero vectors {v1,..., vn}.

◦ Then, for any j, 0 = h0, vji = ha1v1 + ··· + anvn, vji = a1hv1, vji + ··· + anhvn, vji = ajhvj, vji, since each of the inner products hvi, vji for i 6= j is equal to zero.

7 ◦ But now, since vj is not the zero vector, hvj, vji is positive, so it must be the case that aj = 0. This holds for every j, so all the coecients of the linear dependence are zero. Hence there can be no nontrivial linear dependence, so any orthogonal set is linearly independent.

• Corollary: If V is an n-dimensional vector space and S is an orthogonal set of n nonzero vectors, then S is a basis for V . (We refer to such a basis as an .)

◦ Proof: By the proposition above, S is linearly independent, and by our earlier results, a linearly- independent set of n vectors in an n-dimensional vector space is necessarily a basis.

• Given a basis of V , every vector in V can be written as a unique of the basis vectors. However, actually computing the coecients of the linear combination can be quite cumbersome.

◦ If, however, we have an orthogonal basis for V , then we can compute the coecients for the linear combination much more conveniently.

• Theorem (Orthogonal Decomposition): If V is an n-dimensional vector space and S = {e1,..., en} is an hv, eki orthogonal basis, then for any v in S, we can write v = c1e1 + ··· + cnen, where ck = for each hek, eki 1 ≤ k ≤ n. In particular, if S is an orthonormal basis, then each ck = hv, eki.

◦ Proof: Since S is a basis, there do exist such coecients ci and they are unique.

◦ We then compute hv, eki = hc1e1 + ··· + cnen, eki = c1 he1, eki+···+cn hen, eki = ck hek, eki since each of the inner products hej, eki for j 6= k is equal to zero.

hv, eki ◦ Therefore, we must have ck = for each 1 ≤ k ≤ n. hek, eki

◦ If S is an orthonormal basis, then hek, eki = 1 for each k, so we get the simpler expression ck = hv, eki.

• Example: Write v = (7, 3, −4) as a linear combination of the basis {(−1, 1, 2), (2, 0, 1), (1, 5, −2)} of R3.

◦ We saw above that this set is an orthogonal basis, so let e1 = (−1, 1, 2), e2 = (2, 0, 1), and e3 = (1, 5, −2).

◦ We compute v · e1 = −12, v · e2 = 10, v · e3 = 30, e1 · e1 = 6, e2 · e2 = 5, and e3 · e3 = 30. −12 10 30 ◦ Thus, per the theorem, v = c e + c e + c e where c = = −2, c = = 2, and c = = 1. 1 1 2 2 3 3 1 6 2 5 3 30 ◦ Indeed, we can verify that (7, 3, −4) = −2(−1, 1, 2) + 2(2, 0, 1) + 1(1, 5, −2) .

• Given a basis, there exists a way to write any vector as a linear combination of the basis elements: the advantage of having an orthogonal basis is that we can easily compute the coecients. We now give an algorithm for constructing an orthogonal basis for any nite-dimensional inner product space:

• Theorem (Gram-Schmidt Procedure): Let S = {v1, v2,... } be a basis of the inner product space V , and set Vk = span(v1,..., vk). Then there exists an orthogonal set of vectors {w1, w2,... } such that, for each k ≥ 1, span(w1,..., wk) = span(Vk) and wk is orthogonal to every vector in Vk−1. Furthermore, this sequence is unique up to multiplying the elements by nonzero scalars.

◦ Proof: We construct the sequence {w1, w2,... } recursively: we start with the simple choice w1 = v1.

◦ Now suppose we have constructed {w1, w2,..., wk−1}, where span(w1,..., wk−1) = span(Vk−1).

◦ Dene the next vector as wk = vk − a1w1 − a2w2 − · · · − ak−1wk−1, where aj = hvk, wji / hwj, wji.

◦ From the construction, we can see that each of w1,..., wk is a linear combination of v1,..., vk, and vice versa. Thus, by properties of span, span(w1,..., wk) = Vk.

◦ Then hwk, wji = hvk − a1w1 − a2w2 − · · · − ak−1wk−1, wji = hvk, wji−a1 hw1, wji−· · ·−ak−1 hwk−1, wji = hvk, wji − aj hwj, wji = 0 because all of the inner products hwi, wji are zero except for hwj, wji.

◦ Thus wk is orthogonal to each of w1,..., wk−1 and hence also to all linear combinations of these vectors. ◦ The uniqueness follows from the observation that (upon appropriate rescaling) we are essentially required to choose wk = vk − a1w1 − a2w2 − · · · − ak−1wk−1 for some scalars a1, . . . ak−1: orthogonality then forces the choice of the coecients aj that we used above.

8 • Corollary: Every nite-dimensional inner product space has an orthonormal basis.

◦ Proof: Choose any basis {v1,..., vn} for V and apply the Gram-Schmidt procedure: this yields an orthogonal basis {w1,..., wn} for V .

◦ Now simply normalize each vector in {w1,..., wn} by dividing by its norm: this preserves orthogonality, but rescales each vector to have norm 1, thus yielding an orthonormal basis for V .

• The proof of the Gram-Schmidt procedure may seem involved, but applying it in practice is fairly straight- forward (if somewhat cumbersome computationally).

◦ We remark here that, although our algorithm above gives an orthogonal basis, it is also possible to perform the normalization at each step during the procedure, to construct an orthonormal basis one vector at a time. ◦ When performing computations by hand, it is generally disadvantageous to normalize at each step, because the norm of a vector will often involve square roots (which will then be carried into subsequent steps of the computation). ◦ When using a computer (with approximate arithmetic), however, normalizing at each step can avoid certain numerical instability issues. The particular description of the algorithm we have discussed turns out not to be especially numerically stable, but it is possible to modify the algorithm to avoid magnifying the error as substantially when iterating the procedure.

• Example: For V = R3 with the standard inner product, apply the Gram-Schmidt procedure to the vectors , , . Use the result to nd an orthonormal basis for 3. v1 = (2, 1, 2) v2 = (5, 4, 2) v3 = (−1, 2, 1) R

◦ We start with w1 = v1 = (2, 1, 2) .

v2 · w1 (5, 4, 2) · (2, 1, 2) 18 ◦ Next, w2 = v2 − a1w1, where a1 = = = = 2. Thus, w2 = (1, 2, −2) . w1 · w1 (2, 1, 2) · (2, 1, 2) 9

v3 · w1 (−1, 2, 1) · (2, 1, 2) 2 v3 · w2 ◦ Finally, w3 = v3 − b1w1 − b2w2 where b1 = = = , and b2 = = w1 · w1 (2, 1, 2) · (2, 1, 2) 9 w2 · w2 (−1, 2, 1) · (1, 2, −2) 1 14 14 7 = . Thus, w = (− , , ) . (1, 2, −2) · (1, 2, −2) 9 3 9 9 9 ◦ For the orthonormal basis, we simply divide each vector by its length. w 2 1 2 w 1 2 2 w 2 2 1 ◦ We get 1 = ( , , ), 2 = ( , , − ), and 3 = (− , , ). ||w1|| 3 3 3 ||w2|| 3 3 3 ||w3|| 3 3 3

Example: For with inner product 1 , apply the Gram-Schmidt procedure to • V = P (R) hf, gi = 0 f(x)g(x) dx 2 the p1 = 1, p2 = x, p3 = x . ´

◦ We start with w1 = p1 = 1 . 1 hp , w i x dx 1 1 Next, , where 2 1 0 . Thus, . ◦ w2 = p2 − a1w1 a1 = = 1 = w2 = x − hw1, w1i ´ 2 2 0 1 dx ´ 1 hp , w i x2 dx 1 hp , w i Finally, where 3 1 0 , and 3 2 ◦ w3 = p3 − b1w1 − b2w2 b1 = = 1 = b2 = = hw1, w1i ´ 3 hw2, w2i 0 1 dx 1 2 ´ 0 x (x − 1/2) dx 1/12 2 1 = = 1. Thus, w3 = x − x + . ´ 1 2 1/12 6 0 (x − 1/2) dx ´ 3.2.2 Orthogonal Complements and Orthogonal Projection

• If V is an inner product space, W is a subspace, and v is some vector in V , we would like to study the problem of nding a best approximation of v in W .

◦ For two vectors v and w, the between v and w is ||v − w||, so what we are seeking is a vector w in S that minimizes the quantity ||v − w||.

9 ◦ As a particular example, suppose we are given a point P in R2 and wish to nd the minimal distance from P to a particular in R2. Geometrically, the minimal distance is achieved by the segment PQ, where Q is chosen so that PQ is perpendicular to the line. ◦ In a similar way, the minimal distance between a point in R3 and a given plane will also be minimized by nding the segment perpendicular to the plane. ◦ Both of these problems suggest that the solution to this optimization problem will involve some notion of perpendicularity to the subspace W .

• Denition: Let V be an inner product space. If S is a nonempty of V , we say a vector v in V is orthogonal to S if it is orthogonal to every vector in S. The set of all vectors orthogonal to S is denoted S⊥ (S-perpendicular, or often S-perp for short).

◦ We will typically be interested in the case where S is a subspace of V . It is easy to see via the subspace criterion that S⊥is always a subspace of V , even if S itself is not. ◦ Example: In R3, if W is the xy-plane consisting of all vectors of the form (x, y, 0), then W ⊥ is the z-axis, consisting of the vectors of the form (0, 0, z). ◦ Example: In R3, if W is the x-axis consisting of all vectors of the form (x, 0, 0), then W ⊥ is the yz-plane, consisting of the vectors of the form (0, y, z). ◦ Example: In any inner product space V , V ⊥ = {0} and {0}⊥ = V .

• When V is nite-dimensional, we can use the Gram-Schmidt process to compute an explicit basis of W ⊥: • Theorem (Basis for Orthogonal Complement): Suppose W is a subspace of the nite-dimensional inner product space V , and that S = {e1,..., ek} is an orthonormal basis for W . If {e1,..., ek, ek+1,..., en} is any extension ⊥ of S to an orthonormal basis for V , the set {ek+1,..., en} is an orthonormal basis for W . In particular, dim(V ) = dim(W ) + dim(W ⊥).

◦ Remark: It is always possible to extend the orthonormal basis S = {e1,..., ek} to an orthonormal basis for V : simply extend the linearly independent set S to a basis {e1,..., ek, xk+1,..., xn} of V , and then apply the Gram-Schmidt process to generate an orthonormal basis {e1,..., ek, ek+1,..., en}.

◦ Proof: For the rst statement, the set {ek+1,..., en} is orthonormal and hence linearly independent. ⊥ Since each vector is orthogonal to every vector in S, each of ek+1,..., en is in W , and so it remains to ⊥ show that {ek+1,..., en} spans W . ⊥ ◦ So let v be any vector in W . Since {e1,..., ek, ek+1,..., en} is an orthonormal basis of V , by the orthogonal decomposition we know that v = hv, e1i e1 +···+hv, eki ek +hv, ek+1i ek+1 +···+hv, eni en. ⊥ ◦ But since v is in W , hv, e1i = ··· = hv, eki = 0: thus, v = hv, ek+1i ek+1 + ··· + hv, eni en, and therefore v is contained in the span of {ek+1,..., en}, as required. ◦ The statement about dimensions follows immediately from our explicit construction of the basis of V as a union of the basis for W and the basis for W ⊥. 1 1 • Example: If W = span[ (1, 2, −2), (−2, 2, 1)] in 3 with the standard dot product, nd a basis for W ⊥. 3 3 R 1 1 ◦ Notice that the vectors e = (1, 2, −2) and e = (−2, 2, 1) form an orthonormal basis for W . 1 3 2 3 ◦ It is straightforward to verify that if v3 = (1, 0, 0), then {e1, e2, v3} is a linearly independent set and therefore a basis for R3.

◦ Applying Gram-Schmidt to the set {e1, e2, v3} yields w1 = e1, w2 = e2, and w3 = v3 − hv3, w1i w1 − 1 hv , w i w = (4, 2, 4). 3 2 2 9 1 ◦ Normalizing w produces the orthonormal basis {e , e , e } for V , with e = (2, 1, 2). 3 1 2 3 3 3 1 ◦ By the theorem above, we conclude that {e } = { (2, 1, 2)} is an orthonormal basis of W ⊥. 3 3

10 ◦ Alternatively, we could have computed a basis for W ⊥ by observing that dim(W ⊥) = dim(V )−dim(W ) = 1 1 1, and then simply nding one nonzero vector orthogonal to both (1, 2, −2) and (−2, 2, 1). (For this, 3 3 we could have either solved the system of equations explicitly, or computed the of the two given vectors.)

• We can give a simpler (although ultimately equivalent) method for nding a basis for W ⊥ using matrices: • Theorem (Orthogonal Complements and Matrices): If A is an m × n , then the rowspace of A and the nullspace of A are orthogonal complements of one another in Rn, with respect to the standard dot product.

◦ Proof: Let A be an m × n matrix, so that the rowspace and nullspace are both subspaces of Rn. ◦ By denition, any vector in rowspace(A) is orthogonal to any vector in nullspace(A), so rowspace(A) ⊆ nullspace(A)⊥ and nullspace(A) ⊆ rowspace(A)⊥. ◦ Furthermore, since dim(rowspace(A)) + dim(nullspace(A)) = n from our results on the respective di- mensions of these spaces, we see that dim(rowspace(A)) = dim(nullspace(A)⊥) and dim(nullspace(A)) = dim(rowspace(A)⊥). ◦ Since all these spaces are nite-dimensional, we must therefore have equality everywhere, as claimed.

• From the theorem above, when W is a subspace of Rn with respect to the standard dot product, we can easily compute a basis for W ⊥ by computing the nullspace of the matrix whose rows are a spanning set for W .

◦ Although this method is much faster, it will not produce an orthonormal basis of W . It can also be adapted for subspaces of an arbitrary nite-dimensional inner product space, but this requires having an orthonormal basis for the space computed ahead of time.

• Example: If W = span[(1, 1, −1, 1), (1, 2, 0, −2)] in R4 with the standard dot product, nd a basis for W ⊥. ◦ We row-reduce the matrix whose rows are the given basis for W :       1 1 −1 1 R −R 1 1 −1 1 R −R 1 0 −2 4 −→2 1 −→1 2 . 1 2 0 −2 0 1 1 −3 0 1 1 −3

◦ From the reduced row-echelon form, we see that {(−4, 3, 0, 1), (2, −1, 1, 0)} is a basis for the nullspace and hence of W ⊥.

• As we might expect from geometric intuition, if W is a subspace of the (nite-dimensional) inner product space V , we can decompose any vector uniquely as the sum of a component in W with a component in W ⊥: • Theorem (Orthogonal Components): Let V be an inner product space and W be a nite-dimensional subspace. Then every vector v ∈ V can be uniquely written in the form v = w + w⊥ for some w ∈ W and w⊥ ∈ W ⊥, 2 2 2 and furthermore, we have the Pythagorean relation ||v|| = ||w|| + w⊥ . ◦ Proof: First, we show that such a decomposition exists. Since W is nite-dimensional, it has some orthonormal basis {e1,..., ek}. ⊥ ◦ Now set w = hv, e1i e1 + hv, e2i e2 + ··· + hv, eki ek, and then w = v − w. ◦ Clearly w ∈ W and v = w + w⊥, so we need only check that w⊥ ∈ W ⊥.

◦ To see this, rst observe that hw, eii = hv, eii since {e1,..., ek} is an orthonormal basis. Then, we ⊥ ⊥ see that w , ei = hv − w, eii = hv, eii − hw, eii = 0. Thus, w is orthogonal to each vector in the orthonormal basis of W , so it is in W ⊥. For the uniqueness, suppose we had two decompositions ⊥ and ⊥. ◦ v = w1 + w1 v = w2 + w2 By subtracting and rearranging, we see that ⊥ ⊥. Denoting this common vector by , ◦ w1 − w2 = w2 − w1 x we see that x is in both W and W ⊥: thus, x is orthogonal to itself, but the only such vector is the zero vector. Thus, and ⊥ ⊥, so the decomposition is unique. w1 = w2 w1 = w2 ◦ For the last statement, since w, w⊥ = 0, we have ||v||2 = hv, vi = w + w⊥, w + w⊥ = hw, wi + ⊥ ⊥ 2 ⊥ 2 w , w = ||w|| + w , as claimed.

11 • Denition: If V is an inner product space and W is a nite-dimensional subspace with orthonormal basis , the orthogonal projection of into is the vector {e1,..., ek} v W projW (v) = hv, e1i e1 + hv, e2i e2 + ··· + hv, eki ek.

◦ If instead we only have an orthogonal basis {u1,..., uk} of W , the corresponding expression is instead hv, u1i hv, u2i hv, u2i . projW (v) = u1 + u2 + ··· + uk hu1, u1i hu2, u2i huk, uki 1 • Example: For W = span[(1, 0, 0), (0, 3, 4)] in 3 under the standard dot product, compute the orthogonal 5 R 2 2 2 projection of v = (1, 2, 1) into W , and verify the relation ||v|| = ||w|| + w⊥ . 1 ◦ Notice that the vectors e = (1, 0, 0) and e = (0, 3, 4) form an orthonormal basis for W . 1 2 5 Thus, the orthogonal projection is ◦ w = projW (v) = hv, e1i e1 + hv, e2i e2 = 1 (1, 0, 0) + 2 (0, 3/5, 4/5) = (1, 6/5, 8/5) . ◦ We see that w⊥ = v − w = (0, 4/5, −3/5) is orthogonal to both (1, 0, 0) and (0, 3/5, 4/5), so it is indeed 2 2 2 2 2 2 in W ⊥. Furthermore, ||v|| = 6, while ||w|| = 5 and w⊥ = 1, so indeed ||v|| = ||w|| + w⊥ .

• The orthogonal projection gives the answer to the approximation problem we posed earlier: • Corollary (Best Approximations): If W is a nite-dimensional subspace of the inner product space V , then for any vector v in V , the projection of v into W is closer to v than any other vector in W . Explicitly, if w is the projection, then for any w0 ∈ W , we have ||v − w|| ≤ ||v − w0|| with equality if and only if w = w0.

◦ Proof: By the theorem on orthogonal complements, we can write v = w + w⊥ where w ∈ W and w⊥ ∈ W ⊥. Now, for any other vector w0 ∈ W , we can write v − w0 = (v − w) + (w − w0), and observe that v − w = w⊥ is in W ⊥, and w − w0 is in W (since both w and w0 are, and W is a subspace). ◦ Thus, v − w0 = (v − w) + (w − w0) is a decomposition of v − w0 into orthogonal vectors. Taking norms, we see that ||v − w0||2 = ||v − w||2 + ||w − w0||2. ◦ Then, if w0 6= w, since the norm of ||w − w0|| is positive, we conclude that ||v − w|| < ||v − w0||. 1 1 • Example: Find the best approximation to v = (3, −3, 3) lying in the subspace W = span[ (1, 2, −2), (−2, 2, 1)], 3 3 where distance is measured under the standard dot product.

1 1 ◦ Notice that the vectors e = (1, 2, −2) and e = (−2, 2, 1) form an orthonormal basis for W . 1 3 2 3 Thus, the desired vector, the orthogonal projection, is ◦ projW (v) = hv, e1i e1 + hv, e2i e2 = −3e1 − 3e2 = (1, −4, 1) .

Example: Find the best approximation to the function that lies in the subspace , • f(x) = sin(πx/2) P2(R) under the inner product 1 . hf, gi = −1 f(x)g(x) dx ´ First, by applying Gram-Schmidt to the basis 2 , we can generate an orthogonal basis of ◦ {1, x, x } P2(R) under this inner product: the result (after rescaling to eliminate denominators) is {1, x, 3x2 − 1}. 2 2 ◦ Now, with p1 = 1, p2 = x, p3 = 3x − 1 we can compute hf, p1i = 0, hf, p2i = 8/π , hf, p3i = 0, and also hp1, p1i = 2, hp2, p2i = 2/3, and hp3, p3i = 8/5. hf, p i hf, p i hf, p i 12 ◦ Thus, the desired orthogonal projection is proj (f) = 1 p + 2 p + 3 p = x . P1(R) 1 2 3 2 hp1, p1i hp2, p2i hp3, p3i π ◦ We can see from a plot of the orthogonal projection and f on [−1, 1] that this line is a reasonably accurate approximation of sin(x) on the interval [−1, 1]:

12 Example: Find the linear polynomial that minimizes the expression 1 x 2 . • p(x) 0 (p(x) − e ) dx ´ Observe that the minimization problem is asking us to nd the orthogonal projection of x into ◦ e P1(R) under the inner product 1 . hf, gi = 0 f(x)g(x) dx First, by applying Gram-Schmidt´ to the basis , we can generate an orthogonal basis of under ◦ {1, x} P1(R) this inner product: the result (after rescaling to clear denominators) is {1, 2x − 1}. x x ◦ Now, with p1 = 1 and p2 = 2x − 1, we can compute he , p1i = e − 1, he , p2i = 3 − e, hp1, p1i = 1, and hp2, p2i = 1/3. hex, p i hex, p i ◦ Then proj (ex) = 1 p + 2 p = (10 − 4e) + (18 − 6e)x ≈ 0.873 + 1.690x . P2(R) 1 2 hp1, p1i hp2, p2i ◦ We can see from a plot (see above) of the orthogonal projection polynomial and ex on [0, 1] that this line is indeed a very accurate approximation of ex on the interval [0, 1].

3.3 Applications of Inner Products

• In this section we discuss several practical applications of inner products and orthogonality.

3.3.1 Least-Squares Estimates

• A fundamental problem in applied mathematics and is data tting: nding a model that well approximates some set of experimental data. Problems of this type are ubiquitous in the physical sciences, social sciences, life sciences, and engineering.

◦ A common example is that of nding a : a line y = mx + b that best ts a set of 2-dimensional data points {(x1, y1),..., (xn, yn)} when plotted in the plane. ◦ Of course, in many cases a linear model is not appropriate, and other types of models (polynomials, powers, exponential functions, logarithms, etc.) are needed instead. ◦ The most common approach to such regression analysis is the method of least squares, which minimizes the sum of the squared errors (the error being the dierence between the model and the actual data).

• As we will discuss, many of these questions ultimately reduce to the following: if A is an m × n matrix such that the matrix equation Ax = c has no solution, what vector xˆ is the closest approximation to a solution?

◦ In other words, we are asking for the vector xˆ that minimizes the vector norm ||Axˆ − c||. ◦ Since the vectors of the form Axˆ are precisely those in the column space of A, from our analysis of best approximations in the previous section we see that the vector w = Axˆ will be the projection of c into the column space of A. ◦ Then, by orthogonal decomposition, we know that w⊥ = c − Axˆ is in the orthogonal complement of the column space of A.

13 ◦ Since the column space of A is the same as the row space of AT , by our theorem on orthogonal comple- ments we know that the orthogonal complement of the column space of A is the nullspace of AT . ◦ Therefore, w⊥ is in the nullspace of AT , so AT w⊥ = 0. ◦ Explicitly, this AT (c − Axˆ) = 0, or AT Axˆ = AT c: this is an explicit matrix system that we can solve for xˆ.

• Denition: If A is an m × n matrix with m > n, a least-squares solution to the matrix equation Ax = c is a vector xˆ satisfying AT Axˆ = AT c.

◦ The system AT Axˆ = AT c for xˆ is always consistent for any matrix A, although it is possible for there to be innitely many solutions (a trivial case would be when A is the ). Even in this case, the orthogonal projection w = Axˆ onto the column space of A will always be unique.

• In typical cases, the of A is often equal to n. In this case, the matrix AT A will always be invertible, and there is a unique least-squares solution:

• Proposition (Least-Squares Solution): If A is an m × n matrix and rank(A) = n, then AT A is invertible and the unique least-squares solution to Ax = c is xˆ = (AT A)−1AT c.

◦ Proof: Suppose that the rank of A is equal to n. ◦ We claim that the nullspace of AT A is the same as the nullspace of A. Clearly, if Ax = 0 then (AT A)x = 0, so it remains to show that if (AT A)x = 0 then Ax = 0. ◦ Notice that the dot product x · y is the same as the matrix product yT x, if x and y are column vectors. ◦ Now suppose that (AT A)x = 0: then xT (AT A)x = 0, or equivalently, (Ax)T (Ax) = 0. ◦ But this means (Ax) · (Ax) = 0, so that the dot product of Ax with itself is zero. ◦ Since the dot product is an inner product, this means Ax must itself be zero, as required. ◦ Thus, the nullspace of AT A is the same as the nullspace of A. Since the of the nullspace is the number of columns minus the rank, and AT A and A both have n columns, rank(AT A) = rank(A) = n. ◦ But since AT A is an n × n matrix, this means AT A is invertible. The second statement then follows immediately upon left-multiplying Ax = c by (AT A)−1.

• Example: Find the least-squares solution to the inconsistent system x + 2y = 3, 2x + y = 4, x + y = 2.

 1 2   3  ◦ In this case, we have A =  2 1  and c =  4 . Since A clearly has rank 2, AT A will be invertible 1 1 2 and there will be a unique least-squares solution.  6 5  1  6 −5  ◦ We compute AT A = , which is indeed invertible and has inverse (AT A)−1 = . 5 6 11 −5 6  3  ◦ The least-squares solution is therefore xˆ = (AT A)−1AT c = . −1

 2   2  ◦ In this case, we see Axˆ =  5 , so the error vector is c − Axˆ =  1 . Our analysis above indicates 2 0 that this error vector has the smallest possible norm.

• We can apply these ideas to the problem of nding an optimal model for a set of data points.

◦ For example, suppose that we wanted to nd a linear model y = mx + b that ts a set of data points 2 {(x1, y1),..., (xn, yn)}, in such a way as to minimize the sum of the squared errors (y1 − mx1 − b) + 2 ··· + (yn − mxn − b) . ◦ If the data points happened to t exactly on a line, then we would be seeking the solution to the system y1 = mx1 + b, y2 = mx2 + b, ... , yn = mxn + b.

14     1 x1 y1  1 x2   b   y2  ◦ In matrix form, this is the system Ax = c where A =  . . , x = , and c =  . .  . .  m  .   . .   .  1 xn yn ◦ Of course, due to experimental errors and other random noise, it is unlikely for the data points to t the model exactly. Instead, the least-squares estimate xˆ will provide the values of m and b that minimize the sum of the squared errors. 2 ◦ In a similar way, to nd a quadratic model y = ax +bx+c for a data set {(x1, y1),..., (xn, yn)}, we would  2    1 x1 x1 y1 2  c   1 x2 x2   y2  use the least-squares estimate for , with  . . . , , and  . . Ax = c A =  . . .  x =  b  c =  .   . . .  a  .  2 1 xn xn yn

◦ In general, to nd a least-squares model of the form y = a1f1(x) + ··· + amfm(x) for a data set {(x1, y1),..., (xn, yn)}, we would want the least-squares estimate for the system Ax = c, with A =       f1(x1) ··· fm(x1) a1 y1 . . . .  . .. . ,  . , and  . .  . . .  x =  .  c =  .  f1(xn) ··· fm(xn) am yn

• Example: Use least-squares estimation to nd the line y = mx + b that is the best model for the data points {(9, 24), (15, 45), (21, 49), (25, 55), (30, 60)}.

 1 9   24   1 15     45    b   ◦ We seek the least-squares solution for Ax = c, where A =  1 21 , x = , and c =  49 .   m    1 25   55  1 30 60  5 100   14.615  ◦ We compute AT A = , so the least-squares solution is xˆ = (AT A)−1AT c ≈ . 100 2272 1.599 ◦ Thus, to three decimal places, the desired line is y = 1.599x + 14.615 . From a plot, we can see that this line is fairly close to all of the data points:

• Example: Use least-squares estimation to nd the quadratic function y = ax2 + bx + c best modeling the data points {(−2, 19), (−1, 7), (0, 4), (1, 2), (2, 7)}.

 1 −2 4   19     1 −1 1  c  7      ◦ We seek the least-squares solution for Ax = c, with A =  1 0 0 , x =  b , c =  4 .      1 1 1  a  2  1 2 4 7

15  5 0 10   2.8  ◦ We compute AT A =  0 10 0 , so the least-squares solution is xˆ = (AT A)−1AT c =  −2.9 . 10 0 34 2.5

◦ Thus, the desired quadratic polynomial is y = −2.5x2 − 2.9x + 2.8 . From a plot (see above), we can see that this quadratic function is fairly close to all of the data points.

• Example: Use least-squares estimation to nd the trigonometric function y = a + b sin(x) + c cos(x) best modeling the data points {(π/2, 8), (π, −4), (3π/2, 2), (2π, 10)}.

 1 1 0   8   a  1 0 −1 −4 ◦ We seek the least-squares solution for Ax = c, with A =  , x = b , c =  .  1 −1 0     2    c   1 0 1 10  4 0 0   4  ◦ We compute AT A =  0 2 0 , so the least-squares solution is xˆ = (AT A)−1AT c =  3 . 0 0 2 7 ◦ Thus, the desired function is y = 4 + 3 sin(x) + 7 cos(x) . In this case, the model predicts the points {(π/2, 7), (π, −3), (3π/2, 1), (2π, 11)}, so it is a good t to the original data:

3.3.2 Fourier Series

• Another extremely useful application of the general theory of orthonormal bases is that of Fourier series.

◦ Fourier analysis, broadly speaking, studies the problem of approximating a function on an interval by trigonometric functions. This problem is very similar to the question, studied in calculus, of approximat- ing a function by a polynomial (the typical method is to use Taylor polynomials, although as we have already discussed, least-squares estimates provide another potential avenue). ◦ Fourier series have a tremendously wide variety of applications, ranging from to solving partial dierential equations (in particular, the famous and heat equation), studying acoustics and optics (decomposing an acoustic or optical waveform into simpler waves of particular frequencies), electical engineering, and quantum mechanics.

• Although a full discussion of Fourier series belongs more properly to analysis, we can still provide an overview.

◦ A typical scenario in Fourier analysis is to approximate a continuous function on [0, 2π] using a trigono- metric polynomial: a function that is a polynomial in sin(x) and cos(x). ◦ Using trigonometric identities, this question is equivalent to approximating a function f(x) by a (nite) Fourier series of the form s(x) = a0 + b1 cos(x) + b2 cos(2x) + ··· + bk cos(kx) + c1 sin(x) + c2 sin(2x) + ··· + ck sin(kx). ◦ Notice that, in the expression above, s(0) = s(2π) since each function in the sum has period 2π. Thus, we can only realistically hope to get close approximations to functions satisfying f(0) = f(2π).

16 • Let V be the vector space of continuous, real-valued functions on the interval [0, 2π] having equal values at 0 and , and dene an inner product on via 2π . 2π V hf, gi = 0 f(x)g(x) dx ´ 1 • Proposition: The functions {ϕ0, ϕ1, ϕ2,... } are an orthonormal set on V , where ϕ0(x) = √ , and ϕ2k−1(x) = 2π 1 1 √ cos(kx) and ϕ (x) = √ sin(kx) for each k ≥ 1. π 2k π 1 ◦ Proof: Using the product-to-sum identities, such as sin(ax) sin(bx) = [cos(a − b)x − cos(a + b)x], it is 2 a straightforward exercise in integration to verify that hϕi, ϕji = 0 for each i 6= j. 1 1 ◦ Furthermore, we have hϕ , ϕ i = 2π 1dx = 1, hϕ , ϕ i = 2π cos2(kx) dx = 1, and 0 0 2π 0 2k−1 2k−1 π 0 1 ´ ´ hϕ , ϕ i = 2π sin2(kx) dx = 1. Thus, the set is orthonormal. 2k 2k π 0 ´

• If it were the case that S = {ϕ0, ϕ1, ϕ2,... } were an orthonormal basis for V , then, given any other function f(x) in V , we could write f as a linear combination of functions in {ϕ0, ϕ1, ϕ2,... }, where we can compute the appropriate coecients using the inner product on V .

∞ X 1 ◦ Unfortunately, S does not span V : we cannot, for example, write the function g(x) = sin(nx) as 2n n=1 a nite linear combination of {ϕ0, ϕ1, ϕ2,... }, since doing so would require each of the innitely many terms in the sum. ◦ Ultimately, the problem, as exemplied by the function g(x) above, is that the denition of basis only allows us to write down nite linear combinations. k X ◦ On the other hand, the nite sums ajϕj(x) for k ≥ 0, where aj = hf, ϕji, will represent the best j=0 approximation to f(x) inside the subspace of V spanned by {ϕ0, ϕ1, . . . , ϕk}. Furthermore, as we increase k, we are taking approximations to f that lie inside larger and larger subspaces of V , so as we take k → ∞, these partial sums will yield better and better approximations to f. ◦ Provided that f is a suciently nice function, it can be proven that in the limit, our formulas for the coecients do give a formula for f(x) as an innite sum:

• Theorem (Fourier Series): Let f(x) be a twice-dierentiable function on [0, 2π] satisfying f(0) = f(2π), and dene the Fourier coecients of as 2π , for the trigonometric functions f aj = hf, ϕji = 0 f(x)ϕj(x) dx ϕj(x) ∞ ´ X dened above. Then f(x) is equal to its Fourier series ajϕj(x) for every x in [0, 2π]. j=0

◦ This result can be interpreted as a limiting version of the theorem we stated earlier giving the coecients for the linear combination of a vector in terms of an orthonormal basis: it gives an explicit way to write

the function f(x) as an innite linear combination of the orthonormal basis elements {ϕ0, ϕ1, ϕ2,... }.

• Example: Compute the Fourier coecients and Fourier series for f(x) = (x − π)2 on the interval [0, 2π], and compare the partial sums of the Fourier series to the original function.

2π 1 1 5/2 ◦ First, we have a0 = f(x)√ dx = √ π . 0 2π 18 ´ √ 1 4 π ◦ For k odd, after integrating by parts twice, we have a = 2π f(x) √ cos(kx) dx = . 2k−1 0 π k2 ´ 1 ◦ For k even, in a similar manner we see a = 2π f(x) √ sin(kx) dx = 0. 2k 0 π ´ ∞ 1 X 4 ◦ Therefore, the Fourier series for f(x) is π2 + cos(kx) . 6 k2 k=1

17 ◦ Here are some plots of the partial sums (up to the term involving cos(nx)) of the Fourier series along with f. As is clearly visible from the graphs, the partial sums give increasingly close approximations to the original function f(x) as we sum more terms:

3.4 Linear Transformations and Inner Products

• We will now study linear transformations T : V → W where V and W are (nite-dimensional) vector spaces, at least one of which is equipped with an inner product.

• Most of our results will use the fact that inner products are linear in their rst coordinate: hv1 + v2, wi = hv1, wi + hv2, wi and hcv, wi = c hv, wi.

3.4.1 Characterizations of Inner Products

• First, if we have an orthonormal basis for W , we can use it to compute the entries in the matrix associated to T .

• Proposition (Entries of Associated Matrix): Suppose T : V → W is a linear transformation of nite- dimensional vector spaces, and W is an inner product space. If β = {v1,..., vn} is an ordered basis for and is an orthonormal ordered basis for , then the -entry of γ is for V γ = {e1,..., em} W (i, j) [T ]β hT (vj), eii each 1 ≤ i ≤ m and 1 ≤ j ≤ n.

m X ◦ Proof: By our results on vector spaces with an orthonormal basis, we know that T (vj) = aiei where i=1 ai = hT (vj), eii.

◦ But this is precisely the statement that the ith entry of the coecient vector [T (vj)]γ , which is in turn the -entry of γ , is . (i, j) [T ]β hT (vj), eii

18 Example: Let 3 be dened by . Find the matrix associated to with • T : P2(R) → R T (p) = hp(0), p(1), p(2)i T 1 1 1 respect to the bases β = {1 − 2x + x2, 2x + x2, 3 + x − x2} and γ = { h1, 2, 2i , h2, 1, −2i , h−2, 2, −1i} of 3 3 3 and 3 respectively. P2(R) R

◦ Notice that γ is an orthonormal basis of R3, so we need only compute the appropriate inner products. 1 1 ◦ Since T (1 − 2x + x2) = h1, 0, 1i, we compute h1, 0, 1i · h1, 2, 2i = 1, h1, 0, 1i · h2, 1, −2i = 0, and 3 3 1 h1, 0, 1i · h−2, 2, −1i = −1. 3 1 1 ◦ Next, since T (2x + x2) = h0, 3, 8i, we compute h0, 3, 8i · h1, 2, 2i = 22/3, h0, 3, 8i · h2, 1, −2i = −13/3, 3 3 1 and h0, 3, 8i · h−2, 2, −1i = −2/3. 3 1 1 ◦ Finally, since T (3+x−x2) = h3, 3, 1i, we compute h3, 3, 1i· h1, 2, 2i = 11/3, h3, 3, 1i· h2, 1, −2i = 7/3, 3 3 1 and h3, 3, 1i · h−2, 2, −1i = −1/3. 3  1 22/3 11/3  Thus, γ . ◦ [T ]β =  0 −13/3 7/3  −1 −2/3 −1/3

• As we noted above, for a xed w in V , the function T : V → F dened by T (v) = hv, wi is a linear transformation. In general, a linear transformation T : V → F from a vector space to its scalar eld is called a linear functional.

◦ Perhaps surprisingly, when V is a nite-dimensional inner product space, it turns out that every linear transformation T : V → F is of the form above.

• Theorem (Riesz Representation Theorem): If V is a nite-dimensional inner product space with scalar eld F , and T : V → F is linear, then there exists a unique vector w in V such that T (v) = hv, wi.

◦ Proof: Let {e1,..., en} be an orthonormal basis for V , and let w = T (e1)e1 + T (e2)e2 + ··· + T (en)en. ◦ We claim that the linear transformation R(v) = hv, wi is equal to T . D E D E ◦ For each i, R(ei) = ei, T (e1)e1 + T (e2)e2 + ··· + T (en)en = ei, T (ei)ei = T (ei) hei, eii = T (ei).

◦ But now, since linear transformations are characterized by their values on a basis, the fact that R(ei) = T (ei) for each i implies that R(v) = T (v) for each v in V , as claimed. ◦ For uniqueness, if hv, wi = hv, w0i for all v, then we would have hv, w − w0i = 0 for all v. Setting v = w − w0 yields hw − w0, w − w0i = 0, meaning that w − w0 = 0 so that w = w0.

• We can also dene a matrix associated to an inner product in a similar way to how we dene a matrix associated to a linear transformation.

• Denition: If V is a nite-dimensional inner product space with ordered bases β and γ, we dene the matrix γ associated to the inner product on to have -entry equal to . Mβ V (i, j) hβj, γii ◦ Note that if β = γ is an orthonormal basis for V , then the associated matrix will be the .

• The main reason we use this denition is the following result, which is the inner-product analogue of the structure theorem for nite-dimensional vector spaces we proved earlier:

• Theorem (Inner Products and Dot Products): If V is a nite-dimensional inner product space over F with ordered bases and , and associated matrix γ , then for any vectors and in , we have β γ Mβ v w V hv, wi = D E γ , where the second inner product is the standard inner product on n. Mβ [v]β, [w]γ F

◦ If F = R, then the standard inner product is the standard dot product x · y, where if F = C then the standard inner product is the modied dot product x · y.

19 ◦ This theorem therefore says that the inner product structure of V is essentially the same as the (modied) dot product on n, up to a factor of the matrix γ . In particular, if we choose to be an orthonormal F Mβ β = γ basis, then the matrix factor is simply the identity matrix. Proof: Suppose and }, and take Pn and Pn ◦ β = {v1,..., vn} γ = {w1,..., wn v = i=1 aivi w = j=1 bjwj for scalars ai and bj. D E Then by properties of inner products, Pn Pn Pn Pn . ◦ hv, wi = i=1 aivi, j=1 bjwj = i=1 j=1 aibj hvi, wji    Pn  a1 ai hvi, w1i . i=1 . ◦ Also, since [v] =  . , we compute M γ [v] =  . . β  .  β β  .  Pn an i=1 ai hvi, wni   b1 . D E ◦ So, since [w] =  . , we see M γ [v] , [w] = Pn b [Pn a hv , w i] = Pn Pn a b hv , w i. γ  .  β β γ j=1 j i=1 i i j j=1 i=1 i j i j bn ◦ This expression is equal to the one we gave above, so we are done.

3.4.2 The Adjoint of a Linear Transformation

• As we have already noted, an inner product is linear in its rst coordinate. Thus, if T : V → W is a linear transformation, the composition hT (v), wi will also be linear in its rst coordinate, and as we have already seen, such compositions play an important role in the structure of an inner product.

• Theorem (Adjoint Transformations): Suppose that V and W are nite-dimensional inner product spaces with inner products and respectively, and that is linear. Then there exists a unique h·, ·iV h·, ·iW T : V → W function ∗ such that ∗ for all in and in , and in fact ∗ is also T : W → V hT (v), wiW = hv,T (w)iV v V w W T a linear transformation.

Proof: For a xed vector , the map is a linear transformation , since ◦ w Rw(v) = hT (v), wiW R : V → F it is the composition of the linear transformations and . h·, wiW T ◦ Now, since V is nite-dimensional, by the Riesz representation theorem, there exists a vector w0 in V such that 0 for every in . Rw(v) = hv, w iV v V Therefore, if we dene ∗ to be this vector 0, we conclude that ∗ for all ◦ T (w) w hT (v), wiW = hv,T (w)iV v and w in V . For uniqueness, if we had another function ∗ with ∗ , then we would have ◦ T0 hT (v), wiW = hv,T0 (w)iV ∗ ∗ ∗ ∗ , so setting hv,T (w) − T0 (w)iV = hv,T (w)iV − hv,T0 (w)iV = hT (v), wiW − hT (v), wiW = 0 v = ∗ ∗ immediately gives ∗ ∗ . T (w) − T0 (w) T (w) = T0 (w) ◦ Finally, to see that T ∗ is linear, observe that

∗ ∗ ∗ hv,T (w1 + w2)iV = hT (v), w1 + w2iW = hT (v), w1iW + hT (v), w2iW = hv,T (w1) + T (w2)iV ∗ ∗ ∗ hv,T (cw)iV = hT (v), cwiW = c hT (v), wiW = c hv,T (w)iV = hv, cT (w)iV

∗ ∗ ∗ ∗ ∗ ∗ and so by the uniqueness of T we must have T (w1 + w2) = T (w1) + T (w2) and T (cw) = cT (w).

• Denition: If V and W are inner product spaces and T : V → W is linear, then, if it exists, the function ∗ with the property that ∗ for all in and in is called the T : W → V hT (v), wiW = hv,T (w)iV v V w W adjoint of T .

◦ The theorem above guarantees the existence and uniqueness of this map T ∗ when the inner product spaces are nite-dimensional. ◦ When V and W are innite-dimensional, there may not exist such a map T ∗. ◦ It is not so easy to give an explicit counterexample, but if V is the set of innite sequences with only nitely many nonzero terms, and T is the transformation that maps the vector en to e1 + e2 + ··· + en, then T has no adjoint.

20 ◦ If T does have an adjoint, however, the proof above shows that it is unique, and in any case, even when T ∗ exists, it is not so obvious how to compute T ∗ explicitly.

• Example: With V = C2 with the standard inner product and T : V → V is the linear transformation with T (x, y) = (ax + by, cx + dy), nd the adjoint map T ∗.

◦ Let {e1, e2} = {h1, 0i , h0, 1i} be the standard orthonormal basis for V . Since a linear transformation is ∗ ∗ characterized by its values on a basis, it is enough to nd T (e1) and T (e2). ∗ ◦ By denition, h(x, y),T (e1)i = hT (x, y), e1i = h(ax + by, cx + dy), e1i = ax + by. ∗ ◦ Therefore, T (e1) must be the vector (a, b), since that is the only vector with the property that ∗ h(x, y),T (e1)i = ax + by for arbitrary x and y. ∗ ◦ Similarly, h(x, y),T (e2)i = hT (x, y), e2i = h(ax + by, cx + dy), e2i = cx + dy, so by the same argument ∗ as above, T (e2) must be the vector (c, d).

∗ ∗ ∗ ◦ Then we see T (x, y) = xT (e1) + yT (e2) = (ax + cy, bx + dy) .  a b  ◦ Notice here that the matrix associated to T is [T ]β = while the matrix associated to T ∗ is β c d  a c  [T ∗]β = , the conjugate-transpose of the matrix associated to T . β b d

• It seems, based on the example above, that T ∗ is related to the conjugate-transpose of a matrix. In fact, this is true in general (which justies our use of the same notation for both):

• Proposition (Adjoint Matrix): If V and W are nite-dimensional inner product spaces with orthonormal bases and respectively, and is linear, then the matrix ∗ β associated to the adjoint ∗ is γ ∗, β γ T : V → W [T ]γ T ([T ]β) the conjugate-transpose of the matrix associated to T .

◦ Proof: Let β = {v1,..., vn} and γ = {w1,..., wm}. From our results on associated matrices in inner product spaces, the -entry of ∗ β is ∗ ◦ (i, j) [T ]γ hT (wj), vii = ∗ hvi,T (wj)i = hT (vi), wji. Now notice that the quantity is the complex conjugate of the -entry of γ : thus, ◦ hT (vi), wji (j, i) [T ]β ∗ β γ ∗ as claimed. [T ]γ = ([T ]β) • Here are a few basic properties of the adjoint, which each follow via the uniqueness property:

1. If S : V → W and T : V → W are linear, then (S + T )∗ = S∗ + T ∗. ◦ Proof: Observe that hv, (S + T )∗wi = h(S + T )v, wi = hS(v) + T (v), wi = hv,S∗(w) + T ∗(w)i. Thus, since (S + T )∗ is unique, it must be equal to S∗ + T ∗. 2. If T : V → W is linear, then (cT )∗ = c T ∗. ◦ Proof: Observe that hv, (cT )∗wi = h(cT )v, wi = c hT (v), wi = hv, cT ∗(w)i, so by uniqueness, (cT )∗ = cT ∗. 3. If S : V → W and T : U → V are linear, then (ST )∗ = T ∗S∗. ◦ Proof: Observe that hu, (ST )∗wi = h(ST )u, wi = hT (u),S∗wi = hu,T ∗S∗(w)i, so by uniqueness, (ST )∗ = T ∗S∗. 4. If T : V → W is linear, then (T ∗)∗ = T . ◦ Proof: Observe that hT ∗(v), wi = hw,T ∗(v)i = hT (w), vi = hv,T (w)i, so by uniqueness, (T ∗)∗ = T .

• As immediate corollaries, we can also deduce the analogous properties of conjugate-transpose matrices. (Of course, with the exception of the statement that (AB)∗ = B∗A∗, these are all immediate from the denition.)

Well, you're at the end of my handout. Hope it was helpful. Copyright notice: This material is copyright Evan Dummit, 2012-2020. You may not reproduce or distribute this material without my express permission.

21