<<

A Review of the Fundamentals of

Philip Etter

Contents

1 The Basics 3 1.1 Vector Spaces ...... 3 1.2 Vector Spans ...... 3 1.3 ...... 4 1.4 Bases and ...... 4 1.5 Subspaces ...... 4 1.6 Vector Norms ...... 5

2 Linear Maps 6 2.1 Image, , and ...... 6 2.2 Specifying Linear Maps ...... 7 2.3 Inverses ...... 7 2.4 Representations of Linear Maps ...... 9 2.5 Change of Formulas for Linear Maps ...... 10 2.6 Operator Norms for Linear Maps ...... 11 2.6.1 Important Norm Inequalities ...... 11

3 Multilinear Forms 12 3.1 Matrix Representations of Bilinear Forms ...... 12 3.2 Symmetric Forms ...... 12 3.3 Anti-Symmetric Forms, Volume, and ...... 13 3.3.1 The Cofactor Formula for Matrix Inverses ...... 16 3.4 Inner Products ...... 17 3.4.1 Orthonormal Bases ...... 18 3.4.2 Adjoints ...... 19 3.5 Formula for Bilinear Forms ...... 19

4 The Algebraic Structure of Linear Maps 20 4.1 Classification Up To Equivalence: Rank-Nullity Theorem ...... 20 4.2 Spectra ...... 21 4.2.1 Eigenvalues and Eigenvectors ...... 21 4.2.2 The Characteristic Polynomial ...... 22 4.3 The Spectral Decomposition for Self-Adjoint Linear Maps ...... 23 4.4 The Singular Value Decomposition ...... 24 4.5 Classification Up To Similarity: The Jordan Decomposition ...... 25 4.6 Classification Up To Congruence: Signatures and the Gramm-Schmidt Process ...... 27

5 Basic Numerics 29 5.1 The Three Fundamental Linear Systems ...... 29 5.1.1 Diagonal Systems ...... 29 5.1.2 Orthogonal Systems ...... 29 5.1.3 Triangular Systems ...... 29 5.2 Solving with Gaussian Elimination ...... 30 5.3 Solving with the Gramm-Schmidt Process ...... 31 5.4 Solving with the Singular Value Decomposition ...... 32 5.4.1 Conditioning ...... 32

1 5.5 Least-Squares Problems ...... 33 5.6 Basic Eigenvector Computation: The Power Method ...... 34

2 1 The Basics

We begin this review by recalling some of the basic definitions in linear algebra.

1.1 Vector Spaces The fundamental mathematical object in linear algebra is the . A vector space is an algebraic object with two fundamental operations: vector addition and scalar multiplication. Definition 1.1 (Vector Space). A vector space V over a field1 F is a set equipped the following properties: 1. A binary addition operation + : V V V with the following properties: × −→ (a) Associativity: (u + v) + w = u + (v + w) for u, v, w V . ∈ (b) Commutivity: u + v = v + u for v, u V . ∈ (c) Identity: there exists 0 V such that v + 0 = v for v V . ∈ ∈ (d) Invertability: for u V , there exists u V such that u + ( u) = 0. ∈ − ∈ − 2. A scalar multiplication operation : F V V with the following properties: · × −→ (a) Compatibility: (αβ)v = α(βv) for α, β F and v V . ∈ ∈ (b) Identity: 1 v = v for v V the multiplicative identity 1 F . · ∈ ∈ (c) Distributivity: α(v + u) = αv + u and (α + β)v = αv + βv for α, β F and v, u V . ∈ ∈ Example 1.1 (The vector space Rn). One example of a vector space over R is the set of n-tuples of R, Rn, with addition defined as (v1, ..., vn) + (u1, ..., vn) = (u1 + v1, ..., un + vn) , and scalar multiplication defined as α (v1, ..., vn) = (αv1, ..., αvn) . · Example 1.2 (Polynomials of degree n). Another example of a vector space over R is the set of polynomials over ≤ R with degree n, with addition defined as ≤ n n−1 n n−1 n n−1 (vnx + vn−1x + ... + v0) + (unx + un−1x + ... + u0) = (vn + un)x + (vn−1 + un−1)x + ... + (v0 + u0) , and scalar multiplication defined as n n−1 n n−1 α(vnx + vn−1x + ... + v0) = αvnx + αvn−1x + ... + αv0 . Example 1.3 (Real-Valued Functions on R). The set of real valued functions R R form a vector space over R, with addition defined as −→ (f + g)(x) = f(x) + g(x) , and scalar multiplication defined as (αf)(x) = αf(x) . Example 1.4 (R over Q). A more exotic example is the set of real numbers R over the set of rational numbers Q with addition and scalar multiplication defined the natural way. n Example 1.5 (Boolean Vector Spaces). Another interesting example is (Z2) over Z2 where Z2 denotes the integers mod 2 (Z2 contains elements 0 (false) and 1 (true) where addition is the exclusive or operation and multiplication is the and operation). These kinds of vector spaces are sometimes seen in coding theory.

1.2 Vector Spans

In linear algebra, the vector span of a set of vectors v1, ..., vm denotes the set of all linear combinations of those vectors. You can think of this as the hyperplane formed from the vectors v1, ..., vm.

Definition 1.2 (Span). The span of a set of vectors V v1, v2, ..., vm in an F -vector space V is defined as ≡ { } span(V) α1v1 + α2v2 + ... + αmvm αi F , (1) ≡ { | ∈ } i.e., all possible linear combinations of the vectors in V. 1Roughly, a field is a set with both an addition and a multiplication operation, both of which can be inverted (i.e., via subtraction and division, respectively). Common examples of fields include the real numbers R, the rational numbers Q, the complex numbers C.

3 1.3 Linear Independence One of the fundamental structural properties of vector spaces is the notion of linear independence. In short, linear independence of a set of vectors v1, ..., vn means that no vector in the set can be formed from a linear combination of the others. The formal definition is usually given by

Definition 1.3 (Linear Independence). A set of vectors V v1, v2, ..., vm are linearly independent if ≡ { }

α1v1 + α2v2 + ... + αmvm = 0 (2) implies α1 = α2 = ... = αm = 0 , (3) for any choice of αi F . ∈ 1.4 Bases and Dimension Finally, there are sets of vectors V known as bases which have the special property that every vector in a vector space can be written as a unique linear combination of the vectors in V. For this to be the case, the vectors in V must span the vector space, but they must also be linearly independent (otherwise, there may be multiple different linear combinations which yield the same vector). Therefore, the definition for a basis is given by

Definition 1.4 (Basis). A basis for a vector space V is a linearly independent set V v1, v2, ..., vm that spans V , i.e., span(V) = V . ≡ { } Bases are the fundamental cornerstone of linear algebra computations. The number of linearly independent elements we need in a set until it spans a vector space is known as the dimension, Definition 1.5 (Dimension). The dimension of a vector space V , denoted dim(V ), is the smallest size of a spanning set for V . Turns out, all bases have the same size, Lemma 1.1. If the dimension of a vector space V is finite, then all bases for V have the same size, equal to the dimension of the vector space.

1.5 Subspaces Sometimes we want to look at a small part of a vector space rather than the whole thing. In these situations, it is useful to study vector subspaces, Definition 1.6 (Subspace). A subset W of a vector space V is called a subspace if it is closed under vector addition and scalar multiplication. There are a number of operations we can do to subspaces, but perhaps the two most interesting are intersection and addition:

Lemma 1.2. If W1,W2 V are subspaces, then the intersection W1 W2 is also a subspace. ⊂ ∩ Definition 1.7 (Subspace Sum). The subspace sum of two subspaces W1,W2 V is the set of all combinations ⊂ of a vector in W1 and a vector in W2, i.e.,

W1 + W2 = w1 + w2 w1 W1, w2 W2 . (4) { | ∈ ∈ } Lemma 1.3. The subpsace sum is a vector subspace. A particularly important variant of the subspace sum is the direct sum of subspaces:

Definition 1.8 (Direct Sum of Subspaces). If two subspaces W1,W2 V have a trivial intersection W1 W2 = 0 , ⊂ ∩ { } then the sum W1 + W2 is a direct sum and is usually denoted as W1 W2. ⊕ Direct sums are useful because they allow us to unambiguously decompose vectors into more primitive components:

Lemma 1.4. Every element of a direct sum W1 W2 can be written as v = w1 + w2 for unique elements w1 W1 ⊕ ∈ and w2 W2. ∈

4 1.6 Vector Norms Vector norms are a way of defining the length of a vector in a vector space. There are a number of different ways to quantify the notion of length. But all of them share a number of fundamental qualities,

Definition 1.9 (Vector norm). A vector norm on an R-vector space V is a function : V R satisfying: k · k −→ 1. Non-negativity: u 0 for all u V . k k ≥ ∈ 2. Triangle Inequality: u + v u + v for all u, v V . k k ≤ k k k k ∈ 3. Absolute Scaling: αu = α u for all α R, u V . k k | | · k k ∈ ∈ 4. Point-Separating: u = 0 if and only if u = 0. k k Example 1.6 (Lp Norms). The most used class of norms on Rn are arguably the Lp norms for p 1. For ≥ u = (u1, u2, ..., un), these are given by !1/p X p u p (ui) . (5) k k ≡ i Among these norms, the L1 and L2 are particularly important. When applied to the difference of two vectors u and v, the former gives the Manhattan distance while the latter gives the Euclidean distance. Example 1.7 (L∞ Norm). The limiting case of the above Lp norms as p is the L∞ norm, given by → ∞

u ∞ max ui (6) k k ≡ i

5 2 Linear Maps

Every field of algebra is composed primarily of two types of mathematical constructs. First, there is always a type of algebraic object — in this case, these are vector spaces — and second, there are always relationships between those objects — maps that respect the underlying algebraic structure. In the case of vector spaces, the algebraic structure is encoded in vector addition and scalar multiplication, and so a map which respects the algebraic structure commutes with both vector addition and scalar multiplication. This gives rise to the notion of a ,

Definition 2.1 (Linear Map). Let V and W be two F -vector spaces. A linear map2 A : V W is a function which commutes with vector addition and scalar multiplication, i.e., −→

A(u + v) = A(u) + A(v) for all u, v V ∈ (7) A(αu) = αA(u) for all α F, u V ∈ ∈ 2.1 Image, Kernel, and Rank Definition 2.2 (Image and Preimage). For a linear map A : V W , the image of a set S V under A is denoted A(S) and defined as −→ ⊂ A(S) w there exists v S such that A(v) = w , (8) ≡ { | ∈ } likewise, the preimage of a set S W under A is denoted as A−1(S) and defined as ⊂ A−1(S) v there exists w S such that A(v) = w . (9) ≡ { | ∈ } When the set S is not specified, the image of A is simply taken to be the image of the domain V , and is denoted by im(A).

The image of a linear map is a linear subspace, since it is closed under linear combinations. That is, if v, u im(A), then there exist v0, u0 such that A(v0) = v and A(u0) = u. As such, ∈ A(αv0 + βu0) = αA(v0) + βA(u0) = αv + βu , (10)

so αv + βu im(A). ∈

Definition 2.3 (Kernel). The kernel (alternatively, nullspace) of a linear map A : V W , denoted ker(A), is the preimage of the identity element 0 W , i.e., −→ ∈ ker(A) A−1(0) = v A(v) = 0 (11) ≡ { | }

The kernel of a linear map is also linear subspace, since it is closed under linear combinations. That is, if v, u ker(A), then, ∈ A(αv + βu) = αA(v) + βA(u) = 0 , (12) so αv + βu ker(A). ∈ A visual representation of the kernel and the image is given in fig. 2.1.

Definition 2.4. The rank of a linear map A : V W is the dimension of its image, −→ rank(A) dim(im(A)) . (13) ≡ If the rank of a linear map is equal to the dimension of its domain, then the matrix is referred to as full-rank.

Definition 2.5. The nullity of a linear map A : V W is the dimension of its kernel, −→ null(A) dim(ker(A)) . (14) ≡ 2Sometimes these are referred to as linear transformations, linear operators, or even vector space morphisms.

6 AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99V2tV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNq9o04 V WAAAB6nicbVA9SwNBEJ3zM8avqKXNYiJYhbtYaBm0sYxoPiA5wt5mL1myt3fszgnhyE+wsVDE1l9k579xk1yhiQ8GHu/NMDMvSKQw6Lrfztr6xubWdmGnuLu3f3BYOjpumTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh0q70i+V3ao7B1klXk7KkKPRL331BjFLI66QSWpM13MT9DOqUTDJp8VeanhC2ZgOeddSRSNu/Gx+6pScW2VAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadog3BW355lbRqVe+yWruvles3eRwFOIUzuAAPrqAOd9CAJjAYwjO8wpsjnRfn3flYtK45+cwJ/IHz+QNse405

0AAACAXicbVDLSsNAFL3xWesr6kZwM9gKrkpSF7osunFZwT6gCWUymbRDJw9mJkIJdeOvuHGhiFv/wp1/4yTNQlsPDBzOua85XsKZVJb1baysrq1vbFa2qts7u3v75sFhV8apILRDYh6Lvocl5SyiHcUUp/1EUBx6nPa8yU3u9x6okCyO7tU0oW6IRxELGMFKS0Pz2ClmZIL6M1R3QqzGXpBZs/rQrFkNqwBaJnZJalCiPTS/HD8maUgjRTiWcmBbiXIzLBQjnM6qTippgskEj+hA0wiHVLpZsX2GzrTioyAW+kUKFervjgyHUk5DT1fmJ8pFLxf/8wapCq7cjEVJqmhE5ouClCMVozwO5DNBieJTTTARTN+KyBgLTJQOrapDsBe/vEy6zYZ90WjeNWut6zKOCpzAKZyDDZfQgltoQwcIPMIzvMKb8WS8GO/Gx7x0xSh7juAPjM8fIA2WqQ==

ker(A)

AAACBHicbVC7TgJBFJ31ifhCLWkmggk2ZBcLLVEbS0zkkbAbMjtcYMLsIzN3jWRDYeOv2FhojK0fYeffODwKBU9yk5Nz7p259/ixFBpt+9taWV1b39jMbGW3d3b39nMHhw0dJYpDnUcyUi2faZAihDoKlNCKFbDAl9D0h9cTv3kPSosovMNRDF7A+qHoCc7QSJ1c3p2+kSrojmnRRXjAdAhqXLo8LXZyBbtsT0GXiTMnBTJHrZP7crsRTwIIkUumdduxY/RSplBwCeOsm2iIGR+yPrQNDVkA2kunC4zpiVG6tBcpUyHSqfp7ImWB1qPAN50Bw4Fe9Cbif147wd6Fl4owThBCPvuol0iKEZ0kQrtCAUc5MoRxJcyulA+YYhxNblkTgrN48jJpVMrOWblyWylUr+ZxZEieHJMSccg5qZIbUiN1wskjeSav5M16sl6sd+tj1rpizWeOyB9Ynz8kwJe+ im(A)

AAACBHicbVC7TgJBFJ31ifhCLWkmggk2ZBcLLVEbS0zkkbAbMjtcYMLsIzN3jWRDYeOv2FhojK0fYeffODwKBU9yk5Nz7p259/ixFBpt+9taWV1b39jMbGW3d3b39nMHhw0dJYpDnUcyUi2faZAihDoKlNCKFbDAl9D0h9cTv3kPSosovMNRDF7A+qHoCc7QSJ1c3p2+kfoygTEtuggPmIpgXLo8LXZyBbtsT0GXiTMnBTJHrZP7crsRTwIIkUumdduxY/RSplBwCeOsm2iIGR+yPrQNDVkA2kunC4zpiVG6tBcpUyHSqfp7ImWB1qPAN50Bw4Fe9Cbif147wd6Fl4owThBCPvuol0iKEZ0kQrtCAUc5MoRxJcyulA+YYhxNblkTgrN48jJpVMrOWblyWylUr+ZxZEieHJMSccg5qZIbUiN1wskjeSav5M16sl6sd+tj1rpizWeOyB9Ynz8muJe/

AAAACAnicbVA9SwNBEJ2LXzF+nVqJzWIiWIW7WChWURvLCOYDkiPsbTbJkr3dY3dPCSHY+FdsLBSx9VfY+W/cJFdo4oOBx3szzMwLY8608bxvJ7O0vLK6ll3PbWxube+4u3s1LRNFaJVILlUjxJpyJmjVMMNpI1YURyGn9XBwPfHr91RpJsWdGcY0iHBPsC4j2Fip7R4ULtEFqqEWl6KnWK9vsFLyAdULbTfvFb0p0CLxU5KHFJW2+9XqSJJEVBjCsdZN34tNMMLKMMLpONdKNI0xGeAebVoqcER1MJq+MEbHVumgrlS2hEFT9ffECEdaD6PQdkbY9PW8NxH/85qJ6Z4HIybixFBBZou6CUdGokkeqMMUJYYPLcFEMXsrIn2sMDE2tZwNwZ9/eZHUSkX/tFi6LeXLV2kcWTiEIzgBH86gDDdQgSoQeIRneIU358l5cd6dj1lrxkln9uEPnM8fBfKV7A== : V W ! Figure 1: A diagram of a linear map A : V W with ker(A) and im(A) labeled. −→

2.2 Specifying Linear Maps

Once we specify a basis V = v1, ..., vn for a vector space V , note that any linear map A : V W is completely { } −→ specified by where it sends the basis vectors v1, ..., vn. This is because any vector u V can be written as a linear combination of the vectors in V, i.e., ∈ X u = αivi , (15) i

for some scalars α1, ..., αn F . Then, by linearity of the map A, we have that ∈ X A(u) = αjA(vj) . (16) j

Therefore, to specify a linear map A, it always suffices to specify what it does to a handful of basis vectors. This is very useful and shows us why bases are an important concept in linear algebra.

2.3 Inverses Definition 2.6 (Inverse). A linear map A−1 : W V is the inverse of linear map A : V W if −→ −→ −1 A A = IdV , ◦ −1 (17) A A = IdW , ◦ where IdV is the identity map on V , which sends every element of V to itself, and IdW is defined likewise. A linear map which is invertible is referred to as nonsingular and, correspondingly, a linear map which is not invertible is sometimes to as singular.

7 Nota Bene: In general, A B = Id does not imply that B A = Id, unless both V and W have the same dimension. ◦ ◦

Lemma 2.1. If a linear map A : V W has a trivial kernel, i.e., ker(A) = 0 , and the image of A is its codomain, i.e., im(A) = W , then the linear map−→ has an inverse. { }

Proof. To show this, let V = v1, ..., vn be a basis for V and consider the vectors { }

A(V) = A(v1), ..., A(vn) . (18) { } We claim that these vectors are linearly independent. Suppose, for contradiction that they are not, then there exist nonzero coefficients αi F such that ∈ X αiA(vi) = 0 . (19) i Using linearity, ! X A αivi = 0 , (20) i which means X αivi ker(A) . (21) ∈ i P But since the vi’s are linearly independent, this means that i αivi = 0, implying that ker(A) contains a nonzero element — a contradiction! Therefore, A(V) is linearly independent. 6 Moreover, we claim that A(V) must also span W . This is because the image of A is equal to W , and therefore for any w W , there exists some v V such that ∈ ∈ A(v) = w . (22)

Since V is a basis for V , we can write v as X v = βivi , (23) i which, by linearity, means that X w = A(v) = βiA(vi) . (24) i

This tells us that A(vi) forms a basis for W . −1 Thus, since A(vi) forms a basis for W , we can use it to define a linear map. So, let us define A as the linear map which sends A(vi) vi . (25) → It is easy to check that A−1 A and A A−1 are the identity on V and W respectively. Therefore, an inverse exists! ◦ ◦

Note that in the above theorem, the condition that im(A) = W is clearly the condition that the map A be surjective (i.e., onto). In algebra, the condition that ker(A) = 0 turns out to be equivalent to the condition { } that A be injective (i.e., one-to-one). To see this, suppose there are two distinct v1, v2 V with the property that ∈ A(v1) = A(v2) . (26) If this is the case, then linearity tells us that

A(v1 v2) = A(v1) A(v2) = 0 . (27) − −

But since v1 = v2, this means that ker(A) contains a nontrivial vector. Conversely, if ker(A) contains a nontrivial vector6 then there are two vectors with the image 0, so the linear map cannot be injective.

8 2.4 Matrix Representations of Linear Maps So far, we talked about specifying a linear map A : V W using a basis V for V . Now, if we have a second basis −→ W = w1, ..., wm for the codomain W , then each of the vectors A(vi) can be written as { } X A(vj) = aijwi , (28) i for some scalars aij F , and moreover, A(u) can also be written as ∈ X A(u) = βiwi , (29) i for some scalars βi F . Therefore, expanding the eq. (16) in terms of the above two expressions gives us ∈ X X X βiwi = αj aijwi . (30) i j i

By linear independence of the wj, the coefficients of the wj on both sides of the equality must match, and therefore, we are left with X βi = aijαj . (31) j This can be written more suggestively in the standard matrix form as       β1 a11 . . . a1n α1  .   . .. .   .   .  =  . . .   .  , (32) βm am1 . . . amn αn where this matrix-vector product is defined via eq. (31). Here, the 2D array [aij] is said to be the matrix repre- sentation of the linear map A with respect to bases V and W, and the 1D arrays [αi] and [βi] are referred to as the coordinate representations of u and A(u) with respect to V and W, respectively. As we can see, matrices just amount to useful computational tools for doing calculations with linear maps on a computer or by hand. Indeed, matrix-vector multiplication is defined precisely so that it coincides with the underlying linear map that the matrix represents. Finally, we can consider the composition of two linear maps A : V W and B : W X. The composition3, B A : V X is also a linear map (you can check this!), and therefore−→ if we have bases−→V, W, X for V , W , X ◦ −→ respectively, and [aij], [bij], [cij] are the matrix representations of A, B, B A respectively, a similar calculation to the one we did above will show that ◦ X cij = bikakj . (33) k Or more suggestively, in standard matrix form,       c11 . . . c1s b11 . . . b1m a11 . . . a1n  . .. .   . .. .   . .. .   . . .  =  . . .   . . .  , (34) cl1 . . . cls bs1 . . . bsm am1 . . . amn where this matrix-matrix product is defined via eq. (33). All of the standard rules for matrix, vector, and scalar multiplication are therefore just consequences of the fact that they are representing linear maps! In future, when referring to the matrix representation of a linear map A : V W with respect to a bases V −→ and W, I will use the shorthand [A]V,W to denote the matrix representation of A with respect to V and W. When A : V V has the same domain and codomain, it is possible to use the same bases for the domain and codomain, −→ in which case I will simply write [A]V as the representation of A in the basis V, i.e., as shorthand for [A]V,V.

3B ◦ A is defined to be the function such that (B ◦ A)(x) = B(A(x))

9 2.5 Change of Basis Formulas for Linear Maps Now that we have seen how to write down a matrix representation of a linear map with respect to some basis V, it is time to consider how the two different matrix representations of the same linear map are related. Let (1) (1) (2) (2) A : V W be a linear map Let V1 = v1 , ..., vn and V2 = v1 , ..., vn be two bases for the domain V and −→ (1) (1) (2){ (2) } { } let W1 = w , ..., wn and W2 = w , ..., wn be two different bases for the codomain W . Then, the matrix { 1 } { 1 } representations [A]V1,W1 and [A]V2,W2 are related by a change of basis. To see how to calculate a change of basis, let us first consider a vector y in V and its two coordinate representations [y]X and [y]Z with respect to two different bases X x1, ..., xn and Z z1, ..., zn of the vector space V . As we discussed above, a linear map can always be specified≡ { by mapping} one set≡ { of basis vectors} to another. Now, let us define the map PX,Z : V V which sends −→ PX,Z(xi) = zi , for all i . (35)

This type of map is called a change of basis map. By definition, the matrix representation [PX,Z]X of this change of basis map with respect to X is given by the coefficients pij, where X zi = pjixj . (36) j P Therefore, if y = i βizi, then we have that ! X X X X X y = βizi = βipjix = pjiβi x . (37) i i j j i P Therefore, the i pjiβi are precisely the entries of [y]X by definition. They also happen to be the entries of the matrix product [PX,Z]X[y]Z by definition. In other words, we have just shown that

[y]X = [PX,Z]X[y]Z . (38) Or rather, the coordinate representation of y in X and the coordinate representation of y in Z are related by the matrix [PX,Z]X. This means the matrix [PX,Z]X translates coordinate representations in Z to coordinate representations in X. Therefore, we can extend this to linear maps by using these change of basis matrices to translate between coordinate representations, giving us

[A]V1,W1 = [PV1,V2 ]V1 [A]V2,W2 [PW2,W1 ]W2 . (39) This relationship is often written more simply as

B = Q−1AP , (40) where A and B are two different matrix representations of the same linear map, and P and Q−1 are change of basis matrices.

If two matrices A and B are related as in eq. 40, then we say that the matrices are equivalent. That is, A and B represent the same linear map, except with perhaps a different choice of basis for the domain and the codomain.

A particular case of interest, is when the linear map A is from V to V . In this case, if X and Z are bases for V , we have the change of basis formula

−1 [A]Z = [PZ,X]Z[A]X[PX,Z]X = [PX,Z]X [A]X[PX,Z]X , (41) −1 where we have used the fact that [PX,Z]X = [PZ,X]Z (the first converts from Z to X and the other converts back). Often times, this relationship is written more simply as

B = P −1AP , (42) where A and B are two different matrix representations of the same linear map and P is a change of basis matrix. Here, the operation of simultaneously multiplying on the right by P and on the left by P −1 is known as a conjugation.

10 If two matrices are related by conjugation as in eq. 42, we say that the two matrices are similar. This relationship is sometimes written visually as B A. (43) ∼ Intuitively, two matrices being similar means that they represent the same underlying linear map, except perhaps in a different basis. Note that is a equivalence relationship. As we will see later in section 4, it is actually possible to classify all possible linear maps up to similarity.

2.6 Operator Norms for Linear Maps

Definition 2.7 (Operator Norm). V and W are R-vector spaces and if V : V R and W : W R are k · k −→ k · k −→ vector norms on V and W respectively, then the operator norm of linear map A : V W with respect to V −→ k · k and W is defined as k · k A V,W sup A(v) W . (44) k k ≡ kvkV =1 k k

Recall that one can view m n matrices as linear maps from Rn to Rm. Under this interpretation, matrices have a number of interesting operator× norms:

1 1 Example 2.1 (L Operator Norm). The L operator norm A 1 for a m n matrix A (where both domain and 1 k k × codomain are under the L vector norm defined in example 1.6) is given by the maximum L1 norm of a column of that matrix.

2 2 Example 2.2 (L Operator Norm). The L operator norm A 2 for a m n matrix A (where both domain and codomain are under the L2 vector norm defined in examplek 1.6)k is given by× the maximum singular value of that matrix. Singular values will be discussed later.

∞ ∞ Example 2.3 (L Operator Norm). The L operator norm A ∞ for a m n matrix A (where both domain and ∞ k k × codomain are under the L vector norm defined in example 1.7) is given by the maximum L1 norm of a row of that matrix.

2.6.1 Important Norm Inequalities It is sometimes important to know that all of the above norms are equivalent, up to a constant factor which depends on dimension. Indeed, it is well known that

Lemma 2.2 (Lp inequalities). For an m n matrix A, × 1 A ∞ A 2 √m A ∞ , √nk k ≤ k k ≤ k k (45) 1 A 1 A 2 √n A 1 . √mk k ≤ k k ≤ k k Another well known inequality is the Hölder inequality:

Lemma 2.3 (Hölder Inequality). For an m n matrix A and n k matrix B, and p, q [1, ] with 1/p + 1/q = 1, we have × × ∈ ∞ AB 1 A p B q . (46) k k ≤ k k k k

11 3 Multilinear Forms

Definition 3.1 (Multilinear Form). For an F -vector space V , a function f : V V ... V F is a multilinear form if it is linear in every entry, i.e., × × × −→

f(u1, ..., ui + v, ..., uk) = f(u1, ..., ui, ..., uk) + f(u1, ..., v, ..., uk) ui, v V, ∈ (47) f(u1, ..., αui, ..., uk) = αf(u1, ..., ui, ..., uk) ui V, α F. ∈ ∈ In particular, if f has two inputs, then it is called a .

3.1 Matrix Representations of Bilinear Forms In the family of multilinear forms, bilinear forms are special because they can be represented by matrices. In fact, taking this perspective can often times make thinking about certain matrices significantly easier. To see what I mean by this, let us say we have a bilinear form B( , ): V V F and a basis V = v1, ..., vn for V . Then, suppose we have an arbitrary vectors u, w V . These· vectors· × can−→ be written as { } ∈ X u = αivi , i X (48) w = βivi , i for some αi, βi F . Moreover, let us define ∈ mij B(vi, vj) . (49) ≡ Using the linearity properties of the bilinear form, we therefore have that X X X X B(u, w) = αiB(vi, vj)βj = αimijβj . (50) i j i j Note that this can be written more suggestively in matrix form as  T     α1 m11 . . . m1n β1  .   . .. .   .  B(u, w) =  .   . . .   .  . (51) αn mn1 . . . mnn βn

If let M be the matrix [mij], then we have that all bilinear forms can be written as T B(u, v) = [u]VM[v]V , (52) where Mij = B(vi, vj) . (53) Therefore, there is an immediate relationship between matrices and bilinear forms! This relationship can often come in handy. In future, we will denote the matrix representation of a bilinear form B with respect to basis V with the shorthand [B]V, i.e., ([B]V)ij = B(vi, vj) . (54)

3.2 Symmetric Forms The simplest example of a class of multilinear form is a symmetric multilinear form: Definition 3.2 (Symmetric Form). A multilinear form f : V V ... V F is symmetric if interchanging any two inputs gives the same result, i.e., × × × −→

f(u1, ..., ui, ..., uj, ..., uk) = f(u1, ..., uj, ..., ui, ..., uk) for all i, j (55) Note that symmetric forms are invariant under permutation of their arguments. I.e., if σ is a permutation of 1, ..., k then { } f(uσ(1), ..., uσ(k)) = f(u1, ..., uk) . (56) A common example of a symmetric form is the Euclidean inner product (dot product) on Rn. The dot product : Rn Rn R takes two vectors, produces a , and is linear in both arguments. · × −→

12 3.3 Anti-Symmetric Forms, Volume, and Determinants

Figure 2: The notion of volume of a parallelotope is multilinear in the vectors which form its edges, as seen in this diagram. If f(v1, v2) denotes the volume of a parallelotope whose edges are given by v1 and v2, then we have that f(v1 + v3, v2) = f(v1, v2) + f(v3, v2).

It so happens that the notion of volume or area on a vector space happens to fulfill the properties of a multilinear map. How do we define an notion of volume in a vector space? Well, one way “volume” is usually defined is as a map which takes in the edges v1, ..., vn of a parallelotope and spits out the volume of that parallelotope. This idea of volume happens to be a multilinear form because the volume of a parallelotope obeys linearity in the edges of the parallelotope (see fig. 3.3). Note however, for the volume to be fully multilinear, it must be signed — that is, it is possible for volume to be both positive and negative, so that if we scale a side of the parallelotope by a negative scalar, we flip the sign of its volume. The difference between a parallelotope with positive volume and one with negative volume is orientation. We typically say that a paralellotope with positive volume is positively oriented and one with negative volume is negatively oriented. To give a concrete example, in two dimensions, you can imagine that we can orient a parallelogram in one of two orientations by following its perimeter in either the clockwise or anti-clockwise direction, see fig. 3.3. By convention, switching the order of the edges v1, ..., vn flips the orientation of associated parallelotope. This means that these volume forms have the property of anti-symmetry,

Definition 3.3 (Anti-Symmetric Form). A multilinear form f : V V ... V F is anti-symmetric (alter- natively, skew-symmetric) if interchanging any two inputs negates the× value× of× the−→ form, i.e.,

f(u1, ..., ui, ..., uj, ..., uk) = f(u1, ..., uj, ..., ui, ..., uk) for all i, j (57) − Unlike with symmetric forms, note that an anti-symmetric form is no longer invariant under permutations. Instead, repeated application of the above property can easily be used to show that, for any permutation σ of 1, ..., k , { } f(uσ(1), ..., uσ(k)) = sign(σ) f(u1, ..., uk) , (58) where sign is the sign of a permutation (i.e., 1 if the permutation can be achieved by interchanging the elements 1, ..., k an even number of times or 1 if the permtutation can be achieved by interchanging the elements 1, ..., k an{ odd} number of times.) − { } Another interesting property of anti-symmetric forms is that if any argument appears twice, then the form is zero, i.e., f(..., u, ..., u, ...) = 0, for all u V. (59) ∈ This is easy to prove, because interchanging u and u introduces a minus sign,

f(..., u, ..., u, ...) = f(..., u, ..., u, ...) , (60) −

13 Positive Signed Volume Negative Signed Volume

Figure 3: A diagram demonstrating how the sign of the volume of a parallelotope flips when its two edges are reversed. Since we interpret the order of the edges as giving us an orientation for the parallelotope (in this case, either clockwise or counterclockwise), flipping the order of the edges “reverses” the parallelotope. Hence, the signed volume is an anti-symmetric multilinear form. which means the form must take the value zero when there are repeated arguments. We now return to the concept of signed volume forms as mentioned above. These are a special case of anti- n symmetric multilinear forms f : V F , where n is now the dimension of the space n. Intuitively, f(v1, ..., vn) is now a measure of the volume of−→ a parallelotope with one vertex at zero and outgoing edges from that vertex given by the vectors v1, ..., vn. With the above two properties, we can actually show that an anti-symmetric form is determined completely by its value on a basis V = v1, ..., vn for V . Suppose we have a set of an arbitrary set of { } vectors w1, ..., wn V . They can be written as ∈ X wi = αijvj . (61) j

Using the linearity properties of f, we have that X X f(w1, ..., wn) = ... α1i1 α2i2 ...αkin f(vi1 , vi2 , ..., vin ) . (62)

i1 in

Note that, in the summation above, if any pair of the i1, i2, ..., in are ever equal, then the resulting value of f(vi1 , ..., vin ) will be zero. For all other choices of i1, ..., in, the tuple (i1, ..., in) represents a permutation of the numbers 1, ..., n (i.e., every number appears exactly once). Therefore, since the non-permutation terms are zero, the above{ summation} can be simplified to a summation over permutations, X X f(w1, ..., wn) = ... α1i1 α2i2 ...αkin f(vi1 , vi2 , ..., vin )

i1 in X = α1,σ(1)α2,σ(2)...αn,σ(n)f(vσ(1), vσ(2), ..., vσ(n)) (63) σ∈Sn ! X = α1,σ(1)α2,σ(2)...αn,σ(n) sign(σ) f(v1, v2, ..., vn) ,

σ∈Sn where Sn denotes the group of permutations on 1, ..., n . Therefore, once we specify the signed volume of a paral- { } lelotope in V spanned by v1, v2, ..., vn, the signed volume of ever other parallelotope in V is completely determined by the above formula. The formula actually a name — it’s called the Liebniz formula. It tells us that all signed volume forms on V are equivalent, up to a constant (here, given by f(v1, ..., vn)). This leads us to the notion of a ,

14 Definition 3.4 (Determinant). The determinant of a set of vectors is the unique anti-symmetric form V n R which assigns the volume 1 to the unit cube, i.e., −→

det(e1, ..., en) = 1 , (64) where ei denote the vectors. The determinant of a matrix is simply the determinant of its columns. n n And the determinant of a linear map A : R R is the determinant of the images of ei under A, i.e., −→

det A = det(A(e1), ..., A(en)) . (65)

This is interpreted as the signed volume of the image of the unit cube under the map A. Note that, by definition, the determinant of the matrix representation of A in the Kronecker basis coincides with the determinant of A viewed as a linear map. As a consequence of eq. (63), we see that the determinant has an explicit formula, ! X Y det(v1, ..., vn) = sign(σ) vi,σ(i) , (66)

σ∈Sn i where vij are the coordinate representations of vi in the Kronecker basis e1, ..., en. The signed volume forms can actually tell us when a set of vectors is not linearly independent. Again, intuitively, a signed volume form measures the signed volume of parallelotopes. And again, recall that if v1, ..., vn are linearly dependent, that means that one of the vectors vi can be expressed as a linear combination of the others, and therefore vi lies in the hyper-plane spanned by the other vectors. Thus, the parallelotope spanned by these vectors is “flat” and has no volume, and hence the form should be zero. Indeed, this is not that difficult to prove,

n Lemma 3.1. Let f : V R be a nontrivial signed volume form (i.e., there exists a set of vectors w1, ..., wn such −→ that f(w1, ..., wn) = 0), then a set of vectors v1, ..., vn is linearly independent if and only if f(v1, ..., vn) = 0. 6 6 Pn Proof. Suppose that they are linearly dependent, then without loss of generality, we can say that v1 = i=2 αivi. Thus, by linearity of f, n X f(v1, ..., vn) = αif(vi, v2, ..., vn) = 0 , (67) i=2 where we have noted that in all of the terms on the right, vi will appear twice in the determinant and therefore, eq. (59) tells us the result must be zero. Conversely, suppose they are linearly independent. Then the vectors form a basis for Rn, and hence, the vectors w1, ..., wn can be written via some linear combination, X wi = aijvj . (68) j

Thus, using the Liebniz formula, we have that ! X f(w1, ..., wn) = a1σ(1)a2σ(2)...anσ(n) sign(σ) f(v1, ..., vn) . (69)

σ∈Sn Now, since we know that the term on the left is nonzero by assumption, it follows that both of the terms on the right must also be nonzero. Ergo, f(v1, ..., vn) = 0. 6 An intuitive way of thinking of the determinant is as the scaling factor of a linear map. That is, every linear map f : Rn Rn scales volume of the space Rn by some scalar factor — and this factor happens to be the determinant. From this−→ interpretation of the determinant, it is reasonable to expect these scale factors to multiply if multiple linear maps are applied to Rn in sequence. Indeed, this is the case: Lemma 3.2. The determinant is multiplicative under composition of linear maps, i.e.,

det(AB) = (det A) (det B) (70) ·

15 Proof. The idea behind of a proof of this statement is to use the fact that any matrix A can be decomposed into a series of elementary column operations Ei. Elementary column operations either (1) scale a column of a matrix by a factor α, or (2) add a scalar multiple of one column of a matrix to another, or (3) interchange two columns of a matrix. You can check by writing out Ei as matrices that these operations when viewed as matrices have determinants α, 1, and 1 respectively. − Now, consider det(AEi), i.e., the determinant of a matrix A when these operations Ei are applied to A.

1. Scaling a column by α: by linearity of the determinant det(AEi) = α det(A) 2. Adding a scalar multiple of one column to another: note that

det(..., vi + αvj, ..., vj, ...) = det(..., vi, ..., vj, ...) + α det(..., vj, ..., vj, ...) = det(..., vi, ..., vj, ...) ,

where to get rid of the second term we used the fact that the inputs are not linearly independent (as vj appears twice). Therefore, det(AEi) = det(A).

3. Switching two columns: the anti-symmetry property of the determinant tell us that det(AEi) = det(A). − Therefore, the scale factor between det(AEi) and det(A) is always equal to det(Ei) for any elementary column operation Ei, as such, det(AEi) = det(A) det(Ei) . (71) Q Now, using the fact that every matrix B can be written as the product B = i Ei for some elementary column operations Ei (we won’t prove this here), repeated application of the above equation gives us ! Y Y det(B) = det Ei = det(Ei) . (72) i i Using the same logic, we also have that ! Y Y det(AB) = det A Ei = det(A) det(Ei) = det(A) det(B) , (73) i i where for the last equality we have substituted in eq. (72). There are two important practical consequences of this fact. The first is that the determinant of the inverse of a matrix is the inverse of the determinant. This follows from the fact that

1 = det(I) = det(AA−1) = det(A) det(A−1) . (74)

Another important practical consequence is that the notion of determinant is actually independent of basis. Indeed, if two matrices A and B are related by a change of basis A = P −1BP , then we have that

det(A) = det(P −1BP ) = det(P )−1 det(B) det(P ) = det(B) . (75)

3.3.1 The Cofactor Formula for Matrix Inverses The linear independence properties of the determinant can be used to show that the matrix inverse has an explicit formula given by 1 A−1 = adj(A) , (76) det A where adj(A) is the adjugate of A, defined as

i+j adj(A)ij ( 1) Mji , (77) ≡ − where Mji is the (j, i)- of A, i.e. the determinant of the matrix formed from A by deleting the jth row and ith column.

16 3.4 Inner Products The final two important multilinear notions in linear algebra is that of the “similarity” of two vectors. Vector “similarity” is usually measured in terms of some normalized version of a dot product. The full generalization of the notion of a dot product from introductory linear algebra is an inner product. An inner product has a number of important properties it must satisfy so that it behaves like a dot product. The first of these properties of notion of sesquilinearity:

Definition 3.5 (Sesquilinear Form). A sesquilinear form S : V V C on a C-vector space V is a function which is conjugate linear in the first argument, i.e., × −→

S(u + v, w) = S(u, w) + S(v, w) for all u, v, w V ∗ ∈ (78) S(αu, v) = α S(u, v) for all u, v V, α C ∈ ∈ and linear in the second argument, i.e., satisfying

S(u, v + w) = S(u, w) + S(v, w) for all u, v, w V ∈ (79) S(u, αv) = αS(u, v) for all u, v V, α C , ∈ ∈ where denotes the complex conjugate. ∗ One can do a similar calculation to the one done in the section 3.1 to show that, with respect to a basis V of V , every sesquilinear form S( , ): V V C can be written as · · × −→ ∗ S(u, w) = [u]VM[w]V , (80) where Mij S(vi, vj) , (81) ≡ and where when used on matrices denotes the conjugate (i.e., take the transpose and then take the complex conjugate∗ of each entry). However, to define the notion of an inner product, sesquilinearity is not enough by itself. Since the inner product of a vector with itself usually represents the “length” of that vector, we typically want the inner product of a vector with itself to be a real number. But this is not always the case with sesquilinear forms, in general, a sesquilinear product of a vector with itself may be complex valued. To ensure that the “length” of a vector is actually real, we require the additional properity of Hermiticity,

Definition 3.6 (Hermitian Form). A sesquilinear form H : V V C is called Hermitian if it also satisfies conjugate symmetry, × −→ H(v, u) = H(u, v)∗ for all u, v V (82) ∈ Note that for a Hermitian form, the Hermitian product of a vector with itself will always be real, since

H(v, v) = H(v, v)∗ . (83)

However, we require one more thing to have a complete inner product. In general, since the inner product of a vector with itself represents “length,” it should be a non-negative number. This means that inner products also require the property of positive-definiteness,

Definition 3.7 (Inner Product). A Hermitian form , : V V C is called an inner product if it satisfies positive-definiteness: h· ·i × −→ u, u > 0 for all u V 0 . (84) h i ∈ \{ }

If an inner product , is real valued, i.e. , : V V R, then any matrix representation of that inner product will be a positive-definiteh· ·i symmetrich· ·i matrix× −→. These types of matrices are extremely important in numerical linear algebra, and thinking about these matrices in terms of representing an inner product is the basis for a number of iterative methods for solving linear systems.

The most common example of an inner product are the discrete and continuous L2 inner products:

17 Example 3.1 (L2 Inner Product). By far the most common inner product is the L2 inner product, given by

X ∗ u, v = u vi . (85) h i i i

Example 3.2 (L2 Inner Product on Functions). Another common example which shows up in analysis often is the L2 inner product for functions on Rn, given by Z f, g = f(x)∗g(x) dx . (86) n h i R

3.4.1 Orthonormal Bases Some bases have the property that they are especially nice to work in for a given inner product , because cross terms in the inner product always cancel, and self products are normalized to one. These basesh· ·i are known as orthonormal bases,

Definition 3.8 (Orthonormal Basis). A basis V = u1, ..., un for a C-vector space V with inner product , : { } h· ·i V V C is orthonormal if the individual vectors in V are orthogonal and normalized, i.e., × −→ ( 1 i = j ui, uj = (87) h i 0 i = j 6 One of the great thing about orthonormal bases is that computing inner product with them is super easy because all cross-terms cancel:

Theorem 3.3. Let , : V V C be an inner product and let u1, ..., un be an orthonormal basis for , . Then, for any v, w h·V·iwith× coordinate−→ representations { } h· ·i ∈ X X v = αiui, w = βiui , (88) i i we have that X ∗ v, w = α βi (89) h i i i Proof. This is a direct result of calculation and comes from the cancellation of cross terms: * + X X v, w = αiui, βiui h i i i X X ∗ = α βj ui, uj i h i i j (90) X X ∗ = αi βjδij i j X ∗ = αi βi . i

A particularly nice consequence of the above is the Pythagorean Theorem,

Corollary 3.3.1 (Pythagorean Theorem). , : V V C be an inner product and let u1, ..., un be an orthonormal basis for , . Then, for any v withh· ·i coordinate× −→ representation { } h· ·i X v = αiui , (91) i We have that X 2 v, v = αi . (92) h i | | i

18 3.4.2 Adjoints Another important concept in linear algebra is that of the adjoint operator:

Definition 3.9 (Adjoint). Consider a linear map A : V W and inner products , V : V V R and ∗ −→ h· ·i × −→ , W : W W R. A map A : W V is called the adjoint of A if it satisfies h· ·i × −→ −→ ∗ Av, w W = v,A w V , (93) h i h i Essentially, an adjoint allows one to move the action of an operator from one side of an inner product to the other.

On Rn with the standard inner product, taking the adjoint of an operator is the same as taking the of a matrix.

Note that the adjoint A∗ might in general be different from A. However, there are a number of cases where the operator A and the operator A∗ are actually equivalent. These happy cases are known as self-adjoint,

Definition 3.10 (Self-Adjoint Map). A linear map A : V V is self-adjoint with respect to an inner product −→ , : V V C if A = A∗, i.e., h· ·i × −→ Au, v = u,Av . (94) h i h i Self-adjoint maps have a very nice structure, which we will examine in section 4.

3.5 Change of Basis Formula for Bilinear Forms Bilinear forms have similar change of basis formulas to linear maps, but the formulas are still subtly different and worth mentioning. Let V and W be two bases for V . Then, the change of basis formula for a bilinear form B on V can easily be derived by observing,

T T T T [u]W[B]W[v]W = B(u, v) = [u]V[B]V[v]V = [u]W[PV,W]V[B]V[PV,W]V[v]W , (95)

Hence, comparing terms on the far left and far right, we obtain that

T [B]W = [PV,W]V[B]V[PV,W]V . (96)

Using less cluttered notation, the change of bases formula for bilinear forms can be written as

B = P T AP , (97) where nonsingular P Rn×n is a change of basis matrix. Note the similarity (no pun intended) with the change of basis formula for linear∈ maps given in eq. 42. It is also easy to show that, for sesquilinear forms, this relationship reads B = P ∗AP . (98)

When two matrices are related as in eq. 97, we say that they are congruent. This means that they represent the same bilinear form, except perhaps in different bases.

As a final note, it may also be the case that the first and second argument of a bilinear form are not necessarily equivalent, such as in the case where the bilinear form is not symmetric. In this situtation, it may more sense to treat the first and second arguments on separate grounds and use separate bases for each. If we allow this, then the change of basis formula instead reads B = QT AP , (99) in analogy to the change of basis formula for .

19 4 The Algebraic Structure of Linear Maps

In this section, we will begin to look at some of the fundamental structural results about linear maps and bilinear forms. These results largely fall into two categories: classification results, which say that there are only a certain number of different algebraic structures that linear map can have, and spectral results, which describe the algebraic structure of invariant subspaces of linear maps.

4.1 Classification Up To Equivalence: Rank-Nullity Theorem Let us start with an easy warm-up. Recall from section 2.5 that two m n matrices A and B are equivalent if there exist matrices Q and P such that × B = Q−1AP . (100) That is B can be obtained from A by performing a change of basis in the domain and a change of basis in the codomain. Note that matrix equivalence is an equivalence relationship (i.e., it is reflexive, symmetric, and transitive), therefore it is a natural question to ask if we can characterize all of the possible equivalence classes of this relationship.4 It turns out that the rank of a matrix is the only thing which distinguishes different equivalence classes. To see why, let us prove the following theorem,

Theorem 4.1. Any matrix A Rm×n is equivalent to a Σ Rm×n whose diagonal elements are either 1 or 0, i.e., ∈ ∈ A = Q−1ΣP, (101) for some Q Rm×m and P Rn×n. Moreover, the number of nonzero entries in Σ is precisely the rank of A. ∈ ∈ Proof. Recall that ker(A) is a linear subspace. Therefore, let u1, ..., uk be a basis for this space. By repeatedly choosing additional vectors outside the span of this set, we can extend this basis for ker(A) to a basis for Rn, denoted u1, ..., uk, v1, ..., vr. Reorder these vectors as v1, ..., vr, u1, ..., uk and call this basis V. Now, we claim that A(v1), ..., A(vr) form a basis for the image of A. To see this, we note that the set of vectors A(v1), ..., A(vr),A(u1), ..., A(uk) clearly spans im(A), and that the vectors A(u1), ..., A(uk) can be discarded because they are all zero. So, A(v1), ..., A(vr) spans the image. Furthermore, these vectors must also be linearly independent, because if they are not, then we have X αiA(vi) = 0 , (102) i for some αi not all identically zero. This means ! X A αivi = 0 , (103) i P so αivi ker(A). But let j 1 be the last j such that αj = 0. Then, we have that i ∈ ≥ 6 j−1 1 X vj = αivi + k , (104) α j i=1 for some k ker(A). But this means vj span vj−1, ..., v1, u1, ..., uk , which contradicts our assumption about ∈ ∈ { } how vj was chosen. Hence, we must have linear independence of A(v1), ..., A(vr). m Therefore, we can similarly extend A(v1), ..., A(vr) into a basis for R . When used, we will write the vectors in this basis as A(v1), ..., A(vr), wr+1, ..., wm and call the basis W. Now, note that we have

A(vi) = 1 A(vi) , for 1 i r , · ≤ ≤ (105) A(ui) = 0 wi , for 1 i min(m r, k) . · ≤ ≤ − Therefore, letting Q be the change of basis matrix from the Kronecker basis to the basis W (i.e., Q−1 has columns given by the vectors in W), and letting P be the change of basis matrix from the Kronecker basis to the basis V, eq. 105 and the change of basis formula give us directly that

A = Q−1ΣP, (106)

4Recall that an equivalence relationship on a set always divides that set into equivalence classes, wherein any two elements in the same class are equivalent and any two elements in different classes are not. When we want to characterize equivalence classes, we are asking the question: what are the fundamental ways that two elements can be non-equivalent?

20 where Σii = 1 for 1 i r and Σii = 0 otherwise. Moreover, to prove the last part of the theorem statement, note ≤ ≤ that r is precisely the dimension of the image im(A) (i.e., the rank of A), because A(v1), ..., A(vr) form a basis for im(A).

Corollary 4.1.1 (Rank-Nullity Theorem). For any A Rm×n, ∈ rank(A) + null(A) = n . (107) Proof. This follows immediately from the above theorem because r is the rank of A, k is the nullity of A, and r + k = n.

Therefore, the morale of the story is: all Rm×n matrices are equivalent to a diagonal matrix with r one entries on the diagonal and zeroes everywhere else. Since the positions in which the ones appear doesn’t matter, as we can just permute the bases V and W, the equivalence classes of matrices under matrix equivalence are precisely given by matrix rank.

4.2 Spectra Another key structural property of linear maps is that of invariant subspaces, Definition 4.1 (Invariant Subspace). Let A : V V be a linear map, then W V is an invariant subspace of A if it satisfies −→ ⊂ A(W ) W. (108) ⊂ Why are invariant subspaces so interesting? Well, suppose that we have a linear map A and a vector space V which can be decomposed into two A-invariant subspaces V = W1 W2. If this is the case, then A does something to ⊕ W1 and it also does something to W2, but crucially, there is not any communication between the two subspaces — A acts on each of them independently, and therefore, we can study the action of A on the two subspaces independently. So, we’ve reduced the problem of studying the action of A on V to the problem of studying the action of A on W1 and W2, each of which are strictly smaller than V . The goal of studying the spectral structure of a linear map A, is to find a decomposition of V into A-invariant subspaces Wi, M V = Wi , (109) i where each of the Wi is as small as possible. The smaller the subspaces Wi in the above decomposition, the more finely we understand the structure of A. In the luckiest case, we may be able to find a decomposition such that each of the Wi has dimension 1. Such a decomposition would be fortunate because the action of A on a one dimensional A-invariant subspace is very easy to understand. To elucidate this point, suppose dim(Wi) = 1 and let wi be a vector spanning Wi. Then A(Wi) Wi, so it must ⊂ be the case that A acts as a scalar multiplier λi on Wi (as A can be restricted a linear map from Wi to itself) with Awi = λiwi. So, if all of the Wi have dimension 1, then we can find a basis wi such that the representation of A in this basis is a diagonal matrix! This means that the structure of such a linear map A would be extremely easy to understand: it is essentially just scaling different parts of the space V by different values λi. Unfortunately, it is not always the case that we can find such a decomposition of V . But, in a lot of cases, like self adjoint linear maps, such decompositions — known as eigendecompositions — do exist actually exist.

4.2.1 Eigenvalues and Eigenvectors To continue the above discussion, we must first define what eigenvalues and eigenvectors are.

Definition 4.2 (Eigenvalues and Eigenvectors). A nonzero vector v V for a C-vector space V is an eigenvector of a linear map A : V V if we have that ∈ −→ A(v) = λv , (110) for some scalar λ F called the corresponding eigenvalue. ∈ As discussed previously, eigenvectors are useful because they form 1 dimensional invariant subspaces. However, it is possible for these invariant subspaces to be larger than just one dimensional. In general, Definition 4.3 (Eigenspace). The set of all eigenvectors W V for some eigenvalue λ is called an eigenspace of ⊂ A. We will denote this Eigλ(A). It is easy to verify that an eigenspace is actually a vector subspace.

21 4.2.2 The Characteristic Polynomial The next relevant question is, how can we actually find the eigenvalues of a linear map A? This is actually not that difficult, as we note that if v is an eigenvector of A with eigenvalue λ, then

(A λI)v = Av λv = λv λv = 0 . (111) − − − Therefore, A has an eigenvalue λ if and only if the operator A λI is singular. And the operator A λI is singular if and only if det(A λI) = 0. Therefore, we have that − − − Lemma 4.2. λ is an eigenvalue of a linear map A : V V if and only if it is a root of the polynomial −→

PA(x) det(A xI) . (112) ≡ −

The polynomial PA(x) is referred to as the characteristic polynomial of A. In particular, this means an n-dimensional matrix A can have at most n different eigenvalues, since this determinant will be a polynomial in x of degree n. Note however, it need not be the case that PA have n distinct roots. Sometimes PA may have a root with multiplicity. This gives rise to the notion of algebraic multiplicity,

Definition 4.4 (Algebraic Multiplicity). The algebraic multiplicity of an eigenvalue λ of a linear map A is the multiplicity of λ as a root in the characteristic polynomial PA(x). Note however, that there is another notion of multiplicity when it comes to eigenvalues.

Definition 4.5 (Geometric Multiplicity). The geometric multiplicity of an eigenvalue λ is the dimension of the eigenspace Eigλ(A).

Nota Bene: In general, the geometric and algebraic multiplicities of eigenvalues need not be the same. For example, consider the matrix

 1 1  A = . (113) 0 1 2 The characteristic polynomial of this matrix is given by PA(x) = (1 x) , so the eigenvalue 1 has algebraic multiplicity 2. However, the geometric multiplicity of the eigenvalue −1 turns out to be 1, not 2.

A final interesting point about the characteristic polynomial is that the operator A when substituted into its characteristic polynomial results in the null operator,

Theorem 4.3 (Cayley-Hamilton). For a linear map A : V V , −→

PA(A) = 0 , (114)

Where PA(A) is obtained by replacing the occurrences of x in PA(x) with A. Another important part of the spectral structure of a linear map is the notion of a minimal polynomial. As we see above, substituting A into PA(x) gives the null operator 0. However, one can use the structure of operator rings to get an interesting result about A,

5 Lemma 4.4. There exists a unique monic polynomial χA(x) which divides every polynomial p(x) satisfying p(A) = 0. This polynomial is referred to as the minimal polynomial of A.

Nota Bene: In general, the minimal and the characteristic polynomials are different. For example, consider the 2 2 I2. The characteristic polynomial of I2 is × 2 PI (x) = (1 x) , (115) 2 −

but the minimal polynomial of I2 is χI (x) = (1 x) . (116) 2 −

5Monic means the leading coefficient is 1

22 In some cases, however, they can be the same. For the matrix given in eq. 113, the two polynomials are the same.

4.3 The Spectral Decomposition for Self-Adjoint Linear Maps As mentioned earlier, it may not always be the case that we are able to decompose a space V into individual one dimensional eigenspaces. However, there are many situations where it is possible to do this. One such situation is in the case self-adjoint linear maps, as introduced in section 3.4.2. Fortunately, self-adjointedness is a strong condition that enforces that a linear map have a very regular spectral structure. To see how this structure comes about, let us prove the ,

Theorem 4.5 (Spectral Theorem for Self-Adjoint Operators). Suppose A : V V is self-adjoint with respect to −→ inner product , : V V C. Then, it is the case that h· ·i × −→ 1. Every eigenvalue of A is real.

2. There must exist a basis v1, ..., vn for V satisfying:

(a) Each vi is an eigenvector of A.

(b) vi are orthonormal with respect to , . h· ·i Proof. First, we prove result (1). Note let λi be an eigenvalue of A with eigenvector vi. Then,

∗ λ vi, vi = λivi, vi = Avi, vi = vi,Avi = vi,Avi = vi, λivi = λi vi, vi . (117) i h i h i h i h i h i h i h i ∗ Since vi, vi = 0, it must be the case that λi = λi and hence λi is real. Theh resti of 6 the proof is by induction on the dimension of V . The base case is the case where dim V = 1. In this case A acts as a scalar and every vector in V is an eigenvector, hence, we can easily find the basis needed in (2). Now, for the inductive step, suppose w is an eigenvector of A with eigenvalue λ. Note that an eigenvector always ⊥ exists because PA(x) must have at least one root in C. Let W = span w and consider the perpendicular space W , defined as { } W ⊥ u u, w = 0 for all w W . (118) ≡ { | h i ∈ } Note that W ⊥ is a vector subspace. Furthermore, note that

A(W ⊥) W ⊥ . (119) ⊂ This is because, for all u W ⊥ and w W , ∈ ∈ Au, w = u,Aw = u, λw = λ u, w = 0 . (120) h i h i h i h i Therefore, we can decompose V as V = W W ⊥ , (121) ⊕ where both W and W ⊥ are A-invariant. Moreover, since W ⊥ has strictly smaller dimension than V , our inductive ⊥ hypothesis tells us that there exists a basis v1, ..., vn−1 for W satisfying the properties (2). To extend this basis to ⊥ a basis for V , we simply normalize w and let vn = w. Then, since w W and vi W for 1 i n 1, we have ∈ ∈ ≤ ≤ − that w, vi = 0 for 1 i n 1 and hence property (2) is satisfied. h i ≤ ≤ − To convert this result into a statement about matrices, we consider the L2 inner product on Cn. The change of basis formula then gives us that

Corollary 4.5.1. Suppose A Cn×n is Hermitian (i.e., A∗ = A), then there exists a unitary6 matrix U Cn×n ∈ ∈ and real diagonal matrix Σ Rn×n such that ∈ A = UΣU ∗ . (122)

There is also the case to consider where A Rn×n is real symmetric. In this case, A is still Hermitian, so the ∈ spectral theorem applies. However, note that the eigenvectors must be real-valued, since the eigenvalues λi are real, and hence the matrices A λiI are real, and thus the kernels ker(A λiI) = Eig (A) are real. So this gives us − − λi 6A U is one such that U ∗ = U −1.

23 Corollary 4.5.2. Suppose A Rn×n is real symmetric (i.e., AT = A), then there exists an orthogonal7 matrix ∈ Q Rn×n and real diagonal matrix Σ Rn×n such that ∈ ∈ A = QΣQT . (123)

In general, to prove such a spectral theorem, it is actually sufficient for the linear map A to be normal (i.e., AA∗ = A∗A). However, a proof for this is slightly more complicated than the one given above for self-adjoint maps.

A spectral decomposition A = QΣQT is extremely useful when computing matrix powers of A, since we have that Ak = QΣkQT and Σk is very easy to compute, whereas Ak may be much more computationally expensive to compute.

4.4 The Singular Value Decomposition The spectral decomposition of a matrix in the previous section is probably one of the nicest decompositions there are. However, it unfortunately does not exist for all matrices. However, there is a similar decomposition which does exist for all matrices, and this is the singular value decomposition. Roughly, the singular value decomposition is derived by noting that the image of the unit sphere Sn under a linear map A : Rn Rm is an ellipsoid. Every ellipsoid has a set of perpendicular major axes. The vectors corresponding to these axes−→ are known as the left singular vectors, the lengths of this axes are the singular values, and the preimages of these axes are right singular vectors. To make this precise, we prove the following,

Theorem 4.6 (Singular Value Decomposition). Every matrix A Rm×n can be written as ∈ A = UΣV T , (124) where U Rm×m and V Rn×n are orthogonal matrices whose columns are the left and right singular vectors ∈ ∈ of A respectively, and Σ Rm×n is a diagonal matrix whose entries are the singular vectors of A. ∈ Proof. By Induction. Since V is orthogonal, the above theorem statement is equivalent to AV = UΣ. Therefore, it m n suffices to show that for k = min(m, n), there exists two set of orthogonal vectors u1, ..., uk R and v1, ..., vk R such that ∈ ∈ Avi = σiui . (125) The case where either n = 1 or m = 1 is therefore trivial. For m > 1 and n > 1, to find the first left singular vector, we want to find the longest axis of the image of the sphere Sn. Therefore, what we care about is

argmax Av1 2 . (126) kv1k2=1k k

Let this v1 be the first right singular vector, let σ1 = Av1 2 be the first singular value, and let u1 = Av1/σ1 be the k k first left singular vector. Let W span v1 . We claim that ≡ { } A(W ) A(W ⊥) , (127) ⊥ that is, any element in A(W ) is orthogonal to every element in A(W ⊥). To see why this must be the case, suppose for ⊥ ⊥ ⊥ ⊥ contradiction that A(W ) is not perpendicular to A(W ), then there exists a vector w W with norm w 2 = 1 such that ∈ k k ⊥ Av1,Aw > 0 . (128) h i ⊥ Now, note that, because v1, w = 0, we know that, for 0 α 1, h i ≤ ≤ 2 p 2 ⊥ 2 2 ⊥ 2 2 1 α v1 + w = (1 α ) v1 2 + α w 2 = (1 α ) + α = 1 . (129) − 2 − k k k k − 7An Q is one such that QT = Q−1.

24 2 ⊥ We claim that we can choose α such that the image of linear combination √1 α v1 + αw under A has a strictly − greater norm than v1, which will be a contradiction to our choice of v1. Note that this norm is given by 2 p 2 ⊥ 2 2 p 2 ⊥ 2 ⊥ 2 A( 1 α v1 + w ) = (1 α ) Av1 2 + 2α 1 α Av1,Aw + α Aw 2 − 2 − k k − h i k k (130) 2 ⊥ 2 = Av1 + 2α Av1,Aw + O(α ) . k k2 h i 2 ⊥ Where O(α ) hides higher order terms in α. As we can see now, since Av1,Aw is positive, choosing α > 0 small 2 h i enough will make the above term larger than Av1 2. But this contradicts the definition of v1. Therefore, we must have A(W ) A(W ⊥). k k Now, since⊥ W ⊥ has lower dimension that the full space, we know by the inductive hypothesis that a singular ⊥ decomposition exists for the restriction of A to the space W . This gives us two sets of orthogonal vectors u2, ..., uk ⊥ ∈ A(W ) and v2, ..., vk W such that ∈ Avi = σiui . (131)

Noting that v1 is orthogonal to v2, ..., vk, and Av1 is orthogonal to u2, ..., uk (as we proved above), adding u1, v1 and σ1 to this set of singular vectors and values completes the proof. The singular value decomposition is an extremely useful construct in numerical linear algebra. It is, in general, also the most numerically stable way to solve a linear system. However, computing it is typically more expensive than computing other matrix decompositions. Note that, if a matrix is diagonalizable, then the matrix’s spectral decomposition and the singular value decomposition are one in the same.

The ratio between the largest singular value and the smallest singular value is known as the condition number of a matrix, denoted σ (A) κ(A) = 1 . (132) σn(A) The condition number is an indicator of how difficult a matrix system is to solve, as well as how sensitive a matrix system is to noise.

4.5 Classification Up To Similarity: The Jordan Decomposition The singular value decomposition is a useful tool, but in general the eigenvalues and the singular values of a linear map can be quite different when the map is not self-adjoint. In this sense, the singular value decomposition doesn’t tell us much about the spectral structure of a linear map. Fortunately, however, there is a theorem which completely classifies all possible spectral structures which a linear map can have. This theorem is called the Jordan-Normal Theorem, and the corresponding decomposition that goes with it is called the Jordan decomposition. We will not prove the Jordan-Normal Theorem here because it is relatively involved. However, I encourage the reader to work through a proof of the Jordan-Normal theorem online because it is a fundamental result of linear algebra and is quite useful in understanding the behavior of iterative methods in numerical linear algebra. To set the stage for the Jordan Decomposition, recall the change of basis formula,

B = P −1AP . (133)

That is, the matrices B and A represent the same linear map up to a choice of basis if they are related by conjugation by a matrix P . Recall that we say that A and B are similar if such a P exists. Moreover, similarity is an equivalence relationship and therefore the similarity relationship splits the set of matrices up into equivalence classes. Therefore, the natural question to ask is: can we characterize all equivalence classes of matrices under similarity or rather, can we characterize all matrices up to a change of basis? The answer is given quite definitively by the Jordan-Normal Theorem. Before we give the theorem statement, let us define two things. First,

k×k Definition 4.6 (The ). The shift matrix of dimension k, denoted Nk R is the matrix with ones right above the diagonals and zeroes everywhere else. ∈

k j Note that the shift matrix Nk is nilpotent of degree k, since (Nk) = 0, but (Nk) = 0 for j < k. Secondly, let us define 6

25 Definition 4.7 (Direct Sum). The direct sum A B of two matrices A and B is given by the ⊕  A 0  A B = . (134) ⊕ 0 B With these preliminaries, we can state the theorem,

Theorem 4.7 (Jordan-Normal Theorem). Let A Rn×n be a matrix. Then, A is similar to a direct sum of the following form: ∈ M A (λiIk + Nk ) (135) ∼ i i i

The blocks λiIki + Nki are called Jordan blocks, and each Jordan block λiIki + Nki represents an A-invariant n subspace of dimension ki. This theorem essentially says that R can be decomposed into A-invariant subspaces

n M R = Wi , (136) i

where on each subspace Wi, the operator A acts like λiIki + Nki .

If the above eq. (135) is the Jordan decomposition of a matrix A, then the characteristic polynomial of A is given by Y ki PA(x) = (1 λi) . (137) − i

Moreover, the λi are the eigenvalues of A.

If the above eq. (135) is the Jordan decomposition of the matrix A, let Λ denote the set of distinct eigenvalues and for each eigenvalue let q(λ) denote the maximum size of a Jordan block for that eigenvalue, we have that

Y q(λ) χA(x) = (1 λ) . (138) − λ∈Λ

For a matrix A, the geometric multiplicity of an eigenvalue λi is the number of distinct Jordan blocks with that eigenvalue.

One easy result of the Jordan-Normal theorem is that Y det(A) = λi , (139) i

where λi are the eigenvalues of A. The easy way to see this is that A is similar to a direct sum of Jordan blocks, and hence A and the direct sum have the same determinant. Then, the statement follows from the fact that the Jordan blocks are upper triangular and the determinant of an upper is the prod- uct of the diagonal entries (you should check this yourself!) In this case, those entries are the eigenvalues of A.

Note also, that the of A (the sum of all diagonal entries) also happens to be invariant under conjugation. Therefore, we also have that X tr(A) = λi (140) i

In my opinion, I find the Jordan-Normal Theorem to be an incredible result, as it essentially completely charac- terizes all the possible algebraic structures of linear maps from Rn to Rn. Moreover, it’s incredible that there aren’t actually that many possibilities. Every linear map is essentially the sum of a scaled identity operator and a nilpotent operator.

26 4.6 Classification Up To Congruence: Signatures and the Gramm-Schmidt Process For this section, we take a look at the final equivalence relationship between matrices, :

B = P T AP . (141)

Recall that A Rn×n and B Rn×n are congruent as above when they represent the same bilinear form, except under different∈ bases. Moreover,∈ in this section we will also make the assumption that the bilinear forms under consideration are symmetric and nondegenerate, and hence act like a pseudo inner product. If this is the case, then it turns out we can classify all symmetric bilinear forms by using concept called “signature.” To explain further, let me cite the theorem,

Theorem 4.8 (Gramm-Schmidt). Let B : V V R be a symmetric bilinear form which is nondegenerate, i.e., such that there exists no nonzero v V for which× −→ ∈ B(v, v) = 0 . (142)

Then, there exists a basis v1, ..., vn for V with the property that ( 0 i = j B(vi, vj) = 6 (143) 1 i = j ±

The signature of the bilinear form B is defined as the tuple (a+, a−) where a+ is the number of vi for which B(vi, vi) = 1 and a− is the number of vi for which B(vi, vi) = 1. − Proof. We build this basis incrementally using the following algorithm.

1. Assume that we already have chosen v1, ..., vk−1.

2. Let wk be a vector which is not in span v1, ..., vk−1 . { } 3. Project out all the components of wk which lie in the direction of vectors we’ve already selected, i.e., let

k−1 X vi v˜k wk B(wk, vi) . (144) ≡ − B(v , v ) i=1 i i

4. Normalize v˜k, v˜k vk p . (145) ≡ B(v˜k, v˜k) | | 5. Repeat until we have n vectors.

Note that since wk is not in the span of v1, ..., vk−1, the vector v˜k must be nonzero and also not in the span of v1, ..., vk−1. And since vk is a nonzero scalar multiple of v˜k, it therefore follows that vk is not in the span of v1, ..., vk−1. Hence, v1, ..., vn form a basis for V . Moreover, inductively, we have that for j < i

k−1 ! X vl B(vi, vj) = B wk B(wk, vl) − B(vl, vl) l=1 k−1 X B(vl, vj) , (146) = B(wk, vi) B(wk, vl) − B(vl, vl) l=1

B(vj, vj) = B(wk, vi) B(wk, vj) = 0 − B(vj, vj) where we have used the inductive hypothesis that B(vl, vj) = 0 for j = l and l < i, as well as the fact that vj is 2 6 normalized so that B(vj, vj) = 1.

27 Fun fact: In the theory of general relativity, 4 4 matrices with signature (3, 1) play a very important role as metrics on space-time: the three positive directions× are space-like, and the one negative direction is time-like.

In matrix terms the above theorem can be formulated as

Corollary 4.8.1. Let A Rn×n be a nonsingular . Then, there exists a nonsingular matrix P such that ∈ P T AP = Σ , (147) where Σ is a diagonal matrix with entries 1. ± T Proof. Let B(v, w) = v Aw and let P be the matrix with columns vk from the above proof of Gramm-Schmidt. Note that this result can also be easily derived from the spectral theorem. However, what is less obvious and cannot be proved directly from the spectral theorem, is that we can actually take the above P to be triangular. Why is this? Well, instead of choosing wk arbitrarily from V , we can let wk = ek where ek is the k-th Kronecker basis vector. Everything about the proof still holds, and we end up with the property that

vk span e1, ..., ek . ∈ { } Therefore, we have that

Corollary 4.8.2. Let A Rn×n be a nonsingular symmetric matrix. Then, there exists a nonsingular upper ∈ triangular matrix U Rn×n such that ∈ U T AU = Σ . (148) Finally, if A is positive definite, then it must be the case that all of the entries of Σ are positive, so Σ = I. Moreover, note that the inverse of n upper triangular matrix is also upper triangular. The following corollary therefore follows,

Corollary 4.8.3 (Choleksy Decomposition). Let A Rn×n be a nonsingular positive-definite symmetric matrix. ∈ Then, there exists a nonsingular lower triangular matrix L Rn×n such that ∈ A = LLT . (149)

In conclusion, the equivalence classes of symmetric nonsingular matrices under congruence are given by the signatures (a+, a−).

28 5 Basic Numerics

Now that we have gained a strong appreciation of the basic structural properties of linear maps, let us turn to the central problem of numerical linear algebra, namely, finding an x Rn such that ∈ Ax = b , (150) for given A Rn×n and b Rn. How do we solve for such an x Rn? Well, the naive thing to do would be to rewrite the above∈ formula as∈ ∈ x = A−1b , (151) then compute A−1 using the cofactor formula and apply it to b. Don’t do this. In numerics, the explicit A−1 is almost never computed outright, because such a computation is usually quite unstable. Instead, we usually try to decompose A into special types of linear systems which are much easier to invert, and then invert those. That means, instead of storing A−1 explicitly, we store a decomposition of A and then invert those using a special procedure. Luckily, there are only really three fundamental systems in linear algebra which we need to know how to solve outright.

5.1 The Three Fundamental Linear Systems 5.1.1 Diagonal Systems The first type of fundamental linear system, is the diagonal linear system where A is a diagonal matrix Σ, and we need to solve Σx = b . (152) This system is easy to solve because all the variables are decoupled. Writing the above matrix equation out, we get

σixi = bi . (153) And hence, bi xi = . (154) σi System solved.

5.1.2 Orthogonal Systems The second type of linear system, which is slightly more advanced (but not much), is the orthogonal linear system. In this case, we have Qx = b . (155) For an orthogonal matrix Q. Luckily, this is also really easy to solve because Q−1 = QT . Therefore, we simply apply QT to both sides, x = QT b . (156) System solved.

5.1.3 Triangular Systems The third type of linear system, is the triangular linear system. Let us consider lower triangular first. Suppose A is a lower triangular matrix L Rn×n and we want to solve ∈ Lx = b . (157) To see how this system can be easily solved, let us write it out more explicitly as

l11x1 = b1 ,

l21x1 + l22x2 = b2 , . (158) .

ln1x1 + ln2x2 + ... + lnnxn = bn .

29 Some easy manipulation will give us 1 x1 = b1 , l11 1 x2 = (b2 l21x1) , l22 − (159) . . 1 xn = (bn ln1x1 ln2x2 ... ln,n−1xn−1) . lnn − − − −

And we’re done! Why? Well because the ith equation above for xi only depends on the values of xi−1, ..., x1. Therefore, we can just forward substitute in the values for xi in the above equations and we’ve solved the triangular system. If only all things in life were this easy.

5.2 Solving with Gaussian Elimination Okay. Well that’s all good and dandy. But how do we a real matrix system? Well, there’s a saying about mathe- maticians:

Extremely important: When a normal person wakes up and sees that their house is on fire, they call the fire department. When a mathematician wakes up and sees that their house is on fire, they call the fire department. When a normal person wakes up and sees that their house is not on fire, they go back to sleep. When a mathematician wakes up and sees that their house is not on fire, they set their house on fire and reduce to the previous problem.

And so in the spirit of the mathematician above (who was later jailed for arson), we will do what all great mathematicians do: be lazy and reduce this problem to something we’ve already solved. First, let’s write out the matrix system:

a11x1 + a12x2 + ... + a1nxn = b1 ,

a21x1 + a22x2 + ... + a2nxn = b2 , . (160) .

an1x1 + an2x2 + ... + annxn = bn .

Let’s try to massage this system of equations. Assuming a11 = 0, we can divide the first equation by a11 to get 6 0 0 0 x1 + a12x2 + ... + a1nxn = b1 ,

a21x1 + a22x2 + ... + a2nxn = b2 , . (161) .

an1x1 + an2x2 + ... + annxn = bn .

I’m using an apostrophe to denote that we’ve altered a value in the above system. Now, here comes the cool part: we can take the first equation multiply it by ai2 (for i > 1) and subtract that scaled version of the first equation from equation i in the system above. This will give us

0 0 0 x1 + a12x2 + ... + a1nxn = b1 , 0 0 0 a22x2 + ... + a2nxn = b2 , . (162) . 0 0 0 an2x2 + ... + annxn = bn .

And bam! Now, x1 only shows up in equation number 1. This move is called a Gauss-Jordan transformation. And

30 now, assuming that a0 = 0, we can do the same thing again and continue to transform the system until we arrive at 22 6

x1 + u12x2 + u13x3 + ... + u1nxn = c1 ,

x2 + u23x3 + ... + u2nxn = c2 , x3 + ... + u3nxn = c3 , (163) . .

xn = cn , for some values uij and ci. But note that this is an upper triangular system! And we know how to solve such a system from the previous section. Therefore, we can now solve for x. However, note that this solution method depended critically on the assumption that the ii-th matrix entry of the the system is nonzero when we reach it. This matrix entry is called a pivot. In practice it is sometimes the case that the pivot may be zero when we reach it. To combat this, one can use a strategy called pivoting. One can either reorder the equations (row-pivoting) or reorder the xi’s (column-pivoting) until the entry is nonzero. Moreover, it turns out a Gauss-Jordan transformation is a linear transformation and can be represented as a matrix Ei. Therefore, what Gaussian Elimination essentially does is apply Gauss-Jordan transformations Ei to both sides of the equation Ax = b , (164) until we end up with Ux = EmEm−1...E1b , (165) where U is upper triangular. It turns out that the Ei are actually lower triangular matrices, so, moving the Ei’s the left in the above equation, A can be rewritten as

−1 −1 −1 A = E1 E2 ...Em U. (166) Recall that the inverse of a lower triangular matrix is also lower triangular and that the composition of two lower triangular matrices is lower triangular. Therefore, the above can be written as

A = LU , (167)

−1 for some lower triangular matrix L. This is called the LU-decomposition of A. In practice, the Ei have very simple forms and their compositions are easy to compute. Basic methods for computing the LU-decomposition work by doing Guassian Elimination on A to obtain U and at the same time composing the inverse Gauss-Jordan −1 transforms Ei to obtain L. Note that once an LU decomposition for A has been computed, we can solve any linear system involving A by using x = U −1L−1b , (168) where to apply L−1, we use the forward substitution algorithm we saw in section 5.1.3, and to apply U −1 we use backward substitution (same principle as forward substitution, but in reverse).

The LU-decomposition is not necessarily the most stable way to compute the solution to a linear system. But, to compensate, it is one of the cheaper direct methods available and it is easy to implement.

5.3 Solving with the Gramm-Schmidt Process Remember the Gramm-Schmidt process from section 4.6? Of course you do! Turns out you can also use it to solve linear systems. Here’s the basic premise: let a1, ..., an be the columns of the matrix A. Now, suppose that we run the Gramm-Schmidt process on those vectors to obtain a basis q1, ..., qn. Then, by construction (assuming the the columns of A are linearly independent, which they must be if A is invertible), we know that

span a1, ..., ai = span q1, ..., qi , for all i . (169) { } { }

31 Therefore, it is the case that

a1 = r11q1

a2 = r21q1 + r22q2 a3 = r31q1 + r32q2 + r33q3 (170) . .

an = rn1q1 + rn2q2 + ... + rnnqn

Because the qi are orthogonal, the coefficients rij are precisely the inner products rij = ai, qj which we compute h i during the Gramm-Schmidt process. Letting R be the matrix with entries rij and letting Q be the matrix with columns qi, we therefore get that A = QR . (171) This is called the QR-decomposition of the matrix A, where Q is orthogonal and R is upper triangular. Unlike the LU decomposition, here, we’ve decomposed A into a triangular and an orthogonal matrix. Therefore, to solve a linear system with A in it, we can perform x = R−1QT b , (172) where R−1 is applied using the forward substitution algorithm in section 5.1.3.

The QR-decomposition is more stable than the LU decomposition. This is because orthogonal matrices are very well conditioned (they have condition number 1, the best possible). But the QR-decomposition is typically also more expensive to compute than the LU-decomposition.

While Gramm-Schmidt is useful for explaining the QR decomposition, the standard Gramm-Schmidt algo- rithm is usually not used for computing the QR decomposition. When the underlying matrix is square or far, QR decompositions are usually computed using Householder reflections. And when the underlying matrix is skinny, QR decompositions are usually computed using a modified version of the Gramm-Schmidt algorithm known as Modified Gramm-Schmidt (MGS).

5.4 Solving with the Singular Value Decomposition Recall the singular value decomposition from section 4.4,

A = UΣV T , (173) where U and V are orthogonal matrices and Σ is diagonal. We will not discuss how to actually compute the SVD here (that is an involved process, unfortunately). However, if we have the SVD, then a system is easy to invert, since

x = V Σ−1U T b . (174)

The SVD is typically the most numerically stable way to solve a linear system. To compensate, it is also typically the most expensive decomposition to compute.

5.4.1 Conditioning The SVD is also useful for thinking about conditioning. Suppose that we are solving a linear system, but now there is a bit of noise or numerical error in the input vector b, i.e.,

Axerr = b + e . (175)

Then, we have that −1 −1 xerr = A b + A e . (176)

32 So the error between the solution we calculate and the true solution is going to be

−1 x xerr = A e . (177) − The magnitude of this error can be bounded by

−1 −1 −1 x xerr 2 = A e 2 A 2 e 2 = σ eo 2 , (178) k − k k k ≤ k k k k n k k where we have used the fact that the 1st singular value of a matrix is its L2 norm and that the 1st singular value of A−1 is the inverse of the nth singular value of A (since A−1 has SVD V Σ−1U T ). Conversely,

−1 T −T −1 T −1 T −1 T x 2 = A b 2 = √b A A b = √b UΣ V V Σ U b k k k k (179) T −2 T −1 T T −1 T −1 = √b UΣ U b σ √b UU b = σ √b b = σ b 2 ≥ 1 1 1 k k Therefore, the relative error of the solution is bounded by

x xerr 2 σ1 e 2 e 2 k − k k k = κ(A) k k , (180) x 2 ≤ σn b 2 b 2 k k k k k k where recall that κ(A) is the condition number of the matrix A and v 2/ b 2 is the relative error in the input. Therefore, we see that, error in the input to the linear system solver will,k ink thek k worst case, be amplified by a factor of κ(A) into error in the output. This is why the value κ(A) is a measure of how difficult a system is to solve, since the bigger κ(A), the more little errors in the input can become large errors in the output.

5.5 Least-Squares Problems In some cases, instead of solving a linear system directly, we want to get Ax as close as possible to a target vector b, i.e., solve 2 min Ax b 2 . (181) x k − k This is known as a least squares problem, and it shows up any time you have to fit some coefficients to some data — which, depending on what you’re into, could be pretty often. For these types of problems, we usually assume that the matrix A is not full-rank. One possible solution method is to resort to calculus and try to set the derivative of the objective function to zero. Define f(x) Ax b 2 . (182) ≡ k − k2 and note that, we can rearrange this expression,

f(x) = (Ax b)T (Ax b) = xT AT Ax 2bT Ax + bT b . (183) − − − Taking a derivative of this gives f(x) = 2AT Ax 2AT b . (184) ∇ − Setting this equal to zero gives us AT Ax = AT b . (185) These are called the normal equations and they give us a square linear system for the best fit minimizer x∗. The interpretation of the normal equations may be more clear if we rearrange some terms and instead write

AT (Ax b) = 0 . (186) − The vector Ax b is called the residual, it is the difference between our solution Ax and our target b. What these equations say, therefore,− is that we want residual (i.e., our error) to be orthogonal to the range of A. This makes sense, since if Ax b weren’t orthogonal to the range of A, then we could perturb our Ax slightly in the range of A and get a result closer− to b.

The normal equations may sometimes have advantages, but they are usually not the preferred method of solving least squares problems. The primary reason for this is that the matrix AT A has condition number

33 κ(A)2, which means that the normal equation can be very ill-conditioned if A is ill-conditioned.

Another way of solving least squares problems is by using the QR-decomposition. Indeed, by rewriting the objective as 2 min y b 2 , (187) y∈im(A) k − k we see that image of the solution to the minimization problem (181) is precisely orthogonal projection onto the space im(A). Therefore, the problem of solving least squares can be reduced to the problem of projecting a vector onto a the image of a matrix. How can we construct projectors onto im(A)? Well, if A is an orthogonal matrix Q, then operator which projects onto im(Q) is very easy to write down, as it is given by QQT . Why is this? Well, let’s look at what the normal equations are for the least squares problem 2 min Qx b 2 . (188) x k − k They are given by QT Qx = QT b . (189) Therefore, since the projection of b onto im(Q) is given by Qx, rearranging the above gives us

Qx = QQT b . (190)

So QQT is a projection onto im(Q), and the least squares problem can be solved as above. But what happens when A is not orthogonal? Well, then we rely on the QR-decomposition,

A = QR , (191) and note that the image of Q is the same as the image of A. Therefore, we know that

Ax = QQT b . (192)

And from this, we can substitute in A = QR, QRx = QQT b . (193) Now since the columns of Q are linearly independent, this must mean that

Rx = QT b , (194) and thus, x = R−1QT b . (195)

5.6 Basic Eigenvector Computation: The Power Method As a final topic in our exploration of basic numerics, consider the problem of computing the largest eigenvalue and corresponding eigenvector of a symmetric positive-definite matrix A Rn×n. Turn out that this is equivalent to calculating the quantity ∈ max Ax 2 . (196) kxk2=1 k k To see why, note that for symmetric matrices, the spectral decomposition and SVD are the same, so the first singular value is also the largest eigenvalue, and we know from our discussion on the SVD that σ1 is given by the quantity in the above equation. One standard way of solving this problem is to use what is called the power method. The idea is that if we repeatedly apply A to a vector x, the component of x along the eigenvector corresponding to λ1 will grow much faster than the components of x along the remaining eigenvectors. To see this, let’s write X x = αiqi , (197) i

n where qi R are the eigenvectors of A and αi R are some scalar coefficients. Then, repeated application of A to x gives ∈ ∈ k X k X k A x = αiA qi = αiλi qi . (198) i i

34 Akx Hence, the error between k (our estimate of the λ1-eigenvector, normalized) and q1 (our target) is given by kA xk2

    2 k 2 k n k A x α1λ1 X αiλi k q1 = 1 q  q1 + q  qi A x 2 − 2 − P k 2 P k 2 (αiλi ) i=2 (αiλ ) k k i i k 2  2  2 k n k α1λ1 X αiλi = 1 q  + q  − P k 2 P k 2 i(αiλi ) i=2 i(αiλk) q 2  P k 2 k  (199) (αiλ ) α1λ i i − 1 2k 2k =  q  + O λ2 /λ1 P k 2 i(αiλi )  2 2k 2k k α1λ1(1 + O(λ2 /λ1 )) α1λ1 2k 2k =  q −  + O λ2 /λ1 P k 2 i(αiλi ) 2k 2k = O(λ2 /λ1 ) .

2 Akx Therefore, we can see that the L difference between our normalized estimate k and the first eigenvector q1 is kA xk2 k k on the order of O(λ2 /λ1 ). Our estimate converges exponentially fast to the target. What about computing arbitrary eigenvalues and eigenvectors of the matrix A? Well, that’s typically more involved. However, note that we can use tricks that transform the spectrum of A to get at eigenvalues other than 0 0 the top one. In particular, if we have an estimate λi for an eigenvalue λi such that the estimate λi is closer to 0 −1 λi than to any other eigenvalue, then, the matrix (A λiI) shares its eigenvectors with the matrix A, and the 0 −1 − top eigenvector of (A λiI) is precisely the eigenvector of A corresponding to λi. Therefore, by using the power method on (A λ0 I)−−1 we can compute other eigenvectors of A besides the top one! − i

35