<<

Intermediate

Version 2.1

Christopher Griffin « 2016-2020

Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License With Contributions By: Elena Kosygina

Contents

List of Figuresv About This Document ix Chapter 1. Essentials1 1. Goals of the Chapter1 2. Fields and Vector Spaces1 3. Matrices, Row and Column Vectors5 4. Linear Combinations, Span, Linear Independence6 5. 9 6. Dimension 11 Chapter 2. More on Matrices and 17 1. Goals of the Chapter 17 2. More Operations: A Review 17 3. Special Matrices 19 4. Matrix Inverse 19 5. Linear Equations 20 6. Elementary Row Operations 21 7. Computing a Matrix Inverse with the Gauss-Jordan Procedure 25 8. When the Gauss-Jordan Procedure does not yield a Solution to Ax = b 27 9. Change of Basis as a System of Linear Equations 29 10. Building New Vector Spaces 32 Chapter 3. Linear Transformations 37 1. Goals of the Chapter 37 2. Linear Transformations 37 3. Properties of Linear Transforms 39 4. Image and 41 5. Matrix of a 44 6. Applications of Linear Transforms 47 7. An Application of Linear Algebra to Control Theory 49 Chapter 4. , Eigenvalues and Eigenvectors 57 1. Goals of the Chapter 57 2. Permutations 57 3. 58 4. Properties of the Determinant 63 5. Eigenvalues and Eigenvectors 65 iii 6. Diagonalization and Jordan’s Decomposition Theorem 69 Chapter 5. 73 1. Goals of the Chapter 73 2. Some Essential Properties of Complex Numbers 73 3. Inner Products 73 4. Orthogonality and the Gram-Schmidt Procedure 76 5. QR Decomposition 78 6. Orthogonal Projection and Orthogonal Complements 80 7. Orthogonal Complement 82 8. Spectral Theorem for Real Symmetric Matrices 84 9. Some Results on AT A 87 Chapter 6. Principal Components Analysis and Singular Value Decomposition 89 1. Goals of the Chapter 89 2. Some Elementary Statistics with Matrices 89 3. Projection and Dimensional Reduction 91 4. An Extended Example 93 5. Singular Value Decomposition 97 Chapter 7. Linear Algebra for Graphs and Markov Chains 103 1. Goals of the Chapter 103 2. Graphs, Multi-Graphs, Simple Graphs 103 3. Directed Graphs 105 4. Matrix Representations of Graphs 107 5. Properties of the Eigenvalues of the Adjacency Matrix 109 6. Eigenvector Centrality 110 7. Markov Chains and Random Walks 114 8. Page 117 9. The Graph Laplacian 120 Chapter 8. Linear Algebra and Systems of Differential Equations 127 1. Goals of the Chapter 127 2. Systems of Differential Equations 127 3. A Solution to the Linear Homogenous Constant Coefficient Differential Equation 130 4. Three Examples 132 5. Non-Diagonalizable Matrices 135 Bibliography 137

iv List of Figures

1.1 The subspace R2 is shown within the subspace R3.4 2.1 (a) Intersection along a line of 3 planes of interest. (b) Illustration that the planes do not intersect in any common line. 28 2.2 The vectors for the change of basis example are shown. Note that v is expressed in terms of the in the problem statement. 31 2.3 The intersection of two sub-spaces in R3 produces a new sub-space of R3. 33 2.4 The sum of two sub-spaces of R2 that share only 0 in common recreate R2. 34

2 3.1 The image and kernel of fA are illustrated in R . 42 3.2 Geometric transformations are shown in the figure above. 48 3.3 A mass moving on a spring is governed by Hooke’s law, translated into the language of Newtonian physics as mx¨ − kx = 0. 49 3.4 A mass moving on a spring given a push on a frictionless surface will oscillate indefinitely, following a sinusoid. 51 5.1 The orthogonal projection of the vector u onto the vector v. 80 5.2 The common plane shared by two vectors in R3 is illustrated along with the triangle they create. 81 5.3 The orthogonal projection of the vector u onto the vector v. 81 5.4 A vector v generates the W = span(v). It’s orthogonal complement W⊥ is shown when v ∈ R3. 83 6.1 An extremely simple data set that lies along a line y − 4 = x − 3, in the direction of h1, 1i containing point (3, 4). 91 6.2 The one dimensional nature of the data is clearly illustrated in this plot of the transformed data z. 92 6.3 A scatter plot of data drawn from a multivariable Gaussian distribution. The distribution density function contour plot is superimposed. 95 6.4 Computing Z = WT YT creates a new uncorrelated data set that is centered at 0. 96 6.5 The data is shown projected onto a linear subspace (line). This is the best projection from 2 dimensions to 1 dimension under a certain measure of best. 96 v 6.6 A gray scale version of the image found at http://hanna-barbera.wikia. com/wiki/Scooby-Doo_(character)?file=Scoobydoo.jpg. Copyright Hannah-Barbara used under the fair use clause of the Copyright Act. 101 6.7 The singular values of the image matrix corresponding to the image in Figure 6.6. Notice the steep decay of the singular values. 102 6.8 Reconstructed images from 15 and 50 singular values capture a substantial amount of detail for substantially smaller transmission sizes. 102 7.1 It is easier for explanation to represent a graph by a diagram in which vertices are represented by points (or squares, circles, triangles etc.) and edges are represented by lines connecting vertices. 103 7.2 A self-loop is an edge in a graph G that contains exactly one vertex. That is, an edge that is a one element subset of the vertex set. Self-loops are illustrated by loops at the vertex in question. 105 7.3 (a) A directed graph. (b) A directed graph with a self-loop. In a directed graph, edges are directed; that is they are ordered pairs of elements drawn from the vertex set. The ordering of the pair gives the direction of the edge. 106 7.4 A walk (a) and a cycle (b) are illustrated. 107 7.5 A connected graph (a) and a disconnected graph (b). 107 7.6 The adjacency matrix of a graph with n vertices is an n × n matrix with a 1 at element (i, j) if and only if there is an edge connecting vertex i to vertex j; otherwise element (i, j) is a zero. 108 7.7 A matrix with 4 vertices and 5 edges. Intuitively, vertices 1 and 4 should have the same eigenvector centrality score as vertices 2 and 3. 113 7.8 A Markov chain is a directed graph to which we assign edge probabilities so that the sum of the probabilities of the out-edges at any vertex is always 1. 115 7.9 An induced Markov chain is constructed from a graph by replacing every edge with a pair of directed edges (going in opposite directions) and assigning a probability equal to the out-degree of each vertex to every edge leaving that vertex. 118 7.10 A set of triangle graphs. 121 7.11 A simple social network. 124 7.12 A graph partition using positive and negative entries of the Fiedler vector. 125

8.1 The solution to the differential equation can be thought of as a vector of fixed unit about the origin. 133 8.2 A plot of representative solutions for x(t) and y(t) for the simple homogeneous linear system in Expression 8.25. 133 8.3 Representative solution curves for Expression 8.39 showing sinusoidal exponential growth of the system. 134 vi 8.4 Representative solution curves for Expression 8.42 showing exponential decay of the system. 135

vii

About This Document

This is a set of lecture notes. They are given away freely to anyone who wants to use them. You know what they say about free things, so you might want to get yourself a book. I like Serge Lang’s Linear Algebra, which is part of the Springer Undergraduate Texts in Mathematics series. If you don’t like Lang’s book, I also like Gilbert Strang’s Linear Algebra and its Applications. To be fair, I’ve only used the third edition of that book. The newer addition seems more like a tome, while the third edition was smaller and to the point. The lecture notes were intended for SM361: Intermediate Linear Algebra, which is a breadth elective in the Mathematics Department at the United States Naval Academy. Since I use these notes while I teach, there may be typographical errors that I noticed in class, but did not fix in the notes. If you see a typo, send me an e-mail and I’ll add an acknowledgement. There may be many typos, that’s why you should have a real textbook. (Because real textbooks never have typos, right?) The material in these notes is largely based on Lang’s excellent undergraduate linear algebra textbook. However, the applications are drawn from multiple sources outside of Lang. There are a few results that are stated but not proved in these notes: • The formula det(AB) = det(A)det(B), • The Jordan Normal Form Theorem, and • The Perron-Frobenius theorem. Individuals interested in using these notes as the middle part of a three-part Linear Algebra sequence should seriously consider proving these results in an advanced linear algebra course to complete the theoretical treatment begun here. In order to use these notes successfully, you should have taken a course in matrices (elementary linear algebra). I review a substantial amount of the material you will need, but it’s always good to have covered prerequisites before you get to a class. That being said, I hope you enjoy using these notes!

ix

CHAPTER 1

Vector Space Essentials

1. Goals of the Chapter (1) Review fields and vector spaces. (2) Provide examples of vector spaces and sub-spaces. (3) Introduce matrices and matrix/. (4) Discuss linear combinations and (5) Define basis. Prove uniqueness of dimension (finite case).

2. Fields and Vector Spaces Definition 1.1 (Group). A group is a pair (S, ◦) where S is a set and ◦ : S × S → S is a binary operation so that:

(1) The binary operation ◦ is associative; that is, if s1, s2 and s3 are in S, then (s1 ◦ s2) ◦ s3 = s1 ◦ (s2 ◦ s3). (2) There is a unique identity element e ∈ S so that for all s ∈ S, e ◦ s = s ◦ e = s. (3) For every element s ∈ S there is an inverse element s−1 ∈ S so that s ◦ s−1 = s−1 ◦ s = e.

If ◦ is commutative, that is for all s1, s2 ∈ S we have s1 ◦ s2 = s2 ◦ s1, then (S, ◦) is called a commutative group (or abelian group). Example 1.2. This course is not about group theory. If you’re interested in groups in the more abstract sense, it’s worth considering taking Abstract Algebra. One of the simplest examples of a group is the set of integers Z under the binary operation of addition. Definition 1.3 (Sub-Group). Let (S, ◦) be a group. A subgroup of (S, ◦) is a group (T, ◦) so that T ⊆ S. The subgroup (T, ◦) shares the identify of the group (S, ◦).

Example 1.4. Consider the group (Z, +). If 2Z is the set of even integers, then (2Z, +) is a subgroup of (Z, +) because that even integers are closed under addition. Definition 1.5 (Field). A field (or number field) is a tuple (S, +, ·, 0, 1) where: (1)( S, +) is a commutative group with unit 0, (2)( S \{0}, ·) is a commutative group with unit 1 (3) The operation · distributes over the operation + so that if a1, a2, and a3 are elements of F , then a1 · (a2 + a3) = a1 · a2 + a1 · a3. Example 1.6. The archetypal example of a field is the field of real numbers R with addition and multiplication playing the expected roles. Another√ common field is the field of complex numbers C (numbers of the form a + bi with i = −1 the imaginary unit) with their addition and multiplication rules defined as expected. 1 Exercise 1. Why is Z not a field under ordinary addition and multiplication? Is Q, the set of rational numbers, a field under the usual addition and multiplication operations? Definition 1.7 (Vector Space). A vector space is a tuple V = (hF, +, ·, 0, 1i,V, +, ·) where (1) hF, +, ·, 0, 1i is a field (with its own addition and multiplication operators defined) called the set of scalars, (2) V is a set called the set of vectors, (3)+: V × V → V is an addition operator defined on the set V , and (4) · : F × V → V . Further the following properties hold for all vectors v1, v2 and v3 in V and scalars in s, s1 and s2 in F: (1)( V, +) is a commutative group (of vectors) with identity element 0. (2) Multiplication of vectors by a distributes over vector addition; i.e., s(v1 + v2) = sv1 + sv2. (3) Multiplication of vectors by a scalar distributes over field addition; i.e., (s1+s2)·v1 = s1v1 + s2v2. (4) Multiplication of a vector by a scalar respects the fields multiplication; i.e., (s1 · s2) · v1 = s1 · (s2 · v1). (5) Scalar identify is respected in the multiplication of vectors by a scalar; i.e., 1 · v1 = v1. Remark 1.8. In general the set of vectors V is not distinguished from the vector space V. Thus, we will write v ∈ V to mean v is a vector in the vector space V. Definition 1.9 (Cartesian Product of a Field). Let F be a field. Then: n F = F × F × · · · × F n is the set of n-tuples of elements of . If + is the field addition operation, we can define | {z } F addition on Fn component-wise by:

(a1, . . . , an) + (b1, . . . , bn) = (a1 + b1, . . . , an + bn) The zero element for addition is then 0 = (0, 0,..., 0) ∈ Fn. Lemma 1.10. The set Fn forms a group under component-wise addition with zero element 0. Theorem 1.11. Given a field F, Fn is a vector space over F when vector-scalar multi- plication is defined so that:

c · (a1, . . . , an) = (ca1, . . . , can)

Exercise 2. Prove Lemma 1.10 and consequently Theorem 1.11. Remark 1.12. Generally speaking, we will not explicitly call out all the different opera- tions, vector sets and fields unless it is absolutely necessary. When referring to vector spaces over the field F with vectors Fn (n ≥ 1) we will generally just say the vector space Fn to mean the set of tuples with n elements from F over the field of scalars F. 2 Example 1.13. The simplest (and most familiar) example of a vector space has as its field R with addition and multiplication defined as expected and as its set of vectors n-tuples in Rn (n ≥ 1) with component-wise vector addition defined as one would expect. Remark 1.14. The vector space Rn is sometimes called Euclidean n space, because all the rules of Euclidean geometry work in this space. Remark 1.15. When most people think about an archetypal vector space, they do think of Rn. You can also think of Cn, the vector space of n-tuples of complex numbers. This vector space is very useful in Quantum Mechanics - where everything in sight seems to be a complex number.

Example 1.16 (Function Space). Let F be the set of all functions from R to R. That is, if f ∈ F, then f : R → R. Suppose we define (f + g)(x) = f(x) + g(x) for all f, g ∈ F and if c ∈ R, then (cf)(x) = cf(x) when f ∈ F. The constant function ϑ(x) = 0 is a the zero in the group (F, +, ϑ). Then F is a vector space over the field R and this is an example of a function space. Here, the functions are the vectors and reals are the scalars. Remark 1.17. Vector spaces can become very abstract (we’ll see some examples as we move along). For example, function spaces can be made much more abstract than the previous example. For now though it is easiest to remember that vector spaces behave like vectors of real numbers with some appropriate additions and multiplications defined. In general, all you need to define a vector space is a field (the scalars) a group (the vectors) and a multiplication operation (scalar-vector multiplication) that connects the two and that satisfies all the properties listed in Definition 1.7.

Definition 1.18 (Subspace). If V = (hF, +, ·, 0, 1i,V, +, ·) is a vector space and U ⊆ V with U = (hF, +, ·, 0, 1i, U, +, ·) also a vector space then, U is called a subspace of V. Note that U must be closed under + and ·.

Example 1.19. If we consider R3 as a vector space over the reals (as usual) then it has as a subspace several copies of R2. The easiest is to consider the subset of vectors: U = {(x, y, 0) : x, y ∈ R} Clearly U is closed under the addition and of the original vector space. This is illustrated in Figure 1.1. Remark 1.20. Proving a set of vectors W is a subspace of a given vector space V requires checking three things: (i) Is 0 in W? (ii) If W closed under vector addition? (iii) Is W closed under scalar-vector multiplication? All other properties of a vector space follow automatically. Example 1.21 (Vector Space of Real-Polynomials). Recall the function space example from Example 1.16. If we confine our attention simply to functions that are polynomials of a single variable with real coefficient. Denote this set by P. This set of functions is closed under addition and scalar multiplication. The zero function is in P. Thus, it is a subspace of F. Exercise 3. Show that the set of all polynomials of a single variable with real coefficients and degree at most k is a subspace of P, which we’ll denote P[xk]. 3 }

2 3 Figure 1.1. The subspace R is shown within the subspace R .

Remark 1.22. Just as we do not differentiate the vector space Fn from the cartesian product group F, from now on, we will just say that a vector v is an element of a vector space V, rather than the set V . 2.1. A Finite Field. Remark 1.23. So far we have discussed fields that should be familiar: (1)( R, +, ·, 0, 1), the field of real numbers or (2)( C, +, ·, 0, 1), the field of complex numbers. Fields do not necessarily have to have an infinite number of elements. The simplest example if GF(2), the Galois Field with 2 elements. We’ll leave a proper introduction to Galois Fields for an Algebra class. We can however, discuss the two-element field. Definition 1.24. The field GF(2) consists of the following: (1) The set of two elements {0, 1}, (2) the addition operation +, (3) the multiplication operation ·, (4) the additive unit 0 and (5) the multiplicative unit 1. The addition and multiplication tables are: + 0 1 + 0 1 0 0 1 0 0 0 1 1 0 1 0 1 Remark 1.25. You should check that the operations are, in fact, commutative and that multiplication distributes over addition. Remark 1.26. This particular field has an intimate relation to Computer Science through Boolean Logic. Suppose that 0 means false (off) while 1 means true (on). Then addition plays the role of exclusive or. The exclusive or operator works in English as follows: either it’s raining or it’s sunny (it cannot be both raining and sunny). Thus: (1) It’s neither sunny nor rainy is not true (false) since it must be either sunny or rainy: 0 + 0 = 0. 4 (2) It’s rainy and not sunny is true (acceptable): 1 + 0 = 1 (3) It’s not rainy and sunny is also true (acceptable): 0 + 1 = 1 (4) It’s raining and sunny is not true: 1 + 1 = 0. By the same token multiplication plays the role of and. This works in English as follows: Tom is a boy and Tom is tall, means Tom must be both a boy and tall. Thus: (1) Tom is neither a boy nor tall is not true: 0 · 0 = 0. (2) Tom is a boy who is not tall is not true: 1 · 0 = 0 (3) Tom is not a boy but is tall is not true: 0 · 1 = 0 (4) Tom is a boy and tall is true: 1 · 1 = 1. Remark 1.27. As it turns out, there are finite fields with a number of elements equal to each prime power, but no finite field with a number of elements that is composite. That is, there is no finite field with exactly 6 elements.

3. Matrices, Row and Column Vectors Remark 1.28. We will cover matrices in depth and, if you’re using these notes, you’ve had a course in matrices. This section is just going to establish some notational consistency for the rest of the notes and enable concrete examples. It is not a complete overview of matrices. Definition 1.29 (Matrix). An m × n matrix is a rectangular array of values (scalars), drawn from a field. If F is the field, we write F m×n to denote the set of m × n matrices with entries drawn from F . Example 1.30. Here is an example of a 2 × 3 matrix drawn from R2×3: 3 1 7 A = √ 2 2 2 5   Or a 2 × 2 matrix with entries drawn from C2×2: 3 + 2i 7 B = 6 3 − 2i   Remark 1.31. We will denote the element at position (i, j) of matrix A as Aij. Thus, in the example above, when: 3 1 7 A = √ 2 2 2 5   then A2,1 = 2. Definition 1.32 (Matrix Addition). If A and B are both in Fm×n, then C = A + B is the matrix sum of A and B in Fm×n and

(1.1) Cij = Aij + Bij for i = 1, . . . , m and j = 1, . . . , n Here + is the field operation addition. Example 1.33. 1 2 5 6 1 + 5 2 + 6 6 8 (1.2) + = = 3 4 7 8 3 + 7 4 + 8 10 12         5 Definition 1.34 (Scalar-). If A is a matrix from Fm×n and c ∈ F, then B = cA = Ac is the scalar-matrix product of c and A in Fm×n and:

(1.3) Bij = cAij for i = 1, . . . , m and j = 1, . . . , n Example 1.35. Let: 3 + 2i 7 B = 6 3 − 2i   Then we can multiply the scalar i ∈ C by B to obtain: 3 + 2i 7 i(3 + 2i) 7i −2 + 3i 7i i = = 6 3 − 2i 6i i(3 − 2i) 6i 2 + 3i       Definition 1.36 (Row/Column Vector). A 1 × n matrix is called a row vector, and a m × 1 matrix is called a column vector. For the remainder of these notes, every vector will be thought of column vector unless otherwise noted. A column vector x in Rn×1 (or Rn) is: x = hx1, . . . , xni. Remark 1.37. It should be clear that any row of matrix A could be considered a row vector in Rn and any column of A could be considered a column vector in Rm. Also, any row/column vector is nothing more sophisticated than tuples of numbers. You are free to think of these things however you like. Notationally, column vectors are used through these notes.

Example 1.38. We can now think of the vector space R2 over the field R as being composed of vectors in R2×1 (column vectors) with the field of real numbers. Let 0 = h0, 0i. Then (R2×1, +, 0) is a group and that’s sufficient for a set of vectors in a vector space. Exercise 4. Show that (R2×1, +, 0) is a group. 4. Linear Combinations, Span, Linear Independence

Definition 1.39. Suppose V is a vector space over the field F. Let v1,..., vm be vectors in ∈ V and let α1, . . . , αm ∈ F be scalars. Then

(1.4) α1v1 + ··· + αmvm is a of the vectors v1,..., vm. Clearly, any linear combination of vectors in V is also a vector in V.

Definition 1.40 (Span). Let V be a vector space and suppose W = {v1,..., vm} is a set of vectors in V, then the span of W is the set: (1.5) span(W ) = {y ∈ V : y is a linear combination of vectors in W }

Proposition 1.41. Let V be a vector space and suppose W = {v1,..., vm} is a set of vectors in V. Then span(W ) is subspace of V. Proof. We must check 3 things: (1) The zero vector is in span(W ): We know that:

0 = 0 · v1 + ··· + 0 · vm 6 is a linear combination of the vectors in W and therefore 0 ∈ span(W ). Here 0 ∈ F is the zero-element in the field and 0 is the zero vector in the vector space. (2) The set of vectors span(W ) is closed under vector addition: Consider two vectors v and w in span(W ). Then we have:

v = α1v1 + ··· + αmvm + w = β1v1 + ··· + βmvm (α1 + β1)v1 + ··· + (αm + βm)vm is a linear combination of the vectors in W and therefore v + w ∈ span(W ). We observed that this is true because scalar-vector multiplication must obey a distributive property. (3) The set span(W ) is closed under scalar-vector multiplication: If:

v = α1v1 + ··· + αmvm and r ∈ F, then:

rv = r (α1v1 + ··· + αmvm) = (rα1)v1 + ··· + (rαm)vm This is a linear combination of the vectors in W and so in the span(W ). We observed this was because scalar-vector multiplication respects scalar multiplication. Therefore, span(W ) is a subspace of V. 

Definition 1.42 (Linear Independence). Let v1,..., vm be vectors in V. The vectors v1,..., vm are linearly dependent if there exists α1, . . . , αm ∈ F, not all zero, such that

(1.6) α1v1 + ··· + αmvm = 0

If the set of vectors v1,..., vm is not linearly dependent, then they are linearly independent and Equation 1.6 holds just in case αi = 0 for all i = 1, . . . , n. Here 0 is the zero-vector in V and 0 is the zero-element in the field.

Exercise 5. Consider the vectors v1 = h0, 0i and v2 = h1, 0i. Are these vectors linearly independent? Explain why or why not. Example 1.43. In R3, consider the vectors: 1 1 0 v1 = 1 , v2 = 0 , v3 = 1 0 1 1       We can show these vectors are linearly independent: Suppose there are values α1, α2, α3 ∈ R such that

α1v1 + α2v2 + α3v3 = 0 Then:

α1 α2 0 α1 + α2 0 α1 + 0 α3 = α1 + α3 = 0           0 α2 α3 α2 + α3 0 Thus we have the system  of linear equations:  

α1 +α2 = 0

α1 + α3 = 0

α2 + α3 = 0 7 From the third equation, we see α3 = −α2. Substituting this into the second equation, we obtain two equations:

α1 + α2 = 0

α1 − α2 = 0

This implies that α1 = α2 and 2α1 = 0 or α1 = α2 = 0. Therefore, α3 = 0 and thus these vectors are linearly independent. Remark 1.44. It is worthwhile to note that the zero vector 0 makes any set of vectors a linearly dependent set. Exercise 6. Prove the remark above. Example 1.45. Consider the vectors: 1 4 v1 = 2 , v2 = 5 3 6 Determining linear  independence  requires us to this yields the equation: 1 4 0 α1 2 + α2 5 = 0 3 6 0 or the system  of equations:   

α1 + 4α2 = 0

2α1 + 5α2 = 0

3α1 + 6α2 = 0

Thus α1 = −4α2. Substituting this into the second and third equations yield:

−3α2 = 0

−6α2 = 0

Thus α2 = 0 and consequently α1 = 0. Thus, the vectors are linearly independent. Example 1.46. Consider the vectors: 1 3 5 v = , v = , v = 1 2 2 4 3 6       As before, we can derive the system of equations:

α1 + 3α2 + 5α3 = 0

2α1 + 4α2 + 6α3 = 0 We have more unknowns than equations, so we suspect there may be many solutions to this system of equations. From the first equation, we see: α1 = −3α2 − 5α3. Consequently we can substitute this into the second equation to obtain:

−6α2 − 10α3 + 4α2 + 6α3 = −2α2 − 4α3 = 0

Thus, α2 = −2α3 and α1 = 6α3 − 5α3 = α3, which we obtain by substituting the expression for α2 into the expression for α1. It appears that α3 can be anything we like. Let’s set 8 α3 = 1. Then α2 = −2 and α1 = 1. We can now confirm that this set of values creates a linear combination of v1, v2 and v3 equal to 0: 1 3 5 1 − 6 + 5 0 1 · − 2 · + = = 2 4 6 2 − 8 + 6 0           Thus, the vectors are not linearly independent and they must be linearly dependent. Exercise 7. Show that the vectors 1 4 7 v1 = 2 , v2 = 5 , v3 = 8 3 6 9 are not linearly independent.   [Hint: Following the examples, create a system of equations and show that there is a solution not equal to 0.] Example 1.47. Consider the vector space of polynomials P. We can show that the set of vectors {3x, x2 − x, 2x2 − x} is linearly dependent. As before, we write: 2 2 α1(3x) + α2(x − x) + α3(2x − x) = ϑ(x) = 0 We can collect terms on x and x2: 2 (α2 + 2α3)x + (3α1 − α2 − α3) = 0 This leads (again) to a system of linear equations:

3α1−α2 − α3 = 0

α2 + 2α3 = 0

We compute α2 = −2α3. Substituting into the first equation yields: 3α1 + α3 = 0 or 1 1 α1 = − 3 α3. Set α3 = 1 and α1 = − 3 and α2 = −2. Exercise 8. Consider the vectors space of polynomials with degree at most 2, P[x2]. 2 Suppose we change the quadratic polynomial a2x + a1x + a0 into the vector: ha0, a1, a2i. Recast the previous example in terms of these vectors and show that the resulting system of equations is identical to the one derived.

5. Basis

Definition 1.48 (Basis). Let B = {v1,..., vm} be a set of vectors in V. The set B is called a basis of V if B is a linearly independent set of vectors and every vector in V is in the span of B. That is, for any vector w ∈ V we can find scalar values α1, . . . , αm ∈ F such that m

(1.7) w = αivi i=1 X Example 1.49. We can show that the vectors: 1 1 0 v1 = 1 , v2 = 0 , v3 = 1 0 1 1

      9 form a basis of R3. We already know that the vectors are linearly independent. To show that R3 is in their span, chose an arbitrary vector in Rm: ha, b, ci. Then we hope to find coefficients α1, α2 and α3 so that: a α1v1 + α2v2 + α3v3 = b c   Expanding this, we must find α1, α2 and α3 so that:

α1 α2 0 a α1 + 0 + α3 = b         0 α2 α3 c A little effort  (in terms of algebra)  will show that: 1 α = (a + b − c) 1 2  1 (1.8) α2 = (a − b + c)  2  1 α = (−a + b + c) 3 2   3 Thus the set {v1, v2, v3} is a basis for R . Exercise 9. Why are the vectors 1 4 7 v1 = 2 , v2 = 5 , v3 = 8 3 6 9

 3     not a basis for R .

Lemma 1.50. Suppose B = {v1,..., vm} be a basis for a vector space V over a field F. Suppose that v ∈ V and:

α1v1 + ··· + αmvm = v = β1v1 + ··· + βmvm

Then αi = βi for i = 1, . . . , m. Proof. Trivially:

(α1v1 + ··· + αmvm) − (β1v1 + ··· + βmvm) = 0 This can be rewritten:

(α1 − β1)v1 + ··· + (αm − βm)vm = 0

The fact that B is a basis implies that αi − βi = 0 for i = 1, . . . , m. This completes the proof.  Remark 1.51 (Coordinate Form). Lemma 1.50 shows that given a vector space V and a basis B for that vector space, we can assign to any vector v ∈ V a unique set of coordinates hα1, . . . , αmi. 10 Remark 1.52. You are most familiar with this in Rn, where the vectors are usually identical to their coordinates when we use a standard basis consisting of {e1,..., en} where n ei ∈ R is:

ei = 0, 0,..., 1, 0,..., 0   i−1 n−i−1   Thus, the coordinates| {z } of| the{z vector} v = (v1, . . . , vn) are exactly hv1, . . . , vni. It does not always have to work this way, as we show in the next example. Example 1.53. Consider the basis 1 1 0 v1 = 1 , v2 = 0 , v3 = 1 0 1 1

3       for R . Note first that the coordinates of v1 with respect to the basis B = {v1, v2, v2}, we have h1, 0, 0i. That is, the coordinates of the first basis vector always look like the coordinates of the first standard basis vector. This can be confusing; what this means is that the vectors (the mathematical objects drawn with arrows) are independent of the basis, which simply gives you a way to assign coordinates to them. To continue the example, lets express the 3 standard basis vector e1 ∈ R in coordinates with respect to the basis B. To do this, we must find α1, α2, and α3 so that: 1 1 0 1 (1.9) α1 1 + α2 0 + α3 1 = 0 0 1 1 0 Using Equation  1.8, can  substitute a = 1,b = 0 and c = 0 to obtain: 1 1 1 α = α = α = − 1 2 2 2 3 2 3 Thus, the coordinate representation for the standard basis vector e1 ∈ R with respect to 1 1 1 the basis B is 2 , 2 , − 2 . 3 Exercise 10. Find coordinate representations for e2 and e3 in R with respect to basis B.

6. Dimension

Definition 1.54 (Maximal Linearly Independent Subset). Let W = {v1,..., vn} be a set of vectors from a vector space V. A set {v1,..., vm} is a maximal linearly independent subset of W if for any vi with r > m, the set {v1,..., vm, vr} is linearly dependent.

Lemma 1.55. Let W = {v1,..., vn} be a set of vectors from a vector space V with the property that span(W ) = V (i.e., every vector in V can be expressed as a linear combination of the vectors in W ). If B = {v1,..., vm} is a maximal linearly independent subset of W , then B is a basis. 11 Proof. Choose an arbitrary vector v ∈ V. There is a set of scalars α1, . . . , αn so that: n

(1.10) v = αivi i=1 X If αi = 0 for i = m+1, . . . , n then we have expressed v as a linear combination of the elements of B. Therefore, assume this is not the case and suppose that αr 6= 0 for r > m. The fact that {v1,..., vm, vr} is linearly dependent means that there are scalars β1, . . . , βm, βr not all 0 so that: m

βrvr + βivi = 0 i=1 X We know that βr 6= 0 (otherwise B would not be linearly independent) and therefore we have: m −β (1.11) v + i v = 0 r β i i=1 r X because the scalars are drawn from a field F for which there is a multiplicative inverse for non- −1 zero elements (i.e, βr = 1/βr). We can then replace vr in Expression 1.10 with Expression 1.11 for each αr 6= 0 with r > m. The result is an expression of v as a linear combination of vectors in B. Thus, span(B) = V. This completes the proof. 

Lemma 1.56. Let {v1,..., vm+1} be a linearly dependent set of vectors in V and let W = {v1,..., vm} be a linearly independent set. Further assume that vm+1 6= 0. Assume α1, . . . , αm+1 are a set of scalars, not all zero, so that

m+1

(1.12) αivi = 0 i=1 X For any j ∈ {1, . . . , m} such that αj 6= 0, if we replace vj in the set W with vm+1, then this new set of vectors is linearly independent.

Proof. We know that αm+1 cannot be zero, since we assumed that W is linearly inde- pendent. Since vm+1 6= 0, we know there is at least one other αi (i = 1, . . . , m) not zero. Without loss of generality, assume that α1 6= 0 (if not, rearrange the vectors to make this true). We can solve for vm+1 using this equation to obtain: m α (1.13) v = − i v m+1 α i i=1 m+1 X Suppose, without loss of generality, we replace v1 by vm+1 in W . We now proceed by contradiction. Assume this new set is linearly dependent. There there exists constants β2, . . . , βm, βm+1, not all zero, such that:

(1.14) β2v2 + ··· + βmvm + βm+1vm+1 = 0. 12 Again, we know that βm+1 6= 0 since the set {v2,..., vm} is linearly independent because W is linearly independent. Then using Equation 1.13 we see that: m α (1.15) β v + ··· + β v + β − i v = 0. 2 2 m m m+1 α i i=1 m+1 ! X We can rearrange the terms in this sum as: β α β α α (1.16) β − m+1 2 v + ··· + β − m+1 m v − 1 v = 0 2 α 2 m α m α 1  m+1   m+1  m+1 The fact that α1 6= 0 and βm+1 6= 0 and αm+1 6= 0 means we have found γ1, . . . , γm, not all zero, such that γ1v1 + ··· + γmvm = 0, contradicting our assumption that W was linearly independent. This contradiction completes the proof. 

Corollary 1.57. If B = {v1,..., vm} is a basis of V and vm+1 is another vector such that: m α (1.17) v = − i v m+1 α i i=1 m+1 X 0 with the property that α1 6= 0, then B = {v2,..., vm+1} is also a basis of V. Proof. The fact that B0 is linearly independent is established in Lemma 1.56, since clearly the set {v1,..., vm+1} is linearly dependent because B is a basis for V. Choose any vector v ∈ V, then there are scalars β1, . . . , βm so that: m

v = βivi i=1 X From Equation 1.17 we can write: 1 m α v = v − i v 1 α m+1 α i 1 i=2 1 X Then: 1 m α m m α β v = β v − i v + β v = β − i v + 1 v 1 α m+1 α i i i i α i α m+1 1 i=2 1 ! i=2 i=2 1 1 X X X   Thus we have expressed an arbitrary vector v as a linear combinations of the elements of B0; 0 therefore B is a basis of V.  Remark 1.58. Lemma 1.56 and its corollary are sometimes taken together and called the exchange lemma. It says something interesting. If B is a basis of V with m elements and vm+1 is another, non-zero, vector in V, we can swap vm+1 for any vector vj in B as long as when we express vm+1 as a linear combination of vectors in B the coefficient of vj is not zero. That is, since B is a basis of V we can express: m

vm+1 = αivi i=1 X As long as αj 6= 0, then we can replace vj with vm+1 and still have a basis of V. 13 Exercise 11. Consider the bases: 1 0 0 B = 0 , 1 , 0        0 0 1  for 3. If v = h1, 1, 0i, which   elements of B can be replaced by v to obtain a new basis B0? R   Theorem 1.59. Suppose that B = {v1,..., vm} is a basis of V and W = {w1,..., wn} is a set of n vectors from V with n > m. Then W is linearly dependent. Proof. We proceed by induction. The fact that B is a basis means that we can replace (1) some vector in B by w1 to obtain a new basis B . Without loss of generality, assume that it is v1; this can be made true by rearranging the order of B, if needed. Now, assume we (k) can do this up to k < m times to obtain basis B = {w1,..., wk, vk+1,..., vm}. We show we can continue to k + 1. The fact that we can replace some vector in B(k) by a vector in W to obtain a new basis B(k+1) is clear from Lemma 1.56 and its corollary, but the nature of this exchange is what we must manage. Suppose we choose wk+1. There are now two possibilities: Case 1: There is no way to replace vi (for i ∈ {k + 1, . . . , m}) with wk+1 and maintain linear independence. This means that when we express: k m

wk+1 = αiwi + αivi i=1 i=k+1 X X That αi = 0 for i = k + 1, . . . , m. Thus, wk+1 is a linear combination of w1,..., wk and W is linearly dependent. Induction can stop at this point. Case 2: There is some vi (for i ∈ {k + 1, . . . , m}) satisfying the assumptions of the (k+1) exchange lemma. Then B = {w1,..., wk, wk+1, vk+2 ..., vm} assuming (as needed) a reordering of the elements vk+1,..., vm. By induction, we have shown that either W is linearly dependent or the {w1,..., wm} forms a basis for V, which means that W must be linearly dependent since (e.g.) wm+1 can be expressed as a linear combination of {w1,..., wm}. This completes the proof.  Theorem 1.60. If B and B0 are two bases for the vector space V, then |B| = |B0|. Exercise 12. Prove Theorem 1.60 under the assumption that the sets are finite size. Definition 1.61 (Dimension). The dimension of a vector space is the cardinality of any of its bases. If V is a vector space, we write this as dim(V). Example 1.62. The dimension of Fk is k. This can be seen using the standard basis vectors. Theorem 1.63. Let V be a vector space with base field F and dimension n. If B = {v1,..., vn} is a set of linearly independent vectors, then B is a basis for V. Proof. The set B is maximal in size, therefore it must constitute a basis by Theorem 1.59.  Corollary 1.64. Let V be a vector space with dimension n and suppose that W = {v1,..., vm} with m < n is a set of linearly independent vectors. Then there are vectors vm+1,..., vn so that B = {v1,..., vn} forms a basis for V. 14 Proof. Clearly, W cannot be a basis for V, thus there is at least one vector vm+1 ∈ V that cannot be expressed as a linear combination of the vectors in W . Thus, the set W 0 = {v1,..., vm+1} is linearly independent. We can repeat this argument to construct B. By the previous theorem, B must be a basis for V.  Exercise 13. Show that if W is a subspace of a vector space V and dim(W) = dim(V), then W = V. Exercise 14. Show that the vector space P[xk] consisting of all polynomials of a single variable with real coefficients and degree at most k has dimension k+1. [Hint: Look at Exercise8 and apply the same idea here.] Remark 1.65. Consider the vector space P of all polynomials on a single variable with real coefficients. This space does not have a finite dimension. Instead, it is an infinite dimensional vector space, which can be defined rigorously, if required. We will not define this rigorously for these notes. Exercise 15. Let F be a field and V be a vector space over F. Show that the set {0} is a subspace of V with dimension 0. Exercise 16. Consider the set of vectors C (the complex numbers). When taken C is also used as the scalar field, clearly this vector space has dimension 1. Show that when R is used as the scalar field with vectors C, the dimension of the resulting vector space is 2, thus illustrate that the dimension is affected by the choice of the field. [Hint: A basis is just a set of vectors. When C is both the field and the vector space, the “vector” 1 is the only basis element needed because (e.g.) the pure imaginary vector i can be constructed by multiplying the scalar i by the vector 1. Suppose we don’t have any imaginary scalars because we’re using R as the scalar field. Find exactly two “vectors” in C that can be used to generate all the complex numbers (vectors).]

15

CHAPTER 2

More on Matrices and Change of Basis

1. Goals of the Chapter (1) Review matrix operations (2) Fundamental Theorem of Linear Algebra (3) Change of Basis Theorem (4) Direct sum spaces (5) Product spaces

2. More Matrix Operations: A Review Remark 2.1. In the last chapter, we introduced matrices and some rudimentary opera- tions. In a sense, this was simply to have some basic notation to work with column vectors and coordinate representations. In this section we introduce (review) additional matrix operations and notations. Most of these should be familiar to the reader. Remark 2.2 (Some Matrix Notation). The following notation occurs in various sub-fields of mathematics and it is by no means universal. It is, however, convenient. Let A ∈ Fm×n for some appropriate base field F. The jth column of A can be written as A·j, where the · is interpreted as ranging over every value of i (from 1 to m). Similarly, the th i row of A can be written as Ai·. Note, these are column and row vectors respectively.

Definition 2.3 (). Recall that if x, y ∈ Fn are two n-dimensional vectors, then the dot product (scalar product) is:

n

(2.1) x · y = xiyi i=1 X th where xi is the i component of the vector x.

Definition 2.4 (Matrix Multiplication). If A ∈ Rm×n and B ∈ Rn×p, then C = AB is the matrix product of A and B and

(2.2) Cij = Ai· · B·j

1×n n×1 Note, Ai· ∈ R (an n-dimensional vector) and B·j ∈ R (another n-dimensional vector), thus making the dot product meaningful. Example 2.5. 1 2 5 6 1(5) + 2(7) 1(6) + 2(8) 19 22 (2.3) = = 3 4 7 8 3(5) + 4(7) 3(6) + 4(8) 43 50         17 Exercise 17. Prove that matrix multiplication distributes over addition. That is, if A ∈ Fm×n and B, C ∈ Fn×p, then: A (B + C) = AB + AC We will use this fact repeatedly.

Definition 2.6 (Matrix ). If A ∈ Rm×n is a m × n matrix, then the transpose of A dented AT is an m × n matrix defined as: T (2.4) Aij = Aji Example 2.7. 1 2 T 1 3 (2.5) = 3 4 2 4     Remark 2.8. The matrix transpose is a particularly useful operation and makes it easy to transform column vectors into row vectors, which enables multiplication. For example, suppose x is an n × 1 column vector (i.e., x is a vector in Fn) and suppose y is an n × 1 column vector. Then: (2.6) x · y = xT y

Exercise 18. Let A, B ∈ Rm×n. Use the definitions of matrix addition and transpose to prove that: (2.7) (A + B)T = AT + BT

[Hint: If C = A + B, then Cij = Aij + Bij, the element in the (i, j) position of matrix C. This element moves to the (j, i) position in the transpose. The (j, i) position of AT + BT is T T T Aji + Bji, but Aji = Aij. Reason from this point.] Exercise 19. Let A, B ∈ Rm×n. Prove by example that AB 6= BA; that is, matrix multiplication is not commutative. [Hint: Almost any pair of matrices you pick (that can be multiplied) will not commute.]

Exercise 20. Let A ∈ Fm×n and let, B ∈ Rn×p. Use the definitions of matrix multipli- cation and transpose to prove that: (2.8) (AB)T = BT AT

[Hint: Use similar reasoning to the hint in Exercise 18. But this time, note that Cij = Ai··B·j, which moves to the (j, i) position. Now figure out what is in the (j, i) position of BT AT .] Definition 2.9. Let A and B be two matrices with the same number of rows (so A ∈ Fm×n and B ∈ Fm×p). Then the augmented matrix [A|B] is:

a11 a12 . . . a1n b11 b12 . . . b1p a21 a22 . . . a2n b21 b22 . . . b2p (2.9)  ......  ......  a a . . . a b b . . . b   m1 m2 mn m1 m2 mp    Thus, [A|B] is a matrix in Rm×(n+p). 18 Example 2.10. Consider the following matrices: 1 2 7 A = , b = 3 4 8     Then [A|B] is: 1 2 7 [A|B] = 3 4 8   A Exercise 21. By analogy define the augmented matrix B . Note, this is not a fraction. In your definition, identify the appropriate requirements on the relationship between the number of rows and columns that the matrices must have. [Hint:  Unlike [A|B], the number of rows don’t have to be the same, since your concatenating on the rows, not columns. There should be a relation between the numbers of columns though.] 3. Special Matrices Definition 2.11 (Identify Matrix). The n × n identify matrix is: 1 0 ... 0 0 1 ... 0 (2.10) In =  . . .  . .. .  0 0 ... 1      Here 1 is the multiplicative unit in the field F from which the matrix entries are drawn. Definition 2.12 (Zero Matrix). The n × n zero matrix an n × n consisting entirely of 0 (the zero in the field). Exercise 22. Show that (Fn×n, +, 0) is a group with 0 the zero matrix. n×n Exercise 23. Let A ∈ F . Show that AIn = InA = A. Hence, I is an identify for the matrix multiplication operation on square matrices. [Hint: Do the multiplication out long hand.] Definition 2.13 (Symmetric Matrix). Let M ∈ Fn×n be a matrix. The matrix M is symmetric if M = MT . Definition 2.14 (Diagonal Matrix). A diagonal matrix is a (square) matrix with the property that Dij = 0 for i 6= j and Dii may take any value in the field on which D is defined. Remark 2.15. Thus, a diagonal matrix has (usually) non-zero entries only on its main diagonal. These matrices will play a critical roll in our analysis. 4. Matrix Inverse Definition 2.16 (). Let A ∈ Fn×n be a . If there is a matrix A−1 such that −1 −1 (2.11) AA = A A = In then matrix A is said to be invertible (or nonsingular) and A−1 is called its inverse. If A is not invertible, it is called a singular matrix. 19 Proposition 2.17. Suppose that A ∈ Fn×n. If there are matrices B, C ∈ Fn×n such that −1 AB = CA = In, then B = C = A . Proof. We can compute:

AB = In =⇒ CAB = CIn =⇒ InB = C =⇒ B = C  Proposition 2.18. Suppose that A ∈ Fn×n and both B ∈ Fn×n and C ∈ Fn×n are inverses of A, then B = C. Exercise 24. Prove Proposition 2.18. Remark 2.19. Propositions 2.17 and 2.18 show that the inverse of a square matrix is unique and there is not difference between a left inverse and a right inverse. It is worth noting this is not necessarily true in general m × n matrices. Remark 2.20. The set n × n invertible matrices over R is denoted GL(n, R). It forms a group called the general linear group under matrix multiplication with In the unit. The general linear group over a field F is defined analogously. Proposition 2.21. If both A and B are invertible in Fn×n, then AB is invertible and (AB)−1 = B−1A−1. Proof. Compute: −1 −1 −1 −1 −1 −1 (AB)(B A ) = ABB A = AInA = AA = In  n×n −1 Exercise 25. Prove that if A1,..., An ∈ F are invertible, then (A1,..., Am) = −1 −1 Am ··· A1 for m ≥ 1. 5. Linear Equations Remark 2.22. Recall that matrices can be used as a short hand way to represent linear equations. Consider the following system of equations:

a11x1 + a12x2 + ··· + a1nxn = b1 a x + a x + ··· + a x = b  21 1 22 2 2n n 2 (2.12)  .  .  am1x1 + am2x2 + ··· + amnxn = bm  Then we can write this in matrix notation as:  (2.13) Ax = b n where Aij = aij for i = 1, . . . , m, j = 1, . . . , n and x is a column vector in F with entries xj m (j = 1, . . . , n) and b is a column vector in F with entries bi (i = 1 . . . , m). Proposition 2.23. Suppose that A ∈ Fn×n and b ∈ Fn×1. If A is invertible, then the unique solution to the system of equations: Ax = b is x = A−1b. 20 Proof. If x = A−1b, then Ax = AA−1b = b and thus x is a solution to the system of equations. Suppose that y ∈ Fn×1 is a second solution. Then Ax = b = Ay. This implies −1 that x = A b = y and thus x = y.  Remark 2.24. The practical problem of solving linear systems has been considered for thousands of years. Only in the last 200 years has a method (called Gauss-Jordan) elimi- nation been codified in the West. This method was known in China at least by the second century CE. In the next section, we discuss this approach and its theoretical ramifications.

Proposition 2.25. Consider the special set of equations Ax = 0 with A ∈ Fm×n. That is, the right-hand-side is zero for every equation. Then any solution x to this system of equations has the property that: Ai· · x = 0 for ; i.e., each row of A is orthogonal to the solution x. Exercise 26. Prove Proposition 2.25.

6. Elementary Row Operations m×n Definition 2.26 (Elementary Row Operation). Let A ∈ F be a matrix. Recall Ai· is the ith row of A. There are three elementary row operations:

(1) (Scalar Multiplication of a Row) Row Ai· is replaced by αAi·, where α ∈ F and α 6= 0. (2) (Row Swap) Row Ai· is swapped with Row Aj· for i 6= j. (3) (Scalar Multiplication and Addition) Row Aj· is replaced by αAi· + Aj· for α ∈ F and i 6= j. Example 2.27. Consider the matrix: 1 2 A = 3 4   defined over the field of real numbers. In an example of scalar multiplication of a row by a constant, we can multiply the second row by 1/3 to obtain: 1 2 B = 1 4  3  As an example of scalar multiplication and addition, we can multiply the second row by (−1) and add the result to the first row to obtain: 0 2 − 4 0 2 C = 3 = 3 1 4 1 4  3   3  We can then use scalar multiplication and multiply the first row by (3/2) to obtain: 0 1 D = 1 4  3  We can then use scalar multiplication and addition to multiply the first row by (−4/3) add it to the second row to obtain: 0 1 E = 1 0   21 Finally, we can swap row 2 and row 1 to obtain: 1 0 I = 2 0 1   Thus using elementary row operations, we have transformed the matrix A into the matrix I2. Theorem 2.28. Each elementary row operation can be accomplished by a matrix multi- plication. Sketch of Proof. We’ll show that scalar multiplication and row addition can be ac- complished by a matrix multiplication. In Exercise 27, you’ll be asked to complete the proof for the other two elementary row operations. Let A ∈ Fm×n. Without loss of generality, suppose we wish to multiply row 1 by α ∈ F and add it to row 2, replacing row 2 with the result. Let: 1 0 0 ... 0 α 1 0 ... 0 (2.14) E =  . . .  . . .. 0 0 0 0 ... 1     This is simply the identity Im with an α in the (2, 1) position instead of 0. Now consider T th EA. Let A·j = [a1j, a2j, . . . , amj] be the j column of A. Then :

1 0 0 ... 0 a1j a1j α 1 0 ... 0 a2j α(a1j) + a2j (2.15)  . . .   .  =  .  . . .. 0 . . 0 0 0 ... 1 a   a     mj  mj        That is, we have taken the first element of A·j and multiplied it by α and added it to the second element of A·j to obtain the new second element of the product. All other elements of A·j are unchanged. Since we chose an arbitrary column of A, it’s clear this will occur in each case. Thus EA will be the new matrix with rows the same as A except for the second row, which will be replaced by the first row of A multiplied by the constant α and added to the second row of A. To multiply the ith row of A and add it to the jth row, we would th simply make a matrix E by starting with Im and replacing the i element of row j with α.  Exercise 27. Complete the proof by showing that scalar multiplication and row swap- ping can be accomplished by a matrix multiplication. [Hint: Scalar multiplication should be easy, given the proof above. For row swap, try multiplying matrix A from Example 2.27 by: 0 1 1 0   and see what comes out. Can you generalize this idea for arbitrary row swaps?] Remark 2.29. Matrices of the kind we’ve just discussed are called elementary matrices. Theorem 2.28 will be important when we study efficient methods for solving linear program- ming problems. It tells us that any set of elementary row operations can be performed by finding the right matrix. That is, suppose I list 4 elementary row operations to perform on 22 matrix A. These elementary row operations correspond to for matrices E1,..., E4. Thus the transformation of A under these row operations can be written using only matrix multipli- cation as B = E4 ··· E1A. This representation is much simpler for a computer to keep track of in algorithms that require the transformation of matrices by elementary row operations. Definition 2.30 (Row Equivalence). Let A ∈ Fm×n and let B ∈ Fm×n. If there is a sequence of elementary matrices E1,..., Ek so that:

B = Ek ··· E1A then A and B are said to be row equivalent. Proposition 2.31. Every elementary matrix is invertible and its inverse is an elemen- tary matrix. Sketch of Proof. As before, we’ll only consider a single case. Consider the matrix; 1 0 0 ... 0 α 1 0 ... 0 E =  . . .  , . . .. 0 0 0 0 ... 1   which multiplies row 1 by α and adds it to row 2. Then we can compute the inverse as: 1 0 0 ... 0 −α 1 0 ... 0 −1 E =  . . .  , . . .. 0  0 0 0 ... 1   Multiplying the two matrices shows they yield the identity. The resulting inverse is an elementary matrix by inspection.  Remark 2.32. The fact that the elementary matrices are invertible is intuitively clear. An elementary matrices perform an action on a matrix and this action can be readily undone. That’s exactly what the inverse is doing. Exercise 28. Compute the inverses for the other two kinds of elementary matrices. Remark 2.33. The process we’ve illustrated in Example 2.27 is an instance of Gauss- Jordan elimination and can be used to find the to solve systems a system linear equations. This process is summarized in Algorithm1.

Definition 2.34 (Pivoting). In Algorithm1 when Aii 6= 0, the process performed in Steps 4 and 5 is called pivoting on element (i, i). ¯ ¯ Lemma 2.35 (Correctness). Suppose that Algorithm1 terminates with [In|b], where b is the result of the elementary row operations. Then b¯ = A−1b. Therefore, Algorithm1 terminates with the unique solution to the system of equations Ax = b, if it exists. Proof. The elementary row operations can be written as the product of elementary matrices on the left-hand-side of matrix X. Thus, the operation algorithm yields:

(2.16) En ··· E1X = [En ··· E1A|En ··· E1b] 23 Gauss-Jordan Elimination Solving Ax = b n×n n×1 (1) Let A ∈ F , b ∈ F . Let X = [A|b]. (2) Let i := 1 (3) If Xii = 0, then use row-swapping on X to replace row i with a row j (j > i) so that Xii =6 0. If this is not possible, then A, there is not a unique solution to Ax = b. (4) Replace Xi· by (1/Xii)Xi·. Element (i, i) of X should now be 1. −X (5) For each j =6 i, replace X by ji X + X . j· Xii i· j· (6) Set i := i + 1. −1 (7) If i > n, then A has been replaced by In and b has been replaced by A b in X. If i ≤ n, then goto Line 3. Algorithm 1. Gauss-Jordan Elimination for Systems

−1 ¯ −1 If En ··· E1A = In, then by the uniqueness of the inverse, A = En ··· E1 and b = A b. This must be a solution to the original system of equations.  Remark 2.36. We have now, effectively, proved the fundamental theorem of Matrix Algebra, which we state, but do not prove in detail (because we’ve already done it).

Theorem 2.37. Suppose that A ∈ Fn×n. The following are equivalent: (1) A is invertible. (2) The system of equations Ax = b has a unique solution for any b ∈ Fn×1. (3) The matrix A is row-equivalent to In. (4) Both the matrix A and A−1 can be written as the product of elementary matrices.  Exercise 29. Use Exercise 25 and Proposition 25 to explicitly show that if A−1 is the product of elementary matrices, then so is A.

Remark 2.38. The previous results can be obtained using elementary column operations, instead of elementary row operations.

Definition 2.39 (Elementary Column Operation). Let A ∈ Fm×n be a matrix. Recall th A·j is the j column of A. There are three elementary column operations:

(1) (Scalar Multiplication of a Column) Column A·j is replaced by αA·j, where α ∈ F and α 6= 0. (2) (Column Swap) Column A·j is swapped with Column A·k for j 6= k. (3) (Scalar Multiplication and Addition) Column A·k is replaced by αA·j + A·k for α ∈ F and j 6= k. Proposition 2.40. For each elementary column operation, there is an invertible elemen- tary matrix E ∈ Fn×n so that AE has the same effect as performing the column operation.

Exercise 30. Prove Proposition 2.40. 24 7. Computing a Matrix Inverse with the Gauss-Jordan Procedure Remark 2.41. The Gauss-Jordan algorithm can be modified easily to compute the in- verse of a matrix A, if it exists. Simply replace [A|b] with X = [A|In]. If the algorithm −1 terminates normally, then X will be replaced with [In|A ]. Example 2.42. Again consider the matrix A from Example 2.27. We can follow the steps in Algorithm1 to compute A−1. Step 1: 1 2 1 0 X := 3 4 0 1   Step 2: i := 1 Step 3 and 4 (i = 1):A 11 = 1, so no swapping is required. Furthermore, replacing X1· by (1/1)X1· will not change X. Step 5 (i = 1): We multiply row 1 of X by −3 and add the result to row 2 of X to obtain: 1 2 1 0 X := 0 −2 −3 1   Step 6: i := 1 + 1 = 2 and i = n so we return to Step 3. Steps 3 (i = 2): The new element A22 = −2 6= 0. Therefore, no swapping is required. Step 4 (i = 2): We replace row 2 of X with row 2 of X multiplied by −1/2. 1 2 1 0 X := 0 1 3 − 1  2 2  Step 5 (i = 2): We multiply row 2 of X by −2 and add the result to row 1 of X to obtain: 1 0 −2 1 X := 0 1 3 − 1  2 2  Step 6 (i = 2): i := 2 + 1 = 3. We now have i > n and the algorithm terminates. Thus using Algorithm1 we have computed: −2 1 A−1 = 3 − 1  2 2  Exercise 31. Does the matrix: 1 2 3 A = 4 5 6 7 8 9 have an inverse? [Hint: Use Gauss-Jordan elimination to find the answer.] Exercise 32 (Bonus). Implement Gauss-Jordan elimination in the programming lan- guage of your choice. Illustrate your implementation by using it to solve the previous exercise. [Hint: Implement sub-routines for each matrix operation. You don’t have to write them as matrix multiplication, though in Matlab, it might speed up your execution. Then use these subroutines to implement each step of the Gauss-Jordan elimination.] 25 7.1. Equations over GF(2). Remark 2.43. We have mentioned that all matrices are defined over a field F. It is therefore logical to note that linear equations Ax = b can be over an arbitrary field, not just R or C. Example 2.44. Consider the system of equations over GF(2):

x1 + x2 + x3 = 1

x1 + x3 = 0

x1 + x2 = 0

This set of equations has a single solution x1 = x2 = x3 = 1. We can show this by computing the inverse of the coefficient matrix in GF(2). Remember, all arithmetic is done in GF(2). Step 1: 1 1 1 1 0 0 X := 1 1 0 0 1 0  1 0 1 0 0 1  Step 2:Add the first row to the second row: 1 1 1 1 0 0 X := 0 0 1 1 1 0  1 0 1 0 0 1  Step 3:Add the first row to the third row: 1 1 1 1 0 0 X := 0 0 1 1 1 0  0 1 0 1 0 1  Step 4:Swap Row 2 and 3  1 1 1 1 0 0 X := 0 1 0 1 0 1  0 0 1 1 1 0  Step 5:Add Row 2 and Row 3 to Row 1 1 0 0 1 1 1 X := 0 1 0 1 0 1  0 0 1 1 1 0  Thus the coefficient matrix A has an inverse and multiplying: 1 1 1 1 1 x = A−1b = 1 0 1 0 = 1 1 1 0 0 1 as expected. It is worth noting that  if we remove the third equation to obtain the system of two equations:

x1 + x2 + x3 = 1

x1 + x3 = 0

26 We have at least two solutions: x1 = x2 = x3 = 1 and x1 = x3 = 0 and x2 = 1. We discuss cases like this in the next section.

8. When the Gauss-Jordan Procedure does not yield a Solution to Ax = b −1 Remark 2.45. If Algorithm1 does not terminate with X := [In|A b], then suppose the algorithm terminates with X := [A0|b0]. There is at least one row in A0 with all zeros. That is A0 has the form: 1 0 ... 0 0 1 ... 0 . .. . 0 . . . (2.17) A =   0 0 ... 0 . . . . .. .   0 0 ... 0   In this case, there are two possibilities: (1) For every zero row in A0, the corresponding element in b0 is 0. In this case, there are an infinite number of alternative solutions to the system of equations Ax = b. (2) There is at least one zero row in A0 whose corresponding element in b0 is not zero. In this case, there are no solutions to the system of equations Ax = b.

Example 2.46. Consider the system of equations in R:

x1 + 2x2 + 3x3 = 7

4x1 + 5x2 + 6x3 = 8

7x1 + 8x2 + 9x3 = 9 This yields matrix: 1 2 3 A = 4 5 6 7 8 9 and right hand side vector b = h7, 8, 9i. Applying Gauss-Jordan elimination in this case yields: 19 1 0 −1 − 3 20 (2.18) X := 0 1 2 3  0 0 0 0  Since the third row is all zeros, there are an infinite number of solutions. An easy way to solve for this set of equations is to let x3 = t, where t may take on any value in R. Variable x3 is called a free variable. Then, row 2 of Expression 2.18 tells us that: 20 20 20 (2.19) x + 2x = =⇒ x + 2t = =⇒ x = − 2t 2 3 3 2 3 2 3

We then solve for x1 in terms of t. From row 1 of Expression 2.18 we have: 19 19 19 (2.20) x − x = − =⇒ x − t = − =⇒ x = t − 1 3 3 1 3 1 3 27 Thus every vector in the set: 19 20 (2.21) X = t − , − 2t, t : t ∈ 3 3 R    is a solution to Ax = b. This is illustrated in Figure 2.1(a).

Intersection

(a) (b)

Figure 2.1. (a) Intersection along a line of 3 planes of interest. (b) Illustration that the planes do not intersect in any common line.

Conversely, suppose we have the problem:

x1 + 2x2 + 3x3 = 7

4x1 + 5x2 + 6x3 = 8

7x1 + 8x2 + 9x3 = 10 The new right hand side vector is b = [7, 8, 20]T . Applying Gauss-Jordan elimination in this case yields: 1 0 −1 0 (2.22) X := 0 1 2 0  0 0 0 1  Since row 3 ofX has a non-zero element in the b0 column, we know this problem has no solution, since there is no way that we can find values for x1, x2 and x3 satisfying:

(2.23) 0x1 + 0x2 + 0x3 = 1 This is illustrated in Figure 2.1(b). Remark 2.47. In general, the number of free variables will be equal to the number of rows of the augmented matrix that are all zero. The following theorem now follows directly from analysis of the Gauss-Jordan procedure. 28 Theorem 2.48. Let A ∈ Fn×n and b ∈ Fn×1. Then exactly one of the following hold: (1) The system of linear equations Ax = b has exactly one solution given by x = A−1b. (2) The system of linear equations Ax = b has more than one solution and the number of free variables defining those solutions is equal to the number of zero rows in the augmented matrix after Gauss-Jordan elimination. (3) The system of linear equations Ax = b has no solution.  Remark 2.49. We will return to these systems of linear equations in the next chapter, but characterize them in a more general way in the language of vector spaces. Exercise 33. Solve the problem

x1 + 2x2 = 7

3x1 + 4x2 = 8 using Gauss-Jordan elimination.

9. Change of Basis as a System of Linear Equations Remark 2.50. It may seem a little murky how all this matrix arithmetic is going to play into our rather theoretical discussion of bases from the previous section. Consider Remark 3 1.51 and Example 1.53 in which we expressed the standard basis vector e1 ∈ R in terms of a new basis B = {h1, 1, 0i, h1, 0, 1i, h0, 1, 1i}. We observed that the new coordinates of the 1 1 1 standard basis vector e1 = h1, 0, 0i were 2 , 2 , − 2 . To make that computation, we solved a system of linear equations (Equation 1.9), which can be written in matrix form as:

1 1 0 α1 1 1 0 1 α2 = 0       0 1 1 α3 0 Notice we had:      1 1 0 v1 = 1 , v2 = 0 , v3 = 1 0 1 1 for the basis vectors  and these  have become  the columns of the matrix. Thus, we can write this more compactly as:

v1 v2 v3 α = e1 Remark 2.51. We can now tackle this problem in more generality. Let V be an n- dimensional vector space over a field F. Let B = {v1,..., vn} be a basis for V and let 0 B = {w1,..., wn} be a second basis for V. Suppose we are given a vector v ∈ V and we know the coordinates for v in basis B are hβ1, . . . , βni. That is: n

v = βjvj j=1 X 29 We will write:

β1 . [v]B = .   βn   to mean that hβ1, . . . , βni are the coordinates of v expressed in basis B. 0 We seek to compute coordinates for v with respect to B ; i.e., we want to find [v]B0 . One way to do this, is simply to solve explicitly for the necessary coordinates for v using B0. However, this is not a general solution. We’d have to repeat this process each time we want to convert from one coordinate system (basis) to another. Instead, suppose that we execute this procedure for just the basis elements in B (just as we did in Example 1.53). Then for each i ∈ {1, . . . , n} we compute:

n

αijwi = vj i=1 X 0 That is, the coordinates of basis vector vj ∈ B written in terms of the basis B are hα1j, . . . , αnji. WARNING: Here is where it gets a little complicated with notation! For each basis vector vj, we’ve generated a column vector hα1j, . . . , αnji because we’re solving the linear equations:

α1j α2j w1 w2 ··· vn  .  = vj .   α   nj   We can now expression the original vector v in terms of the basis B0 by noting that:

n n n

v = βjvj = βj αijwi j=1 j=1 i=1 ! X X X For a fixed i, the coefficient of wi is j βjαij. Thus, the vector of coordinates for v in terms of B0 are: n P j=1 βjα1j n j=1 βjα2j 0 P  [v]B = . P .  n   βjαnj  j=1    There is a simplerP way to write this (using matrices)! Let:

α11 α12 ··· α1n . . . . (2.24) ABB0 = . . .. . = α1 ··· αn ,   α α ··· α n1 n2 nn     where αj = hα1j, . . . , αnji is the column vector of the coordinates of vj expressed in basis B0. We can now prove the change of basis theorem. 30 Figure 2.2. The vectors for the change of basis example are shown. Note that v is expressed in terms of the standard basis in the problem statement.

0 Theorem 2.52. Let B = {v1,..., vn} and B = {w1,..., wn} be two bases for an n- dimensional vector space V over a field F. Suppose that: n

αijwi = vj i=1 X for j = 1, . . . , n and let ABB0 be defined as in Equation 2.24. Then if [v]B = hβ1, . . . , βni = β, then:

[v]B0 = ABB0 β Proof. The proof is by computation: n α11 α12 ··· α1n β1 j=1 βjα1j ...... 0 . . . . . =  .  = [v]B ,     Pn αn1 αn2 ··· αnn βn βjαnj  j=1        as required. P  −1 Corollary 2.53. The matrix ABB0 is invertible and furthermore AB0B = ABB0

Exercise 34. Sketch a proof of Corollary 2.53. [Hint: Remember that ABB0 [wi]B = −1 [wi]B0 . Therefore, [wi]B = ABB0 [wi]B0 . Assert this is sufficient based on the argument in the preceding remark.] Example 2.54. If this is confusing, don’t worry. Change of basis is very confusing because it’s notationally intense. Let’s try a simple example in R2. Let B = {h1, 1i, h1, −1i} and let B0 = {h−1, 0i, h0, −1i}. We are intentionally not using the standard basis here. Assume we have the vector h2, 0i, expressed in the standard basis. This is shown in Figure 2.2. Assume we are transforming from B to B0. We must express v in terms of B first. It’s easy to check that: 1 1 2 1 · +1 · = 1 −1 0       Thus: 1 [v] = , B 1   31 0 with color added for emphasis. Now, to express h1, 1i in B we are seeking hα1,1, α2,1i satis- fying: −1 0 α 1 1,1 = 0 −1 α2,1 1       The solution is (clearly) α1,1 = α2,1 = −1. Thus, [h1, 1i]B0 = h−1, −1i. For the second basis vector in B, we must compute: −1 0 α 1 1,2 = 0 −1 α2,2 −1       The solution is (clearly) α1,2 = −1, α2,2 = 1. Thus, [h1, −1i]B0 = h−1, 1i. Notice each time, we are forming a matrix whose columns are the basis vectors of B0 and the right hand side is the basis vector from B in question. We can now form ABB0 using α1,1, α2,1, α1,2, and α2,2: −1 −1 A 0 = BB −1 1   We can use ABB0 to compute [v]B0 using [v]B: −1 −1 1 −2 [v] 0 = A 0 [v] = = B BB B −1 1 1 0       Fortunately, this is exactly what we’d expect. To express the vector h2, 0i in terms of B0 we’d have: −1 0 2 −2 · + 0 · = . 0 −1 0       10. Building New Vector Spaces1 Remark 2.55. The proof of the next proposition is left as an exercise in checking that the requirements of a vector space are met.

Proposition 2.56. Let V be a vector space over a field F and suppose that W and U are both subspaces of V. Then W ∩ U = {v ∈ V : v ∈ W and v ∈ W} is a subspace of V. Exercise 35. Prove Proposition 2.56. Remark 2.57. It may seem that Proposition 2.56 is non sequitur after our discussions, but it is consistent with our work on linear equations. Any linear equation:

aijxj = 0 j X n defines a (hyper)plane in F . Here, aij, bj ∈ F. That is:

n Hi = x ∈ F : aijxj = 0 ( j ) X 1This material can be delayed until discussion of Orthognality. 32 This is a subset of vectors in Fn and it can be shown that it is in fact a subspace of Fn over the field Fn. A system of linear equations Ax = b simply yields the set of vectors in the intersection of the subspaces. That is, any solution to Ax = 0 is in the space:

H = Hi i \ Thus by Proposition 2.56 it must be a subspace of Fn.

3 3 Figure 2.3. The intersection of two sub-spaces in R produces a new sub-space of R .

Exercise 36. Suppose

n Hi = x ∈ F : aijxj = 0 ( j ) X n Show that Hi is a subspace of F . Remark 2.58. There are other ways to build new subspaces of a vector space from existing subspaces than simple intersection.

Definition 2.59 (Sum). Let V be a vector space over a field F. Suppose W1 and W2 are two subspaces of V. The sum of W1 and W2, written W1 + W2 is the set of all vectors in V with form v1 + v2 for v1 ∈ W1 and v2 ∈ W2.

Lemma 2.60. The set of vectors W1 + W2 from Definition 2.59 is a subspace of V.

Proof. The fact that 0 ∈ W1, W2 and 0 = 0 + 0 implies that 0 ∈ W1 + W2. If v1, v2 ∈ W1 and w1, w2 ∈ W2, then:

(v1 + w1) + (v2 + w2) = (v1 + v2) + (w1 + w2) ∈ W1 + W2

thus showing W1 + W2 is closed under vector addition. Finally, choose any a ∈ F. We know that if v ∈ W1 and w ∈ W2, then av ∈ W1 and aw ∈ W2. Thus a(v + w) ∈ W1 + W2. Thus W1 + W2 is a subspace of V. 

Definition 2.61. A vector space V is a direct sum of subspaces W1 and W2, written W1 ⊕ W2 = V if for each v ∈ V there are two unique elements u ∈ W1 and w ∈ W2 such that v = u + w. 33 (a) (b)

2 2 Figure 2.4. The sum of two sub-spaces of R that share only 0 in common recreate R .

Theorem 2.62. Let V be a vector space with two subspaces W1 and W2 such that V = W1 + W2. If W1 ∩ W2 = {0}, then W1 ⊕ W2 = V.

Proof. Choose any vector v ∈ V. Since V = W1 + W2 there are two elements u ∈ W1 0 and w ∈ W2 so that v = u + w. Now suppose there are two other elements u W1 and 0 0 0 w ∈ W2 so that v = u + w . Then: (u + w) − (u0 + w0) = (u − u0) + (w − w0) = 0. Thus: u − u0 = w0 − w

0 0 0 0 But, u − u ∈ W1 and w − w ∈ W2. The fact that u − u = w − w holds implies that 0 0 w − w ∈ W1 and u − u ∈ W2. The only element these subspaces share in common is 0, 0 0 0 0 the: u − u = 0 = w − w. Therefore, u = u and w = w . This completes the proof.  Example 2.63. Theorem 2.62 is illustrated in Figure 2.4.

Theorem 2.64. Let V be a vector space with a subspace W1. Then there is a subspace W2 such that V = W1 ⊕ W2.

Proof. Assume dim(V) = n. If W1 = V, let W2 = {0}. Otherwise, et B1 = {v1,..., vm} be any basis for W1. By the same reasoning as in Corollary 1.64, we can build vectors vm+1,..., vn not in B1 so that B = {v1,..., vn} form a basis for V. Let W2 = span({vm+1,..., vn}). At once we see that W1 ∩ W2 = {0}. The result follows at once from Theorem 2.62. 

Theorem 2.65. Suppose that V is a vector space with subspaces W1 and W2 and V = W1 ⊕ W2. Then dim(V) = dim(W1) + dim(W2). 34 Proof. Let B1 = {v1,..., vm} be a basis for W1 and B2 = {w1,..., wp} be a basis for W2. Every vector in v ∈ V can be uniquely expressed as: m p

v = αivi + βjwj i=1 j=1 X X Thus, {v1,..., vm, w1,..., wp} must form a basis for V. The fact that dim(V) = dim(W1) + dim(W2) follows at once. 

Proposition 2.66. Let V1 and V2 be (any) two vector spaces over a common field F. Let V = V1 × V2 = {(v1, v2): v1 ∈ V1 and v2 ∈ V2}. Suppose 0i ∈ Vi is the zero vector in Vi, (i = 1, 2). Let 0 = (01, 02) ∈ V. If we define

(2.25) (v1, v2) + (u1, u2) = (v1 + u1, v2 + u2) and for any a ∈ F we have: a(v1, v2) = (av1, av2), then V is a vector space over F. 

Corollary 2.67. If V = V1 × V2, then dim(V) = dim(V1) + dim(V2).

Exercise 37. Verify that V = V1 × V2 is a vector space over F. Prove Corollary 2.67.

Definition 2.68. The vector space V = V1 × V2 is called the product space of V1 and V2. Example 2.69. We’ve already seen a number of examples. Clearly, C2 is the product space of C and C over the field C. However, you could easily have the product space of C × R over the field R. Remark 2.70. Product spaces and direct sums generalize to any number of vector spaces or vector subspaces. In general we’d write: n

V = Vi = V1 × · · · × Vn. i=1 Y Similarly we’d have: n V = i = 1 Wi = W1 ⊕ · · · ⊕ Wn. to mean thatM every element of V is a unique sum of elements of the subspaces W1,..., Wn.

35

CHAPTER 3

Linear Transformations

1. Goals of the Chapter (1) Introduce Linear Maps (or Linear Transforms) (2) Discuss Image and Kernel (3) Prove the dimension theorem (4) Matrix of a Linear Transform (5) Applications of Linear Transformations

2. Linear Transformations Definition 3.1 (Linear Transformation). Let V and W be two vector spaces over the same base field F. A linear transformation is a function f : V → W such that if v1 and v2 are vectors in V and α is a scalar in F, then:

(3.1) f(αv1 + v2) = αf(v1) + f(v2) Remark 3.2. A linear transformation is sometimes called a linear map. Exercise 38. Show that Equation 3.1 can be written as two equations:

f(v1 + v2) = f(v1) + f(v2)

f(αv1) = αf(v1) Exercise 39. Show that of f : V → W is a linear transformation, then f(0) = 0 ∈ W.

Example 3.3. Let a be any real number and consider the vector space R (over itself). Then f(v) = av is a linear transformation.

Example 3.4. Generalizing from the previous example, let F be an arbitrary field and n m×n n consider the space F . If A ∈ F , then the function fA(v) = Av for v ∈ F (written as a column vector) is a linear transformation from Fn to Fm. Put more simply, matrix multi- plication by an m × n matrix transforms n-dimensional (column) vectors to m dimensional (column) vectors. n×1 To see this, note that if v1, v2 ∈ F and α ∈ F, then:

fA(αv1 + v2) = A (αv1 + v2) = αAv1 + Av2 = αfA(v1) + fA(v2), by simple matrix multiplication.

Remark 3.5 (Notation). Throughout the remainder of these notes, if A ∈ Fm×n, we will n use fA to denote the linear transform fA(v) = Av for some v ∈ F . 37 Example 3.6. Consider the vector space P[x2] of polynomials with real coefficients and degree at most 2 over the field of real numbers. We can show that differentiation is a linear transformation from P[x2] to P[x]. Let v be a vector in P[x2]. Then we know: v = ax2 + bx + c for some appropriately chosen a, b, c ∈ R. The derivative transformation is defined as: (3.2) D(v) = 2ax + b and we see at once that D(v) ∈ P[x]. To see that it’s a linear map note that if v1 = 2 2 2 a1x + b1x + c and v2 = a2x + b2x + c2 are both polynomials in P[x ] and α ∈ R, then: 2 αv1 + v2 = (αa1 + a2)x + (αb1 + b2)x + (αc1 + c2) after factorization. Then:

D(αv1+v2) = 2(αa1+a2)x+(αb1+b2) = α(2a1x+b1)+(2a2x+b2) = αD(v1)+D(v2). Thus, polynomial differentiation is a linear transformation. Example 3.7. Continuing the previous example, differentiation can even be accom- plished as a matrix multiplication. Recall the standard basis for P[x2] is {x2, x, 1}, so that a ha, b, ci is equivalent to the polynomial ax2 + bx + c. Let: 2 0 0 D = 0 1 0   Then writing an arbitrary polynomial (vector) in coordinate form: v = ha, b, ci we see: a 2 0 0 2a Dv = b = 0 1 0   b   c   This new vector is equivalent  to the polynomial 2ax + b in P[x] when the standard basis {x, 1} is used for that space. Thus, polynomial differentiation can be expressed as matrix multiplication. Exercise 40. Find the matrix that computes the derivative mapping from P[xn] to P[xn−1] where n ≥ 1. Definition 3.8 (One-to-One or Injective). Let V and W be two vector spaces over the same field F. Let f : V → W be a linear transformation from V to W. The function f is called one-to-one (or injective) if for any vectors v1 and v2 in V: if f(v1) = f(v2), then v1 = v2. Definition 3.9 (Onto or Surjective). Let V and W be two vector spaces over the same field F. Let f : V → W be a linear transformation from V to W. The function f is called onto (or surjective) if for any vectors w ∈ W there is a vector v ∈ V such that f(v) = w. Example 3.10. The derivative map from P[x2] to P[x] is surjective but not injective. 2 2 To see this, note that the polynomials ax + bx + c1 and ax + bx + c2 both map to 2ax + b in P[x] under D, even when c1 6= c2. Thus, D is not injective. 1 2 On the hand, given any ax+b in P[x], the quadratic polynomial 2 ax +bx maps to ax+b under D. Thus D is surjective. 38 n×n n n Exercise 41. Show that if A ∈ F is invertible, then the linear mapping fA : F → F defined by f(v) = Av is injective and surjective. Definition 3.11. A linear map that is both injective and surjective is bijective or is called a bijection. Definition 3.12 (Inverse). Suppose f : V → W is a bijection from vector space V to vector space W, both defined over common base field F. The inverse map, f −1 : W → V is defined as: (3.3) f −1(w) = v ⇐⇒ f(v) = w The fact that f is a bijection means that for each w in W there is a unique v ∈ V so that f(v) = w, thus f −1(w) is uniquely defined for each w in W. Proposition 3.13. If f is a linear transformation from vector space V to vector space W, both defined over common base field F and f is a bijection then f −1 is also a bijective linear transform of W to V. Exercise 42. Prove Proposition 3.13. Remark 3.14. Functions can be injective, surjective or bijective without being linear transformations. We focus on linear transformations because they are the most useful to us. Definition 3.15 (Isomorphism). Two vector spaces V and W over a common base field F are isomorphic if there is a bijective linear transform from V to W. The function is then called an isomorphism. Example 3.16. Consider the subspace V of R3 composed of all vectors of the form ht, 2t, 3ti where t ∈ R. This defines a line passing through the origin in 3-space (R3). Let f : ht, 2t, 3ti 7→ t, so: f : V → R. (1) This function is onto, since for each t ∈ R we have f(ht, 2t, 3t) = t. (2) This function is one-to-one, since if v1, v2 ∈ V and f(v1) = f(v2) = t, then v1 = v2 = ht, 2t, 3ti. Verifying that f is a linear transformation is left as an exercise. Therefore, V is isomorphic to R. Exercise 43. Verify that the function f in Example 3.16 is a linear transformation. Exercise 44. Prove that P[x2] is isomorphic to R3. 3. Properties of Linear Transforms Definition 3.17 (Composition). Let U, V and W be three vector spaces over a common base field F. If f : U → V and g : V → W, then g ◦ f : U → W is defined as: (3.4) (g ◦ f)(u) = g(f(u)) = w ∈ W where u ∈ U. Proposition 3.18. The composition operation is associative. That is: if U, V, W and X are four vector spaces over a common base field F. If f : U → V, g : V → W, and h : W → X , then: h ◦ g ◦ f = (h ◦ g) ◦ f = h ◦ (g ◦ f) 39 Exercise 45. Prove Proposition 3.18. Theorem 3.19. Let U, V and W be three vector spaces over a common base field F. Let f : U → V and g : V → W be linear transformations. Then (f ◦ g): U → W is a linear transformation.

Proof. Let u1, u2 ∈ U and let α ∈ F. Then:

(g ◦ f)(αu1 + u2) = g(f(αu1 + u2)) = g(αf(u1) + f(u2)) =

αg(f(u1)) + g(f(u2)) = α(g ◦ f)(u1) + (g ◦ f)(u2) This completes the proof.  Definition 3.20 (Identity/Zero Map). Let V be a vector space over a field F. The function ι : V → V defined by ι(v) = v is the identity function. Let W be a second vector space over the field F. The function ϑ : V → W defined by ϑ(v) = 0 ∈ W (the zero vector) is the zero function. Lemma 3.21. Both the identity function and the zero function are linear transformations.

Definition 3.22 (Automorphism). Let f : V → V be an isomorphism from a vector space V over field F to itself. Then f is called an automorphism.

Theorem 3.23. Let AutF(V) be the set of all automorphisms of V over F. Then (AutF(V), ◦, ι) forms a group. Proof. Function composition is associative, by Proposition 3.18. Furthermore, we know

from Theorem 3.19 that AutF(V) must be closed under composition. The identity map ι acts like a unit since: (ι ◦ f)(v) = ι(f(v)) = f(v) = f(ι(v)) = (f ◦ ι)(v)

for any v ∈ V. Finally, since each element of AutF(V) is a bijection, it must be invertible and thus every element of AutF(V) has an inverse. Thus AutF(V) is a group.  Definition 3.24 (Addition). Let V and W be vector spaces over the base field F if f, g : V → W, then: (3.5) (f + g)(v) = f(v) + g(v) where v ∈ V. Lemma 3.25. Let V and W be vector spaces over the base field F if f, g : V → W are linear transformations, then (f + g) is a linear transformation of V to W.

Proof. Let v1, v2 ∈ V and let α ∈ F. Then:

(f + g)(αv1 + v2) = f(αv1 + v2) + g(αv1 + v2) =

αf(v1) + f(v2) + αg(v1) + g(v2) = α (f(v1) + g(v1)) + (f(v2) + g(v2)) =

α(f + g)(v1) + (f + g)(v2)  40 Exercise 46. Prove Lemma 3.21 Lemma 3.26. Let L(V, W) be the set of all linear transformations from V to W, two vector spaces that share base field F. Then (L(V, W), +, ϑ) is a group. Proof. Lemma 3.25 shows that L(V, W) is closed under addition. Addition is clearly associative and the additive identity is obviously ϑ, the zero function, which is itself a linear transformation by Lemma 3.21. The additive inverse of f ∈ L(V, W) is obviousy −f, defined by (−f)(v) = −f(v) for any v ∈ V. This is a linear transformation if and only if f is a linear transformation. Thus, L(V, W) is a group.  Remark 3.27. The following theorem is easily verified. Theorem 3.28. Let L(V, W) be the set of all linear transformations from V to W, two vector spaces that share base field F. Then L(V, W) is a vector space over the basefield F, with scalar-vector multiplication defined as (αf)(v) = αf(v) for any f ∈ L(V, W).  4. Image and Kernel Definition 3.29 (Kernel). Let V and W be two vector spaces over the same field F. Let f : V → W be a linear transformation from V to W. Then the kernel of f, denoted Ker(f) is defined as: (3.6) Ker(f) = {v ∈ V : f(v) = 0 ∈ W} Example 3.30. Let A ∈ Fm×n for some positive integers m and n and a field F. The kernel of the linear map fA(v) = Av is the solution set of the system of linear equations: Ax = 0 To make this more concrete, consider the matrix: 1 2 A = 2 4   with elements from R. In this case, we have the matrix equation for the kernel of fA: 1 2 x 0 1 = 2 4 x2 0       The set of solutions is:

Ker(fA) = {h−2t, ti : t ∈ R} This is illustrated in Figure 3.1. n×n Exercise 47. Show that if A ∈ F is invertible, then Ker(fA) = {0}. Exercise 48. Show that the kernel of D on P[xn] is the set of all constant real polyno- mials. Definition 3.31 (Image). Let V and W be two vector spaces over the same field F. Let f : V → W be a linear transformation from V to W. Then the image of f, denoted Im(f) is defined as: (3.7) Im(f) = {w ∈ W : ∃v ∈ V (w = f(v))} That is, the set of all w ∈ W so that there is some v ∈ V so that f(v) = w. 41 Remark 3.32. Clearly if f is a surjective linear mapping from V to W, then Im(f) = W. m×n Example 3.33. Let A ∈ F be a matrix and fA be the linear transformation sending n m v ∈ F to Av in F ; i.e., fA(v) = Av. If:

A = a1 ··· an

where aj = A·j, then Im(fA) = span({a1,..., an}). To be more concrete, let 1 2 A = 2 4   We already know that the columns of A are linearly dependent, so we’ll focus exclusively on the first column. We have:

Im(fA) = span (h1, 2i, h2, 4i) = {ht, 2ti : t ∈ R} 2 2 We illustrate both the image and the kernel of fA in Figure 3.1. Note: since fA : R → R ,

10

5

Image ������� -10 -5 5 10 Kernel

-5

-10

2 Figure 3.1. The image and kernel of fA are illustrated in R .

we are showing the image and kernel in the same plot. In general, the image is a subset of the W when the linear transform maps vector space V to W; i.e., W does not have to be the same as V. Remark 3.34. For the remainder of this section, we will assume that V and W are vector spaces over a common base field F. Furthermore, we will assume that f : V → W is a linear transformation. Theorem 3.35. The kernel of f is a subspace of V. Proof. In Exercise 39 it was shown that f(0) = 0 ∈ W, thus 0 ∈ Ker(f). Suppose that v1, v2 ∈ Ker(f). Then:

f(v1 + v2) = f(v1) + f(v2) = 0.

Thus v1 + v2 ∈ Ker(f). Finally, if v ∈ Ker(f) and α ∈ F, then: f(αv) = αf(v) = 0 42 Consequently Ker(f) is closed under vector addition and scalar-vector multiplication and contains 0 ∈ V Thus it is a subspace.  Theorem 3.36. The image of f is a subspace of W.  Exercise 49. Prove Theorem 3.36. [Hint: The proof is almost identical to the proof of Theorem 3.35.] Theorem 3.37. Let f : V → W be a linear transformation. The function f is a injective if and only if Ker(f) = {0}.

Proof. (⇐) Suppose that Ker(f) = {0}. Let v1, v2 ∈ V have the property that f(v1) = f(v2). Then:

0 = f(v1) − f(v2) = f(v1 − v2)

It follows that v1 − v2 ∈ Ker(f) and thus: v1 − v2 = 0. Thus v1 = v2. Therefore f is injective. (⇒) Suppose f is injective. We know from Exercise 39 that 0 ∈ Ker(f). Therefore, Ker(f) = {0}.  Corollary 3.38. If f is surjective and Ker(f) = {0}, then f is an isomorphism.  n×n Exercise 50. Suppose A ∈ F is invertible. Prove that fA is an isomorphism from Fn to itself.

Lemma 3.39. Suppose Ker(f) = {0}, then dim(Im(f)) = dim(V) and if {v1,..., vn} is a basis for V, then {f(v1), . . . , f(vn)} is a basis for Im(f). Proof. We know from Theorem 3.37 f is injective. Let w ∈ Im(f). Then there is a unique v ∈ V such that f(v) = w. But there are scalars α1, . . . , αn such that: n

v = αivi i=1 X Then: n

w = f(v) = αf(vi) i=1 X Thus {v1,..., vn} spans Im(f). Suppose that: n

αif(vi) = 0, i=1 X then:

f(α1v1 + ··· + αnvn) = 0 The fact that f is injective implies that:

α1v1 + ··· + αnvn = 0

But, {v1,..., vn} is a basis and thus α1 = ··· = αn = 0. Thus, {f(v1), . . . , f(vn)} is linearly independent and therefore a basis.  43 Exercise 51. Conclude a fortiori that if f is injective and {v1,..., vn} is linearly inde- pendent, then so is {f(v1), . . . , f(vn)}. Theorem 3.40. The following relationship between subspace dimensions holds: (3.8) dim(V) = dim(Ker(f)) + dim(Im(f))

Proof. Suppose that dim(V) = n. If Ker(f) = {0}, then dim(Ker(f)) = 0 and by Lemma 3.39, dim(Im(f)) = n and the result follows at once. Assume Ker(f) 6= {0} and suppose that dim(Ker(f)) = m < n. Let {v1,..., vm} be a basis for Ker(f) in V. By Corollary 1.64, there are vectors {vm+1,..., vn} so that {v1,..., vn} form a basis for V. We will show that {f(vm+1), . . . , f(vn)} forms a basis for Im(f). Let w ∈ Im(f). Then there is some v ∈ V so that f(v) = w. We can write: n

v = αivi i=1 X for some α1, . . . , αn ∈ F. Taking the linear transform of both sides yields: n

w = f(v) = αif(vi). i=1 X However, f(vi) = 0 for i = 1, . . . , m and thus we can write: n

w = αif(vi). i=m+1 X Thus, {f(vm+1), . . . , f(vn)} spans Im(f). Now suppose that: n

αif(vi) = 0. i=m+1 X Then:

f(αm+1vm+1 + ··· + αnvn) = 0

But, vm+1,..., vn are in a basis for V with v1,..., vm, the basis for Ker(f). Thus:

(1) αm+1vm+1 + ··· + αnvn is not in Ker(f), which means that: (2) αm+1vm+1 + ··· + αnvn = 0 ∈ V. In this case, αm+1 = ··· = αn = 0, otherwise {v1,..., vn} could not form a basis for V.

Thus {f(vm+1), . . . , f(vn)} is linearly independent and so it must be a basis for Im(f). There- fore we have shown that when dim(V) = n and dim(Ker(f)) = m < n, then dim(Im(f)) = m − n. Thus n = m + (n − m) and Equation 3.8 is proved.  5. Matrix of a Linear Map Proposition 3.41. The set of m × n matrix Fm×n with ordinary matrix addition forms a vector space over the field F. Exercise 52. Prove Proposition 3.41. 44 Lemma 3.42. Let V be a vector space with dimension n over field F and let B = n {v1,..., vn} be a basis for V. Let πB : V → F be the coordinate mapping function so that if v ∈ V and: n

v = αivi i=1 X n for α1, . . . , αn ∈ F, then πB(v) = hα1, . . . , αni. Then πB is an isomorphism of V to F .

Proof. The fact that πB is injective is proved in Lemma 1.50. Surjectivity is clear from construction. The fact that it is a linear transformation is clear from ordinary arithmetic in n F .  Theorem 3.43. Let V be a vector space with dimension n over field F and let W be a vector space with dimension m over F. Let B = {v1,..., vn} be a basis for V and let 0 B = {w1,..., wm} be a basis for W. Suppose that g : V → W is a linear transformation. g m×n th Then the matrix ABB0 ∈ F has j column defined by: g (3.9) A 0 = π 0 (g(v )) BB ·j B j n m defines the linear transformation fAg from to with the property that if w = g(v), BB0 F F then: g (3.10) ABB0 πB(v) = πB0 (w)

Remark 3.44. Before proving this theorem, let us discuss what’s going on here. First, recall that if f(v) = Av for A ∈ Fm×n and v ∈ Fn, we denote this linear transformation as fA, which is why we use g in the statement of the theorem. Secondly, the mappings can be visualized using a commutative diagram:

g fA πB n BB0 m V - F - F π−1 g B0 -  W Given a vector space V and basis B, V can be mapped to Fn. This space can be linearly m g transformed to a subspace of F by matrix multiplication with ABB0 . The resulting subspace maps to a subspace of W, which (we claim) is identical to the subspace generated by the action of the linear transform g on V, which produces vectors in W.

Proof of Theorem 3.43. For any basis vector vj ∈ B, we know that πB(vj) = ej = h0,..., 1, 0 ...i; here the 1 appears in the jth position. Then: g g g (3.11) A 0 π (v ) = A 0 e = A 0 = π 0 (g(v )) BB B j BB j BB ·j B j by definition, thus this transformation works for each basis vector in V. Let v be an arbitrary vector in V, then: n

v = αjvj j=1 X 45 for some α1, . . . , αn ∈ F. In particular, πB(v) = hα1, . . . , αni = α. But then: n n g g g A 0 π (v) = A 0 α = α A 0 = α π 0 (g(v )) = BB B BB j BB ·j j B j j=1 j=1 X X πB0 (f(α1v1 + ··· + αnvn)) = πB0 (g(v)) by Equation 3.11 and the properties of linear transformations.  Example 3.45. We will construct the matrix D corresponding to the linear transforma- tion D already discussed on P[x2]. Let x2, x, 1 be B the basis we will use for P[x2]. We know that D : P[x2] → P[x]. Thus, we B0 = {x, 1} be the basis to be used for P[x]. We will consider each basis element of P[x2] in turn:

D π 0 2 x2 −→ 2x −−→B 0   D π 0 0 x −→ 1 −−→B 1   D π 0 0 1 −→ 0 −−→B 0   Arranging these into the columns of a matrix, we obtain: 2 0 0 D = 0 1 0   Now, we can use matrix multiplication to differentiate quadratic polynomials:

a −1 π f 2a π 0 ax2 + bx + c −→B b −→D −−→B 2a + b   b c   Remark 3.46. The following  corollaries are stated without proof. Corollary 3.47 (First Corollary of Theorem 3.43). Given two vector spaces V and W of dimension n and m (respectively) over a common base field F and two bases B and B0 of m×n f these spaces (respectively), the mapping ϕBB0 : L(V, W) → F defined by ϕBB0 : f 7→ ABB0 is a vector space isomorphism. Corollary 3.48 (Second Corollary of Theorem 3.43). Let V be a vector space over

F with dimension n. Then AutF(V) is isomorphic to GL(n, F) the set of invertible n × n matrices with elements from F. Exercise 53. Prove the previous two corollaries. Remark 3.49. It should be relatively that each linear transform maps to a unique ma- trix assuming a given set of bases for the respectively vector spaces and every matrix can represent some linear transform. Thus, there is a one-to-one and onto mapping of matrices to linear transforms and this mapping must respect matrix addition and scalar multiplication. Therefore, it is an isomorphism. Definition 3.50 (Rank). The rank of a matrix A ∈ Fm×n is equal to the number of linearly independent columns of A and is denoted rank(A). 46 m×n n m Proposition 3.51. Let A ∈ F . Then ϕ : A 7→ fA ∈ L(F , ×, F ) where fA(x) = Ax is an isomorphism between Fm×n and L(Fn, Fm). Furthermore, rank(A) = dim(Im(A)). Proof. The isomorphism between Fm×n and L(Fn, Fm) is a consequence of Corollary 3.47. Let A be composed of columns a1,..., an. Suppose that A has k linearly independent columns and without loss of generality suppose they are the columns 1, . . . k. Then for every other index K > k we can write:

k

aK = αiai i=1 X Let x = hx1, . . . , xni. Then:

k

(3.12) fA(x) = βiaixi i=1 X Thus, if y ∈ Im(fA) is a linear combination of the k linearly independent vectors a1,..., ak, which form a basis for Im(fA). This completes the proof.  m×n Definition 3.52 (Nullity). If A ∈ F , then the nullity of A is dim(Ker(fA)). That is, it is the dimension of the vector space of solutions to the linear system Ax = 0. Remark 3.53. Consequently, Theorem 3.40 is sometimes called the rank-nullity theorem because it relates the rank of a matrix to its nullity. Restricting to matrices is now sensi- ble in the context of Theorem 3.43 because every linear transformation is simply a matrix multiplication in an appropriate coordinate space.

Remark 3.54. If A ∈ Rm×n, then space spanned by the columns of A is called the column space. The space spanned by the rows is called the row space and the kernel of the corresponding linear transform fA is called the null space.

6. Applications of Linear Transforms Remark 3.55. Linear transforms on Rn are particularly useful in graphics and mechan- ical modeling. We’ll focus on a few transforms that have physical meaning in two and three dimensional Euclidean space.

Definition 3.56 (Scaling). Any vector in Rn can be scaled by a factor of α by multiplying by the matrix A = αIn. Example 3.57. Consider the vector h1, 1i. If scaled by a factor of 2 we have: 2 0 1 2 = 0 2 1 2       This is illustrated in Figure 3.2. Definition 3.58 (Shear). A shearing matrix is an elementary matrix that corresponds to the addition of a multiple of one row to another. In two dimensional space, it is most 47 easily modeled by the matrices: 1 s A = 1 0 1   1 0 A = 2 s 1   Example 3.59. Shearing is visually similar to a slant operation. For example, take the matrix A1 with s = 2 and apply it to h1, 1i to obtain: 1 2 1 3 = 0 1 1 1       See Figure 3.2. Definition 3.60 (Rotation 2D). The two dimensional defined by: cos θ − sin θ R = θ sin θ cos θ   Rotates any vector v ∈ R2 by θ radians in the counter-clockwise direction. Example 3.61. Rotating the vector h1, 1i by counter-clockwise rotation by π/12 radians can be accomplished with the multiplication: cos π − sin π 1 cos π − sin π 12 12 = 12 12 sin π cos π 1 cos π + sin π  12 12     12 12 

2.0

Scale

1.5

Rotate 1.0

Original Vector Shear 0.5

0.5 1.0 1.5 2.0 2.5 3.0

Figure 3.2. Geometric transformations are shown in the figure above.

Remark 3.62. Rotation in 3D can be accomplished using three separate matrices, one for roll, pitch and yaw. In particular, if we define the coordinates in the standard x, y and z directions and agree on the right-hand-rule definition of counter-clockwise, then the 48 following three matrices rotate vectors counter-clockwise by θ radians around the x, y and z axes respectively: 1 0 0 cos θ 0 − sin θ x y Rθ = 0 cos θ − sin θ Rθ = 0 1 0 0 sin θ cos θ  sin θ 0 cos θ  cos θ − sin θ 0   z Rθ = sin θ cos θ 0  0 0 1 Combinations of these matrices will produce different effects. These matrices are useful in modeling aerodynamic motion in computers. Remark 3.63. Other transformations are possible, for example reflection can be accom- plished in a relatively straight-forward manner as well as projection onto a lower-dimensional space. All of these are used to manipulate polygons for the creation of high-end computer graphics including video game design and simulations used in high-fidelity flight simulators.

7. An Application of Linear Algebra to Control Theory Differential equations are frequently used to model the continuous motion of an object in space. For example, the spring equation describes the behavior of a mass on a spring: (3.13) mx¨ + kx = 0 This is illustrated in Figure 3.3. We can translate this differential equation (Equation 3.13)

x

m

Figure 3.3. A mass moving on a spring is governed by Hooke’s law, translated into the language of Newtonian physics as mx¨ − kx = 0.

into a system of differential equations by setting v =x ˙. Thenv ˙ =x ¨ and we have: x˙ = v k v˙ = − x  m This can be written as a matrix differential equation:  x˙ 0 1 x (3.14) = v˙ − k 0 v    m    49 If we write x = hx, vi, then we can write: x˙ = Qx, where 0 1 Q = − k 0  m  Suppose we apply the Euler’s Method to solve this differential equation numerically, then we would have: x(t + ) − x(t) = Qx,  which leads to the difference equation: (3.15) x(t + ) = x(t) + Qx = Ix + Qx We can factor this equation so that: (3.16) x(t + ) = Ax(t)a where A = I + Q.

Example 3.64. Suppose m = k = 1 and  = 0.001 and let x0 = 0, v0 = 1. The resulting difference equations in long form are: x(t + 0.001) = x(t) + 0.001v(t) v(t + 0.001) = v(t) − 0.001x(t) We can compare the resulting difference equation solution curve (which is comparatively easy to compute) with the solution to the differential equation. This is shown in Figure 3.4. Notice with  small, for simple systems (like the one given) there is little error in approximating the continuous time system with a discrete time system. The error may grow in time, so care must be taken. 7.1. Controllability. Suppose at each time step, we may exert a control on the mass so that its motion is governed by the equation: x˙ = Qx + Bu 2,p Here B ∈ R where without loss of generality we may assume p ≤ 2 and u = hu1, . . . , upi is a vector of time-varying control signals. This problem is difficult to analyze in the continuous time case, but simpler in the discrete time case. (3.17) x(t + ) = Ax(t) + Bu(t) Suppose the goal is to drive x(t) to some value x∗. For example, suppose we wish to stop mass at x = 0, so then v = 0 as well. Definition 3.65 (Complete State Controllable). The state x is complete state control- lable in Equation 3.17 if given any initial condition x(0) = x0 and any desired final condition ∗ x there is a control function (policy) u(t) defined on a finite time interval [0, tf ] so that ∗ x(tf ) = x . Remark 3.66. Put simply, control means we can drive the system to any state we want and do it in finite time. 50 1.0

0.5

x 0.0 Exact Solution �������� Difference Equation Solution -0.5

-1.0 0 1 2 3 4 5 6 t (a) Position

1.0

0.5

v 0.0 Exact Solution �������� Difference Equation Solution -0.5

-1.0 0 1 2 3 4 5 6 t (b) Velocity

Figure 3.4. A mass moving on a spring given a push on a frictionless surface will oscillate indefinitely, following a sinusoid.

Theorem 3.67. Suppose that x ∈ Rn, u ∈ Rp, A ∈ Rn×n and B ∈ Rn×p. Define the augmented matrix: (3.18) C = B AB ··· An−1B np and for a vector v ∈ R let:  fC(v) = Cv be the linear transformation from Rnp → Rn defined by matrix multiplication (as usual). Then the state dynamics in Equation 3.17 are completely controllable if (and only if):

(3.19) dim (Im(fC)) = n Or equivalently, rank(C) = n.

Proof. Suppose that x(0) = x0. Then:

(3.20) x() = Ax0 + Bu(0) By induction, k−1 k i (3.21) x(k) = A x0 + A Bu((k − 1 − i)) i=0 X In particular, when k = n, then: n n−1 n−2 n−3 (3.22) x(n) = A x0 + A Bu(0) + A Bu() + A u(2) + ··· + Bu((n − 1)) 51 Rearranging, we obtain: u((n − 1)) u((n − 2)) n n−1  .  (3.23) x(n) − A x0 = B AB ··· A B .  u()       u(0)    Here:   u((n − 1)) u((n − 2))  .  u = .  u()     u(0)      np ∗ n is a np × 1 vector of unknowns R . Setting x(n) = x ∈ R . There is at least one solution to this system of equations if and only if dim (Im(fC)) = n or alternatively rank(C) = n since:

(3.24) C ∼ In N by elementary row operations. Here ∼ denotes row equivalence. This completes the proof.  Example 3.68. Suppose we can adjust the velocity of the moving block on the spring so that: x(t + ) = x(t) + v(t) k v(t + ) = v(t) −  x(t) + u(t) m Written in matrix form this is: x(t + ) 1  x(t) 0 (3.25) = + u(t) v(t + ) − k 1 v(t) 1    m      Notice, B ∈ R1×2 and n = 2, thus: 1  0  AB = = − k 1 1 1  m      Consequently: 0  (3.26) C = 1 1   This is non-singular just in case  > 0, which is is necessarily. We can compute: 1  2 0 2 A2x = = 0 − k 1 1 1 − 2 k  m     m  We know that x∗ = h0, 0i. Therefore, we can compute: 0 2 −2 0  u(1) (3.27) x∗ − A2x = − = = = Cu 0 0 1 − 2 k 2 k − 1 1 1 u(0)    m   m      52 Computing C−1 and multiplying yields the solution: 1 +  k (3.28) u = m −2   Meaning: u(0) = −2 k u(1) = 1 +  m Using these values, we can compute: x(1) =  v(1) = −1 x(2) = 0 v(2) = 0 as required. 7.2. Observability. Suppose the system in question is now only observed through an imperfect measurement system so that we see: (3.29) y = Cx The observed output is y ∈ Rp, while the actual output is x ∈ Rn. We do not know x(0). Can we still control the system? Definition 3.69 (Observability). A system is observable if the state at time t can be computed even in the presence of imperfect observability; i.e., C 6= n. Assume u = 0. Then in discrete time: y(0) = Cx(0)(3.30) y() = CAx(0)(3.31) . .(3.32) . y((n − 1)) = Cx((n − 1)) = CA(n−1)x(0)(3.33) This can be written as the matrix equation: y(0) C y() CA (3.34)  .  =  .  x(0) . . y((n − 1)) CA(n−1)     Let:     C CA np×n (3.35) O =  .  ∈ R . CA(n−1)   This is the observability matrix containing np rows and n columns. Notice that Equation 3.34 has a solution if and only if:

(3.36) n = dim(fO) = rank(O) 53 To see this note, this simply says we can express the left hand side using all n columns of the right-hand-side. Thus we have the theorem:

Theorem 3.70. Suppose that x ∈ Rn, A ∈ Rn×n and C ∈ Rn×p. Define the augmented matrix: C CA (3.37) O =  .  . CA(n−1)     and for a vector v ∈ Rn let:

fO(v) = Ov be the linear transformation from Rn → Rnp defined by matrix multiplication (as usual). Then the state dynamics in Equation 3.17 are completely observable if (and only if):

(3.38) dim (Im(fO)) = n Or equivalently, rank(O) = n. Remark 3.71. Controllability and observability can be used together to derive a control when x0 is unknown. (1) Take n − 1 observations of the uncontrolled system. (2) Infer x0 (3) Compute x((n − 1)). (4) Translate time so that x0 = x((n − 1)). (5) Compute the control function as before. Example 3.72. Suppose that: 1 2 (3.39) C = 2 1   and consider our example. From before we have: 1  − k 1  m  Thus we compute: 1 − 2k  + 2 CA = m 2 − k 2 + 1  m  Consequently: 1 2 2 1 (3.40) O =  2k  1 − m  + 2  2 − k 2 + 1  m    54 Applying Gauss-Jordan elimination, we see that: 1 0 0 1 O ∼ 0 0 0 0   Thus it has rank 2. If we started with x0 = 0 and v0 = 1, then we would have: x() =  and v() = 1. Our observed values would be y(0) = h2, 1i and y() = h + 2, 2 + 1i. This leads to the equation: 2 1 2 1 2 1 x0 (3.41)   =  2k   + 2 1 − m  + 2 v0 k   2 + 1  2 − 2 + 1    m  In reality, it is sufficient  to solve the first two equations:

2 = x0 + 2v0

1 = 2x0 + v0 to obtain x0 = 0 and v0 = 1. Thus, control starts at time t =  at the point where x =  and v = 1.

Exercise 54. Compute the control from the point x0 =  and v0 = 1. Remark 3.73. It turns out that continuous time systems work in precisely the same way; the proofs are just much harder and computing a control function is non-trivial; i.e., it is more complex than solving a system of linear equations.

55

CHAPTER 4

Determinants, Eigenvalues and Eigenvectors

1. Goals of the Chapter (1) Discuss Permutations (2) Introduce Determinants (3) Introduce Eigenvalues and Eigenvectors (4) Discuss Diagonalization (5) Discuss the Jordan Normal Form

2. Permutations1 Definition 4.1 (Permutation / Permutation Group). A permutation on a set V = {1, . . . , n} of n elements is a bijective mapping f from V to itself. A permutation group on a set V is a set of permutations with the binary operation of functional composition. Example 4.2. Consider the set V = {1, 2, 3, 4}. A permutation on this set that maps 1 to 2 and 2 to 3 and 3 to 1 can be written as: (1, 2, 3)(4) indicating the cyclic behavior that 1 → 2 → 3 → 1 and 4 is fixed. In general, we write (1, 2, 3) instead of (1, 2, 3)(4) and suppress any elements that do not move under the permutation. For the permutation taking 1 to 3 and 3 to 1 and 2 to 4 and 4 to 2 we write (1, 3)(2, 4) and say that this is the product of (1, 3) and (2, 4). When determining the impact of a permutation on a number, we read the permutation from right to left. Thus, if we want to determine the impact on 2, we read from right to left and see that 2 goes to 4. By contrast, if we had the permutation: (1, 3)(1, 2) then this permutation would take 2 to 1 first and then 1 to 3 thus 2 would be mapped to 3. The number 1 would be first mapped to 2 and then stop. The number 3 would be mapped to 1. Thus we can see that (1, 3)(1, 2) has the same action as the permutation (1, 2, 3). Definition 4.3 (Symmetric Group). Consider a set V with n elements in it. The permutation group Sn contains every possible permutation of the set with n elements.

Example 4.4. Consider the set V = {1, 2, 3}. The symmetric group on V is the set S3 and it contains the permutations: (1) The identity: (1)(2)(3) (2) (1,2)(3) (3) (1,3)(2) (4) (2,3)(1) (5) (1,2,3) (6) (1,3,2)

1This section is used purely to understand the general definition of the determinant of a matrix. 57 Proposition 4.5. For each n, |Sn| = n!. Exercise 55. Prove Proposition 4.5

Definition 4.6 (Transposition). A permutation of the form (a1, a2) is called a transpo- sition. Theorem 4.7. Every permutation can be expressed as the product of transpositions.

Proof. Consider the permutation (a1, a2, . . . , an). We may write:

(4.1) (a1, a2, . . . , an) = (a1, an)(a1, an−1) ··· (a1, a2)

Observe the effect of these two permutations on ai. For i 6= 1 and i 6= n, then reading from right to left (as the permutation is applied) we see that ai maps to a1, which reading further right to left is mapped to ai+1 as we expect. If i = 1, then a1 maps to a2 and there is no further mapping. Finally, if i = n, then we read left to right to the only transposition containing an and see that an maps to a1. Thus Equation 4.1 holds. This completes the proof.  Remark 4.8. The following theorem is useful for our work on matrices in the second part of this chapter, but its proof is outside the scope of these notes. The interested reader can see Chapter 2.2 of [Fra99]. Theorem 4.9. No permutation can be expressed as both a product of an even and an odd number of transpositions. 

Definition 4.10 (Even/Odd Permutation). Let σ ∈ Sn be a permutation. If σ can be expressed as an even number of transpositions, then it is even, otherwise σ is odd. The signature of the permutation is: −1 σ is odd (4.2) sgn(σ) = (1 σ is even 3. Determinant Definition 4.11 (Determinant). Let A ∈ Fn×n. The determinant of A is: n

(4.3) det(A) = sgn(σ) Aiσ(i) σ∈S i=1 Xn Y Here σ ∈ Sn represents a permutation over the set {1, . . . , n} and σ(i) represents the value to which i is mapped under σ. Example 4.12. Consider an arbitrary 2 × 2 matrix: a b A = c d   There are only two permutations in the set S2: the identity permutation (which is even) and the transposition (1, 2) which is odd. Thus, we have: a b det(A) = = A A − A A = ad − bc c d 11 22 12 21

This is the formula that one would expect from a course in matrices.

58 n×n Definition 4.13 (Triangular Matrix). A matrix A ∈ F is upper-triangular if Aij = 0 when i > j and lower-triangular if Aij = 0 when i < j. Lemma 4.14. The determinant of a triangular matrix is the product of the diagonal elements. Proof. Only the identity permutation can lead to a non-zero product in Equation 4.3. For suppose that σ(i) 6= i. Then for at least one i σ(i) > i and at least one i for which σ(i) < i and thus, n

Aiσ(i) = 0 i=1 Y  Example 4.15. It is easy to verify using Example 4.12 that the determinant of: 1 2 A = 0 3   is 1 · 3 = 3.

Proposition 4.16. The determinant of any identity matrix is 1.  Exercise 56. Prove the Proposition 4.16. Remark 4.17. Like many other definitions in mathematics, Definition 4.11 can be useful for proving things, but not very useful for computing determinants. Fortunately there is a recursive formula for computing the determinant, which we provide. Definition 4.18 ((i, j) Sub-Matrix). Consider the square matrix:

a11 ··· a1j ··· a1n a21 ··· a2j ··· a2n A =  . . . . .  ...... a ··· a ··· a   n1 nj nn   The (i, j) sub-matrix obtained from Row 1, Column j is derived by crossing out Row 1 and Column j as illustrated,

a11 ··· a1j ··· a1n a21 ··· a2j ··· a2n  . . . . .  ...... a ··· a ··· a   n1 nj nn   and forming a new matrix (n − 1) × (n − 1) with the remaining elements:

a21 ··· a2,j−1 a2,j+1 ··· a2n ...... A(1j) = ......   an1 ··· an,j−1 an,j+1 ··· ann   The sub-matrix A(i,j) is defined analogously. 59 Definition 4.19 (). Let A ∈ Fn×n. Then the (i, j) minor is:

(4.4) A(i,j) = det(A(i,j))

Definition 4.20 (Cofactor). Let A ∈ Fn×n as in Definition 4.18. Then the (i, j) co- factor is: i+j i+j (4.5) C(i,j) = (−1) A(i,j) = (−1) det(A(i,j)) Example 4.21. Consider the following matrix: 1 2 3 A = 4 5 6 7 8 9 We can compute the (1, 2) minor as: 1 2 3 4 6 4 5 6 = 4 · 9 − 6 · 7 = −6 7 9 7 8 9

In this case i = 1 and j = 2 so the co-factor is:

4 6 C = −1 · = (−1)(−6) = 6 (1,2) 7 9

Theorem 4.22 (Laplace Expansion Formula). Let A ∈ Fn×n as in Definition 4.19. Then: n

(4.6) det(A) = a1jC(1j) j=1 X Example 4.23. Before proving this result, we consider an example. We can use Laplace’s Formula to compute: 1 2 3 5 6 4 6 4 5 4 5 6 = 1 · (−1)1+1 · + 2 · (−1)1+2 · + 3 · (−1)1+3 · = 8 9 7 9 7 8 7 8 9

1 · ( −3) + 2 · (−1) · (−6) + 3 · (−3) = −3 + 12 − 9 = 0

Remark 4.24. Laplace’s Formula can be applied to other rows/columns to yield the same result. It is traditionally stated along the first row.

Proof of Laplace’s Formula. Let σ ∈ Sn map i to j. From Definition 4.11, the σ term of the determinant sum is:

sgn(σ)Aij Aiσ(i) = sgn(σ)aij Aiσ(i) i6=j i6=j Y Y where aij is the (i, j) element of A, usually denoted Aij. The elements of the product:

Aiσ(i) i6=j Y 60 are in the sub-matrix of A(i,j) (removing Row i and Column j from A). Let Sn(i, j) denote the sub-group of Sn consisting of permutations that always map i to j. This group can be put in bijection with Sn−1. Therefore the minor A(i,j) is:

(4.7) A(i,j) = sgn(σ) Aiσ(i) i6=j σ∈XSn(i,j) Y The bijection between Sn(i, j) and Sn−1 can be written explicitly in the following way. Sup- 0 0 pose that τ ∈ Sn−1 and σ ∈ Sn(i, j). Define: τ ∈ Sn so that τ (i) = τ(i) for i = 1, . . . , n − 1 0 0 (recall τ is defined on {1, . . . , n − 1}) and τ (n) = n. Thus, τ ∈ Sn. Then: (4.8) σ = (j, j + 1, . . . , n)τ 0(n, n − 1, . . . , i) in permutation notation. To see this, note that from right to left, i maps to n, which maps 0 to n under τ , which maps to j, as required. The remaining n − 1 elements of Sn are mapped to distinct elements depending entirely on τ. Note that the permutations (j, j + 1, . . . , n) and (n, n − 1, . . . , i) can both be written n − j and n − i transpositions respectively. Thus, sgn(σ) = (−1)n−i+n−jsgn(τ 0) = (−1)2n−i−jsgn(τ) = (−1)−i−jsgn(τ) = (−1)i+jsgn(τ) But then: det(A) = n

sgn(σ)a1j A1σ(i) = j=1 i6=j X σ∈XSn(1,j) Y n

a1j sgn(σ) A1σ(i) = j=1 i6=j X σ∈XSn(1,j) Y n n−1 1+j a1j (−1) sgn(τ) A(1,j)i,τ(i) = j=1 τ∈S i=1 X Xn−1 Y n n−1 1+j a1j(−1) sgn(τ) A(1,j)i,τ(i) = j=1 τ∈Sn−1 i=1 Xn X Y 1+j a1j(−1) det(A(1,j)) = j=1 Xn

a1jC(1,j) j=1 X This completes the proof.  Exercise 57. Use Laplace’s Formula to show that the determinant of the following matrix: a b c A = d e f g h i

  61 is

det(A) = aei + bfg + cdh − afh − bdi − ceg

Exercise 58 (Project). A formula like Laplace’s formula can be used to invert a matrix. When used to solve a system of equations, this is called Cramer’s rule. Let A ∈ Fn×n be a matrix and let:

C(1,1) C(1,2) ··· C(1,n) . . . . C = . . .. .   C(n,1) C(n,2) ··· C(n,n)   Then: 1 A−1 = CT det(A)

Prove that this formula really does invert the matrix.

Example 4.25. We can use the rule given in Exercise 58 to invert the matrix:

10 12 14 A = 12 8 6 14 6 4    First we’ll use the rule in Exercise 57 to compute the determinant of the matrix as a whole:

det(A) = 10 · 8 · 4 + 12 · 6 · 14 + 14 · 12 · 6 − 14 · 8 · 14 − 12 · 12 · 4 − 10 · 6 · 6 = 10(8 · 4 − 6 · 6) + 14(12 · 6 + 12 · 6 − 8 · 14) − 12 · 12 · 4 = 20(4 · 4 − 6 · 3) + 28(6 · 6 + 6 · 6 − 8 · 7) − 144 · 4 = 20 · (−2) + 28(16) − 144 · (4) = 4 (5 · (−2) + 7 · 16 − 144) = 4 (−10 − 144 + 112) = 4 (−42) = −168

Now compute the matrix co-factors (from Laplace’s Rule): Cross Out First Row 8 6 C = −11+1 · = 32 − 36 = −4 (1,1) 6 4

12 6 C = −11+2 · = 48 − 84 = 36 (1,2) 14 4

12 8 C = −11+3 · = 72 − 112 = −40 (1,3) 14 6

62

Cross Out Second Row 12 14 C = −12+1 · = 36 (2,1) 6 4

10 14 C = −12+2 · = −156 (2,2) 14 4

10 12 C = −12+3 · = 108 (2,3) 14 6

Cross Out Third Row

12 14 C = −13+1 · = −40 (3,1) 8 6

10 14 C = −13+2 · = 108 (3,2) 12 6

10 12 C = −13+3 · = −64 (3,3) 12 8

The matrix of cofactors is:

C(1,1) C(1,2) C(1,3) C = C(2,1) C(2,2) C(2,3)   C(3,1) C(3,2) C(3,3) Cramer’s rule says:  −4 36 −40 1 −1 A−1 = CT = 36 −156 108 det(A) 168 −40 108 −64 Notice the symmetry in the solution, which you can exploit in your own solution, rather than working out every term.

Exercise 59. Prove that if A ∈ Fn×n is symmetric and invertible, then A−1 is symmet- ric.

4. Properties of the Determinant Remark 4.26. Laplace Expansion can be done on any column or row, with minor com- puted appropriately. The choice of row 1 is for convenience only.

n× T Proposition 4.27. If A ∈ F , then det(A) = det(A ).  Exercise 60. Prove Corollary 4.27.

Proposition 4.28. Let A ∈ Fn×n. If A has two adjacent identical columns, then det(A) = 0. Proof. We proceed by induction on n. In the case when n = 2, the proof is by calcula- tion. Suppose the statement is true up to n − 1 and suppose that columns k and k + 1 are equal in A. Then in the Laplace expansion, the sub-matrix A(1,j) has two identical adjacent 63 columns just in case j 6= k and j 6= k + 1. Thus C(1,j) = 0. Thus the Laplace expansion reduces to the k and k + 1 terms: 1+k 1+k+1 det(A) = a1,k(−1) det(A(1,k)) + a1,k+1(−1) det(A(1,k+1)) but A(1,k) = A(1,k+1) and a1,k = a1,k+1 because column k is identical to column k + 1 but 1+k 1+k+1 −(−1) = (−1) . Thus, det(A) = 0 and the result follows by induction.  Proposition 4.29. Suppose that A ∈ Fn×n with: 0 A = a1 ··· aj + aj ··· an n×1 where each ai∈ F is a column vector. Let:

A1 = a1 ··· aj ··· an 0 A2 = a1 ··· aj ··· an Then det(A) = det(A1) + det(A2).  Proof. Rewrite the Laplace Expansion around the column j. Then we have: n n 0 (4.9) det(A) = AijC(ij) = (aij + aij)C(ij) = i=1 i=1 X X n n 0 aijC(ij) + aijC(ij) = det(A1) + det(A2). i=1 i=1 X X This completes the proof.  Remark 4.30. The same logic holds for rows. Proposition 4.31. If A ∈ Fn×n and matrix A0 ∈ Fn×n is constructed from A by multi- plying a column (or row) of A by α ∈ F, then: (4.10) det(A0) = αdet(A)

Exercise 61. Prove the Proposition 4.31. [Hint: Use the Laplace expansion with the altered column or row and factor.] Remark 4.32 (Multi-Linear Function). The two previous results taken together show that the determinant is what’s known as a multi-linear function when considered as: n n n det : F × F × · · · F → F n mapping the n vectors in n (the columns of the matrix) to . That is: | {z F } F 0 0 det(a1, . . . , αai + ai, ai+1,..., an) = αdet(a1,..., an) + det(a1,..., ai, ai+1,..., an) Proposition 4.33. If A ∈ Fn×n and matrix A0 ∈ Fn×n is constructed from A by ex- changing any pair of columns (row), then: (4.11) det(A0) = −1det(A) Proof. Exchanging two columns is equivalent to introducing a transpose into the per- mutation σ in Equation 4.3, which swaps the sign of all sgn(σ) computations.  64 Proposition 4.34. If A ∈ Fn×n and matrix A0 ∈ Fn×n is constructed from A by adding a multiple of column (row) i to any other column (row) j 6= i, then det(A0) = det(A). Proof. Without loss of generality, assume column i is added to column j with no multiple. Then A0 has form:

0 A = a1 ··· aj + ai ··· an Applying Propositions 4.29 and 4.28 we see that det(A0) = det(A) + 0, since the one of the two matrices in the multi-linear expansion of the determinant will have a repeated column. In the general case suppose we replace column j with column j plus α ∈ F times column i. Then first replace ai by αai; the determinant is multiplied by α. Applying the same logic as above. Then replace αai by ai again and the determinant is divided by α. This completes the proof.  Remark 4.35. Thus we have shown that elementary column/row operations can only modify the determinant of a matrix by either multiplying it by a constant or changing its sign. In particular, elementary row operations cannot make a matrix have non-zero determinant. Exercise 62. Use Laplace’s Formula to show that if any column or row of a matrix is all zeros, then the determinant must be zero. Exercise 63. Show that an n × n matrix has non-zero determinant if and only if it is column or row equivalent to the identity matrix. Thus prove the following corollary: Corollary 4.36. A square matrix A is invertible if and only if det(A) 6= 0. Remark 4.37. We state but do not prove the final theorem, which is useful in general and can also be used to prove the preceding corollary. The proof is in [Lan87], for the interested reader. It is a direct result of the multilinearity of the determinant function.

Theorem 4.38 (Matrix Product). Suppose A, B ∈ Fn×n. Then det(AB) = det(A)det(B).

 Exercise 64. Prove Corollary 4.36 using this theorem.

5. Eigenvalues and Eigenvectors Definition 4.39 (Algebraic Closure). Let F be a field. The algebraic closure of F, denoted F is an extension of F that is (i) also a field and (ii) has every possible root to any polynomial with coefficients drawn from F. Remark 4.40. A field F is called algebraically closed if F = F. Remark 4.41. The following theorem is outside the scope of the course, but is useful for our work with eigenvalues.

Theorem 4.42. The algebraic closure of R is C. The field C is algebraically closed. 65 Definition 4.43 (Eigenvalue and (Right) Eigenvector). Let A ∈ Fn×n. An eigenvalue, eigenvector pair (λ, x) is a scalar and n × 1 vector such that: (4.12) Ax = λx n and x 6= 0. The eigenvalue may be drawn from F and x from F . n×n Lemma 4.44. A value λ ∈ F is an eigenvalue of A ∈ F if and only if λIn − A is not invertible.

Proof. Suppose that λ is an eigenvalue of A. Then there is some x ∈ Fn such that Equation 4.12 holds. Then 0 = (λIn − A)x and x is in the kernel of the linear transform:

fλIn−A(x) = (λIn − A)x

The fact that x 6= 0 implies that fλIn−A is not one-to-one and thus λIn − A is not invertible.

Conversely suppose that λIn −A is not invertible. Then there is an x 6= 0 in Ker(fλIn−A). Thus:

(λIn − A)x = 0

which implies Equation 4.12. Thus λ is an eigenvalue of A.  Remark 4.45. A left eigenvector is defined analogously with xT A = λxT , when x is considered a column vector. We will deal exclusively with right eigenvectors and hence when we say “eigenvector” we mean a right eigenvector.

Definition 4.46 (Characteristic Polynomial). If A ∈ Fn×n then its characteristic poly- nomial is the degree n polynomial:

(4.13) det (λIn − A)

Theorem 4.47. A value λ is an eigenvalue for A ∈ Fn×n if and only if it satisfies the characteristic equation:

det (λIn − A) = 0 That is, λ is a root of the characteristic polynomial.

Proof. Assume λ is an eigenvalue of A. Then the matrix λIn − A is singular and consequently the characteristic polynomial is zero by Corollary 4.36. Conversely, assume that λ is a root of the characteristic polynomial. Then λIn − A is singular by Corollary 4.36 and thus by Lemma 4.44 it is an eigenvalue.  Remark 4.48. We now see why λ may be in F, rather than F. It is possible the char- acteristic polynomial of a matrix does not have all (or any) of its roots in the field F; the definition of algebraic closure ensures that all eigenvalues are contained in the the algebraic closure of F. Corollary 4.49. If A ∈ Fn×n, then A and AT share eigenvalues. Exercise 65. Prove Corollary 4.49. 66 Example 4.50. Consider the matrix: 1 0 A = 0 2   The characteristic polynomial is computed as: λ − 1 0 det (λI − A) = = (λ − 1)(λ − 2) − 0 = 0 n 0 λ − 2

Thus the characteristic polynomial for this matrix is:

(4.14) λ2 − 3λ + 2

The roots of this polynomial are λ1 = 1 and λ2 = 2. Using these eigenvalues, we can compute eigenvectors: 1 (4.15) x = 1 0   0 (4.16) x = 2 1   and observe that: 1 0 1 1 (4.17) Ax = = 1 = λ x 1 0 2 0 0 1 1       and 1 0 0 0 (4.18) Ax = = 2 λ x 2 0 2 1 1 2 2       as required. Computation of eigenvalues and eigenvectors is usually accomplished by com- puter and several algorithms have been developed. Those interested readers should consult (e.g.) [Dat95]. Remark 4.51. You can use your calculator to return the eigenvalues and eigenvectors of a matrix, as well as several software packages, like Matlab and Mathematica. Remark 4.52. It is important to remember that eigenvectors are unique up to scale. That is, if A is a square matrix and (λ, x) is an eigenvalue eigenvector pair for A, then so is (λ, αx) for α 6= 0. This is because: (4.19) Ax = λx =⇒ A(αx) = λ(αx) Definition 4.53 (Algebraic Multiplicity of an Eigenvalue). An eigenvalue has algebraic multiplicity greater than 1 if it is a multiple root of the characteristic polynomial. The multiplicity of the root is the multiplicity of the eigenvalue.

Example 4.54. Consider the identify matrix I2. It has characteristic polynomial (λ − 1)2, which has one multiple root 1 of multiplicity 2. However, this matrix does have two eigenvectors [1 0]T and [0 1]T .

2 Exercise 66. Show that every vector in F is an eigenvector of I2. 67 Example 4.55. Consider the matrix 1 5 A = 2 4   The characteristic polynomial is computed as: λ − 1 −5 = (λ − 1)(λ − 4) − 10 = λ2 − 5λ − 6 −2 λ − 4

Thus there are two distinct eigenvalues: λ = −1 and λ = 6, the two roots of the characteristic polynomial. We can compute the two eigenvectors in turn. Consider λ = −1. We solve: λ − 1 −5 x −2 −5 x 0 1 = 1 = −2 λ − 4 x2 −2 −5 x2 0           Thus:

−2x1 − 5x2 = 0

We can set x2 = t, a free variable. Consequently the solution is: x 5 t 5 1 = 2 = t 2 x2 t 1       Thus, any eigenvector of λ = −1 is a multiple of the vector h5/2, 1i. For the eigenvalue λ = −6, we have: λ − 1 −5 x 5 −5 x 0 1 = 1 = −2 λ − 4 x2 −2 2 x2 0           From this we see that:

−2x1 + 2x2 = 0 or x1 = x2. Thus setting x2 = t, we have the solution: x t 1 1 = = t x2 t 1       Thus, any eigenvector of λ = 6 is a multiple of the vector h1, 1i.

n×n Theorem 4.56. Suppose that A ∈ F with eigenvalues λ1, . . . , λn all distinct (i.e., λi 6= λj if i 6= j). Then the corresponding eigenvectors {v1,..., vn} are linearly independent.

Proof. We proceed by induction on m to show that {v1, . . . , vm} is linearly independent for m = 1, . . . , n. In the base case, the set {v1} is clearly linearly independent. Now suppose this is true up to m < n, we’ll show it is true for m + 1. Suppose:

(4.20) α1v1 + ··· + αm+1vm+1 = 0

Multiplying by λm+1 on the left and right implies that:

(4.21) α1λm+1v1 + ··· + αm+1λm+1vm+1 = 0 On the other hand, multiplying by A on the left and the right (and applying the fact that Avi = λivi, yields:

(4.22) α1λ1v1 + ··· + αm+1λm+1vn = 0 68 Subtracting Expression 4.21 from Expression 4.22 yields: m

αi(λi − λm+1)vi = 0 i=1 X Since λi 6= λm+1 and by induction we know that {v1,..., vm} are linearly independent it follows that α1, . . . , αm = 0. But then Expression 4.20 reduces to:

αm+1vm+1 = 0

Thus, αm+1 = 0 and the result follows by induction.  n×n Definition 4.57. Let A ∈ F with eigenvectors {v1,..., vn}. Then the vectors space E = span({v1,..., vn}) is called the eigenspace of A. When the eigenvectors are linearly independent, they are an eigenbasis for the space they span.

Remark 4.58. It is worth noting that if vi is an eigenvector of A, then span(vi) is called the eigenspace associated with vi. n Corollary 4.59. The eigenvectors of A ∈ Fn×n form a basis for F when the eigenvalues of A are distinct. Exercise 67. Prove Corollary 4.59.

6. Diagonalization and Jordan’s Decomposition Theorem Remark 4.60. In this section, we state but not prove a number of results on eigenvalues and eigenvectors, with a specialization to matrices with real entries. Many of these results extend with minor modifications to complex matrices. Readers who are interested in the proofs can (and should) see e.g., [Lan87].

Definition 4.61 (Diagonalization). Let A be an n × n matrix with entries from field R. The matrix A can be diagonalized if there exists an n × n diagonal matrix D and another n × n matrix P so that: (4.23) P−1AP = D In this case, P−1AP is the diagonalization of A. Remark 4.62. Clearly if A is diagonalizable, then: (4.24) A = PDP−1

Theorem 4.63. A matrix A ∈ Rn×n is diagonalizable, if and only if the matrix has n linearly independent eigenvectors.

Proof. Suppose that A has a set of linearly independent eigenvectors p1,..., pn. Then n×n for each pi,(i = 1, . . . , n) there is an eigenvalue λi so that Api = λipi. Let P ∈ F have columns p1,..., pn. Then we can see that:

λ1 0 ··· 0 0 λ2 ··· 0 AP = λ1p1| · · · |λnpn = p1| · · · |pn  . . .  = PD, . . .. ···      0 0 ··· λ   n  69   where:

λ1 0 ··· 0 0 λ2 ··· 0 (4.25) D =  . . .  . . .. ···  0 0 ··· λ   n    −1 Since p1,..., pn are linearly independent, it follows P is invertible and thus A = PDP . Conversely, suppose that A is invertible and let D be as in Equation 4.25. Then: AP = DP and reversing the reasoning above, each column of P must be an eigenvector of A with corresponding eigenvalue on the diagonal of D.  Example 4.64. Consider the following matrix: 0 −1 (4.26) A = 1 0   To diagonalize A, we compute its eigenvalues and eigenvectors yielding:

λ1 = i

λ2 = −i for the eigenvalues and: i −i v = v = 1 1 2 1 √    where i = −1 is the imaginary number. Notice this shows that the eigenvalues and eigenvectors have entries drawn from R = C. We can now compute P and D as: −i 0 −i i D = P = 0 i 1 1     It is helpful to note that:

i 1 −1 2 2 P = −i 1  2 2  Arithmetic manipulation shows us that: −1 −1 PD = −i i   Thus: −1 −1 i 1 0 −1 PDP−1 = 2 2 = = A −i i −i 1 1 0    2 2    as required. (Remember that i2 = −1.) 70 Remark 4.65. Matrix diagonalization is a basis transform in disguise. Suppose A ∈ n×n R is diagonalizable. Let v1,..., vn be the eigenvectors A and consider the linear trans- form fA(x) = Ax. If x is expressed in the eigenbasis E = {v1,..., vn}, then: n n n

(4.27) fA(x) = Ax = A αivi = αiAvi = αiλivi i=1 ! i=1 i=1 X X X Thus, computing the linear transformation fA is especially easy in the basis E. Each coor- dinate hα1, . . . , αni is simply multiplied by the corresponding eigenvalue. That is:

(4.28) fA(hα1, . . . , αni) = hλ1α1, . . . , λnαni −1 Now, consider the diagonalization. A = PDP . In E,[vi]E = ei; that is, each eigenvector has as its coordinates the corresponding standard basis vector. By construction:

(4.29) Pei = vi for all i = 1, . . . , n. This means that P is a basis transform matrix from E to the standard basis B. That is: P = AEB. As a consequence: −1 −1 ABE = AEB = P n Thus if [x]B ∈ R is expressed in the standard basis, we have: n n −1 fA(x) = A[x]B = PDP [x]B = PD[x]E = P λi[x]Ei = λi[x]Ei i=1 " i=1 #B n X X Notice the sum i=1 λi[x]Ei is just Equation 4.28. Thus, the diagonalization is a recipe for first transforming a vector in the standard basis into the eigenbasis, then taking the linear P transform fA and then transforming back to the standard basis. Exercise 68. Use the remark above to prove Theorem 4.63. [Hint: The proof of one direction is essentially given. Think about the opposite direction.] Definition 4.66 (Nilpotent Matrix). A matrix N is nilpotent if there is some integer k > 0 so that Nk = 0 Remark 4.67. We generalize the notion of diagonalization in a concept called the Jordan Normal Form. The proof of the Jordan Normal Form theorem is outside the scope of the class, but it can be summarized in the following theorem. Theorem 4.68. Let A be a square matrix with complex entries (i.e., A ∈ Cn×n). Then there exists matrices P, Λ and N so that: (1) Λ is a diagonal matrix with the eigenvalues of A appearing on the diagonal. (2) N is a nilpotent matrix and (3) P is a matrix whose columns are composed of pseudo-eigenvectors and (4): (4.30) A = P(Λ + N)P−1, When A is diagonalizable, then N = 0 and P is a matrix whose columns are composed of eigenvectors.

71

CHAPTER 5

Orthogonality

1. Goals of the Chapter (1) Introduce general inner products. (2) Define Orthogonality. (3) Introduce the Gram-Schmidt Orthogonalization Procedure. (4) Demonstrate the QR decomposition. (5) Discuss orthogonal projection and orthogonal complement. (6) Prove equality of row and column rank. (7) Prove the Spectral Theorem for Real Symmetric Matrices and related results. 2. Some Essential Properties of Complex Numbers Remark 5.1. In this chapter and the ensuring chapters, we restrict our the base field F to be either R or C. We review a few critical facts about complex numbers, which heretofore have been used in a very cursory way. Definition 5.2 (Complex Conjugate). Let z = a + ib be a complex number. The conjugate of z, denoted z = a − ib. Proposition 5.3. If w, z ∈ C, then w + z = w + z. √ Definition 5.4. The magnitude of a complex number z = a + ib is |z| = a2 + b2. Remark 5.5. Note the magnitude of a complex number is both real and non-negative. Proposition 5.6. If z ∈ C, then |z|2 = zz. Exercise 69. Prove Propositions 5.3 and 5.6. 3. Inner Products Definition 5.7 (Inner Product). An inner product on a vector space V over F is a mapping: h·, ·i : V × V → F such that: (1) Conjugate Symmetry: hx, yi = hy, xi (2) Linearity in Argument 1: hαx1 + x2, yi = αhx1, yi + hx2, yi. (3) Positive definiteness: hx, xi ≥ 0 and hx, xi = 0 if and only if x = 0. Remark 5.8 (Notional Point). There is now a possibility for confusion. This set of notes uses the notation a = ha1, . . . , ani for the vector:

a1 . a = .   an   73 However, now ha1, a2i might be the inner product of the vectors a1 and a2. It is up to the reader to know the difference. However to help remember vectors and matrices are always bold, while numbers and scalars are always not bold. Remark 5.9. Inner products are sometimes called scalar products. Example 5.10. The standard dot product defined in Chapter 2 meets the criteria of an inner product over Rn. Here hx, yi = x · y. n Exercise 70. Let z, w ∈ C . If x = hz1, . . . , zni and w = hw1, . . . , wni, where zi, wi ∈ C, note the standard dot product: n

z · w = ziwi i=1 X does not satisfy the conjugate symmetry rule. Find a variation for the dot product that works for complex vector spaces. [Hint: If z = a + bi and w = c + di then:

zw = (a + ib)(c − di) = (ac + bd) + i(bc − ad) = (ac + bd) − i(ad − bc) = (ac + bd) + i(ad − bc) = (a − ib)(c + id) = zw = wz Notice the last equality holds because C is a field and multiplication commutes. Use this fact along with Proposition 5.3.] Definition 5.11 (). Any inner product can be used to induce a norm (length) on the vector space in the following way: (5.1) kxk = hx, xi This in turn canp be used to induce a distance (metric) on the space in which the distance between two vectors x and y is simply kx − yk.

Exercise 71. Using the standard dot product on Rn, verify that Equation 5.1 returns the n-dimensional analog of the Pythagorean theorem. Definition 5.12 (). A vector v in a vector space V with inner product h·, ·i is a unit vector if the induced norm of v is 1. That is, kvk = 1. Proposition 5.13. If v is a vector in V with an inner product and hence a norm, then: u = v/ kvk is a unit vector. Exercise 72. Prove Proposition 5.13.

Example 5.14. The inner product need not be a simple function. Consider P2[x], the space of polynomials of degree at most 2 with real coefficients over the field R. We are free to define an inner product on P2[x] in the following way. Let f, g ∈ P2[x]: 1 hf, gi = f(x)g(x) dx Z−1 We could vary the integral bounds, if we like. 74 Definition 5.15 (Bilinear). An inner product h·, ·i on a vector space V is bilinear if:

hx, αy1 + y2i = αhx, y1i + hx, y2i Exercise 73. Show that the standard dot product is bilinear on Rn. Definition 5.16. A matrix A ∈ Rn×n is postive definite if for all x ∈ Rn we have xT Ax ≥ 0 and xT Ax = 0 if and only if x = 0. Theorem 5.17. Suppose h·, ·i is a bilinear inner product on a real vector space V (i.e., h·,·i F = R) with basis B. Then there is a positive definite matrix AB with the property that: T h·,·i hx, yi = [x]B AB [y]B

Recall [x]B is the coordinate representation of x in B. Proof. The proof of this theorem is similar to the proof used to show how to build a matrix for an arbitrary linear transform. Let B = {v1,..., vn}. Write: n

x = αivi i=1 Xn

y = βivi i=1 X Then: n n n n n n

hx, yi = αivi, βjvj = αi vi, βjvj = αiβjhvi, vji * i=1 i=j + i=1 * i=j + i=1 j=1 X X X X X X Define:

aij = hvi, vji = hvj, vii = aji, by the symmetry of the inner product. Then: n n

hx, yi = αiβjaij i=1 j=1 X X In matrix notation, this can be written as:

a11 a12 ··· a1n β1 n n a21 a22 ··· a2n β2 αiβjaij = α1 α2 ··· αn  . . . .   .  ...... i=1 j=1 . . . . X X   a a ··· a  β   n1 n2 nn  n     Since [x]B = hα1, . . . , αni and [y]B = hβ, . . . , βi, it follows at once that if we define:

a11 a12 ··· a1n h·,·i a21 a22 ··· a2n AB =  . . . .  . . .. . a a ··· a   n1 n2 nn then:   T h·,·i hx, yi = [x]B AB [y]B 75 h·,·i The fact that AB is positive definite follows at once from the positive definiteness of the inner product.  Remark 5.18. We can relax the positive definiteness assumption, while retaining the bilinearity condition to obtain a generalized linear product. The previous theorem still holds in this case, without the positive definiteness result.

4. Orthogonality and the Gram-Schmidt Procedure Definition 5.19 (Orthogonal Vectors). In a vector space V over field F equipped with inner product h·, ·i, two vectors x and y are called orthogonal if hx, yi = 0. Example 5.20. Under the ordinary dot product, the standard basis of Rn is always composed of orthogonal vectors. For instance, in R2, we have: h1, 0i · h0, 1i = 0.

Definition 5.21 (Orthonormal Basis). A basis B = {v1,..., vn} is called orthogonal if for all i 6= j vi is orthogonal to vj and orthonormal if the basis is orthogonal and every vector in B is a unit vector. Example 5.22. The standard basis of Rn is orthonormal in the standard dot product, but that is not the only orthonormal basis as we’ll see.

Remark 5.23. Any orthogonal basis B = {v1,..., vn} can be converted to an orthonor- 0 mal basis by letting ui = vi/ kvik. The new basis B = {u1,..., un} is orthonormal. Theorem 5.24. Let V be a non-trivial n-dimensional vector space over field F with inner product h·, ·i. Then V contains an orthogonal basis.

Proof. We’ll proceed by induction. Choose a non-zero vector v1 ∈ V and let W1 = span(v1). That is:

W1 = {v ∈ V : v = αv1, α ∈ F}

A basis for W1 is u1 = v1. We can extend W1 to an m-dimensional subspace Wm of V using Corollary 1.64. Now assume that there is an orthogonal basis {u1,..., um} for Wm. Extend Wm to Wm+1 with vector vm+1. We will show how to build an orthonormal basis for Wm+1. Define:

hu1, vm+1i hum, vm+1i um+1 = vm+1 − u1 − · · · − um hu1, u1i hum, umi

It now suffices to prove that {u1,..., um+1} is an orthogonal basis for Wm+1. To see this note by its construction um+1 cannot be a linear combination of {u1,..., um} since vm+1 is not a linear combination of {u1,..., um}. Therefore, the fact that Wm+1 is m + 1 dimensional implies that {u1,..., um+1} must be a basis for it. It remains to prove orthogonality. For any i = 1, . . . , m, compute:

m hu , v i hu , u i = u , v − j m+1 u = i m+1 i m+1 hu , u i j * j=1 j j + X hui, vm+1i huj, vm+1i hui, vm+1i − hui, uii − hui, uji hui, uii huj, uji j6=i X 76 By assumption hui, uji = 0 for i 6= j. Therefore:

hui, um+1i = hui, vm+1i − hui, vm+1i = 0, as required. The result follows by induction.  Corollary 5.25. Let V be a non-trivial n-dimensional vector space over field F with inner product h·, ·i. Then V contains an orthonormal basis. Proof. Normalize the basis identified in the theorem.  Remark 5.26 (Gram-Schmidt Procedure). The process identified in the induction proof of Theorem 5.24 can be distilled into the Gram-Schmidt Procedure for finding an orthogonal basis. We summarize the procedure as follows:

Gram-Schmidt Procedure Input: {v1,..., vn} a basis for V, a vector space with an inner product h·, ·i.

(1) Define:

u1 = v1 (2) For each i ∈ {2, . . . , n} define: i−1 huj, vii ui = vi − uj hu , u i j=1 j j X Output: An orthogonal basis B = {u1,..., un} for V. Algorithm 2. Gram-Schmidt Procedure

Example 5.27. We can use the Gram-Schmidt procedure to find an orthonormal basis for R2 assuming we start with the vectors h1, 2i and h2, 1i and use the standard dot product. Let v1 = h1, 2i and v2 = h2, 1i. Step 1: Define: 1 u = v = 1 1 2   Step 2: Use the dot product to compute:

hu1, v2i = h1, 2i · h2, 1i = 4

hu1, u1i = h1, 2i · h1, 2i = 5 Step 3: Compute: hu , v i 2 4 1 6 u = v − 1 2 u = − = 5 2 2 hu , u i 1 1 5 2 −3 1 1      5  Exercise 74. Check that the basis found in the previous example is orthogonal.

Example 5.28. Consider P2[x], the space of polynomials of degree at most 2 with real coefficients over the field R. Let f, g ∈ P2[x] and define: 1 hf, gi = f(x)g(x) dx Z−1 77 We can use this information to find an orthogonal basis for P2[x]. Recall the standard basis 2 for P2[x] is {x , x, 1}. This is not orthogonal, since: 1 x3 1 1 −1 2 hx2, 1i = x2 dx = = − = 6= 0 3 3 3 3 Z−1 −1

On the other hand, we know that:

1 x2 1 1 1 hx, 1i = x dx = = − = 0 2 2 2 Z−1 −1

and 1 1 x4 1 1 hx2, xi = x3 dx = = − = 0 4 4 4 Z−1 −1

We will need the following piece of information:

1 x5 1 2 hx2, x2i = x4 dx = = 5 5 Z−1 −1 2 Applying the Gram-Schmidt procedure with v1 = x , v2 = x and v3 = 1 we have: 2 u1 = v1 = x

hu1, v2i 2 u2 = v2 − u1 = x − 0 · x = x hu1, u1i 2 hu1, v3i hu2, v3i 3 2 5 2 u3 = v3 − u1 − u2 = 1 − 2 x = 1 − x hu1, u1i hu2, u2i 5 3

Exercise 75. Find an orthogonal basis for P2[x] assuming: 1 hf, gi = f(x)g(x) dx Z0 5. QR Decomposition Remark 5.29. We will use the following lemma as we apply the Gram-Schmidt procedure to derive a new Lemma 5.30. Let P ∈ Rn×n be composed of columns of orthonormal vectors. Then T P P = In. Remark 5.31. A matrix like the kind in Lemma 5.30 is called an orthogonal matrix. These matrices are generalized by unitary matrices over the complex numbers. Exercise 76. Prove Lemma 5.30. [Hint: Use the dot product defintion of matrix mul- tiplication and the fact that the columns are orthonormal.] Definition 5.32 (QR decomposition). If A ∈ Rm×n with m ≥ n, then the QR- decomposition consists of an orthogonal matrix Q ∈ Rm×n and a upper triangular matrix R ∈ Rn×n so that A = QR. Remark 5.33. Let A ∈ Rm×n with m ≥ n. Algorithm3 illustrates how to compute the QR decomposition. 78 Gram-Schmidt Procedure m×n Input: A ∈ R with m ≥ n.

(1) Use Gram-Schmidt to orthogonalize the columns of A. (2) Construct an orthonormal basis for the column space of A (i.e., normalize the orthogonal basis you found). T (3) Denote Q as the matrix with these orthonormal columns. Notice Q Q = In. (4) If A = QR, then: QT A = QT QR = R. Output: The QR decomposition of A. Algorithm 3. QR decomposition

Exercise 77. Prove that R must be upper-triangular. [Hint: The columns of Q are constructed from the columns of A by the Gram-Schmidt procedure. Use this fact to obtain the structure of R = QT A.] Example 5.34. We can find the QR decomposition of: 1 0 A = 2 2 0 1 Applying the Gram-Schmidt procedure yields the orthogonal basis:

u1 = h1, 2, 0i 4 2 u2 = − 5 , 5 , 1 Normalizing those vectors yields:

1 2 q1 = √ , √ , 0 5 5  √ q = − √4 , √2 , 5 2 3 5 3 5 3 Then: D E √1 − √4 5 3 5 Q = √2 √2  5 3√ 5  0 5  3  Now compute:  √ 5 √4 R = QT A = 5 0 √3 " 5 # Remark 5.35. Since R is upper triangular and you can verify that QR = A. Efficient QR decomposition can be used to solve systems of equations. If we wish to solve Ax = b and we know A = QR. Then: Ax = QRx = b =⇒ Rx = QT b

79 Now the problem Rx = QT b can be back-solved efficiently; i.e., since R is upper-triangular, it’s easy to solve the the last variable, then use that to solve for the next to last variable etc.

6. Orthogonal Projection and Orthogonal Complements Definition 5.36. The operation: hu, vi proj (u) = v v hv, vi is the orthogonal projection of u onto v. We illustrate this in Figure 5.1.

u

v Projv(u)

Figure 5.1. The orthogonal projection of the vector u onto the vector v.

Remark 5.37. This makes the most geometric sense when the scalar product is the dot product and the space is Rn. To see this, we can show that if x, y ∈ Rn, then x · y = kxk kyk cos(θ), there cos(θ) is the cosine of the angle between x and y in the common plane they share. We can prove this result, if we assume the law of cosines. Lemma 5.38 (Law of Cosines). Consider a triangle with side lengths a, b, and c and let θ be the angle between the sides of length a and b. Then: c2 = a2 + b2 − 2ab cos(θ)  Theorem 5.39. If x, y ∈ Rn and θ is the angle between x and y in the common plane they share, then x · y = kxk kyk cos(θ)

Proof. The geometry is shown in Figure 5.2 for two vectors in R3. R3. Note: n 2 2 kxk = xi , i=1 X where x = hx1, . . . , xni. A similar result holds for y. On the other hand: n n n n n 2 2 2 2 2 2 kx − yk = (xi − yi) = xi − 2xiyi + yi = xi − 2 xiyi + yi i=1 i=1 i=1 i=1 i=1 X X  X X X Simplifying we have: kx − yk2 = kxk2 − 2x · y + kyk2 80 3 Figure 5.2. The common plane shared by two vectors in R is illustrated along with the triangle they create.

Using the law of cosines we see: kx − yk = kxk2 + kyk2 − 2 kxk kyk cos(θ) Therefore: kxk2 − 2x · y + kyk2 = kxk2 + kyk2 − 2 kxk kyk cos(θ) Simplifying this expression yields the familiar fact that: x · y = kxk kyk cos(θ)

 Remark 5.40. We can now make sense of the term orthogonal projection, at least in Rn. Consider the Figure 5.3: We know that: v/kvk is a unit vector. The length of the

u

θ v Projv(u)

Figure 5.3. The orthogonal projection of the vector u onto the vector v. constructed hypotenuse in Figure 5.3 is kuk. By trigonometry the length of the base of the triangle is: u · v kuk cos(θ) = kvk 81 But projv(u) is a vector that points in the direction of v with length the result of projecting u “down” onto v (as illustrated). Therefore: u · v v u · v u · v proj (u) = · = v = v v kvk kvk kvk2 v · v The general orthogonal decomposition formula with inner products is simply an extension of this formula. Theorem 5.41. Let V be a vector space with inner product h·, ·i and let W be a subspace with B = {v1,..., vm} an orthonormal basis for W. If v is a vector in V, then: m

projW (v) = hv, viivi i=1 X is the orthogonal projection of v onto the subspace W.

Proof. That projW (v) is an element of W is clear by its construction; it is a linear combination of the basis elements. It now suffices to show that the vector u = v − projW (v) is orthogonal to projW (v). To see this, compute:

hu, vji = hv − projW (v), vji = hv, vji − hprojW (v), vji = m m

hv, vji − hv, viivi, vj = hv, vji − hv, viihvi, vji * i=1 + i=1 X X But hvi, vji = 0 if i 6= j by basis orthogonality. So we have: 2 hu, vji = hv, vji − hv, vjihvj, vji = hv, vji − hv, vji kvjk = 0, since every basis vector is a unit vector. Therefore, u is orthogonal to every element in the basis B and consequently it must be orthogonal to projW (v).  Exercise 78. Assuming the inner product: 1 hf, gi = f(x)g(x) dx Z0 2 for P2[x], compute the orthogonal projection of x onto the subspace spanned by the vector x. Compare this to the Gram-Schmidt procedure. 7. Orthogonal Complement Definition 5.42. Let W be subspace of a vector space V with inner product h·, ·i. The orthogonal complement of W is the set: W⊥ = {v ∈ V : hv, wi = 0 for all w ∈ W} Example 5.43. We can illustrate a space and its orthogonal complement in low dimen- sions. Let v be an arbitrary vector in R3. Let W = span(v). Then W⊥ consists of the plane to which v is normal. This is illustrated in Figure 5.4. Proposition 5.44. If W is a subspace of a vector space V with inner product h·, ·i, then W⊥ is a subspace. Corollary 5.45. W ∩ W⊥ = {0}. 82 Figure 5.4. A vector v generates the linear subspace W = span(v). It’s orthogonal ⊥ 3 complement W is shown when v ∈ R .

Exercise 79. Prove Proposition 5.44 and its corollary. Theorem 5.46. If W is a subspace of V a vector space, then any vector v ∈ V can be uniquely written as v = w + u where w ∈ W and u ∈ W⊥. Consequently V = W ⊕ W⊥. Proof. It suffices to show that any vector v ∈ V can be written as the sum of a vector in w ∈ W and a vector u ∈ W⊥. The result will then follow from Theorem 2.62 and Corollary 5.45. Let B = {w1,..., wm} be an orthonormal basis for W. We know one must exist. Using B, let: m

w = projW (v) = αiwi, i=1 X where the αi are constructed by the inner product as in Theorem 5.41. The vector w is entirely in the subspace W. The vector: u = v − w is orthogonal to w as proved in Theorem 5.41. Therefore, it must lie in the space W⊥. It is clear that: v = w + u. We may now apply Theorem 2.62 and Corollary 5.45 to see V = W ⊕ W⊥. This completes the proof.  Corollary 5.47. If W is a subspace of V a (finite dimensional) vector space V, then dim(V) = dim(W) + dim(W⊥).

Proof. This is an application of Theorem 2.65.  Corollary 5.48. If W is a subspace of V a (finite dimensional) vector space V, then (W⊥)⊥ = W. 83 Exercise 80. Prove Corollary 5.48. Remark 5.49. Consider Rn and Rm with the standard inner product (e.g., the dot product for Rn). Suppose that A ∈ Rm×n and consider the system of equations: Ax = 0 n m Any solution x to this is in the null space of the matrix. More specifically, if fA : R → R is defined as usual as fA(x) = Ax then we know the null space is the kernel of fA. Let the rows of A be a1,..., am. If x ∈ Ker(fA) then:

hai, xi = 0, because that is how we defined matrix multiplication in Definition 2.4. Then this implies that each row of A is orthogonal to any element in the kernel of fA (the null space of A). Remark 5.50. For notational simplicity, for the remainder of this chapter, let Ker(A) denote the kernel of fA, the corresponding linear transform; i.e., Ker(A) is the null space of the matrix. Also let Im(A) denote the corresponding image of fA. m×n Theorem 5.51. Suppose that A ∈ R . If A has rows a1,..., am, then ⊥ span({a1,..., am}) = Ker(A) .

Proof. The discussion in Remark 5.49 is sufficient to show that span({a1,..., am}) ⊇ ⊥ Ker(A) . To prove opposite containment, suppose that y ∈ span({a1,..., am}). Then:

y = β1a1 + ··· + βmam

for some β1, . . . , βm ∈ R. If x ∈ Ker(A), then by (bi)linearity:

hy, xi = β1ha1, xi + ··· + βmham, xi = 0, ⊥ since ham, xi = 0 for i = 1, . . . , m becasuse Ax = 0. Thus span({a1,..., am}) ⊆ Ker(A) . This completes the proof.  Definition 5.52 (Row Rank). The row rank of a matrix A ∈ Rm×n is the number of linearly independent rows of A.

Corollary 5.53. The (column) rank of a matrix A ∈ Rm×n is equal to its row rank. Exercise 81. Use the rank-nullity theorem along with Theorem 5.51 to prove Corollary 5.53. [Hint: The dimension of Ker(A) and its orthogonal complement must add up to n.]

8. Spectral Theorem for Real Symmetric Matrices Remark 5.54. The goal of this section is to prove the Spectral Theorem for Real Sym- metric Matrices.

Theorem 5.55 (Spectral Theorem for Real Symmetric Matrices). Suppose A ∈ Rn×n is a real, symmetric matrix. Then A is diagonalizable. Furthermore, if the diagonalization of A is: A = PDP−1 84 then P−1 = PT . Thus: A = PDPT As in Theorem 4.68, the columns of P are eigenvectors of A. Furthermore, they form an orthonormal basis for Rn. Remark 5.56. We will build the result in a series of lemmas and definitions. We require one theorem that is outside the scope of this class. We present that first.

Theorem 5.57 (Fundamental Theorem of Algebra). : Suppose that an 6= 0 and p(x) = n n−1 anx + an−1x + ··· + a0 is a polynomial with coefficients in C. Then p(x) has exactly n (possibly non-distinct) roots in C. Lemma 5.58. If A ∈ Rn×n is symmetric, it has at least one eigenvector and eigenvalue. Proof. This is a result of the fundamental theorem of algebra. Every matrix has a characteristic polynomial and that polynomial has at least one solution. (Note: You can also prove this with an optimization argument and Weierstraß Extreme Value Theorem.)  Lemma 5.59. If A ∈ Rn×n is symmetric, then all its eigenvalues are real. Proof. Suppose that λ is a complex eigenvalue with a complex eigenvector z. Then Az = λz. If λ = a + bi, then λ¯ = a − bi. Note λλ¯ = a2 + b2 ∈ R. Furthermore, if λ and µ are two complex numbers, it’s easy to show that λµ = λ¯µ¯. Since A is real (and symmetric) we can conclude that: Az = Az¯ = λ¯z¯ Now (Az¯)T = z¯T A = λ¯z¯T . We have: Az = λz =⇒ z¯T Az = λz¯T z z¯T A = λ¯z¯T =⇒ z¯T Az = λ¯z¯T z But then: λz¯T z = λ¯z¯T z ¯ and this implies λ = λ, which means λ must be real.  Corollary 5.60. If A ∈ Rn×n is symmetric and z ∈ Cn is a complex eigenvector so that z = x + iy, then either y = 0 or both x and y are real eigenvectors. Consequently, A has real eigenvectors. Proof. We can write: Az = Az + iAy = λx + iλy = λz Since λ is real, it follows that either y = 0 and z = x is real or y is a second real eigenvector of A with eigenvalue λ (i.e., λ may have geometric multiplicity at least 2). In either case, A has only real eigenvectors and all complex eigenvectors can be composed of them.  Definition 5.61. Let A ∈ Rn×n and suppose W is a subspace of Rn with the property that if v ∈ W, then Av ∈ W. Then W is called A-invariant. Lemma 5.62. If A ∈ Rn×n is symmetric and W is an A-invariant subspace of Rn, then so is W⊥. 85 Proof. Suppose that v ∈ W⊥. For every w ∈ W, w · v = vT w = 0. Let w = Au for some u ∈ W. Then: uT Av = vT AT u = vT Au = vT w = 0, by the symmetry of A. Therefore, Av is orthogonal to any arbitrary vector u ∈ W and so ⊥ ⊥ Av ∈ W and W is A-invariant.  Lemma 5.63. If A ∈ Rn×n is symmetric and W is an A-invariant subspace of Rn, then W has an eigenvector of A. Proof. Using the Gram-Schmidt procedure, we know that W has an orthonormal basis P = {p1,..., pm}. Let B be the matrix composed of these vectors. In particular for each pi:

(5.2) Api = β1p1 + ··· βmpm for some scalars β1, . . . , βm since Api ∈ W because W is A-invariant. In particular, if P is the matrix whose columns are the basis elements in P, then by Equation 5.2, there is some 1 matrix B (composed of scalars like β1, . . . , βm) so that : AP = PB Taking the transpose and multiplying by P yields: PT AP = BT AP = BT PT P T But we know that P is a matrix composed of orthonormal vectors. So P P = Im. Therefore: PT AP = BT On the other hand taking the transpose again (and exploiting the symmetry of A) means: PT AP = B = BT Therefore B is symmetric. As such, B has a real eigenvector/eigenvalue pair (λ, u). Then: Bu = λu. Then PBu = λPu is necessarily in W because P is a matrix whose columns form a basis for W and Pu is just a linear combination of those columns. Let v = Pu. Then: Av = APu = PBu = λPu = λv Therefore v = Pu is an eigenvector of A in W with eigenvalue λ. This proves the claim.  Remark 5.64. We now can complete the proof of the spectral theorem for real symmetric matrices.

Proof. (Spectral Theorem for Real Symmetric Matrices) Suppose A ∈ Rn×n is a real symmetric matrix. Then it has at least one real eigenvalue/eigenvector pair (λ1, v1). Let n W1 = span(v1). If W1 = R , we are done. Otherwise, assume we have constructed m orthonormal eigenvectors v1,..., vm. Let W = span(v1,..., vm). This space is necessarily A-invariant since every vector can be expressed in a basis of orthonormal eigenvectors (which are A-invariant). We have proved that W⊥ is A-invariant and it must have an eigenvector vm that is orthogonal to v1,..., vm. Normalizing it, we have constructed a new larger set of orthonormal vectors v1,..., vm, vm+1. The result now follows by induction: the eigenvectors

1It’s an excellent exercise to convince yourself of this. 86 of A form an orthogonal basis of Rn and this basis can be normalized to an orthonormal basis. 

9. Some Results on AT A Proposition 5.65. Let A ∈ Rm×n. Then: Ker(A) = Ker(AT A) Proof. Let x ∈ Ker(A), then Ax = 0 and consequently AT Ax = 0. Therefore x ∈ Ker(AT A). Conversely, let x ∈ Ker(AT A). We know that: Ker(B) = Im(BT )⊥ for any matrix B. Let y = Ax. Then y ∈ Im(A). We know AT y = 0. Therefore: y ∈ Ker(AT ) = Im(A)⊥ Thus y ∈ Im(A) ∩ Im(A)⊥, which implies y = 0. Thus, x ∈ Ker(A). This completes the proof.  Exercise 82. Use the rank-nullity theorem to prove that rank(A) = rank(AT A). Remark 5.66. The following results will be useful when we study the Singular Value Decomposition in the next chapter.

Lemma 5.67. If A ∈ Rm×n, then AT A ∈ Rn×n is a symmetric matrix. Exercise 83. Prove Proposition 5.67.

Theorem 5.68. Let A ∈ Rm×n. Every eigenvalue of AT A is non-negative. Proof. By Lemma 5.67, AT A is a real symmetric n × n matrix and therefore its eigen- vectors form an orthonormal basis of Rn. Suppose (λ, v) is an eigenvalue/eigenvector pair of AT A. Without loss of generality, assume kvk = 1. Then: kAvk2 = hAv, Avi Here the inner product is the regular dot product and so we may write: hAv, Avi = (Av)T Av = vT AT Av Notice: AT Av = λv Therefore: kAvk2 = vT λv = λhv, vi = λ kvk2 = λ

2 Therefore, λ ≥ 0 since kAvk ≥ 0.  m×n T Theorem 5.69. Let A ∈ R and suppose that A A has eigenvectors v1, ×, vn, with corresponding eigenvalues λ1 ≥ · · · ≥ λn. If λ1, . . . , λr > 0, then rank(A) = r and Av1,..., Avr for an orthogonal basis for Im(A). 87 Proof. Choose two eigenvectors vi and vj. Then: T T T hAvi, Avji = vi A Avj = vi λjvj = λhvi, vji = 0, because vi and vj are orthogonal by assumption. Therefore Av1,..., Avr are orthogonal and by extension linearly independent. n n Now, suppose that y = Ax for some x ∈ R . The vectors v1, ×, vn form a basis for R and hence there are α1, . . . , αn such that: n

x = αivi i=1 X Compute: n n n r

y = A αivi = αiAvi = αλivi = αλivi, i=1 ! i=1 i=1 i=1 X X X X because λr+1, . . . , λn = 0. Thus, Av1,..., Avr must form a basis for Im(A) and consequently rank(A) = r. This completes the proof.  Remark 5.70. Example 5.71. The matrix: 1 2 3 A = 4 5 6   has rank 2. The matrix: 17 22 27 AT A = 22 29 36  27 36 45  also has rank 2. This must be the case by

88 CHAPTER 6

Principal Components Analysis and Singular Value Decomposition1

1. Goals of the Chapter (1) Discuss covariance matrices (2) Show the meaning of the eigenvalues/eigenvectors of the . (3) Introduce Principal Components Analysis (PCA). (4) Illustrate the link between PCA and regression. (5) Introduce Singular Value Decomposition (SVD). (6) Illustrate how SVD can be used in image analysis. Remark 6.1. Techniques that rely on matrix theory underly a substantial part of modern data analysis. In this section we discuss a technique for performing dimensional reduction; i.e., taking data with many dimensions and simplifying it so that it has fewer dimensions. This approach can frequently help scientists and engineers understand critical factors gov- erning a system.

2. Some Elementary Statistics with Matrices

Definition 6.2 (Mean Vector). Let x1,..., xn be n observations of a vector valued 1×m process (e.g., samples from several sensors) so that for each i, xi ∈ R that is xi is an m dimensional row vector. Then the mean (vector) of the observations is the vector: 1 n (6.1) µ = x n i i=1 X Where µ ∈ R1×m. Definition 6.3 (Covariance Matrix). Let X ∈ Rn×m be the matrix formed from the th n×m observations x1,..., xn where the i row of X is xi. Let M ∈ R be the matrix of n rows each equal to µ. The covariance matrix of X is: 1 (6.2) C = (X − M)T (X − M) . n − 1 The matrix C ∈ Rm,m. Lemma 6.4. The mean of X − M is 0. Exercise 84. Prove Lemma 6.4

1This chapter assumes some familiarity with concepts from statistics. 89 Example 6.5. Suppose: 1 2 X = 3 4 5 6 Then:   µ = 3 4 which is just the vector of column means and 3 4 M = 3 4 3 4 Consequently:  −2 −2 X − M = 0 0  2 2  and the covariance matrix is: −2 −2 1 −2 0 2 4 4 C = 0 0 = 2 −2 0 2   4 4   2 2     Remark 6.6. The covariance matrix will not always consist of a single number. However, the following theorem is true, which we will not prove.

Exercise 85. Show that element Cii (i = 1, . . . , m) is a (maximum likelihood) estimators for the variance of the data from column i (taken from Sensor i). Lemma 6.7. The covariance matrix C is a real-valued, symmetric matrix. Remark 6.8. The covariance matrix essentially expresses the way the dimensions (sen- sor) are co-variate (i.e., vary with) each other. Higher covariance means there is a greater correlation between the numbers from one sensor and the numbers from a different sensor. As we will see, the relationships between the dimensions (and their strengths) can be captured by the eigenvalues and eigenvectors of the matrix C. Remark 6.9. The eigenvectors and their corresponding eigenvalues provide information about the linear relationships that can be found within the covariance matrix and as a result within the original data itself. To see this, it is helpful to think of a matrix A ∈ Rm×m as a mathematical object that transforms any vector in x ∈ Rm×1 into a new vector in Rm×1 by multiplying Ax. This operation, rotates and stretches x in m-dimensional space. However, when x is an eigenvector, there is no rotation, only stretching. In a very powerful sense, eigenvectors already point in a preferred direction of the transformation. The magnitude of the eigenvalue corresponding to that eigenvector provides information about the power of that direction. The m eigenvectors, provide a new coordinate system that is, in a certain sense, proper for the matrix. 90 Example 6.10. Consider the covariance matrix: 4 4 C = 4 4   Given a vector x = hx1, x2i, we have: 4 4 x 4x + 4x Cx = 1 = 1 2 4 4 x2 4x1 + 4x2       So this transformation pushes vectors in the direction of the vector h1, 1i (because the first and second elements of Cx are identical). Clearly h1, 1i is an eigenvector of C with eigenvalue 8. This matrix has a second (non-obvious) eigenvector h−1, 1i with eigenvalue 0. Thus all the power in C lies in the direction of h1, 1i. Incidentally, h−1, 1i is the second eigenvalue precisely because it is an orthogonal vector to h1, 1i. Thus the two eigenvectors form an orthogonal pair of vectors. If we scale them so they each have length 1, we obtain the orthonormal pair of eigenvectors: 1 1 1 1 w1 = √ , √ w2 = −√ , √ 2 2 2 2     Let us now look at a plot of the data in X from Example 6.5 that was used to compute C (see Figure 6.1). The data is shown plotted with the line point in the direction of h1, 1i (and

Figure 6.1. An extremely simple data set that lies along a line y − 4 = x − 3, in the direction of h1, 1i containing point (3, 4).

its negation) and containing point (3, 4), the column means. Thus, the data lies precisely along the direction of most power for the covariance matrix.

3. Projection and Dimensional Reduction

Proposition 6.11. Let Y = X − M be modified data matrix with mean 0. Let yi = Yi. th T be the i row of modified data. Then yi can be expressed as a vector with basis w1,..., wm, the eigenvectors of C making up the columns of W by solving: T (6.3) Wz = yi so that: T T (6.4) z = W yi 91 T Proof. If we express yi in the basis {w1,..., wm} we write: T (6.5) yi = z1w1 + z2w2 + ··· + zmwm m×1 By the Principal Axis Theorem, wi ∈ R for i = 1, . . . m. Furthermore, w1,..., wm form m T m×1 a basis for |R . Since yi ∈ R it follows that z1, . . . , zm must be real. Equation 6.5 can be re-written in matrix form as: T yi = Wz We know W−1 = WT . Thus: T T z = W yi  Example 6.12. Recall from our example we had: −2 −2 Y = X − M = 0 0  2 2  Our eigenvectors for C are h1, 1i and h−1, 1i. We can expression the rows of Y in terms of these eigenvectors. Notice: √ 1 1 −1 1 h−2, −2i = −2 2 · √ , √ + 0 · √ , √ 2 2 2 2     1 1 −1 1 h0, 0i = 0 · √ , √ + 0 · √ , √ 2 2 2 2     √ 1 1 −1 1 h2, 2i = 2 2 · √ , √ + 0 · √ , √ 2 2 2 2   √  √ Thus, the transformed points are (−2 2, 0), (0, 0), and (2 2, 0) and now we can truly see the 1-dimensionality of the data (see Figure 6.2). Notice that we do not really need the

Figure 6.2. The one dimensional nature of the data is clearly illustrated in this plot of the transformed data z.

second eigenvector at all (it has eigenvalue 0).√ We could√ simplify the problem by using only the coefficients of the first eigenvector: {−2 2, 0, 2 2}. This projects the transformed data zi onto the x-axis. 92 This projection process can be accomplished by using only the first column of W in Equation 6.4 (the first row of WT ). Note: √ √ √ √ 1/ 2 −1/ 2 1/ 2 1/ 2 W = √ √ WT = √ √ 1/ 2 1/ 2 −1/ 2 1/ 2     The first row of WT is: T √ √ W1 = 1/ 2 1/ 2 Thus we could write:  √ √ −2 √ √ √ (6.6) z0 = WT y = 1/ 2 1/ 2 · = −4/ 2 = −2 2/ 2 = −2 2 1 1 1 −2  √   as expected. The remaining values 0 and 2 2 can be constructed with the other two rows of Y. T Notice, the 1 in y1 refers to the first data point. The 1 in W1 refers to the fact we used T only 1 of the 2 eigenvectors. If we kept k out of m, we would right Wk . √ √ We can transform back to the original data by taking the 1-dimensional data {−2 2, 0, 2 2} T T and multiplying it by W1 (this undoes multiplying by W1 because W W = I2). To see this note: √ 1/ 2 √ −2 yT = W z0 = √ · 2 2 = 1 1 1 1/ 2 −2       We can get back to X by adding M again. Remark 6.13. In the simple example, we could reconstruct the original data exactly from the 1-dimensional data because the original data itself sat on a line (a 1-dimensional thing). In more complex cases, this will not be the case. However, we can use exactly this approach to reduce the dimension and complexity of a given data set. Doing this is called principal components analysis (PCA). Steps for executing this process is provided in Algorithm4.

Exercise 86. Show that WT YT will result in a matrix that contains n column vectors that are z1,..., zn. Thus the expression: 0 T T T Y = WkWk Y is correct.  4. An Extended Example Example 6.14. We give a more complex example for a randomly generated sample of data in 2-dimensional space. We will project the data onto a 1-dimensional sub-space (line). The data are generated using a two-dimensional Gaussian distribution with mean h2, 2i and covariance matrix: 0.2 −0.7 (6.7) Σ = −0.7 4   The data are shown below for completeness: 93 Principal Components Analysis n×m Input: X ∈ R a data matrix where each column is a different dimension (sensor) and each row is a sample (replicants). There are m dimensions (sensors) and n samples.

1×m (1) Compute the column mean vector µ ∈ R . n×m (2) Compute the mean matrix M ∈ R where each row of M is a copy of µ. (3) Compute Y = X − M. This matrix has column mean 0. 1 T (4) Compute the covariance matrix C = (n−1) Y Y (5) Compute (λ1, w1),..., (λm, wm) the eigenvalue/eigenvector pairs of C in descending order of eigenvalue; i.e., λ1 ≥ λ2 ≥ · · · ≥ λm. Let W be the matrix whose columns are the eigenvectors in order w1, w2,..., wm. (6) If a k < m dimensional representation (projection) of the data is desired, compute m×k Wk ∈ R consisting of the first k columns of W. T T T (7) Compute Yk = WkWk Y . This operation will compute transformations and projects for all rows of Y simultaneously.  (8) Compute Xk = Yk + M. m Output: Xk the reduced data set that lies entirely in a k-dimensional hyer-plane of R . Algorithm 4. Principal Components Analysis with Back Projection

{{1.1913, 4.05873}, {1.53076, 5.27513}, {1.85309, 3.23638}, {1.99963, 1.10533}, {1.79767, 3.86304}, {2.25872, 2.10381}, {1.50469, 4.12347}, {2.13699, 1.84288}, {1.20712, 5.59894}, {1.43594, 2.86787}, {1.50379, 4.92747}, {1.50437, 3.88367}, {2.22937, -0.0357163}, {1.60838, 4.80397}, {2.52479, 0.287635}, {1.84461, 3.33785}, {2.75705, -1.566}, {1.58677, 3.91416}, {2.10225, 0.689372}, {1.99164, 2.03114}, {2.3703, 1.48555}, {2.25813, -0.2236}, {2.76285, 0.0886777}, {2.16664, 2.36102}, {1.87554, -0.133408}, {2.52679, -0.492959}, {2.27623, -0.130207}, {2.7388, -1.36069}, {2.15687, 1.29411}, {1.90101, 0.671318}, {2.02191, 2.60927}, {1.46282, 1.63502}, {2.13333, 0.958677}, {1.86464, 3.07403}, {1.84389, 3.45468}, {1.60883, 3.33228}, {2.51706, 1.44357}, {1.05347, 7.59858}, {1.898, 1.00438}, {1.50151, 3.41193}, {2.05665, 2.41876}, {1.79544, 1.48661}, {2.23181, 1.63454}, {1.2492, 3.67311}, {1.82897, 1.41699}, {1.72701, 4.46551}, {1.64191, 6.38833}, {2.47254, -0.427444}, {2.15246, 4.79382}, {2.16991, 1.48283}, {2.2715, 2.54674}, {2.08859, 2.58774}, {1.98126, 1.38378}, {1.69199, 2.68088}, {1.25897, 5.48203}, {1.69802, 2.20615}, {2.4989, -2.0593}, {2.11843, 0.643992}, {1.96406, -0.664882}, {2.16071, 1.09063}, {1.83942, 3.84346}, {1.35287, 5.54837}, {1.32731, 3.55062}, {2.08264, 2.49115}, {2.12898, 0.818264}, {1.61345, 3.2065}, {2.11461, 2.96489}, {2.15123, 2.82889}, {2.07051, 1.76971}, {1.77957, 1.3183}, {2.2917, 1.90551}, {1.75408, 4.31078}, {2.25497, 0.88574}, {1.91065, 2.12505}, {2.11302, -0.318024}, {0.974176, 4.73707}, {1.84714, 1.75565}, {1.73322, 2.78468}, 94 {2.40627, 0.140563}, {2.17967, 1.2649}, {1.43098, 3.16606}, {1.91726, 1.74352}, {2.31406, 0.9825}, {1.693, 1.69997}, {2.09722, 2.70155}, {2.31961, 0.120007}, {1.89179, -0.463541}, {1.35839, 3.59431}, {2.29766, 0.141463}, {1.504, 2.51007}, {1.65115, 3.86479}, {1.4336, 6.26426}, {1.488, 4.14824}, {1.66953, 3.85173}, {2.82247, 1.30438}, {1.66312, 2.1701}, {1.8374, 2.87124}, {2.39617, -2.14277}, {2.2254, 0.222176}, {2.73372, 0.194578}} A scatter plot of the data with a contour plot of the probability density function for the multivariable Gaussian distribution is shown in Figure 6.3.

Figure 6.3. A scatter plot of data drawn from a multivariable Gaussian distribu- tion. The distribution density function contour plot is superimposed. The mean vector can be computed for this data set as: µˆ = h1.93236, 2.18539i Note it is close to the true mean of the distribution function used to generate the data, which was h2, 2i. We compute Y from X and µˆ and construct the covariance matrix: 1 0.161395 −0.592299 C = YT Y = 99 −0.592299 3.69488   The eigenvalues for this matrix are: λ1 = 3.79152 and λ2 = 0.0647539. Notice λ1 is much larger than λ2 because the data is largely stretched out along a line with negative slope. The corresponding eigenvectors are: −0.161033 −0.986949 w = w = 1 0.986949 2 −0.161033     This yields: −0.161033 −0.986949 W = 0.986949 −0.161033   95 We can compute the transformed data Z = WT YT and we see that it has been adjusted so that the data X is now uncorrelated and centered at (0, 0). This is shown in Figure 6.4. Since there is so much more power in the first eigenvector, we can reduce the dimension of

Figure 6.4. Computing Z = WT YT creates a new uncorrelated data set that is centered at 0. the data set without losing a substantial amount of information. This will project the data onto a line. We use: −0.161033 W = 1 0.986949   Then compute: T T X1 = W1W1 Y + M The result is shown in Figure 6.5.

Figure 6.5. The data is shown projected onto a linear subspace (line). This is the best projection from 2 dimensions to 1 dimension under a certain measure of best.

96 Remark 6.15. It is worth noting that this projection in many dimensions (k > 1) creates data that exists on a hyperplane, the multi-dimensional analogy to a line. Furthermore, there is a specific measurement in which this projection is the best projection. The discussion of this is outside the scope of these notes, but it is worth noting that this projection is not arbitrary. (See Theorem 6.30 for a formal analogous statement.)

T T T Remark 6.16. It is also worth noting that the projection Xk = WkWk Y is some- times called the Kosambi-Karhunen-Lo`eve transform.  5. Singular Value Decomposition Remark 6.17. Let A ∈ Rm×n. Recall from Theorem 5.68, every eigenvalue of AT A is non-negative. We use this theorem and its proof in the following discussion.

m×n Definition 6.18 (Singular Value). Let A ∈ R . Let λ1 ≥ λ2 ≥ · · ·√ ≥ λn be the T √eigenvalues of A A. If λ1, . . . , λr > 0, then the singular values of A are σ1 = λ1, . . . , σr = λr. m×n T Lemma 6.19. Let A ∈ R and v1,..., vn be the orthonormal eigenvectors of A A with corresponding eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn. If λ1, . . . , λr > 0, then for i = 1, . . . , r:

kAvik = σi

Proof. Recall from the proof of Theorem 5.68 that if v is an eigenvector of AT A with eigenvalue λ¡ then: kAvk2 = λ

Use this result with λi and vi. It follows that: 2 2 kAvik = λi = σi . Therefore:

kAvik = σi  m×n Proposition 6.20. Let A ∈ R . Then dim(Im(fA)) is the number of (non-zero) singular values. Exercise 87. Prove Proposition 6.20. [Hint: See Theorem 5.69]. Remark 6.21. Recall we proved the the eigenvalues of AT A are non-negative in Theorem 5.68. Therefore the singular values of A always exist.

Definition 6.22 (Singular Value Decomposition). Let A ∈ Rm×n. The singular value decomposition consists of a orthogonal matrix U ∈ Rm×m, a positive matrix Σ ∈ Rm×n n×n entries Σij = 0 if i 6= j and a orthogonal matrix V ∈ R so that: (6.8) A = UΣVT

Theorem 6.23. For any matrix A ∈ Rm×n, the singular value decomposition exists. 97 T Proof. Recall from Theorem 5.69 that the eigenvectors v1,..., vn of A A form an orthogonal basis of Rn. Without loss of generality, assume this basis is orthonormal. Fur- thermore, from Theorem 5.69, the vectors Av1,..., Avr form an orthogonal basis of Im(A), where r is the number of non-zero eigenvalues (singular values) of AT A. Scale the vectors Av1,..., Avr to form the orthonormal set {u1,..., ur}, where from Lemma 6.19: 1 1 ui = Avi = Avi kAvik σi

Therefore: Avi = σiui. Using the Gram-Schmidt procedure we can extend this to an orthonormal basis {u1,..., um}. Do this by choosing any vectors not in the image of A and applying Gram-Schmidt and normalizing. Let V be the matrix whose columns are v1,..., vn and let U be the matrix whose columns are u1,..., um. Both are orthonormal matrices. Let D be the r × r diagonal matrix of (non-zero) singular values so that by construction: D2 0 AT A = V VT 0 0   T 2 is the diagonalization of the n × n symmetric√ matrix A A. Notice we have D because for th i = 1, . . . , r, the i singular value σi = λi. Define Σ as the m × n matrix: D 0 Σ = , 0 0   where D is surrounded by a sufficient number of 0’s to make Σ m × n. Let U1 be the m × r matrix composed of columns u1,..., ur and let U2 be composed of the remaining columns ur+1,..., um. Likewise let V1 be the n × r matrix composed of v1,..., vr and let V2 be composed of the remaining columns. We know that: T A Avi = 0 T for i = r+1, . . . , n. Therefore vi ∈ Ker(A A) for i = r+1, . . . , n. It follows from Proposition 5.65 that vi ∈ Ker(A) for i = r + 1, . . . , n. Therefore:

AV2 = 0 Finally note that: D 0 UΣ = U U = U D 0 1 2 0 0 1   But:    

σ1 0 ··· 0 σ1 0 ··· 0 0 σ2 ··· 0 0 σ2 ··· 0 Av1 Av1 U1D = u1 ··· ur  . . . .  = ···  . . . .  = . . .. . σ1 σr . . .. .    0 0 ··· σ     0 0 ··· σ   r  r     Av1 ··· Avr = AV1 Therefore:  

UΣ = AV1 0 = AV1 AV2 = AV     98 By construction V is a unitary matrix, therefore we can multiply by VT on the right to see that: UΣVT = A

This completes the proof.  Remark 6.24. Notice we used results on AT A in constructing this. We could (however) have used AAT instead. It is therefore easy to see that AT A and AAT must share non-zero eigenvalues and thus the singular values of A are unique. The singular value decomposition however, is not unique. Exercise 88. Show that the matrix AAT could be used in this process instead. As a consequence, show that AAT and AT A share non-zero eigenvalues.

Theorem 6.25. Let A ∈ Rm×n. Then: T T (6.9) AA = UD1U T T (6.10) A A = VD2V where the columns of U and V are orthornormal eigenvectors. Furthermore, D1 and D2 share non-zero eigenvalues and Σ has non-zero elements corresponding to the singular values of A. Proof. We argue in Remark 6.24 that the non-zero eigenvalues of AAT and AAT are identical. By Theorem 6.23 the singular value of A exists and thus: A = UΣVT for some orthogonal matrices U and VT . Therefore: T T T T T T T AA = UΣV VΣ U = UΣΣ U = UD1U T T T T T T T A A = VΣ U UΣV = VΣ ΣV = VD2V

T It is easy to check that D1 and D2 are diagonal matrices with the eigenvalues of A A on the diagonal. This completes the proof.  T Exercise 89. Check that D1 and D2 are diagonal matrices with the eigenvalues of A A Example 6.26. Let: 3 2 1 A = 1 2 3   we can compute: 14 10 √1 − √1 24 0 √1 √1 AAT = = 2 2 2 2 10 14 √1 √1 0 4 − √1 √1    2 2     2 2  √1 − √1 √1 √1 √1 √1 10 8 6 3 2 6 24 0 0 3 3 3 1 1 T 1 2 − √ 0 √ A A = 8 8 8 = √ 0 − 0 4 0 2 2  3 3        1 2 1 6 8 10 √1 √1 √1 0 0 0 √ √ q − 3  3 2 6   6 6         q  99 Thus we have: √1 − √1 U = 2 2 √1 √1  2 2  √1 − √1 √1 3 2 6 V = √1 0 − 2  3 3  √1 √1 √q1  3 2 6    We construct Σ as: √ √ 24 0 0 2 6 0 0 Σ = √ = 0 4 0 0 2 0     Thus we have: √1 √1 √1 1 1 √ 3 3 3 √ − √ 1 1 T 2 2 2 6 0 0 − √ 0 √ A = UΣV = 2 2 √1 √1 0 2 0    2 2    √1 − 2 √1  6 3 6   q  Remark 6.27. One of the more useful elements of the singular value decomposition is its ability to project vectors into a lower dimensional space, just like in principal components analysis and to provide a reasonable approximation to the matrix A using less data. Another is to increase the sparsity of A by finding a matrix Ak that approximate A. To formalize this, we need a notion of a distance on among matrices.

Definition 6.28 (Frobenius Norm). Let A ∈ Rm×n. The Frobenius Norm of A is the real value:

m n 2 (6.11) kAk = |Aij| F v u i=1 j=1 uX X t Remark 6.29. The proof of the following theorem is outside the scope of these notes, but is accessible to a motivated student. Our interested in it is purely as an application, especially to image processing.

Theorem 6.30. Let A ∈ Rm×n with singular value decomposition A = UΣVT . Suppose the singular values are organized in descending order. Let Σk be the k × k diagonal matrix m×k that retains the largest k singular values. Let Uk ∈ R retain the first k columns of U n×k and let Vk ∈ R retain the first k columns of V. Then:

(6.12) Ak = UkΣkVk has the property that it is the m×n rank k matrix minimizing the matrix distance kA − AkkF .

Remark 6.31. The previous theorem gives a recipe for approximating an arbitrary ma- trix with a matrix with less data in it. This can be particularly useful for image processing as we’ll illustrate below. 100 √ Example 6.32. Suppose we use only the largest singular value 2 6 to recover the matrix A from Example 6.26. Using Theorem 6.30 we keep only the first columns of U and V and compute: √1 √ 2 2 2 (6.13) A = 2 2 6 √1 √1 √1 = 1 √1 3 3 3 2 2 2  2    Obviously this matrix has rank 1 (only one linearly independent column). It is also an approximation of A (it’s actually a matrix of column means). Example 6.33 (Application). A grayscale image of width n pixels and height m pixels can be stored as an m × n matrix of grayscale values. If each grayscale value occupies 8 bits, then transmitting this image with no compression requires transmission of 8mn bits. Transmission from space is complex; early Viking missions to Mars required a way to reduce transmission size and so used a Singular Value Decomposition compression method2, which we illustrate below. Suppose we have the grayscale image shown in Figure 6.6: The original

Figure 6.6. A gray scale version of the image found at http://hanna-barbera. wikia.com/wiki/Scooby-Doo_(character)?file=Scoobydoo.jpg. Copyright Hannah-Barbara used under the fair use clause of the Copyright Act.

image is 391 × 272 pixels. Transmission of this image (uncompressed) would require 850, 816 bits or approximately 106 kB (on the computer on which this document was written, it occupies 111 kB due to various technical aspects of storing data on a hard drive). Using a singular value decomposition, we see the singular values show a clear decay in their value. This is illustrated in Figure 6.7. The fact that the singular values exhibit such steep decline means we can choose only a few (say between 15 and 50) and these (along with Uk and Vk can be used to construct the image. In particular if we send just the vectors in Uk and Vk and the singular values, then: (1) If we choose 15 singular values, then we need only transmit 391 · 15 + 15 + 272 · 15 = 9960 values to reconstruct the image. As compared to the original 106, 352 = 391 · 272 values. This is a savings about 90%. (2) If we choose 50 singular values, then we need only transmit 33, 200, a savings of about 67%. The two reconstructed images using 15 and 50 singular values are shown in Figure 6.8.

2This may be apocryphal I cannot find a reference for this. 101 ×104 Singular Values by Index Log-Plot of Singular Values 7 12 Singular Values Log(σ ) i

6 10

5 8

4 ) i σ 6 Log(

Singular Value 3

4 2

2 1

0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Index Index

(a) Singular Values (b) Log Singular Values

Figure 6.7. The singular values of the image matrix corresponding to the image in Figure 6.6. Notice the steep decay of the singular values.

(a) 15 Singular Val- (b) 50 Singular Val- ues ues

Figure 6.8. Reconstructed images from 15 and 50 singular values capture a sub- stantial amount of detail for substantially smaller transmission sizes.

Remark 6.34. It is worth noting that there has been recent work on using singular value decomposition as a part of image steganography; i.e., hiding messages inside images [Wen06].

102 CHAPTER 7

Linear Algebra for Graphs and Markov Chains

1. Goals of the Chapter (1) Introduce graphs. (2) Define the adjacency matrix and its properties. (3) Discuss the eigenvalue/eigenvector properties of the adjacency matrix. (4) Define eigenvector centrality. (5) Introduce Markov chains and the and stationary probabilities as an eigenvector. (6) Introduce the Page Rank Algorithm and relate it to eigenvector centrality. (7) Discuss Graph Laplacian and its uses.

2. Graphs, Multi-Graphs, Simple Graphs Definition 7.1 (Graph). A graph is a tuple G = (V,E) where V is a (finite) set of vertices and E is a finite collection of edges. The set E contains elements from the union of the one and two element subsets of V . That is, each edge is either a one or two element subset of V . Example 7.2. Consider the set of vertices V = {1, 2, 3, 4}. The set of edges E = {{1, 2}, {2, 3}, {3, 4}, {4, 1}} Then the graph G = (V,E) has four vertices and four edges. It is usually easier to represent this graphically. See Figure 7.1 for the visual representation of G. These visualizations

1 2

4 3

Figure 7.1. It is easier for explanation to represent a graph by a diagram in which vertices are represented by points (or squares, circles, triangles etc.) and edges are represented by lines connecting vertices. are constructed by representing each vertex as a point (or square, circle, triangle etc.) and each edge as a line connecting the vertex representations that make up the edge. That is, let v1, v2 ∈ V . Then there is a line connecting the points for v1 and v2 if and only if {v1, v2} ∈ E.

103 Definition 7.3 (Self-Loop). If G = (V,E) is a graph and v ∈ V and e = {v}, then edge e is called a self-loop. That is, any edge that is a single element subset of V is called a self-loop. Exercise 90. Graphs occur in every day life, but often behind the scenes. Provide an example of a graph (or something that can be modeled as a graph) that appears in everyday life.

Definition 7.4 (Vertex Adjacency). Let G = (V,E) be a graph. Two vertices v1 and v2 are said to be adjacent if there exists an edge e ∈ E so that e = {v1, v2}. A vertex v is self-adjacent if e = {v} is an element of E.

Definition 7.5 (Edge Adjacency). Let G = (V,E) be a graph. Two edges e1 and e2 are said to be adjacent if there exists a vertex v so that v is an element of both e1 and e2 (as sets). An edge e is said to be adjacent to a vertex v if v is an element of e as a set. Definition 7.6 (Neighborhood). Let G = (V,E) be a graph and let v ∈ V . The neighbors of v are the set of vertices that are adjacent to v. Formally: (7.1) N(v) = {u ∈ V : ∃e ∈ E (e = {u, v} or u = v and e = {v})} In some texts, N(v) is called the open neighborhood of v while N[v] = N(v) ∪ {v} is called the closed neighborhood of v. This notation is somewhat rare in practice. When v is an element of more than one graph, we write NG(v) as the neighborhood of v in graph G. Exercise 91. Find the neighborhood of Vertex 1 in the graph in Figure 7.2. Remark 7.7. Expression 7.1 is read N(v) is the set of vertices u in (the set) V such that there exists an edge e in (the set) E so that e = {u, v} or u = v and e = {v}. The logical expression ∃x (R(x)) is always read in this way; that is, there exists x so that some statement R(x) holds. Similarly, the logical expression ∀y (R(y)) is read: For all y the statement R(y) holds. Admittedly this sort of thing is very pedantic, but logical notation can help immensely in simplifying complex mathematical expressions1. Remark 7.8. The difference between the open and closed neighborhood of a vertex can get a bit odd when you have a graph with self-loops. Since this is a highly specialized case, usually the author (of the paper, book etc.) will specify a behavior. Example 7.9. In the graph from Example 7.2, the neighborhood of Vertex 1 is Vertices 2 and 4 and Vertex 1 is adjacent to these vertices.

1When I was in graduate school, I always found Real Analysis to be somewhat mysterious until I got used to all the ’s and δ’s. Then I took a bunch of logic courses and learned to manipulate complex logical expressions, how they were classified and how mathematics could be built up out of Set Theory. Suddenly, Real Analysis (as I understood it) became very easy. It was all about manipulating logical sentences about those ’s and δ’s and determining when certain logical statements were equivalent. The moral of the story: if you want to learn mathematics, take a course or two in logic. 104 Definition 7.10 (Degree). Let G = (V,E) be a graph and let v ∈ V . The degree of v, written deg(v) is the number of non-self-loop edges adjacent to v plus two times the number of self-loops defined at v. More formally: deg(v) = |{e ∈ E : ∃u ∈ V (e = {u, v})}| + 2 |{e ∈ E : e = {v}}| Here if S is a set, then |S| is the cardinality of that set. Remark 7.11. Note that each vertex in the graph in Figure 7.1 has degree 2. Example 7.12. If we replace the edge set in Example 7.9 with: E = {{1, 2}, {2, 3}, {3, 4}, {4, 1}, {1}} then the visual representation of the graph includes a loop that starts and ends at Vertex 1. This is illustrated in Figure 7.2. In this example the degree of Vertex 1 is now 4. We obtain

Self-Loop

1 2

4 3

Figure 7.2. A self-loop is an edge in a graph G that contains exactly one vertex. That is, an edge that is a one element subset of the vertex set. Self-loops are illustrated by loops at the vertex in question.

this by counting the number of non self-loop edges adjacent to Vertex 1 (there are 2) and adding two times the number of self-loops at Vertex 1 (there is 1) to obtain 2 + 2 × 1 = 4. Definition 7.13 (Simple Graph). A graph G = (V,E) is a simple graph if G has no edges that are self-loops and if E is a subset of two element subsets of V ; i.e., G is not a multi-graph. Remark 7.14. We will assume that every graph we discuss is a simple graph and we will use the term graph to mean simple graph. When a particular result holds in a more general setting, we will state it explicitly. 3. Directed Graphs Definition 7.15 (Directed Graph). A directed graph (digraph) is a tuple G = (V,E) where V is a (finite) set of vertices and E is a collection of elements contained in V × V . That is, E is a collection of ordered pairs of vertices. The edges in E are called directed edges to distinguish them from those edges in Definition 7.1 Definition 7.16 (Source / Destination). Let G = (V,E) be a directed graph. The source (or tail) of the (directed) edge e = (v1, v2) is v1 while the destination (or sink or head) of the edge is v2. 105 Remark 7.17. A directed graph (digraph) differs from a graph only insofar as we replace the concept of an edge as a set with the idea that an edge as an ordered pair in which the ordering gives some notion of direction of flow. In the context of a digraph, a self-loop is an ordered pair with form (v, v). We can define a multi-digraph if we allow the set E to be a true collection (rather than a set) that contains multiple copies of an ordered pair.

Remark 7.18. It is worth noting that the ordered pair (v1, v2) is distinct from the pair (v2, v1). Thus if a digraph G = (V,E) has both (v1, v2) and (v2, v1) in its edge set, it is not a multi-digraph. Example 7.19. We can modify the figures in Example 7.9 to make it directed. Suppose we have the directed graph with vertex set V = {1, 2, 3, 4} and edge set: E = {(1, 2), (2, 3), (3, 4), (4, 1)} This digraph is visualized in Figure 7.3(a). In drawing a digraph, we simply append arrow- heads to the destination associated with a directed edge. We can likewise modify our self-loop example to make it directed. In this case, our edge set becomes: E = {(1, 2), (2, 3), (3, 4), (4, 1), (1, 1)} This is shown in Figure 7.3(b).

1 2 1 2

4 3 4 3

(a) (b)

Figure 7.3. (a) A directed graph. (b) A directed graph with a self-loop. In a directed graph, edges are directed; that is they are ordered pairs of elements drawn from the vertex set. The ordering of the pair gives the direction of the edge.

Definition 7.20 (In-Degree, Out-Degree). Let G = (V,E) be a digraph. The in-degree of a vertex v in G is the total number of edges in E with destination v. The out-degree of v is the total number of edges in E with source v. We will denote the in-degree of v by

degin(v) and the out-degree by degout(v). Remark 7.21. Notions like edge and vertex adjacency and neighborhood can be extended to digraphs by simply defining them with respect to the underlying graph of a digraph. Thus the neighborhood of a vertex v in a digraph G is N(v) computed in the underlying graph. Definition 7.22 (Walk). A walk on a directed graph G = (V,E) is a sequence w = (v1, e1, v2, . . . , vn−1, en−1, vn) with vi ∈ V for i = 1, . . . , n, ei ∈ E and and ei = (vi, vi+1) for i = 1, . . . , n − 1. A walk on an undirected graph is defined in the same way except ei = {vi, vi+1}. 106 Definition 7.23 (Walk Length). The length of a walk w is the number of edge it contains. Definition 7.24 (Path / Cycle). Let G = (V,E) be a (directed) graph. A walk w = (v1, e1, v2, . . . , vn−1, en−1, vn) is a path if for each i = 1, . . . , n, vi occurs only once in the 0 sequence w. A walk is a cycle if the walk w = (v1, e1, v2, . . . , vn−1) is a path and v1 = vn. Example 7.25. We illustrate a walk that is also a path and a different cycle in Figure 7.4. The walk has length 3, the path has length 4.

1 1 2 2

4 4

5 5

3 3

(a) Walk (b) Cycle

Figure 7.4. A walk (a) and a cycle (b) are illustrated.

Definition 7.26 (Connected Graph). A graph G = (V,E) is connected if for every pair of vertices v, u ∈ V there is at least one walk w that begins with v and ends with u. In the case of a directed graph, we say the graph is strongly connected when there is a (directed) walk from v to u. Example 7.27. Figure 7.5 we illustrate a connected graph, a disconnected graph and a connected digraph that is not strongly connected. 4. Matrix Representations of Graphs Definition 7.28 (Adjacency Matrix). Let G = (V,E) be a graph and assume that V = {v1, . . . , vn}. The adjacency matrix of G is an n × n matrix M defined as:

1 {vi, vj} ∈ E Mij = (0 else

1 1 2 2

4 4

5 5

3 3

(a) Connected (b) Disconnected

Figure 7.5. A connected graph (a) and a disconnected graph (b).

107 1 2

3 4

Figure 7.6. The adjacency matrix of a graph with n vertices is an n × n matrix with a 1 at element (i, j) if and only if there is an edge connecting vertex i to vertex j; otherwise element (i, j) is a zero.

Proposition 7.29. The adjacency matrix of a (simple) graph is symmetric. Exercise 92. Prove Proposition 7.29.

Theorem 7.30. Let G = (V,E) be a graph with V = {v1, . . . , vn} and let M be its adjacency matrix. For k ≥ 0, the (i, j) entry of Mk is the number of walks of length k from vi to vj. Proof. We will proceed by induction. By definition, M0 is the n × n identity matrix and the number of walks of length 0 between vi and vj is 0 if i 6= j and 1 otherwise, thus the base case is established. k Now suppose that the (i, j) entry of M is the number of walks of length k from vi to vj. We will show this is true for k + 1. We know that: (7.2) Mk+1 = MkM

k+1 Consider vertices vi and vj. The (i, j) element of M is: k+1 k (7.3) Mij = Mi· M·j Let:  k (7.4) Mi· = r1 . . . rn

where rl,(l = 1, . . . , n), is the number of walks of length k from vi to vl by the induction hypothesis. Let:

b1 . (7.5) M·j = .   bn   where bl,(l = 1, . . . , n), is a 1 if and only there is an edge {vl, vj} ∈ E and 0 otherwise. Then the (i, j) term of Mk+1 is: n k+1 k (7.6) Mij = Mi·M·j = rlbl l=1 X This is the total number of walks of length k leading to a vertex vl,(l = 1, . . . , n), from k+1 vertex vi such that there is also an edge connecting vl to vj. Thus Mij is the number of walks of length k + 1 from vi to vj. The result follows by induction.  108 Example 7.31. Consider the graph in Figure 7.6. The adjacency matrix for this graph is: 0 1 1 1 1 0 0 1 (7.7) M = 1 0 0 1 1 1 1 0   Consider M2:   3 1 1 2 1 2 2 1 (7.8) M2 = 1 2 2 1 2 1 1 3   This tells us that there are three distinct walks of length 2 from vertex v1 to itself. These walks are obvious:

(1)( v1, {v1, v2}, v2, {v1, v2}, v1) (2)( v1, {v1, v2}, v3, {v1, v3}, v1) (3)( v1, {v1, v4}, v4, {v1, v4}, v1)

We also see there is 1 path of length 2 from v1 to v2:(v1, {v1, v4}, v4, {v2, v4}, v2). We can verify each of the other numbers of paths in M2. Definition 7.32 (Directed Adjacency Matrix). Let G = (V,E) be a directed graph and assume that V = {v1, . . . , vn}. The adjacency matrix of G is an n × n matrix M defined as:

1 (vi, vj) ∈ E Mij = (0 else

Theorem 7.33. Let G = (V,E) be a digraph with V = {v1, . . . , vn} and let M be its adjacency matrix. For k ≥ 0, the (i, j) entry of Mk is the number of directed walks of length k from vi to vj. Exercise 93. Prove Theorem 7.33. [Hint: Use the approach in the proof of Theorem 7.30.]

5. Properties of the Eigenvalues of the Adjacency Matrix n Lemma 7.34 (Rational Root Theorem). Let anx + ··· + a1x + a0 = 0 for x = p/q with gcd(p, q) = 1 and an, . . . , a0 ∈ Z. Then p is an integer factor by a0 and q is an integer factor of an.  Theorem 7.35. Let G = (V,E) be a graph with adjacency matrix M. Then: (1) Every eigenvalue of M is real and (2) If λ is a rational eigenvalue of M, then it is integer.  Proof. The first part follows at once from the Spectral Theorem for Real Matrices. The second observation follows from the fact that the characteristic polynomial will consist of all integer coefficients because the adjacency matrix consists only of ones and zeros. 109 Consequently, any rational eigenvalue (root of the characteristic equation) x = p/q must have q a factor of 1 (the coefficient of λn, where n is the number of vertices). Therefore any rational eigenvalue is an integer.  Definition 7.36 (Irreducible Matrix). A matrix M ∈ Rn×n is irreducible if for each k (i, j) pair, there is some k ∈ Z with k > 0 so that Mij > 0. Lemma 7.37. If G = (V,E) is a connected graph with adjacency matrix M, then M is irreducible. Exercise 94. Prove Lemma 7.37. Theorem 7.38 (Perron-Frobenius Theorem). If M is an irreducible matrix, then M has an eigenvalue λ0 with the following properties:

(1) The eigenvalue λ0 is positive and if λ is an alternative eigenvalue of M, then λ0 ≥ |λ|, (2) The matrix M has an eigenvectors v0 corresponding to λ0 with only positive entries when properly scaled, (3) The eigenvalue λ0 is a simple root of the characteristic equation for M and therefore has a unique (up to scale) eigenvector v0. (4) The eigenvector v0 is the only eigenvector of M that can have all positive entries when properly scaled.  Remark 7.39. The Perron-Frobenius theorem is a classical result in Linear Algebra with several proofs (see [Mey01]). Also, note the quote from Meyer that starts this chapter. Corollary 7.40. If G = (V,E) is a connected graph with adjacency matrix M, then it has a unique largest eigenvalue which corresponds to an eigenvector that is positive when properly scaled. Proof. Applying Lemma 7.37 we see that M is irreducible. Further, we know that there is an eigenvalue λ0 of M that is (i) greater than or equal to in absolute value all other eigenvalues of M and (ii) a simple root. From Theorem 7.35, we know that all eigenvalues of M are real. But for (i) and (ii) to hold, no other (real) eigenvalue can have value equal to λ0 (otherwise it would not be a simple root). Thus, λ0 is the unique largest eigenvalue of M. This completes the proof.  6. Eigenvector Centrality Remark 7.41. This approach to justifying eigenvector centrality comes from Leo Spizzirri [Spi11]. It is reasonably nice, and fairly rigorous. This is not meant to be anymore than a justification. It is not a proof of correctness. Before proceeding, we recall the principal axis theorem:

Theorem 7.42 (Principal Axis Theorem). Let M ∈ Rn×n be a symmetric matrix. Then n R has a basis consisting of the eigenvectors of M.  Remark 7.43 (Eigenvector Centrality). We can assign to each vertex of a graph G = (V,E) a score (called its eigenvector centrality) that will determine its relative importance 110 in the graph. Here importance it measured in a self-referential way: important vertices are important precisely because they are adjacent to other important vertices. This self- referential definition can be resolved in the following way. Let xi be the (unknown) score of vertex vi ∈ V and let xi = κ(vi) with κ being the function returning the score of each vertex in V . We may define xi as a pseudo-average of the scores of its neighbors. That is, we may write: 1 (7.9) x = κ(v) i λ v∈XN(vi) Here λ will be chosen endogenously during computation. th Recall that Mi· is the i row of the adjacency matrix M and contains a 1 in position j if and only if vi is adjacent to vj; that is to say vj ∈ N(vi). Thus we can rewrite Equation 7.9 as: 1 n x = M x i λ ij j j=1 X This leads to n equations, one for vertex in V (or each row of M). Written as a matrix expression we have: 1 (7.10) x = Mx =⇒ λx = Mx λ Thus x is an eigenvector of M and λ is its eigenvalue. Clearly, there may be several eigenvectors and eigenvalues for M. The question is, which eigenvalue / eigenvector pair should be chosen? The answer is to choose the eigenvector with all positive entries corresponding to the largest eigenvalue. We know such an eigenvalue / eigenvector pair exists and is unique as a result of the Perron-Frobenius Theorem and Lemma 7.37.

Theorem 7.44. Let G = (V,E) be a connected graph with adjacency matrix M ∈ Rn×n. Suppose that λ0 is the largest real eigenvalue of M and has corresponding eigenvector v0. 2 n×1 Further assume that |λ0| > |λ| for any other eigenvalue λ of M. If x ∈ R is a column vector so that x · v0 6= 0, then Mkx (7.11) lim k = α0v0 k→∞ λ0

Proof. Applying Theorem 7.42 we see that the eigenvectors of M must form a basis for Rn. Thus, we can express:

(7.12) x = α0v0 + α1v1 + ··· + αn−1vn−1 Multiplying both sides by Mk yields:

k k k k k k k (7.13) M x = α0M v0 +α1M v1 +···+αn−1M vn−1 = α0λ0v0 +α1λ1v1 +···+αn−1λnvn−1

2This theorem has been corrected in Version 2.1 of the notes. Thanks to Prof. Elena Kosygina. 111 k k k because M vi = λi vi for any eigenvalue vi. Dividing by λ0 yields:

k k k M x λ1 λn−1 (7.14) k = α0v0 + α1 k v1 + ··· + αn−1 k vn−1 λ0 λ0 λ0

Applying the Perron-Frobenius Theorem (and Lemma 7.37) we see that λ0 is greater than the absolute value of any other eigenvalue and thus we have:

k λi (7.15) lim k = 0 k→∞ λ0 for i 6= 0. Thus: Mkx (7.16) lim k = α0v0 k→∞ λ0  Exercise 95. Show that the previous theorem does not hold if there is some other eigenvalue λ of M so that |λ0| = |λ|. To do this, consider the path graph with three vertices. Find its adjacency matrix, eigenvalues and principal eigenvector and confirm the theorem does not hold in this case3.

Remark 7.45. We can use Theorem 7.44 to justify our definition of eigenvector centrality as the eigenvector corresponding to the largest eigenvalue. Let x be a vector with a 1 at index i and 0 everywhere else. This vector corresponds to beginning at vertex vi in graph G with n vertices. If M is the adjacency matrix, then Mx is the ith column of M whose th j index tells us the number of walks of length 1 leading from vertex vj to vertex vi and by symmetry the number of walks leading from vertex vi to vertex vj. We can repeat this logic to see that Mkx gives us a vector of whose jth element is the number of walks of length k from vi to vj. Note for the remainder of this discussion, we will exploit the symmetry that the (i, j) element of M k is both the number of walks from i to j and the number of walks from j to i. From Theorem 7.44 we know that (under some suitable conditions) no matter which vertex we choose in creating x that: Mkx (7.17) lim = α0v0 k→∞ λ0 Reinterpreting Equation 7.17 we observe that as k → ∞, Mkx will converge to some multiple of the eigenvector corresponding to the eigenvalue λ0. That is, the eigenvector corresponding to the largest eigenvalue is a multiple of the number of walks of length k leading from some initial vertex i, since the Perron-Frobeinus eigenvector is unique (up to a scale).

Example 7.46. Consider the graph shown in Figure 7.7. Recall from Example 7.31 this

3This exercise is a result of a comment made by Prof. Elena Kosygina in correcting the statement of the previous theorem. 112 1 2

3 4

Figure 7.7. A matrix with 4 vertices and 5 edges. Intuitively, vertices 1 and 4 should have the same eigenvector centrality score as vertices 2 and 3.

graph had adjacency matrix: 0 1 1 1 1 0 0 1 M = 1 0 0 1 1 1 1 0     We can use a computer to determine the eigenvalues and eigenvectors of M. The eigenvalues are: 1 1√ 1 1√ 0, −1, + 17, − 17, 2 2 2 2   while the corresponding floating point approximations of the eigenvalues the columns of the matrix: 0.0 −1.0 1.0 1.000000001  −1.0 0.0 0.7807764064 −1.280776407  1.0 0.0 0.7807764069 −1.280776408      0.0 1.0 1.0 1.0    √  1 1  The largest eigenvalue is λ0 = 2 + 2 17 which has corresponding eigenvector: 1.0 0.7807764064 v = 0 0.7807764064  1.0      We can normalize this vector to be: 0.2807764065  0.2192235937  v0 = 0.2192235937      0.2807764065      Illustrating that vertices 1 and 4 have identical (larger) eigenvector centrality scores and vertices 2 and 3 have identical (smaller) eigenvector centrality scores. By way of comparison, 113 consider the vector: 1 0 x = 0 0    k k We consider M x/|M x|1 for various values of k: 0.0 0.2822190823 M1x 0.3333333333 M10x 0.2178181007 =   =   |M1x| 0.3333333333 |M10x| 0.2178181007 1   1        0.3333333333   0.2821447163       0.2807863651  0.2807764069 M20x 0.2192136380 M40x 0.2192235931 =   =   |M20x| 0.2192136380 |M40x| 0.2192235931 1   1        0.2807863590   0.2807764069       k k   It’s easy to see that as k → ∞, M x/|M x|1 approaches the normalized eigenvector cen- trality scores as we expected. 7. Markov Chains and Random Walks Remark 7.47. Markov Chains are a type of directed graph in which we assign to each edge a probability of walking along that edge given we imagine ourselves standing in a specific vertex adjacent to the edge. Our goal is to define Markov chains, and random walks on a graph in reference to a Markov chain and show that some of the properties of graphs can be used to derive interesting properties of Markov chains. We’ll then discuss another way of ranking vertices; this one is used (more-or-less) by Google for ranking webpages in their search. Definition 7.48 (Markov Chain). A discrete time Markov Chain is a tuple M = (G, p) where G = (V,E) is a directed graph and the set of vertices is usually referred to as the states, the set of edges are called the transitions and p : E → [0, 1] is a probability assignment function satisfying: (7.18) p(v, v0) = 1 0 v ∈XNo(v) for all v ∈ V . Here, No(v) is the neighborhood reachable by out-edge from v. If there is no edge (v, v0) ∈ E then p(v, v0) = 0. Remark 7.49. There are continuous time Markov chains, but these are not in the scope of these notes. When we say Markov chain, we mean discrete time Markov chain. Example 7.50. A simple Markov chain is shown in Figure 7.8. We can think of a Markov chain as governing the evolution of state as follows. Think of the states as cities with airports. If there is an out-edge connecting the current city to another city, then we can fly from our current city to this next city and we do so with some probability. When 114 we do fly (or perhaps don’t fly and remain at the current location) our state updates to the next city. In this case, time is treated discretely.

1 2 1 6 1 2 2 7 1 7

Figure 7.8. A Markov chain is a directed graph to which we assign edge proba- bilities so that the sum of the probabilities of the out-edges at any vertex is always 1.

A walk along the vertices of a Markov chain governed by the probability function is called a random walk. Definition 7.51 (Stochastic Matrix). Let M = (G, p) be a Markov chain. Then the stochastic matrix (or probability transition matrix) of M is:

(7.19) Mij = p(vi, vj) Example 7.52. The stochastic matrix for the Markov chain in Figure 7.8 is: 1 1 2 2 M = 1 6  7 7  Thus a stochastic matrix is very much like an adjacency matrix where the 0’s and 1’s indi- cating the presence or absence of an edge are replaced by the probabilities associated to the edges in the Markov chain. Definition 7.53 (State Probability Vector). If M = (G, p) be a Markov chain with n n×1 states (vertices) then a state probability vector is a vector x ∈ R such that x1 + x2 + ··· + xn = 1 and xi ≥ 0 for i = 1, . . . , n and xi represents the probability that we are in state i (at vertex i). Remark 7.54. The next theorem can be proved in exactly the same way that Theorem 7.30 is proved. Theorem 7.55. Let M = (G, p) be a Markov chain with n states (vertices). Let x(0) ∈ Rn×1 be an (initial) state probability vector. Then assuming we take a random walk of length k in M using initial state probability vector x(0), the final state probability vector is: k (7.20) x(k) = MT x(0)   Remark 7.56. If you prefer to remove the transpose, you can write x(0) ∈ R1×n; that is, x(0) is a row vector. Then: (7.21) x(k) = x(0)Mk with x(k) ∈ R1×n. 115 Exercise 96. Prove Theorem 7.55. [Hint: Use the same inductive argument from the proof of Theorem 7.30.] Example 7.57. Consider the Markov chain in Figure 7.8. The state vector: 1 x(0) = 0   states that we will start in State 1 with probability 1. From Example 7.52 we know what M is. Then it is easy to see that:

(1) T k (0) 1 1 x = M x = 2 2 Which is precisely the state probability  vector we would expect after a random walk of length 1 in M. Definition 7.58 (Stationary Vector). Let M = (G, p) be a Markov chain. Then a vector x∗ is stationary for M if (7.22) x∗ = MT x∗ Remark 7.59. Expression 7.22 should look familiar. It says that MT has an eigenvalue of 1 and a corresponding eigenvector whose entries are all non-negative (so that the vector can be scaled so its components sum to 1). Furthermore, this looks very similar to the equation we used for eigenvector centrality. Lemma 7.60. Let M = (G, p) be a Markov chain with n states and with stochastic matrix M. Then:

(7.23) Mij = 1 j X for all i = 1, . . . , n. Exercise 97. Prove Lemma 7.60. Lemma 7.61. M = (G, p) be a Markov chain with n states and with stochastic matrix M. If G is strongly connected, then M and MT are irreducible.

Proof. If G is strongly connected, then there is a directed walk from any vertex vi to any other vertex vj in V , the vertex set of G. Consider any length k walk connecting vi to th vj (such a walk exists for some k). Let ei be the vector with 1 in its i component and 0 T k everywhere else. Then (M ) ei is the final state probability vector associated with a walk of length k starting at vertex vi. Since there is a walk of length k from vi to vj, we know that the jth element of this vector must be non-zero. That is: T T k ej (M ) ei > 0 th T k where ej is defined just as ei is but with the 1 at the j position. Thus, (M )ij > 0 for some k for every (i, j) pair and thus MT is irreducible. The fact that M is irreducible follows T k k T immediately from the fact that (M ) = (M ) . This completes the proof.  Theorem 7.62 (Perron-Frobenius Theorem Redux). If M is an irreducible matrix, then M has an eigenvalue λ0 with the following properties: 116 (1) The eigenvalue λ0 is positive and if λ is an alternative eigenvalue of M, then λ0 ≥ |λ|, (2) The matrix M has an eigenvectors v0 corresponding to λ0 with only positive entries, (3) The eigenvalue λ is a simple root of the characteristic equation for M and therefore has a unique (up to scale) eigenvectors v0. (4) The eigenvector v0 is the only eigenvector of M that can have all positive entries when properly scaled. (5) The following inequalities hold:

min Mij ≤ λ0 ≤ max Mij i i j j X X Theorem 7.63. Let M = (G, p) be a Markov chain with stochastic matrix M, MT , is irreducible then M has a unique stationary probability distribution. Proof. From Theorem 4.47 we know that M and MT have identical eigenvalues. By the Perron-Frobenius theorem, M has a largest positive eigenvalue λ0 that satisfies:

min Mij ≤ λ0 ≤ max Mij i i j j X X By Lemma 7.60, we know that:

min Mij = max Mij = 1 i i j j X X T Therefore, by the squeezing lemma λ0 = 1. The fact that M has exactly one strictly positive eigenvector v0 corresponding to λ0 = 1 means that: T (7.24) M v0 = v0

Thus v0 is the unique stationary state probability vector for M = (G, p). This completes the proof. 

8. Page Rank Definition 7.64 (Induced Markov Chain). Let G = (V,E) be a graph. Then the induced Markov chain from G is the one obtained by defining a new directed graph G0 = (V,E0) with each edge {v, v0} ∈ E replaced by two directional edges (v, v0) and (v0, v) in E and defining the probability function p so that: 1 (7.25) p(v, v0) = deg v outG0 Example 7.65. An induced Markov chain is shown in Figure 7.9. The Markov chain in the figure has the stationary state probability vector: 3 8 2 ∗ 8 x =  2  8  1   8    117 1

4 1 4 1 1/3

1/3 1/3 1/2 1/2 1/2 3 2 3 2

1/2

Original Graph Induced Markov Chain

Figure 7.9. An induced Markov chain is constructed from a graph by replacing every edge with a pair of directed edges (going in opposite directions) and assigning a probability equal to the out-degree of each vertex to every edge leaving that vertex.

which is the eigenvector corresponding to the eigenvalue 1 in the matrix MT . Arguing as we did in the proof of Theorem 7.44 and Example 7.46, we could expect that for any state vector x we would have: k lim MT x = x∗ k→∞ and we would be correct. When this convergence happens quickly (where we leave quickly poorly defined) the graph is said to have a fast mixing property. If we used the stationary probability of a vertex in the induced Markov chain as a measure of importance, then clearly vertex 1 would be most important followed by vertices 2 and 3 and lastly vertex 4. We can compare this with the eigenvector centrality measure, which assigns a rank vector of: 0.3154488065 0.2695944375 x+ = 0.2695944375 0.1453623195   Thus eigenvector centrality gives the same ordinal ranking as using the stationary state probability vector, but there are subtle differences in the values produced by these two ranking schemes. This leads us to PageRank [BP98]. Remark 7.66. Consider a collection of web pages each with links. We can construct a directed graph G with the vertex set V consisting of the we web pages and E consisting of the directed links among the pages. Imagine a random web surfer who will click among these web pages following links until a dead-end is reached (a page with no outbound links). In this case, the web surfer will type a new URL in (chosen from the set of web pages available) and the process will continue. From this model, we can induce a Markov chain in which we define a new graph G0 with edge set E0 so that if v ∈ V has out-degree 0, then we create an edge in E0 to every other vertex in V and we then define: 1 (7.26) p(v, v0) = deg v outG0 exactly as before. In the absence of any further insight, the PageRank algorithm simply assigns to each web page a score equal to the stationary probability of its state in the 118 induced Markov chain. For the remainder of this remark, let M be the stochastic matrix of the induced Markov chain. In general, however, PageRank assumes that surfers will get bored after some number of clicks (or new URL’s) and will stop (and move to a new page) with some probability d ∈ [0, 1] called the damping factor. This factor is usually estimated. Assuming there are n web pages, let r ∈ Rn×1 be the PageRank score for each page. Taking boredom into account leads to a new expression for rank (similar to Equation 7.9 for Eigenvector centrality): 1 − d n (7.27) r = + d M r for i = 1, . . . , n i n ji j j=1 ! X Here the d term acts like a damping factor on walks through the Markov chain. In essence, it stalls people as they walk, making it less likely a searcher will keep walking forever. The original System of Equations 7.27 can be written in matrix form as: 1 − d (7.28) r = 1 + dMT r n   where 1 is a n × 1 vector consisting of all 1’s. It is easy to see that when d = 1 r is precisely the stationary state probability vector for the induced Markov chain. When d 6= 1, r is 0 usually computed iteratively by starting with an initial value of ri = 1/n for all i = 1, . . . , n and computing: 1 − d r(k) = 1 + dMT r(k−1) n   The reason is that for large n, the analytic solution:

−1 1 − d (7.29) r = I − dMT 1 n n   is not computationally tractable 4. Example 7.67. Consider the induced Markov chain in Figure 7.9 and suppose we wish to compute PageRank on these vertices with d = 0.85 (which is a common assumption). We might begin with: 1 4 1 (0) 4 r =  1  4  1   4  We would then compute: 0.462499999999999967 1 − d 0.214583333333333320 r(1) = 1 + dMT r(0) =   n 0.214583333333333320        0.108333333333333337      4 T −1 Note, In − dM computes a matrix inverse, which we reviewed briefly in Chapter 4. We should note that for stochastic matrices, this inverse is guaranteed to exist. For those interested, please consult and  of [Dat95, Lan87, Mey01]. 119 We would repeat this again to obtain:

0.311979166666666641 1 − d 0.259739583333333302 r(2) = 1 + dMT r(1) =   n 0.259739583333333302        0.168541666666666673      This would continue until the difference between in the values of r(k) and r(k−1) was small. The final solution would be close to the exact solution: 0.366735867135100591 0.245927818588310476 r∗ =   0.245927818588310393      0.141408495688278513      Note this is (again) very close to the stationary probabilities and the eigenvector centralities we observed earlier. This vector is normalized so that all the entries sum to 1.

Exercise 98. Consider the Markov chain shown below: 1/2

4 1 1/3

1/3 1/3 1/2 1/2 1/2 1/2 3 2

1/2 Suppose this is the induced Markov chain from 4 web pages. Compute the page-rank of these web pages using d = 0.85.

Exercise 99. Find an expression for r(2) in terms of r(0). Explain how the damping factor occurs and how it decreases the chance of taking long walks through the induced Markov chain. Can you generalize your expression for r(2) to an expression for r(k) in terms of r(0)?

9. The Graph Laplacian Remark 7.68. In this last section, we return to simple graphs and discuss the Graph Laplacian matrix, which can be used to partition the vertices of a graph in a sensible way.

Definition 7.69 (Degree Matrix). Let G = (V,E) be a simple graph with V = {v1, . . . , vn}. The degree matrix is the diagonal matrix D with the degree of each vertex in the diagonal. That is Dii = deg(vi) and Dij = 0 if i 6= j.

Example 7.70. Consider the graph in Figure 7.10. It has degree matrix: 120 1 4

3 5

2 6

Figure 7.10. A set of triangle graphs.

2 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0 0 0 D = , 0 0 0 2 0 0   0 0 0 0 2 0 0 0 0 0 0 2   because each of its vertices has degree 2. Definition 7.71 (Graph Laplacian). Let G = (V,E) be a simple graph with V = {v1, . . . , vn}, adjacency matrix M and degree matrix D. The Graph Laplacian Matrix is the matrix L = D − M. Example 7.72. The graph shown in Figure 7.10 has adjacency matrix: 0 1 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 M = 0 0 0 0 1 1   0 0 0 1 0 1 0 0 0 1 1 0   Therefore, it has Laplacian:  2 −1 −1 0 0 0 −1 2 −1 0 0 0 −1 −1 2 0 0 0  L =  0 0 0 2 −1 −1    0 0 0 −1 2 −1  0 0 0 −1 −1 2      Remark 7.73. Notice the row-sum of each row in the Laplacian matrix is zero. The Laplacian matrix is also symmetric. This is not an accident; it will always be the case. Proposition 7.74. Let G be a graph with Laplacian matrix L, then L is symmetric. Proof. Let D and M be the (diagonal) degree matrix and the adjacency matrix respec- tively. Both D and M are symmetric. Therefore L = D − M is symmetric, since LT = (D − M)T = DT − MT = D − M = L.  121 Lemma 7.75. The row-sum of the adjacency matrix of a simple graph is the degree of the corresponding vertex. Exercise 100. Prove Lemma 7.75. Corollary 7.76. The row-sum for each row of the Laplacian matrix of a simple graph is zero.  Theorem 7.77. If L ∈ Rn×n, then 1 = h1, 1,..., 1i ∈ Rn is an eigenvector of L with eigenvalue 0. Proof. Let:

d11 −a12 −a13 · · · −a1n −a21 d22 −a23 · · · −a2n (7.30) L =  . . . . .  ...... −a −a −a ··· d   n1 n2 n3 nn    Let v = L1. Then from Equation 2.2, we know that: 1 1 (7.31) v = L · 1 = d −a −a · · · −a 1 = d − a − a − · · · − a = 0 i i· i1 i2 i3 in . i1 i2 i3 in .     1     Thus vi = 0 for i = 1, . . . , n and v = 0. Thus: L1 = 0 = 0 · 1

Thus 1 is an eigenvector with eigenvalue 0. This completes the proof.  Remark 7.78. It is worth noting that 0 can be an eigenvalue, but the zero vector 0 cannot be an eigenvector. Definition 7.79 (Subgraph). Let G = (V,E) be a graph. If H = (V 0,E0) and V 0 ⊆ V and E0 ⊆ E, then H is a subgraph of G. Remark 7.80. We know from the Principal Axis Theorem (Theorem 5.55) that L must have n linearly independent (and orthogonal) eigenvectors that form a basis for Rn, since its a real symmetric matrix. We’ll use that fact shortly. Example 7.81. Consider the graph shown in Figure 7.10. One of the two triangles is a (proper) subgraph of this graph. The graph is a subgraph of itself (an improper) subgraph. Definition 7.82 (Component). Suppose G = (V,E) is not a connected graph. If H is a subgraph of G and H is connected and for any vertex v not in H, there is no path from v to any vertex in H, then H is a component of G. Example 7.83. The graph in Figure 7.10 has two components. Each triangle is a com- ponent. For example, let H be the left triangle. For any vertex v from the right triangle, there is not path from v to any vertex in the left triangle. Therefore, H is a component. 122 Theorem 7.84. Let G = (V,E) be a graph with V = {v1, . . . , vn} and with Laplacian L. Then the (algebraic) multiplicity of the eigenvalue 0 is equal to the number of components of G.

Proof. Assume G has more than 1 component; order the components H1,...,Hk and suppose that each component has ni vertices. Then n1 + n2 + ··· + nk = n. Each component has its own Laplacian matrix Li for i = 1, . . . , k and the Laplacian matrix of G is the :

L1 0 ··· 0 0 L2 ··· 0 L =  . . . .  . . .. .  0 0 ···L   k  0  The fact that 1i (a vector of 1 with dimension appropriate to Li) is an eigenvector for Li with eigenvalue 0 implies that: vi = h0, ··· , 1i, 0, ··· , 0i is an eigenvector for L with eigenvalue 0. Thus, L has eigenvalue 0 with at least multiplicity k. Now suppose v is an eigenvector with eigenvalue 0. Then: Lv = 0

That is, v ∈ Ker(fL), that is v is in the kernel of the linear transform fL(x) = Lx. We have so far proved:

dim (Ker(fL)) ≥ k since each eigenvector vi is linearly independent of any other eigenvector vj for i 6= j. Thus, the basis of Ker(fL) contains at least k vectors. On the other hand, it is clear by construction that the rank of the Laplacian matrix Li is exactly ni − 1. The structure of L ensures that the rank of L is:

n1 − 1 + n2 − 1 + ··· + nk − 1 = n − k But we know from the rank-nullity theorem that:

rank(L) = dim (Im(fL)) = n − k and:

n = dim (Im(fL)) + dim (Ker(fL)) = n − k + y and y ≥ k. But it follows that y must be exactly k. Therefore, the multiplicity of the eigenvalue 0 is precisely the number of components.  Remark 7.85. We state the following fact without proof. Its proof can be found in [GR01] (Lemma 13.1.1). It is a consequence of the fact that the Laplacian matrix is positive semi-definite, meaning that for any v ∈ Rn, the (scalar) quantity: vT Lv ≥ 0 Lemma 7.86. Let G be a graph with Laplacian matrix L. The eigenvalues of L are all non-negative.  123 Definition 7.87 (Fiedler Value/Vector). Let G with n vertices be a graph with Lapla- cian L and eigenvalues {λn, . . . , λ1} ordered from largest to smallest (i.e., so that λn ≥ λn−1 ≥ · · · ≥ λ1). The second smallest eigenvalue λ2 is called the Fiedler value and its corresponding eigenvector is called the Fiedler vector.

Proposition 7.88. Let G be a graph with Laplacian matrix L. The Fiedler value λ2 > 0 if and only if G is connected. Proof. If G is connected, it has 1 component and therefore the multiplicity of the 0 eigenvalue is 1. By Lemma 7.86, λ2 > 0. On the other hand, suppose that λ2 > 0, then necessarily λ1 = 0 and has multiplicity 1.  Remark 7.89. We state a remarkable fact about the Fiedler vector, whose proof can be found in [Fie73].

Theorem 7.90. Let G = (V,E) be a graph with V = {v1, . . . , vn} and with Laplacian matrix L. If v is the eigenvector corresponding to the Fiedler value λ2 then the set of vertices:

V (v, c) = {vi ∈ V : vi ≥ c} and the edges between these vertices form a connected sub-graph.  Remark 7.91. In particular, this means that if c = 0, then the vertices whose indices correspond to the positive entries in v allow for a natural bipartition of the vertices of G. This bipartition is called a spectral cluster and it is useful in many areas of modern life. In particular, it can be useful for finding groupings of individuals in social networks. Example 7.92. Consider the social network shown in Figure 7.11. If we compute the

Alice

Edward David

Bob Cheryl

Finn

Figure 7.11. A simple social network. √ Fiedler value for this graph we see it is λ2 = 3 − 5 > 0, since the graph is connected. The corresponding Fiedler vector is: 1 √ 1 √ 1 √ 1 √ v = −1 − 5 , −1 − 5 , 5 − 3 , 1, 1 + 5 , 1 ≈ 2 2 2 2      {− 1.61803 , −1.61803, −0.381966 , 1., 1.61803, 1.}

124 Thus, setting c = 0 and assuming the vertices are in alphabetical order, a natural partition of this social network is:

V1 = {Alice, Bob, Cheryl}

V2 = {David, Edward, Finn} That is, we have grouped the vertices together with negative entries in the Fiedler vector and grouped the vertices together with positive entried in the Fiedler vector. This is illustrated in Figure 7.12. It is worth noting that if an entry is 0 (i.e., on the border) that vertex can be

Alice

Edward David

Bob Cheryl

Finn

Figure 7.12. A graph partition using positive and negative entries of the Fiedler vector. placed in either partition or placed in a partition of its own. It usually bridges two distinct vertex groups together within the graph structure.

125

CHAPTER 8

Linear Algebra and Systems of Differential Equations

1. Goals of the Chapter (1) Introduce differential equations and systems of differential equations. (2) Introduce Linear Homogeneous Systems of Equations (3) Show how Taylor series can be used to compute a matrix exponent. (4) Show how the Jordan Decomposition can be used to solve a system of linear homo- geneous equations. (5) Explain the origin of multiples of t in systems with repeated eigenvalues. 2. Systems of Differential Equations Definition 8.1 (System of Ordinary Differential Equations). A system of ordinary dif- ferential equations is an equation system involving (unknown) functions of a single common (independent) variable and their derivatives. Remark 8.2. The notion of order for a system of ordinary differential equations is simply the order of the highest derivative (i.e., second derivative, third derivative etc.) Remark 8.3. It is worth noting that any order n system of differential equations can be transformed to an equivalent order n − 1 system of differential equations. We illustrate using Equation (8.1)y ¨ − αy˙ = Ay Define v =y ˙. Then we may rewrite Equation 8.1 as the system of differential equations: v˙ = Ay + av (8.2) y˙ = v  Thus any order n system of differential equations can be reduced to a first order system of differential equations. Definition 8.4 (Initial Value Problem). Consider a system of first order differential equations with unknown functions y1, . . . , yn. If we are provided with information of the form: y1(a) = r1,. . . , yn(a) = rn, for some a ∈ R and constants r1, . . . , rn, then the problem is called an initial value problem. Definition 8.5 (Linear Differential Equation). A system of differential equations is lin- ear if unknown functions and their derivatives appear only as monomials, possibly multiplied by known functions of the independent variable. Example 8.6. While looking awfully non-linear, the following is a linear differential equation for y(x): (8.3) y0 + sin(x)y = cos(x) 127 while the following system of differential equations for u(t) and v(t) is nonlinear: u˙ = αu − βuv (8.4) v˙ = γuv − δv

for α, β, γ, δ ∈ R. In general and in keeping with Remark 8.3, (and following Strogatz [Str01]), we will focus on differential equation systems with a special form. Let x1(t), . . . , xn(t) be n unknown functions with independent time variable t. We focus on the system:

x˙ 1 = f1(x1, . . . , xn) . (8.5)  .  x˙ n = fn(x1, . . . , xn)  Here we assume that for i = 1, . . . , n, fi(x1, . . . , xn) has derivatives of all orders. In this case we say that fi(x1, . . . , xn) is smooth. Let:

x1 . (8.6) x = .   xn   If we let F : Rn → Rn be a (smooth) vector valued function given by

f1(x1, . . . , xn) . (8.7) F(x) = .   fn(x1, . . . , xn)   Then, we can write System 8.5 as: (8.8) x˙ = F(x) Notice that t does not appear explicitly in these equations. Thus, Equation 8.3 cannot be described in this way, but the nonlinear equations given in System 8.4 most certainly fit this pattern. Definition 8.7 (System of Autonomous Differential Equations). The system of differ- ential equations defined by Equation 8.5 (or Equation 8.8) is called a system of autonomous differential equations. Notice that time (t) does not appear explicitly anywhere. Definition 8.8 (Orbit). Consider the initial value problem x˙ = F(x)

x(0) = x0

1 Any solution x(t; x0) to this problem is called an orbit .

1 n Some authors make a distinction between the parametric curve x(t; x0) and the set of points in R that are defined by this curve. The latter is called an orbit while the former is called a trajectory. We will not be that precise in these notes. 128 Example 8.9. Linear systems with constant coefficients fit the pattern of System 8.5. Consider the following differential equation system: x˙ = αx − βy (8.9) y˙ = γy − δx

In this case, we can rewrite F(x) using matrix notation. System 8.9 can be expressed as:

x˙ α −β x (8.10) = y˙ −δ γ y       As we will see, many of the properties of the solutions of this system depend on the properties of the coefficient matrix.

Definition 8.10 (Linear System with Constant Coefficients). Let A be an n × n matrix and x be a vector of n unknown functions of a dependent variable (in our case t). Then:

(8.11) x˙ = Ax is a linear system of differential equations with constant coefficients.

Remark 8.11. For our purposes, we will always assume entries in A are drawn from R. Exercise 101. Consider the differential equation:

(8.12)y ¨(x) +y ˙(x) + cos(t)y = sin(t)

Use the technique of converting a second order differential equation into a first order dif- ferential equation to write this as a system of first order equations. Then show that the resulting system can be written in the form:

(8.13) y˙ = A(t)y + b(t) where y is a vector of unknown functions, A(t) is a matrix with non-constant entries (func- tions of t) and b(t) is a vector with non-constant entries (functions of t).

Remark 8.12 (Linear Homogeneous System). Let A(t) be an n × n matrix of time varying functions (that do not depend on x1,..., xn). The system, then:

(8.14) x˙ = A(t)x is a linear homogeneous system of differential equations. If b(t) is a vector of time varying functions (that do not depend on x1,..., xn), then:

(8.15) x˙ = A(t)x + b(t) is just a (first order) linear system of differential equations. It turns out, that there is a known form for the solution of equations of this type and they are extremely useful in the field of control theory [Son98]. The interested reader can consult [Arn06]. 129 3. A Solution to the Linear Homogenous Constant Coefficient Differential Equation Consider the differential equation: (8.16)x ˙ = ax Equation 8.16 can be easily solved: dx x˙ = kx =⇒ = ax dt dx =⇒ = adt x 1 =⇒ dx = adt x Z Z =⇒ log(x) = at + C =⇒ x(t) = exp(at + c) = A exp(at)

where A = exp(C). If we are given the initial value x(0) = x0, then the exact solution is:

(8.17) x(t) = x0 exp(at) It is easy to see that when a > 0, the solution explodes as t → ∞ and if a < 0, the solution collapses to 0 as t → ∞. Consider now the natural generalization of Equation 8.16, given by the linear homogenous differential equation in Expression 8.11. x˙ = Ax

and the associated initial value problem x(0) = x0. In this section, we are interested in a general solution to this problem, and what this solution can tell us about non-linear differential equations. Given what we know about exponential growth already, we intuitively wish to write the solution: At (8.18) x(t) = e · x0 = exp (At) · x0, but it’s not entirely clear what such an expression means. Certainly, we could use the Taylor series expansion for the exponential function and argue that: (At)2 (At)3 A2t2 A3t3 (8.19) exp (At) = (At)0 + At + + + ··· = I + At + + + ··· 2! 3! n 2! 3!

Given this assumption, exp (At) · x0 is a matrix / vector product. The following theorem can be proved using formal differentiation on the Taylor series expansion: Theorem 8.13. The function: A2t2 A3t3 (8.20) x(t) = eAt · x = exp (At) · x = I + At + + + ··· · x 0 0 n 2! 3! 0   is a solution to the initial value problem: x˙ = Ax x(0) = x  0

130 Exercise 102. Use formal differentiation on the power series to prove the previous theorem.

Remark 8.14. Theorem 8.13 does not tell us when this solution exists (i.e., when the series converges) or in fact how to compute exp (At). Nor does it tell us what these solutions “look” like in so far as their long-term behavior. For example, in the case of Equation 8.16, we know that if a > 0 the solution tends to ∞ as t → ∞. However, it turns out that for many matrices, exp (At) can be computed rather easily using diagonalization. More importantly, the properties we’ll see for diagonalizable matrices carry over for any matrix A and, in fact, can be used to some extent in studying non-linear systems through linearization, as we’ll discuss in the next sections.

Remark 8.15. The following theorem helps explain why we’re expended so much en- ergy discussing matrix diagonalization. It is, essentially, the key to understanding linear homogeneous systems.

Theorem 8.16. Suppose A is diagonalizable and A = PDP−1 Then (8.21) exp(A) = P exp(D)P−1 and if:

λ1 0 0 ··· 0 0 λ2 0 ··· 0 (8.22) D =  . . . .  . . .. ··· .  0 0 0 ··· λ   n   then: eλ1 0 0 ··· 0 0 eλ2 0 ··· 0 (8.23) exp(D) =  . . . .  . . .. ··· .  0 0 0 ··· eλn      This gives us a very nice meaning to the seemingly impossible idea of raising a number to the power of a matrix. However, it is only for a diagonal matrix, such as D, that exp(D) can be assigned this meaning.

Sketch of Proof. Note that: (8.24) An = PDP−1 n = (PDP−1)(PDP−1) ··· (PDP−1) = P(D)nP−1

Applying this fact and Equation 8.20 (when t = 1) allows us to deduce the theorem.  Exercise 103. Finish the details of the proof of Theorem 8.16. 131 4. Three Examples Example 8.17. We can now use the results of Theorem 8.16 to find a solution to: x˙ = −y (8.25) y˙ = x  with initial value x(0) = x0 and y(0) = y0. Note that the system of differential equations can be written in matrix form as: x˙ 0 −1 x (8.26) = y˙ 1 0 y       Thus we know: 0 −1 (8.27) A = 1 0   From Example 4.64 and Theorem 8.16 we know the solution is: −it i 1 x(t) −i i e 0 2 2 x0 (8.28) = it −i 1 · y(t) 1 1 0 e y0        2 2    Unfortunately, this is not such a convenient expression. To simplify it, we first expand the matrix product to obtain: x(t) 1/2 e−it + 1/2 eit −1/2 ie−it + 1/2 ieit x (8.29) = · 0 y(t) 1/2 ie−it − 1/2 ieit 1/2 e−it + 1/2 eit y0   " #   This simplifies to: −it it it −it x(t) 1 e + e i (e − e ) x0 (8.30) = −it it −it it · y(t) 2 i (e − e ) e + e y0       Next, we remember Euler’s Formula: (8.31) eit = cos(t) + i sin(t) and note that: (8.32) e−it + eit = (cos(t) − i sin(t)) + (cos(t) + i sin(t)) = 2 cos(t) while (8.33) eit − e−it = (cos(t) + i sin(t)) − (cos(t) − i sin(t)) = 2i sin(t) Thus: (8.34) i eit − e−it = 2i2 sin(t) = −2 sin(t) Thus we conclude that: x(t) cos(t) − sin(t) x (8.35) = · 0 y(t) sin(t) cos(t) y0       But this matrix is none other than a (counter-clockwise) rotation matrix that when multiplied T by the vector [x0, y0] will rotate it by an angle t - so any specific solution to the initial value problem is a vector of constant length rotating around the origin. The initial vector is the vector of initial conditions. This is illustrated in Figure 8.1. It is worth noting for an arbitrary x0 and y0, any parametric solution curve is a orbit. One can expand the expression 132 2 Rotation Direction

Solution Curve

x0 y0 1 

Initial Vector

-2 -1 1 2

-1

-2

Figure 8.1. The solution to the differential equation can be thought of as a vector of fixed unit rotation about the origin.

for the orbits of this problem to obtain exact functional representations for x(t) and y(t) as needed:

x(t) = x0 cos(t) − y0 sin(t)(8.36)

y(t) = x0 sin(t) + y0 cos(t)(8.37)

For representative initial values x0 = 1 and y0 = 1, we plot the resulting functions in Figure 8.2.

1.0

0.5

1 2 3 4 5 6

-0.5

-1.0

Figure 8.2. A plot of representative solutions for x(t) and y(t) for the simple homogeneous linear system in Expression 8.25.

Exercise 104. Show that the second order differential equationx ¨ + x = 0 yields the system of differential equations just analyzed. This second order ODE is called a harmonic oscillator. Exercise 105. Show that the total square velocity (x0)2 + (y0)2 is constant for the differential system just analyzed. Example 8.18. Consider the second order differential equation: (8.38)x ¨ − x˙ + x = 0 133 Letting y =x ˙, we can rewrite the previous differential equation (Equation 8.38) as:y ˙ = y+x and obtain the first order system: x˙ = y (8.39) y˙ = −x + y  Using the same methods as in Example 8.17, one can show that the eigenvalues of the matrix corresponding to System 8.39: 0 1 −1 1   are: √ 1 3 ± i 2 2 Notice these eigenvalues are the roots of the characteristic equation, thus explaining the importance of the characteristic equation. This is true in general. Now using the same approach as the one in Example 8.17 one can show that for initial value x(0) = x0 and y(0) = y0 the solution to this differential equation is: √ √ √ 1 t/2 3t 3t (8.40) x(t) = e 3x0 cos − 3(x0 − 2y0) sin 3 2 ! 2 !! √ √ √ 1 t/2 3t 3t (8.41) y(t) = e 3(y0 − 2x0) sin + 3y0 cos 3 2 ! 2 !!

Figure 8.3 illustrates the solution curves to this problem when x0 = y0 = 1.

x,y

60

40

20 xHtL Out[38]= yHtL t -2 2 4 6 8 10

-20

-40

Figure 8.3. Representative solution curves for Expression 8.39 showing sinusoidal exponential growth of the system.

Example 8.19. Consider the following linear system of differential equations: x˙ = −x − y (8.42) y˙ = x − y  Using the same methods as in Example 8.17, one can show that for initial value x(0) = x0 and y(0) = y0 the solution to this differential equation is: −t −t (8.43) x(t) = e cos(t)x0 − e sin(t)y0 −t −t (8.44) y(t) = e sin(t)x0 + e cos(t)y0

134 We notice in this case that the eigenvalues of the matrix for Expression 8.42 are:

λ1 = −1 + i λ2 = −1 − i In identifying the solution, it is important to remember that: (8.45) e(−1+i)t = e−teit = e−t (cos(t) + i sin(t)) explaining the origin of the e−t factors. Figure 8.4 illustrates the solution curves to this −t problem when x0 = y0 = 1. Notice the factor e causes the solution curves to decay to 0 exponentially.

0.5

0.4

0.3

0.2

0.1

1 2 3 4 5 6

-0.1

-0.2

Figure 8.4. Representative solution curves for Expression 8.42 showing exponential decay of the system.

Exercise 106. Use matrix diagonalization to show show the solution given to System 8.42 is correct.

5. Non-Diagonalizable Matrices Remark 8.20. In this final section, we discuss an odd result in ordinary differential equations. When the matrix of a linear system with constant coefficients has a repeate eigenvalue, we are often taught to imagine an extra solution by multiplying t by eλt to obtain some linear combination of eλt and teλt. In this section, we discover from where that t comes and dispel the nonsense idea that this is some kind of lucky guess.

Remark 8.21 (Non-Diagonalizable A). When A ∈ Rn×n is not diagonalizable, we fall back to Theorem 4.68. From this we deduce: (1) exp(At) = P exp (Λt) exp(Nt)P−1. Here N is a nilpotent matrix. (2) The expression exp(Nt) is a polynomial in t. (This may help explain some results you’ve seen in a class on Differential Equations.) In particular, suppose that Nk = 0. Then: k Ajtj exp(Nt) = j! j=0 X 0 where, A = In. Thus, we conclude that: k Ajtj x(t) = P exp (Λt) P−1 · x j! 0 j=0 ! X 135 Thus we see that the diagonalization of the matrix A is still very important. Furthermore, the eigenvalues of A are still very important and, as we’ll see in the next chapter, these drive the long-term behavior of the solutions of the differential equation. Example 8.22. Consider the following linear system of differential equations2: x˙ = 7x + y y˙ = −4x + 3y We can see that this has form x˙ = Ax when: 7 1 A = −4 3   The Jordan Decomposition for this matrix yields: −1 − 1 5 0 0 1 P = 2 Λ = N = 2 0 0 5 0 0       It is easy to verify that A has two identical eigenvalues, both equal to 5 and that: 0 1 2 0 0 = 0 0 0 0 0     We can compute: 0 1 P−1 = 2 −2 −1   Thus, we can write as a solution for the differential equation: 1 5t 1 x −1 − 2 e 0 1 0 0 t 0 2 x0 = · 5t · + · · y 2 0 0 e 0 1 0 0 −2 −1 y0    P   exp(λt)   exp( Nt)   P−1    Expanding and simplifying yields: 5t 5t x e (2t + 1) e t x0 = 5t 5t · y −4e t e (1 − 2t) y0       Suppose x0 = 2 and y0 = −5, then expanding we obtain: x 2e5t(2t + 1) − 5e5tt −t + 2 = = e5t y −5e5t(1 − 2t) − 8e5tt 2t − 5       Exercise 107. Verify the Jordan Decomposition given in the previous example and confirm that the matrix A does have a repeated eigenvalue equal to 5. Confirm (by differ- entiation) that the proposed solution does satisfy the differential equation.

2This example is taken from http://tutorial.math.lamar.edu/Classes/DE/RepeatedEigenvalues. aspx, where it is presented differently. 136 Bibliography

[Arn06] V. I. Arnold, Ordinary differential equations, Universitext, Springer, 2006. [BP98] S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine, Seventh Inter- national World-Wide Web Conference (WWW 1998), 1998. [Dat95] B. N. Datta, , Brooks/Cole, 1995. [Fie73] M. Fiedler, Algebraic connectivity of graphs, zechoslovak Math. J. 23 (1973), no. 98, 298–305. [Fra99] J. B. Fraleigh, A First Course in Abstract Algebra, 6 ed., Addison-Wesley, 1999. [GR01] C. Godsil and G. Royle, Algebraic graph theory, Springer, 2001. [Lan87] S. Lang, Linear Algebra, Springer-Verlag, 1987. [Mey01] C. D. Meyer, and applied linear algebra, SIAM Publishing, 2001. [Son98] E. D. Sontag, Mathematical Control Theory: Deterministic Finite Dimensional Systems, Springer- Verlag, 1998. [Spi11] L. Spizzirri, Justication and application of eigenvector centrality, http://www.math.washington. edu/~morrow/336_11/papers/leo.pdf, March 6 2011 (Last Checked: July 20, 2011). [Str01] S. Strogatz, Nonlinear dynamics and chaos, Westview Press, 2001. [Wen06] Daniel Wengerho, Using the singular value decomposition for image steganography, Master’s thesis, Iowa State University, 2006.

137