<<

Course compendium, Applied analysis

Christopher Engstr¨om Karl Lundeng˚ard Sergei Silvestrov November 14, 2018

Contents

1 Review of basic linear algebra6 1.1 Notation and operations ...... 6 1.2 Linear equation systems ...... 8 1.2.1 Determinants ...... 10 1.3 Eigenvalues and eigenvectors ...... 13 1.4 Exercises ...... 14

2 Matrix factorization and canonical forms 16 2.1 Important types of matrices and some useful properties ...... 17 2.1.1 Triangular and Hessenberg matrices ...... 17 2.1.2 Hermitian matrices ...... 18 2.1.3 Unitary matrices ...... 18 2.1.4 Positive definite matrices ...... 19 2.2 Factorization ...... 19 2.2.1 Spectral factorization ...... 20 2.2.2 Rank factorization ...... 20 2.2.3 LU and Cholesky factorization ...... 21 2.2.4 QR factorization ...... 22 2.2.5 Canonical forms ...... 22 2.2.6 Reduced ...... 23 2.2.7 ...... 23 2.2.8 Singular value decomposition (SVD) ...... 24 2.3 Applications ...... 25 2.3.1 Portfolio evaluation using Monte Carlo simulation ...... 25

3 Non-negative matrices and graphs 27 3.1 Introduction to graphs ...... 27 3.2 Connectivity and irreducible matrices ...... 30 3.3 The ...... 33 3.4 Exercises ...... 35 3.5 Applications ...... 35 3.5.1 Shortest path ...... 35 3.5.2 Resistance ...... 36

4 Perron-Frobenius theory 39 4.1 Perron-Frobenius for non negative matrices ...... 39 4.2 Exercises ...... 42 4.3 Applications ...... 43 4.3.1 Leontief input-output model ...... 43

5 Stochastic matrices and Perron-Frobenius 45 5.1 Stochastic matrices and Markov chains ...... 45

1 5.2 Irreducible, primitive stochastic matrices ...... 48 5.3 Irreducible, imprimitive stochastic matrices ...... 49 5.4 Reducible, stochastic matrices ...... 50 5.5 Hitting times and hitting probabilities ...... 52 5.6 A short look at continuous time Markov chains ...... 54 5.7 Exercises ...... 56 5.8 Applications ...... 57 5.8.1 Return to the Leontief model ...... 57

6 Linear spaces and projections 59 6.1 Linear spaces ...... 59 6.2 Inner product spaces ...... 65 6.3 Projections ...... 69 6.4 Applications ...... 73 6.4.1 Fourier approximation ...... 73 6.4.2 Finding a QR decomposition using the Gram-Schmidt process . . 74

7 Principal component analysis (PCA) and dimension reduction 77 7.1 Mean, variance and covariance ...... 77 7.2 Principal component analysis ...... 81 7.3 Choosing the number of dimensions ...... 83

8 Linear transformations 86 8.1 Linear transformations in geometry ...... 89 8.2 surjection,injection and combined transformations ...... 91 8.3 Application: Transformations in computer graphics and homogeneous co- ordinates ...... 92 8.4 and image ...... 95 8.5 Isomorphisms ...... 98

9 Regression and the Least Square Method 100 9.1 Multiple linear regression ...... 102 9.1.1 Single linear regression ...... 102 9.1.2 Fitting several variables ...... 103 9.2 Finding the coefficients ...... 105 9.3 Weighted Least Squares Method (WLS) ...... 106 9.4 Non-linear regression ...... 107 9.5 Pseudoinverses ...... 108

10 Matrix functions 110 10.1 Matrix power function ...... 110 10.1.1 Calculating An ...... 111 10.2 Matrix root function ...... 113 10.2.1 Root of a ...... 113

2 10.2.2 Finding the of a matrix using the Jordan canonical form114 10.3 Matrix polynomials ...... 115 10.3.1 The Cayley-Hamilton theorem ...... 116 10.3.2 Calculating matrix polynomials ...... 119 10.3.3 Companion matrices ...... 120 10.4 Exponential matrix ...... 122 10.4.1 Solving matrix differential equations ...... 124 10.5 General matrix functions ...... 125

11 Matrix equations 126 11.1 Tensor products ...... 126 11.2 Solving matrix equations ...... 128

12 Numerical methods for computing eigenvalues and eigenvectors 131 12.1 The Power method ...... 131 12.1.1 Inverse Power method ...... 134 12.2 The QR-method ...... 136 12.3 Applications ...... 139 12.3.1 PageRank ...... 139

Index 143

3 4 Notation v vector (usually represented by a row or column matrix) |v| length (Euclidean norm) of a vector kvk norm of a vector M matrix M> M−1 inverse M∗ pseudoinverse M complex conjugate MH Hermitian transpose Mk the k:th power of matrix M, see section 10.1

mi,j The element on the i:th row and j:th column of M (k) k mi,j The element on the i:th row and j:th column of M

Mi. i:th row of M

M.j j:th column of M ⊗ Kroenecker (tensor) product a scalar |a| absolute value of a scalar h·, ·i inner product (scalar product) Z set of + Z set of positive integers R set of real numbers C set of complex numbers Mm×n(K) set of matrices with dimension m × n and elements from the set K

5 1 Review of basic

The reader should already be familiar with the concepts and content in this chapter, instead it serves more as a reminder and overview of what we already know. We will start by giving some notation used throughout the book as well as reminding ourselves of the most basic matrix operations. We will then take a look at linear systems, eigenvalues and eigenvectors and related concepts as seen in elementary courses in linear algebra. While the methods presented here works, they are often very slow or unstable when used on a computer. Much of our later work will relate to how to find better ways to find these or similar quantities or give other examples of where and why they are useful.

1.1 Notation and elementary matrix operations We will here go through the most elementary matrix operations as well as give the no- tation which will be used throughout the book. We remind ourselves that a matrix is a rectangular array of numbers, symbols or expressions. Some example of matrices can be seen below:

1 0 A = B = 0 1 − i 0 4

 1 −2 1 0 0 C = −2 3 D =   0 3 0 3 1 We note that matrices need not be square as in C and D, we also note that although matrices can be complex valued, in most of our examples or practical applications we need only to work with real valued matrices.

• We denote every element of A as ai,j, where i is it’s row number and j is it’s column number. For example looking at C we have c3,2 = 1.

• The size of a matrix is the number of rows and columns (or the index of the bottom right element): We say that C above is a 3 × 2 matrix.

• A matrix with only one row or column is called a row vector or column vector respectively. We usually denote vectors with an arrow v and we assume all vectors are column vectors unless stated otherwise.

• The diagonal of a matrix is the elements diagonally from the top left corner towards the bottom left (a1,1, a2,2, . . . an,n). A matrix where all elements except those on the diagonal is zero is called a diagonal matrix. For example A is a diagonal matrix, but so is D although we will usually only consider square diagonal matrices.

• The trace of a matrix is the sum of the elements on the diagonal.

6 Matrix addition Add every element in the first matrix with corresponding element in the second matrix. Both matrices need to be of the same size for addition to be defined.

1 0 1 −2 1 + 1 0 − 2 2 −2 + = = 0 2 2 −1 0 + 2 2 − 1 2 1 Although addition between a scalar and a matrix is undefined, sometimes authors write it as such in which case they usually means to add the scalar to every element of the matrix as if the scalar was a matrix of the same size with every element equal to the constant. The general definition of matrix addition is: Let A = B + C where all three matrices have the same number of rows and columns. Then ai,j = bi,j + ci,j for every element in A.

Matrix multiplication Given the two matrices A and B we get the product AB as the matrix whose elements ei,j are found by multiplying the elements in row i of A with the elements of column j of B and adding the results. 3 2 0 1 3201  11 10 7 14 1234  1234   5 9 16 24     =   0 1 2 1 0121   1 4 8 9  0 0 1 3 0013 0 1 5 10 For the colored element we have:1 · 3 + 2 · 1 + 3 · 0 + 4 · 0 = 5 We note that we need the number of columns in A to be the same as the number of rows in B but A can have any number of rows and B can have any number of columns. Also we have: • Generally AB 6= BA, why? • The size of AB is the number of rows of A times the number of columns of B. • The I is the matrix with ones on the diagonal and zero elsewhere. For the Identity matrix we have: AI = IA = A. Multiplying a scalar with a matrix is done by multiplying the scalar with every element of the matrix.

1 0 3 · 1 3 · 0 3 0 3 = = 0 2 3 · 0 3 · 2 0 6 The general definition of is: Let A = BC then j X ai,j = bi,kck,j k=1 for each element in A.

7 Matrix transpose

The transpose of a matrix is obtained by flipping the rows and columns such that the elements of the first row is now the elements of the first column. For example given matrix A below:

 1 −2 A = −1 0  3 1

We get A transpose, denoted A> as:

 1 −1 3 A> = −2 0 1

> The general definition of the transpose is: Let B = A then bi,j = aj,i for all elements in B. For complex valued matrices we usually instead talk about the Hermitian transpose where we first take the transpose and then take the complex conjugate (change sign of imaginary part). We usually denote this by AH (sometimes this is denoted by A∗ but here we will reserve that notation for pseudoinverses, see section 9.5). The general definition of the Hermitian transpose (also known as Hermitian conjugate > ∗ ∗ or ) is: Let B = A then bi,j = aj,i, where denotes the complex conjugate, for all elements in B. A real matrix where A = A> is called a and a complex matrix where AH = A we call a .

1.2 Linear equation systems

Very often in applications we end up with a system of linear equations which we want to solve or find as good solution as possible. Either repeatedly for the same or similar systems or only once. Matrices present powerful tools which we can use to solve these problems, in fact even if we have a nonlinear system, we very often approximate it with a linear system and try to solve that instead. Linear systems will be treated repeatedly throughout the book, but here we take a short look at the problem itself, when we can solve it, and how from what we should be familiar with from courses in linear algebra. Given a linear equation system

 a1,1x1 + a1,2x2 + ... + a1,nxn = b1,   a2,1x1 + a2,2x2 + ... + a2,nxn = b2, ......  . . . . .  am,1x1 + am,2x2 + ... + am,nxn = bm,

8 we can represent it using a matrix and column vectors       a1,1 a1,2 . . . a1,n x1 b1  a2,1 a2,2 . . . a2,n   x2   b2      =    . . .. .   .   .   . . . .   .   .  am,1 am,2 . . . am,n xm bm Or in a more compact form as Ax = b. Solving this system of equations is the same as finding an inverse matrix A−1 such that A−1A = I and multiply both expressions from the left with this matrix. The reader should be familiar with using Gaussian elimination to write the system in an easy form to solve, or going a step further and finding the inverse A−1 using Gauss-Jordan elimination. We remember for Gaussian elimination using elementary row operations (adding and multiplying rows) trying to find pivot elements in order to change our matrix to diagonal/row echelon form.   0 F FFFFFF  0 0 0 F FFFF     0 0 0 0 F FFF     0 0 0 0 0 F FF     0 0 0 0 0 0 0   F   0 0 0 0 0 0 0 0     0 0 0 0 0 0 0 0  0 0 0 0 0 0 0 0

After we have done this we can easily solve the system as long as it is solvable. Not all systems can be solved exactly, for example the system below is obviously unsolvable:

1 2 x  1 1 = 1 2 x2 2 For square matrices we can go one step further by using the Gauss-Jordan elimination reducing the elements above the diagonal to zero as well in the same manner. Similarly some systems have more than one solution. For example will any point on the line y = b−1(c − ax) solve the linear equation system

 4 −1 x   2  1 = −8 2 x2 −4 We are often interested in if there is a single unique solution to a linear equation system.

Theorem 1.1. The following statements are equivalent for A ∈ Mn×n(K):

1. AX = B has a unique solution for all B ∈ Mn×m where X ∈ Mn×m. 2. AX = 0 only has the trivial solution (X = 0).

3. A has linearly independent row/column vectors.

9 4. A is invertible/non-degenerate/non-singular.

5. det(A) 6= 0.

6. A has maximum rank (rank(A) = n).

7. A’s row/column space has dimension n.

8. A’s null space has dimension 0.

9. A’s row/column vectors span K

Most of these statements should be familiar already, some will be looked at more in depth later, especially when we look at linear spaces in chapter 6.

1.2.1 Determinants One of the ways to guarantee that a linear system Ax = b has a unique solution is that the determinant is non-zero det(A) 6= 0. We remember the determinant as the product of the eigenvalues. For a small matrix we can easily calculate the determinant by hand, we had the formulas for 2 × 2 and 3 × 3 matrices as:

a b = ad − bc c d

a b c

d e f = aei − afh − bdi + bfg + cdh − ceg

g h i

And for larger matrices we could write it as a sum of determinants of smaller matrices:

a b c e f d f d e d e f = a − b + c h i g i g h g h i And in general we get the determinant function

det : Mn×n(K) 7→ K

det(A) = det(A.1,A.2,...,A.k,...,A.n) Some properties of the determinant:

• det A> = det A

• det AB = det A det B but det(A + B) 6= det(A) + det(B).

• det cA = cn det A

• det is multilinear

10 • det is alternating

• det(I) = 1

With multilinear we mean that it’s linear in each argument

det(A.1,A.2, . . . , a · A.k + b · B,...,A.n)

= a det(A.1,A.2,...,A.k,...,A.n) + b det(A.1,A.2,...B,...A.n) For example if A is a 3 × 3 matrix and B, C are 3 × 1 matrices. Let M = [A.1 A.2 + B A.3 + C] then

det(M) = det(A.1,A.2 + B,A.3 + C)

= det(A.1,A.2,A.3 + C) + det(A.1,B,A.3 + C)

= det(A) + det(A.1, B,A.3) + det(A.1,A.2, C) + det(A.1, B, C) Since the determinant is alternating, switching two columns (or rows) of A switches the sign of det(A)

det(B) = det(A.1,A.2,...,A.k+1,A.k,...,A.n) = − det(A)

But more important is what happens when we have two linear dependent vectors:

det(A.1,A.2,...,B,B,...,A.n) = 0

Then the multilinear and alternating property gives: linear dependence between columns (or rows) in A ⇔ det(A) = 0 Apart from using column or row expansion to calculate the determinant there is an alternative formula for the determinant X det(A) = aσ1,1aσ2,2 . . . aσn,nsign(σ) (1) σ∈Sn

σ is a permutation of columns (rows), Sn is the set of all possible permutations, aσi,i is the element that ai,i is switched with. sign is a function that is equal to 1 if an even number of rows have been switched and -1 if an odd number of rows have been switched. For example for a 2 × 2 matrix A

a a  A = 1,1 1,2 a2,1 a2,2

We get two possible permutations: a σ do not switch places of columns, a a = a , a a = a . σ1 ,1 1,1 σ2 ,2 2,2 b σ switch places of columns, a b = a1,2, a b = a2,1. σ1,1 σ2,2 a b And we get determinant det(A) = aσa,1aσa,2sign(σ ) + a b a b sign(σ ) = a1,1a2,2 − 1 2 σ1,1 σ2,2 a2,1a1,2

11 The basic idea for the formula is as follows.

We start by rewriting columns as sums of columns of the identity matrix

n X A.k = aσk,kI.σk σk=1

Using the multilinearity of the determinant function we get

n X det( aσ1,1I.σ1 ,A.2,...A.n) σ1=1

n X = aσ1,1 det(I.σ1 ,A.2,...A.n) σ1=1

n n X X = aσ1,1 aσ2,2 det(I.σ1 ,I.σ2 ,...A.n) σ1=1 σ2=1

n n n X X X = aσ1,1 aσ2,2 ... aσn,n det(I.σ1 ,I.σ2 ,...I.σn ) σ1=1 σ2=1 σn=1

Terms where σi = σk for some i 6= k, 1 ≤ i, k ≤ n will be equal to zero. X det(A) = aσ1,1aσ2,2 . . . aσn,n det(I.σ1 ,I.σ2 ,...I.σn ) σ∈Sn

The (sign) function captures the alternating behaviour.

Q 1≤i

While the function given here is correct, it’s extremely slow for large matrices, for an n × n matrix there is n! permutations. Instead we mostly calculate it by column (row) expansion as we are used to for 3 × 3 matrices: column i row i zn }| { zn }| { X X det(A) = ak,iAki = ai,kAik (2) k=1 k=1 k+i ˜ ˜ where Aki is equal to (−1) det(Aki) where Aki is equal to A with the k:th row and i:th column removed. Aki is called the ki-cofactor of A.

12 1.3 Eigenvalues and eigenvectors We are often interested in finding eigenvalues or eigenvectors of a matrix.

Definition 1.1 If Av = λv where A ∈ Mn×n(K), λ ∈ K, v ∈ Mn×1(K) then λ is a eigenvalue of A and v is a (right) eigenvector if v 6= 0.

We call an eigenvalue λ with corresponding eigenvector v an eigenpair (λ, v).

A matrix can have (and usually do) have multiple eigenvalues and eigenvectors. A sin- gle eigenvalue can also be associated with multiple eigenvectors. Often we are interested in the ”dominant” eigenvalue or the of a matrix A.

Definition 1.2 The set of eigenvalues for a matrix A is called the spectrum of A and is denoted by Sp(A). The absolute value of the largest eigenvalue is called the spectral radius

ρ(A) = max |λ| λ∈Sp(A)

The eigenvalues with the largest absolute value (if there is only one) is often called the dominant eigenvalue. This and it’s corresponding eigenvector is often of particular interest as we will see for example while working with stochastic matrices in chapter 5.

We can find the eigenvalues by solving det(A − λI) = 0 which is an nth degree poly- nomial in λ. pA(λ) = det(A − λI) is called the characteristic polynomial and pA(λ) = 0 is called the characteristic equation.

Some properties of eigenvalues are:

n n X X • The trace of A is the sum of the eigenvalues of A: tr(A) = ai,i = λi i=1 i=1

n Y • det(A) = λi i=1

• If A ∈ Mn×n(K) have n different eigenvalues then A is diagonalizable.

• If A have no eigenvalue λ = 0 then the determinant is non-zero and the linear system Ax = b has a single unique solution.

Often when interested in finding the eigenvalues or eigenvectors we first find a ”similar” matrix with the same unknown eigenvalues and then solve the problem for this new matrix instead.

13 Definition 1.3 Given two n × n matrices A, B we say that they are similar if there is an invertible n × n matrix P such that B = P−1AP Similar matrices share not only eigenvalues but also many other properties because of that. Theorem 1.2. Let matrices A, B be similar, then A, B have the same

• Eigenvalues (but generally not the same eigenvectors)

• Determinant, det(A) = det(B).

• Rank

• Trace

We will look much more at eigenvalues and eigenvectors for special types of matrices in later chapters, as well as how to compute them efficiently in chapter 12.

1.4 Exercises

a) Show that this linear equation system has one single solution.  −x + 2y − z = 2 −x + y = 3  2x − y + 2z = 1

b) Show that this linear equation system has infinitely many solutions.   x + y − z = 1 2x + y + 2z = 2 3x + 2y − z = 3

c) Show that this linear equation system has no solution.   x − 2y + z = 3 x + 3y − 2z = 2 2x + y − z = 1

Calculate these determinants

2 3 1 5 4 a = 5 , b = , c = 2 −1 5 3 2 5 0 −1

14 1.4.1 Let

3 3 4 4 3 3 4 4 3 3 4 4 0 1 2 2 4 1 2 2 4 1 2 2 A =   , B =   , C =   , det(C) = −26 0 0 3 3 2 1 3 3 2 1 3 3 0 0 0 3 2 1 3 3 3 2 1 3 The following determinants should all be very simple to calculate. Calculate them.

a) det(A)

b) det(B)

c) det(C>)

1.4.2 If the linear equation systems in 1.4 were written on the form Ax = y, for which systems would A−1 exist?

1.4.3 4 1 Find the eigenvalues and eigenvectors of M = 3 2

1.4.4 Given  1 2 1  1 1 2 1 0 0 A = −1 1 0 , B =  0 3 2 , P = 0 0 1 0 2 3 −1 0 1 0 1 0 Show that A and B are ( similar).

15 2 Matrix factorization and canonical forms

Matrix factorization, also known as , are methods in which we write one matrix as a product of several matrices. The aim is typically to change the problem into another easier problem through the choice of these factor matrices. Matrix factor- ization is very often used in order to speed up and/or stabilize calculations in numerical algorithms such as when finding eigenvalues of a matrix or solving a linear system.

We will give some examples of applications but primary we will introduce different factorizations which we will then use in later chapters. One example we should already be familiar with is the diagonalization of a diagonaliz- able matrix A by elementary row (or column) operations. Since elementary row consist of multiplying rows with non-zero scalars and adding them together the application of a series of elementary row operations on a linear equation system, Ax = y, can be written as multiplication by a matrix, S, from the left.

Elementary row operations:Ax = y → SAx = Sy The matrix S is always square and since it is easy to revert any elementary row operations it will also be invertible. Thus it is always allowed to write

SASS−1x = Sy ⇔ SAS−1xˆ = yˆ, xˆ = Sx, yˆ = Sy Suppose that the matrix A was diagonalized in this manner.

SAS−1 = D, D diagonal matrix This relation makes the following definition of a quite natural Definition 2.1 A matrix A is diagonalizable if there exists an S such that

A = S−1DS Where D is a diagonal matrix. This if course an interesting property since a matrix equation Ax = y has a unique solution if A is diagonalizable. But there are many other important factorizations such as:

• Spectral decomposition QΛQ−1

• LU-factorization

• QR-factorization

• Rank factorization CF

• Jordan canonical form S−1JS

16 • Singular value factorization UΣVH

• Cholesky factorization GGH

Before we look at how to find these factorizations and why they are useful we will start by looking at a couple of important types of matrices with some useful properties.

2.1 Important types of matrices and some useful properties We have already looked extensively at non-negative matrices in the earlier chapters, however there are many other types of matrices which are useful when trying to find a suitable matrix factorization.

2.1.1 Triangular and Hessenberg matrices A as we should already know is a matrix with all zeros above or below the diagonal. A matrix where all elements above the diagonal are zero we call lower triangular and a matrix where all elements below the diagonal are zero such as the one below, we call upper triangular.

  FF ... F  0 F ... F A =    . . .. .   . . . .  0 0 ... F We note that a triangular matrix does not need to be square although that is when they are the most useful. For a non- the diagonal is always the one starting in the upper left corner.

To understand why triangular matrices are desirable we will consider to important properties. First of all we can very easily solve a linear system Ax = b if A is triangular since the last row is a equation of the form axn = b, Solving this we can substitute this answer into the row above that and get a new equation in just one variable.

A second important property of triangular matrices us that the eigenvalues of a tri- angular matrix is the same as the elements on the diagonal. You can easily check this by looking at what happens with the characteristic polynomial det(A − Iλ) when A is diagonal. This gives rise to one important question: Is it possible to rewrite a matrix as a product of a triangular matrix and another matrix which doesn’t change the eigenvalues when multiplied with this matrix and if possible, can it be done fast?

Related to triangular matrices are the Hessenberg matrices. These are matrices that are ’almost’ triangular, instead of demanding that all elements below the diagonal is zeros as for an upper triangular matrix, a matrix that is an upper if

17 we also allow the elements immediately below the diagonal to be non zero. This gives an upper Hessenberg matrix A the form:   FFF ··· FFF FFF ··· FFF    0 ···   FF FFF  0 0 ···  A =  F FFF  ......   ......     0 0 0 ··· FFF 0 0 0 ··· 0 FF While not as immediately useful as a triangular matrix, Hessenberg matrices are use- ful just from the fact that they are nearly triangular. The diagonal doesn’t hold the eigenvalues as in a triangular matrix, but they often give a good approximation or ini- tial guess for iterative methods. Multiplication of two (both upper or lower) Hessenberg matrices also result in a new Hessenberg matrix. We will see examples of how and why Hessenberg matrices are useful later when we look at the QR-method in section 12.2.

2.1.2 Hermitian matrices We start by defining the Hermitian conjugate of a matrix:

Definition 2.2 H H The Hermitian conjugate of a matrix A is denoted A and is defined by (A )i,j = (A)j,i. In other words we transpose it and change the sign of the imaginary parts. So if A is real valued, the hermitian conjugate is the same as the transpose.

Definition 2.3 A matrix is said to be Hermitian (or self-adjoint) if AH = A

Notice the similarities with a symmetric matrix A> = A, a Hermitian matrix can be seen as the complex extension of real symmetric matrices. One property of Hermitian (and real symmetric) are that they are normal and thus diagonalizable.

2.1.3 Unitary matrices Matrix factorizations using unitary matrices are often used over other factorization tech- niques because of their stability properties for numerical methods.

Definition 2.4 A matrix, A, is said to be unitary if AH = A−1.

We will look more at unitary matrices when we look at linear spaces and projections in chapter 6.2. Unitary matrices have a large number of ”good” properties.

Theorem 2.1. Let U be a , then

18 a) U is always invertible.

b) U−1 is also unitary.

c) | det(U)| = 1

d) (UV)H = (UV)−1 if V is also unitary.

e) For any λ that is an eigenvalue of U, λ = eiω, 0 ≤ ω ≤ 2π.

f) Let v be a vector, then |Uv| = |v| (for any vector norm).

H H g) The rows/columns of U are orthonormal, that is Ui.Uj. = 0, i 6= j, Uk.Uk. = 1.

One of the important things we can find from this is that if we multiply a matrix A with a unitary matrix A, then UA will have the same eigenvalues as A. We will meet unitary matrices again in later chapters when we will need them and easier can illustrate why they are useful. One example of a unitary matrix is the C matrix below which rotates a vector by the angle θ around the x-axis

1 0 0  C = 0 cos(θ) − sin(θ) 0 sin(θ) cos(θ)

2.1.4 Positive definite matrices Positive definite matrices often come up in practical applications such as covariance matrices in . Positive definite matrices have many properties which makes them easier to work with than general matrices.

Definition 2.5 We consider a square symmetric real valued n × n matrix A, then:

• A is positive definite if x>Ax is positive for all non-zero vectors x.

• A is positive semidefinite if x>Ax is non-negative for all non-zero vectors x.

We define negative definite and negative semidefinite in the same way except demand- ing that the resulting vector is negative or non-positive instead. Similarly we can define it in the same way by replacing the transpose with the hermitian for matrices with complex valued elements.

2.2 Factorization As was mentioned in the introduction factorization (also called decomposition) is taking a matrix and rewriting it as a product of several matrices. This can make it easier to solve certain problems, similarly to how factorization of numbers can made division easier.

19 A well known and previously visited example of factorization is rewriting a diagonal- izable matrix as a product of an invertible matrix and a diagonal matrix

A = S−1DS, D is diagonal

2.2.1 Spectral factorization Spectral factorization is a special version of diagonal factorization. It is sometimes referred to as eigendecomposition. Theorem 2.2. Let A be an square (n × n) matrix with linearly independent rows. Then A = QΛQ−1 where Q is a square n × n matrix whose columns are the eigenvectors of A and Λ is a diagonal matrix with the corresponding eigenvalues. The spectral factorization can be thought of as a ’standardized’ version of diagonal factorization. Also relates to projection, see section ??.

2.2.2 Rank factorization Theorem 2.3. Let A be an m × n matrix with rank(A) = r (A has r independent rows/columns). Then there exist a factorization such that A = CF where C ∈ Mm×r and F ∈ Mr×n We take a short look at how to find such a factorization, We start by rewriting the matrix on reduced row echelon form   0 1 F 0 0 0 F 0  0 0 0 1 0 0 F 0     0 0 0 0 1 0 F 0     0 0 0 0 0 1 F 0  B =    0 0 0 0 0 0 0 1     0 0 0 0 0 0 0 0     0 0 0 0 0 0 0 0  0 0 0 0 0 0 0 0 We then create C by removing all columns in A that corresponds to a non-pivot column in B. In this example we get:   C = A.2 A.4 A.5 A.6 A.8 We then create F by removing all zero rows in B. In this example we get:   F = B1. B2. B3. B4. B5. This gives the Rank factorization of A.

20 2.2.3 LU and Cholesky factorization The LU factorization is a factorization of the form:

A = LU

Where L is lower triangular and U is upper triangular. If we can find such a factor- ization we can for example easily find the solution to a linear system Ax = L(Ux) = b by first solving Ly = b and then solve Ux = y. Both these systems are easy to solve since L and U are both triangular. However not every matrix A have a LU factorization, not even every square invertible matrix. Instead we often talk about LUP factorization instead:

Theorem 2.4. If A is a square matrix, then there exist a factorization such that:

PA = LU

Where P is a , U is a upper triangular matrix and L is a lower triangular matrix. P−1LU is called the LUP factorization of A.

We can still, provided that we found such a factorization for example easily solve the equation system Ax = b by instead solving PAx = Pb. The LU factorization can be seen as a variation of the Gaussian elimination which we often use when attempting to solve a linear system by hand.

LU decomposition can also be used to calculate the determinant of A = P−1LU. We get the determinant as:

n ! n ! −1 −1 Y Y det(A) = det(P ) det(L) det(U) = det(P ) li,i ui,i i=1 i=1 This since we remember that the eigenvalues of a triangular matrix is the values on the diagonal. The determinant of a permutation matrix can be found to be det(P −1) = (−1)S where S is the number of row exchanges.

If A is Hermitian and positive-definite however then we can find another related fac- torization, namely a Cholesky factorization:

Theorem 2.5. If A is a Hermitian, positive-definite matrix, then there exist a factor- ization such that: A = LLH Where L is a lower triangular matrix. LLH is called the Cholesky factorization of A.

The Cholesky factorization is mainly used instead of the LUP factorization when possible to solve linear systems. Some example where such matrices naturally occur is.

21 • Any is positive-definite and any covariance matrix based on measured data is going to be symmetric and real-valued. This means it’s hermitian and we can use the Cholesky factorization.

• When simulating multiple correlated variables using Monte Carlo simulation the correlation matrix can be decomposed using the Cholesky factorization. L can then be used to simulate the dependent samples.

2.2.4 QR factorization The QR-factorization is one of the most useful factorizations. While we will only show what it is and give one example of what it can be used for, section 12 is dedicated to how we can find the QR-factorization efficiently as well as it’s use in the QR-method, one of the most widely used method to find the eigenvalues of a general square matrix.

Theorem 2.6. Every n × m matrix A have a matrix decomposition

A = QR where

• R is a n × m upper triangular matrix.

• Q is a n × n unitary matrix.

If A is a square real matrix then Q is real and therefore an Q> = Q. We can find a QR-factorization can be computed using for example Householder transformations or the Gram-Schmidt process. We will look more into this in ??. The QR-factorization have uses in for example:

• Given a QR-factorization we can solve a linear system Ax = b by solving Rx = Q−1b = QH b. Which is can be done fast since R is a triangular matrix.

• QR-factorization can also used in solving the linear least square problem.

• It plays an important role in the QR-method used to calculate eigenvalues of a matrix.

2.2.5 Canonical forms Some factorizations are referred to as canonical forms. A canonical form is a standard way of describing an object. A single object can have several different canonical forms and which one we use might depend on what we want to find or which canonical form we can easily calculate. We have already seen one example of a canonical form which we used when looking at reducible stochastic matrices. Some other examples of canonical forms for matrices are:

• Diagonal form (for diagonalizable matrices)

22 • Reduced row echelon form (for all matrices)

• Jordan canonical form (for square matrices)

• Singular value factorization form (for all matrices)

2.2.6 Reduced row echelon form Definition 2.6 A matrix is written on reduced row echelon form when they are written on echelon form and their pivot elements are all equal to one and all other elements in a pivot column are zero.

  0 1 F 0 0 0 F 0  0 0 0 1 0 0 F 0     0 0 0 0 1 0 F 0     0 0 0 0 0 1 F 0  B =    0 0 0 0 0 0 0 1     0 0 0 0 0 0 0 0     0 0 0 0 0 0 0 0  0 0 0 0 0 0 0 0

Theorem 2.7. All matrices are similar to some reduced row echelon matrix.

We recognize this as the result of applying Gauss-Jordan elimination to the original matrix.

2.2.7 Jordan normal form The Jordan normal form, while not especially useful in practical applications because of its sensitivity to rounding errors in numerical calculations is still good to be familiar with since it is often usefor for developing mathematical theory. We start by defining the building blocks which which we build the Jordan normal form.

Definition 2.7 (Jordan block) A Jordan block is a square matrix of the form

λ 1 0 ... 0 0 λ 1 ... 0    ......  Jm(λ) =  . . . . .    0 0 . . . λ 1 0 0 ... 0 λ

We call a matrix made up of Jordan blocks a Jordan matrix:

23 Definition 2.8 (Jordan matrix) A Jordan matrix is a square matrix of the form   Jm1 (λ1) 0 ... 0  0 Jm (λ2) ... 0  J =  2   . . .. .   . . . . 

0 0 ... Jm1 (λk)

Theorem 2.8. All square matrices are similar to a Jordan matrix. The Jordan matrix is unique except for the order of the Jordan blocks. This Jordan matrix is called the Jordan normal form of the matrix.

We remember that two matrices A, B are similar if there is a unitary matrix S such that A = S−1BS. This means that if we have found a Jordan matrix J similar to A, they have the same eigenvalues. And since Jordan matrices are upper triangular, we can easily find the eigenvalues along the diagonal of J.

Theorem 2.9 (Some other interesting properties of the Jordan normal form). Let A = S−1JS

a) The eigenvalues of J is the same as the diagonal elements of J.

b) J has one eigenvector per Jordan block.

c) The rank of J is equal to the number of Jordan blocks.

d) The normal form is sensitive to perturbations. This means that a small change in the normal form can mean a large change in the A matrix and vice versa.

2.2.8 Singular value decomposition (SVD) Singular value decomposition is a canonical form used for example in statistics and information processing in dimension reducing methods such as Principal component analysis or calculating pseudoinverses (see section 9.5).

Theorem 2.10. All A ∈ Mm×n can be factorized as

A = UΣVH where U and V are unitary matrices and

S 0 Σ = r 0 0 where Sr is a diagonal matrix with r = rank(A). The diagonal elements of Sr are called the singular values. The singular values are uniquely determined by the matrix A (but not necessarily their order).

24 The singular value decomposition is related to the spectral decomposition. Given a singular value decomposition A = UΣVH we have:

• The columns of U are the eigenvectors of AAH

• The columns of V are the eigenvectors of AH A

• The singular values on the diagonal of Σ are the square root of the non-zero eigenvalues of both AAH and AH A.

Having the singular value decomposition it’s easy to solve a lot of problems found in many practical applications, for example if we have a singular values decomposition A = UΣVH we can:

• Calculate pseudoinverse A+ = VΣ+UH where Σ+ is easily found by taking the reciprocal (x−1) of the non-zero elements of the diagonal and transposing the resulting matrix.

• It is also used in Principal component analysis (PCA) as we will see later.

2.3 Applications 2.3.1 Portfolio evaluation using Monte Carlo simulation This example uses a large amount of staistics and terminology not otherwise described in this book. For the interested reader we refer to books or courses in Monte Carlo methods and stochastic processes for the details.

Suppose we have two stocks A, B with prices at time t as Sa(t),Sb(t). Initially we start with na stocks of A and nb stocks of B. This gives initial wealth:

a b W0 = naS (0) + nbS (0) We let the stocks be for a time T after which we are interested in the new wealth of the portfolio or more specifically how sure we can be that there have not occurred any great losses. We assume the stock pricing can be modelled as geometric brownian mo- a b tions (GBM) with parameters S GBM(µa, σa) and S GBM(µb, σb). We have estimated correlation ρ between the return on A and B.

To evaluate our portfolio we are interested in the probability that the total wealth of our portfolio WT at the end of the period T have dropped by more than 10%. We can write this as:

W  P T ≤ 0.9 W0

25 Finding this probability analytically is rather hard, especially if we work with not 2 stock but say 10 dependent stocks. Instead we use Monte Carlo simulation to simulate the process a large amount of times. Every time checking at the end if WT ≤ 0.9W0. The value of stock A at time T can be written: √ a a 2 ST = S0 exp((µa − σa/2)T + TVa)

Where Va,Vb are multivariate normal distributed variables Va,Vb N(0, Σ).

 2  σa σaσbρ Σ = 2 σaσbρ σb a a So in order to generate one sample of ST ,ST we need to simulate one sample from N(0, Σ). We can usually rather easily simulate independent normal distributed variables N(0,I). Our goal is then to use these independent samples to generate the dependent samples N(0, Σ). Normal distribution have the property that a linear combination of normal distributed variables is itself normal. In fact if we have normal distributed variables with mean zero and variance one ZN(0, 1) then C>Z is distributed as follows.

C>ZN(0, C>C) In other words, if we can find a C>C = Σ then we need only to calculate the product C>Z to get samples from the desired distribution. Σ is obviously real valued and symmetric, but it’s also at least positive semi definite. In fact the only time it isn’t strictly positive definite is when one one of the variables is set to be an exact linear combination of the others. Since we can be almost sure Σ is a real, symmetric, pos- itive definite matrix we can use Cholesky factorization to find a decomposition C>C = Σ.

> We thus generate a large amount of samples C C and check how often WT ≤ 0.9W0 to find the probability P (WT ≤ 0.9W0). Some things to think about:

• Can you think of another situation where you would like to take samples from another multivariate normal distribution?

• Can you think of why the covariance matrix Σ is always a real, symmetric, positive (semi)definite matrix?

26 3 Non-negative matrices and graphs

Non-negative matrices are common in many applications such as when working with stochastic processes and Markov chains as well as when working with things we can represent with a graph, such as electrical networks, water flows, random walks, and so on. Non-negative and positive matrices (not to be confused with positive (semi) defi- nite matrices) have many known properties, especially concerning their eigenvalues and eigenvectors which makes them easier to work with than general matrices.

This and the next to chapters are very much related in that we will start by looking at some , how to represent a graph with a matrix and some things which we can calculate here. After which we will take a look at the properties of non-negative matrices themselves in chapter 4 and then last look at a special type of non-negative matrix related to random walks on graphs in chapter 5.

3.1 Introduction to graphs A graph is a representation of a set of objects (vertices) where some pair of objects have links (edges) connecting them. The vertices can represent a state in a Markov chain, a point on a grid, a crossing in a road network among other things. Likewise the edges could in the same examples represent the possible transitions in the Markov chain, the lines on the grid or the actual roads in a road network. Sometimes in litterature vertices are called nodes and edges are called links. An example of two graphs can be seen in Figure 1.

B B

A C A C

D D

Figure 1: A (left) and a undirected graph (right)

We will concern ourselves with mainly two types of graphs, directed graphs as in the one to the left where every edge connecting two vertices have a direction. And undirected graphs as in the one to the right where the edges doesn’t have a direction but instead only denote that the two vertices are connected. Sometimes we will allow edges which goes from a vertice back to itself (so called loops) and sometimes we won’t, generally loops will be allows for directed graphs but often not when working with undirected

27 graphs. In both cases we can also assign weights to the edges as we will see later.

We can represent graphs such as the ones in Figure 1 with a matrix by defining the for a graph. Definition 3.1 The adjacency matrix A of a graph G with n vertices (|V | = n) is a square n × n matrix with elements ai,j such that:  1, if there is an edge from vertice i to vertice j a = i,j 0, otherwise If the graph is undirected we consider every edge between two vertices as linking in both directions. We note one important thing, we don’t define in which order we take the vertices. Sometimes it’s natural to order the vertices in a certain way, for example depending on vertice labels, however any order is valid as long as you know what order you choose.

B

0 1 0 1 0 0 1 0 1 0 A C   A = 0 1 0 0 1   1 0 0 0 0 0 0 0 0 0 D E

Figure 2: A directed graph and corresponding adjacency matrix with vertices ordered {A, B, C, D, E}

If the graph instead was undirected we would get the graph and adjacency matrix in ??. We note that we don’t need to order the vertices in that order, for example the matrix:

0 1 0 0 1 1 1 1 1 0   A = 0 1 0 1 0   0 1 1 0 0 1 0 0 0 0 Is also a adjacency matrix of the graph in Figure 3 where we instead have ordered the vertices as {C, B, A, D, E}.

28 B

0 1 0 1 0 1 1 1 1 0 A C   A = 0 1 0 0 1   1 1 0 0 0 0 0 1 0 0 D E

Figure 3: A undirected graph and corresponding adjacency matrix with vertices ordered {A, B, C, D, E}

One common addition to graphs is that we also assign weights to the edges, usually we only allow the weights to be positive real values. The weights can for example be used to represent transition probabilities in a Markov chain, distance of a part of road in a road network, time needed to travel between two vertices or the capacity in a flow network. If we have a graph with weighted edges we often talk about the rather than the adjacency matrix. Definition 3.2 The distance matrix D of a graph G with n vertices and weighted edges E is a square n × n matrix with elements di,j such that:

 E , if there is an edge from vertice i to vertice j d = i,j i,j 0, otherwise

Where Ei,j is the weight of the edge from vertice i to vertice j. As with the adjacency matrix, if the graph is undirected we consider every edge between two vertices as linking in both directions. An example of a weighted directed graph and corresponding distance matrix can be seen in 3.2. As in the case with the adjacency matrix we don’t have a defined order of the vertices, so we can use any as long as we know what order we use. While there would be no problem using negative or non real weights on the edges in the matrix representation of the graph, in practice we still usually consider only the cases with only positive weights. This is because we have a lot of usefull theory concerning specifically non- negative matrices (which we get if we allow only positive weights). In most real world examples where we are interested in representing the problem with a graph we also naturally end up with only positive weights, such as when the weights represent for example distance, time, probability, capacity or a flow.

29 2

B 1 0 1 0 1 0 1 0 2 0 2 0 A 2 C   D = 0 1 0 0 2 1   1 0 0 0 0 1 2 0 0 0 0 0 D E

Figure 4: A weighted directed graph and corresponding distance matrix with vertices ordered {A, B, C, D, E}

3.2 Connectivity and irreducible matrices It’s often interesting to see not only if there is an edge between two vertices but also if there is a connection between two vertices if we are allowed to traverse more than one edge. This is interesting both in the case of directed graphs as well as undirected ones: can we get from A to B by following the edges? And if so in how many steps? Or what is the shortest distance between A and B?

We take a look at so called paths in a graph.

Definition 3.3 Given a weighted or unweighted graph.

• A path in a graph is a sequence of vertices e1, e2, . . . , en such that for every vertice in the sequence there is an edge to the next vertice in the sequence.

• If the first vertice in a path is the same as the last vertice we call it a cycle.

• The length of a path is the number of edges in the path (counting multiple uses of the same edge).

We take a look at the adjacency matrix although the same is true for the distance matrix as long as we only allow positive weights.

(k) Theorem 3.1. If element ai,j of the adjacency matrix A to the power of k is non zero, there is a path from vertice i to vertice j of exactly length k. In other words:

(k) k ai,j ∈ A ⇒ There is a path of length k from vertice i to vertice j

30 The proof is left as an exercise.

The problem of finding a path or the existence of a path between two vertices is an important problem in many applications regarding graphs. For example in finding the shortest path in for example a road network as well as finding if and how likely it’s to get from one state to another in a Markov chain. We divide the graph’s in two different groups depending on if we can find a path from all vertices to all other vertices. Definition 3.4 Given a weighted or unweighted graph. • Two vertices is said to be connected if there exists a path from either vertice to the other in the undirected graph.

• A graph is said to be connected if it is connected for all pair of vertices in the undirected graph.

• If it is also connected for all pairs in the directed graph we say that it’s strongly connected.

We see that if a graph is connected or strongly connected in the case of directed graphs, we have a guarantee that we have a path from any vertex to another. Before continuing to we take a short look at connected parts of a graph. Definition 3.5 Given a (possible weighted) graph: • A connected component in a undirected graph is a maximal part of the graph where all vertices are connected with eachother.

• A strongly connected component is a part of the directed graph which is strongly connected. Example of connected components in an undirected graph can be seen in Figure 5 below. We see that the the connected components can themselves be considered connected graphs. A example of strongly connected components in a directed graph can be seen in Figure 6. How many edges would you need to make the graph strongly connected? We are now ready to move over to the matrix side of things. We look at general square non-negative matrices, for which both the adjacency matrix as well as the distance matrix belong if we only allow positive weights . Theorem 3.2. A non-negative square n × n matrix A is said to be Irreducible if any of the following equivalent properties is true: 1. The graph corresponding to A is strongly connected.

m 2. For every pair i, j there exists a natural number m such that ai,j > 0.

31 Figure 5: A undirected graph with 3 connected components

Figure 6: A directed graph with 3 strongly connected components

3. (I + A)n−1 is a positive matrix.

4. A cannot be conjugated into a block upper triangular matrix by a permutation matrix P.

We leave the last property for later and give a short motivation of why the first 3 are equivalent. It’s clear that since the graph is strongly connected (1) there exist a path from any vertice to any other vertice. However if the length of the path between vertice m i and vertice j is m we know that ai,j > 0. So if (1) is fulfilled so is (2) and if two is fulfilled we know there is a path between all vertices so the graph needs to be strongly connected as well (1). For the third we can take the binomial expansion and get:

n X k (I + A)n−1 = Ak n k=0 k Since is positive the matrix is positive if there is at least one 0 ≤ k ≤ n for every n

32 k pair i, j such that ai,j > 0. However since the longest possible path without repeating any vertices is n − 1 (go through all vertices in the graph) we see that this is true only if (2) is true since we need not check any m larger than n − 1.

The question regarding if a non-negative matrix is irreducible or not is a very impor- tant property which we will use repeatedly in the comming chapters both with regards to graphs as well as generall non-negative matrices. Having a clear understanding of what is and isn’t an irreducible matrix and how they relate to graphs and their dis- tance or adjacency matrix will give a much easier understanding when we talk about Perron-Frobenius theory and Markov chains later.

3.3 The Laplacian matrix The Laplacian matrix is another matrix representation of graphs with no self loops (so called simple graphs). It can be used to among others to find the resistance between nodes in an electrical network, estimate the robustness of or find if there are any ”bot- tlenecks” in a network. It’s also usefull in finding some other properties of the graph as well as it appearing naturally in for example the stiffness matrix of simple spring networks.

Before defining the Laplacian matrix we need to define another matrix, namely the of a graph.

Definition 3.6 The degree matrix of a undirected graph is a diagonal matrix where the elements on the diagonal is the number of edges connected with that vertex. In other words given a graph G with n vertices the degree matrix D is a n × n matrix with elements:

 deg(i), i = j d := i,j 0, i 6= j

Where deg(i) is the number of edges connected with vertice vi.

• If there are loops present (vertices linking to itself) we count those twice for the degree.

• If the graph is directed we use either the in-degree(number of edges pointing to- wards the vertice) or out-degree (number of edges pointing out from the vertice). Which one we use depend on application.

• With a directed graph we also only count loops once rather than twice.

You can see a undirected graph and corresponding degree matrix in Figure 7 and Figure 8 Using this and the adjacency matrix we can define the laplacian matrix:

33 2

3 0 0 0 1 4 0 4 0 0 D =   0 0 1 0 0 0 0 2 3

Figure 7: Undirected graph and corresponding degree matrix.

2

2 0 0 0 1 4 0 3 0 0 D =   0 0 1 0 0 0 0 0 3

Figure 8: Directed graph and corresponding degree matrix using in-degree.

Definition 3.7 The Laplacian matrix L of a undirected graph with no self loops is defined as:

L := D − A where D is the degree matrix and A is the adjacency matrix. This gives elements li,j ∈ L:   deg(i), i = j li,j := −1, i 6= j, vi links to vj  0, otherwise

Where deg(i) is the number of edges connected with vertice vi. An example of a graph and corresponding Laplacian matrix can be seen in Figure 9. Some properties of the Laplacian matrix L and it’s eigenvalues λ1 ≤ λ2 ≤ ... ≤ λn. • L is positive semidefinite (x>Lx ≥ 0).

• λ1 = 0 and the number of connected components is it’s algebraic multiplicity.

34 2

 3 −1 −1 −1 1 4 −1 2 0 −1 L =   −1 0 1 0  −1 −1 0 2

3

Figure 9: Simple undirected graph and corresponding Laplacian matrix

• The smallest non zero eigenvalue is called the spectral gap.

• The second smallest eigenvalue of L is the algebraic connectivity of the graph.

We immediately see that it seems like the eigenvalues of the Laplacian gives us a lot of information about the graph, this is especially important since it’s positive semidefinite making more effective calculations of the eigenvalues possible, which might be important if we need to work with very large graphs. We will look more into how we can use the Laplacian matrix in the application part.

3.4 Exercises 3.1.

3.5 Applications 3.5.1 Shortest path A common problem when working with networks such as a road network or in routing on the internet is the shortest path problem. You could for example want to find the best order in which to visit a number of places (the traveling salesman problem) or maybe you want to find the shortest path between two places such as in the case of GPS in cars. There are a number of related problems and solutions depending on if we need the shortest distance between all pair of vertices or if it’s enough with a subset such as a group of vertices or a single pair of vertices.

One common algorithm for finding the shortests path from one vertex to all others is Dijkstras algorithm which most other similar methods are based upon. The method works by assigning a distance from the initial vertex to every other vertex and then trav- eling through the graph always choosing the next vertex among the unvisited vertices that minimizes the distance. At every we we update the distance to every vertex linked

35 to by the current one.

More exact the method works through the following steps using the distance matrix of the resulting graph: 1. Assing distance zero to initial vertex and infinity to all other verticess.

2. Mark all verticess as unvisited and set the initial vertex as current vertex.

3. For the current vertex, calculate the distance to all it’s unvisited neighbors as the current distance plus the distance to the neighbors. Update these vertices distance if the newly found distance is lower.

4. Mark current vertex as visited.

5. If the smallest distance in among the unvisited vertices is infinity: stop! Otherwise set the unvisited vertex with the smallest distance as current vertex and go to step 3. If we are only interested in the shortest distance from the initial vertex to one specific vertex we can stop as soon as the distance to the target vertex is less than the distance to the next current vertex. However for the method to work we need to make a few assumptions, there are also a couple of properties of the graph itself that can say some- thing about the expected result of the algorithm. Take a few minutes to think through the following questions and how they affect the result or what assumptions we need.

• What could happen if there are negative weight’s in the graph?

• What if the graph is directed? Should a transportation network be considered directed or undirected?

• If we know the graph is or isn’t irreducible, can we say something about the result beforehand?

3.5.2 Resistance distance When planning a large electrical network its important to try and get as low resistance between vertices as possible. It is however not an easy task to use the common rules for combining resistors when the network gets large and we might want to know how to expand the network in order to lower the resistance between not just two vertices but lower it overall. The resistance distance between two vertices vi, vj in a graph G is defined as the effective resistance between them when each edge is replaced with a 1 Ω resistor. We define 1 Γ = L + n Where L is the laplacian matrix, n is the number of vertices in G and 1 is a n × n matrix with elements equal to one.

36 The elements of the resistance distance matrix (Ω) is then:

−1 −1 −1 (Ω)i,j = Γi,i + Γj,j − 2Γi,j . For example if we have the graph below and corresponding Laplacian matrix in Fig- ure 10 below:

B

 3 −1 0 −1 −1 −1 3 −1 0 −1 A E C   L =  0 −1 3 −1 −1   −1 0 −1 3 −1 −1 −1 −1 −1 4

D

Figure 10: Undirected graph and corresponding Laplacian matrix

We get Γ = L + 1/5 and it’s inverse:

0.4267 0.1600 0.0933 0.1600 0.1600 0.1600 0.4267 0.1600 0.0933 0.1600 −1   Γ = 0.0933 0.1600 0.4267 0.1600 0.1600   0.1600 0.0933 0.1600 0.4267 0.1600 0.1600 0.1600 0.1600 0.1600 0.3600

−1 −1 −1 We get the resistance distance matrix by computing (Ω)i,j = Γi,i + Γj,j − 2Γi,j . for every element:

 0 0.5333 0.6667 0.5333 0.4667 0.5333 0 0.5333 0.6667 0.4667   (Ω) = 0.6667 0.5333 0 0.5333 0.4667   0.5333 0.6667 0.5333 0 0.4667 0.4667 0.4667 0.4667 0.4667 0 And we have found the effective resistance between vertices as the value of the elements in the matrix. We can see that we have the largest resistance between the vertices A, C and B,D, which is quite natural since they are the ones furthest from eachother. Between opposite vertices we can also easily check the result since we can see it as 2 ressistors made up of 3 parallel connected resistors each in a series. This would give resistance:

R1R2R3 R4R5R6 1 1 2 RA,D = + = + = R1R2 + R1R3 + R2R3 R4R5 + R4R6 + R5R6 1 + 1 + 1 1 + 1 + 1 3

37 Since all the resistances is 1 from the definition of resistance distance. Checking the result for the resistance between for example A and E is harder using the usual computation rules though.

38 4 Perron-Frobenius theory

4.1 Perron-Frobenius for non negative matrices Perron-Frobenius theory gives information about many of the most essential properties of non-negative square matrices. We will start by formulating item 4.1 which is essential to this chapter. While it is not immediately easy to see why these properties are useful, we will see throughout the chapter why they are important and how and when we can use them. Be sure to review what we mean by irreducible matrices if your not sure since it’s one of the recurring themes of this chapter.

Theorem 4.1. Let A be a square non-negative irreducible matrix with spectral radius ρ(A) = r. Then we have:

1. r, called the Perron-Frobenius eigenvalue, is an eigenvalue of A. r is a positive .

2. The (right)eigenvector x belonging to r, Ax = rx, called the (right)Perron vector, is positive ( x > 0).

3. The left eigenvector y belonging to r called the left Perron vector, is positive (y>A = ry>, y > 0).

4. No other eigenvector except those belonging to r have only positive elements.

5. The Perron-Frobenius eigenvalue r has algebraic and geometric multiplicity 1. X X 6. The Perron-Frobenius eigenvalue r satisfies: min ai,j ≤ r ≤ max ai,j. i i j j

We remember the spectral radius being the largest (in absolute value) eigenvalue. The theorem says a lot about the ”largest” eigenvalue as well as the corresponding eigenvec- tor. This is also often called the dominant eigenvalue and its value and eigenvector are in many applications one of the most important things to know about the system as can be seen in the application part. We give a short motivation for the correctness of the theorem. Given (2) we can easily prove that the eigenvalue r is positive: If x is a positive vector and A is non-negative, then Ax > 0 since A is irreducible. Since Ax = rx if r is an eigenvalue belonging to the eigenvector x we have r > 0. If r was negative, zero or complex rx would be as well, but Ax is a positive vector.

We also prove (4) that the only positive eigenvectors are the ones belonging to r: let A be a non-negative irreducible square matrix, set ρ(A) = r and suppose y > 0 is the left Perron vector y>A = ry>. Assume we have an (right) eigenpair (λ, v), v ≥ 0. This gives: ry>v = y>Av = y>λv ⇒ r = λ

39 We give no further proof, but leave some of the other statements as exercises. Initially the theorem was stated for positive matrices only and then later generalised for irre- ducible non-negative matrices. It’s easy to see that if A is positive it’s also non-negative and irreducible.

Theorem 4.2. If A is positive, in addition to all properties for irreducible non-negative matrices, the following is true as well:

• The Perron-Frobenius eigenvalue r has the strictly largest absolute value of all eigenvalues. I.e., for any eigenvalue λ 6= r we have |λ| < r.

This last property ,that there is only one eigenvalue on the spectral radius, is a very useful property as well. In the section on Markov chains, we will see that it determines convergence properties. However this was for positive matrices and we see that it isn’t true for general non-negative irreducible matrices as in the example below.

We look at the permutation matrix:

0 1 P = 1 0

It’s obviously non-negative and irreducible. From Perron-Frobenius we get that the spectral radius ρ(A) = r = 1 since the sum of both rows are one. However the eigenval- ues can easily be found to be ±1 which are both on the unit circle.

While there are irreducible matrices for which there are more than one eigenvalue on the spectral radius, there are some irreducible matrices for which there is only one. We let this last property of having or not having only one eigenvalue on the spectral radius divide the non-negative irreducible matrices in two different classes. We say that a non-negative irreducible matrix is primitive if it has only one eigenvalue with absolute value equal to the spectral radius. If a non-negative irreducible matrix is not primitive, it is imprimitive.

Theorem 4.3. We let A be a square non-negative irreducible matrix, then:

• A is primitive if and only if there is only one eigenvalue r = ρ(A) on the spectral radius.

• A is primitive if and only if lim (A/r)k exists. k→∞ xy> For the limit we have: lim (A/r)k = > 0. k→∞ y>x where x, y are the right and left Perron vectors respectively.

• A is imprimitive if and only if there are h > 1 eigenvalues on the spectral radius. h is called it’s index of imprimitivity.

40 We note the easily missed part that for a matrix to be either primitive or imprimitive it needs to first be non-negative and irreducible. We also see that that only one eigenvalue on the spectral radius is equivalent to the limit lim (A/r)k existing. k→∞ We return to the permutation matrix:

0 1 P = 1 0

We consider the limit lim (A/r)k, given r = 1 which we got earlier. That means that k→∞ if lim (A)k exists then P is primitive. However P2 = I and the sequence A, A2, A3,... k→∞ alternate between P and I. Since it alternates between two matrices it obviously doesn’t converge and is therefore imprimitive. In fact all irreducible permutation matrices are imprimitive since for all of them there is a k such that Pk = I (why?) and r = 1 since every row sum to one.

There are two common ways to tell if a non-negative irreducible matrix is primitive:

Theorem 4.4. A non-negative irreducible matrix A is primitive if:

1. If any diagonal element ai,i > 0 A is primitive (but A could still be primitive even if there are all zeroes on the diagonal).

2. A is primitive if and only if Am > 0 for some m > 0.

3. A is primitive if and only if A is aperiodic.

Proof.

• If (1) is true it’s obvious that there is a m > 0 such that Am > 0. This since if we look at the graph: irreducibility guarantees there is a path from all vertices to all others and Am > 0 means there is a path of exactly length m from all vertices to all others. However since we have one diagonal element ai,i > 0 we can for all paths between two vertices going through ai,i increase the length of the path by one simply by looping within ai,i once. • If Am > 0 and A had a non positive eigenvalue λ = ρ(A). Then λm should be an eigenvalue on the spectral radius of Am. However since Am is positive the only eigenvalue on ρ(Am) is positive and simple, hence A is primitive. Conversely, if A is primitive then the limit } lim Am/rm exists and is positive. But for this to be m→∞ possible we must that Am is positive for large enough m.

Especially the first statement gives a way to often immediately conclude that a prim- itive matrix is in fact primitive. When the first can not be used you are left with trying the second, however you might often need to try quite large m before you can find if a

41 matrix is primitive or not. To show that a matrix is imprimitive it’s often easiest to show that it’s periodic as we did with the permutation matrix. The second statement can also be used however since it can be shown that if A is primitive, the smallest possible m which gives a positive matrix can be bounded by: m ≤ n2 − 2n + 2. So if you try a m at least as large as this and the matrix still isn’t positive, it’s imprimitive.

We give a short note on imprimitive matrices as well:

Theorem 4.5. If A is a imprimitive matrix with h eigenvalues λ1, λ2, . . . , λh on the spectral radius ρ(A). Then the following are true:

• alg mult(λn) = 1, n = 1, 2, . . . , h.

• λ1, λ2, . . . , λh are the h−th roots of ρ(A), such that:

2 h−1 i2π/h {λ1, λ2, . . . , λh} = {r, rω, rω , . . . , rω }, ω = e

We will not look at the imprimitive matrices more than this, except noting that the eigenvalues on the spectral radius for imprimitive matrices are actually very special. Not only do they all have algebraic multiplicity 1, they are evenly spaced around the spectral radius.

4.2 Exercises 4.1. Consider the matrices:

0 0 1 0 1/2 0 1/2 0  1 1 2 3 0 0 0 1  0 1/2 1/2 0  1 3 1 2 A =   B =   C =   0 1 0 0 1/3 1/3 0 1/3 2 1 1 1 1 0 0 0 0 0 0 1 3 2 3 1

For each matrix consider:

• Is it non-negative?

• Is it irreducible?

• Is it primitive?

• What is the maximum and minimum possible value of the Perron-Frobenius eigen- value?

• What can you say about the other eigenvalues?

4.2. Show that for all irreducible permutation matrices there is a k such that Pk = I. Prove that all irreducible permutation matrices are imprimitive.

42 4.3. Consider the matrix: 1 1 2 A = 1 1 1 2 2 1 • Calculate the eigenvalues of this matrix. • Calculate the (right) eigenvector to the Perron-Frobenius eigenvalue.

4.3 Applications 4.3.1 Leontief input-output model Suppose we have a closed economic system made up of n large industries, each producing one commodity. Let a J − unit be what industry j produces that sells for one unit of money. We note that this could be a part of one commodity, for example a car would typically sell for many K − units. We create a vector S with elements sj containing the produced commodities and a matrix A with consumption of other commodities needed to create the commodities. We have:

• Elements sj ∈ S, 0 ≤ sj = number of J − units produced by industry j in one year.

• Elements ai,j ∈ A,0 ≤ ai,j = number of I − units needed to produce one J − unit.

• ai,jsj = number of I − units consumed by industry J in one year. • If we sum over all j we can then find the total I − units that are availible to to public (not consumed by the industries themselves), we call this di. n X • di = si − ai,jsj. j=1 • S − AS is then a vector with the total availible to the public of all commodities (elements di). Suppose the public demands a certain amount of every commodity, we create a de- mand vector d = (d1, d2, . . . , dn) ≥ 0. We want to find the minimal supply vector > S = [s1, s2, . . . , sn] ≥ 0 that satisfy the demand. We are also interested to know if we can guarantee that there actually is a solution to the system or do we need to make further assumptions?

We get the linear system: S − AS = d ⇒ S = (I − A)−1d If (I−A) is invertible we can solve and find a solution relatively easily. However we don’t know if there is one or if the solution could have negative values in S (which would be equivalent to producing a negative amount of a commodity).

43 ∞ X • We use the fact that if the Neumann series An converges, then (I − A)−1 = n=0 ∞ X An. n=0 • If we assume ρ(A) < 1 we get lim Ak = 0, Which is the same as saying the k→∞ Neumann series converge. ∞ X • Since A is non-negative, An is as well and we have guaranteed not only a n=0 solution, but also that S ≥ 0. • The question is what assumptions we need to make in order for ρ(A) < 1? X We look at the column sums ai,j, this can be seen as the total number of units i X required to create one I − unit. It seems reasonable to assume ai,j ≤ 1 with strict i inequality for at least one industry. Since otherwise the industries with go with a loss, and we can assume at least one goes with a profit. X Then there is a matrix E ≥ 0, E 6= 0 with elements ei,j such that ai,j + ei,j = 1, i (every column sum of A + E equal to one).If u is the unit vector with all elements equal to 1, we get: u> = u>(A + E) We assume ρ(A) ≥ 1 and use the Perron vector x of A to get:

u>x = u>(A + E)x ⇔ x = Ax + Ex This can’t be true since Ex > 0 (E non-negative, non-zero) and Ax ≥ x (since ρ(A)x = Ax, ρ(A) ≥ 1). So if we assume the column sums of A to be less than or equal to one with at least one less than one, we get that ρ(A) < 1 and we are finished. ∞ X We also note that if A is irreducible, then An is even positive and not only non- n=0 negative. In other words if we increase the demand of any commodity, we will need to increase the supply of all commodities and not only the one we increased the demand for. Take a few minutes to think through the following questions regarding the new assump- tions of the system. Apart from assuming the industries aren’t making a loss, and at least one makes a profit, we also assumed A to be irreducible. This corresponds to ev- ery commodity being directly or indirectly needed to produce every commodity. This doesn’t seem unreasonable considering we work with large industries.

44 5 Stochastic matrices and Perron-Frobenius

5.1 Stochastic matrices and Markov chains A , sometimes also called transition matrix or probability matrix, is used to describe the transitions between states in a Markov chain. Markov chains are used in applications in a large number of fields, some example are:

• Markov chain Monte Carlo for generating numbers from very complex probability distributions in statistics.

• Modeling of market behaviour, such as market crashes or periods of high or low volatility in economics.

• Population processes in for example biology, queuing theory and finance.

• Other less obvious places like in speech recognition, error correction, thermody- namics and PageRank from the previous chapter.

Our aim here is to get an understanding of how and why Markov chains behave as they do using what we know of Perron-Frobenius theory. We start by defining stochastic matrices and Markov chains. Definition 5.1 A square non-negative matrix is • Row stochastic if every row sums to one.

• Column stochastic if every column sums to one.

• If it is both row and column stochastic we say that it is doubly stochastic. We will from now only say stochastic matrix when we in fact mean row stochastic matrix. We also notice that stochastic matrices need not be irreducible, however when they are we can immediately use Perron-Frobenius to find a lot of information regarding the matrix. A common example of doubly stochastic matrices are the permutation matrices since only one element in every row and column is non-zero and that element is one. Every stochastic matrix corresponds to one Markov chain: Definition 5.2 A Markov chain is a random process, with usually a finite number of states, where the probability of transitions between states only depend on the current state and not any previous state. A Markov chain is often said to be memoryless.

• Formal expression of the fact that a discrete time Markov chain is memoryless:

P (Xk+1 = x|X1 = x1, X2 = x2,..., Xk = xk)

= P (Xk+1 = x|Xk = xk)

45 • Every stochastic matrix defines a Markov chain and every Markov chain defines a stochastic matrix.

You can see a discrete time Markov chain as a sequence of states X0, X1, X2,... happening one after another in discrete time and the state in the next step depends only on the current state. So if our Markov chain has 3 different states, then this corresponds to a 3 × 3 stochastic matrix with every element ai,j corresponding to the transition probability from state xi to state xj. I.e., ai,j = P (Xn+1 = j|Xn = i)). Another way to represent a Markov chain is with a directed graph by considering the stochastic matrix as the graph’s distance matrix, where the distance is now a probability rather than a true distance. In this case, with the edges representing transition probabilities, it’s natural to only consider graphs with positive weights. As an example of a Markov chain, let us consider this simple weather model: • We know the weather on day zero to be sunny. • If it’s sunny one day it will be sunny the next with probability 0.7, cloudy with probability 0.2 and rainy with probability 0.1. • If it’s cloudy it will be sunny with probability 0.3, cloudy with probability 0.4 and rainy with probability 0.3. • If it’s rainy it will be sunny with probability 0.2, cloudy with probability 0.3 and rainy with probability 0.5. Seeing this in text makes it quite hard to follow, especially if we would like to add more states. However since the weather tomorrow only depend on the weather today, we can describe the model with a Markov chain and corresponding stochastic matrix. We order the states in the stochastic matrix from sunny to rainy and put in the transition probabilities to get the stochastic matrix P and it’s corresponding graph below. The vector X0 represents the probability distribution of the weather on day zero.

 sunny  1 0.7 0.2 0.1  cloudy  X0 = 0 P = 0.3 0.4 0.3 rainy 0 0.2 0.3 0.5 0.2 0.7 sunny cloudy 0.4 0.3 0.1 0.3

0.2 0.3

rainy

0.5

46 When we worked with the distance matrix and adjacency matrix of graphs we saw that we could take the powers of the matrix to find out if there is a path between vertices. In the same way we can take the powers of the transition matrix P and if a element k pi,j ∈ P 6= 0 then there it’s possible to get from state i to state j in k steps in the corresponding Markov chain. However we can say even more in the case of stochastic matrices, namely we can tell not only if it’s possible to be in a state after k steps, we can tell how likely it is as well. We can easily see that if we multiply the initial state > (or distribution) X0 with P from the right, we get the probability to be in all the states > in the next step X1 . Generally since we work with probabilities the initial distribution should be non-negative as well as sum to one.

0.7 0.2 0.1 > >     X1 = X0 P = 1 0 0 0.3 0.4 0.3 = 0.7 0.2 0.1 0.2 0.3 0.5 And we see there is only a probability 0.1 that it will rain tomorrow provided it is sunny today. Now we can of course multiply this new vector X1 in the same way to get the probabilities for the weather on the day after tomorrow.

0.7 0.2 0.1 > > 2 >     X2 = X0 P = X1 P = 0.7 0.2 0.1 0.3 0.4 0.3 = 0.57 0.25 0.18 0.2 0.3 0.5

And we get the probability for rain two days from now as 0.18 if it’s sunny today. So to get the probability distribution πk of the weather on day k we can take the transition > matrix to the power of k and multiply with the initial distribution X0 to the left.

> > k Xk = X0 P

> k Sometimes the transposed transition matrix is used instead Xk = (P ) X0, both obvi- ously work equally well.

• One important question in the theory of Markov chains is whether a stationary distribution π> = π>P exist. Or, in other words, if there is a distribution for which the distribution in the next step is the same as the distribution in the current step.

> k > • We are also interested to know if the limit π = ( lim X0 P ) exists. If it exists, k→∞ it means that as long as we iterate the Markov chain for long enough, we will eventually arrive at a stationary distribution π> = π>P

• The stationary distribution can also be seen as the distribution of the time the Markov chain spend in every state if it is iterated a large number of times.

• Last, when the limit exists, we are also interested to know wheather it converges towards the same stationary distribution π regardless of initial state X0.

47 In order to answer these questions we divide the stochastic matrices in 4 different classes:

1. Irreducible with lim (Pk) existing. (P is primitive) k→∞

2. Irreducible with lim (Pk) not existing. (P is imprimitive) k→∞

3. Reducible and lim (Pk) existing. k→∞

4. Reducible and lim (Pk) not existing. k→∞

5.2 Irreducible, primitive stochastic matrices If P is irreducible we can use Perron-Frobenius for irreducible non-negative matrices. From this we get that there is one eigenvector with only positive values, namely the Perron- Frobenius eigenvector with eigenvalue r. From the same theorem we also have that r = 1 since r is bounded by the minimum and maximum row sum, both of which are equal to one since P is a stochastic matrix. The stationary distribution is then the same as the left Perron vector since we have: y> = y>P, if r = 1. So if P is irreducible, a stationary distribution exist, and it is unique.

We remember the requirement for an irreducible matrix to be primitive is for the limit lim (P/r)k exist. Since r = ρ(P) = 1 the limit lim (Pk) exist if P is primitive, and thus k→∞ k→∞ > k > π = ( lim X0 P ) must exist as well. k→∞ Using Perron-Frobenius for primitive matrices, for P we get the (right) Perron vector (x = Px) as x = e/n, where e is the vector with every element equal to one and n is the number of states . You can easily verify this since r = 1 and every row sums to one. If > > we let π = (π1, π2, . . . , πn) be the left Perron vector (π = π P) we get the limit:

(e/n)π> lim Pk = = eπ> > 0 k→∞ π>(e/n)

Calculating the probability distribution of the limit gives:

> k > > > lim X0 (P ) = X0 eπ = π k→∞

And we see that if P is irreducible and primitive, not only does the stationary distri- bution π exist and is unique, we also get that the Markov chain converge towards this distribution regardless of initial distribution X0. Additionally Perron-Frobenius gives us an easy to calculate way to find the stationary distribution since it’s equal to the left Perron vector of P. We give a short example using the earlier weather model.

48 We had the transition matrix: 0.7 0.2 0.1 P = 0.3 0.4 0.3 0.2 0.3 0.5 Since P is positive, it’s obviously irreducible and primitive as well and as such we know that there is a stationary distribution and the distribution of the Markov chain will converge towards it. For the limit we get: > k > > > lim X0 P = X0 eπ = π k→∞ To find the stationary distribution we calculate the left Perron vector π which yields π = [0.46 0.28 0.26]> which is our stationary distribution. From this we can say that on average it will be sunny 46% of the time, cloudy 28% of the time and rainy 26% of the time.

5.3 Irreducible, imprimitive stochastic matrices Since P is irreducible we know that there at least exists a stationary distribution and that it’s unique. However since it’s imprimitive the limit lim Pk doesn’t exist and the k→∞ Markov chain therefore doesn’t necessarily converge towards the stationary distribution. However we can easily see that the (right) Perron vector is the same as in the primitive case (since every row sums to one). Similarly finding the stationary distribution can also be done in the same way for imprimitive matrices as with primitive ones as the left Perron vector.

While we in the primitive case would also have been able to find the stationary distri- bution simply by iterating the chain a large number of times, this is not the case when working with imprimitive matrices.

Even though the Markov chain doesn’t converge, we are still interested in seeing how the Markov chain behaves if we let it iterate for a long time.This is useful for examle when we want to sample from a distribution using an imprimitive Markov chain in Markov chain Monte Carlo methods. While the limit lim Pk doesn’t exist, what we are k→∞ interested isn’t necessarily the limit itself but the distribution of time the Markov chain spends in the different states if we let the Markov chain iterate a large number of times. For primitive matrices this is the same as the limiting distribution. In the imprimitive case we can calculate the Ces´arosum instead: k 1 X lim Pj k→∞ k j=0 This converges and we get: k 1 X lim Pj = eπ> > 0 k→∞ k j=0

49 We multiply the limit with the initial vector X0 and get just like in the primitive case:

k  1 X j > > > X0 lim P = X0 eπ = π k→∞ k j=0

In other words the stationary distribution π exist and is the same regardless of the choice of initial state X0. The Markov chain doesn’t converge to it but rather ’oscillates’ around it. More on why the Ces´arosum always exist for these matrices can be read about in [?]

We give a short example using our previous small permutation matrix:

0 1 P = 1 0

We already saw that it’s imprimitive and therefore the limit doesn’t exist. However limit of the Ces´arosum does exist and we still get the stationary distribution π> as the Perron vector. This gives the stationary distribution π> = [0.5 0.5]>. So while the Markov chain won’t converge it will on average spend half the time in state 1 and half the time in state 2. In fact we can easily see that it spends exactly every second step in state 1 and the other in state 2.

5.4 Reducible, stochastic matrices Reducible matrices are not as common to work with in practice as with irreducible matri- ces, often because the model itself is constructed in such a way as to ensure it produces an irreducible matrix. However sometimes it’s desirable to use a reducible chain, for example using an absorbing Markov chain (one or more states which we can’t leave after the Markov chain ends up in it) in risk analysis.

Since we can’t immedietly apply Perron Frobenius to reducible Markov chains, we for example don’t know if there exists any stationary distribution or maybe there even exist many. What we can do is try to get as close as possible to the irreducible case. We recall that a reducible matrix A could be permuted into upper triangular form by a permutation matrix P: XY P>AP = 0 Z If X or Z is reducible as well we can permutate the original matrix further untill we end up with:   X1,1 X1,2 ··· X1,k  0 X2,2 ··· X2k  P>AP =    ......   . . . .  0 0 0 Xk,k

50 Where Xi,i is either irreducible or a single zero element. One thing we can immediately see already is that if we leave for example block X2,2 we can never come back since none of the states in any block Xi,i have any connection to any of the previous blocks Xj,j, j < i.

Last we look for blocks where only the diagonal block Xi,i contains non-zero elements and permutate those to the bottom so we end up with:

  X1,1 X1,2 ··· X1,j X1,j+1 X1,j+2 ··· X1,k  0 X2,2 ··· X2j X2,j+1 X2,j+2 ··· X2,k     ......   ...... ··· .     0 0 0 Xj,j Xj,j+1 Xj,j+2 ··· Xj,k  >   P AP =  0 0 ··· 0 X 0 ··· 0   j+1,j+1   ..   0 0 ··· 0 0 Xj+2,j+2 . 0     ......   . . ··· . . . . .  0 0 ··· 0 0 0 ··· Xk,k This is called the canonical form of the stochastic matrix and we notice a couple of things. First we remind ourselves that this is still the same Markov chain, we have just relabeled the vertices by permutating the matrix using a permutation matrix and we remind ourselves that the blocks Xi,j are matrices and not necessarily single elements. We also note that this form is generally not unique since we could swap the position of two different blocks in the bottom right part or sometimes in the top left as well. Using this form we can say some things about the behaviour of the Markov chain which might not be as apparent otherwise. We call the states in the upper left the transient states. Whenever we leave one of the diagonal blocks Xi,i, i ≤ j we can never get back to the same or any previous block since all elements below the diagonal blocks are zeroes. Also all the blocks in the top left Xi,i at least on of the blocks Xi,i+1,..., Xi,k are non zero (or they would belong to the bottom right part). This means that if we start in any state in the top left part, we will eventually end up in one of the blocks in the bottom right.

We call the states in the bottom right the ergodic states or the absorbing states, since once you enter any of the blocks in the bottom right part you can never leave it. Either because it’s a single vertex only linking back to itself, or because it’s in itself an irre- ducible stochastic matrix.

The first thing to note is that the initial state obviously influences the behaviour of reducible Markov chains, for example if we start in one of the ergodic states we obviously get a different sequence than if we start in another ergodic state. So even if there would exist a stationary distribution we can expect it to depend on the initial state. Often with reducible Markov chains we aren’t as interested in finding where the chain is after a long time (especially if there is only one absorbing state), but we are often more

51 interested in seeing how long time we expect it to take in order to reach a absorbing state or the probability to end up in a certain absorbind state when there are more than one. One example would be when modelling the failure rates of a machine, we know it will fail eventually, but for how long can we expect it to run?

Assuming a reducible stochastic matrix A is written in canonical form.

XY A = 0 Z

We can then find the limit lim Ak if all blocks in Z are primitive matrices or the k→∞ corresponding limit of the Ces´arosum if at least one of them are imprimitive. We look at the case where all blocks in Z is primitive: The limit can then be written as:

0 (I − X)−1YE lim Ak = k→∞ 0 E

 >  eπj+1 0 ··· 0  > ..   0 eπ . 0  E =  j+2   . .. .   . ··· . .  > 0 ··· 0 eπk Obviously we don’t find a unique stationary distribution (since every block in Z is absorbing), however the limit is still very useful. First of all we see that E contains the individual Perron vectors for the blocks in Z. So if we start in one of the absorbing blocks we find a stationary distribution unique to that block. In fact in that case we can obviously disregard the rest of the Markov chain and only concern ourselves with that one block if we have the transition matrix in canonical form.

−1 Secondly the elements ap,q of (I − X) Y holds the probability that starting in p we leave the transient states and eventually hit state q in one of the absorbing blocks. If q is a single absorbing state ap,q is thus the probability to end up in q when starting in p. This is commonly called the hitting probability to hit q starting in p.

The vector (I − X)−1e is of particular interest as well. It holds the average number of steps needed to leave the transient state, we call this the hitting times for reaching the absorbing states. We take a short look at hitting times and hitting probabilities in the next section from a slightly different perspective to understand why this is correct.

5.5 Hitting times and hitting probabilities When we have a reducible Markov chain (with a finite number of states) we know that we will eventually reach one absorbing state. We consider a stochastic matrix P with

52 transient states D and two absorbing states A, B. We want to find the hitting probabil- ity hi that we end up in A starting in d ∈ D.

We illustrate how to find such a hitting probability with a short example where we consider the Markov chain describes by the graph below.

2/3 1/3 1 2 3 4 1/3 2/3

We ask ourselves: what is the probability to end up in 1 starting in 2? We write this probability as hi = Pi(hit 1). After one step we are either in state 1 with probability 1/3 or we are in 3 with probability 2/3. In other words we could write h2 as: 1 2 h = P h + P h = h + h 2 2,1 1 2,3 3 3 1 3 3 But h1 is the probability to hit 1 starting in 1, since we are already in 1 this probability is obviously 1. We then continue and find the expression for h3 which can be written: 1 2 h = P h + P h = h + h 3 3,4 4 3,2 2 3 4 3 2 Since state 4 is absorbing so the probability h4 to hit 1 from 4 is zero and we get the equation system:  1 2  h2 = + h3  3 3 2  h = h 3 3 2 Solving this gives the hitting probability h2 = 3/5. While this method can be done by hand with a low number of states, it is mostly used to find a recurrence relation when we have an infinite number of states such as a on the integers. If we have a finite number of states and thus can write out our transition matrix P we can find the hitting probability using the following: Theorem 5.1. Consider the Markov chain with transition matrix P and n states. Let A A be a subset of states. The vector of hitting probabilities hi = Pi(hit A) is the minimal non-negative solution to the linear system:  A hi = 1, i ∈ A  n X hA = p hA i∈ / A  i i,j j  j=1 With minimal solution we mean that for any other solution x to the system we have xi ≥ hi.

53 By writing P in canonical form we can verify that this corresponds to the result in the A previous section. The minimality is needed in order to get hi = 0 in other absorbing states than those in the absorbing block we seek. When we solve the system we can set A hi = 0, if i is any absorbing state not in A. We note that it does not matter whether A consists of absorbing states, since we are not interested in what happens after we have entered A. In terms of the previous example we could use the theorem to find the probability that we reach state 3 at some point before reaching an absorbing state.

Next we continue with our example but instead ask ourselves: What is the expected number of steps starting in 2 until we hit either of the absorbing states? We define the hitting time:

ki = Ei(time to hit {1,4} ) As with the hitting probability we look at the next step but now also add a 1 for the time it takes to reach the next step:

1 2 2 k = P k + P k + 1 = k + k + 1 = k + 1 2 2,1 1 2,3 3 3 1 3 3 3 3

Where we noted that k1 = 0 since we then have already reach one of the absorbing states. Next we continue with k3 and get: 1 2 2 k = P k + P k + 1 = k + k + 1 = k + 1 3 3,4 4 3,2 2 3 4 3 2 3 2

Solving this system gives the expected number of steps until absorption as k2 = 3. In the same way as for hitting time we get in general:

Theorem 5.2. Consider the Markov chain with transition matrix P, n states and ab- A sorbing states A. The vector of hitting times ki = Pi(hit A) is the minimal non-negative solution to the linear system:

 A ki = 0, i ∈ A  n X kA = 1 + p kA i∈ / A  i i,j j  j=1

Once again we can write the system in canonical form to verify that this corresponds to the result in the previous section. We do note however that care needs to be taken so that we don’t have other absorbing states outside of A. This since we would then get the relation for a single absorbing state outside of A as kj = 1 + kj.

5.6 A short look at continuous time Markov chains Here we will seek to give a short introduction of continuous time Markov chains and their relation to what we already know about discrete time chains without going to much into probability theory. This is not meant to give a full understanding of continuous time

54 chains, but rather highlight similarities and relations with discrete time chains and pre- vious chapters.

A continuous time Markov chain can be seen as a discrete time chain, but instead of jumping to a new state at fixed intervals 1, 2, 3,... we instead jump to a new state ran- domly in time. With randomly in time we mean that the jump from a state is exponential distributed with intensity λ (which might be different between vertices). Exponential distributed jumps mean two important things, one if we assume there is only one jump T in the interval T ∈ (0, t) then the jump is uniformly distributed on this interval P (T = x|x ∈ (0, t)). Second the jumps are memoryless which means that if there was no jump in (0, s) then the probability of a jump at a point in time after s is independent of s: P (T ∈ (0, t)) = P (T ∈ (s, s + t)|T ≥ s). We will only concern ourselves with chains with a finite state space. With infinite state space there are more things that needs to be taken into consideration such as chances for the chain to ”explode” or when the chance to eventually get back to where we started is less then one (even if it’s irreducible).

For a continuous time chain with a finite state space we can make a graph of the chain. Now however we no longer have the probabilities to jump to another state as weights on the edges, but rather the intensity λ for the jump. We also naturally have no self-loops since this is handled by the intensity of the jumps to other states. An example of what it could look like can be seen in subsection 5.6 below.

1 1 2

2 1 1 1 3 4 2

In continuous time we talk about two different matrices. First we have the Q matrix which is the negative Laplacian matrix where we take into consideration the weight’s of the edges. This means that we get elements qi,j = λi,j, i 6= j where λi,j is the intensity of jumps from state i to state j. The diagonal elements we choose such that the sum of every row is zero. For the Markov chain described by the graph above we get Q matrix:

−3 1 2 0   0 −1 0 1  Q =    1 0 −2 1  0 0 2 −2 If Q is irreducible and finite then there exist a stationary distribution π which we can find as π>Q = 0. In continuous time we can’t have any periodicities and irreducibility

55 is actually enough for a finite chain to guarantee convergence towards this distribution. While we can solve this system to find the stationary distribution there is also another more familiar matrix we can work with. Namely the transition matrix P(t) = etQ. The element pi,j(t) in P (t) describes the probability of being in state j at time t if the chain started in state i at time 0.

We will look more at the exponential of a matrix and how to calculate it later in ??. For the moment it is enough to know that we get :

∞ X (tQ)k P(t) = etQ = k! n=0

P(t) is then a stochastic matrix for all t ≥ 0 and we can find the stationary distribution (using any t > 0) in the same way as we did for a discrete time Markov chain. For Q above we get using t = 1

0.1654 0.2075 0.3895 0.2376 0.0569 0.3841 0.2533 0.3057 P(1) =   0.1663 0.0982 0.4711 0.2645 0.1394 0.0569 0.4720 0.3317

And we can also find the stationary distribution by solving the linear system π>P(1) = π> by for example finding the Perron vector π. It’s easy to see that π = [0.1429 0.1429 0.4286 0.2857]> is a stationary distribution and that it satisfies both π>P(t) = π and π>Q = 0.

5.7 Exercises 5.1. Consider the stochastic matrix  0 0.5 0.5 P =  0 0 1  0.5 0 0.5

• Show that P is irreducible.

• If we start in state 1, what is the probability to be in state 3 after 3 steps (P (X3 = s3|X0 = s1)). • What is the stationary distribution of this Markov chain?

5.2. Consider the stochastic matrix  0 0.5 0 0.5   1 0 0 0  T =    0 0 1 0  0.25 0.25 0.25 0.25

56 • Write T in canonical form.

• Calculate the hitting times of this Markov chain. What is the expected number of steps needed to reach the absorbing state if we start in the first state?

5.8 Applications 5.8.1 Return to the Leontief model We return to the Leontief Input-Output model in subsubsection 4.3.1 for modeling a economic system composed of n industries.. We remember we got a linear system S − AS = d. We found that if A was irreducible (every industry directly or indirectly depend on all others) and every column sum to one or less, with at least one column with sum less than one (at least one goes with a profit and none goes with a loss), we could guarantee that we could satisfy any given demand d.

We are now interested to answer some more questions about the system, such as what is the gross national product (GNP) of the system? To answer this we will look at the system as a Markov chain with transition matrix Q = A>. We note that we take the transpose of the previous system matrix such that every row now sums to one or less rather than column and elements pi,j now corresponds to the amount of J−units required by industry I to create one I−unit. To account for the demand vector, which essentially takes units of produced commodities out of the system we introduce a new state d in the transition matrix. Since any produced commodity that enters the ”public” can’t X come back into the system we let d be an absorbing state such that Qi,d = 1 − qi,j. j In other words we modify the transition matrix by adding a elements to the column corresponding to the demand d such that every row of A> sum to one.

We get a transition matrix of the form:   q1,1 . . . q1,n q1,d  . .. . .   . . . .  Q =   qn,1 . . . qn,n qn,d 0 ... 0 1

Since the top left n × n part is irreducible, and at least one of qi,d, . . . qn,d > 0 since at least one goes with a profit we see that we can reach the absorbing state d from any other state. Our transition matrix therefore describes an absorbing Markov chain it’s also easy to see that any initial supply S leads to supplying some demand d. GNP is the total supply S produced to both satisfy the demand as well as the internal demand of the system. If we look at the mean number of steps required to reach the absorbing state from the transient states: T = (I − A>)−1u.

• We can see the elements ti as the expected production in units to create one product unit of industry i.

57 • Multiplying the demand vector d with T:(T>d) gives the expected total produced units. i.e. the GNP.

58 6 Linear spaces and projections

In this chapter, which will be a bit more theoretical, we will take a look at linear spaces and their relation to matrices. This in order to get a better understanding of some of the underlying theory as well as some commonly used methods or concepts used later. We will start by defining linear spaces and inner product spaces and later in the last part take a closer look at projections and how they can be used to for example find a basis of a matrix.

6.1 Linear spaces We start by defining a linear space:

Definition 6.1 A linear space or vector space over a field F is a set V together with two binary operations (+, ∗) defined such that:

u, v ∈ V ⇒ u + v ∈ V, and u ∈ V, a ∈ F ⇒ a ∗ u ∈ V

With the following being true, where we let u, v, w be vectors in V and a, b be scalars in F .

For vector addition (+) we have:

• Associativity u + (v + w) = (u + v) + w.

• Commutativity: u + v = v + u.

• Existence of identity element 0 ∈ V such that v + 0 = v for all v ∈ V .

• Existence of a inverse element −v ∈ V such that v + (−v) = 0.

For scalar multiplication (∗) we have:

• Distributivity with respect to vector addition: a ∗ (u + v) = a ∗ u + a ∗ v.

• Distributivity with respect to field addition: (a + b) ∗ v = a ∗ v + b ∗ v

• Compatibility with field multiplication: a ∗ (b ∗ v) = (ab) ∗ v.

• Existence of identity element 1 ∈ F such that 1 ∗ v = v for all v ∈ V .

Usually we omit the sign (∗) for multiplication as we are used to. In order to not go to much into algebraic structures we consider a field a set with two operations (+, ·) follow- ing the ”usual” rules such as associativity and commutativity. To read more about fields and other algebraic structures we refer to a book on algebra such as [?]. Two examples of fields are the real and complex numbers R, C.

59 We take a closer look at the definition of a linear space and the requirements on the two operations (+, ∗) for V to be a linear space. First of all we see that + is a operation between two vectors in V such that the result is a new vector also in V . We also need it to be associative and commutative such that we can add in whatever order we want and/or swap place of the two vectors. There also exist a zero(identity) and inverse ele- ment. These are all things we are used to from the real numbers.

Second we have another operation ∗ between a scalar in F and a vector in V such that the result is a vector in V . Similarly to + we have a few requirements on the operation such as distributivity with respect to both vector and field addition. This means means for example we can either add two vectors and then multiply or multiply with both vectors and then add the result. We also have Compatibility with field multiplication as well as the existence of a multiplicative identity element.

Some examples of linear spaces using the common addition and multiplication opera- tions are: n • R . n • C .

• Matrices of size n × n with elements in F : Mn×n(F ). •{ polynomials} In the future, unless otherwise noted we will assume the common addition and mul- 2 tiplication operation is used. We take a look at R with vectors u = (u1, u2), v = (v1, v2), w = (w1, w2) ∈ V and scalars a, b ∈ F . If we add two vectors we get: (u1, u2) + (v1, v2) = (u1 + v1, u2 + v2), since the addition of two real number (for example u1 + v1) is a new real number this obviously belong 2 to R as well. Both assiciativity and commutativity immediately follows from addition in R as well. The identity element with respect to addition is obviously the zero vector 0 = (0, 0) since (u1, u2) + (0, 0) = (u1 + 0, u2 + 0) = (u1, u2). Similarly we find inverse elements simply by multiplying every element of the vector with −1.

For scalar multiplication a ∗ (v1, v2) = (av1, av2) we can show that it fullfills the requirements in a similar way and as such is left to the reader. We now introduce something called subspaces of linear spaces, these can be seen as another linear space contained within a linear space. Definition 6.2 If V is a vector space, a subset U ⊆ V is called a subspace of V if U is itself a vector space with the same operations (+, ∗) as V . We also get a way to test if a subset U is a subspace of a linear space V by checking a just a couple of different things rather than all the requirements in the original definition for a linear space.

60 Theorem 6.1. The subset U ⊆ V is a subspace of V if and only if it satisfies the following 3 conditions.

• 0 lies in U, where 0 is the zero vector (additive identity) of V .

• If u, v ∈ U then u + v ∈ U.

• If u ∈ U and a ∈ F then au ∈ U.

2 For example we look at the linear space R and want to see if the points on the line 2 y = 2x, x, y ∈ R is a subspace of R . We can write the points on the line as:

U = {a(1, 2)|a ∈ R} 2 U is obviously a set in R so we only need to show the three requirements above. If we choose a = 0 it’s clear that 0 ∈ V also lies in U. Adding two vectors in U gives a(1, 2) + b(1, 2) = (a + b)(1, 2) which is obviously in U since a + b ∈ R if a, b ∈ R. Last we multiply with a scalar b ∗ a(1, 2) = (ba)(1, 2) which also obviously belong to U since ba ∈ R. Definition 6.3 A vector v such that: m X v = c1v1 + c2v2 + ... + cmvm = cjvj j=1 is called a linear combination of vectors {v1, v2,..., vn} ∈ V with coefficients in F if {c1, c2, . . . , cm} ∈ F . We illustrate it with an example: Considering the two vectors u = (1, −2, 1) and v = (1, 1, 2). Then the vector (3, −3, 4) is a linear combination of u, v since 2u + v = 2(1, −2, 1)+(1, 1, 2) = (3, −3, 4). However the vector (1, 0, 1) is not a linear combination of these vectors since if we write it as an equation system su + tv = (1, 0, 1) we get:

1 = s + t 0 = −2s + t 1 = s + 2t The only way to satisfy the first and third equation is if s = 1, t = 0 however then the middle equation is not fulfilled. We often talk about not one linear combination of vectors but all linear combinations of a set of vectors:

Definition 6.4 If {v1, v2,..., vm} is a set of vectors in V . Then the set of all linear combinations of these vectors is called their span denoted:

span(v1, v2,..., vm)

61 • If V = span(v1, v2,..., vm) then we call these vectors a spanning set for V .

For example given the two vectors u = (1, −2, 1) and v = (1, 1, 2) above we get their spanning set:

span(u, v) = {su + tv|s, t ∈ R} = (s + t, −2s + t, s + 2t) So the spanning set is all vectors which we can write in this way. We recognize that 3 the spanning set above actually describes a plane in R . In fact all spanning sets in a vector space V are subspaces of V .

Theorem 6.2. For a span U = span(v1, v2,..., vm) in a vector space V we have:

• U is a subspace of V containing each of v1, v2,..., vm.

• U is the smallest subspace containing v1, v2,..., vm, in that all subspaces contain- ing v1, v2,..., vm must also contain U.

We prove the first statement:

Proof. We use the test described earlier to see if this subset is also a subspace.

• U obviously contains 0 = 0v1 + 0v2 + ... + 0vm.

• Given u = a1v1 + a2v2 + ... + amvm and v = b1v1 + b2v2 + ... + bmvm and c ∈ F .

• u + v = (a1 + b1)v1 + (a2 + b2)v2 + ... + (am + bm)vm ∈ U

• cu = (ca1)v1 + (ca2)v2 + ... + (cam)vm ∈ U.

• And we have proven that U is a subspace of V .

We take a look at the spanning set span{(1, 1, 1), (1, 0, 1), (0, 1, 1)} and want to find 3 if this spans the whole of R ? span{(1, 0, 1), (1, 1, 0), (0, 1, 1)} is obviously contained 3 3 in R and it’s clear that R = span{(1, 0, 0), (0, 1, 0), (0, 0, 1)}. We can then prove that (1, 0, 0), (0, 1, 0), (0, 0, 1) is contained in our spanning set and use Theorem 6.2 to 3 prove that it contains R . We do this by writing (1, 0, 0), (0, 1, 0), (0, 0, 1) as a linear combination of our vectors:

(1, 0, 0) = (1, 1, 1) − (0, 1, 1)

(0, 1, 0) = (1, 1, 1) − (1, 0, 1) Using the two first found vectors we find the last.

(0, 0, 1) = (1, 1, 1) − (1, 0, 0) − (0, 1, 0)

62 It’s clear that there are generally more than one way to write a linear space as a spanning set. We will now move on to look at two different types of spanning sets. We 2 compare two spanning sets in R :

2 span((1, −1), (1, 1), (1, 2)) = R 2 span((1, 2), (2, 3)) = R In the first spanning set we can represent the vector (3, 2) as a linear combination of the spanning vectors in two ways:

(3, 2) = 1(1, −1) + 1(1, 1) + 1(1, 2)

(3, 2) = 0(1, −1) + 4(1, 1) − 1(1, 2) However in the second spanning set we get:

 s = −5 (3, 2) = s(1, 2) + t(2, 3) = (s + 2t, 2s + 3t) ⇒ t = 4

In fact, any vector (a, b) have only one unique representation (a, b) = s(1, 2) + t(2, 3). We introduce the notion of linearly dependent or independent sets of vectors. Definition 6.5 A set of vectors {v1, v2,..., vm} is called linearly indepependent if the following is true:

a1v1 + a2v2 + ... + amvm = 0 ⇒ a1 = a2 = ... = am = 0

In other words, a set of vectors {v1, v2,..., vm} are linearly independent if the only way to represent 0 in these vectors is with all coefficient zero:

0 = 0v1 + 0v2 + ... + 0vm One important consequence of this is that it also means that any other vector v can only be representes as a linear combination of {v1, v2,..., vm} in one way as well. Oth- erwise we could find two representations of 0 by subtracting v − v using two different representations of v.

If a set of vectors isn’t linearly independent they are linearly dependent. Definition 6.6 A set of vectors {v1, v2,..., vm} is called linearly depependent iff some vj is a linear combination of the others.

2 2 We consider three vectors {(1 + x), (2x + x ), (1 + x + x )} in (P2) and want to find if they are linearly independent or not by finding a linear combination of them that result in the zero vector.

0 = s(1 + x) + t(2x + x2) + u(1 + x + x2)

63 We get a set of linear equations:   s + u = 0 s + 2t + u = 0  t + u = 0 It’s clear that the only solution to the system is s = t = u = 0 and we see that the vectors are linearly independent. Considering this relation between solving an equation system and we can easily proove the following usefull theorem.

Theorem 6.3. If a vector space V can be spanned by n vectors. Then for any set {v1, v2,..., vm} of linear independent vectors in V , m ≤ n.

In the proof which we leave to the reader we show that the m vectors are linear depen- m X dent if m > n by setting up the equation system xjvj = 0 We represent the m linear j=1 independent vectors as a linear combination of a span of n vectors. This will result in a system of n equations with m unknown variables, which has a nontrivial solution.

In practise the theorem means that every set of linear dependent set of vectors that span a vector space V have the same number of vectors, we call this a basis.

Definition 6.7 A set of vectors {v1, v2,..., vm} is called a basis in V if the following is true:

•{ v1, v2,..., vm} is linear independent.

• V = span{v1, v2,..., vm}.

We summarize our findings of bases in a linear space.

• If {v1, v2,..., vm} and {w1, w2,..., wn} are two bases in V , then m = n.

• In other words, all bases in a vector space contain the same number of vectors.

• If V have a basis of n vectors. We say that V have the dimension n:

dim V = n

• We say that a vector space is finite dimensional if either V = 0 or dim V < ∞.

In fact not only for all bases in a linear space V is it true that they all have the the same amount of vectors, any set of that same amount of linear independent vectors in V is a basis in V .

Theorem 6.4. If dim V = n < ∞ and {v1, v2,..., vn} is a set of linear independent vectors, then {v1, v2,..., vn} is a basis in V .

64 Proof. It’s already linear independent so we then need to prove that {v1, v2,..., vn} spans V . If we assume {v1, v2,..., vn} does not span V , we could then choose a new vector vn+1 outside span{v1, v2,..., vn}. Then {v1, v2,..., vn, vn+1} is linear inde- pendent. But we already know there is a basis {w1, w2,..., wn} which spans V since dim V = n, and for any set of linear independent vectors {v1, v2,..., vm}, we have m ≤ n. So we get a contradiction and {v1, v2,..., vn} must span V .

One common solution to problems is to fit a new more appropriate basis to the problem and use that instead, this guarantee that any set of same amount of linear independent vectors is also a basis making it unnecessary to check if our new proposed basis is actually a basis or not. In essence it makes it easy to find a new basis if we already have one basis to start with. For example changeing to spherical coordinates, finding eigenvectors of a matrix, when finding many matrix decompositions, or making a fourier approximation of a function can all be seen as more or less as finding a new basis to the problem. Before continuing we give a short relation to matrices and how they can and are often used to represent a basis.

Theorem 6.5. For a (real) n × n matrix M the following properties are equivalent:

n • The rows of M are linearly independent in R n • The columns of M are linearly independent in R n • The rows of M span R . n • The columns of M span R . • M is invertible.

6.2 Inner product spaces We note that this far we have only seen vectors as a set of elements in F . However when we talk about vectors we are often talking about length of or angle between vectors. To be able to do this we need to add something to our linear space, namely we add another operation called a inner product and call it a inner product space.

Definition 6.8 An inner product on a vector space V over F is a map h·, ·i that takes two vectors u, v ∈ V and maps to a scalar s ∈ F . We denote the inner product of two vectors u, v: hu, vi for which the following state- ments must be true:

n •h u, vi = hv, ui, (symmetric in R ). •h au, vi = ahu, vi.

65 •h u + v, wi = hu, wi + hv, wi.

•h u, ui > 0 for all u 6= 0 in V .

• A vector space V with an inner product h·, ·i we call a inner product space.

The most commonly used inner product is the dot product sometimes called scalar product:

n u · v = u1v1 + u2v2 + . . . unvn ∈ R Using the definition above we can show that the dot product is in fact a inner product.

• For the first property we get u·v = u1v1 +u2v2 +. . . unvn = v1u1 +v2u2 +. . . vnun. Which is obviously true since xy = yx in R.

• For the second we get: au·v = au1v1+au2v2+. . . aunvn = a(v1u1+v2u2+. . . vnun). Which also immediately follows from multiplication in R.

• Next we get (u + v) · w = (u1 + v1)w1 + (u2 + v2)w2 + ... (un + vn)wn = (v1w1 + v2w2 + . . . vnwn) + (u1w1 + u2w2 + . . . unwn) = (v · w) + (u · w) and the third statement is fulfilled as well.

• Last we note that hu, ui is the sum of the square of all elements. Since the square of a real number r is positive if r 6= 0 the only way for the dot product to equal zero is if u = 0.

But although the dot product is the most commonly used inner product, there are also other inner products and such as for example the following on the polynomials of order n Pn:

Z 1 hp, qi = p(x)q(x) dx 0

Where p, q are polynomials in Pn. It’s easy to show that this is a inner product from the definition as well for the interested reader. √ You should be familiar with the Euclidean norm |v| = v>v of a vector, we define the norm of a inner product space in a similar way.

Definition 6.9 If h·, ·i is an inner product in the vector space V , the norm (sometimes called length) kvk of a vector v in V is defined as:

kvk = phv, vi

We immediately see that when using the dot product we get the Euclidian norm. The norm in a inner product space share some properties with those of a vector norm:

66 •k vk ≥ 0, kvk = 0 if and only if v = 0.

•k cvk = |c| kvk.

•k v + wk ≤ kvk + kwk.

But we also have one additional important property of norms in a inner product space called the Schwarz inequality: • If u, v are vectors in V , then:

hu, vi2 ≤ kuk2kvk2

• With equality if and only if one of u, v is a scalar multiple of the other. Now that we have a notion of length (norm) in our inner product space we might also want the distance between vectors. Once again we start by looking at how we calculate 3 distance using the dot product and the common geometric distance (in R ). Using the dot product we get the distance between two vectors d(u, v) as:

p 2 2 2 d(u, v) = (u1 − v1) + (u2 − v2) + (u3 − v3) = p(u − v) · (u − v) Which we recognize as the Euclidean norm |u − v|. We define the distance between vectors in a general inner product space in the same way: Definition 6.10 The distance between two vectors u, v in an inner product space V is defined as:

d(u, v) = ku − vk

For the distance d(u, v) we have the following properties for vectors u, v, w in the inner product space V . • d(u, v) ≥ 0 with equality iff u = v.

• d(u, v) = d(v, u).

• d(u, w) ≤ d(u, v) + d(v, w). While we can now calculate the length of a vector or the distance between two vectors we still have no idea how to find out if for example two vectors are parallel or orthogonal. To do this we need to define the angle between vectors. Using Schwarz inequality we define the angle between two vectors u, v. Definition 6.11 If 0 ≤ θ ≤ π is the angle between two vectors we have: hu, vi cos θ = kuk kvk

67 We remember Schwarz inequality saying:

hu, vi2 ≤ kuk2kvk2

This gives:  hu, vi 2 hu, vi ⇔ ≤ 1 ⇔ −1 ≤ ≤ 1 kuk kvk kuk kvk Which we then use to define the angle between two vectors. One very important case is when the inner product is zero (hu, vi = 0) and therefor the angle π/2 radians between them. This is not only interesting geometrically but for general inner product spaces as well.

Definition 6.12 Two vectors u, v in a inner product space V are said to be orthogonal if:

hu, vi = 0

More important is if we have a whole set of vectors, all which are orthogonal to eachother:

Definition 6.13 A set of vectors {v1, v2,..., vm} is called an orthogonal set of vectors if:

• vi 6= 0 for all vi.

•h vi, vji = 0 for all i 6= j.

• If kvik = 1 for all i as well, we call it an orthonormal set.

If we have a orthogonal set {v1, v2,..., vm} and want a orthonormal set we can easily find a orthonormal set simply by normalizing every vector. v v v 1 , 2 ,..., m kv1k kv2k kvmk Given a orthogonal set we have a couple of important properties, first of all it’s also linear independent.

Theorem 6.6. If {v1, v2,..., vm} is an orthogonal set of vectors then it is linear inde- pendent.

We now have another way to find a basis in a inner product space V , rather than trying to find a set of linear independent vectors, we could instead look for a orthogonal set of vectors. The found orthogonal set will then automatically be linearly independent as well. We will look more into how to do this in later when we take a closer look at projections. Additionally we have one additional property concerning the norm and orthogonal vectors.

68 Theorem 6.7. If {v1, v2,..., vm} is an orthogonal set of vectors then: 2 2 2 2 kv1 + v2 + ... + vmk = kv1k + kv2k + ... + kvmk We give a short motivation showing it for two vectors, the proof can then easily be extended for n vectors as well. Given two orthogonal vectors v, u we get:

kv + uk2 = hv + u, v + ui = hv, vi + hu, ui + 2hv, ui Since v, u are orthogonal hv, ui = 0 and we get: kv + uk2 = hv, vi + hu, ui = kvk2 + kuk2 A matrix whose rows or columns form a basis of orthonormal vectors is called a unitary matrix. We remember that for a unitary matrix we have: Theorem 6.8. For a n × n matrix U the following conditions are equivalent. • U is invertible and U−1 = UH . • The rows of U are orthonormal with respect to the inner product. • The columns of U are orthonormal with respect to the inner product. • U is a unitary matrix. If U is a real matrix using the ordinary dot product we say that U i an orthogonal matrix if it satisfies the conditions above. Worth noting is that if U is real the hermitian of a matrix is the same as the transpose and we get U−1 = UT . One example of unitary matrices are the permutation matrices, in fact we can see the unitary matrices as a sort of generalization. In the same way as multiplying with a permutation matrix PAPT can be seen as changing from one basis to another by reordering the basis vectors (relabel a graph). Multiplication with a unitary matrix UAUH can also be seen as a change of basis. This is often used in practice by given M find a suitable matrix decomposition M = UAUH and then continue working with the presumably easier A instead.

6.3 Projections We will now look closer at why we usually want to work with a orthogonal basis and how to construct such a basis. We still haven’t given a reason why we would want to find a new orthogonal basis given that we usually start with the standard orthonormal basis represented by the identity matrix. This however wil mainly have to wait until the examples and applications.

One of the large advantages of orthogonal bases is in the ease of finding a representation of any vector in this basis. We have already seen that the representation is unique since it’s basis vectors are linearly independent and we therefor find the new representation by solving a linear system. However if we have an orthogonal basis {e1, e2,..., em} of a inner product space V . Then for any vector v ∈ V :

69 hv, e1i hv, e2i hv, emi • v = 2 e1 + 2 e2 + ... + 2 em ke1k ke2k kemk • We call this the expansion of v as a linear combination of the basis vectors {e1, e2,..., em}. This gives us an easy way to express v given an orthogonal basis without the need to solve a linear equation system. Even better, if the basis is orthonormal it’s even easier 2 since then all keik = 1 and we simply need to take the inner product of the vector v and all the basis vectors.

The next step is in constructing an orthogonal basis, we will take a look at projections and how to use these to create a basis. While we have already constructed bases many times before, for example every time you use Gauss-elimination, we are now interested in how we can ensure that we create a orthogonal basis and rather than only a linear independent one. One method to do this is the Gram-Schmidt process, however before looking at the method we must first show that such a basis actually exist. Theorem 6.9. In every inner product space V with dim V < ∞ there exist an orthogonal basis. Proof. We let V be an inner product space with dimension dim V = n and show it using induction. • If n = 1 then any basis is orthogonal. • We then assume every inner product space of dimension n have an orthogonal basis.

Then if dim V = n+1 and {v1, v2,..., vn+1} be any basis of V . If U = span(v1, v2,..., vn) and {e1, e2,..., en} is an orthogonal basis of U.

It’s then enough to find any vector v such that hv, eii = 0 for all i. since then {e1, e2,..., en, v} will be an orthogonal basis in V . We consider:

v = vn+1 − t1e1 − t2e2 − ... − tnen

Where we need to find all ti. Since vn+1 is not in U, v 6= 0 for all choices of ti. So if we can find ti for all i such that hv, eii = 0 we are finished.

For the orthogonality with ei we get hv, eii

= hvn+1, eii − t1he1, eii − ... − tihei, eii − ... − tnhen, eii 2 = hvn+1, eii − 0 − ... − tikeik − ... − 0 2 = hvn+1, eii − tikeik

hvn+1, eii And we get hv, eii = 0 if ti = 2 . keik

70 We noice that the proof not only shows that there is an orthogonal basis, but also gives us a way to find one given any other basis. It’s is recommended that you take some time to really understand the proof and why it works. We use the results here in order to create the Gram-Schmidt process used in order to find an orthogonal basis. We summarize the previous finding in

Theorem 6.10. Let {e1, e2,..., en} be an orthogonal set of vectors in a inner product space V and v be a vector not in span(e1, e2,..., en). • Then if: hv, e1i hv, e2i hv, eni en+1 = v − 2 e1 − 2 e2 − ... − 2 en ke1k ke2k kenk

•{ e1, e2,..., en, en+1} is an orthogonal set of vectors. Using this we construct the Gram-Schmidt process:

• Let V be an inner product space, and {v1, v2,... vn} be a basis of V .

• We then construct the vectors e1, e2,... en of an orthogonal basis in V succesively using:

• e1 = v1.

hv2, e1i • e2 = v2 − 2 e1. ke1k

hv3, e1i hv3, e2i • e3 = v3 − 2 e1 − 2 e2. ke1k ke2k • ...

hvn, e1i hvn, en−1i • en = vn − 2 e1 − ... − 2 en−1. ke1k ken−1k We note that to get a orthonormal basis (and orthogonal matrix) we need to also normalize the resulting vectors. We can do that either at the end when we have found an orthogonal basis or during the algorithm itself. While this works in theory, when implemented on a computer there can be some stability problems. When computing the vector ek, there might be some small error in resulting in vk being not completely orthogonal. This is then compounded when we further orthogonalize against this vector and resulting vectors repeatedly. Luckily we can make a small adjustment of the method in order to handle these problems.

Modified Gram-Schmidt works like Gram-Schmidt with a small modification.

• u1 = v1.

(1) hvk, u1i • uk = vk − 2 u1. ku1k

71 (1) (2) (1) huk , u2i • uk = uk − 2 u2. ku2k • ... (n−2) (n−1) (n−2) huk , un−1i • uk = uk − 2 un−1. kun−1k

(j) • In every step j we orthogonalize all vectors uk , k > j with uj.

We note that while we assume that {v1, v2,... vn} it doesn’t actually need to be lin- early independent. Since if we apply it to a set of linear dependent vectors then for some i, ui = 0 and we can discard them and not orthogonalize further vectors with them. In this case the number of resulting vectors will be the same as the dimension of the space spanned by the original vectors. One example of where we can use Gram-Schmidt is in finding a QR-decomposition of a matrix, we look more at that in the applications part here and the QR-method later.

While we have already used projections, we haven’t actually defined what a projection is in respect to a inner product space. Definition 6.14 ⊥ ⊥ If v = v1 + v2 with v1 ∈ U and v2 ∈ U , where U is orthogonal to U.

• The vector v1 is called the projection of v on U, denoted: projU (v).

• The vector v2 is called the component of v orthogonal to U, given by: v2 = v − projU (v). Theorem 6.11. If V is a inner product space and U is a subspace of V . Then every vector v ∈ V can be written uniquely as:

v = projU (v) + (v − projU (v))

• If e1, e2,..., em is any orthogonal basis of U, then:

hv, e1i hv, e2i hv, emi projU (v) = 2 e1 + 2 e2 + ... + 2 em ke1k ke2k kemk

• Additionally, for the subspaces U, U ⊥ we have:

dim V = dim U + dimU ⊥

The projection of a vector v in V on a subspace U can be seen as the vector in U most similar to v in U. Looking at the Gram-Schmidt process earlier we also see that another way to see the algorithm is by adding new vectors ei by taking the reminder after subtracting the projection of ei on the subspace spanned by the previous vectors.

ei = vi − projU (v),U = span(e1,..., ei−1)

72 Theorem 6.12. If V is a inner product space and U is a subspace of V . Then projU (v) is the vector in U closest to v ∈ V .

kv − projU (v)k ≤ kv − uk for all u ∈ U. Proof. For v − u we get:

v − u = (v − projU (v)) + (projU (v) − u) Since the first lies in U ⊥ and the second lies in U they are orthogonal and we can use pythagoras:

2 2 2 2 kv − uk = (v − projU (v)) + (projU (v) − u) ≥ (v − projU (v))

3 We look at an example in R where we want to find the point on a plane closest to a point in space. We consider the plane given by the equation x − y − 2z = 0 and find the point on the plane closest to the point v = (3, 1, −2).

• For the plane we get x = y + 2z and the subspace U = {s + 2t, s, t|s, t ∈ R} = span((1, 1, 0), (2, 0, 1)) h(2, 0, 1), (1, 1, 0)i • Using Gram-Schmidt we get: {e = (1, 1, 0), e = (2, 0, 1)− (1, 1, 0). 1 2 k(1, 1, 0)k2

• e2 = (2, 0, 1) − (1, 1, 0) = (1, −1, 1).

• Then we find the closest point by finding the projection of v on U: projU (v) = hv, e1i hv, e2i 2 e1 + 2 e2 ke1k ke2k

• = 2e1 + 0e2 = (2, 2, 0) Since we have found an orthogonal base for the plane we can now easily also find the point on the plane closest to any general point v = (x, y, z)in space as:

hv, e1i hv, e2i x + y x − y + z projU (v) = 2 e1 + 2 e2 = (1, 1, 0) + (1, −1, 1) ke1k ke2k 2 3

6.4 Applications 6.4.1 Fourier approximation

We inspect the space C[−π, π] of continuous functions on the interval [−π, π] using the inner product: Z π hf, gi = f(x)g(x) dx π .

73 • Then {1, sin x, cos x, sin 2x, cos 2x, . . .} is an orthogonal set.

• We can then approximate a function f by projecting it on a finite subspace of this set: Tn = span({1, sin x, cos x, sin 2x, cos 2x, . . . , sin mx, cos mx})

This gives f(x) ≈ tnf (x) = a0+a1 cos x+b1 sin x+a2 cos 2xb2 sin 2x+...+am cos mx+ bm sin mx.

• The coefficients ak, bk we get as: hf(x), 1i 1 Z π • a0 = 2 = f(x) dx. k1k 2π π hf(x), cos(kx)i 1 Z π • ak = 2 = f(x) cos(kx) dx. k cos(kx)k π π hf(x), sin(kx)i 1 Z π • ak = 2 = f(x) sin(kx) dx. k sin(kx)k π π • It’s obvious that as we take more coefficients we get a better approximation of f since T1 ⊆ T2 ⊆ T3 ⊆ .... • If we take an infinite number of coefficients we call this the Fourier series of f.

6.4.2 Finding a QR decomposition using the Gram-Schmidt process We remember that given a matrix M we can decompose it in two matrices M = QR such that Q is a unitary matrix and R is a upper triangular matrix. The QR-decomposition can then in turn be used to solve or find a least square solution to the linear system defined by M. It can also be used to more effectively find the eigenvalues and eigenvec- tors of M. Both of these uses of the QR decomposition is handled elsewhere in the book however one problem first need to be solved.

If we want to use QR decomposition to speed up calculations, we need to be able to find the QR decomposition itself as fast (and stable) as possible. If we look at the columns of A as a (possibly linearly dependent) set of basis vectors. Then QR can be seen as a base change Q with coefficients in R. If we look at the Gram-Schmidt process used on the column vectors v of A we get (before normalization):

hvi, e1i hvi, ei−1i ei = vi − 2 e1 − ... − 2 ei−1 ke1k kei−1k

hvi, e1i hvi, ei−1i ⇔ vi = ei + 2 e1 + ... + 2 ei−1 ke1k kei−1k So we write column i using i basis vectors. If we collect the coefficient in R and the basis vectors in Q we get a QR decomposition. You can easily check this by considering

74 one column in A as we did above. In practice we usually first calculate Q and then find R using multiplication R = QH A since Q is unitary.

We give an example considering the matrix A below using the modified Gram-Schmidt process:

1 −1 2 3 A = 2 0 2 1 2 2 3 2 This gives :

1 −4 A 1 1 Q = .1 = 2 ,A0 = A − QH A Q = −2 .1 kA k 3   .2 .2 .1 .2 .1 3   .1 2 4

 2   6  1 1 A0 = −2 ,A0 = −3 .3 3   .4 3   1 0 Continuing with the next column we get: :

−2  2  A0 1 1 Q = .2 = −1 ,A00 = A0 − QH A0 Q = −2 .2 kA0 k 3   .3 .3 .2 .3 .2 3   .2 2 1

 4  1 A00 = −4 .4 3   2  2  A00 1 Q = .3 = −2 .3 kA00 k 3   .3 1

We now have a full basis and can stop since Q.4 must be zero. For the QR-decomposition A = QR we then get the matrices:

1 −2 2  1 Q = (Q ,Q ,Q ) = 2 −1 −2 .1 .2 .3 3   2 2 1

And 9 3 12 9  1 R = QH A = 0 6 0 −3 3   0 0 3 6 Here are a couple of things to consider and think about: • Can you recognize the Gram-Schmidt process when written in this form? Where are we doing the normalization?

75 • Can you find the QR-decomposition using the modified algorithm?

• Using that Q is hermitian we can write a linear system Ax = b as QRx = b ⇒ Rx = QH b which is very easy to solve since R is triangular. How does this compare to other standard methods such as Gauss elimination?

76 7 Principal component analysis (PCA) and dimension reduction

Let us consider m sets of measurements at time t = 1, 2, 3, . . . , m. For example the value of n stocks, consumption of n commodities, rainfall at n locations, etc. While making our analysis of the data might be easy if m is small, as m grows the time to make the analysis typically grows faster (for example in m2). As m grows large enough we reach a point where we cannot do our analysis in a reasonable time anymore. However let us assume the measurements are dependent, this seems like a rather reasonable assumption. Our goal is to find a new basis in which we can represent the m sets using fewer basis vectors.

7.1 Mean, variance and covariance

Suppose that we have measured the x-coordinate of some object over m different obser- vations. Let the vector x contain these values. We might for example take

−1 x =  2  . 4

Now we can compute the mean value over all observations:

m 1 X Mean(x) = x , m i i=1 which in our example means

3 1 X 1 1 5 Mean(x) = x = (x + x + x ) = (−1 + 2 + 4) = . 3 i 3 1 2 3 3 3 i=1

We might also be interested in how far the elements in x are from the meanx ¯, to get a measure of this spread of the values we can compute the variance

m 1 X Var(x) = (x − x¯)2, m − 1 i i=1 that is, we want values that are more distant from the mean to count more (square of the distance). We may think of the reasoning for the division by m − 1 instead of simply division by m as a result from loosing one degree of freedom by comparing each element tox ¯.

77 In our example we get

3 1 X 1 Var(x) = (x − x¯)2 = (x − x¯)2 + (x − x¯)2 + (x − x¯)2 2 i 2 1 2 3 i=1 ! 1  52  52  52 = −1 − + 2 − + 4 − 2 3 3 3 1   = (−3 − 5)2 + (6 − 5)2 + (12 − 5)2 2 · 9 1 64 + 1 + 49 114 57 19 = 82 + 12 + 72 = = = = . 2 · 9 2 · 9 2 · 9 9 3

We can see that the observation x2 = 2 contributes very little to the variance since it is close to the mean 5/3. Note that if x¯ is the vector we get from subtracting the meanx ¯ from each element in x, that is x¯ = (¯x1, ··· , x¯m) = (x1 − x,¯ ··· , xm − x¯) = x − x¯1, (3) then we can write the variance as a dot product:

m m 1 X 1 X 1 Var(x) = (x − x¯)2, = x¯2, = x¯T x¯. (4) m − 1 i m − 1 i m − 1 i=1 i=1 From the variance we may also define the standard deviation (of the sample) v u m p u 1 X 2 s(x) = Var(x) = t (xi − x¯) . m − 1 i=1

The standard deviation is of the same unit as the observations. If x is a vector of observations in meters (m) then the standard deviation is also in meters, while the variance is in square meters. By the mean and the variance of a observation vector x we can understand important things about the reality that is observed. On the other hand, if we at each observa- tion observe two variables, giving the observation vectors x1 = (x11, x21, ··· , xm1), x2 = (x12, x22, ··· , xm2), then for each of them we can find the meanx ¯1, x¯2 and variances Var(x1), Var(x2), but we may also want to find if there is some relation between obser- vations on the two variables. A tool for this is the covariance. Let x¯1 = (¯x11, x¯21, ··· , x¯m1), x¯2 = (¯x12, x¯22, ··· , x¯m2) be the vectors we get from subtracting the meanx ¯i from each element in xi, i=1,2, then the covariance is m m 1 X 1 X 1 Cov(x , x ) = (x − x¯ )(x − x¯ ) = x¯ x¯ = x¯T x¯ . 1 2 m − 1 i1 1 i2 2 m − 1 i1 i2 m − 1 1 2 i=1 i=1 Comparing with the expression for the variance we have

Var(x) = Cov(x, x).

78 Consider as an example the column vectors of the the following matrix as 10 centered (the column sums are 0) observations of two variables.

 −2.8 −1.4  1.3 1.9     0.070 −1.0    2.7 2.4    ¯    1.3 0.30  X = x¯1 x¯2 =   . (5)  1.4 2.6     −4.6 −3.9    1.9 2.0    −0.43 −1.2 −0.84 −1.7 This data gives the following variances.

Var(x¯1) ≈ 5.1262

Var(x¯2) ≈ 4.7467

Cov(x¯1, x¯2) ≈ 4.5016 That the covariance between the two variables is about the same size as the variances for the variables can then be interpreted, in loose terms, thatx ¯i1 andx ¯i2 goes “hand in hand”, meaning that the assumption thatx ¯i1 ≈ x¯i2 is somewhat correct. From Figure 11 we can see that this makes sense, all points seem to lie close to the line with direction (1, 1) through the origin. From this point on we will assume that all variable vectors are centered by Equation (3) so that the sum of the elements in xi is zero for all i. To describe the variances and covariances between n variables, each observed at m occasions we can form the n × n covariance matrix. Let X be the m × n matrix with columns x1, ··· , xn, then:

COV(X) = COV(x1, ··· , xn)   Cov(x1, x1) Cov(x1, x2) ··· Cov(x1, xn) Cov(x2, x1) Cov(x2, x2) ··· Cov(x2, xn) =    . . .. .   . . . .  Cov(xn, x1) Cov(xn, x2) ··· Cov(xn, xn)  T T T  x1 x1 x1 x2 ··· x1 xn T T T 1 x2 x1 x2 x2 ··· x2 xn =    . . .. .  m − 1  . . . .  T T T xn x1 xn x2 ··· xn xn 1 = XT X. m − 1

79 Figure 11: The points (¯x11, x¯12), (¯x21, x¯22), ··· , (¯xm1, x¯m2) plotted.

The covariance matrix is thus symmetric since Cov(xi, xj) = Cov(xj, xi), or indeed by (XT X)T = XT (XT )T = XT X. The diagonal elements of COV(X) are the variances for x1, ··· , xn. When computing the covariance matrix for n variables with different units there arises some problems. If we multiply the vector xi with some constant c > 0 (representing a change of unit for the i-th variable), the i-th row and the i-th column of COV(X) will be multiplied by c, and indeed the element at (i, i) will be multiplied by c2. This is somewhat bad, since we in general want our analysis of data to be independent on the actual units used. Now, the standard deviation of cxi is r 1 r 1 s(cx ) = (cx )T (cx ) = c xT x = c · s(x ), i m − 1 i i m − 1 i i i and so the vectors xi/s(xi) are independent on the unit used since

(cxi)/(s(cxi)) = cxi/(c · s(xi)) = xi/s(xi).

Motivated by this we define the correlation matrix. Let S be the diagonal matrix with

80 diagonal elements 1/s(xi), then:

 x x  CORR(X) = COV 1 , ··· , n s(x1) s(xn)  xT x xT x xT x  1 1 1 2 ··· 1 n  s(x1)s(x1) s(x1)s(x2) s(x1)s(xn)   T T T   x x1 x x2 x xn  1  2 2 ··· 2   s(x )s(x ) s(x )s(x ) s(x )s(x )  =  2 1 2 2 2 n  m − 1  . . .. .   . . . .   T T T   x x1 x x2 x xn  n n ··· n s(xn)s(x1) s(xn)s(x2) s(xn)s(xn) 1  1  = SXT XS = S XT X S = S COV(X)S, m − 1 m − 1 where the multiplication to the left and to the right by S simply scales the rows and columns by the elements on the diagonal of S (1/s(xi)). The entries of the correlation matrix is thus unit independent.

7.2 Principal component analysis

Suppose that we have a dataset in the m × n matrix

 T  ˆx1    .  X = x1 ··· xn =  .  , T ˆxm

m with m measurements of n variables, each column vector (in R ) belonging to one variable and the sum of the elements in any column zero. The columns may or may not be normalized by division by their standard deviation. n Our task now is to find the direction w ∈ R , |w = 1| in which, if we compute the T components of the rows ˆxi in this direction, the variance of these components is a maximum. The components in the direction of w are

ˆx · w comp ˆx = i . w i |w|

81 The total variance of these components is thus by Equation (4)

m  2 m T 2 1 X ˆxi · w 1 X ˆx w Var (X) = = i w m − 1 |w| m − 1 wT w i=1 i=1 m 1 1 X 2 = ˆxT w m − 1 wT w i i=1 m 1 1 X = (Xw)2 m − 1 wT w i i=1 1 1 = (Xw)T (Xw) m − 1 wT w 1 wT XT Xw = m − 1 wT w wT COV(X)w = . (6) wT w Expression (6) can be expressed by the Rayleigh quotient as R(COV(X), w). Now the magic trick. Since XT X is symmetric it has an orthogonal basis of eigenvectors. T To see this, consider any symmetric matrix M = M with two eigenvalues λ1 6= λ2, with corresponding eigenvectors v1, v2. We have T T T T T λ1(v1 v2) = (λ1v1) v2 = (Mv1) v2 = v1 (M v2) T T T = v1 (Mv2) = v1 (λ2v2) = λ2(v1 v2). T Now, since at most one of λ1, λ2 can be zero we must have v1 v2 = 0, that is, v1 and v2 are orthogonal. If some eigenvalue has more than one associated eigenvector then these eigenvectors will form a subspace which has an orthogonal basis. Since any symmetric real matrix M = MT is trivially normal MT M = MMT it is diagonalizable by a unitary n matrix and has a basis of orthogonal eigenvectors. All in all, we can find a basis for R with n orthogonal vectors all of which are eigenvectors to a given symmetric matrix M and so also for XT X. Let v1, ··· , vn be a sequence of orthonormal eigenvectors for COV(X) with eigenvalues n n λ1, ··· , λn, that thus forms a basis for R . Now, any vector w ∈ R can be represented uniquely by a sequence of coefficients c1, ··· , cn such that

w = c1v1 + ··· + cnvn. Using this we continue with the expression for the variance in the direction w by T T w COV(X)w (c1v1 + ··· + cnvn) COV(X)(c1v1 + ··· + cnvn) Varw(X) = T = T w w (c1v1 + ··· + cnvn) (c1v1 + ··· + cnvn) T (c1v1 + ··· + cnvn) (λ1c1v1 + ··· + λncnvn) = 2 2 c1 + ··· + cn 2 2 λ1c1 + ··· + λncn = 2 2 c1 + ··· + cn

82 Note that since COV(X) = XT X/(m − 1) is positive semi-definite (actually in most or all real cases positive definite) all eigenvalues are non-negative (positive). For convenience we let the eigenvalues be ordered as λ1 ≥ · · · ≥ λn. From this it is easy to see that the maximum variance is achieved when c1 = 1, c2 = 0, ··· , cn = 0, that is Varw(X) is maximal in the direction w = v1 corresponding to the largest eigenvalue, and the variance in this direction is indeed λ1. We denote this direction the (first) principal component of the data in X.   Define the matrix with the basis of eigenvectors by W = v1 ··· vn and the diago- nal matrix with the eigenvalues as D = [dij], dij = λj when i = j and dij = 0 otherwise. Now W is an orthonormal matrix (WT W = I). Then the rows of XW expresses the T coordinates of the rows of X in the basis W, that is (XW)ij = ˆxi vj is the component of ˆxi in the direction vj. Since W is orthonormal XW is essentially a rotation of the rows of X and so all angles between rows (induced by the dot product) are preserved, and so XW essentially represents the same data as X. We have the important relation 1 1 COV(XW) = (XW)T XW = WT XT XW = WT COV(X)W m − 1 m − 1 = WT WD = ID = D. We thus see that in the basis W there is no covariance between the columns of the data and indeed the variance for any column is the eigenvalues λj and so the variances for the columns of XW are ordered in decreasing order.

7.3 Choosing the number of dimensions Suppose now that instead of simply rotating the data so that it’s covariances are zero we may simply not include some directions, preferably those with small variances. Let Wk be the choice of the first k < n eigenvectors in order of decreasing eigenvalue:   Wk = v1 ··· vk . Now, by m×k XWk ∈ R , we have reduced the dimension of the data from n dimensions to k dimensions, and the covariances between columns are still zero. Our hope now is that the first few eigenvalues λ1, ··· , λk, k < n will capture the most variance of the data and so the remanding eigenvalues are comparatively small. How precisely to do this depends on situation, but a good starting-point is that k n X X λr  λr. r=1 r=k+1 One possible way to decide how many components to use then is to require that no less than 90% of the variance is captured by the k first components, that is k n X X λr ≥ 9 λr. r=1 r=k+1

83 It might also be useful to form the function

k X v(k) = λr, r=1 and then plot the value of v(k) for k = 1, ··· , n for a visual inspection. This will help in situations such ass when v(k) captures 89.9% of the variance and v(k + 1) captures 90.1%. In such case we may settle with the k first components. Finally, having selected some suitable number of components, the principal compo- nents, we may want to return the data from our k dimensions to the original n dimensions n by finding the vector in R defined by the k coordinates in the basis Wk in each row, that is construct the matrix

" k # X T (XWk)ir(Wk)jr = XWkWk . r=1 1≤i≤m 1≤j≤n

For the data in Equation (5) plotted in Figure 11 we have the following eigenvectors and eigenvalues. −0.7218 0.6921  W ≈ , −0.6921 −0.7218

λ1 ≈ 9.4420, λ2 ≈ 0.4309. Using the first eigenvector only (W1), with about 95% of the variance, we compute the result in Figure 12. From Figure 12 it is clear that the eigenvector going up and to the right (or actually equivalently down to the left) capture most of the variance of the data.

84 T Figure 12: The points defined by the rows of XW1W1 with X from Equation (5) plotted as diamonds, and the original points of X as circles. The two orthogonal eigenvectors are also plotted as black lines.

85 8 Linear transformations

A linear transformations also called a linear map is a function mapping elements from one set to another (possibly same) set fulfilling certain conditions. Namely if the two sets are linear spaces then the map preserves vector addition and scalar multiplication. In this chapter we will take a look at linear transformations, what properties makes them important and some examples of where we can use them in applications. This as with the chapter before this will have a heavier focus on theory than other chapters. The definition of a linear transformation is as follows:

Definition 8.1 Given two vector spaces V,W over the same field F , then a function T : V → W is a linear transformation if:

• T (v1 + v2) = T (v1) + T (v2), v1, v2 ∈ V . • T (rv) = rT (v), v ∈ V, r ∈ F .

We see that the first condition is equivalent to saying that T needs to preserve vector addition and the second that it preserves scalar multiplication. We note that the two vector spaces V,W need not be the same vector space, or even have the same number of elements. This also means that the addition operation need not be the same since v1 + v2 is done in V and T (v1) + T (v2) is done in W . From the definition follows immediately a couple of properties, it’s recommended to stop and think about why if it isn’t immediately clear.

• T (0) = 0

• T (−v) = −T (v), for all v ∈ V .

• T (r1v1 + r2v2 + . . . rnvn) = T (r1v1) + T (r2v2) + ... + T (rnvn) for all v ∈ V and all r ∈ F .

Many important transformations are in fact linear, some important examples are:

• The identity map x → x and the zero map x → 0.

• Rotation, translation and projection in geometry.

• Differentiation.

• Change of one basis to another.

• The expected value of a random variable.

The importance of many linear transformations gives a motivation why we should study them further and try to find properties of them that we can use. In fact, just by knowing a transformation is linear we already know two useful properties from the

86 definition. For example it doesn’t matter if we first add two vectors and then change their basis or if we first change both their basis and then add them. Often it’s possi- ble to check if a transformation is linear using only the definition as in the example below.

2 3 We define a function T : R → R such that:  x + y  x T = 2x − y y   y

We then want to show that it’s a linear transformation.       (x1 + y1) + (x2 + y2) x1 x2 T + = (2x1 − y1) + (2x2 − y2) y1 y2 y1 + y2     (x1 + y1) (x2 + y2)     x1 x2 = (2x1 − y1) + (2x2 − y2) = T + T y1 y2 y1 y2 And the first requirement is true.

For the second requirement T (rv) = rT (v) we get:

 (rx + ry)   x rx T r = T = (r2x − ry) y ry   ry

 (x + y)  x = r (2x − y) = rT   y y And the second requirement is true as well. Next we look at how to linear transformations act on basis vectors and how this can be used in order to create a linear transformation.

Theorem 8.1. We let V,W be vector spaces over the field F . Given a basis e1, e2,..., en of V and any set of vectors {w1, w2,..., wn} ∈ W . There exists a unique linear transformation T : V → W such that T (ei) = wi.

• Additionally we have:

T (v1e1 + v2e2 + ... + vnen) = v1w1 + v2w2 + . . . vnwn

Proof. We start by looking at the uniqueness.

If two transformations S, T exist such that T (ei) = S(ei) = wi. Then they must be the same since e1, e2,..., en spans V . You can easily see that they must be the same

87 transformation by writing any S(v) = T (v) and expanding v in the basis vectors. And we can conclude that the transformation is unique if it exists. Next we want to show that such a linear transformation always exists.

Given v = v1e1 + v2e2 + ... + vnen ∈ V , where v1, v2, . . . , vn are uniquely determined since e1, e2,..., en is a basis of V . We define a transformation T : V → W as:

T (v) = T (v1e1 + v2e2 + ... + vnen)

= v1w1 + v2w2 + ... + vnwn

For all v ∈ V . We clearly have that T (ei) = wi and showing that T is linear can be done simply by trying:

T (av1 + bv2) = aT (v1) + bT (v2)

Using this we see that it’s quite easy to define a linear transformation, since we only need to specify where to take the basis vectors. Then everything else is handled by the linearity. This result also gives an easy way to find if two linear transformations are equal. Simply check if they give the same result when applied to the basis vectors. If they do, they need to be equal because of the uniqueness. We continue by looking at matrices multiplied with vectors acting as transformations.

n m Consider the m × n matrix A and define a function TA : R → R such that n TA(v) = Av for all (column vectors) v ∈ R . Show that TA is a linear transforma- tion.

We use the definition and immediately get:

TA(v1 + v2) = A(v1 + v2) = Av1 + Av2 = TA(v1) + TA(v2)

TA(rv) = A(rv) = r(Av) = rTA(v)

And TA must be a linear transformation. n m We see that every m × n matrix A defines a linear transformation TA : R → R . This gives us a new perspective in which to view matrices, namely we can view them as linear transformations on vector spaces. But even more importantly we will also see n m that every linear transformation T : R → R can be written in this way using a m × n matrix. n m Theorem 8.2. If T : R → R is a linear transformation, then there exists a m×n ma- n trix A such that T (v) = Av for all column vectors v ∈ R . Additionally if e1, e2,..., en n is the standard basis in R , then A can be written as:

A = [T (e1) T (e2) ...T (en)]

Where T (ei) are the columns of A. The matrix A is called the standard matrix of the linear transformation T .

88 8.1 Linear transformations in geometry The matrix representation of linear transformations is especially useful in geometry and 3d graphics. Especially since most commonly used transformations such as rotation and scaling are in fact linear. If we can find the standard matrix for a certain transformation we can then easily apply that transformation on a large number of vectors with a sim- ple matrix-vector multiplication. This is especially important in for example computer graphics where we often work with a huge number of points even for very primitive ob- jects, which need to be moved whenever the object or camera (view point) moves. Here we will give some examples of linear transformations in geometry and show a couple of methods which we can use to find the standard matrix.

Using the previous result we find the standard matrix for a couple of common trans- n m formaitons in R → R . 2 2 We start in R → R by finding the standard matrix for rotation anti-clockwise about the origin. By making the unit circle we can easily see that if we rotate the first basis vector [1 0]> by θ degrees we end up at [cos θ sin θ]> And if we rotate the second one [0 1]> we end up at

[− sin θ cos θ]>

Our standard matrix is thus cos θ − sin θ A = sin θ cos θ And we can use this to rotate any point θ degrees about the origin. For example remembering that the angle between two orthogonal vectors is θ = 90◦, given a vector v = [x y]> we can find an orthogonal vector in 2d by multiplying with:

cos 90 − sin 90 0 −1 A = = sin 90 cos 90 1 0

Giving Av = [−y x]>, which can easily seen to give zero when taking the dot product together with v. But what if we now instead consider this rotation in the xy−plane in 3 R ? When looking at rotation around the z axis we get the matrix:

cos θ − sin θ 0 A = sin θ cos θ 0 0 0 1

Since for the first two basis vectors we rotate in the plane we had in two dimensions, and the last basis-vector in the direction of the z−axis should remain unchanged. We can find the rotation matrices around the other axes in the same way.

89 2 2 Next we take a look at projection in R → R where we want to find the standard matrix for projection on the line y = ax. We let d = [1 a]> be the direction vector for the line. Then using the projection formula with the basis vectors we get:

h(1, 0), (1, a)i 1 proj (1, 0) = d = (1, a) d k(1, a)k2 1 + a2

h(0, 1), (1, a)i a proj (0, 1) = d = (1, a) d k(1, a)k2 1 + a2 Which gives the standard matrix

1 1 a  A = 1 + a2 a a2

But what if we instead want to find the standard matrix for projection on the plane 3 described by x − z = 0 in R ? We introduce a slightly different method more suitable for higher dimensions. We have an orthogonal basis U = {(1, 0, 1), (0, 1, 0)} for points in the plane. Then instead of finding the projection of each basis vector on the plane individually, we use the projection formula to find the projection of a general point onto this plane. Doing this yields:

h(x, y, z), (1, 0, 1)i proj (x, y, z) = (1, 0, 1) U k(1, 0, 1)k2

h(x, y, z), (0, 1, 0)i x + z x + z  + (0, 1, 0) = , y, k(0, 1, 0)k2 2 2 We then split ut up in the part belonging to x, y, z to get the columns for the individual basis vectors: 1/2 0 1/2 A =  0 1 0  1/2 0 1/2 While we probably could have found the standard matrix for this projection regardless, we can use the same methodology for more complex subspaces in higher dimensions.

• First find an orthogonal basis for the subspace.

• Then find the projection of the basis vectors in the original space on the subspace.

We then get a matrix for this transformation such that we can calculate this projection of any points onto the subspace with a simple matrix-vector multiplication. Before 2 continuing with some more theory we return to R and look at the reflection in the line y = ax

• The reflection of v is equal to 2projd(v) − v (make a picture).

90 This gives: 2 1 a  1 0 A = − 1 + a2 a a2 0 1 1 1 − a2 2a  = 1 + a2 2a a2 − 1 We note that the methodology presented here is possibly since the transformations in question are all linear. While most needed transformations in this case are linear, we will see later that there are at least one critical non-linear transformation we need to be able to do in many applications. Can you think about which one?

8.2 surjection,injection and combined transformations While nice that we can for example rotate a point, what if we want to combine trans- formations. For example maybe we want to rotate around a point which itself rotate around another point. We will also give a way in which to make certain non-linear transformations into linear ones, a method essential to many applications in computer graphics. We start by defining what we mean by a surjection or injection.

Definition 8.2 A linear transformation (or more generally a function) F : V → U is a:

• Surjection, or onto if every element w ∈ W can be written w = T (v at least one vector v ∈ V .

• Injection, or one-to-one function if every element in U is mapped to by at most one element in V .

• If it’s both surjective and injective it’s called a bijection.

One important property is that for F to be invertible it needs to be an injection since otherwise F −1(u), u ∈ U would map to at least two elements in V for some u.

Theorem 8.3. Let F : V → W and G : W → U be two linear transformations. Then the combined transformation G◦F : V → U is linear. If F is invertible, then the inverse F −1 : W → V is linear.

Proof. For the first part we get:

G ◦ F (au + bv) = G(F (au + bv)) = G(F (au)) + G(F (bv))

= aG(F (u)) + bG(F (v)) For the second part we get:

F (F −1(au + bv)) = au + bv = aF (F −1(u)) + bF (F −1(v))

= F (aF −1(u) + bF −1(v))

91 Since F is linear and injective. From this follows that: F −1(au + bv) = aF −1(u) + bF −1(v)

But if the composition of two linear transformations is itself a linear transformation, how do we represent the combined linear transformation G ◦ F ? If we look at the matrix representations of the transformations we get:

AF = [F (e1) F (e2) ...F (en)]

AG = [G(e1) G(e2) ...G(en)] Then: AG◦F = [G(F (e1)) G(F (e2)) ...G(F (en))] = AGAF In other words if we can combine two linear transformations simply by multiplying their respective standard matrices. We note that since matrix multiplication is not commutative, the order in which we apply the linear transformations does impact the resulting transformation. The transformation we want to apply ”first” should be to the right in the matrix multiplication.

8.3 Application: Transformations in computer graphics and homogeneous coordinates We want to make a simple computer graphics of our solar system containing the sun, planets as well as their moons. For simplicity we assume all bodies are spherical and all bodies move in circular paths around something else. To work with we have the points of a unit sphere located in origo. This is a good example of how we can combine relatively simple transformations in order to create very complex paths such as the movement of someone on the Moons surface relative the sun.

We start by assuming the sun is stationary positioned in the middle of the solar system (origo) and everything else revolves around it. Then we only need to scale our unit sphere to appropriate size as well as rotate all the points on the sphere around some axis. Since both rotation and scaling are linear transformations we can apply this transformation without further problems. We let t be a point in time, Rsun be the rotation of the sun around it’s own axis and Ssun be the standard matrix for the scaling needed to get the appropriate size, we then get standard matrix for the points on the unit sphere representing the sun as.

Asun = RsunSsun Lets say we choose to let the sun rotate around the z−axis, we then get the as before and we get in time t as:     cos θst − sin θst 0 ssun 0 0 Asun = sin θst cos θst 0  0 ssun 0  0 0 1 0 0 ssun

92   ssun cos θst −ssun sin θst 0 = ssun sin θst ssun cos θst 0  0 0 ssun We are now finished with the sun, in every frame we multiply all the points on the unit sphere with corresponding matrix to get their position in our animation.

However as soon as we want to animate our first planet we immediately move into problems, moving a point a fixed amount in any direction (translation) is not a linear translation. You can easily realize this if you let T be translation in R: T (x) = x + a → T (0 + 0) = a 6= T (0) + T (0) = 2a. The same is obviously true for higher dimensions. While we could start by adding a vector to every point and then rotate around origo, we need to take great care in order to get the rotation around it’s own axis right(rotate points, then add vector, then finally rotate around origo) as well as it now requiring 2 matrix multiplications and one vector addition. Additionally as soon as we want to add a moon or want to make more complex objects, the problem becomes more and more complicated. Luckily there is a way to handle these transformations, a trick you could say in which we represent our n dimensional vector in n + 1 dimensions. We rep- resent a vector v = (v1, v2, . . . , vn)in n dimensions as a vector v = (v1, v2, . . . , vn, 1) in n + 1 dimensions. For two dimensions this can be seen as representing the points in the xy−plane by the points in the plane 1 unit above the xy−plane in three dimensions.

0 0 0 Now we can represent translation x = x + tx, y = y + ty, z = z + tz (in three dimensions) using matrix multiplication:

 0     x 1 0 0 tx x 0 y  0 1 0 ty y  0 =     z  0 0 1 tz  z 1 0 0 0 1 1

This is commonly referred to as homogeneous coordinates. We note that the last element of every vector is always 1 and the last row of the standard matrix is always the same as well. So by using one extra dimension even if the value of every vector in this extra dimension is always one, we can make translation into a linear transformation. • We note that origo is no longer represented by the zero vector (0, 0, 0) in homoge- neous coordinates since it’s there represented by (0, 0, 0, 1).

• It’s also easy to write any other previous already linear transformation in homo- geneous coordinates, we simply add another row and column with all zeros except for the bottom right corner which we set to one. We are now ready to return to our animation of the solar system using homogeneous coordinates instead. • We start by rewriting the transformation matrix for the points on the suns surface simply by adding the additional row and column:

93   ssun cos θt −ssun sin θt 0 0 ssun sin θt ssun cos θt 0 0 Asun =    0 0 ssun 0 0 0 0 1 Next we want to find the transformation matrix for the points on the earth’s surface from another unit sphere in the origin. A good guideline is to start by applying all transformations of the object in reference to itself, and then apply the transformation with respect to the ”closest” reference point, in this case the position of the sun.

• We start by applying the scaling SE and rotation around it’s own axis RE just like we did with the sun.

• Then we need to transpose the planet away from the sun in the origin. We apply the translation matrix TES so that it’s appropriately far away.

• We then apply another rotation on this to rotate all of these points around the sun (origin) RES representing the rotation of earth around the sun.

• Applying all of the transformations in order gives transformation matrix AE = RESTESRESE.

Especially we note that the first two RESTES denoting the centre position of the earth relative to the sun and the last two REESE is the position of the points on the earths surface relative a unit sphere positioned at earth’s core. We can use the first (already calculated) half of the transformation when finding the transformation matrix for the moon. In the same way we get the transformation SM , RM to apply the scaling and rotation around the moons own axis. We consider the position of the earth as the origin and translate and rotate around that giving: TME, RME. We notice that we now have the position of the moon relative the centre of earth, we apply the transformations ARESTES previously applied on the earth’s centre to get the moons coordinates relative the origin. This gives transformation matrix:

AM = RESTESRMETMERM SM We notice that we can interpret this as: (from right to left).

• RM SM -Position of points on the moons surface relative a unit sphere at the centre of the moon, times:

• RMETME-Position of the centre of the moon relative the centre of earth, times:

• RESTES-Position of the centre of earth relative the centre of the sun (origin).

This methodology is used extensively in computer graphics and robotics in order to de- scribe very complex movements using relatively simple transformations and hierarchies.

94 8.4 Kernel and image To every linear transformation T : V → W we associate two linear spaces. One of them a subspace of V and the other a subspace of W .

Definition 8.3 Given the linear transformation T .

• R(T ) = {w|w = T (v), v ∈ V } ⊂ W , is called the image of T .

• In other words the subspace of W which is mapped to from V

• N(T ) = {v|T (v) = 0} ⊂ V . is called the Kernel of T or the Nullspace of T .

• In other words the subspace of V for which the linear transformation maps to the zero-vector in W

If we look at the transformation in terms of the standard matrix we see that the image R(T ) is the solutions to the matrix equation Ax = b. While the Kernel N(T )l is the solutions to Ax = 0.

Theorem 8.4. Given the linear transformation T : V → W , the kernel N(T ) is a subspace of V and the image R(T ) is a subspace of W .

Proof. Since T (0) = 0 they both contain the zero vector. For the kernel we get:

T (av1 + bv2) = aT (v1) + bT (v2) = a0 + b0 = 0, v1, v2 ∈ N(T )

And av1 + bv2 lies in N(T ) since it satisfies the condition. So N(T ) is a subspace of V . For the image we use: w1 = T (v1), w2 = T (v2)

v1, v2 ∈ V This gives:

aw1 + bw2 = aT (v1) + bT (v2) = T (av1) + T (bv2) = T (av1 + bv2)

And aw1 + bw2 lies in R(T ). So R(T ) is a subspace of W . We can use this to easily check if a linear transformation is injective,surjective and bijective (and invertible).

Theorem 8.5. A linear transformation T : V → W is:

• surjective ⇔ R(T ) = W

• injective ⇔ N(T ) = {0}

• bijective (and invertible) ⇔ N(T ) = {0} and R(T ) = W .

95 If the two vector spaces have the same dimension (square standard matrix) it’s actually enough to show only that it’s injective or surjective and it will always be bijective as well.

Theorem 8.6. Assume a linear transformation T : V → W , where dim V = dim W .

• If N(T ) = {0} or R(T ) = W then T is bijective.

In fact we can show even more, we can show something about the combined dimension of the kernel and image:

Theorem 8.7. Assume we have a linear transformation T : V → W , where V is finite dimensional, then:

• dim V = dim N(T ) + dim R(T )

We move on to a short example showing how we can use what we have learned here. Hermite interpolation is a method in which to find a curve through a given number of points. This is often needed in computer graphics as well as in construction for calculating strength of materials. We will look at one subproblem where we want to fit a curve through two points in the plane where the slope is specified in both points. The easiest possible interpolation function we can think of in this case is a third order polynomial: y = a + bx + cx2 + dx3 Which we want to fit to the points such that the curve goes through the points with the given slope. In order to do this we get a system of four unknowns and four equations:

2 3 y(x1) = a + bx1 + cx1 + dx1 = y1 0 2 y (x1) = b + 2cx1 + 3dx1 = s1 2 3 y(x2) = a + bx2 + cx2 + dx2 = y2 0 2 y (x2) = b + 2cx2 + 3dx2 = s2 Writing it in matrix form we get a linear system in the four unknowns (a, b, c, d) with matrix A:  2 3  1 x1 x1 x1 2 0 1 2x1 3x1 A =  2 3  1 x2 x2 x2  2 0 1 2x2 3x2 While we know how to solve this problem, we ask ourselves, is it always possible to solve this system? Obviously we could simply solve it and see, but doing so (especially by hand) is rather tedious work even for this relatively small matrix.

• You could also calculate the determinant, less work than solving the system but still alot of work.

96 • Instead we look at the transformation    2 3 y(x1) a + bx1 + cx1 + dx1 0 2 y (x1)  b + 2cx1 + 3dx1  T : y → T (y) =   =  2 3 y(x2) a + bx2 + cx2 + dx2 0 2 y (x2) b + 2cx2 + 3dx2

4 2 3 As the transformation from V = {y|y(x) ∈ P3} to W = R . Using the basis {1, x, x , x } in V this is obviously a linear transformation with matrix A above. We take a look at the Kernel of A. For a polynomial in N(T ) we have

0 0 y(x1) = y(x2) = 0, y (x1) = y (x2) = 0

So it has a root in x1, x2 both with multiplicity two. However y is a third order poly- nomial and can only have 3 roots (counting multiplicity) except for the zero polynomial. 4 So N(T ) = {0} and T is injective. Since it’s injective and dim P3 = dim R T is bijective and invertible so the system have a unique solution.

We remember the definition of rank for a square matrix:

Definition 8.4 For A ∈ Mn×n(F) the rank(A) is equivalent to: • The number of linearly independent rows or columns.

• The dimension of the column space of A.

But we can give even more definitions of rank, we now move on to define the rank of a matrix Mm×n(F) using the kernel and image of A. Definition 8.5 Given a linear transformation T : V → W with standard matrix A.

• Rank or the column rank r(A) is the dimension of the image of A: dim R(A), with elements Ax, x ∈ F n.

• Row rank is the dimension of the image of A>: dim R(A>), with elements A>y, y ∈ F m.

or the right nullity v(A) is the dimension of the kernel of A: dim N(A), with elements Ax = 0, x ∈ F n.

• Left nullity is the dimension of the kernel of A>: dim N(A>), with elements Ay = 0, y ∈ F m.

From this immediately follows:

• Nullity zero: v(A) = 0 ⇔ A have a left inverse.

97 • rank m: r(A) = m ⇔ A have a right inverse. We can of course do the same for A>, however we will see that they are related in such a way that knowing only one of these four values (R(A),R(A>),N(A),N(A>)), we can immediately find the others. Theorem 8.8. For every matrix A we have:

dim R(A) = dim R(A>)

In other words, for non square as with the square matrices the column rank and row rank are the same. As with a square matrix you can show it by writing the matrix in echelon form. We end this part with a short example:

Let A be a n × m matrix with rank r. We want to show that the space of all solutions to Ax = 0 of n homogeneous equations in m variables has dimension m − r. This space n m is the kernel of N(T ) with T : R → R such that T (v) = Av. We use dim V = dim N(T ) + dim R(T )

We already have the dimension of the image R(T ), which is the rank r. The dimension of V is m And we immediately get the dimension of the kernel as dim N(T ) = m − r

8.5 Isomorphisms Definition 8.6 A linear transformation T : V → W is called an Isomorphism if it’s both surjective and injective (bijective). • Two vector spaces V and W are called isomorphic if there exists an isomorphism T : V → W .

• We denote this by writing V =∼ W . We see that for two vector spaces that are isomorphic, we have a pairing v → T (v), v ∈ V,T (v) ∈ W . For this pairing we preserve the vector addition and scalar multiplication in respective space. Because addition and multiplication in one space is completely determined by the same operation in the other all vector space properties of one can be determined by those of the other. In essence you could say the two vector spaces are identical apart from notation. Theorem 8.9. For two finite dimensional spaces V,W and a linear transformation T : V → W the following are equivalent: • T is an isomorphism.

• If {e1, e2,..., en} is a basis of V then {T (e1),T (e2),...,T (en)} is a basis of W .

98 • There exists a basis {e1, e2,..., en} of V such that {T (e1),T (e2),...,T (en)} is a basis of W .

−1 −1 • There exist a inverse linear transformation T : W → V such that T T = 1V −1 and TT = 1W Here we see one of the important relations between isomorphic vector spaces, if we have a basis in one we can given a linear isomorphic transformation also find a basis in the other. If both vector spaces have a finite dimension we can say even more.

Theorem 8.10. For the finite dimensional spaces V,W the following are equivalent.

• V =∼ W ,(V and W isomorphic).

• dim V = dim W

In other words, the dimension of a vector space decides which spaces it’s isomorphic to. 3 For example we can immediately conclude that the linear space R is isomorphic to the 2 polynomials of order two: P2 (Polynomials that can be written on the form a+bx+cx ). Some examples of important isomorphisms are:

• The Laplace transform, mapping (hard) differential equations into algebraic equa- tions.

• Graph isomorphisms T : V → W such that if there is an edge between edges u, v in V there is an edge between T (u),T (v) in W .

• Others include isomorphisms on groups and rings (such as fields).

This also give a little motivation of why the study of isomorphisms are useful, if we have a problematic vector space where something we want to do is very complicated, if we can find another isomorphic vector space where the corresponding operation is easier. We can formulate our problem in this new vector space instead, do our now easier calculations and then change back, effectively sidetracking the whole issue.

99 9 Regression and the Least Square Method

Many of the properties and ideas we have examined apply to square matrices, a good example of this is theorem 1.1 which lists nine conditions that are all equivalent to a square matrix being invertible. If a square matrix A is invertible then the linear equation system Ax = y is solvable. In this section we will discuss a scenario where we have a system of linear equations where the A-matrix has more rows than columns. This kind of system is called an over- determined system (we have more equations than unknowns) and if A is a n × m matrix with n > m it is said to have n − m redundant equations. An overdetermined system is often not solvable. The condition that the equation system is solvable can be visualised in different ways depending on the dimensions of A. If m = 2 the system can be solved if yi = ai1x1 + ai2x2 for all i = 1, 2, . . . , n. This can be interpreted as the vectors x and y represent points on a straight line. This property is called collinearity. If m = 3 the system is solvable if yi = ai1x1 + ai2x2 + ai3x3 for all i = 1, 2, . . . , n. This can be interpreted as the vectors x and y represent points (in three dimensional space) which all lie in a plane, see figure 13c. This property is called coplanarity. In higher dimensions (m > 2) the condition that the system is solvable corresponds to all the points lying in some (m − 1)-dimensional volume (sometimes called a hyperplane). In reality collinearity and coplanarity are rare. On the other hand, having a number of points near but not on a straight line is a common situation. One of the simplest ways to model a relation between two quantities is to use a straight line, y(x) = a + bx, and often experiments are designed in such way that this kind of relation is expected or the quantities measured are chosen such that the relation should appear. In reality any system, be it the temperature of a piece of machinery, the amplitude of a signal or the price of a bond, will be subject to some disturbance, often referred to as noise, that will move the measured values away from the ideal line. The images in figure 13b and 13d were generated from the images in 13a and 13c in this manner. In this situation the two most common problems are:

• Hypothesis testing: does this data fit our theory?

• Parameter estimation: how well can we fit this model to our data?

Essentially these problems are about finding the best version of our model, in the sense that we want to use the data we have to construct a model that predicts the data as well as possible. Finding the ’best’ version of the model is usually referred to as a regression problem. Often it is also said that we fit the model to the problem. In this section we will discuss a few ways to do this. Naturally we need to define what we mean by the ’best’ (fitted) version of the model. Usually this is done in what is called the least-square sense. To define what we mean by least-square sense we first note that for each xi in x there

100 7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0

−1 −1

−2 −2 0 2 4 6 8 10 0 2 4 6 8 10 (a) A set of points on a line. (b) A set of points near but not on a line.

(c) A set of points in a plane. (d) A set of points near but not on a plane.

Figure 13: Illustrations of collinearity, coplanarity and linear regression.

m X is a corresponding value given by our model, f(x) = ajgj(xi), and a corresponding j=1 measurement yi. How well our model fits the data can be described by examining the between the yi and fi values. We can create a vector that describes these distances for each measurement called the residual vector or the error vector, e:

e = Ax − y.

We can then measure the normal (Euclidean) length of this vector.

v u n √ uX 2 > |e| = t |ei| = e e. i=1

Minimizing this length gives us the best-fitting line in the least-square sense. When we fit our model in the least-square sense we often say that we use the least square method (LSM). We will now see how we can fit various kinds of models.

101 9.1 Multiple linear regression Linear functions of different kinds are simple ways to model a relation between different quantities. A common and familiar relationship between two quantities is the straight line, y(x) = a + bx. when we use this model it is often called single linear regression. This can also easily be expanded into other kinds of functions, not just straight lines. We can use any function of the form m X f(xi) = ajgj(xi) j=1 where xi for all i = 1, 2, . . . , n are (not necessarily independent) variables and gj are functions. This is called multiple linear regression (MLR). We will start with a closer look at single linear regression.

9.1.1 Single linear regression Suppose we have two lists of values as in table 1

x x1 x2 . . . xn

y y1 y2 . . . yn

Table 1: Example table for a regression problem in one variable. and we want to find a line l(x) = a + bx that fits the table as well as possible in the least square sense. Our problem can be described in the following way: Xr = y,     1 x1 y1 1 x2 a y2 X =   , r = , y =   . . .   .  . .  b  .  1 xn yn

> Here x = [x1 x2 . . . xn] and y is the data from the table and r is the vector of coefficients of the line that we want to find. The residual vector, e, is as follows:   a + bx1 − y1 a + bx2 − y2  e = Xr − y =   .  .   .  a + bxn − yn We can then measure the normal (Euclidean) length of this vector v u n √ uX 2 > |e| = kek2 = t |ei| = e e. i=1

102 Minimizing this length gives us the best-fitting line in the least-square sense. Exactly how this is calculated will be discussed in section 9.2 since the same principle can be used for both single and multiple regression.

9.1.2 Fitting several variables Consider the problem given in the previous section but suppose we want to fit a parabola (second degree polynomial instead, p(x) = a + bx + cx2), instead of a line. Then the problem can be written:

Xr = y,  2   1 x1 x1 y1 2 a 1 x2 x2 y2 X =   , r = b , y =   . . . .     .  . . .  c  .  2 1 xn xn yn

This can be extended in the same way for a polynomial of any degree smaller than n − 1 where n is the number of measurements. Suppose we try to fit a polynomial of k degree k (pk(x) = r0 + r1x + ... + rkx ) with 1 < k < n − 1, then we can write the problem as:

Xr = y,  2 k     1 x1 x1 . . . x1 r0 y1 2 k 1 x2 x2 . . . x2  r1 y2 X =   , r =   , y =   . ......   .   .  . . . . .   .   .  2 k 1 xn xn . . . xn rk yn

If a polynomial of degree k = n − 1 is chosen we get the following situation:

Xr = y,  2 n−1     1 x1 x1 . . . x1 r0 y1 2 n−1 1 x2 x2 . . . x   r1  y2 X =  2  , r =   , y =   . ......   .   .  . . . . .   .   .  2 n−1 1 xn xn . . . xn r(n−1) yn

Here the matrix X will be square and the system will be solvable as long as all the x-values are different. This means that for every set of n measurements there is a polynomial of degree n − 1 that passes through each point in the measurement. This is often referred to as interpolation of the data set (and in this case it is equivalent to the Lagrange’s method of interpolation). The matrix X here is known as the . This (and similar) matrices appear in many different applications and it is also known for having a simple formula for calculating its determinant.

103 x1 g1(x11) g1(x21) . . . g1(xn1) x2 g2(x12) g2(x22) . . . g2(xn2) ...... xm gm(x1m) gm(x2m) . . . gm(xnm)

y y1 y2 . . . yn

Table 2: Example table for a regression problem in m variables.

Suppose instead we have a more general problem with 1 < m ≤ n − 1 lists of data, see table 2. Previously we used LSM to fit polynomials to data. This is equivalent to saying that j−1 gj(xi) = xi in table 2. There are many other functions that can be fitted using LSM. Consider a function written in the form m X f(x) = ajgj(xij) j=1 where xij for all i = 1, 2, . . . , n, j = 1, 2, . . . , m are (not necessarily independent) vari- ables and gj are functions. We then get the following matrix-description of the problem Xa = y,       g1(x11) g2(x12) . . . gm(x1m) a1 y1 g1(x21) g2(x22) . . . gm(x2m)  a2  y2 X =   , a =   , y =   .  . . .. .   .   .   . . . .   .   .  g1(xn1) g2(xn2) . . . gm(xnm) am yn Note that the linearity in multiple linear regression does not mean that the function that we try to fit must be linear with respect to xi but instead to the parameters ai that we can change. For example, if we want to fit a trigonometric function, f(x), instead of a polynomial to some data and f(x) looks like this f(x) = a sin(x) + b sin(2x) + c cos(x) then the fitting problem can be described in this way Xa = y,     sin(x1) sin(2x1) cos(x1) y1 a sin(x2) sin(2x2) cos(x2) y2 X =   , a = b , y =   .  . . .     .   . . .  c  .  sin(xn) sin(2xn) cos(xn) yn For this kind of problems we do not have the same simple guarantees that the X matrix is invertible though.

104 9.2 Finding the coefficients How can we actually find the coefficients that minimize the length of the residual vector? First we can define the square of the length of the residual vector as a function

n > X 2 > s(e) = e e = |ei| = (Ax − y) (Ax − y). i=1 This kind of function is a positive second degree polynomial with no mixed terms and ∂s thus has a global minima where = 0 for all 1 ≤ i ≤ n. We can find the global ∂ei minima by looking at the derivative of the function, ei is determined by xi and

∂ei = Ai,j ∂xj thus n n ∂s X ∂ei X = 2e = 2(A x − y )A = 0 ⇔ A>Ax = A>y. ∂x i ∂x i. i i,j j i=1 j i=1 The linear equation system A>Ax = A>y is called the normal equations. The normal equations are not overdetermined and will be solvable if the x-values are different. Finding the solution to the normal equations gives the coefficients that minimize the error function s. The reason that A>Ax = A>y are called the normal equations is that B = A>A is a .

Definition 9.1 A normal matrix is a matrix for which AH A = AAH . An alternate definition would be A = UDUH where U is a unitary matrix and D is a diagonal matrix.

Proof that the definitions are equivalent. Here we can use a type of factorization that we have not discussed previously in the course called Schur factorization:

A = URUH where U is unitary and R triangular.

AH A = URH UH URUH = URH RUH = URRH UH = AAH if and only if R is diagonal (check this yourself using matrix multiplication).

AH A = UDH UH UDUH = UDH DUH = UDDH UH = AH A since diagonal matrices always commute.

105 Normal matrices are always diagonalizable which means they are always invertible if D has full rank. If B = A>A and A has full rank (equivalent to rank(A) = n) then B = UDUH with rank(D) = n. Thus: if rank(A) = n then x = (A>A)−1A>y minimizes the error function or ’solves’ Ax = y in the least square sense. Sometimes the normal equations can become easier to solve if the A-matrix is fac- torized. One useful example of a factorization that can simplify the calculation is the QR-factorization that you will learn how to do in section 12.2. Using QR-factorization we get A = QR with Q being an orthogonal matrix, Q> = Q−1, and R an upper triangular matrix. The normal equations can then be rewritten in this way:

(A>A)x = (QR)>QRx = = R>Q>QRx = R>Rx = = A>y = (QR)>y = R>Q>y ⇔ ⇔ Rx = Q>y.

9.3 Weighted Least Squares Method (WLS) It was previously described that the least square method was useful in situations where some data had been collected and we wanted to analyse the data or adapt a specific model to the data. In both these cases certain parts of the data can be more relevant than the other. The data can be a compilation of data from many different sources, it can be measured by a tool that is more efficient for certain x or there is a certain area where it is more important that the model fits the data better than everywhere else, for example if we describe a system that is intended to spend most time in a state corresponding to value x then it is probably more important that our model fits well near that x than far away from that x. This can be taken into consideration by using the weighted least square method. When we use the weighted least square method we add a weight in the error function.

n > X 2 s(e) = (e) W(e) = Wi,i|ei| . i=1 Then we get a new set of normal equations

A>WAx = A>Wy.

How to choose W? For physical experiments W is dictated by physical properties of the measured system or physical properties of the measurement tools. In statistics the 1 correct weights for independent variables are the reciprocal of the variance wi = and σi for variables with the correlation matrix Σ the weights are decided by W>W = Σ−1 to get the best linear unbiased estimate. If the different variables in x are independent then W is diagonal but if there is some correlation between them then there will be off-diagonal elements as well.

106 9.4 Non-linear regression The regression problem can also be applied to situations where the model we try to fit is not linear in its parameters. This can be geometrically interpreted analogously to the linear case in figure 14 except instead of a plane we have a more general surface that we try to fit the data. This is illustrated in figure 14.

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0 0 2 4 6 8 10 0 2 4 6 8 10 (a) A set of points on a curve. (b) A set of points near, but not on, a curve.

(c) A set of points on a surface. (d) A set of points near, but not on, a sur- face.

Figure 14: Geometric interpretation of non-linear regression.

This is referred to as non-linear regression. The concept of a fitted model is completely > analogous to the linear case. Consider a function of m variables x = [x1 x2 . . . xm] > defined by a set of parameters r = [r1 r2 . . . rk] that we denote by f(x; r). Suppose we have a set of values xi, yi, i = 1, 2, . . . , n. Comparing the results of the model to the measured data y then the residual vector will be   f(x1) − y1 f(x2) − y2  e = Xr − y =   .  .   .  f(xn) − yn Finding the coefficients r is usually much more difficult in this case than in the linear

107 case and is often done using some numerical method to minimize the norm of the residual vector v u n √ uX 2 > |e| = kek2 = t |ei| = e e. i=1 Examples of numerical methods include the gradient method, Gauss-Newtons method and the Marquardt-Levenburg method (also known as damped least squares). One case where the results for the linear regression problem can be applied is when the problem can be rewritten on the form X f(x) = gi(x). i For example, suppose we have a trigonometric function with a polynomial argument

f(x) = tan(a + bx + cx2) ⇒ arctan(f(x)) = a + bx + cx2, and even though we have a non-linear model we have turned it into the linear problem

Xr = y,  2   1 x1 x1 arctan y1 2 a 1 x2 x2 arctan y2 X =   , r = b , y =   , . . .     .  . . .  c  .  2 1 xn xn arctan yn that can be solved with the regular least square method. Note thought that when doing this we no longer fit the original function but instead the rewritten function. In some applications this is problematic and in some it is not. We will not concern ourselves further with this in this course but in a real application this sort of rewrite should not be used indiscriminately.

9.5 Pseudoinverses For square matrices we could describe the solution of a linear equation system using the inverse matrix Ax = y ⇔ x = A−1y. Not all matrices have an inverse, diagonalizable matrices where one eigenvalue is zero for example, but the idea behind the inverse can be generalized to and made to extend to these matrices as well. For the regular inverse we have

AA−1 = A−1A = I.

In this section we are going to discuss pseudoinverses, which are denoted by ∗ and obey the following rules A∗AA∗ = A∗, AA∗A = A.

108 The inverse of a matrix is unique but a matrix can have many different pseudoinverses. For example, if 0 1 A = 0 0 then all of the following are pseudoinverses

1 0 0 0 1 1 A∗ = , A∗ = , A∗ = 1 1 0 2 1 1 3 1 1

In engineering there is a very important pseudoinverse called the Moore-Penrose pseu- doinverse.

Definition 9.2 The Moore-Penrose pseudoinverse of a matrix A is defined as follows

A∗ = (A>A)−1A>.

The Moore-Penrose pseudoinverse is interesting because it is closely connected to the least square method and writing

Ax = y ⇔ x = A∗y is the same as solving Ax = y using the least square method. Another good property of the Moore-Penrose pseudoinverse is that if the matrix A is invertible then A∗ = A−1. In practice, calculating the Moore-Penrose pseudoinverse is often done using the singular value factorization (SVD) or QR-factorization. Showing how to calculate the Moore-Penrose pseudoinverse using QR-factorization is very similar to showing how it is done for the least square method.

A∗ = (A>A)−1A> = (QR)>QR−1(QR)> = = R>Q>QR−1R>Q> = R>R−1R>Q> = = R−1(R>)−1R>Q> = R−1Q>.

109 10 Matrix functions

Functions are a very useful mathematical concept in many different areas.

n Power function: f(x) = x , x ∈ C, n ∈ Z √ n + Root function: f(x) = x, x ∈ R, n ∈ Z n X k + Polynomials: p(x) = akx , a, x ∈ C, n ∈ Z k=0 x Exponential function: f(x) = e , x ∈ C

Natural logarithm: f(x) = ln(x), x ∈ C Table 3: Some common functions on the real and complex numbers.

Each of the functions in table ?? are based on some fundamental idea that makes them useful for solving problems involving numbers of different kinds. If new functions that have corresponding or similar properties to these functions but acts on matrices instead of numbers could be defined then there might be ways to use them to solve problems that involve matrices of some kind. This can actually be done to some extent and this section will explore this further.

10.1 Matrix power function The normal power function with positive exponents looks like this

n + f(x) = x = x · x · ... · x, x ∈ C, n ∈ Z | {z } n Inspired by this it is quite intuitive to define a corresponding matrix version of the power function Definition 10.1 + For any square matrix A ∈ Mk×k, k ∈ Z the matrix power function is defined as

n + A = AAA ... A, n ∈ Z | {z } n + for any n ∈ Z . The regular power function is not only defined for positive integers, but also for 0 and negative integers. How can definition 10.1 be extended to let n be any integer, not just a positive one? First consider how that x0 = 1 for any x. For matrices this corre- sponds to

110 Definition 10.2 + For any square matrix A ∈ Mk×k, k ∈ Z

A0 = I where I is the identity matrix.

For negative exponents it is well known that

1 x−n = , n ∈ + xn Z Here it is not as intuitive and simple to find a corresponding definition for matrices. But consider that

1 x · = xx−1 = 1 x 1 This means that can be viewed as the inverse of x. This means that there is a x −n + natural way to define A , n ∈ Z for any invertible matrix A. Definition 10.3 + For any square and invertible matrix A ∈ Mk×k, k ∈ Z

−n −1 n −1 −1 −1 −1 + A = (A ) = A A A ... A , n ∈ Z | {z } n where A−1 is the inverse of A.

Now a matrix version of the power function An with integer exponent n ≥ 0 has been defined for any square matrix A. If A is also invertible An is defined for all integer n. When discussing (real or complex) numbers, x, the function xn is not limited to integer n. In fact, n can be any (real or complex number). In the next section the definition of An is extended to fractional n and in section 10.5 we will show how to define matrix versions of more general functions, including An for any (real or complex) number n.

10.1.1 Calculating An Calculating the matrix An can be done in a very straightforward manner by multiplying A with itself n times (or the inverse of A with itself −n times if n is negative). But the calculations can sometimes be simplified further. Previously, see page 13 and definition 2.1, the concept of a diagonalizable matrix has been examined. If A is a diagonalizable matrix, A = S−1DS, then the power function (for positive integer exponents) becomes

An = AA ... A = S−1D SS−1 D SS−1 ... SS−1 DS = S−1DnS | {z } | {z } | {z } =I =I =I where Dn of course becomes

111  n  d1,1 0 ... 0 n  0 d2,2 ... 0  Dn =    . . .. .   . . . .  n 0 0 . . . dm,m Thus if a matrix A is simple to diagonalize then An becomes much less tedious to calculate for high n. The same idea can be generalized to similar matrices. Consider two matrices, A and B, that are similar, in the mathematical sense given in definition 1.3.

A = P−1BP In this case the power function of A can be calculated as follows

An = AA ... A = P−1B PP−1 B PP−1 ... PP−1 BP = P−1BnP | {z } | {z } | {z } =I =I =I In some cases, for example when B is a diagonal matrix and n is large, it can be easier to calculate An in this way. In theorem 2.8 we saw that even if a matrix was not diagonalizable it could be rewritten as a matrix which is almost diagonal   Jm1 (λ1) 0 ... 0  0 Jm (λ2) ... 0  A = P−1JP = P−1  2  P  . . ..   ...... 

0 0 ... Jmk (λk)

Here Jmi , 1 ≤ i ≤ k are all Jordan blocks, that is matrices with the value λi on the diagonal and 1 in the element above the diagonal, see definition 2.7. k is the number of different eigenvalues of A. This means that for any square matrix A the power function can be written like this

 n  Jm1 (λ1) 0 ... 0 n  0 Jm (λ2) ... 0  An = P−1JnP = P−1  2  P  . . ..   ......  n 0 0 ... Jmk (λk)

Since all the Jordan blocks, Jmi (λi), are either small or contains a lot of zeros each n n n Jmi (λi) is simple to calculate and thus J will often be easier to calculate than A . One kind of matrix for which it can be easier to calculate An are nilpotent matrices. Definition 10.4 A square matrix An is said to be a if there is some positive integer, + n ∈ Z , such that An = 0 From this it follows that Ak, k ≥ n.

112 For nilpotent matrices it obviously easy to calculate An for sufficiently high numbers. One example of a nilpotent matrix is a strictly triangular matrix (a triangular matrix with zeros on the diagonal). Remember any square matrix is similar to a matrix written on the Jordan canonical form, which is an upper triangular with the matrix eigenvalues along the diagonal. This means that if all the eigenvalues for a matrix are equal to zero then the matrix will be similar to a strictly triangular matrix which meant that it will be nilpotent.

10.2 Matrix root function

For (real or complex numbers)√ we have the following relation between the power function xn and the root function n x

n √ 1 y = x ⇔ x = n y = y n + For any positive integer, n ∈ Z , it is easy to do the corresponding definition for matrices

Definition 10.5 If the following relation holds for two square matrices A, B ∈ Mk×k

A = BBB ... B | {z } n

√n 1 then B is said to be the nth root of A. This is annotated B = A = A n √ Simply defining the function is not very useful though. In order to make n A useful a few questions need to be answered √ • How do you find n A? √ • For how many different B is B = n A √ √ In this course we will only consider the square root of a matrix 2 A = A .

10.2.1 Root of a diagonal matrix It is not obvious how to find the square root of a matrix, it is not even clear whether there exist any roots at all for a given matrix. For this reason it is suitable to first consider a specific kind of matrix for which the roots exist and are easy to find. Diagonal matrices (see page 6) are generally simple to work with. It is also easy to find their roots. Consider a diagonal matrix D ∈ Mm×m.   d1,1 0 ... 0  0 d2,2 ... 0  D =    . . .. .   . . . .  0 0 . . . dm,m

113 Then we can define a matrix C in this way

p    d1,1 0 ... 0 d1,1 0 ... 0 p  0 d2,2 ... 0   0 d2,2 ... 0  C =   ⇒ CC = C2 =   (7)  . . .. .   . . .. .   . . . .   . . . .  p 0 0 ... dm,m 0 0 . . . dm,m From the definition it follows that C is a root of D. It is not the only root though. Each of the diagonal elements di,i has two roots, one positive and one negative and regardless which root that is chosen in C equation (7) will still hold. This means that there are 2n different Cs that are roots of D. These are not the only possible roots though. For an example of how many roots a matrix can have, see the following theorem. 1 0 Theorem 10.1. All of the following matrices are roots to I = 2 0 1 1 ∓s ∓r 1 ±s ∓r ±1 0 1 0  , , , t ∓r ±s t ∓r ∓s 0 1 0 ±1 if r2 + s2 = t2.

Thus the simple matrix I2 has infinitely many roots. Clearly roots are more com- plicated for matrices than they are for numbers. Fortunately though it is not always necessary to find all the roots of a matrix, often a single root can be enough, and there are several ways to make it easier to construct. For a diagonal matrix D it is very simple to find a root, simply take the matrix that had the positive root of all the elements in the diagonal of D in its diagonal. This is often referred to as the principal root.

10.2.2 Finding the square root of a matrix using the Jordan canonical form The technique for finding a root for a diagonal matrix can be extended to diagonalizable matrices. Consider a diagonalizable matrix A

√ √ A = S−1DS = S−1 D DS = √ √ √ = S−1 DSS−1 DS = S−1 DS ⇒ √ √ ⇒ A = S−1 DS

By an analogous argument you can find a similar relation between any square matrix and its Jordan canonical form √ √ A = P−1 JP Finding the square root for J is not as simple as it is for a diagonal matrix, but there are some similarities

114 p  Jm (λ1) 0 ... 0 1 p √ √  0 Jm2 (λ2) ... 0  −1 −1   A = P JP = P  . . .  P  ......   q  0 0 ... Jmk (λk) √ This means that if a square root of the Jordan blocks are known then A can be calculated. Let J be a Jordan block √     λ 1 0 ... 0 1 λ √0 ... 0 0 λ 1 ... 0 0 1 λ . . . 0     0 0 λ . . . 0 0 0 1 ... 0 J =   = λ   = λ(I + K)  ......  . . . . .  . . . . .  ......  0 0 0 . . . λ 0 0 0 ... 1 √ Consider the function f(x) = 1 + x, this function can be written as a Taylor series

√ f 0(0) f 00(a) f (3)(0) f (4)(0) f(x) = 1 + x = f(0) + x + x2 + x3 + x4 = 1! 2! 3! 4! 1 1 1 5 = 1 + x − x2 + x3 − x4 + ... 2 8 16 128 It turns out that if you switch 1 for I and x for K the equality still holds (see page 125). K is also a strictly upper-triangular matrix and is therefore a nilpotent matrix (see definition 10.4). Thus the following formula will always converge

√ 1 1 1 5 I + K = 1 + K − K2 + K3 − K4 + ... 2 8 16 128 which makes it possible to find a square root for any square matrix.

10.3 Matrix polynomials A polynomial of degree n over a set K (for example the real or complex numbers) is a function p(x) that can be written as

n X k p(x) = akx , ak ∈ K k=0 where n is a finite non-negative integer and x is a variable taken from some set such that xn and addition is defined in a sensible way (more specifically x can only take values from a ring). Examples of sets that the variable x can belong to are real numbers, complex numbers and matrices.

115 Definition 10.6 A matrix polynomial (with complex coefficients) is a function p : Mn×n 7→ Mn×n that can be written as follows

n X k p(A) = akA , ak ∈ C, A ∈ Mn×n k=0 where n is a non-negative integer.

Matrix polynomials has a number of convenient properties, one of them can be said to be inherited from the power function

Theorem 10.2. For two similar matrices, A and B, it is true for any polynomial p that

p(A) = p(P−1BP) = P−1p(B)P

Proof. This theorem is easily proved by replacing A with P−1BP in the expression de- scribing the polynomial and using An = P−1BnP.

10.3.1 The Cayley-Hamilton theorem A very useful and important theorem for matrix polynomials is the Cayley-Hamilton theorem.

Theorem 10.3 (Cayley-Hamilton theorem). Let pA(λ) = det(λI − A) then pA(A) = 0 The Cayley-Hamilton theorem can be useful both for simplifying the calculations of matrix polynomials and for solving regular polynomial equations. This will be demon- strated in the following sections. The rest of this section focuses on proving the Cayley- Hamilton theorem. Proving the Cayley-Hamilton theorem becomes easier using the following formula.

Theorem 10.4 (The adjugate formula). For any square matrix A the following equality holds A adj(A) = adj(A)A = det(A)I (8) where adj(A) is a square matrix such that

adj(A)ki = Aki where Aki is the ki-cofactor of A, (see equation (2)) Proof of the adjugate formula. Using equation (2) gives

n X det(A) = ai,kAi,k k=1

If we collect all the cofactors, Aki, in a matrix

116 (adj(A))ki = Aik Note the transpose. It can then easily be verified that

 n n n  X X X a A a A ... a A  1,k 1k 1,k 2k 1,k nk  k=1 k=1 k=1   n n n  X X X   a A a A ... a A   2,k 1k 2,k 2k 2,k nk  A adj(A) =   = k=1 k=1 k=1   . . .. .   . . . .   n n n  X X X   an,kA1k an,kA2k ... an,kAnk k=1 k=1 k=1  n n  X X det(A) a A ... a A  1,k 2k 1,k nk  k=1 k=1   n n  X X   a A det(A) ... a A   2,k 1k 2,k nk =   (9) k=1 k=1   . . .. .   . . . .   n n  X X   an,kA1k an,kA2k ... det(A)  k=1 k=1

In order to get equation (8) we need to show that all off-diagonal element are equal to zero. To do this, consider a square matrix B. From this matrix we construct a new matrix, C, by replacing the i:th row with the j:th row (i 6= j). This means that in C row i and row j are identical

cik = cj,k = bj,k, 1 ≤ k ≤ n The matrices B and C will also share any cofactors created by removing row i

Cik = Bik Using (2) on C gives

n n n X X X det(C) = ci,kCik = cj,kCik, i 6= j = bj,kBik k=1 k=1 k=1 Since C has two identical rows the rows are not linearly independent which means

det(C) = 0 Thus, for any square matrix B it is true that

117 n X bj,kBik = 0 k=1 This means that all elements in matrix (9) except the diagonal elements are equal to zero. This gives the adjugate formula

A adj(A) = det(AI Checking that the same holds for adj(A)A can be done in an analogous way.

Now the adjugate formula can be used to prove the Cayley-Hamilton theorem

Proof of the Cayley-Hamilton theorem. Set pA(λ) = det(λI − A) and let B = adj(λI − A) then the adjugate formula gives

pA(λ)I = (λI − A)B

Every element in B is a polynomial of degree n − 1 or lower, since each element in B is the determinant of a (n − 1) × (n − 1) matrix. Thus B can be rewritten as a matrix polynomial n−1 X k B = λ Kk k=0 where Bk is a matrix containing the coefficients of the polynomial terms with degree k in B. Combining the to formulas gives

n−1 X k (λI − A)B = (λI − A) λ Bk = k=0 n n−1 = λ Bn−1 + λ (Bn−2 − ABn−1) + ... + λ(B0 − AB1) + (−AB0) =

n n−1 = λ I + λ an−1I + ... + a1λ + a0 = pA(λ)I

If this expression holds for all λ then the following three conditions must be fulfilled

akI = Bk−1 − ABk for 1 ≤ k ≤ n − 1 (10)

I = Bn−1 (11)

a0I = −AB0 (12)

Next consider the polynomial pA(A)

n n−1 pA(A) = A + A an−1 + ... + Aa1 + Ia0 Combining this with condition (10)-(12) results in

118 n n−1 pA(A) = A + A (Bn−2 − ABn−1) + ... + A(B0 − AB1) + (−AB0) =

n n−1 = A (I − Bn−1) +A (Bn−2 − Bn−2) + ... + A (B0 − B0) = 0 | {z } | {z } | {z } =0 =0 =0

10.3.2 Calculating matrix polynomials In section 10.1.1 a number of different techniques and properties worth considering when calculating the power function was described. Naturally all of this is applicable to matrix polynomials as well. But there is other techniques that can be used to simplify the calculations of a matrix polynomial. On page 116 it was mentioned that the Cayley-Hamilton theorem could be used to calculate matrix polynomials. It is very simple to see hoe the following theorem follows from Cayley-Hamilton.

Theorem 10.5. Let pA(λ) = det(λI−A) where A is a square matrix. Then the following formula holds

pA(A) = 0 c · pA(A) = 0 where c ∈ C q(A) · pA(A) = 0 where q is a polynomial Many things that can be done with polynomials of numbers can also be with matrix polynomials. Factorization is one such thing. If

p(x) = (x − λ1)(x − λ2) ... (x − λn) then

p(A) = (A − λ1I)(A − λ2I) ... (A − λnI) This means that if a polynomial on normal numbers can be simplified by factorization, the corresponding matrix polynomial can be simplified in a corresponding way. It also means that polynomial division can be used for matrix polynomials as well as for normal polynomials. That is, any matrix polynomial, p(A), with degree n can be rewritten as p(A) = q(A)π(A) + r(A) where q(A), π(A) and r(A) are all polynomials of degree deg(q) = n − k, deg(π) = k and deg(r) ≤ n with 1 ≤ k ≤ n. If π(A) is chosen as one of the polynomials in theorem 10.5 then π(A) = 0 and p(A) = r(A) with potentially lowers the degree of p(A) making it easier to calculate. One especially interesting consequence of this is the following theorem Theorem 10.6. For two polynomials, a and b, with different degree, deg(a) ≥ deg(b), such that a(A) = 0 and b(A) = 0 then the following relation must apply a(A) = c(A)b(A) where c(A) is a polynomial with degree deg(c) = deg(a) − deg(b). Proof. Dividing a and b gives a(A) = c(A)b(A) + r(A). From b(A) = 0 it follows that a(A) = r(A) but a(A) = 0 thus r(A) = 0.

119 10.3.3 Companion matrices

If λ is an eigenvalue of a square matrix A then det(λI − A) = 0. Since an n × n matrix has n eigenvalues (some of which can be the same) and since pA(λ) = det(λI − A) is a polynomial of λ with degree n this means that each eigenvalue corresponds to a root of the polynomial pA(λ). This can be expressed as a formula

pA(λ) = det(λI − A) = (λ − λ1)(λ − λ2) ... (λ − λn)

where λi for 1 ≤ i ≤ n are eigenvalues of A. This means that if we have found the eigenvalues of some matrix we have also found the roots of some polynomial and vice versa. That solving polynomials can be used to find eigenvalues of matrices is something that we have seen before, the standard method of solving det(λI − A) = 0 is exactly this. But there are other ways of finding the eigenvalues of a matrix, in section 12 a number of different ways of calculating the roots numerically will be shown for example. There is also a simple way of constructing a matrix that has the roots of a given polynomial as eigenvalues. This matrix is called the of a polynomial. The origin of this kind of matrix is linear differential equations. Consider a linear differential equation of order n

(n) (n−1) (1) y + an−1y + ... + a1y + a0 = f

where y(k) is the k:th derivative of y and f is a function. When solving this kind of differential equation it is often helpful to find the roots of the characteristic polynomial

n n−1 p(x) = x + an−1x + ... + a1x + a0

To get the companion matrix we rewrite the differential equation using state variables

 x0 = y   (1)  x1 = y   . .  x = y(n−1)  n−1  (n)  xn = y

Using the state variables x1, . . . xn the differential equation can be written as a system of first order differential equations. If we denote a vector containing all xi with x and a 0 0 vector containing all xi with x then the differential equation can be written as a matrix equation

120  0  x0 = x1  0  x = x2  1 0 >  . x = C(p) x ⇔ .  0 x = xn  n−1  0 xn = −a0x0 − a1x1 − ... − an−1xn + f With C(p) defined as follows.

Definition 10.7 2 n−1 n Let p(x) = a0 + a1x + a2x + ... + an−1x + x then the companion matrix of p is   0 0 ... 0 −a0 1 0 ... 0 −a1    0 1 ... 0 −a2  C(p) =   ......  . . . . .  0 0 ... 0 −an−1 In some literature C(p)> is referred to as the companion matrix instead of C(p).

Theorem 10.7. If C(p) is a companion matrix to the polynomial p then the eigenvalues of C(p) are roots of p.

Proof. Let ei denote vectors in the standard basis, that is column vectors with a 1 in element i and 0 everywhere else. For these vectors multiplication with C(p) will result in the following

C(p)ei = ei+1 for 0 ≤ i < n (13) n X C(p)en = −aiei (14) i=1

From (13) it can be easily shown that

i ei = C(p) e1 for 0 ≤ i < n (15) n en = C(p) e1 (16)

Combining equation (13)-(16) gives

n−1 n X i C(p) e1 = −aiC(p) e1 ⇔ i=0 n n−1 ⇔ C(p) + an−1C(p) + ... + a1C(p) + a0I = 0

⇔ p(C(p)) = 0

121 By the Cayley-Hamilton theorem pC (C(p)) = 0 with pC (λ) = det(λI − C(p)). Thus we have two polynomials that are zero for the companion matrix.

p(C(p)) = 0

pC (C(p)) = 0

Then we can use theorem (10.6) to see that that p(A) = c · pC (A) (since p and pC have the same degree). This means that p(λ) and pC (λ) must have the same roots. The eigenvalues of C(p) are roots of pC (λ) and thus the theorem is proved. Calculating roots of polynomials using companion matrices can be quite efficient, especially since a polynomial of high degree gives a , that is a matrix with only a few non-zero elements. Sparse matrices can be used quite efficiently by computers and many algorithms can be specifically modified to work much faster on sparse matrices than non-sparse matrices. The power method described in section 12.1 can be used to efficiently find numerical eigenvalues of sparse matrices. Many computer programs that finds roots of polynomials uses companion matrices, for example the roots command in MATLAB (see [5]).

10.4 Exponential matrix So far we have constructed matrix functions by taking the definition of a function for numbers and replaced numbers with matrices when it was possible. Now we are going to construct a matrix version of a function in a different way, we are going to identify the most important properties of a function and then construct a matrix function that has the same properties. The function that we are going to construct a matrix version of the exponential func- tion. A common definition for the exponential function is x ex = lim (1 + )n n→∞ n The exponential function is useful in many ways but most of its usefulness comes from the following three properties

d (eat) = aeat = eata (17) dt (eat)−1 = e−at (18) ea · eb = ea+b = eb+a = eb · ea (19)

Here a, b and t are all real or complex numbers. Multiplication of numbers is usually commutative (a · b = b · a) but matrix multiplication is often not commutative. In equation (17)-(19) the various places where commutativity comes in has been written out explicitly since we want to have commutativity in the same way for the matrix version.

122 The most complex function we have defined so far is the matrix polynomial. Can we construct a polynomial such that it fulfills (17)-(17)? Consider what happens to a general polynomial of At when we take the first time derivative

n n n−1 X d X X p(At) = a (At)k ⇔ (p(At)) = a kAktk−1 = A a (l + 1)Altl k dt k l+1 k=0 k=1 l=0 The last equality is achieved by simply taking l = k − 1. Compare this to

n X l l Ap(At) = A alA t l=0

If the polynomial coefficients are chosen such that al+1(l + 1) = al these expressions would be identical except for the final term. Let us choose the correct coefficients 1 a (l + 1) = a ⇔ a = l+1 l l l! and simply avoid the problem of the last term not matching up by making the polynomial have infinite degree.

Definition 10.8 The matrix polynomial function is defined as

∞ X tk eAt = Ak, A ∈ M k! n×n k=0

It is simple to create eA from this by setting t = 1. An important thing to note here is that since the sum in the definition is infinite we should check for convergence. As it turns out the exponential function will always converge. This follows from a more general theorem presented in section 10.5.

Theorem 10.8. For the function the following is true d (eAt) = AeAt = eAtA (20) dt (eAt)−1 = e−At (21) eA · eA = eA+B = eB+A = eB · eA if AB = BA (22) where A and B are square matrices.

This definition of eAt has all the properties we want, with the exception that we need the extra condition on (22) compared to (19). There is no way around this unfortunately. That the definition gives a function that has property (20) is clear from the construc- tion. Property (21) and (22) can be shown by direct calculations using the definition. Calculating the exponential matrix can be challenging. First of all we should note that the exponential function for numbers can be written like this

123 ∞ X xk ex = k! k=0 using a Taylor series. Using this we can see how to calculate the matrix exponential for a diagonal matrix D

 ∞ k  X d1,1 0 ... 0  k!  k=0  d  ∞ k  e 1,1 0 ... 0   X d2,2   0 ... 0   0 ed2,2 ... 0  eD =  k!  =    k=0   . . .. .   . . . .   . . . .   . . .. .   . . .  0 0 . . . edm,m  ∞ k   X dn,n   0 0 ...  k! k=0 Since the exponential matrix is essentially a polynomial it should not be surprising that the following relation holds.

A = S−1BS ⇒ eA = S−1eBS On page 112 it was discussed that nilpotent matrices could make it easier to calculate the matrix power function. This is of course also true for the matrix exponential function, since it turn the infinites sum into a finite sum and thus turns the exponential function into a finite polynomial. There are also many ways of computing the exponential matrix function numerically. A nice collection can be found in [10].

10.4.1 Solving matrix differential equations One of the most common applications of the exponential function for numbers is solving differential equations. df = kf(x) ⇒ f(x) = c · ekx dx The matrix exponential function can be used in the same way.

Theorem 10.9. The exponential matrix exA is the solution to the initial value problem

 dU  = AU dx U(0) = I

Proof. From the construction of exA we already know that if fulfills the differential equation. Checking that the initial condition is fulfilled can be done simply by setting x = 0 in the definition.

124 10.5 General matrix functions There is a systematic way of creating a matrix version of a function. When the matrix exponential function was defined in the previous section (see page 123) we found a matrix polynomial of infinite degree that created the matrix version. One way of doing this is given below Definition 10.9 (Matrix function via Hermite interpolation) If a function f(λ) is defined on all λ that are eigenvalues of a square matrix A then the matrix function f(A) is defined as

h si−1 (k) X X f (λi) f(A) = φ (A) k! ik i=1 k=0 where φik are the Hermite interpolation polynomials (defined below), λi are the different eigenvalues and si is the multiplicity of each eigenvalue. Definition 10.10 Let λi, 1 ≤ i ≤ n be complex numbers and si, 1 ≤ i ≤ n, be positive integers. Let s be the sum of all si. The Hermitian interpolation polynomials are a set of polynomials with degree lower than s that fulfills the conditions ( φ(l)(λ ) 0 if i 6= j or k 6= l ik j = l! 1 if i = j and k = l Using this kind of definition to create a matrix function works as long as the function f is defined on the eigenvalues of the matrix. There is another definition af a matrix function that is equivalent to definition 10.9 but is only applicable to analytical functions, that is functions that can be written as a power series. Definition 10.11 (Matrix function via Cauchy Integral) Let C be a curve that encloses all eigenvalues of the square matrix A. If the function f is analytical (can be written as a power series) on C and inside C then 1 Z f(A) = f(z)(zI − A)−1dz 2πi C −1 The matrix RA(z) = (zI − A) is called the resolvent matrix of A. Proofs that these two definitions makes sense and are equivalent can be found in [?] on page 1-8. The formula in definition 10.11 is a matrix version of Cauchy’s integral formula which can be used to derive the formula for a Taylor series. As a consequence of this is that we can construct the matrix version of any analytical function can by replacing the number variable in a Taylor series expansion (or other convergent power series) with a square matrix. For example of this see the formula for the matrix square root (page 115) and compare the definition of the matrix exponential function with the Taylor series for the exponential function for numbers.

125 11 Matrix equations

A matrix equation is an equation that involves matrices and vectors instead of numbers. A simple example is Ax = y where A is a matrix, y is a known vector and x is an unknown vector. You are already familiar with some ways of solving this kind of equations, for example Gaussian elimination and matrix inverses. There are of course many other kinds of matrix equations. One example that we have seen before are the first order differential equation that was solved using the exponential matrix function, see page 124. There are many other kinds of matrix equations. A simple example would be exchang- ing the x and y vectors in the typical matrix equation for a pair of square matrices, X and Y. There are two variations of this kind of equation

AX = Y XA = Y

Another example of a matrix equation that is important in many applications (first and foremost in control theory) is Sylvester’s equation

AX + XB = C

There is also a version of Sylvester’s equation that is especially important known as the continous Liapunov1 equation. AX + XAH = −C There area many methods for solving specific matrix equations, but here we will take a look at a basic tool for examining certain properties of the solutions of matrix equations and turning the more exotic kind of matrix equations into the typical matrix-vector equation that we already know how to solve. For this a new tool will be introduced.

11.1 Tensor products The tensor product, usually denoted by ⊗ is an interesting tool that can be defined for many kinds of mathematical objects, not only matrices. It is sometimes called the outer product or direct product. A ⊗ B is usually pronounced ’A tensor B’.

Definition 11.1 The tensor product for matrices is a bilinear function of matrices that has two properties:

1. Bilinearity: (µA + ηB) ⊗ C = µA ⊗ C + ηB ⊗ C A ⊗ (µB + ηC) = µA ⊗ B + ηA ⊗ C

2. Associativity: (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C)

1sometimes spelled Lyapunov or Ljapunow

126 It can be possible to find different function that fullfil these condition for at set of matrices but the most general (and most common) tensor product is the Kroenecker product.

Definition 11.2 The Kroenecker product ⊗ between two matrices, A ∈ Mm×n and B ∈ Mp×q, is defined as   a1,1B a1,2B . . . a1,nB  a2,1B a2,2B . . . a2,nB  A ⊗ B =    . . .. .   . . . .  am,1B am,2B . . . am,nB or   Ab1,1 Ab1,2 ... Ab1,q Ab2,1 Ab2,2 ... Ab2,q A ⊗ B =    . . .. .   . . . .  Abp,1 Abp,2 ... Abp,q these two definitions are equivalent but not equal, one is a permutation of the other.

The Kroenecker product is quite simple and intuitive to work with, even if it creates large matrices.

Theorem 11.1. For the Kroenecker product the following is true

1. (µA + ηB) ⊗ C = µA ⊗ C + ηB ⊗ C

2. A ⊗ (µB + ηC) = µA ⊗ B + ηA ⊗ C

3. (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C)

4. (A ⊗ B)> = A> ⊗ B>

5. A ⊗ B = A ⊗ B (complex conjugate)

6. (A ⊗ B)H = AH ⊗ BH

7. (A ⊗ B)−1 = A−1 ⊗ B−1, for all invertible A and B

k n 8. det(A ⊗ B) = det(A) det(B) , with A ∈ Mn×n, B ∈ Mk×k

9. (A ⊗ B)(C ⊗ D) = (AC ⊗ BD), A ∈ Mm×n, B ∈ Mn×k, C ∈ Mp×q, D ∈ Mq×r

10. A ⊗ B = (A ⊗ Ik×k)(In×n × B), A ∈ Mn×n, B ∈ Mk×k

The first 7 properties listed here shows that the Kroenecker product is bilinear, as- sociative and react in a very intuitive way to transpose, complex conjugate, Hermitian conjugate and inverse operations. Number 8 shows that determinant are almost as intu- itive but not quite (note that the determinant of A has been raised to the number of rows

127 in B and vice versa). Properties 9 and 10 are examples of how matrix multiplication and Kroenecker products relate to one another. Taking the Kroenecker product of two matrices gives a new matrix. Many of the properties if the old matrices relate to the properties of the old matrix in a fairly un- complicated manner. One property that this applies to are the eigenvalues Theorem 11.2. Let {λ} be the eigenvalues of A and {µ} be the eigenvalues of B. Then the following is true:

{λµ} are the eigenvalues of A ⊗ B {λ + µ} are the eigenvalues of A ⊗ B

−1 −1 Sketch of proof. a) Use Shurs lemma A = S RAS, B = T RBT where RA and RB are triangular matrices with eigenvalues along the diagonal. Then

−1 −1 A ⊗ B = (S RAS) ⊗ (T RBT) = −1 −1 = (S RA ⊗ T RB)(S ⊗ T) = −1 = (S ⊗ T) (RA ⊗ RB)(S ⊗ T)

RA⊗RB will be triangular and will have the elements ai,ibj,j = λiµj on the diagonal, thus λiµj is an eigenvalue of A⊗B since similar matrices have the same eigenvalues. b) Same argument as above gives A ⊗ I (B ⊗ I) have the same eigenvalues as A (B) adding the two terms together gives λ + µ is an eigenvalue.

11.2 Solving matrix equations The tensor product can be used to make it easy to solve equations like these

AX = Y XA = Y

Taking the tensor product between the identity matrix and some square matrix results in the following theorem Theorem 11.3.

AX = Y ⇔ (I ⊗ A)Xe = Ye XA = Y ⇔ (A> ⊗ I)Xe = Ye where   B.1 B.2 B = B B ...B  , B =   .1 .2 .n e  .   .  B.n

128 This turns an equation consisting of n × n matrices into an equation with one n2 × n2 matrix and two n2 vectors. This equation can be solved the usual way. Solving the matrix this way is equivalent to assigning each element in the unknown matrix a variable and the solve the resulting linear equation system. Using the Kroenecker product can make subsequent manipulations of the matrices involved easier though, thanks to the properties shown in theorem 11.1. Often it is interesting to know if there exists a solution to a given equation and if that solution is unique. Using the Kroenecker product we can see for which matrices A and B that there is a unique solution to Sylvester’s equation. Theorem 11.4. Sylvester’s equation AX + XB = C has a unique solution if and only if A and −B have no common eigenvalues. Proof. Rewrite the equation using the Kroenecker product

AX + XB = C ⇔ (I ⊗ A + B> ⊗ I) Xe = Ce | {z } K This is a normal matrix equation which is solvable if λ = 0 is not an eigenvalue of K. The eigenvalues of K are λ + µ where λ is an eigenvalue of A and µ is an eigenvalue of B. Thus if A and −B have no common eigenvalues the eigenvalues of K will 6= 0.

As mentioned previously Sylvester’s equation is important in control theory especially the version of it that is called the continuous Liapunov equation. This will be illustrated using two theorems (neither theorem will be proved). A stable system is a system that will in some sense ’autocorrect’ itself. If left alone it will approach some equilibrium state and not increase indefinitely or behave chaotically. If the system is described by a first order differential matrix equation then the following condition gives stability. Definition 11.3 A matrix, A, is stable when all solutions to dx = Ax(t) dt converges to 0, x(t) → 0 when t → ∞.

All the eigenvalues of A have negative real component, Re(λ) < 0 for all λ that is an eigenvalue to A. Theorem 11.5. If A and B are stable matrices then

AX + XB = −C has the unique solution Z ∞ X = etACetBdt 0

129 Theorem 11.6. Let P be a positive definite matrix. If and only if there is a positive definite solution to AH X + CA = −P is A a stable matrix.

130 12 Numerical methods for computing eigenvalues and eigenvectors

In practise we can rarely use a analytical solution to finding eigenvalues or eigenvectors of a system. Either because the system is two large making Gausselimination, by solving the characteristic polynomial or similar methods to slow or maybe we only need one or a few eigenvalues or eigenvectors or maybe a rough estimate is enough. You have probably seen by now how time consuming it is to find eigenvalues and eigenvectors analytically. Since there are a huge number of different problems where we want to find eigenvalues or eigenvectors there are also a large number of different methods with different strengths and weaknesses depending on the size and structure of the problem. We will look at two different related methods, the Power method and the QR-method and a couple of variations of both.

While this chapter gives some insight into how to use or implement some methods, it also serves to help in showing that there is no generall method that somehow always works. If a general method doesn’t work, there might be another specified method for your type of problem. It’s then important to know what to look for both in our matrix as well as what we are actually trying to find. It also serves to show how we can start with a very primitive algorithm and then carefully build upon it in order to remove some or most of it’s limitations while still keeping it’s strengths.

Some things to take into consideration when choosing which method to use is: • The size of the matrix, especially very large matrices which we maybe can’t even fit in memory. • Density of non-zero elements, Is the matrix a full matrix (few zeroes) or a sparse matrix (few non-zero elements). • Does the matrix have any kind of useful structure we could use, for example if it’s positive definite, symmetric, Hessian or a bandmatrix. • What do we seek? Do we need all eigenvalues and eigenvectors or just one or a few? Do we neeed exact result or is a good approximation enough? Or maybe we don’t need bth eigenvalues and eigenvectors?

12.1 The Power method

The Power method is a very simple algorithm for finding the eigenvalue λ1 with the largest absolute value and it’s corresponding eigenvector. While the method have a couple of disadvantages making it unsuitable for general matrices, it proves it’s worth in a couple of specialized problems unsuitable for more general methods such as the QR-method which we will look at later. The Power method also acts as an introduction to many of the thoughts behind the QR-method and it’s variations.

131 The Power method in its basic form is a very simple and easy to implement method. We consider the matrix A for which we want to find the ”largest” eigenvalue and corre- sponding eigenvector. • We assume there is only one eigenvalue on the spectral radius and that A is diag- onalizable.

Axk • The method works by iterating xk+1 = . kAxkk

Axk • If λ1 6= 0 then approaches a vector parallel to the eigenvector of the eigen- kAxkk value with largest absolute value λ1. k k−1 • We get the eigenvalue λ1 from the relation A x ≈ λ1A x when k is large. It’s easy to see that if the Matrix A> is a stochastic matrix, running the Power method can be seen as iterating the Markov chain. This also motivates why we need there to be only one eigenvalue on the spectral radius since otherwise the Markov chain and therefor neither the Power method would converge in those cases. We will not look into how fast the convergence is here, but be content by noting that the relative error can be shown to k be in the order of |λ2/λ1| . In other words it’s not only other eigenvalues on the spectral radius that’s problematic, if there are eigenvalues close to the spectral radius they will slow down the convergence as well.

We give a short example using the following matrix: −2 1 1 A =  3 −2 0 1 3 1

> We start with x0 = [1 1 1] and iterate:

−2 1 1 1 0 Ax0 =  3 −2 0 1 = 1 1 3 1 1 5 Next we calculate the norm of this vector: √ kAx0k = 26

And last we use the norm kAx0k in order to normalize Ax0:

Ax0 √ x1 = = [0, 1, 5]/ 26 kAx0k Continuing for the next iteration we get: −2 1 1 0  6  1 1 Ax1 =  3 −2 0 1 √ = −2 √ 1 3 1 5 26 8 26

132 Ax1 √ x2 = = [6, −2, 8]/ 104 kAx1k And if we continue for a couple of more iterations we get:

> • x2 = [0.59, −0.17, 0.78] .

> • x3 = [−0.25, 0.91, 0.33] .

> • x4 = [0.41, −0.61, 0.67] .

> • x7 = [−0.28, 0.85, −0.44] .

> • x10 = [0.28, −0.79, 0.55] .

• For reference the true eigenvector is xtrue = [0.267, −0.802, 0.525].

We see that it does seem to converge, but note that it could take quite some time. However since every step gives an a little better approximation, if we only need a rough estimate we could stop considerably earlier. Because of the assumptions regarding the system as well as the problem with convergence if λ1 ≈ λ2 this simple version of the algorithm is rarely used unless we already have a good understanding of the system.

In short a couple of things to consider when using the power method are:

• If there is more than one eigenvalue on the spectral radius or if it’s multiplicity is larger than one, the method won’t converge.

k • Since the error is of the order |λ2/λ1| , if λ2 is close to λ1 the convergence can be very slow.

• We only get the (in absolute value) largest eigenvalue and corresponding eigenvec- tor.

While these are some considerable disadvantages (we can’t use it on all matrices and the convergence speed is uncertain), there are a couple of things we can do to handle these problems. However while the method does have a couple of negatives, it does have a couple of very strong points as well.

• In every step we make only a single vector-matrix multiplication, making it very fast if we can ensure a fast convergence.

• If A is a sparse matrix (few non-zero elements) the iterations will be fast even for extremely large matrices.

• Since we do no matrix decomposition (as we do in most other common meth- ods) we do not ”destroy” the sparcity of the matrix when calculating the matrix decomposition.

133 Especially we see the advantage of the method on sparse matrices since most other commonly used methods uses a matrix decomposition of the matrix, usually destroying the sparsity of the matrix in the process. This means that while the method might not be as reliable or as fast as other methods such as the QR-method in the general case, if we have a very large sparse matrix we might not even be able to use many other methods since they would be to slow to be usable. In some cases the matrix itself might not even fit in memory anymore after matrix decomposition further increasing the time needed.

12.1.1 Inverse Power method In order to handle some of the problems found earlier we will take a look at something called the inverse Power method. While this fixes some of the problems it doesn’t fix all of them, and there is a cost in that the method is no longer as suitible for sparse matrices as before.

We consider the invertible matrix A with one eigenpair λ, x. From the definition of eigenvalues and eigenvectors we then have Ax = λx. This gives:

Ax = λx ⇔ (A − µI)x = (λ − µ)x ⇔ x = (A − µI)−1(λ − µ)x

⇔ (λ − µ)−1x = (A − µI)−1x And we see that ((λ − µ)−1, x) is an eigenpair of (A − µI)−1 If we instead iterate using this new matrix (A − µI)−1 the method becomes:

−1 (A − µI) xk xk+1 = −1 k(A − µI) xkk And just like the ordinary Power method this will converge towards the eigenvector of the dominant eigenvalue of (A − µI)−1. The dominant eigenvalue of (A − µI)−1 is not necesarily the same as the dominant eigenvalue of A. Rather we get the dominant eigenvalue (λ − µI)−1 of (A − µI)−1 for the λ closest to µ. If we take a look at the error we get:

k (λ − µ)−1 (second closest to µ) −1 (λ(closest to µ) − µ)

k (λ(closest to µ) − µ) = (λ(second closest to µ) − µ)

We can easy see that if µ is a good approximation of an eigenvalue of A then (λ(closest to µ) ≈ 0 and the error will be close to zero as well, this especially improves the convergence with multiple close eigenvalues. With a good approximation of λ the method will converge very fast, often in only one or a couple of iterations. Especially in the case where we know the eigenvalue exactly beforehand such as when working with irreducible stochas- tic matrices we get the error equal to zero and the method should converge in only

134 one step. We also see that if we change µ we can find the eigenvectors of not only the dominant eigenvalue, but the eigenvectors to multiple eigenvalues. However we still have problems with eigenvalues with multiplicity larger than one, or when more than one have the same absolute value. The inverse Power method also requires us to solve a linear system or calculate the inverse of the matrix (A − µI). This makes the method unsuitable for large sparse matrices where the basic Power method shows it’s greatest strength, since calculating the inverse matrix usually destroys the sparcity of the matrix.

We give a short example using the inverse power method to find the eigenvector to the dominant eigenvalue of:

3 3 1 0 1 1 0 3 A =   1 0 1 3 1 3 0 2 Since A is non-negative and irreducible (why?) we can use Perron Frobenius The- orem 4.1 to find bounds on the dominant eigenvalue. We remember this means the dominant eigenvalue is at least equal to the lowest row sum and at most equal to the largest row sum. Since at least one of the diagonal elements are non-zero we know that A is primitive as well so that we know that there is only one eigenvalue on the spectral radius. This ensures the convergence (when seeking the dominant eigenvalue).

We choose µ as the largest possible value of the dominant eigenvalue: µ = max row sum = 7. This ensures that the dominant eigenvalue is the eigenvalue closest to µ, ensuring we converge towards the right eigenvector. We get:

−4 3 1 0   1 −6 0 3  A − µI = A − 7I =    1 0 −6 3  1 3 0 −5 And we get its inverse as:

−0.4038 −0.3173 −0.0673 −0.2308 −1 −0.1538 −0.3590 −0.0256 −0.2308 (A − µI) =   −0.1538 −0.1923 −0.1923 −0.2308 −0.1731 −0.2788 −0.0288 −0.3846 > Starting with x0 = [1 1 1 1] gives:

−1 (A − µI) x0 x1 = −1 k(A − µI) x0k = [−0.5913 − 0.4463 − 0.4463 − 0.5020]> If we continue to iterate a couple of times and multiply the end result of those that are negative with −1 to end up with a positive vector we get:

135 • > x1 = [0.5913 0.4463 0.4463 0.5020]

• > x2 = [0.6074 0.4368 0.4368 0.4995]

• > x3 = [0.6105 0.4351 0.4351 0.4986]

• > xtrue = [0.6113 0.4347 0.4347 0.4983]

We see that we are very close to the true eigenvector after only three iterations, we in fact have a rather good approximation after only one iteration. Using the found eigenvector we can now also find a better approximation of the true dominant eigenvalue λ = 5.8445 as well from the relation Ax = λx. For example using x3 we get: 3.5721 Ax = [3.5721 2.5414 2.5414 2.9131]> → λ ≈ = 5.8511 3 0.6105 Which is very close to the true eigenvalue λ = 5.8445.

The Inverse power method is often used when you have a good approximation of an eigenvalue and need to find an approximation of the corresponding eigenvector. This is often the case in real time systems where a particular eigenvector might needs to be computed multiple times during a very short time as the matrix changes.

Overall both the Power method and the Inverse Power method are unsuitable for general matrices since we cannot guarantee the convergence without further inspecting the matrix. In order to solve the limitations of these methods we will need to calculate many eigenvalues at the same time as for example in the QR-method below.

12.2 The QR-method Often we might not know anything about the matrix beforehand and we might want to find not only one or a couple of eigenvalues/eigenvectors but all of them. We also don’t want to be limited to only finding the eigenpairs for simple eigenvalues.

The QR-method solves all of these problems, it finds all eigenpairs, and it’s generally very fast unless we are working with very large matrices where we might need speci- fied algorithms. Since the method is fast, reliable and finds all eigenpairs it is almost always the QR-method (or a variation there of) that is used whenever you calculate eigenvalues/eigenvectors using computational software. Unless you have very specific re- quirements or already know a more suitable method it’s usually a good idea to start by trying the QR-method and then continue looking at other methods if that is unsufficient.

136 Like the Power method, the QR-method is a iterative method, which after some time converge and gives us the eigenvalues and eigenvectors. Like the name suggest the method uses the QR-decomposition of the original matrix. We remember that we could find a QR decomposition A = QR where Q is a unitary matrix and R is upper triangular using for example the Gram-Schmidt process.

The method works by iterating:

• We start by assigning A0 := A

• In every step we then compute the QR-decomposition Ak = QkRk. H H −1 • We then form Ak+1 = RkQk = Qk QkRkQk = Qk AkQk = Qk AkQk

• Since Q is unitary all Ak are similar and therefor have the same eigenvalues.

• When the method converge Ak will be a triangular matrix and therefor have it’s eigenvalues on the diagonal. The QR-method can be seen as a variation of the Power method, but instead of working with only one vector, we work with a complete basis of vectors, using the QR-decomposition to normalize and orthogonalize in every step. In practice a few mod- ifications are needed in order to handle a couple of problems in it’s most basic form some of which are: • Computing the QR-decomposition of a matrix is expensive, and we need to do it in every step. • The convergence itself like with the Power method can also be very slow. • Like the Power method we have problems with complex and non-simple eigenvalues. However if we can handle these problems we have a very stable and fast algorithm for general matrices. Especially the stability of the algorithm is one of it’s huge selling point and the biggest reason why it’s used over other similar methods using for example the LUP decomposition. But how do we improve the algorithm?

We start by looking at how we can speed up the computations of the QR-decomposition since even if we use Gram Schmidt or some other method it’s still a very expensive operation. We remember a matrix (A) is (upper) Hessenberg if all elements ai,j = 0, i > j + 1. In other words A is of the form:   FFF ··· FFF FFF ··· FFF    0 ···   FF FFF  0 0 ···  A =  F FFF  ......   ......     0 0 0 ··· FFF 0 0 0 ··· 0 FF

137 If A is Hessenberg then we would speed up the algorithm considerably since: • Calculating the QR-factorization of a Hessenberg matrix is much faster than for a generall matrix. • Since multiplication of a Hessenberg matrix with a triangular matrix (both in the same direction) is itself a Hessenberg matrix. If A0 is Hessenberg every other Ak, k = 0, 1, 2,... is Hessenberg as well. • Since a Hessenberg matrix is nearly triangular, it’s convergence towards the trian- gular matrix will be improved, resulting in fewer total steps as well We don’t want to limit our algorithm to only Hessenberg matrices, however what if we can change the matrix we work with to one in Hessenberg form? This is possible since: Theorem 12.1. Every square matrix A have a matrix decomposition A = UH HU where • U is a unitary matrix. • H is a Hessenberg matrix. Since U is a unitary matrix, A and U have the same eigenvalues so in order to find the eigenvalues of A we can instead find the eigenvalues of H which is as we wanted Hes- senberg. A Hessenberg decomposition can be calculated effectively using for example householder transformations.

While we generally have a faster convergence using H, it can still be very slow. We remember we could solve this problem in the Power method by shifting the eigenvalues of the matrix. In the same way we can shift the eigenvalues in the QR-method using (Ak − σI). For Ak we then get:

H H Ak+1 = Qk (Ak − σI)Qk + σI = Qk (Ak)Qk

We see that Ak, Ak+1 is still similar and therefor will have the same eigenvalues. As in the Power method we want σ to be as close as possible to one of the eigenvalues. Usually the diagonal elements of the hessenberg matrix is used since they prove to give a rather good approximation. Since the diagonal elements of Ak should give increasingly accurate approximation of the eigenvalues we can change our shift to another eigenvalue or improve the current one between iterations as well.

As fast as a subdiagonal element is close to zero, we have either found an eigenvalue if it’s one of the outer elements. Or we can divide the problem in two smaller ones since we would have: A A  A = 1,1 1,2 0 A2,2

138 This still leaves the problem with complex eigenvalues in real matrices. In order to solve this problem an implicit method using multiple shifts at the same time is used in practice. We leave this implicit method, also called the Francis algorithm by noting that it exist and works by much the same principle as the inverse Power algorithm.

We give a summary of the modified QR-method using a single shift: 1. Compute the Hessenberg form (A = UH HU) using for example Householder trans- formations.

2. Assign A0 = H. Then in every step, choose a shift σ as one of the diagonal elements of Ak.

3. Compute the QkRk factorization of (Ak − σI).

4. Compute Ak+1 = RkQk + σI

5. Iterate 2 − 4 until Ak is a triangular matrix, optionally dividing the problem in two whenever a subdiagonal element becomes zero.

12.3 Applications 12.3.1 PageRank Pagerank is the method originally designed by Google to rank homepages in their search engine in order to show the most relevant pages first. In the ranking algorithm we consider homepages as nodes in a directed graph, and a link between two homepages as an edge between the corresponding vertices in the graph. Since the number of homepages is huge (and rapidly increasing) this does not only put high demand on the accuracy of the model (few looks at more than the first couple of pages of results) it also puts a huge demand on the speed of the algorithm. The principle used in the ranking is such that homepages that are linked to by many other homepages and/or homepages that are linked to by other important homepages are considered more important (higher rank) than others. An image illustrating the principle can be seen in Figure ??. PageRank uses the adjacency matrix 3.1 of the graph with a couple of modifications. First we do not allow any homepage to link to itself. Secondly, any homepages which links to no other page, usually called dangling nodes needs to be handled separately. Usually they are assumed to instead link to all pages (even itself even though we otherwise don’t allow it). This is the method we will use here, although there are other methods available as well. Before calculating PageRank we make a modification to the adjacency matrix. We weight the matrix by normalizing the sum of every non-zero row such that it sums to one. This is our matrix A. Pagerank is then the eigenvector normalized such that the sum is equal to one of the dominant eigenvalue of matrix M:

M = c(A + gv>)> + (1 − c)ve>

139 image by Felipe Micaroni Lalli

Where the 0 < c < 1 is a scalar, usually c = 0.85, e is the one vector and A as above is a non-negative square matrix whose rows all sum to one. v is a non-negative weigh vector with the sum of all elements equal to one. g is a vector with elements equal to zero for all elements except those corresponding to dangling nodes which are equal to one. We see that saying that we consider dangling nodes linking to all other nods is not strictly true, rather we assume they link according to a weight vector with sum one, usually however all or at least most of the elements of v are equal such that this is more or less true.

To illustrate the method we give a short example where we consider the following system:

1 2

3 4 5

We assume all elements of v equal such that we give no node a higher weight. Using the matrix for the graph we get the matrix (A + gv>) by normalizing every row to one and add the change on the third row corresponding to the dangling node we get:

 0 1/3 1/3 1/3 0   0 0 0 1 0  >   (A + gv ) = 1/5 1/5 1/5 1/5 1/5    0 0 0 0 1  0 0 0 1 0

140 Using this we can now get our matrix M = c(A + gv>)> + (1 − c)ve>:  0 c/3 c/3 c/3 0 > 1 1 1 1 1  0 0 0 c 0  1 1 1 1 1   1 − c   M = c/5 c/5 c/5 c/5 c/5 + 1 1 1 1 1   5    0 0 0 0 c  1 1 1 1 1 0 0 0 c 0 1 1 1 1 1 If we choose c = 0.85 and calculate the eigenvector to the dominant eigenvalue we end up with PageRank p = [0.0384, 0.0492, 0.0492, 0.4458, 0.4173]>. From the result we can see a couple of things, first we see that node one which have no other node linking to it have the lowest PageRank and node two and three which are only linked to by node one have equal and the second lowest PageRank. Node four have a very high PageRank since it’s linked to by many nodes and node five have high PageRank since it’s linked to by node four which have a very high PageRank.

We see that the definition seems reasonable, however we still have a large problem to consider: how do we find the eigenvector to the dominant eigenvalue of such a huge matrix in a reasonable time? For reference 2008 − 2009 both Google and Bing reported indexing over 1.000.000.000.000 webpages which in turn means our matrix M would have about 1024 elements. Obviously something needs to be done.

We note that the matrix M has been constructed such that it’s a non-negative matrix. And it also seems reasonable to assume it’s irreducible through our vector v. In fact unless some nodes have a zero-weight in v, M will be a positive matrix. We also have that the sum of every column of M equal to one, so we can say that M is actually a (column) stochastic matrix. This gives us a more intuitive way to describe pagerank looking at is from a Markov chain perspective.

Consider a internet surfer surfing the web. The internet surfer starts surfing at a random page then with probability c it follows any of the links from the current page ending up in a new page or with probability 1 − c it starts over at a new random page. The Internet surfer then continue repeatedly either follows a link from the current page or starts over at a new random page. The pagerank of a node (page) can then be seen as the probability that the Internet surfer is at that page after a long time.

While this by itself doesn’t make it easier, we note something about A. Since every homepage on average only links to a couple of other homepages, A is an extremely sparse matrix. While A have maybe 1024 elements only 1012m, where m is the average number of links on a homepage is non-zero. If we have a sparse matrix, we could use the Power method to use the sparsity of the data rather than trying (and failing) using other methods such as the QR-method. We note that even though M is a full matrix, the multiplication xn+1 = Mxn in the Power method can be done fast by calculating Axn first. Take a couple of minutes to think through the following:

141 > • How can we calculate xn+1 = Mxn by multiplying xn with cA instead so that we can use that A is sparse?

• How can the weightvector v be used in order to combat linkfarms, where people try to artificially increase ones PageRank?

• The attribute ’nofollow’ can be used on links to not increase the PageRank of the target page. Can you think of any sites where this could be useful?

• What influence does the value c have on the PageRank? For example what is important for a high PageRank if c is close to zero rather than 0.85?

142 Index adjacency matrix, 28 fit, 100 adjugate formula, 116 , adj(A), 116 Gauss elimination, 9 aperiodic, 41 Hermitian conjugate, 18 associativity, 59 Hermitian matrix, 18, 21 basis, 64 hermitian matrix, 8 bijection, 91 hermitian transpose, 8 bijective, 95 Hessenberg matrix, 17 hitting probability, 53 canonical form, 22 hitting time, 54 stochastic matrix, 51 homogeneous coordinates, 93 Cayley-Hamilton theorem, 116 Image, R(T), 95 Ces´arosum, 49 imprimitive matrix, 42 characteristic polynomial, 13 injection, 91 Cholesky factorization, 21 injective, 95 collinearity, 100 inner product space, 65 commutativity, 59 interpolation, 103 connected (graph), 31 irreducible matrix, 31, 39 connected component (graph), 31 Isomorphism, 98 continous time Markov chain, 55 coplanarity, 100 Jordan block, 23 Correlation matrix, 80 Jordan normal form, 23 Covariance matrix, 79 covariance matrix, 22 Kernel, N(T), 95 cycle (graph), 30 Kroenecker product, 127 degree matrix, 33 Laplacian matrix, 34 determinant, 10, 21 least square method, 101 determinant function, 10 least-square sense, 100 diagonal matrix, 6 linear combination, 61 diagonalizable, 13 linear space, 59 distance, 67 linear system, 8 distance matrix, 29 linear transformation, 86 distributivity, 59 linearly dependent, 63 linearly independent, 63 echelon form, 20 LSM, 101 eigenvalue, 13, 39 LU-factorization, 21 eigenvector, 13, 39 LUP-factorization, 21 error vector, 101 Markov chain, 45, 45 field, 59 matrix decomposition, 16

143 matrix exponential function, eA, 123 similar matrix, 14 matrix factorization, 16 simple graph, 33 matrix polynomial, 116 singular value decomposition, 24 matrix power function, An, 110 span, 61 √n 1 matrix root function, A, A n , 113 spanning set, 62 MLR, 102 sparse matrix, 122 Moore-Penrose pseudoinverse, 109 spectral factorization, 20 multilinear, 11 spectral radius, 13, 39 multiple linear regression, 102 spectrum, 13 √ 1 square root of a matrix, A, A 2 , 113 nilpotent matrix, 112 stable, 129 non-linear regression, 107 standard matrix, 88 non-negative matrix, 39 stationary distribution, 47 normal equations, 105 stochastic matrix, 45 Nullity, 97 irreducible and imprimitive, 49 Nullspace, N(T), 95 irreducible and primitive, 48 reducible, 50 orthogonal, 68 strongly connected (graph), 31 orthogonal set, 68 strongly connected component, 31 orthonormal set, 68 subspace, 60 surjection, 91 path (graph), 30 surjective, 95 PCA, 81 SVD, 24 permutation matrix, 21 symmetric matrix, 8 Perron-Frobenius (theorem), 39 positive definite matrix, 19, 21 trace, 6 positive matrix, 40 transpose, 8 Power method, 131 triangular matrix, 17, 21, 50 primitive matrix, 40 Principal component analysis, 81 unitary matrix, 18, 22, 69 pseudoinverses, 108 vector space, 59 QR-factorization, 22 QR-method, 136 weighted least square method, 106 rank, 97 rank factorization, 20 Rayleigh quotient, 82 reduced row echelon form, 23 regression, 100 residual vector, 101 rotation matrix, 89 row echelon form, 9

Schwarz inequality, 67

144 References

[1] K. Bryan and T. Leise. The $25, 000, 000, 000 eigenvector: the linear algebra behind google. SIAM Rev., 48(3):569–581, 2006.

[2] M. Haugh. The monte carlo framework, examples from finance and generating correlated random variables. Course Notes, 2004.

[3] J.R.Norris. Markov Chains. Cambridge university press, New York, 2009.

[4] C. D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, Boston, 2001.

[5] C. Moler. Cleve’s corner: Chebfun, roots, October 2012. visited November 22 2012.

[6] W. Nicholson. Elementary linear algebra with applications, second edition. PWS Publishing company, Boston, 1993.

[7] P. G. D. J. L. Snell. Random walks and electric networks. Free Software Foundation, 2000.

[8] S. Spanne. F¨orel¨asningari Matristeori. KFS i Lund AB, Lund, 1994.

[9] R. Tobias and L. Georg. Markovprocesser. Univ., Lund, 2000.

[10] C. M. C. van Loan. Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Review, 45, 2003.

145