GG313 GEOLOGICAL DATA ANALYSIS 3–1

GG313 GEOLOGICAL DATA ANALYSIS LECTURE NOTES

PAUL WESSEL SECTION 3 LINEAR () ALGEBRA

OVERVIEW OF MATRIX ALGEBRA (or All you ever wanted to know about Linear Algebra but were afraid to ask...)

A matrix is simply a rectangular array of "elements" arranged in a series of m rows and n columns. The order of a matrix is the specification of the number of rows by the number of columns. Elements of a matrix are given as aij, where the value of i specifies the row position and the value of j specifies the column position; so aij identifies the element at position i,j. An element can be a number (real or complex), algebraic expression, or (with some restrictions) a matrix or matrix expression. For example:

12 4 10 a 11 a 12 a 13 a a a A = 8 1 11 = 21 22 23 15 3 11 a 31 a 32 a 33 14 1 11 a 41 a 41 a 43

This matrix, A, has order 4 x 3, the element a23 = 11, a13 = 10, etc. The notation for matrices is not always consistent but is usually one of the following schemes: matrix - designated by a bold letter (most common); capital letter; or letter with under-score, brackets or hat (^). Order is also sometimes given: A(4,3) = A is 4 x 3. order - always given as row x column but uses letters n,m,p differently: n(row) xm(column) or m(row) x n(column) element - most commonly aij with i = row; j = column; (sometimes k,l,p)

Advantages of matrix algebra mainly lies in the fact that it provides a concise and simple method for manipulating large sets of numbers or computations, making it ideal for computers. Also (l) Compact form of matrices allows convenient notation for describing large tables of data; (2) Operations allow complex relationships to be seen which would otherwise be obscured by the shear size of the data (i.e., it aids clarification); and (3) Most matrix manipulation involves just a few standard operations for which standard subroutines are readily available. As a convention with data matrices (i.e., the elements represent data values), the columns usually represent the dilferent variables (e.g., one column contains temperatures, another salinity, etc.) while rows contain the samples (e.g., the values of the variable at each depth or time). Since there are usually more samples than variables, such data matrices are usually rectangular having more rows (m) than columns (n) - order m x n where m > n. Column vector is a matrix containing only a single column of elements: GG313 GEOLOGICAL DATA ANALYSIS 3–2

a 1 a a = 2

a n

Row vector is a matrix containing only a single row of elements:

T a = a 1 a 2...a n Size of a vector is simply the number of elements it contain (= n, in both examples above). Null matrix, written as 0 or 0(m,n) has all elements equal to 0 - it plays the role of zero in matrix algebra. has the same numbers of rows as columns, so order is n x n. Diagonal matrix is a square matrix with zero in all positions except along the principal (or leading) diagonal: 3 0 0 D = 0 1 0 0 0 6 or

0 for i ¹ j d = ij non –zero for i = j

This type of matrix is important for scaling rows or columns of other matrices. The Identity matrix (I) is a diagonal matrix with all of the nonzero elements equal to one. Written as I or In; it plays the role of 1 in matrix algebra (A·I= I·A = A). A Lower triangular matrix (L) is a square matrix with all elements equal to zero above the principal diagonal: 1 0 0 1 L = 3 7 0 = 3 7 8 2 6 8 2 6 or

0 for i < j = ij non –zero for i ³ j

An Upper triangular matrix is a square matrix with all elements equal to zero below the principal diagonal

0 for i < j u = ij non–zero for i £ j

(If one multiplies 2 triangular matrices of the same form, the result is a third matrix of the same form). We also have Fullv populated matrix which is a matrix with all of its elements nonzero, Sparse matrix which is a matrix with only a small proportion of its elements nonzero, and Scalar which simply is a number (i.e., matrix with a single element). GG313 GEOLOGICAL DATA ANALYSIS 3–3

Matrix (or transpose of a matrix) is obtained by interchanging the rows and columns of a matrix. So row i becomes column i and column j becomes row j (also the order of the matrix is reversed). 1 14 A = 6 7 ; A T = 1 6 8 8 2 14 7 2 A diagonal matrix is its own transpose: DT = D. In general, we find

a ij Û aji Symmetric matrix is a square matrix which is symmetrical about its principal diagonal, so a ij = aji . Therefore a symmetrical matrix is equal to its own transpose. 1 2 5 A = A T = 2 6 3 = symmetric 5 3 4

Skew symmetric matrix is a matrix in which:

aij =– a ji So T A =– A

aii = 0(principaldiagonal elementsare zero) 0 4 – 5 A = – 4 0 3 =skew symmetric 5 – 3 0

Any square matrix can be split into the sum of a symmetric and skew symmetric matrix: A = 1 A + A T + 1 A – A T 2 2 Basic Matrix Operations:

Matrix addition and subtraction requires matrices of the same order since this operation simply involves addition or subtraction of corresponding elements. So, if A + B = C, b b a + b a + b a 11 a 12 11 12 11 11 12 12 A = a 21 a 22 ; B = b 21 b 22 ; C = a 21 + b 21 a 22 + b 22 a a 31 32 b 31 b 32 a 31 + b 31 a 32 + b 32 and (1) A + B = B + A

(2)( A + B)+ C = A +(B +C) where all matrices are the same order. of a matrix is multiplying a matrix by a constant (scalar): GG313 GEOLOGICAL DATA ANALYSIS 3–4

a a a 11 a 12 11 12 A = a 21 a 22 = a 21 a 22 a a 31 32 a 31 a 32 where is a scalar. Every element is multiplied by the scalar. Scalar product (or dot product or inner product) is the product of 2 vectors of the same size. a×b = where a is a row vector (or the transpose of a column vector) of length n, b is a column vector (or the transpose of a row vector), also of length n, and is the scalar product of a·b. Then:

b 1

a = a 1 a 2 a 3 ; b = b 2 b 3 and

= a 1b 1 + a 2b 2 + a 3 b 3 Some people like to visualize this multiplication as:

1 3 b x 4 x 2 x x

a [ 2 1 4 5 ] [ 31 ] = Fig. 3-1. Dot product of two vectors.

Conceptually, this product can be thought of as multiplying the length of one vector by the component of the other vector which is parallel to the first:

b

a

|a |·|b |·cos Fig. 3-2. Graphical meaning of the dot product of two vectors.

Think of b as a force and |a| as the magnitude of displacement which is equal to work in the direction of a. Thus: a × b =|a||b|cos( ) where GG313 GEOLOGICAL DATA ANALYSIS 3–5

2 2 2 x = x 1 + x 2 + .. . +x The maximum principle says that the unit vector (n) making a ·n a maximum is that unit vector pointing in the same direction as a: If n || a then cos( ) = cos(0°) = 1 and a·n = |a| |n| cos ( ) = |a||n| = |a|. This is equally true where d is any vector of a given magnitude - that vector n which parallels d will give the largest scalar product. Parallel vectors thus have cos( ) = 1, then a·b = |a| |b| and a = b (i.e., 2 vectors are parallel if one is simply a scalar multiple of the other - this property comes from equating direction cosines) where: = |a|| |b|| Perpendicular vectors have cos( ) = cos 90° = 0, so a·b = 0, where a ^ b. Squaring vectors is simply: 2 T a = a ×a forrowvectors

2 T a = a ×a forcolumnvectors requires "conformable" matrices. Conformable matrices are ones in which there are as many columns in the first as there are rows in the second:

C ( m ,n) = A (m,p ) ×B (p, n)

So, the product matrix C is of order m x n and has elements cij: p c ij =S a ikb kj k =1 This is extension of scalar product - in this case, each element of C is the scalar product of a row vector in A and column vector in B.

b 11 b 12 c 11 c12 a 11 a 12 a 13 = b 21 b 22 c 21 c 22 a 21 a 22 a 23 b 31 b 32

c 12 = a 11b 12 + a 12b 22 + a 13b 32 In "box form": GG313 GEOLOGICAL DATA ANALYSIS 3–6

Elements of C are scalar p products of corresponding B vectors as shown here.

n

m A m C

p n Fig. 3-3. The matrix product of two matrices.

Order of multiplication is important—usually A × B ¹ B ×A and unless A and B are square matrices or the order of AT is the same as the order of B (or vice versa), one of the two products can not even be formed. Order is specified by stating:

A is pre -multiplied by B (for B · A)

A is post -multiplied by B (for A · B) Multiple products: D = A ×B ×C = A × B ×C (The order in which the pairs are multiplied is not important mathematically.)

Computational considerations:

C ( mn ) =A (m,p) ×B (p,n) involves m ´ n ´ p multiplications and m ´ n ´ (p -1) additions, so:

E (m , n) =[A(m, p)×B (p, q)]×C(q,n)

gives m ´ p ´ q multiplications = [D (m, q)] ×C (q,n) gives m ´ q ´ n multiplications and

E (m , n) = A (m,p ) ×[B (p,q) ×C (q,n )]

gives p ´ q ´ n multiplications = A (m ,p ) ×[D (p ,n)] gives m ´ p ´ n multiplications GG313 GEOLOGICAL DATA ANALYSIS 3–7

Therefore: 1 )( A ×B) ×C Þ mq(p + n)totalmultiplications 2) A ×(B ×C) Þ pn(m + q)total multiplications If both A and B are 100 x 100 matrices and C is 100 x 1, then m = 100, p = 100, q = 100 , and n = 1. Multiplying using form l) involves ~l x 106 multiplications, whereas form 2) involves 2 x 104; so computing B · C first, then pre-multiplying by A saves almost a million multiplications and almost an equal number of additions in this example. Therefore order is extremely important computationally for both speed and accuracy (more operations lead to a greater accumulation of round-off errors).

Transpose of matrix product is simply multiplication by transpose of the individual matrices in reverse order:

D = A ×B ×C D T = C T ×B T ×A T

Multiplication bv I leaves the matrix unchanged: 1 0 0 A ·I = I ·A = A 0 1 0 :I 0 0 1

A: 3 6 9 3 6 9 :AI 2 8 7 2 8 7

Pre-multiplication bv a diagonal matrix:

C = D · A, where D is a diagonal matrix. C is the A matrix with each row scaled by a diagonal element of D: d 11 a11 a 12 a 13 D = d 22 ; A = a21 a 22 a 23 a a a d 33 31 32 33

a 11d 11 a 12d 11 a 13d 11 ¬ each element ´ d 11 C = a 21d 22 a 22d 22 a 23d 22 ¬ each element ´ d 22

a 31d 33 a 32d 33 a 33d 33 ¬ each element ´ d 33

Post-multiplication bv a diagonal matrix produces a matrix in which each column has been scaled by a diagonal element. C = A ×D

a 11d 11 a 12d 22 a 13d 33 C = a 21d 11 a 22d 22 a 23d 33

a 31d 11 a 32d 22 a 33d 33 where each column in A has been scaled by the corresponding diagonal matrix elements, dii.

The of a matrix is a single number representing a property of a square matrix GG313 GEOLOGICAL DATA ANALYSIS 3–8

(dependent upon what the matrix represents). The main use here is for finding the inverse of a matrix or solving simultaneous equations. Symbolically, the determinant is usually given as det A, |A| or ||A|| (to differentiate from magnitude). Calculation of a 2 x 2 determinant is given by:

a11 a 12 |A|= = a 11a 22 – a 12a 21 a 21 a 22 This is the difference of the cross products. The calculation of an n x n determinant is given by n A = a 11m 11 – a 12m12 + a 13m 13 – ××× –( –1) a1nm1n where m11 is the determinant with the first row and column missing; m12 is the determinant with the first row and second column missing; etc. (The determinant of a 1 ´ 1 matrix is just the particular element.) An example of a 3 x 3 determinant follows.

a 11 a12 a 13 A = a 21 a22 a 23 a 31 a32 a 33

a 11 a 12 a 13 m11 = a 21 a 22 a 23 = a 22a 33 – a 23a 32 a 31 a 32 a 33

a 11 a 12 a 13 m12 = a 21 a 22 a 23 = a 21a 33 – a 23a 31 a 31 a 32 a 33

a 11 a 12 a 13 m13 = a 21 a 22 a 23 = a 21a 32 – a 22a 31 a 31 a 32 a 33 So

A = a 11m 11 – a 12m 12 + a 13m13

= a 11(a 22a 33 – a 23a 32)– a12(a 21a 33 – a 23a 31)+ a13(a 21a 32 – a 22a 31)

For a 4 x 4 determinant, each m1i would be an entire expansion given above for the 3 x 3 determinant—one quickly needs a computer.

A singular matrix is a square matrix whose determinant is zero. A determinant is zero if:

(l) any row or column is zero; (2) any row or column is equal to a linear combination of the other rows or columns. For example: 1 6 4 A = 2 1 0 5 – 3 – 4 where row 1 = 3(row 2) - row 3. GG313 GEOLOGICAL DATA ANALYSIS 3–9

A = a 11(a 22a 33 – a 23a 32)– a12(a 21a 33 – a 23a 31)+ a13(a 21a 32 – a 22a 31) = 1[1(–4)– 0(–3)]– 6[2(–4)– 0(5)]+4[2(–3)– 1(5]=– 4 +48 – 44= 0 The degree of clustering symmetrically about the principal diagonal is another (of many) properties of a determinant. The more the clustering, the higher the value of the determinant.

Matrix division can be thought of as multiplying by the inverse. Consider scalar division: x = x1 = xb–1 b b which we can write –1 bb = 1 Matrices can be effectively divided by multiplying by the inverse matrix. Nonsingular square -l -l matrices may have an inverse symbolized as A and AA = 1. The calculation of matrix inverse is usually done using elimination methods on the computer. For a simple 2 x 2 matrix, its inverse is given by:

–1 a –a A = 1 22 12 A –a 21 a 11 An example follows.

A = 7 2 ; 10 3

A –1 = 1 3 –2 = 3 –2 21 – 20 –10 7 –10 7

AA –1 = 7 2 3 –2 = 1 0 = I 10 3 –10 7 0 1

Solution of Simultaneous Equations

A system of n simultaneous equations in n unknowns:

a11 x1 + a12 x2 + a13 x3 + a14 x4 = b1

a21 x1 + a22 x2 + a23 x3 + a24 x4 = b2

a31 x1 + a32 x2 + a33 x3 + a34 x4 = b3

a41 x1 + a42 x2 + a43 x3 + a44 x4 = b4 can be written as Ax = b where

a 11 a 12 a 13 a14 a a a a A = 21 22 23 24 =coefficientmatrix a 31 a 32 a 33 a34 a 41 a 42 a 43 a44 GG313 GEOLOGICAL DATA ANALYSIS 3–10

b x 1 1 x 2 b 2 x = x ; b = 3 b 3 x4 b 4 Then –1 – 1 –1 A Ax = A b (pre–multiplyingboth sidesby A ) so Ix = x = A –1 b gives the solution for values of xl, x2, x3, x4 which solve the system. The following example solves for 2 simultaneous equations. Consider 2 equations in 2 unknowns (e.g., equations of lines in the x-y plane):

5x 1 + 7x2 = 19

3x1 – 2x2 =– 1 In matrix form this translates to: x 5 7 1 = 19 3 –2 x 2 –1 A × x = b To solve this matrix, we need the inverse of A:

2 7 –2 –7 A –1 = 1 = 31 31 –10 –21 –3 5 3 –5 31 31 Then x = A-1·b where

2 7 38 – 7 A –1b = 31 31 19 = 31 31 = 1 3 –5 –1 57 + 5 2 31 31 31 31 or, in box form:

19 :b –1

2 7 A –1 = 31 31 1 :x 3 –5 2 31 31

So, the values xl = 1 and x2 = 2 solve the above system, or GG313 GEOLOGICAL DATA ANALYSIS 3–11

x 1 = 1 x 2 2 Computational considerations

While this approach may seem burdensome, it is good because it is extremely general and allows a simple handling and a straight forward solution to very large systems. However, it is true that direct (elimination) methods to the solution are in fact quicker for fully populated matrices: l) Solution to inverse matrix approach involves n3 multiplications for the inversion and n2m more multiplications to finish the solution, where n is the number of equations per set, and m is the number of sets of equations (each of the same form but different b matrix). The total number of multiplications is n3 + n2m.

2) Solution to directly solving equations involves n3 /3 + n2m.

So, while the matrix form is easy to handle, one should not necessarily always use it blindly. We will consider many situations for which matrix solutions are ideal. For sparse or symmetrical matrices, the above relationships may not hold. The rank of a matrix is the number of linearly independent vectors it contains (either row or column vectors): 1 4 0 2 A = 1 0 1 –1 –3 –4 –2 0

Since row 3 = -(row 1) - 2(row 2) or col 3 = col 1 - 1/4(col 2) and col 4 = -(col 1) + 3/4(col 2), the matrix A has rank 2 (i.e., it has only 2 linearly independent vectors, independent of whether viewed by rows or columns). The rank of a matrix product must be less than or equal to the smallest rank of the matrices being multiplied

A (rank2) · B (rank1) = C(rank1) Therefore, (from another angle), if a matrix has rank r, than any matrix factor of it must have rank of at least r. Since the rank cannot be greater than the smallest of m or n, in a mxn matrix, this definition also limits the size (order) of factor matrices. (That is, one cannot factor a matrix of rank 2, into 2 matrices of which either is of less than rank 2, so m and n of each factor must also be ³ 2 ).

The of a square matrix is simply the sum of the elements along the principal diagonal. It is symbolized as tr A. This property is useful in calculating various quantities from matrices. Submatrices are smaller matrix partitions of the larger supermatrix: Supermatrix = A B F C D

Such partitioning will frequently be useful. Other useful matrix properties: GG313 GEOLOGICAL DATA ANALYSIS 3–12

T 1. A T = A –1 2. A –1 = A T –1 3. A –1 = A T =A– T 4. D = ABC,then D–1 =C– 1 B–1 A–1 ;recall that D = ABC; D T = C TB T A T This "reversal rule" for inverse products may be useful for eliminating or minimizing the number of matrix inverses requiring calculation. We will look at a few examples of matrix manipulations. For data matrix A: 1 2 3 A = 4 5 6 7 8 9 and unit row vector j: j = 1 1 1 (l) Compute the mean of each column vector in A (each column has length n = 3):

x = 1jA = j A j = 1 1 1 c 3 n n 3 3 3

Then 1 2 3 4 5 6 :A 7 8 9

j : 1 1 1 4 5 6 :x n 3 3 3 c

(2) Compute the mean of each row vector in A: x = 1Aj T = Aj T r 3 n Then

1 3 T 1 3 : jn 1 3 1 2 3 2 A : 4 5 6 = 5 : xr 7 8 9 8

Last time we looked at the matrix equation A x = b and we found that the solution could be written GG313 GEOLOGICAL DATA ANALYSIS 3–13

x = A –1 b For a moment, let us just consider the left hand side A.x. For any x, this product gives a new vector y. We can say that x is transformed to give y. This is a linear transformation since there are only linear terms in the matrix multiplication, i.e., the vector y is

a 11x 1 + a 12x 2 + a 13x3

y = a 21x 1 + a 22x 2 + a 23x3

a 31x 1 + a 32x 2 + a 33x3

Thus we call the operation T(x) = A.x a linear transformation. If we stick to three or less dimensions, it is possible to graphically visualize vectors and operations on them. Figure 3-4 shows an arbitrary vector x and the result y of the linear transformation y = A.x.

y

x

Fig. 3-4. The vector x is transformed into another vector y using a linear transformation.

Obviously, as we pick another x, we get another y. We might want to know if there are certain vectors that, when being operated on, returns a vector in the same direction, possibly longer or shorter than the original x. In other words, are there an x that satisfies

Ax = x (3.1) We call the eigenvalue and x the eigenvector. We can rewrite this as A x – x = 0 A x – I x = 0 A – I x = 0 or B x = 0 = n In general, the solution to this equation can be written x = B-l.n:

x = 1 n = B –1 n B or GG313 GEOLOGICAL DATA ANALYSIS 3–14

B x = n = 0

So apart from the trivial solution x = [0 0 0], the answer is given when B =0 We know the determinant of B is

a 11 – a 12 a 13

B = a 21 a 22 – a 23 =0

a 31 a 32 a 33 –

Writing out what the determinant is and setting it to zero gives a polynomial in of order n. For n = 3 this will in general give a cubic equation; for n = 2 a quadratic equation must be solved. The solutions 1 , 2 .... etc. are called the eigenvalues of A, and the equation |B | = 0 is called the characteristic equation. For example, given

A = 17 –6 45 –16 let us find its eigenvalues. We set 17 – –6 A – I = = 0 45 – 16 – or – 272– 17 + 16 + 2 +270 = 2 – –2 =0 We easily solve for : 1 ± 12 –4(–2) = = 2,– 1 2

So the eigenvalues are 1 = 2, 2 = -1. We now know what must be for (3.1) to be satisfied, but what about the vectors x ? We still haven't found what they must be, but we will substitute in the value for in (3.1). Using = 2 first, we find A x =2 x (A –2 I )x =0 Find x 15 –6 1 = 0 45 –18 x 2 0 or

15 x1 – 6x2 = 0

45x1 – 18x2 =0 which both give GG313 GEOLOGICAL DATA ANALYSIS 3–15

x = 2 x 1 5 2 So é2 ù x = t ëê 5úû where t is any scalar. Similarly, for = - 1, we find (A + I).x = 0 x 18 –6 1 = 0 45 –15 x 2 0 which reduces to

3x 1 – x2 = 0 which gives é1 ù x = t ëê 3úû It may happen that the characteristic equation gives solutions that are imaginary. However, if the matrix is symmetric it will always yield real eigenvalues, and as long as the matrix A is not singular, all the will be non-zero and the corresponding eigenvectors will be orthogonal. The technique we've used applies to matrices of any size n ´ n, but finding the roots of large polynomials is painful. Usually, the are found by matrix manipulations that involve successive approximations to the x. This is of course only practical on a computer. If we restricted our attention to 2-D geometry, certain properties of eigenvalues and eigenvectors will be clearer. Consider the matrix A

A = 4 8 8 4

(4,8)

(8,4)

Fig. 3-5. Graphical representation of two vectors in the x-y plane.

We can regard the matrix as two row vectors [4 8] and [8 4]. Let us find the eigenvalues and eigenvectors of A : GG313 GEOLOGICAL DATA ANALYSIS 3–16

4 – 8 = 0 8 4 –

16 – 8 + 2 – 64 =0

2 – 8 – 48 = 0

= 8 ± 64+ 4×48 = 12,– 4 2 The eigenvectors are: 4– 12 8 x –8 8 x 1 = 1 = 0 8 4 – 12 x 2 8 –8 x 2 0

T – 8x 1 +8x2 = 0 Þ x1 =x2 e1 = 1 1

x x 4+ 4 8 1 = 8 8 1 = 0 8 4 +4 x2 8 8 x2 0

T 8x1 +8x2 =0 Þ x1 =–x2 e1 = –1 1 We find that the eigenvectors define the minor and major axis of the ellipse which goes through the two points defined by (4,8) and (8,4). The length of these axes are given by the absolute values of the eigenvalues, 1 = 12 and 2 = 4. e2 e1

l 2 = 4 = 12 1 l

Fig. 3-6. The eigenvectors, scaled by the eigenvalues, can be seen to represent the major and minor axes of the ellipse that goes through the two data vectors (8, 4) and (4, 8). GG313 GEOLOGICAL DATA ANALYSIS 3–17

It is customary to normalize the eigenvectors so that their length is unity. In our case we find

eT = 2 2 and eT = – 2 2 1 2 2 2 2 2 The axes of the ellipse are then simply

v 1 = 1e 1

v 2 = 2e2 Since the sign of the eigenvector is indeterminate we choose to make all eigenvalues positive and . thus place the minus-sign in 2 inside e2. You'll notice that vl v2= 0, i.e., they are orthogonal. The eigenvectors make up the columns in a new matrix V

V = e e = 2 1 1 1 2 2 1 –1

Let us expand the eigenvalue equation (3.1) A .x = x to a full matrix equation. We have

A e 1 = 1e 1

A e 2 = 2e2 We can combine these two matrix equations into one. A · V = V · where 0 = 1 0 2

From this we may learn two things. l) Post multiply by V-1: A V V –1 = A = V V –1 The eigenvalue-eigenvector operation let us split a symmetric matrix A into a diagonal matrix (with the eigenvalues along the diagonal) and the matrix V (with the eigenvectors as rows) and its inverse V-1.

2) Pre-multiplying by V-1: = V –1 A V This operation transforms the A matrix into a diagonal matrix . It corresponds to a rotation of the coordinate axes in which the eigenvectors in V becomes the new coordinate axes. Relative to the new coordinates, conveys the same information as A does in the old coordinates. Because is a simple diagonal matrix, the rotation (transformation) makes the relationships between rows and columns in A much clearer.

Simple Regression and Curve Fitting

Whereas an interpolant fits each data point exactly, it is frequently advantageous to produce a smoothed fit to the data - not exactly fitting each point, but producing a "best" fit. A popular (and GG313 GEOLOGICAL DATA ANALYSIS 3–18 convenient) method for producing such fits is known as the method of least squares. The method of least squares produces a fit of a specified (usually continuous) basis to a set of data points which minimizes the sum of the squared mismatch (error) between the fitted curve and data. The error can be measured as in Figure 13-1: This regression of y on x is the most common method. Less common methods (more work involved) is regression of x on y and orthogonal regression (which we will return to later).

y error

x

Fig. 3-7. Graphical representation of the regression errors used in least-squares procedures.

y y error error

x x Fig. 3-8. Two other regression methods: regressing x on y and orthogonal regression.

Least squares simple example

Consider fitting a single "best" linear slope to n data points. This can be a scatter plot of y(t), x(t) plotted at similar values of t; or a simple f(x) relationship. At any rate, y is considered a function of x. We wish to fit a line of the form

y = a1 + a2(x – x0 ) (3.2) and must therefore determine a value for a1 and a2 which produces a line that minimizes the sum of the squared errors (x0 is specified beforehand). So n 2 minimize S (ycomputed – y observed ) i = 1

Ideally, for each observation yi at xi we should have GG313 GEOLOGICAL DATA ANALYSIS 3–19

a1 + a2(x1 – x0 )= y1

a1 + a2(x2 – x0)= y2

a1 + a2(x3 – x0)= y3

a1 + a2(xn – x0)= yn

There are many more equations (n - one for each observed value of y) than unknowns (2 - a1 and a2). Such a system is over-determined and there exists no unique solution (unless all the yi 's do lie exactly on a single line, in which case any two equations will uniquely determine a1 and a2). In matrix form (i.e., Ax = b): y 1 (x1 – x0) 1 1 (x – x ) a y2 2 0 1 = (3.3) a2

1 (xn – x0) yn

Since A is a non-square matrix it has no inverse, so the equation cannot be inverted and solved. Consider instead the error in the fit at each point:

a1 + a2(x1 – x0 )– y1 = e1

a1 +a2(x2 – x0)– y2 = e2

a1 + a2(xn – x0)– yn = en

We wish to determine the values for a1 and a2 that minimize n 2 ei iS=1 This will minimize the variance of the residuals about the regression line and give the least- squares fit. Notation: n 2 T S = E(a 1,a 2)= S ei ( = e e in matrixform) (3.4) i =1

So, E(a1 ,a2) and the minimum of this function (with respect to the two unknown coefficients) can be determined using simple differential calculus, where ¶E(a ,a ) ¶E(a ,a ) 1 2 = 1 2 = 0 (at theminimum) ¶a1 ¶a2

n n 2 ¶E ¶ 2 ¶ = S e i = S a1 + a2(xi – x0) – yi ¶a1 ¶a1 i =1 ¶a1 i=1

n =2 S a1 +a2(xi – x0) –yi = 0 i =1

n n 2 ¶E ¶ 2 ¶ = S ei = S a1 + a2(xi – x0) – yi ¶a2 ¶a2 i=1 ¶a2 i=1 GG313 GEOLOGICAL DATA ANALYSIS 3–20

n =2 a1 +a2(xi – x0) – yi (xi – x0) = 0 iS=1

These two equations can now be expanded into their individual terms forming what are known as the normal equations: n n n

a1 + a2 (xi – x0 ) = yi iS=1 iS=1 iS=1

n n n 2 a1( xi – x0) + a2 (xi – x0 ) = yi(xi – x0) iS=1 iS=1 iS=1 The normal equations thus provide a system of 2 equations in 2 unknowns which can be uniquely solved. Rearranging, n n

na1 + a2 (xi – x0) = yi iS=1 iS=1

n n n 2 a1 ( xi – x0) + a2 (xi – x0) = yi(xi – x0) iS=1 iS=1 iS=1 Notice that all sums are sums of known values that sum to simple constants. Solving: n n 1 a2 a1 = yi – (xi – x0) niS=1 n iS=1 substitute this into the second equation:

n n n n n 1 a2 2 yi – (xi – x0) (xi – x0) + a2 (xi – x0) = yi(xi – x0) n iS=1 n iS=1 iS=1 iS=1 iS=1

Now solve for a2: n n n 2 n n 1 a2 2 yi (xi – x0) – (xi – x0) + a2 (xi – x0) = yi(xi – x0 ) n iS=1 iS=1 n iS=1 iS=1 iS=1

n n 2 n n n 2 1 1 a 2 S (xi – x 0) – S (x i – x0) = S yi(xi – x0)– S y i S (x i – x 0) i =1 n i =1 i =1 n i =1 i =1

Finally,

n n n n n 2 1 2 a = yi (x i – x 0) – y i (x i – x 0) 1 2 S n S S S (xi –x0) – S (xi – x0) ii =1 i =1 i =1 i =1 n i =1

Substitute a2 into the first equation to solve for a1. So, we compute the sums n n n n 2 yi, yi(xi – x0 ), (xi – x0), and (xi – x0) iS=1 iS=1 iS=1 iS=1 and substitute into the above equation to give a1 and a2 producing the best fit. In matrix form the normal equations are:

n n yi n S (xi – x0 ) a1 iS=1 i =1 = n (3.5) n n a2 2 y(x – x ) (xi – x0) (xi – x0) S i i 0 iS=1 iS=1 i =1 GG313 GEOLOGICAL DATA ANALYSIS 3–21

So, Nx = B, e.g., of the form Ax = b. Since N is square and of full rank, this equation is solved in the standard manner: –1 –1 N Nx = N B or x = N –1 B

This problem was simple enough (2 x 2) to solve brute force for a1 and a2. For larger systems this becomes impractical and a matrix solution to the non-square A·x = b equation must be sought. We will next look at the general linear least-squares problem and find a solution in matrix notation.

General Linear Least Squares

We have looked at a few special cases where we have sought to fit a "model" to "data" in a least-squares sense. Fitting a straight line to the x-y points was a very simple example of this technique. We will now look at the more general problem of finding the coefficients for any linear combination of basis functions that fits some data in a least squares sense. There are numerous situations where this is needed:

Situation Model Data Curve Fitting Coefficients of polynomials, sin/cos, etc. Points in x-y plane Gravity modeling Densities of subsurface polygons Gravity observations Hypocenter location Small pertubations to hypocenter Seismic arrival times position

While the basis functions in these cases are all vastly different, they are all used in a linear combination to fit the observed data. We will therefore take time to investigate how such a problem is set up, and how it can all be simplified with matrix algebra.

General (linear) least squares.

Consider the least squares fitting of any continuous basis of the form

x1, x2, x3 , , xm For example polynomial basis Fourier sine basis

0 x1 = x x1 =sin (2p x/ T)

1 x2 = x x2 =sin (4p x/ T)

2 x3 = x x3 =sin (6p x/ T)

GG313 GEOLOGICAL DATA ANALYSIS 3–22

m –1 xm = x x j= sin (2mp x/ T)

We desire to fit an equation of the form

y = a1 x1 +a2 x2 + + am xm to a data set of n data points, where n > m, by minimizing E: n n 2 2 E = S (e i ) = S (a1xi1 +a2xi2 + +amxim –y i) (3.6) i =1 i = 1 where yi is the observed value. We can write this explicitly:

a 1x 11+ a 2 x 12 + + a mx 1m – y 1 = e1

a 1x 21+ a 2 x22 + + a mx 2m – y 2 = e2

a 1x n 1+ a 2x n2 + + a mx nm – y n = en where xij is the j'th x function of the basis, evaluated at the value xi. To minimize E, we set

¶E (a )j = 0 (3.7) ¶a j So n ¶E ( a 1) ¶ 2 = S (a 1xi 1 + a 2x i2 + + a mx im – y i) ¶a 1 ¶a 1 i =1 n

=2 S (a1xi1 +a2xi2 + +amxim – y i )xi1 =0 i =1

n ¶E ( a2) ¶ 2 = S (a1xi1 +a2xi2 + + amxim – y i) ¶a2 ¶a2 i =1 n

=2 S (a1xi1 +a2xi2 + +amxim – y i )xi2 =0 i =1

n ¶E ( am) ¶ 2 = S (a1xi1 +a2xi2 + + amxim – y i) ¶am ¶am i = 1 n

=2 S (a1xi1 +a2xi2 + +amxim – y i )xim =0 i =1 or n ¶E ( a j ) =2 S (a1xi1 +a2xi2 + + amxim – y i)xij = 0 ¶aj i =1 Rearranging these normal equations gives GG313 GEOLOGICAL DATA ANALYSIS 3–23

n n n n

2 a 1 S x i 1 +a2S xi2xi1 + + a m S x imx i 1 = S yixi1 i =1 i =1 i =1 i =1 n n n n 2 a1S xi1xi2 +a2S xi2 + + amS ximx i 2 = S yixi2 i =1 i =1 i =1 i =1

n n n n 2 a1S xi1xim +a2S xi2xim + + a m S x im = S y ixim i =1 i =1 i =1 i =1 or (for each j ): n n n n

a1 xi1 xij + a2 xi2 xij + + am xim xij = yixij iS=1 iS=1 iS=1 iS=1 This provides a closed system of m normal equations, one for each coefficient, e.g.,

¶E (a )j = 0 for j =1, 2, ..., m. ¶a j In matrix form

n n n n 2 S x i 1 S xi2xi1 S x imxi 1 S y i x i1 i =1 i =1 i =1 i =1 n n n n 2 a 1 S xi1xi2 S xi2 S x imxi 2 S yixi2 i =1 i =1 i =1 a i =1 2 =

n n n a m n 2 S xi1xim S x i2 xim S xim S yixim i =1 i =1 i =1 i =1 or simply N × x = B where N is the (known) coefficient matrix, x the vector with the unknowns (aj ), and B contains weighted sums of known (observed) quantities. Solve for the aj in the x vector (N is square and of full rank): –1 –1 N ×N ×x =N ×B

x = N –1 ×B where x is the solution for the aj . The resulting aj values are the ones which satisfy (3.7) and therefore the same ones which, in combination with the chosen basis, produce the "best" fit to the n data points such that (3.6) is minimized.

Last time we found the solution to a general linear least squares problem led us to the matrix form GG313 GEOLOGICAL DATA ANALYSIS 3–24

n n n n 2 S x i 1 S xi2xi1 S x imxi 1 S y i x i1 i =1 i =1 i =1 i =1 n n n n 2 a 1 S xi1xi2 S xi2 S x imxi 2 S yixi2 i =1 i =1 i =1 a i =1 2 =

n n n a m n 2 S xi1xim S x i2 xim S xim S yixim i =1 i =1 i =1 i =1 or simply N × x = B (3.8) where N is the (known) coefficient matrix, x the vector with the unknowns (aj ), and B contains weighted sums of known (observed) quantities. Solve for the aj in the x vector (N is square and of full rank): –1 –1 N ×N ×x =N ×B

–1 x = N ×B (3.9) where x is the solution for the aj . The resulting aj values are the ones which satisfy (3.7) and therefore the same ones which, in combination with the chosen basis, produce the "best" fit to the n data points such that (3.6) is minimized. We will now look at a simpler approach to the same problem using matrix algebra. We have

y 1 e 1 x 11 x12 x 1m a 1 y 2 e 2 x 21 x22 x 2m a 2 ´ – y 3 = e3

x n1 xn 2 x nm a m y n e n A · x – b = e Since m < n , the A matrix is rectangular. This can be written

a 1 e 1 x 11 x12 x 1m –y 1 a 2 e 2 x 21 x 22 x 2m –y 2 ´ = e3

a m x n1 xn 2 x nm –y n 1 e n C · X = e

These matrices describe the system listed earlier. We wish to find the aj values which minimize E = eTe. The matrix form can be partitioned as

x A –b = e 1 GG313 GEOLOGICAL DATA ANALYSIS 3–25

A –b x = A ×x + (–b ×1)= A ×x – b =e 1 We will, for now, refer to the [A -b] matrix as the C matrix and the [x 1]T column vector as the X vector. So, we have C × X = e

(recall C is an n x m+l rectangular matrix). The i'th error, ei, is the dot product:

a 1 x i 1

a 2 x i 2

C i ×X = e i = x i1 x i 2 x im –yi = a 1 a 2 a m 1

a m x im

1 –yi

2 where Ci is the i'th row vector in C. The squared i'th error, ei is then T T T T e i ×e i = C i ×X × C i ×X = X ×C i ×C i ×X or

x i1 a 1

x i2 a 2 T e i ×e i = a 1 a 2 a m 1 x i1 x i2 x im –yi

x im am

–yi 1

2 where we have used the reversal rule for transposed products. The sum of the ei over all the i's is thus T T e T × e = X ×C ×C×X =

xi1 a 1

n xi2 a 2

a 1 a 2 a m 1 S x i 1 x i2 x im –yi i =1 xim a m

–yi 1

T T X C i C i X T T The product C C can be computed to form a new matrix R. Since C m+1,nC n,m+1=Rm+1,m+1 the resulting R matrix is square and symmetric. So, E = e T ×e = X T ×C T ×C ×X = X T ×R ×X To minimize E, we find

¶E a j T = ¶ X T ×R ×X = ¶X R ×X + X T×R ¶X ¶a j ¶a j ¶a j ¶a j For the 2nd coefficient (as an example), we get GG313 GEOLOGICAL DATA ANALYSIS 3–26

T T ¶X = 0 1 0 0 0 = ¶X ¶a 2 ¶a 2 Thus, the partial derivative of the error is

a 1 0 a 1 ¶E 2 = 0 1 0 0 0 R + a 1 a 2 a m 1 R ¶a 2 a m 0 1 0

For all coefficients we set all the partial derivatives to zero:

¶E a j =0 =XT ×R ×X +XT ×R ×X ¶a j where 1 0 0 0 T T ¶X 0 1 0 0 X = = = I m O m ¶a j 0 0 1 0 where Im is the m x m identity matrix and Om the null vector of length m. Consider first R (= CT C):

x11 x 12 x 1m –y 1 x x x –y C: 21 22 2m 2

x n1 xn 2 x nm –y n

2 S x i 1 S x i2xi1 S x imx i1 –S y ix i1 x11 x 21 x n1 x x x 2 x x – y x x12 x 22 x n2 S i 1 i 2 S i2 S im i2 S i i2 C T : :R

x 1m x 2m x nm 2 S x i1xim S x i2x im S x im –S yi xim – y 1 – y 2 – y n 2 –S y ix i 1 –S yi xi2 –S y ix im S yi

The R matrix should look familiar. Consider the partitioned matrix multiplication: GG313 GEOLOGICAL DATA ANALYSIS 3–27

n A(n x m) -b R = CT C = (m x n)

m+1 n m+1 b T T T A(m x n) A A -A

(m x m) m+1 m+1

T T T -b -b A b b

Notice that ATA is matrix N of the normal equations and -AT b = (-bTA)T equal matrix B. Because R is symmetrical (so R = RT) we have T X T × R = R ×X So ¶E ( a ) j = X T R X + X T R X =0 ¶aj

T T =2 X R X =0 X R X =0

T Consider X ·R : GG313 GEOLOGICAL DATA ANALYSIS 3–28

m+1

AT A b T

R -A m+1

T T -b A b b

m+1 m+1

b Always multiplied m . T m T by zero T I 0 A A -A X (m x m)

N B And X T ×R ×X :

x m+1 1

m+1

T AT Ax - A b

T T m A A -A b Nx - B

Finally, since T T T X R X = A A x – A b = 0

T T A A x = A b

– 1 – 1 A T A A T A x = A T A A T b

– 1 T T x = A A A b (3.10) or –1 x = N B GG313 GEOLOGICAL DATA ANALYSIS 3–29 as before. Therefore, the unknown values of the aj (in the x vector) can be solved for directly from the system:

a 1x 11 + a 2x 12 + + a mx 1m = y1

a 1x 21 + a 2x22 + + a mx 2m = y2

a 1x n 1 + a 2x n2 + + a mx nm = yn or simply A × x = b where A is of order n x m with n > m. The least squares solution then becomes –1 x = A T ×A AT ×b where ATA is a square matrix of full rank, with order r = m, and thus invertable.

Fitting a straight line, revisited

We will again consider the best-fitting line problem, this time with errors i in the y-values. We want to measure how well the model agrees with the data, and for this purpose will use the 2 function, i.e. 2 n y – a – bx 2 a,b = i i (3.11) i =1 i

Minimizing 2 will give the best weighted least squares solution. Again we set the partial derivatives to 0: 2 n ¶ y i – a – bx i = 0= –2 2 ¶a i =1 i (3.12)

2 n ¶ y i – a – bx i = 0= –2 2 x i ¶b i =1 i

Let us define the following:

2 1 x i y i x i x iy i S = 2 S x = 2 S y = 2 S xx = 2 S xy = 2 i i i i i

Then (3.12) reduces to

aS + bS x = S y

aS x + bS xx = S xy With 2 = SS xx – S x we find GG313 GEOLOGICAL DATA ANALYSIS 3–30

S S – S S a = xx y x xy (3.13) SS – S S b = xy x y

All this is swell but we must also estimate the uncertainties in a and b. For the same i we may get large differences in errors in a and b. Although not shown here, consideration of propagation of errors shows that the variance f2 in the value of any function is

2 2 2 ¶f f = i (3.14) ¶yi

For our line we can directly find the derivatives of a and b with respect to yi from (3.13):

Figure 3-9. The uncertainty in the line fit depends to a large extent on the distribution of the x-positions.

¶a S xx – S xx i = 2 ¶y i i

¶b Sx i – S x = 2 ¶y i i

Inserting this result into (14.7) then gives

2 S – S x S 2 –2S S x + S 2x 2 2 = 2 xx x i = xx xx x i x i a i 2 2 2 i i

2 2 2 2 2 2 S xx 1 2S xxS x x i S x x i S xxS 2S xx S x S xS xx = 2 2 – 2 2 + 2 2 = 2 – 2 + 2 i i i

2 S xx S xxS – S x S = = xx 2 (3.15) and

2 Sx – S S 2x2 – 2SS x + S2 2 = 2 i x = i x i x b i 2 2 2 i i GG313 GEOLOGICAL DATA ANALYSIS 3–31

2 2 2 2 2 2 S x i 2SS x x i S x 1 S S xx 2SS x S x S = 2 2 – 2 2 + 2 2 = 2 – 2 + 2 i i i

2 S S xxS – S x = = S 2 (3.16)

Similarly, we can find the covariance ab from

2 2 ¶a ¶b S x ab = i = – ¶y i ¶y i

Then, the correlation coefficient becomes

–S r = x SS xx (3.17)

It is therefore useful to shift the origin to x where r = 0. Finally, we must check if the fit is meaningful. We use the 2 value computed and get 2 2 critical c a for n - 2 degrees of freedom which we use to see if exceeds this value. If it doesn't we may say the fit is significant at the level.

What if some data constraints are more reliable than others? We may give that residual more weight than the others:

e 1 2e e = 2

e n

In general, we can use weights wi for each error so that the new error ei' = ei·wi. We do this by introducing a weight matrix w which is a diagonal matrix:

w 1 w w = 2

w n

Now the sum of the squared errors, E, becomes E = e T ×e = eT ×w T ×w ×e = e T ×W ×e where we have introduced W = wT w: Since we have wTe = w(ATx - b) we obtain E =( w·A·x –w·b)T(w·A·x – w·b)=(xT·AT·wT –bT·wT)(w·A·x – w·b) = x T ·A T ·w T ·w·A·x – x T ·A T ·w T ·w·b – b T ·w T ·w·A·x + b T ·w T·w·b We substitute for W. Then GG313 GEOLOGICAL DATA ANALYSIS 3–32

¶E = 0 = x T ×A T ·W·A·x + x T·A T ·W·A·x – x T·A T·W·b – b T ·W ·A·x ¶a j

T Since x only contains the aj, we have x = x = I . We find

T T T T A · W ·A·x + x T·A ·W·A – A ·W ·b – b ·W·A = 0 Again, the 2nd and 4th term are the transpose of the 1st and 3rd. Because each term represent a symmetrical matrix our equation reduces to T T 2A ·W·A·x – 2A ·W·b = 0 which gives us the weighted linear least squares solution –1 T T x = A ·W ·A A ·W·b (3.18)