! MASTER OF SCIENCE IN ANALYTICS 2014 EMPLOYMENT REPORT

Results at graduation, May 2014

Number of graduates: 79

Number of graduates seeking MSAnew employment: 2015 75

Percent with one or more offers of employment by graduation: 100

Percent placed by graduation: 100

Number of employers interviewing: 138

Average number of initial job interviews per student: 13 Percent of all interviewsLinear arranged by Institute:Algebra 92 Percent of graduates with 2 or more job offers: 90

Percent of graduates with 3 or more job offers: 61

Percent of graduates with 4 or more job offers: 40

Average base salary offer ($): 96,600

Median base salary offer ($): Author: 95,000

Average base salary offers – candidates withShaina job experienceRace ($): 100,600

Range of base salary offers – candidates with job experience ($): 80,000-135,000

Percent of graduates with prior professional work experience: 50

Average base salary offers – candidates without experience ($): 89,000

Range of base salary offers – candidates without experience ($): 75,000-110,000

Percent of graduates receiving a signing bonus: 65

Average amount of signing bonus ($): 12,200

Percent remaining in NC: 59

Percent of graduates sharing salary data: 95

Number of reported job offers: 246

Percent of reported job offers based in U.S.: 100

North&Carolina&State&University&•&920&Main&Campus&Drive,&Suite&530&•&Raleigh,&NC&27606&•&© 2014 http://analytics.ncsu.edu& 1

CONTENTS

1 The Basics 1 1.1 Conventional Notation ...... 1 1.1.1 Partitions ...... 2 1.1.2 Special Matrices and Vectors ...... 3 1.1.3 n-space ...... 4 1.2 Vector and Scalar Multiplication ...... 4 1.3 Exercises ...... 7

2 Norms, Inner Products and Orthogonality 9 2.1 Norms and Distances ...... 9 2.2 Inner Products ...... 13 2.2.1 Covariance ...... 13 2.2.2 Mahalanobis Distance ...... 15 2.2.3 Angular Distance ...... 16 2.2.4 Correlation ...... 16 2.3 Orthogonality ...... 17 2.4 Outer Products ...... 19

3 Linear Combinations and Linear Independence 23 3.1 Linear Combinations ...... 23 3.2 Linear Independence ...... 26 3.2.1 Determining Linear Independence ...... 27 3.3 Span of Vectors ...... 28

4 and Change of Basis 32

5 Least Squares 38 CONTENTS 2

6 Eigenvalues and Eigenvectors 43 6.1 Diagonalization ...... 47 6.2 Geometric Interpretation of Eigenvalues and Eigenvectors . . . 49

7 Principal Components Analysis 51 7.1 Comparison with Least Squares ...... 57 7.2 Covariance or Correlation Matrix? ...... 57 7.3 Applications of Principal Components ...... 58 7.3.1 PCA for dimension reduction ...... 58

8 Singular Value Decomposition (SVD) 62 8.1 Resolving a Matrix into Components ...... 63 8.1.1 Data Compression ...... 64 8.1.2 Noise Reduction ...... 64 8.1.3 Latent Semantic Indexing ...... 65

9 Advanced Regression Techniques 68 9.1 Biased Regression ...... 68 9.1.1 Principal Components Regression (PCR) ...... 69 9.1.2 Ridge Regression ...... 72 1

CHAPTER 1

THE BASICS

1.1 Conventional Notation

Linear Algebra has some conventional ways of representing certain types of numerical objects. Throughout this course, we will stick to the following basic conventions: • Bold and uppercase letters like A, X, and U will be used to refer to matrices. • Occasionally, the size of the matrix will be specified by subscripts, like Am×n, which means that A is a matrix with m rows and n columns. • Bold and lowercase letters like x and y will be used to reference vectors. Unless otherwise specified, these vectors will be thought of as columns, with xT and yT referring to the row equivalent. • The individual elements of a vector or matrix will often be referred to with subscripts, so that Aij (or sometimes aij) denotes the element in th th th the i row and j column of the matrix A. Similarly, xk denotes the k element of the vector x. These references to individual elements are not generally bolded because they refer to scalar quantities. • Scalar quantities are written as unbolded greek letters like α, δ, and λ.

• The trace of a square matrix An×n, denoted Tr(A) or Trace(A), is the sum of the diagonal elements of A,

n Tr(A) = ∑ Aii. i=1 1.1. Conventional Notation 2

Beyond these basic conventions, there are other common notational tricks that we will become familiar with. The first of these is writing a partitioned matrix.

1.1.1 Matrix Partitions We will often want to consider a matrix as a collection of either rows or columns rather than individual elements. As we will see in the next chapter, when we partition matrices in this form, we can view their multiplication in simplified form. This often leads us to a new view of the data which can be helpful for interpretation. When we write A = (A1|A2| ... |An) we are viewing the matrix A as collection of column vectors, Ai, in the following way:  ↑ ↑ ↑ ... ↑  A = (A1|A2| ... |An) = A1 A2 A3 ... Ap ↓ ↓ ↓ ... ↓ Similarly, we can write A as a collection of row vectors:     A1 ←− A1 −→ A2  ←− A2 −→     A =  .  =  . . .   .   . . .  Am ←− Am −→ Sometimes, we will want to refer to both rows and columns in the same context. The above notation is not sufficient for this as we have Aj referring to either a column or a row. In these situations, we may use A?j to reference the th th j column and Ai? to reference the i row:

A?1 A?2 ...... A?n   a11 a12 ...... a1n  . . .   . . .     ai1 ... aij ... ain     . . .   . . .  am1 ...... amn

  A1? a11 a12 ...... a1n .  . . .  .  . . .    Ai?  ai1 ... aij ... ain    .  . . .  .  . . .  Am? am1 ...... amn 1.1. Conventional Notation 3

1.1.2 Special Matrices and Vectors The bold capital letter I is used to denote the identity matrix. Sometimes this matrix has a single subscript to specify the size of the matrix. More often, the size of the identity is implied by the matrix equation in which it appears. 1 0 0 0 0 1 0 0 I =   4 0 0 1 0 0 0 0 1 th The bold lowercase ej is used to refer to the j column of I. It is simply a vector of zeros with a one in the jth position. We do not often specify the size of the vector ej, the number of elements is generally assumed from the context of the problem.

  0  .   .     0    e = jthrow →  1  j    0     .   .  0 The vector e with no subscript refers to a vector of all ones.

1 1   1 e =    .   .  1

A diagonal matrix is a matrix for which off-diagonal elements, Aij, i 6= j are zero. For example:   σ1 0 0 0  0 σ 0 0  D =  2   0 0 σ3 0  0 0 0 σ4 Since the off diagonal elements are 0, we need only define the diagonal elements for such a matrix. Thus, we will frequently write

D = diag{σ1, σ2, σ3, σ4} or simply Dii = σi. 1.2. Vector Addition and Scalar Multiplication 4

1.1.3 n-space You are already familiar with the concept of “ordered pairs" or coordinates (x1, x2) on the two-dimensional plane (in , we call this plane "2-space"). Fortunately, we do not live in a two-dimensional world! Our data will more often consist of measurements on a number (lets call that number n) of variables. Thus, our data points belong to what is known as n-space. They are represented by n-tuples which are nothing more than ordered lists of : (x1, x2, x3,..., xn). An n-tuple defines a vector with the same n elements, and so these two concepts should be thought of interchangeably. The only difference is that the vector has a direction, away from the origin and toward the n-tuple. You will recall that the symbol R is used to denote the set of real numbers. R is simply 1-space. It is a set of vectors with a single element. In this sense any , x, has a direction: if it is positive, it is to one side of the origin, if it is negative it is to the opposite side. That number, x, also has a magnitude: |x| is the distance between x and the origin, 0. n-space (the set of real n-tuples) is denoted Rn. In set notation, the formal mathematical definition is simply:

n R = {(x1, x2,..., xn) : xi ∈ R, i = 1, . . . , n} .

We will often use this notation to define the size of an arbitrary vector. For example, x ∈ Rp simply means that x is a vector with p entries: x = (x1, x2,..., xp). Many (all, really) of the concepts we have previously considered in 2- or 3-space extend naturally to n-space and a few new concepts become useful as well. One very important concept is that of a norm or distance metric, as we will see in Chapter 2. Before discussing norms, let’s revisit the basics of vector addition and scalar multiplication.

1.2 Vector Addition and Scalar Multiplication

You’ve already learned how vector addition works algebraically: it occurs element-wise between two vectors of the same length:       a1 b1 a1 + b1 a2 b2 a2 + b2        a  b  a + b  a + b =  3 +  3 =  3 3   .   .   .   .   .   .  an bn an + bn 1.2. Vector Addition and Scalar Multiplication 5

Geometrically, vector addition is witnessed by placing the two vectors, a and b, tail-to-head. The result, a + b, is the vector from the open tail to the open head. This is called the parallelogram law and is demonstrated in Figure 1.1a.

a+b

a b a b a-b

(a) Addition of vectors (b) Subtraction of Vectors

Figure 1.1: Vector Addition and Subtraction Geometrically: Tail-to-Head

When subtracting vectors as a − b we simply add −b to a. The vector −b has the same length as b but points in the opposite direction. This vector has the same length as the one which connects the two heads of a and b as shown in Figure 1.1b. Example 1.2.1: Vector Subtraction: Centering Data

One thing we will do frequently in this course is consider centered and/or standardized data. To center a group of variables, we merely subtract the mean of each from each observation. Geometrically, this amounts to a translation (shift) of the data so that it’s center (or mean) is at the origin. The following graphic illustrates this process using 4 data points. 1.2. Vector Addition and Scalar Multiplication 6

x x

x

x x

-x

x x

x x

Scalar multiplication is another operation which acts element-wise:     a1 αa1 a2 αa2     a  αa  αa = α  3 =  3  .   .   .   .  an αan

Scalar multiplication changes the length of a vector but not the overall direction (although a negative scalar will scale the vector in the opposite direction through the origin). We can see this geometric interpretation of scalar multiplication in Figure 1.2. 1.3. Exercises 7

2a

a

-.5a

Figure 1.2: Geometric Effect of Scalar Multiplication

1.3 Exercises

1. For a general matrix Am×n describe what the following products will provide. Also give the size of the result (i.e. "n × 1 vector" or "scalar").

a. Aej T b. ei A T c. ei Aej d. Ae e. eTA 1 T f. n e A

2. Let Dn×n be a diagonal matrix with diagonal elements Dii. What effect does multiplying a matrix Am×n on the left by D have? What effect does multiplying a matrix An×m on the right by D have? If you cannot see this effect in a general sense, try writing out a simple 3 × 3 matrix as an example first.

3. What is the inverse of a diagonal matrix, D = diag{d11, d22,..., dnn}?

4. Suppose you have a matrix of data, An×p, containing n observations on p variables. Suppose the standard deviations of these variables are σ1, σ2, ... , σp. Give a formula for a matrix that contains the same data but with each variable divided by its standard deviation. Hint: you should use exercises 2 and 3.

5. Suppose we have a network/graph as shown in Figure 1.3. This particular network has 6 numbered vertices (the circles) and edges which connect the vertices. Each edge has a certain weight (perhaps reflecting some level of association between the vertices) which is given as a number. 1.3. Exercises 8

3 1

5 3 5 2 12 4 10 2 9

6

Figure 1.3: An example of a graph or network

a. The adjacency matrix of a graph is defined to be the matrix A such that element Aij reflects the the weight of the edge connecting vertex i and vertex j. Write out the adjacency matrix for this graph. b. The degree of a vertex is defined as the sum of the weights of the edges connected to that vertex. Create a vector d such that di is the degree of node i. c. Write d as a matrix-vector product in two different ways using the adjacency matrix, A, and e. 9

CHAPTER 2

NORMS, INNER PRODUCTS AND ORTHOGONALITY

2.1 Norms and Distances

In applied mathematics, Norms are functions which measure the magnitude or length of a vector. They are commonly used to determine similarities between observations by measuring the distance between them. As we will see, there are many ways to define distance between two points. Definition 2.1.1: Vector Norms and Distance Metrics

A Norm, or distance metric, is a function that takes a vector as input and returns a scalar quantity ( f : Rn → R). A vector norm is typically denoted by two vertical bars surrounding the input vector, kxk, to signify that it is not just any function, but one that satisfies the following criteria:

1. If c is a scalar, then kcxk = |c|kxk

2. The triangle inequality:

kx + yk ≤ kxk + kyk

3. kxk = 0 if and only if x = 0.

4. kxk ≥ 0 for any vector x

We will not spend any time on these axioms or on the theoretical aspects of 2.1. Norms and Distances 10 norms, but we will put a couple of these functions to good use in our studies, the first of which is the Euclidean norm or 2-norm.

Definition 2.1.2: Euclidean Norm, k ? k2

The Euclidean Norm, also known as the 2-norm simply measures the Euclidean length of a vector (i.e. a point’s distance from the origin). Let x = (x1, x2,..., xn). Then, q 2 2 2 kxk2 = x1 + x2 + ··· + xn

If x is a column vector, then √ T kxk2 = x x.

Often we will simply write k ? k rather than k ? k2 to denote the 2-norm, as it is by far the most commonly used norm.

This is merely the distance formula from undergraduate mathematics, measuring the distance between the point x and the origin. To compute the distance between two different points, say x and y, we’d calculate q 2 2 2 kx − yk2 = (x1 − y1) + (x2 − y2) + ··· + (xn − yn)

Example 2.1.1: Euclidean Norm and Distance

Suppose I have two vectors in 3-space:

x = (1, 1, 1) and y = (1, 0, 0)

Then the magnitude of x (i.e. its length or distance from the origin) is √ p 2 2 2 kxk2 = 1 + 1 + 1 = 3

and the magnitude of y is

p 2 2 2 kyk2 = 1 + 0 + 0 = 1

and the distance between point x and point y is q √ 2 2 2 kx − yk2 = (1 − 1) + (1 − 0) + (1 − 0) = 2.

The Euclidean norm is crucial to many methods in data analysis as it measures the closeness of two data points. 2.1. Norms and Distances 11

Thus, to turn any vector into a unit vector, a vector with a length of 1, we need only to divide each of the entries in the vector by its Euclidean norm. This is a simple form of standardization used in many areas of data analysis. For a unit vector x, xTx = 1. Perhaps without knowing it, we’ve already seen many formulas involving the norm of a vector. Examples 2.1.2 and 2.1.3 show how some of the most important concepts in statistics can be represented using vector norms. Example 2.1.2: Standard Deviation and Variance

Suppose a group of individuals has the following heights, measured in inches: (60, 70, 65, 50, 55). The mean height for this group is 60 inches. The formula for the sample standard deviation is typically given as q n 2 ∑i=1(xi − x¯) s = √ n − 1

We want to subtract the mean from each observation, square√ the num- bers, sum the result, take the square root and divide by n − 1. If we let x¯ = x¯e = (60, 60, 60, 60, 60) be a vector containing the mean, and x = (60, 70, 65, 50, 55) be the vector of data then the standard deviation in matrix notation is: 1 s = √ kx − x¯k2 = 7.9 n − 1 The sample variance of this data is merely the square of the sample standard deviation: 1 s2 = kx − x¯k2 n − 1 2

Example 2.1.3: Residual Sums of Squares

Another place we’ve seen a similar calculation is in linear regression. You’ll recall the objective of our regression line is to minimize the sum of squared residuals between the predicted value yˆ and the observed value y: n 2 ∑(yˆi − yi) . i=1 In vector notation, we’d let y be a vector containing the observed data and yˆ be a vector containing the corresponding predictions and write this summation as 2 kyˆ − yk2 2.1. Norms and Distances 12

In fact, any situation where the phrase "sum of squares" is encountered, the 2-norm is generally implicated.

Example 2.1.4: Coefficient of Determination, R2

Since variance can be expressed using the Euclidean norm, so can the coefficient of determination or R2.

n 2 2 SSreg (yˆ − y¯) kyˆ − y¯k 2 = = ∑i=1 i = R n 2 2 SStot ∑i=1(yi − y¯) ky − y¯k

Other useful norms and distances  1-norm, k ? k1. If x = x1 x2 ... xn then the 1-norm of X is

n kxk1 = ∑ |xi|. i=1 This metric is often referred to as Manhattan distance, city block distance, or taxicab distance because it measures the distance between points along a rectangular grid (as a taxicab must travel on the streets of Manhattan, for example). When x and y are binary vectors, the 1-norm is called the Hamming Distance, and simply measures the number of elements that are different between the two vectors.

Figure 2.1: The lengths of the red, yellow, and blue paths represent the 1- norm distance between the two points. The green line shows the Euclidean measurement (2-norm). 2.2. Inner Products 13

∞-norm, k ? k∞. The infinity norm, also called the Supremum, or Max dis- tance, is: kxk∞ = max{|x1|, |x2|,..., |xp|}

2.2 Inner Products

The inner product of vectors is a notion that you’ve already seen, it is what’s called the dot product in most physics and calculus text books. Definition 2.2.1: Vector Inner Product

The inner product of two n × 1 vectors x and y is written xTy (or sometimes as hx, yi) and is the sum of the product of corresponding elements.   y1 y2 n xTy = x x ... x    = x y + x y + ··· + x y = x y . 1 2 n  .  1 1 2 2 n n ∑ i i  .  i=1 yn

When we take the inner product of a vector with itself, we get the square of the 2-norm: T 2 x x = kxk2.

Inner products are at the heart of every matrix product. When we multiply two matrices, Xm×n and Yn×p, we can represent the individual elements of the result as inner products of rows of X and columns of Y as follows:

 X Y X Y ... X Y    1? ?1 1? ?2 1? ?p X ? 1  X2?Y?1 X2?Y?2 ... X2?Y?p  X ?    2    X3?Y?1 X3?Y?2 ... X3?Y?p  XY =  .  Y?1 Y?2 ... Y?p =    .   . . . .   .   . . .. .  X   m? .. Xm?Y?1 ... . Xm?Y?p

2.2.1 Covariance Another important statistical measurement that is represented by an inner product is covariance. Covariance is a measure of how much two random variables change together. The statistical formula for covariance is given as Covariance(x, y) = E[(x − E[x])(y − E[y])] (2.1) where E[?] is the expected value of the variable. If larger values of one variable correspond to larger values of the other variable and at the same time smaller 2.2. Inner Products 14 values of one correspond to smaller values of the other, then the covariance between the two variables is positive. In the opposite case, if larger values of one variable correspond to smaller values of the other and vice versa, then the covariance is negative. Thus, the sign of the covariance shows the tendency of the linear relationship between variables, however the magnitude of the covariance is not easy to interpret. Covariance is a population parameter - it is a property of the joint distribution of the random variables x and y. Definition 2.2.2 provides the mathematical formulation for the sample covariance. This is our best estimate for the population parameter when we have data sampled from a population. Definition 2.2.2: Sample Covariance

If x and y are n × 1 vectors containing n observations for two different variables, then the sample covariance of x and y is given by

1 n 1 (x − x¯)(y − y¯) = (x − x¯)T(y − y¯) − ∑ i i − n 1 i=1 n 1 Where again x¯ and y¯ are vectors that contain x¯ and y¯ repeated n times. It should be clear from this formulation that

cov(x, y) = cov(y, x).

When we have p vectors, v1, v2, ... , vp, each containing n observations for p different variables, the sample covariances are most commonly given by the sample covariance matrix, Σ, where

Σij = cov(vi, vj).

This matrix is symmetric, since Σij = Σji. If we create a matrix V whose columns are the vectors v1, v2, ... vp once the variables have been centered to have mean 0, then the covariance matrix is given by:

1 cov(V) = Σ = VTV. n − 1

th The j diagonal element of this matrix gives the variance vj since

1 Σ = cov(v , v ) = (v − v¯ )T(v − v¯ ) (2.2) jj j j n − 1 j j j j 1 = kv − v¯ k2 (2.3) n − 1 j j 2 = var(vj) (2.4)

When two variables are completely uncorrelated, their covariance is zero. 2.2. Inner Products 15

This lack of correlation would be seen in a covariance matrix with a diagonal structure. That is, if v1, v2, ... , vp are uncorrelated with individual variances 2 2 2 σ1 , σ2 ,..., σp respectively then the corresponding covariance matrix is:

 2  σ1 0 0 . . . 0 2  0 σ 0 . . . 0   2   .. .  Σ =  0 0 . . 0     . . . . .   ......  2 0 0 0 0 σp

Furthermore, for variables which are independent and identically distributed (take for instance the error terms in a linear regression model, which are assumed to independent and normally distributed with mean 0 and constant variance σ), the covariance matrix is a multiple of the identity matrix:

σ2 0 0 . . . 0   0 σ2 0 . . . 0     . .  Σ =  0 0 .. . 0  = σ2I    . . . . .   ......  0 0 0 0 σ2

Transforming our variables in a such a way that their covariance matrix becomes diagonal will be our goal in Chapter 7. Theorem 2.2.1: Properties of Covariance Matrices

The following mathematical properties stem from Equation 2.1. Let Xn×p be a matrix of data containing n observations on p variables. If A is a constant matrix (or vector, in the first case) then

cov(XA) = ATcov(X)A and cov(X + A) = cov(X)

2.2.2 Mahalanobis Distance Mahalanobis Distance is similar to Euclidean distance, but takes into account the correlation of the variables. This metric is relatively common in data mining applications like classification. Suppose we have p variables which have some covariance matrix, Σ. Then the Mahalanobis distance between two T T observations, x = x1 x2 ... xp and y = y1 y2 ... yp is given by q d(x, y) = (x − y)TΣ−1(x − y). 2.2. Inner Products 16

If the covariance matrix is diagonal (meaning the variables are uncorrelated) then the Mahalanobis distance reduces to Euclidean distance normalized by the variance of each variable: v u p (x − y )2 ( ) = u i i = k −1/2( − )k d x, y t∑ 2 Σ x y 2. i=1 si

2.2.3 Angular Distance The inner product between two vectors can provide useful information about their relative orientation in space and about their similarity. For example, to find the cosine of the angle between two vectors in n-space, the inner product of their corresponding unit vectors will provide the result. This cosine is often used as a measure of similarity or correlation between two vectors. Definition 2.2.3: Cosine of Angle between Vectors

The cosine of the angle between two vectors in n-space is given by

xTy cos(θ) = kxk2kyk2

y

θ x

This angular distance is at the heart of Pearson’s correlation coefficient.

2.2.4 Correlation Pearson’s correlation is a normalized version of the covariance, so that not only the sign of the coefficient is meaningful, but its magnitude is meaningful in measuring the strength of the linear association. 2.3. Orthogonality 17

Example 2.2.1: Pearson’s Correlation and Cosine Distance

You may recall the formula for Pearson’s correlation between variable x and y with a sample size of n to be as follows:

n ∑i=1(xi − x¯)(yi − y¯) r = q q n 2 n 2 ∑i=1(xi − x¯) ∑i=1(yi − y¯)

If we let x¯ be a vector that contains x¯ repeated n times, like we did in Example 2.1.2, and let y¯ be a vector that contains y¯ then Pearson’s coefficient can be written as:

(x − x¯)T(y − y¯) r = kx − x¯kky − y¯k

In other words, it is just the cosine of the angle between the two vectors once they have been centered to have mean 0. This makes sense: correlation is a measure of the extent to which the two variables share a line in space. If the cosine of the angle is positive or negative one, this means the angle between the two vectors is 0◦ or 180◦, thus, the two vectors are perfectly correlated or collinear.

It is difficult to visualize the angle between two variable vectors because they exist in n-space, where n is the number of observations in the dataset. Unless we have fewer than 3 observations, we cannot draw these vectors or even picture them in our minds. As it turns out, this angular measurement does translate into something we can conceptualize: Pearson’s correlation coefficient is the angle formed between the two possible regression lines using the centered data: y regressed on x and x regressed on y. This is illustrated in Figure 2.2. To compute the matrix of pairwise correlations between variables x1, x2, x3, ... , xp (columns containing n observations for each variable), we’d first center them to have mean zero, then normalize them to have length kxik = 1 and then compose the matrix X = [x1|x2|x3| ... |xp]. Using this centered and normalized data, the correlation matrix is simply

C = XTX.

2.3 Orthogonality

Orthogonal (or perpendicular) vectors have an angle between them of 90◦, meaning that their cosine (and subsequently their inner product) is zero. 2.3. Orthogonality 18

y=f(x)

x=f(y)

θ

r=cos(θ)

Figure 2.2: Correlation Coefficient r and Angle between Regression Lines

Definition 2.3.1: Orthogonality

Two vectors, x and y, are orthogonal in n-space if their inner product is zero: xTy = 0

Combining the notion of orthogonality and unit vectors we can define an orthonormal set of vectors, or an orthonormal matrix. Remember, for a unit vector, xTx = 1. Definition 2.3.2: Orthonormal Sets

The n × 1 vectors {x1, x2, x3, ... , xp} form an orthonormal set if and only if

T 1. xi xj = 0 when i 6= j and T 2. xi xi = 1 (equivalently kxik = 1) In other words, an orthonormal set is a collection of unit vectors which are mutually orthogonal.

If we form a matrix, X = (x1|x2|x3| ... |xp), having an orthonormal set of vectors as columns, we will find that multiplying the matrix by its transpose provides a nice result: 2.4. Outer Products 19

T  T T T T  x  x1 x1 x1 x2 x1 x3 ... x1 xp 1 T T T T T x x1 x x2 x x3 ... x xp x2   2 2 2 2   T  T T T T  T x  x3 x1 x3 x2 x3 x3 ... x3 xp X X =  3  x1 x2 x3 ... xp =    .   . . . . .   .   ......   .    xT T .. T p xp x1 ...... xp xp

1 0 0 . . . 0 0 1 0 . . . 0   0 0 1 . . . 0 =   = Ip  . . . . .   ......  0 0 0 . . . 1 We will be particularly interested in these types of matrices when they are square. If X is a square matrix with orthonormal columns, the arithmetic above means that the inverse of X is XT (i.e. X also has orthonormal rows):

XTX = XXT = I.

Square matrices with orthonormal columns are called orthogonal matrices. Definition 2.3.3: Orthogonal (or Orthonormal) Matrix

A square matrix, U with orthonormal columns also has orthonormal rows and is called an orthogonal matrix. Such a matrix has an inverse which is equal to it’s transpose,

UTU = UUT = I

2.4 Outer Products

The outer product of two vectors x ∈ Rm and y ∈ Rn, written xyT, is an m × n matrix with rank 1. To see this basic fact, lets just look at an example. 2.4. Outer Products 20

Example 2.4.1: Outer Product

1 2 2 Let x =   and let y = 1. Then the outer product of x and y is: 3 3 4

1 2 1 3  T 2  4 2 6  xy =   2 1 3 =   3 6 3 9  4 8 4 12

which clearly has rank 1. It should be clear from this example that computing an outer product will always result in a matrix whose rows and columns are multiples of each other.

Example 2.4.2: Centering Data with an Outer Product

As we’ve seen in previous examples, many statistical formulas involve the centered data, that is, data from which the mean has been subtracted so that the new mean is zero. Suppose we have a matrix of data containing observations of individuals’ heights (h) in inches, weights (w), in pounds and wrist sizes (s), in inches:

h w s   person1 60 102 5.5 person2  72 170 7.5    A = person  66 110 6.0  3   person4  69 128 6.5  person5 63 130 7.0 The average values for height, weight, and wrist size are as follows:

h¯ = 66 (2.5) w¯ = 128 (2.6) s¯ = 6.5 (2.7)

To center all of the variables in this data set simultaneously, we could compute an outer product using a vector containing the means and a vector of all ones: 2.4. Outer Products 21

60 102 5.5 1 72 170 7.5 1      66 110 6.0 − 1 66 128 6.5     69 128 6.5 1 63 130 7.0 1 60 102 5.5 66 128 6.5 72 170 7.5 66 128 6.5     = 66 110 6.0 − 66 128 6.5     69 128 6.5 66 128 6.5 63 130 7.0 66 128 6.5

−6.0000 −26.0000 −1.0000  6.0000 42.0000 1.0000    =  0 −18.0000 −0.5000    3.0000 0 0  −3.0000 2.0000 0.5000

Exercises  1   1   2  −1 1. Let u =   and v =  . −4  1  −2 −1

a. Determine the Euclidean distance between u and v. b. Find a vector of unit length in the direction of u. c. Determine the cosine of the angle between u and v. d. Find the 1- and ∞-norms of u and v. c. Suppose these vectors are observations on four independent vari- ables, which have the following covariance matrix:

2 0 0 0 0 1 0 0 Σ =   0 0 2 0 0 0 0 1

Determine the Mahalanobis distance between u and v. 2.4. Outer Products 22

2. Let −1 2 0 −2 1  2 2 0 1  U =   3  0 0 3 0  −2 1 0 2

a. Show that U is an orthogonal matrix. 1 1 b. Let b =  . Solve the equation Ux = b. 1 1

3. Write a matrix expression for the correlation matrix, C, for a matrix of centered data, X, where Cij = rij is Pearson’s correlation measure between variables xi and xj. To do this, we need more than an inner product, we need to normalize the rows and columns by the norms kxik. For a hint, see Exercise 2 in Chapter 1.

4. Suppose you have a matrix of data, An×p, containing n observations on p variables. Develop a matrix formula for the standardized data (where the mean of each variable should be subtracted from the corresponding column before dividing by the standard deviation). Hint: use Exercises 1(f) and 4 from Chapter 1 along with Example 2.4.2.

5. Explain why, for any norm or distance metric,

kx − yk = ky − xk

1 6. Find two vectors which are orthogonal to x = 1 1

7. Pythagorean Theorem. Show that x and y are orthogonal if and only if

2 2 2 kx + yk2 = kxk2 + kyk2

2 T (Hint: Recall that kxk2 = x x) 23

CHAPTER 3

LINEAR COMBINATIONS AND LINEAR INDEPENDENCE

One of the most central ideas in all of Linear Algebra is that of linear indepen- dence. For regression problems, it is repeatedly stressed that multicollinearity is problematic. Multicollinearity is simply a statistical term for linear dependence. It’s bad. We will see the reason for this shortly, but first we have to develop the notion of a linear combination.

3.1 Linear Combinations

Definition 3.1.1: Linear Combination

A linear combination is constructed from a set of terms v1, v2, ... , vn by multiplying each term by a constant and adding the result:

n c = α1v1 + α2v2 + ··· + αnvn = ∑ αivn i=1

The coefficients αi are scalar constants and the terms, {vi} can be scalars, vectors, or matrices.

If we dissect our formula for a system of linear equations, Ax = b, we will find that the right-hand side vector b can be expressed as a linear combination of the columns in the coefficient matrix, A. 3.1. Linear Combinations 24

b = Ax (3.1)   x1 x2   b = (A1|A2| ... |An)  .  (3.2)  .  x3 b = x1A1 + x2A2 + ··· + xnAn (3.3)

A concrete example of this expression is given in Example 3.1.1. Example 3.1.1: Systems of Equations as Linear Combinations

Consider the following system of equations:

3x1 + 2x2 + 9x3 = 1 (3.4)

4x1 + 2x2 + 3x3 = 5 (3.5)

2x1 + 7x2 + x3 = 0 (3.6)

We can write this as a matrix vector product Ax = b where       3 2 9 x1 1 A = 4 2 3 x = x2 and b = 5 2 7 1 x3 0

We can also write b as a linear combination of columns of A:

3 2 9 1 x1 4 + x2 2 + x3 3 = 5 2 7 1 0

Similarly, if we have a matrix-matrix product, we can write each column of the result as a linear combination of columns of the first matrix. Let Am×n, Xn×p, and Bm×p be matrices. If we have AX = B then   x11 x12 ... x1p x x ... x n   21 22 2  (A1|A2| ... |An)  . . . .  = (B1|B2| ... |Bn)  . . .. .  xn1 xn2 ... xnp

and we can write

Bj = AXj = x1jA1 + x2jA2 + x3jA3 + ··· + xnjAn.

A concrete example of this expression is given in Example 3.1.2. 3.1. Linear Combinations 25

Example 3.1.2: Linear Combinations in Matrix-Matrix Products

Suppose we have the following matrix formula:

AX = B

2 1 3 5 6 Where A = 1 4 2, X = 9 5 Then 3 2 1 7 8

2 1 3 5 6 B = 1 4 2 9 5 (3.7) 3 2 1 7 8 2(5) + 1(9) + 3(7) 2(6) + 1(5) + 3(8) = 1(5) + 4(9) + 2(7) 1(6) + 4(5) + 2(8) (3.8) 3(5) + 2(9) + 1(7) 3(6) + 2(5) + 1(8)

and we can immediately notice that the columns of B are linear combi- nations of columns of A:

2 1 3 B1 = 5 1 + 9 4 + 7 2 3 2 1

2 1 3 B2 = 6 1 + 5 4 + 8 2 3 2 1 We may also notice that the rows of B can be expressed as a linear combination of rows of X:

   B1? = 2 5 6 + 1 9 5 + 3 7 8    B2? = 1 5 6 + 4 9 5 + 2 7 8    B3? = 3 5 6 + 2 9 5 + 1 7 8 Linear combinations are everywhere, and they can provide subtle but important meaning in the sense that they can break data down into a sum of parts. You should convince yourself of one final view of matrix multiplication, as the sum of outer products. In this case B is the sum of 3 outer products (3 matrices of rank 1) involving the columns of A and corresponding rows of X: B = A?1X1? + A?2X2? + A?3X3?.

Example 3.1.2 turns out to have important implications for our interpreta- 3.2. Linear Independence 26 tion of matrix factorizations. In this context we’d call AX a factorization of the matrix B. We will see how to use these expressions to our advantage in later chapters. We don’t necessarily have to use vectors as the terms for a linear combi- nation. Example 3.1.3 shows how we can write any m × n matrix as a linear combination of nm matrices with rank 1. Example 3.1.3: Linear Combination of Matrices

1 3 Write the matrix A = as a linear combination of the following 4 2 matrices: 1 0 0 1 0 0 0 0 , , , 0 0 0 0 1 0 0 1 Solution:

1 3 1 0 0 1 0 0 0 0 A = = 1 + 3 + 4 + 2 4 2 0 0 0 0 1 0 0 1

Now that we understand the concept of Linear Combination, we can de- velop the important concept of Linear Independence.

3.2 Linear Independence

Definition 3.2.1: Linear Dependence and Linear Independence

A set of vectors {v1, v2, ... , vn} is linearly dependent if we can express the zero vector, 0, as non-trivial linear combination of the vectors. In other words there exist some constants α1, α2, ... αn (non-trivial means that these constants are not all zero) for which

α1v1 + α2v2 + ··· + αnvn = 0. (3.9)

A set of terms is linearly independent if Equation 3.9 has only the trivial solution (α1 = α2 = ··· = αn = 0).

Another way to express linear dependence is to say that we can write one of the vectors as a linear combination of the others. If there exists a non-trivial set of coefficients α1, α2,..., αn for which

α1v1 + α2v2 + ··· + αnvn = 0 3.2. Linear Independence 27

then for αj 6= 0 we could write

1 n vj = − ∑ αivi αj i=1 i6=j

Example 3.2.1: Linearly Dependent Vectors

1 1 3 The vectors v1 = 2 , v2 = 2 , and v3 = 6 are linearly depen- 2 3 7 dent because v3 = 2v1 + v2 or, equivalently, because

2v1 + v2 − v3 = 0

3.2.1 Determining Linear Independence You should realize that the linear combination expressed Definition 3.2.1 can be written as a matrix vector product. Let Am×n = (A1|A2| ... |An) be a matrix. Then by Definition 3.2.1, the columns of A are linearly independent if and only if the equation Ax = 0 (3.10) has only the trivial solution, x = 0. Equation 3.10 is commonly known as the homogeneous . For this equation to have only the trivial solution, it must be the case that under Gauss-Jordan elimination, the augmented matrix (A|0) reduces to (I|0). We have already seen this condition in our discussion about matrix inverses - if a square matrix A reduces to the identity matrix under Gauss-Jordan elimination then it is equivalently called full rank, nonsingular, or invertible. Now we add an additional condition equivalent to the others - the matrix A has linearly independent columns (and rows). In Theorem 3.2.1 a important list of equivalent conditions regarding linear independence and invertibility is given. Theorem 3.2.1: Equivalent Conditions for Matrix Invertibility

Let A be an n × n matrix. The following statements are equivalent. (If one these statement is true, then all of these statements are true)

• A is invertible (A−1exists) 3.3. Span of Vectors 28

• A has full rank (rank(A) = n)

• The columns of A are linearly independent

• The rows of A are linearly independent

• The system Ax = b, b 6= 0 has a unique solution

• Ax = 0 =⇒ x = 0

• A is nonsingular

Gauss−Jordan • A −−−−−−−−→ I

3.3 Span of Vectors

Definition 3.3.1: Vector Span

The span of a single vector v is the set of all scalar multiples of v:

span(v) = {αv for any constant α}

The span of a collection of vectors, V = {v1, v2, ... , vn} is the set of all linear combinations of these vectors:

span(V) = {α1v1 + α2v2 + ··· + αnvn for any constants α1,..., αn}

Recall that addition of vectors can be done geometrically using the head-to- tail method shown in Figure 3.1.

Figure 3.1: Geometrical addition of vectors: Head-to-tail

If we have two linearly independent vectors on a coordinate plane, then any 3.3. Span of Vectors 29 third vector can be written as a linear combination of them. This is because two vectors is sufficient to span the entire 2-dimensional plane. You should take a moment to convince yourself of this geometrically. In 3-space, two linearly independent vectors can still only span a plane. Figure 3.2 depicts this situation. The set of all linearly combinations of the two vectors a and b (i.e. the span(a, b)) carves out a plane. We call this a two-dimensional collection of vectors a subspace of R3. A subspace is formally defined in Definition 3.3.2.

Figure 3.2: The span(a, b) in R3 creates a plane (a 2-dimensional subspace)

Definition 3.3.2: Subspace

A subspace, S of Rn is thought of as a “flat” (having no curvature) surface within Rn. It is a collection of vectors which satisfies the following conditions:

1. The origin (0 vector) is contained in S

2. If x and y are in S then the sum x + y is also in S

3. If x is in S and α is a constant then αx is also in S

The span of two vectors a and b is a subspace because it satisfies these three conditions. (Can you prove it? See exercise 4). 3.3. Span of Vectors 30

Example 3.3.1: Span

1 3 Let a = 3 and b = 0. Explain why or why not each of the 4 1 following vectors is contained in the span(a, b)?

5 a. x = 6 9

• To determine if x is in the span(a, b) we need to find coeffi- cients α1, α2 such that

α1a + α2b = x.

Thus, we attempt to solve the system

1 3 5 α  3 0 1 = 6 . α 4 1 2 9

After , we find that the system is con- sistent with the solution

α  2 1 = α2 1

and so x is in fact in the span(a, b).

2 b. y = 4 6

• We could follow the same procedure as we did in part (a) to learn that the corresponding system is not consistent and thus that y is not in the span(a, b).

Exercises

1. Six views of matrix multiplication: Let Am×k, Bk×n, and Cm×n be matri- ces such that AB = C.

a. Express the first column of C as a linear combination of the columns 3.3. Span of Vectors 31

of A. b. Express the first column of C as a matrix-vector product. c. Express C as a sum of outer products. d. Express the first row of C as a linear combination of the rows of B. e. Express the first row of C as a matrix-vector product.

d. Express the element Cij as an inner product of row or column vectors from A and B.

2. Determine whether or not the vectors

1 0 2 x1 = 3 , x2 = 1 , x3 = 1 1 1 0

are linearly independent.

1 3 3. Let a = 3 and b = 0. 4 1

0 a. Show that the zero vector, 0 is in the span(a, b). 0 1 b. Determine whether or not the vector 0 is in the span(a, b). 1

4. Prove that the span of vectors is a subspace by showing that it satisfies the three conditions from Definition 3.3.1. You can simply show this fact for the span of two vectors and notice how the concept will hold for more than two vectors.

5. True/False Mark each statement as true or false. Justify your response.

• If Ax = b has a solution then b can be written as a linear combina- tion of the columns of A. • If Ax = b has a solution then b is in the span of the columns of A.

• If the vectors v1, v2, and , v3 form a linearly dependent set, then v1 is in the span(v2, v3). 32

CHAPTER 4

BASIS AND CHANGE OF BASIS

When we think of coordinate pairs, or coordinate triplets, we tend to think of them as points on a grid where each axis represents one of the coordinate directions:

span(e )

 

 

span(e

)   

When we think of our data points this way, we are considering them as linear combinations of elementary basis vectors 1 0 e = and e = . 1 0 2 1 For example, the point (2, 3) is written as 2 1 0 = 2 + 3 = 2e + 3e . (4.1) 3 0 1 1 2 33

We consider the coefficients (the scalars 2 and 3) in this linear combination as coordinates in the basis B1 = {e1, e2}. The coordinates, in essence, tell us how much “information” from the vector/point (2, 3) lies along each basis direction: to create this point, we must travel 2 units along the direction of e1 and then 3 units along the direction of e2. We can also view Equation 4.1 as a way to separate the vector (2, 3) into orthogonal components. Each component is an orthogonal projection of the vector onto the span of the corresponding basis vector. The orthogonal projection of vector a onto the span another vector v is simply the closest point to a contained on the span(v), found by “projecting” a toward v at a 90◦ angle. Figure 4.1 shows this explicitly for a = (2, 3).

span(e )

orthogonal projection of

a onto e a

span(e

)  orthogonal projection of a onto e

Figure 4.1: Orthogonal Projections onto basis vectors.

Definition 4.0.3: Elementary Basis

For any vector a = (a1, a2, ... , an), the basis B = {e1, e2, ... , en} (recall th ei is the i column of the identity matrix In) is the elementary basis and a can be written in this basis using the coordinates a1, a2, ... , an as follows: a = a1e1 + a2e2 + ... anen.

The elementary basis B1 is convenient for many reasons, one being its orthonormality:

T T e1 e1 = e2 e2 = 1 T T e1 e2 = e2 e1 = 0 However, there are many (infinitely many, in fact) ways to represent the data points on different axes. If I wanted to view this data in a different 34 way, I could use a different basis. Let’s consider, for example, the following orthonormal basis, drawn in green over the original grid in Figure 4.2:

 √ 1 √  1  B = {v , v } = 2 , 2 2 1 2 2 1 2 −1

span(e )

span(v)

)  span(e

span(v )

Figure 4.2: New basis vectors, v1 and v2, shown on original plane √ 2 The scalar multipliers 2 are simply normalizing factors so that the basis vectors have unit length. You can convince yourself that this is an orthonormal basis by confirming that

T T v1 v1 = v2 v2 = 1 T T v1 v2 = v2 v1 = 0

If we want to change the basis from the elementary B1 to the new green basis vectors in B2, we need to determine a new set of coordinates that direct us to the point using the green basis vectors as a frame of reference. In other words we need to determine (α1, α2) such that travelling α1 units along the direction v1 and then α2 units along the direction v2 will lead us to the point in question. For the point (2, 3) that means √ ! √ ! 2 2 2 = α v + α v = α √2 + α 2√ . 3 1 1 2 2 1 2 2 2 2 − 2 This is merely a system of equations Va = b:

√       2 1 1 α1 2 2 = 1 −1 α2 3 35

The 2 × 2 matrix V on the left-hand side has linearly independent columns and thus has an inverse. In fact, V is an orthonormal matrix which means its inverse is its transpose. Multiplying both sides of the equation by V−1 = VT yields the solution √ ! α  5 2 a = 1 = VTb = 2√ α2 2 − 2

This result tells us that in order to reach the√ red point (formerly known 5 2 as (2,3) in our previous√ basis), we should travel 2 units along the direction 2 of v1 and then − 2 units along the direction v2 (Note that v2 points toward the southeast corner and we want to move northwest, hence the coordinate is negative). Another way (a more mathematical way) to√ say this is that the length of the orthogonal projection of a onto the span of v is 5 2 , and the length of √ 1 2 2 the orthogonal projection of a onto the span of v2 is − 2 . While it may seem that these are difficult distances to plot, they work out quite well if√ we examine our drawing in Figure 4.2, because the diagonal of each square is 2. In the same fashion, we can re-write all 3 of the red points on our graph in the new basis by solving the same system simultaneously for all the points. Let B be a matrix containing the original coordinates of the points and let A be a matrix containing the new coordinates:

−4 2 5 α α α  B = A = 11 12 13 −2 3 2 α21 α22 α23

Then the new data coordinates on the rotated plane can be found by solving:

VA = B

And thus √ 2 −6 5 7 A = VTB = 2 −2 −1 3 Using our new basis vectors, our alternative view of the data is that in Figure 4.3. In the above example, we changed our basis from the original elementary basis to a new orthogonal basis which provides a different view of the data. All of this amounts to a rotation of the data around the origin. No real information has been lost - the points maintain their distances from each other in nearly every distance metric. Our new variables, v1 and v2 are linear combinations of our original variables e1 and e2, thus we can transform the data back to its original coordinate system by again solving a linear system (in this example, we’d simply multiply the new coordinates again by V). In general, we can change bases using the procedure outlined in Theorem 4.0.1. 36

+ span(v )

+ span(v)

Figure 4.3: Points plotted in the new basis, B

Theorem 4.0.1: Changing Bases

Given a matrix of coordinates (in columns), A, in some basis, B1 = {x1, x2, ... , xn}, we can change the basis to B2 = {v1, v2, ... , vn} with the new set of coordinates in a matrix B by solving the system

XA = VB where X and V are matrices containing (as columns) the basis vectors from B1 and B2 respectively. Note that when our original basis is the elementary basis, X = I, our system reduces to A = VB. When our new basis vectors are orthonormal, the solution to this system is simply B = VTA.

Definition 4.0.4: Basis Terminology

A basis for the Rn can be any collection of n linearly independent vectors in Rn; n is said to be the dimension of the vector space Rn. When the basis vectors are orthonormal (as they were in our 37

example), the collection is called an orthonormal basis.

The preceding discussion dealt entirely with bases for Rn (our example was for points in R2). However, we will need to consider bases for subspaces of Rn. Recall that the span of two linearly independent vectors in R3 is a plane. This plane is a 2-dimensional subspace of R3. Its dimension is 2 because 2 basis vectors are required to represent this space. However, not all points from R3 can be written in this basis - only those points which exist on the plane. In the next chapter, we will discuss how to proceed in a situation where the point we’d like to represent does not actually belong to the subspace we are interested in. This is the foundation for Least Squares.

Exercises 3 −2 1. Show that the vectors v = and v = are orthogonal. Create 1 1 2 6 an orthonormal basis for R2 using these two direction vectors.

2. Consider a1 = (1, 1) and a2 = (0, 1) as coordinates for points in the elementary basis. Write the coordinates of a1 and a2 in the orthonormal basis found in exercise 1. Draw a picture which reflects the old and new basis vectors.

3. Write the orthonormal basis vectors from exercise 1 as linear combinations of the original elementary basis vectors.

4. What is the length of the orthogonal projection of a1 onto v1? 38

CHAPTER 5

LEAST SQUARES

The least squares problem arises in almost all areas where mathematics is applied. Statistically, the idea is to find an approximate mathematical rela- tionship between predictor and target variables such that the sum of squared errors between the true value and the approximation is minimized. In two dimensions, the goal would be to develop a line as depicted in Figure 5.1 such that the sum of squared vertical distances (the residuals, in green) between the true data (in red) and the mathematical prediction (in blue) is minimized.

(x1,y1) residual r1 { (x1,y^1)

Figure 5.1: Least Squares Illustrated in 2 dimensions

If we let r be a vector containing the residual values (r1, r2, ... , rn) then the 39 sum of squared residuals can be written in linear algebraic notation as

n 2 T T 2 ∑ ri = r r = (y − yˆ) (y − yˆ) = ky − yˆk i=1 Suppose we want to regress our target variable y on p predictor variables, x1, x2, ... , xp. If we have n observations, then the ideal situation would be to find a vector of parameters β containing an intercept, β0 along with p slope parameters, β1,..., βp such that

x1 x2 ... xp       obs1 1 x11 x12 ... x1p β0 y0 obs2  1 x21 x22 ... x2p  β  y     1  1 .  . . . . .   .  =  .  (5.1) .  . . . . .   .   .  obsn 1 xn1 xn2 ... xnp βp yn | {z } | {z } | {z } X β y

With many more observations than variables, this system of equations will not, in practice, have a solution. Thus, our goal becomes finding a vector of parameters βˆ such that Xβˆ = yˆ comes as close to y as possible. Using the design matrix, X, the least squares solution βˆ is the one for which

ky − Xβˆ k2 = ky − yˆk2 is minimized. Theorem 5.0.2 characterizes the solution to the least squares problem. Theorem 5.0.2: Least Squares Problem and Solution

For an n × m matrix X and n × 1 vector y, let r = Xβb − y. The least squares problem is to find a vector βb that minimizes the quantity n 2 2 ∑ ri = ky − Xβbk . i=1

Any vector βb which provides a minimum value for this expression is called a least-squares solution. • The set of all least squares solutions is precisely the set of solutions to the so-called normal equations,

XTXβb = XTy. 40

• There is a unique least squares solution if and only if rank(X) = m (i.e. linear independence of variables or no perfect multicollinear- ity!), in which case XTX is invertible and the solution is given by − βb = (XTX) 1XTy

Example 5.0.2: Solving a Least Squares Problem

In 2014, data was collected regarding the percentage of linear algebra exercises done by students and the grade they received on their examination. Based on this data, what is the expected effect of completing an additional 10% of the exercises on a students exam grade?

ID % of Exercises Exam Grade 1 20 55 2 100 100 3 90 100 4 70 70 5 50 75 6 10 25 7 30 60

To find the least squares regression line, we want to solve the equation Xβ = y: 1 20   55  1 100 100     1 90    100   β   1 70  0 =  70    β   1 50  1  75      1 10   25  1 30 60 This system is obviously inconsistent. Thus, we want to find the least squares solution βˆ by solving XTXβˆ = XTy:

 7 370  β   485  0 = 370 26900 β1 30800 Now, since multicollinearity was not a problem, we can simply find the inverse of XTX and multiply it on both sides of the equation:

−1  7 370   0.5233 −0.0072 = 370 26900 −0.0072 0.0001 41

and so

β   0.5233 −0.0072  485  32.1109 0 = = β1 −0.0072 0.0001 30800 0.7033

Thus, for each additional 10% of exercises completed, exam grades are expected to increase by about 7 points. The data along with the regression line

grade = 32.1109 + 0.7033percent_exercises

is shown below.

Why the normal equations? The solution of the normal equations has a nice geometrical interpretation. It involves the idea of orthogonal projection, a concept which will be useful for understanding future topics. In order for a system of equations, Ax = b to have a solution, b must be a linear combination of columns of A. That is simply the definition of matrix multiplication and equality. If A is m × n then

Ax = b =⇒ b = x1A1 + x2A2 + ··· + xnAn.

As discussed in Chapter 3, another way to say this is that b is in the span of the columns of A. The span of the columns of A is called the column space of A. In Least-Squares applications, the problem is that b is not in the column space of A. In essence, we want to find the vector bˆ that is closest to b but 42 exists in the column space of A. Then we know that Axˆ = bˆ does have a unique solution, and that the right hand side of the equation comes as close to the original data as possible. By multiplying both sides of the original equation by AT what we are really doing is projecting b orthogonally onto the column space of A. We should think of the column space as a flat surface (perhaps a plane) in space, and b as a point that exists off of that flat surface. There are many ways to draw a line from a point to plane, but the shortest distance would always be travelled perpendicular (orthogonal) to the plane. You may recall from undergraduate calculus or physics that a normal vector to a plane is a vector that is orthogonal to that plane. The normal equations, ATAx = ATb, help us find the closest point to b that belongs to the column space of A by means of an orthogonal projection. This geometrical development is depicted in Figure 5.2.

b

^ ||b-b||=r

^} b = Ax^ A =A(A T A)  A T b

A span(A A)

Figure 5.2: The normal equations yield the vector bˆ in the column space of A which is closest to the original right hand side b vector. 43

CHAPTER 6

EIGENVALUES AND EIGENVECTORS

Definition 6.0.5: Eigenvalues and Eigenvectors

For a square matrix An×n, a scalar λ is called an eigenvalue of A if there is a nonzero vector x such that

Ax = λx.

Such a vector, x is called an eigenvector of A corresponding to the eigenvalue λ. We sometimes refer to the pair (λ, x) as an eigenpair.

Eigenvalues and eigenvectors have numerous applications throughout math- ematics, statistics and other fields. First, we must get a handle on the definition which we will do through some examples. Example 6.0.3: Eigenvalues and Eigenvectors

1 3 1 Determine whether x = is an eigenvector of A = and if 1 1 3 so, find the corresponding eigenvalue. To determine whether x is an eigenvector, we want to compute Ax and observe whether the result is a multiple of x. If this is the case, then the multiplication factor is the corresponding eigenvalue:

3 1 1 4 1 Ax = = = 4 1 3 1 4 1

From this it follows that x is an eigenvector of A and the corresponding 44

eigenvalue is λ = 4.

2 Is the vector y = an eigenvector? 2

3 1 2 8 2 Ay = = = 4 = 4y 1 3 2 8 2

Yes, it is and it corresponds to the same eigenvalue, λ = 4

Example 6.0.3 shows a very important property of eigenvalue-eigenvector pairs. If (λ, x) is an eigenpair then any scalar multiple of x is also an eigenvector corresponding to λ. To see this, let (λ, x) be an eigenpair for a matrix A (which means that Ax = λx) and let y = αx be any scalar multiple of x. Then we have,

Ay = A(αx) = α(Ax) = α(λx) = λ(αx) = λy which shows that y (or any scalar multiple of x) is also an eigenvector associated with the eigenvalue λ. Thus, for each eigenvalue we have infinitely many eigenvectors. In the preceding example, the eigenvectors associated with λ = 4 will be scalar 1 multiples of x = . You may recall from Chapter 3 that the set of all scalar 1 multiples of x is denoted span(x). The span(x) in this example represents the eigenspace of λ. Note: when using software to compute eigenvectors, it is standard practice for the software to provide the normalized/unit eigenvector. In some situations, an eigenvalue can have multiple eigenvectors which are linearly independent. The number of linearly independent eigenvectors associ- ated with an eigenvalue is called the geometric multiplicity of the eigenvalue. Example 6.0.4 clarifies this concept. Example 6.0.4: Geometric Multiplicity

3 0 Consider the matrix A = . It should be straightforward to see 0 3 1 0 that x = and x = are both eigenvectors corresponding to 1 0 2 1 the eigenvalue λ = 3. x1 and x2 are linearly independent, therefore the geometric multiplicity of λ = 3 is 2.

What happens if we take a linear combination of x1 and x2? Is that also 45

2 an eigenvector? Consider y = = 2x + 3x . Then 3 1 2

3 0 2 6 2 Ay = = = 3 = 3y 0 3 3 9 3

shows that y is also an eigenvector associated with λ = 3. The eigenspace corresponding to λ = 3 is the set of all linear combina- tions of x1 and x2, i.e. the span(x1, x2).

We can generalize the result that we saw in Example 6.0.4 for any square matrix and any geometric multiplicity. Let An×n have an eigenvalue λ with geometric multiplicity k. This means there are k linearly independent eigenvec- tors, x1, x2, ... , xk such that Axi = λxi for each eigenvector xi. Now if we let y be a vector in the span(x1, x2, ... , xk) then y is some linear combination of the xi’s: y = α1x2 + α2x2 + ··· + αkxk Observe what happens when we multiply y by A:

Ay = A(α1x2 + α2x2 + ··· + αkxk)

= α1(Ax1) + α2(Ax2) + ··· + αk(Axk)

= α1(λx1) + α2(λx2) + ··· + αk(λxk)

= λ(α1x2 + α2x2 + ··· + αkxk) = λy which shows that y (or any vector in the span(x1, x2, ... , xk)) is an eigenvector of A corresponding to λ. This proof allows us to formally define the concept of an eigenspace. Definition 6.0.6: Eigenspace

Let A be a square matrix and let λ be an eigenvalue of A. The set of all eigenvectors corresponding to λ, together with the zero vector, is called the eigenspace of λ. The number of basis vectors required to form the eigenspace is called the geometric multiplicity of λ.

Now, let’s attempt the eigenvalue problem from the other side. Given an eigenvalue, we will find the corresponding eigenspace in Example 6.0.5. 46

Example 6.0.5: Eigenvalues and Eigenvectors

1 2 Show that λ = 5 is an eigenvalue of A = and determine the 4 3 eigenspace of λ = 5.

Attempting the problem from this angle requires slightly more work. We want to find a vector x such that Ax = 5x. Setting this up, we have:

Ax = 5x.

What we want to do is move both terms to one side and factor out the vector x. In order to do this, we must use an identity matrix, otherwise the equation wouldn’t make sense (we’d be subtracting a constant from a matrix).

Ax − 5x = 0 (A − 5I)x = 0 1 2 5 0 x  0 − 1 = 4 3 0 5 x2 0 −4 2  x  0 1 = 4 −2 x2 0

Clearly, the matrix A − λI is singular (i.e. does not have linearly inde- pendent rows/columns). This will always be the case by the definition Ax = λx, and is often used as an alternative definition. In order to solve this homogeneous system of equations, we use Gaus- sian elimination:

−4 2 0 1 − 1 0 −→ 2 4 −2 0 0 0 0

1 This implies that any vector x for which x1 − 2 x2 = 0 satisfies the eigenvector equation. We can pick any such vector, for example x = 1 , and say that the eigenspace of λ = 5 is 2

1 span 2

If we didn’t know either an eigenvalue or eigenvector of A and instead wanted to find both, we would first find eigenvalues by determining all possible λ such that A − λI is singular and then find the associated eigenvectors. There 6.1. Diagonalization 47 are some tricks which allow us to do this by hand for 2 × 2 and 3 × 3 matrices, but after that the computation time is unworthy of the effort. Now that we have a good understanding of how to interpret eigenvalues and eigenvectors algebraically, let’s take a look at some of the things that they can do, starting with one important fact. Theorem 6.0.3: Eigenvalues and the Trace of a Matrix

Let A be an n × n matrix with eigenvalues λ1, λ2, ... , λn. Then the sum of the eigenvalues is equal to the trace of the matrix (recall that the trace of a matrix is the sum of its diagonal elements).

n Trace(A) = ∑ λi. i=1

Example 6.0.6: Trace of Covariance Matrix

Suppose that we had a collection of n observations on p variables, x1, x2, ... , xp. After centering the data to have zero mean, we can com- pute the sample variances as:

1 var(x ) = xTx = kx k2 i n − 1 i i i These variances form the diagonal elements of the sample covariance matrix, 1 Σ = XTX n − 1 Thus, the total variance of this data is

1 n n kx k2 = Trace(Σ) = λ . − ∑ i ∑ i n 1 i=1 i=1 In other words, the sum of the eigenvalues of a covariance matrix provides the total variance in the variables x1,..., xp.

6.1 Diagonalization

Let’s take another look at Example 6.0.5. We already showed that λ1 = 5 and 1 1 2 v = is an eigenpair for the matrix A = . You may verify that 1 2 4 3  1  λ = −1 and v = is another eigenpair. Suppose we create a matrix of 2 2 −1 6.1. Diagonalization 48 eigenvectors: 1 1  V = (v , v ) = 1 2 2 −1 and a diagonal matrix containing the corresponding eigenvalues: 5 0  D = 0 −1 Then it is easy to verify that AV = VD: 1 2 1 1  AV = 4 3 2 −1  5 −1 = 10 1 1 1  5 0  = 2 −1 0 −1 = VD

If the columns of V are linearly independent, which they are in this case, we can write: V−1AV = D What we have just done is develop a way to transform a matrix A into a diagonal matrix D. This is known as diagonalization. Definition 6.1.1: Diagonalizable

An n × n matrix A is said to be diagonalizable if there exists an invert- ible matrix P and a diagonal matrix D such that

P−1AP = D

This is possible if and only if the matrix A has n linearly independent eigenvectors (known as a complete set of eigenvectors). The matrix P is then the matrix of eigenvectors and the matrix D contains the corresponding eigenvalues on the diagonal.

Determining whether or not a matrix An×n is diagonalizable is a little tricky. Having rank(A) = n is not a sufficient condition for having n linearly independent eigenvectors. The following matrix stands as a counter example: −3 1 −3 A =  20 3 10  2 −2 4 This matrix has full rank but only two linearly independent eigenvectors. For- tunately, for our primary application of diagonalization, we will be dealing 6.2. Geometric Interpretation of Eigenvalues and Eigenvectors 49 with a symmetric matrix, which can always be diagonalized. In fact, symmet- ric matrices have an additional property which makes this diagonalization particularly nice, as we will see in Chapter 7.

6.2 Geometric Interpretation of Eigenvalues and Eigenvectors

Since any scalar multiple of an eigenvector is still an eigenvector, let’s consider for the present discussion unit eigenvectors x of a square matrix A - those with length kxk = 1. By the definition, we know that Ax = λx We know that geometrically, if we multiply x by A, the resulting vector points in the same direction as x. Geometrically, it turns out that multiplying the unit circle or unit sphere by a matrix A carves out an ellipse, or an ellipsoid. We can see eigenvectors visually by watching how multiplication by a matrix A changes the unit vectors. Figure 6.1 illustrates this. The blue arrows represent (a sampling of) the unit circle, all vectors x for which kxk = 1. The red arrows represent the image of the blue arrows after multiplication by A, or Ax for each vector x. We can see how almost every vector changes direction when multiplied by A, except the eigenvector directions which are marked in black. Such a picture provides a nice geometrical interpretation of eigenvectors for a general matrix, but we will see in Chapter 7 just how powerful these eigenvector directions are when we look at symmetric matrix.

4

3

2

1

0

−1

−2

−3

−4 −5 −4 −3 −2 −1 0 1 2 3 4 5

Figure 6.1: Visualizing eigenvectors (in black) using the image (in red) of the unit sphere (in blue) after multiplication by A. 6.2. Geometric Interpretation of Eigenvalues and Eigenvectors 50

Exercises

1. Show that v is an eigenvector of A and find the corresponding eigenvalue:

1 2  3  a. A = v = 2 1 −3 −1 1  1  b. A = v = 6 0 −2 4 −2 4 c. A = v = 5 −7 2

2. Show that λ is an eigenvalue of A and list two eigenvectors corresponding to this eigenvalue:

 0 4 a. A = λ = 4 −1 5  0 4 b. A = λ = 1 −1 5

3. Based on the eigenvectors you found in exercises 2, can the matrix A be diagonalized? Why or why not? If diagonalization is possible, explain how it would be done. 51

CHAPTER 7

PRINCIPAL COMPONENTS ANALYSIS

We now have the tools necessary to discuss one of the most important con- cepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves the analysis of eigenvalues and eigenvectors of the covariance or correlation matrix. Its development relies on the following important facts: Theorem 7.0.1: Diagonalization of Symmetric Matrices

All n × n real valued symmetric matrices (like the covariance and corre- lation matrix) have two very important properties: 1. They have a complete set of n linearly independent eigenvectors, {v1,..., vn}, corresponding to eigenvalues

λ1 ≥ λ2 ≥ · · · ≥ λn.

2. Furthermore, these eigenvectors can be chosen to be orthonormal so that if V = [v1| ... |vn] then VTV = I

or equivalently, V−1 = VT.

Letting D be a diagonal matrix with Dii = λi, by the definition of eigenvalues and eigenvectors we have for any symmetric matrix S,

SV = VD

Thus, any symmetric matrix S can be diagonalized in the following way: VTSV = D 52

Covariance and Correlation matrices (when there is no perfect multi- collinearity in variables) have the additional property that all of their eigenvalues are positive (nonzero). They are positive definite matrices.

Now that we know we have a complete set of eigenvectors, it is common to order them according to the magnitude of their corresponding eigenvalues. From here on out, we will use (λ1, v1) to represent the largest eigenvalue of a matrix and its corresponding eigenvector. When working with a covariance or correlation matrix, this eigenvector associated with the largest eigenvalue is called the first principal component and points in the direction for which the variance of the data is maximal. Example 7.0.1 illustrates this point. Example 7.0.1: Eigenvectors of the Covariance Matrix

Suppose we have a matrix of data for 10 individuals on 2 variables, x1 and x2. Plotted on a plane, the data appears as follows: x

x 53

Our data matrix for these points is:   1 1 2 1   2 4     3 1   4 4 X =   5 2   6 4   6 6   7 6 8 8 the means of the variables in X are:

4.4 x¯ = . 3.7

When thinking about variance directions, our first step should be to center the data so that it has mean zero. Eigenvectors measure the spread of data around the origin. Variance measures spread of data around the mean. Thus, we need to equate the mean with the origin. To center the data, we simply compute       1 1 4.4 3.7 −3.4 −2.7 2 1 4.4 3.7 −2.4 −2.7       2 4 4.4 3.7 −2.4 0.3            − −  3 1 4.4 3.7  1.4 2.7       T 4 4 4.4 3.7 −0.4 0.3  Xc = X − ex¯ =   −   =   . 5 2 4.4 3.7  0.6 −1.7       6 4 4.4 3.7  1.6 0.3        6 6 4.4 3.7  1.6 2.3        7 6 4.4 3.7  2.6 2.3  8 8 4.4 3.7 3.6 4.3

Examining the new centered data, we find that we’ve only translated our data in the plane - we haven’t distorted it in any fashion. 54

x

x

Thus the covariance matrix is: 1 5.6 4.8  Σ = (XTX ) = 9 c c 4.8 6.0111 The eigenvalue and eigenvector pairs of Σ are (rounded to 2 decimal places) as follows:  0.69  −0.72 (λ , v ) = 10.6100, and (λ , v ) = 1.0012, 1 1 0.72 2 2 0.69 Let’s plot the eigenvector directions on the same graph: x

v

v

x 55

The eigenvector v1 is called the first principal component. It is the di- rection along which the variance of the data is maximal. The eigenvector v2 is the second principal component. In general, the second principal component is the direction, orthogonal to the first, along which the variance of the data is maximal (in two dimensions, there is only one direction possible.)

Why is this important? Let’s consider what we’ve just done. We started with two variables, x1 and x2, which appeared to be correlated. We then derived new variables, v1 and v2, which are linear combinations of the original variables:

v1 = 0.69x1 + 0.72x2 (7.1)

v2 = −0.72x1 + 0.69x2 (7.2)

These new variables are completely uncorrelated. To see this, let’s represent our data according to the new variables - i.e. let’s change the basis from B1 = [x1, x2] to B2 = [v1, v2]. Example 7.0.2: The Principal Component Basis

Let’s express our data in the basis defined by the principal components. We want to find coordinates (in a 2 × 10 matrix A) such that our original (centered) data can be expressed in terms of principal components. This is done by solving for A in the following equation (see Chapter 4 and note that the rows of X define the points rather than the columns):

T Xc = AV (7.3)     −3.4 −2.7 a11 a12 −2.4 −2.7  a a     21 22  −2.4 0.3   a a     31 32  − −     1.4 2.7  a41 a42  −     T  0.4 0.3   a51 a52  v1   =   T (7.4)  0.6 −1.7  a61 a62  v2      1.6 0.3   a71 a72       1.6 2.3   a81 a82       2.6 2.3   a91 a92  3.6 4.3 a10,1 a10,2

Conveniently, our new basis is orthonormal meaning that V is an orthogonal matrix, so A = XV. 56

The new data coordinates reflect a simple rotation of the data around the origin:

v

v

Visually, we can see that the new variables are uncorrelated. You may wish to confirm this by calculating the covariance. In fact, we can do this in a general sense. If A = XcV is our new data, then the covariance matrix is diagonal:

1 Σ = ATA A n − 1 1 = (X V)T(X V) n − 1 c c 1 = VT((XTX )V n − 1 c c 1 = VT((n − 1)Σ )V n − 1 X T = V (ΣX)V = VT(VDVT)V = D

T Where ΣX = VDV comes from the diagonalization in Theorem 7.0.1. By changing our variables to principal components, we have managed to “hide” the correlation between x1 and x2 while keeping the spa- cial relationships between data points in tact. Transformation back to variables x1 and x2 is easily done by using the linear relationships in Equations 7.1 and 7.2. 7.1. Comparison with Least Squares 57

7.1 Comparison with Least Squares

In least squares regression, our objective is to maximize the amount of variance explained in our target variable. It may look as though the first principal component from Example 7.0.1 points in the direction of the regression line. This is not the case however. The first principal component points in the direction of a line which minimizes the sum of squared orthogonal distances between the points and the line. Regressing x2 on x1, on the other hand, provides a line which minimizes the sum of squared vertical distances between points and the line. This is illustrated in Figure 7.1. x

Principal Component

Regression Line

x

Figure 7.1: Principal Components vs. Regression Lines

The first principal component about the mean of a set of points can be represented by that line which most closely approaches the data points. In contrast, linear least squares tries to minimize the distance in the y direction only. Thus, although the two use a similar error metric, linear least squares is a method that treats one dimension of the data preferentially, while PCA treats all dimensions equally.

7.2 Covariance or Correlation Matrix?

Principal components analysis can involve eigenvectors of either the covariance matrix or the correlation matrix. When we perform this analysis on the covariance matrix, the geometric interpretation is simply centering the data and then determining the direction of maximal variance. When we perform 7.3. Applications of Principal Components 58 this analysis on the correlation matrix, the interpretation is standardizing the data and then determining the direction of maximal variance. The correlation matrix is simply a scaled form of the covariance matrix. In general, these two methods give different results, especially when the scales of the variables are different. The covariance matrix is the default for R. The correlation matrix is the default in SAS. The covariance matrix method is invoked by the option: proc princomp data=X cov; var x1--x10; run;

Choosing between the covariance and correlation matrix can sometimes pose problems. The rule of thumb is that the correlation matrix should be used when the scales of the variables vary greatly. In this case, the variables with the highest variance will dominate the first principal component. The argument against automatically using correlation matrices is that it is quite a brutal way of standardizing your data.

7.3 Applications of Principal Components

Principal components have a number of applications across many areas of statistics. In the next sections, we will explore their usefulness in the context of dimension reduction. In Chapter 9 we will look at how PCA is used to solve the issue of multicollinearity in biased regression.

7.3.1 PCA for dimension reduction It is quite common for an analyst to have too many variables. There are two different solutions to this problem:

1. Feature Selection: Choose a subset of existing variables to be used in a model.

2. Feature Extraction: Create a new set of features which are combinations of original variables.

Feature Selection Let’s think for a minute about feature selection. What are we really doing when we consider a subset of our existing variables? Take the two dimensional data in Example 7.0.2 (while two-dimensions rarely necessitate dimension reduction, the geometrical interpretation extends to higher dimensions as usual!). The centered data appears as follows: 7.3. Applications of Principal Components 59

x

x

Now say we perform some kind of feature selection (there are a number of ways to do this, chi-square tests for instances) and we determine that the variable x2 is more important than x1. So we throw out x2 and we’ve reduced the dimensions from p = 2 to k = 1. Geometrically, what does our new data look like? By dropping x1 we set all of those horizontal coordinates to zero. In other words, we project the data orthogonally onto the x2 axis:

x x

x x

(a) Projecting Data Orthogonally (b) New One-Dimensional Data

Figure 7.2: Geometrical Interpretation of Feature Selection

Now, how much information (variance) did we lose with this projection? The total variance in the original data is

2 2 kx1k + kx2k .

The variance of our data reduction is

2 kx2k . 7.3. Applications of Principal Components 60

Thus, the proportion of the total information (variance) we’ve kept is

kx k2 6.01 2 = = 2 2 51.7%. kx1k + kx2k 5.6 + 6.01 Our reduced dimensional data contains only 51.7% of the variance of the original data. We’ve lost a lot of information! The fact that feature selection omits variance in our predictor variables does not make it a bad thing! Obviously, getting rid of variables which have no relationship to a target variable (in the case of supervised modeling like prediction and classification) is a good thing. But, in the case of unsupervised learning techniques, where there is no target variable involved, we must be extra careful when it comes to feature selection. In summary,

• Feature Selection is important. Examples include:

– Removing variables which have little to no impact on a target vari- able in supervised modeling (forward/backward/stepwise selec- tion). – Removing variables which have obvious strong correlation with other predictors. – Removing variables that are not interesting in unsupervised learning (For example, you may not want to use the words “th” and “of” when clustering text).

• Feature Selection is an orthogonal projection of the original data onto the span of the variables you choose to keep.

• Feature selection should always be done with care and justification.

– In regression, could create problems of endogeneity (errors corre- lated with predictors - omitted variable bias). – For unsupervised modelling, could lose important information.

Feature Extraction PCA is the most common form of feature extraction. The rotation of the space shown in Example 7.0.2 represents the creation of new features which are linear combinations of the original features. If we have p potential variables for a model and want to reduce that number to k, then the first k principal components combine the individual variables in such a way that is guaranteed to capture as much “information” (variance) as possible. Again, take our two-dimensional data as an example. When we reduce our data down to one- dimension using principal components, we essentially do the same orthogonal projection that we did in Feature Selection, only in this case we conduct that 7.3. Applications of Principal Components 61 projection in the new basis of principal components. Recall that for this data, our first principal component v1 was 0.69 v = . 1 0.73 Projecting the data onto the first principal component is illustrated in Figure 7.3 How much variance do we keep with k principal components? The pro- x x

v v

x x

(a) Projecting Data Orthogonally (b) New One-Dimensional Data

Figure 7.3: Illustration of Feature Extraction via PCA portion of variance explained by each principal component is the ratio of the corresponding eigenvalue to the sum of the eigenvalues (which gives the total amount of variance in the data). Theorem 7.3.1: Proportion of Variance Explained

The proportion of variance explained by the projection of the data onto principal component vi is λi p . ∑j=1 λj Similarly, the proportion of variance explained by the projection of the data onto the first k principal components (k < j) is

k ∑i=1 λi p ∑j=1 λj

In our simple 2 dimensional example we were able to keep λ 10.61 1 = = 91.38% λ1 + λ2 10.61 + 1.00 of our variance in one dimension. 62

CHAPTER 8

SINGULAR VALUE DECOMPOSITION (SVD)

The Singular Value Decomposition (SVD) is one of the most important concepts in applied mathematics. It is used for a number of application including dimension reduction and data analysis. Principal Components Analysis (PCA) is a special case of the SVD. Let’s start with the formal definition, and then see how PCA relates to that definition. Definition 8.0.1: Singular Value Decomposition

For any m × n matrix A with rank(A) = r, there are orthogonal matrices Um×m and Vn×n and a diagonal matrix Dr×r = diag(σ1, σ2, ... , σr) such that D 0 A = U VT with σ ≥ σ ≥ · · · ≥ σ ≥ 0 (8.1) 0 0 1 2 r | {z } m × n

The σi’s are called the nonzero singular values of A. (When r < p = min{m, n} (i.e. when A is not full-rank), A is said to have an additional p − r zero singular values). This factorization is called a singular value decomposition of A, and the columns of U and V are called the left- and right-hand singular vectors for A, respectively.

Properties of the SVD

• The left-hand singular vectors are a set of orthonormal eigenvec- tors for AAT.

• The right-hand singular vectors are a set of orthonormal eigenvec- tors for ATA. 8.1. Resolving a Matrix into Components 63

• The singular values are the square roots of the eigenvalues for ATA and AAT, as these matrices have the same eigenvalues.

When we studied PCA, one of the goals was to find the new coordinates, or scores, of the data in the principal components basis. If our original (centered or standardized) data was contained in the matrix X and the eigenvectors of the covariance/correlation matrix (XTX) were columns of a matrix V, then to find the scores (call these S) of the observations on the eigenvectors we used the following equation: X = SVT. This equation mimics Equation 8.1 because the matrix VT in Equation 8.1 is also a matrix of eigenvectors for ATA. This means that the principal component scores S are a set of unit eigenvectors for AAT scaled by the singular values in D: D 0 S = U . 0 0

8.1 Resolving a Matrix into Components

One of the primary goals of the singular value decomposition is to resolve the data in A into r mutually orthogonal components by writing the matrix factorization as a sum of outer products using the corresponding columns of U and rows of VT:

  σ1 0 . . . 0 0 .  .. .   T  0 . 0 . 0 v1    . .  T D 0  . .  v2  T   . 0 σr 0 .    A = U V = u1 u2 ... um    .  0 0  0 0 0 0 0  .     . . . . .  vT  . . . . .  n 0 0 0 0 0

T T T = σ1u1v1 + σ2u2v2 + ··· + σrurvr .

σ1 ≥ σ2 ≥ ... σr T For simplicity, let Zi = uivi act as basis matrices for this expansion, so we have r A = ∑ σiZi. (8.2) i=1 This representation can be regarded as a Fourier expansion. The coefficient (singular value) σi can be interpreted as the proportion of A lying in the 8.1. Resolving a Matrix into Components 64

“direction" of Zi. When σi is small, omitting that term from the expansion will cause only a small amount of the information in A to be lost. This fact has important consequences for compression and noise reduction.

8.1.1 Data Compression We’ve already seen how PCA can be used to reduce the dimensions of our data while keeping the most amount of variance. The way this is done is by simply ignoring those components for which the proportion of variance is small. Supposing we keep k principal components, this amounts to truncating the sum in Equation 8.2 after k terms:

k A ≈ ∑ σiZi. (8.3) i=1 As it turns out, this truncation has important consequences in many applica- tions. One example is that of image compression. An image is simply an array of pixels. Supposing the image size is m pixels tall by n pixels wide, we can capture this information in an m × n matrix if the image is in grayscale, or an m × 3n matrix for a [r,g,b] color image (we’d need 3 values for each pixel to recreate the pixel’s color). These matrices can get very large (a 6 megapixel photo is 6 million pixels). Rather than store the entire matrix, we can store an approximation to the matrix using only a few (well, more than a few) singular values and singular vectors. This is the basis of image compression. An approximated photo will not be as crisp as the original - some information will be lost - but most of the time we can store much less than the original matrix and still get a good depiction of the image.

8.1.2 Noise Reduction Many applications arise where the relevant information contained in a matrix is contaminated by a certain level of noise. This is particularly common with video and audio signals, but also arises in text data and other types of (usually high dimensional) data. The truncated SVD (Equation 8.3) can actually reduce the amount of noise in data and increase the overall signal-to-noise ratio under certain conditions. Let’s suppose, for instance, that our matrix Am×n contains data which is contaminated by noise. If that noise is assumed to be random (or nondirec- tional) in the sense that the noise is distributed more or less uniformly across the components Zi, then there is just as much noise “in the direction” of one Zi as there is in the other. If the amount of noise along each direction is approx- imately the same, and the σi’s tell us how much (relevant) information in A 8.1. Resolving a Matrix into Components 65

is directed along each component Zi, then it must be that the ratio of “signal” (relevant information) to noise is decreasing across the ordered components, since σ1 ≥ σ2 ≥ · · · ≥ σr implies that the signal is greater in earlier components. So letting SNR(σiZi) denote the signal-to-noise ratio of each component, we have

SNR(σ1Z1) ≥ SNR(σ2Z2) ≥ · · · ≥ SNR(σrZr)

This explains why the truncated SVD,

k A ≈ ∑ σiZi where k < r i=1 can, in many scenarios, filter out some of the noise without losing much of the significant information in A.

8.1.3 Latent Semantic Indexing Text mining is another area where the SVD is used heavily. In text mining, our data structure is generally known as a Term-Document Matrix. The documents are any individual pieces of text that we wish to analyze, cluster, summarize or discover topics from. They could be sentences, abstracts, webpages, or social media updates. The terms are the words contained in these documents. The term-document matrix represents what’s called the “bag-of-words” approach - the order of the words is removed and the data becomes unstructured in the sense that each document is represented by the words it contains, not the order or context in which they appear. The (i, j) entry in this matrix is the number of times term j appears in document i. Definition 8.1.1: Term-Document Matrix

Let m be the number of documents in a collection and n be the number of terms appearing in that collection, then we create our term-document 8.1. Resolving a Matrix into Components 66

matrix A as follows:

term 1 term j term n Doc 1  |   |     |  Am×n =   Doc i  − − f   ij    Doc m

where fij is the frequency of term j in document i.A binary term- document matrix will simply have Aij = 1 if term j is contained in document i.

Term-document matrices tend to be large and sparse. Term-weighting schemes are often used to downplay the effect of commonly used words and bolster the effect of rare but semantically important words. The most popular weighting method is known as “Term Frequency-Inverse Document Frequency” (TF-IDF). For this method, the raw term-frequencies fij in the matrix A are multiplied by global weights (inverse document frequencies), wj, for each term. These weights reflect the commonality of each term across the entire collection. The inverse document frequency of term i is:  total # of documents  w = log j # documents containing term j To put this weight in perspective for a collection of n = 10, 000 documents we have 0 ≤ wj ≤ 9.2, where wj = 0 means the word is contained in every document (i.e. it’s not important semantically) and wj = 9.2 means the word is contained in only 1 document (i.e. it’s quite important). The document vectors are often normalized to have unit 2-norm, since their directions (not their lengths) in the term-space is what characterizes them semantically.

The noise-reduction property of the SVD was extended to text processing in 1990 by Susan Dumais et al, who named the effect Latent Semantic Indexing (LSI). LSI involves the singular value decomposition of the term-document matrix defined in Definition 8.1.1. In other words, it is like a principal components analysis using the unscaled, uncentered inner-product matrix ATA. If the documents are normalized to have unit length, this is a matrix of cosine similarities (see Chapter 2). In text-mining, the cosine similarity is the most common measure of similarity between documents. If the term-document matrix is binary, this is often called the co-occurrence matrix because each entry gives the number of times two words occur in the same document. It certainly seems logical to view text data in this context as it contains both an informative signal and semantic noise. LSI quickly grew roots in the 8.1. Resolving a Matrix into Components 67 information retrieval community, where it is often used for query process- ing. The idea is to remove semantic noise, due to variation and ambiguity in vocabulary and presentation style, without losing significant amounts of information. For example, a human may not differentiate between the words “car” and “automobile”, but indeed the words will become two separate entities in the raw term-document matrix. The main idea in LSI is that the realignment of the data into fewer directions should force related documents (like those containing “car” and “automobile”) closer together in an angular sense, thus revealing latent semantic connections. Purveyors of LSI suggest that the use of the Singular Value Decomposition to project the documents into a lower-dimensional space results in a representation which reflects the major associative patterns of the data while ignoring less important influences. This projection is done with the simple truncation of the SVD shown in Equation 8.3. As we have seen with other types of data, the very nature of dimension reduction makes possible for two documents with similar semantic properties to be mapped closer together. Unfortunately, the mixture of signs (positive and negative) in the singular vectors (think principal components) makes the decomposition difficult to interpret. While the major claims of LSI are legitimate, this lack of interpretability is still conceptually problematic for some folks. In order to make this point as clear as possible, consider the original “term basis” representation for the data, where each document (from a collection containing m total terms in the dictionary) could be written as:

m Aj = ∑ fijei i=1

th where fij is the frequency of term i in the document, and ei is the i column of the m × m identity matrix. The truncated SVD gives us a new set of coordinates (scores) and basis vectors (principal component features):

r Aj ≈ ∑ αiui i=1 but the features ui live in the term space, and thus ought to be interpretable as a linear combination of the original “term basis.” However the linear combination, having both positive and negative coefficients, is semantically meaningless in context - These new features cannot, generally, be thought of as meaningful topics. 68

CHAPTER 9

ADVANCED REGRESSION TECHNIQUES

9.1 Biased Regression

When severe multicollinearity occurs between our predictor variables, least squares estimates are still unbiased, but their variances are large so they may be far from the true value. Biased regression techniques intentionally bias the estimation of the regression coefficients. By adding a degree of bias to the estimates, we can reduce the standard errors (increase the precision). It is hoped that the net effect will be more reliable parameter estimates.

The precision is generally measured by the mean-squared error of our estimate, MSE(βˆ) = [Bias(βˆ)]2 + Var(β). Ordinary least squares regression assumes that the Bias is zero. In biased regression techniques, we’ll allow for some bias in order to minimize the 9.1. Biased Regression 69 variance of our estimate. Ideally, the criteria for deciding when biased regression techniques are better than OLS depends on the true values of the parameters (i.e. we cannot even estimate the bias in our parameter estimates). Since this is not possible, there is no completely objective way to decide. Principal Components Regres- sion (PCR) and Ridge Regression are two such techniques. Ridge regression tends to be the more popular of the two methods, but PCR is a little more straightforward.

9.1.1 Principal Components Regression (PCR) As we saw in Chapter 7, every linear regression model can be restated in terms of a new set of orthogonal predictor variables that are a linear combination of the original variables - principal components. Let x1, x2, ... xp be our predictor variables. Then the principal components (PCs) are just linear combinations of these predictor variables with coefficients from the rows of the eigenvector matrix: PCj = v1jx1 + v2jx2 + ··· + vpjxp The variance-covariance matrix of the principal components is diagonal (diag(λ1, ... , λp)) because the principal components are orthogonal. If λj = 0 then the corresponding PC has no variance (i.e. constant). This reveals linear structure in variables. For example, suppose one of our principal components is PC2 = −0.5x1 + 2x2 with corresponding eigenvalue λ2 = 0

This means that when we subtract 2x2 − 0.5x1 in the original data, the result has zero variability. It is constant for every observation. Thus, it must be that for all observations x2 is completely determined by x1 and vice-versa. The two variables are perfectly correlated. When λj is nearly zero, we are very close to the same situation, which violates the assumptions of our regression model. Let’s look at an applied example.

Example: French Economy We are going to examine data from the French economy reported by Malinvaud (1968).

• The Variables:

1. Imports (Target) 2. Domestic Production 3. Stock Formation 4. Domestic Consumption 9.1. Biased Regression 70

• All measured in billions of French francs between 1949 and 1966.

Lets try to run a simple linear regression to predict Imports using the 3 predictor variables above. We are assuming there is some underlying insistence to understand the relationship of all three variables on Imports - we do not want to drop any variables from the analysis. When we run the regression, we should always pay attention the the Variance Inflation Factors (VIFs) to see if any multicollinearity is affecting the variability in our parameter estimates. proc reg data=advanced.french; model Import = DoProd Stock Consum / vif; run; quit; proc princomp data=advanced.french out=frenchPC; var DoProd Stock Consum; run;

The VIF output from the regression model clearly indicates strong multi- collinearity. The principal component output in Figure 9.1 makes it clear that the difference between two of our variables is essentially constant or has no variability, illuminating the exact source of that multicollinearity.

Figure 9.1: SAS Output

Domestic Consumption is essentially equal to Domestic Production. This is something that matches with realistic expectations. Now there is a “new” set of variables (the PCs) that are orthogonal to each other. Does this new set of variables eliminate multicollinearity concerns? No! In the first model listed in the next block of code (PC Model 1), we have not really changed 9.1. Biased Regression 71 anything! We’ve just rotated our data. Using all 3 principal components we are not incorporating bias into the model or removing the multicollinearity - we are just hiding it! It isn’t until we drop some of the PCs (the second model) that we are able to introduce bias and eliminate the underlying multicollinearity. /* First must standardize our dependent variable. */ /* be aware on covariance vs. correlation PCA! */ /* what would be the difference? */ proc standard data=frenchPC mean=0 std=1 out=frenchPC2; var Import; run; proc reg data=frenchPC2; PC Model 1: model Import = Prin1 Prin2 Prin3 /vif; PC Model 2: model Import = Prin1 Prin2 /vif; run; quit; In order to compute meaningful coefficients we have to do some algebra and take into account the standard errors of our variables (because both the independent variables and the dependent variables were centered and scaled when forming the principal components - there is a difference here if you use the covariance matrix, so understand and be careful!):

Yˆ = α1PC1 + α2PC2 + ··· + αpPCp + e

PCj = v1jx1 + v2jx2 + ··· + vpjxp

Y = β0 + β1x1 + ··· + βpxp + e Where sy βj = (vj1α1 + vj2α2 + ··· + vjpαp) sxj

β0 = Y¯ − β1x¯1 − · · · − βpx¯p SAS can actually do this in PROC PLS (Partial Least Squares) as demon- strated in the next block of code. The caveat is that this procedure can only drop the later PCs keeping the first nfac=n components. Usually this is in fact what you want to accomplish unless you have a principal component that is being driven by some variable that is not significant in your model and you wish to drop that component but keep others after it. In such cases, the coefficients will have to be computed by hand. proc pls data=advanced.french method=pcr nfac=2; model Import= DoProd Stock Consum /solution; run; quit; 9.1. Biased Regression 72

PCR - Cautions PCR may not always work, in the sense that it may have trouble explaining variability in the response variable. You should never blindly drop PCs - you should always be using the justifications set forth above. Outliers / influential observations can severely distort the principal components because they alter the variance-covariance matrix - you should be aware of this fact and always examine your principal components.

9.1.2 Ridge Regression Ridge regression is a biased regression technique to use in the presence of multicollinearity. Produces estimates that tends to have lower MSE (but higher bias) than OLS estimates. Works with standardized values for each of the variables in the model (similar to PCR):

Y˜ = θ1x˜1 + θ2x˜2 + ··· + θpx˜p where Y˜, x˜ represent the standardized values. Recall that solving for OLS estimates involves the normal equations:

ZTZθˆ = ZTY

Rearranging the normal equations leads to the following way to solve for the OLS estimates.

θ1 + r12θ2 + ··· + r1pθp = r1y

r21θ1 + θ2 + ··· + r2pθp = r2y ... = ...

rp1θ1 + rp2θ2 + ··· + θp = rpy where rij is the correlation between predictors i and j (so rij = rji) and rjy is the correlation between the response and predictor j.

Ridge Adjustments Solving for ridge estimates involves changing the normal equations to T T T T Z Zθˆ = Z Y −→ (Z Z + kI)θˆR = Z Y Rearranging the changed normal equations leads to the following way to solve for the ridge estimates:

(1 + k)θ1 + r12θ2 + ··· + r1pθp = r1y

r21θ1 + (1 + k)θ2 + ··· + r2pθp = r2y ... = ...

rp1θ1 + rp2θ2 + ··· + (1 + k)θp = rpy 9.1. Biased Regression 73

The higher the value of k, the more bias is introduced in the estimates of the model. The hardest part about ridge regression is choosing the appropriate value of k because many different ways have been proposed over the years:

• Fixed Point (1975 by Hoerl, Kennard, Baldwin)

pσˆ 2 k = p ˆ2 ∑i=1 θi,OLS

where σˆ 2 is the MSE for the model. This is one of the most popular estimates, sometimes referred to as HKB estimate.

• Iterative Method (1976 by Hoerl, Kennard)

pσˆ 2 k = 0 p ˆ2 ∑i=1 θi,OLS

pσˆ 2 k1 = p ∑ θˆ2 i=1 i,k0 pσˆ 2 kn = p ∑ θˆ2 i=1 i,kn−1 This is repeated until the change is negligible. In practice, we expect taking very few iterations.

• Ridge Trace

– Plot of many different estimates of θˆi across a series of k values. – Use the plot to approximate when the estimates become stable.

Example: Fixed Point Method The code below highlights the method for implementing the Fixed Point method in SAS. Here we create macro variables to represent the MSE of the model and the value for k. The MSE of the model will be used for implementing the Iterative Method. The last PROC REG statement outputs the VIFs, the standard errors for the betas (SEB) and the parameter estimates to the output data set ’B’. The reason we must output these parameters to a dataset is that the SAS output will not show the VIF values for the ridge regression, only from the ordinary OLS model. The RIDGE option allows us to use our macro variable for the parameter k. proc standard data=advanced.french mean=0 std=1 out=frenchstd; var Import DoProd Stock Consum; run; 9.1. Biased Regression 74

proc reg data=frenchstd outest=B; model Import = DoProd Stock Consum / vif; run; quit; data _null_; set B; call symput(’MSE’ , RMSE**2); call symput(’k’, 3*RMSE**2/(DoProd**2+Stock**2+Consum**2)); run; proc reg data=frenchstd outvif outseb outest=B ridge=&k; model Import = DoProd Stock Consum / vif; run; quit;

Example: Iterative Method The code for the iterative method simply extends the code for the fixed point method. We again create macro variables to represent the MSE of the model and the resulting value for k. proc reg data=frenchstd outvif outseb outest=B ridge=&k; model Import = DoProd Stock Consum / vif; run; quit; proc print data=B; run; data _null_; set B; where _TYPE_=’RIDGE’; call symput(’k’, 3*&MSE / (DoProd**2+Stock**2+Consum**2)); run; proc reg data=frenchstd outvif outseb outest=B ridge=&k; model Import = DoProd Stock Consum / vif; run; quit; proc print data=B; run; 9.1. Biased Regression 75

This code would be repeated using the latest ridge model. In practice, we don’t often need to go beyond a few iterations before witnessing convergence. In this example, the VIF values drop below 1 in the second iteration which is something we have to be careful about - we probably do not want to use this iteration.

Example: Ridge Trace Method The Ridge Trace method is implemented by simply inputting the RIDGE parameter as a sequence as shown below: proc reg data=frenchstd outvif outest=B ridge=0 to 0.08 by 0.002; model Import = DoProd Stock Consum / vif; run; quit; proc reg data=frenchstd outvif outseb outest=B ridge=0.04; model Import = DoProd Stock Consum / vif; run; quit; proc print data=B; run;

The code above will output the VIF values and the standardized coefficients for each variable in the model for a range of values of k. These values are given in the output plots shown in Figure 9.2. The goal is to choose a value for k where the lines on the graph become approximately horizontal.

Ridge Regression- Cautions Due to the uncertainty of how to calculate k, there are some that dislike the use of ridge regression (or any other bias regression technique). Both Principle Components regression and Ridge regression should be used as a last case scenario. Deleting or combining variables is preferred because it doesn’t introduce bias. These methods, however, should not be shunned. 9.1. Biased Regression 76

Figure 9.2: SAS Output: Ridge Regression with Ridge Trace Method