! MASTER OF SCIENCE IN ANALYTICS 2014 EMPLOYMENT REPORT
Results at graduation, May 2014
Number of graduates: 79
Number of graduates seeking MSAnew employment: 2015 75
Percent with one or more offers of employment by graduation: 100
Percent placed by graduation: 100
Number of employers interviewing: 138
Average number of initial job interviews per student: 13 Percent of all interviewsLinear arranged by Institute:Algebra 92 Percent of graduates with 2 or more job offers: 90
Percent of graduates with 3 or more job offers: 61
Percent of graduates with 4 or more job offers: 40
Average base salary offer ($): 96,600
Median base salary offer ($): Author: 95,000
Average base salary offers – candidates withShaina job experienceRace ($): 100,600
Range of base salary offers – candidates with job experience ($): 80,000-135,000
Percent of graduates with prior professional work experience: 50
Average base salary offers – candidates without experience ($): 89,000
Range of base salary offers – candidates without experience ($): 75,000-110,000
Percent of graduates receiving a signing bonus: 65
Average amount of signing bonus ($): 12,200
Percent remaining in NC: 59
Percent of graduates sharing salary data: 95
Number of reported job offers: 246
Percent of reported job offers based in U.S.: 100
North&Carolina&State&University&•&920&Main&Campus&Drive,&Suite&530&•&Raleigh,&NC&27606&•&© 2014 http://analytics.ncsu.edu& 1
CONTENTS
1 The Basics 1 1.1 Conventional Notation ...... 1 1.1.1 Matrix Partitions ...... 2 1.1.2 Special Matrices and Vectors ...... 3 1.1.3 n-space ...... 4 1.2 Vector Addition and Scalar Multiplication ...... 4 1.3 Exercises ...... 7
2 Norms, Inner Products and Orthogonality 9 2.1 Norms and Distances ...... 9 2.2 Inner Products ...... 13 2.2.1 Covariance ...... 13 2.2.2 Mahalanobis Distance ...... 15 2.2.3 Angular Distance ...... 16 2.2.4 Correlation ...... 16 2.3 Orthogonality ...... 17 2.4 Outer Products ...... 19
3 Linear Combinations and Linear Independence 23 3.1 Linear Combinations ...... 23 3.2 Linear Independence ...... 26 3.2.1 Determining Linear Independence ...... 27 3.3 Span of Vectors ...... 28
4 Basis and Change of Basis 32
5 Least Squares 38 CONTENTS 2
6 Eigenvalues and Eigenvectors 43 6.1 Diagonalization ...... 47 6.2 Geometric Interpretation of Eigenvalues and Eigenvectors . . . 49
7 Principal Components Analysis 51 7.1 Comparison with Least Squares ...... 57 7.2 Covariance or Correlation Matrix? ...... 57 7.3 Applications of Principal Components ...... 58 7.3.1 PCA for dimension reduction ...... 58
8 Singular Value Decomposition (SVD) 62 8.1 Resolving a Matrix into Components ...... 63 8.1.1 Data Compression ...... 64 8.1.2 Noise Reduction ...... 64 8.1.3 Latent Semantic Indexing ...... 65
9 Advanced Regression Techniques 68 9.1 Biased Regression ...... 68 9.1.1 Principal Components Regression (PCR) ...... 69 9.1.2 Ridge Regression ...... 72 1
CHAPTER 1
THE BASICS
1.1 Conventional Notation
Linear Algebra has some conventional ways of representing certain types of numerical objects. Throughout this course, we will stick to the following basic conventions: • Bold and uppercase letters like A, X, and U will be used to refer to matrices. • Occasionally, the size of the matrix will be specified by subscripts, like Am×n, which means that A is a matrix with m rows and n columns. • Bold and lowercase letters like x and y will be used to reference vectors. Unless otherwise specified, these vectors will be thought of as columns, with xT and yT referring to the row equivalent. • The individual elements of a vector or matrix will often be referred to with subscripts, so that Aij (or sometimes aij) denotes the element in th th th the i row and j column of the matrix A. Similarly, xk denotes the k element of the vector x. These references to individual elements are not generally bolded because they refer to scalar quantities. • Scalar quantities are written as unbolded greek letters like α, δ, and λ.
• The trace of a square matrix An×n, denoted Tr(A) or Trace(A), is the sum of the diagonal elements of A,
n Tr(A) = ∑ Aii. i=1 1.1. Conventional Notation 2
Beyond these basic conventions, there are other common notational tricks that we will become familiar with. The first of these is writing a partitioned matrix.
1.1.1 Matrix Partitions We will often want to consider a matrix as a collection of either rows or columns rather than individual elements. As we will see in the next chapter, when we partition matrices in this form, we can view their multiplication in simplified form. This often leads us to a new view of the data which can be helpful for interpretation. When we write A = (A1|A2| ... |An) we are viewing the matrix A as collection of column vectors, Ai, in the following way: ↑ ↑ ↑ ... ↑ A = (A1|A2| ... |An) = A1 A2 A3 ... Ap ↓ ↓ ↓ ... ↓ Similarly, we can write A as a collection of row vectors: A1 ←− A1 −→ A2 ←− A2 −→ A = . = . . . . . . . Am ←− Am −→ Sometimes, we will want to refer to both rows and columns in the same context. The above notation is not sufficient for this as we have Aj referring to either a column or a row. In these situations, we may use A?j to reference the th th j column and Ai? to reference the i row:
A?1 A?2 ...... A?n a11 a12 ...... a1n . . . . . . ai1 ... aij ... ain . . . . . . am1 ...... amn
A1? a11 a12 ...... a1n . . . . . . . . Ai? ai1 ... aij ... ain . . . . . . . . Am? am1 ...... amn 1.1. Conventional Notation 3
1.1.2 Special Matrices and Vectors The bold capital letter I is used to denote the identity matrix. Sometimes this matrix has a single subscript to specify the size of the matrix. More often, the size of the identity is implied by the matrix equation in which it appears. 1 0 0 0 0 1 0 0 I = 4 0 0 1 0 0 0 0 1 th The bold lowercase ej is used to refer to the j column of I. It is simply a vector of zeros with a one in the jth position. We do not often specify the size of the vector ej, the number of elements is generally assumed from the context of the problem.
0 . . 0 e = jthrow → 1 j 0 . . 0 The vector e with no subscript refers to a vector of all ones.
1 1 1 e = . . 1
A diagonal matrix is a matrix for which off-diagonal elements, Aij, i 6= j are zero. For example: σ1 0 0 0 0 σ 0 0 D = 2 0 0 σ3 0 0 0 0 σ4 Since the off diagonal elements are 0, we need only define the diagonal elements for such a matrix. Thus, we will frequently write
D = diag{σ1, σ2, σ3, σ4} or simply Dii = σi. 1.2. Vector Addition and Scalar Multiplication 4
1.1.3 n-space You are already familiar with the concept of “ordered pairs" or coordinates (x1, x2) on the two-dimensional plane (in Linear Algebra, we call this plane "2-space"). Fortunately, we do not live in a two-dimensional world! Our data will more often consist of measurements on a number (lets call that number n) of variables. Thus, our data points belong to what is known as n-space. They are represented by n-tuples which are nothing more than ordered lists of numbers: (x1, x2, x3,..., xn). An n-tuple defines a vector with the same n elements, and so these two concepts should be thought of interchangeably. The only difference is that the vector has a direction, away from the origin and toward the n-tuple. You will recall that the symbol R is used to denote the set of real numbers. R is simply 1-space. It is a set of vectors with a single element. In this sense any real number, x, has a direction: if it is positive, it is to one side of the origin, if it is negative it is to the opposite side. That number, x, also has a magnitude: |x| is the distance between x and the origin, 0. n-space (the set of real n-tuples) is denoted Rn. In set notation, the formal mathematical definition is simply:
n R = {(x1, x2,..., xn) : xi ∈ R, i = 1, . . . , n} .
We will often use this notation to define the size of an arbitrary vector. For example, x ∈ Rp simply means that x is a vector with p entries: x = (x1, x2,..., xp). Many (all, really) of the concepts we have previously considered in 2- or 3-space extend naturally to n-space and a few new concepts become useful as well. One very important concept is that of a norm or distance metric, as we will see in Chapter 2. Before discussing norms, let’s revisit the basics of vector addition and scalar multiplication.
1.2 Vector Addition and Scalar Multiplication
You’ve already learned how vector addition works algebraically: it occurs element-wise between two vectors of the same length: a1 b1 a1 + b1 a2 b2 a2 + b2 a b a + b a + b = 3 + 3 = 3 3 . . . . . . an bn an + bn 1.2. Vector Addition and Scalar Multiplication 5
Geometrically, vector addition is witnessed by placing the two vectors, a and b, tail-to-head. The result, a + b, is the vector from the open tail to the open head. This is called the parallelogram law and is demonstrated in Figure 1.1a.
a+b
a b a b a-b
(a) Addition of vectors (b) Subtraction of Vectors
Figure 1.1: Vector Addition and Subtraction Geometrically: Tail-to-Head
When subtracting vectors as a − b we simply add −b to a. The vector −b has the same length as b but points in the opposite direction. This vector has the same length as the one which connects the two heads of a and b as shown in Figure 1.1b. Example 1.2.1: Vector Subtraction: Centering Data
One thing we will do frequently in this course is consider centered and/or standardized data. To center a group of variables, we merely subtract the mean of each variable from each observation. Geometrically, this amounts to a translation (shift) of the data so that it’s center (or mean) is at the origin. The following graphic illustrates this process using 4 data points. 1.2. Vector Addition and Scalar Multiplication 6
x x
x
x x
-x
x x
x x
Scalar multiplication is another operation which acts element-wise: a1 αa1 a2 αa2 a αa αa = α 3 = 3 . . . . an αan
Scalar multiplication changes the length of a vector but not the overall direction (although a negative scalar will scale the vector in the opposite direction through the origin). We can see this geometric interpretation of scalar multiplication in Figure 1.2. 1.3. Exercises 7
2a
a
-.5a
Figure 1.2: Geometric Effect of Scalar Multiplication
1.3 Exercises
1. For a general matrix Am×n describe what the following products will provide. Also give the size of the result (i.e. "n × 1 vector" or "scalar").
a. Aej T b. ei A T c. ei Aej d. Ae e. eTA 1 T f. n e A
2. Let Dn×n be a diagonal matrix with diagonal elements Dii. What effect does multiplying a matrix Am×n on the left by D have? What effect does multiplying a matrix An×m on the right by D have? If you cannot see this effect in a general sense, try writing out a simple 3 × 3 matrix as an example first.
3. What is the inverse of a diagonal matrix, D = diag{d11, d22,..., dnn}?
4. Suppose you have a matrix of data, An×p, containing n observations on p variables. Suppose the standard deviations of these variables are σ1, σ2, ... , σp. Give a formula for a matrix that contains the same data but with each variable divided by its standard deviation. Hint: you should use exercises 2 and 3.
5. Suppose we have a network/graph as shown in Figure 1.3. This particular network has 6 numbered vertices (the circles) and edges which connect the vertices. Each edge has a certain weight (perhaps reflecting some level of association between the vertices) which is given as a number. 1.3. Exercises 8
3 1
5 3 5 2 12 4 10 2 9
6
Figure 1.3: An example of a graph or network
a. The adjacency matrix of a graph is defined to be the matrix A such that element Aij reflects the the weight of the edge connecting vertex i and vertex j. Write out the adjacency matrix for this graph. b. The degree of a vertex is defined as the sum of the weights of the edges connected to that vertex. Create a vector d such that di is the degree of node i. c. Write d as a matrix-vector product in two different ways using the adjacency matrix, A, and e. 9
CHAPTER 2
NORMS, INNER PRODUCTS AND ORTHOGONALITY
2.1 Norms and Distances
In applied mathematics, Norms are functions which measure the magnitude or length of a vector. They are commonly used to determine similarities between observations by measuring the distance between them. As we will see, there are many ways to define distance between two points. Definition 2.1.1: Vector Norms and Distance Metrics
A Norm, or distance metric, is a function that takes a vector as input and returns a scalar quantity ( f : Rn → R). A vector norm is typically denoted by two vertical bars surrounding the input vector, kxk, to signify that it is not just any function, but one that satisfies the following criteria:
1. If c is a scalar, then kcxk = |c|kxk
2. The triangle inequality:
kx + yk ≤ kxk + kyk
3. kxk = 0 if and only if x = 0.
4. kxk ≥ 0 for any vector x
We will not spend any time on these axioms or on the theoretical aspects of 2.1. Norms and Distances 10 norms, but we will put a couple of these functions to good use in our studies, the first of which is the Euclidean norm or 2-norm.
Definition 2.1.2: Euclidean Norm, k ? k2
The Euclidean Norm, also known as the 2-norm simply measures the Euclidean length of a vector (i.e. a point’s distance from the origin). Let x = (x1, x2,..., xn). Then, q 2 2 2 kxk2 = x1 + x2 + ··· + xn
If x is a column vector, then √ T kxk2 = x x.
Often we will simply write k ? k rather than k ? k2 to denote the 2-norm, as it is by far the most commonly used norm.
This is merely the distance formula from undergraduate mathematics, measuring the distance between the point x and the origin. To compute the distance between two different points, say x and y, we’d calculate q 2 2 2 kx − yk2 = (x1 − y1) + (x2 − y2) + ··· + (xn − yn)
Example 2.1.1: Euclidean Norm and Distance
Suppose I have two vectors in 3-space:
x = (1, 1, 1) and y = (1, 0, 0)
Then the magnitude of x (i.e. its length or distance from the origin) is √ p 2 2 2 kxk2 = 1 + 1 + 1 = 3
and the magnitude of y is
p 2 2 2 kyk2 = 1 + 0 + 0 = 1
and the distance between point x and point y is q √ 2 2 2 kx − yk2 = (1 − 1) + (1 − 0) + (1 − 0) = 2.
The Euclidean norm is crucial to many methods in data analysis as it measures the closeness of two data points. 2.1. Norms and Distances 11
Thus, to turn any vector into a unit vector, a vector with a length of 1, we need only to divide each of the entries in the vector by its Euclidean norm. This is a simple form of standardization used in many areas of data analysis. For a unit vector x, xTx = 1. Perhaps without knowing it, we’ve already seen many formulas involving the norm of a vector. Examples 2.1.2 and 2.1.3 show how some of the most important concepts in statistics can be represented using vector norms. Example 2.1.2: Standard Deviation and Variance
Suppose a group of individuals has the following heights, measured in inches: (60, 70, 65, 50, 55). The mean height for this group is 60 inches. The formula for the sample standard deviation is typically given as q n 2 ∑i=1(xi − x¯) s = √ n − 1
We want to subtract the mean from each observation, square√ the num- bers, sum the result, take the square root and divide by n − 1. If we let x¯ = x¯e = (60, 60, 60, 60, 60) be a vector containing the mean, and x = (60, 70, 65, 50, 55) be the vector of data then the standard deviation in matrix notation is: 1 s = √ kx − x¯k2 = 7.9 n − 1 The sample variance of this data is merely the square of the sample standard deviation: 1 s2 = kx − x¯k2 n − 1 2
Example 2.1.3: Residual Sums of Squares
Another place we’ve seen a similar calculation is in linear regression. You’ll recall the objective of our regression line is to minimize the sum of squared residuals between the predicted value yˆ and the observed value y: n 2 ∑(yˆi − yi) . i=1 In vector notation, we’d let y be a vector containing the observed data and yˆ be a vector containing the corresponding predictions and write this summation as 2 kyˆ − yk2 2.1. Norms and Distances 12
In fact, any situation where the phrase "sum of squares" is encountered, the 2-norm is generally implicated.
Example 2.1.4: Coefficient of Determination, R2
Since variance can be expressed using the Euclidean norm, so can the coefficient of determination or R2.
n 2 2 SSreg (yˆ − y¯) kyˆ − y¯k 2 = = ∑i=1 i = R n 2 2 SStot ∑i=1(yi − y¯) ky − y¯k
Other useful norms and distances 1-norm, k ? k1. If x = x1 x2 ... xn then the 1-norm of X is
n kxk1 = ∑ |xi|. i=1 This metric is often referred to as Manhattan distance, city block distance, or taxicab distance because it measures the distance between points along a rectangular grid (as a taxicab must travel on the streets of Manhattan, for example). When x and y are binary vectors, the 1-norm is called the Hamming Distance, and simply measures the number of elements that are different between the two vectors.
Figure 2.1: The lengths of the red, yellow, and blue paths represent the 1- norm distance between the two points. The green line shows the Euclidean measurement (2-norm). 2.2. Inner Products 13
∞-norm, k ? k∞. The infinity norm, also called the Supremum, or Max dis- tance, is: kxk∞ = max{|x1|, |x2|,..., |xp|}
2.2 Inner Products
The inner product of vectors is a notion that you’ve already seen, it is what’s called the dot product in most physics and calculus text books. Definition 2.2.1: Vector Inner Product
The inner product of two n × 1 vectors x and y is written xTy (or sometimes as hx, yi) and is the sum of the product of corresponding elements. y1 y2 n xTy = x x ... x = x y + x y + ··· + x y = x y . 1 2 n . 1 1 2 2 n n ∑ i i . i=1 yn
When we take the inner product of a vector with itself, we get the square of the 2-norm: T 2 x x = kxk2.
Inner products are at the heart of every matrix product. When we multiply two matrices, Xm×n and Yn×p, we can represent the individual elements of the result as inner products of rows of X and columns of Y as follows:
X Y X Y ... X Y 1? ?1 1? ?2 1? ?p X ? 1 X2?Y?1 X2?Y?2 ... X2?Y?p X ? 2 X3?Y?1 X3?Y?2 ... X3?Y?p XY = . Y?1 Y?2 ... Y?p = . . . . . . . . .. . X m? .. Xm?Y?1 ... . Xm?Y?p
2.2.1 Covariance Another important statistical measurement that is represented by an inner product is covariance. Covariance is a measure of how much two random variables change together. The statistical formula for covariance is given as Covariance(x, y) = E[(x − E[x])(y − E[y])] (2.1) where E[?] is the expected value of the variable. If larger values of one variable correspond to larger values of the other variable and at the same time smaller 2.2. Inner Products 14 values of one correspond to smaller values of the other, then the covariance between the two variables is positive. In the opposite case, if larger values of one variable correspond to smaller values of the other and vice versa, then the covariance is negative. Thus, the sign of the covariance shows the tendency of the linear relationship between variables, however the magnitude of the covariance is not easy to interpret. Covariance is a population parameter - it is a property of the joint distribution of the random variables x and y. Definition 2.2.2 provides the mathematical formulation for the sample covariance. This is our best estimate for the population parameter when we have data sampled from a population. Definition 2.2.2: Sample Covariance
If x and y are n × 1 vectors containing n observations for two different variables, then the sample covariance of x and y is given by
1 n 1 (x − x¯)(y − y¯) = (x − x¯)T(y − y¯) − ∑ i i − n 1 i=1 n 1 Where again x¯ and y¯ are vectors that contain x¯ and y¯ repeated n times. It should be clear from this formulation that
cov(x, y) = cov(y, x).
When we have p vectors, v1, v2, ... , vp, each containing n observations for p different variables, the sample covariances are most commonly given by the sample covariance matrix, Σ, where
Σij = cov(vi, vj).
This matrix is symmetric, since Σij = Σji. If we create a matrix V whose columns are the vectors v1, v2, ... vp once the variables have been centered to have mean 0, then the covariance matrix is given by:
1 cov(V) = Σ = VTV. n − 1
th The j diagonal element of this matrix gives the variance vj since
1 Σ = cov(v , v ) = (v − v¯ )T(v − v¯ ) (2.2) jj j j n − 1 j j j j 1 = kv − v¯ k2 (2.3) n − 1 j j 2 = var(vj) (2.4)
When two variables are completely uncorrelated, their covariance is zero. 2.2. Inner Products 15
This lack of correlation would be seen in a covariance matrix with a diagonal structure. That is, if v1, v2, ... , vp are uncorrelated with individual variances 2 2 2 σ1 , σ2 ,..., σp respectively then the corresponding covariance matrix is:
2 σ1 0 0 . . . 0 2 0 σ 0 . . . 0 2 .. . Σ = 0 0 . . 0 . . . . . ...... 2 0 0 0 0 σp
Furthermore, for variables which are independent and identically distributed (take for instance the error terms in a linear regression model, which are assumed to independent and normally distributed with mean 0 and constant variance σ), the covariance matrix is a multiple of the identity matrix:
σ2 0 0 . . . 0 0 σ2 0 . . . 0 . . Σ = 0 0 .. . 0 = σ2I . . . . . ...... 0 0 0 0 σ2
Transforming our variables in a such a way that their covariance matrix becomes diagonal will be our goal in Chapter 7. Theorem 2.2.1: Properties of Covariance Matrices
The following mathematical properties stem from Equation 2.1. Let Xn×p be a matrix of data containing n observations on p variables. If A is a constant matrix (or vector, in the first case) then
cov(XA) = ATcov(X)A and cov(X + A) = cov(X)
2.2.2 Mahalanobis Distance Mahalanobis Distance is similar to Euclidean distance, but takes into account the correlation of the variables. This metric is relatively common in data mining applications like classification. Suppose we have p variables which have some covariance matrix, Σ. Then the Mahalanobis distance between two T T observations, x = x1 x2 ... xp and y = y1 y2 ... yp is given by q d(x, y) = (x − y)TΣ−1(x − y). 2.2. Inner Products 16
If the covariance matrix is diagonal (meaning the variables are uncorrelated) then the Mahalanobis distance reduces to Euclidean distance normalized by the variance of each variable: v u p (x − y )2 ( ) = u i i = k −1/2( − )k d x, y t∑ 2 Σ x y 2. i=1 si
2.2.3 Angular Distance The inner product between two vectors can provide useful information about their relative orientation in space and about their similarity. For example, to find the cosine of the angle between two vectors in n-space, the inner product of their corresponding unit vectors will provide the result. This cosine is often used as a measure of similarity or correlation between two vectors. Definition 2.2.3: Cosine of Angle between Vectors
The cosine of the angle between two vectors in n-space is given by
xTy cos(θ) = kxk2kyk2
y
θ x
This angular distance is at the heart of Pearson’s correlation coefficient.
2.2.4 Correlation Pearson’s correlation is a normalized version of the covariance, so that not only the sign of the coefficient is meaningful, but its magnitude is meaningful in measuring the strength of the linear association. 2.3. Orthogonality 17
Example 2.2.1: Pearson’s Correlation and Cosine Distance
You may recall the formula for Pearson’s correlation between variable x and y with a sample size of n to be as follows:
n ∑i=1(xi − x¯)(yi − y¯) r = q q n 2 n 2 ∑i=1(xi − x¯) ∑i=1(yi − y¯)
If we let x¯ be a vector that contains x¯ repeated n times, like we did in Example 2.1.2, and let y¯ be a vector that contains y¯ then Pearson’s coefficient can be written as:
(x − x¯)T(y − y¯) r = kx − x¯kky − y¯k
In other words, it is just the cosine of the angle between the two vectors once they have been centered to have mean 0. This makes sense: correlation is a measure of the extent to which the two variables share a line in space. If the cosine of the angle is positive or negative one, this means the angle between the two vectors is 0◦ or 180◦, thus, the two vectors are perfectly correlated or collinear.
It is difficult to visualize the angle between two variable vectors because they exist in n-space, where n is the number of observations in the dataset. Unless we have fewer than 3 observations, we cannot draw these vectors or even picture them in our minds. As it turns out, this angular measurement does translate into something we can conceptualize: Pearson’s correlation coefficient is the angle formed between the two possible regression lines using the centered data: y regressed on x and x regressed on y. This is illustrated in Figure 2.2. To compute the matrix of pairwise correlations between variables x1, x2, x3, ... , xp (columns containing n observations for each variable), we’d first center them to have mean zero, then normalize them to have length kxik = 1 and then compose the matrix X = [x1|x2|x3| ... |xp]. Using this centered and normalized data, the correlation matrix is simply
C = XTX.
2.3 Orthogonality
Orthogonal (or perpendicular) vectors have an angle between them of 90◦, meaning that their cosine (and subsequently their inner product) is zero. 2.3. Orthogonality 18
y=f(x)
x=f(y)
θ
r=cos(θ)
Figure 2.2: Correlation Coefficient r and Angle between Regression Lines
Definition 2.3.1: Orthogonality
Two vectors, x and y, are orthogonal in n-space if their inner product is zero: xTy = 0
Combining the notion of orthogonality and unit vectors we can define an orthonormal set of vectors, or an orthonormal matrix. Remember, for a unit vector, xTx = 1. Definition 2.3.2: Orthonormal Sets
The n × 1 vectors {x1, x2, x3, ... , xp} form an orthonormal set if and only if
T 1. xi xj = 0 when i 6= j and T 2. xi xi = 1 (equivalently kxik = 1) In other words, an orthonormal set is a collection of unit vectors which are mutually orthogonal.
If we form a matrix, X = (x1|x2|x3| ... |xp), having an orthonormal set of vectors as columns, we will find that multiplying the matrix by its transpose provides a nice result: 2.4. Outer Products 19
T T T T T x x1 x1 x1 x2 x1 x3 ... x1 xp 1 T T T T T x x1 x x2 x x3 ... x xp x2 2 2 2 2 T T T T T T x x3 x1 x3 x2 x3 x3 ... x3 xp X X = 3 x1 x2 x3 ... xp = . . . . . . . ...... . xT T .. T p xp x1 ...... xp xp
1 0 0 . . . 0 0 1 0 . . . 0 0 0 1 . . . 0 = = Ip . . . . . ...... 0 0 0 . . . 1 We will be particularly interested in these types of matrices when they are square. If X is a square matrix with orthonormal columns, the arithmetic above means that the inverse of X is XT (i.e. X also has orthonormal rows):
XTX = XXT = I.
Square matrices with orthonormal columns are called orthogonal matrices. Definition 2.3.3: Orthogonal (or Orthonormal) Matrix
A square matrix, U with orthonormal columns also has orthonormal rows and is called an orthogonal matrix. Such a matrix has an inverse which is equal to it’s transpose,
UTU = UUT = I
2.4 Outer Products
The outer product of two vectors x ∈ Rm and y ∈ Rn, written xyT, is an m × n matrix with rank 1. To see this basic fact, lets just look at an example. 2.4. Outer Products 20
Example 2.4.1: Outer Product
1 2 2 Let x = and let y = 1. Then the outer product of x and y is: 3 3 4
1 2 1 3 T 2 4 2 6 xy = 2 1 3 = 3 6 3 9 4 8 4 12
which clearly has rank 1. It should be clear from this example that computing an outer product will always result in a matrix whose rows and columns are multiples of each other.
Example 2.4.2: Centering Data with an Outer Product
As we’ve seen in previous examples, many statistical formulas involve the centered data, that is, data from which the mean has been subtracted so that the new mean is zero. Suppose we have a matrix of data containing observations of individuals’ heights (h) in inches, weights (w), in pounds and wrist sizes (s), in inches:
h w s person1 60 102 5.5 person2 72 170 7.5 A = person 66 110 6.0 3 person4 69 128 6.5 person5 63 130 7.0 The average values for height, weight, and wrist size are as follows:
h¯ = 66 (2.5) w¯ = 128 (2.6) s¯ = 6.5 (2.7)
To center all of the variables in this data set simultaneously, we could compute an outer product using a vector containing the means and a vector of all ones: 2.4. Outer Products 21
60 102 5.5 1 72 170 7.5 1 66 110 6.0 − 1 66 128 6.5 69 128 6.5 1 63 130 7.0 1 60 102 5.5 66 128 6.5 72 170 7.5 66 128 6.5 = 66 110 6.0 − 66 128 6.5 69 128 6.5 66 128 6.5 63 130 7.0 66 128 6.5
−6.0000 −26.0000 −1.0000 6.0000 42.0000 1.0000 = 0 −18.0000 −0.5000 3.0000 0 0 −3.0000 2.0000 0.5000
Exercises 1 1 2 −1 1. Let u = and v = . −4 1 −2 −1
a. Determine the Euclidean distance between u and v. b. Find a vector of unit length in the direction of u. c. Determine the cosine of the angle between u and v. d. Find the 1- and ∞-norms of u and v. c. Suppose these vectors are observations on four independent vari- ables, which have the following covariance matrix:
2 0 0 0 0 1 0 0 Σ = 0 0 2 0 0 0 0 1
Determine the Mahalanobis distance between u and v. 2.4. Outer Products 22
2. Let −1 2 0 −2 1 2 2 0 1 U = 3 0 0 3 0 −2 1 0 2
a. Show that U is an orthogonal matrix. 1 1 b. Let b = . Solve the equation Ux = b. 1 1
3. Write a matrix expression for the correlation matrix, C, for a matrix of centered data, X, where Cij = rij is Pearson’s correlation measure between variables xi and xj. To do this, we need more than an inner product, we need to normalize the rows and columns by the norms kxik. For a hint, see Exercise 2 in Chapter 1.
4. Suppose you have a matrix of data, An×p, containing n observations on p variables. Develop a matrix formula for the standardized data (where the mean of each variable should be subtracted from the corresponding column before dividing by the standard deviation). Hint: use Exercises 1(f) and 4 from Chapter 1 along with Example 2.4.2.
5. Explain why, for any norm or distance metric,
kx − yk = ky − xk
1 6. Find two vectors which are orthogonal to x = 1 1
7. Pythagorean Theorem. Show that x and y are orthogonal if and only if
2 2 2 kx + yk2 = kxk2 + kyk2
2 T (Hint: Recall that kxk2 = x x) 23
CHAPTER 3
LINEAR COMBINATIONS AND LINEAR INDEPENDENCE
One of the most central ideas in all of Linear Algebra is that of linear indepen- dence. For regression problems, it is repeatedly stressed that multicollinearity is problematic. Multicollinearity is simply a statistical term for linear dependence. It’s bad. We will see the reason for this shortly, but first we have to develop the notion of a linear combination.
3.1 Linear Combinations
Definition 3.1.1: Linear Combination
A linear combination is constructed from a set of terms v1, v2, ... , vn by multiplying each term by a constant and adding the result:
n c = α1v1 + α2v2 + ··· + αnvn = ∑ αivn i=1
The coefficients αi are scalar constants and the terms, {vi} can be scalars, vectors, or matrices.
If we dissect our formula for a system of linear equations, Ax = b, we will find that the right-hand side vector b can be expressed as a linear combination of the columns in the coefficient matrix, A. 3.1. Linear Combinations 24
b = Ax (3.1) x1 x2 b = (A1|A2| ... |An) . (3.2) . x3 b = x1A1 + x2A2 + ··· + xnAn (3.3)
A concrete example of this expression is given in Example 3.1.1. Example 3.1.1: Systems of Equations as Linear Combinations
Consider the following system of equations:
3x1 + 2x2 + 9x3 = 1 (3.4)
4x1 + 2x2 + 3x3 = 5 (3.5)
2x1 + 7x2 + x3 = 0 (3.6)
We can write this as a matrix vector product Ax = b where 3 2 9 x1 1 A = 4 2 3 x = x2 and b = 5 2 7 1 x3 0
We can also write b as a linear combination of columns of A:
3 2 9 1 x1 4 + x2 2 + x3 3 = 5 2 7 1 0
Similarly, if we have a matrix-matrix product, we can write each column of the result as a linear combination of columns of the first matrix. Let Am×n, Xn×p, and Bm×p be matrices. If we have AX = B then x11 x12 ... x1p x x ... x n 21 22 2 (A1|A2| ... |An) . . . . = (B1|B2| ... |Bn) . . .. . xn1 xn2 ... xnp
and we can write
Bj = AXj = x1jA1 + x2jA2 + x3jA3 + ··· + xnjAn.
A concrete example of this expression is given in Example 3.1.2. 3.1. Linear Combinations 25
Example 3.1.2: Linear Combinations in Matrix-Matrix Products
Suppose we have the following matrix formula:
AX = B
2 1 3 5 6 Where A = 1 4 2, X = 9 5 Then 3 2 1 7 8
2 1 3 5 6 B = 1 4 2 9 5 (3.7) 3 2 1 7 8 2(5) + 1(9) + 3(7) 2(6) + 1(5) + 3(8) = 1(5) + 4(9) + 2(7) 1(6) + 4(5) + 2(8) (3.8) 3(5) + 2(9) + 1(7) 3(6) + 2(5) + 1(8)
and we can immediately notice that the columns of B are linear combi- nations of columns of A:
2 1 3 B1 = 5 1 + 9 4 + 7 2 3 2 1
2 1 3 B2 = 6 1 + 5 4 + 8 2 3 2 1 We may also notice that the rows of B can be expressed as a linear combination of rows of X: