A (Very) Brief Primer on Topics in For statistics, the main topics from Math 313 that you need to remember are (i) opera- tions such as , addition, subtraction, etc., (ii) matrix-specific operations such as inverse, and determinant and (iii) matrix properties such as symmetric and positive definite. This worksheet is designed to refresh these concepts in your mind and teach you how to do matrix operations in R. Complete each of the following first by hand, and then using R (for numeric calculations).

1 Basic Math for Matrices

A matrix is a rectangular or square array of numbers or variables. A vector is a matrix with one column and more than one row. A scalar is a 1 ˆ 1 matrix. Traditional notation for matrices and vectors are bold face. For example, A represents a matrix or a vector while a would represent a scalar. Because this notation can be ambiguous in distinguishing between a matrix and a vector we will usually give the dimension of the matrix in words. For example, we could say, “Let Y denote a length N vector” which would mean that Y is N ˆ 1. Depending on the style of the author, vectors and matrices can be either lower- or upper-case but will almost always be bolded (so a or A are matrices (or vectors) and you know because they are bolded). Because matrices are 2-dimensional arrays (they have a row and a column), we use two subscripts to denote the individual elements of a matrix. For example, if A is a 3 ˆ 3 matrix, a13 would typically denote the number in the 1st row and 3rd column of A. Because vectors only have 1 column we typically only use a single subscript. For example, y5 would be the 5th element of Y (the element on the 5th row but 1st column of the vector Y ). Matrix addition and subtraction is simply adding or subtracting matching elements. For exam- th ple, the ij element of pA ` Bq is aij ` bij. However, matrix multiplication is NOT multiplying each individual element. To multiply two matrices A and B, you would (i) multiply the elements in each row of A by the elements in each column of B then (ii) add the results. Notationally, if C “ AB (i.e. C is the matrix that results from the matrix multiplication of A and B), then

cij “ aikbki k ÿ for i “ 1,..., nrowpAq and j “ 1,..., ncolpBq so C would have dimension nrowpAqˆncolpBq. Because of the way that matrix operations are defined, the dimensions of matrices have to match in a way that allows the operation to take place. For example, to add A and B, the dimensions of A have to be the same as the dimensions of B. Likewise, A and B can only be multiplied if the number of columns in A is equal to the number of rows in B. If an operation can be performed on two matrices then we call the matrices A and B conformable otherwise they are non-conformable. The syntax to work with matrices in R is a little weird (until you get used to it). To define the matrix A which has K rows and J columns you would do something like: A.elements <- c(a11,a21,a31,...,aKJ) A <- matrix(A.elements,nrow=K,ncol=J) A

1 Notice that R puts elements of A down the rows first then moves across the columns. To change this you can use the byrow=TRUE option. For example, try: A.elements <- 1:6 K <- 2 J <- 3 A1 <- matrix(A.elements,nrow=K,ncol=J) A2 <- matrix(A.elements,nrow=K,ncol=J,byrow=TRUE) A1 A2

Notice in A1 the elements 1:6 are populated down the rows first where as A2 has the elements populated across column first. Because matrix multiplication is different than traditional multiplication, R uses the com- mand “%*%” to perform matrix multiplication. For example A%*%B would matrix multiply the matrices A and B. If these matrices are not conformable, R will return an error. Try the following practice problems both by hand and with R.

1. Consider the following matrices:

1 3 4 9 A “ , B “ 5 2 2 6 ˆ ˙ ˆ ˙ Compute the following:

(a) A ` B (b) A ´ B (c) AB (d) BA

2. Now, let 9 1 2 6 5 A “ , B “ 1 4 3 1 8 ˆ ˙ ˆ ˙ Compute each of the following, if possible. If non-conformable, write “non-conformable”.

(a) A ` B (b) A ´ B (c) AB (d) BA

2 Matrix Specific Operations

Because matrices are defined as 2-dimensional objects (they have rows and columns), there are a few math operations specific to matrices that we commonly use in statistics. The transpose

2 of a matrix is when we switch the rows and columns. Notationally, we typically use the prime notation ( “ 1 ”) to specify a matrix transpose. For example, A1 (read “A-prime”) would be the transpose of A. The ijth element of A1 would simply be the jith element of A. A matrix where A “ A1 is called symmetric.

The trace of a matrix is just the sum of the diagonal elements. So trpAq “ i aii. I’ll explain its use more when I introduce covariance matrices. The inverse operation of a matrix, intuitively, is division for matrices.ř Before we get to defining the inverse of a matrix, however, we have to define the which is typically denoted by I. The identity matrix is a special square matrix with 1’s along the diagonal and 0’s everywhere else and is the equivalent of the number 1 in matrix algebra. Just like any number multiplied by 1 is just the number, any matrix multiplied by the identity matrix is just the original matrix. Mathematically, AI “ IA “ A for any matrix A where I is the conforming identity matrix (i.e. I is the matrix of appropriate size to do the multiplication). We say that a square matrix A has an inverse A´1 if AA´1 “ A´1A “ I where I is the called the identity matrix (note that non-square matrices do not have inverses). You can see that this is the equivalent of division for matrices by noticing that for any number x, xx´1 “ x´1x “ x{x “ 1. It is important to keep in mind that not all matrices are invertible. If a matrix has an inverse we say it is invertible or non-singular. If a matrix inverse does not exist we call it singular. Most of the matrices that are interesting in statistics are non-singular (they have an inverse). Calculating A´1 is a long and arduous process (which you may remember from Math 313) so I won’t go over how to calculate an inverse by hand (we’ll let the computer do that). However, it is important to keep in mind that calculating matrix inverses can be very time consuming even for computers when the dimensions of A are large (i.e. A has lots of rows and columns). The last matrix operation that you need to know is the determinant of a matrix. The determi- nant of a square matrix A (non-square matrices don’t have determinants) is a single numerical summary that quantifies the “volume” of a matrix (i.e. how much space is taken up by the matrix in matrix-space). I know this definition isn’t that intuitive but it’ll make more sense when we talk about covariances matrices so I’ll let that definition do for now. Notationally, the determinant is given by |A| or detpAq. To calculate the determinant you have to sum all possible products of the elements in the matrix in a certain fashion. Obviously, we aren’t going to be doing that by hand and we’ll let the computer do it for us. But, know that calculating |A| can be very slow (even for computers) when A is big. A few interesting properties about these matrix operations that are used repeatedly throughout statistical concepts are the following,

• pA´1q´1 “ A

• pA1q´1 “ pA´1q1

• pABq´1 “ B´1A´1

• pABq1 “ B1A1

• |cA| “ cn|A| where A is n ˆ n and c is a scalar.

3 N • |A| “ i“1 aii when A is diagonal (zeros everywhere except the diagonal elements) or lower- or upper-triangular (meaning that everything below or above the diagonal is 0). ś • |A´1| “ 1{|A| if A is non-singular

• |A1| “ |A|

• |AB| “ |A||B|

While I won’t ask you to prove any of these concepts (that is what Math 313 does), you will need to remember them because we’ll use them frequently. In R, t(A) is the transpose of the matrix A, solve(A) is the inverse of A, sum(diag(A)) is the trace of A (there isn’t a function set aside to do this but diag() just extracts the diagonal elements of A), and det(A) is the determinant of A. In R, do the following problems (#3 you’ll prove by hand - not in R).

1. For each of the following, find the transpose and specify whether or not the matrix is symmetric: 4 3 (a) A “ 11 2 ˆ ˙ 2 8 (b) B “ 8 12 ˆ ˙ 1 1 2 (c) C “ 1 3 5 ¨ 2 5 8 ˛ ˝ 4 4 ‚ (d) D “ 2 4 ˆ ˙ 2. Using matrices A and B from question 1, calculate the following inverses:

(a) A´1 (b) B´1 (c) pABq´1 (d) B´1A´1 and verify that pABq´1 “ B´1A´1

3. Using properties of inverses and transposes, show that the following is true for any invert- ible, square matrices A, B, C, and D of the same size:

rpA ` Bq1D´1C´1s´1 “ CDpA1 ` B1q´1

4. Find the determinant of each of the following: 4 3 (a) A “ 11 2 ˆ ˙

4 2 8 4 0 5 333 1 0 (b) B “ ¨ 1 1 7 0 ˛ ˚ 6 10 423 0 ‹ ˚ ‹ ˝ 1 1 2 ‚ (c) C “ 1 3 5 ¨ 2 5 8 ˛ (d) C1 and˝ verify |C1‚| “ |C| 4 0 (e) D “ 0 2 ˆ ˙ (f) D´1 and verify |D´1| “ 1{|D|

3 Working with Matrices via Partitioning

Very often in statistics, we will end up partitioning (splitting) up matrices into groups of sub- matrices (or partitioning a vector into groups of sub-vectors). For example, we will often take a square matrix A and partition it as A A A “ 11 12 A21 A22 ˆ ˙ To illustrate, let the 4 ˆ 4 be partitioned as 7 2 5 8 ´3 4 0 2 A11 A12 A “ ¨ ˛ “ 9 3 6 5 A21 A22 ˆ ˙ ˚ 3 1 2 1 ‹ ˚ ‹ ˝ ‚ then 7 2 5 A “ 11 ´3 4 0 ˆ ˙ 8 A “ 12 2 ˆ ˙ 9 3 6 A “ 21 3 1 2 ˆ ˙ 5 A “ 22 1 ˆ ˙ To partition matrices in R, you use the bracket notation. For example, if A is a matrix that you created earlier, A[1:3,5:9] would grab rows 1-3 and columns 5-9 of A. Try the following problem. 1. In R, create the matrix A from above and partition it as shown. Assign appropriate values to A11, A12, A21 and A22.

5 2. Define A and B such that A A B B A “ 11 12 , B “ 11 12 A21 A22 B21 B22 ˆ ˙ ˆ ˙

where each Aij and Blm is a p ˆ p . Write out the following in terms of the sub-matrices:

(a) AB (b) A1

4 Other Linear Algebra Topics

Eigenvalues and eigenvectors. Consider a square matrix A of any dimension and a conforming vector x (so if A is P ˆP then x is P ˆ1) such that I can take the matrix product Ax “ y (where the dimension of y would also be P ˆ 1). Taking the matrix product Ax is linearly transforming x to the vector y. Linear transformations of vectors in this way is something we do all the time in statistics (it turns out, as you’ll learn in 535 that regression is just linearly transforming your response variable y to get the β coefficients). Often, we want to understand this linear transformation in greater detail so we will look at the eigenvalue and eigenvector decomposition of the matrix A. Linear transformations of x by the matrix A does two things: (i) rotates the vector and (ii) scales it. To understand the rotations and scaling performed by A on x, we decompose A as,

A “ CDC1 (1) where C is a P ˆ P matrix where the columns of C are called the eigenvectors of A and D is a P ˆ P with entries λ1, . . . , λP where the λ’s are called the eigenvalues of A. For convenience we often order the eigenvalues such that λ1 ą λ2 ą ¨ ¨ ¨ ą λP . Further C is a th 1 1 orthnormal matrix which means that if cp is the p eigenvector then cpcp “ 1 and cp1 cp2 “ 0 if p1 ‰ p2 (intuitively, this just means that each eigenvector is unique and can’t be created from any other eigenvector). Intuitively, the eigenvectors in C quantify the rotations and the eigenvalues in D quantify the scalings of x by the matrix A under the linear transformation Ax. In this way, we can carefully examine C and D to further understand the linear transformation. The eigenvectors and eigenvalues always come in pairs. That is, the first eigenvector and eigenvalue are paired together because the first eigenvector would quantify the first rotation and the first eigenvalue would quantify the first scaling of the matrix x. Calculating the eigendecomposition in Equation (1) makes other matrix operations easier. For P ´1 ´1 1 ´1 ´1 example, |A| “ p“1 λp and A “ CD C where D “ diagpλ1 , . . . , λP q. However, computing the eigendecomposition in (1) can be very time consuming (even for a computer) if the dimensions ofśA are large. In R, the function eigen() function calculates the eigendecomposition of a matrix and returns a list with vectors and values.

6 The of Matrices. In statistics we are often interested in finding the square root of a matrix. Notationally, we want to find A1{2 such that A “ A1{2pA1{2q1. First, it is important to point out that not all matrices have a square root (just like the square roots of negative numbers don’t exist - technically). For a matrix A to have a square root it must be (i) square, (ii) symmetric and (iii) positive definite. I’ll define what a positive definite matrix is later when I define a because I think the intuition is easier but if you are super curious you can look it up on your own in Wikipedia. We can define A1{2 in a few different ways. The first is using the eigendecomposition of A 1{2 1{2 1{2 1{2 in (1) above. Namely, if we set A “ CD where D “ diagpλ1 , . . . , λP q is the diagonal matrix of the square root of the eigenvalues. You should be able to see that if we define A1{2 this way then A “ A1{2pA1{2q1 (if you don’t then spend a minute to prove it). Because the eigendecomposition of the matrix A can be slow to compute, a faster way to get the square root of a matrix is by using the of a matrix. The french mathematician Andres Cholesky found that matrices where a square root exists can be decompose as A “ LL1 (2) where L is a lower (meaning that everything above the diagonal is 0). This decomposition is particularly appealing because it is faster to compute than eigendecompositions and lower triangular matrices are particularly easy to work with. So, often we will define A1{2 as the Cholesky matrix L which follows immediately that A1{2pA1{2q1 “ LL1 “ A. In R, the chol() command returns L1 (the transpose of L rather than L) if it exists. Calculus results with vectors. In statistical methods, we often encounter two forms of vector and matrix products: (i) a1x and (ii) x1Ax where a is a vector and A is a matrix. We often wish to take the derivative of these products with respect to the vector x (these forms come up, for example, when trying to find maximum likelihood estimates of regression coefficients as you’ll work out later). When taking (multivariate) derivatives of these forms with respect to the vector x we get the following results: Ba1x “ a (3) Bx Bx1Ax “ 2Ax. (4) Bx We will use these properties of derivatives repeatedly when finding maximums etc. Practice Problem. 1. Using R, for the positive definite matrix 2 ´1 A “ ´1 2 ˆ ˙ (a) Find the eigendecomposition and the eigen-square root of A and confirm A1{2pA1{2q1 “ A. (b) Find the cholesky decomposition of A and confirm the Cholesky-square root A1{2pA1{2q1 “ A.

7