Data, Covariance, and Correlation Matrix

Data, Covariance, and Correlation Matrix Nathaniel E. Helwig Assistant Professor of Psychology and Statistics University of Minnesota (Twin Cities) Updated 16-Jan-2017 Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 1 Copyright Copyright c 2017 by Nathaniel E. Helwig Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 2 Outline of Notes 1) The Data Matrix 3) The Correlation Matrix Definition Definition Properties Properties R code R code 2) The Covariance Matrix 4) Miscellaneous Topics Definition Crossproduct calculations Properties Vec and Kronecker R code Visualizing data Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 3 The Data Matrix The Data Matrix Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 4 The Data Matrix Definition The Organization of Data The data matrix refers to the array of numbers 0 1 x11 x12 ··· x1p Bx21 x22 ··· x2pC B C Bx x ··· x C X = B 31 32 3pC B . .. C @ . A xn1 xn2 ··· xnp where xij is the j-th variable collected from the i-th item (e.g., subject). items/subjects are rows variables are columns X is a data matrix of order n × p (# items by # variables). Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 5 The Data Matrix Definition Collection of Column Vectors We can view a data matrix as a collection of column vectors: 0 1 B C X = @x1 x2 ··· xpA where xj is the j-th column of X for j 2 f1;:::; pg. The n × 1 vector xj gives the j-th variable’s scores for the n items. Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 6 The Data Matrix Definition Collection of Row Vectors We can view a data matrix as a collection of row vectors: 0 0 1 x1 B x0 C X = B 2 C B . C @ . A 0 xn 0 where xi is the i-th row of X for i 2 f1;:::; ng. 0 The 1 × p vector xi gives the i-th item’s scores for the p variables. Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 7 The Data Matrix Properties Calculating Variable (Column) Means The sample mean of the j-th variable is given by n 1 X x¯ = x j n ij i=1 −1 0 = n 1nxj where 1n denotes an n × 1 vector of ones xj denotes the j-th column of X Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 8 The Data Matrix Properties Calculating Item (Row) Means The sample mean of the i-th item is given by p 1 X x¯ = x i p ij j=1 −1 0 = p xi 1p where 1p denotes an p × 1 vector of ones 0 xi denotes the i-th row of X Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 9 The Data Matrix R Code Data Frame and Matrix Classes in R > data(mtcars) > class(mtcars) [1] "data.frame" > dim(mtcars) [1] 32 11 > head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > X <- as.matrix(mtcars) > class(X) [1] "matrix" Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 10 The Data Matrix R Code Row and Column Means > # get row means (3 ways) > rowMeans(X)[1:3] Mazda RX4 Mazda RX4 Wag Datsun 710 29.90727 29.98136 23.59818 > c(mean(X[1,]), mean(X[2,]), mean(X[3,])) [1] 29.90727 29.98136 23.59818 > apply(X,1,mean)[1:3] Mazda RX4 Mazda RX4 Wag Datsun 710 29.90727 29.98136 23.59818 > # get column means (3 ways) > colMeans(X)[1:3] mpg cyl disp 20.09062 6.18750 230.72188 > c(mean(X[,1]), mean(X[,2]), mean(X[,3])) [1] 20.09062 6.18750 230.72188 > apply(X,2,mean)[1:3] mpg cyl disp 20.09062 6.18750 230.72188 Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 11 The Data Matrix R Code Other Row and Column Functions > # get column medians > apply(X,2,median)[1:3] mpg cyl disp 19.2 6.0 196.3 > c(median(X[,1]), median(X[,2]), median(X[,3])) [1] 19.2 6.0 196.3 > # get column ranges > apply(X,2,range)[,1:3] mpg cyl disp [1,] 10.4 4 71.1 [2,] 33.9 8 472.0 > cbind(range(X[,1]), range(X[,2]), range(X[,3])) [,1] [,2] [,3] [1,] 10.4 4 71.1 [2,] 33.9 8 472.0 Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 12 The Covariance Matrix The Covariance Matrix Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 13 The Covariance Matrix Definition The Covariation of Data The covariance matrix refers to the symmetric array of numbers 0 2 1 s1 s12 s13 ··· s1p 2 Bs21 s s23 ··· s2pC B 2 C Bs s s2 ··· s C S = B 31 32 3 3pC B . .. C @ . A 2 sp1 sp2 sp3 ··· sp where 2 Pn ¯ 2 sj = (1=n) i=1(xij − xj ) is the variance of the j-th variable Pn sjk = (1=n) i=1(xij − x¯j )(xik − x¯k ) is the covariance between the j-th and k-th variables Pn x¯j = (1=n) i=1 xij is the mean of the j-th variable Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 14 The Covariance Matrix Definition Covariance Matrix from Data Matrix We can calculate the covariance matrix such as 1 S = X0 X n c c 0 where Xc = X − 1nx¯ = CX with 0 x¯ = (x¯1;:::; x¯p) denoting the vector of variable means −1 0 C = In − n 1n1n denoting a centering matrix Note that the centered matrix Xc has the form 0 1 x11 − x¯1 x12 − x¯2 ··· x1p − x¯p Bx21 − x¯1 x22 − x¯2 ··· x2p − x¯pC B C Bx − x¯ x − x¯ ··· x − x¯pC Xc = B 31 1 32 2 3p C B . .. C @ . A xn1 − x¯1 xn2 − x¯2 ··· xnp − x¯p Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 15 The Covariance Matrix Properties Variances are Nonnegative 2 Variances are sums-of-squares, which implies that sj ≥ 0 8j. 2 sj > 0 as long as there does not exist an α such that xj = α1n This implies that. tr(S) ≥ 0 where tr(·) denotes the matrix trace function Pp j=1 λj ≥ 0 where (λ1; : : : ; λp) are the eigenvalues of S If n < p, then λj = 0 for at least one j 2 f1;:::; pg. If n ≥ p and the p columns of X are linearly independent, then λj > 0 for all j 2 f1;:::; pg. Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 16 The Covariance Matrix Properties The Cauchy-Schwarz Inequality From the Cauchy-Schwarz inequality we have that 2 2 2 sjk ≤ sj sk with the equality holding if and only if xj and xk are linearly dependent. We could also write the Cauchy-Schwarz inequality as jsjk j ≤ sj sk where sj and sk denote the standard deviations of the variables. Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 17 The Covariance Matrix R Code Covariance Matrix by Hand (hard way) > n <- nrow(X) > C <- diag(n) - matrix(1/n, n, n) > Xc <- C %*%X > S <- t(Xc) %*% Xc / (n-1) > S[1:3,1:6] mpg cyl disp hp drat wt mpg 36.324103 -9.172379 -633.0972 -320.7321 2.1950635 -5.116685 cyl -9.172379 3.189516 199.6603 101.9315 -0.6683669 1.367371 disp -633.097208 199.660282 15360.7998 6721.1587 -47.0640192 107.684204 # or # > Xc <- scale(X, center=TRUE, scale=FALSE) > S <- t(Xc) %*% Xc / (n-1) > S[1:3,1:6] mpg cyl disp hp drat wt mpg 36.324103 -9.172379 -633.0972 -320.7321 2.1950635 -5.116685 cyl -9.172379 3.189516 199.6603 101.9315 -0.6683669 1.367371 disp -633.097208 199.660282 15360.7998 6721.1587 -47.0640192 107.684204 Nathaniel E. Helwig (U of Minnesota) Data, Covariance, and Correlation Matrix Updated 16-Jan-2017 : Slide 18 The Covariance Matrix R Code Covariance Matrix using cov Function (easy way) # calculate covariance matrix > S <- cov(X) > dim(S) [1] 11 11 # check variance > S[1,1] [1] 36.3241 > var(X[,1]) [1] 36.3241 > sum((X[,1]-mean(X[,1]))^2) / (n-1) [1] 36.3241 # check covariance > S[1:3,1:6] mpg cyl disp hp drat wt mpg 36.324103 -9.172379 -633.0972 -320.7321 2.1950635 -5.116685 cyl -9.172379 3.189516 199.6603 101.9315 -0.6683669 1.367371 disp -633.097208 199.660282 15360.7998 6721.1587 -47.0640192 107.684204 Nathaniel E.

Load more