Data Sample Matrices
Total Page:16
File Type:pdf, Size:1020Kb
Data Sample Matrices Professor Dan A. Simovici UMB Professor Dan A. Simovici (UMB) Data Sample Matrices 1 / 29 1 The Sample Matrix 2 The Covariance Matrix Professor Dan A. Simovici (UMB) Data Sample Matrices 2 / 29 Matrices as organizers for data sets n Data set: a sequence E of m vectors of R , (u1,...,um). th The j components (u i )j of these vectors correspond to the values of a random variable Vj , where 1 ≤ j ≤ n. This data series will be represented as a sample matrix having m rows 0 0 u1,...,um and n columns v 1,...,v n. The number m is the size of the sample. Professor Dan A. Simovici (UMB) Data Sample Matrices 3 / 29 Rows, Experiments, Attributes 0 Each row vector ui corresponds to an experiment Ei in the series of experiments E = (E1,..., Em); the experiment Ei consists of measuring 0 the n components of ui = (xi1,..., xin): v 1 · · · v n 0 u1 x11 · · · x1n 0 u2 x21 · · · x2n . 0 um xm1 · · · xmn The column vector x1j x2j v j = . . xmj th represents the measurements of the j variable Vj of the experiment, for 1 ≤ j ≤ n, as shown below. These variables are usually referred to as attributes or features of the series E. Professor Dan A. Simovici (UMB) Data Sample Matrices 4 / 29 Definition The sample matrix of E is the matrix X ∈ Cm×n given by 0 u1 . X = . = (v 1 · · · v n). 0 um Professor Dan A. Simovici (UMB) Data Sample Matrices 5 / 29 Pairwise Distances Pairwise distances between the row vectors of X ∈ Rm×n can be computed with the MATLAB function pdist(X) . pdist(X) m(m−1) returns a vector D having 2 components corresponding to the pairs of observations arranged in the order 0 0 0 0 0 0 d2(u2,u1), d2(u3,u 1), d2(u3,u2),..., (the order of the lower triangle of the distance matrix). Professor Dan A. Simovici (UMB) Data Sample Matrices 6 / 29 Example Let X be the data matrix 1 4 5 2 3 7 X = 5 1 4 6 2 4 The function call D = pdist(X) returns D = 2.4495 6.0000 5.4772 7.3485 5.0990 5.0990 Professor Dan A. Simovici (UMB) Data Sample Matrices 7 / 29 Example A distance matrix can be obtained using by writing: E = squareform(D) E = 0 2.4495 6.0000 5.4772 2.4495 0 7.3485 5.0990 6.0000 7.3485 0 5.0990 5.4772 5.0990 5.0990 0 Professor Dan A. Simovici (UMB) Data Sample Matrices 8 / 29 Other distances pdist(X,’cityblock’) : d1(x i ,x j ); pdist(X,’cebyshev’) : d∞(x i ,x j ); pdist(X,’minkowski’,p) : Minkowski’s distance dp. Professor Dan A. Simovici (UMB) Data Sample Matrices 9 / 29 Linear Dimensionality-Reduction Mappings Rn A linear data mapping for a data sequence (u1,...,um) ∈ Seqm( ) is a linear mapping r : Rn −→ Rq. n×q If R ∈ R is the matrix that represents this mapping, then r(ui )= Ruui for 1 ≤ i ≤ m. If q < n, we refer to r as a linear dimensionality-reduction mapping. The reduced data matrix is 0 0 r(u1) (Ruu1) . . m×q r(XE )= . = . = XE R ∈ R 0 0 r(um) (Ruum) The reduced data set r(XE ) has new variables Y1,..., Yq. Professor Dan A. Simovici (UMB) Data Sample Matrices 10 / 29 The Sample Mean n Let (u1,...,um) be a series of observations in R . The sample mean of this sequence is the vector m 1 u˜ = u ∈ Rn. m X i i=1 The series is centered if u˜ = 0n. The series (u1 − u˜,...,um − u˜) is always centered. Also, observe that 1 u˜ = (u · · · u )11 . m 1 m m If n = 1, the series of observation is reduced to a vector v ∈ Rm. Professor Dan A. Simovici (UMB) Data Sample Matrices 11 / 29 The Standard Deviation The standard deviation of a vector v ∈ Rm is the number m v 1 2 sv = u (vi − v) , um − 1 X t i=1 where v is the mean of the components of v . The standard deviation of sample matrix X ∈ Rm×n, where X = (v 1 · · · v n) is the row s = (sv 1 ,..., sv n ). Professor Dan A. Simovici (UMB) Data Sample Matrices 12 / 29 Scaling If the measurment scale for the variables V1,..., Vn involved in the experiment are very different due to different measurement units, some variables may inappropriately influence the analysis process. Therefore, the columns of the data sample matrix need to be scaled in order to make. To 1 scale a matrix we need to replace each column v i by v i . This will yield sv i a matrix having the stadard deviation of each column equal to 1. Professor Dan A. Simovici (UMB) Data Sample Matrices 13 / 29 The Effects of Centering on a Sample Matrix Let X ∈ Rm×n is a sample matrix 0 u1 . X = . 0 um The sample matrix that corresponds to the centered sequence is 1 0 Xˆ = (I − 1 1 )X . m m m m Professor Dan A. Simovici (UMB) Data Sample Matrices 14 / 29 The Centering Matrix To center a data matrix X ∈ Rm×n we need to multiply it at the left by the centering matrix 1 0 × H = I − 1 1 ∈ Rm m, m m m m m that is, Xˆ = HmX . 1 Hm = Im − m Jm. Hm is both symmetric and idempotent. Hm has the eigenvalue 0. Professor Dan A. Simovici (UMB) Data Sample Matrices 15 / 29 ML Computation If X ∈ Rm×n is a matrix the standard deviations are computed in MATLAB using the function std(X) , which returns an n-dimensional row s containing the square roots of the sample variances of the columns of U, that is, their standard deviations. The means of the columns of X is computed in MATLAB using the function mean(X) . Professor Dan A. Simovici (UMB) Data Sample Matrices 16 / 29 z-Scores The MATLAB function Z = zscore(X) computes a centered and scaled version of a data sample matrix having the same format as X . If X is a matrix, then z-scores are computed using the mean and standard deviation along each column of X . The columns of Z have sample mean zero and sample standard deviation one (unless a column of X is constant, in which case that column of Z is constant at 0). [Z,mu,sigma] = zscore(X) also returns the mean vector mu and the vector of standard deviations to sigma . Professor Dan A. Simovici (UMB) Data Sample Matrices 17 / 29 Example X = 1 12 77 3 15 80 2 15 75 5 18 98 The means and the standard deviations of the columns of X : >> m = mean(X) m = 2.7500 15.0000 82.5000 >> s=std(X) s = 1.7078 2.4495 10.5357 Professor Dan A. Simovici (UMB) Data Sample Matrices 18 / 29 Example to compute together the mean, the standard deviation, and the matrix Z, we write >> [Z,m,s]=zscore(A) Z = -1.0247 -1.2247 -0.5220 0.1464 0 -0.2373 -0.4392 0 -0.7119 1.3175 1.2247 1.4712 m = 2.7500 15.0000 82.5000 s = 1.7078 2.4495 10.5357 Professor Dan A. Simovici (UMB) Data Sample Matrices 19 / 29 Inertia of a Set of Vectors n Let u = (u1,...,um) be a sequence of vectors in R . The inertia of this sequence relative to a vector z ∈ Rn is the number m I (u)= k u − z k2 . z X j 2 j=1 Professor Dan A. Simovici (UMB) Data Sample Matrices 20 / 29 Huygens’ Inertia Theorem Rn Let u = (u1,...,um) ∈ Seqm( ). We have 2 Iz (u) − Iu˜(u)= m k u˜ − z k2, for every z ∈ Rn. The minimal value of the inertia Iz (u) is achieved for z = u˜ Professor Dan A. Simovici (UMB) Data Sample Matrices 21 / 29 Covariance Coefficient Let u,w ∈ Rm, where m > 1, having the means u and w, and the standard deviations su and sv , respectively. The covariance coefficient of u and w is m−1 1 cov(u,w )= (ui − u)(wi − w). m − 1 X i=1 The correlation coefficient of u and w is cov(u,w ) ρ(u,w )= . susw By Cauchy-Schwarz Inequality, we have m m m v 2 v 2 (ui − u)(wi − w) ≤ u (ui − u) · u (wi − w) , X uX uX i=1 t i=1 t i=1 which implies −1 ≤ ρ(u,w ) ≤ 1. Professor Dan A. Simovici (UMB) Data Sample Matrices 22 / 29 Sample Covariance Matrix Let X ∈ Rm×n be a sample matrix and let Xˆ be the centered sample matrix corresponding to X . The sample covariance matrix is the matrix 1 0 × cov(X )= Xˆ Xˆ ∈ Rn n. m − 1 1 0 Note that if X is centered, cov(X )= m−1 X X . If n = 1 the matrix is reduced to one column X = (v ) and 1 0 cov(v )= v v ∈ R. m − 1 In this case we refer to cov(v ) as the variance of v ; this number is denoted by var(v ). Professor Dan A. Simovici (UMB) Data Sample Matrices 23 / 29 If X = (v 1 · · · v n), then (cov(X ))ij = cov(v i ,v j ) for 1 ≤ i, j ≤ n. The covariance matrix can be written also as 1 0 1 0 cov(X )= X H H X = X H X . m − 1 m m m − 1 m The sample correlation matrix is the matrix corr(X ) given by (corr(X ))ij = ρ(v i ,v j ) for 1 ≤ i, j ≤ n.