Data Sample Matrices

Professor Dan A. Simovici

UMB

Professor Dan A. Simovici (UMB) Data Sample Matrices 1 / 29 1 The Sample

2 The

Professor Dan A. Simovici (UMB) Data Sample Matrices 2 / 29 Matrices as organizers for data sets

n Data set: a sequence E of m vectors of R , (u1,...,um). th The j components (u i )j of these vectors correspond to the values of a random variable Vj , where 1 ≤ j ≤ n. This data series will be represented as a sample matrix having m rows 0 0 u1,...,um and n columns v 1,...,v n. The number m is the size of the sample.

Professor Dan A. Simovici (UMB) Data Sample Matrices 3 / 29 Rows, Experiments, Attributes 0 Each row vector ui corresponds to an experiment Ei in the series of experiments E = (E1,..., Em); the experiment Ei consists of measuring 0 the n components of ui = (xi1,..., xin):

v 1 · · · v n 0 u1 x11 · · · x1n 0 u2 x21 · · · x2n ...... 0 um xm1 · · · xmn The column vector x1j   x2j v j =  .   .    xmj  th represents the measurements of the j variable Vj of the experiment, for 1 ≤ j ≤ n, as shown below. These variables are usually referred to as attributes or features of the series E. Professor Dan A. Simovici (UMB) Data Sample Matrices 4 / 29 Definition

The sample matrix of E is the matrix X ∈ Cm×n given by

0 u1  .  X = . = (v 1 · · · v n).  0  um

Professor Dan A. Simovici (UMB) Data Sample Matrices 5 / 29 Pairwise Distances

Pairwise distances between the row vectors of X ∈ Rm×n can be computed with the MATLAB function pdist(X) . pdist(X) m(m−1) returns a vector D having 2 components corresponding to the pairs of observations arranged in the order

0 0 0 0 0 0 d2(u2,u1), d2(u3,u 1), d2(u3,u2),...,

(the order of the lower triangle of the ).

Professor Dan A. Simovici (UMB) Data Sample Matrices 6 / 29 Example

Let X be the data matrix 1 4 5 2 3 7 X = 5 1 4   6 2 4

The function call D = pdist(X) returns D = 2.4495 6.0000 5.4772 7.3485 5.0990 5.0990

Professor Dan A. Simovici (UMB) Data Sample Matrices 7 / 29 Example

A distance matrix can be obtained using by writing: E = squareform(D) E = 0 2.4495 6.0000 5.4772 2.4495 0 7.3485 5.0990 6.0000 7.3485 0 5.0990 5.4772 5.0990 5.0990 0

Professor Dan A. Simovici (UMB) Data Sample Matrices 8 / 29 Other distances

pdist(X,’cityblock’) : d1(x i ,x j ); pdist(X,’cebyshev’) : d∞(x i ,x j ); pdist(X,’minkowski’,p) : Minkowski’s distance dp.

Professor Dan A. Simovici (UMB) Data Sample Matrices 9 / 29 Linear Dimensionality-Reduction Mappings

Rn A linear data mapping for a data sequence (u1,...,um) ∈ Seqm( ) is a linear mapping r : Rn −→ Rq. n×q If R ∈ R is the matrix that represents this mapping, then r(ui )= Ruui for 1 ≤ i ≤ m. If q < n, we refer to r as a linear dimensionality-reduction mapping. The reduced data matrix is 0 0 r(u1) (Ruu1)  .   .  m×q r(XE )= . = . = XE R ∈ R  0  0 r(um)  (Ruum) 

The reduced data set r(XE ) has new variables Y1,..., Yq.

Professor Dan A. Simovici (UMB) Data Sample Matrices 10 / 29 The Sample Mean

n Let (u1,...,um) be a series of observations in R . The sample mean of this sequence is the vector

m 1 u˜ = u ∈ Rn. m X i i=1

The series is centered if u˜ = 0n. The series (u1 − u˜,...,um − u˜) is always centered. Also, observe that 1 u˜ = (u · · · u )11 . m 1 m m If n = 1, the series of observation is reduced to a vector v ∈ Rm.

Professor Dan A. Simovici (UMB) Data Sample Matrices 11 / 29 The Standard Deviation

The standard deviation of a vector v ∈ Rm is the number

m v 1 2 sv = u (vi − v) , um − 1 X t i=1

where v is the mean of the components of v . The standard deviation of sample matrix X ∈ Rm×n, where

X = (v 1 · · · v n) is the row s = (sv 1 ,..., sv n ).

Professor Dan A. Simovici (UMB) Data Sample Matrices 12 / 29 Scaling

If the measurment scale for the variables V1,..., Vn involved in the experiment are very different due to different measurement units, some variables may inappropriately influence the analysis process. Therefore, the columns of the data sample matrix need to be scaled in order to make. To 1 scale a matrix we need to replace each column v i by v i . This will yield sv i a matrix having the stadard deviation of each column equal to 1.

Professor Dan A. Simovici (UMB) Data Sample Matrices 13 / 29 The Effects of Centering on a Sample Matrix

Let X ∈ Rm×n is a sample matrix

0 u1  .  X = . .  0  um

The sample matrix that corresponds to the centered sequence is

1 0 Xˆ = (I − 1 1 )X . m m m m

Professor Dan A. Simovici (UMB) Data Sample Matrices 14 / 29 The Centering Matrix

To center a data matrix X ∈ Rm×n we need to multiply it at the left by the centering matrix

1 0 × H = I − 1 1 ∈ Rm m, m m m m m

that is, Xˆ = HmX . 1 Hm = Im − m Jm. Hm is both symmetric and idempotent. Hm has the eigenvalue 0.

Professor Dan A. Simovici (UMB) Data Sample Matrices 15 / 29 ML Computation

If X ∈ Rm×n is a matrix the standard deviations are computed in MATLAB using the function std(X) , which returns an n-dimensional row s containing the square roots of the sample variances of the columns of U, that is, their standard deviations. The means of the columns of X is computed in MATLAB using the function mean(X) .

Professor Dan A. Simovici (UMB) Data Sample Matrices 16 / 29 z-Scores

The MATLAB function Z = zscore(X) computes a centered and scaled version of a data sample matrix having the same format as X . If X is a matrix, then z-scores are computed using the mean and standard deviation along each column of X . The columns of Z have sample mean zero and sample standard deviation one (unless a column of X is constant, in which case that column of Z is constant at 0). [Z,mu,sigma] = zscore(X) also returns the mean vector mu and the vector of standard deviations to sigma .

Professor Dan A. Simovici (UMB) Data Sample Matrices 17 / 29 Example

X = 1 12 77 3 15 80 2 15 75 5 18 98 The means and the standard deviations of the columns of X : >> m = mean(X)

m = 2.7500 15.0000 82.5000

>> s=std(X)

s = 1.7078 2.4495 10.5357

Professor Dan A. Simovici (UMB) Data Sample Matrices 18 / 29 Example to compute together the mean, the standard deviation, and the matrix Z, we write >> [Z,m,s]=zscore(A)

Z = -1.0247 -1.2247 -0.5220 0.1464 0 -0.2373 -0.4392 0 -0.7119 1.3175 1.2247 1.4712

m = 2.7500 15.0000 82.5000

s = 1.7078 2.4495 10.5357

Professor Dan A. Simovici (UMB) Data Sample Matrices 19 / 29 Inertia of a Set of Vectors

n Let u = (u1,...,um) be a sequence of vectors in R . The inertia of this sequence relative to a vector z ∈ Rn is the number

m I (u)= k u − z k2 . z X j 2 j=1

Professor Dan A. Simovici (UMB) Data Sample Matrices 20 / 29 Huygens’ Inertia Theorem

Rn Let u = (u1,...,um) ∈ Seqm( ). We have

2 Iz (u) − Iu˜(u)= m k u˜ − z k2, for every z ∈ Rn. The minimal value of the inertia Iz (u) is achieved for z = u˜

Professor Dan A. Simovici (UMB) Data Sample Matrices 21 / 29 Covariance Coefficient Let u,w ∈ Rm, where m > 1, having the means u and w, and the standard deviations su and sv , respectively. The covariance coefficient of u and w is m−1 1 cov(u,w )= (ui − u)(wi − w). m − 1 X i=1 The correlation coefficient of u and w is cov(u,w ) ρ(u,w )= . susw By Cauchy-Schwarz Inequality, we have

m m m v 2 v 2 (ui − u)(wi − w) ≤ u (ui − u) · u (wi − w) , X uX uX i=1 t i=1 t i=1

which implies −1 ≤ ρ(u,w ) ≤ 1.

Professor Dan A. Simovici (UMB) Data Sample Matrices 22 / 29 Sample Covariance Matrix

Let X ∈ Rm×n be a sample matrix and let Xˆ be the centered sample matrix corresponding to X . The sample covariance matrix is the matrix

1 0 × cov(X )= Xˆ Xˆ ∈ Rn n. m − 1

1 0 Note that if X is centered, cov(X )= m−1 X X . If n = 1 the matrix is reduced to one column X = (v ) and

1 0 cov(v )= v v ∈ R. m − 1 In this case we refer to cov(v ) as the variance of v ; this number is denoted by var(v ).

Professor Dan A. Simovici (UMB) Data Sample Matrices 23 / 29 If X = (v 1 · · · v n), then (cov(X ))ij = cov(v i ,v j ) for 1 ≤ i, j ≤ n. The covariance matrix can be written also as

1 0 1 0 cov(X )= X H H X = X H X . m − 1 m m m − 1 m The sample correlation matrix is the matrix corr(X ) given by (corr(X ))ij = ρ(v i ,v j ) for 1 ≤ i, j ≤ n.

Professor Dan A. Simovici (UMB) Data Sample Matrices 24 / 29 1 0 If X is centered, then cov(X )= m−1 X X . cov(X ) is a symmetric, positive semidefinite matrix. The rank of cov(X ) is the same as the rank of Xˆ and, since m, the size of the sample is usually much larger than n we are often justified in asumming that rank(cov(X )) = n. m×n Let X = (v 1 · · · v n) ∈ R be a sample matrix. Note that

1 0 1 0 H v = (I − 1 1 )v = v − 1 1 v = v − a 1 , m p m m m m p p m m m p p p m

1 0 0 because m1mv p = ap for 1 ≤ p ≤ n, where u˜ = (a1,..., an).

Professor Dan A. Simovici (UMB) Data Sample Matrices 25 / 29 Total Variance The covariance matrix can be written as 1 0 0 cov(X ) = (v · · · v n) H Hm(v · · · v n) m − 1 1 m 1 1 0 = (Hmv · · · Hmv n) (Hmv · · · Hmv n), m − 1 1 1 which implies that the (p, q)-entry of this matrix is

1 0 1 0 cov(X ) = (H v ) (H v )= (v − a 1 ) (v − a 1 ). pq m − 1 m p m q m − 1 p p m q q m For a diagonal element we have m 1 cov(X ) = (v − a 1 )2, pp m − 1 X q q m i i=1

which shows that cov(X )pp measures the scattering of the values of the th p variable around the corresponding component ai of the mean sample. th 2 This quantity is known as the p variance and is denoted by σp for 1 ≤ p ≤ n. The total variance tvar(X ) of X is trace(cov(X )). Professor Dan A. Simovici (UMB) Data Sample Matrices 26 / 29 For p 6= q the element cpq of the matrix C = cov(X ) is referred to as the (p, q)-covariance. We have:

1 0 (cov(X )) = (v − a 1 ) (v − a 1 ) pq m − 1 p p m q q m 1 0 0 0 = v v q − ap1 v q − aqv 1m + mapaq m p m p  0 = v pv q − apaq.

If cov(X )pq = 0, then we say that the variables Vp and Vq are uncorrelated.

Professor Dan A. Simovici (UMB) Data Sample Matrices 27 / 29 Covariance matrix and Orthogonal Matrices

Let x 1  .  X = .   x m be a centered sample matrix and let R ∈ Rn×n be an . If Z ∈ Rm×n is a matrix such that Z = XR, then Z is centered, cov(Z)= R0cov(X )R and tvar(Z) = tvar(X ).

Professor Dan A. Simovici (UMB) Data Sample Matrices 28 / 29 Properties of the Covariance Matrix

Since the covariance matrix of a centered matrix X , 1 0 Rn×n cov(X )= m−1 X X ∈ is symmetric, cov(X ) is orthonormally diagonalizable, so there exists an orthogonal matrix R ∈ Rn×n such that R0cov(X )R = D, which corresponds to a sample matrix Z = XR. Let cov(Z)= D = diag(d1,..., dn). The number dp is the sample variance of the pth variable of the data matrix, and the covariances of the form cov(Z)pq with p 6= q are 0. From a statistical point of view, this means that the components p and q are uncorrelated. Without loss of generality we can assume that d1 ≥···≥ dn. The columns of the matrix Z correspond to the new variables Z1,..., Zn.

Professor Dan A. Simovici (UMB) Data Sample Matrices 29 / 29