<<

34 | Principal component analysis

34.1 Introduction

Principal component analysis (PCA) is a useful method to reduce the dimensionality of a multivariate data set Y ∈ Rn×m. It is closely related to the singular value decomposition (SVD) of a . The SVD of a data matrix Y ∈ Rn×m is defined as a Y = UΣV T , (34.1) where U ∈ Rn×n is an orthogonal matrix, Σ ∈ Rn×m is a rectangular diagonal matrix, and V ∈ Rm×m is an orthogonal matrix. The diagonal entries σii of Σ for i = 1, ..., n , referred to as the singular values of Y , correspond to the roots of the non-zero eigenvalues of both the matrices Y T Y ∈ Rm×m and YY T ∈ Rn×n. The n columns of U are referred to as the left-singular vectors and correspond to the eigenvectors of YY T , while the m columns of V are referred to as right-singular vectors and correspond to the eigenvectors of Y T Y . The aim of the current chapter is to review the linear terminology involved in both SVD and PCA, and, more importantly, to endow the linear algebraic concepts of SVD with some data analytic intuition. Before we provide an outline of the current chapter, we give two examples for the application of PCA in functional neuroimaging.

Example 1

In fMRI one is often interested in the BOLD signal time-series of anatomical regions of interest, for example as the data for biophysical modelling approaches (Chapter 43). If a region of interest comprises both voxels that exhibit MR signal increases for a given experimental perturbation and other voxels that exhibit MR signal decreases for the same perturbation, averaging the voxel time-series over space can artificially create an average time-series that exhibits no modulation by the experimental perturbation - despite the fact that both voxel populations were in fact responsive to the experimental perturbation (Figure 34.1A, left panel). This effect can be mitigated by using the first eigenvector of the of the region’s voxel-by-voxel covariance matrix, sometimes referred to as the first eigenmode as a summary of the region’s MR signal time series instead (Figure 34.1A, right panel). On the other hand, if the voxel MR signal time-series within a region of interest are spatially coherent, than the average time-series and the first eigenvector of the voxel MR time-series matrix do not differ much.

Example 2

In biophysical modelling approaches for event-related potentials, such as dynamic causal modelling (Chap- ter 44), the data corresponds to a matrix in the number of electrodes and the number of peri-event time-bins. For computationally efficiency, this potentially large matrix can be projected onto a smaller matrix of feature timecourses. Only this reduced matrix is then subjected to biophysical modelling using the DCM framework. As an example, the leftmost panel of Figure 34.1B visualizes an event-related potential EEG electrode × data sample matrix. The central panel of Figure 34.1B depicts the feature representation of these data comprising five eigenvectors of the data covariance matrix that are associated with the largest variance, and the rightmost panel visualizes the reconstructed data based on these PCA results only. Notably, the reconstructed data based on the PCA-selected features is virtually identical to the original data. To get at the inner workings of both PCA and SVD we have to revisit some elementary concepts from matrix theory and . We proceed as follows: we first review some fundamentals of matrix eigenanalysis, including the notions of eigenvalues, eigenvectors and diagonalization of real symmetric matrices. We then review some essential prerequisites from theory, including the notions of abstract vectors spaces, linear vector combinations, vector space bases, orthogonal and orthonormal bases, vector projections, and vector coordinate transforms. In essence, PCA corresponds to a coordinate 0.3

20 0.2

0.1 40 0 60 -0.1

80 -0.2

-0.3 100 50 100 150 200 250 0 50 100 150 200 250

6 1 10 1 10 1 4 20 0.5 2 20 0.5 2 30 0 30 0 3 0 -0.5 -0.5 40 40 -2 -1 4 -1 50 50 -1.5 -4 -1.5 5 60 60 50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350

Figure 34.1. PCA applications in functional neuroimaging. (A) Eigenmode analysis as spatial summary measure for region-of-interest timecourse extraction. For the current example, it is assumed that a region of interest comprises two voxel populations, one of which is positively modulated by some temporal event of interest (left panel, upper half of voxels), the other of which is negatively modulated by the same event of interest (left panel, lower half of voxels). The right panel depicts the resulting spatial average exhibiting no systematic variation with the temporal event of interest, as well as the the first eigenmode of the voxel × TR matrix shown on the left, which retains the event-related modulation. (B) PCA for feature selection and dimensionality reduction. The leftmost panels depicts an EEG event-related potential electrode × data samples matrix. Using PCA, these data can be compressed to the feature representation shown in the central panel. Notably, the reconstructed data based on this feature representation is virtually identical to the original data, as shown in the rightmost panel. transform of a data set onto a basis that is formed by the eigenvectors of its empirical covariance matrix. In this transformed space, the data features have zero covariance and are hence maximal informative. This property can be used to remove redundant features from the data set and hence allows for data compression.

34.2 Eigenanalysis

An intuitive understanding of the concepts of eigenanalysis requires familiarity with differential equations. We will here thus strive only for a formal understanding. Let A ∈ Rm×m be a . Any vector v ∈ Rm, v 6= 0 that fulfils the equation Av = λv (34.2) for a λ ∈ R is called an eigenvector of A. The scalar λ is called an eigenvalue of A. Each eigenvector has an associated eigenvalue. Eigenvalues for different eigenvectors can be identical. Note that if v ∈ Rm is an eigenvector, then av ∈ Rm with a ∈ R is an eigenvector with eigenvalue aλ. Therefore, one assumes without loss of generality that eigenvectors have one, i.e., vT v = 1.

Computing eigenvectors and eigenvalues

Eigenvectors and eigenvalues of a matrix A ∈ Rm×m can be computed as follows. First, from the definition of eigenvectors and eigenvalues we have

Av = λv ⇔ Av − λv = 0 ⇔ (A − λI) v = 0 (34.3)

This shows, that we are interested in a vector v ∈ Rm and a scalar λ ∈ R, such that the matrix product of (A − λI) and v results in the zero vector 0 ∈ Rm. A trivial solution for this would be to set v = 0, but this is not allowed by the definition of the eigenvector. If v 6= 0, we must adjust λ and v such that v is an element of the nullspace of A. The nullspace of a matrix M ∈ Rm×m, here denoted by N (M) is the set of all vectors w ∈ Rm that are mapped onto the zero vector, i.e.,

m N (M) = {w ∈ R |Mw = 0}. (34.4) If the nullspace of a matrix contains any other element than zero, the matrix is noninvertible (singular). This holds, because the zero vector is always mapped onto the zero vector by premultiplication with any matrix. If another vector is mapped into the zero vector, we would not know which vector we should assign to the zero vector when inverting the matrix by finding the inverse of the matrix in question. Therefore, the matrix cannot be invertible. We know that can be checked to see whether a matrix is invertible or not, and we can make use of this here: if a matrix is not invertible, then its must be zero. Therefore, we are searching for all scalars λ ∈ R, such that

χA(λ) := det (A − λI) = 0 (34.5)

The expression det(A−λI), conceived as a function of λ is referred to as the characteristic polynomialof A, because, written in full, it corresponds to a polynomial in λ. Formulation of the characteristic polynomial then allows for the following strategy to compute eigenvalues and eigenvectors of matrices ∗ 1. Solve χA (λ) = 0 for zero-crossings (also referred to as roots) λi , i = 1, 2, ... The roots of the characteristic polynomial are the eigenvalues of A. ∗ 2. Substitute the values λi in (34.3), which yields the system of linear equations

∗ (A − λi I) vi = 0 (34.6)

and solve the system for the associated eigenvectors vi, i = 1, 2, .... For small matrices with nice properties such as , the above strategy can be applied by hand. In practice, matrices are usually larger than 3 × 3 and eigenanalysis problems are usually solved using numerically computing.

Eigenvalues and eigenvectors of symmetric matrices

We next consider how eigenvalues and eigenvectors can be used to decompose or diagonalize matrices. To this end, assume that the square matrix A ∈ Rm×m is symmetric, for example, because A is a covariance matrix. A corollary of a fundamental result from , which is known as the spectral theorem, asserts that symmetric matrices of size m × m have m distinct eigenvalues λ1, ..., λm with associated m orthogonal eigenvectors q1, ..., qm ∈ R . This implies that

Aqi = λiqi for i = 1, ..., m. (34.7)

If we write the eigenvectors q1, ..., qm as columns of a matrix

 m Q = q1 ··· qm ∈ R , (34.8) then   AQ = λ1q1 ··· λnqn = q1λ1 ··· qnλn = QΛ, (34.9) where Λ ∈ m×m is the matrix R   λ1 0 ··· 0  0 λ2 ··· 0  Λ =   (34.10)  . . .. .   . . . .  0 0 ··· λn that contains the eigenvalues λ1, ..., λn along its main diagonal and zeros elsewhere. Right-multiplication of both sides of eq. (34.9) by QT results in

AQQT = QΛQT ⇔ A = QΛQT , (34.11) while left-multiplication of both sides of eq. (34.9) by QT results in

QT AQ = QT QΛ ⇔ QT AQ = Λ. (34.12) In other words, by computing the eigenvalues and eigenvectors of a symmetric matrix A ∈ Rm×m we can find two matrices QΛ ∈ Rm×m, which allow to rewrite A as QT ΛQ (cf. (34.11)). This an example for a general class of methods known as matrix decompositions. Below, we will introduce another matrix decomposition, the singular value decomposition. Equivalently, we see that pre- and post-multiplying the symmetric matrix A ∈ Rm×m with the of the matrix comprising its eigenvectors as columns results in a diagonal matrix (cf. (34.12)). This is an example for a general class of methods known as matrix diagonalizations.

34.3 Vector space theory

Vector spaces

We define the concept of an abstract vector space as follows:

Vector space Let F be a field, the elements of which we refer to as scalars, and let V be a non-empty set. If “+” denotes a mapping V × V → V , referred to as and “·” denotes a mapping K × V → V referred to as , then the tuple (V, +, ·) is called a vector space on the field F , if for arbitrary elements v, w, u ∈ V and a, b ∈ F the following conditions hold: 1. Commutativity of addition: v + w = w + v. 2. Associativity of addition: (v + w) + u = v + (w + u). 3. Existence of an of addition: there exists an element 0 ∈ V , referred to as the zero vector, such that v + 0 = 0 + v = v. 4. Existence of additive inverses: for every v ∈ V there exists an −v ∈ V such that v + (−v) = 0. 5. Existence of an identity element of scalar multiplication: there exists an element 1 ∈ F such that 1 · v = v. 6. The following axioms for scalar multiplication apply (a) Associativity a (b · v) = (ab) · v (b) Distributivity with respect to vector addition a (v + w) = av + aw (c) Distributivity with respect to field element addition: (a + b) v = av + bv

Note that no multiplication is defined between elements of V for V to be a vector space. We note without proof that (Rm, +, ·), i.e., the set of m-tuples with real entries, together with the addition between elements of Rm and the scalar multiplication between elements of R and Rm is a vector space. The tuple of a vector space V and a scalar product is referred to as an space. For two vectors v, w ∈ Rm the scalar product of v and w is defined as the scalar d ∈ R given by the matrix product of the transpose of v ∈ Rm×1 and w ∈ Rm×1, d := vT w. (34.13)

The Euclidean length of a vector v ∈ Rm is defined as the square root of its scalar product with itself, √ l = vT v. (34.14)

Two vectors v, w ∈ Rm are called orthogonal (cf. Chapter 28), if their scalar product is zero, i.e., if vT w = 0. (34.15)

If, in addition, the length of the vectors v and w is 1, the vectors are called orthonormal. If a vector v ∈ Rm can be expressed as the scalar multiple of another vector w ∈ Rm, i.e., if v = λw. (34.16) for some scalar λ ∈ R, then the vectors v and w are called linearly dependent (cf. Chapter 28). If this is not possible, the vectors are called linearly independent.A of a set of n ∈ N T T T vectors v1 := (v11, v21, ..., vm1) , v2 := (v12, v22, ..., vm2) , ..., v2 := (v1n, v2n, ..., vmn) with coefficients m a1, ..., an ∈ R is a vector w ∈ R of the form       m v11 v12 v1n X  .   .   .  w := aivi = a1v1 + a2v2 + ... + anvn = a1  .  + a2  .  + ... + an  .  (34.17) i=1 vm1 vm2 vm2

Vector space bases

A central result of basic linear algebra is that a set of n linearly independent vectors suffices to express any vector in an n-dimensional vector space. A set of n linearly independent and orthogonal vectors is called a basis. Of particular interest for PCA are orthonormal bases. An orthonormal basis of the space m R is a set of vectors v1, v2, ..., vm which are of unit length and are mutually orthogonal. Using the ( 1, i = j δij := (34.18) 0, i 6= j an orthonormal basis is thus a set of vectors v1, .., vm, such that for i = 1, ..., m and j = 1, ..., m

T vi vj = δij. (34.19)

Writing an arbitrary vector x ∈ Rm as a linear combination of bases vectors is called a basis expansion of x, while the coefficients in this linear combination are referred to as coordinates of x with respect to the basis in question.

As a first example for a vector space basis, we consider the so-called canonical basis of R2 given by the set 1 0 {e , e } = , . (34.20) 1 2 0 1 Note that T T T T e1 e1 = 1, e1 e2 = 0, e2 e1 = 0, and e2 e2 = 1, (34.21) T which shows that e1 and e2 are indeed orthonormal. The basis expansion of a vector x := (x1, x2) in terms of the canonical basis is given by

1 0 x = x + x , (34.22) 1 0 2 1 and the coordinates in terms of the canonical basis are given by x1 and x2. As a generalization of the 2 m canonical basis of R , the canonical basis of R comprises m vectors e1, ..., em of all zero entries except a 1 at the ith , i = 1, ..., m. A second example for an orthonormal basis of R2 is given by the set ( ! !) √1 − √1 {b , b } = 2 , 2 . (34.23) 1 2 √1 √1 2 2

Note that we have 1 1 1 1 1 1 1 1 bT b = + = 1, bT b = − + = 0, bT b = − + = 0, bT b = + = 1 (34.24) 1 1 2 2 1 2 2 2 2 1 2 2 2 2 2 2 which shows that b1 and b2 are indeed orthonormal. Because implies , 2 the set {b1, b2} thus indeed forms a basis of R .

Orthogonal matrices

Concatenating n orthonormal vectors of an n-dimensional vector space, i.e., an orthonormal basis, in a matrix results in an orthogonal matrix. In other words, an orthogonal matrix is a matrix whose columns are pairwise orthonormal, i.e. orthogonal and of unit length. Square matrices Q ∈ Rm×m comprising m orthonormal vectors q1, ..., qm ∈ R as columns fulfil the following equations

T T Q Q = QQ = Im (34.25)

Notably, as the inverse of a matrix A ∈ Rm×m is defined as the matrix A−1 ∈ Rm×m for which the equation −1 −1 A A = AA = Im (34.26) holds, we have the following identity for orthogonal matrices:

QT = Q−1. (34.27)

In words: the inverse of an orthogonal matrix is equal to its inverse.

Vector projections

The of a vector y ∈ Rm onto a vector q ∈ Rm results in a projection vectory ˜ ∈ Rm (Fig- ure 34.2A). This projection vector can be thought of as the on the q ∈ Rm which is closest to the point y ∈ Rm. Becausey ˜ ∈ Rm can be thought of as a point on the vector q ∈ Rm, it must be a scalar multiple of it, i.e. m y˜ = aq ∈ R for a ∈ R. (34.28) The fact thaty ˜ is the point on q closest to y in a Euclidean metric sense, implies the of the vectors y − y˜ and q (cf. Figure 34.2A): qT (y − y˜) = 0. (34.29) It thus follows that the scalar multiple a is given as

qT y a = , (34.30) qT q because (cf. Chapter 39)

qT y qT (y − y˜) = 0 ⇔ qT y − qT y˜ = 0 ⇔ qT y − aqT q = 0 ⇔ aqT q = qT y ⇔ a = . (34.31) qT q

Substitution of a = qT y/qT q in eq. (34.28) thus yields the following expression for an projection vector

qT y y˜ = q. (34.32) qT q Figure 34.2A visualizes exemplary vectors y, qy˜ ∈ m for the case m = 2. Note that if the vector q ∈ m R p R onto which y ∈ Rm is projected has length 1, i.e., if qT q = 1 and thus qT q = 1, expression (34.32) for the projection vectory ˜ ∈ Rm of y onto q simplifies to y˜ = qT y q (34.33)

m Using projections, the coordinates of a vector x ∈ R with respect to an orthonormal basis {q1, ..., qm} m m can readily be obtained. Specifically, let x ∈ R , let {q1, ..., qm} denote an orthonormal basis of R , and let x be given in terms of the orthonormal expansion

m X x = aiqi (34.34) i=1 for scalars ai, i = 1, ..., m. Then the scalars ai, i = 1, ..., m can be evaluated based on the vector x and the basis vectors qi, i = 1, ..., m as T ai = x qi for i = 1, ..., m (34.35)

In other words, the orthonormal expansion of x in the qi, i = 1, ..., m can be written

m X T x = (xi qi)qi. (34.36) i=1 A B

푦 푦 = 1 −1 1 푦2 0.33 1 0 푦 = 1 1 0.66 0.70 푦෤ = 0.23 푦 − 푦෤ 1 −1 1 1 2 1 2 1 푦෤ 푦෤ = 1 푞1 푦෤ 푞 = 2 푞2

−1 0 1 0 0 0

2 Figure 34.2. Vector projections and coordinate transforms. Panel A visualizes the projectiony ˜ of a vector y ∈ R onto a 2 vector q ∈ R . Note thaty ˜ can be conceived as that point on the line formed by the vector q for which the Euclidean between the vectors y and q is minimized, which implies that q and the vector y − y˜ are orthogonal. Panel B visualizes the 2 notion of a vector coordinate transform in R . Specifically, the vector y is considered as being represented with respect to 2 2 the canonical basis of R by the coordinates y1 = 0.33, and 0.66 (black dot). The canonical basis of R is indicated by the 2 blue lines along the x- and y-axes.√ The√ same vector√ in√ R is then considered with coordinates as expressed with respect to the orthonormal basis {(−1/ 2, 1/ 2)T , (1/ 2, 1/ 2)T } indicated by the red lines along the diagonal vectors (−1, 1) and (1, 1). Based on the procedure discussed in the main text, the transformed coordinates evaluate toy ˜1 = 0.70 and 0.˜23. 2 Note that tildey is the same vector as y, but its coordinates are expressed with respect to a different basis of R .

Proof of eq. (34.35)

Equation (34.35) from the following equivalence relationships:

m m m T X T T X T X T T T  T  T x = aiqi ⇔ qj x = qj aiqi ⇔ qj x = aiqj qi ⇔ aj = qj x ⇔ aj = qj x ⇔ aj = x qj (34.37) i=1 i=1 i=1

T for all j = 1, ..., m. Note that the penultimate equivalence follows because qj qi = 0 for j 6= i. 2 1

Coordinate transforms for orthonormal bases

An important result from linear algebra shows how the coordinates of a vector with respect to one or- thonormal basis can be transformed into the coordinates of the same vector with respect to a second orthonormal basis. Specifically, if {v1, ..., vm} and {w1, ..., wm} are two orthonormal bases of the real m vector space R , then the coordinates of a vector y with respect to the basis {v1, ..., vm} can be trans- formed into the coordinates with respect to the basis {w1, ..., wm} by means of a . The matrix that changes the coordinates of a vector with respect to the first basis into coordinates with respect to the second basis, can be found by the following steps

1. Express the basis vectors w1, ..., wm as orthonormal expansion in the basis v1, ..., vm and write the coordinates into the columns of a matrix A T 2. The matrix that changes the coordinates from v1, ..., vm to w1, ..., wm is given by A , while the matrix that changes the coordinates from w1, ..., wm to v1, ..., vm is given by A. We do not aim to prove the approach introduced above, but merely illustrate it with an example in R2 (cf. Figure 34.2B). To this end, let the first orthonormal basis be given by the canonical basis of R2, i.e., 1 0 {v , v } = , . (34.38) 1 2 0 1 Let the second orthonormal basis of R2 be ( ! !) √1 − √1 {w , w } = 2 , 2 . (34.39) 1 2 √1 √1 2 2 2 Further, let y ∈ R be the vector whose coordinates with respect to {v1, v2} we wish to express in terms of the basis {w1, w2}. To express the basis vectors w1 and w2 as orthonormal expansion in the basis {v1, v2}, we first set ! ! √1 1 0 − √1 1 0 2 = a + a and 2 = a + a (34.40) √1 11 0 21 1 √1 12 0 22 1 2 2

Projecting the vectors w1, w2 onto v1, v2 then yields the following coefficients according to eq. (34.35): 1 1 1 1 a11 = √ , a21 = √ , a12 = −√ and a22 = √ . (34.41) 2 2 2 2

For the coordinate A we thus obtain ! a a  √1 − √1 A = 11 12 = 2 2 (34.42) a a √1 √1 21 22 2 2

Note that because the first orthonormal basis corresponds to the canonical basis of R2, the matrix A merely comprises the basis vectors of the second orthonormal basis as columns. The matrix that by 2 premultiplication expresses the coordinates of y with respect to the canonical basis {v1, v2} of R as coordinates of y with respect to the basis of {w1, w2} is then given as the transpose of A,

√1 √1 ! AT = 2 2 . (34.43) − √1 √1 2 2

To change the coordinates of an arbitrary vector y ∈ R2 expressed with respect to the canonical basis 2 2 of R to a representationy ˜ ∈ R with coordinates expressed in terms of the orthonormal basis {w1, w2}, one thus computes ! y˜  √1 − √1 y  y˜ = 1 = 2 2 1 = Ay. (34.44) y˜ √1 √1 y 2 2 2 2 This change of coordinates is visualized for y = (0.33, 0.66)T in Figure 34.2B. Note that y does not change its location, the only thing that changes is its coordinate representation with respect to the two bases.

34.4 Principal component analysis

With the concepts of symmetric matrix decomposition by means of eigenanalysis, as well as vector coordinate transforms formally in place, we are now in the position to review principle component analysis 1×m perse. To this end, we assume that n observations of m-dimensional data vectors y1, ..., yn ∈ R have been obtained, where the yi, i = 1, ..., n are conceived as realizations of a not further specified 1 × m- m×m dimensional random vector with covariance matrix Σ ∈ R p.d.. We assume that the yi, i = 1, ..., n form the rows of a data matrix Y ∈ Rn×m, i.e.,     y1 y11 y12 ··· y1m

y2  y21 y22 ··· y2m  Y :=   =   ∈ n×m. (34.45)  .   . . .. .  R  .   . . . . 

yn yn1 yn2 ··· ynm Note that this data arrangement is accordance with data tables in which each column represents a certain data feature (e.g., age, height, , etc.) and each row represents an experimental unit (e.g., participant, laboratory animal, etc.). Let

n n 1 X 1 X y¯ = y = y y ··· y  ∈ 1×m (34.46) n i n i1 i2 im R i=1 i=1

m denote the empirical mean vector of the n observations y1, ..., yn ∈ R . Further, let

n×m Yc := Y − (In ⊗ y¯) ∈ R (34.47) denote the mean-centred data matrix. Note that the In ⊗ y¯ evaluates to a matrix containing the empirical meany ¯ in each of its rows. Finally, let 1 C = Y T Y ∈ m×m (34.48) n − 1 c c R denote the empirical covariance matrix of the data set. Because empirical covariance matrices are sym- metric matrices, evaluating their eigenvalues and eigenvectors allows for their decomposition according to eq. (34.11). Specifically, C can be written as

C = QΛQT , (34.49) where Q ∈ Rm×m comprises the eigenvectors of C as columns and Λ ∈ Rm×m is a diagonal matrix with the corresponding eigenvalues along its main diagonal. If the eigenvalues and their associated eigenvectors are ordered in the of the eigenvalues, such that λ1 ≥ λ2 ≥ · · · ≥ λm, the matrix decomposition of an empirical covariance matrix (34.49) is also referred to as principal component analysis and the columns of the matrix Q are known as the principal components. As discussed above, the columns of Q are mutually orthogonal and of unit length and thus constitute an orthonormal basis or Rm. Notably, as shown below, in the represented by this orthonormal basis, the data vectors are uncorrelated. Moreover, due to the ordering in decreasing eigenvalues, the data vectors expressed in terms of eigenvector basis express the largest variance along the first first principle component (basis vector) and decreasing variance for the remaining principle components (basis vectors). This aspect of principle component analysis can be used for reducing the number of data dimensions as described in the next section. In the coordinate system spanned by the eigenvectors of the covariance matrix C, i.e., in terms of the basis formed by the columns of Q ∈ Rm×m, the data vectors are uncorrelated. In this sense, PCA allows for obtaining a coordinate description of a data set, in which redundant information between coordinates is removed. Formally, we have the following result: with (34.49) and for i = 1, ..., n, let

˜ T T T Y = Q Yc = YcQ (34.50) denote the matrix of coordinate-transformed data points, and let

y¯˜i :=yQ ¯ (34.51) denote the coordinate-transformed empirical mean. Then it holds that 1 C˜ = Y˜ T Y˜ = Λ, (34.52) n − 1 where with (34.49) Λ = QT CQ (34.53) and is hence a diagonal matrix.

Proof of eq. (34.49)

We consider the empirical covariance matrix of the coordinate-transformed data vectorsy ˜i, i = 1, ..., n: 1 C˜ = Y˜ T Y˜ n − 1 1 = (Y Q)T Y Q n − 1 c c 1 = QT Y T Y Q n − 1 c c (34.54)  1  = QT Y T Y Q n − 1 c c = QT CQ = Λ. 2 Singular value decomposition and principal component analysis

Before discussing how PCA can be employed to reduce the dimensionality of a data set, we consider its efficient evaluation using singular value decomposition. To this end, we first emphasize again that a matrix decomposition of the form C = QΛQT (34.55) with orthonormal matrix Q ∈ Rm×m and diagonal matrix Λ ∈ Rm×m can only be obtained for real and symmetric matrices C ∈ Rm×m. However, there exists an analogous matrix decomposition for arbitrary matrices Y ∈ Rn×m of the form Y = UΣV T , (34.56) such that U ∈ Rn×n is an orthonormal matrix, V ∈ Rm×m is an orthonormal matrix and Σ ∈ Rn×m is a diagonal matrix. The matrix decomposition (34.56) is known as the singular value decomposition of the matrix Y , and the diagonal entries of Σ are known as the singular values of the matrix Y . Notably, as shown below, the columns of U are the eigenvectors of YY T , whereas the columns of V are the eigenvectors of Y T Y .

Proof of product eigendecompositions by eq. (34.56)

We first note without proof that YY T and Y T Y are real symmetric matrices. With eq. (34.56) we then have  T YY T = UΣV T UΣV T

= UΣV T V ΣT U T (34.57) = UΣΣU T Because Σ is a diagonal matrix, S := ΣΣ is a diagonal matrix as well and comprises the squares of the diagonal entries of Σ along its main diagonal. By definition, U is an orthonormal matrix. We have thus rewritten the real symmetric matrix YY T in the form USU T for an orthonormal matrix U and a diagonal matrix S. Eq. (34.11) then implies that the columns of U are the eigenvectors of YY T and the diagonal entries of Σ are the square roots of the corresponding eigenvalues. Similarly, we have  T Y T Y = UΣV T UΣV T

T T T = V Σ U UΣV (34.58) = V ΣT ΣV T = VSV T

Here, we have rewritten the real symmetric matrix Y T Y in the form VSV T for an orthonormal matrix V and the diagonal matrix S. Again, eq. (34.11) then implies that the columns of V are the eigenvectors of Y T Y and the diagonal entries of Σ are the square roots of the corresponding eigenvalues.

2

The PCA of a data matrix Y ∈ Rn×m can thus be computed by evaluating the the singular value decomposition of its scaled and mean-centred variant

1 T √ Yc = UΣV (34.59) n − 1 Here, the principal components appear in the columns of V and the square roots of the corresponding eigenvalues appear in Σ. PCA can thus be implemented in either of two ways: either by computing the covariance matrix of the data matrix and evaluating its eigenvectors and eigenvalues, or by computing the singular value decomposition of an appropriately scaled and mean-centred data matrix. We visualize PCA by singular value decomposition in Figure 34.3.

Data dimensionality reduction with PCA

Consider a data matrix Y ∈ Rn×m such as the n = 350 × m = 64 dimensional event-related potential EEG data matrix in Figure 34.1B. Given the matrix decomposition

C = QΛQT (34.60) Figure 34.3. Principal component analysis by singular value decomposition. (A) The left panel depicts a data matrix n×m m×m Y ∈ R for n = 20 and n = 30. The right panel depicts the empirical covariance matrix C ∈ R of the data. (B) The m×m T m×m left panel depicts the matrix Q ∈ R in the matrix decomposition C = QΛQ , which is equal to the matrix V ∈ R √ −1 T m×m in the singular value decomposition of ( n − 1) Yc = UΣV . The right panel depicts the diagonal matrix Λ ∈ R , T n×m √ −1 T corresponding to Λ = Σ Σ with the diagonal matrix Σ of the singular value decomposition ( n − 1) Yc = UΣV and comprising the eigenvalues of Q along its diagonal. (C) The right panel depicts the coordinate-transformed data matrix ˜ n×m ˜ m×m ˜ matrix Y = YcQ ∈ R and the left panel the empirical covariance matrix C ∈ R of Y . Note that the data features with index larger than 10 are virtually zero in Y˜ and that the empirical covariance matrix of the transformed data matrix is a diagonal matrix (pmfn 34.m). of the mean-centred data matrix Yc, one may select the largest p ≤ m eigenvalues and their corresponding eigenvectors for further analysis. To this end, one may simply discard all columns with index larger than p to form a matrix Q˜ ∈ Rm×p. The projection

Y˜ = YcQ˜ (34.61) then results in a dimensionality-reduced data set Y˜ ∈ Rn×p as shown in the central panel of Figure 34.1B. Because the p eigenvectors corresponding to the largest p eigenvalues account for the majority of the data covariance, the reconstructed data Yˇ = Y˜ Q˜T (34.62) is largely similar to the original data, as shown in the rightmost panel of Figure 34.1B. PCA is hence a useful tool for data compression.

34.5 Bibliographic remarks

PCA is covered in all modern textbooks on data analysis, such as Bishop(2006); Barber(2012); Murphy (2012).

Study questions

T T 1. Show that the vectors v1 = (2, 1) , v2 = (−1, 2) are orthogonal, but not orthonormal. T T T 2. Evaluate the projections of the vector v = (3, 2) onto the vectors e1 = (1, 0) and e1 = (0, 1) . 3. Express the vector y = (1, 2)T with respect to the orthonormal basis ( )  3 4 T  4 3 T {b , b } := , , − , (34.63) 1 2 5 5 5 5

n×m 4. Give a verbal definition of the principal components of a data set Y ∈ R . 5. State the dimensions and properties of the matrices U, Σ, and V resulting from the singular value decom- n×m position of a matrix Y ∈ R .

Study questions answers

1. We have −1 vT v = 2 1 = 2 · (−1) + 1 · 2 = 0 (34.64) 1 2 2 The vectors are thus orthogonal. We also have s q 2 √ √ vT v = 2 1 = 2 · 2 + 1 · 1 = 5 6= 1 (34.65) 1 1 1

and thus v1 is not of length 1, and thus the vectors cannot be orthonormal. T T 2. Because e1 e1 = e2 e2 = 1, we have  3 1 1 3 y˜ = (eT y)e = 1 0 = 3 = (34.66) 1 1 1 2 0 0 0

and similarly,  3 0 0 0 y˜ = (eT y)e = 0 1 = 2 = . (34.67) 2 2 2 2 1 1 2

2 3. To change the coordinates of vector y ∈ R which is expressed with respect to the canonical orthonormal 2 basis {e1, e2} to coordinates of a vectory ˜ ∈ R expressed in terms of the orthonormal basis {b1, b2}, we compute y˜  y˜ = 1 = Ay, (34.68) y˜2 i.e.,  3 4     3 8    5 − 5 1 5 − 5 −1 4 3 = 4 6 = (34.69) 5 5 2 5 + 5 2 n×m 4. The principal components of Y ∈ R are the eigenvectors of the mean-centred data sets’ m×m empirical covariance matrix and represent a coordinate system, i.e. an orthonormal basis, with respect to which the data set has a diagonal empirical covariance matrix. 5. The singular value decomposition Y = UΣV T (34.70) n×m n×n m×m of a matrix Y ∈ R comprises an orthonormal matrix U ∈ R , an orthonormal matrix V ∈ R n×m and diagonal matrix Σ ∈R .