Chapter 5

Some multivariate operations

In this chapter we study some signal processing operations for multivariable signals. We will consider applications where one collects a set of signals xi(n), i = 1, 2,...,M from some system. It is then important to study whether there are possible interde- pendencies between the signals. Such interdependencies cause redundancies, which can be exploited for data compression. Interdependencies between the individual signals can also contain useful information about the structure of the underlying systems that generated the set of signals. Firstly, when the number M of signals is large dependencies between the individual signals make it possible to compress the data. This is for example the case when the M signals xi represent pixels of a sequence of images, where M is the number of pixels in one image. If the images of interest depict a specific class of objects, such as human faces, it turns out that there are redundancies among the various xi’s which can be exploited to compress the images. Secondly, the individual signals xi(n) are often mixtures of unknown source signals sj(n). If the mixture is assumed linear, this implies that there are source signals sj(n) such that

xi(n) = ai1s1(n) + ai2s2(n) + ··· + aiMS sMS , i = 1, 2,...,M (5.1) for n = 0, 1,...,N − 1, where MS is the number of source signals. Introducing the vectors     x1(n) s1(n)      x (n)   s (n)   2   2  x(n) =  .  , s(n) =  .  (5.2)  .   . 

xM (n) sMS (n) and the matrix A with elements aij, (5.1) can be written compactly in vector form as x(n) = As(n), n = 0, 1,...,N − 1 (5.3) In the following examples the measured signals are composed of source signals.

47 Example 5.1 The Cocktail party problem. At a cocktail party, individual voices sj(n) are mixed and only the mixed signals xi(n) can be measured. The ’cocktail party problem’ consists of computing the source signals sj(n) from measured sound signals xi(n).

Example 5.2 Biomedical signal analysis. Magneto encephalogram (MEG) signals used to analyze brain activity are determined using sensors placed at different position on the head. As all activities in the human brain, such as heartbeats, breathing and eye blinking, generate magnetic signals, it follows that the measured signals xi(n) are superpositions of a number of source signals sj(n). For better understanding brain activity it is important to remove the signal components which are caused by heartbeat, breathing and eye blinking. This can be achieved by removing the associated source signals after these have been determined from the measured set of signals.

The problem of finding the source signals s(n) from a set of measured signals xi(n) is called source signal separation. If the mixing matrix A is known, it is trivial to determine the source signal s(n) by inverting the linear relation (5.3). In many ap- plications it is, however, not known how the source signals are mixed to produce the measured signals xi(n). The problem to find the source signals from the measured signals when the mixing matrix A is unknown, is called blind signal separation. The classical example of blind signal separation is the cocktail party problem in Example 5.1. In order to solve the blind signal separation problem some assumptions on the source signals have to be made. The most natural ones are that they are mutually uncorrelated or independent. In section 5.1 Principal Component Analysis (PCA) is described, which can be used for signal decorrelation. Important application of the technique are in data compression. In section 5.2 Independent Component Analysis (ICA) is presented, which can be used to solve the blind signal separation problem.

5.1 Principal component analysis

Assume that we have a sequence of M signals xi(n), n = 0, 1,...,N −1, i = 1, 2,...,M. We should like to express the signals in the form (5.1) in such a way that the source signals sj(n) are uncorrelated. In order to accomplish this, we consider signal variations about their mean values by defining the signals

wi(n) = xi(n) − mi, i = 1, 2,...,M (5.4) where mi is the mean value of {xi(n)},

1 NX−1 mi = xi(n), i = 1, 2,...,M (5.5) N n=0

48 By construction, the signals wi(n) have zero mean values. Our purpose is to express the signals wi(n) in the form (5.1),

wi(n) = ai1s1(n) + ai2s2(n) + ··· + aiMS sMS (n), i = 1, 2,...,M (5.6) where now the source signals {sj(n)} have zero mean values. Introducing the vector   w1(n)    w (n)   2  w(n) =  .  (5.7)  .  wM (n) we have in analogy with (5.3), w(n) = As(n), n = 0, 1,...,N − 1 (5.8) It is convenient to introduce the signal matrices W = [ w(0) w(1) ··· w(N − 1) ] (5.9) S = [ s(0) s(1) ··· s(N − 1) ] (5.10) Relation (5.8) can then be written compactly as W = AS (5.11) Blind signal decorrelation consists of finding the matrix A and uncorrelated source signals, such that their correlations rjk vanish, i.e.,

1 NX−1 rjk = sj(n)sk(n) = 0, j 6= k, j, k = 1, 2,...,MS (5.12) N n=0

T Notice that the product sj(n)sk(n) is the j, kth element of s(n)s (n). It follows that using (5.10), relation (5.12) can be written compactly in matrix form as 1 1 ³ ´ SST = s(0)sT (0) + s(1)sT (1) + ··· + s(N − 1)sT (N − 1) N N

= diag (r11, r22, . . . , rMS MS ) (5.13) where diag(r11, r22, . . . , rMS MS ) denotes a diagonal matrix with diagonal elements rkk, k = 1, 2,...,MS, and zero off-diagonal elements. As the source signals can be scaled by incorporating the factor rkk into the mixing parameters aik in equation (5.6), we can take the source signals to have unit variances, 1 T rkk = 1, k = 1, 2,...,MS. This implies N SS = I, where I denotes the identity matrix. It is important to notice that the signal decorrelation problem is not unique. This 1 T can be seen by considering source signals with unit variances, N SS = I. Define transformed source signals defined by

sY (n) = Ys(n) (5.14)

49 corresponding to factorization of the signal matrix W as W = AS = AY−1YS (5.15) where YS is the source signal matrix associated with source signals sY (n). Then we have for any matrix Y such that YYT = I, 1 1 YS(YS)T = YSST YT = I (5.16) N N implying that the source signals sY (n) are also uncorrelated. In order to obtain a signal decorrelation procedure useful for data reduction we impose a further condition on the source signals as follows. Introduce the sum of variances of the signal sequence {w(n)},

NX−1 XM 2 2 k{w(n)}k = wi(n) n=0 i=1 NX−1 = w(n)T w(n) (5.17) n=0

Then determine the M × 1-vector a1 and signal sequence {s1(n)} such that the error variance 2 k{w(n) − a1s1(n)}k is minimized. Hence {s1(n)} is the scalar source signal which gives the best approxi- mation (in terms of smallest sum of error variances) of {w(n)}. Next, determine the M × 1-vector a2 and the signal sequence {s2(n)} such that the error variance 2 k{w(n) − a1s1(n) − a2s2(n)}k is minimized. Hence {s1(n)} and {s2(n)} are the two source signals which give the best approximation of {w(n)}. Continuing this process, we can construct the vectors a1, a2,..., ar and the source signals {s1(n)}, {s2(n)},..., {sr(n)} such that the error variance 2 k{w(n) − a1s1(n) − · · · − arsr(n)}k is minimized. Hence the signals {s1(n)}, {s2(n)},..., {sr(n)} are the r source signals which give the best approximation of {w(n)}. It turns out the source signals con- structed in this way are uncorrelated. By construction, this decorrelation procedure is optimal for data compression in the sense that it gives the best approximation of {w(n)} with a given number of source signals. The solution of the optimal signal decorrelation problem described above is given by singular value decomposition of the data matrix W. Recall that the signal decorrelation problem is equivalent to factoring the data matrix W according to (5.11) in such a way that SST is diagonal. In order to achieve this, it is convenient to introduce the normalized matrix ³ ´ VT = N −1/2diag r−1/2, r−1/2, . . . , r−1/2 S (5.18) 11 22 MS MS

50 or ³ ´ S = N 1/2diag r1/2, r1/2, . . . , r1/2 VT (5.19) 11 22 MS MS Relation (5.13) is then equivalent to

VT V = I (5.20) where I is the identity matrix. Property (5.20) means that V has orthonormal rows, i.e., the rows are orthogonal and have unit euclidian norm. Introducing (5.19) into (5.11) reduces the signal decorrelation problem to finding a diagonal matrix Σ and a matrix V with orthonormal rows such that

W = AΣVT (5.21)

The factorization (5.21) corresponding to optimal signal decorrelation can be deter- mined by recalling the following standard result from matrix analysis. Singular-value decomposition (SVD). Consider a real n×m matrix W. Let p = min(m, n). Then there exist an m×p matrix V with orthonormal columns

T T V = [v1, v2,..., vp], vi vi = 1, vi vj = 0 if i 6= j, (5.22) an n × p matrix U with orthonormal columns,

T T U = [u1, u2,..., up], ui ui = 1, ui uj = 0 if i 6= j, (5.23) and a diagonal matrix Σ with non-negative diagonal elements,

Σ = diag(σ1, σ2, . . . , σp), σ1 ≥ σ2 ≥ ... ≥ σp ≥ 0, (5.24) such that W can be written as W = UΣVT (5.25) Such a decomposition is called the singular-value decomposition of W. The non- negative scalars σi are the singular values of W, the vector ui is the ith left singular vector, and vj is the jth right singular vector of W. Notice that (5.22) and (5.23) are equivalent to

VT V = I, UT U = I (5.26) where I is the p × p identity matrix.

Remark 5.1 It is worth while to observe that the singular values and singular vectors can be characterized in terms of eigenvalue problems. More precisely, we have from (5.25) and (5.26), WT W = VΣUT UΣVT = VΣ2VT (5.27)

51 Hence, using (5.22) we have T 2 W Wvi = σi vi (5.28) 2 so that the right singular vectors vj and the squared singular values σi of the matrix W are the eigenvectors and eigenvalues, respectively, of the matrix WT W. Similarly, we have WWT = UΣVT VΣUT = UΣ2UT (5.29) and hence, using (5.23), T 2 WW ui = σi ui (5.30) 2 so that the left singular vectors uj and the squared singular values σi of the matrix W are the eigenvectors and eigenvalues, respectively, of the matrix WWT .

Remark 5.2 There is a simple connection between the left and right singular vectors. Multiplying (5.28) by W gives T 2 WW Wvi = σi Wvi which is equal to (5.30) with ui = Wvi (5.31) In a similar way, multiplying (5.30) by WT gives

T T 2 T W WW ui = σi W ui which is equal to (5.28) with T vi = W ui (5.32) Hence the left and right singular vectors are related by (5.31) and (5.32). It is therefore sufficient to solve one (naturally the one having smaller dimension) of the eigenvalue problems (5.28), (5.30).

Singular value decomposition can be determined with the Matlab routine [U,Sigma,V] = svd(W) Now we can solve the optimal signal decorrelation problem as follows. Optimal signal decorrelation. The signals {w(n)} can be decorrelated by introducing the singular value decomposi- tion W = UΣVT of the signal matrix W in eq. (5.9). This has the form W = AS in (5.11) with A = UΣ and the source signal matrix

S = VT (5.33)

Relation (5.33) implies that the ith column vi of V corresponds to the ith source signal sequence, T vi = [ si(0) si(1) ··· si(N − 1) ] , i = 1, 2,...,MS (5.34)

52 Orthogonality of the vectors vi, equation (5.22), then implies orthogonality of the source signal sequences, equation (5.12). Next let us show that singular value decomposition of the signal matrix solves the optimal signal decorrelation problem. Observe that (5.25) can be written in terms of the singular values and the columns of U and V as Xp T W = σiuivi (5.35) i=1 This relation can be exploited for data compression by approximating the vector-valued signal sequence {w(n)} by a lower-dimensional one. In order to see how this can be achieved, we will first show how the euclidian norm of the signal sequence {w(n)} can be expressed in terms of the singular values σi of the associated signal matrix. For this purpose, recall the sum of variances of the signal sequence {w(n)} defined in equation (5.17), NX−1 XM NX−1 2 2 T k{w(n)}k = wi(n) = w(n) w(n) n=0 i=1 n=0 Recalling that the trace of an m × m matrix A is defined as the sum of the diagonal Pm elements aii, tr(A) = i aii, we have ³ ´ w(n)T w(n) = tr w(n)w(n)T (5.36) Hence the the sum of variances, or squared euclidian norm, of the signal sequence {w(n)} can be expressed in terms of the signal matrix W as à ! NX−1 NX−1 ³ ´ w(n)T w(n) = tr w(n)w(n)T = tr WWT (5.37) n=0 n=0 Recalling (5.29), we have ³ ´ ³ ´ tr WWT = tr UΣ2UT à ! Xp 2 T = tr uiσi ui i=1 Xp 2 T = σi ui ui i=1 Xp 2 = σi (5.38) i=1 where in the last equality we have used the orthonormality (5.23) of the columns of U. The squared signal norm is therefore obtained as the sum of squared singular values. Notice that (5.35), (5.34) corresponds to a decomposition of the signal vector w(n) according to Xp w(n) = wi(n) (5.39) i=1

53 where wi(n) = σiuisi(n) (5.40)

Here, the component wi(n) has (squared) norm

NX−1 NX−1 T 2 T 2 wi(n) wi(n) = σi ui ui si(n) n=0 n=0 2 = σi (5.41) where we have used the fact that the source sequences (5.34) have unit norms. Hence (5.35) decomposes the signal w(n) into components wi(n), which are generated by scalar source signals si(n), and in such a way that the total squared euclidian norm is the sum of the squares of the norms of the component. In particular, recalling the ordering of the singular values, (5.24), the source signals have the property that of all possible scalar source signals, s1(n) is the one which explains most of the total signal variance, s2(n) is the one which explains most of the signal variance not explained by s1(n), etc. Hence the source signals constructed in this way solve the optimal signal decorrelation problem. The components (5.35) constructed in this way are called the principal components of the matrix W. Principal components have important applications in a number of data reduction problems. Now consider approximating W by the first r columns of U and V in (5.25), i.e.,

T Wr = UrΣrVr (5.42) where Vr = [v1, v2,..., vr] (5.43)

Ur = [u1, u2,..., ur] (5.44) and Σr = diag(σ1, σ2, . . . , σr) (5.45) Then we have for the approximation error

Xp T W − Wr = σiuivi (5.46) i=r+1 and in analogy with (5.38),   ³ ´ Xp T  2 T  tr (W − Wr)(W − Wr) = tr uiσi ui i=r+1 Xp 2 = σi (5.47) i=r+1

54 It follows that if the sum in (5.47) is small compared to the one in (5.38), the signal matrix W can be accurately approximated by the matrix Wr, or equivalently, the signal sequence {w(n)} can be accurately approximated by the sequence {wr(n)} defined by

Xr wr(n) = σiuisi(n) (5.48) i=1 If r small compared to the signal dimension M, the approximation (5.48) can be used for data compression, as only r source signal sequences {si(n)} and the M × r matrix UrΣr are needed. The following example illustrates the idea.

Example 5.3 Eigenfaces. On field where principal component analysis has been found useful is image processing. One interesting example is the compression, detection and recognition of images of human faces. An image consists of a two-dimensional Nrow × Ncol array I(m, n), which can be represented as an NrowNcol-dimensional vector w by stacking the rows (or columns) of I(m, n) above each other. The dimension of w equals the number of pixels of the image, which may be quite large. However, as images of human faces comprise only a tiny fraction of all possible images, it can be assumed that it is possible to represent face images using a smaller set of parameters by suitable compression of the original image vector w. This can be achieved using principal component analysis. Consider a sequence of images represented by w(0), w(1),..., w(N − 1), and define the mean image, 1 NX−1 m = w(n) (5.49) N n=0 Define the associated signal matrix W for the signal variation about the mean image,

W = [ w(0) − m w(1) − m ··· w(N − 1) − m ] (5.50)

Introduce the singular value decomposition (5.35) of W (denoting the left singular vectors by ei), Xp T W = σieivi (5.51) i=1 By (5.39), (5.40) the face image signals w(n) − m can then be represented in terms of principal components as Xp w(n) = m + wi (5.52) i=1 where wi = σieisi(n) (5.53)

Here the image vectors ei are associated with the principal components of the image signal sequence, and resemble face images themselves. These vectors are called eigen- faces, as they are defined as the eigenvectors of the signal covariance matrix WWT ,

55 cf. equation (5.30). Observe, however, that as the number of rows of W is typically much larger than the number of columns, the matrix WWT has a much larger di- mension than the matrix WT W. Therefore it is more efficient to solve the eigenvalue problem (5.28) and use relation (5.31) to find the eigenfaces. Obviously, data reduction according to (5.48) can be applied to face images, which can be approximated using the first eigenfaces as

Xr wr(n) = m + σieisi(n) (5.54) i=1 It turns out that regardless of the number N of images considered, good approximation accuracy can be achieved using only a limited number of the first eigenfaces. In the literature it has been reported that using r = 40 eigenfaces gives pixel-wise errors of about 2%. Next consider a new face described by the image vector wnew. We wish to study how well the image vector can be represented by the eigenface expansion (5.54), i.e., we should determine the r parameters si,new in the expansion

Xr wnew,r(n) = m + eisi,new (5.55) i=1 to make the approximation error

w˜ new = wnew − wnew,r(n) Xr = wnew − m − eisi,new (5.56) i=1

T small. Using the fact that the vectors ej are orthonormal, i.e., ej ei = 0, for j 6= i, and T ei ei = 1, the squared euclidian norm of the approximation error can be expanded as Xr ³ ´ T 2 T w˜ neww˜ new = si,new − 2si,newei (wnew − m) i=1 T +(wnew − m) (wnew − m) Xr ³ ´ T 2 = si,new − ei (wnew − m) i=1 ³ ´ T 2 T − ei (wnew − m) + (wnew − m) (wnew − m) (5.57)

The parameters si,new which minimize the error (5.57) are given by

T si,new = ei (wnew − m) (5.58)

(making the quadratic terms in the sum equal to zero). If the approximation error w˜ new in (5.56) is small, the face image vector wnew can be approximated by the eigenface expansion (5.55). In this case only r parameters si,new are needed to represent the face

56 image vector. This implies a huge data compression ratio, as the original image signal dimension NrowNcol is usually much larger than r. Besides data compression, the expansion (5.55) can be applied to face detection and recognition. This can be achieved by computing the parameters si,new and the approximation defined by (5.55). Then:

• If the approximation error w˜ new when the image vector is approximated by the eigenface expansion (5.55) is less than some specified threshold value, we conclude that the image represents a face. Otherwise, it is not a face image.

• If the difference wnew − w(n) between the expansion (5.55) and some given face vector w(n) is small (below a specified threshold value), we conclude that the face associated with w(n) has been recognized. Otherwise, the new face is not recognized.

As Example 5.3 shows, principal component analysis and signal decorrelation is useful in data compression. Signal decorrelation is, however, insufficient for achieving source signal separation to solve problems of the type described in Examples 5.1 and 5.2. One restriction of signal decorrelation is that it is not unique, as shown by equations (5.14)–(5.16).

57 5.2 Independent component analysis

In order to decide if two signals are truly independent it is not sufficient to deter- mine their correlation (5.12), because two signals which are uncorrelated may still be dependent. In order to achieve separation of independent source components of a vector-valued signal, we need a stronger measure of independence. The concept of independence can be defined quantitatively in a statistical frame- work. Two random variables y1 and y2 are independent if knowledge of the value of y1 does not give any information about the value of y2 and vice versa. This is true if and only if the joint probability density function p(y1, y2) can be factored as p(y1)p(y2), where p(y1) and p(y2) are the probability density functions of y1 and y2, respectively. It can be shown that for independent random variables we have the useful property, that for any functions h1(·) and h2(·), we have

E[h1(y1)h2(y2)] = E[h1(y1)]E[h2(y2)] (5.59) where E[·] is the expectation operator. The following example shows that two random variables may be uncorrelated although they are not independent. Example 5.4 Uncorrelated but dependent variables. Consider two discrete-valued random variables y1 and y2 such the combinations (y1, y2) = {(0, 1), (0, −1), (1, 0), (−1, 0)} are equally probable (with probability 1/4). Then

E[y1y2] = 0 implying that the variables are uncorrelated. We have also,

2 2 E[y1y2] = 0 whereas 1 1 1 E[y2]E[y2] = (0 + 0 + 1 + 1) · (1 + 1 + 0 + 0) = 6= 0 1 2 4 4 4 2 Hence (5.59) does not hold for h1(y) = h2(y) = y , and the variables are not indepen- dent, although they are uncorrelated.

Remark 5.3 A well-known property of gaussian variables is that two gaussian variables are uncorre- lated if and only if they are also independent. Therefore, the concept of independence does not bring in any new features as far as gaussian variables are concerned. In section 5.1 we saw that signal decorrelation is not unique, and therefore signals having gaussian distributions cannot be separated uniquely into independent components either. For this reason, independent component analysis is possible only when the source signals are nongaussian, or more precisely, one of the independent components may have a gaussian distribution, whereas the others should be nongaussian.

58 In order to define the concept of independence, we recall the following measure of information. The information associated with a measurement having prior probability p is

I = log2(1/p) = − log2 p (5.60) where log2 denotes logarithm with base 2. Hence the more improbable the measure- ment, the larger the information I. The choice of base 2 for the logarithm is convenient because it corrresponds to a natural definition of the unit of information in terms of bits. In particular, an observation having probability p = 1/2, such as the outcome of a toss of a fair coin, is associated with I = − log2(1/2) = 1 bit of information. The definition of I as the logarithmic of the inverse probability is natural, consid- ering that it is reasonable to define the measure of information in such a way that the information associated with two independent measurements is the sum of the infor- mations associated with the individual measurement. Thus, consider two independent measurements A and B having prior probabilities pA and pB. Then the probability of observing both A and B is pApB, with the associated information

IAB = − log2(pApB) = − log2 pA − log2 pB = IA + IB Hence the information associated with observing the outcome of tossing two fair coins is 2 bits. Next consider a random variable Y . The entropy H(Y ) is defined as the expected information obtained when an observation of the random variable is performed. For example, the entropy of a binary random variable Y with two possible outcomes a1 and a2 having equal probabilites P (Y = a1) = P (Y = a2) = 1/2 (such as the toss of a fair coin) is

H(Y ) = −P (Y = a1) log2 P (Y = a1) − P (Y = a2) log2 P (Y = a2) = 1 This makes sense, as the information obtained when observing the variable Y is always 1 bit. On the other hand, the entropy of a certain event Y = a0 having probability 1 is

H(Y ) = −P (Y = a0) log2 P (Y = a0) = − log2 1 = 0 i.e., no information is obtained when observing a certain event. Remark 5.4 The term entropy stems from physics, where it is a measure of randomness (of a gas, liquid, etc). The connection with information is that the more random a variable is (i.e., the higher the entropy), the higher is the expected information obtained when observing an outcome of the variable.

The entropy can be defined for random variables with more general distributions as follows. A discrete random variable Y with possible outcomes ai has entropy X H(Y ) = − P (Y = ai) log2 P (Y = ai) (5.61) i

59 while a vector-valued random variable y with probability density function p(y) has entropy Z

H(y) = − p(y) log2 p(y)dy (5.62) It can be shown that among all random variables having a given covariance matrix, a normally distributed (gaussian) variable has the largest entropy. In this sense the gaussian distribution is the most random of all probability distributions. The entropy concept can be use to give a quantitative measure of the independence (or rather, dependence) of a set of random variables. Consider the random vector y = (y1, y2, . . . , yM ). The mutual information between the random variables yi, i = 1, 2,...,M is defined as

XM I(y1, y2, . . . , yM ) = H(yi) − H(y) (5.63) i=1 It is seen that the mutual information is the difference between the sum of the entropy of the random variables yi considered individually and the entropy of the random vector y, where dependence between the individual random variables yi are taken into account. As dependencies between the variables decrease the entropy, the mutual information is non-negative, and zero if and only if the variables are statistically independent. With the concepts defined above, we can now return to the problem of decompos- ing an observed vector-valued signal sequence {w(n)} into independent source signals {si(n)}. The definition of mutual information shows that independent source signals can be determined by minimizing their mutual information. Recall the signal decomposition (5.8), w(n) = As(n). Without restriction, we can assume that the number of source signals si(n) is equal to the number of available signals wi(n). Then (5.8) can be inverted, so that

s(n) = Bw(n) (5.64) where B = A−1. By (5.63), the mutual information of the source signal variables si(n) depends on their individual entropies H(si) and the total entropy H(s). It can be shown that for s given by (5.64), the entropy H(s) depends on the entropy H(w) according to

H(s) = H(w) + log2 det(B) (5.65) where det(B) is the matrix determinant. Now, as the source signals are required to be independent, they are obviously determined in such a way that they are uncorrelated. Without restriction, we can also assume that they have unit variance (this can always be achieved by scaling). Then, E[ssT ] = I (5.66) and by (5.64) we have E[ssT ] = BE[wwT ]BT = I (5.67)

60 As det(I) = 1 it follows that ³ ´ det BE[wwT ]BT = 1 (5.68) Using the fact that the determinant of a matrix product is the product of the determi- nants of the factors, det(A1A2) = det(A1) det(A2), and the fact that the determinant of a transposed matrix BT is equal to the determinant of B, we obtain ³ ´ det BE[wwT ]BT = (det B)(det E[wwT ])(det BT ) = (det B)2(det E[wwT ]) = 1 (5.69) This implies that det(B) is constant and depends only on the covariance matrix E[wwT ] of the signal {w(n)}. By (5.65), the entropy of H(s) is then also constant, and from (5.63) it follows that the mutual information of source signals si is given by XM I(s1, s2, . . . , sM ) = H(si) + C (5.70) i=1 where C is a constant. The problem of finding the matrix B in (5.64) to minimize the mutual information of the uncorrelated unit-variance source signals si, i = 1, 2,...,M is therefore equivalent to minimizing the sum of their entropies H(si). In independent component analysis, only the observed signal sequences {wi(n)}, i = 1, 2,...,M are available. As the entropy of a random variable depends on its probabil- ity density function according to (5.62), it is very difficult to estimated the entropy from an observed sequence of the variable. In practice various approximative approaches are therefore used to minimize the source signal entropies. Instead of considering the entropy directly, the negentropy is often used instead. The negentropy of a scalar random variable y is defined as

J(y) = H(ygauss) − H(y) (5.71) where ygauss is a gaussian random variable with the same variance as y. Negentropy is zero if and only if y is gaussian, and positive otherwise. Obviously, minimizing the entropy is equivalent to maximizing the negentropy. As it is difficult to estimate the negentropy of a random variable from a finite number of measurements, various approximations are used instead. It can be shown that negen- tropy can be approximated as ³ ´2 J(y) ≈ E[G(y)] − E[G(ygauss)] (5.72) where G(·) is a suitable selected non-quadratic function. Good choices for G are 1 G (y) = log cosh(ay) 1 a −y2/2 G2(y) = −e (5.73) 4 G3(y) = y

61 where 1 ≤ a ≤ 2 is a constant. The approximation (5.72) can now be used to construct an algorithm for separa- tion of independent source signal components by computing uncorrelated source signals {si(n)} with unit variances in such a way that the approximation (5.72) of the negen- tropies J(si) are maximized. The expectation in (5.72) can be estimated as the average over the available sequence,

1 NX−1 E[G(si)] ≈ G(si(n)) (5.74) N n=0 The independent components can be computed successively as follows. Let the T rows of B be denoted by bi . Assume that we have estimated the p first independent T components si(n) = bi w(n), i = 1, . . . , p. Then the independent component sp+1(n) = T T bp+1w(n) can be determimed by computing bp+1 in such a way that the sequence {sp+1(n)} maximizes the approximation (5.72) of the negentropy J(sp+1), subject to the constraint that {sp+1(n)} has unit variance and is uncorrelated with the sequencies {si(n)}, i = 1, . . . , p. This is a constrained optimization problem, which can be solved by standard methods. An efficient method to solve the optimization problem is the FastICA algorithm, where an iterative learning rule is applied to determine the vectors T bi .

5.3 Literature

Principal component analysis is a classical technique for data analysis and data reduc- tion. A comprehensive guide to PCA can be found in the book by Jackson [1]. The eigenface procedure has been originally developed in computer vision applications [2]. The study of independent component analysis was started in the eighties and nineties to overcome the restrictions of PCA for blind source signal separation, and it seems to have been first precisely formulated by Comon [3]. The FastICA algorithm has been developed at the Helsinki University of Technology [4]. A tutorial on ICA can be found in [5], and a comprehensive treatement of ICA is given in the book [6]. Besides signal processing problems, both PCA and ICA can be applied to multi- variate statistical process quality control and supervision [7].

62 Bibliography

[1] Jackson, J. E., A User’s Guide to Principal Components, Wiley, 2003. [2] Turk, M., and A. Pentland, ’Face recognition using eigenfaces’, Proc. IEEE Conference on Computer Vision and Pattern Recognition, 1991, pp. 586-591. [3] Comon, P., ’Independent component analysis, A new concept?’, Signal Processing 36(1994), pp. 287–314. [4] Hyv¨arinen,A. and E. Oja, ’A fast fixed-point algorithm for independent compo- nent analysis’, Neural Computation 9(1997), pp. 1483–1492. [5] Hyv¨arinen,A. and E. Oja, ’Independent component analysis: algorithms and applications’, Neural Networks 13(2000), pp. 411–430. [6] Hyv¨arinen,A., J. Karhunen and E. Oja, Independent Component Analysis, Wiley, 2001. [7] Abrahamsson, J., Statistical Process Control of Multivariate Processes (in Swedish), MSc Thesis, Process Control Laboratory, Abo˚ Akademi, 2007.

63