Indep endent Comp onent Analysis A Tutorial

Aap o Hyvrinen and Erkki Oja

Helsinki University of Technology

Lab oratory of Computer and Information Science

PO Box FIN Esp o o Finland

aapohyvarinenhutfi erkkiojahutfi

httpwwwcishutfipro ject sic a

A version of this pap er will app ear in Neural Networks

with the title Indep endent Comp onentAnalysis Algorithms and Applications

April

Motivation

Imagine that you are in a ro om where twopeople are sp eaking simultaneously You havetwomicrophones

whichyou hold in dierentlocations The microphones giveyou two recorded time signals whichwe could

denote by x t and x t with x and x the amplitudes and t the time index Eachofthese recorded

 

signals is a weighted sum of the sp eechsignals emitted bythetwospeakers whichwedenote by s t and

s t We could express this as a linear equation



x ta s a s

 

x ta s a s

   

where a a a and a are some parameters that dep end on the distances of the microphones from

  

the sp eakers It would b e very useful if you could now estimate the twooriginal sp eech signals s t and

s t using only the recorded signals x t and x t This is called the cocktailparty problem For the

 

time b eing weomitany time delays or other extra factors from our simplied mixing mo del

As an illustration consider the waveforms in Fig and Fig These are of course not realistic sp eech

signals but suce for this illustration The original sp eechsignals could lo ok something likethose in Fig

and the mixed signals could lo ok likethosein Fig The problem is to recover the data in Fig using

only the data in Fig

Actuallyifwe knew the parameters a we could solvethe linear equation in by classical metho ds

ij

The p ointis however that if you dont knowthe a the problem is considerably more dicult

ij

One approachto solving this problem would b e to use some information on the statistical prop erties of

the signals s t to estimate the a Actually and p erhaps surprisinglyitturns out that it is enough to

i ii

assume that s t and s tateachtimeinstant tarestatistical ly independent This is not an unrealistic



assumption in manycases and it need not b e exactly true in practice The recently develop ed technique

of Indep endentComp onentAnalysis or ICA can b e used to estimate the a based on the information of

ij

their indep endence whichallows us to separate the twooriginal source signals s t and s t from their



mixtures x t and x t Fig gives the twosignals estimated bythe ICA metho d As can b e seen these



are very close to the original source signals their signs are reversed but this has no signicance

Indep endent comp onentanalysis was originally develop ed to deal with problems that are closely related

to the co cktailpartyproblem Since the recentincrease of interest in ICA it has b ecome clear that this

principle has a lot of other interesting applications as well 2

0

−2 0 10 20 30 40 50 60 70 80 90 100 2

0

−2

0 10 20 30 40 50 60 70 80 90 100

Figure The original signals

5

0

−5 0 10 20 30 40 50 60 70 80 90 100 2

0

−2

0 10 20 30 40 50 60 70 80 90 100

Figure The observed mixtures of the source signals in Fig 2

0

−2 0 10 20 30 40 50 60 70 80 90 100 2

0

−2

0 10 20 30 40 50 60 70 80 90 100

Figure The estimates of the original source signals estimated using only the observed signals in Fig

The original signals were very accurately estimated up to multiplicativesigns

Consider for example electrical recordings of brain activityasgiven byanelectro encephalogram EEG

The EEG data consists of recordings of electrical p otentials in many dierent lo cations on the scalp

These p otentials are presumably generated by mixing some underlying comp onents of brain activity This

situation is quite similar to the co cktailpartyproblem wewould liketondthe original comp onents of

brain activitybutwe can only observemixtures of the comp onents ICA can reveal interesting information

on brain activitybygiving access to its indep endentcomponents

Another very dierent application of ICA is on feature extraction Afundamental problem in digital

signal pro cessing is to nd suitable representations for image audio or other kind of data for tasks like

compression and denoising Data representations are often based on discrete linear transformations

Standard linear transformations widely used in image pro cessing are the Fourier Haar cosine transforms

etc Eachof them has its own favorable prop erties

It would b e most useful to estimate the linear transformation from the data itself in which case the

transform could b e ideally adapted to the kind of data that is b eing pro cessed Figure shows the basis

functions obtained by ICA from patches of natural images Each image windowinthesetoftraining images

would be a sup erp osition of these windows so that the co ecient in the sup erp osition are indep endent

Feature extraction by ICA will b e explained in more detail later on

All of the applications describ ed ab ovecan actually b e formulated in a unied mathematical framework

that of ICA This is a very generalpurp ose metho d of signal pro cessing and data analysis

In this review we cover the denition and underlying principles of ICA in Sections and Then

starting from Section the ICA problem is solved on the basis of minimizing or maximizing certain

conrast functions this transforms the ICA problem to a numerical optimization problem Many contrast

functions are given and the relations b etween them are claried Section covers a useful prepro cessing that

greatly helps solving the ICA problem and Section reviews one of the most ecientpractical learning

rules for solving the problem the FastICA algorithm Then in Section typical applications of ICA are

covered removing artefacts from brain signal recordings nding hidden factors in nancial time series

and reducing noise in natural images Section concludes the text

Figure Basis functions in ICA of natural images The input window size was pixels These basis

functions can b e considered as the indep endent features of images

Indep endent Comp onent Analysis

Denition of ICA

Torigorously dene ICA we can use a statistical latentvariables mo del Assume that we observe

n linear mixtures x x of n indep endentcomp onents

n

x a s a s a s for all j

j j j   jn n

We have now dropp ed the time index t in the ICA mo del we assume that each mixture x as well

j

as each indep endent comp onent s is a instead of a prop er time signal The observed

k

values x t eg the microphone signals in the co cktail partyproblem are then a sample of this random

j

variable Without loss of generalitywecanassume that b oth the mixture variables and the indep endent

comp onents havezeromean If this is not true then the observable variables x can always b e centered by

i

subtracting the sample mean whichmakes the mo del zeromean

It is convenienttousevectormatrix notation instead of the sums likein the previous equation Let us

denote by x the random vector whose elements are the mixtures x x and likewise by s the random

n

vector with elements s s Let us denote by A the matrix with elements a Generallybold lower

n ij

case letters indicate vectors and bold upp ercase letters denote matrices All vectors are understo o d as

T

column vectors thus x or the transp ose of x is a rowvector Using this vectormatrix notation the

ab ovemixingmodel is written as

x As

Sometimes we need the columns of matrix Adenoting them by a the mo del can also b e written as

j

n

X

x a s

i i

i

The statistical mo del in Eq is called indep endent comp onentanalysis or ICA mo del The ICA mo del

is a generativemodel which means that it describ es howthe observed data are generated byapro cess of

mixing the comp onents s The indep endent comp onents are latentvariables meaning that they cannot

i

be directly observed Also the mixing matrix is assumed to be unknown All weobserveisthe random

vector xandwemust estimate b oth A and s using it This must b e done under as general assumptions

as p ossible

The starting point for ICA is the very simple assumption that the comp onents s are statistically

i

independent Statistical indep endence will b e rigorously dened in Section It will b e seen b elowthat we

must also assume that the indep endentcomp onentmust have nongaussian distributions However in the

basic mo del wedonot assume these distributions known if they are known the problem is considerably

simplied For simplicity we are also assuming that the unknown mixing matrix is square but this

assumption can b e sometimes relaxed as explained in Section Then after estimating the matrix A

we can compute its inverse say W and obtain the indep endentcomponentsimplyby

s Wx

ICA is very closely related to the metho d called blind source separation BSS or blind signal separa

tion Asource means here an original signal ie indep endent comp onent like the sp eaker in a co cktail

partyproblem Blind means that we no very little if anything on the mixing matrix and makelittle

assumptions on the source signals ICA is one metho d p erhaps the most widely used for p erforming blind

source separation

In many applications it would b e more realistic to assume that there is some noise in the measurements

see eg whichwould mean adding a noise term in the mo del For simplicitywe omit any noise

terms since the estimation of the noisefree mo del is dicult enough in itself and seems to b e sucient

for manyapplications

Ambiguities of ICA

In the ICA mo del in Eq it is easy to see that the following ambiguities will hold

We cannot determine the energies of the indep endentcomponents

The reason is that b oth s and A being unknown anyscalarmultiplier in one of the sources s could

i

always b e cancelled by dividing the corresp onding column a of A bythe same scalar see eq As a

i

consequence wemayquite as well x the magnitudes of the indep endentcomp onents as they are random



variables the most natural waytodothisistoassumethat eachhasunit E fs g Then the

i

matrix A will be adapted in the ICA solution metho ds to take into account this restriction Note that

this still leaves the ambiguityofthe sign wecould multiply the an indep endent comp onentby without

aecting the mo del This ambiguity is fortunately insignicantin most applications

We cannot determine the order of the indep endent comp onents

The reason is that again b oth s and A being unknown we can freely change the order of the terms in

the sum in and call anyofthe indep endentcomp onents the rst one Formallyapermutation matrix

P and its inverse can b e substituted in the mo del to give x AP Ps The elements of Ps are the original

indep endentvariables s but in another order The matrix AP is just a new unknown mixing matrix

j

to b e solved by the ICA algorithms

Illustration of ICA

To illustrate the ICA mo del in statistical terms consider two indep endent comp onents that have the

following uniform distributions

p

p

if js j

i

 

ps

i

otherwise

The range of values for this uniform distribution were chosen so as to make the mean zero and the variance

equal to one as was agreed in the previous Section The jointdensityofs and s is then uniform on a



square This follows from the basic denition that the joint density of twoindependent variables is just

the pro duct of their marginal densities see Eq weneed to simply compute the pro duct The joint

densityisillustrated in Figure byshowing data p oints randomly drawn from this distribution

Now let as mix these twoindependentcomponents Let us takethe following mixing matrix

A



This gives us two mixed variables x and x It is easily computed that the mixed data has a uniform



distribution on a parallelogram as shown in Figure Note that the random variables x and x are not



indep endentany more an easy wayto see this is to consider whether it is p ossible to predict the value of

one ofthem say x from the value of the other Clearly if x attains one of its maximum or minimum



values then this completely determines the value of x They are therefore not indep endent For variables



s and s the situation is dierent from Fig itcanbeseen that knowing the value of s do es not in any



wayhelp in guessing the value of s



The problem of estimating the data mo del of ICA is nowtoestimate the mixing matrix A using only



information contained in the mixtures x and x Actuallyyou can see from Figure an intuitivewayof



estimating A The edges of the parallelogram are in the directions of the columns of A This means that

wecould in principle estimate the ICA mo del byrst estimating the joint densityofx and x and then



lo cating the edges So the problem seems to havea solution

In realityhowever this would b e a very p o or metho d because it only works with variables that have

exactly uniform distributions Moreover it would b e computationally quite complicated What we need is

ametho d that works for any distributions of the indep endentcomponents and works fast and reliably

Next we shall consider the exact denition of indep endence b efore starting to develop metho ds for

estimation of the ICA mo del

Figure The joint distribution of the indep endent comp onents s and s with uniform distributions



Horizontal axis s vertical axis s



Figure The jointdistribution of the observed mixtures x and x Horizontal axis x vertical axis x

 

What is indep endence

Denition and fundamental prop erties

Todene the concept of indep endence consider two scalarvalued random variables y and y Basically



the variables y and y are said to be indep endent if information on the value of y do es not give any



information on the value of y and vice versa Ab ove wenoted that this is the case with the variables



s s but not with the mixture variables x x

 

the Technically indep endence can b e dened by the probabilitydensities Let us denote by p y y



the marginal p df of y jointprobabilitydensityfunction p df of y and y Let us further denote by p



y iethe p df of y when it is considered alone

Z

dy p y y y p

 

and similarly for y Then wedene that y and y are indep endentifand only if the jointpdfis factorizable

 

in the following way

y p y p p y y

  

This denition extends naturally for any number n of random variables in which case the joint density

must b e a pro duct of n terms

The denition can b e used to derivea most imp ortant prop ertyof indep endentrandom variables Given

twofunctions h and h we always have



y E h y E h y h y E h g g f g f f

   

This can b e proven as follows

Z Z

E fh y h y g  h y h y py y dy dy

     

Z Z Z Z

 h y p y h y p y dy dy  h y p y dy h y p y dy

         

 E fh y gE fh y g

 

Uncorrelated variables are only partly indep endent

and y are said to be A weaker form of indep endence is uncorrelatedness Two random variables y



uncorrelated if their is zero

E y y E y E y f g f g f g

 

If the variables are indep endent they are uncorrelated whichfollows directly from Eq taking h y

y and h y y

  

On the other hand uncorrelatedness do es imply indep endence For example assume that y y not



are discrete valued and followsuchadistribution that the pair are with probability equal to anyof the

following values Then y and y are uncorrelated as can b e simply calculated



On the other hand

   

y E y E y E y f g f g f g

 

so the condition in Eq is violated and the variables cannot b e indep endent

Since indep endence implies uncorrelatedness many ICA metho ds constrain the estimation pro cedure

so that it always gives uncorrelated estimates of the indep endent comp onents This reduces the number of

free parameters and simplies the problem

Figure The multivariate distribution of two indep endent gaussian variables

Why Gaussian variables are forbidden

The fundamental restriction in ICA is that the indep endent comp onents must b e nongaussian for ICA to

be possible

Tosee whygaussian variables makeICA imp ossible assume that the mixing matrix is orthogonal and

the s are gaussian Then x and x are gaussian uncorrelated and of unit variance Their jointdensity

i 

is given by

 

x x



exp p x x



This distribution is illustrated in Fig The Figure shows that the density is completely symmetric

Therefore it do es not contain anyinformation on the directions of the columns of the mixing matrix A

This is why cannot b e estimated A

More rigorouslyone can provethat the distribution of any orthogonal transformation of the gaussian

x x has exactly the same distribution as x x and that x and x are indep endent Thus inthe

  

case of gaussian variables we can only estimate the ICA mo del up to an orthogonal transformation In

other words the matrix is not identiable for gaussian indep endent comp onents Actuallyifjust one A

of the indep endent comp onents is gaussian the ICA mo del can still b e estimated

Principles of ICA estimation

Nongaussian is indep endent

Intuitively sp eaking the key to estimating the ICA mo del is nongaussianity Actually without nongaus

sianity the estimation is not p ossible at all as mentioned in Sec This is at the same time probably the

main reason for the rather late resurgence of ICA research In most of classical statistical theory random

variables are assumed to havegaussian distributions thus precluding any metho ds related to ICA

The Central Limit Theorem a classical result in probabilitytheory tells that the distribution of a sum

of indep endent random variables tends toward a gaussian distribution under certain conditions Thus a

sum of twoindependent random variables usually has a distribution that is closer to gaussian than anyof

the two original random variables

Let us now assume that the data vector is distributed according to the ICA data mo del in Eq x

ie it is a mixture of indep endent comp onents For simplicity let us assume in this section that all the

indep endentcomp onents have identical distributions To estimate one of the indep endent comp onents we

P

T

consider a linear combination of the x see eq let us denote this by y w x where is w x w

i i i

i

avector to b e determined If were one of the rows of the inverse of this linear combination would w A

actually equal one of the indep endent comp onents The question is now How could weusethe Central

Limit Theorem to determine so that it would equal one of the rows of the inverse of In practice w A

we cannot determine such a exactly because wehaveno knowledge of matrix but we can nd an w A

estimator that gives a good approximation

Toseehowthis leads to the basic principle of ICA estimation let us makeachange of variables dening

T T T T

Then wehave y y is thus a linear combination of s withweights given z A w w x w As z s

i

by z Since a sum of even twoindep endentrandom variables is more gaussian than the original variables

i

T

is more gaussian than anyofthe s and b ecomes least gaussian when it in fact equals one of the s z s

i i

In this case obviously only one of the elements z of is nonzero Note that the s were here assumed to z

i i

haveidentical distributions

T

Therefore we could take as a vector that of Such a vector w maximizes the nongaussianity w x

would necessarily corresp ond in the transformed co ordinate system to a which has only one nonzero z

T T

comp onent This means that equals one of the indep endentcomponents w x z s

T

Maximizing the nongaussianityof thus gives us one of the indep endentcomponents In fact the w x

optimization landscap e for nongaussianityinthe ndimensional space of vectors has n lo cal maxima two w

for eachindep endentcomponent corresp onding to s and s recall that the indep endent comp onents can

i i

be estimated only up to a multiplicativesign Tond several indep endent comp onents weneed to nd all

these lo cal maxima This is not dicult b ecause the dierent indep endent comp onents are uncorrelated

Wecanalways constrain the searchtothe space that gives estimates uncorrelated with the previous ones

This corresp onds to orthogonalization in a suitably transformed ie whitened space

Our approachhere is rather heuristic but it will b e seen in the next section and Sec that it has a

perfectly rigorous justication

Measures of nongaussianity

To use nongaussianity in ICA estimation we must have a quantitative measure of nongaussianity of a

random variable say y To simplify things let us assume that y is centered zeromean and has variance

equal to one Actuallyone of the functions of prepro cessing in ICA algorithms to b e covered in Section

is to make this simplication p ossible

Kurtosis

The classical measure of nongaussianity is kurtosis or the fourthorder cumulant The kurtosis of y is

classically dened by

  

kurt y E y E y f g f g



Actually since we assumed that y is of unit variance the righthand side simplies to E y This f g



shows that kurtosis is simply a normalized version of the fourth moment E y For a gaussian y the f g

 

fourth moment equals E y Thus kurtosis is zero for a gaussian random variable For most but not f g

quite all nongaussian random variables kurtosis is nonzero

Kurtosis can b e b oth p ositiveornegative Random variables that havea negativekurtosis are called

subgaussian and those with positive kurtosis are called sup ergaussian In statistical literature the cor

resp onding expressions platykurtic and leptokurtic are also used Sup ergaussian random variables have

typically a spiky pdf with heavy tails ie the pdf is relatively large at zero and at large values of the 0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

−4 −3 −2 −1 0 1 2 3 4

Figure The densityfunction of the Laplace distribution whichisatypical sup ergaussian distribution

For comparison the gaussian density is given by a dashed line Both densities are normalized ot unit

variance

variable while b eing small for intermediate values Atypical example is the Laplace distribution whose

pdf normalized to unit variance is given by

p

p

y exp j j p y

This p df is illustrated in Fig Subgaussian random variables on the other hand havetypically a at

pdf whichisrather constant near zero and very small for larger values of the variable Atypical example

is the uniform distibution in eq

Typically nongaussianity is measured by the absolute value of kurtosis The square of kurtosis can

also b e used These are zero for a gaussian variable and greater than zero for most nongaussian random

variables There are nongaussian random variables that have zero kurtosis but they can b e considered as

very rare

Kurtosis or rather its absolute value has b een widely used as a measure of nongaussianityinICA and

related elds The main reason is its simplicity both computational and theoretical Computationally

kurtosis can b e estimated simply byusing the fourth momentofthe sample data Theoretical analysis is

simplied b ecause of the following linearityproperty If x and x are two indep endentrandom variables



it holds

kurt x x kurt x kurt x

 

and



kurt x kurt x

where is a scalar These prop erties can b e easily proven using the denition

To illustrate in a simple example what the optimization landscap e for kurtosis lo oks like and how

indep endent comp onents could be found by kurtosis minimization or maximization let us lo ok at a

dimensional mo del Assume that the indep endent comp onents s s have kurtosis values x As



kurt s kurt s resp ectively both dierent from zero Remember that we assumed that they have



T

unit variances Weseek for one of the indep endent comp onents as y w x

T T T T

Let us again makethe transformation Then wehave y z s z s z A w w x w As z s

 



Now based on the additive prop ertyof kurtosis wehave kurt y kurt z s kurt z s z kurt s

 



On the other hand we made the constraint that the variance of y is equal to based on the z kurt s





  

Geometrically z z same assumption concerning s s This implies a constrainton E y g z f





this means that vector is constrained to the unit circle on the dimensional plane The optimization z

 

on the unit z kurt s problem is now what are the maxima of the function kurt y z kurt s j j j j





circle For simplicity you mayconsider that the kurtosis are of the same sign in whichcase it absolute

value op erators can b e omitted The graph of this function is the optimization landscap e for the problem

It is not hard to showthat the maxima are at the p oints when exactly one of the elements of vector

is zero and the other nonzero b ecause of the unit circle constraint the nonzero elementmust b e equal z

toor But these p oints are exactly the ones when y equals one of the indep endentcomponents s

i

and the problem has b een solved

In practice we would start from some weight vector compute the direction in which the kurtosis w

T

of y is growing most strongly if kurtosis is p ositive or decreasing most strongly if kurtosis is w x

negative based on the available sample T of mixture vector and use a gradientmetho d or x x x

one of their extensions for nding a new vector The example can b e generalized to arbitrary dimensions w

showing that kurtosis can theoretically b e used as an optimization criterion for the ICA problem

However kurtosis has also some drawbacks in practice when its value has to be estimated from a

measured sample The main problem is that kurtosis can b e very sensitivetooutliers Its value may

dep end on only a few observations in the tails of the distribution which maybe erroneous or irrelevant

observations In other words kurtosis is not a robust measure of nongaussianity

Thus other measures of nongaussianity might be b etter than kurtosis in some situations Below we

shall consider negentropywhose prop erties are rather opp osite to those of kurtosis and nally intro duce

approximations of negentropythat more or less combine the go o d prop erties of b oth measures

Negentropy

A second very imp ortant measure of nongaussianityis given by negentropy Negentropyis based on the

informationtheoretic quantityofdierential entropy

Entropy is the basic concept of information theory The entropy of a random variable can be inter

preted as the degree of information that the observation of the variable gives The more random ie

unpredictable and unstructured the variable is the larger its entropy More rigorouslyentropyis closely

related to the co ding length of the random variable in fact under some simplifying assumptions entropy

the co ding length of the random variable For intro ductions on information theorysee eg is

Entropy H is dened for a discrete random variable Y as

X

P Y a P Y a H Y log

i i

i

where the a are the p ossible values of Y This very wellknown denition can b e generalized for continuous

i

valued random variables and vectors in which case it is often called dierential entropy The dierential

entropy H of a random vector with density f is dened as y y

Z

f f d H y log y y y

Afundamental result of information theory is that a gaussian variable has the largest entropy among

For a pro of see eg This means that entropycould be used al l random variables of equal variance

as a measure of nongaussianity In fact this shows that the gaussian distribution is the most random or

the least structured of all distributions Entropyissmall for distributions that are clearly concentrated on

certain values ie when the variable is clearly clustered or has a p df that is very spiky

Toobtain a measure of nongaussianity that is zero for a gaussian variable and always nonnegative one

often uses a slightly mo died version of the denition of dierential entropy called negentropy Negentropy

J is dened as follows

J H H y y y

gauss

where is a Gaussian random variable of the same as Due to the ab ovementioned y y

gauss

prop erties negentropyisalways nonnegative and it is zero if and only if has a Gaussian distribution y

Negentropyhas the additional interesting prop ertythatitisinvariantfor invertible linear transformations

The advantage of using negentropyor equivalently dierential entropyasameasure of nongaussianity

is that it is well justied bystatistical theory In fact negentropyisinsomesense the optimal estimator of

nongaussianityas far as statistical prop erties are concerned The problem in using negentropyis however

that it is computationally very dicult Estimating negentropy using the denition would require an

estimate p ossibly nonparametric of the pdf Therefore simpler approximations of negentropyare very

useful as will b e discussed next

Approximations of negentropy

The estimation of negentropyisdicult as mentioned ab ove and therefore this contrast function remains

mainly atheoretical one In practice some approximation haveto be used Here weintro duce approxi

mations that havevery promising prop erties and which will b e used in the following to derivean ecient

metho d for ICA

The classical metho d of approximating negentropyisusing higherorder moments for example as follows

  

E y kurt y g J y f

The random variable y is assumed to be of zero mean and unit variance However the validity of such

approximations maybe rather limited In particular these approximations suer from the nonrobustness

encountered with kurtosis

To avoid the problems encountered with the preceding approximations of negentropy new approxi

mations were develop ed in These approximation were based on the maximumentropyprinciple In

general weobtain the following approximation

p

X



k E G J y y E G f g f g

i i i

i

where k are some positive constants and is aGaussian variable of zero mean and unit variance ie

i

standardized The variable y is assumed to b e of zero mean and unit variance and the functions G are

i

some nonquadratic functions Note that even in cases where this approximation is not very accurate

can b e used to construct a measure of nongaussianity that is consistentinthe sense that it is always

nonnegative and equal to zero if y has a Gaussian distribution

In the case where weuse only one nonquadratic function G the approximation b ecomes



J y E G y E G f g f g

for practically any nonquadratic function G This is clearly a generalization of the momentbased ap



proximation in if y is symmetric Indeed taking G y y one then obtains exactly ie a

kurtosisbased approximation

But the point here is that by cho osing G wisely one obtains approximations of negentropythat are

muchbetter than the one given by In particular cho osing G that do es not growtoo fast one obtains

more robust estimators The following choices of G have proved very useful



G u a u G u u log cosh exp



a

where a is some suitable constant

Thus we obtain approximations of negentropy that giveavery go o d compromise b etween the prop erties

of the two classical nongaussianity measures given by kurtosis and negentropy They are conceptually

simple fast to compute yet have app ealing statistical prop erties esp ecially robustness Therefore we shall

use these contrast functions in our ICA metho ds Since kurtosis can b e expressed in this same framework

it can still be used byour ICA metho ds Apractical algorithm based on these contrast function will b e

presented in Section

Minimization of Mutual Information

Another approachfor ICA estimation inspired by information theoryis minimization of mutual informa

tion We will explain this approach here and show that it leads to the same principle of nding most

nongaussian directions as was describ ed ab ove In particular this approachgives a rigorous justication

for the heuristic principles used ab ove

Mutual Information

Using the concept of dierential entropywe dene the mutual information I between m scalar random

variables y i m as follows

i

m

X

H H y I y y y y

i  m

i

Mutual information is a natural measure of the dep endence between random variables In fact it is

equivalenttothe wellknown KullbackLeibler divergence b etween the jointdensity f and the pro duct y

of its marginal densities a very natural measure for indep endence It is always nonnegative and zero if

and only if the variables are statistically indep endent Thus mutual information takes into account the

whole dep endence structure of the variables and not only the covariance likePCA and related metho ds

Mutual information can be interpreted by using the interpretation of entropy as co de length The

terms H y givethe lengths ofcodesforthe y when these are co ded separately and H gives the co de y

i i

length when is co ded as a random vector ie all the comp onents are co ded in the same co de Mutual y

information thus shows what co de length reduction is obtained bycoding the whole vector instead of the

separate comp onents In general b etter co des can b e obtained bycoding the whole vector However if the

y are indep endent they givenoinformation on eachother and one could just as well co de the variables

i

separately without increasing co de length

An imp ortantpropertyofmutual information isthat wehave for an invertible linear transfor

mation y Wx

X

I y y y H H y x log j det W j

 n i

i

Now let us consider what happ ens if weconstrain the y to b e and of unit variance This means uncorrelated

i

T T T T T T T

E E whichimplies E E fyy g W fxx gW I det I det W fxx gW det W det fxx gdet W

and this implies that must b e constant Moreover for y of unit variance entropyand negentropy det W

i

dier only byaconstant and the sign Thus weobtain

X

J y I y y y C

i  n

i

where C is a constant that do es not dep end on This shows the fundamental relation b etween negentropy W

and mutual information

Dening ICA byMutual Information

Since mutual information is the natural informationtheoretic measure of the indep endence of random vari

ables wecould use it as the criterion for nding the ICA transform In this approachthat is an alternative

to the mo del estimation approach wedenetheICAofarandom vector as an invertible transformation x

as in where the matrix is determined so that the mutual information of the transformed comp onents W

s is minimized

i

It is now obvious from that nding an invertible transformation that minimizes the mutual W

information is roughly equivalentto More precisely nding directions in which the negentropy is maximized

it is roughly equivalenttonding D subspaces such that the pro jections in those subspaces havemaximum

negentropy Rigorouslyspeaking shows that ICA estimation byminimization of mutual information is

equivalenttomaximizing the sum of nongaussianities of the estimates when the estimates areconstrainedto

The constraintofuncorrelatedness is in fact not necessary but simplies the computations beuncorrelated

considerablyasone can then use the simpler form in instead of the more complicated form in

Thus wesee that the formulation of ICA as minimization of mutual information gives another rigorous

justication of our more heuristically intro duced idea of nding maximally nongaussian directions

Maximum Likeliho o d Estimation

The likeliho o d

Avery p opular approachfor estimating the ICA mo del is maximum likeliho o d estimation whichisclosely

connected to the infomax principle Here we discuss this approach and showthatitisessentially equivalent

to minimization of mutual information

It is p ossible to formulate directly the likeliho o d in the noisefree ICA mo del whichwas done in

T

and then estimate the mo del byamaximum likeliho o d metho d Denoting by the matrix W w w

n

theloglikeliho o d takes the form A

n T

X X

T

t T f L w x log j det W j log

i

i

t

i

where the f are the density functions of the s here assumed to be known and the t t T x

i i

are the realizations of The term in the likeliho o d comes from the classic rule for linearly x log j det W j

transforming random variables and their densities In general for anyrandom vector with density x

p and for anymatrix thedensityof is given by p W y Wx Wxj det W j

x x

The Infomax Principle

Another related contrast function was derived from a neural network viewp ointin This was based on

maximizing the output entropyor information ow of a neural network with nonlinear outputs Assume

T

where the g are some that is the input to the neural network whose outputs are of the form g x w x

i i

i

nonlinear scalar functions and the are the weightvectors of the neurons One then wants to maximize w

i

the entropyoftheoutputs

T T

L H g g w x w x

 n

n

If the g are well chosen this framework also enables the estimation of the ICA mo del Indeed several

i

authors eg proved the surprising result that the principle of network entropy maximization

or infomax is equivalent to maximum likeliho o d estimation This equivalence requires that the non

linearities g used in the neural network are chosen as the cumulative distribution functions corresp onding

i

to the densities f ie g f

i i

i

Connection to mutual information

To see the connection between likelihood and mutual information consider the exp ectation of the log

likeliho o d

n

X

T

E f E L flog w xg logj det W j f g

i

i

T

i

Figure An illustration of pro jection pursuit and the interestingness of nongaussian pro jections The

data in this gure is clearly divided into twoclusters However the principal comp onent ie the direction

of maximum variance would be vertical providing no separation b etween the clusters In contrast the

strongly nongaussian pro jection pursuit direction is horizontal providing optimal separation of the clusters

T

Actually if the f were equal to the actual distributions of the rst term would be equal to w x

i

i

P

T

Thus the likeliho o d would b e equal up to an additive constant to the negativeofmutual H x w

i

i

information as given in Eq

Actually in practice the connection is even stronger This is because in practice we dont know the

distributions of the indep endent comp onents A reasonable approachwould b e to estimate the densityof

T

as part of the ML estimation metho d and use this as an approximation of the densityof s In this w x

i

i

case likeliho o d and mutual information are for all practical purp oses equivalent

Nevertheless there is a small dierence that may be very imp ortant in practice The problem with

maximum likeliho o d estimation is that the densities f must be estimated correctly They need not be

i

estimated with anygreat precision in fact it is enough to estimate whether they are sub or sup ergaussian

In manycases infact wehaveenough prior knowledge on the indep endentcomponents and

we dont need to estimate their nature from the data In any case if the information on the nature of

the indep endent comp onents is not correct ML estimation will give completely wrong results Some care

must be taken with ML estimation therefore In contrast using reasonable measures of nongaussianity

this problem do es not usually arise

ICA and Pro jection Pursuit

It is interesting to note howour approachto ICA makes explicit the connection b etween ICA and pro jection

pursuit Pro jection pursuit is a technique develop ed in statistics for nding interesting

pro jections of multidimensional data Suchpro jections can then b e used for optimal visualization of the

data and for suchpurposesasdensityestimation and regression In basic D pro jection pursuit we try

to nd directions such that the pro jections of the data in those directions haveinteresting distributions ie

displaysomestructure It has b een argued byHuber and by Jones and Sibson that the Gaussian

distribution is the least interesting one and that the most interesting directions are those that show the

least Gaussian distribution This is exactly what wedotoestimate the ICA mo del

The usefulness of nding such pro jections can b e seen in Fig where the pro jection on the pro jection

pursuit direction whichishorizontal clearly shows the clustered structure of the data The pro jection on

the rst principal comp onentvertical on the other hand fails to show this structure

Thus in the general formulation ICA can b e considered a variantof pro jection pursuit All the non

gaussianitymeasures and the corresp onding ICA algorithms presented here could also b e called pro jection

pursuit indices and algorithms In particular the pro jection pursuit allows us to tackle the situation

where there are less indep endentcomp onents s than original variables x is Assuming that those dimen

i i

sions of the space that are not spanned by the indep endent comp onents are lled by gaussian noise we

see that computing the nongaussian pro jection pursuit directions we eectively estimate the indep endent

comp onents When all the nongaussian directions havebeen found all the indep endentcomponents have

been estimated Sucha pro cedure can b e interpreted as a hybrid of pro jection pursuit and ICA

However it should b e noted that in the formulation of pro jection pursuit no data mo del or assumption

ab out indep endent comp onents is made If the ICA mo del holds optimizing the ICA nongaussianity

measures pro duce indep endentcomp onents if the mo del do es not hold then what weget are the pro jection

pursuit directions

Prepro cessing for ICA

In the preceding section wediscussed the statistical principles underlying ICA metho ds Practical algo

rithms based on these principles will b e discussed in the next section However b efore applying an ICA

algorithm on the data it is usually very useful to do some prepro cessing In this section wediscuss some

prepro cessing techniques that makethe problem of ICA estimation simpler and b etter conditioned

Centering

The most basic and necessary prepro cessing is to center ie subtract its mean vector E so x m fxg

as to make a zeromean variable This implies that is zeromean as well as can be seen by taking x s

exp ectations on b oth sides of Eq

This prepro cessing is made solely to simplify the ICA algorithms It do es not mean that the mean

could not b e estimated After estimating the mixing matrix with centered data we can complete the A

estimation byadding the mean vector of backtothe centered estimates of The mean vector of is s s s

where is the mean that was subtracted in the prepro cessing given by m m A

Whitening

Another useful prepro cessing strategy in ICA is to rst whiten the observed variables This means that

b efore the application of the ICA algorithm and after centering we transform the observed vector x

so that we obtain a new vector which is white ie its comp onents are uncorrelated and their linearly x

variances equal unity In other words the covariance matrix of equals the identitymatrix x

T

E fxx g I

The whitening transformation is always p ossible One p opular metho d for whitening is to use the eigen

T T

where is the orthogonal matrix value decomp osition EVD of the covariance matrix E E fxx g EDE

T

of eigenvectors of E and is the diagonal matrix of its eigenvalues diag d d Note that fxx g D D

n

T

E can b e estimated in a standard way from the available sample T Whitening can now fxx g x x

be done by

 T

x ED E x

 

 

d diag d where the matrix is computed byasimple comp onentwise op eration as D D

n

T

It is easy to check that now E fxx g I

Whitening transforms the mixing matrix into a new one Wehave from and A

 T

E As As x ED

is orthogonal This can b e seen The utilityofwhitening resides in the fact that the new mixing matrix A

from

T T T T

E E fxx g A fss gA A A I

Figure The joint distribution of the whitened mixtures

Here we see that whitening reduces the number of parameters to be estimated Instead of having to



estimate the n parameters that are the elements of the original matrix we only need to estimate A

An orthogonal matrix contains n n degrees of freedom the new orthogonal mixing matrix A

For example in twodimensions an orthogonal transformation is determined byasingle angle parameter

In larger dimensions an orthogonal matrix contains only ab out half of the number of parameters of an

arbitrary matrix Thus one can say that whitening solves half of the problem of ICA Because whitening is

avery simple and standard pro cedure muchsimpler than anyICA algorithms it is a go o d idea to reduce

the complexityofthe problem this way

It mayalso b e quite useful to reduce the dimension of the data at the same time as wedothe whitening

T

Then welookatthe eigenvalues d of E and discard those that are to o small as is often done in the fxx g

j

statistical technique of principal comp onentanalysis This has often the eect of reducing noise Moreover

dimension reduction prevents overlearning whichcan sometimes b e observed in ICA

Agraphical illustration of the eect of whitening can b e seen in Figure in which the data in Figure

has b een whitened The square dening the distribution is now clearly a rotated version of the original

square in Figure All that is left is the estimation of a single angle that gives the rotation

In the rest of this tutorial weassume that the data has b een prepro cessed by centering and whitening

For simplicityofnotation we denote the prepro cessed data just by and the transformed mixing matrix x

by omitting the tildes A

Further prepro cessing

The success of ICA for a given data set maydep ende crucially on p erforming some applicationdep endent

prepro cessing steps For example if the data consists of timesignals some bandpass ltering maybevery

useful Note that if we lter linearly the observed signals x t to obtain new signals say x t the ICA

i

i

mo del still holds for t with the same mixing matrix x

i

This can b e seen as follows Denote by X the matrix that contains the observations x xT  as its columns

and similarly for S Then the ICA mo del can b e expressed as

X  AS

Now time ltering of X corresp onds to multiplying X from the right byamatrix let us call it M This gives

X  XM  ASM  AS

whichshows that the ICA mo del remains still valid

The FastICA Algorithm

In the preceding sections weintro duced dierent measures of nongaussianity ie ob jectivefunctions for

ICA estimation In practice one also needs an algorithm for maximizing the contrast function for example

the one in In this section weintro duce a very ecient metho d of maximization suited for this task

It is here assumed that the data is prepro cessed by centering and whitening as discussed in the preceding

section

FastICA for one unit

To begin with we shall showthe oneunit version of FastICA By a unit we refer to a computational

unit eventually an articial neuron having a weight vector that the neuron is able to up date by a w

learning rule The FastICA learning rule nds adirection ie a unit vector such that the pro jection w

T

maximizes nongaussianity Nongaussianity is here measured by the approximation of negentropy w x

T T

J given in Recall that the variance of must here b e constrained to unity for whitened w x w x

data this is equivalenttoconstraining the norm of w to b e unity

The FastICA is based on a xedp ointiteration scheme for nding a maximum of the nongaussianityof

T

asmeasured in see It can b e also derived as an approximative Newton iteration w x

Denote by g the derivativeof the nonquadratic function G used in for example the derivatives of the

functions in are

u a u g tanh



u u u g exp



The basic form of the FastICA algorithm is some suitable constant often taken as a where a

is as follows

Cho ose an initial eg random weightvector w

T  T

Let E g E g w xgw w fx w xg f

 

Let kw k w w

If not converged go backto

Note that convergence means that the old and new values of p oint in the same direction ie their w

dotpro duct is almost equal to It is not necessary that the vector converges to a single p oint since w

and dene the same direction This is again b ecause the indep endentcomp onents can b e dened only w

up to a multiplicativesign Note also that it is here assumed that the data is prewhitened

The derivation of FastICA is as follows First note that the maxima of the approximation of the negentropyof

T T

w x are obtained at certain optima of E fGw xg According to the KuhnTucker conditions the optima of

T T  

E fGw xg under the constraint E fw x g  kw k  are obtained at p oints where

T

E fxg w xg w 

Let us try to solve this equation byNewtons metho d Denoting the function on the lefthand side of by F we

obtain its Jacobian matrix JF w  as

T  T

JF w  E fxx g w xg I

Tosimplify the inversion of this matrix we decide to approximate the rst term in Since the data is sphered a

T  T T  T  T

reasonable approximation seems to b e E fxx g w xgE fxx gE fg w xg  E fg w xgI Thus the Jacobian

matrix b ecomes diagonal and can easily b e inverted Thus we obtain the following approximativeNewton iteration

 T  T

w  w E fxg w xg w E fg w xg 

 T

This algorithm can b e further simplied bymultiplying b oth sides of by E fg w xg This gives after

algebraic simplication the FastICA iteration

In practice the exp ectations in FastICA must b e replaced bytheir estimates The natural estimates

are of course the corresp onding sample means Ideally all the data available should b e used but this is

often not a go o d idea b ecause the computations maybecome to o demanding Then the averages can b e

estimated using a smaller sample whose size mayhavea considerable eect on the accuracy of the nal

estimates The sample points should be chosen separately at every iteration If the convergence is not

satisfactory one maythen increase the sample size

FastICA for several units

The oneunit algorithm of the preceding subsection estimates just one of the indep endentcomp onents or

one pro jection pursuit direction To estimate several indep endent comp onents weneed to run the oneunit

FastICA algorithm using several units eg neurons with weightvectors w w

n

To prevent dierent vectors from converging to the same maxima we must the outputs decorrelate

T T

w x w x after every iteration Wepresenthere three metho ds for achieving this

n

Asimplewayofachieving decorrelation is a deation scheme based on a GramSchmidtlikedecorrela

tion This means that we estimate the indep endentcomp onents one by one When wehaveestimated p

indep endent comp onents or p vectors werunthe oneunit xedp ointalgorithm for and w w w

p p

T

after every iteration step subtract from the pro jections j p of the previously w w w w

p j j

p

estimated p vectors and then renormalize w

p

P

p

T

Let w w w w w

j j p p

p

j 

q

T

Let w w w w

p p p

p

In certain applications however it maybe desired to use a symmetric decorrelation in whichnovectors

are privileged over others This can b e accomplished eg by the classical metho d involving matrix

square ro ots

 T

Let W W WW

T T 

of the vectors and the inverse square ro ot is obtained where is the matrix WW w W w

n

T T T   T

as from the eigenvalue decomp osition of Asimpler alternative WW WW FF F F

is the following iterativealgorithm

p

T

Let W W k kWW

Rep eat until convergence



T

W WW W Let W

 

The norm in step can b e almost any ordinary matrix norm eg the norm or the largest absolute row

or column sum but not the Frob enius norm

FastICA and maximum likeliho o d

Finally we giveaversion of FastICA that shows explicitly the connection to the wellknown infomax or

maximum likeliho o d algorithm intro duced in If we express FastICA using the intermediate

formula in and write it in matrix form see for details weseethat FastICA takes the following

form

 T

diag E g W W f y y gW

i

where E y g y and diag y E g The matrix needs to b e orthogo y Wx f g f g W

i i i i i

nalized after every step In this matrix version it is natural to orthogonalize symmetrically W

The ab oveversion of FastICA could b e compared with the sto chastic gradient metho d for maximizing

likeliho o d

 T

g W W I y y W

where is the learning rate not necessarily constant in time Comparing and we see that

FastICA can b e considered as a xedp ointalgorithm for maximum likeliho o d estimation of the ICA data

mo del For details see In FastICA convergence sp eed is optimized by the choice of the matrices

Another advantage of FastICA is that it can estimate both sub and sup ergaussian and diag

i

indep endentcomponents whichisincontrast to ordinary ML algorithms which only work for a given class

of distributions see Sec

Prop erties of the FastICA Algorithm

The FastICA algorithm and the underlying contrast functions haveanumberofdesirable prop erties when

compared with existing metho ds for ICA

The convergence is cubic or at least quadratic under the assumption of the ICA data mo del for a

pro of see This is in contrast to ordinary ICA algorithms based on sto chastic gradient descent

metho ds where the convergence is only linear This means a very fast convergence as has been

conrmed bysimulations and exp eriments on real data see

Contrary to gradientbased algorithms there are no step size parameters to cho ose This means that

the algorithm is easy to use

The algorithm nds directly indep endentcomponents of practically anynonGaussian distribution

using anynonlinearity g This is in contrast to manyalgorithms where some estimate of the proba

bilitydistribution function has to b e rst available and the nonlinearitymust b e chosen accordingly

The p erformance of the metho d can b e optimized bycho osing a suitable nonlinearity g In particular

one can obtain algorithms that are robust andor of minimum variance In fact the twononlinearities

in havesome optimal prop erties for details see

The indep endent comp onents can be estimated one by one which is roughly equivalent to doing

pro jection pursuit This es useful in exploratory data analysis and decreases the computational load

of the metho d in cases where only some of the indep endent comp onents need to b e estimated

The FastICA has most of the advantages of neural algorithms It is parallel distributed computa

tionally simple and requires little memory space Sto chastic gradientmetho ds seem to b e preferable

only if fast adaptivityinachanging environmentisrequired

TM

A Matlab implementation of the FastICA algorithm is available on the World Wide Web free of

charge

Applications of ICA

In this section wereview some applications of ICA The most classical application of ICA the co cktailparty

problem was already explained in the op ening section of this pap er

Separation of Artifacts in MEG Data

Magneto encephalographyMEG is a noninvasive technique by which the activityorthe cortical neurons

can b e measured with very go o d temp oral resolution and mo derate spatial resolution When using a MEG

record as a researchor clinical to ol the investigator may face a problem of extracting the essential features

of the neuromagnetic signals in the presence of artifacts The amplitude of the disturbances maybehigher

than that of the brain signals and the artifacts may resemble pathological signals in shap e

In the authors intro duced a new metho d to separate brain activityfrom artifacts using ICA The

approach is based on the assumption that the brain activity and the artifacts eg eye movements or

blinks or sensor malfunctions are anatomically and physiologically separate pro cesses and this separation

is reected in the statistical indep endence b etween the magnetic signals generated bythose pro cesses The

approach follows the earlier exp eriments with EEG signals rep orted in A related approachis that of

The MEG signals were recorded in a magnetically shielded ro om with a channel wholescalp

Neuromag neuromagnetometer This device collects data at lo cations over the scalp using or

thogonal doublelo op pickup coils that couple strongly to a lo cal source just underneath The test p erson

was asked to blink and makehorizontal saccades in order to pro duce typical o cular eye artifacts More

over to pro duce myographic muscle artifacts the sub ject was asked to bite his teeth for as long as

seconds Yet another artifact was created by placing a digital watchone meter away from the helmet into

the shielded ro om

t from the frontal temp oral and Figure presents a subset of sp ontaneous MEG signals x

i

o ccipital areas The gure also shows the p ositions of the corresp onding sensors on the helmet Due

to the dimension of the data magnetic signals were recorded it is impractical to plot all the MEG

t i Also twoelectroo culogram channels and the electro cardiogram are presented signals x

i

but they were not used in computing the ICA

t of the signals at a The signal vector in the ICA mo del consists nowofthe amplitudes x x

i

certain time p oint so the dimensionalityis n In the theoretical mo del is regarded as a random x

vector and the measurements t givea set of realizations of as time pro ceeds Note that in the basic x x

ICA mo del that weare using the temp oral correlations in the signals are not utilized at all

The t vectors were whitened using PCA and the dimensionality was decreased at the same time x

Then using the FastICA algorithm a subset of the rows of the separating matrix of eq were W

has b ecome available an ICA signal s computed Once a vector t can be computed from s t w

i i i

T

t with t nowdenoting the whitened and lower dimensional signal vector w x x

i

Figure shows sections of indep endent comp onents ICs s t i found from the recorded

i

data together with the corresp onding eld patterns The rst twoICs are clearly due to the musclular

activity originated from the biting Their separation into two comp onents seems to corresp ond on the

basis of the eld patterns to two dierentsets of muscles that were activated during the pro cess IC and

IC are showing the horizontal eyemovements and the eyeblinks resp ectively IC represents the cardiac

artifact that is very clearly extracted

Tond the remaining artifacts the data were highpass ltered with cuto frequency at Hz Next

the indep endentcomponent IC was found It shows clearly the artifact originated at the digital watch

lo cated to the right side of the magnetometer The last indep endent comp onent IC is related to a sensor

presenting higher RMS ro ot mean squared noise than the others

The results of Fig clearly show that using the ICA technique and the FastICA algorithm it is

possible to isolate b oth eyemovementandeyeblinking artifacts as well as cardiac myographic and other

artifacts from MEG signals The FastICA algorithm is an esp ecially suitable to ol b ecause artifact removal

is an interactivetechnique and the investigator mayfreely cho ose howmanyof the ICs he or she wants

In addition to reducing artifacts ICA can b e used to decomp ose evoked elds whichenables direct

access to the underlying brain functioning which is likely to be of great signicance in neuroscientic

research MEG 1000 fT/cm

3 2 EOG 500 µV 4 5 566 1 ECG 500 µV

saccades blinking biting MEG 1 1 2

2 3 3 4 4

5

5

6 6

VEOG HEOG ECG

10 s

Figure From Samples of MEG signals showing artifacts produced by blinking saccades biting

and cardiac cycle For each of the positions shown the two orthogonal directions of the sensors are

plotted

Finding Hidden Factors in Financial Data

It is a tempting alternativeto try ICA on nancial data There are many situations in that application

domain in whichparallel time series are available suchascurrency exchange rates or daily returns of sto cks

that mayhavesome common underlying factors ICA mightreveal some driving mechanisms that otherwise

remain hidden In a recentstudyof astockportfolio it was found that ICA is a complementary to ol

to PCA allowing the underlying structure of the data to b e more readily observed

In we applied ICA on a dierent problem the cashowof several stores b elonging to the same

retail chain trying to nd the fundamental factors common to all stores that aect the cashow data

Thus the cashow eect of the factors sp ecic to any particular store ie the eect of the actions taken

at the individual stores and in its lo cal environment could b e analyzed

The assumption of having some underlying indep endent comp onents in this sp ecic application may

not be unrealistic For example factors like seasonal variations due to holidays and annual variations

and factors having a sudden eect on the purchasing p ower of the customers likeprize changes of various

commo dities can b e exp ected to haveaneect on all the retail stores and such factors can b e assumed to

be roughly indep endentofeach other Yet dep ending on the p olicy and skills of the individual manager

likeeg advertising eorts the eect of the factors on the cash owof sp ecic retail outlets are slightly

dierent By ICA it is p ossible to isolate both the underlying factors and the eect weights thus also

making it p ossible to group the stores on the basis of their managerial p olicies using only the cash ow

time series data

The data consisted of the weekly cash owinstores that b elong to the same retail chain the cash

owmeasurements cover weeks Some examples of the original data x t are shown in Fig

i

The prewhitening was p erformed so that the original signal vectors were pro jected to the subspace

spanned by their rst veprincipal comp onents and the variances were normalized to Thus the dimension

t i were of the signal space was decreased from to Using the FastICA algorithm four ICs s

i

estimated As depicted in Fig the FastICA algorithm has found several clearly dierent fundamental

factors hidden in the original data

The factors haveclearly dierent interpretations The upmost twofactors follow the sudden changes

that are caused byholidays etc the most prominent example is the Christmas time The factor on the

bottom row on the other hand reects the slower seasonal variation with the eect of the summer holidays

clearly visible The factor on the third rowcould representa still slower variation something resembling

a trend The last factor on the fourth row is dierent from the others it might be that this factor

follows mostly the relative comp etitive position of the retail chain with resp ect to its comp etitors but

other interpretations are also p ossible

More details on the exp eriments and their interpretation can b e found in

Reducing Noise in Natural Images

The third example deals with nding ICA lters for natural images and based on the ICA decomp osition

removing noise from images corrupted with additive Gaussian noise

Aset of digitized natural images were used Denote the vector of pixel gray levels in an image window

by Note that contrary to the other two applications in the previous sections we are not this time x

considering multivalued time series or images changing with time instead the elements of are indexed by x

the lo cation in the image windoworpatch The sample windows were taken at random lo cations The D

structure of the windows is of no signicance here rowbyrow scanning was used to turn a square image

windowinto a vector of pixel values The indep endent comp onents of such image windows are represented

in Fig Eachwindowinthis Figure corresp onds to one of the columns of the mixing matrix Thus a A

i

an observed image windowisa sup erp osition of these windows as in with indep endentcoecients

Now supp ose a noisy image mo del holds

z x n

where is uncorrelated noise with elements indexed in the image windowin the same wayas and n x z

is the measured image window corrupted with noise Let us further assume that is Gaussian and is n x

nonGaussian There are manyways to clean the noise one example is to makeatransformation to spatial

frequency space by DFT do lowpass ltering and return to the image space byIDFT This is not

very ecient however Abetter metho d is the recently intro duced Wavelet Shrinkage metho d inwhich

atransform based on wavelets is used or metho ds based on median ltering None of these metho ds

is explicitly taking advantage of the image statistics however

Wehave recently intro duced another statistically principled metho d called Sparse Co de Shrinkage

It is very closely related to indep endentcomp onentanalysis Brieyifwemodel the densityof byICA x

and assume Gaussian then the Maximum Likeliho o d ML solution for given the measurement can n x z

be develop ed in the signal mo del

The ML solution can b e simply computed alb eit approximatelyby using a decomp osition that is an

orthogonalized version of ICA The transform is given by

Wz Wx Wn s Wn

where is here an orthogonal matrix that is the best orthognal approximation of the inverse of the W

ICA mixing matrix The noise term is still Gaussian and white With a suitably chosen orthogonal Wn

transform however the densityof b ecomes highly nonGaussian eg sup erGaussian with a W Wx s

high p ositivekurtosis This dep ends of course on the original signals as we are assuming in fact that x

T

there exists a mo del for the signal suchthatthesource signals or elements of havea positive x W s s

kurtotic densityinwhichcase the ICA transform gives highly sup ergaussian comp onents This seems to

hold at least for image windows of natural scenes

It was shown in that assuming a Laplacian density for s the ML solution for s is given by a

i i

shrinkage function s g Function g has a characteristic shap e g orinvector form s Wz Wz

i i

it is zero close to the origin and then linear after a cutting value dep ending on the parameters of the

Laplacian densityand the Gaussian noise density Assuming other forms for the densities other optimal

shrinkage functions can b e derived

In the Sparse Co de Shrinkage metho d the shrinkage op eration is p erformed in the rotated space after

whichthe estimate for the signal in the original space is given byrotating back

T T

g x W s W Wz

Thus we get the Maximum Likeliho o d estimate for the image window in which muchof the noise has x

been removed

The rotation op erator is such that the sparsity of the comp onents is maximized This W s Wx

op erator can b e learned with a mo dication of the FastICA algorithm see for details

Anoise cleaning result is shown in Fig Anoiseless image and a noisy version in whichthenoise

level is of the signal level are shown The results of the Sparse Co de Shrinkage metho d and classic

wiener ltering are given indicating that Sparse Co de Shrinkage maybeapromising approach The noise

is reduced without blurring edges or other sharp features as much as in wiener ltering This is largely

due to the strongly nonlinear nature of the shrinkage op erator that is optimally adapted to the inherent

statistics of natural images

Telecommunications

Finallywe mention another emerging application area of great p otential telecommunications An example

of a realworld communications application where blind separation techniques are useful is the separation

of the users own signal from the interfering other users signals in CDMA Co deDivision Multiple Access

mobile communications This problem is semiblind in the sense that certain additional prior informa

tion is available on the CDMA data mo del But the numberofparameters to b e estimated is often so high

that suitable blind source separation techniques taking into account the available prior knowledge provide

aclear p erformance improvementover more traditional estimation techniques

Conclusion

ICA isavery generalpurp ose statistical technique in which observed random data are linearly transformed

into comp onents that are maximally indep endent from each other and simultaneously have interesting

distributions ICA can b e formulated as the estimation of a latentvariable mo del The intuitive notion of

maximum nongaussianitycan b e used to derive dierentobjective functions whose optimization enables the

estimation of the ICA mo del Alternativelyonemayuse more classical notions like maximum likeliho o d

estimation or minimization of mutual information to estimate ICA somewhat surprisinglythese approaches

are approximatively equivalent A computationally very ecient metho d p erforming the actual estimation

is given by the FastICA algorithm Applications of ICA can b e found in many dierentareassuchasaudio

pro cessing biomedical signal pro cessing image pro cessing telecommunications and econometrics

References

SI Amari A Cicho cki and HH Yang Anew learning algorithm for blind source separation In

pages MIT Press Cambridge MA Advances in Neural Information Processing Systems

A D BackandA S Weigend Arst application of indep endent comp onent analysis to extracting

structure from sto ckreturns Int J on Neural Systems

AJ Bell and TJ Sejnowski An informationmaximization approachtoblind separation and blind

deconvolution Neural Computation

JF Cardoso Infomax and maximum likeliho o d for source separation IEEE Letters on Signal

Processing

JF Cardoso and B Hvam Laheld Equivariantadaptive source separation IEEE Trans on Signal

Processing

A Cicho cki RE Bogner L Moszczynski and K Pop e Mo died HeraultJutten algorithms for blind

separation of sources Digital Signal Processing

PComon Indep endentcomp onentanalysis a new concept Signal Processing

T M Cover and J A Thomas John Wiley Sons Elements of Information Theory

N Delfosse and P Loubaton Adaptiveblind separation of indep endentsources adeation approach

Signal Processing

D L Donoho I M Johnstone G Kerkyacharian and D Picard Wavelet shrinkage asymptopia

Journal of the Royal Statistical Society ser B

The FastICA MATLAB package Available at httpwwwcishutfiprojectsicafastica

J H Friedman and J W Tukey Apro jection pursuit algorithm for exploratory data analysis IEEE

c Trans of Computers

JH Friedman Exploratory pro jection pursuit J of the American Statistical Association

X Giannakop oulos J Karhunen and E Oja Exp erimental comparison of neural ICA algorithms In

pages Skvde Sweden Proc Int Conf on Articial Neural Networks ICANN

R Gonzalez and PWintz AddisonWesley Digital Image Processing

PJ Hub er Pro jection pursuit The Annals of Statistics

A Hyvrinen Indep endent comp onent analysis in the presence of gaussian noise bymaximizing joint

likeliho o d Neurocomputing

A Hyvrinen New approximations of dierential entropy for indep endent comp onent analysis and

pro jection pursuit In volume pages Advances in Neural Information Processing Systems

MIT Press

A Hyvrinen Fast and robust xedp oint algorithms for indep endent comp onent analysis IEEE

Trans on Neural Networks

A Hyvrinen The xedp ointalgorithm and maximum likeliho o d estimation for indep endent comp o

nentanalysis Neural Processing Letters

A Hyvrinen Gaussian moments for noisy indep endent comp onentanalysis IEEE Signal Processing

Letters

A Hyvrinen Sparse co de shrinkage Denoising of nongaussian data bymaximum likeliho o d estima

tion Neural Computation

A Hyvrinen Survey on indep endent comp onentanalysis Neural Computing Surveys

A Hyvrinen and E Oja A fast xedp ointalgorithm for indep endentcomp onentanalysis Neural

Computation

A Hyvrinen and E Oja Indep endentcomp onentanalysis by general nonlinear Hebbianlike learning

rules Signal Processing

A Hyvrinen J Srel and R Vigrio Spikes and bumps Artefacts generated by indep endent

comp onent analysis with insucientsample size In Proc Int Workshop on Independent Component

pages Aussois France Analysis and Signal Separation ICA

MC Jones and R Sibson What is pro jection pursuit J of the Royal Statistical Society ser A

C Jutten and J Herault Blind separation of sources part I An adaptivealgorithm based on neu

romimetic architecture Signal Processing

J Karhunen E Oja L Wang R Vigrio and J Joutsensalo A class of neural networks for inde

pendent comp onent analysis IEEE Trans on Neural Networks

K Kiviluoto and E Oja Indep endentcomp onent analysis for parallel nancial time series In Proc

volume pages Tokyo Japan ICONIP

TW Lee M Girolami and T J Sejnowski Indep endent comp onent analysis using an extended

infomax algorithm for mixed subgaussian and sup ergaussian sources Neural Computation

D G Luenberger John Wiley Sons Optimization by Vector Space Methods

S Makeig AJ Bell TP Jung and TJ Sejnowski Indep endentcomponent analysis of electro en

cephalographic data In pages MIT Advances in Neural Information PRocessing Systems

Press

S G Mallat Atheory for multiresolution signal decomp osition The wavelet representation IEEE

Trans on PAMI

JP Nadal and N Parga Nonlinear neurons in the low noise limit a factorial co de maximizes

information transfer Network

A Pap oulis McGrawHill rd edition Probability Random Variables and Stochastic Processes

B A Pearlmutter andLCParra Maximum likeliho o d blind source separation Acontextsensitive

generalization of ica In volume pages Advances in Neural Information Processing Systems

DT Pham P Garrat and C Jutten Separation of a mixture of indep endent sources through a

maximum likeliho o d approach In pages Proc EUSIPCO

T Ristaniemi and J Joutsensalo On the p erformance of blind source separation in CDMA downlink

In pages Proc Int Workshop on Independent Component Analysis and Signal Separation ICA

Aussois France

R Vigrio Extraction of o cular artifacts from EEG using indep endentcomp onentanalysis Electroen

ceph clin Neurophysiol

R Vigrio V Jousmki M Hmlinen R Hari and E Oja Indep endentcomponentanalysis for

identication of artifacts in magneto encephalographic recordings In Advances in Neural Information

pages MIT Press Processing Systems

R Vigrio J Srel and E Oja Indep endent comp onentanalysis in wavedecomp osition of auditory

evoked elds In pages Skvde Proc Int Conf on Articial Neural Networks ICANN

Sweden IC1

IC2

IC3

IC4

IC5

IC6

IC7

IC8

IC9

10 s

Figure From Nine independent components found from the MEG data For each component the

left back and right views of the eld patterns generatedbythese components areshownful l line stands

for magnetic ux coming out from the head and dotted line the ux inwards

1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49

1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49

1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49

1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49

1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49

Figure from Five samples of the original cashow time series mean removed normalized to

unit standard deviation Horizontal axis time in weeks

10

1 0

−10 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 10

2 0

−10 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 2

3 0

−2 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 5

4 0

−5 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 5

5 0

−5

1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49 1 5 9 13 17 21 25 29 33 37 41 45 49

Figure from Four independent components or fundamental factors found from the cashow data

Figure from An experiment in denoising Upper left original image Upper right original

image corrupted with noise the noise level is Lower left the recovered image after applying sparse

code shrinkage Lower right for comparison a wiener ltered image