<<

arXiv:1406.4765v2 [math.ST] 10 Sep 2015 ttsis 01 nvriyo uk,Tru Finland Turku, Turku, e-mail: and of Mathematics University of 20014 Department , Professor, a is Oja O Hannu and Nordhausen Klaus Taskinen, Sara Miettinen, Jari Analysis Independent and Component Moments Fourth ahmtc n ttsis 01 nvriyo Turku, e-mail: of Finland University Turku, 20014 of Statistics, Department and Fellow, Research Mathematics Senior a is Nordhausen the and distributions, of types different identified 05 o.3,N.3 372–390 3, DOI: No. 30, Vol. 2015, Science Statistical iln e-mail: Jyv¨askyl¨a,Finland of Jyv¨askyl¨a University 40014 and Statistics, Mathematics of Department Fellow, Research jari.p.miettinen@jyu.fi nvriyo Jyv¨askyl¨a, 40014 of Statistics, Jyv¨askyl¨a,University and e-mail: Mathematics Finland of Department

aiMetnni ototrlResearcher, Postdoctoral a is Miettinen Jari c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint nttt fMteaia Statistics Mathematical of Institute nhssse ffeunycre,Pasn( Pearson curves, of system his In 10.1214/15-STS520 hannu.oja@utu.fi .INTRODUCTION 1. sara.l.taskinen@jyu.fi nttt fMteaia Statistics Mathematical of Institute , oesso hti otcssJD n h ymti version . phrases: symmetric and the words competitors. and Key their JADE than cases better most c perform in FastICA independent that wide in show estimates models Compa the an estimates. of algorithms, the asymptotic of estimation properties and statistical equations prob asymptotic optimization estimating discu corresponding the are the ing moments ind with fourth various starting the paper detail, on this In based known. functionals well component not statistica are the estimates but these FastICA, as and approxima such eigenmatrices), (joint moments, of JADE ization fourth identification), of blind use order the (fourth on t several based independ fi are procedures these there to mation literature, to then engineering back the is matrix In aim ponents. transformation the a combina and for linear variables, estimate are random vector independent random latent observed the of components Abstract. 05 o.3,N.3 372–390 3, No. 30, Vol. 2015, aaTsie sa Academy an is Taskinen Sara . klaus.nordhausen@utu.fi . 2015 , nidpnetcmoetaayi ti sue htthe that assumed is it analysis component independent In Klaus . Hannu . ffieeuvrac,FsIA OI JADE, FOBI, FastICA, equivariance, Affine This . 1895 , in ) 1 esrso utssas aebe rpsdand proposed robust been Recently, have measures. also scale kurtosis two of of measures ratio a as n ero ( Pearson and nesodsml sapoet hc smeasured is which property by a as simply understood ( rbbeerro ˆ of error probable esrso utss including the of kurtosis, Most of ordering. partial measures (ii) Zwet and van transformations the preserve linear under (i) which invariant functionals ( are as Oja kurtosis of and measures distributions, defined symmetrical for ings the on depending mesokurtic of value or leptokurtic, tic, of distribution the degree of for measure kurtosis standard- A of the moments. of fourth and use third the ized on based was classification i elypaens?;se o xml,Darlington example, ( for see, peakedness?”; really sis 1970 κ ,mskri)Pasnas osdrdthe considered also Pearson mesokurtic) 0, = κ hc a asdqetossc s“skurto- “Is as such questions raised has which , .VnZe ( Zwet Van ). β = κ ntecs ftenra distribution normal the of case the In . [ E E ([ 1905 ([ x x − − aldtedsrbto platykur- distribution the called ) κ ja E aiinlesti- raditional E 1964 ae,kroi a generally was kurtosis Later, . rprisof properties l ( ( ioso the of risons x es deriv- lems, ediagonal- te x )] )] omponent 2 rpsdkroi order- kurtosis proposed ) finding d ependent 4 n com- ent )] ) in of tions sdin ssed 2 dan nd FOBI or of β x a ewritten be can , κ a endas defined was = β − 3 , 1981 ) 2 MIETTINEN, TASKINEN, NORDHAUSEN AND OJA considered in the literature; see, for example, Brys, used together with the matrix in a simi- Hubert and Struyf (2006). lar way to find the transformations to independent It is well known that the of the sample components [FOBI by Cardoso (1989) and JADE depends on the population variance only, but by Cardoso and Souloumiac (1993)]. See also Oja, the variance of the sample variance depends also Sirki¨aand Eriksson (2006). on the shape of the distribution through β. The In this paper, we consider the use of univari- measure β has been used as an indicator of the bi- ate and multivariate fourth moments in indepen- modality, for example, in identifying clusters in the dent component analysis (ICA). The basic indepen- set (Pe˜na and Prieto, 2001) or as a general in- dent component (IC) model assumes that the ob- ′ dicator for non-Gaussianity, for example, in testing served components of xi = (xi1,...,xip) are linear for normality or in independent component analysis combinations of latent independent components of ′ (Hyv¨arinen, 1999). Classical tests for the normality zi = (zi1,...,zip) . Hence, the model can be written are based on the standardized third and fourth mo- as ments. See also DeCarlo (1997) for the meaning and xi = µ + Ωzi, i = 1,...,n, use of kurtosis. where the full rank p p matrix Ω is called the mix- The concept and measures of kurtosis have been × extended to the multivariate case as well. The clas- ing matrix and z1,..., zn is a random sample from a sical and kurtosis measures by Mardia distribution with independent components such that (1970), for example, combine in a natural way the E(zi)= 0 and Cov(zi)= Ip. Similarly to the model third and fourth moments of a standardized multi- of elliptically symmetric distributions, the IC model variate variable. Mardia’s measures are invariant un- is a semiparametric model, as the marginal distribu- der affine transformations, that is, the p-variate ran- tions of the components of z are left fully unspecified dom variables x and Ax+b have the same skewness except for the first two moments. For the identifia- and kurtosis values for all full-rank p p matrices bility of the parameters, one further assumes that at A and for all p-vectors b. For similar combinations× most one of the components has a normal distribu- of the standardized third and fourth moments, see tion. Notice also that Ω and z are still confounded in the sense that the order and signs of the com- also M´ori, Rohatgi and Sz´ekely (1993). Let next V1 ponents of z are not uniquely defined. The location and V2 be two p p affine equivariant scatter ma- center, the p-vector µ, is usually considered a nui- trices (functionals);× see Huber (1981) and Maronna sance parameter, since the main goal in independent (1976) for early contributions on scatter matrices. component analysis is, based on a p n data matrix Then, in the invariant coordinate selection (ICS) in × X = (x1,..., x ), to find an estimate for an unmix- Tyler et al. (2009), one finds an affine transforma- n ing matrix W such that Wx has independent com- tion matrix W such that ponents. Note that all unmixing matrices W can be ′ ′ −1 WV1W = Ip and WV2W = D, written as CΩ , where each row and each column of the p p matrix C has exactly one nonzero ele- where D is a diagonal matrix with diagonal elements ment. × in decreasing order. The transformed p variables are The population quantity to be estimated is first then presented in a new invariant coordinate sys- defined as an independent component functional tem, and the diagonal elements in D, that is, the W(F ). The estimate W(Fn), also denoted by −1 eigenvalues of V1 V2, provide measures of multi- W(X), is then obtained by applying the functional variate kurtosis. This procedure is also sometimes to the empirical distribution Fn of X = (x1,..., xn). called the generalized principal component analysis In the engineering literature, several estimation and has been used to find structures in the data. procedures based on the fourth moments, such as See Caussinus and Ruiz-Gazen (1993), Critchley, FOBI (fourth order blind identification) (Cardoso, Pires and Amado (2006), Ilmonen, Nevalainen and 1989), JADE (joint approximate diagonalization Oja (2010), Pe˜na, Prieto and Viladomat (2010), and of eigenmatrices) (Cardoso and Souloumiac, 1993), Nordhausen, Oja and Ollila (2011). For the tests for and FastICA (Hyv¨arinen, 1999), have been pro- multinormality based on these ideas, see Kankainen, posed and widely used. In these approaches the Taskinen and Oja (2007). In independent compo- marginal distributions are separated using various nent analysis, certain fourth matrices are fourth moments. On the other hand, the estimators FOURTH MOMENTS AND ICA 3 by Chen and Bickel (2006) and Samworth and Yuan √nsˆkl, √nrˆkl, and √nrˆmkl is a multivariate normal (2012) only need the existence of the first moments distribution with marginal zero . The nonzero and rely on efficient nonparametric estimates of variances and are the marginal densities. Efficient estimation methods 2 based on residual signed ranks and residual ranks Var(√nsˆkl) = 1, Var(√nrˆkl)= σk, have been developed recently by Ilmonen and Pain- Var(√nrˆmkl)= βm, daveine (2011) and Hallin and Mehta (2015). For a parametric model with a marginal Pearson system and approach, see Karvanen and Koivunen (2002). Cov(√nsˆkl, √nrˆkl)= βk, This paper describes in detail the independent component functionals based on fourth moments Cov(√nrˆkl, √nrˆlk)= βkβl, through corresponding optimization problems and and estimating equations, provides fixed-point algo- rithms and the limiting statistical properties of the Cov(√nsˆkl, √nrˆmkl) = 1, estimates, and specifies the needed assumptions. Cov(√nrˆ , √nrˆ )= β and Also, a wide comparison study of the estimates is kl mkl k carried out. As far as we know, most of the results Cov(√nrˆlk, √nrˆmkl)= βl. in the paper are new, including the asymptotical We also often refer to the following sets of p p properties of the JADE estimate. The asymptotical × properties of the FOBI estimate have been derived transformation matrices: earlier in Ilmonen, Nevalainen and Oja (2010). The 1. = diag(d1,...,dp) : d1,...,dp > 0 (heteroge- limiting variances and the limiting multinormality of neousD { rescaling), } the deflation-based version of the FastICA estimate 2. = diag(j1,...,jp) : j1,...,jp = 1 (heteroge- have been studied in Ollila (2010) and Nordhausen neousJ { sign changes), ± } et al. (2011), respectively. 3. = P : P is a permutation matrix , 4. P = {U : U is an orthogonal matrix}, 2. NOTATION AND PRELIMINARY RESULTS 5. U = {C : C = PJD, P , J , D} . C { ∈ P ∈J ∈ D} Throughout the paper, we use the following nota- Next, let ei denote a p-vector with ith element one tion. First write, for independent zik, k = 1,...,p, ij ′ and other elements zero, and define E = eiej , i, j = 2 3 E(zik) = 0, E(zik) = 1, E(zik)= γk and 1,...,p, and 4 p p E(z )= βk, ik ij ij ′ Jp,p = E E = vec(Ip) vec(Ip) , and ⊗ i=1 j=1 3 2 X X κk = βk 3, πk = sign(κk) and Var(zik)= σk. p p − ij ji As seen later, the limiting distributions of the un- Kp,p = E E , =1 =1 ⊗ mixing matrix estimates based on fourth moments Xi Xj depend on the joint limiting distribution of p p ii jj n Ip,p = E E = Ip2 and −1/2 =1 =1 ⊗ √nsˆkl = n zikzil, Xi Xj =1 p Xi (1) D = Eii Eii. n p,p ⊗ √nrˆ = n−1/2 (z3 γ )z i=1 kl ik − k il X i=1 X Then, for any p p matrix A, Jp,p vec(A) = tr(A) × ′ · and vec(Ip), Kp,p vec(A) = vec(A ), and Dp,p vec(A)= n vec(diag(A)). The matrix Kp,p is sometimes called a −1/2 2 √nrˆmkl = n zimzikzil, commutation matrix. For a symmetric nonnegative 1 2 i=1 S S− / X definite matrix , the matrix is taken to be for distinct k,l,m = 1,...,p. If the eighth moments symmetric and nonnegative definite and to satisfy −1/2 −1/2 of zi exist, then the joint limiting distribution of S SS = Ip. 4 MIETTINEN, TASKINEN, NORDHAUSEN AND OJA

3. INDEPENDENT COMPONENT MODEL 3.2 Independent Component (IC) Functionals AND FUNCTIONALS Write next X = (x1,..., xn) for a random sam- 3.1 Independent Component (IC) Model ple from the IC model (2) with the cumulative dis- tribution function (c.d.f.) Fx. As mentioned in the Throughout the paper, our p-variate observations Introduction, the aim of independent component x1,..., xn follow the independent component (IC) analysis (ICA) is to find an estimate of some un- model mixing matrix W such that Wxi has independent components. It is easy to see that all unmixing ma- (2) xi = µ + Ωzi, i = 1,...,n, trices can be written as W = CΩ−1 for some C . where µ is a mean vector, Ω is a full-rank p p mix- The population quantity, which we wish to estimate,∈C × ing matrix, and z1,..., zn are independent and iden- is defined as the value of an independent component tically distributed random vectors from a p-variate functional W(F ) at the distribution of Fx. distribution such that: Definition 1. The p p matrix-valued func- × Assumption 1. The components zi1,...,zip of tional W(F ) is said to be an independent compo- zi are independent. nent (IC) functional if (i) W(Fx)x has independent components in the IC model (2) and (ii) W(Fx) is Assumption 2. Second moments exist, E(zi)= ′ affine equivariant in the sense that 0 and E(zizi)= Ip. (W(FAx+b)Ax)1,..., (W(FAx+b)Ax)p Assumption 3. At most one of the components { } z 1,...,z of z has a normal distribution. = (W(Fx)x)1,..., (W(Fx)x) i ip i {± ± p} for all nonsingular p p matrices A and for all p- If the model is defined using Assumption 1 only, × then the mixing matrix Ω is not well-defined and can vectors b. at best be identified only up to the order, the signs, Notice that in the independent component model, and heterogenous multiplications of its columns. As- W(Fx)x does not depend on the specific choices sumption 2 states that the second moments exist, of z and Ω, up to the signs and the order of the ′ and E(zi)= 0 and E(zizi)= Ip serve as identifica- components. Notice also that, in the condition (ii), tion constraints for µ and the scales of the columns any c.d.f. F is allowed to be used as an argu- of Ω. Assumption 3 is needed, as, for example, if ment of W(F ). The corresponding sample version z N2(0, I2), then also Uz N2(0, I2) for all or- W(Fn) is then obtained when the IC functional is ∼ ∼ thogonal U and the independent components are applied to the empirical distribution function Fn of not well-defined. Still, after these three assumptions, X = (x1,..., xn). We also sometimes write W(X) the order and signs of the columns of Ω remain for the sample version. Naturally, the estimator is unidentified, but one can identify the set of the stan- then also affine equivariant in the sense that, for all dardized independent components z 1,..., z , nonsingular p p matrices A and for all p-vectors {± i ± ip} ×′ which is naturally sufficient for practical data anal- b, W(AX + b1n)AX = PJW(X)X for some J and P . ∈J ysis. ∈ P One of the key results in independent component Remark 1. As mentioned before, if W is an un- analysis is the following. mixing matrix, then so is CW for all C , and we CW∈C C Theorem 1. Let x = µ + Ωz be an observation then have a whole set of matrices : equivalent to W. To find a unique representative{ ∈ C} in from an IC model with mean vector µ and covari- 1 2 the class, it is often required that Cov(CWx)= I ance matrix Σ = ΩΩ′, and write x = Σ− / (x µ) p st but still the order and signs of the rows remain for the standardized . Then −z = ′ unidentified. Of course, the assumption on the exis- Ux for some orthogonal matrix U = (u1,..., u ) . st p tence of second moments may sometimes be thought The result says that, starting with standardized to be too restrictive. For alternative ways to iden- observations xst, one only has to search for an un- tify the unmixing matrix, see then Chen and Bickel known U such that Uxst has independent com- (2006), Ilmonen and Paindaveine (2011), and Hallin ponents. Thus,∈U after estimating Σ, the estimation and Mehta (2015), for example. For a general discus- problem can be reduced to the estimation problem sion on this identification problem, see also Eriksson of an orthogonal matrix U only. and Koivunen (2004). FOURTH MOMENTS AND ICA 5

4. UNIVARIATE KURTOSIS AND and we obtain, in the general N(µ,σ2) case, that INDEPENDENT COMPONENT ANALYSIS γˆ γˆ √n = √n 4.1 Classical Measures of Univariate Skewness κˆ βˆ 3 and Kurtosis    −  0 6 0 N2 , . Let first x be a univariate random variable with →d 0 0 24 mean value µ and variance σ2. The standardized     variable is then z = (x µ)/σ, and classical skewness Consider next p-variate observations coming from and kurtosis measures− are the standardized third an IC model. The important role of the fourth mo- and fourth moments, γ = E(z3) and β = E(z4). For ments is stated in the following: symmetrical distributions, γ = 0, and for the normal Theorem 2. Let the components of z = (z1,..., distribution, κ = β 3 = 0. For a random sample z )′ be independent and standardized so that E(z)= − p x1,...,xn from a univariate distribution, write 0 and Cov(z)= Ip, and assume that at most one 4 j of the kurtosis values κ = E(z ) 3, i = 1,...,p, is µj = E((xi µ) ) and i i − zero. Then the following inequalities− hold true: n −1 j mj = n (xi x¯) , j = 2, 3, 4. (i) =1 − 4 Xi E((u′z) ) 3 | − | Then the limiting distribution of √n(m2 µ2,m3 4 4 ′ − − µ3,m4 µ4) is a 3-variate normal distribution with max E(z1 ) 3 ,..., E(zp) 3 − ≤ {| − | | − |} mean vector zero and with the for all u such that u′u = 1. The equality holds only 4 4 (i, j) element if u = ei for i such that E(zi ) 3 = max E(z1 ) 3 ,..., E(z4) 3 , and | − | {| − µi+j+2 µi+1µj+1 (i + 1)µiµj+2 p − − | (ii) | − |} (j + 1)µi+2µj + (i + 1)(j + 1)µiµjµ2, 4 4 − E[(u′ z) ] 3 + + E[(u′ z) ] 3 i, j = 1, 2, 3. See Theorem 2.2.3.B in Serfling (1980). | 1 − | · · · | p − | 4 4 Then in the symmetric case with µ2 = 1, for exam- E[z1] 3 + + E[z ] 3 ≤ | − | · · · | p − | ple, ′ for all orthogonal matrices U = (u1,..., up) . The m2 1 equality holds only if U = JP for some J and − ∈J √n m3 P .   m4 µ4 ∈ P − For the first part of the theorem, see Lemma 2   0 in Bugrien and Kent (2005). The theorem suggests d N3 0 , natural strategies and algorithms in search for in- →  0  dependent components. It was seen in Theorem 1   that in the IC model xst = Uz with an orthogonal µ4 1 0 µ6 µ4 − − U = (u1,..., up). The first part of Theorem 2 then 0 µ6 6µ4 + 9 0 . shows how the components can be found one by one  − 2  µ6 µ4 0 µ8 µ4 just by repeatedly maximizing − − If the observations come from N(0, 1), we further ′ 4 E((ukxst) ) 3 , k = 1,...,p obtain | − | (projection pursuit approach), and the second part m2 1 0 2 0 12 − of Theorem 2 implies that the same components may √n m3 N3 0 , 0 6 0 .   →d     be found simultaneously by maximizing m4 3 0 12 0 96 ′ 4 ′ 4 − E[(u1xst) ] 3 + + E[(u xst) ] 3 . The classical skewness and kurtosis  statistics, the | − | · · · | p − | 3/2 In the engineering literature, these two approaches natural estimates of γ and β, areγ ˆ = m3/m2 and ˆ 2 are well known and important special cases of the so- β = m4/m2, and then called deflation-based FastICA and symmetric Fas- √nγˆ = √nm3 + oP (1) and tICA; see, for example, Hyv¨arinen, Karhunen and Oja (2001). The statistical properties of these two √nκˆ = √n(βˆ 3) − estimation procedures will now be considered in de- = √n(m4 3) 6√n(m2 1) + o (1) tail. − − − P 6 MIETTINEN, TASKINEN, NORDHAUSEN AND OJA

4.2 Projection Pursuit The solution for uk is then given by the p + k equa- Approach—Deflation-Based FastICA tions Assume that x is an observation from an IC k−1 −1/2 4πkT(uk) θkjuj 2θkkuk = 0 and model (2) and let again xst = Σ (x µ) be the − − =1 − standardized random variable. Theorem 2(i) then Xj ′ suggests the following projection pursuit approach ujuk = δjk, j = 1,...,k, in searching for the independent components. where πk = sign(κk). One then first finds the solu- Definition 2. The deflation-based projection tions for the Lagrange coefficients in θk, and sub- pursuit (or deflation-based FastICA) functional stituting these results into the first p equations, the −1/2 isW(Fx) = UΣ , where Σ = Cov(x) and the following result is obtained. ′ rows of an orthogonal matrix U = (u1,..., up) are −1/2 Theorem 3. Write xst = Σ (x µ) for the found one by one by maximizing − ′ 3 standardized random vector, and T(u)= E[(u xst) ′ 4 ′ · E((ukxst) ) 3 xst]. The orthogonal matrix U = (u1,..., up) solves | − | the estimating equations ′ ′ under the constraint that ukuk =1 and ujuk = 0, k−1 j = 1,...,k 1. ′ ′ − (ukT(uk))uk = Ip ujuj T(uk), It is straightforward to see that W(F ) is affine − =1 ! x Xj equivariant. In the independent component model (2), k = 1,...,p. W(Fx)x has independent components if Assump- tion 3 is replaced by the following stronger assump- The theorem suggests the following fixed-point al- tion. gorithm for the deflation-based solution. After find- ing u1,..., u 1, the following two steps are re- Assumption 4. The fourth moments of z ex- k− peated until convergence to get uk: ist, and at most one of the kurtosis values κk, k = 1,...,p, is zero. k−1 Step 1: u I u u′ T(u ), k ← p − j j k Thus, under this assumption, W(F ) is an inde- j=1 ! pendent component (IC) functional. Based on The- X Step 2: u u −1u . orem 2(i), the functional then finds the independent k ← k kk k components in such an order that The deflation-based estimate W(X) is obtained ′ 4 ′ 4 as above but by replacing the population quantities E((u1xst) ) 3 E((upxst) ) 3 . | − |≥···≥| − | by the corresponding empirical ones. Without loss of The solution order is unique if the kurtosis values generality, assume next that κ1 κp . First are distinct. note that, due to the affine equivariance| |≥···≥| of the| esti- The Lagrange multiplier technique can be used to mate, W(X)= W(Z)Ω−1. In the efficiency studies, ′ obtain the estimating equations for U = (u1,..., up) . it is therefore sufficient to consider Wˆ = W(Z) and This is done in Ollila (2010) and Nordhausen et al. the limiting distribution of √n(Wˆ Ip) for a se- (2011) and the procedure is the following. After find- − quence Wˆ converging in probability to Ip. As the ing u1,..., uk−1, the solution uk thus optimizes the empirical and population criterion functions Lagrangian function n −1 ′ 4 k Dn(u)= n (u xst,i) 3 and ′ 4 ′ − L(u , θ )= E((u x ) ) 3 θ (u u δ ), i=1 k k | k st − |− kj j k − jk X j=1 4 X D(u)= E[(u′z) ] 3 ′ | − | where θk = (θk1,...,θkk) is the vector of Lagrangian are continuous and sup ′ D (u) D(u) 0, multipliers and δ = 1 (0) as j = k (j = k) is the u u=1 n P jk one can choose a sequence of| solutions− such|→ that Kronecker delta. Write 6 uˆ1 P e1 and similarly for uˆ2,..., uˆp−1. Further, 3 → T(u)= E[(u′x ) x ]. then also Wˆ = Uˆ Sˆ−1/2 I . One can next show st st →P p FOURTH MOMENTS AND ICA 7 that the limiting distribution of √n(Wˆ Ip) is ob- Remark 3. In the engineering literature, tained if we only know the joint limiting distribution− Hyv¨arinen and Oja (1997) were the first to propose of √n(Sˆ Ip) and √n off(Rˆ ), where Sˆ = (ˆskl) is the the procedure based on the fourth moments, and − later considered an extension with a choice among sample covariance matrix, Rˆ = (ˆrkl) is given in (1), several alternative projection indices (measures of and off(Rˆ )= Rˆ diag(Rˆ ). We then have the follow- − non-Gaussianity). The approach is called deflation- ing results; see also Ollila (2010), Nordhausen et al. based or one-unit FastICA and it is perhaps the (2011). most popular approach for the ICA problem in engi- Theorem 4. Let Z = (z1,..., zn) be a random neering applications. Note that the estimating equa- sample from a distribution with finite eighth mo- tions in Theorem 3 and the resulting fixed-point ments and satisfying the Assumptions 1, 2, and 4 algorithm do not fix the order of the components (the order is fixed by the original definition) and, as with κ1 κp . Then there exists a sequence | |≥···≥| | seen in Theorem 4, the limiting distribution of the of solutions such that Wˆ I and →P p estimate depends on the order in which the com- ponents are found. Using this property, Nordhausen √nwˆkl = √nwˆlk √nsˆkl + oP (1), lk. that (i) allows one to use different projection indices for different components and (ii) optimizes the order Corollary 1. Under the assumptions of The- in which the components are extracted. orem 4, the limiting distribution of √n vec(Wˆ Ip) is a multivariate normal with zero mean vector− and 4.3 Symmetric Approach—Symmetric FastICA componentwise variances In the symmetric approach, the rows of the ma- 2 2 trix U are found simultaneously, and we have the σ (κ + 3) ASV(w ˆ )= l − l + 1, κ = 0,lk. kl κ2 k 6 maximizes k 4 4 E((u′ x ) ) 3 + + E((u′ x ) ) 3 Remark 2. Projection pursuit is used to reveal | 1 st − | · · · | p st − | ′ structures in the original data by selecting inter- under the constraint that UU = Ip. esting low-dimensional orthogonal projections of in- This optimization procedure is called symmetric terest. This is done, as above, by maximizing the FastICA in the signal processing community. The value of an objective function (projection index). functional W(Fx) is again affine equivariant. Based The term “projection pursuit” was first launched on Theorem 2(ii), in the IC model with Assump- by Friedman and Tukey (1974). Huber (1985) con- tion 4 the maximizer is unique up to the order and sidered projection indices with heuristic arguments signs of the rows of U, that is, ′ ′ that a projection is the more interesting, the less z1,...,z = u x ,..., u x . { p} {± 1 st ± p st} normal it is. All his indices were ratios of two scale As in the deflation-based case, we use the La- functionals, that is, kurtosis functionals, with the grange multiplier technique to obtain the matrix U. classical kurtosis measure as a special case. He also The Lagrangian function to be optimized is now discussed the idea of a recursive approach to find p p subspaces. Pe˜na and Prieto (2001) used the projec- ′ 4 ′ L(U, Θ)= E((ukxst) ) 3 θkk(ukuk 1) tion pursuit algorithm with the classical kurtosis in- =1| − |− =1 − Xk Xk dex for finding directions for cluster identification. p−1 p For more discussion on the projection pursuit ap- ′ θjkujuk, proach, see also Jones and Sibson (1987). − =1 = +1 Xj kXj 8 MIETTINEN, TASKINEN, NORDHAUSEN AND OJA where the symmetric matrix Θ = (θjk) contains Corollary 2. Under the assumptions of The- all p(p + 1)/2 Lagrangian multipliers. Write again orem 6, the limiting distribution of √n vec(Wˆ Ip) ′ 3 − T(u) = E((u xst) xst). Then the solution U = is a multivariate normal with zero mean vector and ′ (u1,..., up) satisfies componentwise variances

ASV(w ˆkk) = (κk + 2)/4 and 4πkT (uk) = 2θkkuk + θjkuj + θkjuj, 2 2 2 jk σk + σl κk 6(κk + κl) 18 X X ASV(w ˆkl)= − − 2 − , k = 1,...,p, ( κk + κl ) | | | | k = l. and 6 Remark 4. The symmetric FastICA approach UU′ = I . p with other choices of projection indices was pro- Solving θjk and using the fact that θjk = θkj give posed in the engineering literature by Hyv¨arinen ′ ′ πkujT(uk)= πjukT(uj), j, k = 1,...,p, and we get (1999). The computation of symmetric FastICA es- the following estimating equations. timate was done, as in our approach, by running p parallel one-unit algorithms, which were followed by Theorem 5. Let x = Σ−1/2(x µ) be the st − a matrix orthogonalization step. A generalized sym- standardized random vector from the IC model (2), metric FastICA algorithm that uses different pro- ′ 3 ′ T(u)= E((u xst) xst), T(U) = (T(u1),..., T(up)) jection indices for different components was pro- and Π = diag(π1,...,πp). The estimating equations posed by Koldovsk´y, Tichavsk´y and Oja (2006). for the symmetric solution U are The asymptotical variances of generalized symmet- UT(U)′Π = ΠT(U)U′ and UU′ = I . ric FastICA estimates were derived in Tichavsky, p Koldovsky and Oja (2006) under the assumption of For the computation of U, the above estimating symmetric independent component distributions. equations suggest a fixed-point algorithm with the updating step 5. MULTIVARIATE KURTOSIS AND 1 2 INDEPENDENT COMPONENT ANALYSIS U ΠT(T′T)− / . ← 5.1 Measures of Multivariate Skewness The symmetric version estimate W(X) is ob- and Kurtosis tained by replacing the population quantities by x their corresponding empirical ones in the estimating Let be a p-variate random variable with mean vector µ and covariance matrix Σ, and x = equations. Write again Wˆ = W(Z) and let Sˆ = (ˆs ) st kl Σ−1/2(x µ). All the standardized third and fourth and Rˆ = (ˆr ) be as in (1). Then we have the follow- 2 2 2 kl moments− can now be collected into p p and p p ing: matrices × × Theorem 6. Z z z Let = ( 1,..., n) be a random γ = E(x′ (x x′ )) and sample from a distribution of z satisfying the As- st ⊗ st st ′ ′ sumptions 1, 2, and 4 with bounded eighth mo- β = E((xstx ) (xstx )). st ⊗ st ments. Then there is a sequence of solutions such Unfortunately, these multivariate measures of skew- that Wˆ I and →P p ness and kurtosis are not invariant under affine transformations: The transformation x Ax + b √n(w ˆkk 1) − induces, for some unspecified orthogonal→ matrix U, 1 = √n(ˆs 1) + o (1) and the transformations −2 kk − P ′ ′ xst Uxst, γ Uγ(U U ) and √nwˆkl → → ⊗ ′ ′ √nrˆ π √nrˆ π (κ π + 3π 3π )√nsˆ β (U U)β(U U ). = kl k − lk l − k k k − l kl → ⊗ ⊗ κ + κ Notice next that, for any p p matrix A, | k| | l| × ′ + oP (1), k = l, G(A)= E(xstxstAxst) and 6 (3) where π = sign(κ ). ′ ′ k k B(A)= E(xstxstAxstxst) FOURTH MOMENTS AND ICA 9 provide selected p and p2 linear combinations of the Kollo (2008) and Kollo and Srivastava (2004). In third and fourth moments as vec(G(A)) = γ vec(A) Sections 5.2 and 5.3, we first use B alone and then all and vec(B(A)) = β vec(A). Further, the elements of Bij , i, j = 1,...,p, together to find solutions to the matrices independent component problem. In the signal pro- cessing literature, these approaches are called FOBI Gij = G(Eij) and Bij = B(Eij), i, j = 1,...,p, (fourth order blind identification) and JADE (joint list all possible third and fourth moments. Also, approximate diagonalization of eigenmatrices), cor- p p respondingly. ii ii G = G(Ip)= G and B = B(Ip)= B 5.2 Use of Kurtosis Matrix B—FOBI i=1 i=1 X X The independent component functional based on appear to be natural measures of multivariate skew- the covariance matrix Σ and the kurtosis matrix ness and kurtosis. In the independent component B defined in (3) is known as FOBI (fourth order model we then have the following straightforward blind identification) (Cardoso, 1989) in the engineer- result. ing literature. It is one of the earliest approaches to Theorem 7. At the distribution of z with in- the independent component problem and is defined dependent components, E(z)= 0, Cov(z)= Ip, and as follows. 4 κi = E(z ) 3, i = 1,...,p: i − Definition 4. The FOBI functional is W(Fx)= p UΣ−1/2, where Σ = Cov(x) and the rows of U are β Eii Eij I J K ′ ′ = κi( )+ p,p + p,p + p,p, the eigenvectors of B = E(xstxstxstxst). =1 ⊗ Xi p First recall that, in the independent component model, x = U′z for some orthogonal U. This im- Bij = κ (EkkEijEkk)+ Eij + Eji + tr(Eij)I , st k p plies that k=1 X ′ ′ ′ ′ ′ i, j = 1,...,p and B = E(xstxstxstxst)= U E(zz zz )U, p ′ ′ p ii where E(zz zz )= =1(κi + p + 2)E is diagonal, ii i B = (κi + p + 2)E . and therefore the rows of U are the eigenvectors =1 P Xi of B. The order of the eigenvectors is then given Remark 5. The standardized third and fourth by the order of the corresponding eigenvalues, that moments have been used as building bricks for in- is, by the kurtosis order. As W is also affine equiv- variant multivariate measures of skewness and kur- ariant, it is an independent component functional if tosis. The classical skewness and kurtosis measures Assumption 3 is replaced by the following stronger by Mardia (1970) are assumption. ′ 3 ′ 2 Assumption 5. The fourth moments of z exist E((xstx˜st) ) and tr(B)= E((xstxst) ), and are distinct. whereas M´ori, Rohatgi and Sz´ekely (1993) proposed Remark 6. Notice that Assumption 5 As- 2 ′ ′ ′ ⇒ G = E(xstxstxstx˜stx˜stx˜st) and sumption 4 Assumption 3. If Assumption 5 is k k ⇒ ′ 2 not true and there are only m

> κp. Then Wˆ P Ip and is to try to find an orthogonal matrix U such that · · · → ij ′ 1 the matrices UC U , i, j = 1,...,p, are all “as diag- √n(w ˆkk 1) = √n(ˆskk 1) + oP (1) and onal as possible.” In the engineering literature this − −2 − approach is known as joint approximate diagonal- √nwˆkl ization of eigenmatrices (JADE); see Cardoso and Souloumiac (1993). The functional is then defined = √nrˆkl + √nrˆlk + √n rˆmlk as follows. =  mX6 k,l Definition 5. The JADE functional is W(Fx)= 1 2 (κ + p + 4)√nsˆ /(κ κ )+ o (1), UΣ− / , where Σ = Cov(x) and the orthogonal ma- − k kl k − l P  trix U maximizes k = l. p p 2 6 diag(UCijU′) . For an alternative asymptotic presentation of the k k i=1 j=1 √nwˆkl, see Ilmonen, Nevalainen and Oja (2010). X X The joint limiting multivariate normality of √n First note that ˆ ˆ · p p p p vec(S, off(R)) then implies the following. 2 2 diag(UCijU′) + off(UCijU′) Corollary 3. k k k k Under the assumptions of The- i=1 j=1 i=1 j=1 orem 8, the limiting distribution of √n vec(Wˆ I ) X X X X − p p p is a multivariate normal with zero mean vector and 2 = Cij . componentwise variances k k i=1 j=1 ASV(w ˆ ) = (κ + 2)/4 and X X kk k The solution thus minimizes the sum of squared off- ij ′ ASV(w ˆkl) diagonal elements of UC U , i, j = 1,...,p. Notice that, at z, the only possible nonzero elements of Cij, 2 2 2 = σ + σ κ 6(κ + κ ) i, j = 1,...,p, are (Cii) = κ . For the separation of k l − k − k l ii i  the components, we therefore need Assumption 4 saying that at most one of the kurtosis values κ 22 + 2p + κ /(κ κ )2, i j k l is zero. The JADE functional W(F ) is an IC func- − = − jX6 k,l  tional, as we can prove in the following. k = l. −1/2 6 Theorem 9. (i) Write xst = Σ (x µ) Remark 7. Let x be a p-vector with mean vec- for the standardized random vector from the− IC tor µ and covariance matrix Σ. The FOBI proce- model (2) satisfying the Assumptions 1, 2, and 4. ′ dure may then be seen also as a comparison of two If xst = U z, then scatter functionals, namely, p p 2 Cov(x)= Σ and D(V)= diag(VCijV′) , V k k ∈U ′ −1 ′ i=1 j=1 Cov4(x)= E((x µ)(x µ) Σ (x µ)(x µ) ), X X − − − − is maximized by any PJU where P and J . and the FOBI functional then satisfies W Cov(x)W′ = ∈ P ∈J ′ (ii) For any Fx with finite fourth moments, Ip and W Cov4(x)W . Other independent com- −1 ∈ D W(FAx+b)= PJW(Fx)A for some P and ponent functionals are obtained if Cov and Cov4 are J . ∈ P replaced by any scatter matrices with the indepen- ∈J ′ dence property; see Oja, Sirki¨aand Eriksson (2006) In this case, the matrix U = (u1,..., up) thus op- and Tyler et al. (2009). timizes the Lagrangian function ij p p p p 5.3 Joint Use of Kurtosis Matrices B —JADE 2 L(U, Θ)= (u′ Ciju ) θ (u′ u 1) k k − kk k k − The approach in Section 5.2 was based on the i=1 j=1 k=1 k=1 fact that the kurtosis matrix B is diagonal at z. X X X X p−1 p As shown before, the fourth cumulant matrices ′ θlku ul, ij ij ij ij ′ ij k C = B E (E ) tr(E )Ip, i, j = 1,...,p, − =1 = +1 − − − Xk l Xk FOURTH MOMENTS AND ICA 11 where the symmetric matrix Θ = (θij) contains the in Clarkson (1988). It appeared in our simulations p(p + 1)/2 Lagrangian multipliers of the optimiza- that the Jacobi rotation algorithm is computation- tion problem. Write ally much faster and always provides the same solu- p p tion as our fixed-point algorithm. The limiting dis- T(u)= (u′Ciju)Ciju and tribution with variances and covariances of the ele- =1 =1 ments of the JADE estimate (but without the stan- Xi Xj ′ dardization step) was considered also in Bonhomme T(U) = (T(u1),..., T(u )) . p and Robin (2009). The Lagrangian function then yields the estimating Remark 9. The JADE estimate uses p2 fourth equations moment matrices in order to be affine equivariant. ′ ′ uiT(uj)= ujT(ui) and Therefore, the computational load of JADE grows quickly with the number of components. Miettinen u′ u = δ , i, j = 1,...,p, i j ij et al. (2013) suggested a quite similar, but faster and the equations suggest a fixed-point algorithm method, called k-JADE. The k-JADE estimate at ′ −1/2 with the steps U T(T T) . The estimating F is W = UW0, where W0 is the FOBI estimate ← x equations can also again be used to find the follow- and the orthogonal matrix U maximizes ing asymptotical distribution of the JADE estimate 2 Wˆ = W(Z). diag(UCijU′) , k k |i−j|

√nwˆkl 6. COMPARISON OF THE ASYMPTOTIC 2 VARIANCES OF THE ESTIMATES κk√nrˆkl κl√nrˆlk + (3κl 3κk κk)√nsˆkl = − 2 2 − − κk + κl First notice that, for all estimates, √n(W(X) −1 −1 − Ω )= √n(W(Z) Ip)Ω and the comparisons + oP (1), k = l. − 6 can be made using Wˆ = W(Z) only. Second, for all Corollary 4. Under the assumptions of Theo- estimates, √n(w ˆkk 1) = 1/2√n(ˆskk 1) + oP (1) rem 10, the limiting distribution of √n vec(Wˆ I ) − − − − p k = 1,...,p, and therefore the diagonal elements of is a multivariate normal with zero mean vector and Wˆ should not be used in the comparison. It is then componentwise variances natural to compare the estimates using the sum of

ASV(w ˆkk) = (κk + 2)/4 and asymptotic variances of the off-diagonal elements of Wˆ , that is, ASV(w ˆkl) 1 2 2 2 2 2 p− p κk(σk κk 6κk 9) + κl (σl 6κl 9) (4) (ASV(w ˆ )+ASV(ˆw )). = − − 2− 2 2 − − , kl lk (κ + κ ) =1 = +1 k l Xk l Xk k = l. 6 Next note that, for all estimates, except FOBI, the Remark 8. In the literature, there are several limiting variances of √nwˆkl, k = l, surprisingly de- 6 alternative algorithms available for an approximate pend only on the kth and lth marginal distribution 2 2 diagonalization of several symmetric matrices, but (through κk, κl, σk, and σl ) and do not depend ei- the statistical properties of the corresponding esti- ther on the number or on the distributions of the mates are not known. The most popular algorithm other components. Based on the results in the ear- is perhaps the Jacobi rotation algorithm suggested lier sections, we have the following conclusions: 12 MIETTINEN, TASKINEN, NORDHAUSEN AND OJA

1. √nwˆ of the symmetric FastICA estimate and Table 1 kl w w k that of the JADE estimate are asymptotically equiv- The values of ASV( ˆkl)+ASV(ˆlk) for some selected th and lth component distributions and for deflation-based alent, that is, their difference converges to zero in FastICA (DFICA), symmetric FastICA (SFICA), FOBI, probability if the kth and lth marginal distributions and JADE estimates. For FOBI, the lower bound of are the same. ASV(w ˆkl)+ASV(ˆwlk) is used 2. If the independent components are identically DFICA SFICA FOBI JADE distributed, then the symmetric FastICA and JADE estimates are asymptotically equivalent. In this case, EX–EX 11.00 5.50 ∞ 5.50 their criterium value (4) is one half of that of the EX–L 11.00 8.52 19.18 10.22 deflation-based FastICA estimate. The FOBI esti- EX–U 11.00 7.69 7.69 10.17 mate fails in this case. EX–EP 11.00 8.63 8.63 10.61 EX–G 11.00 11.33 11.33 11.00 3. ASV(w ˆkl) of the FOBI estimate is always larger L–L 31.86 15.93 ∞ 15.93 than or equal to that for symmetric FastICA, k = l. 6 L–U 31.86 8.43 8.43 8.43 This follows as κ 2 for all k. The larger the L–EP 31.86 12.38 12.38 15.63 k ≥− other kurtosis values, the larger is the ASV(w ˆkl) of L–G 31.86 40.19 40.19 31.86 U–U 1.86 0.93 ∞ 0.93 FOBI. The variances are equal when p =2 and κk > U–EP 1.86 1.80 40.63 1.50 0 > κl. U–G 1.86 10.19 10.19 1.86 4. √nwˆkp of the deflation-based FastICA estimate EP–EP 6.39 3.20 ∞ 3.20 and of the JADE estimate are asymptotically equiv- EP–G 6.39 34.61 34.61 6.39 alent if the pth marginal distribution is normal. The criterium value (4) is thus the sum of the zero mean and variance one and with shape param- pairwise terms ASV(w ˆkl)+ASV(ˆwlk), which do not depend on the number or distributions of other com- eter β is ponents except for the FOBI estimate. So in most β exp ( x /α)β f(x)= {− | | }, cases the comparison of the estimates can be made 2αΓ(1/β) only through the values ASV(w ˆkl)+ASV(ˆwlk). To make FOBI (roughly) comparable, we use the lower where β> 0, α = (Γ(1/β)/Γ(3/β))1/2 , and Γ is the bound of the value ASV(w ˆkl)+ASV(ˆwlk) with κj = gamma function. Notice that β = 2 gives the normal 2, j = k, l; the lower bound is in fact the exact (Gaussian) distribution, β = 1 gives the heavy-tailed value− in6 the bivariate case. In Table 1, the values Laplace distribution, and the density converges to ASV(w ˆ ) +ASV(ˆw ) are listed for pairs of inde- an extremely low-tailed uniform density as β . kl lk →∞ pendent components from the following five distri- The family of skew distributions for the variables butions: (EX), logistic dis- is coming from the gamma distribution with shape tribution (L), uniform distribution (U), exponen- parameter α and shifted and rescaled to have mean tial power distribution with value zero and variance one. For α = k/2, the distribu- 4 (EP), and normal or Gaussian (G) distribution. tion is a chi-square distribution with k degrees of The excess kurtosis values are κEX = 6, κL = 1.2, freedom, k = 1, 2,.... For α = 1, an exponential dis- κU = 1.8, κEP 0.81 and κG = 0, respectively. tribution is obtained, and the distribution is con- The results− in Table≈−1 are then nicely in accordance verging to a normal distribution as α . →∞ with our general notions above and show that none For all estimates, Figure 1 shows that ASV(w ˆkl)+ of the estimates outperforms all the other estimates. ASV(w ˆlk) gets high values with β close to 2 (nor- Further, in Figure 1, we plot the values ASV(w ˆkl)+ mal distribution). Also, the variances are growing ASV(w ˆlk) when the independent components come with increasing α. The FOBI estimate is poor if (i) from the standardized (symmetric) exponential the marginal kurtosis values are close to each other. power distribution or (ii) from the standardized The contours for the deflation-based FastICA es- (skew) gamma distribution. The limiting variances timate illustrate the fact that the criterium func- then depend only on the shape parameters of the tion ASV(w ˆ12) + ASV(w ˆ21) is not continuous at the models. In the plot, the darker the point, the higher points for which κk + κl = 0. This is due to the fact the value and the worse the estimate. The density that the order in which the components are found function for the exponential power distribution with changes at that point. The symmetric FastICA and FOURTH MOMENTS AND ICA 13

Fig. 1. Contour maps of ASV(w ˆkl)+ASV(w ˆlk) for different estimates and for different independent component distributions. The distributions are either exponential power distributed (EP) or gamma distributed (Gamma) with varying shape parameter values. The estimates, from up to down, are deflation-based FastICA, symmetric FastICA, FOBI, and JADE. For FOBI, the lower bound of ASV(w ˆkl)+ASV(ˆwlk) is used. The lighter the color is, the lower is the variance. 14 MIETTINEN, TASKINEN, NORDHAUSEN AND OJA

JADE estimates are clearly the best estimates with APPENDIX: PROOFS OF THE THEOREMS minor differences. Proof of Theorem 1. Let Ω = ODV′ be the Ω 7. DISCUSSION singular value decomposition of full-rank . Then Σ = ΩΩ′ = OD2O′, and Σ−1/2 = OJD−1O′ for Many popular methods to solve the indepen- some J . (J is needed to make Σ−1/2 positive dent component analysis problem are based on definite.)∈J Then the use of univariate and multivariate fourth mo- −1/2 −1 ′ ′ ments. Examples include FOBI (Cardoso, 1989), xst = Σ (x µ)= OJD O ODV z − JADE (Cardoso and Souloumiac, 1993), and Fas- = OJV′z = Uz tICA (Hyv¨arinen, 1999). In the engineering litera- ture, these ICA methods have originally been for- with an orthogonal U = OJV′.  mulated and regarded as algorithms only, and there- Proof of Theorem 2. If u′u = 1, then it is fore the rigorous analysis and comparison of their straightforward to see that statistical properties have been missing until very p recently. The statistical properties of the deflation- ′ 4 4 4 based FastICA method were derived in Ollila (2010) E[(u z) 3] = ui [E(zi ) 3]. − =1 − and Nordhausen et al. (2011). The asymptotical be- Xi havior of the FOBI estimate was considered in Il- It then easily follows that monen, Nevalainen and Oja (2010), and the asymp- p totical distribution of the JADE estimate (without ′ 4 4 4 E[(u z) ] 3 ui E(zi ) 3 the standardization step) was considered in Bon- | − |≤ =1 | − | Xi homme and Robin (2009). This paper describes in 4 max E(zi ) 3 detail the independent component functionals based ≤ i=1,...,p| − | on fourth moments through corresponding optimiza- ′ tion problems, estimating equations, fixed-point al- and that, for any orthogonal U = (u1,..., up) , gorithms and the assumptions they need, and pro- p p ′ 4 4 4 vides for the very first time the limiting statisti- E[(ujz) ] 3 uji E(zi ) 3 cal properties of the JADE estimate. Careful com- =1| − |≤ =1 | − | Xj Xi Xj  parisons of the asymptotic variances revealed that, p as was expected, JADE and the symmetric version E(z4) 3 . of FastICA performed best in most cases. It was ≤ | i − | i=1 surprising, however, that the JADE and symmetric X FastICA estimates are asymptotically equivalent if For the first result, see also Lemma 2 in Bugrien and  the components are identically distributed. The only Kent (2005). noteworthy difference between these two estimators Proof of Theorem 6. As the functions appeared when one of the components has a nor- p n mal distribution. Then JADE outperforms symmet- −1 ′ 4 Dn(U)= n (u xst,i) 3 and ric FastICA. Recall that JADE requires the compu- j − j=1 i=1 tation of p2 matrices of size p p and, thus, the use X X p of JADE becomes impractical× with a large number U u′ z 4 of independent components. On the other hand, Fas- D( )= E[( j ) ] 3 =1| − | tICA estimates are sometimes difficult to find due Xj to convergence problems of the algorithms, when the are continuous and Dn(U) P D(U) for all U, then, sample size is small. due to the compactness of→, also In this paper we considered only the most basic U IC model, where the number of independent com- sup Dn(U) D(U) P 0. U∈U| − |→ ponents equals the observed dimension and where no additive noise is present. In further research we D(U) attains its maximum at any JP, where J ∈ will consider also these cases. Note that some prop- and P . This further implies that there is a J ∈ P ˆ erties of JADE for noisy ICA were considered in sequence of maximizers that satisfy U P Ip, and → Bonhomme and Robin (2009). therefore also Wˆ = Uˆ Sˆ−1/2 I . →P p FOURTH MOMENTS AND ICA 15

For the estimate Wˆ , the estimating equations are which gives the desired result.  ′ ˆ ′ ˆ Proof of Theorem 8. wˆ kT(wˆ l)ˆπl = wˆ lT(wˆ k)ˆπk and As mentioned in Re- mark 7, the FOBI functional diagonalizes the scat- ′ ˆ wˆ kSwˆ l = δij, k,l = 1,...,p, ter matrices Cov(x) and Cov4(x)= E[(x µ)(x ′ −1 ′ − − ˆ −1 ′ 3 µ) Σ (x µ)(x µ) ] simultaneously. Then where T(wˆ k)= n i(wˆ k(zi z¯)) (zi ¯z). It is − − Cov(z)= I and Cov4(z)= D with strictly decreas- straightforward to see that the− second set− of esti- p ˆ ˆ mating equations givesP ing diagonal elements. Next write S and S4 for the empirical scatter matrices. Then Sˆ P Ip and √ −1√ → n(w ˆkk 1) = 2 n(ˆskk 1) + oP (1) and Sˆ4 P D, and, as Wˆ is a continuous function of (5) − − − → (Sˆ, Sˆ4) in a neighborhood of (Ip, D), also Wˆ P Ip. √n(w ˆkl +w ˆlk)= √nsˆkl + oP (1). → − Let Z˜ = (z˜1,..., ˜zn) = (z1 ¯z,..., zn z¯) denote Consider then the first set of estimating equations the centered sample, − − for k = l. To shorten the notation, write Tˆ (wˆ k)= n 6 −1/2 ′ −1 ′ Tˆ . Now √n(Sˆ4 D)= n (˜z z˜ Sˆ ˜z z˜ D) k − i i i i − i=1 √nwˆ ′ Tˆ = √n(wˆ e )′Tˆ + √ne′ (Tˆ β e ). X k l k − k l k l − l l n −1 ′ ˆ ′ Using equation (2) in Nordhausen et al. (2011) and = n z˜i˜zi√n(S Ip)z˜i˜zi − =1 − Slutsky’s theorem, the above equation reduces to Xi n √ ′ ˆ 1 2 nwˆ kTlπˆl + n− / (˜z ˜z′ z˜ ˜z′ D), i i i i − = (√n(wˆ e )′β e + e′ (√nTˆ ∗ γ e e′√nx¯ i=1 k − k l l k l − l l l X where the (k, l) element, k = l, of the first matrix is + ∆l√n(wˆ l el)))πl 6 − n 1 + oP (1), n− z˜ z˜′√n(Sˆ I )˜z z˜ − ki i − p i li ˆ ∗ −1 ′ 3 i=1 where Tl = n i((elzi) γl)zi and ∆l = X 3E[(e′z )2z z′ ]. According to our− estimating equa- n l i i i −1 2 2 √ tion, the above expressionP should be equivalent to = 2n z˜kiz˜li nsˆkl + oP (1) − =1 Xi √ w′Tˆ √ w e ′ e n ˆ l kπˆk = ( n( ˆ l l) βk k = 2√nsˆ + o (1), − − kl P + e′(√nTˆ ∗ γ e e′ √n¯z l k − k k k and the (k, l) element of the second matrix is + ∆k√n(wˆ i ei)))πk + oP (1). n n − −1/2 3 −1/2 3 n z˜kiz˜li + n z˜kiz˜li This further implies that =1 =1 Xi Xi n (βlπl 3πk)√nwˆkl (βkπk 3πl)√nwˆlk − − − −1/2 2 + n z˜ z˜kiz˜li. = √n(ˆr π +r ˆ π )+ o (1), mi kl k lk l P i=1 m6=k,l 3 X X wherer ˆkl = i(zik γk)zil. Now using (5), we have Thus, that − P √n(Sˆ4)kl = √nrˆkl + √nrˆlk + rˆmkl + oP (1). (β π 3π )√nwˆ = l l − k kl mX6 k,l + (βkπk 3πl)(√nsˆkl + √nwˆkl) Then Theorem 3.1 of Ilmonen, Nevalainen and − Oja (2010) gives = √n(ˆrklπk +r ˆlkπl)+ oP (1). √nwˆkl Then √n(Sˆ4)kl (κk + p + 2)√nsˆkl = − + oP (1) ( βk 3 + βl 3 )√nwˆkl κ + p + 2 (κ + p + 2) | − | | − | k − l = √n(ˆrklπk +r ˆlkπl) + (3πl βkπk)√nsˆkl − = √nrˆkl + √nrˆlk + √nrˆmkl + oP (1), =  mX6 k,l 16 MIETTINEN, TASKINEN, NORDHAUSEN AND OJA

Now (κk + p + 4)√nsˆkl (κk κl)+ oP (1). − − p p   2 . D(V)= diag(VCij(U′z)V′) =1 =1k k To prove Theorem 9, we need the following lemma. Xi Xj 2 Lemma 1. p p p Denote ′ ′ ′ = VU κkukiukjEkk (VU ) . ′ ′ ′ C(x, A)= E[(x Ax)xx ] A A tr(A)I , i=1 j=1 k=1 ! − − − p X X X ij ′ ij ′ ij ji ij ′ C (x)= E[(x E x)xx ] E E tr(E )I , If we write G = VU = (g1,..., g ), then D(V) sim- − − − p p ij ′ plifies to where E = eiej, i, j = 1,...,p. Then C(x, A) is ad- p p p p 2 ditive in A = (aij), that is, D(V)= g2 κ u u p p kl l li lj ij i=1 j=1 k=1 l=1 ! C(x, A)= aijC (x). X X X X p p p p p =1 =1 Xi Xj 2 2 = gklgkl∗ κlκl∗ uliuljul∗iul∗j Also, for an orthogonal U, it holds that =1 =1 =1 =1 ∗=1 Xi Xj Xk Xl lX C Ux A UC x U′AU U′ p p p ( , )= ( , ) . 2 2 = gklgkl∗ κlκl∗ Proof. For additivity, it is straightforward to =1 =1 ∗=1 Xk Xl lX see that, for all A, A1, A2, and b, p p C(x, bA)= bC(x, A) and (uliul∗i) (uljul∗j) · =1 =1 Xi Xj C(x, A1 + A2)= C(x, A1)+ C(x, A2). p p 4 2 For orthogonal U, we obtain = gklκl , =1 =1 Xk Xl C(Ux, A) which is maximized by V = PJU for any P and ′ ′ ′ ′ ′ ∈ P = E[(x U AUx)(Uxx U )] A A tr(A)Ip J . − − − ∈J(ii) Write y = Ax + b, and let µ and Σ denote = U(E[(x′(U′AU)x)xx′] (U′AU) (U′AU) − − the mean vector and covariance matrix of x, respec- tr((U′AU))I )U′ tively. As − p ′ ′ ′ −1/2 ′ ′ −1/2 = UC(x, U AU)U .  (AΣA ) (AΣA )(AΣA ) = Ip, Proof of Theorem 9. (i) First notice that we have that (AΣA′)−1/2A = QΣ−1/2 for some Q ∈ ij , and therefore yst = Qxst with the same Q . C (z)= 0, for i, j = 1,...,p and i = j U −1/2 ∈U 6 We thus define W(Fx)= UΣ , where U max- ii ii C (z)= κiE , for i = 1,...,p. imizes the function p p It then follows that, for an orthogonal U = (u1,..., ij ′ 2 Dxst (V)= diag(VC (xst)V ) . up), =1 =1k k Xi Xj Cij(U′z)= U′C(z, UEijU′)U The maximizer U is not unique, as the maximum is p p then attained for any PJU where P and J . ′ ∈ P ∈J = U C z, ukiuljEkl U Consider next the criterium function for the stan- =1 =1 ! dardized transformed random variable yst. Then Xk Xl p p p p 2 ′ ij ′ = U u u C(z, E ) U Dyst (V)= diag(VC (Qxst)V ) ki lj kl k k =1 =1 ! i=1 j=1 Xk Xl X X p p p ′ ′ ij ′ ′ 2 = U κ u u E U. = VQC(xst, Q E Q)Q V . k ki kj kk k k =1 ! i=1 j=1 Xk X X FOURTH MOMENTS AND ICA 17

′ If we write G = VQ = (g1,..., gp) , then Proof. The proof is similar to the proof of The- orem 4.1 of Miettinen et al. (2014b).  Dyst (V) p p p Proof of Theorem 10. As the criterium func- ′ ij ′ 2 tions = (gkC(xst, Q E Q)gk) =1 =1 =1 p p i j k 2 X X X D (U)= diag(UCˆ ijU′) and p p p p p p p n k k j=1 j=1 = gksgktqilqjm X X p p i=1 j=1 k=1 l=1 m=1 s=1 t=1 X X X X X X X ij ′ 2 2 D(U)= diag(UC U ) =1 =1k k lm Xj Xj C(xst, E )st · ! are continuous and Dn(U) P D(U) for all U, then, due to the compactness of→, p U ∗ ∗ = gksgktgks gkt qil sup Dn(U) D(U) P 0. ∗ ∗ ∗ ∗=1 U∈U| − |→ i,j,k,l,l ,m,mX,s,s ,t,t lm D(U) attains its maximum at any JP where J qjmqil∗ qjm∗ C(xst, E ) · st and P . This further implies that there is∈ a l∗m∗ J ∈ P C(xst, E )s∗t∗ sequence of maximizers that satisfy Uˆ P Ip, and · −1/2 → p therefore also Wˆ = Uˆ Sˆ P Ip. lm → = gksgktgks∗ gkt∗ C(xst, E )st Let Z˜ = (z˜1,..., ˜zn) = (z1 ¯z,..., zn z¯) denote ∗ ∗ ∗ ∗=1 − − k,l,l ,m,mX,s,s ,t,t the centered sample, and write l∗m∗ n C(xst, E ) ∗ ∗ · s t βˆ = β(Z˜)= n−1 (z˜ ˜z′ ) (z˜ ˜z′ ). i i ⊗ i i (u u ∗ ) (u u ∗ ) i=1 · il il jm jm X Xi Xj As the eighth moments of z exist, √n(βˆ β) is p asymptotically normal with the expected value− zero, lm = gksgktgks∗ gkt∗ C(xst, E )st and β as in Theorem 7. ∗ ∗=1 k,l,m,s,sX,t,t Consider first a general sample whitening matrix lm Vˆ satisfying √n(Vˆ Ip)= OP (1). For the whitened C(xst, E ) ∗ ∗ · s t data we obtain − = Dxst (G). n −1 ′ ′ ′ ′ β˜ = β(Vˆ Z˜)= n (Vˆ z˜i˜z Vˆ ) (Vˆ z˜iz˜ Vˆ ) Hence, Dyst (V)= Dxst (VQ) Dxst (U), with equal- i ⊗ i ′ ≤ =1 ity, if V = PJUQ for any P and J . Thus, Xi ′ −1/2 −1∈ P −∈J1/2 −1 W (Fy)= PJUQ QΣ A = PJUΣ A for = (Vˆ Vˆ )βˆ(Vˆ ′ Vˆ ′), any P and J .  ⊗ ⊗ ∈ P ∈J and, further, For Theorem 10 we need the following lemma. √n(β˜ β) Lemma 2. Assume that Sˆ , k = 1,...,K are p − k ˆ p matrices such that √n(Sˆ Λ ) are asymptotically× = √n(β β) k − k − normal with mean zero and Λk = diag(λk1, . . ., λkp). + [(√n(Vˆ Ip) Ip) + (Ip √n(Vˆ Ip))]β Let Uˆ = (uˆ1,..., uˆp) be the orthogonal matrix that − ⊗ ⊗ − + β[(√n(Vˆ ′ I ) I ) maximizes − p ⊗ p K ˆ ′ 2 + (Ip √n(V Ip))]. diag(Uˆ ′Sˆ Uˆ ) . ⊗ − k k k k=1 Write next X n Then kl kl −1 ′ kl ′ Bˆ = B(E , Z˜)= n (˜zi˜z E z˜iz˜ ) and K ˆ i i k=1(λki λkj)√n(Sk)ij i=1 √nuˆij = − + oP (1). X K (λ λ )2 ˆ kl ˆ kl ˆ kl P k=1 ki − kj T = vec(B )= β vec(E ). P 18 MIETTINEN, TASKINEN, NORDHAUSEN AND OJA

Then √n(Tˆ kl vec(Bkl)) is asymptotically normal wherec ˜kk = (C˜ kk) and ˜bkk = (B˜ kk) . So, asymp- − kl kl kl kl with expected value zero and Bkl as given in Theo- totically, all the information is in the matrices B˜ kk, rem 7. Also, k = 1,...,p. As Wˆ = Uˆ Vˆ , where Uˆ and Vˆ are the kk kk rotation matrix and the whitening matrix, respec- √nˆb = √n(Bˆ ) = √nrˆkl + oP (1). kl kl tively, we have that Next, let √n(Wˆ I )= √n(Uˆ Vˆ I ) n − p − p ˜ kl kl ˆ ˜ −1 ˆ ′ ˆ ′ kl ˆ ′ ˆ ′ B = B(E , VZ)= n (Vz˜i˜ziV E V˜ziz˜iV ) = √n(Uˆ Ip)+ √n(Vˆ Ip)+ oP (1). i=1 − − X The asymptotics of the regular JADE unmixing ma- and trix is then obtained with Vˆ = Sˆ−1/2, where Sˆ is the T˜ kl = vec(B˜ kl)= β˜ vec(Ekl) sample covariance matrix. Notice first that denote the standardized counterparts of Bˆ kl and Tˆ kl, respectively. Then √n(Sˆ−1/2 I )= 1/2√n(Sˆ I )+ o (1). − p − − p P √n(T˜ kl vec(Bkl)) − Then substituting (6) and (7) into (8), we have that, = √n(Tˆ kl vec(Bkl)) for k = l, − 6 + [(√n(Vˆ Ip) Ip) √nwˆkl − ⊗ 2 kl κk√nrˆkl κl√nrˆlk + (3κl 3κk κ )√nsˆkl + (Ip √n(Vˆ Ip))] vec(B ) k ⊗ − = − 2 2 − − κk + κl + β[(√n(Vˆ Ip) Ip) − ⊗ + oP (1). kl + (Ip √n(Vˆ Ip))] vec(E ). ⊗ − For the diagonal elements we have simply It turns out that for the asymptotics of Wˆ , we only √nwˆ = 1/2(√nsˆ 1) + o (1).  need kk − kk − P √n(B˜ kk Bkk) − kl ACKNOWLEDGMENTS (6) = √n(Bˆ kk Bkk) + 3√n(Vˆ I ) − kl − p kl The authors wish to thank the editors and the referees for valuable comments and suggestions. + (κk + 3)√n(Vˆ Ip)lk − Research supported in part by the Academy of and Finland (Grants 251965, 256291 and 268703). √n(B˜ ll Bll) − kl (7) = √n(Bˆ ll Bll) + 3√n(Vˆ I ) REFERENCES − kl − p lk Bonhomme, S. Robin, J.-M. + (κ + 3)√n(Vˆ I ) . and (2009). Consistent noisy l − p kl independent component analysis. J. 149 12– Next, note that in the JADE procedure the matrices 25. MR2515042 to be diagonalized are Brys, G., Hubert, M. and Struyf, A. (2006). Robust mea- sures of tail weight. Comput. Statist. Data Anal. 50 733– kl kl kl lk kl 759. MR2207005 C˜ = B˜ E E tr(E )Ip, k,l = 1,...,p. − − − Bugrien, J. B. and Kent, J. T. (2005). Independent As √n(vec(Cˆ kl) vec(Ckl)) are asymptotically nor- component analysis: An approach to clustering. In Pro- mal with mean− zero and Ckl = 0, for k = l, and ceedings in Quantitative Biology, Shape Analysis and S. Barber P. D. Baxter K. V. Mardia Ckk = κ Ekk, then by Lemma 2, √nu reduces6 to ( , , and k kl R. E. Walls, eds.) 111–114. Leeds Univ. Press, Leeds, UK. kk ll Cardoso, J. F. (1989). Source separation using higher or- κk√nc˜kl κl√nc˜kl √nuˆkl = 2 − 2 + oP (1) der moments. In Proc. IEEE International Conference on κk + κl Accoustics, Speech and Signal Processing 2109–2112, Glas- (8) gow, UK. κ √n˜bkk κ √n˜bll = k kl − l kl + o (1), Cardoso, J. F. and Souloumiac, A. (1993). Blind beam- 2 2 P forming for non Gaussian signals. IEE Proc. F 140 362–370. κk + κl FOURTH MOMENTS AND ICA 19

Caussinus, H. and Ruiz-Gazen, A. (1993). Projection pur- Kollo, T. (2008). Multivariate skewness and kurtosis mea- suit and generalized principal component analyses. In New sures with an application in ICA. J. Multivariate Anal. 99 Directions in Statistical Data Analysis and Robustness 2328–2338. MR2463392 (Ascona, 1992). Monte Verit`a 35–46. Birkh¨auser, Basel. Kollo, T. and Srivastava, M. S. (2004). Estimation and MR1280272 testing of parameters in multivariate Laplace distribution. Chen, A. and Bickel, P. J. (2006). Efficient indepen- Comm. Statist. Theory Methods 33 2363–2387. MR2104118 dent component analysis. Ann. Statist. 34 2825–2855. Mardia, K. V. (1970). Measures of multivariate skewness MR2329469 and kurtosis with applications. Biometrika 57 519–530. Clarkson, D. B. (1988). A version of algorithm MR0397994 AS 211: The F-G diagonalization algorithm. Appl. Stat. 37 Maronna, R. A. (1976). Robust M-estimators of multivari- 317–321. ate location and scatter. Ann. Statist. 4 51–67. MR0388656 Critchley, F., Pires, A. and Amado, C. (2006). Principal Miettinen, J., Nordhausen, K., Oja, H. and Taskinen, S. axis analysis. Technical Report 06/14, The Open Univ., (2013). Fast equivariant JADE. In Proc. 38th IEEE Inter- Milton Keynes, UK. national Conference on Acoustics, Speech, and Signal Pro- Darlington, R. B. (1970). Is kurtosis really “peakedness?” cessing (ICASSP 2013) 6153–6157. Vancouver, BC. Amer. Statist. 24 19–22. Miettinen, J., Nordhausen, K., Oja, H. and Taskinen, S. DeCarlo, L. T. (1997). On the meaning and use of kurtosis. (2014a). Deflation-based FastICA with adaptive choices of Psychol. Methods 2 292–307. nonlinearities. IEEE Trans. Signal Process. 62 5716–5724. Eriksson, J. and Koivunen, V. (2004). Identifiability, sepa- MR3273526 rability and uniqueness of linear ICA models. IEEE Signal Miettinen, J., Illner, K., Nordhausen, K., Oja, H., Process. Lett. 11 601–604. Taskinen, S. and Theis, F. J. (2014b). Separation of un- Friedman, J. H. and Tukey, J. W. (1974). A projec- correlated stationary using autocovariance ma- tion pursuit algorithm for exploratory data analysis. IEEE trices. Available at arXiv:1405.3388. Trans. Comput. C 23 881–890. Mori,´ T. F., Rohatgi, V. K. and Szekely,´ G. J. (1993). On Hallin, M. and Mehta, C. (2015). R-estimation for asym- multivariate skewness and kurtosis. Theory Probab. Appl. metric independent component analysis. J. Amer. Statist. 38 547–551. Assoc. 110 218–232. MR3338498 Nordhausen, K., Oja, H. and Ollila, E. (2011). Multivari- Huber, P. J. (1981). . Wiley, New York. ate models and the first four moments. In Nonparametric MR0606374 Statistics and Mixture Models 267–287. World Scientific, Huber, P. J. (1985). Projection pursuit. Ann. Statist. 13 435– Singapore. MR2838731 525. MR0790553 Nordhausen, K., Ilmonen, P., Mandal, A., Oja, H. and Hyvarinen,¨ A. (1999). Fast and robust fixed-point algo- Ollila, E. (2011). Deflation-based FastICA reloaded. In rithms for independent component analysis. IEEE Trans. Proc. 19th European Signal Processing Conference 2011 Neural Netw. 10 626–634. (EUSIPCO 2011) 1854–1858. World Scientific, Singapore. Hyvarinen,¨ A., Karhunen, J. and Oja, E. (2001). Inde- Oja, H. (1981). On location, scale, skewness and kurto- pendent Component Analysis. Wiley, New York. sis of univariate distributions. Scand. J. Stat. 8 154–168. Hyvarinen,¨ A. and Oja, E. (1997). A fast fixed-point algo- MR0633040 rithm for independent component analysis. Neural Com- Oja, H., Sirkia,¨ S. and Eriksson, J. (2006). Scatter matri- put. 9 1483–1492. ces and independent component analysis. Aust. J. Stat. 35 Ilmonen, P., Nevalainen, J. and Oja, H. (2010). Char- 175–189. acteristics of multivariate distributions and the invariant Ollila, E. (2010). The deflation-based FastICA estimator: coordinate system. Statist. Probab. Lett. 80 1844–1853. Statistical analysis revisited. IEEE Trans. Signal Process. MR2734250 58 1527–1541. MR2758026 Ilmonen, P. and Paindaveine, D. (2011). Semiparametri- Pearson, K. (1895). Contributions to the mathematical the- cally efficient inference based on signed ranks in symmet- ory of evolution, II: Skew variation in homogeneous mate- ric independent component models. Ann. Statist. 39 2448– rial. Philos. Trans. R. Soc. 186 343–414. 2476. MR2906874 Pearson, K. (1905). Das Fehlergesetz und seine Verallge- Jones, M. C. and Sibson, R. (1987). What is projection meinerungen durch Fechner und Pearson. A Rejoinder. pursuit? J. Roy. Statist. Soc. Ser. A 150 1–36. MR0887823 Biometrika 4 169–212. Kankainen, A., Taskinen, S. and Oja, H. (2007). Tests Pena,˜ D. and Prieto, F. J. (2001). Cluster identification of multinormality based on location vectors and scatter using projections. J. Amer. Statist. Assoc. 96 1433–1445. matrices. Stat. Methods Appl. 16 357–379. MR2399867 MR1946588 Karvanen, J. and Koivunen, V. (2002). Blind separation Pena,˜ D., Prieto, F. J. and Viladomat, J. (2010). Eigen- methods based on pearson system and its extensions. Sig- vectors of a kurtosis matrix as interesting directions to re- nal Process. 82 663–673. veal cluster structure. J. Multivariate Anal. 101 1995–2007. Koldovsky,´ Z., Tichavsky,´ P. and Oja, E. (2006). Effi- MR2671197 cient variant of algorithm FastICA for independent com- Samworth, R. J. and Yuan, M. (2012). Independent com- ponent analysis attaining the Cram´er–Rao lower bound. ponent analysis via nonparametric maximum likelihood es- IEEE Trans. Neural Netw. 17 1265–1277. timation. Ann. Statist. 40 2973–3002. MR3097966 20 MIETTINEN, TASKINEN, NORDHAUSEN AND OJA

Serfling, R. J. (1980). Approximation Theorems of Mathe- (2009). Invariant co-ordinate selection. J. R. Stat. Soc. Ser. matical Statistics. Wiley, New York. MR0595165 B Stat. Methodol. 71 549–592. MR2749907 Tichavsky, P., Koldovsky, Z. and Oja, E. (2006). Perfor- Van Zwet, W. R. (1964). Convex Transformations of Ran- mance analysis of the FastICA algorithm and Cramer–Rao dom Variables. Mathematical Centre Tracts 7. Mathemat- bounds for linear independent component analysis. IEEE ical Centre, Amsterdam. Trans. Signal Process. 54 1189–1203. Tyler, D. E., Critchley, F., Dumbgen,¨ L. and Oja, H.