Hierarchical Probabilistic Model for Blind Source Separation via Legendre Transformation

Simon Luo1,2 Lamiae Azizi1,2 Mahito Sugiyama3

1School of Mathematics and Statistics, The University of Sydney, Sydney, Australia 2Data Analytics for Resources and Environments (DARE), Australian Research Council, Sydney, Australia 3National Institute of Informatics, Tokyo, Japan

Abstract 2010] extract a specified number of components with the largest variance under an orthogonal constraint. They are composed of a linear combination of variables, and create a We present a novel blind source separation (BSS) set of uncorrelated orthogonal basis vectors that represent method, called information geometric blind source the source signal. The basis vectors with the N largest separation (IGBSS). Our formulation is based on variance are called the principal components and are the the log-linear model equipped with a hierarchi- output of the model. PCA has shown to be effective for cally structured sample space, which has theoreti- many applications such as dimensionality reduction and cal guarantees to uniquely recover a set of source feature extraction. However, for BSS, PCA makes the signals by minimizing the KL divergence from a assumption that the source signals are orthogonal, which is set of mixed signals. Source signals, received sig- often not the case in most practical applications. nals, and mixing matrices are realized as different layers in our hierarchical sample space. Our em- Similarly, ICA also attempts to find the N components with pirical results have demonstrated on images and the largest variance by relaxing the orthogonality constraint. time series data that our approach is superior to Variations of ICA, such as infomax [Bell and Sejnowski, well established techniques and is able to separate 1995], FastICA [Hyvärinen and Oja, 2000], and JADE [Car- signals with complex interactions. doso, 1999], separate a multivariate signal into additive subcomponents by maximizing the statistical independence of each component. ICA assumes that each component is 1 INTRODUCTION non-gaussian and the relationship between the source signal and the mixed signal is an affine transformation. In addition The objective of blind source separation (BSS) is to identify to these assumptions, ICA is sensitive to the initialization of a set of source signals from a set of multivariate mixed sig- the weights as the optimization is non-convex and is likely nals1. BSS is widely used for applications which are consid- to converge to a local optimum. ered to be the “cocktail party problem”. Examples include Other potential methods which can perform BSS in- image/ [Isomura and Toyoizumi, 2016], clude non-negative matrix factorization (NMF) [Lee and

arXiv:1909.11294v3 [stat.ML] 11 Jun 2021 artifact removal in medical imaging [Vigário et al., 1998], Seung, 2001, Berne et al., 2007], dictionary learning and electroencephalogram (EEG) signal separation [Con- (DL) [Olshausen and Field, 1997], and reconstruction ICA gedo et al., 2008]. Currently, there are a number of solutions (RICA) [Le et al., 2011]. NMF, DL and RICA are degener- for the BSS problem. The most widely used approaches are ate approaches to recover the source signal from the mixed variations of principal component analysis (PCA) [Pearson, signal, which means that they lose information when recov- 1901, Murphy, 2012] and independent component analysis ering the source signal. These approaches are more typically (ICA) [Comon, 1994, Murphy, 2012]. However, they all used for feature extraction. NMF factorizes a matrix into two have limitations with their approaches. matrices with nonnegative elements representing weights PCA and its modern variations such as sparse and features. The features extracted by NMF can be used PCA (SPCA) [Zou et al., 2006], non-linear PCA to recover the source signal. More recently, there are more (NLPCA) [Scholz et al., 2005], and Robust PCA [Xu et al., advanced techniques that uses Short-time Fourier transform (STFT) to transform the signal into the frequency domain 1“Mixed signals” and “received signals” are used exchangeably to construct a spectrogram before applying NMF [Sawada throughout this article. et al., 2019]. However, NMF does not maximize statistical

Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021). independence which is required to completely separate the a a  z z  x x  mixed signal into the source signal, and it is also sensitive 11 12 11 12 = 11 12 to initialization as the optimization is non-convex. Due to a21 a22 z21 z22 x21 x22 the non-convexity, additional constraints or heuristics for weight initialization is often applied to NMF to achieve Mixing Source Received layer layer layer better results [Ding et al., 2008, Boutsidis and Gallopou- los, 2008]. DL can be thought of as a variation of the ICA a11 z11 x11 approaches which requires an over-complete basis vector for the mixing matrix. DL may be advantageous because a12 z12 x12 additional constraints such as a positive code or a dictionary ⊥ can be applied to the model. However, since it requires an a21 z21 x21 over-complete basis vector, information may be lost when reconstructing the source signal. In addition, like all the a22 z22 x22 other approaches, DL is also non-convex and it is sensitive to the initialization of the weights. Figure 1: An example of our sample space. Dashed lines All previous approaches have limitations such as loss of show removed partial orders to allow for learning. The nodes information or non-convex optimization and require con- represent the state of each variable and the arrow shows the straints or assumptions such as orthogonality or an affine direction of the partial ordering. transformation which are not ideal for BSS. In the following, we introduce our approach to BSS, called IGBSS (Informa- recover the source signal, that is Z = A−1X = BX. tion Geometric BSS), using the log-linear model [Agresti, 2012], which can introduce relationships between possible Our strategy is to treat the three components, X, Z, and A, states into its sample space [Sugiyama et al., 2017]. Unlike of BSS as a joint distribution and model it by the log-linear the previous approaches that we mentioned above, our ap- model [Agresti, 2012], which is a well-known energy-based proach does not have the assumptions or limitations that they model. We can take non-affine transformation into account require. We provide a flexible solution by introducing a hi- and formulate BSS as a convex optimization problem. erarchical structure between signals into our model, which allows us to treat interactions between signals that are more complex than an affine transformation. Unlike other existing 2.1 LAYER CONFIGURATION methods, our approach does not require the inversion of the mixing matrix and is able to recover the sign of the signal. Let Ω be a sample space of distributions modeled by the Thanks to the well-developed information geometric analy- log-linear model, which is composed of possible states of a sis of the log-linear model [Amari, 2001], optimization of system of interest. Our key idea is to introduce a hierarchical our method is achieved via convex optimization, hence it layered structure into Ω to achieve BSS. We call this model always arrives at the globally optimal unique solution. We information geometric BSS (IGBSS) as its optimality is theoretically show that it always minimizes the Kullback– supported by the tight connection between the log-linear Leibler (KL) divergence from a set of mixed signals to a model and the information geometric properties of the space set of source signals. We empirically demonstrate that our of distributions (statistical manifold), which we will show hierarchical model leads to better separation of signals in- in the following subsections. We implement three layers of cluding complex interaction such as higher-order feature BSS, the mixing layer, the source layer, and the received interactions than existing methods. layer, into Ω in the form of partial orders and learn the joint representation on it using the log-linear model. The log- linear model on a partially ordered set (poset), a set equipped 2 FORMULATION with a partial order “” [Gierz et al., 2003], is proposed by Sugiyama et al. [2017], which includes a (higher-order) Boltzmann machines as an instance [Luo and Sugiyama, BSS is formulated as a function f that separates a set of 2019]. We use this model to achieve the task of BSS by received signals X into a set of source signals Z, i.e., introducing layered structure as partial orders. The received Z = f(X). For example, if one employs a ICA based layer and the source layer represent the input received signal formulation, the BSS problem reduces to X = AZ, where and the output source signal of BSS, respectively, and the the received signal X ∈ L×M with L signals and the R mixing layer encodes information of how to mix the source sample size M is an affine transformation of the source signal. In the following, we consistently assume that L is signal Z ∈ N×M with N signals and a mixing matrix R the number of received signals, M is the sample size, and A ∈ L×N . The objective is to estimate Z by learning A R N is the number of source signals. given X. Our approach is different from the classical formu- lation, where the inverse of the mixing matrix is learnt to Let us construct three layers in the sample space Ω as Ω = {⊥} ∪ A ∪ Z ∪ X and assume that these sets are model [Coull and Agresti, 2003] and F is called a model given as A = {a11, . . . , aLN }, Z = {z11, . . . , zNM }, and matrix, which represents relationship between states. The X = {x11, . . . , xLM }. The element ⊥ denotes the least assumption of the log-linear model is that F is needs to element, and it acts as a partition function of the log-linear be non-singular, and Sugiyama et al. [2017] showed that model. We use 2D indexing of elements in each layer to Equation (2) with a poset Ω always provides a non-singular make the correspondence between our formulation and ICA model matrix; that is, F is regular as long as each entry is based formulation clear; that is, these three layers A, Z, and given as fij = 1ωj ωi . X are analogue to a mixing matrix A ∈ L×N , a source R By inspecting Equation (2), we can see that the log-linear matrix Z ∈ N×M , and a received matrix X ∈ L×M , R R model belongs to the exponential family. In particular, each respectively2. We will also use symbols ω and s to denote θ corresponds to the natural parameter in the exponential elements of Ω, i.e., they can be ⊥, a , z , and x . Here s ln nm lm family, ψ (θ) represents the normalization constant, and we introduce a partial order  between layers. We intro- ω ∈ Ω represents the outcome of each state. duce the connection between each of the nodes in a similar fashion to the forward ICA model, X = AZ. We define The joint distribution for BSS is described by the log- linear model in Equation (2) over the sample space Ω =  0  0 a  z 0 0 if j = i , z  x 0 0 if j = j , ij i j ij i j (1) {⊥} ∪ A ∪ Z ∪ X equipped with the partial order defined aij 6 zi0j0 otherwise, zij 6 xi0j0 otherwise in Equation (1). In addition, it is always assumed that the parameter space of the log-linear model S = A ∪ Z ⊂ Ω, for each element in three layers A, Z, and X , and we do not meaning that mixing and source layers are used as param- have any ordering among elements in the same layer. Since eters to represent distributions in our model. If we learn it is a partial order, transitivity always holds, e.g., a  x 11 22 the joint distribution from a received signal X, we will ob- as a  z and z  x . The first condition encodes 11 12 12 22 tain probabilities on the source layer p(z ), . . . , p(z ), the structure such that the source layer is higher than the 11 NM which represents normalized source signals. The rational of mixing layer, and the second condition encodes that the our approach is given as follows: The connections between received layer is higher than the source layer. An example each layer is structured so that the log-linear model performs of our sample space with L = M = N = 2 is illustrated in a similar computation to the ICA based approach X = AZ. Figure 1. Our structure ensures that each p(xlm) is determined by

(θaln )n∈[N] and (θzmn )n∈[N] with [N] = {1,...,N}, as 2.2 LOG-LINEAR MODEL ON PARTIALLY we always have aln  xlm and znm  xlm. Moreover, ORDERED SETS more complex interaction than affine transformation, such as higher-order interactions, between signals can be treated We use the log-linear model given in the form of if we additionally include partial order structure into Z and/or A. These cannot be treated by a simple matrix multi- X log p(ω) = 1sωθs − ψ(θ), (2) plication. s∈S Since a poset can be also represented as a directed acyclic where p(ω) ∈ (0, 1) is the probability of each state ω ∈ Ω graph (DAG), the log-linear model on a poset has a close and S ⊆ Ω is a parameter space such that a parameter value relationship to that with a hypergraph [Ay et al., 2017, Sec- θ ∈ is associated with each s ∈ S, and ψ(θ) is the s R tion 2.9]. If we treat a poset as a DAG, each node of a DAG partition function such that P p(ω) = 1, where θ = ω∈Ω ⊥ is a state of sample space and edges represent the hierar- −ψ(θ) always holds. In this formulation, we assume that chical relationship between the states, that is, a path from the set Ω of possible states, equivalent to the sample space a node ω to a node ω0 exists if and only if ω  ω0. Note in the statistical sense, is a poset, and 1 = 1 if s  ω sω that this graph structure should not be confused with the and 0 otherwise. If we index Ω as Ω = {ω , ω , . . . , ω }, 1 2 |Ω| graph structure found in Markov Random Fields (MRF) we obtain the following matrix form: (undirected graph) or Bayesian Networks (directed graph), log p = Fθ − ψ(θ), where each node typically represents a random variable. A poset forms a simplicial complex that uses its combinatorial |Ω| |Ω| properties to represent the higher-order interaction effects where p ∈ (0, 1) with pi = p(ωi), θ ∈ R such in the model [Ay et al., 2017, Definition 2.13]. that θi = θωi if ωi ∈ S and θi = 0 otherwise, F = |Ω|×|Ω| (fij) ∈ {0, 1} with fij = 1ωj ωi , and ψ(θ) = (ψ(θ), . . . , ψ(θ)) ∈ R|Ω|. Each vector is treated as a col- umn vector, and log is an element-wise operation. This 2.3 OPTIMIZATION matrix form is often used as a general form of the log-linear We train the log-linear model by minimizing the KL di- 2 We use the same symbol for an entry xlm of X and its corre- vergence from an empirical distribution pˆ, which is identi- sponding state in X to avoid complicated notations. cal to the normalized received signal X ∈ RL×M , to the model distribution p given by Equation (2) or, equivalently, (ηω)ω∈Ω+ work as coordinate systems and determine a dis- maximizing the likelihood. More precisely, we normalize tribution in P. The Riemannian metric with respect to θ is a given X by dividing each entry by the sum of all entries; given as that is, an empirical distribution pˆ is obtained as pˆ(xlm) =   P ∂ηs ∂ log p(ω) ∂ log p(ω) xlm/ l,m xlm. If X contains negative values, an exponen- gss0 = = E tial kernel exp (x )/ P exp (x ) or min-max normal- ∂θs0 ∂θs ∂θs0 lm l,m lm X (6) ization (xlm +  − min(X))/(max(X) +  − min(X)) can = 1sω1s0ωp(ω) − ηsηs0 , be used, where  is some arbitrary small value to avoid ω∈Ω zero probability. We also assume that pˆ(a ) = 0 and ln which coincides with the Fisher information [Sugiyama pˆ(z ) = 0 for all a ∈ A and z ∈ Z. These transfor- nm ln nm et al., 2017, Theorem 3] and we use it for natural gradient. mations do not have a negative effect on the result of the model, because we then apply the reverse transformation on Now we consider two submanifolds Pθ, Pη ⊆ P, which the reconstructed signal or the source signal. we define as + The objective function is given as Pθ = { p ∈ P | θω = 0, ∀ω ∈ E } , E = Ω \S,

Pη = { p ∈ P | ηω =η ˆω, ∀ω ∈ M } , M = S. X pˆ(ω) argmin DKL (ˆpkp) = argmin pˆ(ω) log , (3) P P p(ω) Note that this Pθ coincides with that in Equation (3). The p∈ θ p∈ θ ω∈Ω submanifold Pθ is called an e-flat submanifold and Pη an m-flat submanifold in information geometry. The highlight where Pθ is the set of distributions that can be represented by Equation (2) with our structured sample space Ω = of considering these two types of submanifolds is that, if + {⊥} ∪ A ∪ Z ∪ X and S = A ∪ Z. E ∩ M = ∅ and E ∪ M = Ω , it is theoretically guaranteed that the intersection Pθ ∩ Pη is always a singleton and it The remarkable property of our model is that this optimiza- is the optimizer of Equation (3) [Amari, 2009, Theorem 3], tion problem is convex and it is guaranteed that gradient- that is, it is the globally optimal solution of our model. based methods can always arrive at the globally optimal unique solution. To show this, we analyze the geometric Optimization is achieved by e-projection, which seeks structure of the statistical manifold, the set of probabil- Pθ ∩ Pη in the e-flat submanifold Pθ. The e-projection ity distributions, generated by the log-linear model. Let is always convex optimization as Pθ is convex with respect Ω+ = Ω\{⊥}. First we introduce another parameterization to θ; this is because θ is a coordinate system of Pθ that is linearly constrained on θ. We can therefore use the standard (ηω)ω∈Ω+ of the log-linear model, which is defined as gradient descent strategy to optimize the log-linear model. X The derivative of the KL divergence with respect to θs is ηω = 1ωsp(s). (4) known to be the difference between expectation parameters s∈Ω η [Sugiyama et al., 2017, Theorem 2]: Note that η⊥ = 1 always holds and we do not include it ∂ as a parameter. In addition, for theoretical consistency we DKL(ˆp k p) = ηs − ηˆs = ∆ηs, (7) ∂θ change the parameter space used in Equation (2) from S s + to Ω and assume that θω = 0 if ω 6∈ S. Again we do not and the KL divergence DKL(ˆpkp) is minimized if and only include θ⊥ as a parameter as it is the partition function. Two if ηs =η ˆs for all s ∈ S. (θ ) + (η ) + parameters ω ω∈Ω and ω ω∈Ω have clear statistical From our definition of Ω in Equation (1), we have η = interpretation as it is widely known that any log-linear model zkl η for all z , z 0 ∈ Z. Therefore all elements in the θ η zk0l kl k l belongs to the exponential family, where and correspond source layer will learn the same value. This problem can be θ η to natural and expectation parameters, respectively. and avoided by removing some of partial orders between source are connected via a Legendre transformation which means and received layers. We propose to systematically remove that they are both differentiable and have a one-to-one cor- 0 the partial order z  x 0 0 if i = i to ensure η 6= η θˆ ij i j zkl zk0l respondence. To simplify the notation, we denote by and (see Figure 1), while other strategies are possible as long as ηˆ θ η pˆ the corresponding and of the empirical distribution . η 6= η is satisfied, for example, random deletion. Let zkl zk0l Using the above results, gradient descent can be directly P = {p | 0 < p(ω) < 1 for all ω ∈ Ω} (5) applied to achieve Equation (3). However, this may need a large number of iterations to reach convergence. To reduce be the set of all probability distributions. This set forms the number of iterations, we propose to use natural gradi- a statistical manifold with a dually flat structure, which ent [Amari, 1998], which is a second-order optimization is the canonical geometric structure in information ge- approach and will also always find the global optimum. Let ometry [Amari, 2016], with its dual coordinate system us re-index S = A ∪ Z as S = {s1, s2, . . . , s|S|} and as- T T + + + ((θω)ω∈Ω , (ηω)ω∈Ω ); that is, both of (θω)ω∈Ω and sume that θ = [θs1 , . . . , θs|S| ] and η = [ηs1 , . . . , ηs|S| ] . Algorithm 1 Information Geometric BSS 2.4 PARAMETER COMPUTATION FOR EACH 1: Function IGBSS(X, S): LAYER 2: Compute pˆ from X In the following, we give p, η, and the gradient for each 3: Compute ηˆ = (ˆηs)s∈S from pˆ layer, which are used in gradient descent. 4: Initialize (θs)s∈S (randomly or θs = 0) 5: repeat Received Layer (Input Layer): Probability p(x) on the 6: Compute p using the current parameter (θs)s∈S received layer x ∈ X is obtained as 7: Compute (ηs)s∈S from p X X 8: (∆ηω)ω∈Z ← (ηω)ω∈Z − (ˆηω)ω∈Z log p(x) = 1zxθz + 1axθa + θ⊥, (10) z∈Z a∈A 9: (∆ηω)ω∈A ← (ηω)ω∈A − (ˆηω)ω∈A X 0 10: Compute the Fisher information matrix for source ηx = 1xx0 p(x ) = p(x). (11) layer GZ and the mixing layer GA x0∈X −1 11: (θω)ω∈Z ← (θω)ω∈Z − GZ (∆ηω)ω∈Z −1 We do not need to compute gradient for this layer as there 12: (θω)ω∈A ← (θω)ω∈A − GA (∆ηω)ω∈A is no parameter on this layer and θx = 0 for all x ∈ X . 13: until convergence of (θs)s∈S 14: End Function Source Layer (Output Layer): Probability p(z) on the source layer for each z ∈ Z is given as X X In each step of natural gradient, the current θ is updated to log p(z) = 1z0zθz0 + 1azθa + θ⊥ z0∈Z, a∈A θnext by the following formula: X = θz + 1azθa + θ⊥, (12) θ = θ − G−1(η − ηˆ) next a∈A X X 0 |S|×|S| η = 1 p(x) + 1 0 p(z ) where G = (gij) ∈ R is the Fisher information z zx zz x∈X 0 matrix such that each gij is given as gs s in Equation (6). z ∈Z i j X Although the natural gradient requires less iterations than = 1zxp(x) + p(z). the gradient descent, matrix inversion G−1 is computation- x∈X ally expensive as it has O(|S|3) complexity. In addition, Thus the gradient for the source layer is given as FIM values are often too small and optimization becomes ∂ numerically unstable. To solve these problems, we separate DKL(ˆpkp) = ηz − ηˆz ∂θz the update steps in the source and the mixing layers: X = 1zx (p(x) − pˆ(x)) + p(z). −1 x∈X (θω,next)ω∈Z = (θω)ω∈Z − GZ (∆ηω)ω∈Z , (8) −1 Mixing Layer: Probability p(a) on this layer is given as (θω,next)ω∈A = (θω)ω∈A − GA (∆ηω)ω∈A, (9) X log p(a) = 1a0aθa0 + θ⊥ = θa + θ⊥, (13) where GZ and GA are the Fisher information matrices for 0 source and mixing layers, respectively. Note that this also a ∈A X X leads to the same global optimum. They are constructed by ηa = 1axp(x) + 1azp(z) assuming all the other parameters are fixed. This approach x∈X z∈Z reduces the time complexity to O(|Z|3 + |A|3). The full X 0 + 1aa0 p(a ) algorithm using natural gradient is given in Algorithm 1. a0∈A Computation of p from θ and η from p can be achieved using X X Equations (2) and (4). The time complexity to compute p = 1axp(x) + 1azp(z) + p(a). (14) in Algorithm 1 Line 6 is O(|Ω||S|). The complexity to x∈X z∈Z compute ∆η in Algorithm 1 Line 8 and Line 9 is O(|Z|) + The gradient of the mixing layer is given as O(|A|) = O(|S|). Therefore the total complexity of each ∂ iteration is O(|Z|3 + |A|3 + |Ω||S|). DKL(ˆpkp) = ηa − ηˆa ∂θa X X Note that, although our formulation always give globally = 1ax (p(x) − pˆ(x)) + 1azp(z) + p(a). optimal solution with respect to the optimization problem x∈X z∈Z given in Equation (3), the objective function is not the same (15) as other BSS formulations such as FastICA. Therefore it is Parameter values θa in the mixing layer represent the degree not theoretically guaranteed that our method always shows of mixing between source signals. Hence they can be used superior performance to other approaches. We therefore to perform feature selection and extraction. For example, if empirically evaluate our method in Section 3 and discuss its θa = 0 in the extreme case, the corresponding node a does performance. not have any contribution to the source mixing. 3 EXPERIMENTS larger than the ground truth for all images. Small residuals of each image can be seen on the other images. For instance, We empirically examine the effectiveness of IGBSS to per- in the airplane (F-16) image, residuals from the lake im- form BSS using real-world image and synthetic time-series age can be clearly seen. Compared to the reconstruction datasets for an affine transformation and higher-order in- of IGBSS with FastICA, DL and NMF, IGBSS performs teractions between signals. All experiments were run on significantly better as all the other approaches are unable to CentOS Linux 7 with Intel Xeon CPU E5-2623 v4 and clearly separate the mixed signal. FastICA was unable to Nvidia QuadroGP100 3. provide a reasonable reconstruction with 3 mixed signal. To overcome this limitation of FastICA, we randomly gener- ated another column of the mixing matrix and append it to 3.1 BLIND SOURCE SEPARATION FOR AFFINE the current mixing matrix to create 4 mixed signals as an TRANSFORMATIONS ON IMAGES input to FastICA to recover a more reasonable signal. In our experiments, we use three benchmark images widely The root mean square error (RMSE) of the Euclidean dis- used in computer vision from the University of Southern Cal- tance and the signal-to-noise ratio (SNR) between the re- ifornia’s Signal and Image Processing Institute (USC-SIPI)4, construction and the ground truth is calculated to quan- which include “airplane (F-16)”, “lake” and “peppers”. Each tify results of each method. The SNR is computed by image is standardized to have 32x32 pixels with red, green SNRdB = 20 log10(znorm/|(z − znorm)|). The full results and blue color channels with integer values between 0 and are shown in Table 1 (top row for each experiment). In 255 to represent the intensity of each pixel. These images the table, we present three experiments with different RGB shown in Figure 2a are the source signal Z which are un- images from USC-SIPI dataset, for each experiment we known to the model. They are only used as ground truth to generate a new mixing matrix, where the second and the evaluate the model’s output. The equation X = AZ is used third experiments uses images of “mandrill”, “splash”, “jelly to generate the received signal X by randomly generating beans” and “mandrill”, “lake”, “peppers”, respectively. Our values for a mixing matrix A using the uniform distribution results clearly show that IGBSS is superior to other methods, which generates real numbers between 1 and 6. The images that is, IGBSS has consistently produced the lowest RMSE are then rescaled to integer values within the range between error for every experiment. When looking at the SNR ratio, 0 and 255. The received signal X, which is the input to the our model has produced the highest SNR for the majority of model, is the three images shown in Figure 2b. The three the cases and is always able to recover the same result after images for the mixed signal may look visually similar, how- each run as it is formulated as a convex optimization. ever, they are actually superposition of the source signal with different intensity. The objective of our model is to reconstruct the source signal Z without knowing A. 3.2 BLIND SOURCE SEPARATION WITH HIGHER-ORDER FEATURE INTERACTIONS We compare our approach to FastICA [Hyvärinen and Oja, 2000] with the log cosh function as the signal prior, dic- In any real-world application, the interaction between sig- tionary learning (DL) [Olshausen and Field, 1997] with nals are usually more complex than an affine transformation. constraint for positive dictionary and positive code, and We demonstrate the ability of BSS for our model to include NMF with the coordinate descent solver and non-negative higher-order feature interactions in BSS. We use the same double singular value decomposition (NNDSVD) initial- benchmark images in the standard BSS as the source signal ization [Boutsidis and Gallopoulos, 2008] with zero values Z for our experiment. We generate the higher-order feature replaced with the mean of the input. interactions of the received signal by using the multiplica- Since BSS is an problem, the order of tive product of the source signal. If we take into account up the signal is not recovered. We identify the corresponding to kth order interaction (k ≤ N), signal by taking all permutations of the output and calcu- X late the minimum euclidean distance with the ground truth. xlm = alnznm The permutation which returns the minimum error is con- n X X sidered as the correct order of the image. The scale of the + aln1n2 zn1mzn2m output is also not recovered, thereby we have used min-max n1 n2>n1 normalization to the output of each model. X X X + aln1n2n3 zn1mzn2mzn3m Separation results for images are shown in Figure 2. Our n1 n2>n1 n3>n2 method IGBSS can recover majority of the “shape” of the X X + ··· + ··· aln1...nk zn1m . . . znkm. source signal, while the intensity of each image appears to n1 nk>nk−1

3https://github.com/sjmluo/IGLLM All the other known approaches take into account only first 4http://sipi.usc.edu/database/ order interactions (that is, affine transformation) between (a) (b) (c) (d) (e) (f) (a) (b) (c) (d) (e) (f) GT Mixed IGBSS ICA DL NMF GT Mixed IGBSS ICA DL NMF Figure 2: First-order interaction experiment. Figure 3: Third-order interaction experiment.

Table 1: Signal-to-Noise Ratio of reconstructed signal. (∗) Results for Figure 2. (†) Results for Figure 3. Scores are means ± standard deviation after 40 runs. We have applied different weight initialization after each run.

Root Mean Squared Error (RMSE) Signal-to-noise ratio (SNR) (units in dB) Exp. Order IGBSS FastICA DL NMF IGBSS FastICA DL NMF First∗ 0.252 ± 0.000 0.300 ± 0.089 0.394 ± 0.041 0.622 ± 0.000 12.588 ± 0.000 11.688 ± 4.829 6.810 ± 0.008 1.704 ± 0.000 1 Second 0.260 ± 0.000 0.285 ± 0.096 0.441 ± 0.080 0.662 ± 0.000 10.729 ± 0.000 12.353 ± 4.255 0.526 ± 0.448 -3.426 ± 0.000 Third† 0.252 ± 0.000 0.260 ± 0.111 0.362 ± 0.030 0.612 ± 0.000 12.588 ± 0.000 12.922 ± 5.590 1.471 ± 0.358 0.039 ± 0.000 First 0.133 ± 0.000 0.284 ± 0.064 0.474 ± 0.067 0.591 ± 0.000 14.215 ± 0.000 11.218 ± 1.964 2.098 ± 2.140 -0.940 ± 0.000 2 Second 0.256 ± 0.000 0.263 ± 0.066 0.576 ± 0.008 0.684 ± 0.000 10.612 ± 0.000 11.986 ± 2.157 -1.589 ± 0.269 -3.675 ± 0.000 Third 0.282 ± 0.000 0.239 ± 0.056 0.593 ± 0.007 0.665 ± 0.000 9.346 ± 0.000 11.475 ± 2.145 -2.274 ± 0.227 -4.073 ± 0.000 First 0.155 ± 0.000 0.699 ± 0.047 0.478 ± 0.121 0.628 ± 0.000 11.285 ± 0.000 10.785 ± 2.176 1.448 ± 4.249 0.628 ± 0.000 3 Second 0.200 ± 0.000 0.280 ± 0.049 0.515 ± 0.007 0.709 ± 0.000 10.862 ± 0.000 10.171 ± 2.353 0.529 ± 0.228 -5.579 ± 0.000 Third 0.203 ± 0.000 0.239 ± 0.056 0.536 ± 0.006 0.682 ± 0.000 11.075 ± 0.000 11.041 ± 2.708 -0.244 ± 0.185 -4.961 ± 0.000

Table 2: Quantitative results for time-series separation experiment (mean ± standard deviation with 40 runs). (a) Root Mean Squared Error (RMSE) (b) Signal-to-noise (SNR) (units in dB)

Order IGBSS (min-max) IGBSS (exp) FastICA Order IGBSS (min-max) IGBSS (exp) FastICA First 0.702 ± 0.000 0.703 ± 0.000 0.414 ± 0.286 First 3.596 ± 0.000 3.600 ± 0.000 15.391 ± 3.813 Second 0.921 ± 0.000 0.921 ± 0.000 1.700 ± 0.167 Second 0.291 ± 0.000 0.042 ± 0.000 -5.803 ± 1.124 Third 0.967 ± 0.000 0.961 ± 0.000 1.388 ± 0.178 Third 0.340 ± 0.000 0.128 ± 0.000 -3.427 ± 1.249 features. Differently, our model can directly incorporate the 3.3 TIME SERIES DATA ANALYSIS higher-order features as we do not assume that they are an affine transformation. When we consider up to kth order We demonstrate the effectiveness of our model on time se- interactions, we additionally include the elements corre- ries data. In our experiments, we create three signals with sponding to new mixing parameters into the mixing layer. 500 observations each using the sinusoidal function, sign For example, if k = 2, nodes for aln1n2 are added and function, and the sawtooth function. The synthetic data sim- aln1n2  znm if n1 = n or n2 = n. Figure 3 shows ex- ulates typical signals from a wide range of applications perimental results for the third-order feature experiment. including audio, medical and sensors. We randomly gener- Our approach IGBSS shows superior reconstruction of the ate a mixing matrix by drawing from a uniform distribution source signal to other approaches. All the other approaches with values between 0.5 and 2. In our experiment, we pro- except for NMF is able to achieve reasonable reconstruction. vide comparison of using both min-max normalization and NMF is able to recover the “shape” of the image, however, exponential kernel as a pre-processing step and compare our unlike IBSS, NMF is a degenerate approach, so it is unable approach with FastICA. to recover all color channels in the correct proportion, cre- ating discoloring for the image which is clearly shown in Experimental results are illustrated in Figure 4. These re- the SNR values. Since the proportion of the intensity of the sults show that IGBSS is superior to all the ICA approaches pixel is not recovered. In terms of both of the RMSE and because it is able to recover both the shape of the signal and the SNR shown in Table 1, IGBSS again shows the best re- the sign of the signal, while all the other ICA approaches sults for both second- and third-order interactions of signals are only able to recover the shape of the signal and are un- across the three experiments. able to recover the sign of the signal. This means that ICA could recover a flipped signal. We have paired the recovered Figure 4: Time series signal experiment. signal of ICA with the ground truth by finding the signal 104 103 and sign with the lowest RMSE error. In any practical ap- 103 Natural Gradient plication, this is not possible for ICA because the latent Gradient Descent 102 102 signal is unknown. Through visual inspection, IGBSS is

Rutime Rutime [sec.] Natural Gradient 101 able to recover all visual signals with high accuracy, while 101 Gradient Descent Number of iterations 2000 4000 6000 8000 2000 4000 6000 8000 FastICA is only able to recover the first-order interaction Number of parameters Number of paramters and it is unable to produce a reasonable recovery for second- 2000 100000 Natural Gradient and third-order interactions. In addition to our visual com- 80000 Gradient Descent 1500 parison, we have also performed a quantitative analysis on Natural Gradient 60000 1000 Gradient Descent the experimental results using RMSE error with the ground 40000 500 truth. Results are shown in Table 2. FastICA has shown to Runtime [sec.] 20000

Number of iterations 0 0 have better performance for First-Order interactions. How- 1 2 3 4 5 6 1 2 3 4 5 6 ever, for second- and third-order SNR results for FastICA Order of feature interactions Order of feature interactions is unable to recover a reasonable signal because the noise Figure 5: Experimental analysis of the scalability of number is more dominant. IGBSS has shown superior performance of parameters and higher-order features in the model for and is able to recover the signal for second- and third-order both natural gradient approach and gradient descent interactions with better scores for both RMSE and SNR. the baseline approaches. NMF is typically NP-hard with an exponential runtime complexity of O(2N LM) with respect 3.4 RUNTIME ANALYSIS to N per iteration. Note that N is usually small in BSS. Sim- ilarly, the time complexity of DL using K-SVD is O(L2M) In our experiment, we used a learning rate of 1.0 for gradient per iteration. Both the complexity of NMF and DL depends descent. Although the time complexity for each iteration of on M, which may be large for some applications of BSS. natural gradient is O(|Z|3 + |A|3 + |Ω||S|), which is larger FastICA is usually considered to be the fastest algorithm than O(|Ω||S|2) for gradient descent, natural gradient is for BSS as its complexity does not depend on M. FastICA able to reach convergence faster because it has quadratic con- overcomes the issue by taking the expectation with respect vergence and requires significantly less iterations compared to the samples before learning the mixing matrix to reduce to gradient descent, which linearly converges. Increasing its complexity to O(NL) per iteration. In contrast, the time the size of the input will increase the size of |Ω| only, while complexity of our approach is cubic with respect to M. the number of parameters |Z|, |A| remain this same. Since the complexity of natural gradient is linear with respect to the size |Ω| of the input, increasing |Ω| does not increase the 4 CONCLUSION runtime significantly. Our experimental analysis in Figure 5 supports this analysis: our model scales linearly for both We have proposed a blind source separation (BSS) method, natural gradient and gradient descent when increasing the called Information Geometric Blind Source Separation (IG- order of interactions in our model. This is because for prac- BSS). We have formulated our approach using the log-linear tical application it is unlikely that |A| > |Z|. The runtime model, which enables us to introduce a hierarchical structure difference between natural gradient and gradient descent into its sample space to achieve BSS. We have theoretically becomes larger as the order of interactions increases. shown that IGBSS has desirable properties for BSS such as Finally, we compare the time complexity of our approach to unique recover of source signals as it solves the convex op- timization problem by minimizing the KL divergence from Jean-François Cardoso. High-order contrasts for indepen- mixed signals to source signals. We have experimentally dent component analysis. Neural Computation, 11(1): shown that IGBSS recovers images and signals closer to 157–192, 1999. the ground truth than ICA, dictionary learning, and NMF. Thanks to the flexibility of the hierarchical structure, IGBSS Pierre Comon. Independent component analysis, a new is able to separate signals with complex interactions such concept? Signal Processing, 36(3):287–314, 1994. as higher-order interactions. Our model is superior to the other approaches because it is non-degenerate and is able to Marco Congedo, Cédric Gouy-Pailler, and Christian Jut- recover the sign of the signal. Since our approach is flexible ten. On the blind source separation of human electroen- and requires less assumptions than alternative approaches, cephalogram by approximate joint diagonalization of sec- it can be applied to various real world applications such as ond order statistics. Clinical Neurophysiology, 119(12): medical imaging, signal processing, and image processing. 2677–2686, 2008. B. A. Coull and A. Agresti. Generalized log-linear mod- Acknowledgements els with random effects, with application to smoothing contingency tables. Statistical Modelling, 3(4):251–271, This work was supported by JST, PRESTO Grant Number 2003. JPMJPR1855, Japan and JSPS KAKENHI Grant Number JP21H03503 (MS). Chris HQ Ding, Tao Li, and Michael I Jordan. Convex and semi-nonnegative matrix factorizations. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 32 References (1):45–55, 2008.

Alan Agresti. Categorical Data Analysis. Wiley, 3 edition, Gerhard Gierz, Karl Heinrich Hofmann, Klaus Keimel, Jim- 2012. mie D Lawson, Michael Mislove, and Dana S Scott. Con- tinuous Lattices and Domains, volume 93. Cambridge Shun-chi Amari. Information geometry on hierarchy of university press, 2003. probability distributions. IEEE Transactions on Informa- tion Theory, 47(5):1701–1711, 2001. Aapo Hyvärinen and Erkki Oja. Independent component Shun-Ichi Amari. Natural gradient works efficiently in analysis: algorithms and applications. Neural Networks, learning. Neural Computation, 10(2):251–276, 1998. 13(4-5):411–430, 2000. Shun-Ichi. Amari. Information geometry and its appli- Takuya Isomura and Taro Toyoizumi. A local learning rule cations: Convex function and dually flat manifold. In for independent component analysis. Scientific Reports, F. Nielsen, editor, Emerging Trends in Visual Computing: 6:28073, 2016. LIX Fall Colloquium, ETVC 2008, Revised Invited Papers, pages 75–102. Springer, 2009. Quoc V Le, Alexandre Karpenko, Jiquan Ngiam, and An- drew Y Ng. ICA with reconstruction cost for efficient Shun-Ichi. Amari. Information Geometry and Its Applica- overcomplete feature learning. In Advances in Neural tions. Springer, 2016. Information Processing Systems 24, pages 1017–1025, 2011. Nihat Ay, Jürgen Jost, Hông Vân Lê, and Lorenz Schwach- höfer. Information Geometry, volume 64. Springer, 2017. Daniel D Lee and H Sebastian Seung. Algorithms for non- Anthony J Bell and Terrence J Sejnowski. An information- negative matrix factorization. In Advances in Neural In- maximization approach to blind separation and blind formation Processing Systems 13, pages 556–562, 2001. deconvolution. Neural Computation, 7(6):1129–1159, 1995. Simon Luo and Mahito Sugiyama. Bias-variance trade-off in hierarchical probabilistic models using higher-order Olivier Berne, C Joblin, Y Deville, JD Smith, M Rapacioli, feature interactions. In Proceedings of the 33rd AAAI JP Bernard, J Thomas, W Reach, and A Abergel. Anal- Conference on Artificial Intelligence, pages 4488–4495, ysis of the emission of very small dust particles from 2019. spitzer spectro-imagery data using blind signal separation methods. Astronomy & Astrophysics, 469(2):575–586, Kevin P Murphy. Machine Learning: A Probabilistic Per- 2007. spective. MIT press, 2012. Christos Boutsidis and Efstratios Gallopoulos. SVD based Bruno A Olshausen and David J Field. Sparse coding with initialization: A head start for nonnegative matrix factor- an overcomplete basis set: A strategy employed by V1? ization. Pattern Recognition, 41(4):1350–1362, 2008. Vision Research, 37(23):3311–3325, 1997. Karl Pearson. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901. Hiroshi Sawada, Nobutaka Ono, Hirokazu Kameoka, Daichi Kitamura, and Hiroshi Saruwatari. A review of blind source separation methods: two converging routes to IL- RMA originating from ICA and NMF. APSIPA Transac- tions on Signal and Information Processing, 8, 2019. Matthias Scholz, Fatma Kaplan, Charles L Guy, Joachim Kopka, and Joachim Selbig. Non-linear PCA: a missing data approach. Bioinformatics, 21(20):3887–3895, 2005. Mahito Sugiyama, Hiroyuki Nakahara, and Koji Tsuda. Ten- sor balancing on statistical manifold. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3270–3279, 2017. Ricardo Vigário, Veikko Jousmäki, Matti Hämäläinen, Ri- itta Hari, and Erkki Oja. Independent component analysis for identification of artifacts in magnetoencephalographic recordings. In Advances in Neural Information Process- ing Systems 10, pages 229–235, 1998. Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Ro- bust PCA via outlier pursuit. In Advances in Neural Information Processing Systems, pages 2496–2504, 2010.

Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265–286, 2006. A APPENDIX Table 3: Signal-to-Noise Ratio (SNR) and Root Mean Square Error (RMSE) between the recovered signal and A.1 FEATURE EXTRACTION FOR A 2D POINT the latent source signal for the 2-dimensional point cloud CLOUD EXPERIMENT experiment.

We demonstrate the effectiveness of IGBSS in identification Model PCA ICA IGBSS of independent components on a 2-dimensional point cloud RMSE 2.011 1.445 1.421 to be used for feature extraction or dimensionality reduction. SNR 25.997 27.431 27.503 In our experiment, we generate a 2-dimensional point cloud using two standard Student’s t-distribution with 1.3 degree of freedom and have scaled the first dimension by 1/5 and We run the experiment on the dataset used for the experi- the second dimension by 1/10 to the point cloud, illustrated ment 1 for the first order experiment and have shown the in Figure 7a. Then we have randomly generated a mixing output of several runs in FastICA to show the problem of matrix for our experiment to generate a mixed signal shown the sign inversion in Figure 6. For the 6 runs, we can see in Figure 7b. We run the experiment on our model IGBSS that none of the experiments were able to obtain the cor- using min-max normalization as a pre-processing step and rect sign of the signal. This means that applying FastICA compare it to PCA and ICA. We apply the reverse trans- to applications where the sign of the signal is important is formation of the min-max normalization on the recovered problematic. signal. We have plotted experimental results in Figure 7. From the results, we can see that PCA is able to recover the same scale of the point cloud. However, the sign of the signal is not recovered as we have recovered reversed sign of the signal. PCA also recovers signals which are orthogonal to the largest variance. Therefore the axes of the point cloud recovered by PCA does not align with the source signal in Figure 7a, that is, the axes do not run parallel to the x- and y-axes but instead is still in the same orientation as the mixed signal. This is not what we want as the signal is still mixed, and we would like to recover the signal in the same orientation as the source signal in blind source separation. ICA aims to recover statistically independent signals that are generally considered as the axes with the largest variances and not necessarily orthogonal to each other. However, the limitations of ICA is that it is unable to recover the sign and the scale of the signal. Therefore the scale of the recovered signal does not match with the source signal. In our experi- ment, we have plotted the results with unit variance as the recovered signal is generally unnormalized in ICA. Since our experiment is synthetically generated, we are able to quantitatively measure the the error in each approach by normalizing both the recovered signal and the source signal by its standard deviation then computing the root mean squared error (RMSE) and the signal-to-noise ratio (SNR). The results of this is shown in Table 3. Our proposed approach IGBSS has clear advantages, where it is able to recover the same orientation as the source signal as well as preserve the signal.

A.2 SIGN INVERSION IN ICA

We demonstrate the problem of the sign inversion in ICA. We use the same experimental set-up explained in Sec- tion 3.1 on blind source separation for affine transformation. (a) GT (b) Mixed (c) Run 1 (d) Run 2 (e) Run 3 (f) Run 4 (g) Run 5 (h) Run 6

Figure 6: Six different runs of FastICA with the same experimental input experimental dataset as exp1 with first order interactions. The different results can demonstrate that the FastICA model is non-convex leading to potential problemic results such as the sign inversion. (a) Source Signal (b) Mixed Signal

(c) PCA Recovery (d) ICA Recovery

(e) IGBSS Recovery

Figure 7: 2-dimensional point cloud experiment.