Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2008, Article ID 765615, 9 pages doi:10.1155/2008/765615

Research Article Complex-Valued Adaptive Signal Processing Using Nonlinear Functions

Hualiang Li and Tulay¨ Adalı

Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Baltimore, MD 21250, USA

Correspondence should be addressed to Tulay¨ Adalı, [email protected]

Received 16 October 2007; Accepted 14 February 2008

Recommended by An´ıbal Figueiras-Vidal

We describe a framework based on Wirtinger calculus for adaptive signal processing that enables efficient derivation of algorithms by directly working in the complex domain and taking full advantage of the power of complex-domain nonlinear processing. We establish the basic relationships for optimization in the complex domain and the real-domain equivalences for first- and second- order by extending the work of Brandwood and van den Bos. Examples in the derivation of first- and second-order update rules are given to demonstrate the versatility of the approach.

Copyright © 2008 H. Li and T. Adalı. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION back to the complex domain. The approach facilitates the computations but increases the dimensionality of the prob- Most of today’s challenging signal processing applications re- lem and might not be practical for functions that are nonlin- quire techniques that are nonlinear, adaptive, and with on- ear since in this case, the functional form might not be easily line processing capability. Also, there is need for approaches separable to real and imaginary parts. to process complex-valued data as such data arises in a good Another issue that arises in the nonlinear processing number of scenarios, for example, when processing radar of complex-valued data is due to the conflict between the and magnetic resonance data as well as communications data boundedness and differentiability of complex functions. This and when working in a transform domain such as frequency. result is stated by Liouville’s theorem as: aboundedentire Even though complex signals play such an important role, must be a constant in the complex domain [1]. Hence, many engineering shortcuts have typically been taken in their to use a flexible nonlinear model such as the nonlinear re- treatment preventing full utilization of the power of complex gression model, one cannot identify a complex nonlinear domain processing as well as the information in the real and function (C → C) that is bounded everywhere on the entire imaginary parts of the signal. complex domain. A practical solution to satisfy the bound- The main difficulty arises due to the fact that in the com- edness requirement has been to process the real and imagi- plex domain, analyticity, that is, differentiability in a given nary parts (or the magnitude and phase) separately through open set, as described by the Cauchy-Riemann equations [1] bounded real-valued nonlinearities (see, e.g., [2–6]). The so- imposes a strong structure on the function itself. Thus the lution provides reasonable approximation ability but is an ad analyticity condition is not satisfied for many functions of hoc solution not fully exploiting the efficiency of complex practical interest, most notably for the cost (objective) func- representations, both in terms of parameterization (number tions used as these are typically real valued and hence nonan- of parameters to estimate) and in terms of learning algo- alytic in the complex domain. Definition of pseudogradients rithms to estimate the parameters as we cannot define true are used—and still not through a consistent definition in the gradients when working with these functions. literature—and when having to deal with vector gradients, In this paper, we define a framework that allows tak- transformations CN → R2N are commonly used. These trans- ing full advantage of the power of complex-valued process- formations are isomorphic and allow the use of real-valued ing, in particular when working with nonlinear functions, calculus in the computations, which includes well-defined and eliminates the need for either of the two common engi- gradient and Hessians that can be at the end transformed neering practices we mentioned. The framework we develop 2 EURASIP Journal on Advances in Signal Processing is based on Wirtinger calculus [7] and extends the work of where z = x + jy, is given by the Cauchy-Riemann equations Brandwood [8]andvandenBos[9] to define the basic for- [1]: mulations for derivation of algorithms and their analyses in ∂u ∂v ∂v ∂u the complex domain. We show how the framework also nat- = , =− ,(2) urally admits the use of nonlinear functions that are analytic ∂x ∂y ∂x ∂y rather than the pseudocomplex nonlinear functions defined which summarize the conditions for the to as- using real-valued nonlinearities. Analytic complex nonlinear sume the same value regardless of the direction of approach functions have been shown to provide efficient representa- when Δz → 0.These conditions, when considered carefully, tions in the [10, 11] and to be universal ap- make it clear that the definition of complex differentiability proximators when used as activation functions in a single- is quite stringent and imposes a strong structure on u(x, y) layer multilayer perceptron (MLP) network [12]. and v(x, y), the real and imaginary parts of the function, The work by Brandwood [8]andvandenBos[9]em- and consequently on f (z). Also, obviously most cost (objec- phasize the importance of working with complex-valued gra- tive) functions do not satisfy the Cauchy-Riemann equations dient and Hessian operators rather than transforming the as these functions are typically f : C → R and thus have problem to the real domain. Both contributions, though not v(x, y) = 0. acknowledged in either of the papers, make use of Wirtinger An elegant approach due to Wirtinger [7] relaxes this calculus [7] that provides an elegant way to bypass the limita- strong requirement for differentiability, and defines a less tion imposed by the strict definition of differentiability in the stringent form for the complex domain. More importantly, complex domain. Wirtinger calculus relaxes the traditional it describes how this new definition can be used for defin- definition of differentiability in the complex domain—which ing complex differential operators that allow computation of we refer to as complex differentiability—by defining a form derivatives in a very straightforward manner in the complex that is much easier to satisfy and includes almost all functions domain, by simply using real differentiation results and pro- of practical interest, including functions that are CN → R. cedures. The attractiveness of the formulation stems from the fact In the development, the commonly used definition of that though the derivatives defined within the framework do differentiability that leads to the Cauchy-Riemann equations not satisfy the Cauchy-Riemann conditions, they obey all the is identified as complex differentiability and functions that rules of calculus, including the , differentiation of satisfy the condition on a specified open set as complex an- products and quotients. Thus all computations in the deriva- alytic (or complex holomorphic). The more flexible form tion of an algorithm can be carried out as in the real case. We of differentiability is identified as real differentiability,anda provide the connections between the gradient and Hessian function is called real differentiable when u(x, y)andv(x, y) formulations given in [9] described in C2N and R2N to the are differentiable as functions of real-valued variables x and complex CN -dimensional space, and establish the basic rela- y. Then, one can write the two real-variables as x = (z+z∗)/2 tionships for optimization in the complex domain including and y =−j(z − z∗)/2, and use the chain rule to derive first- and second-order Taylor-series expansions. the operators for differentiation given in the theorem below. Three specific examples are given to demonstrate the ap- The key in the derivation is regarding the two vari- plication of the framework to complex-valued adaptive sig- ables z and z∗ as independent from each other, which is also nal processing, and to show how they enable the use of the the main trick that allows us to make use of the elegance true processing power of the complex domain. The examples of Wirtinger calculus. Hence, we consider a given function include a multilayer perceptron filter design and the deriva- f : C → C as f : R × R → C by writing it as f (z) = f (x, y), tion of the gradient update (backpropagation) rule, indepen- and make use of the underlying R2 structure. The main result dent component analysis using maximum likelihood, and the in this context is stated by Brandwood as follows [8]. derivation of an efficient second-order learning rule, the con- jugate gradient algorithm for the complex domain. Theorem 1. Let f : R × R → C be a function of real variables Next section introduces the main tool, Wirtinger calculus x and y such that g(z, z∗) = f (x, y),wherez = x + jy and for optimization in the complex domain and the key results that g is analytic with respect to z∗ and z independently. Then, given in [8, 9], which we use to establish the main theory (i) the partial derivatives presented in Section 3.InSection 3, we consider both vector     and matrix optimization and establish the equivalences for ∂g 1 ∂f ∂f ∂g 1 ∂f ∂f = − j , = + j first- and second-order derivatives for the real and complex ∂z 2 ∂x ∂y ∂z∗ 2 ∂x ∂y case, and provide the fundamental results for CN and CN×M. (3) Section 4 presents the application examples and Section 5 gives a short discussion. can be computed by treating z∗ as a constant in g and z as a constant, respectively; 2. COMPUTATION OF GRADIENTS IN THE COMPLEX (ii) a necessary and sufficient condition for f to have a DOMAIN USING WIRTINGER CALCULUS stationary point is that ∂g/∂z = 0. Similarly, ∂g/∂z∗ = ffi The fundamental result for the differentiability of a complex- 0 is also a necessary and su cient condition. valued function Therefore, when evaluating the gradient, we can di- f (z) = u(x, y)+jv(x, y), (1) rectly compute the derivatives with respect to the complex H. Li and T. Adalı 3 argument, rather than calculating individual real-valued gra- rectly working in the original CN space where the problems dients as typically performed in the literature (see, e.g., are typically defined. [2, 6, 12, 13]). The requirement for the analyticity of g(z, z∗) ∗ with respect to z and z is independently equivalent to the 3. OPTIMIZATION IN THE COMPLEX DOMAIN condition on real differentiability of f (x, y)sincewecan move from one form of the function to the other using the 3.1. Vector case simple linear transformation given above [1, 14]. When f (z) is complex analytic, that is, when the Cauchy-Riemann con- We define ·, · as the scalar inner product between two ma- · ditions hold, g( )becomesafunctionofonlyz, and the two trices W and V as derivatives, the one given in the theorem and the traditional one coincide. W, V=Trace(VH W), (5) The case we are typically interested in the development of signal processing algorithms is given by f : R × R → R  = 2 and is a special case of the result stated in the theorem. Hence so that W, W W Fro, where the subscript Fro denotes we can employ the same procedure—taking derivatives inde- the Frobenius norm. For vectors, the definition simplifies to pendently with respect to z and z∗, in the optimization of a w, v=vH w. real-valued function as well. In the rest of the paper, we con- We define the gradient vector ∇z = [∂/∂z1, ∂/∂z2, ..., T T sider such functions as these are the costs used in machine ∂/∂zN ] for vector z = [z1, z2, ..., zN ] with zk = zR,k + jzI,k learning, though we identify the deviation, if any, from the in order to write the first-order Taylor series expansion for a general f : R × R → C case for completeness. function g(z, z∗):CN ×CN → R, As a simple example, consider the function g(z, z∗) = ∗  zz =|z|2 = x2 + y2 = f (x, y). We have (1/2)(∂f/∂x + ∗ Δg = Δz, ∇z∗ g + Δz , ∇zg = 2Re Δz, ∇z∗ g ,(6) j(∂f/∂y)) = x + jy = z, which we can also evaluate as ∗ = ∂g/∂z z, that is, by treating z as a constant in g when · · calculating the . where the last equality follows because g( , )isrealvalued. The complex gradient defined by Brandwood [8]has Using the Cauchy-Schwarz-Bunyakovski inequality [16], it been extended by van den Bos to define a complex gradient is straightforward to show that the first-order change in · · ∇ ∗ and Hessian in C2N by defining a mapping g( , ) will be maximized when Δz and the gradient z g are collinear. Hence, it is the gradient with respect to the con- ⎡ ⎤ ∇ ∗ z1 jugate of the , z g, that defines the direction of the ⎢ ⎥ · · ∇ ⎢ ∗⎥ maximum rate of change in g( , )withrespecttoz,not zg ⎢z ⎥ ⎢ 1 ⎥ as sometimes noted in the literature. Thus the gradient opti- ⎢ ⎥ ⎢ . ⎥ mization of g(·, ·) should use the update z ∈ CN −→ z = ⎢ . ⎥ ∈ C2N . (4) ⎢ . ⎥ ⎢ ⎥ ⎢ ⎥ ⎣zN ⎦ Δz = zt+1 − zt =−μ∇z∗ g (7) ∗ zN as this form leads to a nonpositive increment given by Δg = 2 Note that the mapping allows a direct extension of −2μ∇z∗ g , while the update using Δz =−μ∇zg results in Wirtinger’s result to the multidimensional space through N updates Δg =−2μRe{∇z∗ g, ∇zg}, which are not guaran- → ∗ = mappings of the form (zR,k, zI,k) (zk, zk ), where z teed to be nonpositive. zR + jzI , so that one can make use of Wirtinger derivatives. Based on (6), similar to a scalar function of two real vec- Since the transformation from R2 to C2 is a simple linear tors, the second-order Taylor series expansion of g(z, z∗)can invertible mapping, one can work in either space, depend- be written as [17] ing on the convenience offered by each. In [9], it is shown     that such a transformation allows the definition of a Hessian, 1 ∂g ∗ 1 ∂g ∗ hence of a Taylor series expansion very similar to the one in Δ2g = Δz, Δz + Δz , Δz 2 ∂z∂zT 2 ∂z∗∂zH the real case, and the Hessian matrix H defined in this man- ner is naturally linked to the complex CN×N Hessian G in   (8) ∂g that if λ is an eigenvalue of G, then 2λ is the corresponding + Δz∗, Δz∗ . eigenvalue of H. The result implies that the positivity of the ∂z∂zH eigenvalues as well as the conditioning of the Hessian ma- trices are shared properties of the two matrices, that is, of Next, we derive the same complex gradient update rule the two representations. For example, in [15], this property using another approach, which provides the connection be- has been utilized to derive the local stability conditions of tween the real and complex domains. We first introduce the the complex-valued maximization of negentropy algorithm following fundamental mappings that are similar in nature for performing independent component analysis. In the next to those introduced in [9]. section, we establish the connections of the results of [9]to CN for first- and second-order derivatives such that efficient Proposition 1. Given a function g(z, z∗):CN × CN → R that second-order optimization algorithms can be derived by di- is real differentiable and f :R2N →R such that g(z, z∗)= f (w), 4 EURASIP Journal on Advances in Signal Processing

T N 2N where z = [z1, z2, ..., zN ] , w = [zR,1, zI,1, zR,2, zI,2, ..., zR,N , function g(z)isnormallydefinedinC and the C - T  zI,N ] ,andzk = zR,k + jzI,k, k ∈{1, 2, ..., N}, then ping as described by z cannot be always easily applied to de- fine g(z), as observed in [18]. ∂f ∂g In the following two propositions, we show how to use = UH , ∂w ∂z∗ the same mappings we defined above to obtain first- and (9) second-order derivatives, and hence algorithms, in CN in an ∂2 f ∂2g = UH U, efficient manner. ∂w∂wT ∂z∗∂zT   Proposition 2. Given functions g and f defined as in Proposi- =Δ z = −1 = where U is defined by z z∗ Uw and satisfies U (1/ tion 1, one has the complex gradient update rule 2)UH . ∂g Δz =−2μ , (16) Proof. Define a 2 × 2matrixJ as ∂z∗ ⎡ ⎤ 1 j which is equivalent to the real gradient update rule J = ⎣ ⎦ (10) 1 − j ∂f Δw =−μ , (17) ∂w and a vector z∈ C2N as z = [z , z∗, z , z∗, ..., z , z∗ ]T . Then 1 1 2 2 N N where z and w areasdefinedinProposition 1 as well. z = U w, (11) Proof. Assuming f is known, the gradient update rule in the = { } −1 = real domain is where U2N×2N diag J, J, ..., J that satisfies (U ) (1/ 2)(U )H [9]. Next, we can find a permutation matrix P such ∂f Δw =−μ . (18) that ∂w   =Δ ∗ ∗ ∗ T =  = = Mapping back into complex domain, we obtain z z1, z2, ..., zN , z1 , z2 , ..., zN Pz PU w Uw, (12) ∂f ∂g Δz = UΔw =−μU =−2μ ∗ . (19) ∂w ∂z where U =Δ PU that satisfies U−1 = (1/2)UH since P−1 = PT . Using the Wirtinger derivatives in (3), we obtain The dimension of the update rule can be further decreased as ⎡ ⎤ ⎡ ⎤ ∂g ∂g = 1 ∗ ∂f ⎢ ⎥ U , (13) Δz ⎢ ∂z∗ ⎥ ∂g ∂z 2 ∂w ⎣ ⎦ =−2μ ⎢ ⎥ =⇒ Δz =−2μ . (20) Δz∗ ⎣ ∂g ⎦ ∂z∗ which establishes the first-order connection between the ∂z complex gradient and the real gradient. By applying the two derivatives (3) recursively to obtain the second-order deriva- tive of g,weobtain Proposition 3. Given functions g and f defined as in Proposi- ∂2 f   ∂2g tion 1, one has the complex Newton update rule =1 U H U T ∗ T   ∂w∂w ∂z ∂z   (14) =− ∗ − ∗ −1 −1 ∂g − ∗ −1 ∂g   ∂2g ∂2g Δz H2 H1 H2 H1 ∗ H1 H2 , (21) =2 U H PT PU = UH U. ∂z ∂z ∂z∗∂zT ∂z∗∂zT which is equivalent to the real Newton update rule Equality 1 is already proved in [18]. Equality 2 is obtained by simply rearranging the entries in ∂2g/∂z∗∂ zT to form ∂2 f ∂f 2 ∗ T Δw =− , (22) ∂ g/∂z ∂ z . Therefore, the second-order Taylor expansion ∂w∂wT ∂w given in (8)canberewrittenas where 2 = T ∂g 1 H ∂ g ∂2g ∂2g Δg Δz + Δz ∗ T Δz, (15) H = , H = . (23) ∂z 2 ∂z ∂z 1 ∂z∂zT 2 ∂z∂zH × which demonstrates that the C2N 2N Hessian in (15)canbe Proof. The pure Newton method in the real domain takes × decomposed into three CN N Hessians in (8). theformgivenin(22). Using the equalities given in Proposition 1, it can be easily shown that the Newton update The mappings given in Proposition 1 are similar to those in (22)isequivalentto defined in [9]. However, the mappings given in [9] include redundancy since they operate in C2N and the dimension ∂2g ∂g Δz =− ∗ . (24) cannot be further reduced. This is not convenient since cost ∂z∗∂zT ∂z H. Li and T. Adalı 5

Using the definitions for H1 and H2 givenin(23), we can We can use the first-order Taylor series expansion to de- rewrite (24)as rive the relative gradient [19] update rule for the complex ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ∂g case, which is usually directly extended to the complex case ∗ ∗ ⎢ ⎥ H H Δz ⎢ ∗ ⎥ without a derivation [5, 13, 20]. To write the relative gradi- ⎣ 2 1 ⎦ ⎣ ⎦ =−⎢ ∂z ⎥ ∗ ⎣ ⎦ . (25) ent rule, we consider an update of the parameter matrix W H1 H2 Δz ∂g in the invariant form (ΔW)W [19]. We then write the first- ∂z order Taylor series expansion for the perturbation (ΔW)W If ∂2g/∂z∗∂zT is positive definite, we have as ⎡ ⎤     ⎡ ⎤ ⎡ ⎤ ∂g ∂g   ∂g ⎢ ⎥ = ( W)W, + W∗ W∗, Δz M11 M12 ⎢ ∗ ⎥ Δg Δ ∗ Δ ⎣ ⎦ =−⎣ ⎦ ⎢ ∂z ⎥ ∂W ∂W ∗ ⎣ ⎦ , (26) Δz M21 M22 ∂g   (31) ∂z = ∂g H 2Re ΔW, ∗ W where ∂W   = ∗ − ∗ −1 −1 M11 H2 H1 H2 H1 , to determine the quantity that maximizes the rate of change   in the function. The complex relative gradient of g at W is = −∗ ∗ −∗ ∗ − −1 ∗ H M12 H2 H1 H1H2 H1 H2 , then written as (∂g/∂W )W to write the relative gradient   (27) update term as = −∗ ∗ − −1 −∗ M21 H1H2 H1 H2 H1H2 , ∂g H  − ΔW =−μ W W. (32) = − −∗ ∗ 1 ∗ M22 H2 H1H2 H1 , ∂W − = and H−∗ denotes (H∗) 1. Since ∂2g/∂z∗∂zT is Hermitian, we Upon substitution of ΔW into (29), we observe that Δg 2 2 −  ∗ H 2 finally obtain the complex Newton rule as 2μ (∂g/∂W )W Fro is a nonpositive quantity, thus a   proper update term. The relative gradient can be regarded  −1 ∂g ∂g as a special case of natural gradient [21] in the matrix space, Δz =− H∗ − H∗H−1H − H∗H−1 . (28) 2 1 2 1 ∂z∗ 1 2 ∂z but provides the additional advantage that it can be easily ex- tended to nonsquare matrices. In Section 4.2, we show how ∗ The expression for Δz is the conjugate of (28). the relative gradient update rule for independent component analysis based on maximum likelihood can be derived in a 3.2. Matrix case very straightforward manner in the complex domain using (32) and Wirtinger calculus. The extension from the vector gradient to matrix gradient is straightforward. For a real-differentiable g(W, W∗):CN×N × × 4. APPLICATION EXAMPLES CN N → R, we can write the first-order expansion as     We demonstrate the application of the optimization frame- ∂g ∗ ∂g Δg = ΔW, + ΔW , work introduced in Section 3 by three examples. The first ∂W∗ ∂W two examples demonstrate the derivation of the update rules (29)   for complex-valued nonlinear signal processing. In the third = ∂g 2Re ΔW, ∗ , example, we show how the relationship for Newton updates ∂W given by Proposition 3 can be utilized to derive efficient up- where ∂g/∂W is an N × N matrix whose (i, j)th entry is date rules such as the conjugate gradient algorithm for the the partial derivative of g with respect to wij. By arranging complex domain. the matrix gradient into a vector and by using the Cauchy- Schwarz-Bunyakovski inequality [16], it is easy to show that 4.1. Fully complex MLP for nonlinear adaptive filtering the matrix gradient ∂g/∂W∗ defines the direction of the max- imum rate of change in g with respect to W. The multilayer perceptron filter—or network—provides a ffi For local stability analysis, Taylor expansions up to the good example case for the di culties that arise in complex- second order is also frequently needed. Since the first-order valued processing as discussed in the introduction. These are matrix gradient takes a matrix form already, here we only due to the selection of activation functions for use in the fil- provide the second-order expansion with respect to every en- ter structure and the optimization procedure for deriving the try of matrix W.From(8), we obtain weight update rule.   The first issue is due to the conflict between the bound-   ff 2 = 1 ∂g ∂g ∗ ∗ edness and di erentiability of functions in the complex Δ g dwijdwkl + ∗ ∗ dwijdwkl 2 ∂wij∂wkl ∂wij∂wkl domain. This result is stated by Liouville’s theorem as: a bounded entire function must be a constant in the complex do-  ∂g ∗ main [1], where entire refers to differentiability everywhere. + ∗ dwijdwkl. ∂wij∂wkl For example the sigmoid nonlinearity, which has been the (30) most typically used activation function for real-valued MLPs, 6 EURASIP Journal on Advances in Signal Processing

Any function f (z)thatisanalyticfor|z|

z1 tions include polynomials and most trigonometric functions y1 g(·) h(·) and their hyperbolic counterparts. In particular, all the el- x2 ementary transcendental functions proposed in [12]satisfy the property and can be used as effective activation func- z2 . . . tions. These functions, though unbounded, provide signif- . . icant performance advantages in challenging signal process- . . ing problems such as equalization of highly nonlinear chan- yK h(·) zM nels [10] in terms of superior convergence characteristics and better generalization abilities through the efficient rep- xN vNM wKN resentation of the underlying problem structure. The non- (·) g singularities do not pose any practical problems in the im- plementation, except that some care is required in the selec- Figure 1: A single hidden layer MLP filter. tion of their parameters when training these networks. Mo- tivated by these examples, a fundamental result for complex nonlinear approximation is given in [12], where the result has periodic singular points. Since boundedness is deemed as on the approximation ability of the multilayer perceptron important for the stability of algorithms, a practical solution is extended to the complex domain by classifying nonlin- when designing MLPs for the complex domain has been to ear functions based on their singularities. To establish the define nonlinear functions that process the real and imagi- universal approximation property in the complex domain, nary parts separately through bounded real-valued nonlin- a number of elementary transcendental functions are first earities as in [2] classified according to the nature of their nonsingularity as    those with removable, isolated, and essential singularities. f (z) f (x)+j f (y) (33) Based on this classification, three types of approximation theorems are given. The approximation theorems for the first for acomplex variable = + using functions  : R → R. z x jy f two classes of functions are very general and resemble the Another approach has been to define joint-nonlinear com- universal approximation theorem for the real-valued feed- plex activation functions as in [3, 4], respectively, forward multilayer perceptron that was shown almost con-   z   r currently by multiple authors in 1989 [22–24]. The third ap- f (z)  , f rejθ  tanh e jθ. (34) c + |z|/d m proximation theorem for the complex multilayer perceptron is unique and related to the power series approximation that As shown in [10], these functions cannot utilize the phase in- can represent any arbitrarily closely in the formation effectively, and in applications that introduce sig- deleted neighborhood of a singularity. This approximation is nificant phase distortion such as equalization of saturating- uniform only in the analytic domain of convergence whose type channels, are not effective as complex domain nonlinear radius is defined by the closest singularity. filters. For the MLP filter shown in Figure 1,whereyk is the out- The second issue that arises when designing MLPs in the put and zm the input, when the activations functions g(·)and complex domain has to do with the optimization of the cho- h(·) are chosen as functions that are C → C as in [11, 12], we sen cost function to derive the parameter update rule. As an can directly write the backpropagation update equations us- example, consider the most commonly used MLP structure ing Wirtinger derivatives. with a single hidden layer as shown in Figure 1. If the cost ∗ = For the output units, we have ∂yk/∂wkn 0, therefore function is chosen as the squared error at the output, we have    ∗ ∗ = − − ∗ J(V, W) dk yk dk yk , (35) ∂J ∂J ∂yk k ∗ = ∗ ∗   ∂wkn ∂yk ∂wkn = =      where yk h( nwknxn)andxn g( mvnmzm). Note that ∗ ∗ ∗ · · − − ∗ if both activation functions h( )andg( ) satisfy the prop- = ∂ dk yk dk yk ∂h nwknxn ∗ = ∗ ∗ ∗ (36) erty [ f (z)] f (z ), then the cost function assumes the ∂yk ∂wkn form J(V, W) = G(z)G(z∗) making it clear how practical the      derivation of the update rule will be using Wirtinger calcu- =− − ∗ ∗ ∗ ∗ dk yk h wknxn xn . lus, since then we treat the two variables z and z as inde- n pendent in the computation of the derivatives. On the other hand, when any of the activation functions given in (33)and  =− − ∗ ∗ (34) are used, it is clear that the evaluation of the gradients We define δk (dk yk)h ( nwknxn ) so that we can write ∗ = ∗ will have to be performed through separate real and imagi- ∂J/∂wkn δkxn . nary part evaluations as traditionally done, which can easily For the hidden layer or input layer, first we observe the get quite cumbersome [2, 10]. fact that vnm is connected to xn for all m. Again, we have H. Li and T. Adalı 7

∗ = ∗ = ∂yk/∂vnm 0, ∂xn/∂vnm 0. Using the chain rule once again, approximates the source s subject to the permutation and we obtain scaling ambiguity. To write the density transformation, we C→ R2N = = ∗ consider the mapping such that y Wx s,where  ∗  −  ∂J ∂J ∂yk ∂xn = T T T = WR WI = T T T = T T T = ∗ y [yR yI ] , W , x [xR xI ] ,ands [sR sI ] . ∂v∗ ∂y ∂x∗ ∂v∗ WI WR nm k k n nm Given T independent samples x(t), we write the log- ∗  ∗ likelihood function as [26] = ∂xn ∂J ∂yk ∗ ∗ ∗ ∂vnm k ∂yk ∂xn   N       l (y, W) = log det (W) + logpk yk , (38) ∗   = = ∗ ∗ ∗ ∂J ∂yk k 1 g vnmzm zm ∗ ∗ m ∂yk ∂xn k where pk is the density function for kth source. Maximiza-       =−      tion of l is equivalent to minimization of l where l l . = ∗ ∗ ∗ − − ∗ ∗ ∗ Simple algebraic and differential calculus yields g vnmzm zm dk yk h wklxl wkn m k l  −     =−tr W W 1 + T (y) y, (39)   dl d ψ d = z∗g v∗ z∗ δ w∗ . m nm m k kn where ψ(y)isa2N × 1 column vector with components m k (37)      ∂ log p y ∂ log p (y ) ∂ log p y ψ(y) =− 1 1 ··· N N 1 1 Thus, (36)and(37) define the gradient updates for comput- ∂yR,1 ∂yR,N ∂yI,1 ing the hidden and the output layer coefficients, wkn and vnm,   (40) ∂ log p y through backpropagation. Note that the derivations in this ··· N N . case are very similar to the real-valued case as opposed to ∂yI,N what is shown in [2, 10] where separate evaluations with re- = ∗ spect to the real and imaginary parts are carried out. We write log ps(yR, yI ) log ps(y, y ) and using Wirtinger calculus, it is straightforward to show     4.2. Complex maximum likelihood approach to T ∗ ∗ ∗ ψ (y)d y = ψT y, y dy + ψH y, y dy , (41) independent component analysis where ψ(y, y∗)isanN ×1columnvectorwithcomplexcom- Independent component analysis (ICA) for separating ponents complex-valued signals is needed in a number of applica-   tions such as medical image analysis, radar, and communi-   ∗ ∗ =−∂ log pk yk, yk cations. In ICA, the observed data are typically expressed as a ψk yk, y . (42) k ∂y linear combination of independent latent variables such that k = = T   x As where s [s1, s2, ..., sN ] is the vector of sources, × = I jI T Defining a 2N 2N matrix P (1/2) jII ,weobtain x = [x1, x2, ..., xN ] is the vector of observed random vari- ables, and A is the mixing matrix. We consider the simple  −   −  1 = −1 1 case where the number of independent variables is the same tr d W W tr d WPP W ⎧⎡ ⎤ ⎡ ⎤ ⎫ as the number of observed mixtures. The main task of the ⎪ −1⎪ ⎨ dW∗ jdW W∗ jW ⎬ ICA problem is to estimate a separating matrix W that yields = ⎣ ⎦ ·⎣ ⎦ tr ⎪ ∗ ∗ ⎪ the independent components through s = Wx. Nonlinear ⎩ jdW dW jW W ⎭ ICA approaches such as the maximum likelihood provide     ffi − ∗ −∗ practical and e cient solutions to the problem. When de- = tr dWW 1 +tr dW W . riving the update rule in the complex domain, however, the (43) optimization is not straightforward and can easily become cumbersome [13, 25]. To alleviate the problem, the rela- Therefore, we can write (39)as tive gradient framework of [19] has been used along with     − ∗ −∗ isomorphic transformations CN → R2N to derive the update dl =−tr dWW 1 − tr dW W equations in [25]. As we show next, Wirtinger calculus al-     (44) T ∗ H ∗ ∗ lows a much more straightforward derivation procedure, and + ψ y, y dy + ψ y, y dy . in addition, provides a convenient formulation for working = = −1 with probabilistic descriptions such as the probability density Using y Wx and defining dZ (dW)W ,weobtain   function (pdf) in the complex domain. = = −1 = = dy (dW)x dW W y dZy, We define the pdf of a complex random variable X (45) ≡ ∗ ∗ ∗ XR + jXI as pX (x) pXRXI (xR, xI ) and the expectation of dy = dZ y . g(X)isgivenbyE{g(X)}= g(xR + jxI )pX (x)dxRdxI for any measurable function g : C → C. The traditional ICA By treating W as a constant matrix, the differential matrix problem determines a weight matrix W such that y = Wx dZ has components dzij that are linear combinations of dwij 8 EURASIP Journal on Advances in Signal Processing and is a nonintegrable differential form. However, this trans- formation greatly simplifies the expression for the Taylor se- Given some initial gradient s0; = =− = ries expansion without changing the function value. It also Set x0 0, p0 s0, k 0; | | = provides an elegant approach for the derivation of the natu- while sk / 0 sH s ral gradient update for maximum likelihood ICA [26]. Using =  k k  αk T ∗ T ; this transformation, we can write (44)as Re pk H2pk + pk H1pk x = x + α p ;     k+1 k k k  =− − ∗ T ∗ = ∗ ∗ ∗ dl tr(dZ) tr dZ + ψ y, y dZy sk+1 sk + αk H2 pk + H1 pk ;   (46) H H ∗ ∗ ∗ = sk+1sk+1 + ψ y, y dZ y . βk+1 H ; sk sk =− Therefore, the gradient update rule for Z is given by pk+1 sk+1 + βk+1pk; k = k +1;     ∂l ∗ ∗ end(while) ΔZ =−μ = μ I − ψ y, y yH , (47) ∂Z∗ which is equivalent to Algorithm 1: Complex conjugate gradient algorithm.     ΔW = μ I − ψ∗ y, y∗ yH W (48) = −1 by using dZ (dW)W . = ∗ ∗ for k 0, 1, 2, ... Thus the complex score function is defined as ψ (y, y ), Compute a search direction Δz by applying the complex as in [27], which takes a form very similar to the real case CG method, starting from x0 = 0. ff T ∗ T ≤ [26], but with the di erence that in the complex case the en- Terminating when Re(pk H2pk + pk H1pk) 0; tries in the score function are defined using Wirtinger deriva- Set zk+1 = zk + μΔz,whereμ satisfies a complex Wolfe tives. condition. end 4.3. Complex conjugate gradient (CG) algorithm The equivalence condition given by Proposition 3 allows for Algorithm 2: Complex line search Newton-CG algorithm. easy derivation of second-order efficient update schemes as we demonstrate next. As shown in Proposition 3,forareal differentiable function g(z, z∗):CN ×CN → R and f : R2N → the one followed in Proposition 3. It should be noted that R such that g(z, z∗) = f (w), the update for the Newton the complex conjugate gradient algorithm is a linear version method in R2N is given by such that the solution of a linear equation is considered. The procedure given in [28] can be used to obtain the version for ∂2 f ∂f Δw =− , (49) a given nonlinear function. ∂w∂wT ∂w and is equivalent to 5. DISCUSSION    −1 ∂g ∂g We describe a framework for complex-valued adaptive sig- Δz =− H∗ − H∗H−1H − H∗H−1 (50) 2 1 2 1 ∂z∗ 1 2 ∂z nal processing based on Wirtinger calculus for the efficient computation of algorithms and their analyses. By enabling to in CN . To achieve convergence, we require that the search work directly in the complex domain without the need to in- direction Δw is a descent direction when minimizing a crease the problem dimensionality, the framework facilitates cost function, which is the case if the Hessian ∂2 f/∂w∂wT the derivation of update rules and makes efficient second- is positive definite. However, if the Hessian is not positive order update procedures such as the conjugate-gradient rule definite, Δw may be an ascent direction. The line search readily available for complex optimization. The examples we Newton-CG method is one of the strategies for ensuring have provided demonstrate the simplicity offered by the ap- that the update is of good quality. In this strategy, we solve proach in the derivation of both componentwise update rules (49) using the CG method, terminating the updates if as in the case of the backpropagation algorithm for the MLP ΔwT (∂2 f/∂w∂wT )Δw ≤ 0. and direct matrix updates for estimating the demixing ma- When we do not have the definition of function f but trix as in the case of independent component analysis us- only have the knowledge of g, we can obtain the complex ing maximum likelihood. The framework can also be used to conjugate gradient method with straightforward algebraic perform the analysis of nonlinear adaptive algorithms such as manipulations of the real CG algorithm (e.g., given in [28]) ICA using the relative gradient update given in (48) as shown by using the three equalities given in (12), (13), and (14). We in [29] in the derivation of local stability conditions. let s = ∂g/∂z∗ to write the complex CG method as shown in Algorithm 1, and thecomplex line search Newton-CG algo- ACKNOWLEDGMENT rithm is given in Algorithm 2. The complex Wolfe condition [28] can be easily obtained This work is supported by the National Science Foundation from the real Wolfe condition using a procedure similar to through Grants NSF-CCF 0635129 and NSF-IIS 0612076. H. Li and T. Adalı 9

REFERENCES [19] J.-F. Cardoso and B. H. Laheld, “Equivariant adaptive source separation,” IEEE Transactions on Signal Processing, vol. 44, [1] R. Remmert, Theory of Complex Functions, Springer, New no. 12, pp. 3017–3030, 1996. York, NY, USA, 1991. [20] V. Calhoun and T. Adalı, “Complex ICA for FMRI analysis: [2] H. Leung and S. Haykin, “The complex backpropagation algo- performance of several approaches,” in Proceedings of IEEE In- rithm,” IEEE Transactions on Signal Processing, vol. 39, no. 9, ternational Conference on Acoustics, Speech, and Signal Process- pp. 2101–2104, 1991. ing (ICASSP ’03), vol. 2, pp. 717–720, Hong Kong, April 2003. [3] G. M. Georgiou and C. Koutsougeras, “Complex backpropa- [21] S.-I. Amari, “Natural gradient works efficiently in learning,” gation,” IEEE Transactions on Circuits Systems, vol. 39, no. 5, Neural Computation, vol. 10, no. 2, pp. 251–276, 1998. pp. 330–334, 1992. [22] G. Cybenko, “Approximation by superpositions of a sigmoidal [4] A. Hirose, “Continuous complex-valued backpropagation function,” Mathematics of Control, Signals, and Systems, vol. 2, learning,” Electronics Letters, vol. 28, no. 20, pp. 1854–1855, no. 4, pp. 303–314, 1989. 1992. [23] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feed- [5] J. Anemuller,¨ T. J. Sejnowski, and S. Makeig, “Complex inde- forward networks are universal approximators,” Neural Net- pendent component analysis of frequency-domain electroen- works, vol. 2, no. 5, pp. 359–366, 1989. cephalographic data,” Neural Networks,vol.16,no.9,pp. [24] K. Funahashi, “On the approximate realization of continuous 1311–1323, 2003. mappings by neural networks,” Neural Networks, vol. 2, no. 3, [6] P. Smaragdis, “Blind separation of convolved mixtures in the pp. 182–192, 1989. frequency domain,” Neurocomputing,vol.22,no.1–3,pp.21– [25] J.-F. Cardoso and T. Adalı, “The maximum likelihood ap- 34, 1998. proach to complex ICA,” in Proceedings of the IEEE Interna- [7] W. Wirtinger, “Zur formalen theorie der funktionen von mehr tional Conference on Acoustics, Speech and Signal Processing komplexen veranderlichen,”¨ Mathematische Annalen, vol. 97, (ICASSP ’06), vol. 5, pp. 673–676, Toulouse, France, May 2006. no. 1, pp. 357–375, 1927. [26] S.-I. Amari, T.-P. Chen, and A. Cichocki, “Stability analysis of [8] D. H. Brandwood, “A complex gradient operator and its appli- learning algorithms for blind source separation,” Neural Net- cation in adaptive array theory,” IEE Proceedings, F: Communi- works, vol. 10, no. 8, pp. 1345–1351, 1997. cations, Radar and Signal Processing, vol. 130, no. 1, pp. 11–16, [27] T. Adalı and H. Li, “A practical formulation for computation of 1983. complex gradients and its application to maximum likelihood [9] A. van den Bos, “Complex gradient and Hessian,” IEE Proceed- ICA,” in Proceedings of IEEE the International Conference on ings: Vision, Image and Signal Processing, vol. 141, no. 6, pp. Acoustics, Speech and Signal Processing (ICASSP ’07), vol. 2, pp. 380–382, 1994. 633–636, Honolulu, Hawaii, USA, 2007. [10] T. Kim and T. Adalı, “Fully complex multi-layer perceptron [28] J. Nocedal and S. J. Wright, Numerical Optimization, Springer, network for nonlinear signal processing,” Journal of VLSI Sig- New York, NY, USA, 2000. nal Processing Systems for Signal, Image, and Video Technology, [29] H. Li and T. Adalı, “Stability analysis of complex maximum vol. 32, no. 1-2, pp. 29–43, 2002. likelihood ICA using Wirtinger calculus,” in Proceedings of [11] A. I. Hanna and D. P. Mandic, “A fully adaptive normal- IEEE the International Conference on Acoustics, Speech and Sig- ized nonlinear gradient descent algorithm for complex-valued nal Processing (ICASSP ’08), Las Vegas, Nev, USA, April 2008. nonlinear adaptive filters,” IEEE Transactions on Signal Process- ing, vol. 51, no. 10, pp. 2540–2549, 2003. [12] T. Kim and T. Adalı, “Approximation by fully complex mul- tilayer perceptrons,” Neural Computation,vol.15,no.7,pp. 1641–1666, 2003. [13] J. Eriksson, A. Seppola, and V. Koivunen, “Complex ICA for circular and non-circular sources,” in Proceedings of the 13th European Signal Processing Conference (EUSIPCO ’05),An- talya, Turkey, September 2005. [14] K. Kreutz-Delgado, “Lecture supplement on complex vector calculus,” Course notes for ECE275A: Parameter Estimation I, 2006. [15] M. Novey and T. Adalı, “Stability analysis of complex-valued nonlinearities for maximization of nongaussianity,” in Pro- ceedings of the 31th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’06), vol. 5, pp. 633–636, Toulouse, France, May 2006. [16] C. D. Meyer, Matrix Analysis and Applied Linear Algebra, SIAM, Philadelphia, Pa, USA, 2000. [17]T.J.Abatzoglou,J.M.Mendel,andG.A.Harada,“Thecon- strained total least squares technique and its applications to harmonic superresolution,” IEEE Transactions on Signal Pro- cessing, vol. 39, no. 5, pp. 1070–1087, 1991. [18] A. van den Bos, “Estimation of complex parameters,” in Pro- ceedings of the 10th IFAC Symposium on System Identification (SYSID ’94), vol. 3, pp. 495–499, Copenhagen, Denmark, July 1994.