<<

A Neural Transfer for a Smooth and Differentiable Transition Between Additive and Multiplicative Interactions

Sebastian Urban [email protected] Institut f¨urInformatik VI, Technische Universit¨atM¨unchen, Boltzmannstr. 3, 85748 Garching, Germany Patrick van der Smagt [email protected] fortiss — An-Institut der Technischen Universit¨atM¨unchen, Guerickestr. 25, 80805 M¨unchen, Germany

Abstract where the transfer function is applied element-wise, σ (t) = σ(t ). In the context of this paper we will call Existing approaches to combine both ad- i i such networks additive ANNs. ditive and multiplicative neural units ei- ther use a fixed assignment of operations (Hornik et al., 1989) showed that additive ANNs with or require discrete optimization to determine at least one hidden layer and a sigmoidal transfer func- what function a neuron should perform. This tion are able to approximate any function arbitrarily leads either to an inefficient distribution of well given a sufficient number of hidden units. Even computational resources or an extensive in- though an additive ANN is an universal function ap- crease in the computational complexity of the proximator, there is no guarantee that it can approx- training procedure. imate a function efficiently. If the architecture is not We present a novel, parameterizable trans- a good match for a particular problem, a very large fer function based on the mathematical con- number of neurons is required to obtain acceptable re- cept of non-integer functional that sults. allows the operation each neuron performs to (Durbin & Rumelhart, 1989) proposed an alternative be smoothly and, most importantly, differen- neural unit in which the weighted is re- tiablely adjusted between addition and mul- placed by a product, where each input is raised to a tiplication. This allows the decision between power determined by its corresponding weight. The addition and multiplication to be integrated value of such a product unit is given by into the standard backpropagation training procedure.   Y Wij yi = σ xj  . (3) j 1. Introduction Using laws of the this can be writ- P In commonplace artificial neural networks (ANNs) the ten as yi = σ[exp( j Wij log xj)] and thus the values value of a neuron is given by a weighted sum of its of a layer can also be computed efficiently using matrix inputs propagated through a non-linear transfer func- multiplication, i.e. arXiv:1503.05724v3 [stat.ML] 29 Mar 2016 tion. For illustration let us consider a simple neural network with multidimensional input and multivari- y = σ(exp(W log x)) (4) ate output. The input layer should be called x and where exp and log are taken element-wise. Since in the outputs y. Then the value of neuron y is i general the incoming values x can be negative, the   complex exponential and logarithm are used. Often no X non-linearity is applied to the output of the product yi = σ Wijxj . (1) j unit.

The typical choice for the transfer function σ(t) is the 1.1. Hybrid summation-multiplication sigmoid function σ(t) = 1/(1 + e−t) or an approxima- networks tion thereof. Matrix multiplication is used to jointly compute the values of all neurons in one layer more Both types of neurons can be combined in a hybrid efficiently; we have summation-multiplication network. Yet this poses the problem of how to distribute additive and multiplica- y = σ(W x) (2) tive units over the network, i.e. how to determine A Smooth and Differentiable Transition Between Additive and Multiplicative Interactions whether a specific neuron should be an additive or for n, m ∈ Z, forms an Abelian . multiplicative unit to obtain the best results. A simple Equation (5) cannot be used to define functional it- solution is to stack alternating layers of additive and eration for non-integer n. Thus, in order to calcu- product units, optionally with additional connections late non-integer of function, we have to find that skip over a product layer, so that each additive an alternative definition. The sought generalization layer receives inputs from both the product layer and should also extend the additive property (6) of the the additive layer beneath it. The drawback of this composition operation to non-integer n, m ∈ . approach is that the resulting uniform distribution of R product units will hardly be ideal. 2.2. Abel’s A more adaptive approach is to learn the function of each neural unit from provided training data. How- Consider the following functional equation given by ever, since addition and multiplication are different op- (Abel, 1826), erations, until now there was no obvious way to deter- ψ(f(x)) = ψ(x) + β (7) mine the best operation during training of the network using standard neural network optimization methods with constant β ∈ C. We are concerned with f(x) = such as backpropagation. An iterative to de- exp(x). A continuously differentiable solution for β = termine the optimal allocation could have the follow- 1 and x ∈ R is given by ing structure: For initialization randomly choose the operation each neuron performs. Train the network ψ(x) = log(k)(x) + k (8) by minimizing the error function and then evaluate its (k) performance on a validation set. Based on the per- with k ∈ N s.t. 0 ≤ log (x) < 1. Note that for formance determine a new allocation using a discrete x < 0 we have k = −1 and thus ψ is well defined optimization algorithm (such as particle swarm opti- on whole R. The function is shown in Fig.1. Since mization or genetic ). Iterate the process ψ : R → (−1, ∞) is strictly increasing, the inverse until satisfactory performance is achieved. The draw- ψ−1 :(−1, ∞) → R exists and is given by back of this method is its computational complexity; −1 (k) to evaluate one allocation of operations the whole net- ψ (ψ) = exp (ψ − k) (9) work must be trained, which takes from minutes to with k ∈ s.t. 0 ≤ ψ − k < 1. For practical reasons hours for moderately sized problems. N we set ψ−1(ψ) = −∞ for ψ ≤ −1. The derivative of ψ Here we propose an alternative approach, where the is given by distinction between additive and multiplicative neu- k−1 0 Y 1 rons is not discrete but continuous and differentiable. ψ (x) = (10a) log(j)(x) Hence the optimal distribution of additive and mul- j=0 tiplicative units can be determined during standard (k) with k ∈ N s.t. 0 ≤ log (x) < 1 and the derivative of gradient-based optimization. Our approach is orga- its inverse is nized as follows: First, we introduce non-integer iter- ates of the exponential function in the real and com- k−1 −10 Y (j) −1  plex domains. We then use these iterates to smoothly ψ (ψ) = exp ψ (ψ − j) (10b) interpolate between addition (1) and multiplication j=0 (3). Finally, we show how this interpolation can be with k ∈ N s.t. 0 ≤ ψ − k < 1. integrated and implemented in neural networks. 2.2.1. Non-integer iterates using Abel’s 2. Iterates of the exponential function equation 2.1. Functional iteration By inspection of Abel’s equation (7), we see that the nth iterate of the exponential function can be written Let f : C → C be an invertible function. For n ∈ Z we as (n) write f for the n-times iterated application of f, exp(n)(x) = ψ−1(ψ(x) + n) . (11) (n) f (z) = f ◦ f ◦ · · · ◦ f(z) . (5) While this equation is equivalent to (5) for integer n, | {z } n times we are now also free to choose n ∈ R and thus (11) can be seen as a generalization of functional iteration Further let f (−n) = (f −1)(n) where f −1 denotes the to non-integer iterates. It can easily be verified that inverse of f. We set f (0)(z) = z to be the identity the composition property (6) holds. Hence we can un- function. It can be easily verified that functional iter- derstand the function ϕ(x) = exp(1/2)(x) as the func- ation with respect to the , i.e. tion that gives the exponential function when applied f (n) ◦ f (m) = f (n+m) (6) to itself. ϕ is called the functional of exp A Smooth and Differentiable Transition Between Additive and Multiplicative Interactions

ψ(x) with constant γ ∈ C. As before we are interested in 3 solutions of this equation for f(x) = exp(x); we have

2 χ(exp(z)) = γ χ(z) (14)

1 but now we are considering the complex exp : C → C.

x The complex exponential function is not injective, -10 -5 5 10 since -1 exp(z + 2πni) = exp(z) n ∈ Z . Figure 1. A continuously differentiable solution ψ(x) to Thus the imaginary part of the codomain of its inverse, Abel’s equation (7) for the exponential function in the real i.e. the complex logarithm, must be restricted to an domain. interval of size 2π. Here we define log : C → {z ∈ n exp( )(x) C : β ≤ Im z < β + 2π} with β ∈ R. For now let us 2 consider the principal branch of the logarithm, that is β = −π. 1 n=1 To derive a solution, we examine the behavior of exp n = -1 around one of its fixed points. A fixed point of a func- x -3 -2 -1 1 2 3 tion f is a point c with the property that f(c) = c.

-1 The exponential function has an infinite number of fixed points. Here we select the fixed point closest to -2 the real axis in the upper complex half plane. Since Figure 2. Iterates of the exponential function exp(n)(x) log is a , according to the Banach for n ∈ {−1, −0.9,..., 0,..., 0.9, 1} obtained using the so- fixed-point theorem (Khamsi & Kirk, 2001) the fixed lution (11) of Abel’s equation. point of exp can be found by starting at an arbitrary point z ∈ C with Im z ≥ 0 and repetitively applying the logarithm until convergence. Numerically we find and we have ϕ(ϕ(x)) = exp(x) for all x ∈ R. Like- wise exp(1/N) is the function that gives the exponential exp(c) = c ≈ 0.318132 + 1.33724 i function when iterated N times. √ where i = −1 is the imaginary unit. Since n is a continuous parameter in definition (11) Close enough to c the exponential function behaves we can take the derivative of exp with respect to its 0 argument as well as n. They are given by like an affine map. To show this, let z = z − c and consider (n) ∂ exp (x) 0 0 0 exp0(n)(x) = = ψ0−1(ψ(x) + n) ψ0(x) exp(c + z ) − c = exp(c) exp(z ) − c = c [exp(z ) − 1] ∂x 0 0 2 (12a) = c [1 + z + O(|z | ) − 1] (n) 0 0 2 0 ∂ exp (x) = c z + O(|z | ) . exp(n )(x) = = ψ0−1(ψ(x) + n) . (12b) ∂n Here we used the Taylor expansion of the exponential 0 0 02 Thus (8) provides a method to interpolate between function, exp(z ) = 1+z +O(z ). Thus for any point the exponential function, the identity function and the z in a circle of radius r0 around c, we have logarithm in a continuous and differentiable way. 2 2 exp(z) = cz + c − c + O(r0) . (15) 2.3. Schr¨oder’s functional equation By substituting this approximation into (14) it be- comes apparent that a solution to Schr¨oder’sequation Motivated by the necessity to evaluate the logarithm around c is given by for negative arguments, we derive a solution of Abel’s equation for the complex exponential function. Apply- χ(z) = z − c for |z − c| ≤ r0 (16) ing the substitution where we have set γ = c. β ψ(x) = log χ(x) We will now compute the continuation of the solution log γ to points outside the circle around c. From (14) we obtain in Abel’s equation (7) gives a functional equation first χ(z) = c χ(log(z)) . (17) examined by (Schr¨oder, 1870), If for a point z ∈ C repeated application of the log- χ(f(z)) = γ χ(z) (13) arithm leads to a point inside the circle of radius r0 A Smooth and Differentiable Transition Between Additive and Multiplicative Interactions

Im Im 4

1.8 z 0 2

1.6

0

1.4 r0 c

-2 1.2

1.0 -4 -4 -2 0 2 4 Re Re -0.2 0.2 0.4 0.6 0.8 Figure 4. Domain coloring plot of χ(z). Discontinuities e Figure 3. Calculation of χ(z). Starting from point z0 = z arise at 0, 1, e, e ,... and stretch into the negative complex the series zn+1 = log zn is evaluated until |zn − c| ≤ r0 half-plane. They are caused by log being discontinuous at for some n. Inside this circle of radius r0 the function the polar angles β and β + 2π. n value can then be evaluated using χ(z) = c (zn − c). The contours are generated by iterative application of exp to the circle of radius r around c. Near its fixed point the 0 drawback that iterated application of log starting from behaves like a scaling by |c| ≈ 1.374 and a rotation of Im c ≈ 76.6◦ around c. a point on the lower complex half-plane will converge to the complex conjugate c instead of c. Thus χ(z) would be undefined for Im z < 0. around c, we can obtain the function value of χ(z) To avoid this problem, we use the branch defined by from (16) via iterated application of (17). In the next log : C → {z ∈ C : β ≤ Im z < β + 2π} with −1 < β < section it will be shown that this is indeed the case for 0. Using such a branch the series zn, where nearly every z ∈ C. Hence the solution to Schr¨oder’s equation is given by zn+1 = log zn ,

k (k) χ(z) = c (log (z) − c) (18) converges to c, provided that there is no n such that

0 zn = 0. Thus χ is defined on \ D where D = 0 (k ) e C 0 e e with k = mink ∈N k s.t. | log (z) − c| ≤ r0. Solving {0, e, e , e ,... }. for z gives

−1 (k) −k χ (χ) = exp (c χ + c) (19) Proof. If Im zn ≥ 0 then arg zn ∈ [0, π] and thus Im zn+1 ≥ 0. Hence, if we have Im zn ≥ 0 for n ∈ N, 0 −k0 0 0 0 with k = mink ∈N k s.t. |c χ| ≤ r0. Obviously then Im zn ≥ 0 for all n > n. Now, consider the we have χ−1(χ(z)) = z for all z ∈ C. However conformal map χ(χ−1(ξ)) = ξ only holds if Im (c−kξ +c) ∈ [β, β +2π). z − c ξ(z) = z − c The derivative of χ is given by which maps the upper complex half-plane to the unit k−1 Y c disk and define the series ξn+1 = ζ(ξn) with χ0(z) = (20a) log(j) z j=0 ζ(t) = ξlog ξ−1(t) .

0 (k0) with k = min 0 k s.t. | log (z) − c| ≤ r and we k ∈N 0 We have ζ : D1 → D1, where D1 = {t ∈ C : |t| < 1} is have the unit disk; furthermore ζ(0) = 0. Thus by Schwarz k lemma |ζ(t)| < |t| for all t ∈ D1 (since ζ(t) 6= λt with 1 Y   χ  χ0−1(χ) = exp(j) χ−1 (20b) λ ∈ C) and hence limn→∞ ξn = 0 (Kneser, 1950). This c c j j=1 implies limn→∞ zn = c. On the other hand, if Im z < 0 and Re z < 0, then 0 −k0 n n with k = mink0∈ k s.t. |c χ| ≤ r0. N Im log zn > 0 and zn converges as above. Finally, if Im zn < 0 and Re zn ≥ 0, then, using −1 < β, we have 2.3.1. The solution χ is defined on almost C Re zn+1 ≤ | log zn| ≤ 1 + log(Re zn) < Re zn and thus 0 The principal branch of the logarithm, i.e. restrict- at some element n in the series we will have Re zn0 < 1 ing its imaginary part to the interval [−π, π), has the which leads to Re zn0+1 < 0. A Smooth and Differentiable Transition Between Additive and Multiplicative Interactions

5 6 3 4 5 2 3 4 1 2 Im z 1 3 0 Imz

Imχ(z) 0 2 -1 -1 1 -2 x -6 -4 -2 0 2 4 6

-3 Re z 0 x -3 -2 -1 0 1 2 3 -2 -1 0 1 2 3 4 5 Figure 6. holds in the gray- Rez Reχ(z) shaded area for non-integer iteration numbers, i.e. (n) (m) (n+m) Figure 5. Structure of the function χ(z) and calculation exp ◦ exp (z) = exp (z) for n ∈ R and m ∈ of exp(n)(1 + πi) for n ∈ [0, 1] in z-space (left) and χ-space [−1, 1]. We defined log such that Im log z ∈ [−1, −1+2π]. (right). The uniform grid with the cross at the origin is mapped using χ(z). The points 1 + πi and ξ = χ(1 + πi) n n Re exp( )(z) Im exp( )(z) are shown as hollow magenta circles. The black line shows 4 4 −1 n n χ (c ξ) and c ξ for n ∈ [0, 1]. The blue and purple 3 3 points are placed at n = 1/2 and n = 1 respectively. 2 2 1 1 Re z Re z -3 -2 -1 1 2 3 -3 -2 -1 1 2 3 -1 -1 2.3.2. Non-integer iterates using Schroder’s¨ -2 -2 equation -3 -3 n n Re exp( )(z) Im exp( )(z) Repetitive application of Schr¨oder’sequation (13) on 4 4 3 3 an (5) leads to 2 2 (n) n 1 1 χ(f (z)) = γ χ(z) . (21) Re z Re z -3 -2 -1 1 2 3 -3 -2 -1 1 2 3 -1 -1 Thus the nth iterate of the exponential function on the -2 -2 whole complex plane is given by -3 -3 Figure 7. Iterates of the exponential function exp(n)(x + exp(n)(z) = χ−1(cn χ(z)) (22) 0.5i) for n ∈ {0, 0.1,..., 0.9, 1} (upper plots) and n ∈ −1 {0, −0.1,..., −0.9, −1} (lower plots) obtained using the so- where χ(z) and χ (z) are given by (18) and (19) re- lution (11) of Schr¨oder’sequation. Exp, log and the iden- spectively. Since χ is injective we can think of it as tity function are highlighted in orange. a mapping from the complex plane, called z-plane, to another complex plane, called χ-plane. By (22) the operation of calculating the exponential of a number y Hence we defined the continuously differentiable func- in the z-plane corresponds to complex multiplication (n) tion exp : C \ D → C on almost the whole complex by factor c of χ(y) in the χ-plane. This is illustrated plane and showed that it has the meaning of a non- (n) in Fig.5. Samples from exp are shown in Fig.7. integer iterate of exp on the subset E. While the definition for exp(n) given by (22) can be evaluated on the whole complex plane C, it only has 3. Interpolation between addition and meaning as a non-integer iterate of exp, if composi- multiplication tion exp(n)[expm(z)] = exp(n+m)(z) holds. Since this requires that χ(χ−1(ξ)) = ξ, let us define the sets Using fundamental properties of the exponential func- E0 = {ξ ∈ C : χ[χ−1(cm ξ)] = cm ξ ∀m ∈ [−1, 1]} and tion we can write every multiplication of two numbers E = χ−1(E0). Then, for z ∈ E, n ∈ R and m ∈ [−1, 1], x, y ∈ R as the composition of the iterated exponential function is given by xy = exp(log x + log y) = exp(exp−1 x + exp−1 y) . exp(n)[exp(m)(z)] = χ−1cn χχ−1(cm χ(z)) We define the operator ⊕ for x, y ∈ and n ∈ as = χ−1cn+m χ(z) = exp(n+m)(z) n R R and the composition property is satisfied. The subset (n) (−n) (−n)  x ⊕n y = exp exp (x) + exp (y) . (24) E of the complex plane where composition of exp(n) for non-integer n is shown in figure6. Note that we have x ⊕0 y = x + y and x ⊕1 y = xy. (n) The derivatives of exp defined using Schr¨oder’s Thus for 0 < n < 1 the above operator continuously equation are given by interpolates between the elementary operations of ad- exp0(n)(z) = cn χ0−1[cnχ(z)] χ0(z) (23a) dition and multiplication. We will refer to ⊕n as the “addiplication operator”. Analogous to the n-ary sum (n0) n 0−1 n exp (z) = c χ [c χ(z)] χ(z) log(c) . (23b) and product we will employ the following notation for A Smooth and Differentiable Transition Between Additive and Multiplicative Interactions

2⊕ n7 16 ing a continuous parameter. Abel 15 The straightforward approach is to use neurons that 14 Schröder 13 use addiplication instead of summation, i.e. the value 12 of neuron yi is given by 11 10 9   n 0.0 0.2 0.4 0.6 0.8 1.0 M  yi = σ Wijxj Figure 8. The interpolation between addition and multi- j plication using (24) with x = 2 and y = 7. The iter- ni ates of the exponential function are either calculated using    Abel’s equation (blue) or Schr¨oder’sequation (orange). In (ni) X (−ni) = σexp  Wij exp (xj) . (27) both cases the interpolated values exceed the range be- j tween 2 + 7 = 9 and 2 · 7 = 14 and therefore a local maxi- mum exists. For ni = 0 the neuron behaves like an additive neu- ron and for ni = 1 it computes the product of its (−ni) the n-ary addiplication operator, inputs. Because we sum over exp (xj) which has K dependence on the parameter ni of neuron yi, this cal- M culation corresponds to a network in which each neu- xj = xk ⊕n xk+1 ⊕n · · · ⊕n xK . (25) j=k ron in layer x has separate outputs for each neuron in n the following layer y; see Fig.9a. Compared to con- ventional neural nets this architecture has only one The derivative of the addiplication operator w.r.t. its additional real-valued parameter per neuron (n ) but operands and the interpolation parameter n are calcu- i also poses a significant increase in computational com- lated using the chain rule. Using the shorthand plexity due to the necessity of separate outputs. Since (−n) (−n) (−ni) E = exp (x) + exp (y) (26a) exp (xj) is complex it might be sensible (but is not required) to allow a complex weight matrix Wij. we have The computational complexity of separate output ∂(x ⊕ y) n = exp0(n)(E) exp0(−n)(x) (26b) units can be avoided by calculating the value of a neu- ∂x ron according to ∂(x ⊕ y) n = exp0(n)(E) exp0(−n)(y) (26c) ∂y    (ˆn ) X (˜nx ) ∂(x ⊕ y) 0 y = σ exp yi W exp j (x ) . (28) n = exp(n )(E) + (26d) i   ij j  ∂n j h 0 0 i exp0(n)(E) · exp(−n )(x) + exp(−n )(y) . This corresponds to the architecture shown in Fig.9b. The interpolation parameter ni has been split into a pre-transfer-function partn ˆ and a post-transfer- For positive arguments x, y > 0 we can use the it- yi function partn ˜ . Sincen ˆ andn ˜ are not tied erates of exp based either on the solution of Abel’s xj yi xj together, the network is free to implement arbitrary equation (11) or Schr¨oder’sequation (22). However, combinations of iterates of the exponential function. if we also want to deal with negative arguments, we Addiplication occurs as the special casen ˆ = −n˜ . must use iterates of exp based on Schr¨oder’sequation yi xj Compared to conventional neural nets each neuron (22), since the real logarithm is only defined for pos- has two additional parameters, namelyn ˆ andn ˜ ; itive arguments. From the exemplary addiplication yi xj however the asymptotic computational complexity of shown in Fig.8 we can see that the interpolations pro- the network is unchanged. In fact, this architecture duced by these two methods are not monotonic func- corresponds to a conventional, additive neural net, as tions w.r.t. the interpolation parameter n. In both defined by (1), with a neuron-dependent, parameter- cases local maxima exist; however interpolation based izable transfer function. For neuron z the transfer on Schr¨oder’sequation has higher extrema in this case i function given by and also in general (personal experiments). It is well known, that the existence of local extrema can pose a h  i (˜nzi ) (ˆnzi ) problem for gradient-based optimizers. σzi (t) = exp σ exp (t) . (29)

4. Neurons that can add or multiply Consequently, implementation in existing neural net- work frameworks is possible by replacing the standard We propose two methods to construct neural nets that sigmoidal transfer function with this function and op- have units the operation of which can be adjusted us- tionally using a complex weight matrix. A Smooth and Differentiable Transition Between Additive and Multiplicative Interactions

(a) (b) (a) y y y y (b) pattern x y y y ^ ^ ^ ^ n1 n2 n ^ n ^ y y y2 y 1 y y + shift (m) Y Y Y W W s x ~ ~ ~ ~ ~ ~ ~ ~ ~ n~ n~ n~ X X X S S S x x x  x x  x x1 x x3 x 2 x x y + x x x x x x x x x s s s

Figure 9. Two proposed neural network architectures Figure 10. (a) Variable pattern shift problem. Given a N that can implement addiplication. (a) Neuron yi cal- random binary pattern x ∈ R and an integer m ∈ culates its value according to (27). We have yi = {0, 1,...,N −1} presented in one-hot encoding, the learner σ(ˆy ) and the subunits computex ˜ = exp(−ni)(x ) and should output the pattern x circularly shifted to the right i   ij j (ni) P by m grid cells. (b) A neural net with two hidden layers yˆi = exp j Wij x˜ij . The weights between the sub- units are shared. (b) Neuron yi calculates its value accord- that can solve this problem by employing the Fourier shift ing to (28). We have y = σ(ˆy ) and the subunits compute theorem. The first hidden layer is additive, the second is i i   (−n˜j ) (ˆni) P multiplicative; the output layer is additive. All neurons use x˜j = exp (xj ) andy ˆi = exp j Wij x˜j . linear transfer functions. The first hidden layer computes the DFT of the input pattern x and shift mount s. The 5. Applications second hidden layer applies the Fourier shift theorem and the output layer computes the inverse DFT of the shifted 5.1. Variable pattern shift pattern. Consider the dynamic pattern shift task shown in Fig. 10a: The input consists of a binary vector x of N If we encode the shift amount m as a one-hot vector elements and an integer m ∈ {0, 1,...,N − 1}. The s of length N, i.e. sj = 1 if j = m else sj = 0, we can desired output y is x circularly shifted by m elements further rewrite this as to the right, N−1 y = x , 1 X 2πi kv n n−m y = e N S X (30a) v N k k where x is indexed modulo N, i.e. xn−m rolls over to k=0 the right if n − m ≤ 0. with A method to architecturally efficiently implement this N−1 N−1 X −2πi km X −2πi kn task in a neural architecture is based on the shift Sk = sm e N ,Xk = xn e N . theorem of the discrete Fourier transform (DFT) m=0 n=0 (30b) (Brigham, 1988). Let F({xn})k denote the kth ele- ment of the DFT of x = {x0, . . . , xN−1}. By definition This corresponds to a neural network with two hidden we have layers (one additive, one multiplicative) and an addi- N−1 X −2πi kn tive output layer as shown in Fig. 10b. The optimal F({x }) = x e N n k n weights of this network are given by the corresponding n=0 coefficients from (30). and its inverse is given by From this example we see that having the ability to N−1 −1 1 X 2πi kn automatically determine the function of each neuron F ({X }) = X e N . k n N k is crucial to learn neural nets that are able to solve k=0 complex problems. The shift theorem states that a shift by m elements in the time domain is equivalent to a multiplication by factor e−2πikm/N in the frequency domain, 6. Conclusion and future work

−2πi km We proposed one method to continuously and differ- F({x }) = F({x }) e N . n−m k n k entiably interpolate between addition and multiplica- Hence the shifted pattern can be calculated using tion and showed how it can be integrated into neural networks by replacing the standard sigmoidal transfer −1 −2πi km y = F ({F({xn})k e N }) . function with a parameterizable transfer function. In this paper we presented the mathematical formulation Using the above definitions its vth component is given of these concepts and showed how to integrate them by into neural networks. N−1 " N−1 # 1 X 2πi kv −2πi km X −2πi kn We will perform simulations to see how these neu- y = e N e N x e N . v N n k=0 n=0 ral nets behave given real-world problems and how to A Smooth and Differentiable Transition Between Additive and Multiplicative Interactions train them most efficiently. Khamsi, Mohamed A. and Kirk, William A. An Intro- duction to Metric Spaces and Fixed Point Theory. While working on this theory, we already have two Wiley-Interscience, 2001. ISBN 0471418250. possible improvements in mind: Kneser, H. Reelle analytische L¨osungender Gleichung 1. Our interpolation technique is based on non- ϕ(ϕ(x)) = ex und verwandter Funktionalgleichun- integer iterates of the exponential function cal- gen. Journal f¨urdie reine und angewandte Mathe- culated using Abel’s and Schr¨oder’s functional matik, pp. 56–67, 1950. equations. We chose this method for our first Schr¨oder,Ernst. Ueber iterirte Functionen. Mathema- explorations because it has the mathematically tische Annalen, 3:296–322, 1870. ISSN 00255831. sound property that the calculated iterates form doi: 10.1007/BF01443992. an Abelian group under functional composition. However it results in a non-monotonic interpola- tion between addition and multiplication which may lead to a challenging optimization landscape during training. Therefore we will try to find more ad-hoc interpolations with a monotonic transition between addition and multiplication.

2. Our method introduces one or two additional real- valued parameters per neuron for the transfer function. Using a suitable fixed transfer function might allow to absorb these parameters back into the bias.

While the specific implementation details proposed in this work may have their drawbacks, we believe that neurons which can implement operations beyond ad- dition are a key to new areas of application for neural computation.

Acknowledgments Part of this work has been supported in part by the TAC-MAN project, EC Grant agreement no. 610967, within the FP7 framework programme.

References Abel, N.H. Untersuchung der functionen zweier un- abh¨angigver¨anderlichen gr¨oßenx und y, wie f(x, y), welche die eigenschaft haben, daß f(z, f (x, y)) eine symmetrische function von z, x und y ist. Journal f¨urdie reine und angewandte Mathematik, 1826(1): 11–15, 1826.

Brigham, E. Fast Fourier Transform and Its Applica- tions. Prentice Hall, 1988. ISBN 0133075052.

Durbin, Richard and Rumelhart, David E. Product Units: A Computationally Powerful and Biologi- cally Plausible Extension to Backpropagation Net- works. Neural Computation, 1:133–142, 1989.

Hornik, Kurt, Stinchcombe, Maxwell, and White, Hal- bert. Multilayer feedforward networks are univer- sal approximators. Neural networks, 2(5):359–366, 1989.