A Neural Transfer Function for a Smooth and Differentiable

A Neural Transfer Function for a Smooth and Differentiable Transition Between Additive and Multiplicative Interactions Sebastian Urban [email protected] Institut fürInformatik VI, Technische UniversitätMünchen, Boltzmannstr. 3, 85748 Garching, Germany Patrick van der Smagt [email protected] fortiss | An-Institut der Technischen UniversitätMünchen, Guerickestr. 25, 80805 München, Germany Abstract where the transfer function is applied element-wise, σ (t) = σ(t ). In the context of this paper we will call Existing approaches to combine both ad- i i such networks additive ANNs. ditive and multiplicative neural units either use a fixed assignment of operations (Hornik et al., 1989) showed that additive ANNs with or require discrete optimization to determine at least one hidden layer and a sigmoidal transfer func- what function a neuron should perform. This tion are able to approximate any function arbitrarily leads either to an inefficient distribution of well given a sufficient number of hidden units. Even computational resources or an extensive in- though an additive ANN is an universal function ap- crease in the computational complexity of the proximator, there is no guarantee that it can approx- training procedure. imate a function efficiently. If the architecture is not We present a novel, parameterizable trans- a good match for a particular problem, a very large fer function based on the mathematical con- number of neurons is required to obtain acceptable re- cept of non-integer functional iteration that sults. allows the operation each neuron performs to (Durbin & Rumelhart, 1989) proposed an alternative be smoothly and, most importantly, differen- neural unit in which the weighted summation is re- tiablely adjusted between addition and mul- placed by a product, where each input is raised to a tiplication. This allows the decision between power determined by its corresponding weight. The addition and multiplication to be integrated value of such a product unit is given by into the standard backpropagation training procedure. 0 1 Y Wij yi = σ@ xj A : (3) j 1. Introduction Using laws of the exponential function this can be writ- P In commonplace artificial neural networks (ANNs) the ten as yi = σ[exp( j Wij log xj)] and thus the values value of a neuron is given by a weighted sum of its of a layer can also be computed efficiently using matrix inputs propagated through a non-linear transfer func- multiplication, i.e. arXiv:1503.05724v3 [stat.ML] 29 Mar 2016 tion. For illustration let us consider a simple neural network with multidimensional input and multivari- y = σ(exp(W log x)) (4) ate output. The input layer should be called x and where exp and log are taken element-wise. Since in the outputs y. Then the value of neuron y is i general the incoming values x can be negative, the 0 1 complex exponential and logarithm are used. Often no X non-linearity is applied to the output of the product yi = σ@ WijxjA : (1) j unit. The typical choice for the transfer function σ(t) is the 1.1. Hybrid summation-multiplication sigmoid function σ(t) = 1=(1 + e−t) or an approxima- networks tion thereof. Matrix multiplication is used to jointly compute the values of all neurons in one layer more Both types of neurons can be combined in a hybrid efficiently; we have summation-multiplication network. Yet this poses the problem of how to distribute additive and multiplica- y = σ(W x) (2) tive units over the network, i.e. how to determine A Smooth and Differentiable Transition Between Additive and Multiplicative Interactions whether a specific neuron should be an additive or for n; m 2 Z, forms an Abelian group. multiplicative unit to obtain the best results. A simple Equation (5) cannot be used to define functional it- solution is to stack alternating layers of additive and eration for non-integer n. Thus, in order to calcu- product units, optionally with additional connections late non-integer iterations of function, we have to find that skip over a product layer, so that each additive an alternative definition. The sought generalization layer receives inputs from both the product layer and should also extend the additive property (6) of the the additive layer beneath it. The drawback of this composition operation to non-integer n; m 2 . approach is that the resulting uniform distribution of R product units will hardly be ideal. 2.2. Abel's functional equation A more adaptive approach is to learn the function of each neural unit from provided training data. How- Consider the following functional equation given by ever, since addition and multiplication are different op- (Abel, 1826), erations, until now there was no obvious way to deter- (f(x)) = (x) + β (7) mine the best operation during training of the network using standard neural network optimization methods with constant β 2 C. We are concerned with f(x) = such as backpropagation. An iterative algorithm to de- exp(x). A continuously differentiable solution for β = termine the optimal allocation could have the follow- 1 and x 2 R is given by ing structure: For initialization randomly choose the operation each neuron performs. Train the network (x) = log(k)(x) + k (8) by minimizing the error function and then evaluate its (k) performance on a validation set. Based on the per- with k 2 N s.t. 0 ≤ log (x) < 1. Note that for formance determine a new allocation using a discrete x < 0 we have k = −1 and thus is well defined optimization algorithm (such as particle swarm opti- on whole R. The function is shown in Fig.1. Since mization or genetic algorithms). Iterate the process : R ! (−1; 1) is strictly increasing, the inverse until satisfactory performance is achieved. The draw- −1 :(−1; 1) ! R exists and is given by back of this method is its computational complexity; −1 (k) to evaluate one allocation of operations the whole net- ( ) = exp ( − k) (9) work must be trained, which takes from minutes to with k 2 s.t. 0 ≤ − k < 1. For practical reasons hours for moderately sized problems. N we set −1( ) = −∞ for ≤ −1. The derivative of Here we propose an alternative approach, where the is given by distinction between additive and multiplicative neu- k−1 0 Y 1 rons is not discrete but continuous and differentiable. (x) = (10a) log(j)(x) Hence the optimal distribution of additive and mul- j=0 tiplicative units can be determined during standard (k) with k 2 N s.t. 0 ≤ log (x) < 1 and the derivative of gradient-based optimization. Our approach is orga- its inverse is nized as follows: First, we introduce non-integer iterates of the exponential function in the real and com- k−1 −10 Y (j) −1 plex domains. We then use these iterates to smoothly ( ) = exp ( − j) (10b) interpolate between addition (1) and multiplication j=0 (3). Finally, we show how this interpolation can be with k 2 N s.t. 0 ≤ − k < 1. integrated and implemented in neural networks. 2.2.1. Non-integer iterates using Abel's 2. Iterates of the exponential function equation 2.1. Functional iteration By inspection of Abel's equation (7), we see that the nth iterate of the exponential function can be written Let f : C ! C be an invertible function. For n 2 Z we as (n) write f for the n-times iterated application of f, exp(n)(x) = −1( (x) + n) : (11) (n) f (z) = f ◦ f ◦ · · · ◦ f(z) : (5) While this equation is equivalent to (5) for integer n, | {z } n times we are now also free to choose n 2 R and thus (11) can be seen as a generalization of functional iteration Further let f (−n) = (f −1)(n) where f −1 denotes the to non-integer iterates. It can easily be verified that inverse of f. We set f (0)(z) = z to be the identity the composition property (6) holds. Hence we can un- function. It can be easily verified that functional iter- derstand the function '(x) = exp(1=2)(x) as the func- ation with respect to the composition operator, i.e. tion that gives the exponential function when applied f (n) ◦ f (m) = f (n+m) (6) to itself. ' is called the functional square root of exp A Smooth and Differentiable Transition Between Additive and Multiplicative Interactions ψ(x) with constant γ 2 C. As before we are interested in 3 solutions of this equation for f(x) = exp(x); we have 2 χ(exp(z)) = γ χ(z) (14) 1 but now we are considering the complex exp : C ! C. x The complex exponential function is not injective, -10 -5 5 10 since -1 exp(z + 2πni) = exp(z) n 2 Z : Figure 1. A continuously differentiable solution (x) to Thus the imaginary part of the codomain of its inverse, Abel's equation (7) for the exponential function in the real i.e. the complex logarithm, must be restricted to an domain. interval of size 2π. Here we define log : C ! fz 2 n exp( )(x) C : β ≤ Im z < β + 2πg with β 2 R. For now let us 2 consider the principal branch of the logarithm, that is β = −π. 1 n=1 To derive a solution, we examine the behavior of exp n = -1 around one of its fixed points. A fixed point of a func- x -3 -2 -1 1 2 3 tion f is a point c with the property that f(c) = c. -1 The exponential function has an infinite number of fixed points. Here we select the fixed point closest to -2 the real axis in the upper complex half plane.

A Neural Transfer Function for a Smooth and Differentiable

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support