Neural Networks: Nonlinear Optimization for Constrained Learning and Its Applications

Neural Networks: Nonlinear Optimization for Constrained Learning and its Applications

STAVROS J. PERANTONIS Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center for Scientific Research “Demokritos”, 153 10 Aghia Paraskevi, Athens, GREECE

Abstract: - Feedforward neural networks are highly non-linear systems whose training is usually carried out by performing optimization of a suitable cost function which is nonlinear with respect to the optimization variables (synaptic weights). This paper summarizes the considerable advantages that arise from using constrained optimization methods for training feedforward neural networks. A constrained optimization framework is presented that allows incorporation of suitable constraints in the learning process. This enables us to include additional knowledge in efficient learning algorithms that can be either general purpose or problem specific. We present general purpose first order and second order algorithms that arise from the proposed constrained learning framework and evaluate their performance on several benchmark classification problems. Regarding problem specific algorithms, we present recent developments concerning the numerical factorization and root identification of polynomials using sigma-pi feedforward networks trained by constrained optimization techniques.

Key-Words: - neural networks, constrained optimization, learning algorithms

1 Introduction incorporating additional knowledge in the neural Artificial neural networks are computational systems network learning rule. It is subsequently shown loosely modeled after the human brain. Although it is how this general framework can be utilized to not yet clearly understood exactly how the brain obtain efficient neural network learning algorithms. optimizes its connectivity to perform perception and These can be either general purpose algorithms or decision making tasks, there have been considerable problem specific algorithms. The general purpose successes in developing artificial neural network algorithms incorporate additional information about training algorithms by drawing from the wealth of the specific type of the neural network, the nature methods available from the well established field of and characteristics of its cost function landscape non-linear optimization. Although most of the and are used to facilitate learning in broad classes methods used for supervised learning, including the of problems. These are further divided into first original and highly cited back propagation method for order algorithms (mainly using gradient multilayered feed forward networks [1], originate information) and second order algorithms (also from unconstrained optimization techniques, research using information about second derivatives encoded has shown that it is often beneficial to incorporate in the Hessian matrix). Problem specific algorithms additional knowledge in the neural network are also discussed. Special attention is paid to architecture or learning rule [2]-[6]. Often, the recent advancements in numerical solution of the additional knowledge can be encoded in the form of polynomial factorization and root finding problem mathematical relations that have to be satisfied which can be efficiently solved by using suitable simultaneously with the demand for minimization of sigma-pi networks and incorporating relations the cost function. Naturally, methods from the field of among the polynomial coefficients or roots as nonlinear constrained optimization are essential for additional constraints. solving these modified learning tasks. In this paper, we present an overview of recent 2 Constrained learning framework research results on neural network training using Conventional unconstrained supervised learning in constrained optimization techniques. A general neural networks involves minimization, with constrained optimization framework is introduced for respect to the synaptic weights and biases, of a cost function of the form exponential rate. Based on these considerations, the following E= E[ Tip - O ip (w )] (1) optimization problem can be formulated for each Here w is a column vector containing all synaptic epoch of the algorithm, whose solution will weights and biases of the network, p is an index determine the adaptation rule for w : O running over the P patterns of the training set, ip is Maximize k > 0 with respect to dw , subject to the the network output corresponding to output node i following constraints: T 2 and Tip is the corresponding target. dw M d w = (d P ) (4) However, it is desirable to introduce additional dE= d Q (5) relations that represent the extra knowledge and k = -d F /F , m =1 , … , S (6) involve the network’s synaptic weights. Before m m introducing the form of the extra relations, we note where S is the number of components of Φ . that we will be adopting an epoch-by-epoch Case 2: This case involves additional conditions optimization framework with the following whereby there is no specific final target for the objectives: vector Φ , but rather it is desired that all  At each epoch of the learning process, the components of Φ are rendered as large as possible vector w is to be incremented by dw , so that the at each individual epoch of the learning process. search for an optimum new point in the space of This is a multiobjective maximization problem, w is restricted to a region around the current which is addressed by defining k =d Fm and point. It is possible to restrict the search on the demanding that k assume the maximum possible surface of a hypersphere, as, for example, is done value at each epoch. Thus the constrained with the steepest descent algorithm. However, in optimization problem is as before with equation (6) some cases it is convenient to adopt a more general substituted by approach whereby the search is done on a k =d Fm , m =1 , … , S (7) hyperelliptic surface centered at the point defined The solution to the above constrained optimization by the current w : T 2 problems can be obtained by a method similar to dw M d w = (d P ) (2) the constrained gradient ascent technique where M is a positive definite matrix and d P is a introduced by Bryson and Denham [7] and leads to known constant. a generic update rule for w . Therefore, suitable

 At each epoch, the cost function must be Lagrange multipliers L1 and L2 are introduced to decremented by a positive quantity dQ , so that at take account of equations (5) and (4) respectively the end of learning E is rendered as small as and a vector of multipliers Λ to take account of possible. To first order, the change in E can be equation (6) or equation (7). The Lagrangian thus substituted by its first differential so that: reads T dE= d Q (3) k= k +L1(G d w - d Q ) + (8) Next, additional objectives are introduced in order to L[ dwT M d wΛ- (d F P ) w2 ] + 1 T ( d + k ) incorporate the extra knowledge into the learning 2 formalism. The objective is to make the formalism as where the following quantities have been general as possible and capable of incorporating introduced: multiple optimization objectives. The following two 1. A vector G with elements Gi= � E w i cases of importance are therefore considered, that 2. A matrix F whose elements are defined by involve additional mathematical relations representing Fmi=(1 /F m )(禙 m / w i ) (for case 1) or knowledge about learning in neural networks: F= - / w Case 1: There are additional constraints Φ= 0 which mi禙 m i (for case 2). must be satisfied as best as possible upon termination To maximize k under the required constraints, it is of the learning process. Here Φ is a column vector demanded that: whose components are known functions of the dk= d k (1 + ΛT 1 ) synaptic weights. This problem is addressed by (9) +(LGΛT + F T + 2 w L d M T w ) d 2 = 0 introducing a function k and demanding at each 1 2 epoch of the learning process the maximization of k 2 2T 2 subject to the condition that dΦ= -k Φ . In this way, dk =2 L2 dw M d w < 0 (10) it is ensured that Φ tends to 0 at a temporarily Hence, the factors multiplying d 2w and dk in equation (8) should vanish, and therefore: The two most common problems arise because

L1 -11 - 1 T  of the occurrence of long, deep valleys or dw= - M G - M FΛ (11) 2L 2 L troughs that force gradient descent to follow zig- 2 2 zag paths [8] T Λ 1 = -1 (12)  of the possible existence of temporary Equation (11) constitutes the weight update rule for minima in the cost function landscape. the neural network, provided that the Lagrange In order to improve learning speed in long deep multipliers appearing in it have been evaluated in valleys, it is desirable to align current and previous terms of known quantities. The result is summarized epoch weight update vectors as much as possible, forthwith, whereas the full evaluation is carried out in without compromising the need for a decrease in the Appendix. To complete the evaluation, it is the cost function [6]. Thus, satisfaction of an necessary to introduce the following quantities: additional condition is required i.e. maximization of I = GT M-1 G (13) GG the quantity F =(w - wt )( w t - w t -1 ) with respect -1 w IGF = FM G (14) to the synaptic weight vector at each epoch of -1 T the algorithm. Here wt and wt-1 are the values of IFF = FM F (15) 1 the weight vectors at the present and immediately R=(IGG I FF - I GF I GF ) (16) preceding epoch respectively and are treated as IGG known constant vectors. The generic update rule -1T - 1 T - 1 2 derived in the previous section can be applied. (R )a ( I GF R I GF )- ( 1 R I GF ) Z=1 + -1 (17) There is only one additional condition to satisfy (R )aI GG (maximization of dF at each epoch), so that where (. )a denotes the sum of all elements of a matrix Case 2 of the previous section is applicable. It is and 1 is a column vector whose elements are all equal readily seen that in this case L has only one to 1. In terms of these known quantities, the Lagrange component equal to -1 (by equation (12)) and the multipliers are evaluated using the relations: weight update rule is quite simple: 1/ 2 L 1 1 轾 I dw= -1 G - u (22) L = - GG (18) 2L 2 L 2 2犏 (R-1 ) [I (d P ) 2- Z ( d Q ) 2 ] 2 2 臌 a GG where -1 轾 T -1 R 1 2L2d Q 犏1 R IGF -1 - 1 u= wt - w t-1 ,L 1 =( I uG - 2 L 2d Q ) / I GG , Λ=- -犏 R 1 - R IGF (19) (R-1 )I 犏 ( R - 1 ) 1/ 2 a GG臌犏 a 2 (23) 1 轾 IGG I uu- I Gu T L2 = - 犏 2 2 2L2d Q + Λ IGF 2IGG (d P )- ( d Q ) L1 = - (20) 臌 IGG with dQ T T In the Appendix, it is shown that must be chosen IGu=u G , I uu = u u (24) 1/ 2 adaptively according to dQ= - xd P( IGG ) . Here x Hence, weight updates are formed as linear is a real parameter with 0

of the Jacobian matrix Jc of each cluster of redundant hidden nodes corresponding to the linearized system. It turns out that in the vicinity of the temporary minima, learning is slow because the largest eigenvalues of the Jacobian matrices of all formed clusters are very small, and therefore the system evolves very slowly when unconstrained Figure 2: Classification of lithologies and identification of mineral alteration zones using ALECO. back propagation is utilized. The magnitude of the largest eigenvalues gradually grows with learning Moreover, the algorithm has exhibited very good and eventually a bifurcation of the eigenvalues generalization performance in a number of benchmark occurs and the system follows a trajectory which and industrial tasks. Fig. 2 shows the result of allows it to move far away from the minimum. supervised learning performed on the problem of Instead of waiting for the growth of the classifying lithological regions and mineral alteration eigenvalues, it is useful to raise these eigenvalues zones for mineral exploration purposes from satellite more rapidly in order to facilitate learning. It is images. Among several neural and statistical therefore beneficial to incorporate in the learning classification algorithms tried on the problem, rule additional knowledge related to the desire for ALECO was most successful in correctly classifying rapid growth of these eigenvalues. Since it is mineral alteration zones which are very important for difficult in the general case to express the mineral exploration [13]. maximum eigenvalues in closed form in terms of the weights, it is chosen to raise the values of T the subspace spanned by both of these directions. A appropriate lower bounds Fc = 1 J c 1 l c for these solution to this problem is to choose minimization eigenvalues. Details are given in [14] but here it directions which are non-interfering and linearly suffices to say that this is readily achieved by the independent. This can be achieved by the selection generic weight update rule (Case 2 of section 2) using of conjugate directions which form the basis of the the above Fc with M= 1 (the unit matrix). Conjugate Gradient (CG) method [16]. The two vectors dw and u are non-interfering or mutually conjugate with respect to 2 E when 200 2 DCBP dwT (� E )u 0 (25) 150 BP Our objective is to decrease the cost function of equation (1) with respect to w as well as to 100 RPROP T 2 ConjGrad maximize F = dw( E ) u . Second order 50 DBD information is introduced by incrementing the weight vector w by dw , so that 0 QP T 2 2 Success rate Epochs/5 dw(� E ) d w (d P ) (26) Thus, at each iteration, the search for an optimum Figure 3: Parity 4 problem: Comparison of constrained new point in the weight space is restricted to a learning algorithm DCBP with other first order algorithms small hyperellipse centered at the point defined by in terms of success rate and complexity. the current weight vector. The shape of such a hyperellipse reflects the scaling of the underlying The algorithm thus obtained, DCBP (Dynamical problem, and allows for a more correct weighting Constrained Back Propagation), results in significant among all possible directions [17]. Thus, in terms acceleration of learning in the vicinity of the of the general constrained learning framework of temporary minima also improving the success rate in section 2, this is a one-goal optimization problem, comparison with other first order algorithms. The seeking to maximize F , so that (26) is respected. improvement achieved for the parity-4 problem is This falls under Case 2 of the general framework shown in Fig. 3. Moreover, in a standard breast cancer with the matrix M being replaced by the Hessian diagnosis benchmark problem (cancer3 problem of the matrix 2 E . The derived weight update rule reads: PROBEN1 set [15]), DCBP was the only first order L 2- 1 1 algorithm tried that succeeded in learning the task dw= - 1 [� E ] G u (27) with 100% success starting from different initial 2L2 2 L 2 weights, with resilient propagation achieving a 20% where 1/ 2 success rate and other first order algorithms failing to 2 learn the task. -2L2d Q + IGu1 轾 I uu I GG - I Gu L1= , L 2 = - 犏 2 2 IGG2臌 I GG (d P )- ( d Q ) (28) 4 Second order constrained learning T 2- 1 IGG = G(� E ) G algorithms T I =G u , (29) Apart from information contained in the gradient, it is Gu T 2 often useful to include second order information about Iuu = u( E ) u the cost function landscape which is of course As a final touch, to save computational resources, contained in the Hessian matrix of second derivatives the Hessian is evaluated using the LM trust region with respect to the weights. Constrained learning prescription for the Hessian [18] algorithms in this section combine advantages of first 2 T order unconstrained learning algorithms like � E + (J Jm I ) (30) conjugate gradient and second order unconstrained where J is the Jacobian matrix and m is a scalar algorithms like the very successful Levenberg that (indirectly) controls the size of the trust region. Marquardt (LM) algorithm. As is evident from equation (27), the resulting The main idea is that a one-dimensional minimization algorithm is an LM type algorithm with an in the previous step direction u= d w followed by a additional adaptive momentum term, hence the t-1 algorithm is termed LMAM (Levenberg-Marquardt second minimization in the current direction dw does with Adaptive Momentum). An additional not guarantee that the function has been minimized on improvement over this type of algorithm is the so called OLMAM algorithm (Optimized Levenberg- Classification rate Marqardt with Adaptive Momentum) which 100 99 implements exactly the same weight update rule but 98 also achieves independence from the externally 97 96 provided parameter values d P and x . This 95 94 independence is achieved by automatically regulating 93 92 analytical mathematical conditions that should hold in 91 order to ensure the constant maintenance of the s s K n n M M M G a a C V A conjugacy between weight changes in successive d e e H S M e m m L is - - O epochs. Details are given in [19]. rv C C e y rd p zz a u u H LMAM and OLMAM have been particularly S F successful in a number of problems concerning ability of reaching a solution and good generalization ability. Figure 5: Comparison of classification rate achieved by different classification methods for the pap-smear In the famous 2-spiral benchmark, LMAM and classification problem. The constrained learning OLMAM have achieved a remarkable success rate of algorithm (OLMAM) outperforms all other methods 89- 90% and the smallest mean number of epochs achieving a rate of 98.8%. for a feedforward network with just one hidden layer (30 hidden units) and no shortcut connections that (to the best of our knowledge) has ever been reported in the literature of neural networks. As shown in Fig. 4, 5 Problem specific algorithms other algorithms like the original LM and conjugate Problem specific applications of constrained gradient have failed to solve this type of problem in learning range from financial modeling and market the great majority of cases, while the CPU time analysis problems, where sales predictions can be required by LMAM and OLMAM is also very made to conform to certain requirements of the competitive. retailers, to scientific problems like the numerical solution of simultaneous linear equations [21]. A scientific application of constrained learning that 100 has reached a level of maturity is numerical factorization and root finding of polynomials. 80 LMAM Polynomial factorization is an important problem 60 OLMAM with applications in various areas of mathematics, mathematical physics and signal processing ([22] 40 LM and references cited therein). 20 ConjGrad Consider, for example, a polynomial of two

0 variables z1 and z2 : Success rate CPU time/5 (sec) NA N A i j A( z1, z 2 ) = 邋 aij z 1 z 2 (31) Figure 4: 2-spiral problem: Success rates and CPU times i=0 j = 0 required for second-order constrained learning algorithms with N even, and a =1. For the above (LMAM and OLMAM) and two other algorithms A 00 (Levenberg-Marquardt and Conjugate Gradient). polynomial, It is sought to achieve an exact or approximate factorization of the form (i ) Regarding generalization ability, the second order A( z1, z 2 )� A ( z 1 z 2 ) (32) constrained learning algorithms have been i=1 , 2 successfully applied to a medical problem concerning where classification of Pap-Smear test images. In particular, MA M A (i ) ( i ) j k OLMAM has achieved the best classification ability A( z1, z 2 ) = 邋 vjk z 1 z 2 (33) on the test set reported in the literature among a j=0 k = 0 multitude of statistical and other classification with MA= N A /2 . We can try to find the methods. Some of the results of this comparison are (i ) coefficients v by considering P training patterns shown graphically in Fig. 5, while a full account of jk this comparison is given in [20]. selected from the region |z1 |<1 ,| z 2 |< 1. The primary purpose of the learning rule is thus to (i ) minimize with respect to the v jk a cost function of the form the original polynomial instead of the relations (i ) 2 among the coefficients themselves [27]. These E=( A ( z1p , z 2 p ) - A ( z 1 p , z 2 p )) (34) p i=1 , 2 advances have rendered the constrained learning Note that this cost function corresponds to a sigma-pi algorithms very competitive compared with well neural network with the elements of v(1) and v(2) as established numerical root finding methods like the Muller and Laguerre techniques leading to better its synaptic weights. accuracies at a fraction of the CPU time. Unconstrained minimization of the cost function has been tried, but often leads to unsatisfactory results, because it can be easily trapped in flat minima. However, there is extra knowledge available for this 6 Conclusion problem. The easiest way to incorporate more An overview of recent research results on neural knowledge is to take advantage of the constraints network training using constrained optimization among the coefficients of the desired factor techniques has been presented. A generic learning polynomials and the coefficients of the original framework was derived in which many types of polynomial. More explicitly, if it is assumed that additional knowledge, codified as mathematical A( z, z ) is factorable, then these constraints can be relations satisfied by the synaptic weights, can be 1 2 incorporated. Specific examples were given of the expressed as follows: application of this framework to neural network i j Fv =a - v(1) v (2) = 0 learning including first and second order general j+( NA + 1) i ij邋 lm i - l , j - m l=1 m = 1 purpose algorithms as well as problem specific methods. It is hoped that the constrained learning with 0＃i N, 0 ＃ j N . Thus, the objective is to A A approach will continue to offer insight into learning reach a minimum of the cost function of equation (34) in neural networks. It has potential to combine the (i ) with respect to the variables v jk , which satisfies as merits of both connectionist and knowledge based approaches for developing successful applications. best as possible the constraints Fv = 0, where Fv =( F v , 0＃i N , 0 ＃ j N ) . j+( NA + 1) i A A The constraints can be incorporated into the Appendix: Derivation of constrained constrained optimization formalism as in Case 1 of section 2. It turns out that the constrained learning learning algorithm algorithm can determine the factor polynomials in In this Appendix, evaluation of the Lagrange factorable cases and gives good approximate solutions multipliers L1 , L2 and Λ involved in the general in cases where the original polynomial is non- constrained learning framework of section 2 is factorable [23]. carried out. Recently, there have been interesting developments By multiplying both sides of equation (11) by GT that allow us to achieve improved results in the and by taking into account equation (5) we obtain: problem of numerical factorization and root finding L1 1 T for polynomials using constrained learning dQ= - IGG - Λ I GF (35) techniques. In [24] additional constraints have been 2L2 2 L 2 incorporated for ensuring stability of the resulting Solving for L1 readily yields equation (20), which factor polynomials in filter factoring problems related L L to signal processing. More importantly, the basic evaluates 1 in terms of 2 and Λ . constrained learning method has been applied to root By left multiplication of both sides of equation (11) finding of arbitrary polynomials of one variable. The by F and taking into account equations (6) and parallel structure of the neural network has been (20), we obtain exploited in order to obtain all roots of the original dQI RΛ k1+GF = (36) polynomial simultaneously or in distinct groups, I2 L resulting to efficient algorithms capable of handling GG 2 problems with polynomials of high degree [25]. The where the matrix R is defined by equation (16). method has also been extended for finding the roots of Solving equation (36) for Λ yields polynomials with arbitrary complex coefficients and -12L2d Q - 1 Λ=2L2k R 1 + R IGF (37) with roots that can be close to each other and need IGG extra effort to be resolved [26]. A recent extension By substituting this equation into equation (12) we incorporates constraints among the root moments of arrive at: 1+ 2L2d Q 1T R-1 I [2] D. Barber and D. Saad, Does extra knowledge IGG GF k = - -1 (38) necessarily improve generalization?, Neural 2L2 (R )a Computation, Vol. 8, 1996, pp. 202-214. We can now substitute this equation into equation (37) [3] Y. le Cun, L. D. Jackel, B. E. Boser, J. S. Denker, H-P. Graf, I. Guyon, D. Henderson, to obtain equation (19) evaluating Λ in terms of L2 . R. E. Howard and W. Hubbard, Handwritten L To evaluate 2 , we must substitute our expression for digit recognition: Applications of neural dw into equation (4). To make the algebra easier, we network chips and automatic learning, IEEE note that on account of equation (20), equation (11) Communications Magazine, Nov.1989, pp. 41- can be written as: 46.

dQ -11 - 1 [4] P. Simard, Y. le Cun and J. Denker, Efficient dw= M G + M A (39) I2 L pattern recognition using a new transformation GG 2 distance, in: Advances in Neural Processing where Systems, eds. S. J. Hanson, J. D. Cowan and T Λ I T C. L. Giles, Morgan Kaufmann, 1993, pp. V- A=GF G - FΛ (40) I 50–V-58. GG [5] S. Gold, A. Rangarajan and E. Mjolsness, From the definition of we can readily derive the A Learning with preknowledge: clustering with following properties: T-1 T T - 1 point and graph matching distance, Neural A M AΛ= RΛ A , M G = 0 (41) Computatation, Vol. 8, 1996, pp. 787-804. Substituting equation (39) into equation (4) and taking [6] S. J. Perantonis and D. A. Karras, An efficient into account equation (41), we can obtain a relation learning algorithm with momentum involving only L2 and Λ : acceleration, Neural Networks, vol. 8, 1995, 1/ 2 pp. 237-249. 1 轾 I (ΛT RΛ ) L = - GG (42) [7] A. E. Bryson and W. F. Denham, A steepest 2 犏 2 2 ascent method for solving optimum 2臌IGG (d P )- ( d Q ) programming problems, Journal App. Mech., where the negative square root sign has been selected Vol. 29, 1962, pp. 247-257. on account of inequality (10). [8] S. S. Rao, Optimization Theory and By substituting equation (19) into equation (42) and Applications, New Delhi, Wiley Eastern, 1984. solving for L2 , equation (18) is obtained, with Z [9] R. A. Jacobs, Increased rates of convergence given by equation (17). Evaluation of all Lagrange through learning rate adaptation, Neural multipliers in terms of known quantities is now Networks, Vol. 1, 1988, pp. 295-307. complete. [10] M. Riedmiller and H. Braun, A direct adaptive As a final note, let us discuss our choice for dQ . This method for faster backpropagation learning: choice is dictated by the demand that the quantity The RPROP algorithm, Proceedings of the under the square root in equation (42) be positive. It International Conference on Neural Networks, can readily be seen by the first of equation (41) that San Francisco, Vol. 1, 1993, 586-591. ΛT RΛ 0 provided that M is positive definite. [11] E. M. Johansson, F. U. Dowla and T -1 D. M. Goodman, Backpropagation learning for Since IGG = G M G 0 , it follows from equation multilayer feedforward networks using the (42) that care must be taken to ensure that conjugate gradient method, International 2 2 Journal of Neural Systems, Vol. 2, 1992, pp. IGG (d P )> ( d Q ) . The simplest way to achieve this 291-301. is to set dQ= - xd P IGG with 0