Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Two-hidden-layer extreme learning machine for regression and classification
B.Y. Qu a,b, B.F. Lang a, J.J. Liang a,n, A.K. Qin c, O.D. Crisalle a a School of Electrical Engineering, Zhengzhou University, Zhengzhou 450001, China b School of Electric and Information Engineering, Zhongyuan University of Technology, Zhengzhou 450007, China c School of Computer Science and Information Technology, RMIT University, Melbourne, 3001 Victoria, Australia article info abstract
Article history: As a single-hidden-layer feedforward neural network, an extreme learning machine (ELM) randomizes Received 28 May 2015 the weights between the input layer and the hidden layer as well as the bias of hidden neurons, and Received in revised form analytically determines the weights between the hidden layer and the output layer using the least- 9 November 2015 squares method. This paper proposes a two-hidden-layer ELM (denoted TELM) by introducing a novel Accepted 9 November 2015 method for obtaining the parameters of the second hidden layer (connection weights between the first and second hidden layer and the bias of the second hidden layer), hence bringing the actual hidden layer Keywords: output closer to the expected hidden layer output in the two-hidden-layer feedforward network. Extreme learning machine Simultaneously, the TELM method inherits the randomness of the ELM technique for the first hidden Two-hidden-layer layer (connection weights between the input weights and the first hidden layer and the bias of the first Regression hidden layer). Experiments on several regression problems and some popular classification datasets Classification Neural network demonstrate that the proposed TELM can consistently outperform the original ELM, as well as some existing multilayer ELM variants, in terms of average accuracy and the number of hidden neurons. & 2015 Elsevier B.V. All rights reserved.
1. Introduction training speed and outstanding generalization performance. The ELM approach has demonstrated its advantages in various fields of Single-hidden-layer feedforward neural networks (SLFNs), one applications, including image recognition [6–10], power-load of the most popular neural network models [1,2], have a simple forecasting [11,12], wind speed forecasting [13], and protein structure consisting of one input layer, one hidden layer, and one structure prediction [14], among others. However, because of the output layer. A wide range of applications have been used to random weights from the input layer to the hidden layer, as well as demonstrate the efficacy of SLFNs [3,4]. However, these techniques the random biases of the hidden neurons, the average accuracy of suffer from a time-expensive training process that usually adopts ELM variants is generally low, which calls for further investigation gradient-based error back-propagation algorithms, and conse- of better hidden-layer parameter calculation approaches. quently is prone to getting stuck in local minima. To address this Many ELM variants have been developed to improve specific issue, in 2004 Huang et al. [3] proposed an extreme learning aspects of the performance of the original algorithm. Examples machine (ELM) technique aiming at reducing the computational include voting-based extreme learning machines (V-ELM) [15], costs incurred by the error back-propagation procedure during the regularized extreme learning machines (RELM) [16,17], evolu- training process. A distinguishing feature of ELMs is that both the tionary extreme learning machines (E-ELM) [18], online sequential connection weights from the input layer to the hidden layer and extreme learning machines (OS-ELM) [19], fully complex extreme the hidden neurons' biases are randomly generated, instead of learning machines (Fully complex ELM) [4,20], sparse extreme being iteratively learned as in conventional SLFNs. Moreover, the learning machines (Sparse ELM) [21], kernel-based extreme connection weights from the hidden layer to the output layer are learning machines [22], and pruned-extreme learning machines analytically determined using the time-efficient least-squares (P-ELM) [23], among others. However, the problem of how to method (LS) [5]. As a result, an ELM features remarkably fast achieve more satisfactory accuracy remains a challenge to overcome. To achieve desirable accuracy improvements, we propose a n Corresponding author. Tel.: þ86 13526781788. two-hidden-layer extreme learning machine (TELM) algorithm, E-mail addresses: [email protected] (B.Y. Qu), [email protected] (B.F. Lang), [email protected] (J.J. Liang), which adds a hidden layer to the single-hidden-layer ELM archi- [email protected] (A.K. Qin), [email protected] (O.D. Crisalle). tecture, and utilizes a novel method to calculate the parameters http://dx.doi.org/10.1016/j.neucom.2015.11.009 0925-2312/& 2015 Elsevier B.V. All rights reserved.
Please cite this article as: B.Y. Qu, et al., Two-hidden-layer extreme learning machine for regression and classification, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.009i 2 B.Y. Qu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ related to the second hidden layer (namely, connection weights neurons in the hidden layer to resolve the adverse issues encoun- between the first and second hidden layer and the bias of the tered by the back-propagation [31] and Levenberg–Marquardt second hidden layer). Based on previous research, two-hidden- algorithms [32]. layer feedforward neural networks (TLFNs) [24] typically require Consider N arbitrary distinct samples ðÞxi ; ti ðÞi ¼ 1; 2; … ; N , T fewer hidden neurons than SLFNs to achieve a desired perfor- i.e., there is an input feature X ¼ ½ x1 ; x2; …; xN and a desired T mance level. This is an initial basis for considering the two- matrix T ¼ ½ t1 ; t2; …; tN comprised of labeled samples, where xi T n T m hidden-layer structure proposed. The foundational ideas for the ¼ ½ xi1; xi2; …; xin Aℝ and ti ¼ ½ ti1; ti2; …; tim Aℝ , where the TELM algorithm are simpler to present by comparing and con- superscript “T” denotes the matrix/vector transposition. Let L trasting its features with other multilayer ELM algorithms. denote the number of hidden neurons with activation function ðÞx First consider the hierarchical extreme learning machine g . The ELM method selects in a random way the input-weight (HELM) approach presented in [25], which is based on a hier- ¼ W ; W ; :::; W T AℝL n matrix W 1 2 j that links the input layer to archical feedforward neural network (HFNN) structure consisting T L N the hidden layer, and the bias vector B ¼ b1; b2; …; bL Aℝ of of two parts, where each part is comprised of one input layer, one the hidden-layer neurons. Furthermore, W and B are determined hidden layer, and one output. It is therefore possible to regard the simultaneously, and they remain fixed during the training phase. output of the first part as an input neuron in the second part. This procedure allows transforming the original nonlinear neural- Unlike HELM, the proposed TELM contains only one output layer, network system to a system described by the linear expression and is specifically designed for training the parameters of the β ¼ ð Þ hidden layers. Furthermore, HELM is tailored to solving real-time H T 1 or on-line prediction problems that involve a time-sequence β ¼ β ; β ; …; β T AℝL m where 1 2 L is the connection-weight dataset (such as predicting the water quality in a wastewater matrix between thehi hidden layer and the output layer, with vector treatment processes, for example), whereas TELM as no such T components β ¼ β ; β ; …; β ðÞj ¼ 1; 2; …; L that denote restriction on the type of training dataset. j j1 j2 jm Next, consider the multilayer extreme learning machine (ML- the connection weights between the jthhiddenneuronandm output neurons, H ¼ gðÞWXþB AℝN L is the hidden layer output matrix ELM) [26] and the alternative H-ELM advanced in [27]. Both whose scalar entries h ¼ g W x þb ði ¼ 1; 2; …; N; j ¼ 1; 2; …; techniques involve ELM-based auto-encoder schemes as their ij j i j LÞ are interpreted as the output of the jth hidden neuron with respect building blocks. In fact, this H-ELM method is an improvement x W ¼ ; ; …; T over ML-ELM, as it features a sparse ELM auto-encoder for to i, j Wj1 Wj2 Wjn is the vector of connection weights improved performance. Both schemes focus mainly on solving between n input neurons and the jth hidden neuron, and where bj is classification problems, as they are involved in feature extraction. the bias of the jth hidden neuron. Finally, the matrix-vector product W x W In their mode of operation, previous hidden layers specialize on j i is interpreted as the inner product between matrix j and x processing for feature extraction, whereas the last hidden layers vector i. are mostly intended for least-squares operations. The focus of the The only parameter to be calculated in the ELM is the output- β proposed TELM is different, as it seeks to obtain improved per- weights matrix . Using the least-squares method it follows that † formance using a reduced number of hidden neurons. However, β ¼ H T ð2Þ the TELM can also incorporate ELM-based auto-encoder techni- † – ques, hence making it a suitable alternative for seeking improved where H is the Moore Penrose (MP) generalized inverse of matrix performance in feature extraction problems under scenarios that H, which can be calculated using the orthogonal projection method. † 1 call for a reduced number of neurons. That is to say, if HTH is nonsingular, then H ¼ HTH HT;other- The experimental results presented in this paper for several 1 † ¼ T T T fi regression and classification problems demonstrate the superiority wise H H HH when HH is nonsingular. A bene tofusing of TELM over the original ELM and also over other multilayer ELM the MP method of solution is that the above formula yields the T variants in terms of average accuracy. Our experiments also solution vector β of the least two-norm when HH is nonsingular, a investigate the different effect on regression and classification valuable advantage when recognizing that Bartlett [33] observes that problems observed when using initial orthogonalization proce- smaller weights lead to improved generalization performance. dures applied to the parameters of the first hidden-layer (that is, The implementation of the original ELM proceeds according to ðÞx ; t ð ¼ ; ; … ; connection weights between the input weights and the first hid- the following steps, given N training samples i i i 1 2 den layer and the bias of the first hidden layer). NÞ and L hidden neurons with activation function gðÞx : The rest of this paper is organized as follows: Section 2 pre- sents a brief review of the original ELM, Section 3 describes the (i) Randomly assign the connection weights between the input layer proposed TELM technique, Section 4 reports and analyzes and the hidden layer W and the bias of the hidden layer B. experimental results, and finally, Section 5 draws key conclusions (ii) Calculate the hidden layer output matrix H ¼ gðÞWXþB . and also discusses future research plans. (iii) Obtain weights between the hidden layer and the output layer † using the least-square method β ¼ H T.
2. Extreme learning machine 3. Two-hidden-layer extreme learning machine The ELM approach originally proposed by Huang et al. [3] aims at avoiding a time-consuming iterative training procedure and simul- In 1997 Tamura and Tateishi [34] demonstrated that two- taneously improving the generalization performance. The idea is hidden-layer feedforward networks (TLFNs) are superior to inspired by the biological thought that the human brain is a SLFNs in terms of the ability to use fewer hidden neurons to sophisticated system that can handle diverse tasks, day and night, achieve the desired performance. They claimed that a TLFN with without human intervention. Based on this reasoning, some only N=2þ3 hidden neurons can learn from N training samples researchers strongly support the idea that there must be some parts to achieve any negligible trainingpffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi error. Huang [24] further of the brain where the neuron configurations do not depend on the demonstrates that by using 2 ðÞmþ3 N hidden neurons a TLFN external environment [3,24,28–30]. The ELM algorithm takes can learn from N training samples to achieve an arbitrarily small advantage of this biological argument, and employs tuning-free training error. Such advantage of TLFNs motivates us to translate
Please cite this article as: B.Y. Qu, et al., Two-hidden-layer extreme learning machine for regression and classification, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.11.009i B.Y. Qu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 3