Extreme Learning Machines with Regularization for the Classification

Extreme Learning Machines with Regularization for the Classification of Gene Expression Data Dániel T. Várkonyi, Krisztián Buza Eötvös Loránd University, Faculty of Informatics, Department of Data Science and Engineering, Telekom Innovation Laboratories, Budapest, Hungary {varkonyid,buza}@inf.elte.hu WWW home page: http://t-labs.elte.hu Abstract: Extreme learning machine (ELM) is a special only one hidden layer, the input weights (i.e., the weights single-hidden layer feed-forward neural network (SLFN), between the input layer and the hidden layer) are initial- with only one hidden layer and randomly chosen weights ized once, and not trained iteratively. With a well chosen between the input layer and the hidden layer. The ad- convex activation function, the issue of stucking into local vantage of ELM is that only the weights between hidden minima can be avoided. layer and output layer need to be trained, therefore, the While neural networks are powerful, due to their com- computational costs are much lower, resulting in moderate plexity, in the lack of appropriate regularization, they tend training time. In this paper, we compare ELMs with to overfit the data. In the era of deep learning, L1 regu- different regularization strategies (no regularization, L1, larization became popular due to various reasons: on one L2) in context of a binary classification task related to hand, sparse structures resemble the brain, on the other gene expression data. As L1 regularization is known to hand, they lead to computationally cheap models as the lead to sparse structures (i.e., many of the learned weights resulting zero-weights correspond to the lack of connec- are zero) in case of various models, we examine the tions, thus they may be omitted. distribution of the learned weights and the sparsity of the Regularized ELMs have been shown to outperform non- resulting structure in case of ELM. regularized ELMs [5], [12], [15], [20]. However, as op- posed to our study, none of the aforementioned works fo- Keywords: Extreme Learning Machine, Classifica- cused on the classification of gene expression data and the tion, Logistic Regression, L1 Regularization, Gene sparsity of the learned weights. Expression In our study, we compare various regularization techniques – in particular: L1 and L2 regularization as well as 1 Introduction the lack of regularization – in context of classification of gene expression data using ELM. Recent advances in neural networks lead to breakthroughs in many applications in various domains, such as games, 2 Basic Notation and Problem Formulation finance, medicine and engineering, see e.g. [6], [17], [21]. In most cases, gradient-based training is used to find appropriate values of the weights of the network. Gradients First, we define the classification problem and introduce are usually calculated with back propagation (BP) [16]. the basic notation which is used in this paper. We are given X = fx(1);x(2);:::;x(m)g However, gradient-based training may be too slow in cer- a set of training data containing (i) (i) (i) (i) n tain applications. instances x = (x1 ;x2 ;:::;xn ) 2 IR . For each instance For the above reason, other training approaches were x(i), its label y(i) is also given. The set of labels is denoted proposed, such as subset selection [13], [4], second or- by Y = fy(1);y(2);:::;y(m)g. Each label y(i) 2 f0;1g, 0 de- der optimization [7], and global optimization [2], [19], see notes a negative instance and 1 denotes a positive instance. also [1] for details. All the aforementioned algorithms We use x1;x2;:::xn to denote the input nodes. H is the may stuck into local minima, and suffer from slow con- only hidden layer and the number of units in the hidden vergence. layer is denoted by L. We use hi to denote the ith hidden Extreme Learning Machines (ELM) were introduced node. The activation value of ith hidden node for an in- by Huang et al. [10], [11] as a special single layer feed- stance x is hi[x] 2 IR, bi 2 IR is the bias of ith hidden node, forward neural network. ELMs are general function ap- ai; j 2 IR is the randomly initialized weight from xi to jth proximators. ELMs overcome the main disadvantages of hidden node. feed-forward neural networks (FNN). The training speed The output layer contains only one single unit, bi 2 IR of ELM is much faster than that of FNN, since ELM has is the weight from hi to the output unit and bo 2 IR is the bias of output node. ELM(x) 2 IR is the activation of the Copyright c 2019 for this paper by its authors. Use permitted un- der Creative Commons License Attribution 4.0 International (CC BY output unit for an input instance x. The structure of ELM 4.0). is shown in Fig. 1. 3.2 Logistic Regression in the Output Layer of ELM Logistic regression (LR) is one of the most often and ef- fectively used binary classification method. In our case, the output layer of ELM implements logistic regression based on the hidden units’ activation values. Thus the cost function without regularization is: 1 m J(b) = ∑Cost(ELM(x(i));y(i)) (5) m i=1 where: ( −log(ELM(x)); if y=1 Figure 1: Structure of ELM for binary classification [14] Cost(ELM(x);y) = (6) −log(1 − ELM(x)); if y=0. As described in Section 3.2, training ELM leads to a Using (5) with (6), the cost function can be equivalently minimization problem: written as: m minJ(b) (1) 1 (i) (i) bi J(b) = − ∑(y log(ELM(x )) m i=1 (7) where J(b) is the cost function of ELM. + (1 − y(i))log(1 − ELM(x(i)))): The partial derivative of the cost function w.r.t. the kth 3 Methods parameter (bk) is: In this chapter we will introduce the fundamentals required m ¶ 1 (i) (i) (i) to understand our work such as ELM, logistic regression J(b) = (ELM(x ) − y )hk[x ] (8) ¶b m ∑ and regularization of logistic regression. k i=1 Logistic regression can be trained with gradient descent. 3.1 ELM That is: after initializing the parameters bk, in every itera- tion, all bk-s are updated simultaneously according to the Extreme Learning Machine is a special kind of single layer following rule: feed forwarded network. The network has only one hidden layer. The weights between the input layer and the hidden ¶ layer (input weights, for short) are initialized once and not bk = bk − a J(b) (9) ¶bk trained, i.e., they remain unchanged. The output weights between the hidden layer and the output layer are trained where a is the learning rate. iteratively. As the input weights remain in their initial state and only the output weights are trained, the training time 3.3 LASSO and Ridge Regression in the Output of an ELM is much lower than that of a comparable single Layer of ELM layer feed-forward neural network (SLFN) [11]. The output of an ELM is the value of the activation func- In logistic regression and generally in all regression mod- tion applied to the weighted sum of the activation values els, it is a common goal to keep the model as simple as pos- of hidden nodes: sible. Regularization punishes a complex model, in partic- L ular, a penalty term is added to the cost function. Ridge re- ELM(x) = g(∑ b jh j[x] + bo) (2) gression (L2) adds squared magnitude of the coefficients j=1 as penalty term to the loss function. LASSO (Least Ab- where g is the activation function. Since stucking into lo- solute Shrinkage and Selection Operator) regression adds cal minima needs to be avoided, a convex activation, in absolute value of the coefficients as penalty term to the loss particular, the sigmoid function was chosen as activation function. The key difference between these techniques is function: that LASSO shrinks the less important features’ coeffi- 1 cients to zero, thus, leads to a model with less complex g(z) = : (3) 1 + e−z structure. The value of ith hidden node for x input is: In our case, the L1 regularized cost function is: n m L 1 (i) (i) l hi[x] = g(∑ a j;ix j + b j): (4) J(b) = ∑Cost(ELM(x );y ) + ∑ jb jj; (10) j=1 m i=1 m j=1 its partial derivative w.r.t. the kth parameter (bk) is: BAD_N, BCL2_N, pCFOS_N, H3AcK18_N, EGR1_N, H3MeK4_N genes). We ignored these features. m ¶ 1 (i) (i) (i) l Some of the instances of the remaining dataset con- J(b) = ∑(ELM(x ) − y )hk[x ] + sign(bk): ¶bk m i=1 m tained missing values in other features, these instances (11) were also ignored resulting in a dataset of 1047 instances In our case, the L2 regularized cost function is: and 71 gene expression features. We split the data into train and test sets as follows: the m L 1 (i) (i) l 2 test set contains 346 randomly selected instances, while J(b) = ∑Cost(ELM(x );y ) + ∑ b j ; (12) m i=1 m j=1 the remaining 701 instances are assigned to the training set. and its partial derivative w.r.t. the kth parameter (bk) is: m ¶ 1 (i) (i) (i) l 5 Experimental Settings J(b) = ∑(ELM(x )−y )hk[x ]+2 bk (13) ¶bk m i=1 m We compared three ELMs that differ in terms of the ap- where l is the regularization coefficient which shows the plied regularization technique: in the first model we did weight of the penalty term in connection with the average not use any regularization at all, in the second and third cost.

Extreme Learning Machines with Regularization for the Classification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support