An Improved Online Multiclass Classification Algorithm Based on Confidence-Weighted 841

IEICE TRANS. INF. & SYST., VOL.E104–D, NO.6 JUNE 2021 840 PAPER An Improved Online Multiclass Classification Algorithm Based on Confidence-Weighted Ji HU†, Chenggang YAN†, Jiyong ZHANG†a), Dongliang PENG†, Chengwei REN†, Nonmembers, and Shengying YANG††, Member SUMMARY Online learning is a method which updates the model these classic algorithms, Confidence-Weighted (CW) [4] al- gradually and can modify and strengthen the previous model, so that the up- gorithm has the advantages of relatively high classification dated model can adapt to the new data without having to relearn all the data. accuracy and large margin. We hope to overcome its short- However, the accuracy of the current online multiclass learning algorithm still has room for improvement, and the ability to produce sparse models is comings while retaining its advantages so that both perfor- often not strong. In this paper, we propose a new Multiclass Truncated Gra- mance and feature selection capability can be improved. dient Confidence-Weighted online learning algorithm (MTGCW), which In this paper, we extended the CW algorithm for mul- combine the Truncated Gradient algorithm and the Confidence-weighted ticlass classification and added a new operation to its up- algorithm to achieve higher learning performance. The experimental results demonstrate that the accuracy of MTGCW algorithm is always better date steps to enhance the feature selection capability of the than the original CW algorithm and other baseline methods. Based on these model. After each weight update, our algorithm will check results, we applied our algorithm for phishing website recognition and im- whether the gradient exceeds the threshold value. If the gra- age classification, and unexpectedly obtained encouraging experimental re- dient value exceeds the threshold value, the gradient in that sults. Thus, we have reasons to believe that our classification algorithm is direction is truncated. Each truncation operation can effec- clever at handling unstructured data which can promote the cognitive ability of computers to a certain extent. tively reduce the complexity of the model and increase the key words: cognitive system, online learning, multiclass classification, computing efficiency. In addition, we introduce a control- streaming data, confidence-weighted lable parameter to weigh the accuracy and robustness, which make our algorithm perform better in different tasks. 1. Introduction The rest of this paper is organized as following: Sect. 2 reviews related work; Sect. 3 introduces previous work that most closely relates to our method; Sect. 4 proposes the Multiclass classification tasks are widely used in personal Multiclass Truncated Gradient Confidence-Weighted algo- credit evaluation, depiction of user portrait, image classifi- rithm and gives detailed operation steps; Sect. 5 conducts cation and so on. These scenarios rely on streaming data extensive experiments on our proposed algorithm and other for a better user experience and lower latency. Under the state-of-the-art algorithms. Section 6 concludes this work. premise of big data, it is effective to use efficient online learning algorithms to process streaming data in these ap- 2. Related Works plication scenarios. And no doubt all of these application scenarios require the highest possible accuracy. For these Online learning is a continuous training process in which reasons, how to improve the accuracy of multiclass classi- input values are fed into the model in each round of training, fication tasks is the main problem that multiclass classifiers and the model outputs prediction results based on the current need to solve. Thus, in recent years, many online multiclass parameters [5]. If the predicted class label is equal to the learning algorithms were proposed and most of them were input class label value, the model will continue to be used extended from binary classification tasks [1], such as Multi- for the next round of input values; If not, the model will class Perceptron [2], Multiclass Confidence-Weighted algo- suffer a loss and update to make better predictions for future rithm [3], Multiclass Passive-Aggressive algorithm and so data [21], [22]. on. Since the above multiclass classification algorithms are Our proposed online learning for multiclass classifi- based on classic online learning algorithms, they inherit the cation algorithm uses One-versus-Rest (OvR) strategy [6] disadvantages of difficulty in reducing the streaming data and is related to a variety of basic online learning meth- dimension and poor robustness. As the most advanced of ods, including Perceptron algorithm [7], [8], CW learning, Manuscript received September 28, 2020. Soft Confidence-Weighted (SCW) algorithm [9], Passive- Manuscript revised January 7, 2021. Aggressive (PA) algorithm [10], Online Gradient Descent Manuscript publicized March 15, 2021. (OGD) algorithm [11]–[13] and Truncated Gradient (TG) † The authors are with HangZhou DianZi University, algorithm. HangZhou, ZheJiang, China. We will introduce some of these algorithms below. ††The author is with Zhejiang University of Science & Technol- ogy, ZheJiang, China. a) E-mail: [email protected] (Corresponding author) DOI: 10.1587/transinf.2020EDP7190 Copyright c 2021 The Institute of Electronics, Information and Communication Engineers HU et al.: AN IMPROVED ONLINE MULTICLASS CLASSIFICATION ALGORITHM BASED ON CONFIDENCE-WEIGHTED 841 uses the current data to calculate the gradient once to update, 2.1 One-versus-Rest Strategy while offline gradient descent uses all the data to obtain the gradient to optimize the entire model parameters, so the ad- The OvR strategy is a way to solve multiclass classification vantage of the OGD algorithm is that the cost is small when problems by splitting them into multiple binary classifica- used for multiclass classification tasks. tion problems. This strategy involves training a unique bi- Here we only present the suffered loss of OGD and its nary classifier for each class, with all samples belonging to update rule: this category being positive and the rest negative. It requires The hinge loss: that the two classifiers not only predict a class label, but (ω;(x,y)) = max 0, 1 − y ω T x − also generate a real-valued confidence score for decision- t t t T making. In our method, we need the OvR strategy to gen- max ω j xt (6) erate classifiers to get the prediction result to calculate the The update rule: suffered loss. ω + = ω + α x (7) t 1,i ⎧ t,i√ t,i t 2.2 Perceptron Algorithm ⎪ − = ω ⎨⎪ 1/√ tifi ar max t, j xt α =⎪ 1/ tifi= y (8) t,i ⎩⎪ t The basic idea of Perceptron algorithm is to find a hyper- 0 othersize plane ωT x + b in the sample space to divide the dataset into two categories, so the function used to determine the class 2.4 Passive-Aggressive Algorithm labels is formulated as: +1, ωT x + b ≥ 0 The Passive-Aggressive algorithm is an online learning al- yˆ = sign(ωT x + b) = (1) t −1, ωT x + b < 0 gorithm proposed by Koby Crammer in 2006. This algorithm idea is simple but has been proven to be superior to ω is a column vector of weight parameter and b is the bias. other learning algorithms in multiclass tasks. Similar to the We can fix b and update the parameter ω, then the weight Perceptron algorithm, a weight vector ω is given and the adjustment of Perceptron learning is as follows: loss function is based on the hinge loss function: ω = − ωT Δω = η(yt − yˆt)x (2) ( ;(x,y)) max(0, 1 y( x)) (9) ω = ω + Δω (3) Where y(ωT x)isthemargin. And the optimization of the PA learning on round t is: η is learning rate and it is usually between 0 and 1. 1 2 When Perceptron is used for multiclass classification ωt+1 = armin ω − ωt (10) tasks, it is divided into three multiclass Perceptron algo- 2 ω = rithms according to different allocation strategies: max- s.t. ( ;(x,y)) 0 score multiclass Perceptron, uniform multiclass Perceptron The above equation-constrained optimization problem has a and proportion multiclass Perceptron. Here we only intro- closed-form up date rule: duce max-score multiclass Perceptron. According to the lt ff ω + = ω + x = su ered loss, the update rule is expressed as: t 1 t τt tyt where τt 2 (11) xt ω + = ω + α x (4) Further, introduce a parameter C to let PA be able to handle t 1,i ⎧ t,i t,i t ⎪ − = ω non-separable instances and more robust. ⎪ 1 if i ar max t, j xt ⎨⎪ j = 1 αt,i ⎪ +1 if i = y (5) ω = ω − ω 2 + ω ⎩⎪ t t+1 ar min t C( t;(xt,y)) (12) 0 othersize 2 Where C is a parameter to trade-off between passiveness Where ω is a matrix, t represents the t-th round, i-th row in and aggressiveness, higher C value yield stronger aggres- ω is the linear classifier for the i-th label. siveness. Similarly, in multiclass cases, PA update rule can be extended as: 2.3 Online Gradient Descent Algorithm If (ωt;(xt,yt)) > 0: ω + = ω + α x (13) OGD is an effective algorithm, which is simple and easy to t 1,i ⎧ t,i t,i t ⎪ ω operate, and is widely used in online learning. The algo- ⎪ ( t;(xt,yt)) ⎪ − min C, if i = armax ω + x ⎪ 2 t 1,i t rithm flow is very simple, two steps are performed for each ⎨⎪ 2 xt round of update, first perform a gradient descent on the cur- αt,i =⎪ (ω ;(x ,y)) ⎪ + min C, t t t if i = y rent model and the loss function according to the data of ⎪ 2 t ⎩⎪ 2 xt the round. If the updated parameters exceed the definition 0 othersize domain, then project them back to the domain. Unlike the (14) previous offline gradient descent, the OGD algorithm only IEICE TRANS.

Load more