IEICE TRANS. INF. & SYST., VOL.E104–D, NO.6 JUNE 2021 840

PAPER An Improved Online Multiclass Classification Algorithm Based on Confidence-Weighted

Ji HU†, Chenggang YAN†, Jiyong ZHANG†a), Dongliang PENG†, Chengwei REN†, Nonmembers, and Shengying YANG††, Member

SUMMARY Online learning is a method which updates the model these classic algorithms, Confidence-Weighted (CW) [4] al- gradually and can modify and strengthen the previous model, so that the up- gorithm has the advantages of relatively high classification dated model can adapt to the new data without having to relearn all the data. accuracy and large margin. We hope to overcome its short- However, the accuracy of the current online multiclass learning algorithm still has room for improvement, and the ability to produce sparse models is comings while retaining its advantages so that both perfor- often not strong. In this paper, we propose a new Multiclass Truncated Gra- mance and capability can be improved. dient Confidence-Weighted online learning algorithm (MTGCW), which In this paper, we extended the CW algorithm for mul- combine the Truncated Gradient algorithm and the Confidence-weighted ticlass classification and added a new operation to its up- algorithm to achieve higher learning performance. The experimental re- sults demonstrate that the accuracy of MTGCW algorithm is always better date steps to enhance the feature selection capability of the than the original CW algorithm and other baseline methods. Based on these model. After each weight update, our algorithm will check results, we applied our algorithm for phishing website recognition and im- whether the gradient exceeds the threshold value. If the gra- age classification, and unexpectedly obtained encouraging experimental re- dient value exceeds the threshold value, the gradient in that sults. Thus, we have reasons to believe that our classification algorithm is direction is truncated. Each truncation operation can effec- clever at handling unstructured data which can promote the cognitive abil- ity of computers to a certain extent. tively reduce the complexity of the model and increase the key words: cognitive system, online learning, multiclass classification, computing efficiency. In addition, we introduce a control- streaming data, confidence-weighted lable parameter to weigh the accuracy and robustness, which make our algorithm perform better in different tasks. 1. Introduction The rest of this paper is organized as following: Sect. 2 reviews related work; Sect. 3 introduces previous work that most closely relates to our method; Sect. 4 proposes the Multiclass classification tasks are widely used in personal Multiclass Truncated Gradient Confidence-Weighted algo- credit evaluation, depiction of user portrait, image classifi- rithm and gives detailed operation steps; Sect. 5 conducts cation and so on. These scenarios rely on streaming data extensive experiments on our proposed algorithm and other for a better user experience and lower latency. Under the state-of-the-art algorithms. Section 6 concludes this work. premise of big data, it is effective to use efficient online learning algorithms to process streaming data in these ap- 2. Related Works plication scenarios. And no doubt all of these application scenarios require the highest possible accuracy. For these Online learning is a continuous training process in which reasons, how to improve the accuracy of multiclass classi- input values are fed into the model in each round of training, fication tasks is the main problem that multiclass classifiers and the model outputs prediction results based on the current need to solve. Thus, in recent years, many online multiclass parameters [5]. If the predicted class label is equal to the learning algorithms were proposed and most of them were input class label value, the model will continue to be used extended from binary classification tasks [1], such as Multi- for the next round of input values; If not, the model will class [2], Multiclass Confidence-Weighted algo- suffer a loss and update to make better predictions for future rithm [3], Multiclass Passive-Aggressive algorithm and so data [21], [22]. on. Since the above multiclass classification algorithms are Our proposed online learning for multiclass classifi- based on classic online learning algorithms, they inherit the cation algorithm uses One-versus-Rest (OvR) strategy [6] disadvantages of difficulty in reducing the streaming data and is related to a variety of basic online learning meth- dimension and poor robustness. As the most advanced of ods, including Perceptron algorithm [7], [8], CW learning, Manuscript received September 28, 2020. Soft Confidence-Weighted (SCW) algorithm [9], Passive- Manuscript revised January 7, 2021. Aggressive (PA) algorithm [10], Online Gradient Descent Manuscript publicized March 15, 2021. (OGD) algorithm [11]–[13] and Truncated Gradient (TG) † The authors are with HangZhou DianZi University, algorithm. HangZhou, ZheJiang, China. We will introduce some of these algorithms below. ††The author is with Zhejiang University of Science & Technol- ogy, ZheJiang, China. a) E-mail: [email protected] (Corresponding author) DOI: 10.1587/transinf.2020EDP7190

Copyright c 2021 The Institute of Electronics, Information and Communication Engineers HU et al.: AN IMPROVED ONLINE MULTICLASS CLASSIFICATION ALGORITHM BASED ON CONFIDENCE-WEIGHTED 841

uses the current data to calculate the gradient once to update, 2.1 One-versus-Rest Strategy while offline gradient descent uses all the data to obtain the gradient to optimize the entire model parameters, so the ad- The OvR strategy is a way to solve multiclass classification vantage of the OGD algorithm is that the cost is small when problems by splitting them into multiple binary classifica- used for multiclass classification tasks. tion problems. This strategy involves training a unique bi- Here we only present the suffered loss of OGD and its nary classifier for each class, with all samples belonging to update rule: this category being positive and the rest negative. It requires The hinge loss:    that the two classifiers not only predict a class label, but (ω;(x,y)) = max 0, 1 − y ω T x − also generate a real-valued confidence score for decision- t t t T making. In our method, we need the OvR strategy to gen- max ω j xt (6) erate classifiers to get the prediction result to calculate the The update rule: suffered loss. ω + = ω + α x (7) t 1,i ⎧ t,i√ t,i t 2.2 Perceptron Algorithm ⎪ − =  ω ⎨⎪ 1/√ tifi ar max t, j xt α =⎪ 1/ tifi= y (8) t,i ⎩⎪ t The basic idea of Perceptron algorithm is to find a hyper- 0 othersize plane ωT x + b in the sample space to divide the dataset into two categories, so the function used to determine the class 2.4 Passive-Aggressive Algorithm labels is formulated as: +1, ωT x + b ≥ 0 The Passive-Aggressive algorithm is an online learning al- yˆ = sign(ωT x + b) = (1) t −1, ωT x + b < 0 gorithm proposed by Koby Crammer in 2006. This algo- rithm idea is simple but has been proven to be superior to ω is a column vector of weight parameter and b is the bias. other learning algorithms in multiclass tasks. Similar to the We can fix b and update the parameter ω, then the weight Perceptron algorithm, a weight vector ω is given and the adjustment of Perceptron learning is as follows: is based on the hinge loss function: ω = − ωT Δω = η(yt − yˆt)x (2) ( ;(x,y)) max(0, 1 y( x)) (9)  ω = ω + Δω (3) Where y(ωT x)isthemargin. And the optimization of the PA learning on round t is: η is and it is usually between 0 and 1. 1 2 When Perceptron is used for multiclass classification ωt+1 = armin ω − ωt (10) tasks, it is divided into three multiclass Perceptron algo- 2 ω = rithms according to different allocation strategies: max- s.t. ( ;(x,y)) 0 score multiclass Perceptron, uniform multiclass Perceptron The above equation-constrained optimization problem has a and proportion multiclass Perceptron. Here we only intro- closed-form up date rule: duce max-score multiclass Perceptron. According to the lt ff ω + = ω + x = su ered loss, the update rule is expressed as: t 1 t τt tyt where τt 2 (11) xt ω + = ω + α x (4) Further, introduce a parameter C to let PA be able to handle t 1,i ⎧ t,i t,i t ⎪ − =  ω non-separable instances and more robust. ⎪ 1 if i ar max t, j xt ⎨⎪ j = 1 αt,i ⎪ +1 if i = y (5) ω =  ω − ω 2 + ω ⎩⎪ t t+1 ar min t C( t;(xt,y)) (12) 0 othersize 2 Where C is a parameter to trade-off between passiveness Where ω is a matrix, t represents the t-th round, i-th row in and aggressiveness, higher C value yield stronger aggres- ω is the linear classifier for the i-th label. siveness. Similarly, in multiclass cases, PA update rule can be extended as: 2.3 Online Gradient Descent Algorithm If (ωt;(xt,yt)) > 0:

ω + = ω + α x (13) OGD is an effective algorithm, which is simple and easy to t 1,i ⎧ t,i t,i t ⎪ ω operate, and is widely used in online learning. The algo- ⎪ ( t;(xt,yt)) ⎪ − min C, if i = armax ω + x ⎪  2 t 1,i t rithm flow is very simple, two steps are performed for each ⎨⎪ 2 xt round of update, first perform a gradient descent on the cur- αt,i =⎪ (ω ;(x ,y)) ⎪ + min C, t t t if i = y rent model and the loss function according to the data of ⎪  2 t ⎩⎪ 2 xt the round. If the updated parameters exceed the definition 0 othersize domain, then project them back to the domain. Unlike the (14) previous offline gradient descent, the OGD algorithm only IEICE TRANS. INF. & SYST., VOL.E104–D, NO.6 JUNE 2021 842

= αt ⎧ ⎫ 3. Preliminaries ⎪ 2 ⎪ ⎨ −1 − 2φmt + (1 + 2φmt) − 8φ(mt − φvt)⎬ max ⎪0, ⎪ ⎩ 4φv ⎭ The CW algorithm has advanced classification performance t due to the introduction of the probabilistic model. [24] The (19) TG algorithm truncates the gradient within the threshold, 1 βt = (20) which is an effective method to simplify the model. In this 1/(2αtφ) + vt section, we will give a detailed introduction to these two = Δ T Σ Δ = μ T Δ = Φ−1 vt ψt t ψt, mt t ψt,φ (η) (21) algorithms we are based on. Experiments can prove that the introduction of the confi- 3.1 Truncated Gradient Algorithm dence can effectively induce online learning in multiclass classification tasks. In high-dimensional feature vectors and big datasets, it is very important that the model coefficients have good 4. Proposed Methods sparsity. The truncated gradient algorithm was proposed by When dealing with multiclass classification problems, a John Langford, Lihong Li and Tong Zhang in 2009 [14], common approach is to combine the OvR strategy with the the sparsity of the feature vector is well controlled and above binary learning algorithm and modify update rules to greatly improved the ability of feature selection and the in- accommodate multiclass cases. However, using only the terpretability at the same time by this algorithm. Thus, TG OvR strategy to extend the binary cases will make the re- is often used for online learning to enhance learning perfor- sulting mufti-classification to be difficult to suffer loss to up- mance and obtain sparse models. date the coefficients based on the prediction results as in the binary classification algorithm. Moreover, this will cause 3.2 Confidence-Weighted Algorithm the generated multiple classifiers to inherit and amplify the shortcomings of the binary classification algorithm. The online learning classifier can be divided into the first- To address the limitations above, in this section, generation perceptron algorithm and the second-generation we propose a new online learning algorithm suitable for online passive-aggressive learning algorithm according to multiclass classification of streaming data, named Multi- the update mode. The CW algorithm can be considered as class Truncated Gradient Confidence-Weighted (MTGCW), the third-generation online learning algorithm [15]. which aims to overcome the shortcomings of CW and fur- CW combines confidence with the probability distribu- ther improve prediction accuracy and feature selection abil- tion and assumes that ω follows the Gaussian distribution. ity of the model. Like the previous transformation method, ff The sign of ωT x is used as the prediction result like Percep- we also adopt the OvR strategy, but the di erence is that tron. The margin of an example (x,y) is given by y(ωT x) we integrate the CW algorithm with the TG algorithm when and if the margin value is positive, the model predicts cor- generating a single binary classifier. rectly on this simple. The absolute value of the margin is The CW algorithm itself has some disadvantages: used as the confidence, so a larger confidence value indi- (1) its updating strategy is very aggressive. When there is cates more accurate prediction results. noise in the streaming data, the CW algorithm will greatly The optimization task of the algorithm is to update the modify the parameters, resulting in a decrease in accuracy. weight distribution by minimizing the Kullback-Leibler di- So when used for multiclass classification, it will not ob- vergence between the new weight distribution and the old tain satisfactory performance. (2) the poor sparsity of model weight while ensuring that the probability of a correct pre- parameters will lead to the poor interpret-ability and per- diction for training instance is no smaller than the confi- formance of the model to some extent. To overcome these dence hyperparameter η. [25] disadvantages we have done two aspects of work. (1) a pa- In Multiclass Confidence-weighted learning, the coef- rameter C similar to PA algorithm and SCW learning is in- ficients are updated similarly with the binary cases. Mul- troduced to adjust the aggressiveness and passivity. But the ff ticlass CW optimization problem has a closed-form update di erence is that the parameter C can be fixed or dynami- rule: cally changed in our algorithm. The following methods can be used to set C: setting a fixed value before training; setting If αt > 0: a series of candidate values, dynamically selecting the pa- μ = μ + Σ Δ t+1 t αt t ψt (15) rameter according to the input data [16]. It should be noted T that in this paper we use the first method, i.e., the fixed C. Σt+1 =Σt − βtΣtΔψtΔψt Σt (16) (2) Streaming data makes feature selection difficult, Where so we introduce the TG algorithm to truncate coefficients smaller than the threshold θ to reduce the dimension of Δ = − ∧ ψt ψ(xt,yt) ψ(xt,y t) (17) streaming data. =  μ ffi yˆt ar max( i xt) (18) We assumes that the coe cient follows the Gaussian HU et al.: AN IMPROVED ONLINE MULTICLASS CLASSIFICATION ALGORITHM BASED ON CONFIDENCE-WEIGHTED 843

μ distribution with the mean vector i and the covariance ma- based on the cost of misclassification. A large value of C in- trix Σi same as CW. creases the penalty for misclassification, and a small value First, we need to use the OvR strategy to generate clas- of C decreases the penalty for misclassification. sifiers. Binary classifications are extended to multiclass The closed-form solution of MSCW optimization prob- classifications, and of course the weight vector ω1×i must lem can be expressed as: be expanded into a weight matrix ω × , where n is the to- n i μ = μ + α y Σ x tal number of categories and i is the number of attributes. t+1,rt t,rt t t t t (27) μ = μ − Σ Multiply ωn×i and x to get a column vector δn×1. Each el- t+1,st t,st αtyt t xt (28) δ ω × T ement of is the product of the row vector of matrix n i Σt+1 =Σt − βtΣt xt xtΣt (29) and x,sothe j-th element of the column vector is the real- value confidence score of the current sample classified into The update rule of the coefficient: the j-th class ( j is an integer between 1 and n). The above  = 1 − + operation makes each row of ωn×i a linear classifier labeled αt min C, max 0, mtψ δ 2vtψ j. Thus, We select the maximum value of the element in , and the row number of that element is used as the prediction 2 2 2 2 mt ψ − mt ψ + 2vtψφ (30) result [11]. Then we need to use the predicted label and the real label to calculate the loss function. α φ β = √ t (31) In multiclass cases, we set a threshold η to control the t 2ut + αtvtφ probability of the difference between the minimum score in all relevant classes and the maximum score in all irrelevant Where  classes. So the constrain is: 1 2 u = −α v φ + α 2v 2φ2 + 8v (32) t 8 t t t t t Prω ∼N μ Σ ω ∼N μ Σ [ω x ≥ ω x ] ≥ η (22) t,rt ( t,rt, t), t,st ( t,st, t) t,rt t t,st t = T Σ = μ − μ vt xt t xt, mt t,rt xt t,st xt (33) =  μ =  μ 2 Where, rt ar min t,r xt, st ar min t,s xt. φ φ = Φ−1(η),ψ= 1 + (34) Here we only use the optimization of multiclass SCW 2 (MSCW) [17] as an example to introduce our work for its better robustness. [22] Now that we have the update rules, then we will integrate the TG algorithm into the update steps. μ μ Σ = ( t+1,rt, t+1,st, t+1) The function of truncation can be expressed as fol-  N μ Σ N μ Σ + N μ Σ N μ Σ + lows: ar min DKL( ( r, ) ( t,rt, t) DKL( ( s, ) ( t,st, t) ⎧ ⎪ φ N μ μ Σ ⎪ max(0,vj − α) if v j ∈ [0,θ] C ( ( r, s, t); (xt,yt)) (23) ⎨⎪ T (v ,α,θ) =⎪ min(0,v + α) if v ∈ [−θ, 0] (35) 1 j ⎩⎪ j j

where the loss function is: v j othersize = − Δ φ(N(μ , μ , Σ ); (x ,y)) = f (ωi) T1(ωi γ 1L(ωi, zi),γgi,θ) (36) ⎧ r s t t t ⎫⎫ ⎪ ⎪⎪ ⎨⎪ −1 − 2φm + (1 + 2φm )2 − 8φ(m − φv )⎬⎪⎬⎪ The truncation is not performed every round, but every K ⎪ t t t t ⎪⎪ max⎩⎪0, ⎭⎪⎭⎪ online steps, i.e., if i/K is not an integer, gi = 0; if i/Kisan 4φvt integer, gi = Kg where g is a parameter greater than 0. The (24) gravity parameter g and the threshold parameter θ jointly where control the sparsity of the model. So every K round We bring the weight vector ωt+1 = Δ T Σ Δ = μ T Δ = Φ−1 vt ψt t ψt, mt t ψt,φ (η) (25) which has been updated by the rule above into the trunca- tion function as below: In the origin multiclass CW (MCW) algorithm, the con-

straint is: ωt+1 = TG(ωt+1) = ω − Δ ω φ N μ μ Σ = T( t+1 γ l( t+1;(xt+1,yt+1)),γg,θ) (37) s.t.  ( ( r, s, t); (xt,yt)) 0 (26) T ω + = ⎧1( t 1,α,θ) We can see that in the optimization above, this formula is ⎪ ⎪ max(0, ωt+1 − α) if ωt+1 ∈ [0,θ] multiplied by the parameter C and then added after the opti- ⎨ ⎪ min(0, ωt+1 + α) if ωt+1 ∈ [−θ, 0] (38) mization formula, which highly reflects that the soft-margin ⎩⎪ ωt+1 othersize classification can tolerate a small number of errors and the trade-off between keeping the previous information and up- Where   dating the current information while minimizing the objec- T l(ω + ;(x + ,y+ )) = max 0, 1 − y + ω + x (39) tive function. This can prevent fluctuation and effectively re- t 1 t 1 t 1 t 1 t 1 duces the effects of noise. Therefore, the parameter C in the γ is the learning rate, which controls the learning speed and objective function can also be regarded as a penalty param- the learning accuracy, usually between 0 and 1. eter. The value of C can be estimated before model training After truncating the gradient, the sparsity of the model IEICE TRANS. INF. & SYST., VOL.E104–D, NO.6 JUNE 2021 844

Table 1 MTGCW Online learning algorithm learning test dataset. Ten selected features are contained in this dataset: URL anchor, Request URL, SFH, URL length, Having ‘@’, Prefix suffix, IP, Sub domain, Web traffic and DNS record. It includes 1353 instances and the attribute characteristics are all integer. Image classification is another very popular applica- tion field of and also the most basic task in computer vision. To further investigate the performance of these models, we apply our approach and other state-of- the-art methods to this task to analyze performance. We se- lect 2 image datasets to serve this experiment, which are selected for their heterogeneity according to sizes and char- acteristics: (1) The MNIST database of handwritten digits (MNIST). The MNIST dataset is one of the most common datasets used for image classification and can be accessi- ble from many different sources. Complete dataset contains 60,000 training images and 10,000 testing images, all of which are 0-9 handwritten digits taken from American Cen- sus Bureau employees and American high school students. Here we only adopt the test set of MNIST which contains 10000 images for experiment. The size of MNIST test set is large enough to well evaluate the performance of each algo- has achieved the expected progress. The MTGCW algo- rithm in streaming big data. rithm is be shown in Table 1. (2) MAsked FAces (MAFA) [19] MAFA is a masked face detection benchmark dataset, of which images are col- 5. Experiments and Analysis lected from Internet images. Complete MAFA dataset con- tains 30,811 images and 35,806 masked faces. To make 5.1 Datasets and Compared Algorithms MafA more suitable for our experiment, we randomly se- lected 4079 images from the complete data set, and divided The main task of multiclass learning is to minimize cumu- them into four types according to whether to wear the mask lative errors to obtain higher accuracy. Therefore, in or- correctly: not wearing masks, wearing masks correctly, der to test the performance of our proposed MTGCW al- wearing the mask incorrectly, and wearing non-protective gorithm, we carefully selected 7 data sets from different masks. This translates MAFA into a dataset that can be used fields from the UC Irvine Machine Learning Repository and to train models to detect whether a face is wearing a mask. KEEL dataset repository. Iris, seeds, and car evaluation are MAFA is a very characteristic dataset, it contains quite all classic datasets, the former two are plant classification, a few strong occlusion samples, which will greatly affects and the last is simple hierarchical evaluation of cars. The the classification performance of models. Moreover, side wine dataset is the result of chemical analysis of wines from faces will degrade the performance as well, especially the the same region of Italy but from three different varieties purely left or right faces and even the very simple masks containing 13 ingredients. The glass dataset contains at- have strong impact to the face detectors. To sum up, all tributes about several glass types that can be used to drive of these characteristics shows that MAFA is a challenging criminological investigations. The thyroid dataset detects dataset even for the state-of-the-art face detectors. whether a given patient is normal or has hyperthyroidism or Because the resolutions of the images in the two hypothyroidism to tests the performance level of each algo- datasets are uneven, but the input of the model is a fixed- rithm in medical testing. The final robotnavigation dataset length vector, we anti-aliased each picture into a standard includes sensor data from the robot, and the algorithms need 28*28 gray-scale image, and then pulled it into 784 dimen- to analyze this data to determine the robot’s motion. Web- sional row vector. In addition, in order to speed up the con- site phishing is an important issue in e-commerce and other vergence of the models, the gray-scale value of each pixel industries, and it involves payment and transaction secu- of the image are normalized to the range of −1 to 1. Table 2 rity. A website can be divided into Legitimate, Suspicious shows the detailed statistics of the datasets we used. and Phishy according to the phishing detection [18]. There- We have compared our algorithm with other mul- fore, the problem of whether a website is a phishing web- ticlass online learning algorithms, including Max-score site involves multiclass classification task, and the input of Multiclass Perceptron (MPerceptron), state-of-the-art Mul- each attribute of the task can be streaming data. In order ticlass Online Gradient Descent learning (MOGD) and to test the effectiveness of our algorithm in this application Passive-Aggressive (MPA) learning, and our based algo- scenario, we used the Website Phishing Dataset made by rithms Multiclass CW (MCW) and SCW (MSCW). MPer- Neda Abdelhamid of Auckland Institute of Studies as online ceptron, MOGD and MPA are classic and popular first-order HU et al.: AN IMPROVED ONLINE MULTICLASS CLASSIFICATION ALGORITHM BASED ON CONFIDENCE-WEIGHTED 845

Table 2 List of datasets used in the experiments 5.2 Experimental Results and Analysis

Table 4 summarizes the results of our empirical evaluation of cumulative performance of the proposed MTGCW and other multiclass learning methods. The bold data in the table is the best performance of this indicator in this data set. From the table, It can be clearly seen that the proposed MTGCW algorithm achieve the best performance in terms of mistake rate in all datasets. We found that when the dataset is relatively small, like iris, wine, glass, thyroid and seeds, the accuracy of the MTGCW algorithm can far sur- pass other algorithms except MSCW, and slightly exceed Table 3 The process of selecting the parameter η of the MTGCW algo- MSCW. However, when the dataset is large like robotnav- rithm as an example igation, car evaluation and website phishing, the accuracy of the MTGCW algorithm can outperform all compared al- gorithms including MSCW. Moreover, we can see from the last row of Table 4 that MTGCW significantly outperform any other 6 models in online image classification tasks, its performance on mistake rate is up to 22.5% higher than the strongest baseline method in masked faces detection, which means that when handling the number of samples in the mask detection dataset reaches 10000, our approach will be up to about 900 or more cases more accurate than the state- of-the-art algorithms. These results validates how effective our approach is and we have reason to believe that when online learning methods and they are good comparison tar- the dataset is much larger, the performance of our model gets for their high efficiency. The last two algorithms, CW will be further improved. We also believe that if our algo- and SCW, as the most advanced second-order online learn- rithm is used to detect whether people going to public places ing algorithms, often represent the highest level of classifi- wear masks during the epidemic, it will definitely be able to cation tasks. For the details of each of the above algorithms, achieve superior performance and make a great contribution. please refer to the detailed descriptions in Sect. 2. Using Further, MTGCW also performed well on the number these five state-of-the-art algorithms to compare can better of updates, which can reach the best of all in some datasets. test the comprehensive performance of our method. To en- However, the online cumulative time of MTGCW is sure a fair comparison of different algorithms, we use a pa- sometimes more than some other algorithms. This is be- rameter validation scheme to determine the best parameters cause we have added the operation of truncating the gradi- for each algorithm before testing. This scheme randomly se- ent every K rounds to our proposed algorithm, which in- lects and arranges the data in the training set and searches for creases the time consumption to a certain extent. Besides, the best value within a given parameter range. The search we can see from Table 4, in the first few datasets, the stan- range of parameter C is {2−4, 2−3,...,23, 24} and parameter dard deviation of MTGCW’s mistake rate is slightly higher η is {0.55, 0.60,...,0.90, 0,95} [20]. than other algorithms. We believe that this is due to the Table 3 lists the process of selecting the parameter η of introduction of the TG algorithm, which results in the con- the MTGCW algorithm as an example. As we can see in vergence of MTGCW being slightly worse than other algo- the table, when η is 0.900, the mistake rate is the smallest rithms. However, this problem disappeared in the next few of all rounds, so the best value of η we selected has been large datasets in the experiment, because the convergence is bolded in the table, which is 0.900. After determining the somehow related to the size of the dataset, i.e., large data best parameters, 20 randomly arranged sequences are se- sets often lead to better convergence. So when MTGCW lected from the training set, then each algorithm uses these is applied to streaming big data, this disadvantage of sequences to train 20 times, and finally takes the average of MTGCW can be ignored. By examining the overall mis- these training results as the final result. Three indicators are takes, we also found that added “confidence” property al- used to measure the performance of the algorithm: online gorithms like MCW, MSCW and MTGCW can always out- cultivate mistake rate, number of updates and running time perform other classic algorithms such as MPerceptron, etc. cost. The online cultivate mistake rate is calculated by di- Moreover, Fig. 1 shows the online results of 6 algorithms on viding the number of misclassifications in the sequence by 10 datasets of different sizes and domains. And this verifies the total number, which is highly related to prediction accu- that the addition of the “confidence” property has a great racy, is our most concerned indicator. All the experiments promotion effect on model training, and the introduction of were implemented in Matlab and run on a regular PC. TG operation makes the performance even higher again. IEICE TRANS. INF. & SYST., VOL.E104–D, NO.6 JUNE 2021 846

Table 4 Evaluation of cumulative performance of the proposed MTGCW and other algorithms HU et al.: AN IMPROVED ONLINE MULTICLASS CLASSIFICATION ALGORITHM BASED ON CONFIDENCE-WEIGHTED 847

Fig. 1 Evaluation of cumulative performance of the proposed MTGCW and other algorithm IEICE TRANS. INF. & SYST., VOL.E104–D, NO.6 JUNE 2021 848

[11] Y. Ying and M. Pontil, “Online gradient descent learning algo- 6. Conclusion rithms,” Foundations of Computational Mathematics, vol.8, no.5, pp.561–596, 2008. [12] M. Zinkevich, “Online convex programming and generalized in- This paper investigates how to enhance the ability of fea- finitesimal gradient ascent,” Proc. 20th International Conference on ture selection of models in online multiclass learning, with Machine Learning (ICML-03), pp.928–936, 2003. the aim of simultaneously improving the interpretability and [13] S. Smale and Y. Yao, “Online learning algorithms,” Foundations of performance of models. At the same time, we pointed out computational mathematics, vol.6, no.2, pp.145–170, 2006. some shortcomings of the CW algorithm: (1) weak anti- [14] J. Langford, L. Li, and T. Zhang, “Sparse online learning via trun- ffi cated gradient,” Journal of Machine Learning Research, vol.10, noise capability (2) poor sparsity of the coe cient. In this pp.777–801, March 2009. regard, we proposed the MTGCW algorithm based on CW [15] J. Liu, G. Chi, and X. Luo, “Online Classification Algorithm Based which introduced the parameter C to enhance the robustness, on Confidence-weighted Learning,” Computer Engineering, vol.38, and the truncated gradient operation to reduce the dimen- no.9, pp.180–182, 2012. sion of streaming data. The experimental results show that [16] J. Hu, C. Yan, X. Liu, J. Zhang, D. Peng, and Y. Yang, “Truncated ff Gradient Confidence-Weighted Based Online Learning for Imbal- MTGCW is quite e ective in multiclass classification tasks ance Streaming Data,” 2019 IEEE International Conference on Mul- for streaming data and is a state-of-the-art online multiclass timedia and Expo (ICME), IEEE, pp.133–138, 2019. learning algorithm. Further, our algorithm achieved excit- [17] J. Wang, P. Zhao, and S.C.H. Hoi, “Soft confidence-weighted learn- ing experimental results in the application areas of phish- ing,” ACM Transactions on Intelligent Systems and Technology ing website recognition and image classification, especially (TIST), vol.8, no.1, pp.1–32, 2016. in the face mask detection. This means that compared [18] N. Abdelhamid, A. Ayesh, and F. Thabtah, “Phishing detection based associative classification ,” Expert Systems with with baseline algorithms, our algorithm has a significant Applications, vol.41, no.13, pp.5948–5959, 2014. improvement in the ability of analyzing unstructured data. [19] S. Ge, J. Li, Q. Ye, and Z. Luo, “Detecting Masked Faces in the Wild such as images, which has a great effect on the improvement with LLE-CNNs,” 2017 IEEE Conference on Computer Vision and of computer visual cognition, i.e, our algorithm is smarter Pattern Recognition, pp.426–434, 2017. 10.1109/CVPR.2017.53. than others in terms of vision. But we also point out that [20] S.C.H. Hoi, J. Wang, and P. Zhao, “Libol: A library for online learning algorithms,” Journal of Machine Learning Research, vol.15, although we use the truncated gradient algorithm to reduce no.1, 2014. principal component features, this also leads to an increase [21] A. Gepperth and B. Hammer, “ algorithms and in time cost. Future work can be extended to how to re- applications,” 2016 European Symposium on Artificial Neural Net- duce the computation time of MTGCW or how to tackle the works (ESANN), Bruges, Belgium, 2016. problem of improving the feature selection ability of online [22] GB/T 7714 P. Yang, P. Zhao, J. Zhou, and X. Gao, “Confidence learning algorithms. weighted multitask learning,” Proc. AAAI Conference on Artificial Intelligence, vol.33, pp.5636–5643, 2019. [23] N.M. Jagirdar, “Online Machine Learning Algorithms Review References and Comparison in Healthcare,” Master’s Thesis, University of Tennessee, 2018. [1] M. Aly, “Survey on multiclass classification methods,” Neural [24] M. Tan, Y. Yan, L. Wang, et al., “Learning sparse confidence- Netw., vol.19, pp.1–9, 2005. weighted classifier on very high dimensional data,” Thirtieth AAAI [2] K. Crammer and Y. Singer, “Ultraconservative online algorithms for Conference on Artificial Intelligence, 2016. multiclass problems,” Journal of Machine Learning Research, vol.3, [25] GB/T 7714 Y. Liu, Y. Yan, L. Chen, Y. Han, and Y. Yang, “Adap- pp.951–991, 2003. tive Sparse Confidence-Weighted Learning for Online Feature Se- [3] K. Crammer, M. Dredze, and A. Kulesza, “Multi-class confidence lection,” Proc. AAAI Conference on Artificial Intelligence, vol.33, weighted algorithms,” Proc. 2009 Conference on Empirical Methods pp.4408–4415, 2019. in Natural Language Processing, vol.2, Association for Computing Machinery, pp.496–504, 2009. [4] M. Dredze, K. Crammer, and F. Pereira, “Confidence-weighted lin- ear classification,” Proc. 25th international conference on Machine learning, pp.264–271, 2008. [5] S. Shalev-Shwartz, “Online learning and online convex optimiza- tion,” Foundations and TrendsR in Machine Learning, vol.4, no.2, pp.107–194, 2012. [6] D.M.J. Tax and R.P.W. Duin, “Using two-class classifiers for mul- ticlass classification,” Object recognition supported by user interac- tion for service robots, IEEE, vol.2, pp.124–127, 2002. [7] I. Stephen, “Perceptron-based learning algorithms,” IEEE Trans. Neural Netw., vol.50, no.2, p.179, 1990. [8] Y. Freund and R.E. Schapire, “Large margin classification using the perceptron algorithm,” Machine Learning, vol.37, no.3, pp.277–296, 1999. [9] J. Wang, P. Zhao, and S.C.H. Hoi, “Exact soft confidence-weighted learning,” arXiv preprint arXiv:1206.4612, 2012. [10] K. Crammer, O. Dekel, J. Keshet, et al., “Online passive-aggressive algorithms,” Journal of Machine Learning Research, vol.7, pp.551– 585, March 2006. HU et al.: AN IMPROVED ONLINE MULTICLASS CLASSIFICATION ALGORITHM BASED ON CONFIDENCE-WEIGHTED 849

Dongliang Peng is a Professor, PhD super- Appendix: visor. He graduated from Zhejiang University with a doctorate degree in engineering major- ing in control science and engineering in 2003, Table A· 1 Symbol description visited and studied at George Mason University from 2012 to 2013. He engaged in the research of multi-sensor information fusion technology, detection and estimation.

Chengwei Ren is an undergraduate ma- joring in Intelligence Science and Technology at the School of Automation, Hangzhou Dianzi University. His research interests are class im- balance and online classification problem in ma- chine learning and feature extraction problem in .

Shengying Yang is a PhD, lecturer. He graduated with a PhD in Electronic Science and Technology from Hangzhou Dianzi University in 2020. He mainly engaged in research on Ji Hu is a lecturer, PhD candidate. He big data analysis and power electronic device received the M.S. degree in Electronics Sci- design. ence and Technology Major from Hangzhou Dianzi University in 2004. He visited McMaster University in Canada as a visiting scholar in 2016 and he is currently pursuing the Ph.D. degree in intelligent information process- ing at Hangzhou Dianzi University. His research interest includes Artificial Intelligence Tools and Application, Automatic Control, Computer Vision and Speech Understanding, Intelligent System Architecture, Deep Neural Networks and Image processing.

Chenggang Yan is a Professor, PhD super- visor. He received the PhD degree in University of Science and Completed postdoctoral research at Tsinghua University. His main research direc- tions include machine learning, pattern recogni- tion, image processing, computer vision, com- puter graphics, medical image processing, bio- logical information processing.

Jiyong Zhang is a Professor, PhD super- visor. He graduated from the Department of Tsinghua University with a bach- elor’s degree in 1999, graduated from Tsinghua University with a master’s degree in 2001, and graduated from the Swiss Federal Institute of Technology Lausanne (EPFL) with a PhD de- gree in 2008. He has long been committed to technical research in the fields of artificial intel- ligence, big data, cloud computing, and intelli- gent human-computer interaction.