Semi-Supervised Learning for Maximizing the Partial AUC

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Semi-Supervised Learning for Maximizing the Partial AUC Tomoharu Iwata, Akinori Fujino, Naonori Ueda NTT Communication Science Laboratories, Kyoto, Japan {tomoharu.iwata.gy, akinori.fujino.yh, naonori.ueda.fr}@hco.ntt.co.jp Abstract partial area under the ROC curve with a specific FPR range as shown in Figure 1(a). Many methods for maximizing The partial area under a receiver operating characteristic curve (pAUC) is a performance measurement for binary clas- the pAUC have been proposed (Komori and Eguchi 2010; sification problems that summarizes the true positive rate with Ricamato and Tortorella 2011; Narasimhan and Agarwal the specific range of the false positive rate. Obtaining classi- 2013a; 2013b; Ueda and Fujino 2018; Wang and Chang fiers that achieve high pAUC is important in a wide variety 2010). However, existing pAUC maximization methods re- of applications, such as cancer screening and spam filtering. quire many labeled data for training, which are expensive to Although many methods have been proposed for maximizing prepare. the pAUC, existing methods require many labeled data for In this paper, we propose a semi-supervised learning training. In this paper, we propose a semi-supervised learn- method for maximizing the pAUC to achieve a high pAUC ing method for maximizing the pAUC, which trains a classifier with a small amount of labeled data and a large amount with a small amount of labeled data and a large amount of of unlabeled data. To exploit the unlabeled data, we derive unlabeled data. Unlabeled data are usually easier to prepare two approximations of the pAUC: the first is calculated from than labeled data. To exploit unlabeled data, we define the positive and unlabeled data, and the second is calculated from unlabeled positive rate (UPR), which is the probability that negative and unlabeled data. A classifier is trained by maxi- the decision function score of unlabeled data is higher than a mizing the weighted sum of the two approximations of the threshold. We then derive two approximations of the pAUC pAUC and the pAUC that is calculated from positive and from the partial area under the curves of the UPR and the negative data. With experiments using various datasets, we TPR or FPR: the first is calculated from positive and unla- demonstrate that the proposed method achieves higher test beled data, and the second is calculated from negative and pAUCs than existing methods. unlabeled data. A classifier is trained by maximizing the weighted sum 1 Introduction of the two approximations of the pAUC and the pAUC that The area under a receiver operating characteristic (ROC) is calculated from positive and negative data. For classifiers, curve (AUC) is widely used for performance measure- the proposed method can use any differentiable functions, ment with binary classification problems (Bradley 1997; such as logistic regression and neural networks. Although Huang and Ling 2005). The ROC curve is the true positive several semi-supervised methods for AUC maximization rate (TPR) as a function of the false positive rate (FPR). have been proposed (Fujino and Ueda 2016; Sakai, Niu, and By maximizing the AUC, we can obtain a classifier that Sugiyama 2018; Kiryo et al. 2017), they are inapplicable to achieves high average TPRs over all FPR values from zero pAUC maximization. The main contributions of this paper to one (Brefeld and Scheffer 2005; Wang and Tang 2009; are as follows: Ding et al. 2015; Ying, Wen, and Lyu 2016; Han and Zhao 2010; Zhou, Lai, and Yen 2009; Zhao et al. 2011). 1. We derive two approximations of the pAUC that are cal- In many applications, we would like to achieve a high culated using unlabeled data (Section 4). TPR with a specific FPR range. For example, in cancer screening applications, maintaining a low FPR is important 2. We propose a semi-supervised learning method for max- if we are to eliminate unnecessary and costly biopsies (Baker imizing the pAUC based on the approximations of the and Pinsky 2001). In a spam detection system, we can accept pAUC (Section 5). Our work is the first attempt for semi- only a low FPR if we are to prevent legitimate emails from supervised pAUC maximization to our knowledge. being identified as spam. In such applications, a partial AUC 3. We empirically demonstrate that the proposed method (pAUC) is more appropriate than an AUC. The pAUC is the performs better than existing supervised and semi- Copyright c 2020, Association for the Advancement of Artificial supervised methods using various datasets for anomaly Intelligence (www.aaai.org). All rights reserved. detection (Section 6). 4239 2 Preliminaries Let x ∈ RD be a D-dimensional feature vector, and y ∈ {±1} be a binary class label. Decision function s : RD → R outputs a score for classification, where the score is used to estimate the class label by, yˆ =sign(s(x) − h), where h is a threshold and sign(·) is the sign function; sign(A)=+1 if A ≥ 0 and sign(A)=−1 otherwise. The true positive rate of threshold h is defined by the probability that the score of positive data is higher than threshold h, TPR(h)= ∞ f s I s>hds f s −∞ P( ) ( ) , where P( ) is the score distribution of positive data, and I(·) is the indicator function; I(A)=1 (a) (b) if A is true and I(A)=0otherwise. Similarly, the false positive rate of threshold h is defined by the probability Figure 1: (a) pAUC(α, β): Partial area under the curve of that the score of negative data is higher than threshold h, ∞ the true positive rate against the false positive rate between FPR(h)= fN(s)I(s>h)ds, where fN(s) is the score −∞ α and β. (b) pAUC(α, β) is calculated by the probability distribution of negative data. The AUC is the area under the h h h that the score of a positive sample is higher than that of a curve of TPR( ) against FPR( ) with varying threshold , negative sample that lies between α and β when sorted by and it is calculated by its score. 1 AUC = TPR(FPR−1(u))du 0 j αM j βM xN ∞ ∞ where α = N , β = N , and (j) denotes the N j = fP(s)fN(s )I(s>s)dsds , (1) negative sample in ranked in the th position among −∞ −∞ negatives in a descending order of scores s(x) (Dodd and −1 Pepe 2003; Narasimhan and Agarwal 2013a). The empiri- where FPR (u)=inf{h ∈ R|FPR(h) ≤ u} (Cortes cal pAUC is the empirical probability that positive samples and Mohri 2004). The AUC is the probability that scores have higher scores than negative samples that are ranked be- sampled from the positive distribution are higher than those tween jα and jβ. The computational complexity of calculat- from the negative distribution. ing Eq.(3) is O(MN log MN +(β − α)MPMN), where the The partial AUC (pAUC) between α and β, where 0 ≤ first term is for sorting MN negative samples, and the second α<β≤ 1, is the normalized partial area under the curve term is for comparing MP positive samples and (β − α)MN of TPR against FPR where the FPR is between α and β as negative samples in the second term. follows, 1 β pAUC(α, β)= TPR(FPR−1(u))du 3 Problem formulation β − α α Assume that we are given a set of positive samples P = −1 M M FPR (α) ∞ {xP } P , a set of negative samples N = {xN } N , and a 1 f s f s I s>s dsds. m m=1 m m=1 = P( ) N( ) ( ) U {xU }MU β − α FPR−1(β) −∞ set of unlabeled samples = m m=1. We would like to obtain a decision function that has a high pAUC with given E I s xP >sxN , = [ ( ( ) ( ))] (2) α and β for unseen samples. where E[·] represents the expectation, xP is a sample from We assume that the positive ratio of unlabeled samples the positive data distribution, and xN is a sample from the θP is known. When labeled and unlabeled data are gen- negative data distribution with the FPR between α and β. erated from the same distribution p(x,y), θP can be eas- Figure 1(b) illustrates how the pAUC is calculated using ily estimated by using the empirical positive probability positive and negative data. with the labeled data. When labeled and unlabeled data P MP are generated from different distributions, θP can be es- Given a set of positive samples P = {xm}m=1 and a set M timated by using methods described in (Saerens, Latinne, of negative samples N = {xN } N , an empirical pAUC is m m=1 and Decaestecker 2002; Iyer, Nath, and Sarawagi 2014; calculated by Du Plessis and Sugiyama 2014). 1 pAUC( α, β)= (β − α)MPMN 4 Partial AUC with unlabeled data × j − αM I s xP >sxN In this section, we derive two approximations of the pAUC ( α N) ( ( m) ( (jα))) P that are calculated using unlabeled data. Intuitively speak- xm∈P j ing, we sort unlabeled data by their scores, consider unla- β beled data with specific ranges as positive or negative, and I s xP >sxN + ( ( m) ( (j))) approximate the pAUC using the estimated positive or neg- j=j +1 α ative data. Figure 2 shows an overview of the calculation P N of the two approximated pAUCs that use positive, negative +(βMN − jβ)I(s(x ) >s(x )) , (3) m (jβ +1) and unlabeled data, which is explained in detail in this sec- 4240 (a) (b) Figure 2: (a) Approximated pAUC with positive and unla- Figure 3: Top: Unlabeled score distribution fU(s).

Semi-Supervised Learning for Maximizing the Partial AUC

The Textiles of the Han Dynasty & Their Relationship with Society

Effect of Marine Bacillus BC-2 on the Health-Beneficial Ingredients of Flavor Liquor

Archaeological Observation on the Exploration of Chu Capitals

Shuk Han CHENG (Ne Shuk Han CHUNG) Education 85-90 Phd / RA

The Zuozhuan Account of the Death of King Zhao of Chu and Its Sources

Boya Genuine] the Sweet Kitchen Mu Han(Chinese Edition

Discovering Discrepancies in Numerical Libraries

Wykorzystanie Biodegradowalnych Polimerów W Rozmnażaniu Ozdobnych Roślin Cebulowych

Lihua Xu, Ph.D

Glottal Stop Initials and Nasalization in Sino-Vietnamese and Southern Chinese

The Forgotten Case of the Dependency Bugs on the Example of the Robot Operating System

Achieving High Coverage for Floating-Point Code Via Unconstrained Programming