Converting Output Scores from Outlier Detection Algorithms Into Probability Estimates
Total Page:16
File Type:pdf, Size:1020Kb
Converting Output Scores from Outlier Detection Algorithms into Probability Estimates Jing Gao Pang-Ning Tan Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering Michigan State University Michigan State University East Lansing,MI 48824 East Lansing,MI 48824 [email protected] [email protected] Abstract priate threshold for declaring outliers. Instead of requiring the user to choose the threshold in an ad-hoc manner, the Current outlier detection schemes typically output a nu- Bayesian risk model can be employed, which takes into ac- meric score representing the degree to which a given obser- count the relative cost of misclassifying normal examples as vation is an outlier. We argue that converting the scores into outliers, and vice-versa. Second, the probability estimates well-calibrated probability estimates is more favorable for also provide a more robust approach for developing an en- several reasons. First, the probability estimates allow us to semble outlier detection framework than methods based on select the appropriate threshold for declaring outliers using aggregating the relative rankings of outlier scores [9]. Fi- a Bayesian risk model. Second, the probability estimates nally, the probability estimates are useful to determine the obtained from individual models can be aggregated to build uncertainties in outlier prediction. an ensemble outlier detection framework. In this paper, we Obtaining calibrated probability estimates from super- present two methods for transforming outlier scores into vised classifiers such as support vector machine (SVM), probabilities. The first approach assumes that the posterior Na¨ıve Bayes, and decision trees has been the subject of ex- probabilities follow a logistic sigmoid function and learns tensive research in recent years [12, 11, 1, 13]. The calibra- the parameters of the function from the distribution of out- tion methods generally fall under two categories: paramet- lier scores. The second approach models the score distri- ric and non-parametric. Parametric methods assume that butions as a mixture of exponential and Gaussian probabil- the probabilities follow certain well-known distributions, ity functions and calculates the posterior probabilites via whose parameters are to be estimated from the training data. the Bayes’ rule. We evaluated the efficacy of both methods The typical methods used for calibrating classifier’s out- in the context of threshold selection and ensemble outlier puts include logistic regression, asymmetric Laplace distri- detection. We also show that the calibration accuracy im- bution, and piecewise logistic regression. Non-parametric proves with the aid of some labeled examples. methods, on the other hand, employ smoothing, binning, and bagging methods to infer probability estimates from the classifier’s output scores. 1 Introduction Each of the preceding methods require labeled examples to learn the appropriate calibration function. They are in- Outlier detection has been extensively studied for many applicable to calibrating outlier scores because outlier de- years, resulting in the development of numerous algorithms tection is an unsupervised learning task. Therefore a key [7, 3, 6]. These algorithms often produce a numeric-valued challenge in this work is handling the missing label prob- output score to represent the degree to which a given ob- lem. Our solution is to treat the missing labels as hidden servation is unusual. In this paper, we argue that it is in- variables and apply the Expectation-Maximization (EM) al- sufficient to obtain only the magnitude or rank of outlier gorithm [4] to maximize the expected likelihood of the data. score for any given observation. There are many advan- We consider two approaches for modeling the data. The first tages to transforming the output scores into well-calibrated approach models the posterior probability for outlier scores probability estimates. First, the probability estimates pro- using a sigmoid function while the second approach models vide a more systematic approach for selecting the appro- the likelihoods for the normal and outlier classes separately. 1 Our previous work on semi-supervised outlier detection Without loss of generality, we assume that the higher f i is, [5] suggests that adding a small number of labeled examples the more likely xi is an outlier. helps to improve the detection rate and false alarm rate of Our objective is to estimate the probability that xi is an an outlier detection algorithm. In this paper, we further in- outlier given its outlier score fi, i.e., pi = P (O|fi). The vestigate the benefits of semi-supervised outlier detection in probability that xi is normal can be computed accordingly the context of improving the probability estimation of out- by P (M|fi)=1− pi. According to Bayes’ theorem: lier scores by modifying our proposed methods to maximize the joint likelihoods of the labeled and unlabeled data. p(fi|O)P (O) P (O|fi)= In short, our main contributions are summarized below: p(fi|O)P (O)+p(fi|M)P (M) 1 1. We develop two calibration methods for transforming = (1) 1+exp(−ai) outlier scores into probabilities. Unlike existing ap- proaches which are developed for supervised classi- where fiers, our proposed methods do not require labeled ex- p(fi|O)P (O) ai =log (2) amples. Instead, the labels are treated as hidden vari- p(fi|M)P (M) ables to be learnt together with the model parameters. As shown in [2], ai can be considered as a discriminant 2. We devise a semi-supervised method to further im- function that classifies xi into one of the two classes. For prove the calibration of probability estimates. a Gaussian distribution with equal covariance matrices, a i can be simplified to a linear function: 3. We illustrate the benefits of converting outlier scores into probabilities in the context of threshold selection ai = Afi + B (3) and ensemble outlier detection. Our results show that a better performance is achieved using the probability Replacing Equation 3 into 1 yields: estimates instead of the raw outlier scores. 1 pi = P (O|fi)= (4) The rest of the paper is organized as follows. Section 1+exp(−Afi − B) 2 describes our calibration method using sigmoid function Our task is to learn the parameters of the calibration while Section 3 presents an alternative method using a mix- function, A and B. Let ti be a binary variable whose value ture of exponential and Gaussian distributions. Section 4 is 1 if xi belongs to the outlier class and 0 if it is normal. shows how to extend these methods to incorporate labeled The probability of observing ti is examples. In Section 5, we describe the advantages of us- ti 1−ti ing calibrated probabilities for threshold selection and en- p(ti|fi)=pi (1 − pi) (5) semble outlier detection. Experimental results are given in Section 6 while the conclusions are presented in Section 7. which corresponds to a Bernoulli distribution. Let T =[t i] denote an N-dimensional vector, whose components repre- N 2 Calibration Using Sigmoid Function sent the class labels assigned to the observations. As- suming that the observations are drawn independently, the likelihood for observing T is then given by Logistic regression is a widely used method for trans- forming classification outputs into probability estimates. N ti 1−ti Converting outlier scores, on the other hand, is more chal- P (T |F )= pi (1 − pi) (6) lenging because there are no labeled examples available. i=1 This section describes our proposed method for learning the labels and model parameters simultaneously using an EM- Maximizing the likelihood, however, is equivalent to mini- based algorithm. mizing the following negative log likelihood function: N 2.1 Modeling Sigmoid Posterior Probabil- LL(T |F )=− ti log pi +(1− ti) log(1 − pi) (7) ity i=1 Substituting Equation 4 into 7, we obtain: Let X = {x1,x2, ··· ,xN } denote a set of N observa- d R d tions drawn from a -dimensional space, . Suppose the N O data is generated from two classes: the outlier class and LL(T |F )= (1−ti)(Afi+B)+log(1+exp(−Afi−B)) the normal class M. Let F = {f1,f2,··· ,fN } be the corre- i=1 sponding outlier scores assigned to each observation in X. (8) 2 In supervised classification, since labeled examples 1800 {xi,ti} are available, we may create a training set (fi,ti) and learn the parameters of the sigmoid function directly by 1600 minimizing the objective function in Equation 8 (see [11] 1400 and [13]). Unfortunately, because outlier detection is an 1200 unsupervised learning task, we do not know the actual val- 1000 ues for ti. To overcome this problem, we propose to treat 800 the ti’s as hidden variables and employ the EM algorithm to simultaneously estimate the missing labels and parame- 600 ters of the calibration function. The learning algorithm is 400 presented in the next section. 200 0 0 0.2 0.4 0.6 0.8 1 2.2 Learning Parameters of Sigmoid Func- tion (a) Outlier Score Distributions for Normal Exam- ples The EM algorithm is a widely used method for finding 8 maximum likelihood estimates in the presence of missing data. It utilizes an iterative procedure to produce a sequence 7 s of estimated parameter values: {θ |s =1, 2, ...}. The pro- 6 cedure is divided into two steps. First, the missing label ti is replaced by its expected value under the current parameter 5 s estimate, θ . A new parameter estimate is then computed by 4 minimizing the objective function given the current values s s 3 of T =[ti ]. Table 1 shows a pseudocode of the algorithm. 2 EM algorithm: 1 F {f1, ··· ,fN } Input: The set of outlier scores, = 0 Output: Model parameters, θ =(A, B) 0 0.2 0.4 0.6 0.8 1 Method: 1.