<<

Converting Output Scores from Outlier Detection Algorithms into Probability Estimates

Jing Gao Pang-Ning Tan

Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering Michigan State University Michigan State University East Lansing,MI 48824 East Lansing,MI 48824 [email protected] [email protected]

Abstract priate threshold for declaring outliers. Instead of requiring the user to choose the threshold in an ad-hoc manner, the Current outlier detection schemes typically output a nu- Bayesian risk model can be employed, which takes into ac- meric score representing the degree to which a given obser- count the relative cost of misclassifying normal examples as vation is an outlier. We argue that converting the scores into outliers, and vice-versa. Second, the probability estimates well-calibrated probability estimates is more favorable for also provide a more robust approach for developing an en- several reasons. First, the probability estimates allow us to semble outlier detection framework than methods based on select the appropriate threshold for declaring outliers using aggregating the relative rankings of outlier scores [9]. Fi- a Bayesian risk model. Second, the probability estimates nally, the probability estimates are useful to determine the obtained from individual models can be aggregated to build uncertainties in outlier prediction. an ensemble outlier detection framework. In this paper, we Obtaining calibrated probability estimates from super- present two methods for transforming outlier scores into vised classifiers such as support vector machine (SVM), probabilities. The first approach assumes that the posterior Na¬õve Bayes, and decision trees has been the subject of ex- probabilities follow a logistic sigmoid function and learns tensive research in recent years [12, 11, 1, 13]. The calibra- the parameters of the function from the distribution of out- tion methods generally fall under two categories: paramet- lier scores. The second approach models the score distri- ric and non-parametric. Parametric methods assume that butions as a mixture of exponential and Gaussian probabil- the probabilities follow certain well-known distributions, ity functions and calculates the posterior probabilites via whose parameters are to be estimated from the training . the Bayes’ rule. We evaluated the efficacy of both methods The typical methods used for calibrating classifier’s out- in the context of threshold selection and ensemble outlier puts include logistic regression, asymmetric Laplace distri- detection. We also show that the calibration accuracy im- bution, and piecewise logistic regression. Non-parametric proves with the aid of some labeled examples. methods, on the other hand, employ smoothing, binning, and bagging methods to infer probability estimates from the classifier’s output scores. 1 Introduction Each of the preceding methods require labeled examples to learn the appropriate calibration function. They are in- Outlier detection has been extensively studied for many applicable to calibrating outlier scores because outlier de- years, resulting in the development of numerous algorithms tection is an unsupervised learning task. Therefore a key [7, 3, 6]. These algorithms often produce a numeric-valued challenge in this work is handling the missing label prob- output score to represent the degree to which a given ob- lem. Our solution is to treat the missing labels as hidden servation is unusual. In this paper, we argue that it is in- variables and apply the Expectation-Maximization (EM) al- sufficient to obtain only the magnitude or rank of outlier gorithm [4] to maximize the expected likelihood of the data. score for any given observation. There are many advan- We consider two approaches for modeling the data. The first tages to transforming the output scores into well-calibrated approach models the posterior probability for outlier scores probability estimates. First, the probability estimates pro- using a sigmoid function while the second approach models vide a more systematic approach for selecting the appro- the likelihoods for the normal and outlier classes separately.

1 Our previous work on semi-supervised outlier detection Without loss of generality, we assume that the higher f i is, [5] suggests that adding a small number of labeled examples the more likely xi is an outlier. helps to improve the detection rate and false alarm rate of Our objective is to estimate the probability that xi is an an outlier detection algorithm. In this paper, we further in- outlier given its outlier score fi, i.e., pi = P (O|fi). The vestigate the benefits of semi-supervised outlier detection in probability that xi is normal can be computed accordingly the context of improving the probability estimation of out- by P (M|fi)=1− pi. According to Bayes’ theorem: lier scores by modifying our proposed methods to maximize the joint likelihoods of the labeled and unlabeled data. p(fi|O)P (O) P (O|fi)= In short, our main contributions are summarized below: p(fi|O)P (O)+p(fi|M)P (M) 1 1. We develop two calibration methods for transforming = (1) 1+exp(−ai) outlier scores into probabilities. Unlike existing ap- proaches which are developed for supervised classi- where fiers, our proposed methods do not require labeled ex- p(fi|O)P (O) ai =log (2) amples. Instead, the labels are treated as hidden vari- p(fi|M)P (M) ables to be learnt together with the model parameters. As shown in [2], ai can be considered as a discriminant 2. We devise a semi-supervised method to further im- function that classifies xi into one of the two classes. For prove the calibration of probability estimates. a Gaussian distribution with equal covariance matrices, a i can be simplified to a linear function: 3. We illustrate the benefits of converting outlier scores into probabilities in the context of threshold selection ai = Afi + B (3) and ensemble outlier detection. Our results show that a better performance is achieved using the probability Replacing Equation 3 into 1 yields: estimates instead of the raw outlier scores. 1 pi = P (O|fi)= (4) The rest of the paper is organized as follows. Section 1+exp(−Afi − B) 2 describes our calibration method using sigmoid function Our task is to learn the parameters of the calibration while Section 3 presents an alternative method using a mix- function, A and B. Let ti be a binary variable whose value ture of exponential and Gaussian distributions. Section 4 is 1 if xi belongs to the outlier class and 0 if it is normal. shows how to extend these methods to incorporate labeled The probability of observing ti is examples. In Section 5, we describe the advantages of us- ti 1−ti ing calibrated probabilities for threshold selection and en- p(ti|fi)=pi (1 − pi) (5) semble outlier detection. Experimental results are given in Section 6 while the conclusions are presented in Section 7. which corresponds to a Bernoulli distribution. Let T =[t i] denote an N-dimensional vector, whose components repre- N 2 Calibration Using Sigmoid Function sent the class labels assigned to the observations. As- suming that the observations are drawn independently, the likelihood for observing T is then given by Logistic regression is a widely used method for trans- forming classification outputs into probability estimates. N ti 1−ti Converting outlier scores, on the other hand, is more chal- P (T |F )= pi (1 − pi) (6) lenging because there are no labeled examples available. i=1 This section describes our proposed method for learning the labels and model parameters simultaneously using an EM- Maximizing the likelihood, however, is equivalent to mini- based algorithm. mizing the following negative log likelihood function: N   2.1 Modeling Sigmoid Posterior Probabil- LL(T |F )=− ti log pi +(1− ti) log(1 − pi) (7) ity i=1

Substituting Equation 4 into 7, we obtain: Let X = {x1,x2, ··· ,xN } denote a set of N observa- d R d tions drawn from a -dimensional space, . Suppose the N O   data is generated from two classes: the outlier class and LL(T |F )= (1−ti)(Afi+B)+log(1+exp(−Afi−B)) the normal class M. Let F = {f1,f2,··· ,fN } be the corre- i=1 sponding outlier scores assigned to each observation in X. (8)

2 In supervised classification, since labeled examples 1800 {xi,ti} are available, we may create a training set (fi,ti) and learn the parameters of the sigmoid function directly by 1600 minimizing the objective function in Equation 8 (see [11] 1400 and [13]). Unfortunately, because outlier detection is an 1200 unsupervised learning task, we do not know the actual val- 1000 ues for ti. To overcome this problem, we propose to treat 800 the ti’s as hidden variables and employ the EM algorithm to simultaneously estimate the missing labels and parame- 600 ters of the calibration function. The learning algorithm is 400 presented in the next section. 200

0 0 0.2 0.4 0.6 0.8 1 2.2 Learning Parameters of Sigmoid Func- tion (a) Outlier Score Distributions for Normal Exam- ples

The EM algorithm is a widely used method for finding 8 maximum likelihood estimates in the presence of missing data. It utilizes an iterative procedure to produce a sequence 7 s of estimated parameter values: {θ |s =1, 2, ...}. The pro- 6 cedure is divided into two steps. First, the missing label ti is replaced by its expected value under the current parameter 5 s estimate, θ . A new parameter estimate is then computed by 4 minimizing the objective function given the current values s s 3 of T =[ti ]. Table 1 shows a pseudocode of the algorithm. 2

EM algorithm: 1 F {f1, ··· ,fN } Input: The set of outlier scores, = 0 Output: Model parameters, θ =(A, B) 0 0.2 0.4 0.6 0.8 1 Method: 1. s ← 0. (b) Outlier Score Distributions for Outliers 2. Initialize the parameters to θ0 3. Loop until algorithm converges Figure 1. Outlier Score Distributions s+1 s 3.1 E-step: Set tij =E(tij |F, θ )). s+1 3.2 M-step: Compute θ =argminθLL(T |F ). 3.3 set s ← s +1 3 Calibration Using Mixture Modeling of Outlier Scores Table 1. EM algorithm framework Although fitting posterior probabilities into a sigmoid LL(T |F ) is the negative log likelihood function to be function is simple, the method makes a strong assumption minimized. During the E-step, the model parameters are that the outliers and normal examples have similar forms fixed while LL(T |F ) is minimized with respect to ti. Since of outlier score distributions. In this section, we present an LL(T |F ) is a linear function of ti, it is minimized by set- alternative method for modeling the outlier scores using a ting: mixture of exponential and Gaussian distributions.  Af B> t 1 if i + 0 3.1 Using Mixture Models to Describe i = Af B ≤ (9) 0 if i + 0 Outlier Score Distributions

s s During the M step, since T =[ti ] is fixed, minimizing Consider a containing outliers that are uniformly LL(T |F ) with respect to A and B is a two-parameter op- distributed and normal observations drawn from a Gaus- timization problem, which can be solved using the model- sian distribution. We are interested in modeling the dis- trust algorithm described in [11]. tribution of their outlier scores. Suppose the outlier score

3 is computed based on the distance between a data point to The probability density function for Z is obtained by tak- its k-th nearest neighbor. Figure 1 shows the typical out- ing the derivative of its c.d.f with respect to z. Because the lier score distributions for the outlier and normal classes 1. derivative of Φ(x) is a Gaussian distribution and a linear Observe that the outlier scores for the normal class tend to combination of two Gaussian distributions is also a Gaus- have an exponential distribution whereas that of the outlier sian, therefore, Z must follow a Gaussian distribution.  class seems to follow a Gaussian distribution. This result If the outliers are uniformly distributed and the propor- suggests that a consisting of an exponential tion of outliers is considerably smaller than the proportion and a Gaussian component may fit well to the outlier score of normal observations, it is reasonable to assume that the distributions. k-th nearest neighbor of an outlier corresponds to a normal At first glance, it may seem quite surprising to observe observation. Theorem 1 suggests that the distance between that the scores for the outlier class follow a Gaussian dis- a randomly chosen point from a uniform distribution to an- tribution. In the following, we give a theoretical justifica- other randomly chosen point from a Gaussian distribution tion as to why the outliers may indeed have a Gaussian dis- should follow a Gaussian distribution. While such analysis tributed outlier scores. does not conclusively show that the outlier scores must fol- low a Gaussian distribution, it does suggest that modeling Theorem 1 Suppose X and Y are 1-dimensional random the scores of outliers using a Gaussian distribution may be variables. If X ∼ U(−δ, δ) and Y ∼ N(0, 1), then the quite a reasonable assumption. Furthermore, our empirical distances between examples drawn from X and Y follow a results also seem to support this assumption. Gaussian distribution. Therefore we will use a mixture of Gaussian and expo- nential distributions for modeling the outlier scores: Proof. Let Z be the random variable for the distance be- 2 tween X and Y , i.e., Z = |X − Y |. Then we need to prove 1 (fi − μ) Z pi = p(fi|O)=√ exp(− ) (10) that follows a Gaussian distribution. We begin with the 2πσ 2σ2 cumulative distribution function (c.d.f.) for Z: qi = p(fi|M)=λ exp(−λfi) (11) F z P Z ≤ z P |X − Y |≤z Z ( )= ( )= ( ) where μ, σ, and λ are the parameters of the Gaussian and t = f(x, y)dx dy exponential distributions. The probability of observing i R can be written as:

ti 1−ti where R is the region defined by |X − Y |≤z.IfX and Y p(ti,fi)=[αpi] [(1 − α)qi] (12) are independent, their joint probability density function in α R is: where is the prior probability of the outlier component Using Bayes’ rule: 1 2 f(x, y)=f(x)f(y)= √ exp(−y ) ti 1−ti δ π [αpi] [(1 − α)qi] 2 2 p(ti|fi)= (13) p(fi) and zero elsewhere. Therefore The model parameters θ =(α, μ, σ, λ) are estimated by 1 2 minimizing the negative log likelihood function: FZ (z)= √ exp(−y )dx dy R 2δ 2π N δ 1 δ+z LL T |F − t αp − t − α q = √ exp(−y2)dx dy ( )= i log( i)+(1 i) log((1 ) i) −δ 2δ 2π −δ−z i=1 (14) The c.d.f. for standard normal is often denoted as Φ(x), Given the estimate of the model parameters, the posterior probability of xi being an outlier can be computed using where x 1 2 Bayes’ rule: Φ(x)=√ exp(−u )du 2π −∞ αp(fi|O, θˆ) So P (O|fi, θˆ)= (15) αp(fi|O, θˆ)+(1− α)p(fi|M,θˆ) δ   1 FZ (z)= Φ(δ + z) − Φ(−δ − z) dx 2δ 3.2 Learning Parameters of Mixture −δ Model =Φ(δ + z) − Φ(−δ − z)

1The shapes of these histograms are consistently observed when the Similar to the previous method, we use the EM algorithm experiment is repeated several times. to minimize the negative log likelihood function given in

4

s+1 2 2 t t fi − μ {li } fi − μ Equation 14. During the E-step, the expected value for i is: s+1 fi∈Fu i ( ) + fi∈Fl 1 =1 ( ) σ = s+1 s+1 s f ∈F ti + f ∈F 1{li =1} ti =E(ti|F, θ )) i u i l s s (23) =1· P (ti =1|F, θ )+0· P (ti =0|F, θ ) s+1 t + 1{li =1} s λs+1 fi∈Fu i fi∈Fl = P (O|fi,θ ) = s+1 (24) f ∈F ti fi + f ∈F 1{li =1}fi i u i l where the posterior is calculated from Equation 15. s+1 t + 1{li =1} During the M step, the model parameters are re- αs+1 fi∈Fu i fi∈Fl = N (25) computed by solving the following partial derivatives: where  ∂LL(T |F ) 1 if li =1 =0 (16) 1{li =1} = ∂θ 0 otherwise After some manipulation, this leads to the following update equations for the model parameters: 5 Applications

N ts+1f μs+1 i=1 i i This section presents two potential applications that may = N s+1 (17) i=1 ti benefit from using probability estimates for outlier scores: threshold selection and ensemble outlier detection. N ts+1 f − μ 2 σs+1 i=1 i ( i ) = N s+1 (18) i=1 ti 5.1 Threshold Selection

N ts+1 λs+1 i=1 i = N s+1 (19) Most outlier detection algorithms require the user to i=1 ti fi specify a threshold so that any observation whose outlier N ts+1 score exceeds the threshold will be declared as outliers. A αs+1 i=1 i = N (20) standard approach for determining the threshold is to the sorted values of outlier scores and then choose the knee 4 Incorporating Labeled Examples point of the curve as the threshold. Such an ad-hoc method can be imprecise because the location of the knee point is This section presents a framework for incorporating la- subject to user interpretation. beled examples to improve the calibration accuracy of prob- This section illustrates a more principled approach for ability estimates. Let F u and F l be the corresponding sets threshold selection using the Bayesian risk model, which of outlier scores assigned to the unlabeled data X u and la- minimizes the overall risk associated with some cost func- l beled data X , respectively. Furthermore, suppose L = li tion. For a two-class problem, the Bayes decision rule for a denote the set of labels associated with the labeled data, given observation x is to decide w1 if: where li =1if xi is an outlier, and 0 otherwise. (λ21 − λ11)P (w1|x) > (λ12 − λ22)P (w2|x) (26) In Sections 2 and 3, we have shown that the model pa- rameters are calculated by minimizing the negative log like- where w1 and w2 are the two classes while λij is the cost of lihood. For semi-supervised learning[8], we may decom- misclassifying wj as wi. Since p(w2|x)=1− p(w1|x), the pose the negative log likelihood into two parts, each corre- preceding inequality suggests that the appropriate outlier sponding to the labeled and unlabeled data: threshold is automatically determined once the cost func- tions are known. For example, in the case of a zero-one loss θˆ =minLL(F u|θ)+LL(F l,L|θ) (21) θ function, where:  Although the EM algorithm is still applicable to learn 0 if i = j λij = (27) the model parameters under this framework, some modi- 1 if i = j fications are needed. During the E-step, we only need to we declare any observation whose estimated posterior prob- estimate the values of ti for the unlabeled data using Equa- ability P (O|f) exceeds 0.5 as an outlier. tion 9. The ti values for the labeled data are fixed, i.e., l ti = li(∀xi ∈ X ). During the M-step, it can be shown that the parameters 5.2 Ensemble Outlier Detection are updated using a combination of the parameter estimates obtained from the unlabeled data and labeled data: Recently, Lazarevic et al. [9] have proposed a feature s+1 bagging approach for combining outputs from multiple out- t fi + 1{li =1}fi s+1 fi∈Fu i fi∈Fl lier detection algorithms. There were two approaches inves- μ = s+1 (22) t {li } tigated in this study: breadth first and cumulative sum.In fi∈Fu i + fi∈Fl 1 =1

5 both of these approaches, the outputs from each detector in 6.1 Experimental Setup the ensemble are sorted and ranked according to decreasing order of their magnitudes. For each observation, the ranks The data sets used for our experiments are summarized of its outlier scores in the ensemble are aggregated to obtain in Table 2. A description of each data set is given below: the final outlier score. Letter: This corresponds to the letter recognition data Instead of merging the ranks, we propose to combine obtained from the UCI repository. We their probability estimates. We employ techniques from re- choose examples from two classes and designate one of the liability to perform the probability aggregation. Each classes as normal and the other as outlier. outlier detector in the ensemble is considered as a compo- Optical: This is the optical handwritten data set from nent in a complex system. The components can be arranged the UCI machine learning repository. Again, we randomly in series, in parallel, or in any combination of series-parallel choose two classes and select data from one class as the configurations. Detection of outliers by one of the detectors normal examples and a small portion of the other class as in the ensemble is analogous to having one of the compo- outliers. nents in the system fails. Using this framework, the overall SVMguide1: This data set is used to test the libsvm soft- probability that a given observation is an outlier is equiva- ware. The data originally comes from an astroparticle ap- lent to the probability that the overall system fails. plication. One of the two classes is chosen as the outlier Let P (O|x) be the posterior probability that x is an out- class while the other is the normal class. lier according to the ensemble and Pi(O|x) be its posterior Cancer: This data set, which is obtained from UCI probability estimated by detector i. For the series configu- machine learning repository, records the measurements for ration, at least one of the components must fail in order to breast cancer cases. There are two classes, benign, which make the entire system fails. Therefore: is considered as normal and malignant, which is the outlier class. R Lpr: This is one of UNM’s benchmark data sets. For P (O|x)=1− (1 − Pi(O|x)) (28) each trace generated by a user, an ordered list of the fre- i=1 quency counts together with their class label showing “in- For the parallel configuration, the system fails only if all trusive” or “normal” is recorded. the components break down. Analogously, the probability Shuttle: We use a subset of the shuttle data set from Stat- that x is an outlier is: log Project Database. We choose two of the large classes as normal and select examples from one of the remaining R smaller classes as outliers. P O|x P O|x ( )= ( i) (29) Our calibration methods utilize the outlier scores pro- i=1 duced by a distance based outlier detection algorithm. The outlier score is computed based on the distance of each ob- 6 Experimental Evaluation servation to its k-th nearest neighbor [6]. The value of k is set to be 3 to 5 times the number of true outliers depending We have conducted our experiments on several real and on the data sets. We choose this method as our underly- synthetic data sets to evaluate the performances of the cal- ing algorithm because it is easy to implement and is widely ibration methods. We have also demonstrated the effec- used by many authors for comparison purposes. tiveness of using the probability estimates in the context of threshold selection and ensemble outlier detection. 6.2 Empirical Results

Table 2. Description of Data Sets We have conducted several experiments to demonstrate Data sets Total Number Number that our proposed methods help to generate meaningful number of of features of outliers probability outputs as well as improving the effectiveness instances of threshold selection and ensemble outlier detection. Letter 845 16 79 Opt 623 64 55 6.2.1 Probability Estimation SVMguide1 3244 4 155 The posterior probabilities not only help us to determine the Cancer 489 10 45 outlierness of an observation, they also provide estimates of Lpr 2103 183 101 confidence in outlier prediction. This experiment aims to Shuttle 4132 9 132 demonstrate that the probability estimates are pushed closer towards 0.5, which indicates a low confidence in prediction,

6 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 Probability Estimates Probability Estimates

0.2 0.2

0.1 0.1

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Normalized Outlier Scores Normalized Outlier Scores

(a) Mixture model approach (a) Mixture model approach

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 0.3 Probability Estimates 0.3 Probability Estimates

0.2 0.2

0.1 0.1

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Normalized Outlier Scores Normalized Outlier Scores

(b) Sigmoid approach (b) Sigmoid approach

Figure 2. Plots of probability estimation for Figure 3. Plots of probability estimation for synthetic data set with = 40 synthetic data set with variance = 150

when the outliers are difficult to be distinguished from nor- plots for both sigmoid and mixture models look quite simi- mal observations. lar. However, note that the mixture model approach does not For this experiment, we use two synthetic data sets gen- always yield a sigmoid-like curve. The shape of the curve erated from a mixture of two distributions. The normal ob- depends on the parameters of the exponential and Gaussian servations are generated from a Gaussian distribution while distributions. the outliers are assumed to be uniformly distributed. The Gaussian distributions for both synthetic data sets have the 6.2.2 Threshold Selection same mean but different variance (40 versus 150). Because of its higher variance, it is harder to distinguish the outliers The purpose of this experiment is to compare the effective- from normal observations in the second data set compared ness of using probability estimates for threshold selection. to the first data set. The baseline method for threshold selection is obtained in Figures 2 and 3 show the calibrated probability estimates the following way. For each data set, we first plot the out- for the two data sets. For the first data set, because outliers lier scores in increasing order of their magnitudes. Figure 4 are well-separated from the normal examples, most of the shows an example of such plot for the SVMguide1 data set. calibrated values are close to either 0 or 1. For the second The knee of the curve, which is located somewhere between data set, because it is harder to distinguish outliers from nor- 0.05 and 0.1, is then chosen as the threshold for declaring mal observations, there are more observations with proba- outliers. We refer to this method as “knee” in the remainder bility estimates around 0.5. Without converting the outlier of this section. scores into probabilities, we may not be able to obtain an es- Tables 3 to 8 show the results of applying different meth- timate of the confidence in outlier detection. The calibration ods for threshold selection. “sigmoid” and “mix” corre-

7 1

0.9 Table 6. Cancer knee mix sigmoid Smix Ssigmoid 0.8 R 0.4667 0.9778 0.8222 0.9778 0.8222 0.7 P 0.9130 0.6667 0.8222 0.6769 0.8222 0.6 F 0.6176 0.7928 0.8222 0.8000 0.8222 0.5 FA 0.0515 0.0024 0.0180 0.0024 0.0180 0.4 0.3 Table 7. Lpr 0.2 knee mix sigmoid Smix Ssigmoid 0.1 R 0.3564 1.0000 0.4554 1.0000 0.6139 0.05 0 0 500 1000 1500 2000 2500 3000 3500 P 1.0000 0.5372 1.0000 0.5372 1.0000 F 0.5255 0.6990 0.6259 0.6990 0.7607 Figure 4. Sorted Outlier Scores FA 0.0314 0 0.0267 0 0.0191

Table 8. Shuttle knee mix sigmoid Smix Ssigmoid Table 3. Letter R 0.4242 0.5152 0.6894 0.5455 0.7045 knee mix sigmoid Smix Ssigmoid P 0.7568 0.7010 0.6894 0.7129 0.6889 R 0.9937 0.6582 0.7759 0.7165 0.7848 F 0.5437 0.5939 0.6894 0.6180 0.6966 P 0.4642 1.0000 0.9903 1.0000 0.9810 FA 0.0187 0.0159 0.0103 0.0149 0.0098 F 0.6328 0.7939 0.8701 0.8348 0.8720 FA 0.0007 0.0340 0.0226 0.0284 0.0217 Table 9. Confusion Matrix Table 4. Opt True outlier True normal knee mix sigmoid Smix Ssigmoid Predicted outlier TP FP R 0.4912 0.7018 0.9123 0.7193 0.9474 Predicted normal FN TN P 0.9655 0.9756 0.9123 0.9535 0.9000 F 0.6512 0.8163 0.9123 0.8200 0.9231 FA 0.0483 0.0289 0.0087 0.0273 0.0053 TP P = Table 5. SVMguide1 TP + FP knee mix sigmoid Smix Ssigmoid TP R = R 0.4903 0.7419 0.7548 0.7419 0.6968 TP + FN P 0.8636 0.7516 0.7500 0.7516 0.7941 2TP F = F 0.6255 0.7468 0.7524 0.7468 0.7423 2TP + FP + FN FA 0.0250 0.0129 0.0123 0.0129 0.0151 FN FA = (30) FN + TN A good threshold must balance both precision and re- sponds to the sigmoid and mixture model approaches de- call, therefore a higher F-measure value is favored. Our scribed in Section 2.1 and Section 3, respectively. “Ssig- results clearly show that the F-measure for all the data moid” and “Smix” are the semi-supervised versions of these sets improved after probability calibration. Thus, convert- algorithms, which utilize some labeled examples to aid the ing the outlier scores into probabilities improves thresh- calibration. The labeled examples count for 10% in the data old selection in terms of balancing its precision and recall. set. We conduct our experiments using the 0-1 loss func- The results also show that the addition of labeled examples tion, which means that the probability threshold for identi- tends to produce better calibration. More specifically, both fying outliers is 0.5. ”Smix” and ”Ssigmoid” approaches outperform their unsu- pervised counterparts in five out of six data sets in terms of The following evaluation metrics are used to compare their F-measure. the effectiveness of the threshold selection methods: Pre- cision(P), Recall(R), F-measure(F) and False Alarm rate 6.2.3 Outlier Detection Ensemble (FA). All of these metrics are computed from the confusion matrix shown in Table 9. The formula for calculating these This experiment compares the effectiveness of using proba- metrics are listed in Equation 30. bility estimates for combining outputs from multiple outlier

8 Table 10. AUC values for different approaches to combining outputs from multiple outlier detectors knee sum breadth mix sigmoid Smix Ssigmoid Cancer 0.9885 0.9903 0.9883 0.9941 0.9921 0.9940 0.9932 Svmguide1 0.9750 0.9798 0.9694 0.9846 0.9841 0.9846 0.9823 Letter 0.9555 0.9549 0.9412 0.9592 0.8826 0.9657 0.9580 Opt 0.9605 0.9583 0.9503 0.9695 0.9643 0.9719 0.9640 Lpr 0.9999 1.0000 0.9999 0.9965 1.0000 0.9965 1.0000 Shuttle 0.9881 0.9885 0.9885 0.9900 0.9890 0.9900 0.9908 detectors as opposed to the rank aggregation methods (de- noted as breath and sum) proposed by Lazarevic and Ku- 1 mar [9]. The number of ensembles used in experiments is 0.9 10. The receiver-operating characteristic (ROC) curve is 0.8 0.7 often used to show the tradeoff between detection rate and 0.6 false alarm rate. Alternatively, we may use the area un- 0.5 der ROC curve (AUC) as our evaluation metric, where the 0.4 better scheme will have an AUC value closer to 1. From Empirical Probability 0.3 Table 10, it can be observed that combining probability es- 0.2 timates from multiple outlier detectors yields higher AUC 0.1 0 0 0.2 0.4 0.6 0.8 1 values than combining the ranks of their outlier scores. The Normalized Outlier Scores semi-supervised approaches (“Ssigmoid” and “Smix”) also tend to produce higher AUC values that unsupervised ap- (a) Before Calibration proaches in most of the data sets. Another interesting observation can be made when com- 1 paring Tables 3 to 8 against Table 10. For threshold se- 0.9 lection, the “sigmoid” approach tends to outperform the 0.8 “mix” approach, whereas for ensemble outlier detection, the 0.7 “mix” approach tends to produce higher AUC. One possi- 0.6 ble explanation is that both applications (threshold selec- 0.5 0.4 tion and ensemble outlier detection) have different require-

Empirical Probability 0.3 ments concerning the calibration accuracy. Threshold selec- 0.2 tion is more concerned with determining whether the poste- 0.1 0 rior probability is greater than or less than 0.5 and places 0 0.2 0.4 0.6 0.8 1 Predicted Probability less emphasis on how far the estimated probabilities de- viate from their true probabilities. In contrast, the differ- (b) After Calibration ence between the estimated probability and true probability may affect the effectiveness of the ensemble outlier detec- tion framework. Since the “mix” approach models the out- 1 lier score distributions separately for the normal and outlier 0.9 classes, we expect its probability estimates to be more ac- 0.8 0.7 curate. We plan to investigate this issue further as part of 0.6 our future work. 0.5 0.4

Empirical Probability 0.3

6.2.4 Reliability Diagram 0.2

0.1

Reliability diagram [10] is a method for visualizing the cali- 0 0 0.2 0.4 0.6 0.8 1 bration accuracy of a classifier. It can also be used in outlier Predicted Probability detection to show the correspondence between outlier score values and their empirical probabilities. For the raw outlier (c) After Semi-Supervised Calibration scores, we normalize them to fall within the range of [0,1]. We then discretize the scores into equal with bins(0.1) and Figure 5. Reliability Diagrams compute the fraction of examples in each bin that are true

9 outliers. The latter corresponds to the empirical probability MOD ’00: Proceedings of the 2000 ACM SIGMOD in- estimate for the corresponding bin. Figure 5(a) shows the ternational conference on Management of data, pages reliability diagram for the normalized raw outlier scores. If 93Ð104. ACM Press, 2000. the outlier scores are in agreement with the estimated proba- [4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maxi- bilities, the points should be close to diagonal (dashed) line. mum likelihood from incomplete data via the em algo- This plot shows that the normalized scores obtained directly rithm. Journal of the Royal Statistical Society Series from distance-based outlier detection algorithms have little B, 34:1Ð38, 1977. to do with posterior probabilities. After calibration, Figure 5(b) shows that the predicted [5] J. Gao, H. Cheng, and P.-N. Tan. A novel framework probabilities are closer to the empirical ones. However the for incorporating labeled examples into anomaly de- difference is still quite large since the calibration is done tection. In Proceedings of the Second SIAM Interna- without using any labeled information. Finally, Figure 5(c) tional Conference on , 2006. shows the reliability diagram for semi-supervised calibra- [6] W. Jin, A. K. H. Tung, and J. Han. Mining top-n local tion. Note that most of the points have moved closer to the outliers in large databases. In KDD ’01: Proceedings diagonal line, which means that the labeled examples help of the seventh ACM SIGKDD international conference to improve the probability calibration. on Knowledge discovery and data mining, pages 293Ð 298. ACM Press, 2001. 7 Conclusions [7] E. M. Knorr, R. T. Ng, and V. Tucakov. Distance- In this paper, we study the problem of transforming out- based outliers: Algorithms and applications. VLDB lier scores into probability estimates. Unlike existing cal- Journal:Very Large Data Bases, 8(3-4):237Ð253, ibration algorithms, our proposed methods do not require 2000. any labeled examples. In the first approach, we treat the la- [8] T. Lange, M. H. Law, A. K. Jain, and J. Buhmann. bels as hidden variables and fit outlier scores into a sigmoid Learning with constrained and unlabelled data. In The function. An efficient EM-based algorithm is developed to IEEE Computer Society Conference on Computer Vi- learn the function parameters. To obtain a more accurate sion and Pattern Recognition, volume 1, pages 730Ð calibration, we propose an alternative approach that mod- 777, 2005. els the score distributions using a mixture of Gaussian and exponential distributions. The posterior probability is then [9] A. Lazarevic and V. Kumar. Feature bagging for out- calculated using Bayes’ rule. We show that the probabil- lier detection. In KDD ’05: Proceeding of the eleventh ity estimates from outlier scores have many potential appli- ACM SIGKDD international conference on Knowl- cations. We discuss about the use of the probability esti- edge discovery in data mining, pages 157Ð166, New mates in selecting a more appropriate outlier threshold and York, NY, USA, 2005. ACM Press. in improving the performance of an outlier detection en- [10] M.H.Degroot and S.E.Fienberg. The comparison and semble. Our experimental results suggest that our proposed evaluation of forecasters. Statistician, 32(1):12Ð22, methods can produce accurate calibration and can be used 1982. effectively for threshold selection and outlier detection en- semble. We further demonstrate that the calibration perfor- [11] J. C. Platt. Probabilistic outputs for support vector ma- mance improves with the aid of some labeled examples. chines and comparison to regularized likelihood meth- ods. In Advances in Large Margin Classifiers, pages References 61Ð74. MIT press, 2000. [12] B. Zadrozny and C. Elkan. Obtaining calibrated [1] P. N. Bennett. Using asymmetric distributions to im- probability estimates from decision trees and naive prove text classifier probability estimates. In SIGIR Bayesian classifiers. In Machine Learning, Pro- ’03: Proceedings of the 26th annual international ceedings of the Eighteenth International Conference ACM SIGIR conference on Research and development (ICML 2001), pages 609Ð616. Morgan Kaufmann, in informaion retrieval, pages 111Ð118, New York, San Francisco, CA, 2001. NY, USA, 2003. ACM Press. [13] J. Zhang and Y. Yang. Probabilistic score estimation [2] C. M. Bishop. Neural Networks for Pattern Recogni- with piecewise logistic regression. In ICML ’04: Pro- tion. Oxford University Press, USA, 1995. ceedings of the twenty-first international conference on Machine learning, page 115, New York, NY, USA, [3] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. 2004. ACM Press. Lof: identifying density-based local outliers. In SIG-

10