Interpreting and Unifying Outlier Scores

Interpreting and Unifying Outlier Scores Hans-Peter Kriegel Peer Kröger Erich Schubert Arthur Zimek Institut für Informatik, Ludwig-Maximilians Universität München http://www.dbs.ifi.lmu.de {kriegel,kroegerp,schube,zimek}@dbs.ifi.lmu.de Abstract outlier score is not easily interpretable. The scores pro- Outlier scores provided by different outlier models differ vided by varying methods differ widely in their scale, widely in their meaning, range, and contrast between their range, and their meaning. While some methods different outlier models and, hence, are not easily provide probability estimates as outlier scores (e.g. [31]), comparable or interpretable. We propose a unification for many methods, the scaling of occurring values of the of outlier scores provided by various outlier models and outlier score even differs within the same method from a translation of the arbitrary “outlier factors” to values data set to data set, i.e., outlier score x in one data set in the range [0, 1]interpretableasvaluesdescribingthe means, we have an outlier, in another data set it is not probability of a data object of being an outlier. As extraordinary at all. In many cases, even within one an application, we show that this unification facilitates data set, the identical outlier score x for two different enhanced ensembles for outlier detection. database objects can denote substantially different de- grees of outlierness, depending on different local data 1 Introduction distributions. Obviously, this makes the interpretation andcomparisonofdifferentoutlierdetectionmodelsa An outlier could be generally defined as being “an ob- very difficult task. Here, we propose scaling methods for servation (or subset of observations) which appears to be a range of different outlier models including a normal- inconsistent with the remainder of that set of data” [6]. ization to become independent from the specific data How and when an observation qualifies as being “incon- distribution in a given data set as well as a statisti- sistent” is, however, not as generally defined and differs cally sound motivation for a mapping the scores into from application to application as well as from algo- the range of [0, 1], readily interpretable as the proba- rithm to algorithm. From a systematic point of view, bility of a given database object for being an outlier. approaches can be roughly classified, e.g., as global ver- The unified scaling and better comparability of differ- sus local outlier models. This distinction refers to the ent methods could also facilitate a combined use to get scope of a database being considered when a method de- the best of different worlds, e.g. by means of setting up cides on the “outlierness” of a given object. Or we can an ensemble of outlier detection methods. distinguish supervised versus unsupervised approaches. In the remainder, we review existing outlier de- In this paper, we focus on unsupervised outlier detec- tection methods in Section 2 and follow their devel- tion. One can also discern between labeling and scor- opment from the original statistical intuition to mod- ing outlier detection methods. The former are leading ern database applications losing the probabilistic inter- to a binary decision of whether or not a given object pretability of the provided scores step-by-step. In Sec- is an outlier whereas the latter are rather assigning a tion 3, we introduce types of transformations for se- degree of outlierness to each object which is closer to lected prototypes of outlier models to a comparable, the original statistical intuition behind judging on the normalized value readily interpretable as a probability outlierness of observations. If a statistical test quali- value. As an application scenario, we discuss ensem- fies a certain threshold, a hard decision (or label) for a bles for outlier detection based on outlier probability given observation as being or not being an outlier could estimates in Section 4. In Section 5, we show how the be derived. As we will discuss in this study, however, proposed unification of outlier scores eases their inter- the data mining approaches to outlier detection have pretability as well as the comparison and evaluation of moved far from the original statistical intuition during different outlier models on a range of different data sets. the last decade. Eventually, any outlier score provided We will also demonstrate the gained improvement for by an outlier model should help the user to decide on the outlier ensembles. Section 6 concludes the paper. actual outlierness. For most approaches, however, the to appear at: 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, 2011 2 Survey on Outlier Models the range [0, 1]. The model is based on statistical rea- 2.1 Statistical Intuition of Outlierness We are soning but simplifies the approach to outlier detection primarily interested in the nature and meaning of out- considerably, motivated by the need for scalable meth- lier scores provided by unsupervised scoring methods. ods for huge data sets. This, in turn, inspired many new Scores are closer than labels to discussing probabilities outlier detection methods within the database and data whether or not a single data point is generated by a mining community over the last decade. For most of suspicious mechanism. Statistical approaches to outlier these approaches, however, the connection to a statisti- detection are based on presumed distributions of ob- cal reasoning is not obvious any more. Scoring variants jects relating to statistical processes or generating mech- of the DB-outlier notion are proposed in [3,8,29,40,43] anisms. The classical textbook of Barnett and Lewis [6] basically using the distances to the k-nearest neighbors discusses numerous tests for different distributions de- (kNNs) or aggregates thereof. Yet none of these vari- pendingontheexpectednumberandlocationofout- ants can as easily translate to the original statistical liers. A commonly used rule of thumb is that points definition of an outlier as the original DB-outlier no- deviating more than three times the standard deviation tion. Scorings reflect distances and can grow arbitrar- from the mean of a normal distribution are considered ily with no upper bound. The focus of these papers outliers [27]. Problems of these classical approaches are is not on reflecting the meaning of the presented score obviously the required assumption of a specific distri- but on providing more efficient algorithms to compute bution in order to apply a specific test. There are tests the score, often combined with a top-n-approach, inter- forunivariateaswellasmultivariatedatadistributions ested in ranking outlier scores and reporting the top-n but all tests assume a single, known data distribution outliers rather than interpreting given values. Density- to determine an outlier. A classical approach is to fit a based approaches introduce the notion of local outliers Gaussian distribution to a data set, or, equivalently, to by considering ratios between the local density around use the Mahalanobis distance as a measure of outlier- an object and the local density around its neighboring ness. Sometimes, the data are assumed to consist of k objects. The basic local outlier factor (LOF) assigned Gaussian distributions and the means and standard de- to each object of the database D denotes a degree of viations are computed data driven. However, mean and outlierness [9]. The LOF compares the density of each standard deviation are rather sensitive to outliers and object o ∈Dwith the density of its kNNs. A LOF the potential outliers are still considered for the compu- value of approximately 1 indicates that the correspond- tation step. There are proposals of more robust estima- ing object is located within a region of homogeneous tions of these parameters, e.g. [19,45], but problems still density. The higher the LOF value of an object o is, remain as the proposed concepts require a distance func- the higher is the difference between the density around tion to assess the closeness of points before the adapted o and the density around its kNNs, i.e., the more dis- distance measure is available. Related discussions con- tinctly is o considered an outlier. Several extensions of sider the robust computation of PCA [14, 30]. Variants the basic LOF model and algorithms for efficient (top- of the statistical modeling of outliers have been pro- n) computation have been proposed [22, 23, 49, 50] that posed in computer graphics (“depth based”) [25,46,51] basically use different notions of neighborhoods. The as well as databases (“deviation based”) [4, 48]. Local Outlier Integral (LOCI) bases on the concept of a multi-granularity deviation factor and ε-neighborhoods 2.2 Prototypes and Variants The distance-based instead of kNNs [39]. The resolution-based outlier fac- notion of outliers (DB-outlier) may be the origin of de- tor (ROF) [16] is a mix of the local and the global out- veloping outlier detection algorithms in the context of lier paradigm based on the idea of a change of resolu- databases but is in turn still closely related to the statis- tion, i.e., the number of objects representing the neigh- tical intuition of outliers and unifies distribution-based borhood. As opposed to the methods reflecting kNN approaches under certain assumptions [27, 28]. The distances, in truly local methods the resulting outlier model relies on two input parameters, D and p.Ina score is adaptive to fluctuations in the local density and, database D,anobjectx ∈Dis an outlier if at least a thus, comparable over a data set with varying densities. fraction p of all data objects in D has a distance above The central contribution of LOF and related methods is D from x. By definition, the DB-outlier model is a la- to provide a normalization of outlier scores for a given beling approach. Adjusting the threshold can reflect the data set.

Interpreting and Unifying Outlier Scores

A Review on Outlier/Anomaly Detection in Time Series Data

Outlier Identification.Pdf

A Machine Learning Approach to Outlier Detection and Imputation of Missing Data1

Effects of Outliers with an Outlier

Outlier Detection in Graphs: a Study on the Impact of Multiple Graph Models

How Significant Is a Boxplot Outlier?

Lecture 14 Testing for Kurtosis

Comparison of Data Mining Methods on Different Applications: Clustering and Classiﬁcation Methods

A SYSTEMATIC REVIEW of OUTLIERS DETECTION TECHNIQUES in MEDICAL DATA Preliminary Study

Recommended Methods for Outlier Detection and Calculations Of

Identify the Outlier in the Data Set. Determine How the Outlier Affects the Mean, Median, and Mode of the Data

Topics in Biostatistics: Part II