Outlier Detection Techniques
Total Page:16
File Type:pdf, Size:1020Kb
LUDWIG- MAXIMILIANS- DATABASE UNIVERSITÄT INSTITUTE FOR SYSTEMS MÜNCHEN INFORMATICS GROUP The 2010 SIAM International Conference on Data Mining Outlier Detection Techniques Hans-Peter Kriegel, Peer Kröger, Arthur Zimek Ludwig-Maximilians-Universität München Munich, Germany http://www.dbs.ifi.lmu.de {kriegel,kroegerp,zimek}@dbs.ifi.lmu.de Tutorial Notes: SIAM SDM 2010, Columbus, Ohio DATABASE General Issues SYSTEMS GROUP 1. Please feel free to ask questions at any time during the presentation 2. Aim o f the tu tor ia l: ge t the big p ic ture – NOT in terms of a long list of methods and algorithms – BUT in terms of the basic algorithmic approaches – Sample algorithms for these basic approaches will be sketched • The selection of the presented algorithms is somewhat arbitrary • Please don’t mind if yygour favorite algorithm is missin g • Anyway you should be able to classify any other algorithm not covered here by means of which of the basic approaches is implemented 3. The revised version of tutorial notes will soon be available on our websites Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 2 DATABASE Introduction SYSTEMS GROUP What is an outlier? Definition of Hawkins [Hawkins 1980]: “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” Statistics-based intuition – Normal data objects follow a “generating mechanism”, e.g. some given statistical process – Abnormal objects deviate from this generating mechanism Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 3 DATABASE Introduction SYSTEMS GROUP • Example: Hadlum vs. Hadlum (1949) [Barnett 1978] • The birth of a child to Mrs. Hadlum happened 349 days after Mr. Hadlum left for military serv ice. • Average human gestation period is 280 days (40 weeks). • Statistically, 349 days is an outlier. Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 4 DATABASE Introduction SYSTEMS GROUP • Example: Hadlum vs. Hadlum (1949) [Barnett 1978] − blue: statistical basis (13634 observations of gestation periods) − green: assumed underlying Gaussian process − Very low probability for the birth of Mrs. Hadlums child for being generated by this process − redtifMd: assumption of Mr. HdlHadlum (another Gaussian process responsible for the observed birth, where the ggpestation period responsible) − Under this assumption the gestation period has an average duration and highest -possible probability Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 5 DATABASE Introduction SYSTEMS GROUP • Sample applications of outlier detection – Fraud detection • Purchasing behavior of a credit card owner usually changes when the card is stolen • Abnormal buying patterns can characterize credit card abuse – Medicine • Unusual symptoms or test results may indicate potential health problems of a patient • Whether a particular test result is abnormal may depend on other characteristics of the patients (e.g. gender, age, …) – Public health • The occurrence of a particular disease, e.g. tetanus, scattered across various hospitals of a city indicate problems with the corresponding vaccination program in that city • Whether an occurrence is abnormal depends on different aspects like frequency, spatial correlation, etc. Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 6 DATABASE Introduction SYSTEMS GROUP • Sample applications of outlier detection (cont.) – Sports statistics • In many sports, various parameters are recorded for players in order to evaluate the players’ performances • Outstanding (in a positive as well as a negative sense) players may be identified as havinggp abnormal parameter values • Sometimes, players show abnormal values only on a subset or a special combination of the recorded parameters – Detecting measurement errors • Data derived from sensors (e.g. in a given scientific experiment) may contain measurement errors • Abnormal values could provide an indication of a measurement error • Removing such errors can be important in other data mining and data analysis tasks • “One person‘s noise could be another person‘s signal.” –… Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 7 DATABASE Introduction SYSTEMS GROUP • Discussion of the basic intuition based on Hawkins – Data is usually multivariate, i.e., multi-dimensional => basic model is univariate, i. e., 1-dimensional – There is usually more than one generating mechanism/statistical process underlying the data => bibasic mo dldel assumes on ly one “norma l” genera ting mec hihanism – Anomalies may represent a different class (generating mechanism) of objects, so there may be a large class of similar objects that are the outliers => basic model assumes that outliers are rare observations • Conseqqpuence: a lot of models and appproaches have evolved in the past years in order to exceed these assumptions and it is not easy to keep track with this evolution. • New mo dlftidels often invo ltillve typical, new, thhthough usua lly hidden assumptions and restrictions. Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 8 DATABASE Introduction SYSTEMS GROUP • General application scenarios – Supervised scenario • In some applications , training data with normal and abnormal data objects are provided • There may be multiple normal and/or abnormal classes • Often, the classification problem is highly unbalanced – Semi-supervised Scenario • In some applications, only training data for the normal class(es) (or only the abnormal class(es)) are provided – Unsupervised Scenario • In most applications there are no training data available • In this tutorial, we focus on the unsupervised scenario Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 9 DATABASE Introduction SYSTEMS GROUP • Are outliers just a side product of some clustering algorithms? – Many clustering algorithms do not assign all points to clusters but account for noise objects – Look for outliers by applying one of those algorithms and retrieve the noise set – Problem: • Clustering algorithms are optimized to find clusters rather than outliers • Accuracy of outlier detection depends on how good the clustering algorithm captures the structure of clusters • A set o f many a bnorma l da ta o bjec ts tha t are s im ilar to eac h o ther wou ld be recognized as a cluster rather than as noise/outliers Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 10 DATABASE Introduction SYSTEMS GROUP • We will focus on three different classification approaches – Global versus local outlier detection Considers the set of reference objects relative to which each point’ s “outlierness” is judged – Labeling versus scoring outliers Considers the output of an algorithm – Modeling properties Considers the concepts based on which “outlierness” is modeled NOTE: we f ocus on mod dlels and meth thdfods for E uclid ean d dtata btbut many of those can be also used for other data types (because they only require a distance measure) Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 11 DATABASE Introduction SYSTEMS GROUP • Global versus local approaches – Considers the resolution of the reference set w.r.t. which the “outlierness” of a particular data object is determined – Global approaches • The reference set contains all other data objects • Basic assumption: there is only one normal mechanism • Basic problem: other outliers are also in the reference set and may falsify the results – Local approaches • The reference contains a (small) subset of data objects • No assumption on the number of normal mechanisms • Basic problem: how to choose a proper reference set – NOTE: Some approaches are somewhat in between • The resolution of the reference set is varied e.g. from only a single object (local) to the entire database (global) automatically or by a user-defined input parameter Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 12 DATABASE Introduction SYSTEMS GROUP • Labeling versus scoring – Considers the output of an outlier detection algorithm – Labeling approaches • Binary output • Data objects are labeled either as normal or outlier – SiScoring approac hes • Continuous output • For each object an outlier score is computed (e.g. the probability for bibeing an ou tli)tlier) • Data objects can be sorted according to their scores – Notes • Many scoring approaches focus on determining the top-n outliers (parameter n is usually given by the user) • Scoring approaches can usually also produce binary output if necessary (e. g. by defining a suitable threshold on the scoring values) Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 13 DATABASE Introduction SYSTEMS GROUP • Approaches classified by the properties of the underlying modeling approach – Model-based Approaches • Rational – Apply a model to represent normal data points – Outliers are points that do not fit to that model • Sample approaches – Probabilistic tests based on statistical models – Depth-based appr oach es – Deviation-based approaches – Some subspace outlier detection approaches Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010) 14 DATABASE Introduction SYSTEMS GROUP – Proximity-based Approaches • Rational – Examine the spatial proximityyj of each object in the data space – If the proximity of an object considerably deviates from the proximity of other objects it is considered an outlier • Sample approaches – Dis tance-bdbased approac hes – Density-based approaches – Some subspace outlier detection approaches – Angle-based