Outlier Detection
Total Page:16
File Type:pdf, Size:1020Kb
OUTLIER DETECTION Short Course Session 1 Nedret BILLOR Auburn University Department of Mathematics & Statistics, USA Statistics Conference, Colombia, Aug 8‐12, 2016 OUTLINE Motivation and Introduction Approaches to Outlier Detection Sensitivity of Statistical Methods to Outliers Statistical Methods for Outlier Detection Outliers in Univariate data Outliers in Multivariate Classical and Robust Statistical Distance‐ based Methods PCA based Outlier Detection Outliers in Functional Data MOTIVATION & INTRODUCTION Hadlum vs. Hadlum (1949) [Barnett 1978] Ozone Hole Case I: Hadlum vs. Hadlum (1949) [Barnett 1978] The birth of a child to Mrs. Hadlum happened 349 days after Mr. Hadlum left for military service. Average human gestation period is 280 days (40 weeks). Statistically, 349 days is an outlier. Case I: Hadlum vs. Hadlum (1949) [Barnett 1978] − blue: statistical basis (13634 observations of gestation periods) − green: assumed underlying Gaussian process − Very low probability for the birth of Mrs. Hadlums child for being generated by this process − red: assumption of Mr. Hadlum (another Gaussian process responsible for the observed birth, where the gestation period responsible) − Under this assumption the gestation period has an average duration and highest‐possible probability Case II: The Antarctic Ozone Hole The History behind the Ozone Hole • The Earth's ozone layer protects all life from the sun's harmful radiation. Case II: The Antarctic Ozone Hole (cont.) . Human activities (e.g. CFS's in aerosols) have damaged this shield. Less protection from ultraviolet light will, over time, lead to higher skin cancer and cataract rates and crop damage. Case II: The Antarctic Ozone Hole (cont.) Molina and Rowland in 1974 (lab study) and many studies after this, demonstrated the ability of CFC's (Chlorofluorocarbons) to breakdown Ozone in the presence of high frequency UV light . Further studies estimated the ozone layer would be depleted by CFC's by about 7% within 60yrs. Case II: The Antarctic Ozone Hole (cont.) • Shock came in a 1985 field study by Farman, Gardinar and Shanklin. (Nature, May 1985) • British Antarctic Survey showing that ozone levels had dropped to 10% below normal January levels for Antarctica. Case II: The Antarctic Ozone Hole (cont.) • The authors had been somewhat hesitant about publishing this result because Nimbus‐7 satellite data had shown “NO such DROP” during the Antarctic spring! • More comprehensive observations from satellite instruments looking down had shown nothing unusual! Case II: The Antarctic Ozone Hole (cont.) • But NASA soon discovered that the Spring time ''ozone hole'' had been covered up by a computer‐program designed to “discard “ sudden, large drops in ozone concentrations as '‘ERRORS''. • The Nimbus‐7 data was rerun without the filter‐program. Evidence of the Ozone‐hole was seen as far back as 1976. “One person‘s noise could be another person‘s signal!” What is OUTLIER? No universally accepted definition! Hawkins (1980) – An observation (few) that deviates (differs) so much from other observations as to arouse suspicion that it was generated by a different mechanism. Barnett and Lewis (1994) An observation (few) which appears to be inconsistent (different) with the remainder of that set of data. What is OUTLIER? Statistics‐based intuition – Normal data objects follow a “generating mechanism”, e.g. some given statistical process – Abnormal objects deviate from this generating mechanism Applications of outlier detection – Fraud detection . Purchasing behavior of a credit card owner usually changes when the card is stolen . Abnormal buying patterns can characterize credit card abuse Applications of outlier detection (cont.) – Medicine . Unusual symptoms or test results may indicate potential health problems of a patient . Whether a particular test result is abnormal may depend on other characteristics of the patients (e.g. gender, age, …) Applications of outlier detection (cont.) – Intrusion Detection . Attacks on computer systems and computer networks Applications of outlier detection (cont.) – Sports statistics . In many sports, various parameters are recorded for players in order to evaluate the players’ performances . Outstanding (in a positive as well as a negative sense) players may be identified as having abnormal parameter Values . Sometimes, players show abnormal values only on a subset or a special combination of the recorded parameters Applications of outlier detection (cont.) – Ecosystem Disturbance . Hurricanes, floods, heatwaves, earthquakes Applications of outlier detection (cont.) – Detecting measurement errors • Data derived from sensors (e.g. in a given scientific experiment) may contain measurement errors • Abnormal values could provide an indication of a measurement error • Removing such errors can be important in other data mining and data analysis tasks • … What causes OUTLIERS? ‐ Data from Different Sources . Such outliers are often of interest and are the focus of outlier detection in the field of data mining. ‐ Natural variant . Outliers that represent extreme or unlikely variations are often interesting.. (Correct but extreme responses ‐Rare event syndrome) ‐ Data Measurement and Collection Error . Goal is to eliminate such anomalies since they provide no interesting information but only reduce the quality of the data and subsequent data analysis. Difference between Noise and Outlier Outlier Noise Approaches to OUTLIER detection Approaches to OUTLIER detection . Statistical (or model based) approaches . Proximity‐based . Clustering‐based . Classification‐based (One‐class and Semisupervised (i.e. Combining classification‐based and clustering‐based methods) . Reference: Data Mining: Concepts and Techniques, Han et al. 2012 Statistical (or model based) approaches . Assume that the regular data follow some statistical model. Outliers : The data not following the model Example: First use Gaussian distribution to model the regular data For each object y in region R, estimate gD(y), the probability of y fits the Gaussian distribution If gD(y) is very low, y is unlikely generated by the Gaussian model, thus an outlier Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data. There are rich alternatives to use various statistical models E.g., parametric vs. non‐parametric Statistical Approaches Parametric method . Assumes that the normal data is generated by a parametric distribution with parameter θ. The probability density function of the parametric distribution f(x, θ) gives the probability that object x is generated by the distribution . The smaller this value, the more likely x is an outlier Non‐parametric method . Not assume an a‐priori statistical model and determine the model from the input data . Not completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advance . Examples: histogram and kernel density estimation Proximity‐based Outlier : if the proximity of the obs. significantly deviates from the proximity of most of the other obs. in the same data set. The effectiveness: highly relies on the proximity measure. In some applications, proximity or distance measures cannot be obtained easily. Two major types of proximity‐based outlier detection . Distance based: Based on distances (outlier if its neighborhood does not have enough other points) . Density‐based: if its density is relatively much lower than that of its neighbors. Clustering‐Based Methods • Normal data belong to large and dense clusters, • outliers to small or sparse clusters, or do not belong to any clusters Many clustering methods, therefore many clustering‐based outlier detection methods ! Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets Classification‐Based Methods Idea: Train a classification model that can distinguish “normal” data from outliers . A brute‐force approach: Consider a training set that contains samples labeled as “normal” and others labeled as “outlier” . But, the training set is typically heavily biased: # of “normal” samples likely far exceeds # of outlier samples . Cannot detect unseen anomaly Two types: One‐class and Semi‐supervised Classification‐Based Methods (cont.) One‐class model A classifier is built to describe only the normal class. Learn the decision boundary of the normal class using classification methods such as SVM. Any samples that do not belong to the normal class (not within the decision boundary) are declared as outliers. Adv: can detect new outliers that may not appear close to any outlier objects in the training set Classification‐Based Methods (cont.) Semi‐supervised learning Combining classification‐based and clustering‐ based methods Method • Using a clustering‐based approach, find a large cluster, C, and a small cluster, C1 • Since some obs. in C carry the label “normal”, treat all obs. in C as normal • Use the one‐class model of this cluster to identify normal obs. in outlier detection • Since some obs. in cluster C1 carry the label “outlier”, declare all obs. in C1 as outliers • Any obs. that does not fall into the model for C (such as a) is considered an outlier as well. Sensitivity of Statistical Methods to Outliers Sensitivity of Statistical Methods to Outliers . Data often (always) contain outliers. Statistical methods are severely affected by outliers! Sensitivity of Statistical Methods to Outliers on the