CS570 Anomaly Detection

Cengiz Günay

Dept. Math & CS, Emory University

Fall 2013 Some slides courtesy of Li Xiong, Han, Kamber, Pei (2012) Tan, Steinbach, Kumar (2006)

Günay (Emory) Anomaly Detection Fall 2013 1 / 6 Guest speaker_Thursday: Olgert Denas "Feature extraction from deep models" Today’s menu: Anomaly Detection: In the context of clustering and otherwise

Today

Midterm next week_Tuesday: In-class or take-home?

Günay (Emory) Anomaly Detection Fall 2013 2 / 6 Today’s menu: Anomaly Detection: In the context of clustering and otherwise

Today

Midterm next week_Tuesday: In-class or take-home? Guest speaker_Thursday: Olgert Denas "Feature extraction from deep models"

Günay (Emory) Anomaly Detection Fall 2013 2 / 6 Today

Midterm next week_Tuesday: In-class or take-home? Guest speaker_Thursday: Olgert Denas "Feature extraction from deep models" Today’s menu: Anomaly Detection: In the context of clustering and otherwise

Günay (Emory) Anomaly Detection Fall 2013 2 / 6 Anomaly Detection

 Anomaly is a pattern in the data that does not conform to the expected behavior

 , exceptions, peculiarities, surprise Type of Anomaly

 Point Anomalies

 Contextual Anomalies

 Collective Anomalies Point Anomalies

 An individual data instance is anomalous w.r.t. the data

Y

N1 o1

O3

o2

N2

X Contextual Anomalies

 An individual data instance is anomalous within a context  Requires a notion of context  Also referred to as conditional anomalies*

Anomaly Normal

* Xiuyao Song, Mingxi Wu, Christopher Jermaine, Sanjay Ranka, Conditional Anomaly Detection, IEEE Transactions on Data and Knowledge Engineering, 2006. Collective Anomalies

 A collection of related data instances is anomalous  Requires a relationship among data instances  Sequential Data  Spatial Data  Graph Data

 The individual instances within a collective anomaly are not anomalous by themselves

Anomalous Subsequence Anomaly Detection

 Anomalies often have significant impact

 Cyber intrusions

 Credit card fraud

 Health condition anomaly

 Crust deformation anomaly Black Swan Theory

 The impact of rare events is huge and highly underrated

 Black swan event

 The event is a surprise

 The event has a major impact.

 After its first recording, the event is rationalized by hindsight

 Almost all scientific discoveries, historical events are black swan events

March 15, 2011 Data Mining: Concepts and Techniques 8 Anomaly Detection

 Anomaly Detection

 Related problems

 Rare Class Mining

 Chance discovery

 Novelty Detection “Mining needle in a haystack.  Exception Mining So much hay and so little time”

 Noise Removal

 Black Swan Intrusion Detection

 Intrusion Detection:

 Process of monitoring the events occurring in a computer system or network for intrusions

 Intrusions are attempts to bypass the security mechanisms of a computer or network

 Approaches

 Traditional signature-based intrusion detection systems are based on signatures of known attacks

 Anomaly detection Fraud Detection

 Fraud detection

 detection of criminal activities occurring in commercial organizations

 Types of fraud

 Credit card fraud

 Insurance claim fraud

 Mobile / cell phone fraud

 Insider trading

 Challenges

 Fast and accurate real-time detection

 Misclassification cost is very high Image Processing

 Detecting outliers in a image monitored over time

 Detecting anomalous regions within an image

 Used in

 mammography image analysis

 video surveillance

 satellite image analysis

 Key Challenges

 Detecting collective anomalies Anomaly  Data sets are very large Anomaly Detection

 Supervised Anomaly Detection

 Labels available for both normal data and anomalies

 Classification

 Semi-supervised Anomaly Detection

 Labels available only for normal data

 Classification

 Unsupervised Anomaly Detection

 No labels assumed

 Based on the assumption that anomalies are very rare compared to normal data Output of Anomaly Detection

 Label

 Each test instance is given a normal or anomaly label

 Score

 Each test instance is assigned an anomaly score

 Allows the output to be ranked

 Requires an additional threshold parameter

Classification Based Techniques

 Main idea: build a classification model for normal (and anomalous (rare)) events based on labeled training data, and use it to classify each new unseen event

 Categories:

 Supervised classification techniques

 Require knowledge of both normal and anomalyclass

 Build classifier to distinguish between normal and known anomalies

 Semi-supervised classification techniques

 Require knowledge of normal class only!

 Use modified classification model to learn the normal behavior and then detect any deviations from normal behavior as anomalous Advantages and disadvantages? 1 Supervised 2 Semi-supervised Classification Based Techniques

 Advantages:

 Supervised classification techniques

 Models that can be easily understood

 High accuracy in detecting many kinds of known anomalies

 Semi-supervised classification techniques

 Models that can be easily understood

 Normal behavior can be accurately learned

 Drawbacks:

 Supervised classification techniques

 Require both labels from both normal and anomaly class

 Cannot detect unknown and emerging anomalies

 Semi-supervised classification techniques

 Require labels from normal class

 Possible high false alarm rate - previously unseen (yet legitimate) data records may be recognized as anomalies Supervised Anomaly Detection

 Challenge

 Classification models must be able to handle skewed (imbalanced) class distributions

 Misclassification cost for the rare class tend to be high Supervised Classification Techniques

 Blackbox approaches

 Manipulating data records (oversampling / undersampling / generating artificial examples)

 Whitebox approaches

 Adapt classification models

 Design new classification models

 Cost-sensitive classification techniques

 Ensemble based algorithms (SMOTEBoost, RareBoost Manipulating Data Records

Over-sampling the rare class [Ling98]

 Make the duplicates of the rare events until the data set contains as many examples as the majority class => balance the classes Down-sizing (undersampling) the majority class [Kubat97]

 Sample the data records from majority class (Randomly, Near miss examples, Examples far from minority class examples (far from decision boundaries) Generating artificial anomalies

 SMOTE (Synthetic Minority Over-sampling TEchnique) [Chawla02] - new rare class examples are generated inside the regions of existing rare class examples

 Artificial anomalies are generated around the edges of the sparsely populated data regions [Fan01] Adapting Existing Rule Based Classifiers

Case specific feature weighting [Cardey97]  Increases the weight for rare class examples in  Weight dynamically generated based on the path taken by that example Case specific rule weighting [Grzymala00]  LERS (Learning from Examples based on Rough Sets)  increases the rule strength for all rules describing the rare class Rare Class Detection Evaluation

 True positive rate, true negative rate, false positive rate, false negative rate

 Precision/recall

 Implications due to imbalanced class distribution

 Base rate fallacy

March 15, 2011 Data Mining: Concepts and Techniques 21 Base Rate Fallacy (Axelsson, 1999) Base Rate Fallacy

 Even though the test is 99% certain, your chance of having the disease is 1/100, because the population of healthy people is much larger than sick people Semi-supervised Classification Techniques

 Use modified classification model to learn the normal behavior and then detect any deviations from normal behavior as anomalous

 Recent approaches:

 Neural network based approaches

 Support Vector machines (SVM) based approaches

 Markov model based approaches

 Rule-based approaches Using Support Vector Machines

 One class classification problem

 computes a spherically shaped decision boundary with minimal volume around a training set of objects.

Anomaly Detection

 Supervised Anomaly Detection

 Semi-supervised Anomaly Detection

 Unsupervised Anomaly Detection

 Graphical based

 Statistical based

 Nearest neighbor based techniques Graphical Approaches

 Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)

 Limitations

 Time consuming

 Subjective Statistical Approaches

 Assume a parametric model describing the distribution of the data (e.g., normal distribution)

 Apply a statistical test that depends on

 Data distribution

 Parameter of distribution (e.g., mean, variance)

 Number of expected outliers (confidence limit) Grubbs’ Test

 Detect outliers in univariate data

 Assume data comes from normal distribution

 Detects one at a time, remove the outlier, and repeat

 H0: There is no outlier in data

 HA: There is at least one outlier max X − X  Grubbs’ test statistic: G = s 2 (N − )1 t  Reject H 0 if: G > (α / N ,N − )2 N N − 2 + t2 (α / N ,N − )2 Statistical-based – Likelihood Approach

 Assume the data set D contains samples from a mixture of two probability distributions:

 M (majority distribution)

 A (anomalous distribution)

 General Approach:

 Initially, assume all the data points belong to M

 Let L t(D) be the log likelihood of D at time t

 For each point x t that belongs to M, move it to A

 Let L t+1 (D) be the new log likelihood.

 Compute the difference, = Lt(D) – Lt+1 (D)

 If > c (some threshold), then x t is declared as an anomaly and moved permanently from M to A Limitations of Statistical Approaches

 Most of the tests are for a single attribute

 In many cases, data distribution may not be known

 For high dimensional data, it may be difficult to estimate the true distribution Methods for High Dimensional Data

Mahalonobis distance: Incorporates interaction among dimensions Uses covariance matrix S: q D(x, y) = (x − y)T S−1 (x − y)

Other methods: Angle method

Günay (Emory) Anomaly Detection Fall 2013 6 / 6 Distance-based Approaches

 Data is represented as a vector of features

 Three major approaches

 Nearest-neighbor based

 Density based

 Clustering based Nearest Neighbor Based Techniques

 Key assumption: normal points have close neighbors while anomalies are located far from other points

 General two-step approach

1. Compute neighborhood for each data record

2. Analyze the neighborhood to determine whether data record is anomaly or not

 Categories:

 Distance based methods

 Anomalies are data points most distant from other points

 Density based methods

 Anomalies are data points in low density regions © Tan,Steinbach, Kumar Introduction to Data Mining 1/17/2006 5 © Tan,Steinbach, Kumar Introduction to Data Mining 1/17/2006 8 Nearest Neighbor Based Techniques

 Distance based approaches

 A point O in a dataset is an DB( p, d) outlier if at least fraction p of the points in the data set lies greater than distance d from the point O*

 Density based approaches

 Compute local densities of particular regions and declare instances in low density regions as potential anomalies

 Approaches

 (LOF)

 Connectivity Outlier Factor (COF )

 Multi-Granularity Deviation Factor (MDEF)

*Knorr, Ng,Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98 Local Outlier Factor (LOF)*

 For each data point q compute the distance to the k-th nearest neighbor (k-distance)

Compute reachability distance (reach-dist) for each data example q with respect to data example p as: reach-dist(q, p) = max{ k-distance(p), d(q,p)}

Compute local reachability density (lrd) of data example q as inverse of the average reachabaility distance based on the MinPts nearest neighbors of data example q MinPts lrd(q) = ∑ reach _ distMinPts (q, p) p Compaute LOF( q) as ratio of average local reachability density of q’s k- nearest neighbors and local reachability density of the data record q 1 lrd( p) LOF(q) = ⋅∑ MinPts p lrd(q)

* - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000. Advantages of Density based Techniques

 Local Outlier Factor (LOF) approach

 Example: In the NNapproach, p2 is not considered as outlier, while the Distance from p to 3 LOFapproach find nearest neighbor ××× p3 both p1 and p2 as outliers Distance from p 2 to nearest neighbor NN approach may

consider p 3 as outlier, p2 ××× but LOF approach p1 ××× does not Density based approach Using Support Vector Machines

 Main idea [Steinwart05] :

 Normal data records belong to high density data regions

 Anomalies belong to low density data regions

 Use unsupervised approach to learn high density and low density data regions

 Use SVM to classify data density level Clustering Based Techniques

 Key assumption

 normal data records belong to large and dense clusters, while anomalies belong do not belong to any of the clusters or form very small clusters

 Anomalies detected using clustering based methods can be:

 Data records that do not fit into any cluster (residuals from clustering)

 Small clusters

 Low density clusters or local anomalies (far from other points within the same cluster)