OUTLIER DETECTION

Short Course Session 1

Nedret BILLOR Auburn University Department of Mathematics & , USA

Statistics Conference, Colombia, Aug 8‐12, 2016 OUTLINE

 Motivation and Introduction  Approaches to Detection  Sensitivity of Statistical Methods to  Statistical Methods for Outlier Detection  Outliers in Univariate data  Outliers in Multivariate  Classical and Robust Statistical Distance‐ based Methods  PCA based Outlier Detection  Outliers in Functional Data MOTIVATION & INTRODUCTION

Hadlum vs. Hadlum (1949) [Barnett 1978] Ozone Hole

Case I: Hadlum vs. Hadlum (1949) [Barnett 1978]

The birth of a child to Mrs. Hadlum happened 349 days after Mr. Hadlum left for military service. Average human gestation period is 280 days (40 weeks). Statistically, 349 days is an outlier. Case I: Hadlum vs. Hadlum (1949) [Barnett 1978]

− blue: statistical basis (13634 observations of gestation periods) − green: assumed underlying Gaussian process − Very low probability for the birth of Mrs. Hadlums child for being generated by this process − red: assumption of Mr. Hadlum (another Gaussian process responsible for the observed birth, where the gestation period responsible) − Under this assumption the gestation period has an average duration and highest‐possible probability Case II: The Antarctic Ozone Hole

The History behind the Ozone Hole

• The Earth's ozone layer protects all life from the sun's harmful radiation. Case II: The Antarctic Ozone Hole (cont.)

. Human activities (e.g. CFS's in aerosols) have damaged this shield.

. Less protection from ultraviolet light will, over time, lead to higher skin cancer and cataract rates and crop damage. Case II: The Antarctic Ozone Hole (cont.)

Molina and Rowland in 1974 (lab study) and many studies after this, demonstrated the ability of CFC's (Chlorofluorocarbons) to breakdown Ozone in the presence of high frequency UV light .

Further studies estimated the ozone layer would be depleted by CFC's by about 7% within 60yrs. Case II: The Antarctic Ozone Hole (cont.)

• Shock came in a 1985 field study by Farman, Gardinar and Shanklin. (Nature, May 1985)

• British Antarctic Survey showing that ozone levels had dropped to 10% below normal January levels for Antarctica. Case II: The Antarctic Ozone Hole (cont.)

• The authors had been somewhat hesitant about publishing this result

because Nimbus‐7 satellite data had shown “NO such DROP” during the Antarctic spring!

• More comprehensive observations from satellite instruments looking down had shown nothing unusual! Case II: The Antarctic Ozone Hole (cont.)

• But NASA soon discovered that the Spring time ''ozone hole'' had been covered up by a computer‐program designed to “discard “ sudden, large drops in ozone concentrations as '‘ERRORS''.

• The Nimbus‐7 data was rerun without the filter‐program. Evidence of the Ozone‐hole was seen as far back as 1976. “One person‘s noise could be another person‘s signal!” What is OUTLIER?

No universally accepted definition!

Hawkins (1980) – An observation (few) that deviates (differs) so much from other observations as to arouse suspicion that it was generated by a different mechanism.

Barnett and Lewis (1994) An observation (few) which appears to be inconsistent (different) with the remainder of that set of data. What is OUTLIER?

Statistics‐based intuition

– Normal data objects follow a “generating mechanism”, e.g. some given statistical process

– Abnormal objects deviate from this generating mechanism Applications of outlier detection

– Fraud detection

. Purchasing behavior of a credit card owner usually changes when the card is stolen . Abnormal buying patterns can characterize credit card abuse Applications of outlier detection (cont.)

– Medicine

. Unusual symptoms or test results may indicate potential health problems of a patient

. Whether a particular test result is abnormal may depend on other characteristics of the patients (e.g. gender, age, …) Applications of outlier detection (cont.)

– Intrusion Detection

. Attacks on computer systems and computer networks Applications of outlier detection (cont.)

– Sports statistics

. In many sports, various parameters are recorded for players in order to evaluate the players’ performances . Outstanding (in a positive as well as a negative sense) players may be identified as having abnormal parameter Values . Sometimes, players show abnormal values only on a subset or a special combination of the recorded parameters Applications of outlier detection (cont.)

– Ecosystem Disturbance

. Hurricanes, floods, heatwaves, earthquakes Applications of outlier detection (cont.)

– Detecting measurement errors

• Data derived from sensors (e.g. in a given scientific experiment) may contain measurement errors • Abnormal values could provide an indication of a measurement error • Removing such errors can be important in other data mining and data analysis tasks • … What causes OUTLIERS?

‐ Data from Different Sources . Such outliers are often of interest and are the focus of outlier detection in the field of data mining. ‐ Natural variant . Outliers that represent extreme or unlikely variations are often interesting.. (Correct but extreme responses ‐Rare event syndrome) ‐ Data Measurement and Collection Error . Goal is to eliminate such anomalies since they provide no interesting information but only reduce the quality of the data and subsequent data analysis. Difference between Noise and Outlier

Outlier Noise Approaches to OUTLIER detection Approaches to OUTLIER detection

. Statistical (or model based) approaches . Proximity‐based . Clustering‐based . Classification‐based (One‐class and Semisupervised (i.e. Combining classification‐based and clustering‐based methods)

. Reference: Data Mining: Concepts and Techniques, Han et al. 2012 Statistical (or model based) approaches

. Assume that the regular data follow some statistical model. . Outliers : The data not following the model

Example: First use Gaussian distribution to model the regular data

 For each object y in region , estimate gD(y), the probability of y fits the Gaussian distribution

 If gD(y) is very low, y is unlikely generated by the Gaussian model, thus an outlier

 Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data.

 There are rich alternatives to use various statistical models

 E.g., parametric vs. non‐parametric Statistical Approaches Parametric method

. Assumes that the normal data is generated by a parametric distribution with parameter θ. . The probability density function of the parametric distribution f(x, θ) gives the probability that object x is generated by the distribution . The smaller this value, the more likely x is an outlier

Non‐parametric method . Not assume an a‐priori statistical model and determine the model from the input data . Not completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advance . Examples: histogram and kernel density estimation Proximity‐based

Outlier : if the proximity of the obs. significantly deviates from the proximity of most of the other obs. in the same data set.

 The effectiveness: highly relies on the proximity measure.  In some applications, proximity or distance measures cannot be obtained easily. Two major types of proximity‐based outlier detection

. Distance based: Based on distances (outlier if its neighborhood does not have enough other points) . Density‐based: if its density is relatively much lower than that of its neighbors. Clustering‐Based Methods

• Normal data belong to large and dense clusters, • outliers to small or sparse clusters, or do not belong to any clusters

 Many clustering methods, therefore many clustering‐based outlier detection methods !  Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets Classification‐Based Methods

Idea: Train a classification model that can distinguish “normal” data from outliers . A brute‐force approach: Consider a training set that contains samples labeled as “normal” and others labeled as “outlier” . But, the training set is typically heavily biased: # of “normal” samples likely far exceeds # of outlier samples . Cannot detect unseen anomaly

Two types: One‐class and Semi‐supervised Classification‐Based Methods (cont.)

One‐class model A classifier is built to describe only the normal class.

Learn the decision boundary of the normal class using classification methods such as SVM. Any samples that do not belong to the normal class (not within the decision boundary) are declared as outliers.

Adv: can detect new outliers that may not appear close to any outlier objects in the training set Classification‐Based Methods (cont.)

Semi‐supervised learning

Combining classification‐based and clustering‐ based methods Method • Using a clustering‐based approach, find a large cluster, C, and a small cluster, C1 • Since some obs. in C carry the label “normal”, treat all obs. in C as normal • Use the one‐class model of this cluster to identify normal obs. in outlier detection

• Since some obs. in cluster C1 carry the label “outlier”, declare all obs. in C1 as outliers • Any obs. that does not fall into the model for C (such as a) is considered an outlier as well. Sensitivity of Statistical Methods to Outliers Sensitivity of Statistical Methods to Outliers

. Data often (always) contain outliers.

. Statistical methods are severely affected by outliers! Sensitivity of Statistical Methods to Outliers

on the sample (p=2) Sensitivity of Statistical Methods to Outliers on Regression Analysis

Scatterplot of Y vs X Scatterplot of Y vs X

20 14

12 15 10 Y 10 Y 8

6

5 4

2 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 X X

(a) with (6,20) (b) without (6,20) Sensitivity of Statistical Methods to Outliers on Classification

Forest Soil data (n1=11, n2=23, n3=24)

3D Scatterplot of sod vs mag vs pot Misclassification Error Rate: gr 1 2 3 G1: 91 % (10 obs.)

60 G2: 4% ( 9 obs.) 40 sod 20 G3:75% (18 obs.) 6 0 4 mag 0 2 10 20 0 pot 30 Overall: 50% Statistical Methods for OUTLIER Detection for

Univariate & Multivariate data Statistical (or model based) approaches

. Assume that the regular data follow some statistical model. . Outliers : The data not following the model

 Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data.

 There are rich alternatives to use various statistical models

 E.g., parametric vs. non‐parametric NOTATION

X : Data (Gene Expression Levels) matrix of size np n : Number of patients p : Number of genes

p=1  Univariate Data n patients Genes defined in for the gene p p>1 Multivariate Data columns x11 x12 ... x1p    n>p : Low dimensional x21 x22 ... x2p X     ......  n

Standard Deviation(SD) METHOD

• The simple classical approach to screen outliers is to use the SD () method.

• 2 SD Method: mean ±2 SD • 3 SD Method: mean ±3 SD,

• where the mean is the sample mean and SD is the sample standard deviation. • The observations outside these intervals may be considered as outliers! OUTLIERS in Univariate data (cont.)

Z Scores method (Grubbs' Test, 1969): Not robust !  X  X Z  i , i  1,2,...,n i s Robust Z Scores (Iglewicz and Hoaglin, 1993)

0.675(X  (X)) Z  i , i  1,2,...,n i MAD(X)

where MAD= median{| Xi ‐median(X) | } & E(MAD )=0.675 σ for large normal data.

th Rule: If |Zi|>3  i observation is outlier!!! Why

.Data often (always) contain outliers.

. Classical estimates, such as the sample mean, the sample variance, sample covariances and correlations, or LS (least‐ squares) fit, are severely affected by outliers.

. Robust estimates provide a good fit to the bulk of the data & this good fit helps to identify OUTLIER (s) accurately!!!

. In this lecture, an overview of the concepts that will be covered is given. Issues:

When is an observation “outlying enough” to be detected?

Deleting “good” observations results in inaccurate estimations

So,

Are there any other alternatives to mean and standard deviation?

One suggestion:

MEDIAN and Median Absolute Deviation (MAD) Mean and Standard Deviation

Let x=(x1, x2,…, xn) be a set of observed values. Then, sample mean and variance are given as:

Example: Dotplot represents the data containing 24 determinations of the copper content in wholemeal flour (in parts per million) :

with 28.95 w/o 28.95 mean 4.28 3.21 sd 5.3 0.69 Median:

MAD:

To make the MAD comparable to SD, MADN (normalized MAD):

(If XN(,2), then MADN(X)=)

with 28.95 w/o 28.95 with 28.95 w/o 28.95 mean 4.28 3.21 median 3.38 3.37 sd 5.3 0.69 MAD 0.53 0.5 Replace observation 24 with numbers b/w ‐100‐100, and calculate location measures: Location 0246 mean median

-100 -50 0 50 100

x Replace observation 24 with numbers b/w ‐100‐100, and calculate scale measures:

Thus we can say that a single outlier has an unbounded influence on

Scale these two classical statistics.

sd MAD 0 5 10 15 20

-100 -50 0 50 100

x Example : The following is the Q‐Q plot for a dataset containing 20 determinations of the time (in microseconds) needed for light to travel a distance of 7442 m.

SD (i.e.Three‐Sigma Edit Rule) fails to identify one of the observations determined by Q‐Q plot.

z=-1.35 (-4.64) (-2) The reason that −2 has such

a small |zi| value is that both observations pull the z=-3.73 (-11.73) (Obs -44) sample mean x to the left and inflate s; it is said that the value −44 “masks” the Replace sample mean and sd in value −2. the rule by median and MAD, We can identify both of the outliers (values in the parentheses) OUTLIERS in Univariate data (cont.)

BOXPLOT (Tukey’s method, 1977)

120

100

80

60 Values

40

20

0 1 Column Number A value between the inner ( [Q1‐1.5*IQR, Q3+1.5*IQR]) and outer ([Q1‐3 IQR, Q3+3 IQR] ) fences is a possible outlier. An extreme value beyond the outer fences is a probable outlier. OUTLIERS in Univariate data (cont.)

Adjusted BOXPLOT (Vanderviere and Huber, 2004) Tukey’s method is based on robust measures such as lower and upper quartiles and the IQR without considering the of the data.

Vanderviere and Huber (2004) introduced an adjusted boxplot taking into account the medcouple (MC), a robust measure of skewness for a skewed distribution.

(xj  medk )(medk  xi ) MC(x1 ,...,xn )  med xj  xi

i and j have to satisfy xi ≤ medk ≤ xj , and xi ≠ xj . The interval of the adjusted boxplot is as follows (G. Bray et al. (2005)):

[L, U] = [Q1‐1.5 * exp (‐3.5MC) * IQR, Q3+1.5 * exp (4MC) * IQR] if MC ≥ 0 = [Q1‐1.5 * exp (‐4MC) * IQR, Q3+1.5 * exp (3.5MC) * IQR] if MC ≤ 0, Adjusted BOXPLOT (Vanderviere and Huber, 2004)

R Package ‘robustbase’ ( 2011) OUTLIERS in Univariate data (cont.)

MADe method . uses the median and the Median Absolute Deviation (MAD

2 MADe Method: Median ± 2 MADe 3 MADe Method: Median ± 3 MADe,

where MADe=1.483×MAD for large normal data. . MAD is an estimator of the spread in a data, similar to the standard deviation, but has an approximately 50% breakdown point like the median. where

MAD= median (|xi – median(x)|, i=1,2,…,n) Dotplot of C1 Dot plot:

0 30 60 90 120 C1

Histogram: Outlier

4

3

2 Fr e que nc y

1

0 0 20 40 60 80 100 120 C1 Outliers in Multivariate Data Outliers in Multivariate Data

Multivariate data: A data set n patients Genes defined in for the gene p involving two or more variables columns

x11 x12 ... x1p    x21 x22 ... x2p Idea: X     ......  • Transform the multivariate   xn1 xn2 ... xnp  outlier detection task into a univariate outlier detection problem. OUTLIERS in Multivariate Data Visual tools Scatter plots and 3D scatter plots

Higher dimensions??? OUTLIERS in Multivariate Data

Chernoff faces

(Chernoff,1973 & Flury and Riedwyl, 1988) OUTLIERS in Multivariate Data

Andrews’ curve

Coding and representing multivariate data by curves (Andrews , 1972). Classical and Robust Statistical Distance‐based Methods Statistical distance‐based methods (n>p)

Method. Detect outliers by computing a measure of how far a particular point is from the center of the data.

• The usual measure of “outlyingness” for a data point is the Mahalanobis (1936) distance:

1 Di  (xi  x)'S (xi  x), i  1,2,...,n

• Use the Grubb's test (maximum normed residual test ─ another statistical method under normal distribution) on this measure to detect outliers) Statistical distance‐based methods

. The usual measure of “outlyingness” for a data point is the Mahalanobis (1936) distance:

1 Di  (xi  x)'S (xi  x), i  1,2,...,n

However x and S are sensitive to outliers.

Robust version of this method is needed! Robust Statistical Distance‐based Methods

Two Phases for Outlier Detection Methods (Rocke and Woodruff, 1996)

. Obtain robust estimates of location T and scatter C.

. Calculate robust Mahalanobis‐type distance:

T 1 RDi  (xi  T) C (xi  T), i  1,...,n

. Outlier Boundary: Determine separation boundary Q.

If RDi > Q , ith obs. is declared outlier. Robust Statistical Distance‐based Methods

. MVE (minimum volume ellipsoid) . MCD (minimum covariance determinant), . FAST‐MCD by Rousseuw & Van Zomeren (1990), Van Driessen&Rousseuw (1999)

R package ‘robustbase’ ( covMcd() ) https://cran.r‐project.org/web/packages/robustbase/vignettes/fastMcd‐ kmini.pdf

https://cran.r‐project.org/web/packages/robustbase/robustbase.pdf MCD Algorithm

Determine ellipsoid containing h ≥[(n+p+1)/2] points with minimum covariance determinant.

T 1 RDi  (xi  xmcd ) Smcd (xi  xmcd ), i  1,...,n

Exact solution requires combinatoric solution; approximations used in practice.

Limitations: • large p (very slow!) An Application for MCD: Average brain and body weights for 28 species of land animals.

P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection.Wiley, p. 57. An application for MCD : Average brain and body weights for 28 species of land animals. Whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately describe her ways.

Francis Bacon (1620), Novum Organum II 29. Robust Statistical Distance‐based Methods

. BACON (‘Blocked Adaptive Computationally‐Efficient Outlier Nominators’)

(Billor, Hadi and Velleman, 2000)

R Package ‘robustX’ (depends on ‘robustbase’)

(https://cran.r‐project.org/web/packages/robustX/robustX.pdf) Algorithm 1: General BACON Algorithm

Step 1: Identify an initial basic subset of m>p observations that can safely be assumed free of outliers, where p is the dimension of the data and m is an integer chosen by the data analyst.

Step 2: Fit an appropriate model to the basic subset, and from that model compute discrepancies for each of the observations.

Step 3: Find a larger basic subset consisting of observations known (by their discrepancies) to be homogeneous with the basic subset. Generally, these are the observations with smallest discrepancies. This new basic subset may omit some of the previous basic subset observations, but it must be as large as the previous basic subset.

Step 4: Iterate Steps 2 and 3 to refine the basic subset, using a stopping rule that determines when the basic subset can no longer grow safely.

Step 5: Nominate the observations excluded by the final basic subset as outliers. Algorithm 2: Initial Basic Subset in Multivariate Data

Input: An n x p data matrix X & a number, m, of observations to include in the initial basic subset. Output: An initial basic subset of at least m observations.

Two versions:

Version 1 (V1): Initial subset selected based on Mahalanobis distances Version 2 (V2): Initial subset selected based on distances from the BACON Algorithm (Billor, Hadi and Vellemann, 2000)

Therefore two versions of BACON:

One version: • nearly affine equivariant, • a high breakdown point (upwards of 40%), and • computationally efficient even for very large datasets.

Another version: • affine equivariant, • at the expense of a somewhat lower breakdown point (about 20%), • but with the advantage of even lower computational cost even for very large datasets. BACON Algorithm (cont.)

Step 1: Divide the observations according to a suitably chosen

initial distance di into two subsets:

Basic subset and Non‐basic subset

Basic subset: a small subset initially of size m = cp observations

with the smallest values of di Non‐basic subset: the rest of the data

Basic Non-basic 12 … m n BACON Algorithm (cont.)

T 1 Step 2. Compute di(xb ,Sb )  (xi  xb ) Sb (xi  xb ) ,i  1,...,n

where and xb Sb are the mean and covariance of the basic subset.

Step 3. Form a new basic subset by including all observations with

di(xb ,Sb )  c

c : a critical value (Chi‐squared table value).

Step 4. Repeat Steps 2–3 until the size of the basic subset = n or all

observations in the non‐basic subset have di(xb ,Sb )  c and declare the observations in the nonbasic subset (if any) as outliers An Application for BACON: Average brain and body weights for 28 species of land animals. Advantages of BACON algorithm

• Outlier detection methods have suffered in the past from a lack of generality and a computational cost that escalated rapidly with the sample size. • Samples of a size sufficient to support sophisticated methods rapidly grow too large for previously published outlier detection methods to be practical. • The BACON algorithms given here reliably detect multiple outliers at a cost that can be as low as four repetitions of the underlying fitting method. They are thus practical for data sets of even millions of cases. • The BACON algorithms balance between affine equivariance and robustness. Outlier Detection Methods based on Principal Component Analysis (PCA) What is PCA?

Generally speaking

Two objectives:

. “Data” or “Dimension” reduction ‐ moving from many original variables down to a few “composite” variables (linear combinations of the original variables).

. Interpretation ‐ which variables play a larger role in the explanation of total variance.

78 PCA: Principal Component Analysis

PC’s are constructed by maximizing

ui  argmax {var (u'X)} u'u1 ' subject to ui'X Xuj 0, 1 j i

Zi= ui1X1 + + uip Xp = ui’ X , i=1,2,…,k the ith PC score or the ith principal component Geometric understanding of PCA for 3D point cloud

1st PC direction (maximizing variance of projections) explains the cloud best. 1st and 2nd directions form a plane. = X257x235

k=2

k=10 k=50 k=100 k=200 PCA & Outlier Detection

• Methods that use classical PCs to identify potential outliers & • Methods for robustly estimating PCs that may also be used to detect potential outliers. Types of Outliers in PC space

Good leverage Orthogonal Bad leverage

Regular observations

Good leverage

Bad leverage PCA based Outlier Detection

Diagnostic Plot: To detect the type of observations in high dimensional data

k OD  X ˆ zP D  z 2 l i i i vs. i  ij j , i=1,2,…,n j1

T  X 1 ˆ t  P where the zi is the ith row of nxk nxp n pxk

lj : the eigenvalues of the variance‐covariance matrix (j=1,2,...,k) P : the matrix of eigenvectors corresponding to the eigenvalues Diagnostic plot

ODi

Orthogonal outliers Bad leverage

Regular Good leverage Sensitivity of PCA to Outliers

CPCA would mislead the analyst in the presence of outliers since the cov. matrix (or corr. matrix) is sensitive to outliers!

Two dimensional example showing how outliers affect PCA Robust PCA based Outlier Detection

A) Replace the classical covariance matrix by a robust covariance estimator

 M‐estimator: Maronna (1976), Campbell (1980)

 MCD method: Rousseuw & Van Driessen (1999)

 S estimator: Davies (1987), Rousseeuw and Leroy (1987)

PROBLEM: valid ONLY for pn) Robust PCA

B) Approaches based on projection pursuit (pp)

• Li and Chen (1985) (high computational cost!) • Croux, Ruiz Gazen, 2000) (numerical inaccuracy!) • Hubert et al. (2002) (searches the direction on which the projected observations have the largest robust scale, and then removes this dimension and repeats.)

Suitable for data sets with large p and/or large n!

Problem: numerical inaccuracy and high computational cost!

88 Robust PCA

C) Based on combination of ideas of PP and robust covariance estimation

i) based on FAST – MCD (Hubert et al., 2005) (ROBPCA) ii) based on BACON (RBPCA) (blocked adaptive computationally efficient outlier nominators) (Billor et al., 2000)

89 RBPCA: Steps in this algorithm

Case 1: n>p

Step 1. Use (Singular – Value Decomposition ) SVD on the centered data matrix that is, Xc  X 1ˆ' UDP' (affine transformation of the data)

Score matrix: T=XcP

Step 2. Determine the mean ( ) and the variance‐covariance matrix ˆ B ( ) based on clean observations obtained from BACON.ˆ  B Step 3. Find the robust PCs of the BACON based covariance matrix , and determine the # of PCs, as k.

Step 4. The new robust score matrix T  (X 1ˆ B')P *

90 RBPCA: Steps in this algorithm

Case 2: n

Step 1. Use SVD (Singular – Value Decomposition ) on the centered data matrix that is,

Xc  X 1ˆ' UDP'

(affine transformation of the data)

Score matrix: T=XcP

91 RBPCA: Steps in this algorithm

Step 2. Find the clean set of observations (say h=(n+p+1)/2) from T (n

Project the high dimensional datapoints on many (?) univariate directions v

92 “Outlyingness” measure

• For every direction v, find robust center, and ˆ R ˆ robust standard deviation, for the projected R xj’v (j=1,2,…,n). • Find out which points are outlying on the projection vector by

“Outlyingness” measure: xi v ˆ R outlxi  max ,i 1,...,n v • Determine h clean observations ˆ R  n  p 1  h    (n/2 < h < n, h=0.75*n, or  2 )

93 Projecting points onto different projection vectors (dashed lines)

B is outlying, but not A, C B is not outlying here

94 RBPCA: Steps in this algorithm

Apply PCA on the data matrix of h clean observations

(Thxp) and obtain T1  T 1nˆ P

• Find the variance covariance matrix of T1 • Determine # of PCs, k < p st k

Step 4. Obtain the robust PCs based on the robust variance covariance matrix in step 3.

95 Issues for High Dimensional Data

Even for fast algorithms the computation times increase linearly with n and cubically with p!

None of these methods do not work quite as well when the dimensionality is high!!! PCOUT ALGORITHM (Filzmoser et al., 2008)

A recent outlier identification algorithm & effective in high dimensions.

.Based on the robust distances obtained from semi‐robust principal components for robustly sphered data. . Separate weights for location and scatter outliers are computed based on these distances.

The combined weights are used for outlier identification (See R: pcout) PCOUT Algorithm (Filzmoser et al., 2008)

xij  med(x1j , . . . , xnj ) Step 1. Robustly sphere data : xij  mad(x1j , . . . , xnj ) Calculate: sample covariance matrix of the transformed data Step 2. Compute PCs based on C, retain only those "p" PCs that contribute to at least 99% of the total variance.

zij  med(z1j , . . . , znj ) Robustly sphere data : zij  ,j  1,2,...,p mad(z1j , . . . , znj )

The values {zij } are used in 2 phases of the further steps : • finding location outliers • finding scatter outliers PCOUT Algorithm (Filzmoser et al., 2008)

• Calculate Mah. distances dL (weighted based on robust kurtosis) and dS (unweighted) .Calculate scatter and location weights from

 0 d i  c  2 2    d ‐ M   w (d ; c ,M )   1 ‐ i  M  d  c i      i    c ‐ M      1 d i M

For location: For scale: M is the 33(1/3rd) quantile of dL M is , c= c= median(d)+2.5MAD(d) ∗,. ∗,. Final weights:

S L LS (w  c)(w  c) w  i i ,c  0.25 i (1 c)2 LS If wi  0.25, ith observation is an outlier. PCOUT: Leukemia Data (72x7129)

We will try to identify multivariate outliers among the 7129 genes, without using the information of the two leukemia types ALL and AML. The outlying genes will then be used for differentiating between the cases.

2609 genes Outliers in Functional Data Introduction

FDA : collection of different methods in statistical analysis for analyzing curves or functional data.

In standard statistical analysis: focus is on

the set of data vectors (univariate, multivariate).

In FDA, focus is on

the type of data structure such as curves, shapes, images, or set of functional observations. What are Functional Data about?

Figure: The change in temperature over the course of a year, taken from thirty‐five weather stations across Canada (Ramsay and Silverman, 2001). Atlantic stations in red, Continental in blue, Pacific in green, and Arctic in black. What Questions can we ask of the functional data? Statisticians:

• How can I represent the temperature pattern of a Canadian city over the entire year instead of just looking at the twelve discrete points? Should I just "connect the dots," or is there a better way to do this? • Do the summary statistics "mean" and "covariance" have any meaning when I'm dealing with curves? • How can I determine the primary modes of variation in the data? How many typical modes can summarize these thirty‐five curves? • Do these curves exhibit strictly sinusoidal behavior? • Can I create an analysis of variance (ANOVA) or linear model with the curves as the response and the climate as the main effect? Outliers in Functional Data

Problem:

If we have some curves that behave differently from the rest (i.e. outliers in functional form), what happens to the FDA techniques??? Outliers in Functional Data The study of outlier detection has started only recently, and was mostly limited to univariate curves, i.e. p = 1.

Febrero‐Bande et al (2008) identified two reasons why outliers can be present in functional data:

1. gross errors can be caused by errors in measurements and recording or typing mistakes, which should be identified and corrected if possible. 2. Second, outliers can be correctly observed data curves that are suspicious or surprising in the sense that they do not follow the same pattern as that of the majority of the curves. Methods

• Functional Depth‐based • Functional PCA‐based • Functional Boxplot Functional Depth‐based Iso-Depth What is Data Depth? curves

• Notion of data depth for non‐parametric multivariate data analysis

• Provides center‐outward orderings of points in Euclidean space of any dimension and leads to a new non‐ Deepest parametric multivariate statistical analysis point in which no distributional assumptions are needed.

• A data depth measures how deep (or central) a given point x in Rp is relative to F, a probability distribution in Rp (assuming {X1,.., Xn } is a random sample from F) or relative to a given data cloud.

110 Consider x1(t), x2(t), ..., xn(t) : n functions defined on an interval I.

Question: Which one is the deepest function ? Depth for functional data.

• used to measure the centrality of a curve with respect to a set of curves. E.g.: to define the deepest function.

• The functional depth provides a center‐outward ordering of a sample of curves. Order statistics are defined.

• The idea of deepest point of a set of data allows to classify a new observation by using the distance to a class deepest point. Many Data Depths Functions in MV

1. The Mahalanobis depth (Mahalanobis, 1936). 2. The half‐space depth (Hodges, 1955, Tukey, 1975). 3. The Oja depth (Oja, 1983). 4. The simplicial depth (Liu, 1990). 5. The majority depth (Singh, 1991). 6. The projection depth (Zuo, 2003).

1 ' 1 MD(y,F)   1  y    y    Example: F F F The Fraiman and Muniz Depth (FMD)

A functional data depth (Fraiman and Muniz (2001)):

Fn,t (xi(t)) : empirical cdf of the values of the curves x1(t), . . . , xn(t) at a given time point t ∈ [a, b], given by

• Fraiman and Muniz functional depth, hereafter FMD, of a curve xi with respect the set x , . . . , x 1 n:

where Dn(xi(t)) is the univariate depth of the point xi(t) given by

1 1 2 , Functional Depth based Method

Febrero‐Bande et al (2008) proposed the following outlier detection procedure for univariate functional data (i.e. p = 1)

Step 1. For each curve, calculate its functional depth (several versions exist); Step 2. Delete observations with depth below a cutoff C; Step 3. Go back to step 1 with the reduced sample, and repeat until no outliers are found.

Step 3 was added in the hope of avoiding masking effects. The cutoff value C was obtained by a bootstrap procedure. Example: Outlier Curves

Pros & Cons

• Fast computation; • Dependent on chosen • Non‐parametric Depth. method. • Choice of cutoff C is complicated. Functional PCA‐based Functional Principal Components Analysis • Similar to multivariate Principal Components Analysis (PCA), but instead of vectors (xi1, …, xin), we use curves xi(t); • Summations become integrals, e.g., PC Scores corresponding to is

• Finding the PC is equivalent to solving finding the eigenfunctions/eigenvalues of covariance operator G(s, t), that is

where G(s,t) = Cov(x(s), x(t)). Functional PCA using basis expansion

• Select an appropriate orthogonal base of L2, Φk(t), k = 1, 2, …

• Can represent each curve xi(t):

where the number of basis, k, is selected via Generalized Cross‐Validation, for example. • The coefficients can be estimated using Least Squares. • Assume yij = xi(tj) + εij, εij is a measurement error;

• Obtain matrix C = (ciK). Outliers in C will be the functional outliers. • Apply Robust Multivariate PCA to C. Robust PCA Examples

• Spherical PCA – projects each observation in the unit sphere:

where is a robust estimator of the location parameter.

• ROBPCA – Uses Minimum Covariance Determinant (MCD) for low dimensional data ( n > p), or a combination of Projection Pursuit and MCD for high dimensional data (n>p).

• BACON PCA – Block Adaptive Computationally Efficient Outlier Nominators (BACON): Diagnostic Plots for Robust PCA

• Score distances:

• i = 1, …,n where zij are the PC Scores, and λj are the eigenvalues;

• Orthogonal Distance:

is the robust center estimator, is the matrix of eigenvalues. Diagnostic Plots for Robust PCA

• Sdi large and Odi small – 1, 4;

• Sdi small and Odi large– 5;

• Sdi large and Odi large– 2, 3; • ‐ Poblenou NOx Data (Whole Dataset)

• Sawant et al (2012) identified 5 outlying curves using a robust functional PCA approach. • Identified outliers were all working days – 03/09; 03/11; 03/18; 04/29 and 05/02. • Outliers were on days leading to a long weekend or vacation period and hence there was increased traffic flow.

• NOx emissions in Poblenou, Barcelona (Spain) over 115 days. Hourly measurements of the NOx made from 23 February 2005 to 26 June 2005. CFPCA

RFPCA-MCD RFPCA-BACON

Conclusion

After detecting the outliers, we checked for sources for abnormal values of these curves. • It was found that the days detected as outliers were weekends or related to small vacation periods around weekends. • So we conclude that abnormal observations on specific days can be attributed to increase in traffic due to small vacation periods. • We have also detected outlier on Wednesday, March 9. • The observation on 10th March has missing data and thus not included inanalysis. • So we could not pinpoint the reason behind this abnormal observation on 9th March. Functional Boxplot Example (Functional Boxplot)

Data from monthly sea temperatures over the East‐Central tropical Pacific ocean, 1951 ‐ 2007 Functional • Extend the univariate box plot using Data Depth

The functional boxplot of Sea Surface Temp with 1. blue curves denoting envelopes, and 2. a black curve representing the median curve. 3. The red dashed curves are the outlier candidates detected by the 1.5 times the 50% central region rule. Functional Box Plot

The enhanced functional boxplot of SST with 1. dark magenta denoting the 25% central region, 2. magenta representing the 50% central region and 3. pink indicating the 75% central region. Functional & Pointwise Boxplots

The pointwise boxplots of SST with medians connected by a black line. Pros & Cons

• Fast computation. • Choice of BD and MBD • Clear visualization. may not be optimal. • Bad performance for shape outliers.

The command fbplot for functional boxplots is in fda R package, and MATLAB code is also available. Selected References

1. Barnett, V., and T. Lewis (1994). Outliers in Statistical Data, 3rd ed., New York: Wiley. 2. Billor, N., Hadi, A., & Velleman, P., (2000). BACON: Blocked adaptive computationally‐efficient outlier nominators. Computational Statistics and Data Analysis,34, 279298. 4. Filzmoser, P., Maronna, R. & Werner, M. (2008). Outlier identification in high dimensions. Computational Statistics and Data Analysis, Vol. 52, pp. 1694‐1711. 5. Hyndman, R. J. and Shang, H.L. (2010). "Rainbow Plots, Bagplots, and Boxplots for Functional Data". Journal of Computational and Graphical Statistics 19 (1): 29– 45. 6. López‐Pintado, S. and Romo, J. (2009). "On the Concept of Depth for Functional Data". Journal of the American Statistical Association 104 (486): 718–734. 7. Sun, Y. and Genton, M. G. (2011). "Functional boxplots". Journal of Computational and Graphical Statistics 20: 316–334. 8. Sawant, P., Billor, N., and Shin, H. (2012) Functional outlier detection with robust functional principal component analysis. Computational Statistics, 27, 83‐102.