Finding Multivariate Outlier

Finding Multivariate Outlier

Finding Multivariate Outlier Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A A Goals . Concept: Detecting outliers with (robustly) estimated Mahalanobis distance and QQ-plot . R: chisq.plot, pcout from package “mvoutlier” Appl. Multivariate Statistics - Spring 2012 2 Outlier in one dimension - easy . Look at scatterplots . Find dimensions of outliers . Find extreme samples just in these dimensions . Remove outlier Appl. Multivariate Statistics - Spring 2012 3 2d: More tricky No outlier in x or y Outlier Appl. Multivariate Statistics - Spring 2012 4 Recap: Mahalanobis distance . True Mahalanobis distance: MD(x) = (x ¹)T § 1(x ¹) ¡ ¡ ¡ . Estimated Mahalanobisp distance: M^D(x) = (x ¹^)T §^ 1(x ¹^) ¡ ¡ ¡ q Sq. Mahalanobis Distance MD2(x) = Sq. distance from mean in standard deviations IN DIRECTION OF X Appl. Multivariate Statistics - Spring 2012 5 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 µ ¶ Appl. Multivariate Statistics - Spring 2012 6 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 µ ¶ MD = 4 (20,0) Appl. Multivariate Statistics - Spring 2012 7 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 µ ¶ (0,10) MD = 10 Appl. Multivariate Statistics - Spring 2012 8 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 µ ¶ (10, 7) MD = 7.3 Appl. Multivariate Statistics - Spring 2012 9 Theory of Mahalanobis Distance Assume data is multivariate normally distributed (d dimensions) Mahalanobis distance of samples follows a Chi-Square distribution with d degrees of freedom (“By definition”: Sum of d standard normal random variables has Chi-Square distribution with d degrees of freedom.) Appl. Multivariate Statistics - Spring 2012 10 Check for multivariate outlier . Are there samples with estimated Mahalanobis distance that don’t fit at all to a Chi-Square distribution? . Check with a QQ-Plot . Technical details: - Chi-Square distribution is still reasonably good for estimated Mahalanobis distance - use robust estimates for ¹; § Appl. Multivariate Statistics - Spring 2012 11 Robust Estimates: Income of 7 people Robust Scatter Std. Dev. Robust Std. Dev. Robust Std. Dev. Robust Estimates for outlier detection . If scatter is estimated robustly, outlier “stick out” much more . Robust Mahalanobis distance: Mean and Covariance matrix estiamted robustly Appl. Multivariate Statistics - Spring 2012 15 Outlier easily detected ! Example - continued Appl. Multivariate Statistics - Spring 2012 16 Outliers in >2d can be well hidden ! No outlier, right? Appl. Multivariate Statistics - Spring 2012 17 Outliers in >2d can be well hidden ! Wrong! Appl. Multivariate Statistics - Spring 2012 18 Outliers in >2d can be well hidden ! This outlier can’t be seen in the scatterplot- matrix (but in a 3d plot) Appl. Multivariate Statistics - Spring 2012 19 Method 1: Quantile of Chi-Sqaure distribution . Compute for each sample (in d dimensions) the robustly estimated Mahalanobis distance MD(xi) . Compute the 97.5%-Quantile Q of the Chi-Square distribution with d degrees of freedom . All samples with MD(xi) > Q are declared outlier Appl. Multivariate Statistics - Spring 2012 20 Method 2: Adjusted Quantile . Adjusted Quantile for outlier: Depends on distance between cdf of Chi-Square and ecdf of samples in tails . Simulate “normal” deviations in the tails . Outlier have “abnormally large” deviations in the tails (e.g. more than seen in 100 simulations without outliers) Appl. Multivariate Statistics - Spring 2012 21 Method 2: Adjusted Quantile ECDF leaves “plausible” range Defines adaptive cutoff Appl. Multivariate Statistics - Spring 2012 22 Method 2: Adjusted Quantile Function “aq.plot” Appl. Multivariate Statistics - Spring 2012 23 Method 3: State of the art - pcout . Complex method based on robust principal components . Pretty involved methodology . Very fast – good for high dimensions . R: Function “pcout” in package “mvoutlier” . $wfinal01: 0 is outlier . $wfinal: Small values are more severe outlier . P. Filzmoser, R. Maronna, M. Werner. Outlier identification in high dimensions, Computational Statistics and Data Analysis, 52, 1694-1711, 2008 Appl. Multivariate Statistics - Spring 2012 24 Automatic outlier detection . It is always better to look at a QQ-plot to find outlier ! Just find points “sticking out”; no distributional assumption . If you can’t: Automatic outlier detection - finds usually too many or too few outlier depending on parameter settings - depends on distribution assumptions (e.g. multivariate normality) + good for screening of large amounts of data Appl. Multivariate Statistics - Spring 2012 25 Concepts to know . Find multivariate outlier with robustly estimated Mahalanobis distance . Cutoff - by eye (best method) - quantile of Chi-Square distribution Appl. Multivariate Statistics - Spring 2012 26 R commands to know . chisq.plot, pcout in package “mvoutlier” Appl. Multivariate Statistics - Spring 2012 27 Next week . Missing values Appl. Multivariate Statistics - Spring 2012 28 .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    28 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us