A Modified Energy Statistic for Unsupervised Anomaly Detection

A Modified Energy Statistic for Unsupervised Anomaly Detection

A Modified Energy Statistic for Unsupervised Anomaly Detection Rupam Mukherjee GE Research, Bangalore, Karnataka, 560066, India [email protected], [email protected] ABSTRACT value. Anomaly is flagged when the distance metric exceeds a threshold and an alert is generated, thereby preventing sud- For prognostics in industrial applications, the degree of an- den unplanned failures. Although as is captured in (Goldstein omaly of a test point from a baseline cluster is estimated using & Uchida, 2016), anomaly may be both supervised and un- a statistical distance metric. Among different statistical dis- supervised in nature depending on the availability of labelled tance metrics, energy distance is an interesting concept based ground-truth data, in this paper anomaly detection is consid- on Newton’s Law of Gravitation, promising simpler comput- ered to be the unsupervised version which is more convenient ation than classical distance metrics. In this paper, we re- and more practical in many industrial systems. view the state of the art formulations of energy distance and point out several reasons why they are not directly applica- Choice of the statistical distance formulation has an important ble to the anomaly-detection problem. Thereby, we propose impact on the effectiveness of RUL estimation. In literature, a new energy-based metric called the -statistic which ad- statistical distances are described to be of two types as fol- P dresses these issues, is applicable to anomaly detection and lows. retains the computational simplicity of the energy distance. We also demonstrate its effectiveness on a real-life data-set. 1. Divergence measures: These estimate the distance (or, similarity) between probability distributions. Some com- 1. INTRODUCTION mon divergence measures are Kullback-Leibler divergence, Jensen-Shannon divergence and Hellinger distance. Prognostics is a critical requirement in many industrial fields today owing to the potential of cost savings and operational 2. Distance measures: These measures estimate the dis- efficiency entitlement through elimination of unscheduled fail- tance between a single point and a distribution by com- ures and shut-downs. One of the many important modules paring it with a sample drawn from the distribution. The used in a typical Prognostics and Health Management (PHM) most well-known measure in this category is Mahalanobis application is a health indicator module for the system un- distance (Mahalanobis, 1936). A few other distance mea- der consideration which not only estimates a health metric sures are Bhattacharya distance (Bhattacharyya, 1943) but also tracks the evolution of this metric with time. The and the energy distance. health indicator is estimated from a collection of sensor read- ings at different timestamps. Generally, one of many statis- For the industrial anomaly detection problem, it is mostly the tical distances of a test-point from a baseline cluster may be second category which is more significant because most of used as this health indicator. The data-points can be multi- the times, we end up comparing a single point with a baseline dimensional resulting from multiple sensor outputs that can cluster. be tapped from the system under monitoring. This statisti- In this paper, we are interested in the class of distances called cal distance or some function of it may be used as the health energy distance. These are interesting because they are based metric or the health indicator. on the notion of Newton’s law of gravitational energy and Time trending of the health metric as a function of histori- considers statistical observations as celestial objects having cal and future forecast usage patterns give us a visibility into gravitational pull between each other. A distance metric called the Remaining Useful Life (RUL) which is the ultimate tar- -statistic may be written to represent the energy distance be- E get of PHM. However, even before hitting the ultimate target tween distributions.The -statistic can be used to test the sta- E of RUL, doing anomaly detection as an intermediate step has tistical hypothesis of equality of two distributions. This con- cept was proposed and developed in (Szekely,´ Rizzo, et al., Rupam Mukherjee et al. This is an open-access article distributed under the 2004; Szekely´ & Rizzo, 2013; Szekely & Rizzo, 2017) where terms of the Creative Commons Attribution 3.0 United States License, which it was shown the -statistic is more general and powerful than permits unrestricted use, distribution, and reproduction in any medium, pro- E vided the original author and source are credited. many classical statistics. https://doi.org/10.36001/IJPHM.2021.v12i1.1323 International Journal of Prognostics and Health Management, ISSN2153-2648, 2021 000 1 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT Application of energy distances to single and multi sample tively. Let X = X ,X ,...,X be a random sample of S { 1 2 n1 } goodness of fit have been explored in (Rizzo, 2002b, 2002a; size n1 drawn from the density function fX . Similarly, YS = Szekely´ et al., 2004; Baringhaus & Franz, 2004; Szekely´ & Y ,Y ,...,Y is a random sample of size n drawn from { 1 2 n2 } 2 Rizzo, 2005; Rizzo, 2009; Yang, 2012). Several other ap- the density function f . The energy distance (X, Y ) be- Y E plications have been shown in (Szekely, Rizzo, et al., 2005; tween the distributions fX and fY has been defined in (Szekely´ Szekely´ & Rizzo, 2009; Feuerverger, 1993; Matteson & James, et al., 2004; Szekely,´ 1989, 2002). It is a population statistic 2014; Kim, Marzban, Percival, & Stuetzle, 2009). for the pair of random variables X and Y . In this paper, we are interested in exploring it further be- Now, a sample statistic may be used as an estimate for En1n2 cause of its ability to work with Euclidean distances between the population statistic (X, Y ). It may be calculated from E data-points rather than with the data-points themselves. This the pair of samples (XS,YS) as defined in (Szekely´ et al., ability leads to reduced computational complexity compared 2004; Szekely,´ 1989, 2002) and has the following form. to the Mahalanobis distance and makes the -statistic poten- E tially advantageous for memory and computation challenged n n systems. n n 2 1 2 (X ,Y ,Y ,...Y )= 1 2 En1n2 S 1 2 n2 n + n n n However, despite the aforementioned advantages, we found 1 2 " 1 2 i=1 m=1 X X some problems with the -statistic when applied to anom- n1 n1 E 1 1 aly detection. These problems arise from the fact that the - X Y X X E || i − m||2 − n2 || i − j||2 − n2 ⇥ statistic has been developed primarily for comparing equality 1 i=1 j=1 2 X X between two distributions whereas for anomaly detection we n2 n2 need to compare a single test point with a distribution (base- Y Y . (1) || l − m||2 line). In this paper, we analyse these problems and propose l=1 m=1 # X X a new metric called the modified energy distance ( -statistic) P For a d-dimensional space, each observation Xi and Yj is a based on the notion of Newton’s law which addresses these d 1 array. Mathematically, the entire sample may be rep- problems and is more suitable to be used in an anomaly de- ⇥ resented as a d n matrix for X and d n matrix for tection problem. ⇥ 1 S ⇥ 2 YS. Here, we would like to mention that detailed comparison of In industrial systems, usually after a baseline data-set is ac- the proposed metric against all other classical distance met- cumulated, future test-points are compared against it and the rics is a task we would attempt in future work. In this work, baseline itself is not frequently replaced except when there is we focus on establishing the improvements over the -statistic. E a need to re-establish the baseline. Thus, although the base- This paper is laid out as follows. In Section 2, we briefly line cluster varies depending on which time instant it was ac- review the -statistic and mention the reasons why it is com- quired from sensor readings and hence has a random nature, E putationally simpler than other techniques. In Section 3, we it is considered a constant in the anomaly analysis once it is explain the issues which make the -statistic less suitable captured and saved. Hence, in subsequent analysis, it is con- E for anomaly detection and in Section 4, we propose the - sidered invariant. Going forward, in this paper, we will be P statistic and through the use of synthetic data, show that it referring to the sample XS as the baseline cluster (or base- addresses these issues. In Sections 5.1 and 5.2, we derive a line, for simplicity). method for estimating probability from the -statistic using a P For the anomaly detection problem, we want to compare a chosen parametric distribution. In Section 5.4, we show how single point against a baseline cluster having many members. the performance of this new metric compares in terms of dis- Hence, in (1), we assume that Y has sample-size 1 and con- crimination performance against that of the Mahalanobis Dis- S tains only one member Y . We discard the notations n and tance, a classical multi-dimensional distance metric. In Sec- 1 1 n as n =1. We represent n by n going forward as the tion 6, we demonstrate a simple application of the -statistic 2 2 1 P subscript in n is no longer needed. For sufficiently large sam- on real-life data. In Section 7, we show how the training time ple size for X , n n +1. Also, we use y in place of Y of the proposed metric compares against that of Mahalanobis S 1 ⇡ 1 1 in future analysis since the subscript is not crucial for clarity Distance for an incremental baseline update approach.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    15 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us