A Modified Energy Statistic for Unsupervised Anomaly Detection
Rupam Mukherjee
GE Research, Bangalore, Karnataka, 560066, India [email protected], [email protected]
ABSTRACT value. Anomaly is flagged when the distance metric exceeds a threshold and an alert is generated, thereby preventing sud- For prognostics in industrial applications, the degree of an- den unplanned failures. Although as is captured in (Goldstein omaly of a test point from a baseline cluster is estimated using & Uchida, 2016), anomaly may be both supervised and un- a statistical distance metric. Among different statistical dis- supervised in nature depending on the availability of labelled tance metrics, energy distance is an interesting concept based ground-truth data, in this paper anomaly detection is consid- on Newton’s Law of Gravitation, promising simpler comput- ered to be the unsupervised version which is more convenient ation than classical distance metrics. In this paper, we re- and more practical in many industrial systems. view the state of the art formulations of energy distance and point out several reasons why they are not directly applica- Choice of the statistical distance formulation has an important ble to the anomaly-detection problem. Thereby, we propose impact on the effectiveness of RUL estimation. In literature, a new energy-based metric called the -statistic which ad- statistical distances are described to be of two types as fol- P dresses these issues, is applicable to anomaly detection and lows. retains the computational simplicity of the energy distance. We also demonstrate its effectiveness on a real-life data-set. 1. Divergence measures: These estimate the distance (or, similarity) between probability distributions. Some com- 1. INTRODUCTION mon divergence measures are Kullback-Leibler divergence, Jensen-Shannon divergence and Hellinger distance. Prognostics is a critical requirement in many industrial fields today owing to the potential of cost savings and operational 2. Distance measures: These measures estimate the dis- efficiency entitlement through elimination of unscheduled fail- tance between a single point and a distribution by com- ures and shut-downs. One of the many important modules paring it with a sample drawn from the distribution. The used in a typical Prognostics and Health Management (PHM) most well-known measure in this category is Mahalanobis application is a health indicator module for the system un- distance (Mahalanobis, 1936). A few other distance mea- der consideration which not only estimates a health metric sures are Bhattacharya distance (Bhattacharyya, 1943) but also tracks the evolution of this metric with time. The and the energy distance. health indicator is estimated from a collection of sensor read- ings at different timestamps. Generally, one of many statis- For the industrial anomaly detection problem, it is mostly the tical distances of a test-point from a baseline cluster may be second category which is more significant because most of used as this health indicator. The data-points can be multi- the times, we end up comparing a single point with a baseline dimensional resulting from multiple sensor outputs that can cluster. be tapped from the system under monitoring. This statisti- In this paper, we are interested in the class of distances called cal distance or some function of it may be used as the health energy distance. These are interesting because they are based metric or the health indicator. on the notion of Newton’s law of gravitational energy and Time trending of the health metric as a function of histori- considers statistical observations as celestial objects having cal and future forecast usage patterns give us a visibility into gravitational pull between each other. A distance metric called the Remaining Useful Life (RUL) which is the ultimate tar- -statistic may be written to represent the energy distance be- E get of PHM. However, even before hitting the ultimate target tween distributions.The -statistic can be used to test the sta- E of RUL, doing anomaly detection as an intermediate step has tistical hypothesis of equality of two distributions. This con- cept was proposed and developed in (Szekely,´ Rizzo, et al., Rupam Mukherjee et al. This is an open-access article distributed under the 2004; Szekely´ & Rizzo, 2013; Szekely & Rizzo, 2017) where terms of the Creative Commons Attribution 3.0 United States License, which it was shown the -statistic is more general and powerful than permits unrestricted use, distribution, and reproduction in any medium, pro- E vided the original author and source are credited. many classical statistics. https://doi.org/10.36001/IJPHM.2021.v12i1.1323
International Journal of Prognostics and Health Management, ISSN2153-2648, 2021 000 1 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT
Application of energy distances to single and multi sample tively. Let X = X ,X ,...,X be a random sample of S { 1 2 n1 } goodness of fit have been explored in (Rizzo, 2002b, 2002a; size n1 drawn from the density function fX . Similarly, YS = Szekely´ et al., 2004; Baringhaus & Franz, 2004; Szekely´ & Y ,Y ,...,Y is a random sample of size n drawn from { 1 2 n2 } 2 Rizzo, 2005; Rizzo, 2009; Yang, 2012). Several other ap- the density function f . The energy distance (X, Y ) be- Y E plications have been shown in (Szekely, Rizzo, et al., 2005; tween the distributions fX and fY has been defined in (Szekely´ Szekely´ & Rizzo, 2009; Feuerverger, 1993; Matteson & James, et al., 2004; Szekely,´ 1989, 2002). It is a population statistic 2014; Kim, Marzban, Percival, & Stuetzle, 2009). for the pair of random variables X and Y . In this paper, we are interested in exploring it further be- Now, a sample statistic may be used as an estimate for En1n2 cause of its ability to work with Euclidean distances between the population statistic (X, Y ). It may be calculated from E data-points rather than with the data-points themselves. This the pair of samples (XS,YS) as defined in (Szekely´ et al., ability leads to reduced computational complexity compared 2004; Szekely,´ 1989, 2002) and has the following form. to the Mahalanobis distance and makes the -statistic poten- E tially advantageous for memory and computation challenged n n systems. n n 2 1 2 (X ,Y ,Y ,...Y )= 1 2 En1n2 S 1 2 n2 n + n n n However, despite the aforementioned advantages, we found 1 2 " 1 2 i=1 m=1 X X some problems with the -statistic when applied to anom- n1 n1 E 1 1 aly detection. These problems arise from the fact that the - X Y X X E || i m||2 n2 || i j||2 n2 ⇥ statistic has been developed primarily for comparing equality 1 i=1 j=1 2 X X between two distributions whereas for anomaly detection we n2 n2 need to compare a single test point with a distribution (base- Y Y . (1) || l m||2 line). In this paper, we analyse these problems and propose l=1 m=1 # X X a new metric called the modified energy distance ( -statistic) P For a d-dimensional space, each observation Xi and Yj is a based on the notion of Newton’s law which addresses these d 1 array. Mathematically, the entire sample may be rep- problems and is more suitable to be used in an anomaly de- ⇥ resented as a d n matrix for X and d n matrix for tection problem. ⇥ 1 S ⇥ 2 YS. Here, we would like to mention that detailed comparison of In industrial systems, usually after a baseline data-set is ac- the proposed metric against all other classical distance met- cumulated, future test-points are compared against it and the rics is a task we would attempt in future work. In this work, baseline itself is not frequently replaced except when there is we focus on establishing the improvements over the -statistic. E a need to re-establish the baseline. Thus, although the base- This paper is laid out as follows. In Section 2, we briefly line cluster varies depending on which time instant it was ac- review the -statistic and mention the reasons why it is com- quired from sensor readings and hence has a random nature, E putationally simpler than other techniques. In Section 3, we it is considered a constant in the anomaly analysis once it is explain the issues which make the -statistic less suitable captured and saved. Hence, in subsequent analysis, it is con- E for anomaly detection and in Section 4, we propose the - sidered invariant. Going forward, in this paper, we will be P statistic and through the use of synthetic data, show that it referring to the sample XS as the baseline cluster (or base- addresses these issues. In Sections 5.1 and 5.2, we derive a line, for simplicity). method for estimating probability from the -statistic using a P For the anomaly detection problem, we want to compare a chosen parametric distribution. In Section 5.4, we show how single point against a baseline cluster having many members. the performance of this new metric compares in terms of dis- Hence, in (1), we assume that Y has sample-size 1 and con- crimination performance against that of the Mahalanobis Dis- S tains only one member Y . We discard the notations n and tance, a classical multi-dimensional distance metric. In Sec- 1 1 n as n =1. We represent n by n going forward as the tion 6, we demonstrate a simple application of the -statistic 2 2 1 P subscript in n is no longer needed. For sufficiently large sam- on real-life data. In Section 7, we show how the training time ple size for X , n n +1. Also, we use y in place of Y of the proposed metric compares against that of Mahalanobis S 1 ⇡ 1 1 in future analysis since the subscript is not crucial for clarity Distance for an incremental baseline update approach. Fi- in representing a single member set Y . nally, in Section 8, we summarize the observations. S
2. REVIEW OF ENERGY DISTANCE AND ITS COMPUTA- TIONAL SIMPLICITY Let X and Y be two independent real-valued random vari- ables with probability density functions fX and fY respec-
2 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT
From (1), we write a simplified form for as cluster based on when that data set was captured in time, En En1n2 but once they are collected for any machinery, they are usu- (X ,Y )= (X ,y) En S 1 En S ally not changed during the comparison or anomaly detection 2 n phase. Thus, for all practical purposes, they are same as a set X y ⇡ n || i ||2 of multi-dimensional real-valued points. i=1 X 1 n n From the way anomaly detection is usually implemented in Xi Xj . (2) industrial systems, we intuitively desire the following condi- n2 || ||2 i=1 j=1 tions from any acceptable anomaly measure (say ⇤) which X X En may be used in a practical anomaly detection system. Reference (Szekely´ et al., 2004) mentions that for (X ,y) En S The desired characteristics with respect to cluster size and to be a valid distance metric, there must exist a c for every ↵ distance from centroid along with their mathematical expres- ↵ (0, 1) such that 2 sions are as follows.
P (Z>c↵)=↵. (3) Here, Z is a random variable. A random sample z drawn from 1. Relationship with cluster size the distribution of Z is a function of the baseline XS and a random sample x drawn from the distribution fX defined earlier. It takes a form (a) For the test-point y situated at a given distance from the centroid of the baseline cluster and outside the z = (X ,x). (4) En S baseline, it should appear less anomalous if it is closer to the outer boundary of the baseline cluster. As mentioned in Section 1 and in (Szekely´ et al., 2004; Szekely´ Given the saved baseline X , we define a random & Rizzo, 2013; Szekely & Rizzo, 2017), the -statistic is sim- S E variable R such that pler and more general than classical distance metrics. While writing this paper, on comparing (1) with the classical dis- R = X XS 2 and (5) tance metrics described in (Statistical distance, n.d.), the ad- || || vantages which specifically interested us are R has a probability density function fR whose model parameters may be estimated from the sample X . 1. The Euclidean distances in (1) can be computed in paral- S Here, X is the centroid of the baseline cluster which lel and hence the computation time of -statistic does not S E is nothing but the sample mean. scale with data size if implemented in a parallel manner. For the test-point y, let r = y X . Since, f || S||2 R 2. Covariance matrices are not required in (1). For high- is a function of the baseline cluster XS and from (5) dimensional data, this can take up a significant amount the domain of R is the set of Euclidean distances of of computation and memory all possible samples of X from XS, the probability density of any point with distance r from X may 3. Matrix inversion is not needed. In many classical meth- S be expressed as f (r, X ). ods, inversion of covariance matrices is required. Again, R S for high dimensional data, this can involve significantly This problem may now be cast in the form of a hy- heavy computation. pothesis test as follows.
Null hypothesis(H0): Random samples drawn from 3. SOME GAPS IDENTIFIED IN -STATISTIC E fR will be more extreme than r. y is flagged as an In this section, we analyse the general requirements from any anomaly with respect to XS if the null hypothesis is anomaly detection metric and point out some weaknesses in rejected. the -statistic which prevent it from satisfying some of these E Rejection criterion: The null hypothesis is rejected conditions. Going forward, these weaknesses form the basis if of the proposed innovation in this paper.
pR(XS)=P ( X XS 2 >r)=P (R>r) 3.1. Requirements from an Acceptable Metric for Anom- 1 aly Detection = fR(x, X S )dx < pth (6) Zr In anomaly detection, we usually obtain a single test sam- which is a pre-determined threshold. ple y and compare it against the baseline (say XS) which 0 is a cluster of samples drawn from the distribution fX . Al- Let there be a second cluster XS( ) which is formed though there is an element of randomness in the baseline by scaling the baseline with respect to its centroid
3 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT
0 such that its kth element XS( )k may be written as for data of any dimension.
X 0 ( ) = (X X )+X where (7) S k k S S 3.2. Synthetic Dataset >0. If >1, the number of points, relative ori- In order to examine how n performs with respect to the de- E entation of points and cluster shape are maintained sired characteristics stated in Section 3.1, we consider a d- 0 unchanged between XS and XS( ) with the second dimensional random variable X which is distributed as a uni- cluster occupying a larger spatial volume. Hence, form ball. We sample from this distribution to create a syn- the test-point will appear closer to the outer bound- thetic data-set following the method described in (Harman & 0 ary of XS( ) than to that of XS. In this scenario, Lacko, 2010). if ⇤ is a faithful indicator of anomaly, it should ap- En X can be written as pear to be less in the former case. This behaviour may be expressed mathematically as X = r (Z / Z ) Z1/d where (14) b 1 || 1||2 2 0 ⇤(X ( ),y) < ⇤(XS,y)if >1. (8) Z is sampled from a multi-variate uncorrelated standard nor- En S En 1 mal distribution, Z U(0, 1) and r is the desired radius of 2 ⇠ b Now, we saw that a larger radius of baseline is sup- the ball. posed to make y appear less anomalous. Also, larger For the purpose of this study, we restrict the dimension of X the anomaly, less should be the value of the inte- in (14) to 2 to keep the analysis and visualization simple. If gral in (6) since larger anomaly would mean less net the two dimensions of X are written as X(1) and X(2), they probability of having points more extreme. Hence, may be expressed in a simplified form as a function of two 0 random variables R and ✓ in the following way. pr(XS( )) >pr(XS)if >1. (9) X(1) = R cos ✓ and X(2) = R sin ✓ where (b) For an infinitesimally small baseline cluster, any test- R (0,r )1/2 and ✓ U(0, 2⇡) and (15) point would appear to have a very large anomaly, ⇠ b ⇠
irrespective of the value of the Euclidean distance rb is the chosen radius of the example cluster. It may be from the centroid. This is because the distance al- shown from (15) that (✓,R), the probability density func- ways appears large relative to the cluster size. 2 8 tion f✓,R =1/(⇡rb ). Thus, the distribution is uniform in Hence, nature. We also consider a fixed test-observation with loca- 0 tion ✓ =0and R =1. lim n⇤(XS( ),y)= (10) 0 E 1 ! With the anomaly metric increasing asymptotically, 3.3. Shortcomings of -statistic E we will have an asymptotic reduction of the integral We sweep r in (14) from 0 to 5 thereby progressively get- value in (6). b ting a different-sized baseline and thereby changing in (7). This implies different probabilities of the fixed test-point be- 2. Relationship with distance from cluster centroid longing to this cluster. We then evaluate the simplified in En (2). (a) If the test-point is further away from the centroid of the baseline cluster, it should appear more anoma- Figures 1 and 2 show the baselines and test-points for two lous and hence, the anomaly metric should increase. different values of r along with computed values of for b En Thus, if there are two test-points y1 and y2, both the cases. Figure 3 shows a continuous plot of how En varies with rb. n⇤(XS,y1) > n⇤(XS,y2)if (11) E E From these plots, the -statistic can be shown to violate the y1 XS > y2 XS . (12) E 2 2 desired characteristics mentioned in Section 3.1 in the follow- ing manner. (b) If the test-point is at the center, the anomaly metric should be close to zero. Hence, 1. Limiting condition behaviour: The -statistic violates E ⇤(X ,y) 0if y X 0. (13) the limiting conditions in the following manners. En S ! S 2 ! It should be noted that (7) and (11) indicate that the ideal (a) It has been shown in Appendix A (59) that the met- anomaly metric should follow a monotonic behaviour with ric n proposed in (2) has the following property. E respect to the baseline size and also distance of the test-point 0 lim n(XS( ),y)=2 XS y = . (16) from the baseline cluster. These ideal conditions hold good 0 E 2 6 1 ! 4 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT
flection point beyond which it increases instead of reducing and reaches significantly high values in- stead of very small ones. For the particular case in Figure 2, the value of 2.363 is not appropriate. It En should be very close to zero. Also, the should be En higher in Figure 3 compared to Figure 2. However, the opposite is actually observed.
(c) As stated above, in Section 3.1, ⇤ =0if and Figure 1. Baseline (blue) and test-point (red) with rb =0.5 En only if y = X for a viable distance-metric for and n =1.6107 c E anomaly detection. But this condition is violated by the -statistic in its present form because in (1), E =0if and only if the distributions X and En1n2 Y are identical i.e. X = Y . However the single test-point y considered here forms a single member distribution which cannot be identical to X under any circumstance.
2. Monotonic behaviour: From Figure 3, it is seen that the value of goes through an inflection point as r varies En instead of varying monotonically as required in Section Figure 2. Baseline (blue) and test-point (red) with rb =5.0 3.1 and =2.363 En The issues with -statistic raised in this section have been fur- E This violates the condition required for the ideal ther addressed and a modified algorithm suggested in Section metric ⇤ stated in (10). Thus, in its present 4. Since the target in this paper is anomaly detection, we have En En form cannot function as the ideal metric. also extended the analysis to reach a closed form analytical expression for the probability density function (PDF) based From Figure 3, it is clear that when the baseline on which the p-value of the test-point may be estimated. cluster becomes progressively smaller and close to 0, the variation of n is not asymptotic to infinity E ODIFIED NERGY ISTANCE but hits a finite value equal to the distance of the 4. M E D test-point from the baseline centroid. The reason why energy distance is computationally simple is that it works with distributions of Euclidean distances whereas other distance metrics first fit a multivariate distribution and (b) Appendix A (62) also proves the following. then work our probabilities from that. The -statistic is one of E 0 the first energy-based metric proposed for anomaly detection. lim n(XS( ),y)= . (17) E 1 !1 However, we saw in Section 3.3 that it has a few weaknesses. In this section, we borrow the concept of working with Eu- From Figure 3, it is clear that when baseline cluster clidean distances from a study of the -statistic and propose increases in radius, the value of does not always E En an alternative and slightly modified metric based on them. reduce tending towards 0. It goes through an in- The potential energy of a configuration consisting of two par- ticles having masses m1 and m2 respectively and having po- sitions X1 and X2 respectively is the work that needs to be done to move one particle from infinity to its current position against the gravitational force exerted by the other particle already in place. It can be written as Gm m PE(X ,X )= 1 2 . (18) 1 2 X X || 1 2||2 We define the incremental potential energy of y to be the Figure 3. Variation of n with baseline cluster radius (rb), work that needs to be done to move the test-point y from infin- E given a fixed test-point. ity to its present position against the collective force exerted
5 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT
by the baseline sample cluster XS. The incremental potential Here, ✏ is an infinitesimal positive number on the real line. energy may be written as The closer ✏ is to 0, the better is the match between (23) and
n (24). A guideline may be to use P (X ,y)= PE(X ,y) n S k 1 k=1 0 <✏<1e 04 X X . (25) X n k S 2 n k=1 C X = , (19) Xk y The reader may note that this is a guideline and one may use k=1 || ||2 X smaller values of ✏ if they want. However, larger values of where C is a positive constant if we assume that each data- ✏ may not be advisable. We would like the reader to know point in the baseline sample cluster XS as well as the test- that a rigorous evaluation of the impact of the choice of this point y represent particles of equal mass. value on the results needs to be conducted as future work but The net potential energy for assembling the baseline cluster at present, we do not see this as a big risk to this method. XS may be written as We now examine the limiting conditions or 0 and ! ! n which have been discussed with respect to in (16) and 1 En P (XS)= PE(Xk,Xj) (20) (17) respectively. Using the definition of as given in (7), j=2 k